\\d\\d)\\b\"\n",
1035 | "x = re.search(expr,s)\n",
1036 | "x.group('hours')"
1037 | ],
1038 | "execution_count": 32,
1039 | "outputs": [
1040 | {
1041 | "output_type": "execute_result",
1042 | "data": {
1043 | "application/vnd.google.colaboratory.intrinsic+json": {
1044 | "type": "string"
1045 | },
1046 | "text/plain": [
1047 | "'13'"
1048 | ]
1049 | },
1050 | "metadata": {},
1051 | "execution_count": 32
1052 | }
1053 | ]
1054 | },
1055 | {
1056 | "cell_type": "code",
1057 | "metadata": {
1058 | "colab": {
1059 | "base_uri": "https://localhost:8080/"
1060 | },
1061 | "id": "q-2_cgjs6w_T",
1062 | "outputId": "5392c522-8189-4496-e927-ba28dce38d8e"
1063 | },
1064 | "source": [
1065 | "x.span('seconds')"
1066 | ],
1067 | "execution_count": 33,
1068 | "outputs": [
1069 | {
1070 | "output_type": "execute_result",
1071 | "data": {
1072 | "text/plain": [
1073 | "(17, 19)"
1074 | ]
1075 | },
1076 | "metadata": {},
1077 | "execution_count": 33
1078 | }
1079 | ]
1080 | },
1081 | {
1082 | "cell_type": "markdown",
1083 | "metadata": {
1084 | "id": "R0w_4Hf9As-o"
1085 | },
1086 | "source": [
1087 | "#### [What is a non-capturing group in regular expressions?](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions)"
1088 | ]
1089 | },
1090 | {
1091 | "cell_type": "markdown",
1092 | "metadata": {
1093 | "id": "VbBDkxMr7XJ5"
1094 | },
1095 | "source": [
1096 | "## Advanced Regular Expressions\n",
1097 | "\n",
1098 | "#### Finding all Matched Substrings\n",
1099 | "\n",
1100 | "The Python module re provides another great method, which other languages like Perl and Java don't provide. If you want to find all the substrings in a string, which match a regular expression, you have to use a loop in Perl and other languages, as can be seen in the following Perl snippet:\n",
1101 | "\n",
1102 | "```perl\n",
1103 | "while ($string =~ m/regex/g) {\n",
1104 | " print \"Found '$&'. Next attempt at character \" . pos($string)+1 . \"\\n\";\n",
1105 | "}\n",
1106 | "```\n",
1107 | "\n",
1108 | "It's a lot easier in Python. No need to loop. We can just use the findall method of the re module:\n",
1109 | "\n",
1110 | "```re.findall(pattern, string[, flags])```\n",
1111 | "\n",
1112 | "Findall returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order in which they are found"
1113 | ]
1114 | },
1115 | {
1116 | "cell_type": "code",
1117 | "metadata": {
1118 | "colab": {
1119 | "base_uri": "https://localhost:8080/"
1120 | },
1121 | "id": "vUmjFj9D7RFH",
1122 | "outputId": "a124ad69-bb9f-4d1b-93d9-54f960466295"
1123 | },
1124 | "source": [
1125 | "t=\"A fat cat doesn't eat oat but a rat eats bats.\"\n",
1126 | "mo = re.findall(\"[force]at\", t)\n",
1127 | "print(mo)"
1128 | ],
1129 | "execution_count": 34,
1130 | "outputs": [
1131 | {
1132 | "output_type": "stream",
1133 | "name": "stdout",
1134 | "text": [
1135 | "['fat', 'cat', 'eat', 'oat', 'rat', 'eat']\n"
1136 | ]
1137 | }
1138 | ]
1139 | },
1140 | {
1141 | "cell_type": "markdown",
1142 | "metadata": {
1143 | "id": "lo-J5FDU9U-s"
1144 | },
1145 | "source": [
1146 | "If one or more groups are present in the pattern, findall returns a list of groups. This will be a list of tuples if the pattern has more than one group. We demonstrate this in our next example. We have a long string with various Python training courses and their dates. With the first call to findall, we don't use any grouping and receive the complete string as a result. In the next call, we use grouping and findall returns a list of 2-tuples, each having the course name as the first component and the dates as the second component:"
1147 | ]
1148 | },
1149 | {
1150 | "cell_type": "code",
1151 | "metadata": {
1152 | "colab": {
1153 | "base_uri": "https://localhost:8080/"
1154 | },
1155 | "id": "0XjLiZ-T81GW",
1156 | "outputId": "7cebbe03-db15-451c-b67f-30b08895b579"
1157 | },
1158 | "source": [
1159 | "import re\n",
1160 | "courses = \"Python Training Course for Beginners: 15/Aug/2011 - 19/Aug/2011;Python Training Course Intermediate: 12/Dec/2011 - 16/Dec/2011;Python Text Processing Course:31/Oct/2011 - 4/Nov/2011\"\n",
1161 | "items = re.findall(\"[^:]*:[^;]*;?\", courses)\n",
1162 | "items"
1163 | ],
1164 | "execution_count": 35,
1165 | "outputs": [
1166 | {
1167 | "output_type": "execute_result",
1168 | "data": {
1169 | "text/plain": [
1170 | "['Python Training Course for Beginners: 15/Aug/2011 - 19/Aug/2011;',\n",
1171 | " 'Python Training Course Intermediate: 12/Dec/2011 - 16/Dec/2011;',\n",
1172 | " 'Python Text Processing Course:31/Oct/2011 - 4/Nov/2011']"
1173 | ]
1174 | },
1175 | "metadata": {},
1176 | "execution_count": 35
1177 | }
1178 | ]
1179 | },
1180 | {
1181 | "cell_type": "code",
1182 | "metadata": {
1183 | "colab": {
1184 | "base_uri": "https://localhost:8080/"
1185 | },
1186 | "id": "alyQ6ygM9W59",
1187 | "outputId": "1aa67ffa-7e38-4d91-9f2e-3ce8e65a4e6e"
1188 | },
1189 | "source": [
1190 | "items = re.findall(\"([^:]*):([^;]*;?)\", courses)\n",
1191 | "items"
1192 | ],
1193 | "execution_count": 36,
1194 | "outputs": [
1195 | {
1196 | "output_type": "execute_result",
1197 | "data": {
1198 | "text/plain": [
1199 | "[('Python Training Course for Beginners', ' 15/Aug/2011 - 19/Aug/2011;'),\n",
1200 | " ('Python Training Course Intermediate', ' 12/Dec/2011 - 16/Dec/2011;'),\n",
1201 | " ('Python Text Processing Course', '31/Oct/2011 - 4/Nov/2011')]"
1202 | ]
1203 | },
1204 | "metadata": {},
1205 | "execution_count": 36
1206 | }
1207 | ]
1208 | },
1209 | {
1210 | "cell_type": "markdown",
1211 | "metadata": {
1212 | "id": "ugAiXsBP9q0A"
1213 | },
1214 | "source": [
1215 | "#### Alternations\n",
1216 | "\n",
1217 | "In our introduction to regular expressions we had introduced character classes. Character classes offer a choice out of a set of characters. Sometimes we need a choice between several regular expressions. It's a logical \"or\" and that's why the symbol for this construct is the \"|\" symbol. In the following example, we check, if one of the cities London, Paris, Zurich, Konstanz Bern or Strasbourg appear in a string preceded by the word \"location\":"
1218 | ]
1219 | },
1220 | {
1221 | "cell_type": "code",
1222 | "metadata": {
1223 | "colab": {
1224 | "base_uri": "https://localhost:8080/"
1225 | },
1226 | "id": "Zb0aDk9B9ZU4",
1227 | "outputId": "a419899b-e216-4fbe-d0fc-b703fed2165f"
1228 | },
1229 | "source": [
1230 | "# greedy:\n",
1231 | "\n",
1232 | "import re\n",
1233 | "str = \"Course location is London or Paris!\"\n",
1234 | "mo = re.search(r\"location.*(London|Paris|Zurich|Strasbourg)\", str)\n",
1235 | "if mo: print(mo.group())"
1236 | ],
1237 | "execution_count": 40,
1238 | "outputs": [
1239 | {
1240 | "output_type": "stream",
1241 | "name": "stdout",
1242 | "text": [
1243 | "location is London or Paris\n"
1244 | ]
1245 | }
1246 | ]
1247 | },
1248 | {
1249 | "cell_type": "markdown",
1250 | "metadata": {
1251 | "id": "g-qC64gABE6u"
1252 | },
1253 | "source": [
1254 | "#### Compiling Regular Expressions\n",
1255 | "\n",
1256 | "If you want to use the same regexp more than once in a script, it might be a good idea to use a regular expression object, i.e. the regex is compiled.\n",
1257 | "\n",
1258 | "The general syntax:\n",
1259 | "\n",
1260 | "```re.compile(pattern[, flags])```\n",
1261 | "\n",
1262 | "compile returns a regex object, which can be used later for searching and replacing. The expressions behaviour can be modified by specifying a flag value:\n",
1263 | "\n",
1264 | "|Abbreviation| Full name |\n",
1265 | "|----------- | -----------|\n",
1266 | "| re.I | re.IGNORECASE |\n",
1267 | "| re.L | re.LOCALE |\n",
1268 | "| re.M | re.MULTILINE |\n",
1269 | "| re.S | re.DOTALL |\n",
1270 | "| re.U | re.UNICODE |\n",
1271 | "| re.X | re.VERBOSE |\n",
1272 | "\n",
1273 | "Compiled regular objects usually are not saving much time, because Python internally compiles AND CACHES regexes whenever you use them with re.search() or re.match(). The only extra time a non-compiled regex takes is the time it needs to check the cache, which is a key lookup of a dictionary.\n",
1274 | "\n",
1275 | "A good reason to use them is to separate the definition of a regex from its use.\n",
1276 | "\n",
1277 | "#### Splitting a String With or Without Regular Expressions\n",
1278 | "\n",
1279 | "There is a string method split, which can be used to split a string into a list of substrings:\n",
1280 | "\n",
1281 | "``` str.split([sep[, maxsplit]])```\n",
1282 | "\n",
1283 | "As you can see, the method split has two optional parameters. If none is given (or is None) , a string will be separated into substring using whitespaces as delimiters, i.e. every substring consisting purely of whitespaces is used as a delimiter.\n",
1284 | "\n",
1285 | "\n",
1286 | "\n",
1287 | "We demonstrate this behaviour with a famous quotation by Abraham Lincoln:"
1288 | ]
1289 | },
1290 | {
1291 | "cell_type": "code",
1292 | "metadata": {
1293 | "colab": {
1294 | "base_uri": "https://localhost:8080/"
1295 | },
1296 | "id": "Bx9abbum-OIg",
1297 | "outputId": "c7904b8c-f287-431e-d72a-1e5892e325d4"
1298 | },
1299 | "source": [
1300 | "law_courses = \"Let reverence for the laws be breathed by every American mother to the lisping babe that prattles on her lap. Let it be taught in schools, in seminaries, and in colleges. Let it be written in primers, spelling books, and in almanacs. Let it be preached from the pulpit, proclaimed in legislative halls, and enforced in the courts of justice. And, in short, let it become the political religion of the nation.\"\n",
1301 | "law_courses.split()"
1302 | ],
1303 | "execution_count": 41,
1304 | "outputs": [
1305 | {
1306 | "output_type": "execute_result",
1307 | "data": {
1308 | "text/plain": [
1309 | "['Let',\n",
1310 | " 'reverence',\n",
1311 | " 'for',\n",
1312 | " 'the',\n",
1313 | " 'laws',\n",
1314 | " 'be',\n",
1315 | " 'breathed',\n",
1316 | " 'by',\n",
1317 | " 'every',\n",
1318 | " 'American',\n",
1319 | " 'mother',\n",
1320 | " 'to',\n",
1321 | " 'the',\n",
1322 | " 'lisping',\n",
1323 | " 'babe',\n",
1324 | " 'that',\n",
1325 | " 'prattles',\n",
1326 | " 'on',\n",
1327 | " 'her',\n",
1328 | " 'lap.',\n",
1329 | " 'Let',\n",
1330 | " 'it',\n",
1331 | " 'be',\n",
1332 | " 'taught',\n",
1333 | " 'in',\n",
1334 | " 'schools,',\n",
1335 | " 'in',\n",
1336 | " 'seminaries,',\n",
1337 | " 'and',\n",
1338 | " 'in',\n",
1339 | " 'colleges.',\n",
1340 | " 'Let',\n",
1341 | " 'it',\n",
1342 | " 'be',\n",
1343 | " 'written',\n",
1344 | " 'in',\n",
1345 | " 'primers,',\n",
1346 | " 'spelling',\n",
1347 | " 'books,',\n",
1348 | " 'and',\n",
1349 | " 'in',\n",
1350 | " 'almanacs.',\n",
1351 | " 'Let',\n",
1352 | " 'it',\n",
1353 | " 'be',\n",
1354 | " 'preached',\n",
1355 | " 'from',\n",
1356 | " 'the',\n",
1357 | " 'pulpit,',\n",
1358 | " 'proclaimed',\n",
1359 | " 'in',\n",
1360 | " 'legislative',\n",
1361 | " 'halls,',\n",
1362 | " 'and',\n",
1363 | " 'enforced',\n",
1364 | " 'in',\n",
1365 | " 'the',\n",
1366 | " 'courts',\n",
1367 | " 'of',\n",
1368 | " 'justice.',\n",
1369 | " 'And,',\n",
1370 | " 'in',\n",
1371 | " 'short,',\n",
1372 | " 'let',\n",
1373 | " 'it',\n",
1374 | " 'become',\n",
1375 | " 'the',\n",
1376 | " 'political',\n",
1377 | " 'religion',\n",
1378 | " 'of',\n",
1379 | " 'the',\n",
1380 | " 'nation.']"
1381 | ]
1382 | },
1383 | "metadata": {},
1384 | "execution_count": 41
1385 | }
1386 | ]
1387 | },
1388 | {
1389 | "cell_type": "markdown",
1390 | "metadata": {
1391 | "id": "tjOlb25EDa5v"
1392 | },
1393 | "source": [
1394 | "Now we look at a string, which could stem from an Excel or an OpenOffice calc file. We have seen in our previous example that split takes whitespaces as default separators. We want to split the string in the following little example using semicolons as separators. The only thing we have to do is to use \";\" as an argument of split():"
1395 | ]
1396 | },
1397 | {
1398 | "cell_type": "code",
1399 | "metadata": {
1400 | "colab": {
1401 | "base_uri": "https://localhost:8080/"
1402 | },
1403 | "id": "m_oKZMcQDHaX",
1404 | "outputId": "7553e3e9-7ffd-49fe-8662-74134f748c1c"
1405 | },
1406 | "source": [
1407 | "line = \"James;Miller;teacher;Perl\"\n",
1408 | "line.split(\";\")"
1409 | ],
1410 | "execution_count": 42,
1411 | "outputs": [
1412 | {
1413 | "output_type": "execute_result",
1414 | "data": {
1415 | "text/plain": [
1416 | "['James', 'Miller', 'teacher', 'Perl']"
1417 | ]
1418 | },
1419 | "metadata": {},
1420 | "execution_count": 42
1421 | }
1422 | ]
1423 | },
1424 | {
1425 | "cell_type": "markdown",
1426 | "metadata": {
1427 | "id": "CtFK-OQdDihx"
1428 | },
1429 | "source": [
1430 | "The method split() has another optional parameter: maxsplit. If maxsplit is given, at most maxsplit splits are done. This means that the resulting list will have at most \"maxsplit + 1\" elements. We will illustrate the mode of operation of maxsplit in the next example:"
1431 | ]
1432 | },
1433 | {
1434 | "cell_type": "code",
1435 | "metadata": {
1436 | "colab": {
1437 | "base_uri": "https://localhost:8080/"
1438 | },
1439 | "id": "uqFzU0d5Dc7P",
1440 | "outputId": "e5e38ca8-ee7e-4216-c75f-c8588a22586c"
1441 | },
1442 | "source": [
1443 | "mammon = \"The god of the world's leading religion. The chief temple is in the holy city of New York.\"\n",
1444 | "mammon.split(\" \",3)"
1445 | ],
1446 | "execution_count": 43,
1447 | "outputs": [
1448 | {
1449 | "output_type": "execute_result",
1450 | "data": {
1451 | "text/plain": [
1452 | "['The',\n",
1453 | " 'god',\n",
1454 | " 'of',\n",
1455 | " \"the world's leading religion. The chief temple is in the holy city of New York.\"]"
1456 | ]
1457 | },
1458 | "metadata": {},
1459 | "execution_count": 43
1460 | }
1461 | ]
1462 | },
1463 | {
1464 | "cell_type": "markdown",
1465 | "metadata": {
1466 | "id": "tPHgU5gjDqwb"
1467 | },
1468 | "source": [
1469 | "We used a Blank as a delimiter string in the previous example, which can be a problem: If multiple blanks or whitespaces are connected, split() will split the string after every single blank, so that we will get empty strings and strings with only a tab inside ('\\t') in our result list:"
1470 | ]
1471 | },
1472 | {
1473 | "cell_type": "code",
1474 | "metadata": {
1475 | "colab": {
1476 | "base_uri": "https://localhost:8080/"
1477 | },
1478 | "id": "x-y8ZGy5DkfL",
1479 | "outputId": "1e74c135-269b-4015-c6ed-b0249e724339"
1480 | },
1481 | "source": [
1482 | "mammon = \"The god \\t of the world's leading religion. The chief temple is in the holy city of New York.\"\n",
1483 | "mammon.split(\" \",5)"
1484 | ],
1485 | "execution_count": 44,
1486 | "outputs": [
1487 | {
1488 | "output_type": "execute_result",
1489 | "data": {
1490 | "text/plain": [
1491 | "['The',\n",
1492 | " 'god',\n",
1493 | " '',\n",
1494 | " '\\t',\n",
1495 | " 'of',\n",
1496 | " \"the world's leading religion. The chief temple is in the holy city of New York.\"]"
1497 | ]
1498 | },
1499 | "metadata": {},
1500 | "execution_count": 44
1501 | }
1502 | ]
1503 | },
1504 | {
1505 | "cell_type": "markdown",
1506 | "metadata": {
1507 | "id": "ksQY6uKzD9LS"
1508 | },
1509 | "source": [
1510 | "We can prevent the separation of empty strings by using None as the first argument. Now split will use the default behaviour, i.e. every substring consisting of connected whitespace characters will be taken as one separator:"
1511 | ]
1512 | },
1513 | {
1514 | "cell_type": "code",
1515 | "metadata": {
1516 | "colab": {
1517 | "base_uri": "https://localhost:8080/"
1518 | },
1519 | "id": "4GeKhddDDyjb",
1520 | "outputId": "0ad3c905-b05a-437d-f1af-18d410b662b3"
1521 | },
1522 | "source": [
1523 | "mammon.split(None,5)"
1524 | ],
1525 | "execution_count": 45,
1526 | "outputs": [
1527 | {
1528 | "output_type": "execute_result",
1529 | "data": {
1530 | "text/plain": [
1531 | "['The',\n",
1532 | " 'god',\n",
1533 | " 'of',\n",
1534 | " 'the',\n",
1535 | " \"world's\",\n",
1536 | " 'leading religion. The chief temple is in the holy city of New York.']"
1537 | ]
1538 | },
1539 | "metadata": {},
1540 | "execution_count": 45
1541 | }
1542 | ]
1543 | },
1544 | {
1545 | "cell_type": "markdown",
1546 | "metadata": {
1547 | "id": "v-R8av8gEKfD"
1548 | },
1549 | "source": [
1550 | "#### Regular Expression Split\n",
1551 | "\n",
1552 | "The string method split() is the right tool in many cases, but what, if you want e.g. to get the bare words of a text, i.e. without any special characters and whitespaces. If we want this, we have to use the split function from the re module. We illustrate this method with a short text from the beginning of Metamorphoses by Ovid:"
1553 | ]
1554 | },
1555 | {
1556 | "cell_type": "code",
1557 | "metadata": {
1558 | "colab": {
1559 | "base_uri": "https://localhost:8080/"
1560 | },
1561 | "id": "mKitEEcREIQL",
1562 | "outputId": "05e54512-c90b-4a65-b8ef-726cccdad81d"
1563 | },
1564 | "source": [
1565 | "import re\n",
1566 | "metamorphoses = \"OF bodies chang'd to various forms, I sing: Ye Gods, from whom these miracles did spring, Inspire my numbers with coelestial heat;\"\n",
1567 | "re.split(\"\\W+\", metamorphoses)"
1568 | ],
1569 | "execution_count": 46,
1570 | "outputs": [
1571 | {
1572 | "output_type": "execute_result",
1573 | "data": {
1574 | "text/plain": [
1575 | "['OF',\n",
1576 | " 'bodies',\n",
1577 | " 'chang',\n",
1578 | " 'd',\n",
1579 | " 'to',\n",
1580 | " 'various',\n",
1581 | " 'forms',\n",
1582 | " 'I',\n",
1583 | " 'sing',\n",
1584 | " 'Ye',\n",
1585 | " 'Gods',\n",
1586 | " 'from',\n",
1587 | " 'whom',\n",
1588 | " 'these',\n",
1589 | " 'miracles',\n",
1590 | " 'did',\n",
1591 | " 'spring',\n",
1592 | " 'Inspire',\n",
1593 | " 'my',\n",
1594 | " 'numbers',\n",
1595 | " 'with',\n",
1596 | " 'coelestial',\n",
1597 | " 'heat',\n",
1598 | " '']"
1599 | ]
1600 | },
1601 | "metadata": {},
1602 | "execution_count": 46
1603 | }
1604 | ]
1605 | },
1606 | {
1607 | "cell_type": "markdown",
1608 | "metadata": {
1609 | "id": "axXupIvFEiG-"
1610 | },
1611 | "source": [
1612 | "The following example is a good case, where the regular expression is really superior to the string split. Let's assume that we have data lines with surnames, first names and professions of names. We want to clear the data line of the superfluous and redundant text descriptions, i.e. \"surname: \", \"prename: \" and so on, so that we have solely the surname in the first column, the first name in the second column and the profession in the third column:"
1613 | ]
1614 | },
1615 | {
1616 | "cell_type": "code",
1617 | "metadata": {
1618 | "colab": {
1619 | "base_uri": "https://localhost:8080/"
1620 | },
1621 | "id": "jBXayxTtEZOV",
1622 | "outputId": "9413be21-7828-4609-8be8-cd2a683c1843"
1623 | },
1624 | "source": [
1625 | "import re\n",
1626 | "lines = [\"surname: Obama, prename: Barack, profession: president\", \"surname: Merkel, prename: Angela, profession: chancellor\"]\n",
1627 | "for line in lines:\n",
1628 | " print(re.split(\",* *\\w*: \", line))"
1629 | ],
1630 | "execution_count": 47,
1631 | "outputs": [
1632 | {
1633 | "output_type": "stream",
1634 | "name": "stdout",
1635 | "text": [
1636 | "['', 'Obama', 'Barack', 'president']\n",
1637 | "['', 'Merkel', 'Angela', 'chancellor']\n"
1638 | ]
1639 | }
1640 | ]
1641 | },
1642 | {
1643 | "cell_type": "markdown",
1644 | "metadata": {
1645 | "id": "9-LnRBN8E2j7"
1646 | },
1647 | "source": [
1648 | "We can easily improve the script by using a slice operator, so that we don't have the empty string as the first element of our result lists:"
1649 | ]
1650 | },
1651 | {
1652 | "cell_type": "code",
1653 | "metadata": {
1654 | "colab": {
1655 | "base_uri": "https://localhost:8080/"
1656 | },
1657 | "id": "xkmsfFV9E2Eq",
1658 | "outputId": "c0681369-44c8-4e56-f761-5ad0e37a0fb0"
1659 | },
1660 | "source": [
1661 | "import re\n",
1662 | "lines = [\"surname: Obama, prename: Barack, profession: president\", \"surname: Merkel, prename: Angela, profession: chancellor\"]\n",
1663 | "for line in lines:\n",
1664 | " print(re.split(\",* *\\w*: \", line)[1:])"
1665 | ],
1666 | "execution_count": 48,
1667 | "outputs": [
1668 | {
1669 | "output_type": "stream",
1670 | "name": "stdout",
1671 | "text": [
1672 | "['Obama', 'Barack', 'president']\n",
1673 | "['Merkel', 'Angela', 'chancellor']\n"
1674 | ]
1675 | }
1676 | ]
1677 | },
1678 | {
1679 | "cell_type": "markdown",
1680 | "metadata": {
1681 | "id": "t7tVohopE-fb"
1682 | },
1683 | "source": [
1684 | "#### Search and Replace with sub\n",
1685 | "\n",
1686 | "```re.sub(regex, replacement, subject)```\n",
1687 | "\n",
1688 | "Every match of the regular expression regex in the string subject will be replaced by the string replacement. Example:"
1689 | ]
1690 | },
1691 | {
1692 | "cell_type": "code",
1693 | "metadata": {
1694 | "colab": {
1695 | "base_uri": "https://localhost:8080/"
1696 | },
1697 | "id": "aMXXwUu0EjxE",
1698 | "outputId": "1ff0df1d-c014-4068-c432-559c531d2d05"
1699 | },
1700 | "source": [
1701 | "import re\n",
1702 | "str = \"yes I said yes I will Yes.\"\n",
1703 | "res = re.sub(\"[yY]es\",\"no\", str)\n",
1704 | "print(res)"
1705 | ],
1706 | "execution_count": 49,
1707 | "outputs": [
1708 | {
1709 | "output_type": "stream",
1710 | "name": "stdout",
1711 | "text": [
1712 | "no I said no I will no.\n"
1713 | ]
1714 | }
1715 | ]
1716 | },
1717 | {
1718 | "cell_type": "code",
1719 | "metadata": {
1720 | "id": "6h1Jm1PkFKub"
1721 | },
1722 | "source": [
1723 | ""
1724 | ],
1725 | "execution_count": null,
1726 | "outputs": []
1727 | }
1728 | ]
1729 | }
--------------------------------------------------------------------------------
/week6/Lecture_12_requests.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Lecture 12. requests.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": [],
9 | "authorship_tag": "ABX9TyOz0u2npZ/DvoCJ5fZ5sjOG",
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | },
16 | "language_info": {
17 | "name": "python"
18 | }
19 | },
20 | "cells": [
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {
24 | "id": "view-in-github",
25 | "colab_type": "text"
26 | },
27 | "source": [
28 | "
"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {
34 | "id": "Gf_pWG1fndIo"
35 | },
36 | "source": [
37 | "#### В этом туториале мы попробуем получить ответ от [веб сервиса](http://www.cbs.dtu.dk/services/NetMHCpan/) внутри этого ноутбука, а не через браузер\n",
38 | "\n",
39 | "Импортируем нужные модули:"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "metadata": {
45 | "id": "Mt36MTr-xH4e"
46 | },
47 | "source": [
48 | "import requests"
49 | ],
50 | "execution_count": null,
51 | "outputs": []
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {
56 | "id": "qI4Ixs94nnme"
57 | },
58 | "source": [
59 | "#### Ссылка на скрипт на сервере, который обрабатывает изначальный POST запрос:\n",
60 | "\n",
61 | "Всегда, когда вы сабмитите форму в интернете, вы посылаете POST запрос скрипту адрес которого можно узнать просмотрев submitted form data. Загуглите \"how to view submitted form data\" или посмотрите вот это:\n",
62 | "\n",
63 | "https://www.youtube.com/watch?v=SvUqk683mSA\n",
64 | "\n",
65 | "https://wpscholar.com/blog/view-form-data-in-chrome/"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "metadata": {
71 | "id": "ZpcVQTtxxRu6"
72 | },
73 | "source": [
74 | "url = 'http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi'"
75 | ],
76 | "execution_count": null,
77 | "outputs": []
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {
82 | "id": "WZll65hWosP8"
83 | },
84 | "source": [
85 | "#### Формируем данные формы (словарь `post_dadta` - это я его так назвал) - как их формировать смотри так же, как адрес обрабатывающего скрипта (ссылки выше) - и отправляем с помощью `requests` этот POST запрос:"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "metadata": {
91 | "id": "_krexwfv3fv3",
92 | "colab": {
93 | "base_uri": "https://localhost:8080/"
94 | },
95 | "outputId": "f8e2c555-dcb8-4492-d80a-b8582716ce76"
96 | },
97 | "source": [
98 | "post_data = {'configfile': '/usr/opt/www/pub/CBS/services/NetMHCpan-4.1/NetMHCpan.cf',\n",
99 | " 'inp': 0,\n",
100 | " 'SEQPASTE': 'NLVPMVATV',\n",
101 | " 'master': 1,\n",
102 | " 'thrs': 0.5,\n",
103 | " 'thrw': 2,\n",
104 | " 'threshold': -99\n",
105 | " }\n",
106 | "r = requests.post(url, data=post_data) # тут хранится ответ\n",
107 | "type(r) # видим, что этот ответ упакован в экземпляр класса Response,\n",
108 | " # который находится в папке (или файле) models, который находится в папке\n",
109 | " # (папка всего модуля) requests"
110 | ],
111 | "execution_count": null,
112 | "outputs": [
113 | {
114 | "output_type": "execute_result",
115 | "data": {
116 | "text/plain": [
117 | "requests.models.Response"
118 | ]
119 | },
120 | "metadata": {
121 | "tags": []
122 | },
123 | "execution_count": 73
124 | }
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "metadata": {
130 | "id": "sN9wj7GVqA_d"
131 | },
132 | "source": [
133 | "# Вот кстати все методы (функции) и атрибуты (свойства),\n",
134 | "# которые есть у этого объекта:\n",
135 | "\n",
136 | "# dir(r)"
137 | ],
138 | "execution_count": 1,
139 | "outputs": []
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {
144 | "id": "qFu_C5DCrusV"
145 | },
146 | "source": [
147 | "#### Среди всех прочих есть метод `.content`, по которому получаем ответ сервера в виде строки, которая является `html` размеченным текстом\n",
148 | "\n",
149 | "На этом движке взаимодействие с пользователем устроено так: сначала вы посылаете POST запрос и получаете ответ (это мы сделали). В этом ответе вам приходит информация о `jobid` - набор символов и букв, который вы (пользователи) посылаете уже другому скрипту-обработчику уже методом GET (просто в адресной строке). Мы достаем адрес второго обработчика из `r`, оттуда же достаем `jobid` и снова используем модуль `requests` для создания уже запроса GET:"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "metadata": {
155 | "colab": {
156 | "base_uri": "https://localhost:8080/"
157 | },
158 | "id": "-BR-NRSumHSc",
159 | "outputId": "74856b19-38ed-467f-893b-82e7653fe1eb"
160 | },
161 | "source": [
162 | "link = r.content[-120:-45] # адрес второго обработчика (он такой же как и первый)\n",
163 | " # с приделанными к нему данными методом GET - в форме\n",
164 | " # 'key: value' после знака '?'\n",
165 | "print(link)\n",
166 | "jobid = link[-24:]\n",
167 | "url2 = link[:-31]\n",
168 | "print(jobid)\n",
169 | "print(url2)"
170 | ],
171 | "execution_count": null,
172 | "outputs": [
173 | {
174 | "output_type": "stream",
175 | "text": [
176 | "b'http://www.cbs.dtu.dk//cgi-bin/webface2.fcgi?jobid=6101365800005B22C53E54DA'\n",
177 | "b'6101365800005B22C53E54DA'\n",
178 | "b'http://www.cbs.dtu.dk//cgi-bin/webface2.fcgi'\n"
179 | ],
180 | "name": "stdout"
181 | }
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {
187 | "id": "wZoa0z_At_B0"
188 | },
189 | "source": [
190 | "#### Ну и наконец создаем новый запрос GET:\n",
191 | "\n",
192 | "Почему именно такой запрос, я узнал опять же просмотрев submitted form data в браузере, как описано по ссылкам выше"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "metadata": {
198 | "id": "WxkEswry5P-x"
199 | },
200 | "source": [
201 | "get_data = {'jobid': jobid,\n",
202 | " 'wait' : '20'\n",
203 | " }\n",
204 | "result = requests.get(url2, params=get_data)"
205 | ],
206 | "execution_count": null,
207 | "outputs": []
208 | },
209 | {
210 | "cell_type": "code",
211 | "metadata": {
212 | "id": "3Bcn_vZ65VId",
213 | "colab": {
214 | "base_uri": "https://localhost:8080/"
215 | },
216 | "outputId": "d379f2cc-a56a-4393-9099-3267c7d1a06d"
217 | },
218 | "source": [
219 | "result.content # тут находится нужный нам ответ в виде html строки"
220 | ],
221 | "execution_count": null,
222 | "outputs": [
223 | {
224 | "output_type": "execute_result",
225 | "data": {
226 | "text/plain": [
227 | "b'\\n NetMHCpan 4.1 Server - prediction results\\n\\n\\n
\\n\\n \\n | \\n | NetMHCpan Server - prediction results\\n Technical University of Denmark\\n |
\\n
\\n\\n
\\n\\n\\n# NetMHCpan version 4.1b\\n\\n# Tmpdir made /usr/opt/www/webface/tmp/server/netmhcpan/6101365800005B22C53E54DA/netMHCpanWfQpLk\\n# Input is in FSA format\\n\\n# Peptide length 8,9,10,11\\n\\n# Make EL predictions\\n\\nHLA-A02:01 : Distance to training data 0.000 (using nearest neighbor HLA-A02:01)\\n\\n# Rank Threshold for Strong binding peptides 0.500\\n# Rank Threshold for Weak binding peptides 2.000\\n---------------------------------------------------------------------------------------------------------------------------\\n Pos MHC Peptide Core Of Gp Gl Ip Il Icore Identity Score_EL %Rank_EL BindLevel\\n---------------------------------------------------------------------------------------------------------------------------\\n 1 HLA-A*02:01 NLVPMVATV NLVPMVATV 0 0 0 0 0 NLVPMVATV Sequence 0.8323630 0.085 <= SB\\n 1 HLA-A*02:01 NLVPMVAT NLV-PMVAT 0 0 0 3 1 NLVPMVAT Sequence 0.0008310 19.184\\n 2 HLA-A*02:01 LVPMVATV -LVPMVATV 0 0 0 0 1 LVPMVATV Sequence 0.0170850 5.151\\n---------------------------------------------------------------------------------------------------------------------------\\n\\nProtein Sequence. Allele HLA-A*02:01. Number of high binders 1. Number of weak binders 0. Number of peptides 3\\n\\nLink to Allele Frequencies in Worldwide Populations HLA-A02:01\\n-----------------------------------------------------------------------------------\\n\\nExplain the output. Go back.\\n
\\n\\n
\\n\\n\\n'"
228 | ]
229 | },
230 | "metadata": {
231 | "tags": []
232 | },
233 | "execution_count": 93
234 | }
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {
240 | "id": "-TVG5IQRzAKj"
241 | },
242 | "source": [
243 | "#### Сохраним его в виде `html` странице на сервере колаба:"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "metadata": {
249 | "colab": {
250 | "base_uri": "https://localhost:8080/"
251 | },
252 | "id": "fLtcPzvanSAg",
253 | "outputId": "6c6ca909-7425-42f3-d228-6f4b5ff7c6f0"
254 | },
255 | "source": [
256 | "with open(\"results.html\", \"w\") as f:\n",
257 | " f.write(str(result.text))\n",
258 | "!ls # удостоверимся, что сохранили"
259 | ],
260 | "execution_count": null,
261 | "outputs": [
262 | {
263 | "output_type": "stream",
264 | "text": [
265 | "results.html sample_data\n"
266 | ],
267 | "name": "stdout"
268 | }
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {
274 | "id": "_0eBCrEHzT7R"
275 | },
276 | "source": [
277 | "#### Ну и запустим вебсервер в колабе, чтобы просмотреть наш результат как веб страницу - нужно нажать на появившуюся ссылку и выбрать потом наш `html` файл (чтобы делать потом что-то еще нужно будет остановить вебсервер, остановив ячейку):"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "metadata": {
283 | "colab": {
284 | "base_uri": "https://localhost:8080/",
285 | "height": 238
286 | },
287 | "id": "l4K3oh1SxNlx",
288 | "outputId": "29dd747a-dcc9-41b2-cc09-ad93a537517e"
289 | },
290 | "source": [
291 | "from google.colab.output import eval_js\n",
292 | "print(eval_js(\"google.colab.kernel.proxyPort(8000)\"))\n",
293 | "# https://z4spb7cvssd-496ff2e9c6d22116-8000-colab.googleusercontent.com/\n",
294 | "!python -m http.server 8000"
295 | ],
296 | "execution_count": null,
297 | "outputs": [
298 | {
299 | "output_type": "stream",
300 | "text": [
301 | "https://mp1ubn0bvld-496ff2e9c6d22116-8000-colab.googleusercontent.com/\n",
302 | "Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...\n",
303 | "127.0.0.1 - - [28/Jul/2021 11:10:43] \"GET / HTTP/1.1\" 200 -\n",
304 | "127.0.0.1 - - [28/Jul/2021 11:10:46] code 404, message File not found\n",
305 | "127.0.0.1 - - [28/Jul/2021 11:10:46] \"GET /favicon.ico HTTP/1.1\" 404 -\n",
306 | "127.0.0.1 - - [28/Jul/2021 11:10:47] \"GET /results.html HTTP/1.1\" 200 -\n",
307 | "127.0.0.1 - - [28/Jul/2021 11:10:50] code 404, message File not found\n",
308 | "127.0.0.1 - - [28/Jul/2021 11:10:50] \"GET /images/m_logo.gif HTTP/1.1\" 404 -\n",
309 | "127.0.0.1 - - [28/Jul/2021 11:10:50] code 404, message File not found\n",
310 | "127.0.0.1 - - [28/Jul/2021 11:10:50] \"GET /favicon.ico HTTP/1.1\" 404 -\n",
311 | "\n",
312 | "Keyboard interrupt received, exiting.\n",
313 | "^C\n"
314 | ],
315 | "name": "stdout"
316 | }
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {
322 | "id": "XkguaHDGzyBN"
323 | },
324 | "source": [
325 | "Все."
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "metadata": {
331 | "id": "QEi2YRTByMZw"
332 | },
333 | "source": [
334 | ""
335 | ],
336 | "execution_count": null,
337 | "outputs": []
338 | }
339 | ]
340 | }
--------------------------------------------------------------------------------