├── README.md ├── week1 ├── Lecture_1_arithmetics.ipynb └── Lecture_2_if_for_while_float_formatted_output.ipynb ├── week2 ├── Lecture_3_types_sequential_copy.ipynb └── Lecture_4_dictionaries_and_sets.ipynb ├── week3 ├── Lecture_5_functions_namespaces.ipynb └── Lecture_6_lambda_recursion.ipynb ├── week4 ├── Lecture_7_files_packages_exceptions.ipynb └── Lecture_8_OOP.ipynb ├── week5 ├── Lecture_10_callable_context_managers_slots_ABC.ipynb └── Lecture_9_descriptors_inheritance_ipynb.ipynb └── week6 ├── Lecture_11_metaclasses_iterators.ipynb ├── Lecture_12_re.ipynb └── Lecture_12_requests.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Program 2 | 3 | ## 🐍🐍🐍🐍🐍🐍🐍 week1 4 | 5 | #### Лекция 1 6 | 7 | 8 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week1/Lecture_1_arithmetics.ipynb) 9 | 10 | 1. [Пара слов о возникновении языка. Некоторые различия Python2 и Python3](https://www.python-course.eu/python3_history_and_philosophy.php) 11 | 2. [Знакомство с python shell](https://www.python-course.eu/python3_interactive.php) 12 | 3. [Запуск python скрипта из командной строки, .pyc файлы, интерпретаторы и компиляторы](https://www.python-course.eu/python3_execute_script.php) 13 | 4. [Структурирование при помощи отступов](https://www.python-course.eu/python3_blocks.php) 14 | 5. [Целые числа, арифметические операторы и операторы сравнения](https://www.python-course.eu/python3_operators.php) 15 | 6. [Input](https://www.python-course.eu/python3_input.php) и [output](https://www.python-course.eu/python3_print.php) 16 | 17 | #### Лекция 2 18 | 19 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week1/Lecture_2_if_for_while_float_formatted_output.ipynb) 20 | 21 | 1. [Условные выражения](https://www.python-course.eu/python3_conditional_statements.php) 22 | 2. [Циклы 1. While loops](https://www.python-course.eu/python3_loops.php) 23 | 3. [Циклы 2. For loops](https://www.python-course.eu/python3_for_loop.php) 24 | 4. [Вещественные числа в Python](https://tirinox.ru/float-python/) 25 | 5. [Форматированный вывод](https://www.python-course.eu/python3_formatted_output.php) 26 | 27 | ## 🐍🐍🐍🐍🐍🐍🐍 week2 28 | 29 | #### Лекция 3 30 | 31 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week2/Lecture_3_types_sequential_copy.ipynb) 32 | 33 | 1. **(!)** [Переменные в python и типы данных](https://www.python-course.eu/python3_variables.php) 34 | 2. [Знакомство со строками](https://www.python-course.eu/python3_variables.php) 35 | 3. [Основные последовательные (sequential) типы данных: строки и списки](https://www.python-course.eu/python3_sequential_data_types.php) 36 | 4. [Манипуляции со списками](https://www.python-course.eu/python3_list_manipulation.php) 37 | 6. [List comprehensions](https://www.python-course.eu/python3_list_comprehension.php) 38 | 7. [Shallow и deep копирование объектов в python](https://www.python-course.eu/python3_deep_copy.php) 39 | 40 | #### Лекция 4 41 | 42 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week2/Lecture_4_dictionaries_and_sets.ipynb) 43 | 44 | 1. [Словари (dictionaries)](https://www.python-course.eu/python3_dictionaries.php) 45 | 2. [Множества (sets) и замороженные множества (frozen sets)](https://www.python-course.eu/python3_sets_frozensets.php) 46 | 3. [Примеры использования множеств](https://www.python-course.eu/python_sets_example.php) 47 | 4. [Примеры с использованием циклов и словарей](https://www.python-course.eu/working_with_python_dictionaries.php) 48 | 49 | ## 🐍🐍🐍🐍🐍🐍🐍 week3 50 | 51 | #### Лекция 5 52 | 53 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week3/Lecture_5_functions_namespaces.ipynb) 54 | 55 | 1. [Функции](https://www.python-course.eu/python3_functions.php) 56 | 2. [Передача параметров функции в Python](https://www.python-course.eu/python3_passing_arguments.php) 57 | 3. [Декораторы функций](https://www.python-course.eu/python3_decorators.php) 58 | 4. [Области видимости (namespaces)](https://www.python-course.eu/python3_namespaces.php) 59 | 5. [Глобальные и локальные переменные](https://www.python-course.eu/python3_global_vs_local_variables.php) 60 | 61 | #### Лекция 6 62 | 63 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week3/Lecture_6_lambda_recursion.ipynb) 64 | 65 | 1. Встроенная сортировка в Python 66 | 2. [Лямбда функции, filter, reduce, map, zip](https://www.python-course.eu/python3_lambda.php) 67 | 3. [Рекурсия. Мемоизация](https://www.python-course.eu/python3_recursive_functions.php) 68 | 4. Ханойские башни 69 | 5. [Backtracking](https://leetcode.com/explore/learn/card/recursion-ii/472/backtracking/) 70 | 6. [Сведение рекурсии к итерации. Tailing recursion](https://leetcode.com/explore/learn/card/recursion-ii/503/recursion-to-iteration/) 71 | 72 | ## 🐍🐍🐍🐍🐍🐍🐍 week4 73 | 74 | #### Лекция 7 75 | 76 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week4/Lecture_7_files_packages_exceptions.ipynb) 77 | 78 | 1. [Чтение и запись файлов. Модуль Pickle](https://www.python-course.eu/python3_file_management.php) 79 | 2. [Система импорта модулей в Python](https://www.python-course.eu/python3_modules_and_modular_programming.php) 80 | 3. [Пакеты в Python](https://www.python-course.eu/python3_packages.php) 81 | 4. Relative import errors: look [here](https://napuzba.com/a/import-error-relative-no-parent) and [here](https://iq-inc.com/importerror-attempted-relative-import/) 82 | 6. [Исключения](https://www.python-course.eu/python3_exception_handling.php) 83 | 84 | #### Лекция 8 85 | 86 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_totorial/blob/main/week4/Lecture_8_OOP.ipynb) 87 | 88 | 1. [ООП. Основные идеи. Зачем?](https://www.python-course.eu/python3_object_oriented_programming.php) 89 | 2. [Классы, объекты (class instances), атрибуты, методы](https://www.python-course.eu/python3_class_and_instance_attributes.php) 90 | 3. [Значения underscore в Python](https://dbader.org/blog/meaning-of-underscores-in-python) 91 | 4. [Классы-декораторы](https://www.python-course.eu/python3_decorators.php) 92 | 5. [Свойства объектов (properties). Getters и setters](https://www.python-course.eu/python3_properties.php) 93 | 94 | ## 🐍🐍🐍🐍🐍🐍🐍 week5 95 | 96 | #### Лекция 9 97 | 98 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_tutorial/blob/main/week5/Lecture_9_descriptors_inheritance_ipynb.ipynb) 99 | 100 | 1. [Дескрипторы](https://www.python-course.eu/python3_descriptors.php) 101 | 2. [Наследование](https://www.python-course.eu/python3_inheritance.php) 102 | 3. [Пример с наследованием](https://www.python-course.eu/python3_inheritance_example.php) 103 | 4. [Множественное наследование](https://www.python-course.eu/python3_multiple_inheritance.php) 104 | 5. [Пример с множественным наследованием](https://www.python-course.eu/python3_multiple_inheritance_example.php) 105 | 6. [Magic-методы и перегрузка операторов](https://www.python-course.eu/python3_magic_methods.php) 106 | 107 | 108 | #### Лекция 10 109 | 110 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_tutorial/blob/main/week5/Lecture_10_callable_context_managers_slots_ABC.ipynb) 111 | 112 | 1. [Callable классы](https://www.python-course.eu/callable_instances.php) 113 | 2. [Пример класса для полиномиальных функций](https://www.python-course.eu/polynomial_class_in_python.php) 114 | 3. [Контекстные менеджеры](https://www.geeksforgeeks.org/context-manager-in-python) 115 | 4. [Слоты (slots)](https://www.python-course.eu/python3_slots.php) 116 | 5. [Abstract base classes (ABC)](https://www.python-course.eu/python3_abstract_classes.php) 117 | 118 | ## 🐍🐍🐍🐍🐍🐍🐍 week6 119 | 120 | #### Лекция 11 121 | 122 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_tutorial/blob/main/week6/Lecture_11_metaclasses_iterators.ipynb) 123 | 124 | 1. [Взаимосвязь между class и type](https://www.python-course.eu/python3_classes_and_type.php) 125 | 2. [Метаклассы. Мотивация](https://www.python-course.eu/python3_road_to_metaclasses.php) 126 | 3. [Метаклассы](https://www.python-course.eu/python3_metaclasses.php) 127 | 4. [Метаклассы. Пример использования](https://www.python-course.eu/python3_count_function_calls.php) 128 | 5. [Разница между iterator (итераторы) и iterable](https://www.python-course.eu/python3_iterable_iterator.php) 129 | 6. [Итераторы и генераторы](https://www.python-course.eu/python3_generators.php) 130 | 131 | #### Лекция 12 132 | 133 | 1. [collections](https://realpython.com/python-collections-module/) 134 | 2. [itertools](https://realpython.com/python-itertools/) 135 | 3. [requests](https://realpython.com/python-requests/) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_tutorial/blob/main/week6/Lecture_12_requests.ipynb) 136 | 4. [re](https://www.python-course.eu/python3_re.php) and [here](https://www.python-course.eu/python3_re_advanced.php) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justalge/another_python_tutorial/blob/main/week6/Lecture_12_re.ipynb) 137 | -------------------------------------------------------------------------------- /week1/Lecture_2_if_for_while_float_formatted_output.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Lecture 2. if, for, while, float, formatted output.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyNoFsCrRtG/ZXP6iucbrZ85", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "id": "aFLmdm501ukO" 35 | }, 36 | "source": [ 37 | "# Conditionals" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "metadata": { 43 | "id": "1JiSu2-311BW" 44 | }, 45 | "source": [ 46 | "if condition:\n", 47 | " pass\n", 48 | " statement\n", 49 | " statement\n", 50 | " # further statements, if necessary" 51 | ], 52 | "execution_count": null, 53 | "outputs": [] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "metadata": { 58 | "colab": { 59 | "base_uri": "https://localhost:8080/" 60 | }, 61 | "id": "jDfeCFjV11HE", 62 | "outputId": "f9a53498-3d63-4f83-e1e5-edf35596e136" 63 | }, 64 | "source": [ 65 | "'''\n", 66 | "The following example asks users for their nationality. The indented block will\n", 67 | "only be executed if the nationality is French or Italian. If the user enters\n", 68 | "another nationality, nothing happens.\n", 69 | "'''\n", 70 | "\n", 71 | "person = input(\"Nationality? \")\n", 72 | "if person == \"french\" or person == \"French\":\n", 73 | " print(\"Préférez-vous parler français?\")\n", 74 | "if person == \"italian\" or person == \"Italian\":\n", 75 | " print(\"Preferisci parlare italiano?\")" 76 | ], 77 | "execution_count": null, 78 | "outputs": [ 79 | { 80 | "output_type": "stream", 81 | "name": "stdout", 82 | "text": [ 83 | "Nationality? french\n", 84 | "Préférez-vous parler français?\n" 85 | ] 86 | } 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "metadata": { 92 | "colab": { 93 | "base_uri": "https://localhost:8080/" 94 | }, 95 | "id": "SlGQTpjI3pvR", 96 | "outputId": "1e045ebb-3af3-4b38-a64b-8b131918b075" 97 | }, 98 | "source": [ 99 | "'''\n", 100 | "This little script has a drawback. Suppose that the user enters \"french\" as\n", 101 | "a nationality. In this case, the block of the first \"if\" will be executed.\n", 102 | "Afterwards, the program checks if the second \"if\" can be executed as well.\n", 103 | "This doesn't make sense, because we know that the input \"french\" does not match\n", 104 | "the condition \"italian\" or \"Italian\".\n", 105 | "'''\n", 106 | "\n", 107 | "person = input(\"Nationality? \")\n", 108 | "if person == \"french\" or person == \"French\":\n", 109 | " print(\"Préférez-vous parler français?\")\n", 110 | "elif person == \"italian\" or person == \"Italian\":\n", 111 | " print(\"Preferisci parlare italiano?\")\n", 112 | "else:\n", 113 | " print(\"You are neither French nor Italian.\")\n", 114 | " print(\"So, let us speak English!\")\n", 115 | "\n", 116 | "\n", 117 | "# As in our example, most if statements also have \"elif\" and \"else\" branches.\n", 118 | "# This means, there may be more than one \"elif\" branches, but only one \"else\"\n", 119 | "# branch. The \"else\" branch must be at the end of the if statement. Other \"elif\"\n", 120 | "# branches can not be attached after an 'else'. 'if' statements don't need either\n", 121 | "# 'else' nor 'elif' statements." 122 | ], 123 | "execution_count": null, 124 | "outputs": [ 125 | { 126 | "output_type": "stream", 127 | "name": "stdout", 128 | "text": [ 129 | "Nationality? french\n", 130 | "Préférez-vous parler français?\n" 131 | ] 132 | } 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "metadata": { 138 | "id": "BriU6Z1C4lsw" 139 | }, 140 | "source": [ 141 | "# general form:\n", 142 | "\n", 143 | "if condition_1:\n", 144 | " statement_block_1\n", 145 | "elif condition_2:\n", 146 | " statement_block_2\n", 147 | "\n", 148 | "...\n", 149 | "\n", 150 | "elif another_condition: \n", 151 | " another_statement_block\n", 152 | "else:\n", 153 | " else_block" 154 | ], 155 | "execution_count": null, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": { 161 | "id": "-IPYWCNg5LaY" 162 | }, 163 | "source": [ 164 | "#### Examples" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "metadata": { 170 | "colab": { 171 | "base_uri": "https://localhost:8080/" 172 | }, 173 | "id": "PQl7NCAl43d_", 174 | "outputId": "25171e28-4da9-4a4a-ddea-c5b3d10478cf" 175 | }, 176 | "source": [ 177 | "'''\n", 178 | "A one-year-old dog is roughly equivalent to a 14-year-old human being\n", 179 | "A dog that is two years old corresponds in development to a 22 year old person.\n", 180 | "Each additional dog year is equivalent to five human years.\n", 181 | "\n", 182 | "The following example program in Python requires the age of the dog as the input\n", 183 | "and calculates the age in human years according to the above rule. 'input' is\n", 184 | "a statement, where the program flow stops and waits for user input printing out\n", 185 | "the message in brackets:\n", 186 | "'''\n", 187 | "\n", 188 | "age = int(input(\"Age of the dog: \"))\n", 189 | "print()\n", 190 | "if age < 0:\n", 191 | " print(\"This cannot be true!\")\n", 192 | "elif age == 0:\n", 193 | " print(\"This corresponds to 0 human years!\")\n", 194 | "elif age == 1:\n", 195 | " print(\"Roughly 14 years!\")\n", 196 | "elif age == 2:\n", 197 | " print(\"Approximately 22 years!\")\n", 198 | "else:\n", 199 | " human = 22 + (age -2) * 5\n", 200 | " print(\"Corresponds to \" + str(human) + \" human years!\")" 201 | ], 202 | "execution_count": null, 203 | "outputs": [ 204 | { 205 | "output_type": "stream", 206 | "name": "stdout", 207 | "text": [ 208 | "Age of the dog: 3\n", 209 | "\n", 210 | "Corresponds to 27 human years!\n" 211 | ] 212 | } 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "metadata": { 218 | "id": "WhKZTK2t5tk3", 219 | "colab": { 220 | "base_uri": "https://localhost:8080/" 221 | }, 222 | "outputId": "32f8b4e7-1df2-42c5-c1d2-fe1b24e6774a" 223 | }, 224 | "source": [ 225 | "'''\n", 226 | "We will read in three float numbers in the following program\n", 227 | "and will print out the largest value:\n", 228 | "'''\n", 229 | "\n", 230 | "x = float(input(\"1st Number: \"))\n", 231 | "y = float(input(\"2nd Number: \"))\n", 232 | "z = float(input(\"3rd Number: \"))\n", 233 | "\n", 234 | "if x > y and x > z:\n", 235 | " maximum = x\n", 236 | "elif y > x and y > z:\n", 237 | " maximum = y\n", 238 | "else:\n", 239 | " maximum = z\n", 240 | "\n", 241 | "print(\"The maximal value is: \" + str(maximum))" 242 | ], 243 | "execution_count": null, 244 | "outputs": [ 245 | { 246 | "output_type": "stream", 247 | "name": "stdout", 248 | "text": [ 249 | "1st Number: 5\n", 250 | "2nd Number: 7\n", 251 | "3rd Number: 9\n", 252 | "The maximal value is: 9.0\n" 253 | ] 254 | } 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "metadata": { 260 | "id": "_AGCqt2V8PFc" 261 | }, 262 | "source": [ 263 | "7\n", 264 | "'''\n", 265 | "There are other ways to write the conditions like the following one:\n", 266 | "'''\n", 267 | "\n", 268 | "x = float(input(\"1st Number: \"))\n", 269 | "y = float(input(\"2nd Number: \"))\n", 270 | "z = float(input(\"3rd Number: \"))\n", 271 | "\n", 272 | "if x > y:\n", 273 | " if x > z:\n", 274 | " maximum = x\n", 275 | " else:\n", 276 | " maximum = z\n", 277 | "else: \n", 278 | " if y > z:\n", 279 | " maximum = y\n", 280 | " else:\n", 281 | " maximum = z\n", 282 | "\n", 283 | "print(\"The maximal value is: \" + str(maximum))" 284 | ], 285 | "execution_count": null, 286 | "outputs": [] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "metadata": { 291 | "id": "uVOyyQOj8WR4" 292 | }, 293 | "source": [ 294 | "'''\n", 295 | "Another way to find the maximum can be seen in the following example. We are\n", 296 | "using the built-in function max, which calculates the maximum of a list or\n", 297 | "a tuple of numerical values:\n", 298 | "'''\n", 299 | "\n", 300 | "x = float(input(\"1st Number: \"))\n", 301 | "y = float(input(\"2nd Number: \"))\n", 302 | "z = float(input(\"3rd Number: \"))\n", 303 | "\n", 304 | "maximum = max((x,y,z))\n", 305 | "\n", 306 | "print(\"The maximal value is: \" + str(maximum))" 307 | ], 308 | "execution_count": null, 309 | "outputs": [] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "id": "1yTmssrNBvn6" 315 | }, 316 | "source": [ 317 | "#### Ternary if statement" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "metadata": { 323 | "colab": { 324 | "base_uri": "https://localhost:8080/" 325 | }, 326 | "id": "3rOE5us5BvOQ", 327 | "outputId": "93c0f0fb-51a1-48e5-b8fb-064332fbf474" 328 | }, 329 | "source": [ 330 | "'''\n", 331 | "`The maximum speed is 50 if we are within the city, otherwise it is 100.`\n", 332 | "This can be translated into ternary Python statements\n", 333 | "'''\n", 334 | "\n", 335 | "inside_city_limits = True\n", 336 | "maximum_speed = 50 if inside_city_limits else 100\n", 337 | "print(maximum_speed)" 338 | ], 339 | "execution_count": null, 340 | "outputs": [ 341 | { 342 | "output_type": "stream", 343 | "name": "stdout", 344 | "text": [ 345 | "50\n" 346 | ] 347 | } 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": { 353 | "id": "kLv0oPpQChqU" 354 | }, 355 | "source": [ 356 | "# While loops" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": { 362 | "id": "1_2pWe5FGON3" 363 | }, 364 | "source": [ 365 | "Most loops contain a counter or more generally, variables, which change their values in the course of calculation. These variables have to be initialized before the loop is started. The counter or other variables, which can be altered in the body of the loop, are contained in the condition. Before the body of the loop is executed, the condition is evaluated. If it evaluates to False, the while loop is finished. In other words, the program flow will continue with the first statement after the while statement, i.e. on the same indentation level as the while loop. If the condition is evaluated to True, the body, - the indented block below the line with \"while\" - gets executed. After the body is finished, the condition will be evaluated again. The body of the loop will be executed as long as the condition yields True." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "metadata": { 371 | "colab": { 372 | "base_uri": "https://localhost:8080/" 373 | }, 374 | "id": "PukTYSV7GNwq", 375 | "outputId": "ed8128a2-e91e-4a23-8ad4-a74d5596582a" 376 | }, 377 | "source": [ 378 | "# The following small script calculates the sum of the numbers from 1 to 100:\n", 379 | "\n", 380 | "n = 100\n", 381 | "\n", 382 | "total_sum = 0\n", 383 | "counter = 1\n", 384 | "while counter <= n:\n", 385 | " total_sum += counter\n", 386 | " counter += 1\n", 387 | "\n", 388 | "print(\"Sum of 1 until \" + str(n) + \" results in \" + str(total_sum))" 389 | ], 390 | "execution_count": null, 391 | "outputs": [ 392 | { 393 | "output_type": "stream", 394 | "name": "stdout", 395 | "text": [ 396 | "Sum of 1 until 100 results in 5050\n" 397 | ] 398 | } 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": { 404 | "id": "lC0HDiOvGdse" 405 | }, 406 | "source": [ 407 | "#### The `else` part" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": { 413 | "id": "VSvrV9DSG0nI" 414 | }, 415 | "source": [ 416 | "The statements in the else part are executed, when the condition is not fulfilled anymore." 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "metadata": { 422 | "id": "2_dM3Esd8e2R" 423 | }, 424 | "source": [ 425 | "while condition:\n", 426 | " statement_1\n", 427 | " ...\n", 428 | " statement_n\n", 429 | "else:\n", 430 | " statement_1\n", 431 | " ...\n", 432 | " statement_n" 433 | ], 434 | "execution_count": null, 435 | "outputs": [] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": { 440 | "id": "xRPDQ3SKG_Re" 441 | }, 442 | "source": [ 443 | "But what the benefit? We need to understand a new language construction, i.e. the **break** statement" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": { 449 | "id": "XErrP1SiHRhr" 450 | }, 451 | "source": [ 452 | "#### Premature Termination of a while Loop. The break statement" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": { 458 | "id": "QAkZE0JpHo72" 459 | }, 460 | "source": [ 461 | "With the help of a break statement a while loop can be left prematurely, i.e. as soon as the control flow of the program comes to a break inside of a while loop (or other loops) the loop will be immediately left.\n", 462 | "\n", 463 | "Here comes the crucial point: **If a loop is left by break, the else part is not executed**" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "metadata": { 469 | "colab": { 470 | "base_uri": "https://localhost:8080/" 471 | }, 472 | "id": "l-boUiHNHBBW", 473 | "outputId": "aa3561cd-e0a1-49fc-9279-9c06f8e1d940" 474 | }, 475 | "source": [ 476 | "'''\n", 477 | "This behaviour will be illustrated in the following example, a little guessing\n", 478 | "number game. A human player has to guess a number between a range of 1 to n. The\n", 479 | "player inputs his guess. The program informs the player, if this number is\n", 480 | "larger, smaller or equal to the secret number, i.e. the number which the program\n", 481 | "has randomly created. If the player wants to gives up, he or she can input a 0\n", 482 | "or a negative number. Hint: The program needs to create a random number. Therefore\n", 483 | "it's necessary to include the module \"random\".\n", 484 | "'''\n", 485 | "\n", 486 | "import random\n", 487 | "upper_bound = 20\n", 488 | "lower_bound = 1\n", 489 | "to_be_guessed = random.randint(lower_bound, upper_bound)\n", 490 | "guess = 0\n", 491 | "while guess != to_be_guessed:\n", 492 | " guess = int(input(\"New number: \"))\n", 493 | " if guess == 0: # giving up\n", 494 | " print(\"Sorry that you're giving up!\")\n", 495 | " break # break out of a loop, don't execute \"else\"\n", 496 | " if guess < lower_bound or guess > upper_bound:\n", 497 | " print(\"guess not within boundaries!\")\n", 498 | " elif guess > to_be_guessed:\n", 499 | " upper_bound = guess - 1\n", 500 | " print(\"Number too large\")\n", 501 | " elif guess < to_be_guessed:\n", 502 | " lower_bound = guess + 1\n", 503 | " print(\"Number too small\") \n", 504 | "else:\n", 505 | " print(\"Congratulations. You made it!\")" 506 | ], 507 | "execution_count": null, 508 | "outputs": [ 509 | { 510 | "output_type": "stream", 511 | "name": "stdout", 512 | "text": [ 513 | "New number: 5\n", 514 | "Number too small\n", 515 | "New number: 14\n", 516 | "Number too small\n", 517 | "New number: 18\n", 518 | "Congratulations. You made it!\n" 519 | ] 520 | } 521 | ] 522 | }, 523 | { 524 | "cell_type": "markdown", 525 | "metadata": { 526 | "id": "gkFl4gEDTCwl" 527 | }, 528 | "source": [ 529 | "# For loops" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": { 535 | "id": "FaTl8lMmVDE-" 536 | }, 537 | "source": [ 538 | "Python for loop always iterates over an enumeration of a set of items.\n", 539 | "\n", 540 | "General syntax: `for in : else:`" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "metadata": { 546 | "colab": { 547 | "base_uri": "https://localhost:8080/" 548 | }, 549 | "id": "Yc6zQs1rIRyY", 550 | "outputId": "48c3c72a-8bc1-486b-926f-ffada9d24a6c" 551 | }, 552 | "source": [ 553 | "# Example of a simple for loop in Python:\n", 554 | "\n", 555 | "languages = [\"C\", \"C++\", \"Perl\", \"Python\"] \n", 556 | "for language in languages:\n", 557 | " print(language)" 558 | ], 559 | "execution_count": null, 560 | "outputs": [ 561 | { 562 | "output_type": "stream", 563 | "name": "stdout", 564 | "text": [ 565 | "C\n", 566 | "C++\n", 567 | "Perl\n", 568 | "Python\n" 569 | ] 570 | } 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": { 576 | "id": "sMbmJ1TdXjMB" 577 | }, 578 | "source": [ 579 | "#### `Else`" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "metadata": { 585 | "colab": { 586 | "base_uri": "https://localhost:8080/" 587 | }, 588 | "id": "7buzhcY0VjZ0", 589 | "outputId": "bd6311d7-7ff8-4022-89d8-f19b1e5956a0" 590 | }, 591 | "source": [ 592 | "edibles = [\"bacon\", \"spam\", \"eggs\", \"nuts\"]\n", 593 | "for food in edibles:\n", 594 | " if food == \"spam\":\n", 595 | " print(\"No more spam please!\")\n", 596 | " break\n", 597 | " print(\"Great, delicious \" + food)\n", 598 | "else:\n", 599 | " print(\"I am so glad: No spam!\")\n", 600 | "print(\"Finally, I finished stuffing myself\")" 601 | ], 602 | "execution_count": null, 603 | "outputs": [ 604 | { 605 | "output_type": "stream", 606 | "name": "stdout", 607 | "text": [ 608 | "Great, delicious bacon\n", 609 | "No more spam please!\n", 610 | "Finally, I finished stuffing myself\n" 611 | ] 612 | } 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": { 618 | "id": "bsuL97LXXxYX" 619 | }, 620 | "source": [ 621 | "`Continue` interrupt current iteration:" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "metadata": { 627 | "colab": { 628 | "base_uri": "https://localhost:8080/" 629 | }, 630 | "id": "7GVJJqKfXiv_", 631 | "outputId": "31ca1288-3662-4391-b437-9e72aa7df215" 632 | }, 633 | "source": [ 634 | "edibles = [\"bacon\", \"spam\", \"eggs\",\"nuts\"]\n", 635 | "for food in edibles:\n", 636 | " if food == \"spam\":\n", 637 | " print(\"No more spam please!\")\n", 638 | " continue\n", 639 | " print(\"Great, delicious \" + food)\n", 640 | "\n", 641 | "print(\"Finally, I finished stuffing myself\")" 642 | ], 643 | "execution_count": null, 644 | "outputs": [ 645 | { 646 | "output_type": "stream", 647 | "name": "stdout", 648 | "text": [ 649 | "Great, delicious bacon\n", 650 | "No more spam please!\n", 651 | "Great, delicious eggs\n", 652 | "Great, delicious nuts\n", 653 | "Finally, I finished stuffing myself\n" 654 | ] 655 | } 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": { 661 | "id": "HcdV6_MiYEex" 662 | }, 663 | "source": [ 664 | "#### The `range` function:" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": { 670 | "id": "K2AQuKBcbNfu" 671 | }, 672 | "source": [ 673 | "The built-in function `range()` is the right function to iterate over a sequence of numbers. It generates an iterator of arithmetic progressions - an object which is capable of producing the numbers of arithmetic progression. Example:" 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "metadata": { 679 | "colab": { 680 | "base_uri": "https://localhost:8080/" 681 | }, 682 | "id": "D1Fl_1JeX9SZ", 683 | "outputId": "f12a833d-8508-4b32-bea0-e754e453c814" 684 | }, 685 | "source": [ 686 | "range(5)" 687 | ], 688 | "execution_count": null, 689 | "outputs": [ 690 | { 691 | "output_type": "execute_result", 692 | "data": { 693 | "text/plain": [ 694 | "range(0, 5)" 695 | ] 696 | }, 697 | "metadata": {}, 698 | "execution_count": 19 699 | } 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": { 705 | "id": "oK-1ILRrbkNy" 706 | }, 707 | "source": [ 708 | "You can use it in `for` loop:" 709 | ] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "metadata": { 714 | "colab": { 715 | "base_uri": "https://localhost:8080/" 716 | }, 717 | "id": "djQuRj65bWAn", 718 | "outputId": "35c51493-6553-4d03-d250-86f8835d4f32" 719 | }, 720 | "source": [ 721 | "for i in range(5):\n", 722 | " print(i)" 723 | ], 724 | "execution_count": null, 725 | "outputs": [ 726 | { 727 | "output_type": "stream", 728 | "name": "stdout", 729 | "text": [ 730 | "0\n", 731 | "1\n", 732 | "2\n", 733 | "3\n", 734 | "4\n" 735 | ] 736 | } 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": { 742 | "id": "gniXTIB-cBnk" 743 | }, 744 | "source": [ 745 | "To get all numbers in single container we should convert iterator to list:" 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "metadata": { 751 | "colab": { 752 | "base_uri": "https://localhost:8080/" 753 | }, 754 | "id": "pn74EPaXcJ8I", 755 | "outputId": "cc20c063-64d0-4703-9902-a13015cf2fe0" 756 | }, 757 | "source": [ 758 | "list(range(5))" 759 | ], 760 | "execution_count": null, 761 | "outputs": [ 762 | { 763 | "output_type": "execute_result", 764 | "data": { 765 | "text/plain": [ 766 | "[0, 1, 2, 3, 4]" 767 | ] 768 | }, 769 | "metadata": {}, 770 | "execution_count": 21 771 | } 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": { 777 | "id": "Xu4KqQsYb34g" 778 | }, 779 | "source": [ 780 | "General case of usage `range` function:" 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "metadata": { 786 | "colab": { 787 | "base_uri": "https://localhost:8080/" 788 | }, 789 | "id": "-RwE4rcSbsMz", 790 | "outputId": "0ecca4fe-bacd-4f91-f9f6-9b8d0bd3e42d" 791 | }, 792 | "source": [ 793 | "range(4, 50, 5)" 794 | ], 795 | "execution_count": null, 796 | "outputs": [ 797 | { 798 | "output_type": "execute_result", 799 | "data": { 800 | "text/plain": [ 801 | "range(4, 50, 5)" 802 | ] 803 | }, 804 | "metadata": {}, 805 | "execution_count": 23 806 | } 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "metadata": { 812 | "colab": { 813 | "base_uri": "https://localhost:8080/" 814 | }, 815 | "id": "Q4lYOCpQcT2M", 816 | "outputId": "9783c1f3-9459-48be-e73d-3f1cd27da37f" 817 | }, 818 | "source": [ 819 | "# Lets look at all numbers:\n", 820 | "\n", 821 | "list(range(4, 50, 5))" 822 | ], 823 | "execution_count": null, 824 | "outputs": [ 825 | { 826 | "output_type": "execute_result", 827 | "data": { 828 | "text/plain": [ 829 | "[4, 9, 14, 19, 24, 29, 34, 39, 44, 49]" 830 | ] 831 | }, 832 | "metadata": {}, 833 | "execution_count": 25 834 | } 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": { 840 | "id": "zdTDvOyUfmP6" 841 | }, 842 | "source": [ 843 | "#### Examples:" 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "metadata": { 849 | "colab": { 850 | "base_uri": "https://localhost:8080/" 851 | }, 852 | "id": "QOBssTvZcYEk", 853 | "outputId": "8522ec33-c1fd-4eb4-f716-87c0a868661c" 854 | }, 855 | "source": [ 856 | "# Calculation of the Pythagorean Numbers. The following program calculates all\n", 857 | "# pythagorean numbers less than a maximal number:\n", 858 | "\n", 859 | "from math import sqrt\n", 860 | "n = int(input(\"Maximal Number? \"))\n", 861 | "for a in range(1, n+1):\n", 862 | " for b in range(a, n):\n", 863 | " c_square = a**2 + b**2\n", 864 | " c = int(sqrt(c_square))\n", 865 | " if ((c_square - c**2) == 0):\n", 866 | " print(a, b, c)" 867 | ], 868 | "execution_count": null, 869 | "outputs": [ 870 | { 871 | "output_type": "stream", 872 | "name": "stdout", 873 | "text": [ 874 | "Maximal Number? 6\n", 875 | "3 4 5\n" 876 | ] 877 | } 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": { 883 | "id": "Q0JVYya3gIOD" 884 | }, 885 | "source": [ 886 | "#### Iterating over containers with `for` loop:" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "metadata": { 892 | "colab": { 893 | "base_uri": "https://localhost:8080/" 894 | }, 895 | "id": "FybbMBWIf93S", 896 | "outputId": "047207db-69a5-44b0-935c-4ce1c8240825" 897 | }, 898 | "source": [ 899 | "fibonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21]\n", 900 | "for i in range(len(fibonacci)):\n", 901 | " print(i,fibonacci[i])\n", 902 | "print()" 903 | ], 904 | "execution_count": null, 905 | "outputs": [ 906 | { 907 | "output_type": "stream", 908 | "name": "stdout", 909 | "text": [ 910 | "0 0\n", 911 | "1 1\n", 912 | "2 1\n", 913 | "3 2\n", 914 | "4 3\n", 915 | "5 5\n", 916 | "6 8\n", 917 | "7 13\n", 918 | "8 21\n", 919 | "\n" 920 | ] 921 | } 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "metadata": { 927 | "colab": { 928 | "base_uri": "https://localhost:8080/" 929 | }, 930 | "id": "8kPqchxjgbCV", 931 | "outputId": "386cc2e8-454f-4cd6-d36f-d1d573372b1a" 932 | }, 933 | "source": [ 934 | "# PREFERRED:\n", 935 | "\n", 936 | "fibonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21]\n", 937 | "for ix, el in enumerate(fibonacci):\n", 938 | " print(ix, el)\n", 939 | "print()" 940 | ], 941 | "execution_count": null, 942 | "outputs": [ 943 | { 944 | "output_type": "stream", 945 | "name": "stdout", 946 | "text": [ 947 | "0 0\n", 948 | "1 1\n", 949 | "2 1\n", 950 | "3 2\n", 951 | "4 3\n", 952 | "5 5\n", 953 | "6 8\n", 954 | "7 13\n", 955 | "8 21\n", 956 | "\n" 957 | ] 958 | } 959 | ] 960 | }, 961 | { 962 | "cell_type": "markdown", 963 | "metadata": { 964 | "id": "1-nnKR_og8uq" 965 | }, 966 | "source": [ 967 | "#### List iteration with Side Effects" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": { 973 | "id": "wnkEEBmChAq3" 974 | }, 975 | "source": [ 976 | "If you loop over a list, it's best \n", 977 | "to avoid changing the list in the loop body.\n", 978 | "\n", 979 | "Take a look at the following example:" 980 | ] 981 | }, 982 | { 983 | "cell_type": "code", 984 | "metadata": { 985 | "colab": { 986 | "base_uri": "https://localhost:8080/" 987 | }, 988 | "id": "IQb9gTzmgqSK", 989 | "outputId": "cc0ba65b-c898-40b0-e7b1-c6253646eda8" 990 | }, 991 | "source": [ 992 | "colours = [\"red\"]\n", 993 | "for i in colours:\n", 994 | " if i == \"red\":\n", 995 | " colours = [\"black\"] + colours\n", 996 | " if i == \"black\":\n", 997 | " colours += [\"white\"]\n", 998 | "print(colours)" 999 | ], 1000 | "execution_count": null, 1001 | "outputs": [ 1002 | { 1003 | "output_type": "stream", 1004 | "name": "stdout", 1005 | "text": [ 1006 | "['black', 'red']\n" 1007 | ] 1008 | } 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "markdown", 1013 | "metadata": { 1014 | "id": "RNT0UpfmhQqt" 1015 | }, 1016 | "source": [ 1017 | "To avoid these side effects, it's best to work on a copy by using the slicing operator, as can be seen in the next example:" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "code", 1022 | "metadata": { 1023 | "colab": { 1024 | "base_uri": "https://localhost:8080/" 1025 | }, 1026 | "id": "JU7UHKjLhNc3", 1027 | "outputId": "c5acd2fa-2d50-42ae-be3b-725679b4f50b" 1028 | }, 1029 | "source": [ 1030 | "colours = [\"red\"]\n", 1031 | "for i in colours[:]:\n", 1032 | " if i == \"red\":\n", 1033 | " colours += [\"black\"]\n", 1034 | " if i == \"black\":\n", 1035 | " colours += [\"white\"]\n", 1036 | "print(colours)" 1037 | ], 1038 | "execution_count": null, 1039 | "outputs": [ 1040 | { 1041 | "output_type": "stream", 1042 | "name": "stdout", 1043 | "text": [ 1044 | "['red', 'black']\n" 1045 | ] 1046 | } 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "markdown", 1051 | "metadata": { 1052 | "id": "PPpJ_YuFhjQr" 1053 | }, 1054 | "source": [ 1055 | "# Float" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "metadata": { 1061 | "id": "yodeZ01Ch9x-" 1062 | }, 1063 | "source": [ 1064 | "When dealing with floats we should avoid direct comparisons:" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "metadata": { 1070 | "colab": { 1071 | "base_uri": "https://localhost:8080/" 1072 | }, 1073 | "id": "PLXRmlS6hXD0", 1074 | "outputId": "75759f93-5621-4089-81cd-5699bf548ae4" 1075 | }, 1076 | "source": [ 1077 | "import math\n", 1078 | "\n", 1079 | "math.e**(math.log(5))" 1080 | ], 1081 | "execution_count": null, 1082 | "outputs": [ 1083 | { 1084 | "output_type": "execute_result", 1085 | "data": { 1086 | "text/plain": [ 1087 | "4.999999999999999" 1088 | ] 1089 | }, 1090 | "metadata": {}, 1091 | "execution_count": 31 1092 | } 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "markdown", 1097 | "metadata": { 1098 | "id": "GVJLB96YiPv_" 1099 | }, 1100 | "source": [ 1101 | ".. and:" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "code", 1106 | "metadata": { 1107 | "colab": { 1108 | "base_uri": "https://localhost:8080/" 1109 | }, 1110 | "id": "r_au6UatiNE9", 1111 | "outputId": "ce3668ed-17c5-45d0-81c8-26cbf279ca2f" 1112 | }, 1113 | "source": [ 1114 | "math.e**(math.log(5)) == 5" 1115 | ], 1116 | "execution_count": null, 1117 | "outputs": [ 1118 | { 1119 | "output_type": "execute_result", 1120 | "data": { 1121 | "text/plain": [ 1122 | "False" 1123 | ] 1124 | }, 1125 | "metadata": {}, 1126 | "execution_count": 32 1127 | } 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "markdown", 1132 | "metadata": { 1133 | "id": "NF35CQb5iXOK" 1134 | }, 1135 | "source": [ 1136 | "We should do the following:" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "code", 1141 | "metadata": { 1142 | "colab": { 1143 | "base_uri": "https://localhost:8080/" 1144 | }, 1145 | "id": "3m8lLccpiVnX", 1146 | "outputId": "9d8ce74f-f334-4edb-d6e9-c53cf63cbb39" 1147 | }, 1148 | "source": [ 1149 | "eps = 1e-5 # small epsilon\n", 1150 | "abs(math.e**(math.log(5)) - 5) < eps" 1151 | ], 1152 | "execution_count": null, 1153 | "outputs": [ 1154 | { 1155 | "output_type": "execute_result", 1156 | "data": { 1157 | "text/plain": [ 1158 | "True" 1159 | ] 1160 | }, 1161 | "metadata": {}, 1162 | "execution_count": 33 1163 | } 1164 | ] 1165 | }, 1166 | { 1167 | "cell_type": "markdown", 1168 | "metadata": { 1169 | "id": "IZ6aYIqEinxs" 1170 | }, 1171 | "source": [ 1172 | "# Formatted output:" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "markdown", 1177 | "metadata": { 1178 | "id": "2JvIurkrxtGL" 1179 | }, 1180 | "source": [ 1181 | "#### 1. The Old Way or the non-existing printf and sprintf" 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "markdown", 1186 | "metadata": { 1187 | "id": "nGsVIT5lx0E0" 1188 | }, 1189 | "source": [ 1190 | "For the purpose of using printf-like printing in python the modulo operator \"%\" is overloaded by the string class to perform string formatting. Therefore, it is often called string modulo (or somethimes even called modulus) operator, though it has not a lot in common with the actual modulo calculation on numbers. Another term for it is \"string interpolation\", because it interpolates various class types (like int, float and so on) into a formatted string. In many cases the string created via the string interpolation mechanism is used for outputting values in a special way. But it can also be used, for example, to create the right format to put the data into a database. \n", 1191 | "\n", 1192 | "Since Python 2.6 has been introduced, the string method format should be used instead of this old-style formatting. Unfortunately, string modulo \"%\" is still available in Python3 and what is even worse, it is still widely used. That's why we cover it in great detail in this tutorial. You should be capable of understanding it, when you encounter it in some Python code. However, it is very likely that one day this old style of formatting will be removed from the language. So you should get used to str.format().\n", 1193 | "\n", 1194 | "The following diagram depicts how the string modulo operator works:\n", 1195 | "\n", 1196 | "![](https://www.python-course.eu/images/string_modulo_overview_800w.webp)\n", 1197 | "\n", 1198 | "![](https://www.python-course.eu/images/format_string_value_set_800w.webp)\n", 1199 | "\n", 1200 | "The format string contains placeholders. There are two of those in our example: \"%5d\" and \"%8.2f\". The general syntax for a format placeholder is\n", 1201 | "\n", 1202 | "`%[flags][width][.precision]type`\n", 1203 | "\n", 1204 | "Conversion types:\n", 1205 | "\n", 1206 | "d --- Signed integer decimal.\n", 1207 | "\n", 1208 | "i --- Signed integer decimal.\n", 1209 | "\n", 1210 | "o --- Unsigned octal.\n", 1211 | "\n", 1212 | "u --- Obsolete and equivalent to 'd', i.e. signed integer decimal.\n", 1213 | "\n", 1214 | "x --- Unsigned hexadecimal (lowercase).\n", 1215 | "\n", 1216 | "X --- Unsigned hexadecimal (uppercase).\n", 1217 | "\n", 1218 | "e --- Floating point exponential format (lowercase).\n", 1219 | "\n", 1220 | "E --- Floating point exponential format (uppercase).\n", 1221 | "\n", 1222 | "f --- Floating point decimal format.\n", 1223 | "\n", 1224 | "F --- Floating point decimal format.\n", 1225 | "\n", 1226 | "g --- Same as \"e\" if exponent is greater than -4 or less than precision, \"f\" otherwise.\n", 1227 | "\n", 1228 | "G --- Same as \"E\" if exponent is greater than -4 or less than precision, \"F\" otherwise.\n", 1229 | "\n", 1230 | "c --- Single character (accepts integer or single character string).\n", 1231 | "\n", 1232 | "r --- String (converts any python object using repr()).\n", 1233 | "\n", 1234 | "s --- String (converts any python object using str()).\n", 1235 | "\n", 1236 | "% --- No argument is converted, results in a \"%\" character in the result.\n", 1237 | "\n", 1238 | "Examples:" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "metadata": { 1244 | "colab": { 1245 | "base_uri": "https://localhost:8080/" 1246 | }, 1247 | "id": "dMwiQC2tul9U", 1248 | "outputId": "41120038-4781-4013-f9cc-8eb46877cf24" 1249 | }, 1250 | "source": [ 1251 | "print(\"%10.3e\"% (356.08977))\n", 1252 | "print(\"%10.3E\"% (356.08977))\n", 1253 | "print(\"%10o\"% (25))\n", 1254 | "print(\"%10.3o\"% (25))\n", 1255 | "print(\"%10.5o\"% (25))\n", 1256 | "print(\"%5x\"% (47))\n", 1257 | "print(\"%5.4x\"% (47))\n", 1258 | "print(\"%5.4X\"% (47))\n", 1259 | "print(\"Only one percentage sign: %% \" % ())" 1260 | ], 1261 | "execution_count": null, 1262 | "outputs": [ 1263 | { 1264 | "output_type": "stream", 1265 | "name": "stdout", 1266 | "text": [ 1267 | " 3.561e+02\n", 1268 | " 3.561E+02\n", 1269 | " 31\n", 1270 | " 031\n", 1271 | " 00031\n", 1272 | " 2f\n", 1273 | " 002f\n", 1274 | " 002F\n", 1275 | "Only one percentage sign: % \n" 1276 | ] 1277 | } 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "markdown", 1282 | "metadata": { 1283 | "id": "99-fb3p42b9c" 1284 | }, 1285 | "source": [ 1286 | "Even though it may look so, the formatting is not part of the print function. If you have a closer look at our examples, you will see that we passed a formatted string to the print function. Or to put it in other words: If the string modulo operator is applied to a string, it returns a string. This string in turn is passed in our examples to the print function." 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "markdown", 1291 | "metadata": { 1292 | "id": "uLISDNTW2haY" 1293 | }, 1294 | "source": [ 1295 | "#### 2. The Pythonic Way: The string method \"format\"" 1296 | ] 1297 | }, 1298 | { 1299 | "cell_type": "markdown", 1300 | "metadata": { 1301 | "id": "nj7boB7YehTr" 1302 | }, 1303 | "source": [ 1304 | "![](https://www.python-course.eu/images/format_method_positional_parameters_800w.webp)\n", 1305 | "\n", 1306 | "![](https://www.python-course.eu/images/format_method_keyword_parameters_800w.webp)\n", 1307 | "\n", 1308 | "A positional parameter of the format method can be accessed by placing the index of the parameter after the opening brace, e.g. {0} accesses the first parameter, {1} the second one and so on. The index inside of the curly braces can be followed by a colon and a format string, which is similar to the notation of the string modulo, which we had discussed in the beginning of the chapter of our tutorial, e.g. {0:5d} If the positional parameters are used in the order in which they are written, the positional argument specifiers inside of the braces can be omitted, so '{} {} {}' corresponds to '{0} {1} {2}'. But they are needed, if you want to access them in different orders: '{2} {1} {0}'. If a brace character has to be printed, it has to be escaped by doubling it: {{ and }}.\n", 1309 | "\n", 1310 | "Additional formatting flags:\n", 1311 | "\n", 1312 | "\\# --- Used with o, x or X specifiers the value is preceded with 0, 0o, 0O, 0x or 0X respectively\n", 1313 | "\n", 1314 | "0 --- The conversion result will be zero padded for numeric values\n", 1315 | "\n", 1316 | "\\- --- The converted value is left adjusted. If no sign (minus sign e.g.) is going to be written, a blank space is inserted before the value\n", 1317 | "\n", 1318 | "\\+ --- A sign character (\"+\" or \"-\") will precede the conversion (overrides a \"space\" flag)\n", 1319 | "\n", 1320 | "Examples:" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "metadata": { 1326 | "id": "SRl_mCj_2Lj2", 1327 | "colab": { 1328 | "base_uri": "https://localhost:8080/" 1329 | }, 1330 | "outputId": "b76f7b4e-e553-45b7-deb5-1b7c9508bdab" 1331 | }, 1332 | "source": [ 1333 | "x = 378\n", 1334 | "print(\"The value is {:06d}\".format(x))\n", 1335 | "\n", 1336 | "x = -378\n", 1337 | "print(\"The value is {:06d}\".format(x))\n", 1338 | "\n", 1339 | "x = 9283472410989128\n", 1340 | "print(\"The value is {:,}\".format(x))\n", 1341 | "\n", 1342 | "x = 92837424798234792.2487\n", 1343 | "print(\"The value is {0:6,f}\".format(x))\n", 1344 | "\n", 1345 | "# Unless a minimum field width is defined, the field width will always be the\n", 1346 | "# same size as the data to fill it, so that the alignment option has no meaning\n", 1347 | "# in this case.\n", 1348 | "\n", 1349 | "\n", 1350 | "# Additionally, we can modify the formatting with the sign option, which is only\n", 1351 | "# valid for number types:\n", 1352 | "\n", 1353 | "# `+` - indicates that a sign should be used for both positive as well as\n", 1354 | "# negative numbers\n", 1355 | "# `-` - indicates that a sign should be used only for negative numbers, which is\n", 1356 | "# the default behavior\n", 1357 | "# `space` - indicates that a leading space should be used on positive numbers,\n", 1358 | "# and a minus sign on negative numbers" 1359 | ], 1360 | "execution_count": null, 1361 | "outputs": [ 1362 | { 1363 | "output_type": "stream", 1364 | "name": "stdout", 1365 | "text": [ 1366 | "The value is 000378\n", 1367 | "The value is -00378\n", 1368 | "The value is 9,283,472,410,989,128\n", 1369 | "The value is 92,837,424,798,234,800.000000\n" 1370 | ] 1371 | } 1372 | ] 1373 | }, 1374 | { 1375 | "cell_type": "markdown", 1376 | "metadata": { 1377 | "id": "tFdQj4Z7k4l1" 1378 | }, 1379 | "source": [ 1380 | "#### Other string methods for Formatting" 1381 | ] 1382 | }, 1383 | { 1384 | "cell_type": "code", 1385 | "metadata": { 1386 | "colab": { 1387 | "base_uri": "https://localhost:8080/", 1388 | "height": 37 1389 | }, 1390 | "id": "7afub7n8hr0R", 1391 | "outputId": "6caa603a-dc28-49b2-c4b7-cd7e2de5fa8e" 1392 | }, 1393 | "source": [ 1394 | "s = \"Python\"\n", 1395 | "s.center(10)" 1396 | ], 1397 | "execution_count": null, 1398 | "outputs": [ 1399 | { 1400 | "output_type": "execute_result", 1401 | "data": { 1402 | "application/vnd.google.colaboratory.intrinsic+json": { 1403 | "type": "string" 1404 | }, 1405 | "text/plain": [ 1406 | "' Python '" 1407 | ] 1408 | }, 1409 | "metadata": {}, 1410 | "execution_count": 12 1411 | } 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "code", 1416 | "metadata": { 1417 | "colab": { 1418 | "base_uri": "https://localhost:8080/", 1419 | "height": 37 1420 | }, 1421 | "id": "pnOuplsMk_nk", 1422 | "outputId": "5bbbc8c9-ffa9-4acd-aad0-20bfd2a69044" 1423 | }, 1424 | "source": [ 1425 | "s = \"Training\"\n", 1426 | "s.ljust(12)" 1427 | ], 1428 | "execution_count": null, 1429 | "outputs": [ 1430 | { 1431 | "output_type": "execute_result", 1432 | "data": { 1433 | "application/vnd.google.colaboratory.intrinsic+json": { 1434 | "type": "string" 1435 | }, 1436 | "text/plain": [ 1437 | "'Training '" 1438 | ] 1439 | }, 1440 | "metadata": {}, 1441 | "execution_count": 13 1442 | } 1443 | ] 1444 | }, 1445 | { 1446 | "cell_type": "code", 1447 | "metadata": { 1448 | "colab": { 1449 | "base_uri": "https://localhost:8080/", 1450 | "height": 37 1451 | }, 1452 | "id": "rd4zCt8VlLN-", 1453 | "outputId": "10b1d41f-e7e1-4d67-d5a3-2400855b2537" 1454 | }, 1455 | "source": [ 1456 | "s = \"Programming\"\n", 1457 | "s.rjust(15)" 1458 | ], 1459 | "execution_count": null, 1460 | "outputs": [ 1461 | { 1462 | "output_type": "execute_result", 1463 | "data": { 1464 | "application/vnd.google.colaboratory.intrinsic+json": { 1465 | "type": "string" 1466 | }, 1467 | "text/plain": [ 1468 | "' Programming'" 1469 | ] 1470 | }, 1471 | "metadata": {}, 1472 | "execution_count": 14 1473 | } 1474 | ] 1475 | }, 1476 | { 1477 | "cell_type": "code", 1478 | "metadata": { 1479 | "colab": { 1480 | "base_uri": "https://localhost:8080/", 1481 | "height": 37 1482 | }, 1483 | "id": "8DiZQoQ6lOL6", 1484 | "outputId": "e590a777-f6b4-4033-eb96-293053e4be00" 1485 | }, 1486 | "source": [ 1487 | "account_number = \"43447879\"\n", 1488 | "account_number.zfill(12)" 1489 | ], 1490 | "execution_count": null, 1491 | "outputs": [ 1492 | { 1493 | "output_type": "execute_result", 1494 | "data": { 1495 | "application/vnd.google.colaboratory.intrinsic+json": { 1496 | "type": "string" 1497 | }, 1498 | "text/plain": [ 1499 | "'000043447879'" 1500 | ] 1501 | }, 1502 | "metadata": {}, 1503 | "execution_count": 15 1504 | } 1505 | ] 1506 | }, 1507 | { 1508 | "cell_type": "code", 1509 | "metadata": { 1510 | "colab": { 1511 | "base_uri": "https://localhost:8080/", 1512 | "height": 37 1513 | }, 1514 | "id": "BiTqiZdglQ6S", 1515 | "outputId": "2ae88ac5-a026-450e-aaa0-9de8cfdee11d" 1516 | }, 1517 | "source": [ 1518 | "# the same as previous\n", 1519 | "\n", 1520 | "account_number.rjust(12,\"0\")" 1521 | ], 1522 | "execution_count": null, 1523 | "outputs": [ 1524 | { 1525 | "output_type": "execute_result", 1526 | "data": { 1527 | "application/vnd.google.colaboratory.intrinsic+json": { 1528 | "type": "string" 1529 | }, 1530 | "text/plain": [ 1531 | "'000043447879'" 1532 | ] 1533 | }, 1534 | "metadata": {}, 1535 | "execution_count": 16 1536 | } 1537 | ] 1538 | }, 1539 | { 1540 | "cell_type": "markdown", 1541 | "metadata": { 1542 | "id": "4WKZJApxlaU-" 1543 | }, 1544 | "source": [ 1545 | "#### f-strings - formatted string literals (since python 3.6)" 1546 | ] 1547 | }, 1548 | { 1549 | "cell_type": "code", 1550 | "metadata": { 1551 | "colab": { 1552 | "base_uri": "https://localhost:8080/", 1553 | "height": 37 1554 | }, 1555 | "id": "dwz6wHemlY5Q", 1556 | "outputId": "f56a8c7c-9bbe-4505-fa8f-280ae96ca7be" 1557 | }, 1558 | "source": [ 1559 | "price = 11.23\n", 1560 | "f\"Price in Euro: {price}\"" 1561 | ], 1562 | "execution_count": null, 1563 | "outputs": [ 1564 | { 1565 | "output_type": "execute_result", 1566 | "data": { 1567 | "application/vnd.google.colaboratory.intrinsic+json": { 1568 | "type": "string" 1569 | }, 1570 | "text/plain": [ 1571 | "'Price in Euro: 11.23'" 1572 | ] 1573 | }, 1574 | "metadata": {}, 1575 | "execution_count": 17 1576 | } 1577 | ] 1578 | }, 1579 | { 1580 | "cell_type": "code", 1581 | "metadata": { 1582 | "colab": { 1583 | "base_uri": "https://localhost:8080/", 1584 | "height": 37 1585 | }, 1586 | "id": "OYk8zsfzlp7W", 1587 | "outputId": "43b02fd8-1c51-4773-fb48-20ff398a2687" 1588 | }, 1589 | "source": [ 1590 | "f\"Price in Swiss Franks: {price * 1.086}\"" 1591 | ], 1592 | "execution_count": null, 1593 | "outputs": [ 1594 | { 1595 | "output_type": "execute_result", 1596 | "data": { 1597 | "application/vnd.google.colaboratory.intrinsic+json": { 1598 | "type": "string" 1599 | }, 1600 | "text/plain": [ 1601 | "'Price in Swiss Franks: 12.195780000000001'" 1602 | ] 1603 | }, 1604 | "metadata": {}, 1605 | "execution_count": 18 1606 | } 1607 | ] 1608 | }, 1609 | { 1610 | "cell_type": "code", 1611 | "metadata": { 1612 | "colab": { 1613 | "base_uri": "https://localhost:8080/", 1614 | "height": 37 1615 | }, 1616 | "id": "a6qzydhIlwTt", 1617 | "outputId": "38dc4918-a366-4d52-87bd-22cd6bf26824" 1618 | }, 1619 | "source": [ 1620 | "f\"Price in Swiss Franks: {price * 1.086:5.2f}\"" 1621 | ], 1622 | "execution_count": null, 1623 | "outputs": [ 1624 | { 1625 | "output_type": "execute_result", 1626 | "data": { 1627 | "application/vnd.google.colaboratory.intrinsic+json": { 1628 | "type": "string" 1629 | }, 1630 | "text/plain": [ 1631 | "'Price in Swiss Franks: 12.20'" 1632 | ] 1633 | }, 1634 | "metadata": {}, 1635 | "execution_count": 19 1636 | } 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "code", 1641 | "metadata": { 1642 | "id": "OBrqVAqWlyzP" 1643 | }, 1644 | "source": [ 1645 | "" 1646 | ], 1647 | "execution_count": null, 1648 | "outputs": [] 1649 | } 1650 | ] 1651 | } -------------------------------------------------------------------------------- /week6/Lecture_12_re.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Lecture 12. re.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyMFlKI/JE7+E1+XaO5CH0sk", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "id": "9bQhYGHAbolh" 35 | }, 36 | "source": [ 37 | "## re: Regular Expressions\n", 38 | "\n", 39 | "The term \"regular expression\", sometimes also called regex or regexp, has originated in theoretical computer science. In theoretical computer science, they are used to define a language family with certain characteristics, the so-called regular languages. A finite state machine (FSM), which accepts language defined by a regular expression, exists for every regular expression. You can find an implementation of a [Finite State Machine in Python](https://www.python-course.eu/finite_state_machine.php)\n", 40 | "\n", 41 | "Regular Expressions are used in programming languages to filter texts or textstrings. It's possible to check, if a text or a string matches a regular expression. A great thing about regular expressions: The syntax of regular expressions is the same for all programming and script languages, e.g. Python, Perl, Java, SED, AWK and even X#.\n", 42 | "\n", 43 | "The first programs which had incorporated the capability to use regular expressions were the Unix tools ed (editor), the stream editor sed and the filter grep ([you SHOULD know this](https://ostechnix.com/the-grep-command-tutorial-with-examples-for-beginners/)).\n", 44 | "\n", 45 | "There is another mechanism in operating systems, which shouldn't be mistaken for regular expressions. Wildcards, also known as globbing, look very similar in their syntax to regular expressions. However, the semantics differ considerably. Globbing is known from many command line shells, like the Bourne shell, the Bash shell or even DOS. In Bash e.g. the command \"ls .txt\" lists all files (or even directories) ending with the suffix .txt; in regular expression notation \".txt\" wouldn't make sense, it would have to be written as \".*.txt\"\n", 46 | "\n", 47 | "#### Introduction\n", 48 | "\n", 49 | "When we introduced the sequential data types, we got to know the \"in\" operator. We check in the following example, if the string \"easily\" is a substring of the string \"Regular expressions easily explained!\":" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "metadata": { 55 | "colab": { 56 | "base_uri": "https://localhost:8080/" 57 | }, 58 | "id": "quOyWn0kZmRu", 59 | "outputId": "6292a2dc-6911-4d26-8813-b7183bd3d4a1" 60 | }, 61 | "source": [ 62 | "s = \"Regular expressions easily explained!\"\n", 63 | "\"easily\" in s" 64 | ], 65 | "execution_count": 1, 66 | "outputs": [ 67 | { 68 | "output_type": "execute_result", 69 | "data": { 70 | "text/plain": [ 71 | "True" 72 | ] 73 | }, 74 | "metadata": {}, 75 | "execution_count": 1 76 | } 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "id": "NJUtjHbWdaA4" 83 | }, 84 | "source": [ 85 | "We show step by step with the following diagrams how this matching is performed: We check if the string sub = \"abc\"\n", 86 | "\n", 87 | "![](https://www.python-course.eu/images/regular_expression2.webp)\n", 88 | "\n", 89 | "s contained in the string s = \"xaababcbcd\"\n", 90 | "\n", 91 | "![](https://www.python-course.eu/images/regular_expression1_400w.webp)\n", 92 | "\n", 93 | "By the way, the string sub = \"abc\" can be seen as a regular expression, just a very simple one.\n", 94 | "\n", 95 | "In the first place, we check, if the first positions of the two string match, i.e. s[0] == sub[0]. This is not satisfied in our example. We mark this fact by the colour red:\n", 96 | "\n", 97 | "![](https://www.python-course.eu/images/regular_expression3_400w.webp)\n", 98 | "\n", 99 | "Then we check, if s[1:4] == sub. In other words, we have to check at first, if sub[0] is equal to s[1]. This is true and we mark it with the colour green. Then, we have to compare the next positions. s[2] is not equal to sub[1], so we don't have to proceed further with the next position of sub and s:\n", 100 | "\n", 101 | "![](https://www.python-course.eu/images/regular_expression4_400w.webp)\n", 102 | "\n", 103 | "Now we have to check if s[2:5] and sub are equal. The first two positions are equal but not the third:\n", 104 | "\n", 105 | "![](https://www.python-course.eu/images/regular_expression5_400w.webp)\n", 106 | "\n", 107 | "The following steps should be clear without any explanations:\n", 108 | "\n", 109 | "![](https://www.python-course.eu/images/regular_expression6_400w.webp)\n", 110 | "\n", 111 | "Finally, we have a complete match with s[4:7] == sub :\n", 112 | "\n", 113 | "![](https://www.python-course.eu/images/regular_expression7_400w.webp)\n", 114 | "\n", 115 | "#### Representing Regular Expressions in Python\n", 116 | "\n", 117 | "As we have already mentioned in the previous section, we can see the variable \"sub\" from the introduction as a very simple regular expression. If you want to use regular expressions in Python, you have to import the re module, which provides methods and functions to deal with regular expressions.\n", 118 | "\n", 119 | "From other languages you might be used to representing regular expressions within Slashes \"/\", e.g. that's the way Perl, SED or AWK deals with them. In Python there is no special notation. Regular expressions are represented as normal strings.\n", 120 | "\n", 121 | "But this convenience brings along a small problem: The backslash is a special character used in regular expressions, but is also used as an escape character in strings. This implies that Python would first evaluate every backslash of a string and after this - without the necessary backslashes - it would be used as a regular expression. One way to prevent this could be writing every backslash as \"\\\\\" and this way keep it for the evaluation of the regular expression. This can cause extremely clumsy expressions. So, a regular expression to match the Windows path \"C:\\\\\\\\programs\" corresponds to a string in regular expression notation with four backslashes, i.e. \"C:\\\\\\\\\\\\\\\\programs\".\n", 122 | "\n", 123 | "The best way to overcome this problem would be marking regular expressions as raw strings. The solution to our Windows path example looks like this as a raw string:\n", 124 | "\n", 125 | "```r\"C:\\\\programs\"```\n", 126 | "\n", 127 | "Let's look at another example, which might be quite disturbing for people who are used to wildcards:\n", 128 | "\n", 129 | "```r\"^a.*\\.html$\"```\n", 130 | "\n", 131 | "The regular expression of our previous example matches all file names (strings) which start with an \"a\" and end with \".html\". We will the structure of the example above in detail explain in the following sections\n", 132 | "\n", 133 | "#### Syntax of Regular Expression\n", 134 | "\n", 135 | "```r\"cat\"``` is a regular expression, though a very simple one without any metacharacters. Our RE ```r\"cat\"``` matches, for example, the following string: \"A cat and a rat can't be friends.\"\n", 136 | "\n", 137 | "Interestingly, the previous example shows already a \"favourite\" example for a mistake, frequently made not only by beginners and novices but also by advanced users of regular expressions. The idea of this example is to match strings containing the word \"cat\". We are successful at this, but unfortunately we are matching a lot of other words as well. If we match \"cats\" in a string that might be still okay, but what about all those words containing this character sequence \"cat\"? We match words like \"education\", \"communicate\", \"falsification\", \"ramifications\", \"cattle\" and many more. This is a case of \"over matching\", i.e. we receive positive results, which are wrong according to the problem we want to solve.\n", 138 | "\n", 139 | "If we try to fix the previous RE, so that it doesn't create over matching, we might try the expression ```r\" cat \"```. These blanks prevent the matching of the above mentioned words like \"education\", \"falsification\" and \"ramification\", but we fall prey to another mistake. What about the string \"The cat, called Oscar, climbed on the roof.\"? The problem is that we don't expect a comma but only a blank surrounding the word \"cat\".\n", 140 | "\n", 141 | "Before we go on with the description of the syntax of regular expressions, we want to explain how to use them in Python:" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "metadata": { 147 | "colab": { 148 | "base_uri": "https://localhost:8080/" 149 | }, 150 | "id": "4rFDTtSmdWTD", 151 | "outputId": "25080250-d517-4a56-dd20-a7dfc453dc0b" 152 | }, 153 | "source": [ 154 | "import re\n", 155 | "x = re.search(\"cat\", \"A cat and a rat can't be friends.\")\n", 156 | "print(x)" 157 | ], 158 | "execution_count": 2, 159 | "outputs": [ 160 | { 161 | "output_type": "stream", 162 | "name": "stdout", 163 | "text": [ 164 | "\n" 165 | ] 166 | } 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "metadata": { 172 | "colab": { 173 | "base_uri": "https://localhost:8080/" 174 | }, 175 | "id": "4_gCZ5kHiQwl", 176 | "outputId": "b60fc5e1-0722-4b3d-d290-b88d83ba5459" 177 | }, 178 | "source": [ 179 | "x = re.search(\"cow\", \"A cat and a rat can't be friends.\")\n", 180 | "print(x)" 181 | ], 182 | "execution_count": 3, 183 | "outputs": [ 184 | { 185 | "output_type": "stream", 186 | "name": "stdout", 187 | "text": [ 188 | "None\n" 189 | ] 190 | } 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": { 196 | "id": "Du4CEef4iwhF" 197 | }, 198 | "source": [ 199 | "We used the method search from the re module. **This is most probably the most important and the most often used method of this module**. re.search(expr,s) checks a string s for an occurrence of a substring which matches the regular expression expr. The first substring (from left), which satisfies this condition will be returned. If a match has been possible, we get a so-called match object as a result, otherwise the value will be None. This method is already enough to use regular expressions in a basic way in Python programs. We can use it in conditional statements: If a regular expression matches, we get an SRE object returned, which is taken as a True value, and None, which is the return value if it doesn't match, is taken as False:" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "metadata": { 205 | "colab": { 206 | "base_uri": "https://localhost:8080/" 207 | }, 208 | "id": "r4QCpboHidEK", 209 | "outputId": "964bfb75-2a60-4fd9-caf0-4dec4b421bd8" 210 | }, 211 | "source": [ 212 | "if re.search(\"cat\", \"A cat and a rat can't be friends.\"):\n", 213 | " print(\"Some kind of cat has been found :-)\")\n", 214 | "else:\n", 215 | " print(\"No cat has been found :-)\")" 216 | ], 217 | "execution_count": 4, 218 | "outputs": [ 219 | { 220 | "output_type": "stream", 221 | "name": "stdout", 222 | "text": [ 223 | "Some kind of cat has been found :-)\n" 224 | ] 225 | } 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "metadata": { 231 | "colab": { 232 | "base_uri": "https://localhost:8080/" 233 | }, 234 | "id": "gPSt361bi4f6", 235 | "outputId": "9c470d04-c83b-4e4a-dd93-b3b4a7638348" 236 | }, 237 | "source": [ 238 | "if re.search(\"cow\", \"A cat and a rat can't be friends.\"):\n", 239 | " print(\"Cats and Rats and a cow.\")\n", 240 | "else:\n", 241 | " print(\"No cow around.\")" 242 | ], 243 | "execution_count": 5, 244 | "outputs": [ 245 | { 246 | "output_type": "stream", 247 | "name": "stdout", 248 | "text": [ 249 | "No cow around.\n" 250 | ] 251 | } 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": { 257 | "id": "Vp3N3vFfjCtX" 258 | }, 259 | "source": [ 260 | "#### Any Character\n", 261 | "\n", 262 | "Let's assume that we have not been interested in the previous example to recognize the word cat, but all three letter words, which end with \"at\". The syntax of regular expressions supplies a metacharacter \".\", which is used like a placeholder for \"any character\". The regular expression of our example can be written like this: r\" .at \" This RE matches three letter words, isolated by blanks, which end in \"at\". Now we get words like \"rat\", \"cat\", \"bat\", \"eat\", \"sat\" and many others.\n", 263 | "\n", 264 | "But what if the text contains \"words\" like \"@at\" or \"3at\"? These words match as well, meaning we have caused over matching again. We will learn a solution in the following section.\n", 265 | "\n", 266 | "#### Character Classes\n", 267 | "\n", 268 | "Square brackets, \"[\" and \"]\", are used to include a character class. [xyz] means e.g. either an \"x\", an \"y\" or a \"z\". Let's look at a more practical example:\n", 269 | "\n", 270 | "```r\"M[ae][iy]er\"```\n", 271 | "\n", 272 | "This is a regular expression, which matches a surname which is quite common in German. A name with the same pronunciation and four different spellings: Maier, Mayer, Meier, Meyer A finite state automata to recognize this expression can be build like this:\n", 273 | "\n", 274 | "![](https://www.python-course.eu/images/finite_state_machine_mayer_400w.webp)\n", 275 | "\n", 276 | "The graph of the finite state machine (FSM) is simplified to keep the design easy. There should be an arrow in the start node pointing back on its own, i.e. if a character other than an upper case \"M\" has been processed, the machine should stay in the start condition. Furthermore, there should be an arrow pointing back from all nodes except the final nodes (the green ones) to the start node, unless the expected letter has been processed. E.g. if the machine is in state Ma, after having processed a \"M\" and an \"a\", the machine has to go back to state \"Start\", if any character except \"i\" or \"y\" can be read. Those who have problems with this FSM, shouldn't worry, since it is not a prerequisite for the rest of the chapter.\n", 277 | "\n", 278 | "Instead of a choice between two characters, we often need a choice between larger character classes. We might need e.g. a class of letters between \"a\" and \"e\" or between \"0\" and \"5\". To manage such character classes, the syntax of regular expressions supplies a metacharacter \"-\". [a-e] a simplified writing for [abcde] or [0-5] denotes [012345].\n", 279 | "\n", 280 | "The advantage is obvious and even more impressive, if we have to coin expressions like \"any uppercase letter\" into regular expressions. So instead of [ABCDEFGHIJKLMNOPQRSTUVWXYZ] we can write [A-Z]. If this is not convincing: Write an expression for the character class \"any lower case or uppercase letter\" [A-Za-z]\n", 281 | "\n", 282 | "There is something more about the dash, we used to mark the begin and the end of a character class. The dash has only a special meaning if it is used within square brackets and in this case only if it isn't positioned directly after an opening or immediately in front of a closing bracket. So the expression [-az] is only the choice between the three characters \"-\", \"a\" and \"z\", but no other characters. The same is true for [az-].\n", 283 | "\n", 284 | "The only other special character inside square brackets (character class choice) is the **caret \"^\"**. If it is used directly after an opening sqare bracket, it negates the choice. [^0-9] denotes the choice \"any character but a digit\". The position of the caret within the square brackets is crucial. If it is not positioned as the first character following the opening square bracket, it has no special meaning. [^abc] means anything but an \"a\", \"b\" or \"c\" [a^bc] means an \"a\", \"b\", \"c\" or a \"^\"\n", 285 | "\n", 286 | "##### Example:\n", 287 | "\n", 288 | "We have a phone list of the Simpsons, yes, the famous Simpsons from the American animated TV series. There are some people with the surname Neu. We are looking for a Neu, but we don't know the first name, we just know that it starts with a J. Let's write a Python script, which finds all the lines of the phone book, which contain a person with the described surname and a first name starting with J.:" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "metadata": { 294 | "colab": { 295 | "base_uri": "https://localhost:8080/" 296 | }, 297 | "id": "DUSpeMkTi9I1", 298 | "outputId": "5fa13226-2d62-472b-b301-17605409bfe5" 299 | }, 300 | "source": [ 301 | "import re\n", 302 | "\n", 303 | "from urllib.request import urlopen\n", 304 | "with urlopen('https://www.python-course.eu/simpsons_phone_book.txt') as fh:\n", 305 | " for line in fh:\n", 306 | " # line is a byte string so we transform it to utf-8:\n", 307 | " line = line.decode('utf-8').rstrip() \n", 308 | " if re.search(r\"J.*Neu\",line):\n", 309 | " print(line)" 310 | ], 311 | "execution_count": 6, 312 | "outputs": [ 313 | { 314 | "output_type": "stream", 315 | "name": "stdout", 316 | "text": [ 317 | "Jack Neu 555-7666\n", 318 | "Jeb Neu 555-5543\n", 319 | "Jennifer Neu 555-3652\n" 320 | ] 321 | } 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": { 327 | "id": "N6tL8HqdqkVo" 328 | }, 329 | "source": [ 330 | "#### Predefined Character Classes\n", 331 | "\n", 332 | "You might have realized that it can be quite cumbersome to construe certain character classes. A good example is the character class, which describes a valid word character. These are all lower case and uppercase characters plus all the digits and the underscore, corresponding to the following regular expression: r\"[a-zA-Z0-9_]\"\n", 333 | "\n", 334 | "Predefined character classes:\n", 335 | "\n", 336 | "* `\\d` - Matches any decimal digit; equivalent to the set [0-9]\n", 337 | "* `\\D` - The complement of \\d. It matches any non-digit character; equivalent to the set [^0-9]\n", 338 | "* `\\s` - Matches any whitespace character; equivalent to [ \\t\\n\\r\\f\\v]\n", 339 | "* `\\S` - The complement of \\s. It matches any non-whitespace character; equiv. to [^ \\t\\n\\r\\f\\v]\n", 340 | "* `\\w` - Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [a-zA-Z0-9_] plus characters defined as letters for the current locale\n", 341 | "* `\\W` - Matches the complement of \\w\n", 342 | "* `\\b` - Matches the empty string, but only at the start or end of a word\n", 343 | "* `\\B` - Matches the empty string, but not at the start or end of a word\n", 344 | "* `\\\\` - Matches a literal backslash\n", 345 | "\n", 346 | "#### Word boundaries\n", 347 | "\n", 348 | "The \\b and \\B of the previous overview of special sequences, is often not properly understood or even misunderstood especially by novices. While the other sequences match characters, - e.g. \\w matches characters like \"a\", \"b\", \"m\", \"3\" and so on, - \\b and \\B don't match a character. They match empty strings depending on their neighbourhood, i.e. what kind of a character the predecessor and the successor is. So \\b matches any empty string between a \\W and a \\w character and also between a \\w and a \\W character. \\B is the complement, i.e empty strings between \\W and \\W or empty strings between \\w and \\w.\n", 349 | "\n", 350 | "#### Matching Beginning and End\n", 351 | "\n", 352 | "But what if we want to match a regular expression at the beginning of a string and only at the beginning?\n", 353 | "\n", 354 | "The re module of Python provides two functions to match regular expressions. We have met already one of them, i.e. search(). The other has in our opinion a misleading name: match() Misleading, because match(re_str, s) checks for a match of re_str merely at the beginning of the string. But anyway, match() is the solution to our question, as we can see in the following example:" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "metadata": { 360 | "colab": { 361 | "base_uri": "https://localhost:8080/" 362 | }, 363 | "id": "zAdJo8edqNfo", 364 | "outputId": "88665900-9310-4ea5-e711-b3898d1db40d" 365 | }, 366 | "source": [ 367 | "import re\n", 368 | "s1 = \"Mayer is a very common Name\"\n", 369 | "s2 = \"He is called Meyer but he isn't German.\"\n", 370 | "print(re.search(r\"M[ae][iy]er\", s1))\n", 371 | "print(re.search(r\"M[ae][iy]er\", s2))\n", 372 | " # matches because it starts with Mayer\n", 373 | "print(re.match(r\"M[ae][iy]er\", s1)) \n", 374 | "# doesn't match because it doesn't start with Meyer or Meyer, Meier and so on:\n", 375 | "print(re.match(r\"M[ae][iy]er\", s2)) " 376 | ], 377 | "execution_count": 7, 378 | "outputs": [ 379 | { 380 | "output_type": "stream", 381 | "name": "stdout", 382 | "text": [ 383 | "\n", 384 | "\n", 385 | "\n", 386 | "None\n" 387 | ] 388 | } 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "id": "L7fnGDYlu-aC" 395 | }, 396 | "source": [ 397 | "So, this is a way to match the start of a string, but it's a Python specific method, i.e. it can't be used in other languages like Perl, AWK and so on. There is a general solution which is a standard for regular expressions:\n", 398 | "\n", 399 | "The caret '^' matches the start of the string, and in MULTILINE (will be explained further down) mode also matches immediately after each newline, which the Python method match() doesn't do. The caret has to be the first character of a regular expression:" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "metadata": { 405 | "colab": { 406 | "base_uri": "https://localhost:8080/" 407 | }, 408 | "id": "Bo8dDR47vBox", 409 | "outputId": "4128e45e-8835-41ea-fc05-d8c7e3950ac7" 410 | }, 411 | "source": [ 412 | "import re\n", 413 | "s1 = \"Mayer is a very common Name\"\n", 414 | "s2 = \"He is called Meyer but he isn't German.\"\n", 415 | "print(re.search(r\"^M[ae][iy]er\", s1))\n", 416 | "print(re.search(r\"^M[ae][iy]er\", s2))" 417 | ], 418 | "execution_count": 8, 419 | "outputs": [ 420 | { 421 | "output_type": "stream", 422 | "name": "stdout", 423 | "text": [ 424 | "\n", 425 | "None\n" 426 | ] 427 | } 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": { 433 | "id": "6bAHJwJKvyTK" 434 | }, 435 | "source": [ 436 | "But what happens if we concatenate the two strings s1 and s2 in the following way?" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "metadata": { 442 | "id": "fvdO_zGevDeW" 443 | }, 444 | "source": [ 445 | "s = s2 + \"\\n\" + s1" 446 | ], 447 | "execution_count": 9, 448 | "outputs": [] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": { 453 | "id": "-9I0czNnv5Ec" 454 | }, 455 | "source": [ 456 | "Now the string doesn't start with a Maier of any kind, but the name follows a newline character:" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "metadata": { 462 | "colab": { 463 | "base_uri": "https://localhost:8080/" 464 | }, 465 | "id": "kPEhG1aNv3Uo", 466 | "outputId": "13fe6aa4-e936-4a58-9689-0399cb8d5ab3" 467 | }, 468 | "source": [ 469 | "s = s2 + \"\\n\" + s1\n", 470 | "print(re.search(r\"^M[ae][iy]er\", s))" 471 | ], 472 | "execution_count": 10, 473 | "outputs": [ 474 | { 475 | "output_type": "stream", 476 | "name": "stdout", 477 | "text": [ 478 | "None\n" 479 | ] 480 | } 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": { 486 | "id": "XSyxSlOBv9dM" 487 | }, 488 | "source": [ 489 | "The name hasn't been found, because only the beginning of the string is checked. It changes, if we use the multiline mode, which can be activated by adding the following parameters to search:" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "metadata": { 495 | "colab": { 496 | "base_uri": "https://localhost:8080/" 497 | }, 498 | "id": "NXZ0fOE_v7AR", 499 | "outputId": "2f91fdea-174f-423e-c2f7-2e0083e749ee" 500 | }, 501 | "source": [ 502 | "print(re.search(r\"^M[ae][iy]er\", s, re.MULTILINE))\n", 503 | "print(re.search(r\"^M[ae][iy]er\", s, re.M))\n", 504 | "print(re.match(r\"^M[ae][iy]er\", s, re.M))" 505 | ], 506 | "execution_count": 11, 507 | "outputs": [ 508 | { 509 | "output_type": "stream", 510 | "name": "stdout", 511 | "text": [ 512 | "\n", 513 | "\n", 514 | "None\n" 515 | ] 516 | } 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": { 522 | "id": "m3ss3H4awFGp" 523 | }, 524 | "source": [ 525 | "The previous example also shows that the multiline mode doesn't affect the match method. match() never checks anything but the beginning of the string for a match.\n", 526 | "\n", 527 | "We have learnt how to match the beginning of a string. What about the end? Of course that's possible to. The dollar sign matches the end of a string or just before the newline at the end of the string. If in MULTILINE mode, it also matches before a newline. We demonstrate the usage of the \"$\" character in the following example:" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "metadata": { 533 | "colab": { 534 | "base_uri": "https://localhost:8080/" 535 | }, 536 | "id": "wU8D9Rjiv_5B", 537 | "outputId": "ea851cd7-5da1-4b4c-98a1-cb995188e149" 538 | }, 539 | "source": [ 540 | "print(re.search(r\"Python\\.$\",\"I like Python.\"))\n", 541 | "print(re.search(r\"Python\\.$\",\"I like Python and Perl.\"))\n", 542 | "print(re.search(r\"Python\\.$\",\"I like Python.\\nSome prefer Java or Perl.\"))\n", 543 | "print(re.search(r\"Python\\.$\",\"I like Python.\\nSome prefer Java or Perl.\", re.M))" 544 | ], 545 | "execution_count": 12, 546 | "outputs": [ 547 | { 548 | "output_type": "stream", 549 | "name": "stdout", 550 | "text": [ 551 | "\n", 552 | "None\n", 553 | "None\n", 554 | "\n" 555 | ] 556 | } 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": { 562 | "id": "TpX8Ytj2weSQ" 563 | }, 564 | "source": [ 565 | "#### Optional items\n", 566 | "\n", 567 | "If you thought that our collection of Mayer names was complete, you were wrong. There are other ones all over the world, e.g. London and Paris, who dropped their \"e\". So we have four more names [\"Mayr\", \"Meyr\", \"Meir\", \"Mair\"] plus our old set [\"Mayer\", \"Meyer\", \"Meier\", \"Maier\"].\n", 568 | "\n", 569 | "If we try to figure out a fitting regular expression, we realize that we miss something. A way to tell the computer \"this \"e\" may or may not occur\". A question mark is used as a notation for this. A question mark declares that the preceding character or expression is optional.\n", 570 | "\n", 571 | "The final Mayer-Recognizer looks now like this:\n", 572 | "\n", 573 | "```r\"M[ae][iy]e?r\"```\n", 574 | "\n", 575 | "A subexpression is grouped by round brackets and a question mark following such a group means that this group may or may not exist. With the following expression we can match dates like \"Feb 2011\" or February 2011\":\n", 576 | "\n", 577 | "```r\"Feb(ruary)? 2011\"```\n", 578 | "\n", 579 | "#### Quantifiers\n", 580 | "\n", 581 | "If you just use what we have introduced so far, you will still need a lot of things, above all some way of repeating characters or regular expressions. For this purpose, quantifiers are used. We have encountered one in the previous paragraph, i.e. the question mark.\n", 582 | "\n", 583 | "A quantifier after a token, which can be a single character or group in brackets, specifies how often that preceding element is allowed to occur. The most common quantifiers are:\n", 584 | "\n", 585 | "* the question mark ?\n", 586 | "* the asterisk or star character * ~~(which is derived from the Kleene star)~~\n", 587 | "* and the plus sign + ~~(derived from the Kleene cross)~~\n", 588 | "\n", 589 | "We have already previously used one of these quantifiers without explaining it, i.e. the asterisk. A star following a character or a subexpression group means that this expression or character may be repeated arbitrarily, even zero times.\n", 590 | "\n", 591 | "```r\"[0-9]*\"```\n", 592 | "\n", 593 | "The above expression matches any sequence of digits, even the empty string. ```r\".*\"``` matches any sequence of characters and the empty string.\n", 594 | "\n", 595 | "**Exercise:** Write a regular expression which matches strings which starts with a sequence of digits - at least one digit - followed by a blank.\n", 596 | "\n", 597 | "**Solution:**\n", 598 | "\n", 599 | "```r\"^[0-9][0-9]* \"```\n", 600 | "\n", 601 | "The plus operator is very convenient to solve the previous exercise. The plus operator is very similar to the star operator, except that the character or subexpression followed by a \"+\" sign has to be repeated at least one time. Here follows the solution to our exercise with the plus quantifier\n", 602 | "\n", 603 | "**Solution with the plus quantifier:**\n", 604 | "\n", 605 | "```r\"^[0-9]+ \"```\n", 606 | "\n", 607 | "If you work with this arsenal of operators for a while, you will inevitably miss the possibility to repeat expressions for an exact number of times at some point. Let's assume you want to recognize the last lines of addresses on envelopes in Switzerland. These lines usually contain a four digits long post code followed by a blank and a city name. Let's assume that there is no city name in Switzerland, which consists of less than 3 letters, at least 3 letters. We can denote this by [A-Za-z]{3,}. Now we have to recognize lines with German post code (5 digits) lines as well, i.e. the post code can now consist of either four or five digits:\n", 608 | "\n", 609 | "```r\"^[0-9]{4,5} [A-Z][a-z]{2,}\"```\n", 610 | "\n", 611 | "The general syntax is {from, to}, meaning the expression has to appear at least \"from\" times and not more than \"to\" times. {, to} is an abbreviated spelling for {0,to} and {from,} is an abbreviation for \"at least from times but no upper limit\"\n", 612 | "\n", 613 | "#### Grouping\n", 614 | "\n", 615 | "We can group a part of a regular expression by surrounding it with parenthesis (round brackets). This way we can apply operators to the complete group instead of a single character.\n", 616 | "\n", 617 | "#### Capturing Groups and Back References\n", 618 | "\n", 619 | "Parenthesis (round brackets, braces) are not only group subexpressions but they also create back references. The part of the string matched by the grouped part of the regular expression, i.e. the subexpression in parenthesis, is stored in a back reference. With the aid of back references we can reuse parts of regular expressions. These stored values can be both reused inside the expression itself and afterwards, when the regexpr is executed. Before we continue with our treatise about back references, we want to strew in a paragraph about match objects, which is important for our next examples with back references.\n", 620 | "\n", 621 | "#### A Closer Look at the Match Objects\n", 622 | "\n", 623 | "So far we have just checked, if an expression matched or not. We used the fact the re.search() returns a match object if it matches and None otherwise. We haven't been interested e.g. in what has been matched. The match object contains a lot of data about what has been matched, positions and so on.\n", 624 | "\n", 625 | "A match object contains the methods group(), span(), start() and end(), as it can be seen in the following application:" 626 | ] 627 | }, 628 | { 629 | "cell_type": "code", 630 | "metadata": { 631 | "colab": { 632 | "base_uri": "https://localhost:8080/", 633 | "height": 37 634 | }, 635 | "id": "epPA9EqtwceK", 636 | "outputId": "bd3189f8-13ef-4ca9-db8b-da8a586b9a2d" 637 | }, 638 | "source": [ 639 | "import re\n", 640 | "mo = re.search(\"[0-9]+\", \"Customer number: 232454, Date: February 12, 2011\")\n", 641 | "mo.group()" 642 | ], 643 | "execution_count": 13, 644 | "outputs": [ 645 | { 646 | "output_type": "execute_result", 647 | "data": { 648 | "application/vnd.google.colaboratory.intrinsic+json": { 649 | "type": "string" 650 | }, 651 | "text/plain": [ 652 | "'232454'" 653 | ] 654 | }, 655 | "metadata": {}, 656 | "execution_count": 13 657 | } 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "metadata": { 663 | "colab": { 664 | "base_uri": "https://localhost:8080/" 665 | }, 666 | "id": "GfuPbjE-0pQ7", 667 | "outputId": "b296ed4b-3b33-4dce-81fe-39a5c8d1c2d5" 668 | }, 669 | "source": [ 670 | "mo.span()" 671 | ], 672 | "execution_count": 14, 673 | "outputs": [ 674 | { 675 | "output_type": "execute_result", 676 | "data": { 677 | "text/plain": [ 678 | "(17, 23)" 679 | ] 680 | }, 681 | "metadata": {}, 682 | "execution_count": 14 683 | } 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "metadata": { 689 | "colab": { 690 | "base_uri": "https://localhost:8080/" 691 | }, 692 | "id": "Csvb124Y0vr9", 693 | "outputId": "3b7a6019-2756-4f04-a41e-a60a2e358d7e" 694 | }, 695 | "source": [ 696 | "mo.start()" 697 | ], 698 | "execution_count": 15, 699 | "outputs": [ 700 | { 701 | "output_type": "execute_result", 702 | "data": { 703 | "text/plain": [ 704 | "17" 705 | ] 706 | }, 707 | "metadata": {}, 708 | "execution_count": 15 709 | } 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "metadata": { 715 | "colab": { 716 | "base_uri": "https://localhost:8080/" 717 | }, 718 | "id": "an0o-R3K0xpM", 719 | "outputId": "69f11240-9b71-4039-aa86-c693789631aa" 720 | }, 721 | "source": [ 722 | "mo.end()" 723 | ], 724 | "execution_count": 16, 725 | "outputs": [ 726 | { 727 | "output_type": "execute_result", 728 | "data": { 729 | "text/plain": [ 730 | "23" 731 | ] 732 | }, 733 | "metadata": {}, 734 | "execution_count": 16 735 | } 736 | ] 737 | }, 738 | { 739 | "cell_type": "markdown", 740 | "metadata": { 741 | "id": "119Jnyyc1d8y" 742 | }, 743 | "source": [ 744 | "These methods are not difficult to understand. span() returns a tuple with the start and end position, i.e. the string index where the regular expression started matching in the string and ended matching. The methods start() and end() are in a way superfluous as the information is contained in span(), i.e. span()[0] is equal to start() and span()[1] is equal to end(). group(), if called without argument, it returns the substring, which had been matched by the complete regular expression. With the help of group() we are also capable of accessing the matched substring by grouping parentheses, to get the matched substring of the n-th group, we call group() with the argument n: group(n). We can also call group with more than integer argument, e.g. group(n,m). group(n,m) - provided there exists a subgoup n and m - returns a tuple with the matched substrings. group(n,m) is equal to (group(n), group(m)):" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "metadata": { 750 | "colab": { 751 | "base_uri": "https://localhost:8080/", 752 | "height": 37 753 | }, 754 | "id": "08Uiymf_06cg", 755 | "outputId": "388f4923-2e5e-464d-c38a-170125b49628" 756 | }, 757 | "source": [ 758 | "import re\n", 759 | "mo = re.search(\"([0-9]+).*: (.*)\", \"Customer number: 232454, Date: February 12, 2011\")\n", 760 | "mo.group()" 761 | ], 762 | "execution_count": 18, 763 | "outputs": [ 764 | { 765 | "output_type": "execute_result", 766 | "data": { 767 | "application/vnd.google.colaboratory.intrinsic+json": { 768 | "type": "string" 769 | }, 770 | "text/plain": [ 771 | "'232454, Date: February 12, 2011'" 772 | ] 773 | }, 774 | "metadata": {}, 775 | "execution_count": 18 776 | } 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "metadata": { 782 | "colab": { 783 | "base_uri": "https://localhost:8080/", 784 | "height": 37 785 | }, 786 | "id": "SpKWHw9b1g77", 787 | "outputId": "f99e3a6d-5cde-4db1-ae92-560f1480b0d2" 788 | }, 789 | "source": [ 790 | "mo.group(1)" 791 | ], 792 | "execution_count": 19, 793 | "outputs": [ 794 | { 795 | "output_type": "execute_result", 796 | "data": { 797 | "application/vnd.google.colaboratory.intrinsic+json": { 798 | "type": "string" 799 | }, 800 | "text/plain": [ 801 | "'232454'" 802 | ] 803 | }, 804 | "metadata": {}, 805 | "execution_count": 19 806 | } 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "metadata": { 812 | "colab": { 813 | "base_uri": "https://localhost:8080/", 814 | "height": 37 815 | }, 816 | "id": "CtoCiVWM4I3h", 817 | "outputId": "56256895-cc9b-424d-cb96-99e4e937fcb9" 818 | }, 819 | "source": [ 820 | "mo.group(2)" 821 | ], 822 | "execution_count": 20, 823 | "outputs": [ 824 | { 825 | "output_type": "execute_result", 826 | "data": { 827 | "application/vnd.google.colaboratory.intrinsic+json": { 828 | "type": "string" 829 | }, 830 | "text/plain": [ 831 | "'February 12, 2011'" 832 | ] 833 | }, 834 | "metadata": {}, 835 | "execution_count": 20 836 | } 837 | ] 838 | }, 839 | { 840 | "cell_type": "code", 841 | "metadata": { 842 | "colab": { 843 | "base_uri": "https://localhost:8080/" 844 | }, 845 | "id": "LjPWGAF04MCo", 846 | "outputId": "6243c18f-6888-4f82-f082-455971e60ce1" 847 | }, 848 | "source": [ 849 | "mo.group(2, 1)" 850 | ], 851 | "execution_count": 21, 852 | "outputs": [ 853 | { 854 | "output_type": "execute_result", 855 | "data": { 856 | "text/plain": [ 857 | "('February 12, 2011', '232454')" 858 | ] 859 | }, 860 | "metadata": {}, 861 | "execution_count": 21 862 | } 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": { 868 | "id": "tXwt-zZT4gAn" 869 | }, 870 | "source": [ 871 | "A very intuitive example are XML or HTML tags. E.g. let's assume we have a file (called \"tags.txt\") with content like this:\n", 872 | "\n", 873 | "```\n", 874 | " Wolfgang Amadeus Mozart \n", 875 | " Samuel Beckett \n", 876 | " London \n", 877 | "```\n", 878 | "\n", 879 | "We want to rewrite this text automatically to\n", 880 | "\n", 881 | "```\n", 882 | "composer: Wolfgang Amadeus Mozart\n", 883 | "author: Samuel Beckett\n", 884 | "city: London\n", 885 | "```\n", 886 | "\n", 887 | "The following little Python script does the trick. The core of this script is the regular expression. This regular expression works like this: It tries to match a less than symbol \"<\". After this it is reading lower case letters until it reaches the greater than symbol. Everything encountered within \"<\" and \">\" has been stored in a back reference which can be accessed within the expression by writing \\1. Let's assume \\1 contains the value \"composer\". When the expression has reached the first \">\", it continues matching, as the original expression had been \"(.*)\":" 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "metadata": { 893 | "colab": { 894 | "base_uri": "https://localhost:8080/" 895 | }, 896 | "id": "t3WabLLp4OlT", 897 | "outputId": "76b69fc1-44c3-4581-bce3-80b1939a2268" 898 | }, 899 | "source": [ 900 | "text = ''' Wolfgang Amadeus Mozart \n", 901 | " Samuel Beckett \n", 902 | " London \n", 903 | " '''\n", 904 | "\n", 905 | "with open('tags.txt', 'w') as h:\n", 906 | " print(text, file=h)\n", 907 | "\n", 908 | "!cat tags.txt" 909 | ], 910 | "execution_count": 26, 911 | "outputs": [ 912 | { 913 | "output_type": "stream", 914 | "name": "stdout", 915 | "text": [ 916 | " Wolfgang Amadeus Mozart \n", 917 | " Samuel Beckett \n", 918 | " London \n", 919 | " \n" 920 | ] 921 | } 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "metadata": { 927 | "colab": { 928 | "base_uri": "https://localhost:8080/" 929 | }, 930 | "id": "wgmtPGuH4-a8", 931 | "outputId": "fdea3eeb-779f-4605-d220-1471da48410e" 932 | }, 933 | "source": [ 934 | "import re\n", 935 | "fh = open(\"tags.txt\")\n", 936 | "for i in fh:\n", 937 | " i = i.strip()\n", 938 | " if i:\n", 939 | " res = re.search(r\"<([a-z]+)>(.*)\",i)\n", 940 | " print(res.group(1) + \": \" + res.group(2))" 941 | ], 942 | "execution_count": 30, 943 | "outputs": [ 944 | { 945 | "output_type": "stream", 946 | "name": "stdout", 947 | "text": [ 948 | "composer: Wolfgang Amadeus Mozart \n", 949 | "author: Samuel Beckett \n", 950 | "city: London \n" 951 | ] 952 | } 953 | ] 954 | }, 955 | { 956 | "cell_type": "markdown", 957 | "metadata": { 958 | "id": "Pw2fp6YJ6NR1" 959 | }, 960 | "source": [ 961 | "If there are more than one pair of parenthesis (round brackets) inside the expression, the backreferences are numbered \\1, \\2, \\3, in the order of the pairs of parenthesis.\n", 962 | "\n", 963 | "**Exercise:** The next Python example makes use of three back references. We have an imaginary phone list of the Simpsons in a list. Not all entries contain a phone number, but if a phone number exists it is the first part of an entry. Then, separated by a blank, a surname follows, which is followed by first names. Surname and first name are separated by a comma. The task is to rewrite this example in the following way:\n", 964 | "\n", 965 | "```\n", 966 | "Allison Neu 555-8396\n", 967 | "C. Montgomery Burns \n", 968 | "Lionel Putz 555-5299\n", 969 | "Homer Jay Simpson 555-73347\n", 970 | "```\n", 971 | "\n", 972 | "Python script solving the rearrangement problem:" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "metadata": { 978 | "colab": { 979 | "base_uri": "https://localhost:8080/" 980 | }, 981 | "id": "bS8pNdBC4_0r", 982 | "outputId": "b323fbdb-21de-4a3b-a233-06ec19cb11a5" 983 | }, 984 | "source": [ 985 | "import re\n", 986 | "\n", 987 | "l = [\"555-8396 Neu, Allison\", \n", 988 | " \"Burns, C. Montgomery\", \n", 989 | " \"555-5299 Putz, Lionel\",\n", 990 | " \"555-7334 Simpson, Homer Jay\"]\n", 991 | "\n", 992 | "for i in l:\n", 993 | " res = re.search(r\"([0-9-]*)\\s*([A-Za-z]+),\\s+(.*)\", i)\n", 994 | " print(res.group(3) + \" \" + res.group(2) + \" \" + res.group(1))" 995 | ], 996 | "execution_count": 31, 997 | "outputs": [ 998 | { 999 | "output_type": "stream", 1000 | "name": "stdout", 1001 | "text": [ 1002 | "Allison Neu 555-8396\n", 1003 | "C. Montgomery Burns \n", 1004 | "Lionel Putz 555-5299\n", 1005 | "Homer Jay Simpson 555-7334\n" 1006 | ] 1007 | } 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": { 1013 | "id": "fL2pV3P86oF1" 1014 | }, 1015 | "source": [ 1016 | "#### Named Backreferences\n", 1017 | "\n", 1018 | "In the previous paragraph we introduced \"Capturing Groups\" and \"Back references\". More precisely, we could have called them \"Numbered Capturing Groups\" and \"Numbered Backreferences\". Using capturing groups instead of \"numbered\" capturing groups allows you to assign descriptive names instead of automatic numbers to the groups. In the following example, we demonstrate this approach by catching the hours, minutes and seconds from a UNIX date string:" 1019 | ] 1020 | }, 1021 | { 1022 | "cell_type": "code", 1023 | "metadata": { 1024 | "colab": { 1025 | "base_uri": "https://localhost:8080/", 1026 | "height": 37 1027 | }, 1028 | "id": "Gf5ElryE6laP", 1029 | "outputId": "dcc71290-30cf-4ccf-e923-dd50a7192328" 1030 | }, 1031 | "source": [ 1032 | "import re\n", 1033 | "s = \"Sun Oct 14 13:47:03 CEST 2012\"\n", 1034 | "expr = r\"\\b(?P\\d\\d):(?P\\d\\d):(?P\\d\\d)\\b\"\n", 1035 | "x = re.search(expr,s)\n", 1036 | "x.group('hours')" 1037 | ], 1038 | "execution_count": 32, 1039 | "outputs": [ 1040 | { 1041 | "output_type": "execute_result", 1042 | "data": { 1043 | "application/vnd.google.colaboratory.intrinsic+json": { 1044 | "type": "string" 1045 | }, 1046 | "text/plain": [ 1047 | "'13'" 1048 | ] 1049 | }, 1050 | "metadata": {}, 1051 | "execution_count": 32 1052 | } 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "code", 1057 | "metadata": { 1058 | "colab": { 1059 | "base_uri": "https://localhost:8080/" 1060 | }, 1061 | "id": "q-2_cgjs6w_T", 1062 | "outputId": "5392c522-8189-4496-e927-ba28dce38d8e" 1063 | }, 1064 | "source": [ 1065 | "x.span('seconds')" 1066 | ], 1067 | "execution_count": 33, 1068 | "outputs": [ 1069 | { 1070 | "output_type": "execute_result", 1071 | "data": { 1072 | "text/plain": [ 1073 | "(17, 19)" 1074 | ] 1075 | }, 1076 | "metadata": {}, 1077 | "execution_count": 33 1078 | } 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "markdown", 1083 | "metadata": { 1084 | "id": "R0w_4Hf9As-o" 1085 | }, 1086 | "source": [ 1087 | "#### [What is a non-capturing group in regular expressions?](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions)" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "markdown", 1092 | "metadata": { 1093 | "id": "VbBDkxMr7XJ5" 1094 | }, 1095 | "source": [ 1096 | "## Advanced Regular Expressions\n", 1097 | "\n", 1098 | "#### Finding all Matched Substrings\n", 1099 | "\n", 1100 | "The Python module re provides another great method, which other languages like Perl and Java don't provide. If you want to find all the substrings in a string, which match a regular expression, you have to use a loop in Perl and other languages, as can be seen in the following Perl snippet:\n", 1101 | "\n", 1102 | "```perl\n", 1103 | "while ($string =~ m/regex/g) {\n", 1104 | " print \"Found '$&'. Next attempt at character \" . pos($string)+1 . \"\\n\";\n", 1105 | "}\n", 1106 | "```\n", 1107 | "\n", 1108 | "It's a lot easier in Python. No need to loop. We can just use the findall method of the re module:\n", 1109 | "\n", 1110 | "```re.findall(pattern, string[, flags])```\n", 1111 | "\n", 1112 | "Findall returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order in which they are found" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "code", 1117 | "metadata": { 1118 | "colab": { 1119 | "base_uri": "https://localhost:8080/" 1120 | }, 1121 | "id": "vUmjFj9D7RFH", 1122 | "outputId": "a124ad69-bb9f-4d1b-93d9-54f960466295" 1123 | }, 1124 | "source": [ 1125 | "t=\"A fat cat doesn't eat oat but a rat eats bats.\"\n", 1126 | "mo = re.findall(\"[force]at\", t)\n", 1127 | "print(mo)" 1128 | ], 1129 | "execution_count": 34, 1130 | "outputs": [ 1131 | { 1132 | "output_type": "stream", 1133 | "name": "stdout", 1134 | "text": [ 1135 | "['fat', 'cat', 'eat', 'oat', 'rat', 'eat']\n" 1136 | ] 1137 | } 1138 | ] 1139 | }, 1140 | { 1141 | "cell_type": "markdown", 1142 | "metadata": { 1143 | "id": "lo-J5FDU9U-s" 1144 | }, 1145 | "source": [ 1146 | "If one or more groups are present in the pattern, findall returns a list of groups. This will be a list of tuples if the pattern has more than one group. We demonstrate this in our next example. We have a long string with various Python training courses and their dates. With the first call to findall, we don't use any grouping and receive the complete string as a result. In the next call, we use grouping and findall returns a list of 2-tuples, each having the course name as the first component and the dates as the second component:" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "metadata": { 1152 | "colab": { 1153 | "base_uri": "https://localhost:8080/" 1154 | }, 1155 | "id": "0XjLiZ-T81GW", 1156 | "outputId": "7cebbe03-db15-451c-b67f-30b08895b579" 1157 | }, 1158 | "source": [ 1159 | "import re\n", 1160 | "courses = \"Python Training Course for Beginners: 15/Aug/2011 - 19/Aug/2011;Python Training Course Intermediate: 12/Dec/2011 - 16/Dec/2011;Python Text Processing Course:31/Oct/2011 - 4/Nov/2011\"\n", 1161 | "items = re.findall(\"[^:]*:[^;]*;?\", courses)\n", 1162 | "items" 1163 | ], 1164 | "execution_count": 35, 1165 | "outputs": [ 1166 | { 1167 | "output_type": "execute_result", 1168 | "data": { 1169 | "text/plain": [ 1170 | "['Python Training Course for Beginners: 15/Aug/2011 - 19/Aug/2011;',\n", 1171 | " 'Python Training Course Intermediate: 12/Dec/2011 - 16/Dec/2011;',\n", 1172 | " 'Python Text Processing Course:31/Oct/2011 - 4/Nov/2011']" 1173 | ] 1174 | }, 1175 | "metadata": {}, 1176 | "execution_count": 35 1177 | } 1178 | ] 1179 | }, 1180 | { 1181 | "cell_type": "code", 1182 | "metadata": { 1183 | "colab": { 1184 | "base_uri": "https://localhost:8080/" 1185 | }, 1186 | "id": "alyQ6ygM9W59", 1187 | "outputId": "1aa67ffa-7e38-4d91-9f2e-3ce8e65a4e6e" 1188 | }, 1189 | "source": [ 1190 | "items = re.findall(\"([^:]*):([^;]*;?)\", courses)\n", 1191 | "items" 1192 | ], 1193 | "execution_count": 36, 1194 | "outputs": [ 1195 | { 1196 | "output_type": "execute_result", 1197 | "data": { 1198 | "text/plain": [ 1199 | "[('Python Training Course for Beginners', ' 15/Aug/2011 - 19/Aug/2011;'),\n", 1200 | " ('Python Training Course Intermediate', ' 12/Dec/2011 - 16/Dec/2011;'),\n", 1201 | " ('Python Text Processing Course', '31/Oct/2011 - 4/Nov/2011')]" 1202 | ] 1203 | }, 1204 | "metadata": {}, 1205 | "execution_count": 36 1206 | } 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "markdown", 1211 | "metadata": { 1212 | "id": "ugAiXsBP9q0A" 1213 | }, 1214 | "source": [ 1215 | "#### Alternations\n", 1216 | "\n", 1217 | "In our introduction to regular expressions we had introduced character classes. Character classes offer a choice out of a set of characters. Sometimes we need a choice between several regular expressions. It's a logical \"or\" and that's why the symbol for this construct is the \"|\" symbol. In the following example, we check, if one of the cities London, Paris, Zurich, Konstanz Bern or Strasbourg appear in a string preceded by the word \"location\":" 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "code", 1222 | "metadata": { 1223 | "colab": { 1224 | "base_uri": "https://localhost:8080/" 1225 | }, 1226 | "id": "Zb0aDk9B9ZU4", 1227 | "outputId": "a419899b-e216-4fbe-d0fc-b703fed2165f" 1228 | }, 1229 | "source": [ 1230 | "# greedy:\n", 1231 | "\n", 1232 | "import re\n", 1233 | "str = \"Course location is London or Paris!\"\n", 1234 | "mo = re.search(r\"location.*(London|Paris|Zurich|Strasbourg)\", str)\n", 1235 | "if mo: print(mo.group())" 1236 | ], 1237 | "execution_count": 40, 1238 | "outputs": [ 1239 | { 1240 | "output_type": "stream", 1241 | "name": "stdout", 1242 | "text": [ 1243 | "location is London or Paris\n" 1244 | ] 1245 | } 1246 | ] 1247 | }, 1248 | { 1249 | "cell_type": "markdown", 1250 | "metadata": { 1251 | "id": "g-qC64gABE6u" 1252 | }, 1253 | "source": [ 1254 | "#### Compiling Regular Expressions\n", 1255 | "\n", 1256 | "If you want to use the same regexp more than once in a script, it might be a good idea to use a regular expression object, i.e. the regex is compiled.\n", 1257 | "\n", 1258 | "The general syntax:\n", 1259 | "\n", 1260 | "```re.compile(pattern[, flags])```\n", 1261 | "\n", 1262 | "compile returns a regex object, which can be used later for searching and replacing. The expressions behaviour can be modified by specifying a flag value:\n", 1263 | "\n", 1264 | "|Abbreviation| Full name |\n", 1265 | "|----------- | -----------|\n", 1266 | "| re.I | re.IGNORECASE |\n", 1267 | "| re.L | re.LOCALE |\n", 1268 | "| re.M | re.MULTILINE |\n", 1269 | "| re.S | re.DOTALL |\n", 1270 | "| re.U | re.UNICODE |\n", 1271 | "| re.X | re.VERBOSE |\n", 1272 | "\n", 1273 | "Compiled regular objects usually are not saving much time, because Python internally compiles AND CACHES regexes whenever you use them with re.search() or re.match(). The only extra time a non-compiled regex takes is the time it needs to check the cache, which is a key lookup of a dictionary.\n", 1274 | "\n", 1275 | "A good reason to use them is to separate the definition of a regex from its use.\n", 1276 | "\n", 1277 | "#### Splitting a String With or Without Regular Expressions\n", 1278 | "\n", 1279 | "There is a string method split, which can be used to split a string into a list of substrings:\n", 1280 | "\n", 1281 | "``` str.split([sep[, maxsplit]])```\n", 1282 | "\n", 1283 | "As you can see, the method split has two optional parameters. If none is given (or is None) , a string will be separated into substring using whitespaces as delimiters, i.e. every substring consisting purely of whitespaces is used as a delimiter.\n", 1284 | "\n", 1285 | "![](https://www.python-course.eu/images/re_split.webp)\n", 1286 | "\n", 1287 | "We demonstrate this behaviour with a famous quotation by Abraham Lincoln:" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "metadata": { 1293 | "colab": { 1294 | "base_uri": "https://localhost:8080/" 1295 | }, 1296 | "id": "Bx9abbum-OIg", 1297 | "outputId": "c7904b8c-f287-431e-d72a-1e5892e325d4" 1298 | }, 1299 | "source": [ 1300 | "law_courses = \"Let reverence for the laws be breathed by every American mother to the lisping babe that prattles on her lap. Let it be taught in schools, in seminaries, and in colleges. Let it be written in primers, spelling books, and in almanacs. Let it be preached from the pulpit, proclaimed in legislative halls, and enforced in the courts of justice. And, in short, let it become the political religion of the nation.\"\n", 1301 | "law_courses.split()" 1302 | ], 1303 | "execution_count": 41, 1304 | "outputs": [ 1305 | { 1306 | "output_type": "execute_result", 1307 | "data": { 1308 | "text/plain": [ 1309 | "['Let',\n", 1310 | " 'reverence',\n", 1311 | " 'for',\n", 1312 | " 'the',\n", 1313 | " 'laws',\n", 1314 | " 'be',\n", 1315 | " 'breathed',\n", 1316 | " 'by',\n", 1317 | " 'every',\n", 1318 | " 'American',\n", 1319 | " 'mother',\n", 1320 | " 'to',\n", 1321 | " 'the',\n", 1322 | " 'lisping',\n", 1323 | " 'babe',\n", 1324 | " 'that',\n", 1325 | " 'prattles',\n", 1326 | " 'on',\n", 1327 | " 'her',\n", 1328 | " 'lap.',\n", 1329 | " 'Let',\n", 1330 | " 'it',\n", 1331 | " 'be',\n", 1332 | " 'taught',\n", 1333 | " 'in',\n", 1334 | " 'schools,',\n", 1335 | " 'in',\n", 1336 | " 'seminaries,',\n", 1337 | " 'and',\n", 1338 | " 'in',\n", 1339 | " 'colleges.',\n", 1340 | " 'Let',\n", 1341 | " 'it',\n", 1342 | " 'be',\n", 1343 | " 'written',\n", 1344 | " 'in',\n", 1345 | " 'primers,',\n", 1346 | " 'spelling',\n", 1347 | " 'books,',\n", 1348 | " 'and',\n", 1349 | " 'in',\n", 1350 | " 'almanacs.',\n", 1351 | " 'Let',\n", 1352 | " 'it',\n", 1353 | " 'be',\n", 1354 | " 'preached',\n", 1355 | " 'from',\n", 1356 | " 'the',\n", 1357 | " 'pulpit,',\n", 1358 | " 'proclaimed',\n", 1359 | " 'in',\n", 1360 | " 'legislative',\n", 1361 | " 'halls,',\n", 1362 | " 'and',\n", 1363 | " 'enforced',\n", 1364 | " 'in',\n", 1365 | " 'the',\n", 1366 | " 'courts',\n", 1367 | " 'of',\n", 1368 | " 'justice.',\n", 1369 | " 'And,',\n", 1370 | " 'in',\n", 1371 | " 'short,',\n", 1372 | " 'let',\n", 1373 | " 'it',\n", 1374 | " 'become',\n", 1375 | " 'the',\n", 1376 | " 'political',\n", 1377 | " 'religion',\n", 1378 | " 'of',\n", 1379 | " 'the',\n", 1380 | " 'nation.']" 1381 | ] 1382 | }, 1383 | "metadata": {}, 1384 | "execution_count": 41 1385 | } 1386 | ] 1387 | }, 1388 | { 1389 | "cell_type": "markdown", 1390 | "metadata": { 1391 | "id": "tjOlb25EDa5v" 1392 | }, 1393 | "source": [ 1394 | "Now we look at a string, which could stem from an Excel or an OpenOffice calc file. We have seen in our previous example that split takes whitespaces as default separators. We want to split the string in the following little example using semicolons as separators. The only thing we have to do is to use \";\" as an argument of split():" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "metadata": { 1400 | "colab": { 1401 | "base_uri": "https://localhost:8080/" 1402 | }, 1403 | "id": "m_oKZMcQDHaX", 1404 | "outputId": "7553e3e9-7ffd-49fe-8662-74134f748c1c" 1405 | }, 1406 | "source": [ 1407 | "line = \"James;Miller;teacher;Perl\"\n", 1408 | "line.split(\";\")" 1409 | ], 1410 | "execution_count": 42, 1411 | "outputs": [ 1412 | { 1413 | "output_type": "execute_result", 1414 | "data": { 1415 | "text/plain": [ 1416 | "['James', 'Miller', 'teacher', 'Perl']" 1417 | ] 1418 | }, 1419 | "metadata": {}, 1420 | "execution_count": 42 1421 | } 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "markdown", 1426 | "metadata": { 1427 | "id": "CtFK-OQdDihx" 1428 | }, 1429 | "source": [ 1430 | "The method split() has another optional parameter: maxsplit. If maxsplit is given, at most maxsplit splits are done. This means that the resulting list will have at most \"maxsplit + 1\" elements. We will illustrate the mode of operation of maxsplit in the next example:" 1431 | ] 1432 | }, 1433 | { 1434 | "cell_type": "code", 1435 | "metadata": { 1436 | "colab": { 1437 | "base_uri": "https://localhost:8080/" 1438 | }, 1439 | "id": "uqFzU0d5Dc7P", 1440 | "outputId": "e5e38ca8-ee7e-4216-c75f-c8588a22586c" 1441 | }, 1442 | "source": [ 1443 | "mammon = \"The god of the world's leading religion. The chief temple is in the holy city of New York.\"\n", 1444 | "mammon.split(\" \",3)" 1445 | ], 1446 | "execution_count": 43, 1447 | "outputs": [ 1448 | { 1449 | "output_type": "execute_result", 1450 | "data": { 1451 | "text/plain": [ 1452 | "['The',\n", 1453 | " 'god',\n", 1454 | " 'of',\n", 1455 | " \"the world's leading religion. The chief temple is in the holy city of New York.\"]" 1456 | ] 1457 | }, 1458 | "metadata": {}, 1459 | "execution_count": 43 1460 | } 1461 | ] 1462 | }, 1463 | { 1464 | "cell_type": "markdown", 1465 | "metadata": { 1466 | "id": "tPHgU5gjDqwb" 1467 | }, 1468 | "source": [ 1469 | "We used a Blank as a delimiter string in the previous example, which can be a problem: If multiple blanks or whitespaces are connected, split() will split the string after every single blank, so that we will get empty strings and strings with only a tab inside ('\\t') in our result list:" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "code", 1474 | "metadata": { 1475 | "colab": { 1476 | "base_uri": "https://localhost:8080/" 1477 | }, 1478 | "id": "x-y8ZGy5DkfL", 1479 | "outputId": "1e74c135-269b-4015-c6ed-b0249e724339" 1480 | }, 1481 | "source": [ 1482 | "mammon = \"The god \\t of the world's leading religion. The chief temple is in the holy city of New York.\"\n", 1483 | "mammon.split(\" \",5)" 1484 | ], 1485 | "execution_count": 44, 1486 | "outputs": [ 1487 | { 1488 | "output_type": "execute_result", 1489 | "data": { 1490 | "text/plain": [ 1491 | "['The',\n", 1492 | " 'god',\n", 1493 | " '',\n", 1494 | " '\\t',\n", 1495 | " 'of',\n", 1496 | " \"the world's leading religion. The chief temple is in the holy city of New York.\"]" 1497 | ] 1498 | }, 1499 | "metadata": {}, 1500 | "execution_count": 44 1501 | } 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "markdown", 1506 | "metadata": { 1507 | "id": "ksQY6uKzD9LS" 1508 | }, 1509 | "source": [ 1510 | "We can prevent the separation of empty strings by using None as the first argument. Now split will use the default behaviour, i.e. every substring consisting of connected whitespace characters will be taken as one separator:" 1511 | ] 1512 | }, 1513 | { 1514 | "cell_type": "code", 1515 | "metadata": { 1516 | "colab": { 1517 | "base_uri": "https://localhost:8080/" 1518 | }, 1519 | "id": "4GeKhddDDyjb", 1520 | "outputId": "0ad3c905-b05a-437d-f1af-18d410b662b3" 1521 | }, 1522 | "source": [ 1523 | "mammon.split(None,5)" 1524 | ], 1525 | "execution_count": 45, 1526 | "outputs": [ 1527 | { 1528 | "output_type": "execute_result", 1529 | "data": { 1530 | "text/plain": [ 1531 | "['The',\n", 1532 | " 'god',\n", 1533 | " 'of',\n", 1534 | " 'the',\n", 1535 | " \"world's\",\n", 1536 | " 'leading religion. The chief temple is in the holy city of New York.']" 1537 | ] 1538 | }, 1539 | "metadata": {}, 1540 | "execution_count": 45 1541 | } 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "markdown", 1546 | "metadata": { 1547 | "id": "v-R8av8gEKfD" 1548 | }, 1549 | "source": [ 1550 | "#### Regular Expression Split\n", 1551 | "\n", 1552 | "The string method split() is the right tool in many cases, but what, if you want e.g. to get the bare words of a text, i.e. without any special characters and whitespaces. If we want this, we have to use the split function from the re module. We illustrate this method with a short text from the beginning of Metamorphoses by Ovid:" 1553 | ] 1554 | }, 1555 | { 1556 | "cell_type": "code", 1557 | "metadata": { 1558 | "colab": { 1559 | "base_uri": "https://localhost:8080/" 1560 | }, 1561 | "id": "mKitEEcREIQL", 1562 | "outputId": "05e54512-c90b-4a65-b8ef-726cccdad81d" 1563 | }, 1564 | "source": [ 1565 | "import re\n", 1566 | "metamorphoses = \"OF bodies chang'd to various forms, I sing: Ye Gods, from whom these miracles did spring, Inspire my numbers with coelestial heat;\"\n", 1567 | "re.split(\"\\W+\", metamorphoses)" 1568 | ], 1569 | "execution_count": 46, 1570 | "outputs": [ 1571 | { 1572 | "output_type": "execute_result", 1573 | "data": { 1574 | "text/plain": [ 1575 | "['OF',\n", 1576 | " 'bodies',\n", 1577 | " 'chang',\n", 1578 | " 'd',\n", 1579 | " 'to',\n", 1580 | " 'various',\n", 1581 | " 'forms',\n", 1582 | " 'I',\n", 1583 | " 'sing',\n", 1584 | " 'Ye',\n", 1585 | " 'Gods',\n", 1586 | " 'from',\n", 1587 | " 'whom',\n", 1588 | " 'these',\n", 1589 | " 'miracles',\n", 1590 | " 'did',\n", 1591 | " 'spring',\n", 1592 | " 'Inspire',\n", 1593 | " 'my',\n", 1594 | " 'numbers',\n", 1595 | " 'with',\n", 1596 | " 'coelestial',\n", 1597 | " 'heat',\n", 1598 | " '']" 1599 | ] 1600 | }, 1601 | "metadata": {}, 1602 | "execution_count": 46 1603 | } 1604 | ] 1605 | }, 1606 | { 1607 | "cell_type": "markdown", 1608 | "metadata": { 1609 | "id": "axXupIvFEiG-" 1610 | }, 1611 | "source": [ 1612 | "The following example is a good case, where the regular expression is really superior to the string split. Let's assume that we have data lines with surnames, first names and professions of names. We want to clear the data line of the superfluous and redundant text descriptions, i.e. \"surname: \", \"prename: \" and so on, so that we have solely the surname in the first column, the first name in the second column and the profession in the third column:" 1613 | ] 1614 | }, 1615 | { 1616 | "cell_type": "code", 1617 | "metadata": { 1618 | "colab": { 1619 | "base_uri": "https://localhost:8080/" 1620 | }, 1621 | "id": "jBXayxTtEZOV", 1622 | "outputId": "9413be21-7828-4609-8be8-cd2a683c1843" 1623 | }, 1624 | "source": [ 1625 | "import re\n", 1626 | "lines = [\"surname: Obama, prename: Barack, profession: president\", \"surname: Merkel, prename: Angela, profession: chancellor\"]\n", 1627 | "for line in lines:\n", 1628 | " print(re.split(\",* *\\w*: \", line))" 1629 | ], 1630 | "execution_count": 47, 1631 | "outputs": [ 1632 | { 1633 | "output_type": "stream", 1634 | "name": "stdout", 1635 | "text": [ 1636 | "['', 'Obama', 'Barack', 'president']\n", 1637 | "['', 'Merkel', 'Angela', 'chancellor']\n" 1638 | ] 1639 | } 1640 | ] 1641 | }, 1642 | { 1643 | "cell_type": "markdown", 1644 | "metadata": { 1645 | "id": "9-LnRBN8E2j7" 1646 | }, 1647 | "source": [ 1648 | "We can easily improve the script by using a slice operator, so that we don't have the empty string as the first element of our result lists:" 1649 | ] 1650 | }, 1651 | { 1652 | "cell_type": "code", 1653 | "metadata": { 1654 | "colab": { 1655 | "base_uri": "https://localhost:8080/" 1656 | }, 1657 | "id": "xkmsfFV9E2Eq", 1658 | "outputId": "c0681369-44c8-4e56-f761-5ad0e37a0fb0" 1659 | }, 1660 | "source": [ 1661 | "import re\n", 1662 | "lines = [\"surname: Obama, prename: Barack, profession: president\", \"surname: Merkel, prename: Angela, profession: chancellor\"]\n", 1663 | "for line in lines:\n", 1664 | " print(re.split(\",* *\\w*: \", line)[1:])" 1665 | ], 1666 | "execution_count": 48, 1667 | "outputs": [ 1668 | { 1669 | "output_type": "stream", 1670 | "name": "stdout", 1671 | "text": [ 1672 | "['Obama', 'Barack', 'president']\n", 1673 | "['Merkel', 'Angela', 'chancellor']\n" 1674 | ] 1675 | } 1676 | ] 1677 | }, 1678 | { 1679 | "cell_type": "markdown", 1680 | "metadata": { 1681 | "id": "t7tVohopE-fb" 1682 | }, 1683 | "source": [ 1684 | "#### Search and Replace with sub\n", 1685 | "\n", 1686 | "```re.sub(regex, replacement, subject)```\n", 1687 | "\n", 1688 | "Every match of the regular expression regex in the string subject will be replaced by the string replacement. Example:" 1689 | ] 1690 | }, 1691 | { 1692 | "cell_type": "code", 1693 | "metadata": { 1694 | "colab": { 1695 | "base_uri": "https://localhost:8080/" 1696 | }, 1697 | "id": "aMXXwUu0EjxE", 1698 | "outputId": "1ff0df1d-c014-4068-c432-559c531d2d05" 1699 | }, 1700 | "source": [ 1701 | "import re\n", 1702 | "str = \"yes I said yes I will Yes.\"\n", 1703 | "res = re.sub(\"[yY]es\",\"no\", str)\n", 1704 | "print(res)" 1705 | ], 1706 | "execution_count": 49, 1707 | "outputs": [ 1708 | { 1709 | "output_type": "stream", 1710 | "name": "stdout", 1711 | "text": [ 1712 | "no I said no I will no.\n" 1713 | ] 1714 | } 1715 | ] 1716 | }, 1717 | { 1718 | "cell_type": "code", 1719 | "metadata": { 1720 | "id": "6h1Jm1PkFKub" 1721 | }, 1722 | "source": [ 1723 | "" 1724 | ], 1725 | "execution_count": null, 1726 | "outputs": [] 1727 | } 1728 | ] 1729 | } -------------------------------------------------------------------------------- /week6/Lecture_12_requests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Lecture 12. requests.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyOz0u2npZ/DvoCJ5fZ5sjOG", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "id": "Gf_pWG1fndIo" 35 | }, 36 | "source": [ 37 | "#### В этом туториале мы попробуем получить ответ от [веб сервиса](http://www.cbs.dtu.dk/services/NetMHCpan/) внутри этого ноутбука, а не через браузер\n", 38 | "\n", 39 | "Импортируем нужные модули:" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "metadata": { 45 | "id": "Mt36MTr-xH4e" 46 | }, 47 | "source": [ 48 | "import requests" 49 | ], 50 | "execution_count": null, 51 | "outputs": [] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": { 56 | "id": "qI4Ixs94nnme" 57 | }, 58 | "source": [ 59 | "#### Ссылка на скрипт на сервере, который обрабатывает изначальный POST запрос:\n", 60 | "\n", 61 | "Всегда, когда вы сабмитите форму в интернете, вы посылаете POST запрос скрипту адрес которого можно узнать просмотрев submitted form data. Загуглите \"how to view submitted form data\" или посмотрите вот это:\n", 62 | "\n", 63 | "https://www.youtube.com/watch?v=SvUqk683mSA\n", 64 | "\n", 65 | "https://wpscholar.com/blog/view-form-data-in-chrome/" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "metadata": { 71 | "id": "ZpcVQTtxxRu6" 72 | }, 73 | "source": [ 74 | "url = 'http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi'" 75 | ], 76 | "execution_count": null, 77 | "outputs": [] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": { 82 | "id": "WZll65hWosP8" 83 | }, 84 | "source": [ 85 | "#### Формируем данные формы (словарь `post_dadta` - это я его так назвал) - как их формировать смотри так же, как адрес обрабатывающего скрипта (ссылки выше) - и отправляем с помощью `requests` этот POST запрос:" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "metadata": { 91 | "id": "_krexwfv3fv3", 92 | "colab": { 93 | "base_uri": "https://localhost:8080/" 94 | }, 95 | "outputId": "f8e2c555-dcb8-4492-d80a-b8582716ce76" 96 | }, 97 | "source": [ 98 | "post_data = {'configfile': '/usr/opt/www/pub/CBS/services/NetMHCpan-4.1/NetMHCpan.cf',\n", 99 | " 'inp': 0,\n", 100 | " 'SEQPASTE': 'NLVPMVATV',\n", 101 | " 'master': 1,\n", 102 | " 'thrs': 0.5,\n", 103 | " 'thrw': 2,\n", 104 | " 'threshold': -99\n", 105 | " }\n", 106 | "r = requests.post(url, data=post_data) # тут хранится ответ\n", 107 | "type(r) # видим, что этот ответ упакован в экземпляр класса Response,\n", 108 | " # который находится в папке (или файле) models, который находится в папке\n", 109 | " # (папка всего модуля) requests" 110 | ], 111 | "execution_count": null, 112 | "outputs": [ 113 | { 114 | "output_type": "execute_result", 115 | "data": { 116 | "text/plain": [ 117 | "requests.models.Response" 118 | ] 119 | }, 120 | "metadata": { 121 | "tags": [] 122 | }, 123 | "execution_count": 73 124 | } 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "metadata": { 130 | "id": "sN9wj7GVqA_d" 131 | }, 132 | "source": [ 133 | "# Вот кстати все методы (функции) и атрибуты (свойства),\n", 134 | "# которые есть у этого объекта:\n", 135 | "\n", 136 | "# dir(r)" 137 | ], 138 | "execution_count": 1, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": { 144 | "id": "qFu_C5DCrusV" 145 | }, 146 | "source": [ 147 | "#### Среди всех прочих есть метод `.content`, по которому получаем ответ сервера в виде строки, которая является `html` размеченным текстом\n", 148 | "\n", 149 | "На этом движке взаимодействие с пользователем устроено так: сначала вы посылаете POST запрос и получаете ответ (это мы сделали). В этом ответе вам приходит информация о `jobid` - набор символов и букв, который вы (пользователи) посылаете уже другому скрипту-обработчику уже методом GET (просто в адресной строке). Мы достаем адрес второго обработчика из `r`, оттуда же достаем `jobid` и снова используем модуль `requests` для создания уже запроса GET:" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "metadata": { 155 | "colab": { 156 | "base_uri": "https://localhost:8080/" 157 | }, 158 | "id": "-BR-NRSumHSc", 159 | "outputId": "74856b19-38ed-467f-893b-82e7653fe1eb" 160 | }, 161 | "source": [ 162 | "link = r.content[-120:-45] # адрес второго обработчика (он такой же как и первый)\n", 163 | " # с приделанными к нему данными методом GET - в форме\n", 164 | " # 'key: value' после знака '?'\n", 165 | "print(link)\n", 166 | "jobid = link[-24:]\n", 167 | "url2 = link[:-31]\n", 168 | "print(jobid)\n", 169 | "print(url2)" 170 | ], 171 | "execution_count": null, 172 | "outputs": [ 173 | { 174 | "output_type": "stream", 175 | "text": [ 176 | "b'http://www.cbs.dtu.dk//cgi-bin/webface2.fcgi?jobid=6101365800005B22C53E54DA'\n", 177 | "b'6101365800005B22C53E54DA'\n", 178 | "b'http://www.cbs.dtu.dk//cgi-bin/webface2.fcgi'\n" 179 | ], 180 | "name": "stdout" 181 | } 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": { 187 | "id": "wZoa0z_At_B0" 188 | }, 189 | "source": [ 190 | "#### Ну и наконец создаем новый запрос GET:\n", 191 | "\n", 192 | "Почему именно такой запрос, я узнал опять же просмотрев submitted form data в браузере, как описано по ссылкам выше" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "metadata": { 198 | "id": "WxkEswry5P-x" 199 | }, 200 | "source": [ 201 | "get_data = {'jobid': jobid,\n", 202 | " 'wait' : '20'\n", 203 | " }\n", 204 | "result = requests.get(url2, params=get_data)" 205 | ], 206 | "execution_count": null, 207 | "outputs": [] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "metadata": { 212 | "id": "3Bcn_vZ65VId", 213 | "colab": { 214 | "base_uri": "https://localhost:8080/" 215 | }, 216 | "outputId": "d379f2cc-a56a-4393-9099-3267c7d1a06d" 217 | }, 218 | "source": [ 219 | "result.content # тут находится нужный нам ответ в виде html строки" 220 | ], 221 | "execution_count": null, 222 | "outputs": [ 223 | { 224 | "output_type": "execute_result", 225 | "data": { 226 | "text/plain": [ 227 | "b'\\n NetMHCpan 4.1 Server - prediction results\\n\\n\\n
\\n\\n
\\n    \\n

NetMHCpan Server - prediction results

\\n

Technical University of Denmark

\\n
\\n
\\n
\\n
\\n
\\n\\n# NetMHCpan version 4.1b\\n\\n# Tmpdir made /usr/opt/www/webface/tmp/server/netmhcpan/6101365800005B22C53E54DA/netMHCpanWfQpLk\\n# Input is in FSA format\\n\\n# Peptide length 8,9,10,11\\n\\n# Make EL predictions\\n\\nHLA-A02:01 : Distance to training data  0.000 (using nearest neighbor HLA-A02:01)\\n\\n# Rank Threshold for Strong binding peptides   0.500\\n# Rank Threshold for Weak binding peptides   2.000\\n---------------------------------------------------------------------------------------------------------------------------\\n Pos         MHC        Peptide      Core Of Gp Gl Ip Il        Icore        Identity  Score_EL %Rank_EL BindLevel\\n---------------------------------------------------------------------------------------------------------------------------\\n   1 HLA-A*02:01      NLVPMVATV NLVPMVATV  0  0  0  0  0    NLVPMVATV        Sequence 0.8323630    0.085 <= SB\\n   1 HLA-A*02:01       NLVPMVAT NLV-PMVAT  0  0  0  3  1     NLVPMVAT        Sequence 0.0008310   19.184\\n   2 HLA-A*02:01       LVPMVATV -LVPMVATV  0  0  0  0  1     LVPMVATV        Sequence 0.0170850    5.151\\n---------------------------------------------------------------------------------------------------------------------------\\n\\nProtein Sequence. Allele HLA-A*02:01. Number of high binders 1. Number of weak binders 0. Number of peptides 3\\n\\nLink to Allele Frequencies in Worldwide Populations HLA-A02:01\\n-----------------------------------------------------------------------------------\\n\\nExplain the output.  Go back.\\n
\\n\\n
\\n\\n\\n'" 228 | ] 229 | }, 230 | "metadata": { 231 | "tags": [] 232 | }, 233 | "execution_count": 93 234 | } 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "id": "-TVG5IQRzAKj" 241 | }, 242 | "source": [ 243 | "#### Сохраним его в виде `html` странице на сервере колаба:" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "metadata": { 249 | "colab": { 250 | "base_uri": "https://localhost:8080/" 251 | }, 252 | "id": "fLtcPzvanSAg", 253 | "outputId": "6c6ca909-7425-42f3-d228-6f4b5ff7c6f0" 254 | }, 255 | "source": [ 256 | "with open(\"results.html\", \"w\") as f:\n", 257 | " f.write(str(result.text))\n", 258 | "!ls # удостоверимся, что сохранили" 259 | ], 260 | "execution_count": null, 261 | "outputs": [ 262 | { 263 | "output_type": "stream", 264 | "text": [ 265 | "results.html sample_data\n" 266 | ], 267 | "name": "stdout" 268 | } 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": { 274 | "id": "_0eBCrEHzT7R" 275 | }, 276 | "source": [ 277 | "#### Ну и запустим вебсервер в колабе, чтобы просмотреть наш результат как веб страницу - нужно нажать на появившуюся ссылку и выбрать потом наш `html` файл (чтобы делать потом что-то еще нужно будет остановить вебсервер, остановив ячейку):" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "metadata": { 283 | "colab": { 284 | "base_uri": "https://localhost:8080/", 285 | "height": 238 286 | }, 287 | "id": "l4K3oh1SxNlx", 288 | "outputId": "29dd747a-dcc9-41b2-cc09-ad93a537517e" 289 | }, 290 | "source": [ 291 | "from google.colab.output import eval_js\n", 292 | "print(eval_js(\"google.colab.kernel.proxyPort(8000)\"))\n", 293 | "# https://z4spb7cvssd-496ff2e9c6d22116-8000-colab.googleusercontent.com/\n", 294 | "!python -m http.server 8000" 295 | ], 296 | "execution_count": null, 297 | "outputs": [ 298 | { 299 | "output_type": "stream", 300 | "text": [ 301 | "https://mp1ubn0bvld-496ff2e9c6d22116-8000-colab.googleusercontent.com/\n", 302 | "Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...\n", 303 | "127.0.0.1 - - [28/Jul/2021 11:10:43] \"GET / HTTP/1.1\" 200 -\n", 304 | "127.0.0.1 - - [28/Jul/2021 11:10:46] code 404, message File not found\n", 305 | "127.0.0.1 - - [28/Jul/2021 11:10:46] \"GET /favicon.ico HTTP/1.1\" 404 -\n", 306 | "127.0.0.1 - - [28/Jul/2021 11:10:47] \"GET /results.html HTTP/1.1\" 200 -\n", 307 | "127.0.0.1 - - [28/Jul/2021 11:10:50] code 404, message File not found\n", 308 | "127.0.0.1 - - [28/Jul/2021 11:10:50] \"GET /images/m_logo.gif HTTP/1.1\" 404 -\n", 309 | "127.0.0.1 - - [28/Jul/2021 11:10:50] code 404, message File not found\n", 310 | "127.0.0.1 - - [28/Jul/2021 11:10:50] \"GET /favicon.ico HTTP/1.1\" 404 -\n", 311 | "\n", 312 | "Keyboard interrupt received, exiting.\n", 313 | "^C\n" 314 | ], 315 | "name": "stdout" 316 | } 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": { 322 | "id": "XkguaHDGzyBN" 323 | }, 324 | "source": [ 325 | "Все." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "metadata": { 331 | "id": "QEi2YRTByMZw" 332 | }, 333 | "source": [ 334 | "" 335 | ], 336 | "execution_count": null, 337 | "outputs": [] 338 | } 339 | ] 340 | } --------------------------------------------------------------------------------