├── 1_XNhMJunrlY_Xfei8D8sPeg.png
├── 20_Pandas_Functions_for_80_of_your_Data_Science Tasks.ipynb
├── Best_Practices_To_Use_Pandas_Efficiently_As_A_Data_Scientist.ipynb
├── Data Exploration Becomes Easier & Better With Pandas Profiling.md
├── Datasets
├── Popular_Baby_Names.csv
├── Readme.md
├── baseball_stats.csv
├── pokemon.csv
├── poker_hand.csv
└── restaurant_data.csv
├── How To Eliminate Loops From Your Python Code.ipynb
├── Make_Your_Pandas_Code_1000_Times_Faster_With_This Trick.ipynb
├── README.md
├── Selecting_&_Replacing_Values_In_Pandas_DataFrame_Effectively.ipynb
├── Stop_Looping_Through_Pandas_DataFrames_&_Do_This Instead.ipynb
├── Write Efficient Python Code [Defining and Measuring Code Efficiency].ipynb
└── Write_Efficient_Python_Code_(Optimizing_Your Code).ipynb
/1_XNhMJunrlY_Xfei8D8sPeg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Efficient-Python-for-Data-Scientists-Book/2507c7f72e3cb08ef349047a4ca052bcc7307337/1_XNhMJunrlY_Xfei8D8sPeg.png
--------------------------------------------------------------------------------
/Data Exploration Becomes Easier & Better With Pandas Profiling.md:
--------------------------------------------------------------------------------
1 | # Data Exploration Becomes Easier & Better With Pandas Profiling
2 |
3 | Data exploration is a crucial step in any data analysis and data science project. It allows you to gain a deeper understanding of your data, identify patterns and relationships, and identify any potential issues or outliers.
4 |
5 | One of the most popular tools for data exploration is the Python library Pandas. The library provides a powerful set of tools for working with data, including data cleaning, transformation, and visualization. However, even with the powerful capabilities of Pandas, data exploration can still be a time-consuming and tedious task. That's where Pandas Profiling comes in.
6 |
7 | With Pandas Profiling, you can easily generate detailed reports of your data, including summary statistics, missing values, and correlations, making data exploration faster and more efficient. This article will explore how Pandas Profiling can help you improve your data exploration process and make it easier to understand your data.
8 |
9 | ## Table of content:
10 | 1. What is Pandas Profiling?
11 | 2. Installation of Pandas Profiling
12 | 3. Pandas Profiling in Action
13 | 4. Drawbacks of Pandas Profiling & How to Overcome it
14 |
15 | ## 1. What is Pandas Profiling?
16 |
17 | Pandas profiling is a Python library that generates a comprehensive report of a DataFrame, including information about the number of rows and columns, missing values, data types, and other statistics. It can be used to quickly identify potential issues or outliers in the data, and can also be used to generate summary statistics and visualizations of the data.
18 |
19 | The report generated by the pandas profiling library typically includes a variety of information about the dataset, including:
20 |
21 | * Overview: Summary statistics for all columns, including the number of rows, missing values, and data types.
22 | * Variables: Information about each column, including the number of unique values, missing values, and the top frequent values.
23 | * Correlations: Correlation matrix and heatmap, showing the relationship between different variables.
24 | * Distribution: Histograms and kernel density plots for each column, show the distribution of values.
25 | * Categorical Variables: Bar plots for categorical variables, showing the frequency of each category.
26 | * Numerical Variables: Box plots for numerical variables, show the distribution of values and outliers.
27 | * Text: Information about text columns, including the number of characters and words.
28 | * File: Information about file columns, including the number of files, and the size of each file.
29 | * High-Cardinality: Information about high-cardinality categorical variables, including their most frequent values.
30 | * Sample: A sample of the data, with the first and last few rows displayed.
31 |
32 | It is worth noting that the report is interactive and you can drill down on each section for more details.
33 |
34 | ## 2. Installation of Pandas Profiling
35 | To install pandas-profiling, you can use the following command:
36 |
37 |
38 | ```python
39 | import sys
40 |
41 | !"{sys.executable}" -m pip install -U pandas-profiling[notebook]
42 | !jupyter nbextension enable --py widgetsnbextension
43 | ```
44 |
45 | Collecting pandas-profiling[notebook]
46 | Using cached pandas_profiling-3.6.3-py2.py3-none-any.whl (328 kB)
47 | Requirement already satisfied, skipping upgrade: numpy<1.24,>=1.16.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.19.2)
48 | Requirement already satisfied, skipping upgrade: requests<2.29,>=2.24.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (2.24.0)
49 | Requirement already satisfied, skipping upgrade: scipy<1.10,>=1.4.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.5.2)
50 | Collecting visions[type_image_path]==0.7.5
51 | Using cached visions-0.7.5-py3-none-any.whl (102 kB)
52 | Collecting typeguard<2.14,>=2.13.2
53 |
54 | ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
55 |
56 | We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
57 |
58 | sktime 0.9.0 requires numpy>=1.19.3, but you'll have numpy 1.19.2 which is incompatible.
59 | sktime 0.9.0 requires statsmodels<=0.12.1, but you'll have statsmodels 0.13.5 which is incompatible.
60 |
61 |
62 |
63 | Using cached typeguard-2.13.3-py3-none-any.whl (17 kB)
64 | Requirement already satisfied, skipping upgrade: pandas!=1.4.0,<1.6,>1.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (1.1.3)
65 | Collecting phik<0.13,>=0.11.1
66 | Using cached phik-0.12.3-cp37-cp37m-win_amd64.whl (664 kB)
67 | Requirement already satisfied, skipping upgrade: tqdm<4.65,>=4.48.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (4.55.0)
68 | Requirement already satisfied, skipping upgrade: PyYAML<6.1,>=5.0.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (5.3.1)
69 | Collecting multimethod<1.10,>=1.4
70 | Using cached multimethod-1.9.1-py3-none-any.whl (10 kB)
71 | Requirement already satisfied, skipping upgrade: jinja2<3.2,>=2.11.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (2.11.2)
72 | Collecting statsmodels<0.14,>=0.13.2
73 | Using cached statsmodels-0.13.5-cp37-cp37m-win_amd64.whl (9.1 MB)
74 | Requirement already satisfied, skipping upgrade: seaborn<0.13,>=0.10.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (0.11.0)
75 | Requirement already satisfied, skipping upgrade: matplotlib<3.7,>=3.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (3.3.2)
76 | Collecting pydantic<1.11,>=1.8.1
77 | Using cached pydantic-1.10.4-cp37-cp37m-win_amd64.whl (2.1 MB)
78 | Processing c:\users\youss\appdata\local\pip\cache\wheels\70\e1\52\5b14d250ba868768823940c3229e9950d201a26d0bd3ee8655\htmlmin-0.1.12-py3-none-any.whl
79 | Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (8.0.2)
80 | Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (4.6.3)
81 | Requirement already satisfied, skipping upgrade: jupyter-client>=5.3.4; extra == "notebook" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas-profiling[notebook]) (6.1.7)
82 | Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (3.0.4)
83 | Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (2.10)
84 | Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (1.25.11)
85 | Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling[notebook]) (2022.9.24)
86 | Collecting networkx>=2.4
87 | Using cached networkx-2.6.3-py3-none-any.whl (1.9 MB)
88 | Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling[notebook]) (20.2.0)
89 | Collecting tangled-up-in-unicode>=0.0.4
90 | Using cached tangled_up_in_unicode-0.2.0-py3-none-any.whl (4.7 MB)
91 | Collecting imagehash; extra == "type_image_path"
92 | Using cached ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
93 | Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling[notebook]) (8.0.1)
94 | Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (2.8.1)
95 | Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (2020.1)
96 | Requirement already satisfied, skipping upgrade: joblib>=0.14.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from phik<0.13,>=0.11.1->pandas-profiling[notebook]) (0.17.0)
97 | Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jinja2<3.2,>=2.11.1->pandas-profiling[notebook]) (1.1.1)
98 | Collecting packaging>=21.3
99 | Using cached packaging-23.0-py3-none-any.whl (42 kB)
100 | Requirement already satisfied, skipping upgrade: patsy>=0.5.2 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from statsmodels<0.14,>=0.13.2->pandas-profiling[notebook]) (0.5.2)
101 | Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (2.4.7)
102 | Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (1.3.0)
103 | Requirement already satisfied, skipping upgrade: cycler>=0.10 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from matplotlib<3.7,>=3.2->pandas-profiling[notebook]) (0.10.0)
104 | Requirement already satisfied, skipping upgrade: typing-extensions>=4.2.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from pydantic<1.11,>=1.8.1->pandas-profiling[notebook]) (4.3.0)
105 | Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.3.4)
106 | Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.0.5)
107 | Requirement already satisfied, skipping upgrade: ipython>=6.1.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (7.18.1)
108 | Requirement already satisfied, skipping upgrade: jupyterlab-widgets~=3.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.0.3)
109 | Requirement already satisfied, skipping upgrade: widgetsnbextension~=4.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.0.3)
110 | Requirement already satisfied, skipping upgrade: pywin32>=1.0; sys_platform == "win32" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-core>=4.6.3; extra == "notebook"->pandas-profiling[notebook]) (227)
111 | Requirement already satisfied, skipping upgrade: pyzmq>=13 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-client>=5.3.4; extra == "notebook"->pandas-profiling[notebook]) (19.0.2)
112 | Requirement already satisfied, skipping upgrade: tornado>=4.1 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jupyter-client>=5.3.4; extra == "notebook"->pandas-profiling[notebook]) (6.0.4)
113 | Collecting PyWavelets
114 | Downloading PyWavelets-1.3.0-cp37-cp37m-win_amd64.whl (4.2 MB)
115 | Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from python-dateutil>=2.7.3->pandas!=1.4.0,<1.6,>1.1->pandas-profiling[notebook]) (1.15.0)
116 | Requirement already satisfied, skipping upgrade: ipython-genutils in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.0)
117 | Requirement already satisfied, skipping upgrade: setuptools>=18.5 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (50.3.0.post20201006)
118 | Requirement already satisfied, skipping upgrade: pickleshare in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.5)
119 | Requirement already satisfied, skipping upgrade: pygments in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (2.7.2)
120 | Requirement already satisfied, skipping upgrade: backcall in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.0)
121 | Requirement already satisfied, skipping upgrade: jedi>=0.10 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.17.1)
122 | Requirement already satisfied, skipping upgrade: decorator in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.4.2)
123 | Requirement already satisfied, skipping upgrade: colorama; sys_platform == "win32" in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.4.4)
124 | Requirement already satisfied, skipping upgrade: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.0.8)
125 | Requirement already satisfied, skipping upgrade: parso<0.8.0,>=0.7.0 in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from jedi>=0.10->ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.0)
126 | Requirement already satisfied, skipping upgrade: wcwidth in c:\users\youss\anaconda3\envs\new_enviroment\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=6.1.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.5)
127 | Installing collected packages: multimethod, networkx, tangled-up-in-unicode, PyWavelets, imagehash, visions, typeguard, phik, packaging, statsmodels, pydantic, htmlmin, pandas-profiling
128 | Attempting uninstall: packaging
129 | Found existing installation: packaging 20.4
130 | Uninstalling packaging-20.4:
131 | Successfully uninstalled packaging-20.4
132 | Attempting uninstall: statsmodels
133 | Found existing installation: statsmodels 0.12.1
134 | Uninstalling statsmodels-0.12.1:
135 | Successfully uninstalled statsmodels-0.12.1
136 | Successfully installed PyWavelets-1.3.0 htmlmin-0.1.12 imagehash-4.3.1 multimethod-1.9.1 networkx-2.6.3 packaging-23.0 pandas-profiling-3.6.3 phik-0.12.3 pydantic-1.10.4 statsmodels-0.13.5 tangled-up-in-unicode-0.2.0 typeguard-2.13.3 visions-0.7.5
137 |
138 |
139 | Enabling notebook extension jupyter-js-widgets/extension...
140 | - Validating: ok
141 |
142 |
143 |
144 | ```python
145 | import pandas as pd
146 | import pandas_profiling as pp
147 | ```
148 |
149 | ## 3. Pandas Profiling in Action
150 | Let's put the pandas profiling into action and see how it works. We will use the popular baby names dataset.
151 |
152 |
153 | ```python
154 | Popular_baby_names_df = pd.read_csv('Popular_Baby_Names.csv')
155 | Popular_baby_names_df.head()
156 | ```
157 |
158 |
159 |
160 |
161 |
\n",
283 | " "
284 | ]
285 | },
286 | "metadata": {},
287 | "execution_count": 1
288 | }
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "source": [
294 | "In each poker round, each player has five cards in hand, each one characterized by its symbol, which can be either hearts, diamonds, clubs, or spades, and its rank, which ranges from 1 to 13. The dataset consists of every possible combination of five cards one person can possess.\n",
295 | "* Sn: symbol of the n-th card where: 1 (Hearts), 2 (Diamonds), 3 (Clubs), 4 (Spades)\n",
296 | "* Rn: rank of the n-th card where: 1 (Ace), 2–10, 11 (Jack), 12 (Queen), 13 (King)"
297 | ],
298 | "metadata": {
299 | "id": "vFyaahTCrz4n"
300 | }
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "source": [
305 | "## 1. Why do We need Efficient Coding?\n"
306 | ],
307 | "metadata": {
308 | "id": "YLPUOB6xyxdv"
309 | }
310 | },
311 | {
312 | "cell_type": "markdown",
313 | "source": [
314 | "Efficient code is code that executes faster and with lower computational meomery. In this article, we will use the time() function to measure the computational time. \n",
315 | "\n",
316 | "This function measures the current time so we will assign it to a variable before the code execution and after and then calculate the difference to know the computational time of the code. A simple example is shown in the code below:\n",
317 | "\n"
318 | ],
319 | "metadata": {
320 | "id": "PBLJceGKy1FR"
321 | }
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {
327 | "colab": {
328 | "base_uri": "https://localhost:8080/"
329 | },
330 | "id": "0ronfYdopjMo",
331 | "outputId": "7d2f0a38-b742-40ab-826f-6e8c9276cd41"
332 | },
333 | "outputs": [
334 | {
335 | "output_type": "stream",
336 | "name": "stdout",
337 | "text": [
338 | "Result calculated in 0.0001010894775390625 sec\n"
339 | ]
340 | }
341 | ],
342 | "source": [
343 | "# record time before execution\n",
344 | "start_time = time.time()\n",
345 | "# execute operation\n",
346 | "result = 5 + 2\n",
347 | "# record time after execution\n",
348 | "end_time = time.time()\n",
349 | "print(\"Result calculated in {} sec\".format(end_time - start_time))"
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "source": [
355 | "Let's see some examples of how applying efficient code methods will improve the code runtime and decrease the computational time complexity: We will calculate the square of each number from zero, up to a million. At first, we will use a list comprehension to execute this operation, and then repeat the same procedure using a for-loop.\n",
356 | "\n",
357 | "First using list comprehension:\n",
358 | "\n",
359 | "\n"
360 | ],
361 | "metadata": {
362 | "id": "PMjqHZUYy6Im"
363 | }
364 | },
365 | {
366 | "cell_type": "code",
367 | "source": [
368 | "#using List comprehension \n",
369 | "list_comp_start_time = time.time()\n",
370 | "result = [i*i for i in range(0,1000000)]\n",
371 | "list_comp_end_time = time.time()\n",
372 | "print(\"Time using the list_comprehension: {} sec\".format(list_comp_end_time -\n",
373 | "list_comp_start_time))"
374 | ],
375 | "metadata": {
376 | "colab": {
377 | "base_uri": "https://localhost:8080/"
378 | },
379 | "id": "OeXxZOMRy4Y-",
380 | "outputId": "4f25b0ce-dff4-440c-f2d4-f7d703cc7229"
381 | },
382 | "execution_count": null,
383 | "outputs": [
384 | {
385 | "output_type": "stream",
386 | "name": "stdout",
387 | "text": [
388 | "Time using the list_comprehension: 0.12260246276855469 sec\n"
389 | ]
390 | }
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "source": [
396 | "Now we will use for loop to execute the same operation:\n",
397 | "\n"
398 | ],
399 | "metadata": {
400 | "id": "pqoqyCQbzBTt"
401 | }
402 | },
403 | {
404 | "cell_type": "code",
405 | "source": [
406 | "# Using For loop\n",
407 | "for_loop_start_time= time.time()\n",
408 | "result=[]\n",
409 | "for i in range(0,1000000):\n",
410 | " result.append(i*i)\n",
411 | "for_loop_end_time= time.time()\n",
412 | "print(\"Time using the for loop: {} sec\".format(for_loop_end_time - for_loop_start_time))"
413 | ],
414 | "metadata": {
415 | "colab": {
416 | "base_uri": "https://localhost:8080/"
417 | },
418 | "id": "pVJcr30ZzH9z",
419 | "outputId": "9a409a1b-3e26-454c-af52-e50dff54c165"
420 | },
421 | "execution_count": null,
422 | "outputs": [
423 | {
424 | "output_type": "stream",
425 | "name": "stdout",
426 | "text": [
427 | "Time using the for loop: 0.37175941467285156 sec\n"
428 | ]
429 | }
430 | ]
431 | },
432 | {
433 | "cell_type": "markdown",
434 | "source": [
435 | "We can see that there is a big difference between them, we can calculate the difference between them in percentage:\n",
436 | "\n"
437 | ],
438 | "metadata": {
439 | "id": "gGKbSvTiznVq"
440 | }
441 | },
442 | {
443 | "cell_type": "code",
444 | "source": [
445 | "list_comp_time = list_comp_end_time - list_comp_start_time\n",
446 | "for_loop_time = for_loop_end_time - for_loop_start_time\n",
447 | "print(\"Difference in time: {} %\".format((for_loop_time - list_comp_time)/\n",
448 | "list_comp_time*100))"
449 | ],
450 | "metadata": {
451 | "colab": {
452 | "base_uri": "https://localhost:8080/"
453 | },
454 | "id": "Ka7TiiU7zO4j",
455 | "outputId": "e36554c8-0761-4261-d906-5debaa21be6b"
456 | },
457 | "execution_count": null,
458 | "outputs": [
459 | {
460 | "output_type": "stream",
461 | "name": "stdout",
462 | "text": [
463 | "Difference in time: 203.22344778232394 %\n"
464 | ]
465 | }
466 | ]
467 | },
468 | {
469 | "cell_type": "markdown",
470 | "source": [
471 | "Here is another example to show the effect of writing efficient code. We would like to calculate the sum of all consecutive numbers from 1 to 1 million. There are two ways the first is to use brute force in which we will add one by one to a million.\n",
472 | "\n"
473 | ],
474 | "metadata": {
475 | "id": "8yWk4HrKzrjF"
476 | }
477 | },
478 | {
479 | "cell_type": "code",
480 | "source": [
481 | "def sum_formula(N):\n",
482 | " return N*(N+1)/2\n",
483 | " \n",
484 | "# Using the formula\n",
485 | "formula_start_time = time.time()\n",
486 | "formula_result = sum_formula(1000000)\n",
487 | "formula_end_time = time.time()\n",
488 | "\n",
489 | "print(\"Time using the formula: {} sec\".format(formula_end_time - formula_start_time))"
490 | ],
491 | "metadata": {
492 | "colab": {
493 | "base_uri": "https://localhost:8080/"
494 | },
495 | "id": "zglHUf7t0OPK",
496 | "outputId": "fc6007e5-5a75-4791-b9f6-0b9c6bc179b4"
497 | },
498 | "execution_count": null,
499 | "outputs": [
500 | {
501 | "output_type": "stream",
502 | "name": "stdout",
503 | "text": [
504 | "Time using the formula: 5.8650970458984375e-05 sec\n"
505 | ]
506 | }
507 | ]
508 | },
509 | {
510 | "cell_type": "markdown",
511 | "source": [
512 | "Another more efficient method is to use a formula to calculate it. When we want to calculate the sum of all the integer numbers from 1 up to a number, let’s say N, we can multiply N by N+1, and then divide by 2, and this will give us the result we want. This problem was actually given to some students back in Germany in the 19th century, and a bright student called Carl-Friedrich Gauss devised this formula to solve the problem in seconds."
513 | ],
514 | "metadata": {
515 | "id": "Iek6ZUDuzvJe"
516 | }
517 | },
518 | {
519 | "cell_type": "code",
520 | "source": [
521 | "def sum_brute_force(N):\n",
522 | " res = 0\n",
523 | " for i in range(1,N+1):\n",
524 | " res+=i\n",
525 | " return res\n",
526 | "\n",
527 | "# Using brute force\n",
528 | "bf_start_time = time.time()\n",
529 | "bf_result = sum_brute_force(1000000)\n",
530 | "bf_end_time = time.time()\n",
531 | "\n",
532 | "print(\"Time using brute force: {} sec\".format(bf_end_time - bf_start_time))"
533 | ],
534 | "metadata": {
535 | "colab": {
536 | "base_uri": "https://localhost:8080/"
537 | },
538 | "id": "kLHOBkEG012K",
539 | "outputId": "8c61e2ef-2335-49bb-eb3e-e599af4927ad"
540 | },
541 | "execution_count": null,
542 | "outputs": [
543 | {
544 | "output_type": "stream",
545 | "name": "stdout",
546 | "text": [
547 | "Time using brute force: 0.06304192543029785 sec\n"
548 | ]
549 | }
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "source": [
555 | "After running both methods, we achieve a massive improvement with a magnitude of over 160,000%, which clearly demonstrates why we need efficient and optimized code, even for simple tasks.\n",
556 | "\n"
557 | ],
558 | "metadata": {
559 | "id": "pD2YNf5Lzxg_"
560 | }
561 | },
562 | {
563 | "cell_type": "markdown",
564 | "source": [
565 | "One of the most inefficient methods to write python code is to have many loops in your code, especially if you have large data. Since as a data scientist, you will need to iterate through your dataframe extensively, especially in the data preparation and exploration phase, it is important to be able to do this efficiently, as it will save you much time and give space for more important work. We will walk through three methods to make your loops much faster and more efficient:\n",
566 | "\n",
567 | "* Looping using the .iterrows() function\n",
568 | "* Looping using the .apply() function\n",
569 | "* Vectorization\n"
570 | ],
571 | "metadata": {
572 | "id": "T2oxGNcW2SNq"
573 | }
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "source": [
578 | "## 2. Looping effectively using .iterrows()\n",
579 | "Before we talk about how to use the .iterrows() function to improve the looping process, let’s refresh the notion of a generator function.\n",
580 | "\n",
581 | "Generators are a simple tool to create iterators. Inside the body of a generator, instead of return statements, you will find only yield() statements. There can be just one, or several yield() statements. Here, we can see a generator, city_name_generator(), that produces four city names. We assign the generator to the variable city_names for simplicity.\n",
582 | "\n"
583 | ],
584 | "metadata": {
585 | "id": "SQjPBhCL2aDU"
586 | }
587 | },
588 | {
589 | "cell_type": "code",
590 | "source": [
591 | "def city_name_generator():\n",
592 | " yield('New York')\n",
593 | " yield('London')\n",
594 | " yield('Tokyo')\n",
595 | " yield('Sao Paolo')\n",
596 | "\n",
597 | "city_names = city_name_generator()\n"
598 | ],
599 | "metadata": {
600 | "id": "Gs6UTJjqqLT-"
601 | },
602 | "execution_count": null,
603 | "outputs": []
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "source": [
608 | "To access the elements that the generator yields we can use Python’s next() function. Each time the next() command is used, the generator will produce the next value to yield, until there are no more values to yield. We here have 4 cities. Let’s run the next command four times and see what it returns:\n",
609 | "\n"
610 | ],
611 | "metadata": {
612 | "id": "x7-XgpHX2fH7"
613 | }
614 | },
615 | {
616 | "cell_type": "code",
617 | "source": [
618 | "next(city_names)"
619 | ],
620 | "metadata": {
621 | "colab": {
622 | "base_uri": "https://localhost:8080/",
623 | "height": 35
624 | },
625 | "id": "BdS9iCA2ESQu",
626 | "outputId": "8a25c76a-6fa2-47db-ddf9-977d7b57932f"
627 | },
628 | "execution_count": null,
629 | "outputs": [
630 | {
631 | "output_type": "execute_result",
632 | "data": {
633 | "text/plain": [
634 | "'New York'"
635 | ],
636 | "application/vnd.google.colaboratory.intrinsic+json": {
637 | "type": "string"
638 | }
639 | },
640 | "metadata": {},
641 | "execution_count": 2
642 | }
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "source": [
648 | "next(city_names)"
649 | ],
650 | "metadata": {
651 | "colab": {
652 | "base_uri": "https://localhost:8080/",
653 | "height": 35
654 | },
655 | "id": "vRT4Eaz1GgUI",
656 | "outputId": "14fae336-6331-48e8-fb83-317d168014eb"
657 | },
658 | "execution_count": null,
659 | "outputs": [
660 | {
661 | "output_type": "execute_result",
662 | "data": {
663 | "text/plain": [
664 | "'London'"
665 | ],
666 | "application/vnd.google.colaboratory.intrinsic+json": {
667 | "type": "string"
668 | }
669 | },
670 | "metadata": {},
671 | "execution_count": 3
672 | }
673 | ]
674 | },
675 | {
676 | "cell_type": "code",
677 | "source": [
678 | "next(city_names)"
679 | ],
680 | "metadata": {
681 | "colab": {
682 | "base_uri": "https://localhost:8080/",
683 | "height": 35
684 | },
685 | "id": "hA6SCwAUGh79",
686 | "outputId": "0eaf8af0-f56f-4534-ffad-ce3c10590f4e"
687 | },
688 | "execution_count": null,
689 | "outputs": [
690 | {
691 | "output_type": "execute_result",
692 | "data": {
693 | "text/plain": [
694 | "'Tokyo'"
695 | ],
696 | "application/vnd.google.colaboratory.intrinsic+json": {
697 | "type": "string"
698 | }
699 | },
700 | "metadata": {},
701 | "execution_count": 4
702 | }
703 | ]
704 | },
705 | {
706 | "cell_type": "code",
707 | "source": [
708 | "next(city_names)"
709 | ],
710 | "metadata": {
711 | "colab": {
712 | "base_uri": "https://localhost:8080/",
713 | "height": 35
714 | },
715 | "id": "rKjG6CzhGjob",
716 | "outputId": "741af17b-186c-4ab0-d475-fe18e49cd5c9"
717 | },
718 | "execution_count": null,
719 | "outputs": [
720 | {
721 | "output_type": "execute_result",
722 | "data": {
723 | "text/plain": [
724 | "'Sao Paolo'"
725 | ],
726 | "application/vnd.google.colaboratory.intrinsic+json": {
727 | "type": "string"
728 | }
729 | },
730 | "metadata": {},
731 | "execution_count": 5
732 | }
733 | ]
734 | },
735 | {
736 | "cell_type": "markdown",
737 | "source": [
738 | "As we can see that every time we run the next() function it will print a new city name.\n",
739 | "\n"
740 | ],
741 | "metadata": {
742 | "id": "5FFZfu_K2oP8"
743 | }
744 | },
745 | {
746 | "cell_type": "markdown",
747 | "source": [
748 | "Let's go back to the .iterrows() function. The .iterrows() function is a property of every pandas DataFrame. When called, it produces a list with two elements. We will use this generator to iterate through each line of our poker DataFrame. The first element is the index of the row, while the second element contains a pandas Series of each feature of the row: the Symbol and the Rank for each of the five cards. It is very similar to the notion of the enumerate() function, which when applied to a list, returns each element along with its index.\n",
749 | "\n",
750 | "The most intuitive way to iterate through a Pandas DataFrame is to use the range() function, which is often called crude looping. This is shown with the code below:\n",
751 | "\n"
752 | ],
753 | "metadata": {
754 | "id": "Va18cE2T3Rvw"
755 | }
756 | },
757 | {
758 | "cell_type": "code",
759 | "source": [
760 | "start_time = time.time()\n",
761 | "for index in range(poker_data.shape[0]):\n",
762 | " next\n",
763 | "print(\"Time using range(): {} sec\".format(time.time() - start_time))"
764 | ],
765 | "metadata": {
766 | "colab": {
767 | "base_uri": "https://localhost:8080/"
768 | },
769 | "id": "r55m_fxePy5N",
770 | "outputId": "2432444e-a802-4e92-e7d1-fd0146b578c3"
771 | },
772 | "execution_count": null,
773 | "outputs": [
774 | {
775 | "output_type": "stream",
776 | "name": "stdout",
777 | "text": [
778 | "Time using range(): 0.0036385059356689453 sec\n"
779 | ]
780 | }
781 | ]
782 | },
783 | {
784 | "cell_type": "markdown",
785 | "source": [
786 | "One smarter way to iterate through a pandas DataFrame is to use the **.iterrows()** function, which is optimized for this task. We simply define the **‘for’** loop with two iterators, one for the number of each row and the other for all the values.\n",
787 | "\n",
788 | "Inside the loop, the **next()** command indicates that the loop moves to the next value of the iterator, without actually doing something."
789 | ],
790 | "metadata": {
791 | "id": "gFaF1Z-13Way"
792 | }
793 | },
794 | {
795 | "cell_type": "code",
796 | "source": [
797 | "data_generator = poker_data.iterrows()\n",
798 | "start_time = time.time()\n",
799 | "for index, values in data_generator:\n",
800 | " next\n",
801 | "print(\"Time using .iterrows(): {} sec\".format(time.time() - start_time))"
802 | ],
803 | "metadata": {
804 | "colab": {
805 | "base_uri": "https://localhost:8080/"
806 | },
807 | "id": "saqXt53lNPDQ",
808 | "outputId": "2ac4354c-25d1-4032-a9eb-df7d81e8a7d2"
809 | },
810 | "execution_count": null,
811 | "outputs": [
812 | {
813 | "output_type": "stream",
814 | "name": "stdout",
815 | "text": [
816 | "Time using .iterrows(): 1.2583379745483398 sec\n"
817 | ]
818 | }
819 | ]
820 | },
821 | {
822 | "cell_type": "markdown",
823 | "source": [
824 | "Comparing the two computational times we can also notice that the use of .iterrows() does not improve the speed of iterating through pandas DataFrame. It is very useful though when we need a cleaner way to use the values of each row while iterating through the dataset.\n"
825 | ],
826 | "metadata": {
827 | "id": "mD6O-NjL3hah"
828 | }
829 | },
830 | {
831 | "cell_type": "markdown",
832 | "source": [
833 | "## 3.Looping Effectively Using .apply()"
834 | ],
835 | "metadata": {
836 | "id": "0fqYrl6F3q-_"
837 | }
838 | },
839 | {
840 | "cell_type": "markdown",
841 | "source": [
842 | "Now we will use the **.apply()** function to be able to perform a specific task while iterating through a pandas DataFrame. The **.apply()** function does exactly what it says; it applies another function to the whole DataFrame.\n",
843 | "\n",
844 | "The syntax of the **.apply()** function is simple: we create a mapping, using a lambda function in this case, and then declare the function we want to apply to every cell. Here, we’re applying the square root function to every cell of the DataFrame. In terms of speed, it matches the speed of just using the NumPy sqrt() function over the whole DataFrame.\n"
845 | ],
846 | "metadata": {
847 | "id": "x6Y4ACXY3pDX"
848 | }
849 | },
850 | {
851 | "cell_type": "code",
852 | "source": [
853 | "data_sqrt = poker_data.apply(lambda x: np.sqrt(x), axis =0 )\n",
854 | "data_sqrt.head()"
855 | ],
856 | "metadata": {
857 | "colab": {
858 | "base_uri": "https://localhost:8080/",
859 | "height": 206
860 | },
861 | "id": "GVixY9NAP7JI",
862 | "outputId": "447e5a58-91e1-42d1-cfa4-21582df6b580"
863 | },
864 | "execution_count": null,
865 | "outputs": [
866 | {
867 | "output_type": "execute_result",
868 | "data": {
869 | "text/plain": [
870 | " S1 R1 S2 R2 S3 R3 S4 \\\n",
871 | "0 1.000000 3.162278 1.000000 3.316625 1.000000 3.605551 1.000000 \n",
872 | "1 1.414214 3.316625 1.414214 3.605551 1.414214 3.162278 1.414214 \n",
873 | "2 1.732051 3.464102 1.732051 3.316625 1.732051 3.605551 1.732051 \n",
874 | "3 2.000000 3.162278 2.000000 3.316625 2.000000 1.000000 2.000000 \n",
875 | "4 2.000000 1.000000 2.000000 3.605551 2.000000 3.464102 2.000000 \n",
876 | "\n",
877 | " R4 S5 R5 Class \n",
878 | "0 3.464102 1.000000 1.000000 3.0 \n",
879 | "1 3.464102 1.414214 1.000000 3.0 \n",
880 | "2 3.162278 1.732051 1.000000 3.0 \n",
881 | "3 3.605551 2.000000 3.464102 3.0 \n",
882 | "4 3.316625 2.000000 3.162278 3.0 "
883 | ],
884 | "text/html": [
885 | "\n",
886 | "
\n",
1068 | " "
1069 | ]
1070 | },
1071 | "metadata": {},
1072 | "execution_count": 3
1073 | }
1074 | ]
1075 | },
1076 | {
1077 | "cell_type": "markdown",
1078 | "source": [
1079 | "This is a simple example since we would like to apply this function to the dataframe.\n",
1080 | "\n",
1081 | "But what happens when the function of interest is taking more than one cell as an input? For example, what if we want to calculate the sum of the rank of all the cards in each hand? In this case, we will use the .apply() function the same way as we did before, but we need to add ‘axis=1’ at the end of the line to specify we’re applying the function to each row.\n",
1082 | "\n"
1083 | ],
1084 | "metadata": {
1085 | "id": "dMg4d5SV4IL-"
1086 | }
1087 | },
1088 | {
1089 | "cell_type": "code",
1090 | "source": [
1091 | "apply_start_time = time.time()\n",
1092 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=1)\n",
1093 | "apply_end_time = time.time()\n",
1094 | "apply_time = apply_end_time - apply_start_time\n",
1095 | "print(\"Time using .apply(): {} sec\".format(apply_time))"
1096 | ],
1097 | "metadata": {
1098 | "colab": {
1099 | "base_uri": "https://localhost:8080/"
1100 | },
1101 | "id": "6dzHsYQdjtfa",
1102 | "outputId": "4e4351c3-174a-4961-83da-1627f5db1e89"
1103 | },
1104 | "execution_count": null,
1105 | "outputs": [
1106 | {
1107 | "output_type": "stream",
1108 | "name": "stdout",
1109 | "text": [
1110 | "Time using .apply(): 0.2000577449798584 sec\n"
1111 | ]
1112 | }
1113 | ]
1114 | },
1115 | {
1116 | "cell_type": "markdown",
1117 | "source": [
1118 | "Then, we will use the .iterrows() function we saw previously, and compare their efficiency.\n",
1119 | "\n"
1120 | ],
1121 | "metadata": {
1122 | "id": "6Jg-EIJV4McK"
1123 | }
1124 | },
1125 | {
1126 | "cell_type": "code",
1127 | "source": [
1128 | "for_loop_start_time = time.time()\n",
1129 | "for ind, value in poker_data.iterrows():\n",
1130 | " sum([value[1], value[3], value[5], value[7], value[9]])\n",
1131 | "for_loop_end_time = time.time()\n",
1132 | "\n",
1133 | "for_loop_time = for_loop_end_time - for_loop_start_time\n",
1134 | "print(\"Time using .iterrows(): {} sec\".format(for_loop_time))"
1135 | ],
1136 | "metadata": {
1137 | "colab": {
1138 | "base_uri": "https://localhost:8080/"
1139 | },
1140 | "id": "sOkXe8gxrXOX",
1141 | "outputId": "2b901612-ffb1-438d-952c-04b9ce221c62"
1142 | },
1143 | "execution_count": null,
1144 | "outputs": [
1145 | {
1146 | "output_type": "stream",
1147 | "name": "stdout",
1148 | "text": [
1149 | "Time using .iterrows(): 1.1545953750610352 sec\n"
1150 | ]
1151 | }
1152 | ]
1153 | },
1154 | {
1155 | "cell_type": "markdown",
1156 | "source": [
1157 | "Using the .apply() function is significantly faster than the .iterrows() function, with a magnitude of around 400 percent, which is a massive improvement!\n",
1158 | "\n"
1159 | ],
1160 | "metadata": {
1161 | "id": "V3FUIUts4PBG"
1162 | }
1163 | },
1164 | {
1165 | "cell_type": "code",
1166 | "source": [
1167 | "print('The differnce: {} %'.format((for_loop_time - apply_time) / apply_time * 100))"
1168 | ],
1169 | "metadata": {
1170 | "colab": {
1171 | "base_uri": "https://localhost:8080/"
1172 | },
1173 | "id": "JGFa3H3ysxmE",
1174 | "outputId": "566e5293-2da4-4f63-9b0a-8e065689326f"
1175 | },
1176 | "execution_count": null,
1177 | "outputs": [
1178 | {
1179 | "output_type": "stream",
1180 | "name": "stdout",
1181 | "text": [
1182 | "The differnce: 477.1310554246618 %\n"
1183 | ]
1184 | }
1185 | ]
1186 | },
1187 | {
1188 | "cell_type": "markdown",
1189 | "source": [
1190 | "As we did with rows, we can do exactly the same thing for the columns; apply one function to each column. By replacing the axis=1 with axis=0, we can apply the sum function on every column.\n",
1191 | "\n"
1192 | ],
1193 | "metadata": {
1194 | "id": "7FE-SES04VZc"
1195 | }
1196 | },
1197 | {
1198 | "cell_type": "code",
1199 | "source": [
1200 | "apply_start_time = time.time()\n",
1201 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=0)\n",
1202 | "apply_end_time = time.time()\n",
1203 | "apply_time = apply_end_time - apply_start_time\n",
1204 | "print(\"Time using .apply(): {} sec\".format(apply_time))"
1205 | ],
1206 | "metadata": {
1207 | "id": "TqSssrebte0e",
1208 | "colab": {
1209 | "base_uri": "https://localhost:8080/"
1210 | },
1211 | "outputId": "9aa644fc-cb77-4e73-823b-fd749617722e"
1212 | },
1213 | "execution_count": null,
1214 | "outputs": [
1215 | {
1216 | "output_type": "stream",
1217 | "name": "stdout",
1218 | "text": [
1219 | "Time using .apply(): 0.021090030670166016 sec\n"
1220 | ]
1221 | }
1222 | ]
1223 | },
1224 | {
1225 | "cell_type": "markdown",
1226 | "source": [
1227 | "By comparing the **.apply()** function with the native panda's function for summing over rows, we can see that pandas’ native .sum() functions perform the same operation faster.\n",
1228 | "\n"
1229 | ],
1230 | "metadata": {
1231 | "id": "KybwcTHL4YNl"
1232 | }
1233 | },
1234 | {
1235 | "cell_type": "code",
1236 | "source": [
1237 | "pandas_start_time = time.time()\n",
1238 | "poker_data[['R1', 'R1', 'R3', 'R4', 'R5']].sum(axis=0)\n",
1239 | "pandas_end_time = time.time()\n",
1240 | "pandas_time = pandas_end_time - pandas_start_time\n",
1241 | "print(\"Time using pandas: {} sec\".format(pandas_time))"
1242 | ],
1243 | "metadata": {
1244 | "colab": {
1245 | "base_uri": "https://localhost:8080/"
1246 | },
1247 | "id": "c9N3WsQDP_kF",
1248 | "outputId": "7a395b28-fb3f-4e58-8189-d24e9714fc62"
1249 | },
1250 | "execution_count": null,
1251 | "outputs": [
1252 | {
1253 | "output_type": "stream",
1254 | "name": "stdout",
1255 | "text": [
1256 | "Time using pandas: 0.0039751529693603516 sec\n"
1257 | ]
1258 | }
1259 | ]
1260 | },
1261 | {
1262 | "cell_type": "code",
1263 | "source": [
1264 | "print('The differnce: {} %'.format((apply_time - pandas_time) / pandas_time * 100))"
1265 | ],
1266 | "metadata": {
1267 | "colab": {
1268 | "base_uri": "https://localhost:8080/"
1269 | },
1270 | "id": "89czJRWaQC83",
1271 | "outputId": "1e30f4ea-2d55-4035-9def-e2d587b4296d"
1272 | },
1273 | "execution_count": null,
1274 | "outputs": [
1275 | {
1276 | "output_type": "stream",
1277 | "name": "stdout",
1278 | "text": [
1279 | "The differnce: 430.54639237089907 %\n"
1280 | ]
1281 | }
1282 | ]
1283 | },
1284 | {
1285 | "cell_type": "markdown",
1286 | "source": [
1287 | "In conclusion, we observe that the .apply() function performs faster when we want to iterate through all the rows of a pandas DataFrame, but is slower when we perform the same operation through a column.\n",
1288 | "\n"
1289 | ],
1290 | "metadata": {
1291 | "id": "ul9vEJjD40cf"
1292 | }
1293 | },
1294 | {
1295 | "cell_type": "markdown",
1296 | "source": [
1297 | "## 4.Looping effectively using vectorization"
1298 | ],
1299 | "metadata": {
1300 | "id": "5RjLUtkmUuhy"
1301 | }
1302 | },
1303 | {
1304 | "cell_type": "markdown",
1305 | "source": [
1306 | "To understand how we can reduce the amount of iteration performed by the function, recall that the fundamental units of Pandas, DataFrames, and Series, are both based on arrays. Pandas perform more efficiently when an operation is performed to a whole array than to each value separately or sequentially. This can be achieved through vectorization. Vectorization is the process of executing operations on entire arrays.\n",
1307 | "\n",
1308 | "In the code below we want to calculate the sum of the ranks of all the cards in each hand. In order to do that, we slice the poker dataset keeping only the columns that contain the ranks of each card. Then, we call the built-in .sum() property of the DataFrame, using the parameter axis = 1 to denote that we want the sum for each row. In the end, we print the sum of the first five rows of the data.\n",
1309 | "\n"
1310 | ],
1311 | "metadata": {
1312 | "id": "k1sJKs2d49tX"
1313 | }
1314 | },
1315 | {
1316 | "cell_type": "code",
1317 | "source": [
1318 | "start_time_vectorization = time.time()\n",
1319 | "\n",
1320 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)\n",
1321 | "end_time_vectorization = time.time()\n",
1322 | "\n",
1323 | "vectorization_time = end_time_vectorization - start_time_vectorization\n",
1324 | "print(\"Time using pandas vectorization: {} sec\".format(vectorization_time))"
1325 | ],
1326 | "metadata": {
1327 | "colab": {
1328 | "base_uri": "https://localhost:8080/"
1329 | },
1330 | "id": "QsTG45mKQ-Mi",
1331 | "outputId": "1acedf98-4a2a-4f9b-d04c-f4dc062baf52"
1332 | },
1333 | "execution_count": null,
1334 | "outputs": [
1335 | {
1336 | "output_type": "stream",
1337 | "name": "stdout",
1338 | "text": [
1339 | "Time using pandas vectorization: 0.009327411651611328 sec\n"
1340 | ]
1341 | }
1342 | ]
1343 | },
1344 | {
1345 | "cell_type": "markdown",
1346 | "source": [
1347 | "We saw previously various methods that perform functions applied to a DataFrame faster than simply iterating through all the rows of the DataFrame. Our goal is to find the most efficient method to perform this task.\n",
1348 | "\n"
1349 | ],
1350 | "metadata": {
1351 | "id": "ljv_-ySq7qRL"
1352 | }
1353 | },
1354 | {
1355 | "cell_type": "markdown",
1356 | "source": [
1357 | "Using .iterrows() to loop through the DataFrame:\n"
1358 | ],
1359 | "metadata": {
1360 | "id": "3H338N5h7wgS"
1361 | }
1362 | },
1363 | {
1364 | "cell_type": "code",
1365 | "source": [
1366 | "data_generator = poker_data.iterrows()\n",
1367 | "\n",
1368 | "start_time_iterrows = time.time()\n",
1369 | "\n",
1370 | "for index, value in data_generator:\n",
1371 | " sum([value[1], value[3], value[5], value[7]])\n",
1372 | "\n",
1373 | "end_time_iterrows = time.time()\n",
1374 | "iterrows_time = end_time_iterrows - start_time_iterrows\n",
1375 | "print(\"Time using .iterrows() {} seconds \" .format(iterrows_time))"
1376 | ],
1377 | "metadata": {
1378 | "colab": {
1379 | "base_uri": "https://localhost:8080/"
1380 | },
1381 | "id": "kyivEyu_V1OO",
1382 | "outputId": "816b1050-4228-42e2-a81c-164b75cab419"
1383 | },
1384 | "execution_count": null,
1385 | "outputs": [
1386 | {
1387 | "output_type": "stream",
1388 | "name": "stdout",
1389 | "text": [
1390 | "Time using .iterrows() 1.1502439975738525 seconds \n"
1391 | ]
1392 | }
1393 | ]
1394 | },
1395 | {
1396 | "cell_type": "markdown",
1397 | "source": [
1398 | "Using the .apply() mehtod\n"
1399 | ],
1400 | "metadata": {
1401 | "id": "0FURW1ta71Mj"
1402 | }
1403 | },
1404 | {
1405 | "cell_type": "code",
1406 | "source": [
1407 | "start_time_apply = time.time()\n",
1408 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x),axis=1)\n",
1409 | "end_time_apply = time.time()\n",
1410 | "\n",
1411 | "apply_time = end_time_apply - start_time_apply\n",
1412 | "\n",
1413 | "print(\"Time using apply() {} seconds\" .format(apply_time))"
1414 | ],
1415 | "metadata": {
1416 | "colab": {
1417 | "base_uri": "https://localhost:8080/"
1418 | },
1419 | "id": "M4rz9TNIWI1m",
1420 | "outputId": "8434927d-9217-4c07-c08d-4ec2f97ba4db"
1421 | },
1422 | "execution_count": null,
1423 | "outputs": [
1424 | {
1425 | "output_type": "stream",
1426 | "name": "stdout",
1427 | "text": [
1428 | "Time using apply() 0.3497791290283203 seconds\n"
1429 | ]
1430 | }
1431 | ]
1432 | },
1433 | {
1434 | "cell_type": "markdown",
1435 | "source": [
1436 | "Comparing the time it takes to sum the ranks of all the cards in each hand using vectorization, the .iterrows() function, and the .apply() function, we can see that the vectorization method performs much better.\n",
1437 | "\n",
1438 | "We can also use another vectorization method to effectively iterate through the DataFrame which is using Numpy arrays to vectorize the DataFrame.\n",
1439 | "\n",
1440 | "The NumPy library, which defines itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Similar to pandas working with arrays, NumPy operates on arrays called ndarrays. A major difference between Series and ndarrays is that ndarrays leave out many operations such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on pandas Series. NumPy arrays can be used in place of the pandas Series when the additional functionality offered by the pandas Series isn’t critical.\n",
1441 | "\n",
1442 | "For the problems we explore in this article, we could use NumPy ndarrays instead of the pandas series. The question at stake is whether this would be more efficient or not.\n",
1443 | "\n",
1444 | "Again, we will calculate the sum of the ranks of all the cards in each hand. We convert our rank arrays from pandas Series to NumPy arrays simply by using the .values method of pandas Series, which returns a pandas Series as a NumPy ndarray. As with vectorization on the series, passing the NumPy array directly into the function will lead pandas to apply the function to the entire vector.\n",
1445 | "\n"
1446 | ],
1447 | "metadata": {
1448 | "id": "WO9COAX876tv"
1449 | }
1450 | },
1451 | {
1452 | "cell_type": "code",
1453 | "source": [
1454 | "start_time = time.time()\n",
1455 | "\n",
1456 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].values.sum(axis=1)\n",
1457 | "\n",
1458 | "print(\"Time using NumPy vectorization: {} sec\" .format(time.time() - start_time))"
1459 | ],
1460 | "metadata": {
1461 | "colab": {
1462 | "base_uri": "https://localhost:8080/"
1463 | },
1464 | "id": "wJaJBbiI5Q5M",
1465 | "outputId": "27ca2ac9-9a87-4a3d-8b0d-2f1ec772a7d4"
1466 | },
1467 | "execution_count": null,
1468 | "outputs": [
1469 | {
1470 | "output_type": "stream",
1471 | "name": "stdout",
1472 | "text": [
1473 | "Time using NumPy vectorization: 0.001745462417602539 sec\n"
1474 | ]
1475 | }
1476 | ]
1477 | },
1478 | {
1479 | "cell_type": "code",
1480 | "source": [
1481 | "start_time = time.time()\n",
1482 | "poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)\n",
1483 | "print(\"Time using the pandas vectorization %s seconds\" % (time.time() - start_time))"
1484 | ],
1485 | "metadata": {
1486 | "colab": {
1487 | "base_uri": "https://localhost:8080/"
1488 | },
1489 | "id": "OVGS_FGb5eeV",
1490 | "outputId": "e8635e7b-ef4a-42a5-b4b2-ddca0778cd56"
1491 | },
1492 | "execution_count": null,
1493 | "outputs": [
1494 | {
1495 | "output_type": "stream",
1496 | "name": "stdout",
1497 | "text": [
1498 | "Time using the 0.003729104995727539 seconds\n"
1499 | ]
1500 | }
1501 | ]
1502 | },
1503 | {
1504 | "cell_type": "markdown",
1505 | "source": [
1506 | "## 5. Summary of best practices for looping through DataFrame\n",
1507 | "* Using **.iterrows()** does not improve the speed of iterating through the DataFrame but it is more efficient.\n",
1508 | "* The **.apply()** function performs faster when we want to iterate through all the rows of a pandas DataFrame, but is slower when we perform the same operation through a column.\n",
1509 | "* Vectorizing over the pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of the highest priority, we can call in reinforcements in the form of the NumPy Python library."
1510 | ],
1511 | "metadata": {
1512 | "id": "lXeC6MGF8MfY"
1513 | }
1514 | }
1515 | ]
1516 | }
--------------------------------------------------------------------------------
/Write Efficient Python Code [Defining and Measuring Code Efficiency].ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Write Efficient Python Code: Defining & Measuring Code Efficiency"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "As a data scientist, you should spend most of your time working on gaining insights from data not waiting for your code to finish running. Writing efficient Python code can help reduce runtime and save computational resources, ultimately freeing you up to do the things that have more impact."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "Table of Content: \n",
22 | "1. Define Efficiency\n",
23 | " \n",
24 | " 1.1. What is meant by efficient code?\n",
25 | " \n",
26 | " 1.2. Python Standard Libraries\n",
27 | " \n",
28 | "\n",
29 | "2. Python Code Timing and Profiling\n",
30 | " \n",
31 | " 2.1. Python Runtime Investigation\n",
32 | " \n",
33 | " 2.2. Code profiling for runtime\n",
34 | " \n",
35 | " 2.3. Code profiling for memory use"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "## 1. Define Efficiency "
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "### 1.1. What is meant by efficient code?"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "Efficient refers to code that satisfies two key concepts. First, efficient code is fast and has a small latency between execution and returning a result. Second, efficient code allocates resources skillfully and isn't subjected to unnecessary overhead. Although in general your definition of fast runtime and small memory usage may differ depending on the task at hand, the goal of writing efficient code is still to reduce both latency and overhead."
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "Python is a language that prides itself on code readability, and thus, it comes with its own set of idioms and best practices. Writing Python code the way it was intended is often referred to as Pythonic code. This means the code that you write follows the best practices and guiding principles of Python. Pythonic code tends to be less verbose and easier to interpret. Although Python supports code that doesn't follow its guiding principles, this type of code tends to run slower."
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "As an example, look at the non-Pythonic below. Not only is this code more verbose than the Pythonic version, but it also takes longer to run. We'll take a closer look at why this is the case later on in the course, but for now, the main takeaway here is that Pythonic code is efficient code!"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 1,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "numbers = [1,2,3,4,5]\n",
80 | "\n",
81 | "# Non-Pythonic \n",
82 | "doubled_numbers = []\n",
83 | "\n",
84 | "for i in range(len(numbers)):\n",
85 | " doubled_numbers.append(numbers[i]*2)\n",
86 | " \n",
87 | "# pythonic \n",
88 | "\n",
89 | "double_numbers = [x*2 for x in numbers ]"
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "### 1.2. Python Standard Libraries"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "Python Standard Libraries are the Built-in components and libraries of python. These libraries come with every Python installation and are commonly cited as one of Python's greatest strengths. Python has a number of built-in types."
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "metadata": {},
109 | "source": [
110 | "Let's start exploring some of the mentioned functions:\n",
111 | "* range(): This is a handy tool whenever we want to create a sequence of numbers. Suppose we wanted to create a list of integers from zero to ten. We could explicitly type out each integer, but that is not very efficient. Instead, we can use range to accomplish this task. We can provide a range with a start and stop value to create this sequence. Or, we can provide just a stop value assuming that we want our sequence to start at zero. Notice that the stop value is exclusive, or up to but not including this value. Also note the range function returns a range object, which we can convert into a list and print. The range function can also accept a start, stop, and step value (in that order)."
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 2,
117 | "metadata": {},
118 | "outputs": [
119 | {
120 | "name": "stdout",
121 | "output_type": "stream",
122 | "text": [
123 | "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
124 | "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
125 | "[2, 4, 6, 8, 10]\n"
126 | ]
127 | }
128 | ],
129 | "source": [
130 | "# range(start,stop)\n",
131 | "nums = range(0,11)\n",
132 | "nums_list = list(nums)\n",
133 | "print(nums_list)\n",
134 | "\n",
135 | "# range(stop)\n",
136 | "nums = range(11)\n",
137 | "nums_list = list(nums)\n",
138 | "print(nums_list)\n",
139 | "\n",
140 | "# Using range() with a step value\n",
141 | "\n",
142 | "even_nums = range(2, 11, 2)\n",
143 | "even_nums_list = list(even_nums)\n",
144 | "print(even_nums_list)"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "**enumerate():** Another useful built-in function is enumerate. enumerate creates an index item pair for each item in the object provided. For example, calling enumerate on the list letters produces a sequence of indexed values. Similar to the range function, enumerate returns an enumerate object, which can also be converted into a list and printed."
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 2,
157 | "metadata": {
158 | "scrolled": true
159 | },
160 | "outputs": [
161 | {
162 | "name": "stdout",
163 | "output_type": "stream",
164 | "text": [
165 | "[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]\n"
166 | ]
167 | }
168 | ],
169 | "source": [
170 | "# Creates an indexed list of objects using enumerate\n",
171 | "letters = ['a', 'b', 'c', 'd' ]\n",
172 | "indexed_letters = enumerate(letters)\n",
173 | "indexed_letters_list = list(indexed_letters)\n",
174 | "# print(indexed_letters_list)"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "We can also specify the starting index of enumerate with the keyword argument start. Here, we tell enumerate to start the index at five by passing start equals five into the function call."
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 3,
187 | "metadata": {},
188 | "outputs": [
189 | {
190 | "name": "stdout",
191 | "output_type": "stream",
192 | "text": [
193 | "[(5, 'a'), (6, 'b'), (7, 'c'), (8, 'd')]\n"
194 | ]
195 | }
196 | ],
197 | "source": [
198 | "#specify a start value\n",
199 | "letters = ['a', 'b', 'c', 'd' ]\n",
200 | "indexed_letters2 = enumerate(letters, start=5)\n",
201 | "indexed_letters2_list = list(indexed_letters2)\n",
202 | "print(indexed_letters2_list)"
203 | ]
204 | },
205 | {
206 | "cell_type": "markdown",
207 | "metadata": {},
208 | "source": [
209 | "**map():** The last notable built-in function we'll cover is the map() function. map applies a function to each element in an object. Notice that the map function takes two arguments; first, the function you'd like to apply, and second, the object you'd like to apply that function on. Here, we use a map to apply the built-in function round to each element of the nums list."
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 4,
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "name": "stdout",
219 | "output_type": "stream",
220 | "text": [
221 | "[2, 2, 3, 5, 5]\n"
222 | ]
223 | }
224 | ],
225 | "source": [
226 | "nums = [1.5, 2.3, 3.4, 4.6, 5.0]\n",
227 | "rnd_nums = map(round, nums)\n",
228 | "print(list(rnd_nums))"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "The map function can also be used with a lambda, or, an anonymous function. Notice here, that we can use the map function and a lambda expression to apply a function, which we've defined on the fly, to our original list nums. The map function provides a quick and clean way to apply a function to an object iteratively without writing a for loop."
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 6,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "name": "stdout",
245 | "output_type": "stream",
246 | "text": [
247 | "[1, 4, 9, 16, 25]\n"
248 | ]
249 | }
250 | ],
251 | "source": [
252 | "# map() with lambda \n",
253 | "nums = [1, 2, 3, 4, 5]\n",
254 | "sqrd_nums = map(lambda x: x ** 2, nums)\n",
255 | "print(list(sqrd_nums))"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "NumPy, or Numerical Python, is an invaluable Python package for Data Scientists. It is the fundamental package for scientific computing in Python and provides a number of benefits for writing efficient code.\n",
263 | "\n",
264 | "**NumPy arrays** provide a fast and memory-efficient alternative to Python lists. Typically, we import NumPy as np and use np dot array to create a NumPy array."
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 7,
270 | "metadata": {},
271 | "outputs": [
272 | {
273 | "name": "stdout",
274 | "output_type": "stream",
275 | "text": [
276 | "[0, 1, 2, 3, 4]\n",
277 | "[0 1 2 3 4]\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "# python list \n",
283 | "nums_list = list(range(5))\n",
284 | "print(nums_list)\n",
285 | "\n",
286 | "# using numpyu alternative to python lists\n",
287 | "import numpy as np\n",
288 | "nums_np = np.array(range(5))\n",
289 | "print(nums_np)"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "NumPy arrays are homogeneous, which means that they must contain elements of the same type. We can see the type of each element using the .dtype method. Suppose we created an array using a mixture of types. Here, we create the array nums_np_floats using the integers [1,3] and a float [2.5]. Can you spot the difference in the output? The integers now have a proceeding dot in the array. That's because NumPy converted the integers to floats to retain that array's homogeneous nature. Using .dtype, we can verify that the elements in this array are floats."
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 9,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "name": "stdout",
306 | "output_type": "stream",
307 | "text": [
308 | "integer numpy array [1 2 3]\n",
309 | "int32\n",
310 | "float numpy array [1. 2.5 3. ]\n",
311 | "float64\n"
312 | ]
313 | }
314 | ],
315 | "source": [
316 | "# NumPy array homogeneity\n",
317 | "nums_np_ints = np.array([1, 2, 3]) \n",
318 | "print('integer numpy array',nums_np_ints)\n",
319 | "print(nums_np_ints.dtype)\n",
320 | "\n",
321 | "nums_np_floats = np.array([1, 2.5, 3])\n",
322 | "print('float numpy array',nums_np_floats)\n",
323 | "print(nums_np_floats.dtype)"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "When analyzing data, you'll often want to perform operations over entire collections of values quickly. Say, for example, you'd like to square each number within a list of numbers. It'd be nice if we could simply square the list, and get a list of squared values returned. Unfortunately, Python lists don't support these types of calculations."
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 10,
336 | "metadata": {},
337 | "outputs": [
338 | {
339 | "ename": "TypeError",
340 | "evalue": "unsupported operand type(s) for ** or pow(): 'list' and 'int'",
341 | "output_type": "error",
342 | "traceback": [
343 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
344 | "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
345 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Python lists don't support broadcasting\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[0mnums\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m2\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mnums\u001b[0m \u001b[1;33m**\u001b[0m \u001b[1;36m2\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
346 | "\u001b[1;31mTypeError\u001b[0m: unsupported operand type(s) for ** or pow(): 'list' and 'int'"
347 | ]
348 | }
349 | ],
350 | "source": [
351 | "# Python lists don't support broadcasting\n",
352 | "nums = [-2, -1, 0, 1, 2]\n",
353 | "nums ** 2"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "We could square the values using a list by writing a for loop or using a list comprehension as shown in the code below. But neither of these approaches is the most efficient way of doing this. Here lies the second advantage of NumPy arrays - their broadcasting functionality. NumPy arrays vectorize operations, so they are performed on all elements of an object at once. This allows us to efficiently perform calculations over entire arrays. Let's compare the computational time using these three approaches in the following code:"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 24,
366 | "metadata": {},
367 | "outputs": [
368 | {
369 | "name": "stdout",
370 | "output_type": "stream",
371 | "text": [
372 | "Execution time using for loops over list: 0.0020008087158203125 seconds\n",
373 | "Execution time using list comprehension: 0.0020008087158203125 seconds\n",
374 | "Execution time using numpy array broadcasting: 0.0010006427764892578 seconds\n"
375 | ]
376 | }
377 | ],
378 | "source": [
379 | "import time\n",
380 | "\n",
381 | "# define numerical list \n",
382 | "nums = range(0,1000)\n",
383 | "nums = list(nums)\n",
384 | "\n",
385 | "# For loop (inefficient option)\n",
386 | "# get the start time\n",
387 | "st = time.time()\n",
388 | "\n",
389 | "sqrd_nums = []\n",
390 | "for num in nums:\n",
391 | " sqrd_nums.append(num ** 2)\n",
392 | "#print(sqrd_nums)\n",
393 | "\n",
394 | "# get the end time\n",
395 | "et = time.time()\n",
396 | "\n",
397 | "# get the execution time\n",
398 | "elapsed_time = et - st\n",
399 | "print('Execution time using for loops over list:', elapsed_time, 'seconds')\n",
400 | "\n",
401 | "\n",
402 | "# List comprehension (better option but not best)\n",
403 | "# get the start time\n",
404 | "st = time.time()\n",
405 | "\n",
406 | "sqrd_nums = [num ** 2 for num in nums]\n",
407 | "#print(sqrd_nums)\n",
408 | "\n",
409 | "# get the end time\n",
410 | "et = time.time()\n",
411 | "print('Execution time using list comprehension:', elapsed_time, 'seconds')\n",
412 | "\n",
413 | "\n",
414 | "# using numpy array broadcasting\n",
415 | "\n",
416 | "# define the numpy array \n",
417 | "nums_np = np.arange(0,1000)\n",
418 | "# get the start time\n",
419 | "st = time.time()\n",
420 | "\n",
421 | "nums_np ** 2\n",
422 | "\n",
423 | "# get the end time\n",
424 | "et = time.time()\n",
425 | "\n",
426 | "# get the execution time\n",
427 | "elapsed_time = et - st\n",
428 | "print('Execution time using numpy array broadcasting:', elapsed_time, 'seconds')\n"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "We can see that the first two approaches have the same time complexity while using NumPy in the third approach has decreased the computational time to half."
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "Another advantage of NumPy arrays is their indexing capabilities. When comparing basic indexing between a one-dimensional array and a list, the capabilities are identical. When using two-dimensional arrays and lists, the advantages of arrays are clear. To return the second item of the first row in our two-dimensional object, the array syntax is [0,1]. The analogous list syntax is a bit more verbose as you have to surround both the zero and one with square brackets [0][1]. To return the first column values in the 2-d object, the array syntax is [:,0]. Lists don't support this type of syntax, so we must use a list comprehension to return columns."
443 | ]
444 | },
445 | {
446 | "cell_type": "code",
447 | "execution_count": 27,
448 | "metadata": {},
449 | "outputs": [
450 | {
451 | "name": "stdout",
452 | "output_type": "stream",
453 | "text": [
454 | "2\n",
455 | "2\n",
456 | "[1, 4]\n",
457 | "[1 4]\n"
458 | ]
459 | }
460 | ],
461 | "source": [
462 | "#2-D list\n",
463 | "nums2 = [ [1, 2, 3],\n",
464 | " [4, 5, 6] ]\n",
465 | "\n",
466 | "\n",
467 | "# 2-D array\n",
468 | "nums2_np = np.array(nums2)\n",
469 | "\n",
470 | "# printing the second item of the first row \n",
471 | "print(nums2[0][1])\n",
472 | "print(nums2_np[0,1])\n",
473 | "\n",
474 | "# printing the first row values \n",
475 | "\n",
476 | "print([row[0] for row in nums2])\n",
477 | "print(nums2_np[:,0])"
478 | ]
479 | },
480 | {
481 | "cell_type": "markdown",
482 | "metadata": {},
483 | "source": [
484 | "NumPy arrays also have a special technique called boolean indexing. Suppose we wanted to gather only positive numbers from the sequence listed here. With an array, we can create a boolean mask using a simple inequality. Indexing the array is as simple as enclosing this inequality in square brackets. However, to do this using a list, we need to write a for loop to filter the list or use a list comprehension. In either case, using a NumPy array to the index is less verbose and has a faster runtime."
485 | ]
486 | },
487 | {
488 | "cell_type": "code",
489 | "execution_count": 46,
490 | "metadata": {},
491 | "outputs": [
492 | {
493 | "name": "stdout",
494 | "output_type": "stream",
495 | "text": [
496 | "[1 2]\n",
497 | "[1, 2]\n",
498 | "[1, 2]\n"
499 | ]
500 | }
501 | ],
502 | "source": [
503 | "nums = [-2, -1, 0, 1, 2]\n",
504 | "nums_np = np.array(nums)\n",
505 | "\n",
506 | "# Boolean indexing\n",
507 | "print(nums_np[nums_np > 0])\n",
508 | "\n",
509 | "# No boolean indexing for lists\n",
510 | "# For loop (inefficient option)\n",
511 | "\n",
512 | "\n",
513 | "pos = []\n",
514 | "for num in nums:\n",
515 | " if num > 0:\n",
516 | " pos.append(num)\n",
517 | "print(pos)\n",
518 | "\n",
519 | "\n",
520 | "# List comprehension (better option but not best)\n",
521 | "pos = [num for num in nums if num > 0]\n",
522 | "print(pos)"
523 | ]
524 | },
525 | {
526 | "cell_type": "markdown",
527 | "metadata": {},
528 | "source": [
529 | "## 2.Python Code Timing and Profiling"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "In the second section of the article, you will learn how to gather and compare runtimes between different coding approaches. You'll practice using the line_profiler and memory_profiler packages to profile your code base and spot bottlenecks. Then, you'll put your learnings to practice by replacing these bottlenecks with efficient Python code."
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "### 2.1. Python Runtime Investigation "
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "As mentioned in the previous chapter code efficiency code means fast code. To be able to measure how fast our code is we need to be able to measure the code runtime. Comparing runtimes between two code bases, that effectively do the same thing, allows us to pick the code with the optimal performance. By gathering and analyzing runtimes, we can be sure to implement the code that is fastest and thus more efficient.\n",
551 | "\n",
552 | "To compare runtimes, we need to be able to compute the runtime for a line or multiple lines of code. IPython comes with some handy built-in magic commands we can use to time our code. Magic commands are enhancements that have been added on top of the normal Python syntax. These commands are prefixed with the percentage sign. If you aren't familiar with magic commands take a moment to review the documentation. \n",
553 | "\n",
554 | "\n",
555 | "Let's start with this example: we want to inspect the runtime for selecting 1,000 random numbers between zero and one using NumPy's **random.rand()** function. Using %timeit just requires adding the magic command before the line of code we want to analyze. That's it! One simple command to gather runtimes."
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": 2,
561 | "metadata": {},
562 | "outputs": [
563 | {
564 | "name": "stdout",
565 | "output_type": "stream",
566 | "text": [
567 | "26.6 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
568 | ]
569 | }
570 | ],
571 | "source": [
572 | "%timeit rand_nums = np.random.rand(1000)"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "As we can see %timeit provides an average of timing statistics. This is one of the advantages of using %timeit. We also see that multiple runs and loops were generated. %timeit runs through the provided code multiple times to estimate the code's average execution time. This provides a more accurate representation of the actual runtime rather than relying on just one iteration to calculate the runtime. The mean and standard deviation displayed in the output is a summary of the runtime considering each of the multiple runs."
580 | ]
581 | },
582 | {
583 | "cell_type": "markdown",
584 | "metadata": {},
585 | "source": [
586 | "#### Specifying number of runs/loops"
587 | ]
588 | },
589 | {
590 | "cell_type": "markdown",
591 | "metadata": {},
592 | "source": [
593 | "The number of runs represents how many iterations you'd like to use to estimate the runtime. The number of loops represents how many times you'd like the code to be executed per run. We can specify the number of runs, using the -r flag, and the number of loops, using the -n flag. Here, we use -r2, to set the number of runs to two and -n10, to set the number of loops to ten. In this example, %timeit would execute our random number selection 20 times in order to estimate runtime (2 runs each with 10 executions)."
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": 3,
599 | "metadata": {},
600 | "outputs": [
601 | {
602 | "name": "stdout",
603 | "output_type": "stream",
604 | "text": [
605 | "90 µs ± 2.64 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)\n"
606 | ]
607 | }
608 | ],
609 | "source": [
610 | "# Set number of runs to 2 (-r2)\n",
611 | "# Set number of loops to 10 (-n10)\n",
612 | "%timeit -r2 -n10 rand_nums = np.random.rand(1000)"
613 | ]
614 | },
615 | {
616 | "cell_type": "markdown",
617 | "metadata": {},
618 | "source": [
619 | "Another cool feature of %timeit is its ability to run on single or multiple lines of code. When using %timeit in line magic mode, or with a single line of code, one percentage sign is used and we can run %timeit in cell magic mode (or provide multiple lines of code) by using two percentage signs."
620 | ]
621 | },
622 | {
623 | "cell_type": "code",
624 | "execution_count": 10,
625 | "metadata": {},
626 | "outputs": [
627 | {
628 | "name": "stdout",
629 | "output_type": "stream",
630 | "text": [
631 | "1.77 µs ± 62.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
632 | ]
633 | }
634 | ],
635 | "source": [
636 | "# Single line of code\n",
637 | "%timeit nums = [x for x in range(10)]"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": 13,
643 | "metadata": {},
644 | "outputs": [
645 | {
646 | "name": "stdout",
647 | "output_type": "stream",
648 | "text": [
649 | "2.32 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
650 | ]
651 | }
652 | ],
653 | "source": [
654 | "%%timeit\n",
655 | "# Multiple lines of code\n",
656 | "nums = []\n",
657 | "for x in range(10):\n",
658 | " nums.append(x)"
659 | ]
660 | },
661 | {
662 | "cell_type": "markdown",
663 | "metadata": {},
664 | "source": [
665 | "We can also save the output of %timeit into a variable using the -o flag. This allows us to dig deeper into the output and see things like the time for each run, the best time for all runs, and the worst time for all runs."
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 16,
671 | "metadata": {},
672 | "outputs": [
673 | {
674 | "name": "stdout",
675 | "output_type": "stream",
676 | "text": [
677 | "21.6 µs ± 991 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
678 | "The timings for all the 7 runs [1.9878090000020164e-05, 2.1997419999934208e-05, 2.3160939999979746e-05, 2.157510999995793e-05, 2.0585660000051577e-05, 2.2036539999953675e-05, 2.1905299999980344e-05]\n",
679 | "The best timing is 1.9878090000020164e-05\n",
680 | "The worst timeing is 2.3160939999979746e-05\n"
681 | ]
682 | }
683 | ],
684 | "source": [
685 | "# Saving the output to a variable and exploring them\n",
686 | "\n",
687 | "times = %timeit -o rand_nums = np.random.rand(1000)\n",
688 | "print('The timings for all the 7 runs',times.timings)\n",
689 | "print('The best timing is',times.best)\n",
690 | "print('The worst timeing is',times.worst)"
691 | ]
692 | },
693 | {
694 | "cell_type": "markdown",
695 | "metadata": {},
696 | "source": [
697 | "### 2.2. Code profiling for runtime"
698 | ]
699 | },
700 | {
701 | "cell_type": "markdown",
702 | "metadata": {},
703 | "source": [
704 | "We've covered how to time the code using the magic command %timeit, which works well with bite-sized code. But, what if we wanted to time a large code base or see the line-by-line runtimes within a function? In this section, we'll cover a concept called code profiling that allows us to analyze code more efficiently.\n",
705 | "\n",
706 | "Code profiling is a technique used to describe how long, and how often, various parts of a program are executed. The beauty of a code profiler is its ability to gather summary statistics on individual pieces of our code without using magic commands like %timeit. We'll focus on the line_profiler package to profile a function's runtime line-by-line. \n",
707 | "\n",
708 | "Since this package isn't a part of Python's Standard Library, we need to install it separately. This can easily be done with a pip install command as shown in the code below."
709 | ]
710 | },
711 | {
712 | "cell_type": "code",
713 | "execution_count": 17,
714 | "metadata": {},
715 | "outputs": [
716 | {
717 | "name": "stdout",
718 | "output_type": "stream",
719 | "text": [
720 | "Collecting line_profiler\n",
721 | " Downloading line_profiler-3.5.1-cp37-cp37m-win_amd64.whl (52 kB)\n",
722 | "Installing collected packages: line-profiler\n",
723 | "Successfully installed line-profiler-3.5.1\n"
724 | ]
725 | }
726 | ],
727 | "source": [
728 | "!pip install line_profiler"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "Let's explore using line_profiler with an example. Suppose we have a list of names along with each someones height (in centimeters) and weight (in kilograms) loaded as NumPy arrays."
736 | ]
737 | },
738 | {
739 | "cell_type": "code",
740 | "execution_count": 21,
741 | "metadata": {},
742 | "outputs": [],
743 | "source": [
744 | "names = ['Ahmed', 'Mohammed', 'Youssef']\n",
745 | "hts = np.array([188.0, 191.0, 185.0])\n",
746 | "wts = np.array([ 95.0, 100.0, 75.0])"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "We will then develop a function called convert_units that converts each person's height from centimeters to inches and weight from kilograms to pounds."
754 | ]
755 | },
756 | {
757 | "cell_type": "code",
758 | "execution_count": 24,
759 | "metadata": {},
760 | "outputs": [
761 | {
762 | "data": {
763 | "text/plain": [
764 | "{'Ahmed': (74.01559999999999, 209.4389),\n",
765 | " 'Mohammed': (75.19669999999999, 220.462),\n",
766 | " 'Youssef': (72.8345, 165.3465)}"
767 | ]
768 | },
769 | "execution_count": 24,
770 | "metadata": {},
771 | "output_type": "execute_result"
772 | }
773 | ],
774 | "source": [
775 | "def convert_units(names, heights, weights):\n",
776 | " new_hts = [ht * 0.39370 for ht in heights]\n",
777 | " new_wts = [wt * 2.20462 for wt in weights]\n",
778 | " data = {}\n",
779 | " for i,name in enumerate(names):\n",
780 | " data[name] = (new_hts[i], new_wts[i])\n",
781 | " return data\n",
782 | "convert_units(names, hts, wts)"
783 | ]
784 | },
785 | {
786 | "cell_type": "markdown",
787 | "metadata": {},
788 | "source": [
789 | "If we wanted to get an estimated runtime of this function, we could use %timeit. But, this will only give us the total execution time. What if we wanted to see how long each line within the function took to run? One solution is to use %timeit on each individual line of our convert_units function. But, that's a lot of manual work and not very efficient."
790 | ]
791 | },
792 | {
793 | "cell_type": "code",
794 | "execution_count": 26,
795 | "metadata": {},
796 | "outputs": [
797 | {
798 | "name": "stdout",
799 | "output_type": "stream",
800 | "text": [
801 | "13.1 µs ± 787 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
802 | ]
803 | }
804 | ],
805 | "source": [
806 | "%timeit convert_units(names, hts, wts)"
807 | ]
808 | },
809 | {
810 | "cell_type": "markdown",
811 | "metadata": {},
812 | "source": [
813 | "Instead, we can profile our function with the line_profiler package. To use this package, we first need to load it into our session. We can do this using the command %load_ext followed by line_profiler."
814 | ]
815 | },
816 | {
817 | "cell_type": "markdown",
818 | "metadata": {},
819 | "source": [
820 | "Now, we can use the magic command %lprun, from line_profiler, to gather runtimes for individual lines of code within the convert_units function. %lprun uses a special syntax. First, we use the -f flag to indicate we'd like to profile a function. Then, we specify the name of the function we'd like to profile. Note, the name of the function is passed without any parentheses. Finally, we provide the exact function call we'd like to profile by including any arguments that are needed. This is shown in the code below:"
821 | ]
822 | },
823 | {
824 | "cell_type": "code",
825 | "execution_count": 25,
826 | "metadata": {
827 | "scrolled": true
828 | },
829 | "outputs": [
830 | {
831 | "name": "stdout",
832 | "output_type": "stream",
833 | "text": [
834 | "The line_profiler extension is already loaded. To reload it, use:\n",
835 | " %reload_ext line_profiler\n"
836 | ]
837 | }
838 | ],
839 | "source": [
840 | "%load_ext line_profiler\n",
841 | "%lprun -f convert_units convert_units(names, hts, wts)"
842 | ]
843 | },
844 | {
845 | "cell_type": "markdown",
846 | "metadata": {},
847 | "source": [
848 | "### 2.3. Code profiling for memory use"
849 | ]
850 | },
851 | {
852 | "cell_type": "markdown",
853 | "metadata": {},
854 | "source": [
855 | "We've defined efficient code as code that has a minimal runtime and a small memory footprint. So far, we've only covered how to inspect the runtime of our code. In this section, we'll cover a few techniques on how to evaluate our code's memory usage."
856 | ]
857 | },
858 | {
859 | "cell_type": "markdown",
860 | "metadata": {},
861 | "source": [
862 | "One basic approach for inspecting memory consumption is using Python's built-in module sys. This module contains system-specific functions and contains one nice method .getsizeof which returns the size of an object in bytes. sys.getsizeof() is a quick way to see the size of an object."
863 | ]
864 | },
865 | {
866 | "cell_type": "code",
867 | "execution_count": 27,
868 | "metadata": {},
869 | "outputs": [
870 | {
871 | "data": {
872 | "text/plain": [
873 | "9112"
874 | ]
875 | },
876 | "execution_count": 27,
877 | "metadata": {},
878 | "output_type": "execute_result"
879 | }
880 | ],
881 | "source": [
882 | "import sys\n",
883 | "nums_list = [*range(1000)]\n",
884 | "sys.getsizeof(nums_list)"
885 | ]
886 | },
887 | {
888 | "cell_type": "code",
889 | "execution_count": 28,
890 | "metadata": {},
891 | "outputs": [
892 | {
893 | "data": {
894 | "text/plain": [
895 | "4096"
896 | ]
897 | },
898 | "execution_count": 28,
899 | "metadata": {},
900 | "output_type": "execute_result"
901 | }
902 | ],
903 | "source": [
904 | "nums_np = np.array(range(1000))\n",
905 | "sys.getsizeof(nums_np)"
906 | ]
907 | },
908 | {
909 | "cell_type": "markdown",
910 | "metadata": {},
911 | "source": [
912 | "We can see that the memory allocation of a list is almost double that of a NumPy array. However, this method only gives us the size of an individual object. However, if we wanted to inspect the line-by-line memory size of our code. As the runtime profile, we could use a code profiler. Just like we've used code profiling to gather detailed stats on runtimes, we can also use code profiling to analyze the memory allocation for each line of code in our code base. We'll use the memory_profiler package which is very similar to the line_profiler package. It can be downloaded via pip and comes with a handy magic command (%mprun) that uses the same syntax as %lprun."
913 | ]
914 | },
915 | {
916 | "cell_type": "markdown",
917 | "metadata": {},
918 | "source": [
919 | "first lets download the meomery profiler package"
920 | ]
921 | },
922 | {
923 | "cell_type": "code",
924 | "execution_count": 29,
925 | "metadata": {
926 | "scrolled": true
927 | },
928 | "outputs": [
929 | {
930 | "name": "stdout",
931 | "output_type": "stream",
932 | "text": [
933 | "Collecting memory_profiler\n",
934 | " Downloading memory_profiler-0.60.0.tar.gz (38 kB)\n",
935 | "Requirement already satisfied: psutil in c:\\users\\youss\\anaconda3\\envs\\new_enviroment\\lib\\site-packages (from memory_profiler) (5.7.2)\n",
936 | "Building wheels for collected packages: memory-profiler\n",
937 | " Building wheel for memory-profiler (setup.py): started\n",
938 | " Building wheel for memory-profiler (setup.py): finished with status 'done'\n",
939 | " Created wheel for memory-profiler: filename=memory_profiler-0.60.0-py3-none-any.whl size=31279 sha256=288154cda37cbe0ab88effb1aaa56beba4a185ce47eb84b745a1603b06b8294b\n",
940 | " Stored in directory: c:\\users\\youss\\appdata\\local\\pip\\cache\\wheels\\67\\2b\\fb\\326e30d638c538e69a5eb0aa47f4223d979f502bbdb403950f\n",
941 | "Successfully built memory-profiler\n",
942 | "Installing collected packages: memory-profiler\n",
943 | "Successfully installed memory-profiler-0.60.0\n"
944 | ]
945 | }
946 | ],
947 | "source": [
948 | "!pip install memory_profiler"
949 | ]
950 | },
951 | {
952 | "cell_type": "markdown",
953 | "metadata": {},
954 | "source": [
955 | "To be able to apply %mprun to a function and calculate the meomery allocation, this function should be loaded from a separate physical file and not in the IPython console so first we will create a utils_funcs.py file and define the convert_units function in it, and then we will load this function from the file and apply %mprun to it."
956 | ]
957 | },
958 | {
959 | "cell_type": "code",
960 | "execution_count": 38,
961 | "metadata": {},
962 | "outputs": [
963 | {
964 | "name": "stdout",
965 | "output_type": "stream",
966 | "text": [
967 | "The memory_profiler extension is already loaded. To reload it, use:\n",
968 | " %reload_ext memory_profiler\n",
969 | "\n"
970 | ]
971 | }
972 | ],
973 | "source": [
974 | "from utils_funcs import convert_units\n",
975 | "\n",
976 | "%load_ext memory_profiler\n",
977 | "%mprun -f convert_units convert_units(names, hts, wts)"
978 | ]
979 | }
980 | ],
981 | "metadata": {
982 | "kernelspec": {
983 | "display_name": "Python 3",
984 | "language": "python",
985 | "name": "python3"
986 | },
987 | "language_info": {
988 | "codemirror_mode": {
989 | "name": "ipython",
990 | "version": 3
991 | },
992 | "file_extension": ".py",
993 | "mimetype": "text/x-python",
994 | "name": "python",
995 | "nbconvert_exporter": "python",
996 | "pygments_lexer": "ipython3",
997 | "version": "3.7.9"
998 | }
999 | },
1000 | "nbformat": 4,
1001 | "nbformat_minor": 4
1002 | }
1003 |
--------------------------------------------------------------------------------