├── .gitignore
├── 1_a_create_a_dataframe_from_dictonary.ipynb
├── 1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb
├── 1_c_create_dataframe_from_a_csv_file.ipynb
├── 1_d_change_column_names.ipynb
├── 1_e_selecting_columns_or_choosing_columns.ipynb
├── 1_f_drop_or_delete_a_column.ipynb
├── 1_g_create_a_dataframe_with_randomly_generated_data.ipynb
├── 2_a_iterate_over_a_dataframe.ipynb
├── 2_b_apply_a_function_row_wise.ipynb
├── 2_c_apply_a_function_to_a_column.ipynb
├── 2_d_find_and_replace_a_value_in_dataframe_column.ipynb
├── 3_a_merge_dataframes_by_joining_columns.ipynb
├── 3_b_merge_dataframe_by_columns_on_index.ipynb
├── 3_c_merge_two_dataframes_and_split_again.ipynb
├── 3_d_group_by_and_interate.ipynb
├── 4_a_get_binary_or_logical_columns_from_dataframe.ipynb
├── 4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb
├── 4_c_reduce_dimension_of_categorical_column.ipynb
├── 4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb
├── 5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb
├── 5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb
├── 6_a_extending_dataframe_capabilities.ipynb
├── LICENSE
├── README.md
└── data
├── sample_data.csv
└── sample_data_2.csv
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 |
27 | # PyInstaller
28 | # Usually these files are written by a python script from a template
29 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 |
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 |
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 |
48 | # Translations
49 | *.mo
50 | *.pot
51 |
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 |
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 |
60 | # Scrapy stuff:
61 | .scrapy
62 |
63 | # Sphinx documentation
64 | docs/_build/
65 |
66 | # PyBuilder
67 | target/
68 |
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 |
72 | # pyenv
73 | .python-version
74 |
75 | # celery beat schedule file
76 | celerybeat-schedule
77 |
78 | # dotenv
79 | .env
80 |
81 | # virtualenv
82 | venv/
83 | ENV/
84 |
85 | # Spyder project settings
86 | .spyderproject
87 |
88 | # Rope project settings
89 | .ropeproject
90 |
--------------------------------------------------------------------------------
/1_a_create_a_dataframe_from_dictonary.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Create A DataFrame from dictionary"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "data = [{'name': 'vikash', 'age': 27}, {'name': 'Satyam', 'age': 14}]"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "df = pd.DataFrame.from_dict(data, orient='columns')"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 4,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "data": {
48 | "text/html": [
49 | "
\n",
50 | "\n",
63 | "
\n",
64 | " \n",
65 | " \n",
66 | " | \n",
67 | " age | \n",
68 | " name | \n",
69 | "
\n",
70 | " \n",
71 | " \n",
72 | " \n",
73 | " 0 | \n",
74 | " 27 | \n",
75 | " vikash | \n",
76 | "
\n",
77 | " \n",
78 | " 1 | \n",
79 | " 14 | \n",
80 | " Satyam | \n",
81 | "
\n",
82 | " \n",
83 | "
\n",
84 | "
"
85 | ],
86 | "text/plain": [
87 | " age name\n",
88 | "0 27 vikash\n",
89 | "1 14 Satyam"
90 | ]
91 | },
92 | "execution_count": 4,
93 | "metadata": {},
94 | "output_type": "execute_result"
95 | }
96 | ],
97 | "source": [
98 | "df"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {
104 | "collapsed": true
105 | },
106 | "source": [
107 | "## If the Dictionary is nested you first need to normalize it"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 5,
113 | "metadata": {
114 | "collapsed": true
115 | },
116 | "outputs": [],
117 | "source": [
118 | "from pandas.io.json import json_normalize"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 6,
124 | "metadata": {
125 | "collapsed": true
126 | },
127 | "outputs": [],
128 | "source": [
129 | "data = [\n",
130 | " {\n",
131 | " 'name': {\n",
132 | " 'first': 'vikash',\n",
133 | " 'last': 'singh'\n",
134 | " },\n",
135 | " 'age': 27\n",
136 | " },\n",
137 | " {\n",
138 | " 'name': {\n",
139 | " 'first': 'satyam',\n",
140 | " 'last': 'singh'\n",
141 | " },\n",
142 | " 'age': 14\n",
143 | " }\n",
144 | "]"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 7,
150 | "metadata": {},
151 | "outputs": [],
152 | "source": [
153 | "df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 8,
159 | "metadata": {},
160 | "outputs": [
161 | {
162 | "data": {
163 | "text/html": [
164 | "\n",
165 | "\n",
178 | "
\n",
179 | " \n",
180 | " \n",
181 | " | \n",
182 | " age | \n",
183 | " name.first | \n",
184 | " name.last | \n",
185 | "
\n",
186 | " \n",
187 | " \n",
188 | " \n",
189 | " 0 | \n",
190 | " 27 | \n",
191 | " vikash | \n",
192 | " singh | \n",
193 | "
\n",
194 | " \n",
195 | " 1 | \n",
196 | " 14 | \n",
197 | " satyam | \n",
198 | " singh | \n",
199 | "
\n",
200 | " \n",
201 | "
\n",
202 | "
"
203 | ],
204 | "text/plain": [
205 | " age name.first name.last\n",
206 | "0 27 vikash singh\n",
207 | "1 14 satyam singh"
208 | ]
209 | },
210 | "execution_count": 8,
211 | "metadata": {},
212 | "output_type": "execute_result"
213 | }
214 | ],
215 | "source": [
216 | "df"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": null,
222 | "metadata": {
223 | "collapsed": true
224 | },
225 | "outputs": [],
226 | "source": []
227 | }
228 | ],
229 | "metadata": {
230 | "anaconda-cloud": {},
231 | "kernelspec": {
232 | "display_name": "Python [conda root]",
233 | "language": "python",
234 | "name": "conda-root-py"
235 | },
236 | "language_info": {
237 | "codemirror_mode": {
238 | "name": "ipython",
239 | "version": 3
240 | },
241 | "file_extension": ".py",
242 | "mimetype": "text/x-python",
243 | "name": "python",
244 | "nbconvert_exporter": "python",
245 | "pygments_lexer": "ipython3",
246 | "version": "3.5.3"
247 | }
248 | },
249 | "nbformat": 4,
250 | "nbformat_minor": 1
251 | }
252 |
--------------------------------------------------------------------------------
/1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Create A DataFrame by iterating and inserting rows"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 3,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "from random import randint"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 4,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "columns = ['a', 'b', 'c']"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 5,
36 | "metadata": {
37 | "collapsed": false
38 | },
39 | "outputs": [],
40 | "source": [
41 | "df = pd.DataFrame(columns=columns)"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 6,
47 | "metadata": {
48 | "collapsed": false
49 | },
50 | "outputs": [],
51 | "source": [
52 | "for i in range(5):\n",
53 | " df.loc[i] = [randint(-1,1) for n in range(3)]"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 7,
59 | "metadata": {
60 | "collapsed": false
61 | },
62 | "outputs": [
63 | {
64 | "data": {
65 | "text/html": [
66 | "\n",
67 | "
\n",
68 | " \n",
69 | " \n",
70 | " | \n",
71 | " a | \n",
72 | " b | \n",
73 | " c | \n",
74 | "
\n",
75 | " \n",
76 | " \n",
77 | " \n",
78 | " 0 | \n",
79 | " 0.0 | \n",
80 | " -1.0 | \n",
81 | " -1.0 | \n",
82 | "
\n",
83 | " \n",
84 | " 1 | \n",
85 | " 1.0 | \n",
86 | " 1.0 | \n",
87 | " 0.0 | \n",
88 | "
\n",
89 | " \n",
90 | " 2 | \n",
91 | " 1.0 | \n",
92 | " -1.0 | \n",
93 | " 1.0 | \n",
94 | "
\n",
95 | " \n",
96 | " 3 | \n",
97 | " -1.0 | \n",
98 | " 1.0 | \n",
99 | " 1.0 | \n",
100 | "
\n",
101 | " \n",
102 | " 4 | \n",
103 | " -1.0 | \n",
104 | " -1.0 | \n",
105 | " 1.0 | \n",
106 | "
\n",
107 | " \n",
108 | "
\n",
109 | "
"
110 | ],
111 | "text/plain": [
112 | " a b c\n",
113 | "0 0.0 -1.0 -1.0\n",
114 | "1 1.0 1.0 0.0\n",
115 | "2 1.0 -1.0 1.0\n",
116 | "3 -1.0 1.0 1.0\n",
117 | "4 -1.0 -1.0 1.0"
118 | ]
119 | },
120 | "execution_count": 7,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "df"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {
133 | "collapsed": true
134 | },
135 | "outputs": [],
136 | "source": []
137 | }
138 | ],
139 | "metadata": {
140 | "anaconda-cloud": {},
141 | "kernelspec": {
142 | "display_name": "Python [default]",
143 | "language": "python",
144 | "name": "python3"
145 | },
146 | "language_info": {
147 | "codemirror_mode": {
148 | "name": "ipython",
149 | "version": 3
150 | },
151 | "file_extension": ".py",
152 | "mimetype": "text/x-python",
153 | "name": "python",
154 | "nbconvert_exporter": "python",
155 | "pygments_lexer": "ipython3",
156 | "version": "3.5.2"
157 | }
158 | },
159 | "nbformat": 4,
160 | "nbformat_minor": 0
161 | }
162 |
--------------------------------------------------------------------------------
/1_c_create_dataframe_from_a_csv_file.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Create A DataFrame from csv file"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [],
28 | "source": [
29 | "df = pd.DataFrame.from_csv('./data/sample_data.csv', index_col=False)"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "metadata": {
36 | "collapsed": false
37 | },
38 | "outputs": [
39 | {
40 | "data": {
41 | "text/html": [
42 | "\n",
43 | "
\n",
44 | " \n",
45 | " \n",
46 | " | \n",
47 | " col_1 | \n",
48 | " col_2 | \n",
49 | " target | \n",
50 | "
\n",
51 | " \n",
52 | " \n",
53 | " \n",
54 | " 0 | \n",
55 | " 0.11 | \n",
56 | " 0.22 | \n",
57 | " 1 | \n",
58 | "
\n",
59 | " \n",
60 | " 1 | \n",
61 | " 0.10 | \n",
62 | " 0.20 | \n",
63 | " 1 | \n",
64 | "
\n",
65 | " \n",
66 | " 2 | \n",
67 | " 0.90 | \n",
68 | " 0.80 | \n",
69 | " 0 | \n",
70 | "
\n",
71 | " \n",
72 | "
\n",
73 | "
"
74 | ],
75 | "text/plain": [
76 | " col_1 col_2 target\n",
77 | "0 0.11 0.22 1\n",
78 | "1 0.10 0.20 1\n",
79 | "2 0.90 0.80 0"
80 | ]
81 | },
82 | "execution_count": 3,
83 | "metadata": {},
84 | "output_type": "execute_result"
85 | }
86 | ],
87 | "source": [
88 | "df"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 4,
94 | "metadata": {
95 | "collapsed": true
96 | },
97 | "outputs": [],
98 | "source": [
99 | "df_2 = pd.DataFrame.from_csv('./data/sample_data_2.csv', index_col=False, header=None)"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 5,
105 | "metadata": {
106 | "collapsed": false
107 | },
108 | "outputs": [
109 | {
110 | "data": {
111 | "text/html": [
112 | "\n",
113 | "
\n",
114 | " \n",
115 | " \n",
116 | " | \n",
117 | " 0 | \n",
118 | " 1 | \n",
119 | " 2 | \n",
120 | "
\n",
121 | " \n",
122 | " \n",
123 | " \n",
124 | " 0 | \n",
125 | " 0.11 | \n",
126 | " 0.22 | \n",
127 | " 1 | \n",
128 | "
\n",
129 | " \n",
130 | " 1 | \n",
131 | " 0.10 | \n",
132 | " 0.20 | \n",
133 | " 1 | \n",
134 | "
\n",
135 | " \n",
136 | " 2 | \n",
137 | " 0.90 | \n",
138 | " 0.80 | \n",
139 | " 0 | \n",
140 | "
\n",
141 | " \n",
142 | "
\n",
143 | "
"
144 | ],
145 | "text/plain": [
146 | " 0 1 2\n",
147 | "0 0.11 0.22 1\n",
148 | "1 0.10 0.20 1\n",
149 | "2 0.90 0.80 0"
150 | ]
151 | },
152 | "execution_count": 5,
153 | "metadata": {},
154 | "output_type": "execute_result"
155 | }
156 | ],
157 | "source": [
158 | "df_2"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 6,
164 | "metadata": {
165 | "collapsed": true
166 | },
167 | "outputs": [],
168 | "source": [
169 | "df_2.columns = ['col_1', 'col_2', 'taget']"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 7,
175 | "metadata": {
176 | "collapsed": false
177 | },
178 | "outputs": [
179 | {
180 | "data": {
181 | "text/html": [
182 | "\n",
183 | "
\n",
184 | " \n",
185 | " \n",
186 | " | \n",
187 | " col_1 | \n",
188 | " col_2 | \n",
189 | " taget | \n",
190 | "
\n",
191 | " \n",
192 | " \n",
193 | " \n",
194 | " 0 | \n",
195 | " 0.11 | \n",
196 | " 0.22 | \n",
197 | " 1 | \n",
198 | "
\n",
199 | " \n",
200 | " 1 | \n",
201 | " 0.10 | \n",
202 | " 0.20 | \n",
203 | " 1 | \n",
204 | "
\n",
205 | " \n",
206 | " 2 | \n",
207 | " 0.90 | \n",
208 | " 0.80 | \n",
209 | " 0 | \n",
210 | "
\n",
211 | " \n",
212 | "
\n",
213 | "
"
214 | ],
215 | "text/plain": [
216 | " col_1 col_2 taget\n",
217 | "0 0.11 0.22 1\n",
218 | "1 0.10 0.20 1\n",
219 | "2 0.90 0.80 0"
220 | ]
221 | },
222 | "execution_count": 7,
223 | "metadata": {},
224 | "output_type": "execute_result"
225 | }
226 | ],
227 | "source": [
228 | "df_2"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "collapsed": true
236 | },
237 | "outputs": [],
238 | "source": []
239 | }
240 | ],
241 | "metadata": {
242 | "anaconda-cloud": {},
243 | "kernelspec": {
244 | "display_name": "Python [default]",
245 | "language": "python",
246 | "name": "python3"
247 | },
248 | "language_info": {
249 | "codemirror_mode": {
250 | "name": "ipython",
251 | "version": 3
252 | },
253 | "file_extension": ".py",
254 | "mimetype": "text/x-python",
255 | "name": "python",
256 | "nbconvert_exporter": "python",
257 | "pygments_lexer": "ipython3",
258 | "version": "3.5.2"
259 | }
260 | },
261 | "nbformat": 4,
262 | "nbformat_minor": 0
263 | }
264 |
--------------------------------------------------------------------------------
/1_d_change_column_names.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Change column names in dataframe"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', 1, 'a'],['BB', 2, 'a'],['CC', 3, 'a']], columns = ['name','value','salue'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " value | \n",
39 | " salue | \n",
40 | "
\n",
41 | " \n",
42 | " \n",
43 | " \n",
44 | " 0 | \n",
45 | " AA | \n",
46 | " 1 | \n",
47 | " a | \n",
48 | "
\n",
49 | " \n",
50 | " 1 | \n",
51 | " BB | \n",
52 | " 2 | \n",
53 | " a | \n",
54 | "
\n",
55 | " \n",
56 | " 2 | \n",
57 | " CC | \n",
58 | " 3 | \n",
59 | " a | \n",
60 | "
\n",
61 | " \n",
62 | "
\n",
63 | "
"
64 | ],
65 | "text/plain": [
66 | " name value salue\n",
67 | "0 AA 1 a\n",
68 | "1 BB 2 a\n",
69 | "2 CC 3 a"
70 | ]
71 | },
72 | "execution_count": 2,
73 | "metadata": {},
74 | "output_type": "execute_result"
75 | }
76 | ],
77 | "source": [
78 | "df"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 3,
84 | "metadata": {
85 | "collapsed": false
86 | },
87 | "outputs": [],
88 | "source": [
89 | "df.columns.values[1:] = ['prefix_' + val for val in df.columns.values[1:]]"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 4,
95 | "metadata": {
96 | "collapsed": false
97 | },
98 | "outputs": [
99 | {
100 | "data": {
101 | "text/plain": [
102 | "array(['name', 'prefix_value', 'prefix_salue'], dtype=object)"
103 | ]
104 | },
105 | "execution_count": 4,
106 | "metadata": {},
107 | "output_type": "execute_result"
108 | }
109 | ],
110 | "source": [
111 | "df.columns.values"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 5,
117 | "metadata": {
118 | "collapsed": false
119 | },
120 | "outputs": [
121 | {
122 | "data": {
123 | "text/html": [
124 | "\n",
125 | "
\n",
126 | " \n",
127 | " \n",
128 | " | \n",
129 | " name | \n",
130 | " prefix_value | \n",
131 | " prefix_salue | \n",
132 | "
\n",
133 | " \n",
134 | " \n",
135 | " \n",
136 | " 0 | \n",
137 | " AA | \n",
138 | " 1 | \n",
139 | " a | \n",
140 | "
\n",
141 | " \n",
142 | " 1 | \n",
143 | " BB | \n",
144 | " 2 | \n",
145 | " a | \n",
146 | "
\n",
147 | " \n",
148 | " 2 | \n",
149 | " CC | \n",
150 | " 3 | \n",
151 | " a | \n",
152 | "
\n",
153 | " \n",
154 | "
\n",
155 | "
"
156 | ],
157 | "text/plain": [
158 | " name prefix_value prefix_salue\n",
159 | "0 AA 1 a\n",
160 | "1 BB 2 a\n",
161 | "2 CC 3 a"
162 | ]
163 | },
164 | "execution_count": 5,
165 | "metadata": {},
166 | "output_type": "execute_result"
167 | }
168 | ],
169 | "source": [
170 | "df"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {
177 | "collapsed": true
178 | },
179 | "outputs": [],
180 | "source": []
181 | }
182 | ],
183 | "metadata": {
184 | "anaconda-cloud": {},
185 | "kernelspec": {
186 | "display_name": "Python [default]",
187 | "language": "python",
188 | "name": "python3"
189 | },
190 | "language_info": {
191 | "codemirror_mode": {
192 | "name": "ipython",
193 | "version": 3
194 | },
195 | "file_extension": ".py",
196 | "mimetype": "text/x-python",
197 | "name": "python",
198 | "nbconvert_exporter": "python",
199 | "pygments_lexer": "ipython3",
200 | "version": "3.5.2"
201 | }
202 | },
203 | "nbformat": 4,
204 | "nbformat_minor": 0
205 | }
206 |
--------------------------------------------------------------------------------
/1_e_selecting_columns_or_choosing_columns.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Selecting and picking columns from a Dataframe"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', \"temp\", 1],['BB', \"temp\", 2],['CC', \"temp\", 3]], columns = ['name','temp', 'value'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " temp | \n",
39 | " value | \n",
40 | "
\n",
41 | " \n",
42 | " \n",
43 | " \n",
44 | " 0 | \n",
45 | " AA | \n",
46 | " temp | \n",
47 | " 1 | \n",
48 | "
\n",
49 | " \n",
50 | " 1 | \n",
51 | " BB | \n",
52 | " temp | \n",
53 | " 2 | \n",
54 | "
\n",
55 | " \n",
56 | " 2 | \n",
57 | " CC | \n",
58 | " temp | \n",
59 | " 3 | \n",
60 | "
\n",
61 | " \n",
62 | "
\n",
63 | "
"
64 | ],
65 | "text/plain": [
66 | " name temp value\n",
67 | "0 AA temp 1\n",
68 | "1 BB temp 2\n",
69 | "2 CC temp 3"
70 | ]
71 | },
72 | "execution_count": 2,
73 | "metadata": {},
74 | "output_type": "execute_result"
75 | }
76 | ],
77 | "source": [
78 | "df"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 3,
84 | "metadata": {
85 | "collapsed": false
86 | },
87 | "outputs": [],
88 | "source": [
89 | "df = df[[\"name\",\"temp\"]]"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 4,
95 | "metadata": {
96 | "collapsed": false
97 | },
98 | "outputs": [
99 | {
100 | "data": {
101 | "text/html": [
102 | "\n",
103 | "
\n",
104 | " \n",
105 | " \n",
106 | " | \n",
107 | " name | \n",
108 | " temp | \n",
109 | "
\n",
110 | " \n",
111 | " \n",
112 | " \n",
113 | " 0 | \n",
114 | " AA | \n",
115 | " temp | \n",
116 | "
\n",
117 | " \n",
118 | " 1 | \n",
119 | " BB | \n",
120 | " temp | \n",
121 | "
\n",
122 | " \n",
123 | " 2 | \n",
124 | " CC | \n",
125 | " temp | \n",
126 | "
\n",
127 | " \n",
128 | "
\n",
129 | "
"
130 | ],
131 | "text/plain": [
132 | " name temp\n",
133 | "0 AA temp\n",
134 | "1 BB temp\n",
135 | "2 CC temp"
136 | ]
137 | },
138 | "execution_count": 4,
139 | "metadata": {},
140 | "output_type": "execute_result"
141 | }
142 | ],
143 | "source": [
144 | "df"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": null,
150 | "metadata": {
151 | "collapsed": true
152 | },
153 | "outputs": [],
154 | "source": []
155 | }
156 | ],
157 | "metadata": {
158 | "anaconda-cloud": {},
159 | "kernelspec": {
160 | "display_name": "Python [conda root]",
161 | "language": "python",
162 | "name": "conda-root-py"
163 | },
164 | "language_info": {
165 | "codemirror_mode": {
166 | "name": "ipython",
167 | "version": 3
168 | },
169 | "file_extension": ".py",
170 | "mimetype": "text/x-python",
171 | "name": "python",
172 | "nbconvert_exporter": "python",
173 | "pygments_lexer": "ipython3",
174 | "version": "3.5.2"
175 | }
176 | },
177 | "nbformat": 4,
178 | "nbformat_minor": 0
179 | }
180 |
--------------------------------------------------------------------------------
/1_f_drop_or_delete_a_column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Drop columns or pop columns from a dataframe"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " value | \n",
39 | "
\n",
40 | " \n",
41 | " \n",
42 | " \n",
43 | " 0 | \n",
44 | " AA | \n",
45 | " 1 | \n",
46 | "
\n",
47 | " \n",
48 | " 1 | \n",
49 | " BB | \n",
50 | " 2 | \n",
51 | "
\n",
52 | " \n",
53 | " 2 | \n",
54 | " CC | \n",
55 | " 3 | \n",
56 | "
\n",
57 | " \n",
58 | "
\n",
59 | "
"
60 | ],
61 | "text/plain": [
62 | " name value\n",
63 | "0 AA 1\n",
64 | "1 BB 2\n",
65 | "2 CC 3"
66 | ]
67 | },
68 | "execution_count": 2,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "df"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {
81 | "collapsed": true
82 | },
83 | "outputs": [],
84 | "source": [
85 | "df.drop('value', axis=1, inplace=True)"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 4,
91 | "metadata": {
92 | "collapsed": false
93 | },
94 | "outputs": [
95 | {
96 | "data": {
97 | "text/html": [
98 | "\n",
99 | "
\n",
100 | " \n",
101 | " \n",
102 | " | \n",
103 | " name | \n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " \n",
108 | " 0 | \n",
109 | " AA | \n",
110 | "
\n",
111 | " \n",
112 | " 1 | \n",
113 | " BB | \n",
114 | "
\n",
115 | " \n",
116 | " 2 | \n",
117 | " CC | \n",
118 | "
\n",
119 | " \n",
120 | "
\n",
121 | "
"
122 | ],
123 | "text/plain": [
124 | " name\n",
125 | "0 AA\n",
126 | "1 BB\n",
127 | "2 CC"
128 | ]
129 | },
130 | "execution_count": 4,
131 | "metadata": {},
132 | "output_type": "execute_result"
133 | }
134 | ],
135 | "source": [
136 | "df"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 5,
142 | "metadata": {
143 | "collapsed": true
144 | },
145 | "outputs": [],
146 | "source": [
147 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 6,
153 | "metadata": {
154 | "collapsed": false
155 | },
156 | "outputs": [
157 | {
158 | "data": {
159 | "text/html": [
160 | "\n",
161 | "
\n",
162 | " \n",
163 | " \n",
164 | " | \n",
165 | " name | \n",
166 | " value | \n",
167 | "
\n",
168 | " \n",
169 | " \n",
170 | " \n",
171 | " 0 | \n",
172 | " AA | \n",
173 | " 1 | \n",
174 | "
\n",
175 | " \n",
176 | " 1 | \n",
177 | " BB | \n",
178 | " 2 | \n",
179 | "
\n",
180 | " \n",
181 | " 2 | \n",
182 | " CC | \n",
183 | " 3 | \n",
184 | "
\n",
185 | " \n",
186 | "
\n",
187 | "
"
188 | ],
189 | "text/plain": [
190 | " name value\n",
191 | "0 AA 1\n",
192 | "1 BB 2\n",
193 | "2 CC 3"
194 | ]
195 | },
196 | "execution_count": 6,
197 | "metadata": {},
198 | "output_type": "execute_result"
199 | }
200 | ],
201 | "source": [
202 | "df"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 7,
208 | "metadata": {
209 | "collapsed": true
210 | },
211 | "outputs": [],
212 | "source": [
213 | "values = df.pop('value')"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 8,
219 | "metadata": {
220 | "collapsed": false
221 | },
222 | "outputs": [
223 | {
224 | "data": {
225 | "text/html": [
226 | "\n",
227 | "
\n",
228 | " \n",
229 | " \n",
230 | " | \n",
231 | " name | \n",
232 | "
\n",
233 | " \n",
234 | " \n",
235 | " \n",
236 | " 0 | \n",
237 | " AA | \n",
238 | "
\n",
239 | " \n",
240 | " 1 | \n",
241 | " BB | \n",
242 | "
\n",
243 | " \n",
244 | " 2 | \n",
245 | " CC | \n",
246 | "
\n",
247 | " \n",
248 | "
\n",
249 | "
"
250 | ],
251 | "text/plain": [
252 | " name\n",
253 | "0 AA\n",
254 | "1 BB\n",
255 | "2 CC"
256 | ]
257 | },
258 | "execution_count": 8,
259 | "metadata": {},
260 | "output_type": "execute_result"
261 | }
262 | ],
263 | "source": [
264 | "df"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 9,
270 | "metadata": {
271 | "collapsed": false
272 | },
273 | "outputs": [
274 | {
275 | "data": {
276 | "text/plain": [
277 | "0 1\n",
278 | "1 2\n",
279 | "2 3\n",
280 | "Name: value, dtype: int64"
281 | ]
282 | },
283 | "execution_count": 9,
284 | "metadata": {},
285 | "output_type": "execute_result"
286 | }
287 | ],
288 | "source": [
289 | "values"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "collapsed": true
297 | },
298 | "outputs": [],
299 | "source": []
300 | }
301 | ],
302 | "metadata": {
303 | "anaconda-cloud": {},
304 | "kernelspec": {
305 | "display_name": "Python [default]",
306 | "language": "python",
307 | "name": "python3"
308 | },
309 | "language_info": {
310 | "codemirror_mode": {
311 | "name": "ipython",
312 | "version": 3
313 | },
314 | "file_extension": ".py",
315 | "mimetype": "text/x-python",
316 | "name": "python",
317 | "nbconvert_exporter": "python",
318 | "pygments_lexer": "ipython3",
319 | "version": "3.5.2"
320 | }
321 | },
322 | "nbformat": 4,
323 | "nbformat_minor": 0
324 | }
325 |
--------------------------------------------------------------------------------
/1_g_create_a_dataframe_with_randomly_generated_data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Create A DataFrame with randomly generated data"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 3,
36 | "metadata": {
37 | "collapsed": false
38 | },
39 | "outputs": [
40 | {
41 | "data": {
42 | "text/html": [
43 | "\n",
44 | "
\n",
45 | " \n",
46 | " \n",
47 | " | \n",
48 | " A | \n",
49 | " B | \n",
50 | " C | \n",
51 | " D | \n",
52 | "
\n",
53 | " \n",
54 | " \n",
55 | " \n",
56 | " 0 | \n",
57 | " 2 | \n",
58 | " 3 | \n",
59 | " 0 | \n",
60 | " 6 | \n",
61 | "
\n",
62 | " \n",
63 | " 1 | \n",
64 | " 8 | \n",
65 | " 0 | \n",
66 | " 6 | \n",
67 | " 6 | \n",
68 | "
\n",
69 | " \n",
70 | " 2 | \n",
71 | " 6 | \n",
72 | " 5 | \n",
73 | " 4 | \n",
74 | " 6 | \n",
75 | "
\n",
76 | " \n",
77 | " 3 | \n",
78 | " 4 | \n",
79 | " 2 | \n",
80 | " 1 | \n",
81 | " 3 | \n",
82 | "
\n",
83 | " \n",
84 | " 4 | \n",
85 | " 4 | \n",
86 | " 0 | \n",
87 | " 2 | \n",
88 | " 6 | \n",
89 | "
\n",
90 | " \n",
91 | " 5 | \n",
92 | " 5 | \n",
93 | " 6 | \n",
94 | " 1 | \n",
95 | " 2 | \n",
96 | "
\n",
97 | " \n",
98 | " 6 | \n",
99 | " 1 | \n",
100 | " 7 | \n",
101 | " 2 | \n",
102 | " 8 | \n",
103 | "
\n",
104 | " \n",
105 | " 7 | \n",
106 | " 4 | \n",
107 | " 4 | \n",
108 | " 3 | \n",
109 | " 6 | \n",
110 | "
\n",
111 | " \n",
112 | " 8 | \n",
113 | " 4 | \n",
114 | " 2 | \n",
115 | " 6 | \n",
116 | " 2 | \n",
117 | "
\n",
118 | " \n",
119 | " 9 | \n",
120 | " 3 | \n",
121 | " 1 | \n",
122 | " 6 | \n",
123 | " 6 | \n",
124 | "
\n",
125 | " \n",
126 | "
\n",
127 | "
"
128 | ],
129 | "text/plain": [
130 | " A B C D\n",
131 | "0 2 3 0 6\n",
132 | "1 8 0 6 6\n",
133 | "2 6 5 4 6\n",
134 | "3 4 2 1 3\n",
135 | "4 4 0 2 6\n",
136 | "5 5 6 1 2\n",
137 | "6 1 7 2 8\n",
138 | "7 4 4 3 6\n",
139 | "8 4 2 6 2\n",
140 | "9 3 1 6 6"
141 | ]
142 | },
143 | "execution_count": 3,
144 | "metadata": {},
145 | "output_type": "execute_result"
146 | }
147 | ],
148 | "source": [
149 | "df"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {
156 | "collapsed": true
157 | },
158 | "outputs": [],
159 | "source": []
160 | }
161 | ],
162 | "metadata": {
163 | "anaconda-cloud": {},
164 | "kernelspec": {
165 | "display_name": "Python [conda root]",
166 | "language": "python",
167 | "name": "conda-root-py"
168 | },
169 | "language_info": {
170 | "codemirror_mode": {
171 | "name": "ipython",
172 | "version": 3
173 | },
174 | "file_extension": ".py",
175 | "mimetype": "text/x-python",
176 | "name": "python",
177 | "nbconvert_exporter": "python",
178 | "pygments_lexer": "ipython3",
179 | "version": "3.5.2"
180 | }
181 | },
182 | "nbformat": 4,
183 | "nbformat_minor": 1
184 | }
185 |
--------------------------------------------------------------------------------
/2_a_iterate_over_a_dataframe.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Iterate over a dataframe by rows"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " value | \n",
39 | "
\n",
40 | " \n",
41 | " \n",
42 | " \n",
43 | " 0 | \n",
44 | " AA | \n",
45 | " 1 | \n",
46 | "
\n",
47 | " \n",
48 | " 1 | \n",
49 | " BB | \n",
50 | " 2 | \n",
51 | "
\n",
52 | " \n",
53 | " 2 | \n",
54 | " CC | \n",
55 | " 3 | \n",
56 | "
\n",
57 | " \n",
58 | "
\n",
59 | "
"
60 | ],
61 | "text/plain": [
62 | " name value\n",
63 | "0 AA 1\n",
64 | "1 BB 2\n",
65 | "2 CC 3"
66 | ]
67 | },
68 | "execution_count": 2,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "df"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {
81 | "collapsed": false
82 | },
83 | "outputs": [
84 | {
85 | "name": "stdout",
86 | "output_type": "stream",
87 | "text": [
88 | "AA 1\n",
89 | "BB 2\n",
90 | "CC 3\n"
91 | ]
92 | }
93 | ],
94 | "source": [
95 | "for i, row in df.iterrows():\n",
96 | " print(row['name'], row['value'])"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 14,
102 | "metadata": {
103 | "collapsed": false
104 | },
105 | "outputs": [
106 | {
107 | "name": "stdout",
108 | "output_type": "stream",
109 | "text": [
110 | "Pandas(Index=0, name='AA', value=1)\n",
111 | "Pandas(Index=1, name='BB', value=2)\n",
112 | "Pandas(Index=2, name='CC', value=3)\n"
113 | ]
114 | }
115 | ],
116 | "source": [
117 | "for row in df.itertuples():\n",
118 | " print(row)"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": null,
124 | "metadata": {
125 | "collapsed": true
126 | },
127 | "outputs": [],
128 | "source": []
129 | }
130 | ],
131 | "metadata": {
132 | "anaconda-cloud": {},
133 | "kernelspec": {
134 | "display_name": "Python [conda root]",
135 | "language": "python",
136 | "name": "conda-root-py"
137 | },
138 | "language_info": {
139 | "codemirror_mode": {
140 | "name": "ipython",
141 | "version": 3
142 | },
143 | "file_extension": ".py",
144 | "mimetype": "text/x-python",
145 | "name": "python",
146 | "nbconvert_exporter": "python",
147 | "pygments_lexer": "ipython3",
148 | "version": "3.5.2"
149 | }
150 | },
151 | "nbformat": 4,
152 | "nbformat_minor": 0
153 | }
154 |
--------------------------------------------------------------------------------
/2_b_apply_a_function_row_wise.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "apply a function to dataframe rows"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " value | \n",
39 | "
\n",
40 | " \n",
41 | " \n",
42 | " \n",
43 | " 0 | \n",
44 | " AA | \n",
45 | " 1 | \n",
46 | "
\n",
47 | " \n",
48 | " 1 | \n",
49 | " BB | \n",
50 | " 2 | \n",
51 | "
\n",
52 | " \n",
53 | " 2 | \n",
54 | " CC | \n",
55 | " 3 | \n",
56 | "
\n",
57 | " \n",
58 | "
\n",
59 | "
"
60 | ],
61 | "text/plain": [
62 | " name value\n",
63 | "0 AA 1\n",
64 | "1 BB 2\n",
65 | "2 CC 3"
66 | ]
67 | },
68 | "execution_count": 2,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "df"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {
81 | "collapsed": true
82 | },
83 | "outputs": [],
84 | "source": [
85 | "def function_1(val_1, val_2):\n",
86 | " return val_1 + str(val_2)"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 4,
92 | "metadata": {
93 | "collapsed": false
94 | },
95 | "outputs": [],
96 | "source": [
97 | "df['col_a'] = df.apply(lambda row: function_1(row['name'], row['value']), axis=1)"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [
107 | {
108 | "data": {
109 | "text/html": [
110 | "\n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | " | \n",
115 | " name | \n",
116 | " value | \n",
117 | " col_a | \n",
118 | "
\n",
119 | " \n",
120 | " \n",
121 | " \n",
122 | " 0 | \n",
123 | " AA | \n",
124 | " 1 | \n",
125 | " AA1 | \n",
126 | "
\n",
127 | " \n",
128 | " 1 | \n",
129 | " BB | \n",
130 | " 2 | \n",
131 | " BB2 | \n",
132 | "
\n",
133 | " \n",
134 | " 2 | \n",
135 | " CC | \n",
136 | " 3 | \n",
137 | " CC3 | \n",
138 | "
\n",
139 | " \n",
140 | "
\n",
141 | "
"
142 | ],
143 | "text/plain": [
144 | " name value col_a\n",
145 | "0 AA 1 AA1\n",
146 | "1 BB 2 BB2\n",
147 | "2 CC 3 CC3"
148 | ]
149 | },
150 | "execution_count": 5,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "df"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 6,
162 | "metadata": {
163 | "collapsed": true
164 | },
165 | "outputs": [],
166 | "source": [
167 | "def function_2(row):\n",
168 | " return row['value'] * 2"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 7,
174 | "metadata": {
175 | "collapsed": true
176 | },
177 | "outputs": [],
178 | "source": [
179 | "df['col_b'] = df.apply(lambda row: function_2(row), axis=1)"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 8,
185 | "metadata": {
186 | "collapsed": false
187 | },
188 | "outputs": [
189 | {
190 | "data": {
191 | "text/html": [
192 | "\n",
193 | "
\n",
194 | " \n",
195 | " \n",
196 | " | \n",
197 | " name | \n",
198 | " value | \n",
199 | " col_a | \n",
200 | " col_b | \n",
201 | "
\n",
202 | " \n",
203 | " \n",
204 | " \n",
205 | " 0 | \n",
206 | " AA | \n",
207 | " 1 | \n",
208 | " AA1 | \n",
209 | " 2 | \n",
210 | "
\n",
211 | " \n",
212 | " 1 | \n",
213 | " BB | \n",
214 | " 2 | \n",
215 | " BB2 | \n",
216 | " 4 | \n",
217 | "
\n",
218 | " \n",
219 | " 2 | \n",
220 | " CC | \n",
221 | " 3 | \n",
222 | " CC3 | \n",
223 | " 6 | \n",
224 | "
\n",
225 | " \n",
226 | "
\n",
227 | "
"
228 | ],
229 | "text/plain": [
230 | " name value col_a col_b\n",
231 | "0 AA 1 AA1 2\n",
232 | "1 BB 2 BB2 4\n",
233 | "2 CC 3 CC3 6"
234 | ]
235 | },
236 | "execution_count": 8,
237 | "metadata": {},
238 | "output_type": "execute_result"
239 | }
240 | ],
241 | "source": [
242 | "df"
243 | ]
244 | }
245 | ],
246 | "metadata": {
247 | "anaconda-cloud": {},
248 | "kernelspec": {
249 | "display_name": "Python [conda root]",
250 | "language": "python",
251 | "name": "conda-root-py"
252 | },
253 | "language_info": {
254 | "codemirror_mode": {
255 | "name": "ipython",
256 | "version": 3
257 | },
258 | "file_extension": ".py",
259 | "mimetype": "text/x-python",
260 | "name": "python",
261 | "nbconvert_exporter": "python",
262 | "pygments_lexer": "ipython3",
263 | "version": "3.5.2"
264 | }
265 | },
266 | "nbformat": 4,
267 | "nbformat_minor": 0
268 | }
269 |
--------------------------------------------------------------------------------
/2_c_apply_a_function_to_a_column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "apply a function to a dataframe column"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": false
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "df = pd.DataFrame([['AA', 1], ['BB', 2], ['CC', 3]], columns=['name', 'value'])"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": false
27 | },
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "\n",
33 | "
\n",
34 | " \n",
35 | " \n",
36 | " | \n",
37 | " name | \n",
38 | " value | \n",
39 | "
\n",
40 | " \n",
41 | " \n",
42 | " \n",
43 | " 0 | \n",
44 | " AA | \n",
45 | " 1 | \n",
46 | "
\n",
47 | " \n",
48 | " 1 | \n",
49 | " BB | \n",
50 | " 2 | \n",
51 | "
\n",
52 | " \n",
53 | " 2 | \n",
54 | " CC | \n",
55 | " 3 | \n",
56 | "
\n",
57 | " \n",
58 | "
\n",
59 | "
"
60 | ],
61 | "text/plain": [
62 | " name value\n",
63 | "0 AA 1\n",
64 | "1 BB 2\n",
65 | "2 CC 3"
66 | ]
67 | },
68 | "execution_count": 2,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "df"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {
81 | "collapsed": true
82 | },
83 | "outputs": [],
84 | "source": [
85 | "def function_1(val_1):\n",
86 | " return \"prefix_\" + str(val_1)"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 4,
92 | "metadata": {
93 | "collapsed": false
94 | },
95 | "outputs": [],
96 | "source": [
97 | "df['name'] = df['name'].map(function_1)"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [
107 | {
108 | "data": {
109 | "text/html": [
110 | "\n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | " | \n",
115 | " name | \n",
116 | " value | \n",
117 | "
\n",
118 | " \n",
119 | " \n",
120 | " \n",
121 | " 0 | \n",
122 | " prefix_AA | \n",
123 | " 1 | \n",
124 | "
\n",
125 | " \n",
126 | " 1 | \n",
127 | " prefix_BB | \n",
128 | " 2 | \n",
129 | "
\n",
130 | " \n",
131 | " 2 | \n",
132 | " prefix_CC | \n",
133 | " 3 | \n",
134 | "
\n",
135 | " \n",
136 | "
\n",
137 | "
"
138 | ],
139 | "text/plain": [
140 | " name value\n",
141 | "0 prefix_AA 1\n",
142 | "1 prefix_BB 2\n",
143 | "2 prefix_CC 3"
144 | ]
145 | },
146 | "execution_count": 5,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "df"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {
159 | "collapsed": true
160 | },
161 | "outputs": [],
162 | "source": []
163 | }
164 | ],
165 | "metadata": {
166 | "anaconda-cloud": {},
167 | "kernelspec": {
168 | "display_name": "Python [conda root]",
169 | "language": "python",
170 | "name": "conda-root-py"
171 | },
172 | "language_info": {
173 | "codemirror_mode": {
174 | "name": "ipython",
175 | "version": 3
176 | },
177 | "file_extension": ".py",
178 | "mimetype": "text/x-python",
179 | "name": "python",
180 | "nbconvert_exporter": "python",
181 | "pygments_lexer": "ipython3",
182 | "version": "3.5.2"
183 | }
184 | },
185 | "nbformat": 4,
186 | "nbformat_minor": 0
187 | }
188 |
--------------------------------------------------------------------------------
/2_d_find_and_replace_a_value_in_dataframe_column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "find a value in dataframe and replace it with another value"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "df = pd.DataFrame([['One', 'Two'], ['Four', 'Abcd'], ['One', 'Bcd'], ['Five', 'Cd']], columns=['A', 'B'])"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "metadata": {
36 | "collapsed": false
37 | },
38 | "outputs": [
39 | {
40 | "data": {
41 | "text/html": [
42 | "\n",
43 | "
\n",
44 | " \n",
45 | " \n",
46 | " | \n",
47 | " A | \n",
48 | " B | \n",
49 | "
\n",
50 | " \n",
51 | " \n",
52 | " \n",
53 | " 0 | \n",
54 | " One | \n",
55 | " Two | \n",
56 | "
\n",
57 | " \n",
58 | " 1 | \n",
59 | " Four | \n",
60 | " Abcd | \n",
61 | "
\n",
62 | " \n",
63 | " 2 | \n",
64 | " One | \n",
65 | " Bcd | \n",
66 | "
\n",
67 | " \n",
68 | " 3 | \n",
69 | " Five | \n",
70 | " Cd | \n",
71 | "
\n",
72 | " \n",
73 | "
\n",
74 | "
"
75 | ],
76 | "text/plain": [
77 | " A B\n",
78 | "0 One Two\n",
79 | "1 Four Abcd\n",
80 | "2 One Bcd\n",
81 | "3 Five Cd"
82 | ]
83 | },
84 | "execution_count": 3,
85 | "metadata": {},
86 | "output_type": "execute_result"
87 | }
88 | ],
89 | "source": [
90 | "df"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 4,
96 | "metadata": {
97 | "collapsed": false
98 | },
99 | "outputs": [],
100 | "source": [
101 | "df.loc[df['A'] == 'One', 'A'] = 0"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 5,
107 | "metadata": {
108 | "collapsed": false
109 | },
110 | "outputs": [
111 | {
112 | "data": {
113 | "text/html": [
114 | "\n",
115 | "
\n",
116 | " \n",
117 | " \n",
118 | " | \n",
119 | " A | \n",
120 | " B | \n",
121 | "
\n",
122 | " \n",
123 | " \n",
124 | " \n",
125 | " 0 | \n",
126 | " 0 | \n",
127 | " Two | \n",
128 | "
\n",
129 | " \n",
130 | " 1 | \n",
131 | " Four | \n",
132 | " Abcd | \n",
133 | "
\n",
134 | " \n",
135 | " 2 | \n",
136 | " 0 | \n",
137 | " Bcd | \n",
138 | "
\n",
139 | " \n",
140 | " 3 | \n",
141 | " Five | \n",
142 | " Cd | \n",
143 | "
\n",
144 | " \n",
145 | "
\n",
146 | "
"
147 | ],
148 | "text/plain": [
149 | " A B\n",
150 | "0 0 Two\n",
151 | "1 Four Abcd\n",
152 | "2 0 Bcd\n",
153 | "3 Five Cd"
154 | ]
155 | },
156 | "execution_count": 5,
157 | "metadata": {},
158 | "output_type": "execute_result"
159 | }
160 | ],
161 | "source": [
162 | "df"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": null,
168 | "metadata": {
169 | "collapsed": true
170 | },
171 | "outputs": [],
172 | "source": []
173 | }
174 | ],
175 | "metadata": {
176 | "anaconda-cloud": {},
177 | "kernelspec": {
178 | "display_name": "Python [conda root]",
179 | "language": "python",
180 | "name": "conda-root-py"
181 | },
182 | "language_info": {
183 | "codemirror_mode": {
184 | "name": "ipython",
185 | "version": 3
186 | },
187 | "file_extension": ".py",
188 | "mimetype": "text/x-python",
189 | "name": "python",
190 | "nbconvert_exporter": "python",
191 | "pygments_lexer": "ipython3",
192 | "version": "3.5.2"
193 | }
194 | },
195 | "nbformat": 4,
196 | "nbformat_minor": 0
197 | }
198 |
--------------------------------------------------------------------------------
/3_a_merge_dataframes_by_joining_columns.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Merge datafram by joining on a column"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/html": [
31 | "\n",
32 | "
\n",
33 | " \n",
34 | " \n",
35 | " | \n",
36 | " A | \n",
37 | " B | \n",
38 | "
\n",
39 | " \n",
40 | " \n",
41 | " \n",
42 | " 0 | \n",
43 | " 1 | \n",
44 | " 3 | \n",
45 | "
\n",
46 | " \n",
47 | " 1 | \n",
48 | " 2 | \n",
49 | " 4 | \n",
50 | "
\n",
51 | " \n",
52 | "
\n",
53 | "
"
54 | ],
55 | "text/plain": [
56 | " A B\n",
57 | "0 1 3\n",
58 | "1 2 4"
59 | ]
60 | },
61 | "execution_count": 2,
62 | "metadata": {},
63 | "output_type": "execute_result"
64 | }
65 | ],
66 | "source": [
67 | "df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])\n",
68 | "df"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [
78 | {
79 | "data": {
80 | "text/html": [
81 | "\n",
82 | "
\n",
83 | " \n",
84 | " \n",
85 | " | \n",
86 | " A | \n",
87 | " C | \n",
88 | "
\n",
89 | " \n",
90 | " \n",
91 | " \n",
92 | " 0 | \n",
93 | " 1 | \n",
94 | " 5 | \n",
95 | "
\n",
96 | " \n",
97 | " 1 | \n",
98 | " 1 | \n",
99 | " 6 | \n",
100 | "
\n",
101 | " \n",
102 | "
\n",
103 | "
"
104 | ],
105 | "text/plain": [
106 | " A C\n",
107 | "0 1 5\n",
108 | "1 1 6"
109 | ]
110 | },
111 | "execution_count": 3,
112 | "metadata": {},
113 | "output_type": "execute_result"
114 | }
115 | ],
116 | "source": [
117 | "df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])\n",
118 | "df2"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 4,
124 | "metadata": {
125 | "collapsed": false
126 | },
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/html": [
131 | "\n",
132 | "
\n",
133 | " \n",
134 | " \n",
135 | " | \n",
136 | " A | \n",
137 | " B | \n",
138 | " C | \n",
139 | "
\n",
140 | " \n",
141 | " \n",
142 | " \n",
143 | " 0 | \n",
144 | " 1 | \n",
145 | " 3 | \n",
146 | " 5.0 | \n",
147 | "
\n",
148 | " \n",
149 | " 1 | \n",
150 | " 1 | \n",
151 | " 3 | \n",
152 | " 6.0 | \n",
153 | "
\n",
154 | " \n",
155 | " 2 | \n",
156 | " 2 | \n",
157 | " 4 | \n",
158 | " NaN | \n",
159 | "
\n",
160 | " \n",
161 | "
\n",
162 | "
"
163 | ],
164 | "text/plain": [
165 | " A B C\n",
166 | "0 1 3 5.0\n",
167 | "1 1 3 6.0\n",
168 | "2 2 4 NaN"
169 | ]
170 | },
171 | "execution_count": 4,
172 | "metadata": {},
173 | "output_type": "execute_result"
174 | }
175 | ],
176 | "source": [
177 | "df.merge(df2, how='left', on='A') # merges on columns A"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 5,
183 | "metadata": {
184 | "collapsed": false
185 | },
186 | "outputs": [],
187 | "source": [
188 | "df2.drop_duplicates(subset=['A'], inplace=True)"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 6,
194 | "metadata": {
195 | "collapsed": false
196 | },
197 | "outputs": [
198 | {
199 | "data": {
200 | "text/html": [
201 | "\n",
202 | "
\n",
203 | " \n",
204 | " \n",
205 | " | \n",
206 | " A | \n",
207 | " B | \n",
208 | " C | \n",
209 | "
\n",
210 | " \n",
211 | " \n",
212 | " \n",
213 | " 0 | \n",
214 | " 1 | \n",
215 | " 3 | \n",
216 | " 5.0 | \n",
217 | "
\n",
218 | " \n",
219 | " 1 | \n",
220 | " 2 | \n",
221 | " 4 | \n",
222 | " NaN | \n",
223 | "
\n",
224 | " \n",
225 | "
\n",
226 | "
"
227 | ],
228 | "text/plain": [
229 | " A B C\n",
230 | "0 1 3 5.0\n",
231 | "1 2 4 NaN"
232 | ]
233 | },
234 | "execution_count": 6,
235 | "metadata": {},
236 | "output_type": "execute_result"
237 | }
238 | ],
239 | "source": [
240 | "df.merge(df2, how='left', on='A')"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {
247 | "collapsed": true
248 | },
249 | "outputs": [],
250 | "source": []
251 | }
252 | ],
253 | "metadata": {
254 | "anaconda-cloud": {},
255 | "kernelspec": {
256 | "display_name": "Python [default]",
257 | "language": "python",
258 | "name": "python3"
259 | },
260 | "language_info": {
261 | "codemirror_mode": {
262 | "name": "ipython",
263 | "version": 3
264 | },
265 | "file_extension": ".py",
266 | "mimetype": "text/x-python",
267 | "name": "python",
268 | "nbconvert_exporter": "python",
269 | "pygments_lexer": "ipython3",
270 | "version": "3.5.2"
271 | }
272 | },
273 | "nbformat": 4,
274 | "nbformat_minor": 0
275 | }
276 |
--------------------------------------------------------------------------------
/3_b_merge_dataframe_by_columns_on_index.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "merge dataframe by columns using index"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/html": [
31 | "\n",
32 | "
\n",
33 | " \n",
34 | " \n",
35 | " | \n",
36 | " A | \n",
37 | " B | \n",
38 | "
\n",
39 | " \n",
40 | " \n",
41 | " \n",
42 | " 0 | \n",
43 | " 1 | \n",
44 | " 3 | \n",
45 | "
\n",
46 | " \n",
47 | " 1 | \n",
48 | " 2 | \n",
49 | " 4 | \n",
50 | "
\n",
51 | " \n",
52 | "
\n",
53 | "
"
54 | ],
55 | "text/plain": [
56 | " A B\n",
57 | "0 1 3\n",
58 | "1 2 4"
59 | ]
60 | },
61 | "execution_count": 2,
62 | "metadata": {},
63 | "output_type": "execute_result"
64 | }
65 | ],
66 | "source": [
67 | "df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])\n",
68 | "df"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [
78 | {
79 | "data": {
80 | "text/html": [
81 | "\n",
82 | "
\n",
83 | " \n",
84 | " \n",
85 | " | \n",
86 | " A | \n",
87 | " D | \n",
88 | "
\n",
89 | " \n",
90 | " \n",
91 | " \n",
92 | " 0 | \n",
93 | " 1 | \n",
94 | " 5 | \n",
95 | "
\n",
96 | " \n",
97 | " 1 | \n",
98 | " 1 | \n",
99 | " 6 | \n",
100 | "
\n",
101 | " \n",
102 | "
\n",
103 | "
"
104 | ],
105 | "text/plain": [
106 | " A D\n",
107 | "0 1 5\n",
108 | "1 1 6"
109 | ]
110 | },
111 | "execution_count": 3,
112 | "metadata": {},
113 | "output_type": "execute_result"
114 | }
115 | ],
116 | "source": [
117 | "df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'D'])\n",
118 | "df2"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 4,
124 | "metadata": {
125 | "collapsed": false
126 | },
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/html": [
131 | "\n",
132 | "
\n",
133 | " \n",
134 | " \n",
135 | " | \n",
136 | " A | \n",
137 | " B | \n",
138 | " A | \n",
139 | " D | \n",
140 | "
\n",
141 | " \n",
142 | " \n",
143 | " \n",
144 | " 0 | \n",
145 | " 1 | \n",
146 | " 3 | \n",
147 | " 1 | \n",
148 | " 5 | \n",
149 | "
\n",
150 | " \n",
151 | " 1 | \n",
152 | " 2 | \n",
153 | " 4 | \n",
154 | " 1 | \n",
155 | " 6 | \n",
156 | "
\n",
157 | " \n",
158 | "
\n",
159 | "
"
160 | ],
161 | "text/plain": [
162 | " A B A D\n",
163 | "0 1 3 1 5\n",
164 | "1 2 4 1 6"
165 | ]
166 | },
167 | "execution_count": 4,
168 | "metadata": {},
169 | "output_type": "execute_result"
170 | }
171 | ],
172 | "source": [
173 | "pd.concat([df, df2], axis=1)"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "metadata": {
180 | "collapsed": true
181 | },
182 | "outputs": [],
183 | "source": []
184 | }
185 | ],
186 | "metadata": {
187 | "anaconda-cloud": {},
188 | "kernelspec": {
189 | "display_name": "Python [default]",
190 | "language": "python",
191 | "name": "python3"
192 | },
193 | "language_info": {
194 | "codemirror_mode": {
195 | "name": "ipython",
196 | "version": 3
197 | },
198 | "file_extension": ".py",
199 | "mimetype": "text/x-python",
200 | "name": "python",
201 | "nbconvert_exporter": "python",
202 | "pygments_lexer": "ipython3",
203 | "version": "3.5.2"
204 | }
205 | },
206 | "nbformat": 4,
207 | "nbformat_minor": 0
208 | }
209 |
--------------------------------------------------------------------------------
/3_c_merge_two_dataframes_and_split_again.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "merge dataframe and split again.\n",
8 | "Useful for merging test and train data to create panel.\n",
9 | "Then apply transformations on panel in one go.\n",
10 | "Finally split the panel back into train and test dataframes."
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 1,
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import pandas as pd\n",
22 | "import numpy as np"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "metadata": {
29 | "collapsed": false
30 | },
31 | "outputs": [],
32 | "source": [
33 | "ts1 = [1,2,3,4]\n",
34 | "ts2 = [6,7,8,9]\n",
35 | "d = {'col_1': ts1, 'col_2': ts2}"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {
42 | "collapsed": false
43 | },
44 | "outputs": [
45 | {
46 | "data": {
47 | "text/plain": [
48 | "{'col_1': [1, 2, 3, 4], 'col_2': [6, 7, 8, 9]}"
49 | ]
50 | },
51 | "execution_count": 3,
52 | "metadata": {},
53 | "output_type": "execute_result"
54 | }
55 | ],
56 | "source": [
57 | "d"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 4,
63 | "metadata": {
64 | "collapsed": false
65 | },
66 | "outputs": [],
67 | "source": [
68 | "df_1 = pd.DataFrame(data=d)"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 5,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [
78 | {
79 | "data": {
80 | "text/html": [
81 | "\n",
82 | "
\n",
83 | " \n",
84 | " \n",
85 | " | \n",
86 | " col_1 | \n",
87 | " col_2 | \n",
88 | "
\n",
89 | " \n",
90 | " \n",
91 | " \n",
92 | " 0 | \n",
93 | " 1 | \n",
94 | " 6 | \n",
95 | "
\n",
96 | " \n",
97 | " 1 | \n",
98 | " 2 | \n",
99 | " 7 | \n",
100 | "
\n",
101 | " \n",
102 | " 2 | \n",
103 | " 3 | \n",
104 | " 8 | \n",
105 | "
\n",
106 | " \n",
107 | " 3 | \n",
108 | " 4 | \n",
109 | " 9 | \n",
110 | "
\n",
111 | " \n",
112 | "
\n",
113 | "
"
114 | ],
115 | "text/plain": [
116 | " col_1 col_2\n",
117 | "0 1 6\n",
118 | "1 2 7\n",
119 | "2 3 8\n",
120 | "3 4 9"
121 | ]
122 | },
123 | "execution_count": 5,
124 | "metadata": {},
125 | "output_type": "execute_result"
126 | }
127 | ],
128 | "source": [
129 | "df_1"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": 6,
135 | "metadata": {
136 | "collapsed": true
137 | },
138 | "outputs": [],
139 | "source": [
140 | "df_2 = pd.DataFrame(np.random.randn(3, 2), columns=['col_1', 'col_2'])"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 7,
146 | "metadata": {
147 | "collapsed": false
148 | },
149 | "outputs": [
150 | {
151 | "data": {
152 | "text/html": [
153 | "\n",
154 | "
\n",
155 | " \n",
156 | " \n",
157 | " | \n",
158 | " col_1 | \n",
159 | " col_2 | \n",
160 | "
\n",
161 | " \n",
162 | " \n",
163 | " \n",
164 | " 0 | \n",
165 | " 0.654547 | \n",
166 | " -1.201099 | \n",
167 | "
\n",
168 | " \n",
169 | " 1 | \n",
170 | " -0.088006 | \n",
171 | " -0.049599 | \n",
172 | "
\n",
173 | " \n",
174 | " 2 | \n",
175 | " 0.609881 | \n",
176 | " -1.003260 | \n",
177 | "
\n",
178 | " \n",
179 | "
\n",
180 | "
"
181 | ],
182 | "text/plain": [
183 | " col_1 col_2\n",
184 | "0 0.654547 -1.201099\n",
185 | "1 -0.088006 -0.049599\n",
186 | "2 0.609881 -1.003260"
187 | ]
188 | },
189 | "execution_count": 7,
190 | "metadata": {},
191 | "output_type": "execute_result"
192 | }
193 | ],
194 | "source": [
195 | "df_2"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 8,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "df_all = pd.concat((df_1, df_2), axis=0, ignore_index=True)"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 9,
212 | "metadata": {
213 | "collapsed": false
214 | },
215 | "outputs": [
216 | {
217 | "data": {
218 | "text/html": [
219 | "\n",
220 | "
\n",
221 | " \n",
222 | " \n",
223 | " | \n",
224 | " col_1 | \n",
225 | " col_2 | \n",
226 | "
\n",
227 | " \n",
228 | " \n",
229 | " \n",
230 | " 0 | \n",
231 | " 1.000000 | \n",
232 | " 6.000000 | \n",
233 | "
\n",
234 | " \n",
235 | " 1 | \n",
236 | " 2.000000 | \n",
237 | " 7.000000 | \n",
238 | "
\n",
239 | " \n",
240 | " 2 | \n",
241 | " 3.000000 | \n",
242 | " 8.000000 | \n",
243 | "
\n",
244 | " \n",
245 | " 3 | \n",
246 | " 4.000000 | \n",
247 | " 9.000000 | \n",
248 | "
\n",
249 | " \n",
250 | " 4 | \n",
251 | " 0.654547 | \n",
252 | " -1.201099 | \n",
253 | "
\n",
254 | " \n",
255 | " 5 | \n",
256 | " -0.088006 | \n",
257 | " -0.049599 | \n",
258 | "
\n",
259 | " \n",
260 | " 6 | \n",
261 | " 0.609881 | \n",
262 | " -1.003260 | \n",
263 | "
\n",
264 | " \n",
265 | "
\n",
266 | "
"
267 | ],
268 | "text/plain": [
269 | " col_1 col_2\n",
270 | "0 1.000000 6.000000\n",
271 | "1 2.000000 7.000000\n",
272 | "2 3.000000 8.000000\n",
273 | "3 4.000000 9.000000\n",
274 | "4 0.654547 -1.201099\n",
275 | "5 -0.088006 -0.049599\n",
276 | "6 0.609881 -1.003260"
277 | ]
278 | },
279 | "execution_count": 9,
280 | "metadata": {},
281 | "output_type": "execute_result"
282 | }
283 | ],
284 | "source": [
285 | "df_all"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 10,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "(4, 2)\n",
300 | "(3, 2)\n",
301 | "(7, 2)\n"
302 | ]
303 | }
304 | ],
305 | "source": [
306 | "print(df_1.shape)\n",
307 | "print(df_2.shape)\n",
308 | "print(df_all.shape)"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": 11,
314 | "metadata": {
315 | "collapsed": false
316 | },
317 | "outputs": [],
318 | "source": [
319 | "df_train = df_all[:df_1.shape[0]]\n",
320 | "df_test = df_all[df_1.shape[0]:]"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": 12,
326 | "metadata": {
327 | "collapsed": false
328 | },
329 | "outputs": [
330 | {
331 | "name": "stdout",
332 | "output_type": "stream",
333 | "text": [
334 | "(4, 2)\n",
335 | "(3, 2)\n",
336 | "(7, 2)\n"
337 | ]
338 | }
339 | ],
340 | "source": [
341 | "print(df_train.shape)\n",
342 | "print(df_test.shape)\n",
343 | "print(df_all.shape)"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": null,
349 | "metadata": {
350 | "collapsed": true
351 | },
352 | "outputs": [],
353 | "source": []
354 | }
355 | ],
356 | "metadata": {
357 | "anaconda-cloud": {},
358 | "kernelspec": {
359 | "display_name": "Python [default]",
360 | "language": "python",
361 | "name": "python3"
362 | },
363 | "language_info": {
364 | "codemirror_mode": {
365 | "name": "ipython",
366 | "version": 3
367 | },
368 | "file_extension": ".py",
369 | "mimetype": "text/x-python",
370 | "name": "python",
371 | "nbconvert_exporter": "python",
372 | "pygments_lexer": "ipython3",
373 | "version": "3.5.2"
374 | }
375 | },
376 | "nbformat": 4,
377 | "nbformat_minor": 0
378 | }
379 |
--------------------------------------------------------------------------------
/3_d_group_by_and_interate.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "perform group by on dataframe and iterate on the grouped result"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "classes = [\"class 1\"] * 5 + [\"class 2\"] * 5\n",
30 | "sub_class = ['c1','c2','c2','c1','c3'] + ['c1','c2','c3','c2','c3']\n",
31 | "vals = [1,3,5,1,3] + [2,6,7,5,2]\n",
32 | "p_df = pd.DataFrame({\"class\": classes, \"sub_class\": sub_class, \"vals\": vals})"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {
39 | "collapsed": false
40 | },
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " class | \n",
51 | " sub_class | \n",
52 | " vals | \n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " \n",
57 | " 0 | \n",
58 | " class 1 | \n",
59 | " c1 | \n",
60 | " 1 | \n",
61 | "
\n",
62 | " \n",
63 | " 1 | \n",
64 | " class 1 | \n",
65 | " c2 | \n",
66 | " 3 | \n",
67 | "
\n",
68 | " \n",
69 | " 2 | \n",
70 | " class 1 | \n",
71 | " c2 | \n",
72 | " 5 | \n",
73 | "
\n",
74 | " \n",
75 | " 3 | \n",
76 | " class 1 | \n",
77 | " c1 | \n",
78 | " 1 | \n",
79 | "
\n",
80 | " \n",
81 | " 4 | \n",
82 | " class 1 | \n",
83 | " c3 | \n",
84 | " 3 | \n",
85 | "
\n",
86 | " \n",
87 | " 5 | \n",
88 | " class 2 | \n",
89 | " c1 | \n",
90 | " 2 | \n",
91 | "
\n",
92 | " \n",
93 | " 6 | \n",
94 | " class 2 | \n",
95 | " c2 | \n",
96 | " 6 | \n",
97 | "
\n",
98 | " \n",
99 | " 7 | \n",
100 | " class 2 | \n",
101 | " c3 | \n",
102 | " 7 | \n",
103 | "
\n",
104 | " \n",
105 | " 8 | \n",
106 | " class 2 | \n",
107 | " c2 | \n",
108 | " 5 | \n",
109 | "
\n",
110 | " \n",
111 | " 9 | \n",
112 | " class 2 | \n",
113 | " c3 | \n",
114 | " 2 | \n",
115 | "
\n",
116 | " \n",
117 | "
\n",
118 | "
"
119 | ],
120 | "text/plain": [
121 | " class sub_class vals\n",
122 | "0 class 1 c1 1\n",
123 | "1 class 1 c2 3\n",
124 | "2 class 1 c2 5\n",
125 | "3 class 1 c1 1\n",
126 | "4 class 1 c3 3\n",
127 | "5 class 2 c1 2\n",
128 | "6 class 2 c2 6\n",
129 | "7 class 2 c3 7\n",
130 | "8 class 2 c2 5\n",
131 | "9 class 2 c3 2"
132 | ]
133 | },
134 | "execution_count": 3,
135 | "metadata": {},
136 | "output_type": "execute_result"
137 | }
138 | ],
139 | "source": [
140 | "p_df"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 4,
146 | "metadata": {
147 | "collapsed": false
148 | },
149 | "outputs": [],
150 | "source": [
151 | "grouped = p_df.groupby(['class', 'sub_class'])['vals'].median()"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 5,
157 | "metadata": {
158 | "collapsed": false
159 | },
160 | "outputs": [
161 | {
162 | "data": {
163 | "text/plain": [
164 | "class sub_class\n",
165 | "class 1 c1 1.0\n",
166 | " c2 4.0\n",
167 | " c3 3.0\n",
168 | "class 2 c1 2.0\n",
169 | " c2 5.5\n",
170 | " c3 4.5\n",
171 | "Name: vals, dtype: float64"
172 | ]
173 | },
174 | "execution_count": 5,
175 | "metadata": {},
176 | "output_type": "execute_result"
177 | }
178 | ],
179 | "source": [
180 | "grouped"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 6,
186 | "metadata": {
187 | "collapsed": false
188 | },
189 | "outputs": [
190 | {
191 | "name": "stdout",
192 | "output_type": "stream",
193 | "text": [
194 | "class 1 : c1 : 1.0\n",
195 | "class 1 : c2 : 4.0\n",
196 | "class 1 : c3 : 3.0\n",
197 | "class 2 : c1 : 2.0\n",
198 | "class 2 : c2 : 5.5\n",
199 | "class 2 : c3 : 4.5\n"
200 | ]
201 | }
202 | ],
203 | "source": [
204 | "for index_val, value in grouped.iteritems():\n",
205 | " class_name, sub_class_name = index_val\n",
206 | " print(class_name, \":\", sub_class_name, \":\", value)"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "metadata": {
213 | "collapsed": true
214 | },
215 | "outputs": [],
216 | "source": []
217 | }
218 | ],
219 | "metadata": {
220 | "anaconda-cloud": {},
221 | "kernelspec": {
222 | "display_name": "Python [conda root]",
223 | "language": "python",
224 | "name": "conda-root-py"
225 | },
226 | "language_info": {
227 | "codemirror_mode": {
228 | "name": "ipython",
229 | "version": 3
230 | },
231 | "file_extension": ".py",
232 | "mimetype": "text/x-python",
233 | "name": "python",
234 | "nbconvert_exporter": "python",
235 | "pygments_lexer": "ipython3",
236 | "version": "3.5.2"
237 | }
238 | },
239 | "nbformat": 4,
240 | "nbformat_minor": 0
241 | }
242 |
--------------------------------------------------------------------------------
/4_a_get_binary_or_logical_columns_from_dataframe.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Get columns which are binary from a dataframe"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "df = pd.DataFrame({'col_1': [1, 0, 1, None], \n",
30 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n",
31 | " 'col_3': [1, 2, 3, 4], \n",
32 | " 'col_4': ['a', 'b', 'c', 'd']})"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {
39 | "collapsed": false
40 | },
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " col_1 | \n",
51 | " col_2 | \n",
52 | " col_3 | \n",
53 | " col_4 | \n",
54 | "
\n",
55 | " \n",
56 | " \n",
57 | " \n",
58 | " 0 | \n",
59 | " 1.0 | \n",
60 | " 1.2 | \n",
61 | " 1 | \n",
62 | " a | \n",
63 | "
\n",
64 | " \n",
65 | " 1 | \n",
66 | " 0.0 | \n",
67 | " 3.1 | \n",
68 | " 2 | \n",
69 | " b | \n",
70 | "
\n",
71 | " \n",
72 | " 2 | \n",
73 | " 1.0 | \n",
74 | " 4.4 | \n",
75 | " 3 | \n",
76 | " c | \n",
77 | "
\n",
78 | " \n",
79 | " 3 | \n",
80 | " NaN | \n",
81 | " 5.5 | \n",
82 | " 4 | \n",
83 | " d | \n",
84 | "
\n",
85 | " \n",
86 | "
\n",
87 | "
"
88 | ],
89 | "text/plain": [
90 | " col_1 col_2 col_3 col_4\n",
91 | "0 1.0 1.2 1 a\n",
92 | "1 0.0 3.1 2 b\n",
93 | "2 1.0 4.4 3 c\n",
94 | "3 NaN 5.5 4 d"
95 | ]
96 | },
97 | "execution_count": 3,
98 | "metadata": {},
99 | "output_type": "execute_result"
100 | }
101 | ],
102 | "source": [
103 | "df"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 4,
109 | "metadata": {
110 | "collapsed": false
111 | },
112 | "outputs": [],
113 | "source": [
114 | "bool_cols = [col for col in df if len(df[[col]].dropna()[col].unique()) == 2]"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 5,
120 | "metadata": {
121 | "collapsed": false
122 | },
123 | "outputs": [
124 | {
125 | "data": {
126 | "text/plain": [
127 | "['col_1']"
128 | ]
129 | },
130 | "execution_count": 5,
131 | "metadata": {},
132 | "output_type": "execute_result"
133 | }
134 | ],
135 | "source": [
136 | "bool_cols"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 6,
142 | "metadata": {
143 | "collapsed": false
144 | },
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/html": [
149 | "\n",
150 | "
\n",
151 | " \n",
152 | " \n",
153 | " | \n",
154 | " col_1 | \n",
155 | "
\n",
156 | " \n",
157 | " \n",
158 | " \n",
159 | " 0 | \n",
160 | " 1.0 | \n",
161 | "
\n",
162 | " \n",
163 | " 1 | \n",
164 | " 0.0 | \n",
165 | "
\n",
166 | " \n",
167 | " 2 | \n",
168 | " 1.0 | \n",
169 | "
\n",
170 | " \n",
171 | " 3 | \n",
172 | " NaN | \n",
173 | "
\n",
174 | " \n",
175 | "
\n",
176 | "
"
177 | ],
178 | "text/plain": [
179 | " col_1\n",
180 | "0 1.0\n",
181 | "1 0.0\n",
182 | "2 1.0\n",
183 | "3 NaN"
184 | ]
185 | },
186 | "execution_count": 6,
187 | "metadata": {},
188 | "output_type": "execute_result"
189 | }
190 | ],
191 | "source": [
192 | "df[bool_cols]"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "metadata": {
199 | "collapsed": true
200 | },
201 | "outputs": [],
202 | "source": []
203 | }
204 | ],
205 | "metadata": {
206 | "anaconda-cloud": {},
207 | "kernelspec": {
208 | "display_name": "Python [default]",
209 | "language": "python",
210 | "name": "python3"
211 | },
212 | "language_info": {
213 | "codemirror_mode": {
214 | "name": "ipython",
215 | "version": 3
216 | },
217 | "file_extension": ".py",
218 | "mimetype": "text/x-python",
219 | "name": "python",
220 | "nbconvert_exporter": "python",
221 | "pygments_lexer": "ipython3",
222 | "version": "3.5.2"
223 | }
224 | },
225 | "nbformat": 4,
226 | "nbformat_minor": 0
227 | }
228 |
--------------------------------------------------------------------------------
/4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "get columns from dataframe which are categorical and convert them using label encoding"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "from sklearn.preprocessing import LabelEncoder"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.DataFrame({'col_1': [1, 0, 1, None], \n",
31 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n",
32 | " 'col_3': [1, 2, 3, 4], \n",
33 | " 'col_4': ['a', 'b', 'c', 'd']})"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {
40 | "collapsed": false
41 | },
42 | "outputs": [
43 | {
44 | "data": {
45 | "text/html": [
46 | "\n",
47 | "
\n",
48 | " \n",
49 | " \n",
50 | " | \n",
51 | " col_1 | \n",
52 | " col_2 | \n",
53 | " col_3 | \n",
54 | " col_4 | \n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " \n",
59 | " 0 | \n",
60 | " 1.0 | \n",
61 | " 1.2 | \n",
62 | " 1 | \n",
63 | " a | \n",
64 | "
\n",
65 | " \n",
66 | " 1 | \n",
67 | " 0.0 | \n",
68 | " 3.1 | \n",
69 | " 2 | \n",
70 | " b | \n",
71 | "
\n",
72 | " \n",
73 | " 2 | \n",
74 | " 1.0 | \n",
75 | " 4.4 | \n",
76 | " 3 | \n",
77 | " c | \n",
78 | "
\n",
79 | " \n",
80 | " 3 | \n",
81 | " NaN | \n",
82 | " 5.5 | \n",
83 | " 4 | \n",
84 | " d | \n",
85 | "
\n",
86 | " \n",
87 | "
\n",
88 | "
"
89 | ],
90 | "text/plain": [
91 | " col_1 col_2 col_3 col_4\n",
92 | "0 1.0 1.2 1 a\n",
93 | "1 0.0 3.1 2 b\n",
94 | "2 1.0 4.4 3 c\n",
95 | "3 NaN 5.5 4 d"
96 | ]
97 | },
98 | "execution_count": 3,
99 | "metadata": {},
100 | "output_type": "execute_result"
101 | }
102 | ],
103 | "source": [
104 | "df"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 4,
110 | "metadata": {
111 | "collapsed": false
112 | },
113 | "outputs": [
114 | {
115 | "name": "stdout",
116 | "output_type": "stream",
117 | "text": [
118 | "\n",
119 | "RangeIndex: 4 entries, 0 to 3\n",
120 | "Data columns (total 4 columns):\n",
121 | "col_1 3 non-null float64\n",
122 | "col_2 4 non-null float64\n",
123 | "col_3 4 non-null int64\n",
124 | "col_4 4 non-null object\n",
125 | "dtypes: float64(2), int64(1), object(1)\n",
126 | "memory usage: 208.0+ bytes\n"
127 | ]
128 | }
129 | ],
130 | "source": [
131 | "df.info()"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 5,
137 | "metadata": {
138 | "collapsed": true
139 | },
140 | "outputs": [],
141 | "source": [
142 | "bool_cols = [col for col in df if len(df[[col]].dropna()[col].unique()) == 2]"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": 6,
148 | "metadata": {
149 | "collapsed": false
150 | },
151 | "outputs": [],
152 | "source": [
153 | "for col in bool_cols:\n",
154 | " label = LabelEncoder()\n",
155 | " label.fit(list(df[col].values.astype(\"str\")))\n",
156 | " df[col] = label.transform(list(df[col].values.astype(\"str\")))\n"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 7,
162 | "metadata": {
163 | "collapsed": false
164 | },
165 | "outputs": [
166 | {
167 | "name": "stdout",
168 | "output_type": "stream",
169 | "text": [
170 | "\n",
171 | "RangeIndex: 4 entries, 0 to 3\n",
172 | "Data columns (total 4 columns):\n",
173 | "col_1 4 non-null int64\n",
174 | "col_2 4 non-null float64\n",
175 | "col_3 4 non-null int64\n",
176 | "col_4 4 non-null object\n",
177 | "dtypes: float64(1), int64(2), object(1)\n",
178 | "memory usage: 208.0+ bytes\n"
179 | ]
180 | }
181 | ],
182 | "source": [
183 | "df.info()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 8,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [
193 | {
194 | "data": {
195 | "text/html": [
196 | "\n",
197 | "
\n",
198 | " \n",
199 | " \n",
200 | " | \n",
201 | " col_1 | \n",
202 | " col_2 | \n",
203 | " col_3 | \n",
204 | " col_4 | \n",
205 | "
\n",
206 | " \n",
207 | " \n",
208 | " \n",
209 | " 0 | \n",
210 | " 1 | \n",
211 | " 1.2 | \n",
212 | " 1 | \n",
213 | " a | \n",
214 | "
\n",
215 | " \n",
216 | " 1 | \n",
217 | " 0 | \n",
218 | " 3.1 | \n",
219 | " 2 | \n",
220 | " b | \n",
221 | "
\n",
222 | " \n",
223 | " 2 | \n",
224 | " 1 | \n",
225 | " 4.4 | \n",
226 | " 3 | \n",
227 | " c | \n",
228 | "
\n",
229 | " \n",
230 | " 3 | \n",
231 | " 2 | \n",
232 | " 5.5 | \n",
233 | " 4 | \n",
234 | " d | \n",
235 | "
\n",
236 | " \n",
237 | "
\n",
238 | "
"
239 | ],
240 | "text/plain": [
241 | " col_1 col_2 col_3 col_4\n",
242 | "0 1 1.2 1 a\n",
243 | "1 0 3.1 2 b\n",
244 | "2 1 4.4 3 c\n",
245 | "3 2 5.5 4 d"
246 | ]
247 | },
248 | "execution_count": 8,
249 | "metadata": {},
250 | "output_type": "execute_result"
251 | }
252 | ],
253 | "source": [
254 | "df"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {
261 | "collapsed": true
262 | },
263 | "outputs": [],
264 | "source": []
265 | }
266 | ],
267 | "metadata": {
268 | "anaconda-cloud": {},
269 | "kernelspec": {
270 | "display_name": "Python [default]",
271 | "language": "python",
272 | "name": "python3"
273 | },
274 | "language_info": {
275 | "codemirror_mode": {
276 | "name": "ipython",
277 | "version": 3
278 | },
279 | "file_extension": ".py",
280 | "mimetype": "text/x-python",
281 | "name": "python",
282 | "nbconvert_exporter": "python",
283 | "pygments_lexer": "ipython3",
284 | "version": "3.5.2"
285 | }
286 | },
287 | "nbformat": 4,
288 | "nbformat_minor": 0
289 | }
290 |
--------------------------------------------------------------------------------
/4_c_reduce_dimension_of_categorical_column.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Sometimes columns in dataframe have high dimentionality.\n",
8 | "eg: some categorical column with 20 most frequent values covering 80% of the cases.\n",
9 | " Rest being long tail.\n",
10 | "In such case we can convert long tail part into others based on some cut off of count."
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 1,
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import pandas as pd"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 2,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [],
31 | "source": [
32 | "df = pd.DataFrame({'groups': ['group 1','group 2','group 1','group 2','group 3','group 4','group 5','group 1','group 2','group 5'], \n",
33 | " 'vals': [1,2,3,4,5,6,7,8,9,10]})"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {
40 | "collapsed": false
41 | },
42 | "outputs": [
43 | {
44 | "data": {
45 | "text/html": [
46 | "\n",
47 | "
\n",
48 | " \n",
49 | " \n",
50 | " | \n",
51 | " groups | \n",
52 | " vals | \n",
53 | "
\n",
54 | " \n",
55 | " \n",
56 | " \n",
57 | " 0 | \n",
58 | " group 1 | \n",
59 | " 1 | \n",
60 | "
\n",
61 | " \n",
62 | " 1 | \n",
63 | " group 2 | \n",
64 | " 2 | \n",
65 | "
\n",
66 | " \n",
67 | " 2 | \n",
68 | " group 1 | \n",
69 | " 3 | \n",
70 | "
\n",
71 | " \n",
72 | " 3 | \n",
73 | " group 2 | \n",
74 | " 4 | \n",
75 | "
\n",
76 | " \n",
77 | " 4 | \n",
78 | " group 3 | \n",
79 | " 5 | \n",
80 | "
\n",
81 | " \n",
82 | " 5 | \n",
83 | " group 4 | \n",
84 | " 6 | \n",
85 | "
\n",
86 | " \n",
87 | " 6 | \n",
88 | " group 5 | \n",
89 | " 7 | \n",
90 | "
\n",
91 | " \n",
92 | " 7 | \n",
93 | " group 1 | \n",
94 | " 8 | \n",
95 | "
\n",
96 | " \n",
97 | " 8 | \n",
98 | " group 2 | \n",
99 | " 9 | \n",
100 | "
\n",
101 | " \n",
102 | " 9 | \n",
103 | " group 5 | \n",
104 | " 10 | \n",
105 | "
\n",
106 | " \n",
107 | "
\n",
108 | "
"
109 | ],
110 | "text/plain": [
111 | " groups vals\n",
112 | "0 group 1 1\n",
113 | "1 group 2 2\n",
114 | "2 group 1 3\n",
115 | "3 group 2 4\n",
116 | "4 group 3 5\n",
117 | "5 group 4 6\n",
118 | "6 group 5 7\n",
119 | "7 group 1 8\n",
120 | "8 group 2 9\n",
121 | "9 group 5 10"
122 | ]
123 | },
124 | "execution_count": 3,
125 | "metadata": {},
126 | "output_type": "execute_result"
127 | }
128 | ],
129 | "source": [
130 | "df"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 4,
136 | "metadata": {
137 | "collapsed": false
138 | },
139 | "outputs": [
140 | {
141 | "data": {
142 | "text/plain": [
143 | "group 1 3\n",
144 | "group 2 3\n",
145 | "group 5 2\n",
146 | "group 4 1\n",
147 | "group 3 1\n",
148 | "Name: groups, dtype: int64"
149 | ]
150 | },
151 | "execution_count": 4,
152 | "metadata": {},
153 | "output_type": "execute_result"
154 | }
155 | ],
156 | "source": [
157 | "df['groups'].value_counts()"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 5,
163 | "metadata": {
164 | "collapsed": false
165 | },
166 | "outputs": [],
167 | "source": [
168 | "high_dim_columns = ['groups']\n",
169 | "\n",
170 | "for column in high_dim_columns:\n",
171 | " a = pd.DataFrame(df[column].value_counts() <= 2)\n",
172 | " unique_values = a.index[a[column]].values\n",
173 | " df.loc[df[column].isin(unique_values), column] = 'other'"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 6,
179 | "metadata": {
180 | "collapsed": false
181 | },
182 | "outputs": [
183 | {
184 | "data": {
185 | "text/html": [
186 | "\n",
187 | "
\n",
188 | " \n",
189 | " \n",
190 | " | \n",
191 | " groups | \n",
192 | " vals | \n",
193 | "
\n",
194 | " \n",
195 | " \n",
196 | " \n",
197 | " 0 | \n",
198 | " group 1 | \n",
199 | " 1 | \n",
200 | "
\n",
201 | " \n",
202 | " 1 | \n",
203 | " group 2 | \n",
204 | " 2 | \n",
205 | "
\n",
206 | " \n",
207 | " 2 | \n",
208 | " group 1 | \n",
209 | " 3 | \n",
210 | "
\n",
211 | " \n",
212 | " 3 | \n",
213 | " group 2 | \n",
214 | " 4 | \n",
215 | "
\n",
216 | " \n",
217 | " 4 | \n",
218 | " other | \n",
219 | " 5 | \n",
220 | "
\n",
221 | " \n",
222 | " 5 | \n",
223 | " other | \n",
224 | " 6 | \n",
225 | "
\n",
226 | " \n",
227 | " 6 | \n",
228 | " other | \n",
229 | " 7 | \n",
230 | "
\n",
231 | " \n",
232 | " 7 | \n",
233 | " group 1 | \n",
234 | " 8 | \n",
235 | "
\n",
236 | " \n",
237 | " 8 | \n",
238 | " group 2 | \n",
239 | " 9 | \n",
240 | "
\n",
241 | " \n",
242 | " 9 | \n",
243 | " other | \n",
244 | " 10 | \n",
245 | "
\n",
246 | " \n",
247 | "
\n",
248 | "
"
249 | ],
250 | "text/plain": [
251 | " groups vals\n",
252 | "0 group 1 1\n",
253 | "1 group 2 2\n",
254 | "2 group 1 3\n",
255 | "3 group 2 4\n",
256 | "4 other 5\n",
257 | "5 other 6\n",
258 | "6 other 7\n",
259 | "7 group 1 8\n",
260 | "8 group 2 9\n",
261 | "9 other 10"
262 | ]
263 | },
264 | "execution_count": 6,
265 | "metadata": {},
266 | "output_type": "execute_result"
267 | }
268 | ],
269 | "source": [
270 | "df"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "metadata": {
277 | "collapsed": true
278 | },
279 | "outputs": [],
280 | "source": []
281 | }
282 | ],
283 | "metadata": {
284 | "anaconda-cloud": {},
285 | "kernelspec": {
286 | "display_name": "Python [conda root]",
287 | "language": "python",
288 | "name": "conda-root-py"
289 | },
290 | "language_info": {
291 | "codemirror_mode": {
292 | "name": "ipython",
293 | "version": 3
294 | },
295 | "file_extension": ".py",
296 | "mimetype": "text/x-python",
297 | "name": "python",
298 | "nbconvert_exporter": "python",
299 | "pygments_lexer": "ipython3",
300 | "version": "3.5.2"
301 | }
302 | },
303 | "nbformat": 4,
304 | "nbformat_minor": 0
305 | }
306 |
--------------------------------------------------------------------------------
/4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Convert categorical columns to one hot encoded columns"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [],
28 | "source": [
29 | "\n",
30 | "df = pd.DataFrame({'sex': ['M', 'F', 'M', 'F'], \n",
31 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n",
32 | " 'col_3': [1, 2, 3, 4], \n",
33 | " 'col_4': ['a', 'b', 'c', 'd']})"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {
40 | "collapsed": false
41 | },
42 | "outputs": [
43 | {
44 | "data": {
45 | "text/html": [
46 | "\n",
47 | "
\n",
48 | " \n",
49 | " \n",
50 | " | \n",
51 | " col_2 | \n",
52 | " col_3 | \n",
53 | " col_4 | \n",
54 | " sex | \n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " \n",
59 | " 0 | \n",
60 | " 1.2 | \n",
61 | " 1 | \n",
62 | " a | \n",
63 | " M | \n",
64 | "
\n",
65 | " \n",
66 | " 1 | \n",
67 | " 3.1 | \n",
68 | " 2 | \n",
69 | " b | \n",
70 | " F | \n",
71 | "
\n",
72 | " \n",
73 | " 2 | \n",
74 | " 4.4 | \n",
75 | " 3 | \n",
76 | " c | \n",
77 | " M | \n",
78 | "
\n",
79 | " \n",
80 | " 3 | \n",
81 | " 5.5 | \n",
82 | " 4 | \n",
83 | " d | \n",
84 | " F | \n",
85 | "
\n",
86 | " \n",
87 | "
\n",
88 | "
"
89 | ],
90 | "text/plain": [
91 | " col_2 col_3 col_4 sex\n",
92 | "0 1.2 1 a M\n",
93 | "1 3.1 2 b F\n",
94 | "2 4.4 3 c M\n",
95 | "3 5.5 4 d F"
96 | ]
97 | },
98 | "execution_count": 3,
99 | "metadata": {},
100 | "output_type": "execute_result"
101 | }
102 | ],
103 | "source": [
104 | "df"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 4,
110 | "metadata": {
111 | "collapsed": false
112 | },
113 | "outputs": [],
114 | "source": [
115 | "categorical_variables = ['sex']\n",
116 | "\n",
117 | "for variable in categorical_variables:\n",
118 | " # Fill missing data with the word \"Missing\"\n",
119 | " df[variable].fillna(\"Missing\", inplace=True)\n",
120 | " # Create array of dummies\n",
121 | " dummies = pd.get_dummies(df[variable], prefix=variable)\n",
122 | " # Update dataframe to include dummies and drop the main variable\n",
123 | " df = pd.concat([df, dummies], axis=1)\n",
124 | " df.drop([variable], axis=1, inplace=True)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 5,
130 | "metadata": {
131 | "collapsed": false
132 | },
133 | "outputs": [
134 | {
135 | "data": {
136 | "text/html": [
137 | "\n",
138 | "
\n",
139 | " \n",
140 | " \n",
141 | " | \n",
142 | " col_2 | \n",
143 | " col_3 | \n",
144 | " col_4 | \n",
145 | " sex_F | \n",
146 | " sex_M | \n",
147 | "
\n",
148 | " \n",
149 | " \n",
150 | " \n",
151 | " 0 | \n",
152 | " 1.2 | \n",
153 | " 1 | \n",
154 | " a | \n",
155 | " 0.0 | \n",
156 | " 1.0 | \n",
157 | "
\n",
158 | " \n",
159 | " 1 | \n",
160 | " 3.1 | \n",
161 | " 2 | \n",
162 | " b | \n",
163 | " 1.0 | \n",
164 | " 0.0 | \n",
165 | "
\n",
166 | " \n",
167 | " 2 | \n",
168 | " 4.4 | \n",
169 | " 3 | \n",
170 | " c | \n",
171 | " 0.0 | \n",
172 | " 1.0 | \n",
173 | "
\n",
174 | " \n",
175 | " 3 | \n",
176 | " 5.5 | \n",
177 | " 4 | \n",
178 | " d | \n",
179 | " 1.0 | \n",
180 | " 0.0 | \n",
181 | "
\n",
182 | " \n",
183 | "
\n",
184 | "
"
185 | ],
186 | "text/plain": [
187 | " col_2 col_3 col_4 sex_F sex_M\n",
188 | "0 1.2 1 a 0.0 1.0\n",
189 | "1 3.1 2 b 1.0 0.0\n",
190 | "2 4.4 3 c 0.0 1.0\n",
191 | "3 5.5 4 d 1.0 0.0"
192 | ]
193 | },
194 | "execution_count": 5,
195 | "metadata": {},
196 | "output_type": "execute_result"
197 | }
198 | ],
199 | "source": [
200 | "df"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {
207 | "collapsed": true
208 | },
209 | "outputs": [],
210 | "source": []
211 | }
212 | ],
213 | "metadata": {
214 | "anaconda-cloud": {},
215 | "kernelspec": {
216 | "display_name": "Python [conda root]",
217 | "language": "python",
218 | "name": "conda-root-py"
219 | },
220 | "language_info": {
221 | "codemirror_mode": {
222 | "name": "ipython",
223 | "version": 3
224 | },
225 | "file_extension": ".py",
226 | "mimetype": "text/x-python",
227 | "name": "python",
228 | "nbconvert_exporter": "python",
229 | "pygments_lexer": "ipython3",
230 | "version": "3.5.2"
231 | }
232 | },
233 | "nbformat": 4,
234 | "nbformat_minor": 0
235 | }
236 |
--------------------------------------------------------------------------------
/5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Split a text column into multiple column based on some delimiter. "
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "data = [{'test': 'vikash|Arpit', 'val': 6},\n",
30 | " {'test': 'vikash_1|arpit|Vinayp', 'val': 3},\n",
31 | " {'test': 'arpit|vinayp', 'val': 2}]"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 3,
37 | "metadata": {
38 | "collapsed": true
39 | },
40 | "outputs": [],
41 | "source": [
42 | "df = pd.DataFrame.from_dict(data, orient='columns')"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 4,
48 | "metadata": {
49 | "collapsed": false
50 | },
51 | "outputs": [
52 | {
53 | "data": {
54 | "text/html": [
55 | "\n",
56 | "
\n",
57 | " \n",
58 | " \n",
59 | " | \n",
60 | " test | \n",
61 | " val | \n",
62 | "
\n",
63 | " \n",
64 | " \n",
65 | " \n",
66 | " 0 | \n",
67 | " vikash|Arpit | \n",
68 | " 6 | \n",
69 | "
\n",
70 | " \n",
71 | " 1 | \n",
72 | " vikash_1|arpit|Vinayp | \n",
73 | " 3 | \n",
74 | "
\n",
75 | " \n",
76 | " 2 | \n",
77 | " arpit|vinayp | \n",
78 | " 2 | \n",
79 | "
\n",
80 | " \n",
81 | "
\n",
82 | "
"
83 | ],
84 | "text/plain": [
85 | " test val\n",
86 | "0 vikash|Arpit 6\n",
87 | "1 vikash_1|arpit|Vinayp 3\n",
88 | "2 arpit|vinayp 2"
89 | ]
90 | },
91 | "execution_count": 4,
92 | "metadata": {},
93 | "output_type": "execute_result"
94 | }
95 | ],
96 | "source": [
97 | "df"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [
107 | {
108 | "data": {
109 | "text/html": [
110 | "\n",
111 | "
\n",
112 | " \n",
113 | " \n",
114 | " | \n",
115 | " 0 | \n",
116 | " 1 | \n",
117 | " 2 | \n",
118 | "
\n",
119 | " \n",
120 | " \n",
121 | " \n",
122 | " 0 | \n",
123 | " arpit | \n",
124 | " vikash | \n",
125 | " NaN | \n",
126 | "
\n",
127 | " \n",
128 | " 1 | \n",
129 | " vinayp | \n",
130 | " arpit | \n",
131 | " vikash_1 | \n",
132 | "
\n",
133 | " \n",
134 | " 2 | \n",
135 | " vinayp | \n",
136 | " arpit | \n",
137 | " NaN | \n",
138 | "
\n",
139 | " \n",
140 | "
\n",
141 | "
"
142 | ],
143 | "text/plain": [
144 | " 0 1 2\n",
145 | "0 arpit vikash NaN\n",
146 | "1 vinayp arpit vikash_1\n",
147 | "2 vinayp arpit NaN"
148 | ]
149 | },
150 | "execution_count": 5,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "df['test'].apply(lambda x: pd.Series([i for i in reversed(x.lower().split('|'))]))"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": null,
162 | "metadata": {
163 | "collapsed": true
164 | },
165 | "outputs": [],
166 | "source": []
167 | }
168 | ],
169 | "metadata": {
170 | "anaconda-cloud": {},
171 | "kernelspec": {
172 | "display_name": "Python [conda root]",
173 | "language": "python",
174 | "name": "conda-root-py"
175 | },
176 | "language_info": {
177 | "codemirror_mode": {
178 | "name": "ipython",
179 | "version": 3
180 | },
181 | "file_extension": ".py",
182 | "mimetype": "text/x-python",
183 | "name": "python",
184 | "nbconvert_exporter": "python",
185 | "pygments_lexer": "ipython3",
186 | "version": "3.5.2"
187 | }
188 | },
189 | "nbformat": 4,
190 | "nbformat_minor": 1
191 | }
192 |
--------------------------------------------------------------------------------
/5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Split a text column into multiple column based on some delimiter.\n",
8 | "Then convert the values into one hot encoded columns.\n",
9 | "Basically converting a categorical variable into one hot encoded values."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 1,
15 | "metadata": {
16 | "collapsed": true
17 | },
18 | "outputs": [],
19 | "source": [
20 | "import pandas as pd"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": [
31 | "data = [{'test': 'vikash|Arpit', 'val': 6},\n",
32 | " {'test': 'vikash_1|arpit|Vinayp', 'val': 3},\n",
33 | " {'test': 'arpit|vinayp', 'val': 2}]"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {
40 | "collapsed": true
41 | },
42 | "outputs": [],
43 | "source": [
44 | "df = pd.DataFrame.from_dict(data, orient='columns')"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 4,
50 | "metadata": {
51 | "collapsed": false
52 | },
53 | "outputs": [
54 | {
55 | "data": {
56 | "text/html": [
57 | "\n",
58 | "
\n",
59 | " \n",
60 | " \n",
61 | " | \n",
62 | " test | \n",
63 | " val | \n",
64 | "
\n",
65 | " \n",
66 | " \n",
67 | " \n",
68 | " 0 | \n",
69 | " vikash|Arpit | \n",
70 | " 6 | \n",
71 | "
\n",
72 | " \n",
73 | " 1 | \n",
74 | " vikash_1|arpit|Vinayp | \n",
75 | " 3 | \n",
76 | "
\n",
77 | " \n",
78 | " 2 | \n",
79 | " arpit|vinayp | \n",
80 | " 2 | \n",
81 | "
\n",
82 | " \n",
83 | "
\n",
84 | "
"
85 | ],
86 | "text/plain": [
87 | " test val\n",
88 | "0 vikash|Arpit 6\n",
89 | "1 vikash_1|arpit|Vinayp 3\n",
90 | "2 arpit|vinayp 2"
91 | ]
92 | },
93 | "execution_count": 4,
94 | "metadata": {},
95 | "output_type": "execute_result"
96 | }
97 | ],
98 | "source": [
99 | "df"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 5,
105 | "metadata": {
106 | "collapsed": false
107 | },
108 | "outputs": [],
109 | "source": [
110 | "chosen_columns = set()\n",
111 | "for idx, row in df.iterrows():\n",
112 | " for val in str(row['test']).lower().split('|'):\n",
113 | " chosen_columns.add(val.strip())"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 6,
119 | "metadata": {
120 | "collapsed": false
121 | },
122 | "outputs": [],
123 | "source": [
124 | "chosen_columns_list = list(chosen_columns)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 7,
130 | "metadata": {
131 | "collapsed": false
132 | },
133 | "outputs": [],
134 | "source": [
135 | "chosen_columns_list.sort(key=len, reverse=True)"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 8,
141 | "metadata": {
142 | "collapsed": false
143 | },
144 | "outputs": [
145 | {
146 | "data": {
147 | "text/plain": [
148 | "['vikash_1', 'vinayp', 'vikash', 'arpit']"
149 | ]
150 | },
151 | "execution_count": 8,
152 | "metadata": {},
153 | "output_type": "execute_result"
154 | }
155 | ],
156 | "source": [
157 | "chosen_columns_list"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 9,
163 | "metadata": {
164 | "collapsed": false
165 | },
166 | "outputs": [],
167 | "source": [
168 | "def get_one_hot_encoded_column(col_value):\n",
169 | " col_value = col_value.lower()\n",
170 | " new_col_value = ''\n",
171 | " for val in chosen_columns_list:\n",
172 | " if val in col_value.split('|'):\n",
173 | " col_value = col_value.replace(val, '')\n",
174 | " new_col_value += '1,'\n",
175 | " else:\n",
176 | " new_col_value += '0,'\n",
177 | " return new_col_value[:-1]"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 10,
183 | "metadata": {
184 | "collapsed": false
185 | },
186 | "outputs": [],
187 | "source": [
188 | "df['test_new'] = df['test'].map(get_one_hot_encoded_column)"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 11,
194 | "metadata": {
195 | "collapsed": false
196 | },
197 | "outputs": [
198 | {
199 | "data": {
200 | "text/html": [
201 | "\n",
202 | "
\n",
203 | " \n",
204 | " \n",
205 | " | \n",
206 | " test | \n",
207 | " val | \n",
208 | " test_new | \n",
209 | "
\n",
210 | " \n",
211 | " \n",
212 | " \n",
213 | " 0 | \n",
214 | " vikash|Arpit | \n",
215 | " 6 | \n",
216 | " 0,0,1,1 | \n",
217 | "
\n",
218 | " \n",
219 | " 1 | \n",
220 | " vikash_1|arpit|Vinayp | \n",
221 | " 3 | \n",
222 | " 1,1,0,1 | \n",
223 | "
\n",
224 | " \n",
225 | " 2 | \n",
226 | " arpit|vinayp | \n",
227 | " 2 | \n",
228 | " 0,1,0,1 | \n",
229 | "
\n",
230 | " \n",
231 | "
\n",
232 | "
"
233 | ],
234 | "text/plain": [
235 | " test val test_new\n",
236 | "0 vikash|Arpit 6 0,0,1,1\n",
237 | "1 vikash_1|arpit|Vinayp 3 1,1,0,1\n",
238 | "2 arpit|vinayp 2 0,1,0,1"
239 | ]
240 | },
241 | "execution_count": 11,
242 | "metadata": {},
243 | "output_type": "execute_result"
244 | }
245 | ],
246 | "source": [
247 | "df"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 12,
253 | "metadata": {
254 | "collapsed": false
255 | },
256 | "outputs": [],
257 | "source": [
258 | "df2 = df['test_new'].apply(lambda x: pd.Series([i for i in x.lower().split(',')]))"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 15,
264 | "metadata": {
265 | "collapsed": false
266 | },
267 | "outputs": [
268 | {
269 | "data": {
270 | "text/html": [
271 | "\n",
272 | "
\n",
273 | " \n",
274 | " \n",
275 | " | \n",
276 | " vikash_1 | \n",
277 | " vinayp | \n",
278 | " vikash | \n",
279 | " arpit | \n",
280 | "
\n",
281 | " \n",
282 | " \n",
283 | " \n",
284 | " 0 | \n",
285 | " 0 | \n",
286 | " 0 | \n",
287 | " 1 | \n",
288 | " 1 | \n",
289 | "
\n",
290 | " \n",
291 | " 1 | \n",
292 | " 1 | \n",
293 | " 1 | \n",
294 | " 0 | \n",
295 | " 1 | \n",
296 | "
\n",
297 | " \n",
298 | " 2 | \n",
299 | " 0 | \n",
300 | " 1 | \n",
301 | " 0 | \n",
302 | " 1 | \n",
303 | "
\n",
304 | " \n",
305 | "
\n",
306 | "
"
307 | ],
308 | "text/plain": [
309 | " vikash_1 vinayp vikash arpit\n",
310 | "0 0 0 1 1\n",
311 | "1 1 1 0 1\n",
312 | "2 0 1 0 1"
313 | ]
314 | },
315 | "execution_count": 15,
316 | "metadata": {},
317 | "output_type": "execute_result"
318 | }
319 | ],
320 | "source": [
321 | "df2"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 16,
327 | "metadata": {
328 | "collapsed": false
329 | },
330 | "outputs": [],
331 | "source": [
332 | "df2.columns = chosen_columns_list"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 17,
338 | "metadata": {
339 | "collapsed": false
340 | },
341 | "outputs": [
342 | {
343 | "data": {
344 | "text/html": [
345 | "\n",
346 | "
\n",
347 | " \n",
348 | " \n",
349 | " | \n",
350 | " vikash_1 | \n",
351 | " vinayp | \n",
352 | " vikash | \n",
353 | " arpit | \n",
354 | "
\n",
355 | " \n",
356 | " \n",
357 | " \n",
358 | " 0 | \n",
359 | " 0 | \n",
360 | " 0 | \n",
361 | " 1 | \n",
362 | " 1 | \n",
363 | "
\n",
364 | " \n",
365 | " 1 | \n",
366 | " 1 | \n",
367 | " 1 | \n",
368 | " 0 | \n",
369 | " 1 | \n",
370 | "
\n",
371 | " \n",
372 | " 2 | \n",
373 | " 0 | \n",
374 | " 1 | \n",
375 | " 0 | \n",
376 | " 1 | \n",
377 | "
\n",
378 | " \n",
379 | "
\n",
380 | "
"
381 | ],
382 | "text/plain": [
383 | " vikash_1 vinayp vikash arpit\n",
384 | "0 0 0 1 1\n",
385 | "1 1 1 0 1\n",
386 | "2 0 1 0 1"
387 | ]
388 | },
389 | "execution_count": 17,
390 | "metadata": {},
391 | "output_type": "execute_result"
392 | }
393 | ],
394 | "source": [
395 | "df2"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": 18,
401 | "metadata": {
402 | "collapsed": false
403 | },
404 | "outputs": [
405 | {
406 | "name": "stdout",
407 | "output_type": "stream",
408 | "text": [
409 | "\n",
410 | "RangeIndex: 3 entries, 0 to 2\n",
411 | "Data columns (total 4 columns):\n",
412 | "vikash_1 3 non-null object\n",
413 | "vinayp 3 non-null object\n",
414 | "vikash 3 non-null object\n",
415 | "arpit 3 non-null object\n",
416 | "dtypes: object(4)\n",
417 | "memory usage: 176.0+ bytes\n"
418 | ]
419 | }
420 | ],
421 | "source": [
422 | "df2.info()"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": 19,
428 | "metadata": {
429 | "collapsed": true
430 | },
431 | "outputs": [],
432 | "source": [
433 | "df2 = df2.apply(pd.to_numeric)"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 20,
439 | "metadata": {
440 | "collapsed": false
441 | },
442 | "outputs": [
443 | {
444 | "name": "stdout",
445 | "output_type": "stream",
446 | "text": [
447 | "\n",
448 | "RangeIndex: 3 entries, 0 to 2\n",
449 | "Data columns (total 4 columns):\n",
450 | "vikash_1 3 non-null int64\n",
451 | "vinayp 3 non-null int64\n",
452 | "vikash 3 non-null int64\n",
453 | "arpit 3 non-null int64\n",
454 | "dtypes: int64(4)\n",
455 | "memory usage: 176.0 bytes\n"
456 | ]
457 | }
458 | ],
459 | "source": [
460 | "df2.info()"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": 28,
466 | "metadata": {
467 | "collapsed": false
468 | },
469 | "outputs": [],
470 | "source": [
471 | "df_new = pd.concat([df, df2], axis=1)"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": 29,
477 | "metadata": {
478 | "collapsed": false
479 | },
480 | "outputs": [],
481 | "source": [
482 | "df_new.drop(['test', 'test_new'], inplace=True, axis=1)"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": 30,
488 | "metadata": {
489 | "collapsed": false
490 | },
491 | "outputs": [
492 | {
493 | "data": {
494 | "text/html": [
495 | "\n",
496 | "
\n",
497 | " \n",
498 | " \n",
499 | " | \n",
500 | " val | \n",
501 | " vikash_1 | \n",
502 | " vinayp | \n",
503 | " vikash | \n",
504 | " arpit | \n",
505 | "
\n",
506 | " \n",
507 | " \n",
508 | " \n",
509 | " 0 | \n",
510 | " 6 | \n",
511 | " 0 | \n",
512 | " 0 | \n",
513 | " 1 | \n",
514 | " 1 | \n",
515 | "
\n",
516 | " \n",
517 | " 1 | \n",
518 | " 3 | \n",
519 | " 1 | \n",
520 | " 1 | \n",
521 | " 0 | \n",
522 | " 1 | \n",
523 | "
\n",
524 | " \n",
525 | " 2 | \n",
526 | " 2 | \n",
527 | " 0 | \n",
528 | " 1 | \n",
529 | " 0 | \n",
530 | " 1 | \n",
531 | "
\n",
532 | " \n",
533 | "
\n",
534 | "
"
535 | ],
536 | "text/plain": [
537 | " val vikash_1 vinayp vikash arpit\n",
538 | "0 6 0 0 1 1\n",
539 | "1 3 1 1 0 1\n",
540 | "2 2 0 1 0 1"
541 | ]
542 | },
543 | "execution_count": 30,
544 | "metadata": {},
545 | "output_type": "execute_result"
546 | }
547 | ],
548 | "source": [
549 | "df_new"
550 | ]
551 | },
552 | {
553 | "cell_type": "code",
554 | "execution_count": null,
555 | "metadata": {
556 | "collapsed": true
557 | },
558 | "outputs": [],
559 | "source": []
560 | }
561 | ],
562 | "metadata": {
563 | "anaconda-cloud": {},
564 | "kernelspec": {
565 | "display_name": "Python [conda root]",
566 | "language": "python",
567 | "name": "conda-root-py"
568 | },
569 | "language_info": {
570 | "codemirror_mode": {
571 | "name": "ipython",
572 | "version": 3
573 | },
574 | "file_extension": ".py",
575 | "mimetype": "text/x-python",
576 | "name": "python",
577 | "nbconvert_exporter": "python",
578 | "pygments_lexer": "ipython3",
579 | "version": "3.5.2"
580 | }
581 | },
582 | "nbformat": 4,
583 | "nbformat_minor": 1
584 | }
585 |
--------------------------------------------------------------------------------
/6_a_extending_dataframe_capabilities.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "Adding functionality so we can create dataframe from string representation of dict"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 2,
13 | "metadata": {
14 | "collapsed": true
15 | },
16 | "outputs": [],
17 | "source": [
18 | "import ast\n",
19 | "def str_to_dict(string):\n",
20 | " return ast.literal_eval(string) "
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 3,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": [
31 | "import pandas as pd"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 4,
37 | "metadata": {
38 | "collapsed": false
39 | },
40 | "outputs": [],
41 | "source": [
42 | "class MySubClass(pd.DataFrame):\n",
43 | " def from_str(self, string):\n",
44 | " df_obj = super().from_dict(str_to_dict(string))\n",
45 | " df_obj.my_string_attribute = string\n",
46 | " return df_obj"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 5,
52 | "metadata": {
53 | "collapsed": true
54 | },
55 | "outputs": [],
56 | "source": [
57 | "data = \"{'col_1' : ['a','b'], 'col2': [1, 2]}\""
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 7,
63 | "metadata": {
64 | "collapsed": false
65 | },
66 | "outputs": [],
67 | "source": [
68 | "obj = MySubClass().from_str(data)"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 8,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [
78 | {
79 | "data": {
80 | "text/plain": [
81 | "__main__.MySubClass"
82 | ]
83 | },
84 | "execution_count": 8,
85 | "metadata": {},
86 | "output_type": "execute_result"
87 | }
88 | ],
89 | "source": [
90 | "type(obj)"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 9,
96 | "metadata": {
97 | "collapsed": false
98 | },
99 | "outputs": [
100 | {
101 | "data": {
102 | "text/html": [
103 | "\n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " | \n",
108 | " col2 | \n",
109 | " col_1 | \n",
110 | "
\n",
111 | " \n",
112 | " \n",
113 | " \n",
114 | " 0 | \n",
115 | " 1 | \n",
116 | " a | \n",
117 | "
\n",
118 | " \n",
119 | " 1 | \n",
120 | " 2 | \n",
121 | " b | \n",
122 | "
\n",
123 | " \n",
124 | "
\n",
125 | "
"
126 | ],
127 | "text/plain": [
128 | " col2 col_1\n",
129 | "0 1 a\n",
130 | "1 2 b"
131 | ]
132 | },
133 | "execution_count": 9,
134 | "metadata": {},
135 | "output_type": "execute_result"
136 | }
137 | ],
138 | "source": [
139 | "obj"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 10,
145 | "metadata": {
146 | "collapsed": false
147 | },
148 | "outputs": [
149 | {
150 | "data": {
151 | "text/plain": [
152 | "\"{'col_1' : ['a','b'], 'col2': [1, 2]}\""
153 | ]
154 | },
155 | "execution_count": 10,
156 | "metadata": {},
157 | "output_type": "execute_result"
158 | }
159 | ],
160 | "source": [
161 | "obj.my_string_attribute"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 8,
167 | "metadata": {
168 | "collapsed": true
169 | },
170 | "outputs": [],
171 | "source": [
172 | "sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},\n",
173 | " {'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},\n",
174 | " {'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]\n",
175 | "df = MySubClass(sales)"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 9,
181 | "metadata": {
182 | "collapsed": false
183 | },
184 | "outputs": [
185 | {
186 | "data": {
187 | "text/html": [
188 | "\n",
189 | "
\n",
190 | " \n",
191 | " \n",
192 | " | \n",
193 | " Feb | \n",
194 | " Jan | \n",
195 | " Mar | \n",
196 | " account | \n",
197 | "
\n",
198 | " \n",
199 | " \n",
200 | " \n",
201 | " 0 | \n",
202 | " 200 | \n",
203 | " 150 | \n",
204 | " 140 | \n",
205 | " Jones LLC | \n",
206 | "
\n",
207 | " \n",
208 | " 1 | \n",
209 | " 210 | \n",
210 | " 200 | \n",
211 | " 215 | \n",
212 | " Alpha Co | \n",
213 | "
\n",
214 | " \n",
215 | " 2 | \n",
216 | " 90 | \n",
217 | " 50 | \n",
218 | " 95 | \n",
219 | " Blue Inc | \n",
220 | "
\n",
221 | " \n",
222 | "
\n",
223 | "
"
224 | ],
225 | "text/plain": [
226 | " Feb Jan Mar account\n",
227 | "0 200 150 140 Jones LLC\n",
228 | "1 210 200 215 Alpha Co\n",
229 | "2 90 50 95 Blue Inc"
230 | ]
231 | },
232 | "execution_count": 9,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "df"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 10,
244 | "metadata": {
245 | "collapsed": false
246 | },
247 | "outputs": [
248 | {
249 | "data": {
250 | "text/plain": [
251 | "__main__.MySubClass"
252 | ]
253 | },
254 | "execution_count": 10,
255 | "metadata": {},
256 | "output_type": "execute_result"
257 | }
258 | ],
259 | "source": [
260 | "type(df)"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": true
268 | },
269 | "outputs": [],
270 | "source": []
271 | }
272 | ],
273 | "metadata": {
274 | "anaconda-cloud": {},
275 | "kernelspec": {
276 | "display_name": "Python [conda root]",
277 | "language": "python",
278 | "name": "conda-root-py"
279 | },
280 | "language_info": {
281 | "codemirror_mode": {
282 | "name": "ipython",
283 | "version": 3
284 | },
285 | "file_extension": ".py",
286 | "mimetype": "text/x-python",
287 | "name": "python",
288 | "nbconvert_exporter": "python",
289 | "pygments_lexer": "ipython3",
290 | "version": "3.5.2"
291 | }
292 | },
293 | "nbformat": 4,
294 | "nbformat_minor": 1
295 | }
296 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2016 Vikash Singh
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ### Basics of Pandas python library
2 | =======
3 |
4 | 1. Basic Dataframes Operations
5 | - Create Dataframe from a dictionary [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_a_create_a_dataframe_from_dictonary.ipynb)
6 | - Create Dataframe by inserting rows in an iterative way [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb)
7 | - Create Dataframe with randomly generated data [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_c_create_dataframe_from_a_csv_file.ipynb)
8 | - Create Dataframe from a csv file [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_d_change_column_names.ipynb)
9 | - Change Dataframe column names [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_e_selecting_columns_or_choosing_columns.ipynb)
10 | - Chose specific columns from a DataFrame [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_f_drop_or_delete_a_column.ipynb)
11 | - Delete drop columns or extract columns from Dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_g_create_a_dataframe_with_randomly_generated_data.ipynb)
12 |
13 | 2. Manipulating Dataframe
14 | - Iterate over a dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_a_iterate_over_a_dataframe.ipynb)
15 | - Apply function to dataframe row wise [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb)
16 | - Apply function to a specific column of dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb)
17 | - Find and replace a value in Dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_d_find_and_replace_a_value_in_dataframe_column.ipynb)
18 |
19 | 3. Split and Merge Dataframes
20 | - Merge dataframes by columns using join [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_a_merge_dataframes_by_joining_columns.ipynb)
21 | - Merge dataframes by columns on index [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_b_merge_dataframe_by_columns_on_index.ipynb)
22 | - Merge dataframe and split again [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_c_merge_two_dataframes_and_split_again.ipynb)
23 | - Group by a dataframe and iterate over grouped series [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_d_group_by_and_interate.ipynb)
24 |
25 | 4. Convert columns
26 | - Get binary columns in a DataFrame [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_a_get_binary_or_logical_columns_from_dataframe.ipynb)
27 | - Convert categorical columns to integer columns using label encoding [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb)
28 | - Reduce high dimentionality from categorical column [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_c_reduce_dimension_of_categorical_column.ipynb)
29 | - Convert categorical column to one hot encoded column [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb)
30 |
31 | 5. Split column
32 | - Split column usiong a delimiter [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb)
33 | - Split a column using delimter and one hot encode the values [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb)
34 |
35 | 6. Adding capabilities to pandas.DataFrame class
36 | - Extend pandas.DataFrame class to store additional value(s). [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/6_a_extending_dataframe_capabilities.ipynb)
37 |
--------------------------------------------------------------------------------
/data/sample_data.csv:
--------------------------------------------------------------------------------
1 | col_1,col_2,target
2 | 0.11,0.22,1
3 | 0.1,0.2,1
4 | 0.9,0.8,0
--------------------------------------------------------------------------------
/data/sample_data_2.csv:
--------------------------------------------------------------------------------
1 | 0.11,0.22,1
2 | 0.1,0.2,1
3 | 0.9,0.8,0
--------------------------------------------------------------------------------