├── .gitignore ├── 1_a_create_a_dataframe_from_dictonary.ipynb ├── 1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb ├── 1_c_create_dataframe_from_a_csv_file.ipynb ├── 1_d_change_column_names.ipynb ├── 1_e_selecting_columns_or_choosing_columns.ipynb ├── 1_f_drop_or_delete_a_column.ipynb ├── 1_g_create_a_dataframe_with_randomly_generated_data.ipynb ├── 2_a_iterate_over_a_dataframe.ipynb ├── 2_b_apply_a_function_row_wise.ipynb ├── 2_c_apply_a_function_to_a_column.ipynb ├── 2_d_find_and_replace_a_value_in_dataframe_column.ipynb ├── 3_a_merge_dataframes_by_joining_columns.ipynb ├── 3_b_merge_dataframe_by_columns_on_index.ipynb ├── 3_c_merge_two_dataframes_and_split_again.ipynb ├── 3_d_group_by_and_interate.ipynb ├── 4_a_get_binary_or_logical_columns_from_dataframe.ipynb ├── 4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb ├── 4_c_reduce_dimension_of_categorical_column.ipynb ├── 4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb ├── 5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb ├── 5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb ├── 6_a_extending_dataframe_capabilities.ipynb ├── LICENSE ├── README.md └── data ├── sample_data.csv └── sample_data_2.csv /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | -------------------------------------------------------------------------------- /1_a_create_a_dataframe_from_dictonary.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Create A DataFrame from dictionary" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "data = [{'name': 'vikash', 'age': 27}, {'name': 'Satyam', 'age': 14}]" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "df = pd.DataFrame.from_dict(data, orient='columns')" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 4, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "data": { 48 | "text/html": [ 49 | "
\n", 50 | "\n", 63 | "\n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | "
agename
027vikash
114Satyam
\n", 84 | "
" 85 | ], 86 | "text/plain": [ 87 | " age name\n", 88 | "0 27 vikash\n", 89 | "1 14 Satyam" 90 | ] 91 | }, 92 | "execution_count": 4, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "df" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "source": [ 107 | "## If the Dictionary is nested you first need to normalize it" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "metadata": { 114 | "collapsed": true 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "from pandas.io.json import json_normalize" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 6, 124 | "metadata": { 125 | "collapsed": true 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "data = [\n", 130 | " {\n", 131 | " 'name': {\n", 132 | " 'first': 'vikash',\n", 133 | " 'last': 'singh'\n", 134 | " },\n", 135 | " 'age': 27\n", 136 | " },\n", 137 | " {\n", 138 | " 'name': {\n", 139 | " 'first': 'satyam',\n", 140 | " 'last': 'singh'\n", 141 | " },\n", 142 | " 'age': 14\n", 143 | " }\n", 144 | "]" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 7, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 8, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/html": [ 164 | "
\n", 165 | "\n", 178 | "\n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | "
agename.firstname.last
027vikashsingh
114satyamsingh
\n", 202 | "
" 203 | ], 204 | "text/plain": [ 205 | " age name.first name.last\n", 206 | "0 27 vikash singh\n", 207 | "1 14 satyam singh" 208 | ] 209 | }, 210 | "execution_count": 8, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "df" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": { 223 | "collapsed": true 224 | }, 225 | "outputs": [], 226 | "source": [] 227 | } 228 | ], 229 | "metadata": { 230 | "anaconda-cloud": {}, 231 | "kernelspec": { 232 | "display_name": "Python [conda root]", 233 | "language": "python", 234 | "name": "conda-root-py" 235 | }, 236 | "language_info": { 237 | "codemirror_mode": { 238 | "name": "ipython", 239 | "version": 3 240 | }, 241 | "file_extension": ".py", 242 | "mimetype": "text/x-python", 243 | "name": "python", 244 | "nbconvert_exporter": "python", 245 | "pygments_lexer": "ipython3", 246 | "version": "3.5.3" 247 | } 248 | }, 249 | "nbformat": 4, 250 | "nbformat_minor": 1 251 | } 252 | -------------------------------------------------------------------------------- /1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Create A DataFrame by iterating and inserting rows" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 3, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "from random import randint" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 4, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "columns = ['a', 'b', 'c']" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 5, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "df = pd.DataFrame(columns=columns)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 6, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "for i in range(5):\n", 53 | " df.loc[i] = [randint(-1,1) for n in range(3)]" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 7, 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/html": [ 66 | "
\n", 67 | "\n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | "
abc
00.0-1.0-1.0
11.01.00.0
21.0-1.01.0
3-1.01.01.0
4-1.0-1.01.0
\n", 109 | "
" 110 | ], 111 | "text/plain": [ 112 | " a b c\n", 113 | "0 0.0 -1.0 -1.0\n", 114 | "1 1.0 1.0 0.0\n", 115 | "2 1.0 -1.0 1.0\n", 116 | "3 -1.0 1.0 1.0\n", 117 | "4 -1.0 -1.0 1.0" 118 | ] 119 | }, 120 | "execution_count": 7, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "df" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [] 137 | } 138 | ], 139 | "metadata": { 140 | "anaconda-cloud": {}, 141 | "kernelspec": { 142 | "display_name": "Python [default]", 143 | "language": "python", 144 | "name": "python3" 145 | }, 146 | "language_info": { 147 | "codemirror_mode": { 148 | "name": "ipython", 149 | "version": 3 150 | }, 151 | "file_extension": ".py", 152 | "mimetype": "text/x-python", 153 | "name": "python", 154 | "nbconvert_exporter": "python", 155 | "pygments_lexer": "ipython3", 156 | "version": "3.5.2" 157 | } 158 | }, 159 | "nbformat": 4, 160 | "nbformat_minor": 0 161 | } 162 | -------------------------------------------------------------------------------- /1_c_create_dataframe_from_a_csv_file.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Create A DataFrame from csv file" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "df = pd.DataFrame.from_csv('./data/sample_data.csv', index_col=False)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/html": [ 42 | "
\n", 43 | "\n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | "
col_1col_2target
00.110.221
10.100.201
20.900.800
\n", 73 | "
" 74 | ], 75 | "text/plain": [ 76 | " col_1 col_2 target\n", 77 | "0 0.11 0.22 1\n", 78 | "1 0.10 0.20 1\n", 79 | "2 0.90 0.80 0" 80 | ] 81 | }, 82 | "execution_count": 3, 83 | "metadata": {}, 84 | "output_type": "execute_result" 85 | } 86 | ], 87 | "source": [ 88 | "df" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 4, 94 | "metadata": { 95 | "collapsed": true 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "df_2 = pd.DataFrame.from_csv('./data/sample_data_2.csv', index_col=False, header=None)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/html": [ 112 | "
\n", 113 | "\n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | "
012
00.110.221
10.100.201
20.900.800
\n", 143 | "
" 144 | ], 145 | "text/plain": [ 146 | " 0 1 2\n", 147 | "0 0.11 0.22 1\n", 148 | "1 0.10 0.20 1\n", 149 | "2 0.90 0.80 0" 150 | ] 151 | }, 152 | "execution_count": 5, 153 | "metadata": {}, 154 | "output_type": "execute_result" 155 | } 156 | ], 157 | "source": [ 158 | "df_2" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "df_2.columns = ['col_1', 'col_2', 'taget']" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 7, 175 | "metadata": { 176 | "collapsed": false 177 | }, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/html": [ 182 | "
\n", 183 | "\n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | "
col_1col_2taget
00.110.221
10.100.201
20.900.800
\n", 213 | "
" 214 | ], 215 | "text/plain": [ 216 | " col_1 col_2 taget\n", 217 | "0 0.11 0.22 1\n", 218 | "1 0.10 0.20 1\n", 219 | "2 0.90 0.80 0" 220 | ] 221 | }, 222 | "execution_count": 7, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "df_2" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": { 235 | "collapsed": true 236 | }, 237 | "outputs": [], 238 | "source": [] 239 | } 240 | ], 241 | "metadata": { 242 | "anaconda-cloud": {}, 243 | "kernelspec": { 244 | "display_name": "Python [default]", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.5.2" 259 | } 260 | }, 261 | "nbformat": 4, 262 | "nbformat_minor": 0 263 | } 264 | -------------------------------------------------------------------------------- /1_d_change_column_names.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Change column names in dataframe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', 1, 'a'],['BB', 2, 'a'],['CC', 3, 'a']], columns = ['name','value','salue'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | "
namevaluesalue
0AA1a
1BB2a
2CC3a
\n", 63 | "
" 64 | ], 65 | "text/plain": [ 66 | " name value salue\n", 67 | "0 AA 1 a\n", 68 | "1 BB 2 a\n", 69 | "2 CC 3 a" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "df" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "df.columns.values[1:] = ['prefix_' + val for val in df.columns.values[1:]]" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 4, 95 | "metadata": { 96 | "collapsed": false 97 | }, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "array(['name', 'prefix_value', 'prefix_salue'], dtype=object)" 103 | ] 104 | }, 105 | "execution_count": 4, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "df.columns.values" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 5, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/html": [ 124 | "
\n", 125 | "\n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | "
nameprefix_valueprefix_salue
0AA1a
1BB2a
2CC3a
\n", 155 | "
" 156 | ], 157 | "text/plain": [ 158 | " name prefix_value prefix_salue\n", 159 | "0 AA 1 a\n", 160 | "1 BB 2 a\n", 161 | "2 CC 3 a" 162 | ] 163 | }, 164 | "execution_count": 5, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "df" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": { 177 | "collapsed": true 178 | }, 179 | "outputs": [], 180 | "source": [] 181 | } 182 | ], 183 | "metadata": { 184 | "anaconda-cloud": {}, 185 | "kernelspec": { 186 | "display_name": "Python [default]", 187 | "language": "python", 188 | "name": "python3" 189 | }, 190 | "language_info": { 191 | "codemirror_mode": { 192 | "name": "ipython", 193 | "version": 3 194 | }, 195 | "file_extension": ".py", 196 | "mimetype": "text/x-python", 197 | "name": "python", 198 | "nbconvert_exporter": "python", 199 | "pygments_lexer": "ipython3", 200 | "version": "3.5.2" 201 | } 202 | }, 203 | "nbformat": 4, 204 | "nbformat_minor": 0 205 | } 206 | -------------------------------------------------------------------------------- /1_e_selecting_columns_or_choosing_columns.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Selecting and picking columns from a Dataframe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', \"temp\", 1],['BB', \"temp\", 2],['CC', \"temp\", 3]], columns = ['name','temp', 'value'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | "
nametempvalue
0AAtemp1
1BBtemp2
2CCtemp3
\n", 63 | "
" 64 | ], 65 | "text/plain": [ 66 | " name temp value\n", 67 | "0 AA temp 1\n", 68 | "1 BB temp 2\n", 69 | "2 CC temp 3" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "df" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 3, 84 | "metadata": { 85 | "collapsed": false 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "df = df[[\"name\",\"temp\"]]" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 4, 95 | "metadata": { 96 | "collapsed": false 97 | }, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/html": [ 102 | "
\n", 103 | "\n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | "
nametemp
0AAtemp
1BBtemp
2CCtemp
\n", 129 | "
" 130 | ], 131 | "text/plain": [ 132 | " name temp\n", 133 | "0 AA temp\n", 134 | "1 BB temp\n", 135 | "2 CC temp" 136 | ] 137 | }, 138 | "execution_count": 4, 139 | "metadata": {}, 140 | "output_type": "execute_result" 141 | } 142 | ], 143 | "source": [ 144 | "df" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [] 155 | } 156 | ], 157 | "metadata": { 158 | "anaconda-cloud": {}, 159 | "kernelspec": { 160 | "display_name": "Python [conda root]", 161 | "language": "python", 162 | "name": "conda-root-py" 163 | }, 164 | "language_info": { 165 | "codemirror_mode": { 166 | "name": "ipython", 167 | "version": 3 168 | }, 169 | "file_extension": ".py", 170 | "mimetype": "text/x-python", 171 | "name": "python", 172 | "nbconvert_exporter": "python", 173 | "pygments_lexer": "ipython3", 174 | "version": "3.5.2" 175 | } 176 | }, 177 | "nbformat": 4, 178 | "nbformat_minor": 0 179 | } 180 | -------------------------------------------------------------------------------- /1_f_drop_or_delete_a_column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Drop columns or pop columns from a dataframe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | "
namevalue
0AA1
1BB2
2CC3
\n", 59 | "
" 60 | ], 61 | "text/plain": [ 62 | " name value\n", 63 | "0 AA 1\n", 64 | "1 BB 2\n", 65 | "2 CC 3" 66 | ] 67 | }, 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "df" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 3, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "df.drop('value', axis=1, inplace=True)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 4, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/html": [ 98 | "
\n", 99 | "\n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | "
name
0AA
1BB
2CC
\n", 121 | "
" 122 | ], 123 | "text/plain": [ 124 | " name\n", 125 | "0 AA\n", 126 | "1 BB\n", 127 | "2 CC" 128 | ] 129 | }, 130 | "execution_count": 4, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "df" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 5, 142 | "metadata": { 143 | "collapsed": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 6, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/html": [ 160 | "
\n", 161 | "\n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | "
namevalue
0AA1
1BB2
2CC3
\n", 187 | "
" 188 | ], 189 | "text/plain": [ 190 | " name value\n", 191 | "0 AA 1\n", 192 | "1 BB 2\n", 193 | "2 CC 3" 194 | ] 195 | }, 196 | "execution_count": 6, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "df" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 7, 208 | "metadata": { 209 | "collapsed": true 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "values = df.pop('value')" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 8, 219 | "metadata": { 220 | "collapsed": false 221 | }, 222 | "outputs": [ 223 | { 224 | "data": { 225 | "text/html": [ 226 | "
\n", 227 | "\n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | "
name
0AA
1BB
2CC
\n", 249 | "
" 250 | ], 251 | "text/plain": [ 252 | " name\n", 253 | "0 AA\n", 254 | "1 BB\n", 255 | "2 CC" 256 | ] 257 | }, 258 | "execution_count": 8, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "df" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 9, 270 | "metadata": { 271 | "collapsed": false 272 | }, 273 | "outputs": [ 274 | { 275 | "data": { 276 | "text/plain": [ 277 | "0 1\n", 278 | "1 2\n", 279 | "2 3\n", 280 | "Name: value, dtype: int64" 281 | ] 282 | }, 283 | "execution_count": 9, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "values" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": { 296 | "collapsed": true 297 | }, 298 | "outputs": [], 299 | "source": [] 300 | } 301 | ], 302 | "metadata": { 303 | "anaconda-cloud": {}, 304 | "kernelspec": { 305 | "display_name": "Python [default]", 306 | "language": "python", 307 | "name": "python3" 308 | }, 309 | "language_info": { 310 | "codemirror_mode": { 311 | "name": "ipython", 312 | "version": 3 313 | }, 314 | "file_extension": ".py", 315 | "mimetype": "text/x-python", 316 | "name": "python", 317 | "nbconvert_exporter": "python", 318 | "pygments_lexer": "ipython3", 319 | "version": "3.5.2" 320 | } 321 | }, 322 | "nbformat": 4, 323 | "nbformat_minor": 0 324 | } 325 | -------------------------------------------------------------------------------- /1_g_create_a_dataframe_with_randomly_generated_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Create A DataFrame with randomly generated data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/html": [ 43 | "
\n", 44 | "\n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | "
ABCD
02306
18066
26546
34213
44026
55612
61728
74436
84262
93166
\n", 127 | "
" 128 | ], 129 | "text/plain": [ 130 | " A B C D\n", 131 | "0 2 3 0 6\n", 132 | "1 8 0 6 6\n", 133 | "2 6 5 4 6\n", 134 | "3 4 2 1 3\n", 135 | "4 4 0 2 6\n", 136 | "5 5 6 1 2\n", 137 | "6 1 7 2 8\n", 138 | "7 4 4 3 6\n", 139 | "8 4 2 6 2\n", 140 | "9 3 1 6 6" 141 | ] 142 | }, 143 | "execution_count": 3, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "df" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": { 156 | "collapsed": true 157 | }, 158 | "outputs": [], 159 | "source": [] 160 | } 161 | ], 162 | "metadata": { 163 | "anaconda-cloud": {}, 164 | "kernelspec": { 165 | "display_name": "Python [conda root]", 166 | "language": "python", 167 | "name": "conda-root-py" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.5.2" 180 | } 181 | }, 182 | "nbformat": 4, 183 | "nbformat_minor": 1 184 | } 185 | -------------------------------------------------------------------------------- /2_a_iterate_over_a_dataframe.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Iterate over a dataframe by rows" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | "
namevalue
0AA1
1BB2
2CC3
\n", 59 | "
" 60 | ], 61 | "text/plain": [ 62 | " name value\n", 63 | "0 AA 1\n", 64 | "1 BB 2\n", 65 | "2 CC 3" 66 | ] 67 | }, 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "df" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 3, 80 | "metadata": { 81 | "collapsed": false 82 | }, 83 | "outputs": [ 84 | { 85 | "name": "stdout", 86 | "output_type": "stream", 87 | "text": [ 88 | "AA 1\n", 89 | "BB 2\n", 90 | "CC 3\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "for i, row in df.iterrows():\n", 96 | " print(row['name'], row['value'])" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 14, 102 | "metadata": { 103 | "collapsed": false 104 | }, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "Pandas(Index=0, name='AA', value=1)\n", 111 | "Pandas(Index=1, name='BB', value=2)\n", 112 | "Pandas(Index=2, name='CC', value=3)\n" 113 | ] 114 | } 115 | ], 116 | "source": [ 117 | "for row in df.itertuples():\n", 118 | " print(row)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": { 125 | "collapsed": true 126 | }, 127 | "outputs": [], 128 | "source": [] 129 | } 130 | ], 131 | "metadata": { 132 | "anaconda-cloud": {}, 133 | "kernelspec": { 134 | "display_name": "Python [conda root]", 135 | "language": "python", 136 | "name": "conda-root-py" 137 | }, 138 | "language_info": { 139 | "codemirror_mode": { 140 | "name": "ipython", 141 | "version": 3 142 | }, 143 | "file_extension": ".py", 144 | "mimetype": "text/x-python", 145 | "name": "python", 146 | "nbconvert_exporter": "python", 147 | "pygments_lexer": "ipython3", 148 | "version": "3.5.2" 149 | } 150 | }, 151 | "nbformat": 4, 152 | "nbformat_minor": 0 153 | } 154 | -------------------------------------------------------------------------------- /2_b_apply_a_function_row_wise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "apply a function to dataframe rows" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | "
namevalue
0AA1
1BB2
2CC3
\n", 59 | "
" 60 | ], 61 | "text/plain": [ 62 | " name value\n", 63 | "0 AA 1\n", 64 | "1 BB 2\n", 65 | "2 CC 3" 66 | ] 67 | }, 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "df" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 3, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "def function_1(val_1, val_2):\n", 86 | " return val_1 + str(val_2)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 4, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "df['col_a'] = df.apply(lambda row: function_1(row['name'], row['value']), axis=1)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 5, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/html": [ 110 | "
\n", 111 | "\n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | "
namevaluecol_a
0AA1AA1
1BB2BB2
2CC3CC3
\n", 141 | "
" 142 | ], 143 | "text/plain": [ 144 | " name value col_a\n", 145 | "0 AA 1 AA1\n", 146 | "1 BB 2 BB2\n", 147 | "2 CC 3 CC3" 148 | ] 149 | }, 150 | "execution_count": 5, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "df" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 6, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "def function_2(row):\n", 168 | " return row['value'] * 2" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 7, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "df['col_b'] = df.apply(lambda row: function_2(row), axis=1)" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 8, 185 | "metadata": { 186 | "collapsed": false 187 | }, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/html": [ 192 | "
\n", 193 | "\n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | "
namevaluecol_acol_b
0AA1AA12
1BB2BB24
2CC3CC36
\n", 227 | "
" 228 | ], 229 | "text/plain": [ 230 | " name value col_a col_b\n", 231 | "0 AA 1 AA1 2\n", 232 | "1 BB 2 BB2 4\n", 233 | "2 CC 3 CC3 6" 234 | ] 235 | }, 236 | "execution_count": 8, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "df" 243 | ] 244 | } 245 | ], 246 | "metadata": { 247 | "anaconda-cloud": {}, 248 | "kernelspec": { 249 | "display_name": "Python [conda root]", 250 | "language": "python", 251 | "name": "conda-root-py" 252 | }, 253 | "language_info": { 254 | "codemirror_mode": { 255 | "name": "ipython", 256 | "version": 3 257 | }, 258 | "file_extension": ".py", 259 | "mimetype": "text/x-python", 260 | "name": "python", 261 | "nbconvert_exporter": "python", 262 | "pygments_lexer": "ipython3", 263 | "version": "3.5.2" 264 | } 265 | }, 266 | "nbformat": 4, 267 | "nbformat_minor": 0 268 | } 269 | -------------------------------------------------------------------------------- /2_c_apply_a_function_to_a_column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "apply a function to a dataframe column" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "df = pd.DataFrame([['AA', 1], ['BB', 2], ['CC', 3]], columns=['name', 'value'])" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | "
namevalue
0AA1
1BB2
2CC3
\n", 59 | "
" 60 | ], 61 | "text/plain": [ 62 | " name value\n", 63 | "0 AA 1\n", 64 | "1 BB 2\n", 65 | "2 CC 3" 66 | ] 67 | }, 68 | "execution_count": 2, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "df" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 3, 80 | "metadata": { 81 | "collapsed": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "def function_1(val_1):\n", 86 | " return \"prefix_\" + str(val_1)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 4, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "df['name'] = df['name'].map(function_1)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 5, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/html": [ 110 | "
\n", 111 | "\n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | "
namevalue
0prefix_AA1
1prefix_BB2
2prefix_CC3
\n", 137 | "
" 138 | ], 139 | "text/plain": [ 140 | " name value\n", 141 | "0 prefix_AA 1\n", 142 | "1 prefix_BB 2\n", 143 | "2 prefix_CC 3" 144 | ] 145 | }, 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "df" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [] 163 | } 164 | ], 165 | "metadata": { 166 | "anaconda-cloud": {}, 167 | "kernelspec": { 168 | "display_name": "Python [conda root]", 169 | "language": "python", 170 | "name": "conda-root-py" 171 | }, 172 | "language_info": { 173 | "codemirror_mode": { 174 | "name": "ipython", 175 | "version": 3 176 | }, 177 | "file_extension": ".py", 178 | "mimetype": "text/x-python", 179 | "name": "python", 180 | "nbconvert_exporter": "python", 181 | "pygments_lexer": "ipython3", 182 | "version": "3.5.2" 183 | } 184 | }, 185 | "nbformat": 4, 186 | "nbformat_minor": 0 187 | } 188 | -------------------------------------------------------------------------------- /2_d_find_and_replace_a_value_in_dataframe_column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "find a value in dataframe and replace it with another value" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "df = pd.DataFrame([['One', 'Two'], ['Four', 'Abcd'], ['One', 'Bcd'], ['Five', 'Cd']], columns=['A', 'B'])" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/html": [ 42 | "
\n", 43 | "\n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | "
AB
0OneTwo
1FourAbcd
2OneBcd
3FiveCd
\n", 74 | "
" 75 | ], 76 | "text/plain": [ 77 | " A B\n", 78 | "0 One Two\n", 79 | "1 Four Abcd\n", 80 | "2 One Bcd\n", 81 | "3 Five Cd" 82 | ] 83 | }, 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "df" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 4, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "df.loc[df['A'] == 'One', 'A'] = 0" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 5, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/html": [ 114 | "
\n", 115 | "\n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | "
AB
00Two
1FourAbcd
20Bcd
3FiveCd
\n", 146 | "
" 147 | ], 148 | "text/plain": [ 149 | " A B\n", 150 | "0 0 Two\n", 151 | "1 Four Abcd\n", 152 | "2 0 Bcd\n", 153 | "3 Five Cd" 154 | ] 155 | }, 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "df" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "collapsed": true 170 | }, 171 | "outputs": [], 172 | "source": [] 173 | } 174 | ], 175 | "metadata": { 176 | "anaconda-cloud": {}, 177 | "kernelspec": { 178 | "display_name": "Python [conda root]", 179 | "language": "python", 180 | "name": "conda-root-py" 181 | }, 182 | "language_info": { 183 | "codemirror_mode": { 184 | "name": "ipython", 185 | "version": 3 186 | }, 187 | "file_extension": ".py", 188 | "mimetype": "text/x-python", 189 | "name": "python", 190 | "nbconvert_exporter": "python", 191 | "pygments_lexer": "ipython3", 192 | "version": "3.5.2" 193 | } 194 | }, 195 | "nbformat": 4, 196 | "nbformat_minor": 0 197 | } 198 | -------------------------------------------------------------------------------- /3_a_merge_dataframes_by_joining_columns.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Merge datafram by joining on a column" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "
\n", 32 | "\n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | "
AB
013
124
\n", 53 | "
" 54 | ], 55 | "text/plain": [ 56 | " A B\n", 57 | "0 1 3\n", 58 | "1 2 4" 59 | ] 60 | }, 61 | "execution_count": 2, 62 | "metadata": {}, 63 | "output_type": "execute_result" 64 | } 65 | ], 66 | "source": [ 67 | "df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])\n", 68 | "df" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/html": [ 81 | "
\n", 82 | "\n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | "
AC
015
116
\n", 103 | "
" 104 | ], 105 | "text/plain": [ 106 | " A C\n", 107 | "0 1 5\n", 108 | "1 1 6" 109 | ] 110 | }, 111 | "execution_count": 3, 112 | "metadata": {}, 113 | "output_type": "execute_result" 114 | } 115 | ], 116 | "source": [ 117 | "df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])\n", 118 | "df2" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": { 125 | "collapsed": false 126 | }, 127 | "outputs": [ 128 | { 129 | "data": { 130 | "text/html": [ 131 | "
\n", 132 | "\n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | "
ABC
0135.0
1136.0
224NaN
\n", 162 | "
" 163 | ], 164 | "text/plain": [ 165 | " A B C\n", 166 | "0 1 3 5.0\n", 167 | "1 1 3 6.0\n", 168 | "2 2 4 NaN" 169 | ] 170 | }, 171 | "execution_count": 4, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "df.merge(df2, how='left', on='A') # merges on columns A" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 5, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "df2.drop_duplicates(subset=['A'], inplace=True)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 6, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [ 198 | { 199 | "data": { 200 | "text/html": [ 201 | "
\n", 202 | "\n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | "
ABC
0135.0
124NaN
\n", 226 | "
" 227 | ], 228 | "text/plain": [ 229 | " A B C\n", 230 | "0 1 3 5.0\n", 231 | "1 2 4 NaN" 232 | ] 233 | }, 234 | "execution_count": 6, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "df.merge(df2, how='left', on='A')" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": true 248 | }, 249 | "outputs": [], 250 | "source": [] 251 | } 252 | ], 253 | "metadata": { 254 | "anaconda-cloud": {}, 255 | "kernelspec": { 256 | "display_name": "Python [default]", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.5.2" 271 | } 272 | }, 273 | "nbformat": 4, 274 | "nbformat_minor": 0 275 | } 276 | -------------------------------------------------------------------------------- /3_b_merge_dataframe_by_columns_on_index.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "merge dataframe by columns using index" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "
\n", 32 | "\n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | "
AB
013
124
\n", 53 | "
" 54 | ], 55 | "text/plain": [ 56 | " A B\n", 57 | "0 1 3\n", 58 | "1 2 4" 59 | ] 60 | }, 61 | "execution_count": 2, 62 | "metadata": {}, 63 | "output_type": "execute_result" 64 | } 65 | ], 66 | "source": [ 67 | "df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])\n", 68 | "df" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/html": [ 81 | "
\n", 82 | "\n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | "
AD
015
116
\n", 103 | "
" 104 | ], 105 | "text/plain": [ 106 | " A D\n", 107 | "0 1 5\n", 108 | "1 1 6" 109 | ] 110 | }, 111 | "execution_count": 3, 112 | "metadata": {}, 113 | "output_type": "execute_result" 114 | } 115 | ], 116 | "source": [ 117 | "df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'D'])\n", 118 | "df2" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": { 125 | "collapsed": false 126 | }, 127 | "outputs": [ 128 | { 129 | "data": { 130 | "text/html": [ 131 | "
\n", 132 | "\n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | "
ABAD
01315
12416
\n", 159 | "
" 160 | ], 161 | "text/plain": [ 162 | " A B A D\n", 163 | "0 1 3 1 5\n", 164 | "1 2 4 1 6" 165 | ] 166 | }, 167 | "execution_count": 4, 168 | "metadata": {}, 169 | "output_type": "execute_result" 170 | } 171 | ], 172 | "source": [ 173 | "pd.concat([df, df2], axis=1)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "collapsed": true 181 | }, 182 | "outputs": [], 183 | "source": [] 184 | } 185 | ], 186 | "metadata": { 187 | "anaconda-cloud": {}, 188 | "kernelspec": { 189 | "display_name": "Python [default]", 190 | "language": "python", 191 | "name": "python3" 192 | }, 193 | "language_info": { 194 | "codemirror_mode": { 195 | "name": "ipython", 196 | "version": 3 197 | }, 198 | "file_extension": ".py", 199 | "mimetype": "text/x-python", 200 | "name": "python", 201 | "nbconvert_exporter": "python", 202 | "pygments_lexer": "ipython3", 203 | "version": "3.5.2" 204 | } 205 | }, 206 | "nbformat": 4, 207 | "nbformat_minor": 0 208 | } 209 | -------------------------------------------------------------------------------- /3_c_merge_two_dataframes_and_split_again.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "merge dataframe and split again.\n", 8 | "Useful for merging test and train data to create panel.\n", 9 | "Then apply transformations on panel in one go.\n", 10 | "Finally split the panel back into train and test dataframes." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": { 17 | "collapsed": true 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import pandas as pd\n", 22 | "import numpy as np" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": { 29 | "collapsed": false 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "ts1 = [1,2,3,4]\n", 34 | "ts2 = [6,7,8,9]\n", 35 | "d = {'col_1': ts1, 'col_2': ts2}" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "{'col_1': [1, 2, 3, 4], 'col_2': [6, 7, 8, 9]}" 49 | ] 50 | }, 51 | "execution_count": 3, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "d" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 4, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "df_1 = pd.DataFrame(data=d)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 5, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/html": [ 81 | "
\n", 82 | "\n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | "
col_1col_2
016
127
238
349
\n", 113 | "
" 114 | ], 115 | "text/plain": [ 116 | " col_1 col_2\n", 117 | "0 1 6\n", 118 | "1 2 7\n", 119 | "2 3 8\n", 120 | "3 4 9" 121 | ] 122 | }, 123 | "execution_count": 5, 124 | "metadata": {}, 125 | "output_type": "execute_result" 126 | } 127 | ], 128 | "source": [ 129 | "df_1" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 6, 135 | "metadata": { 136 | "collapsed": true 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "df_2 = pd.DataFrame(np.random.randn(3, 2), columns=['col_1', 'col_2'])" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 7, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/html": [ 153 | "
\n", 154 | "\n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
col_1col_2
00.654547-1.201099
1-0.088006-0.049599
20.609881-1.003260
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " col_1 col_2\n", 184 | "0 0.654547 -1.201099\n", 185 | "1 -0.088006 -0.049599\n", 186 | "2 0.609881 -1.003260" 187 | ] 188 | }, 189 | "execution_count": 7, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "df_2" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 8, 201 | "metadata": { 202 | "collapsed": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "df_all = pd.concat((df_1, df_2), axis=0, ignore_index=True)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 9, 212 | "metadata": { 213 | "collapsed": false 214 | }, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/html": [ 219 | "
\n", 220 | "\n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | "
col_1col_2
01.0000006.000000
12.0000007.000000
23.0000008.000000
34.0000009.000000
40.654547-1.201099
5-0.088006-0.049599
60.609881-1.003260
\n", 266 | "
" 267 | ], 268 | "text/plain": [ 269 | " col_1 col_2\n", 270 | "0 1.000000 6.000000\n", 271 | "1 2.000000 7.000000\n", 272 | "2 3.000000 8.000000\n", 273 | "3 4.000000 9.000000\n", 274 | "4 0.654547 -1.201099\n", 275 | "5 -0.088006 -0.049599\n", 276 | "6 0.609881 -1.003260" 277 | ] 278 | }, 279 | "execution_count": 9, 280 | "metadata": {}, 281 | "output_type": "execute_result" 282 | } 283 | ], 284 | "source": [ 285 | "df_all" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 10, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [ 295 | { 296 | "name": "stdout", 297 | "output_type": "stream", 298 | "text": [ 299 | "(4, 2)\n", 300 | "(3, 2)\n", 301 | "(7, 2)\n" 302 | ] 303 | } 304 | ], 305 | "source": [ 306 | "print(df_1.shape)\n", 307 | "print(df_2.shape)\n", 308 | "print(df_all.shape)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 11, 314 | "metadata": { 315 | "collapsed": false 316 | }, 317 | "outputs": [], 318 | "source": [ 319 | "df_train = df_all[:df_1.shape[0]]\n", 320 | "df_test = df_all[df_1.shape[0]:]" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 12, 326 | "metadata": { 327 | "collapsed": false 328 | }, 329 | "outputs": [ 330 | { 331 | "name": "stdout", 332 | "output_type": "stream", 333 | "text": [ 334 | "(4, 2)\n", 335 | "(3, 2)\n", 336 | "(7, 2)\n" 337 | ] 338 | } 339 | ], 340 | "source": [ 341 | "print(df_train.shape)\n", 342 | "print(df_test.shape)\n", 343 | "print(df_all.shape)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": { 350 | "collapsed": true 351 | }, 352 | "outputs": [], 353 | "source": [] 354 | } 355 | ], 356 | "metadata": { 357 | "anaconda-cloud": {}, 358 | "kernelspec": { 359 | "display_name": "Python [default]", 360 | "language": "python", 361 | "name": "python3" 362 | }, 363 | "language_info": { 364 | "codemirror_mode": { 365 | "name": "ipython", 366 | "version": 3 367 | }, 368 | "file_extension": ".py", 369 | "mimetype": "text/x-python", 370 | "name": "python", 371 | "nbconvert_exporter": "python", 372 | "pygments_lexer": "ipython3", 373 | "version": "3.5.2" 374 | } 375 | }, 376 | "nbformat": 4, 377 | "nbformat_minor": 0 378 | } 379 | -------------------------------------------------------------------------------- /3_d_group_by_and_interate.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "perform group by on dataframe and iterate on the grouped result" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "classes = [\"class 1\"] * 5 + [\"class 2\"] * 5\n", 30 | "sub_class = ['c1','c2','c2','c1','c3'] + ['c1','c2','c3','c2','c3']\n", 31 | "vals = [1,3,5,1,3] + [2,6,7,5,2]\n", 32 | "p_df = pd.DataFrame({\"class\": classes, \"sub_class\": sub_class, \"vals\": vals})" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 3, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | "
classsub_classvals
0class 1c11
1class 1c23
2class 1c25
3class 1c11
4class 1c33
5class 2c12
6class 2c26
7class 2c37
8class 2c25
9class 2c32
\n", 118 | "
" 119 | ], 120 | "text/plain": [ 121 | " class sub_class vals\n", 122 | "0 class 1 c1 1\n", 123 | "1 class 1 c2 3\n", 124 | "2 class 1 c2 5\n", 125 | "3 class 1 c1 1\n", 126 | "4 class 1 c3 3\n", 127 | "5 class 2 c1 2\n", 128 | "6 class 2 c2 6\n", 129 | "7 class 2 c3 7\n", 130 | "8 class 2 c2 5\n", 131 | "9 class 2 c3 2" 132 | ] 133 | }, 134 | "execution_count": 3, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "p_df" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 4, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "grouped = p_df.groupby(['class', 'sub_class'])['vals'].median()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "metadata": { 158 | "collapsed": false 159 | }, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "class sub_class\n", 165 | "class 1 c1 1.0\n", 166 | " c2 4.0\n", 167 | " c3 3.0\n", 168 | "class 2 c1 2.0\n", 169 | " c2 5.5\n", 170 | " c3 4.5\n", 171 | "Name: vals, dtype: float64" 172 | ] 173 | }, 174 | "execution_count": 5, 175 | "metadata": {}, 176 | "output_type": "execute_result" 177 | } 178 | ], 179 | "source": [ 180 | "grouped" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 6, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "class 1 : c1 : 1.0\n", 195 | "class 1 : c2 : 4.0\n", 196 | "class 1 : c3 : 3.0\n", 197 | "class 2 : c1 : 2.0\n", 198 | "class 2 : c2 : 5.5\n", 199 | "class 2 : c3 : 4.5\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "for index_val, value in grouped.iteritems():\n", 205 | " class_name, sub_class_name = index_val\n", 206 | " print(class_name, \":\", sub_class_name, \":\", value)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": { 213 | "collapsed": true 214 | }, 215 | "outputs": [], 216 | "source": [] 217 | } 218 | ], 219 | "metadata": { 220 | "anaconda-cloud": {}, 221 | "kernelspec": { 222 | "display_name": "Python [conda root]", 223 | "language": "python", 224 | "name": "conda-root-py" 225 | }, 226 | "language_info": { 227 | "codemirror_mode": { 228 | "name": "ipython", 229 | "version": 3 230 | }, 231 | "file_extension": ".py", 232 | "mimetype": "text/x-python", 233 | "name": "python", 234 | "nbconvert_exporter": "python", 235 | "pygments_lexer": "ipython3", 236 | "version": "3.5.2" 237 | } 238 | }, 239 | "nbformat": 4, 240 | "nbformat_minor": 0 241 | } 242 | -------------------------------------------------------------------------------- /4_a_get_binary_or_logical_columns_from_dataframe.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Get columns which are binary from a dataframe" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "df = pd.DataFrame({'col_1': [1, 0, 1, None], \n", 30 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n", 31 | " 'col_3': [1, 2, 3, 4], \n", 32 | " 'col_4': ['a', 'b', 'c', 'd']})" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 3, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | "
col_1col_2col_3col_4
01.01.21a
10.03.12b
21.04.43c
3NaN5.54d
\n", 87 | "
" 88 | ], 89 | "text/plain": [ 90 | " col_1 col_2 col_3 col_4\n", 91 | "0 1.0 1.2 1 a\n", 92 | "1 0.0 3.1 2 b\n", 93 | "2 1.0 4.4 3 c\n", 94 | "3 NaN 5.5 4 d" 95 | ] 96 | }, 97 | "execution_count": 3, 98 | "metadata": {}, 99 | "output_type": "execute_result" 100 | } 101 | ], 102 | "source": [ 103 | "df" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 4, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "bool_cols = [col for col in df if len(df[[col]].dropna()[col].unique()) == 2]" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 5, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/plain": [ 127 | "['col_1']" 128 | ] 129 | }, 130 | "execution_count": 5, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "bool_cols" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 6, 142 | "metadata": { 143 | "collapsed": false 144 | }, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/html": [ 149 | "
\n", 150 | "\n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | "
col_1
01.0
10.0
21.0
3NaN
\n", 176 | "
" 177 | ], 178 | "text/plain": [ 179 | " col_1\n", 180 | "0 1.0\n", 181 | "1 0.0\n", 182 | "2 1.0\n", 183 | "3 NaN" 184 | ] 185 | }, 186 | "execution_count": 6, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "df[bool_cols]" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": { 199 | "collapsed": true 200 | }, 201 | "outputs": [], 202 | "source": [] 203 | } 204 | ], 205 | "metadata": { 206 | "anaconda-cloud": {}, 207 | "kernelspec": { 208 | "display_name": "Python [default]", 209 | "language": "python", 210 | "name": "python3" 211 | }, 212 | "language_info": { 213 | "codemirror_mode": { 214 | "name": "ipython", 215 | "version": 3 216 | }, 217 | "file_extension": ".py", 218 | "mimetype": "text/x-python", 219 | "name": "python", 220 | "nbconvert_exporter": "python", 221 | "pygments_lexer": "ipython3", 222 | "version": "3.5.2" 223 | } 224 | }, 225 | "nbformat": 4, 226 | "nbformat_minor": 0 227 | } 228 | -------------------------------------------------------------------------------- /4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "get columns from dataframe which are categorical and convert them using label encoding" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "from sklearn.preprocessing import LabelEncoder" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "df = pd.DataFrame({'col_1': [1, 0, 1, None], \n", 31 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n", 32 | " 'col_3': [1, 2, 3, 4], \n", 33 | " 'col_4': ['a', 'b', 'c', 'd']})" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | "
col_1col_2col_3col_4
01.01.21a
10.03.12b
21.04.43c
3NaN5.54d
\n", 88 | "
" 89 | ], 90 | "text/plain": [ 91 | " col_1 col_2 col_3 col_4\n", 92 | "0 1.0 1.2 1 a\n", 93 | "1 0.0 3.1 2 b\n", 94 | "2 1.0 4.4 3 c\n", 95 | "3 NaN 5.5 4 d" 96 | ] 97 | }, 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "df" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": { 111 | "collapsed": false 112 | }, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "\n", 119 | "RangeIndex: 4 entries, 0 to 3\n", 120 | "Data columns (total 4 columns):\n", 121 | "col_1 3 non-null float64\n", 122 | "col_2 4 non-null float64\n", 123 | "col_3 4 non-null int64\n", 124 | "col_4 4 non-null object\n", 125 | "dtypes: float64(2), int64(1), object(1)\n", 126 | "memory usage: 208.0+ bytes\n" 127 | ] 128 | } 129 | ], 130 | "source": [ 131 | "df.info()" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 5, 137 | "metadata": { 138 | "collapsed": true 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "bool_cols = [col for col in df if len(df[[col]].dropna()[col].unique()) == 2]" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 6, 148 | "metadata": { 149 | "collapsed": false 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "for col in bool_cols:\n", 154 | " label = LabelEncoder()\n", 155 | " label.fit(list(df[col].values.astype(\"str\")))\n", 156 | " df[col] = label.transform(list(df[col].values.astype(\"str\")))\n" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": { 163 | "collapsed": false 164 | }, 165 | "outputs": [ 166 | { 167 | "name": "stdout", 168 | "output_type": "stream", 169 | "text": [ 170 | "\n", 171 | "RangeIndex: 4 entries, 0 to 3\n", 172 | "Data columns (total 4 columns):\n", 173 | "col_1 4 non-null int64\n", 174 | "col_2 4 non-null float64\n", 175 | "col_3 4 non-null int64\n", 176 | "col_4 4 non-null object\n", 177 | "dtypes: float64(1), int64(2), object(1)\n", 178 | "memory usage: 208.0+ bytes\n" 179 | ] 180 | } 181 | ], 182 | "source": [ 183 | "df.info()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 8, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "data": { 195 | "text/html": [ 196 | "
\n", 197 | "\n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | "
col_1col_2col_3col_4
011.21a
103.12b
214.43c
325.54d
\n", 238 | "
" 239 | ], 240 | "text/plain": [ 241 | " col_1 col_2 col_3 col_4\n", 242 | "0 1 1.2 1 a\n", 243 | "1 0 3.1 2 b\n", 244 | "2 1 4.4 3 c\n", 245 | "3 2 5.5 4 d" 246 | ] 247 | }, 248 | "execution_count": 8, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "df" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "collapsed": true 262 | }, 263 | "outputs": [], 264 | "source": [] 265 | } 266 | ], 267 | "metadata": { 268 | "anaconda-cloud": {}, 269 | "kernelspec": { 270 | "display_name": "Python [default]", 271 | "language": "python", 272 | "name": "python3" 273 | }, 274 | "language_info": { 275 | "codemirror_mode": { 276 | "name": "ipython", 277 | "version": 3 278 | }, 279 | "file_extension": ".py", 280 | "mimetype": "text/x-python", 281 | "name": "python", 282 | "nbconvert_exporter": "python", 283 | "pygments_lexer": "ipython3", 284 | "version": "3.5.2" 285 | } 286 | }, 287 | "nbformat": 4, 288 | "nbformat_minor": 0 289 | } 290 | -------------------------------------------------------------------------------- /4_c_reduce_dimension_of_categorical_column.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Sometimes columns in dataframe have high dimentionality.\n", 8 | "eg: some categorical column with 20 most frequent values covering 80% of the cases.\n", 9 | " Rest being long tail.\n", 10 | "In such case we can convert long tail part into others based on some cut off of count." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": { 17 | "collapsed": true 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import pandas as pd" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "df = pd.DataFrame({'groups': ['group 1','group 2','group 1','group 2','group 3','group 4','group 5','group 1','group 2','group 5'], \n", 33 | " 'vals': [1,2,3,4,5,6,7,8,9,10]})" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | "
groupsvals
0group 11
1group 22
2group 13
3group 24
4group 35
5group 46
6group 57
7group 18
8group 29
9group 510
\n", 108 | "
" 109 | ], 110 | "text/plain": [ 111 | " groups vals\n", 112 | "0 group 1 1\n", 113 | "1 group 2 2\n", 114 | "2 group 1 3\n", 115 | "3 group 2 4\n", 116 | "4 group 3 5\n", 117 | "5 group 4 6\n", 118 | "6 group 5 7\n", 119 | "7 group 1 8\n", 120 | "8 group 2 9\n", 121 | "9 group 5 10" 122 | ] 123 | }, 124 | "execution_count": 3, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "df" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 4, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "group 1 3\n", 144 | "group 2 3\n", 145 | "group 5 2\n", 146 | "group 4 1\n", 147 | "group 3 1\n", 148 | "Name: groups, dtype: int64" 149 | ] 150 | }, 151 | "execution_count": 4, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "df['groups'].value_counts()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 5, 163 | "metadata": { 164 | "collapsed": false 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "high_dim_columns = ['groups']\n", 169 | "\n", 170 | "for column in high_dim_columns:\n", 171 | " a = pd.DataFrame(df[column].value_counts() <= 2)\n", 172 | " unique_values = a.index[a[column]].values\n", 173 | " df.loc[df[column].isin(unique_values), column] = 'other'" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 6, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [ 183 | { 184 | "data": { 185 | "text/html": [ 186 | "
\n", 187 | "\n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | "
groupsvals
0group 11
1group 22
2group 13
3group 24
4other5
5other6
6other7
7group 18
8group 29
9other10
\n", 248 | "
" 249 | ], 250 | "text/plain": [ 251 | " groups vals\n", 252 | "0 group 1 1\n", 253 | "1 group 2 2\n", 254 | "2 group 1 3\n", 255 | "3 group 2 4\n", 256 | "4 other 5\n", 257 | "5 other 6\n", 258 | "6 other 7\n", 259 | "7 group 1 8\n", 260 | "8 group 2 9\n", 261 | "9 other 10" 262 | ] 263 | }, 264 | "execution_count": 6, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "df" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": { 277 | "collapsed": true 278 | }, 279 | "outputs": [], 280 | "source": [] 281 | } 282 | ], 283 | "metadata": { 284 | "anaconda-cloud": {}, 285 | "kernelspec": { 286 | "display_name": "Python [conda root]", 287 | "language": "python", 288 | "name": "conda-root-py" 289 | }, 290 | "language_info": { 291 | "codemirror_mode": { 292 | "name": "ipython", 293 | "version": 3 294 | }, 295 | "file_extension": ".py", 296 | "mimetype": "text/x-python", 297 | "name": "python", 298 | "nbconvert_exporter": "python", 299 | "pygments_lexer": "ipython3", 300 | "version": "3.5.2" 301 | } 302 | }, 303 | "nbformat": 4, 304 | "nbformat_minor": 0 305 | } 306 | -------------------------------------------------------------------------------- /4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Convert categorical columns to one hot encoded columns" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "\n", 30 | "df = pd.DataFrame({'sex': ['M', 'F', 'M', 'F'], \n", 31 | " 'col_2': [1.2, 3.1, 4.4, 5.5], \n", 32 | " 'col_3': [1, 2, 3, 4], \n", 33 | " 'col_4': ['a', 'b', 'c', 'd']})" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | "
col_2col_3col_4sex
01.21aM
13.12bF
24.43cM
35.54dF
\n", 88 | "
" 89 | ], 90 | "text/plain": [ 91 | " col_2 col_3 col_4 sex\n", 92 | "0 1.2 1 a M\n", 93 | "1 3.1 2 b F\n", 94 | "2 4.4 3 c M\n", 95 | "3 5.5 4 d F" 96 | ] 97 | }, 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "df" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": { 111 | "collapsed": false 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "categorical_variables = ['sex']\n", 116 | "\n", 117 | "for variable in categorical_variables:\n", 118 | " # Fill missing data with the word \"Missing\"\n", 119 | " df[variable].fillna(\"Missing\", inplace=True)\n", 120 | " # Create array of dummies\n", 121 | " dummies = pd.get_dummies(df[variable], prefix=variable)\n", 122 | " # Update dataframe to include dummies and drop the main variable\n", 123 | " df = pd.concat([df, dummies], axis=1)\n", 124 | " df.drop([variable], axis=1, inplace=True)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 5, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [ 134 | { 135 | "data": { 136 | "text/html": [ 137 | "
\n", 138 | "\n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | "
col_2col_3col_4sex_Fsex_M
01.21a0.01.0
13.12b1.00.0
24.43c0.01.0
35.54d1.00.0
\n", 184 | "
" 185 | ], 186 | "text/plain": [ 187 | " col_2 col_3 col_4 sex_F sex_M\n", 188 | "0 1.2 1 a 0.0 1.0\n", 189 | "1 3.1 2 b 1.0 0.0\n", 190 | "2 4.4 3 c 0.0 1.0\n", 191 | "3 5.5 4 d 1.0 0.0" 192 | ] 193 | }, 194 | "execution_count": 5, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "df" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": null, 206 | "metadata": { 207 | "collapsed": true 208 | }, 209 | "outputs": [], 210 | "source": [] 211 | } 212 | ], 213 | "metadata": { 214 | "anaconda-cloud": {}, 215 | "kernelspec": { 216 | "display_name": "Python [conda root]", 217 | "language": "python", 218 | "name": "conda-root-py" 219 | }, 220 | "language_info": { 221 | "codemirror_mode": { 222 | "name": "ipython", 223 | "version": 3 224 | }, 225 | "file_extension": ".py", 226 | "mimetype": "text/x-python", 227 | "name": "python", 228 | "nbconvert_exporter": "python", 229 | "pygments_lexer": "ipython3", 230 | "version": "3.5.2" 231 | } 232 | }, 233 | "nbformat": 4, 234 | "nbformat_minor": 0 235 | } 236 | -------------------------------------------------------------------------------- /5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Split a text column into multiple column based on some delimiter. " 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": true 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "data = [{'test': 'vikash|Arpit', 'val': 6},\n", 30 | " {'test': 'vikash_1|arpit|Vinayp', 'val': 3},\n", 31 | " {'test': 'arpit|vinayp', 'val': 2}]" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 3, 37 | "metadata": { 38 | "collapsed": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "df = pd.DataFrame.from_dict(data, orient='columns')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 4, 48 | "metadata": { 49 | "collapsed": false 50 | }, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/html": [ 55 | "
\n", 56 | "\n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | "
testval
0vikash|Arpit6
1vikash_1|arpit|Vinayp3
2arpit|vinayp2
\n", 82 | "
" 83 | ], 84 | "text/plain": [ 85 | " test val\n", 86 | "0 vikash|Arpit 6\n", 87 | "1 vikash_1|arpit|Vinayp 3\n", 88 | "2 arpit|vinayp 2" 89 | ] 90 | }, 91 | "execution_count": 4, 92 | "metadata": {}, 93 | "output_type": "execute_result" 94 | } 95 | ], 96 | "source": [ 97 | "df" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 5, 103 | "metadata": { 104 | "collapsed": false 105 | }, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/html": [ 110 | "
\n", 111 | "\n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | "
012
0arpitvikashNaN
1vinayparpitvikash_1
2vinayparpitNaN
\n", 141 | "
" 142 | ], 143 | "text/plain": [ 144 | " 0 1 2\n", 145 | "0 arpit vikash NaN\n", 146 | "1 vinayp arpit vikash_1\n", 147 | "2 vinayp arpit NaN" 148 | ] 149 | }, 150 | "execution_count": 5, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "df['test'].apply(lambda x: pd.Series([i for i in reversed(x.lower().split('|'))]))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [] 167 | } 168 | ], 169 | "metadata": { 170 | "anaconda-cloud": {}, 171 | "kernelspec": { 172 | "display_name": "Python [conda root]", 173 | "language": "python", 174 | "name": "conda-root-py" 175 | }, 176 | "language_info": { 177 | "codemirror_mode": { 178 | "name": "ipython", 179 | "version": 3 180 | }, 181 | "file_extension": ".py", 182 | "mimetype": "text/x-python", 183 | "name": "python", 184 | "nbconvert_exporter": "python", 185 | "pygments_lexer": "ipython3", 186 | "version": "3.5.2" 187 | } 188 | }, 189 | "nbformat": 4, 190 | "nbformat_minor": 1 191 | } 192 | -------------------------------------------------------------------------------- /5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Split a text column into multiple column based on some delimiter.\n", 8 | "Then convert the values into one hot encoded columns.\n", 9 | "Basically converting a categorical variable into one hot encoded values." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "data = [{'test': 'vikash|Arpit', 'val': 6},\n", 32 | " {'test': 'vikash_1|arpit|Vinayp', 'val': 3},\n", 33 | " {'test': 'arpit|vinayp', 'val': 2}]" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "df = pd.DataFrame.from_dict(data, orient='columns')" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/html": [ 57 | "
\n", 58 | "\n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | "
testval
0vikash|Arpit6
1vikash_1|arpit|Vinayp3
2arpit|vinayp2
\n", 84 | "
" 85 | ], 86 | "text/plain": [ 87 | " test val\n", 88 | "0 vikash|Arpit 6\n", 89 | "1 vikash_1|arpit|Vinayp 3\n", 90 | "2 arpit|vinayp 2" 91 | ] 92 | }, 93 | "execution_count": 4, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "df" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "chosen_columns = set()\n", 111 | "for idx, row in df.iterrows():\n", 112 | " for val in str(row['test']).lower().split('|'):\n", 113 | " chosen_columns.add(val.strip())" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 6, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "chosen_columns_list = list(chosen_columns)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 7, 130 | "metadata": { 131 | "collapsed": false 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "chosen_columns_list.sort(key=len, reverse=True)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 8, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/plain": [ 148 | "['vikash_1', 'vinayp', 'vikash', 'arpit']" 149 | ] 150 | }, 151 | "execution_count": 8, 152 | "metadata": {}, 153 | "output_type": "execute_result" 154 | } 155 | ], 156 | "source": [ 157 | "chosen_columns_list" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 9, 163 | "metadata": { 164 | "collapsed": false 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "def get_one_hot_encoded_column(col_value):\n", 169 | " col_value = col_value.lower()\n", 170 | " new_col_value = ''\n", 171 | " for val in chosen_columns_list:\n", 172 | " if val in col_value.split('|'):\n", 173 | " col_value = col_value.replace(val, '')\n", 174 | " new_col_value += '1,'\n", 175 | " else:\n", 176 | " new_col_value += '0,'\n", 177 | " return new_col_value[:-1]" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 10, 183 | "metadata": { 184 | "collapsed": false 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "df['test_new'] = df['test'].map(get_one_hot_encoded_column)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 11, 194 | "metadata": { 195 | "collapsed": false 196 | }, 197 | "outputs": [ 198 | { 199 | "data": { 200 | "text/html": [ 201 | "
\n", 202 | "\n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | "
testvaltest_new
0vikash|Arpit60,0,1,1
1vikash_1|arpit|Vinayp31,1,0,1
2arpit|vinayp20,1,0,1
\n", 232 | "
" 233 | ], 234 | "text/plain": [ 235 | " test val test_new\n", 236 | "0 vikash|Arpit 6 0,0,1,1\n", 237 | "1 vikash_1|arpit|Vinayp 3 1,1,0,1\n", 238 | "2 arpit|vinayp 2 0,1,0,1" 239 | ] 240 | }, 241 | "execution_count": 11, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "df" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 12, 253 | "metadata": { 254 | "collapsed": false 255 | }, 256 | "outputs": [], 257 | "source": [ 258 | "df2 = df['test_new'].apply(lambda x: pd.Series([i for i in x.lower().split(',')]))" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 15, 264 | "metadata": { 265 | "collapsed": false 266 | }, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/html": [ 271 | "
\n", 272 | "\n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | "
vikash_1vinaypvikasharpit
00011
11101
20101
\n", 306 | "
" 307 | ], 308 | "text/plain": [ 309 | " vikash_1 vinayp vikash arpit\n", 310 | "0 0 0 1 1\n", 311 | "1 1 1 0 1\n", 312 | "2 0 1 0 1" 313 | ] 314 | }, 315 | "execution_count": 15, 316 | "metadata": {}, 317 | "output_type": "execute_result" 318 | } 319 | ], 320 | "source": [ 321 | "df2" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 16, 327 | "metadata": { 328 | "collapsed": false 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "df2.columns = chosen_columns_list" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 17, 338 | "metadata": { 339 | "collapsed": false 340 | }, 341 | "outputs": [ 342 | { 343 | "data": { 344 | "text/html": [ 345 | "
\n", 346 | "\n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | "
vikash_1vinaypvikasharpit
00011
11101
20101
\n", 380 | "
" 381 | ], 382 | "text/plain": [ 383 | " vikash_1 vinayp vikash arpit\n", 384 | "0 0 0 1 1\n", 385 | "1 1 1 0 1\n", 386 | "2 0 1 0 1" 387 | ] 388 | }, 389 | "execution_count": 17, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "df2" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 18, 401 | "metadata": { 402 | "collapsed": false 403 | }, 404 | "outputs": [ 405 | { 406 | "name": "stdout", 407 | "output_type": "stream", 408 | "text": [ 409 | "\n", 410 | "RangeIndex: 3 entries, 0 to 2\n", 411 | "Data columns (total 4 columns):\n", 412 | "vikash_1 3 non-null object\n", 413 | "vinayp 3 non-null object\n", 414 | "vikash 3 non-null object\n", 415 | "arpit 3 non-null object\n", 416 | "dtypes: object(4)\n", 417 | "memory usage: 176.0+ bytes\n" 418 | ] 419 | } 420 | ], 421 | "source": [ 422 | "df2.info()" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 19, 428 | "metadata": { 429 | "collapsed": true 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "df2 = df2.apply(pd.to_numeric)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 20, 439 | "metadata": { 440 | "collapsed": false 441 | }, 442 | "outputs": [ 443 | { 444 | "name": "stdout", 445 | "output_type": "stream", 446 | "text": [ 447 | "\n", 448 | "RangeIndex: 3 entries, 0 to 2\n", 449 | "Data columns (total 4 columns):\n", 450 | "vikash_1 3 non-null int64\n", 451 | "vinayp 3 non-null int64\n", 452 | "vikash 3 non-null int64\n", 453 | "arpit 3 non-null int64\n", 454 | "dtypes: int64(4)\n", 455 | "memory usage: 176.0 bytes\n" 456 | ] 457 | } 458 | ], 459 | "source": [ 460 | "df2.info()" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 28, 466 | "metadata": { 467 | "collapsed": false 468 | }, 469 | "outputs": [], 470 | "source": [ 471 | "df_new = pd.concat([df, df2], axis=1)" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 29, 477 | "metadata": { 478 | "collapsed": false 479 | }, 480 | "outputs": [], 481 | "source": [ 482 | "df_new.drop(['test', 'test_new'], inplace=True, axis=1)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 30, 488 | "metadata": { 489 | "collapsed": false 490 | }, 491 | "outputs": [ 492 | { 493 | "data": { 494 | "text/html": [ 495 | "
\n", 496 | "\n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | "
valvikash_1vinaypvikasharpit
060011
131101
220101
\n", 534 | "
" 535 | ], 536 | "text/plain": [ 537 | " val vikash_1 vinayp vikash arpit\n", 538 | "0 6 0 0 1 1\n", 539 | "1 3 1 1 0 1\n", 540 | "2 2 0 1 0 1" 541 | ] 542 | }, 543 | "execution_count": 30, 544 | "metadata": {}, 545 | "output_type": "execute_result" 546 | } 547 | ], 548 | "source": [ 549 | "df_new" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": { 556 | "collapsed": true 557 | }, 558 | "outputs": [], 559 | "source": [] 560 | } 561 | ], 562 | "metadata": { 563 | "anaconda-cloud": {}, 564 | "kernelspec": { 565 | "display_name": "Python [conda root]", 566 | "language": "python", 567 | "name": "conda-root-py" 568 | }, 569 | "language_info": { 570 | "codemirror_mode": { 571 | "name": "ipython", 572 | "version": 3 573 | }, 574 | "file_extension": ".py", 575 | "mimetype": "text/x-python", 576 | "name": "python", 577 | "nbconvert_exporter": "python", 578 | "pygments_lexer": "ipython3", 579 | "version": "3.5.2" 580 | } 581 | }, 582 | "nbformat": 4, 583 | "nbformat_minor": 1 584 | } 585 | -------------------------------------------------------------------------------- /6_a_extending_dataframe_capabilities.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "Adding functionality so we can create dataframe from string representation of dict" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import ast\n", 19 | "def str_to_dict(string):\n", 20 | " return ast.literal_eval(string) " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": { 27 | "collapsed": true 28 | }, 29 | "outputs": [], 30 | "source": [ 31 | "import pandas as pd" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 4, 37 | "metadata": { 38 | "collapsed": false 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "class MySubClass(pd.DataFrame):\n", 43 | " def from_str(self, string):\n", 44 | " df_obj = super().from_dict(str_to_dict(string))\n", 45 | " df_obj.my_string_attribute = string\n", 46 | " return df_obj" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 5, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "data = \"{'col_1' : ['a','b'], 'col2': [1, 2]}\"" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 7, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "obj = MySubClass().from_str(data)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 8, 74 | "metadata": { 75 | "collapsed": false 76 | }, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "__main__.MySubClass" 82 | ] 83 | }, 84 | "execution_count": 8, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "type(obj)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 9, 96 | "metadata": { 97 | "collapsed": false 98 | }, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/html": [ 103 | "
\n", 104 | "\n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | "
col2col_1
01a
12b
\n", 125 | "
" 126 | ], 127 | "text/plain": [ 128 | " col2 col_1\n", 129 | "0 1 a\n", 130 | "1 2 b" 131 | ] 132 | }, 133 | "execution_count": 9, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "obj" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 10, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [ 149 | { 150 | "data": { 151 | "text/plain": [ 152 | "\"{'col_1' : ['a','b'], 'col2': [1, 2]}\"" 153 | ] 154 | }, 155 | "execution_count": 10, 156 | "metadata": {}, 157 | "output_type": "execute_result" 158 | } 159 | ], 160 | "source": [ 161 | "obj.my_string_attribute" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 8, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},\n", 173 | " {'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': 215},\n", 174 | " {'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]\n", 175 | "df = MySubClass(sales)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 9, 181 | "metadata": { 182 | "collapsed": false 183 | }, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/html": [ 188 | "
\n", 189 | "\n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | "
FebJanMaraccount
0200150140Jones LLC
1210200215Alpha Co
2905095Blue Inc
\n", 223 | "
" 224 | ], 225 | "text/plain": [ 226 | " Feb Jan Mar account\n", 227 | "0 200 150 140 Jones LLC\n", 228 | "1 210 200 215 Alpha Co\n", 229 | "2 90 50 95 Blue Inc" 230 | ] 231 | }, 232 | "execution_count": 9, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "df" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 10, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "__main__.MySubClass" 252 | ] 253 | }, 254 | "execution_count": 10, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "type(df)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": { 267 | "collapsed": true 268 | }, 269 | "outputs": [], 270 | "source": [] 271 | } 272 | ], 273 | "metadata": { 274 | "anaconda-cloud": {}, 275 | "kernelspec": { 276 | "display_name": "Python [conda root]", 277 | "language": "python", 278 | "name": "conda-root-py" 279 | }, 280 | "language_info": { 281 | "codemirror_mode": { 282 | "name": "ipython", 283 | "version": 3 284 | }, 285 | "file_extension": ".py", 286 | "mimetype": "text/x-python", 287 | "name": "python", 288 | "nbconvert_exporter": "python", 289 | "pygments_lexer": "ipython3", 290 | "version": "3.5.2" 291 | } 292 | }, 293 | "nbformat": 4, 294 | "nbformat_minor": 1 295 | } 296 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Vikash Singh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Basics of Pandas python library 2 | ======= 3 | 4 | 1. Basic Dataframes Operations 5 | - Create Dataframe from a dictionary [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_a_create_a_dataframe_from_dictonary.ipynb) 6 | - Create Dataframe by inserting rows in an iterative way [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_b_create_a_dataframe_by_iterating_and_inserting_rows_.ipynb) 7 | - Create Dataframe with randomly generated data [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_c_create_dataframe_from_a_csv_file.ipynb) 8 | - Create Dataframe from a csv file [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_d_change_column_names.ipynb) 9 | - Change Dataframe column names [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_e_selecting_columns_or_choosing_columns.ipynb) 10 | - Chose specific columns from a DataFrame [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_f_drop_or_delete_a_column.ipynb) 11 | - Delete drop columns or extract columns from Dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/1_g_create_a_dataframe_with_randomly_generated_data.ipynb) 12 | 13 | 2. Manipulating Dataframe 14 | - Iterate over a dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_a_iterate_over_a_dataframe.ipynb) 15 | - Apply function to dataframe row wise [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb) 16 | - Apply function to a specific column of dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb) 17 | - Find and replace a value in Dataframe [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/2_d_find_and_replace_a_value_in_dataframe_column.ipynb) 18 | 19 | 3. Split and Merge Dataframes 20 | - Merge dataframes by columns using join [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_a_merge_dataframes_by_joining_columns.ipynb) 21 | - Merge dataframes by columns on index [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_b_merge_dataframe_by_columns_on_index.ipynb) 22 | - Merge dataframe and split again [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_c_merge_two_dataframes_and_split_again.ipynb) 23 | - Group by a dataframe and iterate over grouped series [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/3_d_group_by_and_interate.ipynb) 24 | 25 | 4. Convert columns 26 | - Get binary columns in a DataFrame [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_a_get_binary_or_logical_columns_from_dataframe.ipynb) 27 | - Convert categorical columns to integer columns using label encoding [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_b_convert_categorical_columns_to_label_encoded_columns_or_integer_column.ipynb) 28 | - Reduce high dimentionality from categorical column [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_c_reduce_dimension_of_categorical_column.ipynb) 29 | - Convert categorical column to one hot encoded column [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/4_d_convert_categorical_columns_to_one_hot_encoded_columns.ipynb) 30 | 31 | 5. Split column 32 | - Split column usiong a delimiter [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/5_a_split_a_column_into_multiple_columns_based_on_delimiter.ipynb) 33 | - Split a column using delimter and one hot encode the values [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/5_b_split_a_column_into_multiple_columns_one_hot_encoding.ipynb) 34 | 35 | 6. Adding capabilities to pandas.DataFrame class 36 | - Extend pandas.DataFrame class to store additional value(s). [Sample Notebook](https://github.com/vi3k6i5/pandas_basics/blob/master/6_a_extending_dataframe_capabilities.ipynb) 37 | -------------------------------------------------------------------------------- /data/sample_data.csv: -------------------------------------------------------------------------------- 1 | col_1,col_2,target 2 | 0.11,0.22,1 3 | 0.1,0.2,1 4 | 0.9,0.8,0 -------------------------------------------------------------------------------- /data/sample_data_2.csv: -------------------------------------------------------------------------------- 1 | 0.11,0.22,1 2 | 0.1,0.2,1 3 | 0.9,0.8,0 --------------------------------------------------------------------------------