├── 01_16_av_automated_data_profiling_MS_WORD.ipynb ├── 02_av_tidy_data_with_python.ipynb ├── README.md ├── SAMPLE_FULL_DPD_Image_MSWORD.PNG └── data_profile_df_MS_WORD.docx /01_16_av_automated_data_profiling_MS_WORD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# 1. Automated Data Profiling In Python" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "**Author : Anandakumar Varatharajah**\n", 17 | "
\n", 18 | "***http://www.analyticsinsights.ninja***" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Version : 0.17 \n", 26 | "Date : 14 July 2019 \n", 27 | "License : MIT License" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "The main objective of this notebook is **only** to understand raw data profile. i.e. data type, min & max values, ranges, unique values, etc. \n", 35 | "In consequent notebooks we will explore further on how to make decisions to make the data tidy and perform the data transformations based on the understanding of the data profile.\n", 36 | "
\n", 37 | "The code is largely kept generic so that it could be used with any shape of data. " 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "# The Game Changer - Data Profile Dataframe (DPD)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "The game changer for exploratory data analysis is the final ***Data Profile Dataframe*** that is generated which combines ***all*** the information required to inform data cleaning, tidy data and optimisations (memory and processing) decisions. \n", 52 | "Instead of using various Pandas commands at different instances and going back and forth to cross refer information, Data Profile Dataframe brings all information into a single dataframe. This will be very useful when reviewing the data profile with the business subject matter or other team members as all information related to data profile is in a single easy to understand format.\n", 53 | "\n", 54 | "![image.png](https://raw.githubusercontent.com/AnalyticsInsightsNinja/Python_TidyData/master/SAMPLE_FULL_DPD_Image_MSWORD.PNG)\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Understanding the data is **the critical step** in preparing the data to be used for analytics. As many experts will point out the data preparation and transforming the data into a tidy format takes about 80% of the effort in any data analytics or data analysis project.
\n", 62 | "***Understanding the data requires good understanding of the domain and/or access to a subject matter expert (SME) to help make decisions about data quality and data usage:***\n", 63 | "* What are the columns and what do they mean?\n", 64 | "* How to interpret each columns and possible values of a column?\n", 65 | "* Should the columns be renamed (and cleaned e.g. trim)?\n", 66 | "* Are there columns that may have similar information that could be dropped in favour of one master column?\n", 67 | "* Can columns with no values (or all empty) be dropped?\n", 68 | "* Can columns which have more than certain threshold of blank values be dropped?\n", 69 | "* How can the missing values be filled and can it be filled meaningfully?\n", 70 | "* Can rows that have missing values for certain columns or combination of columns be dropped? i.e. the row is meaningless wihtout those values.\n", 71 | "* Can the numeric data type columns be converted / down casted to optimise memory usage based on the data values?\n", 72 | " - or will there be outliers possibly in future data sets that we cannot do this?\n", 73 | " - can the min and max values be used to determine the lowest possible data type?\n", 74 | "* Can some string/object columns be converted to Category types?\n", 75 | " - based on count of unique values\n", 76 | "* Can any columns be discarded that may not be required for analytics?" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "# Environment setup" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "It is recommended best practice to document the execution environment. \n", 91 | "e.g. When the initial version of this notebook was developed in Azure Notebooks (Jupyter) the environment was documented in the code. When the notebook was exported to local PC JupyterLab and then imported back into Azure Notebook, the Kernal changed to an older version and some code did not work. Having the initital versions documented in comments saved a lot of effort in trying to understand what went wrong.\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 1, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# Get the date of execution\n", 101 | "import datetime\n", 102 | "date_generated = datetime.datetime.now()\n" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 2, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "from platform import python_version \n", 112 | "# use python_version() to get the version. This is used in the final DPD HTML\n", 113 | "# 3.6.6 in Azure Notebooks in April 2019\n" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 3, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "import pandas as pd\n", 123 | "# use pd.__version__ to get the pandas version. This is used in the final DPD HTML\n", 124 | "# Pandas version 0.22.0 in Azure Notebooks in April 2019\n", 125 | "\n", 126 | "# set maximum number of columns to display in notebook\n", 127 | "pd.set_option('display.max_columns', 250)\n", 128 | "\n", 129 | "# To check whether a column is numeric type\n", 130 | "from pandas.api.types import is_numeric_dtype\n", 131 | "\n", 132 | "# To check whether a column is object/string type\n", 133 | "from pandas.api.types import is_string_dtype\n" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "import numpy as np\n" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 6, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "# Import the graph packages\n", 152 | "import matplotlib.pyplot as plt\n", 153 | "plt.rcParams.update({'figure.max_open_warning': 0})\n", 154 | "%matplotlib inline\n", 155 | "\n", 156 | "import seaborn as sns\n", 157 | "# Seabotn version 0.9.0 in Azure Notebooks in April 2019\n", 158 | "# use sns.__version__ to get the pandas version. This is used in the final DPD HTML" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# This library is required to generate the MS Word document\n", 168 | "from docx import Document\n", 169 | "from docx.shared import Inches, Pt\n", 170 | "from docx.enum.text import WD_ALIGN_PARAGRAPH #used to align str(number) in cells " 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "# Raw data file exploration" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "The raw data file used in this notebook has been derived from the Sales Products csv file from IBM Analytics Community and has been modified to include untidy data for the purposes of this data exploration work. \n", 185 | "The raw data should be in a format that can be laoded into pandas. i.e. if there are any rows need to be skipped, column headers mapped, etc. should be handle in the pandas.read code block." 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 7, 191 | "metadata": { 192 | "scrolled": true 193 | }, 194 | "outputs": [ 195 | { 196 | "name": "stderr", 197 | "output_type": "stream", 198 | "text": [ 199 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 200 | " Dload Upload Total Spent Left Speed\n", 201 | "\n", 202 | " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\n", 203 | "100 44227 100 44227 0 0 396k 0 --:--:-- --:--:-- --:--:-- 396k\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "# Download data file from Github site using curl and save it to local disk -o \"filename\"\n", 209 | "!curl -o \"mydataset.csv\" \"https://raw.githubusercontent.com/AnalyticsInsightsNinja/Sample_Analytics_Data/master/titanic.csv\" \n", 210 | "# Data file to be loaded\n", 211 | "raw_data_file = \"mydataset.csv\"" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 8, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "# Use Pandas to load the data file into a dataframe\n", 221 | "try:\n", 222 | " df = pd.read_csv(raw_data_file, thousands=',', float_precision=2)\n", 223 | "except:\n", 224 | " print(\"Error: Data file not found!\")\n" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "**Note:** If the raw data is a big data file of several GB's in size it may not be possible to load the the whole file into memory. One possibility is using 'pandas pyspark'.
\n", 232 | "Other options to load data incrementally and optimise the data by converting data types will be demonstrated in a seperate notebook." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 9, 238 | "metadata": { 239 | "scrolled": true 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/html": [ 245 | "

\n", 246 | "\n", 259 | "\n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | "

	Survived	Pclass	Name	Sex	Age	Fare
692	0	2	Mr. Charles Henry Chapman	male	52.0	13.50
230	0	3	Mr. Bengt Edvin Larsson	male	29.0	7.78
155	1	3	Miss. Katherine Gilnagh	female	16.0	7.73
879	0	2	Mr. Frederick James Banfield	male	28.0	10.50
2	1	3	Miss. Laina Heikkinen	female	26.0	7.92

\n", 331 | "

" 332 | ], 333 | "text/plain": [ 334 | " Survived Pclass Name Sex Age \\\n", 335 | "692 0 2 Mr. Charles Henry Chapman male 52.0 \n", 336 | "230 0 3 Mr. Bengt Edvin Larsson male 29.0 \n", 337 | "155 1 3 Miss. Katherine Gilnagh female 16.0 \n", 338 | "879 0 2 Mr. Frederick James Banfield male 28.0 \n", 339 | "2 1 3 Miss. Laina Heikkinen female 26.0 \n", 340 | "\n", 341 | " Siblings_Spouses_Aboard Parents_Children Aboard Fare \n", 342 | "692 0 0 13.50 \n", 343 | "230 0 0 7.78 \n", 344 | "155 0 0 7.73 \n", 345 | "879 0 0 10.50 \n", 346 | "2 0 0 7.92 " 347 | ] 348 | }, 349 | "execution_count": 9, 350 | "metadata": {}, 351 | "output_type": "execute_result" 352 | } 353 | ], 354 | "source": [ 355 | "# Sample raw data rows from dataset\n", 356 | "df.sample(5).round(2)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "# Memory Usage Analysis" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 10, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "# Check whether the file is obatined from url.\n", 373 | "# If from url, then skip file size in disk check\n", 374 | "\n", 375 | "if \"http\" in raw_data_file:\n", 376 | " file_size = float('nan')\n", 377 | "else:\n", 378 | " # Calculating file size (in MB) on disk\n", 379 | " import os\n", 380 | "\n", 381 | " file_size = (os.stat(raw_data_file).st_size / 1024 **2)\n", 382 | " #This is used in the DPD HTML\n" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 11, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "# Calculate dataset size in memory (MB)\n", 392 | "df_mem = df.memory_usage(deep=True).sum() / 1024**2\n", 393 | "#This is used in the DPD HTML\n" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 12, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# Calclulate dataset size increase in memory (MB)\n", 403 | "sz_increase = ((df_mem - file_size) / file_size)\n", 404 | "#This is used in the DPD HTML" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 13, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "# Plot the memory usage \n", 414 | "# Create a dictionary from the variables and convert to Pandas DataFrame\n", 415 | "# Use DataFrame's ploting capabilities\n", 416 | "raw_data_dict = {\"File on disk\":file_size, \"Dataset in memroy\": df_mem}\n", 417 | "raw_data_plot = pd.DataFrame.from_dict(raw_data_dict, orient='index').reset_index()\n", 418 | "\n", 419 | "# Pandas DataFrame plot\n", 420 | "raw_data_plot.plot(kind='bar',\\\n", 421 | " x=\"index\" ,\\\n", 422 | " y=0, \\\n", 423 | " legend=False, \\\n", 424 | " title='Data size increase from disk to memory')\n", 425 | "# plt.subplots_adjust(wspace=0.4, hspace=0.35)\n", 426 | "plt.xticks(rotation=0)\n", 427 | "\n", 428 | "# Save the figure\n", 429 | "plt.savefig('fig_df_tot_memory.png', dpi=50)\n", 430 | "plt.close('all')\n" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 14, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "# Get memory used by each column in the raw data dataset in MB\n", 440 | "# This will be later merged with the DPD\n", 441 | "mem_used_dtypes = pd.DataFrame(df.memory_usage(deep=True) / 1024**2)\n", 442 | "\n", 443 | "# Rename column\n", 444 | "mem_used_dtypes.rename(columns={ 0:'memory'}, inplace=True)\n", 445 | "\n", 446 | "# Drop index memory usage since this is not required when merging with Data Quality Dataframe\n", 447 | "mem_used_dtypes.drop('Index', axis=0, inplace=True) \n" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "# Constructing The Data Profile Dataframe (DPD) - The Game Changer " 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 15, 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [ 463 | "# Number of rows of the DPD will be the count of columns in the raw date dataframe\n", 464 | "# Since it there will be one row for each column\n", 465 | "no_of_rows = len(df.columns)\n", 466 | "\n", 467 | "\n", 468 | "# Constructing the data_qlt_df dataframe and pre-assigning and columns\n", 469 | "# Pre-assigning the number of rows the dataframe would have is memory and processing efficient\n", 470 | "# This is a better approach than continuous append or concat operation to dataframe\n", 471 | "\n", 472 | "data_qlt_df = pd.DataFrame(index=np.arange(0, no_of_rows), \\\n", 473 | " columns=('column_name', 'col_data_type', 'col_memory','non_null_values', \\\n", 474 | " 'unique_values_count', 'column_dtype')\n", 475 | " )\n", 476 | "\n", 477 | "\n", 478 | "# Add rows to the data_qlt_df dataframe\n", 479 | "for ind, cols in enumerate(df.columns):\n", 480 | " # Count of unique values in the column\n", 481 | " col_unique_count = df[cols].nunique()\n", 482 | " \n", 483 | " data_qlt_df.loc[ind] = [cols, \\\n", 484 | " df[cols].dtype, \\\n", 485 | " mem_used_dtypes['memory'][ind], \\\n", 486 | " df[cols].count(), \\\n", 487 | " col_unique_count, \\\n", 488 | " cols + '~'+ str(df[cols].dtype)\n", 489 | " ]\n" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 16, 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "# Use describe() to get column stats of raw dataframe\n", 499 | "# This will be merged with the DPD\n", 500 | "raw_num_df = df.describe().T.round(2)\n" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 17, 506 | "metadata": {}, 507 | "outputs": [], 508 | "source": [ 509 | "#----- Key Step ---------------\n", 510 | "# Merging the df.describe() output with rest of the info to create a single Data Profile Dataframe\n", 511 | "data_qlt_df = pd.merge(data_qlt_df, raw_num_df, how='left', left_on='column_name', right_index=True)\n" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": 18, 517 | "metadata": {}, 518 | "outputs": [], 519 | "source": [ 520 | "# Calculate percentage of non-null values over total number of values\n", 521 | "data_qlt_df['%_of_non_nulls'] = (data_qlt_df['non_null_values']/df.shape[0])*100\n", 522 | "\n", 523 | "# Calculate null values for the column\n", 524 | "data_qlt_df['null_values'] = df.shape[0] - data_qlt_df['non_null_values']\n", 525 | "\n", 526 | "# Calculate percentage of null values over total number of values\n", 527 | "data_qlt_df['%_of_nulls'] = 100 - data_qlt_df['%_of_non_nulls']\n", 528 | "\n", 529 | "# Calculate percentage of each column memory usage compared to total memory used by raw data datframe\n", 530 | "data_qlt_df['%_of_total_memory'] = data_qlt_df['col_memory'] / data_qlt_df['col_memory'].sum() * 100\n", 531 | "\n", 532 | "# Calculate the total memory used by a given group of data type\n", 533 | "# See Notes section at the bottom of this notebook for advatages of using 'transform' function with group_by\n", 534 | "data_qlt_df[\"dtype_total\"] = data_qlt_df.groupby('col_data_type')[\"col_memory\"].transform('sum')\n", 535 | "\n", 536 | "# Calculate the percentage memory used by each column data type compared to the total memory used by the group of data type\n", 537 | "# the above can be merged to one calculation if we do not need the total as separate column\n", 538 | "#data_qlt_df[\"%_of_dtype_mem2\"] = data_qlt_df[\"Dtype Memory\"] / (data_qlt_df.groupby('Data Type')[\"Dtype Memory\"].transform('sum')) * 100\n", 539 | "data_qlt_df[\"%_of_dtype_mem\"] = data_qlt_df[\"col_memory\"] / data_qlt_df[\"dtype_total\"] * 100\n", 540 | "\n", 541 | "# Calculate the percentage memory used by each group of data type of the total memory used by dataset\n", 542 | "data_qlt_df[\"dtype_%_total_mem\"] = data_qlt_df[\"dtype_total\"] / df_mem * 100\n", 543 | "\n", 544 | "# Calculate the count of each data type\n", 545 | "data_qlt_df[\"dtype_count\"] = data_qlt_df.groupby('col_data_type')[\"col_data_type\"].transform('count')\n", 546 | "\n", 547 | "# Calculate the total count of column values\n", 548 | "data_qlt_df[\"count\"] = data_qlt_df['null_values'] + data_qlt_df['non_null_values']" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 19, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "# Reorder the Data Profile Dataframe columns\n", 558 | "data_qlt_df = data_qlt_df[\n", 559 | " ['column_name', 'col_data_type', 'col_memory', '%_of_dtype_mem', '%_of_total_memory',\\\n", 560 | " 'dtype_count', 'dtype_total', 'dtype_%_total_mem', 'non_null_values', '%_of_non_nulls',\\\n", 561 | " 'null_values', '%_of_nulls', 'unique_values_count', 'count', 'mean', 'std', 'min', '25%',\\\n", 562 | " '50%', '75%', 'max']\n", 563 | " ]\n" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "**The above data quality data frame summarises all information required for making data quality decisions.** \n", 571 | "Though there are info() and describe() methods to do these, having all the relvant information in one dataframe makes the data quality exploration much easier. This dataframe can be used for summarising information and for plotting to ehnace the ease of Data Understanding effort." 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "# Plot Memory Usage Analysis" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 20, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "# Plot count of column data types and memory used by each datatype\n", 588 | "plt_dtype = data_qlt_df.groupby('col_data_type')['dtype_count', 'dtype_total', 'dtype_%_total_mem'].last().sort_values(by='dtype_count')\n", 589 | "\n", 590 | "fig1, (ax, ax2) = plt.subplots(ncols=2, figsize=(10,5))\n", 591 | "plt.subplots_adjust(wspace=0.4, hspace=0.35, bottom=0.20)\n", 592 | "\n", 593 | "plt_dtype.plot(kind='bar', y='dtype_count', use_index=True, legend=False, ax=ax, title='Count of columns by data type')\n", 594 | "\n", 595 | "plt_dtype.plot(kind='bar', y='dtype_total', use_index=True, legend=False, ax=ax2, title='Memory used by data type')\n", 596 | "\n", 597 | "fig1.savefig(\"fig_cols_memory.png\", dpi=50)\n", 598 | "plt.close('all')" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 21, 604 | "metadata": {}, 605 | "outputs": [], 606 | "source": [ 607 | "# Memory used by columns of raw data dataframe\n", 608 | "fig2, ax = plt.subplots(ncols=1, figsize=(15,5))\n", 609 | "plt.subplots_adjust(wspace=0.4, hspace=0.35, bottom=0.30)\n", 610 | "\n", 611 | "# Memory used by object data type\n", 612 | "(data_qlt_df[data_qlt_df['col_data_type'] == 'object']\n", 613 | " .sort_values(by='col_memory', ascending=False)\n", 614 | " .plot(kind=\"bar\", \n", 615 | " x=\"column_name\", \n", 616 | " y=\"col_memory\", \n", 617 | " title=\"Memory (MB) usage by columns of object data type\",\n", 618 | " legend=False, ax=ax)\n", 619 | ")\n", 620 | "plt.xticks(rotation=35)\n", 621 | "fig2.savefig(\"fig_object_cols_memory.png\", dpi=50)\n", 622 | "plt.close('all')\n", 623 | "\n", 624 | "# Memory used by non-object data type\n", 625 | "fig2, ax1 = plt.subplots(ncols=1, figsize=(15,5))\n", 626 | "plt.subplots_adjust(wspace=0.4, hspace=0.35, bottom=0.30)\n", 627 | "\n", 628 | "(data_qlt_df[data_qlt_df['col_data_type'] != 'object']\n", 629 | " .sort_values(by='col_memory', ascending=False)\n", 630 | " .plot(kind=\"bar\", \n", 631 | " x=\"column_name\", \n", 632 | " y=\"col_memory\", \n", 633 | " title=\"Memory (MB) usage by columns of non-object data type\",\n", 634 | " legend=False, ax=ax1)\n", 635 | ")\n", 636 | "plt.xticks(rotation=35)\n", 637 | "\n", 638 | "fig2.savefig(\"fig_non_object_cols_memory.png\", dpi=50)\n", 639 | "plt.close('all')\n" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": {}, 645 | "source": [ 646 | "# Generate data profile graphs for 'numerical' columns" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 22, 652 | "metadata": { 653 | "scrolled": true 654 | }, 655 | "outputs": [ 656 | { 657 | "name": "stdout", 658 | "output_type": "stream", 659 | "text": [ 660 | "1 of 6 completed Survived\n", 661 | "2 of 6 completed Pclass\n", 662 | "3 of 6 completed Age\n", 663 | "4 of 6 completed Siblings_Spouses_Aboard\n", 664 | "5 of 6 completed Parents_Children Aboard\n", 665 | "6 of 6 completed Fare\n" 666 | ] 667 | } 668 | ], 669 | "source": [ 670 | "import numpy as np\n", 671 | "from matplotlib.patches import Rectangle\n", 672 | "\n", 673 | "# Get the list of numeric columns from raw dataframe\n", 674 | "# need this: from pandas.api.types import is_numeric_dtype\n", 675 | "# get numeric columns which are not empty\n", 676 | "num_cols = [cols for cols in df.columns if is_numeric_dtype(df[cols]) and len(df[cols].dropna())>0]\n", 677 | "\n", 678 | "iter_len = len(num_cols)\n", 679 | "\n", 680 | "# For each numeric column in the list\n", 681 | "for x, col_name in enumerate(num_cols):\n", 682 | " print(x+1, \" of \", iter_len, \" completed \", col_name)\n", 683 | " \n", 684 | " # Create a copy of the column values without nulls or NA\n", 685 | " no_null_col = df[col_name].dropna()\n", 686 | " \n", 687 | " \n", 688 | " # Calculate the 95 percentile of the values\n", 689 | " q25 = np.percentile(no_null_col, 25)\n", 690 | " q75 = np.percentile(no_null_col, 75) \n", 691 | " q95 = np.percentile(no_null_col, 95)\n", 692 | " \n", 693 | " # Plot the graphs\n", 694 | " fig3 = plt.figure(figsize=(20,15))\n", 695 | " fig3.suptitle(\"Profile of column \" + col_name, fontsize=25) #Title for the whole figure\n", 696 | " plt.subplots_adjust(wspace=0.4, hspace=0.35)\n", 697 | "\n", 698 | " ax1 = fig3.add_subplot(2,3,1)\n", 699 | " ax1.set_title(\"Box plot for all the values\", fontsize=20)\n", 700 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35)\n", 701 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 702 | " ax1.boxplot(no_null_col)\n", 703 | "\n", 704 | " ax1 = fig3.add_subplot(2,3,2)\n", 705 | " ax1.set_title(\"Distribution of all values\", fontsize=20)\n", 706 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35, fontsize=15)\n", 707 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 708 | " ax1.hist(no_null_col)\n", 709 | "\n", 710 | " ax1 = fig3.add_subplot(2,3,3)\n", 711 | " ax1.set_title(\"Boxplot for quartiles (all values)\", fontsize=20)\n", 712 | " if len(no_null_col.value_counts()) >= 4:\n", 713 | " df[u'quartiles'] = pd.qcut(\n", 714 | " df[col_name],\n", 715 | " 4, duplicates='drop')\n", 716 | " df.boxplot(column= col_name, by=u'quartiles', ax = ax1)\n", 717 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35, fontsize=15)\n", 718 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 719 | "\n", 720 | " ax1 = fig3.add_subplot(2,3,4)\n", 721 | " ax1.set_title(\"Box plot without outliers\", fontsize=20)\n", 722 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35, fontsize=15)\n", 723 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 724 | " ax1.boxplot(no_null_col, showfliers=False)\n", 725 | "\n", 726 | " ax1 = fig3.add_subplot(2,3,5)\n", 727 | " ax1.set_title(\"Violin plot (<95% percentile)\", fontsize=20)\n", 728 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35, fontsize=15)\n", 729 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 730 | " ax1.violinplot(no_null_col[no_null_col <= q95])\n", 731 | "\n", 732 | " \n", 733 | " #Histogram with bin ranges, counts and percentile color\n", 734 | " ax1 = fig3.add_subplot(2,3,6)\n", 735 | " ax1.set_title(\"Histogram (<95% percentile)\", fontsize=20)\n", 736 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=35, fontsize=15)\n", 737 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 738 | "\n", 739 | " # Take only the data less than 95 percentile\n", 740 | " data = no_null_col[no_null_col <= q95]\n", 741 | "\n", 742 | " # Colours for different percentiles\n", 743 | " perc_25_colour = 'gold'\n", 744 | " perc_50_colour = 'mediumaquamarine'\n", 745 | " perc_75_colour = 'deepskyblue'\n", 746 | " perc_95_colour = 'peachpuff'\n", 747 | "\n", 748 | " '''\n", 749 | " counts = numpy.ndarray of count of data ponts for each bin/column in the histogram\n", 750 | " bins = numpy.ndarray of bin edge/range values\n", 751 | " patches = a list of Patch objects.\n", 752 | " each Patch object contains a Rectnagle object. \n", 753 | " e.g. Rectangle(xy=(-2.51953, 0), width=0.501013, height=3, angle=0)\n", 754 | " '''\n", 755 | " counts, bins, patches = ax1.hist(data, bins=10, facecolor=perc_50_colour, edgecolor='gray')\n", 756 | "\n", 757 | " # Set the ticks to be at the edges of the bins.\n", 758 | " ax1.set_xticks(bins.round(2))\n", 759 | " plt.xticks(rotation=70, fontsize=15)\n", 760 | "\n", 761 | " # Change the colors of bars at the edges\n", 762 | " for patch, leftside, rightside in zip(patches, bins[:-1], bins[1:]):\n", 763 | " if rightside < q25:\n", 764 | " patch.set_facecolor(perc_25_colour)\n", 765 | " elif leftside > q95:\n", 766 | " patch.set_facecolor(perc_95_colour)\n", 767 | " elif leftside > q75:\n", 768 | " patch.set_facecolor(perc_75_colour)\n", 769 | "\n", 770 | " # Calculate bar centre to display the count of data points and %\n", 771 | " bin_x_centers = 0.5 * np.diff(bins) + bins[:-1]\n", 772 | " bin_y_centers = ax1.get_yticks()[1] * 0.25\n", 773 | "\n", 774 | " # Display the the count of data points and % for each bar in histogram\n", 775 | " for i in range(len(bins)-1):\n", 776 | " bin_label = \"{0:,}\".format(counts[i]) + \" ({0:,.2f}%)\".format((counts[i]/counts.sum())*100)\n", 777 | " plt.text(bin_x_centers[i], bin_y_centers, bin_label, rotation=90, rotation_mode='anchor')\n", 778 | "\n", 779 | " #create legend\n", 780 | " handles = [Rectangle((0,0),1,1,color=c,ec=\"k\") for c in [perc_25_colour, perc_50_colour, perc_75_colour, perc_95_colour]]\n", 781 | " labels= [\"0-25 Percentile\",\"25-50 Percentile\", \"50-75 Percentile\", \">95 Percentile\"]\n", 782 | " plt.legend(handles, labels, bbox_to_anchor=(0.5, 0., 0.85, 0.99))\n", 783 | " \n", 784 | "\n", 785 | " fig3.suptitle(\"Profile of column \" + col_name, fontsize=25) #Title for the whole figure\n", 786 | " fig_name = 'fig_' + col_name\n", 787 | " fig3.savefig(fig_name, dpi=50)\n", 788 | " plt.close('all')\n", 789 | " \n", 790 | "# plt.show()\n", 791 | "\n", 792 | "df.drop(u'quartiles', axis=1, inplace=True)" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "metadata": {}, 798 | "source": [ 799 | "# Generate data profile graphs for 'object' columns" 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": 23, 805 | "metadata": { 806 | "scrolled": false 807 | }, 808 | "outputs": [ 809 | { 810 | "name": "stdout", 811 | "output_type": "stream", 812 | "text": [ 813 | "1 of 2 completed Name\n", 814 | "2 of 2 completed Sex\n" 815 | ] 816 | } 817 | ], 818 | "source": [ 819 | "# Get the list of object columns from raw dataframe\n", 820 | "# get object columns which are not empty\n", 821 | "obj_cols = [cols for cols in df.columns if is_string_dtype(df[cols]) and len(df[cols].dropna())>0]\n", 822 | "\n", 823 | "iter_len = len(obj_cols)\n", 824 | "\n", 825 | "\n", 826 | "# For each object column in the list\n", 827 | "for x, col_name in enumerate(obj_cols):\n", 828 | " print(x+1, \" of \", iter_len, \" completed \", col_name)\n", 829 | " \n", 830 | " # Create a copy of the column values without nulls or NA\n", 831 | " no_null_col = df[col_name].dropna()\n", 832 | "\n", 833 | " values_freq_threshold = 25\n", 834 | " col_unique_count = df[col_name].nunique()\n", 835 | " \n", 836 | " # If unique values count is below the threshold value then store the details of unique values\n", 837 | " col_unique_vals = df[col_name].value_counts(normalize=True, sort=True)\n", 838 | " \n", 839 | " # Plot the graphs\n", 840 | " fig4 = plt.figure(figsize=(20,7))\n", 841 | " fig4.suptitle(\"Profile of column \" + col_name, fontsize=25) #Title for the whole figure\n", 842 | " plt.subplots_adjust(wspace=0.4, hspace=0.35, bottom=0.35)\n", 843 | "\n", 844 | " ax1 = fig4.add_subplot(1,1,1)\n", 845 | " ax1.set_title(\"Bar chart for top 25 values\", fontsize=20)\n", 846 | " plt.setp(ax1.get_xticklabels(), ha=\"right\", rotation=45, fontsize=15)\n", 847 | " plt.setp(ax1.get_yticklabels(), ha=\"right\", fontsize=15)\n", 848 | " \n", 849 | " col_unique_vals.head(values_freq_threshold).sort_values(ascending=False).plot.bar()\n", 850 | " plt.xticks(rotation=75)\n", 851 | " for p in ax1.patches:\n", 852 | " ax1.annotate(str(round(p.get_height(),2)), (p.get_x() * 1.005, p.get_height() * 1.005), fontsize=15)\n", 853 | " \n", 854 | " fig4.suptitle(\"Profile of column \" + col_name, fontsize=25) #Title for the whole figure\n", 855 | " fig_name = 'fig_' + col_name\n", 856 | " fig4.savefig(fig_name, dpi= 50)\n", 857 | "\n", 858 | " plt.close('all')\n", 859 | "# plt.show()" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "# Candidate columns for Category type" 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "metadata": {}, 872 | "source": [ 873 | "Analysing how many unique values an 'object' column has will be useful to detrmine which columns are good candidates for *Categorical* data type. In combination with the total memory used by 'object' data type and each 'object' data type column, decisions can be made on converting them Category type." 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": 24, 879 | "metadata": {}, 880 | "outputs": [], 881 | "source": [ 882 | "# Create a df and a column for % of memory by each object column\n", 883 | "cardn_df = data_qlt_df[data_qlt_df['col_data_type'] == 'object'][['column_name', 'col_memory', '%_of_dtype_mem', '%_of_total_memory', 'unique_values_count']]\n", 884 | "\n", 885 | "cardn_df = cardn_df.sort_values('unique_values_count')\n" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": {}, 891 | "source": [ 892 | "# Candidate columns for down casting type" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 25, 898 | "metadata": {}, 899 | "outputs": [], 900 | "source": [ 901 | "# Create a df and a column for % of memory by each object column\n", 902 | "num_cardn_df = data_qlt_df[data_qlt_df['col_data_type'] != 'object'][['column_name', 'col_memory', '%_of_dtype_mem', '%_of_total_memory', 'unique_values_count']]\n", 903 | "\n", 904 | "num_cardn_df = num_cardn_df.sort_values('unique_values_count')" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": {}, 910 | "source": [ 911 | "# Columns with high percentage of null values" 912 | ] 913 | }, 914 | { 915 | "cell_type": "code", 916 | "execution_count": 26, 917 | "metadata": {}, 918 | "outputs": [], 919 | "source": [ 920 | "# The empty values threshold can be set to a lower/higher value depending on the size of the data sets \n", 921 | "threshold_perc = 0.75\n", 922 | "col_vals_threshold = df.shape[0] * threshold_perc" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": 27, 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [ 931 | "null_vals_df = data_qlt_df[data_qlt_df['non_null_values'] < col_vals_threshold][['column_name', 'col_data_type', 'col_memory', 'non_null_values', '%_of_non_nulls', 'null_values', '%_of_nulls']]\n", 932 | "\n", 933 | "# .style.format({'dtype_memory': \"{:,.2f}\", 'non_null_values': \"{:,.2f}\", '%_of_non_nulls': \"{:,.2f}\", 'null_values': \"{:,.2f}\", '%_of_nulls': \"{:,.2f}\", 'unique_values_count': \"{:,.2f}\"})" 934 | ] 935 | }, 936 | { 937 | "cell_type": "markdown", 938 | "metadata": {}, 939 | "source": [ 940 | "# Generate the Correlation plot" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 28, 946 | "metadata": {}, 947 | "outputs": [], 948 | "source": [ 949 | "f, ax = plt.subplots(figsize=(15, 10))\n", 950 | "plt.subplots_adjust(bottom=0.35)\n", 951 | "plt.autoscale()\n", 952 | "\n", 953 | "corr_data = df.corr()\n", 954 | "sns.heatmap(corr_data,\n", 955 | " mask=np.zeros_like(corr_data, dtype=np.bool), \n", 956 | " cmap=sns.diverging_palette(20, 220, as_cmap=True),\n", 957 | " vmin=-1, vmax=1,\n", 958 | " square=True, \n", 959 | " ax=ax)\n", 960 | "\n", 961 | "fig_name = 'fig_cor_plot.png'\n", 962 | "f.savefig(fig_name, dpi=70)\n", 963 | "# plt.show()\n", 964 | "plt.close('all')" 965 | ] 966 | }, 967 | { 968 | "cell_type": "markdown", 969 | "metadata": {}, 970 | "source": [ 971 | "### Importing the image for the document" 972 | ] 973 | }, 974 | { 975 | "cell_type": "code", 976 | "execution_count": 29, 977 | "metadata": {}, 978 | "outputs": [], 979 | "source": [ 980 | "import requests\n", 981 | "\n", 982 | "image_url = \"https://raw.githubusercontent.com/AnalyticsInsightsNinja/Python_TidyData/master/SAMPLE_FULL_DPD_Image_MSWORD.PNG\"\n", 983 | "\n", 984 | "Picture_request = requests.get(image_url)\n", 985 | "if Picture_request.status_code == 200:\n", 986 | " with open(\"msword_output.jpg\", 'wb') as f:\n", 987 | " f.write(Picture_request.content)" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "# Construct the MS Word document" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": 30, 1000 | "metadata": {}, 1001 | "outputs": [], 1002 | "source": [ 1003 | "# Make sure you have the docx package and it is imported\n", 1004 | "# see the environment setup section\n", 1005 | "\n", 1006 | "#Create Document object\n", 1007 | "document = Document()\n", 1008 | "\n", 1009 | "# Add Title\n", 1010 | "document.add_heading('Data Profile Dataframe - Notebook v0.17 - 14 July 2019', 0)\n", 1011 | "document.add_heading(raw_data_file, 0)\n", 1012 | "\n", 1013 | "# COver page paragraph\n", 1014 | "p = document.add_paragraph('The main objective of this notebook is ')\n", 1015 | "p.add_run('only').bold = True\n", 1016 | "p.add_run(' to understand raw data profile. i.e. data type, min & max values, ranges, unique values, etc.')\n", 1017 | "p = document.add_paragraph('In consequent notebooks we will explore further on how to make decisions to make \\\n", 1018 | "the data tidy and perform the data transformations based on the understanding of the data profile.')\n", 1019 | "p = document.add_paragraph('')\n", 1020 | "p.add_run('The code is largely kept generic so that it could be used with any shape of data.').italic = True\n" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 31, 1026 | "metadata": {}, 1027 | "outputs": [ 1028 | { 1029 | "data": { 1030 | "text/plain": [ 1031 | "" 1032 | ] 1033 | }, 1034 | "execution_count": 31, 1035 | "metadata": {}, 1036 | "output_type": "execute_result" 1037 | } 1038 | ], 1039 | "source": [ 1040 | "# Page 2\n", 1041 | "document.add_page_break()\n", 1042 | "# Heading 1\n", 1043 | "document.add_heading('The Game Changer - Data Profile Dataframe (DPD)', level=1)\n", 1044 | "p = document.add_paragraph('The game changer for exploratory data analysis is the final')\n", 1045 | "p.add_run(' Data Profile Dataframe').bold = True\n", 1046 | "p.add_run(' that is generated which combines ')\n", 1047 | "p.add_run('all').bold = True\n", 1048 | "p.add_run(' the information required to inform data cleaning, tidy data and optimisations (memory and processing) decisions.\\\n", 1049 | " Instead of using various Pandas commands at different instances and going back and forth to cross refer information, Data Profile Dataframe brings all information into a single dataframe.\\\n", 1050 | " This will be very useful when reviewing the data profile with the business subject matter or other team members as all information related to data profile is in a single easy to understand format.')\n", 1051 | "\n", 1052 | "document.add_picture('msword_output.jpg', height=Inches(4), width=Inches(4))\n", 1053 | "\n", 1054 | "document.add_page_break()\n", 1055 | "p = document.add_paragraph('Understanding the data is ')\n", 1056 | "p.add_run('the critical step').bold = True\n", 1057 | "p.add_run(' in preparing the data to be used for analytics.\\\n", 1058 | " As many experts will point out the data preparation and transforming the data into a tidy format takes about 80% of the effort in any data analytics or data analysis project.')\n", 1059 | "p = document.add_paragraph('')\n", 1060 | "p.add_run('Understanding the data requires good understanding of the domain and/or access to a subject\\\n", 1061 | "matter expert (SME) to help make decisions about data quality and data usage:').bold = True\n", 1062 | "\n", 1063 | "document.add_paragraph(\n", 1064 | " 'What are the columns and what do they mean?', style='List Bullet'\n", 1065 | ")\n", 1066 | "document.add_paragraph(\n", 1067 | " 'How to interpret each columns and possible values of a column?', style='List Bullet'\n", 1068 | ")\n", 1069 | "document.add_paragraph(\n", 1070 | " 'Should the columns be renamed (and cleaned e.g. trim)?', style='List Bullet'\n", 1071 | ")\n", 1072 | "document.add_paragraph(\n", 1073 | " 'Are there columns that may have similar information that could be dropped in favour of one master column?', style='List Bullet'\n", 1074 | ")\n", 1075 | "document.add_paragraph(\n", 1076 | " 'Can columns with no values (or all empty) be dropped?', style='List Bullet'\n", 1077 | ")\n", 1078 | "document.add_paragraph(\n", 1079 | " 'Can columns which have more than certain threshold of blank values be dropped?', style='List Bullet'\n", 1080 | ")\n", 1081 | "document.add_paragraph(\n", 1082 | " 'Can rows that have missing values for certain columns or combination of columns be dropped?', style='List Bullet'\n", 1083 | ")\n", 1084 | "document.add_paragraph(\n", 1085 | " 'i.e. the row is meaningless wihtout those values.', style='List Continue'\n", 1086 | ")\n", 1087 | "document.add_paragraph(\n", 1088 | " 'Can the numeric data type columns be converted / down casted to optimise memory usage based on the data values?', style='List Bullet'\n", 1089 | ")\n", 1090 | "document.add_paragraph(\n", 1091 | " 'or will there be outliers possibly in future data sets that we cannot do this?', style='List Bullet 2'\n", 1092 | ")\n", 1093 | "document.add_paragraph(\n", 1094 | " 'Can the min and max values be used to determine the lowest possible data type?', style='List Bullet 2'\n", 1095 | ")\n", 1096 | "document.add_paragraph(\n", 1097 | " 'Can some string/object columns be converted to Category types?', style='List Bullet'\n", 1098 | ")\n", 1099 | "document.add_paragraph(\n", 1100 | " 'based on count of unique values', style='List Bullet 2'\n", 1101 | ")\n", 1102 | "document.add_paragraph(\n", 1103 | " 'Can any columns be discarded that may not be required for analytics?', style='List Bullet'\n", 1104 | ")\n" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": {}, 1110 | "source": [ 1111 | "# Word - Data profile summary" 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "code", 1116 | "execution_count": 32, 1117 | "metadata": {}, 1118 | "outputs": [ 1119 | { 1120 | "data": { 1121 | "text/plain": [ 1122 | "" 1123 | ] 1124 | }, 1125 | "execution_count": 32, 1126 | "metadata": {}, 1127 | "output_type": "execute_result" 1128 | } 1129 | ], 1130 | "source": [ 1131 | "document.add_page_break()\n", 1132 | "document.add_heading('Columns Data Profile Summary', 0)" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "code", 1137 | "execution_count": 33, 1138 | "metadata": {}, 1139 | "outputs": [], 1140 | "source": [ 1141 | "# Page 4\n", 1142 | "p = document.add_paragraph(' ')\n", 1143 | "\n", 1144 | "# Heading 1\n", 1145 | "document.add_heading('Dataset shape', level=1)\n", 1146 | "\n", 1147 | "table = document.add_table(rows=2, cols=2, style = 'Medium Shading 1 Accent 3')\n", 1148 | "\n", 1149 | "# Header row\n", 1150 | "cell = table.cell(0, 0)\n", 1151 | "cell.text = 'No.of rows'\n", 1152 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1153 | "cell_font.size = Pt(11)\n", 1154 | "cell_font.bold = True\n", 1155 | "\n", 1156 | "cell = table.cell(0, 1)\n", 1157 | "cell.text = 'No.of columns'\n", 1158 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1159 | "cell_font.size = Pt(11)\n", 1160 | "cell_font.bold = True\n", 1161 | "\n", 1162 | "# Values\n", 1163 | "cell = table.cell(1, 0)\n", 1164 | "cell.text = F'{df.shape[0] :,}'\n", 1165 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1166 | "cell_font.size = Pt(11)\n", 1167 | "cell_font.bold = False\n", 1168 | "\n", 1169 | "cell = table.cell(1, 1)\n", 1170 | "cell.text = F'{df.shape[1] :,}'\n", 1171 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1172 | "cell_font.size = Pt(11)\n", 1173 | "cell_font.bold = False\n" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "code", 1178 | "execution_count": 34, 1179 | "metadata": {}, 1180 | "outputs": [], 1181 | "source": [ 1182 | "# Page 4a\n", 1183 | "# document.add_page_break()\n", 1184 | "p = document.add_paragraph(' ')\n", 1185 | "\n", 1186 | "# Heading 1\n", 1187 | "document.add_heading('Dataframe columns summary', level=1)\n", 1188 | "\n", 1189 | "# Rehsape the column data type dataframe into form that can be printed in MS Word\n", 1190 | "data = round(data_qlt_df[['column_name','col_data_type', 'non_null_values', 'null_values', 'count']], 2)\n", 1191 | "\n", 1192 | "# add a table to the end and create a reference variable\n", 1193 | "# extra row is so we can add the header row\n", 1194 | "table = document.add_table(data.shape[0]+1, data.shape[1], style='Medium Shading 1 Accent 3')\n", 1195 | "\n", 1196 | "# add the header rows.\n", 1197 | "for j in range(data.shape[1]):\n", 1198 | "\n", 1199 | " #header row first two columns\n", 1200 | " if j <= 1:\n", 1201 | " cell = table.cell(0, j)\n", 1202 | " cell.text = F'{data.columns[j]}'\n", 1203 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1204 | " cell_font.size = Pt(11)\n", 1205 | " cell_font.bold = True\n", 1206 | " else:\n", 1207 | " cell = table.cell(0, j)\n", 1208 | " cell.text = F'{data.columns[j]}'\n", 1209 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1210 | " cell_font.size = Pt(11)\n", 1211 | " cell_font.bold = True\n", 1212 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT\n", 1213 | " \n", 1214 | " \n", 1215 | "# add the rest of the data frame\n", 1216 | "for i in range(data.shape[0]):\n", 1217 | " for j in range(data.shape[1]):\n", 1218 | " if j <= 1:\n", 1219 | " cell = table.cell(i+1, j)\n", 1220 | " cell.text = F'{data.values[i,j]}'\n", 1221 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1222 | " cell_font.size = Pt(11)\n", 1223 | " cell_font.bold = False \n", 1224 | " else:\n", 1225 | " cell = table.cell(i+1, j)\n", 1226 | " cell.text = F'{data.values[i,j] :,}'\n", 1227 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1228 | " cell_font.size = Pt(11)\n", 1229 | " cell_font.bold = False \n", 1230 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n" 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "markdown", 1235 | "metadata": {}, 1236 | "source": [ 1237 | "# Word - Column memory usage profile" 1238 | ] 1239 | }, 1240 | { 1241 | "cell_type": "code", 1242 | "execution_count": 35, 1243 | "metadata": {}, 1244 | "outputs": [ 1245 | { 1246 | "data": { 1247 | "text/plain": [ 1248 | "" 1249 | ] 1250 | }, 1251 | "execution_count": 35, 1252 | "metadata": {}, 1253 | "output_type": "execute_result" 1254 | } 1255 | ], 1256 | "source": [ 1257 | "document.add_page_break()\n", 1258 | "document.add_heading('Memory Usage Profile', 0)" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": 36, 1264 | "metadata": {}, 1265 | "outputs": [ 1266 | { 1267 | "data": { 1268 | "text/plain": [ 1269 | "" 1270 | ] 1271 | }, 1272 | "execution_count": 36, 1273 | "metadata": {}, 1274 | "output_type": "execute_result" 1275 | } 1276 | ], 1277 | "source": [ 1278 | "# Page 5\n", 1279 | "p = document.add_paragraph(' ')\n", 1280 | "\n", 1281 | "# Heading 1\n", 1282 | "document.add_heading('Data file size on disk vs. dataset size in memory', level=1)\n", 1283 | "\n", 1284 | "# Create table\n", 1285 | "table = document.add_table(rows=3, cols=2, style = 'Medium Shading 1 Accent 3')\n", 1286 | "\n", 1287 | "# Add column headers\n", 1288 | "cell = table.cell(0,0)\n", 1289 | "cell.text = 'Description'\n", 1290 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1291 | "cell_font.size = Pt(11)\n", 1292 | "cell_font.bold = True \n", 1293 | "\n", 1294 | "cell = table.cell(0,1)\n", 1295 | "cell.text = 'Size in MB'\n", 1296 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1297 | "cell_font.size = Pt(11)\n", 1298 | "cell_font.bold = True \n", 1299 | "cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1300 | "\n", 1301 | "# Add values : Value Line 1\n", 1302 | "cell = table.cell(1,0)\n", 1303 | "cell.text = 'Data file size on disk'\n", 1304 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1305 | "cell_font.size = Pt(11)\n", 1306 | "cell_font.bold = False \n", 1307 | "\n", 1308 | "cell = table.cell(1,1)\n", 1309 | "cell.text = F'{round(file_size, 2) :,.2f}'\n", 1310 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1311 | "cell_font.size = Pt(11)\n", 1312 | "cell_font.bold = False \n", 1313 | "cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1314 | "\n", 1315 | "# Add values : Value Line 2\n", 1316 | "cell = table.cell(2,0)\n", 1317 | "cell.text = 'Dataset size in memory'\n", 1318 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1319 | "cell_font.size = Pt(11)\n", 1320 | "cell_font.bold = False \n", 1321 | "\n", 1322 | "cell = table.cell(2,1)\n", 1323 | "cell.text = F'{round(df_mem, 2) :,.2f}'\n", 1324 | "cell_font = cell.paragraphs[0].runs[0].font\n", 1325 | "cell_font.size = Pt(11)\n", 1326 | "cell_font.bold = False \n", 1327 | "cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1328 | "\n", 1329 | "# Memory increase\n", 1330 | "p = document.add_paragraph('')\n", 1331 | "p = document.add_paragraph('Dataset increase in memory : ')\n", 1332 | "p.add_run(str(round(sz_increase*100, 2)) + '%').bold = True\n", 1333 | "\n", 1334 | "# Add graph\n", 1335 | "document.add_picture('fig_df_tot_memory.png', height=Inches(3), width=Inches(3))\n" 1336 | ] 1337 | }, 1338 | { 1339 | "cell_type": "code", 1340 | "execution_count": 37, 1341 | "metadata": {}, 1342 | "outputs": [], 1343 | "source": [ 1344 | "# Page 6\n", 1345 | "document.add_page_break()\n", 1346 | "\n", 1347 | "# Heading 1\n", 1348 | "document.add_heading('Dataframe column types and size in memory', level=1)\n", 1349 | "\n", 1350 | "# Rehsape the column data type dataframe into form that can be printed in MS Word\n", 1351 | "# Using .reset_index() will make the index a column\n", 1352 | "data = round(plt_dtype.reset_index(), 2)\n", 1353 | "\n", 1354 | "\n", 1355 | "# add a table to the end and create a reference variable\n", 1356 | "# extra row is so we can add the header row\n", 1357 | "table = document.add_table(data.shape[0]+1, data.shape[1], style = 'Medium Shading 1 Accent 3')\n", 1358 | "\n", 1359 | "# add the header rows.\n", 1360 | "for j in range(data.shape[1]):\n", 1361 | " #header row first first columns\n", 1362 | " if j == 0:\n", 1363 | " cell = table.cell(0, j)\n", 1364 | " cell.text = F'{data.columns[j]}'\n", 1365 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1366 | " cell_font.size = Pt(11)\n", 1367 | " cell_font.bold = True\n", 1368 | " else:\n", 1369 | " cell = table.cell(0, j)\n", 1370 | " cell.text = F'{data.columns[j]}'\n", 1371 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1372 | " cell_font.size = Pt(11)\n", 1373 | " cell_font.bold = True\n", 1374 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT\n", 1375 | " \n", 1376 | " \n", 1377 | "# add the rest of the data frame\n", 1378 | "for i in range(data.shape[0]):\n", 1379 | " for j in range(data.shape[1]):\n", 1380 | " if j == 0:\n", 1381 | " cell = table.cell(i+1, j)\n", 1382 | " cell.text = F'{data.values[i,j]}'\n", 1383 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1384 | " cell_font.size = Pt(11)\n", 1385 | " cell_font.bold = False \n", 1386 | " else:\n", 1387 | " cell = table.cell(i+1, j)\n", 1388 | " cell.text = F'{data.values[i,j] :,.2f}'\n", 1389 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1390 | " cell_font.size = Pt(11)\n", 1391 | " cell_font.bold = False \n", 1392 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1393 | "\n", 1394 | "\n", 1395 | "p = document.add_paragraph(' ')\n", 1396 | "p = document.add_paragraph('\"col_data_type\" : Column data type')\n", 1397 | "p = document.add_paragraph('\"dtype_count\" : Number of oclumns in the dataset of the given data type')\n", 1398 | "p = document.add_paragraph('\"dtype_total\" : Total memory in MB for the given data type')\n", 1399 | "p = document.add_paragraph('\"dtype_%_total_mem\" : Percentage of the memory used by the given data type out of the total memory used by the dataset')\n", 1400 | "\n", 1401 | "document.add_picture('fig_cols_memory.png', height=Inches(3), width=Inches(6))\n", 1402 | "\n", 1403 | "p = document.add_paragraph('In a memory heavy datasets the above information can shed light into which data type you need to focus if you need to optimise the memory usage.')\n", 1404 | "p = document.add_paragraph('e.g. may be convert \"object\" datatype to \"category\" type if the cardinality is low or may be down cast \"float64\" to float16 or smaller.')\n", 1405 | "p = document.add_paragraph('These decision need further information on column cardinality and max/min values which are covered in the next few sections.')\n" 1406 | ] 1407 | }, 1408 | { 1409 | "cell_type": "code", 1410 | "execution_count": 38, 1411 | "metadata": {}, 1412 | "outputs": [], 1413 | "source": [ 1414 | "# Page 7\n", 1415 | "document.add_page_break()\n", 1416 | "\n", 1417 | "# Heading 1\n", 1418 | "document.add_heading('Memory used by \"object\" data type', level=1)\n", 1419 | "\n", 1420 | "\n", 1421 | "# Rehsape the column data type dataframe into form that can be printed in MS Word\n", 1422 | "# Using .reset_index() will make the index a column\n", 1423 | "data = round(cardn_df.sort_values(\"unique_values_count\"), 2)\n", 1424 | "\n", 1425 | "\n", 1426 | "# add a table to the end and create a reference variable\n", 1427 | "# extra row is so we can add the header row\n", 1428 | "table = document.add_table(data.shape[0]+1, data.shape[1], style = 'Medium Shading 1 Accent 3')\n", 1429 | "\n", 1430 | "# add the header rows.\n", 1431 | "for j in range(data.shape[1]):\n", 1432 | " #header row first first columns\n", 1433 | " if j == 0:\n", 1434 | " cell = table.cell(0, j)\n", 1435 | " cell.text = F'{data.columns[j]}'\n", 1436 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1437 | " cell_font.size = Pt(11)\n", 1438 | " cell_font.bold = True\n", 1439 | " else:\n", 1440 | " cell = table.cell(0, j)\n", 1441 | " cell.text = F'{data.columns[j]}'\n", 1442 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1443 | " cell_font.size = Pt(11)\n", 1444 | " cell_font.bold = True\n", 1445 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT\n", 1446 | " \n", 1447 | " \n", 1448 | "# add the rest of the data frame\n", 1449 | "for i in range(data.shape[0]):\n", 1450 | " for j in range(data.shape[1]):\n", 1451 | " if j == 0:\n", 1452 | " cell = table.cell(i+1, j)\n", 1453 | " cell.text = F'{data.values[i,j]}'\n", 1454 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1455 | " cell_font.size = Pt(11)\n", 1456 | " cell_font.bold = False \n", 1457 | " else:\n", 1458 | " cell = table.cell(i+1, j)\n", 1459 | " cell.text = F'{data.values[i,j] :,.2f}'\n", 1460 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1461 | " cell_font.size = Pt(11)\n", 1462 | " cell_font.bold = False \n", 1463 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1464 | "\n", 1465 | "\n", 1466 | "p = document.add_paragraph(' ')\n", 1467 | "p = document.add_paragraph('\"column_name\" : Name of the column in the dataframe')\n", 1468 | "p = document.add_paragraph('\"col_memory\" : Memory used by the given column')\n", 1469 | "p = document.add_paragraph('\"%_of_dtype_mem\" : Percentage of memory used by the given column out of memory used by the column data type')\n", 1470 | "p = document.add_paragraph('\"%_of_total_memory\" : Percentage of the memory used by the given column out of the total memory used by the dataset')\n", 1471 | "p = document.add_paragraph('\"unique_values_count\" : Count of the unique values for the given column')\n", 1472 | "\n", 1473 | "document.add_picture('fig_object_cols_memory.png', height=Inches(3), width=Inches(6))\n", 1474 | "\n", 1475 | "p = document.add_paragraph(' ')\n", 1476 | "p = document.add_paragraph(\"Analysing how many unique values an 'object' column has will be useful to detrmine\\\n", 1477 | "which columns are good candidates for *Categorical* data type. In combination with the total memory used by 'object'\\\n", 1478 | "data type and each 'object' data type column, decisions can be made on converting them Category type.\\\n", 1479 | "Object or string data type columns with low cardinality is suitable for Category type.\")\n", 1480 | "p.add_run(\"The threshold of 'low cardinality' depends on the domain of the data and data usage patterns.\").bold = True" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": 39, 1486 | "metadata": {}, 1487 | "outputs": [], 1488 | "source": [ 1489 | "# Page 8\n", 1490 | "document.add_page_break()\n", 1491 | "\n", 1492 | "# Heading 1\n", 1493 | "document.add_heading('Memory used by \"Non-Object\" data type', level=1)\n", 1494 | "\n", 1495 | "# Rehsape the column data type dataframe into form that can be printed in MS Word\n", 1496 | "# Using .reset_index() will make the index a column\n", 1497 | "data = round(num_cardn_df.sort_values(\"unique_values_count\"), 2)\n", 1498 | "\n", 1499 | "# add a table to the end and create a reference variable\n", 1500 | "# extra row is so we can add the header row\n", 1501 | "table = document.add_table(data.shape[0]+1, data.shape[1], style = 'Medium Shading 1 Accent 3')\n", 1502 | "\n", 1503 | "# add the header rows.\n", 1504 | "for j in range(data.shape[1]):\n", 1505 | " #header row first first columns\n", 1506 | " if j == 0:\n", 1507 | " cell = table.cell(0, j)\n", 1508 | " cell.text = F'{data.columns[j]}'\n", 1509 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1510 | " cell_font.size = Pt(11)\n", 1511 | " cell_font.bold = True\n", 1512 | " else:\n", 1513 | " cell = table.cell(0, j)\n", 1514 | " cell.text = F'{data.columns[j]}'\n", 1515 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1516 | " cell_font.size = Pt(11)\n", 1517 | " cell_font.bold = True\n", 1518 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT\n", 1519 | " \n", 1520 | " \n", 1521 | "# add the rest of the data frame\n", 1522 | "for i in range(data.shape[0]):\n", 1523 | " for j in range(data.shape[1]):\n", 1524 | " if j == 0:\n", 1525 | " cell = table.cell(i+1, j)\n", 1526 | " cell.text = F'{data.values[i,j]}'\n", 1527 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1528 | " cell_font.size = Pt(11)\n", 1529 | " cell_font.bold = False \n", 1530 | " else:\n", 1531 | " cell = table.cell(i+1, j)\n", 1532 | " cell.text = F'{data.values[i,j] :,.2f}'\n", 1533 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1534 | " cell_font.size = Pt(11)\n", 1535 | " cell_font.bold = False \n", 1536 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1537 | "\n", 1538 | "\n", 1539 | "p = document.add_paragraph(' ')\n", 1540 | "p = document.add_paragraph('\"column_name\" : Name of the column in the dataframe')\n", 1541 | "p = document.add_paragraph('\"col_memory\" : Memory used by the given column')\n", 1542 | "p = document.add_paragraph('\"%_of_dtype_mem\" : Percentage of memory used by the given column out of memory used by the column data type')\n", 1543 | "p = document.add_paragraph('\"%_of_total_memory\" : Percentage of the memory used by the given column out of the total memory used by the dataset')\n", 1544 | "\n", 1545 | "document.add_picture('fig_non_object_cols_memory.png', height=Inches(3), width=Inches(6))\n", 1546 | "\n", 1547 | "p = document.add_paragraph(' ')\n", 1548 | "p = document.add_paragraph(\"By analysing the min and max values of the numeric columns decions can be made to downcast the data type to more memory efficient storage types.\")\n" 1549 | ] 1550 | }, 1551 | { 1552 | "cell_type": "code", 1553 | "execution_count": 40, 1554 | "metadata": {}, 1555 | "outputs": [], 1556 | "source": [ 1557 | "# Page 9\n", 1558 | "document.add_page_break()\n", 1559 | "\n", 1560 | "# Heading 1\n", 1561 | "document.add_heading('Columns with non-null values less than ' + \"{:,.2f}\".format(threshold_perc*100) + '%', level=1)\n", 1562 | "\n", 1563 | "p = document.add_paragraph('The columns should contain at least ' + \"{:,.0f}\".format(col_vals_threshold) + ' (' + \"{:,.2f}\".format((col_vals_threshold/df.shape[0])*100) + '%) non-empty rows out of '+ \"{:,}\".format(df.shape[0]) + ' rows to be considered useful.')\n", 1564 | "p = document.add_paragraph('The non-empty values threshold can be set using the threshold_perc variable in the code.')\n", 1565 | "\n", 1566 | "\n", 1567 | "# Rehsape the column data type dataframe into form that can be printed in MS Word\n", 1568 | "# Using .reset_index() will make the index a column\n", 1569 | "data = round(null_vals_df.sort_values(\"non_null_values\"), 2)\n", 1570 | "\n", 1571 | "# add a table to the end and create a reference variable\n", 1572 | "# extra row is so we can add the header row\n", 1573 | "table = document.add_table(data.shape[0]+1, data.shape[1], style = 'Medium Shading 1 Accent 3')\n", 1574 | "\n", 1575 | "# add the header rows.\n", 1576 | "for j in range(data.shape[1]):\n", 1577 | " #header row first first columns\n", 1578 | " if j <= 1:\n", 1579 | " cell = table.cell(0, j)\n", 1580 | " cell.text = F'{data.columns[j]}'\n", 1581 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1582 | " cell_font.size = Pt(11)\n", 1583 | " cell_font.bold = True\n", 1584 | " else:\n", 1585 | " cell = table.cell(0, j)\n", 1586 | " cell.text = F'{data.columns[j]}'\n", 1587 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1588 | " cell_font.size = Pt(11)\n", 1589 | " cell_font.bold = True\n", 1590 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT\n", 1591 | " \n", 1592 | " \n", 1593 | "# add the rest of the data frame\n", 1594 | "for i in range(data.shape[0]):\n", 1595 | " for j in range(data.shape[1]):\n", 1596 | " if j <= 1:\n", 1597 | " cell = table.cell(i+1, j)\n", 1598 | " cell.text = F'{data.values[i,j]}'\n", 1599 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1600 | " cell_font.size = Pt(11)\n", 1601 | " cell_font.bold = False \n", 1602 | " else:\n", 1603 | " cell = table.cell(i+1, j)\n", 1604 | " cell.text = F'{data.values[i,j] :,.2f}'\n", 1605 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1606 | " cell_font.size = Pt(11)\n", 1607 | " cell_font.bold = False \n", 1608 | " cell.paragraphs[0].alignment= WD_ALIGN_PARAGRAPH.RIGHT \n", 1609 | " \n", 1610 | "\n", 1611 | "\n", 1612 | "p = document.add_paragraph(' ')\n", 1613 | "p = document.add_paragraph('\"column_name\" : Name of the column in the dataframe')\n", 1614 | "p = document.add_paragraph('\"col_data_type\" : Data type of the given column')\n", 1615 | "p = document.add_paragraph('\"col_memory\" : Memory used by the given column')\n", 1616 | "p = document.add_paragraph('\"non_null_values\" : Count of non-null values in the given column')\n", 1617 | "p = document.add_paragraph('\"%_of_non_nulls\" : Percentage of the non-null values out of total values for the given column')\n", 1618 | "p = document.add_paragraph('\"null_values\" : Count of null values in the given column')\n", 1619 | "p = document.add_paragraph('\"%_of_nulls\" : Percentage of the null values out of total values for the given column')\n", 1620 | "\n", 1621 | "p = document.add_paragraph(' ')\n", 1622 | "p = document.add_paragraph(\"Generally columns with large percentage of empty values can be *dropped* from the dataset as they will not add any value to the analysis.\")\n", 1623 | "p = document.add_paragraph('')\n", 1624 | "p.add_run('But this depends on the domian of the dataset and usage pattern of the columns/data.').bold = True\n" 1625 | ] 1626 | }, 1627 | { 1628 | "cell_type": "markdown", 1629 | "metadata": {}, 1630 | "source": [ 1631 | "# Word - Data Correlation plot" 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "code", 1636 | "execution_count": 41, 1637 | "metadata": {}, 1638 | "outputs": [ 1639 | { 1640 | "data": { 1641 | "text/plain": [ 1642 | "" 1643 | ] 1644 | }, 1645 | "execution_count": 41, 1646 | "metadata": {}, 1647 | "output_type": "execute_result" 1648 | } 1649 | ], 1650 | "source": [ 1651 | "document.add_page_break()\n", 1652 | "document.add_heading('Data correlation plot', 0)\n", 1653 | "\n", 1654 | "p = document.add_paragraph('')\n", 1655 | "\n", 1656 | "document.add_picture('fig_cor_plot.png', height=Inches(6), width=Inches(6))" 1657 | ] 1658 | }, 1659 | { 1660 | "cell_type": "markdown", 1661 | "metadata": {}, 1662 | "source": [ 1663 | "# Word - Create the detail column profile rows" 1664 | ] 1665 | }, 1666 | { 1667 | "cell_type": "code", 1668 | "execution_count": 42, 1669 | "metadata": {}, 1670 | "outputs": [ 1671 | { 1672 | "data": { 1673 | "text/plain": [ 1674 | "" 1675 | ] 1676 | }, 1677 | "execution_count": 42, 1678 | "metadata": {}, 1679 | "output_type": "execute_result" 1680 | } 1681 | ], 1682 | "source": [ 1683 | "document.add_page_break()\n", 1684 | "document.add_heading('Column Data Profile Details', 0)" 1685 | ] 1686 | }, 1687 | { 1688 | "cell_type": "code", 1689 | "execution_count": 43, 1690 | "metadata": {}, 1691 | "outputs": [], 1692 | "source": [ 1693 | "# ind = 1 # to be taken from iterrows loop later\n", 1694 | "for ind in range(data_qlt_df.shape[0]):\n", 1695 | " document.add_page_break()\n", 1696 | " \n", 1697 | " # Create table for column profile details\n", 1698 | " table = document.add_table(rows=6, cols=6, style = 'Medium Shading 1 Accent 3' )\n", 1699 | " \n", 1700 | " # Merge cells in header row for COlumn Name\n", 1701 | " for y in range(len(table.rows[0].cells)-1):\n", 1702 | " a = table.cell(0,y)\n", 1703 | " b = table.cell(0,y+1)\n", 1704 | " a.merge(b)\n", 1705 | "\n", 1706 | " # Merge cells in detail rows spanning 2 cells x 3 \n", 1707 | " for row in range(1,6):\n", 1708 | " a = table.cell(row,0)\n", 1709 | " b = table.cell(row,1)\n", 1710 | " a.merge(b)\n", 1711 | " a = table.cell(row,2)\n", 1712 | " b = table.cell(row,3)\n", 1713 | " a.merge(b)\n", 1714 | " a = table.cell(row,4)\n", 1715 | " b = table.cell(row,5)\n", 1716 | " a.merge(b)\n", 1717 | "\n", 1718 | "\n", 1719 | " #*** ADD VALUES TO TABLE ***#\n", 1720 | " # Cell 0,0 (merged 6 cells): Header - Column Name\n", 1721 | " cell = table.cell(0, 0)\n", 1722 | " cell.text = data_qlt_df[\"column_name\"][ind]\n", 1723 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1724 | " cell_font.size = Pt(15)\n", 1725 | " cell_font.bold = True\n", 1726 | "\n", 1727 | " # Cell 1,0: Blank\n", 1728 | " cell = table.cell(1, 1)\n", 1729 | " cell.text = \"TBD Column :\\n\"\n", 1730 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1731 | " cell_font.size = Pt(11)\n", 1732 | " cell_font.bold = True\n", 1733 | " p = cell.paragraphs[0].add_run('no value')\n", 1734 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1735 | " cell_font2.size = Pt(12)\n", 1736 | " cell_font2.bold = False\n", 1737 | "\n", 1738 | " # Cell 1,0: Column data type\n", 1739 | " cell = table.cell(1, 3)\n", 1740 | " cell.text = 'Data Type : \\n'\n", 1741 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1742 | " cell_font.size = Pt(11)\n", 1743 | " cell_font.bold = True\n", 1744 | " p = cell.paragraphs[0].add_run(str(data_qlt_df[\"col_data_type\"][ind]))\n", 1745 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1746 | " cell_font2.size = Pt(12)\n", 1747 | " cell_font2.bold = False\n", 1748 | "\n", 1749 | " # Cell 1,1: Count of toal values in the column\n", 1750 | " cell = table.cell(1, 5)\n", 1751 | " cell.text = 'Values Count : \\n'\n", 1752 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1753 | " cell_font.size = Pt(11)\n", 1754 | " cell_font.bold = True\n", 1755 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"count\"][ind] :,.0f}')\n", 1756 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1757 | " cell_font2.size = Pt(11)\n", 1758 | " cell_font2.bold = False\n", 1759 | "\n", 1760 | " # Cell 2,0: Count of unique values in the column\n", 1761 | " cell = table.cell(2, 1)\n", 1762 | " cell.text = 'Unique Values Count : \\n'\n", 1763 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1764 | " cell_font.size = Pt(11)\n", 1765 | " cell_font.bold = True\n", 1766 | " unique_per = (data_qlt_df[\"unique_values_count\"][ind] / data_qlt_df[\"count\"][ind]) * 100\n", 1767 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"unique_values_count\"][ind] :,.0f}' + \" \" + F'({unique_per :,.2f}%)' )\n", 1768 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1769 | " cell_font2.size = Pt(11)\n", 1770 | " cell_font2.bold = False\n", 1771 | "\n", 1772 | " # Cell 2,1: Count of non-null values in the column\n", 1773 | " cell = table.cell(2, 3)\n", 1774 | " cell.text = 'Non-Null Values Count : \\n'\n", 1775 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1776 | " cell_font.size = Pt(11)\n", 1777 | " cell_font.bold = True\n", 1778 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"non_null_values\"][ind] :,.0f}' + \" \" + F' ({data_qlt_df[\"%_of_non_nulls\"][ind] :,.2f}%)' )\n", 1779 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1780 | " cell_font2.size = Pt(11)\n", 1781 | " cell_font2.bold = False \n", 1782 | "\n", 1783 | " # Cell 2,2: Count of null values in the column\n", 1784 | " cell = table.cell(2, 5)\n", 1785 | " cell.text = 'Null Values Count : \\n'\n", 1786 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1787 | " cell_font.size = Pt(11)\n", 1788 | " cell_font.bold = True\n", 1789 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"null_values\"][ind] :,.0f}' + \" \" + F' ({data_qlt_df[\"%_of_nulls\"][ind] :,.2f}%)' )\n", 1790 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1791 | " cell_font2.size = Pt(11)\n", 1792 | " cell_font2.bold = False\n", 1793 | "\n", 1794 | " # Cell 3,0: Min of values in the column\n", 1795 | " cell = table.cell(3, 1)\n", 1796 | " cell.text = 'Min : \\n'\n", 1797 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1798 | " cell_font.size = Pt(11)\n", 1799 | " cell_font.bold = True\n", 1800 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"min\"][ind] :,.2f}' )\n", 1801 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1802 | " cell_font2.size = Pt(11)\n", 1803 | " cell_font2.bold = False\n", 1804 | "\n", 1805 | " # Cell 3,1: Mean of values in the column\n", 1806 | " cell = table.cell(3, 3)\n", 1807 | " cell.text = 'Mean : \\n'\n", 1808 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1809 | " cell_font.size = Pt(11)\n", 1810 | " cell_font.bold = True\n", 1811 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"mean\"][ind] :,.2f}' )\n", 1812 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1813 | " cell_font2.size = Pt(11)\n", 1814 | " cell_font2.bold = False\n", 1815 | "\n", 1816 | " # Cell 3,3: Max of values in the column\n", 1817 | " cell = table.cell(3, 5)\n", 1818 | " cell.text = 'Max : \\n'\n", 1819 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1820 | " cell_font.size = Pt(11)\n", 1821 | " cell_font.bold = True\n", 1822 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"max\"][ind] :,.2f}' )\n", 1823 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1824 | " cell_font2.size = Pt(11)\n", 1825 | " cell_font2.bold = False\n", 1826 | "\n", 1827 | " # Cell 4,1: 25th Percentile of values in the column\n", 1828 | " cell = table.cell(4, 1)\n", 1829 | " cell.text = '25th Percentile : \\n'\n", 1830 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1831 | " cell_font.size = Pt(11)\n", 1832 | " cell_font.bold = True\n", 1833 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"25%\"][ind] :,.2f}' )\n", 1834 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1835 | " cell_font2.size = Pt(11)\n", 1836 | " cell_font2.bold = False\n", 1837 | "\n", 1838 | " # Cell 4,2: 50th Percentile of values in the column\n", 1839 | " cell = table.cell(4, 3)\n", 1840 | " cell.text = '50th Percentile : \\n'\n", 1841 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1842 | " cell_font.size = Pt(11)\n", 1843 | " cell_font.bold = True\n", 1844 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"50%\"][ind] :,.2f}' )\n", 1845 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1846 | " cell_font2.size = Pt(11)\n", 1847 | " cell_font2.bold = False\n", 1848 | "\n", 1849 | " # Cell 4,3: 75th Percentile of values in the column\n", 1850 | " cell = table.cell(4, 5)\n", 1851 | " cell.text = '75th Percentile : \\n'\n", 1852 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1853 | " cell_font.size = Pt(11)\n", 1854 | " cell_font.bold = True\n", 1855 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"75%\"][ind] :,.2f}' )\n", 1856 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1857 | " cell_font2.size = Pt(11)\n", 1858 | " cell_font2.bold = False\n", 1859 | "\n", 1860 | " # Cell 5,1: Memory used by the column values\n", 1861 | " cell = table.cell(5, 1)\n", 1862 | " cell.text = 'Column Memory : \\n'\n", 1863 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1864 | " cell_font.size = Pt(11)\n", 1865 | " cell_font.bold = True\n", 1866 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"col_memory\"][ind] :,.2} MB' )\n", 1867 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1868 | " cell_font2.size = Pt(11)\n", 1869 | " cell_font2.bold = False\n", 1870 | "\n", 1871 | " # Cell 5,2: Memory used by the column values vs. memory used by the data type\n", 1872 | " cell = table.cell(5, 3)\n", 1873 | " cell.text = 'As % of Dtype Memory : \\n'\n", 1874 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1875 | " cell_font.size = Pt(11)\n", 1876 | " cell_font.bold = True\n", 1877 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"%_of_dtype_mem\"][ind] :.2f}%' )\n", 1878 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1879 | " cell_font2.size = Pt(11)\n", 1880 | " cell_font2.bold = False \n", 1881 | "\n", 1882 | " # Cell 5,3: Memory used by the column values vs. memory used by the data type\n", 1883 | " cell = table.cell(5, 5)\n", 1884 | " cell.text = 'As % of DF Memory : \\n'\n", 1885 | " cell_font = cell.paragraphs[0].runs[0].font\n", 1886 | " cell_font.size = Pt(11)\n", 1887 | " cell_font.bold = True\n", 1888 | " p = cell.paragraphs[0].add_run(F'{data_qlt_df[\"%_of_total_memory\"][ind] :.2f}%' )\n", 1889 | " cell_font2 = cell.paragraphs[0].runs[1].font\n", 1890 | " cell_font2.size = Pt(11)\n", 1891 | " cell_font2.bold = False\n", 1892 | "\n", 1893 | " p = document.add_paragraph(' ')\n", 1894 | " p = document.add_paragraph(' ')\n", 1895 | "\n", 1896 | " fig_name = 'fig_' + data_qlt_df['column_name'][ind] + '.png'\n", 1897 | " document.add_picture(fig_name, height=Inches(3.5), width=Inches(6))\n" 1898 | ] 1899 | }, 1900 | { 1901 | "cell_type": "code", 1902 | "execution_count": 44, 1903 | "metadata": {}, 1904 | "outputs": [], 1905 | "source": [ 1906 | "# save the doc\n", 1907 | "document.save('data_profile_df_MS_WORD.docx')" 1908 | ] 1909 | }, 1910 | { 1911 | "cell_type": "code", 1912 | "execution_count": 45, 1913 | "metadata": {}, 1914 | "outputs": [ 1915 | { 1916 | "name": "stdout", 1917 | "output_type": "stream", 1918 | "text": [ 1919 | "Document generated!\n" 1920 | ] 1921 | } 1922 | ], 1923 | "source": [ 1924 | "print(\"Document generated!\")" 1925 | ] 1926 | }, 1927 | { 1928 | "cell_type": "code", 1929 | "execution_count": null, 1930 | "metadata": {}, 1931 | "outputs": [], 1932 | "source": [] 1933 | } 1934 | ], 1935 | "metadata": { 1936 | "kernelspec": { 1937 | "display_name": "Python 3", 1938 | "language": "python", 1939 | "name": "python3" 1940 | }, 1941 | "language_info": { 1942 | "codemirror_mode": { 1943 | "name": "ipython", 1944 | "version": 3 1945 | }, 1946 | "file_extension": ".py", 1947 | "mimetype": "text/x-python", 1948 | "name": "python", 1949 | "nbconvert_exporter": "python", 1950 | "pygments_lexer": "ipython3", 1951 | "version": "3.7.1" 1952 | } 1953 | }, 1954 | "nbformat": 4, 1955 | "nbformat_minor": 2 1956 | } 1957 | -------------------------------------------------------------------------------- /02_av_tidy_data_with_python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tidy Data with Python\n", 8 | "It is always good to display the python and other library versions so it is clear on what version the code has been developed" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [ 16 | { 17 | "name": "stdout", 18 | "output_type": "stream", 19 | "text": [ 20 | "3.7.1\n" 21 | ] 22 | } 23 | ], 24 | "source": [ 25 | "# Python version \n", 26 | "import platform\n", 27 | "print(platform.python_version())" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 2, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/plain": [ 38 | "'0.23.4'" 39 | ] 40 | }, 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "output_type": "execute_result" 44 | } 45 | ], 46 | "source": [ 47 | "#pandas version\n", 48 | "import pandas as pd\n", 49 | "pd.__version__" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Importing the data file\n", 57 | "In this example we are downloading an Excel file from the sheet named 'UntidySheet1'" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "#Read the Excel file data from sheet 'UntidySheet1'\n", 67 | "\n", 68 | "df = pd.read_excel(\n", 69 | " 'C:\\\\50_AnandSamples\\\\SAMPLE DATA\\\\Excel Sample Data\\\\UnTidayData.xlsx', \n", 70 | " sheet_name='UntidySheet1'\n", 71 | ")" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "(3, 5)" 83 | ] 84 | }, 85 | "execution_count": 4, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "# Dataset info\n", 92 | "\n", 93 | "df.shape\n" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 5, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/html": [ 104 | "

\n", 105 | "\n", 118 | "\n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | "

	Store Name	10000>20000	20000-50000	m20-40	f20-40
2	Store 3	14	16	8	9
0	Store 1	2	4	1	2
1	Store 2	8	10	5	6

\n", 156 | "

" 157 | ], 158 | "text/plain": [ 159 | " Store Name 10000>20000 20000-50000 m20-40 f20-40\n", 160 | "2 Store 3 14 16 8 9\n", 161 | "0 Store 1 2 4 1 2\n", 162 | "1 Store 2 8 10 5 6" 163 | ] 164 | }, 165 | "execution_count": 5, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "# Sample data\n", 172 | "\n", 173 | "df.sample(3)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "# Tidy Data\n", 181 | "## By Hadley Wickham \n", 182 | " \n", 183 | "The paper \"Tidy Data\" by Hadley Wickham states:\n", 184 | "It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). \n", 185 | "**Data tidying:** is structuring datasets to facilitate analysis. The principles of tidy data are closely tied to those of relational databases and Codd’s relational algebra (Codd 1990), but are framed in a language familiar to statisticians. \n", 186 | "**In tidy data:**\n", 187 | " 1. Each variable forms a column.\n", 188 | " 2. Each observation forms a row.\n", 189 | " 3. Each type of observational unit forms a table.\n", 190 | "\n", 191 | "This section describes the ﬁve most common problems with messy datasets, along with their remedies:\n", 192 | " 1. Column headers are values, not variable names. \n", 193 | " 2. Multiple variables are stored in one column. \n", 194 | " 3. Variables are stored in both rows and columns. \n", 195 | " 4. Multiple types of observational units are stored in the same table. \n", 196 | " 5. A single observational unit is stored in multiple tables.\n" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "## 1. Column headers are values, not variable names." 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "A common type of messy dataset is tabular data designed for presentation, where variables form both the rows and columns, and column headers are values, not variable names.\n", 211 | "\n", 212 | "In the below sample dataset, columns have customer income range values in the column header and respective count of customers as row values." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 6, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/html": [ 223 | "

\n", 224 | "\n", 237 | "\n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | "

	Store Name	10000>20000	20000-50000	m20-40	f20-40
0	Store 1	2	4	1	2
1	Store 2	8	10	5	6

\n", 267 | "

" 268 | ], 269 | "text/plain": [ 270 | " Store Name 10000>20000 20000-50000 m20-40 f20-40\n", 271 | "0 Store 1 2 4 1 2\n", 272 | "1 Store 2 8 10 5 6" 273 | ] 274 | }, 275 | "execution_count": 6, 276 | "metadata": {}, 277 | "output_type": "execute_result" 278 | } 279 | ], 280 | "source": [ 281 | "# Sample data\n", 282 | "df.sample(2)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "To tidy it, we need to **melt**, or **stack it**. In other words, we need to turn columns into rows. While this is often described as making wide datasets long or tall.\n", 290 | "The result of melting is a **molten dataset**. " 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 7, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "text/html": [ 301 | "

\n", 302 | "\n", 315 | "\n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | "

	Store Name	m20-40	f20-40	Customer Income Range	Count of Customers
0	Store 1	1	2	10000>20000	2
1	Store 2	5	6	10000>20000	8
2	Store 3	8	9	10000>20000	14
3	Store 1	1	2	20000-50000	4
4	Store 2	5	6	20000-50000	10
5	Store 3	8	9	20000-50000	16

\n", 377 | "

" 378 | ], 379 | "text/plain": [ 380 | " Store Name m20-40 f20-40 Customer Income Range Count of Customers\n", 381 | "0 Store 1 1 2 10000>20000 2\n", 382 | "1 Store 2 5 6 10000>20000 8\n", 383 | "2 Store 3 8 9 10000>20000 14\n", 384 | "3 Store 1 1 2 20000-50000 4\n", 385 | "4 Store 2 5 6 20000-50000 10\n", 386 | "5 Store 3 8 9 20000-50000 16" 387 | ] 388 | }, 389 | "execution_count": 7, 390 | "metadata": {}, 391 | "output_type": "execute_result" 392 | } 393 | ], 394 | "source": [ 395 | "df = df.melt(\n", 396 | " id_vars=['Store Name', 'm20-40', 'f20-40'], \n", 397 | " var_name='Customer Income Range', \n", 398 | " value_name='Count of Customers'\n", 399 | ")\n", 400 | "df" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "## 2. Multiple variables are stored in one column." 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 8, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "data": { 417 | "text/html": [ 418 | "

\n", 419 | "\n", 432 | "\n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | "

	Store Name	m20-40	f20-40	Customer Income Range	Count of Customers
3	Store 1	1	2	20000-50000	4
5	Store 3	8	9	20000-50000	16

\n", 462 | "

" 463 | ], 464 | "text/plain": [ 465 | " Store Name m20-40 f20-40 Customer Income Range Count of Customers\n", 466 | "3 Store 1 1 2 20000-50000 4\n", 467 | "5 Store 3 8 9 20000-50000 16" 468 | ] 469 | }, 470 | "execution_count": 8, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "df.sample(2)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "The dataset now have passed step 1, but still the data is messy as the demographic information is combined into a single column title. e.g. m20-40 means 'Male' of 'Age range between 20 to 40'.\n", 484 | "To tidy this we need to separate the gender information and age range into seperate columns.\n", 485 | "\n", 486 | "Cleaning and Tidying Data in Pandas - Daniel Chen\n", 487 | "https://www.youtube.com/watch?v=iYie42M1ZyU![image.png](attachment:image.png) 45:52 - 56:04" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 9, 493 | "metadata": { 494 | "scrolled": true 495 | }, 496 | "outputs": [ 497 | { 498 | "data": { 499 | "text/html": [ 500 | "

\n", 501 | "\n", 514 | "\n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | "

	Store Name	Customer Income Range	Count of Customers	Demographics	Count of Demographics
0	Store 1	10000>20000	2	m20-40	1
1	Store 2	10000>20000	8	m20-40	5
2	Store 3	10000>20000	14	m20-40	8
3	Store 1	20000-50000	4	m20-40	1
4	Store 2	20000-50000	10	m20-40	5
5	Store 3	20000-50000	16	m20-40	8
6	Store 1	10000>20000	2	f20-40	2
7	Store 2	10000>20000	8	f20-40	6
8	Store 3	10000>20000	14	f20-40	9
9	Store 1	20000-50000	4	f20-40	2
10	Store 2	20000-50000	10	f20-40	6
11	Store 3	20000-50000	16	f20-40	9

\n", 624 | "

" 625 | ], 626 | "text/plain": [ 627 | " Store Name Customer Income Range Count of Customers Demographics \\\n", 628 | "0 Store 1 10000>20000 2 m20-40 \n", 629 | "1 Store 2 10000>20000 8 m20-40 \n", 630 | "2 Store 3 10000>20000 14 m20-40 \n", 631 | "3 Store 1 20000-50000 4 m20-40 \n", 632 | "4 Store 2 20000-50000 10 m20-40 \n", 633 | "5 Store 3 20000-50000 16 m20-40 \n", 634 | "6 Store 1 10000>20000 2 f20-40 \n", 635 | "7 Store 2 10000>20000 8 f20-40 \n", 636 | "8 Store 3 10000>20000 14 f20-40 \n", 637 | "9 Store 1 20000-50000 4 f20-40 \n", 638 | "10 Store 2 20000-50000 10 f20-40 \n", 639 | "11 Store 3 20000-50000 16 f20-40 \n", 640 | "\n", 641 | " Count of Demographics \n", 642 | "0 1 \n", 643 | "1 5 \n", 644 | "2 8 \n", 645 | "3 1 \n", 646 | "4 5 \n", 647 | "5 8 \n", 648 | "6 2 \n", 649 | "7 6 \n", 650 | "8 9 \n", 651 | "9 2 \n", 652 | "10 6 \n", 653 | "11 9 " 654 | ] 655 | }, 656 | "execution_count": 9, 657 | "metadata": {}, 658 | "output_type": "execute_result" 659 | } 660 | ], 661 | "source": [ 662 | "# Unpivot/Melt the demographic column into rows (wide to long)\n", 663 | "\n", 664 | "df = df.melt(\n", 665 | " id_vars = ['Store Name', 'Customer Income Range', 'Count of Customers'],\n", 666 | " var_name='Demographics', \n", 667 | " value_name='Count of Demographics'\n", 668 | ")\n", 669 | "df" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": 10, 675 | "metadata": {}, 676 | "outputs": [ 677 | { 678 | "data": { 679 | "text/plain": [ 680 | "(12, 5)" 681 | ] 682 | }, 683 | "execution_count": 10, 684 | "metadata": {}, 685 | "output_type": "execute_result" 686 | } 687 | ], 688 | "source": [ 689 | "df.shape" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 11, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "data": { 699 | "text/html": [ 700 | "

\n", 701 | "\n", 714 | "\n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | "

	Store Name	Customer Income Range	Count of Customers	Demographics	Count of Demographics	Gender	Age Range
0	Store 1	10000>20000	2	m20-40	1	Male	20-40
1	Store 2	10000>20000	8	m20-40	5	Male	20-40
2	Store 3	10000>20000	14	m20-40	8	Male	20-40
3	Store 1	20000-50000	4	m20-40	1	Male	20-40
4	Store 2	20000-50000	10	m20-40	5	Male	20-40
5	Store 3	20000-50000	16	m20-40	8	Male	20-40
6	Store 1	10000>20000	2	f20-40	2	Female	20-40
7	Store 2	10000>20000	8	f20-40	6	Female	20-40
8	Store 3	10000>20000	14	f20-40	9	Female	20-40
9	Store 1	20000-50000	4	f20-40	2	Female	20-40
10	Store 2	20000-50000	10	f20-40	6	Female	20-40
11	Store 3	20000-50000	16	f20-40	9	Female	20-40

\n", 850 | "

" 851 | ], 852 | "text/plain": [ 853 | " Store Name Customer Income Range Count of Customers Demographics \\\n", 854 | "0 Store 1 10000>20000 2 m20-40 \n", 855 | "1 Store 2 10000>20000 8 m20-40 \n", 856 | "2 Store 3 10000>20000 14 m20-40 \n", 857 | "3 Store 1 20000-50000 4 m20-40 \n", 858 | "4 Store 2 20000-50000 10 m20-40 \n", 859 | "5 Store 3 20000-50000 16 m20-40 \n", 860 | "6 Store 1 10000>20000 2 f20-40 \n", 861 | "7 Store 2 10000>20000 8 f20-40 \n", 862 | "8 Store 3 10000>20000 14 f20-40 \n", 863 | "9 Store 1 20000-50000 4 f20-40 \n", 864 | "10 Store 2 20000-50000 10 f20-40 \n", 865 | "11 Store 3 20000-50000 16 f20-40 \n", 866 | "\n", 867 | " Count of Demographics Gender Age Range \n", 868 | "0 1 Male 20-40 \n", 869 | "1 5 Male 20-40 \n", 870 | "2 8 Male 20-40 \n", 871 | "3 1 Male 20-40 \n", 872 | "4 5 Male 20-40 \n", 873 | "5 8 Male 20-40 \n", 874 | "6 2 Female 20-40 \n", 875 | "7 6 Female 20-40 \n", 876 | "8 9 Female 20-40 \n", 877 | "9 2 Female 20-40 \n", 878 | "10 6 Female 20-40 \n", 879 | "11 9 Female 20-40 " 880 | ] 881 | }, 882 | "execution_count": 11, 883 | "metadata": {}, 884 | "output_type": "execute_result" 885 | } 886 | ], 887 | "source": [ 888 | "# Use the .str to convert column into string and perform string functions\n", 889 | "# Gender can be extracted by extracting the first element/character of the string\n", 890 | "\n", 891 | "df['Gender'] = df['Demographics'].str[0]\n", 892 | "\n", 893 | "# Extract the age range by extracting the 2nd element/character onwards from the string\n", 894 | "\n", 895 | "df['Age Range'] = df['Demographics'].str[1:]\n", 896 | "\n", 897 | "# Replace 'm' and 'f' with Male and Femal\n", 898 | "\n", 899 | "df['Gender'] = df['Gender'].str.replace('m', 'Male')\n", 900 | "df['Gender'] = df['Gender'].str.replace('f', 'Female')\n", 901 | "\n", 902 | "df" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "# 3. Variables are stored in both rows and columns. " 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 12, 915 | "metadata": {}, 916 | "outputs": [ 917 | { 918 | "data": { 919 | "text/html": [ 920 | "

\n", 921 | "\n", 934 | "\n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | "

	Store Name	Sale Type	Jan	Feb	Mar
0	Store 1	Max Sale	100	200	300
1	Store 1	Min Sale	10	11	12
2	Store 2	Max Sale	102	202	302
3	Store 2	Min Sale	13	14	15
4	Store 3	Max Sale	104	204	304

\n", 988 | "

" 989 | ], 990 | "text/plain": [ 991 | " Store Name Sale Type Jan Feb Mar\n", 992 | "0 Store 1 Max Sale 100 200 300\n", 993 | "1 Store 1 Min Sale 10 11 12\n", 994 | "2 Store 2 Max Sale 102 202 302\n", 995 | "3 Store 2 Min Sale 13 14 15\n", 996 | "4 Store 3 Max Sale 104 204 304" 997 | ] 998 | }, 999 | "execution_count": 12, 1000 | "metadata": {}, 1001 | "output_type": "execute_result" 1002 | } 1003 | ], 1004 | "source": [ 1005 | "# Load a new dataset for this example from same Excel file but different sheet\n", 1006 | "\n", 1007 | "df2 = pd.read_excel(\n", 1008 | " 'C:\\\\50_AnandSamples\\\\SAMPLE DATA\\\\Excel Sample Data\\\\UnTidayData.xlsx', \n", 1009 | " sheet_name='UntidySheet2'\n", 1010 | ")\n", 1011 | "df2.head()" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "metadata": {}, 1017 | "source": [ 1018 | "In this dataset there is two rows for each store, a row for Max Sale and a row for Min Sale. To tidy the data, We need to **Pivot** this column so that the rows moves to columns (long to wide).\n", 1019 | "Also months are in columns with sales as row values. To tidy these columns, we need Melt / UnPivot these columns to transform them to rows (wide to long)." 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": 13, 1025 | "metadata": {}, 1026 | "outputs": [ 1027 | { 1028 | "data": { 1029 | "text/html": [ 1030 | "

\n", 1031 | "\n", 1044 | "\n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | "

	Store Name	Sale Type	Month	Sales
0	Store 1	Max Sale	Jan	100
1	Store 1	Min Sale	Jan	10
2	Store 2	Max Sale	Jan	102
3	Store 2	Min Sale	Jan	13
4	Store 3	Max Sale	Jan	104
5	Store 3	Min Sale	Jan	16
6	Store 4	Max Sale	Jan	106
7	Store 4	Min Sale	Jan	19
8	Store 1	Max Sale	Feb	200
9	Store 1	Min Sale	Feb	11
10	Store 2	Max Sale	Feb	202
11	Store 2	Min Sale	Feb	14
12	Store 3	Max Sale	Feb	204
13	Store 3	Min Sale	Feb	17
14	Store 4	Max Sale	Feb	206
15	Store 4	Min Sale	Feb	20
16	Store 1	Max Sale	Mar	300
17	Store 1	Min Sale	Mar	12
18	Store 2	Max Sale	Mar	302
19	Store 2	Min Sale	Mar	15
20	Store 3	Max Sale	Mar	304
21	Store 3	Min Sale	Mar	18
22	Store 4	Max Sale	Mar	306
23	Store 4	Min Sale	Mar	21

\n", 1225 | "

" 1226 | ], 1227 | "text/plain": [ 1228 | " Store Name Sale Type Month Sales\n", 1229 | "0 Store 1 Max Sale Jan 100\n", 1230 | "1 Store 1 Min Sale Jan 10\n", 1231 | "2 Store 2 Max Sale Jan 102\n", 1232 | "3 Store 2 Min Sale Jan 13\n", 1233 | "4 Store 3 Max Sale Jan 104\n", 1234 | "5 Store 3 Min Sale Jan 16\n", 1235 | "6 Store 4 Max Sale Jan 106\n", 1236 | "7 Store 4 Min Sale Jan 19\n", 1237 | "8 Store 1 Max Sale Feb 200\n", 1238 | "9 Store 1 Min Sale Feb 11\n", 1239 | "10 Store 2 Max Sale Feb 202\n", 1240 | "11 Store 2 Min Sale Feb 14\n", 1241 | "12 Store 3 Max Sale Feb 204\n", 1242 | "13 Store 3 Min Sale Feb 17\n", 1243 | "14 Store 4 Max Sale Feb 206\n", 1244 | "15 Store 4 Min Sale Feb 20\n", 1245 | "16 Store 1 Max Sale Mar 300\n", 1246 | "17 Store 1 Min Sale Mar 12\n", 1247 | "18 Store 2 Max Sale Mar 302\n", 1248 | "19 Store 2 Min Sale Mar 15\n", 1249 | "20 Store 3 Max Sale Mar 304\n", 1250 | "21 Store 3 Min Sale Mar 18\n", 1251 | "22 Store 4 Max Sale Mar 306\n", 1252 | "23 Store 4 Min Sale Mar 21" 1253 | ] 1254 | }, 1255 | "execution_count": 13, 1256 | "metadata": {}, 1257 | "output_type": "execute_result" 1258 | } 1259 | ], 1260 | "source": [ 1261 | "# Melt the months\n", 1262 | "\n", 1263 | "df2 = df2.melt(\n", 1264 | " id_vars = ['Store Name', 'Sale Type'],\n", 1265 | " var_name = 'Month',\n", 1266 | " value_name = 'Sales'\n", 1267 | ")\n", 1268 | "df2" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 14, 1274 | "metadata": {}, 1275 | "outputs": [ 1276 | { 1277 | "data": { 1278 | "text/html": [ 1279 | "

\n", 1280 | "\n", 1293 | "\n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | "

	Sale Type	Max Sale	Min Sale
Store Name	Month
Store 1	Feb	200	11
Jan	100	10
Mar	300	12
Store 2	Feb	202	14
Jan	102	13
Mar	302	15
Store 3	Feb	204	17
Jan	104	16
Mar	304	18
Store 4	Feb	206	20
Jan	106	19
Mar	306	21

\n", 1375 | "

" 1376 | ], 1377 | "text/plain": [ 1378 | "Sale Type Max Sale Min Sale\n", 1379 | "Store Name Month \n", 1380 | "Store 1 Feb 200 11\n", 1381 | " Jan 100 10\n", 1382 | " Mar 300 12\n", 1383 | "Store 2 Feb 202 14\n", 1384 | " Jan 102 13\n", 1385 | " Mar 302 15\n", 1386 | "Store 3 Feb 204 17\n", 1387 | " Jan 104 16\n", 1388 | " Mar 304 18\n", 1389 | "Store 4 Feb 206 20\n", 1390 | " Jan 106 19\n", 1391 | " Mar 306 21" 1392 | ] 1393 | }, 1394 | "execution_count": 14, 1395 | "metadata": {}, 1396 | "output_type": "execute_result" 1397 | } 1398 | ], 1399 | "source": [ 1400 | "# To pivot the Sale Type column use pivot_table rather than pivot function as pivot_table can handle duplicates\n", 1401 | "\n", 1402 | "df2 = df2.pivot_table(\n", 1403 | " index = ['Store Name', 'Month'],\n", 1404 | " columns = 'Sale Type',\n", 1405 | " values = 'Sales'\n", 1406 | ")\n", 1407 | "df2" 1408 | ] 1409 | }, 1410 | { 1411 | "cell_type": "code", 1412 | "execution_count": 15, 1413 | "metadata": {}, 1414 | "outputs": [ 1415 | { 1416 | "data": { 1417 | "text/html": [ 1418 | "

\n", 1419 | "\n", 1432 | "\n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | "

Sale Type	Store Name	Month	Max Sale	Min Sale
0	Store 1	Feb	200	11
1	Store 1	Jan	100	10
2	Store 1	Mar	300	12
3	Store 2	Feb	202	14
4	Store 2	Jan	102	13
5	Store 2	Mar	302	15
6	Store 3	Feb	204	17
7	Store 3	Jan	104	16
8	Store 3	Mar	304	18
9	Store 4	Feb	206	20
10	Store 4	Jan	106	19
11	Store 4	Mar	306	21

\n", 1529 | "

" 1530 | ], 1531 | "text/plain": [ 1532 | "Sale Type Store Name Month Max Sale Min Sale\n", 1533 | "0 Store 1 Feb 200 11\n", 1534 | "1 Store 1 Jan 100 10\n", 1535 | "2 Store 1 Mar 300 12\n", 1536 | "3 Store 2 Feb 202 14\n", 1537 | "4 Store 2 Jan 102 13\n", 1538 | "5 Store 2 Mar 302 15\n", 1539 | "6 Store 3 Feb 204 17\n", 1540 | "7 Store 3 Jan 104 16\n", 1541 | "8 Store 3 Mar 304 18\n", 1542 | "9 Store 4 Feb 206 20\n", 1543 | "10 Store 4 Jan 106 19\n", 1544 | "11 Store 4 Mar 306 21" 1545 | ] 1546 | }, 1547 | "execution_count": 15, 1548 | "metadata": {}, 1549 | "output_type": "execute_result" 1550 | } 1551 | ], 1552 | "source": [ 1553 | "# The above has a hierarchical index\n", 1554 | "# To get a flat dataset we can use 'reset_index()'\n", 1555 | "\n", 1556 | "df2 = df2.reset_index()\n", 1557 | "\n", 1558 | "df2" 1559 | ] 1560 | }, 1561 | { 1562 | "cell_type": "markdown", 1563 | "metadata": {}, 1564 | "source": [ 1565 | "# 4. Multiple types of observational units are stored in the same table. \n", 1566 | "# 5. A single observational unit is stored in multiple tables." 1567 | ] 1568 | }, 1569 | { 1570 | "cell_type": "markdown", 1571 | "metadata": {}, 1572 | "source": [ 1573 | "Item 4 and 5 are to do with normalising the dataset for storage (in database) and then merging them back to create the flat data set.\n", 1574 | "This is not covered here but you can further data from Hadley Wickham's paper \"Tiday Data\"." 1575 | ] 1576 | } 1577 | ], 1578 | "metadata": { 1579 | "kernelspec": { 1580 | "display_name": "Python 3", 1581 | "language": "python", 1582 | "name": "python3" 1583 | }, 1584 | "language_info": { 1585 | "codemirror_mode": { 1586 | "name": "ipython", 1587 | "version": 3 1588 | }, 1589 | "file_extension": ".py", 1590 | "mimetype": "text/x-python", 1591 | "name": "python", 1592 | "nbconvert_exporter": "python", 1593 | "pygments_lexer": "ipython3", 1594 | "version": "3.7.1" 1595 | } 1596 | }, 1597 | "nbformat": 4, 1598 | "nbformat_minor": 2 1599 | } 1600 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pandas Data Exploration Journey 2 | The Pandas Data Exploration Journey series will consist of following notebooks : 3 | 1. Data Profiling (01_16_av_automated_data_profiling_MS_WORD.ipynb) 4 | 2. Data Quality and Transformations Decisions 5 | 3. Tidy Data and Data Tranformations 6 | 4. Analytics & Visualisations 7 | 8 | ## Objective 9 | The main objective of this notebook is only to understand the data using Data Profiling. ***The output is a MS Woprd document which documents all the data profiling infoprmation and plots for the given data file.*** 10 | The code is largely kept generic so that it could be used with any shape of data. Any data quality or data tidying recommendations will be dealt in other notebooks. 11 | 12 | # Data Profile Dataframe (DPD) - The Game Changer 13 | The game changer for exploratory data analysis is the final ***Data Profile Dataframe*** that is generated which combines ***all*** the information required to inform data cleaning, tidy data and optimisations (memory and processing) decisions. 14 | Instead of using various Pandas commands at different instances and going back and forth to cross refer information, Data Profile Dataframe brings all information into a single dataframe. This will be very useful when reviewing the data profile with the business subject matter or other team members as all information related to data profile is in a single easy to understand format. 15 | 16 | ![image.png](SAMPLE_FULL_DPD_Image_MSWORD.PNG) 17 | 18 | 19 | Understanding the data is **the critical step** in preparing the data to be used for analytics. As many experts will point out the data preparation and transforming the data into a tidy format takes about 80% of the effort in any data analytics or data analysis project.
20 | ***Understanding the data requires good understanding of the domain and/or access to a subject matter expert (SME) to help make decisions about data quality and data usage:*** 21 | * What are the columns and what do they mean? 22 | * How to interpret each columns and possible values of a column? 23 | * Should the columns be renamed (and cleaned e.g. trim)? 24 | * Are there columns that may have similar information that could be dropped in favour of one master column? 25 | * Can columns with no values (or all empty) be dropped? 26 | * Can columns which have more than certain threshold of blank values be dropped? 27 | * How can the missing values be filled and can it be filled meaningfully? 28 | * Can rows that have missing values for certain columns or combination of columns be dropped? i.e. the row is meaningless wihtout those values. 29 | * Can the numeric data type columns be converted / down casted to optimise memory usage based on the data values? 30 | - or will there be outliers possibly in future data sets that we cannot do this? 31 | - can the min and max values be used to determine the lowest possible data type? 32 | * Can some string/object columns be converted to Category types? 33 | - based on count of unique values 34 | * Can any columns be discarded that may not be required for analytics? 35 | -------------------------------------------------------------------------------- /SAMPLE_FULL_DPD_Image_MSWORD.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AnalyticsInsightsNinja/Python_TidyData/ecca74d2633503d5755f6659dda8c2b4b30c759a/SAMPLE_FULL_DPD_Image_MSWORD.PNG -------------------------------------------------------------------------------- /data_profile_df_MS_WORD.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AnalyticsInsightsNinja/Python_TidyData/ecca74d2633503d5755f6659dda8c2b4b30c759a/data_profile_df_MS_WORD.docx --------------------------------------------------------------------------------