├── README.md ├── Sales_Analysis_Code_Along.ipynb ├── Sales_Analysis_Code_Along_Solution.ipynb ├── data └── data.xlsx └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Solve Real-World Data Science Tasks in Python | Data Analysis with Pandas & Plotly (Full Tutorial) 3 | 4 | In this tutorial, we are going to solve a real-world data science/analysis project with python. 5 | 6 | We will be using the following Python Libraries: 7 | - Pandas 8 | - Pandas Profiling Report 9 | - AutoViz 10 | - Plotly 11 | 12 | After we have loaded the dataset, we will do some initial exploratory data analysis to get an idea of the dataset. 13 | I am going to show you very useful pandas’ functions which you can apply to any kind of dataset you might deal with. 14 | 15 | However, nowadays there are so many cool libraries available, which will make exploratory data analysis so much easier. I will show you my favorite 2 libraries, which will generate automated reports for us in just a few lines of code. 16 | Those reports are a great starting point before we are moving on to answer real-world business type questions. 17 | 18 | While answering those questions, we will cover a wide range of various pandas’ functions. Additionally, we will also code our own python helper function, which we are going to use in the deep-dive & visualization section. All the charts we are going to create will be interactive and have a clean design. 19 | 20 | We will cover the following chart types: 21 | - Histogram 22 | - Box Plot 23 | - Bar Charts 24 | - Scatter Plot 25 | - Line Chart 26 | 27 | Feel free to code along with me. In the project files, you will also find an exercise Notebook that includes all the tasks we are going to solve. 28 | 29 | ## Video Tutorial 30 | 31 | [![YouTube Video](https://img.youtube.com/vi/ZI9T2O7XYxY/0.jpg)](https://youtu.be/ZI9T2O7XYxY) 32 | 33 | ## Requirements 34 | ``` 35 | autoviz==0.0.81 36 | numpy==1.19.3 37 | openpyxl==3.0.5 38 | pandas==1.2.0 39 | pandas-profiling==2.9.0 40 | plotly==4.14.1 41 | plotly-express==0.4.1 42 | xlrd==2.0.1 43 | ``` 44 | 45 | 46 | ## 🤓 Check Out My Excel Add-ins 47 | I've developed some handy Excel add-ins that you might find useful: 48 | 49 | - 📊 **[Dashboard Add-in](https://pythonandvba.com/grafly)**: Easily create interactive and visually appealing dashboards. 50 | - 🎨 **[Cartoon Charts Add-In](https://pythonandvba.com/cuteplots)**: Create engaging and fun cartoon-style charts. 51 | - 🤪 **[Emoji Add-in](https://pythonandvba.com/emojify)**: Add a touch of fun to your spreadsheets with emojis. 52 | - 🛠️ **[MyToolBelt Add-in](https://pythonandvba.com/mytoolbelt)**: A versatile toolbelt for Excel, featuring: 53 | - Creation of Pandas DataFrames and Jupyter Notebooks from Excel ranges 54 | - ChatGPT integration for advanced data analysis 55 | - And much more! 56 | 57 | 58 | 59 | ## 🤝 Connect with Me 60 | - 📺 **YouTube:** [CodingIsFun](https://youtube.com/c/CodingIsFun) 61 | - 🌐 **Website:** [PythonAndVBA](https://pythonandvba.com) 62 | - 💬 **Discord:** [Join the Community](https://pythonandvba.com/discord) 63 | - 💼 **LinkedIn:** [Sven Bosau](https://www.linkedin.com/in/sven-bosau/) 64 | - 📸 **Instagram:** [sven_bosau](https://www.instagram.com/sven_bosau/) 65 | 66 | ## ☕ Support 67 | If you appreciate the project and wish to encourage its continued development, consider [supporting my work](https://pythonandvba.com/coffee-donation). 68 | [![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://pythonandvba.com/coffee-donation) 69 | 70 | ## Feedback & Collaboration 71 | For feedback, suggestions, or potential collaboration opportunities, reach out at contact@pythonandvba.com. 72 | ![Logo](https://www.pythonandvba.com/banner-img) 73 | 74 | -------------------------------------------------------------------------------- /Sales_Analysis_Code_Along.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": { 5 | "image.png": { 6 | "image/png": "" 7 | } 8 | }, 9 | "cell_type": "markdown", 10 | "metadata": {}, 11 | "source": [ 12 | "![image.png](attachment:image.png)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "##### SCENARIO:\n", 20 | "\n", 21 | "> __You are working as a Data Analyst/Scientist in the 'Coding Is Fun Corp.' The CEO wants you to have a look at the commercial data for this year & to present your findings.__ 👩‍💻\n", 22 | "___" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "# Import Libraries & Load Dataset" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "### Imports" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "#### Imports" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# Version:\n", 53 | "# --Python 3.8.5-- \n", 54 | "# autoviz==0.0.81\n", 55 | "# numpy==1.19.3\n", 56 | "# openpyxl==3.0.5\n", 57 | "# pandas==1.2.0\n", 58 | "# pandas-profiling==2.9.0\n", 59 | "# plotly==4.14.1\n", 60 | "# plotly-express==0.4.1\n", 61 | "# xlrd==2.0.1\n", 62 | "\n", 63 | "# Imports:\n" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "#### Plotly Template Settings" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 2, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "# -- Settings Plotly template\n", 80 | "# Reference Link:\n", 81 | "# https://plotly.com/python/templates/\n", 82 | "# Try other themes: 'plotly_dark', 'plotly_white', 'ggplot2', 'seaborn', 'simple_white'\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### Load DataFrame" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "**Load DataFrame and store it in a variable called \"df\"**" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "**Inspect first 5 rows of the DataFrame**" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": { 117 | "scrolled": true 118 | }, 119 | "outputs": [], 120 | "source": [] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "# Explore Dataset" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "## Traditionally" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 3, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Basic Info about DataFrame\n" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 4, 148 | "metadata": { 149 | "scrolled": true 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "# Describe Method\n" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 5, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "# Get a view of unique values in column, e.g. 'Ship Mode'\n" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 6, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "# NaN count for each column\n" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "## Automated Reports" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "#### Pandas Profiling Report" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 8, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "# Generate Pandas Profiling Report\n", 195 | "\n", 196 | "\n", 197 | "# View in Notebook\n" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 9, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "# Export Pandas Profiling Report to HTML\n" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "#### Auto Viz Report" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "# Data Preperation & Analysis" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### 🚩 TASKS:\n", 235 | "- What was the highest Sale in 2020?\n", 236 | "- What is average discount rate of charis?\n", 237 | "- Add extra columns to seperate Year & Month from the Order Date\n", 238 | "- Add a new column to calculate the Profit Margin for each sales record\n", 239 | "- Export manipulated dataframe to Excel\n", 240 | "- Create a new dataframe to reflect total Profit & Sales by Sub-Category\n", 241 | "- Develop a function, to return a dataframe which is grouped by a particular column (as an input)" 242 | ] 243 | }, 244 | { 245 | "cell_type": "markdown", 246 | "metadata": {}, 247 | "source": [ 248 | "**What was the highest Sale?**" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 10, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# Highest Sale\n" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "**What is average Discount of charis?**" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 12, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "# Create Boolean mask\n", 274 | "\n", 275 | "\n", 276 | "# Use Boolean mask to filter dataframe\n" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "**Add an extra column for \"Order Month\" & \"Order Year\"**" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "**Add a new column to calculate the Profit Margin for each sales record**" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "**Export manipulated dataframe back to excel**" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "#### Total Profit &Sales by Sub-Category" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 14, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "# Group By Sub-Category [SUM]\n", 335 | "\n", 336 | "# Reset Index\n" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "#### Develop a function, to return a dataframe which is grouped by a particular column (as an input)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 16, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# Groupby as a function\n", 353 | "\n", 354 | " \n", 355 | "# Group DataFrame by Segment\n" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "# Further Deep Dive & Visualization" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "### 🚩 Objective: \n", 370 | "- Further Analysis/Deep Dive using various kind of Charts\n", 371 | "- Prepare/Refactor Dataframe for different Charttypes\n", 372 | "- Generate & Export 'Ready-To-Present- Charts': Clean & Interactive\n", 373 | "-----\n", 374 | "#### 📊 Chart Types:\n", 375 | "- [x] Histogram\n", 376 | "- [x] Boxpot\n", 377 | "- [x] Various Barplots\n", 378 | "- [x] Scatterplot\n", 379 | "- [x] Linechart" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "**Distribution Sales [Histogram]**" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 17, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "# Quick Stats Overview for Sales\n" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 18, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "# Create Chart\n", 405 | "\n", 406 | "# Plot Chart\n" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "**Show the distribution and skewness of Sales [Boxplot]**" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 19, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "# Create Chart\n", 423 | "\n", 424 | "# Plot Chart\n" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "**Plot Sales by Sub-Category**" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 20, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [ 440 | "# Create Dataframe\n" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 22, 446 | "metadata": {}, 447 | "outputs": [], 448 | "source": [ 449 | "# Create Chart\n", 450 | "\n", 451 | "\n", 452 | "# Display Plot\n", 453 | "\n", 454 | "\n", 455 | "# Export Chart to HTML\n" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "**Plot Profit by Sub-Category**" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 23, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "# Create Chart\n", 472 | "\n", 473 | "\n", 474 | "# Display Plot\n", 475 | "\n", 476 | "\n", 477 | "# Export Chart to HTML\n" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "**Plot Sales & Profit by Sub-Category**" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 25, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "# Create Chart\n", 494 | "\n", 495 | "\n", 496 | "# Display Plot\n", 497 | "\n", 498 | "\n", 499 | "# Export Chart to HTML\n" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "#### Inspect Negative Profit of Tables" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "Is there any linear correlation between Sales/Profit & Discount? [Scatterplot]" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": 26, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "# Create Chart\n", 523 | "\n", 524 | "\n", 525 | "# Display Plot\n", 526 | "\n", 527 | "\n", 528 | "# Export Chart to HTML\n" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "**Check Discount mean by Sub Category**" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 28, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "# Create new dataframe: Group by 'Sub-Category' and aggregate the mean of 'Discount'\n", 545 | "\n", 546 | "\n", 547 | "# Display first 5 rows of new dataframe\n" 548 | ] 549 | }, 550 | { 551 | "cell_type": "markdown", 552 | "metadata": {}, 553 | "source": [ 554 | "**Plot Mean Discount by Sub Category**" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 29, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "# Create Chart\n", 564 | "\n", 565 | "\n", 566 | "# Display Plot\n", 567 | "\n", 568 | "\n", 569 | "# Export Chart to HTML\n" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "**Plot Sales & Profit Development for the year 2020**" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 31, 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "# Sort Values by Order Date\n", 586 | "\n", 587 | "# Add cumulative Sales & Profit\n", 588 | "\n", 589 | "# Print tail & head of sorted dataframe\n" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 32, 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "# Create Chart\n", 599 | "\n", 600 | "# Display Plot\n", 601 | "\n", 602 | "# Export Chart to HTML\n" 603 | ] 604 | } 605 | ], 606 | "metadata": { 607 | "kernelspec": { 608 | "display_name": "Python 3", 609 | "language": "python", 610 | "name": "python3" 611 | }, 612 | "language_info": { 613 | "codemirror_mode": { 614 | "name": "ipython", 615 | "version": 3 616 | }, 617 | "file_extension": ".py", 618 | "mimetype": "text/x-python", 619 | "name": "python", 620 | "nbconvert_exporter": "python", 621 | "pygments_lexer": "ipython3", 622 | "version": "3.8.5" 623 | }, 624 | "toc": { 625 | "base_numbering": 1, 626 | "nav_menu": {}, 627 | "number_sections": false, 628 | "sideBar": true, 629 | "skip_h1_title": false, 630 | "title_cell": "Table of Contents", 631 | "title_sidebar": "Contents", 632 | "toc_cell": false, 633 | "toc_position": {}, 634 | "toc_section_display": true, 635 | "toc_window_display": false 636 | } 637 | }, 638 | "nbformat": 4, 639 | "nbformat_minor": 4 640 | } 641 | -------------------------------------------------------------------------------- /data/data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sven-Bo/data-analysis-python/aab2589e354bd475a1c342711b873010050cef99/data/data.xlsx -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | autoviz==0.0.81 2 | numpy==1.19.3 3 | openpyxl==3.0.5 4 | pandas==1.2.0 5 | pandas-profiling==2.9.0 6 | plotly==4.14.1 7 | plotly-express==0.4.1 8 | xlrd==2.0.1 9 | --------------------------------------------------------------------------------