├── .python-version ├── docs ├── style.css ├── methods │ ├── ratio.md │ ├── funnel.md │ ├── total.md │ ├── ratio.ipynb │ ├── total.ipynb │ └── funnel.ipynb ├── installation.md ├── index.md ├── index.ipynb ├── theme │ ├── README.md │ ├── LICENSE │ └── main.html └── examples │ ├── ibis.md │ ├── simple-revenue-funnel.md │ ├── fashion-brand-co2e.md │ ├── simple-revenue-funnel.ipynb │ └── iowa-whiskey-sales.md ├── .gitattributes ├── icanexplain ├── datasets │ ├── iowa_whiskey_sales.csv.zip │ ├── product_footprints.csv.gz │ └── us_general_election_popular_vote.csv.zip ├── datasets.py ├── test_sum.py ├── test_mean.py └── __init__.py ├── .gitignore ├── CONTRIBUTING.md ├── .github ├── workflows │ ├── code-quality.yml │ └── unit-tests.yml └── actions │ └── install-env │ └── action.yml ├── Makefile ├── pre-commit-hooks └── check_pinned_actions.sh ├── mkdocs.yml ├── pyproject.toml ├── .pre-commit-config.yaml ├── README.md └── LICENSE /.python-version: -------------------------------------------------------------------------------- 1 | 3.11.8 2 | -------------------------------------------------------------------------------- /docs/style.css: -------------------------------------------------------------------------------- 1 | h1 { 2 | color: red; 3 | } 4 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /docs/methods/ratio.md: -------------------------------------------------------------------------------- 1 | # Ratio decomposition 2 | 3 | 4 | ```python 5 | 6 | ``` 7 | -------------------------------------------------------------------------------- /docs/installation.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | ```sh 4 | pip install icanexplain 5 | ``` 6 | -------------------------------------------------------------------------------- /docs/methods/funnel.md: -------------------------------------------------------------------------------- 1 | # Funnel decomposition 2 | 3 | 4 | ```python 5 | # Funnel 6 | ``` 7 | -------------------------------------------------------------------------------- /docs/methods/total.md: -------------------------------------------------------------------------------- 1 | # Total decomposition 2 | 3 | $$ 4 | \sum_{i=1}^n \frac{1}{2} 5 | $$ 6 | 7 | 8 | ```python 9 | 10 | ``` 11 | -------------------------------------------------------------------------------- /icanexplain/datasets/iowa_whiskey_sales.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/carbonfact/icanexplain/HEAD/icanexplain/datasets/iowa_whiskey_sales.csv.zip -------------------------------------------------------------------------------- /icanexplain/datasets/product_footprints.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/carbonfact/icanexplain/HEAD/icanexplain/datasets/product_footprints.csv.gz -------------------------------------------------------------------------------- /icanexplain/datasets/us_general_election_popular_vote.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/carbonfact/icanexplain/HEAD/icanexplain/datasets/us_general_election_popular_vote.csv.zip -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | docs/public 2 | docs/resources 3 | docs/.hugo_build.lock 4 | *.pyc 5 | /*.ipynb 6 | .ipynb_checkpoints 7 | .DS_Store 8 | .pytest_cache 9 | *.ddb 10 | *.ddb.wal 11 | dist/ 12 | site/ 13 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | ```sh 4 | # Prepare virtual environment 5 | git clone https://github.com/carbonfact/icanexplain 6 | cd icanexplain 7 | poetry install 8 | poetry shell 9 | 10 | # Install pre-commit hooks 11 | pre-commit install --hook-type pre-push 12 | pre-commit run --all-files 13 | 14 | # Run tests 15 | pytest 16 | 17 | # Serve docs locally 18 | make docs 19 | ``` 20 | -------------------------------------------------------------------------------- /.github/workflows/code-quality.yml: -------------------------------------------------------------------------------- 1 | name: Code quality 2 | 3 | on: 4 | pull_request: 5 | branches: 6 | - "*" 7 | push: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | run: 13 | runs-on: ubuntu-latest 14 | steps: 15 | - uses: actions/checkout@f43a0e5ff2bd294095638e18286ca9a3d1956744 16 | - uses: ./.github/actions/install-env 17 | - name: Run pre-commit on all files 18 | run: poetry run pre-commit run --all-files 19 | -------------------------------------------------------------------------------- /.github/workflows/unit-tests.yml: -------------------------------------------------------------------------------- 1 | name: Unit tests 2 | 3 | on: 4 | pull_request: 5 | branches: 6 | - "*" 7 | push: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | run: 13 | runs-on: ubuntu-latest 14 | strategy: 15 | matrix: 16 | python-version: ["3.10", "3.11", "3.12"] 17 | steps: 18 | - uses: actions/checkout@f43a0e5ff2bd294095638e18286ca9a3d1956744 19 | - uses: ./.github/actions/install-env 20 | with: 21 | python-version: ${{ matrix.python-version }} 22 | - run: poetry run pytest 23 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | execute-notebooks: 2 | poetry run jupyter nbconvert --execute --to notebook --inplace docs/*.ipynb 3 | poetry run jupyter nbconvert --execute --to notebook --inplace docs/examples/*.ipynb 4 | poetry run jupyter nbconvert --execute --to notebook --inplace docs/methods/*.ipynb 5 | 6 | render-notebooks: 7 | poetry run jupyter nbconvert --to markdown docs/*.ipynb 8 | poetry run jupyter nbconvert --to markdown docs/examples/*.ipynb 9 | poetry run jupyter nbconvert --to markdown docs/methods/*.ipynb 10 | 11 | docs: execute-notebooks render-notebooks 12 | poetry run mkdocs serve 13 | 14 | publish-docs: execute-notebooks render-notebooks 15 | poetry run mkdocs gh-deploy 16 | -------------------------------------------------------------------------------- /docs/methods/ratio.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Ratio decomposition" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [] 16 | } 17 | ], 18 | "metadata": { 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.11.4" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /pre-commit-hooks/check_pinned_actions.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Only run if .github/ files are staged 4 | staged_github_files=$(git diff --cached --name-only --diff-filter=ACM | grep '^\.github/') 5 | if [ -z "$staged_github_files" ]; then 6 | exit 0 7 | fi 8 | 9 | # Check for unpinned external GitHub Actions (not using commit SHA) 10 | offenders=$(echo "$staged_github_files" | grep -E '\.github/(workflows|actions)/' | 11 | xargs grep -E "uses:[[:space:]]*[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+@" | 12 | grep -v "\.github/actions" | 13 | grep -v -E "@[0-9a-f]{40}($|[^0-9a-f])") 14 | 15 | if [ -n "$offenders" ]; then 16 | echo "❌ Error: Detected external GitHub Actions that are not pinned to a commit SHA." >&2 17 | echo "Please update your workflows accordingly to prevent supply chain attacks!" >&2 18 | echo "Offending lines:" >&2 19 | echo "$offenders" >&2 20 | exit 1 21 | fi -------------------------------------------------------------------------------- /docs/methods/total.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Total decomposition" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "$$\n", 15 | "\\sum_{i=1}^n \\frac{1}{2}\n", 16 | "$$" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [] 25 | } 26 | ], 27 | "metadata": { 28 | "language_info": { 29 | "codemirror_mode": { 30 | "name": "ipython", 31 | "version": 3 32 | }, 33 | "file_extension": ".py", 34 | "mimetype": "text/x-python", 35 | "name": "python", 36 | "nbconvert_exporter": "python", 37 | "pygments_lexer": "ipython3", 38 | "version": "3.11.4" 39 | } 40 | }, 41 | "nbformat": 4, 42 | "nbformat_minor": 2 43 | } 44 | -------------------------------------------------------------------------------- /mkdocs.yml: -------------------------------------------------------------------------------- 1 | site_name: icanexplain 2 | repo_name: carbonfact/icanexplain 3 | repo_url: https://github.com/carbonfact/icanexplain 4 | 5 | nav: 6 | - Introduction: 7 | - index.md 8 | - installation.md 9 | - Examples: 10 | - examples/iowa-whiskey-sales.md # total 11 | - examples/fashion-brand-co2e.md # rate 12 | - examples/simple-revenue-funnel.md # funnel 13 | - examples/ibis.md 14 | theme: 15 | name: material 16 | font: 17 | text: Noto Sans Mono 18 | features: 19 | - navigation.tabs 20 | - tables 21 | - content.code.copy 22 | palette: 23 | primary: black 24 | icon: 25 | logo: material/chart-tree 26 | 27 | markdown_extensions: 28 | - pymdownx.highlight: 29 | anchor_linenums: true 30 | line_spans: __span 31 | pygments_lang_class: true 32 | - pymdownx.inlinehilite 33 | - pymdownx.snippets 34 | - pymdownx.superfences 35 | - toc: 36 | permalink: true 37 | permalink_title: null 38 | - pymdownx.arithmatex: 39 | generic: true 40 | -------------------------------------------------------------------------------- /docs/methods/funnel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Funnel decomposition" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "execution": { 15 | "iopub.execute_input": "2024-09-25T08:40:25.341828Z", 16 | "iopub.status.busy": "2024-09-25T08:40:25.341418Z", 17 | "iopub.status.idle": "2024-09-25T08:40:25.360667Z", 18 | "shell.execute_reply": "2024-09-25T08:40:25.360217Z" 19 | } 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "# Funnel" 24 | ] 25 | } 26 | ], 27 | "metadata": { 28 | "language_info": { 29 | "codemirror_mode": { 30 | "name": "ipython", 31 | "version": 3 32 | }, 33 | "file_extension": ".py", 34 | "mimetype": "text/x-python", 35 | "name": "python", 36 | "nbconvert_exporter": "python", 37 | "pygments_lexer": "ipython3", 38 | "version": "3.11.4" 39 | } 40 | }, 41 | "nbformat": 4, 42 | "nbformat_minor": 2 43 | } 44 | -------------------------------------------------------------------------------- /icanexplain/datasets.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | import pathlib 4 | 5 | import pandas as pd 6 | 7 | DATASETS_DIR = pathlib.Path(__file__).parent / "datasets" 8 | 9 | 10 | def load_product_footprints(): 11 | return pd.read_csv(DATASETS_DIR / "product_footprints.csv.gz") 12 | 13 | 14 | def load_us_general_election_popular_vote(): 15 | return pd.read_csv(DATASETS_DIR / "us_general_election_popular_vote.csv.zip") 16 | 17 | 18 | def load_world_demography(): 19 | return pd.read_csv(DATASETS_DIR / "world_demography.csv") 20 | 21 | 22 | def load_iowa_whiskey_sales(): 23 | """Iowa whiskey sales. 24 | 25 | This dataset contains the sales of whiskey in the state of Iowa, USA. The data comes from 26 | Iowa's Open Data Portal. 27 | 28 | For the sake of example, the data is limited to 2012, 2016, and 2020. The data is also limited 29 | to a sample of 50,000 sales records. 30 | 31 | References 32 | ---------- 33 | [1] https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/about_data 34 | 35 | """ 36 | return pd.read_csv(DATASETS_DIR / "iowa_whiskey_sales.csv.zip") 37 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "icanexplain" 3 | version = "0.3.0" 4 | description = "Explain why metrics change by unpacking them" 5 | authors = ["Max Halford "] 6 | readme = "README.md" 7 | 8 | [tool.poetry.dependencies] 9 | python = "^3.10" 10 | ibis-framework = "^9.5.0" 11 | altair = "^5.3.0" 12 | 13 | [tool.poetry.group.dev.dependencies] 14 | ruff = "^0.3.2" 15 | jupyter = "^1.0.0" 16 | pandas = "^2.2.1" 17 | pytest = "^8.2.1" 18 | names = "^0.3.0" 19 | ibis-framework = {extras = ["duckdb", "pandas"], version = "^9.5.0"} 20 | mkdocs = "^1.6.0" 21 | pygments = "^2.18.0" 22 | vega-datasets = "^0.9.0" 23 | mkdocs-material = "^9.5.26" 24 | polars = "^1.0.0" 25 | pre-commit = "^3.8.0" 26 | mypy = "^1.11.2" 27 | 28 | [tool.pytest.ini_options] 29 | addopts = [ 30 | "--doctest-modules", 31 | "--doctest-glob=README.md", 32 | "--doctest-glob=docs/api/*.md", 33 | "--verbose", 34 | "--color=yes", 35 | "--strict-markers", 36 | ] 37 | doctest_optionflags = "NORMALIZE_WHITESPACE NUMBER ELLIPSIS" 38 | 39 | [build-system] 40 | requires = ["poetry-core"] 41 | build-backend = "poetry.core.masonry.api" 42 | -------------------------------------------------------------------------------- /docs/index.md: -------------------------------------------------------------------------------- 1 | # Welcome 2 | 3 | Well met, fellow data analyst! 4 | 5 | If you're like me, then you're used to pesky stakeholders, who ask you why a metric changed. These kind of questions are tricky to answer confidently. It usually ends with you sharing a few other related metrics, giving some context, and providing a weak explanation. All the while hoping the stakeholder will be satisfied (or fed up) and go away 😮‍💨 6 | 7 | This isn't a good situation to be in. But what if you could tell *exactly* why a metric changed? Wouldn't that be great? 🤩 8 | 9 | `icanexplain` is a Python package. It provides a framework to break a metric down into drivers. It attributes the change in a metric to its drivers. Instead of just measuring the evolution of each driver, we can exactly quantify how much of the metric's evolution is due to each driver. 10 | 11 | The best way to understand how `icanexplain` works is to see it in action, by checking out the [examples](examples/iowa-whiskey-sales/). 12 | 13 | `icanexplain` works with [pandas](https://pandas.pydata.org/) and [Polars](https://pola.rs/) out of the box. Additionally, it can run against other backends (e.g. SQL) because it is implemented with [Ibis](https://ibis-project.org/). Check out [this example](examples/ibis/) for more information. 14 | 15 |
16 | 17 | 18 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | files: icanexplain 2 | repos: 3 | - repo: https://github.com/pre-commit/pre-commit-hooks 4 | rev: v4.4.0 5 | hooks: 6 | - id: check-json 7 | - id: check-yaml 8 | 9 | - repo: https://github.com/astral-sh/ruff-pre-commit 10 | # Ruff version. 11 | rev: v0.5.7 12 | hooks: 13 | # Run the linter. 14 | - id: ruff 15 | types_or: [python, pyi, jupyter] 16 | args: [--fix] 17 | # Run the formatter. 18 | - id: ruff-format 19 | types_or: [python, pyi, jupyter] 20 | 21 | - repo: https://github.com/pre-commit/mirrors-mypy 22 | rev: "v1.1.1" 23 | hooks: 24 | - id: mypy 25 | args: 26 | - "--config-file=pyproject.toml" 27 | - "--python-version=3.11" 28 | additional_dependencies: 29 | - pandera[mypy] 30 | - types-python-slugify 31 | - types-paramiko 32 | - types-requests 33 | 34 | # strip output from jupyter notebooks 35 | - repo: https://github.com/kynan/nbstripout 36 | rev: 0.7.1 37 | hooks: 38 | - id: nbstripout 39 | 40 | - repo: local 41 | hooks: 42 | - id: check-external-actions-pinned 43 | name: Check GitHub Actions are pinned 44 | entry: pre-commit-hooks/check_pinned_actions.sh 45 | language: script 46 | pass_filenames: false 47 | -------------------------------------------------------------------------------- /.github/actions/install-env/action.yml: -------------------------------------------------------------------------------- 1 | name: Install Python env 2 | 3 | inputs: 4 | python-version: 5 | required: true 6 | description: "Python version" 7 | 8 | runs: 9 | using: "composite" 10 | steps: 11 | - name: Check out repository 12 | uses: actions/checkout@v3 13 | 14 | - name: Set up python 15 | id: setup-python 16 | uses: actions/setup-python@v5 17 | with: 18 | python-version: ${{ inputs.python-version }} 19 | 20 | - name: Load cached venv 21 | id: cached-poetry-dependencies 22 | uses: actions/cache@v4 23 | with: 24 | path: .venv 25 | key: venv-${{ runner.os }}-${{ hashFiles('poetry.lock') }}-${{ hashFiles('.github/actions/install-env/action.yml') }}-${{ steps.setup-python.outputs.python-version }} 26 | 27 | - name: Load cached .local 28 | id: cached-dotlocal 29 | uses: actions/cache@v4 30 | with: 31 | path: ~/.local 32 | key: dotlocal-${{ runner.os }}-${{ hashFiles('.github/actions/install-env/action.yml') }}-${{ steps.setup-python.outputs.python-version }} 33 | 34 | - name: Install Python poetry 35 | uses: snok/install-poetry@v1 36 | with: 37 | virtualenvs-create: true 38 | virtualenvs-in-project: true 39 | installer-parallel: true 40 | virtualenvs-path: .venv 41 | if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true' 42 | 43 | - name: Install dependencies 44 | shell: bash 45 | if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true' 46 | run: poetry install --no-interaction 47 | 48 | - name: Activate environment 49 | shell: bash 50 | run: source .venv/bin/activate 51 | -------------------------------------------------------------------------------- /docs/index.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Welcome" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Well met, fellow data analyst!\n", 15 | "\n", 16 | "If you're like me, then you're used to pesky stakeholders, who ask you why a metric changed. These kind of questions are tricky to answer confidently. It usually ends with you sharing a few other related metrics, giving some context, and providing a weak explanation. All the while hoping the stakeholder will be satisfied (or fed up) and go away 😮‍💨\n", 17 | "\n", 18 | "This isn't a good situation to be in. But what if you could tell *exactly* why a metric changed? Wouldn't that be great? 🤩\n", 19 | "\n", 20 | "`icanexplain` is a Python package. It provides a framework to break a metric down into drivers. It attributes the change in a metric to its drivers. Instead of just measuring the evolution of each driver, we can exactly quantify how much of the metric's evolution is due to each driver.\n", 21 | "\n", 22 | "The best way to understand how `icanexplain` works is to see it in action, by checking out the [examples](examples/iowa-whiskey-sales/).\n", 23 | "\n", 24 | "`icanexplain` works with [pandas](https://pandas.pydata.org/) and [Polars](https://pola.rs/) out of the box. Additionally, it can run against other backends (e.g. SQL) because it is implemented with [Ibis](https://ibis-project.org/). Check out [this example](examples/ibis/) for more information.\n", 25 | "\n", 26 | "
" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [] 33 | } 34 | ], 35 | "metadata": { 36 | "language_info": { 37 | "codemirror_mode": { 38 | "name": "ipython", 39 | "version": 3 40 | }, 41 | "file_extension": ".py", 42 | "mimetype": "text/x-python", 43 | "name": "python", 44 | "nbconvert_exporter": "python", 45 | "pygments_lexer": "ipython3", 46 | "version": "3.11.4" 47 | } 48 | }, 49 | "nbformat": 4, 50 | "nbformat_minor": 2 51 | } 52 | -------------------------------------------------------------------------------- /docs/theme/README.md: -------------------------------------------------------------------------------- 1 | # Kilsbergen 2 | 3 | A clean [MkDocs][mkdocs] theme. 4 | 5 | This theme is designed for [Tako][tako], [Pris][pris], and [Noblit][noblit]. 6 | It is not flexible on purpose: it supports everything I need, and nothing more. 7 | 8 | ## Demos 9 | 10 | * [Musium documentation][musium-docs] 11 | * [Noblit documentation][noblit-docs] 12 | * [Pris documentation][pris-docs] 13 | * [RCL documentation][rcl-docs] 14 | * [Squiller documentation][squiller-docs] 15 | * [Tako documentation][tako-docs] 16 | 17 | ## Features 18 | 19 | * Responsive design 20 | * Zero javascript 21 | 22 | ## Usage 23 | 24 | One easy way to use this theme, is to add it as a Git submodule to your `docs` 25 | directory, e.g. at `docs/theme`. Then add the following in your `mkdocs.yml`: 26 | 27 | ```yaml 28 | theme: 29 | name: null 30 | custom_dir: docs/theme 31 | ``` 32 | 33 | This theme requires MkDocs 1.1 or later. For earlier versions, delete this 34 | `README.md` to work around [this bug][readmebug]. 35 | 36 | To enable anchors next to section headings, add the following to your 37 | `mkdocs.yml`: 38 | 39 | ```yaml 40 | markdown_extensions: 41 | - toc: 42 | permalink: true 43 | permalink_title: null 44 | ``` 45 | 46 | To enable syntax highlighting, ensure that `pygmentize` is available, and add 47 | the following to your `mkdocs.yml`: 48 | 49 | ```yaml 50 | markdown_extensions: 51 | - codehilite 52 | ``` 53 | 54 | See also [the python-markdown list of extensions][exts]. 55 | 56 | [readmebug]: https://github.com/mkdocs/mkdocs/issues/1766 57 | [exts]: https://python-markdown.github.io/extensions/ 58 | 59 | ## License 60 | 61 | Kilsbergen is licensed under the [Apache 2.0][apache2] license. In the generated 62 | documentation, it is fine to just link to this readme from a comment. 63 | 64 | [apache2]: https://www.apache.org/licenses/LICENSE-2.0 65 | [mkdocs]: https://www.mkdocs.org/ 66 | [musium-docs]: https://docs.ruuda.nl/musium/ 67 | [noblit-docs]: https://docs.ruuda.nl/noblit/ 68 | [noblit]: https://github.com/ruuda/noblit 69 | [pris-docs]: https://docs.ruuda.nl/pris/ 70 | [pris]: https://github.com/ruuda/pris 71 | [rcl-docs]: https://docs.ruuda.nl/rcl/ 72 | [squiller-docs]: https://docs.ruuda.nl/squiller/ 73 | [tako-docs]: https://docs.ruuda.nl/tako/ 74 | [tako]: https://github.com/ruuda/tako 75 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # icanexplain 2 | 3 |

4 | 5 | 6 | tests 7 | 8 | 9 | 10 | 11 | code_quality 12 | 13 | 14 | 15 | 16 | documentation 17 | 18 | 19 | 20 | 21 | pypi 22 | 23 | 24 | 25 | 26 | license 27 | 28 |

29 | 30 | _Explain why metrics change by unpacking them_ 31 | 32 | This library is here to help with the difficult task of explaining why a metric changes. It's particularly useful for analysts, data scientists, analytics engineers, and business intelligence professionals who need to understand the drivers of a metric's change. 33 | 34 | This README provides a small introduction. For more information, please refer to the [documentation](https://carbonfact.github.io/icanexplain). 35 | 36 | Check out [this blog post](https://maxhalford.github.io/blog/kpi-evolution-decomposition/) for some in-depth explanation. 37 | 38 | ## Quickstart 39 | 40 | Let's say you're an analyst at an Airbnb-like company. You're tasked with analyzing year-over-year revenue growth. You have obtained the following dataset: 41 | 42 | ```py 43 | >>> import pandas as pd 44 | >>> fmt_currency = lambda x: '' if pd.isna(x) else '${:,.0f}'.format(x) 45 | 46 | >>> revenue = pd.DataFrame.from_dict([ 47 | ... {'year': 2019, 'bookings': 1_000, 'revenue_per_booking': 200}, 48 | ... {'year': 2020, 'bookings': 1_000, 'revenue_per_booking': 220}, 49 | ... {'year': 2021, 'bookings': 1_500, 'revenue_per_booking': 220}, 50 | ... {'year': 2022, 'bookings': 1_700, 'revenue_per_booking': 225}, 51 | ... ]) 52 | >>> ( 53 | ... revenue 54 | ... .assign(bookings=revenue.bookings.apply('{:,d}'.format)) 55 | ... .assign(revenue_per_booking=revenue.revenue_per_booking.apply(fmt_currency)) 56 | ... .set_index('year') 57 | ... ) 58 | bookings revenue_per_booking 59 | year 60 | 2019 1,000 $200 61 | 2020 1,000 $220 62 | 2021 1,500 $220 63 | 2022 1,700 $225 64 | 65 | ``` 66 | 67 | It's quite straightforward to calculate the revenue for each year, and then to measure the year-over-year growth: 68 | 69 | ```py 70 | >>> ( 71 | ... revenue 72 | ... .assign(revenue=revenue.eval('bookings * revenue_per_booking')) 73 | ... .assign(growth=lambda x: x.revenue.diff()) 74 | ... .assign(bookings=revenue.bookings.apply('{:,d}'.format)) 75 | ... .assign(revenue_per_booking=revenue.revenue_per_booking.apply(fmt_currency)) 76 | ... .assign(revenue=lambda x: x.revenue.apply(fmt_currency)) 77 | ... .assign(growth=lambda x: x.growth.apply(fmt_currency)) 78 | ... .set_index('year') 79 | ... ) 80 | bookings revenue_per_booking revenue growth 81 | year 82 | 2019 1,000 $200 $200,000 83 | 2020 1,000 $220 $220,000 $20,000 84 | 2021 1,500 $220 $330,000 $110,000 85 | 2022 1,700 $225 $382,500 $52,500 86 | 87 | ``` 88 | 89 | Growth can be due to two factors: an increase in the number of bookings, or an increase in the revenue per booking. The icanexplain library to decompose the growth into these two factors. First, let's install the package: 90 | 91 | ```sh 92 | pip install icanexplain 93 | ``` 94 | 95 | Then, we can use the `SumExplainer` to decompose the growth: 96 | 97 | ```py 98 | >>> import icanexplain as ice 99 | >>> explainer = ice.SumExplainer( 100 | ... fact='revenue_per_booking', 101 | ... period='year', 102 | ... count='bookings' 103 | ... ) 104 | >>> explanation = explainer(revenue) 105 | >>> explanation.map(fmt_currency) 106 | inner mix 107 | year 108 | 2020 $20,000 $0 109 | 2021 $0 $110,000 110 | 2022 $7,500 $45,000 111 | 112 | ``` 113 | 114 | Here's how to interpret this explanation: 115 | 116 | - From 2019 to 2020, the revenue growth was entirely due to an increase in the revenue per booking. The number of bookings was exactly the same. Therefore, the $20,000 is entirely due to the inner effect (increase in revenue per booking). 117 | - From 2020 to 2021, the revenue growth was entirely due to an increase in the number of bookings. The revenue per booking was exactly the same. Therefore, the $110,000 is entirely due to the mix effect (increase in bookings). 118 | - From 2021 to 2022, there was a $52,500 revenue growth. However, the revenue per booking went down by $10, so the increase is due to the higher number of bookings. The inner effect is -$7,500 while the mix effect is $45,000. 119 | 120 | Here's a visual representation of this last interpretation: 121 | 122 |

123 | example 124 |

125 | 126 | ## Contributing 127 | 128 | Feel free to reach out to [max@carbonfact.com](mailto:max@carbonfact.com) if you want to know more and/or contribute 🤗 129 | 130 | Check out the [contribution guidelines](CONTRIBUTING.md) to get started. 131 | 132 | ## License 133 | 134 | icanexplain is free and open-source software licensed under the Apache License, Version 2.0. 135 | -------------------------------------------------------------------------------- /icanexplain/test_sum.py: -------------------------------------------------------------------------------- 1 | import random 2 | import pandas as pd 3 | 4 | 5 | def make_claims() -> pd.DataFrame: 6 | random.seed(42) 7 | 8 | # Function to generate a random cost based on the claim type and year 9 | def generate_claim_cost(claim_type, year): 10 | if claim_type == "Dentist": 11 | base_cost = 100 12 | elif claim_type == "Psychiatrist": 13 | base_cost = 150 14 | elif claim_type == "General Physician": 15 | base_cost = 80 16 | elif claim_type == "Physiotherapy": 17 | base_cost = 120 18 | else: 19 | base_cost = 50 20 | 21 | # Adjust cost based on year 22 | if year == 2021: 23 | base_cost *= 1.2 24 | elif year == 2023: 25 | base_cost *= 1.5 26 | 27 | # Add some random variation 28 | cost = random.uniform(base_cost - 20, base_cost + 20) 29 | return round(cost, 2) 30 | 31 | # Generating sample data 32 | claim_types = ["Dentist", "Psychiatrist", "General Physician", "Physiotherapy"] 33 | years = [2021, 2022, 2023] 34 | people = [ 35 | "John", 36 | "Jane", 37 | "Michael", 38 | "Emily", 39 | "William", 40 | "Emma", 41 | "Daniel", 42 | "Olivia", 43 | "Lucas", 44 | "Ava", 45 | ] 46 | 47 | data = [] 48 | for year in years: 49 | for person in people: 50 | num_claims = random.randint( 51 | 1, 5 52 | ) # Random number of claims per person per year 53 | for _ in range(num_claims): 54 | claim_type = random.choice(claim_types) 55 | cost = generate_claim_cost(claim_type, year) 56 | date = pd.to_datetime( 57 | f"{random.randint(1, 12)}/{random.randint(1, 28)}/{year}", 58 | format="%m/%d/%Y", 59 | ) 60 | data.append([person, claim_type, date, year, cost]) 61 | 62 | # Create the DataFrame 63 | columns = ["person", "claim_type", "date", "year", "amount"] 64 | claims = pd.DataFrame(data, columns=columns) 65 | 66 | return claims 67 | 68 | 69 | def test_claims(): 70 | """ 71 | 72 | >>> import icanexplain as ice 73 | 74 | >>> claims = make_claims() 75 | >>> claims.head() 76 | person claim_type date year amount 77 | 0 John Dentist 2021-04-08 2021 129.66 78 | 1 Jane Dentist 2021-09-03 2021 127.07 79 | 2 Jane Physiotherapy 2021-02-07 2021 125.27 80 | 3 Michael Dentist 2021-12-21 2021 122.45 81 | 4 Michael Physiotherapy 2021-10-09 2021 132.82 82 | 83 | The goal is to explain the evolution of total claims amount over time. Let's take a look at the 84 | yearly evolution. 85 | 86 | >>> ( 87 | ... claims 88 | ... .groupby('year') 89 | ... .agg({'amount': 'sum'}) 90 | ... .assign(diff=lambda x: x.amount.diff()) 91 | ... .reset_index() 92 | ... ) 93 | year amount diff 94 | 0 2021 3814.54 NaN 95 | 1 2022 2890.29 -924.25 96 | 2 2023 4178.03 1287.74 97 | 98 | The theory is that the figures we find should add up to the same yearly total, however we 99 | explanation the metric. 100 | 101 | >>> explainer = ice.SumExplainer( 102 | ... fact='amount', 103 | ... period='year', 104 | ... group='claim_type' 105 | ... ) 106 | >>> explanation = explainer(claims) 107 | >>> explanation 108 | inner mix 109 | year claim_type 110 | 2022 Dentist -170.700000 -311.240000 111 | General Physician -95.053333 249.693333 112 | Physiotherapy -122.880000 -339.450000 113 | Psychiatrist -282.030000 147.410000 114 | 2023 Dentist 338.180000 480.330000 115 | General Physician 313.151429 -236.051429 116 | Physiotherapy 185.125000 524.575000 117 | Psychiatrist 544.140000 -861.710000 118 | 119 | Let's check that the sum of the inner and mix columns add up as expected. 120 | 121 | >>> ( 122 | ... explanation 123 | ... .groupby('year') 124 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 125 | ... .rename('diff') 126 | ... .reset_index() 127 | ... ) 128 | year diff 129 | 0 2022 -924.25 130 | 1 2023 1287.74 131 | 132 | """ 133 | 134 | 135 | def test_claims_with_gaps(): 136 | """ 137 | 138 | In practice, dimension values don't always appear for each period of time. It's good to check 139 | that the implementation can handle such cases. 140 | 141 | >>> import icanexplain as ice 142 | 143 | >>> claims = make_claims() 144 | >>> claims = claims.drop(index=claims.query('year == 2021 and claim_type == "Dentist"').index) 145 | >>> claims = claims.drop(index=claims.query('year == 2022 and claim_type == "Physiotherapy"').index) 146 | 147 | >>> ( 148 | ... claims 149 | ... .groupby('year') 150 | ... .agg({'amount': 'sum'}) 151 | ... .assign(diff=lambda x: x.amount.diff()) 152 | ... .reset_index() 153 | ... ) 154 | year amount diff 155 | 0 2021 2710.12 NaN 156 | 1 2022 2550.84 -159.28 157 | 2 2023 4178.03 1627.19 158 | 159 | >>> explainer = ice.SumExplainer( 160 | ... fact='amount', 161 | ... period='year', 162 | ... group='claim_type' 163 | ... ) 164 | >>> explanation = explainer(claims) 165 | >>> explanation 166 | inner mix 167 | year claim_type 168 | 2022 Dentist 0.000000 622.480000 169 | General Physician -95.053333 249.693333 170 | Physiotherapy -801.780000 -0.000000 171 | Psychiatrist -282.030000 147.410000 172 | 2023 Dentist 338.180000 480.330000 173 | General Physician 313.151429 -236.051429 174 | Physiotherapy 0.000000 1049.150000 175 | Psychiatrist 544.140000 -861.710000 176 | 177 | >>> ( 178 | ... explanation 179 | ... .groupby('year') 180 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 181 | ... .rename('diff') 182 | ... .reset_index() 183 | ... ) 184 | year diff 185 | 0 2022 -159.28 186 | 1 2023 1627.19 187 | 188 | """ 189 | 190 | 191 | def test_agg_vs_samples(): 192 | """ 193 | 194 | We want to check that explanationing with a sample by sample approach gives the same results as 195 | explanationing with an aggregated table. 196 | 197 | >>> import icanexplain as ice 198 | 199 | >>> claims = make_claims() 200 | >>> ( 201 | ... claims 202 | ... .groupby('year') 203 | ... .agg({'amount': 'sum'}) 204 | ... .assign(diff=lambda x: x.amount.diff()) 205 | ... .reset_index() 206 | ... ) 207 | year amount diff 208 | 0 2021 3814.54 NaN 209 | 1 2022 2890.29 -924.25 210 | 2 2023 4178.03 1287.74 211 | 212 | Sample by sample. 213 | 214 | >>> explainer = ice.SumExplainer( 215 | ... fact='amount', 216 | ... period='year', 217 | ... group='claim_type' 218 | ... ) 219 | >>> explanation = explainer(claims) 220 | >>> explanation 221 | inner mix 222 | year claim_type 223 | 2022 Dentist -170.700000 -311.240000 224 | General Physician -95.053333 249.693333 225 | Physiotherapy -122.880000 -339.450000 226 | Psychiatrist -282.030000 147.410000 227 | 2023 Dentist 338.180000 480.330000 228 | General Physician 313.151429 -236.051429 229 | Physiotherapy 185.125000 524.575000 230 | Psychiatrist 544.140000 -861.710000 231 | 232 | >>> ( 233 | ... explanation 234 | ... .groupby('year') 235 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 236 | ... .rename('diff') 237 | ... .reset_index() 238 | ... ) 239 | year diff 240 | 0 2022 -924.25 241 | 1 2023 1287.74 242 | 243 | Aggregate. 244 | 245 | >>> claims_agg = ( 246 | ... claims 247 | ... .groupby(['year', 'claim_type']) 248 | ... ['amount'].agg(['mean', 'count']) 249 | ... .reset_index() 250 | ... ) 251 | >>> claims_agg 252 | year claim_type mean count 253 | 0 2021 Dentist 122.713333 9 254 | 1 2021 General Physician 99.073333 6 255 | 2 2021 Physiotherapy 133.630000 6 256 | 3 2021 Psychiatrist 187.700000 7 257 | 4 2022 Dentist 103.746667 6 258 | 5 2022 General Physician 83.231111 9 259 | 6 2022 Physiotherapy 113.150000 3 260 | 7 2022 Psychiatrist 147.410000 8 261 | 8 2023 Dentist 160.110000 9 262 | 9 2023 General Physician 118.025714 7 263 | 10 2023 Physiotherapy 174.858333 6 264 | 11 2023 Psychiatrist 215.427500 4 265 | 266 | >>> explainer = ice.SumExplainer( 267 | ... fact='mean', 268 | ... period='year', 269 | ... group='claim_type', 270 | ... count='count' 271 | ... ) 272 | >>> explanation = explainer(claims_agg) 273 | >>> explanation 274 | inner mix 275 | year claim_type 276 | 2022 Dentist -170.700000 -311.240000 277 | General Physician -95.053333 249.693333 278 | Physiotherapy -122.880000 -339.450000 279 | Psychiatrist -282.030000 147.410000 280 | 2023 Dentist 338.180000 480.330000 281 | General Physician 313.151429 -236.051429 282 | Physiotherapy 185.125000 524.575000 283 | Psychiatrist 544.140000 -861.710000 284 | 285 | >>> ( 286 | ... explanation 287 | ... .groupby('year') 288 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 289 | ... .rename('diff') 290 | ... .reset_index() 291 | ... ) 292 | year diff 293 | 0 2022 -924.25 294 | 1 2023 1287.74 295 | 296 | """ 297 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /docs/theme/LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /icanexplain/test_mean.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import random 3 | import names # type: ignore[import] 4 | import pandas as pd 5 | 6 | 7 | def make_claims() -> pd.DataFrame: 8 | random.seed(42) 9 | 10 | # Function to generate a random cost based on the claim type and year 11 | def generate_claim_cost(claim_type, year): 12 | if claim_type == "Dentist": 13 | base_cost = 100 14 | elif claim_type == "Psychiatrist": 15 | base_cost = 150 16 | 17 | # Adjust cost based on year 18 | if year == 2021: 19 | base_cost *= 1.2 20 | elif year == 2023: 21 | base_cost *= 1.5 22 | 23 | # Add some random variation 24 | cost = random.uniform(base_cost - 20, base_cost + 20) 25 | return round(cost, 2) 26 | 27 | # Generating sample data 28 | claim_types = ["Dentist", "Psychiatrist"] 29 | years = [2021, 2022, 2023, 2024] 30 | people = ["John", "Jane", "Michael", "Emily", "William"] 31 | 32 | data = [] 33 | for year in years: 34 | new_people = ( 35 | [names.get_first_name() for _ in range(random.randint(1, 3))] 36 | if year > 2021 37 | else [] 38 | ) 39 | existing_people = [person for person in people if random.random() > 0.3] 40 | people_this_year = existing_people + new_people 41 | people.extend(new_people) 42 | 43 | for person in people_this_year: 44 | num_claims = random.randint( 45 | 1, 5 46 | ) # Random number of claims per existing customer per year 47 | for _ in range(num_claims): 48 | claim_type = random.choice(claim_types) 49 | cost = generate_claim_cost(claim_type, year) 50 | date = pd.to_datetime( 51 | f"{random.randint(1, 12)}/{random.randint(1, 28)}/{year}", 52 | format="%m/%d/%Y", 53 | ) 54 | data.append([person, claim_type, date, year, cost]) 55 | 56 | # Create the DataFrame 57 | columns = ["person", "claim_type", "date", "year", "amount"] 58 | claims = pd.DataFrame(data, columns=columns) 59 | 60 | # Indicate whether people are existing, new, or returning 61 | years_seen = collections.defaultdict(set) 62 | statuses = [] 63 | for claim in claims.to_dict(orient="records"): 64 | years_seen[claim["person"]].add(claim["year"]) 65 | if claim["year"] - 1 in years_seen[claim["person"]]: 66 | statuses.append("EXISTING") 67 | elif any(year < claim["year"] for year in years_seen[claim["person"]]): 68 | statuses.append("RETURNING") 69 | elif not { 70 | year for year in years_seen[claim["person"]] if year != claim["year"] 71 | }: 72 | statuses.append("NEW") 73 | 74 | claims["status"] = statuses 75 | 76 | return claims 77 | 78 | 79 | def test_claims(): 80 | """ 81 | 82 | >>> import icanexplain as ice 83 | 84 | >>> claims = make_claims() 85 | >>> claims.head() 86 | person claim_type date year amount status 87 | 0 John Dentist 2021-01-01 2021 123.62 NEW 88 | 1 John Dentist 2021-09-20 2021 108.75 NEW 89 | 2 John Dentist 2021-12-21 2021 122.45 NEW 90 | 3 John Psychiatrist 2021-10-09 2021 168.82 NEW 91 | 4 John Dentist 2021-03-23 2021 130.35 NEW 92 | 93 | The goal is to explain how the mean evolved over time. Let's take a look at it. 94 | 95 | >>> ( 96 | ... claims 97 | ... .groupby('year') 98 | ... .agg({'amount': 'mean'}) 99 | ... .assign(diff=lambda x: x.amount.diff()) 100 | ... .reset_index() 101 | ... ) 102 | year amount diff 103 | 0 2021 145.808889 NaN 104 | 1 2022 112.676667 -33.132222 105 | 2 2023 173.043667 60.367000 106 | 3 2024 122.920625 -50.123042 107 | 108 | Here's the breakdown by claim type: 109 | 110 | >>> ( 111 | ... claims 112 | ... .groupby(['year', 'claim_type']) 113 | ... ['amount'].agg(['mean', 'count']) 114 | ... .reset_index() 115 | ... ) 116 | year claim_type mean count 117 | 0 2021 Dentist 122.87200 5 118 | 1 2021 Psychiatrist 174.48000 4 119 | 2 2022 Dentist 98.37500 4 120 | 3 2022 Psychiatrist 141.28000 2 121 | 4 2023 Dentist 148.36500 20 122 | 5 2023 Psychiatrist 222.40100 10 123 | 6 2024 Dentist 97.68250 8 124 | 7 2024 Psychiatrist 148.15875 8 125 | 126 | The theory is that however we explanation the metric, the figures we find should add up to the same 127 | yearly total. 128 | 129 | >>> explainer = ice.MeanExplainer( 130 | ... fact='amount', 131 | ... period='year', 132 | ... group='claim_type' 133 | ... ) 134 | >>> explanation = explainer(claims) 135 | >>> explanation 136 | inner mix 137 | year claim_type 138 | 2022 Dentist -16.331333 1.132815 139 | Psychiatrist -11.066667 -6.867037 140 | 2023 Dentist 33.326667 -0.000000 141 | Psychiatrist 27.040333 -0.000000 142 | 2024 Dentist -25.341250 -4.240729 143 | Psychiatrist -37.121125 16.580063 144 | 145 | Let's check that the sum of the inner and mix columns add up as expected. 146 | 147 | >>> ( 148 | ... explanation 149 | ... .groupby('year') 150 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 151 | ... .rename('diff') 152 | ... .reset_index() 153 | ... ) 154 | year diff 155 | 0 2022 -33.132222 156 | 1 2023 60.367000 157 | 2 2024 -50.123042 158 | 159 | """ 160 | 161 | 162 | def test_claims_with_gaps(): 163 | """ 164 | 165 | >>> import icanexplain as ice 166 | 167 | >>> claims = make_claims() 168 | >>> claims = claims.drop(index=claims.query('year == 2021 and claim_type == "Dentist"').index) 169 | >>> claims = claims.drop(index=claims.query('year == 2022 and claim_type == "Psychiatrist"').index) 170 | >>> claims = claims.drop(index=claims.query('year == 2023 and claim_type == "Psychiatrist"').index) 171 | 172 | >>> ( 173 | ... claims 174 | ... .groupby('year') 175 | ... .agg({'amount': 'mean'}) 176 | ... .assign(diff=lambda x: x.amount.diff()) 177 | ... .reset_index() 178 | ... ) 179 | year amount diff 180 | 0 2021 174.480000 NaN 181 | 1 2022 98.375000 -76.105000 182 | 2 2023 148.365000 49.990000 183 | 3 2024 122.920625 -25.444375 184 | 185 | >>> explainer = ice.MeanExplainer( 186 | ... fact='amount', 187 | ... period='year', 188 | ... group='claim_type' 189 | ... ) 190 | >>> explainer(claims) 191 | inner mix 192 | year claim_type 193 | 2022 Dentist 98.375000 -98.375000 194 | Psychiatrist -0.000000 -76.105000 195 | 2023 Dentist 49.990000 -0.000000 196 | Psychiatrist 0.000000 -0.000000 197 | 2024 Dentist -25.341250 -12.722187 198 | Psychiatrist 74.079375 -61.460313 199 | 200 | """ 201 | 202 | 203 | def test_clicks(): 204 | """ 205 | 206 | >>> import pandas as pd 207 | >>> traffic_agg = pd.DataFrame([ 208 | ... {'timestamp': '2018-01-01', 'dim': 'A', 'clicks': 150, 'impressions': 1000}, 209 | ... {'timestamp': '2018-01-01', 'dim': 'B', 'clicks': 150, 'impressions': 2000}, 210 | ... {'timestamp': '2018-02-01', 'dim': 'A', 'clicks': 200, 'impressions': 1000}, 211 | ... {'timestamp': '2018-02-01', 'dim': 'B', 'clicks': 300, 'impressions': 2000}, 212 | ... {'timestamp': '2019-01-01', 'dim': 'A', 'clicks': 120, 'impressions': 1100}, 213 | ... {'timestamp': '2019-01-01', 'dim': 'B', 'clicks': 200, 'impressions': 2150}, 214 | ... {'timestamp': '2019-02-01', 'dim': 'A', 'clicks': 242, 'impressions': 1100}, 215 | ... {'timestamp': '2019-02-01', 'dim': 'B', 'clicks': 323, 'impressions': 2150}, 216 | ... ]) 217 | >>> traffic_agg['timestamp'] = pd.to_datetime(traffic_agg['timestamp']) 218 | 219 | The figures are aggregated, which isn't the usual expected format. A first solution is to 220 | expand the data into individual samples. 221 | 222 | >>> import itertools 223 | >>> traffic = pd.DataFrame(itertools.chain(*[ 224 | ... [{'timestamp': r['timestamp'], 'dim': r['dim'], 'click': True} for _ in range(r['clicks'])] + 225 | ... [{'timestamp': r['timestamp'], 'dim': r['dim'], 'click': False} for _ in range(r['impressions'] - r['clicks'])] 226 | ... for r in traffic_agg.to_dict(orient='records') 227 | ... ])) 228 | >>> traffic.head() 229 | timestamp dim click 230 | 0 2018-01-01 A True 231 | 1 2018-01-01 A True 232 | 2 2018-01-01 A True 233 | 3 2018-01-01 A True 234 | 4 2018-01-01 A True 235 | 236 | >>> traffic = traffic.assign( 237 | ... year=traffic.timestamp.dt.year, 238 | ... month=traffic.timestamp.dt.month 239 | ... ) 240 | >>> traffic.groupby(['timestamp', 'dim'])['click'].agg(['sum', 'size']) 241 | sum size 242 | timestamp dim 243 | 2018-01-01 A 150 1000 244 | B 150 2000 245 | 2018-02-01 A 200 1000 246 | B 300 2000 247 | 2019-01-01 A 120 1100 248 | B 200 2150 249 | 2019-02-01 A 242 1100 250 | B 323 2150 251 | 252 | >>> import icanexplain as ice 253 | 254 | >>> explainer = ice.MeanExplainer( 255 | ... fact='click', 256 | ... period=['year', 'month'], 257 | ... group='dim', 258 | ... ) 259 | >>> explainer(traffic) 260 | inner mix 261 | year month dim 262 | 2019 1 A -0.013846 0.000264 263 | B 0.011923 0.000120 264 | 2 A 0.006769 0.000134 265 | B 0.000154 0.000122 266 | 267 | We can also make this work with the aggregate data. Here's how: 268 | 269 | >>> traffic_agg = traffic_agg.assign( 270 | ... click_rate=lambda x: x['clicks'] / x['impressions'], 271 | ... year=traffic_agg.timestamp.dt.year, 272 | ... month=traffic_agg.timestamp.dt.month 273 | ... ) 274 | >>> explainer = ice.MeanExplainer( 275 | ... fact='click_rate', 276 | ... count='impressions', 277 | ... period=['year', 'month'], 278 | ... group='dim', 279 | ... ) 280 | >>> explainer(traffic_agg) 281 | inner mix 282 | year month dim 283 | 2019 1 A -0.013846 0.000264 284 | B 0.011923 0.000120 285 | 2 A 0.006769 0.000134 286 | B 0.000154 0.000122 287 | 288 | """ 289 | 290 | 291 | def test_agg_vs_samples(): 292 | """ 293 | 294 | We want to check that explanationing with a sample by sample approach gives the same results as 295 | explanationing with an aggregated table. 296 | 297 | >>> import icanexplain as ice 298 | 299 | >>> claims = make_claims() 300 | >>> ( 301 | ... claims 302 | ... .groupby('year') 303 | ... .agg({'amount': 'mean'}) 304 | ... .assign(diff=lambda x: x.amount.diff()) 305 | ... .reset_index() 306 | ... ) 307 | year amount diff 308 | 0 2021 145.808889 NaN 309 | 1 2022 112.676667 -33.132222 310 | 2 2023 173.043667 60.367000 311 | 3 2024 122.920625 -50.123042 312 | 313 | Sample by sample. 314 | 315 | >>> explainer = ice.MeanExplainer( 316 | ... fact='amount', 317 | ... period='year', 318 | ... group='claim_type' 319 | ... ) 320 | >>> explanation = explainer(claims) 321 | >>> explanation 322 | inner mix 323 | year claim_type 324 | 2022 Dentist -16.331333 1.132815 325 | Psychiatrist -11.066667 -6.867037 326 | 2023 Dentist 33.326667 -0.000000 327 | Psychiatrist 27.040333 -0.000000 328 | 2024 Dentist -25.341250 -4.240729 329 | Psychiatrist -37.121125 16.580063 330 | 331 | >>> ( 332 | ... explanation 333 | ... .groupby('year') 334 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 335 | ... .rename('diff') 336 | ... .reset_index() 337 | ... ) 338 | year diff 339 | 0 2022 -33.132222 340 | 1 2023 60.367000 341 | 2 2024 -50.123042 342 | 343 | Aggregate. 344 | 345 | >>> claims_agg = ( 346 | ... claims 347 | ... .groupby(['year', 'claim_type']) 348 | ... ['amount'].agg(['mean', 'count']) 349 | ... .reset_index() 350 | ... ) 351 | >>> claims_agg 352 | year claim_type mean count 353 | 0 2021 Dentist 122.87200 5 354 | 1 2021 Psychiatrist 174.48000 4 355 | 2 2022 Dentist 98.37500 4 356 | 3 2022 Psychiatrist 141.28000 2 357 | 4 2023 Dentist 148.36500 20 358 | 5 2023 Psychiatrist 222.40100 10 359 | 6 2024 Dentist 97.68250 8 360 | 7 2024 Psychiatrist 148.15875 8 361 | 362 | >>> explainer = ice.MeanExplainer( 363 | ... fact='mean', 364 | ... period='year', 365 | ... group='claim_type', 366 | ... count='count' 367 | ... ) 368 | >>> explanation = explainer(claims_agg) 369 | >>> explanation 370 | inner mix 371 | year claim_type 372 | 2022 Dentist -16.331333 1.132815 373 | Psychiatrist -11.066667 -6.867037 374 | 2023 Dentist 33.326667 -0.000000 375 | Psychiatrist 27.040333 -0.000000 376 | 2024 Dentist -25.341250 -4.240729 377 | Psychiatrist -37.121125 16.580063 378 | 379 | >>> ( 380 | ... explanation 381 | ... .groupby('year') 382 | ... .apply(lambda x: (x.inner + x.mix).sum(), include_groups=False) 383 | ... .rename('diff') 384 | ... .reset_index() 385 | ... ) 386 | year diff 387 | 0 2022 -33.132222 388 | 1 2023 60.367000 389 | 2 2024 -50.123042 390 | 391 | """ 392 | -------------------------------------------------------------------------------- /docs/theme/main.html: -------------------------------------------------------------------------------- 1 | {# 2 | Kilsbergen -- A clean MkDocs theme. 3 | Copyright 2019 Ruud van Asseldonk. 4 | 5 | Licensed under the Apache License, Version 2.0 (the "License"); 6 | you may not use this file except in compliance with the License. 7 | A copy of the License has been included in the root of the repository. 8 | #} 9 | 10 | 11 | 12 | 17 | 18 | 19 | {% if page.title %}{{ page.title }} — {% endif %}{{ config.site_name }} 20 | 23 | 24 | 25 | 26 | 430 | 431 | 432 |
433 |
434 | 440 |
441 | {{ page.content }} 442 |
443 | {% if page.next_page or page.previous_page -%} 444 | 452 | {% endif -%} 453 |
454 | 487 |
488 | 489 | 490 | -------------------------------------------------------------------------------- /docs/examples/ibis.md: -------------------------------------------------------------------------------- 1 | # Different backend support with Ibis 🐦 2 | 3 | icanexplain is implemented with [Ibis](https://github.com/ibis-project/ibis). This means that it is framework agnostic, and can work with different backends. This example shows how to use it with [DuckDB](https://duckdb.org/). 4 | 5 | 6 | ```python 7 | import ibis 8 | import icanexplain as ice 9 | 10 | products_df = ice.datasets.load_product_footprints() 11 | con = ibis.connect("duckdb://example.ddb") 12 | con.create_table( 13 | "products", products_df, overwrite=True 14 | ) 15 | ``` 16 | 17 | 18 | 19 | 20 |
DatabaseTable: example.main.products
 21 |   year       int64
 22 |   category   string
 23 |   product_id string
 24 |   footprint  float64
 25 |   units      int64
 26 | 
27 | 28 | 29 | 30 | 31 | 32 | ```python 33 | con = ibis.connect("duckdb://example.ddb") 34 | con.list_tables() 35 | ``` 36 | 37 | 38 | 39 | 40 | ['products'] 41 | 42 | 43 | 44 | 45 | ```python 46 | ibis.options.interactive = True 47 | products = con.table("products") 48 | products.head() 49 | ``` 50 | 51 | 52 | 53 | 54 |
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
 55 | ┃ year   category  product_id  footprint  units ┃
 56 | ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
 57 | │ int64stringstringfloat64int64 │
 58 | ├───────┼──────────┼────────────┼───────────┼───────┤
 59 | │  2021DRESS   848be709  96.04803 │
 60 | │  2021DRESS   658f92b3  58.153367 │
 61 | │  2021DRESS   3a26f323  82.94240 │
 62 | │  2021DRESS   6221dca6  85.94432 │
 63 | │  2021DRESS   46864ac5  84.99816 │
 64 | └───────┴──────────┴────────────┴───────────┴───────┘
 65 | 
66 | 67 | 68 | 69 | 70 | 71 | ```python 72 | explainer = ice.SumExplainer( 73 | fact='footprint', 74 | count='units', 75 | group='category', 76 | period='year' 77 | ) 78 | explanation = explainer(products) 79 | explanation 80 | ``` 81 | 82 | 83 | 84 | 85 |
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
 86 | ┃ year   category  inner          mix           ┃
 87 | ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
 88 | │ int64stringfloat64float64       │
 89 | ├───────┼──────────┼───────────────┼───────────────┤
 90 | │  2022DRESS   3.931932e+06-1.881370e+07 │
 91 | │  2022JACKET  -1.510008e+07-9.238617e+07 │
 92 | │  2022PANTS   4.002506e+075.295190e+07 │
 93 | │  2022SHIRT   -1.484809e+06-5.791456e+06 │
 94 | │  2022SWEATER -2.676209e+071.181504e+07 │
 95 | │  2022TSHIRT  6.650940e+06-2.311836e+07 │
 96 | │  2023DRESS   -4.078094e+06-1.240339e+07 │
 97 | │  2023JACKET  -6.793317e+06-4.924036e+07 │
 98 | │  2023PANTS   -1.636299e+07-2.295608e+08 │
 99 | │  2023SHIRT   8.920908e+05-4.019144e+06 │
100 | │      │
101 | └───────┴──────────┴───────────────┴───────────────┘
102 | 
103 | 104 | 105 | 106 | 107 | 108 | ```python 109 | type(explanation) 110 | ``` 111 | 112 | 113 | 114 | 115 | ibis.expr.types.relations.Table 116 | 117 | 118 | 119 | 120 | ```python 121 | explanation.execute() 122 | ``` 123 | 124 | 125 | 126 | 127 |
128 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 |
yearcategoryinnermix
02022DRESS3.931932e+06-1.881370e+07
12022JACKET-1.510008e+07-9.238617e+07
22022PANTS4.002506e+075.295190e+07
32022SHIRT-1.484809e+06-5.791456e+06
42022SWEATER-2.676209e+071.181504e+07
52022TSHIRT6.650940e+06-2.311836e+07
62023DRESS-4.078094e+06-1.240339e+07
72023JACKET-6.793317e+06-4.924036e+07
82023PANTS-1.636299e+07-2.295608e+08
92023SHIRT8.920908e+05-4.019144e+06
102023SWEATER-5.701391e+06-1.130507e+08
112023TSHIRT-1.150391e+07-8.391323e+07
238 |
239 | 240 | 241 | 242 | 243 | ```python 244 | ibis.to_sql(explanation) 245 | ``` 246 | 247 | 248 | 249 | 250 | ```sql 251 | SELECT 252 | * 253 | FROM ( 254 | SELECT 255 | "t9"."year", 256 | "t9"."category", 257 | "t9"."count_lag" * ( 258 | "t9"."mean" - "t9"."mean_lag" 259 | ) AS "inner", 260 | ( 261 | "t9"."count" - "t9"."count_lag" 262 | ) * "t9"."mean" AS "mix" 263 | FROM ( 264 | SELECT 265 | "t8"."category", 266 | "t8"."year", 267 | "t8"."mean", 268 | "t8"."count", 269 | LAG("t8"."mean", 1) OVER (PARTITION BY "t8"."category" ORDER BY "t8"."year" ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "mean_lag", 270 | LAG("t8"."count", 1) OVER (PARTITION BY "t8"."category" ORDER BY "t8"."year" ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "count_lag" 271 | FROM ( 272 | SELECT 273 | "t7"."category", 274 | "t7"."year", 275 | COALESCE("t7"."mean", 0) AS "mean", 276 | COALESCE("t7"."count", 0) AS "count" 277 | FROM ( 278 | SELECT 279 | "t3"."category", 280 | "t4"."year", 281 | "t6"."mean", 282 | "t6"."count" 283 | FROM ( 284 | SELECT DISTINCT 285 | "t0"."category" 286 | FROM "products" AS "t0" 287 | ) AS "t3" 288 | CROSS JOIN ( 289 | SELECT DISTINCT 290 | "t0"."year" 291 | FROM "products" AS "t0" 292 | ) AS "t4" 293 | LEFT OUTER JOIN ( 294 | SELECT 295 | "t0"."category", 296 | "t0"."year", 297 | SUM("t0"."footprint" * "t0"."units") / SUM("t0"."units") AS "mean", 298 | SUM("t0"."units") AS "count" 299 | FROM "products" AS "t0" 300 | GROUP BY 301 | 1, 302 | 2 303 | ) AS "t6" 304 | ON "t3"."category" = "t6"."category" AND "t4"."year" = "t6"."year" 305 | ) AS "t7" 306 | ) AS "t8" 307 | ) AS "t9" 308 | ORDER BY 309 | "t9"."year" ASC, 310 | "t9"."category" ASC 311 | ) AS "t10" 312 | WHERE 313 | "t10"."year" IS NOT NULL 314 | AND "t10"."category" IS NOT NULL 315 | AND "t10"."inner" IS NOT NULL 316 | AND "t10"."mix" IS NOT NULL 317 | ``` 318 | 319 | 320 | -------------------------------------------------------------------------------- /icanexplain/__init__.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import functools 3 | import operator 4 | 5 | import ibis # type: ignore[import] 6 | 7 | from . import datasets 8 | 9 | __all__ = ["FunnelExplainer", "MeanExplainer", "SumExplainer", "datasets"] 10 | 11 | 12 | def cartesian_product(table, columns): 13 | return functools.reduce( 14 | lambda x, y: x.cross_join(y), [table[[d]].distinct() for d in columns] 15 | ) 16 | 17 | 18 | def is_pandas_dataframe(obj): 19 | try: 20 | import pandas as pd 21 | 22 | return isinstance(obj, pd.DataFrame) 23 | except ImportError: 24 | # pandas is not installed 25 | return False 26 | 27 | 28 | def is_polars_dataframe(obj): 29 | try: 30 | import polars as pl # type: ignore[import] 31 | 32 | return isinstance(obj, pl.DataFrame) 33 | except ImportError: 34 | # polars is not installed 35 | return False 36 | 37 | 38 | def coerce_table(method): 39 | @functools.wraps(method) 40 | def _impl(self, table): 41 | if is_pandas_dataframe(table): 42 | return method(self, ibis.memtable(table[self._necessary_columns])) 43 | if is_polars_dataframe(table): 44 | return method(self, ibis.memtable(table[self._necessary_columns])) 45 | return method(self, table) 46 | 47 | return _impl 48 | 49 | 50 | class Unpacker(abc.ABC): 51 | def __init__( 52 | self, 53 | fact: str, 54 | period: str | list[str], 55 | group: str | list[str] | None = None, 56 | count: str | None = None, 57 | ): 58 | self.fact = fact 59 | self.period = [period] if isinstance(period, str) else period 60 | self.group = [group] if isinstance(group, str) else (group if group else []) 61 | self.count = count 62 | 63 | @abc.abstractproperty 64 | def _necessary_columns(self): 65 | pass 66 | 67 | @abc.abstractmethod 68 | def _explanation(self, table: ibis.Table): 69 | pass 70 | 71 | @abc.abstractmethod 72 | def _format(self, explanation: ibis.Table): 73 | pass 74 | 75 | def __call__(self, table): 76 | explanation = self._explanation(table) 77 | explanation_fmt = self._format(explanation) 78 | if is_pandas_dataframe(table): 79 | return explanation_fmt.execute().set_index([*self.period, *self.group]) 80 | if is_polars_dataframe(table): 81 | return explanation_fmt.execute() 82 | return explanation_fmt 83 | 84 | def plot(self, table): 85 | import altair as alt # type: ignore[import] 86 | import pandas as pd 87 | 88 | explanation = self._explanation(table) 89 | if not isinstance(explanation, pd.DataFrame): 90 | explanation = explanation.execute() 91 | 92 | charts = [] 93 | total = pd.DataFrame({"label": [], "total": []}) 94 | 95 | for i, (period, period_explanation) in enumerate( 96 | explanation.sort_values(self.period).groupby(self.period) 97 | ): 98 | if i > 0: 99 | contributions = pd.concat( 100 | [ 101 | ( 102 | period_explanation[[*self.period, *self.group, "inner"]] 103 | .rename(columns={"inner": "impact"}) 104 | .assign(kind="inner") 105 | ), 106 | ( 107 | period_explanation[[*self.period, *self.group, "mix"]] 108 | .rename(columns={"mix": "impact"}) 109 | .assign(kind="mix") 110 | ), 111 | ] 112 | ) 113 | prev_total_value = total["total"].iloc[-1] 114 | contributions = contributions.sort_values("impact", ascending=False) 115 | contributions["end"] = ( 116 | prev_total_value + contributions["impact"].cumsum() 117 | ) 118 | contributions["start"] = ( 119 | contributions["end"].shift(1).fillna(prev_total_value) 120 | ) 121 | label_cols = [*self.period, *self.group, "kind"] 122 | contributions["label"] = contributions[label_cols].agg( 123 | lambda x: " • ".join(x.astype(str)), axis="columns" 124 | ) 125 | contributions["is_positive"] = contributions["impact"] > 0 126 | 127 | chart = ( 128 | alt.Chart(contributions) 129 | .mark_bar() 130 | .encode( 131 | y=alt.Y("label:O", sort=None, axis=alt.Axis(title=None)), 132 | x=alt.X("start:Q", axis=alt.Axis(title=self.fact)), 133 | x2="end:Q", 134 | color=alt.Color( 135 | "is_positive:N", 136 | scale=alt.Scale( 137 | domain=[True, False], range=["green", "red"] 138 | ), 139 | legend=None, 140 | ), 141 | tooltip=[*label_cols, "impact"], 142 | ) 143 | ) 144 | charts.append(chart) 145 | 146 | total = pd.DataFrame( 147 | { 148 | "label": [period], 149 | "total": ( 150 | [ 151 | ( 152 | period_explanation["count"] 153 | * period_explanation["ratio"] 154 | ).sum() 155 | / period_explanation["count"].sum() 156 | ] 157 | if isinstance(self, MeanExplainer) 158 | else [(period_explanation["total"]).sum()] 159 | ), 160 | } 161 | ) 162 | 163 | chart = ( 164 | alt.Chart(total) 165 | .mark_bar() 166 | .encode(x="total:Q", y=alt.X("label:O", sort=None), tooltip=["total"]) 167 | ) 168 | charts.append(chart) 169 | 170 | return alt.layer(*charts).interactive() 171 | 172 | 173 | class SumExplainer(Unpacker): 174 | @property 175 | def _necessary_columns(self): 176 | return [ 177 | self.fact, 178 | *self.period, 179 | *self.group, 180 | *([self.count] if self.count else []), 181 | ] 182 | 183 | @coerce_table 184 | def _explanation(self, table): 185 | explanation = table.aggregate( 186 | by=[*self.group, *self.period], 187 | mean=( 188 | (table[self.fact] * table[self.count]).sum() / table[self.count].sum() 189 | if self.count 190 | else table[self.fact].mean() 191 | ), 192 | count=(table[self.count].sum() if self.count else table[self.fact].count()), 193 | ) 194 | 195 | # Artificially add rows with 0s when there are no data points for a given group at a given 196 | # period. For instance, there might not be any dentist claims in 2022, but if there some in 197 | # 2021, then we want to have a 0 recorded so that we can measure the difference. 198 | cart = cartesian_product(table, [*self.group, *self.period]) 199 | explanation = cart.left_join(explanation, cart.columns)[explanation.columns] 200 | explanation = explanation.mutate( 201 | mean=explanation["mean"].fill_null(0), 202 | count=explanation["count"].fill_null(0), 203 | ) 204 | 205 | # Calculate lag values 206 | # Usually one or more group keys are provided. But if they aren't, there is no need to 207 | # aggregate the data by the group keys. In this case, we can just calculate the lag values 208 | # along the period column. 209 | # TODO: it would be nice to have a more elegant way to handle this. 210 | if lag_key := [*self.group, *self.period[1:]]: 211 | explanation = ( 212 | explanation.group_by(lag_key) 213 | .order_by(self.period) 214 | .mutate( 215 | mean_lag=explanation["mean"].lag(1), 216 | count_lag=explanation["count"].lag(1), 217 | ) 218 | ) 219 | else: 220 | explanation = explanation.order_by(self.period).mutate( 221 | mean_lag=explanation["mean"].lag(1), 222 | count_lag=explanation["count"].lag(1), 223 | ) 224 | 225 | # Calculate the inner and mix effects 226 | return explanation.mutate( 227 | total=explanation["count"] * explanation["mean"], 228 | inner=explanation["count_lag"] 229 | * (explanation["mean"] - explanation["mean_lag"]), 230 | mix=(explanation["count"] - explanation["count_lag"]) * explanation["mean"], 231 | ) 232 | 233 | def _format(self, explanation): 234 | return ( 235 | explanation.order_by([*self.period, *self.group]) 236 | .select([*self.period, *self.group, "inner", "mix"]) 237 | .drop_null(how="any") 238 | ) 239 | 240 | 241 | class MeanExplainer(Unpacker): 242 | @property 243 | def _necessary_columns(self): 244 | return [ 245 | self.fact, 246 | *self.period, 247 | *self.group, 248 | *([self.count] if self.count else []), 249 | ] 250 | 251 | @coerce_table 252 | def _explanation(self, table): 253 | explanation = table.aggregate( 254 | by=[*self.group, *self.period], 255 | sum=( 256 | (table[self.fact] * table[self.count]).sum() 257 | if self.count 258 | else table[self.fact].sum() 259 | ), 260 | count=(table[self.count].sum() if self.count else table[self.fact].count()), 261 | ) 262 | 263 | # Artificially add rows with 0s when there are no data points for a given group at a given 264 | # period. For instance, there might not be any dentist claims in 2022, but if there some in 265 | # 2021, then we want to have a 0 recorded so that we can measure the difference. 266 | cart = cartesian_product(table, [*self.group, *self.period]) 267 | explanation = cart.left_join(explanation, cart.columns)[explanation.columns] 268 | explanation = explanation.mutate( 269 | sum=explanation["sum"].fill_null(0), count=explanation["count"].fill_null(0) 270 | ) 271 | explanation = explanation.mutate( 272 | # https://ibis-project.org/reference/expression-generic#ibis.expr.types.generic.Value.nullif 273 | ratio=(explanation["sum"] / explanation["count"].nullif(0)).fill_null(0) 274 | ) 275 | explanation = explanation.mutate(ratio=explanation["ratio"]) 276 | 277 | yearly_figures = explanation.group_by(self.period).aggregate( 278 | sum_sum=explanation["sum"].sum(), count_sum=explanation["count"].sum() 279 | ) 280 | explanation = explanation.left_join(yearly_figures, self.period) 281 | explanation = explanation.mutate( 282 | share=explanation["count"] / explanation["count_sum"], 283 | global_ratio=explanation["sum_sum"] / explanation["count_sum"], 284 | ) 285 | 286 | # Calculate lag values 287 | # 🐲 It's a bit tricky, but in cases where more than one period column is provided, they 288 | # it affects the lag calculation. For instance, if we have year and month, then we want to 289 | # calculate the lag for the same month in the previous year. This only applies to the case 290 | # where the period is a list of columns. 291 | if by := [*self.group, *self.period[1:]]: 292 | explanation = ( 293 | explanation.group_by(by) 294 | .order_by(self.period) 295 | .mutate( 296 | ratio_lag=explanation["ratio"].lag(1), 297 | share_lag=explanation["share"].lag(1), 298 | global_ratio_lag=explanation["global_ratio"].lag(1), 299 | ) 300 | ) 301 | else: 302 | explanation = explanation.order_by(self.period).mutate( 303 | ratio_lag=explanation["ratio"].lag(1), 304 | share_lag=explanation["share"].lag(1), 305 | global_ratio_lag=explanation["global_ratio"].lag(1), 306 | ) 307 | 308 | # Calculate the inner and mix effects 309 | return explanation.mutate( 310 | inner=explanation["share"] 311 | * (explanation["ratio"] - explanation["ratio_lag"]), 312 | mix=(explanation["share"] - explanation["share_lag"]) 313 | * (explanation["ratio_lag"] - explanation["global_ratio"]), 314 | ) 315 | 316 | def _format(self, explanation): 317 | return ( 318 | explanation.order_by([*self.period, *self.group]) 319 | .select([*self.period, *self.group, "inner", "mix"]) 320 | .drop_null(how="any") 321 | ) 322 | 323 | 324 | class FunnelExplainer(Unpacker): 325 | def __init__( 326 | self, funnel: list[str], period: str | list[str], group: str | list[str] 327 | ): 328 | self.funnel = funnel 329 | self.period = [period] if isinstance(period, str) else period 330 | self.group = [group] if isinstance(group, str) else group 331 | 332 | @property 333 | def _necessary_columns(self): 334 | return [*self.funnel, *self.period, *self.group] 335 | 336 | @coerce_table 337 | def _explanation(self, table): 338 | # Sum events by period and dimensions 339 | explanation = table.group_by([*self.period, *self.group]).aggregate( 340 | **{step: table[step].sum() for step in self.funnel} 341 | ) 342 | ratios = { 343 | (f"{num}_over_{den}" if den else num): (num, den) 344 | for den, num in [(None, self.funnel[0]), *zip(self.funnel, self.funnel[1:])] 345 | } 346 | ratio_names = list(ratios) 347 | 348 | explanation = explanation.mutate( 349 | **{ 350 | ratio_name: explanation[num] / explanation[den] 351 | for ratio_name, (num, den) in ratios.items() 352 | if den 353 | } 354 | ) 355 | 356 | explanation = ( 357 | explanation.group_by([*self.group, *self.period[1:]]) 358 | .order_by(self.period) 359 | .mutate( 360 | **{ 361 | f"{ratio_name}_lag": explanation[ratio_name].lag(1) 362 | for ratio_name in ratios 363 | } 364 | ) 365 | ) 366 | 367 | explanation = explanation.mutate( 368 | **{ 369 | f"{ratio_name}_contribution": functools.reduce( 370 | operator.mul, 371 | [ 372 | *[explanation[ratio_name] for ratio_name in ratio_names[:i]], 373 | explanation[ratio_names[i]] 374 | - explanation[f"{ratio_names[i]}_lag"], 375 | *[ 376 | explanation[f"{ratio_name}_lag"] 377 | for ratio_name in ratio_names[i + 1 :] 378 | ], 379 | ], 380 | ) 381 | for i, ratio_name in enumerate(ratio_names) 382 | } 383 | ) 384 | 385 | return explanation 386 | 387 | def _format(self, explanation): 388 | return ( 389 | explanation.order_by([*self.period, *self.group]) 390 | .select( 391 | [ 392 | *self.period, 393 | *self.group, 394 | *[ 395 | col 396 | for col in explanation.schema() 397 | if col.endswith("_contribution") 398 | ], 399 | ] 400 | ) 401 | .drop_null(how="any") 402 | ) 403 | -------------------------------------------------------------------------------- /docs/examples/simple-revenue-funnel.md: -------------------------------------------------------------------------------- 1 | # Simple revenue funnel 🛒 2 | 3 | We look at a toy website funnel in this example. Imagine a fictitious website that sells stuff. Users go to the website, are presented with items, can add them to their cart, and then can buy them. 4 | 5 | 6 | ```python 7 | import pandas as pd 8 | import locale 9 | 10 | locale.setlocale(locale.LC_MONETARY, 'en_US.UTF-8') 11 | def fmt_currency(x): 12 | return locale.currency(x, grouping=True) 13 | 14 | traffic = pd.DataFrame({ 15 | 'date': ['2018-01-01', '2018-01-01', '2018-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2018-02-01', '2018-02-01', '2018-02-01', '2019-02-01', '2019-02-01', '2019-02-01'], 16 | 'group': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'], 17 | 'impressions': [1000, 2000, 2500, 1000, 2150, 2000, 50, 2000, 2500, 2500, 2150, 2000], 18 | 'clicks': [150, 150, 250, 120, 200, 400, 20, 300, 250, 1000, 323, 320], 19 | 'conversions': [120, 150, 125, 160, 145, 166, 10, 150, 125, 500, 145, 166], 20 | 'revenue': ['$8,600', '$9,400', '$10,750', '$9,055', '$8,739', '$10,147', '$500', '$11,400', '$8,750', '$50,000', '$10,739', '$12,147'], 21 | }) 22 | traffic['date'] = pd.to_datetime(traffic['date']) 23 | traffic['revenue'] = traffic['revenue'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float) 24 | traffic.style.format({'revenue': fmt_currency, 'date': lambda x: x.strftime('%Y-%m-%d')}, na_rep='N/A') 25 | ``` 26 | 27 | 28 | 29 | 30 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 |
 dategroupimpressionsclicksconversionsrevenue
02018-01-01A1000150120$8,600.00
12018-01-01B2000150150$9,400.00
22018-01-01C2500250125$10,750.00
32019-01-01A1000120160$9,055.00
42019-01-01B2150200145$8,739.00
52019-01-01C2000400166$10,147.00
62018-02-01A502010$500.00
72018-02-01B2000300150$11,400.00
82018-02-01C2500250125$8,750.00
92019-02-01A25001000500$50,000.00
102019-02-01B2150323145$10,739.00
112019-02-01C2000320166$12,147.00
155 | 156 | 157 | 158 | 159 | The users are bucketed into 3 groups: A, B, C. We've also bucketed impressions/clicks/conversions/revenue figures by month of the year. 160 | 161 | We're interested in understanding how the metrics evolve over time. The basic method is to calculate each metric separately. To keep things simple, we can do this for each year. 162 | 163 | 164 | ```python 165 | pd.DataFrame({ 166 | 'impressions': ( 167 | traffic 168 | .assign(year=traffic.date.dt.year) 169 | .groupby('year') 170 | .impressions.sum() 171 | ), 172 | 'click_rate': ( 173 | traffic 174 | .assign(year=traffic.date.dt.year) 175 | .groupby('year') 176 | .apply(lambda x: x.clicks.sum() / x.impressions.sum(), include_groups=False) 177 | ), 178 | 'conversion_rate': ( 179 | traffic 180 | .assign(year=traffic.date.dt.year) 181 | .groupby('year') 182 | .apply(lambda x: x.conversions.sum() / x.clicks.sum(), include_groups=False) 183 | ), 184 | 'average_spend': ( 185 | traffic 186 | .assign(year=traffic.date.dt.year) 187 | .groupby('year') 188 | .apply(lambda x: x.revenue.sum() / x.conversions.sum(), include_groups=False) 189 | ), 190 | 'revenue': ( 191 | traffic 192 | .assign(year=traffic.date.dt.year) 193 | .groupby('year') 194 | .revenue.sum() 195 | ) 196 | }).style.format({'average_spend': fmt_currency, 'revenue': fmt_currency}, na_rep='') 197 | ``` 198 | 199 | 200 | 201 | 202 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 |
 impressionsclick_rateconversion_rateaverage_spendrevenue
year     
2018100500.1114430.607143$72.65$49,400.00
2019118000.2002540.542531$78.65$100,827.00
242 | 243 | 244 | 245 | 246 | In and of itself, this is already quite interesting. However, what we really want to know is how the change of each metric contributes to the change in revenue. This is where icanexplain comes in. 247 | 248 | 249 | ```python 250 | import icanexplain as ice 251 | 252 | explainer = ice.FunnelExplainer( 253 | funnel=['impressions', 'clicks', 'conversions', 'revenue'], 254 | period='year', 255 | group=['month', 'group'] 256 | ) 257 | traffic = traffic.assign( 258 | month=traffic.date.dt.month, 259 | year=traffic.date.dt.year 260 | ) 261 | explanation = explainer(traffic) 262 | explanation.style.format(fmt_currency).set_properties(**{'text-align': 'right'}) 263 | ``` 264 | 265 | 266 | 267 | 268 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 |
   impressions_contributionclicks_over_impressions_contributionconversions_over_clicks_contributionrevenue_over_conversions_contribution
yearmonthgroup    
20191A$0.00-$1,720.00$4,586.67-$2,411.67
B$705.00$2,428.33-$3,446.67-$347.67
C-$2,150.00$8,600.00-$2,924.00-$4,129.00
2A$24,500.00$0.00$0.00$25,000.00
B$855.00$19.00-$1,254.00-$281.00
C-$1,750.00$4,200.00$420.00$527.00
342 | 343 | 344 | 345 | 346 | This is powerful, because it allows us to understand the drivers of revenue growth. For example, between January 2018 and January 2019, revenue went up by $8,600 due an increase in clicks for group C. This is more insightful than just saying that their click rate went up. 347 | 348 | One thing to keep in mind is that contributions sum up to the overall difference between two periods. This means that it's easy to unit test that the contributions are correct: 349 | 350 | 351 | ```python 352 | ( 353 | explanation 354 | .groupby('year').sum().sum(axis=1) 355 | .to_frame('sum') 356 | .style.format(fmt_currency) 357 | ) 358 | ``` 359 | 360 | 361 | 362 | 363 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 |
 sum
year 
2019$51,427.00
383 | 384 | 385 | 386 | 387 | Of course, it would be more interesting to apply this methodology to some real data. One example is the [Google Analytics dataset sample](https://developers.google.com/analytics/bigquery/web-ecommerce-demo-dataset) which is publicly available in BigQuery. 388 | -------------------------------------------------------------------------------- /docs/examples/fashion-brand-co2e.md: -------------------------------------------------------------------------------- 1 | # Fashion brand CO2e emissions 👟 2 | 3 | Fashion brands increasingly have to be aware and report on their environmental footprint. 4 | 5 | The following dataset comes from a real fashion brand, and has been anomymized. Each row represents a product manufactured in a given year. 6 | 7 | 8 | ```python 9 | import icanexplain as ice 10 | 11 | def fmt_CO2e(kg): 12 | if abs(kg) < 1e3: 13 | return f'{kg:,.2f}kgCO2e' 14 | return f'{kg / 1e6:,.1f}ktCO2e' 15 | 16 | products = ice.datasets.load_product_footprints() 17 | products.sample(5).style.format({'footprint': fmt_CO2e, 'units': '{:,d}'}) 18 | ``` 19 | 20 | 21 | 22 | 23 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 |
 yearcategoryproduct_idfootprintunits
905122022TSHIRTcea264427.62kgCO2e1,486
460752022JACKETd17ec41538.43kgCO2e2,254
518492022PANTSd5531c9b41.55kgCO2e8
128182021PANTS335f31e313.53kgCO2e4
648702022PANTSe5562fe829.16kgCO2e576
79 | 80 | 81 | 82 | 83 | The `footprint` column indicates the product's carbon footprint in kgCO2e. The `units` column corresponds to the number of units produced. 84 | 85 | Companies usually report their emissions on a yearly basis. We can do this by multiplying the footprint of each product, with the number of units produced, and summing the results. 86 | 87 | 88 | ```python 89 | ( 90 | products 91 | .groupby('year') 92 | .apply(lambda g: (g['footprint'] * g['units']).sum() / g['units'].sum(), include_groups=False) 93 | .to_frame('average') 94 | .assign(diff=lambda x: x.average.diff()) 95 | .style.format(fmt_CO2e, na_rep='') 96 | ) 97 | ``` 98 | 99 | 100 | 101 | 102 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 |
 averagediff
year  
202121.95kgCO2e
202221.71kgCO2e-0.24kgCO2e
202322.74kgCO2e1.03kgCO2e
135 | 136 | 137 | 138 | 139 | The average footprint went down between 2021 and 2022. It then went back up in 2023. Of course, we want to understand why. When they see this, fashion brands have one word coming out of their mouth: why, why, why? 140 | 141 | The overall average footprint can change for two reasons: 142 | 143 | 1. The average footprint per product category evolved. 144 | 2. The mix of product categories evolved. 145 | 146 | The second reason is called the *mix effect*. For instance, let's say t-shirts have a lower footprint than jackets. If the share of jackets produced in 2023 is higher than in 2022, the average footprint will go up. 147 | 148 | The jackets in 2023 aren't necessarily the same than those of 2022. They could be more sustainable, and have a lower footprint. This is the tricky part: we need to disentangle the mix effect from the evolution of the footprint of each product category. That is the value proposition of this package. 149 | 150 | 151 | ```python 152 | explainer = ice.MeanExplainer( 153 | fact='footprint', 154 | count='units', 155 | period='year', 156 | group='category', 157 | ) 158 | explanation = explainer(products) 159 | explanation.style.format({'inner': fmt_CO2e, 'mix': fmt_CO2e}, na_rep='') 160 | ``` 161 | 162 | 163 | 164 | 165 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 |
  innermix
yearcategory  
2022DRESS0.05kgCO2e-0.14kgCO2e
JACKET-0.17kgCO2e-0.69kgCO2e
PANTS0.61kgCO2e0.20kgCO2e
SHIRT-0.02kgCO2e0.00kgCO2e
SWEATER-0.39kgCO2e-0.09kgCO2e
TSHIRT0.08kgCO2e0.30kgCO2e
2023DRESS-0.08kgCO2e0.51kgCO2e
JACKET-0.13kgCO2e0.97kgCO2e
PANTS-0.22kgCO2e-0.09kgCO2e
SHIRT0.02kgCO2e-0.03kgCO2e
SWEATER-0.06kgCO2e0.36kgCO2e
TSHIRT-0.16kgCO2e-0.06kgCO2e
247 | 248 | 249 | 250 | 251 | Here's the meaning of each column: 252 | 253 | - `inner` is the difference due to the change in the average footprint per unit. A negative inner values means the footprint per unit shifted in a way that reduced emissions. For instance, low emission products seem to have been prioritized in 2022 (-17.5ktCO2e), but not in 2023 (+73.4ktCO2e). 254 | - `mix` is the difference due to the change in the number of units produced. A negative mix value means the number of units produced shifted in a way that reduced emissions. 255 | 256 | A convenient way to read these values is to use a waterfall chart. 257 | 258 | 259 | ```python 260 | explainer.plot(products) 261 | ``` 262 | 263 | 264 | 265 | 266 | 267 | 278 |
279 | 332 | 333 | 334 | 335 | This is better than reporting the average footprint and unit produced separately. It's more informative to quantify their contribution to the change in emissions. Here it's good to confirm that the decrease in emissions is mostly due to a reduction in the number of units produced for both years. But it's also good to see that there was an increase due to the average footprint in 2023. Importantly, each one of these effects is calculated, and not just assumed. 336 | 337 | It's natural to want to deepen the analysis. For instance: 338 | 339 | 1. Why is there a significant inner contribution for pants in 2022? Is it because the materials are less sustainable? Or because the pants got heavier? 340 | 2. The reduction in 2023 is mainly due to the reduction in the number of units produced. Can this be broken down into marketing segments? For instance, is the reduction mainly driven by online or in-person sales? How does this break down by country? 341 | 342 | These questions hint at the interactive aspect of this kind of analysis. Once you break down a metric's evolution along a dimension, the next steps are to break down the metric (question 1) and/or include another dimension (question 2). 343 | 344 |
345 | -------------------------------------------------------------------------------- /docs/examples/simple-revenue-funnel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Simple revenue funnel 🛒" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "We look at a toy website funnel in this example. Imagine a fictitious website that sells stuff. Users go to the website, are presented with items, can add them to their cart, and then can buy them." 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": { 21 | "execution": { 22 | "iopub.execute_input": "2024-09-25T08:40:22.306476Z", 23 | "iopub.status.busy": "2024-09-25T08:40:22.305631Z", 24 | "iopub.status.idle": "2024-09-25T08:40:22.376497Z", 25 | "shell.execute_reply": "2024-09-25T08:40:22.376158Z" 26 | } 27 | }, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "\n", 34 | "\n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | "
 dategroupimpressionsclicksconversionsrevenue
02018-01-01A1000150120$8,600.00
12018-01-01B2000150150$9,400.00
22018-01-01C2500250125$10,750.00
32019-01-01A1000120160$9,055.00
42019-01-01B2150200145$8,739.00
52019-01-01C2000400166$10,147.00
62018-02-01A502010$500.00
72018-02-01B2000300150$11,400.00
82018-02-01C2500250125$8,750.00
92019-02-01A25001000500$50,000.00
102019-02-01B2150323145$10,739.00
112019-02-01C2000320166$12,147.00
\n" 157 | ], 158 | "text/plain": [ 159 | "" 160 | ] 161 | }, 162 | "execution_count": 1, 163 | "metadata": {}, 164 | "output_type": "execute_result" 165 | } 166 | ], 167 | "source": [ 168 | "import pandas as pd\n", 169 | "import locale\n", 170 | "\n", 171 | "locale.setlocale(locale.LC_MONETARY, 'en_US.UTF-8')\n", 172 | "def fmt_currency(x):\n", 173 | " return locale.currency(x, grouping=True)\n", 174 | "\n", 175 | "traffic = pd.DataFrame({\n", 176 | " 'date': ['2018-01-01', '2018-01-01', '2018-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2018-02-01', '2018-02-01', '2018-02-01', '2019-02-01', '2019-02-01', '2019-02-01'],\n", 177 | " 'group': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],\n", 178 | " 'impressions': [1000, 2000, 2500, 1000, 2150, 2000, 50, 2000, 2500, 2500, 2150, 2000],\n", 179 | " 'clicks': [150, 150, 250, 120, 200, 400, 20, 300, 250, 1000, 323, 320],\n", 180 | " 'conversions': [120, 150, 125, 160, 145, 166, 10, 150, 125, 500, 145, 166],\n", 181 | " 'revenue': ['$8,600', '$9,400', '$10,750', '$9,055', '$8,739', '$10,147', '$500', '$11,400', '$8,750', '$50,000', '$10,739', '$12,147'],\n", 182 | "})\n", 183 | "traffic['date'] = pd.to_datetime(traffic['date'])\n", 184 | "traffic['revenue'] = traffic['revenue'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).astype(float)\n", 185 | "traffic.style.format({'revenue': fmt_currency, 'date': lambda x: x.strftime('%Y-%m-%d')}, na_rep='N/A')" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "The users are bucketed into 3 groups: A, B, C. We've also bucketed impressions/clicks/conversions/revenue figures by month of the year.\n", 193 | "\n", 194 | "We're interested in understanding how the metrics evolve over time. The basic method is to calculate each metric separately. To keep things simple, we can do this for each year." 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 2, 200 | "metadata": { 201 | "execution": { 202 | "iopub.execute_input": "2024-09-25T08:40:22.378509Z", 203 | "iopub.status.busy": "2024-09-25T08:40:22.378303Z", 204 | "iopub.status.idle": "2024-09-25T08:40:22.398464Z", 205 | "shell.execute_reply": "2024-09-25T08:40:22.398209Z" 206 | } 207 | }, 208 | "outputs": [ 209 | { 210 | "data": { 211 | "text/html": [ 212 | "\n", 214 | "\n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | "
 impressionsclick_rateconversion_rateaverage_spendrevenue
year     
2018100500.1114430.607143$72.65$49,400.00
2019118000.2002540.542531$78.65$100,827.00
\n" 252 | ], 253 | "text/plain": [ 254 | "" 255 | ] 256 | }, 257 | "execution_count": 2, 258 | "metadata": {}, 259 | "output_type": "execute_result" 260 | } 261 | ], 262 | "source": [ 263 | "pd.DataFrame({\n", 264 | " 'impressions': (\n", 265 | " traffic\n", 266 | " .assign(year=traffic.date.dt.year)\n", 267 | " .groupby('year')\n", 268 | " .impressions.sum()\n", 269 | " ),\n", 270 | " 'click_rate': (\n", 271 | " traffic\n", 272 | " .assign(year=traffic.date.dt.year)\n", 273 | " .groupby('year')\n", 274 | " .apply(lambda x: x.clicks.sum() / x.impressions.sum(), include_groups=False)\n", 275 | " ),\n", 276 | " 'conversion_rate': (\n", 277 | " traffic\n", 278 | " .assign(year=traffic.date.dt.year)\n", 279 | " .groupby('year')\n", 280 | " .apply(lambda x: x.conversions.sum() / x.clicks.sum(), include_groups=False)\n", 281 | " ),\n", 282 | " 'average_spend': (\n", 283 | " traffic\n", 284 | " .assign(year=traffic.date.dt.year)\n", 285 | " .groupby('year')\n", 286 | " .apply(lambda x: x.revenue.sum() / x.conversions.sum(), include_groups=False)\n", 287 | " ),\n", 288 | " 'revenue': (\n", 289 | " traffic\n", 290 | " .assign(year=traffic.date.dt.year)\n", 291 | " .groupby('year')\n", 292 | " .revenue.sum()\n", 293 | " )\n", 294 | "}).style.format({'average_spend': fmt_currency, 'revenue': fmt_currency}, na_rep='')" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "In and of itself, this is already quite interesting. However, what we really want to know is how the change of each metric contributes to the change in revenue. This is where icanexplain comes in." 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 3, 307 | "metadata": { 308 | "execution": { 309 | "iopub.execute_input": "2024-09-25T08:40:22.399934Z", 310 | "iopub.status.busy": "2024-09-25T08:40:22.399828Z", 311 | "iopub.status.idle": "2024-09-25T08:40:22.728790Z", 312 | "shell.execute_reply": "2024-09-25T08:40:22.728507Z" 313 | } 314 | }, 315 | "outputs": [ 316 | { 317 | "data": { 318 | "text/html": [ 319 | "\n", 324 | "\n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | "
   impressions_contributionclicks_over_impressions_contributionconversions_over_clicks_contributionrevenue_over_conversions_contribution
yearmonthgroup    
20191A$0.00-$1,720.00$4,586.67-$2,411.67
B$705.00$2,428.33-$3,446.67-$347.67
C-$2,150.00$8,600.00-$2,924.00-$4,129.00
2A$24,500.00$0.00$0.00$25,000.00
B$855.00$19.00-$1,254.00-$281.00
C-$1,750.00$4,200.00$420.00$527.00
\n" 393 | ], 394 | "text/plain": [ 395 | "" 396 | ] 397 | }, 398 | "execution_count": 3, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "import icanexplain as ice\n", 405 | "\n", 406 | "explainer = ice.FunnelExplainer(\n", 407 | " funnel=['impressions', 'clicks', 'conversions', 'revenue'],\n", 408 | " period='year',\n", 409 | " group=['month', 'group']\n", 410 | ")\n", 411 | "traffic = traffic.assign(\n", 412 | " month=traffic.date.dt.month,\n", 413 | " year=traffic.date.dt.year\n", 414 | ")\n", 415 | "explanation = explainer(traffic)\n", 416 | "explanation.style.format(fmt_currency).set_properties(**{'text-align': 'right'})" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "This is powerful, because it allows us to understand the drivers of revenue growth. For example, between January 2018 and January 2019, revenue went up by $8,600 due an increase in clicks for group C. This is more insightful than just saying that their click rate went up." 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "One thing to keep in mind is that contributions sum up to the overall difference between two periods. This means that it's easy to unit test that the contributions are correct:" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 4, 436 | "metadata": { 437 | "execution": { 438 | "iopub.execute_input": "2024-09-25T08:40:22.730483Z", 439 | "iopub.status.busy": "2024-09-25T08:40:22.730381Z", 440 | "iopub.status.idle": "2024-09-25T08:40:22.742096Z", 441 | "shell.execute_reply": "2024-09-25T08:40:22.741750Z" 442 | } 443 | }, 444 | "outputs": [ 445 | { 446 | "data": { 447 | "text/html": [ 448 | "\n", 450 | "\n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | "
 sum
year 
2019$51,427.00
\n" 468 | ], 469 | "text/plain": [ 470 | "" 471 | ] 472 | }, 473 | "execution_count": 4, 474 | "metadata": {}, 475 | "output_type": "execute_result" 476 | } 477 | ], 478 | "source": [ 479 | "(\n", 480 | " explanation\n", 481 | " .groupby('year').sum().sum(axis=1)\n", 482 | " .to_frame('sum')\n", 483 | " .style.format(fmt_currency)\n", 484 | ")" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "Of course, it would be more interesting to apply this methodology to some real data. One example is the [Google Analytics dataset sample](https://developers.google.com/analytics/bigquery/web-ecommerce-demo-dataset) which is publicly available in BigQuery. " 492 | ] 493 | } 494 | ], 495 | "metadata": { 496 | "kernelspec": { 497 | "display_name": "Python 3", 498 | "language": "python", 499 | "name": "python3" 500 | }, 501 | "language_info": { 502 | "codemirror_mode": { 503 | "name": "ipython", 504 | "version": 3 505 | }, 506 | "file_extension": ".py", 507 | "mimetype": "text/x-python", 508 | "name": "python", 509 | "nbconvert_exporter": "python", 510 | "pygments_lexer": "ipython3", 511 | "version": "3.11.4" 512 | } 513 | }, 514 | "nbformat": 4, 515 | "nbformat_minor": 2 516 | } 517 | -------------------------------------------------------------------------------- /docs/examples/iowa-whiskey-sales.md: -------------------------------------------------------------------------------- 1 | # Iowa whiskey sales 🥃 2 | 3 | Let's look at whiskey sales in Iowa. This is a subset of the data from the [Iowa Liquor Sales dataset](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy). 4 | 5 | 6 | ```python 7 | import icanexplain as ice 8 | 9 | sales = ice.datasets.load_iowa_whiskey_sales() 10 | sales.head().style.format() 11 | ``` 12 | 13 | 14 | 15 | 16 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 |
 datecategoryvendorsales_amountprice_per_bottlebottles_soldbottle_volume_mlyear
02012-06-04CANADIAN WHISKIESCONSTELLATION WINE COMPANY, INC.94.02000015.670000617502012
12016-01-05STRAIGHT BOURBON WHISKIESCAMPARI(SKYY)18.7600009.38000023752016
22016-05-25CANADIAN WHISKIESDIAGEO AMERICAS11.03000011.03000013002016
32016-01-20CANADIAN WHISKIESPHILLIPS BEVERAGE COMPANY33.84000011.28000037502016
42012-03-19CANADIAN WHISKIESCONSTELLATION WINE COMPANY, INC.94.02000015.670000617502012
90 | 91 | 92 | 93 | 94 | The `sales_amount` column represents the bill a customer payed for a given transaction. We can sum it and group by year to see how the total sales amount evolves over time. 95 | 96 | 97 | ```python 98 | import locale 99 | 100 | locale.setlocale(locale.LC_MONETARY, 'en_US.UTF-8') 101 | def fmt_currency(x): 102 | return locale.currency(x, grouping=True) 103 | 104 | ( 105 | sales.groupby('year')['sales_amount'] 106 | .sum() 107 | .to_frame() 108 | .assign(diff=lambda x: x.diff()) 109 | .style.format(lambda x: fmt_currency(x) if x > 0 else '') 110 | ) 111 | ``` 112 | 113 | 114 | 115 | 116 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 |
 sales_amountdiff
year  
2012$1,842,098.86
2016$2,298,505.88$456,407.02
2020$3,378,164.43$1,079,658.55
149 | 150 | 151 | 152 | 153 | Ok, but why? Well, we can use icanexplain to break down the evolution into two effects: 154 | 155 | 1. The inner effect: how much the average transaction value changed. 156 | 2. The mix effect: how much the number of transations changed. 157 | 158 | 159 | ```python 160 | import icanexplain as ice 161 | 162 | explainer = ice.SumExplainer( 163 | fact='sales_amount', 164 | period='year', 165 | group='category' 166 | ) 167 | explanation = explainer(sales) 168 | ( 169 | explanation.style 170 | .format(lambda x: fmt_currency(x) if x > 0 else '$0') 171 | .set_properties(**{'text-align': 'right'}) 172 | ) 173 | ``` 174 | 175 | 176 | 177 | 178 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 |
  innermix
yearcategory  
2016BLENDED WHISKIES$17,854.43$7,356.77
CANADIAN WHISKIES$0$225,902.66
CORN WHISKIES$0$4,113.90
IRISH WHISKIES$22,144.48$75,122.83
SCOTCH WHISKIES$19,591.97$0
SINGLE BARREL BOURBON WHISKIES$1,852.03$6,375.43
STRAIGHT BOURBON WHISKIES$107,144.93$97,934.50
STRAIGHT RYE WHISKIES$0$0
2020BLENDED WHISKIES$83,342.60$59,768.58
CANADIAN WHISKIES$224,022.62$149,363.35
CORN WHISKIES$1,517.48$1,453.26
IRISH WHISKIES$0$67,344.41
SCOTCH WHISKIES$19,840.48$0
SINGLE BARREL BOURBON WHISKIES$11,958.32$3,819.27
STRAIGHT BOURBON WHISKIES$167,864.46$268,064.74
STRAIGHT RYE WHISKIES$0$64,056.43
283 | 284 | 285 | 286 | 287 | For instance, we see that the average transation amount for blended whiskies contributed to an $17,854 increase in sales from 2012 to 2016. This is the inner effect. The mix effect for blended whiskies, on the other hand, contributed to a $7,356 increase in sales. 288 | 289 | Here's another example: the mix effect of Canadian whiskies is $225,902. This value, the mix effect, represents the increase due to the number of extra sales for Canadian whiskies. The inner effect, on the other hand, is $0. This means that the average transaction value for Canadian whiskies did not change between 2012 and 2016, and therefore didn't contribute to the increase in sales. 290 | 291 | A visual way to look interpret the above table is to use a waterfall chart. The idea is that the contributions sum to the difference between two periods. In this case, the difference in sales from 2012 to 2016 is $456,407. The waterfall chart shows how the inner and mix effects contributed to this difference. 292 | 293 | 294 | ```python 295 | explainer.plot(sales) 296 | ``` 297 | 298 | 299 | 300 | 301 | 302 | 313 |
314 | 367 | 368 | 369 | --------------------------------------------------------------------------------