├── mara_schema
    ├── ui
    │   ├── __init__.py
    │   ├── static
    │   │   ├── mara-schema.css
    │   │   └── data-set-sql-query.js
    │   ├── graph.py
    │   └── views.py
    ├── config.py
    ├── example
    │   ├── __init__.py
    │   ├── entities
    │   │   ├── product.py
    │   │   ├── product_category.py
    │   │   ├── order_item.py
    │   │   ├── order.py
    │   │   └── customer.py
    │   ├── data_sets
    │   │   ├── products.py
    │   │   ├── customers.py
    │   │   └── order_items.py
    │   └── dimensional-schema.sql
    ├── __init__.py
    ├── attribute.py
    ├── metric.py
    ├── entity.py
    ├── data_set.py
    └── sql_generation.py
├── docs
    ├── changes.md
    ├── requirements.txt
    ├── _static
    │   ├── favicon.ico
    │   ├── mara-animal.jpg
    │   ├── mara-schema.png
    │   ├── mara-schema-sql-generation.gif
    │   ├── mara-schema-data-set-visualization.png
    │   └── example-dimensional-database-schema.svg
    ├── license.rst
    ├── installation.rst
    ├── Makefile
    ├── config.rst
    ├── design.rst
    ├── api.rst
    ├── conf.py
    ├── index.rst
    ├── artifact-generation.rst
    └── example.rst
├── setup.py
├── pyproject.toml
├── .gitignore
├── .pre-commit-config.yaml
├── .readthedocs.yaml
├── setup.cfg
├── CHANGELOG.md
├── Makefile
├── .github
    └── workflows
    │   └── build.yaml
├── LICENSE
└── README.md


/mara_schema/ui/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/docs/changes.md:
--------------------------------------------------------------------------------
1 | ```{include} ../CHANGELOG.md
2 | ```
3 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | 
3 | setup()
4 | 


--------------------------------------------------------------------------------
/docs/requirements.txt:
--------------------------------------------------------------------------------
1 | sphinx==4.5.0
2 | myst-parser==0.18.0
3 | 


--------------------------------------------------------------------------------
/docs/_static/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/favicon.ico


--------------------------------------------------------------------------------
/docs/_static/mara-animal.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-animal.jpg


--------------------------------------------------------------------------------
/docs/_static/mara-schema.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema.png


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools >= 40.6.0", "wheel"]
3 | build-backend = "setuptools.build_meta"
4 | 


--------------------------------------------------------------------------------
/docs/_static/mara-schema-sql-generation.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema-sql-generation.gif


--------------------------------------------------------------------------------
/docs/_static/mara-schema-data-set-visualization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema-data-set-visualization.png


--------------------------------------------------------------------------------
/mara_schema/config.py:
--------------------------------------------------------------------------------
1 | import functools
2 | from .data_set import DataSet
3 | 
4 | @functools.lru_cache(maxsize=None)
5 | def data_sets() -> [DataSet]:
6 |     """Returns all available data_sets."""
7 |     from .example import example_data_sets
8 |     return example_data_sets()
9 | 


--------------------------------------------------------------------------------
/docs/license.rst:
--------------------------------------------------------------------------------
 1 | License
 2 | =======
 3 | 
 4 | MIT Source License
 5 | ---------------------------
 6 | 
 7 | The MIT license applies to all files in the Mara repository
 8 | and source distribution. This includes Mara's source code, the
 9 | examples, and tests, as well as the documentation.
10 | 
11 | .. include:: ../LICENSE
12 | 


--------------------------------------------------------------------------------
/mara_schema/example/__init__.py:
--------------------------------------------------------------------------------
1 | from ..data_set import DataSet
2 | 
3 | 
4 | def example_data_sets() -> [DataSet]:
5 |     from .data_sets.customers import customers_data_set
6 |     from .data_sets.order_items import order_items_data_set
7 |     from .data_sets.products import products_data_set
8 |     return [order_items_data_set, customers_data_set, products_data_set]
9 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # Distribution / packaging
 7 | build/
 8 | dist/
 9 | *.egg-info/
10 | .eggs/
11 | 
12 | # Unit test / coverage reports
13 | .pytest_cache/
14 | 
15 | # Sphinx documentation
16 | docs/_build/
17 | 
18 | # Dev tools
19 | .idea
20 | 
21 | # Environments
22 | /.venv
23 | 


--------------------------------------------------------------------------------
/docs/installation.rst:
--------------------------------------------------------------------------------
 1 | Installation
 2 | ============
 3 | 
 4 | To use the library directly, use pip:
 5 | 
 6 | ``$ pip install mara-schema``
 7 | 
 8 | or
 9 | 
10 | ``$ pip install git+https://github.com/mara/mara-schema.git``
11 | 
12 | For an example of an integration into a flask application, have a look at the `Mara Example Project 1 <https://github.com/mara/mara-example-project-1>`_.
13 | 


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
 1 | # See https://pre-commit.com for more information
 2 | # See https://pre-commit.com/hooks.html for more hooks
 3 | repos:
 4 | -   repo: https://github.com/pre-commit/pre-commit-hooks
 5 |     rev: v3.2.0
 6 |     hooks:
 7 |     -   id: trailing-whitespace
 8 |     -   id: end-of-file-fixer
 9 |     -   id: check-toml
10 |     -   id: check-yaml
11 |     -   id: check-added-large-files
12 | 


--------------------------------------------------------------------------------
/.readthedocs.yaml:
--------------------------------------------------------------------------------
 1 | # Read the Docs configuration file
 2 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
 3 | 
 4 | version: 2
 5 | 
 6 | build:
 7 |   os: ubuntu-20.04
 8 |   tools:
 9 |     python: "3.10"
10 | 
11 | sphinx:
12 |    configuration: docs/conf.py
13 | 
14 | python:
15 |    install:
16 |      - requirements: docs/requirements.txt
17 |      - method: pip
18 |        path: .
19 | 


--------------------------------------------------------------------------------
/mara_schema/example/entities/product.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.entity import Entity, Type
 2 | 
 3 | product_entity = Entity(
 4 |     name='Product',
 5 |     description='Products that were at least once sold or once on stock',
 6 |     schema_name='dim')
 7 | 
 8 | product_entity.add_attribute(
 9 |     name='SKU',
10 |     description='The ID of a product as defined in the PIM system',
11 |     high_cardinality=True,
12 |     column_name='sku',
13 |     type=Type.ID)
14 | 
15 | from .product_category import product_category_entity
16 | 
17 | product_entity.link_entity(target_entity=product_category_entity)
18 | 


--------------------------------------------------------------------------------
/mara_schema/example/entities/product_category.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.entity import Entity
 2 | 
 3 | product_category_entity = Entity(
 4 |     name='Product category',
 5 |     description='A broad categorization of products as defined by the purchasing team',
 6 |     schema_name='dim')
 7 | 
 8 | product_category_entity.add_attribute(
 9 |     name='Level 1',
10 |     description='One of the 6 main product categories',
11 |     column_name='main_category')
12 | 
13 | product_category_entity.add_attribute(
14 |     name='Level 2',
15 |     description='The second level category of a product',
16 |     column_name='sub_category_1')
17 | 


--------------------------------------------------------------------------------
/mara_schema/ui/static/mara-schema.css:
--------------------------------------------------------------------------------
 1 | /*
 2 | As of 2021-01-25, supported in Chrome, Firefoy, Edge, but not Safari and doesn't need any extra css
 3 | classes on the anchor
 4 | https://css-tricks.com/fixed-headers-on-page-links-and-overlapping-content-oh-my/
 5 | */
 6 | html {
 7 |   scroll-padding-top: 90px; /* height of sticky header + head of table */
 8 | }
 9 | 
10 | /* For styling a trailing <a hred="#id" id="#id">¶</a> so it's light grey but gets the blue + underline on hover */
11 | html a.anchor-link-sign {
12 |     color: #bfbfbf;
13 | }
14 | 
15 | html a.anchor-link-sign:hover {
16 |     cursor: pointer;
17 |     color: #0275d8;
18 | }
19 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
 1 | [metadata]
 2 | name = mara-schema
 3 | version = attr: mara_schema.__version__
 4 | url = https://github.com/mara/mara-schema
 5 | description = Mapping of DWH database tables to business entities, attributes & metrics in Python, with automatic creation of flattened tables
 6 | long_description = file: README.md
 7 | long_description_content_type = text/markdown
 8 | author = Mara contributors
 9 | license = MIT
10 | 
11 | [options]
12 | packages = mara_schema
13 | python_requires = >= 3.6
14 | install_requires =
15 |     flask
16 |     graphviz
17 |     mara-page
18 |     sqlalchemy
19 | 
20 | [options.package_data]
21 | mara_schema = **/*.py, ui/static/**/*, example/**/*.sql
22 | 


--------------------------------------------------------------------------------
/mara_schema/__init__.py:
--------------------------------------------------------------------------------
 1 | __version__ = '1.2.1'
 2 | 
 3 | def MARA_CONFIG_MODULES():
 4 |     from . import config
 5 |     return [config]
 6 | 
 7 | 
 8 | def MARA_CLICK_COMMANDS():
 9 |     return []
10 | 
11 | 
12 | def MARA_FLASK_BLUEPRINTS():
13 |     from .ui import views
14 |     return [views.blueprint]
15 | 
16 | 
17 | def MARA_AUTOMIGRATE_SQLALCHEMY_MODELS():
18 |     return []
19 | 
20 | 
21 | def MARA_ACL_RESOURCES():
22 |     from .ui import views
23 |     return {
24 |         'Schema': views.acl_resource_schema
25 |     }
26 | 
27 | 
28 | def MARA_NAVIGATION_ENTRIES():
29 |     from .ui import views
30 |     return {
31 |         'Schema': views.schema_navigation_entry()
32 |     }
33 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line, and also
 5 | # from the environment for the first two.
 6 | SPHINXOPTS    ?=
 7 | SPHINXBUILD   ?= sphinx-build
 8 | SOURCEDIR     = .
 9 | BUILDDIR      = _build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 | 


--------------------------------------------------------------------------------
/mara_schema/example/entities/order_item.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.attribute import Type
 2 | from mara_schema.entity import Entity
 3 | 
 4 | order_item_entity = Entity(
 5 |     name='Order item',
 6 |     description='Individual products sold as part of an order',
 7 |     schema_name='dim')
 8 | 
 9 | order_item_entity.add_attribute(
10 |     name='Order item ID',
11 |     description='The ID of the order item in the backend',
12 |     column_name='order_item_id',
13 |     type=Type.ID,
14 |     high_cardinality=True)
15 | 
16 | from .order import order_entity
17 | from .product import product_entity
18 | 
19 | order_item_entity.link_entity(target_entity=order_entity, prefix='')
20 | order_item_entity.link_entity(target_entity=product_entity)
21 | 


--------------------------------------------------------------------------------
/docs/config.rst:
--------------------------------------------------------------------------------
 1 | Configuration
 2 | =============
 3 | 
 4 | 
 5 | Extension Configuration Values
 6 | ------------------------------
 7 | 
 8 | The following configuration values are used by this extension. They are defined as python functions in ``mara_schema.config``
 9 | and can be changed with the `monkey patch`_ from `Mara App`_. An example can be found `here <https://github.com/mara/mara-example-project-1/blob/master/app/local_setup.py.example>`_.
10 | 
11 | .. _monkey patch: https://github.com/mara/mara-app/blob/master/mara_app/monkey_patch.py
12 | .. _Mara App: https://github.com/mara/mara-app
13 | 
14 | 
15 | .. py:data:: data_sets
16 | 
17 |     Returns all available data_sets.
18 | 
19 |     Default: ``mara_schema.example.example_data_sets()``
20 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | # Changelog
 2 | 
 3 | ## 1.2.1 (2022-06-25)
 4 | 
 5 | - makes the foreign key column naming patchable (#23)
 6 | 
 7 | ## 1.2.0 (2022-06-21)
 8 | 
 9 | - add data set sql generation support for other SQL engines (#22)
10 | - using the default db alias engine in view if possible (#21)
11 | - fix missing files in PyPI package
12 | 
13 | ## 1.1.1 (2022-06-20)
14 | 
15 | - use client-side rendering for graphviz fallback (#20)
16 | 
17 | ## 1.1.0 (2021-05-12)
18 | 
19 | - Add 'more...' links after the description and add an anchor to metrics and attributes (#9)
20 | - Cast 0.0 to double precision (#13)
21 | - Add parameter star_schema_transitive_fks (#14)
22 | 
23 | ## 1.0.1 (2020-07-10)
24 | 
25 | - Update documentation and example
26 | 
27 | 
28 | ## 1.0.0 (2020-06-29)
29 | 
30 | - Initial release
31 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | MODULE_NAME=mara_schema
 2 | 
 3 | 
 4 | all:
 5 | 	# builds virtual env. and starts install in it
 6 | 	make .venv/bin/python
 7 | 	make install
 8 | 
 9 | 
10 | install:
11 | 	# install of module
12 | 	.venv/bin/pip install .
13 | 
14 | 
15 | publish:
16 | 	# manually publishing the package
17 | 	.venv/bin/pip install build twine
18 | 	.venv/bin/python -m build
19 | 	.venv/bin/twine upload dist/*
20 | 
21 | 
22 | clean:
23 | 	# clean up
24 | 	rm -rf .venv/ build/ dist/ ${MODULE_NAME}.egg-info/ .pytest_cache/ .eggs/
25 | 
26 | 
27 | .PYTHON3:=$(shell PATH='$(subst $(CURDIR)/.venv/bin:,,$(PATH))' which python3)
28 | 
29 | .venv/bin/python:
30 | 	mkdir -p .venv
31 | 	cd .venv && $(.PYTHON3) -m venv --copies --prompt='[$(shell basename `pwd`)/.venv]' .
32 | 
33 | 	.venv/bin/python -m pip install --upgrade pip
34 | 


--------------------------------------------------------------------------------
/mara_schema/example/data_sets/products.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.data_set import DataSet, Aggregation
 2 | 
 3 | from ..entities.product import product_entity
 4 | 
 5 | products_data_set = DataSet(entity=product_entity, name='Products')
 6 | 
 7 | products_data_set.add_simple_metric(
 8 |     name='Revenue last 30 days',
 9 |     description='The revenue generated from the product in the last 30 days',
10 |     aggregation=Aggregation.SUM,
11 |     column_name='revenue_last_30_days',
12 |     important_field=True)
13 | 
14 | products_data_set.add_simple_metric(
15 |     name='# Items on stock',
16 |     description='How many items of the products are in stock according to the ERP (at the time of the last DWH import)',
17 |     column_name='number_of_items_on_stock',
18 |     aggregation=Aggregation.SUM,
19 |     important_field=True)
20 | 


--------------------------------------------------------------------------------
/docs/design.rst:
--------------------------------------------------------------------------------
 1 | Design Decisions
 2 | ================
 3 | 
 4 | Schema sync to front-ends
 5 | -------------------------
 6 | When reporting tools have a Metadata API (e.g. Metabase, Tableau) or can read schema definitions from text files (e.g. Looker, Mondrian), then it's easy to sync definitions with them. The `Mara Metabase <https://github.com/mara/mara-metabase>`_ package contains a function for syncing Mara Schema definitions with Metabase and the `Mara Mondrian <https://github.com/mara/mara-mondrian>`_ package contains a generator for a Mondrian schema.
 7 | 
 8 | We welcome contributions for creating Looker `LookML files <https://docs.looker.com/data-modeling/getting-started/file-types-in-project>`_, for syncing definitions with Tableau, and for syncing with any other BI front-end.
 9 | 
10 | Also, we see a potential for automatically creating data guides in other Wikis or documentation tools.
11 | 


--------------------------------------------------------------------------------
/.github/workflows/build.yaml:
--------------------------------------------------------------------------------
 1 | name: mara-schema
 2 | 
 3 | on:
 4 |   push:
 5 |     branches:
 6 |       - main
 7 |   pull_request:
 8 |     branches:
 9 |       - main
10 | 
11 | jobs:
12 |   build:
13 |     runs-on: ubuntu-latest
14 |     strategy:
15 |       matrix:
16 |         python-version: ['3.7', '3.8', '3.9', '3.10', '3.11']
17 |     steps:
18 |       - name: Chechout code
19 |         uses: actions/checkout@v2
20 |       - name: Setup python
21 |         uses: actions/setup-python@v2
22 |         with:
23 |           python-version: ${{ matrix.python-version }}
24 |       - name: Install package
25 |         env:
26 |           pythonversion: ${{ matrix.python-version }}
27 |         run: |
28 |           python -c "import sys; print(sys.version)"
29 |           pip install .
30 |           echo Finished successful build with Python $pythonversion
31 | #      - name: Test with pytest
32 | #        run: |
33 | #          pytest -v tests
34 | 


--------------------------------------------------------------------------------
/mara_schema/example/data_sets/customers.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.data_set import DataSet, Aggregation
 2 | 
 3 | from ..entities.customer import customer_entity
 4 | 
 5 | customers_data_set = DataSet(entity=customer_entity, name='Customers')
 6 | 
 7 | customers_data_set.exclude_path(['Order', 'Customer'])
 8 | 
 9 | customers_data_set.add_simple_metric(
10 |     name='# Orders',
11 |     description='Number of orders placed by the customer',
12 |     aggregation=Aggregation.SUM,
13 |     column_name='number_of_orders',
14 |     important_field=True)
15 | 
16 | customers_data_set.add_simple_metric(
17 |     name='CLV',
18 |     description='The lifetime revenue generated from items purchased by this customer',
19 |     aggregation=Aggregation.SUM,
20 |     column_name='revenue_lifetime',
21 |     important_field=True)
22 | 
23 | customers_data_set.add_composed_metric(
24 |     name='AOV',
25 |     description='The average revenue per order of the customer',
26 |     formula='[CLV] / [# Orders]')
27 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2020-2022 Mara contributors
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/mara_schema/example/entities/order.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.attribute import Type
 2 | from mara_schema.entity import Entity
 3 | 
 4 | order_entity = Entity(
 5 |     name='Order',
 6 |     description='Valid orders for which an invoice was created',
 7 |     schema_name='dim')
 8 | 
 9 | order_entity.add_attribute(
10 |     name='Order ID',
11 |     description='The invoice number of the order as stored in the backend',
12 |     column_name='order_id',
13 |     type=Type.ID,
14 |     important_field=True,
15 |     high_cardinality=True)
16 | 
17 | order_entity.add_attribute(
18 |     name='Order date',
19 |     description='The date when the order was placed (stored in the backend)',
20 |     column_name='order_date',
21 |     type=Type.DATE,
22 |     important_field=True)
23 | 
24 | order_entity.add_attribute(
25 |     name='Status',
26 |     description='The current status of the order (new, paid, shipped, etc.)',
27 |     column_name='status',
28 |     accessible_via_entity_link=False,
29 |     type=Type.ENUM)
30 | 
31 | from .customer import customer_entity
32 | 
33 | order_entity.link_entity(
34 |     target_entity=customer_entity,
35 |     description='The customer who placed the order')
36 | 


--------------------------------------------------------------------------------
/mara_schema/ui/static/data-set-sql-query.js:
--------------------------------------------------------------------------------
 1 | var DataSetSqlQuery = function (baseUrl) {
 2 | 
 3 |     function localStorageKey(param) {
 4 |         return 'mara-schem-sql-query-param-' + param;
 5 |     }
 6 | 
 7 |     $('.param-checkbox').each(function (n, checkbox) {
 8 |         var storedValue = localStorage.getItem(localStorageKey(checkbox.value));
 9 |         if (storedValue == 'false') {
10 |             checkbox.checked = false;
11 |         } else if (storedValue == 'true') {
12 |             checkbox.checked = true;
13 |         } else {
14 |             checkbox.checked = checkbox.value != 'star schema';
15 |         }
16 |     });
17 | 
18 |     function updateUI() {
19 |         var selectedParams = [];
20 |         $('.param-checkbox').each(function (n, checkbox) {
21 |             if (checkbox.checked) {
22 |                 selectedParams.push(checkbox.value);
23 |             }
24 |             localStorage.setItem(localStorageKey(checkbox.value), checkbox.checked);
25 |         });
26 | 
27 |         var url = baseUrl;
28 |         if (selectedParams.length > 0) {
29 |             url += '/' + selectedParams.join('/')
30 |         }
31 |         loadContentAsynchronously('sql-container', url);
32 |     }
33 | 
34 |     $('.param-checkbox').change(updateUI);
35 | 
36 |     updateUI();
37 | };
38 | 


--------------------------------------------------------------------------------
/docs/api.rst:
--------------------------------------------------------------------------------
 1 | API
 2 | ===
 3 | 
 4 | .. module:: mara_schema
 5 | 
 6 | This part of the documentation covers all the interfaces of Mara Schema. For
 7 | parts where the package depends on external libraries, we document the most
 8 | important right here and provide links to the canonical documentation.
 9 | 
10 | 
11 | Entities
12 | --------
13 | 
14 | .. module:: mara_schema.entity
15 | 
16 | .. autoclass:: Entity
17 |     :members:
18 |     :special-members: __init__
19 | 
20 | .. autoclass:: EntityLink
21 |     :special-members: __init__
22 | 
23 | 
24 | Attributes
25 | ----------
26 | 
27 | .. module:: mara_schema.attribute
28 | 
29 | .. autoclass:: Attribute
30 |     :members:
31 |     :special-members: __init__
32 | 
33 | .. autoclass:: Type
34 | 
35 | .. autofunction:: normalize_name
36 | 
37 | 
38 | Data sets
39 | ---------
40 | 
41 | .. module:: mara_schema.data_set
42 | 
43 | .. autoclass:: DataSet
44 |     :members:
45 |     :special-members: __init__
46 | 
47 | 
48 | Metrics
49 | -------
50 | 
51 | .. module:: mara_schema.metric
52 | 
53 | .. autoclass:: Aggregation
54 | 
55 | .. autoclass:: NumberFormat
56 | 
57 | .. autoclass:: SimpleMetric
58 |     :members:
59 |     :special-members: __init__
60 | 
61 | .. autoclass:: ComposedMetric
62 |     :members:
63 |     :special-members: __init__
64 | 
65 | 
66 | SQL Generation
67 | --------------
68 | 
69 | .. module:: mara_schema.sql_generation
70 | 
71 | .. autofunction:: data_set_sql_query
72 | 


--------------------------------------------------------------------------------
/mara_schema/example/entities/customer.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.attribute import Type
 2 | from mara_schema.entity import Entity
 3 | 
 4 | customer_entity = Entity(
 5 |     name='Customer',
 6 |     description='People that made at least one order or that subscribed to the newsletter',
 7 |     schema_name='dim')
 8 | 
 9 | customer_entity.add_attribute(
10 |     name='Customer ID',
11 |     description='The ID of the customer as defined in the backend',
12 |     column_name='customer_id',
13 |     type=Type.ID,
14 |     high_cardinality=True,
15 |     important_field=True)
16 | 
17 | customer_entity.add_attribute(
18 |     name='Email',
19 |     description='The email of the customer',
20 |     column_name='email',
21 |     personal_data=True,
22 |     high_cardinality=True,
23 |     accessible_via_entity_link=False)
24 | 
25 | customer_entity.add_attribute(
26 |     name='Duration since first order',
27 |     description='The number of days since the first order was placed',
28 |     type=Type.DURATION,
29 |     column_name='duration_since_first_order',
30 |     accessible_via_entity_link=False)
31 | 
32 | from .order import order_entity
33 | from .product_category import product_category_entity
34 | 
35 | customer_entity.link_entity(
36 |     target_entity=product_category_entity,
37 |     fk_column='favourite_product_category_fk',
38 |     prefix='Favourite product category',
39 |     description='The category of the most purchased product (by revenue) of the customer')
40 | 
41 | customer_entity.link_entity(
42 |     target_entity=order_entity,
43 |     fk_column='first_order_fk',
44 |     prefix='First order')
45 | 


--------------------------------------------------------------------------------
/mara_schema/example/data_sets/order_items.py:
--------------------------------------------------------------------------------
 1 | from mara_schema.data_set import DataSet, Aggregation
 2 | 
 3 | from ..entities.order_item import order_item_entity
 4 | 
 5 | order_items_data_set = DataSet(entity=order_item_entity, name='Order items')
 6 | 
 7 | order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date'])
 8 | 
 9 | order_items_data_set.add_simple_metric(
10 |     name='# Order items',
11 |     description='The number of ordered products',
12 |     column_name='order_item_id',
13 |     aggregation=Aggregation.COUNT)
14 | 
15 | order_items_data_set.add_simple_metric(
16 |     name='# Orders',
17 |     description='The number of valid orders (orders with an invoice)',
18 |     column_name='order_fk',
19 |     aggregation=Aggregation.DISTINCT_COUNT,
20 |     important_field=True)
21 | 
22 | order_items_data_set.add_simple_metric(
23 |     name='Product revenue',
24 |     description='The price of the ordered products as shown in the cart',
25 |     aggregation=Aggregation.SUM,
26 |     column_name='product_revenue',
27 |     important_field=True)
28 | 
29 | order_items_data_set.add_simple_metric(
30 |     name='Shipping revenue',
31 |     description='Revenue generated based on the price of the items and delivery fee',
32 |     aggregation=Aggregation.SUM,
33 |     column_name='shipping_revenue')
34 | 
35 | order_items_data_set.add_composed_metric(
36 |     name='Revenue',
37 |     description='The total cart value of the order',
38 |     formula='[Product revenue] + [Shipping revenue]',
39 |     important_field=True)
40 | 
41 | order_items_data_set.add_composed_metric(
42 |     name='AOV',
43 |     description='The average revenue per order. Attention: not meaningful when split by product',
44 |     formula='[Revenue] / [# Orders]',
45 |     important_field=True)
46 | 


--------------------------------------------------------------------------------
/mara_schema/example/dimensional-schema.sql:
--------------------------------------------------------------------------------
 1 | DROP SCHEMA IF EXISTS dim CASCADE;
 2 | CREATE SCHEMA dim;
 3 | 
 4 | CREATE TYPE dim.STATUS AS ENUM ('New', 'Paid', 'Shipped', 'Returned', 'Refunded');
 5 | 
 6 | CREATE TABLE dim.order
 7 | (
 8 |     order_id    INTEGER PRIMARY KEY,
 9 |     customer_fk INTEGER    NOT NULL,
10 |     order_date  TIMESTAMP WITH TIME ZONE,
11 |     status      dim.STATUS NOT NULL
12 | );
13 | 
14 | CREATE TABLE dim.order_item
15 | (
16 |     order_item_id    INTEGER PRIMARY KEY,
17 |     order_fk         INTEGER          NOT NULL,
18 |     product_fk       INTEGER          NOT NULL,
19 | 
20 |     product_revenue  DOUBLE PRECISION NOT NULL,
21 |     shipping_revenue DOUBLE PRECISION NOT NULL
22 | );
23 | 
24 | CREATE TABLE dim.customer
25 | (
26 |     customer_id                   INTEGER PRIMARY KEY,
27 |     email                         TEXT NOT NULL,
28 |     duration_since_first_order    INTEGER,
29 |     first_order_fk                INTEGER,
30 |     favourite_product_category_fk INTEGER,
31 | 
32 |     number_of_orders              INTEGER,
33 |     revenue_lifetime              DOUBLE PRECISION
34 | );
35 | 
36 | CREATE TABLE dim.product
37 | (
38 |     product_id          INTEGER PRIMARY KEY,
39 |     sku                 TEXT    NOT NULL,
40 |     product_category_fk INTEGER NOT NULL,
41 |     revenue_all_time    DOUBLE PRECISION
42 | );
43 | 
44 | CREATE TABLE dim.product_category
45 | (
46 |     product_category_id INTEGER PRIMARY KEY,
47 |     level_1             TEXT NOT NULL,
48 |     level_2             TEXT NOT NULL
49 | );
50 | 
51 | 
52 | ALTER TABLE dim.order
53 |     ADD FOREIGN KEY (customer_fk) REFERENCES dim.customer (customer_id);
54 | ALTER TABLE dim.order_item
55 |     ADD FOREIGN KEY (order_fk) REFERENCES dim.order (order_id);
56 | ALTER TABLE dim.order_item
57 |     ADD FOREIGN KEY (product_fk) REFERENCES dim.product (product_id);
58 | ALTER TABLE dim.customer
59 |     ADD FOREIGN KEY (first_order_fk) REFERENCES dim.order (order_id);
60 | ALTER TABLE dim.customer
61 |     ADD FOREIGN KEY (favourite_product_category_fk)
62 |         REFERENCES dim.product_category (product_category_id);
63 | ALTER TABLE dim.product
64 |     ADD FOREIGN KEY (product_category_fk) REFERENCES dim.product_category (product_category_id);
65 | 


--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
 1 | # This file only contains a selection of the most common options. For a full
 2 | # list see the documentation:
 3 | # https://www.sphinx-doc.org/en/master/usage/configuration.html
 4 | 
 5 | # -- Path setup --------------------------------------------------------------
 6 | 
 7 | # If extensions (or modules to document with autodoc) are in another directory,
 8 | # add these directories to sys.path here. If the directory is relative to the
 9 | # documentation root, use os.path.abspath to make it absolute, like shown here.
10 | #
11 | # import os
12 | # import sys
13 | # sys.path.insert(0, os.path.abspath('.'))
14 | 
15 | 
16 | # -- Project information -----------------------------------------------------
17 | 
18 | project = 'Mara Schema'
19 | copyright = '2020-2022, Mara contributors'
20 | author = 'Mara contributors'
21 | 
22 | # The short X.Y version.
23 | from mara_schema import __version__
24 | version = __version__
25 | # The full version, including alpha/beta/rc tags
26 | release = version
27 | 
28 | 
29 | # -- General configuration ---------------------------------------------------
30 | 
31 | # Add any Sphinx extension module names here, as strings. They can be
32 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
33 | # ones.
34 | extensions = [
35 |     'sphinx.ext.autodoc',
36 |     'myst_parser',
37 | ]
38 | 
39 | # Add any paths that contain templates here, relative to this directory.
40 | templates_path = ['_templates']
41 | 
42 | # List of patterns, relative to source directory, that match files and
43 | # directories to ignore when looking for source files.
44 | # This pattern also affects html_static_path and html_extra_path.
45 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
46 | 
47 | 
48 | # -- Options for HTML output -------------------------------------------------
49 | 
50 | # The theme to use for HTML and HTML Help pages.  See the documentation for
51 | # a list of builtin themes.
52 | #
53 | html_theme = 'alabaster'
54 | 
55 | # Add any paths that contain custom static files (such as style sheets) here,
56 | # relative to this directory. They are copied after the builtin static files,
57 | # so a file named "default.css" will overwrite the builtin "default.css".
58 | html_static_path = ['_static']
59 | html_favicon = "_static/favicon.ico"
60 | html_logo = "_static/mara-animal.jpg"
61 | html_title = f"Mara Schema Documentation ({version})"
62 | 


--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
 1 | .. rst-class:: hide-header
 2 | 
 3 | Mara Schema documentation
 4 | =========================
 5 | 
 6 | Welcome to Mara Schema's documentation. The Mara Schema package is a Python based mapping of physical data warehouse
 7 | tables to logical business entities (a.k.a. "cubes", "models", "data sets", etc.). It comes with
 8 | - sql query generation for flattening normalized database tables into wide tables for various analytics front-ends
 9 | - a flask based visualization of the schema that can serve as a documentation of the business definitions of a data warehouse (a.k.a "data dictionary" or "data guide")
10 | - the possibility to sync schemas to reporting front-ends that have meta-data APIs (e.g. Metabase, Looker, Tableau)
11 | 
12 | .. image:: _static/mara-schema.png
13 | 
14 | Have a look at a real-world application of Mara Schema in the `Mara Example Project 1 <https://github.com/mara/mara-example-project-1>`_.
15 | 
16 | 
17 | **Why** should I use Mara Schema?
18 | 
19 | 1. **Definition of analytical business entities as code**: There are many solutions for documenting the company-wide definitions of attributes & metrics for the users of a data warehouse. These can range from simple spreadsheets or wikis to metadata management tools inside reporting front-ends. However, these definitions can quickly get out of sync when new columns are added or changed in the underlying data warehouse. Mara Schema allows to deploy definition changes together with changes in the underlying ETL processes so that all definitions will always be in sync with the underlying data warehouse schema.
20 | 
21 | 2. **Automatic generation of aggregates / artifacts**: When a company wants to enforce a *single source of truth* in their data warehouse, then a heavily normalized Kimball-style `snowflake schema <https://en.wikipedia.org/wiki/Snowflake_schema>`_ is still the weapon of choice. It enforces an agreed-upon unified modelling of business entities across domains and ensures referential consistency. However, snowflake schemas are not ideal for analytics or data science because they require a lot of joins. Most analytical databases and reporting tools nowadays work better with pre-flattened wide tables. Creating such flattened tables is an error-prone and dull activity, but with Mara Schema one can automate most of the work in creating flattened data set tables in the ETL.
22 | 
23 | 
24 | User's Guide
25 | ------------
26 | 
27 | This part of the documentation focuses on step-by-step instructions how to use this extension.
28 | 
29 | .. toctree::
30 |    :maxdepth: 2
31 | 
32 |    installation
33 |    example
34 |    artifact-generation
35 |    config
36 | 
37 | 
38 | API Reference
39 | -------------
40 | 
41 | If you are looking for information on a specific function, class or
42 | method, this part of the documentation is for you.
43 | 
44 | .. toctree::
45 |    :maxdepth: 2
46 | 
47 |    api
48 | 
49 | Additional Notes
50 | ----------------
51 | 
52 | Legal information and changelog are here for the interested.
53 | 
54 | .. toctree::
55 |    :maxdepth: 2
56 | 
57 |    design
58 |    license
59 |    changes
60 | 


--------------------------------------------------------------------------------
/docs/artifact-generation.rst:
--------------------------------------------------------------------------------
 1 | Artifact generation
 2 | ===================
 3 | 
 4 | The function ``data_set_sql_query`` in `mara_schema/sql_generation.py <mara_schema/sql_generation.py>`_ can be used to flatten the entities of a data set into a wide data set table:
 5 | .. code-block:: python
 6 | 
 7 |     data_set_sql_query(data_set=order_items_data_set, human_readable_columns=True, pre_computed_metrics=False,
 8 |                     star_schema=False, personal_data=False, high_cardinality_attributes=True)
 9 | 
10 | The resulting SELECT statement can be used for creating a data set table that is specifically tailored for the use in Metabase:
11 | 
12 | .. code-block:: sql
13 | 
14 |     SELECT
15 |         order_item.order_item_id AS "Order item ID",
16 | 
17 |         "order".order_id AS "Order ID",
18 |         "order".order_date AS "Order date",
19 | 
20 |         order_customer.customer_id AS "Customer ID",
21 | 
22 |         order_customer_favourite_product_category.main_category AS "Customer favourite product category level 1",
23 |         order_customer_favourite_product_category.sub_category_1 AS "Customer favourite product category level 2",
24 | 
25 |         order_customer_first_order.order_date AS "Customer first order date",
26 | 
27 |         product.sku AS "Product SKU",
28 | 
29 |         product_product_category.main_category AS "Product category level 1",
30 |         product_product_category.sub_category_1 AS "Product category level 2",
31 | 
32 |         order_item.order_item_id AS "# Order items",
33 |         order_item.order_fk AS "# Orders",
34 |         order_item.product_revenue AS "Product revenue",
35 |         order_item.revenue AS "Shipping revenue"
36 | 
37 |     FROM dim.order_item order_item
38 |     LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order_id
39 |     LEFT JOIN dim.customer order_customer ON "order".customer_fk = order_customer.customer_id
40 |     LEFT JOIN dim.product_category order_customer_favourite_product_category ON order_customer.favourite_product_category_fk = order_customer_favourite_product_category.product_category_id
41 |     LEFT JOIN dim."order" order_customer_first_order ON order_customer.first_order_fk = order_customer_first_order.order_id
42 |     LEFT JOIN dim.product product ON order_item.product_fk = product.product_id
43 |     LEFT JOIN dim.product_category product_product_category ON product.product_category_fk = product_product_category.product_category_id
44 | 
45 | Please note that the ``data_set_sql_query`` only returns SQL select statements, it's a matter of executing these statements somewhere in the ETL of the Data Warehouse. `Here <https://github.com/mara/mara-example-project-1/tree/main/app/pipelines/generate_artifacts/metabase.py>`_ is an example for creating data set tables for Metabase using `Mara Pipelines <https://github.com/mara/mara-pipelines>`_.
46 | 
47 | 
48 | 
49 | There are several parameters for controlling the output of the `data_set_sql_query` function:
50 | 
51 |  - ``human_readable_columns``: Whether to use "Customer name" rather than "customer_name" as column name
52 |  - ``pre_computed_metrics``: Whether to pre-compute composed metrics, counts and distinct counts on row level
53 |  - ``star_schema``: Whether to add foreign keys to the tables of linked entities rather than including their attributes
54 |  - ``personal_data``: Whether to include attributes that are marked as personal data
55 |  - ``high_cardinality_attributes``: Whether to include attributes that are marked to have a high cardinality
56 | 
57 | .. image:: _static/mara-schema-sql-generation.gif
58 |     :alt: Mara schema SQL generation
59 | 


--------------------------------------------------------------------------------
/mara_schema/attribute.py:
--------------------------------------------------------------------------------
 1 | import enum
 2 | import re
 3 | 
 4 | import typing
 5 | 
 6 | 
 7 | class Type(enum.EnumMeta):
 8 |     """
 9 |     Attribute types that need special treatment in artifact creation
10 |         Type.ID: A numeric ID that is converted to text in a flattened table so that it can be filtered
11 |         Type.DATE: Date attribute as a foreign_key to a date dimension
12 |         Type.DURATION: Duration attribute as a foreign_key to a duration dimension
13 |         Type.ENUM: Attributes that is converted to text in a flattened table.
14 |         Type.ARRAY: Attribute of type array
15 |     """
16 |     DATE = 'date'
17 |     DURATION = 'duration'
18 |     ID = 'id'
19 |     ENUM = 'enum'
20 |     ARRAY = 'array'
21 | 
22 | 
23 | class Attribute():
24 |     """A property of an entity, corresponds to a column in the underlying dimensional table"""
25 | 
26 |     def __init__(self, name: str, description: str, column_name: str,
27 |                  accessible_via_entity_link: bool, type: 'Type' = None, high_cardinality: bool = False,
28 |                  personal_data: bool = False,
29 |                  important_field: bool = False,
30 |                  more_url: typing.Optional[str] = None,
31 |                  ) -> None:
32 |         """See documentation of function Entity.add_attribute"""
33 |         self.name = name
34 |         self.description = description
35 |         self.column_name = column_name
36 |         self.type = type
37 |         self.high_cardinality = high_cardinality
38 |         self.personal_data = personal_data
39 |         self.important_field = important_field
40 |         self.accessible_via_entity_link = accessible_via_entity_link
41 |         self.more_url = more_url
42 | 
43 |     def __repr__(self) -> str:
44 |         return f'<Attribute "{self.name}">'
45 | 
46 |     def prefixed_name(self, path: typing.Tuple['EntityLink'] = None) -> str:
47 |         """Generate a meaningful business name by concatenating the prefix of entity link instances and original
48 |         name of attribute. """
49 | 
50 |         def first_lower(string: str = ''):
51 |             """Lowercase first letter if the first two letter are not capitalized """
52 |             if not re.match(r'([A-Z]){2}', string):
53 |                 return string[:1].lower() + string[1:]
54 |             else:
55 |                 return string
56 | 
57 |         if path:
58 |             prefix = ' '.join([entity_link.prefix.lower() for entity_link in path])
59 |             return normalize_name(prefix + ' ' + first_lower(self.name))
60 |         else:
61 |             return normalize_name(self.name)
62 | 
63 | 
64 | def normalize_name(name: str, max_length: int = 63) -> str:
65 |     """
66 |     Makes "Foo bar baz" out of "foo bar bar baz"
67 |     Args:
68 |         name: the name to normalize
69 |         max_length: optionally limit length by replacing too long part with a hash of the name
70 |     """
71 | 
72 |     def first_letter_capitalize(string: str) -> str:
73 |         return string[0].upper() + string[1::]
74 | 
75 |     # Remove repeating words from the generated name, e.g. "First booking booking ID" -> "First booking ID"
76 |     name = re.sub(r'\b(\w+)( \1\b)+', r'\1', name).strip()
77 | 
78 |     # Remove duplicate whitespace
79 |     name = re.sub('\s\s+', ' ', name)
80 | 
81 |     name = first_letter_capitalize(name)
82 | 
83 |     # Limit length
84 | 
85 |     if max_length and len(name) > max_length:
86 |         import hashlib
87 |         m = hashlib.md5()
88 |         m.update(name.encode('utf-8'))
89 |         return name[:(max_length - 8)] + m.hexdigest()[:8]
90 | 
91 |     return name
92 | 


--------------------------------------------------------------------------------
/mara_schema/metric.py:
--------------------------------------------------------------------------------
  1 | import abc
  2 | import enum
  3 | 
  4 | import typing
  5 | 
  6 | 
  7 | class Aggregation(enum.EnumMeta):
  8 |     """Aggregation methods for metrics"""
  9 |     SUM = 'sum'
 10 |     AVERAGE = 'avg'
 11 |     COUNT = 'count'
 12 |     DISTINCT_COUNT = 'distinct-count'
 13 | 
 14 | 
 15 | class NumberFormat(enum.EnumMeta):
 16 |     """How to format values"""
 17 |     STANDARD = 'Standard'
 18 |     CURRENCY = 'Currency'
 19 |     PERCENT = 'Percent'
 20 | 
 21 | 
 22 | class Metric(abc.ABC):
 23 |     def __init__(self, name: str, description: str, data_set: 'DataSet',
 24 |                  important_field: bool = False) -> None:
 25 |         """
 26 |         A numeric aggregation on columns of an entity table.
 27 | 
 28 |         Args:
 29 |             name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations"
 30 |             description: A meaningful business definition of the metric
 31 |             important_field: It refers to key business metrics.
 32 |         """
 33 |         self.name = name
 34 |         self.description = description
 35 |         self.data_set = data_set
 36 |         self.important_field = important_field
 37 | 
 38 |     @abc.abstractmethod
 39 |     def display_formula(self):
 40 |         """Returns a documentation string for displaying the formula in the frontend"""
 41 |         pass
 42 | 
 43 | 
 44 | class SimpleMetric(Metric):
 45 |     def __init__(self, name: str, description: str, data_set: 'DataSet',
 46 |                  column_name: str, aggregation: Aggregation, important_field: bool = False,
 47 |                  number_format: NumberFormat = NumberFormat.STANDARD,
 48 |                  more_url: typing.Optional[str] = None):
 49 |         """
 50 |         A metric that is computed as a direct aggregation on a entity table column
 51 |         Args:
 52 |             name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations"
 53 |             description: A meaningful business definition of the metric
 54 |             data_set: The data set that contains the metric
 55 |             column_name: The column that the aggregation is based on
 56 |             aggregation: The aggregation method to use
 57 |             important_field: It refers to key business metrics.
 58 |             number_format: The way to format a string. Defaults to NumberFormat.STANDARD.
 59 |         """
 60 |         super().__init__(name, description, data_set)
 61 |         self.column_name = column_name
 62 |         self.aggregation = aggregation
 63 |         self.important_field = important_field
 64 |         self.number_format = number_format
 65 |         self.more_url = more_url
 66 | 
 67 |     def __repr__(self) -> str:
 68 |         return f'<Metric "{self.name}": {self.display_formula()})>'
 69 | 
 70 |     def display_formula(self) -> str:
 71 |         return f"{self.aggregation}({self.column_name})"
 72 | 
 73 | 
 74 | class ComposedMetric(Metric):
 75 |     def __init__(self, name: str, description: str, data_set: 'DataSet',
 76 |                  parent_metrics: [Metric], formula_template: str, important_field: bool = False,
 77 |                  number_format: NumberFormat = NumberFormat.STANDARD,
 78 |                  more_url: typing.Optional[str] = None) -> None:
 79 |         """
 80 |         A metric that is based on a list of simple metrics.
 81 |         Args:
 82 |             name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations"
 83 |             description: A meaningful business definition of the metric
 84 |             data_set: The data set that contains the metric
 85 |             parent_metrics: The parent metrics that this metric is composed of
 86 |             formula_template: How to compose the parent metrics, with '{}' as placeholders
 87 |                 Examples: '{} + {}', '{} / ({} + {})'
 88 |             important_field: It refers to key business metrics.
 89 |             number_format: The way to format a string. Defaults to NumberFormat.STANDARD.
 90 |         """
 91 |         super().__init__(name, description, data_set)
 92 |         self.parent_metrics = parent_metrics
 93 |         self.formula_template = formula_template
 94 |         self.important_field = important_field
 95 |         self.number_format = number_format
 96 |         self.more_url = more_url
 97 | 
 98 |     def __repr__(self) -> str:
 99 |         return f'<ComposedMetric "{self.name}": {self.display_formula()}>'
100 | 
101 |     def display_formula(self) -> str:
102 |         return self.formula_template.format(*[f'[{metric.name}]' for metric in self.parent_metrics])
103 | 


--------------------------------------------------------------------------------
/mara_schema/ui/graph.py:
--------------------------------------------------------------------------------
  1 | import graphviz
  2 | from mara_page.xml import _
  3 | 
  4 | from ..data_set import DataSet
  5 | from ..metric import ComposedMetric
  6 | 
  7 | link_color_ = '#0275d8'
  8 | font_size_ = '10.5px'
  9 | line_color_ = '#888888'
 10 | edge_arrow_size_ = '0.7'
 11 | 
 12 | 
 13 | def overview_graph():
 14 |     from .. import config
 15 |     from .views import data_set_url
 16 | 
 17 |     all_entities = set()
 18 | 
 19 |     for data_set in config.data_sets():
 20 |         all_entities.update(data_set.entity.connected_entities())
 21 | 
 22 |     graph = graphviz.Digraph(engine='neato', graph_attr={})
 23 | 
 24 |     for entity in all_entities:
 25 |         data_set = entity.data_set
 26 | 
 27 |         graph.node(name=entity.name,
 28 |                    label=entity.name.replace(' ', '\n'),
 29 |                    fontname=' ',
 30 |                    fontsize=font_size_,
 31 |                    fontcolor=link_color_ if data_set else '#222222',
 32 |                    href=data_set_url(data_set) if data_set else None,
 33 |                    color='transparent',
 34 |                    tooltip=entity.description)
 35 | 
 36 |         for entity_link in entity.entity_links:
 37 |             label = entity_link.prefix.lower().replace(entity_link.target_entity.name.lower(),'').title()
 38 |             graph.edge(entity.name,
 39 |                        entity_link.target_entity.name,
 40 |                        headlabel=label.replace(' ', '\n') if label else None,
 41 |                        labelfloat='true',
 42 |                        labeldistance='2.5',
 43 |                        labelfontsize='9.0',
 44 |                        fontcolor='#cccccc',
 45 |                        arrowsize=edge_arrow_size_,
 46 |                        color=line_color_)
 47 | 
 48 |     return _render_graph(graph)
 49 | 
 50 | 
 51 | def data_set_graph(data_set: DataSet) -> str:
 52 |     from .views import data_set_url
 53 | 
 54 |     paths = data_set.paths_to_connected_entities()
 55 |     if not paths:
 56 |         return ''
 57 | 
 58 |     graph = graphviz.Digraph(engine='neato', graph_attr={})
 59 | 
 60 |     graph.node(name='root',
 61 |                label=data_set.entity.name.replace(' ', '\n'),
 62 |                fontname=' ',
 63 |                fontsize=font_size_,
 64 |                color='#888888',
 65 |                height='0.1',
 66 |                fontcolor='#222222',
 67 |                style='dotted',
 68 |                shape='rectangle',
 69 |                tooltip=data_set.entity.description
 70 |                )
 71 | 
 72 |     for path in paths:
 73 |         entity_link = path[-1]
 74 | 
 75 |         data_set = entity_link.target_entity.data_set
 76 |         graph.node(name=str(path),
 77 |                    label=entity_link.target_entity.name.replace(' ', '\n'),
 78 |                    fontname=' ',
 79 |                    fontsize=font_size_,
 80 |                    color='transparent',
 81 |                    height='0.1',
 82 |                    href=data_set_url(data_set) if data_set else None,
 83 |                    fontcolor=link_color_ if data_set else None,
 84 |                    tooltip=entity_link.target_entity.description
 85 |                    )
 86 | 
 87 |         label = entity_link.prefix.lower().replace(entity_link.target_entity.name.lower(),'').title()
 88 |         graph.edge('root' if len(path) == 1 else str(path[:-1]), str(path),
 89 |                    color=line_color_,
 90 |                    headlabel=label.replace(' ', '\n') if label else None,
 91 |                    labelfloat='true',
 92 |                    labeldistance='2.5',
 93 |                    labelfontsize='9.0',
 94 |                    fontcolor='#cccccc',
 95 |                    arrowsize=edge_arrow_size_)
 96 | 
 97 |     return _render_graph(graph)
 98 | 
 99 | 
100 | def metrics_graph(data_set: DataSet) -> str:
101 |     graph = graphviz.Digraph(engine='dot', graph_attr={'rankdir': 'TD',
102 |                                                        'ranksep': '0.2',
103 |                                                        'nodesep': '0.15',
104 |                                                        'splines': 'true'
105 |                                                        })
106 | 
107 |     connected_metrics = set()
108 |     for metric in data_set.metrics.values():
109 |         if isinstance(metric, ComposedMetric):
110 |             connected_metrics.add(metric)
111 |             for parent_metric in metric.parent_metrics:
112 |                 connected_metrics.add(parent_metric)
113 |                 graph.edge(parent_metric.name,
114 |                            metric.name,
115 |                            color=line_color_,
116 |                            arrowsize=edge_arrow_size_)
117 | 
118 |     for metric in connected_metrics:
119 |         graph.node(name=metric.name,
120 |                    label=metric.name.replace(' ', '\n'),
121 |                    fontname=' ',
122 |                    fontcolor='#222222',
123 |                    fontsize=font_size_,
124 |                    color='transparent',
125 |                    height='0.1',
126 |                    tooltip=f'{metric.description}\n\n{metric.display_formula()}')
127 |     return _render_graph(graph)
128 | 
129 | 
130 | def _render_graph(graph: graphviz.Digraph) -> str:
131 |     try:
132 |         return graph.pipe('svg').decode('utf-8')
133 |     except graphviz.backend.ExecutableNotFound as e:
134 |         import uuid
135 |         # This exception occurs when the graphviz tools are not found.
136 |         # We use here a fallback to client-side rendering using the javascript library d3-graphviz.
137 |         graph_id = f'dependency_graph_{uuid.uuid4().hex}'
138 |         escaped_graph_source = graph.source.replace("`","\\`")
139 |         return str(_.div(id=graph_id)[
140 |             _.tt(style="color:red")[str(e)],
141 |         ]) + str(_.script[
142 |             f'div=d3.select("#{graph_id}");',
143 |             'graph=div.graphviz();',
144 |             'div.text("");',
145 |             f'graph.renderDot(`{escaped_graph_source}`);',
146 |         ])
147 | 


--------------------------------------------------------------------------------
/mara_schema/entity.py:
--------------------------------------------------------------------------------
  1 | import typing
  2 | 
  3 | from .attribute import Attribute, Type
  4 | 
  5 | 
  6 | class Entity():
  7 |     def __init__(self, name: str, description: str,
  8 |                  schema_name: str, table_name: str = None,
  9 |                  pk_column_name: str = None):
 10 |         """
 11 |         A business object with attributes and links to other entities, corresponds to a table in the dimensional schema
 12 | 
 13 |         Args:
 14 |             name: A short noun phrase that captures the nature of the entity.  E.g. "Customer", "Order item"
 15 |             description: A short text that helps to understand the underlying business process.
 16 |                 E.g. "People who registered through the web site or installed the app"
 17 |             schema_name: The database schema of the underlying table in the dimensional schema, e.g. "xy_dim"
 18 |             table_name: The name of the underlying table in the dimensional schema, e.g. "order_item".
 19 |                 Defaults to the lower-cased entity name with spaces replaced by underscores
 20 |             pk_column_name: The primary key column in the underlying table, defaults to table_name + '_id'
 21 |         """
 22 |         self.name = name
 23 |         self.description = description
 24 |         self.schema_name = schema_name
 25 |         self.table_name = table_name or name.lower().replace(' ', '_')
 26 |         self.pk_column_name = pk_column_name or f'{self.table_name}_id'
 27 | 
 28 |         self.attributes = []
 29 |         self.entity_links = []
 30 |         self.data_set = None  # the data set that contains the entity
 31 | 
 32 |     def __repr__(self) -> str:
 33 |         return f'<Entity "{self.name}">'
 34 | 
 35 |     def add_attribute(self, name: str, description: str, column_name: str = None, type: Type = None,
 36 |                       high_cardinality: bool = False, personal_data: bool = False, important_field: bool = False,
 37 |                       accessible_via_entity_link: bool = True,
 38 |                       more_url: typing.Optional[str] = None,
 39 |                       ) -> None:
 40 |         """
 41 |         Adds a property based on a column in the underlying dimensional table to the entity
 42 | 
 43 |         Args:
 44 |             name: How the attribute is displayed in front-ends, e.g. "Order date"
 45 |             description: A meaningful business definition of the attribute. E.g. "The date when the order was placed"
 46 |             column_name: The name of the column in the underlying database table.
 47 |                 Defaults to the lower-cased name with white-spaced replaced by underscores.
 48 |             type: The type of the attribute, see definition of `Type` enum
 49 |             high_cardinality: It refers to columns with values that are very uncommon or unique. Defaults to False.
 50 |             personal_data: It refers to person related data, e.g. "Email address", "Name".
 51 |             important_field: A field that highlights the the data set. Shown by default in overviews
 52 |             accessible_via_entity_link: If False, then this attribute is excluded from data sets that are not
 53 |                      based on this entity.
 54 |             more_url: URL (as string) which should be appended as a `more...` link in the UI.
 55 |         """
 56 |         self.attributes.append(
 57 |             Attribute(
 58 |                 name=name,
 59 |                 description=description,
 60 |                 column_name=column_name or name.lower().replace(' ', '_'),
 61 |                 accessible_via_entity_link=accessible_via_entity_link,
 62 |                 type=type,
 63 |                 high_cardinality=high_cardinality,
 64 |                 personal_data=personal_data,
 65 |                 important_field=important_field,
 66 |                 more_url=more_url,
 67 |             ))
 68 | 
 69 |     def remove_attribute(self, name: str) -> None:
 70 |         """
 71 |         Removes a property based on a column in the underlying dimensional table from the entity
 72 | 
 73 |         Args:
 74 |             name: How the attribute is displayed in front-ends, e.g. "Order date"
 75 |         """
 76 |         self.attributes.remove(self.find_attribute(name))
 77 | 
 78 |     def link_entity(self, target_entity: 'Entity', fk_column: str = None,
 79 |                     prefix: str = None, description=None) -> None:
 80 |         """
 81 |         Adds a link from the entity to another entity, corresponds to a foreign key relationship
 82 | 
 83 |         Args:
 84 |             target_entity: The referenced entity, e.g. an "Order" entity
 85 |             fk_column: The foreign key column in the source entity, e.g. "first_order_fk" in the "customer" table
 86 |             prefix: Attributes from the linked entity will be prefixed with this, e.g "First order".
 87 |                     Defaults to the name of the linked entity.
 88 |             description: A short explanation for the relation between the entity and target entity
 89 |         """
 90 |         self.entity_links.append(
 91 |             EntityLink(
 92 |                 target_entity=target_entity,
 93 |                 fk_column=fk_column or f'{target_entity.table_name}_fk',
 94 |                 prefix=prefix if prefix is not None else target_entity.name,
 95 |                 description=description))
 96 | 
 97 |     def find_entity_link(self, target_entity_name: str, prefix: str = None) -> 'EntityLink':
 98 |         """Find an EntityLink by its target entity name or prefix."""
 99 | 
100 |         entity_links = [entity_link for entity_link in self.entity_links
101 |                         if entity_link.target_entity.name == target_entity_name
102 |                         and (prefix is None or prefix == entity_link.prefix)]
103 | 
104 |         if not entity_links:
105 |             raise LookupError(f"""Linked entity "{target_entity_name}" / "{prefix or ''}" not found in {self}""")
106 | 
107 |         if len(entity_links) > 1:
108 |             raise LookupError(f"""Multiple linked entities found for "{target_entity_name}" / "{prefix}" """)
109 | 
110 |         return entity_links[0]
111 | 
112 |     def find_attribute(self, attribute_name: str) -> Attribute:
113 |         """Find an attribute by its name"""
114 |         attribute = next((attribute for attribute in self.attributes if attribute.name == attribute_name), None)
115 |         if not attribute:
116 |             raise KeyError(f'Attribute "{attribute_name}" not found in f{self}')
117 |         return attribute
118 | 
119 |     def connected_entities(self) -> ['Entity']:
120 |         """ Find all recursively linked entities. """
121 |         result = set([self])
122 | 
123 |         def traverse_graph(entity: Entity):
124 |             for link in entity.entity_links:
125 |                 if link.target_entity not in result:
126 |                     result.add(link.target_entity)
127 |                     traverse_graph(link.target_entity)
128 | 
129 |         traverse_graph(self)
130 | 
131 |         return result
132 | 
133 | 
134 | class EntityLink():
135 |     def __init__(self, target_entity: Entity, prefix: str,
136 |                  description: str = None, fk_column: str = None) -> None:
137 |         """
138 |         A link from an entity to another entity, corresponds to a foreign key relationship
139 | 
140 |         Args:
141 |             target_entity: The referenced entity, e.g. an "Order" entity
142 |             prefix: Attributes from the linked entity will be prefixed with this, e.g "First order".
143 |             description: A short explanation for the relation between the entity and target entity
144 |             fk_column: The foreign key column in the source entity, e.g. "first_order_fk" in the "customer" table
145 |         """
146 |         self.target_entity = target_entity
147 |         self.prefix = prefix
148 |         self.description = description
149 |         self.fk_column = fk_column or f'{target_entity.table_name}_fk'
150 | 
151 |     def __repr__(self) -> str:
152 |         return f'<EntityLink "{self.target_entity.name}" / "{self.prefix}">'
153 | 


--------------------------------------------------------------------------------
/docs/example.rst:
--------------------------------------------------------------------------------
  1 | Example
  2 | =======
  3 | 
  4 | Let's consider the following toy example of a dimensional schema in the data warehouse of a hypothetical e-commerce company:
  5 | 
  6 | .. image:: _static/example-dimensional-database-schema.svg
  7 |     :alt: Example dimensional star schema
  8 | 
  9 | Each box is a database table with its columns, and the lines between tables show the foreign key constraints. That's a classic Kimball style `snowflake schema <https://en.wikipedia.org/wiki/Snowflake_schema>`_ and it requires a proper modelling / ETL layer in your data warehouse. A script that creates these example tables in PostgreSQL can be found in `example/dimensional-schema.sql <https://github.com/mara/mara-schema/blob/main/mara_schema/example/dimensional-schema.sql>`_.
 10 | 
 11 | It's a prototypical data warehouse schema for B2C e-commerce: There are orders composed of individual product purchases (order items) made by customers. There are circular references: Orders have a customer, and customers have a first order. Order items have a product (and thus a product category) and customers have a favourite product category.
 12 | 
 13 | The respective entity and data set definitions for this database schema can be found in the `mara_schema/example <https://github.com/mara/mara-schema/tree/main/mara_schema/example>`_ directory.
 14 | 
 15 | Entities
 16 | --------
 17 | 
 18 | In Mara Schema, each business relevant table in the dimensional schema is mapped to an `Entity <https://github.com/mara/mara-schema/blob/main/mara_schema/entity.py>`_. In dimensional modelling terms, entities can be both fact tables and dimensions. For example, a customer entity can be a dimension of an order items data set (a.k.a. "cube", "model", "data mart") and a customer data set of its own.
 19 | 
 20 | Here's a `shortened <https://github.com/mara/mara-schema/blob/main/mara_schema/example/entities/order_item.py>`_ defnition of the "Order item" entity based on the ``dim.order_item`` table:
 21 | 
 22 | .. code-block:: python
 23 | 
 24 |     from mara_schema.entity import Entity
 25 | 
 26 |     order_item_entity = Entity(
 27 |         name='Order item',
 28 |         description='Individual products sold as part of an order',
 29 |         schema_name='dim')
 30 | 
 31 | It assumes that there is an ``order_item`` table in the ``dim`` schema of the data warehouse, with ``order_item_id`` as the primary key. The optional ``table_name`` and ``pk_column_name`` parameters can be used when another naming scheme for tables and primary keys is used.
 32 | 
 33 | Attributes
 34 | ----------
 35 | 
 36 | `Attributes <https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py>`_ represent facts about an entity. They correspond to the non-numerical columns in a fact or dimension table:
 37 | 
 38 | .. code-block:: python
 39 | 
 40 |     from mara_schema.attribute import Type
 41 | 
 42 |     order_item_entity.add_attribute(
 43 |         name='Order item ID',
 44 |         description='The ID of the order item in the backend',
 45 |         column_name='order_item_id',
 46 |         type=Type.ID,
 47 |         high_cardinality=True)
 48 | 
 49 | They come with a speaking name (as shown in reporting front-ends), a description and a ``column_name`` in the underlying database table.
 50 | 
 51 | There a several parameters for controlling the generation of artifact tables and the visibility in front-ends:
 52 | - Setting ``personal_data`` to ``True`` means that the attribute contains personally identifiable information and thus should be hidden from most users.
 53 | - When ```high_cardinality` is ``True``, then the attribute is hidden in front-ends that can not deal well with dimensions with a lot of values.
 54 | - The ``type`` attribute controls how some fields are treated in artifact creation. See `mara_schema/attribute.py#L7 <https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py#L7>`_.
 55 | - An ``important_field`` highlights the data set and is shown by default in overviews.
 56 | - When ``accessible_via_entity_link`` is ``False``, then the attribute will be hidden in data sets that use the entity as an dimension.
 57 | 
 58 | Linking entities
 59 | ----------------
 60 | 
 61 | The attributes of the dimensions of an entity are recursively linked with the ``link_entity`` method:
 62 | 
 63 | .. code-block:: python
 64 | 
 65 |     from .order import order_entity
 66 |     from .product import product_entity
 67 | 
 68 |     order_item_entity.link_entity(target_entity=order_entity, prefix='')
 69 |     order_item_entity.link_entity(target_entity=product_entity)
 70 | 
 71 | This pulls in attributes of other entities that are connected to an entity table via foreign key columns. When the other entity is called "Foo bar", then it's assumed that there is a ``foo_bar_fk`` in the entity table (can be overwritten with the ``fk_column`` parameter). The optional ``prefix`` controls how linked attributes are named (e.g. "First order date" vs "Order date") and also helps to disambiguate when there are multiple links from one entity to another.
 72 | 
 73 | Data Sets
 74 | ---------
 75 | 
 76 | Once all entities and their relationships are established, `Data Sets <https://github.com/mara/mara-schema/blob/main/mara_schema/data_set.py>`_ (a.k.a "cubes", "models" or "data marts") add metrics and attributes from linked entities to an entity:
 77 | 
 78 | .. code-block:: python
 79 | 
 80 |     from mara_schema.data_set import DataSet
 81 | 
 82 |     from ..entities.order_item import order_item_entity
 83 | 
 84 |     order_items_data_set = DataSet(entity=order_item_entity, name='Order items')
 85 | 
 86 | 
 87 | There are two kinds of `Metrics <https://github.com/mara/mara-schema/blob/main/mara_schema/metric.py>`_ (a.k.a "Measures") in Mara Schema: simple metrics and composed metrics. Simple metrics are computed as direct aggregations on an entity table column:
 88 | 
 89 | .. code-block:: python
 90 | 
 91 |     from mara_schema.data_set import Aggregation
 92 | 
 93 |     order_items_data_set.add_simple_metric(
 94 |         name='# Orders',
 95 |         description='The number of valid orders (orders with an invoice)',
 96 |         column_name='order_fk',
 97 |         aggregation=Aggregation.DISTINCT_COUNT,
 98 |         important_field=True)
 99 | 
100 |     order_items_data_set.add_simple_metric(
101 |         name='Product revenue',
102 |         description='The price of the ordered products as shown in the cart',
103 |         aggregation=Aggregation.SUM,
104 |         column_name='product_revenue',
105 |         important_field=True)
106 | 
107 | In this example the metric "# Orders" is defined as the distinct count on the ``order_fk`` column, and "Product revenue" as the sum of the ``product_revenue`` column.
108 | 
109 | Composed metrics are built from other metrics (both simple and composed)  like this:
110 | 
111 | .. code-block:: python
112 | 
113 |     order_items_data_set.add_composed_metric(
114 |         name='Revenue',
115 |         description='The total cart value of the order',
116 |         formula='[Product revenue] + [Shipping revenue]',
117 |         important_field=True)
118 | 
119 |     order_items_data_set.add_composed_metric(
120 |         name='AOV',
121 |         description='The average revenue per order. Attention: not meaningful when split by product',
122 |         formula='[Revenue] / [# Orders]',
123 |         important_field=True)
124 | 
125 | The ``formula`` parameter takes simple algebraic expressions (``+``, ``-``, ``*``, ``/`` and parentheses) with the names of the parent metrics in rectangular brackets, e.g. ``([a] + [b]) / [c]``.
126 | 
127 | Excluding specific entity links
128 | -------------------------------
129 | 
130 | With complex snowflake schemas the graph of linked entities can become rather big. To avoid cluttering data sets with unnecessary attributes, Mara Schema has a way for excluding entire entity links:
131 | 
132 | ``customers_data_set.exclude_path(['Order', 'Customer'])``
133 | 
134 | This means that the customer of the first order of a customer will not be part of the customers data set. Similarly, it is possible to limit the list of attributes from a linked entity:
135 | 
136 | ``order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date'])``
137 | 
138 | Here only the order date of the first order of the customer of the order will be included in the data set.
139 | 


--------------------------------------------------------------------------------
/docs/_static/example-dimensional-database-schema.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
  3 |  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
  4 | <!-- Generated by graphviz version 2.43.0 (0)
  5 |  -->
  6 | <!-- Title: %3 Pages: 1 -->
  7 | <svg width="416pt" height="348pt"
  8 |  viewBox="0.00 0.00 416.00 348.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
  9 | <g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 344)">
 10 | <title>%3</title>
 11 | <polygon fill="white" stroke="transparent" points="-4,4 -4,-344 412,-344 412,4 -4,4"/>
 12 | <!-- dim.customer -->
 13 | <g id="node1" class="node">
 14 | <title>dim.customer</title>
 15 | <polygon fill="#ffffcc" stroke="transparent" points="256,-238 256,-336 400,-336 400,-238 256,-238"/>
 16 | <text text-anchor="start" x="258" y="-327" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> customer </text>
 17 | <text text-anchor="start" x="258" y="-314" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> customer_id </text>
 18 | <text text-anchor="start" x="258" y="-302" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> email </text>
 19 | <text text-anchor="start" x="258" y="-290" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> duration_since_first_order </text>
 20 | <text text-anchor="start" x="258" y="-278" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> first_order_fk </text>
 21 | <text text-anchor="start" x="257.96" y="-266" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> favourite_product_category_fk </text>
 22 | <text text-anchor="start" x="258" y="-254" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> number_of_orders </text>
 23 | <text text-anchor="start" x="258" y="-242" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> revenue_lifetime </text>
 24 | <polygon fill="none" stroke="black" points="256,-238 256,-336 400,-336 400,-238 256,-238"/>
 25 | </g>
 26 | <!-- dim.order -->
 27 | <g id="node2" class="node">
 28 | <title>dim.order</title>
 29 | <polygon fill="#ffffcc" stroke="transparent" points="256,-94 256,-156 320,-156 320,-94 256,-94"/>
 30 | <text text-anchor="start" x="258" y="-147" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> order </text>
 31 | <text text-anchor="start" x="258" y="-134" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> order_id </text>
 32 | <text text-anchor="start" x="257.99" y="-122" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> customer_fk </text>
 33 | <text text-anchor="start" x="258" y="-110" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> order_date </text>
 34 | <text text-anchor="start" x="258" y="-98" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> status </text>
 35 | <polygon fill="none" stroke="black" points="256,-94 256,-156 320,-156 320,-94 256,-94"/>
 36 | </g>
 37 | <!-- dim.customer&#45;&gt;dim.order -->
 38 | <g id="edge4" class="edge">
 39 | <title>dim.customer&#45;&gt;dim.order</title>
 40 | <path fill="none" stroke="#888888" d="M321.39,-233.78C316.92,-213.2 311.11,-189.9 305.51,-170.15"/>
 41 | <polygon fill="#888888" stroke="#888888" points="308.85,-169.11 302.7,-160.48 302.13,-171.06 308.85,-169.11"/>
 42 | </g>
 43 | <!-- dim.product_category -->
 44 | <g id="node5" class="node">
 45 | <title>dim.product_category</title>
 46 | <polygon fill="#ffffcc" stroke="transparent" points="132,-172 132,-222 232,-222 232,-172 132,-172"/>
 47 | <text text-anchor="start" x="134" y="-213" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> product_category </text>
 48 | <text text-anchor="start" x="133.91" y="-200" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> product_category_id </text>
 49 | <text text-anchor="start" x="134" y="-188" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> level_1 </text>
 50 | <text text-anchor="start" x="134" y="-176" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> level_2 </text>
 51 | <polygon fill="none" stroke="black" points="132,-172 132,-222 232,-222 232,-172 132,-172"/>
 52 | </g>
 53 | <!-- dim.customer&#45;&gt;dim.product_category -->
 54 | <g id="edge5" class="edge">
 55 | <title>dim.customer&#45;&gt;dim.product_category</title>
 56 | <path fill="none" stroke="#888888" d="M247.74,-237.52C244.36,-235.44 241,-233.37 237.69,-231.33"/>
 57 | <polygon fill="#888888" stroke="#888888" points="239.42,-228.29 229.07,-226.02 235.75,-234.25 239.42,-228.29"/>
 58 | </g>
 59 | <!-- dim.order&#45;&gt;dim.customer -->
 60 | <g id="edge2" class="edge">
 61 | <title>dim.order&#45;&gt;dim.customer</title>
 62 | <path fill="none" stroke="#888888" d="M290.93,-160.31C294.5,-178.97 299.95,-202.45 305.65,-224.1"/>
 63 | <polygon fill="#888888" stroke="#888888" points="302.3,-225.15 308.28,-233.9 309.07,-223.34 302.3,-225.15"/>
 64 | </g>
 65 | <!-- dim.status -->
 66 | <g id="node6" class="node">
 67 | <title>dim.status</title>
 68 | <polygon fill="#ffffcc" stroke="transparent" points="345,-57 345,-71 381,-71 381,-57 345,-57"/>
 69 | <text text-anchor="start" x="346.88" y="-62" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> status </text>
 70 | <polygon fill="none" stroke="black" points="345,-57 345,-71 381,-71 381,-57 345,-57"/>
 71 | </g>
 72 | <!-- dim.order&#45;&gt;dim.status -->
 73 | <g id="edge3" class="edge">
 74 | <title>dim.order&#45;&gt;dim.status</title>
 75 | <path fill="none" stroke="#888888" d="M328.13,-92.36C329.75,-91.04 331.37,-89.73 332.96,-88.43"/>
 76 | <polygon fill="#888888" stroke="#888888" points="335.17,-91.15 340.72,-82.12 330.75,-85.72 335.17,-91.15"/>
 77 | </g>
 78 | <!-- dim.order_item -->
 79 | <g id="node3" class="node">
 80 | <title>dim.order_item</title>
 81 | <polygon fill="#ffffcc" stroke="transparent" points="182,-4 182,-78 270,-78 270,-4 182,-4"/>
 82 | <text text-anchor="start" x="184" y="-69" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> order_item </text>
 83 | <text text-anchor="start" x="184" y="-56" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> order_item_id </text>
 84 | <text text-anchor="start" x="184" y="-44" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> order_fk </text>
 85 | <text text-anchor="start" x="184" y="-32" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> product_fk </text>
 86 | <text text-anchor="start" x="184" y="-20" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> product_revenue </text>
 87 | <text text-anchor="start" x="183.75" y="-8" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> shipping_revenue </text>
 88 | <polygon fill="none" stroke="black" points="182,-4 182,-78 270,-78 270,-4 182,-4"/>
 89 | </g>
 90 | <!-- dim.order_item&#45;&gt;dim.order -->
 91 | <g id="edge6" class="edge">
 92 | <title>dim.order_item&#45;&gt;dim.order</title>
 93 | <path fill="none" stroke="#888888" d="M256.27,-82.02C256.4,-82.18 256.52,-82.35 256.64,-82.52"/>
 94 | <polygon fill="#888888" stroke="#888888" points="253.4,-84.02 262.16,-89.98 259.03,-79.86 253.4,-84.02"/>
 95 | </g>
 96 | <!-- dim.product -->
 97 | <g id="node4" class="node">
 98 | <title>dim.product</title>
 99 | <polygon fill="#ffffcc" stroke="transparent" points="8,-94 8,-156 108,-156 108,-94 8,-94"/>
100 | <text text-anchor="start" x="10" y="-147" font-family="Helvetica, Arial, sans-serif" font-weight="bold" text-decoration="underline" font-size="10.00" fill="#555555"> product </text>
101 | <text text-anchor="start" x="10" y="-134" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> product_id </text>
102 | <text text-anchor="start" x="10" y="-122" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> sku </text>
103 | <text text-anchor="start" x="9.92" y="-110" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> product_category_fk </text>
104 | <text text-anchor="start" x="10" y="-98" font-family="Helvetica, Arial, sans-serif" font-size="10.00" fill="#555555"> revenue_all_time </text>
105 | <polygon fill="none" stroke="black" points="8,-94 8,-156 108,-156 108,-94 8,-94"/>
106 | </g>
107 | <!-- dim.order_item&#45;&gt;dim.product -->
108 | <g id="edge1" class="edge">
109 | <title>dim.order_item&#45;&gt;dim.product</title>
110 | <path fill="none" stroke="#888888" d="M173.76,-67.12C158.47,-74.77 141.52,-83.24 125.42,-91.29"/>
111 | <polygon fill="#888888" stroke="#888888" points="123.43,-88.37 116.05,-95.97 126.56,-94.63 123.43,-88.37"/>
112 | </g>
113 | <!-- dim.product&#45;&gt;dim.product_category -->
114 | <g id="edge7" class="edge">
115 | <title>dim.product&#45;&gt;dim.product_category</title>
116 | <path fill="none" stroke="#888888" d="M116.01,-158.68C118.32,-160.03 120.65,-161.38 122.97,-162.72"/>
117 | <polygon fill="#888888" stroke="#888888" points="121.45,-165.89 131.85,-167.88 124.96,-159.83 121.45,-165.89"/>
118 | </g>
119 | </g>
120 | </svg>
121 | 


--------------------------------------------------------------------------------
/mara_schema/data_set.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import typing
  3 | 
  4 | from .attribute import Attribute
  5 | from .entity import Entity, EntityLink
  6 | from .metric import NumberFormat, Aggregation, SimpleMetric, ComposedMetric
  7 | 
  8 | 
  9 | class DataSet():
 10 |     def __init__(self, entity: Entity, name: str):
 11 |         """
 12 |         An entity with its metrics and recursively linked entities.
 13 | 
 14 |         Args:
 15 |             entity: The underlying entity with its attributes and linked other entities
 16 |             name: The name of the data set.
 17 |         """
 18 |         self.entity = entity
 19 |         self.name = name
 20 | 
 21 |         self.entity.data_set = self
 22 |         self.metrics = {}
 23 |         self.excluded_paths = set()
 24 |         self.included_attributes = {}
 25 |         self.excluded_attributes = {}
 26 | 
 27 |     def __repr__(self) -> str:
 28 |         return f'<DataSet "{self.entity.name}">'
 29 | 
 30 |     def add_simple_metric(self, name: str, description: str, column_name: str, aggregation: Aggregation,
 31 |                           important_field: bool = False,
 32 |                           number_format: NumberFormat = NumberFormat.STANDARD,
 33 |                           more_url: typing.Optional[str] = None,
 34 |                           ):
 35 |         """
 36 |         Add a metric that is computed as a direct aggregation on a entity table column
 37 | 
 38 |         Args:
 39 |             name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations"
 40 |             description: A meaningful business definition of the metric
 41 |             column_name: The column that the aggregation is based on
 42 |             aggregation: The aggregation method to use
 43 |             important_field: It refers to key business metrics.
 44 |             number_format: The way to format a string. Defaults to NumberFormat.STANDARD.
 45 |             more_url: URL (as string) which should be appended as a `more...` link in the UI.
 46 |         """
 47 |         if name in self.metrics:
 48 |             raise ValueError(f'Metric "{name}" already exists in data set "{self.name}"')
 49 | 
 50 |         self.metrics[name] = SimpleMetric(
 51 |             name=name,
 52 |             description=description,
 53 |             data_set=self,
 54 |             column_name=column_name,
 55 |             aggregation=aggregation,
 56 |             important_field=important_field,
 57 |             number_format=number_format,
 58 |             more_url=more_url,
 59 |         )
 60 | 
 61 |     def add_composed_metric(self, name: str, description: str, formula: str, important_field: bool = False,
 62 |                             number_format: NumberFormat = NumberFormat.STANDARD,
 63 |                             more_url: typing.Optional[str] = None,
 64 |                             ):
 65 |         """
 66 |         Add a metric that is based on a list of simple metrics.
 67 | 
 68 |         Args:
 69 |             name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations"
 70 |             description: A meaningful business definition of the metric
 71 |             formula: How to compute the metric. Examples: [Metric A] + [Metric B],  [Metric A] / ([Metric B] + [Metric C])
 72 |             important_field: It refers to key business metrics.
 73 |             number_format: The way to format a string. Defaults to NumberFormat.STANDARD.
 74 |             more_url: URL (as string) which should be appended as a `more...` link in the UI.
 75 |         """
 76 |         if name in self.metrics:
 77 |             raise ValueError(f'Metric "{name}" already exists in data set "{self.name}"')
 78 | 
 79 |         # ' [a] \n + [b]' -> '[a] + [b]'
 80 |         formula_cleaned = re.sub("\s\s+", " ", formula.strip().replace('\n', ''))
 81 | 
 82 |         # split '[a] + [b]' -> ['', 'a', ' + ', 'b', '']
 83 |         formula_split = re.split(r'\[(.*?)\]', formula_cleaned)
 84 | 
 85 |         parent_metrics = []
 86 |         for metric_name in formula_split[1::2]:  # 1::2  start at second, take every 2nd,
 87 |             if metric_name not in self.metrics:
 88 |                 raise ValueError(f'Could not find metric "{metric_name}" in data set "{self.name}"')
 89 |             parent_metrics.append(self.metrics[metric_name])
 90 | 
 91 |         self.metrics[name] = ComposedMetric(name=name,
 92 |                                             description=description,
 93 |                                             data_set=self,
 94 |                                             parent_metrics=parent_metrics,
 95 |                                             formula_template='{}'.join(formula_split[0::2]),
 96 |                                             important_field=important_field,
 97 |                                             number_format=number_format,
 98 |                                             more_url=more_url,
 99 |                                             )
100 | 
101 |     _PathSpec = typing.TypeVar('_PathSpec', typing.Sequence[typing.Union[str, typing.Tuple[str, str]]], bytes)
102 | 
103 |     def _parse_path(self, entity: Entity, path: _PathSpec) -> typing.Union[
104 |         typing.Tuple, typing.Tuple[EntityLink, EntityLink]]:
105 |         """
106 |         Helper function for parsing path specifications into a tuple of entity link instances
107 | 
108 |         Args:
109 |             entity: the entity for which to resolve the entity links
110 |             path: How to get to the entity from the entity of the data set.
111 |                   A list of either strings (target entity names) or tuples of strings (target entity name + prefix).
112 |                   Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3']
113 |         """
114 |         if not path:
115 |             return ()
116 | 
117 |         if not (isinstance(path[0], str) or (isinstance(path[0], tuple) and len(path[0]) == 2)):
118 |             raise TypeError(f'Expecting a string or a tuple of two strings, got: {path[0]}')
119 | 
120 |         target_entity_name, prefix = (path[0], None) if isinstance(path[0], str) else path[0]
121 |         entity_link = entity.find_entity_link(target_entity_name, prefix)
122 | 
123 |         return (entity_link,) + self._parse_path(entity_link.target_entity, path[1::])
124 | 
125 |     def exclude_path(self, path: _PathSpec):
126 |         """
127 |         Exclude a connected entity from generated data set tables by specifying the entity links to that entity
128 | 
129 |         Args:
130 |             path: How to get to the entity from the data set entity.
131 |                   A list of either strings (target entity names) or tuples of strings (target entity name + prefix).
132 |                   Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3']
133 |         """
134 |         self.excluded_paths.add(self._parse_path(self.entity, path))
135 | 
136 |     def exclude_attributes(self, path: _PathSpec, attribute_names: [str] = None):
137 |         """
138 |         Exclude attributes of a connected entity in generated data set tables.
139 | 
140 |         Args:
141 |             path: How to get to the entity from the data set entity.
142 |                   A list of either strings (target entity names) or tuples of strings (target entity name + prefix).
143 |                   Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3']
144 |             attribute_names: A list of name of attributes to be excluded. If not provided, then exclude all attributes
145 |         """
146 |         entity_links = self._parse_path(self.entity, path)
147 |         entity = entity_links[-1].target_entity
148 | 
149 |         if not attribute_names:
150 |             self.excluded_attributes[entity_links] = entity.attributes
151 |         else:
152 |             self.excluded_attributes[entity_links] = [entity.find_attribute(attribute_name) for attribute_name in
153 |                                                       attribute_names]
154 | 
155 |     def include_attributes(self, path: _PathSpec, attribute_names: [str]):
156 |         """
157 |         Exclude all attributes except the explicitly included ones of a connected entity in generated data set tables.
158 | 
159 |         Args:
160 |             path: How to get to the entity from the data set entity.
161 |                   A list of either strings (target entity names) or tuples of strings (target entity name + prefix).
162 |                   Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3']
163 |             attribute_names: A list of name of attributes to be included.
164 |         """
165 |         entity_links = self._parse_path(self.entity, path)
166 | 
167 |         self.included_attributes[entity_links] = [entity_links[-1].target_entity.find_attribute(attribute_name) for
168 |                                                   attribute_name in attribute_names]
169 | 
170 |     def paths_to_connected_entities(self) -> [(EntityLink,)]:
171 |         """
172 |         Get all possible paths to connected entities (tuples of entity links)
173 |         - that are not explicitly excluded
174 |         - that are are not beyond the max link depth or that are explicitly included
175 |         """
176 | 
177 |         paths = []
178 | 
179 |         def _append_path_including_subpaths(paths, path) -> typing.List[typing.Tuple[EntityLink]]:
180 |             """Append a path and its subpaths to the list of paths, if they do not already exist. A subpath always starts
181 |             at the beginning of the path: (1,2,3) -> (1,), (1,2), (1,2,3)
182 |             """
183 |             for i in range(len(path)):
184 |                 if path[:i + 1] not in paths:
185 |                     paths.append(path[:i + 1])
186 |             return paths
187 | 
188 |         def traverse_graph(entity: Entity, current_path: tuple):
189 |             for entity_link in entity.entity_links:
190 |                 path = current_path + (entity_link,)
191 | 
192 |                 if (entity_link not in current_path  # check for circles in path
193 |                     and (path not in self.excluded_paths)  # explicitly excluded paths
194 |                 ):
195 |                     _append_path_including_subpaths(paths, path)
196 |                     traverse_graph(entity_link.target_entity, path)
197 | 
198 |         traverse_graph(self.entity, ())
199 | 
200 |         return paths
201 | 
202 |     def connected_attributes(self, include_personal_data: bool = True) -> {(EntityLink,): {str: Attribute}}:
203 |         """
204 |         Returns all attributes with their prefixed name from all connected entities.
205 | 
206 |         Args:
207 |             include_personal_data: If False, then exclude fields that are marked as personal data
208 | 
209 |         Returns:
210 |             A dictionary with the paths as keys and dictionaries of prefixed attribute names and
211 |             attributes as values. Example:
212 |                 {(<EntityLink 1>, <EntityLink 2): {'Prefixed attribute 1 name': <Attribute 1>,
213 |                                                    'Prefixed attribute 2 name': <Attribute 2>},
214 |                  ..}
215 |         """
216 |         result = {(): {attribute.prefixed_name(): attribute for attribute in self.entity.attributes}}
217 | 
218 |         for path in self.paths_to_connected_entities():
219 |             result[path] = {}
220 |             entity = path[-1].target_entity
221 |             for attribute in entity.attributes:
222 |                 if ((path in self.included_attributes and attribute in self.included_attributes[path])
223 |                     or (path not in self.included_attributes)) \
224 |                     and ((path in self.excluded_attributes and attribute not in self.excluded_attributes[path])
225 |                          or (path not in self.excluded_attributes)) \
226 |                     and attribute.accessible_via_entity_link and (
227 |                     include_personal_data or not attribute.personal_data):
228 |                     result[path][attribute.prefixed_name(path)] = attribute
229 |         return result
230 | 
231 |     def id(self):
232 |         """Returns a representation that can be used in urls"""
233 |         from html import escape
234 |         return escape(self.name.replace(' ', '_').lower())
235 | 


--------------------------------------------------------------------------------
/mara_schema/sql_generation.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | import sqlalchemy
  4 | import sqlalchemy.engine
  5 | 
  6 | from .attribute import Type, normalize_name
  7 | from .data_set import DataSet
  8 | from .entity import EntityLink
  9 | from .metric import SimpleMetric, Aggregation
 10 | 
 11 | 
 12 | def data_set_sql_query(data_set: DataSet,
 13 |                        human_readable_columns=True,
 14 |                        pre_computed_metrics=True,
 15 |                        star_schema: bool = False,
 16 |                        star_schema_transitive_fks: bool = True,
 17 |                        personal_data=True,
 18 |                        high_cardinality_attributes=True,
 19 |                        engine: sqlalchemy.engine.Engine = None) -> str:
 20 |     """
 21 |     Returns a SQL select statement that flattens all linked entities of a data set into a wide table
 22 | 
 23 |     Args:
 24 |         data_set: the data set to flatten
 25 |         human_readable_columns: Whether to use "Customer name" rather than "customer_name" as column name
 26 |         pre_computed_metrics: Whether to pre-compute composed metrics, counts and distinct counts on row level
 27 |         star_schema: Whether to add foreign keys to the tables of linked entities rather than including their attributes.
 28 |         star_schema_transitive_fks: Whether to include all attributes of all transitively linked entities. When False,
 29 |             only their respective foreign keys are included. Defaults to True.
 30 |             Example for star_schema_transitive_fks = False:
 31 |                 SELECT order.id
 32 |                        order.date
 33 |                        order.price
 34 | 
 35 |                        customer.customer_fk
 36 | 
 37 |                        store.store_fk
 38 |                FROM order
 39 |                  LEFT JOIN customer
 40 |                  LEFT JOIN store
 41 |         personal_data: Whether to include attributes that are marked as personal dataTrue
 42 |         high_cardinality_attributes: Whether to include attributes that are marked to have a high cardinality
 43 |         engine: A sqlalchemy engine that is used to quote database identifiers. Defaults to a PostgreSQL engine.
 44 | 
 45 |     Returns:
 46 |         A string containing the select statement
 47 |     """
 48 |     engine = engine or sqlalchemy.create_engine(f'postgresql+psycopg2://')
 49 | 
 50 |     def quote(name) -> str:
 51 |         """Quote a column or table name for the specified database engine"""
 52 |         return engine.dialect.identifier_preparer.quote(name)
 53 | 
 54 |     # alias for the underlying table of the entity of the data set
 55 |     entity_table_alias = database_identifier(data_set.entity.name)
 56 | 
 57 |     # progressively build the query
 58 |     query = 'SELECT'
 59 | 
 60 |     column_definitions = []
 61 | 
 62 |     # Iterate all connected entities
 63 |     for path, attributes in data_set.connected_attributes().items():
 64 |         first = True  # for adding an empty line between each entity
 65 | 
 66 |         # helper function for adding a column
 67 |         def add_column_definition(table_alias: str, column_name: str, column_alias: str,
 68 |                                   cast_to_text: bool, first: bool, custom_column_expression: str = None):
 69 |             column_definition = '\n    ' if first else '    '
 70 |             column_definition += custom_column_expression or f'{quote(table_alias)}.{quote(column_name)}'
 71 |             if cast_to_text:
 72 |                 if engine.url.drivername.startswith('postgresql'):
 73 |                     column_definition += '::TEXT'
 74 |                 elif engine.url.drivername.startswith('bigquery'):
 75 |                     column_definition = f'CAST({column_definition} AS STRING)'
 76 |                 elif engine.url.drivername.startswith('mssql'):
 77 |                     column_definition = f'CAST({column_definition} AS NVARCHAR)'
 78 |                 else:
 79 |                     raise NotImplementedError(f'Casting to text is not implemented for engine {engine.url.drivername}')
 80 |             if column_alias != column_name:
 81 |                 column_definition += f' AS {quote(column_alias)}'
 82 |             column_definitions.append(column_definition)
 83 | 
 84 |             return False
 85 | 
 86 |         if star_schema and path:  # create a foreign key to the last entity of the path
 87 |             first = add_column_definition(
 88 |                 table_alias=table_alias_for_path(path[:-1]) if len(path) > 1 else entity_table_alias,
 89 |                 column_name=path[-1].fk_column,
 90 |                 column_alias=(normalize_name(' '.join([entity_link.prefix or entity_link.target_entity.name
 91 |                                                        for entity_link in path]))
 92 |                               if human_readable_columns else foreign_key_column_name(table_alias_for_path(path))),
 93 |                 cast_to_text=False, first=first)
 94 | 
 95 |         # Add columns for all attributes
 96 |         # Always add all columns for the first object (i.e. the original dataset) as indicated by path == ()
 97 |         if star_schema_transitive_fks or path == ():
 98 |             for name, attribute in attributes.items():
 99 |                 if attribute.personal_data and not personal_data:
100 |                     continue
101 |                 if attribute.high_cardinality and not high_cardinality_attributes:
102 |                     continue
103 | 
104 |                 table_alias = table_alias_for_path(path) if path else entity_table_alias
105 |                 column_name = attribute.column_name
106 |                 column_alias = name if human_readable_columns else database_identifier(name)
107 |                 custom_column_expression = None
108 | 
109 |                 if star_schema:  # Add foreign keys for dates and durations
110 |                     if attribute.type == Type.DATE:
111 |                         if engine.url.drivername.startswith('postgresql'):
112 |                             custom_column_expression = f"TO_CHAR({quote(table_alias)}.{quote(column_name)}, 'YYYYMMDD') :: INTEGER"
113 |                         elif engine.url.drivername.startswith('bigquery'):
114 |                             custom_column_expression = f"CAST(FORMAT_DATE('%Y%m%d',{quote(table_alias)}.{quote(column_name)}) AS INT64)"
115 |                         elif engine.url.drivername.startswith('mssql'):
116 |                             custom_column_expression = f"CAST(CONVERT(char(8),{quote(table_alias)}.{quote(column_name)},112) AS INT)"
117 |                         else:
118 |                             raise NotImplementedError(f'Star schema casting of DATE attributes is not implemented for engine {engine.url.drivername}')
119 |                         column_alias = name if human_readable_columns else foreign_key_column_name(database_identifier(name))
120 |                     elif attribute.type == Type.DURATION:
121 |                         column_alias = name if human_readable_columns else foreign_key_column_name(database_identifier(name))
122 |                     elif not path:
123 |                         pass  # Add attributes of data set entity
124 |                     else:
125 |                         continue  # Exclude attributes from linked entities
126 | 
127 |                 first = add_column_definition(table_alias=table_alias, column_name=column_name,
128 |                                               column_alias=column_alias,
129 |                                               cast_to_text=attribute.type == Type.ENUM, first=first,
130 |                                               custom_column_expression=custom_column_expression)
131 | 
132 |         # Only add foreign key columns of linked entities
133 |         elif star_schema_transitive_fks is False and path:
134 |             first = add_column_definition(
135 |                 table_alias=table_alias_for_path(path[:-1]) if len(path) > 1 else entity_table_alias,
136 |                 column_name=path[-1].fk_column,
137 |                 column_alias=foreign_key_column_name(table_alias_for_path(path)),
138 |                 cast_to_text=False, first=first)
139 |         else:
140 |             assert False, 'This should not happen.'
141 | 
142 | 
143 |     # helper function for pre-computing composed metrics
144 |     def sql_formula(metric):
145 |         if isinstance(metric, SimpleMetric):
146 |             if metric.aggregation in [Aggregation.DISTINCT_COUNT, Aggregation.COUNT]:
147 |                 # for distinct counts, return 1::SMALLINT if the expression is not null
148 |                 if engine.url.drivername.startswith('postgresql'):
149 |                     return f'({quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL) ::INTEGER :: SMALLINT'
150 |                 elif engine.url.drivername.startswith('bigquery'):
151 |                     return f'CAST({quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL AS INT64)'
152 |                 else:
153 |                     return f'CASE WHEN {quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL THEN 1 ELSE 0 END'
154 |             else:
155 |                 # Coalesce with 0 so that metrics that combine simplemetrics work ( in SQL `1 + NULL` is `NULL` )
156 |                 return f'COALESCE({quote(entity_table_alias)}.{quote(metric.column_name)}, 0)'
157 |         else:
158 |             if '/' in metric.formula_template:  # avoid divisions by 0
159 |                 if engine.url.drivername.startswith('postgresql'):
160 |                     return metric.formula_template.format(
161 |                         *[f'(NULLIF({sql_formula(metric)}, 0.0 :: DOUBLE PRECISION))' for metric in metric.parent_metrics])
162 |                 else:
163 |                     return metric.formula_template.format(
164 |                         *[f'(NULLIF({sql_formula(metric)}, 0.0))' for metric in metric.parent_metrics])
165 | 
166 |             else:  # render metric template
167 |                 return metric.formula_template.format(
168 |                     *[f'({sql_formula(metric)})' for metric in metric.parent_metrics])
169 | 
170 |     first = True
171 |     for name, metric in data_set.metrics.items():
172 |         column_alias = metric.name if human_readable_columns else database_identifier(metric.name)
173 | 
174 |         if pre_computed_metrics:
175 |             column_definition = f'    {sql_formula(metric)} AS {quote(column_alias)}'
176 |         elif isinstance(metric, SimpleMetric):
177 |             column_definition = f'    {quote(entity_table_alias)}.{quote(metric.column_name)}'
178 |             if column_alias != metric.column_name:
179 |                 column_definition += f' AS {quote(column_alias)}'
180 |         else:
181 |             continue
182 | 
183 |         if first:
184 |             column_definition = '\n' + column_definition
185 |             first = False
186 |         column_definitions.append(column_definition)
187 | 
188 |     # add column definitions to SELECT part
189 |     query += ',\n'.join(column_definitions)
190 | 
191 |     # add FROM part for entity table
192 |     query += f'\n\nFROM {quote(data_set.entity.schema_name)}.{quote(data_set.entity.table_name)} {quote(entity_table_alias)}'
193 | 
194 |     # Add LEFT JOIN statements
195 |     for path in data_set.paths_to_connected_entities():
196 |         left_alias = table_alias_for_path(path[:-1]) if len(path) > 1 else database_identifier(data_set.entity.name)
197 |         right_alias = table_alias_for_path(path)
198 |         entity_link = path[-1]
199 |         target_entity = entity_link.target_entity
200 | 
201 |         query += f'\nLEFT JOIN {quote(target_entity.schema_name)}.{quote(target_entity.table_name)} {quote(right_alias)}'
202 |         query += f' ON {quote(left_alias)}.{quote(path[-1].fk_column)} = {quote(right_alias)}.{quote(target_entity.pk_column_name)}'
203 | 
204 |     return query
205 | 
206 | 
207 | def database_identifier(name) -> str:
208 |     """Turns a string into something that can be used as a table or column name"""
209 |     return re.sub('[^0-9a-z]+', '_', name.lower())
210 | 
211 | 
212 | def table_alias_for_path(path: (EntityLink,)) -> str:
213 |     """Turns `(<EntityLink 'Customer'>, <EntityLink 'First order'>,)` into `customer_first_order` """
214 |     return database_identifier('_'.join([entity_link.prefix or entity_link.target_entity.name
215 |                                          for entity_link in path]))
216 | 
217 | 
218 | def foreign_key_column_name(name) -> str:
219 |     """Turns a table alias into a foreign key column name"""
220 |     return f'{name}_fk'
221 | 


--------------------------------------------------------------------------------
/mara_schema/ui/views.py:
--------------------------------------------------------------------------------
  1 | """Documentation of data sets and entities"""
  2 | 
  3 | import functools
  4 | import re
  5 | from html import escape
  6 | 
  7 | import flask
  8 | import unicodedata
  9 | from mara_page import acl, navigation, response, bootstrap, _, html
 10 | 
 11 | from ..data_set import DataSet
 12 | 
 13 | # The flask blueprint that does
 14 | blueprint = flask.Blueprint('mara_schema', __name__, static_folder='static',
 15 |                             template_folder='templates', url_prefix='/schema')
 16 | 
 17 | # Defines an ACL resource (needs to be handled by the application)
 18 | acl_resource_schema = acl.AclResource(name='Schema')
 19 | 
 20 | 
 21 | def data_set_url(data_set: DataSet) -> str:
 22 |     return flask.url_for('mara_schema.data_set_page', id=data_set.id())
 23 | 
 24 | 
 25 | _slugify_strip_re = re.compile(r'[^\w\s-]')
 26 | _slugify_hyphenate_re = re.compile(r'[-\s]+')
 27 | 
 28 | 
 29 | # from https://github.com/django/django/blob/0382ecfe020b4c51b4c01e4e9a21892771e66941/django/utils/text.py
 30 | # Under BSD license
 31 | def slugify(value, allow_unicode=False):
 32 |     """
 33 |     Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
 34 |     dashes to single dashes. Remove characters that aren't alphanumerics,
 35 |     underscores, or hyphens. Convert to lowercase. Also strip leading and
 36 |     trailing whitespace, dashes, and underscores.
 37 |     """
 38 |     value = str(value)
 39 |     if allow_unicode:
 40 |         value = unicodedata.normalize('NFKC', value)
 41 |     else:
 42 |         value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
 43 |     value = re.sub(_slugify_strip_re, '', value.lower())
 44 |     return re.sub(_slugify_hyphenate_re, '-', value).strip('-_')
 45 | 
 46 | 
 47 | def schema_navigation_entry() -> navigation.NavigationEntry:
 48 |     """Defines a part of the navigation tree (needs to be handled by the application).
 49 | 
 50 |     Returns:
 51 |         A mara NavigationEntry object.
 52 | 
 53 |     """
 54 |     from .. import config
 55 | 
 56 |     return navigation.NavigationEntry(
 57 |         label='Data sets', icon='book',
 58 |         description='Documentation of attributes and metrics of all data sets',
 59 |         children=[navigation.NavigationEntry(label='Overview', icon='list',
 60 |                                              uri_fn=lambda: flask.url_for('mara_schema.index_page'))]
 61 |                  + [navigation.NavigationEntry(label=data_set.name, icon='book',
 62 |                                                description=data_set.entity.description,
 63 |                                                uri_fn=lambda data_set=data_set: flask.url_for(
 64 |                                                    'mara_schema.data_set_page', id=data_set.id()))
 65 |                     for data_set in config.data_sets()])
 66 | 
 67 | 
 68 | @blueprint.route('')
 69 | @acl.require_permission(acl_resource_schema)
 70 | def index_page() -> response.Response:
 71 |     """Renders the overview page"""
 72 |     from .. import config
 73 | 
 74 |     return response.Response(
 75 | 
 76 |         html=[
 77 |             bootstrap.card(
 78 |                 header_left='Entities & their relations',
 79 |                 body=html.asynchronous_content(flask.url_for('mara_schema.overview_graph'))),
 80 |             bootstrap.card(
 81 |                 header_left='Data sets',
 82 |                 body=bootstrap.table(
 83 |                     ['Name', 'Description'],
 84 |                     [_.tr[_.td[_.a(href=flask.url_for('mara_schema.data_set_page',
 85 |                                                       id=data_set.id()))[
 86 |                         escape(data_set.name)]],
 87 |                           _.td[_.i[escape(data_set.entity.description)]],
 88 |                      ] for data_set in config.data_sets()]),
 89 |             )],
 90 |         title='Data sets documentation',
 91 |         css_files=[flask.url_for('mara_schema.static', filename='schema.css')]
 92 |     )
 93 | 
 94 | 
 95 | @blueprint.route('/<id>')
 96 | @acl.require_permission(acl_resource_schema)
 97 | def data_set_page(id: str) -> response.Response:
 98 |     """Renders the pages for individual data sets"""
 99 |     from .. import config
100 | 
101 |     data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None)
102 |     if not data_set:
103 |         flask.flash(f'Could not find data set "{id}"', category='warning')
104 |         return flask.redirect(flask.url_for('mara_schema.index_page'))
105 | 
106 |     base_url = flask.url_for('mara_schema.data_set_sql_query', id=data_set.id())
107 | 
108 |     def attribute_rows(data_set: DataSet) -> []:
109 |         rows = []
110 |         for path, attributes in data_set.connected_attributes().items():
111 |             if path:
112 |                 rows.append(_.tr[_.td(colspan=3, style='border-top:none; padding-top: 20px;')[
113 |                     [['→ ',
114 |                       _.a(href=data_set_url(entity.data_set))[link_title] if entity.data_set else link_title,
115 |                       ' &nbsp;']
116 |                      for entity, link_title
117 |                      in [(entity_link.target_entity, entity_link.prefix or entity_link.target_entity.name)
118 |                          for entity_link in path]],
119 |                     [' &nbsp;&nbsp;', _.i[path[-1].description]] if path[-1].description else ''
120 |                 ]])
121 |             for prefixed_name, attribute in attributes.items():
122 |                 attribute_link_id = slugify(f'attribute {path[-1].target_entity.name if path else ""} {attribute.name}')
123 |                 rows.append(_.tr(id=attribute_link_id)[
124 |                                 _.td[
125 |                                     escape(prefixed_name),
126 |                                     ' ',
127 |                                     _.a(class_='anchor-link-sign',
128 |                                         href=f'#{attribute_link_id}')['¶'],
129 | 
130 |                                 ],
131 |                                 _.td[[_.i[escape(attribute.description)]] +
132 |                                      ([' (', _.a(href=attribute.more_url)['more...'], ')']
133 |                                       if attribute.more_url else [])],
134 |                                 _.td[_.tt[escape(
135 |                                     f'{path[-1].target_entity.table_name + "." if path else ""}{attribute.column_name}')]]])
136 |         return rows
137 | 
138 |     def metrics_rows(data_set: DataSet) -> []:
139 |         rows = []
140 |         for metric in data_set.metrics.values():
141 |             metric_link_id = slugify(f'metric {metric.name}')
142 | 
143 |             rows.append([_.tr(id=metric_link_id)[
144 |                             _.td[
145 |                                 escape(metric.name),
146 |                                 ' ',
147 |                                 _.a(class_='anchor-link-sign',
148 |                                     href=f'#{metric_link_id}')['¶'],
149 | 
150 |                             ],
151 |                             _.td[[_.i[escape(metric.description)]] +
152 |                                  ([' (', _.a(href=metric.more_url)['more...'], ')'] if metric.more_url else [])
153 |                                  ],
154 |                             _.td[_.code[escape(metric.display_formula())]]
155 |                         ]])
156 |         return rows
157 | 
158 |     return response.Response(
159 |         html=[bootstrap.card(
160 |             header_left=_.i[escape(data_set.entity.description)],
161 |             body=[
162 |                 _.p['Entity table: ',
163 |                     _.code[escape(f'{data_set.entity.schema_name}.{data_set.entity.table_name}')]],
164 |                 html.asynchronous_content(flask.url_for('mara_schema.data_set_graph', id=data_set.id())),
165 |             ]),
166 |             bootstrap.card(
167 |                 header_left='Metrics',
168 |                 body=[
169 |                     html.asynchronous_content(flask.url_for('mara_schema.metrics_graph', id=data_set.id())),
170 |                     bootstrap.table(
171 |                         ['Name', 'Description', 'Computation'],
172 |                         metrics_rows(data_set)
173 |                     ),
174 |                 ]),
175 |             bootstrap.card(
176 |                 header_left='Attributes',
177 |                 body=bootstrap.table(["Name", "Description", "Column name"], attribute_rows(data_set))),
178 |             bootstrap.card(
179 |                 header_left=['Data set sql query: &nbsp;',
180 |                              [_.div(class_='form-check form-check-inline')[
181 |                                   "&nbsp;&nbsp; ",
182 |                                   _.label(class_='form-check-label')[
183 |                                       _.input(class_="form-check-input param-checkbox", type="checkbox",
184 |                                               value=param)[
185 |                                           ''], ' ', param]]
186 |                               for param in [
187 |                                   'human readable columns',
188 |                                   'pre-computed metrics',
189 |                                   'star schema',
190 |                                   'star_schema_transitive_fks',
191 |                                   'personal data',
192 |                                   'high cardinality attributes',
193 |                               ]]],
194 |                 body=[_.div(id='sql-container')[html.asynchronous_content(base_url, 'sql-container')],
195 |                       _.script['''
196 | document.addEventListener('DOMContentLoaded', function() {
197 |     DataSetSqlQuery("''' + base_url + '''");
198 | });
199 | ''']])
200 |         ],
201 |         title=f'Data set "{data_set.name}"',
202 |         js_files=[flask.url_for('mara_schema.static', filename='data-set-sql-query.js')],
203 |         css_files=[flask.url_for('mara_schema.static', filename='mara-schema.css')],
204 |     )
205 | 
206 | 
207 | @blueprint.route('/<id>/_data_set_sql_query', defaults={'params': ''})
208 | @blueprint.route('/<id>/_data_set_sql_query/<path:params>')
209 | def data_set_sql_query(id: str, params: [str]) -> response.Response:
210 |     from .. import config
211 |     from ..sql_generation import data_set_sql_query
212 | 
213 |     params = set(params.split('/'))
214 |     data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None)
215 |     if not data_set:
216 |         return f'Could not find data set "{id}"'
217 | 
218 |     # using the engine of the default db from mara_pipelines.config.default_db_alias()
219 |     engine = None
220 |     try:
221 |         # since mara_pipelines and mara_db is not a default requirement of module mara_schema,
222 |         # we use a try/except clause
223 |         import mara_db.sqlalchemy_engine
224 |         import mara_pipelines.config
225 |         engine = mara_db.sqlalchemy_engine.engine(mara_pipelines.config.default_db_alias())
226 |     except ImportError or ModuleNotFoundError or NotImplementedError:
227 |         pass
228 | 
229 |     sql = data_set_sql_query(data_set,
230 |                              pre_computed_metrics='pre-computed metrics' in params,
231 |                              human_readable_columns='human readable columns' in params,
232 |                              personal_data='personal data' in params,
233 |                              high_cardinality_attributes='high cardinality attributes' in params,
234 |                              star_schema='star schema' in params,
235 |                              star_schema_transitive_fks='star_schema_transitive_fks' in params,
236 |                              engine=engine)
237 |     return str(_.div[html.highlight_syntax(sql, 'sql')])
238 | 
239 | 
240 | @blueprint.route('/_overview_graph')
241 | @acl.require_permission(acl_resource_schema, do_abort=False)
242 | @functools.lru_cache(maxsize=None)
243 | def overview_graph() -> str:
244 |     """Returns an graph of all the defined entities and data sets"""
245 |     from .graph import overview_graph
246 | 
247 |     return overview_graph()
248 | 
249 | 
250 | @blueprint.route('/<id>/_data_set_graph')
251 | @acl.require_permission(acl_resource_schema)
252 | def data_set_graph(id: str) -> str:
253 |     """Renders a graph with all the linked entities of an individual data sets"""
254 |     from .. import config
255 |     from .graph import data_set_graph
256 | 
257 |     data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None)
258 |     if not data_set:
259 |         return f'Could not find data set "{id}"'
260 | 
261 |     return data_set_graph(data_set)
262 | 
263 | 
264 | @blueprint.route('/<id>/_metrics_graph')
265 | @acl.require_permission(acl_resource_schema)
266 | def metrics_graph(id: str) -> str:
267 |     """Renders a visualization of all composed metrics of a data set"""
268 |     from .. import config
269 |     from .graph import metrics_graph
270 | 
271 |     data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None)
272 |     if not data_set:
273 |         return f'Could not find data set "{id}"'
274 | 
275 |     return metrics_graph(data_set)
276 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Mara Schema
  2 | 
  3 | [![Build Status](https://github.com/mara/mara-schema/actions/workflows/build.yaml/badge.svg)](https://github.com/mara/mara-schema/actions/workflows/build.yaml)
  4 | [![PyPI - License](https://img.shields.io/pypi/l/mara-schema.svg)](https://github.com/mara/mara-schema/blob/main/LICENSE)
  5 | [![PyPI version](https://badge.fury.io/py/mara-schema.svg)](https://badge.fury.io/py/mara-schema)
  6 | [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://communityinviter.com/apps/mara-users/public-invite)
  7 | 
  8 | Python based mapping of physical data warehouse tables to logical business entities (a.k.a. "cubes", "models", "data sets", etc.). It comes with
  9 | - sql query generation for flattening normalized database tables into wide tables for various analytics front-ends
 10 | - a flask based visualization of the schema that can serve as a documentation of the business definitions of a data warehouse (a.k.a "data dictionary" or "data guide")
 11 | - the possibility to sync schemas to reporting front-ends that have meta-data APIs (e.g. Metabase, Looker, Tableau)
 12 | 
 13 | &nbsp;
 14 | 
 15 | ![Mara Schema overview](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema.png)
 16 | 
 17 | &nbsp;
 18 | 
 19 | Have a look at a real-world application of Mara Schema in the [Mara Example Project 1](https://github.com/mara/mara-example-project-1).
 20 | 
 21 | &nbsp;
 22 | 
 23 | **Why** should I use Mara Schema?
 24 | 
 25 | 1. **Definition of analytical business entities as code**: There are many solutions for documenting the company-wide definitions of attributes & metrics for the users of a data warehouse. These can range from simple spreadsheets or wikis to metadata management tools inside reporting front-ends. However, these definitions can quickly get out of sync when new columns are added or changed in the underlying data warehouse. Mara Schema allows to deploy definition changes together with changes in the underlying ETL processes so that all definitions will always be in sync with the underlying data warehouse schema.
 26 | 
 27 | 
 28 | 2. **Automatic generation of aggregates / artifacts**: When a company wants to enforce a *single source of truth* in their data warehouse, then a heavily normalized Kimball-style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) is still the weapon of choice. It enforces an agreed-upon unified modelling of business entities across domains and ensures referential consistency. However, snowflake schemas are not ideal for analytics or data science because they require a lot of joins. Most analytical databases and reporting tools nowadays work better with pre-flattened wide tables. Creating such flattened tables is an error-prone and dull activity, but with Mara Schema one can automate most of the work in creating flattened data set tables in the ETL.
 29 | 
 30 | &nbsp;
 31 | 
 32 | ## Installation
 33 | 
 34 | To use the library directly, use pip:
 35 | 
 36 | ```
 37 | pip install mara-schema
 38 | ```
 39 | 
 40 | or
 41 | 
 42 | ```
 43 | pip install git+https://github.com/mara/mara-schema.git
 44 | ```
 45 | 
 46 | &nbsp;
 47 | 
 48 | ## Defining entities, attributes, metrics & data sets
 49 | 
 50 | Let's consider the following toy example of a dimensional schema in the data warehouse of a hypothetical e-commerce company:
 51 | 
 52 | ![Example dimensional star schema](https://github.com/mara/mara-schema/raw/main/docs/_static/example-dimensional-database-schema.svg)
 53 | 
 54 | Each box is a database table with its columns, and the lines between tables show the foreign key constraints. That's a classic Kimball style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) and it requires a proper modelling / ETL layer in your data warehouse. A script that creates these example tables in PostgreSQL can be found in [example/dimensional-schema.sql](https://github.com/mara/mara-schema/blob/main/mara_schema/example/dimensional-schema.sql).
 55 | 
 56 | It's a prototypical data warehouse schema for B2C e-commerce: There are orders composed of individual product purchases (order items) made by customers. There are circular references: Orders have a customer, and customers have a first order. Order items have a product (and thus a product category) and customers have a favourite product category.
 57 | 
 58 | The respective entity and data set definitions for this database schema can be found in the [mara_schema/example](https://github.com/mara/mara-schema/tree/main/mara_schema/example) directory.
 59 | 
 60 | &nbsp;
 61 | 
 62 | In Mara Schema, each business relevant table in the dimensional schema is mapped to an [Entity](https://github.com/mara/mara-schema/blob/main/mara_schema/entity.py). In dimensional modelling terms, entities can be both fact tables and dimensions. For example, a customer entity can be a dimension of an order items data set (a.k.a. "cube", "model", "data mart") and a customer data set of its own.
 63 | 
 64 | Here's a [shortened](https://github.com/mara/mara-schema/blob/main/mara_schema/example/entities/order_item.py) defnition of the "Order item" entity based on the `dim.order_item` table:
 65 | 
 66 | ```python
 67 | from mara_schema.entity import Entity
 68 | 
 69 | order_item_entity = Entity(
 70 |     name='Order item',
 71 |     description='Individual products sold as part of an order',
 72 |     schema_name='dim')
 73 | ```
 74 | 
 75 | It assumes that there is an `order_item` table in the `dim` schema of the data warehouse, with `order_item_id` as the primary key. The optional `table_name` and `pk_column_name` parameters can be used when another naming scheme for tables and primary keys is used.
 76 | 
 77 | &nbsp;
 78 | 
 79 | [Attributes](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py) represent facts about an entity. They correspond to the non-numerical columns in a fact or dimension table:
 80 | 
 81 | ```python
 82 | from mara_schema.attribute import Type
 83 | 
 84 | order_item_entity.add_attribute(
 85 |     name='Order item ID',
 86 |     description='The ID of the order item in the backend',
 87 |     column_name='order_item_id',
 88 |     type=Type.ID,
 89 |     high_cardinality=True)
 90 | ```
 91 | 
 92 | They come with a speaking name (as shown in reporting front-ends), a description and a `column_name` in the underlying database table.
 93 | 
 94 | There a several parameters for controlling the generation of artifact tables and the visibility in front-ends:
 95 | - Setting `personal_data` to `True` means that the attribute contains personally identifiable information and thus should be hidden from most users.
 96 | - When `high_cardinality` is `True`, then the attribute is hidden in front-ends that can not deal well with dimensions with a lot of values.
 97 | - The `type` attribute controls how some fields are treated in artifact creation. See [mara_schema/attribute.py#L7](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py#L7).
 98 | - An `important_field` highlights the data set and is shown by default in overviews.
 99 | - When `accessible_via_entity_link` is `False`, then the attribute will be hidden in data sets that use the entity as an dimension.
100 | 
101 | &nbsp;
102 | 
103 | The attributes of the dimensions of an entity are recursively linked with the `link_entity` method:
104 | 
105 | ```python
106 | from .order import order_entity
107 | from .product import product_entity
108 | 
109 | order_item_entity.link_entity(target_entity=order_entity, prefix='')
110 | order_item_entity.link_entity(target_entity=product_entity)
111 | ```
112 | 
113 | This pulls in attributes of other entities that are connected to an entity table via foreign key columns. When the other entity is called "Foo bar", then it's assumed that there is a `foo_bar_fk` in the entity table (can be overwritten with the `fk_column` parameter). The optional `prefix` controls how linked attributes are named (e.g. "First order date" vs "Order date") and also helps to disambiguate when there are multiple links from one entity to another.
114 | 
115 | &nbsp;
116 | 
117 | Once all entities and their relationships are established, [Data Sets](https://github.com/mara/mara-schema/blob/main/mara_schema/data_set.py) (a.k.a "cubes", "models" or "data marts") add metrics and attributes from linked entities to an entity:
118 | 
119 | ```python
120 | from mara_schema.data_set import DataSet
121 | 
122 | from ..entities.order_item import order_item_entity
123 | 
124 | order_items_data_set = DataSet(entity=order_item_entity, name='Order items')
125 | ```
126 | 
127 | &nbsp;
128 | 
129 | There are two kinds of [Metrics](https://github.com/mara/mara-schema/blob/main/mara_schema/metric.py) (a.k.a "Measures") in Mara Schema: simple metrics and composed metrics. Simple metrics are computed as direct aggregations on an entity table column:
130 | 
131 | ```python
132 | from mara_schema.data_set import Aggregation
133 | 
134 | order_items_data_set.add_simple_metric(
135 |     name='# Orders',
136 |     description='The number of valid orders (orders with an invoice)',
137 |     column_name='order_fk',
138 |     aggregation=Aggregation.DISTINCT_COUNT,
139 |     important_field=True)
140 | 
141 | order_items_data_set.add_simple_metric(
142 |     name='Product revenue',
143 |     description='The price of the ordered products as shown in the cart',
144 |     aggregation=Aggregation.SUM,
145 |     column_name='product_revenue',
146 |     important_field=True)
147 | ```
148 | 
149 | In this example the metric "# Orders" is defined as the distinct count on the `order_fk` column, and "Product revenue" as the sum of the `product_revenue` column.
150 | 
151 | Composed metrics are built from other metrics (both simple and composed)  like this:
152 | 
153 | ```python
154 | order_items_data_set.add_composed_metric(
155 |     name='Revenue',
156 |     description='The total cart value of the order',
157 |     formula='[Product revenue] + [Shipping revenue]',
158 |     important_field=True)
159 | 
160 | order_items_data_set.add_composed_metric(
161 |     name='AOV',
162 |     description='The average revenue per order. Attention: not meaningful when split by product',
163 |     formula='[Revenue] / [# Orders]',
164 |     important_field=True)
165 | ```
166 | 
167 | The `formula` parameter takes simple algebraic expressions (`+`, `-`, `*`, `/` and parentheses) with the names of the parent metrics in rectangular brackets, e.g. `([a] + [b]) / [c]`.
168 | 
169 | &nbsp;
170 | 
171 | With complex snowflake schemas the graph of linked entities can become rather big. To avoid cluttering data sets with unnecessary attributes, Mara Schema has a way for excluding entire entity links:
172 | 
173 | ```python
174 | customers_data_set.exclude_path(['Order', 'Customer'])
175 | ```
176 | 
177 | This means that the customer of the first order of a customer will not be part of the customers data set. Similarly, it is possible to limit the list of attributes from a linked entity:
178 | 
179 | ```python
180 | order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date'])
181 | ```
182 | 
183 | Here only the order date of the first order of the customer of the order will be included in the data set.
184 | 
185 | &nbsp;
186 | 
187 | ## Visualization
188 | 
189 | Mara schema comes with (an optional) Flask based visualization that documents the metrics and attributes of all data sets:
190 | 
191 | ![Mara schema data set visualization](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-data-set-visualization.png)
192 | 
193 | When made available to business users, then this can serve as the "data dictionary", "data guide" or "data catalog" of a company.
194 | 
195 | &nbsp;
196 | 
197 | ## Artifact generation
198 | 
199 | The function `data_set_sql_query` in [mara_schema/sql_generation.py](https://github.com/mara/mara-schema/blob/main/mara_schema/sql_generation.py) can be used to flatten the entities of a data set into a wide data set table:
200 | 
201 | ```python
202 | data_set_sql_query(data_set=order_items_data_set, human_readable_columns=True, pre_computed_metrics=False,
203 |                    star_schema=False, personal_data=False, high_cardinality_attributes=True)
204 | ```
205 | 
206 | The resulting SELECT statement can be used for creating a data set table that is specifically tailored for the use in Metabase:
207 | 
208 | ```sql
209 | SELECT
210 |      order_item.order_item_id AS "Order item ID",
211 | 
212 |     "order".order_id AS "Order ID",
213 |     "order".order_date AS "Order date",
214 | 
215 |     order_customer.customer_id AS "Customer ID",
216 | 
217 |     order_customer_favourite_product_category.main_category AS "Customer favourite product category level 1",
218 |     order_customer_favourite_product_category.sub_category_1 AS "Customer favourite product category level 2",
219 | 
220 |     order_customer_first_order.order_date AS "Customer first order date",
221 | 
222 |     product.sku AS "Product SKU",
223 | 
224 |     product_product_category.main_category AS "Product category level 1",
225 |     product_product_category.sub_category_1 AS "Product category level 2",
226 | 
227 |     order_item.order_item_id AS "# Order items",
228 |     order_item.order_fk AS "# Orders",
229 |     order_item.product_revenue AS "Product revenue",
230 |     order_item.revenue AS "Shipping revenue"
231 | 
232 | FROM dim.order_item order_item
233 | LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order_id
234 | LEFT JOIN dim.customer order_customer ON "order".customer_fk = order_customer.customer_id
235 | LEFT JOIN dim.product_category order_customer_favourite_product_category ON order_customer.favourite_product_category_fk = order_customer_favourite_product_category.product_category_id
236 | LEFT JOIN dim."order" order_customer_first_order ON order_customer.first_order_fk = order_customer_first_order.order_id
237 | LEFT JOIN dim.product product ON order_item.product_fk = product.product_id
238 | LEFT JOIN dim.product_category product_product_category ON product.product_category_fk = product_product_category.product_category_id
239 | ```
240 | 
241 | Please note that the `data_set_sql_query` only returns SQL select statements, it's a matter of executing these statements somewhere in the ETL of the Data Warehouse. [Here](https://github.com/mara/mara-example-project-1/tree/main/app/pipelines/generate_artifacts/metabase.py) is an example for creating data set tables for Metabase using [Mara Pipelines](https://github.com/mara/mara-pipelines).
242 | 
243 | &nbsp;
244 | 
245 | There are several parameters for controlling the output of the `data_set_sql_query` function:
246 | 
247 |  - `human_readable_columns`: Whether to use "Customer name" rather than "customer_name" as column name
248 |  - `pre_computed_metrics`: Whether to pre-compute composed metrics, counts and distinct counts on row level
249 |  - `star_schema`: Whether to add foreign keys to the tables of linked entities rather than including their attributes
250 |  - `personal_data`: Whether to include attributes that are marked as personal data
251 |  - `high_cardinality_attributes`: Whether to include attributes that are marked to have a high cardinality
252 | 
253 | ![Mara schema SQL generation](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-sql-generation.gif)
254 | 
255 | 
256 | ## Schema sync to front-ends
257 | 
258 | When reporting tools have a Metadata API (e.g. Metabase, Tableau) or can read schema definitions from text files (e.g. Looker, Mondrian), then it's easy to sync definitions with them. The [Mara Metabase](https://github.com/mara/mara-metabase) package contains a function for syncing Mara Schema definitions with Metabase and the [Mara Mondrian](https://github.com/mara/mara-mondrian) package contains a generator for a Mondrian schema.
259 | 
260 | We welcome contributions for creating Looker LookML files, for syncing definitions with Tableau, and for syncing with any other BI front-end.
261 | 
262 | Also, we see a potential for automatically creating data guides in other Wikis or documentation tools.
263 | 
264 | 
265 | ## Installation
266 | 
267 | To use the library directly, use pip:
268 | 
269 | ```
270 | pip install mara-schema
271 | ```
272 | 
273 | or
274 | 
275 | ```
276 | pip install git+https://github.com/mara/mara-schema.git
277 | ```
278 | 
279 | For an example of an integration into a flask application, have a look at the [Mara Example Project 1](https://github.com/mara/mara-example-project-1).
280 | 
281 | &nbsp;
282 | 
283 | ## Links
284 | 
285 | * Documentation: https://mara-schema.readthedocs.io/
286 | * Changes: https://mara-schema.readthedocs.io/en/stable/changes.html
287 | * PyPI Releases: https://pypi.org/project/mara-schema/
288 | * Source Code: https://github.com/mara/mara-schema
289 | * Issue Tracker: https://github.com/mara/mara-schema/issues
290 | 


--------------------------------------------------------------------------------