├── mara_schema ├── ui │ ├── __init__.py │ ├── static │ │ ├── mara-schema.css │ │ └── data-set-sql-query.js │ ├── graph.py │ └── views.py ├── config.py ├── example │ ├── __init__.py │ ├── entities │ │ ├── product.py │ │ ├── product_category.py │ │ ├── order_item.py │ │ ├── order.py │ │ └── customer.py │ ├── data_sets │ │ ├── products.py │ │ ├── customers.py │ │ └── order_items.py │ └── dimensional-schema.sql ├── __init__.py ├── attribute.py ├── metric.py ├── entity.py ├── data_set.py └── sql_generation.py ├── docs ├── changes.md ├── requirements.txt ├── _static │ ├── favicon.ico │ ├── mara-animal.jpg │ ├── mara-schema.png │ ├── mara-schema-sql-generation.gif │ ├── mara-schema-data-set-visualization.png │ └── example-dimensional-database-schema.svg ├── license.rst ├── installation.rst ├── Makefile ├── config.rst ├── design.rst ├── api.rst ├── conf.py ├── index.rst ├── artifact-generation.rst └── example.rst ├── setup.py ├── pyproject.toml ├── .gitignore ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── setup.cfg ├── CHANGELOG.md ├── Makefile ├── .github └── workflows │ └── build.yaml ├── LICENSE └── README.md /mara_schema/ui/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /docs/changes.md: -------------------------------------------------------------------------------- 1 | ```{include} ../CHANGELOG.md 2 | ``` 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup() 4 | -------------------------------------------------------------------------------- /docs/requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx==4.5.0 2 | myst-parser==0.18.0 3 | -------------------------------------------------------------------------------- /docs/_static/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/favicon.ico -------------------------------------------------------------------------------- /docs/_static/mara-animal.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-animal.jpg -------------------------------------------------------------------------------- /docs/_static/mara-schema.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema.png -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools >= 40.6.0", "wheel"] 3 | build-backend = "setuptools.build_meta" 4 | -------------------------------------------------------------------------------- /docs/_static/mara-schema-sql-generation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema-sql-generation.gif -------------------------------------------------------------------------------- /docs/_static/mara-schema-data-set-visualization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mara/mara-schema/HEAD/docs/_static/mara-schema-data-set-visualization.png -------------------------------------------------------------------------------- /mara_schema/config.py: -------------------------------------------------------------------------------- 1 | import functools 2 | from .data_set import DataSet 3 | 4 | @functools.lru_cache(maxsize=None) 5 | def data_sets() -> [DataSet]: 6 | """Returns all available data_sets.""" 7 | from .example import example_data_sets 8 | return example_data_sets() 9 | -------------------------------------------------------------------------------- /docs/license.rst: -------------------------------------------------------------------------------- 1 | License 2 | ======= 3 | 4 | MIT Source License 5 | --------------------------- 6 | 7 | The MIT license applies to all files in the Mara repository 8 | and source distribution. This includes Mara's source code, the 9 | examples, and tests, as well as the documentation. 10 | 11 | .. include:: ../LICENSE 12 | -------------------------------------------------------------------------------- /mara_schema/example/__init__.py: -------------------------------------------------------------------------------- 1 | from ..data_set import DataSet 2 | 3 | 4 | def example_data_sets() -> [DataSet]: 5 | from .data_sets.customers import customers_data_set 6 | from .data_sets.order_items import order_items_data_set 7 | from .data_sets.products import products_data_set 8 | return [order_items_data_set, customers_data_set, products_data_set] 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # Distribution / packaging 7 | build/ 8 | dist/ 9 | *.egg-info/ 10 | .eggs/ 11 | 12 | # Unit test / coverage reports 13 | .pytest_cache/ 14 | 15 | # Sphinx documentation 16 | docs/_build/ 17 | 18 | # Dev tools 19 | .idea 20 | 21 | # Environments 22 | /.venv 23 | -------------------------------------------------------------------------------- /docs/installation.rst: -------------------------------------------------------------------------------- 1 | Installation 2 | ============ 3 | 4 | To use the library directly, use pip: 5 | 6 | ``$ pip install mara-schema`` 7 | 8 | or 9 | 10 | ``$ pip install git+https://github.com/mara/mara-schema.git`` 11 | 12 | For an example of an integration into a flask application, have a look at the `Mara Example Project 1 `_. 13 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | # See https://pre-commit.com for more information 2 | # See https://pre-commit.com/hooks.html for more hooks 3 | repos: 4 | - repo: https://github.com/pre-commit/pre-commit-hooks 5 | rev: v3.2.0 6 | hooks: 7 | - id: trailing-whitespace 8 | - id: end-of-file-fixer 9 | - id: check-toml 10 | - id: check-yaml 11 | - id: check-added-large-files 12 | -------------------------------------------------------------------------------- /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | # Read the Docs configuration file 2 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details 3 | 4 | version: 2 5 | 6 | build: 7 | os: ubuntu-20.04 8 | tools: 9 | python: "3.10" 10 | 11 | sphinx: 12 | configuration: docs/conf.py 13 | 14 | python: 15 | install: 16 | - requirements: docs/requirements.txt 17 | - method: pip 18 | path: . 19 | -------------------------------------------------------------------------------- /mara_schema/example/entities/product.py: -------------------------------------------------------------------------------- 1 | from mara_schema.entity import Entity, Type 2 | 3 | product_entity = Entity( 4 | name='Product', 5 | description='Products that were at least once sold or once on stock', 6 | schema_name='dim') 7 | 8 | product_entity.add_attribute( 9 | name='SKU', 10 | description='The ID of a product as defined in the PIM system', 11 | high_cardinality=True, 12 | column_name='sku', 13 | type=Type.ID) 14 | 15 | from .product_category import product_category_entity 16 | 17 | product_entity.link_entity(target_entity=product_category_entity) 18 | -------------------------------------------------------------------------------- /mara_schema/example/entities/product_category.py: -------------------------------------------------------------------------------- 1 | from mara_schema.entity import Entity 2 | 3 | product_category_entity = Entity( 4 | name='Product category', 5 | description='A broad categorization of products as defined by the purchasing team', 6 | schema_name='dim') 7 | 8 | product_category_entity.add_attribute( 9 | name='Level 1', 10 | description='One of the 6 main product categories', 11 | column_name='main_category') 12 | 13 | product_category_entity.add_attribute( 14 | name='Level 2', 15 | description='The second level category of a product', 16 | column_name='sub_category_1') 17 | -------------------------------------------------------------------------------- /mara_schema/ui/static/mara-schema.css: -------------------------------------------------------------------------------- 1 | /* 2 | As of 2021-01-25, supported in Chrome, Firefoy, Edge, but not Safari and doesn't need any extra css 3 | classes on the anchor 4 | https://css-tricks.com/fixed-headers-on-page-links-and-overlapping-content-oh-my/ 5 | */ 6 | html { 7 | scroll-padding-top: 90px; /* height of sticky header + head of table */ 8 | } 9 | 10 | /* For styling a trailing so it's light grey but gets the blue + underline on hover */ 11 | html a.anchor-link-sign { 12 | color: #bfbfbf; 13 | } 14 | 15 | html a.anchor-link-sign:hover { 16 | cursor: pointer; 17 | color: #0275d8; 18 | } 19 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = mara-schema 3 | version = attr: mara_schema.__version__ 4 | url = https://github.com/mara/mara-schema 5 | description = Mapping of DWH database tables to business entities, attributes & metrics in Python, with automatic creation of flattened tables 6 | long_description = file: README.md 7 | long_description_content_type = text/markdown 8 | author = Mara contributors 9 | license = MIT 10 | 11 | [options] 12 | packages = mara_schema 13 | python_requires = >= 3.6 14 | install_requires = 15 | flask 16 | graphviz 17 | mara-page 18 | sqlalchemy 19 | 20 | [options.package_data] 21 | mara_schema = **/*.py, ui/static/**/*, example/**/*.sql 22 | -------------------------------------------------------------------------------- /mara_schema/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = '1.2.1' 2 | 3 | def MARA_CONFIG_MODULES(): 4 | from . import config 5 | return [config] 6 | 7 | 8 | def MARA_CLICK_COMMANDS(): 9 | return [] 10 | 11 | 12 | def MARA_FLASK_BLUEPRINTS(): 13 | from .ui import views 14 | return [views.blueprint] 15 | 16 | 17 | def MARA_AUTOMIGRATE_SQLALCHEMY_MODELS(): 18 | return [] 19 | 20 | 21 | def MARA_ACL_RESOURCES(): 22 | from .ui import views 23 | return { 24 | 'Schema': views.acl_resource_schema 25 | } 26 | 27 | 28 | def MARA_NAVIGATION_ENTRIES(): 29 | from .ui import views 30 | return { 31 | 'Schema': views.schema_navigation_entry() 32 | } 33 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /mara_schema/example/entities/order_item.py: -------------------------------------------------------------------------------- 1 | from mara_schema.attribute import Type 2 | from mara_schema.entity import Entity 3 | 4 | order_item_entity = Entity( 5 | name='Order item', 6 | description='Individual products sold as part of an order', 7 | schema_name='dim') 8 | 9 | order_item_entity.add_attribute( 10 | name='Order item ID', 11 | description='The ID of the order item in the backend', 12 | column_name='order_item_id', 13 | type=Type.ID, 14 | high_cardinality=True) 15 | 16 | from .order import order_entity 17 | from .product import product_entity 18 | 19 | order_item_entity.link_entity(target_entity=order_entity, prefix='') 20 | order_item_entity.link_entity(target_entity=product_entity) 21 | -------------------------------------------------------------------------------- /docs/config.rst: -------------------------------------------------------------------------------- 1 | Configuration 2 | ============= 3 | 4 | 5 | Extension Configuration Values 6 | ------------------------------ 7 | 8 | The following configuration values are used by this extension. They are defined as python functions in ``mara_schema.config`` 9 | and can be changed with the `monkey patch`_ from `Mara App`_. An example can be found `here `_. 10 | 11 | .. _monkey patch: https://github.com/mara/mara-app/blob/master/mara_app/monkey_patch.py 12 | .. _Mara App: https://github.com/mara/mara-app 13 | 14 | 15 | .. py:data:: data_sets 16 | 17 | Returns all available data_sets. 18 | 19 | Default: ``mara_schema.example.example_data_sets()`` 20 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## 1.2.1 (2022-06-25) 4 | 5 | - makes the foreign key column naming patchable (#23) 6 | 7 | ## 1.2.0 (2022-06-21) 8 | 9 | - add data set sql generation support for other SQL engines (#22) 10 | - using the default db alias engine in view if possible (#21) 11 | - fix missing files in PyPI package 12 | 13 | ## 1.1.1 (2022-06-20) 14 | 15 | - use client-side rendering for graphviz fallback (#20) 16 | 17 | ## 1.1.0 (2021-05-12) 18 | 19 | - Add 'more...' links after the description and add an anchor to metrics and attributes (#9) 20 | - Cast 0.0 to double precision (#13) 21 | - Add parameter star_schema_transitive_fks (#14) 22 | 23 | ## 1.0.1 (2020-07-10) 24 | 25 | - Update documentation and example 26 | 27 | 28 | ## 1.0.0 (2020-06-29) 29 | 30 | - Initial release 31 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | MODULE_NAME=mara_schema 2 | 3 | 4 | all: 5 | # builds virtual env. and starts install in it 6 | make .venv/bin/python 7 | make install 8 | 9 | 10 | install: 11 | # install of module 12 | .venv/bin/pip install . 13 | 14 | 15 | publish: 16 | # manually publishing the package 17 | .venv/bin/pip install build twine 18 | .venv/bin/python -m build 19 | .venv/bin/twine upload dist/* 20 | 21 | 22 | clean: 23 | # clean up 24 | rm -rf .venv/ build/ dist/ ${MODULE_NAME}.egg-info/ .pytest_cache/ .eggs/ 25 | 26 | 27 | .PYTHON3:=$(shell PATH='$(subst $(CURDIR)/.venv/bin:,,$(PATH))' which python3) 28 | 29 | .venv/bin/python: 30 | mkdir -p .venv 31 | cd .venv && $(.PYTHON3) -m venv --copies --prompt='[$(shell basename `pwd`)/.venv]' . 32 | 33 | .venv/bin/python -m pip install --upgrade pip 34 | -------------------------------------------------------------------------------- /mara_schema/example/data_sets/products.py: -------------------------------------------------------------------------------- 1 | from mara_schema.data_set import DataSet, Aggregation 2 | 3 | from ..entities.product import product_entity 4 | 5 | products_data_set = DataSet(entity=product_entity, name='Products') 6 | 7 | products_data_set.add_simple_metric( 8 | name='Revenue last 30 days', 9 | description='The revenue generated from the product in the last 30 days', 10 | aggregation=Aggregation.SUM, 11 | column_name='revenue_last_30_days', 12 | important_field=True) 13 | 14 | products_data_set.add_simple_metric( 15 | name='# Items on stock', 16 | description='How many items of the products are in stock according to the ERP (at the time of the last DWH import)', 17 | column_name='number_of_items_on_stock', 18 | aggregation=Aggregation.SUM, 19 | important_field=True) 20 | -------------------------------------------------------------------------------- /docs/design.rst: -------------------------------------------------------------------------------- 1 | Design Decisions 2 | ================ 3 | 4 | Schema sync to front-ends 5 | ------------------------- 6 | When reporting tools have a Metadata API (e.g. Metabase, Tableau) or can read schema definitions from text files (e.g. Looker, Mondrian), then it's easy to sync definitions with them. The `Mara Metabase `_ package contains a function for syncing Mara Schema definitions with Metabase and the `Mara Mondrian `_ package contains a generator for a Mondrian schema. 7 | 8 | We welcome contributions for creating Looker `LookML files `_, for syncing definitions with Tableau, and for syncing with any other BI front-end. 9 | 10 | Also, we see a potential for automatically creating data guides in other Wikis or documentation tools. 11 | -------------------------------------------------------------------------------- /.github/workflows/build.yaml: -------------------------------------------------------------------------------- 1 | name: mara-schema 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | pull_request: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | build: 13 | runs-on: ubuntu-latest 14 | strategy: 15 | matrix: 16 | python-version: ['3.7', '3.8', '3.9', '3.10', '3.11'] 17 | steps: 18 | - name: Chechout code 19 | uses: actions/checkout@v2 20 | - name: Setup python 21 | uses: actions/setup-python@v2 22 | with: 23 | python-version: ${{ matrix.python-version }} 24 | - name: Install package 25 | env: 26 | pythonversion: ${{ matrix.python-version }} 27 | run: | 28 | python -c "import sys; print(sys.version)" 29 | pip install . 30 | echo Finished successful build with Python $pythonversion 31 | # - name: Test with pytest 32 | # run: | 33 | # pytest -v tests 34 | -------------------------------------------------------------------------------- /mara_schema/example/data_sets/customers.py: -------------------------------------------------------------------------------- 1 | from mara_schema.data_set import DataSet, Aggregation 2 | 3 | from ..entities.customer import customer_entity 4 | 5 | customers_data_set = DataSet(entity=customer_entity, name='Customers') 6 | 7 | customers_data_set.exclude_path(['Order', 'Customer']) 8 | 9 | customers_data_set.add_simple_metric( 10 | name='# Orders', 11 | description='Number of orders placed by the customer', 12 | aggregation=Aggregation.SUM, 13 | column_name='number_of_orders', 14 | important_field=True) 15 | 16 | customers_data_set.add_simple_metric( 17 | name='CLV', 18 | description='The lifetime revenue generated from items purchased by this customer', 19 | aggregation=Aggregation.SUM, 20 | column_name='revenue_lifetime', 21 | important_field=True) 22 | 23 | customers_data_set.add_composed_metric( 24 | name='AOV', 25 | description='The average revenue per order of the customer', 26 | formula='[CLV] / [# Orders]') 27 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2020-2022 Mara contributors 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. 20 | -------------------------------------------------------------------------------- /mara_schema/example/entities/order.py: -------------------------------------------------------------------------------- 1 | from mara_schema.attribute import Type 2 | from mara_schema.entity import Entity 3 | 4 | order_entity = Entity( 5 | name='Order', 6 | description='Valid orders for which an invoice was created', 7 | schema_name='dim') 8 | 9 | order_entity.add_attribute( 10 | name='Order ID', 11 | description='The invoice number of the order as stored in the backend', 12 | column_name='order_id', 13 | type=Type.ID, 14 | important_field=True, 15 | high_cardinality=True) 16 | 17 | order_entity.add_attribute( 18 | name='Order date', 19 | description='The date when the order was placed (stored in the backend)', 20 | column_name='order_date', 21 | type=Type.DATE, 22 | important_field=True) 23 | 24 | order_entity.add_attribute( 25 | name='Status', 26 | description='The current status of the order (new, paid, shipped, etc.)', 27 | column_name='status', 28 | accessible_via_entity_link=False, 29 | type=Type.ENUM) 30 | 31 | from .customer import customer_entity 32 | 33 | order_entity.link_entity( 34 | target_entity=customer_entity, 35 | description='The customer who placed the order') 36 | -------------------------------------------------------------------------------- /mara_schema/ui/static/data-set-sql-query.js: -------------------------------------------------------------------------------- 1 | var DataSetSqlQuery = function (baseUrl) { 2 | 3 | function localStorageKey(param) { 4 | return 'mara-schem-sql-query-param-' + param; 5 | } 6 | 7 | $('.param-checkbox').each(function (n, checkbox) { 8 | var storedValue = localStorage.getItem(localStorageKey(checkbox.value)); 9 | if (storedValue == 'false') { 10 | checkbox.checked = false; 11 | } else if (storedValue == 'true') { 12 | checkbox.checked = true; 13 | } else { 14 | checkbox.checked = checkbox.value != 'star schema'; 15 | } 16 | }); 17 | 18 | function updateUI() { 19 | var selectedParams = []; 20 | $('.param-checkbox').each(function (n, checkbox) { 21 | if (checkbox.checked) { 22 | selectedParams.push(checkbox.value); 23 | } 24 | localStorage.setItem(localStorageKey(checkbox.value), checkbox.checked); 25 | }); 26 | 27 | var url = baseUrl; 28 | if (selectedParams.length > 0) { 29 | url += '/' + selectedParams.join('/') 30 | } 31 | loadContentAsynchronously('sql-container', url); 32 | } 33 | 34 | $('.param-checkbox').change(updateUI); 35 | 36 | updateUI(); 37 | }; 38 | -------------------------------------------------------------------------------- /docs/api.rst: -------------------------------------------------------------------------------- 1 | API 2 | === 3 | 4 | .. module:: mara_schema 5 | 6 | This part of the documentation covers all the interfaces of Mara Schema. For 7 | parts where the package depends on external libraries, we document the most 8 | important right here and provide links to the canonical documentation. 9 | 10 | 11 | Entities 12 | -------- 13 | 14 | .. module:: mara_schema.entity 15 | 16 | .. autoclass:: Entity 17 | :members: 18 | :special-members: __init__ 19 | 20 | .. autoclass:: EntityLink 21 | :special-members: __init__ 22 | 23 | 24 | Attributes 25 | ---------- 26 | 27 | .. module:: mara_schema.attribute 28 | 29 | .. autoclass:: Attribute 30 | :members: 31 | :special-members: __init__ 32 | 33 | .. autoclass:: Type 34 | 35 | .. autofunction:: normalize_name 36 | 37 | 38 | Data sets 39 | --------- 40 | 41 | .. module:: mara_schema.data_set 42 | 43 | .. autoclass:: DataSet 44 | :members: 45 | :special-members: __init__ 46 | 47 | 48 | Metrics 49 | ------- 50 | 51 | .. module:: mara_schema.metric 52 | 53 | .. autoclass:: Aggregation 54 | 55 | .. autoclass:: NumberFormat 56 | 57 | .. autoclass:: SimpleMetric 58 | :members: 59 | :special-members: __init__ 60 | 61 | .. autoclass:: ComposedMetric 62 | :members: 63 | :special-members: __init__ 64 | 65 | 66 | SQL Generation 67 | -------------- 68 | 69 | .. module:: mara_schema.sql_generation 70 | 71 | .. autofunction:: data_set_sql_query 72 | -------------------------------------------------------------------------------- /mara_schema/example/entities/customer.py: -------------------------------------------------------------------------------- 1 | from mara_schema.attribute import Type 2 | from mara_schema.entity import Entity 3 | 4 | customer_entity = Entity( 5 | name='Customer', 6 | description='People that made at least one order or that subscribed to the newsletter', 7 | schema_name='dim') 8 | 9 | customer_entity.add_attribute( 10 | name='Customer ID', 11 | description='The ID of the customer as defined in the backend', 12 | column_name='customer_id', 13 | type=Type.ID, 14 | high_cardinality=True, 15 | important_field=True) 16 | 17 | customer_entity.add_attribute( 18 | name='Email', 19 | description='The email of the customer', 20 | column_name='email', 21 | personal_data=True, 22 | high_cardinality=True, 23 | accessible_via_entity_link=False) 24 | 25 | customer_entity.add_attribute( 26 | name='Duration since first order', 27 | description='The number of days since the first order was placed', 28 | type=Type.DURATION, 29 | column_name='duration_since_first_order', 30 | accessible_via_entity_link=False) 31 | 32 | from .order import order_entity 33 | from .product_category import product_category_entity 34 | 35 | customer_entity.link_entity( 36 | target_entity=product_category_entity, 37 | fk_column='favourite_product_category_fk', 38 | prefix='Favourite product category', 39 | description='The category of the most purchased product (by revenue) of the customer') 40 | 41 | customer_entity.link_entity( 42 | target_entity=order_entity, 43 | fk_column='first_order_fk', 44 | prefix='First order') 45 | -------------------------------------------------------------------------------- /mara_schema/example/data_sets/order_items.py: -------------------------------------------------------------------------------- 1 | from mara_schema.data_set import DataSet, Aggregation 2 | 3 | from ..entities.order_item import order_item_entity 4 | 5 | order_items_data_set = DataSet(entity=order_item_entity, name='Order items') 6 | 7 | order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date']) 8 | 9 | order_items_data_set.add_simple_metric( 10 | name='# Order items', 11 | description='The number of ordered products', 12 | column_name='order_item_id', 13 | aggregation=Aggregation.COUNT) 14 | 15 | order_items_data_set.add_simple_metric( 16 | name='# Orders', 17 | description='The number of valid orders (orders with an invoice)', 18 | column_name='order_fk', 19 | aggregation=Aggregation.DISTINCT_COUNT, 20 | important_field=True) 21 | 22 | order_items_data_set.add_simple_metric( 23 | name='Product revenue', 24 | description='The price of the ordered products as shown in the cart', 25 | aggregation=Aggregation.SUM, 26 | column_name='product_revenue', 27 | important_field=True) 28 | 29 | order_items_data_set.add_simple_metric( 30 | name='Shipping revenue', 31 | description='Revenue generated based on the price of the items and delivery fee', 32 | aggregation=Aggregation.SUM, 33 | column_name='shipping_revenue') 34 | 35 | order_items_data_set.add_composed_metric( 36 | name='Revenue', 37 | description='The total cart value of the order', 38 | formula='[Product revenue] + [Shipping revenue]', 39 | important_field=True) 40 | 41 | order_items_data_set.add_composed_metric( 42 | name='AOV', 43 | description='The average revenue per order. Attention: not meaningful when split by product', 44 | formula='[Revenue] / [# Orders]', 45 | important_field=True) 46 | -------------------------------------------------------------------------------- /mara_schema/example/dimensional-schema.sql: -------------------------------------------------------------------------------- 1 | DROP SCHEMA IF EXISTS dim CASCADE; 2 | CREATE SCHEMA dim; 3 | 4 | CREATE TYPE dim.STATUS AS ENUM ('New', 'Paid', 'Shipped', 'Returned', 'Refunded'); 5 | 6 | CREATE TABLE dim.order 7 | ( 8 | order_id INTEGER PRIMARY KEY, 9 | customer_fk INTEGER NOT NULL, 10 | order_date TIMESTAMP WITH TIME ZONE, 11 | status dim.STATUS NOT NULL 12 | ); 13 | 14 | CREATE TABLE dim.order_item 15 | ( 16 | order_item_id INTEGER PRIMARY KEY, 17 | order_fk INTEGER NOT NULL, 18 | product_fk INTEGER NOT NULL, 19 | 20 | product_revenue DOUBLE PRECISION NOT NULL, 21 | shipping_revenue DOUBLE PRECISION NOT NULL 22 | ); 23 | 24 | CREATE TABLE dim.customer 25 | ( 26 | customer_id INTEGER PRIMARY KEY, 27 | email TEXT NOT NULL, 28 | duration_since_first_order INTEGER, 29 | first_order_fk INTEGER, 30 | favourite_product_category_fk INTEGER, 31 | 32 | number_of_orders INTEGER, 33 | revenue_lifetime DOUBLE PRECISION 34 | ); 35 | 36 | CREATE TABLE dim.product 37 | ( 38 | product_id INTEGER PRIMARY KEY, 39 | sku TEXT NOT NULL, 40 | product_category_fk INTEGER NOT NULL, 41 | revenue_all_time DOUBLE PRECISION 42 | ); 43 | 44 | CREATE TABLE dim.product_category 45 | ( 46 | product_category_id INTEGER PRIMARY KEY, 47 | level_1 TEXT NOT NULL, 48 | level_2 TEXT NOT NULL 49 | ); 50 | 51 | 52 | ALTER TABLE dim.order 53 | ADD FOREIGN KEY (customer_fk) REFERENCES dim.customer (customer_id); 54 | ALTER TABLE dim.order_item 55 | ADD FOREIGN KEY (order_fk) REFERENCES dim.order (order_id); 56 | ALTER TABLE dim.order_item 57 | ADD FOREIGN KEY (product_fk) REFERENCES dim.product (product_id); 58 | ALTER TABLE dim.customer 59 | ADD FOREIGN KEY (first_order_fk) REFERENCES dim.order (order_id); 60 | ALTER TABLE dim.customer 61 | ADD FOREIGN KEY (favourite_product_category_fk) 62 | REFERENCES dim.product_category (product_category_id); 63 | ALTER TABLE dim.product 64 | ADD FOREIGN KEY (product_category_fk) REFERENCES dim.product_category (product_category_id); 65 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # This file only contains a selection of the most common options. For a full 2 | # list see the documentation: 3 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 4 | 5 | # -- Path setup -------------------------------------------------------------- 6 | 7 | # If extensions (or modules to document with autodoc) are in another directory, 8 | # add these directories to sys.path here. If the directory is relative to the 9 | # documentation root, use os.path.abspath to make it absolute, like shown here. 10 | # 11 | # import os 12 | # import sys 13 | # sys.path.insert(0, os.path.abspath('.')) 14 | 15 | 16 | # -- Project information ----------------------------------------------------- 17 | 18 | project = 'Mara Schema' 19 | copyright = '2020-2022, Mara contributors' 20 | author = 'Mara contributors' 21 | 22 | # The short X.Y version. 23 | from mara_schema import __version__ 24 | version = __version__ 25 | # The full version, including alpha/beta/rc tags 26 | release = version 27 | 28 | 29 | # -- General configuration --------------------------------------------------- 30 | 31 | # Add any Sphinx extension module names here, as strings. They can be 32 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 33 | # ones. 34 | extensions = [ 35 | 'sphinx.ext.autodoc', 36 | 'myst_parser', 37 | ] 38 | 39 | # Add any paths that contain templates here, relative to this directory. 40 | templates_path = ['_templates'] 41 | 42 | # List of patterns, relative to source directory, that match files and 43 | # directories to ignore when looking for source files. 44 | # This pattern also affects html_static_path and html_extra_path. 45 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] 46 | 47 | 48 | # -- Options for HTML output ------------------------------------------------- 49 | 50 | # The theme to use for HTML and HTML Help pages. See the documentation for 51 | # a list of builtin themes. 52 | # 53 | html_theme = 'alabaster' 54 | 55 | # Add any paths that contain custom static files (such as style sheets) here, 56 | # relative to this directory. They are copied after the builtin static files, 57 | # so a file named "default.css" will overwrite the builtin "default.css". 58 | html_static_path = ['_static'] 59 | html_favicon = "_static/favicon.ico" 60 | html_logo = "_static/mara-animal.jpg" 61 | html_title = f"Mara Schema Documentation ({version})" 62 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. rst-class:: hide-header 2 | 3 | Mara Schema documentation 4 | ========================= 5 | 6 | Welcome to Mara Schema's documentation. The Mara Schema package is a Python based mapping of physical data warehouse 7 | tables to logical business entities (a.k.a. "cubes", "models", "data sets", etc.). It comes with 8 | - sql query generation for flattening normalized database tables into wide tables for various analytics front-ends 9 | - a flask based visualization of the schema that can serve as a documentation of the business definitions of a data warehouse (a.k.a "data dictionary" or "data guide") 10 | - the possibility to sync schemas to reporting front-ends that have meta-data APIs (e.g. Metabase, Looker, Tableau) 11 | 12 | .. image:: _static/mara-schema.png 13 | 14 | Have a look at a real-world application of Mara Schema in the `Mara Example Project 1 `_. 15 | 16 | 17 | **Why** should I use Mara Schema? 18 | 19 | 1. **Definition of analytical business entities as code**: There are many solutions for documenting the company-wide definitions of attributes & metrics for the users of a data warehouse. These can range from simple spreadsheets or wikis to metadata management tools inside reporting front-ends. However, these definitions can quickly get out of sync when new columns are added or changed in the underlying data warehouse. Mara Schema allows to deploy definition changes together with changes in the underlying ETL processes so that all definitions will always be in sync with the underlying data warehouse schema. 20 | 21 | 2. **Automatic generation of aggregates / artifacts**: When a company wants to enforce a *single source of truth* in their data warehouse, then a heavily normalized Kimball-style `snowflake schema `_ is still the weapon of choice. It enforces an agreed-upon unified modelling of business entities across domains and ensures referential consistency. However, snowflake schemas are not ideal for analytics or data science because they require a lot of joins. Most analytical databases and reporting tools nowadays work better with pre-flattened wide tables. Creating such flattened tables is an error-prone and dull activity, but with Mara Schema one can automate most of the work in creating flattened data set tables in the ETL. 22 | 23 | 24 | User's Guide 25 | ------------ 26 | 27 | This part of the documentation focuses on step-by-step instructions how to use this extension. 28 | 29 | .. toctree:: 30 | :maxdepth: 2 31 | 32 | installation 33 | example 34 | artifact-generation 35 | config 36 | 37 | 38 | API Reference 39 | ------------- 40 | 41 | If you are looking for information on a specific function, class or 42 | method, this part of the documentation is for you. 43 | 44 | .. toctree:: 45 | :maxdepth: 2 46 | 47 | api 48 | 49 | Additional Notes 50 | ---------------- 51 | 52 | Legal information and changelog are here for the interested. 53 | 54 | .. toctree:: 55 | :maxdepth: 2 56 | 57 | design 58 | license 59 | changes 60 | -------------------------------------------------------------------------------- /docs/artifact-generation.rst: -------------------------------------------------------------------------------- 1 | Artifact generation 2 | =================== 3 | 4 | The function ``data_set_sql_query`` in `mara_schema/sql_generation.py `_ can be used to flatten the entities of a data set into a wide data set table: 5 | .. code-block:: python 6 | 7 | data_set_sql_query(data_set=order_items_data_set, human_readable_columns=True, pre_computed_metrics=False, 8 | star_schema=False, personal_data=False, high_cardinality_attributes=True) 9 | 10 | The resulting SELECT statement can be used for creating a data set table that is specifically tailored for the use in Metabase: 11 | 12 | .. code-block:: sql 13 | 14 | SELECT 15 | order_item.order_item_id AS "Order item ID", 16 | 17 | "order".order_id AS "Order ID", 18 | "order".order_date AS "Order date", 19 | 20 | order_customer.customer_id AS "Customer ID", 21 | 22 | order_customer_favourite_product_category.main_category AS "Customer favourite product category level 1", 23 | order_customer_favourite_product_category.sub_category_1 AS "Customer favourite product category level 2", 24 | 25 | order_customer_first_order.order_date AS "Customer first order date", 26 | 27 | product.sku AS "Product SKU", 28 | 29 | product_product_category.main_category AS "Product category level 1", 30 | product_product_category.sub_category_1 AS "Product category level 2", 31 | 32 | order_item.order_item_id AS "# Order items", 33 | order_item.order_fk AS "# Orders", 34 | order_item.product_revenue AS "Product revenue", 35 | order_item.revenue AS "Shipping revenue" 36 | 37 | FROM dim.order_item order_item 38 | LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order_id 39 | LEFT JOIN dim.customer order_customer ON "order".customer_fk = order_customer.customer_id 40 | LEFT JOIN dim.product_category order_customer_favourite_product_category ON order_customer.favourite_product_category_fk = order_customer_favourite_product_category.product_category_id 41 | LEFT JOIN dim."order" order_customer_first_order ON order_customer.first_order_fk = order_customer_first_order.order_id 42 | LEFT JOIN dim.product product ON order_item.product_fk = product.product_id 43 | LEFT JOIN dim.product_category product_product_category ON product.product_category_fk = product_product_category.product_category_id 44 | 45 | Please note that the ``data_set_sql_query`` only returns SQL select statements, it's a matter of executing these statements somewhere in the ETL of the Data Warehouse. `Here `_ is an example for creating data set tables for Metabase using `Mara Pipelines `_. 46 | 47 | 48 | 49 | There are several parameters for controlling the output of the `data_set_sql_query` function: 50 | 51 | - ``human_readable_columns``: Whether to use "Customer name" rather than "customer_name" as column name 52 | - ``pre_computed_metrics``: Whether to pre-compute composed metrics, counts and distinct counts on row level 53 | - ``star_schema``: Whether to add foreign keys to the tables of linked entities rather than including their attributes 54 | - ``personal_data``: Whether to include attributes that are marked as personal data 55 | - ``high_cardinality_attributes``: Whether to include attributes that are marked to have a high cardinality 56 | 57 | .. image:: _static/mara-schema-sql-generation.gif 58 | :alt: Mara schema SQL generation 59 | -------------------------------------------------------------------------------- /mara_schema/attribute.py: -------------------------------------------------------------------------------- 1 | import enum 2 | import re 3 | 4 | import typing 5 | 6 | 7 | class Type(enum.EnumMeta): 8 | """ 9 | Attribute types that need special treatment in artifact creation 10 | Type.ID: A numeric ID that is converted to text in a flattened table so that it can be filtered 11 | Type.DATE: Date attribute as a foreign_key to a date dimension 12 | Type.DURATION: Duration attribute as a foreign_key to a duration dimension 13 | Type.ENUM: Attributes that is converted to text in a flattened table. 14 | Type.ARRAY: Attribute of type array 15 | """ 16 | DATE = 'date' 17 | DURATION = 'duration' 18 | ID = 'id' 19 | ENUM = 'enum' 20 | ARRAY = 'array' 21 | 22 | 23 | class Attribute(): 24 | """A property of an entity, corresponds to a column in the underlying dimensional table""" 25 | 26 | def __init__(self, name: str, description: str, column_name: str, 27 | accessible_via_entity_link: bool, type: 'Type' = None, high_cardinality: bool = False, 28 | personal_data: bool = False, 29 | important_field: bool = False, 30 | more_url: typing.Optional[str] = None, 31 | ) -> None: 32 | """See documentation of function Entity.add_attribute""" 33 | self.name = name 34 | self.description = description 35 | self.column_name = column_name 36 | self.type = type 37 | self.high_cardinality = high_cardinality 38 | self.personal_data = personal_data 39 | self.important_field = important_field 40 | self.accessible_via_entity_link = accessible_via_entity_link 41 | self.more_url = more_url 42 | 43 | def __repr__(self) -> str: 44 | return f'' 45 | 46 | def prefixed_name(self, path: typing.Tuple['EntityLink'] = None) -> str: 47 | """Generate a meaningful business name by concatenating the prefix of entity link instances and original 48 | name of attribute. """ 49 | 50 | def first_lower(string: str = ''): 51 | """Lowercase first letter if the first two letter are not capitalized """ 52 | if not re.match(r'([A-Z]){2}', string): 53 | return string[:1].lower() + string[1:] 54 | else: 55 | return string 56 | 57 | if path: 58 | prefix = ' '.join([entity_link.prefix.lower() for entity_link in path]) 59 | return normalize_name(prefix + ' ' + first_lower(self.name)) 60 | else: 61 | return normalize_name(self.name) 62 | 63 | 64 | def normalize_name(name: str, max_length: int = 63) -> str: 65 | """ 66 | Makes "Foo bar baz" out of "foo bar bar baz" 67 | Args: 68 | name: the name to normalize 69 | max_length: optionally limit length by replacing too long part with a hash of the name 70 | """ 71 | 72 | def first_letter_capitalize(string: str) -> str: 73 | return string[0].upper() + string[1::] 74 | 75 | # Remove repeating words from the generated name, e.g. "First booking booking ID" -> "First booking ID" 76 | name = re.sub(r'\b(\w+)( \1\b)+', r'\1', name).strip() 77 | 78 | # Remove duplicate whitespace 79 | name = re.sub('\s\s+', ' ', name) 80 | 81 | name = first_letter_capitalize(name) 82 | 83 | # Limit length 84 | 85 | if max_length and len(name) > max_length: 86 | import hashlib 87 | m = hashlib.md5() 88 | m.update(name.encode('utf-8')) 89 | return name[:(max_length - 8)] + m.hexdigest()[:8] 90 | 91 | return name 92 | -------------------------------------------------------------------------------- /mara_schema/metric.py: -------------------------------------------------------------------------------- 1 | import abc 2 | import enum 3 | 4 | import typing 5 | 6 | 7 | class Aggregation(enum.EnumMeta): 8 | """Aggregation methods for metrics""" 9 | SUM = 'sum' 10 | AVERAGE = 'avg' 11 | COUNT = 'count' 12 | DISTINCT_COUNT = 'distinct-count' 13 | 14 | 15 | class NumberFormat(enum.EnumMeta): 16 | """How to format values""" 17 | STANDARD = 'Standard' 18 | CURRENCY = 'Currency' 19 | PERCENT = 'Percent' 20 | 21 | 22 | class Metric(abc.ABC): 23 | def __init__(self, name: str, description: str, data_set: 'DataSet', 24 | important_field: bool = False) -> None: 25 | """ 26 | A numeric aggregation on columns of an entity table. 27 | 28 | Args: 29 | name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations" 30 | description: A meaningful business definition of the metric 31 | important_field: It refers to key business metrics. 32 | """ 33 | self.name = name 34 | self.description = description 35 | self.data_set = data_set 36 | self.important_field = important_field 37 | 38 | @abc.abstractmethod 39 | def display_formula(self): 40 | """Returns a documentation string for displaying the formula in the frontend""" 41 | pass 42 | 43 | 44 | class SimpleMetric(Metric): 45 | def __init__(self, name: str, description: str, data_set: 'DataSet', 46 | column_name: str, aggregation: Aggregation, important_field: bool = False, 47 | number_format: NumberFormat = NumberFormat.STANDARD, 48 | more_url: typing.Optional[str] = None): 49 | """ 50 | A metric that is computed as a direct aggregation on a entity table column 51 | Args: 52 | name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations" 53 | description: A meaningful business definition of the metric 54 | data_set: The data set that contains the metric 55 | column_name: The column that the aggregation is based on 56 | aggregation: The aggregation method to use 57 | important_field: It refers to key business metrics. 58 | number_format: The way to format a string. Defaults to NumberFormat.STANDARD. 59 | """ 60 | super().__init__(name, description, data_set) 61 | self.column_name = column_name 62 | self.aggregation = aggregation 63 | self.important_field = important_field 64 | self.number_format = number_format 65 | self.more_url = more_url 66 | 67 | def __repr__(self) -> str: 68 | return f'' 69 | 70 | def display_formula(self) -> str: 71 | return f"{self.aggregation}({self.column_name})" 72 | 73 | 74 | class ComposedMetric(Metric): 75 | def __init__(self, name: str, description: str, data_set: 'DataSet', 76 | parent_metrics: [Metric], formula_template: str, important_field: bool = False, 77 | number_format: NumberFormat = NumberFormat.STANDARD, 78 | more_url: typing.Optional[str] = None) -> None: 79 | """ 80 | A metric that is based on a list of simple metrics. 81 | Args: 82 | name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations" 83 | description: A meaningful business definition of the metric 84 | data_set: The data set that contains the metric 85 | parent_metrics: The parent metrics that this metric is composed of 86 | formula_template: How to compose the parent metrics, with '{}' as placeholders 87 | Examples: '{} + {}', '{} / ({} + {})' 88 | important_field: It refers to key business metrics. 89 | number_format: The way to format a string. Defaults to NumberFormat.STANDARD. 90 | """ 91 | super().__init__(name, description, data_set) 92 | self.parent_metrics = parent_metrics 93 | self.formula_template = formula_template 94 | self.important_field = important_field 95 | self.number_format = number_format 96 | self.more_url = more_url 97 | 98 | def __repr__(self) -> str: 99 | return f'' 100 | 101 | def display_formula(self) -> str: 102 | return self.formula_template.format(*[f'[{metric.name}]' for metric in self.parent_metrics]) 103 | -------------------------------------------------------------------------------- /mara_schema/ui/graph.py: -------------------------------------------------------------------------------- 1 | import graphviz 2 | from mara_page.xml import _ 3 | 4 | from ..data_set import DataSet 5 | from ..metric import ComposedMetric 6 | 7 | link_color_ = '#0275d8' 8 | font_size_ = '10.5px' 9 | line_color_ = '#888888' 10 | edge_arrow_size_ = '0.7' 11 | 12 | 13 | def overview_graph(): 14 | from .. import config 15 | from .views import data_set_url 16 | 17 | all_entities = set() 18 | 19 | for data_set in config.data_sets(): 20 | all_entities.update(data_set.entity.connected_entities()) 21 | 22 | graph = graphviz.Digraph(engine='neato', graph_attr={}) 23 | 24 | for entity in all_entities: 25 | data_set = entity.data_set 26 | 27 | graph.node(name=entity.name, 28 | label=entity.name.replace(' ', '\n'), 29 | fontname=' ', 30 | fontsize=font_size_, 31 | fontcolor=link_color_ if data_set else '#222222', 32 | href=data_set_url(data_set) if data_set else None, 33 | color='transparent', 34 | tooltip=entity.description) 35 | 36 | for entity_link in entity.entity_links: 37 | label = entity_link.prefix.lower().replace(entity_link.target_entity.name.lower(),'').title() 38 | graph.edge(entity.name, 39 | entity_link.target_entity.name, 40 | headlabel=label.replace(' ', '\n') if label else None, 41 | labelfloat='true', 42 | labeldistance='2.5', 43 | labelfontsize='9.0', 44 | fontcolor='#cccccc', 45 | arrowsize=edge_arrow_size_, 46 | color=line_color_) 47 | 48 | return _render_graph(graph) 49 | 50 | 51 | def data_set_graph(data_set: DataSet) -> str: 52 | from .views import data_set_url 53 | 54 | paths = data_set.paths_to_connected_entities() 55 | if not paths: 56 | return '' 57 | 58 | graph = graphviz.Digraph(engine='neato', graph_attr={}) 59 | 60 | graph.node(name='root', 61 | label=data_set.entity.name.replace(' ', '\n'), 62 | fontname=' ', 63 | fontsize=font_size_, 64 | color='#888888', 65 | height='0.1', 66 | fontcolor='#222222', 67 | style='dotted', 68 | shape='rectangle', 69 | tooltip=data_set.entity.description 70 | ) 71 | 72 | for path in paths: 73 | entity_link = path[-1] 74 | 75 | data_set = entity_link.target_entity.data_set 76 | graph.node(name=str(path), 77 | label=entity_link.target_entity.name.replace(' ', '\n'), 78 | fontname=' ', 79 | fontsize=font_size_, 80 | color='transparent', 81 | height='0.1', 82 | href=data_set_url(data_set) if data_set else None, 83 | fontcolor=link_color_ if data_set else None, 84 | tooltip=entity_link.target_entity.description 85 | ) 86 | 87 | label = entity_link.prefix.lower().replace(entity_link.target_entity.name.lower(),'').title() 88 | graph.edge('root' if len(path) == 1 else str(path[:-1]), str(path), 89 | color=line_color_, 90 | headlabel=label.replace(' ', '\n') if label else None, 91 | labelfloat='true', 92 | labeldistance='2.5', 93 | labelfontsize='9.0', 94 | fontcolor='#cccccc', 95 | arrowsize=edge_arrow_size_) 96 | 97 | return _render_graph(graph) 98 | 99 | 100 | def metrics_graph(data_set: DataSet) -> str: 101 | graph = graphviz.Digraph(engine='dot', graph_attr={'rankdir': 'TD', 102 | 'ranksep': '0.2', 103 | 'nodesep': '0.15', 104 | 'splines': 'true' 105 | }) 106 | 107 | connected_metrics = set() 108 | for metric in data_set.metrics.values(): 109 | if isinstance(metric, ComposedMetric): 110 | connected_metrics.add(metric) 111 | for parent_metric in metric.parent_metrics: 112 | connected_metrics.add(parent_metric) 113 | graph.edge(parent_metric.name, 114 | metric.name, 115 | color=line_color_, 116 | arrowsize=edge_arrow_size_) 117 | 118 | for metric in connected_metrics: 119 | graph.node(name=metric.name, 120 | label=metric.name.replace(' ', '\n'), 121 | fontname=' ', 122 | fontcolor='#222222', 123 | fontsize=font_size_, 124 | color='transparent', 125 | height='0.1', 126 | tooltip=f'{metric.description}\n\n{metric.display_formula()}') 127 | return _render_graph(graph) 128 | 129 | 130 | def _render_graph(graph: graphviz.Digraph) -> str: 131 | try: 132 | return graph.pipe('svg').decode('utf-8') 133 | except graphviz.backend.ExecutableNotFound as e: 134 | import uuid 135 | # This exception occurs when the graphviz tools are not found. 136 | # We use here a fallback to client-side rendering using the javascript library d3-graphviz. 137 | graph_id = f'dependency_graph_{uuid.uuid4().hex}' 138 | escaped_graph_source = graph.source.replace("`","\\`") 139 | return str(_.div(id=graph_id)[ 140 | _.tt(style="color:red")[str(e)], 141 | ]) + str(_.script[ 142 | f'div=d3.select("#{graph_id}");', 143 | 'graph=div.graphviz();', 144 | 'div.text("");', 145 | f'graph.renderDot(`{escaped_graph_source}`);', 146 | ]) 147 | -------------------------------------------------------------------------------- /mara_schema/entity.py: -------------------------------------------------------------------------------- 1 | import typing 2 | 3 | from .attribute import Attribute, Type 4 | 5 | 6 | class Entity(): 7 | def __init__(self, name: str, description: str, 8 | schema_name: str, table_name: str = None, 9 | pk_column_name: str = None): 10 | """ 11 | A business object with attributes and links to other entities, corresponds to a table in the dimensional schema 12 | 13 | Args: 14 | name: A short noun phrase that captures the nature of the entity. E.g. "Customer", "Order item" 15 | description: A short text that helps to understand the underlying business process. 16 | E.g. "People who registered through the web site or installed the app" 17 | schema_name: The database schema of the underlying table in the dimensional schema, e.g. "xy_dim" 18 | table_name: The name of the underlying table in the dimensional schema, e.g. "order_item". 19 | Defaults to the lower-cased entity name with spaces replaced by underscores 20 | pk_column_name: The primary key column in the underlying table, defaults to table_name + '_id' 21 | """ 22 | self.name = name 23 | self.description = description 24 | self.schema_name = schema_name 25 | self.table_name = table_name or name.lower().replace(' ', '_') 26 | self.pk_column_name = pk_column_name or f'{self.table_name}_id' 27 | 28 | self.attributes = [] 29 | self.entity_links = [] 30 | self.data_set = None # the data set that contains the entity 31 | 32 | def __repr__(self) -> str: 33 | return f'' 34 | 35 | def add_attribute(self, name: str, description: str, column_name: str = None, type: Type = None, 36 | high_cardinality: bool = False, personal_data: bool = False, important_field: bool = False, 37 | accessible_via_entity_link: bool = True, 38 | more_url: typing.Optional[str] = None, 39 | ) -> None: 40 | """ 41 | Adds a property based on a column in the underlying dimensional table to the entity 42 | 43 | Args: 44 | name: How the attribute is displayed in front-ends, e.g. "Order date" 45 | description: A meaningful business definition of the attribute. E.g. "The date when the order was placed" 46 | column_name: The name of the column in the underlying database table. 47 | Defaults to the lower-cased name with white-spaced replaced by underscores. 48 | type: The type of the attribute, see definition of `Type` enum 49 | high_cardinality: It refers to columns with values that are very uncommon or unique. Defaults to False. 50 | personal_data: It refers to person related data, e.g. "Email address", "Name". 51 | important_field: A field that highlights the the data set. Shown by default in overviews 52 | accessible_via_entity_link: If False, then this attribute is excluded from data sets that are not 53 | based on this entity. 54 | more_url: URL (as string) which should be appended as a `more...` link in the UI. 55 | """ 56 | self.attributes.append( 57 | Attribute( 58 | name=name, 59 | description=description, 60 | column_name=column_name or name.lower().replace(' ', '_'), 61 | accessible_via_entity_link=accessible_via_entity_link, 62 | type=type, 63 | high_cardinality=high_cardinality, 64 | personal_data=personal_data, 65 | important_field=important_field, 66 | more_url=more_url, 67 | )) 68 | 69 | def remove_attribute(self, name: str) -> None: 70 | """ 71 | Removes a property based on a column in the underlying dimensional table from the entity 72 | 73 | Args: 74 | name: How the attribute is displayed in front-ends, e.g. "Order date" 75 | """ 76 | self.attributes.remove(self.find_attribute(name)) 77 | 78 | def link_entity(self, target_entity: 'Entity', fk_column: str = None, 79 | prefix: str = None, description=None) -> None: 80 | """ 81 | Adds a link from the entity to another entity, corresponds to a foreign key relationship 82 | 83 | Args: 84 | target_entity: The referenced entity, e.g. an "Order" entity 85 | fk_column: The foreign key column in the source entity, e.g. "first_order_fk" in the "customer" table 86 | prefix: Attributes from the linked entity will be prefixed with this, e.g "First order". 87 | Defaults to the name of the linked entity. 88 | description: A short explanation for the relation between the entity and target entity 89 | """ 90 | self.entity_links.append( 91 | EntityLink( 92 | target_entity=target_entity, 93 | fk_column=fk_column or f'{target_entity.table_name}_fk', 94 | prefix=prefix if prefix is not None else target_entity.name, 95 | description=description)) 96 | 97 | def find_entity_link(self, target_entity_name: str, prefix: str = None) -> 'EntityLink': 98 | """Find an EntityLink by its target entity name or prefix.""" 99 | 100 | entity_links = [entity_link for entity_link in self.entity_links 101 | if entity_link.target_entity.name == target_entity_name 102 | and (prefix is None or prefix == entity_link.prefix)] 103 | 104 | if not entity_links: 105 | raise LookupError(f"""Linked entity "{target_entity_name}" / "{prefix or ''}" not found in {self}""") 106 | 107 | if len(entity_links) > 1: 108 | raise LookupError(f"""Multiple linked entities found for "{target_entity_name}" / "{prefix}" """) 109 | 110 | return entity_links[0] 111 | 112 | def find_attribute(self, attribute_name: str) -> Attribute: 113 | """Find an attribute by its name""" 114 | attribute = next((attribute for attribute in self.attributes if attribute.name == attribute_name), None) 115 | if not attribute: 116 | raise KeyError(f'Attribute "{attribute_name}" not found in f{self}') 117 | return attribute 118 | 119 | def connected_entities(self) -> ['Entity']: 120 | """ Find all recursively linked entities. """ 121 | result = set([self]) 122 | 123 | def traverse_graph(entity: Entity): 124 | for link in entity.entity_links: 125 | if link.target_entity not in result: 126 | result.add(link.target_entity) 127 | traverse_graph(link.target_entity) 128 | 129 | traverse_graph(self) 130 | 131 | return result 132 | 133 | 134 | class EntityLink(): 135 | def __init__(self, target_entity: Entity, prefix: str, 136 | description: str = None, fk_column: str = None) -> None: 137 | """ 138 | A link from an entity to another entity, corresponds to a foreign key relationship 139 | 140 | Args: 141 | target_entity: The referenced entity, e.g. an "Order" entity 142 | prefix: Attributes from the linked entity will be prefixed with this, e.g "First order". 143 | description: A short explanation for the relation between the entity and target entity 144 | fk_column: The foreign key column in the source entity, e.g. "first_order_fk" in the "customer" table 145 | """ 146 | self.target_entity = target_entity 147 | self.prefix = prefix 148 | self.description = description 149 | self.fk_column = fk_column or f'{target_entity.table_name}_fk' 150 | 151 | def __repr__(self) -> str: 152 | return f'' 153 | -------------------------------------------------------------------------------- /docs/example.rst: -------------------------------------------------------------------------------- 1 | Example 2 | ======= 3 | 4 | Let's consider the following toy example of a dimensional schema in the data warehouse of a hypothetical e-commerce company: 5 | 6 | .. image:: _static/example-dimensional-database-schema.svg 7 | :alt: Example dimensional star schema 8 | 9 | Each box is a database table with its columns, and the lines between tables show the foreign key constraints. That's a classic Kimball style `snowflake schema `_ and it requires a proper modelling / ETL layer in your data warehouse. A script that creates these example tables in PostgreSQL can be found in `example/dimensional-schema.sql `_. 10 | 11 | It's a prototypical data warehouse schema for B2C e-commerce: There are orders composed of individual product purchases (order items) made by customers. There are circular references: Orders have a customer, and customers have a first order. Order items have a product (and thus a product category) and customers have a favourite product category. 12 | 13 | The respective entity and data set definitions for this database schema can be found in the `mara_schema/example `_ directory. 14 | 15 | Entities 16 | -------- 17 | 18 | In Mara Schema, each business relevant table in the dimensional schema is mapped to an `Entity `_. In dimensional modelling terms, entities can be both fact tables and dimensions. For example, a customer entity can be a dimension of an order items data set (a.k.a. "cube", "model", "data mart") and a customer data set of its own. 19 | 20 | Here's a `shortened `_ defnition of the "Order item" entity based on the ``dim.order_item`` table: 21 | 22 | .. code-block:: python 23 | 24 | from mara_schema.entity import Entity 25 | 26 | order_item_entity = Entity( 27 | name='Order item', 28 | description='Individual products sold as part of an order', 29 | schema_name='dim') 30 | 31 | It assumes that there is an ``order_item`` table in the ``dim`` schema of the data warehouse, with ``order_item_id`` as the primary key. The optional ``table_name`` and ``pk_column_name`` parameters can be used when another naming scheme for tables and primary keys is used. 32 | 33 | Attributes 34 | ---------- 35 | 36 | `Attributes `_ represent facts about an entity. They correspond to the non-numerical columns in a fact or dimension table: 37 | 38 | .. code-block:: python 39 | 40 | from mara_schema.attribute import Type 41 | 42 | order_item_entity.add_attribute( 43 | name='Order item ID', 44 | description='The ID of the order item in the backend', 45 | column_name='order_item_id', 46 | type=Type.ID, 47 | high_cardinality=True) 48 | 49 | They come with a speaking name (as shown in reporting front-ends), a description and a ``column_name`` in the underlying database table. 50 | 51 | There a several parameters for controlling the generation of artifact tables and the visibility in front-ends: 52 | - Setting ``personal_data`` to ``True`` means that the attribute contains personally identifiable information and thus should be hidden from most users. 53 | - When ```high_cardinality` is ``True``, then the attribute is hidden in front-ends that can not deal well with dimensions with a lot of values. 54 | - The ``type`` attribute controls how some fields are treated in artifact creation. See `mara_schema/attribute.py#L7 `_. 55 | - An ``important_field`` highlights the data set and is shown by default in overviews. 56 | - When ``accessible_via_entity_link`` is ``False``, then the attribute will be hidden in data sets that use the entity as an dimension. 57 | 58 | Linking entities 59 | ---------------- 60 | 61 | The attributes of the dimensions of an entity are recursively linked with the ``link_entity`` method: 62 | 63 | .. code-block:: python 64 | 65 | from .order import order_entity 66 | from .product import product_entity 67 | 68 | order_item_entity.link_entity(target_entity=order_entity, prefix='') 69 | order_item_entity.link_entity(target_entity=product_entity) 70 | 71 | This pulls in attributes of other entities that are connected to an entity table via foreign key columns. When the other entity is called "Foo bar", then it's assumed that there is a ``foo_bar_fk`` in the entity table (can be overwritten with the ``fk_column`` parameter). The optional ``prefix`` controls how linked attributes are named (e.g. "First order date" vs "Order date") and also helps to disambiguate when there are multiple links from one entity to another. 72 | 73 | Data Sets 74 | --------- 75 | 76 | Once all entities and their relationships are established, `Data Sets `_ (a.k.a "cubes", "models" or "data marts") add metrics and attributes from linked entities to an entity: 77 | 78 | .. code-block:: python 79 | 80 | from mara_schema.data_set import DataSet 81 | 82 | from ..entities.order_item import order_item_entity 83 | 84 | order_items_data_set = DataSet(entity=order_item_entity, name='Order items') 85 | 86 | 87 | There are two kinds of `Metrics `_ (a.k.a "Measures") in Mara Schema: simple metrics and composed metrics. Simple metrics are computed as direct aggregations on an entity table column: 88 | 89 | .. code-block:: python 90 | 91 | from mara_schema.data_set import Aggregation 92 | 93 | order_items_data_set.add_simple_metric( 94 | name='# Orders', 95 | description='The number of valid orders (orders with an invoice)', 96 | column_name='order_fk', 97 | aggregation=Aggregation.DISTINCT_COUNT, 98 | important_field=True) 99 | 100 | order_items_data_set.add_simple_metric( 101 | name='Product revenue', 102 | description='The price of the ordered products as shown in the cart', 103 | aggregation=Aggregation.SUM, 104 | column_name='product_revenue', 105 | important_field=True) 106 | 107 | In this example the metric "# Orders" is defined as the distinct count on the ``order_fk`` column, and "Product revenue" as the sum of the ``product_revenue`` column. 108 | 109 | Composed metrics are built from other metrics (both simple and composed) like this: 110 | 111 | .. code-block:: python 112 | 113 | order_items_data_set.add_composed_metric( 114 | name='Revenue', 115 | description='The total cart value of the order', 116 | formula='[Product revenue] + [Shipping revenue]', 117 | important_field=True) 118 | 119 | order_items_data_set.add_composed_metric( 120 | name='AOV', 121 | description='The average revenue per order. Attention: not meaningful when split by product', 122 | formula='[Revenue] / [# Orders]', 123 | important_field=True) 124 | 125 | The ``formula`` parameter takes simple algebraic expressions (``+``, ``-``, ``*``, ``/`` and parentheses) with the names of the parent metrics in rectangular brackets, e.g. ``([a] + [b]) / [c]``. 126 | 127 | Excluding specific entity links 128 | ------------------------------- 129 | 130 | With complex snowflake schemas the graph of linked entities can become rather big. To avoid cluttering data sets with unnecessary attributes, Mara Schema has a way for excluding entire entity links: 131 | 132 | ``customers_data_set.exclude_path(['Order', 'Customer'])`` 133 | 134 | This means that the customer of the first order of a customer will not be part of the customers data set. Similarly, it is possible to limit the list of attributes from a linked entity: 135 | 136 | ``order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date'])`` 137 | 138 | Here only the order date of the first order of the customer of the order will be included in the data set. 139 | -------------------------------------------------------------------------------- /docs/_static/example-dimensional-database-schema.svg: -------------------------------------------------------------------------------- 1 | 2 | 4 | 6 | 7 | 9 | 10 | %3 11 | 12 | 13 | 14 | dim.customer 15 | 16 | customer 17 | customer_id 18 | email 19 | duration_since_first_order 20 | first_order_fk 21 | favourite_product_category_fk 22 | number_of_orders 23 | revenue_lifetime 24 | 25 | 26 | 27 | 28 | dim.order 29 | 30 | order 31 | order_id 32 | customer_fk 33 | order_date 34 | status 35 | 36 | 37 | 38 | 39 | dim.customer->dim.order 40 | 41 | 42 | 43 | 44 | 45 | dim.product_category 46 | 47 | product_category 48 | product_category_id 49 | level_1 50 | level_2 51 | 52 | 53 | 54 | 55 | dim.customer->dim.product_category 56 | 57 | 58 | 59 | 60 | 61 | dim.order->dim.customer 62 | 63 | 64 | 65 | 66 | 67 | dim.status 68 | 69 | status 70 | 71 | 72 | 73 | 74 | dim.order->dim.status 75 | 76 | 77 | 78 | 79 | 80 | dim.order_item 81 | 82 | order_item 83 | order_item_id 84 | order_fk 85 | product_fk 86 | product_revenue 87 | shipping_revenue 88 | 89 | 90 | 91 | 92 | dim.order_item->dim.order 93 | 94 | 95 | 96 | 97 | 98 | dim.product 99 | 100 | product 101 | product_id 102 | sku 103 | product_category_fk 104 | revenue_all_time 105 | 106 | 107 | 108 | 109 | dim.order_item->dim.product 110 | 111 | 112 | 113 | 114 | 115 | dim.product->dim.product_category 116 | 117 | 118 | 119 | 120 | 121 | -------------------------------------------------------------------------------- /mara_schema/data_set.py: -------------------------------------------------------------------------------- 1 | import re 2 | import typing 3 | 4 | from .attribute import Attribute 5 | from .entity import Entity, EntityLink 6 | from .metric import NumberFormat, Aggregation, SimpleMetric, ComposedMetric 7 | 8 | 9 | class DataSet(): 10 | def __init__(self, entity: Entity, name: str): 11 | """ 12 | An entity with its metrics and recursively linked entities. 13 | 14 | Args: 15 | entity: The underlying entity with its attributes and linked other entities 16 | name: The name of the data set. 17 | """ 18 | self.entity = entity 19 | self.name = name 20 | 21 | self.entity.data_set = self 22 | self.metrics = {} 23 | self.excluded_paths = set() 24 | self.included_attributes = {} 25 | self.excluded_attributes = {} 26 | 27 | def __repr__(self) -> str: 28 | return f'' 29 | 30 | def add_simple_metric(self, name: str, description: str, column_name: str, aggregation: Aggregation, 31 | important_field: bool = False, 32 | number_format: NumberFormat = NumberFormat.STANDARD, 33 | more_url: typing.Optional[str] = None, 34 | ): 35 | """ 36 | Add a metric that is computed as a direct aggregation on a entity table column 37 | 38 | Args: 39 | name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations" 40 | description: A meaningful business definition of the metric 41 | column_name: The column that the aggregation is based on 42 | aggregation: The aggregation method to use 43 | important_field: It refers to key business metrics. 44 | number_format: The way to format a string. Defaults to NumberFormat.STANDARD. 45 | more_url: URL (as string) which should be appended as a `more...` link in the UI. 46 | """ 47 | if name in self.metrics: 48 | raise ValueError(f'Metric "{name}" already exists in data set "{self.name}"') 49 | 50 | self.metrics[name] = SimpleMetric( 51 | name=name, 52 | description=description, 53 | data_set=self, 54 | column_name=column_name, 55 | aggregation=aggregation, 56 | important_field=important_field, 57 | number_format=number_format, 58 | more_url=more_url, 59 | ) 60 | 61 | def add_composed_metric(self, name: str, description: str, formula: str, important_field: bool = False, 62 | number_format: NumberFormat = NumberFormat.STANDARD, 63 | more_url: typing.Optional[str] = None, 64 | ): 65 | """ 66 | Add a metric that is based on a list of simple metrics. 67 | 68 | Args: 69 | name: How the metric is displayed in front-ends, e.g. "Revenue after cancellations" 70 | description: A meaningful business definition of the metric 71 | formula: How to compute the metric. Examples: [Metric A] + [Metric B], [Metric A] / ([Metric B] + [Metric C]) 72 | important_field: It refers to key business metrics. 73 | number_format: The way to format a string. Defaults to NumberFormat.STANDARD. 74 | more_url: URL (as string) which should be appended as a `more...` link in the UI. 75 | """ 76 | if name in self.metrics: 77 | raise ValueError(f'Metric "{name}" already exists in data set "{self.name}"') 78 | 79 | # ' [a] \n + [b]' -> '[a] + [b]' 80 | formula_cleaned = re.sub("\s\s+", " ", formula.strip().replace('\n', '')) 81 | 82 | # split '[a] + [b]' -> ['', 'a', ' + ', 'b', ''] 83 | formula_split = re.split(r'\[(.*?)\]', formula_cleaned) 84 | 85 | parent_metrics = [] 86 | for metric_name in formula_split[1::2]: # 1::2 start at second, take every 2nd, 87 | if metric_name not in self.metrics: 88 | raise ValueError(f'Could not find metric "{metric_name}" in data set "{self.name}"') 89 | parent_metrics.append(self.metrics[metric_name]) 90 | 91 | self.metrics[name] = ComposedMetric(name=name, 92 | description=description, 93 | data_set=self, 94 | parent_metrics=parent_metrics, 95 | formula_template='{}'.join(formula_split[0::2]), 96 | important_field=important_field, 97 | number_format=number_format, 98 | more_url=more_url, 99 | ) 100 | 101 | _PathSpec = typing.TypeVar('_PathSpec', typing.Sequence[typing.Union[str, typing.Tuple[str, str]]], bytes) 102 | 103 | def _parse_path(self, entity: Entity, path: _PathSpec) -> typing.Union[ 104 | typing.Tuple, typing.Tuple[EntityLink, EntityLink]]: 105 | """ 106 | Helper function for parsing path specifications into a tuple of entity link instances 107 | 108 | Args: 109 | entity: the entity for which to resolve the entity links 110 | path: How to get to the entity from the entity of the data set. 111 | A list of either strings (target entity names) or tuples of strings (target entity name + prefix). 112 | Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3'] 113 | """ 114 | if not path: 115 | return () 116 | 117 | if not (isinstance(path[0], str) or (isinstance(path[0], tuple) and len(path[0]) == 2)): 118 | raise TypeError(f'Expecting a string or a tuple of two strings, got: {path[0]}') 119 | 120 | target_entity_name, prefix = (path[0], None) if isinstance(path[0], str) else path[0] 121 | entity_link = entity.find_entity_link(target_entity_name, prefix) 122 | 123 | return (entity_link,) + self._parse_path(entity_link.target_entity, path[1::]) 124 | 125 | def exclude_path(self, path: _PathSpec): 126 | """ 127 | Exclude a connected entity from generated data set tables by specifying the entity links to that entity 128 | 129 | Args: 130 | path: How to get to the entity from the data set entity. 131 | A list of either strings (target entity names) or tuples of strings (target entity name + prefix). 132 | Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3'] 133 | """ 134 | self.excluded_paths.add(self._parse_path(self.entity, path)) 135 | 136 | def exclude_attributes(self, path: _PathSpec, attribute_names: [str] = None): 137 | """ 138 | Exclude attributes of a connected entity in generated data set tables. 139 | 140 | Args: 141 | path: How to get to the entity from the data set entity. 142 | A list of either strings (target entity names) or tuples of strings (target entity name + prefix). 143 | Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3'] 144 | attribute_names: A list of name of attributes to be excluded. If not provided, then exclude all attributes 145 | """ 146 | entity_links = self._parse_path(self.entity, path) 147 | entity = entity_links[-1].target_entity 148 | 149 | if not attribute_names: 150 | self.excluded_attributes[entity_links] = entity.attributes 151 | else: 152 | self.excluded_attributes[entity_links] = [entity.find_attribute(attribute_name) for attribute_name in 153 | attribute_names] 154 | 155 | def include_attributes(self, path: _PathSpec, attribute_names: [str]): 156 | """ 157 | Exclude all attributes except the explicitly included ones of a connected entity in generated data set tables. 158 | 159 | Args: 160 | path: How to get to the entity from the data set entity. 161 | A list of either strings (target entity names) or tuples of strings (target entity name + prefix). 162 | Example: ['Entity 1', ('Entity 2', 'Prefix'), 'Entity 3'] 163 | attribute_names: A list of name of attributes to be included. 164 | """ 165 | entity_links = self._parse_path(self.entity, path) 166 | 167 | self.included_attributes[entity_links] = [entity_links[-1].target_entity.find_attribute(attribute_name) for 168 | attribute_name in attribute_names] 169 | 170 | def paths_to_connected_entities(self) -> [(EntityLink,)]: 171 | """ 172 | Get all possible paths to connected entities (tuples of entity links) 173 | - that are not explicitly excluded 174 | - that are are not beyond the max link depth or that are explicitly included 175 | """ 176 | 177 | paths = [] 178 | 179 | def _append_path_including_subpaths(paths, path) -> typing.List[typing.Tuple[EntityLink]]: 180 | """Append a path and its subpaths to the list of paths, if they do not already exist. A subpath always starts 181 | at the beginning of the path: (1,2,3) -> (1,), (1,2), (1,2,3) 182 | """ 183 | for i in range(len(path)): 184 | if path[:i + 1] not in paths: 185 | paths.append(path[:i + 1]) 186 | return paths 187 | 188 | def traverse_graph(entity: Entity, current_path: tuple): 189 | for entity_link in entity.entity_links: 190 | path = current_path + (entity_link,) 191 | 192 | if (entity_link not in current_path # check for circles in path 193 | and (path not in self.excluded_paths) # explicitly excluded paths 194 | ): 195 | _append_path_including_subpaths(paths, path) 196 | traverse_graph(entity_link.target_entity, path) 197 | 198 | traverse_graph(self.entity, ()) 199 | 200 | return paths 201 | 202 | def connected_attributes(self, include_personal_data: bool = True) -> {(EntityLink,): {str: Attribute}}: 203 | """ 204 | Returns all attributes with their prefixed name from all connected entities. 205 | 206 | Args: 207 | include_personal_data: If False, then exclude fields that are marked as personal data 208 | 209 | Returns: 210 | A dictionary with the paths as keys and dictionaries of prefixed attribute names and 211 | attributes as values. Example: 212 | {(, , 213 | 'Prefixed attribute 2 name': }, 214 | ..} 215 | """ 216 | result = {(): {attribute.prefixed_name(): attribute for attribute in self.entity.attributes}} 217 | 218 | for path in self.paths_to_connected_entities(): 219 | result[path] = {} 220 | entity = path[-1].target_entity 221 | for attribute in entity.attributes: 222 | if ((path in self.included_attributes and attribute in self.included_attributes[path]) 223 | or (path not in self.included_attributes)) \ 224 | and ((path in self.excluded_attributes and attribute not in self.excluded_attributes[path]) 225 | or (path not in self.excluded_attributes)) \ 226 | and attribute.accessible_via_entity_link and ( 227 | include_personal_data or not attribute.personal_data): 228 | result[path][attribute.prefixed_name(path)] = attribute 229 | return result 230 | 231 | def id(self): 232 | """Returns a representation that can be used in urls""" 233 | from html import escape 234 | return escape(self.name.replace(' ', '_').lower()) 235 | -------------------------------------------------------------------------------- /mara_schema/sql_generation.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | import sqlalchemy 4 | import sqlalchemy.engine 5 | 6 | from .attribute import Type, normalize_name 7 | from .data_set import DataSet 8 | from .entity import EntityLink 9 | from .metric import SimpleMetric, Aggregation 10 | 11 | 12 | def data_set_sql_query(data_set: DataSet, 13 | human_readable_columns=True, 14 | pre_computed_metrics=True, 15 | star_schema: bool = False, 16 | star_schema_transitive_fks: bool = True, 17 | personal_data=True, 18 | high_cardinality_attributes=True, 19 | engine: sqlalchemy.engine.Engine = None) -> str: 20 | """ 21 | Returns a SQL select statement that flattens all linked entities of a data set into a wide table 22 | 23 | Args: 24 | data_set: the data set to flatten 25 | human_readable_columns: Whether to use "Customer name" rather than "customer_name" as column name 26 | pre_computed_metrics: Whether to pre-compute composed metrics, counts and distinct counts on row level 27 | star_schema: Whether to add foreign keys to the tables of linked entities rather than including their attributes. 28 | star_schema_transitive_fks: Whether to include all attributes of all transitively linked entities. When False, 29 | only their respective foreign keys are included. Defaults to True. 30 | Example for star_schema_transitive_fks = False: 31 | SELECT order.id 32 | order.date 33 | order.price 34 | 35 | customer.customer_fk 36 | 37 | store.store_fk 38 | FROM order 39 | LEFT JOIN customer 40 | LEFT JOIN store 41 | personal_data: Whether to include attributes that are marked as personal dataTrue 42 | high_cardinality_attributes: Whether to include attributes that are marked to have a high cardinality 43 | engine: A sqlalchemy engine that is used to quote database identifiers. Defaults to a PostgreSQL engine. 44 | 45 | Returns: 46 | A string containing the select statement 47 | """ 48 | engine = engine or sqlalchemy.create_engine(f'postgresql+psycopg2://') 49 | 50 | def quote(name) -> str: 51 | """Quote a column or table name for the specified database engine""" 52 | return engine.dialect.identifier_preparer.quote(name) 53 | 54 | # alias for the underlying table of the entity of the data set 55 | entity_table_alias = database_identifier(data_set.entity.name) 56 | 57 | # progressively build the query 58 | query = 'SELECT' 59 | 60 | column_definitions = [] 61 | 62 | # Iterate all connected entities 63 | for path, attributes in data_set.connected_attributes().items(): 64 | first = True # for adding an empty line between each entity 65 | 66 | # helper function for adding a column 67 | def add_column_definition(table_alias: str, column_name: str, column_alias: str, 68 | cast_to_text: bool, first: bool, custom_column_expression: str = None): 69 | column_definition = '\n ' if first else ' ' 70 | column_definition += custom_column_expression or f'{quote(table_alias)}.{quote(column_name)}' 71 | if cast_to_text: 72 | if engine.url.drivername.startswith('postgresql'): 73 | column_definition += '::TEXT' 74 | elif engine.url.drivername.startswith('bigquery'): 75 | column_definition = f'CAST({column_definition} AS STRING)' 76 | elif engine.url.drivername.startswith('mssql'): 77 | column_definition = f'CAST({column_definition} AS NVARCHAR)' 78 | else: 79 | raise NotImplementedError(f'Casting to text is not implemented for engine {engine.url.drivername}') 80 | if column_alias != column_name: 81 | column_definition += f' AS {quote(column_alias)}' 82 | column_definitions.append(column_definition) 83 | 84 | return False 85 | 86 | if star_schema and path: # create a foreign key to the last entity of the path 87 | first = add_column_definition( 88 | table_alias=table_alias_for_path(path[:-1]) if len(path) > 1 else entity_table_alias, 89 | column_name=path[-1].fk_column, 90 | column_alias=(normalize_name(' '.join([entity_link.prefix or entity_link.target_entity.name 91 | for entity_link in path])) 92 | if human_readable_columns else foreign_key_column_name(table_alias_for_path(path))), 93 | cast_to_text=False, first=first) 94 | 95 | # Add columns for all attributes 96 | # Always add all columns for the first object (i.e. the original dataset) as indicated by path == () 97 | if star_schema_transitive_fks or path == (): 98 | for name, attribute in attributes.items(): 99 | if attribute.personal_data and not personal_data: 100 | continue 101 | if attribute.high_cardinality and not high_cardinality_attributes: 102 | continue 103 | 104 | table_alias = table_alias_for_path(path) if path else entity_table_alias 105 | column_name = attribute.column_name 106 | column_alias = name if human_readable_columns else database_identifier(name) 107 | custom_column_expression = None 108 | 109 | if star_schema: # Add foreign keys for dates and durations 110 | if attribute.type == Type.DATE: 111 | if engine.url.drivername.startswith('postgresql'): 112 | custom_column_expression = f"TO_CHAR({quote(table_alias)}.{quote(column_name)}, 'YYYYMMDD') :: INTEGER" 113 | elif engine.url.drivername.startswith('bigquery'): 114 | custom_column_expression = f"CAST(FORMAT_DATE('%Y%m%d',{quote(table_alias)}.{quote(column_name)}) AS INT64)" 115 | elif engine.url.drivername.startswith('mssql'): 116 | custom_column_expression = f"CAST(CONVERT(char(8),{quote(table_alias)}.{quote(column_name)},112) AS INT)" 117 | else: 118 | raise NotImplementedError(f'Star schema casting of DATE attributes is not implemented for engine {engine.url.drivername}') 119 | column_alias = name if human_readable_columns else foreign_key_column_name(database_identifier(name)) 120 | elif attribute.type == Type.DURATION: 121 | column_alias = name if human_readable_columns else foreign_key_column_name(database_identifier(name)) 122 | elif not path: 123 | pass # Add attributes of data set entity 124 | else: 125 | continue # Exclude attributes from linked entities 126 | 127 | first = add_column_definition(table_alias=table_alias, column_name=column_name, 128 | column_alias=column_alias, 129 | cast_to_text=attribute.type == Type.ENUM, first=first, 130 | custom_column_expression=custom_column_expression) 131 | 132 | # Only add foreign key columns of linked entities 133 | elif star_schema_transitive_fks is False and path: 134 | first = add_column_definition( 135 | table_alias=table_alias_for_path(path[:-1]) if len(path) > 1 else entity_table_alias, 136 | column_name=path[-1].fk_column, 137 | column_alias=foreign_key_column_name(table_alias_for_path(path)), 138 | cast_to_text=False, first=first) 139 | else: 140 | assert False, 'This should not happen.' 141 | 142 | 143 | # helper function for pre-computing composed metrics 144 | def sql_formula(metric): 145 | if isinstance(metric, SimpleMetric): 146 | if metric.aggregation in [Aggregation.DISTINCT_COUNT, Aggregation.COUNT]: 147 | # for distinct counts, return 1::SMALLINT if the expression is not null 148 | if engine.url.drivername.startswith('postgresql'): 149 | return f'({quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL) ::INTEGER :: SMALLINT' 150 | elif engine.url.drivername.startswith('bigquery'): 151 | return f'CAST({quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL AS INT64)' 152 | else: 153 | return f'CASE WHEN {quote(entity_table_alias)}.{quote(metric.column_name)} IS NOT NULL THEN 1 ELSE 0 END' 154 | else: 155 | # Coalesce with 0 so that metrics that combine simplemetrics work ( in SQL `1 + NULL` is `NULL` ) 156 | return f'COALESCE({quote(entity_table_alias)}.{quote(metric.column_name)}, 0)' 157 | else: 158 | if '/' in metric.formula_template: # avoid divisions by 0 159 | if engine.url.drivername.startswith('postgresql'): 160 | return metric.formula_template.format( 161 | *[f'(NULLIF({sql_formula(metric)}, 0.0 :: DOUBLE PRECISION))' for metric in metric.parent_metrics]) 162 | else: 163 | return metric.formula_template.format( 164 | *[f'(NULLIF({sql_formula(metric)}, 0.0))' for metric in metric.parent_metrics]) 165 | 166 | else: # render metric template 167 | return metric.formula_template.format( 168 | *[f'({sql_formula(metric)})' for metric in metric.parent_metrics]) 169 | 170 | first = True 171 | for name, metric in data_set.metrics.items(): 172 | column_alias = metric.name if human_readable_columns else database_identifier(metric.name) 173 | 174 | if pre_computed_metrics: 175 | column_definition = f' {sql_formula(metric)} AS {quote(column_alias)}' 176 | elif isinstance(metric, SimpleMetric): 177 | column_definition = f' {quote(entity_table_alias)}.{quote(metric.column_name)}' 178 | if column_alias != metric.column_name: 179 | column_definition += f' AS {quote(column_alias)}' 180 | else: 181 | continue 182 | 183 | if first: 184 | column_definition = '\n' + column_definition 185 | first = False 186 | column_definitions.append(column_definition) 187 | 188 | # add column definitions to SELECT part 189 | query += ',\n'.join(column_definitions) 190 | 191 | # add FROM part for entity table 192 | query += f'\n\nFROM {quote(data_set.entity.schema_name)}.{quote(data_set.entity.table_name)} {quote(entity_table_alias)}' 193 | 194 | # Add LEFT JOIN statements 195 | for path in data_set.paths_to_connected_entities(): 196 | left_alias = table_alias_for_path(path[:-1]) if len(path) > 1 else database_identifier(data_set.entity.name) 197 | right_alias = table_alias_for_path(path) 198 | entity_link = path[-1] 199 | target_entity = entity_link.target_entity 200 | 201 | query += f'\nLEFT JOIN {quote(target_entity.schema_name)}.{quote(target_entity.table_name)} {quote(right_alias)}' 202 | query += f' ON {quote(left_alias)}.{quote(path[-1].fk_column)} = {quote(right_alias)}.{quote(target_entity.pk_column_name)}' 203 | 204 | return query 205 | 206 | 207 | def database_identifier(name) -> str: 208 | """Turns a string into something that can be used as a table or column name""" 209 | return re.sub('[^0-9a-z]+', '_', name.lower()) 210 | 211 | 212 | def table_alias_for_path(path: (EntityLink,)) -> str: 213 | """Turns `(, ,)` into `customer_first_order` """ 214 | return database_identifier('_'.join([entity_link.prefix or entity_link.target_entity.name 215 | for entity_link in path])) 216 | 217 | 218 | def foreign_key_column_name(name) -> str: 219 | """Turns a table alias into a foreign key column name""" 220 | return f'{name}_fk' 221 | -------------------------------------------------------------------------------- /mara_schema/ui/views.py: -------------------------------------------------------------------------------- 1 | """Documentation of data sets and entities""" 2 | 3 | import functools 4 | import re 5 | from html import escape 6 | 7 | import flask 8 | import unicodedata 9 | from mara_page import acl, navigation, response, bootstrap, _, html 10 | 11 | from ..data_set import DataSet 12 | 13 | # The flask blueprint that does 14 | blueprint = flask.Blueprint('mara_schema', __name__, static_folder='static', 15 | template_folder='templates', url_prefix='/schema') 16 | 17 | # Defines an ACL resource (needs to be handled by the application) 18 | acl_resource_schema = acl.AclResource(name='Schema') 19 | 20 | 21 | def data_set_url(data_set: DataSet) -> str: 22 | return flask.url_for('mara_schema.data_set_page', id=data_set.id()) 23 | 24 | 25 | _slugify_strip_re = re.compile(r'[^\w\s-]') 26 | _slugify_hyphenate_re = re.compile(r'[-\s]+') 27 | 28 | 29 | # from https://github.com/django/django/blob/0382ecfe020b4c51b4c01e4e9a21892771e66941/django/utils/text.py 30 | # Under BSD license 31 | def slugify(value, allow_unicode=False): 32 | """ 33 | Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated 34 | dashes to single dashes. Remove characters that aren't alphanumerics, 35 | underscores, or hyphens. Convert to lowercase. Also strip leading and 36 | trailing whitespace, dashes, and underscores. 37 | """ 38 | value = str(value) 39 | if allow_unicode: 40 | value = unicodedata.normalize('NFKC', value) 41 | else: 42 | value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii') 43 | value = re.sub(_slugify_strip_re, '', value.lower()) 44 | return re.sub(_slugify_hyphenate_re, '-', value).strip('-_') 45 | 46 | 47 | def schema_navigation_entry() -> navigation.NavigationEntry: 48 | """Defines a part of the navigation tree (needs to be handled by the application). 49 | 50 | Returns: 51 | A mara NavigationEntry object. 52 | 53 | """ 54 | from .. import config 55 | 56 | return navigation.NavigationEntry( 57 | label='Data sets', icon='book', 58 | description='Documentation of attributes and metrics of all data sets', 59 | children=[navigation.NavigationEntry(label='Overview', icon='list', 60 | uri_fn=lambda: flask.url_for('mara_schema.index_page'))] 61 | + [navigation.NavigationEntry(label=data_set.name, icon='book', 62 | description=data_set.entity.description, 63 | uri_fn=lambda data_set=data_set: flask.url_for( 64 | 'mara_schema.data_set_page', id=data_set.id())) 65 | for data_set in config.data_sets()]) 66 | 67 | 68 | @blueprint.route('') 69 | @acl.require_permission(acl_resource_schema) 70 | def index_page() -> response.Response: 71 | """Renders the overview page""" 72 | from .. import config 73 | 74 | return response.Response( 75 | 76 | html=[ 77 | bootstrap.card( 78 | header_left='Entities & their relations', 79 | body=html.asynchronous_content(flask.url_for('mara_schema.overview_graph'))), 80 | bootstrap.card( 81 | header_left='Data sets', 82 | body=bootstrap.table( 83 | ['Name', 'Description'], 84 | [_.tr[_.td[_.a(href=flask.url_for('mara_schema.data_set_page', 85 | id=data_set.id()))[ 86 | escape(data_set.name)]], 87 | _.td[_.i[escape(data_set.entity.description)]], 88 | ] for data_set in config.data_sets()]), 89 | )], 90 | title='Data sets documentation', 91 | css_files=[flask.url_for('mara_schema.static', filename='schema.css')] 92 | ) 93 | 94 | 95 | @blueprint.route('/') 96 | @acl.require_permission(acl_resource_schema) 97 | def data_set_page(id: str) -> response.Response: 98 | """Renders the pages for individual data sets""" 99 | from .. import config 100 | 101 | data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None) 102 | if not data_set: 103 | flask.flash(f'Could not find data set "{id}"', category='warning') 104 | return flask.redirect(flask.url_for('mara_schema.index_page')) 105 | 106 | base_url = flask.url_for('mara_schema.data_set_sql_query', id=data_set.id()) 107 | 108 | def attribute_rows(data_set: DataSet) -> []: 109 | rows = [] 110 | for path, attributes in data_set.connected_attributes().items(): 111 | if path: 112 | rows.append(_.tr[_.td(colspan=3, style='border-top:none; padding-top: 20px;')[ 113 | [['→ ', 114 | _.a(href=data_set_url(entity.data_set))[link_title] if entity.data_set else link_title, 115 | '  '] 116 | for entity, link_title 117 | in [(entity_link.target_entity, entity_link.prefix or entity_link.target_entity.name) 118 | for entity_link in path]], 119 | ['   ', _.i[path[-1].description]] if path[-1].description else '' 120 | ]]) 121 | for prefixed_name, attribute in attributes.items(): 122 | attribute_link_id = slugify(f'attribute {path[-1].target_entity.name if path else ""} {attribute.name}') 123 | rows.append(_.tr(id=attribute_link_id)[ 124 | _.td[ 125 | escape(prefixed_name), 126 | ' ', 127 | _.a(class_='anchor-link-sign', 128 | href=f'#{attribute_link_id}')['¶'], 129 | 130 | ], 131 | _.td[[_.i[escape(attribute.description)]] + 132 | ([' (', _.a(href=attribute.more_url)['more...'], ')'] 133 | if attribute.more_url else [])], 134 | _.td[_.tt[escape( 135 | f'{path[-1].target_entity.table_name + "." if path else ""}{attribute.column_name}')]]]) 136 | return rows 137 | 138 | def metrics_rows(data_set: DataSet) -> []: 139 | rows = [] 140 | for metric in data_set.metrics.values(): 141 | metric_link_id = slugify(f'metric {metric.name}') 142 | 143 | rows.append([_.tr(id=metric_link_id)[ 144 | _.td[ 145 | escape(metric.name), 146 | ' ', 147 | _.a(class_='anchor-link-sign', 148 | href=f'#{metric_link_id}')['¶'], 149 | 150 | ], 151 | _.td[[_.i[escape(metric.description)]] + 152 | ([' (', _.a(href=metric.more_url)['more...'], ')'] if metric.more_url else []) 153 | ], 154 | _.td[_.code[escape(metric.display_formula())]] 155 | ]]) 156 | return rows 157 | 158 | return response.Response( 159 | html=[bootstrap.card( 160 | header_left=_.i[escape(data_set.entity.description)], 161 | body=[ 162 | _.p['Entity table: ', 163 | _.code[escape(f'{data_set.entity.schema_name}.{data_set.entity.table_name}')]], 164 | html.asynchronous_content(flask.url_for('mara_schema.data_set_graph', id=data_set.id())), 165 | ]), 166 | bootstrap.card( 167 | header_left='Metrics', 168 | body=[ 169 | html.asynchronous_content(flask.url_for('mara_schema.metrics_graph', id=data_set.id())), 170 | bootstrap.table( 171 | ['Name', 'Description', 'Computation'], 172 | metrics_rows(data_set) 173 | ), 174 | ]), 175 | bootstrap.card( 176 | header_left='Attributes', 177 | body=bootstrap.table(["Name", "Description", "Column name"], attribute_rows(data_set))), 178 | bootstrap.card( 179 | header_left=['Data set sql query:  ', 180 | [_.div(class_='form-check form-check-inline')[ 181 | "   ", 182 | _.label(class_='form-check-label')[ 183 | _.input(class_="form-check-input param-checkbox", type="checkbox", 184 | value=param)[ 185 | ''], ' ', param]] 186 | for param in [ 187 | 'human readable columns', 188 | 'pre-computed metrics', 189 | 'star schema', 190 | 'star_schema_transitive_fks', 191 | 'personal data', 192 | 'high cardinality attributes', 193 | ]]], 194 | body=[_.div(id='sql-container')[html.asynchronous_content(base_url, 'sql-container')], 195 | _.script[''' 196 | document.addEventListener('DOMContentLoaded', function() { 197 | DataSetSqlQuery("''' + base_url + '''"); 198 | }); 199 | ''']]) 200 | ], 201 | title=f'Data set "{data_set.name}"', 202 | js_files=[flask.url_for('mara_schema.static', filename='data-set-sql-query.js')], 203 | css_files=[flask.url_for('mara_schema.static', filename='mara-schema.css')], 204 | ) 205 | 206 | 207 | @blueprint.route('//_data_set_sql_query', defaults={'params': ''}) 208 | @blueprint.route('//_data_set_sql_query/') 209 | def data_set_sql_query(id: str, params: [str]) -> response.Response: 210 | from .. import config 211 | from ..sql_generation import data_set_sql_query 212 | 213 | params = set(params.split('/')) 214 | data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None) 215 | if not data_set: 216 | return f'Could not find data set "{id}"' 217 | 218 | # using the engine of the default db from mara_pipelines.config.default_db_alias() 219 | engine = None 220 | try: 221 | # since mara_pipelines and mara_db is not a default requirement of module mara_schema, 222 | # we use a try/except clause 223 | import mara_db.sqlalchemy_engine 224 | import mara_pipelines.config 225 | engine = mara_db.sqlalchemy_engine.engine(mara_pipelines.config.default_db_alias()) 226 | except ImportError or ModuleNotFoundError or NotImplementedError: 227 | pass 228 | 229 | sql = data_set_sql_query(data_set, 230 | pre_computed_metrics='pre-computed metrics' in params, 231 | human_readable_columns='human readable columns' in params, 232 | personal_data='personal data' in params, 233 | high_cardinality_attributes='high cardinality attributes' in params, 234 | star_schema='star schema' in params, 235 | star_schema_transitive_fks='star_schema_transitive_fks' in params, 236 | engine=engine) 237 | return str(_.div[html.highlight_syntax(sql, 'sql')]) 238 | 239 | 240 | @blueprint.route('/_overview_graph') 241 | @acl.require_permission(acl_resource_schema, do_abort=False) 242 | @functools.lru_cache(maxsize=None) 243 | def overview_graph() -> str: 244 | """Returns an graph of all the defined entities and data sets""" 245 | from .graph import overview_graph 246 | 247 | return overview_graph() 248 | 249 | 250 | @blueprint.route('//_data_set_graph') 251 | @acl.require_permission(acl_resource_schema) 252 | def data_set_graph(id: str) -> str: 253 | """Renders a graph with all the linked entities of an individual data sets""" 254 | from .. import config 255 | from .graph import data_set_graph 256 | 257 | data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None) 258 | if not data_set: 259 | return f'Could not find data set "{id}"' 260 | 261 | return data_set_graph(data_set) 262 | 263 | 264 | @blueprint.route('//_metrics_graph') 265 | @acl.require_permission(acl_resource_schema) 266 | def metrics_graph(id: str) -> str: 267 | """Renders a visualization of all composed metrics of a data set""" 268 | from .. import config 269 | from .graph import metrics_graph 270 | 271 | data_set = next((data_set for data_set in config.data_sets() if data_set.id() == id), None) 272 | if not data_set: 273 | return f'Could not find data set "{id}"' 274 | 275 | return metrics_graph(data_set) 276 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Mara Schema 2 | 3 | [![Build Status](https://github.com/mara/mara-schema/actions/workflows/build.yaml/badge.svg)](https://github.com/mara/mara-schema/actions/workflows/build.yaml) 4 | [![PyPI - License](https://img.shields.io/pypi/l/mara-schema.svg)](https://github.com/mara/mara-schema/blob/main/LICENSE) 5 | [![PyPI version](https://badge.fury.io/py/mara-schema.svg)](https://badge.fury.io/py/mara-schema) 6 | [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://communityinviter.com/apps/mara-users/public-invite) 7 | 8 | Python based mapping of physical data warehouse tables to logical business entities (a.k.a. "cubes", "models", "data sets", etc.). It comes with 9 | - sql query generation for flattening normalized database tables into wide tables for various analytics front-ends 10 | - a flask based visualization of the schema that can serve as a documentation of the business definitions of a data warehouse (a.k.a "data dictionary" or "data guide") 11 | - the possibility to sync schemas to reporting front-ends that have meta-data APIs (e.g. Metabase, Looker, Tableau) 12 | 13 |   14 | 15 | ![Mara Schema overview](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema.png) 16 | 17 |   18 | 19 | Have a look at a real-world application of Mara Schema in the [Mara Example Project 1](https://github.com/mara/mara-example-project-1). 20 | 21 |   22 | 23 | **Why** should I use Mara Schema? 24 | 25 | 1. **Definition of analytical business entities as code**: There are many solutions for documenting the company-wide definitions of attributes & metrics for the users of a data warehouse. These can range from simple spreadsheets or wikis to metadata management tools inside reporting front-ends. However, these definitions can quickly get out of sync when new columns are added or changed in the underlying data warehouse. Mara Schema allows to deploy definition changes together with changes in the underlying ETL processes so that all definitions will always be in sync with the underlying data warehouse schema. 26 | 27 | 28 | 2. **Automatic generation of aggregates / artifacts**: When a company wants to enforce a *single source of truth* in their data warehouse, then a heavily normalized Kimball-style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) is still the weapon of choice. It enforces an agreed-upon unified modelling of business entities across domains and ensures referential consistency. However, snowflake schemas are not ideal for analytics or data science because they require a lot of joins. Most analytical databases and reporting tools nowadays work better with pre-flattened wide tables. Creating such flattened tables is an error-prone and dull activity, but with Mara Schema one can automate most of the work in creating flattened data set tables in the ETL. 29 | 30 |   31 | 32 | ## Installation 33 | 34 | To use the library directly, use pip: 35 | 36 | ``` 37 | pip install mara-schema 38 | ``` 39 | 40 | or 41 | 42 | ``` 43 | pip install git+https://github.com/mara/mara-schema.git 44 | ``` 45 | 46 |   47 | 48 | ## Defining entities, attributes, metrics & data sets 49 | 50 | Let's consider the following toy example of a dimensional schema in the data warehouse of a hypothetical e-commerce company: 51 | 52 | ![Example dimensional star schema](https://github.com/mara/mara-schema/raw/main/docs/_static/example-dimensional-database-schema.svg) 53 | 54 | Each box is a database table with its columns, and the lines between tables show the foreign key constraints. That's a classic Kimball style [snowflake schema](https://en.wikipedia.org/wiki/Snowflake_schema) and it requires a proper modelling / ETL layer in your data warehouse. A script that creates these example tables in PostgreSQL can be found in [example/dimensional-schema.sql](https://github.com/mara/mara-schema/blob/main/mara_schema/example/dimensional-schema.sql). 55 | 56 | It's a prototypical data warehouse schema for B2C e-commerce: There are orders composed of individual product purchases (order items) made by customers. There are circular references: Orders have a customer, and customers have a first order. Order items have a product (and thus a product category) and customers have a favourite product category. 57 | 58 | The respective entity and data set definitions for this database schema can be found in the [mara_schema/example](https://github.com/mara/mara-schema/tree/main/mara_schema/example) directory. 59 | 60 |   61 | 62 | In Mara Schema, each business relevant table in the dimensional schema is mapped to an [Entity](https://github.com/mara/mara-schema/blob/main/mara_schema/entity.py). In dimensional modelling terms, entities can be both fact tables and dimensions. For example, a customer entity can be a dimension of an order items data set (a.k.a. "cube", "model", "data mart") and a customer data set of its own. 63 | 64 | Here's a [shortened](https://github.com/mara/mara-schema/blob/main/mara_schema/example/entities/order_item.py) defnition of the "Order item" entity based on the `dim.order_item` table: 65 | 66 | ```python 67 | from mara_schema.entity import Entity 68 | 69 | order_item_entity = Entity( 70 | name='Order item', 71 | description='Individual products sold as part of an order', 72 | schema_name='dim') 73 | ``` 74 | 75 | It assumes that there is an `order_item` table in the `dim` schema of the data warehouse, with `order_item_id` as the primary key. The optional `table_name` and `pk_column_name` parameters can be used when another naming scheme for tables and primary keys is used. 76 | 77 |   78 | 79 | [Attributes](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py) represent facts about an entity. They correspond to the non-numerical columns in a fact or dimension table: 80 | 81 | ```python 82 | from mara_schema.attribute import Type 83 | 84 | order_item_entity.add_attribute( 85 | name='Order item ID', 86 | description='The ID of the order item in the backend', 87 | column_name='order_item_id', 88 | type=Type.ID, 89 | high_cardinality=True) 90 | ``` 91 | 92 | They come with a speaking name (as shown in reporting front-ends), a description and a `column_name` in the underlying database table. 93 | 94 | There a several parameters for controlling the generation of artifact tables and the visibility in front-ends: 95 | - Setting `personal_data` to `True` means that the attribute contains personally identifiable information and thus should be hidden from most users. 96 | - When `high_cardinality` is `True`, then the attribute is hidden in front-ends that can not deal well with dimensions with a lot of values. 97 | - The `type` attribute controls how some fields are treated in artifact creation. See [mara_schema/attribute.py#L7](https://github.com/mara/mara-schema/blob/main/mara_schema/attribute.py#L7). 98 | - An `important_field` highlights the data set and is shown by default in overviews. 99 | - When `accessible_via_entity_link` is `False`, then the attribute will be hidden in data sets that use the entity as an dimension. 100 | 101 |   102 | 103 | The attributes of the dimensions of an entity are recursively linked with the `link_entity` method: 104 | 105 | ```python 106 | from .order import order_entity 107 | from .product import product_entity 108 | 109 | order_item_entity.link_entity(target_entity=order_entity, prefix='') 110 | order_item_entity.link_entity(target_entity=product_entity) 111 | ``` 112 | 113 | This pulls in attributes of other entities that are connected to an entity table via foreign key columns. When the other entity is called "Foo bar", then it's assumed that there is a `foo_bar_fk` in the entity table (can be overwritten with the `fk_column` parameter). The optional `prefix` controls how linked attributes are named (e.g. "First order date" vs "Order date") and also helps to disambiguate when there are multiple links from one entity to another. 114 | 115 |   116 | 117 | Once all entities and their relationships are established, [Data Sets](https://github.com/mara/mara-schema/blob/main/mara_schema/data_set.py) (a.k.a "cubes", "models" or "data marts") add metrics and attributes from linked entities to an entity: 118 | 119 | ```python 120 | from mara_schema.data_set import DataSet 121 | 122 | from ..entities.order_item import order_item_entity 123 | 124 | order_items_data_set = DataSet(entity=order_item_entity, name='Order items') 125 | ``` 126 | 127 |   128 | 129 | There are two kinds of [Metrics](https://github.com/mara/mara-schema/blob/main/mara_schema/metric.py) (a.k.a "Measures") in Mara Schema: simple metrics and composed metrics. Simple metrics are computed as direct aggregations on an entity table column: 130 | 131 | ```python 132 | from mara_schema.data_set import Aggregation 133 | 134 | order_items_data_set.add_simple_metric( 135 | name='# Orders', 136 | description='The number of valid orders (orders with an invoice)', 137 | column_name='order_fk', 138 | aggregation=Aggregation.DISTINCT_COUNT, 139 | important_field=True) 140 | 141 | order_items_data_set.add_simple_metric( 142 | name='Product revenue', 143 | description='The price of the ordered products as shown in the cart', 144 | aggregation=Aggregation.SUM, 145 | column_name='product_revenue', 146 | important_field=True) 147 | ``` 148 | 149 | In this example the metric "# Orders" is defined as the distinct count on the `order_fk` column, and "Product revenue" as the sum of the `product_revenue` column. 150 | 151 | Composed metrics are built from other metrics (both simple and composed) like this: 152 | 153 | ```python 154 | order_items_data_set.add_composed_metric( 155 | name='Revenue', 156 | description='The total cart value of the order', 157 | formula='[Product revenue] + [Shipping revenue]', 158 | important_field=True) 159 | 160 | order_items_data_set.add_composed_metric( 161 | name='AOV', 162 | description='The average revenue per order. Attention: not meaningful when split by product', 163 | formula='[Revenue] / [# Orders]', 164 | important_field=True) 165 | ``` 166 | 167 | The `formula` parameter takes simple algebraic expressions (`+`, `-`, `*`, `/` and parentheses) with the names of the parent metrics in rectangular brackets, e.g. `([a] + [b]) / [c]`. 168 | 169 |   170 | 171 | With complex snowflake schemas the graph of linked entities can become rather big. To avoid cluttering data sets with unnecessary attributes, Mara Schema has a way for excluding entire entity links: 172 | 173 | ```python 174 | customers_data_set.exclude_path(['Order', 'Customer']) 175 | ``` 176 | 177 | This means that the customer of the first order of a customer will not be part of the customers data set. Similarly, it is possible to limit the list of attributes from a linked entity: 178 | 179 | ```python 180 | order_items_data_set.include_attributes(['Order', 'Customer', 'Order'], ['Order date']) 181 | ``` 182 | 183 | Here only the order date of the first order of the customer of the order will be included in the data set. 184 | 185 |   186 | 187 | ## Visualization 188 | 189 | Mara schema comes with (an optional) Flask based visualization that documents the metrics and attributes of all data sets: 190 | 191 | ![Mara schema data set visualization](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-data-set-visualization.png) 192 | 193 | When made available to business users, then this can serve as the "data dictionary", "data guide" or "data catalog" of a company. 194 | 195 |   196 | 197 | ## Artifact generation 198 | 199 | The function `data_set_sql_query` in [mara_schema/sql_generation.py](https://github.com/mara/mara-schema/blob/main/mara_schema/sql_generation.py) can be used to flatten the entities of a data set into a wide data set table: 200 | 201 | ```python 202 | data_set_sql_query(data_set=order_items_data_set, human_readable_columns=True, pre_computed_metrics=False, 203 | star_schema=False, personal_data=False, high_cardinality_attributes=True) 204 | ``` 205 | 206 | The resulting SELECT statement can be used for creating a data set table that is specifically tailored for the use in Metabase: 207 | 208 | ```sql 209 | SELECT 210 | order_item.order_item_id AS "Order item ID", 211 | 212 | "order".order_id AS "Order ID", 213 | "order".order_date AS "Order date", 214 | 215 | order_customer.customer_id AS "Customer ID", 216 | 217 | order_customer_favourite_product_category.main_category AS "Customer favourite product category level 1", 218 | order_customer_favourite_product_category.sub_category_1 AS "Customer favourite product category level 2", 219 | 220 | order_customer_first_order.order_date AS "Customer first order date", 221 | 222 | product.sku AS "Product SKU", 223 | 224 | product_product_category.main_category AS "Product category level 1", 225 | product_product_category.sub_category_1 AS "Product category level 2", 226 | 227 | order_item.order_item_id AS "# Order items", 228 | order_item.order_fk AS "# Orders", 229 | order_item.product_revenue AS "Product revenue", 230 | order_item.revenue AS "Shipping revenue" 231 | 232 | FROM dim.order_item order_item 233 | LEFT JOIN dim."order" "order" ON order_item.order_fk = "order".order_id 234 | LEFT JOIN dim.customer order_customer ON "order".customer_fk = order_customer.customer_id 235 | LEFT JOIN dim.product_category order_customer_favourite_product_category ON order_customer.favourite_product_category_fk = order_customer_favourite_product_category.product_category_id 236 | LEFT JOIN dim."order" order_customer_first_order ON order_customer.first_order_fk = order_customer_first_order.order_id 237 | LEFT JOIN dim.product product ON order_item.product_fk = product.product_id 238 | LEFT JOIN dim.product_category product_product_category ON product.product_category_fk = product_product_category.product_category_id 239 | ``` 240 | 241 | Please note that the `data_set_sql_query` only returns SQL select statements, it's a matter of executing these statements somewhere in the ETL of the Data Warehouse. [Here](https://github.com/mara/mara-example-project-1/tree/main/app/pipelines/generate_artifacts/metabase.py) is an example for creating data set tables for Metabase using [Mara Pipelines](https://github.com/mara/mara-pipelines). 242 | 243 |   244 | 245 | There are several parameters for controlling the output of the `data_set_sql_query` function: 246 | 247 | - `human_readable_columns`: Whether to use "Customer name" rather than "customer_name" as column name 248 | - `pre_computed_metrics`: Whether to pre-compute composed metrics, counts and distinct counts on row level 249 | - `star_schema`: Whether to add foreign keys to the tables of linked entities rather than including their attributes 250 | - `personal_data`: Whether to include attributes that are marked as personal data 251 | - `high_cardinality_attributes`: Whether to include attributes that are marked to have a high cardinality 252 | 253 | ![Mara schema SQL generation](https://github.com/mara/mara-schema/raw/main/docs/_static/mara-schema-sql-generation.gif) 254 | 255 | 256 | ## Schema sync to front-ends 257 | 258 | When reporting tools have a Metadata API (e.g. Metabase, Tableau) or can read schema definitions from text files (e.g. Looker, Mondrian), then it's easy to sync definitions with them. The [Mara Metabase](https://github.com/mara/mara-metabase) package contains a function for syncing Mara Schema definitions with Metabase and the [Mara Mondrian](https://github.com/mara/mara-mondrian) package contains a generator for a Mondrian schema. 259 | 260 | We welcome contributions for creating Looker LookML files, for syncing definitions with Tableau, and for syncing with any other BI front-end. 261 | 262 | Also, we see a potential for automatically creating data guides in other Wikis or documentation tools. 263 | 264 | 265 | ## Installation 266 | 267 | To use the library directly, use pip: 268 | 269 | ``` 270 | pip install mara-schema 271 | ``` 272 | 273 | or 274 | 275 | ``` 276 | pip install git+https://github.com/mara/mara-schema.git 277 | ``` 278 | 279 | For an example of an integration into a flask application, have a look at the [Mara Example Project 1](https://github.com/mara/mara-example-project-1). 280 | 281 |   282 | 283 | ## Links 284 | 285 | * Documentation: https://mara-schema.readthedocs.io/ 286 | * Changes: https://mara-schema.readthedocs.io/en/stable/changes.html 287 | * PyPI Releases: https://pypi.org/project/mara-schema/ 288 | * Source Code: https://github.com/mara/mara-schema 289 | * Issue Tracker: https://github.com/mara/mara-schema/issues 290 | --------------------------------------------------------------------------------