├── LICENSE
├── README.md
├── diagrams
├── material
│ ├── AWS-Cloud-Development-Kit_Icon_64_Squid.svg
│ ├── Apache_Hive_logo.svg
│ ├── Data-Analytics-1.png
│ ├── Spring-Boot-REST-API.webp
│ ├── amazon-ecr-avis-prix-alternatives-logiciel.webp
│ ├── amazon-elastic-container-seeklogo.com.svg
│ ├── azure-adls.png
│ ├── data-factory-logo.png
│ ├── fastapi-logo.png
│ ├── grpc-icon-color.png
│ ├── icon-aws-amazon-eks.svg
│ ├── icon-tableau-1.png
│ ├── logo-qlik.png
│ ├── power-bi-logo-2.jpg
│ ├── powerbi-logo-1.png
│ └── rest-api-1.svg
├── snapshots
│ ├── Data Platform - 001.png
│ ├── Data Platform - 002 - Modern data stack.png
│ ├── Data Platform - 003 - Data products.png
│ ├── Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png
│ ├── Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png
│ ├── Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png
│ ├── Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png
│ ├── Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png
│ └── Data Platform - Trends.png
└── src
│ ├── Data Platform - Principles - Data Engineering - 2023-01 - v1.0.excalidraw
│ ├── Data Platform - Principles - Data Engineering - 2023-03 - v2.0.excalidraw
│ ├── Data Platform - Principles - Data Engineering - 2023-04 - v2.1.excalidraw
│ ├── Data Platform - Principles - Data Engineering - latest.excalidraw
│ ├── Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.excalidraw
│ ├── Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.excalidraw
│ ├── Data Platform - Principles - Data Lake In and Out - latest.excalidraw
│ ├── Data Platform - Principles - Data Science - latest.excalidraw
│ └── da-library.excalidrawlib
└── material
└── README.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 data-engineering-helpers
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Data platform - Architecture principles
2 | =======================================
3 |
4 | # Overview
5 | [This project](https://github.com/data-engineering-helpers/architecture-principles)
6 | intends to collaborate on specifying architecture principles and diagrams
7 | for a typical data platform with the so-called Modern Data Stack (MDS).
8 |
9 | Even though the members of the GitHub organization may be employed by
10 | some companies, they speak on their personal behalf and do not represent
11 | these companies.
12 |
13 | ## References
14 | * [Material for the Data platform - Architecture principles](material/)
15 | * Specifications/principles for a
16 | [data engineering pipeline deployment tool](https://github.com/data-engineering-helpers/data-pipeline-deployment)
17 | + [`dpcctl`, the Data Processing Pipeline (DPP) CLI utility](https://github.com/data-engineering-helpers/dppctl), a Minimal Viable Product (MVP) in Go
18 | * [Material for the Data platform - Data contracts](https://github.com/data-engineering-helpers/data-contracts/blob/main/README.md)
19 | * [Material for the Data platform - Data quality](https://github.com/data-engineering-helpers/data-quality/blob/main/README.md)
20 | * [Material for the Data platform - Data-lakes, data warehouses, data lake-houses](https://github.com/data-engineering-helpers/data-lakehouse)
21 | * [Material for the Data platform - Modern Data Stack (MDS) in a box](https://github.com/data-engineering-helpers/mds-in-a-box/blob/main/README.md)
22 |
23 | # Diagrams
24 |
25 | ## Data lake - ins and outs
26 | * [Data engineering Excalidraw diagram online - Data platform principles for data lake ins and outs](https://excalidraw.com/#json=mv7jSkpTewcQb_S4raJ5G,S6aAoK8gA3VroJ5ai8Kb6w)
27 |
28 | * [Excalidraw source on GitHub - Data platform principles for data lake ins and outs](diagrams/src/Data%20Platform%20-%20Principles%20-%20Data%20Lake%20In%20and%20Out%20-%20latest.excalidraw)
29 |
30 | 
31 |
32 | ## Data engineering
33 | * [Data engineering Excalidraw diagram online - Data platform principles for Data Engineering](https://excalidraw.com/#json=UPsnozgpMAxRaz3feC23y,n478x5MVcgCz1XTZ7h9qHw)
34 |
35 | * [Excalidraw source on GitHub - Data platform principles for Data Engineering](diagrams/src/Data%20Platform%20-%20Principles%20-%20Data%20Engineering%20-%20latest.excalidraw)
36 |
37 | 
38 |
39 | # Principles
40 |
41 | ## Production vs non-production
42 | As a summary, production and non-production environments should be
43 | as separated as possible, if possible separated by kind of a "Chinese wall":
44 | * Non-production environments should not have access, by design,
45 | to production resources, including production data
46 | * The only allowed tasks are the publication of non sensitive data,
47 | by production environment (_e.g._, Spark processes) to non-production
48 | storage (S3 buckets)
49 | * As data scientists, analysts and engineers have to be able to work
50 | on realistic data sets, the above principles mean that teams must invest
51 | on how to create non sensitive data from production data. In order to do so,
52 | several processes are possible (_e.g._, anonymisation, obfuscation, aggregation,
53 | data generation, simulation). Some specialized companies, such as
54 | [Statice+Anonos](https://www.statice.ai/), help in generating
55 | non sensitive realistic data sets.
56 |
57 | ## Persistency of data files
58 | Once the data files are written to S3, they must never be overwritten.
59 | The data files are stored in S3 in a persistent way, versioned (_e.g._,
60 | with [Delta](https://delta.io/)), and must be kept on cloud object storage
61 | (_e.g._, AWS S3, Azure ADLS, Google GCS) as long as legally and
62 | technically possible.
63 |
64 | That principle is the same as the one in the
65 | [Change Data Capture (CDC)](https://en.wikipedia.org/wiki/Change_data_capture)
66 | mechanism:
67 | * Regularly, snapshots are taken out of a given data set
68 | * Snapshots take the shape of Parquet/Delta data files. Functionally, a snapshot
69 | is similar to a picture taken with a photo camera: it corresponds to the latest
70 | state of the data set, consistent and instantaneous (there is no history in
71 | a snapshot)
72 | * Snapshots must be versioned. Usually, it is enough to add the time-stamp
73 | of when the snapshot was taken to the file-path/URI of the snapshot data files
74 | * The succession of the snapshots correspond to the succession of the versions
75 | of the data set
76 | * Snapshots data files must be persistent: they must never be overwritten
77 | * The history may be rebuilt from the succession of the snapshots
78 |
79 | The [Delta format](https://delta.io/) applies that principle of keeping persistent
80 | snapshots/versions of a given data set, while abstracting away the need to version
81 | and to not overwrite data sets. With Delta, one can store and "overwrite" data sets,
82 | while in practice the data files are actually versioned snapshot data files and
83 | the log of transactions is kept along those snapshots/versions so as to allow
84 | rebuilding of the history.
85 |
86 | ## Format of data files
87 | The format of the structured data files must be Delta wherever possible,
88 | and Parquet only when Delta is not possible. No other data format is allowed
89 | on the data lake for structured data.
90 |
91 | ## Data processing, from files to files
92 | Any data processing task:
93 | 1. Makes use of software artifacts (_e.g._, Python wheels,
94 | Scala JARs, dbt SQL artifacts)
95 | 2. Takes, as input, data files, which are, as mentioned above,
96 | persistent and versioned, _i.e._, which will never be overwritten
97 | and which their version allows to uniquely identify
98 | 3. Generates, as output, data files, which have to be, as mentioned above,
99 | persistent and versioned, _i.e._, which will never be overwritten
100 | and which their version allows to uniquely identify
101 |
102 | ## Capitalization on data processing software
103 | We capitalize on the (source code of the) software used to process data,
104 | rather than on the prepared data sets. The software project is instantiated
105 | from a template (_e.g._, with
106 | [Cookiecutter](https://github.com/cookiecutter/cookiecutter)/[Cruft](https://cruft.github.io/cruft/))
107 | and managed through a Git repository.
108 | The Git repository may be audited, including the level of compliance
109 | to the (evolutions of the) template.
110 |
111 | ## End-to-end data responsibility
112 | Data domain data engineering teams are responsible for the (quality
113 | and service level agreements of the) delivered data sets.
114 | That responsibility includes checking (and potentially fixing) the quality
115 | of the source data sets. As an illustration,
116 | if [medallion](https://www.databricks.com/glossary/medallion-architecture)
117 | (silver/gold/insight) data sets would be manufactured cars,
118 | the responsibility encompasses the quality of every single part
119 | (_e.g._, tires, windshield).
120 | The data engineering teams cannot deflect their responsibility on the quality
121 | of the silver/gold/insight data sets to the quality of the source data sets:
122 | they have to fix the quality of the source data sets if needed.
123 |
124 | ## Data lake
125 | * The purpose of the data lake is to serve as a centralized and scalable repository
126 | for storing data from various sources
127 | * Data sets are materialized as both:
128 | + Data files with an open standard of storage, namely [Delta](https://delta.io/) whenever possible,
129 | or [Parquet](https://parquet.apache.org/) when Delta is not possible
130 | + Tables in databases. These tables are served through a standard and open API,
131 | namely [Hive Metastore](https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore).
132 | [AWS Glue](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore)
133 | and [GCP Dataproc](https://cloud.google.com/dataproc-metastore/docs/hive-metastore) both implement
134 | the Hive Metastore API. In the documentation, Glue/Dataproc databases and tables may be interchanged
135 | with Hive Metastore databases and tables. The databases and tables are actually metadata (_i.e._,
136 | data such as table name and description, column names, types and descriptions, about the data itself);
137 | the underlying data are stored in Parquet/Delta, as explained in the point just above
138 | * The modern data lake is structured around the so-called
139 | [medallion architecture](https://www.advancinganalytics.co.uk/blog/medallion-architecture),
140 | representing different levels of data "refinement": Bronze, Silver, Gold and Insight. Each level has its own rules
141 | and conventions that should be applied systematically and this page serves as a reference of these rules
142 | (and should therefore be kept constantly up to date)
143 |
144 |
--------------------------------------------------------------------------------
/diagrams/material/AWS-Cloud-Development-Kit_Icon_64_Squid.svg:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/diagrams/material/Apache_Hive_logo.svg:
--------------------------------------------------------------------------------
1 |
2 |
52 |
--------------------------------------------------------------------------------
/diagrams/material/Data-Analytics-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/Data-Analytics-1.png
--------------------------------------------------------------------------------
/diagrams/material/Spring-Boot-REST-API.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/Spring-Boot-REST-API.webp
--------------------------------------------------------------------------------
/diagrams/material/amazon-ecr-avis-prix-alternatives-logiciel.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/amazon-ecr-avis-prix-alternatives-logiciel.webp
--------------------------------------------------------------------------------
/diagrams/material/amazon-elastic-container-seeklogo.com.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/diagrams/material/azure-adls.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/azure-adls.png
--------------------------------------------------------------------------------
/diagrams/material/data-factory-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/data-factory-logo.png
--------------------------------------------------------------------------------
/diagrams/material/fastapi-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/fastapi-logo.png
--------------------------------------------------------------------------------
/diagrams/material/grpc-icon-color.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/grpc-icon-color.png
--------------------------------------------------------------------------------
/diagrams/material/icon-aws-amazon-eks.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/diagrams/material/icon-tableau-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/icon-tableau-1.png
--------------------------------------------------------------------------------
/diagrams/material/logo-qlik.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/logo-qlik.png
--------------------------------------------------------------------------------
/diagrams/material/power-bi-logo-2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/power-bi-logo-2.jpg
--------------------------------------------------------------------------------
/diagrams/material/powerbi-logo-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/powerbi-logo-1.png
--------------------------------------------------------------------------------
/diagrams/material/rest-api-1.svg:
--------------------------------------------------------------------------------
1 |
4 |
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - 001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 001.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - 002 - Modern data stack.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 002 - Modern data stack.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - 003 - Data products.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 003 - Data products.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png
--------------------------------------------------------------------------------
/diagrams/snapshots/Data Platform - Trends.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Trends.png
--------------------------------------------------------------------------------
/material/README.md:
--------------------------------------------------------------------------------
1 | Material for the Data platform Architecture principles
2 | ======================================================
3 |
4 | # Overview
5 | [That page](https://github.com/data-engineering-helpers/architecture-principles/blob/main/material/README.md)
6 | collects some material for the
7 | [Data platform - Architecture principles](https://github.com/data-engineering-helpers/architecture-principles).
8 |
9 | # Articles
10 |
11 | ## Wikipedia
12 |
13 | ### Software engineering / software development
14 | * [Wikipedia - Software development](https://en.wikipedia.org/wiki/Software_development)
15 | * [Wikipedia - Software engineering](https://en.wikipedia.org/wiki/Software_engineering)
16 | * [Wikipedia - Computer programming](https://en.wikipedia.org/wiki/Computer_programming)
17 | * [Wikipedia - Software documentation](https://en.wikipedia.org/wiki/Software_documentation)
18 | * [Wikipedia - Software testing](https://en.wikipedia.org/wiki/Software_testing)
19 | * [Wikipedia - Software bugs](https://en.wikipedia.org/wiki/Software_bugs)
20 | * [Wikipedia - Software framework](https://en.wikipedia.org/wiki/Software_framework)
21 | * [Wikipedia - Application software](https://en.wikipedia.org/wiki/Application_software)
22 | * [Wikipedia - Software development process](https://en.wikipedia.org/wiki/Software_development_process)
23 |
24 | ### Data engineering
25 | * [Wikipedia - Data engineering](https://en.wikipedia.org/wiki/Data_engineering)
26 | * [Wikipedia - Data processing](https://en.wikipedia.org/wiki/Data_processing)
27 | * [Wikipedia - Data cleaning](https://en.wikipedia.org/wiki/Data_cleaning)
28 | * [Wikipedia - Data analytics](https://en.wikipedia.org/wiki/Data_analytics)
29 | * [Wikipedia - Data science](https://en.wikipedia.org/wiki/Data_science)
30 |
31 | ## The 13 software engineering laws
32 | * Title: The 13 software engineering laws
33 | * Date: May 2025
34 | * Author: Anton Zaides
35 | ([Anton Zaides on LinkedIn](https://www.linkedin.com/in/anton-zaides/),
36 | [Anton Zaides on Substack](https://substack.com/@antonzaides))
37 | * Link to the article on Substack:
38 | https://newsletter.manager.dev/p/the-13-software-engineering-laws
39 |
40 | ## Medallion Architecture in a Data Product World
41 | * Title: Medallion Architecture in a Data Product World
42 | * Date: Apr. 2025
43 | * Author: Elliot Gordon
44 | ([Elliot Gordon on LinkedIn](https://www.linkedin.com/in/elliott-cordo/),
45 | [Elliot Gordon on Medium](https://medium.com/@datafutures))
46 | * Link to the article on Medium:
47 | https://medium.com/datafutures/medallion-architecture-in-a-data-product-world-3758d17b6cf6
48 |
49 | ## Data Products: A Case Against Medallion Architecture
50 | * Title: Data Products: A Case Against Medallion Architecture
51 | * Date: Feb. 2025
52 | * Authors:
53 | * Animesh Kumar
54 | * Shubhanshu Jain
55 | * Samadrita Ghosh
56 | * Link to the article on Medium:
57 | https://medium.com/@community_md101/data-products-a-case-against-medallion-architecture-139096ceea08
58 |
59 | ## 5 Data Engineering mistakes
60 | * Title: 5 Data Engineering mistakes, and what to do about them
61 | * Date: July 2024
62 | * Author: Daniel Beach
63 | ([Daniel Beach on LinkedIn](https://www.linkedin.com/in/daniel-beach-6ab8b4132/))
64 | * Link to the article on Substack
65 | https://dataengineeringcentral.substack.com/p/5-data-engineering-mistakes
66 | * The 5 common mistakes:
67 | * Not embracing simple architecture and design
68 | * Not having a good local development environment
69 | * Not having a good orchestration and dependency management tool
70 | * Not testing code and pipelines before release
71 | * Not doing something hard
72 |
73 | ## The Rise of the Data Platform Engineer
74 | * Title: The Rise of the Data Platform Engineer
75 | * Date: June 2024
76 | * Author: Pedram Navid
77 | ([Pedram Navid on LinkedIn](https://www.linkedin.com/in/pedramnavid/))
78 | * Link to the article:
79 | https://databased.pedramnavid.com/p/the-rise-of-the-data-platform-engineer
80 |
81 | ## Data Council 2024: The future data stack is composable
82 | * Title: Data Council 2024: The future data stack is composable, and other hot takes
83 | * Date: April 2024
84 | * Author: Case Roberts
85 | ([Chase Roberts on LinkedIn](https://www.linkedin.com/in/chasecroberts/),
86 | [Chase Roberts on Medium](https://chsrbrts.medium.com/))
87 | * Link to the article:
88 | https://medium.com/vvus/data-council-2024-the-future-data-stack-is-composable-and-other-hot-takes-b6c5f2429e22
89 |
90 | ## Data Lakehouses, Post-Modern Data Stacks and Enabling Gen AI
91 | * Title: Data Lakehouses, Post-Modern Data Stacks and Enabling Gen AI: The Rittman Analytics Guide to Modernising Data Analytics in 2024
92 | * Date: April 2024
93 | * Author: Mark Rittman
94 | ([Mark Rittman on LinkedIn](https://www.linkedin.com/in/markrittman/),
95 | [Mark Rittman on Medium](https://markrittman.medium.com/))
96 | * Link to the article:
97 | https://blog.rittmananalytics.com/data-lakehouses-post-modern-data-stacks-and-enabling-gen-ai-the-rittman-analytics-guide-to-b102027b8cf8
98 |
99 | ## Why data teams are adopting declarative stateful pipelines
100 | * Title: Why Data Teams Are Adopting Declarative (Stateful) Pipelines
101 | * Date: November 2023
102 | * Author: Iaroslav Zeigerman
103 | * Link to the article:
104 | https://tobikodata.com/why-data-teams-are-adopting-declarative-pipelines.html
105 | * Publisher: tobiko blog
106 |
107 | ## The next big step forwards for analytics engineering
108 | * Title: The next big step forwards for analytics engineering
109 | * Date: April 2023
110 | * Author: Tristan Handy
111 | ([Tristan Handy on LinkedIn](https://www.linkedin.com/in/tristanhandy/),
112 | [Tristan Handy on DBT's web site](https://www.getdbt.com/author/tristan-handy/))
113 | * Link to the article:
114 | https://www.getdbt.com/blog/analytics-engineering-next-step-forwards/
115 | * Publisher: DBT
116 |
117 | ## Data Materialization is a Convergence Problem
118 | * Title: Data Materialization is a Convergence Problem
119 | * Date: April 2023
120 | * Author: Alex Rasmussen
121 | ([Alex Rasmussen on LinkedIn](https://www.linkedin.com/in/alexras/),
122 | [Alex Rasmussen on Substack](https://substack.com/profile/4434805-alex-rasmussen))
123 | * Link to the article:
124 | https://stkbailey.substack.com/p/data-materialization-is-a-convergence
125 | * Publisher: Substack
126 |
127 | ## Principles - Alex Ewerlöf's guiding principles
128 | * Title: My guiding principles after 20 years of programming
129 | * Date: January 2020
130 | * Author: Alex Ewerlöf
131 | ([Alex Ewerlöf on LinkedIn](https://www.linkedin.com/in/alexewerlof/),
132 | [Alex Ewerlöf on Substack](https://substack.com/profile/87732486-alex-ewerlof))
133 | * Link to the article:
134 | https://alexewerlof.medium.com/my-guiding-principles-after-20-years-of-programming-a087dc55596c
135 | * Publisher: Medium
136 |
137 | ## Zalando's Engineering principles
138 | * Title: Zalando's Engineering principles
139 | * GitHub repository: https://github.com/zalando/engineering-principles
140 | * Overview: in March 2015, we have adopted this set of principles for tech and architecture:
141 | * Microservices
142 | * API First
143 | * REST
144 | * Cloud
145 | * Software as a Service (SaaS)
146 |
147 | ## Design docs
148 |
149 | ### Writing design docs for data pipelines
150 | * Title: Writing design docs for data pipelines
151 | * Date: May 2023
152 | * Author: Mahdi Karabiben
153 | ([Mahdi Karabiben on LinkedIn](https://www.linkedin.com/in/mahdikarabiben/),
154 | [Mahdi Karabiben on Medium](https://mahdiqb.medium.com/about))
155 | * Link to the article:
156 | https://towardsdatascience.com/writing-design-docs-for-data-pipelines-d49550f95580
157 |
158 | ### Design docs at Google
159 | * Link to the presentation page:
160 | https://www.industrialempathy.com/posts/design-docs-at-google/
161 |
162 | ## Principles - Functional Data Engineering
163 |
164 | ### Functional Data Engineering — a modern paradigm for batch data processing
165 | * Title: Functional Data Engineering — a modern paradigm for batch data processing
166 | * Author: Maxime Beauchemin
167 | * Date: January 2018
168 | * Link to the article:
169 | https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
170 | * Publisher: Medium
171 |
172 | ### Transform Your Data Processing with Functional Data Engineering
173 | * Author: Raphaël Mansuy (https://www.linkedin.com/in/raphaelmansuy/)
174 | * Date: January 2023
175 | * On LinkedIn feed:
176 | https://www.linkedin.com/posts/raphaelmansuy_dataengineering-functionalprogramming-activity-7025661748233797632-TQqZ/
177 |
178 | ### Functional data engineering - a blueprint
179 | * Author: Ananth Packkildurai (https://www.linkedin.com/in/ananthdurai/)
180 | * Date: December 2022
181 | * Link to the article:
182 | https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint
183 | * Publisher: Substack
184 |
185 | ### Why developers are falling in love with functional programming
186 | * Author: Ari Joury
187 | * Date: August 2020
188 | * Link to the article:
189 | https://towardsdatascience.com/why-developers-are-falling-in-love-with-functional-programming-13514df4048e
190 | * Publisher: Medium
191 |
192 | ### Data centric manifesto
193 | * Homepage: http://www.datacentricmanifesto.org/
194 | * Principles: http://www.datacentricmanifesto.org/principles/
195 |
196 | ## Principles - Orchestration
197 |
198 | ### Don't use Apache Airflow in that way
199 | * Title: Don’t Use Apache Airflow in That Way
200 | * Date: May 2023
201 | * Author: Ansam Yousry (
202 | [Ansam Yousry on LinkedIn](https://www.linkedin.com/in/ansam-yousry-34b32b116/),
203 | [Ansam Yousry on Medium](https://medium.com/@ansam.yousry))
204 | * Link to the article:
205 | https://medium.com/illumination/what-apache-airflow-is-not-e9dc9722500b
206 | * Publisher: Medium
207 | * Summary: Airflow is not a data streaming or processing tool, but it is a good orchestrator
208 | for managing data pipelines. Airflow integrates well with specialized data tools, allowing building
209 | complete and scalable data pipeline solutions
210 |
211 | ## Principles - Documentation
212 |
213 | ### Atlassian - The importance of documentation
214 | * Title: The importance of documentation (because it’s way more than a formality)
215 | * Link to the page:
216 | https://www.atlassian.com/work-management/knowledge-sharing/documentation/importance-of-documentation
217 | * Publisher: Atlassian
218 | * Documentation should be your best friend:
219 | + A single source of truth saves time and energy
220 | + Documentation is essential to quality and process control
221 | + Documentation cuts down duplicative work
222 | + It makes hiring and onboarding so much easier
223 | + A single source of truth makes everyone smarter
224 |
225 | ### Doc as Code
226 | * Title: Docs as Code
227 | * Author: [Eric Holscher](https://ericholscher.com/)
228 | * Link to the page:
229 | https://www.writethedocs.org/guide/docs-as-code/
230 | * Publisher: [Write the Docs](https://www.writethedocs.org/)
231 | * Documentation as Code (Docs as Code) refers to a philosophy that you should be writing documentation
232 | with the same tools as code:
233 | + Issue Trackers
234 | + Version Control (Git)
235 | + Plain Text Markup (Markdown, reStructuredText, Asciidoc)
236 | + Code Reviews
237 | + Automated Tests
238 | * This means following the same workflows as development teams, and being integrated in the product team.
239 | It enables a culture where writers and developers both feel ownership of documentation, and work together
240 | to make it as good as possible.
241 |
242 | ### Help engineers become better writers
243 | * Title: Using Vale to help engineers become better writers
244 | * Date: April 2023
245 | * Author: [François Violette](https://www.linkedin.com/in/francoisviolette/)
246 | * Link to the article:
247 | https://engineering.contentsquare.com/2023/using-vale-to-help-engineers-become-better-writers/
248 | * Publisher: Contentsquare
249 |
250 | ### How we manage documentation at Funding Circle for our Data Platform
251 | * Title: How we manage documentation at Funding Circle for our Data Platform
252 | * Date: April 2023
253 | * Author: Nikolajs Skrjabins (
254 | [Nikolajs Skrjabins on LinkedIn](https://www.linkedin.com/in/nikolajs-skrjabins/),
255 | [Nikolajs Skrjabins on Medium](https://nikolajs-skrjabins.medium.com/))
256 | * Link to the article:
257 | https://medium.com/funding-circle/how-we-manage-documentation-at-funding-circle-for-our-data-platform-960a422b9b2e
258 | * Publisher: Medium
259 |
260 | ## Principles - Infrastructure as code
261 |
262 | ### SST design principles
263 | * SST home page: https://sst.dev
264 | * Page covering the design principles: https://docs.sst.dev/design-principles
265 | + Zero config
266 | + Progressive disclosure
267 | + Attaching permissions
268 | + Having an escape hatch
269 |
270 | ### We Need a Data Engineering-Specific Language
271 | * Title: We Need a Data Engineering-Specific Language
272 | * Date: February 2024
273 | * Author: Julien Hurault
274 | ([Julien Hurault on LinkedIn](https://www.linkedin.com/in/julienhuraultanalytics/))
275 | * Link to the article:
276 | https://juhache.substack.com/p/we-need-a-data-engineering-specific
277 | * Publisher: Medium
278 |
279 | # Books
280 |
281 | ## Shipit!
282 | * Link on Amazon: https://www.amazon.com/Ship-Practical-Successful-Pragmatic-Programmers-ebook/dp/B01F2OUIY6
283 | * Authors: Jared Richardson, William A. Gwaltney
284 | * Date: 21 June 2005
285 | * Publisher: Pragmatic Bookshelf; 1st edition
286 | * ISBN-10: 9780974514048 / ISBN-13: 978-0974514048
287 |
288 | ## Open collection of architecture books
289 | * A few software architecture books, according to Thilina Ashen Gamage in 2020:
290 | https://medium.com/@ThilinaAshenGamage/the-best-software-architecture-books-of-all-time-b82b63bb853b
291 |
292 | # Blogs / websites
293 |
294 | ## Martin Fowler
295 | * [Martin Fowler on Wikipedia](https://en.wikipedia.org/wiki/Martin_Fowler_(software_engineer))
296 | * [Martin Fowler profile page on his blog](http://martinfowler.com/aboutMe.html)
297 | * [Martin Fowler on ThoughtWorks](https://www.thoughtworks.com/profiles/leaders/martin-fowler)
298 | * Main blog: http://martinfowler.com
299 | * Data management section: https://martinfowler.com/data/
300 |
301 | ## Zhamak Dehghani
302 | * [Zhamak Dehghani on LinkedIn](https://www.linkedin.com/in/zhamak-dehghani/)
303 | * [Zhamak Dehghani on Twitter](https://twitter.com/zhamakd)
304 | * Zhamak pinpointed the concept of
305 | [Data Mesh](https://en.wikipedia.org/wiki/Data_mesh)
306 | in [a famous article on ThoughtWorks](https://martinfowler.com/articles/data-monolith-to-mesh.html)
307 | in May 2019
308 |
309 | ## Joel Spolsky
310 | * [Joel Spolsky on Wikipedia](https://en.wikipedia.org/wiki/Joel_Spolsky)
311 | * Main blog: https://www.joelonsoftware.com/
312 |
313 | ## Github -> data-engineer-handbook
314 | Repo with multiple resources for data engineer skills
315 | * GitHub repo: https://github.com/DataEngineer-io/data-engineer-handbook
316 |
317 | # Frameworks
318 |
319 | ## Twelve-Factor (12factor) Application
320 | * Homepage: https://12factor.net/
321 |
322 | ## Clean architecture
323 | One of the main principles of clean architecture is nothing more than keeping your options open by changing,
324 | adding, or removing dependencies as often as you want. We offer two analogies to illustrate this point.
325 | A good demo project can be found on
326 | [GitHub by Mattia Battiston](https://github.com/mattia-battiston/clean-architecture-example).
327 |
328 | ### Articles
329 | * Articles:
330 | + Aug. 2023:
331 | [Implementing clean architecture solutions: A practical example](https://developers.redhat.com/articles/2023/08/08/implementing-clean-architecture-solutions-practical-example)
332 | + Apr. 2023:
333 | [My advice for building maintainable, clean architect](https://developers.redhat.com/articles/2023/04/17/my-advice-building-maintainable-clean-architecture)
334 | + Apr. 2023:
335 | [My advice for transitioning to a clean architecture platform](https://developers.redhat.com/articles/2023/04/17/my-advice-transitioning-clean-architecture-platform)
336 | * Authors:
337 | + Maarten Vandeperre
338 | ([Maarten Vandeperre on LinkedIn](https://www.linkedin.com/in/maarten-vandeperre-8780743b/),
339 | [Maarten Vandeperre on RedHat Developer](https://developers.redhat.com/author/maarten-vandeperre))
340 | + [Kevin Dubois](https://developers.redhat.com/author/kevin-dubois)
341 | * Publisher: RedHat Developer
342 |
343 | ## BackStage
344 | * GitHub page for Backstage: https://github.com/backstage/backstage
345 | + Getting started: https://backstage.io/docs/getting-started/
346 | * BackStage TechDocs: https://backstage.io/docs/features/techdocs/
347 | * BackStage software templates: https://backstage.io/docs/features/software-templates/
348 | * BackStage software catalog: https://backstage.io/docs/features/software-catalog/
349 |
350 | ## Cookiecutter
351 | * Cookiecutter home page: https://cookiecutter.readthedocs.io/en/stable/
352 | + GitHub page: https://github.com/cookiecutter/cookiecutter
353 | * Cruft, bringing Git-based workflows on top of Cookiecutter:
354 | + Cruft home page: https://cruft.github.io/cruft/
355 | + GitHub page: https://github.com/cruft/cruft/
356 |
357 | ## docToolchain
358 | * Homepage: https://doctoolchain.org/docToolchain
359 |
360 | # Practices around architecture
361 |
362 | ## OpenLineage
363 | OpenLineage feature propoals. For instance,
364 | https://github.com/OpenLineage/OpenLineage/blob/main/proposals/336/PROPOSALS.md
365 |
366 |
367 |
--------------------------------------------------------------------------------