├── LICENSE ├── README.md ├── diagrams ├── material │ ├── AWS-Cloud-Development-Kit_Icon_64_Squid.svg │ ├── Apache_Hive_logo.svg │ ├── Data-Analytics-1.png │ ├── Spring-Boot-REST-API.webp │ ├── amazon-ecr-avis-prix-alternatives-logiciel.webp │ ├── amazon-elastic-container-seeklogo.com.svg │ ├── azure-adls.png │ ├── data-factory-logo.png │ ├── fastapi-logo.png │ ├── grpc-icon-color.png │ ├── icon-aws-amazon-eks.svg │ ├── icon-tableau-1.png │ ├── logo-qlik.png │ ├── power-bi-logo-2.jpg │ ├── powerbi-logo-1.png │ └── rest-api-1.svg ├── snapshots │ ├── Data Platform - 001.png │ ├── Data Platform - 002 - Modern data stack.png │ ├── Data Platform - 003 - Data products.png │ ├── Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png │ ├── Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png │ ├── Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png │ ├── Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png │ ├── Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png │ └── Data Platform - Trends.png └── src │ ├── Data Platform - Principles - Data Engineering - 2023-01 - v1.0.excalidraw │ ├── Data Platform - Principles - Data Engineering - 2023-03 - v2.0.excalidraw │ ├── Data Platform - Principles - Data Engineering - 2023-04 - v2.1.excalidraw │ ├── Data Platform - Principles - Data Engineering - latest.excalidraw │ ├── Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.excalidraw │ ├── Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.excalidraw │ ├── Data Platform - Principles - Data Lake In and Out - latest.excalidraw │ ├── Data Platform - Principles - Data Science - latest.excalidraw │ └── da-library.excalidrawlib └── material └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 data-engineering-helpers 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Data platform - Architecture principles 2 | ======================================= 3 | 4 | # Overview 5 | [This project](https://github.com/data-engineering-helpers/architecture-principles) 6 | intends to collaborate on specifying architecture principles and diagrams 7 | for a typical data platform with the so-called Modern Data Stack (MDS). 8 | 9 | Even though the members of the GitHub organization may be employed by 10 | some companies, they speak on their personal behalf and do not represent 11 | these companies. 12 | 13 | ## References 14 | * [Material for the Data platform - Architecture principles](material/) 15 | * Specifications/principles for a 16 | [data engineering pipeline deployment tool](https://github.com/data-engineering-helpers/data-pipeline-deployment) 17 | + [`dpcctl`, the Data Processing Pipeline (DPP) CLI utility](https://github.com/data-engineering-helpers/dppctl), a Minimal Viable Product (MVP) in Go 18 | * [Material for the Data platform - Data contracts](https://github.com/data-engineering-helpers/data-contracts/blob/main/README.md) 19 | * [Material for the Data platform - Data quality](https://github.com/data-engineering-helpers/data-quality/blob/main/README.md) 20 | * [Material for the Data platform - Data-lakes, data warehouses, data lake-houses](https://github.com/data-engineering-helpers/data-lakehouse) 21 | * [Material for the Data platform - Modern Data Stack (MDS) in a box](https://github.com/data-engineering-helpers/mds-in-a-box/blob/main/README.md) 22 | 23 | # Diagrams 24 | 25 | ## Data lake - ins and outs 26 | * [Data engineering Excalidraw diagram online - Data platform principles for data lake ins and outs](https://excalidraw.com/#json=mv7jSkpTewcQb_S4raJ5G,S6aAoK8gA3VroJ5ai8Kb6w) 27 | 28 | * [Excalidraw source on GitHub - Data platform principles for data lake ins and outs](diagrams/src/Data%20Platform%20-%20Principles%20-%20Data%20Lake%20In%20and%20Out%20-%20latest.excalidraw) 29 | 30 | ![Data platform principles for data lake ins and outs](diagrams/snapshots/Data%20Platform%20-%20Principles%20-%20Data%20Lake%20In%20and%20Out%20-%202023-04%20-%20v2.0.png) 31 | 32 | ## Data engineering 33 | * [Data engineering Excalidraw diagram online - Data platform principles for Data Engineering](https://excalidraw.com/#json=UPsnozgpMAxRaz3feC23y,n478x5MVcgCz1XTZ7h9qHw) 34 | 35 | * [Excalidraw source on GitHub - Data platform principles for Data Engineering](diagrams/src/Data%20Platform%20-%20Principles%20-%20Data%20Engineering%20-%20latest.excalidraw) 36 | 37 | ![Data Platform - Principles - Data Engineering](diagrams/snapshots/Data%20Platform%20-%20Principles%20-%20Data%20Engineering%20-%202023-04%20-%20v2.1.png) 38 | 39 | # Principles 40 | 41 | ## Production vs non-production 42 | As a summary, production and non-production environments should be 43 | as separated as possible, if possible separated by kind of a "Chinese wall": 44 | * Non-production environments should not have access, by design, 45 | to production resources, including production data 46 | * The only allowed tasks are the publication of non sensitive data, 47 | by production environment (_e.g._, Spark processes) to non-production 48 | storage (S3 buckets) 49 | * As data scientists, analysts and engineers have to be able to work 50 | on realistic data sets, the above principles mean that teams must invest 51 | on how to create non sensitive data from production data. In order to do so, 52 | several processes are possible (_e.g._, anonymisation, obfuscation, aggregation, 53 | data generation, simulation). Some specialized companies, such as 54 | [Statice+Anonos](https://www.statice.ai/), help in generating 55 | non sensitive realistic data sets. 56 | 57 | ## Persistency of data files 58 | Once the data files are written to S3, they must never be overwritten. 59 | The data files are stored in S3 in a persistent way, versioned (_e.g._, 60 | with [Delta](https://delta.io/)), and must be kept on cloud object storage 61 | (_e.g._, AWS S3, Azure ADLS, Google GCS) as long as legally and 62 | technically possible. 63 | 64 | That principle is the same as the one in the 65 | [Change Data Capture (CDC)](https://en.wikipedia.org/wiki/Change_data_capture) 66 | mechanism: 67 | * Regularly, snapshots are taken out of a given data set 68 | * Snapshots take the shape of Parquet/Delta data files. Functionally, a snapshot 69 | is similar to a picture taken with a photo camera: it corresponds to the latest 70 | state of the data set, consistent and instantaneous (there is no history in 71 | a snapshot) 72 | * Snapshots must be versioned. Usually, it is enough to add the time-stamp 73 | of when the snapshot was taken to the file-path/URI of the snapshot data files 74 | * The succession of the snapshots correspond to the succession of the versions 75 | of the data set 76 | * Snapshots data files must be persistent: they must never be overwritten 77 | * The history may be rebuilt from the succession of the snapshots 78 | 79 | The [Delta format](https://delta.io/) applies that principle of keeping persistent 80 | snapshots/versions of a given data set, while abstracting away the need to version 81 | and to not overwrite data sets. With Delta, one can store and "overwrite" data sets, 82 | while in practice the data files are actually versioned snapshot data files and 83 | the log of transactions is kept along those snapshots/versions so as to allow 84 | rebuilding of the history. 85 | 86 | ## Format of data files 87 | The format of the structured data files must be Delta wherever possible, 88 | and Parquet only when Delta is not possible. No other data format is allowed 89 | on the data lake for structured data. 90 | 91 | ## Data processing, from files to files 92 | Any data processing task: 93 | 1. Makes use of software artifacts (_e.g._, Python wheels, 94 | Scala JARs, dbt SQL artifacts) 95 | 2. Takes, as input, data files, which are, as mentioned above, 96 | persistent and versioned, _i.e._, which will never be overwritten 97 | and which their version allows to uniquely identify 98 | 3. Generates, as output, data files, which have to be, as mentioned above, 99 | persistent and versioned, _i.e._, which will never be overwritten 100 | and which their version allows to uniquely identify 101 | 102 | ## Capitalization on data processing software 103 | We capitalize on the (source code of the) software used to process data, 104 | rather than on the prepared data sets. The software project is instantiated 105 | from a template (_e.g._, with 106 | [Cookiecutter](https://github.com/cookiecutter/cookiecutter)/[Cruft](https://cruft.github.io/cruft/)) 107 | and managed through a Git repository. 108 | The Git repository may be audited, including the level of compliance 109 | to the (evolutions of the) template. 110 | 111 | ## End-to-end data responsibility 112 | Data domain data engineering teams are responsible for the (quality 113 | and service level agreements of the) delivered data sets. 114 | That responsibility includes checking (and potentially fixing) the quality 115 | of the source data sets. As an illustration, 116 | if [medallion](https://www.databricks.com/glossary/medallion-architecture) 117 | (silver/gold/insight) data sets would be manufactured cars, 118 | the responsibility encompasses the quality of every single part 119 | (_e.g._, tires, windshield). 120 | The data engineering teams cannot deflect their responsibility on the quality 121 | of the silver/gold/insight data sets to the quality of the source data sets: 122 | they have to fix the quality of the source data sets if needed. 123 | 124 | ## Data lake 125 | * The purpose of the data lake is to serve as a centralized and scalable repository 126 | for storing data from various sources 127 | * Data sets are materialized as both: 128 | + Data files with an open standard of storage, namely [Delta](https://delta.io/) whenever possible, 129 | or [Parquet](https://parquet.apache.org/) when Delta is not possible 130 | + Tables in databases. These tables are served through a standard and open API, 131 | namely [Hive Metastore](https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore). 132 | [AWS Glue](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore) 133 | and [GCP Dataproc](https://cloud.google.com/dataproc-metastore/docs/hive-metastore) both implement 134 | the Hive Metastore API. In the documentation, Glue/Dataproc databases and tables may be interchanged 135 | with Hive Metastore databases and tables. The databases and tables are actually metadata (_i.e._, 136 | data such as table name and description, column names, types and descriptions, about the data itself); 137 | the underlying data are stored in Parquet/Delta, as explained in the point just above 138 | * The modern data lake is structured around the so-called 139 | [medallion architecture](https://www.advancinganalytics.co.uk/blog/medallion-architecture), 140 | representing different levels of data "refinement": Bronze, Silver, Gold and Insight. Each level has its own rules 141 | and conventions that should be applied systematically and this page serves as a reference of these rules 142 | (and should therefore be kept constantly up to date) 143 | 144 | -------------------------------------------------------------------------------- /diagrams/material/AWS-Cloud-Development-Kit_Icon_64_Squid.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | AWS-Cloud-Development-Kit_64_Squid 5 | Created with Sketch. 6 | 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /diagrams/material/Apache_Hive_logo.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /diagrams/material/Data-Analytics-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/Data-Analytics-1.png -------------------------------------------------------------------------------- /diagrams/material/Spring-Boot-REST-API.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/Spring-Boot-REST-API.webp -------------------------------------------------------------------------------- /diagrams/material/amazon-ecr-avis-prix-alternatives-logiciel.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/amazon-ecr-avis-prix-alternatives-logiciel.webp -------------------------------------------------------------------------------- /diagrams/material/amazon-elastic-container-seeklogo.com.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | image/svg+xml 45 | 53 | 54 | 59 | 61 | 67 | 68 | 70 | 75 | 76 | 78 | 84 | 85 | 87 | 92 | 93 | 95 | 100 | 101 | 103 | 108 | 109 | 111 | 117 | 118 | 120 | 126 | 127 | 129 | 135 | 136 | 138 | 144 | 145 | 147 | 153 | 154 | 156 | 162 | 163 | 165 | 170 | 171 | 173 | 178 | 179 | 181 | 186 | 187 | 189 | 194 | 195 | 197 | 202 | 203 | 205 | 210 | 211 | 213 | 218 | 219 | 221 | 226 | 227 | 230 | 236 | 242 | 248 | 254 | 260 | 266 | 272 | 278 | 284 | 285 | -------------------------------------------------------------------------------- /diagrams/material/azure-adls.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/azure-adls.png -------------------------------------------------------------------------------- /diagrams/material/data-factory-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/data-factory-logo.png -------------------------------------------------------------------------------- /diagrams/material/fastapi-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/fastapi-logo.png -------------------------------------------------------------------------------- /diagrams/material/grpc-icon-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/grpc-icon-color.png -------------------------------------------------------------------------------- /diagrams/material/icon-aws-amazon-eks.svg: -------------------------------------------------------------------------------- 1 | icon-aws-amazon-eks -------------------------------------------------------------------------------- /diagrams/material/icon-tableau-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/icon-tableau-1.png -------------------------------------------------------------------------------- /diagrams/material/logo-qlik.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/logo-qlik.png -------------------------------------------------------------------------------- /diagrams/material/power-bi-logo-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/power-bi-logo-2.jpg -------------------------------------------------------------------------------- /diagrams/material/powerbi-logo-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/material/powerbi-logo-1.png -------------------------------------------------------------------------------- /diagrams/material/rest-api-1.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - 001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 001.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - 002 - Modern data stack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 002 - Modern data stack.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - 003 - Data products.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - 003 - Data products.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-01 - v1.0.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-03 - v2.0.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Engineering - 2023-04 - v2.1.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-01 - v1.0.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Principles - Data Lake In and Out - 2023-04 - v2.0.png -------------------------------------------------------------------------------- /diagrams/snapshots/Data Platform - Trends.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/data-engineering-helpers/architecture-principles/bedf3e8bdb99759c15f8cfeb9e66767fed34e97a/diagrams/snapshots/Data Platform - Trends.png -------------------------------------------------------------------------------- /material/README.md: -------------------------------------------------------------------------------- 1 | Material for the Data platform Architecture principles 2 | ====================================================== 3 | 4 | # Overview 5 | [That page](https://github.com/data-engineering-helpers/architecture-principles/blob/main/material/README.md) 6 | collects some material for the 7 | [Data platform - Architecture principles](https://github.com/data-engineering-helpers/architecture-principles). 8 | 9 | # Articles 10 | 11 | ## Wikipedia 12 | 13 | ### Software engineering / software development 14 | * [Wikipedia - Software development](https://en.wikipedia.org/wiki/Software_development) 15 | * [Wikipedia - Software engineering](https://en.wikipedia.org/wiki/Software_engineering) 16 | * [Wikipedia - Computer programming](https://en.wikipedia.org/wiki/Computer_programming) 17 | * [Wikipedia - Software documentation](https://en.wikipedia.org/wiki/Software_documentation) 18 | * [Wikipedia - Software testing](https://en.wikipedia.org/wiki/Software_testing) 19 | * [Wikipedia - Software bugs](https://en.wikipedia.org/wiki/Software_bugs) 20 | * [Wikipedia - Software framework](https://en.wikipedia.org/wiki/Software_framework) 21 | * [Wikipedia - Application software](https://en.wikipedia.org/wiki/Application_software) 22 | * [Wikipedia - Software development process](https://en.wikipedia.org/wiki/Software_development_process) 23 | 24 | ### Data engineering 25 | * [Wikipedia - Data engineering](https://en.wikipedia.org/wiki/Data_engineering) 26 | * [Wikipedia - Data processing](https://en.wikipedia.org/wiki/Data_processing) 27 | * [Wikipedia - Data cleaning](https://en.wikipedia.org/wiki/Data_cleaning) 28 | * [Wikipedia - Data analytics](https://en.wikipedia.org/wiki/Data_analytics) 29 | * [Wikipedia - Data science](https://en.wikipedia.org/wiki/Data_science) 30 | 31 | ## The 13 software engineering laws 32 | * Title: The 13 software engineering laws 33 | * Date: May 2025 34 | * Author: Anton Zaides 35 | ([Anton Zaides on LinkedIn](https://www.linkedin.com/in/anton-zaides/), 36 | [Anton Zaides on Substack](https://substack.com/@antonzaides)) 37 | * Link to the article on Substack: 38 | https://newsletter.manager.dev/p/the-13-software-engineering-laws 39 | 40 | ## Medallion Architecture in a Data Product World 41 | * Title: Medallion Architecture in a Data Product World 42 | * Date: Apr. 2025 43 | * Author: Elliot Gordon 44 | ([Elliot Gordon on LinkedIn](https://www.linkedin.com/in/elliott-cordo/), 45 | [Elliot Gordon on Medium](https://medium.com/@datafutures)) 46 | * Link to the article on Medium: 47 | https://medium.com/datafutures/medallion-architecture-in-a-data-product-world-3758d17b6cf6 48 | 49 | ## Data Products: A Case Against Medallion Architecture 50 | * Title: Data Products: A Case Against Medallion Architecture 51 | * Date: Feb. 2025 52 | * Authors: 53 | * Animesh Kumar 54 | * Shubhanshu Jain 55 | * Samadrita Ghosh 56 | * Link to the article on Medium: 57 | https://medium.com/@community_md101/data-products-a-case-against-medallion-architecture-139096ceea08 58 | 59 | ## 5 Data Engineering mistakes 60 | * Title: 5 Data Engineering mistakes, and what to do about them 61 | * Date: July 2024 62 | * Author: Daniel Beach 63 | ([Daniel Beach on LinkedIn](https://www.linkedin.com/in/daniel-beach-6ab8b4132/)) 64 | * Link to the article on Substack 65 | https://dataengineeringcentral.substack.com/p/5-data-engineering-mistakes 66 | * The 5 common mistakes: 67 | * Not embracing simple architecture and design 68 | * Not having a good local development environment 69 | * Not having a good orchestration and dependency management tool 70 | * Not testing code and pipelines before release 71 | * Not doing something hard 72 | 73 | ## The Rise of the Data Platform Engineer 74 | * Title: The Rise of the Data Platform Engineer 75 | * Date: June 2024 76 | * Author: Pedram Navid 77 | ([Pedram Navid on LinkedIn](https://www.linkedin.com/in/pedramnavid/)) 78 | * Link to the article: 79 | https://databased.pedramnavid.com/p/the-rise-of-the-data-platform-engineer 80 | 81 | ## Data Council 2024: The future data stack is composable 82 | * Title: Data Council 2024: The future data stack is composable, and other hot takes 83 | * Date: April 2024 84 | * Author: Case Roberts 85 | ([Chase Roberts on LinkedIn](https://www.linkedin.com/in/chasecroberts/), 86 | [Chase Roberts on Medium](https://chsrbrts.medium.com/)) 87 | * Link to the article: 88 | https://medium.com/vvus/data-council-2024-the-future-data-stack-is-composable-and-other-hot-takes-b6c5f2429e22 89 | 90 | ## Data Lakehouses, Post-Modern Data Stacks and Enabling Gen AI 91 | * Title: Data Lakehouses, Post-Modern Data Stacks and Enabling Gen AI: The Rittman Analytics Guide to Modernising Data Analytics in 2024 92 | * Date: April 2024 93 | * Author: Mark Rittman 94 | ([Mark Rittman on LinkedIn](https://www.linkedin.com/in/markrittman/), 95 | [Mark Rittman on Medium](https://markrittman.medium.com/)) 96 | * Link to the article: 97 | https://blog.rittmananalytics.com/data-lakehouses-post-modern-data-stacks-and-enabling-gen-ai-the-rittman-analytics-guide-to-b102027b8cf8 98 | 99 | ## Why data teams are adopting declarative stateful pipelines 100 | * Title: Why Data Teams Are Adopting Declarative (Stateful) Pipelines 101 | * Date: November 2023 102 | * Author: Iaroslav Zeigerman 103 | * Link to the article: 104 | https://tobikodata.com/why-data-teams-are-adopting-declarative-pipelines.html 105 | * Publisher: tobiko blog 106 | 107 | ## The next big step forwards for analytics engineering 108 | * Title: The next big step forwards for analytics engineering 109 | * Date: April 2023 110 | * Author: Tristan Handy 111 | ([Tristan Handy on LinkedIn](https://www.linkedin.com/in/tristanhandy/), 112 | [Tristan Handy on DBT's web site](https://www.getdbt.com/author/tristan-handy/)) 113 | * Link to the article: 114 | https://www.getdbt.com/blog/analytics-engineering-next-step-forwards/ 115 | * Publisher: DBT 116 | 117 | ## Data Materialization is a Convergence Problem 118 | * Title: Data Materialization is a Convergence Problem 119 | * Date: April 2023 120 | * Author: Alex Rasmussen 121 | ([Alex Rasmussen on LinkedIn](https://www.linkedin.com/in/alexras/), 122 | [Alex Rasmussen on Substack](https://substack.com/profile/4434805-alex-rasmussen)) 123 | * Link to the article: 124 | https://stkbailey.substack.com/p/data-materialization-is-a-convergence 125 | * Publisher: Substack 126 | 127 | ## Principles - Alex Ewerlöf's guiding principles 128 | * Title: My guiding principles after 20 years of programming 129 | * Date: January 2020 130 | * Author: Alex Ewerlöf 131 | ([Alex Ewerlöf on LinkedIn](https://www.linkedin.com/in/alexewerlof/), 132 | [Alex Ewerlöf on Substack](https://substack.com/profile/87732486-alex-ewerlof)) 133 | * Link to the article: 134 | https://alexewerlof.medium.com/my-guiding-principles-after-20-years-of-programming-a087dc55596c 135 | * Publisher: Medium 136 | 137 | ## Zalando's Engineering principles 138 | * Title: Zalando's Engineering principles 139 | * GitHub repository: https://github.com/zalando/engineering-principles 140 | * Overview: in March 2015, we have adopted this set of principles for tech and architecture: 141 | * Microservices 142 | * API First 143 | * REST 144 | * Cloud 145 | * Software as a Service (SaaS) 146 | 147 | ## Design docs 148 | 149 | ### Writing design docs for data pipelines 150 | * Title: Writing design docs for data pipelines 151 | * Date: May 2023 152 | * Author: Mahdi Karabiben 153 | ([Mahdi Karabiben on LinkedIn](https://www.linkedin.com/in/mahdikarabiben/), 154 | [Mahdi Karabiben on Medium](https://mahdiqb.medium.com/about)) 155 | * Link to the article: 156 | https://towardsdatascience.com/writing-design-docs-for-data-pipelines-d49550f95580 157 | 158 | ### Design docs at Google 159 | * Link to the presentation page: 160 | https://www.industrialempathy.com/posts/design-docs-at-google/ 161 | 162 | ## Principles - Functional Data Engineering 163 | 164 | ### Functional Data Engineering — a modern paradigm for batch data processing 165 | * Title: Functional Data Engineering — a modern paradigm for batch data processing 166 | * Author: Maxime Beauchemin 167 | * Date: January 2018 168 | * Link to the article: 169 | https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a 170 | * Publisher: Medium 171 | 172 | ### Transform Your Data Processing with Functional Data Engineering 173 | * Author: Raphaël Mansuy (https://www.linkedin.com/in/raphaelmansuy/) 174 | * Date: January 2023 175 | * On LinkedIn feed: 176 | https://www.linkedin.com/posts/raphaelmansuy_dataengineering-functionalprogramming-activity-7025661748233797632-TQqZ/ 177 | 178 | ### Functional data engineering - a blueprint 179 | * Author: Ananth Packkildurai (https://www.linkedin.com/in/ananthdurai/) 180 | * Date: December 2022 181 | * Link to the article: 182 | https://www.dataengineeringweekly.com/p/functional-data-engineering-a-blueprint 183 | * Publisher: Substack 184 | 185 | ### Why developers are falling in love with functional programming 186 | * Author: Ari Joury 187 | * Date: August 2020 188 | * Link to the article: 189 | https://towardsdatascience.com/why-developers-are-falling-in-love-with-functional-programming-13514df4048e 190 | * Publisher: Medium 191 | 192 | ### Data centric manifesto 193 | * Homepage: http://www.datacentricmanifesto.org/ 194 | * Principles: http://www.datacentricmanifesto.org/principles/ 195 | 196 | ## Principles - Orchestration 197 | 198 | ### Don't use Apache Airflow in that way 199 | * Title: Don’t Use Apache Airflow in That Way 200 | * Date: May 2023 201 | * Author: Ansam Yousry ( 202 | [Ansam Yousry on LinkedIn](https://www.linkedin.com/in/ansam-yousry-34b32b116/), 203 | [Ansam Yousry on Medium](https://medium.com/@ansam.yousry)) 204 | * Link to the article: 205 | https://medium.com/illumination/what-apache-airflow-is-not-e9dc9722500b 206 | * Publisher: Medium 207 | * Summary: Airflow is not a data streaming or processing tool, but it is a good orchestrator 208 | for managing data pipelines. Airflow integrates well with specialized data tools, allowing building 209 | complete and scalable data pipeline solutions 210 | 211 | ## Principles - Documentation 212 | 213 | ### Atlassian - The importance of documentation 214 | * Title: The importance of documentation (because it’s way more than a formality) 215 | * Link to the page: 216 | https://www.atlassian.com/work-management/knowledge-sharing/documentation/importance-of-documentation 217 | * Publisher: Atlassian 218 | * Documentation should be your best friend: 219 | + A single source of truth saves time and energy 220 | + Documentation is essential to quality and process control 221 | + Documentation cuts down duplicative work 222 | + It makes hiring and onboarding so much easier 223 | + A single source of truth makes everyone smarter 224 | 225 | ### Doc as Code 226 | * Title: Docs as Code 227 | * Author: [Eric Holscher](https://ericholscher.com/) 228 | * Link to the page: 229 | https://www.writethedocs.org/guide/docs-as-code/ 230 | * Publisher: [Write the Docs](https://www.writethedocs.org/) 231 | * Documentation as Code (Docs as Code) refers to a philosophy that you should be writing documentation 232 | with the same tools as code: 233 | + Issue Trackers 234 | + Version Control (Git) 235 | + Plain Text Markup (Markdown, reStructuredText, Asciidoc) 236 | + Code Reviews 237 | + Automated Tests 238 | * This means following the same workflows as development teams, and being integrated in the product team. 239 | It enables a culture where writers and developers both feel ownership of documentation, and work together 240 | to make it as good as possible. 241 | 242 | ### Help engineers become better writers 243 | * Title: Using Vale to help engineers become better writers 244 | * Date: April 2023 245 | * Author: [François Violette](https://www.linkedin.com/in/francoisviolette/) 246 | * Link to the article: 247 | https://engineering.contentsquare.com/2023/using-vale-to-help-engineers-become-better-writers/ 248 | * Publisher: Contentsquare 249 | 250 | ### How we manage documentation at Funding Circle for our Data Platform 251 | * Title: How we manage documentation at Funding Circle for our Data Platform 252 | * Date: April 2023 253 | * Author: Nikolajs Skrjabins ( 254 | [Nikolajs Skrjabins on LinkedIn](https://www.linkedin.com/in/nikolajs-skrjabins/), 255 | [Nikolajs Skrjabins on Medium](https://nikolajs-skrjabins.medium.com/)) 256 | * Link to the article: 257 | https://medium.com/funding-circle/how-we-manage-documentation-at-funding-circle-for-our-data-platform-960a422b9b2e 258 | * Publisher: Medium 259 | 260 | ## Principles - Infrastructure as code 261 | 262 | ### SST design principles 263 | * SST home page: https://sst.dev 264 | * Page covering the design principles: https://docs.sst.dev/design-principles 265 | + Zero config 266 | + Progressive disclosure 267 | + Attaching permissions 268 | + Having an escape hatch 269 | 270 | ### We Need a Data Engineering-Specific Language 271 | * Title: We Need a Data Engineering-Specific Language 272 | * Date: February 2024 273 | * Author: Julien Hurault 274 | ([Julien Hurault on LinkedIn](https://www.linkedin.com/in/julienhuraultanalytics/)) 275 | * Link to the article: 276 | https://juhache.substack.com/p/we-need-a-data-engineering-specific 277 | * Publisher: Medium 278 | 279 | # Books 280 | 281 | ## Shipit! 282 | * Link on Amazon: https://www.amazon.com/Ship-Practical-Successful-Pragmatic-Programmers-ebook/dp/B01F2OUIY6 283 | * Authors: Jared Richardson, William A. Gwaltney 284 | * Date: 21 June 2005 285 | * Publisher: Pragmatic Bookshelf; 1st edition 286 | * ISBN-10: 9780974514048 / ISBN-13: 978-0974514048 287 | 288 | ## Open collection of architecture books 289 | * A few software architecture books, according to Thilina Ashen Gamage in 2020: 290 | https://medium.com/@ThilinaAshenGamage/the-best-software-architecture-books-of-all-time-b82b63bb853b 291 | 292 | # Blogs / websites 293 | 294 | ## Martin Fowler 295 | * [Martin Fowler on Wikipedia](https://en.wikipedia.org/wiki/Martin_Fowler_(software_engineer)) 296 | * [Martin Fowler profile page on his blog](http://martinfowler.com/aboutMe.html) 297 | * [Martin Fowler on ThoughtWorks](https://www.thoughtworks.com/profiles/leaders/martin-fowler) 298 | * Main blog: http://martinfowler.com 299 | * Data management section: https://martinfowler.com/data/ 300 | 301 | ## Zhamak Dehghani 302 | * [Zhamak Dehghani on LinkedIn](https://www.linkedin.com/in/zhamak-dehghani/) 303 | * [Zhamak Dehghani on Twitter](https://twitter.com/zhamakd) 304 | * Zhamak pinpointed the concept of 305 | [Data Mesh](https://en.wikipedia.org/wiki/Data_mesh) 306 | in [a famous article on ThoughtWorks](https://martinfowler.com/articles/data-monolith-to-mesh.html) 307 | in May 2019 308 | 309 | ## Joel Spolsky 310 | * [Joel Spolsky on Wikipedia](https://en.wikipedia.org/wiki/Joel_Spolsky) 311 | * Main blog: https://www.joelonsoftware.com/ 312 | 313 | ## Github -> data-engineer-handbook 314 | Repo with multiple resources for data engineer skills 315 | * GitHub repo: https://github.com/DataEngineer-io/data-engineer-handbook 316 | 317 | # Frameworks 318 | 319 | ## Twelve-Factor (12factor) Application 320 | * Homepage: https://12factor.net/ 321 | 322 | ## Clean architecture 323 | One of the main principles of clean architecture is nothing more than keeping your options open by changing, 324 | adding, or removing dependencies as often as you want. We offer two analogies to illustrate this point. 325 | A good demo project can be found on 326 | [GitHub by Mattia Battiston](https://github.com/mattia-battiston/clean-architecture-example). 327 | 328 | ### Articles 329 | * Articles: 330 | + Aug. 2023: 331 | [Implementing clean architecture solutions: A practical example](https://developers.redhat.com/articles/2023/08/08/implementing-clean-architecture-solutions-practical-example) 332 | + Apr. 2023: 333 | [My advice for building maintainable, clean architect](https://developers.redhat.com/articles/2023/04/17/my-advice-building-maintainable-clean-architecture) 334 | + Apr. 2023: 335 | [My advice for transitioning to a clean architecture platform](https://developers.redhat.com/articles/2023/04/17/my-advice-transitioning-clean-architecture-platform) 336 | * Authors: 337 | + Maarten Vandeperre 338 | ([Maarten Vandeperre on LinkedIn](https://www.linkedin.com/in/maarten-vandeperre-8780743b/), 339 | [Maarten Vandeperre on RedHat Developer](https://developers.redhat.com/author/maarten-vandeperre)) 340 | + [Kevin Dubois](https://developers.redhat.com/author/kevin-dubois) 341 | * Publisher: RedHat Developer 342 | 343 | ## BackStage 344 | * GitHub page for Backstage: https://github.com/backstage/backstage 345 | + Getting started: https://backstage.io/docs/getting-started/ 346 | * BackStage TechDocs: https://backstage.io/docs/features/techdocs/ 347 | * BackStage software templates: https://backstage.io/docs/features/software-templates/ 348 | * BackStage software catalog: https://backstage.io/docs/features/software-catalog/ 349 | 350 | ## Cookiecutter 351 | * Cookiecutter home page: https://cookiecutter.readthedocs.io/en/stable/ 352 | + GitHub page: https://github.com/cookiecutter/cookiecutter 353 | * Cruft, bringing Git-based workflows on top of Cookiecutter: 354 | + Cruft home page: https://cruft.github.io/cruft/ 355 | + GitHub page: https://github.com/cruft/cruft/ 356 | 357 | ## docToolchain 358 | * Homepage: https://doctoolchain.org/docToolchain 359 | 360 | # Practices around architecture 361 | 362 | ## OpenLineage 363 | OpenLineage feature propoals. For instance, 364 | https://github.com/OpenLineage/OpenLineage/blob/main/proposals/336/PROPOSALS.md 365 | 366 | 367 | --------------------------------------------------------------------------------