├── .github
    └── FUNDING.yml
├── CHANGELOG.md
├── README.md
├── img
    ├── extras.png
    ├── roadmap.png
    └── title.png
└── text
    ├── extras.md
    └── roadmap.md


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 | 
3 | # github: [datastacktv, alexandraabbas]
4 | custom: https://paypal.me/alexandraabbas
5 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | ## Roadmap 2021
 2 | 
 3 | ### Update 2021-01-15
 4 | 
 5 | * Added text version for visually impaired users (issue #10)
 6 | * Math & statistics basics have been added to CS fundamentals (issue #22)
 7 | * Dimensional modelling has been added to Database fundamentals
 8 | * Added section for Object storage (issue #7)
 9 | * Azure CosmosDB has been added to Document databases
10 | * Apache Impala has been moved from Batch processing to Data Warehouses
11 | * Azure Synapse Analytics (issue #18) and ClickHouse (issue #24) have been added to Data Warehouses
12 | * Lambda & Kappa architectures have been added to Cluster computing fundamentals (issue #31)
13 | * Azure Data Lake has been added to Managed Hadoop
14 | * Apache NiFi has been added to Hybrid data processing
15 | * Cloud specific messaging services have been added to Messaging (issue #8)
16 | * Luigi has been added to Workflow scheduling
17 | * AWS CDK has replaced AWS CloudFormation in Infrastructure provisioning (issue #4, issue #6)
18 | * Power BI has been added to data visualisation tools (issue #29)
19 | * MLflow has been added to Machine Learning Ops (issue #30)
20 | 
21 | ## Roadmap 2020
22 | 
23 | [Modern Data Engineer Roadmap 2020](https://github.com/datastacktv/data-engineer-roadmap/tree/8b1ccdce4524961bfd37495de20117c47766b1eb)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ![Modern Data Engineer Roadmap 2021](img/title.png)
 3 | 
 4 | > Roadmap to becoming a data engineer in 2021
 5 | 
 6 | [![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv)
 7 | [![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](http://youtube.com/c/datastacktv)
 8 | [![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/)
 9 | [![Jobs](https://img.shields.io/badge/-Jobs-ffdf4b)](https://datastackjobs.com/)
10 | 
11 | This roadmap aims to give a **complete picture of the modern data engineering landscape** and serve as a **study guide** for aspiring data engineers.
12 | 
13 | ***
14 | 
15 | <h3 align="center"><strong>Note to beginners</strong></h3>
16 | 
17 | > Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.
18 | 
19 | ***
20 | 
21 | 🔥  We just launched [**Data Stack Jobs**](https://datastackjobs.com/) — a clean and simple job site for Data Stack Engineers!
22 | 
23 | > [Text version for visually impaired users](text/roadmap.md)
24 | 
25 | ![Data Engineer Roadmap](img/roadmap.png)
26 | 
27 | ## Nice to have 😎
28 | 
29 | > [Text version for visually impaired users](text/extras.md)
30 | 
31 | ![Data Engineer Roadmap Extras](img/extras.png)
32 | 
33 | ## Contributions are welcome 💜
34 | 
35 | Please raise an issue to discuss your suggestions or open a Pull Request to request improvements.
36 | 
37 | ## Reviewers 🔎
38 | 
39 | Huge thank you to [@whydidithavetobebugs](https://github.com/whydidithavetobebugs), [@sawidis](https://github.com/sawidis), [@marclamberti](https://github.com/marclamberti) and [@mpyeager](https://github.com/mpyeager) for reviewing this roadmap.
40 | 
41 | ## About us 👋🏼
42 | 
43 | [datastack.tv](https://datastack.tv/) is the learning platform for the modern data stack. We create concise screencast video tutorials for data engineers. [**Browse our courses here!**](https://datastack.tv/courses.html)
44 | 
45 | ## License 🗞
46 | 
47 | > Copyright © 2021 Alexandra Abbas — <hello@datastack.tv>
48 | 


--------------------------------------------------------------------------------
/img/extras.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/extras.png


--------------------------------------------------------------------------------
/img/roadmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/roadmap.png


--------------------------------------------------------------------------------
/img/title.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/title.png


--------------------------------------------------------------------------------
/text/extras.md:
--------------------------------------------------------------------------------
 1 | > Text version for visually impaired users
 2 | 
 3 | *Note: Data engineers often work closely with Data scientists, Data analysts and Machine Learning engineers. It’s good to have a basic understanding of the tools they use.*
 4 | 
 5 | * Visualise data
 6 | 	* Tableau [general recommendation]
 7 |     * Looker [personal recommendation]
 8 |     * Grafana [general recommendation]
 9 |     * Jupyter Notebook [general recommendation]
10 |     * Microsoft Power BI
11 | 
12 | * Machine Learning fundamentals
13 | 	* Terminology [general recommendation]
14 | 		* Supervised vs unsupervised learning
15 |         * Classification vs regression
16 |         * Evaluation metrics
17 |     * scikit-learn [general recommendation]
18 |     * Tensorflow [personal recommendation]
19 |     * Keras [personal recommendation]
20 |     * PyTorch [general recommendation]
21 | 
22 | * Machine Learning Ops
23 |     * Tensorflow Extended (TFX) [general recommendation]
24 |     * Kubeflow [personal recommendation]
25 |     * MLflow
26 |     * Amazon SageMaker
27 |     * Google Cloud AI Platform
28 | 
29 | *Note: Keep learning...*
30 | 


--------------------------------------------------------------------------------
/text/roadmap.md:
--------------------------------------------------------------------------------
  1 | > Text version for visually impaired users
  2 | 
  3 | # Data Engineer in 2021
  4 | 
  5 | * CS fundamentals
  6 | 	* Basic terminal usage [general recommendation]
  7 | 	* Data structures & algorithms [general recommendation]
  8 | 	* APIs [general recommendation]
  9 | 	* REST [general recommendation]
 10 | 	* Structured vs unstructured data [general recommendation]
 11 | 	* Serialisation
 12 | 	* Linux [general recommendation]
 13 | 		* CLI
 14 | 		* Vim
 15 | 		* Shell scripting
 16 | 		* Cronjobs
 17 | 	* How does the computer work? [general recommendation]
 18 | 	* How does the Internet work? [general recommendation]
 19 | 	* Git — Version control [general recommendation]
 20 | 	* Math & statistics basics [general recommendation]
 21 | 
 22 | *Note: Git is used for tracking changes in source code and coordinating work among programmers. In your day to day work you will use Git server as a service like GitHub, GitLab or Bitbucket.*
 23 | 
 24 | * Learn a programming language
 25 | 	* Python [personal recommendation]
 26 | 	* Java [general recommendation]
 27 | 	* Scala
 28 | 	* Go
 29 | 
 30 | *Note: Learn how to write clean, extensibile code. Spend some time understanding programming paradigms (functional vs. OOP) and best practices (design patterns, YAGNI, stateful vs stateless applications). Get familiar with an IDE or code editor like VSCode.*
 31 | 
 32 | * Testing
 33 | 	* Unit testing [general recommendation]
 34 | 	* Integration testing [general recommendation]
 35 | 	* Functional testing [general recommendation]
 36 | 
 37 | * Database fundamentals
 38 | 	* SQL [general recommendation]
 39 | 	* Normalisation [general recommendation]
 40 | 	* ACID transactions [general recommendation]
 41 | 	* CAP theorem [general recommendation]
 42 | 	* OLTP vs OLAP [general recommendation]
 43 | 	* Horizontal vs vertical scaling [general recommendation]
 44 | 	* Dimensional modeling [general recommendation]
 45 | 
 46 | * Relational databases
 47 | 	* MySQL [general recommendation]
 48 | 	* PostgreSQL [general recommendation]
 49 | 	* MariaDB
 50 | 	* Amazon Aurora
 51 | 
 52 | * Non-relational databases
 53 | 	* Document databases
 54 | 		* MongoDB [general recommendation]
 55 | 		* Elasticsearch [general recommendation]
 56 | 		* Apache CouchDB
 57 | 		* Azure CormosDB
 58 | 	* Wide column databases
 59 | 		* Apache Cassandra [general recommendation]
 60 | 		* Apache HBase [general recommendation]
 61 | 		* Google Cloud Bigtable [personal recommendation]
 62 | 	* Graph databases
 63 | 		* Neo4j
 64 | 		* Amazon Neptune
 65 | 	* Key-value stores
 66 | 		* Redis [personal recommendation]
 67 | 		* Memcached
 68 | 		* Amazon DynamoDB [general recommendation]
 69 | 
 70 | *Note: Understand the difference between Document, Wide column, Graph and Key-value NoSQL databases. We recommend mastering one database from each category.*
 71 | 
 72 | * Data warehouses
 73 | 	* Snowflake [general recommendation]
 74 | 	* Presto
 75 | 	* Apache Hive
 76 | 	* Apache Impala
 77 | 	* Amazon Redshift [general recommendation]
 78 | 	* Google BigQuery [personal recommendation]
 79 | 	* Azure Synapse
 80 | 	* ClickHouse
 81 | 
 82 | * Object storage
 83 | 	* AWS S3 [general recommendation]
 84 | 	* Azure Blob Storage
 85 | 	* Google Cloud Storage
 86 | 	* Apache Ozone
 87 | 
 88 | * Cluster computing fundamentals
 89 | 	* Apache Hadoop [general recommendation]
 90 | 	* HDFS [general recommendation]
 91 | 	* MapReduce [general recommendation]
 92 | 	* Lambda & Kappa architectures
 93 | 	* Managed Hadoop [general recommendation]
 94 | 		* Amazon EMR
 95 | 		* Google Dataproc
 96 | 		* Azure Data Lake
 97 | 
 98 | *Note: Most modern data processing frameworks are based on Apache Hadoop and MapReduce to some extent. Understanding these concepts can help you learn modern data processing frameworks much quicker.*
 99 | 
100 | * Data processing
101 | 	* Batch
102 | 		* Apache Pig [general recommendation]
103 | 		* Apache Arrow
104 | 		* data build tool [personal recommendation]
105 | 	* Hybrid
106 | 		* Apache Spark [general recommendation]
107 | 		* Apache Beam [personal recommendation]
108 | 		* Apache Flink [general recommendation]
109 | 		* Apache NiFi
110 | 	* Streaming
111 | 		* Apache Kafka [personal recommendation]
112 | 		* Apache Storm [general recommendation]
113 | 		* Apache Samza
114 | 		* Amazon Kinesis
115 | 
116 | *Note: Hybrid frameworks are able to process both batch and streaming data. Batch data processing is often done by analytical data warehouse applications. See Data warehouses section for more.*
117 | 
118 | * Messaging
119 | 	* RabbitMQ [general recommendation]
120 | 	* Apache ActiveMQ
121 | 	* Amazon SNS & SQS
122 | 	* Google PubSub
123 | 	* Azure Service Bus
124 | 
125 | * Workflow scheduling
126 | 	* Apache Airflow [personal recommendation]
127 | 	* Google Composer
128 | 	* Apache Oozie
129 | 	* Luigi
130 | 
131 | *Note: Cloud Composer is a managed Apache Airflow service on Google Cloud Platform.*
132 | 
133 | * Monitoring and observability for data pipelines
134 | 	* Prometheus [general recommendation]
135 | 	* Datadog [general recommendation]
136 | 	* Sentry [general recommendation]
137 | 	* Monte Carlo
138 | 	* Datafold
139 | 	* Soda Data
140 | 	* StatsD
141 | 
142 | * Networking
143 | 	* Protocols [general recommendation]
144 | 		* HTTP / HTTPS
145 | 		* TCP
146 | 		* SSH
147 | 		* IP
148 | 		* DNS
149 | 	* Firewalls [general recommendation]
150 | 	* VPN [general recommendation]
151 | 	* VPC [general recommendation]
152 | 
153 | * Infrastructure as Code
154 | 	* Containers
155 | 		* Docker [personal recommendation]
156 | 		* LXC
157 | 	* Container orchestration
158 | 		* Kubernetes [general recommendation]
159 | 		* Docker Swarm
160 | 		* Apache Mesos
161 | 		* Google Kubernetes Engine (GKE) [general recommendation]
162 | 	* Infrastructure provisioning
163 | 		* Terraform [personal recommendation]
164 | 		* Pulumi
165 | 		* AWS CDK [general recommendation]
166 | 
167 | * CI/CD
168 | 	* GitHub Actions [general recommendation]
169 | 	* Jenkins [general recommendation]
170 | 
171 | * Identity and access management
172 | 	* Active Directory [general recommendation]
173 | 	* Azure Active Directory
174 | 
175 | * Data security & privacy
176 | 	* Legal compliance [general recommendation]
177 | 	* Encryption [general recommendation]
178 | 	* Key management [general recommendation]
179 | 	* Data governance & integrity
180 | 


--------------------------------------------------------------------------------