├── .github └── FUNDING.yml ├── CHANGELOG.md ├── README.md ├── img ├── extras.png ├── roadmap.png └── title.png └── text ├── extras.md └── roadmap.md /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | # github: [datastacktv, alexandraabbas] 4 | custom: https://paypal.me/alexandraabbas 5 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## Roadmap 2021 2 | 3 | ### Update 2021-01-15 4 | 5 | * Added text version for visually impaired users (issue #10) 6 | * Math & statistics basics have been added to CS fundamentals (issue #22) 7 | * Dimensional modelling has been added to Database fundamentals 8 | * Added section for Object storage (issue #7) 9 | * Azure CosmosDB has been added to Document databases 10 | * Apache Impala has been moved from Batch processing to Data Warehouses 11 | * Azure Synapse Analytics (issue #18) and ClickHouse (issue #24) have been added to Data Warehouses 12 | * Lambda & Kappa architectures have been added to Cluster computing fundamentals (issue #31) 13 | * Azure Data Lake has been added to Managed Hadoop 14 | * Apache NiFi has been added to Hybrid data processing 15 | * Cloud specific messaging services have been added to Messaging (issue #8) 16 | * Luigi has been added to Workflow scheduling 17 | * AWS CDK has replaced AWS CloudFormation in Infrastructure provisioning (issue #4, issue #6) 18 | * Power BI has been added to data visualisation tools (issue #29) 19 | * MLflow has been added to Machine Learning Ops (issue #30) 20 | 21 | ## Roadmap 2020 22 | 23 | [Modern Data Engineer Roadmap 2020](https://github.com/datastacktv/data-engineer-roadmap/tree/8b1ccdce4524961bfd37495de20117c47766b1eb) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ![Modern Data Engineer Roadmap 2021](img/title.png) 3 | 4 | > Roadmap to becoming a data engineer in 2021 5 | 6 | [![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv) 7 | [![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](http://youtube.com/c/datastacktv) 8 | [![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/) 9 | [![Jobs](https://img.shields.io/badge/-Jobs-ffdf4b)](https://datastackjobs.com/) 10 | 11 | This roadmap aims to give a **complete picture of the modern data engineering landscape** and serve as a **study guide** for aspiring data engineers. 12 | 13 | *** 14 | 15 |

Note to beginners

16 | 17 | > Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices. 18 | 19 | *** 20 | 21 | 🔥 We just launched [**Data Stack Jobs**](https://datastackjobs.com/) — a clean and simple job site for Data Stack Engineers! 22 | 23 | > [Text version for visually impaired users](text/roadmap.md) 24 | 25 | ![Data Engineer Roadmap](img/roadmap.png) 26 | 27 | ## Nice to have 😎 28 | 29 | > [Text version for visually impaired users](text/extras.md) 30 | 31 | ![Data Engineer Roadmap Extras](img/extras.png) 32 | 33 | ## Contributions are welcome 💜 34 | 35 | Please raise an issue to discuss your suggestions or open a Pull Request to request improvements. 36 | 37 | ## Reviewers 🔎 38 | 39 | Huge thank you to [@whydidithavetobebugs](https://github.com/whydidithavetobebugs), [@sawidis](https://github.com/sawidis), [@marclamberti](https://github.com/marclamberti) and [@mpyeager](https://github.com/mpyeager) for reviewing this roadmap. 40 | 41 | ## About us 👋🏼 42 | 43 | [datastack.tv](https://datastack.tv/) is the learning platform for the modern data stack. We create concise screencast video tutorials for data engineers. [**Browse our courses here!**](https://datastack.tv/courses.html) 44 | 45 | ## License 🗞 46 | 47 | > Copyright © 2021 Alexandra Abbas — 48 | -------------------------------------------------------------------------------- /img/extras.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/extras.png -------------------------------------------------------------------------------- /img/roadmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/roadmap.png -------------------------------------------------------------------------------- /img/title.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datastacktv/data-engineer-roadmap/53e47f5780a5af7c919b791e91bf8c14439644ee/img/title.png -------------------------------------------------------------------------------- /text/extras.md: -------------------------------------------------------------------------------- 1 | > Text version for visually impaired users 2 | 3 | *Note: Data engineers often work closely with Data scientists, Data analysts and Machine Learning engineers. It’s good to have a basic understanding of the tools they use.* 4 | 5 | * Visualise data 6 | * Tableau [general recommendation] 7 | * Looker [personal recommendation] 8 | * Grafana [general recommendation] 9 | * Jupyter Notebook [general recommendation] 10 | * Microsoft Power BI 11 | 12 | * Machine Learning fundamentals 13 | * Terminology [general recommendation] 14 | * Supervised vs unsupervised learning 15 | * Classification vs regression 16 | * Evaluation metrics 17 | * scikit-learn [general recommendation] 18 | * Tensorflow [personal recommendation] 19 | * Keras [personal recommendation] 20 | * PyTorch [general recommendation] 21 | 22 | * Machine Learning Ops 23 | * Tensorflow Extended (TFX) [general recommendation] 24 | * Kubeflow [personal recommendation] 25 | * MLflow 26 | * Amazon SageMaker 27 | * Google Cloud AI Platform 28 | 29 | *Note: Keep learning...* 30 | -------------------------------------------------------------------------------- /text/roadmap.md: -------------------------------------------------------------------------------- 1 | > Text version for visually impaired users 2 | 3 | # Data Engineer in 2021 4 | 5 | * CS fundamentals 6 | * Basic terminal usage [general recommendation] 7 | * Data structures & algorithms [general recommendation] 8 | * APIs [general recommendation] 9 | * REST [general recommendation] 10 | * Structured vs unstructured data [general recommendation] 11 | * Serialisation 12 | * Linux [general recommendation] 13 | * CLI 14 | * Vim 15 | * Shell scripting 16 | * Cronjobs 17 | * How does the computer work? [general recommendation] 18 | * How does the Internet work? [general recommendation] 19 | * Git — Version control [general recommendation] 20 | * Math & statistics basics [general recommendation] 21 | 22 | *Note: Git is used for tracking changes in source code and coordinating work among programmers. In your day to day work you will use Git server as a service like GitHub, GitLab or Bitbucket.* 23 | 24 | * Learn a programming language 25 | * Python [personal recommendation] 26 | * Java [general recommendation] 27 | * Scala 28 | * Go 29 | 30 | *Note: Learn how to write clean, extensibile code. Spend some time understanding programming paradigms (functional vs. OOP) and best practices (design patterns, YAGNI, stateful vs stateless applications). Get familiar with an IDE or code editor like VSCode.* 31 | 32 | * Testing 33 | * Unit testing [general recommendation] 34 | * Integration testing [general recommendation] 35 | * Functional testing [general recommendation] 36 | 37 | * Database fundamentals 38 | * SQL [general recommendation] 39 | * Normalisation [general recommendation] 40 | * ACID transactions [general recommendation] 41 | * CAP theorem [general recommendation] 42 | * OLTP vs OLAP [general recommendation] 43 | * Horizontal vs vertical scaling [general recommendation] 44 | * Dimensional modeling [general recommendation] 45 | 46 | * Relational databases 47 | * MySQL [general recommendation] 48 | * PostgreSQL [general recommendation] 49 | * MariaDB 50 | * Amazon Aurora 51 | 52 | * Non-relational databases 53 | * Document databases 54 | * MongoDB [general recommendation] 55 | * Elasticsearch [general recommendation] 56 | * Apache CouchDB 57 | * Azure CormosDB 58 | * Wide column databases 59 | * Apache Cassandra [general recommendation] 60 | * Apache HBase [general recommendation] 61 | * Google Cloud Bigtable [personal recommendation] 62 | * Graph databases 63 | * Neo4j 64 | * Amazon Neptune 65 | * Key-value stores 66 | * Redis [personal recommendation] 67 | * Memcached 68 | * Amazon DynamoDB [general recommendation] 69 | 70 | *Note: Understand the difference between Document, Wide column, Graph and Key-value NoSQL databases. We recommend mastering one database from each category.* 71 | 72 | * Data warehouses 73 | * Snowflake [general recommendation] 74 | * Presto 75 | * Apache Hive 76 | * Apache Impala 77 | * Amazon Redshift [general recommendation] 78 | * Google BigQuery [personal recommendation] 79 | * Azure Synapse 80 | * ClickHouse 81 | 82 | * Object storage 83 | * AWS S3 [general recommendation] 84 | * Azure Blob Storage 85 | * Google Cloud Storage 86 | * Apache Ozone 87 | 88 | * Cluster computing fundamentals 89 | * Apache Hadoop [general recommendation] 90 | * HDFS [general recommendation] 91 | * MapReduce [general recommendation] 92 | * Lambda & Kappa architectures 93 | * Managed Hadoop [general recommendation] 94 | * Amazon EMR 95 | * Google Dataproc 96 | * Azure Data Lake 97 | 98 | *Note: Most modern data processing frameworks are based on Apache Hadoop and MapReduce to some extent. Understanding these concepts can help you learn modern data processing frameworks much quicker.* 99 | 100 | * Data processing 101 | * Batch 102 | * Apache Pig [general recommendation] 103 | * Apache Arrow 104 | * data build tool [personal recommendation] 105 | * Hybrid 106 | * Apache Spark [general recommendation] 107 | * Apache Beam [personal recommendation] 108 | * Apache Flink [general recommendation] 109 | * Apache NiFi 110 | * Streaming 111 | * Apache Kafka [personal recommendation] 112 | * Apache Storm [general recommendation] 113 | * Apache Samza 114 | * Amazon Kinesis 115 | 116 | *Note: Hybrid frameworks are able to process both batch and streaming data. Batch data processing is often done by analytical data warehouse applications. See Data warehouses section for more.* 117 | 118 | * Messaging 119 | * RabbitMQ [general recommendation] 120 | * Apache ActiveMQ 121 | * Amazon SNS & SQS 122 | * Google PubSub 123 | * Azure Service Bus 124 | 125 | * Workflow scheduling 126 | * Apache Airflow [personal recommendation] 127 | * Google Composer 128 | * Apache Oozie 129 | * Luigi 130 | 131 | *Note: Cloud Composer is a managed Apache Airflow service on Google Cloud Platform.* 132 | 133 | * Monitoring and observability for data pipelines 134 | * Prometheus [general recommendation] 135 | * Datadog [general recommendation] 136 | * Sentry [general recommendation] 137 | * Monte Carlo 138 | * Datafold 139 | * Soda Data 140 | * StatsD 141 | 142 | * Networking 143 | * Protocols [general recommendation] 144 | * HTTP / HTTPS 145 | * TCP 146 | * SSH 147 | * IP 148 | * DNS 149 | * Firewalls [general recommendation] 150 | * VPN [general recommendation] 151 | * VPC [general recommendation] 152 | 153 | * Infrastructure as Code 154 | * Containers 155 | * Docker [personal recommendation] 156 | * LXC 157 | * Container orchestration 158 | * Kubernetes [general recommendation] 159 | * Docker Swarm 160 | * Apache Mesos 161 | * Google Kubernetes Engine (GKE) [general recommendation] 162 | * Infrastructure provisioning 163 | * Terraform [personal recommendation] 164 | * Pulumi 165 | * AWS CDK [general recommendation] 166 | 167 | * CI/CD 168 | * GitHub Actions [general recommendation] 169 | * Jenkins [general recommendation] 170 | 171 | * Identity and access management 172 | * Active Directory [general recommendation] 173 | * Azure Active Directory 174 | 175 | * Data security & privacy 176 | * Legal compliance [general recommendation] 177 | * Encryption [general recommendation] 178 | * Key management [general recommendation] 179 | * Data governance & integrity 180 | --------------------------------------------------------------------------------