├── CONTRIBUTING.md └── Readme.md /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to the Big Data Engineering Roadmap repository! We welcome contributions from the community to make this roadmap more comprehensive and valuable for aspiring Big Data Engineers. 4 | 5 | ## Ways to Contribute 6 | 7 | There are several ways you can contribute to this repository: 8 | 9 | 1. **Suggest New Resources**: If you know of any helpful resources, such as books, tutorials, blogs, or open-source projects related to Big Data Engineering, please feel free to submit them for inclusion in the relevant sections of the roadmap. 10 | 11 | 2. **Improve Existing Resources**: If you find any outdated or inaccurate information in the existing resources, or if you have better explanations or examples, please submit a pull request with your improvements. 12 | 13 | 3. **Report Issues**: If you encounter any issues or errors in the resources or the repository structure, please open an issue in the repository's issue tracker. Be sure to provide detailed information about the issue, including any error messages or screenshots, if applicable. 14 | 15 | 4. **Contribute Code Examples or Projects**: If you have any code examples, sample projects, or practical exercises related to Big Data Engineering concepts, you can contribute them to the repository. These additions can greatly benefit aspiring Big Data Engineers by providing hands-on learning opportunities. 16 | 17 | ## Contributing Process 18 | 19 | 1. **Fork the Repository**: Start by forking this repository to your GitHub account. 20 | 21 | 2. **Create a Branch**: Create a new branch for your contributions. Use a descriptive name that reflects the changes you plan to make. 22 | 23 | 3. **Make Changes**: Make the necessary changes or additions to the repository. Be sure to follow the established coding conventions and documentation guidelines. 24 | 25 | 4. **Test Your Changes**: Test your changes thoroughly to ensure they work as intended and do not introduce any regressions. 26 | 27 | 5. **Commit Your Changes**: Commit your changes with a descriptive commit message that explains the purpose of the changes. 28 | 29 | 6. **Push to Your Fork**: Push your changes to your forked repository. 30 | 31 | 7. **Submit a Pull Request**: Submit a pull request from your forked repository to the main repository. In the pull request description, provide a detailed explanation of the changes you've made and the motivation behind them. 32 | 33 | 8. **Review Process**: Your pull request will be reviewed by the maintainers of the repository. They may request changes or provide feedback, in which case you should make the necessary updates and push them to your branch. 34 | 35 | 9. **Merge**: Once your pull request has been approved, it will be merged into the main repository. 36 | 37 | ## Code of Conduct 38 | 39 | Please note that by contributing to this repository, you agree to follow our [Code of Conduct](CODE_OF_CONDUCT.md). 40 | 41 | ## License 42 | 43 | By contributing to this repository, you agree that your contributions will be licensed under the [MIT License](LICENSE). 44 | 45 | Thank you for your contributions! We appreciate your efforts to make this Big Data Engineering Roadmap a valuable resource for the community. 46 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # Big Data Engineering Roadmap 2 | 3 | This repository serves as a comprehensive guide for individuals aspiring to become Big Data Engineers. It provides a detailed roadmap, recommended learning resources, and a collection of open-source projects to help you develop the necessary skills and gain hands-on experience. 4 | ### Some of the resource links are not working, I am trying to update them. [If you want to add create a pull request] 5 | 6 | ## Table of Contents 7 | 8 | - [Introduction](#introduction) 9 | - [Programming Languages](#programming-languages) 10 | - [Python](#python) 11 | - [Scala/Java](#scalajava) 12 | - [Data Processing Frameworks](#data-processing-frameworks) 13 | - [Apache Spark](#apache-spark) 14 | - [Apache Hadoop](#apache-hadoop) 15 | - [Data Storage and Querying](#data-storage-and-querying) 16 | - [Databases](#databases) 17 | - [Data Warehousing](#data-warehousing) 18 | - [Data Streaming and Messaging](#data-streaming-and-messaging) 19 | - [Apache Kafka](#apache-kafka) 20 | - [Apache Flink/Apache Storm](#apache-flinkache-storm) 21 | - [Data Orchestration and Workflow Management](#data-orchestration-and-workflow-management) 22 | - [Apache Airflow](#apache-airflow) 23 | - [Cloud Computing](#cloud-computing) 24 | - [AWS](#aws) 25 | - [Azure](#azure) 26 | - [GCP](#gcp) 27 | - [Data Modeling and ETL/ELT](#data-modeling-and-etlelt) 28 | - [Data Modeling](#data-modeling) 29 | - [ETL/ELT](#etlelt) 30 | - [Data Visualization and Reporting](#data-visualization-and-reporting) 31 | - [Soft Skills](#soft-skills) 32 | - [Projects and Certifications](#projects-and-certifications) 33 | - [Interview Preparation](#interview-preparation) 34 | - [Contributing](#contributing) 35 | 36 | ## Introduction 37 | 38 | This section will provide an overview of the Big Data Engineering field, its importance, and the role of a Big Data Engineer. 39 | 40 | ## Programming Languages 41 | 42 | ### Python 43 | 44 | - [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957662/) - Free book from O'Reilly covering Python for data analysis. 45 | - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) - Free book that covers the essential knowledge for working with data in Python. 46 | - [Python for Data Analysis Video Series](https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1IHykI9tC7mXXhxmr) - Video tutorials from Corey Schafer. 47 | 48 | ### Scala/Java 49 | 50 | - [Scala Programming Language](https://scala-lang.org/documentation/) - Official documentation and learning resources for Scala. 51 | - [Java Programming Language](https://docs.oracle.com/javase/tutorial/) - Official Java tutorials from Oracle. 52 | 53 | ## Data Processing Frameworks 54 | 55 | ### Apache Spark 56 | 57 | - [Apache Spark Official Documentation](https://spark.apache.org/docs/latest/) 58 | - [Learning Spark: Lightning-Fast Data Analytics](https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf) - Free book from Databricks (requires email signup). 59 | - [Spark Programming Guide](https://www.oreilly.com/library/view/spark-programming-guide/9781786462718/) - Book from O'Reilly. 60 | 61 | ### Apache Hadoop 62 | 63 | - [Apache Hadoop Official Documentation](https://hadoop.apache.org/docs/stable/) 64 | - [Hadoop: The Definitive Guide](https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/) - Book from O'Reilly. 65 | 66 | ## Data Storage and Querying 67 | 68 | ### Databases 69 | 70 | - [PostgreSQL Tutorial](https://www.postgresqltutorial.com/) - Free comprehensive PostgreSQL tutorial. 71 | - [MySQL Tutorial](https://www.mysqltutorial.org/) - Free MySQL tutorial for beginners. 72 | - [MongoDB University](https://university.mongodb.com/) - Free online courses and certifications for MongoDB. 73 | - [Apache Cassandra Documentation](https://cassandra.apache.org/doc/) - Official documentation for Apache Cassandra. 74 | - [HBase Reference Guide](https://hbase.apache.org/book.html) - Official reference guide for Apache HBase. 75 | 76 | ### Data Warehousing 77 | 78 | - [Apache Hive Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial) - Official Apache Hive tutorial. 79 | - [Presto Documentation](https://prestodb.io/docs/current/) - Official documentation for Presto. 80 | - [Apache Impala Documentation](https://impala.apache.org/docs/build/) - Official documentation for Apache Impala. 81 | 82 | ## Data Streaming and Messaging 83 | 84 | ### Apache Kafka 85 | 86 | - [Apache Kafka Documentation](https://kafka.apache.org/documentation/) - Official Apache Kafka documentation. 87 | - [Kafka: The Definitive Guide](https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/) - Book from O'Reilly. 88 | - [Kafka Streams Documentation](https://kafka.apache.org/documentation/streams/) - Official documentation for Kafka Streams. 89 | 90 | ### Apache Flink/Apache Storm 91 | 92 | - [Apache Flink Documentation](https://nightlies.apache.org/flink/flink-docs-release-1.16/) - Official Apache Flink documentation. 93 | - [Apache Storm Documentation](https://storm.apache.org/releases/current/index.html) - Official Apache Storm documentation. 94 | 95 | ## Data Orchestration and Workflow Management 96 | 97 | ### Apache Airflow 98 | 99 | - [Apache Airflow Documentation](https://airflow.apache.org/docs/) - Official Apache Airflow documentation. 100 | - [Airflow Tutorial](https://airflow.apache.org/tutorial.html) - Official Airflow tutorial. 101 | - [Mastering Apache Airflow](https://www.oreilly.com/library/view/mastering-apache-airflow/9781492086314/) - Book from O'Reilly. 102 | 103 | ## Cloud Computing 104 | 105 | ### AWS 106 | 107 | - [AWS Big Data Services](https://aws.amazon.com/big-data/) - Overview of AWS big data services. 108 | - [Amazon EMR Documentation](https://docs.aws.amazon.com/emr/) - Official documentation for Amazon EMR. 109 | - [Amazon S3 Documentation](https://docs.aws.amazon.com/s3/) - Official documentation for Amazon S3. 110 | - [Amazon Athena Documentation](https://docs.aws.amazon.com/athena/) - Official documentation for Amazon Athena. 111 | - [Amazon Redshift Documentation](https://docs.aws.amazon.com/redshift/) - Official documentation for Amazon Redshift. 112 | 113 | ### Azure 114 | 115 | - [Azure Data Services](https://azure.microsoft.com/en-us/product-categories/analytics/) - Overview of Azure data and analytics services. 116 | - [Azure HDInsight Documentation](https://docs.microsoft.com/en-us/azure/hdinsight/) - Official documentation for Azure HDInsight. 117 | - [Azure Data Lake Storage Documentation](https://docs.microsoft.com/en-us/azure/data-lake-store/) - Official documentation for Azure Data Lake Storage. 118 | - [Azure Synapse Analytics Documentation](https://docs.microsoft.com/en-us/azure/synapse-analytics/) - Official documentation for Azure Synapse Analytics. 119 | 120 | ### GCP 121 | 122 | - [Google Cloud Data Services](https://cloud.google.com/products/data-analytics) - Overview of Google Cloud data and analytics services. 123 | - [Google Cloud Dataproc Documentation](https://cloud.google.com/dataproc/docs/) - Official documentation for Google Cloud Dataproc. 124 | - [Google Cloud Dataflow Documentation](https://cloud.google.com/dataflow/docs/) - Official documentation for Google Cloud Dataflow. 125 | - [Google BigQuery Documentation](https://cloud.google.com/bigquery/docs/) - Official documentation for Google BigQuery. 126 | 127 | ## Data Modeling and ETL/ELT 128 | 129 | ### Data Modeling 130 | 131 | - [Data Modeling for Data Warehouses](https://www.amazon.com/Data-Modeling-Data-Warehouses-Vaughan/dp/1555581167) - Book by Len Silverston and Paul Agnew. 132 | - [Data Vault Modeling Guide](https://www.oreilly.com/library/view/data-vault-modeling/9781119613565/) - Book from O'Reilly. 133 | 134 | ### ETL/ELT 135 | 136 | - [ETL/ELT with Python](https://www.oreilly.com/library/view/etlelt-with-python/9781492047025/) - Book from O'Reilly. 137 | - [Apache NiFi Documentation](https://nifi.apache.org/docs.html) - Official documentation for Apache NiFi. 138 | - [Talend Open Studio Documentation](https://help.talend.com/r/en-US/8.0/) - Official documentation for Talend Open Studio. 139 | 140 | ## Data Visualization and Reporting 141 | 142 | - [Tableau Desktop Resources](https://www.tableau.com/learn/training) - Free training resources for Tableau Desktop. 143 | - [Power BI Documentation](https://docs.microsoft.com/en-us/power-bi/) - Official documentation for Microsoft Power BI. 144 | - [Apache Superset Documentation](https://superset.apache.org/docs/) - Official documentation for Apache Superset. 145 | 146 | ## Soft Skills 147 | 148 | - [Problem-Solving Techniques](https://www.mindtools.com/pages/article/newTMC_00.htm) - Resources for developing problem-solving skills. 149 | - [Effective Communication Skills](https://www.mindtools.com/CommSkll/CommunicationIntro.htm) - Resources for improving communication skills. 150 | - [Collaboration and Teamwork](https://www.mindtools.com/pages/article/newTMM_84.htm) - Resources for enhancing collaboration and teamwork. 151 | 152 | ## Projects and Certifications 153 | 154 | - [AWS Certified Big Data Specialty](https://aws.amazon.com/certification/certified-big-data-specialty/) - AWS Certified Big Data Specialty certification. 155 | - [Azure Data Engineer Associate](https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer) - Azure Data Engineer Associate certification. 156 | - [Google Cloud Professional Data Engineer](https://cloud.google.com/certification/data-engineer) - Google Cloud Professional Data Engineer certification. 157 | 158 | ## Interview Preparation 159 | 160 | - [Data Engineering Interview Questions](https://data-flair.training/blogs/data-engineering-interview-questions/) - Collection of data engineering interview questions. 161 | - [System Design Interview Questions](https://www.educative.io/courses/grokking-the-system-design-interview) - System design interview questions and resources. 162 | 163 | ## Contributing 164 | 165 | If you have any suggestions, improvements, or additional resources to share, please feel free to contribute to this repository. Follow the [Contributing Guidelines](.github/CONTRIBUTING.md) to get started. 166 | --------------------------------------------------------------------------------