├── Uber Data Analytics End-to-End └── readme.md ├── Youtube Data Analysis End-to-End └── readme.md ├── COVID-19 Data Analysis End-to-End Project └── readme.md ├── Olympics data analytics - end to end Azure data engineering └── readme.md ├── Stock Market Real-Time Data Analysis Using Kafka ├── Architecture.jpg ├── readme.md ├── command_kafka.txt ├── KafkaConsumer.ipynb └── KafkaProducer.ipynb ├── Indian Stock Market Real-Time Data Analysis and Visualization └── readme.md ├── twitter-airflow-data-engineering-project └── readme.md └── README.md /Uber Data Analytics End-to-End/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Youtube Data Analysis End-to-End/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /COVID-19 Data Analysis End-to-End Project/readme.md: -------------------------------------------------------------------------------- 1 | # COVID-19 Data Analysis Project 2 | 3 | -------------------------------------------------------------------------------- /Olympics data analytics - end to end Azure data engineering/readme.md: -------------------------------------------------------------------------------- 1 | ## Olympics Data Analytics Project 2 | 3 | -------------------------------------------------------------------------------- /Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cybergeekgyan/Data-Engineering-Portfolio/HEAD/Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg -------------------------------------------------------------------------------- /Indian Stock Market Real-Time Data Analysis and Visualization/readme.md: -------------------------------------------------------------------------------- 1 | # Indian Stock Market Real-Time Data Analysis and Visualization 2 | 3 | ## Requirements 4 | 5 | - Python 6 | - Azure Account 7 | - Stock API 8 | - Kafka 9 | 10 | 11 | 12 | ## Steps 13 | 14 | 15 | -------------------------------------------------------------------------------- /twitter-airflow-data-engineering-project/readme.md: -------------------------------------------------------------------------------- 1 | ## Twitter Data Pipeline using Airflow 2 | 3 | `Pandas` `Python` `Tweepy` `Airflow` `AWS` 4 | 5 | ## Steps 6 | 7 | - Extract data from Twitter using the Twitter API --> `Tweepy` 8 | - Create a data pipeline to run these automation task using --> `Apache Airflow` 9 | - Tranforming the data using Pandas to store into AWS S3 10 | 11 | 12 | -------------------------------------------------------------------------------- /Stock Market Real-Time Data Analysis Using Kafka/readme.md: -------------------------------------------------------------------------------- 1 | # Stock Market Real-Time Data Analysis Using Kafka 2 | 3 | End-to-End Indian Stock Market Data Engineering Project on Real-time stock market data using Kafka 4 | 5 | ## Architecture 6 | 7 | ![architecture](https://github.com/cybergeekgyan/Data-Engineering-Portfolio/blob/main/Stock%20Market%20Real-Time%20Data%20Analysis%20Using%20Kafka/Architecture.jpg) 8 | 9 | ## Pre-requisite 10 | 11 | - Python --> Programming Language 12 | - AWS Account --> Cloud Service 13 | - AWS S3 --> Object Storage 14 | - EC2 15 | - Apache Kafka 16 | - Apache Airflow 17 | - Glue Crawler & Glue Catalog 18 | - Athena 19 | - SQL 20 | -------------------------------------------------------------------------------- /Stock Market Real-Time Data Analysis Using Kafka/command_kafka.txt: -------------------------------------------------------------------------------- 1 | wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz 2 | tar -xvf kafka_2.12-3.3.1.tgz 3 | 4 | 5 | ----------------------- 6 | java -version 7 | sudo yum install java-1.8.0-openjdk 8 | java -version 9 | cd kafka_2.12-3.3.1 10 | 11 | Start Zoo-keeper: 12 | ------------------------------- 13 | bin/zookeeper-server-start.sh config/zookeeper.properties 14 | 15 | Open another window to start kafka 16 | But first ssh to to your ec2 machine as done above 17 | 18 | 19 | Start Kafka-server: 20 | ---------------------------------------- 21 | Duplicate the session & enter in a new console -- 22 | export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M" 23 | cd kafka_2.12-3.3.1 24 | bin/kafka-server-start.sh config/server.properties 25 | 26 | It is pointing to private server , change server.properties so that it can run in public IP 27 | 28 | To do this , you can follow any of the 2 approaches shared belwo -- 29 | Do a "sudo nano config/server.properties" - change ADVERTISED_LISTENERS to public ip of the EC2 instance 30 | 31 | 32 | Create the topic: 33 | ----------------------------- 34 | Duplicate the session & enter in a new console -- 35 | cd kafka_2.12-3.3.1 36 | bin/kafka-topics.sh --create --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1 37 | 38 | Start Producer: 39 | -------------------------- 40 | bin/kafka-console-producer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} 41 | 42 | Start Consumer: 43 | ------------------------- 44 | Duplicate the session & enter in a new console -- 45 | cd kafka_2.12-3.3.1 46 | bin/kafka-console-consumer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} 47 | -------------------------------------------------------------------------------- /Stock Market Real-Time Data Analysis Using Kafka/KafkaConsumer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "b6675043", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "from kafka import KafkaConsumer\n", 11 | "from time import sleep\n", 12 | "from json import dumps,loads\n", 13 | "import json\n", 14 | "from s3fs import S3FileSystem" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "id": "9eeff3ef", 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "consumer = KafkaConsumer(\n", 25 | " 'demo_test',\n", 26 | " bootstrap_servers=[':9092'], #add your IP here\n", 27 | " value_deserializer=lambda x: loads(x.decode('utf-8')))" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "id": "eda5a608", 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "# for c in consumer:\n", 38 | "# print(c.value)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "id": "8d60dc6c", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "s3 = S3FileSystem()" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "id": "0f135e81", 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "for count, i in enumerate(consumer):\n", 59 | " with s3.open(\"s3://kafka-stock-market-tutorial-youtube-darshil/stock_market_{}.json\".format(count), 'w') as file:\n", 60 | " json.dump(i.value, file) " 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "id": "7b811cb6", 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [] 70 | } 71 | ], 72 | "metadata": { 73 | "kernelspec": { 74 | "display_name": "Python 3 (ipykernel)", 75 | "language": "python", 76 | "name": "python3" 77 | }, 78 | "language_info": { 79 | "codemirror_mode": { 80 | "name": "ipython", 81 | "version": 3 82 | }, 83 | "file_extension": ".py", 84 | "mimetype": "text/x-python", 85 | "name": "python", 86 | "nbconvert_exporter": "python", 87 | "pygments_lexer": "ipython3", 88 | "version": "3.10.0" 89 | } 90 | }, 91 | "nbformat": 4, 92 | "nbformat_minor": 5 93 | } 94 | -------------------------------------------------------------------------------- /Stock Market Real-Time Data Analysis Using Kafka/KafkaProducer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "e2d3f594", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "pip install kafka-python" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "f19405f0", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "from kafka import KafkaProducer\n", 22 | "from time import sleep\n", 23 | "from json import dumps\n", 24 | "import json" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "id": "b483a0e4", 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "producer = KafkaProducer(bootstrap_servers=[':9092'], #change ip here\n", 35 | " value_serializer=lambda x: \n", 36 | " dumps(x).encode('utf-8'))" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "0c30b915", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "producer.send('demo_test', value={'surnasdasdame':'parasdasdmar'})" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "id": "cc8d45aa", 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "df = pd.read_csv(\"data/indexProcessed.csv\")" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "id": "113a2516", 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "df.head()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "id": "4c7ec0be", 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "while True:\n", 77 | " dict_stock = df.sample(1).to_dict(orient=\"records\")[0]\n", 78 | " producer.send('demo_test', value=dict_stock)\n", 79 | " sleep(1)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "ed71c0e4", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "producer.flush() #clear data from kafka server" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "id": "5991d10f", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "3632d61d", 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [] 107 | } 108 | ], 109 | "metadata": { 110 | "kernelspec": { 111 | "display_name": "Python 3 (ipykernel)", 112 | "language": "python", 113 | "name": "python3" 114 | }, 115 | "language_info": { 116 | "codemirror_mode": { 117 | "name": "ipython", 118 | "version": 3 119 | }, 120 | "file_extension": ".py", 121 | "mimetype": "text/x-python", 122 | "name": "python", 123 | "nbconvert_exporter": "python", 124 | "pygments_lexer": "ipython3", 125 | "version": "3.10.0" 126 | } 127 | }, 128 | "nbformat": 4, 129 | "nbformat_minor": 5 130 | } 131 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🔥Data-Engineering-Resources and Projects 2 | 3 | - https://github.com/DataEngineer-io/data-engineer-handbook 4 | 5 | 6 | ### Data Engineering Workflow 7 | 8 | Screenshot 2024-01-26 at 12 50 44 AM 9 | 10 | 11 | ### Big Data Architecture 12 | 13 | Screenshot 2024-01-26 at 12 53 23 AM 14 | 15 | 16 | ## 📚Books 17 | 18 | - [Designing Data-Intensive Applications]() 19 | - [Fundamentals of Data Engineering]() 20 | - [The Data Warehouse Toolkit]() 21 | - [Cracking the Data Engineering Interview]() 22 | - [Data Engineering with Python]() 23 | - [Data Pipelines with Apache Airflow]() 24 | - [The Data Warehouse Toolkit]() 25 | - [Big Data: Principles and Best Practices of Scalable Real-Time Data Systems]() 26 | - [Designing Data-Intensive Applications]() 27 | 28 | 29 | ## 🧰Tools for Data Engineers 30 | 31 | - **Basic Skills**: `Linux`, `Git & GitHub`, `Computer Networking`, `Cloud Computing`, `Network & Security`, `Agile Development` 32 | 33 | - **Advanced Skills (Good to Know)**: ` Data Lake & Data WareHouse Concepts`, `REST APIs`, `Databases(SQL & NoSQL)` 34 | 35 | - **Programming Languages**: `Python`, `SQL`, `Java`, `Scala` 36 | 37 | - **Databases**: `PostgreSQL`, `MongoDB`, `Neo4j`, `Redis`, `Cassandra`, `Apache HBase`, `Snowflake`, `InfluxDB` 38 | 39 | - Data Ingestion: `Apache Kafka`, `Flume`, `Logstash`, `Airbyte`, `Apache Spark`, `Talend`, `Informatica` 40 | 41 | - Data Tranformation: `Python`, `Pandas`, `SQL`, `Apache Spark`, `Hive`, `dbt`, `Matillion`, `Pig` 42 | 43 | - Data Preprocessing: `Apache Spark`, `Apache Hadoop`, `Apache Flink` 44 | 45 | - Data Orchestration: `Apache Airflow`, `Luigi` 46 | 47 | - Data Storage: `Data Lake`: AWS S3, Azure Blob Storage, Google Cloud Storage, `Data Warehouse`: Snowflake, Google BigQuery, Amazon Redshift, Apache Hive 48 | 49 | - Data Visualization: `Tableau`, `PowerBI`, `Looker` 50 | 51 | - DataOps: `Docker`, `Kubernetes`, `Jenkins` 52 | 53 | 54 | ### Here's what's on the menu: 👇 55 | 56 | - 🐍 Python, 57 | - 📊 SQL, 58 | - 🛠️ MySQL, 59 | - 🌳 MongoDB, 60 | - 🔥 PySpark, 61 | - 🎈 Bash, 62 | - 🌬️ Airflow, 63 | - ☕ Apache Kafka, 64 | - 🐙 Git, 65 | - 🐈 GitHub, 66 | - ⚙️ CICD basics, 67 | - 🏬 Data Warehousing, 68 | - 🛠️ DBT, 69 | - 🌊 Data Lakes, 70 | - 📘 DataBricks, 71 | - ☁️ Azure Databricks, 72 | - ❄️ Snowflake, 73 | - 🌪️ Apache NiFi, 74 | - 🌐 Debezium 75 | 76 | 77 | 1. Master Python: https://lnkd.in/d-pZPyf5 78 | 79 | 2. Learn SQL: https://lnkd.in/dzAiRF-x 80 | 81 | 3. Get hands-on with MySQL: https://lnkd.in/ddpSkUhc 82 | 83 | 4. Dive into MongoDB: https://lnkd.in/dHQ4VC2E 84 | 85 | 5. Master PySpark: https://lnkd.in/d7fgs7dE 86 | 87 | 6. Discover Bash, Airflow & Kafka: https://lnkd.in/dDhuEqQE 88 | 89 | 7. Master Git & GitHub: https://lnkd.in/dqJ7J3kN 90 | 91 | 8. Understand CICD basics: https://lnkd.in/dcfKBmCa 92 | 93 | 9. Decode Data Warehousing: https://lnkd.in/dPVRDJT5 94 | 95 | 10. Learn DBT: https://lnkd.in/eG9eaEuE 96 | 97 | 11. Understand Data Lakes: https://lnkd.in/dtZKJ4d6 98 | 99 | 12. Explore DataBricks: https://lnkd.in/dCBiQXPR 100 | 101 | 13. Learn Azure Databricks: https://lnkd.in/dzmwBs4Y 102 | 103 | 14. Master Snowflake: https://lnkd.in/dDBeddVy 104 | 105 | 15. Explore Apache NiFi: https://lnkd.in/de7bvnSt 106 | 107 | 108 | ## 📙Projects 109 | 110 | | Sr. No. | Projects | Description | Tech Stack | Tags | Code Link | 111 | |--------|-----------|--------------|-----------|-----|------------| 112 | | 01. | [Build ETL Pipeline Using AWS Cloud]() | 113 | | 02. | [Covid Data Analysis Project]() | 114 | | 03. | [Twitter Data Pipeline using Airflow and AWS]() | 115 | | 04. | [YouTube Data Analysis (End-To-End Data Engineering Project)]() | 116 | | 05. | [Olympic Data Analytics: End-To-End Azure Data Engineering Project]() | 117 | | 06. | [Uber Data Analytics Project On GCP]() | 118 | | 07. | [Data Ingestion and ETL Pipeline using Azure]() | 119 | | 08. | [Indian Stock Market Real-Time Data Processing, Analysis & Visualization using Azure Stream Analytics]() | 120 | | 09. | [Simple Stock Market ETL Process with SQL]() | 121 | 122 | ## 🔶 Free Learning Resources 123 | 124 | | Tools | Link| Used for | Official Docs | Youtube | 125 | | ------|---------|---------------|----------|-----| 126 | | DBMS | - [MySQL](https://lnkd.in/ddpSkUhc) - [MongoDB](https://lnkd.in/dHQ4VC2E) 127 | | SQL | https://lnkd.in/dzAiRF-x 128 | | Python | https://lnkd.in/d-pZPyf5 | 129 | | Linux | 130 | | Data Warehouse & Lake Concepts | - [Data Warehouse](https://lnkd.in/dPVRDJT5) - [Data Lakes](https://lnkd.in/dtZKJ4d6) 131 | | Data Pipelines | 132 | | DBT | https://lnkd.in/eG9eaEuE 133 | | PySpark | https://lnkd.in/d7fgs7dE 134 | | Kafka | 135 | | Apache Nifi | https://lnkd.in/de7bvnSt 136 | | Airflow | 137 | | Databricks | https://lnkd.in/dCBiQXPR 138 | | Snowflake | https://lnkd.in/dDBeddVy 139 | | Cloud Computing Concepts | 140 | | Distributed Systems fundamentals | 141 | | AWS | 142 | | Azure | 143 | | GCP | 144 | | Git & GitHub | https://lnkd.in/dqJ7J3kN 145 | | CI/CD | https://lnkd.in/dcfKBmCa 146 | | Jenkins | 147 | | Github Actions | 148 | | Terraform | 149 | | Sonarqube | 150 | | Docker | 151 | | Kubernetes | 152 | | Power BI | 153 | | Tableau | 154 | | Apache Superset | 155 | | Prometheus | 156 | | Graphana | 157 | | Datadog | 158 | 159 | 160 | ## 💼 Read Real-World Case Studies -> Tech Blogs 161 | 1. Netflix - https://netflixtechblog.medium.com/ 162 | 2. AWS - https://aws.amazon.com/solutions/case-studies/ 163 | 3. GCP - https://cloud.google.com/customers 164 | 4. Azure - https://azure.microsoft.com/en-us/resources/customer-stories/ 165 | 5. Spotify - https://engineering.atspotify.com/category/data/ 166 | 6. MongoDB - https://www.mongodb.com/blog/all 167 | 7. Swiggy - https://bytes.swiggy.com/the-swiggy-delivery-challenge-part-one-6a2abb4f82f6 168 | - https://bytes.swiggy.com/swiggy-distance-service-9868dcf613f4 169 | - https://bytes.swiggy.com/the-tech-that-brings-you-your-food-1a7926229886 170 | 8. Zomato - https://blog.zomato.com/ 171 | 172 | 173 | 174 | --------------------------------------------------------------------------------