├── Uber Data Analytics End-to-End
└── readme.md
├── Youtube Data Analysis End-to-End
└── readme.md
├── COVID-19 Data Analysis End-to-End Project
└── readme.md
├── Olympics data analytics - end to end Azure data engineering
└── readme.md
├── Stock Market Real-Time Data Analysis Using Kafka
├── Architecture.jpg
├── readme.md
├── command_kafka.txt
├── KafkaConsumer.ipynb
└── KafkaProducer.ipynb
├── Indian Stock Market Real-Time Data Analysis and Visualization
└── readme.md
├── twitter-airflow-data-engineering-project
└── readme.md
└── README.md
/Uber Data Analytics End-to-End/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Youtube Data Analysis End-to-End/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/COVID-19 Data Analysis End-to-End Project/readme.md:
--------------------------------------------------------------------------------
1 | # COVID-19 Data Analysis Project
2 |
3 |
--------------------------------------------------------------------------------
/Olympics data analytics - end to end Azure data engineering/readme.md:
--------------------------------------------------------------------------------
1 | ## Olympics Data Analytics Project
2 |
3 |
--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cybergeekgyan/Data-Engineering-Portfolio/HEAD/Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg
--------------------------------------------------------------------------------
/Indian Stock Market Real-Time Data Analysis and Visualization/readme.md:
--------------------------------------------------------------------------------
1 | # Indian Stock Market Real-Time Data Analysis and Visualization
2 |
3 | ## Requirements
4 |
5 | - Python
6 | - Azure Account
7 | - Stock API
8 | - Kafka
9 |
10 |
11 |
12 | ## Steps
13 |
14 |
15 |
--------------------------------------------------------------------------------
/twitter-airflow-data-engineering-project/readme.md:
--------------------------------------------------------------------------------
1 | ## Twitter Data Pipeline using Airflow
2 |
3 | `Pandas` `Python` `Tweepy` `Airflow` `AWS`
4 |
5 | ## Steps
6 |
7 | - Extract data from Twitter using the Twitter API --> `Tweepy`
8 | - Create a data pipeline to run these automation task using --> `Apache Airflow`
9 | - Tranforming the data using Pandas to store into AWS S3
10 |
11 |
12 |
--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/readme.md:
--------------------------------------------------------------------------------
1 | # Stock Market Real-Time Data Analysis Using Kafka
2 |
3 | End-to-End Indian Stock Market Data Engineering Project on Real-time stock market data using Kafka
4 |
5 | ## Architecture
6 |
7 | 
8 |
9 | ## Pre-requisite
10 |
11 | - Python --> Programming Language
12 | - AWS Account --> Cloud Service
13 | - AWS S3 --> Object Storage
14 | - EC2
15 | - Apache Kafka
16 | - Apache Airflow
17 | - Glue Crawler & Glue Catalog
18 | - Athena
19 | - SQL
20 |
--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/command_kafka.txt:
--------------------------------------------------------------------------------
1 | wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz
2 | tar -xvf kafka_2.12-3.3.1.tgz
3 |
4 |
5 | -----------------------
6 | java -version
7 | sudo yum install java-1.8.0-openjdk
8 | java -version
9 | cd kafka_2.12-3.3.1
10 |
11 | Start Zoo-keeper:
12 | -------------------------------
13 | bin/zookeeper-server-start.sh config/zookeeper.properties
14 |
15 | Open another window to start kafka
16 | But first ssh to to your ec2 machine as done above
17 |
18 |
19 | Start Kafka-server:
20 | ----------------------------------------
21 | Duplicate the session & enter in a new console --
22 | export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
23 | cd kafka_2.12-3.3.1
24 | bin/kafka-server-start.sh config/server.properties
25 |
26 | It is pointing to private server , change server.properties so that it can run in public IP
27 |
28 | To do this , you can follow any of the 2 approaches shared belwo --
29 | Do a "sudo nano config/server.properties" - change ADVERTISED_LISTENERS to public ip of the EC2 instance
30 |
31 |
32 | Create the topic:
33 | -----------------------------
34 | Duplicate the session & enter in a new console --
35 | cd kafka_2.12-3.3.1
36 | bin/kafka-topics.sh --create --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1
37 |
38 | Start Producer:
39 | --------------------------
40 | bin/kafka-console-producer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
41 |
42 | Start Consumer:
43 | -------------------------
44 | Duplicate the session & enter in a new console --
45 | cd kafka_2.12-3.3.1
46 | bin/kafka-console-consumer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
47 |
--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/KafkaConsumer.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "b6675043",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "from kafka import KafkaConsumer\n",
11 | "from time import sleep\n",
12 | "from json import dumps,loads\n",
13 | "import json\n",
14 | "from s3fs import S3FileSystem"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "id": "9eeff3ef",
21 | "metadata": {},
22 | "outputs": [],
23 | "source": [
24 | "consumer = KafkaConsumer(\n",
25 | " 'demo_test',\n",
26 | " bootstrap_servers=[':9092'], #add your IP here\n",
27 | " value_deserializer=lambda x: loads(x.decode('utf-8')))"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "id": "eda5a608",
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "# for c in consumer:\n",
38 | "# print(c.value)"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "id": "8d60dc6c",
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "s3 = S3FileSystem()"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "id": "0f135e81",
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "for count, i in enumerate(consumer):\n",
59 | " with s3.open(\"s3://kafka-stock-market-tutorial-youtube-darshil/stock_market_{}.json\".format(count), 'w') as file:\n",
60 | " json.dump(i.value, file) "
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "id": "7b811cb6",
67 | "metadata": {},
68 | "outputs": [],
69 | "source": []
70 | }
71 | ],
72 | "metadata": {
73 | "kernelspec": {
74 | "display_name": "Python 3 (ipykernel)",
75 | "language": "python",
76 | "name": "python3"
77 | },
78 | "language_info": {
79 | "codemirror_mode": {
80 | "name": "ipython",
81 | "version": 3
82 | },
83 | "file_extension": ".py",
84 | "mimetype": "text/x-python",
85 | "name": "python",
86 | "nbconvert_exporter": "python",
87 | "pygments_lexer": "ipython3",
88 | "version": "3.10.0"
89 | }
90 | },
91 | "nbformat": 4,
92 | "nbformat_minor": 5
93 | }
94 |
--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/KafkaProducer.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "id": "e2d3f594",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "pip install kafka-python"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "id": "f19405f0",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "import pandas as pd\n",
21 | "from kafka import KafkaProducer\n",
22 | "from time import sleep\n",
23 | "from json import dumps\n",
24 | "import json"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": null,
30 | "id": "b483a0e4",
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "producer = KafkaProducer(bootstrap_servers=[':9092'], #change ip here\n",
35 | " value_serializer=lambda x: \n",
36 | " dumps(x).encode('utf-8'))"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "id": "0c30b915",
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "producer.send('demo_test', value={'surnasdasdame':'parasdasdmar'})"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "id": "cc8d45aa",
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "df = pd.read_csv(\"data/indexProcessed.csv\")"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "id": "113a2516",
63 | "metadata": {},
64 | "outputs": [],
65 | "source": [
66 | "df.head()"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "id": "4c7ec0be",
73 | "metadata": {},
74 | "outputs": [],
75 | "source": [
76 | "while True:\n",
77 | " dict_stock = df.sample(1).to_dict(orient=\"records\")[0]\n",
78 | " producer.send('demo_test', value=dict_stock)\n",
79 | " sleep(1)"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "id": "ed71c0e4",
86 | "metadata": {},
87 | "outputs": [],
88 | "source": [
89 | "producer.flush() #clear data from kafka server"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "id": "5991d10f",
96 | "metadata": {},
97 | "outputs": [],
98 | "source": []
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "id": "3632d61d",
104 | "metadata": {},
105 | "outputs": [],
106 | "source": []
107 | }
108 | ],
109 | "metadata": {
110 | "kernelspec": {
111 | "display_name": "Python 3 (ipykernel)",
112 | "language": "python",
113 | "name": "python3"
114 | },
115 | "language_info": {
116 | "codemirror_mode": {
117 | "name": "ipython",
118 | "version": 3
119 | },
120 | "file_extension": ".py",
121 | "mimetype": "text/x-python",
122 | "name": "python",
123 | "nbconvert_exporter": "python",
124 | "pygments_lexer": "ipython3",
125 | "version": "3.10.0"
126 | }
127 | },
128 | "nbformat": 4,
129 | "nbformat_minor": 5
130 | }
131 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 🔥Data-Engineering-Resources and Projects
2 |
3 | - https://github.com/DataEngineer-io/data-engineer-handbook
4 |
5 |
6 | ### Data Engineering Workflow
7 |
8 |
9 |
10 |
11 | ### Big Data Architecture
12 |
13 |
14 |
15 |
16 | ## 📚Books
17 |
18 | - [Designing Data-Intensive Applications]()
19 | - [Fundamentals of Data Engineering]()
20 | - [The Data Warehouse Toolkit]()
21 | - [Cracking the Data Engineering Interview]()
22 | - [Data Engineering with Python]()
23 | - [Data Pipelines with Apache Airflow]()
24 | - [The Data Warehouse Toolkit]()
25 | - [Big Data: Principles and Best Practices of Scalable Real-Time Data Systems]()
26 | - [Designing Data-Intensive Applications]()
27 |
28 |
29 | ## 🧰Tools for Data Engineers
30 |
31 | - **Basic Skills**: `Linux`, `Git & GitHub`, `Computer Networking`, `Cloud Computing`, `Network & Security`, `Agile Development`
32 |
33 | - **Advanced Skills (Good to Know)**: ` Data Lake & Data WareHouse Concepts`, `REST APIs`, `Databases(SQL & NoSQL)`
34 |
35 | - **Programming Languages**: `Python`, `SQL`, `Java`, `Scala`
36 |
37 | - **Databases**: `PostgreSQL`, `MongoDB`, `Neo4j`, `Redis`, `Cassandra`, `Apache HBase`, `Snowflake`, `InfluxDB`
38 |
39 | - Data Ingestion: `Apache Kafka`, `Flume`, `Logstash`, `Airbyte`, `Apache Spark`, `Talend`, `Informatica`
40 |
41 | - Data Tranformation: `Python`, `Pandas`, `SQL`, `Apache Spark`, `Hive`, `dbt`, `Matillion`, `Pig`
42 |
43 | - Data Preprocessing: `Apache Spark`, `Apache Hadoop`, `Apache Flink`
44 |
45 | - Data Orchestration: `Apache Airflow`, `Luigi`
46 |
47 | - Data Storage: `Data Lake`: AWS S3, Azure Blob Storage, Google Cloud Storage, `Data Warehouse`: Snowflake, Google BigQuery, Amazon Redshift, Apache Hive
48 |
49 | - Data Visualization: `Tableau`, `PowerBI`, `Looker`
50 |
51 | - DataOps: `Docker`, `Kubernetes`, `Jenkins`
52 |
53 |
54 | ### Here's what's on the menu: 👇
55 |
56 | - 🐍 Python,
57 | - 📊 SQL,
58 | - 🛠️ MySQL,
59 | - 🌳 MongoDB,
60 | - 🔥 PySpark,
61 | - 🎈 Bash,
62 | - 🌬️ Airflow,
63 | - ☕ Apache Kafka,
64 | - 🐙 Git,
65 | - 🐈 GitHub,
66 | - ⚙️ CICD basics,
67 | - 🏬 Data Warehousing,
68 | - 🛠️ DBT,
69 | - 🌊 Data Lakes,
70 | - 📘 DataBricks,
71 | - ☁️ Azure Databricks,
72 | - ❄️ Snowflake,
73 | - 🌪️ Apache NiFi,
74 | - 🌐 Debezium
75 |
76 |
77 | 1. Master Python: https://lnkd.in/d-pZPyf5
78 |
79 | 2. Learn SQL: https://lnkd.in/dzAiRF-x
80 |
81 | 3. Get hands-on with MySQL: https://lnkd.in/ddpSkUhc
82 |
83 | 4. Dive into MongoDB: https://lnkd.in/dHQ4VC2E
84 |
85 | 5. Master PySpark: https://lnkd.in/d7fgs7dE
86 |
87 | 6. Discover Bash, Airflow & Kafka: https://lnkd.in/dDhuEqQE
88 |
89 | 7. Master Git & GitHub: https://lnkd.in/dqJ7J3kN
90 |
91 | 8. Understand CICD basics: https://lnkd.in/dcfKBmCa
92 |
93 | 9. Decode Data Warehousing: https://lnkd.in/dPVRDJT5
94 |
95 | 10. Learn DBT: https://lnkd.in/eG9eaEuE
96 |
97 | 11. Understand Data Lakes: https://lnkd.in/dtZKJ4d6
98 |
99 | 12. Explore DataBricks: https://lnkd.in/dCBiQXPR
100 |
101 | 13. Learn Azure Databricks: https://lnkd.in/dzmwBs4Y
102 |
103 | 14. Master Snowflake: https://lnkd.in/dDBeddVy
104 |
105 | 15. Explore Apache NiFi: https://lnkd.in/de7bvnSt
106 |
107 |
108 | ## 📙Projects
109 |
110 | | Sr. No. | Projects | Description | Tech Stack | Tags | Code Link |
111 | |--------|-----------|--------------|-----------|-----|------------|
112 | | 01. | [Build ETL Pipeline Using AWS Cloud]() |
113 | | 02. | [Covid Data Analysis Project]() |
114 | | 03. | [Twitter Data Pipeline using Airflow and AWS]() |
115 | | 04. | [YouTube Data Analysis (End-To-End Data Engineering Project)]() |
116 | | 05. | [Olympic Data Analytics: End-To-End Azure Data Engineering Project]() |
117 | | 06. | [Uber Data Analytics Project On GCP]() |
118 | | 07. | [Data Ingestion and ETL Pipeline using Azure]() |
119 | | 08. | [Indian Stock Market Real-Time Data Processing, Analysis & Visualization using Azure Stream Analytics]() |
120 | | 09. | [Simple Stock Market ETL Process with SQL]() |
121 |
122 | ## 🔶 Free Learning Resources
123 |
124 | | Tools | Link| Used for | Official Docs | Youtube |
125 | | ------|---------|---------------|----------|-----|
126 | | DBMS | - [MySQL](https://lnkd.in/ddpSkUhc) - [MongoDB](https://lnkd.in/dHQ4VC2E)
127 | | SQL | https://lnkd.in/dzAiRF-x
128 | | Python | https://lnkd.in/d-pZPyf5 |
129 | | Linux |
130 | | Data Warehouse & Lake Concepts | - [Data Warehouse](https://lnkd.in/dPVRDJT5) - [Data Lakes](https://lnkd.in/dtZKJ4d6)
131 | | Data Pipelines |
132 | | DBT | https://lnkd.in/eG9eaEuE
133 | | PySpark | https://lnkd.in/d7fgs7dE
134 | | Kafka |
135 | | Apache Nifi | https://lnkd.in/de7bvnSt
136 | | Airflow |
137 | | Databricks | https://lnkd.in/dCBiQXPR
138 | | Snowflake | https://lnkd.in/dDBeddVy
139 | | Cloud Computing Concepts |
140 | | Distributed Systems fundamentals |
141 | | AWS |
142 | | Azure |
143 | | GCP |
144 | | Git & GitHub | https://lnkd.in/dqJ7J3kN
145 | | CI/CD | https://lnkd.in/dcfKBmCa
146 | | Jenkins |
147 | | Github Actions |
148 | | Terraform |
149 | | Sonarqube |
150 | | Docker |
151 | | Kubernetes |
152 | | Power BI |
153 | | Tableau |
154 | | Apache Superset |
155 | | Prometheus |
156 | | Graphana |
157 | | Datadog |
158 |
159 |
160 | ## 💼 Read Real-World Case Studies -> Tech Blogs
161 | 1. Netflix - https://netflixtechblog.medium.com/
162 | 2. AWS - https://aws.amazon.com/solutions/case-studies/
163 | 3. GCP - https://cloud.google.com/customers
164 | 4. Azure - https://azure.microsoft.com/en-us/resources/customer-stories/
165 | 5. Spotify - https://engineering.atspotify.com/category/data/
166 | 6. MongoDB - https://www.mongodb.com/blog/all
167 | 7. Swiggy - https://bytes.swiggy.com/the-swiggy-delivery-challenge-part-one-6a2abb4f82f6
168 | - https://bytes.swiggy.com/swiggy-distance-service-9868dcf613f4
169 | - https://bytes.swiggy.com/the-tech-that-brings-you-your-food-1a7926229886
170 | 8. Zomato - https://blog.zomato.com/
171 |
172 |
173 |
174 |
--------------------------------------------------------------------------------