├── Uber Data Analytics End-to-End
    └── readme.md
├── Youtube Data Analysis End-to-End
    └── readme.md
├── COVID-19 Data Analysis End-to-End Project
    └── readme.md
├── Olympics data analytics - end to end Azure data engineering
    └── readme.md
├── Stock Market Real-Time Data Analysis Using Kafka
    ├── Architecture.jpg
    ├── readme.md
    ├── command_kafka.txt
    ├── KafkaConsumer.ipynb
    └── KafkaProducer.ipynb
├── Indian Stock Market Real-Time Data Analysis and Visualization
    └── readme.md
├── twitter-airflow-data-engineering-project
    └── readme.md
└── README.md


/Uber Data Analytics End-to-End/readme.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/Youtube Data Analysis End-to-End/readme.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/COVID-19 Data Analysis End-to-End Project/readme.md:
--------------------------------------------------------------------------------
1 | # COVID-19 Data Analysis Project
2 | 
3 | 


--------------------------------------------------------------------------------
/Olympics data analytics - end to end Azure data engineering/readme.md:
--------------------------------------------------------------------------------
1 | ## Olympics Data Analytics Project
2 | 
3 | 


--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cybergeekgyan/Data-Engineering-Portfolio/HEAD/Stock Market Real-Time Data Analysis Using Kafka/Architecture.jpg


--------------------------------------------------------------------------------
/Indian Stock Market Real-Time Data Analysis and Visualization/readme.md:
--------------------------------------------------------------------------------
 1 | # Indian Stock Market Real-Time Data Analysis and Visualization 
 2 | 
 3 | ## Requirements
 4 | 
 5 | - Python
 6 | - Azure Account
 7 | - Stock API
 8 | - Kafka
 9 | 
10 | 
11 | 
12 | ## Steps
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/twitter-airflow-data-engineering-project/readme.md:
--------------------------------------------------------------------------------
 1 | ## Twitter Data Pipeline using Airflow
 2 | 
 3 | `Pandas` `Python` `Tweepy` `Airflow` `AWS` 
 4 | 
 5 | ## Steps
 6 | 
 7 | - Extract data from Twitter using the Twitter API --> `Tweepy`
 8 | - Create a data pipeline to run these automation task using --> `Apache Airflow`
 9 | - Tranforming the data using Pandas to store into AWS S3
10 | 
11 | 
12 | 


--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/readme.md:
--------------------------------------------------------------------------------
 1 | # Stock Market Real-Time Data Analysis Using Kafka
 2 | 
 3 | End-to-End Indian Stock Market Data Engineering Project on Real-time stock market data using Kafka 
 4 | 
 5 | ## Architecture
 6 | 
 7 | ![architecture](https://github.com/cybergeekgyan/Data-Engineering-Portfolio/blob/main/Stock%20Market%20Real-Time%20Data%20Analysis%20Using%20Kafka/Architecture.jpg)
 8 | 
 9 | ## Pre-requisite
10 | 
11 | - Python --> Programming Language
12 | - AWS Account --> Cloud Service
13 | - AWS S3 --> Object Storage
14 | - EC2
15 | - Apache Kafka
16 | - Apache Airflow
17 | - Glue Crawler & Glue Catalog
18 | - Athena
19 | - SQL
20 | 


--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/command_kafka.txt:
--------------------------------------------------------------------------------
 1 | wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz
 2 | tar -xvf kafka_2.12-3.3.1.tgz
 3 | 
 4 | 
 5 | -----------------------
 6 | java -version
 7 | sudo yum install java-1.8.0-openjdk
 8 | java -version
 9 | cd kafka_2.12-3.3.1
10 | 
11 | Start Zoo-keeper:
12 | -------------------------------
13 | bin/zookeeper-server-start.sh config/zookeeper.properties
14 | 
15 | Open another window to start kafka
16 | But first ssh to to your ec2 machine as done above
17 | 
18 | 
19 | Start Kafka-server:
20 | ----------------------------------------
21 | Duplicate the session & enter in a new console --
22 | export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
23 | cd kafka_2.12-3.3.1
24 | bin/kafka-server-start.sh config/server.properties
25 | 
26 | It is pointing to private server , change server.properties so that it can run in public IP 
27 | 
28 | To do this , you can follow any of the 2 approaches shared belwo --
29 | Do a "sudo nano config/server.properties" - change ADVERTISED_LISTENERS to public ip of the EC2 instance
30 | 
31 | 
32 | Create the topic:
33 | -----------------------------
34 | Duplicate the session & enter in a new console --
35 | cd kafka_2.12-3.3.1
36 | bin/kafka-topics.sh --create --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} --replication-factor 1 --partitions 1
37 | 
38 | Start Producer:
39 | --------------------------
40 | bin/kafka-console-producer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092} 
41 | 
42 | Start Consumer:
43 | -------------------------
44 | Duplicate the session & enter in a new console --
45 | cd kafka_2.12-3.3.1
46 | bin/kafka-console-consumer.sh --topic demo_testing2 --bootstrap-server {Put the Public IP of your EC2 Instance:9092}
47 | 


--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/KafkaConsumer.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "id": "b6675043",
 7 |    "metadata": {},
 8 |    "outputs": [],
 9 |    "source": [
10 |     "from kafka import KafkaConsumer\n",
11 |     "from time import sleep\n",
12 |     "from json import dumps,loads\n",
13 |     "import json\n",
14 |     "from s3fs import S3FileSystem"
15 |    ]
16 |   },
17 |   {
18 |    "cell_type": "code",
19 |    "execution_count": null,
20 |    "id": "9eeff3ef",
21 |    "metadata": {},
22 |    "outputs": [],
23 |    "source": [
24 |     "consumer = KafkaConsumer(\n",
25 |     "    'demo_test',\n",
26 |     "     bootstrap_servers=[':9092'], #add your IP here\n",
27 |     "    value_deserializer=lambda x: loads(x.decode('utf-8')))"
28 |    ]
29 |   },
30 |   {
31 |    "cell_type": "code",
32 |    "execution_count": null,
33 |    "id": "eda5a608",
34 |    "metadata": {},
35 |    "outputs": [],
36 |    "source": [
37 |     "# for c in consumer:\n",
38 |     "#     print(c.value)"
39 |    ]
40 |   },
41 |   {
42 |    "cell_type": "code",
43 |    "execution_count": null,
44 |    "id": "8d60dc6c",
45 |    "metadata": {},
46 |    "outputs": [],
47 |    "source": [
48 |     "s3 = S3FileSystem()"
49 |    ]
50 |   },
51 |   {
52 |    "cell_type": "code",
53 |    "execution_count": null,
54 |    "id": "0f135e81",
55 |    "metadata": {},
56 |    "outputs": [],
57 |    "source": [
58 |     "for count, i in enumerate(consumer):\n",
59 |     "    with s3.open(\"s3://kafka-stock-market-tutorial-youtube-darshil/stock_market_{}.json\".format(count), 'w') as file:\n",
60 |     "        json.dump(i.value, file)    "
61 |    ]
62 |   },
63 |   {
64 |    "cell_type": "code",
65 |    "execution_count": null,
66 |    "id": "7b811cb6",
67 |    "metadata": {},
68 |    "outputs": [],
69 |    "source": []
70 |   }
71 |  ],
72 |  "metadata": {
73 |   "kernelspec": {
74 |    "display_name": "Python 3 (ipykernel)",
75 |    "language": "python",
76 |    "name": "python3"
77 |   },
78 |   "language_info": {
79 |    "codemirror_mode": {
80 |     "name": "ipython",
81 |     "version": 3
82 |    },
83 |    "file_extension": ".py",
84 |    "mimetype": "text/x-python",
85 |    "name": "python",
86 |    "nbconvert_exporter": "python",
87 |    "pygments_lexer": "ipython3",
88 |    "version": "3.10.0"
89 |   }
90 |  },
91 |  "nbformat": 4,
92 |  "nbformat_minor": 5
93 | }
94 | 


--------------------------------------------------------------------------------
/Stock Market Real-Time Data Analysis Using Kafka/KafkaProducer.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "e2d3f594",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "pip install kafka-python"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "id": "f19405f0",
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import pandas as pd\n",
 21 |     "from kafka import KafkaProducer\n",
 22 |     "from time import sleep\n",
 23 |     "from json import dumps\n",
 24 |     "import json"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "id": "b483a0e4",
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "producer = KafkaProducer(bootstrap_servers=[':9092'], #change ip here\n",
 35 |     "                         value_serializer=lambda x: \n",
 36 |     "                         dumps(x).encode('utf-8'))"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "id": "0c30b915",
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "producer.send('demo_test', value={'surnasdasdame':'parasdasdmar'})"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "id": "cc8d45aa",
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "df = pd.read_csv(\"data/indexProcessed.csv\")"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "id": "113a2516",
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "df.head()"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "id": "4c7ec0be",
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "while True:\n",
 77 |     "    dict_stock = df.sample(1).to_dict(orient=\"records\")[0]\n",
 78 |     "    producer.send('demo_test', value=dict_stock)\n",
 79 |     "    sleep(1)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "id": "ed71c0e4",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "producer.flush() #clear data from kafka server"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": null,
 95 |    "id": "5991d10f",
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": []
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "id": "3632d61d",
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": []
107 |   }
108 |  ],
109 |  "metadata": {
110 |   "kernelspec": {
111 |    "display_name": "Python 3 (ipykernel)",
112 |    "language": "python",
113 |    "name": "python3"
114 |   },
115 |   "language_info": {
116 |    "codemirror_mode": {
117 |     "name": "ipython",
118 |     "version": 3
119 |    },
120 |    "file_extension": ".py",
121 |    "mimetype": "text/x-python",
122 |    "name": "python",
123 |    "nbconvert_exporter": "python",
124 |    "pygments_lexer": "ipython3",
125 |    "version": "3.10.0"
126 |   }
127 |  },
128 |  "nbformat": 4,
129 |  "nbformat_minor": 5
130 | }
131 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🔥Data-Engineering-Resources and Projects
  2 | 
  3 | - https://github.com/DataEngineer-io/data-engineer-handbook
  4 | 
  5 | 
  6 | ### Data Engineering Workflow
  7 | 
  8 | <img width="1117" alt="Screenshot 2024-01-26 at 12 50 44 AM" src="https://github.com/cybergeekgyan/Data-Engineering-Portfolio/assets/56320349/a2997131-ce46-4a3b-b20d-e5d08a7e4c6e">
  9 | 
 10 | 
 11 | ### Big Data Architecture
 12 | 
 13 | <img width="1047" alt="Screenshot 2024-01-26 at 12 53 23 AM" src="https://github.com/cybergeekgyan/Data-Engineering-Portfolio/assets/56320349/97ae6570-851c-472c-b545-e3e633b2ad61">
 14 | 
 15 | 
 16 | ## 📚Books 
 17 | 
 18 | - [Designing Data-Intensive Applications]()
 19 | - [Fundamentals of Data Engineering]()
 20 | - [The Data Warehouse Toolkit]()
 21 | - [Cracking the Data Engineering Interview]()
 22 | - [Data Engineering with Python]()
 23 | - [Data Pipelines with Apache Airflow]()
 24 | - [The Data Warehouse Toolkit]()
 25 | - [Big Data: Principles and Best Practices of Scalable Real-Time Data Systems]()
 26 | - [Designing Data-Intensive Applications]()
 27 | 
 28 | 
 29 | ## 🧰Tools for Data Engineers
 30 | 
 31 | - **Basic Skills**: `Linux`, `Git & GitHub`, `Computer Networking`, `Cloud Computing`, `Network & Security`, `Agile Development`
 32 |   
 33 | - **Advanced Skills (Good to Know)**: ` Data Lake & Data WareHouse Concepts`, `REST APIs`, `Databases(SQL & NoSQL)`
 34 |   
 35 | - **Programming Languages**: `Python`, `SQL`, `Java`, `Scala`
 36 |   
 37 | - **Databases**: `PostgreSQL`, `MongoDB`, `Neo4j`, `Redis`, `Cassandra`, `Apache HBase`, `Snowflake`, `InfluxDB`
 38 | 
 39 | - Data Ingestion: `Apache Kafka`, `Flume`, `Logstash`, `Airbyte`, `Apache Spark`, `Talend`, `Informatica`
 40 | 
 41 | - Data Tranformation: `Python`, `Pandas`, `SQL`, `Apache Spark`, `Hive`, `dbt`, `Matillion`, `Pig`
 42 |   
 43 | - Data Preprocessing: `Apache Spark`, `Apache Hadoop`, `Apache Flink`
 44 |   
 45 | - Data Orchestration: `Apache Airflow`, `Luigi`
 46 |       
 47 | - Data Storage: `Data Lake`: AWS S3, Azure Blob Storage, Google Cloud Storage, `Data Warehouse`: Snowflake, Google BigQuery, Amazon Redshift, Apache Hive
 48 |   
 49 | - Data Visualization: `Tableau`, `PowerBI`, `Looker`
 50 |   
 51 | - DataOps: `Docker`, `Kubernetes`, `Jenkins`
 52 | 
 53 | 
 54 | ### Here's what's on the menu: 👇
 55 | 
 56 | - 🐍 Python,
 57 | - 📊 SQL,
 58 | - 🛠️ MySQL,
 59 | - 🌳 MongoDB,
 60 | - 🔥 PySpark,
 61 | - 🎈 Bash,
 62 | - 🌬️ Airflow,
 63 | - ☕ Apache Kafka,
 64 | - 🐙 Git,
 65 | - 🐈 GitHub,
 66 | - ⚙️ CICD basics,
 67 | - 🏬 Data Warehousing,
 68 | - 🛠️ DBT,
 69 | - 🌊 Data Lakes,
 70 | - 📘 DataBricks,
 71 | - ☁️ Azure Databricks,
 72 | - ❄️ Snowflake,
 73 | - 🌪️ Apache NiFi,
 74 | - 🌐 Debezium
 75 | 
 76 | 
 77 | 1. Master Python: https://lnkd.in/d-pZPyf5
 78 | 
 79 | 2. Learn SQL: https://lnkd.in/dzAiRF-x
 80 | 
 81 | 3. Get hands-on with MySQL: https://lnkd.in/ddpSkUhc
 82 | 
 83 | 4. Dive into MongoDB: https://lnkd.in/dHQ4VC2E
 84 | 
 85 | 5. Master PySpark: https://lnkd.in/d7fgs7dE
 86 | 
 87 | 6. Discover Bash, Airflow & Kafka: https://lnkd.in/dDhuEqQE
 88 | 
 89 | 7. Master Git & GitHub: https://lnkd.in/dqJ7J3kN
 90 | 
 91 | 8. Understand CICD basics: https://lnkd.in/dcfKBmCa
 92 | 
 93 | 9. Decode Data Warehousing: https://lnkd.in/dPVRDJT5
 94 | 
 95 | 10. Learn DBT: https://lnkd.in/eG9eaEuE
 96 | 
 97 | 11. Understand Data Lakes: https://lnkd.in/dtZKJ4d6
 98 | 
 99 | 12. Explore DataBricks: https://lnkd.in/dCBiQXPR
100 | 
101 | 13. Learn Azure Databricks: https://lnkd.in/dzmwBs4Y
102 | 
103 | 14. Master Snowflake: https://lnkd.in/dDBeddVy
104 | 
105 | 15. Explore Apache NiFi: https://lnkd.in/de7bvnSt
106 | 
107 | 
108 | ## 📙Projects 
109 | 
110 | | Sr. No. | Projects | Description | Tech Stack | Tags | Code Link |
111 | |--------|-----------|--------------|-----------|-----|------------|
112 | | 01. | [Build ETL Pipeline Using AWS Cloud]() | 
113 | | 02. | [Covid Data Analysis Project]() |
114 | | 03. | [Twitter Data Pipeline using Airflow and AWS]() |
115 | | 04. | [YouTube Data Analysis (End-To-End Data Engineering Project)]() |
116 | | 05. | [Olympic Data Analytics: End-To-End Azure Data Engineering Project]() |
117 | | 06. | [Uber Data Analytics Project On GCP]() |
118 | | 07. | [Data Ingestion and ETL Pipeline using Azure]() | 
119 | | 08. | [Indian Stock Market Real-Time Data Processing, Analysis & Visualization using Azure Stream Analytics]() |
120 | | 09. | [Simple Stock Market ETL Process with SQL]() |
121 | 
122 | ## 🔶 Free Learning Resources
123 | 
124 | | Tools | Link| Used for | Official Docs | Youtube |
125 | | ------|---------|---------------|----------|-----|
126 | | DBMS | - [MySQL](https://lnkd.in/ddpSkUhc) - [MongoDB](https://lnkd.in/dHQ4VC2E)
127 | | SQL | https://lnkd.in/dzAiRF-x
128 | | Python | https://lnkd.in/d-pZPyf5 |
129 | | Linux |
130 | | Data Warehouse & Lake Concepts | - [Data Warehouse](https://lnkd.in/dPVRDJT5) - [Data Lakes](https://lnkd.in/dtZKJ4d6) 
131 | | Data Pipelines |
132 | | DBT | https://lnkd.in/eG9eaEuE
133 | | PySpark | https://lnkd.in/d7fgs7dE
134 | | Kafka |
135 | | Apache Nifi | https://lnkd.in/de7bvnSt
136 | | Airflow |
137 | | Databricks | https://lnkd.in/dCBiQXPR
138 | | Snowflake | https://lnkd.in/dDBeddVy
139 | | Cloud Computing Concepts |
140 | | Distributed Systems fundamentals |
141 | | AWS |
142 | | Azure |
143 | | GCP |
144 | | Git & GitHub | https://lnkd.in/dqJ7J3kN
145 | | CI/CD | https://lnkd.in/dcfKBmCa
146 | | Jenkins |
147 | | Github Actions |
148 | | Terraform |
149 | | Sonarqube |
150 | | Docker |
151 | | Kubernetes |
152 | | Power BI |
153 | | Tableau |
154 | | Apache Superset | 
155 | | Prometheus |
156 | | Graphana |
157 | | Datadog |
158 | 
159 | 
160 | ## 💼 Read Real-World Case Studies -> Tech Blogs
161 | 1. Netflix - https://netflixtechblog.medium.com/
162 | 2. AWS - https://aws.amazon.com/solutions/case-studies/
163 | 3. GCP - https://cloud.google.com/customers
164 | 4. Azure - https://azure.microsoft.com/en-us/resources/customer-stories/
165 | 5. Spotify - https://engineering.atspotify.com/category/data/
166 | 6. MongoDB - https://www.mongodb.com/blog/all
167 | 7. Swiggy - https://bytes.swiggy.com/the-swiggy-delivery-challenge-part-one-6a2abb4f82f6
168 |           - https://bytes.swiggy.com/swiggy-distance-service-9868dcf613f4
169 |           - https://bytes.swiggy.com/the-tech-that-brings-you-your-food-1a7926229886
170 | 8. Zomato - https://blog.zomato.com/
171 | 
172 | 
173 | 
174 | 


--------------------------------------------------------------------------------