├── architecture.jpeg
├── s3_cli_command.sh
├── lambda_function.py
├── README.md
└── pyspark_code.py


/architecture.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/darshilparmar/dataengineering-youtube-analysis-project/HEAD/architecture.jpeg


--------------------------------------------------------------------------------
/s3_cli_command.sh:
--------------------------------------------------------------------------------
 1 | #Replace It With Your Bucket Name
 2 | 
 3 | # To copy all JSON Reference data to same location:
 4 | aws s3 cp . s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data/ --recursive --exclude "*" --include "*.json"
 5 | 
 6 | # To copy all data files to its own location, following Hive-style patterns:
 7 | aws s3 cp CAvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=ca/
 8 | aws s3 cp DEvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=de/
 9 | aws s3 cp FRvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=fr/
10 | aws s3 cp GBvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=gb/
11 | aws s3 cp INvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=in/
12 | aws s3 cp JPvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=jp/
13 | aws s3 cp KRvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=kr/
14 | aws s3 cp MXvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=mx/
15 | aws s3 cp RUvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=ru/
16 | aws s3 cp USvideos.csv s3://de-on-youtube-raw-useast1-dev/youtube/raw_statistics/region=us/
17 | 
18 | 
19 | 


--------------------------------------------------------------------------------
/lambda_function.py:
--------------------------------------------------------------------------------
 1 | import awswrangler as wr
 2 | import pandas as pd
 3 | import urllib.parse
 4 | import os
 5 | 
 6 | os_input_s3_cleansed_layer = os.environ['s3_cleansed_layer']
 7 | os_input_glue_catalog_db_name = os.environ['glue_catalog_db_name']
 8 | os_input_glue_catalog_table_name = os.environ['glue_catalog_table_name']
 9 | os_input_write_data_operation = os.environ['write_data_operation']
10 | 
11 | 
12 | def lambda_handler(event, context):
13 |     # Get the object from the event and show its content type
14 |     bucket = event['Records'][0]['s3']['bucket']['name']
15 |     key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
16 |     try:
17 | 
18 |         # Creating DF from content
19 |         df_raw = wr.s3.read_json('s3://{}/{}'.format(bucket, key))
20 | 
21 |         # Extract required columns:
22 |         df_step_1 = pd.json_normalize(df_raw['items'])
23 | 
24 |         # Write to S3
25 |         wr_response = wr.s3.to_parquet(
26 |             df=df_step_1,
27 |             path=os_input_s3_cleansed_layer,
28 |             dataset=True,
29 |             database=os_input_glue_catalog_db_name,
30 |             table=os_input_glue_catalog_table_name,
31 |             mode=os_input_write_data_operation
32 |         )
33 | 
34 |         return wr_response
35 |     except Exception as e:
36 |         print(e)
37 |         print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
38 |         raise e
39 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Engineering YouTube Analysis Project by Darshil Parmar
 2 | 
 3 | ## Overview
 4 | 
 5 | This project aims to securely manage, streamline, and perform analysis on the structured and semi-structured YouTube videos data based on the video categories and the trending metrics.
 6 | 
 7 | ## Project Goals
 8 | 1. Data Ingestion — Build a mechanism to ingest data from different sources
 9 | 2. ETL System — We are getting data in raw format, transforming this data into the proper format
10 | 3. Data lake — We will be getting data from multiple sources so we need centralized repo to store them
11 | 4. Scalability — As the size of our data increases, we need to make sure our system scales with it
12 | 5. Cloud — We can’t process vast amounts of data on our local computer so we need to use the cloud, in this case, we will use AWS
13 | 6. Reporting — Build a dashboard to get answers to the question we asked earlier
14 | 
15 | ## Services we will be using
16 | 1. Amazon S3: Amazon S3 is an object storage service that provides manufacturing scalability, data availability, security, and performance.
17 | 2. AWS IAM: This is nothing but identity and access management which enables us to manage access to AWS services and resources securely.
18 | 3. QuickSight: Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
19 | 4. AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
20 | 5. AWS Lambda: Lambda is a computing service that allows programmers to run code without creating or managing servers.
21 | 6. AWS Athena: Athena is an interactive query service for S3 in which there is no need to load data it stays in S3.
22 | 
23 | ## Dataset Used
24 | This Kaggle dataset contains statistics (CSV files) on daily popular YouTube videos over the course of many months. There are up to 200 trending videos published every day for many locations. The data for each region is in its own file. The video title, channel title, publication time, tags, views, likes and dislikes, description, and comment count are among the items included in the data. A category_id field, which differs by area, is also included in the JSON file linked to the region.
25 | 
26 | https://www.kaggle.com/datasets/datasnaek/youtube-new
27 | 
28 | ## Architecture Diagram
29 | <img src="architecture.jpeg">
30 | 
31 | ## Complete Tutorial
32 | I have created a detailed 3+ hour tutorial on this project, where you will execute everything from start to end
33 | 
34 | https://youtu.be/yZKJFKu49Dk
35 | 


--------------------------------------------------------------------------------
/pyspark_code.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from awsglue.transforms import *
 3 | from awsglue.utils import getResolvedOptions
 4 | from pyspark.context import SparkContext
 5 | from awsglue.context import GlueContext
 6 | from awsglue.job import Job
 7 | 
 8 | ############################### Added by Carlos ###############################
 9 | from awsglue.dynamicframe import DynamicFrame
10 | 
11 | 
12 | ## @params: [JOB_NAME]
13 | args = getResolvedOptions(sys.argv, ['JOB_NAME'])
14 | 
15 | sc = SparkContext()
16 | glueContext = GlueContext(sc)
17 | spark = glueContext.spark_session
18 | job = Job(glueContext)
19 | job.init(args['JOB_NAME'], args)
20 | ## @type: DataSource
21 | ## @args: [database = "db_youtube_raw", table_name = "raw_statistics", transformation_ctx = "datasource0"]
22 | ## @return: datasource0
23 | ## @inputs: []
24 | ############################### Added by Darshil ###############################
25 | predicate_pushdown = "region in ('ca','gb','us')"
26 | 
27 | datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_youtube_raw", table_name = "raw_statistics", transformation_ctx = "datasource0", push_down_predicate = predicate_pushdown)
28 | 
29 | ## @type: ApplyMapping
30 | ## @args: [mapping = [("video_id", "string", "video_id", "string"), ("trending_date", "string", "trending_date", "string"), ("title", "string", "title", "string"), ("channel_title", "string", "channel_title", "string"), ("category_id", "long", "category_id", "long"), ("publish_time", "string", "publish_time", "string"), ("tags", "string", "tags", "string"), ("views", "long", "views", "long"), ("likes", "long", "likes", "long"), ("dislikes", "long", "dislikes", "long"), ("comment_count", "long", "comment_count", "long"), ("thumbnail_link", "string", "thumbnail_link", "string"), ("comments_disabled", "boolean", "comments_disabled", "boolean"), ("ratings_disabled", "boolean", "ratings_disabled", "boolean"), ("video_error_or_removed", "boolean", "video_error_or_removed", "boolean"), ("description", "string", "description", "string"), ("region", "string", "region", "string")], transformation_ctx = "applymapping1"]
31 | ## @return: applymapping1
32 | ## @inputs: [frame = datasource0]
33 | applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("video_id", "string", "video_id", "string"), ("trending_date", "string", "trending_date", "string"), ("title", "string", "title", "string"), ("channel_title", "string", "channel_title", "string"), ("category_id", "long", "category_id", "long"), ("publish_time", "string", "publish_time", "string"), ("tags", "string", "tags", "string"), ("views", "long", "views", "long"), ("likes", "long", "likes", "long"), ("dislikes", "long", "dislikes", "long"), ("comment_count", "long", "comment_count", "long"), ("thumbnail_link", "string", "thumbnail_link", "string"), ("comments_disabled", "boolean", "comments_disabled", "boolean"), ("ratings_disabled", "boolean", "ratings_disabled", "boolean"), ("video_error_or_removed", "boolean", "video_error_or_removed", "boolean"), ("description", "string", "description", "string"), ("region", "string", "region", "string")], transformation_ctx = "applymapping1")
34 | ## @type: ResolveChoice
35 | ## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
36 | ## @return: resolvechoice2
37 | ## @inputs: [frame = applymapping1]
38 | resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
39 | ## @type: DropNullFields
40 | ## @args: [transformation_ctx = "dropnullfields3"]
41 | ## @return: dropnullfields3
42 | ## @inputs: [frame = resolvechoice2]
43 | dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
44 | ## @type: DataSink
45 | ## @args: [connection_type = "s3", connection_options = {"path": "s3://bigdata-on-youtube-cleansed-euwest1-14317621-dev/youtube/raw_statistics/"}, format = "parquet", transformation_ctx = "datasink4"]
46 | ## @return: datasink4
47 | ## @inputs: [frame = dropnullfields3]
48 | 
49 | ############################### Added by Darshil ###############################
50 | #MAKE SURE YOU COPY ONLY WHAT IS NEEDED
51 | 
52 | datasink1 = dropnullfields3.toDF().coalesce(1)
53 | df_final_output = DynamicFrame.fromDF(datasink1, glueContext, "df_final_output")
54 | datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_final_output, connection_type = "s3", connection_options = {"path": "s3://de-on-youtube-cleansed-useast1-dev/youtube/raw_statistics/", "partitionKeys": ["region"]}, format = "parquet", transformation_ctx = "datasink4")
55 | 
56 | ############################### Added by Darshil ###############################
57 | job.commit()
58 | 


--------------------------------------------------------------------------------