├── requirements.txt ├── data_dbt ├── seeds │ └── .gitkeep ├── tests │ └── .gitkeep ├── analyses │ └── .gitkeep ├── macros │ ├── .gitkeep │ └── capitalize_replace.sql ├── snapshots │ └── .gitkeep ├── models │ ├── staging │ │ ├── stg_ratings.sql │ │ ├── stg_books.sql │ │ ├── stg_users.sql │ │ └── schema.yml │ └── core │ │ ├── dim_rating_by_countries.sql │ │ ├── facts_full_ratings.sql │ │ ├── schema.yml │ │ └── dim_rating_by_age_range.sql ├── README.md └── dbt_project.yml ├── pyrightconfig.json ├── terraform ├── locals.tf ├── variables.tf ├── .terraform.lock.hcl └── main.tf ├── screenshots ├── run_dag.png ├── dag_graph.png ├── airflow_home.png ├── architecture.png ├── dags_index.png ├── metabase_home.png ├── DE_2024_Dashboard.pdf ├── DE_2024_Dashboard.png └── bigquery_schema_1.png ├── .gitignore ├── data_airflow └── dags │ ├── sql │ └── load-dwh.sql │ └── book-recommendation-dag.py ├── LICENSE ├── docker-compose.yaml └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data_dbt/seeds/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data_dbt/tests/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data_dbt/analyses/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data_dbt/macros/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /data_dbt/snapshots/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pyrightconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "venvPath": ".", 3 | "venv": ".venv" 4 | } 5 | -------------------------------------------------------------------------------- /terraform/locals.tf: -------------------------------------------------------------------------------- 1 | locals { 2 | DE_2004_PROJECT_DATALAKE = "book_recommendation_datalake" 3 | } 4 | -------------------------------------------------------------------------------- /screenshots/run_dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/run_dag.png -------------------------------------------------------------------------------- /screenshots/dag_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/dag_graph.png -------------------------------------------------------------------------------- /screenshots/airflow_home.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/airflow_home.png -------------------------------------------------------------------------------- /screenshots/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/architecture.png -------------------------------------------------------------------------------- /screenshots/dags_index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/dags_index.png -------------------------------------------------------------------------------- /screenshots/metabase_home.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/metabase_home.png -------------------------------------------------------------------------------- /screenshots/DE_2024_Dashboard.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/DE_2024_Dashboard.pdf -------------------------------------------------------------------------------- /screenshots/DE_2024_Dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/DE_2024_Dashboard.png -------------------------------------------------------------------------------- /screenshots/bigquery_schema_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/bigquery_schema_1.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | data_airflow/logs/* 2 | .venv/* 3 | *__pycache__* 4 | .env 5 | terraform/.terraform/* 6 | terraform/terraform.tfstate 7 | terraform/terraform.tfstate.backup 8 | logs/* 9 | data_dbt/target/* 10 | data_dbt/dbt_packages/* 11 | data_dbt/logs/* 12 | -------------------------------------------------------------------------------- /data_dbt/models/staging/stg_ratings.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='view' 4 | ) 5 | }} 6 | 7 | with ratingsdata as ( 8 | SELECT 9 | * 10 | FROM 11 | {{ source ('staging', 'ratings') }} 12 | ) 13 | 14 | select 15 | user_id, 16 | isbn, 17 | rating 18 | from ratingsdata 19 | -------------------------------------------------------------------------------- /data_dbt/models/staging/stg_books.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='view' 4 | ) 5 | }} 6 | 7 | with booksdata as ( 8 | SELECT 9 | * 10 | FROM 11 | {{ source ('staging', 'books') }} where year_of_publication <> 0 12 | ) 13 | 14 | select 15 | isbn, 16 | book_title, 17 | book_author, 18 | year_of_publication, 19 | publisher 20 | from booksdata 21 | -------------------------------------------------------------------------------- /data_dbt/models/core/dim_rating_by_countries.sql: -------------------------------------------------------------------------------- 1 | {{ config(materialized='table') }} 2 | 3 | with ratings as ( 4 | select * from {{ ref('stg_ratings') }} 5 | ), 6 | users as ( 7 | select* from {{ ref('stg_users') }} 8 | ) 9 | 10 | select 11 | users.country as country, 12 | count(ratings.rating) as total_ratings 13 | from ratings 14 | join users on ratings.user_id = users.user_id 15 | group by country 16 | -------------------------------------------------------------------------------- /data_dbt/macros/capitalize_replace.sql: -------------------------------------------------------------------------------- 1 | 2 | {% macro capitalize_replace(column_name) %} 3 | {% if target.type == 'bigquery' %} 4 | replace(INITCAP({{ column_name }}), '"', '') 5 | {% elif target.type == 'postgres' %} 6 | replace(INITCAP({{ column_name }}), '"', '') 7 | {% elif target.type == 'snowflake' %} 8 | replace(INITCAP({{ column_name }}), '"', '') 9 | {% else %} 10 | replace(UPPER(LEFT({{ column_name }}, 1)) || LOWER(SUBSTRING({{ column_name }}, 2)), '"', '') 11 | {% endif %} 12 | {% endmacro %} 13 | -------------------------------------------------------------------------------- /data_dbt/README.md: -------------------------------------------------------------------------------- 1 | Welcome to your new dbt project! 2 | 3 | ### Using the starter project 4 | 5 | Try running the following commands: 6 | - dbt run 7 | - dbt test 8 | 9 | 10 | ### Resources: 11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction) 12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers 13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support 14 | - Find [dbt events](https://events.getdbt.com) near you 15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices 16 | -------------------------------------------------------------------------------- /data_dbt/models/core/facts_full_ratings.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='table' 4 | ) 5 | }} 6 | 7 | with ratings as ( 8 | select * from {{ ref('stg_ratings') }} 9 | ), 10 | books as( 11 | select* from {{ ref('stg_books') }} 12 | ), 13 | users as ( 14 | select* from {{ ref('stg_users') }} 15 | ) 16 | 17 | 18 | select 19 | ratings.user_id as user_id, 20 | ratings.isbn as isbn, 21 | ratings.rating as rating, 22 | users.age as age, 23 | users.city as city, 24 | users.state as state, 25 | users.country as country, 26 | books.book_title as title, 27 | books.book_author as author, 28 | books.year_of_publication as year_of_publication, 29 | books.publisher as publisher 30 | from ratings 31 | join users on ratings.user_id = users.user_id 32 | join books on ratings.isbn = books.isbn 33 | -------------------------------------------------------------------------------- /data_airflow/dags/sql/load-dwh.sql: -------------------------------------------------------------------------------- 1 | -- books table 2 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.books`; 3 | 4 | CREATE TABLE IF NOT EXISTS 5 | `{{ BOOK_RECOMMENDATION_WH }}.books` 6 | CLUSTER BY 7 | `year_of_publication`, 8 | `publisher`, 9 | `book_author` AS 10 | SELECT 11 | * 12 | FROM 13 | `{{BOOK_RECOMMENDATION_WH_EXT}}.books`; 14 | 15 | 16 | -- users 17 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.users`; 18 | 19 | CREATE TABLE IF NOT EXISTS 20 | `{{ BOOK_RECOMMENDATION_WH }}.users` 21 | CLUSTER BY 22 | `country`, 23 | `state`, 24 | `city` AS 25 | SELECT 26 | * 27 | FROM 28 | `{{BOOK_RECOMMENDATION_WH_EXT}}.users`; 29 | 30 | -- rating 31 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.ratings`; 32 | 33 | CREATE TABLE IF NOT EXISTS 34 | `{{ BOOK_RECOMMENDATION_WH }}.ratings` 35 | CLUSTER BY 36 | `user_id`, 37 | `isbn` AS 38 | SELECT 39 | * 40 | FROM 41 | `{{BOOK_RECOMMENDATION_WH_EXT}}.ratings`; 42 | -------------------------------------------------------------------------------- /data_dbt/models/core/schema.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: facts_full_ratings 5 | description: > 6 | This fact models contains table combination from rating, books and users. 7 | This will help dashboard presentations. 8 | - name: dim_rating_by_countries 9 | description: Aggregated table of all rating by countries. 10 | columns: 11 | - name: country 12 | data_type: string 13 | description: Column for countries 14 | 15 | - name: total_ratings 16 | data_type: numeric 17 | description: Total rating from countries 18 | 19 | - name: dim_rating_by_age_range 20 | description: Aggregated table of all rating from some age range. 21 | columns: 22 | - name: age_range 23 | data_type: string 24 | description: List of age range 25 | 26 | - name: number_of_rating 27 | data_type: numeric 28 | description: Number of rating from age range 29 | -------------------------------------------------------------------------------- /terraform/variables.tf: -------------------------------------------------------------------------------- 1 | variable "project" { 2 | type = string 3 | description = "GCP project ID" 4 | default = "radiant-gateway-412001" 5 | } 6 | 7 | variable "region" { 8 | type = string 9 | description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations" 10 | default = "us-west1" 11 | } 12 | 13 | variable "storage_class" { 14 | type = string 15 | description = "The Storage Class of the new bucket. Ref: https://cloud.google.com/storage/docs/storage-classes" 16 | default = "STANDARD" 17 | } 18 | 19 | variable "book_recommendataion_ext_datasets" { 20 | type = string 21 | description = "Dataset in BigQuery where raw data (external tables) will be loaded." 22 | default = "book_recommendataion_wh" 23 | } 24 | 25 | variable "book_recommendation_analytics_datasets" { 26 | type = string 27 | description = "Dataset in BigQuery where raw data (from Google Cloud Storage and DBT) will be loaded." 28 | default = "book_recommendation_analytics" 29 | } 30 | -------------------------------------------------------------------------------- /data_dbt/models/core/dim_rating_by_age_range.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='table' 4 | ) 5 | }} 6 | 7 | WITH age_range_ratings AS ( 8 | SELECT 9 | CASE 10 | WHEN age = 0 THEN 'Unknown age' 11 | WHEN age BETWEEN 1 AND 9 THEN '1-9' 12 | WHEN age BETWEEN 10 AND 19 THEN '10-19' 13 | WHEN age BETWEEN 20 AND 29 THEN '20-29' 14 | WHEN age BETWEEN 30 AND 39 THEN '30-39' 15 | WHEN age BETWEEN 40 AND 49 THEN '40-49' 16 | WHEN age BETWEEN 50 AND 59 THEN '50-59' 17 | WHEN age BETWEEN 60 AND 69 THEN '60-69' 18 | WHEN age BETWEEN 70 AND 79 THEN '70-79' 19 | WHEN age BETWEEN 80 AND 89 THEN '80-89' 20 | ELSE '90+' 21 | END AS age_range, 22 | age, 23 | rating 24 | FROM 25 | {{ ref('facts_full_ratings') }} 26 | ) 27 | 28 | SELECT 29 | age_range, 30 | COUNT(rating) AS number_of_rating 31 | FROM 32 | age_range_ratings 33 | GROUP BY 34 | age_range 35 | ORDER BY 36 | MIN(age) 37 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Olusegun Ayeni 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /terraform/.terraform.lock.hcl: -------------------------------------------------------------------------------- 1 | # This file is maintained automatically by "terraform init". 2 | # Manual edits may be lost in future updates. 3 | 4 | provider "registry.terraform.io/hashicorp/google" { 5 | version = "5.21.0" 6 | constraints = "5.21.0" 7 | hashes = [ 8 | "h1:XFwEEqXVi0PaGzYI2p65DxfWggnQItX5AsRCveTOmT4=", 9 | "zh:4185b880504af117898f6b0c50fd8459ac4b9d5fd7c2ceaf6fc5b18d4d920978", 10 | "zh:56bbb4ae9cfbd1a9c3008e911605f1edccfa6a1048d88dba0c92f347441d274d", 11 | "zh:59246baa783208f5b51ad9d4a4008a327128e234f4a8d1d629cf0af6ae6a9249", 12 | "zh:989e7e07a46e486f791a82f20bf0a2f73b64464fe6a97925edc164eeee33d980", 13 | "zh:9945cce3c36e4e95c74a2f71c38cb0d042014bef555fdeb07e6d92bc624b7567", 14 | "zh:b276a2e6ba9a9d2cd3127b6cc9a9cdf6cca2db1fbe4ad4a5332025ae3c7c9bb6", 15 | "zh:d1af7f76ef64a808dcaeabbeb74f27a7925665082209db019ca79e6e06fe3ab2", 16 | "zh:d7954e905704b4f158c592e0d8c3d8d54a9edd6f8392d2fa3dfc9f0fe29795d8", 17 | "zh:e85724a917887ac00112ca4edafdc2b233c787c2892f386dafa9dfd3215083c0", 18 | "zh:ebadb8e5b387914e118ecbf83e08a72d034fe069e9e5b0cefa857b758479f835", 19 | "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c", 20 | "zh:fb38ef67430bcf8e07144f77b76b47df35acd38f5e553fe7104ecfe54378bb9e", 21 | ] 22 | } 23 | -------------------------------------------------------------------------------- /data_dbt/dbt_project.yml: -------------------------------------------------------------------------------- 1 | # Name your project! Project names should contain only lowercase characters 2 | # and underscores. A good package name should reflect your organization's 3 | # name or the intended use of these models 4 | name: "data_dbt" 5 | version: "1.0.0" 6 | config-version: 2 7 | 8 | # This setting configures which "profile" dbt uses for this project. 9 | profile: "data_dbt_book_recommendation" 10 | 11 | # These configurations specify where dbt should look for different types of files. 12 | # The `model-paths` config, for example, states that models in this project can be 13 | # found in the "models/" directory. You probably won't need to change these! 14 | model-paths: ["models"] 15 | analysis-paths: ["analyses"] 16 | test-paths: ["tests"] 17 | seed-paths: ["seeds"] 18 | macro-paths: ["macros"] 19 | snapshot-paths: ["snapshots"] 20 | 21 | clean-targets: # directories to be removed by `dbt clean` 22 | - "target" 23 | - "dbt_packages" 24 | 25 | # Configuring models 26 | # Full documentation: https://docs.getdbt.com/docs/configuring-models 27 | 28 | # In this example config, we tell dbt to build all models in the example/ 29 | # directory as views. These settings can be overridden in the individual model 30 | # files using the `{{ config(...) }}` macro. 31 | models: 32 | data_dbt: 33 | # Config indicated by + and applies to all files under models/example/ 34 | staging: 35 | +materialized: view 36 | -------------------------------------------------------------------------------- /terraform/main.tf: -------------------------------------------------------------------------------- 1 | terraform { 2 | required_providers { 3 | google = { 4 | source = "hashicorp/google" 5 | version = "5.21.0" 6 | } 7 | } 8 | } 9 | 10 | provider "google" { 11 | project = var.project 12 | region = var.region 13 | } 14 | 15 | resource "google_storage_bucket" "book_recommendation_datalake" { 16 | name = "${local.DE_2004_PROJECT_DATALAKE}_${var.project}" 17 | location = var.region 18 | 19 | storage_class = var.storage_class 20 | uniform_bucket_level_access = true 21 | public_access_prevention = "enforced" 22 | 23 | versioning { 24 | enabled = true 25 | } 26 | 27 | lifecycle_rule { 28 | action { 29 | type = "Delete" 30 | } 31 | condition { 32 | age = 10 //days 33 | } 34 | } 35 | 36 | force_destroy = true 37 | } 38 | 39 | resource "google_bigquery_dataset" "book_recommendataion_ext_dataset" { 40 | project = var.project 41 | location = var.region 42 | dataset_id = var.book_recommendataion_ext_datasets 43 | delete_contents_on_destroy = true 44 | } 45 | 46 | resource "google_bigquery_dataset" "book_recommendation_analytics_dataset" { 47 | project = var.project 48 | location = var.region 49 | dataset_id = var.book_recommendation_analytics_datasets 50 | delete_contents_on_destroy = true 51 | } 52 | -------------------------------------------------------------------------------- /data_dbt/models/staging/stg_users.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | materialized='view' 4 | ) 5 | }} 6 | 7 | with usersdata as ( 8 | SELECT 9 | * 10 | FROM 11 | {{ source ('staging', 'users') }} 12 | WHERE LOWER(country) NOT IN('', 'n/a') AND LOWER(country) not like '%n/a%' 13 | ) 14 | 15 | select 16 | user_id, 17 | age, 18 | {{ capitalize_replace('city') }} AS city, 19 | {{ capitalize_replace('state') }} AS state, 20 | CASE 21 | WHEN lower(country) IN ('usa', 'united states', 'united state', 'us', 'u.s.a.', 'america', 'u.s.a>', 'united staes', 'united states of america', 'Csa', 'San Franicsco', 'U.S. Of A.') THEN 'United states' 22 | WHEN lower(country) IN ('italia', 'l`italia', 'ferrara') THEN 'Italy' 23 | WHEN lower(country) IN ('u.a.e') THEN 'United Arab Emirates' 24 | WHEN lower(country) IN ('c.a.', 'canada') THEN 'Canada' 25 | WHEN lower(country) IN ('nz') THEN 'New Zealand' 26 | WHEN lower(country) IN ('urugua') THEN 'Uruguay' 27 | WHEN lower(country) IN ('p.r.china', 'china') THEN 'China' 28 | WHEN lower(country) IN ('trinidad and tobago', 'tobago') THEN 'Trinidad And Tobago' 29 | WHEN lower(country) IN ('united kingdom', 'u.k.', 'england', 'wales', 'united kindgonm') THEN 'United Kingdom' 30 | ELSE {{ capitalize_replace('country') }} 31 | END AS country 32 | from usersdata 33 | where ( 34 | lower(country) not like '%far away%' AND 35 | lower(country) not in ( 36 | 'quit', 37 | 'here and there', 38 | 'everywhere and anywhere', 39 | 'x', 40 | 'k1c7b1', 41 | 'we`re global!', 42 | '中国', 43 | 'lkjlj', 44 | '', 45 | 'Space', 46 | 'Tdzimi', 47 | 'Usa (Currently Living In England)', 48 | 'Ua', 49 | 'Universe' 50 | ) 51 | ) 52 | -------------------------------------------------------------------------------- /data_dbt/models/staging/schema.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | sources: 4 | - name: staging 5 | database: radiant-gateway-412001 6 | schema: book_recommendation_analytics 7 | tables: 8 | - name: books 9 | - name: ratings 10 | - name: users 11 | models: 12 | - name: stg_users 13 | description: List of Users that used the book store. 14 | columns: 15 | - name: user_id 16 | data_type: numeric 17 | description: id assigned to the user. 18 | - name: age 19 | date_type: numeric 20 | description: Age of the user. 21 | 22 | - name: city 23 | date_type: string 24 | description: city of the user. 25 | 26 | - name: state 27 | date_type: string 28 | description: state of the user. 29 | 30 | - name: state 31 | date_type: string 32 | description: state of the user. 33 | 34 | - name: country 35 | date_type: string 36 | description: Ratings given to books by users. 37 | 38 | - name: stg_books 39 | description: List of books in the book store. 40 | columns: 41 | - name: isbn 42 | data_type: string 43 | description: isbn of the book. 44 | 45 | - name: book_title 46 | data_type: string 47 | description: Title of the book. 48 | 49 | - name: book_author 50 | data_type: string 51 | description: Author of the book. 52 | 53 | - name: year_of_publication 54 | data_type: numeric 55 | description: The year the book was published. 56 | 57 | - name: publisher 58 | data_type: string 59 | description: publisher of the book 60 | 61 | - name: stg_ratings 62 | description: List of books in the book store. 63 | columns: 64 | - name: user_id 65 | data_type: numeric 66 | description: id of the user that gave the rating. 67 | 68 | - name: isbn 69 | data_type: string 70 | description: isbn of the book that possesses the rating. 71 | 72 | - name: rating 73 | data_type: numeric 74 | description: book's rating given by a user. 75 | -------------------------------------------------------------------------------- /data_airflow/dags/book-recommendation-dag.py: -------------------------------------------------------------------------------- 1 | import os 2 | import datetime as dt 3 | import logging; 4 | import pandas as pd 5 | import pyarrow.fs as pafs 6 | from airflow import DAG 7 | from airflow.operators.bash import BashOperator 8 | from airflow.operators.python import PythonOperator 9 | from airflow.utils.task_group import TaskGroup 10 | from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator, BigQueryInsertJobOperator 11 | 12 | AIRFLOW_HOME = os.environ.get('AIRFLOW_HOME', '/opt/airflow/') 13 | GCP_PROJECT_ID = os.environ.get('GCP_PROJECT_ID') 14 | BOOK_RECOMMENDATION_BUCKET = os.environ.get('GCP_BOOK_RECOMMENDATION_BUCKET') 15 | BOOK_RECOMMENDATION_WH_EXT = os.environ.get('GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET') 16 | BOOK_RECOMMENDATION_WH = os.environ.get('GCP_BOOK_RECOMMENDATION_WH_DATASET') 17 | GCP_CREDENTIALS = os.environ.get('GOOGLE_APPLICATION_CREDENTIALS') 18 | 19 | dataset_download_path = f'{AIRFLOW_HOME}/book-recommendation-dataset/' 20 | parquet_store_path = f'{dataset_download_path}pq/' 21 | 22 | gcs_store_dir = '/book-recommendation-pq' 23 | gcs_pq_store_path = f'{BOOK_RECOMMENDATION_BUCKET}{gcs_store_dir}' 24 | 25 | book_recommendation_datasets = ['books', 'ratings', 'users'] 26 | 27 | book_dtype = { 28 | 'isbn': pd.StringDtype(), 29 | 'book_title': pd.StringDtype(), 30 | 'book_author': pd.StringDtype(), 31 | 'year_of_publication': pd.Int64Dtype(), 32 | 'publisher': pd.StringDtype(), 33 | } 34 | 35 | user_dtype = { 36 | 'user_id': pd.Int64Dtype(), 37 | 'age': pd.Int64Dtype(), 38 | 'location': pd.StringDtype() 39 | } 40 | 41 | rating_dtype = { 42 | 'user_id': pd.Int64Dtype(), 43 | 'isbn': pd.StringDtype(), 44 | 'rating': pd.Int64Dtype(), 45 | } 46 | 47 | def do_clean_to_parquet(): 48 | if not os.path.exists(parquet_store_path): 49 | os.makedirs(parquet_store_path) 50 | 51 | for filename in os.listdir(dataset_download_path): 52 | if filename.endswith('.csv'): 53 | dataset_df = pd.read_csv(f'{dataset_download_path}{filename}') 54 | 55 | if filename.startswith('Books'): 56 | dataset_df = dataset_df.drop(columns=[ 57 | 'Image-URL-S', 58 | 'Image-URL-M', 59 | 'Image-URL-L' 60 | ], axis='columns') 61 | 62 | dataset_df = dataset_df.rename(mapper={ 63 | 'Book-Title': 'book_title', 64 | 'Book-Author': 'book_author', 65 | 'Year-Of-Publication': 'year_of_publication', 66 | 'Publisher': 'publisher', 67 | 'ISBN': 'isbn' 68 | }, axis='columns') 69 | 70 | dataset_df['year_of_publication'] = pd.to_numeric(dataset_df['year_of_publication'], errors='coerce') 71 | dataset_df = dataset_df.dropna(subset=['year_of_publication']) 72 | dataset_df = dataset_df.astype(book_dtype) 73 | elif filename.startswith('Users'): 74 | dataset_df = dataset_df.rename(mapper={ 75 | 'User-ID': 'user_id', 76 | 'Age': 'age', 77 | 'Location': 'location' 78 | }, axis='columns') 79 | 80 | dataset_df = dataset_df.astype(user_dtype) 81 | 82 | dataset_df['location_data'] = dataset_df['location'].apply(lambda x: [x.strip() for x in x.split(',')]) #split by a comma and trim 83 | dataset_df['location_data'] = dataset_df['location_data'].apply(lambda values: [val for val in reversed(values) if val is not None][:3][::-1]) 84 | 85 | dataset_df[['city', 'state', 'country']] = pd.DataFrame(dataset_df['location_data'].tolist()) 86 | dataset_df.drop(columns=['location', 'location_data'], inplace=True) 87 | dataset_df['age'].fillna(0, inplace=True) 88 | elif filename.startswith('Ratings'): 89 | dataset_df = dataset_df.rename(mapper={ 90 | 'User-ID': 'user_id', 91 | 'ISBN': 'isbn', 92 | 'Book-Rating': 'rating' 93 | }, axis='columns') 94 | 95 | dataset_df.astype(rating_dtype) 96 | 97 | dataset_df['rating'].fillna(0, inplace=True) 98 | else: 99 | continue 100 | 101 | 102 | print('dataset_df.columns', dataset_df.columns) 103 | parquet_filename = filename.lower().replace('.csv', '.parquet') 104 | parquet_loc = f'{parquet_store_path}{parquet_filename}' 105 | 106 | dataset_df.reset_index(drop=True, inplace=True) 107 | dataset_df.to_parquet(parquet_loc) 108 | 109 | logging.info('Done cleaning up!') 110 | 111 | 112 | def do_upload_pq_to_gcs(): 113 | gcs = pafs.GcsFileSystem() 114 | dir_info = gcs.get_file_info(gcs_pq_store_path) 115 | if dir_info.type != pafs.FileType.NotFound: 116 | gcs.delete_dir(gcs_pq_store_path) 117 | 118 | gcs.create_dir(gcs_pq_store_path) 119 | pafs.copy_files( 120 | source=parquet_store_path, 121 | destination=gcs_pq_store_path, 122 | destination_filesystem=gcs 123 | ) 124 | 125 | logging.info('Copied parquet to gsc') 126 | 127 | 128 | default_args = { 129 | 'owner': 'iamraphson', 130 | 'depends_on_past': False, 131 | 'retries': 2, 132 | 'retry_delay': dt.timedelta(minutes=1), 133 | } 134 | 135 | with DAG( 136 | 'Book-Recommendation-DAG', 137 | default_args=default_args, 138 | description='DAG for book recommendation dataset', 139 | tags=['Book Recommendation'], 140 | user_defined_macros={ 141 | 'BOOK_RECOMMENDATION_WH_EXT': BOOK_RECOMMENDATION_WH_EXT, 142 | 'BOOK_RECOMMENDATION_WH': BOOK_RECOMMENDATION_WH 143 | } 144 | ) as dag: 145 | install_pip_packages_task = BashOperator( 146 | task_id='install_pip_packages_task', 147 | bash_command='pip install --user kaggle' 148 | ) 149 | 150 | pulldown_dataset_task = BashOperator( 151 | task_id='pulldown_dataset_task', 152 | bash_command=f'kaggle datasets download arashnic/book-recommendation-dataset --path {dataset_download_path} --unzip' 153 | ) 154 | 155 | do_clean_to_parquet_task = PythonOperator( 156 | task_id='do_clean_to_parquet_task', 157 | python_callable=do_clean_to_parquet 158 | ) 159 | 160 | do_upload_pq_to_gcs_task = PythonOperator( 161 | task_id='do_upload_pq_to_gcs_task', 162 | python_callable=do_upload_pq_to_gcs 163 | ) 164 | 165 | with TaskGroup('create-external-table-group-tasks') as create_external_table_group_task: 166 | for dataset in book_recommendation_datasets: 167 | BigQueryCreateExternalTableOperator( 168 | task_id=f'bq_external_{dataset}_table_task', 169 | table_resource={ 170 | 'tableReference': { 171 | 'projectId': GCP_PROJECT_ID, 172 | 'datasetId': BOOK_RECOMMENDATION_WH, 173 | 'tableId': dataset, 174 | }, 175 | 'externalDataConfiguration': { 176 | 'autodetect': True, 177 | 'sourceFormat': 'PARQUET', 178 | 'sourceUris': [f'gs://{gcs_pq_store_path}/{dataset}.parquet'], 179 | }, 180 | }, 181 | ) 182 | 183 | create_table_partitions_task = BigQueryInsertJobOperator( 184 | task_id = 'create_table_partitions_task', 185 | configuration={ 186 | 'query': { 187 | 'query': "{% include 'sql/load-dwh.sql' %}", 188 | 'useLegacySql': False, 189 | } 190 | } 191 | ) 192 | 193 | 194 | clean_up_dataset_store_task = BashOperator( 195 | task_id='clean_up_dataset_store_task', 196 | bash_command=f"rm -rf {dataset_download_path}" 197 | ) 198 | 199 | uninstall_pip_packge_task = BashOperator( 200 | task_id='uninstall_pip_package_task', 201 | bash_command=f"pip uninstall --yes kaggle" 202 | ) 203 | 204 | install_pip_packages_task.set_downstream(pulldown_dataset_task) 205 | pulldown_dataset_task.set_downstream(do_clean_to_parquet_task) 206 | do_clean_to_parquet_task.set_downstream(do_upload_pq_to_gcs_task) 207 | do_upload_pq_to_gcs_task.set_downstream(create_external_table_group_task) 208 | create_external_table_group_task.set_downstream(create_table_partitions_task) 209 | create_table_partitions_task.set_downstream(clean_up_dataset_store_task) 210 | create_table_partitions_task.set_downstream(uninstall_pip_packge_task) 211 | -------------------------------------------------------------------------------- /docker-compose.yaml: -------------------------------------------------------------------------------- 1 | x-airflow-common: &airflow-common 2 | image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.8.3} 3 | # build: . 4 | environment: &airflow-common-env 5 | AIRFLOW__CORE__EXECUTOR: CeleryExecutor 6 | AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow 7 | AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow 8 | AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 9 | AIRFLOW__CORE__FERNET_KEY: "" 10 | AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "true" 11 | AIRFLOW__CORE__LOAD_EXAMPLES: "false" 12 | AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session" 13 | AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: "true" 14 | _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} 15 | KAGGLE_USERNAME: "$KAGGLE_USERNAME" 16 | KAGGLE_KEY: "$KAGGLE_TOKEN" 17 | GCP_PROJECT_ID: "$GCP_PROJECT_ID" 18 | GCP_BOOK_RECOMMENDATION_BUCKET: "$GCP_BOOK_RECOMMENDATION_BUCKET" 19 | GOOGLE_APPLICATION_CREDENTIALS: "/.google/credentials/dezoomcamp-2024.json" 20 | AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: "google-cloud-platform://?extra__google_cloud_platform__key_path=$GOOGLE_APPLICATION_CREDENTIALS" 21 | GCP_BOOK_RECOMMENDATION_WH_DATASET: "$GCP_BOOK_RECOMMENDATION_WH_DATASET" 22 | GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET: "$GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET" 23 | volumes: 24 | - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/dags:/opt/airflow/dags 25 | - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/logs:/opt/airflow/logs 26 | - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/config:/opt/airflow/config 27 | - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/plugins:/opt/airflow/plugins 28 | - ~/.google/credentials/:/.google/credentials/:ro 29 | user: "${AIRFLOW_UID:-50000}:0" 30 | depends_on: &airflow-common-depends-on 31 | redis: 32 | condition: service_healthy 33 | postgres: 34 | condition: service_healthy 35 | 36 | services: 37 | metabase: 38 | image: metabase/metabase:latest 39 | container_name: book-recommendation-metabase 40 | ports: 41 | - 1460:3000 42 | postgres: 43 | image: postgres:13 44 | container_name: airflow-postgres 45 | environment: 46 | POSTGRES_USER: airflow 47 | POSTGRES_PASSWORD: airflow 48 | POSTGRES_DB: airflow 49 | volumes: 50 | - postgres-db-volume:/var/lib/postgresql/data 51 | healthcheck: 52 | test: ["CMD", "pg_isready", "-U", "airflow"] 53 | interval: 10s 54 | retries: 5 55 | start_period: 5s 56 | restart: always 57 | 58 | redis: 59 | image: redis:latest 60 | container_name: airflow-redis 61 | expose: 62 | - 6379 63 | healthcheck: 64 | test: ["CMD", "redis-cli", "ping"] 65 | interval: 10s 66 | timeout: 30s 67 | retries: 50 68 | start_period: 30s 69 | restart: always 70 | 71 | airflow-webserver: 72 | <<: *airflow-common 73 | container_name: airflow-webserver 74 | command: webserver 75 | ports: 76 | - "8080:8080" 77 | healthcheck: 78 | test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] 79 | interval: 30s 80 | timeout: 10s 81 | retries: 5 82 | start_period: 30s 83 | restart: always 84 | depends_on: 85 | <<: *airflow-common-depends-on 86 | airflow-init: 87 | condition: service_completed_successfully 88 | 89 | airflow-scheduler: 90 | <<: *airflow-common 91 | container_name: airflow-scheduler 92 | command: scheduler 93 | healthcheck: 94 | test: ["CMD", "curl", "--fail", "http://localhost:8974/health"] 95 | interval: 30s 96 | timeout: 10s 97 | retries: 5 98 | start_period: 30s 99 | restart: always 100 | depends_on: 101 | <<: *airflow-common-depends-on 102 | airflow-init: 103 | condition: service_completed_successfully 104 | 105 | airflow-worker: 106 | <<: *airflow-common 107 | container_name: airflow-worker 108 | command: celery worker 109 | healthcheck: 110 | # yamllint disable rule:line-length 111 | test: 112 | - "CMD-SHELL" 113 | - 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' 114 | interval: 30s 115 | timeout: 10s 116 | retries: 5 117 | start_period: 30s 118 | environment: 119 | <<: *airflow-common-env 120 | # Required to handle warm shutdown of the celery workers properly 121 | # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation 122 | DUMB_INIT_SETSID: "0" 123 | restart: always 124 | depends_on: 125 | <<: *airflow-common-depends-on 126 | airflow-init: 127 | condition: service_completed_successfully 128 | 129 | airflow-triggerer: 130 | <<: *airflow-common 131 | container_name: airflow-triggerer 132 | command: triggerer 133 | healthcheck: 134 | test: 135 | [ 136 | "CMD-SHELL", 137 | 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"', 138 | ] 139 | interval: 30s 140 | timeout: 10s 141 | retries: 5 142 | start_period: 30s 143 | restart: always 144 | depends_on: 145 | <<: *airflow-common-depends-on 146 | airflow-init: 147 | condition: service_completed_successfully 148 | 149 | airflow-init: 150 | <<: *airflow-common 151 | container_name: airflow-init 152 | entrypoint: /bin/bash 153 | # yamllint disable rule:line-length 154 | command: 155 | - -c 156 | - | 157 | if [[ -z "${AIRFLOW_UID}" ]]; then 158 | echo 159 | echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" 160 | echo "If you are on Linux, you SHOULD follow the instructions below to set " 161 | echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." 162 | echo "For other operating systems you can get rid of the warning with manually created .env file:" 163 | echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user" 164 | echo 165 | fi 166 | one_meg=1048576 167 | mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) 168 | cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) 169 | disk_available=$$(df / | tail -1 | awk '{print $$4}') 170 | warning_resources="false" 171 | if (( mem_available < 4000 )) ; then 172 | echo 173 | echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" 174 | echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" 175 | echo 176 | warning_resources="true" 177 | fi 178 | if (( cpus_available < 2 )); then 179 | echo 180 | echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" 181 | echo "At least 2 CPUs recommended. You have $${cpus_available}" 182 | echo 183 | warning_resources="true" 184 | fi 185 | if (( disk_available < one_meg * 10 )); then 186 | echo 187 | echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" 188 | echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" 189 | echo 190 | warning_resources="true" 191 | fi 192 | if [[ $${warning_resources} == "true" ]]; then 193 | echo 194 | echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" 195 | echo "Please follow the instructions to increase amount of resources available:" 196 | echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin" 197 | echo 198 | fi 199 | mkdir -p /sources/data_airflow/logs /sources/data_airflow/dags /sources/data_airflow/plugins 200 | chown -R "${AIRFLOW_UID}:0" /sources/data_airflow/{logs,dags,plugins} 201 | exec /entrypoint airflow version 202 | # yamllint enable rule:line-length 203 | environment: 204 | <<: *airflow-common-env 205 | _AIRFLOW_DB_MIGRATE: "true" 206 | _AIRFLOW_WWW_USER_CREATE: "true" 207 | _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} 208 | _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} 209 | _PIP_ADDITIONAL_REQUIREMENTS: "" 210 | user: "0:0" 211 | volumes: 212 | - ${AIRFLOW_PROJ_DIR:-.}:/sources 213 | 214 | airflow-cli: 215 | <<: *airflow-common 216 | container_name: airflow-cli 217 | profiles: 218 | - debug 219 | environment: 220 | <<: *airflow-common-env 221 | CONNECTION_CHECK_MAX_COUNT: "0" 222 | # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252 223 | command: 224 | - bash 225 | - -c 226 | - airflow 227 | 228 | # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up 229 | # or by explicitly targeted on the command line e.g. docker-compose up flower. 230 | # See: https://docs.docker.com/compose/profiles/ 231 | flower: 232 | <<: *airflow-common 233 | command: celery flower 234 | container_name: airflow-flower 235 | profiles: 236 | - flower 237 | ports: 238 | - "5555:5555" 239 | healthcheck: 240 | test: ["CMD", "curl", "--fail", "http://localhost:5555/"] 241 | interval: 30s 242 | timeout: 10s 243 | retries: 5 244 | start_period: 30s 245 | restart: always 246 | depends_on: 247 | <<: *airflow-common-depends-on 248 | airflow-init: 249 | condition: service_completed_successfully 250 | 251 | volumes: 252 | postgres-db-volume: 253 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Pipeline Project for Book Recommendation 2 | 3 |
4 | Table of Contents 5 |
    6 |
  1. 7 | Introduction 8 | 11 |
  2. 12 |
  3. 13 | Project Architecture 14 |
  4. 15 |
  5. 16 | Getting Started 17 | 31 |
  6. 32 |
  7. 33 | Data Ingestion 34 |
  8. 35 |
  9. 36 | Data Transformation 37 |
  10. 38 |
  11. 39 | Data Visualization 40 |
  12. 41 |
  13. 42 | Contact 43 |
  14. 44 |
  15. 45 | Acknowledgments 46 |
  16. 47 |
48 |
49 | 50 | ## Introduction 51 | 52 | This project is part of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). As part of the project, I developed a data pipeline to load and process data from a Kaggle dataset containing bookstore information for a book recommendation system. The dataset can be accessed on [this Kaggle](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/). 53 | 54 | This dataset offers book ratings from users of various ages and geographical locations. It comprises three files: User.csv, which includes age and location data of bookstore users; Books.csv, containing information such as authors, titles, and ISBNs; and Ratings.csv, which details the ratings given by users for each book. Additional information about the dataset is available on Kaggle. 55 | 56 | The primary objective of this project is to establish a streamlined data pipeline for obtaining, storing, cleansing, and visualizing data automatically. This pipeline aims to address various queries, such as identifying top-rated publishers and authors, and analyzing ratings based on geographical location. 57 | 58 | Given that the data is static, the data pipeline operates as a one-time process. 59 | 60 | ### Built With 61 | 62 | - Dataset repo: [Kaggle](https://www.kaggle.com) 63 | - Infrastructure as Code: [Terraform](https://www.terraform.io/) 64 | - Workflow Orchestration: [Airflow](https://airflow.apache.org) 65 | - Data Lake: [Google Cloud Storage](https://cloud.google.com/storage) 66 | - Data Warehouse: [Google BigQuery](https://cloud.google.com/bigquery) 67 | - Transformation: [DBT](https://www.getdbt.com/) 68 | - Visualisation: [Metabase](https://www.metabase.com/) 69 | - Programming Language: Python and SQL 70 | 71 | ## Project Architecture 72 | 73 | ![architecture](./screenshots/architecture.png) 74 | Cloud infrastructure is set up with Terraform. 75 | 76 | Airflow is run on a local docker container. 77 | 78 | ## Getting Started 79 | 80 | ### Prerequisites 81 | 82 | 1. A [Google Cloud Platform](https://cloud.google.com/) account. 83 | 2. A [kaggle](https://www.kaggle.com/) account. 84 | 3. Install VSCode or [Zed](https://zed.dev/) or any other IDE that works for you. 85 | 4. [Install Terraform](https://www.terraform.io/downloads) 86 | 5. [Install Docker Desktop](https://docs.docker.com/get-docker/) 87 | 6. [Install Google Cloud SDK](https://cloud.google.com/sdk) 88 | 7. Clone this repository onto your local machine. 89 | 90 | ### Create a Google Cloud Project 91 | 92 | - Go to [Google Cloud](https://console.cloud.google.com/) and create a new project. 93 | - Get the project ID and define the environment variables `GCP_PROJECT_ID` in the .env file located in the root directory 94 | - Create a [Service account](https://cloud.google.com/iam/docs/service-account-overview) with the following roles: 95 | - `BigQuery Admin` 96 | - `Storage Admin` 97 | - `Storage Object Admin` 98 | - `Viewer` 99 | - Download the Service Account credentials and store it in `$HOME/.google/credentials/`. 100 | - You need to activate the following APIs [here](https://console.cloud.google.com/apis/library/browse) 101 | - Cloud Storage API 102 | - BigQuery API 103 | - Assign the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your JSON credentials file, such that `GOOGLE_APPLICATION_CREDENTIALS` will be $HOME/.google/credentials/.json 104 | - add this line to the end of the `.bashrc` file 105 | ```bash 106 | export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json 107 | ``` 108 | - Activate the enviroment variable by runing `source .bashrc` 109 | 110 | ### Set up kaggle 111 | 112 | - A detailed description on how to authenicate is found [here](https://www.kaggle.com/docs/api) 113 | - Define the environment variables `KAGGLE_USER` and `KAGGLE_TOKEN` in the .env file located in the root directory. Note: `KAGGLE_TOKEN` is the same as `KAGGLE_KEY` 114 | 115 | ### Set up the infrastructure on GCP with Terraform 116 | 117 | - Using Zed or VSCode, open the cloned project `DE-2024-project-bookrecommendation`. 118 | - To customize the default values of `variable "project"` and `variable "region"` to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables `TF_VAR_project` and `TF_VAR_region`. 119 | - Open the terminal to the root project. 120 | - Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command `cd terraform`. 121 | - Set an alias `alias tf='terraform'` 122 | - Initialise Terraform: `tf init` 123 | - Plan the infrastructure: `tf plan` 124 | - Apply the changes: `tf apply` 125 | 126 | ### Set up Airflow and Metabase 127 | 128 | - Please confirm that the following environment variables are configured in `.env` in the root directory of the project. 129 | - `AIRFLOW_UID`. The default value is 50000 130 | - `KAGGLE_USERNAME`. This should be set from [Set up kaggle](#set-up-kaggle) section. 131 | - `KAGGLE_TOKEN`. This should be set from [Set up kaggle](#set-up-kaggle) section too 132 | - `GCP_PROJECT_ID`. This should be set from [Create a Google Cloud Project](#create-a-google-cloud-project) section 133 | - `GCP_BOOK_RECOMMENDATION_BUCKET=book_recommendation_datalake_` 134 | - `GCP_BOOK_RECOMMENDATION_WH_DATASET=book_recommendation_analytics` 135 | - `GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET=book_recommendataion_wh` 136 | - Run `docker-compose up`. 137 | - Access the Airflow dashboard by visiting `http://localhost:8080/` in your web browser. The interface will resemble the following. Use the username and password airflow to log in. 138 | 139 | ![Airflow](./screenshots/airflow_home.png) 140 | 141 | - Visit `http://localhost:1460` in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI. 142 | 143 | ![Metabase](./screenshots/metabase_home.png) 144 | 145 | ## Data Ingestion 146 | 147 | Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs 148 | ![DAGS](./screenshots/dags_index.png) 149 | Below is the DAG's graph. 150 | ![DAG Graph](./screenshots/dag_graph.png) 151 | To run the DAG, Click on the play button(Figure 1) 152 | ![Run Graph](./screenshots/run_dag.png) 153 | 154 | ## Data Transformation 155 | 156 | - Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command `cd data_dbt`. 157 | - Generate a profiles.yml file within `${HOME}/.dbt`, followed by defining a profile for this project as instructed below. 158 | 159 | ```yaml 160 | data_dbt_book_recommendation: 161 | outputs: 162 | dev: 163 | dataset: book_recommendation_analytics 164 | fixed_retries: 1 165 | keyfile: 166 | location: 167 | method: service-account 168 | priority: interactive 169 | project: 170 | threads: 6 171 | timeout_seconds: 300 172 | type: bigquery 173 | target: dev 174 | ``` 175 | 176 | - To run all models, run `dbt run -t dev` 177 | - Navigate to your Google [BigQuery](https://console.cloud.google.com/bigquery) project by clicking on this link. There, you'll find all the tables and views created by DBT. 178 | ![Big Query](./screenshots/bigquery_schema_1.png) 179 | 180 | ## Data Visualization 181 | 182 | Please watch the [provided video tutorial](https://youtu.be/BnLkrA7a6gM&) to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this [PDF](./screenshots/DE_2024_Dashboard.pdf) linked below contains the complete screenshot of the dashboard I created. 183 | 184 | ![Dashboard](./screenshots/DE_2024_Dashboard.png) 185 | 186 | ## Contact 187 | 188 | Twitter: [@iamraphson](https://twitter.com/iamraphson) 189 | 190 | ## Acknowledgments 191 | 192 | I would like to extend my heartfelt gratitude to the organizers of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) for providing such a valuable course. The insights I gained have been instrumental in broadening my understanding of the field of Data Engineering. Additionally, I want to express my appreciation to my fellow colleague with whom I took the course. Thank you all for your support and collaboration throughout this journey. 193 | 194 | 🦅 195 | --------------------------------------------------------------------------------