├── requirements.txt
├── data_dbt
    ├── seeds
    │   └── .gitkeep
    ├── tests
    │   └── .gitkeep
    ├── analyses
    │   └── .gitkeep
    ├── macros
    │   ├── .gitkeep
    │   └── capitalize_replace.sql
    ├── snapshots
    │   └── .gitkeep
    ├── models
    │   ├── staging
    │   │   ├── stg_ratings.sql
    │   │   ├── stg_books.sql
    │   │   ├── stg_users.sql
    │   │   └── schema.yml
    │   └── core
    │   │   ├── dim_rating_by_countries.sql
    │   │   ├── facts_full_ratings.sql
    │   │   ├── schema.yml
    │   │   └── dim_rating_by_age_range.sql
    ├── README.md
    └── dbt_project.yml
├── pyrightconfig.json
├── terraform
    ├── locals.tf
    ├── variables.tf
    ├── .terraform.lock.hcl
    └── main.tf
├── screenshots
    ├── run_dag.png
    ├── dag_graph.png
    ├── airflow_home.png
    ├── architecture.png
    ├── dags_index.png
    ├── metabase_home.png
    ├── DE_2024_Dashboard.pdf
    ├── DE_2024_Dashboard.png
    └── bigquery_schema_1.png
├── .gitignore
├── data_airflow
    └── dags
    │   ├── sql
    │       └── load-dwh.sql
    │   └── book-recommendation-dag.py
├── LICENSE
├── docker-compose.yaml
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/data_dbt/seeds/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/data_dbt/tests/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/data_dbt/analyses/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/data_dbt/macros/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/data_dbt/snapshots/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pyrightconfig.json:
--------------------------------------------------------------------------------
1 | {
2 |   "venvPath": ".",
3 |   "venv": ".venv"
4 | }
5 | 


--------------------------------------------------------------------------------
/terraform/locals.tf:
--------------------------------------------------------------------------------
1 | locals {
2 |   DE_2004_PROJECT_DATALAKE = "book_recommendation_datalake"
3 | }
4 | 


--------------------------------------------------------------------------------
/screenshots/run_dag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/run_dag.png


--------------------------------------------------------------------------------
/screenshots/dag_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/dag_graph.png


--------------------------------------------------------------------------------
/screenshots/airflow_home.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/airflow_home.png


--------------------------------------------------------------------------------
/screenshots/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/architecture.png


--------------------------------------------------------------------------------
/screenshots/dags_index.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/dags_index.png


--------------------------------------------------------------------------------
/screenshots/metabase_home.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/metabase_home.png


--------------------------------------------------------------------------------
/screenshots/DE_2024_Dashboard.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/DE_2024_Dashboard.pdf


--------------------------------------------------------------------------------
/screenshots/DE_2024_Dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/DE_2024_Dashboard.png


--------------------------------------------------------------------------------
/screenshots/bigquery_schema_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamraphson/DE-2024-project-book-recommendation/HEAD/screenshots/bigquery_schema_1.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | data_airflow/logs/*
 2 | .venv/*
 3 | *__pycache__*
 4 | .env
 5 | terraform/.terraform/*
 6 | terraform/terraform.tfstate
 7 | terraform/terraform.tfstate.backup
 8 | logs/*
 9 | data_dbt/target/*
10 | data_dbt/dbt_packages/*
11 | data_dbt/logs/*
12 | 


--------------------------------------------------------------------------------
/data_dbt/models/staging/stg_ratings.sql:
--------------------------------------------------------------------------------
 1 | {{
 2 |     config(
 3 |         materialized='view'
 4 |     )
 5 | }}
 6 | 
 7 | with ratingsdata as (
 8 |     SELECT
 9 |     	*
10 |     FROM
11 |     	{{ source ('staging', 'ratings') }}
12 | )
13 | 
14 | select
15 |     user_id,
16 |     isbn,
17 |     rating
18 | from ratingsdata
19 | 


--------------------------------------------------------------------------------
/data_dbt/models/staging/stg_books.sql:
--------------------------------------------------------------------------------
 1 | {{
 2 |     config(
 3 |         materialized='view'
 4 |     )
 5 | }}
 6 | 
 7 | with booksdata as (
 8 |     SELECT
 9 |     	*
10 |     FROM
11 |     	{{ source ('staging', 'books') }} where year_of_publication <> 0
12 | )
13 | 
14 | select
15 |     isbn,
16 |     book_title,
17 |     book_author,
18 |     year_of_publication,
19 |     publisher
20 | from booksdata
21 | 


--------------------------------------------------------------------------------
/data_dbt/models/core/dim_rating_by_countries.sql:
--------------------------------------------------------------------------------
 1 | {{ config(materialized='table') }}
 2 | 
 3 | with ratings as (
 4 |     select * from {{ ref('stg_ratings') }}
 5 | ),
 6 | users as (
 7 |     select* from {{ ref('stg_users') }}
 8 | )
 9 | 
10 | select
11 |     users.country as country,
12 |     count(ratings.rating) as total_ratings
13 | from ratings
14 | join users on ratings.user_id = users.user_id
15 | group by country
16 | 


--------------------------------------------------------------------------------
/data_dbt/macros/capitalize_replace.sql:
--------------------------------------------------------------------------------
 1 | 
 2 | {% macro capitalize_replace(column_name) %}
 3 |     {% if target.type == 'bigquery' %}
 4 |         replace(INITCAP({{ column_name }}), '"', '')
 5 |     {% elif target.type == 'postgres' %}
 6 |         replace(INITCAP({{ column_name }}), '"', '')
 7 |     {% elif target.type == 'snowflake' %}
 8 |         replace(INITCAP({{ column_name }}), '"', '')
 9 |     {% else %}
10 |         replace(UPPER(LEFT({{ column_name }}, 1)) || LOWER(SUBSTRING({{ column_name }}, 2)), '"', '')
11 |     {% endif %}
12 | {% endmacro %}
13 | 


--------------------------------------------------------------------------------
/data_dbt/README.md:
--------------------------------------------------------------------------------
 1 | Welcome to your new dbt project!
 2 | 
 3 | ### Using the starter project
 4 | 
 5 | Try running the following commands:
 6 | - dbt run
 7 | - dbt test
 8 | 
 9 | 
10 | ### Resources:
11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
14 | - Find [dbt events](https://events.getdbt.com) near you
15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
16 | 


--------------------------------------------------------------------------------
/data_dbt/models/core/facts_full_ratings.sql:
--------------------------------------------------------------------------------
 1 | {{
 2 |     config(
 3 |         materialized='table'
 4 |     )
 5 | }}
 6 | 
 7 | with ratings as (
 8 |     select * from {{ ref('stg_ratings') }}
 9 | ),
10 | books as(
11 |     select* from {{ ref('stg_books') }}
12 | ),
13 | users as (
14 |     select* from {{ ref('stg_users') }}
15 | )
16 | 
17 | 
18 | select
19 |     ratings.user_id as user_id,
20 |     ratings.isbn as isbn,
21 |     ratings.rating as rating,
22 |     users.age as age,
23 |     users.city as city,
24 |     users.state as state,
25 |     users.country as country,
26 |     books.book_title as title,
27 |     books.book_author as author,
28 |     books.year_of_publication as year_of_publication,
29 |     books.publisher as publisher
30 | from ratings
31 | join users on ratings.user_id = users.user_id
32 | join books on ratings.isbn = books.isbn
33 | 


--------------------------------------------------------------------------------
/data_airflow/dags/sql/load-dwh.sql:
--------------------------------------------------------------------------------
 1 | -- books table
 2 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.books`;
 3 | 
 4 | CREATE TABLE IF NOT EXISTS
 5 |   `{{ BOOK_RECOMMENDATION_WH }}.books`
 6 | CLUSTER BY
 7 |   `year_of_publication`,
 8 |   `publisher`,
 9 |   `book_author` AS
10 | SELECT
11 |   *
12 | FROM
13 |   `{{BOOK_RECOMMENDATION_WH_EXT}}.books`;
14 | 
15 | 
16 | -- users
17 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.users`;
18 | 
19 | CREATE TABLE IF NOT EXISTS
20 | `{{ BOOK_RECOMMENDATION_WH }}.users`
21 | CLUSTER BY
22 | `country`,
23 | `state`,
24 | `city` AS
25 | SELECT
26 | *
27 | FROM
28 | `{{BOOK_RECOMMENDATION_WH_EXT}}.users`;
29 | 
30 | -- rating
31 | DROP TABLE IF EXISTS `{{ BOOK_RECOMMENDATION_WH }}.ratings`;
32 | 
33 | CREATE TABLE IF NOT EXISTS
34 | `{{ BOOK_RECOMMENDATION_WH }}.ratings`
35 | CLUSTER BY
36 | `user_id`,
37 | `isbn` AS
38 | SELECT
39 | *
40 | FROM
41 | `{{BOOK_RECOMMENDATION_WH_EXT}}.ratings`;
42 | 


--------------------------------------------------------------------------------
/data_dbt/models/core/schema.yml:
--------------------------------------------------------------------------------
 1 | version: 2
 2 | 
 3 | models:
 4 |   - name: facts_full_ratings
 5 |     description: >
 6 |       This fact models contains table combination from rating, books and users.
 7 |       This will help dashboard presentations.
 8 |   - name: dim_rating_by_countries
 9 |     description: Aggregated table of all rating by countries.
10 |     columns:
11 |       - name: country
12 |         data_type: string
13 |         description: Column for countries
14 | 
15 |       - name: total_ratings
16 |         data_type: numeric
17 |         description: Total rating from countries
18 | 
19 |   - name: dim_rating_by_age_range
20 |     description: Aggregated table of all rating from some age range.
21 |     columns:
22 |       - name: age_range
23 |         data_type: string
24 |         description: List of age range
25 | 
26 |       - name: number_of_rating
27 |         data_type: numeric
28 |         description: Number of rating from age range
29 | 


--------------------------------------------------------------------------------
/terraform/variables.tf:
--------------------------------------------------------------------------------
 1 | variable "project" {
 2 |   type        = string
 3 |   description = "GCP project ID"
 4 |   default     = "radiant-gateway-412001"
 5 | }
 6 | 
 7 | variable "region" {
 8 |   type        = string
 9 |   description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
10 |   default     = "us-west1"
11 | }
12 | 
13 | variable "storage_class" {
14 |   type        = string
15 |   description = "The Storage Class of the new bucket. Ref: https://cloud.google.com/storage/docs/storage-classes"
16 |   default     = "STANDARD"
17 | }
18 | 
19 | variable "book_recommendataion_ext_datasets" {
20 |   type        = string
21 |   description = "Dataset in BigQuery where raw data (external tables) will be loaded."
22 |   default     = "book_recommendataion_wh"
23 | }
24 | 
25 | variable "book_recommendation_analytics_datasets" {
26 |   type        = string
27 |   description = "Dataset in BigQuery where raw data (from Google Cloud Storage and DBT) will be loaded."
28 |   default     = "book_recommendation_analytics"
29 | }
30 | 


--------------------------------------------------------------------------------
/data_dbt/models/core/dim_rating_by_age_range.sql:
--------------------------------------------------------------------------------
 1 | {{
 2 |     config(
 3 |         materialized='table'
 4 |     )
 5 | }}
 6 | 
 7 | WITH age_range_ratings AS (
 8 |     SELECT
 9 |         CASE
10 |             WHEN age = 0 THEN 'Unknown age'
11 |             WHEN age BETWEEN 1 AND 9 THEN '1-9'
12 |             WHEN age BETWEEN 10 AND 19 THEN '10-19'
13 |             WHEN age BETWEEN 20 AND 29 THEN '20-29'
14 |             WHEN age BETWEEN 30 AND 39 THEN '30-39'
15 |             WHEN age BETWEEN 40 AND 49 THEN '40-49'
16 |             WHEN age BETWEEN 50 AND 59 THEN '50-59'
17 |             WHEN age BETWEEN 60 AND 69 THEN '60-69'
18 |             WHEN age BETWEEN 70 AND 79 THEN '70-79'
19 |             WHEN age BETWEEN 80 AND 89 THEN '80-89'
20 |             ELSE '90+'
21 |         END AS age_range,
22 |         age,
23 |         rating
24 |     FROM
25 |         {{ ref('facts_full_ratings') }}
26 | )
27 | 
28 | SELECT
29 |     age_range,
30 |     COUNT(rating) AS number_of_rating
31 | FROM
32 |     age_range_ratings
33 | GROUP BY
34 |     age_range
35 | ORDER BY
36 |     MIN(age)
37 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Olusegun Ayeni
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/terraform/.terraform.lock.hcl:
--------------------------------------------------------------------------------
 1 | # This file is maintained automatically by "terraform init".
 2 | # Manual edits may be lost in future updates.
 3 | 
 4 | provider "registry.terraform.io/hashicorp/google" {
 5 |   version     = "5.21.0"
 6 |   constraints = "5.21.0"
 7 |   hashes = [
 8 |     "h1:XFwEEqXVi0PaGzYI2p65DxfWggnQItX5AsRCveTOmT4=",
 9 |     "zh:4185b880504af117898f6b0c50fd8459ac4b9d5fd7c2ceaf6fc5b18d4d920978",
10 |     "zh:56bbb4ae9cfbd1a9c3008e911605f1edccfa6a1048d88dba0c92f347441d274d",
11 |     "zh:59246baa783208f5b51ad9d4a4008a327128e234f4a8d1d629cf0af6ae6a9249",
12 |     "zh:989e7e07a46e486f791a82f20bf0a2f73b64464fe6a97925edc164eeee33d980",
13 |     "zh:9945cce3c36e4e95c74a2f71c38cb0d042014bef555fdeb07e6d92bc624b7567",
14 |     "zh:b276a2e6ba9a9d2cd3127b6cc9a9cdf6cca2db1fbe4ad4a5332025ae3c7c9bb6",
15 |     "zh:d1af7f76ef64a808dcaeabbeb74f27a7925665082209db019ca79e6e06fe3ab2",
16 |     "zh:d7954e905704b4f158c592e0d8c3d8d54a9edd6f8392d2fa3dfc9f0fe29795d8",
17 |     "zh:e85724a917887ac00112ca4edafdc2b233c787c2892f386dafa9dfd3215083c0",
18 |     "zh:ebadb8e5b387914e118ecbf83e08a72d034fe069e9e5b0cefa857b758479f835",
19 |     "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
20 |     "zh:fb38ef67430bcf8e07144f77b76b47df35acd38f5e553fe7104ecfe54378bb9e",
21 |   ]
22 | }
23 | 


--------------------------------------------------------------------------------
/data_dbt/dbt_project.yml:
--------------------------------------------------------------------------------
 1 | # Name your project! Project names should contain only lowercase characters
 2 | # and underscores. A good package name should reflect your organization's
 3 | # name or the intended use of these models
 4 | name: "data_dbt"
 5 | version: "1.0.0"
 6 | config-version: 2
 7 | 
 8 | # This setting configures which "profile" dbt uses for this project.
 9 | profile: "data_dbt_book_recommendation"
10 | 
11 | # These configurations specify where dbt should look for different types of files.
12 | # The `model-paths` config, for example, states that models in this project can be
13 | # found in the "models/" directory. You probably won't need to change these!
14 | model-paths: ["models"]
15 | analysis-paths: ["analyses"]
16 | test-paths: ["tests"]
17 | seed-paths: ["seeds"]
18 | macro-paths: ["macros"]
19 | snapshot-paths: ["snapshots"]
20 | 
21 | clean-targets: # directories to be removed by `dbt clean`
22 |   - "target"
23 |   - "dbt_packages"
24 | 
25 | # Configuring models
26 | # Full documentation: https://docs.getdbt.com/docs/configuring-models
27 | 
28 | # In this example config, we tell dbt to build all models in the example/
29 | # directory as views. These settings can be overridden in the individual model
30 | # files using the `{{ config(...) }}` macro.
31 | models:
32 |   data_dbt:
33 |     # Config indicated by + and applies to all files under models/example/
34 |     staging:
35 |       +materialized: view
36 | 


--------------------------------------------------------------------------------
/terraform/main.tf:
--------------------------------------------------------------------------------
 1 | terraform {
 2 |   required_providers {
 3 |     google = {
 4 |       source  = "hashicorp/google"
 5 |       version = "5.21.0"
 6 |     }
 7 |   }
 8 | }
 9 | 
10 | provider "google" {
11 |   project = var.project
12 |   region  = var.region
13 | }
14 | 
15 | resource "google_storage_bucket" "book_recommendation_datalake" {
16 |   name     = "${local.DE_2004_PROJECT_DATALAKE}_${var.project}"
17 |   location = var.region
18 | 
19 |   storage_class               = var.storage_class
20 |   uniform_bucket_level_access = true
21 |   public_access_prevention    = "enforced"
22 | 
23 |   versioning {
24 |     enabled = true
25 |   }
26 | 
27 |   lifecycle_rule {
28 |     action {
29 |       type = "Delete"
30 |     }
31 |     condition {
32 |       age = 10 //days
33 |     }
34 |   }
35 | 
36 |   force_destroy = true
37 | }
38 | 
39 | resource "google_bigquery_dataset" "book_recommendataion_ext_dataset" {
40 |   project                    = var.project
41 |   location                   = var.region
42 |   dataset_id                 = var.book_recommendataion_ext_datasets
43 |   delete_contents_on_destroy = true
44 | }
45 | 
46 | resource "google_bigquery_dataset" "book_recommendation_analytics_dataset" {
47 |   project                    = var.project
48 |   location                   = var.region
49 |   dataset_id                 = var.book_recommendation_analytics_datasets
50 |   delete_contents_on_destroy = true
51 | }
52 | 


--------------------------------------------------------------------------------
/data_dbt/models/staging/stg_users.sql:
--------------------------------------------------------------------------------
 1 | {{
 2 |     config(
 3 |         materialized='view'
 4 |     )
 5 | }}
 6 | 
 7 | with usersdata as (
 8 |     SELECT
 9 |     	*
10 |     FROM
11 |     	{{ source ('staging', 'users') }}
12 |     WHERE LOWER(country) NOT IN('', 'n/a') AND LOWER(country) not like '%n/a%'
13 | )
14 | 
15 | select
16 |     user_id,
17 |     age,
18 |     {{ capitalize_replace('city') }} AS city,
19 |     {{ capitalize_replace('state') }} AS state,
20 |     CASE
21 |         WHEN lower(country) IN ('usa', 'united states', 'united state', 'us', 'u.s.a.', 'america', 'u.s.a>', 'united staes', 'united states of america', 'Csa', 'San Franicsco', 'U.S. Of A.') THEN 'United states'
22 |         WHEN lower(country) IN ('italia', 'l`italia', 'ferrara') THEN 'Italy'
23 |         WHEN lower(country) IN ('u.a.e') THEN 'United Arab Emirates'
24 |         WHEN lower(country) IN ('c.a.', 'canada') THEN 'Canada'
25 |         WHEN lower(country) IN ('nz') THEN 'New Zealand'
26 |         WHEN lower(country) IN ('urugua') THEN 'Uruguay'
27 |         WHEN lower(country) IN ('p.r.china', 'china') THEN 'China'
28 |         WHEN lower(country) IN ('trinidad and tobago', 'tobago') THEN 'Trinidad And Tobago'
29 |         WHEN lower(country) IN ('united kingdom', 'u.k.', 'england', 'wales', 'united kindgonm') THEN 'United Kingdom'
30 |         ELSE {{ capitalize_replace('country') }}
31 |     END AS country
32 | from usersdata
33 | where (
34 |     lower(country) not like '%far away%' AND
35 |     lower(country) not in (
36 |         'quit',
37 |         'here and there',
38 |         'everywhere and anywhere',
39 |         'x',
40 |         'k1c7b1',
41 |         'we`re global!',
42 |         '&#20013;&#22269;',
43 |         'lkjlj',
44 |         '',
45 |         'Space',
46 |         'Tdzimi',
47 |         'Usa (Currently Living In England)',
48 |         'Ua',
49 |         'Universe'
50 |     )
51 | )
52 | 


--------------------------------------------------------------------------------
/data_dbt/models/staging/schema.yml:
--------------------------------------------------------------------------------
 1 | version: 2
 2 | 
 3 | sources:
 4 |   - name: staging
 5 |     database: radiant-gateway-412001
 6 |     schema: book_recommendation_analytics
 7 |     tables:
 8 |       - name: books
 9 |       - name: ratings
10 |       - name: users
11 | models:
12 |   - name: stg_users
13 |     description: List of Users that used the book store.
14 |     columns:
15 |       - name: user_id
16 |         data_type: numeric
17 |         description: id assigned to the user.
18 |       - name: age
19 |         date_type: numeric
20 |         description: Age of the user.
21 | 
22 |       - name: city
23 |         date_type: string
24 |         description: city of the user.
25 | 
26 |       - name: state
27 |         date_type: string
28 |         description: state of the user.
29 | 
30 |       - name: state
31 |         date_type: string
32 |         description: state of the user.
33 | 
34 |       - name: country
35 |         date_type: string
36 |         description: Ratings given to books by users.
37 | 
38 |   - name: stg_books
39 |     description: List of books in the book store.
40 |     columns:
41 |       - name: isbn
42 |         data_type: string
43 |         description: isbn of the book.
44 | 
45 |       - name: book_title
46 |         data_type: string
47 |         description: Title of the book.
48 | 
49 |       - name: book_author
50 |         data_type: string
51 |         description: Author of the book.
52 | 
53 |       - name: year_of_publication
54 |         data_type: numeric
55 |         description: The year the book was published.
56 | 
57 |       - name: publisher
58 |         data_type: string
59 |         description: publisher of the book
60 | 
61 |   - name: stg_ratings
62 |     description: List of books in the book store.
63 |     columns:
64 |       - name: user_id
65 |         data_type: numeric
66 |         description: id of the user that gave the rating.
67 | 
68 |       - name: isbn
69 |         data_type: string
70 |         description: isbn of the book that possesses the rating.
71 | 
72 |       - name: rating
73 |         data_type: numeric
74 |         description: book's rating given by a user.
75 | 


--------------------------------------------------------------------------------
/data_airflow/dags/book-recommendation-dag.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import datetime as dt
  3 | import logging;
  4 | import pandas as pd
  5 | import pyarrow.fs as pafs
  6 | from airflow import DAG
  7 | from airflow.operators.bash import BashOperator
  8 | from airflow.operators.python import PythonOperator
  9 | from airflow.utils.task_group import TaskGroup
 10 | from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator, BigQueryInsertJobOperator
 11 | 
 12 | AIRFLOW_HOME = os.environ.get('AIRFLOW_HOME', '/opt/airflow/')
 13 | GCP_PROJECT_ID = os.environ.get('GCP_PROJECT_ID')
 14 | BOOK_RECOMMENDATION_BUCKET = os.environ.get('GCP_BOOK_RECOMMENDATION_BUCKET')
 15 | BOOK_RECOMMENDATION_WH_EXT = os.environ.get('GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET')
 16 | BOOK_RECOMMENDATION_WH = os.environ.get('GCP_BOOK_RECOMMENDATION_WH_DATASET')
 17 | GCP_CREDENTIALS = os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')
 18 | 
 19 | dataset_download_path = f'{AIRFLOW_HOME}/book-recommendation-dataset/'
 20 | parquet_store_path = f'{dataset_download_path}pq/'
 21 | 
 22 | gcs_store_dir = '/book-recommendation-pq'
 23 | gcs_pq_store_path = f'{BOOK_RECOMMENDATION_BUCKET}{gcs_store_dir}'
 24 | 
 25 | book_recommendation_datasets = ['books', 'ratings', 'users']
 26 | 
 27 | book_dtype = {
 28 |     'isbn': pd.StringDtype(),
 29 |     'book_title': pd.StringDtype(),
 30 |     'book_author': pd.StringDtype(),
 31 |     'year_of_publication': pd.Int64Dtype(),
 32 |     'publisher': pd.StringDtype(),
 33 | }
 34 | 
 35 | user_dtype = {
 36 |     'user_id': pd.Int64Dtype(),
 37 |     'age': pd.Int64Dtype(),
 38 |     'location': pd.StringDtype()
 39 | }
 40 | 
 41 | rating_dtype = {
 42 |     'user_id': pd.Int64Dtype(),
 43 |     'isbn': pd.StringDtype(),
 44 |     'rating': pd.Int64Dtype(),
 45 | }
 46 | 
 47 | def do_clean_to_parquet():
 48 |     if not os.path.exists(parquet_store_path):
 49 |         os.makedirs(parquet_store_path)
 50 | 
 51 |     for filename in os.listdir(dataset_download_path):
 52 |         if filename.endswith('.csv'):
 53 |             dataset_df = pd.read_csv(f'{dataset_download_path}{filename}')
 54 | 
 55 |             if filename.startswith('Books'):
 56 |                 dataset_df = dataset_df.drop(columns=[
 57 |                     'Image-URL-S',
 58 |                     'Image-URL-M',
 59 |                     'Image-URL-L'
 60 |                 ], axis='columns')
 61 | 
 62 |                 dataset_df = dataset_df.rename(mapper={
 63 |                     'Book-Title': 'book_title',
 64 |                     'Book-Author': 'book_author',
 65 |                     'Year-Of-Publication': 'year_of_publication',
 66 |                     'Publisher': 'publisher',
 67 |                     'ISBN': 'isbn'
 68 |                 }, axis='columns')
 69 | 
 70 |                 dataset_df['year_of_publication'] = pd.to_numeric(dataset_df['year_of_publication'], errors='coerce')
 71 |                 dataset_df = dataset_df.dropna(subset=['year_of_publication'])
 72 |                 dataset_df = dataset_df.astype(book_dtype)
 73 |             elif filename.startswith('Users'):
 74 |                 dataset_df = dataset_df.rename(mapper={
 75 |                     'User-ID': 'user_id',
 76 |                     'Age': 'age',
 77 |                     'Location': 'location'
 78 |                 }, axis='columns')
 79 | 
 80 |                 dataset_df = dataset_df.astype(user_dtype)
 81 | 
 82 |                 dataset_df['location_data'] = dataset_df['location'].apply(lambda x: [x.strip() for x in x.split(',')]) #split by a comma and trim
 83 |                 dataset_df['location_data'] = dataset_df['location_data'].apply(lambda values: [val for val in reversed(values) if val is not None][:3][::-1])
 84 | 
 85 |                 dataset_df[['city', 'state', 'country']] = pd.DataFrame(dataset_df['location_data'].tolist())
 86 |                 dataset_df.drop(columns=['location', 'location_data'], inplace=True)
 87 |                 dataset_df['age'].fillna(0, inplace=True)
 88 |             elif filename.startswith('Ratings'):
 89 |                 dataset_df = dataset_df.rename(mapper={
 90 |                     'User-ID': 'user_id',
 91 |                     'ISBN': 'isbn',
 92 |                     'Book-Rating': 'rating'
 93 |                 }, axis='columns')
 94 | 
 95 |                 dataset_df.astype(rating_dtype)
 96 | 
 97 |                 dataset_df['rating'].fillna(0, inplace=True)
 98 |             else:
 99 |                 continue
100 | 
101 | 
102 |             print('dataset_df.columns', dataset_df.columns)
103 |             parquet_filename = filename.lower().replace('.csv', '.parquet')
104 |             parquet_loc = f'{parquet_store_path}{parquet_filename}'
105 | 
106 |             dataset_df.reset_index(drop=True, inplace=True)
107 |             dataset_df.to_parquet(parquet_loc)
108 | 
109 |             logging.info('Done cleaning up!')
110 | 
111 | 
112 | def do_upload_pq_to_gcs():
113 |     gcs = pafs.GcsFileSystem()
114 |     dir_info = gcs.get_file_info(gcs_pq_store_path)
115 |     if dir_info.type != pafs.FileType.NotFound:
116 |         gcs.delete_dir(gcs_pq_store_path)
117 | 
118 |     gcs.create_dir(gcs_pq_store_path)
119 |     pafs.copy_files(
120 |         source=parquet_store_path,
121 |         destination=gcs_pq_store_path,
122 |         destination_filesystem=gcs
123 |     )
124 | 
125 |     logging.info('Copied parquet to gsc')
126 | 
127 | 
128 | default_args = {
129 |     'owner': 'iamraphson',
130 |     'depends_on_past': False,
131 |     'retries': 2,
132 |     'retry_delay': dt.timedelta(minutes=1),
133 | }
134 | 
135 | with DAG(
136 |     'Book-Recommendation-DAG',
137 |     default_args=default_args,
138 |     description='DAG for book recommendation dataset',
139 |     tags=['Book Recommendation'],
140 |     user_defined_macros={
141 |         'BOOK_RECOMMENDATION_WH_EXT': BOOK_RECOMMENDATION_WH_EXT,
142 |         'BOOK_RECOMMENDATION_WH': BOOK_RECOMMENDATION_WH
143 |     }
144 | ) as dag:
145 |     install_pip_packages_task = BashOperator(
146 |         task_id='install_pip_packages_task',
147 |         bash_command='pip install --user kaggle'
148 |     )
149 | 
150 |     pulldown_dataset_task = BashOperator(
151 |         task_id='pulldown_dataset_task',
152 |         bash_command=f'kaggle datasets download arashnic/book-recommendation-dataset --path {dataset_download_path} --unzip'
153 |     )
154 | 
155 |     do_clean_to_parquet_task = PythonOperator(
156 |         task_id='do_clean_to_parquet_task',
157 |         python_callable=do_clean_to_parquet
158 |     )
159 | 
160 |     do_upload_pq_to_gcs_task = PythonOperator(
161 |         task_id='do_upload_pq_to_gcs_task',
162 |         python_callable=do_upload_pq_to_gcs
163 |     )
164 | 
165 |     with TaskGroup('create-external-table-group-tasks') as create_external_table_group_task:
166 |         for dataset in book_recommendation_datasets:
167 |             BigQueryCreateExternalTableOperator(
168 |                  task_id=f'bq_external_{dataset}_table_task',
169 |                  table_resource={
170 |                      'tableReference': {
171 |                          'projectId': GCP_PROJECT_ID,
172 |                          'datasetId': BOOK_RECOMMENDATION_WH,
173 |                          'tableId': dataset,
174 |                      },
175 |                      'externalDataConfiguration': {
176 |                          'autodetect': True,
177 |                          'sourceFormat': 'PARQUET',
178 |                          'sourceUris': [f'gs://{gcs_pq_store_path}/{dataset}.parquet'],
179 |                      },
180 |                  },
181 |              )
182 | 
183 |     create_table_partitions_task = BigQueryInsertJobOperator(
184 |         task_id = 'create_table_partitions_task',
185 |         configuration={
186 |             'query': {
187 |                 'query': "{% include 'sql/load-dwh.sql' %}",
188 |                 'useLegacySql': False,
189 |             }
190 |         }
191 |     )
192 | 
193 | 
194 |     clean_up_dataset_store_task = BashOperator(
195 |         task_id='clean_up_dataset_store_task',
196 |         bash_command=f"rm -rf {dataset_download_path}"
197 |     )
198 | 
199 |     uninstall_pip_packge_task = BashOperator(
200 |         task_id='uninstall_pip_package_task',
201 |         bash_command=f"pip uninstall --yes kaggle"
202 |     )
203 | 
204 |     install_pip_packages_task.set_downstream(pulldown_dataset_task)
205 |     pulldown_dataset_task.set_downstream(do_clean_to_parquet_task)
206 |     do_clean_to_parquet_task.set_downstream(do_upload_pq_to_gcs_task)
207 |     do_upload_pq_to_gcs_task.set_downstream(create_external_table_group_task)
208 |     create_external_table_group_task.set_downstream(create_table_partitions_task)
209 |     create_table_partitions_task.set_downstream(clean_up_dataset_store_task)
210 |     create_table_partitions_task.set_downstream(uninstall_pip_packge_task)
211 | 


--------------------------------------------------------------------------------
/docker-compose.yaml:
--------------------------------------------------------------------------------
  1 | x-airflow-common: &airflow-common
  2 |   image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.8.3}
  3 |   # build: .
  4 |   environment: &airflow-common-env
  5 |     AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  6 |     AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
  7 |     AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
  8 |     AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
  9 |     AIRFLOW__CORE__FERNET_KEY: ""
 10 |     AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "true"
 11 |     AIRFLOW__CORE__LOAD_EXAMPLES: "false"
 12 |     AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session"
 13 |     AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: "true"
 14 |     _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
 15 |     KAGGLE_USERNAME: "$KAGGLE_USERNAME"
 16 |     KAGGLE_KEY: "$KAGGLE_TOKEN"
 17 |     GCP_PROJECT_ID: "$GCP_PROJECT_ID"
 18 |     GCP_BOOK_RECOMMENDATION_BUCKET: "$GCP_BOOK_RECOMMENDATION_BUCKET"
 19 |     GOOGLE_APPLICATION_CREDENTIALS: "/.google/credentials/dezoomcamp-2024.json"
 20 |     AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: "google-cloud-platform://?extra__google_cloud_platform__key_path=$GOOGLE_APPLICATION_CREDENTIALS"
 21 |     GCP_BOOK_RECOMMENDATION_WH_DATASET: "$GCP_BOOK_RECOMMENDATION_WH_DATASET"
 22 |     GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET: "$GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET"
 23 |   volumes:
 24 |     - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/dags:/opt/airflow/dags
 25 |     - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/logs:/opt/airflow/logs
 26 |     - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/config:/opt/airflow/config
 27 |     - ${AIRFLOW_PROJ_DIR:-.}/data_airflow/plugins:/opt/airflow/plugins
 28 |     - ~/.google/credentials/:/.google/credentials/:ro
 29 |   user: "${AIRFLOW_UID:-50000}:0"
 30 |   depends_on: &airflow-common-depends-on
 31 |     redis:
 32 |       condition: service_healthy
 33 |     postgres:
 34 |       condition: service_healthy
 35 | 
 36 | services:
 37 |   metabase:
 38 |     image: metabase/metabase:latest
 39 |     container_name: book-recommendation-metabase
 40 |     ports:
 41 |       - 1460:3000
 42 |   postgres:
 43 |     image: postgres:13
 44 |     container_name: airflow-postgres
 45 |     environment:
 46 |       POSTGRES_USER: airflow
 47 |       POSTGRES_PASSWORD: airflow
 48 |       POSTGRES_DB: airflow
 49 |     volumes:
 50 |       - postgres-db-volume:/var/lib/postgresql/data
 51 |     healthcheck:
 52 |       test: ["CMD", "pg_isready", "-U", "airflow"]
 53 |       interval: 10s
 54 |       retries: 5
 55 |       start_period: 5s
 56 |     restart: always
 57 | 
 58 |   redis:
 59 |     image: redis:latest
 60 |     container_name: airflow-redis
 61 |     expose:
 62 |       - 6379
 63 |     healthcheck:
 64 |       test: ["CMD", "redis-cli", "ping"]
 65 |       interval: 10s
 66 |       timeout: 30s
 67 |       retries: 50
 68 |       start_period: 30s
 69 |     restart: always
 70 | 
 71 |   airflow-webserver:
 72 |     <<: *airflow-common
 73 |     container_name: airflow-webserver
 74 |     command: webserver
 75 |     ports:
 76 |       - "8080:8080"
 77 |     healthcheck:
 78 |       test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
 79 |       interval: 30s
 80 |       timeout: 10s
 81 |       retries: 5
 82 |       start_period: 30s
 83 |     restart: always
 84 |     depends_on:
 85 |       <<: *airflow-common-depends-on
 86 |       airflow-init:
 87 |         condition: service_completed_successfully
 88 | 
 89 |   airflow-scheduler:
 90 |     <<: *airflow-common
 91 |     container_name: airflow-scheduler
 92 |     command: scheduler
 93 |     healthcheck:
 94 |       test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
 95 |       interval: 30s
 96 |       timeout: 10s
 97 |       retries: 5
 98 |       start_period: 30s
 99 |     restart: always
100 |     depends_on:
101 |       <<: *airflow-common-depends-on
102 |       airflow-init:
103 |         condition: service_completed_successfully
104 | 
105 |   airflow-worker:
106 |     <<: *airflow-common
107 |     container_name: airflow-worker
108 |     command: celery worker
109 |     healthcheck:
110 |       # yamllint disable rule:line-length
111 |       test:
112 |         - "CMD-SHELL"
113 |         - 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
114 |       interval: 30s
115 |       timeout: 10s
116 |       retries: 5
117 |       start_period: 30s
118 |     environment:
119 |       <<: *airflow-common-env
120 |       # Required to handle warm shutdown of the celery workers properly
121 |       # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
122 |       DUMB_INIT_SETSID: "0"
123 |     restart: always
124 |     depends_on:
125 |       <<: *airflow-common-depends-on
126 |       airflow-init:
127 |         condition: service_completed_successfully
128 | 
129 |   airflow-triggerer:
130 |     <<: *airflow-common
131 |     container_name: airflow-triggerer
132 |     command: triggerer
133 |     healthcheck:
134 |       test:
135 |         [
136 |           "CMD-SHELL",
137 |           'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"',
138 |         ]
139 |       interval: 30s
140 |       timeout: 10s
141 |       retries: 5
142 |       start_period: 30s
143 |     restart: always
144 |     depends_on:
145 |       <<: *airflow-common-depends-on
146 |       airflow-init:
147 |         condition: service_completed_successfully
148 | 
149 |   airflow-init:
150 |     <<: *airflow-common
151 |     container_name: airflow-init
152 |     entrypoint: /bin/bash
153 |     # yamllint disable rule:line-length
154 |     command:
155 |       - -c
156 |       - |
157 |         if [[ -z "${AIRFLOW_UID}" ]]; then
158 |           echo
159 |           echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
160 |           echo "If you are on Linux, you SHOULD follow the instructions below to set "
161 |           echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
162 |           echo "For other operating systems you can get rid of the warning with manually created .env file:"
163 |           echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
164 |           echo
165 |         fi
166 |         one_meg=1048576
167 |         mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
168 |         cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
169 |         disk_available=$$(df / | tail -1 | awk '{print $$4}')
170 |         warning_resources="false"
171 |         if (( mem_available < 4000 )) ; then
172 |           echo
173 |           echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
174 |           echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
175 |           echo
176 |           warning_resources="true"
177 |         fi
178 |         if (( cpus_available < 2 )); then
179 |           echo
180 |           echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
181 |           echo "At least 2 CPUs recommended. You have $${cpus_available}"
182 |           echo
183 |           warning_resources="true"
184 |         fi
185 |         if (( disk_available < one_meg * 10 )); then
186 |           echo
187 |           echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
188 |           echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
189 |           echo
190 |           warning_resources="true"
191 |         fi
192 |         if [[ $${warning_resources} == "true" ]]; then
193 |           echo
194 |           echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
195 |           echo "Please follow the instructions to increase amount of resources available:"
196 |           echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
197 |           echo
198 |         fi
199 |         mkdir -p /sources/data_airflow/logs /sources/data_airflow/dags /sources/data_airflow/plugins
200 |         chown -R "${AIRFLOW_UID}:0" /sources/data_airflow/{logs,dags,plugins}
201 |         exec /entrypoint airflow version
202 |     # yamllint enable rule:line-length
203 |     environment:
204 |       <<: *airflow-common-env
205 |       _AIRFLOW_DB_MIGRATE: "true"
206 |       _AIRFLOW_WWW_USER_CREATE: "true"
207 |       _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
208 |       _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
209 |       _PIP_ADDITIONAL_REQUIREMENTS: ""
210 |     user: "0:0"
211 |     volumes:
212 |       - ${AIRFLOW_PROJ_DIR:-.}:/sources
213 | 
214 |   airflow-cli:
215 |     <<: *airflow-common
216 |     container_name: airflow-cli
217 |     profiles:
218 |       - debug
219 |     environment:
220 |       <<: *airflow-common-env
221 |       CONNECTION_CHECK_MAX_COUNT: "0"
222 |     # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
223 |     command:
224 |       - bash
225 |       - -c
226 |       - airflow
227 | 
228 |   # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
229 |   # or by explicitly targeted on the command line e.g. docker-compose up flower.
230 |   # See: https://docs.docker.com/compose/profiles/
231 |   flower:
232 |     <<: *airflow-common
233 |     command: celery flower
234 |     container_name: airflow-flower
235 |     profiles:
236 |       - flower
237 |     ports:
238 |       - "5555:5555"
239 |     healthcheck:
240 |       test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
241 |       interval: 30s
242 |       timeout: 10s
243 |       retries: 5
244 |       start_period: 30s
245 |     restart: always
246 |     depends_on:
247 |       <<: *airflow-common-depends-on
248 |       airflow-init:
249 |         condition: service_completed_successfully
250 | 
251 | volumes:
252 |   postgres-db-volume:
253 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Data Pipeline Project for Book Recommendation
  2 | 
  3 | <details>
  4 |     <summary>Table of Contents</summary>
  5 |     <ol>
  6 |         <li>
  7 |             <a href="#introduction">Introduction</a>
  8 |             <ul>
  9 |                 <li><a href="#built-with">Built With</a></li>
 10 |             </ul>
 11 |         </li>
 12 |         <li>
 13 |             <a href="#project-architecture">Project Architecture</a>
 14 |         </li>
 15 |         <li>
 16 |              <a href="#getting-started">Getting Started</a>
 17 |              <ul>
 18 |                 <li>
 19 |                 <a href="#create-a-google-cloud-project">Create a Google Cloud Project<a>
 20 |                 </li>
 21 |                 <li>
 22 |                 <a href="#set-up-kaggle">Set up Kaggle<a>
 23 |                 </li>
 24 |                 <li>
 25 |                     <a href="#set-up-the-infrastructure-on-GCP-with-terraform">Set up the infrastructure on GCP with Terraform</a>
 26 |                 </li>
 27 |                 <li>
 28 |                     <a href="#set-up-airflow-and-metabase">Set up Airflow and Metabase</a>
 29 |                 </li>
 30 |             </ul>
 31 |         </li>
 32 |         <li>
 33 |             <a href="#data-ingestion">Data Ingestion</a>
 34 |         </li>
 35 |         <li>
 36 |             <a href="#data-transformation">Data Transformation</a>
 37 |         </li>
 38 |         <li>
 39 |             <a href="#data-visualization">Data Visualization</a>
 40 |         </li>
 41 |         <li>
 42 |             <a href="#contact">Contact</a>
 43 |         </li>
 44 |          <li>
 45 |             <a href="#acknowledgments">Acknowledgments</a>
 46 |         </li>
 47 |     </ol>
 48 | </details>
 49 | 
 50 | ## Introduction
 51 | 
 52 | This project is part of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). As part of the project, I developed a data pipeline to load and process data from a Kaggle dataset containing bookstore information for a book recommendation system. The dataset can be accessed on [this Kaggle](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/).
 53 | 
 54 | This dataset offers book ratings from users of various ages and geographical locations. It comprises three files: User.csv, which includes age and location data of bookstore users; Books.csv, containing information such as authors, titles, and ISBNs; and Ratings.csv, which details the ratings given by users for each book. Additional information about the dataset is available on Kaggle.
 55 | 
 56 | The primary objective of this project is to establish a streamlined data pipeline for obtaining, storing, cleansing, and visualizing data automatically. This pipeline aims to address various queries, such as identifying top-rated publishers and authors, and analyzing ratings based on geographical location.
 57 | 
 58 | Given that the data is static, the data pipeline operates as a one-time process.
 59 | 
 60 | ### Built With
 61 | 
 62 | - Dataset repo: [Kaggle](https://www.kaggle.com)
 63 | - Infrastructure as Code: [Terraform](https://www.terraform.io/)
 64 | - Workflow Orchestration: [Airflow](https://airflow.apache.org)
 65 | - Data Lake: [Google Cloud Storage](https://cloud.google.com/storage)
 66 | - Data Warehouse: [Google BigQuery](https://cloud.google.com/bigquery)
 67 | - Transformation: [DBT](https://www.getdbt.com/)
 68 | - Visualisation: [Metabase](https://www.metabase.com/)
 69 | - Programming Language: Python and SQL
 70 | 
 71 | ## Project Architecture
 72 | 
 73 | ![architecture](./screenshots/architecture.png)
 74 | Cloud infrastructure is set up with Terraform.
 75 | 
 76 | Airflow is run on a local docker container.
 77 | 
 78 | ## Getting Started
 79 | 
 80 | ### Prerequisites
 81 | 
 82 | 1. A [Google Cloud Platform](https://cloud.google.com/) account.
 83 | 2. A [kaggle](https://www.kaggle.com/) account.
 84 | 3. Install VSCode or [Zed](https://zed.dev/) or any other IDE that works for you.
 85 | 4. [Install Terraform](https://www.terraform.io/downloads)
 86 | 5. [Install Docker Desktop](https://docs.docker.com/get-docker/)
 87 | 6. [Install Google Cloud SDK](https://cloud.google.com/sdk)
 88 | 7. Clone this repository onto your local machine.
 89 | 
 90 | ### Create a Google Cloud Project
 91 | 
 92 | - Go to [Google Cloud](https://console.cloud.google.com/) and create a new project.
 93 | - Get the project ID and define the environment variables `GCP_PROJECT_ID` in the .env file located in the root directory
 94 | - Create a [Service account](https://cloud.google.com/iam/docs/service-account-overview) with the following roles:
 95 |   - `BigQuery Admin`
 96 |   - `Storage Admin`
 97 |   - `Storage Object Admin`
 98 |   - `Viewer`
 99 | - Download the Service Account credentials and store it in `$HOME/.google/credentials/`.
100 | - You need to activate the following APIs [here](https://console.cloud.google.com/apis/library/browse)
101 |   - Cloud Storage API
102 |   - BigQuery API
103 | - Assign the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your JSON credentials file, such that `GOOGLE_APPLICATION_CREDENTIALS` will be $HOME/.google/credentials/<authkeys_filename>.json
104 |   - add this line to the end of the `.bashrc` file
105 |   ```bash
106 |   export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
107 |   ```
108 |   - Activate the enviroment variable by runing `source .bashrc`
109 | 
110 | ### Set up kaggle
111 | 
112 | - A detailed description on how to authenicate is found [here](https://www.kaggle.com/docs/api)
113 | - Define the environment variables `KAGGLE_USER` and `KAGGLE_TOKEN` in the .env file located in the root directory. Note: `KAGGLE_TOKEN` is the same as `KAGGLE_KEY`
114 | 
115 | ### Set up the infrastructure on GCP with Terraform
116 | 
117 | - Using Zed or VSCode, open the cloned project `DE-2024-project-bookrecommendation`.
118 | - To customize the default values of `variable "project"` and `variable "region"` to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables `TF_VAR_project` and `TF_VAR_region`.
119 | - Open the terminal to the root project.
120 | - Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command `cd terraform`.
121 | - Set an alias `alias tf='terraform'`
122 | - Initialise Terraform: `tf init`
123 | - Plan the infrastructure: `tf plan`
124 | - Apply the changes: `tf apply`
125 | 
126 | ### Set up Airflow and Metabase
127 | 
128 | - Please confirm that the following environment variables are configured in `.env` in the root directory of the project.
129 |   - `AIRFLOW_UID`. The default value is 50000
130 |   - `KAGGLE_USERNAME`. This should be set from [Set up kaggle](#set-up-kaggle) section.
131 |   - `KAGGLE_TOKEN`. This should be set from [Set up kaggle](#set-up-kaggle) section too
132 |   - `GCP_PROJECT_ID`. This should be set from [Create a Google Cloud Project](#create-a-google-cloud-project) section
133 |   - `GCP_BOOK_RECOMMENDATION_BUCKET=book_recommendation_datalake_<GCP project id>`
134 |   - `GCP_BOOK_RECOMMENDATION_WH_DATASET=book_recommendation_analytics`
135 |   - `GCP_BOOK_RECOMMENDATION_WH_EXT_DATASET=book_recommendataion_wh`
136 | - Run `docker-compose up`.
137 | - Access the Airflow dashboard by visiting `http://localhost:8080/` in your web browser. The interface will resemble the following. Use the username and password airflow to log in.
138 | 
139 | ![Airflow](./screenshots/airflow_home.png)
140 | 
141 | - Visit `http://localhost:1460` in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.
142 | 
143 | ![Metabase](./screenshots/metabase_home.png)
144 | 
145 | ## Data Ingestion
146 | 
147 | Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs
148 | ![DAGS](./screenshots/dags_index.png)
149 | Below is the DAG's graph.
150 | ![DAG Graph](./screenshots/dag_graph.png)
151 | To run the DAG, Click on the play button(Figure 1)
152 | ![Run Graph](./screenshots/run_dag.png)
153 | 
154 | ## Data Transformation
155 | 
156 | - Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command `cd data_dbt`.
157 | - Generate a profiles.yml file within `${HOME}/.dbt`, followed by defining a profile for this project as instructed below.
158 | 
159 | ```yaml
160 | data_dbt_book_recommendation:
161 |   outputs:
162 |     dev:
163 |       dataset: book_recommendation_analytics
164 |       fixed_retries: 1
165 |       keyfile: <location_google_auth_key>
166 |       location: <preferred project region>
167 |       method: service-account
168 |       priority: interactive
169 |       project: <preferred project id>
170 |       threads: 6
171 |       timeout_seconds: 300
172 |       type: bigquery
173 |   target: dev
174 | ```
175 | 
176 | - To run all models, run `dbt run -t dev`
177 | - Navigate to your Google [BigQuery](https://console.cloud.google.com/bigquery) project by clicking on this link. There, you'll find all the tables and views created by DBT.
178 |   ![Big Query](./screenshots/bigquery_schema_1.png)
179 | 
180 | ## Data Visualization
181 | 
182 | Please watch the [provided video tutorial](https://youtu.be/BnLkrA7a6gM&) to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this [PDF](./screenshots/DE_2024_Dashboard.pdf) linked below contains the complete screenshot of the dashboard I created.
183 | 
184 | ![Dashboard](./screenshots/DE_2024_Dashboard.png)
185 | 
186 | ## Contact
187 | 
188 | Twitter: [@iamraphson](https://twitter.com/iamraphson)
189 | 
190 | ## Acknowledgments
191 | 
192 | I would like to extend my heartfelt gratitude to the organizers of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) for providing such a valuable course. The insights I gained have been instrumental in broadening my understanding of the field of Data Engineering. Additionally, I want to express my appreciation to my fellow colleague with whom I took the course. Thank you all for your support and collaboration throughout this journey.
193 | 
194 | 🦅
195 | 


--------------------------------------------------------------------------------