├── .gitignore
├── dags
    ├── .airflowignore
    └── tayloro
    │   └── business_entity_dag.py
├── capstone_queries
    ├── graffana
    │   ├── colorado_city_metrics
    │   │   ├── variable_cities.sql
    │   │   ├── likely_city_crimes_during_day.sql
    │   │   ├── likely_city_crimes_during_night.sql
    │   │   └── seasonal_city_crime_rate_by_month_1997_2020.sql
    │   ├── colorado_county_metrics
    │   │   ├── county_police_agency_crime_count.sql
    │   │   ├── seasonal_average_crime_trends_for_county.sql
    │   │   ├── county_crime_count_per_category_1997_2020.sql
    │   │   ├── crime_density_per_county_1997_2020.sql
    │   │   ├── county_population_vs_crime_trends_1997_2020.sql
    │   │   └── county_crime_rates_vs_median_income.sql
    │   └── business_search
    │   │   └── business_search_query.sql
    ├── gold
    │   ├── create_colorado_county_agency_crime_counts.sql
    │   ├── create_colorado_city_crime_counts_1997_2020.sql
    │   ├── create_colorado_avg_age_per_crime_category_1997_2020.sql
    │   ├── normalized_crime_and_population_trends.sql
    │   ├── create_county_crime_per_capita_with_coordinates.sql
    │   ├── total_count_of_crimes_by_county.sql
    │   ├── create_colorado_crime_type_distribution_by_city.sql
    │   ├── create_colorado_crime_category_totals_per_county_1997_2020.sql
    │   ├── create_colorado_county_average_income_1997_2020.sql
    │   ├── colorado_crime_population_trends.sql
    │   ├── create_colorado_crime_vs_median_household_income_1997_2020.sql
    │   ├── create_colorado_city_crime_time_likelihood.sql
    │   ├── create_colorado_city_seasonal_crime_rates_1997_2020.sql
    │   └── create_colorado_county_seasonal_crime_rates_1997_2020.sql
    ├── silver
    │   ├── delete_malformed_business_entities.sql
    │   ├── update_principal_zip_code_formatting.sql
    │   ├── update_colorado_business_entity_county.sql
    │   ├── create_colorado_business_entities_with_ranking.sql
    │   ├── create_colorado_crimes_2016_2020_with_cities.sql
    │   ├── create_colorado_income_1997_2020.sql
    │   ├── create_total_population_per_county_1997_2020.sql
    │   ├── combine_city_and_county_data_for_business_entities.sql
    │   ├── create_colorado_county_total_crime_per_capita_1997_2020.sql
    │   ├── household_income_and_total_crimes_1997_2020.sql
    │   ├── create_colorado_final_county_tier_rank.sql
    │   ├── calculate_crime_percentile_for_categories.sql
    │   ├── calculate_colorado_household_income_tier_rank.sql
    │   ├── calculate_final_county_tier_rank.sql
    │   ├── create_yearly_county_crime_per_capita_1997_2020.sql
    │   ├── calculate_population_crime_per_capita_tier.sql
    │   ├── calculate_population_crime_per_capita_rank.sql
    │   ├── create_colorado_crime_tier_county_rank.sql
    │   ├── create_colorado_crimes_with_cities.sql
    │   ├── create_colorado_crime_weighted_scores.sql
    │   ├── create_colorado_crimes.sql
    │   └── clean_address_colorado_business_entity_data.sql
    └── bronze
    │   ├── write_colorado_county_coordinates_iceberg_from_S3.py
    │   ├── write_colorado_crimes_1997_2020_iceberg_from_S3.py
    │   ├── write_colorado_income_iceberg_from_S3.py
    │   └── write_colorado_crimes_2016_2020_iceberg_from_S3.py
├── scripts
    ├── getS3bucketfilelist.py
    ├── upload_capstone_csv_data_to_s3.py
    ├── modify_colorado_city_county_zip_csv.py
    ├── create_crime_rate_per_capita.py
    ├── create_colorado_crimes_table.py
    ├── create_colorado_business_entities.py
    ├── write_to_iceberg_from_S3.py
    └── business_entity_address_validation.py
├── Capstone-data-dictionary.csv
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .env


--------------------------------------------------------------------------------
/dags/.airflowignore:
--------------------------------------------------------------------------------
1 | community


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_city_metrics/variable_cities.sql:
--------------------------------------------------------------------------------
1 | select distinct city from tayloro.colorado_crimes_with_cities order by city asc


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_city_metrics/likely_city_crimes_during_day.sql:
--------------------------------------------------------------------------------
1 | SELECT offense_category_name FROM academy.tayloro.colorado_city_crime_time_likelihood where city = '$city' and is_day_likely = true


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_city_metrics/likely_city_crimes_during_night.sql:
--------------------------------------------------------------------------------
1 | SELECT offense_category_name FROM academy.tayloro.colorado_city_crime_time_likelihood where city = '$city' and is_night_likely = true


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/county_police_agency_crime_count.sql:
--------------------------------------------------------------------------------
1 | SELECT agency_name, total_crimes FROM academy.tayloro.colorado_county_agency_crime_counts where county_name = LOWER('$county') order by total_crimes desc
2 | 


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_city_metrics/seasonal_city_crime_rate_by_month_1997_2020.sql:
--------------------------------------------------------------------------------
1 | SELECT city, month_name, avg_monthly_crimes FROM academy.tayloro.colorado_city_seasonal_crime_rates_1997_2020 WHERE city = 'Denver' order by crime_month asc
2 | 


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_county_agency_crime_counts.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE tayloro.colorado_county_agency_crime_counts AS
2 | SELECT
3 |     county_name,
4 |     agency_name,
5 |     COUNT(*) AS total_crimes
6 | FROM tayloro.colorado_crimes
7 | GROUP BY county_name, agency_name
8 | ORDER BY total_crimes DESC;


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/seasonal_average_crime_trends_for_county.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     county_name,
3 |     month_name,
4 |     avg_monthly_crimes
5 | FROM academy.tayloro.colorado_county_seasonal_crime_rates_1997_2020
6 | WHERE county_name = LOWER('$county')
7 | ORDER BY crime_month ASC
8 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/delete_malformed_business_entities.sql:
--------------------------------------------------------------------------------
1 | DELETE FROM tayloro.colorado_business_entities
2 | WHERE principalstate = 'CO'
3 |   AND (
4 |     CASE
5 |       WHEN principalzipcode LIKE '%-%' THEN SUBSTRING(principalzipcode, 1, 5)
6 |       ELSE principalzipcode
7 |     END
8 |   ) NOT LIKE '8%';


--------------------------------------------------------------------------------
/capstone_queries/silver/update_principal_zip_code_formatting.sql:
--------------------------------------------------------------------------------
1 | UPDATE tayloro.colorado_business_entities
2 | SET principalzipcode = CASE
3 |     WHEN principalzipcode LIKE '%-%' THEN SUBSTRING(principalzipcode, 1, 5)
4 |     ELSE principalzipcode
5 | END
6 | WHERE principalstate = 'CO'
7 |   AND NOT regexp_like(principalzipcode, '^\d{5}$');


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/county_crime_count_per_category_1997_2020.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     offense_category_name,  -- Crime type
3 |     crime_count  -- Count of crimes per category
4 | FROM
5 |     academy.tayloro.colorado_crime_category_totals_per_county_1997_2020
6 | WHERE county_name = LOWER('$county')  -- Grafana variable for county selection


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/crime_density_per_county_1997_2020.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     county,
3 |     cent_lat AS lat,
4 |     cent_long AS lon,
5 |     crime_per_100k,
6 |     LN(1 + crime_per_100k) AS crime_log_scale  -- Use LN() for natural log (PostgreSQL, DuckDB, etc.)
7 | FROM academy.tayloro.colorado_county_crime_per_capita_with_coordinates


--------------------------------------------------------------------------------
/capstone_queries/silver/update_colorado_business_entity_county.sql:
--------------------------------------------------------------------------------
1 | UPDATE tayloro.colorado_business_entities
2 | SET principalcounty = (
3 |     SELECT ccz.county
4 |     FROM tayloro.colorado_city_county_zip ccz
5 |     WHERE TRIM(LOWER(principalcity)) = TRIM(LOWER(ccz.city))
6 |       AND TRIM(CAST(principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
7 | )
8 | WHERE principalstate = 'CO';


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_city_crime_counts_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_city_crime_counts_1997_2020 AS
 2 | SELECT
 3 |     city,
 4 |     COUNT(*) AS crime_count
 5 | FROM
 6 |     academy.tayloro.colorado_crimes_with_cities
 7 | WHERE
 8 |     EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020
 9 |     and city is not null
10 | GROUP BY
11 |     city
12 | ORDER BY
13 |     crime_count DESC;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_business_entities_with_ranking.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_business_entities_with_ranking AS
 2 | SELECT
 3 |     be.*,
 4 |     fr.crime_rank,
 5 |     fr.income_rank,
 6 |     fr.population_rank,
 7 |     fr.final_rank
 8 | FROM tayloro.colorado_business_entities be
 9 | LEFT JOIN tayloro.colorado_final_county_tier_rank fr
10 |     ON LOWER(be.principalcounty) = LOWER(fr.county);


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_avg_age_per_crime_category_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_avg_age_per_crime_category_1997_2020 AS
 2 | SELECT
 3 |     city,
 4 |     offense_category_name,
 5 |     ROUND(AVG(age_num), 2) AS avg_age
 6 | FROM tayloro.colorado_crimes_with_cities
 7 | WHERE age_num IS NOT NULL
 8 | and city is not null
 9 | GROUP BY city, offense_category_name
10 | ORDER BY city, offense_category_name;


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/county_population_vs_crime_trends_1997_2020.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana
3 |     total_crimes,  -- Raw total crimes per year
4 |     total_population  -- Raw total population per year
5 | FROM academy.tayloro.colorado_crime_population_trends
6 | WHERE county_name = '$county'
7 | ORDER BY year


--------------------------------------------------------------------------------
/capstone_queries/gold/normalized_crime_and_population_trends.sql:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |     DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana
 3 |     total_crimes,  -- Raw total crimes per year
 4 |     total_population  -- Raw total population per year
 5 | FROM academy.tayloro.colorado_crime_population_trends
 6 | WHERE county_name = '$county'
 7 | ORDER BY year
 8 | 
 9 | 
10 | 
11 | 
12 | 
13 | 


--------------------------------------------------------------------------------
/capstone_queries/gold/create_county_crime_per_capita_with_coordinates.sql:
--------------------------------------------------------------------------------
1 | CREATE TABLE tayloro.colorado_county_crime_per_capita_with_coordinates AS
2 | select ccc.label AS county,
3 |        crime_per_100k,
4 |        ccc.cent_lat,
5 |        ccc.cent_long
6 | from tayloro.colorado_county_crime_per_capita_1997_2020 crpc
7 |          inner join tayloro.colorado_county_coordinates_raw ccc
8 |                     on LOWER(crpc.county_name) = LOWER(ccc.label)


--------------------------------------------------------------------------------
/capstone_queries/gold/total_count_of_crimes_by_county.sql:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |     offense_category_name,  -- Crime type
 3 |     COUNT(*) AS crime_count  -- Count of crimes per category
 4 | FROM
 5 |     academy.tayloro.colorado_crimes
 6 | WHERE
 7 |     EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020
 8 |     AND county_name = '$county'  -- Grafana variable for county selection
 9 | GROUP BY
10 |     offense_category_name
11 | ORDER BY
12 |     crime_count DESC;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_crimes_2016_2020_with_cities.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crimes_2016_2020_with_cities AS
 2 | WITH unique_city_mapping AS (
 3 |     SELECT DISTINCT city
 4 |     FROM tayloro.colorado_city_county_zip
 5 | )
 6 | SELECT
 7 |     ccr.*,
 8 |     ucm.city AS city
 9 | FROM tayloro.colorado_crimes_2016_2020_raw ccr
10 | INNER JOIN unique_city_mapping ucm
11 |     ON LOWER(TRIM(ccr.pub_agency_name)) = LOWER(TRIM(ucm.city));


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_crime_type_distribution_by_city.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crime_type_distribution_by_city AS
 2 | SELECT
 3 |     city,
 4 |     crime_against,  -- e.g., Person, Property, Society
 5 |     COUNT(*) AS total_crimes
 6 | FROM tayloro.colorado_crimes_with_cities
 7 | WHERE city IS NOT NULL
 8 | and crime_against is not null
 9 | and crime_against <> 'Not a Crime'
10 | GROUP BY city, crime_against
11 | ORDER BY city, total_crimes DESC;


--------------------------------------------------------------------------------
/capstone_queries/graffana/business_search/business_search_query.sql:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |     entityname AS Company_Name,
 3 |     principaladdress1 AS Address,
 4 |     principalcity AS City,
 5 |     principalstate AS State,
 6 |     principalcounty AS County
 7 | FROM
 8 |     academy.tayloro.colorado_business_entities_with_ranking
 9 | WHERE
10 |     LOWER(entityname) LIKE LOWER('%$entityname%') AND principalstate = 'CO' -- Case-insensitive search
11 | ORDER BY
12 |     entityname


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_crime_category_totals_per_county_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crime_category_totals_per_county_1997_2020 AS
 2 | SELECT
 3 |     offense_category_name,  -- Crime type
 4 |     COUNT(*) AS crime_count, -- Count of crimes per category
 5 |     county_name
 6 | FROM
 7 |     academy.tayloro.colorado_crimes
 8 | WHERE
 9 |     EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020
10 | GROUP BY
11 |     offense_category_name, county_name
12 | ORDER BY
13 |     crime_count DESC;


--------------------------------------------------------------------------------
/capstone_queries/graffana/colorado_county_metrics/county_crime_rates_vs_median_income.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Ensure date formatting for Grafana
3 |     CAST(MAX(crime_per_100k) AS DOUBLE) AS crime_per_100k, -- Max ensures single value per year
4 |     CAST(MAX(median_household_income) AS DOUBLE) AS median_household_income -- Max ensures single value per year
5 | FROM academy.tayloro.colorado_crime_vs_median_household_income_1997_2020
6 | WHERE county_name = '$county'
7 | GROUP BY year
8 | ORDER BY year


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_county_average_income_1997_2020.sql:
--------------------------------------------------------------------------------
1 | SELECT
2 |     county,
3 |     ROUND(AVG(CASE WHEN inctype = 1 THEN income END), 2) AS avg_total_personal_income,  -- Total Personal Income
4 |     ROUND(AVG(CASE WHEN inctype = 2 THEN income END), 2) AS avg_per_capita_income,  -- Per Capita Personal Income
5 |     ROUND(AVG(CASE WHEN inctype = 3 THEN income END), 2) AS avg_median_household_income  -- Median Household Income
6 | FROM academy.tayloro.colorado_income_1997_2020
7 | WHERE periodyear BETWEEN 1997 AND 2020
8 | GROUP BY county
9 | ORDER BY county;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_income_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_income_1997_2020 AS
 2 | WITH colorado_income_1997_2020 AS (
 3 |     SELECT
 4 |         TRIM(REPLACE(areaname, 'County', '')) AS county,
 5 |         incdesc,
 6 |         inctype,
 7 |         incsource,
 8 |         income,
 9 |         incrank,
10 |         periodyear
11 |     FROM tayloro.colorado_income_raw
12 |     WHERE stateabbrv = 'CO'
13 |         AND areatyname = 'County'
14 |         AND periodyear BETWEEN 1997 AND 2020 -- Filter for years 1997-2020
15 | )
16 | SELECT * FROM colorado_income_1997_2020;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_total_population_per_county_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE academy.tayloro.colorado_total_population_per_county_1997_2020 AS
 2 | WITH filtered_population AS (
 3 |     SELECT
 4 |         county,
 5 |         year,
 6 |         SUM(total_population) AS total_population  -- Aggregating population in case of duplicates
 7 |     FROM tayloro.colorado_population_raw
 8 |     WHERE year BETWEEN 1997 AND 2020
 9 |         AND data_type = 'Estimate'  -- Ensuring only estimated values
10 |     GROUP BY county, year
11 | )
12 | SELECT * FROM filtered_population
13 | ORDER BY county, year;


--------------------------------------------------------------------------------
/capstone_queries/gold/colorado_crime_population_trends.sql:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |     DATE_PARSE(CAST(EXTRACT(YEAR FROM incident_date) AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana
 3 |     county_name,
 4 |     COUNT(*) AS total_crimes
 5 | FROM academy.tayloro.colorado_crimes
 6 | WHERE county_name = '$county'
 7 |   AND incident_date IS NOT NULL -- Exclude records with null incident_date
 8 |   AND county_name IS NOT NULL
 9 | GROUP BY EXTRACT(YEAR FROM incident_date), county_name -- Group by the extracted year and county
10 | ORDER BY EXTRACT(YEAR FROM incident_date), county_name -- Order by the extracted year and county
11 | 


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_crime_vs_median_household_income_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crime_vs_median_household_income_1997_2020 AS
 2 | SELECT
 3 |     c.county_name,
 4 |     c.year,  -- Ensure the year column is included
 5 |     c.total_crimes,
 6 |     c.total_population,
 7 |     (c.total_crimes * 100000.0 / NULLIF(c.total_population, 0)) AS crime_per_100k, -- Crime rate per 100k residents
 8 |     i.income AS median_household_income  -- Median household income per year
 9 | FROM academy.tayloro.colorado_crime_population_trends c
10 | LEFT JOIN academy.tayloro.colorado_income_1997_2020 i
11 |     ON LOWER(c.county_name) = LOWER(i.county)
12 |     AND c.year = i.periodyear
13 |     AND i.inctype = 3  -- 3 corresponds to "Median Household Income"
14 | ORDER BY c.year;


--------------------------------------------------------------------------------
/capstone_queries/silver/combine_city_and_county_data_for_business_entities.sql:
--------------------------------------------------------------------------------
 1 | INSERT INTO tayloro.colorado_business_entities
 2 | SELECT b.entityid,
 3 |        b.entityname,
 4 |        b.principaladdress1,
 5 |        b.principaladdress2,
 6 |        b.principalcity,
 7 |        c.county AS principalcounty, -- Mapping county from cities dataset
 8 |        b.principalstate,
 9 |        b.principalzipcode,
10 |        b.principalcountry,
11 |        b.entitystatus,
12 |        b.jurisdictonofformation,
13 |        b.entitytype,
14 |        CASE
15 |            WHEN regexp_like(TRIM(b.entityformdate), '^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$')
16 |                THEN CAST(date_parse(TRIM(b.entityformdate), '%m/%d/%Y') AS DATE)
17 |            ELSE NULL
18 |            END  AS entityformdate   -- Convert only valid dates
19 | FROM tayloro.colorado_business_entities_raw b
20 |          INNER JOIN tayloro.colorado_cities_and_counties c
21 |                    ON LOWER(TRIM(b.principalcity)) = LOWER(TRIM(c.city))


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_county_total_crime_per_capita_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_county_crime_per_capita_1997_2020 AS
 2 | WITH crimes_total AS (
 3 |     SELECT
 4 |         county_name,
 5 |         COUNT(*) AS total_crimes
 6 |     FROM tayloro.colorado_crimes
 7 |     WHERE incident_date IS NOT NULL
 8 |         AND YEAR(incident_date) BETWEEN 1997 AND 2020
 9 |     GROUP BY county_name
10 | ),
11 | population_total AS (
12 |     SELECT
13 |         county AS county_name,
14 |         SUM(TOTALPOPULATION) AS total_population
15 |     FROM tayloro.colorado_population_raw
16 |     WHERE DATATYPE = 'Estimate'
17 |         AND YEAR BETWEEN 1997 AND 2020
18 |     GROUP BY county
19 | ),
20 | crime_rate_per_capita AS (
21 |     SELECT
22 |         c.county_name,
23 |         c.total_crimes,
24 |         p.total_population,
25 |         (c.total_crimes * 100000.0 / p.total_population) AS crime_per_100k
26 |     FROM crimes_total c
27 |     JOIN population_total p
28 |         ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county_name))
29 | )
30 | SELECT * FROM crime_rate_per_capita;


--------------------------------------------------------------------------------
/capstone_queries/silver/household_income_and_total_crimes_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |     DATE_PARSE(CAST(i.year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana
 3 |     i.county,
 4 |     i.household_income_per_capita,
 5 |     c.total_crimes
 6 | FROM (
 7 |     -- Median Household Income Query
 8 |     SELECT
 9 |         TRIM(REPLACE(areaname, 'County', '')) AS county,
10 |         periodyear AS year,
11 |         income AS household_income_per_capita
12 |     FROM academy.tayloro.colorado_income_raw
13 |     WHERE areatyname = 'County'
14 |       AND periodyear BETWEEN 1997 AND 2020
15 |       AND inctype = 2
16 | ) i
17 | LEFT JOIN (
18 |     -- Total Crimes Query
19 |     SELECT
20 |         EXTRACT(YEAR FROM incident_date) AS year,
21 |         county_name AS county,
22 |         COUNT(*) AS total_crimes
23 |     FROM academy.tayloro.colorado_crimes
24 |     WHERE incident_date IS NOT NULL
25 |       AND county_name IS NOT NULL
26 |     GROUP BY EXTRACT(YEAR FROM incident_date), county_name
27 | ) c
28 | ON i.year = c.year
29 | AND i.county = c.county
30 | WHERE i.county = 'Boulder' -- Add a dynamic county filter
31 | ORDER BY i.year


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_final_county_tier_rank.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS
 2 | WITH combined_ranks AS (
 3 |     SELECT
 4 |         COALESCE(c.county_name, i.county, p.county) AS county,
 5 |         COALESCE(c.overall_crime_tier, 0) AS crime_rank,
 6 |         COALESCE(i.income_tier, 0) AS income_rank,
 7 |         COALESCE(p.crime_per_capita_tier, 0) AS population_rank
 8 |     FROM academy.tayloro.colorado_crime_tier_county_rank c
 9 |     FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i
10 |         ON LOWER(c.county_name) = LOWER(i.county)
11 |     FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p
12 |         ON LOWER(c.county_name) = LOWER(p.county)
13 | ),
14 | average_rank AS (
15 |     SELECT
16 |         county,
17 |         crime_rank,
18 |         income_rank,
19 |         population_rank,
20 |         -- Apply weights: crime is weighted 2x, income and population 1x each.
21 |         ROUND((2 * crime_rank + income_rank + population_rank) / 4.0) AS final_rank
22 |     FROM combined_ranks
23 |     WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0
24 | )
25 | SELECT *
26 | FROM average_rank
27 | ORDER BY final_rank ASC;


--------------------------------------------------------------------------------
/capstone_queries/silver/calculate_crime_percentile_for_categories.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE IF NOT EXISTS tayloro.colorado_crime_tier_stolen_property AS
 2 | WITH crime_data AS (
 3 |     -- Compute total crime count per county from 1997-2000
 4 |     SELECT
 5 |         county_name,
 6 |         COUNT(*) AS total_crimes
 7 |     FROM academy.tayloro.colorado_crimes
 8 |     WHERE offense_category_name = 'Stolen Property Offenses'
 9 |       AND EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020
10 |     GROUP BY county_name
11 | ),
12 | crime_percentiles AS (
13 |     -- Assign percentile ranks based on crime count
14 |     SELECT
15 |         county_name,
16 |         total_crimes,
17 |         PERCENT_RANK() OVER (ORDER BY total_crimes) AS crime_percentile
18 |     FROM crime_data
19 | ),
20 | crime_tiers AS (
21 |     -- Categorize counties into Tiers
22 |     SELECT
23 |         county_name,
24 |         total_crimes,
25 |         crime_percentile,
26 |         CASE
27 |             WHEN crime_percentile >= 0.75 THEN 4  -- Very High Crime
28 |             WHEN crime_percentile >= 0.50 THEN 3  -- High Crime
29 |             WHEN crime_percentile >= 0.25 THEN 2  -- Moderate Crime
30 |             ELSE 1  -- Low Crime
31 |         END AS crime_tier
32 |     FROM crime_percentiles
33 | )
34 | SELECT * FROM crime_tiers;


--------------------------------------------------------------------------------
/scripts/getS3bucketfilelist.py:
--------------------------------------------------------------------------------
 1 | import boto3
 2 | import os
 3 | import sys
 4 | from dotenv import load_dotenv
 5 | 
 6 | # Add the parent directory of "upload_to_s3.py" to the module search path
 7 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
 8 | 
 9 | from aws_secret_manager import get_secret
10 | 
11 | 
12 | # Load environment variables from .env file
13 | load_dotenv()
14 | 
15 | # Fetch AWS keys from environment variables
16 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
17 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
18 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
19 | catalog_name = get_secret("CATALOG_NAME")
20 | 
21 | # S3 bucket name
22 | s3_bucket = "< zach bucket name >"
23 | 
24 | # Initialize the S3 client
25 | s3 = boto3.client(
26 |     "s3",
27 |     aws_access_key_id=aws_access_key_id,
28 |     aws_secret_access_key=aws_secret_access_key
29 | )
30 | 
31 | # Bucket and prefix
32 | bucket_name = "< zach bucket name >"
33 | prefix = "tayloro-capstone-data/"
34 | 
35 | # List objects in the S3 bucket directory
36 | response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
37 | 
38 | # Extract and print file names
39 | if "Contents" in response:
40 |     file_names = [obj["Key"] for obj in response["Contents"]]
41 |     print("\n".join(file_names))
42 | else:
43 |     print("No files found.")


--------------------------------------------------------------------------------
/capstone_queries/silver/calculate_colorado_household_income_tier_rank.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_household_income_tier_rank AS
 2 | WITH income_aggregated AS (
 3 |     -- Step 1: Aggregate income per county across all years and compute average
 4 |     SELECT
 5 |         county,
 6 |         AVG(income) AS avg_median_household_income
 7 |     FROM academy.tayloro.colorado_income_1997_2020
 8 |     WHERE inctype = 3 -- 3 represents "Median Household Income"
 9 |     GROUP BY county
10 | ),
11 | income_percentiles AS (
12 |     -- Step 2: Compute percentile rank for average income per county
13 |     SELECT
14 |         county,
15 |         avg_median_household_income,
16 |         PERCENT_RANK() OVER (ORDER BY avg_median_household_income) AS income_percentile
17 |     FROM income_aggregated
18 | ),
19 | income_tiers AS (
20 |     -- Step 3: Assign tiers based on percentile rankings
21 |     SELECT
22 |         county,
23 |         avg_median_household_income,
24 |         income_percentile,
25 |         CASE
26 |             WHEN income_percentile >= 0.75 THEN 1 -- Highest income counties (Top 25%)
27 |             WHEN income_percentile >= 0.50 THEN 2 -- Middle-High (50%-75%)
28 |             WHEN income_percentile >= 0.25 THEN 3 -- Middle-Low (25%-50%)
29 |             ELSE 4 -- Lowest income counties (Bottom 25%)
30 |         END AS income_tier
31 |     FROM income_percentiles
32 | )
33 | SELECT * FROM income_tiers;
34 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/calculate_final_county_tier_rank.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS
 2 | WITH combined_ranks AS (
 3 |     SELECT
 4 |         COALESCE(c.county_name, i.county, p.county) AS county,
 5 |         COALESCE(c.overall_crime_tier, 0) AS crime_rank,
 6 |         COALESCE(i.income_tier, 0) AS income_rank,
 7 |         COALESCE(p.crime_per_capita_tier, 0) AS population_rank
 8 |     FROM academy.tayloro.colorado_crime_tier_county_rank c
 9 |     FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i
10 |         ON LOWER(c.county_name) = LOWER(i.county)
11 |     FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p
12 |         ON LOWER(c.county_name) = LOWER(p.county)
13 | ),
14 | average_rank AS (
15 |     SELECT
16 |         county,
17 |         crime_rank,
18 |         income_rank,
19 |         population_rank,
20 |         ROUND(
21 |             (crime_rank + income_rank + population_rank) /
22 |             NULLIF((CASE WHEN crime_rank > 0 THEN 1 ELSE 0 END
23 |                   + CASE WHEN income_rank > 0 THEN 1 ELSE 0 END
24 |                   + CASE WHEN population_rank > 0 THEN 1 ELSE 0 END), 0)
25 |         ) AS final_rank -- Calculate the average rank and round to nearest whole number
26 |     FROM combined_ranks
27 |     WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0 -- Ensure all ranks are not null
28 | )
29 | SELECT * FROM average_rank
30 | ORDER BY avg_rank ASC; -- Order by the final average rank
31 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/create_yearly_county_crime_per_capita_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | INSERT INTO tayloro.coloraodo_county_crime_rate_per_capita
 2 | WITH crimes_per_year AS (
 3 |     SELECT
 4 |         county_name,
 5 |         YEAR(incident_date) AS periodyear,  -- Extract year from incident_date
 6 |         COUNT(*) AS total_crimes
 7 |     FROM tayloro.colorado_crimes
 8 |     WHERE
 9 |         incident_date IS NOT NULL
10 |         AND YEAR(incident_date) BETWEEN 1997 AND 2020  -- Filter within date range
11 |     GROUP BY county_name, YEAR(incident_date)  -- Group by county and year
12 | ),
13 | population_per_year AS (
14 |     SELECT
15 |         county AS county_name,
16 |         YEAR AS periodyear,
17 |         SUM(TOTALPOPULATION) AS total_population  -- Aggregate population by year and county
18 |     FROM tayloro.colorado_population_raw
19 |     WHERE DATATYPE = 'Estimate'  -- Use only estimated population data
20 |     GROUP BY county, YEAR  -- Group by county and year
21 | ),
22 | crime_rate_per_capita AS (
23 |     SELECT
24 |         c.county_name,
25 |         c.periodyear,
26 |         c.total_crimes,
27 |         p.total_population,
28 |         (c.total_crimes * 100000.0 / p.total_population) AS crime_per_100k  -- Crime per 100,000 people
29 |     FROM crimes_per_year c
30 |     JOIN population_per_year p
31 |         ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county_name))  -- Ensure proper join
32 |         AND c.periodyear = p.periodyear  -- Match years
33 | )
34 | SELECT * FROM crime_rate_per_capita;


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_city_crime_time_likelihood.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_city_crime_time_likelihood AS
 2 | WITH crime_counts AS (
 3 |     SELECT
 4 |         city,
 5 |         offense_category_name,
 6 |         CASE
 7 |             WHEN incident_hour BETWEEN 6 AND 18 THEN 'Day'
 8 |             ELSE 'Night'
 9 |         END AS time_of_day,
10 |         COUNT(*) AS total_crimes
11 |     FROM tayloro.colorado_crimes_with_cities
12 |     WHERE incident_hour IS NOT NULL
13 |         AND offense_category_name IS NOT NULL
14 |         AND city IS NOT NULL
15 |     GROUP BY
16 |         city,
17 |         offense_category_name,
18 |         CASE
19 |             WHEN incident_hour BETWEEN 6 AND 18 THEN 'Day'
20 |             ELSE 'Night'
21 |         END
22 | ),
23 | crime_pivot AS (
24 |     SELECT
25 |         city,
26 |         offense_category_name,
27 |         AVG(CASE WHEN time_of_day = 'Day' THEN total_crimes ELSE NULL END) AS avg_day_crimes,
28 |         AVG(CASE WHEN time_of_day = 'Night' THEN total_crimes ELSE NULL END) AS avg_night_crimes
29 |     FROM crime_counts
30 |     GROUP BY city, offense_category_name
31 | )
32 | SELECT
33 |     city,
34 |     offense_category_name,
35 |     avg_day_crimes,
36 |     avg_night_crimes,
37 |     CASE
38 |         WHEN avg_night_crimes > avg_day_crimes THEN TRUE
39 |         ELSE FALSE
40 |     END AS is_night_likely,
41 |     CASE
42 |         WHEN avg_day_crimes > avg_night_crimes THEN TRUE
43 |         ELSE FALSE
44 |     END AS is_day_likely
45 | FROM crime_pivot
46 | ORDER BY city, offense_category_name;
47 | 


--------------------------------------------------------------------------------
/scripts/upload_capstone_csv_data_to_s3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | 
 5 | # Load environment variables from the .env file
 6 | load_dotenv()
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from upload_to_s3 import upload_to_s3
12 | 
13 | # Correct the relative path to skip the "include" directory
14 | LOCAL_FILE = os.path.abspath(
15 |     os.path.join(
16 |         os.path.dirname(__file__),  # Current script's directory
17 |         "../../../capstone_data/Colorado_County_Boundaries.csv"  # Adjusted to point to capstone_data
18 |     )
19 | )
20 | 
21 | print(f"Resolved path: {LOCAL_FILE}")
22 | 
23 | # Define the S3 bucket name and file path
24 | S3_BUCKET = "<zach bucket path >"  # Replace with your actual S3 bucket name
25 | S3_FILE_PATH = "tayloro-capstone-data/Colorado_County_Boundaries.csv"
26 | 
27 | def main():
28 |     # Ensure the file exists locally before uploading
29 |     if not os.path.exists(LOCAL_FILE):
30 |         print(f"Error: Local file not found at {LOCAL_FILE}")
31 |         return
32 | 
33 |     file_size = os.path.getsize(LOCAL_FILE)
34 |     print(f"File size: {file_size / (1024 * 1024):.2f} MB")
35 | 
36 |     #Upload the file to S3
37 |     s3_path = upload_to_s3(LOCAL_FILE, S3_BUCKET, S3_FILE_PATH)
38 | 
39 |     # Verify and print the result
40 |     if s3_path:
41 |         print(f"File successfully uploaded to {s3_path}")
42 |     else:
43 |         print("Failed to upload the file to S3.")
44 | 
45 | if __name__ == "__main__":
46 |     main()


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_city_seasonal_crime_rates_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_city_seasonal_crime_rates_1997_2020 AS
 2 | WITH monthly_crimes AS (
 3 |     SELECT
 4 |         city,
 5 |         MONTH(incident_date) AS crime_month,  -- Numeric month (1-12)
 6 |         CASE
 7 |             WHEN MONTH(incident_date) = 1 THEN 'January'
 8 |             WHEN MONTH(incident_date) = 2 THEN 'February'
 9 |             WHEN MONTH(incident_date) = 3 THEN 'March'
10 |             WHEN MONTH(incident_date) = 4 THEN 'April'
11 |             WHEN MONTH(incident_date) = 5 THEN 'May'
12 |             WHEN MONTH(incident_date) = 6 THEN 'June'
13 |             WHEN MONTH(incident_date) = 7 THEN 'July'
14 |             WHEN MONTH(incident_date) = 8 THEN 'August'
15 |             WHEN MONTH(incident_date) = 9 THEN 'September'
16 |             WHEN MONTH(incident_date) = 10 THEN 'October'
17 |             WHEN MONTH(incident_date) = 11 THEN 'November'
18 |             WHEN MONTH(incident_date) = 12 THEN 'December'
19 |         END AS month_name,
20 |         COUNT(*) AS total_crimes
21 |     FROM tayloro.colorado_crimes_with_cities
22 |     WHERE incident_date IS NOT NULL
23 |     GROUP BY city, MONTH(incident_date)  -- Ensure grouping includes month
24 | ),
25 | normalized_crime AS (
26 |     SELECT
27 |         mc.city,
28 |         mc.crime_month,
29 |         mc.month_name,
30 |         AVG(mc.total_crimes) AS avg_monthly_crimes
31 |     FROM monthly_crimes mc
32 |     GROUP BY mc.city, mc.crime_month, mc.month_name  -- Ensure grouping includes crime_month
33 | )
34 | SELECT
35 |     city,
36 |     crime_month,
37 |     month_name,
38 |     avg_monthly_crimes
39 | FROM normalized_crime
40 | ORDER BY city, crime_month;


--------------------------------------------------------------------------------
/capstone_queries/gold/create_colorado_county_seasonal_crime_rates_1997_2020.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_county_seasonal_crime_rates_1997_2020 AS
 2 | WITH monthly_crimes AS (
 3 |     SELECT
 4 |         county_name,
 5 |         MONTH(incident_date) AS crime_month,  -- Numeric month (1-12)
 6 |         CASE
 7 |             WHEN MONTH(incident_date) = 1 THEN 'January'
 8 |             WHEN MONTH(incident_date) = 2 THEN 'February'
 9 |             WHEN MONTH(incident_date) = 3 THEN 'March'
10 |             WHEN MONTH(incident_date) = 4 THEN 'April'
11 |             WHEN MONTH(incident_date) = 5 THEN 'May'
12 |             WHEN MONTH(incident_date) = 6 THEN 'June'
13 |             WHEN MONTH(incident_date) = 7 THEN 'July'
14 |             WHEN MONTH(incident_date) = 8 THEN 'August'
15 |             WHEN MONTH(incident_date) = 9 THEN 'September'
16 |             WHEN MONTH(incident_date) = 10 THEN 'October'
17 |             WHEN MONTH(incident_date) = 11 THEN 'November'
18 |             WHEN MONTH(incident_date) = 12 THEN 'December'
19 |         END AS month_name,
20 |         COUNT(*) AS total_crimes
21 |     FROM tayloro.colorado_crimes
22 |     WHERE incident_date IS NOT NULL
23 |     GROUP BY county_name, MONTH(incident_date)  -- Ensure grouping includes month
24 | ),
25 | normalized_crime AS (
26 |     SELECT
27 |         mc.county_name,
28 |         mc.crime_month,
29 |         mc.month_name,
30 |         AVG(mc.total_crimes) AS avg_monthly_crimes
31 |     FROM monthly_crimes mc
32 |     GROUP BY mc.county_name, mc.crime_month, mc.month_name  -- Ensure grouping includes crime_month
33 | )
34 | SELECT
35 |     county_name,
36 |     crime_month,
37 |     month_name,
38 |     avg_monthly_crimes
39 | FROM normalized_crime
40 | ORDER BY county_name, crime_month;
41 | 


--------------------------------------------------------------------------------
/scripts/modify_colorado_city_county_zip_csv.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import csv
 4 | 
 5 | # Correct the relative path to skip the "include" directory
 6 | LOCAL_FILE = os.path.abspath(
 7 |     os.path.join(
 8 |         os.path.dirname(__file__),  # Current script's directory
 9 |         "../../../capstone_data/colorado_city_county_zip.csv"  # Adjusted to point to capstone_data
10 |     )
11 | )
12 | 
13 | print(f"Resolved path: {LOCAL_FILE}")
14 | 
15 | # Determine the output file path (we'll write the updated CSV in the same folder)
16 | output_file = os.path.join(os.path.dirname(LOCAL_FILE), "colorado_city_county_zip_updated.csv")
17 | print(f"Output file path: {output_file}")
18 | 
19 | # Use utf-8-sig to handle potential BOM in the CSV file.
20 | with open(LOCAL_FILE, 'r', newline='', encoding='utf-8-sig') as infile:
21 |     reader = csv.DictReader(infile)
22 |     fieldnames = reader.fieldnames
23 |     if not fieldnames:
24 |         print("Error: No header row found in the CSV.")
25 |         sys.exit(1)
26 | 
27 |     print("Found headers:", fieldnames)
28 | 
29 |     # Locate the ZIP Code header in a case-insensitive manner.
30 |     zip_header = None
31 |     for header in fieldnames:
32 |         if header.strip().lower() == "zip code":
33 |             zip_header = header
34 |             break
35 | 
36 |     if not zip_header:
37 |         print("Error: 'ZIP Code' column not found in the CSV.")
38 |         sys.exit(1)
39 | 
40 |     updated_rows = []
41 |     for row in reader:
42 |         zip_value = row[zip_header]
43 |         # Remove the prefix "ZIP Code " if it appears
44 |         if zip_value.startswith("ZIP Code "):
45 |             row[zip_header] = zip_value.replace("ZIP Code ", "", 1)
46 |         updated_rows.append(row)
47 | 
48 | # Write the updated rows into a new CSV file.
49 | with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
50 |     writer = csv.DictWriter(outfile, fieldnames=fieldnames)
51 |     writer.writeheader()
52 |     writer.writerows(updated_rows)
53 | 
54 | print(f"CSV updated successfully: {output_file}")
55 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/calculate_population_crime_per_capita_tier.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE academy.tayloro.colorado_crime_per_capita_rank AS
 2 | WITH crime_per_capita_per_year AS (
 3 |     -- Step 1: Compute Crime Per Capita for Each Year (1997-2000)
 4 |     SELECT
 5 |         LOWER(TRIM(c.county_name)) AS county, -- Standardize county name
 6 |         EXTRACT(YEAR FROM c.incident_date) AS year,
 7 |         COUNT(*) * 1000.0 / p.total_population AS crime_per_1000
 8 |     FROM academy.tayloro.colorado_crimes c
 9 |     JOIN academy.tayloro.colorado_total_population_per_county_1997_2000 p
10 |         ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county)) -- Match counties with case-insensitivity and trimming
11 |         AND EXTRACT(YEAR FROM c.incident_date) = p.year
12 |     WHERE p.total_population > 0 -- Avoid division by zero
13 |     GROUP BY LOWER(TRIM(c.county_name)), EXTRACT(YEAR FROM c.incident_date), p.total_population
14 | ),
15 | average_crime_per_capita AS (
16 |     -- Step 2: Compute Average Crime Per Capita for 1997-2000
17 |     SELECT
18 |         county,
19 |         AVG(crime_per_1000) AS avg_crime_per_1000
20 |     FROM crime_per_capita_per_year
21 |     GROUP BY county
22 | ),
23 | crime_distribution AS (
24 |     -- Step 3: Compute Percentile Rank Based on Average Crime Per Capita
25 |     SELECT
26 |         county,
27 |         avg_crime_per_1000,
28 |         PERCENT_RANK() OVER (ORDER BY avg_crime_per_1000) AS crime_percentile
29 |     FROM average_crime_per_capita
30 | ),
31 | crime_tiers AS (
32 |     -- Step 4: Assign Crime Per Capita Tiers
33 |     SELECT
34 |         county,
35 |         avg_crime_per_1000,
36 |         crime_percentile,
37 |         CASE
38 |             WHEN crime_percentile >= 0.75 THEN 4 -- Highest crime per capita (Top 25%)
39 |             WHEN crime_percentile >= 0.50 THEN 3 -- Medium-high (50%-75%)
40 |             WHEN crime_percentile >= 0.25 THEN 2 -- Medium-low (25%-50%)
41 |             ELSE 1 -- Lowest crime per capita (Bottom 25%)
42 |         END AS crime_per_capita_tier
43 |     FROM crime_distribution
44 | )
45 | SELECT * FROM crime_tiers;
46 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/calculate_population_crime_per_capita_rank.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_population_crime_per_capita_rank AS
 2 | WITH crime_per_capita_per_year AS (
 3 |     -- Step 1: Compute Crime Per Capita for Each Year (1997-2000)
 4 |     SELECT
 5 |         LOWER(c.county_name) AS county, -- Standardizing county name for consistent join
 6 |         EXTRACT(YEAR FROM c.incident_date) AS year,
 7 |         COUNT(*) * 1000.0 / p.total_population AS crime_per_1000
 8 |     FROM academy.tayloro.colorado_crimes c
 9 |     JOIN academy.tayloro.colorado_total_population_per_county_1997_2020 p
10 |         ON LOWER(c.county_name) = LOWER(p.county) -- Ensure lowercase match
11 |         AND EXTRACT(YEAR FROM c.incident_date) = p.year
12 |     WHERE p.total_population > 0 -- Avoid division by zero
13 |     GROUP BY LOWER(c.county_name), EXTRACT(YEAR FROM c.incident_date), p.total_population
14 | ),
15 | average_crime_per_capita AS (
16 |     -- Step 2: Compute Average Crime Per Capita for 1997-2000
17 |     SELECT
18 |         county,
19 |         AVG(crime_per_1000) AS avg_crime_per_1000
20 |     FROM crime_per_capita_per_year
21 |     GROUP BY county
22 | ),
23 | crime_distribution AS (
24 |     -- Step 3: Compute Percentile Rank Based on Average Crime Per Capita
25 |     SELECT
26 |         county,
27 |         avg_crime_per_1000,
28 |         PERCENT_RANK() OVER (ORDER BY avg_crime_per_1000) AS crime_percentile
29 |     FROM average_crime_per_capita
30 | ),
31 | crime_tiers AS (
32 |     -- Step 4: Assign Crime Per Capita Tiers and Capitalize First Letter
33 |     SELECT
34 |         CONCAT(UPPER(SUBSTRING(county, 1, 1)), LOWER(SUBSTRING(county, 2))) AS county, -- Capitalize first letter
35 |         avg_crime_per_1000,
36 |         crime_percentile,
37 |         CASE
38 |             WHEN crime_percentile >= 0.75 THEN 4 -- Highest crime per capita (Top 25%)
39 |             WHEN crime_percentile >= 0.50 THEN 3 -- Medium-high (50%-75%)
40 |             WHEN crime_percentile >= 0.25 THEN 2 -- Medium-low (25%-50%)
41 |             ELSE 1 -- Lowest crime per capita (Bottom 25%)
42 |         END AS crime_per_capita_tier
43 |     FROM crime_distribution
44 | )
45 | SELECT * FROM crime_tiers;
46 | 


--------------------------------------------------------------------------------
/scripts/create_crime_rate_per_capita.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "< zach bucket name >"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | spark.sql("""
64 | CREATE TABLE IF NOT EXISTS tayloro.coloraodo_county_crime_rate_per_capita (
65 |     county_name STRING,
66 |     periodyear INTEGER,
67 |     total_crimes INTEGER,
68 |     total_population INTEGER,
69 |     crime_per_100k DOUBLE  -- Crime per 100,000 people
70 | )
71 | USING iceberg
72 | PARTITIONED BY (county_name);
73 | """)
74 | 
75 | print("Iceberg table created successfully (if it didn't already exist).")


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_crime_tier_county_rank.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crime_tier_county_rank AS
 2 | WITH county_scores AS (
 3 |     SELECT
 4 |         county_name,
 5 |         -- Select the crime tiers for each category
 6 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_property_destruction' THEN crime_tier ELSE NULL END) AS property_destruction_tier,
 7 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_burglary' THEN crime_tier ELSE NULL END) AS burglary_tier,
 8 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_larceny_theft' THEN crime_tier ELSE NULL END) AS larceny_theft_tier,
 9 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_vehicle_theft' THEN crime_tier ELSE NULL END) AS vehicle_theft_tier,
10 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_robbery' THEN crime_tier ELSE NULL END) AS robbery_tier,
11 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_arson' THEN crime_tier ELSE NULL END) AS arson_tier,
12 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_stolen_property' THEN crime_tier ELSE NULL END) AS stolen_property_tier,
13 |         -- Calculate the average of all crime tiers
14 |         ROUND(
15 |             AVG(
16 |                 CASE
17 |                     WHEN table_name IN (
18 |                         'colorado_crime_tier_property_destruction',
19 |                         'colorado_crime_tier_burglary',
20 |                         'colorado_crime_tier_larceny_theft',
21 |                         'colorado_crime_tier_vehicle_theft',
22 |                         'colorado_crime_tier_robbery',
23 |                         'colorado_crime_tier_arson',
24 |                         'colorado_crime_tier_stolen_property'
25 |                     ) THEN crime_tier
26 |                     ELSE NULL
27 |                 END
28 |             )
29 |         ) AS overall_crime_tier -- Round to the nearest whole number
30 |     FROM (
31 |         -- Union all 7 crime tier tables
32 |         SELECT county_name, 'colorado_crime_tier_property_destruction' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_property_destruction
33 |         UNION ALL
34 |         SELECT county_name, 'colorado_crime_tier_burglary' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_burglary
35 |         UNION ALL
36 |         SELECT county_name, 'colorado_crime_tier_larceny_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_larceny_theft
37 |         UNION ALL
38 |         SELECT county_name, 'colorado_crime_tier_vehicle_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_vehicle_theft
39 |         UNION ALL
40 |         SELECT county_name, 'colorado_crime_tier_robbery' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_robbery
41 |         UNION ALL
42 |         SELECT county_name, 'colorado_crime_tier_arson' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_arson
43 |         UNION ALL
44 |         SELECT county_name, 'colorado_crime_tier_stolen_property' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_stolen_property
45 |     ) combined
46 |     GROUP BY county_name
47 | )
48 | SELECT * FROM county_scores
49 | ORDER BY overall_crime_tier DESC;
50 | 


--------------------------------------------------------------------------------
/scripts/create_colorado_crimes_table.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "< zach bucket name >"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | print(spark.sql)
64 | #Create the Iceberg table if it doesn't already exist
65 | spark.sql("""
66 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_crimes (
67 |     agency_name STRING,
68 |     county_name STRING,
69 |     incident_date DATE,
70 |     incident_hour INTEGER,
71 |     offense_name STRING,
72 |     crime_against STRING,
73 |     offense_category_name STRING,
74 |     age_num INTEGER
75 | )
76 | USING iceberg
77 | PARTITIONED BY (county_name)
78 | """)
79 | print("Iceberg table created successfully (if it didn't already exist).")
80 | 


--------------------------------------------------------------------------------
/scripts/create_colorado_business_entities.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "< zach bucket name >"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | spark.sql("""
64 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_business_entities (
65 |     entityid LONG,
66 |     entityname STRING,
67 |     principaladdress1 STRING,
68 |     principaladdress2 STRING,
69 |     principalcity STRING,
70 |     principalcounty STRING,  -- Mapped county from cities dataset
71 |     principalstate STRING,
72 |     principalzipcode STRING,
73 |     principalcountry STRING,
74 |     entitystatus STRING,
75 |     jurisdictonofformation STRING,
76 |     entitytype STRING,
77 |     entityformdate DATE  -- Ensure stored as DATE type
78 | )
79 | USING iceberg
80 | PARTITIONED BY (principalcounty)
81 | """)
82 | 
83 | print("Iceberg table created successfully (if it didn't already exist).")
84 | 
85 | 


--------------------------------------------------------------------------------
/scripts/write_to_iceberg_from_S3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "zachwilsonsorganization-522"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | # Verify SparkSession
64 | print("SparkSession initialized successfully:", spark)
65 | 
66 | # S3 path to your input file
67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_County_Boundaries.csv"
68 | 
69 | # Read the CSV file from the S3 bucket
70 | df = spark.read.format("csv") \
71 |     .option("header", "true") \
72 |     .option("inferSchema", "true") \
73 |     .load(file_path)
74 | 
75 | 
76 | df.printSchema()
77 | 
78 | 
79 | spark.sql("""
80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_county_coordinates_raw (
81 |     county STRING,
82 |     label STRING,
83 |     cent_lat DOUBLE,
84 |     cent_long DOUBLE
85 | )
86 | USING iceberg
87 | PARTITIONED BY (county)
88 | """)
89 | 
90 | # Write to the Iceberg table with Parquet format and overwrite partitions
91 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_county_coordinates_raw") \
92 |     .tableProperty("write.format.default", "parquet") \
93 |     .overwritePartitions()
94 | 


--------------------------------------------------------------------------------
/capstone_queries/bronze/write_colorado_county_coordinates_iceberg_from_S3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "zachwilsonsorganization-522"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | # Verify SparkSession
64 | print("SparkSession initialized successfully:", spark)
65 | 
66 | # S3 path to your input file
67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_County_Boundaries.csv"
68 | 
69 | # Read the CSV file from the S3 bucket
70 | df = spark.read.format("csv") \
71 |     .option("header", "true") \
72 |     .option("inferSchema", "true") \
73 |     .load(file_path)
74 | 
75 | 
76 | df.printSchema()
77 | 
78 | 
79 | spark.sql("""
80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_counties_coordinates_raw (
81 |     county STRING,
82 |     label STRING,
83 |     cent_lat DOUBLE,
84 |     cent_long DOUBLE
85 | )
86 | USING iceberg
87 | PARTITIONED BY (county)
88 | """)
89 | 
90 | # Write to the Iceberg table with Parquet format and overwrite partitions
91 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_counties_coordinates_raw") \
92 |     .tableProperty("write.format.default", "parquet") \
93 |     .overwritePartitions()
94 | 


--------------------------------------------------------------------------------
/capstone_queries/bronze/write_colorado_crimes_1997_2020_iceberg_from_S3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "zachwilsonsorganization-522"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | # Verify SparkSession
64 | print("SparkSession initialized successfully:", spark)
65 | 
66 | # S3 path to your input file
67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_1995_2015.csv"
68 | 
69 | # Read the CSV file from the S3 bucket
70 | df = spark.read.format("csv") \
71 |     .option("header", "true") \
72 |     .option("inferSchema", "true") \
73 |     .load(file_path)
74 | 
75 | 
76 | df.printSchema()
77 | 
78 | 
79 | spark.sql("""
80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_crimes_1995_2015_raw (
81 |     pub_agency_name STRING,
82 |     county_name STRING,
83 |     incident_date STRING,
84 |     incident_hour INTEGER,
85 |     offense_name STRING,
86 |     crime_against STRING,
87 |     offense_category_name STRING,
88 |     offense_group STRING,
89 |     age_num INTEGER
90 | )
91 | USING iceberg
92 | PARTITIONED BY (county_name)
93 | """)
94 | 
95 | # Write to the Iceberg table with Parquet format and overwrite partitions
96 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_crimes_1995_2015_raw") \
97 |     .tableProperty("write.format.default", "parquet") \
98 |     .overwritePartitions()
99 | 


--------------------------------------------------------------------------------
/capstone_queries/bronze/write_colorado_income_iceberg_from_S3.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from dotenv import load_dotenv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.conf import SparkConf
 6 | from pyspark.sql.functions import regexp_replace
 7 | 
 8 | # Add the parent directory of "upload_to_s3.py" to the module search path
 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
10 | 
11 | from aws_secret_manager import get_secret
12 | 
13 | 
14 | # Load environment variables from .env file
15 | load_dotenv()
16 | 
17 | # Fetch AWS keys from environment variables
18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
21 | catalog_name = get_secret("CATALOG_NAME")
22 | 
23 | # S3 bucket name
24 | s3_bucket = "zachwilsonsorganization-522"
25 | 
26 | # Define required packages
27 | extra_packages = [
28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
32 | ]
33 | 
34 | # Initialize SparkConf
35 | conf = SparkConf()
36 | conf.set('spark.jars.packages', ','.join(extra_packages))
37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
38 | conf.set('spark.sql.defaultCatalog', catalog_name)
39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
47 | conf.set('spark.sql.shuffle.partitions', '200')
48 | conf.set('spark.driver.memory', '16g')
49 | 
50 | 
51 | 
52 | # Initialize SparkSession
53 | spark = SparkSession.builder \
54 |     .appName("Load S3 data into Iceberg") \
55 |     .config(conf=conf) \
56 |     .getOrCreate()
57 | 
58 | # Set AWS credentials for S3 access
59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
62 | 
63 | # Verify SparkSession
64 | print("SparkSession initialized successfully:", spark)
65 | 
66 | # S3 path to your input file
67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_2016_2020.csv"
68 | 
69 | # Read the CSV file from the S3 bucket
70 | df = spark.read.format("csv") \
71 |     .option("header", "true") \
72 |     .option("inferSchema", "true") \
73 |     .load(file_path)
74 | 
75 | 
76 | df.printSchema()
77 | 
78 | 
79 | # spark.sql("""
80 | # CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_income_raw (
81 | #     pub_agency_name STRING,
82 | #     county_name STRING,
83 | #     incident_date STRING,
84 | #     incident_hour INTEGER,
85 | #     offense_name STRING,
86 | #     crime_against STRING,
87 | #     offense_category_name STRING,
88 | #     offense_group STRING,
89 | #     age_num INTEGER
90 | # )
91 | # USING iceberg
92 | # PARTITIONED BY (county_name)
93 | # """)
94 | #
95 | # # Write to the Iceberg table with Parquet format and overwrite partitions
96 | # df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_income_raw") \
97 | #     .tableProperty("write.format.default", "parquet") \
98 | #     .overwritePartitions()
99 | 


--------------------------------------------------------------------------------
/capstone_queries/bronze/write_colorado_crimes_2016_2020_iceberg_from_S3.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | from dotenv import load_dotenv
  4 | from pyspark.sql import SparkSession
  5 | from pyspark.conf import SparkConf
  6 | from pyspark.sql.functions import regexp_replace
  7 | 
  8 | # Add the parent directory of "upload_to_s3.py" to the module search path
  9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
 10 | 
 11 | from aws_secret_manager import get_secret
 12 | 
 13 | 
 14 | # Load environment variables from .env file
 15 | load_dotenv()
 16 | 
 17 | # Fetch AWS keys from environment variables
 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
 21 | catalog_name = get_secret("CATALOG_NAME")
 22 | 
 23 | # S3 bucket name
 24 | s3_bucket = "zachwilsonsorganization-522"
 25 | 
 26 | # Define required packages
 27 | extra_packages = [
 28 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",  # Iceberg runtime
 29 |     "software.amazon.awssdk:bundle:2.17.178",  # AWS SDK bundle
 30 |     "software.amazon.awssdk:url-connection-client:2.17.178",  # URL connection client
 31 |     "org.apache.hadoop:hadoop-aws:3.3.4"  # Hadoop AWS connector
 32 | ]
 33 | 
 34 | # Initialize SparkConf
 35 | conf = SparkConf()
 36 | conf.set('spark.jars.packages', ','.join(extra_packages))
 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
 38 | conf.set('spark.sql.defaultCatalog', catalog_name)
 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
 46 | conf.set('spark.executor.memory', '16g')  # Adjust the value based on your system's capacity
 47 | conf.set('spark.sql.shuffle.partitions', '200')
 48 | conf.set('spark.driver.memory', '16g')
 49 | 
 50 | 
 51 | 
 52 | # Initialize SparkSession
 53 | spark = SparkSession.builder \
 54 |     .appName("Load S3 data into Iceberg") \
 55 |     .config(conf=conf) \
 56 |     .getOrCreate()
 57 | 
 58 | # Set AWS credentials for S3 access
 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
 62 | 
 63 | # Verify SparkSession
 64 | print("SparkSession initialized successfully:", spark)
 65 | 
 66 | # S3 path to your input file
 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_2016_2020.csv"
 68 | 
 69 | # Read the CSV file from the S3 bucket
 70 | df = spark.read.format("csv") \
 71 |     .option("header", "true") \
 72 |     .option("inferSchema", "true") \
 73 |     .load(file_path)
 74 | 
 75 | 
 76 | df.printSchema()
 77 | 
 78 | 
 79 | print(f"Catalog Name: {catalog_name}")
 80 | 
 81 | 
 82 | spark.sql("""
 83 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_crimes_2016_2020_raw_new (
 84 |     pub_agency_name STRING,
 85 |     county_name STRING,
 86 |     incident_date STRING,
 87 |     incident_hour INTEGER,
 88 |     offense_name STRING,
 89 |     crime_against STRING,
 90 |     offense_category_name STRING,
 91 |     offense_group STRING,
 92 |     age_num INTEGER
 93 | )
 94 | USING iceberg
 95 | PARTITIONED BY (county_name)
 96 | """)
 97 | 
 98 | # Write to the Iceberg table with Parquet format and overwrite partitions
 99 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_crimes_2016_2020_raw_new") \
100 |     .tableProperty("write.format.default", "parquet") \
101 |     .overwritePartitions()
102 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_crimes_with_cities.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crimes_with_cities AS
 2 | SELECT
 3 |     -- Clean and coalesce agency names from both datasets
 4 |     COALESCE(
 5 |       trim(
 6 |         regexp_replace(
 7 |           regexp_replace(
 8 |             regexp_replace(
 9 |               regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
10 |               '(?i)\\s+Police Department\\b', ''
11 |             ),
12 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
13 |           ),
14 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
15 |         )
16 |       ),
17 |       trim(
18 |         regexp_replace(
19 |           regexp_replace(
20 |             regexp_replace(
21 |               regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
22 |               '(?i)\\s+Police Department\\b', ''
23 |             ),
24 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
25 |           ),
26 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
27 |         )
28 |       )
29 |     ) AS agency_name,
30 | 
31 |     -- Standardize and clean county names with first letter capitalized
32 |     COALESCE(
33 |         UPPER(SUBSTRING(LOWER(TRIM(a.primary_county)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(a.primary_county)), 2)),
34 |         UPPER(SUBSTRING(LOWER(TRIM(b.county_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(b.county_name)), 2))
35 |     ) AS county_name,
36 | 
37 |     -- Combine and clean incident dates, converting them to DATE
38 |     COALESCE(
39 |         CAST(date_parse(TRIM(a.incident_date), '%m/%d/%Y') AS DATE),
40 |         CAST(date_parse(TRIM(b.incident_date), '%m/%d/%Y') AS DATE)
41 |     ) AS incident_date,
42 | 
43 |     -- Standardize incident hour
44 |     COALESCE(a.incident_hour, b.incident_hour) AS incident_hour,
45 | 
46 |     -- Capitalize offense names (first letter capitalized, rest lowercase)
47 |     COALESCE(
48 |         UPPER(SUBSTRING(LOWER(TRIM(a.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(a.offense_name)), 2)),
49 |         UPPER(SUBSTRING(LOWER(TRIM(b.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(b.offense_name)), 2))
50 |     ) AS offense_name,
51 | 
52 |     -- Clean crime_against and offense_category_name
53 |     COALESCE(a.crime_against, b.crime_against) AS crime_against,
54 |     COALESCE(a.offense_category_name, b.offense_category_name) AS offense_category_name,
55 | 
56 |     -- Clean age_num
57 |     COALESCE(a.age_num, b.age_num) AS age_num,
58 | 
59 |     -- Coalesce the city values: use city_name from the 1997–2015 data when available; otherwise, use city from the 2016–2020 data
60 |     COALESCE(a.city_name, b.city) AS city
61 | FROM
62 |     tayloro.colorado_crimes_1997_2015_raw a
63 | FULL OUTER JOIN
64 |     tayloro.colorado_crimes_2016_2020_with_cities b
65 | ON
66 |     -- Clean agency names in the join condition so they match properly
67 |     LOWER(
68 |       trim(
69 |         regexp_replace(
70 |           regexp_replace(
71 |             regexp_replace(
72 |               regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
73 |               '(?i)\\s+Police Department\\b', ''
74 |             ),
75 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
76 |           ),
77 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
78 |         )
79 |       )
80 |     ) =
81 |     LOWER(
82 |       trim(
83 |         regexp_replace(
84 |           regexp_replace(
85 |             regexp_replace(
86 |               regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
87 |               '(?i)\\s+Police Department\\b', ''
88 |             ),
89 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
90 |           ),
91 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
92 |         )
93 |       )
94 |     )
95 |     AND LOWER(TRIM(a.primary_county)) = LOWER(TRIM(b.county_name))
96 |     AND CAST(date_parse(TRIM(a.incident_date), '%m/%d/%Y') AS DATE) = CAST(date_parse(TRIM(b.incident_date), '%m/%d/%Y') AS DATE)
97 |     AND a.incident_hour = b.incident_hour
98 |     AND a.offense_name = b.offense_name;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_crime_weighted_scores.sql:
--------------------------------------------------------------------------------
 1 | CREATE TABLE tayloro.colorado_crime_weighted_scores AS
 2 | WITH crime_weighted AS (
 3 |     SELECT
 4 |         TRIM(SPLIT_PART(county_name, ';', 1)) AS county_name, -- Extract first county if multiple are listed
 5 |         EXTRACT(YEAR FROM incident_date) AS year,  -- Correctly extract year for grouping
 6 |         SUM(CASE WHEN offense_category_name = 'Destruction/Damage/Vandalism of Property' THEN 10 ELSE 0 END) AS destruction_vandalism,
 7 |         SUM(CASE WHEN offense_category_name = 'Burglary/Breaking & Entering' THEN 9 ELSE 0 END) AS burglary,
 8 |         SUM(CASE WHEN offense_category_name = 'Larceny/Theft Offenses' THEN 8 ELSE 0 END) AS larceny_theft,
 9 |         SUM(CASE WHEN offense_category_name = 'Motor Vehicle Theft' THEN 7 ELSE 0 END) AS motor_vehicle_theft,
10 |         SUM(CASE WHEN offense_category_name = 'Robbery' THEN 7 ELSE 0 END) AS robbery,
11 |         SUM(CASE WHEN offense_category_name = 'Arson' THEN 7 ELSE 0 END) AS arson,
12 |         SUM(CASE WHEN offense_category_name = 'Stolen Property Offenses' THEN 6 ELSE 0 END) AS stolen_property,
13 |         SUM(CASE WHEN offense_category_name = 'Counterfeiting/Forgery' THEN 5 ELSE 0 END) AS counterfeiting_forgery,
14 |         SUM(CASE WHEN offense_category_name = 'Fraud Offenses' THEN 5 ELSE 0 END) AS fraud,
15 |         SUM(CASE WHEN offense_category_name = 'Bribery' THEN 4 ELSE 0 END) AS bribery,
16 |         SUM(CASE WHEN offense_category_name = 'Extortion/Blackmail' THEN 4 ELSE 0 END) AS extortion_blackmail,
17 |         SUM(CASE WHEN offense_category_name = 'Embezzlement' THEN 3 ELSE 0 END) AS embezzlement,
18 |         SUM(CASE WHEN offense_category_name = 'Weapon Law Violations' THEN 3 ELSE 0 END) AS weapon_law_violations,
19 |         SUM(CASE WHEN offense_category_name = 'Drug/Narcotic Offenses' THEN 3 ELSE 0 END) AS drug_offenses,
20 |         SUM(CASE WHEN offense_category_name = 'Prostitution Offenses' THEN 2 ELSE 0 END) AS prostitution_offenses,
21 |         SUM(CASE WHEN offense_category_name = 'Kidnapping/Abduction' THEN 2 ELSE 0 END) AS kidnapping_abduction,
22 |         SUM(CASE WHEN offense_category_name = 'Sex Offenses' THEN 2 ELSE 0 END) AS sex_offenses,
23 |         SUM(CASE WHEN offense_category_name = 'Human Trafficking' THEN 1 ELSE 0 END) AS human_trafficking,
24 |         SUM(CASE WHEN offense_category_name = 'Gambling Offenses' THEN 1 ELSE 0 END) AS gambling_offenses,
25 |         SUM(CASE WHEN offense_category_name = 'Animal Cruelty' THEN 1 ELSE 0 END) AS animal_cruelty,
26 |         SUM(
27 |             CASE
28 |                 WHEN offense_category_name = 'Destruction/Damage/Vandalism of Property' THEN 10
29 |                 WHEN offense_category_name = 'Burglary/Breaking & Entering' THEN 9
30 |                 WHEN offense_category_name = 'Larceny/Theft Offenses' THEN 8
31 |                 WHEN offense_category_name = 'Motor Vehicle Theft' THEN 7
32 |                 WHEN offense_category_name = 'Robbery' THEN 7
33 |                 WHEN offense_category_name = 'Arson' THEN 7
34 |                 WHEN offense_category_name = 'Stolen Property Offenses' THEN 6
35 |                 WHEN offense_category_name = 'Counterfeiting/Forgery' THEN 5
36 |                 WHEN offense_category_name = 'Fraud Offenses' THEN 5
37 |                 WHEN offense_category_name = 'Bribery' THEN 4
38 |                 WHEN offense_category_name = 'Extortion/Blackmail' THEN 4
39 |                 WHEN offense_category_name = 'Embezzlement' THEN 3
40 |                 WHEN offense_category_name = 'Weapon Law Violations' THEN 3
41 |                 WHEN offense_category_name = 'Drug/Narcotic Offenses' THEN 3
42 |                 WHEN offense_category_name = 'Prostitution Offenses' THEN 2
43 |                 WHEN offense_category_name = 'Kidnapping/Abduction' THEN 2
44 |                 WHEN offense_category_name = 'Sex Offenses' THEN 2
45 |                 WHEN offense_category_name = 'Human Trafficking' THEN 1
46 |                 WHEN offense_category_name = 'Gambling Offenses' THEN 1
47 |                 WHEN offense_category_name = 'Animal Cruelty' THEN 1
48 |                 ELSE 0
49 |             END
50 |         ) AS total_crime_score
51 |     FROM tayloro.colorado_crimes
52 |     WHERE TRIM(county_name) IS NOT NULL
53 |       AND TRIM(county_name) <> ''
54 |       AND incident_date IS NOT NULL
55 |     GROUP BY county_name, EXTRACT(YEAR FROM incident_date)  -- Fix: Use correct grouping
56 | )
57 | SELECT * FROM crime_weighted;


--------------------------------------------------------------------------------
/capstone_queries/silver/create_colorado_crimes.sql:
--------------------------------------------------------------------------------
  1 | CREATE TABLE tayloro.colorado_crimes AS
  2 | SELECT
  3 |     -- Clean and coalesce agency names using regex replacements:
  4 |     COALESCE(
  5 |       trim(
  6 |         regexp_replace(
  7 |           regexp_replace(
  8 |             regexp_replace(
  9 |               regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
 10 |               '(?i)\\s+Police Department\\b', ''
 11 |             ),
 12 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
 13 |           ),
 14 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
 15 |         )
 16 |       ),
 17 |       trim(
 18 |         regexp_replace(
 19 |           regexp_replace(
 20 |             regexp_replace(
 21 |               regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
 22 |               '(?i)\\s+Police Department\\b', ''
 23 |             ),
 24 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
 25 |           ),
 26 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
 27 |         )
 28 |       )
 29 |     ) AS agency_name,
 30 | 
 31 |     -- Normalize county names: convert to lowercase, replace semicolons with commas,
 32 |     -- remove quotes, split on commas, trim each element, sort, and rejoin.
 33 |     COALESCE(
 34 |       array_join(
 35 |         array_sort(
 36 |           transform(
 37 |             split(
 38 |               regexp_replace(regexp_replace(lower(trim(a.primary_county)), ';', ','), '"', '')
 39 |               , ','
 40 |             ),
 41 |             x -> trim(x)
 42 |           )
 43 |         ),
 44 |         ', '
 45 |       ),
 46 |       array_join(
 47 |         array_sort(
 48 |           transform(
 49 |             split(
 50 |               regexp_replace(regexp_replace(lower(trim(b.county_name)), ';', ','), '"', '')
 51 |               , ','
 52 |             ),
 53 |             x -> trim(x)
 54 |           )
 55 |         ),
 56 |         ', '
 57 |       )
 58 |     ) AS county_name,
 59 | 
 60 |     -- Parse and coalesce incident dates
 61 |     COALESCE(
 62 |         CAST(date_parse(trim(a.incident_date), '%m/%d/%Y') AS DATE),
 63 |         CAST(date_parse(trim(b.incident_date), '%m/%d/%Y') AS DATE)
 64 |     ) AS incident_date,
 65 | 
 66 |     -- Standardize incident hour
 67 |     COALESCE(a.incident_hour, b.incident_hour) AS incident_hour,
 68 | 
 69 |     -- Capitalize offense names (first letter uppercase, rest lowercase)
 70 |     COALESCE(
 71 |         UPPER(SUBSTRING(LOWER(trim(a.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(trim(a.offense_name)), 2)),
 72 |         UPPER(SUBSTRING(LOWER(trim(b.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(trim(b.offense_name)), 2))
 73 |     ) AS offense_name,
 74 | 
 75 |     -- Coalesce crime_against and offense_category_name
 76 |     COALESCE(a.crime_against, b.crime_against) AS crime_against,
 77 |     COALESCE(a.offense_category_name, b.offense_category_name) AS offense_category_name,
 78 | 
 79 |     -- Coalesce age_num
 80 |     COALESCE(a.age_num, b.age_num) AS age_num
 81 | FROM
 82 |     tayloro.colorado_crimes_1997_2015_raw a
 83 | FULL OUTER JOIN
 84 |     tayloro.colorado_crimes_2016_2020_raw b
 85 | ON
 86 |     -- Join on the cleaned agency names:
 87 |     LOWER(
 88 |       trim(
 89 |         regexp_replace(
 90 |           regexp_replace(
 91 |             regexp_replace(
 92 |               regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
 93 |               '(?i)\\s+Police Department\\b', ''
 94 |             ),
 95 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
 96 |           ),
 97 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
 98 |         )
 99 |       )
100 |     ) =
101 |     LOWER(
102 |       trim(
103 |         regexp_replace(
104 |           regexp_replace(
105 |             regexp_replace(
106 |               regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'),
107 |               '(?i)\\s+Police Department\\b', ''
108 |             ),
109 |             '(?i)\\s+County Sheriff(?:''s Office)?\\b', ''
110 |           ),
111 |           '(?i)\\s+Sheriff(?:''s Office)?\\b', ''
112 |         )
113 |       )
114 |     )
115 |     AND
116 |     -- Join on normalized county names:
117 |     LOWER(
118 |       array_join(
119 |         array_sort(
120 |           transform(
121 |             split(
122 |               regexp_replace(regexp_replace(lower(trim(a.primary_county)), ';', ','), '"', '')
123 |             , ','
124 |             ),
125 |             x -> trim(x)
126 |           )
127 |         ),
128 |         ', '
129 |       )
130 |     ) = LOWER(
131 |       array_join(
132 |         array_sort(
133 |           transform(
134 |             split(
135 |               regexp_replace(regexp_replace(lower(trim(b.county_name)), ';', ','), '"', '')
136 |             , ','
137 |             ),
138 |             x -> trim(x)
139 |           )
140 |         ),
141 |         ', '
142 |       )
143 |     )
144 |     AND CAST(date_parse(trim(a.incident_date), '%m/%d/%Y') AS DATE) = CAST(date_parse(trim(b.incident_date), '%m/%d/%Y') AS DATE)
145 |     AND a.incident_hour = b.incident_hour
146 |     AND a.offense_name = b.offense_name;
147 | 


--------------------------------------------------------------------------------
/capstone_queries/silver/clean_address_colorado_business_entity_data.sql:
--------------------------------------------------------------------------------
  1 | SELECT
  2 |     cbe.principalcity,
  3 |     COUNT(*) AS cnt
  4 | FROM tayloro.colorado_business_entities cbe
  5 | WHERE cbe.principalstate = 'CO'
  6 |   AND NOT EXISTS (
  7 |     SELECT 1
  8 |     FROM tayloro.colorado_city_county_zip ccz
  9 |     WHERE LOWER(ccz.city) = LOWER(cbe.principalcity)
 10 |   )
 11 | GROUP BY cbe.principalcity
 12 | ORDER BY cbe.principalcity;
 13 | 
 14 | SELECT
 15 |     cbe.principalzipcode,
 16 |     cbe.principalcity,
 17 |     COUNT(*) AS record_count
 18 | FROM tayloro.colorado_business_entities cbe
 19 | LEFT JOIN tayloro.colorado_city_county_zip ccz
 20 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
 21 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
 22 | WHERE cbe.principalstate = 'CO'
 23 |   AND ccz.city IS NULL
 24 | and cbe.principalzipcode is not null
 25 | and cbe.principalcity is not null
 26 | GROUP BY
 27 |     cbe.principalzipcode,
 28 |     cbe.principalcity
 29 | ORDER BY
 30 |     record_count DESC;
 31 | 
 32 | 
 33 | 
 34 | 
 35 | select * from tayloro.colorado_business_entities where principalzipcode = '80401'
 36 | 
 37 | select * from tayloro.colorado_city_county_zip where lower(city) LIKE '%timn%' order by zip_code
 38 | select * from tayloro.colorado_city_county_zip where zip_code = 80303
 39 | 
 40 | INSERT INTO tayloro.colorado_city_county_zip (city, zip_code, county)
 41 | VALUES ('Evans', 80634, 'Weld'),
 42 |     ('Evans', 80645, 'Weld')
 43 | 
 44 | SELECT DISTINCT principalzipcode, principalcounty
 45 | FROM tayloro.colorado_business_entities
 46 | WHERE LOWER(principalcounty) LIKE '%douglas%';
 47 | 
 48 | SELECT COUNT(*) AS unmatched_count
 49 | FROM tayloro.colorado_business_entities cbe
 50 | LEFT JOIN tayloro.colorado_city_county_zip ccz
 51 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
 52 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
 53 | WHERE cbe.principalstate = 'CO'
 54 |   AND ccz.city IS NULL;
 55 | 
 56 | 
 57 | 
 58 | UPDATE tayloro.colorado_business_entities
 59 | SET principalcity = 'Peyton'
 60 | WHERE principalstate = 'CO'
 61 |   AND principalcity = 'peyton co';
 62 | 
 63 | 
 64 | UPDATE tayloro.colorado_business_entities
 65 | SET principalcity = 'Fort Collins'
 66 | WHERE principalstate = 'CO'
 67 |   AND principalzipcode = '80524';
 68 | 
 69 | 
 70 | 
 71 | SELECT *
 72 | FROM tayloro.colorado_business_entities cbe
 73 | LEFT JOIN tayloro.colorado_city_county_zip ccz
 74 |   ON LOWER(cbe.principalcity) = LOWER(ccz.city)
 75 |   AND cbe.principalzipcode = CAST(ccz.zip_code AS VARCHAR(5))
 76 | WHERE cbe.principalstate = 'CO'
 77 |   AND ccz.city IS NULL;
 78 | 
 79 | 
 80 | SELECT *
 81 | FROM tayloro.colorado_business_entities
 82 | WHERE principalzipcode = '80422'
 83 | and lower(principalcity) <> 'black hawk'
 84 | 
 85 | SELECT
 86 |     CASE
 87 |         WHEN ccz.city IS NULL THEN 'No Match'
 88 |         ELSE 'Match'
 89 |     END AS match_status,
 90 |     COUNT(*) AS count
 91 | FROM tayloro.colorado_business_entities cbe
 92 | LEFT JOIN tayloro.colorado_city_county_zip ccz
 93 |     ON LOWER(cbe.principalcity) = LOWER(ccz.city)
 94 |     AND cbe.principalzipcode = CAST(ccz.zip_code AS VARCHAR(5))
 95 | WHERE cbe.principalstate = 'CO'
 96 | GROUP BY
 97 |     CASE
 98 |         WHEN ccz.city IS NULL THEN 'No Match'
 99 |         ELSE 'Match'
100 |     END;
101 | 
102 | 
103 | SELECT *
104 | FROM tayloro.colorado_business_entities cbe
105 | LEFT JOIN tayloro.colorado_city_county_zip ccz
106 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
107 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
108 | WHERE cbe.principalstate = 'CO'
109 |   AND ccz.city IS NULL;
110 | 
111 | select * from tayloro.colorado_business_entities where principalcounty LIKE '%County%'
112 | 
113 | 
114 | 
115 | 
116 | UPDATE tayloro.colorado_business_entities
117 | SET
118 |   principalcity = (
119 |     SELECT geo_city
120 |     FROM tayloro.colorado_temp_geocoded_entities
121 |     WHERE entityid = tayloro.colorado_business_entities.entityid
122 |   ),
123 |   principalzipcode = (
124 |     SELECT geo_postcode
125 |     FROM tayloro.colorado_temp_geocoded_entities
126 |     WHERE entityid = tayloro.colorado_business_entities.entityid
127 |   ),
128 |   principalcounty = (
129 |     SELECT replace(geo_county, ' County', '')
130 |     FROM tayloro.colorado_temp_geocoded_entities
131 |     WHERE entityid = tayloro.colorado_business_entities.entityid
132 |   )
133 | WHERE entityid IN (
134 |   SELECT entityid
135 |   FROM tayloro.colorado_temp_geocoded_entities
136 | );
137 | 
138 | 
139 | select cbe.principalcity, ctge.geo_city, cbe.principalcounty, ctge.geo_county, cbe.principalzipcode, ctge.geo_postcode from tayloro.colorado_temp_geocoded_entities ctge
140 | join tayloro.colorado_business_entities cbe
141 | on ctge.entityid = cbe.entityid
142 | 
143 | UPDATE tayloro.colorado_business_entities
144 | SET
145 |   principalcity = (
146 |     SELECT geo_city
147 |     FROM taylorSELECT
148 |   CASE
149 |     WHEN ccz.city IS NOT NULL THEN 'Match'
150 |     ELSE 'No Match'
151 |   END AS match_status,
152 |   COUNT(*) AS count
153 | FROM tayloro.colorado_business_entities cbe
154 | LEFT JOIN tayloro.colorado_city_county_zip ccz
155 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
156 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
157 | WHERE cbe.principalstate = 'CO'
158 | GROUP BY
159 |   CASE
160 |     WHEN ccz.city IS NOT NULL THEN 'Match'
161 |     ELSE 'No Match'
162 |   END;o.colorado_temp_geocoded_entities
163 |     WHERE entityid = tayloro.colorado_business_entities.entityid
164 |   ),
165 |   principalzipcode = (
166 |     SELECT geo_postcode
167 |     FROM tayloro.colorado_temp_geocoded_entities
168 |     WHERE entityid = tayloro.colorado_business_entities.entityid
169 |   ),
170 |   principalcounty = (
171 |     SELECT replace(geo_county, ' County', '')
172 |     FROM tayloro.colorado_temp_geocoded_entities
173 |     WHERE entityid = tayloro.colorado_business_entities.entityid
174 |   )
175 | WHERE entityid IN (
176 |   SELECT entityid
177 |   FROM tayloro.colorado_temp_geocoded_entities
178 | );
179 | 
180 | 
181 | 
182 | SELECT DISTINCT
183 |     cbe.principalzipcode,
184 |     cbe.principalcity
185 | FROM tayloro.colorado_business_entities cbe
186 | LEFT JOIN tayloro.colorado_city_county_zip ccz
187 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
188 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
189 | WHERE cbe.principalstate = 'CO'
190 |   AND ccz.city IS NULL
191 | ORDER BY cbe.principalcity, cbe.principalzipcode;
192 | 
193 | 
194 | DELETE FROM tayloro.colorado_business_entities
195 | WHERE principalstate = 'CO'
196 |   AND entityid IN (
197 |     SELECT cbe.entityid
198 |     FROM tayloro.colorado_business_entities cbe
199 |     LEFT JOIN tayloro.colorado_city_county_zip ccz
200 |       ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
201 |       AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
202 |     WHERE cbe.principalstate = 'CO'
203 |       AND ccz.city IS NULL
204 |   );
205 | 
206 | SELECT county, COUNT(*) AS count
207 | FROM tayloro.colorado_city_county_zip
208 | WHERE LOWER(county) LIKE '%county%'
209 | GROUP BY county
210 | ORDER BY count DESC;
211 | 
212 | SELECT
213 |     cbe.entityid,
214 |     cbe.principalcity,
215 |     cbe.principalzipcode,
216 |     ccz.city AS ref_city,
217 |     ccz.zip_code AS ref_zip,
218 |     ccz.county AS ref_county
219 | FROM tayloro.colorado_business_entities cbe
220 | JOIN tayloro.colorado_city_county_zip ccz
221 |   ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city))
222 |   AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
223 | WHERE cbe.principalstate = 'CO'
224 | LIMIT 100;
225 | 
226 | 
227 | UPDATE tayloro.colorado_business_entities
228 | SET principalcounty = (
229 |     SELECT ccz.county
230 |     FROM tayloro.colorado_city_county_zip ccz
231 |     WHERE TRIM(LOWER(principalcity)) = TRIM(LOWER(ccz.city))
232 |       AND TRIM(CAST(principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
233 | )
234 | WHERE principalstate = 'CO';
235 | 
236 | CREATE TABLE tayloro.colorado_city_county_zip_new AS
237 | SELECT DISTINCT zip_code, city, county
238 | FROM tayloro.colorado_city_county_zip;
239 | 
240 | SELECT COUNT(*) AS original_count FROM tayloro.colorado_city_county_zip;
241 | SELECT COUNT(*) AS distinct_count FROM tayloro.colorado_city_county_zip_new;
242 | 
243 | DROP TABLE tayloro.colorado_city_county_zip;
244 | ALTER TABLE tayloro.colorado_city_county_zip_new RENAME TO tayloro.colorado_city_county_zip;
245 | 
246 | 
247 | SELECT principalcity, principalzipcode, principalcounty
248 | FROM tayloro.colorado_business_entities
249 | WHERE principalstate = 'CO'
250 | LIMIT 20;
251 | 


--------------------------------------------------------------------------------
/scripts/business_entity_address_validation.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | from dotenv import load_dotenv
  4 | from pyspark.sql import SparkSession
  5 | from pyspark.conf import SparkConf
  6 | from pyspark.sql import functions as F
  7 | from pyspark.sql.functions import concat_ws, col, udf
  8 | from pyspark.sql.types import StructType, StructField, StringType
  9 | import requests
 10 | 
 11 | # Add the parent directory of "upload_to_s3.py" to the module search path
 12 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
 13 | 
 14 | from aws_secret_manager import get_secret
 15 | 
 16 | # Load environment variables from .env file
 17 | load_dotenv()
 18 | 
 19 | # Fetch AWS keys and catalog credentials from environment variables/secrets
 20 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
 21 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
 22 | tabular_credential = get_secret("TABULAR_CREDENTIAL")
 23 | catalog_name = get_secret("CATALOG_NAME")
 24 | 
 25 | # S3 bucket name (if needed)
 26 | s3_bucket = "< zach bucket name >"
 27 | 
 28 | # Define required packages
 29 | extra_packages = [
 30 |     "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2",
 31 |     "software.amazon.awssdk:bundle:2.17.178",
 32 |     "software.amazon.awssdk:url-connection-client:2.17.178",
 33 |     "org.apache.hadoop:hadoop-aws:3.3.4"
 34 | ]
 35 | 
 36 | # Initialize SparkConf
 37 | conf = SparkConf()
 38 | conf.set('spark.jars.packages', ','.join(extra_packages))
 39 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
 40 | conf.set('spark.sql.defaultCatalog', catalog_name)
 41 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog')
 42 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential)
 43 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
 44 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name)
 45 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/')
 46 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
 47 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive')
 48 | conf.set('spark.executor.memory', '16g')
 49 | conf.set('spark.sql.shuffle.partitions', '200')
 50 | conf.set('spark.driver.memory', '16g')
 51 | 
 52 | # Initialize SparkSession
 53 | spark = SparkSession.builder \
 54 |     .appName("Geoapify Address Validation with EntityID") \
 55 |     .config(conf=conf) \
 56 |     .getOrCreate()
 57 | 
 58 | # (Optional) Set AWS credentials if reading from S3; if not needed, these can be omitted.
 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id)
 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key)
 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
 62 | 
 63 | print("SparkSession initialized successfully:", spark)
 64 | 
 65 | # ------------------------------------------------------------------------------
 66 | # Read the required tables into DataFrames
 67 | # ------------------------------------------------------------------------------
 68 | # Business Entities DataFrame
 69 | cbeDF = spark.table("tayloro.colorado_business_entities")
 70 | # Reference table (city, county, zip) DataFrame
 71 | cczDF = spark.table("tayloro.colorado_city_county_zip")
 72 | 
 73 | # ------------------------------------------------------------------------------
 74 | # Prepare DataFrames for Joining
 75 | # ------------------------------------------------------------------------------
 76 | # For cbeDF: create lower-case version of principalcity and ensure ZIP is a string.
 77 | cbe_prepped = cbeDF.withColumn("city_lower", F.lower(F.col("principalcity"))) \
 78 |                    .withColumn("zip_str", F.col("principalzipcode"))
 79 | # For cczDF: create lower-case version of city and cast zip_code to string.
 80 | ccz_prepped = cczDF.withColumn("city_lower", F.lower(F.col("city"))) \
 81 |                    .withColumn("zip_str", F.col("zip_code").cast("string"))
 82 | 
 83 | # Join the DataFrames on the prepped columns
 84 | enrichedDF = cbe_prepped.join(ccz_prepped, on=["city_lower", "zip_str"], how="left")
 85 | 
 86 | # Filter for Colorado records where the reference table did not match (i.e., unmatched addresses)
 87 | unmatchedDF = enrichedDF.filter(
 88 |     (F.col("principalstate") == "CO") &
 89 |     (F.col("city").isNull())  # "city" from the reference table (ccz)
 90 | )
 91 | 
 92 | print("Total unmatched rows:", unmatchedDF.count())
 93 | 
 94 | # ------------------------------------------------------------------------------
 95 | # Build the Full Address and Define the Geocoding UDF
 96 | # ------------------------------------------------------------------------------
 97 | # Create a full address column. You can adjust the concatenation as needed.
 98 | unmatchedDF = unmatchedDF.withColumn(
 99 |     "full_address",
100 |     concat_ws(", ", F.col("principaladdress1"), F.col("principalcity"), F.col("principalzipcode"))
101 | )
102 | 
103 | # Define a Python function to call the Geoapify API and extract city, county, state, and postcode.
104 | def geocode_address(address):
105 |     if not address:
106 |         return (None, None, None, None)
107 |     base_url = "https://api.geoapify.com/v1/geocode/search"
108 |     url = f"{base_url}?text={address}&apiKey={api_key}"
109 |     try:
110 |         resp = requests.get(url)
111 |         if resp.status_code == 200:
112 |             data = resp.json()
113 |             if "features" in data and len(data["features"]) > 0:
114 |                 properties = data["features"][0].get("properties", {})
115 |                 city = properties.get("city", None)
116 |                 county = properties.get("county", None)
117 |                 state = properties.get("state", None)
118 |                 postcode = properties.get("postcode", None)
119 |                 return (city, county, state, postcode)
120 |         return (None, None, None, None)
121 |     except Exception as e:
122 |         return (None, None, None, None)
123 | 
124 | # Define the schema for the geocode response as a struct
125 | geocode_schema = StructType([
126 |     StructField("geo_city", StringType(), True),
127 |     StructField("geo_county", StringType(), True),
128 |     StructField("geo_state", StringType(), True),
129 |     StructField("geo_postcode", StringType(), True)
130 | ])
131 | 
132 | # Register the UDF
133 | geocode_udf = udf(geocode_address, geocode_schema)
134 | 
135 | # Apply the UDF to the DataFrame to get geocoded data
136 | geocodedDF = unmatchedDF.withColumn("geocode", geocode_udf(F.col("full_address")))
137 | 
138 | # Extract the individual fields from the geocode struct and retain the entityid
139 | geocodedDF = geocodedDF.withColumn("geo_city", F.col("geocode.geo_city")) \
140 |                        .withColumn("geo_county", F.col("geocode.geo_county")) \
141 |                        .withColumn("geo_state", F.col("geocode.geo_state")) \
142 |                        .withColumn("geo_postcode", F.col("geocode.geo_postcode")) \
143 |                        .drop("geocode")
144 | 
145 | # ------------------------------------------------------------------------------
146 | # Select only the columns that match the target table schema
147 | # Target Table Columns: entityid, principaladdress1, principalcity, principalzipcode, full_address,
148 | #                         geo_city, geo_county, geo_state, geo_postcode
149 | # ------------------------------------------------------------------------------
150 | outputDF = geocodedDF.select(
151 |     "entityid",
152 |     "principaladdress1",
153 |     "principalcity",
154 |     "principalzipcode",
155 |     "full_address",
156 |     "geo_city",
157 |     "geo_county",
158 |     "geo_state",
159 |     "geo_postcode"
160 | )
161 | 
162 | # (Optional) Show sample results
163 | outputDF.show(20, truncate=False)
164 | 
165 | # ------------------------------------------------------------------------------
166 | # Create the target Iceberg table if it doesn't exist
167 | # ------------------------------------------------------------------------------
168 | spark.sql("""
169 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities (
170 |     entityid BIGINT,
171 |     principaladdress1 STRING,
172 |     principalcity STRING,
173 |     principalzipcode STRING,
174 |     full_address STRING,
175 |     geo_city STRING,
176 |     geo_county STRING,
177 |     geo_state STRING,
178 |     geo_postcode STRING
179 | )
180 | USING iceberg
181 | """)
182 | 
183 | # ------------------------------------------------------------------------------
184 | # Write the selected output DataFrame to the Iceberg table
185 | # ------------------------------------------------------------------------------
186 | outputDF.writeTo("`tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities") \
187 |         .tableProperty("write.format.default", "parquet") \
188 |         .overwritePartitions()
189 | 
190 | print("Geocoded data successfully written to Iceberg table: `tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities")
191 | 


--------------------------------------------------------------------------------
/Capstone-data-dictionary.csv:
--------------------------------------------------------------------------------
  1 | schema,table,attribute,type,level,Data Lineage Source
  2 | tayloro,colorado_avg_age_per_crime_category_1997_2020,city,varchar,gold,colorado_crimes_with_cities
  3 | tayloro,colorado_avg_age_per_crime_category_1997_2020,offense_category_name,varchar,gold,colorado_crimes_with_cities
  4 | tayloro,colorado_avg_age_per_crime_category_1997_2020,avg_age,double,gold,colorado_crimes_with_cities
  5 | tayloro,colorado_business_entities,entityid,bigint,silver,colorado_business_entities_raw
  6 | tayloro,colorado_business_entities,entityname,varchar,silver,colorado_business_entities_raw
  7 | tayloro,colorado_business_entities,principaladdress1,varchar,silver,colorado_business_entities_raw
  8 | tayloro,colorado_business_entities,principaladdress2,varchar,silver,colorado_business_entities_raw
  9 | tayloro,colorado_business_entities,principalcity,varchar,silver,colorado_business_entities_raw
 10 | tayloro,colorado_business_entities,principalcounty,varchar,silver,colorado_business_entities_raw
 11 | tayloro,colorado_business_entities,principalstate,varchar,silver,colorado_business_entities_raw
 12 | tayloro,colorado_business_entities,principalzipcode,varchar,silver,colorado_business_entities_raw
 13 | tayloro,colorado_business_entities,principalcountry,varchar,silver,colorado_business_entities_raw
 14 | tayloro,colorado_business_entities,entitystatus,varchar,silver,colorado_business_entities_raw
 15 | tayloro,colorado_business_entities,jurisdictonofformation,varchar,silver,colorado_business_entities_raw
 16 | tayloro,colorado_business_entities,entitytype,varchar,silver,colorado_business_entities_raw
 17 | tayloro,colorado_business_entities,entityformdate,date,silver,colorado_business_entities_raw
 18 | tayloro,colorado_business_entities_duplicates,entityid,bigint,silver,colorado_business_entities_stg_{{yesterday_ds))
 19 | tayloro,colorado_business_entities_duplicates,entityname,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 20 | tayloro,colorado_business_entities_duplicates,principaladdress1,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 21 | tayloro,colorado_business_entities_duplicates,principaladdress2,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 22 | tayloro,colorado_business_entities_duplicates,principalcity,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 23 | tayloro,colorado_business_entities_duplicates,principalcounty,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 24 | tayloro,colorado_business_entities_duplicates,principalstate,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 25 | tayloro,colorado_business_entities_duplicates,principalzipcode,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 26 | tayloro,colorado_business_entities_duplicates,principalcountry,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 27 | tayloro,colorado_business_entities_duplicates,entitystatus,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 28 | tayloro,colorado_business_entities_duplicates,jurisdictonofformation,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 29 | tayloro,colorado_business_entities_duplicates,entitytype,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 30 | tayloro,colorado_business_entities_duplicates,entityformdate,date,silver,colorado_business_entities_stg_{{yesterday_ds))
 31 | tayloro,colorado_business_entities_misfit_entities,entityid,bigint,silver,colorado_business_entities_stg_{{yesterday_ds))
 32 | tayloro,colorado_business_entities_misfit_entities,entityname,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 33 | tayloro,colorado_business_entities_misfit_entities,principaladdress1,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 34 | tayloro,colorado_business_entities_misfit_entities,principaladdress2,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 35 | tayloro,colorado_business_entities_misfit_entities,principalcity,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 36 | tayloro,colorado_business_entities_misfit_entities,principalcounty,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 37 | tayloro,colorado_business_entities_misfit_entities,principalstate,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 38 | tayloro,colorado_business_entities_misfit_entities,principalzipcode,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 39 | tayloro,colorado_business_entities_misfit_entities,principalcountry,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 40 | tayloro,colorado_business_entities_misfit_entities,entitystatus,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 41 | tayloro,colorado_business_entities_misfit_entities,jurisdictonofformation,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 42 | tayloro,colorado_business_entities_misfit_entities,entitytype,varchar,silver,colorado_business_entities_stg_{{yesterday_ds))
 43 | tayloro,colorado_business_entities_misfit_entities,entityformdate,date,silver,colorado_business_entities_stg_{{yesterday_ds))
 44 | tayloro,colorado_business_entities_raw,entityid,bigint,bronze,source
 45 | tayloro,colorado_business_entities_raw,entityname,varchar,bronze,source
 46 | tayloro,colorado_business_entities_raw,principaladdress1,varchar,bronze,source
 47 | tayloro,colorado_business_entities_raw,principaladdress2,varchar,bronze,source
 48 | tayloro,colorado_business_entities_raw,principalcity,varchar,bronze,source
 49 | tayloro,colorado_business_entities_raw,principalstate,varchar,bronze,source
 50 | tayloro,colorado_business_entities_raw,principalzipcode,varchar,bronze,source
 51 | tayloro,colorado_business_entities_raw,principalcountry,varchar,bronze,source
 52 | tayloro,colorado_business_entities_raw,entitystatus,varchar,bronze,source
 53 | tayloro,colorado_business_entities_raw,jurisdictonofformation,varchar,bronze,source
 54 | tayloro,colorado_business_entities_raw,entitytype,varchar,bronze,source
 55 | tayloro,colorado_business_entities_raw,entityformdate,varchar,bronze,source
 56 | tayloro,colorado_business_entities_with_ranking,entityid,bigint,silver,colorado_business_entities
 57 | tayloro,colorado_business_entities_with_ranking,entityname,varchar,silver,colorado_business_entities
 58 | tayloro,colorado_business_entities_with_ranking,principaladdress1,varchar,silver,colorado_business_entities
 59 | tayloro,colorado_business_entities_with_ranking,principaladdress2,varchar,silver,colorado_business_entities
 60 | tayloro,colorado_business_entities_with_ranking,principalcity,varchar,silver,colorado_business_entities
 61 | tayloro,colorado_business_entities_with_ranking,principalcounty,varchar,silver,colorado_business_entities
 62 | tayloro,colorado_business_entities_with_ranking,principalstate,varchar,silver,colorado_business_entities
 63 | tayloro,colorado_business_entities_with_ranking,principalzipcode,varchar,silver,colorado_business_entities
 64 | tayloro,colorado_business_entities_with_ranking,principalcountry,varchar,silver,colorado_business_entities
 65 | tayloro,colorado_business_entities_with_ranking,entitystatus,varchar,silver,colorado_business_entities
 66 | tayloro,colorado_business_entities_with_ranking,jurisdictonofformation,varchar,silver,colorado_business_entities
 67 | tayloro,colorado_business_entities_with_ranking,entitytype,varchar,silver,colorado_business_entities
 68 | tayloro,colorado_business_entities_with_ranking,entityformdate,date,silver,colorado_business_entities
 69 | tayloro,colorado_business_entities_with_ranking,crime_rank,double,silver,colorado_business_entities
 70 | tayloro,colorado_business_entities_with_ranking,income_rank,integer,silver,colorado_business_entities
 71 | tayloro,colorado_business_entities_with_ranking,population_rank,integer,silver,colorado_business_entities
 72 | tayloro,colorado_business_entities_with_ranking,final_rank,double,silver,colorado_business_entities
 73 | tayloro,colorado_cities_and_counties,city,varchar,silver,source
 74 | tayloro,colorado_cities_and_counties,county,varchar,silver,source
 75 | tayloro,colorado_city_county_zip,zip_code,integer,bronze,source
 76 | tayloro,colorado_city_county_zip,city,varchar,bronze,source
 77 | tayloro,colorado_city_county_zip,county,varchar,bronze,source
 78 | tayloro,colorado_city_crime_counts_1997_2020,city,varchar,gold,colorado_crimes
 79 | tayloro,colorado_city_crime_counts_1997_2020,crime_count,bigint,gold,colorado_crimes
 80 | tayloro,colorado_city_crime_time_likelihood,city,varchar,gold,colorado_crimes
 81 | tayloro,colorado_city_crime_time_likelihood,offense_category_name,varchar,gold,colorado_crimes_with_cities
 82 | tayloro,colorado_city_crime_time_likelihood,avg_day_crimes,double,gold,colorado_crimes_with_cities
 83 | tayloro,colorado_city_crime_time_likelihood,avg_night_crimes,double,gold,colorado_crimes_with_cities
 84 | tayloro,colorado_city_crime_time_likelihood,is_night_likely,boolean,gold,colorado_crimes_with_cities
 85 | tayloro,colorado_city_crime_time_likelihood,is_day_likely,boolean,gold,colorado_crimes_with_cities
 86 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,city,varchar,gold,colorado_crimes_with_cities
 87 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,crime_month,bigint,gold,colorado_crimes_with_cities
 88 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,month_name,varchar,gold,colorado_crimes_with_cities
 89 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,avg_monthly_crimes,double,gold,colorado_crimes_with_cities
 90 | tayloro,colorado_county_agency_crime_counts,county_name,varchar,gold,colorado_crimes
 91 | tayloro,colorado_county_agency_crime_counts,agency_name,varchar,gold,colorado_crimes
 92 | tayloro,colorado_county_agency_crime_counts,total_crimes,bigint,gold,colorado_crimes
 93 | tayloro,colorado_county_average_income_1997_2020,county_name,varchar,gold,colorado_crimes
 94 | tayloro,colorado_county_average_income_1997_2020,agency_name,varchar,gold,colorado_crimes
 95 | tayloro,colorado_county_average_income_1997_2020,total_crimes,bigint,gold,colorado_crimes
 96 | tayloro,colorado_county_coordinates_raw,county,varchar,bronze,source
 97 | tayloro,colorado_county_coordinates_raw,label,varchar,bronze,source
 98 | tayloro,colorado_county_coordinates_raw,cent_lat,double,bronze,source
 99 | tayloro,colorado_county_coordinates_raw,cent_long,double,bronze,source
100 | tayloro,colorado_county_crime_per_capita_1997_2020,county_name,varchar,silver,"colorado_population_raw, colorado_crimes"
101 | tayloro,colorado_county_crime_per_capita_1997_2020,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes"
102 | tayloro,colorado_county_crime_per_capita_1997_2020,total_population,bigint,silver,"colorado_population_raw, colorado_crimes"
103 | tayloro,colorado_county_crime_per_capita_1997_2020,crime_per_100k,decimal,silver,"colorado_population_raw, colorado_crimes"
104 | tayloro,colorado_county_crime_per_capita_with_coordinates,county,varchar,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw"
105 | tayloro,colorado_county_crime_per_capita_with_coordinates,crime_per_100k,"decimal(26,1)",gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw"
106 | tayloro,colorado_county_crime_per_capita_with_coordinates,cent_lat,double,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw"
107 | tayloro,colorado_county_crime_per_capita_with_coordinates,cent_long,double,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw"
108 | tayloro,colorado_county_crime_vs_population_1997_2020,county_name,varchar,silver,"colorado_population_raw, colorado_crimes"
109 | tayloro,colorado_county_crime_vs_population_1997_2020,crime_year,bigint,silver,"colorado_population_raw, colorado_crimes"
110 | tayloro,colorado_county_crime_vs_population_1997_2020,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes"
111 | tayloro,colorado_county_crime_vs_population_1997_2020,total_population,bigint,silver,"colorado_population_raw, colorado_crimes"
112 | tayloro,colorado_county_crime_vs_population_1997_2020,crime_per_100k,"decimal(26,1)",silver,"colorado_population_raw, colorado_crimes"
113 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,county_name,varchar,gold,colorado_crimes
114 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,crime_month,bigint,gold,colorado_crimes
115 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,month_name,varchar,gold,colorado_crimes
116 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,avg_monthly_crimes,double,gold,colorado_crimes
117 | tayloro,colorado_crime_category_totals_per_county_1997_2020,offense_category_name,varchar,gold,colorado_crimes
118 | tayloro,colorado_crime_category_totals_per_county_1997_2020,crime_count,bigint,gold,colorado_crimes
119 | tayloro,colorado_crime_category_totals_per_county_1997_2020,county_name,varchar,gold,colorado_crimes
120 | tayloro,colorado_crime_population_trends,year,integer,silver,"colorado_population_raw, colorado_crimes"
121 | tayloro,colorado_crime_population_trends,county_name,varchar,silver,"colorado_population_raw, colorado_crimes"
122 | tayloro,colorado_crime_population_trends,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes"
123 | tayloro,colorado_crime_population_trends,total_population,bigint,silver,"colorado_population_raw, colorado_crimes"
124 | tayloro,colorado_crime_tier_arson,county_name,varchar,silver,colorado_crimes
125 | tayloro,colorado_crime_tier_arson,total_crimes,bigint,silver,colorado_crimes
126 | tayloro,colorado_crime_tier_arson,crime_percentile,double,silver,colorado_crimes
127 | tayloro,colorado_crime_tier_arson,crime_tier,integer,silver,colorado_crimes
128 | tayloro,colorado_crime_tier_burglary,county_name,varchar,silver,colorado_crimes
129 | tayloro,colorado_crime_tier_burglary,total_crimes,bigint,silver,colorado_crimes
130 | tayloro,colorado_crime_tier_burglary,crime_percentile,double,silver,colorado_crimes
131 | tayloro,colorado_crime_tier_burglary,crime_tier,integer,silver,colorado_crimes
132 | tayloro,colorado_crime_tier_county_rank,county_name,varchar,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
133 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
134 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
135 | tayloro,colorado_crime_tier_county_rank,property_destruction_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
136 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
137 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
138 | tayloro,colorado_crime_tier_county_rank,burglary_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
139 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
140 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
141 | tayloro,colorado_crime_tier_county_rank,larceny_theft_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
142 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
143 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
144 | tayloro,colorado_crime_tier_county_rank,vehicle_theft_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
145 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
146 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
147 | tayloro,colorado_crime_tier_county_rank,robbery_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
148 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
149 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
150 | tayloro,colorado_crime_tier_county_rank,arson_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
151 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
152 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
153 | tayloro,colorado_crime_tier_county_rank,stolen_property_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
154 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
155 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
156 | tayloro,colorado_crime_tier_county_rank,overall_crime_tier,double,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary,
157 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 
158 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property"
159 | tayloro,colorado_crime_tier_larceny_theft,county_name,varchar,silver,colorado_crimes
160 | tayloro,colorado_crime_tier_larceny_theft,total_crimes,bigint,silver,colorado_crimes
161 | tayloro,colorado_crime_tier_larceny_theft,crime_percentile,double,silver,colorado_crimes
162 | tayloro,colorado_crime_tier_larceny_theft,crime_tier,integer,silver,colorado_crimes
163 | tayloro,colorado_crime_tier_property_destruction,county_name,varchar,silver,colorado_crimes
164 | tayloro,colorado_crime_tier_property_destruction,total_crimes,bigint,silver,colorado_crimes
165 | tayloro,colorado_crime_tier_property_destruction,crime_percentile,double,silver,colorado_crimes
166 | tayloro,colorado_crime_tier_property_destruction,crime_tier,integer,silver,colorado_crimes
167 | tayloro,colorado_crime_tier_robbery,county_name,varchar,silver,colorado_crimes
168 | tayloro,colorado_crime_tier_robbery,total_crimes,bigint,silver,colorado_crimes
169 | tayloro,colorado_crime_tier_robbery,crime_percentile,double,silver,colorado_crimes
170 | tayloro,colorado_crime_tier_robbery,crime_tier,integer,silver,colorado_crimes
171 | tayloro,colorado_crime_tier_stolen_property,county_name,varchar,silver,colorado_crimes
172 | tayloro,colorado_crime_tier_stolen_property,total_crimes,bigint,silver,colorado_crimes
173 | tayloro,colorado_crime_tier_stolen_property,crime_percentile,double,silver,colorado_crimes
174 | tayloro,colorado_crime_tier_stolen_property,crime_tier,integer,silver,colorado_crimes
175 | tayloro,colorado_crime_tier_vehicle_theft,county_name,varchar,silver,colorado_crimes
176 | tayloro,colorado_crime_tier_vehicle_theft,total_crimes,bigint,silver,colorado_crimes
177 | tayloro,colorado_crime_tier_vehicle_theft,crime_percentile,double,silver,colorado_crimes
178 | tayloro,colorado_crime_tier_vehicle_theft,crime_tier,integer,silver,colorado_crimes
179 | tayloro,colorado_crime_type_distribution_by_city,city,varchar,gold,colorado_crimes_with_cities
180 | tayloro,colorado_crime_type_distribution_by_city,crime_against,varchar,gold,colorado_crimes_with_cities
181 | tayloro,colorado_crime_type_distribution_by_city,total_crimes,bigint,gold,colorado_crimes_with_cities
182 | tayloro,colorado_crime_vs_median_household_income_1997_2020,county_name,varchar,gold,"colorado_income_1997_2020, colorado_crimes"
183 | tayloro,colorado_crime_vs_median_household_income_1997_2020,year,integer,gold,"colorado_income_1997_2020, colorado_crimes"
184 | tayloro,colorado_crime_vs_median_household_income_1997_2020,total_crimes,bigint,gold,"colorado_income_1997_2020, colorado_crimes"
185 | tayloro,colorado_crime_vs_median_household_income_1997_2020,total_population,bigint,gold,"colorado_income_1997_2020, colorado_crimes"
186 | tayloro,colorado_crime_vs_median_household_income_1997_2020,crime_per_100k,"decimal(26,1)",gold,"colorado_income_1997_2020, colorado_crimes"
187 | tayloro,colorado_crime_vs_median_household_income_1997_2020,median_household_income,bigint,gold,"colorado_income_1997_2020, colorado_crimes"
188 | tayloro,colorado_crimes,agency_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
189 | tayloro,colorado_crimes,county_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
190 | tayloro,colorado_crimes,incident_date,date,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
191 | tayloro,colorado_crimes,incident_hour,integer,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
192 | tayloro,colorado_crimes,offense_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
193 | tayloro,colorado_crimes,crime_against,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
194 | tayloro,colorado_crimes,offense_category_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
195 | tayloro,colorado_crimes,age_num,integer,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw"
196 | tayloro,colorado_crimes_1997_2015_raw,agency_name,varchar,bronze,source
197 | tayloro,colorado_crimes_1997_2015_raw,agency_type_name,varchar,bronze,source
198 | tayloro,colorado_crimes_1997_2015_raw,city_name,varchar,bronze,source
199 | tayloro,colorado_crimes_1997_2015_raw,primary_county,varchar,bronze,source
200 | tayloro,colorado_crimes_1997_2015_raw,incident_hour,integer,bronze,source
201 | tayloro,colorado_crimes_1997_2015_raw,offense_name,varchar,bronze,source
202 | tayloro,colorado_crimes_1997_2015_raw,crime_against,varchar,bronze,source
203 | tayloro,colorado_crimes_1997_2015_raw,offense_category_name,varchar,bronze,source
204 | tayloro,colorado_crimes_1997_2015_raw,age_num,integer,bronze,source
205 | tayloro,colorado_crimes_1997_2015_raw,incident_date,varchar,bronze,source
206 | tayloro,colorado_crimes_2016_2020_raw,pub_agency_name,varchar,bronze,source
207 | tayloro,colorado_crimes_2016_2020_raw,county_name,varchar,bronze,source
208 | tayloro,colorado_crimes_2016_2020_raw,incident_date,varchar,bronze,source
209 | tayloro,colorado_crimes_2016_2020_raw,incident_hour,integer,bronze,source
210 | tayloro,colorado_crimes_2016_2020_raw,offense_name,varchar,bronze,source
211 | tayloro,colorado_crimes_2016_2020_raw,crime_against,varchar,bronze,source
212 | tayloro,colorado_crimes_2016_2020_raw,offense_category_name,varchar,bronze,source
213 | tayloro,colorado_crimes_2016_2020_raw,offense_group,varchar,bronze,source
214 | tayloro,colorado_crimes_2016_2020_raw,age_num,integer,bronze,source
215 | tayloro,colorado_crimes_2016_2020_with_cities,pub_agency_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
216 | tayloro,colorado_crimes_2016_2020_with_cities,county_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
217 | tayloro,colorado_crimes_2016_2020_with_cities,incident_date,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
218 | tayloro,colorado_crimes_2016_2020_with_cities,incident_hour,integer,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
219 | tayloro,colorado_crimes_2016_2020_with_cities,offense_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
220 | tayloro,colorado_crimes_2016_2020_with_cities,crime_against,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
221 | tayloro,colorado_crimes_2016_2020_with_cities,offense_category_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
222 | tayloro,colorado_crimes_2016_2020_with_cities,offense_group,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
223 | tayloro,colorado_crimes_2016_2020_with_cities,age_num,integer,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
224 | tayloro,colorado_crimes_2016_2020_with_cities,city,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw"
225 | tayloro,colorado_crimes_with_cities,agency_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
226 | tayloro,colorado_crimes_with_cities,county_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
227 | tayloro,colorado_crimes_with_cities,incident_date,date,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
228 | tayloro,colorado_crimes_with_cities,incident_hour,integer,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
229 | tayloro,colorado_crimes_with_cities,offense_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
230 | tayloro,colorado_crimes_with_cities,crime_against,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
231 | tayloro,colorado_crimes_with_cities,offense_category_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
232 | tayloro,colorado_crimes_with_cities,age_num,integer,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
233 | tayloro,colorado_crimes_with_cities,city,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw"
234 | tayloro,colorado_final_county_tier_rank,county,varchar,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 
235 | colorado_population_crime_per_capita_rank"
236 | tayloro,colorado_final_county_tier_rank,crime_rank,double,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 
237 | colorado_population_crime_per_capita_rank"
238 | tayloro,colorado_final_county_tier_rank,income_rank,integer,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 
239 | colorado_population_crime_per_capita_rank"
240 | tayloro,colorado_final_county_tier_rank,population_rank,integer,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 
241 | colorado_population_crime_per_capita_rank"
242 | tayloro,colorado_final_county_tier_rank,final_rank,double,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 
243 | colorado_population_crime_per_capita_rank"
244 | tayloro,colorado_income_1997_2020,county,varchar,silver,colorado_income_raw
245 | tayloro,colorado_income_1997_2020,incdesc,varchar,silver,colorado_income_raw
246 | tayloro,colorado_income_1997_2020,inctype,integer,silver,colorado_income_raw
247 | tayloro,colorado_income_1997_2020,incsource,integer,silver,colorado_income_raw
248 | tayloro,colorado_income_1997_2020,income,bigint,silver,colorado_income_raw
249 | tayloro,colorado_income_1997_2020,incrank,integer,silver,colorado_income_raw
250 | tayloro,colorado_income_1997_2020,periodyear,integer,silver,colorado_income_raw
251 | tayloro,colorado_income_household_tier_rank,county,varchar,silver,colorado_income_raw
252 | tayloro,colorado_income_household_tier_rank,avg_median_household_income,double,silver,colorado_income_raw
253 | tayloro,colorado_income_household_tier_rank,income_percentile,double,silver,colorado_income_raw
254 | tayloro,colorado_income_household_tier_rank,income_tier,integer,silver,colorado_income_raw
255 | tayloro,colorado_income_raw,stateabbrv,varchar,bronze,source
256 | tayloro,colorado_income_raw,statename,varchar,bronze,source
257 | tayloro,colorado_income_raw,stfips,integer,bronze,source
258 | tayloro,colorado_income_raw,areatyname,varchar,bronze,source
259 | tayloro,colorado_income_raw,areaname,varchar,bronze,source
260 | tayloro,colorado_income_raw,areatype,integer,bronze,source
261 | tayloro,colorado_income_raw,area,integer,bronze,source
262 | tayloro,colorado_income_raw,periodyear,integer,bronze,source
263 | tayloro,colorado_income_raw,periodtype,integer,bronze,source
264 | tayloro,colorado_income_raw,pertypdesc,varchar,bronze,source
265 | tayloro,colorado_income_raw,period,integer,bronze,source
266 | tayloro,colorado_income_raw,inctype,integer,bronze,source
267 | tayloro,colorado_income_raw,incdesc,varchar,bronze,source
268 | tayloro,colorado_income_raw,incsource,integer,bronze,source
269 | tayloro,colorado_income_raw,incsrcdesc,varchar,bronze,source
270 | tayloro,colorado_income_raw,income,bigint,bronze,source
271 | tayloro,colorado_income_raw,incrank,integer,bronze,source
272 | tayloro,colorado_income_raw,population,integer,bronze,source
273 | tayloro,colorado_income_raw,releasedate,integer,bronze,source
274 | tayloro,colorado_population_crime_per_capita_rank,county,varchar,silver,"colorado_population_raw, colorado_crimes"
275 | tayloro,colorado_population_crime_per_capita_rank,avg_crime_per_1000,"decimal(24,1)",silver,"colorado_population_raw, colorado_crimes"
276 | tayloro,colorado_population_crime_per_capita_rank,crime_percentile,double,silver,"colorado_population_raw, colorado_crimes"
277 | tayloro,colorado_population_crime_per_capita_rank,crime_per_capita_tier,integer,silver,"colorado_population_raw, colorado_crimes"
278 | tayloro,colorado_population_raw,id,integer,bronze,source
279 | tayloro,colorado_population_raw,county,varchar,bronze,source
280 | tayloro,colorado_population_raw,fipscode,integer,bronze,source
281 | tayloro,colorado_population_raw,year,integer,bronze,source
282 | tayloro,colorado_population_raw,age,integer,bronze,source
283 | tayloro,colorado_population_raw,malepopulation,integer,bronze,source
284 | tayloro,colorado_population_raw,femalepopulation,integer,bronze,source
285 | tayloro,colorado_population_raw,totalpopulation,integer,bronze,source
286 | tayloro,colorado_population_raw,datatype,varchar,bronze,source
287 | tayloro,colorado_temp_geocoded_entities,entityid,bigint,silver,"colorado_business_entities, Geoapify API"
288 | tayloro,colorado_temp_geocoded_entities,principaladdress1,varchar,silver,"colorado_business_entities, Geoapify API"
289 | tayloro,colorado_temp_geocoded_entities,principalcity,varchar,silver,"colorado_business_entities, Geoapify API"
290 | tayloro,colorado_temp_geocoded_entities,principalzipcode,varchar,silver,"colorado_business_entities, Geoapify API"
291 | tayloro,colorado_temp_geocoded_entities,full_address,varchar,silver,"colorado_business_entities, Geoapify API"
292 | tayloro,colorado_temp_geocoded_entities,geo_city,varchar,silver,"colorado_business_entities, Geoapify API"
293 | tayloro,colorado_temp_geocoded_entities,geo_county,varchar,silver,"colorado_business_entities, Geoapify API"
294 | tayloro,colorado_temp_geocoded_entities,geo_state,varchar,silver,"colorado_business_entities, Geoapify API"
295 | tayloro,colorado_temp_geocoded_entities,geo_postcode,varchar,silver,"colorado_business_entities, Geoapify API"
296 | tayloro,colorado_total_population_per_county_1997_2020,year,integer,gold,colorado_population_raw
297 | tayloro,colorado_total_population_per_county_1997_2021,county,varchar,gold,colorado_population_raw
298 | tayloro,colorado_total_population_per_county_1997_2022,total_population,bigint,gold,colorado_population_raw
299 | tayloro,coloraodo_county_crime_rate_per_capita,county_name,varchar,silver,"colorado_population_raw, colorado_crimes"
300 | tayloro,coloraodo_county_crime_rate_per_capita,periodyear,integer,silver,"colorado_population_raw, colorado_crimes"
301 | tayloro,coloraodo_county_crime_rate_per_capita,total_crimes,integer,silver,"colorado_population_raw, colorado_crimes"
302 | tayloro,coloraodo_county_crime_rate_per_capita,total_population,integer,silver,"colorado_population_raw, colorado_crimes"
303 | tayloro,coloraodo_county_crime_rate_per_capita,crime_per_100k,double,silver,"colorado_population_raw, colorado_crimes"


--------------------------------------------------------------------------------
/dags/tayloro/business_entity_dag.py:
--------------------------------------------------------------------------------
  1 | from airflow.decorators import dag
  2 | from airflow.operators.python_operator import PythonOperator
  3 | from airflow.utils.dates import datetime, timedelta
  4 | from include.eczachly.trino_queries import execute_trino_query, run_trino_query_dq_check
  5 | from airflow.models import Variable
  6 | import os
  7 | import requests
  8 | from datetime import datetime
  9 | from dotenv import load_dotenv
 10 | load_dotenv()  # This loads the .env file so that os.getenv() works
 11 | 
 12 | local_script_path = os.path.join("include", 'eczachly/scripts/kafka_read_example.py')
 13 | 
 14 | tabular_credential = Variable.get("TABULAR_CREDENTIAL")
 15 | 
 16 | DATA_COLORADO_GOV_KEY = Variable.get("DATA_COLORADO_GOV_KEY")
 17 | GEOAPIFY_KEY = Variable.get("GEOAPIFY_KEY")
 18 | 
 19 | 
 20 | @dag(
 21 |     description="A DAG that processes Colorado business entity data daily at 5 AM MT.",
 22 |     default_args={
 23 |         "owner": "Taylor Ortiz",
 24 |         "retries": 0,
 25 |         "execution_timeout": timedelta(hours=1),
 26 |     },
 27 |     start_date=datetime(2025, 2, 23),  # Ensure this is set correctly
 28 |     max_active_runs=1,  # Ensures only one active DAG run at a time
 29 |     schedule_interval="0 13 * * *",  # Runs at 12 PM UTC = 6 AM MT
 30 |     catchup=False,  # Prevents running past dates
 31 |     tags=["tayloro", "capstone"]
 32 | )
 33 | 
 34 | def business_entity_dag():
 35 |     yesterday = '{{ yesterday_ds }}'
 36 |     schema = 'tayloro'
 37 |     production_table = f'{schema}.colorado_business_entities'
 38 | 
 39 |     # Correctly reference Airflow macros
 40 |     business_entities_with_ranking = f"{production_table}_with_ranking"
 41 |     final_county_tier_rank = f'{schema}.colorado_final_county_tier_rank'
 42 |     staging_table = f"{production_table}_stg_{{{{ yesterday_ds_nodash }}}}"
 43 |     dupes_staging_table = f"{production_table}_dupes_stg_{{{{ yesterday_ds_nodash }}}}"
 44 |     dupes_table = f"{production_table}_duplicates"
 45 |     geo_validation_staging_table = f"{production_table}_geo_validation_stg_{{{{ yesterday_ds_nodash }}}}"
 46 |     misfit_entities_table = f"{production_table}_misfit_entities"
 47 |     added_city_county_zip = f"{production_table}_added_city_count_zip_stg_{{{{ yesterday_ds_nodash }}}}"
 48 | 
 49 |     geoapify_key = GEOAPIFY_KEY
 50 |     data_colorado_gov_api_key = DATA_COLORADO_GOV_KEY
 51 | 
 52 | 
 53 |     # Function to transform API data into records
 54 |     def transform_entity_data_to_records(response):
 55 |         records = []
 56 |         for result in response:
 57 |             # Extract data while defaulting to None for missing fields
 58 |             records.append({
 59 |                 "entityid": result.get("entityid"),
 60 |                 "entityname": result.get("entityname"),
 61 |                 "principaladdress1": result.get("principaladdress1"),
 62 |                 "principaladdress2": result.get("principaladdress2"),
 63 |                 "principalcity": result.get("principalcity"),
 64 |                 "principalstate": result.get("principalstate"),
 65 |                 "principalzipcode": result.get("principalzipcode"),
 66 |                 "principalcountry": result.get("principalcountry"),
 67 |                 "entitystatus": result.get("entitystatus"),
 68 |                 "jurisdictonofformation": result.get("jurisdictonofformation"),
 69 |                 "entitytype": result.get("entitytype"),
 70 |                 "entityformdate": result.get("entityformdate")
 71 |             })
 72 |         return records
 73 | 
 74 |     # Function to fetch business entity data for yesterday from the data.colorado.gov API
 75 |     def fetch_business_entity_data(yesterday, api_token):
 76 |         print('what is the value of the token? ', api_token)
 77 |         print('Grabbing Colorado Business Entity records for ', yesterday)
 78 |         formatted_date = f"{yesterday}T00:00:00.000"
 79 |         base_url = f"https://data.colorado.gov/resource/4ykn-tg5h.json?$$app_token={api_token}"
 80 |         params = {
 81 |             "entityformdate": formatted_date,
 82 |             "entitystatus": "Good Standing",
 83 |             "principalstate": "CO",
 84 |         }
 85 |         query_string = "&".join(f"{key}={value.replace(' ', '%20')}" for key, value in params.items())
 86 |         # Append with an ampersand (&) instead of a question mark
 87 |         final_url = f"{base_url}&{query_string}"
 88 |         print('what is final url? ', final_url)
 89 |         response = requests.get(final_url)
 90 | 
 91 |         if response.status_code == 200 and len(response.json()) > 0:
 92 |             records = transform_entity_data_to_records(response.json())
 93 |             print("Fetched Records:", records)
 94 |             return records
 95 |         else:
 96 |             print(f"No data fetched for {yesterday}")
 97 |             return []
 98 |     def insert_into_staging_table(records, staging_table):
 99 |         if not records:
100 |             print("No records to insert.")
101 |             return
102 | 
103 |         # Build the insert query
104 |         insert_query = f"""
105 |         INSERT INTO {staging_table} (
106 |             entityid, entityname, principaladdress1, principaladdress2,
107 |             principalcity, principalcounty, principalstate, principalzipcode,
108 |             principalcountry, entitystatus, jurisdictonofformation, entitytype, entityformdate
109 |         ) VALUES
110 |         """
111 |         values = []
112 |         for record in records:
113 |             # Escape single quotes in strings
114 |             entityname = record['entityname'].replace("'", "''")
115 |             principaladdress1 = record['principaladdress1'].replace("'", "''")
116 |             principaladdress2 = (record.get('principaladdress2') or '').replace("'", "''")
117 |             principalcity = record['principalcity'].replace("'", "''")
118 |             principalstate = record['principalstate']
119 |             principalzipcode = record['principalzipcode']
120 |             principalcountry = record['principalcountry'].replace("'", "''")
121 |             entitystatus = record['entitystatus'].replace("'", "''")
122 |             jurisdictonofformation = record['jurisdictonofformation'].replace("'", "''")
123 |             entitytype = record['entitytype'].replace("'", "''")
124 | 
125 |             # Convert entityformdate to proper DATE format
126 |             raw_date = record['entityformdate']
127 |             try:
128 |                 # Parse and format the date as YYYY-MM-DD
129 |                 formatted_date = datetime.strptime(raw_date[:10], '%Y-%m-%d').strftime('%Y-%m-%d')
130 |             except ValueError:
131 |                 print(f"Invalid date format: {raw_date}")
132 |                 continue
133 | 
134 |             # Format each record as a SQL value
135 |             values.append(
136 |                 f"({record['entityid']}, '{entityname}', '{principaladdress1}', '{principaladdress2}', "
137 |                 f"'{principalcity}', NULL, '{principalstate}', '{principalzipcode}', "
138 |                 f"'{principalcountry}', '{entitystatus}', '{jurisdictonofformation}', '{entitytype}', "
139 |                 f"DATE '{formatted_date}')"
140 |             )
141 |         insert_query += ", ".join(values)
142 | 
143 |         try:
144 |             execute_trino_query(query=insert_query)
145 |             print(f"Successfully inserted {len(records)} records into {staging_table}")
146 |         except Exception as e:
147 |             print(f"Failed to insert records: {e}")
148 | 
149 |     # Fetch and insert data
150 |     def fetch_and_insert_data(yesterday, staging_table, api_token):
151 |         print('True')
152 |         records = fetch_business_entity_data(yesterday, api_token)
153 |         insert_into_staging_table(records, staging_table)
154 | 
155 |     def transform_geocode_response(response_json, entityid):
156 |         """
157 |         Parses the Geoapify API JSON response (which is already a list of features)
158 |         and extracts the desired fields.
159 | 
160 |         Returns a list of tuples with the following structure:
161 |           (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address)
162 |         """
163 |         records = []
164 |         print('what is the value of response_json? ', response_json)
165 |         # Loop directly over the list of feature dictionaries
166 |         for feature in response_json:
167 |             # Ensure feature is a dict before processing
168 |             if not isinstance(feature, dict):
169 |                 continue
170 |             props = feature.get("properties", {})
171 |             # Extract desired fields from properties
172 |             geo_city = props.get("city")
173 |             geo_county = props.get("county")
174 |             geo_state = props.get("state")
175 |             geo_postcode = props.get("postcode")
176 |             formatted_address = props.get("formatted")
177 |             # Append a tuple including the entityid for later reference
178 |             records.append((entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address))
179 |         return records
180 | 
181 |     def process_and_insert_address_data(aggregated_records, geo_table):
182 |         """
183 |         Receives an aggregated list of tuples where each tuple is:
184 |           (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address)
185 |         Builds and executes an INSERT query to load these records into the temporary geo table.
186 |         Assumes that the first column (entityid) is numeric (BIGINT) and the rest are strings.
187 |         Applies a final check on the zip code (geo_postcode) to ensure it is either a valid 5-digit string
188 |         or is blank.
189 |         """
190 |         if not aggregated_records:
191 |             print("No records to insert from API responses.")
192 |             return
193 | 
194 |         values_list = []
195 |         for rec in aggregated_records:
196 |             rec_values = []
197 |             for idx, field in enumerate(rec):
198 |                 # Final check for the zip code (geo_postcode, index 4)
199 |                 if idx == 4:
200 |                     if field is not None:
201 |                         field = str(field).strip()
202 |                         if len(field) >= 5:
203 |                             candidate = field[:5]
204 |                             # Use candidate if it's all digits; otherwise blank it out.
205 |                             if candidate.isdigit():
206 |                                 field = candidate
207 |                             else:
208 |                                 field = ''
209 |                         else:
210 |                             field = ''
211 |                 # For the first field (entityid), assume it's numeric and do not quote.
212 |                 if idx == 0:
213 |                     rec_values.append(str(field))
214 |                 else:
215 |                     if field is None:
216 |                         rec_values.append("NULL")
217 |                     else:
218 |                         rec_values.append("'" + str(field).replace("'", "''") + "'")
219 |             values_list.append("(" + ", ".join(rec_values) + ")")
220 |         values_clause = ", ".join(values_list)
221 | 
222 |         insert_query = f"""
223 |             INSERT INTO {geo_table}
224 |             (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address)
225 |             VALUES {values_clause}
226 |         """
227 |         execute_trino_query(query=insert_query)
228 |         print(f"Inserted {len(aggregated_records)} records into {geo_table}.")
229 | 
230 | 
231 |     def fetch_validated_address(addr, city, state, zip, api_key):
232 |         base_url = "https://api.geoapify.com/v1/geocode/search"
233 |         url = f"{base_url}?text={addr}, {city} {state} {zip}&apiKey={api_key}"
234 | 
235 |         response = requests.get(url)
236 |         print(f"API call to {url} returned status: {response.status_code}")
237 | 
238 |         if response.status_code != 200:
239 |             print("API call failed with status code", response.status_code)
240 |             return None
241 | 
242 |         data = response.json()
243 |         features = data.get("features", [])
244 | 
245 |         if not features:
246 |             print("No features returned in API response.")
247 |             return None
248 | 
249 |         validated_features = []
250 |         for feature in features:
251 |             props = feature.get("properties", {})
252 |             # The rank object lives inside the properties
253 |             rank = props.get("rank", {})
254 |             match_type = rank.get("match_type")
255 |             confidence = rank.get("confidence")
256 |             confidence_city = rank.get("confidence_city_level")
257 |             confidence_street = rank.get("confidence_street_level")
258 |             city = props.get("city")
259 |             county = props.get("county")
260 |             zip = props.get("postcode")
261 | 
262 |             # Check each feature's rank details
263 |             if (match_type == "full_match"
264 |                     and confidence == 1
265 |                     and confidence_city == 1
266 |                     and confidence_street == 1
267 |                     and city
268 |                     and county
269 |                     and zip):
270 |                 validated_features.append(feature)
271 | 
272 |         if validated_features:
273 |             # If at least one validated feature is found, you can return them.
274 |             # For example, you could choose the first one or return all validated features.
275 |             print(f"Found {len(validated_features)} validated feature(s).")
276 |             return validated_features  # or, for example: return validated_features[0]
277 |         else:
278 |             print("No feature meets the validation criteria.")
279 |             return None
280 | 
281 | 
282 |     def process_unmatched_entities_task(staging, geo_table, api_key):
283 |         """
284 |         1. Queries the staging table for records where principalcounty is NULL.
285 |         2. For each record, constructs the full address and calls the address API.
286 |         3. Aggregates the API responses while retaining the entityid.
287 |         4. Inserts the aggregated geocoded data into the temporary geo table.
288 |         """
289 |         # Modify the query to select entityid, principaladdress1, and principalstate.
290 |         query = f"""
291 |             SELECT entityid, principaladdress1, principalcity, principalstate, principalzipcode
292 |             FROM {staging}
293 |             WHERE principalstate = 'CO'
294 |               AND principalcounty IS NULL
295 |         """
296 |         results = execute_trino_query(query=query)
297 |         print("Unmatched records fetched:", results)
298 | 
299 |         aggregated_records = []
300 |         for row in results:
301 |             entityid = row[0]
302 |             address = row[1]
303 |             city = row[2]
304 |             state = row[3]
305 |             zip = row[4]
306 | 
307 |             # Call the address validation API
308 |             api_response = fetch_validated_address(address, city, state, zip, api_key)
309 |             if api_response:
310 |                 # Transform the API response; include the entityid with each record
311 |                 recs = transform_geocode_response(api_response, entityid)
312 |                 aggregated_records.extend(recs)
313 | 
314 |         # Insert the aggregated geocoded records into the temporary geo table
315 |         process_and_insert_address_data(aggregated_records, geo_table)
316 | 
317 |     def check_validated_address_combos(staging, geo_table, added_city_county_zip):
318 |         """
319 |         Queries the validated address combinations by joining the temporary geo validation table
320 |         with the staging table based on entityid. It returns distinct combinations of:
321 |           (zip_code, city, county, entityid)
322 |         that are not already present in the reference table (colorado_city_county_zip).
323 | 
324 |         Parameters:
325 |           staging (str): Fully resolved name of the staging table.
326 |           geo_table (str): Fully resolved name of the temporary geo validation table.
327 | 
328 |         Returns:
329 |           None. The function prints the query results.
330 |         """
331 |         query = f"""
332 |             SELECT DISTINCT
333 |                 CAST(g.geo_postcode AS INTEGER) AS zip_code,
334 |                 g.geo_city AS city,
335 |                 replace(g.geo_county, ' County', '') AS county,
336 |                 s.entityid
337 |             FROM {geo_table} g
338 |             JOIN {staging} s
339 |               ON g.entityid = s.entityid
340 |             WHERE s.principalstate = 'CO'
341 |               AND NOT EXISTS (
342 |                   SELECT 1
343 |                   FROM tayloro.colorado_city_county_zip r
344 |                   WHERE TRIM(LOWER(r.city)) = TRIM(LOWER(g.geo_city))
345 |                     AND TRIM(CAST(r.zip_code AS VARCHAR)) = TRIM(g.geo_postcode)
346 |               )
347 |         """
348 |         results = execute_trino_query(query=query)
349 | 
350 |         if not results:
351 |             print("No validated address combinations to insert.")
352 |             return
353 | 
354 |         # Build the INSERT query using zip_code, city, county (ignoring entityid)
355 |         values_list = []
356 |         for row in results:
357 |             # row structure: [zip_code, city, county, entityid]
358 |             zip_code = str(row[0])  # Numeric, so no quotes
359 |             city = "'" + str(row[1]).replace("'", "''") + "'"
360 |             county = "'" + str(row[2]).replace("'", "''").replace(" County", "") + "'"
361 |             values_list.append(f"({zip_code}, {city}, {county})")
362 | 
363 |         values_clause = ", ".join(values_list)
364 |         insert_query = f"""
365 |             INSERT INTO {added_city_county_zip} (zip_code, city, county)
366 |             VALUES {values_clause}
367 |         """
368 |         execute_trino_query(query=insert_query)
369 |         print(f"Inserted {len(results)} validated address combinations into {added_city_county_zip}.")
370 | 
371 | 
372 |     create_business_entities_yesterday_stage = PythonOperator(
373 |         task_id="create_business_entities_yesterday_stage",
374 |         python_callable=execute_trino_query,
375 |         op_kwargs={
376 |             'query': f"""
377 |                 CREATE TABLE IF NOT EXISTS {staging_table} (
378 |                     entityid BIGINT,
379 |                     entityname VARCHAR,
380 |                     principaladdress1 VARCHAR,
381 |                     principaladdress2 VARCHAR,
382 |                     principalcity VARCHAR,
383 |                     principalcounty VARCHAR, -- Mapped county from cities dataset
384 |                     principalstate VARCHAR,
385 |                     principalzipcode VARCHAR,
386 |                     principalcountry VARCHAR,
387 |                     entitystatus VARCHAR,
388 |                     jurisdictonofformation VARCHAR,
389 |                     entitytype VARCHAR,
390 |                     entityformdate DATE -- Ensure stored as DATE type
391 |                 )
392 |                 WITH (
393 |                     partitioning = ARRAY['principalcounty'] -- Partition by principalcounty
394 |                 )
395 |             """
396 |         }
397 |     )
398 | 
399 |     create_misfit_business_entities_table = PythonOperator(
400 |         task_id="create_misfit_business_entities_table",
401 |         python_callable=execute_trino_query,
402 |         op_kwargs={
403 |             'query': f"""
404 |                 CREATE TABLE IF NOT EXISTS {misfit_entities_table} (
405 |                     entityid BIGINT,
406 |                     entityname VARCHAR,
407 |                     principaladdress1 VARCHAR,
408 |                     principaladdress2 VARCHAR,
409 |                     principalcity VARCHAR,
410 |                     principalcounty VARCHAR, -- Mapped county from cities dataset
411 |                     principalstate VARCHAR,
412 |                     principalzipcode VARCHAR,
413 |                     principalcountry VARCHAR,
414 |                     entitystatus VARCHAR,
415 |                     jurisdictonofformation VARCHAR,
416 |                     entitytype VARCHAR,
417 |                     entityformdate DATE -- Ensure stored as DATE type
418 |                 )
419 |             """
420 |         }
421 |     )
422 | 
423 |     create_duplicates_table = PythonOperator(
424 |         task_id="create_duplicates_table",
425 |         python_callable=execute_trino_query,
426 |         op_kwargs={
427 |             'query': f"""
428 |                 CREATE TABLE IF NOT EXISTS {dupes_table} (
429 |                     entityid BIGINT,
430 |                     entityname VARCHAR,
431 |                     principaladdress1 VARCHAR,
432 |                     principaladdress2 VARCHAR,
433 |                     principalcity VARCHAR,
434 |                     principalcounty VARCHAR, -- Mapped county from cities dataset
435 |                     principalstate VARCHAR,
436 |                     principalzipcode VARCHAR,
437 |                     principalcountry VARCHAR,
438 |                     entitystatus VARCHAR,
439 |                     jurisdictonofformation VARCHAR,
440 |                     entitytype VARCHAR,
441 |                     entityformdate DATE -- Ensure stored as DATE type
442 |                 )
443 |             """
444 |         }
445 |     )
446 | 
447 |     create_dupes_staging_table = PythonOperator(
448 |         task_id="create_dupes_staging_table",
449 |         python_callable=execute_trino_query,
450 |         op_kwargs={
451 |             'query': f"""
452 |                 CREATE TABLE IF NOT EXISTS {dupes_staging_table} (
453 |                     entityid BIGINT,
454 |                     entityname VARCHAR,
455 |                     principaladdress1 VARCHAR,
456 |                     principaladdress2 VARCHAR,
457 |                     principalcity VARCHAR,
458 |                     principalcounty VARCHAR, -- Mapped county from cities dataset
459 |                     principalstate VARCHAR,
460 |                     principalzipcode VARCHAR,
461 |                     principalcountry VARCHAR,
462 |                     entitystatus VARCHAR,
463 |                     jurisdictonofformation VARCHAR,
464 |                     entitytype VARCHAR,
465 |                     entityformdate DATE -- Ensure stored as DATE type
466 |                 )
467 |                 WITH (
468 |                     partitioning = ARRAY['principalcounty'] -- Partition by principalcounty
469 |                 )
470 |             """
471 |         }
472 |     )
473 | 
474 | 
475 |     create_geo_validation_staging_table = PythonOperator(
476 |         task_id="create_geo_validation_staging_table",
477 |         python_callable=execute_trino_query,
478 |         op_kwargs={
479 |             'query': f"""
480 |                 CREATE TABLE IF NOT EXISTS {geo_validation_staging_table} (
481 |                     entityid BIGINT,
482 |                     geo_city VARCHAR,
483 |                     geo_county VARCHAR,
484 |                     geo_state VARCHAR,
485 |                     geo_postcode VARCHAR,
486 |                     formatted_address VARCHAR
487 |                 )
488 |                 WITH (
489 |                     partitioning = ARRAY['geo_county']
490 |                 )
491 |             """
492 |         }
493 |     )
494 | 
495 |     create_addded_city_county_zip_staging_table = PythonOperator(
496 |         task_id="create_addded_city_county_zip_staging_table",
497 |         python_callable=execute_trino_query,
498 |         op_kwargs={
499 |             'query': f"""
500 |                 CREATE TABLE IF NOT EXISTS {added_city_county_zip} (
501 |                     zip_code INT,
502 |                     city VARCHAR,
503 |                     county VARCHAR
504 |                 )
505 |             """
506 |         }
507 |     )
508 | 
509 |     fetch_and_insert_business_entities_yesterday = PythonOperator(
510 |         task_id="fetch_and_insert_business_entities_yesterday",
511 |         python_callable=fetch_and_insert_data,
512 |         op_kwargs={
513 |             'yesterday': yesterday,
514 |             'staging_table': staging_table,
515 |             'api_token': data_colorado_gov_api_key
516 |         }
517 |     )
518 | 
519 | 
520 |     filter_invalid_addresses = PythonOperator(
521 |         task_id="filter_invalid_addresses",
522 |         python_callable=execute_trino_query,
523 |         op_kwargs={
524 |             'query': f"""
525 |             DELETE FROM {staging_table}
526 |             WHERE principalcity IS NULL OR principalzipcode IS NULL
527 |             """
528 |         }
529 |     )
530 | 
531 |     reformat_zip_code = PythonOperator(
532 |         task_id="reformat_zip_code",
533 |         python_callable=execute_trino_query,
534 |         op_kwargs={
535 |             'query': f"""
536 |                 UPDATE {staging_table}
537 |                 SET principalzipcode = 
538 |                     CASE 
539 |                         WHEN LENGTH(TRIM(principalzipcode)) >= 5 
540 |                              AND TRY_CAST(SUBSTR(TRIM(principalzipcode), 1, 5) AS INTEGER) IS NOT NULL 
541 |                         THEN SUBSTR(TRIM(principalzipcode), 1, 5)
542 |                         ELSE ''
543 |                     END
544 |                 WHERE principalstate = 'CO'
545 |             """
546 |         }
547 |     )
548 | 
549 | 
550 | 
551 |     update_county_on_staging_table = PythonOperator(
552 |         task_id="update_county_on_staging_table",
553 |         python_callable=execute_trino_query,
554 |         op_kwargs={
555 |             'query': f"""
556 |                 UPDATE {staging_table}
557 |                 SET principalcounty = (
558 |                     SELECT ccz.county
559 |                     FROM tayloro.colorado_city_county_zip ccz
560 |                     WHERE TRIM(LOWER({staging_table}.principalcity)) = TRIM(LOWER(ccz.city))
561 |                       AND TRIM(CAST({staging_table}.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
562 |                 )
563 |                 WHERE principalstate = 'CO'
564 |             """
565 |         }
566 |     )
567 | 
568 |     process_unmatched_entities_task = PythonOperator(
569 |         task_id="process_unmatched_entities_task",
570 |         python_callable=process_unmatched_entities_task,
571 |         op_kwargs={
572 |             'staging': staging_table,
573 |             'geo_table': geo_validation_staging_table,
574 |             'api_key': geoapify_key
575 |         }
576 |     )
577 | 
578 |     check_validated_address_combos_task = PythonOperator(
579 |         task_id="check_validated_address_combos",
580 |         python_callable=check_validated_address_combos,
581 |         op_kwargs={
582 |             "staging": staging_table,           # resolved staging table name
583 |             "geo_table": geo_validation_staging_table,  # resolved temporary geo validation table name
584 |             "added_city_county_zip": added_city_county_zip
585 |         }
586 |     )
587 | 
588 |     insert_validated_address_combos_task = PythonOperator(
589 |         task_id="insert_validated_address_combos",
590 |         python_callable=execute_trino_query,
591 |         op_kwargs={
592 |             'query': f"""
593 |                 INSERT INTO tayloro.colorado_city_county_zip (zip_code, city, county)
594 |                 SELECT a.zip_code, a.city, a.county
595 |                 FROM {added_city_county_zip} a
596 |                 WHERE a.zip_code IS NOT NULL
597 |                   AND a.city IS NOT NULL
598 |                   AND a.county IS NOT NULL
599 |                   AND NOT EXISTS (
600 |                       SELECT 1
601 |                       FROM tayloro.colorado_city_county_zip r
602 |                       WHERE r.zip_code = a.zip_code
603 |                         AND TRIM(LOWER(r.city)) = TRIM(LOWER(a.city))
604 |                         AND TRIM(LOWER(r.county)) = TRIM(LOWER(a.county))
605 |                   )
606 |             """
607 |         }
608 |     )
609 | 
610 | 
611 |     override_from_validated_addresses = PythonOperator(
612 |         task_id="override_from_validated_addresses",
613 |         python_callable=execute_trino_query,
614 |         op_kwargs={
615 |             'query': f"""
616 |                 UPDATE tayloro.colorado_business_entities
617 |                 SET
618 |                     principalcity = (
619 |                         SELECT geo_city
620 |                         FROM tayloro.colorado_temp_geocoded_entities
621 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
622 |                     ),
623 |                     principalzipcode = (
624 |                         SELECT geo_postcode
625 |                         FROM tayloro.colorado_temp_geocoded_entities
626 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
627 |                     ),
628 |                     principalcounty = (
629 |                         SELECT geo_county
630 |                         FROM tayloro.colorado_temp_geocoded_entities
631 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
632 |                     )
633 |                 WHERE entityid IN (
634 |                     SELECT entityid
635 |                     FROM tayloro.colorado_temp_geocoded_entities
636 |                 )
637 |                 AND (
638 |                     principalcity <> (
639 |                         SELECT geo_city
640 |                         FROM tayloro.colorado_temp_geocoded_entities
641 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
642 |                     )
643 |                     OR principalzipcode <> (
644 |                         SELECT geo_postcode
645 |                         FROM tayloro.colorado_temp_geocoded_entities
646 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
647 |                     )
648 |                     OR principalcounty <> (
649 |                         SELECT geo_county
650 |                         FROM tayloro.colorado_temp_geocoded_entities
651 |                         WHERE entityid = tayloro.colorado_business_entities.entityid
652 |                     )
653 |                 )
654 |             """
655 |         }
656 |     )
657 | 
658 |     retry_update_county_on_staging_table = PythonOperator(
659 |         task_id="retry_update_county_on_staging_table",
660 |         python_callable=execute_trino_query,
661 |         op_kwargs={
662 |             'query': f"""
663 |                 UPDATE {staging_table}
664 |                 SET principalcounty = (
665 |                     SELECT ccz.county
666 |                     FROM tayloro.colorado_city_county_zip ccz
667 |                     WHERE TRIM(LOWER({staging_table}.principalcity)) = TRIM(LOWER(ccz.city))
668 |                       AND TRIM(CAST({staging_table}.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR))
669 |                 )
670 |                 WHERE principalstate = 'CO'
671 |                 AND principalcounty IS NULL
672 |             """
673 |         }
674 |     )
675 | 
676 |     identify_duplicates_task = PythonOperator(
677 |         task_id="identify_duplicates_task",
678 |         python_callable=execute_trino_query,
679 |         op_kwargs={
680 |             'query': f"""
681 |                 INSERT INTO {dupes_staging_table}
682 |                 SELECT DISTINCT
683 |                     s.entityid,
684 |                     s.entityname,
685 |                     s.principaladdress1,
686 |                     s.principaladdress2,
687 |                     s.principalcity,
688 |                     s.principalcounty,
689 |                     s.principalstate,
690 |                     s.principalzipcode,
691 |                     s.principalcountry,
692 |                     s.entitystatus,
693 |                     s.jurisdictonofformation,
694 |                     s.entitytype,
695 |                     s.entityformdate
696 |                 FROM {staging_table} s
697 |                 INNER JOIN {production_table} p
698 |                   ON s.entityname = p.entityname
699 |                   AND s.principaladdress1 = p.principaladdress1
700 |                   AND s.principaladdress2 = p.principaladdress2
701 |                   AND s.principalcity = p.principalcity
702 |                   AND s.principalcounty = p.principalcounty
703 |                   AND s.principalstate = p.principalstate
704 |                   AND s.principalzipcode = p.principalzipcode
705 |                   AND s.principalcountry = p.principalcountry
706 |                   AND s.entitystatus = p.entitystatus
707 |                   AND s.jurisdictonofformation = p.jurisdictonofformation
708 |                   AND s.entitytype = p.entitytype
709 |                   AND s.entityformdate = p.entityformdate
710 |             """
711 |         }
712 |     )
713 | 
714 |     run_dq_check_null_values = PythonOperator(
715 |         task_id="run_dq_check_null_values",
716 |         python_callable=run_trino_query_dq_check,
717 |         op_kwargs={
718 |             'query': f"""
719 |                 SELECT COUNT(*)
720 |                 FROM {staging_table}
721 |                 WHERE entityname IS NULL
722 |                    OR principalcity IS NULL
723 |                    OR entitytype IS NULL
724 |                    OR entityformdate IS NULL
725 |             """
726 |         }
727 |     )
728 | 
729 |     run_dq_check_principalstate = PythonOperator(
730 |         task_id="run_dq_check_principalstate",
731 |         python_callable=run_trino_query_dq_check,
732 |         op_kwargs={
733 |             'query': f"""
734 |                 SELECT COUNT(*)
735 |                 FROM {staging_table}
736 |                 WHERE principalstate <> 'CO' OR principalstate IS NULL
737 |             """
738 |         }
739 |     )
740 | 
741 |     run_dq_check_entityformdate = PythonOperator(
742 |         task_id="run_dq_check_entityformdate",
743 |         python_callable=run_trino_query_dq_check,
744 |         op_kwargs={
745 |             'query': f"""
746 |                 SELECT COUNT(*)
747 |                 FROM {staging_table}
748 |                 WHERE entityformdate <> DATE('{yesterday}') OR entityformdate IS NULL
749 |             """
750 |         }
751 |     )
752 | 
753 |     run_dq_check_no_unexpected_duplicates = PythonOperator(
754 |         task_id="run_dq_check_no_unexpected_duplicates",
755 |         python_callable=execute_trino_query,
756 |         op_kwargs={
757 |             'query': f"""
758 |                 SELECT entityid, COUNT(*) AS count
759 |                 FROM {staging_table}
760 |                 WHERE entityid NOT IN (
761 |                     SELECT entityid FROM {dupes_staging_table}
762 |                 )
763 |                 GROUP BY entityid
764 |                 HAVING COUNT(*) > 1
765 |             """
766 |         }
767 |     )
768 | 
769 |     insert_new_records_task = PythonOperator(
770 |         task_id="insert_new_records_task",
771 |         python_callable=execute_trino_query,
772 |         op_kwargs={
773 |             'query': f"""
774 |                 INSERT INTO {production_table} (
775 |                     entityid, entityname, principaladdress1, principaladdress2, principalcity,
776 |                     principalcounty, principalstate, principalzipcode, principalcountry, entitystatus,
777 |                     jurisdictonofformation, entitytype, entityformdate
778 |                 )
779 |                 SELECT
780 |                     entityid, entityname, principaladdress1, principaladdress2, principalcity,
781 |                     principalcounty, principalstate, principalzipcode, principalcountry, entitystatus,
782 |                     jurisdictonofformation, entitytype, entityformdate
783 |                 FROM {staging_table}
784 |                 WHERE entityid NOT IN (
785 |                     SELECT entityid FROM {dupes_staging_table}
786 |                 )
787 |                 AND principalcounty IS NOT NULL
788 |             """
789 |         }
790 |     )
791 | 
792 |     insert_business_entities_with_ranking = PythonOperator(
793 |         task_id="insert_business_entities_with_ranking",
794 |         python_callable=execute_trino_query,
795 |         op_kwargs={
796 |             'query': f"""
797 |                 INSERT INTO {business_entities_with_ranking}
798 |                 SELECT
799 |                     stg.*,
800 |                     fr.crime_rank,
801 |                     fr.income_rank,
802 |                     fr.population_rank,
803 |                     fr.final_rank
804 |                 FROM {staging_table} stg
805 |                 LEFT JOIN {final_county_tier_rank} fr
806 |                     ON LOWER(stg.principalcounty) = LOWER(fr.county)
807 |                 WHERE stg.entityid NOT IN (
808 |                         SELECT entityid FROM {dupes_staging_table}
809 |                     )
810 |                   AND stg.principalcounty IS NOT NULL
811 |             """
812 |         }
813 |     )
814 | 
815 | 
816 |     insert_duplicates = PythonOperator(
817 |         task_id="insert_duplicates",
818 |         python_callable=execute_trino_query,
819 |         op_kwargs={
820 |             'query': f"""
821 |                 INSERT INTO {dupes_table}
822 |                 SELECT 
823 |                     entityid,
824 |                     entityname,
825 |                     principaladdress1,
826 |                     principaladdress2,
827 |                     principalcity,
828 |                     principalcounty,
829 |                     principalstate,
830 |                     principalzipcode,
831 |                     principalcountry,
832 |                     entitystatus,
833 |                     jurisdictonofformation,
834 |                     entitytype,
835 |                     entityformdate
836 |                 FROM {dupes_staging_table}
837 |             """
838 |         }
839 |     )
840 | 
841 |     insert_misfits = PythonOperator(
842 |         task_id="insert_misfits",
843 |         python_callable=execute_trino_query,
844 |         op_kwargs={
845 |             'query': f"""
846 |                 INSERT INTO {misfit_entities_table}
847 |                 SELECT 
848 |                     entityid,
849 |                     entityname,
850 |                     principaladdress1,
851 |                     principaladdress2,
852 |                     principalcity,
853 |                     principalcounty,
854 |                     principalstate,
855 |                     principalzipcode,
856 |                     principalcountry,
857 |                     entitystatus,
858 |                     jurisdictonofformation,
859 |                     entitytype,
860 |                     entityformdate
861 |                 FROM {staging_table}
862 |                 WHERE principalcounty IS NULL
863 |             """
864 |         }
865 |     )
866 | 
867 |     drop_staging_table_task = PythonOperator(
868 |         task_id="drop_staging_table_task",
869 |         python_callable=execute_trino_query,
870 |         op_kwargs={
871 |             'query': f"DROP TABLE IF EXISTS {staging_table}"
872 |         }
873 |     )
874 | 
875 |     drop_dupes_staging_table_task = PythonOperator(
876 |         task_id="drop_dupes_staging_table_task",
877 |         python_callable=execute_trino_query,
878 |         op_kwargs={
879 |             'query': f"DROP TABLE IF EXISTS {dupes_staging_table}"
880 |         }
881 |     )
882 | 
883 |     drop_geo_validation_staging_table = PythonOperator(
884 |         task_id="drop_geo_validation_staging_table",
885 |         python_callable=execute_trino_query,
886 |         op_kwargs={
887 |             'query': f"DROP TABLE IF EXISTS {geo_validation_staging_table}"
888 |         }
889 |     )
890 | 
891 |     drop_added_city_county_zip_staging_table = PythonOperator(
892 |         task_id="drop_added_city_county_zip_staging_table",
893 |         python_callable=execute_trino_query,
894 |         op_kwargs={
895 |             'query': f"DROP TABLE IF EXISTS {added_city_county_zip}"
896 |         }
897 |     )
898 | 
899 |     (
900 |             create_business_entities_yesterday_stage
901 |             >> create_misfit_business_entities_table
902 |             >> create_dupes_staging_table
903 |             >> create_duplicates_table
904 |             >> create_geo_validation_staging_table
905 |             >> create_addded_city_county_zip_staging_table
906 |             >> fetch_and_insert_business_entities_yesterday
907 |             >> filter_invalid_addresses
908 |             >> reformat_zip_code
909 |             >> update_county_on_staging_table
910 |             >> process_unmatched_entities_task
911 |             >> check_validated_address_combos_task
912 |             >> insert_validated_address_combos_task
913 |             >> override_from_validated_addresses
914 |             >> retry_update_county_on_staging_table
915 |             >> identify_duplicates_task
916 |             >> run_dq_check_null_values
917 |             >> run_dq_check_principalstate
918 |             >> run_dq_check_entityformdate
919 |             >> run_dq_check_no_unexpected_duplicates
920 |             >> insert_new_records_task
921 |             >> insert_business_entities_with_ranking
922 |             >> insert_duplicates
923 |             >> insert_misfits
924 |             >> drop_staging_table_task
925 |             >> drop_dupes_staging_table_task
926 |             >> drop_geo_validation_staging_table
927 |             >> drop_added_city_county_zip_staging_table
928 |      )
929 | 
930 | 
931 | business_entity_dag()
932 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 | <img width="547" alt="Screenshot 2025-02-27 at 6 01 10 PM" src="https://github.com/user-attachments/assets/3eedd775-764b-429b-af8e-3c2fe6b59f75" />
  3 | </p>
  4 | 
  5 | #
  6 | 
  7 | <p align="center">
  8 |   
  9 |   <strong>A data engineering capstone project that builds a program for Colorado's OEDIT (Office of Economic Development and International Trade) to allocate security system subsidies to businesses using historical crime, population, and income data trends from 1997-2020.</strong>
 10 | </p>
 11 | 
 12 | ### About
 13 | 
 14 | > 🟢 **UPDATE –** On 5/22/25, Zach informed me that my Data Engineering capstone was selected as **Best Capstone** in the bootcamp! His feedback is below:
 15 | 
 16 | ---
 17 | 
 18 | > **“Taylor Ortiz’s capstone project, which won Best Capstone in our January 2025 Data Engineering bootcamp, demonstrated exceptional skills in data processing, quality validation, and solving real-world business challenges. His pipeline effectively addressed entity search and subsidy classification while handling data inconsistencies and deduplication. The well-structured documentation, including DAG runs and dashboards, highlighted his strong technical abilities and clear communication. Taylor would be a valuable asset to any data engineering team.”**
 19 | >
 20 | > — *Zach Wilson - dataexpert.io*
 21 | 
 22 | 
 23 | 
 24 | 
 25 | 
 26 | My name is Taylor Ortiz and I enrolled in Zach Wilson's Dataexpert.io Data Engineering bootcamp as part of the January 2025 cohort. Part of the requirements for achieving the highest certification in the bootcamp is to complete a capstone project that showcases the skills attained during the bootcamp and the general skills that I possess as a data engineer and data architect. The following capstone business case, KPIs, and use cases are fictional and made up by me for the sake of the exercise (although I think it has merit for an actual State of Colorado OEDIT program one day). The technology stack used was built entirely from scratch using publicly accessible data from [data.colorado.gov](https://data.colorado.gov) and [CDPHE Open Data](https://data-cdphe.opendata.arcgis.com/). Enjoy!
 27 | 
 28 | ### Features
 29 | 
 30 | * 9,048,771 rows of source data extracted
 31 | * A 28 task DAG (Directed Acyclic Graph) using Apache Airflow data pipeline orchestration running in a Astronomer cloud production environment
 32 | * Data visualization of 16 required KPIs and use cases, as well as overall Colorado crime density, using Grafana Labs dashboards
 33 | * A user experience for Colorado business entities to search for their business, understand the tier structure, and discover what subsidy tier their business qualifies for using Grafana Labs dashboards
 34 | * Apache Spark jobs with Amazon Web Services (AWS) and Geoapify Geocoding API
 35 | * Medallion architecture data design patterns
 36 | * Comprehensive architecture diagram
 37 | * Comprehensive data dictionary displaying all data sources and manipulations used
 38 |   
 39 | ## Table of Contents
 40 | 1. [Capstone Requirements](#capstone-requirements)
 41 | 2. [Project Overview](#project-overview)
 42 |    1. [Problem Statement](#problem-statement)
 43 |    2. [KPIs and Use Cases](#kpis-and-use-cases)
 44 |       1. [County Level](#county-level)
 45 |       2. [City Level](#city-level)
 46 |    3. [Data Sources](#data-sources)
 47 |    4. [Data Tables](#data-tables)
 48 |    5. [Technologies Used](#technologies-used)
 49 |    6. [Architecture](#architecture)
 50 | 3. [Business Entity Tier Ranking](#business-entity-tier-ranking)
 51 |    1. [Subsidy Tiers](#subsidy-tiers)
 52 |    2. [How Tiers Are Calculated and Assigned](#how-tiers-are-calculated-and-assigned)
 53 |         1. [Step 1: Identify Criteria](#step-1-identify-criteria)
 54 |         2. [Step 2: Establish Individual Rankings](#step-2-establish-individual-rankings)
 55 |         3. [Step 3: Merge Individual Crime Rankings to Form a Unified Crime Tier County Rank](#merge-individual-crime-rankings)
 56 |         4. [Step 4: Merge Rankings for Final County Tier Rank](#final-county-tier)
 57 |         5. [Step 5: Backfill Business Entities Table with Final Ranking](#backfill-business-entities)
 58 | 4. [KPI and Use Case Visualizations](#kpi-and-use-case-visualizations)
 59 |     1. [Colorado County Dashboard](#colorado-county-dashboard)
 60 |     2. [Colorado City Dashboard](#colorado-city-dashboard)
 61 |     3. [Colorado Crime Density Dashboard](#colorado-crime-density-dashboard)
 62 | 5. [Business Entity Data Pipeline](#business-entity-data-pipeline)
 63 |     1. [Business Entity Daily DAG](#business-entity-daily-dag)
 64 |     2. [Task Descriptions](#task-descriptions)
 65 |     3. [DAG Flow Order](#dag-flow-order)
 66 |     4. [Astronomer Cloud Deployment](#astronomer-cloud-deployment)
 67 | 6. [Putting It All Together](#putting-it-all-together)
 68 |     1. [Business Entity Search Dashboard](#business-entity-search-dashboard)
 69 | 7. [Challenges and Findings](#challenges-and-findings)
 70 |     1. [In-depth Summary of Business Entities Data Cleaning and Update Process](#business-entity-data-cleaning)
 71 |     2. [Out-of-Memory (OOM) Exceptions and Spark Configuration Adjustments](#out-of-memory-spark)
 72 |     3. [Challenge: Ambiguous City and County Matching](#ambiguous-matching)
 73 |     4. [Inconsistent Zip Code Formats Causing Casting Errors](#inconsistent-zip-format)
 74 | 8. [Next Steps and Closing Thoughts](#next-steps-and-closing-thoughts)
 75 | 
 76 | 
 77 | ## Capstone Requirements
 78 | 
 79 | * Identify a problem you'd like to solve.
 80 | * Scope the project and choose datasets.
 81 | * Use at least 2 different data sources and formats (e.g., CSV, JSON, APIs), with at least 1 million rows.
 82 | * Define your use case: analytics table, relational database, dashboard, etc.
 83 | * Choose your tech stack and justify your choices.
 84 | * Explore and assess the data.
 85 | * Identify and resolve data quality issues.
 86 | * Document your data cleaning steps.
 87 | * Create a data model diagram and dictionary.
 88 | * Build ETL pipelines to transform and load the data.
 89 | * Run pipelines and perform data quality checks.
 90 | * Visualize or present the data through dashboards, applications, or reports.
 91 | * Your data pipeline must be running in a production cloud environment
 92 | 
 93 | ## Project Overview
 94 | 
 95 | ### Problem Statement
 96 | 
 97 | * The State of Colorado has hired your data firm to develop internal back-office visualizations and a front facing business user experience to support a critical initiative for the Office of Economic Development and International Trade (OEDIT). The program, B.A.S.E. (Business Assistance for Security Enhancements), is designed to evaluate and qualify businesses across the state for security system subsidies to protect their business based on historical crime trends, income, and population data by city and county. Historical data from 1997 to 2020, as well as new data extracted daily, will be analyzed to determine which security tiers businesses qualify for.
 98 | * Additionally, the State of Colorado has sixteen KPIs and use cases they would like to see vizualized out of the extracted, transformed and aggregated data to assist them in planning, funding and outreach initiatives. 
 99 | * This tool will allow OEDIT to automatically notify active businesses in good standing about the subsidies they qualify for, streamlining the application process for business owners seeking to participate. Lastly, the State requires the creation of an intuitive user experience that enables businesses to search for their assigned tier, providing them with easy access to their eligibility information.
100 | 
101 | ### KPIs and Use Cases
102 | 
103 | <details>
104 | <summary id="county-level">County Level KPIs and Use Cases</summary>
105 | 
106 | 1. **KPI:** Set a goal for each police agency within a given county to reduce crime by 5% within the next fiscal year, based on a baseline crime count from historical data.  
107 |    - **Use Case:** Calculate crime count for police agencies in a given county to get a baseline number.
108 | 
109 | 2. **KPI:** Bring 100% free self defense and resident safety programs to lower income counties.  
110 |    - **Use Case:** Calculate the average median household income per county.
111 | 
112 | 3. **KPI:** Evaluate additional police agency support required in counties that show a correlation between rising population and rising crime.  
113 |    - **Use Case:** Show population trends for each county alongside corresponding crime trends by year.
114 | 
115 | 4. **KPI:** Identify counties with an average annual population growth rate exceeding 10% from 1997–2020, and adjust state support allocations to improve funding alignment for resident programs.  
116 |    - **Use Case:** Show population trends for each county from 1997–2020.
117 | 
118 | 5. **KPI:** Inform police agencies of which crime categories are most prevalent in their respective county.  
119 |    - **Use Case:** Calculate the totals for each crime category per county from 1997 to 2020 to gauge frequency.
120 | 
121 | 6. **KPI:** Enable police agencies to be proactive by identifying which months of the year have a higher volume of crime.  
122 |    - **Use Case:** Show average seasonal crime trends for each month by county.
123 | 
124 | 7. **KPI:** Provide an interactive visual representation of crime density across counties for the entire state.  
125 |    - **Use Case:** Create a geo-map of crime density for counties from 1997–2020.
126 | 
127 | 8. **KPI:** Drive further marketing outreach for supportive safety programs in lower income counties.  
128 |    - **Use Case:** Show average income per capita for counties.
129 | 
130 | 9. **KPI:** Display crime rates compared to median household income to pinpoint high-risk areas.  
131 |    - **Use Case:** Compare crime data with median household income for each county.
132 | 
133 | 10. **KPI:** Illustrate crime type distribution by county by Property, Person and Society to better understand where to best allocate safety resources 
134 |     - **Use Case:** Analyze the distribution of different crime types across each county from 1997–2020 for Property, Person and Society crimes.
135 | 
136 | </details>
137 | 
138 | <details>
139 | <summary id="city-level">City Level KPIs and Use Cases</summary>
140 | 
141 | 1. **KPI:** Deploy an interactive dashboard displaying seasonal crime trends for each city to detect seasonal peaks to guide targeted patrol planning.  
142 |    - **Use Case:** Show seasonal crime trends for the year in each city.
143 | 
144 | 2. **KPI:** Identify and report the top three crime categories most prevalent during daytime (6 AM–6 PM) in each city to optimize resource allocation during peak hours.  
145 |    - **Use Case:** Determine which crimes are more likely to happen during the day for a city.
146 | 
147 | 3. **KPI:** Calculate the average age of individuals involved in each crime category across cities to support the development of tailored intervention programs.  
148 |    - **Use Case:** Compute the average age for crime categories across cities.
149 | 
150 | 4. **KPI:** Identify and report the top three crime categories most common at night (6 PM–6 AM) in each city to inform optimized night patrol scheduling.  
151 |    - **Use Case:** Determine which crimes are more likely to happen at night for a city.
152 | 
153 | 5. **KPI:** Develop a dynamic visualization that shows the percentage distribution of crime types by city by Property, Person and Society to support targeted law enforcement initiatives.  
154 |    - **Use Case:** Display crime type distribution by city.
155 | 
156 | 6. **KPI:** Provide an analysis dashboard showing average crime trends by day of the week for each city, highlighting peak crime days to drive strategic patrol scheduling.  
157 |    - **Use Case:** Show crime trends on average by day of the week to determine when to patrol more.
158 | 
159 | </details>
160 | 
161 | ### Data Sources
162 | 
163 | - [Crimes in Colorado (2016-2020)](https://data.colorado.gov/Public-Safety/Crimes-in-Colorado/j6g4-gayk/about_data): Offenses in Colorado for 2016 through 2020 by Agency from the FBI's Crime Data Explorer.
164 |     - Number of rows: 3.1M
165 | - [Crimes in Colorado (1997-2015)](https://data.colorado.gov/Public-Safety/Crimes-in-Colorado-1997-to-2015/6vnq-az4b/about_data): Crime stats for the State of Colorado from 1997 to 2015. Data provided by the CDPS and the FBI's Crime Data Explorer (CDE).
166 |     - Number of rows: 4.95M
167 | - [Personal Income in Colorado](https://data.colorado.gov/Labor-and-Employment/Personal-Income-in-Colorado/2cpa-vbur/about_data): Income (per capita or total) for each county by year with rank and population. From Colorado Department of Labor and Employment (CDLE), since 1969.
168 |     - Number of rows: 10k
169 | - [Population Projections in Colorado](https://data.colorado.gov/Demographics/Population-Projections-in-Colorado/q5vp-adf3/about_data): Actual and predicted population data by gender and age from the Department of Local Affairs (DOLA), from 1990 to 2040.
170 |     - Number of rows: 382k
171 | - [Business Entities in Colorado](https://data.colorado.gov/Business/Business-Entities-in-Colorado/4ykn-tg5h/about_data): Colorado Business Entities (corporations, LLCs, etc.) registered with the Colorado Department of State (CDOS) since 1864.
172 |     - Number of rows: 2.81M
173 | - [Colorado County Boundaries](https://data-cdphe.opendata.arcgis.com/datasets/CDPHE::colorado-county-boundaries/about): This feature class contains county boundaries for all 64 Colorado counties and 2010 US Census attributes data describing the population within each county.
174 |     - Number of rows: 64
175 | 
176 | 
177 | ### Data Tables
178 | 
179 | <details>
180 |   <summary id="bronze-layer">Bronze Layer (raw) data</summary>
181 | 
182 | 1. **colorado_business_entities_raw**  
183 |    - **Number of rows:** 984,368
184 | 2. **colorado_city_county_zip_raw**  
185 |    - **Number of rows:** 760
186 | 3. **colorado_county_coordinates_raw**  
187 |    - **Number of rows:** 64
188 | 4. **colorado_crimes_1997_2015_raw**  
189 |    - **Number of rows:** 4,952,282
190 | 5. **colorado_crimes_2016_2020_raw**  
191 |    - **Number of rows:** 3,101,365
192 | 6. **colorado_income_raw**  
193 |    - **Number of rows:** 9,932
194 |   
195 | 
196 | </details>
197 | 
198 | <details>
199 |   <summary id="silver-layer">Silver Layer (transform) data</summary>
200 | 
201 | 1. **colorado_business_entities**  
202 |    - **Number of rows:** 817,324 (grows by Airflow pipeline)
203 | 2. **colorado_business_entities_duplicates**  
204 |    - **Number of rows:** 721 (grows by Airflow pipeline)
205 | 3. **colorado_business_entities_misfit_entities**  
206 |    - **Number of rows:** 71 (grows by Airflow pipeline)
207 | 4. **colorado_business_entities_with_ranking**  
208 |    - **Number of rows:** 817,324 (grows by Airflow pipeline)
209 | 5. **colorado_crime_population_trends**  
210 |    - **Number of rows:** 1,536
211 | 6. **colorado_crime_tier_arson**  
212 |    - **Number of rows:** 72
213 | 7. **colorado_crime_tier_burglary**  
214 |    - **Number of rows:** 79
215 | 8. **colorado_crime_tier_county_rank**  
216 |    - **Number of rows:** 79
217 | 9. **colorado_crime_tier_larceny_theft**  
218 |    - **Number of rows:** 79
219 | 10. **colorado_crime_tier_property_destruction**  
220 |     - **Number of rows:** 79
221 | 11. **colorado_crime_tier_property_destruction**  
222 |     - **Number of rows:** 70
223 | 12. **colorado_crime_tier_stolen_property**  
224 |     - **Number of rows:** 75
225 | 13. **colorado_crime_tier_vehicle_theft**  
226 |     - **Number of rows:** 79
227 | 14. **colorado_crimes**  
228 |     - **Number of rows:** 8,053,647
229 | 15. **colorado_crimes_2016_2020_with_cities**  
230 |     - **Number of rows:** 2,771,984
231 | 16. **colorado_crimes_with_cities**  
232 |     - **Number of rows:** 7,724,266
233 | 17. **colorado_final_county_tier_rank**  
234 |     - **Number of rows:** 64
235 | 18. **colorado_income_1997_2020**  
236 |     - **Number of rows:** 4604
237 | 19. **colorado_income_household_tier_rank**  
238 |     - **Number of rows:** 64
239 | 20. **colorado_population_raw**  
240 |     - **Number of rows:** 381,504
241 | 21. **colorado_temp_geocoded_entities**  
242 |     - **Number of rows:** 2,033
243 | </details>
244 | 
245 | <details>
246 |   <summary id="gold-layer">Gold Layer (aggregate) data</summary>
247 | 
248 | 1. **colorado_avg_age_per_crime_category_1997_2020**  
249 |    - **Number of rows:** 3,154
250 | 2. **colorado_city_crime_counts_1997_2020**  
251 |    - **Number of rows:** 182
252 | 3. **colorado_city_crime_time_likelihood**  
253 |    - **Number of rows:** 3,206
254 | 4. **colorado_city_seasonal_crime_rates_1997_2020**  
255 |    - **Number of rows:** 2,196
256 | 5. **colorado_county_agency_crime_counts**  
257 |    - **Number of rows:** 526
258 | 6. **colorado_county_average_income_1997_2020**  
259 |    - **Number of rows:** 64
260 | 7. **colorado_county_crime_per_capita_1997_2020**  
261 |    - **Number of rows:** 64
262 | 8. **colorado_county_crime_per_capita_with_coordinates**  
263 |    - **Number of rows:** 64
264 | 9. **colorado_county_crime_vs_population_1997_2020**  
265 |    - **Number of rows:** 1,665
266 | 10. **colorado_county_seasonal_crime_rates_1997_2020**  
267 |     - **Number of rows:** 911
268 | 11. **colorado_crime_category_totals_per_county_1997_2020**  
269 |     - **Number of rows:** 1,520
270 | 12. **colorado_crime_type_distribution_by_city**  
271 |     - **Number of rows:** 549
272 | 13. **colorado_crime_vs_median_household_income_1997_2020**  
273 |     - **Number of rows:** 1,536
274 | 14. **colorado_population_crime_per_capita_rank**  
275 |     - **Number of rows:** 64
276 | 15. **colorado_total_population_per_county_1997_2020**  
277 |     - **Number of rows:** 1,536
278 | 16. **coloraodo_county_crime_rate_per_capita**  
279 |     - **Number of rows:** 1,458
280 | </details>
281 | 
282 | Access the entire [capstone data dictionary](https://github.com/taylor-ortiz/dataexpert-data-engineering-capstone/blob/main/Capstone-data-dictionary.csv) for more detailed info on these datasets used.
283 | 
284 | ### Technologies Used
285 | - **Python:** Programming language used for Spark jobs with AWS and Iceberg loads, Geoapify APIs and python callables in Airflow orchestration
286 | - **SQL:** fundamental for querying, transforming, and aggregating structured data.
287 | - **Apache Airflow:** orchestration tool that manages, schedules and automates the daily busines entity extract from the Colorado Information Marketplace
288 | - **AWS:** S3 bucket storage for CSV extracts
289 | - **Tabular:** data lake operations
290 | - **Trino:** distributed SQL query engine for medallion architecture
291 | - **Geoapify:** geocoding api for address validation of business entities
292 | - **Grafana:** dashboards for data visualizations of KPIs, use cases and business user experience
293 | - **Astronomer:** data orchestraton platform that provides a managed service for Apache Airflow orchestrations in the cloud
294 | - **Apache Spark:** open source distributed computing system for processing source extracts from AWS to Iceberg and backfilling / verifying address data with Geoapify
295 | 
296 | ### Architecture
297 | ![B A S E  Future State Diagram (4)](https://github.com/user-attachments/assets/78f8aaf2-af71-44d2-a08c-52d9b0d766da)
298 | 
299 | ## Business Entity Tier Ranking
300 | 
301 | ### Subsidy Tiers
302 | 
303 | Tiers are represented as a range of 1 through 4 in the B.A.S.E. program. 1 indicates the lowest level need and 4 indicates the highest level need for security system subsidies. Each subsidy tier below offers a gradual increase of security system services based on the tier that your business qualifies for. The idea is that tiers will be backfilled and assigned on the source business entities dataset and as new business entities are added daily to the Colorado Information Marketplace, those records will also be assigned a tier through the Airflow orchestration that we will cover below. However, how are tiers actually calculated and assigned? Read on!
304 | 
305 | <img width="1406" alt="Screenshot 2025-02-28 at 4 44 20 PM" src="https://github.com/user-attachments/assets/46d1a897-69bd-4d81-a2b7-61c1001996ef" />
306 | 
307 | ### How Tiers Are Calculated and Assigned
308 | 
309 | <details>
310 | <summary id="step-1-identify-criteria"><strong>Step 1: Identify Criteria</strong></summary>
311 | 
312 | - **Crimes:**  
313 |   Out of all the crime categories that exist, I chose 7 that closely resemble property-related or adjacent crimes that would factor into the ranking:
314 |   - Destruction/Damage/Vandalism of Property
315 |   - Burglary/Breaking & Entering
316 |   - Larceny/Theft Offenses
317 |   - Motor Vehicle Theft
318 |   - Robbery
319 |   - Arson
320 |   - Stolen Property Offenses
321 | 
322 | - **Population:**  
323 |   Identify crime per capita (per 1,000 residents) for all counties between the years of 1997 and 2020.
324 | 
325 | - **Income:**  
326 |   Identify Median Household Income for all counties between the years of 1997 and 2020.
327 | 
328 | </details>
329 | 
330 | <details>
331 | <summary id="step-2-establish-individual-rankings"><strong>Step 2: Establish Individual Rankings</strong></summary>
332 | <br/>
333 | For each metric, we transform raw data into a standardized ranking by following a similar process:
334 | 
335 | #### Crime Categories
336 | For each of the 7 selected crime categories:
337 | - **Aggregate Data:**  
338 |   Count the number of incidents per county.
339 | - **Compute Percentile Ranks:**  
340 |   Use a percentile function (e.g., `PERCENT_RANK()`) to determine each county’s standing relative to others.
341 | - **Assign Tiers:**  
342 |   Counties in the highest percentiles (top 25%) receive an assignment of 4. Counties >= 50% and <= 75% receive an assignment of 3. Counties >= 25% and <= 50% receive an assignment of 2. Counties that are <= 25% receive an assignment of 1. 
343 | 
344 | #### Population-Adjusted Crime (Crime Per Capita)
345 | We adjust raw crime counts by county population to calculate the crime rate per 1,000 residents. This involves:
346 | - **Calculating Yearly Crime Rates:**  
347 |   Join crime data with population figures.
348 | - **Averaging:**  
349 |   Compute an average crime rate per county over the selected years.
350 | - **Ranking:**  
351 |   Counties in the highest percentiles (top 25%) receive an assignment of 4. Counties >= 50% and <= 75% receive an assignment of 3. Counties >= 25% and <= 50% receive an assignment of 2. Counties that are <= 25% receive an assignment of 1. 
352 | 
353 | #### Income
354 | For median household income, we:
355 | - **Aggregate Income Data:**  
356 |   Average the income per county across the years.
357 | - **Compute Percentile Ranks:**  
358 |   Establish how each county compares to others.
359 | - **Assign Tiers:**  
360 |   Income is the inverse of the previous two. We actually want to rank the best Income counties the lowest because they would require less subsidy assistance than lower income counties. Therefore, Counties in the highest percentiles (top 25%) receive an assignment of 1. Counties >= 50% and <= 75% receive an assignment of 2. Counties >= 25% and <= 50% receive an assignment of 3. Counties that are <= 25% receive an assignment of 4. 
361 | 
362 |   Below is an example distribution. While this is not a perfect distribution, I was pretty happy with the result from a public dataset. With a mostly even distribution based on percentage, we can start assigning tiers based on the rubric mentioned above.
363 | 
364 |   <img width="1123" alt="Screenshot 2025-02-28 at 5 24 50 PM" src="https://github.com/user-attachments/assets/b81e41d2-5fc1-4cf4-8055-36df7663a537" />
365 | 
366 |   <br/>
367 | Below is an example query for evaluating the Destruction/Damage/Vandalism of Property crime category and assigning a tier to counties:
368 | 
369 | ```sql
370 | CREATE TABLE tayloro.colorado_crime_tier_destruction_property AS
371 | WITH crime_aggregated AS (
372 |     -- Step 1: Aggregate property destruction crime counts per county
373 |     SELECT
374 |         county,
375 |         COUNT(*) AS total_property_destruction_crimes
376 |     FROM academy.tayloro.colorado_crimes
377 |     WHERE offense_category_name = 'Destruction/Damage/Vandalism of Property'
378 |     GROUP BY county
379 | ),
380 | crime_percentiles AS (
381 |     -- Step 2: Compute percentile rank for property destruction crime counts per county
382 |     SELECT
383 |         county,
384 |         total_property_destruction_crimes,
385 |         PERCENT_RANK() OVER (ORDER BY total_arson_crimes) AS crime_percentile
386 |     FROM crime_aggregated
387 | ),
388 | crime_tiers AS (
389 |     -- Step 3: Assign tiers based on percentile rankings
390 |     SELECT
391 |         county,
392 |         total_property_destruction_crimes,
393 |         crime_percentile,
394 |         CASE
395 |             WHEN crime_percentile >= 0.75 THEN 1 -- Counties with the highest counts (Top 25%)
396 |             WHEN crime_percentile >= 0.50 THEN 2 -- Next 25% (50%-75%)
397 |             WHEN crime_percentile >= 0.25 THEN 3 -- Next 25% (25%-50%)
398 |             ELSE 4                             -- Counties with the lowest counts (Bottom 25%)
399 |         END AS crime_tier
400 |     FROM crime_percentiles
401 | )
402 | SELECT * FROM crime_tiers;
403 | ```
404 | 
405 |   <br/>
406 | <img width="969" alt="Screenshot 2025-02-28 at 9 07 04 PM" src="https://github.com/user-attachments/assets/56a36431-784e-4ca9-a5c3-a9c155987be1" />
407 | 
408 | </details>
409 | 
410 | <details>
411 | <summary id="merge-individual-crime-rankings"><strong>Step 3: Merge Individual Crime Rankings to Form a Unified Crime Tier County Rank</strong></summary>
412 | <br/>
413 | 
414 | After calculating individual crime tiers for each of the 7 selected crime categories, the next step is to merge these rankings into a single unified score per county. This unified score, referred to as the **overall crime tier**, is computed by averaging the individual tiers for each county and then rounding the result to the nearest whole number. The process involves the following steps:
415 | 
416 | - **Unioning the Individual Tables:**  
417 |   All individual crime tier tables (e.g., for property destruction, burglary, larceny/theft, etc.) are combined using a `UNION ALL`. Each record is tagged with its corresponding table name for identification.
418 | 
419 | - **Pivoting and Aggregating Data:**  
420 |   The combined dataset is grouped by county. For each crime category, the query extracts the crime tier and computes the average of these tiers across all categories.
421 | 
422 | - **Final Ordering:**  
423 |   The resulting unified table is sorted by the overall crime tier in descending order, thereby highlighting counties with the highest aggregated crime tiers.
424 | 
425 | Below is the SQL query that performs these operations:
426 | 
427 | ```sql
428 | CREATE TABLE tayloro.colorado_crime_tier_county_rank AS
429 | WITH county_scores AS (
430 |     SELECT
431 |         county_name,
432 |         -- Select the crime tiers for each category
433 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_property_destruction' THEN crime_tier ELSE NULL END) AS property_destruction_tier,
434 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_burglary' THEN crime_tier ELSE NULL END) AS burglary_tier,
435 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_larceny_theft' THEN crime_tier ELSE NULL END) AS larceny_theft_tier,
436 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_vehicle_theft' THEN crime_tier ELSE NULL END) AS vehicle_theft_tier,
437 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_robbery' THEN crime_tier ELSE NULL END) AS robbery_tier,
438 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_arson' THEN crime_tier ELSE NULL END) AS arson_tier,
439 |         MAX(CASE WHEN table_name = 'colorado_crime_tier_stolen_property' THEN crime_tier ELSE NULL END) AS stolen_property_tier,
440 |         -- Calculate the average of all crime tiers
441 |         ROUND(
442 |             AVG(
443 |                 CASE
444 |                     WHEN table_name IN (
445 |                         'colorado_crime_tier_property_destruction',
446 |                         'colorado_crime_tier_burglary',
447 |                         'colorado_crime_tier_larceny_theft',
448 |                         'colorado_crime_tier_vehicle_theft',
449 |                         'colorado_crime_tier_robbery',
450 |                         'colorado_crime_tier_arson',
451 |                         'colorado_crime_tier_stolen_property'
452 |                     ) THEN crime_tier
453 |                     ELSE NULL
454 |                 END
455 |             )
456 |         ) AS overall_crime_tier -- Round to the nearest whole number
457 |     FROM (
458 |         -- Union all 7 crime tier tables
459 |         SELECT county_name, 'colorado_crime_tier_property_destruction' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_property_destruction
460 |         UNION ALL
461 |         SELECT county_name, 'colorado_crime_tier_burglary' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_burglary
462 |         UNION ALL
463 |         SELECT county_name, 'colorado_crime_tier_larceny_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_larceny_theft
464 |         UNION ALL
465 |         SELECT county_name, 'colorado_crime_tier_vehicle_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_vehicle_theft
466 |         UNION ALL
467 |         SELECT county_name, 'colorado_crime_tier_robbery' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_robbery
468 |         UNION ALL
469 |         SELECT county_name, 'colorado_crime_tier_arson' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_arson
470 |         UNION ALL
471 |         SELECT county_name, 'colorado_crime_tier_stolen_property' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_stolen_property
472 |     ) combined
473 |     GROUP BY county_name
474 | )
475 | SELECT * FROM county_scores
476 | ORDER BY overall_crime_tier DESC;
477 | ```
478 | 
479 | <img width="1605" alt="Screenshot 2025-02-28 at 9 20 22 PM" src="https://github.com/user-attachments/assets/349ce15c-1724-453a-960e-45e0506d089e" />
480 | 
481 | </details>
482 | 
483 | <details>
484 | <summary id="final-county-tier"><strong>Step 4: Merge Rankings for Final County Tier Rank</strong></summary>
485 | <br/>
486 | In this step, we combine the overall crime tier county rank, the income tier rank, and the population (crime per capita) rank to generate a final composite ranking for each county. Since our primary focus is on crime, we weigh the crime rank a little heavier in our final calculation.
487 | 
488 | The merging process involves:
489 | - **Combining Ranks:**  
490 |   We perform a full outer join on the three individual ranking tables to ensure every county is represented.
491 | - **Handling Missing Values:**  
492 |   `COALESCE` is used to manage missing values by defaulting to 0 when a rank is not available.
493 | - **Calculating the Final Rank:**  
494 |   The final rank is computed as the average of the three rankings. With crime being our primary focus, its rank is given slightly more influence in this average.
495 | - **Final Ordering:**  
496 |   Counties are ordered by the final average rank (in ascending order) to highlight those with the highest overall tier.
497 | 
498 | Below is the SQL query that accomplishes these steps:
499 | 
500 | ```sql
501 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS
502 | WITH combined_ranks AS (
503 |     SELECT
504 |         COALESCE(c.county_name, i.county, p.county) AS county,
505 |         COALESCE(c.overall_crime_tier, 0) AS crime_rank,
506 |         COALESCE(i.income_tier, 0) AS income_rank,
507 |         COALESCE(p.crime_per_capita_tier, 0) AS population_rank
508 |     FROM academy.tayloro.colorado_crime_tier_county_rank c
509 |     FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i
510 |         ON LOWER(c.county_name) = LOWER(i.county)
511 |     FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p
512 |         ON LOWER(c.county_name) = LOWER(p.county)
513 | ),
514 | average_rank AS (
515 |     SELECT
516 |         county,
517 |         crime_rank,
518 |         income_rank,
519 |         population_rank,
520 |         ROUND(
521 |             (crime_rank + income_rank + population_rank) /
522 |             NULLIF((CASE WHEN crime_rank > 0 THEN 1 ELSE 0 END
523 |                   + CASE WHEN income_rank > 0 THEN 1 ELSE 0 END
524 |                   + CASE WHEN population_rank > 0 THEN 1 ELSE 0 END), 0)
525 |         ) AS final_rank -- Calculate the average rank and round to the nearest whole number
526 |     FROM combined_ranks
527 |     WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0 -- Ensure all ranks are present
528 | )
529 | SELECT * FROM average_rank
530 | ORDER BY final_rank ASC; -- Order by the final average rank
531 | ```
532 | 
533 | <img width="1032" alt="Screenshot 2025-02-28 at 9 21 32 PM" src="https://github.com/user-attachments/assets/3b42d140-7104-4218-9634-de3663caea83" />
534 | </details>
535 | 
536 | <details>
537 | <summary id="backfill-business-entities"><strong>Step 5: Backfill Business Entities Table with Final Ranking</strong></summary>
538 | <br/>
539 | In this final step, we join the final county tier rank table with the business entities table. This allows us to backfill all business entities with their corresponding ranking information based on county. The query uses a `LEFT JOIN` to ensure that every business entity is retained, matching on a case-insensitive comparison of county names.
540 | 
541 | ```sql
542 | CREATE TABLE tayloro.colorado_business_entities_with_ranking AS
543 | SELECT
544 |     be.*,
545 |     fr.crime_rank,
546 |     fr.income_rank,
547 |     fr.population_rank,
548 |     fr.final_rank
549 | FROM tayloro.colorado_business_entities be
550 | LEFT JOIN tayloro.colorado_final_county_tier_rank fr
551 |     ON LOWER(be.principalcounty) = LOWER(fr.county);
552 | ```
553 | 
554 | <img width="1122" alt="Screenshot 2025-02-28 at 9 39 08 PM" src="https://github.com/user-attachments/assets/75eb5f91-baa6-4e74-adc3-7a64d3c16595" />
555 | </details>
556 | 
557 | Below is a visualization of how these datasets were merged together to form a cohesive ranking across the three main categories.
558 | 
559 | ![B A S E  Future State Diagram (5)](https://github.com/user-attachments/assets/d665e2b9-ae7f-4bd7-85f0-6c5f5fedd3b6)
560 | 
561 | ## KPI and Use Case Visualizations
562 | 
563 | These dashboards represent the 16 KPIs and use cases that were provided to the data firm by the state. Please note as well that Grafana allows us to create dynamic variables that enable us to choose our city or county and fetch dynamic data from those values in real time for visualization. Since these are static images, that will not be represented here, but is a huge component to the user experience. 
564 | 
565 | ### Colorado County Dashboard
566 | 
567 | This county dashboard shows the following use cases:
568 | - Calculate crime count for police agencies in a given county to get a baseline number.
569 | - Calculate the average median household income per county.
570 | - Show population trends for each county alongside corresponding crime trends by year.
571 | - Show population trends for each county from 1997–2020.
572 | - Calculate the totals for each crime category per county from 1997 to 2020 to gauge frequency.
573 | - Show average seasonal crime trends for each month by county.
574 | - Create a geo-map of crime density for counties from 1997–2020.
575 | - Show average income per capita for counties.
576 | - Compare crime data with median household income for each county.
577 | - Analyze the distribution of different crime types across each county from 1997–2020 for Property, Person and Society crimes.
578 |   
579 | <img width="1575" alt="Screenshot 2025-02-28 at 5 07 06 PM" src="https://github.com/user-attachments/assets/1b246758-a1ab-4e9b-a1f5-824fa8b3e81c" />
580 | 
581 | ### Colorado City Dashboard
582 | 
583 | This city dashboard shows the following use cases: 
584 | 
585 | - Show seasonal crime trends for the year in each city.
586 | - Determine which crimes are more likely to happen during the day for a city.
587 | - Compute the average age for crime categories across cities.
588 | - Determine which crimes are more likely to happen at night for a city.
589 | - Display crime type distribution by city.
590 | - Show crime trends on average by day of the week to determine when to patrol more.
591 |   
592 | <img width="1516" alt="Screenshot 2025-02-28 at 5 05 24 PM" src="https://github.com/user-attachments/assets/fd79e5a5-904a-4bcb-9d06-9c76d78383e4" />
593 | 
594 | ### Colorado Crime Density Dashboard
595 | 
596 | This dashboard shows overall crime density on crime per capita (100k) residents across the entire State of Colorado. Its cool to see the larger areas represent the highest density and its not surprising that many of those are in the metro area of Denver.
597 | 
598 | <img width="1412" alt="Screenshot 2025-02-23 at 9 33 19 PM" src="https://github.com/user-attachments/assets/504f4e15-24c7-4ddd-bdb4-fddab9cbb16b" />
599 | 
600 | 
601 | ### Business Entity Data Pipeline
602 | 
603 | #### Business Entity Daily DAG
604 | 
605 | The final part of the equation is our data pipeline that evaluates newly incoming business entities daily. This DAG calls out to the public Colorado Information Marketplace API every day at 6 AM MT and fetches yesterday's Colorado Business Entity data for processing. It then cleans the data, joins matching County data based on Zip and City and then assigns an appropriate subsidy tier. This DAG involves several tasks to ingest, clean, validate, enrich, and load data from the data.colorado.gov API into production tables. Below is a breakdown of each task and its purpose.
606 | 
607 | ---
608 | 
609 | #### Task Descriptions
610 | 
611 | 1. **create_business_entities_yesterday_stage**  
612 |    *Description:* Creates a staging table for business entities for yesterday’s data. This table is partitioned by `principalcounty` and sets up the structure to temporarily hold the data for further processing.
613 | 
614 | 2. **create_misfit_business_entities_table**  
615 |    *Description:* Creates a table for misfit business entities that do not meet the expected data patterns. This helps segregate records that require special attention or further review.
616 | 
617 | 3. **create_dupes_staging_table**  
618 |    *Description:* Creates a staging table for capturing potential duplicate records from the business entities data. It mirrors the structure of the main table and is partitioned by `principalcounty`.
619 | 
620 | 4. **create_duplicates_table**  
621 |    *Description:* Creates a permanent table to store confirmed duplicate records from the business entities dataset.
622 | 
623 | 5. **create_geo_validation_staging_table**  
624 |    *Description:* Creates a temporary table to hold the results of geo-validated address data. This table is partitioned by `geo_county` and is used to store responses from the Geoapify API.
625 | 
626 | 6. **create_addded_city_county_zip_staging_table**  
627 |    *Description:* Creates a staging table to store newly discovered city/county/zip combinations from geo-validation. This table is later used to update the reference data for address mapping.
628 | 
629 | 7. **fetch_and_insert_business_entities_yesterday**  
630 |    *Description:* Fetches business entity data for yesterday from the data.colorado.gov API and inserts the records into the staging table. This task calls the API, transforms the response into records, and executes an INSERT query.
631 | 
632 | 8. **filter_invalid_addresses**  
633 |    *Description:* Deletes records from the staging table that have NULL values in critical address fields (`principalcity` or `principalzipcode`), ensuring only valid addresses proceed to the next steps.
634 | 
635 | 9. **reformat_zip_code**  
636 |    *Description:* Reformats the `principalzipcode` field in the staging table to a standard 5-digit format, ensuring consistency for downstream processing.
637 | 
638 | 10. **update_county_on_staging_table**  
639 |     *Description:* Updates the `principalcounty` field in the staging table by joining with the cleaned reference table (`colorado_city_county_zip`). The matching is done using normalized (trimmed and lowercased) city names and ZIP codes.
640 | 
641 | 11. **process_unmatched_entities_task**  
642 |     *Description:* Processes records from the staging table where `principalcounty` is still NULL. It calls the Geoapify API to fetch validated address data for these unmatched entities, transforms the API responses, and inserts the results into the geo validation staging table.
643 | 
644 | 12. **check_validated_address_combos**  
645 |     *Description:* Checks validated address combinations by joining the geo validation staging table with the staging table. This task identifies new city/county/zip combinations not present in the reference table.
646 | 
647 | 13. **insert_validated_address_combos**  
648 |     *Description:* Inserts the new, validated city/county/zip combinations into the reference table (`colorado_city_county_zip`). This enhances the mapping accuracy for future data processing.
649 | 
650 | 14. **override_from_validated_addresses**  
651 |     *Description:* Updates the main business entities table by overriding existing address fields with geo-validated data. This ensures that the most accurate and standardized addresses are stored.
652 | 
653 | 15. **retry_update_county_on_staging_table**  
654 |     *Description:* Reattempts the update of the `principalcounty` field on the staging table for any records that remain unmatched after initial processing.
655 | 
656 | 16. **identify_duplicates_task**  
657 |     *Description:* Identifies duplicate records within the staging table by comparing key address and entity fields. The duplicates are then inserted into the duplicates staging table.
658 | 
659 | 17. **run_dq_check_null_values**  
660 |     *Description:* Runs a data quality check that counts records with NULL values in critical fields (`entityname`, `principalcity`, `entitytype`, `entityformdate`) to ensure data completeness.
661 | 
662 | 18. **run_dq_check_principalstate**  
663 |     *Description:* Executes a data quality check to verify that all records in the staging table have `principalstate` equal to 'CO'.
664 | 
665 | 19. **run_dq_check_entityformdate**  
666 |     *Description:* Performs a data quality check to ensure that the `entityformdate` in the staging table matches the expected date (yesterday’s date).
667 | 
668 | 20. **run_dq_check_no_unexpected_duplicates**  
669 |     *Description:* Checks for any unexpected duplicates in the staging table by grouping on `entityid` and counting occurrences greater than one.
670 | 
671 | 21. **insert_new_records_task**  
672 |     *Description:* Inserts new business entity records from the staging table into the production table, excluding any records that have been flagged as duplicates and ensuring that `principalcounty` is populated.
673 | 
674 | 22. **insert_business_entities_with_ranking**  
675 |     *Description:* Inserts business entity records along with their associated ranking information by joining the staging table with the final county tier rank table.
676 | 
677 | 23. **insert_duplicates**  
678 |     *Description:* Inserts duplicate records from the duplicates staging table into the permanent duplicates table for further review or archival.
679 | 
680 | 24. **insert_misfits**  
681 |     *Description:* Inserts misfit records (those without a valid `principalcounty`) from the staging table into the misfits table, isolating records that could not be processed normally.
682 | 
683 | 25. **drop_staging_table_task**  
684 |     *Description:* Drops the staging table to clean up temporary data once processing is complete.
685 | 
686 | 26. **drop_dupes_staging_table_task**  
687 |     *Description:* Drops the duplicates staging table to remove temporary duplicate data after it has been archived.
688 | 
689 | 27. **drop_geo_validation_staging_table**  
690 |     *Description:* Drops the geo validation staging table that was used for holding geo API responses once the data has been processed and merged.
691 | 
692 | 28. **drop_added_city_county_zip_staging_table**  
693 |     *Description:* Drops the staging table for newly added city/county/zip combinations, finalizing the cleanup of temporary tables.
694 | 
695 | ---
696 | 
697 | #### DAG Flow Order
698 | 
699 | The tasks order looks like this:
700 | 
701 | <img width="543" alt="Screenshot 2025-02-28 at 10 29 57 PM" src="https://github.com/user-attachments/assets/fd136dc9-b948-41db-ba49-b3230e2c338e" />
702 | 
703 | #### Astronomer Cloud Deployment
704 | 
705 | One of the core requirements from Zach for the capstone project is that it must be running in a production cloud environment to count. Below are snapshots of my DAG in the Astronomer production environment running daily.
706 | 
707 | <img width="870" alt="Screenshot 2025-02-28 at 6 23 12 AM" src="https://github.com/user-attachments/assets/73aeb31d-b6fc-416d-80ef-6f64d3fbb91b" />
708 | 
709 | This capture shows the successes and failures of my DAG runs as I worked through errors and bugs.
710 | 
711 | <img width="1011" alt="Screenshot 2025-02-28 at 6 22 22 AM" src="https://github.com/user-attachments/assets/b2785011-3f07-43fd-8588-a07875926614" />
712 | 
713 | ### Putting It All Together
714 | 
715 | #### Business Entity Search Dashboard
716 | <br/>
717 | The dashboard below enables a business owner to navigate to this dashboard, search for their business entity and see information about their business, information on the available security system subsidy tiers and exactly which tier their business qualifies for. In this example, Lost Coffee could search for information about their business and learn that they have been assigned a tier ranking of 3 based on all of the criteria explained above. 
718 | <br/>
719 | <img width="1412" alt="Screenshot 2025-02-28 at 5 08 32 PM" src="https://github.com/user-attachments/assets/3da8052f-2a73-4a00-aaa3-1314f4ea01b1" />
720 | 
721 | ### Challenges and Findings
722 | 
723 | <details>
724 | <summary id="business-entity-data-cleaning"><strong>In-depth Summary of Business Entities Data Cleaning and Update Process</strong></summary>
725 | <br/>
726 | 
727 | **Challenges Faced:**  
728 | - **Mismatches in Data:** Many records in the `colorado_business_entities` table lacked corresponding county information when joined with the reference table (`colorado_city_county_zip`).  
729 | - **Inconsistent Data Formatting:** Typos, case discrepancies, and extra spaces in city names and ZIP codes led to failed matches.  
730 | - **Incomplete Reference Data:** Missing city/ZIP combinations and duplicate entries in the reference table required augmentation and deduplication to ensure accurate mapping.
731 | 
732 | **Process Overview:**
733 | 
734 | 1. **Initial Analysis and Identification of Issues:**  
735 |    - Ran queries to identify records in `colorado_business_entities` with no matching county in the reference table.  
736 |    - Examined unmatched city/ZIP pairs to determine the most frequent discrepancies.
737 | 
738 | 2. **Data Correction in the Business Entities Table:**  
739 |    - Performed manual corrections for known typos and inconsistencies (e.g., updating “peyton co” to “Peyton”, correcting ZIP 80524 entries to “Fort Collins”).  
740 |    - Verified corrections by running SELECT queries against both the business entities and reference datasets.
741 | 
742 | 3. **Assessing Match Quality:**  
743 |    - Executed queries to compare the count of matched versus unmatched records, revealing a high number of mismatches that required further attention.
744 |   
745 | 4. **Used Geoapify API to resolve some address issues:**
746 |    - Leveraged the Geocoding API from Geoapify to validate address descrepancies and improve address, city, county and zip attributes of non matching business entities
747 |    - One funny thing is that while processing the remaining unmatched records, I totally blew through the daily Geoapify limits and appreciate their services very much for letting me do so
748 |    - <img width="984" alt="Screenshot 2025-03-01 at 7 48 32 AM" src="https://github.com/user-attachments/assets/895a1cb9-ddc5-43e0-a68b-45925ba57402" />
749 | 
750 | 
751 | 5. **Inserting Missing Reference Data:**  
752 |    - Augmented the reference table by inserting missing city/ZIP combinations (e.g., added entries for “Evans” in Weld County).  
753 |    - Inspected the reference table to ensure that the inserted data met quality standards.
754 | 
755 | 6. **Deduplicating the Reference Table:**  
756 |    - Identified duplicate rows for certain city/ZIP pairs (e.g., “Burlington, 80807” and “New Raymer, 80742”).  
757 |    - Created a new deduplicated version of the reference table and replaced the original with the cleaned version.
758 | 
759 | 7. **Updating the Business Entities with County Data:**  
760 |    - Employed a final update query (using a correlated subquery) to backfill the `principalcounty` field in `colorado_business_entities` by matching normalized city and ZIP code data from the cleaned reference table.  
761 |    - Verified the update by confirming that previously unmatched records were now successfully mapped.
762 | 
763 | 8. **Final Outcome:**  
764 |    - **Normalized Addresses:** Data cleaning (trimming, lowercasing, casting ZIP codes) ensured consistent comparisons.  
765 |    - **Accurate Mapping:** Manual corrections and enhanced reference data led to a complete and unique set of city/ZIP combinations.  
766 |    - **Successful Update:** The final update query backfilled the `principalcounty` field for all records, reducing unmatched counts to zero.
767 | 
768 | </details>
769 | 
770 | 
771 | <details>
772 | <summary id="out-of-memory-spark"><strong>Out-of-Memory (OOM) Exceptions and Spark Configuration Adjustments</strong></summary>
773 | <br/>
774 | When processing large datasets (over a million rows) using Spark, I frequently encountered out-of-memory exceptions—especially when fetching large CSV files from S3 and writing to Iceberg. The default Spark settings were insufficient for these workloads.
775 | 
776 | To address this challenge, I reviewed the Spark documentation and applied recommendations from the bootcamp. I adjusted the Spark configuration to allocate more memory to both the executors and the driver, and to optimize shuffle operations. The following settings helped mitigate the memory issues:
777 | 
778 | ```python
779 | conf.set('spark.executor.memory', '16g')         # Increase executor memory based on your system's capacity
780 | conf.set('spark.sql.shuffle.partitions', '200')    # Optimize the number of shuffle partitions
781 | conf.set('spark.driver.memory', '16g')             # Increase driver memory to handle larger workloads
782 | ```
783 | By increasing the memory allocation and tuning the shuffle partitions, I was able to process the large CSV files efficiently and write to Iceberg without running into out-of-memory errors.
784 | </details>
785 | 
786 | <details>
787 | <summary id="ambiguous-matching"><strong>Challenge: Ambiguous City and County Matching</strong></summary>
788 | <br/>
789 | 
790 | **Description:**  
791 | Initially, my orchestration relied solely on a city and county dataset to map business entities to their respective counties. However, an issue arose where cities that span multiple counties were being randomly assigned to just one county.
792 | 
793 | **Example:**  
794 | - **Aurora:** This city spans across Douglas County, Arapahoe County, and Denver County.  
795 | - Without additional disambiguation, business entities in Aurora were incorrectly mapped to a single county at random, leading to inaccurate county assignments.
796 | 
797 | **Resolution:**  
798 | - I enhanced the reference dataset by adding a **ZIP code** column.  
799 | - By matching on both the city and ZIP code of a business entity's address, the mapping became much more precise.  
800 | - This change ensured that each business entity is matched to the correct county based on the distinct combination of city and ZIP code.
801 | 
802 | **Outcome:**  
803 | Incorporating the ZIP code significantly improved the accuracy of county assignments, eliminating the randomness in cases where cities span multiple counties.
804 | 
805 | </details>
806 | 
807 | <details>
808 | <summary id="inconsistent-zip-format"><strong>Inconsistent Zip Code Formats Causing Casting Errors</strong></summary>
809 | <br/>
810 | 
811 | **Description:**  
812 | Initially, the ETL process transformed API responses without validating the `principalzipcode` field. This oversight led to casting errors when attempting to convert zip code values in unexpected formats (e.g., `"80137-6828"`) into an integer for insertion into the staging table.
813 | 
814 | **Example:**  
815 | - **Problem Record:** A record with a `principalzipcode` value like `"80137-6828"` resulted in the following error during insertion:
816 | 
817 | TrinoUserError: TrinoUserError(type=USER_ERROR, name=INVALID_CAST_ARGUMENT, message="Cannot cast '80137-6828' to INT", query_id=...)
818 | 
819 | **Resolution:**  
820 | - **Data Transformation Enhancement:**  
821 | I updated the `transform_entity_data_to_records` function to validate the zip code using a regular expression. Only zip codes matching exactly five digits (`\d{5}`) are accepted:
822 | - Valid zip codes are converted to an integer.
823 | - Invalid zip codes are set to Python’s `None`.
824 | 
825 | - **SQL Query Adjustment:**  
826 | In the `insert_into_staging_table` function, I modified the query construction to check for `None`. If a zip code is `None`, it is replaced with the SQL literal `NULL` (without quotes) in the final query. This ensures that only a valid integer or a proper `NULL` value is inserted into the `principalzipcode` column.
827 | 
828 | **Outcome:**  
829 | By separating data validation from SQL query construction, this solution prevents casting errors and maintains data integrity. Only correctly formatted 5-digit zip codes (or `NULL` for invalid ones) are passed to the database, resulting in a robust and error-free ETL process.
830 | 
831 | </details>
832 | 
833 | ## Next Steps and Closing Thoughts
834 | 
835 | ### Next Steps
836 | 
837 | Here is a running list of things that I would continue to refine, improve or new features I would introduce:
838 | - Address level tier assignment that are specific to the lat and long of business location
839 | - Understanding residential vs commercial businesses and how to delineate on security subsidy
840 | - Write automation to handle the duplicates and misfits table instead of it being manually remedied
841 | - Build Machine Learning in to the population projections data that uses forecasted population growth to predict crime trends in respective counties
842 | 
843 | ### Closing Thoughts
844 | 
845 | In closing, this capstone project was easily one of the hardest projects that I have ever done by myself end to end. I pushed myself and accomplished more than I ever thought I would and had so much fun in the process. The most exciting thing about it is that while I only focused on 16 KPIs and use cases that I made up, these datasets could serve hundreds of other exciting and interesting use cases. I spent countless hours over 8 weeks finding the right datasets, refining the data into the KPIs and Use Cases you just reviewed and building the experience that allows business owners to search for their qualifying subsidy tiers. I am so proud of myself and what I was able to accomplish from free data extracts. It was really cool for me to focus on data that is in my own back yard here in Colorado. I was able to accomplish things I never thought I would ever do and I cannot wait to see what is next for my career after this. 
846 | 


--------------------------------------------------------------------------------