├── .gitignore ├── dags ├── .airflowignore └── tayloro │ └── business_entity_dag.py ├── capstone_queries ├── graffana │ ├── colorado_city_metrics │ │ ├── variable_cities.sql │ │ ├── likely_city_crimes_during_day.sql │ │ ├── likely_city_crimes_during_night.sql │ │ └── seasonal_city_crime_rate_by_month_1997_2020.sql │ ├── colorado_county_metrics │ │ ├── county_police_agency_crime_count.sql │ │ ├── seasonal_average_crime_trends_for_county.sql │ │ ├── county_crime_count_per_category_1997_2020.sql │ │ ├── crime_density_per_county_1997_2020.sql │ │ ├── county_population_vs_crime_trends_1997_2020.sql │ │ └── county_crime_rates_vs_median_income.sql │ └── business_search │ │ └── business_search_query.sql ├── gold │ ├── create_colorado_county_agency_crime_counts.sql │ ├── create_colorado_city_crime_counts_1997_2020.sql │ ├── create_colorado_avg_age_per_crime_category_1997_2020.sql │ ├── normalized_crime_and_population_trends.sql │ ├── create_county_crime_per_capita_with_coordinates.sql │ ├── total_count_of_crimes_by_county.sql │ ├── create_colorado_crime_type_distribution_by_city.sql │ ├── create_colorado_crime_category_totals_per_county_1997_2020.sql │ ├── create_colorado_county_average_income_1997_2020.sql │ ├── colorado_crime_population_trends.sql │ ├── create_colorado_crime_vs_median_household_income_1997_2020.sql │ ├── create_colorado_city_crime_time_likelihood.sql │ ├── create_colorado_city_seasonal_crime_rates_1997_2020.sql │ └── create_colorado_county_seasonal_crime_rates_1997_2020.sql ├── silver │ ├── delete_malformed_business_entities.sql │ ├── update_principal_zip_code_formatting.sql │ ├── update_colorado_business_entity_county.sql │ ├── create_colorado_business_entities_with_ranking.sql │ ├── create_colorado_crimes_2016_2020_with_cities.sql │ ├── create_colorado_income_1997_2020.sql │ ├── create_total_population_per_county_1997_2020.sql │ ├── combine_city_and_county_data_for_business_entities.sql │ ├── create_colorado_county_total_crime_per_capita_1997_2020.sql │ ├── household_income_and_total_crimes_1997_2020.sql │ ├── create_colorado_final_county_tier_rank.sql │ ├── calculate_crime_percentile_for_categories.sql │ ├── calculate_colorado_household_income_tier_rank.sql │ ├── calculate_final_county_tier_rank.sql │ ├── create_yearly_county_crime_per_capita_1997_2020.sql │ ├── calculate_population_crime_per_capita_tier.sql │ ├── calculate_population_crime_per_capita_rank.sql │ ├── create_colorado_crime_tier_county_rank.sql │ ├── create_colorado_crimes_with_cities.sql │ ├── create_colorado_crime_weighted_scores.sql │ ├── create_colorado_crimes.sql │ └── clean_address_colorado_business_entity_data.sql └── bronze │ ├── write_colorado_county_coordinates_iceberg_from_S3.py │ ├── write_colorado_crimes_1997_2020_iceberg_from_S3.py │ ├── write_colorado_income_iceberg_from_S3.py │ └── write_colorado_crimes_2016_2020_iceberg_from_S3.py ├── scripts ├── getS3bucketfilelist.py ├── upload_capstone_csv_data_to_s3.py ├── modify_colorado_city_county_zip_csv.py ├── create_crime_rate_per_capita.py ├── create_colorado_crimes_table.py ├── create_colorado_business_entities.py ├── write_to_iceberg_from_S3.py └── business_entity_address_validation.py ├── Capstone-data-dictionary.csv └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .env -------------------------------------------------------------------------------- /dags/.airflowignore: -------------------------------------------------------------------------------- 1 | community -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_city_metrics/variable_cities.sql: -------------------------------------------------------------------------------- 1 | select distinct city from tayloro.colorado_crimes_with_cities order by city asc -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_city_metrics/likely_city_crimes_during_day.sql: -------------------------------------------------------------------------------- 1 | SELECT offense_category_name FROM academy.tayloro.colorado_city_crime_time_likelihood where city = '$city' and is_day_likely = true -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_city_metrics/likely_city_crimes_during_night.sql: -------------------------------------------------------------------------------- 1 | SELECT offense_category_name FROM academy.tayloro.colorado_city_crime_time_likelihood where city = '$city' and is_night_likely = true -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/county_police_agency_crime_count.sql: -------------------------------------------------------------------------------- 1 | SELECT agency_name, total_crimes FROM academy.tayloro.colorado_county_agency_crime_counts where county_name = LOWER('$county') order by total_crimes desc 2 | -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_city_metrics/seasonal_city_crime_rate_by_month_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT city, month_name, avg_monthly_crimes FROM academy.tayloro.colorado_city_seasonal_crime_rates_1997_2020 WHERE city = 'Denver' order by crime_month asc 2 | -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_county_agency_crime_counts.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_county_agency_crime_counts AS 2 | SELECT 3 | county_name, 4 | agency_name, 5 | COUNT(*) AS total_crimes 6 | FROM tayloro.colorado_crimes 7 | GROUP BY county_name, agency_name 8 | ORDER BY total_crimes DESC; -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/seasonal_average_crime_trends_for_county.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | county_name, 3 | month_name, 4 | avg_monthly_crimes 5 | FROM academy.tayloro.colorado_county_seasonal_crime_rates_1997_2020 6 | WHERE county_name = LOWER('$county') 7 | ORDER BY crime_month ASC 8 | -------------------------------------------------------------------------------- /capstone_queries/silver/delete_malformed_business_entities.sql: -------------------------------------------------------------------------------- 1 | DELETE FROM tayloro.colorado_business_entities 2 | WHERE principalstate = 'CO' 3 | AND ( 4 | CASE 5 | WHEN principalzipcode LIKE '%-%' THEN SUBSTRING(principalzipcode, 1, 5) 6 | ELSE principalzipcode 7 | END 8 | ) NOT LIKE '8%'; -------------------------------------------------------------------------------- /capstone_queries/silver/update_principal_zip_code_formatting.sql: -------------------------------------------------------------------------------- 1 | UPDATE tayloro.colorado_business_entities 2 | SET principalzipcode = CASE 3 | WHEN principalzipcode LIKE '%-%' THEN SUBSTRING(principalzipcode, 1, 5) 4 | ELSE principalzipcode 5 | END 6 | WHERE principalstate = 'CO' 7 | AND NOT regexp_like(principalzipcode, '^\d{5}$'); -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/county_crime_count_per_category_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | offense_category_name, -- Crime type 3 | crime_count -- Count of crimes per category 4 | FROM 5 | academy.tayloro.colorado_crime_category_totals_per_county_1997_2020 6 | WHERE county_name = LOWER('$county') -- Grafana variable for county selection -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/crime_density_per_county_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | county, 3 | cent_lat AS lat, 4 | cent_long AS lon, 5 | crime_per_100k, 6 | LN(1 + crime_per_100k) AS crime_log_scale -- Use LN() for natural log (PostgreSQL, DuckDB, etc.) 7 | FROM academy.tayloro.colorado_county_crime_per_capita_with_coordinates -------------------------------------------------------------------------------- /capstone_queries/silver/update_colorado_business_entity_county.sql: -------------------------------------------------------------------------------- 1 | UPDATE tayloro.colorado_business_entities 2 | SET principalcounty = ( 3 | SELECT ccz.county 4 | FROM tayloro.colorado_city_county_zip ccz 5 | WHERE TRIM(LOWER(principalcity)) = TRIM(LOWER(ccz.city)) 6 | AND TRIM(CAST(principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 7 | ) 8 | WHERE principalstate = 'CO'; -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_city_crime_counts_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_city_crime_counts_1997_2020 AS 2 | SELECT 3 | city, 4 | COUNT(*) AS crime_count 5 | FROM 6 | academy.tayloro.colorado_crimes_with_cities 7 | WHERE 8 | EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020 9 | and city is not null 10 | GROUP BY 11 | city 12 | ORDER BY 13 | crime_count DESC; -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_business_entities_with_ranking.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_business_entities_with_ranking AS 2 | SELECT 3 | be.*, 4 | fr.crime_rank, 5 | fr.income_rank, 6 | fr.population_rank, 7 | fr.final_rank 8 | FROM tayloro.colorado_business_entities be 9 | LEFT JOIN tayloro.colorado_final_county_tier_rank fr 10 | ON LOWER(be.principalcounty) = LOWER(fr.county); -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_avg_age_per_crime_category_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_avg_age_per_crime_category_1997_2020 AS 2 | SELECT 3 | city, 4 | offense_category_name, 5 | ROUND(AVG(age_num), 2) AS avg_age 6 | FROM tayloro.colorado_crimes_with_cities 7 | WHERE age_num IS NOT NULL 8 | and city is not null 9 | GROUP BY city, offense_category_name 10 | ORDER BY city, offense_category_name; -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/county_population_vs_crime_trends_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana 3 | total_crimes, -- Raw total crimes per year 4 | total_population -- Raw total population per year 5 | FROM academy.tayloro.colorado_crime_population_trends 6 | WHERE county_name = '$county' 7 | ORDER BY year -------------------------------------------------------------------------------- /capstone_queries/gold/normalized_crime_and_population_trends.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana 3 | total_crimes, -- Raw total crimes per year 4 | total_population -- Raw total population per year 5 | FROM academy.tayloro.colorado_crime_population_trends 6 | WHERE county_name = '$county' 7 | ORDER BY year 8 | 9 | 10 | 11 | 12 | 13 | -------------------------------------------------------------------------------- /capstone_queries/gold/create_county_crime_per_capita_with_coordinates.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_county_crime_per_capita_with_coordinates AS 2 | select ccc.label AS county, 3 | crime_per_100k, 4 | ccc.cent_lat, 5 | ccc.cent_long 6 | from tayloro.colorado_county_crime_per_capita_1997_2020 crpc 7 | inner join tayloro.colorado_county_coordinates_raw ccc 8 | on LOWER(crpc.county_name) = LOWER(ccc.label) -------------------------------------------------------------------------------- /capstone_queries/gold/total_count_of_crimes_by_county.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | offense_category_name, -- Crime type 3 | COUNT(*) AS crime_count -- Count of crimes per category 4 | FROM 5 | academy.tayloro.colorado_crimes 6 | WHERE 7 | EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020 8 | AND county_name = '$county' -- Grafana variable for county selection 9 | GROUP BY 10 | offense_category_name 11 | ORDER BY 12 | crime_count DESC; -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_crimes_2016_2020_with_cities.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crimes_2016_2020_with_cities AS 2 | WITH unique_city_mapping AS ( 3 | SELECT DISTINCT city 4 | FROM tayloro.colorado_city_county_zip 5 | ) 6 | SELECT 7 | ccr.*, 8 | ucm.city AS city 9 | FROM tayloro.colorado_crimes_2016_2020_raw ccr 10 | INNER JOIN unique_city_mapping ucm 11 | ON LOWER(TRIM(ccr.pub_agency_name)) = LOWER(TRIM(ucm.city)); -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_crime_type_distribution_by_city.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crime_type_distribution_by_city AS 2 | SELECT 3 | city, 4 | crime_against, -- e.g., Person, Property, Society 5 | COUNT(*) AS total_crimes 6 | FROM tayloro.colorado_crimes_with_cities 7 | WHERE city IS NOT NULL 8 | and crime_against is not null 9 | and crime_against <> 'Not a Crime' 10 | GROUP BY city, crime_against 11 | ORDER BY city, total_crimes DESC; -------------------------------------------------------------------------------- /capstone_queries/graffana/business_search/business_search_query.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | entityname AS Company_Name, 3 | principaladdress1 AS Address, 4 | principalcity AS City, 5 | principalstate AS State, 6 | principalcounty AS County 7 | FROM 8 | academy.tayloro.colorado_business_entities_with_ranking 9 | WHERE 10 | LOWER(entityname) LIKE LOWER('%$entityname%') AND principalstate = 'CO' -- Case-insensitive search 11 | ORDER BY 12 | entityname -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_crime_category_totals_per_county_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crime_category_totals_per_county_1997_2020 AS 2 | SELECT 3 | offense_category_name, -- Crime type 4 | COUNT(*) AS crime_count, -- Count of crimes per category 5 | county_name 6 | FROM 7 | academy.tayloro.colorado_crimes 8 | WHERE 9 | EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020 10 | GROUP BY 11 | offense_category_name, county_name 12 | ORDER BY 13 | crime_count DESC; -------------------------------------------------------------------------------- /capstone_queries/graffana/colorado_county_metrics/county_crime_rates_vs_median_income.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | DATE_PARSE(CAST(year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Ensure date formatting for Grafana 3 | CAST(MAX(crime_per_100k) AS DOUBLE) AS crime_per_100k, -- Max ensures single value per year 4 | CAST(MAX(median_household_income) AS DOUBLE) AS median_household_income -- Max ensures single value per year 5 | FROM academy.tayloro.colorado_crime_vs_median_household_income_1997_2020 6 | WHERE county_name = '$county' 7 | GROUP BY year 8 | ORDER BY year -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_county_average_income_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | county, 3 | ROUND(AVG(CASE WHEN inctype = 1 THEN income END), 2) AS avg_total_personal_income, -- Total Personal Income 4 | ROUND(AVG(CASE WHEN inctype = 2 THEN income END), 2) AS avg_per_capita_income, -- Per Capita Personal Income 5 | ROUND(AVG(CASE WHEN inctype = 3 THEN income END), 2) AS avg_median_household_income -- Median Household Income 6 | FROM academy.tayloro.colorado_income_1997_2020 7 | WHERE periodyear BETWEEN 1997 AND 2020 8 | GROUP BY county 9 | ORDER BY county; -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_income_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_income_1997_2020 AS 2 | WITH colorado_income_1997_2020 AS ( 3 | SELECT 4 | TRIM(REPLACE(areaname, 'County', '')) AS county, 5 | incdesc, 6 | inctype, 7 | incsource, 8 | income, 9 | incrank, 10 | periodyear 11 | FROM tayloro.colorado_income_raw 12 | WHERE stateabbrv = 'CO' 13 | AND areatyname = 'County' 14 | AND periodyear BETWEEN 1997 AND 2020 -- Filter for years 1997-2020 15 | ) 16 | SELECT * FROM colorado_income_1997_2020; -------------------------------------------------------------------------------- /capstone_queries/silver/create_total_population_per_county_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE academy.tayloro.colorado_total_population_per_county_1997_2020 AS 2 | WITH filtered_population AS ( 3 | SELECT 4 | county, 5 | year, 6 | SUM(total_population) AS total_population -- Aggregating population in case of duplicates 7 | FROM tayloro.colorado_population_raw 8 | WHERE year BETWEEN 1997 AND 2020 9 | AND data_type = 'Estimate' -- Ensuring only estimated values 10 | GROUP BY county, year 11 | ) 12 | SELECT * FROM filtered_population 13 | ORDER BY county, year; -------------------------------------------------------------------------------- /capstone_queries/gold/colorado_crime_population_trends.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | DATE_PARSE(CAST(EXTRACT(YEAR FROM incident_date) AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana 3 | county_name, 4 | COUNT(*) AS total_crimes 5 | FROM academy.tayloro.colorado_crimes 6 | WHERE county_name = '$county' 7 | AND incident_date IS NOT NULL -- Exclude records with null incident_date 8 | AND county_name IS NOT NULL 9 | GROUP BY EXTRACT(YEAR FROM incident_date), county_name -- Group by the extracted year and county 10 | ORDER BY EXTRACT(YEAR FROM incident_date), county_name -- Order by the extracted year and county 11 | -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_crime_vs_median_household_income_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crime_vs_median_household_income_1997_2020 AS 2 | SELECT 3 | c.county_name, 4 | c.year, -- Ensure the year column is included 5 | c.total_crimes, 6 | c.total_population, 7 | (c.total_crimes * 100000.0 / NULLIF(c.total_population, 0)) AS crime_per_100k, -- Crime rate per 100k residents 8 | i.income AS median_household_income -- Median household income per year 9 | FROM academy.tayloro.colorado_crime_population_trends c 10 | LEFT JOIN academy.tayloro.colorado_income_1997_2020 i 11 | ON LOWER(c.county_name) = LOWER(i.county) 12 | AND c.year = i.periodyear 13 | AND i.inctype = 3 -- 3 corresponds to "Median Household Income" 14 | ORDER BY c.year; -------------------------------------------------------------------------------- /capstone_queries/silver/combine_city_and_county_data_for_business_entities.sql: -------------------------------------------------------------------------------- 1 | INSERT INTO tayloro.colorado_business_entities 2 | SELECT b.entityid, 3 | b.entityname, 4 | b.principaladdress1, 5 | b.principaladdress2, 6 | b.principalcity, 7 | c.county AS principalcounty, -- Mapping county from cities dataset 8 | b.principalstate, 9 | b.principalzipcode, 10 | b.principalcountry, 11 | b.entitystatus, 12 | b.jurisdictonofformation, 13 | b.entitytype, 14 | CASE 15 | WHEN regexp_like(TRIM(b.entityformdate), '^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$') 16 | THEN CAST(date_parse(TRIM(b.entityformdate), '%m/%d/%Y') AS DATE) 17 | ELSE NULL 18 | END AS entityformdate -- Convert only valid dates 19 | FROM tayloro.colorado_business_entities_raw b 20 | INNER JOIN tayloro.colorado_cities_and_counties c 21 | ON LOWER(TRIM(b.principalcity)) = LOWER(TRIM(c.city)) -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_county_total_crime_per_capita_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_county_crime_per_capita_1997_2020 AS 2 | WITH crimes_total AS ( 3 | SELECT 4 | county_name, 5 | COUNT(*) AS total_crimes 6 | FROM tayloro.colorado_crimes 7 | WHERE incident_date IS NOT NULL 8 | AND YEAR(incident_date) BETWEEN 1997 AND 2020 9 | GROUP BY county_name 10 | ), 11 | population_total AS ( 12 | SELECT 13 | county AS county_name, 14 | SUM(TOTALPOPULATION) AS total_population 15 | FROM tayloro.colorado_population_raw 16 | WHERE DATATYPE = 'Estimate' 17 | AND YEAR BETWEEN 1997 AND 2020 18 | GROUP BY county 19 | ), 20 | crime_rate_per_capita AS ( 21 | SELECT 22 | c.county_name, 23 | c.total_crimes, 24 | p.total_population, 25 | (c.total_crimes * 100000.0 / p.total_population) AS crime_per_100k 26 | FROM crimes_total c 27 | JOIN population_total p 28 | ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county_name)) 29 | ) 30 | SELECT * FROM crime_rate_per_capita; -------------------------------------------------------------------------------- /capstone_queries/silver/household_income_and_total_crimes_1997_2020.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | DATE_PARSE(CAST(i.year AS VARCHAR) || '-01-01', '%Y-%m-%d') AS time, -- Convert year to date for Grafana 3 | i.county, 4 | i.household_income_per_capita, 5 | c.total_crimes 6 | FROM ( 7 | -- Median Household Income Query 8 | SELECT 9 | TRIM(REPLACE(areaname, 'County', '')) AS county, 10 | periodyear AS year, 11 | income AS household_income_per_capita 12 | FROM academy.tayloro.colorado_income_raw 13 | WHERE areatyname = 'County' 14 | AND periodyear BETWEEN 1997 AND 2020 15 | AND inctype = 2 16 | ) i 17 | LEFT JOIN ( 18 | -- Total Crimes Query 19 | SELECT 20 | EXTRACT(YEAR FROM incident_date) AS year, 21 | county_name AS county, 22 | COUNT(*) AS total_crimes 23 | FROM academy.tayloro.colorado_crimes 24 | WHERE incident_date IS NOT NULL 25 | AND county_name IS NOT NULL 26 | GROUP BY EXTRACT(YEAR FROM incident_date), county_name 27 | ) c 28 | ON i.year = c.year 29 | AND i.county = c.county 30 | WHERE i.county = 'Boulder' -- Add a dynamic county filter 31 | ORDER BY i.year -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_final_county_tier_rank.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS 2 | WITH combined_ranks AS ( 3 | SELECT 4 | COALESCE(c.county_name, i.county, p.county) AS county, 5 | COALESCE(c.overall_crime_tier, 0) AS crime_rank, 6 | COALESCE(i.income_tier, 0) AS income_rank, 7 | COALESCE(p.crime_per_capita_tier, 0) AS population_rank 8 | FROM academy.tayloro.colorado_crime_tier_county_rank c 9 | FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i 10 | ON LOWER(c.county_name) = LOWER(i.county) 11 | FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p 12 | ON LOWER(c.county_name) = LOWER(p.county) 13 | ), 14 | average_rank AS ( 15 | SELECT 16 | county, 17 | crime_rank, 18 | income_rank, 19 | population_rank, 20 | -- Apply weights: crime is weighted 2x, income and population 1x each. 21 | ROUND((2 * crime_rank + income_rank + population_rank) / 4.0) AS final_rank 22 | FROM combined_ranks 23 | WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0 24 | ) 25 | SELECT * 26 | FROM average_rank 27 | ORDER BY final_rank ASC; -------------------------------------------------------------------------------- /capstone_queries/silver/calculate_crime_percentile_for_categories.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE IF NOT EXISTS tayloro.colorado_crime_tier_stolen_property AS 2 | WITH crime_data AS ( 3 | -- Compute total crime count per county from 1997-2000 4 | SELECT 5 | county_name, 6 | COUNT(*) AS total_crimes 7 | FROM academy.tayloro.colorado_crimes 8 | WHERE offense_category_name = 'Stolen Property Offenses' 9 | AND EXTRACT(YEAR FROM incident_date) BETWEEN 1997 AND 2020 10 | GROUP BY county_name 11 | ), 12 | crime_percentiles AS ( 13 | -- Assign percentile ranks based on crime count 14 | SELECT 15 | county_name, 16 | total_crimes, 17 | PERCENT_RANK() OVER (ORDER BY total_crimes) AS crime_percentile 18 | FROM crime_data 19 | ), 20 | crime_tiers AS ( 21 | -- Categorize counties into Tiers 22 | SELECT 23 | county_name, 24 | total_crimes, 25 | crime_percentile, 26 | CASE 27 | WHEN crime_percentile >= 0.75 THEN 4 -- Very High Crime 28 | WHEN crime_percentile >= 0.50 THEN 3 -- High Crime 29 | WHEN crime_percentile >= 0.25 THEN 2 -- Moderate Crime 30 | ELSE 1 -- Low Crime 31 | END AS crime_tier 32 | FROM crime_percentiles 33 | ) 34 | SELECT * FROM crime_tiers; -------------------------------------------------------------------------------- /scripts/getS3bucketfilelist.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import os 3 | import sys 4 | from dotenv import load_dotenv 5 | 6 | # Add the parent directory of "upload_to_s3.py" to the module search path 7 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 8 | 9 | from aws_secret_manager import get_secret 10 | 11 | 12 | # Load environment variables from .env file 13 | load_dotenv() 14 | 15 | # Fetch AWS keys from environment variables 16 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 17 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 18 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 19 | catalog_name = get_secret("CATALOG_NAME") 20 | 21 | # S3 bucket name 22 | s3_bucket = "< zach bucket name >" 23 | 24 | # Initialize the S3 client 25 | s3 = boto3.client( 26 | "s3", 27 | aws_access_key_id=aws_access_key_id, 28 | aws_secret_access_key=aws_secret_access_key 29 | ) 30 | 31 | # Bucket and prefix 32 | bucket_name = "< zach bucket name >" 33 | prefix = "tayloro-capstone-data/" 34 | 35 | # List objects in the S3 bucket directory 36 | response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix) 37 | 38 | # Extract and print file names 39 | if "Contents" in response: 40 | file_names = [obj["Key"] for obj in response["Contents"]] 41 | print("\n".join(file_names)) 42 | else: 43 | print("No files found.") -------------------------------------------------------------------------------- /capstone_queries/silver/calculate_colorado_household_income_tier_rank.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_household_income_tier_rank AS 2 | WITH income_aggregated AS ( 3 | -- Step 1: Aggregate income per county across all years and compute average 4 | SELECT 5 | county, 6 | AVG(income) AS avg_median_household_income 7 | FROM academy.tayloro.colorado_income_1997_2020 8 | WHERE inctype = 3 -- 3 represents "Median Household Income" 9 | GROUP BY county 10 | ), 11 | income_percentiles AS ( 12 | -- Step 2: Compute percentile rank for average income per county 13 | SELECT 14 | county, 15 | avg_median_household_income, 16 | PERCENT_RANK() OVER (ORDER BY avg_median_household_income) AS income_percentile 17 | FROM income_aggregated 18 | ), 19 | income_tiers AS ( 20 | -- Step 3: Assign tiers based on percentile rankings 21 | SELECT 22 | county, 23 | avg_median_household_income, 24 | income_percentile, 25 | CASE 26 | WHEN income_percentile >= 0.75 THEN 1 -- Highest income counties (Top 25%) 27 | WHEN income_percentile >= 0.50 THEN 2 -- Middle-High (50%-75%) 28 | WHEN income_percentile >= 0.25 THEN 3 -- Middle-Low (25%-50%) 29 | ELSE 4 -- Lowest income counties (Bottom 25%) 30 | END AS income_tier 31 | FROM income_percentiles 32 | ) 33 | SELECT * FROM income_tiers; 34 | -------------------------------------------------------------------------------- /capstone_queries/silver/calculate_final_county_tier_rank.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS 2 | WITH combined_ranks AS ( 3 | SELECT 4 | COALESCE(c.county_name, i.county, p.county) AS county, 5 | COALESCE(c.overall_crime_tier, 0) AS crime_rank, 6 | COALESCE(i.income_tier, 0) AS income_rank, 7 | COALESCE(p.crime_per_capita_tier, 0) AS population_rank 8 | FROM academy.tayloro.colorado_crime_tier_county_rank c 9 | FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i 10 | ON LOWER(c.county_name) = LOWER(i.county) 11 | FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p 12 | ON LOWER(c.county_name) = LOWER(p.county) 13 | ), 14 | average_rank AS ( 15 | SELECT 16 | county, 17 | crime_rank, 18 | income_rank, 19 | population_rank, 20 | ROUND( 21 | (crime_rank + income_rank + population_rank) / 22 | NULLIF((CASE WHEN crime_rank > 0 THEN 1 ELSE 0 END 23 | + CASE WHEN income_rank > 0 THEN 1 ELSE 0 END 24 | + CASE WHEN population_rank > 0 THEN 1 ELSE 0 END), 0) 25 | ) AS final_rank -- Calculate the average rank and round to nearest whole number 26 | FROM combined_ranks 27 | WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0 -- Ensure all ranks are not null 28 | ) 29 | SELECT * FROM average_rank 30 | ORDER BY avg_rank ASC; -- Order by the final average rank 31 | -------------------------------------------------------------------------------- /capstone_queries/silver/create_yearly_county_crime_per_capita_1997_2020.sql: -------------------------------------------------------------------------------- 1 | INSERT INTO tayloro.coloraodo_county_crime_rate_per_capita 2 | WITH crimes_per_year AS ( 3 | SELECT 4 | county_name, 5 | YEAR(incident_date) AS periodyear, -- Extract year from incident_date 6 | COUNT(*) AS total_crimes 7 | FROM tayloro.colorado_crimes 8 | WHERE 9 | incident_date IS NOT NULL 10 | AND YEAR(incident_date) BETWEEN 1997 AND 2020 -- Filter within date range 11 | GROUP BY county_name, YEAR(incident_date) -- Group by county and year 12 | ), 13 | population_per_year AS ( 14 | SELECT 15 | county AS county_name, 16 | YEAR AS periodyear, 17 | SUM(TOTALPOPULATION) AS total_population -- Aggregate population by year and county 18 | FROM tayloro.colorado_population_raw 19 | WHERE DATATYPE = 'Estimate' -- Use only estimated population data 20 | GROUP BY county, YEAR -- Group by county and year 21 | ), 22 | crime_rate_per_capita AS ( 23 | SELECT 24 | c.county_name, 25 | c.periodyear, 26 | c.total_crimes, 27 | p.total_population, 28 | (c.total_crimes * 100000.0 / p.total_population) AS crime_per_100k -- Crime per 100,000 people 29 | FROM crimes_per_year c 30 | JOIN population_per_year p 31 | ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county_name)) -- Ensure proper join 32 | AND c.periodyear = p.periodyear -- Match years 33 | ) 34 | SELECT * FROM crime_rate_per_capita; -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_city_crime_time_likelihood.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_city_crime_time_likelihood AS 2 | WITH crime_counts AS ( 3 | SELECT 4 | city, 5 | offense_category_name, 6 | CASE 7 | WHEN incident_hour BETWEEN 6 AND 18 THEN 'Day' 8 | ELSE 'Night' 9 | END AS time_of_day, 10 | COUNT(*) AS total_crimes 11 | FROM tayloro.colorado_crimes_with_cities 12 | WHERE incident_hour IS NOT NULL 13 | AND offense_category_name IS NOT NULL 14 | AND city IS NOT NULL 15 | GROUP BY 16 | city, 17 | offense_category_name, 18 | CASE 19 | WHEN incident_hour BETWEEN 6 AND 18 THEN 'Day' 20 | ELSE 'Night' 21 | END 22 | ), 23 | crime_pivot AS ( 24 | SELECT 25 | city, 26 | offense_category_name, 27 | AVG(CASE WHEN time_of_day = 'Day' THEN total_crimes ELSE NULL END) AS avg_day_crimes, 28 | AVG(CASE WHEN time_of_day = 'Night' THEN total_crimes ELSE NULL END) AS avg_night_crimes 29 | FROM crime_counts 30 | GROUP BY city, offense_category_name 31 | ) 32 | SELECT 33 | city, 34 | offense_category_name, 35 | avg_day_crimes, 36 | avg_night_crimes, 37 | CASE 38 | WHEN avg_night_crimes > avg_day_crimes THEN TRUE 39 | ELSE FALSE 40 | END AS is_night_likely, 41 | CASE 42 | WHEN avg_day_crimes > avg_night_crimes THEN TRUE 43 | ELSE FALSE 44 | END AS is_day_likely 45 | FROM crime_pivot 46 | ORDER BY city, offense_category_name; 47 | -------------------------------------------------------------------------------- /scripts/upload_capstone_csv_data_to_s3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | 5 | # Load environment variables from the .env file 6 | load_dotenv() 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from upload_to_s3 import upload_to_s3 12 | 13 | # Correct the relative path to skip the "include" directory 14 | LOCAL_FILE = os.path.abspath( 15 | os.path.join( 16 | os.path.dirname(__file__), # Current script's directory 17 | "../../../capstone_data/Colorado_County_Boundaries.csv" # Adjusted to point to capstone_data 18 | ) 19 | ) 20 | 21 | print(f"Resolved path: {LOCAL_FILE}") 22 | 23 | # Define the S3 bucket name and file path 24 | S3_BUCKET = "" # Replace with your actual S3 bucket name 25 | S3_FILE_PATH = "tayloro-capstone-data/Colorado_County_Boundaries.csv" 26 | 27 | def main(): 28 | # Ensure the file exists locally before uploading 29 | if not os.path.exists(LOCAL_FILE): 30 | print(f"Error: Local file not found at {LOCAL_FILE}") 31 | return 32 | 33 | file_size = os.path.getsize(LOCAL_FILE) 34 | print(f"File size: {file_size / (1024 * 1024):.2f} MB") 35 | 36 | #Upload the file to S3 37 | s3_path = upload_to_s3(LOCAL_FILE, S3_BUCKET, S3_FILE_PATH) 38 | 39 | # Verify and print the result 40 | if s3_path: 41 | print(f"File successfully uploaded to {s3_path}") 42 | else: 43 | print("Failed to upload the file to S3.") 44 | 45 | if __name__ == "__main__": 46 | main() -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_city_seasonal_crime_rates_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_city_seasonal_crime_rates_1997_2020 AS 2 | WITH monthly_crimes AS ( 3 | SELECT 4 | city, 5 | MONTH(incident_date) AS crime_month, -- Numeric month (1-12) 6 | CASE 7 | WHEN MONTH(incident_date) = 1 THEN 'January' 8 | WHEN MONTH(incident_date) = 2 THEN 'February' 9 | WHEN MONTH(incident_date) = 3 THEN 'March' 10 | WHEN MONTH(incident_date) = 4 THEN 'April' 11 | WHEN MONTH(incident_date) = 5 THEN 'May' 12 | WHEN MONTH(incident_date) = 6 THEN 'June' 13 | WHEN MONTH(incident_date) = 7 THEN 'July' 14 | WHEN MONTH(incident_date) = 8 THEN 'August' 15 | WHEN MONTH(incident_date) = 9 THEN 'September' 16 | WHEN MONTH(incident_date) = 10 THEN 'October' 17 | WHEN MONTH(incident_date) = 11 THEN 'November' 18 | WHEN MONTH(incident_date) = 12 THEN 'December' 19 | END AS month_name, 20 | COUNT(*) AS total_crimes 21 | FROM tayloro.colorado_crimes_with_cities 22 | WHERE incident_date IS NOT NULL 23 | GROUP BY city, MONTH(incident_date) -- Ensure grouping includes month 24 | ), 25 | normalized_crime AS ( 26 | SELECT 27 | mc.city, 28 | mc.crime_month, 29 | mc.month_name, 30 | AVG(mc.total_crimes) AS avg_monthly_crimes 31 | FROM monthly_crimes mc 32 | GROUP BY mc.city, mc.crime_month, mc.month_name -- Ensure grouping includes crime_month 33 | ) 34 | SELECT 35 | city, 36 | crime_month, 37 | month_name, 38 | avg_monthly_crimes 39 | FROM normalized_crime 40 | ORDER BY city, crime_month; -------------------------------------------------------------------------------- /capstone_queries/gold/create_colorado_county_seasonal_crime_rates_1997_2020.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_county_seasonal_crime_rates_1997_2020 AS 2 | WITH monthly_crimes AS ( 3 | SELECT 4 | county_name, 5 | MONTH(incident_date) AS crime_month, -- Numeric month (1-12) 6 | CASE 7 | WHEN MONTH(incident_date) = 1 THEN 'January' 8 | WHEN MONTH(incident_date) = 2 THEN 'February' 9 | WHEN MONTH(incident_date) = 3 THEN 'March' 10 | WHEN MONTH(incident_date) = 4 THEN 'April' 11 | WHEN MONTH(incident_date) = 5 THEN 'May' 12 | WHEN MONTH(incident_date) = 6 THEN 'June' 13 | WHEN MONTH(incident_date) = 7 THEN 'July' 14 | WHEN MONTH(incident_date) = 8 THEN 'August' 15 | WHEN MONTH(incident_date) = 9 THEN 'September' 16 | WHEN MONTH(incident_date) = 10 THEN 'October' 17 | WHEN MONTH(incident_date) = 11 THEN 'November' 18 | WHEN MONTH(incident_date) = 12 THEN 'December' 19 | END AS month_name, 20 | COUNT(*) AS total_crimes 21 | FROM tayloro.colorado_crimes 22 | WHERE incident_date IS NOT NULL 23 | GROUP BY county_name, MONTH(incident_date) -- Ensure grouping includes month 24 | ), 25 | normalized_crime AS ( 26 | SELECT 27 | mc.county_name, 28 | mc.crime_month, 29 | mc.month_name, 30 | AVG(mc.total_crimes) AS avg_monthly_crimes 31 | FROM monthly_crimes mc 32 | GROUP BY mc.county_name, mc.crime_month, mc.month_name -- Ensure grouping includes crime_month 33 | ) 34 | SELECT 35 | county_name, 36 | crime_month, 37 | month_name, 38 | avg_monthly_crimes 39 | FROM normalized_crime 40 | ORDER BY county_name, crime_month; 41 | -------------------------------------------------------------------------------- /scripts/modify_colorado_city_county_zip_csv.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import csv 4 | 5 | # Correct the relative path to skip the "include" directory 6 | LOCAL_FILE = os.path.abspath( 7 | os.path.join( 8 | os.path.dirname(__file__), # Current script's directory 9 | "../../../capstone_data/colorado_city_county_zip.csv" # Adjusted to point to capstone_data 10 | ) 11 | ) 12 | 13 | print(f"Resolved path: {LOCAL_FILE}") 14 | 15 | # Determine the output file path (we'll write the updated CSV in the same folder) 16 | output_file = os.path.join(os.path.dirname(LOCAL_FILE), "colorado_city_county_zip_updated.csv") 17 | print(f"Output file path: {output_file}") 18 | 19 | # Use utf-8-sig to handle potential BOM in the CSV file. 20 | with open(LOCAL_FILE, 'r', newline='', encoding='utf-8-sig') as infile: 21 | reader = csv.DictReader(infile) 22 | fieldnames = reader.fieldnames 23 | if not fieldnames: 24 | print("Error: No header row found in the CSV.") 25 | sys.exit(1) 26 | 27 | print("Found headers:", fieldnames) 28 | 29 | # Locate the ZIP Code header in a case-insensitive manner. 30 | zip_header = None 31 | for header in fieldnames: 32 | if header.strip().lower() == "zip code": 33 | zip_header = header 34 | break 35 | 36 | if not zip_header: 37 | print("Error: 'ZIP Code' column not found in the CSV.") 38 | sys.exit(1) 39 | 40 | updated_rows = [] 41 | for row in reader: 42 | zip_value = row[zip_header] 43 | # Remove the prefix "ZIP Code " if it appears 44 | if zip_value.startswith("ZIP Code "): 45 | row[zip_header] = zip_value.replace("ZIP Code ", "", 1) 46 | updated_rows.append(row) 47 | 48 | # Write the updated rows into a new CSV file. 49 | with open(output_file, 'w', newline='', encoding='utf-8') as outfile: 50 | writer = csv.DictWriter(outfile, fieldnames=fieldnames) 51 | writer.writeheader() 52 | writer.writerows(updated_rows) 53 | 54 | print(f"CSV updated successfully: {output_file}") 55 | -------------------------------------------------------------------------------- /capstone_queries/silver/calculate_population_crime_per_capita_tier.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE academy.tayloro.colorado_crime_per_capita_rank AS 2 | WITH crime_per_capita_per_year AS ( 3 | -- Step 1: Compute Crime Per Capita for Each Year (1997-2000) 4 | SELECT 5 | LOWER(TRIM(c.county_name)) AS county, -- Standardize county name 6 | EXTRACT(YEAR FROM c.incident_date) AS year, 7 | COUNT(*) * 1000.0 / p.total_population AS crime_per_1000 8 | FROM academy.tayloro.colorado_crimes c 9 | JOIN academy.tayloro.colorado_total_population_per_county_1997_2000 p 10 | ON LOWER(TRIM(c.county_name)) = LOWER(TRIM(p.county)) -- Match counties with case-insensitivity and trimming 11 | AND EXTRACT(YEAR FROM c.incident_date) = p.year 12 | WHERE p.total_population > 0 -- Avoid division by zero 13 | GROUP BY LOWER(TRIM(c.county_name)), EXTRACT(YEAR FROM c.incident_date), p.total_population 14 | ), 15 | average_crime_per_capita AS ( 16 | -- Step 2: Compute Average Crime Per Capita for 1997-2000 17 | SELECT 18 | county, 19 | AVG(crime_per_1000) AS avg_crime_per_1000 20 | FROM crime_per_capita_per_year 21 | GROUP BY county 22 | ), 23 | crime_distribution AS ( 24 | -- Step 3: Compute Percentile Rank Based on Average Crime Per Capita 25 | SELECT 26 | county, 27 | avg_crime_per_1000, 28 | PERCENT_RANK() OVER (ORDER BY avg_crime_per_1000) AS crime_percentile 29 | FROM average_crime_per_capita 30 | ), 31 | crime_tiers AS ( 32 | -- Step 4: Assign Crime Per Capita Tiers 33 | SELECT 34 | county, 35 | avg_crime_per_1000, 36 | crime_percentile, 37 | CASE 38 | WHEN crime_percentile >= 0.75 THEN 4 -- Highest crime per capita (Top 25%) 39 | WHEN crime_percentile >= 0.50 THEN 3 -- Medium-high (50%-75%) 40 | WHEN crime_percentile >= 0.25 THEN 2 -- Medium-low (25%-50%) 41 | ELSE 1 -- Lowest crime per capita (Bottom 25%) 42 | END AS crime_per_capita_tier 43 | FROM crime_distribution 44 | ) 45 | SELECT * FROM crime_tiers; 46 | -------------------------------------------------------------------------------- /capstone_queries/silver/calculate_population_crime_per_capita_rank.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_population_crime_per_capita_rank AS 2 | WITH crime_per_capita_per_year AS ( 3 | -- Step 1: Compute Crime Per Capita for Each Year (1997-2000) 4 | SELECT 5 | LOWER(c.county_name) AS county, -- Standardizing county name for consistent join 6 | EXTRACT(YEAR FROM c.incident_date) AS year, 7 | COUNT(*) * 1000.0 / p.total_population AS crime_per_1000 8 | FROM academy.tayloro.colorado_crimes c 9 | JOIN academy.tayloro.colorado_total_population_per_county_1997_2020 p 10 | ON LOWER(c.county_name) = LOWER(p.county) -- Ensure lowercase match 11 | AND EXTRACT(YEAR FROM c.incident_date) = p.year 12 | WHERE p.total_population > 0 -- Avoid division by zero 13 | GROUP BY LOWER(c.county_name), EXTRACT(YEAR FROM c.incident_date), p.total_population 14 | ), 15 | average_crime_per_capita AS ( 16 | -- Step 2: Compute Average Crime Per Capita for 1997-2000 17 | SELECT 18 | county, 19 | AVG(crime_per_1000) AS avg_crime_per_1000 20 | FROM crime_per_capita_per_year 21 | GROUP BY county 22 | ), 23 | crime_distribution AS ( 24 | -- Step 3: Compute Percentile Rank Based on Average Crime Per Capita 25 | SELECT 26 | county, 27 | avg_crime_per_1000, 28 | PERCENT_RANK() OVER (ORDER BY avg_crime_per_1000) AS crime_percentile 29 | FROM average_crime_per_capita 30 | ), 31 | crime_tiers AS ( 32 | -- Step 4: Assign Crime Per Capita Tiers and Capitalize First Letter 33 | SELECT 34 | CONCAT(UPPER(SUBSTRING(county, 1, 1)), LOWER(SUBSTRING(county, 2))) AS county, -- Capitalize first letter 35 | avg_crime_per_1000, 36 | crime_percentile, 37 | CASE 38 | WHEN crime_percentile >= 0.75 THEN 4 -- Highest crime per capita (Top 25%) 39 | WHEN crime_percentile >= 0.50 THEN 3 -- Medium-high (50%-75%) 40 | WHEN crime_percentile >= 0.25 THEN 2 -- Medium-low (25%-50%) 41 | ELSE 1 -- Lowest crime per capita (Bottom 25%) 42 | END AS crime_per_capita_tier 43 | FROM crime_distribution 44 | ) 45 | SELECT * FROM crime_tiers; 46 | -------------------------------------------------------------------------------- /scripts/create_crime_rate_per_capita.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "< zach bucket name >" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | spark.sql(""" 64 | CREATE TABLE IF NOT EXISTS tayloro.coloraodo_county_crime_rate_per_capita ( 65 | county_name STRING, 66 | periodyear INTEGER, 67 | total_crimes INTEGER, 68 | total_population INTEGER, 69 | crime_per_100k DOUBLE -- Crime per 100,000 people 70 | ) 71 | USING iceberg 72 | PARTITIONED BY (county_name); 73 | """) 74 | 75 | print("Iceberg table created successfully (if it didn't already exist).") -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_crime_tier_county_rank.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crime_tier_county_rank AS 2 | WITH county_scores AS ( 3 | SELECT 4 | county_name, 5 | -- Select the crime tiers for each category 6 | MAX(CASE WHEN table_name = 'colorado_crime_tier_property_destruction' THEN crime_tier ELSE NULL END) AS property_destruction_tier, 7 | MAX(CASE WHEN table_name = 'colorado_crime_tier_burglary' THEN crime_tier ELSE NULL END) AS burglary_tier, 8 | MAX(CASE WHEN table_name = 'colorado_crime_tier_larceny_theft' THEN crime_tier ELSE NULL END) AS larceny_theft_tier, 9 | MAX(CASE WHEN table_name = 'colorado_crime_tier_vehicle_theft' THEN crime_tier ELSE NULL END) AS vehicle_theft_tier, 10 | MAX(CASE WHEN table_name = 'colorado_crime_tier_robbery' THEN crime_tier ELSE NULL END) AS robbery_tier, 11 | MAX(CASE WHEN table_name = 'colorado_crime_tier_arson' THEN crime_tier ELSE NULL END) AS arson_tier, 12 | MAX(CASE WHEN table_name = 'colorado_crime_tier_stolen_property' THEN crime_tier ELSE NULL END) AS stolen_property_tier, 13 | -- Calculate the average of all crime tiers 14 | ROUND( 15 | AVG( 16 | CASE 17 | WHEN table_name IN ( 18 | 'colorado_crime_tier_property_destruction', 19 | 'colorado_crime_tier_burglary', 20 | 'colorado_crime_tier_larceny_theft', 21 | 'colorado_crime_tier_vehicle_theft', 22 | 'colorado_crime_tier_robbery', 23 | 'colorado_crime_tier_arson', 24 | 'colorado_crime_tier_stolen_property' 25 | ) THEN crime_tier 26 | ELSE NULL 27 | END 28 | ) 29 | ) AS overall_crime_tier -- Round to the nearest whole number 30 | FROM ( 31 | -- Union all 7 crime tier tables 32 | SELECT county_name, 'colorado_crime_tier_property_destruction' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_property_destruction 33 | UNION ALL 34 | SELECT county_name, 'colorado_crime_tier_burglary' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_burglary 35 | UNION ALL 36 | SELECT county_name, 'colorado_crime_tier_larceny_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_larceny_theft 37 | UNION ALL 38 | SELECT county_name, 'colorado_crime_tier_vehicle_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_vehicle_theft 39 | UNION ALL 40 | SELECT county_name, 'colorado_crime_tier_robbery' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_robbery 41 | UNION ALL 42 | SELECT county_name, 'colorado_crime_tier_arson' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_arson 43 | UNION ALL 44 | SELECT county_name, 'colorado_crime_tier_stolen_property' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_stolen_property 45 | ) combined 46 | GROUP BY county_name 47 | ) 48 | SELECT * FROM county_scores 49 | ORDER BY overall_crime_tier DESC; 50 | -------------------------------------------------------------------------------- /scripts/create_colorado_crimes_table.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "< zach bucket name >" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | print(spark.sql) 64 | #Create the Iceberg table if it doesn't already exist 65 | spark.sql(""" 66 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_crimes ( 67 | agency_name STRING, 68 | county_name STRING, 69 | incident_date DATE, 70 | incident_hour INTEGER, 71 | offense_name STRING, 72 | crime_against STRING, 73 | offense_category_name STRING, 74 | age_num INTEGER 75 | ) 76 | USING iceberg 77 | PARTITIONED BY (county_name) 78 | """) 79 | print("Iceberg table created successfully (if it didn't already exist).") 80 | -------------------------------------------------------------------------------- /scripts/create_colorado_business_entities.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "< zach bucket name >" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | spark.sql(""" 64 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_business_entities ( 65 | entityid LONG, 66 | entityname STRING, 67 | principaladdress1 STRING, 68 | principaladdress2 STRING, 69 | principalcity STRING, 70 | principalcounty STRING, -- Mapped county from cities dataset 71 | principalstate STRING, 72 | principalzipcode STRING, 73 | principalcountry STRING, 74 | entitystatus STRING, 75 | jurisdictonofformation STRING, 76 | entitytype STRING, 77 | entityformdate DATE -- Ensure stored as DATE type 78 | ) 79 | USING iceberg 80 | PARTITIONED BY (principalcounty) 81 | """) 82 | 83 | print("Iceberg table created successfully (if it didn't already exist).") 84 | 85 | -------------------------------------------------------------------------------- /scripts/write_to_iceberg_from_S3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "zachwilsonsorganization-522" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | # Verify SparkSession 64 | print("SparkSession initialized successfully:", spark) 65 | 66 | # S3 path to your input file 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_County_Boundaries.csv" 68 | 69 | # Read the CSV file from the S3 bucket 70 | df = spark.read.format("csv") \ 71 | .option("header", "true") \ 72 | .option("inferSchema", "true") \ 73 | .load(file_path) 74 | 75 | 76 | df.printSchema() 77 | 78 | 79 | spark.sql(""" 80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_county_coordinates_raw ( 81 | county STRING, 82 | label STRING, 83 | cent_lat DOUBLE, 84 | cent_long DOUBLE 85 | ) 86 | USING iceberg 87 | PARTITIONED BY (county) 88 | """) 89 | 90 | # Write to the Iceberg table with Parquet format and overwrite partitions 91 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_county_coordinates_raw") \ 92 | .tableProperty("write.format.default", "parquet") \ 93 | .overwritePartitions() 94 | -------------------------------------------------------------------------------- /capstone_queries/bronze/write_colorado_county_coordinates_iceberg_from_S3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "zachwilsonsorganization-522" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | # Verify SparkSession 64 | print("SparkSession initialized successfully:", spark) 65 | 66 | # S3 path to your input file 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_County_Boundaries.csv" 68 | 69 | # Read the CSV file from the S3 bucket 70 | df = spark.read.format("csv") \ 71 | .option("header", "true") \ 72 | .option("inferSchema", "true") \ 73 | .load(file_path) 74 | 75 | 76 | df.printSchema() 77 | 78 | 79 | spark.sql(""" 80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_counties_coordinates_raw ( 81 | county STRING, 82 | label STRING, 83 | cent_lat DOUBLE, 84 | cent_long DOUBLE 85 | ) 86 | USING iceberg 87 | PARTITIONED BY (county) 88 | """) 89 | 90 | # Write to the Iceberg table with Parquet format and overwrite partitions 91 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_counties_coordinates_raw") \ 92 | .tableProperty("write.format.default", "parquet") \ 93 | .overwritePartitions() 94 | -------------------------------------------------------------------------------- /capstone_queries/bronze/write_colorado_crimes_1997_2020_iceberg_from_S3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "zachwilsonsorganization-522" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | # Verify SparkSession 64 | print("SparkSession initialized successfully:", spark) 65 | 66 | # S3 path to your input file 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_1995_2015.csv" 68 | 69 | # Read the CSV file from the S3 bucket 70 | df = spark.read.format("csv") \ 71 | .option("header", "true") \ 72 | .option("inferSchema", "true") \ 73 | .load(file_path) 74 | 75 | 76 | df.printSchema() 77 | 78 | 79 | spark.sql(""" 80 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_crimes_1995_2015_raw ( 81 | pub_agency_name STRING, 82 | county_name STRING, 83 | incident_date STRING, 84 | incident_hour INTEGER, 85 | offense_name STRING, 86 | crime_against STRING, 87 | offense_category_name STRING, 88 | offense_group STRING, 89 | age_num INTEGER 90 | ) 91 | USING iceberg 92 | PARTITIONED BY (county_name) 93 | """) 94 | 95 | # Write to the Iceberg table with Parquet format and overwrite partitions 96 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_crimes_1995_2015_raw") \ 97 | .tableProperty("write.format.default", "parquet") \ 98 | .overwritePartitions() 99 | -------------------------------------------------------------------------------- /capstone_queries/bronze/write_colorado_income_iceberg_from_S3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "zachwilsonsorganization-522" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | # Verify SparkSession 64 | print("SparkSession initialized successfully:", spark) 65 | 66 | # S3 path to your input file 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_2016_2020.csv" 68 | 69 | # Read the CSV file from the S3 bucket 70 | df = spark.read.format("csv") \ 71 | .option("header", "true") \ 72 | .option("inferSchema", "true") \ 73 | .load(file_path) 74 | 75 | 76 | df.printSchema() 77 | 78 | 79 | # spark.sql(""" 80 | # CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_income_raw ( 81 | # pub_agency_name STRING, 82 | # county_name STRING, 83 | # incident_date STRING, 84 | # incident_hour INTEGER, 85 | # offense_name STRING, 86 | # crime_against STRING, 87 | # offense_category_name STRING, 88 | # offense_group STRING, 89 | # age_num INTEGER 90 | # ) 91 | # USING iceberg 92 | # PARTITIONED BY (county_name) 93 | # """) 94 | # 95 | # # Write to the Iceberg table with Parquet format and overwrite partitions 96 | # df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_income_raw") \ 97 | # .tableProperty("write.format.default", "parquet") \ 98 | # .overwritePartitions() 99 | -------------------------------------------------------------------------------- /capstone_queries/bronze/write_colorado_crimes_2016_2020_iceberg_from_S3.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql.functions import regexp_replace 7 | 8 | # Add the parent directory of "upload_to_s3.py" to the module search path 9 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 10 | 11 | from aws_secret_manager import get_secret 12 | 13 | 14 | # Load environment variables from .env file 15 | load_dotenv() 16 | 17 | # Fetch AWS keys from environment variables 18 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 19 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 20 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 21 | catalog_name = get_secret("CATALOG_NAME") 22 | 23 | # S3 bucket name 24 | s3_bucket = "zachwilsonsorganization-522" 25 | 26 | # Define required packages 27 | extra_packages = [ 28 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", # Iceberg runtime 29 | "software.amazon.awssdk:bundle:2.17.178", # AWS SDK bundle 30 | "software.amazon.awssdk:url-connection-client:2.17.178", # URL connection client 31 | "org.apache.hadoop:hadoop-aws:3.3.4" # Hadoop AWS connector 32 | ] 33 | 34 | # Initialize SparkConf 35 | conf = SparkConf() 36 | conf.set('spark.jars.packages', ','.join(extra_packages)) 37 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 38 | conf.set('spark.sql.defaultCatalog', catalog_name) 39 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 40 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 41 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 44 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 45 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 46 | conf.set('spark.executor.memory', '16g') # Adjust the value based on your system's capacity 47 | conf.set('spark.sql.shuffle.partitions', '200') 48 | conf.set('spark.driver.memory', '16g') 49 | 50 | 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Load S3 data into Iceberg") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # Set AWS credentials for S3 access 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | # Verify SparkSession 64 | print("SparkSession initialized successfully:", spark) 65 | 66 | # S3 path to your input file 67 | file_path = f"s3a://{s3_bucket}/tayloro-capstone-data/Colorado_Crimes_2016_2020.csv" 68 | 69 | # Read the CSV file from the S3 bucket 70 | df = spark.read.format("csv") \ 71 | .option("header", "true") \ 72 | .option("inferSchema", "true") \ 73 | .load(file_path) 74 | 75 | 76 | df.printSchema() 77 | 78 | 79 | print(f"Catalog Name: {catalog_name}") 80 | 81 | 82 | spark.sql(""" 83 | CREATE TABLE IF NOT EXISTS `eczachly-academy-warehouse`.tayloro.colorado_crimes_2016_2020_raw_new ( 84 | pub_agency_name STRING, 85 | county_name STRING, 86 | incident_date STRING, 87 | incident_hour INTEGER, 88 | offense_name STRING, 89 | crime_against STRING, 90 | offense_category_name STRING, 91 | offense_group STRING, 92 | age_num INTEGER 93 | ) 94 | USING iceberg 95 | PARTITIONED BY (county_name) 96 | """) 97 | 98 | # Write to the Iceberg table with Parquet format and overwrite partitions 99 | df.writeTo("`eczachly-academy-warehouse`.tayloro.colorado_crimes_2016_2020_raw_new") \ 100 | .tableProperty("write.format.default", "parquet") \ 101 | .overwritePartitions() 102 | -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_crimes_with_cities.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crimes_with_cities AS 2 | SELECT 3 | -- Clean and coalesce agency names from both datasets 4 | COALESCE( 5 | trim( 6 | regexp_replace( 7 | regexp_replace( 8 | regexp_replace( 9 | regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 10 | '(?i)\\s+Police Department\\b', '' 11 | ), 12 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 13 | ), 14 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 15 | ) 16 | ), 17 | trim( 18 | regexp_replace( 19 | regexp_replace( 20 | regexp_replace( 21 | regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 22 | '(?i)\\s+Police Department\\b', '' 23 | ), 24 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 25 | ), 26 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 27 | ) 28 | ) 29 | ) AS agency_name, 30 | 31 | -- Standardize and clean county names with first letter capitalized 32 | COALESCE( 33 | UPPER(SUBSTRING(LOWER(TRIM(a.primary_county)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(a.primary_county)), 2)), 34 | UPPER(SUBSTRING(LOWER(TRIM(b.county_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(b.county_name)), 2)) 35 | ) AS county_name, 36 | 37 | -- Combine and clean incident dates, converting them to DATE 38 | COALESCE( 39 | CAST(date_parse(TRIM(a.incident_date), '%m/%d/%Y') AS DATE), 40 | CAST(date_parse(TRIM(b.incident_date), '%m/%d/%Y') AS DATE) 41 | ) AS incident_date, 42 | 43 | -- Standardize incident hour 44 | COALESCE(a.incident_hour, b.incident_hour) AS incident_hour, 45 | 46 | -- Capitalize offense names (first letter capitalized, rest lowercase) 47 | COALESCE( 48 | UPPER(SUBSTRING(LOWER(TRIM(a.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(a.offense_name)), 2)), 49 | UPPER(SUBSTRING(LOWER(TRIM(b.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(TRIM(b.offense_name)), 2)) 50 | ) AS offense_name, 51 | 52 | -- Clean crime_against and offense_category_name 53 | COALESCE(a.crime_against, b.crime_against) AS crime_against, 54 | COALESCE(a.offense_category_name, b.offense_category_name) AS offense_category_name, 55 | 56 | -- Clean age_num 57 | COALESCE(a.age_num, b.age_num) AS age_num, 58 | 59 | -- Coalesce the city values: use city_name from the 1997–2015 data when available; otherwise, use city from the 2016–2020 data 60 | COALESCE(a.city_name, b.city) AS city 61 | FROM 62 | tayloro.colorado_crimes_1997_2015_raw a 63 | FULL OUTER JOIN 64 | tayloro.colorado_crimes_2016_2020_with_cities b 65 | ON 66 | -- Clean agency names in the join condition so they match properly 67 | LOWER( 68 | trim( 69 | regexp_replace( 70 | regexp_replace( 71 | regexp_replace( 72 | regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 73 | '(?i)\\s+Police Department\\b', '' 74 | ), 75 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 76 | ), 77 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 78 | ) 79 | ) 80 | ) = 81 | LOWER( 82 | trim( 83 | regexp_replace( 84 | regexp_replace( 85 | regexp_replace( 86 | regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 87 | '(?i)\\s+Police Department\\b', '' 88 | ), 89 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 90 | ), 91 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 92 | ) 93 | ) 94 | ) 95 | AND LOWER(TRIM(a.primary_county)) = LOWER(TRIM(b.county_name)) 96 | AND CAST(date_parse(TRIM(a.incident_date), '%m/%d/%Y') AS DATE) = CAST(date_parse(TRIM(b.incident_date), '%m/%d/%Y') AS DATE) 97 | AND a.incident_hour = b.incident_hour 98 | AND a.offense_name = b.offense_name; -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_crime_weighted_scores.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crime_weighted_scores AS 2 | WITH crime_weighted AS ( 3 | SELECT 4 | TRIM(SPLIT_PART(county_name, ';', 1)) AS county_name, -- Extract first county if multiple are listed 5 | EXTRACT(YEAR FROM incident_date) AS year, -- Correctly extract year for grouping 6 | SUM(CASE WHEN offense_category_name = 'Destruction/Damage/Vandalism of Property' THEN 10 ELSE 0 END) AS destruction_vandalism, 7 | SUM(CASE WHEN offense_category_name = 'Burglary/Breaking & Entering' THEN 9 ELSE 0 END) AS burglary, 8 | SUM(CASE WHEN offense_category_name = 'Larceny/Theft Offenses' THEN 8 ELSE 0 END) AS larceny_theft, 9 | SUM(CASE WHEN offense_category_name = 'Motor Vehicle Theft' THEN 7 ELSE 0 END) AS motor_vehicle_theft, 10 | SUM(CASE WHEN offense_category_name = 'Robbery' THEN 7 ELSE 0 END) AS robbery, 11 | SUM(CASE WHEN offense_category_name = 'Arson' THEN 7 ELSE 0 END) AS arson, 12 | SUM(CASE WHEN offense_category_name = 'Stolen Property Offenses' THEN 6 ELSE 0 END) AS stolen_property, 13 | SUM(CASE WHEN offense_category_name = 'Counterfeiting/Forgery' THEN 5 ELSE 0 END) AS counterfeiting_forgery, 14 | SUM(CASE WHEN offense_category_name = 'Fraud Offenses' THEN 5 ELSE 0 END) AS fraud, 15 | SUM(CASE WHEN offense_category_name = 'Bribery' THEN 4 ELSE 0 END) AS bribery, 16 | SUM(CASE WHEN offense_category_name = 'Extortion/Blackmail' THEN 4 ELSE 0 END) AS extortion_blackmail, 17 | SUM(CASE WHEN offense_category_name = 'Embezzlement' THEN 3 ELSE 0 END) AS embezzlement, 18 | SUM(CASE WHEN offense_category_name = 'Weapon Law Violations' THEN 3 ELSE 0 END) AS weapon_law_violations, 19 | SUM(CASE WHEN offense_category_name = 'Drug/Narcotic Offenses' THEN 3 ELSE 0 END) AS drug_offenses, 20 | SUM(CASE WHEN offense_category_name = 'Prostitution Offenses' THEN 2 ELSE 0 END) AS prostitution_offenses, 21 | SUM(CASE WHEN offense_category_name = 'Kidnapping/Abduction' THEN 2 ELSE 0 END) AS kidnapping_abduction, 22 | SUM(CASE WHEN offense_category_name = 'Sex Offenses' THEN 2 ELSE 0 END) AS sex_offenses, 23 | SUM(CASE WHEN offense_category_name = 'Human Trafficking' THEN 1 ELSE 0 END) AS human_trafficking, 24 | SUM(CASE WHEN offense_category_name = 'Gambling Offenses' THEN 1 ELSE 0 END) AS gambling_offenses, 25 | SUM(CASE WHEN offense_category_name = 'Animal Cruelty' THEN 1 ELSE 0 END) AS animal_cruelty, 26 | SUM( 27 | CASE 28 | WHEN offense_category_name = 'Destruction/Damage/Vandalism of Property' THEN 10 29 | WHEN offense_category_name = 'Burglary/Breaking & Entering' THEN 9 30 | WHEN offense_category_name = 'Larceny/Theft Offenses' THEN 8 31 | WHEN offense_category_name = 'Motor Vehicle Theft' THEN 7 32 | WHEN offense_category_name = 'Robbery' THEN 7 33 | WHEN offense_category_name = 'Arson' THEN 7 34 | WHEN offense_category_name = 'Stolen Property Offenses' THEN 6 35 | WHEN offense_category_name = 'Counterfeiting/Forgery' THEN 5 36 | WHEN offense_category_name = 'Fraud Offenses' THEN 5 37 | WHEN offense_category_name = 'Bribery' THEN 4 38 | WHEN offense_category_name = 'Extortion/Blackmail' THEN 4 39 | WHEN offense_category_name = 'Embezzlement' THEN 3 40 | WHEN offense_category_name = 'Weapon Law Violations' THEN 3 41 | WHEN offense_category_name = 'Drug/Narcotic Offenses' THEN 3 42 | WHEN offense_category_name = 'Prostitution Offenses' THEN 2 43 | WHEN offense_category_name = 'Kidnapping/Abduction' THEN 2 44 | WHEN offense_category_name = 'Sex Offenses' THEN 2 45 | WHEN offense_category_name = 'Human Trafficking' THEN 1 46 | WHEN offense_category_name = 'Gambling Offenses' THEN 1 47 | WHEN offense_category_name = 'Animal Cruelty' THEN 1 48 | ELSE 0 49 | END 50 | ) AS total_crime_score 51 | FROM tayloro.colorado_crimes 52 | WHERE TRIM(county_name) IS NOT NULL 53 | AND TRIM(county_name) <> '' 54 | AND incident_date IS NOT NULL 55 | GROUP BY county_name, EXTRACT(YEAR FROM incident_date) -- Fix: Use correct grouping 56 | ) 57 | SELECT * FROM crime_weighted; -------------------------------------------------------------------------------- /capstone_queries/silver/create_colorado_crimes.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE tayloro.colorado_crimes AS 2 | SELECT 3 | -- Clean and coalesce agency names using regex replacements: 4 | COALESCE( 5 | trim( 6 | regexp_replace( 7 | regexp_replace( 8 | regexp_replace( 9 | regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 10 | '(?i)\\s+Police Department\\b', '' 11 | ), 12 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 13 | ), 14 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 15 | ) 16 | ), 17 | trim( 18 | regexp_replace( 19 | regexp_replace( 20 | regexp_replace( 21 | regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 22 | '(?i)\\s+Police Department\\b', '' 23 | ), 24 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 25 | ), 26 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 27 | ) 28 | ) 29 | ) AS agency_name, 30 | 31 | -- Normalize county names: convert to lowercase, replace semicolons with commas, 32 | -- remove quotes, split on commas, trim each element, sort, and rejoin. 33 | COALESCE( 34 | array_join( 35 | array_sort( 36 | transform( 37 | split( 38 | regexp_replace(regexp_replace(lower(trim(a.primary_county)), ';', ','), '"', '') 39 | , ',' 40 | ), 41 | x -> trim(x) 42 | ) 43 | ), 44 | ', ' 45 | ), 46 | array_join( 47 | array_sort( 48 | transform( 49 | split( 50 | regexp_replace(regexp_replace(lower(trim(b.county_name)), ';', ','), '"', '') 51 | , ',' 52 | ), 53 | x -> trim(x) 54 | ) 55 | ), 56 | ', ' 57 | ) 58 | ) AS county_name, 59 | 60 | -- Parse and coalesce incident dates 61 | COALESCE( 62 | CAST(date_parse(trim(a.incident_date), '%m/%d/%Y') AS DATE), 63 | CAST(date_parse(trim(b.incident_date), '%m/%d/%Y') AS DATE) 64 | ) AS incident_date, 65 | 66 | -- Standardize incident hour 67 | COALESCE(a.incident_hour, b.incident_hour) AS incident_hour, 68 | 69 | -- Capitalize offense names (first letter uppercase, rest lowercase) 70 | COALESCE( 71 | UPPER(SUBSTRING(LOWER(trim(a.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(trim(a.offense_name)), 2)), 72 | UPPER(SUBSTRING(LOWER(trim(b.offense_name)), 1, 1)) || LOWER(SUBSTRING(LOWER(trim(b.offense_name)), 2)) 73 | ) AS offense_name, 74 | 75 | -- Coalesce crime_against and offense_category_name 76 | COALESCE(a.crime_against, b.crime_against) AS crime_against, 77 | COALESCE(a.offense_category_name, b.offense_category_name) AS offense_category_name, 78 | 79 | -- Coalesce age_num 80 | COALESCE(a.age_num, b.age_num) AS age_num 81 | FROM 82 | tayloro.colorado_crimes_1997_2015_raw a 83 | FULL OUTER JOIN 84 | tayloro.colorado_crimes_2016_2020_raw b 85 | ON 86 | -- Join on the cleaned agency names: 87 | LOWER( 88 | trim( 89 | regexp_replace( 90 | regexp_replace( 91 | regexp_replace( 92 | regexp_replace(a.agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 93 | '(?i)\\s+Police Department\\b', '' 94 | ), 95 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 96 | ), 97 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 98 | ) 99 | ) 100 | ) = 101 | LOWER( 102 | trim( 103 | regexp_replace( 104 | regexp_replace( 105 | regexp_replace( 106 | regexp_replace(b.pub_agency_name, '(?i)\\bDrug Enforcement Team\\b', 'Drug Task Force'), 107 | '(?i)\\s+Police Department\\b', '' 108 | ), 109 | '(?i)\\s+County Sheriff(?:''s Office)?\\b', '' 110 | ), 111 | '(?i)\\s+Sheriff(?:''s Office)?\\b', '' 112 | ) 113 | ) 114 | ) 115 | AND 116 | -- Join on normalized county names: 117 | LOWER( 118 | array_join( 119 | array_sort( 120 | transform( 121 | split( 122 | regexp_replace(regexp_replace(lower(trim(a.primary_county)), ';', ','), '"', '') 123 | , ',' 124 | ), 125 | x -> trim(x) 126 | ) 127 | ), 128 | ', ' 129 | ) 130 | ) = LOWER( 131 | array_join( 132 | array_sort( 133 | transform( 134 | split( 135 | regexp_replace(regexp_replace(lower(trim(b.county_name)), ';', ','), '"', '') 136 | , ',' 137 | ), 138 | x -> trim(x) 139 | ) 140 | ), 141 | ', ' 142 | ) 143 | ) 144 | AND CAST(date_parse(trim(a.incident_date), '%m/%d/%Y') AS DATE) = CAST(date_parse(trim(b.incident_date), '%m/%d/%Y') AS DATE) 145 | AND a.incident_hour = b.incident_hour 146 | AND a.offense_name = b.offense_name; 147 | -------------------------------------------------------------------------------- /capstone_queries/silver/clean_address_colorado_business_entity_data.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | cbe.principalcity, 3 | COUNT(*) AS cnt 4 | FROM tayloro.colorado_business_entities cbe 5 | WHERE cbe.principalstate = 'CO' 6 | AND NOT EXISTS ( 7 | SELECT 1 8 | FROM tayloro.colorado_city_county_zip ccz 9 | WHERE LOWER(ccz.city) = LOWER(cbe.principalcity) 10 | ) 11 | GROUP BY cbe.principalcity 12 | ORDER BY cbe.principalcity; 13 | 14 | SELECT 15 | cbe.principalzipcode, 16 | cbe.principalcity, 17 | COUNT(*) AS record_count 18 | FROM tayloro.colorado_business_entities cbe 19 | LEFT JOIN tayloro.colorado_city_county_zip ccz 20 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 21 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 22 | WHERE cbe.principalstate = 'CO' 23 | AND ccz.city IS NULL 24 | and cbe.principalzipcode is not null 25 | and cbe.principalcity is not null 26 | GROUP BY 27 | cbe.principalzipcode, 28 | cbe.principalcity 29 | ORDER BY 30 | record_count DESC; 31 | 32 | 33 | 34 | 35 | select * from tayloro.colorado_business_entities where principalzipcode = '80401' 36 | 37 | select * from tayloro.colorado_city_county_zip where lower(city) LIKE '%timn%' order by zip_code 38 | select * from tayloro.colorado_city_county_zip where zip_code = 80303 39 | 40 | INSERT INTO tayloro.colorado_city_county_zip (city, zip_code, county) 41 | VALUES ('Evans', 80634, 'Weld'), 42 | ('Evans', 80645, 'Weld') 43 | 44 | SELECT DISTINCT principalzipcode, principalcounty 45 | FROM tayloro.colorado_business_entities 46 | WHERE LOWER(principalcounty) LIKE '%douglas%'; 47 | 48 | SELECT COUNT(*) AS unmatched_count 49 | FROM tayloro.colorado_business_entities cbe 50 | LEFT JOIN tayloro.colorado_city_county_zip ccz 51 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 52 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 53 | WHERE cbe.principalstate = 'CO' 54 | AND ccz.city IS NULL; 55 | 56 | 57 | 58 | UPDATE tayloro.colorado_business_entities 59 | SET principalcity = 'Peyton' 60 | WHERE principalstate = 'CO' 61 | AND principalcity = 'peyton co'; 62 | 63 | 64 | UPDATE tayloro.colorado_business_entities 65 | SET principalcity = 'Fort Collins' 66 | WHERE principalstate = 'CO' 67 | AND principalzipcode = '80524'; 68 | 69 | 70 | 71 | SELECT * 72 | FROM tayloro.colorado_business_entities cbe 73 | LEFT JOIN tayloro.colorado_city_county_zip ccz 74 | ON LOWER(cbe.principalcity) = LOWER(ccz.city) 75 | AND cbe.principalzipcode = CAST(ccz.zip_code AS VARCHAR(5)) 76 | WHERE cbe.principalstate = 'CO' 77 | AND ccz.city IS NULL; 78 | 79 | 80 | SELECT * 81 | FROM tayloro.colorado_business_entities 82 | WHERE principalzipcode = '80422' 83 | and lower(principalcity) <> 'black hawk' 84 | 85 | SELECT 86 | CASE 87 | WHEN ccz.city IS NULL THEN 'No Match' 88 | ELSE 'Match' 89 | END AS match_status, 90 | COUNT(*) AS count 91 | FROM tayloro.colorado_business_entities cbe 92 | LEFT JOIN tayloro.colorado_city_county_zip ccz 93 | ON LOWER(cbe.principalcity) = LOWER(ccz.city) 94 | AND cbe.principalzipcode = CAST(ccz.zip_code AS VARCHAR(5)) 95 | WHERE cbe.principalstate = 'CO' 96 | GROUP BY 97 | CASE 98 | WHEN ccz.city IS NULL THEN 'No Match' 99 | ELSE 'Match' 100 | END; 101 | 102 | 103 | SELECT * 104 | FROM tayloro.colorado_business_entities cbe 105 | LEFT JOIN tayloro.colorado_city_county_zip ccz 106 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 107 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 108 | WHERE cbe.principalstate = 'CO' 109 | AND ccz.city IS NULL; 110 | 111 | select * from tayloro.colorado_business_entities where principalcounty LIKE '%County%' 112 | 113 | 114 | 115 | 116 | UPDATE tayloro.colorado_business_entities 117 | SET 118 | principalcity = ( 119 | SELECT geo_city 120 | FROM tayloro.colorado_temp_geocoded_entities 121 | WHERE entityid = tayloro.colorado_business_entities.entityid 122 | ), 123 | principalzipcode = ( 124 | SELECT geo_postcode 125 | FROM tayloro.colorado_temp_geocoded_entities 126 | WHERE entityid = tayloro.colorado_business_entities.entityid 127 | ), 128 | principalcounty = ( 129 | SELECT replace(geo_county, ' County', '') 130 | FROM tayloro.colorado_temp_geocoded_entities 131 | WHERE entityid = tayloro.colorado_business_entities.entityid 132 | ) 133 | WHERE entityid IN ( 134 | SELECT entityid 135 | FROM tayloro.colorado_temp_geocoded_entities 136 | ); 137 | 138 | 139 | select cbe.principalcity, ctge.geo_city, cbe.principalcounty, ctge.geo_county, cbe.principalzipcode, ctge.geo_postcode from tayloro.colorado_temp_geocoded_entities ctge 140 | join tayloro.colorado_business_entities cbe 141 | on ctge.entityid = cbe.entityid 142 | 143 | UPDATE tayloro.colorado_business_entities 144 | SET 145 | principalcity = ( 146 | SELECT geo_city 147 | FROM taylorSELECT 148 | CASE 149 | WHEN ccz.city IS NOT NULL THEN 'Match' 150 | ELSE 'No Match' 151 | END AS match_status, 152 | COUNT(*) AS count 153 | FROM tayloro.colorado_business_entities cbe 154 | LEFT JOIN tayloro.colorado_city_county_zip ccz 155 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 156 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 157 | WHERE cbe.principalstate = 'CO' 158 | GROUP BY 159 | CASE 160 | WHEN ccz.city IS NOT NULL THEN 'Match' 161 | ELSE 'No Match' 162 | END;o.colorado_temp_geocoded_entities 163 | WHERE entityid = tayloro.colorado_business_entities.entityid 164 | ), 165 | principalzipcode = ( 166 | SELECT geo_postcode 167 | FROM tayloro.colorado_temp_geocoded_entities 168 | WHERE entityid = tayloro.colorado_business_entities.entityid 169 | ), 170 | principalcounty = ( 171 | SELECT replace(geo_county, ' County', '') 172 | FROM tayloro.colorado_temp_geocoded_entities 173 | WHERE entityid = tayloro.colorado_business_entities.entityid 174 | ) 175 | WHERE entityid IN ( 176 | SELECT entityid 177 | FROM tayloro.colorado_temp_geocoded_entities 178 | ); 179 | 180 | 181 | 182 | SELECT DISTINCT 183 | cbe.principalzipcode, 184 | cbe.principalcity 185 | FROM tayloro.colorado_business_entities cbe 186 | LEFT JOIN tayloro.colorado_city_county_zip ccz 187 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 188 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 189 | WHERE cbe.principalstate = 'CO' 190 | AND ccz.city IS NULL 191 | ORDER BY cbe.principalcity, cbe.principalzipcode; 192 | 193 | 194 | DELETE FROM tayloro.colorado_business_entities 195 | WHERE principalstate = 'CO' 196 | AND entityid IN ( 197 | SELECT cbe.entityid 198 | FROM tayloro.colorado_business_entities cbe 199 | LEFT JOIN tayloro.colorado_city_county_zip ccz 200 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 201 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 202 | WHERE cbe.principalstate = 'CO' 203 | AND ccz.city IS NULL 204 | ); 205 | 206 | SELECT county, COUNT(*) AS count 207 | FROM tayloro.colorado_city_county_zip 208 | WHERE LOWER(county) LIKE '%county%' 209 | GROUP BY county 210 | ORDER BY count DESC; 211 | 212 | SELECT 213 | cbe.entityid, 214 | cbe.principalcity, 215 | cbe.principalzipcode, 216 | ccz.city AS ref_city, 217 | ccz.zip_code AS ref_zip, 218 | ccz.county AS ref_county 219 | FROM tayloro.colorado_business_entities cbe 220 | JOIN tayloro.colorado_city_county_zip ccz 221 | ON TRIM(LOWER(cbe.principalcity)) = TRIM(LOWER(ccz.city)) 222 | AND TRIM(CAST(cbe.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 223 | WHERE cbe.principalstate = 'CO' 224 | LIMIT 100; 225 | 226 | 227 | UPDATE tayloro.colorado_business_entities 228 | SET principalcounty = ( 229 | SELECT ccz.county 230 | FROM tayloro.colorado_city_county_zip ccz 231 | WHERE TRIM(LOWER(principalcity)) = TRIM(LOWER(ccz.city)) 232 | AND TRIM(CAST(principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 233 | ) 234 | WHERE principalstate = 'CO'; 235 | 236 | CREATE TABLE tayloro.colorado_city_county_zip_new AS 237 | SELECT DISTINCT zip_code, city, county 238 | FROM tayloro.colorado_city_county_zip; 239 | 240 | SELECT COUNT(*) AS original_count FROM tayloro.colorado_city_county_zip; 241 | SELECT COUNT(*) AS distinct_count FROM tayloro.colorado_city_county_zip_new; 242 | 243 | DROP TABLE tayloro.colorado_city_county_zip; 244 | ALTER TABLE tayloro.colorado_city_county_zip_new RENAME TO tayloro.colorado_city_county_zip; 245 | 246 | 247 | SELECT principalcity, principalzipcode, principalcounty 248 | FROM tayloro.colorado_business_entities 249 | WHERE principalstate = 'CO' 250 | LIMIT 20; 251 | -------------------------------------------------------------------------------- /scripts/business_entity_address_validation.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from dotenv import load_dotenv 4 | from pyspark.sql import SparkSession 5 | from pyspark.conf import SparkConf 6 | from pyspark.sql import functions as F 7 | from pyspark.sql.functions import concat_ws, col, udf 8 | from pyspark.sql.types import StructType, StructField, StringType 9 | import requests 10 | 11 | # Add the parent directory of "upload_to_s3.py" to the module search path 12 | sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../"))) 13 | 14 | from aws_secret_manager import get_secret 15 | 16 | # Load environment variables from .env file 17 | load_dotenv() 18 | 19 | # Fetch AWS keys and catalog credentials from environment variables/secrets 20 | aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID") 21 | aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY") 22 | tabular_credential = get_secret("TABULAR_CREDENTIAL") 23 | catalog_name = get_secret("CATALOG_NAME") 24 | 25 | # S3 bucket name (if needed) 26 | s3_bucket = "< zach bucket name >" 27 | 28 | # Define required packages 29 | extra_packages = [ 30 | "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2", 31 | "software.amazon.awssdk:bundle:2.17.178", 32 | "software.amazon.awssdk:url-connection-client:2.17.178", 33 | "org.apache.hadoop:hadoop-aws:3.3.4" 34 | ] 35 | 36 | # Initialize SparkConf 37 | conf = SparkConf() 38 | conf.set('spark.jars.packages', ','.join(extra_packages)) 39 | conf.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 40 | conf.set('spark.sql.defaultCatalog', catalog_name) 41 | conf.set(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 42 | conf.set(f'spark.sql.catalog.{catalog_name}.credential', tabular_credential) 43 | conf.set(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') 44 | conf.set(f'spark.sql.catalog.{catalog_name}.warehouse', catalog_name) 45 | conf.set(f'spark.sql.catalog.{catalog_name}.uri', 'https://api.tabular.io/ws/') 46 | conf.set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') 47 | conf.set('spark.sql.catalog.spark_catalog.type', 'hive') 48 | conf.set('spark.executor.memory', '16g') 49 | conf.set('spark.sql.shuffle.partitions', '200') 50 | conf.set('spark.driver.memory', '16g') 51 | 52 | # Initialize SparkSession 53 | spark = SparkSession.builder \ 54 | .appName("Geoapify Address Validation with EntityID") \ 55 | .config(conf=conf) \ 56 | .getOrCreate() 57 | 58 | # (Optional) Set AWS credentials if reading from S3; if not needed, these can be omitted. 59 | spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key_id) 60 | spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_access_key) 61 | spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com") 62 | 63 | print("SparkSession initialized successfully:", spark) 64 | 65 | # ------------------------------------------------------------------------------ 66 | # Read the required tables into DataFrames 67 | # ------------------------------------------------------------------------------ 68 | # Business Entities DataFrame 69 | cbeDF = spark.table("tayloro.colorado_business_entities") 70 | # Reference table (city, county, zip) DataFrame 71 | cczDF = spark.table("tayloro.colorado_city_county_zip") 72 | 73 | # ------------------------------------------------------------------------------ 74 | # Prepare DataFrames for Joining 75 | # ------------------------------------------------------------------------------ 76 | # For cbeDF: create lower-case version of principalcity and ensure ZIP is a string. 77 | cbe_prepped = cbeDF.withColumn("city_lower", F.lower(F.col("principalcity"))) \ 78 | .withColumn("zip_str", F.col("principalzipcode")) 79 | # For cczDF: create lower-case version of city and cast zip_code to string. 80 | ccz_prepped = cczDF.withColumn("city_lower", F.lower(F.col("city"))) \ 81 | .withColumn("zip_str", F.col("zip_code").cast("string")) 82 | 83 | # Join the DataFrames on the prepped columns 84 | enrichedDF = cbe_prepped.join(ccz_prepped, on=["city_lower", "zip_str"], how="left") 85 | 86 | # Filter for Colorado records where the reference table did not match (i.e., unmatched addresses) 87 | unmatchedDF = enrichedDF.filter( 88 | (F.col("principalstate") == "CO") & 89 | (F.col("city").isNull()) # "city" from the reference table (ccz) 90 | ) 91 | 92 | print("Total unmatched rows:", unmatchedDF.count()) 93 | 94 | # ------------------------------------------------------------------------------ 95 | # Build the Full Address and Define the Geocoding UDF 96 | # ------------------------------------------------------------------------------ 97 | # Create a full address column. You can adjust the concatenation as needed. 98 | unmatchedDF = unmatchedDF.withColumn( 99 | "full_address", 100 | concat_ws(", ", F.col("principaladdress1"), F.col("principalcity"), F.col("principalzipcode")) 101 | ) 102 | 103 | # Define a Python function to call the Geoapify API and extract city, county, state, and postcode. 104 | def geocode_address(address): 105 | if not address: 106 | return (None, None, None, None) 107 | base_url = "https://api.geoapify.com/v1/geocode/search" 108 | url = f"{base_url}?text={address}&apiKey={api_key}" 109 | try: 110 | resp = requests.get(url) 111 | if resp.status_code == 200: 112 | data = resp.json() 113 | if "features" in data and len(data["features"]) > 0: 114 | properties = data["features"][0].get("properties", {}) 115 | city = properties.get("city", None) 116 | county = properties.get("county", None) 117 | state = properties.get("state", None) 118 | postcode = properties.get("postcode", None) 119 | return (city, county, state, postcode) 120 | return (None, None, None, None) 121 | except Exception as e: 122 | return (None, None, None, None) 123 | 124 | # Define the schema for the geocode response as a struct 125 | geocode_schema = StructType([ 126 | StructField("geo_city", StringType(), True), 127 | StructField("geo_county", StringType(), True), 128 | StructField("geo_state", StringType(), True), 129 | StructField("geo_postcode", StringType(), True) 130 | ]) 131 | 132 | # Register the UDF 133 | geocode_udf = udf(geocode_address, geocode_schema) 134 | 135 | # Apply the UDF to the DataFrame to get geocoded data 136 | geocodedDF = unmatchedDF.withColumn("geocode", geocode_udf(F.col("full_address"))) 137 | 138 | # Extract the individual fields from the geocode struct and retain the entityid 139 | geocodedDF = geocodedDF.withColumn("geo_city", F.col("geocode.geo_city")) \ 140 | .withColumn("geo_county", F.col("geocode.geo_county")) \ 141 | .withColumn("geo_state", F.col("geocode.geo_state")) \ 142 | .withColumn("geo_postcode", F.col("geocode.geo_postcode")) \ 143 | .drop("geocode") 144 | 145 | # ------------------------------------------------------------------------------ 146 | # Select only the columns that match the target table schema 147 | # Target Table Columns: entityid, principaladdress1, principalcity, principalzipcode, full_address, 148 | # geo_city, geo_county, geo_state, geo_postcode 149 | # ------------------------------------------------------------------------------ 150 | outputDF = geocodedDF.select( 151 | "entityid", 152 | "principaladdress1", 153 | "principalcity", 154 | "principalzipcode", 155 | "full_address", 156 | "geo_city", 157 | "geo_county", 158 | "geo_state", 159 | "geo_postcode" 160 | ) 161 | 162 | # (Optional) Show sample results 163 | outputDF.show(20, truncate=False) 164 | 165 | # ------------------------------------------------------------------------------ 166 | # Create the target Iceberg table if it doesn't exist 167 | # ------------------------------------------------------------------------------ 168 | spark.sql(""" 169 | CREATE TABLE IF NOT EXISTS `tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities ( 170 | entityid BIGINT, 171 | principaladdress1 STRING, 172 | principalcity STRING, 173 | principalzipcode STRING, 174 | full_address STRING, 175 | geo_city STRING, 176 | geo_county STRING, 177 | geo_state STRING, 178 | geo_postcode STRING 179 | ) 180 | USING iceberg 181 | """) 182 | 183 | # ------------------------------------------------------------------------------ 184 | # Write the selected output DataFrame to the Iceberg table 185 | # ------------------------------------------------------------------------------ 186 | outputDF.writeTo("`tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities") \ 187 | .tableProperty("write.format.default", "parquet") \ 188 | .overwritePartitions() 189 | 190 | print("Geocoded data successfully written to Iceberg table: `tayloro-academy-warehouse`.tayloro.colorado_temp_geocoded_entities") 191 | -------------------------------------------------------------------------------- /Capstone-data-dictionary.csv: -------------------------------------------------------------------------------- 1 | schema,table,attribute,type,level,Data Lineage Source 2 | tayloro,colorado_avg_age_per_crime_category_1997_2020,city,varchar,gold,colorado_crimes_with_cities 3 | tayloro,colorado_avg_age_per_crime_category_1997_2020,offense_category_name,varchar,gold,colorado_crimes_with_cities 4 | tayloro,colorado_avg_age_per_crime_category_1997_2020,avg_age,double,gold,colorado_crimes_with_cities 5 | tayloro,colorado_business_entities,entityid,bigint,silver,colorado_business_entities_raw 6 | tayloro,colorado_business_entities,entityname,varchar,silver,colorado_business_entities_raw 7 | tayloro,colorado_business_entities,principaladdress1,varchar,silver,colorado_business_entities_raw 8 | tayloro,colorado_business_entities,principaladdress2,varchar,silver,colorado_business_entities_raw 9 | tayloro,colorado_business_entities,principalcity,varchar,silver,colorado_business_entities_raw 10 | tayloro,colorado_business_entities,principalcounty,varchar,silver,colorado_business_entities_raw 11 | tayloro,colorado_business_entities,principalstate,varchar,silver,colorado_business_entities_raw 12 | tayloro,colorado_business_entities,principalzipcode,varchar,silver,colorado_business_entities_raw 13 | tayloro,colorado_business_entities,principalcountry,varchar,silver,colorado_business_entities_raw 14 | tayloro,colorado_business_entities,entitystatus,varchar,silver,colorado_business_entities_raw 15 | tayloro,colorado_business_entities,jurisdictonofformation,varchar,silver,colorado_business_entities_raw 16 | tayloro,colorado_business_entities,entitytype,varchar,silver,colorado_business_entities_raw 17 | tayloro,colorado_business_entities,entityformdate,date,silver,colorado_business_entities_raw 18 | tayloro,colorado_business_entities_duplicates,entityid,bigint,silver,colorado_business_entities_stg_{{yesterday_ds)) 19 | tayloro,colorado_business_entities_duplicates,entityname,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 20 | tayloro,colorado_business_entities_duplicates,principaladdress1,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 21 | tayloro,colorado_business_entities_duplicates,principaladdress2,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 22 | tayloro,colorado_business_entities_duplicates,principalcity,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 23 | tayloro,colorado_business_entities_duplicates,principalcounty,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 24 | tayloro,colorado_business_entities_duplicates,principalstate,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 25 | tayloro,colorado_business_entities_duplicates,principalzipcode,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 26 | tayloro,colorado_business_entities_duplicates,principalcountry,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 27 | tayloro,colorado_business_entities_duplicates,entitystatus,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 28 | tayloro,colorado_business_entities_duplicates,jurisdictonofformation,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 29 | tayloro,colorado_business_entities_duplicates,entitytype,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 30 | tayloro,colorado_business_entities_duplicates,entityformdate,date,silver,colorado_business_entities_stg_{{yesterday_ds)) 31 | tayloro,colorado_business_entities_misfit_entities,entityid,bigint,silver,colorado_business_entities_stg_{{yesterday_ds)) 32 | tayloro,colorado_business_entities_misfit_entities,entityname,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 33 | tayloro,colorado_business_entities_misfit_entities,principaladdress1,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 34 | tayloro,colorado_business_entities_misfit_entities,principaladdress2,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 35 | tayloro,colorado_business_entities_misfit_entities,principalcity,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 36 | tayloro,colorado_business_entities_misfit_entities,principalcounty,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 37 | tayloro,colorado_business_entities_misfit_entities,principalstate,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 38 | tayloro,colorado_business_entities_misfit_entities,principalzipcode,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 39 | tayloro,colorado_business_entities_misfit_entities,principalcountry,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 40 | tayloro,colorado_business_entities_misfit_entities,entitystatus,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 41 | tayloro,colorado_business_entities_misfit_entities,jurisdictonofformation,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 42 | tayloro,colorado_business_entities_misfit_entities,entitytype,varchar,silver,colorado_business_entities_stg_{{yesterday_ds)) 43 | tayloro,colorado_business_entities_misfit_entities,entityformdate,date,silver,colorado_business_entities_stg_{{yesterday_ds)) 44 | tayloro,colorado_business_entities_raw,entityid,bigint,bronze,source 45 | tayloro,colorado_business_entities_raw,entityname,varchar,bronze,source 46 | tayloro,colorado_business_entities_raw,principaladdress1,varchar,bronze,source 47 | tayloro,colorado_business_entities_raw,principaladdress2,varchar,bronze,source 48 | tayloro,colorado_business_entities_raw,principalcity,varchar,bronze,source 49 | tayloro,colorado_business_entities_raw,principalstate,varchar,bronze,source 50 | tayloro,colorado_business_entities_raw,principalzipcode,varchar,bronze,source 51 | tayloro,colorado_business_entities_raw,principalcountry,varchar,bronze,source 52 | tayloro,colorado_business_entities_raw,entitystatus,varchar,bronze,source 53 | tayloro,colorado_business_entities_raw,jurisdictonofformation,varchar,bronze,source 54 | tayloro,colorado_business_entities_raw,entitytype,varchar,bronze,source 55 | tayloro,colorado_business_entities_raw,entityformdate,varchar,bronze,source 56 | tayloro,colorado_business_entities_with_ranking,entityid,bigint,silver,colorado_business_entities 57 | tayloro,colorado_business_entities_with_ranking,entityname,varchar,silver,colorado_business_entities 58 | tayloro,colorado_business_entities_with_ranking,principaladdress1,varchar,silver,colorado_business_entities 59 | tayloro,colorado_business_entities_with_ranking,principaladdress2,varchar,silver,colorado_business_entities 60 | tayloro,colorado_business_entities_with_ranking,principalcity,varchar,silver,colorado_business_entities 61 | tayloro,colorado_business_entities_with_ranking,principalcounty,varchar,silver,colorado_business_entities 62 | tayloro,colorado_business_entities_with_ranking,principalstate,varchar,silver,colorado_business_entities 63 | tayloro,colorado_business_entities_with_ranking,principalzipcode,varchar,silver,colorado_business_entities 64 | tayloro,colorado_business_entities_with_ranking,principalcountry,varchar,silver,colorado_business_entities 65 | tayloro,colorado_business_entities_with_ranking,entitystatus,varchar,silver,colorado_business_entities 66 | tayloro,colorado_business_entities_with_ranking,jurisdictonofformation,varchar,silver,colorado_business_entities 67 | tayloro,colorado_business_entities_with_ranking,entitytype,varchar,silver,colorado_business_entities 68 | tayloro,colorado_business_entities_with_ranking,entityformdate,date,silver,colorado_business_entities 69 | tayloro,colorado_business_entities_with_ranking,crime_rank,double,silver,colorado_business_entities 70 | tayloro,colorado_business_entities_with_ranking,income_rank,integer,silver,colorado_business_entities 71 | tayloro,colorado_business_entities_with_ranking,population_rank,integer,silver,colorado_business_entities 72 | tayloro,colorado_business_entities_with_ranking,final_rank,double,silver,colorado_business_entities 73 | tayloro,colorado_cities_and_counties,city,varchar,silver,source 74 | tayloro,colorado_cities_and_counties,county,varchar,silver,source 75 | tayloro,colorado_city_county_zip,zip_code,integer,bronze,source 76 | tayloro,colorado_city_county_zip,city,varchar,bronze,source 77 | tayloro,colorado_city_county_zip,county,varchar,bronze,source 78 | tayloro,colorado_city_crime_counts_1997_2020,city,varchar,gold,colorado_crimes 79 | tayloro,colorado_city_crime_counts_1997_2020,crime_count,bigint,gold,colorado_crimes 80 | tayloro,colorado_city_crime_time_likelihood,city,varchar,gold,colorado_crimes 81 | tayloro,colorado_city_crime_time_likelihood,offense_category_name,varchar,gold,colorado_crimes_with_cities 82 | tayloro,colorado_city_crime_time_likelihood,avg_day_crimes,double,gold,colorado_crimes_with_cities 83 | tayloro,colorado_city_crime_time_likelihood,avg_night_crimes,double,gold,colorado_crimes_with_cities 84 | tayloro,colorado_city_crime_time_likelihood,is_night_likely,boolean,gold,colorado_crimes_with_cities 85 | tayloro,colorado_city_crime_time_likelihood,is_day_likely,boolean,gold,colorado_crimes_with_cities 86 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,city,varchar,gold,colorado_crimes_with_cities 87 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,crime_month,bigint,gold,colorado_crimes_with_cities 88 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,month_name,varchar,gold,colorado_crimes_with_cities 89 | tayloro,colorado_city_seasonal_crime_rates_1997_2020,avg_monthly_crimes,double,gold,colorado_crimes_with_cities 90 | tayloro,colorado_county_agency_crime_counts,county_name,varchar,gold,colorado_crimes 91 | tayloro,colorado_county_agency_crime_counts,agency_name,varchar,gold,colorado_crimes 92 | tayloro,colorado_county_agency_crime_counts,total_crimes,bigint,gold,colorado_crimes 93 | tayloro,colorado_county_average_income_1997_2020,county_name,varchar,gold,colorado_crimes 94 | tayloro,colorado_county_average_income_1997_2020,agency_name,varchar,gold,colorado_crimes 95 | tayloro,colorado_county_average_income_1997_2020,total_crimes,bigint,gold,colorado_crimes 96 | tayloro,colorado_county_coordinates_raw,county,varchar,bronze,source 97 | tayloro,colorado_county_coordinates_raw,label,varchar,bronze,source 98 | tayloro,colorado_county_coordinates_raw,cent_lat,double,bronze,source 99 | tayloro,colorado_county_coordinates_raw,cent_long,double,bronze,source 100 | tayloro,colorado_county_crime_per_capita_1997_2020,county_name,varchar,silver,"colorado_population_raw, colorado_crimes" 101 | tayloro,colorado_county_crime_per_capita_1997_2020,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes" 102 | tayloro,colorado_county_crime_per_capita_1997_2020,total_population,bigint,silver,"colorado_population_raw, colorado_crimes" 103 | tayloro,colorado_county_crime_per_capita_1997_2020,crime_per_100k,decimal,silver,"colorado_population_raw, colorado_crimes" 104 | tayloro,colorado_county_crime_per_capita_with_coordinates,county,varchar,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw" 105 | tayloro,colorado_county_crime_per_capita_with_coordinates,crime_per_100k,"decimal(26,1)",gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw" 106 | tayloro,colorado_county_crime_per_capita_with_coordinates,cent_lat,double,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw" 107 | tayloro,colorado_county_crime_per_capita_with_coordinates,cent_long,double,gold,"colorado_county_crime_per_capita_1997_2020, colorado_county_coordinates_raw" 108 | tayloro,colorado_county_crime_vs_population_1997_2020,county_name,varchar,silver,"colorado_population_raw, colorado_crimes" 109 | tayloro,colorado_county_crime_vs_population_1997_2020,crime_year,bigint,silver,"colorado_population_raw, colorado_crimes" 110 | tayloro,colorado_county_crime_vs_population_1997_2020,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes" 111 | tayloro,colorado_county_crime_vs_population_1997_2020,total_population,bigint,silver,"colorado_population_raw, colorado_crimes" 112 | tayloro,colorado_county_crime_vs_population_1997_2020,crime_per_100k,"decimal(26,1)",silver,"colorado_population_raw, colorado_crimes" 113 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,county_name,varchar,gold,colorado_crimes 114 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,crime_month,bigint,gold,colorado_crimes 115 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,month_name,varchar,gold,colorado_crimes 116 | tayloro,colorado_county_seasonal_crime_rates_1997_2020,avg_monthly_crimes,double,gold,colorado_crimes 117 | tayloro,colorado_crime_category_totals_per_county_1997_2020,offense_category_name,varchar,gold,colorado_crimes 118 | tayloro,colorado_crime_category_totals_per_county_1997_2020,crime_count,bigint,gold,colorado_crimes 119 | tayloro,colorado_crime_category_totals_per_county_1997_2020,county_name,varchar,gold,colorado_crimes 120 | tayloro,colorado_crime_population_trends,year,integer,silver,"colorado_population_raw, colorado_crimes" 121 | tayloro,colorado_crime_population_trends,county_name,varchar,silver,"colorado_population_raw, colorado_crimes" 122 | tayloro,colorado_crime_population_trends,total_crimes,bigint,silver,"colorado_population_raw, colorado_crimes" 123 | tayloro,colorado_crime_population_trends,total_population,bigint,silver,"colorado_population_raw, colorado_crimes" 124 | tayloro,colorado_crime_tier_arson,county_name,varchar,silver,colorado_crimes 125 | tayloro,colorado_crime_tier_arson,total_crimes,bigint,silver,colorado_crimes 126 | tayloro,colorado_crime_tier_arson,crime_percentile,double,silver,colorado_crimes 127 | tayloro,colorado_crime_tier_arson,crime_tier,integer,silver,colorado_crimes 128 | tayloro,colorado_crime_tier_burglary,county_name,varchar,silver,colorado_crimes 129 | tayloro,colorado_crime_tier_burglary,total_crimes,bigint,silver,colorado_crimes 130 | tayloro,colorado_crime_tier_burglary,crime_percentile,double,silver,colorado_crimes 131 | tayloro,colorado_crime_tier_burglary,crime_tier,integer,silver,colorado_crimes 132 | tayloro,colorado_crime_tier_county_rank,county_name,varchar,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 133 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 134 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 135 | tayloro,colorado_crime_tier_county_rank,property_destruction_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 136 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 137 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 138 | tayloro,colorado_crime_tier_county_rank,burglary_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 139 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 140 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 141 | tayloro,colorado_crime_tier_county_rank,larceny_theft_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 142 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 143 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 144 | tayloro,colorado_crime_tier_county_rank,vehicle_theft_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 145 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 146 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 147 | tayloro,colorado_crime_tier_county_rank,robbery_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 148 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 149 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 150 | tayloro,colorado_crime_tier_county_rank,arson_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 151 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 152 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 153 | tayloro,colorado_crime_tier_county_rank,stolen_property_tier,integer,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 154 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 155 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 156 | tayloro,colorado_crime_tier_county_rank,overall_crime_tier,double,silver,"colorado_crime_tier_property_destruction, colorado_crime_tier_burglary, 157 | colorado_crime_tier_larceny_theft, colorado_crime_tier_vehicle_theft, 158 | colorado_crime_tier_robbery, colorado_crime_tier_arson, colorado_crime_tier_stolen_property" 159 | tayloro,colorado_crime_tier_larceny_theft,county_name,varchar,silver,colorado_crimes 160 | tayloro,colorado_crime_tier_larceny_theft,total_crimes,bigint,silver,colorado_crimes 161 | tayloro,colorado_crime_tier_larceny_theft,crime_percentile,double,silver,colorado_crimes 162 | tayloro,colorado_crime_tier_larceny_theft,crime_tier,integer,silver,colorado_crimes 163 | tayloro,colorado_crime_tier_property_destruction,county_name,varchar,silver,colorado_crimes 164 | tayloro,colorado_crime_tier_property_destruction,total_crimes,bigint,silver,colorado_crimes 165 | tayloro,colorado_crime_tier_property_destruction,crime_percentile,double,silver,colorado_crimes 166 | tayloro,colorado_crime_tier_property_destruction,crime_tier,integer,silver,colorado_crimes 167 | tayloro,colorado_crime_tier_robbery,county_name,varchar,silver,colorado_crimes 168 | tayloro,colorado_crime_tier_robbery,total_crimes,bigint,silver,colorado_crimes 169 | tayloro,colorado_crime_tier_robbery,crime_percentile,double,silver,colorado_crimes 170 | tayloro,colorado_crime_tier_robbery,crime_tier,integer,silver,colorado_crimes 171 | tayloro,colorado_crime_tier_stolen_property,county_name,varchar,silver,colorado_crimes 172 | tayloro,colorado_crime_tier_stolen_property,total_crimes,bigint,silver,colorado_crimes 173 | tayloro,colorado_crime_tier_stolen_property,crime_percentile,double,silver,colorado_crimes 174 | tayloro,colorado_crime_tier_stolen_property,crime_tier,integer,silver,colorado_crimes 175 | tayloro,colorado_crime_tier_vehicle_theft,county_name,varchar,silver,colorado_crimes 176 | tayloro,colorado_crime_tier_vehicle_theft,total_crimes,bigint,silver,colorado_crimes 177 | tayloro,colorado_crime_tier_vehicle_theft,crime_percentile,double,silver,colorado_crimes 178 | tayloro,colorado_crime_tier_vehicle_theft,crime_tier,integer,silver,colorado_crimes 179 | tayloro,colorado_crime_type_distribution_by_city,city,varchar,gold,colorado_crimes_with_cities 180 | tayloro,colorado_crime_type_distribution_by_city,crime_against,varchar,gold,colorado_crimes_with_cities 181 | tayloro,colorado_crime_type_distribution_by_city,total_crimes,bigint,gold,colorado_crimes_with_cities 182 | tayloro,colorado_crime_vs_median_household_income_1997_2020,county_name,varchar,gold,"colorado_income_1997_2020, colorado_crimes" 183 | tayloro,colorado_crime_vs_median_household_income_1997_2020,year,integer,gold,"colorado_income_1997_2020, colorado_crimes" 184 | tayloro,colorado_crime_vs_median_household_income_1997_2020,total_crimes,bigint,gold,"colorado_income_1997_2020, colorado_crimes" 185 | tayloro,colorado_crime_vs_median_household_income_1997_2020,total_population,bigint,gold,"colorado_income_1997_2020, colorado_crimes" 186 | tayloro,colorado_crime_vs_median_household_income_1997_2020,crime_per_100k,"decimal(26,1)",gold,"colorado_income_1997_2020, colorado_crimes" 187 | tayloro,colorado_crime_vs_median_household_income_1997_2020,median_household_income,bigint,gold,"colorado_income_1997_2020, colorado_crimes" 188 | tayloro,colorado_crimes,agency_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 189 | tayloro,colorado_crimes,county_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 190 | tayloro,colorado_crimes,incident_date,date,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 191 | tayloro,colorado_crimes,incident_hour,integer,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 192 | tayloro,colorado_crimes,offense_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 193 | tayloro,colorado_crimes,crime_against,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 194 | tayloro,colorado_crimes,offense_category_name,varchar,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 195 | tayloro,colorado_crimes,age_num,integer,silver,"colorado_crimes_1997_2015_raw, colorado_crimes_2016_2020_raw" 196 | tayloro,colorado_crimes_1997_2015_raw,agency_name,varchar,bronze,source 197 | tayloro,colorado_crimes_1997_2015_raw,agency_type_name,varchar,bronze,source 198 | tayloro,colorado_crimes_1997_2015_raw,city_name,varchar,bronze,source 199 | tayloro,colorado_crimes_1997_2015_raw,primary_county,varchar,bronze,source 200 | tayloro,colorado_crimes_1997_2015_raw,incident_hour,integer,bronze,source 201 | tayloro,colorado_crimes_1997_2015_raw,offense_name,varchar,bronze,source 202 | tayloro,colorado_crimes_1997_2015_raw,crime_against,varchar,bronze,source 203 | tayloro,colorado_crimes_1997_2015_raw,offense_category_name,varchar,bronze,source 204 | tayloro,colorado_crimes_1997_2015_raw,age_num,integer,bronze,source 205 | tayloro,colorado_crimes_1997_2015_raw,incident_date,varchar,bronze,source 206 | tayloro,colorado_crimes_2016_2020_raw,pub_agency_name,varchar,bronze,source 207 | tayloro,colorado_crimes_2016_2020_raw,county_name,varchar,bronze,source 208 | tayloro,colorado_crimes_2016_2020_raw,incident_date,varchar,bronze,source 209 | tayloro,colorado_crimes_2016_2020_raw,incident_hour,integer,bronze,source 210 | tayloro,colorado_crimes_2016_2020_raw,offense_name,varchar,bronze,source 211 | tayloro,colorado_crimes_2016_2020_raw,crime_against,varchar,bronze,source 212 | tayloro,colorado_crimes_2016_2020_raw,offense_category_name,varchar,bronze,source 213 | tayloro,colorado_crimes_2016_2020_raw,offense_group,varchar,bronze,source 214 | tayloro,colorado_crimes_2016_2020_raw,age_num,integer,bronze,source 215 | tayloro,colorado_crimes_2016_2020_with_cities,pub_agency_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 216 | tayloro,colorado_crimes_2016_2020_with_cities,county_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 217 | tayloro,colorado_crimes_2016_2020_with_cities,incident_date,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 218 | tayloro,colorado_crimes_2016_2020_with_cities,incident_hour,integer,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 219 | tayloro,colorado_crimes_2016_2020_with_cities,offense_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 220 | tayloro,colorado_crimes_2016_2020_with_cities,crime_against,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 221 | tayloro,colorado_crimes_2016_2020_with_cities,offense_category_name,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 222 | tayloro,colorado_crimes_2016_2020_with_cities,offense_group,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 223 | tayloro,colorado_crimes_2016_2020_with_cities,age_num,integer,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 224 | tayloro,colorado_crimes_2016_2020_with_cities,city,varchar,silver,"colorado_city_county_zip, colorado_crimes_2016_2020_raw" 225 | tayloro,colorado_crimes_with_cities,agency_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 226 | tayloro,colorado_crimes_with_cities,county_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 227 | tayloro,colorado_crimes_with_cities,incident_date,date,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 228 | tayloro,colorado_crimes_with_cities,incident_hour,integer,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 229 | tayloro,colorado_crimes_with_cities,offense_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 230 | tayloro,colorado_crimes_with_cities,crime_against,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 231 | tayloro,colorado_crimes_with_cities,offense_category_name,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 232 | tayloro,colorado_crimes_with_cities,age_num,integer,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 233 | tayloro,colorado_crimes_with_cities,city,varchar,silver,"colorado_crimes_2016_2020_with_cities, colorado_crimes_1997_2015_raw" 234 | tayloro,colorado_final_county_tier_rank,county,varchar,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 235 | colorado_population_crime_per_capita_rank" 236 | tayloro,colorado_final_county_tier_rank,crime_rank,double,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 237 | colorado_population_crime_per_capita_rank" 238 | tayloro,colorado_final_county_tier_rank,income_rank,integer,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 239 | colorado_population_crime_per_capita_rank" 240 | tayloro,colorado_final_county_tier_rank,population_rank,integer,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 241 | colorado_population_crime_per_capita_rank" 242 | tayloro,colorado_final_county_tier_rank,final_rank,double,silver,"colorado_crime_tier_county_rank, colorado_income_household_tier_rank, 243 | colorado_population_crime_per_capita_rank" 244 | tayloro,colorado_income_1997_2020,county,varchar,silver,colorado_income_raw 245 | tayloro,colorado_income_1997_2020,incdesc,varchar,silver,colorado_income_raw 246 | tayloro,colorado_income_1997_2020,inctype,integer,silver,colorado_income_raw 247 | tayloro,colorado_income_1997_2020,incsource,integer,silver,colorado_income_raw 248 | tayloro,colorado_income_1997_2020,income,bigint,silver,colorado_income_raw 249 | tayloro,colorado_income_1997_2020,incrank,integer,silver,colorado_income_raw 250 | tayloro,colorado_income_1997_2020,periodyear,integer,silver,colorado_income_raw 251 | tayloro,colorado_income_household_tier_rank,county,varchar,silver,colorado_income_raw 252 | tayloro,colorado_income_household_tier_rank,avg_median_household_income,double,silver,colorado_income_raw 253 | tayloro,colorado_income_household_tier_rank,income_percentile,double,silver,colorado_income_raw 254 | tayloro,colorado_income_household_tier_rank,income_tier,integer,silver,colorado_income_raw 255 | tayloro,colorado_income_raw,stateabbrv,varchar,bronze,source 256 | tayloro,colorado_income_raw,statename,varchar,bronze,source 257 | tayloro,colorado_income_raw,stfips,integer,bronze,source 258 | tayloro,colorado_income_raw,areatyname,varchar,bronze,source 259 | tayloro,colorado_income_raw,areaname,varchar,bronze,source 260 | tayloro,colorado_income_raw,areatype,integer,bronze,source 261 | tayloro,colorado_income_raw,area,integer,bronze,source 262 | tayloro,colorado_income_raw,periodyear,integer,bronze,source 263 | tayloro,colorado_income_raw,periodtype,integer,bronze,source 264 | tayloro,colorado_income_raw,pertypdesc,varchar,bronze,source 265 | tayloro,colorado_income_raw,period,integer,bronze,source 266 | tayloro,colorado_income_raw,inctype,integer,bronze,source 267 | tayloro,colorado_income_raw,incdesc,varchar,bronze,source 268 | tayloro,colorado_income_raw,incsource,integer,bronze,source 269 | tayloro,colorado_income_raw,incsrcdesc,varchar,bronze,source 270 | tayloro,colorado_income_raw,income,bigint,bronze,source 271 | tayloro,colorado_income_raw,incrank,integer,bronze,source 272 | tayloro,colorado_income_raw,population,integer,bronze,source 273 | tayloro,colorado_income_raw,releasedate,integer,bronze,source 274 | tayloro,colorado_population_crime_per_capita_rank,county,varchar,silver,"colorado_population_raw, colorado_crimes" 275 | tayloro,colorado_population_crime_per_capita_rank,avg_crime_per_1000,"decimal(24,1)",silver,"colorado_population_raw, colorado_crimes" 276 | tayloro,colorado_population_crime_per_capita_rank,crime_percentile,double,silver,"colorado_population_raw, colorado_crimes" 277 | tayloro,colorado_population_crime_per_capita_rank,crime_per_capita_tier,integer,silver,"colorado_population_raw, colorado_crimes" 278 | tayloro,colorado_population_raw,id,integer,bronze,source 279 | tayloro,colorado_population_raw,county,varchar,bronze,source 280 | tayloro,colorado_population_raw,fipscode,integer,bronze,source 281 | tayloro,colorado_population_raw,year,integer,bronze,source 282 | tayloro,colorado_population_raw,age,integer,bronze,source 283 | tayloro,colorado_population_raw,malepopulation,integer,bronze,source 284 | tayloro,colorado_population_raw,femalepopulation,integer,bronze,source 285 | tayloro,colorado_population_raw,totalpopulation,integer,bronze,source 286 | tayloro,colorado_population_raw,datatype,varchar,bronze,source 287 | tayloro,colorado_temp_geocoded_entities,entityid,bigint,silver,"colorado_business_entities, Geoapify API" 288 | tayloro,colorado_temp_geocoded_entities,principaladdress1,varchar,silver,"colorado_business_entities, Geoapify API" 289 | tayloro,colorado_temp_geocoded_entities,principalcity,varchar,silver,"colorado_business_entities, Geoapify API" 290 | tayloro,colorado_temp_geocoded_entities,principalzipcode,varchar,silver,"colorado_business_entities, Geoapify API" 291 | tayloro,colorado_temp_geocoded_entities,full_address,varchar,silver,"colorado_business_entities, Geoapify API" 292 | tayloro,colorado_temp_geocoded_entities,geo_city,varchar,silver,"colorado_business_entities, Geoapify API" 293 | tayloro,colorado_temp_geocoded_entities,geo_county,varchar,silver,"colorado_business_entities, Geoapify API" 294 | tayloro,colorado_temp_geocoded_entities,geo_state,varchar,silver,"colorado_business_entities, Geoapify API" 295 | tayloro,colorado_temp_geocoded_entities,geo_postcode,varchar,silver,"colorado_business_entities, Geoapify API" 296 | tayloro,colorado_total_population_per_county_1997_2020,year,integer,gold,colorado_population_raw 297 | tayloro,colorado_total_population_per_county_1997_2021,county,varchar,gold,colorado_population_raw 298 | tayloro,colorado_total_population_per_county_1997_2022,total_population,bigint,gold,colorado_population_raw 299 | tayloro,coloraodo_county_crime_rate_per_capita,county_name,varchar,silver,"colorado_population_raw, colorado_crimes" 300 | tayloro,coloraodo_county_crime_rate_per_capita,periodyear,integer,silver,"colorado_population_raw, colorado_crimes" 301 | tayloro,coloraodo_county_crime_rate_per_capita,total_crimes,integer,silver,"colorado_population_raw, colorado_crimes" 302 | tayloro,coloraodo_county_crime_rate_per_capita,total_population,integer,silver,"colorado_population_raw, colorado_crimes" 303 | tayloro,coloraodo_county_crime_rate_per_capita,crime_per_100k,double,silver,"colorado_population_raw, colorado_crimes" -------------------------------------------------------------------------------- /dags/tayloro/business_entity_dag.py: -------------------------------------------------------------------------------- 1 | from airflow.decorators import dag 2 | from airflow.operators.python_operator import PythonOperator 3 | from airflow.utils.dates import datetime, timedelta 4 | from include.eczachly.trino_queries import execute_trino_query, run_trino_query_dq_check 5 | from airflow.models import Variable 6 | import os 7 | import requests 8 | from datetime import datetime 9 | from dotenv import load_dotenv 10 | load_dotenv() # This loads the .env file so that os.getenv() works 11 | 12 | local_script_path = os.path.join("include", 'eczachly/scripts/kafka_read_example.py') 13 | 14 | tabular_credential = Variable.get("TABULAR_CREDENTIAL") 15 | 16 | DATA_COLORADO_GOV_KEY = Variable.get("DATA_COLORADO_GOV_KEY") 17 | GEOAPIFY_KEY = Variable.get("GEOAPIFY_KEY") 18 | 19 | 20 | @dag( 21 | description="A DAG that processes Colorado business entity data daily at 5 AM MT.", 22 | default_args={ 23 | "owner": "Taylor Ortiz", 24 | "retries": 0, 25 | "execution_timeout": timedelta(hours=1), 26 | }, 27 | start_date=datetime(2025, 2, 23), # Ensure this is set correctly 28 | max_active_runs=1, # Ensures only one active DAG run at a time 29 | schedule_interval="0 13 * * *", # Runs at 12 PM UTC = 6 AM MT 30 | catchup=False, # Prevents running past dates 31 | tags=["tayloro", "capstone"] 32 | ) 33 | 34 | def business_entity_dag(): 35 | yesterday = '{{ yesterday_ds }}' 36 | schema = 'tayloro' 37 | production_table = f'{schema}.colorado_business_entities' 38 | 39 | # Correctly reference Airflow macros 40 | business_entities_with_ranking = f"{production_table}_with_ranking" 41 | final_county_tier_rank = f'{schema}.colorado_final_county_tier_rank' 42 | staging_table = f"{production_table}_stg_{{{{ yesterday_ds_nodash }}}}" 43 | dupes_staging_table = f"{production_table}_dupes_stg_{{{{ yesterday_ds_nodash }}}}" 44 | dupes_table = f"{production_table}_duplicates" 45 | geo_validation_staging_table = f"{production_table}_geo_validation_stg_{{{{ yesterday_ds_nodash }}}}" 46 | misfit_entities_table = f"{production_table}_misfit_entities" 47 | added_city_county_zip = f"{production_table}_added_city_count_zip_stg_{{{{ yesterday_ds_nodash }}}}" 48 | 49 | geoapify_key = GEOAPIFY_KEY 50 | data_colorado_gov_api_key = DATA_COLORADO_GOV_KEY 51 | 52 | 53 | # Function to transform API data into records 54 | def transform_entity_data_to_records(response): 55 | records = [] 56 | for result in response: 57 | # Extract data while defaulting to None for missing fields 58 | records.append({ 59 | "entityid": result.get("entityid"), 60 | "entityname": result.get("entityname"), 61 | "principaladdress1": result.get("principaladdress1"), 62 | "principaladdress2": result.get("principaladdress2"), 63 | "principalcity": result.get("principalcity"), 64 | "principalstate": result.get("principalstate"), 65 | "principalzipcode": result.get("principalzipcode"), 66 | "principalcountry": result.get("principalcountry"), 67 | "entitystatus": result.get("entitystatus"), 68 | "jurisdictonofformation": result.get("jurisdictonofformation"), 69 | "entitytype": result.get("entitytype"), 70 | "entityformdate": result.get("entityformdate") 71 | }) 72 | return records 73 | 74 | # Function to fetch business entity data for yesterday from the data.colorado.gov API 75 | def fetch_business_entity_data(yesterday, api_token): 76 | print('what is the value of the token? ', api_token) 77 | print('Grabbing Colorado Business Entity records for ', yesterday) 78 | formatted_date = f"{yesterday}T00:00:00.000" 79 | base_url = f"https://data.colorado.gov/resource/4ykn-tg5h.json?$$app_token={api_token}" 80 | params = { 81 | "entityformdate": formatted_date, 82 | "entitystatus": "Good Standing", 83 | "principalstate": "CO", 84 | } 85 | query_string = "&".join(f"{key}={value.replace(' ', '%20')}" for key, value in params.items()) 86 | # Append with an ampersand (&) instead of a question mark 87 | final_url = f"{base_url}&{query_string}" 88 | print('what is final url? ', final_url) 89 | response = requests.get(final_url) 90 | 91 | if response.status_code == 200 and len(response.json()) > 0: 92 | records = transform_entity_data_to_records(response.json()) 93 | print("Fetched Records:", records) 94 | return records 95 | else: 96 | print(f"No data fetched for {yesterday}") 97 | return [] 98 | def insert_into_staging_table(records, staging_table): 99 | if not records: 100 | print("No records to insert.") 101 | return 102 | 103 | # Build the insert query 104 | insert_query = f""" 105 | INSERT INTO {staging_table} ( 106 | entityid, entityname, principaladdress1, principaladdress2, 107 | principalcity, principalcounty, principalstate, principalzipcode, 108 | principalcountry, entitystatus, jurisdictonofformation, entitytype, entityformdate 109 | ) VALUES 110 | """ 111 | values = [] 112 | for record in records: 113 | # Escape single quotes in strings 114 | entityname = record['entityname'].replace("'", "''") 115 | principaladdress1 = record['principaladdress1'].replace("'", "''") 116 | principaladdress2 = (record.get('principaladdress2') or '').replace("'", "''") 117 | principalcity = record['principalcity'].replace("'", "''") 118 | principalstate = record['principalstate'] 119 | principalzipcode = record['principalzipcode'] 120 | principalcountry = record['principalcountry'].replace("'", "''") 121 | entitystatus = record['entitystatus'].replace("'", "''") 122 | jurisdictonofformation = record['jurisdictonofformation'].replace("'", "''") 123 | entitytype = record['entitytype'].replace("'", "''") 124 | 125 | # Convert entityformdate to proper DATE format 126 | raw_date = record['entityformdate'] 127 | try: 128 | # Parse and format the date as YYYY-MM-DD 129 | formatted_date = datetime.strptime(raw_date[:10], '%Y-%m-%d').strftime('%Y-%m-%d') 130 | except ValueError: 131 | print(f"Invalid date format: {raw_date}") 132 | continue 133 | 134 | # Format each record as a SQL value 135 | values.append( 136 | f"({record['entityid']}, '{entityname}', '{principaladdress1}', '{principaladdress2}', " 137 | f"'{principalcity}', NULL, '{principalstate}', '{principalzipcode}', " 138 | f"'{principalcountry}', '{entitystatus}', '{jurisdictonofformation}', '{entitytype}', " 139 | f"DATE '{formatted_date}')" 140 | ) 141 | insert_query += ", ".join(values) 142 | 143 | try: 144 | execute_trino_query(query=insert_query) 145 | print(f"Successfully inserted {len(records)} records into {staging_table}") 146 | except Exception as e: 147 | print(f"Failed to insert records: {e}") 148 | 149 | # Fetch and insert data 150 | def fetch_and_insert_data(yesterday, staging_table, api_token): 151 | print('True') 152 | records = fetch_business_entity_data(yesterday, api_token) 153 | insert_into_staging_table(records, staging_table) 154 | 155 | def transform_geocode_response(response_json, entityid): 156 | """ 157 | Parses the Geoapify API JSON response (which is already a list of features) 158 | and extracts the desired fields. 159 | 160 | Returns a list of tuples with the following structure: 161 | (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address) 162 | """ 163 | records = [] 164 | print('what is the value of response_json? ', response_json) 165 | # Loop directly over the list of feature dictionaries 166 | for feature in response_json: 167 | # Ensure feature is a dict before processing 168 | if not isinstance(feature, dict): 169 | continue 170 | props = feature.get("properties", {}) 171 | # Extract desired fields from properties 172 | geo_city = props.get("city") 173 | geo_county = props.get("county") 174 | geo_state = props.get("state") 175 | geo_postcode = props.get("postcode") 176 | formatted_address = props.get("formatted") 177 | # Append a tuple including the entityid for later reference 178 | records.append((entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address)) 179 | return records 180 | 181 | def process_and_insert_address_data(aggregated_records, geo_table): 182 | """ 183 | Receives an aggregated list of tuples where each tuple is: 184 | (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address) 185 | Builds and executes an INSERT query to load these records into the temporary geo table. 186 | Assumes that the first column (entityid) is numeric (BIGINT) and the rest are strings. 187 | Applies a final check on the zip code (geo_postcode) to ensure it is either a valid 5-digit string 188 | or is blank. 189 | """ 190 | if not aggregated_records: 191 | print("No records to insert from API responses.") 192 | return 193 | 194 | values_list = [] 195 | for rec in aggregated_records: 196 | rec_values = [] 197 | for idx, field in enumerate(rec): 198 | # Final check for the zip code (geo_postcode, index 4) 199 | if idx == 4: 200 | if field is not None: 201 | field = str(field).strip() 202 | if len(field) >= 5: 203 | candidate = field[:5] 204 | # Use candidate if it's all digits; otherwise blank it out. 205 | if candidate.isdigit(): 206 | field = candidate 207 | else: 208 | field = '' 209 | else: 210 | field = '' 211 | # For the first field (entityid), assume it's numeric and do not quote. 212 | if idx == 0: 213 | rec_values.append(str(field)) 214 | else: 215 | if field is None: 216 | rec_values.append("NULL") 217 | else: 218 | rec_values.append("'" + str(field).replace("'", "''") + "'") 219 | values_list.append("(" + ", ".join(rec_values) + ")") 220 | values_clause = ", ".join(values_list) 221 | 222 | insert_query = f""" 223 | INSERT INTO {geo_table} 224 | (entityid, geo_city, geo_county, geo_state, geo_postcode, formatted_address) 225 | VALUES {values_clause} 226 | """ 227 | execute_trino_query(query=insert_query) 228 | print(f"Inserted {len(aggregated_records)} records into {geo_table}.") 229 | 230 | 231 | def fetch_validated_address(addr, city, state, zip, api_key): 232 | base_url = "https://api.geoapify.com/v1/geocode/search" 233 | url = f"{base_url}?text={addr}, {city} {state} {zip}&apiKey={api_key}" 234 | 235 | response = requests.get(url) 236 | print(f"API call to {url} returned status: {response.status_code}") 237 | 238 | if response.status_code != 200: 239 | print("API call failed with status code", response.status_code) 240 | return None 241 | 242 | data = response.json() 243 | features = data.get("features", []) 244 | 245 | if not features: 246 | print("No features returned in API response.") 247 | return None 248 | 249 | validated_features = [] 250 | for feature in features: 251 | props = feature.get("properties", {}) 252 | # The rank object lives inside the properties 253 | rank = props.get("rank", {}) 254 | match_type = rank.get("match_type") 255 | confidence = rank.get("confidence") 256 | confidence_city = rank.get("confidence_city_level") 257 | confidence_street = rank.get("confidence_street_level") 258 | city = props.get("city") 259 | county = props.get("county") 260 | zip = props.get("postcode") 261 | 262 | # Check each feature's rank details 263 | if (match_type == "full_match" 264 | and confidence == 1 265 | and confidence_city == 1 266 | and confidence_street == 1 267 | and city 268 | and county 269 | and zip): 270 | validated_features.append(feature) 271 | 272 | if validated_features: 273 | # If at least one validated feature is found, you can return them. 274 | # For example, you could choose the first one or return all validated features. 275 | print(f"Found {len(validated_features)} validated feature(s).") 276 | return validated_features # or, for example: return validated_features[0] 277 | else: 278 | print("No feature meets the validation criteria.") 279 | return None 280 | 281 | 282 | def process_unmatched_entities_task(staging, geo_table, api_key): 283 | """ 284 | 1. Queries the staging table for records where principalcounty is NULL. 285 | 2. For each record, constructs the full address and calls the address API. 286 | 3. Aggregates the API responses while retaining the entityid. 287 | 4. Inserts the aggregated geocoded data into the temporary geo table. 288 | """ 289 | # Modify the query to select entityid, principaladdress1, and principalstate. 290 | query = f""" 291 | SELECT entityid, principaladdress1, principalcity, principalstate, principalzipcode 292 | FROM {staging} 293 | WHERE principalstate = 'CO' 294 | AND principalcounty IS NULL 295 | """ 296 | results = execute_trino_query(query=query) 297 | print("Unmatched records fetched:", results) 298 | 299 | aggregated_records = [] 300 | for row in results: 301 | entityid = row[0] 302 | address = row[1] 303 | city = row[2] 304 | state = row[3] 305 | zip = row[4] 306 | 307 | # Call the address validation API 308 | api_response = fetch_validated_address(address, city, state, zip, api_key) 309 | if api_response: 310 | # Transform the API response; include the entityid with each record 311 | recs = transform_geocode_response(api_response, entityid) 312 | aggregated_records.extend(recs) 313 | 314 | # Insert the aggregated geocoded records into the temporary geo table 315 | process_and_insert_address_data(aggregated_records, geo_table) 316 | 317 | def check_validated_address_combos(staging, geo_table, added_city_county_zip): 318 | """ 319 | Queries the validated address combinations by joining the temporary geo validation table 320 | with the staging table based on entityid. It returns distinct combinations of: 321 | (zip_code, city, county, entityid) 322 | that are not already present in the reference table (colorado_city_county_zip). 323 | 324 | Parameters: 325 | staging (str): Fully resolved name of the staging table. 326 | geo_table (str): Fully resolved name of the temporary geo validation table. 327 | 328 | Returns: 329 | None. The function prints the query results. 330 | """ 331 | query = f""" 332 | SELECT DISTINCT 333 | CAST(g.geo_postcode AS INTEGER) AS zip_code, 334 | g.geo_city AS city, 335 | replace(g.geo_county, ' County', '') AS county, 336 | s.entityid 337 | FROM {geo_table} g 338 | JOIN {staging} s 339 | ON g.entityid = s.entityid 340 | WHERE s.principalstate = 'CO' 341 | AND NOT EXISTS ( 342 | SELECT 1 343 | FROM tayloro.colorado_city_county_zip r 344 | WHERE TRIM(LOWER(r.city)) = TRIM(LOWER(g.geo_city)) 345 | AND TRIM(CAST(r.zip_code AS VARCHAR)) = TRIM(g.geo_postcode) 346 | ) 347 | """ 348 | results = execute_trino_query(query=query) 349 | 350 | if not results: 351 | print("No validated address combinations to insert.") 352 | return 353 | 354 | # Build the INSERT query using zip_code, city, county (ignoring entityid) 355 | values_list = [] 356 | for row in results: 357 | # row structure: [zip_code, city, county, entityid] 358 | zip_code = str(row[0]) # Numeric, so no quotes 359 | city = "'" + str(row[1]).replace("'", "''") + "'" 360 | county = "'" + str(row[2]).replace("'", "''").replace(" County", "") + "'" 361 | values_list.append(f"({zip_code}, {city}, {county})") 362 | 363 | values_clause = ", ".join(values_list) 364 | insert_query = f""" 365 | INSERT INTO {added_city_county_zip} (zip_code, city, county) 366 | VALUES {values_clause} 367 | """ 368 | execute_trino_query(query=insert_query) 369 | print(f"Inserted {len(results)} validated address combinations into {added_city_county_zip}.") 370 | 371 | 372 | create_business_entities_yesterday_stage = PythonOperator( 373 | task_id="create_business_entities_yesterday_stage", 374 | python_callable=execute_trino_query, 375 | op_kwargs={ 376 | 'query': f""" 377 | CREATE TABLE IF NOT EXISTS {staging_table} ( 378 | entityid BIGINT, 379 | entityname VARCHAR, 380 | principaladdress1 VARCHAR, 381 | principaladdress2 VARCHAR, 382 | principalcity VARCHAR, 383 | principalcounty VARCHAR, -- Mapped county from cities dataset 384 | principalstate VARCHAR, 385 | principalzipcode VARCHAR, 386 | principalcountry VARCHAR, 387 | entitystatus VARCHAR, 388 | jurisdictonofformation VARCHAR, 389 | entitytype VARCHAR, 390 | entityformdate DATE -- Ensure stored as DATE type 391 | ) 392 | WITH ( 393 | partitioning = ARRAY['principalcounty'] -- Partition by principalcounty 394 | ) 395 | """ 396 | } 397 | ) 398 | 399 | create_misfit_business_entities_table = PythonOperator( 400 | task_id="create_misfit_business_entities_table", 401 | python_callable=execute_trino_query, 402 | op_kwargs={ 403 | 'query': f""" 404 | CREATE TABLE IF NOT EXISTS {misfit_entities_table} ( 405 | entityid BIGINT, 406 | entityname VARCHAR, 407 | principaladdress1 VARCHAR, 408 | principaladdress2 VARCHAR, 409 | principalcity VARCHAR, 410 | principalcounty VARCHAR, -- Mapped county from cities dataset 411 | principalstate VARCHAR, 412 | principalzipcode VARCHAR, 413 | principalcountry VARCHAR, 414 | entitystatus VARCHAR, 415 | jurisdictonofformation VARCHAR, 416 | entitytype VARCHAR, 417 | entityformdate DATE -- Ensure stored as DATE type 418 | ) 419 | """ 420 | } 421 | ) 422 | 423 | create_duplicates_table = PythonOperator( 424 | task_id="create_duplicates_table", 425 | python_callable=execute_trino_query, 426 | op_kwargs={ 427 | 'query': f""" 428 | CREATE TABLE IF NOT EXISTS {dupes_table} ( 429 | entityid BIGINT, 430 | entityname VARCHAR, 431 | principaladdress1 VARCHAR, 432 | principaladdress2 VARCHAR, 433 | principalcity VARCHAR, 434 | principalcounty VARCHAR, -- Mapped county from cities dataset 435 | principalstate VARCHAR, 436 | principalzipcode VARCHAR, 437 | principalcountry VARCHAR, 438 | entitystatus VARCHAR, 439 | jurisdictonofformation VARCHAR, 440 | entitytype VARCHAR, 441 | entityformdate DATE -- Ensure stored as DATE type 442 | ) 443 | """ 444 | } 445 | ) 446 | 447 | create_dupes_staging_table = PythonOperator( 448 | task_id="create_dupes_staging_table", 449 | python_callable=execute_trino_query, 450 | op_kwargs={ 451 | 'query': f""" 452 | CREATE TABLE IF NOT EXISTS {dupes_staging_table} ( 453 | entityid BIGINT, 454 | entityname VARCHAR, 455 | principaladdress1 VARCHAR, 456 | principaladdress2 VARCHAR, 457 | principalcity VARCHAR, 458 | principalcounty VARCHAR, -- Mapped county from cities dataset 459 | principalstate VARCHAR, 460 | principalzipcode VARCHAR, 461 | principalcountry VARCHAR, 462 | entitystatus VARCHAR, 463 | jurisdictonofformation VARCHAR, 464 | entitytype VARCHAR, 465 | entityformdate DATE -- Ensure stored as DATE type 466 | ) 467 | WITH ( 468 | partitioning = ARRAY['principalcounty'] -- Partition by principalcounty 469 | ) 470 | """ 471 | } 472 | ) 473 | 474 | 475 | create_geo_validation_staging_table = PythonOperator( 476 | task_id="create_geo_validation_staging_table", 477 | python_callable=execute_trino_query, 478 | op_kwargs={ 479 | 'query': f""" 480 | CREATE TABLE IF NOT EXISTS {geo_validation_staging_table} ( 481 | entityid BIGINT, 482 | geo_city VARCHAR, 483 | geo_county VARCHAR, 484 | geo_state VARCHAR, 485 | geo_postcode VARCHAR, 486 | formatted_address VARCHAR 487 | ) 488 | WITH ( 489 | partitioning = ARRAY['geo_county'] 490 | ) 491 | """ 492 | } 493 | ) 494 | 495 | create_addded_city_county_zip_staging_table = PythonOperator( 496 | task_id="create_addded_city_county_zip_staging_table", 497 | python_callable=execute_trino_query, 498 | op_kwargs={ 499 | 'query': f""" 500 | CREATE TABLE IF NOT EXISTS {added_city_county_zip} ( 501 | zip_code INT, 502 | city VARCHAR, 503 | county VARCHAR 504 | ) 505 | """ 506 | } 507 | ) 508 | 509 | fetch_and_insert_business_entities_yesterday = PythonOperator( 510 | task_id="fetch_and_insert_business_entities_yesterday", 511 | python_callable=fetch_and_insert_data, 512 | op_kwargs={ 513 | 'yesterday': yesterday, 514 | 'staging_table': staging_table, 515 | 'api_token': data_colorado_gov_api_key 516 | } 517 | ) 518 | 519 | 520 | filter_invalid_addresses = PythonOperator( 521 | task_id="filter_invalid_addresses", 522 | python_callable=execute_trino_query, 523 | op_kwargs={ 524 | 'query': f""" 525 | DELETE FROM {staging_table} 526 | WHERE principalcity IS NULL OR principalzipcode IS NULL 527 | """ 528 | } 529 | ) 530 | 531 | reformat_zip_code = PythonOperator( 532 | task_id="reformat_zip_code", 533 | python_callable=execute_trino_query, 534 | op_kwargs={ 535 | 'query': f""" 536 | UPDATE {staging_table} 537 | SET principalzipcode = 538 | CASE 539 | WHEN LENGTH(TRIM(principalzipcode)) >= 5 540 | AND TRY_CAST(SUBSTR(TRIM(principalzipcode), 1, 5) AS INTEGER) IS NOT NULL 541 | THEN SUBSTR(TRIM(principalzipcode), 1, 5) 542 | ELSE '' 543 | END 544 | WHERE principalstate = 'CO' 545 | """ 546 | } 547 | ) 548 | 549 | 550 | 551 | update_county_on_staging_table = PythonOperator( 552 | task_id="update_county_on_staging_table", 553 | python_callable=execute_trino_query, 554 | op_kwargs={ 555 | 'query': f""" 556 | UPDATE {staging_table} 557 | SET principalcounty = ( 558 | SELECT ccz.county 559 | FROM tayloro.colorado_city_county_zip ccz 560 | WHERE TRIM(LOWER({staging_table}.principalcity)) = TRIM(LOWER(ccz.city)) 561 | AND TRIM(CAST({staging_table}.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 562 | ) 563 | WHERE principalstate = 'CO' 564 | """ 565 | } 566 | ) 567 | 568 | process_unmatched_entities_task = PythonOperator( 569 | task_id="process_unmatched_entities_task", 570 | python_callable=process_unmatched_entities_task, 571 | op_kwargs={ 572 | 'staging': staging_table, 573 | 'geo_table': geo_validation_staging_table, 574 | 'api_key': geoapify_key 575 | } 576 | ) 577 | 578 | check_validated_address_combos_task = PythonOperator( 579 | task_id="check_validated_address_combos", 580 | python_callable=check_validated_address_combos, 581 | op_kwargs={ 582 | "staging": staging_table, # resolved staging table name 583 | "geo_table": geo_validation_staging_table, # resolved temporary geo validation table name 584 | "added_city_county_zip": added_city_county_zip 585 | } 586 | ) 587 | 588 | insert_validated_address_combos_task = PythonOperator( 589 | task_id="insert_validated_address_combos", 590 | python_callable=execute_trino_query, 591 | op_kwargs={ 592 | 'query': f""" 593 | INSERT INTO tayloro.colorado_city_county_zip (zip_code, city, county) 594 | SELECT a.zip_code, a.city, a.county 595 | FROM {added_city_county_zip} a 596 | WHERE a.zip_code IS NOT NULL 597 | AND a.city IS NOT NULL 598 | AND a.county IS NOT NULL 599 | AND NOT EXISTS ( 600 | SELECT 1 601 | FROM tayloro.colorado_city_county_zip r 602 | WHERE r.zip_code = a.zip_code 603 | AND TRIM(LOWER(r.city)) = TRIM(LOWER(a.city)) 604 | AND TRIM(LOWER(r.county)) = TRIM(LOWER(a.county)) 605 | ) 606 | """ 607 | } 608 | ) 609 | 610 | 611 | override_from_validated_addresses = PythonOperator( 612 | task_id="override_from_validated_addresses", 613 | python_callable=execute_trino_query, 614 | op_kwargs={ 615 | 'query': f""" 616 | UPDATE tayloro.colorado_business_entities 617 | SET 618 | principalcity = ( 619 | SELECT geo_city 620 | FROM tayloro.colorado_temp_geocoded_entities 621 | WHERE entityid = tayloro.colorado_business_entities.entityid 622 | ), 623 | principalzipcode = ( 624 | SELECT geo_postcode 625 | FROM tayloro.colorado_temp_geocoded_entities 626 | WHERE entityid = tayloro.colorado_business_entities.entityid 627 | ), 628 | principalcounty = ( 629 | SELECT geo_county 630 | FROM tayloro.colorado_temp_geocoded_entities 631 | WHERE entityid = tayloro.colorado_business_entities.entityid 632 | ) 633 | WHERE entityid IN ( 634 | SELECT entityid 635 | FROM tayloro.colorado_temp_geocoded_entities 636 | ) 637 | AND ( 638 | principalcity <> ( 639 | SELECT geo_city 640 | FROM tayloro.colorado_temp_geocoded_entities 641 | WHERE entityid = tayloro.colorado_business_entities.entityid 642 | ) 643 | OR principalzipcode <> ( 644 | SELECT geo_postcode 645 | FROM tayloro.colorado_temp_geocoded_entities 646 | WHERE entityid = tayloro.colorado_business_entities.entityid 647 | ) 648 | OR principalcounty <> ( 649 | SELECT geo_county 650 | FROM tayloro.colorado_temp_geocoded_entities 651 | WHERE entityid = tayloro.colorado_business_entities.entityid 652 | ) 653 | ) 654 | """ 655 | } 656 | ) 657 | 658 | retry_update_county_on_staging_table = PythonOperator( 659 | task_id="retry_update_county_on_staging_table", 660 | python_callable=execute_trino_query, 661 | op_kwargs={ 662 | 'query': f""" 663 | UPDATE {staging_table} 664 | SET principalcounty = ( 665 | SELECT ccz.county 666 | FROM tayloro.colorado_city_county_zip ccz 667 | WHERE TRIM(LOWER({staging_table}.principalcity)) = TRIM(LOWER(ccz.city)) 668 | AND TRIM(CAST({staging_table}.principalzipcode AS VARCHAR)) = TRIM(CAST(ccz.zip_code AS VARCHAR)) 669 | ) 670 | WHERE principalstate = 'CO' 671 | AND principalcounty IS NULL 672 | """ 673 | } 674 | ) 675 | 676 | identify_duplicates_task = PythonOperator( 677 | task_id="identify_duplicates_task", 678 | python_callable=execute_trino_query, 679 | op_kwargs={ 680 | 'query': f""" 681 | INSERT INTO {dupes_staging_table} 682 | SELECT DISTINCT 683 | s.entityid, 684 | s.entityname, 685 | s.principaladdress1, 686 | s.principaladdress2, 687 | s.principalcity, 688 | s.principalcounty, 689 | s.principalstate, 690 | s.principalzipcode, 691 | s.principalcountry, 692 | s.entitystatus, 693 | s.jurisdictonofformation, 694 | s.entitytype, 695 | s.entityformdate 696 | FROM {staging_table} s 697 | INNER JOIN {production_table} p 698 | ON s.entityname = p.entityname 699 | AND s.principaladdress1 = p.principaladdress1 700 | AND s.principaladdress2 = p.principaladdress2 701 | AND s.principalcity = p.principalcity 702 | AND s.principalcounty = p.principalcounty 703 | AND s.principalstate = p.principalstate 704 | AND s.principalzipcode = p.principalzipcode 705 | AND s.principalcountry = p.principalcountry 706 | AND s.entitystatus = p.entitystatus 707 | AND s.jurisdictonofformation = p.jurisdictonofformation 708 | AND s.entitytype = p.entitytype 709 | AND s.entityformdate = p.entityformdate 710 | """ 711 | } 712 | ) 713 | 714 | run_dq_check_null_values = PythonOperator( 715 | task_id="run_dq_check_null_values", 716 | python_callable=run_trino_query_dq_check, 717 | op_kwargs={ 718 | 'query': f""" 719 | SELECT COUNT(*) 720 | FROM {staging_table} 721 | WHERE entityname IS NULL 722 | OR principalcity IS NULL 723 | OR entitytype IS NULL 724 | OR entityformdate IS NULL 725 | """ 726 | } 727 | ) 728 | 729 | run_dq_check_principalstate = PythonOperator( 730 | task_id="run_dq_check_principalstate", 731 | python_callable=run_trino_query_dq_check, 732 | op_kwargs={ 733 | 'query': f""" 734 | SELECT COUNT(*) 735 | FROM {staging_table} 736 | WHERE principalstate <> 'CO' OR principalstate IS NULL 737 | """ 738 | } 739 | ) 740 | 741 | run_dq_check_entityformdate = PythonOperator( 742 | task_id="run_dq_check_entityformdate", 743 | python_callable=run_trino_query_dq_check, 744 | op_kwargs={ 745 | 'query': f""" 746 | SELECT COUNT(*) 747 | FROM {staging_table} 748 | WHERE entityformdate <> DATE('{yesterday}') OR entityformdate IS NULL 749 | """ 750 | } 751 | ) 752 | 753 | run_dq_check_no_unexpected_duplicates = PythonOperator( 754 | task_id="run_dq_check_no_unexpected_duplicates", 755 | python_callable=execute_trino_query, 756 | op_kwargs={ 757 | 'query': f""" 758 | SELECT entityid, COUNT(*) AS count 759 | FROM {staging_table} 760 | WHERE entityid NOT IN ( 761 | SELECT entityid FROM {dupes_staging_table} 762 | ) 763 | GROUP BY entityid 764 | HAVING COUNT(*) > 1 765 | """ 766 | } 767 | ) 768 | 769 | insert_new_records_task = PythonOperator( 770 | task_id="insert_new_records_task", 771 | python_callable=execute_trino_query, 772 | op_kwargs={ 773 | 'query': f""" 774 | INSERT INTO {production_table} ( 775 | entityid, entityname, principaladdress1, principaladdress2, principalcity, 776 | principalcounty, principalstate, principalzipcode, principalcountry, entitystatus, 777 | jurisdictonofformation, entitytype, entityformdate 778 | ) 779 | SELECT 780 | entityid, entityname, principaladdress1, principaladdress2, principalcity, 781 | principalcounty, principalstate, principalzipcode, principalcountry, entitystatus, 782 | jurisdictonofformation, entitytype, entityformdate 783 | FROM {staging_table} 784 | WHERE entityid NOT IN ( 785 | SELECT entityid FROM {dupes_staging_table} 786 | ) 787 | AND principalcounty IS NOT NULL 788 | """ 789 | } 790 | ) 791 | 792 | insert_business_entities_with_ranking = PythonOperator( 793 | task_id="insert_business_entities_with_ranking", 794 | python_callable=execute_trino_query, 795 | op_kwargs={ 796 | 'query': f""" 797 | INSERT INTO {business_entities_with_ranking} 798 | SELECT 799 | stg.*, 800 | fr.crime_rank, 801 | fr.income_rank, 802 | fr.population_rank, 803 | fr.final_rank 804 | FROM {staging_table} stg 805 | LEFT JOIN {final_county_tier_rank} fr 806 | ON LOWER(stg.principalcounty) = LOWER(fr.county) 807 | WHERE stg.entityid NOT IN ( 808 | SELECT entityid FROM {dupes_staging_table} 809 | ) 810 | AND stg.principalcounty IS NOT NULL 811 | """ 812 | } 813 | ) 814 | 815 | 816 | insert_duplicates = PythonOperator( 817 | task_id="insert_duplicates", 818 | python_callable=execute_trino_query, 819 | op_kwargs={ 820 | 'query': f""" 821 | INSERT INTO {dupes_table} 822 | SELECT 823 | entityid, 824 | entityname, 825 | principaladdress1, 826 | principaladdress2, 827 | principalcity, 828 | principalcounty, 829 | principalstate, 830 | principalzipcode, 831 | principalcountry, 832 | entitystatus, 833 | jurisdictonofformation, 834 | entitytype, 835 | entityformdate 836 | FROM {dupes_staging_table} 837 | """ 838 | } 839 | ) 840 | 841 | insert_misfits = PythonOperator( 842 | task_id="insert_misfits", 843 | python_callable=execute_trino_query, 844 | op_kwargs={ 845 | 'query': f""" 846 | INSERT INTO {misfit_entities_table} 847 | SELECT 848 | entityid, 849 | entityname, 850 | principaladdress1, 851 | principaladdress2, 852 | principalcity, 853 | principalcounty, 854 | principalstate, 855 | principalzipcode, 856 | principalcountry, 857 | entitystatus, 858 | jurisdictonofformation, 859 | entitytype, 860 | entityformdate 861 | FROM {staging_table} 862 | WHERE principalcounty IS NULL 863 | """ 864 | } 865 | ) 866 | 867 | drop_staging_table_task = PythonOperator( 868 | task_id="drop_staging_table_task", 869 | python_callable=execute_trino_query, 870 | op_kwargs={ 871 | 'query': f"DROP TABLE IF EXISTS {staging_table}" 872 | } 873 | ) 874 | 875 | drop_dupes_staging_table_task = PythonOperator( 876 | task_id="drop_dupes_staging_table_task", 877 | python_callable=execute_trino_query, 878 | op_kwargs={ 879 | 'query': f"DROP TABLE IF EXISTS {dupes_staging_table}" 880 | } 881 | ) 882 | 883 | drop_geo_validation_staging_table = PythonOperator( 884 | task_id="drop_geo_validation_staging_table", 885 | python_callable=execute_trino_query, 886 | op_kwargs={ 887 | 'query': f"DROP TABLE IF EXISTS {geo_validation_staging_table}" 888 | } 889 | ) 890 | 891 | drop_added_city_county_zip_staging_table = PythonOperator( 892 | task_id="drop_added_city_county_zip_staging_table", 893 | python_callable=execute_trino_query, 894 | op_kwargs={ 895 | 'query': f"DROP TABLE IF EXISTS {added_city_county_zip}" 896 | } 897 | ) 898 | 899 | ( 900 | create_business_entities_yesterday_stage 901 | >> create_misfit_business_entities_table 902 | >> create_dupes_staging_table 903 | >> create_duplicates_table 904 | >> create_geo_validation_staging_table 905 | >> create_addded_city_county_zip_staging_table 906 | >> fetch_and_insert_business_entities_yesterday 907 | >> filter_invalid_addresses 908 | >> reformat_zip_code 909 | >> update_county_on_staging_table 910 | >> process_unmatched_entities_task 911 | >> check_validated_address_combos_task 912 | >> insert_validated_address_combos_task 913 | >> override_from_validated_addresses 914 | >> retry_update_county_on_staging_table 915 | >> identify_duplicates_task 916 | >> run_dq_check_null_values 917 | >> run_dq_check_principalstate 918 | >> run_dq_check_entityformdate 919 | >> run_dq_check_no_unexpected_duplicates 920 | >> insert_new_records_task 921 | >> insert_business_entities_with_ranking 922 | >> insert_duplicates 923 | >> insert_misfits 924 | >> drop_staging_table_task 925 | >> drop_dupes_staging_table_task 926 | >> drop_geo_validation_staging_table 927 | >> drop_added_city_county_zip_staging_table 928 | ) 929 | 930 | 931 | business_entity_dag() 932 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | Screenshot 2025-02-27 at 6 01 10 PM 3 |

4 | 5 | # 6 | 7 |

8 | 9 | A data engineering capstone project that builds a program for Colorado's OEDIT (Office of Economic Development and International Trade) to allocate security system subsidies to businesses using historical crime, population, and income data trends from 1997-2020. 10 |

11 | 12 | ### About 13 | 14 | > 🟢 **UPDATE –** On 5/22/25, Zach informed me that my Data Engineering capstone was selected as **Best Capstone** in the bootcamp! His feedback is below: 15 | 16 | --- 17 | 18 | > **“Taylor Ortiz’s capstone project, which won Best Capstone in our January 2025 Data Engineering bootcamp, demonstrated exceptional skills in data processing, quality validation, and solving real-world business challenges. His pipeline effectively addressed entity search and subsidy classification while handling data inconsistencies and deduplication. The well-structured documentation, including DAG runs and dashboards, highlighted his strong technical abilities and clear communication. Taylor would be a valuable asset to any data engineering team.”** 19 | > 20 | > — *Zach Wilson - dataexpert.io* 21 | 22 | 23 | 24 | 25 | 26 | My name is Taylor Ortiz and I enrolled in Zach Wilson's Dataexpert.io Data Engineering bootcamp as part of the January 2025 cohort. Part of the requirements for achieving the highest certification in the bootcamp is to complete a capstone project that showcases the skills attained during the bootcamp and the general skills that I possess as a data engineer and data architect. The following capstone business case, KPIs, and use cases are fictional and made up by me for the sake of the exercise (although I think it has merit for an actual State of Colorado OEDIT program one day). The technology stack used was built entirely from scratch using publicly accessible data from [data.colorado.gov](https://data.colorado.gov) and [CDPHE Open Data](https://data-cdphe.opendata.arcgis.com/). Enjoy! 27 | 28 | ### Features 29 | 30 | * 9,048,771 rows of source data extracted 31 | * A 28 task DAG (Directed Acyclic Graph) using Apache Airflow data pipeline orchestration running in a Astronomer cloud production environment 32 | * Data visualization of 16 required KPIs and use cases, as well as overall Colorado crime density, using Grafana Labs dashboards 33 | * A user experience for Colorado business entities to search for their business, understand the tier structure, and discover what subsidy tier their business qualifies for using Grafana Labs dashboards 34 | * Apache Spark jobs with Amazon Web Services (AWS) and Geoapify Geocoding API 35 | * Medallion architecture data design patterns 36 | * Comprehensive architecture diagram 37 | * Comprehensive data dictionary displaying all data sources and manipulations used 38 | 39 | ## Table of Contents 40 | 1. [Capstone Requirements](#capstone-requirements) 41 | 2. [Project Overview](#project-overview) 42 | 1. [Problem Statement](#problem-statement) 43 | 2. [KPIs and Use Cases](#kpis-and-use-cases) 44 | 1. [County Level](#county-level) 45 | 2. [City Level](#city-level) 46 | 3. [Data Sources](#data-sources) 47 | 4. [Data Tables](#data-tables) 48 | 5. [Technologies Used](#technologies-used) 49 | 6. [Architecture](#architecture) 50 | 3. [Business Entity Tier Ranking](#business-entity-tier-ranking) 51 | 1. [Subsidy Tiers](#subsidy-tiers) 52 | 2. [How Tiers Are Calculated and Assigned](#how-tiers-are-calculated-and-assigned) 53 | 1. [Step 1: Identify Criteria](#step-1-identify-criteria) 54 | 2. [Step 2: Establish Individual Rankings](#step-2-establish-individual-rankings) 55 | 3. [Step 3: Merge Individual Crime Rankings to Form a Unified Crime Tier County Rank](#merge-individual-crime-rankings) 56 | 4. [Step 4: Merge Rankings for Final County Tier Rank](#final-county-tier) 57 | 5. [Step 5: Backfill Business Entities Table with Final Ranking](#backfill-business-entities) 58 | 4. [KPI and Use Case Visualizations](#kpi-and-use-case-visualizations) 59 | 1. [Colorado County Dashboard](#colorado-county-dashboard) 60 | 2. [Colorado City Dashboard](#colorado-city-dashboard) 61 | 3. [Colorado Crime Density Dashboard](#colorado-crime-density-dashboard) 62 | 5. [Business Entity Data Pipeline](#business-entity-data-pipeline) 63 | 1. [Business Entity Daily DAG](#business-entity-daily-dag) 64 | 2. [Task Descriptions](#task-descriptions) 65 | 3. [DAG Flow Order](#dag-flow-order) 66 | 4. [Astronomer Cloud Deployment](#astronomer-cloud-deployment) 67 | 6. [Putting It All Together](#putting-it-all-together) 68 | 1. [Business Entity Search Dashboard](#business-entity-search-dashboard) 69 | 7. [Challenges and Findings](#challenges-and-findings) 70 | 1. [In-depth Summary of Business Entities Data Cleaning and Update Process](#business-entity-data-cleaning) 71 | 2. [Out-of-Memory (OOM) Exceptions and Spark Configuration Adjustments](#out-of-memory-spark) 72 | 3. [Challenge: Ambiguous City and County Matching](#ambiguous-matching) 73 | 4. [Inconsistent Zip Code Formats Causing Casting Errors](#inconsistent-zip-format) 74 | 8. [Next Steps and Closing Thoughts](#next-steps-and-closing-thoughts) 75 | 76 | 77 | ## Capstone Requirements 78 | 79 | * Identify a problem you'd like to solve. 80 | * Scope the project and choose datasets. 81 | * Use at least 2 different data sources and formats (e.g., CSV, JSON, APIs), with at least 1 million rows. 82 | * Define your use case: analytics table, relational database, dashboard, etc. 83 | * Choose your tech stack and justify your choices. 84 | * Explore and assess the data. 85 | * Identify and resolve data quality issues. 86 | * Document your data cleaning steps. 87 | * Create a data model diagram and dictionary. 88 | * Build ETL pipelines to transform and load the data. 89 | * Run pipelines and perform data quality checks. 90 | * Visualize or present the data through dashboards, applications, or reports. 91 | * Your data pipeline must be running in a production cloud environment 92 | 93 | ## Project Overview 94 | 95 | ### Problem Statement 96 | 97 | * The State of Colorado has hired your data firm to develop internal back-office visualizations and a front facing business user experience to support a critical initiative for the Office of Economic Development and International Trade (OEDIT). The program, B.A.S.E. (Business Assistance for Security Enhancements), is designed to evaluate and qualify businesses across the state for security system subsidies to protect their business based on historical crime trends, income, and population data by city and county. Historical data from 1997 to 2020, as well as new data extracted daily, will be analyzed to determine which security tiers businesses qualify for. 98 | * Additionally, the State of Colorado has sixteen KPIs and use cases they would like to see vizualized out of the extracted, transformed and aggregated data to assist them in planning, funding and outreach initiatives. 99 | * This tool will allow OEDIT to automatically notify active businesses in good standing about the subsidies they qualify for, streamlining the application process for business owners seeking to participate. Lastly, the State requires the creation of an intuitive user experience that enables businesses to search for their assigned tier, providing them with easy access to their eligibility information. 100 | 101 | ### KPIs and Use Cases 102 | 103 |
104 | County Level KPIs and Use Cases 105 | 106 | 1. **KPI:** Set a goal for each police agency within a given county to reduce crime by 5% within the next fiscal year, based on a baseline crime count from historical data. 107 | - **Use Case:** Calculate crime count for police agencies in a given county to get a baseline number. 108 | 109 | 2. **KPI:** Bring 100% free self defense and resident safety programs to lower income counties. 110 | - **Use Case:** Calculate the average median household income per county. 111 | 112 | 3. **KPI:** Evaluate additional police agency support required in counties that show a correlation between rising population and rising crime. 113 | - **Use Case:** Show population trends for each county alongside corresponding crime trends by year. 114 | 115 | 4. **KPI:** Identify counties with an average annual population growth rate exceeding 10% from 1997–2020, and adjust state support allocations to improve funding alignment for resident programs. 116 | - **Use Case:** Show population trends for each county from 1997–2020. 117 | 118 | 5. **KPI:** Inform police agencies of which crime categories are most prevalent in their respective county. 119 | - **Use Case:** Calculate the totals for each crime category per county from 1997 to 2020 to gauge frequency. 120 | 121 | 6. **KPI:** Enable police agencies to be proactive by identifying which months of the year have a higher volume of crime. 122 | - **Use Case:** Show average seasonal crime trends for each month by county. 123 | 124 | 7. **KPI:** Provide an interactive visual representation of crime density across counties for the entire state. 125 | - **Use Case:** Create a geo-map of crime density for counties from 1997–2020. 126 | 127 | 8. **KPI:** Drive further marketing outreach for supportive safety programs in lower income counties. 128 | - **Use Case:** Show average income per capita for counties. 129 | 130 | 9. **KPI:** Display crime rates compared to median household income to pinpoint high-risk areas. 131 | - **Use Case:** Compare crime data with median household income for each county. 132 | 133 | 10. **KPI:** Illustrate crime type distribution by county by Property, Person and Society to better understand where to best allocate safety resources 134 | - **Use Case:** Analyze the distribution of different crime types across each county from 1997–2020 for Property, Person and Society crimes. 135 | 136 |
137 | 138 |
139 | City Level KPIs and Use Cases 140 | 141 | 1. **KPI:** Deploy an interactive dashboard displaying seasonal crime trends for each city to detect seasonal peaks to guide targeted patrol planning. 142 | - **Use Case:** Show seasonal crime trends for the year in each city. 143 | 144 | 2. **KPI:** Identify and report the top three crime categories most prevalent during daytime (6 AM–6 PM) in each city to optimize resource allocation during peak hours. 145 | - **Use Case:** Determine which crimes are more likely to happen during the day for a city. 146 | 147 | 3. **KPI:** Calculate the average age of individuals involved in each crime category across cities to support the development of tailored intervention programs. 148 | - **Use Case:** Compute the average age for crime categories across cities. 149 | 150 | 4. **KPI:** Identify and report the top three crime categories most common at night (6 PM–6 AM) in each city to inform optimized night patrol scheduling. 151 | - **Use Case:** Determine which crimes are more likely to happen at night for a city. 152 | 153 | 5. **KPI:** Develop a dynamic visualization that shows the percentage distribution of crime types by city by Property, Person and Society to support targeted law enforcement initiatives. 154 | - **Use Case:** Display crime type distribution by city. 155 | 156 | 6. **KPI:** Provide an analysis dashboard showing average crime trends by day of the week for each city, highlighting peak crime days to drive strategic patrol scheduling. 157 | - **Use Case:** Show crime trends on average by day of the week to determine when to patrol more. 158 | 159 |
160 | 161 | ### Data Sources 162 | 163 | - [Crimes in Colorado (2016-2020)](https://data.colorado.gov/Public-Safety/Crimes-in-Colorado/j6g4-gayk/about_data): Offenses in Colorado for 2016 through 2020 by Agency from the FBI's Crime Data Explorer. 164 | - Number of rows: 3.1M 165 | - [Crimes in Colorado (1997-2015)](https://data.colorado.gov/Public-Safety/Crimes-in-Colorado-1997-to-2015/6vnq-az4b/about_data): Crime stats for the State of Colorado from 1997 to 2015. Data provided by the CDPS and the FBI's Crime Data Explorer (CDE). 166 | - Number of rows: 4.95M 167 | - [Personal Income in Colorado](https://data.colorado.gov/Labor-and-Employment/Personal-Income-in-Colorado/2cpa-vbur/about_data): Income (per capita or total) for each county by year with rank and population. From Colorado Department of Labor and Employment (CDLE), since 1969. 168 | - Number of rows: 10k 169 | - [Population Projections in Colorado](https://data.colorado.gov/Demographics/Population-Projections-in-Colorado/q5vp-adf3/about_data): Actual and predicted population data by gender and age from the Department of Local Affairs (DOLA), from 1990 to 2040. 170 | - Number of rows: 382k 171 | - [Business Entities in Colorado](https://data.colorado.gov/Business/Business-Entities-in-Colorado/4ykn-tg5h/about_data): Colorado Business Entities (corporations, LLCs, etc.) registered with the Colorado Department of State (CDOS) since 1864. 172 | - Number of rows: 2.81M 173 | - [Colorado County Boundaries](https://data-cdphe.opendata.arcgis.com/datasets/CDPHE::colorado-county-boundaries/about): This feature class contains county boundaries for all 64 Colorado counties and 2010 US Census attributes data describing the population within each county. 174 | - Number of rows: 64 175 | 176 | 177 | ### Data Tables 178 | 179 |
180 | Bronze Layer (raw) data 181 | 182 | 1. **colorado_business_entities_raw** 183 | - **Number of rows:** 984,368 184 | 2. **colorado_city_county_zip_raw** 185 | - **Number of rows:** 760 186 | 3. **colorado_county_coordinates_raw** 187 | - **Number of rows:** 64 188 | 4. **colorado_crimes_1997_2015_raw** 189 | - **Number of rows:** 4,952,282 190 | 5. **colorado_crimes_2016_2020_raw** 191 | - **Number of rows:** 3,101,365 192 | 6. **colorado_income_raw** 193 | - **Number of rows:** 9,932 194 | 195 | 196 |
197 | 198 |
199 | Silver Layer (transform) data 200 | 201 | 1. **colorado_business_entities** 202 | - **Number of rows:** 817,324 (grows by Airflow pipeline) 203 | 2. **colorado_business_entities_duplicates** 204 | - **Number of rows:** 721 (grows by Airflow pipeline) 205 | 3. **colorado_business_entities_misfit_entities** 206 | - **Number of rows:** 71 (grows by Airflow pipeline) 207 | 4. **colorado_business_entities_with_ranking** 208 | - **Number of rows:** 817,324 (grows by Airflow pipeline) 209 | 5. **colorado_crime_population_trends** 210 | - **Number of rows:** 1,536 211 | 6. **colorado_crime_tier_arson** 212 | - **Number of rows:** 72 213 | 7. **colorado_crime_tier_burglary** 214 | - **Number of rows:** 79 215 | 8. **colorado_crime_tier_county_rank** 216 | - **Number of rows:** 79 217 | 9. **colorado_crime_tier_larceny_theft** 218 | - **Number of rows:** 79 219 | 10. **colorado_crime_tier_property_destruction** 220 | - **Number of rows:** 79 221 | 11. **colorado_crime_tier_property_destruction** 222 | - **Number of rows:** 70 223 | 12. **colorado_crime_tier_stolen_property** 224 | - **Number of rows:** 75 225 | 13. **colorado_crime_tier_vehicle_theft** 226 | - **Number of rows:** 79 227 | 14. **colorado_crimes** 228 | - **Number of rows:** 8,053,647 229 | 15. **colorado_crimes_2016_2020_with_cities** 230 | - **Number of rows:** 2,771,984 231 | 16. **colorado_crimes_with_cities** 232 | - **Number of rows:** 7,724,266 233 | 17. **colorado_final_county_tier_rank** 234 | - **Number of rows:** 64 235 | 18. **colorado_income_1997_2020** 236 | - **Number of rows:** 4604 237 | 19. **colorado_income_household_tier_rank** 238 | - **Number of rows:** 64 239 | 20. **colorado_population_raw** 240 | - **Number of rows:** 381,504 241 | 21. **colorado_temp_geocoded_entities** 242 | - **Number of rows:** 2,033 243 |
244 | 245 |
246 | Gold Layer (aggregate) data 247 | 248 | 1. **colorado_avg_age_per_crime_category_1997_2020** 249 | - **Number of rows:** 3,154 250 | 2. **colorado_city_crime_counts_1997_2020** 251 | - **Number of rows:** 182 252 | 3. **colorado_city_crime_time_likelihood** 253 | - **Number of rows:** 3,206 254 | 4. **colorado_city_seasonal_crime_rates_1997_2020** 255 | - **Number of rows:** 2,196 256 | 5. **colorado_county_agency_crime_counts** 257 | - **Number of rows:** 526 258 | 6. **colorado_county_average_income_1997_2020** 259 | - **Number of rows:** 64 260 | 7. **colorado_county_crime_per_capita_1997_2020** 261 | - **Number of rows:** 64 262 | 8. **colorado_county_crime_per_capita_with_coordinates** 263 | - **Number of rows:** 64 264 | 9. **colorado_county_crime_vs_population_1997_2020** 265 | - **Number of rows:** 1,665 266 | 10. **colorado_county_seasonal_crime_rates_1997_2020** 267 | - **Number of rows:** 911 268 | 11. **colorado_crime_category_totals_per_county_1997_2020** 269 | - **Number of rows:** 1,520 270 | 12. **colorado_crime_type_distribution_by_city** 271 | - **Number of rows:** 549 272 | 13. **colorado_crime_vs_median_household_income_1997_2020** 273 | - **Number of rows:** 1,536 274 | 14. **colorado_population_crime_per_capita_rank** 275 | - **Number of rows:** 64 276 | 15. **colorado_total_population_per_county_1997_2020** 277 | - **Number of rows:** 1,536 278 | 16. **coloraodo_county_crime_rate_per_capita** 279 | - **Number of rows:** 1,458 280 |
281 | 282 | Access the entire [capstone data dictionary](https://github.com/taylor-ortiz/dataexpert-data-engineering-capstone/blob/main/Capstone-data-dictionary.csv) for more detailed info on these datasets used. 283 | 284 | ### Technologies Used 285 | - **Python:** Programming language used for Spark jobs with AWS and Iceberg loads, Geoapify APIs and python callables in Airflow orchestration 286 | - **SQL:** fundamental for querying, transforming, and aggregating structured data. 287 | - **Apache Airflow:** orchestration tool that manages, schedules and automates the daily busines entity extract from the Colorado Information Marketplace 288 | - **AWS:** S3 bucket storage for CSV extracts 289 | - **Tabular:** data lake operations 290 | - **Trino:** distributed SQL query engine for medallion architecture 291 | - **Geoapify:** geocoding api for address validation of business entities 292 | - **Grafana:** dashboards for data visualizations of KPIs, use cases and business user experience 293 | - **Astronomer:** data orchestraton platform that provides a managed service for Apache Airflow orchestrations in the cloud 294 | - **Apache Spark:** open source distributed computing system for processing source extracts from AWS to Iceberg and backfilling / verifying address data with Geoapify 295 | 296 | ### Architecture 297 | ![B A S E Future State Diagram (4)](https://github.com/user-attachments/assets/78f8aaf2-af71-44d2-a08c-52d9b0d766da) 298 | 299 | ## Business Entity Tier Ranking 300 | 301 | ### Subsidy Tiers 302 | 303 | Tiers are represented as a range of 1 through 4 in the B.A.S.E. program. 1 indicates the lowest level need and 4 indicates the highest level need for security system subsidies. Each subsidy tier below offers a gradual increase of security system services based on the tier that your business qualifies for. The idea is that tiers will be backfilled and assigned on the source business entities dataset and as new business entities are added daily to the Colorado Information Marketplace, those records will also be assigned a tier through the Airflow orchestration that we will cover below. However, how are tiers actually calculated and assigned? Read on! 304 | 305 | Screenshot 2025-02-28 at 4 44 20 PM 306 | 307 | ### How Tiers Are Calculated and Assigned 308 | 309 |
310 | Step 1: Identify Criteria 311 | 312 | - **Crimes:** 313 | Out of all the crime categories that exist, I chose 7 that closely resemble property-related or adjacent crimes that would factor into the ranking: 314 | - Destruction/Damage/Vandalism of Property 315 | - Burglary/Breaking & Entering 316 | - Larceny/Theft Offenses 317 | - Motor Vehicle Theft 318 | - Robbery 319 | - Arson 320 | - Stolen Property Offenses 321 | 322 | - **Population:** 323 | Identify crime per capita (per 1,000 residents) for all counties between the years of 1997 and 2020. 324 | 325 | - **Income:** 326 | Identify Median Household Income for all counties between the years of 1997 and 2020. 327 | 328 |
329 | 330 |
331 | Step 2: Establish Individual Rankings 332 |
333 | For each metric, we transform raw data into a standardized ranking by following a similar process: 334 | 335 | #### Crime Categories 336 | For each of the 7 selected crime categories: 337 | - **Aggregate Data:** 338 | Count the number of incidents per county. 339 | - **Compute Percentile Ranks:** 340 | Use a percentile function (e.g., `PERCENT_RANK()`) to determine each county’s standing relative to others. 341 | - **Assign Tiers:** 342 | Counties in the highest percentiles (top 25%) receive an assignment of 4. Counties >= 50% and <= 75% receive an assignment of 3. Counties >= 25% and <= 50% receive an assignment of 2. Counties that are <= 25% receive an assignment of 1. 343 | 344 | #### Population-Adjusted Crime (Crime Per Capita) 345 | We adjust raw crime counts by county population to calculate the crime rate per 1,000 residents. This involves: 346 | - **Calculating Yearly Crime Rates:** 347 | Join crime data with population figures. 348 | - **Averaging:** 349 | Compute an average crime rate per county over the selected years. 350 | - **Ranking:** 351 | Counties in the highest percentiles (top 25%) receive an assignment of 4. Counties >= 50% and <= 75% receive an assignment of 3. Counties >= 25% and <= 50% receive an assignment of 2. Counties that are <= 25% receive an assignment of 1. 352 | 353 | #### Income 354 | For median household income, we: 355 | - **Aggregate Income Data:** 356 | Average the income per county across the years. 357 | - **Compute Percentile Ranks:** 358 | Establish how each county compares to others. 359 | - **Assign Tiers:** 360 | Income is the inverse of the previous two. We actually want to rank the best Income counties the lowest because they would require less subsidy assistance than lower income counties. Therefore, Counties in the highest percentiles (top 25%) receive an assignment of 1. Counties >= 50% and <= 75% receive an assignment of 2. Counties >= 25% and <= 50% receive an assignment of 3. Counties that are <= 25% receive an assignment of 4. 361 | 362 | Below is an example distribution. While this is not a perfect distribution, I was pretty happy with the result from a public dataset. With a mostly even distribution based on percentage, we can start assigning tiers based on the rubric mentioned above. 363 | 364 | Screenshot 2025-02-28 at 5 24 50 PM 365 | 366 |
367 | Below is an example query for evaluating the Destruction/Damage/Vandalism of Property crime category and assigning a tier to counties: 368 | 369 | ```sql 370 | CREATE TABLE tayloro.colorado_crime_tier_destruction_property AS 371 | WITH crime_aggregated AS ( 372 | -- Step 1: Aggregate property destruction crime counts per county 373 | SELECT 374 | county, 375 | COUNT(*) AS total_property_destruction_crimes 376 | FROM academy.tayloro.colorado_crimes 377 | WHERE offense_category_name = 'Destruction/Damage/Vandalism of Property' 378 | GROUP BY county 379 | ), 380 | crime_percentiles AS ( 381 | -- Step 2: Compute percentile rank for property destruction crime counts per county 382 | SELECT 383 | county, 384 | total_property_destruction_crimes, 385 | PERCENT_RANK() OVER (ORDER BY total_arson_crimes) AS crime_percentile 386 | FROM crime_aggregated 387 | ), 388 | crime_tiers AS ( 389 | -- Step 3: Assign tiers based on percentile rankings 390 | SELECT 391 | county, 392 | total_property_destruction_crimes, 393 | crime_percentile, 394 | CASE 395 | WHEN crime_percentile >= 0.75 THEN 1 -- Counties with the highest counts (Top 25%) 396 | WHEN crime_percentile >= 0.50 THEN 2 -- Next 25% (50%-75%) 397 | WHEN crime_percentile >= 0.25 THEN 3 -- Next 25% (25%-50%) 398 | ELSE 4 -- Counties with the lowest counts (Bottom 25%) 399 | END AS crime_tier 400 | FROM crime_percentiles 401 | ) 402 | SELECT * FROM crime_tiers; 403 | ``` 404 | 405 |
406 | Screenshot 2025-02-28 at 9 07 04 PM 407 | 408 |
409 | 410 |
411 | Step 3: Merge Individual Crime Rankings to Form a Unified Crime Tier County Rank 412 |
413 | 414 | After calculating individual crime tiers for each of the 7 selected crime categories, the next step is to merge these rankings into a single unified score per county. This unified score, referred to as the **overall crime tier**, is computed by averaging the individual tiers for each county and then rounding the result to the nearest whole number. The process involves the following steps: 415 | 416 | - **Unioning the Individual Tables:** 417 | All individual crime tier tables (e.g., for property destruction, burglary, larceny/theft, etc.) are combined using a `UNION ALL`. Each record is tagged with its corresponding table name for identification. 418 | 419 | - **Pivoting and Aggregating Data:** 420 | The combined dataset is grouped by county. For each crime category, the query extracts the crime tier and computes the average of these tiers across all categories. 421 | 422 | - **Final Ordering:** 423 | The resulting unified table is sorted by the overall crime tier in descending order, thereby highlighting counties with the highest aggregated crime tiers. 424 | 425 | Below is the SQL query that performs these operations: 426 | 427 | ```sql 428 | CREATE TABLE tayloro.colorado_crime_tier_county_rank AS 429 | WITH county_scores AS ( 430 | SELECT 431 | county_name, 432 | -- Select the crime tiers for each category 433 | MAX(CASE WHEN table_name = 'colorado_crime_tier_property_destruction' THEN crime_tier ELSE NULL END) AS property_destruction_tier, 434 | MAX(CASE WHEN table_name = 'colorado_crime_tier_burglary' THEN crime_tier ELSE NULL END) AS burglary_tier, 435 | MAX(CASE WHEN table_name = 'colorado_crime_tier_larceny_theft' THEN crime_tier ELSE NULL END) AS larceny_theft_tier, 436 | MAX(CASE WHEN table_name = 'colorado_crime_tier_vehicle_theft' THEN crime_tier ELSE NULL END) AS vehicle_theft_tier, 437 | MAX(CASE WHEN table_name = 'colorado_crime_tier_robbery' THEN crime_tier ELSE NULL END) AS robbery_tier, 438 | MAX(CASE WHEN table_name = 'colorado_crime_tier_arson' THEN crime_tier ELSE NULL END) AS arson_tier, 439 | MAX(CASE WHEN table_name = 'colorado_crime_tier_stolen_property' THEN crime_tier ELSE NULL END) AS stolen_property_tier, 440 | -- Calculate the average of all crime tiers 441 | ROUND( 442 | AVG( 443 | CASE 444 | WHEN table_name IN ( 445 | 'colorado_crime_tier_property_destruction', 446 | 'colorado_crime_tier_burglary', 447 | 'colorado_crime_tier_larceny_theft', 448 | 'colorado_crime_tier_vehicle_theft', 449 | 'colorado_crime_tier_robbery', 450 | 'colorado_crime_tier_arson', 451 | 'colorado_crime_tier_stolen_property' 452 | ) THEN crime_tier 453 | ELSE NULL 454 | END 455 | ) 456 | ) AS overall_crime_tier -- Round to the nearest whole number 457 | FROM ( 458 | -- Union all 7 crime tier tables 459 | SELECT county_name, 'colorado_crime_tier_property_destruction' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_property_destruction 460 | UNION ALL 461 | SELECT county_name, 'colorado_crime_tier_burglary' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_burglary 462 | UNION ALL 463 | SELECT county_name, 'colorado_crime_tier_larceny_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_larceny_theft 464 | UNION ALL 465 | SELECT county_name, 'colorado_crime_tier_vehicle_theft' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_vehicle_theft 466 | UNION ALL 467 | SELECT county_name, 'colorado_crime_tier_robbery' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_robbery 468 | UNION ALL 469 | SELECT county_name, 'colorado_crime_tier_arson' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_arson 470 | UNION ALL 471 | SELECT county_name, 'colorado_crime_tier_stolen_property' AS table_name, crime_tier FROM academy.tayloro.colorado_crime_tier_stolen_property 472 | ) combined 473 | GROUP BY county_name 474 | ) 475 | SELECT * FROM county_scores 476 | ORDER BY overall_crime_tier DESC; 477 | ``` 478 | 479 | Screenshot 2025-02-28 at 9 20 22 PM 480 | 481 |
482 | 483 |
484 | Step 4: Merge Rankings for Final County Tier Rank 485 |
486 | In this step, we combine the overall crime tier county rank, the income tier rank, and the population (crime per capita) rank to generate a final composite ranking for each county. Since our primary focus is on crime, we weigh the crime rank a little heavier in our final calculation. 487 | 488 | The merging process involves: 489 | - **Combining Ranks:** 490 | We perform a full outer join on the three individual ranking tables to ensure every county is represented. 491 | - **Handling Missing Values:** 492 | `COALESCE` is used to manage missing values by defaulting to 0 when a rank is not available. 493 | - **Calculating the Final Rank:** 494 | The final rank is computed as the average of the three rankings. With crime being our primary focus, its rank is given slightly more influence in this average. 495 | - **Final Ordering:** 496 | Counties are ordered by the final average rank (in ascending order) to highlight those with the highest overall tier. 497 | 498 | Below is the SQL query that accomplishes these steps: 499 | 500 | ```sql 501 | CREATE TABLE tayloro.colorado_final_county_tier_rank AS 502 | WITH combined_ranks AS ( 503 | SELECT 504 | COALESCE(c.county_name, i.county, p.county) AS county, 505 | COALESCE(c.overall_crime_tier, 0) AS crime_rank, 506 | COALESCE(i.income_tier, 0) AS income_rank, 507 | COALESCE(p.crime_per_capita_tier, 0) AS population_rank 508 | FROM academy.tayloro.colorado_crime_tier_county_rank c 509 | FULL OUTER JOIN academy.tayloro.colorado_income_household_tier_rank i 510 | ON LOWER(c.county_name) = LOWER(i.county) 511 | FULL OUTER JOIN academy.tayloro.colorado_population_crime_per_capita_rank p 512 | ON LOWER(c.county_name) = LOWER(p.county) 513 | ), 514 | average_rank AS ( 515 | SELECT 516 | county, 517 | crime_rank, 518 | income_rank, 519 | population_rank, 520 | ROUND( 521 | (crime_rank + income_rank + population_rank) / 522 | NULLIF((CASE WHEN crime_rank > 0 THEN 1 ELSE 0 END 523 | + CASE WHEN income_rank > 0 THEN 1 ELSE 0 END 524 | + CASE WHEN population_rank > 0 THEN 1 ELSE 0 END), 0) 525 | ) AS final_rank -- Calculate the average rank and round to the nearest whole number 526 | FROM combined_ranks 527 | WHERE crime_rank > 0 AND income_rank > 0 AND population_rank > 0 -- Ensure all ranks are present 528 | ) 529 | SELECT * FROM average_rank 530 | ORDER BY final_rank ASC; -- Order by the final average rank 531 | ``` 532 | 533 | Screenshot 2025-02-28 at 9 21 32 PM 534 |
535 | 536 |
537 | Step 5: Backfill Business Entities Table with Final Ranking 538 |
539 | In this final step, we join the final county tier rank table with the business entities table. This allows us to backfill all business entities with their corresponding ranking information based on county. The query uses a `LEFT JOIN` to ensure that every business entity is retained, matching on a case-insensitive comparison of county names. 540 | 541 | ```sql 542 | CREATE TABLE tayloro.colorado_business_entities_with_ranking AS 543 | SELECT 544 | be.*, 545 | fr.crime_rank, 546 | fr.income_rank, 547 | fr.population_rank, 548 | fr.final_rank 549 | FROM tayloro.colorado_business_entities be 550 | LEFT JOIN tayloro.colorado_final_county_tier_rank fr 551 | ON LOWER(be.principalcounty) = LOWER(fr.county); 552 | ``` 553 | 554 | Screenshot 2025-02-28 at 9 39 08 PM 555 |
556 | 557 | Below is a visualization of how these datasets were merged together to form a cohesive ranking across the three main categories. 558 | 559 | ![B A S E Future State Diagram (5)](https://github.com/user-attachments/assets/d665e2b9-ae7f-4bd7-85f0-6c5f5fedd3b6) 560 | 561 | ## KPI and Use Case Visualizations 562 | 563 | These dashboards represent the 16 KPIs and use cases that were provided to the data firm by the state. Please note as well that Grafana allows us to create dynamic variables that enable us to choose our city or county and fetch dynamic data from those values in real time for visualization. Since these are static images, that will not be represented here, but is a huge component to the user experience. 564 | 565 | ### Colorado County Dashboard 566 | 567 | This county dashboard shows the following use cases: 568 | - Calculate crime count for police agencies in a given county to get a baseline number. 569 | - Calculate the average median household income per county. 570 | - Show population trends for each county alongside corresponding crime trends by year. 571 | - Show population trends for each county from 1997–2020. 572 | - Calculate the totals for each crime category per county from 1997 to 2020 to gauge frequency. 573 | - Show average seasonal crime trends for each month by county. 574 | - Create a geo-map of crime density for counties from 1997–2020. 575 | - Show average income per capita for counties. 576 | - Compare crime data with median household income for each county. 577 | - Analyze the distribution of different crime types across each county from 1997–2020 for Property, Person and Society crimes. 578 | 579 | Screenshot 2025-02-28 at 5 07 06 PM 580 | 581 | ### Colorado City Dashboard 582 | 583 | This city dashboard shows the following use cases: 584 | 585 | - Show seasonal crime trends for the year in each city. 586 | - Determine which crimes are more likely to happen during the day for a city. 587 | - Compute the average age for crime categories across cities. 588 | - Determine which crimes are more likely to happen at night for a city. 589 | - Display crime type distribution by city. 590 | - Show crime trends on average by day of the week to determine when to patrol more. 591 | 592 | Screenshot 2025-02-28 at 5 05 24 PM 593 | 594 | ### Colorado Crime Density Dashboard 595 | 596 | This dashboard shows overall crime density on crime per capita (100k) residents across the entire State of Colorado. Its cool to see the larger areas represent the highest density and its not surprising that many of those are in the metro area of Denver. 597 | 598 | Screenshot 2025-02-23 at 9 33 19 PM 599 | 600 | 601 | ### Business Entity Data Pipeline 602 | 603 | #### Business Entity Daily DAG 604 | 605 | The final part of the equation is our data pipeline that evaluates newly incoming business entities daily. This DAG calls out to the public Colorado Information Marketplace API every day at 6 AM MT and fetches yesterday's Colorado Business Entity data for processing. It then cleans the data, joins matching County data based on Zip and City and then assigns an appropriate subsidy tier. This DAG involves several tasks to ingest, clean, validate, enrich, and load data from the data.colorado.gov API into production tables. Below is a breakdown of each task and its purpose. 606 | 607 | --- 608 | 609 | #### Task Descriptions 610 | 611 | 1. **create_business_entities_yesterday_stage** 612 | *Description:* Creates a staging table for business entities for yesterday’s data. This table is partitioned by `principalcounty` and sets up the structure to temporarily hold the data for further processing. 613 | 614 | 2. **create_misfit_business_entities_table** 615 | *Description:* Creates a table for misfit business entities that do not meet the expected data patterns. This helps segregate records that require special attention or further review. 616 | 617 | 3. **create_dupes_staging_table** 618 | *Description:* Creates a staging table for capturing potential duplicate records from the business entities data. It mirrors the structure of the main table and is partitioned by `principalcounty`. 619 | 620 | 4. **create_duplicates_table** 621 | *Description:* Creates a permanent table to store confirmed duplicate records from the business entities dataset. 622 | 623 | 5. **create_geo_validation_staging_table** 624 | *Description:* Creates a temporary table to hold the results of geo-validated address data. This table is partitioned by `geo_county` and is used to store responses from the Geoapify API. 625 | 626 | 6. **create_addded_city_county_zip_staging_table** 627 | *Description:* Creates a staging table to store newly discovered city/county/zip combinations from geo-validation. This table is later used to update the reference data for address mapping. 628 | 629 | 7. **fetch_and_insert_business_entities_yesterday** 630 | *Description:* Fetches business entity data for yesterday from the data.colorado.gov API and inserts the records into the staging table. This task calls the API, transforms the response into records, and executes an INSERT query. 631 | 632 | 8. **filter_invalid_addresses** 633 | *Description:* Deletes records from the staging table that have NULL values in critical address fields (`principalcity` or `principalzipcode`), ensuring only valid addresses proceed to the next steps. 634 | 635 | 9. **reformat_zip_code** 636 | *Description:* Reformats the `principalzipcode` field in the staging table to a standard 5-digit format, ensuring consistency for downstream processing. 637 | 638 | 10. **update_county_on_staging_table** 639 | *Description:* Updates the `principalcounty` field in the staging table by joining with the cleaned reference table (`colorado_city_county_zip`). The matching is done using normalized (trimmed and lowercased) city names and ZIP codes. 640 | 641 | 11. **process_unmatched_entities_task** 642 | *Description:* Processes records from the staging table where `principalcounty` is still NULL. It calls the Geoapify API to fetch validated address data for these unmatched entities, transforms the API responses, and inserts the results into the geo validation staging table. 643 | 644 | 12. **check_validated_address_combos** 645 | *Description:* Checks validated address combinations by joining the geo validation staging table with the staging table. This task identifies new city/county/zip combinations not present in the reference table. 646 | 647 | 13. **insert_validated_address_combos** 648 | *Description:* Inserts the new, validated city/county/zip combinations into the reference table (`colorado_city_county_zip`). This enhances the mapping accuracy for future data processing. 649 | 650 | 14. **override_from_validated_addresses** 651 | *Description:* Updates the main business entities table by overriding existing address fields with geo-validated data. This ensures that the most accurate and standardized addresses are stored. 652 | 653 | 15. **retry_update_county_on_staging_table** 654 | *Description:* Reattempts the update of the `principalcounty` field on the staging table for any records that remain unmatched after initial processing. 655 | 656 | 16. **identify_duplicates_task** 657 | *Description:* Identifies duplicate records within the staging table by comparing key address and entity fields. The duplicates are then inserted into the duplicates staging table. 658 | 659 | 17. **run_dq_check_null_values** 660 | *Description:* Runs a data quality check that counts records with NULL values in critical fields (`entityname`, `principalcity`, `entitytype`, `entityformdate`) to ensure data completeness. 661 | 662 | 18. **run_dq_check_principalstate** 663 | *Description:* Executes a data quality check to verify that all records in the staging table have `principalstate` equal to 'CO'. 664 | 665 | 19. **run_dq_check_entityformdate** 666 | *Description:* Performs a data quality check to ensure that the `entityformdate` in the staging table matches the expected date (yesterday’s date). 667 | 668 | 20. **run_dq_check_no_unexpected_duplicates** 669 | *Description:* Checks for any unexpected duplicates in the staging table by grouping on `entityid` and counting occurrences greater than one. 670 | 671 | 21. **insert_new_records_task** 672 | *Description:* Inserts new business entity records from the staging table into the production table, excluding any records that have been flagged as duplicates and ensuring that `principalcounty` is populated. 673 | 674 | 22. **insert_business_entities_with_ranking** 675 | *Description:* Inserts business entity records along with their associated ranking information by joining the staging table with the final county tier rank table. 676 | 677 | 23. **insert_duplicates** 678 | *Description:* Inserts duplicate records from the duplicates staging table into the permanent duplicates table for further review or archival. 679 | 680 | 24. **insert_misfits** 681 | *Description:* Inserts misfit records (those without a valid `principalcounty`) from the staging table into the misfits table, isolating records that could not be processed normally. 682 | 683 | 25. **drop_staging_table_task** 684 | *Description:* Drops the staging table to clean up temporary data once processing is complete. 685 | 686 | 26. **drop_dupes_staging_table_task** 687 | *Description:* Drops the duplicates staging table to remove temporary duplicate data after it has been archived. 688 | 689 | 27. **drop_geo_validation_staging_table** 690 | *Description:* Drops the geo validation staging table that was used for holding geo API responses once the data has been processed and merged. 691 | 692 | 28. **drop_added_city_county_zip_staging_table** 693 | *Description:* Drops the staging table for newly added city/county/zip combinations, finalizing the cleanup of temporary tables. 694 | 695 | --- 696 | 697 | #### DAG Flow Order 698 | 699 | The tasks order looks like this: 700 | 701 | Screenshot 2025-02-28 at 10 29 57 PM 702 | 703 | #### Astronomer Cloud Deployment 704 | 705 | One of the core requirements from Zach for the capstone project is that it must be running in a production cloud environment to count. Below are snapshots of my DAG in the Astronomer production environment running daily. 706 | 707 | Screenshot 2025-02-28 at 6 23 12 AM 708 | 709 | This capture shows the successes and failures of my DAG runs as I worked through errors and bugs. 710 | 711 | Screenshot 2025-02-28 at 6 22 22 AM 712 | 713 | ### Putting It All Together 714 | 715 | #### Business Entity Search Dashboard 716 |
717 | The dashboard below enables a business owner to navigate to this dashboard, search for their business entity and see information about their business, information on the available security system subsidy tiers and exactly which tier their business qualifies for. In this example, Lost Coffee could search for information about their business and learn that they have been assigned a tier ranking of 3 based on all of the criteria explained above. 718 |
719 | Screenshot 2025-02-28 at 5 08 32 PM 720 | 721 | ### Challenges and Findings 722 | 723 |
724 | In-depth Summary of Business Entities Data Cleaning and Update Process 725 |
726 | 727 | **Challenges Faced:** 728 | - **Mismatches in Data:** Many records in the `colorado_business_entities` table lacked corresponding county information when joined with the reference table (`colorado_city_county_zip`). 729 | - **Inconsistent Data Formatting:** Typos, case discrepancies, and extra spaces in city names and ZIP codes led to failed matches. 730 | - **Incomplete Reference Data:** Missing city/ZIP combinations and duplicate entries in the reference table required augmentation and deduplication to ensure accurate mapping. 731 | 732 | **Process Overview:** 733 | 734 | 1. **Initial Analysis and Identification of Issues:** 735 | - Ran queries to identify records in `colorado_business_entities` with no matching county in the reference table. 736 | - Examined unmatched city/ZIP pairs to determine the most frequent discrepancies. 737 | 738 | 2. **Data Correction in the Business Entities Table:** 739 | - Performed manual corrections for known typos and inconsistencies (e.g., updating “peyton co” to “Peyton”, correcting ZIP 80524 entries to “Fort Collins”). 740 | - Verified corrections by running SELECT queries against both the business entities and reference datasets. 741 | 742 | 3. **Assessing Match Quality:** 743 | - Executed queries to compare the count of matched versus unmatched records, revealing a high number of mismatches that required further attention. 744 | 745 | 4. **Used Geoapify API to resolve some address issues:** 746 | - Leveraged the Geocoding API from Geoapify to validate address descrepancies and improve address, city, county and zip attributes of non matching business entities 747 | - One funny thing is that while processing the remaining unmatched records, I totally blew through the daily Geoapify limits and appreciate their services very much for letting me do so 748 | - Screenshot 2025-03-01 at 7 48 32 AM 749 | 750 | 751 | 5. **Inserting Missing Reference Data:** 752 | - Augmented the reference table by inserting missing city/ZIP combinations (e.g., added entries for “Evans” in Weld County). 753 | - Inspected the reference table to ensure that the inserted data met quality standards. 754 | 755 | 6. **Deduplicating the Reference Table:** 756 | - Identified duplicate rows for certain city/ZIP pairs (e.g., “Burlington, 80807” and “New Raymer, 80742”). 757 | - Created a new deduplicated version of the reference table and replaced the original with the cleaned version. 758 | 759 | 7. **Updating the Business Entities with County Data:** 760 | - Employed a final update query (using a correlated subquery) to backfill the `principalcounty` field in `colorado_business_entities` by matching normalized city and ZIP code data from the cleaned reference table. 761 | - Verified the update by confirming that previously unmatched records were now successfully mapped. 762 | 763 | 8. **Final Outcome:** 764 | - **Normalized Addresses:** Data cleaning (trimming, lowercasing, casting ZIP codes) ensured consistent comparisons. 765 | - **Accurate Mapping:** Manual corrections and enhanced reference data led to a complete and unique set of city/ZIP combinations. 766 | - **Successful Update:** The final update query backfilled the `principalcounty` field for all records, reducing unmatched counts to zero. 767 | 768 |
769 | 770 | 771 |
772 | Out-of-Memory (OOM) Exceptions and Spark Configuration Adjustments 773 |
774 | When processing large datasets (over a million rows) using Spark, I frequently encountered out-of-memory exceptions—especially when fetching large CSV files from S3 and writing to Iceberg. The default Spark settings were insufficient for these workloads. 775 | 776 | To address this challenge, I reviewed the Spark documentation and applied recommendations from the bootcamp. I adjusted the Spark configuration to allocate more memory to both the executors and the driver, and to optimize shuffle operations. The following settings helped mitigate the memory issues: 777 | 778 | ```python 779 | conf.set('spark.executor.memory', '16g') # Increase executor memory based on your system's capacity 780 | conf.set('spark.sql.shuffle.partitions', '200') # Optimize the number of shuffle partitions 781 | conf.set('spark.driver.memory', '16g') # Increase driver memory to handle larger workloads 782 | ``` 783 | By increasing the memory allocation and tuning the shuffle partitions, I was able to process the large CSV files efficiently and write to Iceberg without running into out-of-memory errors. 784 |
785 | 786 |
787 | Challenge: Ambiguous City and County Matching 788 |
789 | 790 | **Description:** 791 | Initially, my orchestration relied solely on a city and county dataset to map business entities to their respective counties. However, an issue arose where cities that span multiple counties were being randomly assigned to just one county. 792 | 793 | **Example:** 794 | - **Aurora:** This city spans across Douglas County, Arapahoe County, and Denver County. 795 | - Without additional disambiguation, business entities in Aurora were incorrectly mapped to a single county at random, leading to inaccurate county assignments. 796 | 797 | **Resolution:** 798 | - I enhanced the reference dataset by adding a **ZIP code** column. 799 | - By matching on both the city and ZIP code of a business entity's address, the mapping became much more precise. 800 | - This change ensured that each business entity is matched to the correct county based on the distinct combination of city and ZIP code. 801 | 802 | **Outcome:** 803 | Incorporating the ZIP code significantly improved the accuracy of county assignments, eliminating the randomness in cases where cities span multiple counties. 804 | 805 |
806 | 807 |
808 | Inconsistent Zip Code Formats Causing Casting Errors 809 |
810 | 811 | **Description:** 812 | Initially, the ETL process transformed API responses without validating the `principalzipcode` field. This oversight led to casting errors when attempting to convert zip code values in unexpected formats (e.g., `"80137-6828"`) into an integer for insertion into the staging table. 813 | 814 | **Example:** 815 | - **Problem Record:** A record with a `principalzipcode` value like `"80137-6828"` resulted in the following error during insertion: 816 | 817 | TrinoUserError: TrinoUserError(type=USER_ERROR, name=INVALID_CAST_ARGUMENT, message="Cannot cast '80137-6828' to INT", query_id=...) 818 | 819 | **Resolution:** 820 | - **Data Transformation Enhancement:** 821 | I updated the `transform_entity_data_to_records` function to validate the zip code using a regular expression. Only zip codes matching exactly five digits (`\d{5}`) are accepted: 822 | - Valid zip codes are converted to an integer. 823 | - Invalid zip codes are set to Python’s `None`. 824 | 825 | - **SQL Query Adjustment:** 826 | In the `insert_into_staging_table` function, I modified the query construction to check for `None`. If a zip code is `None`, it is replaced with the SQL literal `NULL` (without quotes) in the final query. This ensures that only a valid integer or a proper `NULL` value is inserted into the `principalzipcode` column. 827 | 828 | **Outcome:** 829 | By separating data validation from SQL query construction, this solution prevents casting errors and maintains data integrity. Only correctly formatted 5-digit zip codes (or `NULL` for invalid ones) are passed to the database, resulting in a robust and error-free ETL process. 830 | 831 |
832 | 833 | ## Next Steps and Closing Thoughts 834 | 835 | ### Next Steps 836 | 837 | Here is a running list of things that I would continue to refine, improve or new features I would introduce: 838 | - Address level tier assignment that are specific to the lat and long of business location 839 | - Understanding residential vs commercial businesses and how to delineate on security subsidy 840 | - Write automation to handle the duplicates and misfits table instead of it being manually remedied 841 | - Build Machine Learning in to the population projections data that uses forecasted population growth to predict crime trends in respective counties 842 | 843 | ### Closing Thoughts 844 | 845 | In closing, this capstone project was easily one of the hardest projects that I have ever done by myself end to end. I pushed myself and accomplished more than I ever thought I would and had so much fun in the process. The most exciting thing about it is that while I only focused on 16 KPIs and use cases that I made up, these datasets could serve hundreds of other exciting and interesting use cases. I spent countless hours over 8 weeks finding the right datasets, refining the data into the KPIs and Use Cases you just reviewed and building the experience that allows business owners to search for their qualifying subsidy tiers. I am so proud of myself and what I was able to accomplish from free data extracts. It was really cool for me to focus on data that is in my own back yard here in Colorado. I was able to accomplish things I never thought I would ever do and I cannot wait to see what is next for my career after this. 846 | --------------------------------------------------------------------------------