├── .github ├── CODEOWNERS └── pull_request_template.md ├── .gitignore ├── README.md ├── analyses ├── .gitkeep ├── driver_bets.sql ├── lap_time_analysis_python_worksheet.py └── pit_stops_analysis_python_worksheet.py ├── dbt_project.yml ├── macros ├── .gitkeep ├── convert_laptime.sql ├── macros.yml └── test_all_values_gte_zero.sql ├── models ├── column_definitions.md ├── core │ ├── core.yml │ ├── core_table_definitions.md │ ├── dim_circuits.sql │ ├── dim_constructors.sql │ ├── dim_drivers.sql │ ├── dim_races.sql │ ├── dim_seasons.sql │ ├── dim_status.sql │ ├── fct_constructor_results.sql │ ├── fct_constructor_standings.sql │ ├── fct_driver_standings.sql │ ├── fct_lap_times.sql │ ├── fct_pit_stops.sql │ ├── fct_qualifying.sql │ ├── fct_results.sql │ └── fct_sprint_results.sql ├── marts │ ├── aggregates │ │ ├── agg_fastest_pit_stops_by_constructor.py │ │ ├── agg_lap_times_moving_avg.py │ │ └── aggregates.yml │ ├── mrt.md │ ├── mrt.yml │ ├── mrt_lap_times_years.sql │ ├── mrt_pit_stops.sql │ ├── mrt_results.sql │ └── mrt_results_circuits.sql ├── ml │ ├── prep_encoding_splitting │ │ ├── covariate_encoding.py │ │ ├── encoding_mapping.py │ │ ├── hold_out_dataset_for_prediction.py │ │ ├── ml_data_prep.py │ │ └── training_testing_dataset.py │ └── training_and_prediction │ │ ├── apply_prediction_to_position.py │ │ └── train_model_to_predict_position.py ├── overview.md └── staging │ └── formula1 │ ├── f1_sources.yml │ ├── stage_model_descriptions.md │ ├── staging.yml │ ├── stg_circuits.sql │ ├── stg_constructor_results.sql │ ├── stg_constructor_standings.sql │ ├── stg_constructors.sql │ ├── stg_driver_standings.sql │ ├── stg_drivers.sql │ ├── stg_lap_times.sql │ ├── stg_pit_stops.sql │ ├── stg_qualifying.sql │ ├── stg_races.sql │ ├── stg_results.sql │ ├── stg_seasons.sql │ ├── stg_sprint_results.sql │ └── stg_status.sql ├── packages.yml ├── seeds └── .gitkeep ├── setup ├── existing_account_snowflake_grants.sql ├── pc_snowflake_grants.sql └── setup_script_s3_to_snowflake.sql ├── snapshots └── .gitkeep └── tests ├── .gitkeep ├── lap_times_moving_avg_assert_positive_or_null.sql └── macro_pit_stops_mean_is_positive.sql /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | # Owning team for this repo 2 | * @dbt-labs/solutions-architects 3 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | 7 | 8 | ## Description & motivation 9 | 13 | 14 | ## To-do before merge 15 | 22 | 23 | ## Screenshots: 24 | 28 | 29 | ## Validation of models: 30 | 35 | 36 | ## Changes to existing models: 37 | 42 | 43 | ## Checklist: 44 | 50 | - [ ] My pull request represents one logical piece of work. 51 | - [ ] My commits are related to the pull request and look clean. 52 | - [ ] My SQL follows the [dbt Labs style guide](https://github.com/dbt-labs/corp/blob/master/dbt_style_guide.md). 53 | - [ ] I have materialized my models appropriately. 54 | - [ ] I have added appropriate tests and documentation to any new models. 55 | - [ ] I have updated the README file. 56 | {%- if project.warehouse == 'redshift' %} 57 | - [ ] I have added sort and dist keys to models materialized as tables. 58 | - [ ] I have validated the SQL in any late-binding views. 59 | {% endif %} -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | target/ 3 | dbt_packages/ 4 | logs/ 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## About the project 3 | This repository is the code for supporting both a [dbt guide](https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark) and [Snowflake quickstart](https://quickstarts.snowflake.com/guide/leverage_dbt_cloud_to_generate_ml_ready_pipelines_using_snowpark_python/index.html?index=..%2F..index#0). 4 | This repository showcases how to combine both SQL and python together in the same data pipeline to run both analytics and machine learning models on dbt Cloud. We are able to blend these together seamlessly using Snowpark for python on Snowflake. 5 | The project utilizes open source Formula 1 data. This repository, python-snowpark-formula1, is code complete and contains all models end to end. 6 | 7 | 8 | This repo create insights for: 9 | - Finding the lap time average and rolling average through the years 10 | - Predicting the position of each driver based on a decade of data 11 | 12 | First we fetch the Formula 1 data, hosted on a dbt Labs public S3 bucket. The code for this is under the `setup` folder in the file `setup_script_s3_to_snowflake.sql`. 13 | We will create a Snowflake Stage for our CSV files then use Snowflake's COPY INTO function to copy the data in from our CSV files into tables. The Formula 1 is available on Kaggle. The data is originally compiled from the Ergast Developer API. 14 | We will not be building the full pipeline as part of this workshop. Instead we will leverage an exisitng repo, fork it, and focus on our machine learning pipeline. 15 | 16 | ## Step by step quickstart 17 | You can follow along how to build out some of the pipeline's code by following the [snowflake quickstart Leverage dbt Cloud to Generate ML ready pipelines using snowpark python](https://quickstarts.snowflake.com/guide/leverage_dbt_cloud_to_generate_ml_ready_pipelines_using_snowpark_python/index.html?index=..%2F..index#0). 18 | The quickstart is abridged for time, meaning not every file in this repo is built from scratch. The quickstart utilizes a separate forked repository that has models (.sql and .py files) dropped from this repository (python-snowpark-formula1). 19 | 20 | ## Getting started 21 | You do not need to create any local setup to replicate and run this project! 22 | 23 | All you need to get started are: 24 | 1. A Snowflake account with ACCOUNTADMIN access (you can spin up a trial account) 25 | 2. A GitHub Account 26 | 27 | A dbt cloud account will be setup through a process called Partner Connect. This creates a dbt environment to login to, in browswer, that interpolates dbt code! 28 | 29 | If you'd like to run this project locally, no problem! That setup won't be covered here and you should be familiar with using dbt core and Snowflake notebook connections locally. 30 | 31 | ## What you can learn by exploring this repo: 32 | - How to use dbt with Snowflake to build scalable transformations using SQL and Python 33 | - How to use dbt SQL to prepare your data from sources to encoding 34 | - How to train a model in dbt python and use it for future prediction 35 | 36 | ## Prerequisites 37 | To get the most out this repo it's best to have: 38 | - Basic to intermediate SQL and python. 39 | - Basic understanding of dbt fundamentals. We recommend the [dbt Fundamentals course](https://courses.getdbt.com/courses/fundamentals) if you're interested. 40 | - High level understanding of machine learning processes (encoding, training, testing) 41 | - Simple ML algorithms — we will use logistic regression to keep the focus on the workflow, not algorithms! 42 | 43 | **Bonus:** if you have completed the dbt workshop, [Accelerating Data Teams with Snowflake and dbt Cloud Hands On Lab](https://quickstarts.snowflake.com/guide/accelerating_data_teams_with_snowflake_and_dbt_cloud_hands_on_lab/index.html?index=..%2F..index#0), to have hands on keyboard practice with concepts like the source, ref, tests, and docs in dbt. 44 | By having completed that workshop, you will gain the most of this dbt python + snowpark workshop. 45 | 46 | ## Roadmap and future improvements 47 | We will be updating this project from time to time for code cleanup, automation, and developing best practices. 48 | So far the list of future improvements is as follows: 49 | [ ] label encorder clean up for numeric variables
50 | [ ] ohe for the categorical variables
51 | [ ] multi-class accuracy
52 | [ ] saving accurary to a separate file to monitor model degradation/decay
53 | [ ] trying out https://github.com/omnata-labs/dbt-ml-preprocessing for some of preprocessing? 54 | 55 | ## Contributing 56 | Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated!! 57 | Really, any changes you'd like to see made to this project come to life is awesome! 58 | 59 | If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again! 60 | 61 | To make a contribution follow these steps: 62 | 1. Fork the Project 63 | 2. Create your Feature Branch (git checkout -b feature/AmazingFeature) 64 | 3. Commit your Changes (git commit -m 'Add some AmazingFeature') 65 | 4. Push to the Branch (git push origin feature/AmazingFeature) 66 | 5. Open a Pull Request 67 | 68 | ## Contact 69 | Hope Watson — [LinkedIn](https://www.linkedin.com/in/hopewatson/) 70 | 71 | ## Acknowledgments 72 | 73 | I want to give a massive thanks to [Eda Johnson](https://www.linkedin.com/in/eda-johnson-saa-csa-pmp-0a2783/) at Snowflake. She was immensely helpful and in the python models you can see where her proof of concept served as the workflow we iterated on in this project. 74 | 75 | Thanks to dbt Labs and my former manager and colleague, [Amy Chen](https://www.linkedin.com/in/yuanamychen/), for giving me the time to continue building skills and becoming technically more competent each day. 76 | -------------------------------------------------------------------------------- /analyses/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbt-labs/python-snowpark-formula1/b6caebae7914a119c66ac16a4430f5e10faf41a8/analyses/.gitkeep -------------------------------------------------------------------------------- /analyses/driver_bets.sql: -------------------------------------------------------------------------------- 1 | with predictions as ( 2 | select * from {{ ref('apply_prediction_to_position') }} 3 | ), 4 | 5 | encoding_mapping as ( 6 | select * from {{ ref('encoding_mapping') }} 7 | ) 8 | 9 | select 10 | original_driver_value, 11 | AVG(position_predicted) as avg_position_predicted, 12 | encoded_driver_value, 13 | predictions.driver 14 | 15 | from predictions 16 | join encoding_mapping on predictions.driver=encoding_mapping.encoded_driver_value 17 | group by 1,3,4 18 | order by avg_position_predicted 19 | -------------------------------------------------------------------------------- /analyses/lap_time_analysis_python_worksheet.py: -------------------------------------------------------------------------------- 1 | # The Snowpark package is required for Python Worksheets. 2 | # You can add more packages by selecting them using the Packages control and then importing them. 3 | 4 | import snowflake.snowpark as snowpark 5 | import pandas as pd 6 | 7 | def main(session: snowpark.Session): 8 | # Your code goes here, inside the "main" handler. 9 | tableName = 'MRT_LAP_TIMES_YEARS' 10 | dataframe = session.table(tableName) 11 | lap_times = dataframe.to_pandas() 12 | 13 | # print table 14 | print(lap_times) 15 | 16 | # describe the data 17 | lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000 18 | lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame() 19 | lap_time_trends.reset_index(inplace=True) 20 | lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean() 21 | lap_time_trends.columns = lap_time_trends.columns.str.upper() 22 | 23 | final_df = session.create_dataframe(lap_time_trends) 24 | # Return value will appear in the Results tab. 25 | return final_df -------------------------------------------------------------------------------- /analyses/pit_stops_analysis_python_worksheet.py: -------------------------------------------------------------------------------- 1 | # The Snowpark package is required for Python Worksheets. 2 | # You can add more packages by selecting them using the Packages control and then importing them. 3 | 4 | import snowflake.snowpark as snowpark 5 | from snowflake.snowpark.functions import col 6 | import pandas as pd 7 | import numpy as np 8 | 9 | def main(session: snowpark.Session): 10 | # Your code goes here, inside the "main" handler. 11 | tableName = 'PIT_STOPS_JOINED' 12 | dataframe = session.table(tableName) 13 | pit_stops_joined = dataframe.to_pandas() 14 | 15 | # print table 16 | print(pit_stops_joined) 17 | 18 | # provide year so we do not hardcode dates 19 | year=2021 20 | 21 | # describe the data 22 | pit_stops_joined["PIT_STOP_SECONDS"] = pit_stops_joined["PIT_STOP_MILLISECONDS"]/1000 23 | fastest_pit_stops = pit_stops_joined[(pit_stops_joined["RACE_YEAR"]==year)].groupby(by="CONSTRUCTOR_NAME")["PIT_STOP_SECONDS"].describe().sort_values(by='mean') 24 | fastest_pit_stops.reset_index(inplace=True) 25 | fastest_pit_stops.columns = fastest_pit_stops.columns.str.upper() 26 | 27 | final_df = session.create_dataframe(fastest_pit_stops) 28 | # Return value will appear in the Results tab. 29 | return final_df -------------------------------------------------------------------------------- /dbt_project.yml: -------------------------------------------------------------------------------- 1 | name: 'snowflake_python_workshop' 2 | version: '1.5.0' 3 | require-dbt-version: '>=1.3.0' 4 | config-version: 2 5 | 6 | # This setting configures which "profile" dbt uses for this project. 7 | profile: 'default' 8 | 9 | # These configurations specify where dbt should look for different types of files. 10 | # The `source-paths` config, for example, states that models in this project can be 11 | # found in the "models/" directory. You probably won't need to change these! 12 | model-paths: ["models"] 13 | analysis-paths: ["analyses"] 14 | test-paths: ["tests"] 15 | seed-paths: ["seeds"] 16 | macro-paths: ["macros"] 17 | snapshot-paths: ["snapshots"] 18 | 19 | target-path: "target" # directory which will store compiled SQL files 20 | clean-targets: # directories to be removed by `dbt clean` 21 | - "target" 22 | - "dbt_packages" 23 | 24 | models: 25 | snowflake_python_workshop: 26 | staging: 27 | +docs: 28 | node_color: "CadetBlue" 29 | 30 | marts: 31 | +materialized: table 32 | aggregates: 33 | +docs: 34 | node_color: "Maroon" 35 | +tags: "bi" 36 | 37 | core: 38 | +materialized: table 39 | +docs: 40 | node_color: "#800080" 41 | 42 | ml: 43 | +materialized: table 44 | prep_encoding_splitting: 45 | +docs: 46 | node_color: "Indigo" 47 | training_and_prediction: 48 | +docs: 49 | node_color: "Black" -------------------------------------------------------------------------------- /macros/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbt-labs/python-snowpark-formula1/b6caebae7914a119c66ac16a4430f5e10faf41a8/macros/.gitkeep -------------------------------------------------------------------------------- /macros/convert_laptime.sql: -------------------------------------------------------------------------------- 1 | {% macro convert_laptime(column_name) %} 2 | TIME_FROM_PARTS( 3 | 00, -- hours (always 0) 4 | IFF(CONTAINS({{ column_name }}, ':'), 5 | LPAD(SPLIT_PART({{ column_name }}, ':', 1), 2, '0'), 6 | 00), -- minutes 7 | IFF(CONTAINS({{ column_name }}, ':'), 8 | SPLIT_PART(SPLIT_PART({{ column_name }}, ':', 2), '.', 1), 9 | SPLIT_PART({{ column_name }}, '.', 1)), -- seconds 10 | IFF(CONTAINS({{ column_name }}, ':'), 11 | RPAD(SPLIT_PART(SPLIT_PART({{ column_name }}, ':', 2), '.', 2), 9, '0'), 12 | RPAD(SPLIT_PART({{ column_name }}, '.', 2), 9, '0')) -- nanoseconds 13 | ) 14 | {% endmacro %} -------------------------------------------------------------------------------- /macros/macros.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | macros: 4 | - name: convert_laptime 5 | description: "This macro converts formatted time string into actual time variable type." -------------------------------------------------------------------------------- /macros/test_all_values_gte_zero.sql: -------------------------------------------------------------------------------- 1 | {% macro test_all_values_gte_zero(table, column) %} 2 | 3 | select * from {{ ref(table) }} where {{ column }} < 0 4 | 5 | {% endmacro %} -------------------------------------------------------------------------------- /models/column_definitions.md: -------------------------------------------------------------------------------- 1 | {% docs circuit_id_def %} 2 | 3 | This column is the unique identifier for the circuit. 4 | 5 | {% enddocs %} 6 | 7 | {% docs circuit_ref_def %} 8 | 9 | This column contains the circuit reference code. 10 | 11 | {% enddocs %} 12 | 13 | {% docs circuit_name_def %} 14 | 15 | This column contains the name of the circuit. 16 | 17 | {% enddocs %} 18 | 19 | {% docs circuit_location_def %} 20 | 21 | This column contains the location of the circuit. 22 | 23 | {% enddocs %} 24 | 25 | {% docs circuit_country_def %} 26 | 27 | This column contains the country where the circuit is located. 28 | 29 | {% enddocs %} 30 | 31 | {% docs circuit_lat_def %} 32 | 33 | This column contains the latitude of the circuit. 34 | 35 | {% enddocs %} 36 | 37 | {% docs circuit_lng_def %} 38 | 39 | This column contains the longitude of the circuit. 40 | 41 | {% enddocs %} 42 | 43 | {% docs circuit_alt_def %} 44 | 45 | This column contains the altitude of the circuit. 46 | 47 | {% enddocs %} 48 | 49 | {% docs circuit_url_def %} 50 | 51 | This column contains the URL to the circuit's Wikipedia page. 52 | 53 | {% enddocs %} 54 | 55 | {% docs constructor_results_id_def %} 56 | 57 | This column is the unique identifier for each constructor result. 58 | 59 | {% enddocs %} 60 | 61 | {% docs race_id_def %} 62 | 63 | This column is the unique identifier for each race. 64 | 65 | {% enddocs %} 66 | 67 | {% docs constructor_points_def %} 68 | 69 | This column contains the points scored by the constructor in the race. 70 | 71 | {% enddocs %} 72 | 73 | {% docs constructor_status_def %} 74 | 75 | This column contains the constructor's status (e.g., "+1 Lap", "Finished", etc.). 76 | 77 | {% enddocs %} 78 | 79 | {% docs constructor_ref_def %} 80 | 81 | This column contains the constructor reference code. 82 | 83 | {% enddocs %} 84 | 85 | {% docs constructor_name_def %} 86 | 87 | This column contains the name of the constructor. 88 | 89 | {% enddocs %} 90 | 91 | {% docs constructor_url_def %} 92 | 93 | This column contains the URL to the constructor's Wikipedia page. 94 | 95 | {% enddocs %} 96 | 97 | {% docs driver_standings_id_def %} 98 | 99 | This column is the unique identifier for each driver standing. 100 | 101 | {% enddocs %} 102 | 103 | {% docs driver_points_def %} 104 | 105 | This column contains the total points scored by the driver in the season. 106 | 107 | {% enddocs %} 108 | 109 | {% docs driver_position_def %} 110 | 111 | This column contains the driver's position in the standings. 112 | 113 | {% enddocs %} 114 | 115 | {% docs position_text_def %} 116 | 117 | This column contains the text representation of the driver's position in the standings. 118 | 119 | {% enddocs %} 120 | 121 | {% docs driver_wins_def %} 122 | 123 | This column contains the number of wins by the driver in the season. 124 | 125 | {% enddocs %} 126 | 127 | {% docs driver_ref_def %} 128 | 129 | This column contains the driver reference code. 130 | 131 | {% enddocs %} 132 | 133 | {% docs driver_number_def %} 134 | 135 | This column contains the driver's racing number. 136 | 137 | {% enddocs %} 138 | 139 | {% docs drivercode_def %} 140 | 141 | This column contains the driver's three-letter code. 142 | 143 | {% enddocs %} 144 | 145 | {% docs driverforename_def %} 146 | 147 | This column contains the driver's first name. 148 | 149 | {% enddocs %} 150 | 151 | {% docs driver_surname_def %} 152 | 153 | This column contains the driver's last name. 154 | 155 | {% enddocs %} 156 | 157 | {% docs driver_dob_def %} 158 | 159 | This column contains the driver's date of birth. 160 | 161 | {% enddocs %} 162 | 163 | {% docs driver_nationality_def %} 164 | 165 | This column contains the driver's nationality. 166 | 167 | {% enddocs %} 168 | 169 | {% docs driver_url_def %} 170 | 171 | This column contains the URL to the driver's Wikipedia page. 172 | 173 | {% enddocs %} 174 | 175 | {% docs year_def %} 176 | The year in which the event (e.g., race, season) took place. 177 | {% enddocs %} 178 | 179 | {% docs round_def %} 180 | The round number of the race within the season. 181 | {% enddocs %} 182 | 183 | {% docs race_name_def %} 184 | The official name of the event (e.g., race, circuit). 185 | {% enddocs %} 186 | 187 | {% docs date_def %} 188 | The date of the event (e.g., race, season start). 189 | {% enddocs %} 190 | 191 | {% docs start_time_def %} 192 | The starting time of the race or the total race time for the driver. 193 | {% enddocs %} 194 | 195 | {% docs url_def %} 196 | The URL of the event's (e.g., race, season, driver) Wikipedia page. 197 | {% enddocs %} 198 | 199 | {% docs number_def %} 200 | The car number of the driver. 201 | {% enddocs %} 202 | 203 | {% docs grid_def %} 204 | The starting grid position of the driver. 205 | {% enddocs %} 206 | 207 | {% docs position_def %} 208 | The finishing position of the driver. 209 | {% enddocs %} 210 | 211 | {% docs position_order_def %} 212 | The finishing position order of the driver. 213 | {% enddocs %} 214 | 215 | {% docs points_def %} 216 | The points earned by the driver or constructor in the race or season. 217 | {% enddocs %} 218 | 219 | {% docs laps_def %} 220 | The number of laps completed by the driver. 221 | {% enddocs %} 222 | 223 | {% docs race_milliseconds_def %} 224 | The total race time for the driver in milliseconds. 225 | {% enddocs %} 226 | 227 | {% docs results_milliseconds_def %} 228 | The total race time for the driver in milliseconds. 229 | {% enddocs %} 230 | 231 | {% docs fastest_lap_def %} 232 | The lap number of the driver's fastest lap. 233 | {% enddocs %} 234 | 235 | {% docs rank_def %} 236 | The rank of the driver's fastest lap time among all drivers. 237 | {% enddocs %} 238 | 239 | {% docs fastest_lap_time_def %} 240 | The driver's fastest lap time. 241 | {% enddocs %} 242 | 243 | {% docs fastest_lap_speed_def %} 244 | The driver's average speed during their fastest lap. 245 | {% enddocs %} 246 | 247 | {% docs result_id_def %} 248 | This column is the unique identifier/primary key for the result table. 249 | {% enddocs %} 250 | 251 | {% docs driver_id_def %} 252 | The unique identifier of the driver. 253 | {% enddocs %} 254 | 255 | {% docs constructor_id_def %} 256 | The unique identifier of the constructor/team. 257 | {% enddocs %} 258 | 259 | {% docs status_id_def %} 260 | The unique identifier of the driver's finishing status (e.g., finished, retired). 261 | {% enddocs %} 262 | 263 | {% docs lap_def %} 264 | The lap number within the race. 265 | {% enddocs %} 266 | 267 | {% docs lap_time_def %} 268 | The lap time for the driver in the format 'MM:SS.mmm'. 269 | {% enddocs %} 270 | 271 | {% docs lap_time_milliseconds_def %} 272 | The lap time for the driver in milliseconds. 273 | {% enddocs %} 274 | 275 | {% docs contructor_name_def %} 276 | The name of the constructor/team. 277 | {% enddocs %} 278 | 279 | {% docs constructor_nationality_def %} 280 | The nationality of the constructor/team. 281 | {% enddocs %} 282 | 283 | {% docs driver_ref_num_def %} 284 | The driver's permanent reference number. 285 | {% enddocs %} 286 | 287 | {% docs driver_code_def %} 288 | The driver's three-letter abbreviation code. 289 | {% enddocs %} 290 | 291 | {% docs driver_forename_def %} 292 | The driver's first name. 293 | {% enddocs %} 294 | 295 | {% docs constructor_wins_def %} 296 | 297 | This column contains the number of wins by the constructor in the season. 298 | 299 | {% enddocs %} 300 | 301 | {% docs official_lap_time_def %} 302 | The lap time for the driver in the time format. 303 | {% enddocs %} 304 | 305 | {% docs stop_number_def %} 306 | The count of pit stops during the race. 307 | {% enddocs %} 308 | 309 | {% docs pit_stop_time_def %} 310 | Formatted pit stop time (of day) 311 | {% enddocs %} 312 | 313 | {% docs pit_stop_duration_seconds_def %} 314 | Formatted time (duration) of pitstop. 315 | {% enddocs %} 316 | 317 | {% docs pit_stop_duration_def %} 318 | Pit stop time in TIME variable. 319 | {% enddocs %} 320 | 321 | {% docs pit_stop_milliseconds_def %} 322 | NUMBER for the milliseconds spent during pit stop. 323 | {% enddocs %} 324 | 325 | {% docs qualifying_position_def %} 326 | Final qualifying position. 327 | {% enddocs %} 328 | 329 | {% docs q1_time_def %} 330 | Qualifying 1 time (MM:SS.mmm format). 331 | {% enddocs %} 332 | 333 | {% docs q2_time_def %} 334 | Qualifying 2 time (MM:SS.mmm format). 335 | {% enddocs %} 336 | 337 | {% docs q3_time_def %} 338 | Qualifying 3 time (MM:SS.mmm format). 339 | {% enddocs %} 340 | 341 | {% docs race_year_def %} 342 | The year in which the race took place. 343 | {% enddocs %} 344 | 345 | {% docs race_round_def %} 346 | The round number of the race within the championship. 347 | {% enddocs %} 348 | 349 | {% docs race_date_def %} 350 | Date of the race event (YYYY-MM-DD format). 351 | {% enddocs %} 352 | 353 | {% docs race_time_def %} 354 | Scheduled time of the race event (HH:MM:SS format). 355 | {% enddocs %} 356 | 357 | {% docs race_url_def %} 358 | Wikipedia URL for the specific race event. 359 | {% enddocs %} 360 | 361 | {% docs fp1_date_def %} 362 | Date of the first free practice session (YYYY-MM-DD format). 363 | {% enddocs %} 364 | 365 | {% docs fp1_time_def %} 366 | Scheduled time of the first free practice session (HH:MM:SS format). 367 | {% enddocs %} 368 | 369 | {% docs fp2_date_def %} 370 | Date of the second free practice session (YYYY-MM-DD format). 371 | {% enddocs %} 372 | 373 | {% docs fp2_time_def %} 374 | Scheduled time of the second free practice session (HH:MM:SS format). 375 | {% enddocs %} 376 | 377 | {% docs fp3_date_def %} 378 | Date of the third free practice session (YYYY-MM-DD format). 379 | {% enddocs %} 380 | 381 | {% docs fp3_time_def %} 382 | Scheduled time of the third free practice session (HH:MM:SS format). 383 | {% enddocs %} 384 | 385 | {% docs quali_date_def %} 386 | Date of the qualifying session (YYYY-MM-DD format). 387 | {% enddocs %} 388 | 389 | {% docs quali_time_def %} 390 | Scheduled time of the qualifying session (HH:MM:SS format). 391 | {% enddocs %} 392 | 393 | {% docs sprint_date_def %} 394 | Date of the sprint race (YYYY-MM-DD format), if applicable. 395 | {% enddocs %} 396 | 397 | {% docs sprint_time_def %} 398 | Scheduled time of the sprint race (HH:MM:SS format), if applicable. 399 | {% enddocs %} 400 | 401 | {% docs driver_rank_def %} 402 | The rank of the driver for the event. 403 | {% enddocs %} 404 | 405 | {% docs status_def %} 406 | The status during a race such as how many laps down, collision, mechanical issue, finish, etc. 407 | {% enddocs %} 408 | 409 | -------------------------------------------------------------------------------- /models/core/core.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: dim_circuits 5 | description: '{{ doc("dim_circuits") }}' 6 | columns: 7 | - name: dim_circuit_id 8 | description: "The primary key for this table, each circuit is distinct" 9 | tests: 10 | - unique 11 | - not_null 12 | - name: circuit_id 13 | description: '{{ doc("circuit_id_def") }}' 14 | - name: circuit_ref 15 | description: '{{ doc("circuit_ref_def") }}' 16 | - name: circuit_name 17 | description: '{{ doc("circuit_name_def") }}' 18 | - name: circuit_location 19 | description: '{{ doc("circuit_location_def") }}' 20 | - name: circuit_country 21 | description: '{{ doc("circuit_country_def") }}' 22 | - name: latitude 23 | description: '{{ doc("circuit_lat_def") }}' 24 | - name: longitude 25 | description: '{{ doc("circuit_lng_def") }}' 26 | - name: altitude 27 | description: '{{ doc("circuit_alt_def") }}' 28 | - name: circuit_url 29 | description: '{{ doc("circuit_url_def") }}' 30 | 31 | - name: fct_constructor_results 32 | description: "Fact data related to constructor (ie brand) results. This is at a race level but race info isn't in this model, just the race_id." 33 | columns: 34 | - name: constructor_results_id 35 | description: "The primary key for this table, result has its own id." 36 | tests: 37 | - unique 38 | - not_null 39 | - name: race_id 40 | description: '{{ doc("race_id_def") }}' 41 | - name: constructor_id 42 | description: '{{ doc("constructor_id_def") }}' 43 | - name: constructor_points 44 | description: '{{ doc("constructor_points_def") }}' 45 | - name: status 46 | description: '{{ doc("constructor_status_def") }}' 47 | 48 | - name: fct_constructor_standings 49 | description: "Fact data related to constructor (ie brand) results. This is at a race level but race info isn't in this model, just the race_id." 50 | columns: 51 | - name: constructor_standings_id 52 | description: "The primary key for this table, result has its own id." 53 | tests: 54 | - unique 55 | - not_null 56 | - name: race_id 57 | description: '{{ doc("race_id_def") }}' 58 | - name: constructor_id 59 | description: '{{ doc("constructor_id_def") }}' 60 | - name: points 61 | description: '{{ doc("constructor_points_def") }}' 62 | - name: constructor_position 63 | description: '{{ doc("constructor_status_def") }}' 64 | - name: position_text 65 | description: '{{ doc("constructor_points_def") }}' 66 | - name: wins 67 | description: '{{ doc("constructor_wins_def") }}' 68 | 69 | - name: dim_constructors 70 | description: "Dimensional data for the constructors (teams)." 71 | columns: 72 | - name: dim_constructor_id 73 | description: "The primary key for this table, each team has its own id." 74 | tests: 75 | - unique 76 | - not_null 77 | - name: constructor_id 78 | description: '{{ doc("constructor_id_def") }}' 79 | - name: constructor_ref 80 | description: '{{ doc("constructor_ref_def") }}' 81 | - name: constructor_name 82 | description: '{{ doc("constructor_name_def") }}' 83 | - name: constructor_nationality 84 | description: '{{ doc("constructor_nationality_def") }}' 85 | 86 | - name: fct_driver_standings 87 | description: "Fact data for driver standings by race." 88 | columns: 89 | - name: driver_standings_id 90 | description: "The primary key for this table." 91 | tests: 92 | - unique 93 | - not_null 94 | - name: race_id 95 | description: '{{ doc("race_id_def") }}' 96 | - name: driver_id 97 | description: '{{ doc("driver_id_def") }}' 98 | - name: driver_points 99 | description: '{{ doc("driver_points_def") }}' 100 | - name: driver_position 101 | description: '{{ doc("driver_position_def") }}' 102 | - name: position_text 103 | description: '{{ doc("position_text_def") }}' 104 | - name: driver_wins 105 | description: '{{ doc("driver_wins_def") }}' 106 | 107 | - name: dim_drivers 108 | description: "Dimensional model stage for driver attributes." 109 | columns: 110 | - name: driver_id 111 | description: "The primary key for this table." 112 | tests: 113 | - unique 114 | - not_null 115 | - name: driver_code 116 | description: '{{ doc("driver_code_def") }}' 117 | - name: driver_ref 118 | description: '{{ doc("driver_ref_def") }}' 119 | - name: driver_number 120 | description: '{{ doc("driver_number_def") }}' 121 | - name: forename 122 | description: '{{ doc("driver_forename_def") }}' 123 | - name: surname 124 | description: '{{ doc("driver_surname_def") }}' 125 | - name: date_of_birth 126 | description: '{{ doc("driver_dob_def") }}' 127 | - name: driver_nationality 128 | description: '{{ doc("driver_nationality_def") }}' 129 | 130 | - name: fct_lap_times 131 | description: "Fact data around every lap." 132 | columns: 133 | - name: lap_times_id 134 | description: "The primary key for this table." 135 | tests: 136 | - unique 137 | - not_null 138 | - name: race_id 139 | description: '{{ doc("race_id_def") }}' 140 | - name: driver_id 141 | description: '{{ doc("driver_id_def") }}' 142 | - name: lap 143 | description: '{{ doc("laps_def") }}' 144 | - name: driver_position 145 | description: '{{ doc("driver_position_def") }}' 146 | - name: lap_time_formatted 147 | description: '{{ doc("lap_time_def") }}' 148 | - name: official_laptime 149 | description: '{{ doc("official_lap_time_def") }}' 150 | - name: lap_time_milliseconds 151 | description: '{{ doc("lap_time_milliseconds_def") }}' 152 | 153 | - name: fct_pit_stops 154 | description: "Fact data around pitstop times." 155 | columns: 156 | - name: pitstop_id 157 | description: "The primary key for this table." 158 | tests: 159 | - unique 160 | - not_null 161 | - name: race_id 162 | description: '{{ doc("race_id_def") }}' 163 | - name: driver_id 164 | description: '{{ doc("driver_id_def") }}' 165 | - name: stop_number 166 | description: '{{ doc("laps_def") }}' 167 | - name: lap 168 | description: '{{ doc("stop_number_def") }}' 169 | - name: pit_stop_time 170 | description: '{{ doc("pit_stop_time_def") }}' 171 | - name: pit_stop_duration_seconds 172 | description: '{{ doc("pit_stop_duration_seconds_def") }}' 173 | - name: pit_stop_duration 174 | description: '{{ doc("pit_stop_duration_def") }}' 175 | - name: pit_stop_milliseconds 176 | description: '{{ doc("pit_stop_milliseconds_def") }}' 177 | 178 | - name: fct_qualifying 179 | description: "Fact data around qualifying times." 180 | columns: 181 | - name: qualifying_id 182 | description: "The primary key for this table." 183 | tests: 184 | - unique 185 | - not_null 186 | - name: race_id 187 | description: '{{ doc("race_id_def") }}' 188 | - name: driver_id 189 | description: '{{ doc("driver_id_def") }}' 190 | - name: constructor_id 191 | description: '{{ doc("constructor_id_def") }}' 192 | - name: driver_number 193 | description: '{{ doc("driver_number_def") }}' 194 | - name: qualifying_position 195 | description: '{{ doc("qualifying_position_def") }}' 196 | - name: q1_time 197 | description: '{{ doc("q1_time_def") }}' 198 | - name: q2_time 199 | description: '{{ doc("q2_time_def") }}' 200 | - name: q3_time 201 | description: '{{ doc("q3_time_def") }}' 202 | 203 | - name: fct_qualifying_pivoted 204 | description: "Fact data around qualifying times." 205 | columns: 206 | - name: qualifying_pivoted_id 207 | description: "The primary key for this table." 208 | tests: 209 | - unique 210 | - not_null 211 | - name: qualifying_round 212 | description: "The round of qualifying, must be distinct values 1, 2 or 3." 213 | tests: 214 | - accepted_values: 215 | values: ['1', '2', '3'] 216 | 217 | - name: dim_races 218 | description: "Dimensional data around race events." 219 | columns: 220 | - name: race_id 221 | description: "The primary key for this table." 222 | tests: 223 | - unique 224 | - not_null 225 | - name: race_year 226 | description: '{{ doc("race_year_def") }}' 227 | - name: race_round 228 | description: '{{ doc("race_round_def") }}' 229 | - name: circuit_id 230 | description: '{{ doc("circuit_id_def") }}' 231 | - name: race_name 232 | description: '{{ doc("race_name_def") }}' 233 | - name: race_date 234 | description: '{{ doc("race_date_def") }}' 235 | - name: race_time 236 | description: '{{ doc("race_time_def") }}' 237 | - name: race_url 238 | description: '{{ doc("race_url_def") }}' 239 | - name: fp1_date 240 | description: '{{ doc("fp1_date_def") }}' 241 | - name: fp1_time 242 | description: '{{ doc("fp1_time_def") }}' 243 | - name: fp2_date 244 | description: '{{ doc("fp2_date_def") }}' 245 | - name: fp2_time 246 | description: '{{ doc("fp2_time_def") }}' 247 | - name: fp3_date 248 | description: '{{ doc("fp3_date_def") }}' 249 | - name: fp3_time 250 | description: '{{ doc("fp3_time_def") }}' 251 | - name: quali_date 252 | description: '{{ doc("quali_date_def") }}' 253 | - name: quali_time 254 | description: '{{ doc("quali_time_def") }}' 255 | - name: sprint_date 256 | description: '{{ doc("sprint_date_def") }}' 257 | - name: sprint_time 258 | description: '{{ doc("sprint_time_def") }}' 259 | 260 | - name: fct_results 261 | description: "Fact data around qualifying times." 262 | columns: 263 | - name: fct_result_id 264 | description: "The primary key for this table." 265 | tests: 266 | - unique 267 | - not_null 268 | - name: result_id 269 | description: '{{ doc("result_id_def") }}' 270 | - name: race_id 271 | description: '{{ doc("race_id_def") }}' 272 | - name: driver_id 273 | description: '{{ doc("driver_id_def") }}' 274 | - name: constructor_id 275 | description: '{{ doc("constructor_id_def") }}' 276 | - name: driver_number 277 | description: '{{ doc("driver_number_def") }}' 278 | - name: grid 279 | description: '{{ doc("grid_def") }}' 280 | - name: driver_position 281 | description: '{{ doc("driver_position_def") }}' 282 | - name: position_text 283 | description: '{{ doc("position_text_def") }}' 284 | - name: position_order 285 | description: '{{ doc("position_order_def") }}' 286 | - name: points 287 | description: '{{ doc("points_def") }}' 288 | - name: laps 289 | description: '{{ doc("laps_def") }}' 290 | - name: race_time 291 | description: '{{ doc("race_time_def") }}' 292 | - name: milliseconds 293 | description: '{{ doc("results_milliseconds_def") }}' 294 | - name: fastest_lap 295 | description: '{{ doc("fastest_lap_def") }}' 296 | - name: driver_rank 297 | description: '{{ doc("driver_rank_def") }}' 298 | - name: fastest_lap_time 299 | description: '{{ doc("fastest_lap_time_def") }}' 300 | - name: fastest_lap_speed 301 | description: '{{ doc("fastest_lap_speed_def") }}' 302 | - name: status_id 303 | description: '{{ doc("status_id_def") }}' 304 | 305 | - name: dim_seasons 306 | description: "Season information" 307 | columns: 308 | - name: season_year 309 | description: "The primary key for this table." 310 | tests: 311 | - unique 312 | - not_null 313 | - name: season_url 314 | description: "The Wiki URL for F1 Season" 315 | 316 | - name: fct_sprint_results 317 | description: "Fact data around qualifying times." 318 | columns: 319 | - name: result_id 320 | description: "The primary key for this table." 321 | tests: 322 | - unique 323 | - not_null 324 | - name: race_id 325 | description: '{{ doc("race_id_def") }}' 326 | - name: driver_id 327 | description: '{{ doc("driver_id_def") }}' 328 | - name: constructor_id 329 | description: '{{ doc("constructor_id_def") }}' 330 | - name: driver_number 331 | description: '{{ doc("driver_number_def") }}' 332 | - name: grid 333 | description: '{{ doc("grid_def") }}' 334 | - name: driver_position 335 | description: '{{ doc("driver_position_def") }}' 336 | - name: position_text 337 | description: '{{ doc("position_text_def") }}' 338 | - name: position_order 339 | description: '{{ doc("position_order_def") }}' 340 | - name: points 341 | description: '{{ doc("points_def") }}' 342 | - name: laps 343 | description: '{{ doc("laps_def") }}' 344 | - name: sprint_time 345 | description: '{{ doc("sprint_time_def") }}' 346 | - name: milliseconds 347 | description: '{{ doc("results_milliseconds_def") }}' 348 | - name: fastest_lap 349 | description: '{{ doc("fastest_lap_def") }}' 350 | - name: fastest_lap_time 351 | description: '{{ doc("fastest_lap_time_def") }}' 352 | - name: status_id 353 | description: '{{ doc("status_id_def") }}' 354 | 355 | 356 | - name: dim_status 357 | description: "Status codes mapping." 358 | columns: 359 | - name: status_id 360 | description: "The primary key for this table." 361 | tests: 362 | - unique 363 | - not_null 364 | - name: status 365 | description: '{{ doc("status_def") }}' -------------------------------------------------------------------------------- /models/core/core_table_definitions.md: -------------------------------------------------------------------------------- 1 | {% docs dim_circuits %} 2 | 3 | This table contains dimensional data related to F1 circuits. 4 | 5 | {% enddocs %} -------------------------------------------------------------------------------- /models/core/dim_circuits.sql: -------------------------------------------------------------------------------- 1 | {{ config( 2 | tags=["dimension", "track"] 3 | ) }} 4 | 5 | WITH circuits AS ( 6 | SELECT * FROM {{ ref('stg_circuits') }} 7 | ) 8 | 9 | SELECT 10 | {{ dbt_utils.generate_surrogate_key(['circuit_id']) }} AS dim_circuit_id, 11 | * 12 | FROM circuits -------------------------------------------------------------------------------- /models/core/dim_constructors.sql: -------------------------------------------------------------------------------- 1 | WITH constructors AS ( 2 | SELECT * FROM {{ ref('stg_constructors') }} 3 | ) 4 | 5 | SELECT 6 | {{ dbt_utils.generate_surrogate_key(['constructor_id']) }} AS dim_constructor_id, 7 | constructor_id AS constructor_id, 8 | constructor_ref AS constructor_ref, 9 | constructor_name AS constructor_name, 10 | constructor_nationality AS constructor_nationality 11 | --constructor_url AS constructor_url 12 | FROM constructors -------------------------------------------------------------------------------- /models/core/dim_drivers.sql: -------------------------------------------------------------------------------- 1 | WITH drivers AS ( 2 | SELECT 3 | driver_id AS driver_id, 4 | driver_ref AS driver_ref, 5 | driver_number AS driver_number, 6 | driver_code AS driver_code, 7 | forename AS forename, 8 | surname AS surname, 9 | date_of_birth AS date_of_birth, 10 | driver_nationality AS driver_nationality 11 | --driver_url AS driver_url 12 | FROM {{ ref('stg_drivers') }} 13 | ) 14 | 15 | SELECT * FROM drivers -------------------------------------------------------------------------------- /models/core/dim_races.sql: -------------------------------------------------------------------------------- 1 | WITH races AS ( 2 | SELECT 3 | race_id AS race_id, 4 | race_year AS race_year, 5 | race_round AS race_round, 6 | circuit_id AS circuit_id, 7 | race_name AS race_name, 8 | race_date AS race_date, 9 | race_time AS race_time, 10 | --race_url AS race_url, 11 | fp1_date AS fp1_date, 12 | fp1_time AS fp1_time, 13 | fp2_date AS fp2_date, 14 | fp2_time AS fp2_time, 15 | fp3_date AS fp3_date, 16 | fp3_time AS fp3_time, 17 | quali_date AS quali_date, 18 | quali_time AS quali_time, 19 | sprint_date AS sprint_date, 20 | sprint_time AS sprint_time 21 | FROM {{ ref('stg_races') }} 22 | ) 23 | 24 | SELECT * FROM races -------------------------------------------------------------------------------- /models/core/dim_seasons.sql: -------------------------------------------------------------------------------- 1 | WITH seasons AS ( 2 | SELECT 3 | season_year AS season_year, 4 | season_url AS season_url 5 | FROM {{ ref('stg_seasons') }} 6 | ) 7 | 8 | SELECT * FROM seasons -------------------------------------------------------------------------------- /models/core/dim_status.sql: -------------------------------------------------------------------------------- 1 | WITH status AS ( 2 | SELECT 3 | status_id AS status_id, 4 | status AS status 5 | FROM {{ ref('stg_status') }} 6 | ) 7 | 8 | SELECT * FROM status -------------------------------------------------------------------------------- /models/core/fct_constructor_results.sql: -------------------------------------------------------------------------------- 1 | WITH constructor_results AS ( 2 | SELECT * FROM {{ ref('stg_constructor_results') }} 3 | ) 4 | 5 | SELECT 6 | constructor_results_id, 7 | race_id, 8 | constructor_id, 9 | constructor_points, 10 | status 11 | FROM constructor_results -------------------------------------------------------------------------------- /models/core/fct_constructor_standings.sql: -------------------------------------------------------------------------------- 1 | WITH constructor_standings AS ( 2 | SELECT 3 | constructor_standings_id AS constructor_standings_id, 4 | race_id AS race_id, 5 | constructor_id AS constructor_id, 6 | points AS points, 7 | constructor_position AS constructor_position, 8 | position_text AS position_text, 9 | wins AS wins 10 | FROM {{ ref('stg_constructor_standings') }} 11 | ) 12 | 13 | SELECT * FROM constructor_standings -------------------------------------------------------------------------------- /models/core/fct_driver_standings.sql: -------------------------------------------------------------------------------- 1 | WITH driver_standings AS ( 2 | SELECT 3 | driver_standings_id AS driver_standings_id, 4 | race_id AS race_id, 5 | driver_id AS driver_id, 6 | driver_points AS driver_points, 7 | driver_position AS driver_position, 8 | position_text AS position_text, 9 | driver_wins AS driver_wins 10 | FROM {{ ref('stg_driver_standings') }} 11 | ) 12 | 13 | SELECT * FROM driver_standings -------------------------------------------------------------------------------- /models/core/fct_lap_times.sql: -------------------------------------------------------------------------------- 1 | WITH lap_times AS ( 2 | SELECT 3 | {{ dbt_utils.generate_surrogate_key(['race_id', 'driver_id', 'lap']) }} AS lap_times_id, 4 | race_id AS race_id, 5 | driver_id AS driver_id, 6 | lap AS lap, 7 | driver_position AS driver_position, 8 | lap_time_formatted AS lap_time_formatted, 9 | official_laptime AS official_laptime, 10 | lap_time_milliseconds AS lap_time_milliseconds 11 | FROM {{ ref('stg_lap_times') }} 12 | ) 13 | 14 | SELECT * FROM lap_times -------------------------------------------------------------------------------- /models/core/fct_pit_stops.sql: -------------------------------------------------------------------------------- 1 | WITH pitstops AS ( 2 | SELECT 3 | {{ dbt_utils.generate_surrogate_key(['race_id', 'driver_id', 'stop_number']) }} AS pitstop_id, 4 | race_id AS race_id, 5 | driver_id AS driver_id, 6 | stop_number AS stop_number, 7 | lap AS lap, 8 | pit_stop_time AS pit_stop_time, 9 | pit_stop_duration_seconds AS pit_stop_duration_seconds, 10 | pit_stop_duration AS pit_stop_duration, 11 | pit_stop_milliseconds AS pit_stop_milliseconds, 12 | max(stop_number) over (partition by race_id,driver_id) as total_pit_stops_per_race 13 | FROM {{ ref('stg_pit_stops') }} 14 | ) 15 | 16 | SELECT * FROM pitstops -------------------------------------------------------------------------------- /models/core/fct_qualifying.sql: -------------------------------------------------------------------------------- 1 | {{ config( 2 | enabled=false 3 | ) }} 4 | WITH qualifying AS ( 5 | SELECT 6 | qualifying_id AS qualifying_id, 7 | race_id AS race_id, 8 | driver_id AS driver_id, 9 | constructor_id AS constructor_id, 10 | driver_number AS driver_number, 11 | qualifying_position AS qualifying_position, 12 | q1_time AS q1_time, 13 | q2_time AS q2_time, 14 | q3_time AS q3_time 15 | FROM {{ ref('stg_qualifying') }} 16 | ) 17 | 18 | SELECT * FROM qualifying 19 | -------------------------------------------------------------------------------- /models/core/fct_results.sql: -------------------------------------------------------------------------------- 1 | {{ config( 2 | tags=["race", "laptimes"], 3 | materialized = "incremental", 4 | unique_key = "result_id" 5 | ) }} 6 | 7 | WITH results AS ( 8 | SELECT 9 | {{ dbt_utils.generate_surrogate_key(['result_id']) }} AS fct_result_id, 10 | result_id AS result_id, 11 | race_id AS race_id, 12 | driver_id AS driver_id, 13 | constructor_id AS constructor_id, 14 | driver_number AS driver_number, 15 | grid AS grid, 16 | driver_position AS driver_position, 17 | position_text AS position_text, 18 | position_order AS position_order, 19 | points AS points, 20 | laps AS laps, 21 | race_time AS race_time, 22 | milliseconds AS milliseconds, 23 | fastest_lap AS fastest_lap, 24 | driver_rank AS driver_rank, 25 | fastest_lap_time AS fastest_lap_time, 26 | fastest_lap_speed AS fastest_lap_speed, 27 | status_id AS status_id 28 | FROM {{ ref('stg_results') }} 29 | ) 30 | 31 | SELECT * FROM results -------------------------------------------------------------------------------- /models/core/fct_sprint_results.sql: -------------------------------------------------------------------------------- 1 | WITH sprint_results AS ( 2 | SELECT 3 | result_id AS result_id, 4 | race_id AS race_id, 5 | driver_id AS driver_id, 6 | constructor_id AS constructor_id, 7 | driver_number AS driver_number, 8 | grid AS grid, 9 | driver_position AS driver_position, 10 | position_text AS position_text, 11 | position_order AS position_order, 12 | points AS points, 13 | laps AS laps, 14 | sprint_time AS sprint_time, 15 | milliseconds AS milliseconds, 16 | fastest_lap AS fastest_lap, 17 | fastest_lap_time AS fastest_lap_time, 18 | status_id AS status_id 19 | FROM {{ ref('stg_sprint_results') }} 20 | ) 21 | 22 | SELECT * FROM sprint_results -------------------------------------------------------------------------------- /models/marts/aggregates/agg_fastest_pit_stops_by_constructor.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | def model(dbt, session): 5 | # dbt configuration 6 | dbt.config(packages=["pandas","numpy"]) 7 | 8 | # get upstream data 9 | pit_stops_joined = dbt.ref("mrt_pit_stops").to_pandas() 10 | 11 | # provide year so we do not hardcode dates 12 | year=2021 13 | 14 | # describe the data 15 | pit_stops_joined["PIT_STOP_SECONDS"] = pit_stops_joined["PIT_STOP_MILLISECONDS"]/1000 16 | fastest_pit_stops = pit_stops_joined[(pit_stops_joined["RACE_YEAR"]==year)].groupby(by="CONSTRUCTOR_NAME")["PIT_STOP_SECONDS"].describe().sort_values(by='mean') 17 | fastest_pit_stops.reset_index(inplace=True) 18 | fastest_pit_stops.columns = fastest_pit_stops.columns.str.upper() 19 | 20 | return fastest_pit_stops.round(2) -------------------------------------------------------------------------------- /models/marts/aggregates/agg_lap_times_moving_avg.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def model(dbt, session): 4 | # dbt configuration 5 | dbt.config(packages=["pandas"]) 6 | 7 | # get upstream data 8 | lap_times = dbt.ref("mrt_lap_times_years").to_pandas() 9 | 10 | # describe the data 11 | lap_times["LAP_TIME_SECONDS"] = lap_times["LAP_TIME_MILLISECONDS"]/1000 12 | lap_time_trends = lap_times.groupby(by="RACE_YEAR")["LAP_TIME_SECONDS"].mean().to_frame() 13 | lap_time_trends.reset_index(inplace=True) 14 | lap_time_trends["LAP_MOVING_AVG_5_YEARS"] = lap_time_trends["LAP_TIME_SECONDS"].rolling(5).mean() 15 | lap_time_trends.columns = lap_time_trends.columns.str.upper() 16 | 17 | return lap_time_trends.round(1) -------------------------------------------------------------------------------- /models/marts/aggregates/aggregates.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: fastest_pit_stops_by_constructor 5 | description: Use the python .describe() method to retrieve summary statistics table about pit stops by constructor. Sort by average stop time ascending so the first row returns the fastest constructor. 6 | columns: 7 | - name: constructor_name 8 | description: team that makes the car 9 | tests: 10 | - unique 11 | 12 | - name: lap_times_moving_avg 13 | description: Use the python .rolling() method to calculate the 5 year rolling average of pit stop times alongside the average for each year. 14 | columns: 15 | - name: race_year 16 | description: year of the race 17 | tests: 18 | - relationships: 19 | to: ref('int_lap_times_years') 20 | field: race_year 21 | -------------------------------------------------------------------------------- /models/marts/mrt.md: -------------------------------------------------------------------------------- 1 | # the intent of this .md is to allow for multi-line long form explanations for our intermediate transformations 2 | 3 | # below are descriptions 4 | {% docs int_results %} In this query we want to join out other important information about the race results to have a human readable table about results, races, drivers, constructors, and status. 5 | We will have 4 left joins onto our results table. {% enddocs %} 6 | 7 | {% docs int_pit_stops %} There are many pit stops within one race, aka a M:1 relationship. 8 | We want to aggregate this so we can properly join pit stop information without creating a fanout. {% enddocs %} 9 | 10 | {% docs int_lap_times_years %} Lap times are done per lap. We need to join them out to the race year to understand yearly lap time trends. {% enddocs %} -------------------------------------------------------------------------------- /models/marts/mrt.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: int_results 5 | description: '{{ doc("int_results") }}' 6 | 7 | - name: int_pit_stops 8 | description: '{{ doc("int_pit_stops") }}' 9 | 10 | - name: int_lap_times_years 11 | description: '{{ doc("int_lap_times_years") }}' -------------------------------------------------------------------------------- /models/marts/mrt_lap_times_years.sql: -------------------------------------------------------------------------------- 1 | with lap_times as ( 2 | 3 | select * from {{ ref('fct_lap_times') }} 4 | 5 | ), 6 | races as ( 7 | 8 | select * from {{ ref('dim_races') }} 9 | ), 10 | expanded_lap_times_by_year as ( 11 | select 12 | lap_times.race_id, 13 | driver_id, 14 | race_year, 15 | lap, 16 | lap_time_milliseconds 17 | from lap_times 18 | left join races 19 | on lap_times.race_id = races.race_id 20 | where lap_time_milliseconds is not null 21 | ) 22 | select * from expanded_lap_times_by_year -------------------------------------------------------------------------------- /models/marts/mrt_pit_stops.sql: -------------------------------------------------------------------------------- 1 | with base_results as ( 2 | 3 | select * from {{ ref('mrt_results') }} 4 | 5 | ), 6 | 7 | pit_stops as ( 8 | 9 | select * from {{ ref('fct_pit_stops') }} 10 | 11 | ), 12 | 13 | pit_stops_joined as ( 14 | 15 | select 16 | base_results.race_id, 17 | race_year, 18 | base_results.driver_id, 19 | constructor_id, 20 | constructor_name, 21 | stop_number, 22 | lap, 23 | pit_stop_time, 24 | pit_stop_duration_seconds, 25 | pit_stop_milliseconds, 26 | total_pit_stops_per_race 27 | from base_results 28 | left join pit_stops 29 | on base_results.race_id=pit_stops.race_id and base_results.driver_id=pit_stops.driver_id 30 | ) 31 | select * from pit_stops_joined -------------------------------------------------------------------------------- /models/marts/mrt_results.sql: -------------------------------------------------------------------------------- 1 | with results as ( 2 | 3 | select * from {{ ref('fct_results') }} 4 | 5 | ), 6 | 7 | races as ( 8 | 9 | select * from {{ ref('dim_races') }} 10 | 11 | ), 12 | 13 | drivers as ( 14 | 15 | select * from {{ ref('dim_drivers') }} 16 | 17 | ), 18 | 19 | constructors as ( 20 | 21 | select * from {{ ref('dim_constructors') }} 22 | 23 | ), 24 | 25 | status as ( 26 | 27 | select * from {{ ref('dim_status') }} 28 | 29 | ), 30 | 31 | int_results as ( 32 | select 33 | result_id, 34 | results.race_id, 35 | race_year, 36 | race_round, 37 | circuit_id, 38 | race_name, 39 | race_date, 40 | results.race_time, 41 | results.driver_id, 42 | results.driver_number, 43 | forename ||' '|| surname as driver, 44 | cast(datediff('year', date_of_birth, race_date) as int) as drivers_age_years, 45 | driver_nationality, 46 | results.constructor_id, 47 | constructor_name, 48 | constructor_nationality, 49 | grid, 50 | driver_position, 51 | position_text, 52 | position_order, 53 | points, 54 | laps, 55 | milliseconds, 56 | fastest_lap, 57 | driver_rank, 58 | fastest_lap_time, 59 | fastest_lap_speed, 60 | results.status_id, 61 | status, 62 | case when driver_position is null then 1 else 0 end as dnf_flag 63 | from results 64 | left join races 65 | on results.race_id=races.race_id 66 | left join drivers 67 | on results.driver_id = drivers.driver_id 68 | left join constructors 69 | on results.constructor_id = constructors.constructor_id 70 | left join status 71 | on results.status_id = status.status_id 72 | ) 73 | 74 | select * from int_results 75 | -------------------------------------------------------------------------------- /models/marts/mrt_results_circuits.sql: -------------------------------------------------------------------------------- 1 | with int_results as ( 2 | 3 | select * from {{ ref('mrt_results') }} 4 | 5 | ), 6 | 7 | int_pit_stops as ( 8 | select 9 | race_id, 10 | driver_id, 11 | max(total_pit_stops_per_race) as total_pit_stops_per_race 12 | from {{ ref('mrt_pit_stops') }} 13 | group by 1,2 14 | 15 | ), 16 | 17 | circuits as ( 18 | 19 | select * from {{ ref('dim_circuits') }} 20 | 21 | ), 22 | 23 | base_results as ( 24 | select 25 | result_id, 26 | int_results.race_id, 27 | race_year, 28 | race_round, 29 | int_results.circuit_id, 30 | int_results.race_name, 31 | circuit_ref, 32 | circuit_location, 33 | circuit_country, 34 | latitude, 35 | longitude, 36 | altitude, 37 | total_pit_stops_per_race, 38 | race_date, 39 | race_time, 40 | int_results.driver_id, 41 | driver, 42 | driver_number, 43 | drivers_age_years, 44 | driver_nationality, 45 | constructor_id, 46 | constructor_name, 47 | constructor_nationality, 48 | grid, 49 | driver_position, 50 | position_text, 51 | position_order, 52 | points, 53 | laps, 54 | milliseconds, 55 | fastest_lap, 56 | driver_rank, 57 | fastest_lap_time, 58 | fastest_lap_speed, 59 | status_id, 60 | status, 61 | dnf_flag 62 | from int_results 63 | left join circuits 64 | on int_results.circuit_id=circuits.circuit_id 65 | left join int_pit_stops 66 | on int_results.driver_id=int_pit_stops.driver_id and int_results.race_id=int_pit_stops.race_id 67 | ) 68 | 69 | select * from base_results -------------------------------------------------------------------------------- /models/ml/prep_encoding_splitting/covariate_encoding.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder 4 | from sklearn.linear_model import LogisticRegression 5 | 6 | def model(dbt, session): 7 | # dbt configuration 8 | dbt.config(packages=["pandas","numpy","scikit-learn"]) 9 | 10 | # get upstream data 11 | data = dbt.ref("ml_data_prep").to_pandas() 12 | 13 | # list out covariates we want to use in addition to outcome variable we are modeling - position 14 | covariates = data[['RACE_YEAR','RACE_NAME','GRID','CONSTRUCTOR_NAME','DRIVER','DRIVERS_AGE_YEARS','DRIVER_CONFIDENCE','CONSTRUCTOR_RELAIBLITY','TOTAL_PIT_STOPS_PER_RACE','ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR', 'DRIVER_POSITION']] 15 | 16 | # filter covariates on active drivers and constructors 17 | # use fil_cov as short for "filtered_covariates" 18 | fil_cov = covariates[(covariates['ACTIVE_DRIVER']==1)&(covariates['ACTIVE_CONSTRUCTOR']==1)] 19 | 20 | # Encode categorical variables using LabelEncoder 21 | # TODO: we'll update this to both ohe in the future for non-ordinal variables! 22 | le = LabelEncoder() 23 | fil_cov['RACE_NAME'] = le.fit_transform(fil_cov['RACE_NAME']) 24 | fil_cov['CONSTRUCTOR_NAME'] = le.fit_transform(fil_cov['CONSTRUCTOR_NAME']) 25 | fil_cov['DRIVER'] = le.fit_transform(fil_cov['DRIVER']) 26 | fil_cov['TOTAL_PIT_STOPS_PER_RACE'] = le.fit_transform(fil_cov['TOTAL_PIT_STOPS_PER_RACE']) 27 | 28 | # Simply target variable "position" to represent 3 meaningful categories in Formula1 29 | # 1. Podium position 2. Points for team 3. Nothing - no podium or points! 30 | def position_index(x): 31 | if x<4: 32 | return 1 33 | if x>10: 34 | return 3 35 | else : 36 | return 2 37 | 38 | # we are dropping the columns that we filtered on in addition to our training variable 39 | encoded_data = fil_cov.drop(['ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR'],1) 40 | encoded_data['POSITION_LABEL']= encoded_data['DRIVER_POSITION'].apply(lambda x: position_index(x)) 41 | encoded_data_grouped_target = encoded_data.drop(['DRIVER_POSITION'],1) 42 | 43 | return encoded_data_grouped_target -------------------------------------------------------------------------------- /models/ml/prep_encoding_splitting/encoding_mapping.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder 4 | from sklearn.linear_model import LogisticRegression 5 | 6 | def model(dbt, session): 7 | # dbt configuration 8 | dbt.config(packages=["pandas","numpy","scikit-learn"]) 9 | 10 | # get upstream data 11 | data = dbt.ref("ml_data_prep").to_pandas() 12 | 13 | # list out covariates we want to use in addition to outcome variable we are modeling - position 14 | covariates = data[['RACE_YEAR','RACE_NAME','GRID','CONSTRUCTOR_NAME','DRIVER','DRIVERS_AGE_YEARS','DRIVER_CONFIDENCE','CONSTRUCTOR_RELAIBLITY','TOTAL_PIT_STOPS_PER_RACE','ACTIVE_DRIVER','ACTIVE_CONSTRUCTOR', 'DRIVER_POSITION']] 15 | 16 | # filter covariates on active drivers and constructors 17 | # use fil_cov as short for "filtered_covariates" 18 | fil_cov = covariates[(covariates['ACTIVE_DRIVER']==1)&(covariates['ACTIVE_CONSTRUCTOR']==1)] 19 | 20 | # Encode categorical variables using LabelEncoder 21 | # TODO: we'll update this to both ohe in the future for non-ordinal variables! 22 | le = LabelEncoder() 23 | fil_cov['DRIVER'] = le.fit_transform(fil_cov['DRIVER']) 24 | 25 | encoding_mapping = pd.DataFrame({'original_driver_value': le.classes_, 'encoded_driver_value': range(len(le.classes_))}) 26 | encoding_mapping.columns = encoding_mapping.columns.str.upper() 27 | 28 | 29 | 30 | return encoding_mapping 31 | 32 | 33 | -------------------------------------------------------------------------------- /models/ml/prep_encoding_splitting/hold_out_dataset_for_prediction.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def model(dbt, session): 4 | # dbt configuration 5 | dbt.config(packages=["pandas"], tags="predict") 6 | 7 | # get upstream data 8 | encoding = dbt.ref("covariate_encoding").to_pandas() 9 | 10 | # variable for year instead of hardcoding it 11 | year=2020 12 | 13 | # filter the data based on the specified year 14 | hold_out_dataset = encoding.loc[encoding['RACE_YEAR'] == year] 15 | 16 | return hold_out_dataset 17 | 18 | -------------------------------------------------------------------------------- /models/ml/prep_encoding_splitting/ml_data_prep.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def model(dbt, session): 4 | # dbt configuration 5 | dbt.config(packages=["pandas"]) 6 | 7 | # get upstream data 8 | fct_results = dbt.ref("mrt_results_circuits").to_pandas() 9 | 10 | # provide years so we do not hardcode dates in filter command 11 | start_year=2010 12 | end_year=2020 13 | 14 | # describe the data for a full decade 15 | data = fct_results.loc[fct_results['RACE_YEAR'].between(start_year, end_year)] 16 | 17 | # convert string to an integer 18 | # data['POSITION'] = data['POSITION'].astype(float) 19 | 20 | # we cannot have nulls if we want to use total pit stops 21 | data['TOTAL_PIT_STOPS_PER_RACE'] = data['TOTAL_PIT_STOPS_PER_RACE'].fillna(0) 22 | 23 | # some of the constructors changed their name over the year so replacing old names with current name 24 | mapping = {'Force India': 'Racing Point', 'Sauber': 'Alfa Romeo', 'Lotus F1': 'Renault', 'Toro Rosso': 'AlphaTauri'} 25 | data['CONSTRUCTOR_NAME'].replace(mapping, inplace=True) 26 | 27 | # create confidence metrics for drivers and constructors 28 | dnf_by_driver = data.groupby('DRIVER').sum()['DNF_FLAG'] 29 | driver_race_entered = data.groupby('DRIVER').count()['DNF_FLAG'] 30 | driver_dnf_ratio = (dnf_by_driver/driver_race_entered) 31 | driver_confidence = 1-driver_dnf_ratio 32 | driver_confidence_dict = dict(zip(driver_confidence.index,driver_confidence)) 33 | 34 | dnf_by_constructor = data.groupby('CONSTRUCTOR_NAME').sum()['DNF_FLAG'] 35 | constructor_race_entered = data.groupby('CONSTRUCTOR_NAME').count()['DNF_FLAG'] 36 | constructor_dnf_ratio = (dnf_by_constructor/constructor_race_entered) 37 | constructor_relaiblity = 1-constructor_dnf_ratio 38 | constructor_relaiblity_dict = dict(zip(constructor_relaiblity.index,constructor_relaiblity)) 39 | 40 | data['DRIVER_CONFIDENCE'] = data['DRIVER'].apply(lambda x:driver_confidence_dict[x]) 41 | data['CONSTRUCTOR_RELAIBLITY'] = data['CONSTRUCTOR_NAME'].apply(lambda x:constructor_relaiblity_dict[x]) 42 | 43 | #removing retired drivers and constructors 44 | active_constructors = ['Renault', 'Williams', 'McLaren', 'Ferrari', 'Mercedes', 45 | 'AlphaTauri', 'Racing Point', 'Alfa Romeo', 'Red Bull', 46 | 'Haas F1 Team'] 47 | active_drivers = ['Daniel Ricciardo', 'Kevin Magnussen', 'Carlos Sainz', 48 | 'Valtteri Bottas', 'Lance Stroll', 'George Russell', 49 | 'Lando Norris', 'Sebastian Vettel', 'Kimi Räikkönen', 50 | 'Charles Leclerc', 'Lewis Hamilton', 'Daniil Kvyat', 51 | 'Max Verstappen', 'Pierre Gasly', 'Alexander Albon', 52 | 'Sergio Pérez', 'Esteban Ocon', 'Antonio Giovinazzi', 53 | 'Romain Grosjean','Nicholas Latifi'] 54 | 55 | # create flags for active drivers and constructors so we can filter downstream 56 | data['ACTIVE_DRIVER'] = data['DRIVER'].apply(lambda x: int(x in active_drivers)) 57 | data['ACTIVE_CONSTRUCTOR'] = data['CONSTRUCTOR_NAME'].apply(lambda x: int(x in active_constructors)) 58 | 59 | return data -------------------------------------------------------------------------------- /models/ml/prep_encoding_splitting/training_testing_dataset.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | def model(dbt, session): 4 | 5 | # dbt configuration 6 | dbt.config(packages=["pandas"], tags="train") 7 | 8 | # get upstream data 9 | encoding = dbt.ref("covariate_encoding").to_pandas() 10 | 11 | # provide years so we do not hardcode dates in filter command 12 | start_year=2010 13 | end_year=2019 14 | 15 | # describe the data for a full decade 16 | train_test_dataset = encoding.loc[encoding['RACE_YEAR'].between(start_year, end_year)] 17 | 18 | return train_test_dataset -------------------------------------------------------------------------------- /models/ml/training_and_prediction/apply_prediction_to_position.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import joblib 3 | import pandas as pd 4 | import os 5 | from snowflake.snowpark import types as T 6 | 7 | DB_STAGE = 'MODELSTAGE' 8 | version = '1.0' 9 | # The name of the model file 10 | model_file_path = 'driver_position_'+version 11 | model_file_packaged = 'driver_position_'+version+'.joblib' 12 | 13 | # This is a local directory, used for storing the various artifacts locally 14 | LOCAL_TEMP_DIR = f'/tmp/driver_position' 15 | DOWNLOAD_DIR = os.path.join(LOCAL_TEMP_DIR, 'download') 16 | TARGET_MODEL_DIR_PATH = os.path.join(LOCAL_TEMP_DIR, 'ml_model') 17 | TARGET_LIB_PATH = os.path.join(LOCAL_TEMP_DIR, 'lib') 18 | 19 | # The feature columns that were used during model training 20 | # and that will be used during prediction 21 | FEATURE_COLS = [ 22 | "RACE_YEAR" 23 | ,"RACE_NAME" 24 | ,"GRID" 25 | ,"CONSTRUCTOR_NAME" 26 | ,"DRIVER" 27 | ,"DRIVERS_AGE_YEARS" 28 | ,"DRIVER_CONFIDENCE" 29 | ,"CONSTRUCTOR_RELAIBLITY" 30 | ,"TOTAL_PIT_STOPS_PER_RACE"] 31 | 32 | def register_udf_for_prediction(p_predictor ,p_session ,p_dbt): 33 | 34 | # The prediction udf 35 | 36 | def predict_position(p_df: T.PandasDataFrame[int, int, int, int, 37 | int, int, int, int, int]) -> T.PandasSeries[int]: 38 | # Snowpark currently does not set the column name in the input dataframe 39 | # The default col names are like 0,1,2,... Hence we need to reset the column 40 | # names to the features that we initially used for training. 41 | p_df.columns = [*FEATURE_COLS] 42 | 43 | # Perform prediction. this returns an array object 44 | pred_array = p_predictor.predict(p_df) 45 | # Convert to series 46 | df_predicted = pd.Series(pred_array) 47 | return df_predicted 48 | 49 | # The list of packages that will be used by UDF 50 | udf_packages = p_dbt.config.get('packages') 51 | 52 | predict_position_udf = p_session.udf.register( 53 | predict_position 54 | ,name=f'predict_position' 55 | ,packages = udf_packages 56 | ) 57 | return predict_position_udf 58 | 59 | def download_models_and_libs_from_stage(p_session): 60 | p_session.file.get(f'@{DB_STAGE}/{model_file_path}/{model_file_packaged}', DOWNLOAD_DIR) 61 | 62 | def load_model(p_session): 63 | # Load the model and initialize the predictor 64 | model_fl_path = os.path.join(DOWNLOAD_DIR, model_file_packaged) 65 | predictor = joblib.load(model_fl_path) 66 | return predictor 67 | 68 | # ------------------------------- 69 | def model(dbt, session): 70 | dbt.config( 71 | packages = ['snowflake-snowpark-python' ,'scipy','scikit-learn' ,'pandas' ,'numpy'], 72 | materialized = "table", 73 | tags = "predict", 74 | use_anonymous_sproc=True 75 | ) 76 | session._use_scoped_temp_objects = False 77 | download_models_and_libs_from_stage(session) 78 | predictor = load_model(session) 79 | predict_position_udf = register_udf_for_prediction(predictor, session ,dbt) 80 | 81 | # Retrieve the data, and perform the prediction 82 | hold_out_df = (dbt.ref("hold_out_dataset_for_prediction") 83 | .select(*FEATURE_COLS) 84 | ) 85 | trained_model_file = dbt.ref("train_model_to_predict_position") 86 | 87 | # Perform prediction. 88 | new_predictions_df = hold_out_df.withColumn("position_predicted" 89 | ,predict_position_udf(*FEATURE_COLS) 90 | ) 91 | 92 | return new_predictions_df -------------------------------------------------------------------------------- /models/ml/training_and_prediction/train_model_to_predict_position.py: -------------------------------------------------------------------------------- 1 | import snowflake.snowpark.functions as F 2 | from sklearn.model_selection import train_test_split 3 | import pandas as pd 4 | from sklearn.metrics import confusion_matrix, balanced_accuracy_score 5 | import io 6 | from sklearn.linear_model import LogisticRegression 7 | from joblib import dump, load 8 | import joblib 9 | import logging 10 | import sys 11 | from joblib import dump, load 12 | 13 | logger = logging.getLogger("mylog") 14 | 15 | def save_file(session, model, path, dest_filename): 16 | input_stream = io.BytesIO() 17 | joblib.dump(model, input_stream) 18 | session._conn.upload_stream(input_stream, path, dest_filename) 19 | return "successfully created file: " + path 20 | 21 | def model(dbt, session): 22 | dbt.config( 23 | packages = ['numpy','scikit-learn','pandas','numpy','joblib','cachetools'], 24 | materialized = "table", 25 | tags = "train" 26 | ) 27 | # Create a stage in Snowflake to save our model file 28 | session.sql('create or replace stage MODELSTAGE').collect() 29 | 30 | #session._use_scoped_temp_objects = False 31 | version = "1.0" 32 | logger.info('Model training version: ' + version) 33 | 34 | # read in our training and testing upstream dataset 35 | test_train_df = dbt.ref("training_testing_dataset") 36 | 37 | # cast snowpark df to pandas df 38 | test_train_pd_df = test_train_df.to_pandas() 39 | target_col = "POSITION_LABEL" 40 | 41 | # split out covariate predictors, x, from our target column position_label, y. 42 | split_X = test_train_pd_df.drop([target_col], axis=1) 43 | split_y = test_train_pd_df[target_col] 44 | 45 | # Split out our training and test data into proportions 46 | X_train, X_test, y_train, y_test = train_test_split(split_X, split_y, train_size=0.7, random_state=42) 47 | train = [X_train, y_train] 48 | test = [X_test, y_test] 49 | # now we are only training our one model to deploy 50 | # we are keeping the focus on the workflows and not algorithms for this lab! 51 | model = LogisticRegression() 52 | 53 | # fit the preprocessing pipeline and the model together 54 | model.fit(X_train, y_train) 55 | y_pred = model.predict_proba(X_test)[:,1] 56 | predictions = [round(value) for value in y_pred] 57 | balanced_accuracy = balanced_accuracy_score(y_test, predictions) 58 | 59 | # Save the model to a stage 60 | save_file(session, model, "@MODELSTAGE/driver_position_"+version, "driver_position_"+version+".joblib" ) 61 | logger.info('Model artifact:' + "@MODELSTAGE/driver_position_"+version+".joblib") 62 | 63 | # Take our pandas training and testing dataframes and put them back into snowpark dataframes 64 | snowpark_train_df = session.write_pandas(pd.concat(train, axis=1, join='inner'), "train_table", auto_create_table=True, create_temp_table=True) 65 | snowpark_test_df = session.write_pandas(pd.concat(test, axis=1, join='inner'), "test_table", auto_create_table=True, create_temp_table=True) 66 | 67 | # Union our training and testing data together and add a column indicating train vs test rows 68 | return snowpark_train_df.with_column("DATASET_TYPE", F.lit("train")).union(snowpark_test_df.with_column("DATASET_TYPE", F.lit("test"))) -------------------------------------------------------------------------------- /models/overview.md: -------------------------------------------------------------------------------- 1 | {% docs __overview__ %} 2 | 3 | ## About Dataset 4 | 5 | ### Context 6 | This is a dataset pulled from Kaggle called [Formula 1 World Championship (1950 - 2023)](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020?select=status.csv). 7 | 8 | Formula 1 (a.k.a. F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. The word "formula" in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on purpose-built circuits and on public roads. 9 | 10 | ### Content 11 | The dataset consists of all information on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, championships from 1950 till the latest 2023 season. 12 | 13 | ### Acknowledgements 14 | The data is compiled from [Ergast](http://ergast.com/mrd/). 15 | 16 | ### Inspiration 17 | 18 | [![Alt text](https://images.hgmsites.net/hug/mercedes-benz_100691933_h.jpg)](https://images.hgmsites.net) 19 | 20 | *"Races are won at the track. Championships are won at the factory."* - Mercedes (2019) 21 | 22 | With the amount of data being captured, analyzed and used to design, build and drive the Formula 1 cars is astounding. It is a global sport being followed by millions of people worldwide and it is very fascinating to see drivers pushing their limit in these vehicles to become the fastest racers in the world! 23 | 24 | {% enddocs %} -------------------------------------------------------------------------------- /models/staging/formula1/f1_sources.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | sources: 4 | - name: formula1 5 | description: formula 1 datasets with normalized tables 6 | database: formula1 7 | schema: raw 8 | tables: 9 | 10 | - name: circuits 11 | description: One record per circuit, which is the specific race course. 12 | columns: 13 | - name: circuit_id 14 | tests: 15 | - unique 16 | - not_null 17 | description: Unique identifier for the circuit 18 | - name: circuit_ref 19 | description: Unique circuit reference 20 | - name: name 21 | description: Name of the circuit 22 | - name: location 23 | description: City or town where the circuit is located 24 | - name: country 25 | description: Country where the circuit is located 26 | - name: lat 27 | description: Latitude of the circuit 28 | - name: lng 29 | description: Longitude of the circuit 30 | - name: alt 31 | description: Altitude of the circuit in meters 32 | - name: url 33 | description: Wikipedia URL for the circuit 34 | 35 | 36 | - name: constructor_results 37 | description: This table contains race results for constructors. 38 | columns: 39 | - name: constructor_results_id 40 | description: Unique identifier for the constructor race result 41 | - name: race_id 42 | description: Unique identifier for the race 43 | - name: constructor_id 44 | description: Unique identifier for the constructor 45 | - name: points 46 | description: Points scored by the constructor in the race 47 | - name: status 48 | description: Overall status of the constructor's cars in the race 49 | 50 | 51 | - name: constructor_standings 52 | description: This table contains historical Formula 1 constructor standings. 53 | columns: 54 | - name: constructor_standings_id 55 | description: Unique identifier for the constructor standings entry 56 | - name: race_id 57 | description: Unique identifier for the race 58 | - name: constructor_id 59 | description: Unique identifier for the constructor 60 | - name: points 61 | description: Accumulated points for the constructor at this point in the season 62 | - name: position 63 | description: Current position of the constructor in the standings 64 | - name: wins 65 | description: Number of race wins for the constructor at this point in the season 66 | 67 | 68 | - name: constructors 69 | description: One record per constructor. Constructors are the teams that build their formula 1 cars. 70 | columns: 71 | - name: constructor_id 72 | description: Unique identifier for the constructor 73 | tests: 74 | - unique 75 | - not_null 76 | - name: constructor_ref 77 | description: Unique constructor reference 78 | - name: name 79 | description: Constructor name 80 | - name: nationality 81 | description: Constructor's nationality 82 | - name: url 83 | description: Wikipedia URL for the constructor 84 | 85 | - name: driver_standings 86 | description: This table contains historical Formula 1 driver standings. 87 | columns: 88 | - name: driver_standings_id 89 | description: Unique identifier for the driver standings entry 90 | - name: race_id 91 | description: Unique identifier for the race 92 | - name: driver_id 93 | description: Unique identifier for the driver 94 | - name: points 95 | description: Accumulated points for the driver at this point in the season 96 | - name: position 97 | description: Current position of the driver in the standings 98 | - name: wins 99 | description: Number of race wins for the driver at this point in the season 100 | 101 | 102 | - name: drivers 103 | description: One record per driver. This table gives details about the driver. 104 | columns: 105 | - name: driver_id 106 | tests: 107 | - unique 108 | - not_null 109 | description: Unique identifier for the driver 110 | - name: driver_ref 111 | description: Unique driver reference 112 | - name: number 113 | description: Driver number 114 | - name: code 115 | description: Driver code 116 | - name: forename 117 | description: Driver's first name 118 | - name: surname 119 | description: Driver's last name 120 | - name: dob 121 | description: Driver's date of birth 122 | - name: nationality 123 | description: Driver's nationality 124 | - name: url 125 | description: Wikipedia URL for the driver 126 | 127 | - name: lap_times 128 | description: One row per lap in each race. Lap times started being recorded in this dataset in 1984 and joined through driver_id. 129 | columns: 130 | - name: race_id 131 | description: Unique identifier for the race 132 | - name: driver_id 133 | description: Unique identifier for the driver 134 | - name: lap 135 | description: Lap number 136 | - name: position 137 | description: Position of the driver at the end of the lap 138 | - name: time 139 | description: Lap time in minutes, seconds, and milliseconds (MM:SS.mmm format) 140 | - name: milliseconds 141 | description: Lap time in milliseconds 142 | 143 | - name: pit_stops 144 | description: One row per pit stop. Pit stops do not have their own id column, the combination of the race_id and driver_id identify the pit stop. 145 | columns: 146 | - name: race_id 147 | description: Unique identifier for the race 148 | - name: driver_id 149 | description: Unique identifier for the driver 150 | - name: stop 151 | description: Sequential number of the pit stop for the driver in the race 152 | tests: 153 | - accepted_values: 154 | values: [1,2,3,4,5,6,7,8] 155 | quote: false 156 | - name: lap 157 | description: Lap number during which the pit stop occurred 158 | - name: time 159 | description: Time of the day when the pit stop took place (HH:MM:SS format) 160 | - name: duration 161 | description: Duration of the pit stop in seconds (SS.mmm format) 162 | - name: milliseconds 163 | description: Duration of the pit stop in milliseconds 164 | 165 | - name: qualifying 166 | description: This table contains qualifying session results for each driver. 167 | columns: 168 | - name: qualifying_id 169 | description: Unique identifier for the qualifying result 170 | - name: race_id 171 | description: Unique identifier for the race 172 | - name: driver_id 173 | description: Unique identifier for the driver 174 | - name: constructor_id 175 | description: Unique identifier for the constructor 176 | - name: number 177 | description: Car number 178 | - name: position 179 | description: Final qualifying position 180 | - name: q1 181 | description: Qualifying 1 time (MM:SS.mmm format) 182 | - name: q2 183 | description: Qualifying 2 time (MM:SS.mmm format), if applicable 184 | - name: q3 185 | description: Qualifying 3 time (MM:SS.mmm format), if applicable 186 | 187 | - name: races 188 | description: One race per row. Importantly this table contains the race year to understand trends. 189 | columns: 190 | - name: race_id 191 | tests: 192 | - unique 193 | - not_null 194 | description: Unique identifier for the race 195 | - name: year 196 | description: Year the race took place 197 | - name: round 198 | description: Round number in the championship 199 | - name: circuit_id 200 | description: Unique identifier for the circuit 201 | - name: name 202 | description: Name of the race 203 | - name: date 204 | description: Date of the race 205 | - name: time 206 | description: Time of the race 207 | - name: url 208 | description: Wikipedia URL for the race 209 | 210 | - name: results 211 | description: One row per result. The main table that we join out for grid and position variables. 212 | columns: 213 | - name: result_id 214 | description: Unique identifier for the race result 215 | tests: 216 | - unique 217 | - not_null 218 | - name: race_id 219 | description: Unique identifier for the race 220 | - name: driver_id 221 | description: Unique identifier for the driver 222 | - name: constructor_id 223 | description: Unique identifier for the constructor 224 | - name: number 225 | description: Car number 226 | - name: grid 227 | description: Starting grid position 228 | - name: position 229 | description: Finishing position 230 | - name: position_text 231 | description: Text representation of finishing position 232 | - name: position_order 233 | description: Order of finishing position 234 | - name: points 235 | description: Points scored in the race 236 | - name: laps 237 | description: Number of laps completed 238 | - name: time 239 | description: Time taken to complete the race 240 | - name: milliseconds 241 | description: Time taken to complete the race in milliseconds 242 | - name: fastest_lap 243 | description: Lap number of the driver's fastest lap 244 | - name: rank 245 | description: Rank of the driver's fastest lap time 246 | - name: fastest_lap_time 247 | description: Time of the driver's fastest lap 248 | - name: fastest_lap_speed 249 | description: Speed of the driver's fastest lap 250 | - name: status_id 251 | description: Unique identifier for the driver's race status 252 | 253 | - name: seasons 254 | description: This table contains information about Formula 1 seasons. 255 | columns: 256 | - name: year 257 | description: The year of the season 258 | - name: url 259 | description: Wikipedia URL for the season 260 | 261 | - name: sprint_results 262 | description: This table contains sprint race results for drivers and constructors. 263 | columns: 264 | - name: sprint_result_id 265 | description: Unique identifier for the sprint race result 266 | - name: race_id 267 | description: Unique identifier for the race 268 | - name: driver_id 269 | description: Unique identifier for the driver 270 | - name: constructor_id 271 | description: Unique identifier for the constructor 272 | - name: grid 273 | description: Starting grid position 274 | - name: position 275 | description: Finishing position 276 | - name: position_order 277 | description: Order of finishing position 278 | - name: points 279 | description: Points scored in the sprint race 280 | - name: laps 281 | description: Number of laps completed 282 | - name: time 283 | description: Time taken to complete the sprint race 284 | - name: milliseconds 285 | description: Time taken to complete the sprint race in milliseconds 286 | - name: status_id 287 | description: Unique identifier for the driver's sprint race status 288 | 289 | - name: status 290 | description: One status per row. The status contextualizes whether the race was finished or what issues arose e.g. collisions, engine, etc. 291 | columns: 292 | - name: status_id 293 | tests: 294 | - unique 295 | - not_null 296 | description: Unique identifier for the status 297 | - name: status 298 | description: Description of the status, explaining the reason for the status (e.g., finished, retired, disqualified) -------------------------------------------------------------------------------- /models/staging/formula1/stage_model_descriptions.md: -------------------------------------------------------------------------------- 1 | {% docs stg_circuit_desc %} 2 | [![Spa](https://www.spa-francorchamps.be/sites/default/files/image/16.03.2022_logo_05.jpg)](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.spa-francorchamps.be) 3 | 4 | 5 | This is the staging model which has dimensional data for each F1 circuit. 6 | 7 | {% enddocs %} 8 | 9 | {% docs stg_constructors_desc %} 10 | [![Constructors](https://www.grandprix247.com/wp-content/uploads/2020/11/pic1.jpg)](https://www.grandprix247.com/wp-content/uploads/2020/11/pic1.jpg) 11 | 12 | 13 | This is the staging model for constructor dimensional data. 14 | 15 | {% enddocs %} -------------------------------------------------------------------------------- /models/staging/formula1/staging.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | exposures: 4 | 5 | - name: Formula1 6 | label: F1 7 | type: analysis 8 | maturity: high 9 | url: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 10 | description: > 11 | This exposure is to document the source of the data. 12 | owner: 13 | name: Ian Cooper 14 | email: ian.cooper@dbtlabs.com 15 | 16 | models: 17 | - name: stg_circuits 18 | description: '{{ doc("stg_circuit_desc") }}' 19 | columns: 20 | - name: circuit_id 21 | description: "The primary key for this table, each circuit is distinct" 22 | tests: 23 | - unique 24 | - not_null 25 | - name: circuit_ref 26 | description: '{{ doc("circuit_ref_def") }}' 27 | - name: circuit_name 28 | description: '{{ doc("circuit_name_def") }}' 29 | - name: circuit_location 30 | description: '{{ doc("circuit_location_def") }}' 31 | - name: circuit_country 32 | description: '{{ doc("circuit_country_def") }}' 33 | - name: latitude 34 | description: '{{ doc("circuit_lat_def") }}' 35 | - name: longitude 36 | description: '{{ doc("circuit_lng_def") }}' 37 | - name: altitude 38 | description: '{{ doc("circuit_alt_def") }}' 39 | - name: circuit_url 40 | description: '{{ doc("circuit_url_def") }}' 41 | 42 | - name: stg_constructor_results 43 | description: "Fact data related to constructor (ie brand) results. This is at a race level but race info isn't in this model, just the race_id." 44 | columns: 45 | - name: constructor_results_id 46 | description: "The primary key for this table, result has its own id." 47 | tests: 48 | - unique 49 | - not_null 50 | - name: race_id 51 | description: '{{ doc("race_id_def") }}' 52 | - name: constructor_id 53 | description: '{{ doc("constructor_id_def") }}' 54 | - name: constructor_points 55 | description: '{{ doc("constructor_points_def") }}' 56 | - name: status 57 | description: '{{ doc("constructor_status_def") }}' 58 | 59 | - name: stg_constructor_standings 60 | description: "Fact data related to constructor (ie brand) results. This is at a race level but race info isn't in this model, just the race_id." 61 | columns: 62 | - name: constructor_standings_id 63 | description: "The primary key for this table, result has its own id." 64 | tests: 65 | - unique 66 | - not_null 67 | - name: race_id 68 | description: '{{ doc("race_id_def") }}' 69 | - name: constructor_id 70 | description: '{{ doc("constructor_id_def") }}' 71 | - name: points 72 | description: '{{ doc("constructor_points_def") }}' 73 | - name: constructor_position 74 | description: '{{ doc("constructor_status_def") }}' 75 | - name: position_text 76 | description: '{{ doc("constructor_points_def") }}' 77 | - name: wins 78 | description: '{{ doc("constructor_wins_def") }}' 79 | 80 | - name: stg_constructors 81 | description: '{{ doc("stg_constructors_desc") }}' 82 | columns: 83 | - name: constructor_id 84 | description: "The primary key for this table, each team has its own id." 85 | tests: 86 | - unique 87 | - not_null 88 | - name: constructor_ref 89 | description: '{{ doc("constructor_ref_def") }}' 90 | - name: constructor_name 91 | description: '{{ doc("constructor_name_def") }}' 92 | - name: constructor_nationality 93 | description: '{{ doc("constructor_nationality_def") }}' 94 | - name: constructor_url 95 | description: '{{ doc("constructor_url_def") }}' 96 | 97 | - name: stg_driver_standings 98 | description: "Fact data for driver standings by race." 99 | columns: 100 | - name: driver_standings_id 101 | description: "The primary key for this table." 102 | tests: 103 | - unique 104 | - not_null 105 | - name: race_id 106 | description: '{{ doc("race_id_def") }}' 107 | - name: driver_id 108 | description: '{{ doc("driver_id_def") }}' 109 | - name: driver_code 110 | description: '{{ doc("driver_code_def") }}' 111 | - name: driver_points 112 | description: '{{ doc("driver_points_def") }}' 113 | - name: driver_position 114 | description: '{{ doc("driver_position_def") }}' 115 | - name: position_text 116 | description: '{{ doc("position_text_def") }}' 117 | - name: driver_wins 118 | description: '{{ doc("driver_wins_def") }}' 119 | 120 | - name: stg_drivers 121 | description: "Dimensional model stage for driver attributes." 122 | columns: 123 | - name: driver_id 124 | description: "The primary key for this table." 125 | tests: 126 | - unique 127 | - not_null 128 | - name: driver_ref 129 | description: '{{ doc("driver_ref_def") }}' 130 | - name: driver_number 131 | description: '{{ doc("driver_number_def") }}' 132 | - name: forename 133 | description: '{{ doc("driver_forename_def") }}' 134 | - name: surname 135 | description: '{{ doc("driver_surname_def") }}' 136 | - name: date_of_birth 137 | description: '{{ doc("driver_dob_def") }}' 138 | - name: driver_nationality 139 | description: '{{ doc("driver_nationality_def") }}' 140 | - name: driver_url 141 | description: '{{ doc("driver_url_def") }}' 142 | 143 | - name: stg_lap_times 144 | description: "Fact data around every lap." 145 | columns: 146 | - name: lap_times_id 147 | description: "The primary key for this table." 148 | tests: 149 | - unique 150 | - not_null 151 | - name: race_id 152 | description: '{{ doc("race_id_def") }}' 153 | - name: driver_id 154 | description: '{{ doc("driver_id_def") }}' 155 | - name: lap 156 | description: '{{ doc("laps_def") }}' 157 | - name: driver_position 158 | description: '{{ doc("driver_position_def") }}' 159 | - name: lap_time_formatted 160 | description: '{{ doc("lap_time_def") }}' 161 | - name: official_laptime 162 | description: '{{ doc("official_lap_time_def") }}' 163 | - name: lap_time_milliseconds 164 | description: '{{ doc("lap_time_milliseconds_def") }}' 165 | 166 | - name: stg_pit_stops 167 | description: "Fact data around pitstop times." 168 | columns: 169 | - name: pitstop_id 170 | description: "The primary key for this table." 171 | tests: 172 | - unique 173 | - not_null 174 | - name: race_id 175 | description: '{{ doc("race_id_def") }}' 176 | - name: driver_id 177 | description: '{{ doc("driver_id_def") }}' 178 | - name: stop_number 179 | description: '{{ doc("laps_def") }}' 180 | - name: lap 181 | description: '{{ doc("stop_number_def") }}' 182 | - name: pit_stop_time 183 | description: '{{ doc("pit_stop_time_def") }}' 184 | - name: pit_stop_duration_seconds 185 | description: '{{ doc("pit_stop_duration_seconds_def") }}' 186 | - name: pit_stop_duration 187 | description: '{{ doc("pit_stop_duration_def") }}' 188 | - name: pit_stop_milliseconds 189 | description: '{{ doc("pit_stop_milliseconds_def") }}' 190 | 191 | - name: stg_qualifying 192 | description: "Fact data around qualifying times." 193 | columns: 194 | - name: qualifying_id 195 | description: "The primary key for this table." 196 | tests: 197 | - unique 198 | - not_null 199 | - name: race_id 200 | description: '{{ doc("race_id_def") }}' 201 | - name: driver_id 202 | description: '{{ doc("driver_id_def") }}' 203 | - name: constructor_id 204 | description: '{{ doc("constructor_id_def") }}' 205 | - name: driver_number 206 | description: '{{ doc("driver_number_def") }}' 207 | - name: qualifying_position 208 | description: '{{ doc("qualifying_position_def") }}' 209 | - name: q1_time 210 | description: '{{ doc("q1_time_def") }}' 211 | - name: q2_time 212 | description: '{{ doc("q2_time_def") }}' 213 | - name: q3_time 214 | description: '{{ doc("q3_time_def") }}' 215 | 216 | - name: stg_races 217 | description: "Fact data around qualifying times." 218 | columns: 219 | - name: race_id 220 | description: "The primary key for this table." 221 | tests: 222 | - unique 223 | - not_null 224 | - name: race_year 225 | description: '{{ doc("race_year_def") }}' 226 | - name: race_round 227 | description: '{{ doc("race_round_def") }}' 228 | - name: circuit_id 229 | description: '{{ doc("circuit_id_def") }}' 230 | - name: race_name 231 | description: '{{ doc("race_name_def") }}' 232 | - name: race_date 233 | description: '{{ doc("race_date_def") }}' 234 | - name: race_time 235 | description: '{{ doc("race_time_def") }}' 236 | - name: race_url 237 | description: '{{ doc("race_url_def") }}' 238 | - name: fp1_date 239 | description: '{{ doc("fp1_date_def") }}' 240 | - name: fp1_time 241 | description: '{{ doc("fp1_time_def") }}' 242 | - name: fp2_date 243 | description: '{{ doc("fp2_date_def") }}' 244 | - name: fp2_time 245 | description: '{{ doc("fp2_time_def") }}' 246 | - name: fp3_date 247 | description: '{{ doc("fp3_date_def") }}' 248 | - name: fp3_time 249 | description: '{{ doc("fp3_time_def") }}' 250 | - name: quali_date 251 | description: '{{ doc("quali_date_def") }}' 252 | - name: quali_time 253 | description: '{{ doc("quali_time_def") }}' 254 | - name: sprint_date 255 | description: '{{ doc("sprint_date_def") }}' 256 | - name: sprint_time 257 | description: '{{ doc("sprint_time_def") }}' 258 | 259 | - name: stg_results 260 | description: "Fact data around qualifying times." 261 | columns: 262 | - name: result_id 263 | description: "The primary key for this table." 264 | tests: 265 | - unique 266 | - not_null 267 | - name: race_id 268 | description: '{{ doc("race_id_def") }}' 269 | - name: driver_id 270 | description: '{{ doc("driver_id_def") }}' 271 | - name: constructor_id 272 | description: '{{ doc("constructor_id_def") }}' 273 | - name: driver_number 274 | description: '{{ doc("driver_number_def") }}' 275 | - name: grid 276 | description: '{{ doc("grid_def") }}' 277 | - name: driver_position 278 | description: '{{ doc("driver_position_def") }}' 279 | - name: position_text 280 | description: '{{ doc("position_text_def") }}' 281 | - name: position_order 282 | description: '{{ doc("position_order_def") }}' 283 | - name: points 284 | description: '{{ doc("points_def") }}' 285 | - name: laps 286 | description: '{{ doc("laps_def") }}' 287 | - name: race_time 288 | description: '{{ doc("race_time_def") }}' 289 | - name: milliseconds 290 | description: '{{ doc("results_milliseconds_def") }}' 291 | - name: fastest_lap 292 | description: '{{ doc("fastest_lap_def") }}' 293 | - name: driver_rank 294 | description: '{{ doc("driver_rank_def") }}' 295 | - name: fastest_lap_time 296 | description: '{{ doc("fastest_lap_time_def") }}' 297 | - name: fastest_lap_speed 298 | description: '{{ doc("fastest_lap_speed_def") }}' 299 | - name: status_id 300 | description: '{{ doc("status_id_def") }}' 301 | 302 | 303 | - name: stg_seasons 304 | description: "Season information" 305 | columns: 306 | - name: season_year 307 | description: "The primary key for this table." 308 | tests: 309 | - unique 310 | - not_null 311 | - name: season_url 312 | description: "The Wiki URL for F1 Season" 313 | 314 | - name: stg_sprint_results 315 | description: "Fact data around qualifying times." 316 | columns: 317 | - name: result_id 318 | description: "The primary key for this table." 319 | tests: 320 | - unique 321 | - not_null 322 | - name: race_id 323 | description: '{{ doc("race_id_def") }}' 324 | - name: driver_id 325 | description: '{{ doc("driver_id_def") }}' 326 | - name: constructor_id 327 | description: '{{ doc("constructor_id_def") }}' 328 | - name: driver_number 329 | description: '{{ doc("driver_number_def") }}' 330 | - name: grid 331 | description: '{{ doc("grid_def") }}' 332 | - name: driver_position 333 | description: '{{ doc("driver_position_def") }}' 334 | - name: position_text 335 | description: '{{ doc("position_text_def") }}' 336 | - name: position_order 337 | description: '{{ doc("position_order_def") }}' 338 | - name: points 339 | description: '{{ doc("points_def") }}' 340 | - name: laps 341 | description: '{{ doc("laps_def") }}' 342 | - name: sprint_time 343 | description: '{{ doc("sprint_time_def") }}' 344 | - name: milliseconds 345 | description: '{{ doc("results_milliseconds_def") }}' 346 | - name: fastest_lap 347 | description: '{{ doc("fastest_lap_def") }}' 348 | - name: fastest_lap_time 349 | description: '{{ doc("fastest_lap_time_def") }}' 350 | - name: status_id 351 | description: '{{ doc("status_id_def") }}' 352 | 353 | 354 | - name: stg_status 355 | description: "Status codes mapping." 356 | columns: 357 | - name: status_id 358 | description: "The primary key for this table." 359 | tests: 360 | - unique 361 | - not_null 362 | - name: status 363 | description: '{{ doc("status_def") }}' -------------------------------------------------------------------------------- /models/staging/formula1/stg_circuits.sql: -------------------------------------------------------------------------------- 1 | with 2 | circuits as (select * from {{ source('formula1', 'circuits') }}), 3 | renamed as ( 4 | select 5 | circuit_id as circuit_id, 6 | circuit_ref as circuit_ref, 7 | name as circuit_name, 8 | location as circuit_location, 9 | country as circuit_country, 10 | lat as latitude, 11 | lng as longitude, 12 | to_number(iff(contains(alt, 'N'), null, alt)) as altitude, 13 | url as circuit_url 14 | from circuits 15 | ) 16 | select * 17 | from renamed 18 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_constructor_results.sql: -------------------------------------------------------------------------------- 1 | with 2 | constructor_results as ( 3 | select * from {{ source('formula1', 'constructor_results') }} 4 | ), 5 | renamed as ( 6 | select 7 | constructor_results_id as constructor_results_id, 8 | race_id as race_id, 9 | constructor_id as constructor_id, 10 | points as constructor_points, 11 | iff(contains(status, '\N'), null, status) as status 12 | from constructor_results 13 | ) 14 | select * 15 | from renamed 16 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_constructor_standings.sql: -------------------------------------------------------------------------------- 1 | with 2 | constructor_standings as ( 3 | select * from {{ source('formula1', 'constructor_standings') }} 4 | ), 5 | renamed as ( 6 | select 7 | constructor_standings_id as constructor_standings_id, 8 | race_id as race_id, 9 | constructor_id as constructor_id, 10 | points as points, 11 | "POSITION" as constructor_position, 12 | position_text as position_text, 13 | wins as wins 14 | from constructor_standings 15 | ) 16 | select * 17 | from renamed 18 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_constructors.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | constructors as (select * from {{ source('formula1', 'constructors') }}), 4 | 5 | renamed as ( 6 | select 7 | constructor_id as constructor_id, 8 | constructor_ref as constructor_ref, 9 | name as constructor_name, 10 | nationality as constructor_nationality, 11 | url as constructor_url 12 | from constructors 13 | ) 14 | 15 | select * 16 | from renamed 17 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_driver_standings.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | driver_standings as (select * from {{ source('formula1', 'driver_standings') }}), 4 | 5 | renamed as ( 6 | select 7 | driver_standings_id as driver_standings_id, 8 | race_id as race_id, 9 | driver_id as driver_id, 10 | points as driver_points, 11 | "POSITION" as driver_position, 12 | position_text as position_text, 13 | wins as driver_wins 14 | from driver_standings 15 | ) 16 | 17 | select * 18 | from renamed 19 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_drivers.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | drivers as (select * from {{ source('formula1', 'drivers') }}), 4 | 5 | renamed as ( 6 | select 7 | driver_id as driver_id, 8 | driver_ref as driver_ref, 9 | number as driver_number, 10 | code as driver_code, 11 | forename, 12 | surname, 13 | dob as date_of_birth, 14 | nationality as driver_nationality, 15 | url as driver_url 16 | from drivers 17 | ) 18 | 19 | select * 20 | from renamed 21 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_lap_times.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | lap_times as (select * from {{ source('formula1', 'lap_times') }}), 4 | 5 | renamed as ( 6 | select 7 | race_id as race_id, 8 | driver_id as driver_id, 9 | lap, 10 | "POSITION" as driver_position, 11 | "TIME" as lap_time_formatted, 12 | {{ convert_laptime("lap_time_formatted") }} as official_laptime, 13 | milliseconds as lap_time_milliseconds 14 | from lap_times 15 | ) 16 | 17 | select 18 | {{ dbt_utils.generate_surrogate_key(["race_id", "driver_id", "lap"]) }} 19 | as lap_times_id, 20 | * 21 | from renamed 22 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_pit_stops.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | pit_stops as (select * from {{ source('formula1', 'pit_stops') }}), 4 | 5 | renamed as ( 6 | select 7 | race_id as race_id, 8 | driver_id as driver_id, 9 | stop as stop_number, 10 | lap, 11 | "TIME" as pit_stop_time, 12 | duration as pit_stop_duration_seconds, 13 | {{ convert_laptime("pit_stop_duration_seconds") }} as pit_stop_duration, 14 | milliseconds as pit_stop_milliseconds 15 | from pit_stops 16 | ) 17 | 18 | select 19 | {{ dbt_utils.generate_surrogate_key(["race_id", "driver_id", "stop_number"]) }} 20 | as pitstop_id, 21 | * 22 | from renamed 23 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_qualifying.sql: -------------------------------------------------------------------------------- 1 | with 2 | qualifying as (select * from {{ source("formula1", "qualifying") }}), 3 | 4 | renamed as ( 5 | select 6 | qualifying_id as qualifying_id, 7 | race_id as race_id, 8 | driver_id as driver_id, 9 | constructor_id as constructor_id, 10 | number as driver_number, 11 | "POSITION" as qualifying_position, 12 | iff(contains(q1, '\N'), null, {{ convert_laptime("q1") }}) as q1_time, 13 | iff(contains(q2, '\N'), null, {{ convert_laptime("q2") }}) as q2_time, 14 | iff(contains(q3, '\N'), null, {{ convert_laptime("q3") }}) as q3_time 15 | from qualifying 16 | ) 17 | 18 | select * 19 | from renamed 20 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_races.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | races as ( 4 | 5 | select * from {{ source('formula1','races') }} 6 | 7 | ), 8 | 9 | renamed as ( 10 | select 11 | race_id as race_id, 12 | "YEAR" as race_year, 13 | "ROUND" as race_round, 14 | circuit_id as circuit_id, 15 | name as race_name, 16 | "DATE" as race_date, 17 | "TIME" as race_time, 18 | url as race_url, 19 | fp1_date, 20 | fp1_time, 21 | fp2_date, 22 | fp2_time, 23 | fp3_date, 24 | fp3_time, 25 | quali_date, 26 | quali_time, 27 | sprint_date, 28 | sprint_time 29 | from races 30 | ) 31 | 32 | select * from renamed -------------------------------------------------------------------------------- /models/staging/formula1/stg_results.sql: -------------------------------------------------------------------------------- 1 | with 2 | 3 | results as (select * from {{ source('formula1', 'results') }}), 4 | 5 | renamed as ( 6 | select 7 | result_id as result_id, 8 | race_id as race_id, 9 | driver_id as driver_id, 10 | constructor_id as constructor_id, 11 | number as driver_number, 12 | grid, 13 | --position::int as position, 14 | iff(contains(position, '\N'), null, position) as driver_position, 15 | position_text as position_text, 16 | position_order as position_order, 17 | points, 18 | laps, 19 | iff(contains("TIME", '\N'), null, "TIME") as race_time, 20 | iff(contains(milliseconds, '\N'), null, milliseconds) as milliseconds, 21 | iff(contains(fastest_lap, '\N'), null, fastest_lap) as fastest_lap, 22 | "RANK" as driver_rank, 23 | iff(contains(fastest_lap_time, '\N'),null,{{ convert_laptime("fastest_lap_time") }}) as fastest_lap_time, 24 | iff(contains(fastest_lap_speed, '\N'), null, fastest_lap_speed) as fastest_lap_speed, 25 | status_id as status_id 26 | from results 27 | ) 28 | 29 | select * 30 | from renamed 31 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_seasons.sql: -------------------------------------------------------------------------------- 1 | with 2 | seasons as (select * from {{ source('formula1', 'seasons') }}), 3 | renamed as ( 4 | select 5 | -- need surogate key with race_id, driver_id and stop 6 | "YEAR" as season_year, url as season_url 7 | from seasons 8 | ) 9 | select * 10 | from renamed 11 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_sprint_results.sql: -------------------------------------------------------------------------------- 1 | with sprint_results as (select * from {{ source('formula1', 'sprint_results') }}), 2 | 3 | renamed as ( 4 | select 5 | result_id as result_id, 6 | race_id as race_id, 7 | driver_id as driver_id, 8 | constructor_id as constructor_id, 9 | number as driver_number, 10 | grid as grid, 11 | iff(contains("POSITION", '\N'), null, "POSITION") as driver_position, 12 | position_text as position_text, 13 | position_order as position_order, 14 | points as points, 15 | laps as laps, 16 | iff(contains("TIME", '\N'), null, "TIME") as sprint_time, 17 | iff(contains(milliseconds, '\N'), null, milliseconds) as milliseconds, 18 | iff(contains(fastest_lap, '\N'), null, fastest_lap) as fastest_lap, 19 | iff( 20 | contains(fastest_lap_time, '\N'), 21 | null, 22 | {{ convert_laptime("fastest_lap_time") }} 23 | ) as fastest_lap_time, 24 | status_id as status_id 25 | from sprint_results 26 | ) 27 | 28 | select * 29 | from renamed 30 | -------------------------------------------------------------------------------- /models/staging/formula1/stg_status.sql: -------------------------------------------------------------------------------- 1 | WITH status AS ( 2 | 3 | SELECT * FROM {{ source('formula1','status') }} 4 | 5 | ), 6 | 7 | renamed AS ( 8 | SELECT 9 | status_id AS status_id, 10 | status 11 | FROM status 12 | ) 13 | 14 | SELECT * FROM renamed -------------------------------------------------------------------------------- /packages.yml: -------------------------------------------------------------------------------- 1 | packages: 2 | - package: dbt-labs/dbt_utils 3 | version: 1.0.0 -------------------------------------------------------------------------------- /seeds/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbt-labs/python-snowpark-formula1/b6caebae7914a119c66ac16a4430f5e10faf41a8/seeds/.gitkeep -------------------------------------------------------------------------------- /setup/existing_account_snowflake_grants.sql: -------------------------------------------------------------------------------- 1 | /* 2 | If you already have an existing account or accidentally drop your original database, you might need to grant or regrant permissions. 3 | This file is here in the project for these circumstances. A great video on this setup can be found at: https://www.youtube.com/watch?v=kbCkwhySV_I. 4 | The code below assumes that the tranformer role and transforming warehouse have already been created in Snowflake. 5 | */ 6 | 7 | /* 8 | Giving access to database, schema, and all tables within the schema to our transformer role. 9 | The permissions here are for the raw data that we read in. 10 | */ 11 | grant usage on database formula1 to role transformer; 12 | grant usage on schema formula1.raw to role transformer; 13 | grant select on all tables in schema formula1.raw to role transformer; 14 | 15 | /* 16 | Giving access to database, schema, and all tables within the schema to our transformer role. 17 | The permissions here are for database and schema for the transformed data that we will write to creating new schemas and tables. 18 | */ 19 | grant usage on database analytics to role transformer; 20 | grant modify on database analytics to role transformer; 21 | grant monitor on database analytics to role transformer; 22 | grant create schema on database analytics to role transformer; 23 | 24 | 25 | /* 26 | Give permission for our transformer role to use the warehouse to run our compute and commands on. 27 | */ 28 | grant operate on warehouse transforming to role transformer; 29 | grant usage on warehouse transforming to role transformer; -------------------------------------------------------------------------------- /setup/pc_snowflake_grants.sql: -------------------------------------------------------------------------------- 1 | /* 2 | If you accidentally drop your original database after setting up partner connect, you might need to grant or regrant permissions to build your dbt models. 3 | This file is here in the project for these circumstances. A great video on this setup can be found at: https://www.youtube.com/watch?v=kbCkwhySV_I. 4 | The code below assumes that the PC_DBT_ROLE role and PC_DBT_WH warehouse have already been created in Snowflake as part of the partner connect process. 5 | */ 6 | 7 | /* 8 | Giving access to database, schema, and all tables within the schema to our transformer role. 9 | The permissions here are for the raw data that we read in. 10 | */ 11 | grant usage on database formula1 to role PC_DBT_ROLE; 12 | grant usage on schema formula1.raw to role PC_DBT_ROLE; 13 | grant select on all tables in schema formula1.raw to role PC_DBT_ROLE; 14 | 15 | /* 16 | Giving access to database, schema, and all tables within the schema to our PC_DBT_ROLE role. 17 | The permissions here are for database and schema for the transformed data that we will write to creating new schemas and tables. 18 | */ 19 | grant usage on database PC_DBT_DB to role PC_DBT_ROLE; 20 | grant modify on database PC_DBT_DB to role PC_DBT_ROLE; 21 | grant monitor on database PC_DBT_DB to role PC_DBT_ROLE; 22 | grant create schema on database PC_DBT_DB to role PC_DBT_ROLE; 23 | 24 | 25 | /* 26 | Give permission for our PC_DBT_ROLE role to use the PC_DBT_WH warehouse to run our compute and commands on. 27 | */ 28 | grant operate on warehouse PC_DBT_WH to role PC_DBT_ROLE; 29 | grant usage on warehouse PC_DBT_WH to role PC_DBT_ROLE; -------------------------------------------------------------------------------- /setup/setup_script_s3_to_snowflake.sql: -------------------------------------------------------------------------------- 1 | /* 2 | This is our setup script to create a new database for the Formula1 data in Snowflake. 3 | We are copying data from a public s3 bucket into snowflake by defining our csv format and snowflake stage. 4 | */ 5 | -- create and define our formula1 database 6 | create or replace database formula1; 7 | use database formula1; 8 | create or replace schema raw; 9 | use schema raw; 10 | 11 | --define our file format for reading in the csvs 12 | create or replace file format csvformat 13 | type = csv 14 | field_delimiter =',' 15 | field_optionally_enclosed_by = '"', 16 | skip_header=1; 17 | 18 | -- 19 | create or replace stage formula1_stage 20 | file_format = csvformat 21 | url = 's3://formula1-dbt-cloud-python-demo/formula1-kaggle-data/'; 22 | 23 | -- load in the 8 tables we need for our demo 24 | -- we are first creating the table then copying our data in from s3 25 | -- think of this as an empty container or shell that we are then filling 26 | 27 | --CIRCUITS 28 | create or replace table formula1.raw.circuits ( 29 | CIRCUIT_ID NUMBER(38,0), 30 | CIRCUIT_REF VARCHAR(16777216), 31 | NAME VARCHAR(16777216), 32 | LOCATION VARCHAR(16777216), 33 | COUNTRY VARCHAR(16777216), 34 | LAT FLOAT, 35 | LNG FLOAT, 36 | ALT NUMBER(38,0), 37 | URL VARCHAR(16777216) 38 | ); 39 | -- copy our data from public s3 bucket into our tables 40 | copy into circuits 41 | from @formula1_stage/circuits.csv 42 | on_error='continue'; 43 | 44 | --CONSTRUCTOR RESULTS 45 | create or replace table formula1.raw.constructor_results ( 46 | CONSTRUCTOR_RESULTS_ID NUMBER(38,0), 47 | RACE_ID NUMBER(38,0), 48 | CONSTRUCTOR_ID NUMBER(38,0), 49 | POINTS NUMBER(38,0), 50 | STATUS VARCHAR(16777216) 51 | ); 52 | copy into constructor_results 53 | from @formula1_stage/constructor_results.csv 54 | on_error='continue'; 55 | 56 | --CONSTRUCTOR STANDINGS 57 | create or replace table formula1.raw.constructor_standings ( 58 | CONSTRUCTOR_STANDINGS_ID NUMBER(38,0), 59 | RACE_ID NUMBER(38,0), 60 | CONSTRUCTOR_ID NUMBER(38,0), 61 | POINTS NUMBER(38,0), 62 | POSITION FLOAT, 63 | POSITION_TEXT VARCHAR(16777216), 64 | WINS NUMBER(38,0) 65 | ); 66 | copy into constructor_standings 67 | from @formula1_stage/constructor_standings.csv 68 | on_error='continue'; 69 | 70 | --CONSTRUCTORS 71 | create or replace table formula1.raw.constructors ( 72 | CONSTRUCTOR_ID NUMBER(38,0), 73 | CONSTRUCTOR_REF VARCHAR(16777216), 74 | NAME VARCHAR(16777216), 75 | NATIONALITY VARCHAR(16777216), 76 | URL VARCHAR(16777216) 77 | ); 78 | copy into constructors 79 | from @formula1_stage/constructors.csv 80 | on_error='continue'; 81 | 82 | --DRIVER STANDINGS 83 | create or replace table formula1.raw.driver_standings ( 84 | DRIVER_STANDINGS_ID NUMBER(38,0), 85 | RACE_ID NUMBER(38,0), 86 | DRIVER_ID NUMBER(38,0), 87 | POINTS NUMBER(38,0), 88 | POSITION FLOAT, 89 | POSITION_TEXT VARCHAR(16777216), 90 | WINS NUMBER(38,0) 91 | 92 | ); 93 | copy into driver_standings 94 | from @formula1_stage/driver_standings.csv 95 | on_error='continue'; 96 | 97 | --DRIVERS 98 | create or replace table formula1.raw.drivers ( 99 | DRIVER_ID NUMBER(38,0), 100 | DRIVER_REF VARCHAR(16777216), 101 | NUMBER VARCHAR(16777216), 102 | CODE VARCHAR(16777216), 103 | FORENAME VARCHAR(16777216), 104 | SURNAME VARCHAR(16777216), 105 | DOB DATE, 106 | NATIONALITY VARCHAR(16777216), 107 | URL VARCHAR(16777216) 108 | ); 109 | copy into drivers 110 | from @formula1_stage/drivers.csv 111 | on_error='continue'; 112 | 113 | --LAP TIMES 114 | create or replace table formula1.raw.lap_times ( 115 | RACE_ID NUMBER(38,0), 116 | DRIVER_ID NUMBER(38,0), 117 | LAP NUMBER(38,0), 118 | POSITION FLOAT, 119 | TIME VARCHAR(16777216), 120 | MILLISECONDS NUMBER(38,0) 121 | ); 122 | copy into lap_times 123 | from @formula1_stage/lap_times.csv 124 | on_error='continue'; 125 | 126 | --PIT STOPS 127 | create or replace table formula1.raw.pit_stops ( 128 | RACE_ID NUMBER(38,0), 129 | DRIVER_ID NUMBER(38,0), 130 | STOP NUMBER(38,0), 131 | LAP NUMBER(38,0), 132 | TIME VARCHAR(16777216), 133 | DURATION VARCHAR(16777216), 134 | MILLISECONDS NUMBER(38,0) 135 | ); 136 | copy into pit_stops 137 | from @formula1_stage/pit_stops.csv 138 | on_error='continue'; 139 | 140 | --QUALIFYING 141 | create or replace table formula1.raw.qualifying ( 142 | QUALIFYING_ID NUMBER(38,0), 143 | RACE_ID NUMBER(38,0), 144 | DRIVER_ID NUMBER(38,0), 145 | CONSTRUCTOR_ID NUMBER(38,0), 146 | NUMBER NUMBER(38,0), 147 | POSITION FLOAT, 148 | Q1 VARCHAR(16777216), 149 | Q2 VARCHAR(16777216), 150 | Q3 VARCHAR(16777216) 151 | ); 152 | copy into qualifying 153 | from @formula1_stage/qualifying.csv 154 | on_error='continue'; 155 | 156 | --RACES 157 | create or replace table formula1.raw.races ( 158 | RACE_ID NUMBER(38,0), 159 | YEAR NUMBER(38,0), 160 | ROUND NUMBER(38,0), 161 | CIRCUIT_ID NUMBER(38,0), 162 | NAME VARCHAR(16777216), 163 | DATE DATE, 164 | TIME VARCHAR(16777216), 165 | URL VARCHAR(16777216), 166 | FP1_DATE VARCHAR(16777216), 167 | FP1_TIME VARCHAR(16777216), 168 | FP2_DATE VARCHAR(16777216), 169 | FP2_TIME VARCHAR(16777216), 170 | FP3_DATE VARCHAR(16777216), 171 | FP3_TIME VARCHAR(16777216), 172 | QUALI_DATE VARCHAR(16777216), 173 | QUALI_TIME VARCHAR(16777216), 174 | SPRINT_DATE VARCHAR(16777216), 175 | SPRINT_TIME VARCHAR(16777216) 176 | ); 177 | copy into races 178 | from @formula1_stage/races.csv 179 | on_error='continue'; 180 | 181 | --RESULTS 182 | create or replace table formula1.raw.results ( 183 | RESULT_ID NUMBER(38,0), 184 | RACE_ID NUMBER(38,0), 185 | DRIVER_ID NUMBER(38,0), 186 | CONSTRUCTOR_ID NUMBER(38,0), 187 | NUMBER NUMBER(38,0), 188 | GRID NUMBER(38,0), 189 | POSITION FLOAT, 190 | POSITION_TEXT VARCHAR(16777216), 191 | POSITION_ORDER NUMBER(38,0), 192 | POINTS NUMBER(38,0), 193 | LAPS NUMBER(38,0), 194 | TIME VARCHAR(16777216), 195 | MILLISECONDS NUMBER(38,0), 196 | FASTEST_LAP NUMBER(38,0), 197 | RANK NUMBER(38,0), 198 | FASTEST_LAP_TIME VARCHAR(16777216), 199 | FASTEST_LAP_SPEED FLOAT, 200 | STATUS_ID NUMBER(38,0) 201 | ); 202 | copy into results 203 | from @formula1_stage/results.csv 204 | on_error='continue'; 205 | 206 | --SEASONS 207 | create or replace table formula1.raw.seasons ( 208 | YEAR NUMBER(38,0), 209 | URL VARCHAR(16777216) 210 | ); 211 | copy into seasons 212 | from @formula1_stage/seasons.csv 213 | on_error='continue'; 214 | 215 | --SPRINT RESULTS 216 | create or replace table formula1.raw.sprint_results ( 217 | RESULT_ID NUMBER(38,0), 218 | RACE_ID NUMBER(38,0), 219 | DRIVER_ID NUMBER(38,0), 220 | CONSTRUCTOR_ID NUMBER(38,0), 221 | NUMBER NUMBER(38,0), 222 | GRID NUMBER(38,0), 223 | POSITION FLOAT, 224 | POSITION_TEXT VARCHAR(16777216), 225 | POSITION_ORDER NUMBER(38,0), 226 | POINTS NUMBER(38,0), 227 | LAPS NUMBER(38,0), 228 | TIME VARCHAR(16777216), 229 | MILLISECONDS NUMBER(38,0), 230 | FASTEST_LAP VARCHAR(16777216), 231 | FASTEST_LAP_TIME VARCHAR(16777216), 232 | STATUS_ID NUMBER(38,0) 233 | ); 234 | copy into sprint_results 235 | from @formula1_stage/sprint_results.csv 236 | on_error='continue'; 237 | 238 | --STATUS 239 | create or replace table formula1.raw.status ( 240 | STATUS_ID NUMBER(38,0), 241 | STATUS VARCHAR(16777216) 242 | ); 243 | copy into status 244 | from @formula1_stage/status.csv 245 | on_error='continue'; -------------------------------------------------------------------------------- /snapshots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbt-labs/python-snowpark-formula1/b6caebae7914a119c66ac16a4430f5e10faf41a8/snapshots/.gitkeep -------------------------------------------------------------------------------- /tests/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbt-labs/python-snowpark-formula1/b6caebae7914a119c66ac16a4430f5e10faf41a8/tests/.gitkeep -------------------------------------------------------------------------------- /tests/lap_times_moving_avg_assert_positive_or_null.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | enabled=true, 4 | severity='error', 5 | tags = ['bi'] 6 | ) 7 | }} 8 | 9 | with lap_times_moving_avg as ( select * from {{ ref('lap_times_moving_avg') }} ) 10 | 11 | select * 12 | from lap_times_moving_avg 13 | where lap_moving_avg_5_years < 0 and lap_moving_avg_5_years is not null -------------------------------------------------------------------------------- /tests/macro_pit_stops_mean_is_positive.sql: -------------------------------------------------------------------------------- 1 | {{ 2 | config( 3 | enabled=true, 4 | severity='warn', 5 | tags = ['bi'] 6 | ) 7 | }} 8 | 9 | 10 | {{ test_all_values_gte_zero('fastest_pit_stops_by_constructor', 'mean') }} --------------------------------------------------------------------------------