├── .gitignore ├── 02_ingest ├── README.md ├── bqload.sh ├── download.sh ├── ingest.sh ├── ingest_from_crsbucket.sh ├── monthlyupdate │ ├── .dockerignore │ ├── .gitignore │ ├── 01_setup_svc_acct.sh │ ├── 02_deploy_cr.sh │ ├── 03_call_cr.sh │ ├── 04_next_month.sh │ ├── 05_setup_cron.sh │ ├── Dockerfile │ ├── ingest_flights.py │ ├── main.py │ └── requirements.txt ├── raw_download.sh └── upload.sh ├── 03_sqlstudio ├── README.md ├── contingency.sh ├── contingency1.sql ├── contingency2.sql ├── contingency3.sql ├── contingency4.sql ├── create_table.sql ├── create_views.sh └── create_views.sql ├── 04_streaming ├── .gitignore ├── README.md ├── design │ ├── airport_schema.json │ ├── mktbl.sh │ └── queries.txt ├── ingest_from_crsbucket.sh ├── realtime │ ├── avg01.py │ ├── avg02.py │ └── avg03.py ├── simulate │ ├── .gitignore │ ├── airports.csv.gz │ ├── simulate.py │ └── simulate_may2015.sh └── transform │ ├── airports.csv.gz │ ├── bqsample.sh │ ├── df01.py │ ├── df02.py │ ├── df03.py │ ├── df04.py │ ├── df05.py │ ├── df06.py │ ├── df07.py │ ├── flights_sample.json │ ├── install_packages.sh │ ├── setup.py │ └── stage_airports_file.sh ├── 05_bqnotebook ├── README.md ├── create_trainday.sh ├── exploration.ipynb ├── queries.txt └── trainday.txt ├── 06_dataproc ├── README.md ├── bayes_on_spark.py ├── create_cluster.sh ├── create_personal_cluster.sh ├── decrease_cluster.sh ├── delete_cluster.sh ├── increase_cluster.sh ├── install_on_cluster.sh ├── quantization.ipynb └── submit_serverless.sh ├── 07_sparkml ├── README.md ├── autoscale.yaml ├── create_large_cluster.sh ├── experiment.py ├── graphs.ipynb ├── logistic.py ├── logistic_regression.ipynb └── submit_spark.sh ├── 08_bqml ├── README.md ├── bqml_logistic.ipynb ├── bqml_nonlinear.ipynb ├── bqml_timetxf.ipynb └── bqml_timewindow.ipynb ├── 09_vertexai ├── .gitignore ├── README.md ├── call_predict.sh ├── example_input.json ├── flights_model.png └── flights_model_tf2.ipynb ├── 10_mlops ├── README.md ├── call_predict.py ├── explanation-metadata.json ├── ingest_from_crsbucket.sh ├── model.py └── train_on_vertexai.py ├── 11_realtime ├── .gitignore ├── README.md ├── alldata_sample.json ├── change_ch10_files.py ├── create_sample_input.sh ├── create_traindata.py ├── evaluation.ipynb ├── flightstxf │ ├── __init__.py │ └── flights_transforms.py ├── make_predictions.py ├── setup.py └── simevents_sample.json ├── 12_fulldataset └── README.md ├── COPYRIGHT ├── LICENSE ├── README.md └── cover_edition2.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | -------------------------------------------------------------------------------- /02_ingest/README.md: -------------------------------------------------------------------------------- 1 | # 2. Ingesting data onto the Cloud 2 | 3 | ### Create a bucket 4 | * Go to the Storage section of the GCP web console and create a new bucket 5 | 6 | ### Populate your bucket with the data you will need for the book 7 | 8 | * Open CloudShell and git clone this repo: 9 | ``` 10 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 11 | ``` 12 | * Go to the 02_ingest folder of the repo 13 | * Edit ./ingest.sh to reflect the years you want to process (at minimum, you need 2015) 14 | * Execute ./ingest.sh bucketname 15 | 16 | ### [Optional] Scheduling monthly downloads 17 | * Go to the 02_ingest/monthlyupdate folder in the repo. 18 | * Run the command `pip3 install google-cloud-storage google-cloud-bigquery` 19 | * Run the command `gcloud auth application-default login` 20 | * Try ingesting one month using the Python script: `./ingest_flights.py --debug --bucket your-bucket-name --year 2015 --month 02` 21 | * Set up a service account called svc-monthly-ingest by running `./01_setup_svc_acct.sh` 22 | * Now, try running the ingest script as the service account: 23 | * Visit the Service Accounts section of the GCP Console: https://console.cloud.google.com/iam-admin/serviceaccounts 24 | * Select the newly created service account svc-monthly-ingest and click Manage Keys 25 | * Add key (Create a new JSON key) and download it to a file named tempkey.json 26 | * Run `gcloud auth activate-service-account --key-file tempkey.json` 27 | * Try ingesting one month `./ingest_flights.py --bucket $BUCKET --year 2015 --month 03 --debug` 28 | * Go back to running command as yourself using `gcloud auth login` 29 | * Deploy to Cloud Run: `./02_deploy_cr.sh` 30 | * Test that you can invoke the function using Cloud Run: `./03_call_cr.sh` 31 | * Test that the functionality to get the next month works: `./04_next_month.sh` 32 | * Set up a Cloud Scheduler job to invoke Cloud Run every month: `./05_setup_cron.sh` 33 | * Visit the GCP Console for Cloud Run and Cloud Scheduler and delete the Cloud Run instance and the scheduled task—you won’t need them any further. 34 | -------------------------------------------------------------------------------- /02_ingest/bqload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 2 ]; then 4 | echo "Usage: ./bqload.sh csv-bucket-name YEAR" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | YEAR=$2 10 | 11 | SCHEMA=Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING 12 | 13 | # create dataset if not exists 14 | PROJECT=$(gcloud config get-value project) 15 | #bq --project_id $PROJECT rm -f ${PROJECT}:dsongcp.flights_raw 16 | bq --project_id $PROJECT show dsongcp || bq mk --sync dsongcp 17 | 18 | for MONTH in `seq -w 1 12`; do 19 | 20 | CSVFILE=gs://${BUCKET}/flights/raw/${YEAR}${MONTH}.csv 21 | bq --project_id $PROJECT --sync \ 22 | load --time_partitioning_field=FlightDate --time_partitioning_type=MONTH \ 23 | --source_format=CSV --ignore_unknown_values --skip_leading_rows=1 --schema=$SCHEMA \ 24 | --replace ${PROJECT}:dsongcp.flights_raw\$${YEAR}${MONTH} $CSVFILE 25 | 26 | done 27 | 28 | -------------------------------------------------------------------------------- /02_ingest/download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Note that we have commented out the BTS website, and are instead 4 | # using a mirror. This is because the BTS website is frequently down 5 | SOURCE=https://storage.googleapis.com/data-science-on-gcp/edition2/raw 6 | #SOURCE=https://transtats.bts.gov/PREZIP 7 | 8 | if test "$#" -ne 2; then 9 | echo "Usage: ./download.sh year month" 10 | echo " eg: ./download.sh 2015 1" 11 | exit 12 | fi 13 | 14 | YEAR=$1 15 | MONTH=$2 16 | BASEURL="${SOURCE}/On_Time_Reporting_Carrier_On_Time_Performance_1987_present" 17 | echo "Downloading YEAR=$YEAR ... MONTH=$MONTH ... from $BASEURL" 18 | 19 | 20 | MONTH2=$(printf "%02d" $MONTH) 21 | 22 | TMPDIR=$(mktemp -d) 23 | 24 | ZIPFILE=${TMPDIR}/${YEAR}_${MONTH2}.zip 25 | echo $ZIPFILE 26 | 27 | curl -o $ZIPFILE ${BASEURL}_${YEAR}_${MONTH}.zip 28 | unzip -d $TMPDIR $ZIPFILE 29 | 30 | mv $TMPDIR/*.csv ./${YEAR}${MONTH2}.csv 31 | rm -rf $TMPDIR 32 | -------------------------------------------------------------------------------- /02_ingest/ingest.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./ingest.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | export BUCKET=$1 9 | 10 | # get zip files from BTS, extract csv files 11 | for YEAR in `seq 2015 2015`; do 12 | for MONTH in `seq 1 12`; do 13 | bash download.sh $YEAR $MONTH 14 | # upload the raw CSV files to our GCS bucket 15 | bash upload.sh $BUCKET 16 | rm *.csv 17 | done 18 | # load the CSV files into BigQuery as string columns 19 | bash bqload.sh $BUCKET $YEAR 20 | done 21 | 22 | 23 | # verify that things worked 24 | bq query --nouse_legacy_sql \ 25 | 'SELECT DISTINCT year, month FROM dsongcp.flights_raw ORDER BY year ASC, CAST(month AS INTEGER) ASC' 26 | 27 | bq query --nouse_legacy_sql \ 28 | 'SELECT year, month, COUNT(*) AS num_flights FROM dsongcp.flights_raw GROUP BY year, month ORDER BY year ASC, CAST(month AS INTEGER) ASC' 29 | -------------------------------------------------------------------------------- /02_ingest/ingest_from_crsbucket.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./ingest_from_crsbucket.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | FROM=gs://data-science-on-gcp/edition2/flights/raw 10 | TO=gs://$BUCKET/flights/raw 11 | 12 | CMD="gsutil -m cp " 13 | for MONTH in `seq -w 1 12`; do 14 | CMD="$CMD ${FROM}/2015${MONTH}.csv" 15 | done 16 | CMD="$CMD ${FROM}/201601.csv $TO" 17 | 18 | echo $CMD 19 | $CMD 20 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/.dockerignore: -------------------------------------------------------------------------------- 1 | Dockerfile 2 | README.md 3 | *.pyc 4 | *.pyo 5 | *.pyd 6 | __pycache__ 7 | .pytest_cache 8 | .git 9 | .gitignore 10 | tempkey.json 11 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/.gitignore: -------------------------------------------------------------------------------- 1 | tempkey.json 2 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/01_setup_svc_acct.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | SVC_ACCT=svc-monthly-ingest 4 | PROJECT_ID=$(gcloud config get-value project) 5 | BUCKET=${PROJECT_ID}-cf-staging 6 | REGION=us-central1 7 | SVC_PRINCIPAL=serviceAccount:${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com 8 | 9 | gsutil ls gs://$BUCKET || gsutil mb -l $REGION gs://$BUCKET 10 | gsutil uniformbucketlevelaccess set on gs://$BUCKET 11 | 12 | gcloud iam service-accounts create $SVC_ACCT --display-name "flights monthly ingest" 13 | 14 | # make the service account the admin of the bucket 15 | # it can read/write/list/delete etc. on only this bucket 16 | gsutil iam ch ${SVC_PRINCIPAL}:roles/storage.admin gs://$BUCKET 17 | 18 | # ability to create/delete partitions etc in BigQuery table 19 | bq --project_id=${PROJECT_ID} query --nouse_legacy_sql \ 20 | "GRANT \`roles/bigquery.dataOwner\` ON SCHEMA dsongcp TO '$SVC_PRINCIPAL' " 21 | 22 | gcloud projects add-iam-policy-binding ${PROJECT_ID} \ 23 | --member ${SVC_PRINCIPAL} \ 24 | --role roles/bigquery.jobUser 25 | 26 | # At this point, test running as service account 27 | # download a json key from the console (temporarily) 28 | # either add this to .gcloudignore and .gitignore or put it in a different directory! 29 | # gcloud auth activate-service-account --key-file tempkey.json 30 | # ./ingest_flights.py --bucket $BUCKET --year 2015 --month 03 --debug 31 | # after this, go back to being yourself with gcloud auth login 32 | 33 | # Make sure the sevice account can invoke cloud functions 34 | gcloud projects add-iam-policy-binding ${PROJECT_ID} \ 35 | --member ${SVC_PRINCIPAL} \ 36 | --role roles/run.invoker 37 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/02_deploy_cr.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # same as in setup_svc_acct 4 | NAME=ingest-flights-monthly 5 | SVC_ACCT=svc-monthly-ingest 6 | PROJECT_ID=$(gcloud config get-value project) 7 | REGION=us-central1 8 | SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com 9 | 10 | #gcloud functions deploy $URL \ 11 | # --entry-point ingest_flights --runtime python37 --trigger-http \ 12 | # --timeout 540s --service-account ${SVC_EMAIL} --no-allow-unauthenticated 13 | 14 | gcloud run deploy $NAME --region $REGION --source=$(pwd) \ 15 | --platform=managed --service-account ${SVC_EMAIL} --no-allow-unauthenticated \ 16 | --timeout 12m \ 17 | 18 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/03_call_cr.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # same as deploy_cr.sh 4 | NAME=ingest-flights-monthly 5 | 6 | PROJECT_ID=$(gcloud config get-value project) 7 | BUCKET=${PROJECT_ID}-cf-staging 8 | 9 | URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)') 10 | echo $URL 11 | 12 | # Feb 2015 13 | echo {\"year\":\"2015\"\,\"month\":\"02\"\,\"bucket\":\"${BUCKET}\"\} > /tmp/message 14 | 15 | curl -k -X POST $URL \ 16 | -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ 17 | -H "Content-Type:application/json" --data-binary @/tmp/message 18 | 19 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/04_next_month.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # same as deploy_cr.sh 4 | NAME=ingest-flights-monthly 5 | 6 | PROJECT_ID=$(gcloud config get-value project) 7 | BUCKET=${PROJECT_ID}-cf-staging 8 | 9 | URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)') 10 | echo $URL 11 | 12 | # next month 13 | echo "Getting month that follows ... (removing 12 if needed, so there is something to get) " 14 | gsutil rm -rf gs://$BUCKET/flights/raw/201512.csv.gz 15 | gsutil ls gs://$BUCKET/flights/raw 16 | echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message 17 | cat /tmp/message 18 | 19 | curl -k -X POST $URL \ 20 | -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ 21 | -H "Content-Type:application/json" --data-binary @/tmp/message 22 | 23 | echo "Done" 24 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/05_setup_cron.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # same as in setup_svc_acct.sh and call_cr.sh 4 | NAME=ingest-flights-monthly 5 | PROJECT_ID=$(gcloud config get-value project) 6 | BUCKET=${PROJECT_ID}-cf-staging 7 | SVC_ACCT=svc-monthly-ingest 8 | SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com 9 | 10 | SVC_URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)') 11 | echo $SVC_URL 12 | echo $SVC_EMAIL 13 | 14 | # note that there is no year or month. The service looks for next month in that case. 15 | echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message 16 | cat /tmp/message 17 | 18 | gcloud scheduler jobs create http monthlyupdate \ 19 | --description "Ingest flights using Cloud Run" \ 20 | --schedule="8 of month 10:00" --time-zone "America/New_York" \ 21 | --uri=$SVC_URL --http-method POST \ 22 | --oidc-service-account-email $SVC_EMAIL --oidc-token-audience=$SVC_URL \ 23 | --max-backoff=7d \ 24 | --max-retry-attempts=5 \ 25 | --max-retry-duration=2d \ 26 | --min-backoff=12h \ 27 | --headers="Content-Type=application/json" \ 28 | --message-body-from-file=/tmp/message 29 | 30 | 31 | # To try this out, go to Console and do two things: 32 | # in Service Accounts, give yourself the ability to impersonate this service account (ServiceAccountUser) 33 | # in Cloud Scheduler, click "Run Now" 34 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the official lightweight Python image. 2 | # https://hub.docker.com/_/python 3 | FROM python:3.6-slim 4 | 5 | # Allow statements and log messages to immediately appear in the Knative logs 6 | ENV PYTHONUNBUFFERED True 7 | 8 | # Copy local code to the container image. 9 | ENV APP_HOME /app 10 | WORKDIR $APP_HOME 11 | COPY . ./ 12 | 13 | # Install production dependencies. 14 | RUN pip install --no-cache-dir -r requirements.txt 15 | 16 | # Run the web service on container startup. Here we use the gunicorn 17 | # webserver, with one worker process and 8 threads. 18 | # For environments with multiple CPU cores, increase the number of workers 19 | # to be equal to the cores available. 20 | # Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. 21 | CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app 22 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/ingest_flights.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016-2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import os 18 | import gzip 19 | import shutil 20 | import logging 21 | import os.path 22 | import zipfile 23 | import datetime 24 | import tempfile 25 | from google.cloud import storage 26 | from google.cloud.storage import Blob 27 | from google.cloud import bigquery 28 | 29 | SOURCE = "https://storage.googleapis.com/data-science-on-gcp/edition2/raw" 30 | #SOURCE = "https://transtats.bts.gov/PREZIP" 31 | 32 | 33 | def urlopen(url): 34 | from urllib.request import urlopen as impl 35 | import ssl 36 | 37 | ctx_no_secure = ssl.create_default_context() 38 | ctx_no_secure.set_ciphers('HIGH:!DH:!aNULL') 39 | ctx_no_secure.check_hostname = False 40 | ctx_no_secure.verify_mode = ssl.CERT_NONE 41 | return impl(url, context=ctx_no_secure) 42 | 43 | 44 | def download(year: str, month: str, destdir: str): 45 | """ 46 | Downloads on-time performance data and returns local filename 47 | year e.g.'2015' 48 | month e.g. '01 for January 49 | """ 50 | logging.info('Requesting data for {}-{}-*'.format(year, month)) 51 | 52 | url = os.path.join(SOURCE, 53 | "On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{}_{}.zip".format(year, int(month))) 54 | logging.debug("Trying to download {}".format(url)) 55 | 56 | filename = os.path.join(destdir, "{}{}.zip".format(year, month)) 57 | with open(filename, "wb") as fp: 58 | response = urlopen(url) 59 | fp.write(response.read()) 60 | logging.debug("{} saved".format(filename)) 61 | return filename 62 | 63 | 64 | def zip_to_csv(filename, destdir): 65 | """ 66 | Extracts the CSV file from the zip file into the destdir 67 | """ 68 | zip_ref = zipfile.ZipFile(filename, 'r') 69 | cwd = os.getcwd() 70 | os.chdir(destdir) 71 | zip_ref.extractall() 72 | os.chdir(cwd) 73 | csvfile = os.path.join(destdir, zip_ref.namelist()[0]) 74 | zip_ref.close() 75 | logging.info("Extracted {}".format(csvfile)) 76 | 77 | # now gzip for faster upload to bucket 78 | gzipped = csvfile + ".gz" 79 | with open(csvfile, 'rb') as ifp: 80 | with gzip.open(gzipped, 'wb') as ofp: 81 | shutil.copyfileobj(ifp, ofp) 82 | logging.info("Compressed into {}".format(gzipped)) 83 | 84 | return gzipped 85 | 86 | 87 | def upload(csvfile, bucketname, blobname): 88 | """ 89 | Uploads the CSV file into the bucket with the given blobname 90 | """ 91 | client = storage.Client() 92 | bucket = client.get_bucket(bucketname) 93 | logging.info(bucket) 94 | blob = Blob(blobname, bucket) 95 | logging.debug('Uploading {} ...'.format(csvfile)) 96 | blob.upload_from_filename(csvfile) 97 | gcslocation = 'gs://{}/{}'.format(bucketname, blobname) 98 | logging.info('Uploaded {} ...'.format(gcslocation)) 99 | return gcslocation 100 | 101 | 102 | def bqload(gcsfile, year, month): 103 | """ 104 | Loads the CSV file in GCS into BigQuery, replacing the existing data in that partition 105 | """ 106 | client = bigquery.Client() 107 | # truncate existing partition ... 108 | table_ref = client.dataset('dsongcp').table('flights_raw${}{}'.format(year, month)) 109 | job_config = bigquery.LoadJobConfig() 110 | job_config.source_format = 'CSV' 111 | job_config.write_disposition = 'WRITE_TRUNCATE' 112 | job_config.ignore_unknown_values = True 113 | job_config.time_partitioning = bigquery.table.TimePartitioning('MONTH', 'FlightDate') 114 | job_config.skip_leading_rows = 1 115 | job_config.schema = [ 116 | bigquery.SchemaField(col_and_type.split(':')[0], col_and_type.split(':')[1]) #, mode='required') 117 | for col_and_type in 118 | "Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING".split(',') 119 | ] 120 | load_job = client.load_table_from_uri(gcsfile, table_ref, job_config=job_config) 121 | load_job.result() # waits for table load to complete 122 | 123 | if load_job.state != 'DONE': 124 | raise load_job.exception() 125 | 126 | return table_ref, load_job.output_rows 127 | 128 | 129 | def ingest(year, month, bucket): 130 | ''' 131 | ingest flights data from BTS website to Google Cloud Storage 132 | return table, numrows on success. 133 | raises exception if this data is not on BTS website 134 | ''' 135 | tempdir = tempfile.mkdtemp(prefix='ingest_flights') 136 | try: 137 | zipfile = download(year, month, tempdir) 138 | bts_csv = zip_to_csv(zipfile, tempdir) 139 | gcsloc = 'flights/raw/{}{}.csv.gz'.format(year, month) 140 | gcsloc = upload(bts_csv, bucket, gcsloc) 141 | return bqload(gcsloc, year, month) 142 | finally: 143 | logging.debug('Cleaning up by removing {}'.format(tempdir)) 144 | shutil.rmtree(tempdir) 145 | 146 | 147 | def next_month(bucketname): 148 | ''' 149 | Finds which months are on GCS, and returns next year,month to download 150 | ''' 151 | client = storage.Client() 152 | bucket = client.get_bucket(bucketname) 153 | blobs = list(bucket.list_blobs(prefix='flights/raw/')) 154 | files = [blob.name for blob in blobs if 'csv' in blob.name] # csv files only 155 | lastfile = os.path.basename(files[-1]) 156 | logging.debug('The latest file on GCS is {}'.format(lastfile)) 157 | year = lastfile[:4] 158 | month = lastfile[4:6] 159 | return compute_next_month(year, month) 160 | 161 | 162 | def compute_next_month(year, month): 163 | dt = datetime.datetime(int(year), int(month), 15) # 15th of month 164 | dt = dt + datetime.timedelta(30) # will always go to next month 165 | logging.debug('The next month is {}'.format(dt)) 166 | return '{}'.format(dt.year), '{:02d}'.format(dt.month) 167 | 168 | 169 | if __name__ == '__main__': 170 | import argparse 171 | 172 | parser = argparse.ArgumentParser(description='ingest flights data from BTS website to Google Cloud Storage') 173 | parser.add_argument('--bucket', help='GCS bucket to upload data to', required=True) 174 | parser.add_argument('--year', help='Example: 2015. If not provided, defaults to getting next month') 175 | parser.add_argument('--month', help='Specify 01 for January. If not provided, defaults to getting next month') 176 | parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages') 177 | 178 | try: 179 | args = parser.parse_args() 180 | if args.debug: 181 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG) 182 | else: 183 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 184 | 185 | if args.year is None or args.month is None: 186 | year_, month_ = next_month(args.bucket) 187 | else: 188 | year_ = args.year 189 | month_ = args.month 190 | logging.debug('Ingesting year={} month={}'.format(year_, month_)) 191 | tableref, numrows = ingest(year_, month_, args.bucket) 192 | logging.info('Success ... ingested {} rows to {}'.format(numrows, tableref)) 193 | except Exception as e: 194 | logging.exception("Try again later?") 195 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2016-2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import os 18 | import logging 19 | from flask import Flask 20 | from flask import request, escape 21 | from ingest_flights import ingest, next_month 22 | 23 | app = Flask(__name__) 24 | 25 | 26 | @app.route("/", methods=['POST']) 27 | def ingest_flights(): 28 | # noinspection PyBroadException 29 | try: 30 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 31 | json = request.get_json(force=True) # https://stackoverflow.com/questions/53216177/http-triggering-cloud-function-with-cloud-scheduler/60615210#60615210 32 | 33 | year = escape(json['year']) if 'year' in json else None 34 | month = escape(json['month']) if 'month' in json else None 35 | bucket = escape(json['bucket']) # required 36 | 37 | if year is None or month is None or len(year) == 0 or len(month) == 0: 38 | year, month = next_month(bucket) 39 | logging.debug('Ingesting year={} month={}'.format(year, month)) 40 | tableref, numrows = ingest(year, month, bucket) 41 | ok = 'Success ... ingested {} rows to {}'.format(numrows, tableref) 42 | logging.info(ok) 43 | return ok 44 | except Exception as e: 45 | logging.exception("Failed to ingest ... try again later?") 46 | 47 | 48 | if __name__ == "__main__": 49 | app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080))) 50 | -------------------------------------------------------------------------------- /02_ingest/monthlyupdate/requirements.txt: -------------------------------------------------------------------------------- 1 | Flask==2.0.1 2 | google-cloud-storage==1.42.0 3 | google-cloud-bigquery==2.25.1 4 | gunicorn==20.1.0 5 | 6 | 7 | -------------------------------------------------------------------------------- /02_ingest/raw_download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #export YEAR=${YEAR:=2015} 4 | SOURCE=https://transtats.bts.gov/PREZIP 5 | 6 | OUTDIR=raw 7 | mkdir -p $OUTDIR 8 | 9 | for YEAR in `seq 2019 2019`; do 10 | for MONTH in `seq 1 12`; do 11 | 12 | FILE=On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${YEAR}_${MONTH}.zip 13 | curl -k -o ${OUTDIR}/${FILE} ${SOURCE}/${FILE} 14 | 15 | done 16 | done 17 | -------------------------------------------------------------------------------- /02_ingest/upload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./upload.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | 10 | echo "Uploading to bucket $BUCKET..." 11 | gsutil -m cp *.csv gs://$BUCKET/flights/raw/ 12 | #gsutil -m acl ch -R -g allUsers:R gs://$BUCKET/flights/raw 13 | #gsutil -m acl ch -R -g google.com:R gs://$BUCKET/flights/raw 14 | -------------------------------------------------------------------------------- /03_sqlstudio/README.md: -------------------------------------------------------------------------------- 1 | # 3. Creating compelling dashboards 2 | 3 | ### Catch up to Chapter 2 4 | If you have not already done so, load the raw data into a BigQuery dataset: 5 | * Go to the Storage section of the GCP web console and create a new bucket 6 | * Open CloudShell and git clone this repo: 7 | ``` 8 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 9 | ``` 10 | * Then, run: 11 | ``` 12 | cd data-science-on-gcp/02_ingest 13 | ./ingest.sh bucketname 14 | ``` 15 | 16 | 17 | ### Optional: Load the data into PostgreSQL 18 | * Navigate to https://console.cloud.google.com/sql 19 | * Select Create Instance 20 | * Choose PostgreSQL and then fill out the form as follows: 21 | * Call the instance flights 22 | * Generate a strong password by clicking GENERATE 23 | * Choose the default PostgreSQL version 24 | * Choose the region where your bucket of CSV data exists 25 | * Choose a single zone instance 26 | * Choose a standard machine type with 2 vCPU 27 | * Click Create Instance 28 | * Type (change bucket as necessary): 29 | ``` 30 | gsutil cp create_table.sql \ 31 | gs://cloud-training-demos-ml/flights/ch3/create_table.sql 32 | ``` 33 | * Create empty table using web console: 34 | * navigate to databases section of Cloud SQL and create a new database called bts 35 | * navigate to flights instance and select IMPORT 36 | * Specify location of create_table.sql in your bucket 37 | * Specify that you want to create a table in the database bts 38 | * Load the CSV files into this table: 39 | * Browse to 201501.csv in your bucket 40 | * Specify CSV as the format 41 | * bts as the database 42 | * flights as the table 43 | * In Cloud Shell, connect to database and run queries 44 | * Connect to the database using one of these two commands (the first if you don't need a SQL proxy, the second if you do -- you'll typically need a SQL proxy if your organization has set up a security rule to allow access only to authorized networks): 45 | * ```gcloud sql connect flights --user=postgres``` 46 | * OR ```gcloud beta sql connect flights --user=postgres``` 47 | * In the prompt, type ```\c bts;``` 48 | * Type in the following query: 49 | ``` 50 | SELECT "Origin", COUNT(*) AS num_flights 51 | FROM flights GROUP BY "Origin" 52 | ORDER BY num_flights DESC 53 | LIMIT 5; 54 | ``` 55 | * Add more months of CSV data and notice that the performance degrades. 56 | Once you are done, delete the Cloud SQL instance since you will not need it for the rest of the book. 57 | 58 | ### Creating view in BigQuery 59 | * Run the script 60 | ```./create_views.sh``` 61 | * Compute the contingency table for various thresholds by running the script 62 | ``` 63 | ./contingency.sh 64 | ``` 65 | 66 | ### Building a dashboard 67 | Follow the steps in the main text of the chapter to set up a Data Studio dashboard and create charts. 68 | 69 | -------------------------------------------------------------------------------- /03_sqlstudio/contingency.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | PROJECT=$(gcloud config get-value project) 4 | cat contingency4.sql \ 5 | | bq --project_id $PROJECT query --nouse_legacy_sql 6 | -------------------------------------------------------------------------------- /03_sqlstudio/contingency1.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | COUNT(*) AS true_positives 3 | FROM dsongcp.flights 4 | WHERE dep_delay < 15 AND arr_delay < 15 5 | -------------------------------------------------------------------------------- /03_sqlstudio/contingency2.sql: -------------------------------------------------------------------------------- 1 | DECLARE THRESH INT64; 2 | SET THRESH = 15; 3 | 4 | SELECT 5 | COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives, 6 | COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives, 7 | COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives, 8 | COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives, 9 | COUNT(*) AS total 10 | FROM dsongcp.flights 11 | WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL 12 | -------------------------------------------------------------------------------- /03_sqlstudio/contingency3.sql: -------------------------------------------------------------------------------- 1 | SELECT 2 | THRESH, 3 | COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives, 4 | COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives, 5 | COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives, 6 | COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives, 7 | COUNT(*) AS total 8 | FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH 9 | WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL 10 | GROUP BY THRESH 11 | -------------------------------------------------------------------------------- /03_sqlstudio/contingency4.sql: -------------------------------------------------------------------------------- 1 | WITH contingency_table AS ( 2 | SELECT 3 | THRESH, 4 | COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives, 5 | COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives, 6 | COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives, 7 | COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives, 8 | COUNT(*) AS total 9 | FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH 10 | WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL 11 | GROUP BY THRESH 12 | ) 13 | 14 | SELECT 15 | ROUND((true_positives + true_negatives)/total, 2) AS accuracy, 16 | ROUND(false_positives/(true_positives+false_positives), 2) AS fpr, 17 | ROUND(false_negatives/(false_negatives+true_negatives), 2) AS fnr, 18 | * 19 | FROM contingency_table 20 | -------------------------------------------------------------------------------- /03_sqlstudio/create_table.sql: -------------------------------------------------------------------------------- 1 | drop table if exists flights; 2 | 3 | CREATE TABLE flights ( 4 | "Year" TEXT, 5 | "Quarter" TEXT, 6 | "Month" TEXT, 7 | "DayofMonth" TEXT, 8 | "DayOfWeek" TEXT, 9 | "FlightDate" TEXT, 10 | "Reporting_Airline" TEXT, 11 | "DOT_ID_Reporting_Airline" TEXT, 12 | "IATA_CODE_Reporting_Airline" TEXT, 13 | "Tail_Number" TEXT, 14 | "Flight_Number_Reporting_Airline" TEXT, 15 | "OriginAirportID" TEXT, 16 | "OriginAirportSeqID" TEXT, 17 | "OriginCityMarketID" TEXT, 18 | "Origin" TEXT, 19 | "OriginCityName" TEXT, 20 | "OriginState" TEXT, 21 | "OriginStateFips" TEXT, 22 | "OriginStateName" TEXT, 23 | "OriginWac" TEXT, 24 | "DestAirportID" TEXT, 25 | "DestAirportSeqID" TEXT, 26 | "DestCityMarketID" TEXT, 27 | "Dest" TEXT, 28 | "DestCityName" TEXT, 29 | "DestState" TEXT, 30 | "DestStateFips" TEXT, 31 | "DestStateName" TEXT, 32 | "DestWac" TEXT, 33 | "CRSDepTime" TEXT, 34 | "DepTime" TEXT, 35 | "DepDelay" TEXT, 36 | "DepDelayMinutes" TEXT, 37 | "DepDel15" TEXT, 38 | "DepartureDelayGroups" TEXT, 39 | "DepTimeBlk" TEXT, 40 | "TaxiOut" TEXT, 41 | "WheelsOff" TEXT, 42 | "WheelsOn" TEXT, 43 | "TaxiIn" TEXT, 44 | "CRSArrTime" TEXT, 45 | "ArrTime" TEXT, 46 | "ArrDelay" TEXT, 47 | "ArrDelayMinutes" TEXT, 48 | "ArrDel15" TEXT, 49 | "ArrivalDelayGroups" TEXT, 50 | "ArrTimeBlk" TEXT, 51 | "Cancelled" TEXT, 52 | "CancellationCode" TEXT, 53 | "Diverted" TEXT, 54 | "CRSElapsedTime" TEXT, 55 | "ActualElapsedTime" TEXT, 56 | "AirTime" TEXT, 57 | "Flights" TEXT, 58 | "Distance" TEXT, 59 | "DistanceGroup" TEXT, 60 | "CarrierDelay" TEXT, 61 | "WeatherDelay" TEXT, 62 | "NASDelay" TEXT, 63 | "SecurityDelay" TEXT, 64 | "LateAircraftDelay" TEXT, 65 | "FirstDepTime" TEXT, 66 | "TotalAddGTime" TEXT, 67 | "LongestAddGTime" TEXT, 68 | "DivAirportLandings" TEXT, 69 | "DivReachedDest" TEXT, 70 | "DivActualElapsedTime" TEXT, 71 | "DivArrDelay" TEXT, 72 | "DivDistance" TEXT, 73 | "Div1Airport" TEXT, 74 | "Div1AirportID" TEXT, 75 | "Div1AirportSeqID" TEXT, 76 | "Div1WheelsOn" TEXT, 77 | "Div1TotalGTime" TEXT, 78 | "Div1LongestGTime" TEXT, 79 | "Div1WheelsOff" TEXT, 80 | "Div1TailNum" TEXT, 81 | "Div2Airport" TEXT, 82 | "Div2AirportID" TEXT, 83 | "Div2AirportSeqID" TEXT, 84 | "Div2WheelsOn" TEXT, 85 | "Div2TotalGTime" TEXT, 86 | "Div2LongestGTime" TEXT, 87 | "Div2WheelsOff" TEXT, 88 | "Div2TailNum" TEXT, 89 | "Div3Airport" TEXT, 90 | "Div3AirportID" TEXT, 91 | "Div3AirportSeqID" TEXT, 92 | "Div3WheelsOn" TEXT, 93 | "Div3TotalGTime" TEXT, 94 | "Div3LongestGTime" TEXT, 95 | "Div3WheelsOff" TEXT, 96 | "Div3TailNum" TEXT, 97 | "Div4Airport" TEXT, 98 | "Div4AirportID" TEXT, 99 | "Div4AirportSeqID" TEXT, 100 | "Div4WheelsOn" TEXT, 101 | "Div4TotalGTime" TEXT, 102 | "Div4LongestGTime" TEXT, 103 | "Div4WheelsOff" TEXT, 104 | "Div4TailNum" TEXT, 105 | "Div5Airport" TEXT, 106 | "Div5AirportID" TEXT, 107 | "Div5AirportSeqID" TEXT, 108 | "Div5WheelsOn" TEXT, 109 | "Div5TotalGTime" TEXT, 110 | "Div5LongestGTime" TEXT, 111 | "Div5WheelsOff" TEXT, 112 | "Div5TailNum" TEXT, 113 | "junk" TEXT 114 | ); 115 | -------------------------------------------------------------------------------- /03_sqlstudio/create_views.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | PROJECT=$(gcloud config get-value project) 4 | cat create_views.sql | bq --project_id $PROJECT query --nouse_legacy_sql 5 | -------------------------------------------------------------------------------- /03_sqlstudio/create_views.sql: -------------------------------------------------------------------------------- 1 | CREATE OR REPLACE VIEW dsongcp.flights 2 | -- CREATE MATERIALIZED VIEW dsongcp.flights 3 | -- PARTITION BY DATE_TRUNC(FL_DATE, MONTH) 4 | AS 5 | SELECT 6 | FlightDate AS FL_DATE, 7 | Reporting_Airline AS UNIQUE_CARRIER, 8 | OriginAirportSeqID AS ORIGIN_AIRPORT_SEQ_ID, 9 | Origin AS ORIGIN, 10 | DestAirportSeqID AS DEST_AIRPORT_SEQ_ID, 11 | Dest AS DEST, 12 | CRSDepTime AS CRS_DEP_TIME, 13 | DepTime AS DEP_TIME, 14 | CAST(DepDelay AS FLOAT64) AS DEP_DELAY, 15 | CAST(TaxiOut AS FLOAT64) AS TAXI_OUT, 16 | WheelsOff AS WHEELS_OFF, 17 | WheelsOn AS WHEELS_ON, 18 | CAST(TaxiIn AS FLOAT64) AS TAXI_IN, 19 | CRSArrTime AS CRS_ARR_TIME, 20 | ArrTime AS ARR_TIME, 21 | CAST(ArrDelay AS FLOAT64) AS ARR_DELAY, 22 | IF(Cancelled = '1.00', True, False) AS CANCELLED, 23 | IF(Diverted = '1.00', True, False) AS DIVERTED, 24 | DISTANCE 25 | FROM dsongcp.flights_raw; 26 | 27 | CREATE OR REPLACE VIEW dsongcp.delayed_10 AS 28 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 10; 29 | 30 | CREATE OR REPLACE VIEW dsongcp.delayed_15 AS 31 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 15; 32 | 33 | CREATE OR REPLACE VIEW dsongcp.delayed_20 AS 34 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 20; 35 | 36 | 37 | -------------------------------------------------------------------------------- /04_streaming/.gitignore: -------------------------------------------------------------------------------- 1 | .* 2 | -------------------------------------------------------------------------------- /04_streaming/README.md: -------------------------------------------------------------------------------- 1 | # 4. Streaming data: publication and ingest 2 | 3 | ### Catch up until Chapter 3 if necessary 4 | * Go to the Storage section of the GCP web console and create a new bucket 5 | * Open CloudShell and git clone this repo: 6 | ``` 7 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 8 | ``` 9 | * Then, run: 10 | ``` 11 | cd data-science-on-gcp/02_ingest 12 | ./ingest_from_crsbucket bucketname 13 | ``` 14 | * Run: 15 | ``` 16 | cd ../03_sqlstudio 17 | ./create_views.sh 18 | ``` 19 | 20 | ### Batch processing transformation in DataFlow 21 | * Setup: 22 | ``` 23 | cd transform; ./install_packages.sh 24 | ``` 25 | * Parsing airports data: 26 | ``` 27 | ./df01.py 28 | head extracted_airports-00000* 29 | rm extracted_airports-* 30 | ``` 31 | * Adding timezone information: 32 | ``` 33 | ./df02.py 34 | head airports_with_tz-00000* 35 | rm airports_with_tz-* 36 | ``` 37 | * Converting times to UTC: 38 | ``` 39 | ./df03.py 40 | head -3 all_flights-00000* 41 | ``` 42 | * Correcting dates: 43 | ``` 44 | ./df04.py 45 | head -3 all_flights-00000* 46 | rm all_flights-* 47 | ``` 48 | * Create events: 49 | ``` 50 | ./df05.py 51 | head -3 all_events-00000* 52 | rm all_events-* 53 | ``` 54 | * Read/write to Cloud: 55 | ``` 56 | ./stage_airports_file.sh BUCKETNAME 57 | ./df06.py --project PROJECT --bucket BUCKETNAME 58 | ``` 59 | Look for new tables in BigQuery (flights_simevents) 60 | * Run on Cloud: 61 | ``` 62 | ./df07.py --project PROJECT --bucket BUCKETNAME --region us-central1 63 | ``` 64 | * Go to the GCP web console and wait for the Dataflow ch04timecorr job to finish. It might take between 30 minutes and 2+ hours depending on the quota associated with your project (you can change the quota by going to https://console.cloud.google.com/iam-admin/quotas). 65 | * Then, navigate to the BigQuery console and type in: 66 | ``` 67 | SELECT 68 | ORIGIN, 69 | DEP_TIME, 70 | DEST, 71 | ARR_TIME, 72 | ARR_DELAY, 73 | EVENT_TIME, 74 | EVENT_TYPE 75 | FROM 76 | dsongcp.flights_simevents 77 | WHERE 78 | (DEP_DELAY > 15 and ORIGIN = 'SEA') or 79 | (ARR_DELAY > 15 and DEST = 'SEA') 80 | ORDER BY EVENT_TIME ASC 81 | LIMIT 82 | 5 83 | 84 | ``` 85 | ### Simulate event stream 86 | * In CloudShell, run 87 | ``` 88 | cd simulate 89 | python3 ./simulate.py --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-05-04 00:00:00 UTC' --speedFactor=30 --project $DEVSHELL_PROJECT_ID 90 | ``` 91 | 92 | ### Real-time Stream Processing 93 | * In another CloudShell tab, run avg01.py: 94 | ``` 95 | cd realtime 96 | ./avg01.py --project PROJECT --bucket BUCKETNAME --region us-central1 97 | ``` 98 | * In about a minute, you can query events from the BigQuery console: 99 | ``` 100 | SELECT * FROM dsongcp.streaming_events 101 | ORDER BY EVENT_TIME DESC 102 | LIMIT 5 103 | ``` 104 | * Stop avg01.py by hitting Ctrl+C 105 | * Run avg02.py: 106 | ``` 107 | ./avg02.py --project PROJECT --bucket BUCKETNAME --region us-central1 108 | ``` 109 | * In about 5 min, you can query from the BigQuery console: 110 | ``` 111 | SELECT * FROM dsongcp.streaming_delays 112 | ORDER BY END_TIME DESC 113 | LIMIT 5 114 | ``` 115 | * Look at how often the data is coming in: 116 | ``` 117 | SELECT END_TIME, num_flights 118 | FROM dsongcp.streaming_delays 119 | ORDER BY END_TIME DESC 120 | LIMIT 5 121 | ``` 122 | * It's likely that the pipeline will be stuck. You need to run this on Dataflow. 123 | * Stop avg02.py by hitting Ctrl+C 124 | * In BigQuery, truncate the table: 125 | ``` 126 | TRUNCATE TABLE dsongcp.streaming_delays 127 | ``` 128 | * Run avg03.py: 129 | ``` 130 | ./avg03.py --project PROJECT --bucket BUCKETNAME --region us-central1 131 | ``` 132 | * Go to the GCP web console in the Dataflow section and monitor the job. 133 | * Once the job starts writing to BigQuery, run this query and save this as a view: 134 | ``` 135 | SELECT * FROM dsongcp.streaming_delays 136 | WHERE AIRPORT = 'ATL' 137 | ORDER BY END_TIME DESC 138 | ``` 139 | * Create a view of the latest arrival delay by airport: 140 | ``` 141 | CREATE OR REPLACE VIEW dsongcp.airport_delays AS 142 | WITH delays AS ( 143 | SELECT d.*, a.LATITUDE, a.LONGITUDE 144 | FROM dsongcp.streaming_delays d 145 | JOIN dsongcp.airports a USING(AIRPORT) 146 | WHERE a.AIRPORT_IS_LATEST = 1 147 | ) 148 | 149 | SELECT 150 | AIRPORT, 151 | CONCAT(LATITUDE, ',', LONGITUDE) AS LOCATION, 152 | ARRAY_AGG( 153 | STRUCT(AVG_ARR_DELAY, AVG_DEP_DELAY, NUM_FLIGHTS, END_TIME) 154 | ORDER BY END_TIME DESC LIMIT 1) AS a 155 | FROM delays 156 | GROUP BY AIRPORT, LONGITUDE, LATITUDE 157 | 158 | ``` 159 | * Follow the steps in the chapter to connect to Data Studio and create a GeoMap. 160 | * Stop the simulation program in CloudShell. 161 | * From the GCP web console, stop the Dataflow streaming pipeline. 162 | 163 | -------------------------------------------------------------------------------- /04_streaming/design/airport_schema.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "mode": "NULLABLE", 4 | "name": "AIRPORT_SEQ_ID", 5 | "type": "INTEGER" 6 | }, 7 | { 8 | "mode": "NULLABLE", 9 | "name": "AIRPORT_ID", 10 | "type": "INTEGER" 11 | }, 12 | { 13 | "mode": "NULLABLE", 14 | "name": "AIRPORT", 15 | "type": "STRING" 16 | }, 17 | { 18 | "mode": "NULLABLE", 19 | "name": "DISPLAY_AIRPORT_NAME", 20 | "type": "STRING" 21 | }, 22 | { 23 | "mode": "NULLABLE", 24 | "name": "DISPLAY_AIRPORT_CITY_NAME_FULL", 25 | "type": "STRING" 26 | }, 27 | { 28 | "mode": "NULLABLE", 29 | "name": "AIRPORT_WAC_SEQ_ID2", 30 | "type": "INTEGER" 31 | }, 32 | { 33 | "mode": "NULLABLE", 34 | "name": "AIRPORT_WAC", 35 | "type": "INTEGER" 36 | }, 37 | { 38 | "mode": "NULLABLE", 39 | "name": "AIRPORT_COUNTRY_NAME", 40 | "type": "STRING" 41 | }, 42 | { 43 | "mode": "NULLABLE", 44 | "name": "AIRPORT_COUNTRY_CODE_ISO", 45 | "type": "STRING" 46 | }, 47 | { 48 | "mode": "NULLABLE", 49 | "name": "AIRPORT_STATE_NAME", 50 | "type": "STRING" 51 | }, 52 | { 53 | "mode": "NULLABLE", 54 | "name": "AIRPORT_STATE_CODE", 55 | "type": "STRING" 56 | }, 57 | { 58 | "mode": "NULLABLE", 59 | "name": "AIRPORT_STATE_FIPS", 60 | "type": "INTEGER" 61 | }, 62 | { 63 | "mode": "NULLABLE", 64 | "name": "CITY_MARKET_SEQ_ID", 65 | "type": "INTEGER" 66 | }, 67 | { 68 | "mode": "NULLABLE", 69 | "name": "CITY_MARKET_ID", 70 | "type": "INTEGER" 71 | }, 72 | { 73 | "mode": "NULLABLE", 74 | "name": "DISPLAY_CITY_MARKET_NAME_FULL", 75 | "type": "STRING" 76 | }, 77 | { 78 | "mode": "NULLABLE", 79 | "name": "CITY_MARKET_WAC_SEQ_ID2", 80 | "type": "INTEGER" 81 | }, 82 | { 83 | "mode": "NULLABLE", 84 | "name": "CITY_MARKET_WAC", 85 | "type": "INTEGER" 86 | }, 87 | { 88 | "mode": "NULLABLE", 89 | "name": "LAT_DEGREES", 90 | "type": "INTEGER" 91 | }, 92 | { 93 | "mode": "NULLABLE", 94 | "name": "LAT_HEMISPHERE", 95 | "type": "STRING" 96 | }, 97 | { 98 | "mode": "NULLABLE", 99 | "name": "LAT_MINUTES", 100 | "type": "INTEGER" 101 | }, 102 | { 103 | "mode": "NULLABLE", 104 | "name": "LAT_SECONDS", 105 | "type": "INTEGER" 106 | }, 107 | { 108 | "mode": "NULLABLE", 109 | "name": "LATITUDE", 110 | "type": "FLOAT" 111 | }, 112 | { 113 | "mode": "NULLABLE", 114 | "name": "LON_DEGREES", 115 | "type": "INTEGER" 116 | }, 117 | { 118 | "mode": "NULLABLE", 119 | "name": "LON_HEMISPHERE", 120 | "type": "STRING" 121 | }, 122 | { 123 | "mode": "NULLABLE", 124 | "name": "LON_MINUTES", 125 | "type": "INTEGER" 126 | }, 127 | { 128 | "mode": "NULLABLE", 129 | "name": "LON_SECONDS", 130 | "type": "INTEGER" 131 | }, 132 | { 133 | "mode": "NULLABLE", 134 | "name": "LONGITUDE", 135 | "type": "FLOAT" 136 | }, 137 | { 138 | "mode": "NULLABLE", 139 | "name": "UTC_LOCAL_TIME_VARIATION", 140 | "type": "INTEGER" 141 | }, 142 | { 143 | "mode": "NULLABLE", 144 | "name": "AIRPORT_START_DATE", 145 | "type": "DATE" 146 | }, 147 | { 148 | "mode": "NULLABLE", 149 | "name": "AIRPORT_THRU_DATE", 150 | "type": "DATE" 151 | }, 152 | { 153 | "mode": "NULLABLE", 154 | "name": "AIRPORT_IS_CLOSED", 155 | "type": "INTEGER" 156 | }, 157 | { 158 | "mode": "NULLABLE", 159 | "name": "AIRPORT_IS_LATEST", 160 | "type": "INTEGER" 161 | }, 162 | { 163 | "mode": "NULLABLE", 164 | "name": "string_field_32", 165 | "type": "STRING" 166 | } 167 | ] 168 | -------------------------------------------------------------------------------- /04_streaming/design/mktbl.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | bq mk --external_table_definition=./airport_schema.json@CSV=gs://data-science-on-gcp/edition2/raw/airports.csv dsongcp.airports_gcs 3 | -------------------------------------------------------------------------------- /04_streaming/design/queries.txt: -------------------------------------------------------------------------------- 1 | SELECT 2 | AIRPORT_SEQ_ID, AIRPORT_ID, AIRPORT, DISPLAY_AIRPORT_NAME, 3 | LAT_DEGREES, LAT_HEMISPHERE, LAT_MINUTES, LAT_SECONDS, LATITUDE 4 | FROM dsongcp.airports_gcs 5 | WHERE DISPLAY_AIRPORT_NAME LIKE '%Seattle%' 6 | 7 | 8 | SELECT 9 | AIRPORT, LATITUDE, LONGITUDE 10 | FROM dsongcp.airports_gcs 11 | WHERE AIRPORT_IS_LATEST = 1 AND AIRPORT = 'DFW' 12 | 13 | -------------------------------------------------------------------------------- /04_streaming/ingest_from_crsbucket.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./ingest_from_crsbucket.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | FROM=gs://data-science-on-gcp/edition2/flights/tzcorr 10 | TO=gs://$BUCKET/flights/tzcorr 11 | 12 | #sharded files 13 | CMD="gsutil -m cp " 14 | for SHARD in `seq -w 0 26`; do 15 | CMD="$CMD ${FROM}/all_flights-000${SHARD}-of-00026" 16 | done 17 | CMD="$CMD $TO" 18 | echo $CMD 19 | $CMD 20 | 21 | # load tzcorr into BigQuery 22 | PROJECT=$(gcloud config get-value project) 23 | bq --project_id $PROJECT \ 24 | load --source_format=NEWLINE_DELIMITED_JSON --autodetect ${PROJECT}:dsongcp.flights_tzcorr \ 25 | ${TO}/all_flights-* 26 | 27 | cd transform 28 | 29 | # airports.csv 30 | ./stage_airports_file.sh ${BUCKET} 31 | -------------------------------------------------------------------------------- /04_streaming/realtime/avg01.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import json 19 | 20 | 21 | def run(project, bucket, region): 22 | argv = [ 23 | '--project={0}'.format(project), 24 | '--job_name=ch04avgdelay', 25 | '--streaming', 26 | '--save_main_session', 27 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 28 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 29 | '--setup_file=./setup.py', 30 | '--autoscaling_algorithm=THROUGHPUT_BASED', 31 | '--max_num_workers=8', 32 | '--region={}'.format(region), 33 | '--runner=DirectRunner' 34 | ] 35 | 36 | with beam.Pipeline(argv=argv) as pipeline: 37 | events = {} 38 | 39 | for event_name in ['arrived', 'departed']: 40 | topic_name = "projects/{}/topics/{}".format(project, event_name) 41 | 42 | events[event_name] = (pipeline 43 | | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub(topic=topic_name) 44 | | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s)) 45 | ) 46 | 47 | all_events = (events['arrived'], events['departed']) | beam.Flatten() 48 | 49 | flights_schema = ','.join([ 50 | 'FL_DATE:date,UNIQUE_CARRIER:string,ORIGIN_AIRPORT_SEQ_ID:string,ORIGIN:string', 51 | 'DEST_AIRPORT_SEQ_ID:string,DEST:string,CRS_DEP_TIME:timestamp,DEP_TIME:timestamp', 52 | 'DEP_DELAY:float,TAXI_OUT:float,WHEELS_OFF:timestamp,WHEELS_ON:timestamp,TAXI_IN:float', 53 | 'CRS_ARR_TIME:timestamp,ARR_TIME:timestamp,ARR_DELAY:float,CANCELLED:boolean', 54 | 'DIVERTED:boolean,DISTANCE:float', 55 | 'DEP_AIRPORT_LAT:float,DEP_AIRPORT_LON:float,DEP_AIRPORT_TZOFFSET:float', 56 | 'ARR_AIRPORT_LAT:float,ARR_AIRPORT_LON:float,ARR_AIRPORT_TZOFFSET:float']) 57 | events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp']) 58 | 59 | schema = events_schema 60 | 61 | (all_events 62 | | 'bqout' >> beam.io.WriteToBigQuery( 63 | 'dsongcp.streaming_events', schema=schema, 64 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 65 | ) 66 | ) 67 | 68 | 69 | if __name__ == '__main__': 70 | import argparse 71 | 72 | parser = argparse.ArgumentParser(description='Run pipeline on the cloud') 73 | parser.add_argument('-p', '--project', help='Unique project ID', required=True) 74 | parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists', 75 | required=True) 76 | parser.add_argument('-r', '--region', 77 | help='Region in which to run the Dataflow job. Choose the same region as your bucket.', 78 | required=True) 79 | 80 | args = vars(parser.parse_args()) 81 | 82 | run(project=args['project'], bucket=args['bucket'], region=args['region']) 83 | -------------------------------------------------------------------------------- /04_streaming/realtime/avg02.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import json 20 | import numpy as np 21 | 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S' 23 | 24 | 25 | def compute_stats(airport, events): 26 | arrived = [event['ARR_DELAY'] for event in events if event['EVENT_TYPE'] == 'arrived'] 27 | avg_arr_delay = float(np.mean(arrived)) if len(arrived) > 0 else None 28 | 29 | departed = [event['DEP_DELAY'] for event in events if event['EVENT_TYPE'] == 'departed'] 30 | avg_dep_delay = float(np.mean(departed)) if len(departed) > 0 else None 31 | 32 | num_flights = len(events) 33 | start_time = min([event['EVENT_TIME'] for event in events]) 34 | latest_time = max([event['EVENT_TIME'] for event in events]) 35 | 36 | return { 37 | 'AIRPORT': airport, 38 | 'AVG_ARR_DELAY': avg_arr_delay, 39 | 'AVG_DEP_DELAY': avg_dep_delay, 40 | 'NUM_FLIGHTS': num_flights, 41 | 'START_TIME': start_time, 42 | 'END_TIME': latest_time 43 | } 44 | 45 | 46 | def by_airport(event): 47 | if event['EVENT_TYPE'] == 'departed': 48 | return event['ORIGIN'], event 49 | else: 50 | return event['DEST'], event 51 | 52 | 53 | def run(project, bucket, region): 54 | argv = [ 55 | '--project={0}'.format(project), 56 | '--job_name=ch04avgdelay', 57 | '--streaming', 58 | '--save_main_session', 59 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 60 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 61 | '--autoscaling_algorithm=THROUGHPUT_BASED', 62 | '--max_num_workers=8', 63 | '--region={}'.format(region), 64 | '--runner=DirectRunner' 65 | ] 66 | 67 | with beam.Pipeline(argv=argv) as pipeline: 68 | events = {} 69 | 70 | for event_name in ['arrived', 'departed']: 71 | topic_name = "projects/{}/topics/{}".format(project, event_name) 72 | 73 | events[event_name] = (pipeline 74 | | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub( 75 | topic=topic_name, timestamp_attribute='EventTimeStamp') 76 | | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s)) 77 | ) 78 | 79 | all_events = (events['arrived'], events['departed']) | beam.Flatten() 80 | 81 | stats = (all_events 82 | | 'byairport' >> beam.Map(by_airport) 83 | | 'window' >> beam.WindowInto(beam.window.SlidingWindows(60 * 60, 5 * 60)) 84 | | 'group' >> beam.GroupByKey() 85 | | 'stats' >> beam.Map(lambda x: compute_stats(x[0], x[1])) 86 | ) 87 | 88 | stats_schema = ','.join(['AIRPORT:string,AVG_ARR_DELAY:float,AVG_DEP_DELAY:float', 89 | 'NUM_FLIGHTS:int64,START_TIME:timestamp,END_TIME:timestamp']) 90 | (stats 91 | | 'bqout' >> beam.io.WriteToBigQuery( 92 | 'dsongcp.streaming_delays', schema=stats_schema, 93 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 94 | ) 95 | ) 96 | 97 | 98 | if __name__ == '__main__': 99 | import argparse 100 | 101 | parser = argparse.ArgumentParser(description='Run pipeline on the cloud') 102 | parser.add_argument('-p', '--project', help='Unique project ID', required=True) 103 | parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists', 104 | required=True) 105 | parser.add_argument('-r', '--region', 106 | help='Region in which to run the Dataflow job. Choose the same region as your bucket.', 107 | required=True) 108 | 109 | args = vars(parser.parse_args()) 110 | 111 | run(project=args['project'], bucket=args['bucket'], region=args['region']) 112 | -------------------------------------------------------------------------------- /04_streaming/realtime/avg03.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import json 20 | import numpy as np 21 | 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S' 23 | 24 | 25 | def compute_stats(airport, events): 26 | arrived = [event['ARR_DELAY'] for event in events if event['EVENT_TYPE'] == 'arrived'] 27 | avg_arr_delay = float(np.mean(arrived)) if len(arrived) > 0 else None 28 | 29 | departed = [event['DEP_DELAY'] for event in events if event['EVENT_TYPE'] == 'departed'] 30 | avg_dep_delay = float(np.mean(departed)) if len(departed) > 0 else None 31 | 32 | num_flights = len(events) 33 | start_time = min([event['EVENT_TIME'] for event in events]) 34 | latest_time = max([event['EVENT_TIME'] for event in events]) 35 | 36 | return { 37 | 'AIRPORT': airport, 38 | 'AVG_ARR_DELAY': avg_arr_delay, 39 | 'AVG_DEP_DELAY': avg_dep_delay, 40 | 'NUM_FLIGHTS': num_flights, 41 | 'START_TIME': start_time, 42 | 'END_TIME': latest_time 43 | } 44 | 45 | 46 | def by_airport(event): 47 | if event['EVENT_TYPE'] == 'departed': 48 | return event['ORIGIN'], event 49 | else: 50 | return event['DEST'], event 51 | 52 | 53 | def run(project, bucket, region): 54 | argv = [ 55 | '--project={0}'.format(project), 56 | '--job_name=ch04avgdelay', 57 | '--streaming', 58 | '--save_main_session', 59 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 60 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 61 | '--autoscaling_algorithm=THROUGHPUT_BASED', 62 | '--max_num_workers=8', 63 | '--region={}'.format(region), 64 | '--runner=DataflowRunner' 65 | ] 66 | 67 | with beam.Pipeline(argv=argv) as pipeline: 68 | events = {} 69 | 70 | for event_name in ['arrived', 'departed']: 71 | topic_name = "projects/{}/topics/{}".format(project, event_name) 72 | 73 | events[event_name] = (pipeline 74 | | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub( 75 | topic=topic_name, timestamp_attribute='EventTimeStamp') 76 | | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s)) 77 | ) 78 | 79 | all_events = (events['arrived'], events['departed']) | beam.Flatten() 80 | 81 | stats = (all_events 82 | | 'byairport' >> beam.Map(by_airport) 83 | | 'window' >> beam.WindowInto(beam.window.SlidingWindows(60 * 60, 5 * 60)) 84 | | 'group' >> beam.GroupByKey() 85 | | 'stats' >> beam.Map(lambda x: compute_stats(x[0], x[1])) 86 | ) 87 | 88 | stats_schema = ','.join(['AIRPORT:string,AVG_ARR_DELAY:float,AVG_DEP_DELAY:float', 89 | 'NUM_FLIGHTS:int64,START_TIME:timestamp,END_TIME:timestamp']) 90 | (stats 91 | | 'bqout' >> beam.io.WriteToBigQuery( 92 | 'dsongcp.streaming_delays', schema=stats_schema, 93 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 94 | ) 95 | ) 96 | 97 | 98 | if __name__ == '__main__': 99 | import argparse 100 | 101 | parser = argparse.ArgumentParser(description='Run pipeline on the cloud') 102 | parser.add_argument('-p', '--project', help='Unique project ID', required=True) 103 | parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists', 104 | required=True) 105 | parser.add_argument('-r', '--region', 106 | help='Region in which to run the Dataflow job. Choose the same region as your bucket.', 107 | required=True) 108 | 109 | args = vars(parser.parse_args()) 110 | 111 | run(project=args['project'], bucket=args['bucket'], region=args['region']) 112 | -------------------------------------------------------------------------------- /04_streaming/simulate/.gitignore: -------------------------------------------------------------------------------- 1 | *-?????-of-????? 2 | *.egg-info 3 | .* 4 | -------------------------------------------------------------------------------- /04_streaming/simulate/airports.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/04_streaming/simulate/airports.csv.gz -------------------------------------------------------------------------------- /04_streaming/simulate/simulate.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import time 18 | import pytz 19 | import logging 20 | import argparse 21 | import datetime 22 | import google.cloud.pubsub_v1 as pubsub # Use v1 of the API 23 | import google.cloud.bigquery as bq 24 | 25 | TIME_FORMAT = '%Y-%m-%d %H:%M:%S %Z' 26 | RFC3339_TIME_FORMAT = '%Y-%m-%dT%H:%M:%S-00:00' 27 | 28 | def publish(publisher, topics, allevents, notify_time): 29 | timestamp = notify_time.strftime(RFC3339_TIME_FORMAT) 30 | for key in topics: # 'departed', 'arrived', etc. 31 | topic = topics[key] 32 | events = allevents[key] 33 | # the client automatically batches 34 | logging.info('Publishing {} {} till {}'.format(len(events), key, timestamp)) 35 | for event_data in events: 36 | publisher.publish(topic, event_data.encode(), EventTimeStamp=timestamp) 37 | 38 | def notify(publisher, topics, rows, simStartTime, programStart, speedFactor): 39 | # sleep computation 40 | def compute_sleep_secs(notify_time): 41 | time_elapsed = (datetime.datetime.utcnow() - programStart).total_seconds() 42 | sim_time_elapsed = (notify_time - simStartTime).total_seconds() / speedFactor 43 | to_sleep_secs = sim_time_elapsed - time_elapsed 44 | return to_sleep_secs 45 | 46 | tonotify = {} 47 | for key in topics: 48 | tonotify[key] = list() 49 | 50 | for row in rows: 51 | event_type, notify_time, event_data = row 52 | 53 | # how much time should we sleep? 54 | if compute_sleep_secs(notify_time) > 1: 55 | # notify the accumulated tonotify 56 | publish(publisher, topics, tonotify, notify_time) 57 | for key in topics: 58 | tonotify[key] = list() 59 | 60 | # recompute sleep, since notification takes a while 61 | to_sleep_secs = compute_sleep_secs(notify_time) 62 | if to_sleep_secs > 0: 63 | logging.info('Sleeping {} seconds'.format(to_sleep_secs)) 64 | time.sleep(to_sleep_secs) 65 | tonotify[event_type].append(event_data) 66 | 67 | # left-over records; notify again 68 | publish(publisher, topics, tonotify, notify_time) 69 | 70 | 71 | if __name__ == '__main__': 72 | parser = argparse.ArgumentParser(description='Send simulated flight events to Cloud Pub/Sub') 73 | parser.add_argument('--startTime', help='Example: 2015-05-01 00:00:00 UTC', required=True) 74 | parser.add_argument('--endTime', help='Example: 2015-05-03 00:00:00 UTC', required=True) 75 | parser.add_argument('--project', help='your project id, to create pubsub topic', required=True) 76 | parser.add_argument('--speedFactor', help='Example: 60 implies 1 hour of data sent to Cloud Pub/Sub in 1 minute', required=True, type=float) 77 | parser.add_argument('--jitter', help='type of jitter to add: None, uniform, exp are the three options', default='None') 78 | 79 | # set up BigQuery bqclient 80 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 81 | args = parser.parse_args() 82 | bqclient = bq.Client(args.project) 83 | bqclient.get_table('dsongcp.flights_simevents') # throws exception on failure 84 | 85 | # jitter? 86 | if args.jitter == 'exp': 87 | jitter = 'CAST (-LN(RAND()*0.99 + 0.01)*30 + 90.5 AS INT64)' 88 | elif args.jitter == 'uniform': 89 | jitter = 'CAST(90.5 + RAND()*30 AS INT64)' 90 | else: 91 | jitter = '0' 92 | 93 | 94 | # run the query to pull simulated events 95 | querystr = """ 96 | SELECT 97 | EVENT_TYPE, 98 | TIMESTAMP_ADD(EVENT_TIME, INTERVAL @jitter SECOND) AS NOTIFY_TIME, 99 | EVENT_DATA 100 | FROM 101 | dsongcp.flights_simevents 102 | WHERE 103 | EVENT_TIME >= @startTime 104 | AND EVENT_TIME < @endTime 105 | ORDER BY 106 | EVENT_TIME ASC 107 | """ 108 | job_config = bq.QueryJobConfig( 109 | query_parameters=[ 110 | bq.ScalarQueryParameter("jitter", "INT64", jitter), 111 | bq.ScalarQueryParameter("startTime", "TIMESTAMP", args.startTime), 112 | bq.ScalarQueryParameter("endTime", "TIMESTAMP", args.endTime), 113 | ] 114 | ) 115 | rows = bqclient.query(querystr, job_config=job_config) 116 | 117 | # create one Pub/Sub notification topic for each type of event 118 | publisher = pubsub.PublisherClient() 119 | topics = {} 120 | for event_type in ['wheelsoff', 'arrived', 'departed']: 121 | topics[event_type] = publisher.topic_path(args.project, event_type) 122 | try: 123 | publisher.get_topic(topic=topics[event_type]) 124 | logging.info("Already exists: {}".format(topics[event_type])) 125 | except: 126 | logging.info("Creating {}".format(topics[event_type])) 127 | publisher.create_topic(name=topics[event_type]) 128 | 129 | 130 | # notify about each row in the dataset 131 | programStartTime = datetime.datetime.utcnow() 132 | simStartTime = datetime.datetime.strptime(args.startTime, TIME_FORMAT).replace(tzinfo=pytz.UTC) 133 | logging.info('Simulation start time is {}'.format(simStartTime)) 134 | notify(publisher, topics, rows, simStartTime, programStartTime, args.speedFactor) 135 | -------------------------------------------------------------------------------- /04_streaming/simulate/simulate_may2015.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | python3 simulate.py --project $(gcloud config get-value project) --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-06-01 00:00:00 UTC' --speedFactor 30 3 | -------------------------------------------------------------------------------- /04_streaming/transform/airports.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/04_streaming/transform/airports.csv.gz -------------------------------------------------------------------------------- /04_streaming/transform/bqsample.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if test "$#" -ne 1; then 4 | echo "Usage: ./bqsample.sh bucket-name" 5 | echo " eg: ./bqsample.sh cloud-training-demos-ml" 6 | exit 7 | fi 8 | 9 | BUCKET=$1 10 | PROJECT=$(gcloud config get-value project) 11 | 12 | bq --project_id=$PROJECT query --destination_table dsongcp.flights_sample --replace --nouse_legacy_sql \ 13 | 'SELECT * FROM dsongcp.flights WHERE RAND() < 0.001' 14 | 15 | bq --project_id=$PROJECT extract --destination_format=NEWLINE_DELIMITED_JSON \ 16 | dsongcp.flights_sample gs://${BUCKET}/flights/ch4/flights_sample.json 17 | 18 | gsutil cp gs://${BUCKET}/flights/ch4/flights_sample.json . 19 | -------------------------------------------------------------------------------- /04_streaming/transform/df01.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2019 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import csv 19 | 20 | if __name__ == '__main__': 21 | with beam.Pipeline('DirectRunner') as pipeline: 22 | airports = (pipeline 23 | | beam.io.ReadFromText('airports.csv.gz') 24 | | beam.Map(lambda line: next(csv.reader([line]))) 25 | | beam.Map(lambda fields: (fields[0], (fields[21], fields[26]))) 26 | ) 27 | 28 | (airports 29 | | beam.Map(lambda airport_data: '{},{}'.format(airport_data[0], ','.join(airport_data[1]))) 30 | | beam.io.WriteToText('extracted_airports') 31 | ) 32 | -------------------------------------------------------------------------------- /04_streaming/transform/df02.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import csv 19 | 20 | 21 | def addtimezone(lat, lon): 22 | try: 23 | import timezonefinder 24 | tf = timezonefinder.TimezoneFinder() 25 | tz = tf.timezone_at(lng=float(lon), lat=float(lat)) 26 | if tz is None: 27 | tz = 'UTC' 28 | return lat, lon, tz 29 | except ValueError: 30 | return lat, lon, 'TIMEZONE' # header 31 | 32 | 33 | if __name__ == '__main__': 34 | with beam.Pipeline('DirectRunner') as pipeline: 35 | airports = (pipeline 36 | | beam.io.ReadFromText('airports.csv.gz') 37 | | beam.Filter(lambda line: "United States" in line) 38 | | beam.Map(lambda line: next(csv.reader([line]))) 39 | | beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 40 | ) 41 | 42 | airports | beam.Map(lambda f: '{},{}'.format(f[0], ','.join(f[1]))) | beam.io.textio.WriteToText( 43 | 'airports_with_tz') 44 | -------------------------------------------------------------------------------- /04_streaming/transform/df03.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import csv 20 | import json 21 | 22 | 23 | def addtimezone(lat, lon): 24 | try: 25 | import timezonefinder 26 | tf = timezonefinder.TimezoneFinder() 27 | return lat, lon, tf.timezone_at(lng=float(lon), lat=float(lat)) 28 | # return (lat, lon, 'America/Los_Angeles') # FIXME 29 | except ValueError: 30 | return lat, lon, 'TIMEZONE' # header 31 | 32 | 33 | def as_utc(date, hhmm, tzone): 34 | try: 35 | if len(hhmm) > 0 and tzone is not None: 36 | import datetime, pytz 37 | loc_tz = pytz.timezone(tzone) 38 | loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False) 39 | # can't just parse hhmm because the data contains 2400 and the like ... 40 | loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:])) 41 | utc_dt = loc_dt.astimezone(pytz.utc) 42 | return utc_dt.strftime('%Y-%m-%d %H:%M:%S') 43 | else: 44 | return '' # empty string corresponds to canceled flights 45 | except ValueError as e: 46 | logging.exception('{} {} {}'.format(date, hhmm, tzone)) 47 | raise e 48 | 49 | 50 | def tz_correct(line, airport_timezones): 51 | fields = json.loads(line) 52 | try: 53 | # convert all times to UTC 54 | dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"] 55 | arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"] 56 | dep_timezone = airport_timezones[dep_airport_id][2] 57 | arr_timezone = airport_timezones[arr_airport_id][2] 58 | 59 | for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]: 60 | fields[f] = as_utc(fields["FL_DATE"], fields[f], dep_timezone) 61 | for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 62 | fields[f] = as_utc(fields["FL_DATE"], fields[f], arr_timezone) 63 | 64 | yield json.dumps(fields) 65 | except KeyError as e: 66 | logging.exception(" Ignoring " + line + " because airport is not known") 67 | 68 | 69 | if __name__ == '__main__': 70 | with beam.Pipeline('DirectRunner') as pipeline: 71 | airports = (pipeline 72 | | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz') 73 | | beam.Filter(lambda line: "United States" in line) 74 | | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line]))) 75 | | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 76 | ) 77 | 78 | flights = (pipeline 79 | | 'flights:read' >> beam.io.ReadFromText('flights_sample.json') 80 | | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports)) 81 | ) 82 | 83 | flights | beam.io.textio.WriteToText('all_flights') 84 | -------------------------------------------------------------------------------- /04_streaming/transform/df04.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import csv 20 | import json 21 | 22 | 23 | def addtimezone(lat, lon): 24 | try: 25 | import timezonefinder 26 | tf = timezonefinder.TimezoneFinder() 27 | lat = float(lat) 28 | lon = float(lon) 29 | return lat, lon, tf.timezone_at(lng=lon, lat=lat) 30 | except ValueError: 31 | return lat, lon, 'TIMEZONE' # header 32 | 33 | 34 | def as_utc(date, hhmm, tzone): 35 | """ 36 | Returns date corrected for timezone, and the tzoffset 37 | """ 38 | try: 39 | if len(hhmm) > 0 and tzone is not None: 40 | import datetime, pytz 41 | loc_tz = pytz.timezone(tzone) 42 | loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False) 43 | # can't just parse hhmm because the data contains 2400 and the like ... 44 | loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:])) 45 | utc_dt = loc_dt.astimezone(pytz.utc) 46 | return utc_dt.strftime('%Y-%m-%d %H:%M:%S'), loc_dt.utcoffset().total_seconds() 47 | else: 48 | return '', 0 # empty string corresponds to canceled flights 49 | except ValueError as e: 50 | logging.exception('{} {} {}'.format(date, hhmm, tzone)) 51 | raise e 52 | 53 | 54 | def add_24h_if_before(arrtime, deptime): 55 | import datetime 56 | if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime: 57 | adt = datetime.datetime.strptime(arrtime, '%Y-%m-%d %H:%M:%S') 58 | adt += datetime.timedelta(hours=24) 59 | return adt.strftime('%Y-%m-%d %H:%M:%S') 60 | else: 61 | return arrtime 62 | 63 | 64 | def tz_correct(line, airport_timezones): 65 | fields = json.loads(line) 66 | try: 67 | # convert all times to UTC 68 | dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"] 69 | arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"] 70 | dep_timezone = airport_timezones[dep_airport_id][2] 71 | arr_timezone = airport_timezones[arr_airport_id][2] 72 | 73 | for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]: 74 | fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone) 75 | for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 76 | fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone) 77 | 78 | for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 79 | fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"]) 80 | 81 | fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0] 82 | fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1] 83 | fields["DEP_AIRPORT_TZOFFSET"] = deptz 84 | fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0] 85 | fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1] 86 | fields["ARR_AIRPORT_TZOFFSET"] = arrtz 87 | yield json.dumps(fields) 88 | except KeyError as e: 89 | logging.exception(" Ignoring " + line + " because airport is not known") 90 | 91 | 92 | if __name__ == '__main__': 93 | with beam.Pipeline('DirectRunner') as pipeline: 94 | airports = (pipeline 95 | | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz') 96 | | beam.Filter(lambda line: "United States" in line) 97 | | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line]))) 98 | | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 99 | ) 100 | 101 | flights = (pipeline 102 | | 'flights:read' >> beam.io.ReadFromText('flights_sample.json') 103 | | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports)) 104 | ) 105 | 106 | flights | beam.io.textio.WriteToText('all_flights') 107 | -------------------------------------------------------------------------------- /04_streaming/transform/df05.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import csv 20 | import json 21 | 22 | DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S' 23 | 24 | 25 | def addtimezone(lat, lon): 26 | try: 27 | import timezonefinder 28 | tf = timezonefinder.TimezoneFinder() 29 | lat = float(lat) 30 | lon = float(lon) 31 | return lat, lon, tf.timezone_at(lng=lon, lat=lat) 32 | except ValueError: 33 | return lat, lon, 'TIMEZONE' # header 34 | 35 | 36 | def as_utc(date, hhmm, tzone): 37 | """ 38 | Returns date corrected for timezone, and the tzoffset 39 | """ 40 | try: 41 | if len(hhmm) > 0 and tzone is not None: 42 | import datetime, pytz 43 | loc_tz = pytz.timezone(tzone) 44 | loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False) 45 | # can't just parse hhmm because the data contains 2400 and the like ... 46 | loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:])) 47 | utc_dt = loc_dt.astimezone(pytz.utc) 48 | return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds() 49 | else: 50 | return '', 0 # empty string corresponds to canceled flights 51 | except ValueError as e: 52 | logging.exception('{} {} {}'.format(date, hhmm, tzone)) 53 | raise e 54 | 55 | 56 | def add_24h_if_before(arrtime, deptime): 57 | import datetime 58 | if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime: 59 | adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT) 60 | adt += datetime.timedelta(hours=24) 61 | return adt.strftime(DATETIME_FORMAT) 62 | else: 63 | return arrtime 64 | 65 | 66 | def tz_correct(fields, airport_timezones): 67 | try: 68 | # convert all times to UTC 69 | dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"] 70 | arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"] 71 | dep_timezone = airport_timezones[dep_airport_id][2] 72 | arr_timezone = airport_timezones[arr_airport_id][2] 73 | 74 | for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]: 75 | fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone) 76 | for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 77 | fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone) 78 | 79 | for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 80 | fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"]) 81 | 82 | fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0] 83 | fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1] 84 | fields["DEP_AIRPORT_TZOFFSET"] = deptz 85 | fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0] 86 | fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1] 87 | fields["ARR_AIRPORT_TZOFFSET"] = arrtz 88 | yield fields 89 | except KeyError as e: 90 | logging.exception(f"Ignoring {fields} because airport is not known") 91 | 92 | 93 | def get_next_event(fields): 94 | if len(fields["DEP_TIME"]) > 0: 95 | event = dict(fields) # copy 96 | event["EVENT_TYPE"] = "departed" 97 | event["EVENT_TIME"] = fields["DEP_TIME"] 98 | for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]: 99 | event.pop(f, None) # not knowable at departure time 100 | yield event 101 | if len(fields["ARR_TIME"]) > 0: 102 | event = dict(fields) 103 | event["EVENT_TYPE"] = "arrived" 104 | event["EVENT_TIME"] = fields["ARR_TIME"] 105 | yield event 106 | 107 | 108 | def run(): 109 | with beam.Pipeline('DirectRunner') as pipeline: 110 | airports = (pipeline 111 | | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz') 112 | | beam.Filter(lambda line: "United States" in line) 113 | | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line]))) 114 | | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 115 | ) 116 | 117 | flights = (pipeline 118 | | 'flights:read' >> beam.io.ReadFromText('flights_sample.json') 119 | | 'flights:parse' >> beam.Map(lambda line: json.loads(line)) 120 | | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports)) 121 | ) 122 | 123 | (flights 124 | | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields)) 125 | | 'flights:out' >> beam.io.textio.WriteToText('all_flights') 126 | ) 127 | 128 | events = flights | beam.FlatMap(get_next_event) 129 | 130 | (events 131 | | 'events:tostring' >> beam.Map(lambda fields: json.dumps(fields)) 132 | | 'events:out' >> beam.io.textio.WriteToText('all_events') 133 | ) 134 | 135 | 136 | if __name__ == '__main__': 137 | run() 138 | -------------------------------------------------------------------------------- /04_streaming/transform/df06.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import csv 20 | import json 21 | 22 | 23 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S' 24 | 25 | 26 | def addtimezone(lat, lon): 27 | try: 28 | import timezonefinder 29 | tf = timezonefinder.TimezoneFinder() 30 | lat = float(lat) 31 | lon = float(lon) 32 | return lat, lon, tf.timezone_at(lng=lon, lat=lat) 33 | except ValueError: 34 | return lat, lon, 'TIMEZONE' # header 35 | 36 | 37 | def as_utc(date, hhmm, tzone): 38 | """ 39 | Returns date corrected for timezone, and the tzoffset 40 | """ 41 | try: 42 | if len(hhmm) > 0 and tzone is not None: 43 | import datetime, pytz 44 | loc_tz = pytz.timezone(tzone) 45 | loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False) 46 | # can't just parse hhmm because the data contains 2400 and the like ... 47 | loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:])) 48 | utc_dt = loc_dt.astimezone(pytz.utc) 49 | return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds() 50 | else: 51 | return '', 0 # empty string corresponds to canceled flights 52 | except ValueError as e: 53 | logging.exception('{} {} {}'.format(date, hhmm, tzone)) 54 | raise e 55 | 56 | 57 | def add_24h_if_before(arrtime, deptime): 58 | import datetime 59 | if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime: 60 | adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT) 61 | adt += datetime.timedelta(hours=24) 62 | return adt.strftime(DATETIME_FORMAT) 63 | else: 64 | return arrtime 65 | 66 | 67 | def tz_correct(fields, airport_timezones): 68 | fields['FL_DATE'] = fields['FL_DATE'].strftime('%Y-%m-%d') # convert to a string so JSON code works 69 | try: 70 | # convert all times to UTC 71 | dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"] 72 | arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"] 73 | 74 | dep_timezone = airport_timezones[dep_airport_id][2] 75 | arr_timezone = airport_timezones[arr_airport_id][2] 76 | 77 | for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]: 78 | fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone) 79 | for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 80 | fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone) 81 | 82 | for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 83 | fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"]) 84 | 85 | fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0] 86 | fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1] 87 | fields["DEP_AIRPORT_TZOFFSET"] = deptz 88 | fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0] 89 | fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1] 90 | fields["ARR_AIRPORT_TZOFFSET"] = arrtz 91 | yield fields 92 | except KeyError: 93 | #logging.exception(f"Ignoring {fields} because airport is not known") 94 | pass 95 | 96 | except KeyError: 97 | logging.exception("Ignoring field because airport is not known") 98 | 99 | 100 | def get_next_event(fields): 101 | if len(fields["DEP_TIME"]) > 0: 102 | event = dict(fields) # copy 103 | event["EVENT_TYPE"] = "departed" 104 | event["EVENT_TIME"] = fields["DEP_TIME"] 105 | for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]: 106 | event.pop(f, None) # not knowable at departure time 107 | yield event 108 | if len(fields["WHEELS_OFF"]) > 0: 109 | event = dict(fields) # copy 110 | event["EVENT_TYPE"] = "wheelsoff" 111 | event["EVENT_TIME"] = fields["WHEELS_OFF"] 112 | for f in ["WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]: 113 | event.pop(f, None) # not knowable at departure time 114 | yield event 115 | if len(fields["ARR_TIME"]) > 0: 116 | event = dict(fields) 117 | event["EVENT_TYPE"] = "arrived" 118 | event["EVENT_TIME"] = fields["ARR_TIME"] 119 | yield event 120 | 121 | 122 | def create_event_row(fields): 123 | featdict = dict(fields) # copy 124 | featdict['EVENT_DATA'] = json.dumps(fields) 125 | return featdict 126 | 127 | 128 | def run(project, bucket): 129 | argv = [ 130 | '--project={0}'.format(project), 131 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 132 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 133 | '--runner=DirectRunner' 134 | ] 135 | airports_filename = 'gs://{}/flights/airports/airports.csv.gz'.format(bucket) 136 | flights_output = 'gs://{}/flights/tzcorr/all_flights'.format(bucket) 137 | 138 | with beam.Pipeline(argv=argv) as pipeline: 139 | airports = (pipeline 140 | | 'airports:read' >> beam.io.ReadFromText(airports_filename) 141 | | beam.Filter(lambda line: "United States" in line) 142 | | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line]))) 143 | | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 144 | ) 145 | 146 | flights = (pipeline 147 | | 'flights:read' >> beam.io.ReadFromBigQuery( 148 | query='SELECT * FROM dsongcp.flights WHERE rand() < 0.001', use_standard_sql=True) 149 | | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports)) 150 | ) 151 | 152 | (flights 153 | | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields)) 154 | | 'flights:gcsout' >> beam.io.textio.WriteToText(flights_output) 155 | ) 156 | 157 | flights_schema = ','.join([ 158 | 'FL_DATE:date', 159 | 'UNIQUE_CARRIER:string', 160 | 'ORIGIN_AIRPORT_SEQ_ID:string', 161 | 'ORIGIN:string', 162 | 'DEST_AIRPORT_SEQ_ID:string', 163 | 'DEST:string', 164 | 'CRS_DEP_TIME:timestamp', 165 | 'DEP_TIME:timestamp', 166 | 'DEP_DELAY:float', 167 | 'TAXI_OUT:float', 168 | 'WHEELS_OFF:timestamp', 169 | 'WHEELS_ON:timestamp', 170 | 'TAXI_IN:float', 171 | 'CRS_ARR_TIME:timestamp', 172 | 'ARR_TIME:timestamp', 173 | 'ARR_DELAY:float', 174 | 'CANCELLED:boolean', 175 | 'DIVERTED:boolean', 176 | 'DISTANCE:float', 177 | 'DEP_AIRPORT_LAT:float', 178 | 'DEP_AIRPORT_LON:float', 179 | 'DEP_AIRPORT_TZOFFSET:float', 180 | 'ARR_AIRPORT_LAT:float', 181 | 'ARR_AIRPORT_LON:float', 182 | 'ARR_AIRPORT_TZOFFSET:float', 183 | 'Year:string']) 184 | 185 | # autodetect on JSON works, but is less reliable 186 | #flights_schema = 'SCHEMA_AUTODETECT' 187 | 188 | (flights 189 | | 'flights:bqout' >> beam.io.WriteToBigQuery( 190 | 'dsongcp.flights_tzcorr', 191 | schema=flights_schema, 192 | write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 193 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 194 | ) 195 | ) 196 | 197 | events = flights | beam.FlatMap(get_next_event) 198 | events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp,EVENT_DATA:string']) 199 | 200 | (events 201 | | 'events:totablerow' >> beam.Map(lambda fields: create_event_row(fields)) 202 | | 'events:bqout' >> beam.io.WriteToBigQuery( 203 | 'dsongcp.flights_simevents', schema=events_schema, 204 | write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 205 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 206 | ) 207 | ) 208 | 209 | if __name__ == '__main__': 210 | import argparse 211 | 212 | parser = argparse.ArgumentParser(description='Run pipeline on the cloud') 213 | parser.add_argument('-p', '--project', help='Unique project ID', required=True) 214 | parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists', 215 | required=True) 216 | 217 | args = vars(parser.parse_args()) 218 | 219 | print("Correcting timestamps and writing to BigQuery dataset") 220 | 221 | run(project=args['project'], bucket=args['bucket']) 222 | -------------------------------------------------------------------------------- /04_streaming/transform/df07.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2016 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import csv 20 | import json 21 | 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S' 23 | 24 | 25 | def addtimezone(lat, lon): 26 | try: 27 | import timezonefinder 28 | tf = timezonefinder.TimezoneFinder() 29 | lat = float(lat) 30 | lon = float(lon) 31 | return lat, lon, tf.timezone_at(lng=lon, lat=lat) 32 | except ValueError: 33 | return lat, lon, 'TIMEZONE' # header 34 | 35 | 36 | def as_utc(date, hhmm, tzone): 37 | """ 38 | Returns date corrected for timezone, and the tzoffset 39 | """ 40 | try: 41 | if len(hhmm) > 0 and tzone is not None: 42 | import datetime, pytz 43 | loc_tz = pytz.timezone(tzone) 44 | loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False) 45 | # can't just parse hhmm because the data contains 2400 and the like ... 46 | loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:])) 47 | utc_dt = loc_dt.astimezone(pytz.utc) 48 | return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds() 49 | else: 50 | return '', 0 # empty string corresponds to canceled flights 51 | except ValueError as e: 52 | logging.exception('{} {} {}'.format(date, hhmm, tzone)) 53 | raise e 54 | 55 | 56 | def add_24h_if_before(arrtime, deptime): 57 | import datetime 58 | if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime: 59 | adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT) 60 | adt += datetime.timedelta(hours=24) 61 | return adt.strftime(DATETIME_FORMAT) 62 | else: 63 | return arrtime 64 | 65 | 66 | def airport_timezone(airport_id, airport_timezones): 67 | if airport_id in airport_timezones: 68 | return airport_timezones[airport_id] 69 | else: 70 | return '37.41', '-92.35', u'America/Chicago' 71 | 72 | 73 | def tz_correct(fields, airport_timezones): 74 | fields['FL_DATE'] = fields['FL_DATE'].strftime('%Y-%m-%d') # convert to a string so JSON code works 75 | 76 | # convert all times to UTC 77 | dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"] 78 | arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"] 79 | fields["DEP_AIRPORT_LAT"], fields["DEP_AIRPORT_LON"], dep_timezone = airport_timezone(dep_airport_id, 80 | airport_timezones) 81 | fields["ARR_AIRPORT_LAT"], fields["ARR_AIRPORT_LON"], arr_timezone = airport_timezone(arr_airport_id, 82 | airport_timezones) 83 | 84 | for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]: 85 | fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone) 86 | for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 87 | fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone) 88 | 89 | for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]: 90 | fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"]) 91 | 92 | fields["DEP_AIRPORT_TZOFFSET"] = deptz 93 | fields["ARR_AIRPORT_TZOFFSET"] = arrtz 94 | yield fields 95 | 96 | 97 | def get_next_event(fields): 98 | if len(fields["DEP_TIME"]) > 0: 99 | event = dict(fields) # copy 100 | event["EVENT_TYPE"] = "departed" 101 | event["EVENT_TIME"] = fields["DEP_TIME"] 102 | for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]: 103 | event.pop(f, None) # not knowable at departure time 104 | yield event 105 | if len(fields["WHEELS_OFF"]) > 0: 106 | event = dict(fields) # copy 107 | event["EVENT_TYPE"] = "wheelsoff" 108 | event["EVENT_TIME"] = fields["WHEELS_OFF"] 109 | for f in ["WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]: 110 | event.pop(f, None) # not knowable at departure time 111 | yield event 112 | if len(fields["ARR_TIME"]) > 0: 113 | event = dict(fields) 114 | event["EVENT_TYPE"] = "arrived" 115 | event["EVENT_TIME"] = fields["ARR_TIME"] 116 | yield event 117 | 118 | 119 | def create_event_row(fields): 120 | featdict = dict(fields) # copy 121 | featdict['EVENT_DATA'] = json.dumps(fields) 122 | return featdict 123 | 124 | 125 | def run(project, bucket, region): 126 | argv = [ 127 | '--project={0}'.format(project), 128 | '--job_name=ch04timecorr', 129 | '--save_main_session', 130 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 131 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 132 | '--setup_file=./setup.py', 133 | '--autoscaling_algorithm=THROUGHPUT_BASED', 134 | '--max_num_workers=8', 135 | '--region={}'.format(region), 136 | '--runner=DataflowRunner' 137 | ] 138 | airports_filename = 'gs://{}/flights/airports/airports.csv.gz'.format(bucket) 139 | flights_output = 'gs://{}/flights/tzcorr/all_flights'.format(bucket) 140 | 141 | with beam.Pipeline(argv=argv) as pipeline: 142 | airports = (pipeline 143 | | 'airports:read' >> beam.io.ReadFromText(airports_filename) 144 | | 'airports:onlyUSA' >> beam.Filter(lambda line: "United States" in line) 145 | | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line]))) 146 | | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26]))) 147 | ) 148 | 149 | flights = (pipeline 150 | | 'flights:read' >> beam.io.ReadFromBigQuery( 151 | query='SELECT * FROM dsongcp.flights', use_standard_sql=True) 152 | | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports)) 153 | ) 154 | 155 | (flights 156 | | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields)) 157 | | 'flights:gcsout' >> beam.io.textio.WriteToText(flights_output) 158 | ) 159 | 160 | flights_schema = ','.join([ 161 | 'FL_DATE:date,UNIQUE_CARRIER:string,ORIGIN_AIRPORT_SEQ_ID:string,ORIGIN:string', 162 | 'DEST_AIRPORT_SEQ_ID:string,DEST:string,CRS_DEP_TIME:timestamp,DEP_TIME:timestamp', 163 | 'DEP_DELAY:float,TAXI_OUT:float,WHEELS_OFF:timestamp,WHEELS_ON:timestamp,TAXI_IN:float', 164 | 'CRS_ARR_TIME:timestamp,ARR_TIME:timestamp,ARR_DELAY:float,CANCELLED:boolean', 165 | 'DIVERTED:boolean,DISTANCE:float', 166 | 'DEP_AIRPORT_LAT:float,DEP_AIRPORT_LON:float,DEP_AIRPORT_TZOFFSET:float', 167 | 'ARR_AIRPORT_LAT:float,ARR_AIRPORT_LON:float,ARR_AIRPORT_TZOFFSET:float']) 168 | flights | 'flights:bqout' >> beam.io.WriteToBigQuery( 169 | 'dsongcp.flights_tzcorr', schema=flights_schema, 170 | write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 171 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 172 | ) 173 | 174 | events = flights | beam.FlatMap(get_next_event) 175 | events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp,EVENT_DATA:string']) 176 | 177 | (events 178 | | 'events:totablerow' >> beam.Map(lambda fields: create_event_row(fields)) 179 | | 'events:bqout' >> beam.io.WriteToBigQuery( 180 | 'dsongcp.flights_simevents', schema=events_schema, 181 | write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 182 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED 183 | ) 184 | ) 185 | 186 | 187 | if __name__ == '__main__': 188 | import argparse 189 | 190 | parser = argparse.ArgumentParser(description='Run pipeline on the cloud') 191 | parser.add_argument('-p', '--project', help='Unique project ID', required=True) 192 | parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists', 193 | required=True) 194 | parser.add_argument('-r', '--region', 195 | help='Region in which to run the Dataflow job. Choose the same region as your bucket.', 196 | required=True) 197 | 198 | args = vars(parser.parse_args()) 199 | 200 | print("Correcting timestamps and writing to BigQuery dataset") 201 | 202 | run(project=args['project'], bucket=args['bucket'], region=args['region']) 203 | -------------------------------------------------------------------------------- /04_streaming/transform/install_packages.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python3 -m pip install --upgrade pip 4 | python3 -m pip cache purge 5 | python3 -m pip install --upgrade timezonefinder pytz 'apache-beam[gcp]' 6 | 7 | echo "If this script fails, please try installing it in a virtualenv" 8 | echo "virtualenv ~/beam_env" 9 | echo "source ~/beam_env/bin/activate" 10 | echo "./install_packages.sh" 11 | -------------------------------------------------------------------------------- /04_streaming/transform/setup.py: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | """Setup.py module for the workflow's worker utilities. 19 | 20 | All the workflow related code is gathered in a package that will be built as a 21 | source distribution, staged in the staging area for the workflow being run and 22 | then installed in the workers when they start running. 23 | 24 | This behavior is triggered by specifying the --setup_file command line option 25 | when running the workflow for remote execution. 26 | """ 27 | 28 | from distutils.command.build import build as _build 29 | import subprocess 30 | 31 | import setuptools 32 | 33 | 34 | # This class handles the pip install mechanism. 35 | class build(_build): # pylint: disable=invalid-name 36 | """A build command class that will be invoked during package install. 37 | 38 | The package built using the current setup.py will be staged and later 39 | installed in the worker using `pip install package'. This class will be 40 | instantiated during install for this specific scenario and will trigger 41 | running the custom commands specified. 42 | """ 43 | sub_commands = _build.sub_commands + [('CustomCommands', None)] 44 | 45 | 46 | # Some custom command to run during setup. The command is not essential for this 47 | # workflow. It is used here as an example. Each command will spawn a child 48 | # process. Typically, these commands will include steps to install non-Python 49 | # packages. For instance, to install a C++-based library libjpeg62 the following 50 | # two commands will have to be added: 51 | # 52 | # ['apt-get', 'update'], 53 | # ['apt-get', '--assume-yes', install', 'libjpeg62'], 54 | # 55 | # First, note that there is no need to use the sudo command because the setup 56 | # script runs with appropriate access. 57 | # Second, if apt-get tool is used then the first command needs to be 'apt-get 58 | # update' so the tool refreshes itself and initializes links to download 59 | # repositories. Without this initial step the other apt-get install commands 60 | # will fail with package not found errors. Note also --assume-yes option which 61 | # shortcuts the interactive confirmation. 62 | # 63 | # The output of custom commands (including failures) will be logged in the 64 | # worker-startup log. 65 | CUSTOM_COMMANDS = [ 66 | ] 67 | 68 | 69 | class CustomCommands(setuptools.Command): 70 | """A setuptools Command class able to run arbitrary commands.""" 71 | 72 | def initialize_options(self): 73 | pass 74 | 75 | def finalize_options(self): 76 | pass 77 | 78 | def RunCustomCommand(self, command_list): 79 | print('Running command: %s' % command_list) 80 | p = subprocess.Popen( 81 | command_list, 82 | stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 83 | # Can use communicate(input='y\n'.encode()) if the command run requires 84 | # some confirmation. 85 | stdout_data, _ = p.communicate() 86 | print('Command output: %s' % stdout_data) 87 | if p.returncode != 0: 88 | raise RuntimeError( 89 | 'Command %s failed: exit code: %s' % (command_list, p.returncode)) 90 | 91 | def run(self): 92 | for command in CUSTOM_COMMANDS: 93 | self.RunCustomCommand(command) 94 | 95 | 96 | # Configure the required packages and scripts to install. 97 | # Note that the Python Dataflow containers come with numpy already installed 98 | # so this dependency will not trigger anything to be installed unless a version 99 | # restriction is specified. 100 | REQUIRED_PACKAGES = [ 101 | 'timezonefinder', 102 | 'pytz' 103 | ] 104 | 105 | setuptools.setup( 106 | name='flightsdf', 107 | version='0.0.1', 108 | description='Data Science on GCP flights analysis pipelines', 109 | install_requires=REQUIRED_PACKAGES, 110 | packages=setuptools.find_packages(), 111 | py_modules=['df07'], 112 | cmdclass={ 113 | # Command class instantiated and run during pip install scenarios. 114 | 'build': build, 115 | 'CustomCommands': CustomCommands, 116 | } 117 | ) 118 | -------------------------------------------------------------------------------- /04_streaming/transform/stage_airports_file.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if test "$#" -ne 1; then 4 | echo "Usage: ./bqsample.sh bucket-name" 5 | echo " eg: ./bqsample.sh cloud-training-demos-ml" 6 | exit 7 | fi 8 | 9 | BUCKET=$1 10 | PROJECT=$(gcloud config get-value project) 11 | 12 | gsutil cp airports.csv.gz gs://${BUCKET}/flights/airports/airports.csv.gz 13 | 14 | bq --project_id=$PROJECT load \ 15 | --autodetect --replace --source_format=CSV \ 16 | dsongcp.airports gs://${BUCKET}/flights/airports/airports.csv.gz -------------------------------------------------------------------------------- /05_bqnotebook/README.md: -------------------------------------------------------------------------------- 1 | # 5. Interactive data exploration 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-4, the simplest way to catch up is to copy data from my bucket: 5 | * Go to the Storage section of the GCP web console and create a new bucket 6 | * Open CloudShell and git clone this repo: 7 | ``` 8 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 9 | ``` 10 | * Then, run: 11 | ``` 12 | cd data-science-on-gcp/02_ingest 13 | ./ingest_from_crsbucket bucketname 14 | ./bqload.sh (csv-bucket-name) YEAR 15 | ``` 16 | * Run: 17 | ``` 18 | cd ../03_sqlstudio 19 | ./create_views.sh 20 | ``` 21 | * Run: 22 | ``` 23 | cd ../04_streaming 24 | ./ingest_from_crsbucket.sh 25 | ``` 26 | 27 | ## Try out queries 28 | * In BigQuery, query the time corrected files created in Chapter 4: 29 | ``` 30 | SELECT 31 | ORIGIN, 32 | AVG(DEP_DELAY) AS dep_delay, 33 | AVG(ARR_DELAY) AS arr_delay, 34 | COUNT(ARR_DELAY) AS num_flights 35 | FROM 36 | dsongcp.flights_tzcorr 37 | GROUP BY 38 | ORIGIN 39 | ``` 40 | * Try out the other queries in queries.txt in this directory. 41 | 42 | * Navigate to the Vertex AI Workbench part of the GCP console. 43 | 44 | * Start a new managed notebook. Then, copy and paste cells from exploration.ipynb and click Run to execute the code. 45 | 46 | * Create the trainday table BigQuery table and CSV file as you will need it later 47 | ``` 48 | ./create_trainday.sh 49 | ``` 50 | -------------------------------------------------------------------------------- /05_bqnotebook/create_trainday.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./create_trainday.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | 10 | cat trainday.txt | bq query --nouse_legacy_sql 11 | 12 | bq extract dsongcp.trainday gs://${BUCKET}/flights/trainday.csv 13 | -------------------------------------------------------------------------------- /05_bqnotebook/queries.txt: -------------------------------------------------------------------------------- 1 | SELECT 2 | ORIGIN, 3 | AVG(DEP_DELAY) AS dep_delay, 4 | AVG(ARR_DELAY) AS arr_delay, 5 | COUNT(ARR_DELAY) AS num_flights 6 | FROM 7 | dsongcp.flights_tzcorr 8 | GROUP BY 9 | ORIGIN 10 | 11 | ________________________________________________________________ 12 | 13 | WITH all_airports AS ( 14 | SELECT 15 | ORIGIN, 16 | AVG(DEP_DELAY) AS dep_delay, 17 | AVG(ARR_DELAY) AS arr_delay, 18 | COUNT(ARR_DELAY) AS num_flights 19 | FROM 20 | dsongcp.flights_tzcorr 21 | GROUP BY 22 | ORIGIN 23 | ) 24 | 25 | SELECT * FROM all_airports WHERE num_flights > 3650 26 | ORDER BY dep_delay DESC 27 | 28 | ________________________________________________________________ 29 | 30 | WITH all_airports AS ( 31 | SELECT 32 | ORIGIN, 33 | AVG(DEP_DELAY) AS dep_delay, 34 | AVG(ARR_DELAY) AS arr_delay, 35 | COUNT(ARR_DELAY) AS num_flights 36 | FROM 37 | dsongcp.flights_tzcorr 38 | WHERE EXTRACT(MONTH FROM FL_DATE) = 1 39 | GROUP BY 40 | ORIGIN 41 | ) 42 | 43 | SELECT * FROM all_airports WHERE num_flights > 310 44 | ORDER BY dep_delay DESC 45 | 46 | ________________________________________________________________ -------------------------------------------------------------------------------- /05_bqnotebook/trainday.txt: -------------------------------------------------------------------------------- 1 | CREATE OR REPLACE TABLE dsongcp.trainday AS 2 | 3 | SELECT 4 | FL_DATE, 5 | IF(ABS(MOD(FARM_FINGERPRINT(CAST(FL_DATE AS STRING)), 100)) < 70, 6 | 'True', 'False') AS is_train_day 7 | FROM ( 8 | SELECT 9 | DISTINCT(FL_DATE) AS FL_DATE 10 | FROM 11 | dsongcp.flights_tzcorr) 12 | ORDER BY 13 | FL_DATE 14 | -------------------------------------------------------------------------------- /06_dataproc/README.md: -------------------------------------------------------------------------------- 1 | # 6. Bayes Classifier on Cloud Dataproc 2 | 3 | To repeat the steps in this chapter, follow these steps. 4 | 5 | ### Catch up from Chapters 2-5 6 | If you didn't go through Chapters 2-5, the simplest way to catch up is to copy data from my bucket: 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery: 14 | ``` 15 | bash create_trainday.sh BUCKET-NAME 16 | ``` 17 | 18 | ### Create Dataproc cluster 19 | In CloudShell: 20 | * Clone the repository if you haven't already done so: 21 | ``` 22 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 23 | ``` 24 | * Change to the directory for this chapter: 25 | ``` 26 | cd data-science-on-gcp/06_dataproc 27 | ``` 28 | * Create the Dataproc cluster to run jobs on, specifying the name of your bucket and a 29 | zone in the region that the bucket is in. (You created this bucket in Chapter 2) 30 | ``` 31 | ./create_cluster.sh 32 | ``` 33 | *Note:* Make sure that the compute zone is in the same region as the bucket, otherwise you will incur network egress charges. 34 | 35 | ### Interactive development 36 | * Navigate to the Dataproc section of the GCP web console and click on "Web Interfaces". 37 | 38 | * Click on JupyterLab 39 | 40 | * In JupyterLab, navigate to /LocalDisks/home/dataproc/data-science-on-gcp 41 | 42 | * Open 06_dataproc/quantization.ipynb. Click Run | Clear All Outputs. Then run the cells one by one. 43 | 44 | * [optional] make the changes suggested in the notebook to run on the full dataset. Note that you might have to 45 | reduce numbers to fit into your quota. 46 | 47 | ### Delete the cluster 48 | * Delete the cluster either from the GCP web console or by typing in CloudShell, ```./delete_cluster.sh ``` 49 | 50 | ### Serverless workflow 51 | * Visit https://console.cloud.google.com/networking/networks/list 52 | * Select the "default" network in your region and allow private Google access 53 | * Run ./submit_serverless.sh 54 | 55 | -------------------------------------------------------------------------------- /06_dataproc/bayes_on_spark.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import logging 18 | import pandas as pd 19 | import numpy as np 20 | from pyspark.sql import SparkSession 21 | import pyspark.sql.functions as F 22 | 23 | 24 | def run_bayes(BUCKET): 25 | spark = SparkSession \ 26 | .builder \ 27 | .appName("Bayes classification using Spark") \ 28 | .getOrCreate() 29 | 30 | # read flights data 31 | inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET) # FULL 32 | flights = spark.read.json(inputs) 33 | flights.createOrReplaceTempView('flights') 34 | 35 | # which days are training days? 36 | traindays = spark.read \ 37 | .option("header", "true") \ 38 | .option("inferSchema", "true") \ 39 | .csv('gs://{}/flights/trainday.csv'.format(BUCKET)) 40 | traindays.createOrReplaceTempView('traindays') 41 | 42 | # create training dataset 43 | statement = """ 44 | SELECT 45 | f.FL_DATE AS date, 46 | CAST(distance AS FLOAT) AS distance, 47 | dep_delay, 48 | IF(arr_delay < 15, 1, 0) AS ontime 49 | FROM flights f 50 | JOIN traindays t 51 | ON f.FL_DATE == t.FL_DATE 52 | WHERE 53 | t.is_train_day AND 54 | f.dep_delay IS NOT NULL 55 | ORDER BY 56 | f.dep_delay DESC 57 | """ 58 | flights = spark.sql(statement) 59 | 60 | # quantiles 61 | distthresh = flights.approxQuantile('distance', list(np.arange(0, 1.0, 0.2)), 0.02) 62 | distthresh[-1] = float('inf') 63 | delaythresh = range(10, 20) 64 | logging.info("Computed distance thresholds: {}".format(distthresh)) 65 | 66 | # bayes in each bin 67 | df = pd.DataFrame(columns=['dist_thresh', 'delay_thresh', 'frac_ontime']) 68 | for m in range(0, len(distthresh) - 1): 69 | for n in range(0, len(delaythresh) - 1): 70 | bdf = flights[(flights['distance'] >= distthresh[m]) 71 | & (flights['distance'] < distthresh[m + 1]) 72 | & (flights['dep_delay'] >= delaythresh[n]) 73 | & (flights['dep_delay'] < delaythresh[n + 1])] 74 | ontime_frac = bdf.agg(F.sum('ontime')).collect()[0][0] / bdf.agg(F.count('ontime')).collect()[0][0] 75 | print(m, n, ontime_frac) 76 | df = df.append({ 77 | 'dist_thresh': distthresh[m], 78 | 'delay_thresh': delaythresh[n], 79 | 'frac_ontime': ontime_frac 80 | }, ignore_index=True) 81 | 82 | # lookup table 83 | df['score'] = abs(df['frac_ontime'] - 0.7) 84 | bayes = df.sort_values(['score']).groupby('dist_thresh').head(1).sort_values('dist_thresh') 85 | bayes.to_csv('gs://{}/flights/bayes.csv'.format(BUCKET), index=False) 86 | logging.info("Wrote lookup table: {}".format(bayes.head())) 87 | 88 | 89 | if __name__ == '__main__': 90 | import argparse 91 | 92 | parser = argparse.ArgumentParser(description='Create Bayes lookup table') 93 | parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True) 94 | parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages') 95 | 96 | args = parser.parse_args() 97 | if args.debug: 98 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG) 99 | else: 100 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 101 | 102 | run_bayes(args.bucket) 103 | -------------------------------------------------------------------------------- /06_dataproc/create_cluster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 2 ]; then 4 | echo "Usage: ./create_cluster.sh bucket-name region" 5 | exit 6 | fi 7 | 8 | PROJECT=$(gcloud config get-value project) 9 | BUCKET=$1 10 | REGION=$2 11 | EMAIL=$3 12 | INSTALL=gs://$BUCKET/flights/dataproc/install_on_cluster.sh 13 | 14 | # upload install file 15 | sed "s/CHANGE_TO_USER_NAME/dataproc/g" install_on_cluster.sh > /tmp/install_on_cluster.sh 16 | gsutil cp /tmp/install_on_cluster.sh $INSTALL 17 | 18 | # create cluster 19 | gcloud dataproc clusters create ch6cluster \ 20 | --enable-component-gateway \ 21 | --region ${REGION} --zone ${REGION}-a \ 22 | --master-machine-type n1-standard-4 \ 23 | --master-boot-disk-size 500 --num-workers 2 \ 24 | --worker-machine-type n1-standard-4 \ 25 | --worker-boot-disk-size 500 \ 26 | --optional-components JUPYTER --project $PROJECT \ 27 | --initialization-actions=$INSTALL \ 28 | --scopes https://www.googleapis.com/auth/cloud-platform 29 | 30 | -------------------------------------------------------------------------------- /06_dataproc/create_personal_cluster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 3 ]; then 4 | echo "Usage: ./create_cluster.sh bucket-name region user_email" 5 | exit 6 | fi 7 | 8 | PROJECT=$(gcloud config get-value project) 9 | BUCKET=$1 10 | REGION=$2 11 | EMAIL=$3 12 | INSTALL=gs://$BUCKET/flights/dataproc/install_on_cluster.sh 13 | 14 | # upload install file 15 | sed "s/CHANGE_TO_USER_NAME/$USER/g" install_on_cluster.sh > /tmp/install_on_cluster.sh 16 | gsutil cp /tmp/install_on_cluster.sh $INSTALL 17 | 18 | # create cluster 19 | gcloud dataproc clusters create ch6cluster \ 20 | --enable-component-gateway \ 21 | --region ${REGION} --zone ${REGION}-a \ 22 | --properties dataproc:dataproc.personal-auth.user=$EMAIL \ 23 | --master-machine-type n1-standard-4 \ 24 | --master-boot-disk-size 500 --num-workers 2 \ 25 | --worker-machine-type n1-standard-4 \ 26 | --worker-boot-disk-size 500 \ 27 | --optional-components JUPYTER --project $PROJECT \ 28 | --initialization-actions=$INSTALL \ 29 | --scopes https://www.googleapis.com/auth/cloud-platform 30 | 31 | 32 | echo "Once cluster is up, please run the following command to inject your auth into the cluster." 33 | echo "gcloud dataproc clusters enable-personal-auth-session --region=$REGION ch6cluster" 34 | -------------------------------------------------------------------------------- /06_dataproc/decrease_cluster.sh: -------------------------------------------------------------------------------- 1 | gcloud dataproc clusters update ch6cluster\ 2 | --num-secondary-workers=0 --num-workers=2 --region=us-central1 3 | -------------------------------------------------------------------------------- /06_dataproc/delete_cluster.sh: -------------------------------------------------------------------------------- 1 | if [ "$#" -ne 1 ]; then 2 | echo "Usage: ./delete_cluster.sh region" 3 | exit 4 | fi 5 | 6 | gcloud dataproc clusters delete ch6cluster --region $1 -------------------------------------------------------------------------------- /06_dataproc/increase_cluster.sh: -------------------------------------------------------------------------------- 1 | gcloud dataproc clusters update ch6cluster\ 2 | --num-secondary-workers=3 --num-workers=4 --region=us-central1 3 | -------------------------------------------------------------------------------- /06_dataproc/install_on_cluster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # install Google Python client on all nodes 4 | apt-get -y update 5 | apt-get install python-dev 6 | apt-get install -y python-pip 7 | pip install --upgrade google-api-python-client 8 | 9 | # git clone on Master 10 | USER=CHANGE_TO_USER_NAME # the username that dataproc runs as 11 | ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) 12 | if [[ "${ROLE}" == 'Master' ]]; then 13 | cd home/$USER 14 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 15 | chown -R $USER data-science-on-gcp 16 | fi 17 | -------------------------------------------------------------------------------- /06_dataproc/submit_serverless.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 2 ]; then 4 | echo "Usage: ./submit_serverless.sh bucket-name region" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | REGION=$2 10 | 11 | # Note: The "default" network in the region needs to be enabled 12 | # for private Google access 13 | # https://cloud.google.com/vpc/docs/configure-private-google-access#config-pga 14 | 15 | gsutil cp bayes_on_spark.py gs://$BUCKET/ 16 | 17 | gcloud beta dataproc batches submit pyspark \ 18 | --project=$(gcloud config get-value project) \ 19 | --region=$REGION \ 20 | gs://${BUCKET}/bayes_on_spark.py \ 21 | -- \ 22 | --bucket ${BUCKET} # --debug 23 | -------------------------------------------------------------------------------- /07_sparkml/README.md: -------------------------------------------------------------------------------- 1 | # 7. Machine Learning: Logistic regression on Spark 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-6, the simplest way to catch up is to copy data from my bucket: 5 | 6 | #### Catch up from Chapters 2-5 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery: 14 | ``` 15 | bash create_trainday.sh 16 | ``` 17 | 18 | #### [Optional] Catch up from Chapter 6 19 | * Use the instructions in the Chapter 6 README to: 20 | * launch a minimal Cloud Dataproc cluster with initialization actions for Jupyter (`./create_cluster.sh BUCKET ZONE`) 21 | 22 | * Start a new notebook and in a cell, download a read-only clone of this repository: 23 | ``` 24 | %bash 25 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 26 | rm -rf data-science-on-gcp/.git 27 | ``` 28 | * Browse to data-science-on-gcp/07_sparkml_and_bqml/logistic_regression.ipynb 29 | and run the cells in the notebook (change the BUCKET appropriately). 30 | 31 | ## This Chapter 32 | ### Logistic regression using Spark 33 | * Launch a large Dataproc cluster: 34 | ``` 35 | ./create_large_cluster.sh BUCKET ZONE 36 | ``` 37 | * If it fails with quota issues, get increased quota. If you can't have more quota, 38 | reduce the number of workers appropriately. 39 | 40 | * Submit a Spark job to run the full dataset (change the BUCKET appropriately). 41 | ``` 42 | ./submit_spark.sh BUCKET logistic.py 43 | ``` 44 | 45 | ### Feature engineering 46 | * Submit a Spark job to do experimentation: `./submit_spark.sh BUCKET experiment.py` 47 | 48 | ### Cleanup 49 | * Delete the cluster either from the GCP web console or by typing in CloudShell, `../06_dataproc/delete_cluster.sh` 50 | 51 | 52 | -------------------------------------------------------------------------------- /07_sparkml/autoscale.yaml: -------------------------------------------------------------------------------- 1 | workerConfig: 2 | minInstances: 10 3 | maxInstances: 30 4 | secondaryWorkerConfig: 5 | maxInstances: 20 6 | basicAlgorithm: 7 | cooldownPeriod: 15m 8 | yarnConfig: 9 | scaleUpFactor: 0.05 10 | scaleDownFactor: 1.0 11 | gracefulDecommissionTimeout: 1h 12 | -------------------------------------------------------------------------------- /07_sparkml/create_large_cluster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 2 ]; then 4 | echo "Usage: ./create_cluster.sh bucket-name region" 5 | exit 6 | fi 7 | 8 | PROJECT=$(gcloud config get-value project) 9 | BUCKET=$1 10 | REGION=$2 11 | 12 | # create cluster 13 | gcloud dataproc clusters create ch7cluster \ 14 | --enable-component-gateway \ 15 | --region ${REGION} --zone ${REGION}-a \ 16 | --master-machine-type n1-standard-4 \ 17 | --master-boot-disk-size 500 \ 18 | --num-workers 30 --num-secondary-workers 20 \ 19 | --worker-machine-type n1-standard-8 \ 20 | --worker-boot-disk-size 500 \ 21 | --project $PROJECT \ 22 | --scopes https://www.googleapis.com/auth/cloud-platform 23 | 24 | gcloud dataproc autoscaling-policies import experiment-policy \ 25 | --source=autoscale.yaml --region=$REGION 26 | 27 | gcloud dataproc clusters update ch7cluster \ 28 | --autoscaling-policy=experiment-policy --region=$REGION 29 | -------------------------------------------------------------------------------- /07_sparkml/experiment.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | from pyspark.mllib.classification import LogisticRegressionWithLBFGS 18 | from pyspark.mllib.regression import LabeledPoint 19 | from pyspark.sql import SparkSession 20 | from pyspark import SparkContext 21 | import logging 22 | import numpy as np 23 | 24 | NUM_PARTITIONS = 1000 25 | 26 | def get_category(hour): 27 | if hour < 6 or hour > 20: 28 | return [1, 0, 0] # night 29 | if hour < 10: 30 | return [0, 1, 0] # morning 31 | if hour < 17: 32 | return [0, 0, 1] # mid-day 33 | else: 34 | return [0, 0, 0] # evening 35 | 36 | 37 | def get_local_hour(timestamp, correction): 38 | import datetime 39 | TIME_FORMAT = '%Y-%m-%d %H:%M:%S' 40 | timestamp = timestamp.replace('T', ' ') # incase different 41 | t = datetime.datetime.strptime(timestamp, TIME_FORMAT) 42 | d = datetime.timedelta(seconds=correction) 43 | t = t + d 44 | # return [t.hour] # raw 45 | # theta = np.radians(360 * t.hour / 24.0) # von-Miyes 46 | # return [np.sin(theta), np.cos(theta)] 47 | return get_category(t.hour) # bucketize 48 | 49 | 50 | def eval(labelpred): 51 | ''' 52 | data = (label, pred) 53 | data[0] = label 54 | data[1] = pred 55 | ''' 56 | cancel = labelpred.filter(lambda data: data[1] < 0.7) 57 | nocancel = labelpred.filter(lambda data: data[1] >= 0.7) 58 | corr_cancel = cancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count() 59 | corr_nocancel = nocancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count() 60 | 61 | cancel_denom = cancel.count() 62 | nocancel_denom = nocancel.count() 63 | if cancel_denom == 0: 64 | cancel_denom = 1 65 | if nocancel_denom == 0: 66 | nocancel_denom = 1 67 | 68 | totsqe = labelpred.map( 69 | lambda data: (data[0] - data[1]) * (data[0] - data[1]) 70 | ).sum() 71 | rmse = np.sqrt(totsqe / float(cancel.count() + nocancel.count())) 72 | 73 | return { 74 | 'rmse': rmse, 75 | 'total_cancel': cancel.count(), 76 | 'correct_cancel': float(corr_cancel) / cancel_denom, 77 | 'total_noncancel': nocancel.count(), 78 | 'correct_noncancel': float(corr_nocancel) / nocancel_denom 79 | } 80 | 81 | 82 | def run_experiment(BUCKET, SCALE_AND_CLIP, WITH_TIME, WITH_ORIGIN): 83 | # Create spark session 84 | sc = SparkContext('local', 'experimentation') 85 | spark = SparkSession \ 86 | .builder \ 87 | .appName("Logistic regression w/ Spark ML") \ 88 | .getOrCreate() 89 | 90 | # read dataset 91 | traindays = spark.read \ 92 | .option("header", "true") \ 93 | .csv('gs://{}/flights/trainday.csv'.format(BUCKET)) 94 | traindays.createOrReplaceTempView('traindays') 95 | 96 | #inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET) # 1/30th 97 | inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET) # FULL 98 | flights = spark.read.json(inputs) 99 | 100 | # this view can now be queried 101 | flights.createOrReplaceTempView('flights') 102 | 103 | # separate training and validation data 104 | from pyspark.sql.functions import rand 105 | SEED=13 106 | traindays = traindays.withColumn("holdout", rand(SEED) > 0.8) # 80% of data is for training 107 | traindays.createOrReplaceTempView('traindays') 108 | 109 | # logistic regression 110 | trainquery = """ 111 | SELECT 112 | ORIGIN, DEP_DELAY, TAXI_OUT, ARR_DELAY, DISTANCE, DEP_TIME, DEP_AIRPORT_TZOFFSET 113 | FROM flights f 114 | JOIN traindays t 115 | ON f.FL_DATE == t.FL_DATE 116 | WHERE 117 | t.is_train_day == 'True' AND 118 | t.holdout == False AND 119 | f.CANCELLED == 'False' AND 120 | f.DIVERTED == 'False' 121 | """ 122 | traindata = spark.sql(trainquery).repartition(NUM_PARTITIONS) 123 | 124 | def to_example(fields): 125 | features = [ 126 | fields['DEP_DELAY'], 127 | fields['DISTANCE'], 128 | fields['TAXI_OUT'], 129 | ] 130 | 131 | if SCALE_AND_CLIP: 132 | def clip(x): 133 | if x < -1: 134 | return -1 135 | if x > 1: 136 | return 1 137 | return x 138 | features = [ 139 | clip(float(fields['DEP_DELAY']) / 30), 140 | clip((float(fields['DISTANCE']) / 1000) - 1), 141 | clip((float(fields['TAXI_OUT']) / 10) - 1), 142 | ] 143 | 144 | if WITH_TIME: 145 | features.extend( 146 | get_local_hour(fields['DEP_TIME'], fields['DEP_AIRPORT_TZOFFSET'])) 147 | 148 | if WITH_ORIGIN: 149 | features.extend(fields['origin_onehot']) 150 | 151 | return LabeledPoint( 152 | float(fields['ARR_DELAY'] < 15), #ontime 153 | features) 154 | 155 | def add_origin(df, trained_model=None): 156 | from pyspark.ml.feature import OneHotEncoder, StringIndexer 157 | if not trained_model: 158 | indexer = StringIndexer(inputCol='ORIGIN', outputCol='origin_index') 159 | trained_model = indexer.fit(df) 160 | indexed = trained_model.transform(df) 161 | encoder = OneHotEncoder(inputCol='origin_index', outputCol='origin_onehot') 162 | return trained_model, encoder.fit(indexed).transform(indexed) 163 | 164 | if WITH_ORIGIN: 165 | index_model, traindata = add_origin(traindata) 166 | 167 | examples = traindata.rdd.map(to_example) 168 | lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True) 169 | lrmodel.clearThreshold() # return probabilities 170 | 171 | # save model 172 | MODEL_FILE='gs://' + BUCKET + '/flights/sparkmloutput/model' 173 | lrmodel.save(sc, MODEL_FILE) 174 | logging.info("Saved trained model to {}".format(MODEL_FILE)) 175 | 176 | # evaluate model on the heldout data 177 | evalquery = trainquery.replace("t.holdout == False", "t.holdout == True") 178 | evaldata = spark.sql(evalquery).repartition(NUM_PARTITIONS) 179 | if WITH_ORIGIN: 180 | evaldata = add_origin(evaldata, index_model) 181 | examples = evaldata.rdd.map(to_example) 182 | labelpred = examples.map(lambda p: (p.label, lrmodel.predict(p.features))) 183 | 184 | 185 | logging.info(eval(labelpred)) 186 | 187 | 188 | 189 | if __name__ == '__main__': 190 | import argparse 191 | 192 | parser = argparse.ArgumentParser(description='Run experiments with different features in Spark') 193 | parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True) 194 | parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages') 195 | 196 | args = parser.parse_args() 197 | if args.debug: 198 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG) 199 | else: 200 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 201 | 202 | run_experiment(args.bucket, SCALE_AND_CLIP=False, WITH_TIME=False, WITH_ORIGIN=False) 203 | 204 | -------------------------------------------------------------------------------- /07_sparkml/logistic.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | from pyspark.mllib.classification import LogisticRegressionWithLBFGS 18 | from pyspark.mllib.regression import LabeledPoint 19 | from pyspark.sql import SparkSession 20 | from pyspark import SparkContext 21 | import logging 22 | 23 | def run_logistic(BUCKET): 24 | # Create spark session 25 | sc = SparkContext('local', 'logistic') 26 | spark = SparkSession \ 27 | .builder \ 28 | .appName("Logistic regression w/ Spark ML") \ 29 | .getOrCreate() 30 | 31 | # read dataset 32 | traindays = spark.read \ 33 | .option("header", "true") \ 34 | .csv('gs://{}/flights/trainday.csv'.format(BUCKET)) 35 | traindays.createOrReplaceTempView('traindays') 36 | 37 | # inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET) # 1/30th 38 | inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET) # FULL 39 | flights = spark.read.json(inputs) 40 | 41 | # this view can now be queried ... 42 | flights.createOrReplaceTempView('flights') 43 | 44 | 45 | # logistic regression 46 | trainquery = """ 47 | SELECT 48 | DEP_DELAY, TAXI_OUT, ARR_DELAY, DISTANCE 49 | FROM flights f 50 | JOIN traindays t 51 | ON f.FL_DATE == t.FL_DATE 52 | WHERE 53 | t.is_train_day == 'True' AND 54 | f.CANCELLED == 'False' AND 55 | f.DIVERTED == 'False' 56 | """ 57 | traindata = spark.sql(trainquery) 58 | 59 | def to_example(fields): 60 | return LabeledPoint(\ 61 | float(fields['ARR_DELAY'] < 15), #ontime \ 62 | [ \ 63 | fields['DEP_DELAY'], # DEP_DELAY \ 64 | fields['TAXI_OUT'], # TAXI_OUT \ 65 | fields['DISTANCE'], # DISTANCE \ 66 | ]) 67 | 68 | examples = traindata.rdd.map(to_example) 69 | lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True) 70 | lrmodel.setThreshold(0.7) 71 | 72 | # save model 73 | MODEL_FILE='gs://{}/flights/sparkmloutput/model'.format(BUCKET) 74 | lrmodel.save(sc, MODEL_FILE) 75 | logging.info('Logistic regression model saved in {}'.format(MODEL_FILE)) 76 | 77 | # evaluate 78 | testquery = trainquery.replace("t.is_train_day == 'True'","t.is_train_day == 'False'") 79 | testdata = spark.sql(testquery) 80 | examples = testdata.rdd.map(to_example) 81 | 82 | # Evaluate model 83 | lrmodel.clearThreshold() # so it returns probabilities 84 | labelpred = examples.map(lambda p: (p.label, lrmodel.predict(p.features))) 85 | logging.info('All flights: {}'.format(eval_model(labelpred))) 86 | 87 | 88 | # keep only those examples near the decision threshold 89 | labelpred = labelpred.filter(lambda data: data[1] > 0.65 and data[1] < 0.75) 90 | logging.info('Flights near decision threshold: {}'.format(eval_model(labelpred))) 91 | 92 | def eval_model(labelpred): 93 | ''' 94 | data = (label, pred) 95 | data[0] = label 96 | data[1] = pred 97 | ''' 98 | cancel = labelpred.filter(lambda data: data[1] < 0.7) 99 | nocancel = labelpred.filter(lambda data: data[1] >= 0.7) 100 | corr_cancel = cancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count() 101 | corr_nocancel = nocancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count() 102 | 103 | cancel_denom = cancel.count() 104 | nocancel_denom = nocancel.count() 105 | if cancel_denom == 0: 106 | cancel_denom = 1 107 | if nocancel_denom == 0: 108 | nocancel_denom = 1 109 | return { 110 | 'total_cancel': cancel.count(), 111 | 'correct_cancel': float(corr_cancel)/cancel_denom, 112 | 'total_noncancel': nocancel.count(), 113 | 'correct_noncancel': float(corr_nocancel)/nocancel_denom 114 | } 115 | 116 | 117 | if __name__ == '__main__': 118 | import argparse 119 | 120 | parser = argparse.ArgumentParser(description='Run logistic regression in Spark') 121 | parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True) 122 | parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages') 123 | 124 | args = parser.parse_args() 125 | if args.debug: 126 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG) 127 | else: 128 | logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) 129 | 130 | run_logistic(args.bucket) 131 | -------------------------------------------------------------------------------- /07_sparkml/submit_spark.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 3 ]; then 4 | echo "Usage: ./submit_spark_to_cluster.sh bucket-name region pyspark-file" 5 | exit 6 | fi 7 | 8 | BUCKET=$1 9 | REGION=$2 10 | PYSPARK=$3 11 | 12 | OUTDIR=gs://$BUCKET/flights/sparkmloutput 13 | 14 | gsutil -m rm -r $OUTDIR 15 | 16 | # submit to existing cluster 17 | gsutil cp $PYSPARK $OUTDIR/$PYSPARK 18 | gcloud dataproc jobs submit pyspark \ 19 | --cluster ch7cluster --region $REGION \ 20 | $OUTDIR/$PYSPARK \ 21 | -- \ 22 | --bucket $BUCKET 23 | -------------------------------------------------------------------------------- /08_bqml/README.md: -------------------------------------------------------------------------------- 1 | # 8. Machine Learning with BigQuery ML 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket: 5 | 6 | #### Catch up from Chapters 2-7 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery: 14 | ``` 15 | bash create_trainday.sh 16 | ``` 17 | 18 | ## This Chapter 19 | 20 | ### Vertex AI Workbench 21 | * Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench 22 | * Launch a new terminal window and type in it: 23 | ``` 24 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 25 | ``` 26 | * In the navigation pane on the left, navigate to data-science-on-gcp/08_bqml 27 | 28 | ### Logistic regression using BigQuery ML 29 | * Open the notebook bqml_logistic.ipynb 30 | * Edit | Clear All Outputs 31 | * Run through the cells one-by-one, reading the commentary, looking at the code, and examining the output. 32 | * Close the notebook 33 | * Click on the square icon on the left-most bar to view Running Terminals and Notebooks 34 | * Stop the notebook 35 | 36 | ### Other notebooks 37 | * Repeat the steps above for the following notebooks (in order) 38 | * bqml_nonlinear.ipynb 39 | * bqml_timewindow.ipynb 40 | * bqml_timetxf.ipynb 41 | 42 | -------------------------------------------------------------------------------- /09_vertexai/.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | trained_model 3 | babyweight 4 | -------------------------------------------------------------------------------- /09_vertexai/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Classifier using TensorFlow 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket: 5 | 6 | #### Catch up from Chapters 2-7 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery: 14 | ``` 15 | bash create_trainday.sh 16 | ``` 17 | 18 | ## This Chapter 19 | 20 | * Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench 21 | * Launch a new terminal window and type in it: 22 | ``` 23 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 24 | ``` 25 | * In the navigation pane on the left, navigate to data-science-on-gcp/09_vertexai 26 | * Open the notebook flights_model_tf2.ipynb and run the cells. Note that the notebook has 27 | DEVELOP_MODE=True and so it will train on a very, very small amount of data. This is just 28 | to make sure the code works. 29 | 30 | 31 | ## Articles 32 | Some of the content in this chapter was published as blog posts (links below). 33 | 34 | * [Giving Vertex AI, the New Unified ML Platform on Google Cloud, a Spin](https://towardsdatascience.com/giving-vertex-ai-the-new-unified-ml-platform-on-google-cloud-a-spin-35e0f3852f25): 35 | Why do we need it, how good is the code-free ML training, really, and what does all this mean for data science jobs? 36 | * [How to Deploy a TensorFlow Model to Vertex AI](https://towardsdatascience.com/how-to-deploy-a-tensorflow-model-to-vertex-ai-87d9ae1df56): Working with saved models and endpoints in Vertex AI 37 | 38 | 39 | -------------------------------------------------------------------------------- /09_vertexai/call_predict.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | REGION=us-central1 4 | ENDPOINT_NAME=flights 5 | 6 | ENDPOINT_ID=$(gcloud ai endpoints list --region=$REGION \ 7 | --format='value(ENDPOINT_ID)' --filter=display_name=${ENDPOINT_NAME} \ 8 | --sort-by=creationTimeStamp | tail -1) 9 | echo $ENDPOINT_ID 10 | gcloud ai endpoints predict $ENDPOINT_ID --region=$REGION --json-request=example_input.json 11 | -------------------------------------------------------------------------------- /09_vertexai/example_input.json: -------------------------------------------------------------------------------- 1 | {"instances": [ 2 | {"dep_hour": 2, "is_weekday": 1, "dep_delay": 40, "taxi_out": 17, "distance": 41, "carrier": "AS", "dep_airport_lat": 58.42527778, "dep_airport_lon": -135.7075, "arr_airport_lat": 58.35472222, "arr_airport_lon": -134.57472222, "origin": "GST", "dest": "JNU"}, 3 | {"dep_hour": 22, "is_weekday": 0, "dep_delay": -7, "taxi_out": 7, "distance": 201, "carrier": "HA", "dep_airport_lat": 21.97611111, "dep_airport_lon": -159.33888889, "arr_airport_lat": 20.89861111, "arr_airport_lon": -156.43055556, "origin": "LIH", "dest": "OGG"} 4 | ]} 5 | -------------------------------------------------------------------------------- /09_vertexai/flights_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/09_vertexai/flights_model.png -------------------------------------------------------------------------------- /10_mlops/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Classifier using TensorFlow 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket: 5 | 6 | #### Catch up from Chapters 2-7 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery: 14 | ``` 15 | bash create_trainday.sh 16 | ``` 17 | * In this (10_mlops) folder, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 18 | 19 | ## This Chapter 20 | 21 | In CloudShell, do the following steps: 22 | 23 | * Install the aiplatform library 24 | ``` 25 | pip3 install google-cloud-aiplatform cloudml-hypertune kfp 26 | ``` 27 | * Try running the standalone model file on a small sample: 28 | ``` 29 | python3 model.py --bucket --develop 30 | ``` 31 | * [Optional] Run a Vertex AI Pipeline on the small sample (will take about ten minutes): 32 | ``` 33 | python3 train_on_vertexai.py --project --bucket --develop 34 | ``` 35 | * Train on the full dataset using Vertex AI: 36 | ``` 37 | python3 train_on_vertexai.py --project --bucket 38 | ``` 39 | * Try calling the model using bash: 40 | ``` 41 | cd ../09_vertexai 42 | bash ./call_predict.sh 43 | cd ../10_mlops 44 | ``` 45 | * Try calling the model using Python: 46 | ``` 47 | python3 call_predict.py 48 | ``` 49 | * [Optional] Train an AutoML model using Vertex AI: 50 | ``` 51 | python3 train_on_vertexai.py --project --bucket --automl 52 | ``` 53 | * [Optional] Hyperparameter tune the custom model using Vertex AI: 54 | ``` 55 | python3 train_on_vertexai.py --project --bucket --num_hparam_trials 10 56 | ``` 57 | 58 | 59 | ## Articles 60 | Some of the content in this chapter was published as blog posts (links below). 61 | 62 | To try out the code in the articles without going through the chapter, copy the necessary data to your bucket: 63 | ``` 64 | gsutil cp gs://data-science-on-gcp/edition2/ch9/data/all.csv gs://BUCKET/ch9/data/all.csv 65 | ``` 66 | 67 | Now you will be able to run model.py and train_on_vertexai.py as in the directions above. 68 | 69 | * [Developing and Deploying a Machine Learning Model on Vertex AI using Python](https://medium.com/@lakshmanok/developing-and-deploying-a-machine-learning-model-on-vertex-ai-using-python-865b535814f8): Write training pipelines that will make your MLOps team happy 70 | * [How to build an MLOps pipeline for hyperparameter tuning in Vertex AI](https://lakshmanok.medium.com/how-to-build-an-mlops-pipeline-for-hyperparameter-tuning-in-vertex-ai-45cc2faf4ff5): 71 | Best practices to set up your model and orchestrator for hyperparameter tuning 72 | 73 | 74 | -------------------------------------------------------------------------------- /10_mlops/call_predict.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import sys, json 16 | from google.cloud import aiplatform 17 | from google.cloud.aiplatform import gapic as aip 18 | 19 | ENDPOINT_NAME = 'flights' 20 | 21 | if __name__ == '__main__': 22 | 23 | endpoints = aiplatform.Endpoint.list( 24 | filter='display_name="{}"'.format(ENDPOINT_NAME), 25 | order_by='create_time desc' 26 | ) 27 | if len(endpoints) == 0: 28 | print("No endpoint named {}".format(ENDPOINT_NAME)) 29 | sys.exit(-1) 30 | 31 | endpoint = endpoints[0] 32 | 33 | input_data = {"instances": [ 34 | {"dep_hour": 2, "is_weekday": 1, "dep_delay": 40, "taxi_out": 17, "distance": 41, "carrier": "AS", 35 | "dep_airport_lat": 58.42527778, "dep_airport_lon": -135.7075, "arr_airport_lat": 58.35472222, 36 | "arr_airport_lon": -134.57472222, "origin": "GST", "dest": "JNU"}, 37 | {"dep_hour": 22, "is_weekday": 0, "dep_delay": -7, "taxi_out": 7, "distance": 201, "carrier": "HA", 38 | "dep_airport_lat": 21.97611111, "dep_airport_lon": -159.33888889, "arr_airport_lat": 20.89861111, 39 | "arr_airport_lon": -156.43055556, "origin": "LIH", "dest": "OGG"} 40 | ]} 41 | 42 | preds = endpoint.predict(input_data['instances']) 43 | print(preds) 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /10_mlops/explanation-metadata.json: -------------------------------------------------------------------------------- 1 | { 2 | "inputs": { 3 | "dep_delay": { 4 | "inputTensorName": "dep_delay" 5 | }, 6 | "taxi_out": { 7 | "inputTensorName": "taxi_out" 8 | }, 9 | "distance": { 10 | "inputTensorName": "distance" 11 | }, 12 | "dep_hour": { 13 | "inputTensorName": "dep_hour" 14 | }, 15 | "is_weekday": { 16 | "inputTensorName": "is_weekday" 17 | }, 18 | "dep_airport_lat": { 19 | "inputTensorName": "dep_airport_lat" 20 | }, 21 | "dep_airport_lon": { 22 | "inputTensorName": "dep_airport_lon" 23 | }, 24 | "arr_airport_lat": { 25 | "inputTensorName": "arr_airport_lat" 26 | }, 27 | "arr_airport_lon": { 28 | "inputTensorName": "arr_airport_lon" 29 | }, 30 | "carrier": { 31 | "inputTensorName": "carrier" 32 | }, 33 | "origin": { 34 | "inputTensorName": "origin" 35 | }, 36 | "dest": { 37 | "inputTensorName": "dest" 38 | } 39 | }, 40 | "outputs": { 41 | "pred": { 42 | "outputTensorName": "pred" 43 | } 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /10_mlops/ingest_from_crsbucket.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ "$#" -ne 1 ]; then 4 | echo "Usage: ./ingest_from_crsbucket.sh destination-bucket-name" 5 | exit 6 | fi 7 | 8 | BUCKET_NAME=$1 9 | 10 | for split in train eval all; do 11 | gsutil cp gs://data-science-on-gcp/edition2/ch9/data/${split}.csv gs://$BUCKET_NAME/ch9/data/${split}.csv 12 | done 13 | -------------------------------------------------------------------------------- /10_mlops/train_on_vertexai.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017-2021 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import argparse 16 | import logging 17 | from datetime import datetime 18 | import tensorflow as tf 19 | 20 | from google.cloud import aiplatform 21 | from google.cloud.aiplatform import gapic as aip 22 | from google.cloud.aiplatform import hyperparameter_tuning as hpt 23 | from kfp.v2 import compiler, dsl 24 | 25 | ENDPOINT_NAME = 'flights' 26 | 27 | 28 | def train_custom_model(data_set, timestamp, develop_mode, cpu_only_mode, tf_version, extra_args=None): 29 | # Set up training and deployment infra 30 | 31 | if cpu_only_mode: 32 | train_image='us-docker.pkg.dev/vertex-ai/training/tf-cpu.{}:latest'.format(tf_version) 33 | deploy_image='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest'.format(tf_version) 34 | else: 35 | train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version) 36 | deploy_image = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest".format(tf_version) 37 | 38 | # train 39 | model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp) 40 | job = aiplatform.CustomTrainingJob( 41 | display_name='train-{}'.format(model_display_name), 42 | script_path="model.py", 43 | container_uri=train_image, 44 | requirements=['cloudml-hypertune'], # any extra Python packages 45 | model_serving_container_image_uri=deploy_image 46 | ) 47 | model_args = [ 48 | '--bucket', BUCKET, 49 | ] 50 | if develop_mode: 51 | model_args += ['--develop'] 52 | if extra_args: 53 | model_args += extra_args 54 | 55 | if cpu_only_mode: 56 | model = job.run( 57 | dataset=data_set, 58 | # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html# 59 | predefined_split_column_name='data_split', 60 | model_display_name=model_display_name, 61 | args=model_args, 62 | replica_count=1, 63 | machine_type='n1-standard-4', 64 | sync=develop_mode 65 | ) 66 | else: 67 | model = job.run( 68 | dataset=data_set, 69 | # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html# 70 | predefined_split_column_name='data_split', 71 | model_display_name=model_display_name, 72 | args=model_args, 73 | replica_count=1, 74 | machine_type='n1-standard-4', 75 | # See https://cloud.google.com/vertex-ai/docs/general/locations#accelerators 76 | accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name, 77 | accelerator_count=1, 78 | sync=develop_mode 79 | ) 80 | return model 81 | 82 | 83 | def train_automl_model(data_set, timestamp, develop_mode): 84 | # train 85 | model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp) 86 | job = aiplatform.AutoMLTabularTrainingJob( 87 | display_name='train-{}'.format(model_display_name), 88 | optimization_prediction_type='classification' 89 | ) 90 | model = job.run( 91 | dataset=data_set, 92 | # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html# 93 | predefined_split_column_name='data_split', 94 | target_column='ontime', 95 | model_display_name=model_display_name, 96 | budget_milli_node_hours=(300 if develop_mode else 2000), 97 | disable_early_stopping=False, 98 | export_evaluated_data_items=True, 99 | export_evaluated_data_items_bigquery_destination_uri='{}:dsongcp.ch9_automl_evaluated'.format(PROJECT), 100 | export_evaluated_data_items_override_destination=True, 101 | sync=develop_mode 102 | ) 103 | return model 104 | 105 | 106 | def do_hyperparameter_tuning(data_set, timestamp, develop_mode, cpu_only_mode, tf_version): 107 | # Vertex AI services require regional API endpoints. 108 | if cpu_only_mode: 109 | train_image='us-docker.pkg.dev/vertex-ai/training/tf-cpu.{}:latest'.format(tf_version) 110 | else: 111 | train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version) 112 | 113 | # a single trial job 114 | model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp) 115 | if cpu_only_mode: 116 | trial_job = aiplatform.CustomJob.from_local_script( 117 | display_name='train-{}'.format(model_display_name), 118 | script_path="model.py", 119 | container_uri=train_image, 120 | args=[ 121 | '--bucket', BUCKET, 122 | '--skip_full_eval', # no need to evaluate on test data set 123 | '--num_epochs', '10', 124 | '--num_examples', '500000' # 1/10 actual size to finish faster 125 | ], 126 | requirements=['cloudml-hypertune'], # any extra Python packages 127 | replica_count=1, 128 | machine_type='n1-standard-4' 129 | ) 130 | else: 131 | trial_job = aiplatform.CustomJob.from_local_script( 132 | display_name='train-{}'.format(model_display_name), 133 | script_path="model.py", 134 | container_uri=train_image, 135 | args=[ 136 | '--bucket', BUCKET, 137 | '--skip_full_eval', # no need to evaluate on test data set 138 | '--num_epochs', '10', 139 | '--num_examples', '500000' # 1/10 actual size to finish faster 140 | ], 141 | requirements=['cloudml-hypertune'], # any extra Python packages 142 | replica_count=1, 143 | machine_type='n1-standard-4', 144 | # See https://cloud.google.com/vertex-ai/docs/general/locations#accelerators 145 | accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name, 146 | accelerator_count=1, 147 | ) 148 | 149 | # the tuning job 150 | hparam_job = aiplatform.HyperparameterTuningJob( 151 | # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html# 152 | display_name='hparam-{}'.format(model_display_name), 153 | custom_job=trial_job, 154 | metric_spec={'val_rmse': 'minimize'}, 155 | parameter_spec={ 156 | "train_batch_size": hpt.IntegerParameterSpec(min=16, max=256, scale='log'), 157 | "nbuckets": hpt.IntegerParameterSpec(min=5, max=10, scale='linear'), 158 | "dnn_hidden_units": hpt.CategoricalParameterSpec(values=["64,16", "64,16,4", "64,64,64,8", "256,64,16"]) 159 | }, 160 | max_trial_count=2 if develop_mode else NUM_HPARAM_TRIALS, 161 | parallel_trial_count=2, 162 | search_algorithm=None, # Bayesian 163 | ) 164 | 165 | hparam_job.run(sync=True) # has to finish before we can get trials. 166 | 167 | # get the parameters corresponding to the best trial 168 | best = sorted(hparam_job.trials, key=lambda x: x.final_measurement.metrics[0].value)[0] 169 | logging.info('Best trial: {}'.format(best)) 170 | best_params = [] 171 | for param in best.parameters: 172 | best_params.append('--{}'.format(param.parameter_id)) 173 | 174 | if param.parameter_id in ["train_batch_size", "nbuckets"]: 175 | # hparam returns 10.0 even though it's an integer param. so round it. 176 | # but CustomTrainingJob makes integer args into floats. so make it a string 177 | best_params.append(str(int(round(param.value)))) 178 | else: 179 | # string or float parameters 180 | best_params.append(param.value) 181 | 182 | # run the best trial to completion 183 | logging.info('Launching full training job with {}'.format(best_params)) 184 | return train_custom_model(data_set, timestamp, develop_mode, cpu_only_mode, tf_version, extra_args=best_params) 185 | 186 | 187 | @dsl.pipeline(name="flights-ch9-pipeline", 188 | description="ds-on-gcp ch9 flights pipeline" 189 | ) 190 | def main(): 191 | aiplatform.init(project=PROJECT, location=REGION, staging_bucket='gs://{}'.format(BUCKET)) 192 | 193 | # create data set 194 | all_files = tf.io.gfile.glob('gs://{}/ch9/data/all*.csv'.format(BUCKET)) 195 | logging.info("Training on {}".format(all_files)) 196 | data_set = aiplatform.TabularDataset.create( 197 | display_name='data-{}'.format(ENDPOINT_NAME), 198 | gcs_source=all_files 199 | ) 200 | if TF_VERSION is not None: 201 | tf_version = TF_VERSION.replace(".", "-") 202 | else: 203 | tf_version = '2-' + tf.__version__[2:3] 204 | 205 | # train 206 | if AUTOML: 207 | model = train_automl_model(data_set, TIMESTAMP, DEVELOP_MODE) 208 | elif NUM_HPARAM_TRIALS > 1: 209 | model = do_hyperparameter_tuning(data_set, TIMESTAMP, DEVELOP_MODE, CPU_ONLY_MODE, tf_version) 210 | else: 211 | model = train_custom_model(data_set, TIMESTAMP, DEVELOP_MODE, CPU_ONLY_MODE, tf_version) 212 | 213 | # create endpoint if it doesn't already exist 214 | endpoints = aiplatform.Endpoint.list( 215 | filter='display_name="{}"'.format(ENDPOINT_NAME), 216 | order_by='create_time desc', 217 | project=PROJECT, location=REGION, 218 | ) 219 | if len(endpoints) > 0: 220 | endpoint = endpoints[0] # most recently created 221 | else: 222 | endpoint = aiplatform.Endpoint.create( 223 | display_name=ENDPOINT_NAME, project=PROJECT, location=REGION, 224 | sync=DEVELOP_MODE 225 | ) 226 | 227 | # deploy 228 | model.deploy( 229 | endpoint=endpoint, 230 | traffic_split={"0": 100}, 231 | machine_type='n1-standard-2', 232 | min_replica_count=1, 233 | max_replica_count=1, 234 | sync=DEVELOP_MODE 235 | ) 236 | 237 | if DEVELOP_MODE: 238 | model.wait() 239 | 240 | 241 | def run_pipeline(): 242 | compiler.Compiler().compile(pipeline_func=main, package_path='flights_pipeline.json') 243 | 244 | job = aip.PipelineJob( 245 | display_name="{}-pipeline".format(ENDPOINT_NAME), 246 | template_path="{}_pipeline.json".format(ENDPOINT_NAME), 247 | pipeline_root="{}/pipeline_root/intro".format(BUCKET), 248 | enable_caching=False 249 | ) 250 | 251 | job.run() 252 | 253 | 254 | if __name__ == '__main__': 255 | parser = argparse.ArgumentParser() 256 | 257 | parser.add_argument( 258 | '--bucket', 259 | help='Data will be read from gs://BUCKET/ch9/data and checkpoints will be in gs://BUCKET/ch9/trained_model', 260 | required=True 261 | ) 262 | parser.add_argument( 263 | '--region', 264 | help='Where to run the trainer', 265 | default='us-central1' 266 | ) 267 | parser.add_argument( 268 | '--project', 269 | help='Project to be billed', 270 | required=True 271 | ) 272 | parser.add_argument( 273 | '--develop', 274 | help='Train on a small subset in development', 275 | dest='develop', 276 | action='store_true') 277 | parser.set_defaults(develop=False) 278 | parser.add_argument( 279 | '--automl', 280 | help='Train an AutoML Table, instead of using model.py', 281 | dest='automl', 282 | action='store_true') 283 | parser.set_defaults(automl=False) 284 | parser.add_argument( 285 | '--num_hparam_trials', 286 | help='Number of hyperparameter trials. 0/1 means no hyperparam. Ignored if --automl is set.', 287 | type=int, 288 | default=0) 289 | parser.add_argument( 290 | '--pipeline', 291 | help='Run as pipeline', 292 | dest='pipeline', 293 | action='store_true') 294 | parser.add_argument( 295 | '--cpuonly', 296 | help='Run without GPU', 297 | dest='cpuonly', 298 | action='store_true') 299 | parser.set_defaults(cpuonly=False) 300 | parser.add_argument( 301 | '--tfversion', 302 | help='TensorFlow version to use' 303 | ) 304 | 305 | # parse args 306 | logging.getLogger().setLevel(logging.INFO) 307 | args = parser.parse_args().__dict__ 308 | BUCKET = args['bucket'] 309 | PROJECT = args['project'] 310 | REGION = args['region'] 311 | DEVELOP_MODE = args['develop'] 312 | CPU_ONLY_MODE = args['cpuonly'] 313 | TF_VERSION = args['tfversion'] 314 | AUTOML = args['automl'] 315 | NUM_HPARAM_TRIALS = args['num_hparam_trials'] 316 | TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S") 317 | 318 | if args['pipeline']: 319 | run_pipeline() 320 | else: 321 | main() -------------------------------------------------------------------------------- /11_realtime/.gitignore: -------------------------------------------------------------------------------- 1 | model.py 2 | train_on_vertexai.py 3 | *.egg-info 4 | call_predict.py 5 | -------------------------------------------------------------------------------- /11_realtime/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning on Streaming Pipelines 2 | 3 | ### Catch up from previous chapters if necessary 4 | If you didn't go through Chapters 2-9, the simplest way to catch up is to copy data from my bucket: 5 | 6 | #### Catch up from Chapters 2-9 7 | * Open CloudShell and git clone this repo: 8 | ``` 9 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp 10 | ``` 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 13 | * Go to the 05_bqnotebook folder of the repo, run the program ./create_trainday.sh and specify your bucket name. 14 | * Go to the 10_mlops folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name. 15 | 16 | #### From CloudShell 17 | * Install the Python libraries you'll need 18 | ``` 19 | pip3 install google-cloud-aiplatform cloudml-hypertune pyfarmhash 20 | ``` 21 | * [Optional] Create a small, local sample of BigQuery datasets for local experimentation: 22 | ``` 23 | bash create_sample_input.sh 24 | ``` 25 | * [Optional] Run a local pipeline to create a training dataset: 26 | ``` 27 | python3 create_traindata.py --input local 28 | ``` 29 | Verify the results: 30 | ``` 31 | cat /tmp/all_data* 32 | ``` 33 | * Run a Dataflow pipeline to create the full training dataset: 34 | ``` 35 | python3 create_traindata.py --input bigquery --project --bucket --region 36 | ``` 37 | Note if you get an error similar to: 38 | ``` 39 | AttributeError: Can't get attribute '_create_code' on 40 | ``` 41 | it is because the global version of your modules are ahead/behind of what Apache Beam on the server requires. Make sure to submit Apache Beam code to Dataflow from a pristine virtual environment that has only the modules you need: 42 | ``` 43 | python -m venv ~/beamenv 44 | source ~/beamenv/bin/activate 45 | pip install apache-beam[gcp] google-cloud-aiplatform cloudml-hypertune pyfarmhash pyparsing==2.4.2 46 | python3 create_traindata.py ... 47 | ``` 48 | Note that beamenv is only for submitting to Dataflow. Run train_on_vertexai.py and other code directly in the terminal. 49 | * Run script that copies over the Ch10 model.py and train_on_vertexai.py files and makes the necessary changes: 50 | ``` 51 | python3 change_ch10_files.py 52 | ``` 53 | * [Optional] Train an AutoML model on the enriched dataset: 54 | ``` 55 | python3 train_on_vertexai.py --automl --project --bucket --region 56 | ``` 57 | Verify performance by running the following BigQuery query: 58 | ``` 59 | SELECT 60 | SQRT(SUM( 61 | (CAST(ontime AS FLOAT64) - predicted_ontime.scores[OFFSET(0)])* 62 | (CAST(ontime AS FLOAT64) - predicted_ontime.scores[OFFSET(0)]) 63 | )/COUNT(*)) 64 | FROM dsongcp.ch11_automl_evaluated 65 | ``` 66 | * Train custom ML model on the enriched dataset: 67 | ``` 68 | python3 train_on_vertexai.py --project --bucket --region 69 | ``` 70 | Look at the logs of the log to determine the final RMSE. 71 | * Run a local pipeline to invoke predictions: 72 | ``` 73 | python3 make_predictions.py --input local 74 | ``` 75 | Verify the results: 76 | ``` 77 | cat /tmp/predictions* 78 | ``` 79 | * [Optional] Run a pipeline on full BigQuery dataset to invoke predictions: 80 | ``` 81 | python3 make_predictions.py --input bigquery --project --bucket --region 82 | ``` 83 | Verify the results 84 | ``` 85 | gsutil cat gs://BUCKET/flights/ch11/predictions* | head -5 86 | ``` 87 | * [Optional] Simulate real-time pipeline and check to see if predictions are being made 88 | 89 | 90 | In one terminal, type: 91 | ``` 92 | cd ../04_streaming/simulate 93 | python3 ./simulate.py --startTime '2015-05-01 00:00:00 UTC' \ 94 | --endTime '2015-05-04 00:00:00 UTC' --speedFactor=30 --project 95 | ``` 96 | 97 | In another terminal type: 98 | ``` 99 | python3 make_predictions.py --input pubsub \ 100 | --project --bucket --region 101 | ``` 102 | 103 | Ensure that the pipeline starts, check that output elements are starting to be written out, do: 104 | ``` 105 | gsutil ls gs://BUCKET/flights/ch11/predictions* 106 | ``` 107 | Make sure to go to the GCP Console and stop the Dataflow pipeline. 108 | 109 | 110 | * Simulate real-time pipeline and try out different jagger etc. 111 | 112 | In one terminal, type: 113 | ``` 114 | cd ../04_streaming/simulate 115 | python3 ./simulate.py --startTime '2015-02-01 00:00:00 UTC' \ 116 | --endTime '2015-02-03 00:00:00 UTC' --speedFactor=30 --project 117 | ``` 118 | 119 | In another terminal type: 120 | ``` 121 | python3 make_predictions.py --input pubsub --output bigquery \ 122 | --project --bucket --region 123 | ``` 124 | 125 | Ensure that the pipeline starts, look at BigQuery: 126 | ``` 127 | SELECT * FROM dsongcp.streaming_preds ORDER BY event_time DESC LIMIT 10 128 | ``` 129 | When done, make sure to go to the GCP Console and stop the Dataflow pipeline. 130 | 131 | Note: If you are going to try it a second time around, delete the BigQuery sink, or simulate with a different time range 132 | ``` 133 | bq rm -f dsongcp.streaming_preds 134 | ``` 135 | 136 | -------------------------------------------------------------------------------- /11_realtime/change_ch10_files.py: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | import os 16 | 17 | CHANGES = [ 18 | # both 19 | ("ch9", "ch11"), 20 | 21 | # train_on_vertexai.py 22 | ("ENDPOINT_NAME = 'flights'", "ENDPOINT_NAME = 'flights-ch11'"), 23 | 24 | # model.py 25 | ("arr_airport_lat,arr_airport_lon", "arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out"), 26 | ("43.41694444, -124.24694444, 39.86166667, -104.67305556, 'TRAIN'", 27 | "43.41694444, -124.24694444, 39.86166667, -104.67305556, -3.0, 5.0, 'TRAIN'"), 28 | 29 | # call_predict.py 30 | ('"carrier": "AS"', '"carrier": "AS", "avg_dep_delay": -3.0, "avg_taxi_out": 5.0'), 31 | ('"carrier": "HA"', '"carrier": "HA", "avg_dep_delay": 3.0, "avg_taxi_out": 8.0'), 32 | ] 33 | 34 | for filename in ['train_on_vertexai.py', 'model.py', 'call_predict.py']: 35 | in_filename = os.path.join('../10_mlops', filename) 36 | with open(in_filename, "r") as ifp: 37 | with open(filename, "w") as ofp: 38 | ofp.write("#### DO NOT EDIT! Autogenerated from {}".format(in_filename)) 39 | for line in ifp.readlines(): 40 | for change in CHANGES: 41 | new_line = line.replace(change[0], change[1]) 42 | if new_line != line: 43 | print('<<' + line + '>>' + new_line) 44 | line = new_line 45 | ofp.write(line) 46 | 47 | print("*** Wrote out {}".format(filename)) 48 | -------------------------------------------------------------------------------- /11_realtime/create_sample_input.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | bq query --nouse_legacy_sql --format=sparse \ 3 | "SELECT EVENT_DATA FROM dsongcp.flights_simevents WHERE EVENT_TYPE = 'wheelsoff' AND EVENT_TIME BETWEEN '2015-03-10T10:00:00' AND '2015-03-10T14:00:00' " \ 4 | | grep FL_DATE \ 5 | > simevents_sample.json 6 | 7 | 8 | bq query --nouse_legacy_sql --format=json \ 9 | "SELECT * FROM dsongcp.flights_tzcorr WHERE DEP_TIME BETWEEN '2015-03-10T10:00:00' AND '2015-03-10T14:00:00' " \ 10 | | sed 's/\[//g' | sed 's/\]//g' | sed s'/\},/\}\n/g' \ 11 | > alldata_sample.json 12 | -------------------------------------------------------------------------------- /11_realtime/create_traindata.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import os 20 | import json 21 | 22 | from flightstxf import flights_transforms as ftxf 23 | 24 | CSV_HEADER = 'ontime,dep_delay,taxi_out,distance,origin,dest,dep_hour,is_weekday,carrier,dep_airport_lat,dep_airport_lon,arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out,data_split' 25 | 26 | 27 | def dict_to_csv(f): 28 | try: 29 | yield ','.join([str(x) for x in f.values()]) 30 | except Exception as e: 31 | logging.warning('Ignoring {} because: {}'.format(f, e), exc_info=True) 32 | pass 33 | 34 | 35 | def run(project, bucket, region, input): 36 | if input == 'local': 37 | logging.info('Running locally on small extract') 38 | argv = [ 39 | '--runner=DirectRunner' 40 | ] 41 | flights_output = '/tmp/' 42 | else: 43 | logging.info('Running in the cloud on full dataset input={}'.format(input)) 44 | argv = [ 45 | '--project={0}'.format(project), 46 | '--job_name=ch11traindata', 47 | # '--save_main_session', # not needed as we are running as a package now 48 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 49 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 50 | '--setup_file=./setup.py', 51 | '--autoscaling_algorithm=THROUGHPUT_BASED', 52 | '--max_num_workers=20', 53 | # '--max_num_workers=4', '--worker_machine_type=m1-ultramem-40', '--disk_size_gb=500', # for full 2015-2019 dataset 54 | '--region={}'.format(region), 55 | '--runner=DataflowRunner' 56 | ] 57 | flights_output = 'gs://{}/ch11/data/'.format(bucket) 58 | 59 | with beam.Pipeline(argv=argv) as pipeline: 60 | 61 | # read the event stream 62 | if input == 'local': 63 | input_file = './alldata_sample.json' 64 | logging.info("Reading from {} ... Writing to {}".format(input_file, flights_output)) 65 | events = ( 66 | pipeline 67 | | 'read_input' >> beam.io.ReadFromText(input_file) 68 | | 'parse_input' >> beam.Map(lambda line: json.loads(line)) 69 | ) 70 | elif input == 'bigquery': 71 | input_table = 'dsongcp.flights_tzcorr' 72 | logging.info("Reading from {} ... Writing to {}".format(input_table, flights_output)) 73 | events = ( 74 | pipeline 75 | | 'read_input' >> beam.io.ReadFromBigQuery(table=input_table) 76 | ) 77 | else: 78 | logging.error("Unknown input type {}".format(input)) 79 | return 80 | 81 | # events -> features. See ./flights_transforms.py for the code shared between training & prediction 82 | features = ftxf.transform_events_to_features(events) 83 | 84 | # shuffle globally so that we are not at mercy of TensorFlow's shuffle buffer 85 | features = ( 86 | features 87 | | 'into_global' >> beam.WindowInto(beam.window.GlobalWindows()) 88 | | 'shuffle' >> beam.util.Reshuffle() 89 | ) 90 | 91 | # write out 92 | for split in ['ALL', 'TRAIN', 'VALIDATE', 'TEST']: 93 | feats = features 94 | if split != 'ALL': 95 | feats = feats | 'only_{}'.format(split) >> beam.Filter(lambda f: f['data_split'] == split) 96 | ( 97 | feats 98 | | '{}_to_string'.format(split) >> beam.FlatMap(dict_to_csv) 99 | | '{}_to_gcs'.format(split) >> beam.io.textio.WriteToText(os.path.join(flights_output, split.lower()), 100 | file_name_suffix='.csv', header=CSV_HEADER, 101 | # workaround b/207384805 102 | num_shards=1) 103 | ) 104 | 105 | 106 | if __name__ == '__main__': 107 | import argparse 108 | 109 | parser = argparse.ArgumentParser(description='Create training CSV file that includes time-aggregate features') 110 | parser.add_argument('-p', '--project', help='Project to be billed for Dataflow job. Omit if running locally.') 111 | parser.add_argument('-b', '--bucket', help='Training data will be written to gs://BUCKET/flights/ch11/') 112 | parser.add_argument('-r', '--region', help='Region to run Dataflow job. Choose the same region as your bucket.') 113 | parser.add_argument('-i', '--input', help='local OR bigquery', required=True) 114 | 115 | logging.getLogger().setLevel(logging.INFO) 116 | args = vars(parser.parse_args()) 117 | 118 | if args['input'] != 'local': 119 | if not args['bucket'] or not args['project'] or not args['region']: 120 | print("Project, Bucket, Region are needed in order to run on the cloud on full dataset.") 121 | parser.print_help() 122 | parser.exit() 123 | 124 | run(project=args['project'], bucket=args['bucket'], region=args['region'], input=args['input']) 125 | -------------------------------------------------------------------------------- /11_realtime/flightstxf/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/11_realtime/flightstxf/__init__.py -------------------------------------------------------------------------------- /11_realtime/flightstxf/flights_transforms.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import datetime as dt 19 | import logging 20 | import numpy as np 21 | import farmhash # pip install pyfarmhash 22 | 23 | DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S' 24 | WINDOW_DURATION = 60 * 60 25 | WINDOW_EVERY = 5 * 60 26 | 27 | 28 | def get_data_split(fl_date): 29 | fl_date_str = str(fl_date) 30 | # Use farm fingerprint just like in BigQuery 31 | x = np.abs(np.uint64(farmhash.fingerprint64(fl_date_str)).astype('int64') % 100) 32 | if x < 60: 33 | data_split = 'TRAIN' 34 | elif x < 80: 35 | data_split = 'VALIDATE' 36 | else: 37 | data_split = 'TEST' 38 | return data_split 39 | 40 | 41 | def get_data_split_2019(fl_date): 42 | fl_date_str = str(fl_date) 43 | if fl_date_str > '2019': 44 | data_split = 'TEST' 45 | else: 46 | # Use farm fingerprint just like in BigQuery 47 | x = np.abs(np.uint64(farmhash.fingerprint64(fl_date_str)).astype('int64') % 100) 48 | if x < 95: 49 | data_split = 'TRAIN' 50 | else: 51 | data_split = 'VALIDATE' 52 | return data_split 53 | 54 | 55 | def to_datetime(event_time): 56 | if isinstance(event_time, str): 57 | # In BigQuery, this is a datetime.datetime. In JSON, it's a string 58 | # sometimes it has a T separating the date, sometimes it doesn't 59 | # Handle all the possibilities 60 | event_time = dt.datetime.strptime(event_time.replace('T', ' '), DATETIME_FORMAT) 61 | return event_time 62 | 63 | 64 | def approx_miles_between(lat1, lon1, lat2, lon2): 65 | # convert to radians 66 | lat1 = float(lat1) * np.pi / 180.0 67 | lat2 = float(lat2) * np.pi / 180.0 68 | lon1 = float(lon1) * np.pi / 180.0 69 | lon2 = float(lon2) * np.pi / 180.0 70 | 71 | # apply Haversine formula 72 | d_lat = lat2 - lat1 73 | d_lon = lon2 - lon1 74 | a = (pow(np.sin(d_lat / 2), 2) + 75 | pow(np.sin(d_lon / 2), 2) * 76 | np.cos(lat1) * np.cos(lat2)); 77 | c = 2 * np.arcsin(np.sqrt(a)) 78 | return float(6371 * c * 0.621371) # miles 79 | 80 | 81 | def create_features_and_label(event, for_training): 82 | try: 83 | model_input = {} 84 | 85 | if for_training: 86 | model_input.update({ 87 | 'ontime': 1.0 if float(event['ARR_DELAY'] or 0) < 15 else 0, 88 | }) 89 | 90 | # features for both training and prediction 91 | model_input.update({ 92 | # same as in ch9 93 | 'dep_delay': event['DEP_DELAY'], 94 | 'taxi_out': event['TAXI_OUT'], 95 | # distance is not in wheelsoff 96 | 'distance': approx_miles_between(event['DEP_AIRPORT_LAT'], event['DEP_AIRPORT_LON'], 97 | event['ARR_AIRPORT_LAT'], event['ARR_AIRPORT_LON']), 98 | 'origin': event['ORIGIN'], 99 | 'dest': event['DEST'], 100 | 'dep_hour': to_datetime(event['DEP_TIME']).hour, 101 | 'is_weekday': 1.0 if to_datetime(event['DEP_TIME']).isoweekday() < 6 else 0.0, 102 | 'carrier': event['UNIQUE_CARRIER'], 103 | 'dep_airport_lat': event['DEP_AIRPORT_LAT'], 104 | 'dep_airport_lon': event['DEP_AIRPORT_LON'], 105 | 'arr_airport_lat': event['ARR_AIRPORT_LAT'], 106 | 'arr_airport_lon': event['ARR_AIRPORT_LON'], 107 | # newly computed averages 108 | 'avg_dep_delay': event['AVG_DEP_DELAY'], 109 | 'avg_taxi_out': event['AVG_TAXI_OUT'], 110 | 111 | }) 112 | 113 | if for_training: 114 | model_input.update({ 115 | # training data split 116 | 'data_split': get_data_split(event['FL_DATE']) 117 | }) 118 | else: 119 | model_input.update({ 120 | # prediction output should include timestamp 121 | 'event_time': event['WHEELS_OFF'] 122 | }) 123 | 124 | yield model_input 125 | except Exception as e: 126 | # if any key is not present, don't use for training 127 | logging.warning('Ignoring {} because: {}'.format(event, e), exc_info=True) 128 | pass 129 | 130 | 131 | def compute_mean(events, col_name): 132 | values = [float(event[col_name]) for event in events if col_name in event and event[col_name]] 133 | return float(np.mean(values)) if len(values) > 0 else None 134 | 135 | 136 | def add_stats(element, window=beam.DoFn.WindowParam): 137 | # result of a group-by, so this will be called once for each airport and window 138 | # all averages here are by airport 139 | airport = element[0] 140 | events = element[1] 141 | 142 | # how late are flights leaving? 143 | avg_dep_delay = compute_mean(events, 'DEP_DELAY') 144 | avg_taxiout = compute_mean(events, 'TAXI_OUT') 145 | 146 | # remember that an event will be present for 60 minutes, but we want to emit 147 | # it only if it has just arrived (if it is within 5 minutes of the start of the window) 148 | emit_end_time = window.start + WINDOW_EVERY 149 | for event in events: 150 | event_time = to_datetime(event['WHEELS_OFF']).timestamp() 151 | if event_time < emit_end_time: 152 | event_plus_stat = event.copy() 153 | event_plus_stat['AVG_DEP_DELAY'] = avg_dep_delay 154 | event_plus_stat['AVG_TAXI_OUT'] = avg_taxiout 155 | yield event_plus_stat 156 | 157 | 158 | def assign_timestamp(event): 159 | try: 160 | event_time = to_datetime(event['WHEELS_OFF']) 161 | yield beam.window.TimestampedValue(event, event_time.timestamp()) 162 | except: 163 | pass 164 | 165 | 166 | def is_normal_operation(event): 167 | for flag in ['CANCELLED', 'DIVERTED']: 168 | if flag in event: 169 | s = str(event[flag]).lower() 170 | if s == 'true': 171 | return False; # cancelled or diverted 172 | return True # normal operation 173 | 174 | 175 | def transform_events_to_features(events, for_training=True): 176 | # events are assigned the time at which predictions will have to be made -- the wheels off time 177 | events = events | 'assign_time' >> beam.FlatMap(assign_timestamp) 178 | events = events | 'remove_cancelled' >> beam.Filter(is_normal_operation) 179 | 180 | # compute stats by airport, and add to events 181 | features = ( 182 | events 183 | | 'window' >> beam.WindowInto(beam.window.SlidingWindows(WINDOW_DURATION, WINDOW_EVERY)) 184 | | 'by_airport' >> beam.Map(lambda x: (x['ORIGIN'], x)) 185 | | 'group_by_airport' >> beam.GroupByKey() 186 | | 'events_and_stats' >> beam.FlatMap(add_stats) 187 | | 'events_to_features' >> beam.FlatMap(lambda x: create_features_and_label(x, for_training)) 188 | ) 189 | 190 | return features 191 | -------------------------------------------------------------------------------- /11_realtime/make_predictions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2021 Google Inc. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | import apache_beam as beam 18 | import logging 19 | import json 20 | import os 21 | 22 | from flightstxf import flights_transforms as ftxf 23 | 24 | 25 | CSV_HEADER = 'event_time,dep_delay,taxi_out,distance,origin,dest,dep_hour,is_weekday,carrier,dep_airport_lat,dep_airport_lon,arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out,prob_ontime' 26 | 27 | 28 | # class FlightsModelSharedInvoker(beam.DoFn): 29 | # # https://beam.apache.org/releases/pydoc/2.24.0/apache_beam.utils.shared.html 30 | # def __init__(self, shared_handle): 31 | # self._shared_handle = shared_handle 32 | # 33 | # def process(self, input_data): 34 | # def create_endpoint(): 35 | # from google.cloud import aiplatform 36 | # endpoint_name = 'flights-ch10' 37 | # endpoints = aiplatform.Endpoint.list( 38 | # filter='display_name="{}"'.format(endpoint_name), 39 | # order_by='create_time desc' 40 | # ) 41 | # if len(endpoints) == 0: 42 | # raise EnvironmentError("No endpoint named {}".format(endpoint_name)) 43 | # logging.info("Found endpoint {}".format(endpoints[0])) 44 | # return endpoints[0] 45 | # 46 | # # get already created endpoint if possible 47 | # endpoint = self._shared_handle.acquire(create_endpoint) 48 | # 49 | # # call predictions and pull out probability 50 | # logging.info("Invoking ML model on {} flights".format(len(input_data))) 51 | # predictions = endpoint.predict(input_data).predictions 52 | # for idx, input_instance in enumerate(input_data): 53 | # result = input_instance.copy() 54 | # result['prob_ontime'] = predictions[idx][0] 55 | # yield result 56 | 57 | 58 | class FlightsModelInvoker(beam.DoFn): 59 | def __init__(self): 60 | self.endpoint = None 61 | 62 | def setup(self): 63 | from google.cloud import aiplatform 64 | endpoint_name = 'flights-ch11' 65 | endpoints = aiplatform.Endpoint.list( 66 | filter='display_name="{}"'.format(endpoint_name), 67 | order_by='create_time desc' 68 | ) 69 | if len(endpoints) == 0: 70 | raise EnvironmentError("No endpoint named {}".format(endpoint_name)) 71 | logging.info("Found endpoint {}".format(endpoints[0])) 72 | self.endpoint = endpoints[0] 73 | 74 | def process(self, input_data): 75 | # call predictions and pull out probability 76 | logging.info("Invoking ML model on {} flights".format(len(input_data))) 77 | # drop inputs not needed by model 78 | features = [x.copy() for x in input_data] 79 | for f in features: 80 | f.pop('event_time') 81 | # call model 82 | predictions = self.endpoint.predict(features).predictions 83 | for idx, input_instance in enumerate(input_data): 84 | result = input_instance.copy() 85 | result['prob_ontime'] = predictions[idx][0] 86 | yield result 87 | 88 | 89 | def run(project, bucket, region, source, sink): 90 | if source == 'local': 91 | logging.info('Running locally on small extract') 92 | argv = [ 93 | '--project={0}'.format(project), 94 | '--runner=DirectRunner' 95 | ] 96 | flights_output = '/tmp/predictions' 97 | else: 98 | logging.info('Running in the cloud on full dataset input={}'.format(source)) 99 | argv = [ 100 | '--project={0}'.format(project), 101 | '--job_name=ch10predictions', 102 | '--save_main_session', 103 | '--staging_location=gs://{0}/flights/staging/'.format(bucket), 104 | '--temp_location=gs://{0}/flights/temp/'.format(bucket), 105 | '--setup_file=./setup.py', 106 | '--autoscaling_algorithm=THROUGHPUT_BASED', 107 | '--max_num_workers=8', 108 | '--region={}'.format(region), 109 | '--runner=DataflowRunner' 110 | ] 111 | if source == 'pubsub': 112 | logging.info("Turning on streaming. Cancel the pipeline from GCP console") 113 | argv += ['--streaming'] 114 | flights_output = 'gs://{}/flights/ch11/predictions'.format(bucket) 115 | 116 | with beam.Pipeline(argv=argv) as pipeline: 117 | 118 | # read the event stream 119 | if source == 'local': 120 | input_file = './simevents_sample.json' 121 | logging.info("Reading from {} ... Writing to {}".format(input_file, flights_output)) 122 | events = ( 123 | pipeline 124 | | 'read_input' >> beam.io.ReadFromText(input_file) 125 | | 'parse_input' >> beam.Map(lambda line: json.loads(line)) 126 | ) 127 | elif source == 'bigquery': 128 | input_query = ("SELECT EVENT_DATA FROM dsongcp.flights_simevents " + 129 | "WHERE EVENT_TIME BETWEEN '2015-03-01' AND '2015-03-02'") 130 | logging.info("Reading from {} ... Writing to {}".format(input_query, flights_output)) 131 | events = ( 132 | pipeline 133 | | 'read_input' >> beam.io.ReadFromBigQuery(query=input_query, use_standard_sql=True) 134 | | 'parse_input' >> beam.Map(lambda row: json.loads(row['EVENT_DATA'])) 135 | ) 136 | elif source == 'pubsub': 137 | input_topic = "projects/{}/topics/wheelsoff".format(project) 138 | logging.info("Reading from {} ... Writing to {}".format(input_topic, flights_output)) 139 | events = ( 140 | pipeline 141 | | 'read_input' >> beam.io.ReadFromPubSub(topic=input_topic, 142 | timestamp_attribute='EventTimeStamp') 143 | | 'parse_input' >> beam.Map(lambda s: json.loads(s)) 144 | ) 145 | else: 146 | logging.error("Unknown input type {}".format(source)) 147 | return 148 | 149 | # events -> features. See ./flights_transforms.py for the code shared between training & prediction 150 | features = ftxf.transform_events_to_features(events, for_training=False) 151 | 152 | # call model endpoint 153 | # shared_handle = beam.utils.shared.Shared() 154 | preds = ( 155 | features 156 | | 'into_global' >> beam.WindowInto(beam.window.GlobalWindows()) 157 | | 'batch_instances' >> beam.BatchElements(min_batch_size=1, max_batch_size=64) 158 | | 'model_predict' >> beam.ParDo(FlightsModelInvoker()) 159 | ) 160 | 161 | # write it out 162 | if sink == 'file': 163 | (preds 164 | | 'to_string' >> beam.Map(lambda f: ','.join([str(x) for x in f.values()])) 165 | | 'to_gcs' >> beam.io.textio.WriteToText(flights_output, 166 | file_name_suffix='.csv', header=CSV_HEADER, 167 | # workaround b/207384805 168 | num_shards=1) 169 | ) 170 | elif sink == 'bigquery': 171 | preds_schema = ','.join([ 172 | 'event_time:timestamp', 173 | 'prob_ontime:float', 174 | 'dep_delay:float', 175 | 'taxi_out:float', 176 | 'distance:float', 177 | 'origin:string', 178 | 'dest:string', 179 | 'dep_hour:integer', 180 | 'is_weekday:integer', 181 | 'carrier:string', 182 | 'dep_airport_lat:float,dep_airport_lon:float', 183 | 'arr_airport_lat:float,arr_airport_lon:float', 184 | 'avg_dep_delay:float', 185 | 'avg_taxi_out:float', 186 | ]) 187 | (preds 188 | | 'to_bigquery' >> beam.io.WriteToBigQuery( 189 | 'dsongcp.streaming_preds', schema=preds_schema, 190 | # write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 191 | create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, 192 | method='STREAMING_INSERTS' 193 | ) 194 | ) 195 | else: 196 | logging.error("Unknown output type {}".format(sink)) 197 | return 198 | 199 | 200 | if __name__ == '__main__': 201 | import argparse 202 | 203 | parser = argparse.ArgumentParser(description='Create training CSV file that includes time-aggregate features') 204 | parser.add_argument('-p', '--project', help='Project to be billed for Dataflow/BigQuery', required=True) 205 | parser.add_argument('-b', '--bucket', help='data will be read from written to gs://BUCKET/flights/ch11/') 206 | parser.add_argument('-r', '--region', help='Region to run Dataflow job. Choose the same region as your bucket.') 207 | parser.add_argument('-i', '--input', help='local, bigquery OR pubsub', required=True) 208 | parser.add_argument('-o', '--output', help='file, bigquery OR bigtable', default='file') 209 | 210 | logging.getLogger().setLevel(logging.INFO) 211 | args = vars(parser.parse_args()) 212 | 213 | if args['input'] != 'local': 214 | if not args['bucket'] or not args['project'] or not args['region']: 215 | print("Project, Bucket, Region are needed in order to run on the cloud on full dataset.") 216 | parser.print_help() 217 | parser.exit() 218 | 219 | run(project=args['project'], bucket=args['bucket'], region=args['region'], 220 | source=args['input'], sink=args['output']) 221 | -------------------------------------------------------------------------------- /11_realtime/setup.py: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | 18 | """Setup.py module for the workflow's worker utilities. 19 | 20 | All the workflow related code is gathered in a package that will be built as a 21 | source distribution, staged in the staging area for the workflow being run and 22 | then installed in the workers when they start running. 23 | 24 | This behavior is triggered by specifying the --setup_file command line option 25 | when running the workflow for remote execution. 26 | """ 27 | 28 | from distutils.command.build import build as _build 29 | import subprocess 30 | 31 | import setuptools 32 | 33 | 34 | # This class handles the pip install mechanism. 35 | class build(_build): # pylint: disable=invalid-name 36 | """A build command class that will be invoked during package install. 37 | 38 | The package built using the current setup.py will be staged and later 39 | installed in the worker using `pip install package'. This class will be 40 | instantiated during install for this specific scenario and will trigger 41 | running the custom commands specified. 42 | """ 43 | sub_commands = _build.sub_commands + [('CustomCommands', None)] 44 | 45 | 46 | # Some custom command to run during setup. The command is not essential for this 47 | # workflow. It is used here as an example. Each command will spawn a child 48 | # process. Typically, these commands will include steps to install non-Python 49 | # packages. For instance, to install a C++-based library libjpeg62 the following 50 | # two commands will have to be added: 51 | # 52 | # ['apt-get', 'update'], 53 | # ['apt-get', '--assume-yes', install', 'libjpeg62'], 54 | # 55 | # First, note that there is no need to use the sudo command because the setup 56 | # script runs with appropriate access. 57 | # Second, if apt-get tool is used then the first command needs to be 'apt-get 58 | # update' so the tool refreshes itself and initializes links to download 59 | # repositories. Without this initial step the other apt-get install commands 60 | # will fail with package not found errors. Note also --assume-yes option which 61 | # shortcuts the interactive confirmation. 62 | # 63 | # The output of custom commands (including failures) will be logged in the 64 | # worker-startup log. 65 | CUSTOM_COMMANDS = [ 66 | ] 67 | 68 | 69 | class CustomCommands(setuptools.Command): 70 | """A setuptools Command class able to run arbitrary commands.""" 71 | 72 | def initialize_options(self): 73 | pass 74 | 75 | def finalize_options(self): 76 | pass 77 | 78 | def RunCustomCommand(self, command_list): 79 | print ('Running command: %s' % command_list) 80 | p = subprocess.Popen( 81 | command_list, 82 | stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 83 | # Can use communicate(input='y\n'.encode()) if the command run requires 84 | # some confirmation. 85 | stdout_data, _ = p.communicate() 86 | print ('Command output: %s' % stdout_data) 87 | if p.returncode != 0: 88 | raise RuntimeError( 89 | 'Command %s failed: exit code: %s' % (command_list, p.returncode)) 90 | 91 | def run(self): 92 | for command in CUSTOM_COMMANDS: 93 | self.RunCustomCommand(command) 94 | 95 | 96 | # Configure the required packages and scripts to install. 97 | # Note that the Python Dataflow containers come with numpy already installed 98 | # so this dependency will not trigger anything to be installed unless a version 99 | # restriction is specified. 100 | REQUIRED_PACKAGES = [ 101 | 'pyfarmhash', 102 | 'google-cloud-aiplatform', 103 | 'cloudml-hypertune', 104 | 'dill==0.3.1.1' 105 | ] 106 | 107 | 108 | setuptools.setup( 109 | name='flightsdf', 110 | version='0.0.1', 111 | description='Data Science on GCP flights training and prediction pipelines', 112 | install_requires=REQUIRED_PACKAGES, 113 | packages=setuptools.find_packages(), 114 | cmdclass={ 115 | # Command class instantiated and run during pip install scenarios. 116 | 'build': build, 117 | 'CustomCommands': CustomCommands, 118 | } 119 | ) 120 | -------------------------------------------------------------------------------- /12_fulldataset/README.md: -------------------------------------------------------------------------------- 1 | # Full Dataset 2 | 3 | #### [Optional] Train on 2015-2018 and evaluate on 2019 4 | Note that this will take many hours and require significant resources. 5 | There is a reason why I have worked with only 1 year of data so far in the book. 6 | * [5 min] Erase the current contents of your bucket and BigQuery dataset: 7 | ``` 8 | gsutil -m rm -rf gs://BUCKET/* 9 | bq rm -r -f dsongcp 10 | ``` 11 | * [28h or 2 min] Create Training Dataset OR Copy it from my bucket 12 | * [28 hours] Create Training Dataset 13 | * [30 min] Ingest raw files: 14 | * cd 02_ingest 15 | * Edit the YEARS in 02_ingest/ingest.sh to process 2015 to 2019. 16 | * Run ./ingest.sh program 17 | * [2 min] Create views 18 | * cd ../03_sqlstudio 19 | * ./create_views.sh 20 | * [40 min] Do time correction 21 | * cd ../04_streaming/transform 22 | * ./stage_airports_file.sh $BUCKET 23 | * Increase number of workers in df07.py to 20 or the limit of your quota 24 | * python3 df07.py --project $PROJECT --bucket $BUCKET --region $REGION 25 | * [26 hours] Create training dataset 26 | * cd ../11_realtime 27 | * Edit flightstxf/create_traindata.py changing the line 28 | ``` 29 | 'data_split': get_data_split(event['FL_DATE']) 30 | ``` 31 | to 32 | ``` 33 | 'data_split': get_data_split_2019(event['FL_DATE']) 34 | ``` 35 | * Change the worker type to m1-ultramem-40 and disksize to 500 GB in the run() method of create_traindata.py. 36 | * Create full training dataset 37 | ``` 38 | python3 create_traindata.py --input bigquery --project $PROJECT --bucket $BUCKET --region $REGION 39 | ``` 40 | * [2 min] Copy the full training data set from my bucket: 41 | ``` 42 | gsutil cp \ 43 | gs://data-science-on-gcp/edition2/ch12_fulldataset/all-00000-of-00001.csv \ 44 | gs://$BUCKET/ch11/data/all-00000-of-00001.csv 45 | ``` 46 | 47 | * [5 hr] Train AutoML model so that we have evaluation statistics in BigQuery: 48 | ``` 49 | cd 11_realtime 50 | python3 train_on_vertexai.py --automl --project $PROJECT --bucket $BUCKET --region $REGION 51 | ``` 52 | * Open the notebook evaluation.ipynb in Vertex Workbench and run the cells. 53 | -------------------------------------------------------------------------------- /COPYRIGHT: -------------------------------------------------------------------------------- 1 | Copyright Google Inc. 2016 2 | Licensed under the Apache License, Version 2.0 (the "License"); 3 | you may not use this file except in compliance with the License. 4 | You may obtain a copy of the License at 5 | http://www.apache.org/licenses/LICENSE-2.0 6 | Unless required by applicable law or agreed to in writing, software 7 | distributed under the License is distributed on an "AS IS" BASIS, 8 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 9 | See the License for the specific language governing permissions and 10 | limitations under the License. 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # data-science-on-gcp 2 | 3 | Source code accompanying book: 4 | 5 | 6 | 7 | 10 | 15 | 18 | 19 | 20 | 23 | 28 | 31 |
8 | 9 | 11 | Data Science on the Google Cloud Platform, 2nd Edition
12 | Valliappa Lakshmanan
13 | O'Reilly, Apr 2022 14 |
16 | Branch 2nd Edition [also main] 17 |
21 | 22 | 24 | Data Science on the Google Cloud Platform
25 | Valliappa Lakshmanan
26 | O'Reilly, Jan 2017 27 |
29 | Branch edition1_tf2 (obsolete, and will not be maintained) 30 |
32 | 33 | ### Try out the code on Google Cloud Platform 34 | Open in Cloud Shell 35 | 36 | The code on Qwiklabs (see below) is **continually tested**, and this repo is kept up-to-date. 37 | 38 | If the code doesn't work for you, I recommend that you try the corresponding Qwiklab lab to see if there is some step that you missed. 39 | If you still have problems, please leave feedback in Qwiklabs, or file an issue in this repo. 40 | 41 | ### Try out the code on Qwiklabs 42 | 43 | - [Data Science on the Google Cloud Platform Quest](https://google.qwiklabs.com/quests/43) 44 | - [Data Science on Google Cloud Platform: Machine Learning Quest](https://google.qwiklabs.com/quests/50) 45 | 46 | 47 | 48 | ### Purchase book 49 | [Read on-line or download PDF of book](https://www.oreilly.com/library/view/data-science-on/9781098118945/) 50 | 51 | [Buy on Amazon.com](https://www.amazon.com/Data-Science-Google-Cloud-Platform-dp-1098118952/dp/1098118952/) 52 | 53 | ### Updates to book 54 | I updated the book in Nov 2019 with TensorFlow 2.0, Cloud Functions, and BigQuery ML. 55 | -------------------------------------------------------------------------------- /cover_edition2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/cover_edition2.jpg --------------------------------------------------------------------------------