├── .gitignore
├── 02_ingest
    ├── README.md
    ├── bqload.sh
    ├── download.sh
    ├── ingest.sh
    ├── ingest_from_crsbucket.sh
    ├── monthlyupdate
    │   ├── .dockerignore
    │   ├── .gitignore
    │   ├── 01_setup_svc_acct.sh
    │   ├── 02_deploy_cr.sh
    │   ├── 03_call_cr.sh
    │   ├── 04_next_month.sh
    │   ├── 05_setup_cron.sh
    │   ├── Dockerfile
    │   ├── ingest_flights.py
    │   ├── main.py
    │   └── requirements.txt
    ├── raw_download.sh
    └── upload.sh
├── 03_sqlstudio
    ├── README.md
    ├── contingency.sh
    ├── contingency1.sql
    ├── contingency2.sql
    ├── contingency3.sql
    ├── contingency4.sql
    ├── create_table.sql
    ├── create_views.sh
    └── create_views.sql
├── 04_streaming
    ├── .gitignore
    ├── README.md
    ├── design
    │   ├── airport_schema.json
    │   ├── mktbl.sh
    │   └── queries.txt
    ├── ingest_from_crsbucket.sh
    ├── realtime
    │   ├── avg01.py
    │   ├── avg02.py
    │   └── avg03.py
    ├── simulate
    │   ├── .gitignore
    │   ├── airports.csv.gz
    │   ├── simulate.py
    │   └── simulate_may2015.sh
    └── transform
    │   ├── airports.csv.gz
    │   ├── bqsample.sh
    │   ├── df01.py
    │   ├── df02.py
    │   ├── df03.py
    │   ├── df04.py
    │   ├── df05.py
    │   ├── df06.py
    │   ├── df07.py
    │   ├── flights_sample.json
    │   ├── install_packages.sh
    │   ├── setup.py
    │   └── stage_airports_file.sh
├── 05_bqnotebook
    ├── README.md
    ├── create_trainday.sh
    ├── exploration.ipynb
    ├── queries.txt
    └── trainday.txt
├── 06_dataproc
    ├── README.md
    ├── bayes_on_spark.py
    ├── create_cluster.sh
    ├── create_personal_cluster.sh
    ├── decrease_cluster.sh
    ├── delete_cluster.sh
    ├── increase_cluster.sh
    ├── install_on_cluster.sh
    ├── quantization.ipynb
    └── submit_serverless.sh
├── 07_sparkml
    ├── README.md
    ├── autoscale.yaml
    ├── create_large_cluster.sh
    ├── experiment.py
    ├── graphs.ipynb
    ├── logistic.py
    ├── logistic_regression.ipynb
    └── submit_spark.sh
├── 08_bqml
    ├── README.md
    ├── bqml_logistic.ipynb
    ├── bqml_nonlinear.ipynb
    ├── bqml_timetxf.ipynb
    └── bqml_timewindow.ipynb
├── 09_vertexai
    ├── .gitignore
    ├── README.md
    ├── call_predict.sh
    ├── example_input.json
    ├── flights_model.png
    └── flights_model_tf2.ipynb
├── 10_mlops
    ├── README.md
    ├── call_predict.py
    ├── explanation-metadata.json
    ├── ingest_from_crsbucket.sh
    ├── model.py
    └── train_on_vertexai.py
├── 11_realtime
    ├── .gitignore
    ├── README.md
    ├── alldata_sample.json
    ├── change_ch10_files.py
    ├── create_sample_input.sh
    ├── create_traindata.py
    ├── evaluation.ipynb
    ├── flightstxf
    │   ├── __init__.py
    │   └── flights_transforms.py
    ├── make_predictions.py
    ├── setup.py
    └── simevents_sample.json
├── 12_fulldataset
    └── README.md
├── COPYRIGHT
├── LICENSE
├── README.md
└── cover_edition2.jpg


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | 


--------------------------------------------------------------------------------
/02_ingest/README.md:
--------------------------------------------------------------------------------
 1 | # 2. Ingesting data onto the Cloud
 2 | 
 3 | ### Create a bucket
 4 | * Go to the Storage section of the GCP web console and create a new bucket
 5 | 
 6 | ### Populate your bucket with the data you will need for the book
 7 | 
 8 | * Open CloudShell and git clone this repo:
 9 |     ```
10 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
11 |     ```
12 | * Go to the 02_ingest folder of the repo
13 | * Edit ./ingest.sh to reflect the years you want to process (at minimum, you need 2015)
14 | * Execute ./ingest.sh bucketname
15 | 
16 | ### [Optional] Scheduling monthly downloads
17 | * Go to the 02_ingest/monthlyupdate folder in the repo.
18 | * Run the command `pip3 install google-cloud-storage google-cloud-bigquery`
19 | * Run the command `gcloud auth application-default login`
20 | * Try ingesting one month using the Python script: `./ingest_flights.py --debug --bucket your-bucket-name --year 2015 --month 02` 
21 | * Set up a service account called svc-monthly-ingest by running `./01_setup_svc_acct.sh`
22 | * Now, try running the ingest script as the service account:
23 |   * Visit the Service Accounts section of the GCP Console: https://console.cloud.google.com/iam-admin/serviceaccounts
24 |   * Select the newly created service account svc-monthly-ingest and click Manage Keys
25 |   * Add key (Create a new JSON key) and download it to a file named tempkey.json
26 |   * Run `gcloud auth activate-service-account --key-file tempkey.json`
27 |   * Try ingesting one month `./ingest_flights.py --bucket $BUCKET --year 2015 --month 03 --debug`
28 |   * Go back to running command as yourself using `gcloud auth login`
29 | * Deploy to Cloud Run: `./02_deploy_cr.sh`
30 | * Test that you can invoke the function using Cloud Run: `./03_call_cr.sh`
31 | * Test that the functionality to get the next month works: `./04_next_month.sh`
32 | * Set up a Cloud Scheduler job to invoke Cloud Run every month: `./05_setup_cron.sh`
33 | * Visit the GCP Console for Cloud Run and Cloud Scheduler and delete the Cloud Run instance and the scheduled task—you won’t need them any further.
34 | 


--------------------------------------------------------------------------------
/02_ingest/bqload.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 2 ]; then
 4 |     echo "Usage: ./bqload.sh  csv-bucket-name YEAR"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | YEAR=$2
10 | 
11 | SCHEMA=Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING
12 | 
13 | # create dataset if not exists
14 | PROJECT=$(gcloud config get-value project)
15 | #bq --project_id $PROJECT rm -f ${PROJECT}:dsongcp.flights_raw
16 | bq --project_id $PROJECT show dsongcp || bq mk --sync dsongcp
17 | 
18 | for MONTH in `seq -w 1 12`; do
19 | 
20 | CSVFILE=gs://${BUCKET}/flights/raw/${YEAR}${MONTH}.csv
21 | bq --project_id $PROJECT  --sync \
22 |    load --time_partitioning_field=FlightDate --time_partitioning_type=MONTH \
23 |    --source_format=CSV --ignore_unknown_values --skip_leading_rows=1 --schema=$SCHEMA \
24 |    --replace ${PROJECT}:dsongcp.flights_raw\$${YEAR}${MONTH} $CSVFILE
25 | 
26 | done
27 | 
28 | 


--------------------------------------------------------------------------------
/02_ingest/download.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Note that we have commented out the BTS website, and are instead
 4 | # using a mirror. This is because the BTS website is frequently down
 5 | SOURCE=https://storage.googleapis.com/data-science-on-gcp/edition2/raw
 6 | #SOURCE=https://transtats.bts.gov/PREZIP
 7 | 
 8 | if test "$#" -ne 2; then
 9 |    echo "Usage: ./download.sh year month"
10 |    echo "   eg: ./download.sh 2015 1"
11 |    exit
12 | fi
13 | 
14 | YEAR=$1
15 | MONTH=$2
16 | BASEURL="${SOURCE}/On_Time_Reporting_Carrier_On_Time_Performance_1987_present"
17 | echo "Downloading YEAR=$YEAR ...  MONTH=$MONTH ... from $BASEURL"
18 | 
19 | 
20 | MONTH2=$(printf "%02d" $MONTH)
21 | 
22 | TMPDIR=$(mktemp -d)
23 | 
24 | ZIPFILE=${TMPDIR}/${YEAR}_${MONTH2}.zip
25 | echo $ZIPFILE
26 | 
27 | curl -o $ZIPFILE ${BASEURL}_${YEAR}_${MONTH}.zip
28 | unzip -d $TMPDIR $ZIPFILE
29 | 
30 | mv $TMPDIR/*.csv ./${YEAR}${MONTH2}.csv
31 | rm -rf $TMPDIR
32 | 


--------------------------------------------------------------------------------
/02_ingest/ingest.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./ingest.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | export BUCKET=$1
 9 | 
10 | # get zip files from BTS, extract csv files
11 | for YEAR in `seq 2015 2015`; do
12 |    for MONTH in `seq 1 12`; do
13 |       bash download.sh $YEAR $MONTH
14 |       # upload the raw CSV files to our GCS bucket
15 |       bash upload.sh $BUCKET
16 |       rm *.csv
17 |    done
18 |    # load the CSV files into BigQuery as string columns
19 |    bash bqload.sh $BUCKET $YEAR
20 | done
21 | 
22 | 
23 | # verify that things worked
24 | bq query --nouse_legacy_sql \
25 |   'SELECT DISTINCT year, month FROM dsongcp.flights_raw ORDER BY year ASC, CAST(month AS INTEGER) ASC'
26 | 
27 | bq query --nouse_legacy_sql \
28 |   'SELECT year, month, COUNT(*) AS num_flights FROM dsongcp.flights_raw GROUP BY year, month ORDER BY year ASC, CAST(month AS INTEGER) ASC'
29 | 


--------------------------------------------------------------------------------
/02_ingest/ingest_from_crsbucket.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./ingest_from_crsbucket.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | FROM=gs://data-science-on-gcp/edition2/flights/raw
10 | TO=gs://$BUCKET/flights/raw
11 | 
12 | CMD="gsutil -m cp "
13 | for MONTH in `seq -w 1 12`; do
14 |   CMD="$CMD ${FROM}/2015${MONTH}.csv"
15 | done
16 | CMD="$CMD ${FROM}/201601.csv $TO"
17 | 
18 | echo $CMD
19 | $CMD
20 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/.dockerignore:
--------------------------------------------------------------------------------
 1 | Dockerfile
 2 | README.md
 3 | *.pyc
 4 | *.pyo
 5 | *.pyd
 6 | __pycache__
 7 | .pytest_cache
 8 | .git
 9 | .gitignore
10 | tempkey.json
11 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/.gitignore:
--------------------------------------------------------------------------------
1 | tempkey.json
2 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/01_setup_svc_acct.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | SVC_ACCT=svc-monthly-ingest
 4 | PROJECT_ID=$(gcloud config get-value project)
 5 | BUCKET=${PROJECT_ID}-cf-staging
 6 | REGION=us-central1
 7 | SVC_PRINCIPAL=serviceAccount:${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com
 8 | 
 9 | gsutil ls gs://$BUCKET || gsutil mb -l $REGION gs://$BUCKET
10 | gsutil uniformbucketlevelaccess set on gs://$BUCKET
11 | 
12 | gcloud iam service-accounts create $SVC_ACCT --display-name "flights monthly ingest"
13 | 
14 | # make the service account the admin of the bucket
15 | # it can read/write/list/delete etc. on only this bucket
16 | gsutil iam ch ${SVC_PRINCIPAL}:roles/storage.admin gs://$BUCKET
17 | 
18 | # ability to create/delete partitions etc in BigQuery table
19 | bq --project_id=${PROJECT_ID} query --nouse_legacy_sql \
20 |   "GRANT \`roles/bigquery.dataOwner\` ON SCHEMA dsongcp TO '$SVC_PRINCIPAL' "
21 | 
22 | gcloud projects add-iam-policy-binding ${PROJECT_ID} \
23 |   --member ${SVC_PRINCIPAL} \
24 |   --role roles/bigquery.jobUser
25 | 
26 | # At this point, test running as service account
27 | # download a json key from the console (temporarily)
28 | # either add this to .gcloudignore and .gitignore or put it in a different directory!
29 | # gcloud auth activate-service-account --key-file tempkey.json
30 | # ./ingest_flights.py --bucket $BUCKET --year 2015 --month 03 --debug
31 | # after this, go back to being yourself with gcloud auth login
32 | 
33 | # Make sure the sevice account can invoke cloud functions
34 | gcloud projects add-iam-policy-binding ${PROJECT_ID} \
35 |   --member ${SVC_PRINCIPAL} \
36 |   --role roles/run.invoker
37 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/02_deploy_cr.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # same as in setup_svc_acct
 4 | NAME=ingest-flights-monthly
 5 | SVC_ACCT=svc-monthly-ingest
 6 | PROJECT_ID=$(gcloud config get-value project)
 7 | REGION=us-central1
 8 | SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com
 9 | 
10 | #gcloud functions deploy $URL \
11 | #    --entry-point ingest_flights --runtime python37 --trigger-http \
12 | #    --timeout 540s --service-account ${SVC_EMAIL} --no-allow-unauthenticated
13 | 
14 | gcloud run deploy $NAME --region $REGION --source=$(pwd) \
15 |     --platform=managed --service-account ${SVC_EMAIL} --no-allow-unauthenticated \
16 |     --timeout 12m \
17 | 
18 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/03_call_cr.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # same as deploy_cr.sh
 4 | NAME=ingest-flights-monthly
 5 | 
 6 | PROJECT_ID=$(gcloud config get-value project)
 7 | BUCKET=${PROJECT_ID}-cf-staging
 8 | 
 9 | URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)')
10 | echo $URL
11 | 
12 | # Feb 2015
13 | echo {\"year\":\"2015\"\,\"month\":\"02\"\,\"bucket\":\"${BUCKET}\"\} > /tmp/message
14 | 
15 | curl -k -X POST $URL \
16 |    -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
17 |    -H "Content-Type:application/json" --data-binary @/tmp/message
18 | 
19 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/04_next_month.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # same as deploy_cr.sh
 4 | NAME=ingest-flights-monthly
 5 | 
 6 | PROJECT_ID=$(gcloud config get-value project)
 7 | BUCKET=${PROJECT_ID}-cf-staging
 8 | 
 9 | URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)')
10 | echo $URL
11 | 
12 | # next month
13 | echo "Getting month that follows ... (removing 12 if needed, so there is something to get) "
14 | gsutil rm -rf gs://$BUCKET/flights/raw/201512.csv.gz
15 | gsutil ls gs://$BUCKET/flights/raw
16 | echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message
17 | cat /tmp/message
18 | 
19 | curl -k -X POST $URL \
20 |    -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
21 |    -H "Content-Type:application/json" --data-binary @/tmp/message
22 | 
23 | echo "Done"
24 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/05_setup_cron.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # same as in setup_svc_acct.sh and call_cr.sh
 4 | NAME=ingest-flights-monthly
 5 | PROJECT_ID=$(gcloud config get-value project)
 6 | BUCKET=${PROJECT_ID}-cf-staging
 7 | SVC_ACCT=svc-monthly-ingest
 8 | SVC_EMAIL=${SVC_ACCT}@${PROJECT_ID}.iam.gserviceaccount.com
 9 | 
10 | SVC_URL=$(gcloud run services describe ingest-flights-monthly --format 'value(status.url)')
11 | echo $SVC_URL
12 | echo $SVC_EMAIL
13 | 
14 | # note that there is no year or month. The service looks for next month in that case.
15 | echo {\"bucket\":\"${BUCKET}\"\} > /tmp/message
16 | cat /tmp/message
17 | 
18 | gcloud scheduler jobs create http monthlyupdate \
19 |        --description "Ingest flights using Cloud Run" \
20 |        --schedule="8 of month 10:00" --time-zone "America/New_York" \
21 |        --uri=$SVC_URL --http-method POST \
22 |        --oidc-service-account-email $SVC_EMAIL --oidc-token-audience=$SVC_URL \
23 |        --max-backoff=7d \
24 |        --max-retry-attempts=5 \
25 |        --max-retry-duration=2d \
26 |        --min-backoff=12h \
27 |        --headers="Content-Type=application/json" \
28 |        --message-body-from-file=/tmp/message
29 | 
30 | 
31 | # To try this out, go to Console and do two things:
32 | #    in Service Accounts, give yourself the ability to impersonate this service account (ServiceAccountUser)
33 | #    in Cloud Scheduler, click "Run Now"
34 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use the official lightweight Python image.
 2 | # https://hub.docker.com/_/python
 3 | FROM python:3.6-slim
 4 | 
 5 | # Allow statements and log messages to immediately appear in the Knative logs
 6 | ENV PYTHONUNBUFFERED True
 7 | 
 8 | # Copy local code to the container image.
 9 | ENV APP_HOME /app
10 | WORKDIR $APP_HOME
11 | COPY . ./
12 | 
13 | # Install production dependencies.
14 | RUN pip install --no-cache-dir -r requirements.txt
15 | 
16 | # Run the web service on container startup. Here we use the gunicorn
17 | # webserver, with one worker process and 8 threads.
18 | # For environments with multiple CPU cores, increase the number of workers
19 | # to be equal to the cores available.
20 | # Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
21 | CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
22 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/ingest_flights.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016-2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import os
 18 | import gzip
 19 | import shutil
 20 | import logging
 21 | import os.path
 22 | import zipfile
 23 | import datetime
 24 | import tempfile
 25 | from google.cloud import storage
 26 | from google.cloud.storage import Blob
 27 | from google.cloud import bigquery
 28 | 
 29 | SOURCE = "https://storage.googleapis.com/data-science-on-gcp/edition2/raw"
 30 | #SOURCE = "https://transtats.bts.gov/PREZIP"
 31 | 
 32 | 
 33 | def urlopen(url):
 34 |     from urllib.request import urlopen as impl
 35 |     import ssl
 36 | 
 37 |     ctx_no_secure = ssl.create_default_context()
 38 |     ctx_no_secure.set_ciphers('HIGH:!DH:!aNULL')
 39 |     ctx_no_secure.check_hostname = False
 40 |     ctx_no_secure.verify_mode = ssl.CERT_NONE
 41 |     return impl(url, context=ctx_no_secure)
 42 | 
 43 | 
 44 | def download(year: str, month: str, destdir: str):
 45 |     """
 46 |     Downloads on-time performance data and returns local filename
 47 |     year e.g.'2015'
 48 |     month e.g. '01 for January
 49 |     """
 50 |     logging.info('Requesting data for {}-{}-*'.format(year, month))
 51 | 
 52 |     url = os.path.join(SOURCE,
 53 |                        "On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{}_{}.zip".format(year, int(month)))
 54 |     logging.debug("Trying to download {}".format(url))
 55 | 
 56 |     filename = os.path.join(destdir, "{}{}.zip".format(year, month))
 57 |     with open(filename, "wb") as fp:
 58 |         response = urlopen(url)
 59 |         fp.write(response.read())
 60 |     logging.debug("{} saved".format(filename))
 61 |     return filename
 62 | 
 63 | 
 64 | def zip_to_csv(filename, destdir):
 65 |     """
 66 |     Extracts the CSV file from the zip file into the destdir
 67 |     """
 68 |     zip_ref = zipfile.ZipFile(filename, 'r')
 69 |     cwd = os.getcwd()
 70 |     os.chdir(destdir)
 71 |     zip_ref.extractall()
 72 |     os.chdir(cwd)
 73 |     csvfile = os.path.join(destdir, zip_ref.namelist()[0])
 74 |     zip_ref.close()
 75 |     logging.info("Extracted {}".format(csvfile))
 76 | 
 77 |     # now gzip for faster upload to bucket
 78 |     gzipped = csvfile + ".gz"
 79 |     with open(csvfile, 'rb') as ifp:
 80 |         with gzip.open(gzipped, 'wb') as ofp:
 81 |             shutil.copyfileobj(ifp, ofp)
 82 |     logging.info("Compressed into {}".format(gzipped))
 83 | 
 84 |     return gzipped
 85 | 
 86 | 
 87 | def upload(csvfile, bucketname, blobname):
 88 |     """
 89 |     Uploads the CSV file into the bucket with the given blobname
 90 |     """
 91 |     client = storage.Client()
 92 |     bucket = client.get_bucket(bucketname)
 93 |     logging.info(bucket)
 94 |     blob = Blob(blobname, bucket)
 95 |     logging.debug('Uploading {} ...'.format(csvfile))
 96 |     blob.upload_from_filename(csvfile)
 97 |     gcslocation = 'gs://{}/{}'.format(bucketname, blobname)
 98 |     logging.info('Uploaded {} ...'.format(gcslocation))
 99 |     return gcslocation
100 | 
101 | 
102 | def bqload(gcsfile, year, month):
103 |     """
104 |     Loads the CSV file in GCS into BigQuery, replacing the existing data in that partition
105 |     """
106 |     client = bigquery.Client()
107 |     # truncate existing partition ...
108 |     table_ref = client.dataset('dsongcp').table('flights_raw${}{}'.format(year, month))
109 |     job_config = bigquery.LoadJobConfig()
110 |     job_config.source_format = 'CSV'
111 |     job_config.write_disposition = 'WRITE_TRUNCATE'
112 |     job_config.ignore_unknown_values = True
113 |     job_config.time_partitioning = bigquery.table.TimePartitioning('MONTH', 'FlightDate')
114 |     job_config.skip_leading_rows = 1
115 |     job_config.schema = [
116 |         bigquery.SchemaField(col_and_type.split(':')[0], col_and_type.split(':')[1])  #, mode='required')
117 |         for col_and_type in
118 |         "Year:STRING,Quarter:STRING,Month:STRING,DayofMonth:STRING,DayOfWeek:STRING,FlightDate:DATE,Reporting_Airline:STRING,DOT_ID_Reporting_Airline:STRING,IATA_CODE_Reporting_Airline:STRING,Tail_Number:STRING,Flight_Number_Reporting_Airline:STRING,OriginAirportID:STRING,OriginAirportSeqID:STRING,OriginCityMarketID:STRING,Origin:STRING,OriginCityName:STRING,OriginState:STRING,OriginStateFips:STRING,OriginStateName:STRING,OriginWac:STRING,DestAirportID:STRING,DestAirportSeqID:STRING,DestCityMarketID:STRING,Dest:STRING,DestCityName:STRING,DestState:STRING,DestStateFips:STRING,DestStateName:STRING,DestWac:STRING,CRSDepTime:STRING,DepTime:STRING,DepDelay:STRING,DepDelayMinutes:STRING,DepDel15:STRING,DepartureDelayGroups:STRING,DepTimeBlk:STRING,TaxiOut:STRING,WheelsOff:STRING,WheelsOn:STRING,TaxiIn:STRING,CRSArrTime:STRING,ArrTime:STRING,ArrDelay:STRING,ArrDelayMinutes:STRING,ArrDel15:STRING,ArrivalDelayGroups:STRING,ArrTimeBlk:STRING,Cancelled:STRING,CancellationCode:STRING,Diverted:STRING,CRSElapsedTime:STRING,ActualElapsedTime:STRING,AirTime:STRING,Flights:STRING,Distance:STRING,DistanceGroup:STRING,CarrierDelay:STRING,WeatherDelay:STRING,NASDelay:STRING,SecurityDelay:STRING,LateAircraftDelay:STRING,FirstDepTime:STRING,TotalAddGTime:STRING,LongestAddGTime:STRING,DivAirportLandings:STRING,DivReachedDest:STRING,DivActualElapsedTime:STRING,DivArrDelay:STRING,DivDistance:STRING,Div1Airport:STRING,Div1AirportID:STRING,Div1AirportSeqID:STRING,Div1WheelsOn:STRING,Div1TotalGTime:STRING,Div1LongestGTime:STRING,Div1WheelsOff:STRING,Div1TailNum:STRING,Div2Airport:STRING,Div2AirportID:STRING,Div2AirportSeqID:STRING,Div2WheelsOn:STRING,Div2TotalGTime:STRING,Div2LongestGTime:STRING,Div2WheelsOff:STRING,Div2TailNum:STRING,Div3Airport:STRING,Div3AirportID:STRING,Div3AirportSeqID:STRING,Div3WheelsOn:STRING,Div3TotalGTime:STRING,Div3LongestGTime:STRING,Div3WheelsOff:STRING,Div3TailNum:STRING,Div4Airport:STRING,Div4AirportID:STRING,Div4AirportSeqID:STRING,Div4WheelsOn:STRING,Div4TotalGTime:STRING,Div4LongestGTime:STRING,Div4WheelsOff:STRING,Div4TailNum:STRING,Div5Airport:STRING,Div5AirportID:STRING,Div5AirportSeqID:STRING,Div5WheelsOn:STRING,Div5TotalGTime:STRING,Div5LongestGTime:STRING,Div5WheelsOff:STRING,Div5TailNum:STRING".split(',')
119 |     ]
120 |     load_job = client.load_table_from_uri(gcsfile, table_ref, job_config=job_config)
121 |     load_job.result()  # waits for table load to complete
122 | 
123 |     if load_job.state != 'DONE':
124 |         raise load_job.exception()
125 | 
126 |     return table_ref, load_job.output_rows
127 | 
128 | 
129 | def ingest(year, month, bucket):
130 |     '''
131 |    ingest flights data from BTS website to Google Cloud Storage
132 |    return table, numrows on success.
133 |    raises exception if this data is not on BTS website
134 |    '''
135 |     tempdir = tempfile.mkdtemp(prefix='ingest_flights')
136 |     try:
137 |         zipfile = download(year, month, tempdir)
138 |         bts_csv = zip_to_csv(zipfile, tempdir)
139 |         gcsloc = 'flights/raw/{}{}.csv.gz'.format(year, month)
140 |         gcsloc = upload(bts_csv, bucket, gcsloc)
141 |         return bqload(gcsloc, year, month)
142 |     finally:
143 |         logging.debug('Cleaning up by removing {}'.format(tempdir))
144 |         shutil.rmtree(tempdir)
145 | 
146 | 
147 | def next_month(bucketname):
148 |     '''
149 |      Finds which months are on GCS, and returns next year,month to download
150 |    '''
151 |     client = storage.Client()
152 |     bucket = client.get_bucket(bucketname)
153 |     blobs = list(bucket.list_blobs(prefix='flights/raw/'))
154 |     files = [blob.name for blob in blobs if 'csv' in blob.name]  # csv files only
155 |     lastfile = os.path.basename(files[-1])
156 |     logging.debug('The latest file on GCS is {}'.format(lastfile))
157 |     year = lastfile[:4]
158 |     month = lastfile[4:6]
159 |     return compute_next_month(year, month)
160 | 
161 | 
162 | def compute_next_month(year, month):
163 |     dt = datetime.datetime(int(year), int(month), 15)  # 15th of month
164 |     dt = dt + datetime.timedelta(30)  # will always go to next month
165 |     logging.debug('The next month is {}'.format(dt))
166 |     return '{}'.format(dt.year), '{:02d}'.format(dt.month)
167 | 
168 | 
169 | if __name__ == '__main__':
170 |     import argparse
171 | 
172 |     parser = argparse.ArgumentParser(description='ingest flights data from BTS website to Google Cloud Storage')
173 |     parser.add_argument('--bucket', help='GCS bucket to upload data to', required=True)
174 |     parser.add_argument('--year', help='Example: 2015.  If not provided, defaults to getting next month')
175 |     parser.add_argument('--month', help='Specify 01 for January. If not provided, defaults to getting next month')
176 |     parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages')
177 | 
178 |     try:
179 |         args = parser.parse_args()
180 |         if args.debug:
181 |             logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
182 |         else:
183 |             logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
184 | 
185 |         if args.year is None or args.month is None:
186 |             year_, month_ = next_month(args.bucket)
187 |         else:
188 |             year_ = args.year
189 |             month_ = args.month
190 |         logging.debug('Ingesting year={} month={}'.format(year_, month_))
191 |         tableref, numrows = ingest(year_, month_, args.bucket)
192 |         logging.info('Success ... ingested {} rows to {}'.format(numrows, tableref))
193 |     except Exception as e:
194 |         logging.exception("Try again later?")
195 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/main.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Copyright 2016-2021 Google Inc.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | import os
18 | import logging
19 | from flask import Flask
20 | from flask import request, escape
21 | from ingest_flights import ingest, next_month
22 | 
23 | app = Flask(__name__)
24 | 
25 | 
26 | @app.route("/", methods=['POST'])
27 | def ingest_flights():
28 |     # noinspection PyBroadException
29 |     try:
30 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
31 |         json = request.get_json(force=True) # https://stackoverflow.com/questions/53216177/http-triggering-cloud-function-with-cloud-scheduler/60615210#60615210
32 | 
33 |         year = escape(json['year']) if 'year' in json else None
34 |         month = escape(json['month']) if 'month' in json else None
35 |         bucket = escape(json['bucket'])  # required
36 | 
37 |         if year is None or month is None or len(year) == 0 or len(month) == 0:
38 |             year, month = next_month(bucket)
39 |         logging.debug('Ingesting year={} month={}'.format(year, month))
40 |         tableref, numrows = ingest(year, month, bucket)
41 |         ok = 'Success ... ingested {} rows to {}'.format(numrows, tableref)
42 |         logging.info(ok)
43 |         return ok
44 |     except Exception as e:
45 |         logging.exception("Failed to ingest ... try again later?")
46 | 
47 | 
48 | if __name__ == "__main__":
49 |     app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
50 | 


--------------------------------------------------------------------------------
/02_ingest/monthlyupdate/requirements.txt:
--------------------------------------------------------------------------------
1 | Flask==2.0.1
2 | google-cloud-storage==1.42.0
3 | google-cloud-bigquery==2.25.1
4 | gunicorn==20.1.0
5 | 
6 | 
7 | 


--------------------------------------------------------------------------------
/02_ingest/raw_download.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #export YEAR=${YEAR:=2015}
 4 | SOURCE=https://transtats.bts.gov/PREZIP
 5 | 
 6 | OUTDIR=raw
 7 | mkdir -p $OUTDIR
 8 | 
 9 | for YEAR in `seq 2019 2019`; do
10 | for MONTH in `seq 1 12`; do
11 | 
12 |   FILE=On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${YEAR}_${MONTH}.zip
13 |   curl -k -o ${OUTDIR}/${FILE}  ${SOURCE}/${FILE}
14 | 
15 | done
16 | done
17 | 


--------------------------------------------------------------------------------
/02_ingest/upload.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./upload.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | 
10 | echo "Uploading to bucket $BUCKET..."
11 | gsutil -m cp *.csv gs://$BUCKET/flights/raw/
12 | #gsutil -m acl ch -R -g allUsers:R gs://$BUCKET/flights/raw
13 | #gsutil -m acl ch -R -g google.com:R gs://$BUCKET/flights/raw
14 | 


--------------------------------------------------------------------------------
/03_sqlstudio/README.md:
--------------------------------------------------------------------------------
 1 | # 3. Creating compelling dashboards
 2 | 
 3 | ### Catch up to Chapter 2
 4 | If you have not already done so, load the raw data into a BigQuery dataset:
 5 | * Go to the Storage section of the GCP web console and create a new bucket
 6 | * Open CloudShell and git clone this repo:
 7 |     ```
 8 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
 9 |     ```
10 | * Then, run:
11 |     ```
12 |     cd data-science-on-gcp/02_ingest
13 |     ./ingest.sh bucketname
14 |     ```
15 | 
16 | 
17 | ### Optional: Load the data into PostgreSQL
18 | * Navigate to https://console.cloud.google.com/sql
19 | * Select Create Instance
20 | * Choose PostgreSQL and then fill out the form as follows:
21 |   * Call the instance flights
22 |   * Generate a strong password by clicking GENERATE
23 |   * Choose the default PostgreSQL version
24 |   * Choose the region where your bucket of CSV data exists
25 |   * Choose a single zone instance
26 |   * Choose a standard machine type with 2 vCPU
27 |   * Click Create Instance
28 | *  Type (change bucket as necessary):
29 |   ```
30 |    gsutil cp create_table.sql \
31 |     gs://cloud-training-demos-ml/flights/ch3/create_table.sql
32 |   ```
33 | * Create empty table using web console:
34 |   * navigate to databases section of Cloud SQL and create a new database called bts
35 |   * navigate to flights instance and select IMPORT
36 |   * Specify location of create_table.sql in your bucket
37 |   * Specify that you want to create a table in the database bts
38 | * Load the CSV files into this table:
39 |   * Browse to 201501.csv in your bucket
40 |   * Specify CSV as the format
41 |   * bts as the database
42 |   * flights as the table
43 | * In Cloud Shell, connect to database and run queries
44 |   * Connect to the database using one of these two commands (the first if you don't need a SQL proxy, the second if you do -- you'll typically need a SQL proxy if your organization has set up a security rule to allow access only to authorized networks):
45 |     * ```gcloud sql connect flights --user=postgres```
46 |     * OR ```gcloud beta sql connect flights --user=postgres```
47 |   * In the prompt, type ```\c bts;```
48 |   * Type in the following query:
49 |   ``` 
50 |   SELECT "Origin", COUNT(*) AS num_flights 
51 |   FROM flights GROUP BY "Origin" 
52 |   ORDER BY num_flights DESC 
53 |   LIMIT 5;
54 |   ```
55 | * Add more months of CSV data and notice that the performance degrades.
56 | Once you are done, delete the Cloud SQL instance since you will not need it for the rest of the book.
57 | 
58 | ### Creating view in BigQuery
59 | * Run the script 
60 |   ```./create_views.sh```
61 | * Compute the contingency table for various thresholds by running the script 
62 |   ```
63 |   ./contingency.sh
64 |   ```
65 | 
66 | ### Building a dashboard
67 | Follow the steps in the main text of the chapter to set up a Data Studio dashboard and create charts.
68 | 
69 | 


--------------------------------------------------------------------------------
/03_sqlstudio/contingency.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | PROJECT=$(gcloud config get-value project)
4 | cat contingency4.sql \
5 |    | bq --project_id $PROJECT query --nouse_legacy_sql
6 | 


--------------------------------------------------------------------------------
/03_sqlstudio/contingency1.sql:
--------------------------------------------------------------------------------
1 | SELECT 
2 |     COUNT(*) AS true_positives
3 | FROM dsongcp.flights
4 | WHERE dep_delay < 15 AND arr_delay < 15
5 | 


--------------------------------------------------------------------------------
/03_sqlstudio/contingency2.sql:
--------------------------------------------------------------------------------
 1 | DECLARE THRESH INT64;
 2 | SET THRESH = 15;
 3 |  
 4 | SELECT 
 5 |     COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives,
 6 |     COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives,
 7 |     COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives,
 8 |     COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives,
 9 |     COUNT(*) AS total
10 | FROM dsongcp.flights
11 | WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL
12 | 


--------------------------------------------------------------------------------
/03_sqlstudio/contingency3.sql:
--------------------------------------------------------------------------------
 1 | SELECT 
 2 |     THRESH,
 3 |     COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives,
 4 |     COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives,
 5 |     COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives,
 6 |     COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives,
 7 |     COUNT(*) AS total
 8 | FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH
 9 | WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL
10 | GROUP BY THRESH
11 | 


--------------------------------------------------------------------------------
/03_sqlstudio/contingency4.sql:
--------------------------------------------------------------------------------
 1 | WITH contingency_table AS (
 2 |     SELECT 
 3 |         THRESH,
 4 |         COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_positives,
 5 |         COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_positives,
 6 |         COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_negatives,
 7 |         COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_negatives,
 8 |         COUNT(*) AS total
 9 |     FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH
10 |     WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL
11 |     GROUP BY THRESH
12 | )
13 | 
14 | SELECT
15 |    ROUND((true_positives + true_negatives)/total, 2) AS accuracy,
16 |    ROUND(false_positives/(true_positives+false_positives), 2) AS fpr,
17 |    ROUND(false_negatives/(false_negatives+true_negatives), 2) AS fnr,
18 |    *
19 | FROM contingency_table
20 | 


--------------------------------------------------------------------------------
/03_sqlstudio/create_table.sql:
--------------------------------------------------------------------------------
  1 | drop table if exists flights;
  2 | 
  3 | CREATE TABLE flights (
  4 |   "Year" TEXT,
  5 |   "Quarter" TEXT,
  6 |   "Month" TEXT,
  7 |   "DayofMonth" TEXT,
  8 |   "DayOfWeek" TEXT,
  9 |   "FlightDate" TEXT,
 10 |   "Reporting_Airline" TEXT,
 11 |   "DOT_ID_Reporting_Airline" TEXT,
 12 |   "IATA_CODE_Reporting_Airline" TEXT,
 13 |   "Tail_Number" TEXT,
 14 |   "Flight_Number_Reporting_Airline" TEXT,
 15 |   "OriginAirportID" TEXT,
 16 |   "OriginAirportSeqID" TEXT,
 17 |   "OriginCityMarketID" TEXT,
 18 |   "Origin" TEXT,
 19 |   "OriginCityName" TEXT,
 20 |   "OriginState" TEXT,
 21 |   "OriginStateFips" TEXT,
 22 |   "OriginStateName" TEXT,
 23 |   "OriginWac" TEXT,
 24 |   "DestAirportID" TEXT,
 25 |   "DestAirportSeqID" TEXT,
 26 |   "DestCityMarketID" TEXT,
 27 |   "Dest" TEXT,
 28 |   "DestCityName" TEXT,
 29 |   "DestState" TEXT,
 30 |   "DestStateFips" TEXT,
 31 |   "DestStateName" TEXT,
 32 |   "DestWac" TEXT,
 33 |   "CRSDepTime" TEXT,
 34 |   "DepTime" TEXT,
 35 |   "DepDelay" TEXT,
 36 |   "DepDelayMinutes" TEXT,
 37 |   "DepDel15" TEXT,
 38 |   "DepartureDelayGroups" TEXT,
 39 |   "DepTimeBlk" TEXT,
 40 |   "TaxiOut" TEXT,
 41 |   "WheelsOff" TEXT,
 42 |   "WheelsOn" TEXT,
 43 |   "TaxiIn" TEXT,
 44 |   "CRSArrTime" TEXT,
 45 |   "ArrTime" TEXT,
 46 |   "ArrDelay" TEXT,
 47 |   "ArrDelayMinutes" TEXT,
 48 |   "ArrDel15" TEXT,
 49 |   "ArrivalDelayGroups" TEXT,
 50 |   "ArrTimeBlk" TEXT,
 51 |   "Cancelled" TEXT,
 52 |   "CancellationCode" TEXT,
 53 |   "Diverted" TEXT,
 54 |   "CRSElapsedTime" TEXT,
 55 |   "ActualElapsedTime" TEXT,
 56 |   "AirTime" TEXT,
 57 |   "Flights" TEXT,
 58 |   "Distance" TEXT,
 59 |   "DistanceGroup" TEXT,
 60 |   "CarrierDelay" TEXT,
 61 |   "WeatherDelay" TEXT,
 62 |   "NASDelay" TEXT,
 63 |   "SecurityDelay" TEXT,
 64 |   "LateAircraftDelay" TEXT,
 65 |   "FirstDepTime" TEXT,
 66 |   "TotalAddGTime" TEXT,
 67 |   "LongestAddGTime" TEXT,
 68 |   "DivAirportLandings" TEXT,
 69 |   "DivReachedDest" TEXT,
 70 |   "DivActualElapsedTime" TEXT,
 71 |   "DivArrDelay" TEXT,
 72 |   "DivDistance" TEXT,
 73 |   "Div1Airport" TEXT,
 74 |   "Div1AirportID" TEXT,
 75 |   "Div1AirportSeqID" TEXT,
 76 |   "Div1WheelsOn" TEXT,
 77 |   "Div1TotalGTime" TEXT,
 78 |   "Div1LongestGTime" TEXT,
 79 |   "Div1WheelsOff" TEXT,
 80 |   "Div1TailNum" TEXT,
 81 |   "Div2Airport" TEXT,
 82 |   "Div2AirportID" TEXT,
 83 |   "Div2AirportSeqID" TEXT,
 84 |   "Div2WheelsOn" TEXT,
 85 |   "Div2TotalGTime" TEXT,
 86 |   "Div2LongestGTime" TEXT,
 87 |   "Div2WheelsOff" TEXT,
 88 |   "Div2TailNum" TEXT,
 89 |   "Div3Airport" TEXT,
 90 |   "Div3AirportID" TEXT,
 91 |   "Div3AirportSeqID" TEXT,
 92 |   "Div3WheelsOn" TEXT,
 93 |   "Div3TotalGTime" TEXT,
 94 |   "Div3LongestGTime" TEXT,
 95 |   "Div3WheelsOff" TEXT,
 96 |   "Div3TailNum" TEXT,
 97 |   "Div4Airport" TEXT,
 98 |   "Div4AirportID" TEXT,
 99 |   "Div4AirportSeqID" TEXT,
100 |   "Div4WheelsOn" TEXT,
101 |   "Div4TotalGTime" TEXT,
102 |   "Div4LongestGTime" TEXT,
103 |   "Div4WheelsOff" TEXT,
104 |   "Div4TailNum" TEXT,
105 |   "Div5Airport" TEXT,
106 |   "Div5AirportID" TEXT,
107 |   "Div5AirportSeqID" TEXT,
108 |   "Div5WheelsOn" TEXT,
109 |   "Div5TotalGTime" TEXT,
110 |   "Div5LongestGTime" TEXT,
111 |   "Div5WheelsOff" TEXT,
112 |   "Div5TailNum" TEXT,
113 |   "junk" TEXT
114 | );
115 | 


--------------------------------------------------------------------------------
/03_sqlstudio/create_views.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | PROJECT=$(gcloud config get-value project)
4 | cat create_views.sql | bq --project_id $PROJECT query --nouse_legacy_sql
5 | 


--------------------------------------------------------------------------------
/03_sqlstudio/create_views.sql:
--------------------------------------------------------------------------------
 1 | CREATE OR REPLACE VIEW dsongcp.flights
 2 | -- CREATE MATERIALIZED VIEW dsongcp.flights
 3 | -- PARTITION BY DATE_TRUNC(FL_DATE, MONTH)
 4 | AS
 5 | SELECT
 6 |   FlightDate AS FL_DATE,
 7 |   Reporting_Airline AS UNIQUE_CARRIER,
 8 |   OriginAirportSeqID AS ORIGIN_AIRPORT_SEQ_ID,
 9 |   Origin AS ORIGIN,
10 |   DestAirportSeqID AS DEST_AIRPORT_SEQ_ID,
11 |   Dest AS DEST,
12 |   CRSDepTime AS CRS_DEP_TIME,
13 |   DepTime AS DEP_TIME,
14 |   CAST(DepDelay AS FLOAT64) AS DEP_DELAY,
15 |   CAST(TaxiOut AS FLOAT64) AS TAXI_OUT,
16 |   WheelsOff AS WHEELS_OFF,
17 |   WheelsOn AS WHEELS_ON,
18 |   CAST(TaxiIn AS FLOAT64) AS TAXI_IN,
19 |   CRSArrTime AS CRS_ARR_TIME,
20 |   ArrTime AS ARR_TIME,
21 |   CAST(ArrDelay AS FLOAT64) AS ARR_DELAY,
22 |   IF(Cancelled = '1.00', True, False) AS CANCELLED,
23 |   IF(Diverted = '1.00', True, False) AS DIVERTED,
24 |   DISTANCE
25 | FROM dsongcp.flights_raw;
26 | 
27 | CREATE OR REPLACE VIEW dsongcp.delayed_10 AS
28 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 10;
29 | 
30 | CREATE OR REPLACE VIEW dsongcp.delayed_15 AS
31 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 15;
32 | 
33 | CREATE OR REPLACE VIEW dsongcp.delayed_20 AS
34 | SELECT * FROM dsongcp.flights WHERE dep_delay >= 20;
35 | 
36 | 
37 | 


--------------------------------------------------------------------------------
/04_streaming/.gitignore:
--------------------------------------------------------------------------------
1 | .*
2 | 


--------------------------------------------------------------------------------
/04_streaming/README.md:
--------------------------------------------------------------------------------
  1 | # 4. Streaming data: publication and ingest
  2 | 
  3 | ### Catch up until Chapter 3 if necessary
  4 | * Go to the Storage section of the GCP web console and create a new bucket
  5 | * Open CloudShell and git clone this repo:
  6 |     ```
  7 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
  8 |     ```
  9 | * Then, run:
 10 | ```
 11 | cd data-science-on-gcp/02_ingest
 12 | ./ingest_from_crsbucket bucketname
 13 | ```
 14 | * Run:
 15 | ```
 16 | cd ../03_sqlstudio
 17 | ./create_views.sh
 18 | ```
 19 | 
 20 | ### Batch processing transformation in DataFlow
 21 | * Setup:
 22 |     ```
 23 | 	cd transform; ./install_packages.sh
 24 |     ```
 25 | * Parsing airports data:
 26 | 	```
 27 | 	./df01.py
 28 | 	head extracted_airports-00000*
 29 | 	rm extracted_airports-*
 30 | 	```
 31 | * Adding timezone information:
 32 | 	```
 33 | 	./df02.py
 34 | 	head airports_with_tz-00000*
 35 | 	rm airports_with_tz-*
 36 | 	```
 37 | * Converting times to UTC:
 38 | 	```
 39 | 	./df03.py
 40 | 	head -3 all_flights-00000*
 41 | 	```
 42 | * Correcting dates:
 43 | 	```
 44 | 	./df04.py
 45 | 	head -3 all_flights-00000*
 46 | 	rm all_flights-*
 47 | 	```
 48 | * Create events:
 49 | 	```
 50 | 	./df05.py
 51 | 	head -3 all_events-00000*
 52 | 	rm all_events-*
 53 | 	```  
 54 | * Read/write to Cloud:
 55 | 	```
 56 |     ./stage_airports_file.sh BUCKETNAME
 57 | 	./df06.py --project PROJECT --bucket BUCKETNAME
 58 | 	``` 
 59 |     Look for new tables in BigQuery (flights_simevents)
 60 | * Run on Cloud:
 61 | 	```
 62 | 	./df07.py --project PROJECT --bucket BUCKETNAME --region us-central1
 63 | 	``` 
 64 | * Go to the GCP web console and wait for the Dataflow ch04timecorr job to finish. It might take between 30 minutes and 2+ hours depending on the quota associated with your project (you can change the quota by going to https://console.cloud.google.com/iam-admin/quotas).
 65 | * Then, navigate to the BigQuery console and type in:
 66 | 	```
 67 |         SELECT
 68 |           ORIGIN,
 69 |           DEP_TIME,
 70 |           DEST,
 71 |           ARR_TIME,
 72 |           ARR_DELAY,
 73 |           EVENT_TIME,
 74 |           EVENT_TYPE
 75 |         FROM
 76 |           dsongcp.flights_simevents
 77 |         WHERE
 78 |           (DEP_DELAY > 15 and ORIGIN = 'SEA') or
 79 |           (ARR_DELAY > 15 and DEST = 'SEA')
 80 |         ORDER BY EVENT_TIME ASC
 81 |         LIMIT
 82 |           5
 83 | 
 84 | 	```
 85 | ### Simulate event stream
 86 | * In CloudShell, run
 87 | 	```
 88 |     cd simulate
 89 | 	python3 ./simulate.py --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-05-04 00:00:00 UTC' --speedFactor=30 --project $DEVSHELL_PROJECT_ID
 90 |     ```
 91 |  
 92 | ### Real-time Stream Processing
 93 | * In another CloudShell tab, run avg01.py:
 94 | 	```
 95 | 	cd realtime
 96 | 	./avg01.py --project PROJECT --bucket BUCKETNAME --region us-central1
 97 | 	```
 98 | * In about a minute, you can query events from the BigQuery console:
 99 | 	```
100 | 	SELECT * FROM dsongcp.streaming_events
101 | 	ORDER BY EVENT_TIME DESC
102 |     LIMIT 5
103 | 	```
104 | * Stop avg01.py by hitting Ctrl+C
105 | * Run avg02.py:
106 | 	```
107 | 	./avg02.py --project PROJECT --bucket BUCKETNAME --region us-central1
108 | 	```
109 | * In about 5 min, you can query from the BigQuery console:
110 | 	```
111 | 	SELECT * FROM dsongcp.streaming_delays
112 | 	ORDER BY END_TIME DESC
113 |     LIMIT 5
114 | 	``` 
115 | * Look at how often the data is coming in:
116 | 	```
117 |     SELECT END_TIME, num_flights
118 |     FROM dsongcp.streaming_delays
119 |     ORDER BY END_TIME DESC
120 |     LIMIT 5
121 | 	``` 
122 | * It's likely that the pipeline will be stuck. You need to run this on Dataflow.
123 | * Stop avg02.py by hitting Ctrl+C
124 | * In BigQuery, truncate the table:
125 | 	```
126 | 	TRUNCATE TABLE dsongcp.streaming_delays
127 | 	``` 
128 | * Run avg03.py:
129 | 	```
130 | 	./avg03.py --project PROJECT --bucket BUCKETNAME --region us-central1
131 | 	```
132 | * Go to the GCP web console in the Dataflow section and monitor the job.
133 | * Once the job starts writing to BigQuery, run this query and save this as a view:
134 | 	```
135 | 	SELECT * FROM dsongcp.streaming_delays
136 |     WHERE AIRPORT = 'ATL'
137 |     ORDER BY END_TIME DESC
138 | 	```
139 | * Create a view of the latest arrival delay by airport:
140 | 	```
141 |     CREATE OR REPLACE VIEW dsongcp.airport_delays AS
142 |     WITH delays AS (
143 |         SELECT d.*, a.LATITUDE, a.LONGITUDE
144 |         FROM dsongcp.streaming_delays d
145 |         JOIN dsongcp.airports a USING(AIRPORT) 
146 |         WHERE a.AIRPORT_IS_LATEST = 1
147 |     )
148 |      
149 |     SELECT 
150 |         AIRPORT,
151 |         CONCAT(LATITUDE, ',', LONGITUDE) AS LOCATION,
152 |         ARRAY_AGG(
153 |             STRUCT(AVG_ARR_DELAY, AVG_DEP_DELAY, NUM_FLIGHTS, END_TIME)
154 |             ORDER BY END_TIME DESC LIMIT 1) AS a
155 |     FROM delays
156 |     GROUP BY AIRPORT, LONGITUDE, LATITUDE
157 | 
158 | 	```   
159 | * Follow the steps in the chapter to connect to Data Studio and create a GeoMap.
160 | * Stop the simulation program in CloudShell.
161 | * From the GCP web console, stop the Dataflow streaming pipeline.
162 | 
163 | 


--------------------------------------------------------------------------------
/04_streaming/design/airport_schema.json:
--------------------------------------------------------------------------------
  1 | [
  2 |       {
  3 |         "mode": "NULLABLE",
  4 |         "name": "AIRPORT_SEQ_ID",
  5 |         "type": "INTEGER"
  6 |       },
  7 |       {
  8 |         "mode": "NULLABLE",
  9 |         "name": "AIRPORT_ID",
 10 |         "type": "INTEGER"
 11 |       },
 12 |       {
 13 |         "mode": "NULLABLE",
 14 |         "name": "AIRPORT",
 15 |         "type": "STRING"
 16 |       },
 17 |       {
 18 |         "mode": "NULLABLE",
 19 |         "name": "DISPLAY_AIRPORT_NAME",
 20 |         "type": "STRING"
 21 |       },
 22 |       {
 23 |         "mode": "NULLABLE",
 24 |         "name": "DISPLAY_AIRPORT_CITY_NAME_FULL",
 25 |         "type": "STRING"
 26 |       },
 27 |       {
 28 |         "mode": "NULLABLE",
 29 |         "name": "AIRPORT_WAC_SEQ_ID2",
 30 |         "type": "INTEGER"
 31 |       },
 32 |       {
 33 |         "mode": "NULLABLE",
 34 |         "name": "AIRPORT_WAC",
 35 |         "type": "INTEGER"
 36 |       },
 37 |       {
 38 |         "mode": "NULLABLE",
 39 |         "name": "AIRPORT_COUNTRY_NAME",
 40 |         "type": "STRING"
 41 |       },
 42 |       {
 43 |         "mode": "NULLABLE",
 44 |         "name": "AIRPORT_COUNTRY_CODE_ISO",
 45 |         "type": "STRING"
 46 |       },
 47 |       {
 48 |         "mode": "NULLABLE",
 49 |         "name": "AIRPORT_STATE_NAME",
 50 |         "type": "STRING"
 51 |       },
 52 |       {
 53 |         "mode": "NULLABLE",
 54 |         "name": "AIRPORT_STATE_CODE",
 55 |         "type": "STRING"
 56 |       },
 57 |       {
 58 |         "mode": "NULLABLE",
 59 |         "name": "AIRPORT_STATE_FIPS",
 60 |         "type": "INTEGER"
 61 |       },
 62 |       {
 63 |         "mode": "NULLABLE",
 64 |         "name": "CITY_MARKET_SEQ_ID",
 65 |         "type": "INTEGER"
 66 |       },
 67 |       {
 68 |         "mode": "NULLABLE",
 69 |         "name": "CITY_MARKET_ID",
 70 |         "type": "INTEGER"
 71 |       },
 72 |       {
 73 |         "mode": "NULLABLE",
 74 |         "name": "DISPLAY_CITY_MARKET_NAME_FULL",
 75 |         "type": "STRING"
 76 |       },
 77 |       {
 78 |         "mode": "NULLABLE",
 79 |         "name": "CITY_MARKET_WAC_SEQ_ID2",
 80 |         "type": "INTEGER"
 81 |       },
 82 |       {
 83 |         "mode": "NULLABLE",
 84 |         "name": "CITY_MARKET_WAC",
 85 |         "type": "INTEGER"
 86 |       },
 87 |       {
 88 |         "mode": "NULLABLE",
 89 |         "name": "LAT_DEGREES",
 90 |         "type": "INTEGER"
 91 |       },
 92 |       {
 93 |         "mode": "NULLABLE",
 94 |         "name": "LAT_HEMISPHERE",
 95 |         "type": "STRING"
 96 |       },
 97 |       {
 98 |         "mode": "NULLABLE",
 99 |         "name": "LAT_MINUTES",
100 |         "type": "INTEGER"
101 |       },
102 |       {
103 |         "mode": "NULLABLE",
104 |         "name": "LAT_SECONDS",
105 |         "type": "INTEGER"
106 |       },
107 |       {
108 |         "mode": "NULLABLE",
109 |         "name": "LATITUDE",
110 |         "type": "FLOAT"
111 |       },
112 |       {
113 |         "mode": "NULLABLE",
114 |         "name": "LON_DEGREES",
115 |         "type": "INTEGER"
116 |       },
117 |       {
118 |         "mode": "NULLABLE",
119 |         "name": "LON_HEMISPHERE",
120 |         "type": "STRING"
121 |       },
122 |       {
123 |         "mode": "NULLABLE",
124 |         "name": "LON_MINUTES",
125 |         "type": "INTEGER"
126 |       },
127 |       {
128 |         "mode": "NULLABLE",
129 |         "name": "LON_SECONDS",
130 |         "type": "INTEGER"
131 |       },
132 |       {
133 |         "mode": "NULLABLE",
134 |         "name": "LONGITUDE",
135 |         "type": "FLOAT"
136 |       },
137 |       {
138 |         "mode": "NULLABLE",
139 |         "name": "UTC_LOCAL_TIME_VARIATION",
140 |         "type": "INTEGER"
141 |       },
142 |       {
143 |         "mode": "NULLABLE",
144 |         "name": "AIRPORT_START_DATE",
145 |         "type": "DATE"
146 |       },
147 |       {
148 |         "mode": "NULLABLE",
149 |         "name": "AIRPORT_THRU_DATE",
150 |         "type": "DATE"
151 |       },
152 |       {
153 |         "mode": "NULLABLE",
154 |         "name": "AIRPORT_IS_CLOSED",
155 |         "type": "INTEGER"
156 |       },
157 |       {
158 |         "mode": "NULLABLE",
159 |         "name": "AIRPORT_IS_LATEST",
160 |         "type": "INTEGER"
161 |       },
162 |       {
163 |         "mode": "NULLABLE",
164 |         "name": "string_field_32",
165 |         "type": "STRING"
166 |       }
167 | ]
168 | 


--------------------------------------------------------------------------------
/04_streaming/design/mktbl.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | bq mk --external_table_definition=./airport_schema.json@CSV=gs://data-science-on-gcp/edition2/raw/airports.csv dsongcp.airports_gcs
3 | 


--------------------------------------------------------------------------------
/04_streaming/design/queries.txt:
--------------------------------------------------------------------------------
 1 | SELECT 
 2 | AIRPORT_SEQ_ID, AIRPORT_ID, AIRPORT, DISPLAY_AIRPORT_NAME,
 3 | LAT_DEGREES, LAT_HEMISPHERE, LAT_MINUTES, LAT_SECONDS, LATITUDE
 4 | FROM dsongcp.airports_gcs
 5 | WHERE DISPLAY_AIRPORT_NAME LIKE '%Seattle%'
 6 | 
 7 | 
 8 | SELECT 
 9 |   AIRPORT, LATITUDE, LONGITUDE
10 | FROM dsongcp.airports_gcs
11 | WHERE AIRPORT_IS_LATEST = 1 AND AIRPORT = 'DFW'
12 | 
13 | 


--------------------------------------------------------------------------------
/04_streaming/ingest_from_crsbucket.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./ingest_from_crsbucket.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | FROM=gs://data-science-on-gcp/edition2/flights/tzcorr
10 | TO=gs://$BUCKET/flights/tzcorr
11 | 
12 | #sharded files
13 | CMD="gsutil -m cp "
14 | for SHARD in `seq -w 0 26`; do
15 |   CMD="$CMD ${FROM}/all_flights-000${SHARD}-of-00026"
16 | done
17 | CMD="$CMD $TO"
18 | echo $CMD
19 | $CMD
20 | 
21 | # load tzcorr into BigQuery
22 | PROJECT=$(gcloud config get-value project)
23 | bq --project_id $PROJECT \
24 |   load --source_format=NEWLINE_DELIMITED_JSON --autodetect ${PROJECT}:dsongcp.flights_tzcorr \
25 |   ${TO}/all_flights-*
26 | 
27 | cd transform
28 | 
29 | # airports.csv
30 | ./stage_airports_file.sh ${BUCKET}
31 | 


--------------------------------------------------------------------------------
/04_streaming/realtime/avg01.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2021 Google Inc.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | import apache_beam as beam
18 | import json
19 | 
20 | 
21 | def run(project, bucket, region):
22 |     argv = [
23 |         '--project={0}'.format(project),
24 |         '--job_name=ch04avgdelay',
25 |         '--streaming',
26 |         '--save_main_session',
27 |         '--staging_location=gs://{0}/flights/staging/'.format(bucket),
28 |         '--temp_location=gs://{0}/flights/temp/'.format(bucket),
29 |         '--setup_file=./setup.py',
30 |         '--autoscaling_algorithm=THROUGHPUT_BASED',
31 |         '--max_num_workers=8',
32 |         '--region={}'.format(region),
33 |         '--runner=DirectRunner'
34 |     ]
35 | 
36 |     with beam.Pipeline(argv=argv) as pipeline:
37 |         events = {}
38 | 
39 |         for event_name in ['arrived', 'departed']:
40 |             topic_name = "projects/{}/topics/{}".format(project, event_name)
41 | 
42 |             events[event_name] = (pipeline
43 |                                   | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub(topic=topic_name)
44 |                                   | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s))
45 |                                   )
46 | 
47 |         all_events = (events['arrived'], events['departed']) | beam.Flatten()
48 | 
49 |         flights_schema = ','.join([
50 |             'FL_DATE:date,UNIQUE_CARRIER:string,ORIGIN_AIRPORT_SEQ_ID:string,ORIGIN:string',
51 |             'DEST_AIRPORT_SEQ_ID:string,DEST:string,CRS_DEP_TIME:timestamp,DEP_TIME:timestamp',
52 |             'DEP_DELAY:float,TAXI_OUT:float,WHEELS_OFF:timestamp,WHEELS_ON:timestamp,TAXI_IN:float',
53 |             'CRS_ARR_TIME:timestamp,ARR_TIME:timestamp,ARR_DELAY:float,CANCELLED:boolean',
54 |             'DIVERTED:boolean,DISTANCE:float',
55 |             'DEP_AIRPORT_LAT:float,DEP_AIRPORT_LON:float,DEP_AIRPORT_TZOFFSET:float',
56 |             'ARR_AIRPORT_LAT:float,ARR_AIRPORT_LON:float,ARR_AIRPORT_TZOFFSET:float'])
57 |         events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp'])
58 | 
59 |         schema = events_schema
60 | 
61 |         (all_events
62 |          | 'bqout' >> beam.io.WriteToBigQuery(
63 |                     'dsongcp.streaming_events', schema=schema,
64 |                     create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
65 |                 )
66 |          )
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     import argparse
71 | 
72 |     parser = argparse.ArgumentParser(description='Run pipeline on the cloud')
73 |     parser.add_argument('-p', '--project', help='Unique project ID', required=True)
74 |     parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists',
75 |                         required=True)
76 |     parser.add_argument('-r', '--region',
77 |                         help='Region in which to run the Dataflow job. Choose the same region as your bucket.',
78 |                         required=True)
79 | 
80 |     args = vars(parser.parse_args())
81 | 
82 |     run(project=args['project'], bucket=args['bucket'], region=args['region'])
83 | 


--------------------------------------------------------------------------------
/04_streaming/realtime/avg02.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import json
 20 | import numpy as np
 21 | 
 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S'
 23 | 
 24 | 
 25 | def compute_stats(airport, events):
 26 |     arrived = [event['ARR_DELAY'] for event in events if event['EVENT_TYPE'] == 'arrived']
 27 |     avg_arr_delay = float(np.mean(arrived)) if len(arrived) > 0 else None
 28 | 
 29 |     departed = [event['DEP_DELAY'] for event in events if event['EVENT_TYPE'] == 'departed']
 30 |     avg_dep_delay = float(np.mean(departed)) if len(departed) > 0 else None
 31 | 
 32 |     num_flights = len(events)
 33 |     start_time = min([event['EVENT_TIME'] for event in events])
 34 |     latest_time = max([event['EVENT_TIME'] for event in events])
 35 | 
 36 |     return {
 37 |         'AIRPORT': airport,
 38 |         'AVG_ARR_DELAY': avg_arr_delay,
 39 |         'AVG_DEP_DELAY': avg_dep_delay,
 40 |         'NUM_FLIGHTS': num_flights,
 41 |         'START_TIME': start_time,
 42 |         'END_TIME': latest_time
 43 |     }
 44 | 
 45 | 
 46 | def by_airport(event):
 47 |     if event['EVENT_TYPE'] == 'departed':
 48 |         return event['ORIGIN'], event
 49 |     else:
 50 |         return event['DEST'], event
 51 | 
 52 | 
 53 | def run(project, bucket, region):
 54 |     argv = [
 55 |         '--project={0}'.format(project),
 56 |         '--job_name=ch04avgdelay',
 57 |         '--streaming',
 58 |         '--save_main_session',
 59 |         '--staging_location=gs://{0}/flights/staging/'.format(bucket),
 60 |         '--temp_location=gs://{0}/flights/temp/'.format(bucket),
 61 |         '--autoscaling_algorithm=THROUGHPUT_BASED',
 62 |         '--max_num_workers=8',
 63 |         '--region={}'.format(region),
 64 |         '--runner=DirectRunner'
 65 |     ]
 66 | 
 67 |     with beam.Pipeline(argv=argv) as pipeline:
 68 |         events = {}
 69 | 
 70 |         for event_name in ['arrived', 'departed']:
 71 |             topic_name = "projects/{}/topics/{}".format(project, event_name)
 72 | 
 73 |             events[event_name] = (pipeline
 74 |                                   | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub(
 75 |                                                 topic=topic_name, timestamp_attribute='EventTimeStamp')
 76 |                                   | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s))
 77 |                                   )
 78 | 
 79 |         all_events = (events['arrived'], events['departed']) | beam.Flatten()
 80 | 
 81 |         stats = (all_events
 82 |                  | 'byairport' >> beam.Map(by_airport)
 83 |                  | 'window' >> beam.WindowInto(beam.window.SlidingWindows(60 * 60, 5 * 60))
 84 |                  | 'group' >> beam.GroupByKey()
 85 |                  | 'stats' >> beam.Map(lambda x: compute_stats(x[0], x[1]))
 86 |         )
 87 | 
 88 |         stats_schema = ','.join(['AIRPORT:string,AVG_ARR_DELAY:float,AVG_DEP_DELAY:float',
 89 |                                  'NUM_FLIGHTS:int64,START_TIME:timestamp,END_TIME:timestamp'])
 90 |         (stats
 91 |          | 'bqout' >> beam.io.WriteToBigQuery(
 92 |                     'dsongcp.streaming_delays', schema=stats_schema,
 93 |                     create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
 94 |                 )
 95 |          )
 96 | 
 97 | 
 98 | if __name__ == '__main__':
 99 |     import argparse
100 | 
101 |     parser = argparse.ArgumentParser(description='Run pipeline on the cloud')
102 |     parser.add_argument('-p', '--project', help='Unique project ID', required=True)
103 |     parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists',
104 |                         required=True)
105 |     parser.add_argument('-r', '--region',
106 |                         help='Region in which to run the Dataflow job. Choose the same region as your bucket.',
107 |                         required=True)
108 | 
109 |     args = vars(parser.parse_args())
110 | 
111 |     run(project=args['project'], bucket=args['bucket'], region=args['region'])
112 | 


--------------------------------------------------------------------------------
/04_streaming/realtime/avg03.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import json
 20 | import numpy as np
 21 | 
 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S'
 23 | 
 24 | 
 25 | def compute_stats(airport, events):
 26 |     arrived = [event['ARR_DELAY'] for event in events if event['EVENT_TYPE'] == 'arrived']
 27 |     avg_arr_delay = float(np.mean(arrived)) if len(arrived) > 0 else None
 28 | 
 29 |     departed = [event['DEP_DELAY'] for event in events if event['EVENT_TYPE'] == 'departed']
 30 |     avg_dep_delay = float(np.mean(departed)) if len(departed) > 0 else None
 31 | 
 32 |     num_flights = len(events)
 33 |     start_time = min([event['EVENT_TIME'] for event in events])
 34 |     latest_time = max([event['EVENT_TIME'] for event in events])
 35 | 
 36 |     return {
 37 |         'AIRPORT': airport,
 38 |         'AVG_ARR_DELAY': avg_arr_delay,
 39 |         'AVG_DEP_DELAY': avg_dep_delay,
 40 |         'NUM_FLIGHTS': num_flights,
 41 |         'START_TIME': start_time,
 42 |         'END_TIME': latest_time
 43 |     }
 44 | 
 45 | 
 46 | def by_airport(event):
 47 |     if event['EVENT_TYPE'] == 'departed':
 48 |         return event['ORIGIN'], event
 49 |     else:
 50 |         return event['DEST'], event
 51 | 
 52 | 
 53 | def run(project, bucket, region):
 54 |     argv = [
 55 |         '--project={0}'.format(project),
 56 |         '--job_name=ch04avgdelay',
 57 |         '--streaming',
 58 |         '--save_main_session',
 59 |         '--staging_location=gs://{0}/flights/staging/'.format(bucket),
 60 |         '--temp_location=gs://{0}/flights/temp/'.format(bucket),
 61 |         '--autoscaling_algorithm=THROUGHPUT_BASED',
 62 |         '--max_num_workers=8',
 63 |         '--region={}'.format(region),
 64 |         '--runner=DataflowRunner'
 65 |     ]
 66 | 
 67 |     with beam.Pipeline(argv=argv) as pipeline:
 68 |         events = {}
 69 | 
 70 |         for event_name in ['arrived', 'departed']:
 71 |             topic_name = "projects/{}/topics/{}".format(project, event_name)
 72 | 
 73 |             events[event_name] = (pipeline
 74 |                                   | 'read:{}'.format(event_name) >> beam.io.ReadFromPubSub(
 75 |                                                 topic=topic_name, timestamp_attribute='EventTimeStamp')
 76 |                                   | 'parse:{}'.format(event_name) >> beam.Map(lambda s: json.loads(s))
 77 |                                   )
 78 | 
 79 |         all_events = (events['arrived'], events['departed']) | beam.Flatten()
 80 | 
 81 |         stats = (all_events
 82 |                  | 'byairport' >> beam.Map(by_airport)
 83 |                  | 'window' >> beam.WindowInto(beam.window.SlidingWindows(60 * 60, 5 * 60))
 84 |                  | 'group' >> beam.GroupByKey()
 85 |                  | 'stats' >> beam.Map(lambda x: compute_stats(x[0], x[1]))
 86 |         )
 87 | 
 88 |         stats_schema = ','.join(['AIRPORT:string,AVG_ARR_DELAY:float,AVG_DEP_DELAY:float',
 89 |                                  'NUM_FLIGHTS:int64,START_TIME:timestamp,END_TIME:timestamp'])
 90 |         (stats
 91 |          | 'bqout' >> beam.io.WriteToBigQuery(
 92 |                     'dsongcp.streaming_delays', schema=stats_schema,
 93 |                     create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
 94 |                 )
 95 |          )
 96 | 
 97 | 
 98 | if __name__ == '__main__':
 99 |     import argparse
100 | 
101 |     parser = argparse.ArgumentParser(description='Run pipeline on the cloud')
102 |     parser.add_argument('-p', '--project', help='Unique project ID', required=True)
103 |     parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists',
104 |                         required=True)
105 |     parser.add_argument('-r', '--region',
106 |                         help='Region in which to run the Dataflow job. Choose the same region as your bucket.',
107 |                         required=True)
108 | 
109 |     args = vars(parser.parse_args())
110 | 
111 |     run(project=args['project'], bucket=args['bucket'], region=args['region'])
112 | 


--------------------------------------------------------------------------------
/04_streaming/simulate/.gitignore:
--------------------------------------------------------------------------------
1 | *-?????-of-?????
2 | *.egg-info
3 | .*
4 | 


--------------------------------------------------------------------------------
/04_streaming/simulate/airports.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/04_streaming/simulate/airports.csv.gz


--------------------------------------------------------------------------------
/04_streaming/simulate/simulate.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import time
 18 | import pytz
 19 | import logging
 20 | import argparse
 21 | import datetime
 22 | import google.cloud.pubsub_v1 as pubsub # Use v1 of the API
 23 | import google.cloud.bigquery as bq
 24 | 
 25 | TIME_FORMAT = '%Y-%m-%d %H:%M:%S %Z'
 26 | RFC3339_TIME_FORMAT = '%Y-%m-%dT%H:%M:%S-00:00'
 27 | 
 28 | def publish(publisher, topics, allevents, notify_time):
 29 |    timestamp = notify_time.strftime(RFC3339_TIME_FORMAT)
 30 |    for key in topics:  # 'departed', 'arrived', etc.
 31 |       topic = topics[key]
 32 |       events = allevents[key]
 33 |       # the client automatically batches
 34 |       logging.info('Publishing {} {} till {}'.format(len(events), key, timestamp))
 35 |       for event_data in events:
 36 |           publisher.publish(topic, event_data.encode(), EventTimeStamp=timestamp)
 37 | 
 38 | def notify(publisher, topics, rows, simStartTime, programStart, speedFactor):
 39 |    # sleep computation
 40 |    def compute_sleep_secs(notify_time):
 41 |         time_elapsed = (datetime.datetime.utcnow() - programStart).total_seconds()
 42 |         sim_time_elapsed = (notify_time - simStartTime).total_seconds() / speedFactor
 43 |         to_sleep_secs = sim_time_elapsed - time_elapsed
 44 |         return to_sleep_secs
 45 | 
 46 |    tonotify = {}
 47 |    for key in topics:
 48 |      tonotify[key] = list()
 49 | 
 50 |    for row in rows:
 51 |        event_type, notify_time, event_data = row
 52 | 
 53 |        # how much time should we sleep?
 54 |        if compute_sleep_secs(notify_time) > 1:
 55 |           # notify the accumulated tonotify
 56 |           publish(publisher, topics, tonotify, notify_time)
 57 |           for key in topics:
 58 |              tonotify[key] = list()
 59 | 
 60 |           # recompute sleep, since notification takes a while
 61 |           to_sleep_secs = compute_sleep_secs(notify_time)
 62 |           if to_sleep_secs > 0:
 63 |              logging.info('Sleeping {} seconds'.format(to_sleep_secs))
 64 |              time.sleep(to_sleep_secs)
 65 |        tonotify[event_type].append(event_data)
 66 | 
 67 |    # left-over records; notify again
 68 |    publish(publisher, topics, tonotify, notify_time)
 69 | 
 70 | 
 71 | if __name__ == '__main__':
 72 |    parser = argparse.ArgumentParser(description='Send simulated flight events to Cloud Pub/Sub')
 73 |    parser.add_argument('--startTime', help='Example: 2015-05-01 00:00:00 UTC', required=True)
 74 |    parser.add_argument('--endTime', help='Example: 2015-05-03 00:00:00 UTC', required=True)
 75 |    parser.add_argument('--project', help='your project id, to create pubsub topic', required=True)
 76 |    parser.add_argument('--speedFactor', help='Example: 60 implies 1 hour of data sent to Cloud Pub/Sub in 1 minute', required=True, type=float)
 77 |    parser.add_argument('--jitter', help='type of jitter to add: None, uniform, exp  are the three options', default='None')
 78 | 
 79 |    # set up BigQuery bqclient
 80 |    logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
 81 |    args = parser.parse_args()
 82 |    bqclient = bq.Client(args.project)
 83 |    bqclient.get_table('dsongcp.flights_simevents')  # throws exception on failure
 84 | 
 85 |    # jitter?
 86 |    if args.jitter == 'exp':
 87 |       jitter = 'CAST (-LN(RAND()*0.99 + 0.01)*30 + 90.5 AS INT64)'
 88 |    elif args.jitter == 'uniform':
 89 |       jitter = 'CAST(90.5 + RAND()*30 AS INT64)'
 90 |    else:
 91 |       jitter = '0'
 92 | 
 93 | 
 94 |    # run the query to pull simulated events
 95 |    querystr = """
 96 | SELECT
 97 |   EVENT_TYPE,
 98 |   TIMESTAMP_ADD(EVENT_TIME, INTERVAL @jitter SECOND) AS NOTIFY_TIME,
 99 |   EVENT_DATA
100 | FROM
101 |   dsongcp.flights_simevents
102 | WHERE
103 |   EVENT_TIME >= @startTime
104 |   AND EVENT_TIME < @endTime
105 | ORDER BY
106 |   EVENT_TIME ASC
107 | """
108 |    job_config = bq.QueryJobConfig(
109 |        query_parameters=[
110 |            bq.ScalarQueryParameter("jitter", "INT64", jitter),
111 |            bq.ScalarQueryParameter("startTime", "TIMESTAMP", args.startTime),
112 |            bq.ScalarQueryParameter("endTime", "TIMESTAMP", args.endTime),
113 |        ]
114 |    )
115 |    rows = bqclient.query(querystr, job_config=job_config)
116 | 
117 |    # create one Pub/Sub notification topic for each type of event
118 |    publisher = pubsub.PublisherClient()
119 |    topics = {}
120 |    for event_type in ['wheelsoff', 'arrived', 'departed']:
121 |        topics[event_type] = publisher.topic_path(args.project, event_type)
122 |        try:
123 |            publisher.get_topic(topic=topics[event_type])
124 |            logging.info("Already exists: {}".format(topics[event_type]))
125 |        except:
126 |            logging.info("Creating {}".format(topics[event_type]))
127 |            publisher.create_topic(name=topics[event_type])
128 | 
129 | 
130 |    # notify about each row in the dataset
131 |    programStartTime = datetime.datetime.utcnow()
132 |    simStartTime = datetime.datetime.strptime(args.startTime, TIME_FORMAT).replace(tzinfo=pytz.UTC)
133 |    logging.info('Simulation start time is {}'.format(simStartTime))
134 |    notify(publisher, topics, rows, simStartTime, programStartTime, args.speedFactor)
135 | 


--------------------------------------------------------------------------------
/04_streaming/simulate/simulate_may2015.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | python3 simulate.py --project $(gcloud config get-value project) --startTime '2015-05-01 00:00:00 UTC' --endTime '2015-06-01 00:00:00 UTC' --speedFactor 30
3 | 


--------------------------------------------------------------------------------
/04_streaming/transform/airports.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/04_streaming/transform/airports.csv.gz


--------------------------------------------------------------------------------
/04_streaming/transform/bqsample.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if test "$#" -ne 1; then
 4 |    echo "Usage: ./bqsample.sh bucket-name"
 5 |    echo "   eg: ./bqsample.sh cloud-training-demos-ml"
 6 |    exit
 7 | fi
 8 | 
 9 | BUCKET=$1
10 | PROJECT=$(gcloud config get-value project)
11 | 
12 | bq --project_id=$PROJECT query --destination_table dsongcp.flights_sample --replace --nouse_legacy_sql \
13 |    'SELECT * FROM dsongcp.flights WHERE RAND() < 0.001'
14 | 
15 | bq --project_id=$PROJECT extract --destination_format=NEWLINE_DELIMITED_JSON \
16 |    dsongcp.flights_sample  gs://${BUCKET}/flights/ch4/flights_sample.json
17 | 
18 | gsutil cp gs://${BUCKET}/flights/ch4/flights_sample.json .
19 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df01.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2019 Google Inc.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | import apache_beam as beam
18 | import csv
19 | 
20 | if __name__ == '__main__':
21 |     with beam.Pipeline('DirectRunner') as pipeline:
22 |         airports = (pipeline
23 |                     | beam.io.ReadFromText('airports.csv.gz')
24 |                     | beam.Map(lambda line: next(csv.reader([line])))
25 |                     | beam.Map(lambda fields: (fields[0], (fields[21], fields[26])))
26 |                     )
27 | 
28 |         (airports
29 |          | beam.Map(lambda airport_data: '{},{}'.format(airport_data[0], ','.join(airport_data[1])))
30 |          | beam.io.WriteToText('extracted_airports')
31 |          )
32 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df02.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2016 Google Inc.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | import apache_beam as beam
18 | import csv
19 | 
20 | 
21 | def addtimezone(lat, lon):
22 |     try:
23 |         import timezonefinder
24 |         tf = timezonefinder.TimezoneFinder()
25 |         tz = tf.timezone_at(lng=float(lon), lat=float(lat))
26 |         if tz is None:
27 |             tz = 'UTC'
28 |         return lat, lon, tz
29 |     except ValueError:
30 |         return lat, lon, 'TIMEZONE'  # header
31 | 
32 | 
33 | if __name__ == '__main__':
34 |     with beam.Pipeline('DirectRunner') as pipeline:
35 |         airports = (pipeline
36 |                     | beam.io.ReadFromText('airports.csv.gz')
37 |                     | beam.Filter(lambda line: "United States" in line)
38 |                     | beam.Map(lambda line: next(csv.reader([line])))
39 |                     | beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
40 |                     )
41 | 
42 |         airports | beam.Map(lambda f: '{},{}'.format(f[0], ','.join(f[1]))) | beam.io.textio.WriteToText(
43 |             'airports_with_tz')
44 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df03.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2016 Google Inc.
 4 | #
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | import apache_beam as beam
18 | import logging
19 | import csv
20 | import json
21 | 
22 | 
23 | def addtimezone(lat, lon):
24 |     try:
25 |         import timezonefinder
26 |         tf = timezonefinder.TimezoneFinder()
27 |         return lat, lon, tf.timezone_at(lng=float(lon), lat=float(lat))
28 |         # return (lat, lon, 'America/Los_Angeles') # FIXME
29 |     except ValueError:
30 |         return lat, lon, 'TIMEZONE'  # header
31 | 
32 | 
33 | def as_utc(date, hhmm, tzone):
34 |     try:
35 |         if len(hhmm) > 0 and tzone is not None:
36 |             import datetime, pytz
37 |             loc_tz = pytz.timezone(tzone)
38 |             loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False)
39 |             # can't just parse hhmm because the data contains 2400 and the like ...
40 |             loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:]))
41 |             utc_dt = loc_dt.astimezone(pytz.utc)
42 |             return utc_dt.strftime('%Y-%m-%d %H:%M:%S')
43 |         else:
44 |             return ''  # empty string corresponds to canceled flights
45 |     except ValueError as e:
46 |         logging.exception('{} {} {}'.format(date, hhmm, tzone))
47 |         raise e
48 | 
49 | 
50 | def tz_correct(line, airport_timezones):
51 |     fields = json.loads(line)
52 |     try:
53 |         # convert all times to UTC
54 |         dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"]
55 |         arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"]
56 |         dep_timezone = airport_timezones[dep_airport_id][2]
57 |         arr_timezone = airport_timezones[arr_airport_id][2]
58 | 
59 |         for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]:
60 |             fields[f] = as_utc(fields["FL_DATE"], fields[f], dep_timezone)
61 |         for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
62 |             fields[f] = as_utc(fields["FL_DATE"], fields[f], arr_timezone)
63 | 
64 |         yield json.dumps(fields)
65 |     except KeyError as e:
66 |         logging.exception(" Ignoring " + line + " because airport is not known")
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     with beam.Pipeline('DirectRunner') as pipeline:
71 |         airports = (pipeline
72 |                     | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz')
73 |                     | beam.Filter(lambda line: "United States" in line)
74 |                     | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line])))
75 |                     | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
76 |                     )
77 | 
78 |         flights = (pipeline
79 |                    | 'flights:read' >> beam.io.ReadFromText('flights_sample.json')
80 |                    | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports))
81 |                    )
82 | 
83 |         flights | beam.io.textio.WriteToText('all_flights')
84 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df04.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import csv
 20 | import json
 21 | 
 22 | 
 23 | def addtimezone(lat, lon):
 24 |     try:
 25 |         import timezonefinder
 26 |         tf = timezonefinder.TimezoneFinder()
 27 |         lat = float(lat)
 28 |         lon = float(lon)
 29 |         return lat, lon, tf.timezone_at(lng=lon, lat=lat)
 30 |     except ValueError:
 31 |         return lat, lon, 'TIMEZONE'  # header
 32 | 
 33 | 
 34 | def as_utc(date, hhmm, tzone):
 35 |     """
 36 |    Returns date corrected for timezone, and the tzoffset
 37 |    """
 38 |     try:
 39 |         if len(hhmm) > 0 and tzone is not None:
 40 |             import datetime, pytz
 41 |             loc_tz = pytz.timezone(tzone)
 42 |             loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False)
 43 |             # can't just parse hhmm because the data contains 2400 and the like ...
 44 |             loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:]))
 45 |             utc_dt = loc_dt.astimezone(pytz.utc)
 46 |             return utc_dt.strftime('%Y-%m-%d %H:%M:%S'), loc_dt.utcoffset().total_seconds()
 47 |         else:
 48 |             return '', 0  # empty string corresponds to canceled flights
 49 |     except ValueError as e:
 50 |         logging.exception('{} {} {}'.format(date, hhmm, tzone))
 51 |         raise e
 52 | 
 53 | 
 54 | def add_24h_if_before(arrtime, deptime):
 55 |     import datetime
 56 |     if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime:
 57 |         adt = datetime.datetime.strptime(arrtime, '%Y-%m-%d %H:%M:%S')
 58 |         adt += datetime.timedelta(hours=24)
 59 |         return adt.strftime('%Y-%m-%d %H:%M:%S')
 60 |     else:
 61 |         return arrtime
 62 | 
 63 | 
 64 | def tz_correct(line, airport_timezones):
 65 |     fields = json.loads(line)
 66 |     try:
 67 |         # convert all times to UTC
 68 |         dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"]
 69 |         arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"]
 70 |         dep_timezone = airport_timezones[dep_airport_id][2]
 71 |         arr_timezone = airport_timezones[arr_airport_id][2]
 72 | 
 73 |         for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]:
 74 |             fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone)
 75 |         for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 76 |             fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone)
 77 | 
 78 |         for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 79 |             fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"])
 80 | 
 81 |         fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0]
 82 |         fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1]
 83 |         fields["DEP_AIRPORT_TZOFFSET"] = deptz
 84 |         fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0]
 85 |         fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1]
 86 |         fields["ARR_AIRPORT_TZOFFSET"] = arrtz
 87 |         yield json.dumps(fields)
 88 |     except KeyError as e:
 89 |         logging.exception(" Ignoring " + line + " because airport is not known")
 90 | 
 91 | 
 92 | if __name__ == '__main__':
 93 |     with beam.Pipeline('DirectRunner') as pipeline:
 94 |         airports = (pipeline
 95 |                     | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz')
 96 |                     | beam.Filter(lambda line: "United States" in line)
 97 |                     | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line])))
 98 |                     | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
 99 |                     )
100 | 
101 |         flights = (pipeline
102 |                    | 'flights:read' >> beam.io.ReadFromText('flights_sample.json')
103 |                    | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports))
104 |                    )
105 | 
106 |         flights | beam.io.textio.WriteToText('all_flights')
107 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df05.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import csv
 20 | import json
 21 | 
 22 | DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
 23 | 
 24 | 
 25 | def addtimezone(lat, lon):
 26 |     try:
 27 |         import timezonefinder
 28 |         tf = timezonefinder.TimezoneFinder()
 29 |         lat = float(lat)
 30 |         lon = float(lon)
 31 |         return lat, lon, tf.timezone_at(lng=lon, lat=lat)
 32 |     except ValueError:
 33 |         return lat, lon, 'TIMEZONE'  # header
 34 | 
 35 | 
 36 | def as_utc(date, hhmm, tzone):
 37 |     """
 38 |    Returns date corrected for timezone, and the tzoffset
 39 |    """
 40 |     try:
 41 |         if len(hhmm) > 0 and tzone is not None:
 42 |             import datetime, pytz
 43 |             loc_tz = pytz.timezone(tzone)
 44 |             loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False)
 45 |             # can't just parse hhmm because the data contains 2400 and the like ...
 46 |             loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:]))
 47 |             utc_dt = loc_dt.astimezone(pytz.utc)
 48 |             return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds()
 49 |         else:
 50 |             return '', 0  # empty string corresponds to canceled flights
 51 |     except ValueError as e:
 52 |         logging.exception('{} {} {}'.format(date, hhmm, tzone))
 53 |         raise e
 54 | 
 55 | 
 56 | def add_24h_if_before(arrtime, deptime):
 57 |     import datetime
 58 |     if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime:
 59 |         adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT)
 60 |         adt += datetime.timedelta(hours=24)
 61 |         return adt.strftime(DATETIME_FORMAT)
 62 |     else:
 63 |         return arrtime
 64 | 
 65 | 
 66 | def tz_correct(fields, airport_timezones):
 67 |     try:
 68 |         # convert all times to UTC
 69 |         dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"]
 70 |         arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"]
 71 |         dep_timezone = airport_timezones[dep_airport_id][2]
 72 |         arr_timezone = airport_timezones[arr_airport_id][2]
 73 | 
 74 |         for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]:
 75 |             fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone)
 76 |         for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 77 |             fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone)
 78 | 
 79 |         for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 80 |             fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"])
 81 | 
 82 |         fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0]
 83 |         fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1]
 84 |         fields["DEP_AIRPORT_TZOFFSET"] = deptz
 85 |         fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0]
 86 |         fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1]
 87 |         fields["ARR_AIRPORT_TZOFFSET"] = arrtz
 88 |         yield fields
 89 |     except KeyError as e:
 90 |         logging.exception(f"Ignoring {fields} because airport is not known")
 91 | 
 92 | 
 93 | def get_next_event(fields):
 94 |     if len(fields["DEP_TIME"]) > 0:
 95 |         event = dict(fields)  # copy
 96 |         event["EVENT_TYPE"] = "departed"
 97 |         event["EVENT_TIME"] = fields["DEP_TIME"]
 98 |         for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]:
 99 |             event.pop(f, None)  # not knowable at departure time
100 |         yield event
101 |     if len(fields["ARR_TIME"]) > 0:
102 |         event = dict(fields)
103 |         event["EVENT_TYPE"] = "arrived"
104 |         event["EVENT_TIME"] = fields["ARR_TIME"]
105 |         yield event
106 | 
107 | 
108 | def run():
109 |     with beam.Pipeline('DirectRunner') as pipeline:
110 |         airports = (pipeline
111 |                     | 'airports:read' >> beam.io.ReadFromText('airports.csv.gz')
112 |                     | beam.Filter(lambda line: "United States" in line)
113 |                     | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line])))
114 |                     | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
115 |                     )
116 | 
117 |         flights = (pipeline
118 |                    | 'flights:read' >> beam.io.ReadFromText('flights_sample.json')
119 |                    | 'flights:parse' >> beam.Map(lambda line: json.loads(line))
120 |                    | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports))
121 |                    )
122 | 
123 |         (flights
124 |          | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields))
125 |          | 'flights:out' >> beam.io.textio.WriteToText('all_flights')
126 |          )
127 | 
128 |         events = flights | beam.FlatMap(get_next_event)
129 | 
130 |         (events
131 |          | 'events:tostring' >> beam.Map(lambda fields: json.dumps(fields))
132 |          | 'events:out' >> beam.io.textio.WriteToText('all_events')
133 |          )
134 | 
135 | 
136 | if __name__ == '__main__':
137 |     run()
138 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df06.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import csv
 20 | import json
 21 | 
 22 | 
 23 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S'
 24 | 
 25 | 
 26 | def addtimezone(lat, lon):
 27 |     try:
 28 |         import timezonefinder
 29 |         tf = timezonefinder.TimezoneFinder()
 30 |         lat = float(lat)
 31 |         lon = float(lon)
 32 |         return lat, lon, tf.timezone_at(lng=lon, lat=lat)
 33 |     except ValueError:
 34 |         return lat, lon, 'TIMEZONE'  # header
 35 | 
 36 | 
 37 | def as_utc(date, hhmm, tzone):
 38 |     """
 39 |     Returns date corrected for timezone, and the tzoffset
 40 |     """
 41 |     try:
 42 |         if len(hhmm) > 0 and tzone is not None:
 43 |             import datetime, pytz
 44 |             loc_tz = pytz.timezone(tzone)
 45 |             loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False)
 46 |             # can't just parse hhmm because the data contains 2400 and the like ...
 47 |             loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:]))
 48 |             utc_dt = loc_dt.astimezone(pytz.utc)
 49 |             return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds()
 50 |         else:
 51 |             return '', 0  # empty string corresponds to canceled flights
 52 |     except ValueError as e:
 53 |         logging.exception('{} {} {}'.format(date, hhmm, tzone))
 54 |         raise e
 55 | 
 56 | 
 57 | def add_24h_if_before(arrtime, deptime):
 58 |     import datetime
 59 |     if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime:
 60 |         adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT)
 61 |         adt += datetime.timedelta(hours=24)
 62 |         return adt.strftime(DATETIME_FORMAT)
 63 |     else:
 64 |         return arrtime
 65 | 
 66 | 
 67 | def tz_correct(fields, airport_timezones):
 68 |     fields['FL_DATE'] = fields['FL_DATE'].strftime('%Y-%m-%d')  # convert to a string so JSON code works
 69 |     try:
 70 |         # convert all times to UTC
 71 |         dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"]
 72 |         arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"]
 73 | 
 74 |         dep_timezone = airport_timezones[dep_airport_id][2]
 75 |         arr_timezone = airport_timezones[arr_airport_id][2]
 76 | 
 77 |         for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]:
 78 |             fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone)
 79 |         for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 80 |             fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone)
 81 | 
 82 |         for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 83 |             fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"])
 84 | 
 85 |         fields["DEP_AIRPORT_LAT"] = airport_timezones[dep_airport_id][0]
 86 |         fields["DEP_AIRPORT_LON"] = airport_timezones[dep_airport_id][1]
 87 |         fields["DEP_AIRPORT_TZOFFSET"] = deptz
 88 |         fields["ARR_AIRPORT_LAT"] = airport_timezones[arr_airport_id][0]
 89 |         fields["ARR_AIRPORT_LON"] = airport_timezones[arr_airport_id][1]
 90 |         fields["ARR_AIRPORT_TZOFFSET"] = arrtz
 91 |         yield fields
 92 |     except KeyError:
 93 |         #logging.exception(f"Ignoring {fields} because airport is not known")
 94 |         pass
 95 | 
 96 |     except KeyError:
 97 |         logging.exception("Ignoring field because airport is not known")
 98 | 
 99 | 
100 | def get_next_event(fields):
101 |     if len(fields["DEP_TIME"]) > 0:
102 |         event = dict(fields)  # copy
103 |         event["EVENT_TYPE"] = "departed"
104 |         event["EVENT_TIME"] = fields["DEP_TIME"]
105 |         for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]:
106 |             event.pop(f, None)  # not knowable at departure time
107 |         yield event
108 |     if len(fields["WHEELS_OFF"]) > 0:
109 |         event = dict(fields)  # copy
110 |         event["EVENT_TYPE"] = "wheelsoff"
111 |         event["EVENT_TIME"] = fields["WHEELS_OFF"]
112 |         for f in ["WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]:
113 |             event.pop(f, None)  # not knowable at departure time
114 |         yield event
115 |     if len(fields["ARR_TIME"]) > 0:
116 |         event = dict(fields)
117 |         event["EVENT_TYPE"] = "arrived"
118 |         event["EVENT_TIME"] = fields["ARR_TIME"]
119 |         yield event
120 | 
121 | 
122 | def create_event_row(fields):
123 |     featdict = dict(fields)  # copy
124 |     featdict['EVENT_DATA'] = json.dumps(fields)
125 |     return featdict
126 | 
127 | 
128 | def run(project, bucket):
129 |     argv = [
130 |         '--project={0}'.format(project),
131 |         '--staging_location=gs://{0}/flights/staging/'.format(bucket),
132 |         '--temp_location=gs://{0}/flights/temp/'.format(bucket),
133 |         '--runner=DirectRunner'
134 |     ]
135 |     airports_filename = 'gs://{}/flights/airports/airports.csv.gz'.format(bucket)
136 |     flights_output = 'gs://{}/flights/tzcorr/all_flights'.format(bucket)
137 | 
138 |     with beam.Pipeline(argv=argv) as pipeline:
139 |         airports = (pipeline
140 |                     | 'airports:read' >> beam.io.ReadFromText(airports_filename)
141 |                     | beam.Filter(lambda line: "United States" in line)
142 |                     | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line])))
143 |                     | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
144 |                     )
145 | 
146 |         flights = (pipeline
147 |                    | 'flights:read' >> beam.io.ReadFromBigQuery(
148 |                     query='SELECT * FROM dsongcp.flights WHERE rand() < 0.001', use_standard_sql=True)
149 |                    | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports))
150 |                    )
151 | 
152 |         (flights
153 |          | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields))
154 |          | 'flights:gcsout' >> beam.io.textio.WriteToText(flights_output)
155 |          )
156 |         
157 |         flights_schema = ','.join([
158 |             'FL_DATE:date',
159 |             'UNIQUE_CARRIER:string',
160 |             'ORIGIN_AIRPORT_SEQ_ID:string',
161 |             'ORIGIN:string',
162 |             'DEST_AIRPORT_SEQ_ID:string',
163 |             'DEST:string',
164 |             'CRS_DEP_TIME:timestamp',
165 |             'DEP_TIME:timestamp',
166 |             'DEP_DELAY:float',
167 |             'TAXI_OUT:float',
168 |             'WHEELS_OFF:timestamp',
169 |             'WHEELS_ON:timestamp',
170 |             'TAXI_IN:float',
171 |             'CRS_ARR_TIME:timestamp',
172 |             'ARR_TIME:timestamp',
173 |             'ARR_DELAY:float',
174 |             'CANCELLED:boolean',
175 |             'DIVERTED:boolean',
176 |             'DISTANCE:float',
177 |             'DEP_AIRPORT_LAT:float',
178 |             'DEP_AIRPORT_LON:float',
179 |             'DEP_AIRPORT_TZOFFSET:float',
180 |             'ARR_AIRPORT_LAT:float',
181 |             'ARR_AIRPORT_LON:float',
182 |             'ARR_AIRPORT_TZOFFSET:float',
183 |             'Year:string'])
184 |         
185 |         # autodetect on JSON works, but is less reliable
186 |         #flights_schema = 'SCHEMA_AUTODETECT'
187 |         
188 |         (flights 
189 |          | 'flights:bqout' >> beam.io.WriteToBigQuery(
190 |                 'dsongcp.flights_tzcorr', 
191 |                 schema=flights_schema,
192 |                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
193 |                 create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
194 |                 )
195 |         )
196 |         
197 |         events = flights | beam.FlatMap(get_next_event)
198 |         events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp,EVENT_DATA:string'])
199 | 
200 |         (events
201 |          | 'events:totablerow' >> beam.Map(lambda fields: create_event_row(fields))
202 |          | 'events:bqout' >> beam.io.WriteToBigQuery(
203 |                 'dsongcp.flights_simevents', schema=events_schema,
204 |                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
205 |                 create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
206 |                 )
207 |         )
208 |         
209 | if __name__ == '__main__':
210 |     import argparse
211 | 
212 |     parser = argparse.ArgumentParser(description='Run pipeline on the cloud')
213 |     parser.add_argument('-p', '--project', help='Unique project ID', required=True)
214 |     parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists',
215 |                         required=True)
216 | 
217 |     args = vars(parser.parse_args())
218 | 
219 |     print("Correcting timestamps and writing to BigQuery dataset")
220 | 
221 |     run(project=args['project'], bucket=args['bucket'])
222 | 


--------------------------------------------------------------------------------
/04_streaming/transform/df07.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2016 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import csv
 20 | import json
 21 | 
 22 | DATETIME_FORMAT = '%Y-%m-%dT%H:%M:%S'
 23 | 
 24 | 
 25 | def addtimezone(lat, lon):
 26 |     try:
 27 |         import timezonefinder
 28 |         tf = timezonefinder.TimezoneFinder()
 29 |         lat = float(lat)
 30 |         lon = float(lon)
 31 |         return lat, lon, tf.timezone_at(lng=lon, lat=lat)
 32 |     except ValueError:
 33 |         return lat, lon, 'TIMEZONE'  # header
 34 | 
 35 | 
 36 | def as_utc(date, hhmm, tzone):
 37 |     """
 38 |     Returns date corrected for timezone, and the tzoffset
 39 |     """
 40 |     try:
 41 |         if len(hhmm) > 0 and tzone is not None:
 42 |             import datetime, pytz
 43 |             loc_tz = pytz.timezone(tzone)
 44 |             loc_dt = loc_tz.localize(datetime.datetime.strptime(date, '%Y-%m-%d'), is_dst=False)
 45 |             # can't just parse hhmm because the data contains 2400 and the like ...
 46 |             loc_dt += datetime.timedelta(hours=int(hhmm[:2]), minutes=int(hhmm[2:]))
 47 |             utc_dt = loc_dt.astimezone(pytz.utc)
 48 |             return utc_dt.strftime(DATETIME_FORMAT), loc_dt.utcoffset().total_seconds()
 49 |         else:
 50 |             return '', 0  # empty string corresponds to canceled flights
 51 |     except ValueError as e:
 52 |         logging.exception('{} {} {}'.format(date, hhmm, tzone))
 53 |         raise e
 54 | 
 55 | 
 56 | def add_24h_if_before(arrtime, deptime):
 57 |     import datetime
 58 |     if len(arrtime) > 0 and len(deptime) > 0 and arrtime < deptime:
 59 |         adt = datetime.datetime.strptime(arrtime, DATETIME_FORMAT)
 60 |         adt += datetime.timedelta(hours=24)
 61 |         return adt.strftime(DATETIME_FORMAT)
 62 |     else:
 63 |         return arrtime
 64 | 
 65 | 
 66 | def airport_timezone(airport_id, airport_timezones):
 67 |     if airport_id in airport_timezones:
 68 |         return airport_timezones[airport_id]
 69 |     else:
 70 |         return '37.41', '-92.35', u'America/Chicago'
 71 | 
 72 | 
 73 | def tz_correct(fields, airport_timezones):
 74 |     fields['FL_DATE'] = fields['FL_DATE'].strftime('%Y-%m-%d')  # convert to a string so JSON code works
 75 | 
 76 |     # convert all times to UTC
 77 |     dep_airport_id = fields["ORIGIN_AIRPORT_SEQ_ID"]
 78 |     arr_airport_id = fields["DEST_AIRPORT_SEQ_ID"]
 79 |     fields["DEP_AIRPORT_LAT"], fields["DEP_AIRPORT_LON"], dep_timezone = airport_timezone(dep_airport_id,
 80 |                                                                                           airport_timezones)
 81 |     fields["ARR_AIRPORT_LAT"], fields["ARR_AIRPORT_LON"], arr_timezone = airport_timezone(arr_airport_id,
 82 |                                                                                           airport_timezones)
 83 | 
 84 |     for f in ["CRS_DEP_TIME", "DEP_TIME", "WHEELS_OFF"]:
 85 |         fields[f], deptz = as_utc(fields["FL_DATE"], fields[f], dep_timezone)
 86 |     for f in ["WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 87 |         fields[f], arrtz = as_utc(fields["FL_DATE"], fields[f], arr_timezone)
 88 | 
 89 |     for f in ["WHEELS_OFF", "WHEELS_ON", "CRS_ARR_TIME", "ARR_TIME"]:
 90 |         fields[f] = add_24h_if_before(fields[f], fields["DEP_TIME"])
 91 | 
 92 |     fields["DEP_AIRPORT_TZOFFSET"] = deptz
 93 |     fields["ARR_AIRPORT_TZOFFSET"] = arrtz
 94 |     yield fields
 95 | 
 96 | 
 97 | def get_next_event(fields):
 98 |     if len(fields["DEP_TIME"]) > 0:
 99 |         event = dict(fields)  # copy
100 |         event["EVENT_TYPE"] = "departed"
101 |         event["EVENT_TIME"] = fields["DEP_TIME"]
102 |         for f in ["TAXI_OUT", "WHEELS_OFF", "WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]:
103 |             event.pop(f, None)  # not knowable at departure time
104 |         yield event
105 |     if len(fields["WHEELS_OFF"]) > 0:
106 |         event = dict(fields)  # copy
107 |         event["EVENT_TYPE"] = "wheelsoff"
108 |         event["EVENT_TIME"] = fields["WHEELS_OFF"]
109 |         for f in ["WHEELS_ON", "TAXI_IN", "ARR_TIME", "ARR_DELAY", "DISTANCE"]:
110 |             event.pop(f, None)  # not knowable at departure time
111 |         yield event
112 |     if len(fields["ARR_TIME"]) > 0:
113 |         event = dict(fields)
114 |         event["EVENT_TYPE"] = "arrived"
115 |         event["EVENT_TIME"] = fields["ARR_TIME"]
116 |         yield event
117 | 
118 | 
119 | def create_event_row(fields):
120 |     featdict = dict(fields)  # copy
121 |     featdict['EVENT_DATA'] = json.dumps(fields)
122 |     return featdict
123 | 
124 | 
125 | def run(project, bucket, region):
126 |     argv = [
127 |         '--project={0}'.format(project),
128 |         '--job_name=ch04timecorr',
129 |         '--save_main_session',
130 |         '--staging_location=gs://{0}/flights/staging/'.format(bucket),
131 |         '--temp_location=gs://{0}/flights/temp/'.format(bucket),
132 |         '--setup_file=./setup.py',
133 |         '--autoscaling_algorithm=THROUGHPUT_BASED',
134 |         '--max_num_workers=8',
135 |         '--region={}'.format(region),
136 |         '--runner=DataflowRunner'
137 |     ]
138 |     airports_filename = 'gs://{}/flights/airports/airports.csv.gz'.format(bucket)
139 |     flights_output = 'gs://{}/flights/tzcorr/all_flights'.format(bucket)
140 | 
141 |     with beam.Pipeline(argv=argv) as pipeline:
142 |         airports = (pipeline
143 |                     | 'airports:read' >> beam.io.ReadFromText(airports_filename)
144 |                     | 'airports:onlyUSA' >> beam.Filter(lambda line: "United States" in line)
145 |                     | 'airports:fields' >> beam.Map(lambda line: next(csv.reader([line])))
146 |                     | 'airports:tz' >> beam.Map(lambda fields: (fields[0], addtimezone(fields[21], fields[26])))
147 |                     )
148 | 
149 |         flights = (pipeline
150 |                    | 'flights:read' >> beam.io.ReadFromBigQuery(
151 |                     query='SELECT * FROM dsongcp.flights', use_standard_sql=True)
152 |                    | 'flights:tzcorr' >> beam.FlatMap(tz_correct, beam.pvalue.AsDict(airports))
153 |                    )
154 | 
155 |         (flights
156 |          | 'flights:tostring' >> beam.Map(lambda fields: json.dumps(fields))
157 |          | 'flights:gcsout' >> beam.io.textio.WriteToText(flights_output)
158 |          )
159 | 
160 |         flights_schema = ','.join([
161 |             'FL_DATE:date,UNIQUE_CARRIER:string,ORIGIN_AIRPORT_SEQ_ID:string,ORIGIN:string',
162 |             'DEST_AIRPORT_SEQ_ID:string,DEST:string,CRS_DEP_TIME:timestamp,DEP_TIME:timestamp',
163 |             'DEP_DELAY:float,TAXI_OUT:float,WHEELS_OFF:timestamp,WHEELS_ON:timestamp,TAXI_IN:float',
164 |             'CRS_ARR_TIME:timestamp,ARR_TIME:timestamp,ARR_DELAY:float,CANCELLED:boolean',
165 |             'DIVERTED:boolean,DISTANCE:float',
166 |             'DEP_AIRPORT_LAT:float,DEP_AIRPORT_LON:float,DEP_AIRPORT_TZOFFSET:float',
167 |             'ARR_AIRPORT_LAT:float,ARR_AIRPORT_LON:float,ARR_AIRPORT_TZOFFSET:float'])
168 |         flights | 'flights:bqout' >> beam.io.WriteToBigQuery(
169 |             'dsongcp.flights_tzcorr', schema=flights_schema,
170 |             write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
171 |             create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
172 |         )
173 | 
174 |         events = flights | beam.FlatMap(get_next_event)
175 |         events_schema = ','.join([flights_schema, 'EVENT_TYPE:string,EVENT_TIME:timestamp,EVENT_DATA:string'])
176 | 
177 |         (events
178 |          | 'events:totablerow' >> beam.Map(lambda fields: create_event_row(fields))
179 |          | 'events:bqout' >> beam.io.WriteToBigQuery(
180 |                     'dsongcp.flights_simevents', schema=events_schema,
181 |                     write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
182 |                     create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
183 |                 )
184 |          )
185 | 
186 | 
187 | if __name__ == '__main__':
188 |     import argparse
189 | 
190 |     parser = argparse.ArgumentParser(description='Run pipeline on the cloud')
191 |     parser.add_argument('-p', '--project', help='Unique project ID', required=True)
192 |     parser.add_argument('-b', '--bucket', help='Bucket where gs://BUCKET/flights/airports/airports.csv.gz exists',
193 |                         required=True)
194 |     parser.add_argument('-r', '--region',
195 |                         help='Region in which to run the Dataflow job. Choose the same region as your bucket.',
196 |                         required=True)
197 | 
198 |     args = vars(parser.parse_args())
199 | 
200 |     print("Correcting timestamps and writing to BigQuery dataset")
201 | 
202 |     run(project=args['project'], bucket=args['bucket'], region=args['region'])
203 | 


--------------------------------------------------------------------------------
/04_streaming/transform/install_packages.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python3 -m pip install --upgrade pip
 4 | python3 -m pip cache purge
 5 | python3 -m pip install --upgrade timezonefinder pytz 'apache-beam[gcp]'
 6 | 
 7 | echo "If this script fails, please try installing it in a virtualenv"
 8 | echo "virtualenv ~/beam_env"
 9 | echo "source ~/beam_env/bin/activate"
10 | echo "./install_packages.sh"
11 | 


--------------------------------------------------------------------------------
/04_streaming/transform/setup.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Licensed to the Apache Software Foundation (ASF) under one or more
  3 | # contributor license agreements.  See the NOTICE file distributed with
  4 | # this work for additional information regarding copyright ownership.
  5 | # The ASF licenses this file to You under the Apache License, Version 2.0
  6 | # (the "License"); you may not use this file except in compliance with
  7 | # the License.  You may obtain a copy of the License at
  8 | #
  9 | #    http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | #
 17 | 
 18 | """Setup.py module for the workflow's worker utilities.
 19 | 
 20 | All the workflow related code is gathered in a package that will be built as a
 21 | source distribution, staged in the staging area for the workflow being run and
 22 | then installed in the workers when they start running.
 23 | 
 24 | This behavior is triggered by specifying the --setup_file command line option
 25 | when running the workflow for remote execution.
 26 | """
 27 | 
 28 | from distutils.command.build import build as _build
 29 | import subprocess
 30 | 
 31 | import setuptools
 32 | 
 33 | 
 34 | # This class handles the pip install mechanism.
 35 | class build(_build):  # pylint: disable=invalid-name
 36 |     """A build command class that will be invoked during package install.
 37 | 
 38 |   The package built using the current setup.py will be staged and later
 39 |   installed in the worker using `pip install package'. This class will be
 40 |   instantiated during install for this specific scenario and will trigger
 41 |   running the custom commands specified.
 42 |   """
 43 |     sub_commands = _build.sub_commands + [('CustomCommands', None)]
 44 | 
 45 | 
 46 | # Some custom command to run during setup. The command is not essential for this
 47 | # workflow. It is used here as an example. Each command will spawn a child
 48 | # process. Typically, these commands will include steps to install non-Python
 49 | # packages. For instance, to install a C++-based library libjpeg62 the following
 50 | # two commands will have to be added:
 51 | #
 52 | #     ['apt-get', 'update'],
 53 | #     ['apt-get', '--assume-yes', install', 'libjpeg62'],
 54 | #
 55 | # First, note that there is no need to use the sudo command because the setup
 56 | # script runs with appropriate access.
 57 | # Second, if apt-get tool is used then the first command needs to be 'apt-get
 58 | # update' so the tool refreshes itself and initializes links to download
 59 | # repositories.  Without this initial step the other apt-get install commands
 60 | # will fail with package not found errors. Note also --assume-yes option which
 61 | # shortcuts the interactive confirmation.
 62 | #
 63 | # The output of custom commands (including failures) will be logged in the
 64 | # worker-startup log.
 65 | CUSTOM_COMMANDS = [
 66 | ]
 67 | 
 68 | 
 69 | class CustomCommands(setuptools.Command):
 70 |     """A setuptools Command class able to run arbitrary commands."""
 71 | 
 72 |     def initialize_options(self):
 73 |         pass
 74 | 
 75 |     def finalize_options(self):
 76 |         pass
 77 | 
 78 |     def RunCustomCommand(self, command_list):
 79 |         print('Running command: %s' % command_list)
 80 |         p = subprocess.Popen(
 81 |             command_list,
 82 |             stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
 83 |         # Can use communicate(input='y\n'.encode()) if the command run requires
 84 |         # some confirmation.
 85 |         stdout_data, _ = p.communicate()
 86 |         print('Command output: %s' % stdout_data)
 87 |         if p.returncode != 0:
 88 |             raise RuntimeError(
 89 |                 'Command %s failed: exit code: %s' % (command_list, p.returncode))
 90 | 
 91 |     def run(self):
 92 |         for command in CUSTOM_COMMANDS:
 93 |             self.RunCustomCommand(command)
 94 | 
 95 | 
 96 | # Configure the required packages and scripts to install.
 97 | # Note that the Python Dataflow containers come with numpy already installed
 98 | # so this dependency will not trigger anything to be installed unless a version
 99 | # restriction is specified.
100 | REQUIRED_PACKAGES = [
101 |     'timezonefinder',
102 |     'pytz'
103 | ]
104 | 
105 | setuptools.setup(
106 |     name='flightsdf',
107 |     version='0.0.1',
108 |     description='Data Science on GCP flights analysis pipelines',
109 |     install_requires=REQUIRED_PACKAGES,
110 |     packages=setuptools.find_packages(),
111 |     py_modules=['df07'],
112 |     cmdclass={
113 |         # Command class instantiated and run during pip install scenarios.
114 |         'build': build,
115 |         'CustomCommands': CustomCommands,
116 |     }
117 | )
118 | 


--------------------------------------------------------------------------------
/04_streaming/transform/stage_airports_file.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if test "$#" -ne 1; then
 4 |    echo "Usage: ./bqsample.sh bucket-name"
 5 |    echo "   eg: ./bqsample.sh cloud-training-demos-ml"
 6 |    exit
 7 | fi
 8 | 
 9 | BUCKET=$1
10 | PROJECT=$(gcloud config get-value project)
11 | 
12 | gsutil cp airports.csv.gz gs://${BUCKET}/flights/airports/airports.csv.gz
13 | 
14 | bq --project_id=$PROJECT load \
15 |    --autodetect --replace --source_format=CSV \
16 |    dsongcp.airports gs://${BUCKET}/flights/airports/airports.csv.gz


--------------------------------------------------------------------------------
/05_bqnotebook/README.md:
--------------------------------------------------------------------------------
 1 | # 5. Interactive data exploration
 2 | 
 3 | ### Catch up from previous chapters if necessary
 4 | If you didn't go through Chapters 2-4, the simplest way to catch up is to copy data from my bucket:
 5 | * Go to the Storage section of the GCP web console and create a new bucket
 6 | * Open CloudShell and git clone this repo:
 7 |     ```
 8 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
 9 |     ```
10 | * Then, run:
11 |     ```
12 |     cd data-science-on-gcp/02_ingest
13 |     ./ingest_from_crsbucket bucketname
14 |     ./bqload.sh  (csv-bucket-name) YEAR 
15 |     ```
16 | * Run:
17 |     ```
18 |     cd ../03_sqlstudio
19 |     ./create_views.sh
20 |     ```
21 | * Run:
22 |     ```
23 |     cd ../04_streaming
24 |     ./ingest_from_crsbucket.sh
25 |     ```
26 | 
27 | ## Try out queries
28 | * In BigQuery, query the time corrected files created in Chapter 4:
29 |     ```
30 |     SELECT
31 |        ORIGIN,
32 |        AVG(DEP_DELAY) AS dep_delay,
33 |        AVG(ARR_DELAY) AS arr_delay,
34 |        COUNT(ARR_DELAY) AS num_flights
35 |      FROM
36 |        dsongcp.flights_tzcorr
37 |      GROUP BY
38 |        ORIGIN
39 |     ```
40 | * Try out the other queries in queries.txt in this directory.
41 | 
42 | * Navigate to the Vertex AI Workbench part of the GCP console.
43 | 
44 | * Start a new managed notebook. Then, copy and paste cells from <a href="exploration.ipynb">exploration.ipynb</a> and click Run to execute the code.
45 | 
46 | * Create the trainday table BigQuery table and CSV file as you will need it later
47 |     ```
48 |     ./create_trainday.sh
49 |     ```
50 | 


--------------------------------------------------------------------------------
/05_bqnotebook/create_trainday.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./create_trainday.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | 
10 | cat trainday.txt | bq query --nouse_legacy_sql
11 | 
12 | bq extract dsongcp.trainday gs://${BUCKET}/flights/trainday.csv
13 | 


--------------------------------------------------------------------------------
/05_bqnotebook/queries.txt:
--------------------------------------------------------------------------------
 1 | SELECT
 2 |    ORIGIN,
 3 |    AVG(DEP_DELAY) AS dep_delay,
 4 |    AVG(ARR_DELAY) AS arr_delay,
 5 |    COUNT(ARR_DELAY) AS num_flights
 6 |  FROM
 7 |    dsongcp.flights_tzcorr
 8 |  GROUP BY
 9 |    ORIGIN
10 | 
11 | ________________________________________________________________
12 | 
13 | WITH all_airports AS (
14 |   SELECT
15 |     ORIGIN,
16 |     AVG(DEP_DELAY) AS dep_delay,
17 |     AVG(ARR_DELAY) AS arr_delay,
18 |     COUNT(ARR_DELAY) AS num_flights
19 |   FROM
20 |     dsongcp.flights_tzcorr
21 |   GROUP BY
22 |     ORIGIN
23 | )
24 |  
25 | SELECT * FROM all_airports WHERE num_flights > 3650
26 | ORDER BY dep_delay DESC
27 | 
28 | ________________________________________________________________
29 | 
30 | WITH all_airports AS (
31 |   SELECT
32 |     ORIGIN,
33 |     AVG(DEP_DELAY) AS dep_delay,
34 |     AVG(ARR_DELAY) AS arr_delay,
35 |     COUNT(ARR_DELAY) AS num_flights
36 |   FROM
37 |     dsongcp.flights_tzcorr
38 |   WHERE EXTRACT(MONTH FROM FL_DATE) = 1
39 |   GROUP BY
40 |     ORIGIN
41 | )
42 | 
43 | SELECT * FROM all_airports WHERE num_flights > 310
44 | ORDER BY dep_delay DESC
45 | 
46 | ________________________________________________________________


--------------------------------------------------------------------------------
/05_bqnotebook/trainday.txt:
--------------------------------------------------------------------------------
 1 | CREATE OR REPLACE TABLE dsongcp.trainday AS
 2 | 
 3 | SELECT
 4 |   FL_DATE,
 5 |   IF(ABS(MOD(FARM_FINGERPRINT(CAST(FL_DATE AS STRING)), 100)) < 70,
 6 |      'True', 'False') AS is_train_day
 7 | FROM (
 8 |   SELECT
 9 |     DISTINCT(FL_DATE) AS FL_DATE
10 |   FROM
11 |     dsongcp.flights_tzcorr)
12 | ORDER BY
13 |   FL_DATE
14 | 


--------------------------------------------------------------------------------
/06_dataproc/README.md:
--------------------------------------------------------------------------------
 1 | # 6. Bayes Classifier on Cloud Dataproc
 2 | 
 3 | To repeat the steps in this chapter, follow these steps.
 4 | 
 5 | ### Catch up from Chapters 2-5
 6 | If you didn't go through Chapters 2-5, the simplest way to catch up is to copy data from my bucket:
 7 | * Open CloudShell and git clone this repo:
 8 |     ```
 9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
10 |     ```
11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
14 | 	```
15 | 	bash create_trainday.sh BUCKET-NAME
16 | 	```
17 | 
18 | ### Create Dataproc cluster
19 | In CloudShell:
20 | * Clone the repository if you haven't already done so:
21 |     ```
22 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
23 |     ```
24 | * Change to the directory for this chapter:
25 |     ```
26 |     cd data-science-on-gcp/06_dataproc
27 |     ```
28 | * Create the Dataproc cluster to run jobs on, specifying the name of your bucket and a 
29 |   zone in the region that the bucket is in. (You created this bucket in Chapter 2)
30 |    ```
31 |     ./create_cluster.sh <BUCKET-NAME>  <COMPUTE-ZONE>
32 |     ```
33 | *Note:* Make sure that the compute zone is in the same region as the bucket, otherwise you will incur network egress charges.
34 | 
35 | ### Interactive development
36 | * Navigate to the Dataproc section of the GCP web console and click on "Web Interfaces".
37 | 
38 | * Click on JupyterLab
39 | 
40 | * In JupyterLab, navigate to /LocalDisks/home/dataproc/data-science-on-gcp
41 | 
42 | * Open 06_dataproc/quantization.ipynb. Click Run | Clear All Outputs. Then run the cells one by one.
43 |  
44 | * [optional] make the changes suggested in the notebook to run on the full dataset.  Note that you might have to
45 |   reduce numbers to fit into your quota.
46 |   
47 | ### Delete the cluster
48 | * Delete the cluster either from the GCP web console or by typing in CloudShell, ```./delete_cluster.sh <YOUR REGION>```
49 | 
50 | ### Serverless workflow
51 | * Visit https://console.cloud.google.com/networking/networks/list
52 | * Select the "default" network in your region and allow private Google access
53 | * Run ./submit_serverless.sh
54 |  
55 | 


--------------------------------------------------------------------------------
/06_dataproc/bayes_on_spark.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import logging
 18 | import pandas as pd
 19 | import numpy as np
 20 | from pyspark.sql import SparkSession
 21 | import pyspark.sql.functions as F
 22 | 
 23 | 
 24 | def run_bayes(BUCKET):
 25 |     spark = SparkSession \
 26 |         .builder \
 27 |         .appName("Bayes classification using Spark") \
 28 |         .getOrCreate()
 29 | 
 30 |     # read flights data
 31 |     inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET)  # FULL
 32 |     flights = spark.read.json(inputs)
 33 |     flights.createOrReplaceTempView('flights')
 34 | 
 35 |     # which days are training days?
 36 |     traindays = spark.read \
 37 |         .option("header", "true") \
 38 |         .option("inferSchema", "true") \
 39 |         .csv('gs://{}/flights/trainday.csv'.format(BUCKET))
 40 |     traindays.createOrReplaceTempView('traindays')
 41 | 
 42 |     # create training dataset
 43 |     statement = """
 44 |     SELECT
 45 |       f.FL_DATE AS date,
 46 |       CAST(distance AS FLOAT) AS distance,
 47 |       dep_delay,
 48 |       IF(arr_delay < 15, 1, 0) AS ontime
 49 |     FROM flights f
 50 |     JOIN traindays t
 51 |     ON f.FL_DATE == t.FL_DATE
 52 |     WHERE
 53 |       t.is_train_day AND
 54 |       f.dep_delay IS NOT NULL
 55 |     ORDER BY
 56 |       f.dep_delay DESC
 57 |     """
 58 |     flights = spark.sql(statement)
 59 | 
 60 |     # quantiles
 61 |     distthresh = flights.approxQuantile('distance', list(np.arange(0, 1.0, 0.2)), 0.02)
 62 |     distthresh[-1] = float('inf')
 63 |     delaythresh = range(10, 20)
 64 |     logging.info("Computed distance thresholds: {}".format(distthresh))
 65 | 
 66 |     # bayes in each bin
 67 |     df = pd.DataFrame(columns=['dist_thresh', 'delay_thresh', 'frac_ontime'])
 68 |     for m in range(0, len(distthresh) - 1):
 69 |         for n in range(0, len(delaythresh) - 1):
 70 |             bdf = flights[(flights['distance'] >= distthresh[m])
 71 |                           & (flights['distance'] < distthresh[m + 1])
 72 |                           & (flights['dep_delay'] >= delaythresh[n])
 73 |                           & (flights['dep_delay'] < delaythresh[n + 1])]
 74 |             ontime_frac = bdf.agg(F.sum('ontime')).collect()[0][0] / bdf.agg(F.count('ontime')).collect()[0][0]
 75 |             print(m, n, ontime_frac)
 76 |             df = df.append({
 77 |                 'dist_thresh': distthresh[m],
 78 |                 'delay_thresh': delaythresh[n],
 79 |                 'frac_ontime': ontime_frac
 80 |             }, ignore_index=True)
 81 | 
 82 |     # lookup table
 83 |     df['score'] = abs(df['frac_ontime'] - 0.7)
 84 |     bayes = df.sort_values(['score']).groupby('dist_thresh').head(1).sort_values('dist_thresh')
 85 |     bayes.to_csv('gs://{}/flights/bayes.csv'.format(BUCKET), index=False)
 86 |     logging.info("Wrote lookup table: {}".format(bayes.head()))
 87 | 
 88 | 
 89 | if __name__ == '__main__':
 90 |     import argparse
 91 | 
 92 |     parser = argparse.ArgumentParser(description='Create Bayes lookup table')
 93 |     parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True)
 94 |     parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages')
 95 | 
 96 |     args = parser.parse_args()
 97 |     if args.debug:
 98 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
 99 |     else:
100 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
101 | 
102 |     run_bayes(args.bucket)
103 | 


--------------------------------------------------------------------------------
/06_dataproc/create_cluster.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 2 ]; then
 4 |     echo "Usage: ./create_cluster.sh  bucket-name  region"
 5 |     exit
 6 | fi
 7 | 
 8 | PROJECT=$(gcloud config get-value project)
 9 | BUCKET=$1
10 | REGION=$2
11 | EMAIL=$3
12 | INSTALL=gs://$BUCKET/flights/dataproc/install_on_cluster.sh
13 | 
14 | # upload install file
15 | sed "s/CHANGE_TO_USER_NAME/dataproc/g" install_on_cluster.sh > /tmp/install_on_cluster.sh
16 | gsutil cp /tmp/install_on_cluster.sh $INSTALL
17 | 
18 | # create cluster
19 | gcloud dataproc clusters create ch6cluster \
20 |   --enable-component-gateway \
21 |   --region ${REGION} --zone ${REGION}-a \
22 |   --master-machine-type n1-standard-4 \
23 |   --master-boot-disk-size 500 --num-workers 2 \
24 |   --worker-machine-type n1-standard-4 \
25 |   --worker-boot-disk-size 500 \
26 |   --optional-components JUPYTER --project $PROJECT \
27 |   --initialization-actions=$INSTALL \
28 |   --scopes https://www.googleapis.com/auth/cloud-platform
29 | 
30 | 


--------------------------------------------------------------------------------
/06_dataproc/create_personal_cluster.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 3 ]; then
 4 |     echo "Usage: ./create_cluster.sh  bucket-name  region  user_email"
 5 |     exit
 6 | fi
 7 | 
 8 | PROJECT=$(gcloud config get-value project)
 9 | BUCKET=$1
10 | REGION=$2
11 | EMAIL=$3
12 | INSTALL=gs://$BUCKET/flights/dataproc/install_on_cluster.sh
13 | 
14 | # upload install file
15 | sed "s/CHANGE_TO_USER_NAME/$USER/g" install_on_cluster.sh > /tmp/install_on_cluster.sh
16 | gsutil cp /tmp/install_on_cluster.sh $INSTALL
17 | 
18 | # create cluster
19 | gcloud dataproc clusters create ch6cluster \
20 |   --enable-component-gateway \
21 |   --region ${REGION} --zone ${REGION}-a \
22 |   --properties dataproc:dataproc.personal-auth.user=$EMAIL \
23 |   --master-machine-type n1-standard-4 \
24 |   --master-boot-disk-size 500 --num-workers 2 \
25 |   --worker-machine-type n1-standard-4 \
26 |   --worker-boot-disk-size 500 \
27 |   --optional-components JUPYTER --project $PROJECT \
28 |   --initialization-actions=$INSTALL \
29 |   --scopes https://www.googleapis.com/auth/cloud-platform
30 | 
31 | 
32 | echo "Once cluster is up, please run the following command to inject your auth into the cluster."
33 | echo "gcloud dataproc clusters enable-personal-auth-session --region=$REGION ch6cluster"
34 | 


--------------------------------------------------------------------------------
/06_dataproc/decrease_cluster.sh:
--------------------------------------------------------------------------------
1 | gcloud dataproc clusters update ch6cluster\
2 |    --num-secondary-workers=0 --num-workers=2 --region=us-central1
3 | 


--------------------------------------------------------------------------------
/06_dataproc/delete_cluster.sh:
--------------------------------------------------------------------------------
1 | if [ "$#" -ne 1 ]; then
2 |     echo "Usage: ./delete_cluster.sh region"
3 |     exit
4 | fi
5 | 
6 | gcloud dataproc clusters delete ch6cluster --region $1


--------------------------------------------------------------------------------
/06_dataproc/increase_cluster.sh:
--------------------------------------------------------------------------------
1 | gcloud dataproc clusters update ch6cluster\
2 |    --num-secondary-workers=3 --num-workers=4 --region=us-central1
3 | 


--------------------------------------------------------------------------------
/06_dataproc/install_on_cluster.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # install Google Python client on all nodes
 4 | apt-get -y update
 5 | apt-get install python-dev
 6 | apt-get install -y python-pip
 7 | pip install --upgrade google-api-python-client
 8 | 
 9 | # git clone on Master
10 | USER=CHANGE_TO_USER_NAME # the username that dataproc runs as
11 | ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
12 | if [[ "${ROLE}" == 'Master' ]]; then
13 |   cd home/$USER
14 |   git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
15 |   chown -R $USER data-science-on-gcp
16 | fi
17 | 


--------------------------------------------------------------------------------
/06_dataproc/submit_serverless.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 2 ]; then
 4 |    echo "Usage:   ./submit_serverless.sh bucket-name region"
 5 |    exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | REGION=$2
10 | 
11 | # Note: The "default" network in the region needs to be enabled
12 | # for private Google access
13 | # https://cloud.google.com/vpc/docs/configure-private-google-access#config-pga
14 | 
15 | gsutil cp bayes_on_spark.py gs://$BUCKET/
16 | 
17 | gcloud beta dataproc batches submit pyspark \
18 |    --project=$(gcloud config get-value project) \
19 |    --region=$REGION \
20 |    gs://${BUCKET}/bayes_on_spark.py \
21 |    -- \
22 |    --bucket ${BUCKET} # --debug
23 | 


--------------------------------------------------------------------------------
/07_sparkml/README.md:
--------------------------------------------------------------------------------
 1 | # 7. Machine Learning: Logistic regression on Spark
 2 | 
 3 | ### Catch up from previous chapters if necessary
 4 | If you didn't go through Chapters 2-6, the simplest way to catch up is to copy data from my bucket:
 5 | 
 6 | #### Catch up from Chapters 2-5
 7 | * Open CloudShell and git clone this repo:
 8 |     ```
 9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
10 |     ```
11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
14 | 	```
15 | 	bash create_trainday.sh <BUCKET-NAME>
16 | 	```
17 | 
18 | #### [Optional] Catch up from Chapter 6
19 | * Use the instructions in the <a href="../06_dataproc/README.md">Chapter 6 README</a> to:
20 |   * launch a minimal Cloud Dataproc cluster with initialization actions for Jupyter (`./create_cluster.sh BUCKET ZONE`)
21 | 
22 | * Start a new notebook and in a cell, download a read-only clone of this repository:
23 |     ```
24 |     %bash
25 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
26 |     rm -rf data-science-on-gcp/.git
27 |     ```
28 | * Browse to data-science-on-gcp/07_sparkml_and_bqml/logistic_regression.ipynb
29 |   and run the cells in the notebook (change the BUCKET appropriately).
30 | 
31 | ## This Chapter
32 | ### Logistic regression using Spark
33 | * Launch a large Dataproc cluster:
34 |     ```
35 |     ./create_large_cluster.sh BUCKET ZONE
36 |     ```
37 | * If it fails with quota issues, get increased quota. If you can't have more quota, 
38 |   reduce the number of workers appropriately.
39 | 
40 | * Submit a Spark job to run the full dataset (change the BUCKET appropriately).
41 |     ```
42 |     ./submit_spark.sh BUCKET logistic.py
43 |     ```
44 | 
45 | ### Feature engineering
46 | * Submit a Spark job to do experimentation: `./submit_spark.sh BUCKET experiment.py`
47 | 
48 | ### Cleanup
49 | * Delete the cluster either from the GCP web console or by typing in CloudShell, `../06_dataproc/delete_cluster.sh`
50 | 
51 | 
52 | 


--------------------------------------------------------------------------------
/07_sparkml/autoscale.yaml:
--------------------------------------------------------------------------------
 1 | workerConfig:
 2 |   minInstances: 10
 3 |   maxInstances: 30
 4 | secondaryWorkerConfig:
 5 |   maxInstances: 20
 6 | basicAlgorithm:
 7 |   cooldownPeriod: 15m
 8 |   yarnConfig:
 9 |     scaleUpFactor: 0.05
10 |     scaleDownFactor: 1.0
11 |     gracefulDecommissionTimeout: 1h
12 | 


--------------------------------------------------------------------------------
/07_sparkml/create_large_cluster.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 2 ]; then
 4 |     echo "Usage: ./create_cluster.sh  bucket-name  region"
 5 |     exit
 6 | fi
 7 | 
 8 | PROJECT=$(gcloud config get-value project)
 9 | BUCKET=$1
10 | REGION=$2
11 | 
12 | # create cluster
13 | gcloud dataproc clusters create ch7cluster \
14 |   --enable-component-gateway \
15 |   --region ${REGION} --zone ${REGION}-a \
16 |   --master-machine-type n1-standard-4 \
17 |   --master-boot-disk-size 500 \
18 |   --num-workers 30 --num-secondary-workers 20 \
19 |   --worker-machine-type n1-standard-8 \
20 |   --worker-boot-disk-size 500 \
21 |   --project $PROJECT \
22 |   --scopes https://www.googleapis.com/auth/cloud-platform
23 | 
24 | gcloud dataproc autoscaling-policies import experiment-policy \
25 |   --source=autoscale.yaml --region=$REGION
26 | 
27 | gcloud dataproc clusters update ch7cluster \
28 |   --autoscaling-policy=experiment-policy --region=$REGION
29 | 


--------------------------------------------------------------------------------
/07_sparkml/experiment.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | from pyspark.mllib.classification import LogisticRegressionWithLBFGS
 18 | from pyspark.mllib.regression import LabeledPoint
 19 | from pyspark.sql import SparkSession
 20 | from pyspark import SparkContext
 21 | import logging
 22 | import numpy as np
 23 | 
 24 | NUM_PARTITIONS = 1000
 25 | 
 26 | def get_category(hour):
 27 |     if hour < 6 or hour > 20:
 28 |         return [1, 0, 0]  # night
 29 |     if hour < 10:
 30 |         return [0, 1, 0]  # morning
 31 |     if hour < 17:
 32 |         return [0, 0, 1]  # mid-day
 33 |     else:
 34 |         return [0, 0, 0]  # evening
 35 | 
 36 | 
 37 | def get_local_hour(timestamp, correction):
 38 |     import datetime
 39 |     TIME_FORMAT = '%Y-%m-%d %H:%M:%S'
 40 |     timestamp = timestamp.replace('T', ' ')  # incase different
 41 |     t = datetime.datetime.strptime(timestamp, TIME_FORMAT)
 42 |     d = datetime.timedelta(seconds=correction)
 43 |     t = t + d
 44 |     # return [t.hour]    # raw
 45 |     # theta = np.radians(360 * t.hour / 24.0)  # von-Miyes
 46 |     # return [np.sin(theta), np.cos(theta)]
 47 |     return get_category(t.hour)  # bucketize
 48 | 
 49 | 
 50 | def eval(labelpred):
 51 |     '''
 52 |         data = (label, pred)
 53 |             data[0] = label
 54 |             data[1] = pred
 55 |     '''
 56 |     cancel = labelpred.filter(lambda data: data[1] < 0.7)
 57 |     nocancel = labelpred.filter(lambda data: data[1] >= 0.7)
 58 |     corr_cancel = cancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count()
 59 |     corr_nocancel = nocancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count()
 60 | 
 61 |     cancel_denom = cancel.count()
 62 |     nocancel_denom = nocancel.count()
 63 |     if cancel_denom == 0:
 64 |         cancel_denom = 1
 65 |     if nocancel_denom == 0:
 66 |         nocancel_denom = 1
 67 | 
 68 |     totsqe = labelpred.map(
 69 |         lambda data: (data[0] - data[1]) * (data[0] - data[1])
 70 |     ).sum()
 71 |     rmse = np.sqrt(totsqe / float(cancel.count() + nocancel.count()))
 72 | 
 73 |     return {
 74 |         'rmse': rmse,
 75 |         'total_cancel': cancel.count(),
 76 |         'correct_cancel': float(corr_cancel) / cancel_denom,
 77 |         'total_noncancel': nocancel.count(),
 78 |         'correct_noncancel': float(corr_nocancel) / nocancel_denom
 79 |     }
 80 | 
 81 | 
 82 | def run_experiment(BUCKET, SCALE_AND_CLIP, WITH_TIME, WITH_ORIGIN):
 83 |     # Create spark session
 84 |     sc = SparkContext('local', 'experimentation')
 85 |     spark = SparkSession \
 86 |         .builder \
 87 |         .appName("Logistic regression w/ Spark ML") \
 88 |         .getOrCreate()
 89 | 
 90 |     # read dataset
 91 |     traindays = spark.read \
 92 |         .option("header", "true") \
 93 |         .csv('gs://{}/flights/trainday.csv'.format(BUCKET))
 94 |     traindays.createOrReplaceTempView('traindays')
 95 | 
 96 |     #inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET) # 1/30th
 97 |     inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET)  # FULL
 98 |     flights = spark.read.json(inputs)
 99 | 
100 |     # this view can now be queried
101 |     flights.createOrReplaceTempView('flights')
102 | 
103 |     # separate training and validation data
104 |     from pyspark.sql.functions import rand
105 |     SEED=13
106 |     traindays = traindays.withColumn("holdout", rand(SEED) > 0.8)  # 80% of data is for training
107 |     traindays.createOrReplaceTempView('traindays')
108 | 
109 |     # logistic regression
110 |     trainquery = """
111 |     SELECT
112 |         ORIGIN, DEP_DELAY, TAXI_OUT, ARR_DELAY, DISTANCE, DEP_TIME, DEP_AIRPORT_TZOFFSET
113 |     FROM flights f
114 |     JOIN traindays t
115 |     ON f.FL_DATE == t.FL_DATE
116 |     WHERE
117 |       t.is_train_day == 'True' AND
118 |       t.holdout == False AND
119 |       f.CANCELLED == 'False' AND 
120 |       f.DIVERTED == 'False'
121 |     """
122 |     traindata = spark.sql(trainquery).repartition(NUM_PARTITIONS)
123 | 
124 |     def to_example(fields):
125 |         features = [
126 |             fields['DEP_DELAY'],
127 |             fields['DISTANCE'],
128 |             fields['TAXI_OUT'],
129 |         ]
130 | 
131 |         if SCALE_AND_CLIP:
132 |             def clip(x):
133 |                 if x < -1:
134 |                     return -1
135 |                 if x > 1:
136 |                     return 1
137 |                 return x
138 |             features = [
139 |                 clip(float(fields['DEP_DELAY']) / 30),
140 |                 clip((float(fields['DISTANCE']) / 1000) - 1),
141 |                 clip((float(fields['TAXI_OUT']) / 10) - 1),
142 |             ]
143 | 
144 |         if WITH_TIME:
145 |             features.extend(
146 |                 get_local_hour(fields['DEP_TIME'], fields['DEP_AIRPORT_TZOFFSET']))
147 | 
148 |         if WITH_ORIGIN:
149 |             features.extend(fields['origin_onehot'])
150 | 
151 |         return LabeledPoint(
152 |               float(fields['ARR_DELAY'] < 15), #ontime
153 |               features)
154 | 
155 |     def add_origin(df, trained_model=None):
156 |         from pyspark.ml.feature import OneHotEncoder, StringIndexer
157 |         if not trained_model:
158 |             indexer = StringIndexer(inputCol='ORIGIN', outputCol='origin_index')
159 |             trained_model = indexer.fit(df)
160 |         indexed = trained_model.transform(df)
161 |         encoder = OneHotEncoder(inputCol='origin_index', outputCol='origin_onehot')
162 |         return trained_model, encoder.fit(indexed).transform(indexed)
163 | 
164 |     if WITH_ORIGIN:
165 |         index_model, traindata = add_origin(traindata)
166 | 
167 |     examples = traindata.rdd.map(to_example)
168 |     lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)
169 |     lrmodel.clearThreshold()  # return probabilities
170 | 
171 |     # save model
172 |     MODEL_FILE='gs://' + BUCKET + '/flights/sparkmloutput/model'
173 |     lrmodel.save(sc, MODEL_FILE)
174 |     logging.info("Saved trained model to {}".format(MODEL_FILE))
175 | 
176 |     # evaluate model on the heldout data
177 |     evalquery = trainquery.replace("t.holdout == False", "t.holdout == True")
178 |     evaldata = spark.sql(evalquery).repartition(NUM_PARTITIONS)
179 |     if WITH_ORIGIN:
180 |         evaldata = add_origin(evaldata, index_model)
181 |     examples = evaldata.rdd.map(to_example)
182 |     labelpred = examples.map(lambda p: (p.label, lrmodel.predict(p.features)))
183 | 
184 | 
185 |     logging.info(eval(labelpred))
186 | 
187 | 
188 | 
189 | if __name__ == '__main__':
190 |     import argparse
191 | 
192 |     parser = argparse.ArgumentParser(description='Run experiments with different features in Spark')
193 |     parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True)
194 |     parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages')
195 | 
196 |     args = parser.parse_args()
197 |     if args.debug:
198 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
199 |     else:
200 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
201 | 
202 |     run_experiment(args.bucket, SCALE_AND_CLIP=False, WITH_TIME=False, WITH_ORIGIN=False)
203 | 
204 | 


--------------------------------------------------------------------------------
/07_sparkml/logistic.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | from pyspark.mllib.classification import LogisticRegressionWithLBFGS
 18 | from pyspark.mllib.regression import LabeledPoint
 19 | from pyspark.sql import SparkSession
 20 | from pyspark import SparkContext
 21 | import logging
 22 | 
 23 | def run_logistic(BUCKET):
 24 |     # Create spark session
 25 |     sc = SparkContext('local', 'logistic')
 26 |     spark = SparkSession \
 27 |         .builder \
 28 |         .appName("Logistic regression w/ Spark ML") \
 29 |         .getOrCreate()
 30 | 
 31 |     # read dataset
 32 |     traindays = spark.read \
 33 |         .option("header", "true") \
 34 |         .csv('gs://{}/flights/trainday.csv'.format(BUCKET))
 35 |     traindays.createOrReplaceTempView('traindays')
 36 | 
 37 |     # inputs = 'gs://{}/flights/tzcorr/all_flights-00000-*'.format(BUCKET)  # 1/30th
 38 |     inputs = 'gs://{}/flights/tzcorr/all_flights-*'.format(BUCKET)  # FULL
 39 |     flights = spark.read.json(inputs)
 40 | 
 41 |     # this view can now be queried ...
 42 |     flights.createOrReplaceTempView('flights')
 43 | 
 44 | 
 45 |     # logistic regression
 46 |     trainquery = """
 47 |     SELECT
 48 |       DEP_DELAY, TAXI_OUT, ARR_DELAY, DISTANCE
 49 |     FROM flights f
 50 |     JOIN traindays t
 51 |     ON f.FL_DATE == t.FL_DATE
 52 |     WHERE
 53 |       t.is_train_day == 'True' AND
 54 |       f.CANCELLED == 'False' AND 
 55 |       f.DIVERTED == 'False'
 56 |     """
 57 |     traindata = spark.sql(trainquery)
 58 | 
 59 |     def to_example(fields):
 60 |       return LabeledPoint(\
 61 |                   float(fields['ARR_DELAY'] < 15), #ontime \
 62 |                   [ \
 63 |                       fields['DEP_DELAY'], # DEP_DELAY \
 64 |                       fields['TAXI_OUT'], # TAXI_OUT \
 65 |                       fields['DISTANCE'], # DISTANCE \
 66 |                   ])
 67 | 
 68 |     examples = traindata.rdd.map(to_example)
 69 |     lrmodel = LogisticRegressionWithLBFGS.train(examples, intercept=True)
 70 |     lrmodel.setThreshold(0.7)
 71 | 
 72 |     # save model
 73 |     MODEL_FILE='gs://{}/flights/sparkmloutput/model'.format(BUCKET)
 74 |     lrmodel.save(sc, MODEL_FILE)
 75 |     logging.info('Logistic regression model saved in {}'.format(MODEL_FILE))
 76 | 
 77 |     # evaluate
 78 |     testquery = trainquery.replace("t.is_train_day == 'True'","t.is_train_day == 'False'")
 79 |     testdata = spark.sql(testquery)
 80 |     examples = testdata.rdd.map(to_example)
 81 | 
 82 |     # Evaluate model
 83 |     lrmodel.clearThreshold() # so it returns probabilities
 84 |     labelpred = examples.map(lambda p: (p.label, lrmodel.predict(p.features)))
 85 |     logging.info('All flights: {}'.format(eval_model(labelpred)))
 86 | 
 87 | 
 88 |     # keep only those examples near the decision threshold
 89 |     labelpred = labelpred.filter(lambda data: data[1] > 0.65 and data[1] < 0.75)
 90 |     logging.info('Flights near decision threshold: {}'.format(eval_model(labelpred)))
 91 | 
 92 | def eval_model(labelpred):
 93 |     '''
 94 |             data = (label, pred)
 95 |                 data[0] = label
 96 |                 data[1] = pred
 97 |     '''
 98 |     cancel = labelpred.filter(lambda data: data[1] < 0.7)
 99 |     nocancel = labelpred.filter(lambda data: data[1] >= 0.7)
100 |     corr_cancel = cancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count()
101 |     corr_nocancel = nocancel.filter(lambda data: data[0] == int(data[1] >= 0.7)).count()
102 | 
103 |     cancel_denom = cancel.count()
104 |     nocancel_denom = nocancel.count()
105 |     if cancel_denom == 0:
106 |         cancel_denom = 1
107 |     if nocancel_denom == 0:
108 |         nocancel_denom = 1
109 |     return {
110 |         'total_cancel': cancel.count(),
111 |         'correct_cancel': float(corr_cancel)/cancel_denom,
112 |         'total_noncancel': nocancel.count(),
113 |         'correct_noncancel': float(corr_nocancel)/nocancel_denom
114 |     }
115 | 
116 | 
117 | if __name__ == '__main__':
118 |     import argparse
119 | 
120 |     parser = argparse.ArgumentParser(description='Run logistic regression in Spark')
121 |     parser.add_argument('--bucket', help='GCS bucket to read/write data', required=True)
122 |     parser.add_argument('--debug', dest='debug', action='store_true', help='Specify if you want debug messages')
123 | 
124 |     args = parser.parse_args()
125 |     if args.debug:
126 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.DEBUG)
127 |     else:
128 |         logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
129 | 
130 |     run_logistic(args.bucket)
131 | 


--------------------------------------------------------------------------------
/07_sparkml/submit_spark.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 3 ]; then
 4 |     echo "Usage: ./submit_spark_to_cluster.sh  bucket-name  region  pyspark-file"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET=$1
 9 | REGION=$2
10 | PYSPARK=$3
11 | 
12 | OUTDIR=gs://$BUCKET/flights/sparkmloutput
13 | 
14 | gsutil -m rm -r $OUTDIR
15 | 
16 | # submit to existing cluster
17 | gsutil cp $PYSPARK $OUTDIR/$PYSPARK
18 | gcloud dataproc jobs submit pyspark \
19 |    --cluster ch7cluster --region $REGION \
20 |    $OUTDIR/$PYSPARK \
21 |    -- \
22 |    --bucket $BUCKET
23 | 


--------------------------------------------------------------------------------
/08_bqml/README.md:
--------------------------------------------------------------------------------
 1 | # 8. Machine Learning with BigQuery ML
 2 | 
 3 | ### Catch up from previous chapters if necessary
 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:
 5 | 
 6 | #### Catch up from Chapters 2-7
 7 | * Open CloudShell and git clone this repo:
 8 |     ```
 9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
10 |     ```
11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
14 | 	```
15 | 	bash create_trainday.sh <BUCKET-NAME>
16 | 	```
17 |  
18 | ## This Chapter
19 | 
20 | ### Vertex AI Workbench
21 | * Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench
22 | * Launch a new terminal window and type in it:
23 | ```
24 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
25 | ```
26 | * In the navigation pane on the left, navigate to data-science-on-gcp/08_bqml
27 | 
28 | ### Logistic regression using BigQuery ML
29 | * Open the notebook bqml_logistic.ipynb
30 | * Edit | Clear All Outputs
31 | * Run through the cells one-by-one, reading the commentary, looking at the code, and examining the output.
32 | * Close the notebook
33 | * Click on the square icon on the left-most bar to view Running Terminals and Notebooks
34 | * Stop the notebook
35 | 
36 | ### Other notebooks
37 | * Repeat the steps above for the following notebooks (in order)
38 |   * bqml_nonlinear.ipynb
39 |   * bqml_timewindow.ipynb
40 |   * bqml_timetxf.ipynb
41 |   
42 | 


--------------------------------------------------------------------------------
/09_vertexai/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | trained_model
3 | babyweight
4 | 


--------------------------------------------------------------------------------
/09_vertexai/README.md:
--------------------------------------------------------------------------------
 1 | # Machine Learning Classifier using TensorFlow
 2 | 
 3 | ### Catch up from previous chapters if necessary
 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:
 5 | 
 6 | #### Catch up from Chapters 2-7
 7 | * Open CloudShell and git clone this repo:
 8 |     ```
 9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
10 |     ```
11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
14 | 	```
15 | 	bash create_trainday.sh <BUCKET-NAME>
16 | 	```
17 |  
18 | ## This Chapter
19 | 
20 | * Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench
21 | * Launch a new terminal window and type in it:
22 | ```
23 | git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
24 | ```
25 | * In the navigation pane on the left, navigate to data-science-on-gcp/09_vertexai
26 | * Open the notebook flights_model_tf2.ipynb and run the cells.  Note that the notebook has
27 | DEVELOP_MODE=True and so it will train on a very, very small amount of data. This is just
28 | to make sure the code works.
29 | 
30 | 
31 | ## Articles
32 | Some of the content in this chapter was published as blog posts (links below).
33 | 
34 | * [Giving Vertex AI, the New Unified ML Platform on Google Cloud, a Spin](https://towardsdatascience.com/giving-vertex-ai-the-new-unified-ml-platform-on-google-cloud-a-spin-35e0f3852f25):
35 | Why do we need it, how good is the code-free ML training, really, and what does all this mean for data science jobs?
36 | * [How to Deploy a TensorFlow Model to Vertex AI](https://towardsdatascience.com/how-to-deploy-a-tensorflow-model-to-vertex-ai-87d9ae1df56): Working with saved models and endpoints in Vertex AI
37 | 
38 | 
39 | 


--------------------------------------------------------------------------------
/09_vertexai/call_predict.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | REGION=us-central1
 4 | ENDPOINT_NAME=flights
 5 | 
 6 | ENDPOINT_ID=$(gcloud ai endpoints list --region=$REGION \
 7 |               --format='value(ENDPOINT_ID)' --filter=display_name=${ENDPOINT_NAME} \
 8 |               --sort-by=creationTimeStamp | tail -1)
 9 | echo $ENDPOINT_ID
10 | gcloud ai endpoints predict $ENDPOINT_ID --region=$REGION --json-request=example_input.json
11 | 


--------------------------------------------------------------------------------
/09_vertexai/example_input.json:
--------------------------------------------------------------------------------
1 | {"instances": [
2 |   {"dep_hour": 2, "is_weekday": 1, "dep_delay": 40, "taxi_out": 17, "distance": 41, "carrier": "AS", "dep_airport_lat": 58.42527778, "dep_airport_lon": -135.7075, "arr_airport_lat": 58.35472222, "arr_airport_lon": -134.57472222, "origin": "GST", "dest": "JNU"},
3 |   {"dep_hour": 22, "is_weekday": 0, "dep_delay": -7, "taxi_out": 7, "distance": 201, "carrier": "HA", "dep_airport_lat": 21.97611111, "dep_airport_lon": -159.33888889, "arr_airport_lat": 20.89861111, "arr_airport_lon": -156.43055556, "origin": "LIH", "dest": "OGG"}
4 | ]}
5 | 


--------------------------------------------------------------------------------
/09_vertexai/flights_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/09_vertexai/flights_model.png


--------------------------------------------------------------------------------
/10_mlops/README.md:
--------------------------------------------------------------------------------
 1 | # Machine Learning Classifier using TensorFlow
 2 | 
 3 | ### Catch up from previous chapters if necessary
 4 | If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:
 5 | 
 6 | #### Catch up from Chapters 2-7
 7 | * Open CloudShell and git clone this repo:
 8 |     ```
 9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
10 |     ```
11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
13 | * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
14 | 	```
15 | 	bash create_trainday.sh <BUCKET-NAME>
16 | 	```
17 | * In this (10_mlops) folder, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
18 |     
19 | ## This Chapter
20 | 
21 | In CloudShell, do the following steps:
22 | 
23 | * Install the aiplatform library
24 |     ```
25 |     pip3 install google-cloud-aiplatform cloudml-hypertune kfp
26 |     ```
27 | * Try running the standalone model file on a small sample:
28 |     ```
29 |     python3 model.py  --bucket <bucket-name> --develop
30 |     ```
31 | * [Optional] Run a Vertex AI Pipeline on the small sample (will take about ten minutes):
32 |     ```
33 |     python3 train_on_vertexai.py --project <project> --bucket <bucket-name> --develop
34 |     ```
35 | * Train on the full dataset using Vertex AI:
36 |     ```
37 |     python3 train_on_vertexai.py --project <project> --bucket <bucket-name>
38 |     ```
39 | * Try calling the model using bash:
40 |     ```
41 |     cd ../09_vertexai
42 |     bash ./call_predict.sh
43 |     cd ../10_mlops
44 |     ```
45 | * Try calling the model using Python:
46 |     ```
47 |     python3 call_predict.py
48 |     ```
49 | * [Optional] Train an AutoML model using Vertex AI:
50 |     ```
51 |     python3 train_on_vertexai.py --project <project> --bucket <bucket-name> --automl
52 |     ```
53 | * [Optional] Hyperparameter tune the custom model using Vertex AI:
54 |     ```
55 |     python3 train_on_vertexai.py --project <project> --bucket <bucket-name> --num_hparam_trials 10
56 |     ```
57 | 
58 | 
59 | ## Articles
60 | Some of the content in this chapter was published as blog posts (links below).
61 | 
62 | To try out the code in the articles without going through the chapter, copy the necessary data to your bucket:
63 |   ```
64 |   gsutil cp gs://data-science-on-gcp/edition2/ch9/data/all.csv gs://BUCKET/ch9/data/all.csv
65 | ```
66 | 
67 | Now you will be able to run model.py and train_on_vertexai.py as in the directions above.
68 | 
69 | * [Developing and Deploying a Machine Learning Model on Vertex AI using Python](https://medium.com/@lakshmanok/developing-and-deploying-a-machine-learning-model-on-vertex-ai-using-python-865b535814f8): Write training pipelines that will make your MLOps team happy
70 | * [How to build an MLOps pipeline for hyperparameter tuning in Vertex AI](https://lakshmanok.medium.com/how-to-build-an-mlops-pipeline-for-hyperparameter-tuning-in-vertex-ai-45cc2faf4ff5):
71 | Best practices to set up your model and orchestrator for hyperparameter tuning
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/10_mlops/call_predict.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | import sys, json
16 | from google.cloud import aiplatform
17 | from google.cloud.aiplatform import gapic as aip
18 | 
19 | ENDPOINT_NAME = 'flights'
20 | 
21 | if __name__ == '__main__':
22 | 
23 |     endpoints = aiplatform.Endpoint.list(
24 |         filter='display_name="{}"'.format(ENDPOINT_NAME),
25 |         order_by='create_time desc'
26 |     )
27 |     if len(endpoints) == 0:
28 |         print("No endpoint named {}".format(ENDPOINT_NAME))
29 |         sys.exit(-1)
30 |     
31 |     endpoint = endpoints[0]
32 |     
33 |     input_data = {"instances": [
34 |         {"dep_hour": 2, "is_weekday": 1, "dep_delay": 40, "taxi_out": 17, "distance": 41, "carrier": "AS",
35 |          "dep_airport_lat": 58.42527778, "dep_airport_lon": -135.7075, "arr_airport_lat": 58.35472222,
36 |          "arr_airport_lon": -134.57472222, "origin": "GST", "dest": "JNU"},
37 |         {"dep_hour": 22, "is_weekday": 0, "dep_delay": -7, "taxi_out": 7, "distance": 201, "carrier": "HA",
38 |          "dep_airport_lat": 21.97611111, "dep_airport_lon": -159.33888889, "arr_airport_lat": 20.89861111,
39 |          "arr_airport_lon": -156.43055556, "origin": "LIH", "dest": "OGG"}
40 |     ]}
41 | 
42 |     preds = endpoint.predict(input_data['instances'])
43 |     print(preds)
44 |     
45 | 
46 | 
47 | 


--------------------------------------------------------------------------------
/10_mlops/explanation-metadata.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "inputs": {
 3 |     "dep_delay": {
 4 |       "inputTensorName": "dep_delay"
 5 |     },
 6 |     "taxi_out": {
 7 |       "inputTensorName": "taxi_out"
 8 |     },
 9 |     "distance": {
10 |       "inputTensorName": "distance"
11 |     },
12 |     "dep_hour": {
13 |       "inputTensorName": "dep_hour"
14 |     },
15 |     "is_weekday": {
16 |       "inputTensorName": "is_weekday"
17 |     },
18 |     "dep_airport_lat": {
19 |       "inputTensorName": "dep_airport_lat"
20 |     },
21 |     "dep_airport_lon": {
22 |       "inputTensorName": "dep_airport_lon"
23 |     },
24 |     "arr_airport_lat": {
25 |       "inputTensorName": "arr_airport_lat"
26 |     },
27 |     "arr_airport_lon": {
28 |       "inputTensorName": "arr_airport_lon"
29 |     },
30 |     "carrier": {
31 |       "inputTensorName": "carrier"
32 |     },
33 |     "origin": {
34 |       "inputTensorName": "origin"
35 |     },
36 |     "dest": {
37 |       "inputTensorName": "dest"
38 |     }
39 |   },
40 |   "outputs": {
41 |     "pred": {
42 |       "outputTensorName": "pred"
43 |     }
44 |   }
45 | }
46 | 


--------------------------------------------------------------------------------
/10_mlops/ingest_from_crsbucket.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ "$#" -ne 1 ]; then
 4 |     echo "Usage: ./ingest_from_crsbucket.sh  destination-bucket-name"
 5 |     exit
 6 | fi
 7 | 
 8 | BUCKET_NAME=$1
 9 | 
10 | for split in train eval all; do
11 |    gsutil cp gs://data-science-on-gcp/edition2/ch9/data/${split}.csv gs://$BUCKET_NAME/ch9/data/${split}.csv
12 | done
13 | 


--------------------------------------------------------------------------------
/10_mlops/train_on_vertexai.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017-2021 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | import argparse
 16 | import logging
 17 | from datetime import datetime
 18 | import tensorflow as tf
 19 | 
 20 | from google.cloud import aiplatform
 21 | from google.cloud.aiplatform import gapic as aip
 22 | from google.cloud.aiplatform import hyperparameter_tuning as hpt
 23 | from kfp.v2 import compiler, dsl
 24 | 
 25 | ENDPOINT_NAME = 'flights'
 26 | 
 27 | 
 28 | def train_custom_model(data_set, timestamp, develop_mode, cpu_only_mode, tf_version, extra_args=None):
 29 |     # Set up training and deployment infra
 30 |     
 31 |     if cpu_only_mode:
 32 |         train_image='us-docker.pkg.dev/vertex-ai/training/tf-cpu.{}:latest'.format(tf_version)
 33 |         deploy_image='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest'.format(tf_version)
 34 |     else:
 35 |         train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version)
 36 |         deploy_image = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.{}:latest".format(tf_version)
 37 | 
 38 |     # train
 39 |     model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp)
 40 |     job = aiplatform.CustomTrainingJob(
 41 |         display_name='train-{}'.format(model_display_name),
 42 |         script_path="model.py",
 43 |         container_uri=train_image,
 44 |         requirements=['cloudml-hypertune'],  # any extra Python packages
 45 |         model_serving_container_image_uri=deploy_image
 46 |     )
 47 |     model_args = [
 48 |         '--bucket', BUCKET,
 49 |     ]
 50 |     if develop_mode:
 51 |         model_args += ['--develop']
 52 |     if extra_args:
 53 |         model_args += extra_args
 54 |     
 55 |     if cpu_only_mode:
 56 |         model = job.run(
 57 |             dataset=data_set,
 58 |             # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html#
 59 |             predefined_split_column_name='data_split',
 60 |             model_display_name=model_display_name,
 61 |             args=model_args,
 62 |             replica_count=1,
 63 |             machine_type='n1-standard-4',
 64 |             sync=develop_mode
 65 |         )
 66 |     else:
 67 |         model = job.run(
 68 |             dataset=data_set,
 69 |             # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html#
 70 |             predefined_split_column_name='data_split',
 71 |             model_display_name=model_display_name,
 72 |             args=model_args,
 73 |             replica_count=1,
 74 |             machine_type='n1-standard-4',
 75 |             # See https://cloud.google.com/vertex-ai/docs/general/locations#accelerators
 76 |             accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name,
 77 |             accelerator_count=1,
 78 |             sync=develop_mode
 79 |         )
 80 |     return model
 81 | 
 82 | 
 83 | def train_automl_model(data_set, timestamp, develop_mode):
 84 |     # train
 85 |     model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp)
 86 |     job = aiplatform.AutoMLTabularTrainingJob(
 87 |         display_name='train-{}'.format(model_display_name),
 88 |         optimization_prediction_type='classification'
 89 |     )
 90 |     model = job.run(
 91 |         dataset=data_set,
 92 |         # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html#
 93 |         predefined_split_column_name='data_split',
 94 |         target_column='ontime',
 95 |         model_display_name=model_display_name,
 96 |         budget_milli_node_hours=(300 if develop_mode else 2000),
 97 |         disable_early_stopping=False,
 98 |         export_evaluated_data_items=True,
 99 |         export_evaluated_data_items_bigquery_destination_uri='{}:dsongcp.ch9_automl_evaluated'.format(PROJECT),
100 |         export_evaluated_data_items_override_destination=True,
101 |         sync=develop_mode
102 |     )
103 |     return model
104 | 
105 | 
106 | def do_hyperparameter_tuning(data_set, timestamp, develop_mode, cpu_only_mode, tf_version):
107 |     # Vertex AI services require regional API endpoints.
108 |     if cpu_only_mode:
109 |         train_image='us-docker.pkg.dev/vertex-ai/training/tf-cpu.{}:latest'.format(tf_version)
110 |     else: 
111 |         train_image = "us-docker.pkg.dev/vertex-ai/training/tf-gpu.{}:latest".format(tf_version)
112 | 
113 |     # a single trial job
114 |     model_display_name = '{}-{}'.format(ENDPOINT_NAME, timestamp)
115 |     if cpu_only_mode:
116 |         trial_job = aiplatform.CustomJob.from_local_script(
117 |             display_name='train-{}'.format(model_display_name),
118 |             script_path="model.py",
119 |             container_uri=train_image,
120 |             args=[
121 |                 '--bucket', BUCKET,
122 |                 '--skip_full_eval',  # no need to evaluate on test data set
123 |                 '--num_epochs', '10',
124 |                 '--num_examples', '500000'  # 1/10 actual size to finish faster
125 |             ],
126 |             requirements=['cloudml-hypertune'],  # any extra Python packages
127 |             replica_count=1,
128 |             machine_type='n1-standard-4'
129 |         )
130 |     else:
131 |         trial_job = aiplatform.CustomJob.from_local_script(
132 |             display_name='train-{}'.format(model_display_name),
133 |             script_path="model.py",
134 |             container_uri=train_image,
135 |             args=[
136 |                 '--bucket', BUCKET,
137 |                 '--skip_full_eval',  # no need to evaluate on test data set
138 |                 '--num_epochs', '10',
139 |                 '--num_examples', '500000'  # 1/10 actual size to finish faster
140 |             ],
141 |             requirements=['cloudml-hypertune'],  # any extra Python packages
142 |             replica_count=1,
143 |             machine_type='n1-standard-4',
144 |             # See https://cloud.google.com/vertex-ai/docs/general/locations#accelerators
145 |             accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name,
146 |             accelerator_count=1,
147 |         )
148 | 
149 |     # the tuning job
150 |     hparam_job = aiplatform.HyperparameterTuningJob(
151 |         # See https://googleapis.dev/python/aiplatform/latest/aiplatform.html#
152 |         display_name='hparam-{}'.format(model_display_name),
153 |         custom_job=trial_job,
154 |         metric_spec={'val_rmse': 'minimize'},
155 |         parameter_spec={
156 |             "train_batch_size": hpt.IntegerParameterSpec(min=16, max=256, scale='log'),
157 |             "nbuckets": hpt.IntegerParameterSpec(min=5, max=10, scale='linear'),
158 |             "dnn_hidden_units": hpt.CategoricalParameterSpec(values=["64,16", "64,16,4", "64,64,64,8", "256,64,16"])
159 |         },
160 |         max_trial_count=2 if develop_mode else NUM_HPARAM_TRIALS,
161 |         parallel_trial_count=2,
162 |         search_algorithm=None,  # Bayesian
163 |     )
164 | 
165 |     hparam_job.run(sync=True)  # has to finish before we can get trials.
166 | 
167 |     # get the parameters corresponding to the best trial
168 |     best = sorted(hparam_job.trials, key=lambda x: x.final_measurement.metrics[0].value)[0]
169 |     logging.info('Best trial: {}'.format(best))
170 |     best_params = []
171 |     for param in best.parameters:
172 |         best_params.append('--{}'.format(param.parameter_id))
173 | 
174 |         if param.parameter_id in ["train_batch_size", "nbuckets"]:
175 |             # hparam returns 10.0 even though it's an integer param. so round it.
176 |             # but CustomTrainingJob makes integer args into floats. so make it a string
177 |             best_params.append(str(int(round(param.value))))
178 |         else:
179 |             # string or float parameters
180 |             best_params.append(param.value)
181 | 
182 |     # run the best trial to completion
183 |     logging.info('Launching full training job with {}'.format(best_params))
184 |     return train_custom_model(data_set, timestamp, develop_mode, cpu_only_mode, tf_version, extra_args=best_params)
185 | 
186 | 
187 | @dsl.pipeline(name="flights-ch9-pipeline",
188 |               description="ds-on-gcp ch9 flights pipeline"
189 | )
190 | def main():
191 |     aiplatform.init(project=PROJECT, location=REGION, staging_bucket='gs://{}'.format(BUCKET))
192 | 
193 |     # create data set
194 |     all_files = tf.io.gfile.glob('gs://{}/ch9/data/all*.csv'.format(BUCKET))
195 |     logging.info("Training on {}".format(all_files))
196 |     data_set = aiplatform.TabularDataset.create(
197 |         display_name='data-{}'.format(ENDPOINT_NAME),
198 |         gcs_source=all_files
199 |     )
200 |     if TF_VERSION is not None:
201 |         tf_version = TF_VERSION.replace(".", "-")
202 |     else:
203 |         tf_version = '2-' + tf.__version__[2:3]
204 | 
205 |     # train
206 |     if AUTOML:
207 |         model = train_automl_model(data_set, TIMESTAMP, DEVELOP_MODE)
208 |     elif NUM_HPARAM_TRIALS > 1:
209 |         model = do_hyperparameter_tuning(data_set, TIMESTAMP, DEVELOP_MODE, CPU_ONLY_MODE, tf_version)
210 |     else:
211 |         model = train_custom_model(data_set, TIMESTAMP, DEVELOP_MODE, CPU_ONLY_MODE, tf_version)
212 | 
213 |     # create endpoint if it doesn't already exist
214 |     endpoints = aiplatform.Endpoint.list(
215 |         filter='display_name="{}"'.format(ENDPOINT_NAME),
216 |         order_by='create_time desc',
217 |         project=PROJECT, location=REGION,
218 |     )
219 |     if len(endpoints) > 0:
220 |         endpoint = endpoints[0]  # most recently created
221 |     else:
222 |         endpoint = aiplatform.Endpoint.create(
223 |             display_name=ENDPOINT_NAME, project=PROJECT, location=REGION,
224 |             sync=DEVELOP_MODE
225 |         )
226 | 
227 |     # deploy
228 |     model.deploy(
229 |         endpoint=endpoint,
230 |         traffic_split={"0": 100},
231 |         machine_type='n1-standard-2',
232 |         min_replica_count=1,
233 |         max_replica_count=1,
234 |         sync=DEVELOP_MODE
235 |     )
236 | 
237 |     if DEVELOP_MODE:
238 |         model.wait()
239 | 
240 | 
241 | def run_pipeline():
242 |     compiler.Compiler().compile(pipeline_func=main, package_path='flights_pipeline.json')
243 | 
244 |     job = aip.PipelineJob(
245 |         display_name="{}-pipeline".format(ENDPOINT_NAME),
246 |         template_path="{}_pipeline.json".format(ENDPOINT_NAME),
247 |         pipeline_root="{}/pipeline_root/intro".format(BUCKET),
248 |         enable_caching=False
249 |     )
250 | 
251 |     job.run()
252 | 
253 | 
254 | if __name__ == '__main__':
255 |     parser = argparse.ArgumentParser()
256 | 
257 |     parser.add_argument(
258 |         '--bucket',
259 |         help='Data will be read from gs://BUCKET/ch9/data and checkpoints will be in gs://BUCKET/ch9/trained_model',
260 |         required=True
261 |     )
262 |     parser.add_argument(
263 |         '--region',
264 |         help='Where to run the trainer',
265 |         default='us-central1'
266 |     )
267 |     parser.add_argument(
268 |         '--project',
269 |         help='Project to be billed',
270 |         required=True
271 |     )
272 |     parser.add_argument(
273 |         '--develop',
274 |         help='Train on a small subset in development',
275 |         dest='develop',
276 |         action='store_true')
277 |     parser.set_defaults(develop=False)
278 |     parser.add_argument(
279 |         '--automl',
280 |         help='Train an AutoML Table, instead of using model.py',
281 |         dest='automl',
282 |         action='store_true')
283 |     parser.set_defaults(automl=False)
284 |     parser.add_argument(
285 |         '--num_hparam_trials',
286 |         help='Number of hyperparameter trials. 0/1 means no hyperparam. Ignored if --automl is set.',
287 |         type=int,
288 |         default=0)
289 |     parser.add_argument(
290 |         '--pipeline',
291 |         help='Run as pipeline',
292 |         dest='pipeline',
293 |         action='store_true')
294 |     parser.add_argument(
295 |         '--cpuonly',
296 |         help='Run without GPU',
297 |         dest='cpuonly',
298 |         action='store_true')
299 |     parser.set_defaults(cpuonly=False)
300 |     parser.add_argument(
301 |         '--tfversion',
302 |         help='TensorFlow version to use'
303 |     )
304 | 
305 |     # parse args
306 |     logging.getLogger().setLevel(logging.INFO)
307 |     args = parser.parse_args().__dict__
308 |     BUCKET = args['bucket']
309 |     PROJECT = args['project']
310 |     REGION = args['region']
311 |     DEVELOP_MODE = args['develop']
312 |     CPU_ONLY_MODE = args['cpuonly']
313 |     TF_VERSION = args['tfversion']    
314 |     AUTOML = args['automl']
315 |     NUM_HPARAM_TRIALS = args['num_hparam_trials']
316 |     TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
317 | 
318 |     if args['pipeline']:
319 |         run_pipeline()
320 |     else:
321 |         main()


--------------------------------------------------------------------------------
/11_realtime/.gitignore:
--------------------------------------------------------------------------------
1 | model.py
2 | train_on_vertexai.py
3 | *.egg-info
4 | call_predict.py
5 | 


--------------------------------------------------------------------------------
/11_realtime/README.md:
--------------------------------------------------------------------------------
  1 | # Machine Learning on Streaming Pipelines
  2 | 
  3 | ### Catch up from previous chapters if necessary
  4 | If you didn't go through Chapters 2-9, the simplest way to catch up is to copy data from my bucket:
  5 | 
  6 | #### Catch up from Chapters 2-9
  7 | * Open CloudShell and git clone this repo:
  8 |     ```
  9 |     git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
 10 |     ```
 11 | * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 12 | * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 13 | * Go to the 05_bqnotebook folder of the repo, run the program ./create_trainday.sh and specify your bucket name.
 14 | * Go to the 10_mlops folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 15 | 
 16 | #### From CloudShell
 17 | * Install the Python libraries you'll need
 18 |     ```
 19 |     pip3 install google-cloud-aiplatform cloudml-hypertune pyfarmhash
 20 |     ```
 21 | * [Optional] Create a small, local sample of BigQuery datasets for local experimentation:
 22 |     ```
 23 |     bash create_sample_input.sh
 24 |     ```
 25 | * [Optional] Run a local pipeline to create a training dataset:
 26 |     ```
 27 |     python3 create_traindata.py --input local
 28 |     ```
 29 |    Verify the results:
 30 |    ```
 31 |    cat /tmp/all_data*
 32 |    ```
 33 | * Run a Dataflow pipeline to create the full training dataset:
 34 |   ```
 35 |     python3 create_traindata.py --input bigquery --project <PROJECT> --bucket <BUCKET> --region <REGION>
 36 |   ```
 37 |   Note if you get an error similar to:
 38 |   ```
 39 |   AttributeError: Can't get attribute '_create_code' on <module 'dill._dill' from '/usr/local/lib/python3.7/site-packages/dill/_dill.py'>
 40 |   ```
 41 |   it is because the global version of your modules are ahead/behind of what Apache Beam on the server requires. Make sure to submit Apache Beam code to Dataflow from a pristine virtual environment that has only the modules you need:
 42 |   ```
 43 |   python -m venv ~/beamenv
 44 |   source ~/beamenv/bin/activate
 45 |   pip install apache-beam[gcp] google-cloud-aiplatform cloudml-hypertune pyfarmhash pyparsing==2.4.2
 46 |   python3 create_traindata.py ...
 47 |   ```
 48 |   Note that beamenv is only for submitting to Dataflow. Run train_on_vertexai.py and other code directly in the terminal.
 49 | * Run script that copies over the Ch10 model.py and train_on_vertexai.py files and makes the necessary changes:
 50 |   ```
 51 |   python3 change_ch10_files.py
 52 |   ```
 53 | * [Optional] Train an AutoML model on the enriched dataset:
 54 |   ```
 55 |   python3 train_on_vertexai.py --automl --project <PROJECT> --bucket <BUCKET> --region <REGION>
 56 |   ```
 57 |   Verify performance by running the following BigQuery query:
 58 |   ```
 59 |   SELECT  
 60 |   SQRT(SUM(
 61 |       (CAST(ontime AS FLOAT64) - predicted_ontime.scores[OFFSET(0)])*
 62 |       (CAST(ontime AS FLOAT64) - predicted_ontime.scores[OFFSET(0)])
 63 |       )/COUNT(*))
 64 |   FROM dsongcp.ch11_automl_evaluated
 65 |   ```
 66 | * Train custom ML model on the enriched dataset:
 67 |   ```
 68 |   python3 train_on_vertexai.py --project <PROJECT> --bucket <BUCKET> --region <REGION>
 69 |   ```
 70 |   Look at the logs of the log to determine the final RMSE.
 71 | * Run a local pipeline to invoke predictions:
 72 |     ```
 73 |     python3 make_predictions.py --input local
 74 |     ```
 75 |    Verify the results:
 76 |    ```
 77 |    cat /tmp/predictions*
 78 |    ```
 79 | * [Optional] Run a pipeline on full BigQuery dataset to invoke predictions:
 80 |     ```
 81 |     python3 make_predictions.py --input bigquery --project <PROJECT> --bucket <BUCKET> --region <REGION>
 82 |     ```
 83 |    Verify the results
 84 |    ```
 85 |    gsutil cat gs://BUCKET/flights/ch11/predictions* | head -5
 86 |    ```
 87 | * [Optional] Simulate real-time pipeline and check to see if predictions are being made
 88 | 
 89 |   
 90 |    In one terminal, type:
 91 |     ```
 92 |   cd ../04_streaming/simulate
 93 |   python3 ./simulate.py --startTime '2015-05-01 00:00:00 UTC' \
 94 |            --endTime '2015-05-04 00:00:00 UTC' --speedFactor=30 --project <PROJECT>
 95 |     ```
 96 |    
 97 |   In another terminal type:
 98 |     ```
 99 |     python3 make_predictions.py --input pubsub \
100 |            --project <PROJECT> --bucket <BUCKET> --region <REGION>
101 |     ```
102 |   
103 |   Ensure that the pipeline starts, check that output elements are starting to be written out, do:
104 |    ```
105 |    gsutil ls gs://BUCKET/flights/ch11/predictions*
106 |    ```
107 |    Make sure to go to the GCP Console and stop the Dataflow pipeline.
108 | 
109 |   
110 | * Simulate real-time pipeline and try out different jagger etc.
111 | 
112 |   In one terminal, type:
113 |     ```
114 |   cd ../04_streaming/simulate
115 |   python3 ./simulate.py --startTime '2015-02-01 00:00:00 UTC' \
116 |            --endTime '2015-02-03 00:00:00 UTC' --speedFactor=30 --project <PROJECT>
117 |     ```
118 |    
119 |   In another terminal type:
120 |     ```
121 |     python3 make_predictions.py --input pubsub --output bigquery \
122 |            --project <PROJECT> --bucket <BUCKET> --region <REGION>
123 |     ```
124 |   
125 |   Ensure that the pipeline starts, look at BigQuery:
126 |    ```
127 |    SELECT * FROM dsongcp.streaming_preds ORDER BY event_time DESC LIMIT 10
128 |    ```
129 |    When done, make sure to go to the GCP Console and stop the Dataflow pipeline.
130 |    
131 |    Note: If you are going to try it a second time around, delete the BigQuery sink, or simulate with a different time range
132 |    ```
133 |    bq rm -f dsongcp.streaming_preds
134 |    ```
135 |   
136 | 


--------------------------------------------------------------------------------
/11_realtime/change_ch10_files.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2021 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | import os
16 | 
17 | CHANGES = [
18 |     # both
19 |     ("ch9", "ch11"),
20 | 
21 |     # train_on_vertexai.py
22 |     ("ENDPOINT_NAME = 'flights'", "ENDPOINT_NAME = 'flights-ch11'"),
23 | 
24 |     # model.py
25 |     ("arr_airport_lat,arr_airport_lon", "arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out"),
26 |     ("43.41694444, -124.24694444, 39.86166667, -104.67305556, 'TRAIN'",
27 |      "43.41694444, -124.24694444, 39.86166667, -104.67305556, -3.0, 5.0, 'TRAIN'"),
28 | 
29 |     # call_predict.py
30 |     ('"carrier": "AS"', '"carrier": "AS", "avg_dep_delay": -3.0, "avg_taxi_out": 5.0'),
31 |     ('"carrier": "HA"', '"carrier": "HA", "avg_dep_delay": 3.0, "avg_taxi_out": 8.0'),
32 | ]
33 | 
34 | for filename in ['train_on_vertexai.py', 'model.py', 'call_predict.py']:
35 |     in_filename = os.path.join('../10_mlops', filename)
36 |     with open(in_filename, "r") as ifp:
37 |         with open(filename, "w") as ofp:
38 |             ofp.write("#### DO NOT EDIT! Autogenerated from {}".format(in_filename))
39 |             for line in ifp.readlines():
40 |                 for change in CHANGES:
41 |                     new_line = line.replace(change[0], change[1])
42 |                     if new_line != line:
43 |                         print('<<' + line + '>>' + new_line)
44 |                         line = new_line
45 |                 ofp.write(line)
46 | 
47 |     print("*** Wrote out {}".format(filename))
48 | 


--------------------------------------------------------------------------------
/11_realtime/create_sample_input.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | bq query --nouse_legacy_sql --format=sparse \
 3 |     "SELECT EVENT_DATA FROM dsongcp.flights_simevents WHERE EVENT_TYPE = 'wheelsoff' AND EVENT_TIME BETWEEN '2015-03-10T10:00:00' AND '2015-03-10T14:00:00' " \
 4 |     | grep FL_DATE \
 5 |     > simevents_sample.json
 6 | 
 7 | 
 8 | bq query --nouse_legacy_sql --format=json \
 9 |     "SELECT * FROM dsongcp.flights_tzcorr WHERE DEP_TIME BETWEEN '2015-03-10T10:00:00' AND '2015-03-10T14:00:00' " \
10 |     | sed 's/\[//g' | sed 's/\]//g' | sed s'/\},/\}\n/g' \
11 |     > alldata_sample.json
12 | 


--------------------------------------------------------------------------------
/11_realtime/create_traindata.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import os
 20 | import json
 21 | 
 22 | from flightstxf import flights_transforms as ftxf
 23 | 
 24 | CSV_HEADER = 'ontime,dep_delay,taxi_out,distance,origin,dest,dep_hour,is_weekday,carrier,dep_airport_lat,dep_airport_lon,arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out,data_split'
 25 | 
 26 | 
 27 | def dict_to_csv(f):
 28 |     try:
 29 |         yield ','.join([str(x) for x in f.values()])
 30 |     except Exception as e:
 31 |         logging.warning('Ignoring {} because: {}'.format(f, e), exc_info=True)
 32 |         pass
 33 | 
 34 | 
 35 | def run(project, bucket, region, input):
 36 |     if input == 'local':
 37 |         logging.info('Running locally on small extract')
 38 |         argv = [
 39 |             '--runner=DirectRunner'
 40 |         ]
 41 |         flights_output = '/tmp/'
 42 |     else:
 43 |         logging.info('Running in the cloud on full dataset input={}'.format(input))
 44 |         argv = [
 45 |             '--project={0}'.format(project),
 46 |             '--job_name=ch11traindata',
 47 |             # '--save_main_session', # not needed as we are running as a package now
 48 |             '--staging_location=gs://{0}/flights/staging/'.format(bucket),
 49 |             '--temp_location=gs://{0}/flights/temp/'.format(bucket),
 50 |             '--setup_file=./setup.py',
 51 |             '--autoscaling_algorithm=THROUGHPUT_BASED',
 52 |             '--max_num_workers=20',
 53 |             # '--max_num_workers=4', '--worker_machine_type=m1-ultramem-40', '--disk_size_gb=500',  # for full 2015-2019 dataset
 54 |             '--region={}'.format(region),
 55 |             '--runner=DataflowRunner'
 56 |         ]
 57 |         flights_output = 'gs://{}/ch11/data/'.format(bucket)
 58 | 
 59 |     with beam.Pipeline(argv=argv) as pipeline:
 60 | 
 61 |         # read the event stream
 62 |         if input == 'local':
 63 |             input_file = './alldata_sample.json'
 64 |             logging.info("Reading from {} ... Writing to {}".format(input_file, flights_output))
 65 |             events = (
 66 |                     pipeline
 67 |                     | 'read_input' >> beam.io.ReadFromText(input_file)
 68 |                     | 'parse_input' >> beam.Map(lambda line: json.loads(line))
 69 |             )
 70 |         elif input == 'bigquery':
 71 |             input_table = 'dsongcp.flights_tzcorr'
 72 |             logging.info("Reading from {} ... Writing to {}".format(input_table, flights_output))
 73 |             events = (
 74 |                     pipeline
 75 |                     | 'read_input' >> beam.io.ReadFromBigQuery(table=input_table)
 76 |             )
 77 |         else:
 78 |             logging.error("Unknown input type {}".format(input))
 79 |             return
 80 | 
 81 |         # events -> features.  See ./flights_transforms.py for the code shared between training & prediction
 82 |         features = ftxf.transform_events_to_features(events)
 83 | 
 84 |         # shuffle globally so that we are not at mercy of TensorFlow's shuffle buffer
 85 |         features = (
 86 |             features
 87 |             | 'into_global' >> beam.WindowInto(beam.window.GlobalWindows())
 88 |             | 'shuffle' >> beam.util.Reshuffle()
 89 |         )
 90 | 
 91 |         # write out
 92 |         for split in ['ALL', 'TRAIN', 'VALIDATE', 'TEST']:
 93 |             feats = features
 94 |             if split != 'ALL':
 95 |                 feats = feats | 'only_{}'.format(split) >> beam.Filter(lambda f: f['data_split'] == split)
 96 |             (
 97 |                 feats
 98 |                 | '{}_to_string'.format(split) >> beam.FlatMap(dict_to_csv)
 99 |                 | '{}_to_gcs'.format(split) >> beam.io.textio.WriteToText(os.path.join(flights_output, split.lower()),
100 |                                                                           file_name_suffix='.csv', header=CSV_HEADER,
101 |                                                                           # workaround b/207384805
102 |                                                                           num_shards=1)
103 |             )
104 | 
105 | 
106 | if __name__ == '__main__':
107 |     import argparse
108 | 
109 |     parser = argparse.ArgumentParser(description='Create training CSV file that includes time-aggregate features')
110 |     parser.add_argument('-p', '--project', help='Project to be billed for Dataflow job. Omit if running locally.')
111 |     parser.add_argument('-b', '--bucket', help='Training data will be written to gs://BUCKET/flights/ch11/')
112 |     parser.add_argument('-r', '--region', help='Region to run Dataflow job. Choose the same region as your bucket.')
113 |     parser.add_argument('-i', '--input', help='local OR bigquery', required=True)
114 | 
115 |     logging.getLogger().setLevel(logging.INFO)
116 |     args = vars(parser.parse_args())
117 | 
118 |     if args['input'] != 'local':
119 |         if not args['bucket'] or not args['project'] or not args['region']:
120 |             print("Project, Bucket, Region are needed in order to run on the cloud on full dataset.")
121 |             parser.print_help()
122 |             parser.exit()
123 | 
124 |     run(project=args['project'], bucket=args['bucket'], region=args['region'], input=args['input'])
125 | 


--------------------------------------------------------------------------------
/11_realtime/flightstxf/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/11_realtime/flightstxf/__init__.py


--------------------------------------------------------------------------------
/11_realtime/flightstxf/flights_transforms.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import datetime as dt
 19 | import logging
 20 | import numpy as np
 21 | import farmhash  # pip install pyfarmhash
 22 | 
 23 | DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
 24 | WINDOW_DURATION = 60 * 60
 25 | WINDOW_EVERY = 5 * 60
 26 | 
 27 | 
 28 | def get_data_split(fl_date):
 29 |     fl_date_str = str(fl_date)
 30 |     # Use farm fingerprint just like in BigQuery
 31 |     x = np.abs(np.uint64(farmhash.fingerprint64(fl_date_str)).astype('int64') % 100)
 32 |     if x < 60:
 33 |         data_split = 'TRAIN'
 34 |     elif x < 80:
 35 |         data_split = 'VALIDATE'
 36 |     else:
 37 |         data_split = 'TEST'
 38 |     return data_split
 39 | 
 40 | 
 41 | def get_data_split_2019(fl_date):
 42 |     fl_date_str = str(fl_date)
 43 |     if fl_date_str > '2019':
 44 |         data_split = 'TEST'
 45 |     else:
 46 |         # Use farm fingerprint just like in BigQuery
 47 |         x = np.abs(np.uint64(farmhash.fingerprint64(fl_date_str)).astype('int64') % 100)
 48 |         if x < 95:
 49 |             data_split = 'TRAIN'
 50 |         else:
 51 |             data_split = 'VALIDATE'
 52 |     return data_split
 53 | 
 54 | 
 55 | def to_datetime(event_time):
 56 |     if isinstance(event_time, str):
 57 |         # In BigQuery, this is a datetime.datetime.  In JSON, it's a string
 58 |         # sometimes it has a T separating the date, sometimes it doesn't
 59 |         # Handle all the possibilities
 60 |         event_time = dt.datetime.strptime(event_time.replace('T', ' '), DATETIME_FORMAT)
 61 |     return event_time
 62 | 
 63 | 
 64 | def approx_miles_between(lat1, lon1, lat2, lon2):
 65 |     # convert to radians
 66 |     lat1 = float(lat1) * np.pi / 180.0
 67 |     lat2 = float(lat2) * np.pi / 180.0
 68 |     lon1 = float(lon1) * np.pi / 180.0
 69 |     lon2 = float(lon2) * np.pi / 180.0
 70 | 
 71 |     # apply Haversine formula
 72 |     d_lat = lat2 - lat1
 73 |     d_lon = lon2 - lon1
 74 |     a = (pow(np.sin(d_lat / 2), 2) +
 75 |          pow(np.sin(d_lon / 2), 2) *
 76 |          np.cos(lat1) * np.cos(lat2));
 77 |     c = 2 * np.arcsin(np.sqrt(a))
 78 |     return float(6371 * c * 0.621371)  # miles
 79 | 
 80 | 
 81 | def create_features_and_label(event, for_training):
 82 |     try:
 83 |         model_input = {}
 84 | 
 85 |         if for_training:
 86 |             model_input.update({
 87 |                 'ontime': 1.0 if float(event['ARR_DELAY'] or 0) < 15 else 0,
 88 |             })
 89 | 
 90 |         # features for both training and prediction
 91 |         model_input.update({
 92 |             # same as in ch9
 93 |             'dep_delay': event['DEP_DELAY'],
 94 |             'taxi_out': event['TAXI_OUT'],
 95 |             # distance is not in wheelsoff
 96 |             'distance': approx_miles_between(event['DEP_AIRPORT_LAT'], event['DEP_AIRPORT_LON'],
 97 |                                              event['ARR_AIRPORT_LAT'], event['ARR_AIRPORT_LON']),
 98 |             'origin': event['ORIGIN'],
 99 |             'dest': event['DEST'],
100 |             'dep_hour': to_datetime(event['DEP_TIME']).hour,
101 |             'is_weekday': 1.0 if to_datetime(event['DEP_TIME']).isoweekday() < 6 else 0.0,
102 |             'carrier': event['UNIQUE_CARRIER'],
103 |             'dep_airport_lat': event['DEP_AIRPORT_LAT'],
104 |             'dep_airport_lon': event['DEP_AIRPORT_LON'],
105 |             'arr_airport_lat': event['ARR_AIRPORT_LAT'],
106 |             'arr_airport_lon': event['ARR_AIRPORT_LON'],
107 |             # newly computed averages
108 |             'avg_dep_delay': event['AVG_DEP_DELAY'],
109 |             'avg_taxi_out': event['AVG_TAXI_OUT'],
110 | 
111 |         })
112 | 
113 |         if for_training:
114 |             model_input.update({
115 |                 # training data split
116 |                 'data_split': get_data_split(event['FL_DATE'])
117 |             })
118 |         else:
119 |             model_input.update({
120 |                 # prediction output should include timestamp
121 |                 'event_time': event['WHEELS_OFF']
122 |             })
123 | 
124 |         yield model_input
125 |     except Exception as e:
126 |         # if any key is not present, don't use for training
127 |         logging.warning('Ignoring {} because: {}'.format(event, e), exc_info=True)
128 |         pass
129 | 
130 | 
131 | def compute_mean(events, col_name):
132 |     values = [float(event[col_name]) for event in events if col_name in event and event[col_name]]
133 |     return float(np.mean(values)) if len(values) > 0 else None
134 | 
135 | 
136 | def add_stats(element, window=beam.DoFn.WindowParam):
137 |     # result of a group-by, so this will be called once for each airport and window
138 |     # all averages here are by airport
139 |     airport = element[0]
140 |     events = element[1]
141 | 
142 |     # how late are flights leaving?
143 |     avg_dep_delay = compute_mean(events, 'DEP_DELAY')
144 |     avg_taxiout = compute_mean(events, 'TAXI_OUT')
145 | 
146 |     # remember that an event will be present for 60 minutes, but we want to emit
147 |     # it only if it has just arrived (if it is within 5 minutes of the start of the window)
148 |     emit_end_time = window.start + WINDOW_EVERY
149 |     for event in events:
150 |         event_time = to_datetime(event['WHEELS_OFF']).timestamp()
151 |         if event_time < emit_end_time:
152 |             event_plus_stat = event.copy()
153 |             event_plus_stat['AVG_DEP_DELAY'] = avg_dep_delay
154 |             event_plus_stat['AVG_TAXI_OUT'] = avg_taxiout
155 |             yield event_plus_stat
156 | 
157 | 
158 | def assign_timestamp(event):
159 |     try:
160 |         event_time = to_datetime(event['WHEELS_OFF'])
161 |         yield beam.window.TimestampedValue(event, event_time.timestamp())
162 |     except:
163 |         pass
164 | 
165 | 
166 | def is_normal_operation(event):
167 |     for flag in ['CANCELLED', 'DIVERTED']:
168 |         if flag in event:
169 |             s = str(event[flag]).lower()
170 |             if s == 'true':
171 |                 return False;  # cancelled or diverted
172 |     return True  # normal operation
173 | 
174 | 
175 | def transform_events_to_features(events, for_training=True):
176 |     # events are assigned the time at which predictions will have to be made -- the wheels off time
177 |     events = events | 'assign_time' >> beam.FlatMap(assign_timestamp)
178 |     events = events | 'remove_cancelled' >> beam.Filter(is_normal_operation)
179 | 
180 |     # compute stats by airport, and add to events
181 |     features = (
182 |             events
183 |             | 'window' >> beam.WindowInto(beam.window.SlidingWindows(WINDOW_DURATION, WINDOW_EVERY))
184 |             | 'by_airport' >> beam.Map(lambda x: (x['ORIGIN'], x))
185 |             | 'group_by_airport' >> beam.GroupByKey()
186 |             | 'events_and_stats' >> beam.FlatMap(add_stats)
187 |             | 'events_to_features' >> beam.FlatMap(lambda x: create_features_and_label(x, for_training))
188 |     )
189 | 
190 |     return features
191 | 


--------------------------------------------------------------------------------
/11_realtime/make_predictions.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2021 Google Inc.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | 
 17 | import apache_beam as beam
 18 | import logging
 19 | import json
 20 | import os
 21 | 
 22 | from flightstxf import flights_transforms as ftxf
 23 | 
 24 | 
 25 | CSV_HEADER = 'event_time,dep_delay,taxi_out,distance,origin,dest,dep_hour,is_weekday,carrier,dep_airport_lat,dep_airport_lon,arr_airport_lat,arr_airport_lon,avg_dep_delay,avg_taxi_out,prob_ontime'
 26 | 
 27 | 
 28 | # class FlightsModelSharedInvoker(beam.DoFn):
 29 | #     # https://beam.apache.org/releases/pydoc/2.24.0/apache_beam.utils.shared.html
 30 | #     def __init__(self, shared_handle):
 31 | #         self._shared_handle = shared_handle
 32 | #
 33 | #     def process(self, input_data):
 34 | #         def create_endpoint():
 35 | #             from google.cloud import aiplatform
 36 | #             endpoint_name = 'flights-ch10'
 37 | #             endpoints = aiplatform.Endpoint.list(
 38 | #                 filter='display_name="{}"'.format(endpoint_name),
 39 | #                 order_by='create_time desc'
 40 | #             )
 41 | #             if len(endpoints) == 0:
 42 | #                 raise EnvironmentError("No endpoint named {}".format(endpoint_name))
 43 | #             logging.info("Found endpoint {}".format(endpoints[0]))
 44 | #             return endpoints[0]
 45 | #
 46 | #         # get already created endpoint if possible
 47 | #         endpoint = self._shared_handle.acquire(create_endpoint)
 48 | #
 49 | #         # call predictions and pull out probability
 50 | #         logging.info("Invoking ML model on {} flights".format(len(input_data)))
 51 | #         predictions = endpoint.predict(input_data).predictions
 52 | #         for idx, input_instance in enumerate(input_data):
 53 | #             result = input_instance.copy()
 54 | #             result['prob_ontime'] = predictions[idx][0]
 55 | #             yield result
 56 | 
 57 | 
 58 | class FlightsModelInvoker(beam.DoFn):
 59 |     def __init__(self):
 60 |         self.endpoint = None
 61 | 
 62 |     def setup(self):
 63 |         from google.cloud import aiplatform
 64 |         endpoint_name = 'flights-ch11'
 65 |         endpoints = aiplatform.Endpoint.list(
 66 |             filter='display_name="{}"'.format(endpoint_name),
 67 |             order_by='create_time desc'
 68 |         )
 69 |         if len(endpoints) == 0:
 70 |             raise EnvironmentError("No endpoint named {}".format(endpoint_name))
 71 |         logging.info("Found endpoint {}".format(endpoints[0]))
 72 |         self.endpoint = endpoints[0]
 73 | 
 74 |     def process(self, input_data):
 75 |         # call predictions and pull out probability
 76 |         logging.info("Invoking ML model on {} flights".format(len(input_data)))
 77 |         # drop inputs not needed by model
 78 |         features = [x.copy() for x in input_data]
 79 |         for f in features:
 80 |             f.pop('event_time')
 81 |         # call model
 82 |         predictions = self.endpoint.predict(features).predictions
 83 |         for idx, input_instance in enumerate(input_data):
 84 |             result = input_instance.copy()
 85 |             result['prob_ontime'] = predictions[idx][0]
 86 |             yield result
 87 | 
 88 | 
 89 | def run(project, bucket, region, source, sink):
 90 |     if source == 'local':
 91 |         logging.info('Running locally on small extract')
 92 |         argv = [
 93 |             '--project={0}'.format(project),
 94 |             '--runner=DirectRunner'
 95 |         ]
 96 |         flights_output = '/tmp/predictions'
 97 |     else:
 98 |         logging.info('Running in the cloud on full dataset input={}'.format(source))
 99 |         argv = [
100 |             '--project={0}'.format(project),
101 |             '--job_name=ch10predictions',
102 |             '--save_main_session',
103 |             '--staging_location=gs://{0}/flights/staging/'.format(bucket),
104 |             '--temp_location=gs://{0}/flights/temp/'.format(bucket),
105 |             '--setup_file=./setup.py',
106 |             '--autoscaling_algorithm=THROUGHPUT_BASED',
107 |             '--max_num_workers=8',
108 |             '--region={}'.format(region),
109 |             '--runner=DataflowRunner'
110 |         ]
111 |         if source == 'pubsub':
112 |             logging.info("Turning on streaming. Cancel the pipeline from GCP console")
113 |             argv += ['--streaming']
114 |         flights_output = 'gs://{}/flights/ch11/predictions'.format(bucket)
115 | 
116 |     with beam.Pipeline(argv=argv) as pipeline:
117 | 
118 |         # read the event stream
119 |         if source == 'local':
120 |             input_file = './simevents_sample.json'
121 |             logging.info("Reading from {} ... Writing to {}".format(input_file, flights_output))
122 |             events = (
123 |                     pipeline
124 |                     | 'read_input' >> beam.io.ReadFromText(input_file)
125 |                     | 'parse_input' >> beam.Map(lambda line: json.loads(line))
126 |             )
127 |         elif source == 'bigquery':
128 |             input_query = ("SELECT EVENT_DATA FROM dsongcp.flights_simevents " +
129 |                            "WHERE EVENT_TIME BETWEEN '2015-03-01' AND '2015-03-02'")
130 |             logging.info("Reading from {} ... Writing to {}".format(input_query, flights_output))
131 |             events = (
132 |                     pipeline
133 |                     | 'read_input' >> beam.io.ReadFromBigQuery(query=input_query, use_standard_sql=True)
134 |                     | 'parse_input' >> beam.Map(lambda row: json.loads(row['EVENT_DATA']))
135 |             )
136 |         elif source == 'pubsub':
137 |             input_topic = "projects/{}/topics/wheelsoff".format(project)
138 |             logging.info("Reading from {} ... Writing to {}".format(input_topic, flights_output))
139 |             events = (
140 |                     pipeline
141 |                     | 'read_input' >> beam.io.ReadFromPubSub(topic=input_topic,
142 |                                                              timestamp_attribute='EventTimeStamp')
143 |                     | 'parse_input' >> beam.Map(lambda s: json.loads(s))
144 |             )
145 |         else:
146 |             logging.error("Unknown input type {}".format(source))
147 |             return
148 | 
149 |         # events -> features.  See ./flights_transforms.py for the code shared between training & prediction
150 |         features = ftxf.transform_events_to_features(events, for_training=False)
151 | 
152 |         # call model endpoint
153 |         # shared_handle = beam.utils.shared.Shared()
154 |         preds = (
155 |                 features
156 |                 | 'into_global' >> beam.WindowInto(beam.window.GlobalWindows())
157 |                 | 'batch_instances' >> beam.BatchElements(min_batch_size=1, max_batch_size=64)
158 |                 | 'model_predict' >> beam.ParDo(FlightsModelInvoker())
159 |         )
160 | 
161 |         # write it out
162 |         if sink == 'file':
163 |             (preds
164 |              | 'to_string' >> beam.Map(lambda f: ','.join([str(x) for x in f.values()]))
165 |              | 'to_gcs' >> beam.io.textio.WriteToText(flights_output,
166 |                                                       file_name_suffix='.csv', header=CSV_HEADER,
167 |                                                       # workaround b/207384805
168 |                                                       num_shards=1)
169 |              )
170 |         elif sink == 'bigquery':
171 |             preds_schema = ','.join([
172 |                 'event_time:timestamp',
173 |                 'prob_ontime:float',
174 |                 'dep_delay:float',
175 |                 'taxi_out:float',
176 |                 'distance:float',
177 |                 'origin:string',
178 |                 'dest:string',
179 |                 'dep_hour:integer',
180 |                 'is_weekday:integer',
181 |                 'carrier:string',
182 |                 'dep_airport_lat:float,dep_airport_lon:float',
183 |                 'arr_airport_lat:float,arr_airport_lon:float',
184 |                 'avg_dep_delay:float',
185 |                 'avg_taxi_out:float',
186 |             ])
187 |             (preds
188 |              | 'to_bigquery' >> beam.io.WriteToBigQuery(
189 |                         'dsongcp.streaming_preds', schema=preds_schema,
190 |                         # write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
191 |                         create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
192 |                         method='STREAMING_INSERTS'
193 |                     )
194 |              )
195 |         else:
196 |             logging.error("Unknown output type {}".format(sink))
197 |             return
198 | 
199 | 
200 | if __name__ == '__main__':
201 |     import argparse
202 | 
203 |     parser = argparse.ArgumentParser(description='Create training CSV file that includes time-aggregate features')
204 |     parser.add_argument('-p', '--project', help='Project to be billed for Dataflow/BigQuery', required=True)
205 |     parser.add_argument('-b', '--bucket', help='data will be read from written to gs://BUCKET/flights/ch11/')
206 |     parser.add_argument('-r', '--region', help='Region to run Dataflow job. Choose the same region as your bucket.')
207 |     parser.add_argument('-i', '--input', help='local, bigquery OR pubsub', required=True)
208 |     parser.add_argument('-o', '--output', help='file, bigquery OR bigtable', default='file')
209 | 
210 |     logging.getLogger().setLevel(logging.INFO)
211 |     args = vars(parser.parse_args())
212 | 
213 |     if args['input'] != 'local':
214 |         if not args['bucket'] or not args['project'] or not args['region']:
215 |             print("Project, Bucket, Region are needed in order to run on the cloud on full dataset.")
216 |             parser.print_help()
217 |             parser.exit()
218 | 
219 |     run(project=args['project'], bucket=args['bucket'], region=args['region'],
220 |         source=args['input'], sink=args['output'])
221 | 


--------------------------------------------------------------------------------
/11_realtime/setup.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Licensed to the Apache Software Foundation (ASF) under one or more
  3 | # contributor license agreements.  See the NOTICE file distributed with
  4 | # this work for additional information regarding copyright ownership.
  5 | # The ASF licenses this file to You under the Apache License, Version 2.0
  6 | # (the "License"); you may not use this file except in compliance with
  7 | # the License.  You may obtain a copy of the License at
  8 | #
  9 | #    http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | #
 17 | 
 18 | """Setup.py module for the workflow's worker utilities.
 19 | 
 20 | All the workflow related code is gathered in a package that will be built as a
 21 | source distribution, staged in the staging area for the workflow being run and
 22 | then installed in the workers when they start running.
 23 | 
 24 | This behavior is triggered by specifying the --setup_file command line option
 25 | when running the workflow for remote execution.
 26 | """
 27 | 
 28 | from distutils.command.build import build as _build
 29 | import subprocess
 30 | 
 31 | import setuptools
 32 | 
 33 | 
 34 | # This class handles the pip install mechanism.
 35 | class build(_build):  # pylint: disable=invalid-name
 36 |   """A build command class that will be invoked during package install.
 37 | 
 38 |   The package built using the current setup.py will be staged and later
 39 |   installed in the worker using `pip install package'. This class will be
 40 |   instantiated during install for this specific scenario and will trigger
 41 |   running the custom commands specified.
 42 |   """
 43 |   sub_commands = _build.sub_commands + [('CustomCommands', None)]
 44 | 
 45 | 
 46 | # Some custom command to run during setup. The command is not essential for this
 47 | # workflow. It is used here as an example. Each command will spawn a child
 48 | # process. Typically, these commands will include steps to install non-Python
 49 | # packages. For instance, to install a C++-based library libjpeg62 the following
 50 | # two commands will have to be added:
 51 | #
 52 | #     ['apt-get', 'update'],
 53 | #     ['apt-get', '--assume-yes', install', 'libjpeg62'],
 54 | #
 55 | # First, note that there is no need to use the sudo command because the setup
 56 | # script runs with appropriate access.
 57 | # Second, if apt-get tool is used then the first command needs to be 'apt-get
 58 | # update' so the tool refreshes itself and initializes links to download
 59 | # repositories.  Without this initial step the other apt-get install commands
 60 | # will fail with package not found errors. Note also --assume-yes option which
 61 | # shortcuts the interactive confirmation.
 62 | #
 63 | # The output of custom commands (including failures) will be logged in the
 64 | # worker-startup log.
 65 | CUSTOM_COMMANDS = [
 66 | ]
 67 | 
 68 | 
 69 | class CustomCommands(setuptools.Command):
 70 |   """A setuptools Command class able to run arbitrary commands."""
 71 | 
 72 |   def initialize_options(self):
 73 |     pass
 74 | 
 75 |   def finalize_options(self):
 76 |     pass
 77 | 
 78 |   def RunCustomCommand(self, command_list):
 79 |     print ('Running command: %s' % command_list)
 80 |     p = subprocess.Popen(
 81 |         command_list,
 82 |         stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
 83 |     # Can use communicate(input='y\n'.encode()) if the command run requires
 84 |     # some confirmation.
 85 |     stdout_data, _ = p.communicate()
 86 |     print ('Command output: %s' % stdout_data)
 87 |     if p.returncode != 0:
 88 |       raise RuntimeError(
 89 |           'Command %s failed: exit code: %s' % (command_list, p.returncode))
 90 | 
 91 |   def run(self):
 92 |     for command in CUSTOM_COMMANDS:
 93 |       self.RunCustomCommand(command)
 94 | 
 95 | 
 96 | # Configure the required packages and scripts to install.
 97 | # Note that the Python Dataflow containers come with numpy already installed
 98 | # so this dependency will not trigger anything to be installed unless a version
 99 | # restriction is specified.
100 | REQUIRED_PACKAGES = [
101 |     'pyfarmhash',
102 |     'google-cloud-aiplatform',
103 |     'cloudml-hypertune',
104 |      'dill==0.3.1.1'
105 | ]
106 | 
107 | 
108 | setuptools.setup(
109 |     name='flightsdf',
110 |     version='0.0.1',
111 |     description='Data Science on GCP flights training and prediction pipelines',
112 |     install_requires=REQUIRED_PACKAGES,
113 |     packages=setuptools.find_packages(),
114 |     cmdclass={
115 |         # Command class instantiated and run during pip install scenarios.
116 |         'build': build,
117 |         'CustomCommands': CustomCommands,
118 |         }
119 |     )
120 | 


--------------------------------------------------------------------------------
/12_fulldataset/README.md:
--------------------------------------------------------------------------------
 1 | # Full Dataset
 2 | 
 3 | #### [Optional] Train on 2015-2018 and evaluate on 2019
 4 | Note that this will take many hours and require significant resources.
 5 | There is a reason why I have worked with only 1 year of data so far in the book.
 6 | * [5 min] Erase the current contents of your bucket and BigQuery dataset:
 7 |   ```
 8 |   gsutil -m rm -rf gs://BUCKET/*
 9 |   bq rm -r -f dsongcp
10 |   ```
11 | * [28h or 2 min] Create Training Dataset OR Copy it from my bucket
12 |   * [28 hours] Create Training Dataset
13 |     * [30 min] Ingest raw files:
14 |       * cd 02_ingest
15 |       * Edit the YEARS in 02_ingest/ingest.sh to process 2015 to 2019.
16 |       * Run ./ingest.sh program
17 |     * [2 min] Create views
18 |       * cd ../03_sqlstudio
19 |       * ./create_views.sh
20 |     * [40 min] Do time correction
21 |       * cd ../04_streaming/transform
22 |       * ./stage_airports_file.sh $BUCKET
23 |       * Increase number of workers in df07.py to 20 or the limit of your quota
24 |       * python3 df07.py --project $PROJECT --bucket $BUCKET --region $REGION 
25 |     * [26 hours] Create training dataset
26 |       * cd ../11_realtime
27 |       * Edit flightstxf/create_traindata.py changing the line
28 |         ```
29 |         'data_split': get_data_split(event['FL_DATE'])
30 |         ```
31 |         to
32 |         ```
33 |         'data_split': get_data_split_2019(event['FL_DATE'])
34 |         ```
35 |       * Change the worker type to m1-ultramem-40 and disksize to 500 GB in the run() method of create_traindata.py.
36 |       * Create full training dataset
37 |         ```
38 |         python3 create_traindata.py --input bigquery --project $PROJECT --bucket $BUCKET --region $REGION
39 |         ```
40 |   * [2 min] Copy the full training data set from my bucket:
41 |       ```
42 |       gsutil cp \
43 |          gs://data-science-on-gcp/edition2/ch12_fulldataset/all-00000-of-00001.csv \
44 |          gs://$BUCKET/ch11/data/all-00000-of-00001.csv
45 |       ```
46 |  
47 | * [5 hr] Train AutoML model so that we have evaluation statistics in BigQuery:
48 |   ```
49 |   cd 11_realtime
50 |   python3 train_on_vertexai.py --automl --project $PROJECT --bucket $BUCKET --region $REGION
51 |   ```
52 | * Open the notebook evaluation.ipynb in Vertex Workbench and run the cells.
53 | 


--------------------------------------------------------------------------------
/COPYRIGHT:
--------------------------------------------------------------------------------
 1 | Copyright Google Inc. 2016
 2 | Licensed under the Apache License, Version 2.0 (the "License");
 3 | you may not use this file except in compliance with the License.
 4 | You may obtain a copy of the License at
 5 | http://www.apache.org/licenses/LICENSE-2.0
 6 | Unless required by applicable law or agreed to in writing, software
 7 | distributed under the License is distributed on an "AS IS" BASIS,
 8 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 9 | See the License for the specific language governing permissions and
10 | limitations under the License.
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # data-science-on-gcp
 2 | 
 3 | Source code accompanying book:
 4 | 
 5 | <table>
 6 | <tr>
 7 |   <td>
 8 |   <img src="cover_edition2.jpg" height="100"/>
 9 |   </td>
10 |   <td>
11 |   <a href="https://www.amazon.com/Data-Science-Google-Cloud-Platform/dp/1098118952/">Data Science on the Google Cloud Platform, 2nd Edition</a> <br/>
12 |   Valliappa Lakshmanan <br/>
13 |   O'Reilly, Apr 2022
14 |   </td>
15 |   <td>
16 |   Branch <a href="https://github.com/GoogleCloudPlatform/data-science-on-gcp/">2nd Edition</a> [also main]
17 |   </td>
18 | </tr>
19 | <tr>
20 |   <td>
21 |   <img src="https://images-na.ssl-images-amazon.com/images/I/51dgw%2BCYSOL._SX379_BO1,204,203,200_.jpg" height="100"/>
22 |   </td>
23 |   <td>
24 |   Data Science on the Google Cloud Platform <br/>
25 |   Valliappa Lakshmanan <br/>
26 |   O'Reilly, Jan 2017
27 |   </td>
28 |   <td>
29 |   Branch <a href="https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/edition1_tf2">edition1_tf2</a> (obsolete, and will not be maintained)
30 |   </td>
31 | </table>
32 | 
33 | ### Try out the code on Google Cloud Platform
34 | <a href="https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/data-science-on-gcp&page=editor&open_in_editor=README.md"> <img alt="Open in Cloud Shell" src ="http://gstatic.com/cloudssh/images/open-btn.png"></a>
35 | 
36 | The code on Qwiklabs (see below) is **continually tested**, and this repo is kept up-to-date.
37 | 
38 | If the code doesn't work for you, I recommend that you try the corresponding Qwiklab lab to see if there is some step that you missed.
39 | If you still have problems, please leave feedback in Qwiklabs, or file an issue in this repo.
40 | 
41 | ### Try out the code on Qwiklabs
42 | 
43 | - [Data Science on the Google Cloud Platform Quest](https://google.qwiklabs.com/quests/43)
44 | - [Data Science on Google Cloud Platform: Machine Learning Quest](https://google.qwiklabs.com/quests/50)
45 | 
46 | 
47 | 
48 | ### Purchase book
49 | [Read on-line or download PDF of book](https://www.oreilly.com/library/view/data-science-on/9781098118945/)
50 | 
51 | [Buy on Amazon.com](https://www.amazon.com/Data-Science-Google-Cloud-Platform-dp-1098118952/dp/1098118952/)
52 | 
53 | ### Updates to book
54 | I updated the book in Nov 2019 with TensorFlow 2.0, Cloud Functions, and BigQuery ML.
55 | 


--------------------------------------------------------------------------------
/cover_edition2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GoogleCloudPlatform/data-science-on-gcp/652564b9feeeaab331ce27fdd672b8226ba1e837/cover_edition2.jpg


--------------------------------------------------------------------------------