├── images
    ├── tmp
    ├── sql_struct.jpg
    ├── storage-table.jpg
    └── cloud-dataflow-beam.jpg
└── README.md


/images/tmp:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/images/sql_struct.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pekoto-zz/GCP-Big-Data-ML/HEAD/images/sql_struct.jpg


--------------------------------------------------------------------------------
/images/storage-table.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pekoto-zz/GCP-Big-Data-ML/HEAD/images/storage-table.jpg


--------------------------------------------------------------------------------
/images/cloud-dataflow-beam.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pekoto-zz/GCP-Big-Data-ML/HEAD/images/cloud-dataflow-beam.jpg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # GCP-Big-Data-ML
   2 | Notes for the Google Cloud Platform Big Data and Machine Learning Fundamentals course.
   3 | 
   4 | https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/home/welcome
   5 | 
   6 | https://console.cloud.google.com
   7 | 
   8 | # WEEK 1
   9 | 
  10 | ## 1. Google Cloud Architecture
  11 | 
  12 |     -----------------------------------------
  13 |             2. Big Data/ML Products
  14 |     -----------------------------------------
  15 |      1. Compute Power | Storage | Networking
  16 |     -----------------------------------------
  17 |                0. Security
  18 |     -----------------------------------------
  19 | 
  20 | 
  21 | -2. Abstract away scaling, infrastructure
  22 | 
  23 | -1. Process, store, and deliver pipelines, ML models, etc.
  24 | 
  25 | -0. Auth
  26 | 
  27 | ## 2. Creating a VM on Compute Engine
  28 | 
  29 | ### 2.1 Setting up the VM
  30 | 
  31 | 1. https://cloud.google.com
  32 | 2. Compute --> Compute Engine --> VM Instances
  33 | 3. Create a new VM
  34 | 
  35 | 3.1 Allow full access to cloud APIs
  36 | 
  37 | 3.2 Since we will access the VM through SSH, we don't need to allow HTTP or HTTPS traffic
  38 | 
  39 | 4. Click SSH to connect
  40 | 
  41 | At this stage, the new VM has no software.
  42 | 
  43 | We can just install stuff like normal:
  44 | 
  45 | `sudo apt-get install git`
  46 | 
  47 | (APT: advanced package tool, package manager)
  48 | 
  49 | ### 2.2 Sample earthquake data
  50 | 
  51 | After installing git, we can pull down the data from our repo:
  52 | 
  53 | `git clone https://www.github.com/GoogleCloudPlatform/training-data-analyst`
  54 | 
  55 | Course materials are in:
  56 | 
  57 | `training-data-analyst/courses/bdml_fundamentals`
  58 | 
  59 | Go to the Earthquake sample in `earthquakevm`.
  60 | 
  61 | The `ingest.sh` shell script contains the script to ingest data (`less ingest.sh`). This script basically just deletes existing data and then `wget`s a new CSV file containing the data.
  62 | 
  63 | Now, `transform.py` contains a Python script to parse the CSV file and create a PNG from it using `matplotlib` (
  64 | [matplotlib notes](https://nbviewer.jupyter.org/github/pekoto/MyJupyterNotes/blob/master/Python%20for%20Data%20Analysis.ipynb)).
  65 | 
  66 | ([File details](https://github.com/GoogleCloudPlatform/datalab-samples/blob/master/basemap/earthquakes.ipynb))
  67 | 
  68 | Run `./install_missing.sh` to get the missing Python libraries, and run the scripts mentioned above to get the CSV and generate the image.
  69 | 
  70 | ### 2.3 Transferring the data to bucket storage
  71 | 
  72 | Now, since we have generated the image. Let's get it off the VM and delete the VM.
  73 | 
  74 | To do this, we need to create some storage:
  75 | Storage > Storage > Browser > Create Bucket
  76 | 
  77 | Now, to view our bucket, we can use:
  78 | 
  79 | `gsutil ls gs://[bucketname]` (hence why bucket names need to be globally unique)
  80 | 
  81 | To copy out data to the bucket, we can use:
  82 | 
  83 | `gsutil cp earthquakes.* gs://[bucketname]`
  84 | 
  85 | ### 2.4 Stopping/deleting the VM
  86 | 
  87 | So now we're finished with our VM, we can either:
  88 | 
  89 | __STOP__
  90 | Stop the machine. You will still pay for the disk, but not the processing power.
  91 | 
  92 | __DELETE__
  93 | Delete the machine. You won't pay for anything, but obviously you will lose all of the data.
  94 | 
  95 | ### 2.5 Viewing the assets
  96 | 
  97 | Now, we want to make our assets in storage publically available.
  98 | To do this:
  99 | Select Files > Permissions > Add Members > `allUsers` > Grant `Storage Object Viewer` role.
 100 | 
 101 | Now, you can use the public link to [view your assets](https://storage.googleapis.com/earthquakeg/earthquakes.htm).
 102 | 
 103 | ## 3. Storage
 104 | 
 105 | There are 4 storage types:
 106 | 
 107 | 1. __Multiregional__: Optimized for geo-redundancy, end-user latency
 108 | 2. __Regional__: High performance local access (common for data pipelines you want to run, rather than give global access)
 109 | 3. __Nearline__: Data accessed less than once a month
 110 | 4. __Coldline__: Data accessed less than once a year
 111 | 
 112 | ## 4. Edge Network (networking)
 113 | 
 114 | Edge node receives the user's request and passes to the nearest Google data center.
 115 | 
 116 | Consider node types (Hadoop):
 117 | 
 118 | 1. Master node: Controls which nodes perform which tasks. Most work is assigned to...
 119 | 2. Worker node: Stores data and performs calculations
 120 | 3. Edge node: Facilitate communication between users and master/worker nodes
 121 | 
 122 | __Edge computing__: brings data storage closer to the location where it's needed. In contrast to cloud computing, edge computing does decentralized computing at the edge of the network.
 123 | 
 124 | __Edge Node (aka. Google Global Cache - GGC)__: Points close to the user. Network operators and ISPs deploy Google servers inside their network. E.g., YouTube videos could be cached on these edge nodes.
 125 | 
 126 | __Edge Point of Prescence__: Locations where Google connects its network to the rest of the internet.
 127 | 
 128 | https://peering.google.com/#/
 129 | 
 130 | ## 5. Security
 131 | 
 132 | Use Google IAM, etc. to provide security.
 133 | 
 134 | BigQuery data is encrypted.
 135 | 
 136 | ## 6. Big Data Tool History
 137 | 
 138 | 1. __GFS__: Data can be stored in a distributed fashion
 139 | 2. __MapReduce__: Distributed processing, large scale processing across server clusters. Hadoop implementation
 140 | 3. __BigTable__: Record high volume of data
 141 | 4. __Dremel__: Breaks data into small chunks (shards) and compresses into columnar format, then uses query optimizer to process queries in parallel (service auto-manages data imbalances and scale)
 142 | 5. __Colossus, TensorFlow__: And more...
 143 | 
 144 | ## (Note on Columnar Databases)
 145 | 
 146 | Consider typical row-oriented databases. The data is stored in a row. This is great when you want to query multiple columns from a single row.
 147 | 
 148 | However, what if you want to get all of the data from all rows from a single column?
 149 | 
 150 | For this, we need to read every row, picking out just the columns we want. For example, we want the average age of all the people. This will be slower in row-oriented databases, because even an index on the age column wouldn't help. The DB will just do a sequential scan.
 151 | 
 152 | Instead we can store the data column-wise. How does it know which columns to join together into a single set? Each column has a link back to the row number. Or, in terms of implementation, each column could be stored in a separate file. Then, each column for a given row is stored at the same offset in a given file. When you scan a column and find the data you want, the rest of the data for that record will be at the same offset in the other files.
 153 | 
 154 | 
 155 | __Row oriented__
 156 | ```
 157 | RowId 	EmpId 	Lastname 	Firstname 	Salary
 158 | 001 	10 	    Smith 	    Joe      	40000
 159 | 002 	12 	    Jones 	    Mary 	    50000
 160 | 003 	11 	    Johnson 	Cathy 	    44000
 161 | 004 	22 	    Jones 	    Bob 	    55000 
 162 | ```
 163 | 
 164 | __Column oriented__
 165 | ```
 166 | 10:001,12:002,11:003,22:004;
 167 | Smith:001,Jones:002,Johnson:003,Jones:004;
 168 | Joe:001,Mary:002,Cathy:003,Bob:004;
 169 | 40000:001,50000:002,44000:003,55000:004;
 170 | ```
 171 | 
 172 | In terms of IO improvements, if you have 100 rows with 100 columns, it will be the difference between reading 100x2 vs. 100x100.
 173 | 
 174 | It also becomes easier to horizontally scale -- make a new file to store more of that column.
 175 | 
 176 | It is also easier to add a column -- just add a new file.
 177 | 
 178 | ## Lab 1
 179 | 
 180 | __Public datasets__
 181 | 
 182 | Go to `BigQuery > Resources > Add Data > Explore public datasets` to add publically available datasets.
 183 | 
 184 | You can query the datasets using SQL syntax:
 185 | 
 186 | ```
 187 | SELECT
 188 |   name, gender,
 189 |   SUM(number) AS total
 190 | FROM
 191 |   `bigquery-public-data.usa_names.usa_1910_2013`
 192 | GROUP BY
 193 |   name, gender
 194 | ORDER BY
 195 |   total DESC
 196 | LIMIT
 197 |   10
 198 | ```
 199 | 
 200 | Before you run the query, the query validator in the bottom right will show much data is going to be run.
 201 | 
 202 | __Creating your own dataset__
 203 | 
 204 | Resources seem to have the following structure:
 205 | 
 206 | `Resources > Project > Dataset > Table`
 207 | 
 208 | To add your own dataset:
 209 | 
 210 | `Resources > click Project ID > Create dataset`
 211 | (For example babynames)
 212 | 
 213 | Then, we can add a table to this dataset.
 214 | 
 215 | `Click dataset name > Create table`
 216 | 
 217 | * Source > upload from file
 218 | * File format > CSV
 219 | * Table name > names_2014
 220 | * Schema > name:string,gender:string,count:integer
 221 | 
 222 | Creating the table will run a job. The table will be created after the job completes.
 223 | 
 224 | Click preview to see some of the data.
 225 | 
 226 | Then you can query the table like before.
 227 | 
 228 | SQL Syntax for BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/
 229 | 
 230 | ## 7. GCP Approaches
 231 | 
 232 | Google Cloud Platform applications:
 233 | 
 234 | 1. __Compute Engine__: Infrastructure as a service. Run virtual machines on demand in the cloud. 
 235 | 2. __Google Kubernetes Engine (GKE)__: Clusters of machines running containers (code packages with dependencies)
 236 | 3. __App Engine__: Platform as a service (PaaS). Run code in the cloud without worrying about infrastructure.
 237 | 4. __Cloud Functions__: Serverless environment. Functions as a service (FaaS). Executes code in response to events.
 238 | 
 239 | App Engine: long-lived web applications that auto-scale to billions of users.
 240 | 
 241 | Cloud functions: Code triggered by event, such as new file uploaded.
 242 | 
 243 | __Serverless__
 244 | 
 245 | Although this can mean different things, typically it now means functions as a service (FaaS). I.e., application code is hosted by a third party, eliminating need for server and hardware management. Applications are broken into functions that are scaled automatically.
 246 | 
 247 | ## 8. Recommendation Systems
 248 | 
 249 | A recommendation system requires 3 things:
 250 | 
 251 | * Data
 252 | * Model
 253 | * Training/serving infrastructure
 254 | 
 255 | _Point_: Train model on data, not rules.
 256 | 
 257 | __RankBrain__
 258 | 
 259 | ML applied to Google search. Rather than use heuristics (e.g., if in California and q='giants', show California giants), use previous data to train ML model and display results.
 260 | 
 261 | __Building a recommendation system__
 262 | 
 263 | 1. Ingest existing data (input and output, e.g., user ratings of tagged data)
 264 | 2. Train model to predict output (e.g., user rating)
 265 | 3. Provide recommendation: rate all of the unrated products, show top n.
 266 | 
 267 | Ratings will be based on:
 268 | 
 269 | 1. Who is this user like?
 270 | 2. Is this a good product? (other ratings?)
 271 | 3. Predicted rating = user-preference * item-quality
 272 | 
 273 | Models can typically be updated once a day or once a week. It does not need to be streaming.
 274 | Once computed, the recommendations can be stored in Cloud SQL.
 275 | 
 276 | So compute the ratings use a batch job (__Dataproc__), and store them in __Cloud SQL__.
 277 | 
 278 | __Storage systems__
 279 | 
 280 | Roughly:
 281 | 
 282 | 1. __Cloud Storage__: Global file system _(unstructured)_
 283 | 2. __Cloud SQL__: Relational database (transactional/relational data accessed through SQL) _(structured/transactional)_
 284 | 2.1 __Cloud Spanner__: Horizontal scalability (+ more than a few GBs, need a few DBs)
 285 | 3. __Datastore__: Transactional, NoSQL object-oriented database _(structured/transactional)_
 286 | 4. __Bigtable__: High throughput, NoSQL, __append-only__ data (not transactional)  _(millisecond latency analytics)_
 287 | 5. __BigQuery__: SQL data warehouse to power analytics _(seconds latency analytics)_
 288 | 
 289 | ![image](https://raw.githubusercontent.com/pekoto/GCP-Big-Data-ML/master/images/storage-table.jpg)
 290 | 
 291 | __Hadoop Ecosystem__
 292 | 
 293 | 1. __Hadoop__: MapReduce framework (HDFS)
 294 | 2. __Pig__: Scripting language that can be compiled into MapReduce jobs
 295 | 3. __Hive__: Data warehousing system and query language. Makes data on distributed file system look like an SQL DB.
 296 | 4. __Spark__: Lets you run queries on your data. Also ML, etc.
 297 | 
 298 | __Storage__
 299 | 
 300 | HDFS is used for working storage -- storage during the processing of the job.
 301 | But all input and output data will be stored in Cloud Storage.
 302 | So because we store data off-cluster, the cluster only has to be available for a run of the job.
 303 | 
 304 | (Recall, you can shut down the compute nodes when not using them, so save your data in Cloud Storage, etc., instead of in the computer node disk.)
 305 | 
 306 | ## Lab 2
 307 | 
 308 | __Creating Cloud SQL DBs__
 309 | 
 310 | First, let's create a Cloud SQL instance to hold all of the recommendation data:
 311 | 
 312 | `SQL > Create Instance > MySQL > Instance ID = 'rentals'`
 313 | 
 314 | Now the script creates 3 tables:
 315 | 
 316 | * Accomodation: Basic details
 317 | * Rating: 1-to-many for accomodation to ratings and user-to-ratings
 318 | * Recommendation: This will be populated by the recommendation engine
 319 | 
 320 | `Connect to this instance > Connect using Cloud Shell`
 321 | 
 322 | An `sql connect` command will prepopulate to connect to the DB, so you can now run commands as required.
 323 | 
 324 | Use `show databases;` to show the registered DBs.
 325 | 
 326 | Run the script to create the `recommendation_spark` database and underlying tables.
 327 | 
 328 | __Ingesting Data__
 329 | 
 330 | Now, before we can populate the Cloud SQL tables we just created, we need to stage the CSV files containing the data on Cloud Storage.
 331 | To do this, we can either use Cloud Shell commands or the Console UI.
 332 | 
 333 | Using Console UI:
 334 | 
 335 | `Storage > Browser > Create Bucket > Upload CSV files`
 336 | 
 337 | Now, click the `Import` function on the Cloud SQL page to populate the SQL tables from the CSV files.
 338 | 
 339 | After querying the SQL tables to make sure they populated correctly, type `Exit` to exit.
 340 | 
 341 | __Dataproc__
 342 | 
 343 | Now we will ue Dataproc to train the ML model based on previous ratings.
 344 | 
 345 | We need to launch Dataproc and configure it so each machine in the cluster can access Cloud SQL.
 346 | 
 347 | Dataproc let you provision Apache Hadoop clusters.
 348 | 
 349 | After provisioning your cluster, run a bash script to patch each machine so that its IP is authorized to access the Cloud SQL instance.
 350 | 
 351 | Copy over the Python recommendation model.
 352 | 
 353 | You edit code via Cloud Shell by using: `cloudshell edit train_and_apply.py`
 354 | 
 355 | Now run the job via Dataproc:
 356 | 
 357 | `Dataproc console > Jobs > Submit job > Enter file containing job code/set parameters`
 358 | 
 359 | The job will run through to populate the recommendations.
 360 | 
 361 | ## 9. BigQuery
 362 | 
 363 | Petabyte data warehoue. Two services in one:
 364 | * SQL Query Engine (serverless -- fully managed)
 365 | * Managed data storage
 366 | 
 367 | 
 368 | * Pay for data stored and queries run, or flat tier
 369 | * Analysis engine: takes in data and run analysis/model building
 370 | * Can connect to BigQuery from other tools
 371 | 
 372 | __Sample Query__
 373 | 
 374 | `Cloud console > Big Query > Create dataset > Create table (can upload from storage, etc.)`
 375 | 
 376 | When writing queries, use the following format:
 377 | 
 378 | `FROM [project (name? or id?)].[dataset].[table]`
 379 | 
 380 | E.g.:
 381 | 
 382 | ````
 383 | SELECT COUNT(*) AS total_trips
 384 | FROM `bigquery-public-data.san_francisco_bikeshare.bikeshare_trips`
 385 | ````
 386 | 
 387 | (Click on this query to view the table info)
 388 | 
 389 | Click the down arrow > Run selected to just run the selected part of the query
 390 | 
 391 | __Architecture__
 392 | 
 393 |     ----------------------------                          ----------------------------
 394 |       BigQuery Storage Service    <-- Petabit network -->   BigQuery Query Service
 395 |     ----------------------------                          ----------------------------
 396 |      Project                                                 Run queries
 397 |        Dataset A | Dataset B                                 Connectors to other products
 398 |        Table 1   | Table 1*                                  
 399 |        Table 2   | ...
 400 |        ...       |
 401 |     ----------------------------                          ----------------------------
 402 |      * Tables stored as compressed 
 403 |        column in Colossus
 404 |      * Supports steams or data ingest
 405 | 
 406 | 
 407 | __BigQuery Tips__
 408 | 
 409 | * Ctrl/cmd-click on table name: view table
 410 | * In details, click on field name to insert it into the query
 411 | * Click More > Format to automatically format the query
 412 | * Explore in Data Studio > visualize data
 413 | * Save query > Save query data in project
 414 | * `CREATE OR REPLACE TABLE [dataset].[tablename] AS [SQL QUERY]` to save the data into a table, saving you having to rerun the query every time
 415 | * In the above, you could replace TABLE with VIEW, to just store the query itself. Helpful if the data is changing a lot.
 416 | 
 417 | __Cloud Dataprep__
 418 | 
 419 | Provides data on data quality. Provides data cleansing, etc.
 420 | 
 421 | ## 10. SQL Array and Structs
 422 | 
 423 | Splitting the data into different tables requires joins or, possibly, denormalization.
 424 | 
 425 | To avoid this, we can use two features:
 426 | 
 427 | __SQL Structs (Records)__
 428 | 
 429 | These are a datatype that is essentially a collection of fields. You can think of it like a table inside another table.
 430 | 
 431 | __Array Datatype__
 432 | 
 433 | Lets you have multiple fields associated with a single row.
 434 | 
 435 | ![image](https://raw.githubusercontent.com/pekoto/GCP-Big-Data-ML/master/images/sql_struct.jpg)
 436 | 
 437 | ## 11. ML Model
 438 | 
 439 | Some terms...
 440 | 
 441 | * __Instance/observation__: A row of data in the table
 442 | * __Label__: Correct answer known historically (e.g., how much this customer spent), in future data this is what you want to know
 443 | * __Feature columns__: Other columns in the table (i.e., used in model, but you don't want to predict them)
 444 | * __One hot encoding__: Turning enums into a matrix of 1s so a not to skew the model
 445 | 
 446 | ![image](https://cdn-images-1.medium.com/max/2400/1*Ac4z1rWWuU0TzxJRUM62WA.jpeg)
 447 | 
 448 | 
 449 | __BigQuery ML (BQML)__
 450 | 
 451 | In BigQuery, we can build models in SQL.
 452 | 
 453 | First, build the model in SQL:
 454 | 
 455 | ````
 456 | CREATE MODEL numbikes.model
 457 | OPTIONS
 458 | (model_type='linear_reg', labels=['num_trips']) AS
 459 | WITH bike_data AS
 460 | (
 461 | SELECT COUNT(*) a num_trims,
 462 | ...
 463 | ````
 464 | 
 465 | Second, write a SQL prediction query:
 466 | 
 467 | ````
 468 | SELECT predicted_num_trips, num_trips, trip_date
 469 | FROM
 470 | ml.PREDICT(MODEL 'numbikes.model...
 471 | 
 472 | ````
 473 | 
 474 | BigQuery's SQL models will:
 475 | 
 476 | 1. Auto-tune learning rate
 477 | 2. Auto-splits data into training and test
 478 | (though these hyperparameters can be set manually too)
 479 | 
 480 | __Process__
 481 | 
 482 | The general process looks like this:
 483 | 
 484 | 1. Get data into BigQuery
 485 | 2. Preprocess features (select features) -- create training set
 486 | 3. Create model in BigQuery (`CREATE MODEL`)
 487 | 4. Evaluate model
 488 | 5. Make predictions with model (`ML.predict`)
 489 | 
 490 | __BQML Cheatsheet__
 491 | 
 492 | * __Label__: Alias a column as 'label', or specify column(s) in OPTIONS using input_label_cols (reminder: labels are what is currently known in training data, but what you want to predict)
 493 | * __Feature__: Table columns used as SQL SELECT statement (`SELECT * FROM ML.FEATURE_INFO(MODEL ``mydataset.mymodel``)` to get info about that column after model is trained)
 494 | * __Model__: An object created in BigQuery that resides in BigQuqery dataset
 495 | * __Model Types__: Linear regression (predict on numeric field), logistic regression (discrete class -- high or low, spam not spam, etc.) (`CREATE OR REPLACE MODEL <dataset>.<name> OPTIONS(model_type='<type>') AS <training dataset>`)
 496 | * __Training Progress__: `SELECT * FROM ML.TRAINING_INFO(MODEL ``mydataset.mymodel```
 497 | * __Inspect Weights__: `SELECT * FROM ML.WEIGHTS(MODEL ``mydataset.mymodel``, (<query>))`
 498 | * __Evaluation__: `SELECT * FROM ML.EVALUATE(MODEL ``mydataset.mymodel``)
 499 | * __Prediction__: `SELECT * FROM ML.PREDICT(MODEL ``mydataet.mymodel``, (<query>))`
 500 | 
 501 | ## Lab 3
 502 | 
 503 | Get the conversion rate:
 504 | 
 505 | ```sql
 506 | WITH visitors AS(
 507 | SELECT
 508 | COUNT(DISTINCT fullVisitorId) as total_visitors
 509 | FROM `data-to-insights.ecommerce.web_analytics`
 510 | ),
 511 | 
 512 | purchasers AS(
 513 | SELECT
 514 | COUNT(DISTINCT fullVisitorId) as total_purchasers
 515 | FROM `data-to-insights.ecommerce.web_analytics`
 516 | WHERE totals.transactions IS NOT NULL
 517 | )
 518 | 
 519 | SELECT
 520 | total_visitors,
 521 | total_purchasers,
 522 | total_purchasers/total_visitors as conversion_rate
 523 | FROM visitors, purchasers
 524 | ```
 525 | 
 526 | Find the top 5 selling products:
 527 | 
 528 | ```sql
 529 | SELECT
 530 | p.v2ProductName,
 531 | p.v2ProductCategory,
 532 | SUM(p.productQuantity) AS units_sold,
 533 | ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
 534 | FROM `data-to-insights.ecommerce.web_analytics`,
 535 | UNNEST(hits) AS h,
 536 | UNNEST(h.product) AS P
 537 | GROUP BY 1,2
 538 | ORDER BY revenue DESC
 539 | LIMIT 5;
 540 | ```
 541 | 
 542 | This query:
 543 | 1. Creates a table containing a row for all of hits `UNNEST(hits)`
 544 | 2. Creates a table containing all of the products in hits `UNNEST(h.product)`
 545 | 3. Select the various fields
 546 | 4. Groups by product name and product category
 547 | 
 548 | The `UNNEST` keyword takes an array and returns a table with a single row for each element in the array.
 549 | `OFFSET` will help to retain the ordering of the array.
 550 | 
 551 | ```
 552 | SELECT *
 553 | FROM UNNEST(['foo', 'bar', 'baz', 'qux', 'corge', 'garply', 'waldo', 'fred'])
 554 |   AS element
 555 | WITH OFFSET AS offset
 556 | ORDER BY offset;
 557 | 
 558 | +----------+--------+
 559 | | element  | offset |
 560 | +----------+--------+
 561 | | foo      | 0      |
 562 | | bar      | 1      |
 563 | | baz      | 2      |
 564 | | qux      | 3      |
 565 | | corge    | 4      |
 566 | | garply   | 5      |
 567 | | waldo    | 6      |
 568 | | fred     | 7      |
 569 | +----------+--------+
 570 | 
 571 | ```
 572 | 
 573 | `GROUP BY 1,2` refers to the first and second items in the select list.
 574 | 
 575 | Analytics schema:
 576 | 
 577 | https://support.google.com/analytics/answer/3437719?hl=en
 578 | 
 579 | __Create the model__
 580 | 
 581 | ```sql
 582 | CREATE OR REPLACE MODEL `ecommerce.classification_model`
 583 | OPTIONS
 584 | (
 585 | model_type='logistic_reg', # Since we want to classify as A/B
 586 | labels = ['will_buy_on_return_visit'] # Set the thing we want to predict
 587 | )
 588 | AS
 589 | 
 590 | #standardSQL
 591 | SELECT
 592 |   * EXCEPT(fullVisitorId) 
 593 | FROM
 594 | 
 595 |   # features
 596 |   (SELECT
 597 |     fullVisitorId,
 598 |     IFNULL(totals.bounces, 0) AS bounces,
 599 |     IFNULL(totals.timeOnSite, 0) AS time_on_site
 600 |   FROM
 601 |     `data-to-insights.ecommerce.web_analytics`
 602 |   WHERE
 603 |     totals.newVisits = 1
 604 |     AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
 605 |   JOIN
 606 |   (SELECT
 607 |     fullvisitorid,
 608 |     IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 609 |   FROM
 610 |       `data-to-insights.ecommerce.web_analytics`
 611 |   GROUP BY fullvisitorid)
 612 |   USING (fullVisitorId)
 613 | ;
 614 | ```
 615 | 
 616 | After running, the query will create a new ML model in `project:dataset.model`
 617 | 
 618 | `EXCEPT` will return all of the rows in the left query not in the right query.
 619 | 
 620 | For example:
 621 | 
 622 | ```sql
 623 |   WITH a AS (
 624 | SELECT * FROM UNNEST([1,2,3,4]) AS n
 625 | 
 626 |     ), b AS (
 627 | SELECT * FROM UNNEST([4,5,6,7]) AS n)
 628 | 
 629 | SELECT * FROM a
 630 | 
 631 | EXCEPT DISTINCT
 632 | 
 633 | SELECT * FROM b
 634 | 
 635 | -- | n |
 636 | -- | 1 |
 637 | -- | 2 |
 638 | -- | 3 |
 639 | ```
 640 | 
 641 | 
 642 | __Evaluate the model__
 643 | 
 644 | One feature we can use to evaluate the model is the receiver operating characteristic (ROC).
 645 | Essentially this shows the quality of a binary classifier by mapping true positive rates against false positive rates.
 646 | 
 647 | We want to get the area under the curve as close as possible to 1.0
 648 | 
 649 | https://cdn.qwiklabs.com/GNW5Bw%2B8bviep9OK201QGPzaAEnKKyoIkDChUHeVdFw%3D
 650 | 
 651 | ```sql
 652 | SELECT
 653 |   roc_auc,
 654 |   CASE
 655 |     WHEN roc_auc > .9 THEN 'good'
 656 |     WHEN roc_auc > .8 THEN 'fair'
 657 |     WHEN roc_auc > .7 THEN 'not great'
 658 |   ELSE 'poor' END AS model_quality
 659 | FROM
 660 |   ML.EVALUATE(MODEL ecommerce.classification_model,  (
 661 | 
 662 | SELECT
 663 |   * EXCEPT(fullVisitorId)
 664 | FROM
 665 | 
 666 |   # features
 667 |   (SELECT
 668 |     fullVisitorId,
 669 |     IFNULL(totals.bounces, 0) AS bounces,
 670 |     IFNULL(totals.timeOnSite, 0) AS time_on_site
 671 |   FROM
 672 |     `data-to-insights.ecommerce.web_analytics`
 673 |   WHERE
 674 |     totals.newVisits = 1
 675 |     AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
 676 |   JOIN
 677 |   (SELECT
 678 |     fullvisitorid,
 679 |     IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 680 |   FROM
 681 |       `data-to-insights.ecommerce.web_analytics`
 682 |   GROUP BY fullvisitorid)
 683 |   USING (fullVisitorId)
 684 | 
 685 | ));
 686 | ```
 687 | 
 688 | roc_auc is a queryable field
 689 | 
 690 | __Improving the model__
 691 | 
 692 | We can improve the model by adding more features:
 693 | 
 694 | ```sql
 695 | CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
 696 | OPTIONS
 697 |   (model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
 698 | 
 699 | WITH all_visitor_stats AS (
 700 | SELECT
 701 |   fullvisitorid,
 702 |   IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 703 |   FROM `data-to-insights.ecommerce.web_analytics`
 704 |   GROUP BY fullvisitorid
 705 | )
 706 | 
 707 | # add in new features
 708 | SELECT * EXCEPT(unique_session_id) FROM (
 709 | 
 710 |   SELECT
 711 |       CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
 712 | 
 713 |       # labels
 714 |       will_buy_on_return_visit,
 715 | 
 716 |       MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
 717 | 
 718 |       # behavior on the site
 719 |       IFNULL(totals.bounces, 0) AS bounces,
 720 |       IFNULL(totals.timeOnSite, 0) AS time_on_site,
 721 |       totals.pageviews,
 722 | 
 723 |       # where the visitor came from
 724 |       trafficSource.source,
 725 |       trafficSource.medium,
 726 |       channelGrouping,
 727 | 
 728 |       # mobile or desktop
 729 |       device.deviceCategory,
 730 | 
 731 |       # geographic
 732 |       IFNULL(geoNetwork.country, "") AS country
 733 | 
 734 |   FROM `data-to-insights.ecommerce.web_analytics`,
 735 |      UNNEST(hits) AS h
 736 | 
 737 |     JOIN all_visitor_stats USING(fullvisitorid)
 738 | 
 739 |   WHERE 1=1
 740 |     # only predict for new visits
 741 |     AND totals.newVisits = 1
 742 |     AND date BETWEEN '20160801' AND '20170430' # train 9 months
 743 | 
 744 |   GROUP BY
 745 |   unique_session_id,
 746 |   will_buy_on_return_visit,
 747 |   bounces,
 748 |   time_on_site,
 749 |   totals.pageviews,
 750 |   trafficSource.source,
 751 |   trafficSource.medium,
 752 |   channelGrouping,
 753 |   device.deviceCategory,
 754 |   country
 755 | );
 756 | ```
 757 | 
 758 | Point: Ensure you use the same training data. Otherwise differences could be due to differences in input, rather than model improvements.
 759 | 
 760 | Now we have a better model, we can make predictions.
 761 | 
 762 | __Make predictions__
 763 | 
 764 | ```sql
 765 | SELECT
 766 | *
 767 | FROM
 768 |   ml.PREDICT(MODEL `ecommerce.classification_model_2`,
 769 |    (
 770 | 
 771 | WITH all_visitor_stats AS (
 772 | SELECT
 773 |   fullvisitorid,
 774 |   IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
 775 |   FROM `data-to-insights.ecommerce.web_analytics`
 776 |   GROUP BY fullvisitorid
 777 | )
 778 | 
 779 |   SELECT
 780 |       CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
 781 | 
 782 |       # labels
 783 |       will_buy_on_return_visit,
 784 | 
 785 |       MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
 786 | 
 787 |       # behavior on the site
 788 |       IFNULL(totals.bounces, 0) AS bounces,
 789 |       IFNULL(totals.timeOnSite, 0) AS time_on_site,
 790 |       totals.pageviews,
 791 | 
 792 |       # where the visitor came from
 793 |       trafficSource.source,
 794 |       trafficSource.medium,
 795 |       channelGrouping,
 796 | 
 797 |       # mobile or desktop
 798 |       device.deviceCategory,
 799 | 
 800 |       # geographic
 801 |       IFNULL(geoNetwork.country, "") AS country
 802 | 
 803 |   FROM `data-to-insights.ecommerce.web_analytics`,
 804 |      UNNEST(hits) AS h
 805 | 
 806 |     JOIN all_visitor_stats USING(fullvisitorid)
 807 | 
 808 |   WHERE
 809 |     # only predict for new visits
 810 |     totals.newVisits = 1
 811 |     AND date BETWEEN '20170701' AND '20170801' # test 1 month
 812 | 
 813 |   GROUP BY
 814 |   unique_session_id,
 815 |   will_buy_on_return_visit,
 816 |   bounces,
 817 |   time_on_site,
 818 |   totals.pageviews,
 819 |   trafficSource.source,
 820 |   trafficSource.medium,
 821 |   channelGrouping,
 822 |   device.deviceCategory,
 823 |   country
 824 | )
 825 | 
 826 | )
 827 | 
 828 | ORDER BY
 829 |   predicted_will_buy_on_return_visit DESC;
 830 | ```
 831 | 
 832 | # WEEK 2
 833 | 
 834 | ## Data Pipelines
 835 | 
 836 | Ingesting real-time data poses a number of challenges:
 837 | 
 838 | 1. Have to scale in real time
 839 | 2. Have to deal with data being late
 840 | 3. Have to deal with bad data coming in real time (duplicates, missing data, etc.)
 841 | 
 842 | ## Cloud Pub/Sub
 843 | 
 844 | Distributed messaging system to handle real-time messaging.
 845 | 
 846 |     --------------------------------------------------------------------------------------------------------------------
 847 | 
 848 |        (sensor data, etc.) > Cloud Pub/Sub > Cloud Dataflow > BigQuery/Cloud Storage > Exploration/Visualization Apps
 849 |                              [Ingests data,   [Subscribes to
 850 |                               publishes to      Cloud Pub/Sub]
 851 |                               subscribers
 852 |     --------------------------------------------------------------------------------------------------------------------
 853 | 
 854 | _Serverless Big Data Pipeline_
 855 | 
 856 | __Cloud Pub/Sub Architecture__
 857 | 
 858 | Cloud Pub/Sub uses __topics__. These are like channels that is publishes. Subscribers can listen to these topics and pick up messages that are published.
 859 | 
 860 | For example:
 861 | 
 862 | 1. Setup Pub/Sub with a topic called "HR"
 863 | 2. When a new worker joins, the HR system publishes a "NEW HIRE" event to the "HR" topic
 864 | 3. Then, downstream applications (facilities, badge activation system) who are subscribed to this topic can get the message and take action as appropriate
 865 | 
 866 | ## Cloud Dataflow
 867 | 
 868 | __Apache Beam__
 869 | 
 870 | * Used to implement batch or streaming data processing jobs.
 871 | * Pipelines written in Java, Python, or Go
 872 | * Creates a model representation of code which is portable across many __runners__
 873 | * Runners pass models to an execution environment, which can run on many different engines (e.g., Spark, Cloud Dataflow)
 874 | * Transformations can be done in parallel, making pipelines scalable
 875 | 
 876 | __Workflow with Cloud Dataflow__
 877 | 
 878 | ![image](https://raw.githubusercontent.com/pekoto/GCP-Big-Data-ML/master/images/cloud-dataflow-beam.jpg)
 879 | 
 880 | 1. Write code to create model in Beam
 881 | 2. Beam passes a job to Cloud Dataflow
 882 | 3. Once it receives it, Cloud Dataflow's service:
 883 | * Optimizes execution graph
 884 | * Schedules out to workers in distributed fashion
 885 | * Auto-heal if workers encounter errors
 886 | * Connect to data sinks to produce results (e.g., BigQuery)
 887 | 
 888 | A number of template pipelines are available:
 889 | 
 890 | https://github.com/googlecloudplatform/dataflowtemplates
 891 | 
 892 | ## Data Studio
 893 | 
 894 | * Provides data visualization
 895 | * Data is live, not just a static image
 896 | * Click `Explore in Data Studio` in BigQuery
 897 | 
 898 | __Creating a report__
 899 | 
 900 | 1. Create new report
 901 | 2. Select a data source (can have multiple data sources)
 902 | 3. Create charts: click and draw
 903 | 
 904 | Data Studio uses _Dimensions_ and _Metric_ chips.
 905 | 
 906 | * __Dimensions__: Categories or buckets of information (area code, etc.). Shown in green.
 907 | * __Metric__: Measure dimension values. Measurements, aggregations, count, etc. Shown in blue.
 908 | 
 909 | Use __calculated fields__ to create your own metrics.
 910 | 
 911 | ## Lab 1
 912 | 
 913 | __Cloud Pub/Sub topics__
 914 | 
 915 | Cloud Pub/Sub lets decouples senders and receivers. Senders send messages to a Cloud Pub/Sub _topic_, and receivers subscribe to this topic.
 916 | 
 917 | __Create a BigQuery dataset_
 918 | 
 919 | Messages published into Pub/Sub will be stored in BigQuery.
 920 | 
 921 | 1. In Cloud Shell, run `bq mk taxirides` to create a dataset called `taxirides` in BigQuery
 922 | 2. Now create a table inside the dataset:
 923 | 
 924 | ```
 925 | bq mk \
 926 | --time_partitioning_field timestamp \
 927 | --schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
 928 | timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\
 929 | passenger_count:integer -t taxirides.realtime
 930 | ```
 931 | 
 932 | __Create a Cloud Storage bucket__
 933 | 
 934 | 1. Storage > Create Bucket
 935 | 2. Name must be globally unique (e.g., project id)
 936 | 
 937 | __Setup Cloud Dataflow Pipeline__
 938 | 
 939 | 1. Navigation > Dataflow
 940 | 2. Create job from template
 941 | 3. Type > Cloud Pub/Sub topic to BigQuery template
 942 | 4. Input topic > projects/pubsub-public-data/topics/taxirides-realtime
 943 | 5. Output table > qwiklabs-gcp-72fdadec78efe24c:taxirides.realtime
 944 | 6. Temporary location > gs://qwiklabs-gcp-72fdadec78efe24c/tmp/
 945 | 7. Click run job
 946 | 
 947 | Cloud Dataflow will now show a visualization of the Dataflow job running.
 948 | 
 949 | __Analyze streaming data__
 950 | 
 951 | You can check the data in BigQuery:
 952 | 
 953 | `SELECT * FROM taxirides.realtime LIMIT 10;`
 954 | 
 955 | See aggregated per/minute ride data:
 956 | 
 957 | ```sql
 958 | WITH streaming_data AS (
 959 | 
 960 | SELECT
 961 |   timestamp,
 962 |   TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,
 963 |   TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute,
 964 |   TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second,
 965 |   ride_id,
 966 |   latitude, 
 967 |   longitude,
 968 |   meter_reading,
 969 |   ride_status,
 970 |   passenger_count
 971 | FROM
 972 |   taxirides.realtime
 973 | WHERE ride_status = 'dropoff'
 974 | ORDER BY timestamp DESC
 975 | LIMIT 100000
 976 | 
 977 | )
 978 | 
 979 | # calculate aggregations on stream for reporting:
 980 | SELECT 
 981 |  ROW_NUMBER() OVER() AS dashboard_sort,
 982 |  minute,
 983 |  COUNT(DISTINCT ride_id) AS total_rides,
 984 |  SUM(meter_reading) AS total_revenue,
 985 |  SUM(passenger_count) AS total_passengers
 986 | FROM streaming_data
 987 | GROUP BY minute, timestamp
 988 | ```
 989 | 
 990 | __Explore in Data Studio__
 991 | 
 992 | 1. Click `Explore in Data Studio`
 993 | 2. Set up dimensions and metrics as desired
 994 | 
 995 | Once finished, stop the Cloud Dataflow pipeline job.
 996 | 
 997 | ## Approaches to ML
 998 | 
 999 | 1. Use pre-built AI
1000 | - Lack enough data to build your own model
1001 | 
1002 | 2. Add custom models
1003 | - Requires 100,000~millions of records of sample data
1004 | 
1005 | 3. Create new models
1006 | 
1007 | 
1008 | Use pre-built ML building blocks. E.g., https://console.cloud.google.com/vision/
1009 | For example, use vision API to extract text and translate API to tranlate it.
1010 | 
1011 | ## AutoML
1012 | 
1013 | Use to extend the capabilities of the AI building block without code.
1014 | For example, extend Vision API to recognize cloud types by uploading photos of clouds with type labels.
1015 | A __confusion matrix__ shows the % of labels that were correctly/incorrectly labelled.
1016 | 
1017 | Uses __neural architecture search__ to build several models and extract the best one.
1018 | 
1019 | ## Lab 2
1020 | 
1021 | __Get an API key__
1022 | 
1023 | * APIs and Services > Credentials > Create Credentials > API Key
1024 | 
1025 | In Cloud Shell, set the API Key as an environment variable:
1026 | 
1027 | `export API_KEY=<YOUR_API_KEY>`
1028 | 
1029 | __Create storage bucket__
1030 | 
1031 | * Storage > Create bucket > <project id>
1032 |     
1033 | Make the bucket publically available:
1034 | 
1035 | `gsutil acl ch -u AllUsers:R gs://qwiklabs-gcp-005f9de6234f0e59/*`
1036 | 
1037 | Then click the public link to check it worked.
1038 | 
1039 | __Create JSON request__
1040 | 
1041 | You can use Emacs in Cloud Shell:
1042 | 
1043 | 1. `emacs`
1044 | 2. `c-X c-F` to create file (this finds file, but type in the file name you want to create and it will create it as an empty buffer)
1045 | 3. `c-X c-C` to save file and kill terminal
1046 | 
1047 | (nano seems to work better in Cloud Shell)
1048 | 
1049 | __Send the request__
1050 | 
1051 | Use `curl` to send the request:
1052 | 
1053 | `curl -s -X POST -H "Content-Type: application/json" --data-binary @request.json  https://vision.googleapis.com/v1/images:annotate?key=${API_KEY}`
1054 | 
1055 | Now, the response will provide some labels based on the pretrained model, but what if we want the model to be able to detect our own labels? In that case, we can feed in some of our own training data. To do this custom training, we can use _AutoML_.
1056 | 
1057 | __AutoML__
1058 | 
1059 | Once setup, AutoML will create a new bucket with the suffix `-vcm`.
1060 | 1. Bind a new environment variable to this bucket:
1061 | 
1062 | `export BUCKET=<YOUR_AUTOML_BUCKET>`
1063 | 
1064 | 2. Copy over the training data:
1065 | 
1066 | `gsutil -m cp -r gs://automl-codelab-clouds/* gs://${BUCKET}`
1067 | 
1068 | 3. Now we need to set up a CSV file that tells AutoML where to find each image and the labels associated with each image. We just copy it over:
1069 | 
1070 | `gsutil -m cp -r gs://automl-codelab-clouds/* gs://${BUCKET}`
1071 | 
1072 | 4. Now copy the file to your bucket:
1073 | 
1074 | `gsutil cp ./data.csv gs://${BUCKET}`
1075 | 
1076 | 5. Now, back in the AutoML Vision UI, click `New Dataset > clouds > Select a CSV file on Cloud Storage > Create Dataset`
1077 | 
1078 | The images will be imported from the CSV. After the images have been imported, you can view them and check their labels, etc. You can also change the labels, etc.
1079 | 
1080 | (You want _at least_ 100 images for training.)
1081 | 
1082 | 6. Click train to start training.
1083 | 
1084 | After the model has finished training, you can check the accuracy of the model:
1085 | 
1086 | * __Precision__: What proportion of +ve identifications were correct?
1087 | * __Recall__: What propotion of actual +ves was identified correctly?
1088 | (1.0 = good score)
1089 | 
1090 | __Generating Predictions__
1091 | 
1092 | Now that we've training our model, we can use it to make some predictions on unseen data.
1093 | 
1094 | 1. Go to the `Predict` tab
1095 | 
1096 | 2. Unload images to see predictions
1097 | 
1098 | ## Building a Custom Model
1099 | 
1100 | There are threes ways to build custom models in GCP:
1101 | 
1102 | 1. Using SQL models
1103 | 2. AutoML
1104 | 3. ML Engine Notebook (Jupyter) (can write own model with Keras)
1105 | 
1106 | ## Further Courses
1107 | 
1108 | * From Data to Insights
1109 | * Data Engineering
1110 | * ML on GCP
1111 | * Adv ML on GCP
1112 | 
1113 | 


--------------------------------------------------------------------------------