├── README.md ├── articles ├── data-lifecycle-cloud-platform.md └── hadoop-gcp-migration-overview.md ├── case-study ├── flowlogistic.md └── mjtelco.md ├── coursera ├── building-resilient-streaming-systems-gcp │ ├── Module 1: Architecture of Streaming Analytics Pipelines │ │ ├── 01-what-is-streaming.md │ │ ├── 02-dealing-with-unordered-late-data.md │ │ ├── 03-derive-insights-from-data.md │ │ ├── 04-lab-derive-insights.md │ │ ├── _6fea563bb557571a99ef8afbff6df258_Lab1.pdf │ │ └── handling-variable-data-volumes.md │ ├── Module 2: Ingesting Variable Volumes │ │ ├── 01-what-is-pubsub.md │ │ ├── 02-how-it-works-topics-and-subscriptions.md │ │ ├── 03-lab-review.md │ │ ├── codelab-publish-streaming-data-into-pub-sub.md │ │ └── img │ │ │ ├── Screen Shot 2018-03-21 at 3.13.18 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.13.23 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.16.34 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.17.24 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.18.16 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.19.34 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.25.11 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.28.25 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.32.38 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.36.25 PM.png │ │ │ ├── Screen Shot 2018-03-21 at 3.52.57 PM.png │ │ │ └── Screen Shot 2018-03-21 at 4.21.48 PM 1.png │ ├── Module 3: Implementing Streaming Pipelines │ │ ├── 01-what-is-google-cloud-dataflow.md │ │ ├── 02-challenges-in-stream-processing.md │ │ ├── 03-build-a-stream-processing-pipeline-for-live-traffic-data.md │ │ ├── 04-handle-late-data-watermarks-triggers-accumulation.md │ │ ├── codelab-streaming-data-pipelines.md │ │ └── img │ │ │ ├── Screen Shot 2018-03-23 at 10.29.26 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.34.51 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.36.36 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.37.54 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.38.28 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.40.02 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.41.30 AM.png │ │ │ ├── Screen Shot 2018-03-23 at 10.49.25 AM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.26.56 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.27.10 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.28.07 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.32.26 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.35.00 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.38.51 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.39.48 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.43.58 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.44.33 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.45.19 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.47.18 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.48.33 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.50.47 PM.png │ │ │ ├── Screen Shot 2018-03-26 at 12.54.12 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.08.36 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.09.07 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.11.22 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.14.31 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.14.44 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.16.20 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.17.01 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.17.50 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.18.48 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.22.24 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.23.24 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.25.51 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.27.20 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.28.27 PM.png │ │ │ ├── Screen Shot 2018-03-27 at 5.29.26 PM.png │ │ │ └── Screen Shot 2018-03-27 at 5.58.20 PM.png │ ├── Module 4: Streaming analytics and dashboards │ │ ├── 00-codelab-streaming-analytics-dashboards.md │ │ ├── 01-streaming-analytics-dashboards.md │ │ └── img │ │ │ ├── Screen Shot 2018-03-28 at 2.51.42 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 2.53.48 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 2.55.30 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 2.57.40 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 2.59.04 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.00.17 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.00.47 PM.png │ │ │ └── Screen Shot 2018-03-28 at 3.07.00 PM.png │ ├── Module 5: Handling Throughput and Latency Requirements │ │ ├── 00-codelab-streaming-into-bigtable-at-low-latency.md │ │ ├── 01-module-intro-and-agenda.md │ │ ├── 02-what-is-cloud-spanner.md │ │ ├── 03-what-is-bigtable.md │ │ ├── 04-designing-for-bigtable.md │ │ ├── 05-ingesting-into-bigtable.md │ │ ├── 06-rebalance-strategy-distribute-storage.md │ │ └── img │ │ │ ├── Screen Shot 2018-03-28 at 3.29.53 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.38.46 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.39.00 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.39.34 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.41.16 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.42.28 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.43.21 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.43.22 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.45.53 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.46.49 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.48.39 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.54.31 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.57.53 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 3.59.14 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.09.20 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.10.50 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.14.49 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.21.16 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.21.29 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.22.04 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.22.21 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.22.42 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.23.15 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.23.36 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.24.43 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.24.53 PM.png │ │ │ ├── Screen Shot 2018-03-28 at 4.25.24 PM.png │ │ │ ├── Updates │ │ │ └── commit yesterday │ ├── README.md │ └── img │ │ ├── Screen Shot 2018-03-20 at 10.37.56 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.39.00 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.44.43 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.46.45 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.47.57 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.50.07 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.53.24 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.55.31 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.58.54 AM.png │ │ ├── Screen Shot 2018-03-20 at 10.59.38 AM.png │ │ ├── Screen Shot 2018-03-20 at 11.02.26 AM.png │ │ ├── Screen Shot 2018-03-20 at 11.02.56 AM.png │ │ ├── Screen Shot 2018-03-20 at 11.04.23 AM.png │ │ ├── Screen Shot 2018-03-20 at 11.05.56 AM.png │ │ └── Screen Shot 2018-03-20 at 11.16.49 AM.png ├── gcp-big-data-ml-fundamentals │ ├── 01-intro.txt │ ├── 02-welcome-to-foundations-of-gcp-compute-and-storage.txt │ ├── 03-what-is-the-google-cloud-platform.txt │ ├── 04-gcp-big-data-products.txt │ ├── 05-usage-scenarios.txt │ ├── 06-cpus-on-demand.txt │ ├── 07-lab-2a-review.txt │ ├── 08-a-global-filesystem.txt │ ├── 10-lab-2b-review.txt │ ├── 11-module-2-resources.txt │ ├── Module-3-Data-Analysis-on-the-Cloud │ │ ├── 01-intro-to-managed-services-for-common-use-cases.txt │ │ ├── 02-stepping-stones-to-transformation.txt │ │ ├── 03-your-sql-database-in-the-cloud.txt │ │ ├── 04-lab-3a-review.txt │ │ ├── 05-managed-hadoop-in-the-cloud.txt │ │ ├── 06-lab-3b-review.txt │ │ ├── 07-module-3-review.txt │ │ ├── README.MD │ │ └── img │ │ │ ├── Screen Shot 2017-12-14 at 4.52.32 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.01.51 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.02.05 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.02.37 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.03.31 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.04.27 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.05.39 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.06.37 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.07.55 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.11.52 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.21.27 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.21.43 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 1.22.48 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 12.55.39 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 12.55.52 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 12.58.04 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 4.43.12 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 5.30.59 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 5.31.20 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 5.33.07 PM.png │ │ │ ├── Screen Shot 2017-12-15 at 5.38.02 PM.png │ │ │ └── Screen Shot 2017-12-17 at 4.34.05 PM.png │ ├── Module-4-Scaling-Data-Analysis-Compute-with-GCP │ │ ├── demandforecast.ipynb │ │ ├── fast-random-access.txt │ │ ├── fully-build-machine-learning-models.txt │ │ ├── interactive-iterative-development-demo.txt │ │ ├── intro-to-scaling-data-analysis-change-how-you-compute-with-gcp.txt │ │ ├── lab-4a-review.txt │ │ ├── lab-4b-overview.txt │ │ ├── lab-4c-overview.txt │ │ ├── machine-learning-with-tensorflow.txt │ │ ├── mlapis.ipynb │ │ ├── module-4-review.txt │ │ ├── training-and-creating-a-neural-network-model.txt │ │ └── warehouse-and-interactively-query-petabytes.txt │ ├── Module-5-Data-Processing-Architectures-Scalable-Ingest-Transform-and-Load │ │ ├── intro-to-data-processing-architectures.txt │ │ ├── message-oriented-architectures.txt │ │ ├── module-5-review.txt │ │ └── serverless-data-pipelines.txt │ ├── img │ │ ├── Screen Shot 2017-12-14 at 12.04.10 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.05.55 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.08.19 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.16.30 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.16.37 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.20.45 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.21.26 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.22.16 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.24.00 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.28.40 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.29.51 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.33.01 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.34.22 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.37.44 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.39.09 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.48.15 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.51.27 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.52.46 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.53.38 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.55.02 PM.png │ │ ├── Screen Shot 2017-12-14 at 12.55.10 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.37.50 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.43.46 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.46.04 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.48.17 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.51.35 PM.png │ │ ├── Screen Shot 2017-12-14 at 3.52.09 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.06.47 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.07.06 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.09.20 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.15.02 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.18.33 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.21.31 PM.png │ │ ├── Screen Shot 2017-12-14 at 4.22.41 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.18.56 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.19.07 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.20.37 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.22.56 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.26.20 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.28.34 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.30.27 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.31.38 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.31.55 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.32.14 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.33.45 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.36.18 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.46.38 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.49.05 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.49.31 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.53.03 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.53.13 PM.png │ │ ├── Screen Shot 2017-12-16 at 7.56.13 PM.png │ │ ├── Screen Shot 2017-12-16 at 8.58.52 PM.png │ │ ├── Screen Shot 2017-12-16 at 8.59.52 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.08.46 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.10.01 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.10.40 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.13.58 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.14.19 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.14.27 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.14.41 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.16.18 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.20.29 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.20.38 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.22.26 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.24.04 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.49.26 PM.png │ │ ├── Screen Shot 2017-12-16 at 9.52.03 PM.png │ │ ├── Screen Shot 2017-12-17 at 12.06.29 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.16.29 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.17.52 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.20.41 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.22.25 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.22.38 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.26.53 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.28.27 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.32.05 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.32.29 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.33.35 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.34.17 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.34.49 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.35.21 PM.png │ │ ├── Screen Shot 2017-12-17 at 4.36.00 PM.png │ │ └── cards.gif │ └── summary-of-gcp-big-data-and-ml.txt ├── leveraging-unstructured-data-dataproc-gcp │ ├── Data-Engineering-01_Dataproc-_Unstructured_.pdf │ ├── Module 1: Introduction to Cloud Dataproc │ │ ├── 01-why-unstructured-data.txt │ │ ├── 02-what-data-do-enterprises-analyze.txt │ │ ├── 03-even-google-skipped-unstructured-data.txt │ │ ├── 04-considering-counting-problems.txt │ │ ├── 05-why-cloud-dataproc.txt │ │ ├── 06-cluster-provisioning-considerations.txt │ │ ├── 07-imagine-your-cluster-provisioning.txt │ │ ├── 08-dataproc-eases-hadoop-management.txt │ │ ├── 09-create-a-cluster-from-the-web.txt │ │ ├── 10-cluster-configurations-and-preemptible-workers.txt │ │ ├── 11-customizing-a-dataproc-cluster.txt │ │ ├── 12-lab-creating-a-dataproc-cluster-review.txt │ │ ├── 13-creating-custom-machine-types.txt │ │ ├── Data-Engineering-01_Dataproc-_Unstructured_.pdf │ │ ├── lab.md │ │ └── quiz │ │ │ └── Screen Shot 2018-01-02 at 3.12.25 PM.png │ ├── Module 2: Running Dataproc jobs │ │ ├── 01-overview-of-running-dataproc-jobs.txt │ │ ├── 02-why-ssh-into-a-cluster.txt │ │ ├── 03-lab-review.txt │ │ ├── 04-separation-of-storage-and-compute.txt │ │ ├── 05-moving-to-a-serverless-world.txt │ │ ├── 06-submitting-jobs-with-dataproc-and-cloud-shell.txt │ │ ├── 07-lab2b.txt │ │ ├── Screen Shot 2018-01-03 at 4.43.16 PM.png │ │ └── running-jobs.md │ ├── Module 3: Leveraging GCP │ │ ├── 01-introduction-to-leveraging-gcp.txt │ │ ├── 02-leveraging-google-cloud-platform-pt-1.txt │ │ ├── 03-leveraging-google-cloud-platform-pt-2.txt │ │ ├── 04-codelab-leveraging-unstructured-data-part-4.txt │ │ ├── 05-lab-review.txt │ │ └── 06-bigquery-support.txt │ └── Module 4: Analyzing Unstructured Data │ │ ├── 01-introduction-to-analyzing-unstructured-data.txt │ │ ├── 02-infuse-your-business-with-machine-learning.txt │ │ └── 03-lab.txt ├── prepare-gcp-exam │ ├── Screen Shot 2019-02-03 at 11.12.48 AM.png │ ├── Screen Shot 2019-02-03 at 11.12.56 AM.png │ ├── Screen Shot 2019-02-03 at 11.14.16 AM.png │ ├── Screen Shot 2019-02-03 at 2.16.44 PM.png │ ├── Screen Shot 2019-02-03 at 2.19.01 PM.png │ ├── Screen Shot 2019-02-03 at 2.45.59 PM.png │ ├── Screen Shot 2019-02-03 at 3.16.30 PM.png │ ├── Screen Shot 2019-02-03 at 3.18.25 PM.png │ ├── Screen Shot 2019-02-03 at 3.19.55 PM.png │ ├── Screen Shot 2019-02-03 at 4.05.44 PM.png │ ├── Screen Shot 2019-02-03 at 4.22.06 PM.png │ ├── Screen Shot 2019-02-03 at 4.23.43 PM.png │ ├── Screen Shot 2019-02-03 at 4.25.20 PM.png │ ├── Screen Shot 2019-02-03 at 4.26.11 PM.png │ ├── Screen Shot 2019-02-03 at 4.26.28 PM.png │ ├── Screen Shot 2019-02-03 at 4.27.06 PM.png │ ├── Screen Shot 2019-02-03 at 4.27.50 PM.png │ ├── Screen Shot 2019-02-03 at 4.28.07 PM.png │ ├── Screen Shot 2019-02-03 at 4.30.51 PM.png │ ├── Screen Shot 2019-02-03 at 4.31.36 PM.png │ ├── Screen Shot 2019-02-03 at 4.39.53 PM.png │ ├── Screen Shot 2019-02-03 at 4.40.35 PM.png │ ├── Screen Shot 2019-02-04 at 1.17.57 PM.png │ ├── Screen Shot 2019-02-04 at 1.56.09 PM.png │ ├── Screen Shot 2019-02-04 at 12.37.56 PM.png │ ├── Screen Shot 2019-02-04 at 12.41.29 PM.png │ ├── Screen Shot 2019-02-04 at 12.41.58 PM.png │ ├── Screen Shot 2019-02-04 at 12.46.13 PM.png │ ├── Screen Shot 2019-02-04 at 12.49.20 PM.png │ ├── Screen Shot 2019-02-04 at 12.49.26 PM.png │ ├── Screen Shot 2019-02-04 at 2.02.42 PM.png │ ├── Screen Shot 2019-02-04 at 2.04.20 PM.png │ ├── Screen Shot 2019-02-04 at 2.08.28 PM.png │ ├── Screen Shot 2019-02-04 at 2.25.27 PM.png │ └── Screen Shot 2019-02-04 at 2.26.02 PM.png ├── serverless-data-analysis-bigquery-cloud-dataflow-gcp │ ├── Module 1: Serverless Data Analysis with BigQuery │ │ ├── 00-cpb101-bigquery-query.txt │ │ ├── 01-course-introduction.txt │ │ ├── 02-who-is-a-data-engineer.txt │ │ ├── 03-course-overview-and-agenda.txt │ │ ├── 04-what-is-bigquery.txt │ │ ├── 05-evaluating-bigquery.txt │ │ ├── 06-architecting-a-bigquery-project.txt │ │ ├── 07-running-a-query.txt │ │ ├── 08-lab-serverless-data-analysis-java-python-part-1.txt │ │ ├── 10-load-and-export-data.txt │ │ ├── 11-complex-queries-and-functions.txt │ │ ├── 12-advanced-capabilities-in-bigquery.txt │ │ ├── 13-processing-bigquery-data-types.txt │ │ ├── 14-standard-sql-and-window-functions.txt │ │ ├── 15-user-defined-functions.txt │ │ ├── 17-optimize-for-performance-and-pricing.txt │ │ ├── 18-tables-and-partitioning-for-performance.txt │ │ └── 19-bigquery-plans-and-categories.txt │ ├── Module 2: Autoscaling Data Processing Pipelines with Dataflow │ │ ├── 00-links.txt │ │ ├── 01-dataflow-and-its-capabilities.txt │ │ ├── 02-write-data-pipelines-in-java-and-python.txt │ │ ├── 03-execute-data-pipelines-in-java-and-python.txt │ │ ├── 04-lab-4-review.txt │ │ ├── 05-mapreduce-and-parallel-processing.txt │ │ └── 06-transforms-in-cloud-dataflow.txt │ └── images │ │ ├── Screen Shot 2018-01-04 at 2.29.44 PM.png │ │ ├── Screen Shot 2018-01-04 at 2.32.56 PM.png │ │ ├── Screen Shot 2018-01-04 at 3.23.17 PM.png │ │ ├── Screen Shot 2018-01-05 at 2.43.03 PM.png │ │ ├── Screen Shot 2018-01-24 at 3.09.40 PM.png │ │ ├── Screen Shot 2018-01-24 at 3.14.06 PM.png │ │ ├── Screen Shot 2018-01-24 at 3.20.16 PM.png │ │ ├── Screen Shot 2018-01-25 at 3.27.53 PM.png │ │ ├── Screen Shot 2018-01-25 at 4.07.18 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.37.32 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.38.10 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.40.20 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.43.24 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.45.19 PM.png │ │ ├── Screen Shot 2018-01-26 at 12.47.06 PM.png │ │ ├── Screen Shot 2018-01-26 at 3.59.15 PM.png │ │ ├── Screen Shot 2018-01-26 at 4.11.15 PM.png │ │ └── Screen Shot 2018-01-26 at 4.18.46 PM.png └── serverless-machine-learning-gcp │ ├── Module 1: Getting Started with Machine Learning │ ├── 00-create-machine-learning-datasets-lab-1a.md │ ├── 01-module-1-overview.md │ ├── 02-what-is-machine-learning-ml.md │ ├── 03-playing-with-machine-learning-ml.md │ ├── 04-a-neural-network-playground.ml │ ├── 05-combinations-and-hierarchies-of-features.md │ ├── 06-engineering-features-layers-and-neurons.md │ ├── 07-the-reality-of-machine-learning.md │ ├── 08-covering-all-use-cases.md │ ├── 09-negative-examples-and-near-misses.md │ ├── 10-explore-the-data-and-fix-problems.md │ ├── 11-think-carefully-about-error-metrics.md │ ├── 12-classification-accuracy.md │ ├── 13-changing-the-model-threshold-pt-2.md │ ├── 14-creating-machine-learning-datasets-for-regression-problems.md │ ├── 15-split-dataset-and-model-experimentation.md │ ├── 16-evaluating-the-final-model.md │ ├── 17-create-ml-datasets-lab-overview.md │ └── what-is-machine-learning-ml.pdf │ ├── Module 2: Building ML models with Tensorflow │ ├── 00-getting-started-with-tensorflow-lab-2a.md │ ├── 01-module-2-overview.md │ ├── 02-building-machine-learning-models-with-tensorflow.md │ ├── 03-tensorflow-lab-review.md │ ├── 04-tensorflow-for-machine-learning-lab.md │ ├── 05-tf-learn-lab-review.md │ ├── 06-gaining-more-flexibility-lab.md │ ├── 07-tensorflow-on-big-data-lab-review.md │ ├── 08-the-experiment-framework-lab.md │ └── Screen Shot 2018-02-16 at 3.57.43 PM.png │ ├── Module 3: Scaling ML models with Cloud ML Engine │ ├── 01-scaling-tf-models-with-cloud-ml-engine.md │ ├── 02-why-cloud-ml-engine.md │ └── 03-package-up-a-tensorflow-model.md │ ├── Module 4: Feature Engineering │ ├── 01-creating-good-features.md │ ├── 02-value-should-be-known-for-prediction.md │ ├── 03-quiz-value-knowable-or-not.md │ ├── 04-numeric-with-meaningful-magnitude.md │ ├── 05-4-enough-examples.md │ ├── 06-raw-data-to-numeric-features.md │ ├── 07-good-features-bring-human-insight-to-problems.md │ └── 08-build-effective-ml-with-model-architectures.md │ └── README.md ├── data-engineer-certificate-exam-guide.md ├── know ├── bigquery.md └── bigquery │ ├── best-practices-performance-input.md │ ├── best-practices-performance-patterns.md │ ├── life-of-a-bigquery-streaming-insert.md │ └── partitioned-tables.md ├── lab ├── README.md ├── bigquery-for-data-analysis │ └── Introduction_to_SQL_for_BigQuery_and_Cloud_SQL.md └── data-engineering │ ├── Analyzing_Natality_Data_Using_Datalab_and_BigQuery.md │ ├── Building_an_IoT_Analytics_Pipeline_on_Google_Cloud_Platform.md │ ├── Cloud_TPU_Qwik_Start.md │ ├── Launching_Dataproc_Jobs_with_Cloud_Composer.md │ ├── Predict_Housing_Prices_with_Tensorflow_and_Cloud_ML_Engine.md │ ├── Predict_Visitor_Purchases_with_a_Classification_Model_in_BQML.md │ ├── Run_a_Big_Data_Text_Processing_Pipeline_in_Cloud_Dataflow.md │ ├── Simulating_a_Data_Warehouse_in_the_Cloud_Using_BigQuery_and_Dataflow.md │ ├── Weather_Data_in_BigQuery.md │ └── Working_with_Google_Cloud_Dataprep.md ├── practice-exam └── data-engineer.md └── study-guide.md /coursera/building-resilient-streaming-systems-gcp/Module 1: Architecture of Streaming Analytics Pipelines/_6fea563bb557571a99ef8afbff6df258_Lab1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 1: Architecture of Streaming Analytics Pipelines/_6fea563bb557571a99ef8afbff6df258_Lab1.pdf -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 1: Architecture of Streaming Analytics Pipelines/handling-variable-data-volumes.md: -------------------------------------------------------------------------------- 1 | ## Variable volumes makes it possible to derive real-time insights from growing data 2 | So challenge number one, variable volumes. So volumes keep changing, and so you need the ability of your ingest to scale. So think again of our credit card scenario. The volume of transactions changes between holidays, nights, etc. 3 | And in spite of variable volumes, our ingest application shouldn't crash. It needs to be available. 4 | At the same time, we need any messages that we receive to remain received. 5 | We shouldn't have people making a purchase and somehow, we've lost the record of when they've made that purchase. In other words, it needs to be durable. So available, our ingest shouldn't crash. And durable, we shouldn't lose these messages. Both of these are very important. 6 | So when we talk about scaling here, we're essentially talking about being able to deal with faults as clients and servers and storage systems etc, fail unexpectedly. So how do you do that? How can you be fault tolerant while dealing with spikes in data, changes in the volume of data? 7 | 8 | ## Tightly coupled services propagate failures 9 | What you cannot do is to tightly couple the sender and the receiver. 10 | If the sender is directly sending messages to the receiver, and if the receiver gets overwhelmed and crashes then either all of those messages get lost, it's not durable in other words. Or the sender itself will get overwhelmed, so the sender won't be available in other words. So if you directly couple the sender with the receiver, you either have durability problems or you have availability problems. 11 | So this is the case when you have, for example, a fan in. So you have multiple senders, sending messages to a receiver. 12 | This diagram illustrates what happens when one of the senders has a problem. Perhaps it's sending lots and lots of messages. Now, the receiver is faced with that problem and so the receiver crashes, and that causes the whole system to go away. 13 | This third algorithm illustrates a fan out scenario. So the sender is sending messages to multiple receivers, and this illustrates what happens if a receiver has a problem. 14 | Now, the sender has to keep storing those messages for this receiver and at some point it gets overwhelmed, and it crashes, and brings the whole system down with it. So in order words, if you have tight coupling between a sender and a receiver 15 | then errors propagate and that's not a good thing. 16 | 17 | ## Loosely-coupled systems scale better 18 | So the solution to this is to buffer. So don't send messages directly from the publisher to the subscriber. But instead, send messages to a message bus. 19 | And the message bus buffers these messages. And this architecture scales a lot better. 20 | The architecture handles variable volumes. All the scenarios fan in, fan out, right? Whether it's fan in, whether it's fan out, it all still works. So, the message bus takes care of durability. It takes care of availability. So, subscribers and publishers, the message bus are loosely coupled now. And because they're loosely coupled, it scales a lot better. 21 | So if, a subscriber goes away, the message bus basically just keeps those messages until the subscriber comes back and starts continuing those messages again. Or if a publisher goes away, then that's not a problem until the publisher comes back and starts sending messages. 22 | So, this is pretty tolerant to failures at every which stage. 23 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/codelab-publish-streaming-data-into-pub-sub.md: -------------------------------------------------------------------------------- 1 | Overview 2 | Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications. Use Cloud Pub/Sub to publish and subscribe to data from multiple sources, then use Google Cloud Dataflow to understand your data, all in real time. 3 | 4 | In this lab you will use simulate your traffic sensor data into a Pubsub topic for later to be processed by Dataflow pipeline before finally ending up in a BigQuery table for further analysis. 5 | 6 | What you learn 7 | In this lab, you will learn how to: 8 | 9 | Create a Pubsub topic and subscription 10 | Simulate your traffic sensor data into Pubsub 11 | Begin the lab 12 | https://codelabs.developers.google.com/codelabs/cpb104-pubsub/ 13 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.13.18 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.13.18 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.13.23 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.13.23 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.16.34 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.16.34 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.17.24 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.17.24 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.18.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.18.16 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.19.34 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.19.34 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.25.11 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.25.11 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.28.25 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.28.25 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.32.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.32.38 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.36.25 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.36.25 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.52.57 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 3.52.57 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 4.21.48 PM 1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 2: Ingesting Variable Volumes/img/Screen Shot 2018-03-21 at 4.21.48 PM 1.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.29.26 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.29.26 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.34.51 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.34.51 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.36.36 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.36.36 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.37.54 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.37.54 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.38.28 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.38.28 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.40.02 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.40.02 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.41.30 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.41.30 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.49.25 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-23 at 10.49.25 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.26.56 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.26.56 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.27.10 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.27.10 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.28.07 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.28.07 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.32.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.32.26 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.35.00 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.35.00 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.38.51 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.38.51 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.39.48 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.39.48 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.43.58 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.43.58 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.44.33 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.44.33 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.45.19 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.45.19 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.47.18 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.47.18 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.48.33 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.48.33 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.50.47 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.50.47 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.54.12 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-26 at 12.54.12 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.08.36 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.08.36 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.09.07 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.09.07 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.11.22 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.11.22 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.14.31 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.14.31 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.14.44 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.14.44 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.16.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.16.20 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.17.01 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.17.01 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.17.50 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.17.50 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.18.48 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.18.48 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.22.24 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.22.24 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.23.24 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.23.24 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.25.51 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.25.51 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.27.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.27.20 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.28.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.28.27 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.29.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.29.26 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.58.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 3: Implementing Streaming Pipelines/img/Screen Shot 2018-03-27 at 5.58.20 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.51.42 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.51.42 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.53.48 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.53.48 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.55.30 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.55.30 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.57.40 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.57.40 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.59.04 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 2.59.04 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.00.17 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.00.17 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.00.47 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.00.47 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.07.00 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 4: Streaming analytics and dashboards/img/Screen Shot 2018-03-28 at 3.07.00 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/01-module-intro-and-agenda.md: -------------------------------------------------------------------------------- 1 | So far in this course, we have looked at streaming data, how to build resilient streaming pipelines, we looked at how to create variable in variable volume ingest, we looked at how to process data that could be late or unordered using Dataflow, and then we looked at how to do queries and data even as it's streaming in using BigQuery, and displaying that data with data studio. But what we haven't yet looked at is other options as far as the sync is concerned. BigQuery is a very good general-purpose solution, something that would work in most cases that you're worried about, but every once in a while you will come across a situation where the latency of BigQuery is going to be problematic. In BigQuery, the data that's streaming in is available in a matter of seconds and sometimes you will want lower latency than that. You'll want your information to be available in a matter of milliseconds, for example, or microseconds. You may also have latency issues where the throughput of BigQuery, which is about 100,000 records a second, may not be enough, and you may want to deal with higher throughput. And so, what we're going to be looking at in this final chapter is how to handle such throughput and latency requirements. When BigQuery is not enough, where do you go? We will talk about Cloud Spanner and we'll talk about Bigtable. These are going to be two of our options that we could consider, and then we'll spend a lot of time looking at Bigtable. We'll look at how to design for Bigtable. Specifically, how to design schemas, how to design the row key for Bigtable. We'll look at how to ingest data into a Bigtable. We'll do a lab that essentially takes Dataflow pipeline and that is currently streaming into BigQuery, and modifies it so that it is streaming the average speeds into BigQuery but it's streaming the current conditions which is 30 times more data. The current conditions, we will stream it into Bigtable. And then finally, we'll look at some performance considerations. 2 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/03-what-is-bigtable.md: -------------------------------------------------------------------------------- 1 | ## Bigtable: big, fast, autoscaling NoSQL 2 | So, Bigtable, is big, it's fast, it's auto scaling and it's NoSQL. So it's big. How big? You can deal with data more than a terabyte. This data can be semi structured or can be structured. And it's really meant for data that's very fast changing, that has a very high throughput and you use it when you don't need transactions, when you don't need strong relational semantics. So where do people use Bigtable for? It tends to be used a lot for time series data, financial data, sensor data, data with this natural ordering in terms of time. It also gets used for real-time processing, asynchronous batch operations, and increasingly, it is getting used for data involving machine learning algorithms, especially machine learning algorithms that do continuous training. So with cloud table, you get global availabilities. It's like Pub/Sub again. So you basically get global availability. You can put your service and data where you want but unlike Pub/Sub, unlike BigQuery, it is not cluster free. So you're thinking in terms of a cluster of nodes with Bigtable. Bigtable like everything else on GCP, your data is encrypted at flight, at rest, on the wire, and you get full control of the data, you get the identity access management that's come into the platform. And you get redundant auto scaling storage. So it's all the data that you put into Bigtable, is durable, it's replicated and you can get access to it. So like BigQuery, Bigtable also separates out computing and storage. What do you mean by that? Didn't I just say that Bigtable use clusters? Well, it uses clusters, but those clusters only contain pointers to the data. They don't contain the data itself. So the clusters consist of nodes, these nodes all contain the metadata, the data itself remains on Colossus. It remains on Google Cloud storage. So, if we think in terms of data that's stored in contiguous rows. So you basically have rows of data and those rows of data that are stored contiguous, we talk of them as tablets. So all of those tablets of data are stored on GCS. And what the nodes contain, is essentially pointers to those tablets of data. And so whenever the clients want to do processing, they basically say, "Here's the processing I want to do." And the nodes basically carry out the processing. And the data gets shuffled into those nodes to basically do the computation. But when the nodes read the data, they essentially read contiguous rows of data. This is going to be important when we think about the design of Bigtable. How to optimize the performance of Bigtable. 3 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.29.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.29.53 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.38.46 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.38.46 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.39.00 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.39.00 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.39.34 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.39.34 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.41.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.41.16 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.42.28 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.42.28 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.43.21 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.43.21 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.43.22 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.43.22 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.45.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.45.53 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.46.49 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.46.49 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.48.39 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.48.39 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.54.31 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.54.31 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.57.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.57.53 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.59.14 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 3.59.14 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.09.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.09.20 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.10.50 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.10.50 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.14.49 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.14.49 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.21.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.21.16 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.21.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.21.29 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.04 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.04 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.21 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.21 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.42 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.22.42 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.23.15 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.23.15 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.23.36 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.23.36 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.24.43 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.24.43 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.24.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.24.53 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.25.24 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Screen Shot 2018-03-28 at 4.25.24 PM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/Updates: -------------------------------------------------------------------------------- 1 | Script that will cd into the charts directory 2 | Script that will call git update 3 | 4 | First, I recommend against using git pull. Instead, create a safer git up alias: 5 | https://stackoverflow.com/q/15316601/712605 6 | 7 | git config --global alias.up '!git remote update -p; git merge --ff-only @{u}' 8 | See this answer for an explanation of git up. 9 | 10 | Then you can safely script it: 11 | 12 | #!/bin/sh 13 | for repo in repo1 repo2 repo3 repo4; do 14 | (cd "${repo}" && git checkout master && git up) 15 | done 16 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/Module 5: Handling Throughput and Latency Requirements/img/commit yesterday: -------------------------------------------------------------------------------- 1 | git commit --amend --date="2018-03-13 12:00:00" 2 | 3 | Enhancement 4 | [incubator/fluentd-elasticsearch] Adjust the configMap to allow configurable system.conf 5 | 6 | We would like to propose adding the `containers.input.conf` data configuration, 7 | to the `values.yaml` file, so that we can override the default value. 8 | 9 | Currently the `containers.input.conf` defines the path as `path /var/log/containers/*.log` 10 | on some Kubernetes environments this path does not exist. 11 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/README.md: -------------------------------------------------------------------------------- 1 | # Building Resilient Streaming Systems on Google Cloud Platform 2 | 3 | About this course: This 1-week, accelerated on-demand course builds upon Google Cloud Platform Big Data and Machine Learning Fundamentals. Through a combination of video lectures, demonstrations, and hands-on labs, you'll learn how to build streaming data pipelines using Google Cloud Pub/Sub and Dataflow to enable real-time decision making. You will also learn how to build dashboards to render tailored output for various stakeholder audience. 4 | 5 | Prerequisites: 6 | • Google Cloud Platform Big Data and Machine Learning Fundamentals (or equivalent experience) 7 | • Some knowledge of Java 8 | 9 | Objectives: 10 | • Understand use-cases for real-time streaming analytics 11 | • Use Google Cloud PubSub asynchronous messaging service to manage data events 12 | • Write streaming pipelines and run transformations where necessary 13 | • Get familiar with both sides of a streaming pipeline: production and consumption 14 | • Interoperate Dataflow, BigQuery and Cloud Pub/Sub for real-time streaming and analysis 15 | 16 | 17 | Welcome to this course on resilient streaming applications on Google Cloud platform. The first chapter is on the architecture of streaming analytics pipelines. This course discusses what stream processing is, how it fits into a big data architecture, when stream processing makes sense, and what Google cloud technologies and products you can choose from to build a resilient streaming data processing solution. We'll also discuss the challenges associated with stream data processing. There are three key challenges: handling variable data volumes, dealing with unordered or late data, and deriving insights from data even as it's streaming in. And in this chapter we'll also do a lab, which is a pen and paper one, where you will fit some typical streaming scenarios into this streaming architecture. 18 | 19 | ## Module 1: Architecture of Streaming Analytics Pipelines 20 | ### What Is Streaming? 21 | ### Challenge #1: Handling Variable Data Volumes 22 | ### Challenge #2: Dealing with Unordered/Late Data 23 | ### Challenge #3: Derive Insights From Data 24 | ### Lab: Derive Insights From Data 25 | 26 | ## Module 2: Ingesting Variable Volumes 27 | ## Module 3: Implementing Streaming Pipelines 28 | ## Module 4: Streaming analytics and dashboards 29 | ## Module 5: Handling Throughput and Latency Requirements 30 | -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.37.56 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.37.56 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.39.00 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.39.00 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.44.43 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.44.43 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.46.45 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.46.45 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.47.57 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.47.57 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.50.07 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.50.07 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.53.24 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.53.24 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.55.31 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.55.31 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.58.54 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.58.54 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.59.38 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 10.59.38 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.02.26 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.02.26 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.02.56 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.02.56 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.04.23 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.04.23 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.05.56 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.05.56 AM.png -------------------------------------------------------------------------------- /coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.16.49 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/building-resilient-streaming-systems-gcp/img/Screen Shot 2018-03-20 at 11.16.49 AM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/02-welcome-to-foundations-of-gcp-compute-and-storage.txt: -------------------------------------------------------------------------------- 1 | Let's start talking about the Foundations of GCP. And the foundations of GCP lies with its computing and storage infrastructure. Any computer consists of computing, and storage, and networking to connect the computing in storage. The Cloud computer, is also a computer. It's a global computer, but it also contains a compute engine. It contains storage and networking that you don't directly interact with, but networking that's there in any case, for you to connect the computing that you are doing with your data that you have stored. So in this module we look at the foundations of Google Cloud Platform, the compute engine and cloud storage. 2 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/11-module-2-resources.txt: -------------------------------------------------------------------------------- 1 | 0:00 2 | So, before I leave you here, let's kind of take a look at a few of the resources. So there listed documentation for compute cloud.google/compute, it's on the computer engine, /storage is on the storage. And /pricing is on the pricing calculator. This is actually a pretty cool thing and something that you want to get very familiar with. So let's go ahead and go to the pricing calculator. And, the pricing calculator, so there's our calculator. 3 | 0:35 4 | So let's go to the calculator. 5 | 0:39 6 | And let's say that I have a workload that requires 15 instances for doing something that's a Linux thing, it's a regular VM. And the VM type is going to be in one standard for and it's going to use four SSDs and it's going to be running. I'll put that again. It's going to be running three hours a day for three days a week, and plan this is hard to estimate. 7 | 1:21 8 | And this is going to cost us $381 a month at the time that it will be recording again. Now go back and check the pricing, pricing keeps changing. Usually, keeps dropping but just keep changing. And let say in addition to that we want on cloud storage, we want to store within a region and we want to store, let's say we want to store 100 terabyte of data. And we can add that to the estimate and that's basically what that's going to cost us. 9 | 1:59 10 | In addition to the pricing, the other thing that I'm going to talk about here is something called the Cloud Launcher. 11 | 2:05 12 | Remember that we decided that we're going to create a compute engine VM and be able to install software on it. Well, lots of times, the software that you want to install on a machine has already been installed by someone else. So let's say, for example, that we want to create Cloud launcher that has WordPress. 13 | 2:30 14 | Okay, there it is, right? So here is a WordPress, click to Deploy. And we can say, I want to basically launch WordPress on Compute Engine. And that's basically going to give us single VM with WordPress already installed on it, and this is basically typically done by partners of Google that basically take a compute into VM and provide what's called a deployment manager script. This is just a configuration file, so you can use this to even customize you own virtual machines. So you could basically say, I'm going to create a virtual machine and install these pieces of software and do this installation steps. And that's basically what my deployment script is. So what's being used for the Cloud Launcher can also be extremely helpful if you're doing your own. 15 | 16 | Resources 17 | 18 | Compute Engine: https://cloud.google.com/compute/ 19 | Storage: https://cloud.google.com/storage/ 20 | Pricing: https://cloud.google.com/pricing/ 21 | Cloud Launcher: https://cloud.google.com/launcher/ 22 | Pricing Philosophy: https://cloud.google.com/pricing/philosophy/ 23 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/01-intro-to-managed-services-for-common-use-cases.txt: -------------------------------------------------------------------------------- 1 | In this module, we'll look at how to do data analysis on the Cloud using tools that you're probably familiar with, using a relational database and using the big data tools that are part of the Hadoop ecosystem. What are we looking at is how you can take existing programs that work with the relational database, in particular MySQL database, are with Hadoop. In particular, a Spark program and migrate it so that you can run those programs on Google Cloud. Essentially what Google Cloud provides you in those cases are managed services. So, we will look at managed services for common use cases. 2 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/07-module-3-review.txt: -------------------------------------------------------------------------------- 1 | 0:00 2 | So let's do a quick module review. Relational databases are a good choice when you need which of the following? Streaming high-throughput writes, no. What about this is problematic for relational databases? 3 | 0:18 4 | It's the high through-put. Relation databases may not be able to support very, very high through-put, I mean again, it depends on the through-put that you care about, but because they're transactional, rights tend to be relatively slow. Compare to storage mechanisms that don’t try to manage transactions. 5 | 0:40 6 | Second, fast queries on terabytes of data. 7 | 0:45 8 | Now, the problem here is terabytes, it's this quantity. Relational databases scale pretty well to a few hundred gigabytes, but not more than that. And even at a few hundred gigabytes, you start running into problems of scale. 9 | 1:01 10 | Third, aggregations on unstructured data. No, the problem here is unstructured. Aggregations are no problem. We can do sum, we can do average, we can do count. Those are all very standard SQL key words. The problem is unstructured. You cannot do a column of pixels in an image using a relational database. That's just it's not an appropriate use of relational database. 11 | 1:30 12 | Transactional updates on relatively small datasets. Absolutely. The reason you want to use a relational database is if you need transactions and if you create datasets are a few hundred megs to a few gigs that's like the sweet spot for a relational database. It's perfect to use for that and that's a huge fraction of your use cases but you want to be aware of the exceptions, the cases were relational databases may not be the best choice. If you have high true put, if you have terabytes of data or if you have unstructured data. For those cases, we will talk about alternatives in the next chapter. 13 | 2:16 14 | Next question, Cloud SQL and Cloud Dataproc offer familiar tools. MySQL, in the case of Cloud SQL. 15 | 2:26 16 | Hadoop, Pig, Hive and Spark in the case of Cloud Dataproc. So what's a value-add Provided by GCP. 17 | 2:35 18 | A, it's the same API, but Google implements it better. 19 | 2:39 20 | No, it's the exact same code. In fact, if we find improvements, we basically go back and check them in back to the original open source code base. So it is the exact same code. We're not making any changes. So that answers the second question too, there are no Google proprietary extensions. If there are bug fixes they go back into the main branch. 21 | 3:03 22 | Third, fully managed versions of the software offer no-ops, absolutely. So you basically get fully managed versions of MySQL of Spark that just run on the Dataproc cluster. You don't need to install anything, you don't have to worry about managing those kinds of infrastructure, that's all taken care of for you. 23 | 3:25 24 | Google infrastructure offers reliability and cost savings. Absolutely. As we talked about. For example and this is very relevant in the case of something like Dataproc. The way you approach Hadoop and Pig and Hive jobs changes as soon as you think about putting your data on cloud storage. And reading it from your Dataproc instance. 25 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/README.MD: -------------------------------------------------------------------------------- 1 | https://cloud.google.com/sql/ 2 | https://cloud.google.com/dataproc/ 3 | https://cloud.google.com/solutions/ 4 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-14 at 4.52.32 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-14 at 4.52.32 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.01.51 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.01.51 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.02.05 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.02.05 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.02.37 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.02.37 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.03.31 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.03.31 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.04.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.04.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.05.39 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.05.39 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.06.37 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.06.37 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.07.55 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.07.55 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.11.52 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.11.52 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.21.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.21.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.21.43 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.21.43 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.22.48 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 1.22.48 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.55.39 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.55.39 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.55.52 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.55.52 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.58.04 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 12.58.04 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 4.43.12 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 4.43.12 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.30.59 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.30.59 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.31.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.31.20 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.33.07 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.33.07 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.38.02 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-15 at 5.38.02 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-17 at 4.34.05 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/Module-3-Data-Analysis-on-the-Cloud/img/Screen Shot 2017-12-17 at 4.34.05 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-4-Scaling-Data-Analysis-Compute-with-GCP/intro-to-scaling-data-analysis-change-how-you-compute-with-gcp.txt: -------------------------------------------------------------------------------- 1 | 0:01 2 | Scaling Data Analysis. In this module, we look at things that are more transformational, things that you may not have done exactly the same way before. So we'll start out by looking at how to do fast random access. And if you're looking at fast random access, you need fast querying. You've probably used the relational database before, we look at other options, and how you can accomplish them on Google Cloud. Similarly, we look at how to do interactive, iterative development, but to do all of your processing on the Cloud using a notebook format. 3 | 0:39 4 | Then, we look at how to warehouse your data, and to carry out interactive querying of extremely large datasets, of petabytes scale datasets, but still do your querying interactively. And having done that, the next thing that we look at, and continuing on this whole idea of doing the things that are transformational, is that we look at transfer flow, and how you can do machine learning with transfer flow, again, on Google Cloud. And then, finally, we will look at pre-built machine learning models that are available. That you could take those machine learning models and interpret them into your own applications. 5 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-4-Scaling-Data-Analysis-Compute-with-GCP/module-4-review.txt: -------------------------------------------------------------------------------- 1 | 0:01 2 | So we've looked at a variety of transformational technologies in this chapter. So let's quickly go through and review them. So let's match the use case on the left with the product on the right. So if we want to search for objects by attribute value, which of these things will we use? Well, we are searching for objects, so it should be one of the data technologies. So is it BigTable, or is it Datastore? 3 | 0:30 4 | Well, because you're searching for attribute value, it has to be Datastore. Remember that BigTable, you can only search by key. High-throughput writes of wide-column data. Well, that is BigTable, right, because it's supporting high-throughput writes. Warehousing structured data. So what's the data warehouse technology on Google Cloud? That's, which one, BigQuery. To create and test new machine learning methods. Well, if you're writing new machine learning methods, then TensorFlow. Develop Big Data algorithms interactively in Python. Well, interactive development in Python is done best with Datalab. 5 | 1:15 6 | No-ops, custom machine learning applications at scale. No-ops ML at scale, then that's a role for Cloud ML. Automatically reject inappropriate image content. 7 | 1:33 8 | Rejecting image content where it is inappropriate. Well, that could be the Vision API. So you could use a Vision API to basically see if this is safe content or not safe content. Build an application to monitor the Spanish twitter feed, assuming that you're not in Spain. We're adding the assumption is that this is a foreign language, and you want to basically figure out what's being said in that foreign language. That's a Natural Language API. Transcribe customer support calls. Well, how do you transcribe? That will be going from speech to text, so that will be the Speech API. 9 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-5-Data-Processing-Architectures-Scalable-Ingest-Transform-and-Load/intro-to-data-processing-architectures.txt: -------------------------------------------------------------------------------- 1 | In this last module of the class, we'll do a very quick overview of message-oriented architectures and serverless data pipelines. We will look at serverless data pipelines, in particular cloud data flow, in a lot more detail in the course on serverless data analysis, which is part of the data engineer track. In this module though, we will look at it very, very briefly, very quickly. So you know what it is and you know where to find it if you need it. 2 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-5-Data-Processing-Architectures-Scalable-Ingest-Transform-and-Load/message-oriented-architectures.txt: -------------------------------------------------------------------------------- 1 | Asynchronous processing is a great way to absorb shock and change. For example, if you have an application and you've built this application for 100 users. And in order for it to handle 100 users, it's doing all of these things that it needs to do, and then you basically get a spike in usage. Instead of getting 100 users, you now suddenly have 4,000 users. At this point you have two approaches, one is for your application to just crash. The other is for you to kind of queue things up, such that people get responses but they're a little bit delayed. And that's what asynchronous processing helps you do. The idea is that when somebody submits a job, it goes in. And rather than giving them a response immediately, you basically have the receiving code and the processing code separated. And they're separated by a messaging system. So in other words, new requests come in, they go into a message queue and then you have consumers of this message queue actually processing these requests. So something like this something like asynchronous processing is a very common design paradigm if you need to build highly available systems. The idea being that any request that's sent to the system will get processed. Well, what happens if you have an outage? Well, if you need high availability, then asynchronous processing is one solution. Another thing is to basically balance load across multiple workers, balance high throughput. And the third reason to do that is to reduce coupling. You may have the people producing images separate from the people consuming those messages. And rather than having the system producing messages held up by the fact that the people receiving the message aren't yet able to accommodate a new library or new way of doing things. You can basically separate them, reduce the coupling by using an asynchronous system, by using a message queue. And this allows more agility within your organization. It's a great way to reduce latency so that you can accept requests really close to the network edge. The idea being that the person making the request doesn't have to make the request all the way up to the service. And instead, they can make a request to the closest point on the network. And then the request can travel on an internal Google fiber all the way through. And it's a good way for you to manage consistency. Such that you can apply the exact same security policies to message processing, regardless of where the message comes from. Whereas, if you're relying on the client to process these messages, then you may have some issues. So on DCP, the way you can do message oriented architectures is to use Cloud Pub/Sub. Cloud Pu/Sub offers reliable, real-time messaging that's accessible through HTTP, right? So you can have your HR system basically sending a new hire event or vendor office sending a new contractor event and these are decoupled sources. They know nothing about each other, but they publish their events to a common Pub/Sub topic, the HR topic. And then you could have multiple consumers, each of whom has a subscription to this HR topic. Some of these subscriptions could be pull subscriptions. In other words, whenever the system, the client is ready to process a new message, it goes ahead and asks, are there any new messages? Or it could be a push in which, basically, the client system says, call this endpoint whenever there's a new message for me. And that new endpoint would get called by Pub/Sub whenever there's a new message. So in this way, Pub/Sub can give you reliable delivery, you can get completely decoupled workers. And so Pub/Sub is a good way to handle asynchronous processing. 2 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/Module-5-Data-Processing-Architectures-Scalable-Ingest-Transform-and-Load/module-5-review.txt: -------------------------------------------------------------------------------- 1 | So quick module review. We looked at two use cases, the way to decouple producers and consumers of data in complex systems in larger organizations is Pub/Sub. And the way to create scaleable, fault-tolerant multi-step processing of data is Cloud Dataflow. I know we've gone through these very quickly. We just talked about the system architecture. Come take the data engineer course, and you'll be able to get deeper hands-on experience with both of these products. 2 | 3 | Cloud Pub/Sub: https://cloud.google.com/pubsub/ 4 | Cloud Dataflow: https://cloud.google.com/dataflow/ 5 | Reliable task scheduling on Google Compute Engine: https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine 6 | Real-time data analysis with Kubernetes, Cloud Pub/Sub, and BigQuery: https://cloud.google.com/solutions/real-time/kubernetes-pubsub-bigquery 7 | Processing logs at scale using Cloud Dataflow: https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow 8 | -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.04.10 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.04.10 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.05.55 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.05.55 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.08.19 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.08.19 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.16.30 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.16.30 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.16.37 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.16.37 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.20.45 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.20.45 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.21.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.21.26 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.22.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.22.16 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.24.00 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.24.00 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.28.40 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.28.40 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.29.51 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.29.51 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.33.01 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.33.01 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.34.22 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.34.22 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.37.44 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.37.44 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.39.09 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.39.09 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.48.15 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.48.15 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.51.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.51.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.52.46 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.52.46 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.53.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.53.38 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.55.02 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.55.02 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.55.10 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 12.55.10 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.37.50 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.37.50 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.43.46 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.43.46 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.46.04 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.46.04 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.48.17 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.48.17 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.51.35 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.51.35 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.52.09 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 3.52.09 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.06.47 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.06.47 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.07.06 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.07.06 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.09.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.09.20 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.15.02 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.15.02 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.18.33 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.18.33 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.21.31 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.21.31 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.22.41 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-14 at 4.22.41 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.18.56 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.18.56 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.19.07 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.19.07 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.20.37 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.20.37 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.22.56 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.22.56 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.26.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.26.20 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.28.34 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.28.34 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.30.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.30.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.31.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.31.38 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.31.55 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.31.55 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.32.14 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.32.14 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.33.45 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.33.45 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.36.18 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.36.18 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.46.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.46.38 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.49.05 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.49.05 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.49.31 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.49.31 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.53.03 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.53.03 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.53.13 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.53.13 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.56.13 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 7.56.13 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 8.58.52 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 8.58.52 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 8.59.52 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 8.59.52 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.08.46 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.08.46 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.10.01 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.10.01 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.10.40 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.10.40 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.13.58 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.13.58 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.19 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.19 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.41 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.14.41 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.16.18 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.16.18 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.20.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.20.29 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.20.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.20.38 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.22.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.22.26 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.24.04 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.24.04 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.49.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.49.26 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.52.03 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-16 at 9.52.03 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 12.06.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 12.06.29 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.16.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.16.29 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.17.52 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.17.52 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.20.41 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.20.41 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.22.25 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.22.25 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.22.38 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.22.38 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.26.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.26.53 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.28.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.28.27 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.32.05 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.32.05 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.32.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.32.29 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.33.35 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.33.35 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.34.17 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.34.17 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.34.49 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.34.49 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.35.21 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.35.21 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.36.00 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/Screen Shot 2017-12-17 at 4.36.00 PM.png -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/img/cards.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/gcp-big-data-ml-fundamentals/img/cards.gif -------------------------------------------------------------------------------- /coursera/gcp-big-data-ml-fundamentals/summary-of-gcp-big-data-and-ml.txt: -------------------------------------------------------------------------------- 1 | In summary then, we've looked at google infrastructure. We basically looked at how google provides global infrastructure in terms of global data centers at global network, Edge locations in a variety of countries, software-defined networking and so on, such that you can build applications similar to the way we do, used on the GCP platform. 2 | 3 | And we talked about how the Cloud is evolving, where you may have been doing everything on location, on-premise, and migrated to a virtualized data center. And if you're using Cloud at the level of provisioning virtual machines, and spinning up a VM every time you need to run a job, you're essentially doing the second wave of cloud, but you're still managing these virtual machines, this infrastructure. And in this course we can encourage you to look beyond managing your own machines to this third wave of completely elastic, automated services and scalable data. The idea being, whether you are writing SQL queries or BigQuery, or data pipelines or data flow, or you're doing machine learning, models with Cloud Machine Learning, you are using no ops auto-scaling services on GCP, so that you are focused on the business applications, and Google's focused on the infrastructure. 4 | 5 | So the point of this is that typical big data processing involves programming, provisioning resources, handling growing scale, working on how to manage reliability you're deploying, configuring, looking at how much of your machines are getting used, figuring out how to optimize the use of those machines, looking at tuning off the performance, monitoring these machines, and in the time that you have left, creating new features and programming new things. On GCP though, working with Big Data is programming. This helps you focus on insight, not on managing and provisioning infrastructure. 6 | 7 | So in summary, GCP offers you ways to spend less on ops and administration, incorporate real-time data into apps and architectures, apply machine learning broadly easily, and as an end goal, create citizen data scientists, so that everybody in your organization can work with data. That should be your goal and that's something that GCP can enable for you. 8 | 9 | 10 | Big data and machine learning blog: https://cloud.google.com/blog/big-data/ 11 | 12 | Google Cloud Platform blog: https://cloudplatform.googleblog.com/ 13 | 14 | Google Cloud Platform curated articles: https://medium.com/google-cloud 15 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Data-Engineering-01_Dataproc-_Unstructured_.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/leveraging-unstructured-data-dataproc-gcp/Data-Engineering-01_Dataproc-_Unstructured_.pdf -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/01-why-unstructured-data.txt: -------------------------------------------------------------------------------- 1 | Welcome to Google Cloud platform, this is Grant Morell. Today we're going to look at Cloud Dataproc which is Google's managed Hadoop infrastructure. 2 | 3 | Our agenda is to talk about unstructured data, why we would use Cloud Dataproc, how to create a Cloud Dataproc cluster, and some customizations to the computing nodes of the cluster. 4 | 5 | So let's begin. So for the last several years Google has been creating Google Street View by driving specially outfitted cars with lots of cameras on top around cities and creating street view of those cities. When they initially did this, they were doing it for use with Google Maps and Google Street View, but since the data set has been created now and actually is being updated rather frequently, it could be applied to so many other things beyond what's happening just within Google with it. 6 | 7 | Look at this image. What can we tell? Well we have street signs. We have street names. We have street numbers. We have business names. We have traffic signals. We have business facades and frontage. We have a one time snapshot as to how many people are in front of it, and presumably we've got time of day and the data was taken sort of thing. 8 | 9 | Now I've heard of organizations in New York that are actually using street view to document street parking spaces, disabled parking spaces, curbs, curbs that support wheelchairs, sidewalks, bus stops, windows and even trees, yeah trees. By combining satellite views they can identify trees from overhead, so using Google maps. Then they move down the street view, because from the map they can get the latitude and longitude coordinates and then use the vision API to identify the type of that tree. That way they can actually look at the types of trees and know the population of different types of trees around a city. 10 | 11 | Wow! That's pretty cool. The alternative would be to have somebody walk around the city and actually document them all. Now, with something like Google Street View, the streets can be virtually driven and business opportunities can be looked for. We could look at windows you know, going beyond just you know, what we've identified so far, we can look at windows, we can look at railings. We could look at stonework, we could look at painting. You can look at grass mowing that needs be done. So there's a lot of data that could actually be determined by just processing an image. Now some big data involves just pure counting. 12 | 13 | Data error rates go down when a bug fix was applied. Something that a computer could count. You know retail store, calculating delays in payment processing to identify stores experiencing problems with their credit card machines and things like that. Stuff that literally is purely quantitative. So it's got a quantity related to it. We need to collect that quantity and we need to look for thresholds. But, there are then those problems that require a lot of counting plus the human insight into looking at the data. So if I was to ask a question, OK my programmers are checking "code" but are there any specific programmers checking in low quality code. Well how would we know? 14 | 15 | Obviously we would need to create some sort of metric or we would have to set aside some sort of metric based on the data we have. We just haven't processed it yet. Now we might say well which of our stores is lacking parking spaces. How would we know? Well that would be video footage of our parking lot and looking and identifying when our parking space is in use, when are they not and is there ever a time of day where we exceed the parking spaces that we actually have allocated and people are waiting or people are leaving? 16 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/02-what-data-do-enterprises-analyze.txt: -------------------------------------------------------------------------------- 1 | If you look at the kinds of data that you have in your company. 2 | 3 | What kind of sources of data do you have? 4 | 5 | And what kind of data do you actually analyze? 6 | 7 | You might think of our data that you have and that you analyze, data that you don't have you wish you had, data that you have but for whatever reason you don't analyze and you have data that is probably present in a third party that you could go and acquire. But, let's kind of key off on one aspect of it. The aspect of data that you have that you don't analyze. Why is that? Why would you have data you're storing it, you're keeping it around. You have it it's accessible. But, for whatever reason you don't analyze it. So what are some of those reasons? 8 | 9 | Maybe, the the amount of data is too large for you to analyze, maybe it's volume. Maybe it's veracity, maybe you're not sure about how good that data is. Is it worth analyzing? Anything else? Well, perhaps it's the fact that it's coming so fast that trying to keep up with that stream of data. The velocity of that data could be too high. 10 | 11 | Well, that's one reason. But, I think that in many cases, it's not just about volume or veracity or about velocity. It's something more fundamental than that. It has to do with the kinds of tools that we're familiar with. We're very familiar with analyzing structured data. If it's data in a database, it's a relational data no problem at all. But, if it's data that's not structured, lots of times we just leave it completely unanalyzed. So, no I'm a Google professional services and I talk to customers quite a bit. And, I posed this question to them, what kind of data that you have that you don't analyze? And, they think about it and they talk about data that's too large you know that's not things that are arriving too fast. But, then I say, "well what do you do about your e-mails?" Surely your customers send you e-mails. What do you do about them? And they look at me like I'm crazy. Who analyzes e-mails? How about newsgroups? What do you do with newsgroups? How about photographs that your technicians take on the field? Nope. So, if you have inspections being carried out and you have photographs what do you do with those? This has to do with the tools that we have available to us; unstructured data, images, free form text. It's genuinely hard. And so, what we're going to be looking at next in the data engineering course, is how to deal with unstructured. 12 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/03-even-google-skipped-unstructured-data.txt: -------------------------------------------------------------------------------- 1 | So even at Google we face this problem. There was a time that someone at Google had this great idea that we take cameras, put them on top of cars and drive them through every street in the world. Even more amazing to me someone higher up at Google thought that was a really good idea and they would put money into this. So this actually happened, and there were cars that drove through a number of streets in the world. Those photographs that were taken by those cars got surfaced in Google Maps. The way that these were used was that now you would have directions and you say you're going there and there'd be this little picture where it showed this is where you're going, this is what the storefront looks like et cetera. So that was good. It was basically collected and it was basically used and shown to whoever is using Google maps. But that was pretty much it. We didn't have any technology in place to do anything else with that data. So we had all this imagery from all over the world and we're doing nothing with it. But then over time, deep learning came along. Our machine learning improved and we said, "Hey, how about we go back and look at all the imagery that we've collected over time from street view maps?" And we did. We went back to those images and we're looking at those we could say, "Oh, well, here is a street sign. Here is a street number. Here is a business. Here's a road." And we could actually then make it a feedback loop. We could look at these images that were collected by cars driving through and we could basically use them to enhance our maps. That's essentially what you're talking about. You may have unstructured data that you have laying around in your company that you're using maybe for one specific purpose. And usually the purpose is to surface to a human user because a human user can look at that unstructured data and they can make sense of it. But you don't have any automated programs that analyze that unstructured data. So what do you want to look at is how we can use machine learning, the advances that have happened in machine learning, to analyze those images, those free-form text, etc. When we are looking at that the thing to realize is that you don't need to start from scratch. You don't have to build an image recognition model in order to take advantage of all the image data that you have. You can use pre-built models and essentially apply them. So those are what we will call the machine learning APIs. So whether it's analyzing free-form text or whether it's analyzing images, we will be able to take advantage of pre-built machine learning models. We'll apply those machine learning models to data that we have around, and we will extract out information that we can then use. 2 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/04-considering-counting-problems.txt: -------------------------------------------------------------------------------- 1 | So when you think about analytical tools and much of the analysis that we do, they're all about counting. When you look in MapReduce, what are the things that we do on a MapReduce system? Essentially we count things. When I say this to people, they look at me, and they say, no, we do something more complex than that. But think about it, just think about it. Suppose you're trying to figure out, well, are we experiencing delays in payment processing? Surely a delay in payment processing is not just counting. But, what is it that you actually do when you're trying to figure out delays in payment processing? You're going through every transaction, and for every transaction, you're basically figuring out how long it took you to basically make the payment. And you're taking that and maybe you're adding it up over all the transactions. You're finding a mean of some kind. That's counting. You're adding, you're counting, now you're basically looking at how many transactions took longer than three days to pay, right? That's essentially a counting problem. And this is what I would call it easy counting problem and MapReduce is full of easy counting problems. You go through a very large set of data and through that set of data, you basically are looking for specific things and you're counting the number of times they happen. 2 | 3 | And as long as the things that you are looking for are easy to analyze, it becomes a pretty easy counting problem. 4 | 5 | But there are other kinds of problems that are a little bit harder. Suppose I were to ask you, not for delays in payment processing, but I were to ask you say, well, how often are your programmers checking in low quality code? That is also a counting problem. How often is someone doing something is a counting problem. But the thing that you're counting, low quality code, is extremely hard to just extract from data ,right? Your data or the code. But how can you look at code and say, is this low quality or high quality, okay? That's what makes it a harder counting problem. If the thing that you're trying to detect is harder, then it makes it a harder counting problem, but it's still a counting problem. Lots of analytical tools are about counting problems. But when we look at counting problems, depending on what it is that we count, they could be easy counting problems, or they could be much harder counting problems. 6 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/06-cluster-provisioning-considerations.txt: -------------------------------------------------------------------------------- 1 | So what's the problem with running everything on one cluster? 2 | 3 | Think about it this way. What happens if you don't have enough computing power on that cluster for the number of jobs that you want to run? 4 | 5 | Or conversely, what happens if you have way too much computing power for the active processes that you want to run? 6 | 7 | And realize that the same cluster at some times may be too much and at other times it may be too little. 8 | 9 | So what you really need to avoid over-provisioning a cluster or under-provisioning a cluster is to dynamically provision that cluster. 10 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/07-imagine-your-cluster-provisioning.txt: -------------------------------------------------------------------------------- 1 | Imagine that you are using a map-produced cluster, and you realized that you've underprovisioned it, and you need to go get four or five more machines. What do you have to do when you get those five machines? Can you just add them to the cluster? 2 | 3 | No, you have to do something else. You have to take the data that you originally had, and you need to reshard the data. You need to take the data, move it around, copy it around. 4 | 5 | Now, all the time you're spending recopying the data, redistributing the data, what is the business impact that the cluster is having? 6 | 7 | Nothing. It is not actually solving a business problem. All you're doing is you're moving data around. 8 | 9 | Or think about a cluster that primarily gets used during work hours. And someone says, well, I'm noticing that in this cluster, you're really not using it between 10 PM and 6 AM. So let's go find the jobs that we can now run on this cluster, and now you're basically beating down every door in your company trying to increase a utilization of your cluster. 10 | 11 | Is that a good use of your time? 12 | 13 | What would you rather spend your time doing? 14 | 15 | Managing resource allocation and provisioning or doing analytics? 16 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/08-dataproc-eases-hadoop-management.txt: -------------------------------------------------------------------------------- 1 | Dataproc eases Hadoop management. And let's start on this slide here from the left. So, when you're running Hadoop on premise you've got your code to worry about. You've got your monitoring and health of the server. You've got your development integration. You've got your scaling. You've got to submit jobs. You've got to handle connectivity. You've got to worry about deployment and creation. Now, you can move a little bit farther, you could have a vendor do all that for you. And there are lots of vendors that will provide you with turn-key clusters. So they handle a lot of the setup and configuration of it, for a price. But then they're also, you still have to worry about, because you're running it on premise, you're still doing all the power, you still have physical servers, you've got all that on site to work with. Now, you could use bdutil, which are a free open source toolkits that can help with some of this sort of stuff, and you can basically build a bit of a solution where some of the pieces can be leveraged from the cloud: So like Google Cloud Platform Connectivity deployment and creation, and you're going to manage your custom code, and the monitoring and the developer integration and the scaling and jobs submission. The ultimate goal, and with the Google Cloud Dataproc that's what we're going to be able to achieve, is basically to offload monitoring health, developed integration, scaling job submission, connectivity, deployment creation, all of that over to Google Cloud platform. So we can really concentrate on our custom code pieces, and everything else we just use as a service and pay for. So what does a typical Dataproc deployment involve? Well, we go into Google Cloud platform, we could use the web console, we can use G-Cloud or we could use the APIs. We could basically request a cluster be created with several parameters like how many nodes we want on it, and the size of those nodes. We then click Create, and about 90 seconds later our cluster is ready to go. OK, you're saying, "Grant, is that really true?" Well, you bet. Let's give it a try. 2 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/09-create-a-cluster-from-the-web.txt: -------------------------------------------------------------------------------- 1 | So, here's the general layout of what the Google Cloud Platform looks like right now. Now I will note that just in the last several weeks some of these future regions and zones have actually come online. So this map is always changing and I encourage you to go up and take a look at Google Cloud Platform, Zones and Regions page. Now, let me point you to that really quick. I'm just going to go to my browser, grab a new tab and I'm going to search for GCP zones and regions. And almost always the first article that comes up is the Google Zones and Regions. And when you click on this, it'll actually, just scroll down a little bit and it'll show you all the current locations that you could run Google Cloud Platform resources, although your results may vary a little bit. 2 | 3 | There are certain regions that can't support some Google products but in the case of Dataproc, it's just compute engine so it's already supported in all these regions. And like, for example, the latest ones that came up, europe- west2 I just saw pop up last week and us-east4 just came online fairly recently too. So that's what we're talking about is, you know, sort of in this map, is we want to basically make sure that we put our compute nodes where our data lives. Now, Google runs a network within a zone, so within a Google Cloud Platform zone the data transmitting between storage and compute is literally passing across what Google calls their petabits bisection network. So basically, they didn't build, you know, they didn't build petabit network cards but when they aggregate all the bandwidth and they look at, sort of, the capabilities on that it's on the petabit scale. Now between regions and also between zones you have less bandwidth. Between zones in the same region that's basically they're different buildings or different segments of buildings with separate isolation across so the bandwidth isn't quite to the petabit scale. And then when you go between regions you're looking at fiber optic connections scattering the globe and Google transmitting it across those network segments. So that's why it's again really important to choose where your data is versus where your data PROC cluster is. 4 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/Data-Engineering-01_Dataproc-_Unstructured_.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/Data-Engineering-01_Dataproc-_Unstructured_.pdf -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/lab.md: -------------------------------------------------------------------------------- 1 | ## Creating Dataproc Clusters 2 | In this lab, you will create, customize and delete Dataproc clusters using the Web console and the command line interface (CLI). You will also connect to the cluster using SSH, and run a couple simple jobs. You will also access the cluster's Hadoop and HDFS services from the browser. 3 | 4 | ## What you learn 5 | In this lab, you: 6 | 7 | * Create a Dataproc cluster from the Web console 8 | * SSH into the cluster and run PySpark jobs 9 | * Add a firewall rule that allows access to your cluster from the browser 10 | * Create, manage and delete Dataproc clusters from the CLI 11 | 12 | Begin the lab 13 | 14 | https://codelabs.developers.google.com/codelabs/cpb102-creating-dataproc-clusters/ 15 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/quiz/Screen Shot 2018-01-02 at 3.12.25 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/leveraging-unstructured-data-dataproc-gcp/Module 1: Introduction to Cloud Dataproc/quiz/Screen Shot 2018-01-02 at 3.12.25 PM.png -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 2: Running Dataproc jobs/01-overview-of-running-dataproc-jobs.txt: -------------------------------------------------------------------------------- 1 | Welcome to Google Cloud Platform. This is Grant Moyle. In this video, we are going to look at running Dataproc jobs on Google Cloud Platform. Dataproc is Google's managed Hadoop infrastructure so it's a fabulous platform for running your open source parallel computing jobs. In a previous video we discussed how to create a stateless cluster in less than 90 seconds using Dataproc. In this video we're going to discuss how to run our jobs on Dataproc clusters, how to submit jobs via a high level APIs, and also how to optimize the use of Google Cloud Storage as a replacement for your persistent data instead of using Hadoop file system. Once a Dataproc cluster has been created and then started, you can SSH into the master node, or in the case of a highly available cluster, any of the master nodes. From there, you can interact directly with Hadoop, run Pig/Spark, or Hive drops. Since all of these are preloaded, you can also interact with the Hadoop file system directly, or HDFS, as we'll hear it referred to, as that has been provisioned on your master and persistent worker nodes. The only nodes that don't have HDFS will be the preemptible nodes because they could be preempted at any time, so it doesn't make sense to have their disks associated with the Hadoop file system. Now the best way to experience SSH in any other cluster is to do it yourself. So, I'm going to just turn you loose on a lab, and in this lab, you're going to create a data plot cluster using Google Cloud Shell. Create a Google Cloud Storage Bucket and clone a repository. From there, you connect into the master node via SSH and run a Python Spark read evaluate process loop interpreter. Then you'll run a Pig job that'll use data stored in the file system. And then you'll destroy the cluster since you'll all be done with it. So pause the video and go complete the lab titled Leveraging Unstructured Data Part 2 2 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 2: Running Dataproc jobs/02-why-ssh-into-a-cluster.txt: -------------------------------------------------------------------------------- 1 | So you can SSH into the cluster and it can run by Spark. Why would you do that? Why would you want to SSH into the cluster to that particular machine and run a program on the interpreter? Normally, the way you interact with a cluster is when you use a notebook or you submit a job into the Spark cluster but there are times when it's advantageous to be able to simply drop into the shell of the machine. Drop into Spark so that you can do a quick exploration. You can try-- you could try a couple of lines of Spark code. You can basically see if something works, if the data is in place, how long something takes. For those kind of quick experimentation, it can be very helpful to be able to SSH into the cluster. Start the PySpark interpreter and try out your quick experiments. 2 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 2: Running Dataproc jobs/Screen Shot 2018-01-03 at 4.43.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/leveraging-unstructured-data-dataproc-gcp/Module 2: Running Dataproc jobs/Screen Shot 2018-01-03 at 4.43.16 PM.png -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 2: Running Dataproc jobs/running-jobs.md: -------------------------------------------------------------------------------- 1 | ## Lab 2a: Running Pig and Spark programs 2 | 3 | In this lab, you will run Pig and Spark programs on a Dataproc cluster. 4 | 5 | ## What you learn 6 | In this lab, you: 7 | 8 | * SSH into the cluster to run Pig and Spark job 9 | * Create a Cloud Storage bucket to store job input files 10 | * Work with HDFS 11 | 12 | Begin the lab 13 | 14 | https://codelabs.developers.google.com/codelabs/cpb102-running-pig-spark/ 15 | 16 | Google Cloud Dataproc supports running jobs written in Apache Pig, Apache Hive, Apache Spark, and other tools commonly used in the Apache Hadoop ecosystem. 17 | 18 | For development purposes, you can SSH into the cluster master and execute jobs using the PySpark Read-Evaluate-Process-Loop (REPL) interpreter. 19 | 20 | Let's take a look at how this works. 21 | 22 | ./replace_and_upload.sh gcp365-191023 23 | gsutil -m cp gs://gcp365-191023/unstructured/pet-details.* . 24 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 3: Leveraging GCP/01-introduction-to-leveraging-gcp.txt: -------------------------------------------------------------------------------- 1 | 2 | So far, in this course, what we've been looking at is about migration from an on-premise Hadoop cluster to GCP. So we just said, take the code that already runs on your on-premise cluster that the Spark or Pig or Hive or whatever code there is. The one change what we're asking you to make is to change it where you're reading from HDFS. Don't read from HDFS, instead read from GCS and by replacing that read from reading from HDFS to replacing to GCS, all of a sudden things get much better economically. You don't need to have a cluster up all the time. You don't need to have a cluster that people are competing for. You don't need to over provision the cluster for those small periods of peak usage. You get much better economic benefits, you get ephemeral clusters by doing one simple thing; changing your input from HDFS to GCS. So that's a recommendation. If you're migrating from on-premises where you're doing Hadoop workloads to a public cloud, replace HDFS reads by GCS reads. So that you take advantage of the extremely high sustained read performance that GCS provides. And this allows you to create clusters that are job specific. So you create a cluster, you run the job and you delete the cluster. So that's basically what the last couple of chapters have been about. But now, migration has been done. You are on GCP. What else can you do? What we are going to now look at is how to leverage the power of GCP beyond the on-premises jobs and migrating them. So you've done that. You now have your on-premises jobs, you've migrated them to the Google Cloud and they're now running on the Google Cloud. Great. But now that you have jobs that are running on Google Cloud, let's not treat your Hadoop and Spark jobs in isolation. Let's look at how you can take advantage of the rest of the capabilities of the platform. 3 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 3: Leveraging GCP/02-leveraging-google-cloud-platform-pt-1.txt: -------------------------------------------------------------------------------- 1 | Welcome to Google Cloud Platform. This is Grant Moyle. In this video we're going to take a look at leveraging Google Cloud platform for processing our big data jobs on Dataproc and also customize the initialization of our Dataproc master and worker nodes so that we can include other open source software or custom initialization. So far in the Dataproc videos, we have discussed building stateless Hadoop in less than 90 seconds. We've discussed running standard Hadoop, Spark, Pig, and Hive, which are all included in the base Dataproc configuration and submitting jobs by a high level APIs. In this video, we're going to look at customizing the initialization of our Dataproc clusters using scripts or programs that we're going to load into Cloud storage. And we're also going to take a look at how BigQuery and other Google Cloud platform services can be leveraged from Dataproc. So let's look at customizing our cluster. 2 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 3: Leveraging GCP/04-codelab-leveraging-unstructured-data-part-4.txt: -------------------------------------------------------------------------------- 1 | Lab Overview: Leveraging Google Cloud Platform Services 2 | In this lab, you will create a Dataproc cluster that includes Datalab and the Google Python Client API. You will then create iPython notebooks that integrate with BigQuery and storage and utilize Spark. 3 | 4 | What you learn 5 | In this lab, you: 6 | 7 | Create a Dataproc cluster with an Initialization Action that installs Google Cloud Datalab 8 | Run Jupyter Notebooks on the Dataproc cluster using Google Cloud Datalab 9 | Create Python and PySpark jobs that utilize Google Cloud Storage, BigQuery and Spark. 10 | Begin the lab 11 | https://codelabs.developers.google.com/codelabs/cpb102-dataproc-with-gcp/ 12 | 13 | Mark as completed 14 | -------------------------------------------------------------------------------- /coursera/leveraging-unstructured-data-dataproc-gcp/Module 4: Analyzing Unstructured Data/01-introduction-to-analyzing-unstructured-data.txt: -------------------------------------------------------------------------------- 1 | 2 | Welcome to Google Cloud Platform, this is Grant Moyle. In this video, we're going to take a look at Analyzing unstructured data on Google Cloud Platform using the machine running based APIs. If you watched the previous video on dataproc, you might recall this slide. Humans are great at deriving insight by looking at chart and graphs and picture. Computers are great accounting things, but how do we bridge the gap between the two? Well, that's where machine learning fits in, modelling the way humans learn, just we want to do it with computers. As was said in Wired magazine in May of 2016, soon, we won't program computers, we'll train them like dogs. Well, how do you train a dog or maybe a cat or a spouse, well you reward good behaviour and you discipline bad behaviour. Well, it's not quite like that, it's more like how our children learn. You show them lots of pictures of dogs and tell them that they're dogs. Then, show them lots of pictures of other animals that are not dogs and slowly, their brain builds a model for it. Rinse and repeat, this time for cats and birds and snakes and cows. I recall both my children when we were working on their smell neural network. We'd be driving along in the car and would pass a cow pasture with the windows down and you'd hear, ew, what's that smell. Well, that was their neural network being formed. We'd associate the smell of the cows with the word cow. 3 | 4 | Now, 20 years ago, while I was a graduate student, I took a semester on machine learning, but the problem was, back then is, we didn't have the computing power. Back then, if our lab got $100,000 grant, I could buy six servers. So essentially, I got six CPUs and built some simple learning models and crunch away at data for days upon days. But we could never really touch vision or speech or all that sort of stuff. These days, you don't need to build those, they are available for us to use directly as in this vision API, the speech API, the natural learning API or the translate API. And more APIs are being added all the time. Such as, you might have heard about the newly announced jobs API, which is plug and play for helping companies do hiring and resumes and job application processing. Pretty cool stuff. 5 | -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.12.48 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.12.48 AM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.12.56 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.12.56 AM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.14.16 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 11.14.16 AM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.16.44 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.16.44 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.19.01 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.19.01 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.45.59 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 2.45.59 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.16.30 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.16.30 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.18.25 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.18.25 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.19.55 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 3.19.55 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.05.44 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.05.44 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.22.06 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.22.06 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.23.43 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.23.43 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.25.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.25.20 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.26.11 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.26.11 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.26.28 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.26.28 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.27.06 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.27.06 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.27.50 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.27.50 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.28.07 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.28.07 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.30.51 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.30.51 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.31.36 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.31.36 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.39.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.39.53 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.40.35 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-03 at 4.40.35 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 1.17.57 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 1.17.57 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 1.56.09 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 1.56.09 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.37.56 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.37.56 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.41.29 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.41.29 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.41.58 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.41.58 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.46.13 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.46.13 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.49.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.49.20 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.49.26 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 12.49.26 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.02.42 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.02.42 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.04.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.04.20 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.08.28 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.08.28 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.25.27 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.25.27 PM.png -------------------------------------------------------------------------------- /coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.26.02 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/prepare-gcp-exam/Screen Shot 2019-02-04 at 2.26.02 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/00-cpb101-bigquery-query.txt: -------------------------------------------------------------------------------- 1 | https://codelabs.developers.google.com/codelabs/cpb101-bigquery-query/#0 2 | 3 | What you learn 4 | In this lab, you: 5 | 6 | Create and run a query 7 | Modify the query to add clauses, subqueries, built-in functions and joins. 8 | 9 | 10 | Load a CSV file into a BigQuery table using the web UI 11 | Load a JSON file into a BigQuery table using the CLI 12 | 13 | https://codelabs.developers.google.com/codelabs/cpb101-bigquery-data/ 14 | 15 | In this lab, you write a query that uses advanced SQL concepts: 16 | 17 | Nested fields 18 | Regular expressions 19 | With statement 20 | Group and Having 21 | 22 | https://codelabs.developers.google.com/codelabs/cpb101-advanced-queries/#0 23 | 24 | https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804 25 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/01-course-introduction.txt: -------------------------------------------------------------------------------- 1 | 0:00 2 | In this first course of the data engineer track, you will gain an overview of the data and machine learning parts of Google Cloud platform, and not just a cursory overview. So what you're going to be doing is in each module, everything is going to be two pronged. At one level we will look at some products that will help you accomplish certain things on Google Cloud, but at the same time, we look at very specific use cases, very common use cases that involve machine learning, that involve data processing, that involve data analysis. So we look at how to accomplish those use cases using this particular products, and we'll do this over and over again. And by the time we are done with this course, you will have basically gotten a pretty good overview of all the moving parts of the data and ML parts of the platform. 3 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/02-who-is-a-data-engineer.txt: -------------------------------------------------------------------------------- 1 | So, who's a data engineer? A data engineer is someone who enables decision making, typically they do this by building data pipelines by ingesting data, processing data, building tools to analyze data, building dashboards, building machine learning models. So, a data engineer's job is to enable decision-making within the company in a very systematic way. And in order to be a good data engineer, you needed to know both programming and statistics to a great deal of depth. But with the advent of Cloud services. Particularly the fully managed, auto-scaling services on Google Cloud. The amount of infrastructure that you need to know, the amount of programming that you need to know has gotten a lot simpler. At the same time, the statistics realm has also gotten a lot simpler. You now have libraries that take care of a lot of the load level program that you have to do. And a lot of the mathematical concepts that you had to know. To the extent that you can now program with data. You can build statistical machine learning models a lot simpler. When you're building, using these libraries and packages. So, what is happen is that over time, the amount of programming the unit you needed to know has gotten simplified. The amount of Statistics that you need to know has gotten simplified. And what that means, is now that you can now look at somebody with skills of a data engineer who can now build these data pipelines. And go all the way to building statistical machine learning models. So, that's what we're going to be talking about in these sets of courses. 2 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/08-lab-serverless-data-analysis-java-python-part-1.txt: -------------------------------------------------------------------------------- 1 | So, in this section what we've done is that we've basically started out very simple – looking at a table, looking at a preview of the table, finding columns, creating queries, using clauses, using inner selects, using some built-in functions like LPAD etc. Doing joints to basically find the fraction of flights that depart late from New York's La Guardia Airport and restricted to rainy days. Right. So let's go ahead and try this out. This will step you through the process of doing this, learning how to run BigQuery queries. Again the labs and the scores are not meant to show you something new, are they meant to reinforce the topics that I'm covering in the lectures. Right. All of the labs are about reinforcement, about taking these examples and showing them in a holistic way because when you look at it in a slide, I'm just going to be sure to use snippets. I'm not going to be showing you the entire code. So the labs are a way for you to go look at the entire code. So please go ahead and look at the entire code, figure out how all of the pieces go in together and fit together, and then what I strongly recommend you do is that you go to BigQuery itself. Notice that in BigQuery, right, you basically have a variety of other projects. If you are in big Query and you don't see these projects, what you can do is switch to project. And you can basically switch to project so they can say display project and then you can type the name of a project and I... One of the ones that I would say you try is BigQuery samples. Okay. And you can say now display it, and at that point Big Query samples will show up on the left hand side of your BigQuery console with a variety of data sets there. Pick one of these datasets and try to do your analysis on it, try to basically see how do I write a query to look at the Nasdaq stock quote for the Google stock, right, and figure out something about Google stock, for example. Right? So go ahead and try something out. And that's the way you learn. Right? So but first before you can try something out on your own, you should probably look at a holistic example of how something is put together and that's the purpose of these labs. So go ahead and do this lab and we will start back. 2 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/15-user-defined-functions.txt: -------------------------------------------------------------------------------- 1 | You also have a variety of date and time functions that operate on time stamp. We looked at being able to extract the date from a time stamp for example. The thing to realize is that BigQuery uses the epoch time, it understands UTC. You can create a date using year, month and date. You can create a date using a time stamp. You can also create a Date Time object using separate date and time objects. The standard SQL UDFs are scalar, right? So they basically give you functionality, okay? That ability to do things like loops and non-trivial string parsing. Let's take a look at an example of a SQL function. So you can do Create Temporary function. The name of the function is addFourAndDivide. X is an INT64, Y is an INT64. Those are the two parameters, and it's AS, so you're now defining it, ((x + 4) / y). WITH numbers where SELECT 1 as val UNION ALL, SELECT 3 as val UNION ALL, SELECT 4 as val UNION ALL. So we're basically selecting the val addFourAndDivide val comma 2 as a result from numbers. So we basically have a numbers array and we are just using it, but the key thing to do is look at the first part which is how you define a temporary function, addFourAndDivide. You essentially define the function, define its parameters, define their types, you say, ask, and then you basically write what that function needs to do. 2 | 3 | You can also create external UDF components. For example here the language is JavaScript. Here's a temporary function, multi multiple inputs but it's not written in SQL, it's written in JavaScript. And so the rest of it between the three codes is all JavaScript. You can write JavaScripts UDF's, you can write SQL UDF's There are some constraints as far as user-defined functions are concerned. UDFs can't return way too much data – they need to return reasonable amounts of data, less than 5 meg. You can't run lots of JavaScript UDFs concurrently – you can only run six at the time that we're recording here. You can't do native code JavaScript – there's only things that can run within a sandbox. And JavaScript is a 32-bit language inside BigQuery so even though BigQuery has int64s, it's only gonna use at most significant 32 bits. 4 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 1: Serverless Data Analysis with BigQuery/19-bigquery-plans-and-categories.txt: -------------------------------------------------------------------------------- 1 | So, when you look at a BigQuery plan, you're looking for any stage where there is a significant difference between the average and the max time. And whenever you do this, it indicates a significant data skew. And one of the ways that you can fix this is by probably removing the tail, for example, with the having clause, filtering them out so that you're not processing those tail things. You also then look at the amount of time that's been spent waiting, and that would might indicate that you might want to do your filtering earlier. You should also look at the time that spent on CPU tasks on the compute and if there's a lot of time that's spent on the compute. And again, what I mean here is that you look at these stages, and if it turns out – in this case, the longest stage is a wait – but let's say the longest stage, the thing that's completely filled is a compute, then you know that for that stage, that computation function is the one that's taking the longest period of time and that's the thing that you should look at on optimizing. The other thing that you can do is that you can monitor BigQuery with Stackdriver. So it's a fully interactive GUI. You can create dashboards, you can look at how many slots were utilized, what queries are in flight, how many bytes were uploaded or stored. In terms of pricing BigQuery, basically there are three categories of BigQuery pricing. First of all the things that are free. Loading data into BigQuery, exporting data from BigQuery, any queries on metadata. How many rows are there in this table? How many columns are there in this table? What are the names of the columns? Those queries like that, they're always free. Any cached query is free, right? So if you run a query and you run the exact same query in that project, it's free, right? Anything that has already been cached and you're getting it back is free. The caches are per user for privacy reasons. So if you have two users in a project, they don't share the cache, okay? So it's a per user cache, but 2 | 3 | any query whose results are returned from the cache is free. Any query that has an error is also free. 4 | 5 | Now things that are not free, storage and processing. So you get charged based on the amount of data in the table. If you have streaming data, you get charged based on the ingest rate of that streaming data. And you will get an automatic discount for data that you haven't edited in a while. So you automatically get a discount for older data. The other category that you have for pricing is processing, okay? So for processing prices, you can have on-demand pricing. Basically, each query we look at the amount of data that that query processes. And the charge for the query is based on the amount of data that's being processed, with the first terabyte a month being free. So it's data that you're processing beyond 1 terabyte is what you pay for. You also have a flat rate plan, but the flat rate plan is something that you talk to your sales team about. And the third thing is that you may have some functions that are extremely high compute, especially JavaScript functions that are very high compute. You have to opt in to run them. And so there's a special charge for those. 6 | 7 | So quick resources, the BigQuery documentation, tutorials on BigQuery, pricing information on BigQuery and client libraries for BigQuery. This is how you interact with BigQuery from Python, Java, etc. 8 | 9 | 10 | https://cloud.google.com/bigquery/docs/ 11 | 12 | https://cloud.google.com/bigquery/docs/tutorials 13 | 14 | https://cloud.google.com/bigquery/pricing 15 | 16 | https://cloud.google.com/bigquery/client-libraries 17 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 2: Autoscaling Data Processing Pipelines with Dataflow/00-links.txt: -------------------------------------------------------------------------------- 1 | For more resources on Dataflow, there is the docs on Dataflow, the example that we looked at of finding Java projects that need help by looking at large amount of code on GitHub. 2 | 3 | It's written up in this Medium post, so you can go ahead and look at that. And there's a solution paper that looks at how to process logs at scale using Cloud Dataflow. 4 | 5 | Thank you very much. 6 | 7 | https://cloud.google.com/dataflow/ 8 | 9 | https://medium.com/google-cloud/popular-java-projects-on-github-that-could-use-some-help-analyzed-using-bigquery-and-dataflow-dbd5753827f4#.t82wsxd2c 10 | 11 | https://cloud.google.com/solutions/processing-logs-at-scale-using-dataflow 12 | -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 2: Autoscaling Data Processing Pipelines with Dataflow/06-transforms-in-cloud-dataflow.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/Module 2: Autoscaling Data Processing Pipelines with Dataflow/06-transforms-in-cloud-dataflow.txt -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 2.29.44 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 2.29.44 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 2.32.56 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 2.32.56 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 3.23.17 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-04 at 3.23.17 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-05 at 2.43.03 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-05 at 2.43.03 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.09.40 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.09.40 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.14.06 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.14.06 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.20.16 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-24 at 3.20.16 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-25 at 3.27.53 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-25 at 3.27.53 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-25 at 4.07.18 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-25 at 4.07.18 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.37.32 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.37.32 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.38.10 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.38.10 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.40.20 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.40.20 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.43.24 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.43.24 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.45.19 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.45.19 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.47.06 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 12.47.06 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 3.59.15 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 3.59.15 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 4.11.15 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 4.11.15 PM.png -------------------------------------------------------------------------------- /coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 4.18.46 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-data-analysis-bigquery-cloud-dataflow-gcp/images/Screen Shot 2018-01-26 at 4.18.46 PM.png -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/00-create-machine-learning-datasets-lab-1a.md: -------------------------------------------------------------------------------- 1 | Overview 2 | In this lab, you will: 3 | 4 | Explore a dataset using BigQuery and Datalab 5 | Sample the dataset and create training, validation, and testing datasets for local development of TensorFlow models 6 | Create a benchmark to evaluate the performance of ML against 7 | What you need 8 | To complete this lab, you need: 9 | 10 | A Google Cloud Platform project (if not, please sign up for a free trial and come back here). 11 | Begin the lab 12 | https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/ 13 | 14 | Note: You should only complete Parts 1-5 of this Codelab and then return to this course. 15 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/01-module-1-overview.md: -------------------------------------------------------------------------------- 1 | In this module, we'll talk about what machine learning is, how to build effective machine learning models, and how to make sure they're effective by evaluating them. And then, we'll discover that in order to evaluate machine learning models. We need to build multiple machine learning datasets so we'll learn how to do that. And then, we'll basically go ahead and do a lab on exploring some data and building a machine learning model to predict the taxi fare amount from one location to another. This is the dataset that we'll use while we are building our machine learning models and evaluating them through the rest of the course. 2 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/05-combinations-and-hierarchies-of-features.md: -------------------------------------------------------------------------------- 1 | 0:01 2 | So we couldn't use a single line to separate these. What did we have to do? We had to add an layer in between; and the layer in between is essentially a combination of features. So each of those neurons ends up combining the X1s and X2s in a specific way, and in this case because my activation function was a linear activation function, it ended up being a line. So now you have the first neuron essentially forming that line, the next neuron forming this line, and the third neuron forming that line, and this guy, this neuron, its rate is almost zero so it ends up not mattering. So those are the three neurons and they together form a triangle, each one of them is its own line. Okay. So, every neuron essentially is a combination of input features. 3 | 0:55 4 | But how about this? 5 | 0:57 6 | Can you now use a set of lines to do a separation? Now, there's no way that you can draw a polygon, because if you had dozens of lines, you would have a polygon with dozens of sides where it's still a polygon. A single polygon is not going to work. But on the other hand, if you could add, if each of these, each of the ones on the first thing is a single line and each of the ones in the second layer then is a combination of lines, and a combination of lines becomes a polygon, then what we have because of the second layer is a set of polygons. So now we can basically see that each of these is now basically capturing a single polygon and now we have a bunch of polygons together to separate out the blue from the orange. But again one thing that we've got to realize is that the neural network is are only learning the data that you have provided. As humans we know "know that this blue set of darts keeps going up". But it wasn't there in the data and because it was not there in the data, what the ML Model has learned has been a polygon that is just completely truncated there. So that's something important to realize; that a neural network is only as good as the input data that you provided. If you provided data beyond what its inputs are, the results are going to be completely up in the air. You don't know what they are. So let's go ahead and try this as well. 7 | 2:51 8 | So here's our, Spiral. 9 | 2:56 10 | And we can see that over time the model is going to learn one polygon. And then it's going to move on to learning the next polygon, and the next polygon, and so on. And then over time it's going to have a bunch of polygons that it's learned. And those together will form a pretty good separation between the blue dots and the orange dots. 11 | 3:28 12 | And you can see the gradient descent at work as the initial polygon gets created and slowly grows to encompass the other points. And at this point we've kind of stabilized it. Now obviously this is sub-ideal at the very top, but then we had no points. We had no orange point, had no in between, and so we basically had them. If we had an orange point in the in between, then the network would have learned to kind of truncate at this location. But in this case it doesn't, so it doesn't know that. So again, the completeness of your data set, the coverage of your data set is extremely important. Because all that we're doing is that we're learning from examples. 13 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/07-the-reality-of-machine-learning.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | In this module we'll look at What Makes an Effective Machine Learning Model. 3 | 0:07 4 | In the popular imagination what machine learning is is a way by which you have lots of data, doesn't matter what the data is, just have lots of it. You take that data and you somehow throw that into a fancy model with complex math that's multi-dimensional space, and out pops something magical, magical results. 5 | 0:33 6 | In reality though, machine learning is going to be about collecting data. It's not going to be about data that you just happen to have lying around. That somehow magically works the way you want it to work, instead you have to purposefully collect the data. And the data that you collect, you typically have to organize it. And the organization of the data is going to reflect your insight into the problem, insight into what it is that you're trying to predict. And having organised the data, then you will create a model, and model also will be quite purposeful in terms of what is it you want to predict. And the model is going to be capturing your insights into the problem, but once you have that model, then you can apply all of the fancy machinery to flesh out this model from data. So you have data that you've collected and you have a model that expresses the thing that you want to be able to predict, your problem domain, and then you can use machine learning to fit the parameters of this model. And, at the end, what you get is going to be magical. But in order to get that, you're going to be spending a lot of time and a lot of effort. 7 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/08-covering-all-use-cases.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | What are the kinds of things that you would have to do? When you collect your data, you have to make sure that the data that you're collecting covers all possible use cases. So for example, let's say you have this issue that you want to look at photographs of the sky and figure out if there are clouds. So you want to take photographs. Let's say you have photographs and you want to figure out if there are clouds. One way to do this is to say, let me go ahead and find a treasure trove of thousands of photographs or millions of photographs and let's see if I can find clouds in them. 3 | 0:43 4 | That's just data that you have lying around, found data and you hope to make it work. 5 | 0:50 6 | This doesn't always work well, instead, you have to be purposeful about how you collect the data. So one thing that you might say is, what kinds of clouds are there? And you could go talk to a human expert, and in this case the National Weather Service and NASA have gone out and you'll have all of this taxonomy of clouds, the types of clouds there are. And then you can make sure that in your treasure trove images, all of these types of clouds are actually included in those sets of images. If you hadn't been purposeful, for example, you may not have any images of say tornadoes, or you may not have any images of cirrus clouds. And that might be a problem when you try to go take this model and you put it into production. 7 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/09-negative-examples-and-near-misses.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | It's not going to be enough for you to take all the positive cases. All the types of clouds that the National Weather Service or NASA tells you. These are the types of clouds there are and you make sure that you have images for each type of cloud. You also have to put yourself in the shoes of Machine Learning Algorithm. And the way to think about it is an ML algorithm is like a two-year old. It only knows exactly what it has seen before. 3 | 0:30 4 | So you have to think in terms of what kind of negative examples is going to confuse a machine learning model. So you might think, for example, about what happens if there are cartoons. What happens if there are just pictures of white clouds on blue background? 5 | 0:48 6 | Is that going to be thought off as a cloud? And if it is if in your domain in your problem, in the thing that you want to solve, if these images should not be clouds, then there shouldn't be a cloud. And it's the same thing about whether it's cotton balls in a field, or chemical entrails, or sheep, or a close up of a marble floor. You have to ensure that you have negative examples, near misses included in your data set. So your data set has to be complete in terms of positive cases. It also has to include negative cases that are close that are near misses. 7 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/10-explore-the-data-and-fix-problems.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | You also have to look at the data that you have and you have to see about what things about the data don't quite fit. Now, this is something that if you worked with data for a long time you know, right? You know that you have to go find outliers in your data. But the traditional statistical approach to outliers is often to simply throw them away. So you have your data set and you take your outliers and you say well these are things that are way beyond the normal distribution of my data and therefore I'm going to discard them and not use them in my analysis. 3 | 0:39 4 | With machine learning though, before you throw out outliers you should do one 5 | 0:46 6 | thing, you should try to see if you can collect enough of these outliers. What do you mean by that? Let's say, for example, you have some temperature trends that you are following in your data, and your temperature goes up and down, but every once in a while, right? You basically see, there is this huge, deep drop and it happens in one in, maybe, 15,000 samples. This is negative spike and it can be very tempting to simply throw out those outliers and move on. It's 1 in 15,000, we don't need it. But a better approach would be to say, why does this happen? And it may turn out that this happens because of daylight savings time changes. So every time there is a change in the time zone, or in the time offset to UTC, that's when this thing happens, and then that explains something to you about your data. And you have to go and make sure that you have enough of those examples in your training dataset. This is important, because at prediction time, you will still want to be able to deal with these things. One of the key things to remember is that a machine learning algorithm is built on historical data. But it is built in order to carry out predictions in real time. And the less hand holding you need to do, the fewer point you have to throw away, the more robust your prediction algorithm is going to be. So you want to try to make sure that if there are issues with your data, you find out why those issues exist. And sometimes your solution is not to throw away these kinds of outliers. Your solution is to back and collect more data such that this outlier is no longer an outlier, it is something that can be reasoned about and it can be something that the ML model can learn to anticipate and handle gracefully. 7 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/14-creating-machine-learning-datasets-for-regression-problems.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | So, now that we've looked at a number of machine learning concepts, let's go on and do a Lab to create machine learning datasets. The problem that we're going to try to solve is a problem of predicting the taxi fare. The idea is, that you want to go ahead and you want to predict the fare amount. Let's say we want to predict a fare amount. And you have a dataset that relates a fare amount to the distance that has been traveled. So what is the error measure that we're going to use to optimize this? Well, this is a regression problem, because we are predicting a continuous number, the fare amount, and therefore, we should use mean squared error. So great. We'll use mean squared error to optimize this regression problem. Now here is model number one. So we've gotten all of those points, each of these points reflects one data point. So, we had somebody travel this distance, and they paid this amount in fares. And you have a bunch of data like this, and you've basically fitted to a line, and that's your model prediction, and it turns out that the route mean squared error is 22. The reason we're talking route mean squared error is that it allows us to talk in terms of dollars, rather than squared dollars, which is not very intuitive. But the mean squared error and route mean squared error are very closely related, its just a square root of the other one. So let's say we have a fare amount and our error in predicting the fare amount, a route mean squared error, is 22. So here is a second model, and this model has a route mean squared error of zero. Which model is better? I think intuitively, we all feel that this model is better than this model, because this model doesn't look as if it would generalize very well. In other words, if we have some new data, remember the purpose of a machine learning model is to be able to predict for new data, for unlabeled data. So, we go get some data that were not used in training, and intuitively we feel that the first model is going to be more general than the second model, which is going to be overfit to this data that it was used in training. And indeed, model one, if the old RMSE in training was 22, and the new RMSE on this new dataset that we're using is also about 22, pretty similar indicates it's pretty good. So, model one generalizes very well, but model two, now the squiggles do not pass through these new points anymore, and the old RMSE was zero, and the new RMSE is 32, and that is a big red flag to us. It's a red flag that the error on the new dataset is so different from the error on the training dataset. 3 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/15-split-dataset-and-model-experimentation.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | So, we can formalize this by taking our original dataset and splitting it. So, we can take our original dataset and we say, here's my training data, here's what I'll call my validation data, and then, I can choose how well or how complex my model needs to be by looking at the performance of the model on new data, data that it was not used in terms of training. So the idea is, you take your model, you train your model on the training data, and you evaluate the model on the validation data, and you look at the error, and you increase a model complexity, you go back and you train the model, you evaluate the model, and you look at the model. So, maybe if these were our dataset, and we initially started off the model that's a straight line, then we increase the module complexity, we get something that looks like that, and then we increase the model complexity some more, and we get something that looks like that, and then we go back and evaluate this model on the validation dataset, we learn that the first model is an underfit, the second model is perfect, and the third model again, is overfit, right? The error basically, it's starts to increase in the validation dataset. So, at that point then we can say the right level of model complexity is the one in the middle. So, the idea behind this is that you can take your data and you can split your data into training datasets, and validation datasets, and use this to decide how complex your model needs to be. So in other words, this is something that we'll call hyperparameter tuning, and we can do this, the hyperparameter here is the model link, how many neurons do we have in our neural network? How many layers do we have in our neural network? That's how you increase the complexity of a model. And you can keep increasing it until the point when your errors on the validation dataset start to increase. 3 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/16-evaluating-the-final-model.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | But, once you have done this, though, you no longer have an independent dataset. And if you want to evaluate the final model, you want to say, okay, here's the final model that I have, how well does it perform? Well, you've kind of used all this data, this validation data, in order to choose the model parameters. So, you cannot use the performance on the validation data as a measure of how well your model works. What you will have to do is to go get some more data. Some completely independent data, and one way to do this is to wait for the passage of time. Maybe you're creating a model and you're training it on historical data from 2016, and then once you have 2017 data, you use that as your test data set, and then you're ready to go. But, another way to do this would be to take your original data and split it as training data and validation data and change the splits maybe ten times. This is called cross-validation and the average of all of those tells you how good your model is doing. 3 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/what-is-machine-learning-ml.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-machine-learning-gcp/Module 1: Getting Started with Machine Learning/what-is-machine-learning-ml.pdf -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 2: Building ML models with Tensorflow/00-getting-started-with-tensorflow-lab-2a.md: -------------------------------------------------------------------------------- 1 | In this lab, you will learn the following on how the TensorFlow Python API works: 2 | 3 | Building a graph 4 | Running a graph 5 | Feeding values into a graph 6 | Find area of a triangle using TensorFlow 7 | Begin the Lab 8 | https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#5 9 | 10 | Note: Only complete Part 6 of the Codelab and then return to this course. 11 | 12 | 13 | In this lab, you will implement a simple machine learning model using tf.learn: 14 | 15 | Read .csv data into a Pandas dataframe 16 | Implement a Linear Regression model in TensorFlow 17 | Train the model 18 | Evaluate the model 19 | Predict with the model 20 | Repeat with a Deep Neural Network model in TensorFlow 21 | Begin the Lab 22 | https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#6 23 | 24 | Note: Only complete Part 7 of the Codelab and then return to this course. 25 | 26 | 27 | In this lab, you will learn how to: 28 | 29 | Read from a potentially large file in batches 30 | Do a wildcard match on filenames 31 | Break the one-to-one relationship between inputs and features 32 | Begin the Lab 33 | https://codelabs.developers.google.com/codelabs/dataeng-machine-learning/index.html?index=#7 34 | 35 | Note: Only complete Part 8 of the Codelab and then return to this course. 36 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 2: Building ML models with Tensorflow/01-module-2-overview.md: -------------------------------------------------------------------------------- 1 | In this module we will learn how to build machine learning models using TensorFlow. So, we'll talk what TensorFlow is, we'll look at TensorFlow code, how to get started with it. How to do to TensorFlow graphs, how to look at loops in TensorFlow, what kind of loops there are and how to monitor machine learning training as it goes on using TensorBoard. 2 | 3 | In this module, we will look at what TensorFlow is and how to use TensorFlow for machine learning. And then we will look at how to write distributed TensorFlow models so that we can carry out machine learning training on a number of machines, rather than just on a single machine. So what is TensorFlow? 4 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 2: Building ML models with Tensorflow/Screen Shot 2018-02-16 at 3.57.43 PM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jorwalk/data-engineering-gcp/0b0e55dad8f9f0409fdd0ddeab3eb79c5fd16f68/coursera/serverless-machine-learning-gcp/Module 2: Building ML models with Tensorflow/Screen Shot 2018-02-16 at 3.57.43 PM.png -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 3: Scaling ML models with Cloud ML Engine/01-scaling-tf-models-with-cloud-ml-engine.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | In the previous module, we looked at how to write a TensorFlow model using the high level estimator API. In this module, we will take one such model and we look at how to scale it out, how to run it in a distributed way, so that you're not running it on one machine, you're running it on a whole host of machines. 3 | 0:24 4 | Now here's the thing. All the hard work is done. Once you've wrote a TensorFlow model using the estimator API, taking that model and distributing it is simply a bunch of gcloud commands. So this module is, essentially, just scripting. So we're going to swim through this rather quickly. 5 | 0:48 6 | So, what are we going to do in this module? We're going to, essentially, look at how to build an effective machine learning model, part one. And part one of that is how to take a TensorFlow model that you have, and how to scale it out, how to run it on many machines. That's, essentially, the way you're dealing with big data. Instead of running it all on one machine and it takes days, you look at how to scale it out on many machines. And that's what we're going to do. But if you're going to do scaling, why do we need Cloud ML Engine? 7 | 8 | 9 | 0:01 10 | In the previous module, we looked at how to build a machine learning model using TensorFlow. But the model that we built was a relatively small model because the data set that it could process was very small. All of that data was loaded into memory, and it processed it. So what's left now? We need to look at how to build effective machine learning models. And one of the key ways that you make a machine learning model effective is to have it work on larger data sets. So we look in this module about the first step, which is how to build an effective machine learning model on big data. And you do that by scaling it out. So we look at Cloud Machine Learning, Cloud ML, which lets you take a TensorFlow model and scale it so that it can run on very large data sets. After this module, the next model that we look at will be on feature engineering, which is the second way to build an effective machine learning model. And the third way to build an effective machine learning model is to change the model architecture. We'll look at that next. But in this module, we'll look at how to scale up a machine learning model so that it can work on large data sets using Cloud ML. 11 | -------------------------------------------------------------------------------- /coursera/serverless-machine-learning-gcp/Module 4: Feature Engineering/08-build-effective-ml-with-model-architectures.md: -------------------------------------------------------------------------------- 1 | 0:00 2 | So at this point, we've talked about how to do feature engineering, how to create features beyond real valued numbers. The last thing I want to talk about is model architecture. When we go back and look at our ice cream We had a feature that is the price, And it was represented 3 | 0:25 4 | Just one, right? The price is represented It's a real valued column, But the employeeId, 5 | 0:38 6 | So how many input neurons 7 | 0:42 8 | If we have 25 employees in our ice Let's just say it's 25, 25 employees, 25 columns. So we talk of these things 9 | 0:59 10 | Because we have to one hat encode And that blows things up, So price is a dense field. It's dense, it's a continuous number. And employeeId is a categorical feature, 11 | 1:19 12 | Now, if you have an image So let's say we have an image. Every pixel value is dense. It's a number, it's very dense. R, G, B, and A, so there's four numbers. It's very dense, so subtracting, and doing all the things That's what a neural network is great at. It's at adding and multiplying, and But this is what your sparse feature looks like. In every row, there's a one somewhere, but in general, it's a zero, right? Something like this. Adding them and subtracting them. So if you take two rows here and you add them or you subtract them, it's still pretty much gonna be all zeros. So this is something that's very hard for a neural network to deal with, because there are so many weights that essentially have no impact, because you multiply a weight by zero, it's still zero. So you're basically gonna get stuck in some local minimum, and not be able to get out of it. But this is the kind of thing that a linear model is gonna do very, very, very well, because as we talked about with the taxi color and identifying taxis, it's very easy if you have a sparse thing to just assign high weights to certain things. So, for example, in our ice cream example, you may say, this particular employee, no matter what they do, no matter how long the customers wait, they're very chatty, they're very helpful, they're smiling all the time and the customer is always gonna be happy. So you can basically go ahead and assign weights to sparse features very easily. A linear model works very well for sparse features. So, our neural network works very well for 13 | 3:26 14 | But for sparse features, we want to take our input and So we want to have a linear weights to just specific inputs. 15 | 3:42 16 | But in reality, of dense features and wide features. Some real valued continuous numbers and So which one do we use? Do we use a linear model Or do we use a deep model because 17 | 4:09 18 | Can we have both? Sure, what if you were to take all And say, there are some inputs And those inputs, I will pass through so we can do all of our fine grained And I'll take all my sparse features and 19 | 4:42 20 | So can we have our cake and eat it too? 21 | 4:46 22 | And that's exactly what a wide and a deep neural net linear combined So wide and deep model essentially says, tell me which ones are sparse, so Tell me which ones are dense, so that I And tell me how many layers and So this is a wide and deep network. So rather than it being black and white, We're going to take our features and decide which ones of them are dense, And then, we're going to create 23 | 5:41 24 | So this is essentially a constructor for It asks you for the deep columns and and you just specify them. 25 | -------------------------------------------------------------------------------- /know/bigquery.md: -------------------------------------------------------------------------------- 1 | # BigQuery 2 | 3 | * [Introduction to Partitioned Tables](./bigquery/partitioned-tables.md) 4 | * [Avoiding SQL Anti-Patterns](./bigquery/best-practices-performance-patterns.md) 5 | * [Managing Input Data and Data Sources](./bigquery/best-practices-performance-input.md) 6 | * [Life of a BigQuery streaming insert](./bigquery/life-of-a-bigquery-streaming-insert.md) 7 | * [Access Control](https://cloud.google.com/bigquery/docs/access-control) 8 | * [BigQuery Best Practices: Controlling Costs](https://cloud.google.com/bigquery/docs/best-practices-costs) 9 | * [BigQuery Best Practices: Optimizing Storage](https://cloud.google.com/bigquery/docs/best-practices-storage) 10 | * [Creating an Authorized View in BigQuery](https://cloud.google.com/bigquery/docs/share-access-views) 11 | * [Improve BigQuery ingestion times 10x by using Avro source format](https://cloud.google.com/blog/products/gcp/improve-bigquery-ingestion-times-10x-by-using-avro-source-format) 12 | * [Querying Multiple Tables Using a Wildcard Table](https://cloud.google.com/bigquery/docs/querying-wildcard-tables) 13 | -------------------------------------------------------------------------------- /lab/README.md: -------------------------------------------------------------------------------- 1 | # Google Cloud Training Lab 2 | 3 | ## Data Engineering 4 | This advanced-level quest is unique amongst the other Qwiklabs offerings. The labs have been curated to give IT professionals hands-on practice with topics and services that appear in the [Google Cloud Certified Professional Data Engineer Certification](https://cloud.google.com/certification/data-engineer). From Big Query, to Dataproc, to Tensorflow, this quest is composed of specific labs that will put your GCP data engineering knowledge to the test. Be aware that while practice with these labs will increase your skills and abilities, you will need other preparation too. The exam is quite challenging and external studying, experience, and/or background in cloud data engineering is recommended. 5 | 6 | ### [Weather Data in BigQuery](https://google.qwiklabs.com/focuses/609?parent=catalog) 7 | 8 | In this lab you analyze historical weather observations using BigQuery and use weather data in conjunction with other datasets. This lab is part of a series of labs on processing scientific data. 9 | 10 | ### [Analyzing Natality Data Using Datalab and BigQuery](https://google.qwiklabs.com/focuses/604?parent=catalog) 11 | 12 | In this lab you analyze a large (137 million rows) natality dataset using Google BigQuery and Cloud Datalab. This lab is part of a series of labs on processing scientific data. 13 | 14 | ### [Bigtable: Qwik Start - Hbase Shell](https://google.qwiklabs.com/focuses/580?parent=catalog) 15 | 16 | This hands-on lab will show you how to use the HBase shell to connect to a Cloud Bigtable instance. Watch the short video [Bigtable: Qwik Start - Qwiklabs Preview](https://youtu.be/unre6cmOvvQ). 17 | 18 | ### [Predict Housing Prices with Tensorflow and Cloud ML Engine](https://google.qwiklabs.com/focuses/3644?parent=catalog) 19 | 20 | In this lab you will build an end to end machine learning solution using Tensorflow + Cloud ML Engine and leverage the cloud for distributed training and online prediction. 21 | 22 | ### [Run a Big Data Text Processing Pipeline in Cloud Dataflow](https://google.qwiklabs.com/focuses/608?parent=catalog) 23 | 24 | In this lab you will use Google Cloud Dataflow to create a Maven project with the Cloud Dataflow SDK, and run a distributed word count pipeline using the Google Cloud Platform Console. 25 | 26 | ### [Launching Dataproc Jobs with Cloud Composer](https://google.qwiklabs.com/focuses/3357?parent=catalog) 27 | 28 | In this lab you'll use Google Cloud Composer to automate the transform and load steps of an ETL data pipeline. 29 | 30 | ### [Building an IoT Analytics Pipeline on Google Cloud Platform](https://google.qwiklabs.com/focuses/605?parent=catalog) 31 | 32 | This lab shows you how to connect and manage devices using Cloud IoT Core; ingest the steam of information using Cloud Pub/Sub; process the IoT data using Cloud Dataflow; use BigQuery to analyze the IoT data. 33 | 34 | ### [Working with Google Cloud Dataprep](https://google.qwiklabs.com/focuses/610?parent=catalog) 35 | 36 | Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to use Cloud Dataprep to clean and enrich multiple datasets using a mock use case scenario of customer info and purchase history. 37 | 38 | ### [Simulating a Data Warehouse in the Cloud Using BigQuery and Dataflow](https://google.qwiklabs.com/focuses/3506?parent=catalog) 39 | 40 | In this lab you build several Data Pipelines that will ingest data from the USA Babynames dataset into BigQuery, simulating a batch transformation 41 | 42 | ### [Predict Visitor Purchases with a Classification Model in BQML](https://google.qwiklabs.com/focuses/1794?parent=catalog) 43 | 44 | In this lab you will use a newly available ecommerce dataset to run some typical queries that businesses would want to know about their customers' purchasing habits. --------------------------------------------------------------------------------