├── Chapter16 ├── DataEngGlueCWS3CuratedZoneRole-trust-policy.json ├── CFN-glue_job-streams_by_category.cfn ├── Glue-streaming_views_by_category.py └── README.md ├── Chapter03 ├── test.csv ├── DataEngLambdaS3CWGluePolicy.json ├── CSVtoParquetLambda.py └── README.md ├── Chapter10 ├── dataeng-random-failure-generator.py ├── dataeng-check-file-ext.py ├── README.md └── ProcessFileStateMachine.json ├── Chapter08 └── README.md ├── LICENSE ├── Chapter13 ├── website-reviews-analysis-role.py └── README.md ├── Chapter02 └── README.md ├── Chapter01 └── README.md ├── Chapter07 └── README.md ├── Chapter17 └── README.md ├── Chapter05 ├── Data-Engineering-Whiteboard-Template.drawio ├── README.md ├── Data-Engineering-Whiteboard-Completed-Notes.drawio └── Data-Engineering-Completed-Whiteboard.drawio ├── Chapter12 └── README.md ├── Chapter06 ├── mysql-ec2loader.cfn └── README.md ├── Chapter04 ├── README.md └── AthenaAccessCleanZoneDB ├── Chapter15 └── README.md ├── Chapter09 └── README.md ├── Chapter11 └── README.md ├── Chapter14 └── README.md └── README.md /Chapter16/DataEngGlueCWS3CuratedZoneRole-trust-policy.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Effect": "Allow", 6 | "Principal": { 7 | "Service": [ 8 | "glue.amazonaws.com", 9 | "cloudformation.amazonaws.com" 10 | ] 11 | }, 12 | "Action": "sts:AssumeRole" 13 | } 14 | ] 15 | } 16 | -------------------------------------------------------------------------------- /Chapter03/test.csv: -------------------------------------------------------------------------------- 1 | Name,favorite_num 2 | Vrinda,22 3 | Tracy,28 4 | Gareth,23 5 | Chris,16 6 | Emma,14 7 | Carlos,7 8 | Cooper,11 9 | Praful,4 10 | David,33 11 | Shilpa,2 12 | Gary,18 13 | Sean,20 14 | Ha-yoon,9 15 | Elizabeth,8 16 | Mary,1 17 | Chen,15 18 | Janet,22 19 | Mariusz,25 20 | Romain,11 21 | Matt,25 22 | Brendan,19 23 | Roger,2 24 | Jack,7 25 | Sachin,17 26 | Francisco,5 27 | -------------------------------------------------------------------------------- /Chapter10/dataeng-random-failure-generator.py: -------------------------------------------------------------------------------- 1 | from random import randint 2 | 3 | def lambda_handler(event, context): 4 | print('Processing') 5 | #Our ETL code would go here 6 | value = randint(0, 2) 7 | # We now divide 10 by our random number. 8 | # If the random numebr is 0, our function will fail 9 | newval = 10 / value 10 | print(f'New Value is: {newval}') 11 | return(newval) 12 | -------------------------------------------------------------------------------- /Chapter10/dataeng-check-file-ext.py: -------------------------------------------------------------------------------- 1 | import urllib.parse 2 | import json 3 | import os 4 | print('Loading function') 5 | def lambda_handler(event, context): 6 | print("Received event: " + json.dumps(event, indent=2)) 7 | # Get the object from the event and show its content type 8 | bucket = event['detail']['bucket']['name'] 9 | key = urllib.parse.unquote_plus(event['detail']['object']['key'], encoding='utf-8') 10 | filename, file_extension = os.path.splitext(key) 11 | print(f'File extension is: {file_extension}') 12 | payload = { 13 | "file_extension": file_extension, 14 | "bucket": bucket, 15 | "key": key 16 | } 17 | return payload 18 | -------------------------------------------------------------------------------- /Chapter08/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 8 - Identifying and Enabling Data Consumers 2 | 3 | In this chapter, we explored a variety of data consumers that you are likely to find in most 4 | organizations, including business users, data analysts, and data scientists. We briefly 5 | examined their roles, and then looked at the types of AWS services that each of them is 6 | likely to use to work with data. 7 | 8 | ## Hands-on Activity 9 | In the hands-on section of this chapter, we took on the role of a data analyst, tasked 10 | with creating a mailing list for the marketing department. We used data that had been 11 | imported from a MySQL database into S3 in a previous chapter, joined two of the tables 12 | from that database, and transformed the data in some of the columns. Then, we wrote the 13 | newly transformed dataset out to Amazon S3 as a CSV file. 14 | 15 | #### Configuring new datasets for AWS Glue DataBrew 16 | - AWS Management Console - Glue DataBrew: https://console.aws.amazon.com/databrew 17 | 18 | 19 | -------------------------------------------------------------------------------- /Chapter03/DataEngLambdaS3CWGluePolicy.json: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Effect": "Allow", 6 | "Action": [ 7 | "logs:PutLogEvents", 8 | "logs:CreateLogGroup", 9 | "logs:CreateLogStream" 10 | ], 11 | "Resource": "arn:aws:logs:*:*:*" 12 | }, 13 | { 14 | "Effect": "Allow", 15 | "Action": [ 16 | "s3:*" 17 | ], 18 | "Resource": [ 19 | "arn:aws:s3:::dataeng-landing-zone-INITIALS/*", 20 | "arn:aws:s3:::dataeng-landing-zone-INITIALS", 21 | "arn:aws:s3:::dataeng-clean-zone-INITIALS/*", 22 | "arn:aws:s3:::dataeng-clean-zone-INITIALS" 23 | ] 24 | }, 25 | { 26 | "Effect": "Allow", 27 | "Action": [ 28 | "glue:*" 29 | ], 30 | "Resource": "*" 31 | } 32 | ] 33 | } 34 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Chapter13/website-reviews-analysis-role.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import json 3 | comprehend = boto3.client(service_name='comprehend', 4 | region_name='us-east-2') 5 | 6 | def lambda_handler(event, context): 7 | for record in event['Records']: 8 | payload = record["body"] 9 | print(str(payload)) 10 | 11 | print('Calling DetectSentiment') 12 | response = comprehend.detect_sentiment(Text=payload, 13 | LanguageCode='en') 14 | sentiment = response['Sentiment'] 15 | sentiment_score = response['SentimentScore'] 16 | print(f'SENTIMENT: {sentiment}') 17 | print(f'SENTIMENT SCORE: {sentiment_score}') 18 | 19 | print('Calling DetectEntities') 20 | response = comprehend.detect_entities(Text=payload, 21 | LanguageCode='en') 22 | #print(response['Entities']) 23 | for entity in response['Entities']: 24 | entity_text = entity['Text'] 25 | entity_type = entity['Type'] 26 | print( 27 | f'ENTITY: {entity_text}, ' 28 | f'ENTITY TYPE: {entity_type}' 29 | ) 30 | return 31 | -------------------------------------------------------------------------------- /Chapter02/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 2 - Data Management Architectures for Analytics 2 | In this chapter, we learned about the foundational architectural concepts that are typically applied when designing real-life analytics data management and processing solutions. 3 | We also did a deep-dive into three analytics data management architectures that are popular today: data warehouses, data lakes, and data lakehouses. 4 | 5 | ## Hands-on Activity 6 | In the ***hands-on activity*** section, you learnt how to access the AWS Command Line Interface (CLI) via AWS CloudShell, and then used the CLI to create Amazon S3 buckets (storage containers in the Amazon S3 service) which we will use in later chapters. 7 | 8 | ### Links 9 | - **Learn more about S3 bucket naming rules:** https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html 10 | - **AWS CloudShell service:** https://us-east-2.console.aws.amazon.com/cloudshell/home 11 | 12 | ### Commands 13 | #### Create a new Amazon S3 bucket 14 | The following command, when run via the AWS CLI, creates a new bucket called *dataeng-test-bucket-123*. If a bucket with this name already exists the command will fail, so you need to ensure you provide a globally unique name. 15 | 16 | ``` 17 | aws s3 mb s3://dataeng-test-bucket-123 18 | ``` 19 | -------------------------------------------------------------------------------- /Chapter01/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 1 - An Introduction to Data Engineering 2 | In this chapter we reviewed how data is becoming an increasingly important asset for organizations, and looked at some of the core challenges of working with big data. We then reviewed some of the data related roles that are commonly seen in organizations today. 3 | 4 | ## Links 5 | - **AWS Data Lake Defition:** https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ 6 | 7 | ## Hands-on Activity 8 | 9 | In this ***hands-on activity section*** of this chapter you were given step-by-step instructions to guide you in creating a new AWS account. There was no coding or policies configured in this chapter. 10 | 11 | ### Links 12 | - Creating a billing alarm to monitor your estimated AWS charges: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/monitor_estimated_charges_with_cloudwatch.html 13 | - Google Voice for creating a virtual phone number for use with your account: https://voice.google.com/ 14 | - What to do if you don’t receive a confirmation email within 24 hours: https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/ 15 | - Best practices for securing your account - root user: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_root-user.html 16 | - Best practices for securing your account - Multi Factor Authentication (MFA): https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa_enable_virtual.html 17 | -------------------------------------------------------------------------------- /Chapter07/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7 - Transforming Data to Optimize for Analytics 2 | 3 | In this chapter, we reviewed a number of common transformations that can be applied 4 | to raw datasets, covering both generic transformations used to optimize data for analytics, 5 | and business transforms used to enrich and denormalize datasets. 6 | 7 | ## Hands-on Activity 8 | In the ***hands-on activity*** section of this chapter, we joined various datasets that we had previously ingested in order to denormalize the underlying datasets. We then joined data we had ingested from a database with data that had been streamed into the data lake. 9 | 10 | #### Creating a new IAM role for the Glue job 11 | - AWS Management Console - IAM Policies: https://console.aws.amazon.com/iamv2/home?#/policies 12 | 13 | - AWS IAM policy for `DataEngGlueCWS3CuratedZoneWrite`. Change *INITIALS* in the policy below to match the name of the relevant bucket that you previously created. 14 | 15 | ``` 16 | { 17 | "Version": "2012-10-17", 18 | "Statement": [ 19 | { 20 | "Effect": "Allow", 21 | "Action": [ 22 | "s3:GetObject" 23 | ], 24 | "Resource": [ 25 | "arn:aws:s3:::dataeng-landing-zone-INITIALS/*", 26 | "arn:aws:s3:::dataeng-clean-zone-INITIALS/*" 27 | ] 28 | }, 29 | { 30 | "Effect": "Allow", 31 | "Action": [ 32 | "s3:*" 33 | ], 34 | "Resource": "arn:aws:s3:::dataeng-curated-zone-INITIALS/*" 35 | } 36 | ] 37 | } 38 | ``` 39 | 40 | 41 | -------------------------------------------------------------------------------- /Chapter17/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 17 - Wrapping Up the First Part of Your Learning Journey 2 | 3 | In this chapter we looked at some of the complexities of data engineering in the real-world, and 4 | examined some examples of real-world data pipelines. We then look at some emerging trends to get 5 | an idea of what the future holds for data engineering, such as the increased adoption of a data 6 | mesh approach, the reality of multi-cloud environments, the work that will need to be done to 7 | migrate to open-table formats, how Generative AI may impact the field, and more. 8 | 9 | ## Hands-on Activity 10 | In the hands-on activity we looked at how you can review your AWS spend in the billing console, and then how you could optionally close your AWS account. 11 | 12 | #### Reviewing AWS Billing to identify the resources being charged for 13 | 14 | - AWS Management Console - Billing Console: https://console.aws.amazon.com/billing/home 15 | 16 | - AWS Documentation on how to create a billing alarm to monitor estimated charges: [Documentation Link](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/monitor_estimated_charges_with_cloudwatch.html) 17 | 18 | #### Closing your AWS account 19 | 20 | - AWS Documentation on considerations before closing your AWS account: [Documentation Link](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/close-account.html) 21 | 22 | - AWS Management Console - Login: https://console.aws.amazon.com 23 | [Make sure to log in with the email address you used when creating the account] 24 | 25 | - AWS Billing Console: https://console.aws.amazon.com/billing/home 26 | -------------------------------------------------------------------------------- /Chapter05/Data-Engineering-Whiteboard-Template.drawio: -------------------------------------------------------------------------------- 1 | 7VrbbuI6FP0aHoviOBd4LJeZqdQjHYmRRpoXZBKTeBrsyHEKnK8fh9iFxIFmdICGoSDReNtx7LX2rdvpwfFq85WjNP6HhTjp2Va46cFJz7ZtMPDkn0KyLSXAcv1SEnESKtleMCP/YT1QSXMS4qwyUDCWCJJWhQGjFAeiIkOcs3V12JIl1aemKMKGYBagxJT+IKGI9eosa9/xDZMoVo92dMcK6cFKkMUoZOsDEZz24JgzJsqr1WaMkwI9jUt535cjvW8L45iKNjd4JJ+n9nQS/kz5dvZCk8XTtwe7nOUVJbnasFqs2GoEOMtpiItJrB4crWMi8CxFQdG7lqRLWSxWiWwBeWkuSq3zFXOBNwcitcivmK2w4Fs5RPU+eK5CTOnMA3SHfbcUrfccOEM1LK7ArxQOKd6jt/n30MgLhc4fIAUNpCZIICn5zhHNloxLwgmjBnxy16KKUSY4e8FjljAuJZRROXK0JElSE6GERFQ2A4kllvJRgSGRqvmoOlYkDIvHNJJSpW3JqFDGZTvn4smu8wShwZLtNbCkbzw7Sc776sxykRAq4df+QsOj0e/ZUH6/FE8dRRyFBO/7Gsgqhvs+AN4xZkOUxW88aAqf0QIn/7KM7HQGThZMCLZq4FiwtEkVDtTpgFpg67bab/FIlKXlRpdkU6xjlDJSzDJ9lZNlWidjlBY3rDZR4cj7aJ05/UUevGAxXxMRz9nil5wlu5TqABv0wfDwY2iS65qK5Pp9+0Kq5L6vSjiUoUI1GRcxixhFyXQvrRnhfswzK2jdIf8LC7FV/KFcsOPkeqfAz1jOA3xiP8ovCsQjLE6MA8qGi82d5JLjRHq812qcPDsNnkHDM6IhoZEU/iwMrPP+9jRvf2I0wNZhUJuNMzDD4jX9rf8h/tZ9hNbI/fv8bSRRnIcyq5gH8idh0ZkUx3adWqBu0BvX7cOhqTrepVRncCyfGqu9349hQ2nYHqxRZH+waQ8/U6mbSKUcr/OplP5f/q/JpYCuRLybTPmdSqb0ug+IGCcY0fvLpRy3c7kUaFGLuS0rgW2tBHTLSsxSj0GEjmYTwqUvLwMPLdAuQP8/YbFIby13INMBIzaCqe3ZU8MIZc9y9+ligGwMhhyXuvMUFOsZyWZ5VcuGMcUcJWcyeNceVs29Kb9qyH+17Pxa9lmruo0EC1rdT7DMYtU450hICO8ustvD7kV2s4h145HdbxvZ3W5F9hb1qs/IfmORHfo1//zRkd1pkT9e/VAVOF4VJehZDYeqoMkxOpZ/KahaJEHXhmpQDx/Nx89HkPL00PNjZQZ5VTF9DAKcZd2P8Wc7eQYOqFHUcPIM4FVPns0Yr+vZjGb5CvM7Isg1baiBoOueVLWoQV7dKQ/roUu6mtaO5lJAmTXCjwdqADoIlFmvUxY/2+VAd2TvD8M2rwJd197NJOyJRjgr39GyvjOW3BNBfv2U6IIEyeb+rcZd38HLoXD6Gw== -------------------------------------------------------------------------------- /Chapter16/CFN-glue_job-streams_by_category.cfn: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: '2010-09-09' 2 | # CloudFormation template to deploy the streaming view by category 3 | # Glue job. 4 | # In the Parameters section we define parameters that can be passed to 5 | # CloudFormation at deployment time. If no parameters are passed in, then the 6 | # specified default is used. 7 | Parameters: 8 | # JobName: The name of the job to be created 9 | JobName: 10 | Type: String 11 | Default: streaming_views_by_category 12 | # The name of the IAM role that the job assumes. It must have access to data, 13 | # script, and temporary directory. We created this IAM role via the AWS 14 | # console in Chapter 7. 15 | IAMRoleName: 16 | Type: String 17 | Default: DataEngGlueCWS3CuratedZoneRole 18 | # The S3 path where the script for this job is located. Modify the default 19 | # below to reference the specific path for your S3 bucket 20 | ScriptLocation: 21 | Type: String 22 | Default: "s3://data-product-film-gse23/glueETL_code/Glue-streaming_views_by_category.py" 23 | # In the Resources section, we define the AWS resources we want to deploy 24 | # with this CloudFormation template. In our case, it is just a single Glue 25 | # job, but a single template can deploy multiple different AWS resources 26 | Resources: 27 | # Below we define our Glue job, and we substitute parameters in from the 28 | # above section. 29 | GlueJob: 30 | Type: AWS::Glue::Job 31 | Properties: 32 | Role: !Ref IAMRoleName 33 | Description: Glue job to calculate number of streams by category 34 | Command: 35 | Name: glueetl 36 | ScriptLocation: !Ref ScriptLocation 37 | WorkerType: G.1X 38 | NumberOfWorkers: 2 39 | GlueVersion: "3.0" 40 | Name: Streaming Views by Category 41 | -------------------------------------------------------------------------------- /Chapter12/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 12 - Visualizing Data with Amazon QuickSight 2 | 3 | In this chapter we discussed the power of visually representing data, and then explored core Amazon 4 | QuickSight concepts. We looked at how various data sources can be used with QuickSight, 5 | how data can optionally be imported into the SPICE storage engine, and how you can 6 | perform some data preparation tasks using QuickSight. 7 | 8 | We then did a deeper dive into the concepts of analyses (where new visuals are authored) 9 | and dashboards (published analyses that can be shared with data consumers). As part 10 | of this, we also examined some of the common types of visualizations available in 11 | QuickSight. 12 | 13 | We then looked at some of the advanced features available in QuickSight, including ML 14 | Insights (which uses machine learning to detect outliers in data and forecast future data 15 | trends), as well as embedded dashboards (which enable you to embed either the full 16 | QuickSight console or dashboards directly into your websites and applications). We also 17 | examined QuickSight Q (for Natural Language Queries), and the ability to generate paginated reports. 18 | 19 | ## Hands-on Activity 20 | 21 | #### Setting up a new QuickSight account and loading a dataset 22 | In this section, we create a visualization using data from [SimpleMaps.com](https://simplemaps.com). The basic map data is distributed under the [Creative Commons Attribution 4.0 International (CC BY 4.0) license](https://creativecommons.org/licenses/by/4.0/). 23 | 24 | - Link to download the Simple Maps basic data: https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.76.zip 25 | [Extract the ZIP file to access the underlying CSV file] 26 | 27 | - Link to the Amazon QuickSight service: https://quicksight.aws.amazon.com/ 28 | 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /Chapter03/CSVtoParquetLambda.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | import awswrangler as wr 3 | from urllib.parse import unquote_plus 4 | 5 | def lambda_handler(event, context): 6 | # Get the source bucket and object name as passed to the Lambda function 7 | for record in event['Records']: 8 | bucket = record['s3']['bucket']['name'] 9 | key = unquote_plus(record['s3']['object']['key']) 10 | 11 | # We will set the DB and table name based on the last two elements of 12 | # the path prior to the file name. If key = 'dms/sakila/film/LOAD01.csv', 13 | # then the following lines will set db to sakila and table_name to 'film' 14 | key_list = key.split("/") 15 | print(f'key_list: {key_list}') 16 | db_name = key_list[len(key_list)-3] 17 | table_name = key_list[len(key_list)-2] 18 | 19 | print(f'Bucket: {bucket}') 20 | print(f'Key: {key}') 21 | print(f'DB Name: {db_name}') 22 | print(f'Table Name: {table_name}') 23 | 24 | input_path = f"s3://{bucket}/{key}" 25 | print(f'Input_Path: {input_path}') 26 | output_path = f"s3://dataeng-clean-zone-INITIALS/{db_name}/{table_name}" 27 | print(f'Output_Path: {output_path}') 28 | 29 | input_df = wr.s3.read_csv([input_path]) 30 | 31 | current_databases = wr.catalog.databases() 32 | wr.catalog.databases() 33 | if db_name not in current_databases.values: 34 | print(f'- Database {db_name} does not exist ... creating') 35 | wr.catalog.create_database(db_name) 36 | else: 37 | print(f'- Database {db_name} already exists') 38 | 39 | result = wr.s3.to_parquet( 40 | df=input_df, 41 | path=output_path, 42 | dataset=True, 43 | database=db_name, 44 | table=table_name, 45 | mode="append") 46 | 47 | print("RESULT: ") 48 | print(f'{result}') 49 | 50 | return result 51 | -------------------------------------------------------------------------------- /Chapter05/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 5 - Architecting Data Engineering Pipelines 2 | 3 | In this chapter, we reviewed an approach to developing data engineering pipelines by 4 | identifying a limited-scope project, and then whiteboarding a high-level architecture 5 | diagram. We looked at how we could have a workshop, in conjunction with relevant 6 | stakeholders in the organization, to discuss requirements and plan the initial architecture. 7 | 8 | ## Links from this chapter 9 | 10 | - Spotify blog providing an example of a data engineering pipeline: https://engineering.atspotify.com/2020/02/18/spotify-unwrapped-how-we-brought-you-a-decade-of-data/ 11 | 12 | ## Hands-on Activity 13 | In the ***hands-on activity*** section of this chapter, we read through some fictional notes of a meeting to discuss a new project that had specific data requirements. As we read through the notes, we sketched out a high-level whiteboard architecture showing data consumers, data ingestion sources, and transformations. 14 | 15 | - Link to diagrams.net - an online architecture design tool: https://www.diagrams.net/. 16 | 17 | **NOTE:** The files linked to below can be downloaded from here (in .drawio format) and then opened in diagrams.net and modified. To download the source files, click on the link, and then right-click the **Raw** button, and select **Save link as**. This will let you download the XML draw.io file which you can then open with diagrams.net. 18 | 19 | - Generic Data Architecture Whiteboard Template (drawio format): [Data-Engineering-Whiteboard-Template.drawio](Data-Engineering-Whiteboard-Template.drawio) 20 | 21 | - Completed Data Architecture Whiteboard Diagram (drawio format): [Data-Engineering-Completed-Whiteboard.drawio](Data-Engineering-Completed-Whiteboard.drawio) 22 | 23 | - Completed Data Architecture Whiteboard Notes (drawio format): [Data-Engineering-Whiteboard-Completed-Notes.drawio](Data-Engineering-Whiteboard-Completed-Notes.drawio) 24 | 25 | -------------------------------------------------------------------------------- /Chapter06/mysql-ec2loader.cfn: -------------------------------------------------------------------------------- 1 | --- 2 | AWSTemplateFormatVersion: 2010-09-09 3 | Description: Chapter 6 - Data Engineering with AWS 4 | Parameters: 5 | DBPassword: 6 | Type: String 7 | NoEcho: true 8 | Description: The database admin account password 9 | MinLength: 8 10 | AllowedPattern: ^[a-zA-Z0-9]*$ 11 | ConstraintDescription: Password must contain only alphanumeric characters. 12 | LatestAmiId: 13 | Type: 'AWS::SSM::Parameter::Value' 14 | Default: '/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64' 15 | Resources: 16 | MySQLInstance: 17 | Type: AWS::RDS::DBInstance 18 | Properties: 19 | AllocatedStorage: 20 20 | DBInstanceClass: db.t3.micro 21 | Engine: MySQL 22 | MasterUsername: admin 23 | MasterUserPassword: !Ref DBPassword 24 | EC2Instance: 25 | Type: AWS::EC2::Instance 26 | Properties: 27 | ImageId: !Ref LatestAmiId 28 | InstanceType: t3.micro 29 | UserData: 30 | Fn::Base64: 31 | Fn::Sub: 32 | - | 33 | #!/bin/bash 34 | 35 | yum install -y mariadb105 36 | 37 | curl https://downloads.mysql.com/docs/sakila-db.zip -o sakila.zip 38 | 39 | unzip sakila.zip 40 | 41 | cd sakila-db 42 | 43 | echo "mysql --host=${EndPointAddress} --user=admin --password=${DBPassword} -f < sakila-schema.sql" | tee -a /var/tmp/userdata.log 44 | 45 | mysql --host=${EndPointAddress} --user=admin --password=${DBPassword} -f < sakila-schema.sql | tee -a /var/tmp/userdata.log1 46 | 47 | mysql --host=${EndPointAddress} --user=admin --password=${DBPassword} -f < sakila-data.sql | tee -a /var/tmp/userdata.log2 48 | 49 | - EndPointAddress: !GetAtt MySQLInstance.Endpoint.Address 50 | Tags: 51 | - Key: Name 52 | Value: dataeng-book-ec2-3 53 | DependsOn: MySQLInstance 54 | -------------------------------------------------------------------------------- /Chapter04/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 4 - Data Governance, Security and Cataloging 2 | 3 | In this chapter, we did a deeper dive into best practices for handling data responsibly and securely, and for making sure that the value of data can be maximized for an organization. 4 | 5 | ## Hands-on Activity 6 | In the ***hands-on activity*** section of this chapter, we created a new data lake user and assigned them permissions using AWS Identitty and Access Management (IAM). Once we verified their permissions, we then transitioned data authorization over to use Lake Formation for fine-grained access control (including the ability to control permissions at the column level). 7 | 8 | #### Creating a new user with IAM permissions 9 | - AWS Management Console - IAM Policies: https://console.aws.amazon.com/iamv2/home?#/policies 10 | 11 | - Resource section of policy that is updated to limit access to just the Glue `cleanzonedb` database and tables in that database 12 | ``` 13 | "Resource": [ 14 | "arn:aws:glue:*:*:catalog", 15 | "arn:aws:glue:*:*:database/cleanzonedb", 16 | "arn:aws:glue:*:*:database/cleanzonedb*", 17 | "arn:aws:glue:*:*:table/cleanzonedb/*" 18 | ] 19 | ``` 20 | 21 | - New section of policy that enables access to the underlying S3 storage for the `cleanzonedb` database. **Ensure that you modify INITIALS below to reflect the correct name for your CleanZoneDB bucket.** 22 | ``` 23 | { 24 | "Effect": "Allow", 25 | "Action": [ 26 | "s3:GetBucketLocation", 27 | "s3:GetObject", 28 | "s3:ListBucket", 29 | "s3:ListBucketMultipartUploads", 30 | "s3:ListMultipartUploadParts", 31 | "s3:AbortMultipartUpload", 32 | "s3:PutObject" 33 | ], 34 | "Resource": [ 35 | "arn:aws:s3:::dataeng-clean-zone-INITIALS/*" 36 | ] 37 | }, 38 | ``` 39 | 40 | - Athena query to validate that IAM permissions are correct for the datalake-user: 41 | `select * from cleanzonedb.csvtoparquet` 42 | 43 | #### Transitioning to managing fine-grained permissions with AWS Lake Formation 44 | 45 | - AWS Management Console - Lake Formation: https://console.aws.amazon.com/lakeformation/home 46 | -------------------------------------------------------------------------------- /Chapter16/Glue-streaming_views_by_category.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from awsglue.transforms import * 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | from awsglue.dynamicframe import DynamicFrame 8 | 9 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 10 | sc = SparkContext() 11 | glueContext = GlueContext(sc) 12 | spark = glueContext.spark_session 13 | job = Job(glueContext) 14 | job.init(args["JOB_NAME"], args) 15 | 16 | # Load the streaming_films table into a Glue DynamicFrame from the Glue catalog 17 | StreamingFilms = glueContext.create_dynamic_frame.from_catalog( 18 | database="curatedzonedb", 19 | table_name="streaming_films", 20 | transformation_ctx="StreamingFilms", 21 | ) 22 | 23 | # Convert the DynamicFrame to a Spark DataFrame 24 | spark_dataframe = StreamingFilms.toDF() 25 | 26 | # Create a SparkSQL table based on the steaming_films table 27 | spark_dataframe.createOrReplaceTempView("streaming_films") 28 | 29 | # Create a new DataFrame that records number of streams for each 30 | # category of film 31 | CategoryStreamsDF = glueContext.sql(""" 32 | SELECT category_name, 33 | count(category_name) streams 34 | FROM streaming_films 35 | GROUP BY category_name 36 | """) 37 | 38 | # Convert the DataFrame back to a Glue DynamicFrame 39 | CategoryStreamsDyf = DynamicFrame.fromDF(CategoryStreamsDF, glueContext, "CategoryStreamsDyf") 40 | 41 | # Prepare to write the dataframe to Amazon S3 42 | ############# NOTE ############# 43 | #### Change the path below to 44 | #### reference your bucket name 45 | ################################ 46 | s3output = glueContext.getSink( 47 | path="s3://dataeng-curated-zone-gse23/streaming/top_categories", 48 | connection_type="s3", 49 | updateBehavior="UPDATE_IN_DATABASE", 50 | partitionKeys=[], 51 | compression="snappy", 52 | enableUpdateCatalog=True, 53 | transformation_ctx="s3output", 54 | ) 55 | # Set the database and table name for where you want this table 56 | # to be registered in the Glue catalog 57 | s3output.setCatalogInfo( 58 | catalogDatabase="curatedzonedb", catalogTableName="category_streams" 59 | ) 60 | # Set the output format to Glue Parquet 61 | s3output.setFormat("glueparquet") 62 | # Write the output to S3 and update the Glue catalog 63 | s3output.writeFrame(CategoryStreamsDyf) 64 | job.commit() 65 | -------------------------------------------------------------------------------- /Chapter03/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 3 - The AWS Data Engineers Toolkit 2 | In this chapter we reviewed a range of AWS services at a high level, including services for ingesting data from a variety of sources, services for transforming data, services for orchestrating pipelines, and services for consuming and working with data. 3 | 4 | ## Hands-on Activity 5 | In the ***hands-on activity*** section of this chapter, we configured an S3 bucket to automatically trigger a Lambda function whenever a new CSV file was written to the bucket. In the Lambda function, we used an open-source library to convert the CSV file into Parquet format, and wrote the file out to a new zone of our data lake. 6 | 7 | #### Creating a Lambda layer containing the AWS SDK for Pandas (awswrangler) library 8 | - AWS SDK for Pandas site: https://github.com/aws/aws-sdk-pandas 9 | - AWS Data Wrangler v2.19 ZIP file for Python 3.9: https://github.com/aws/aws-sdk-pandas/releases/download/2.19.0/awswrangler-layer-2.19.0-py3.9.zip 10 | - AWS Management Console - Lambda Layers: https://console.aws.amazon.com/lambda/home#/layers 11 | 12 | #### Creating an IAM policy and role for your Lambda function 13 | - AWS Management Console - IAM Policies: https://console.aws.amazon.com/iamv2/home?#/policies 14 | - Policy JSON for `DataEngLambdaS3CWGluePolicy`: [DataEngLambdaS3CWGluePolicy](DataEngLambdaS3CWGluePolicy.json) 15 | [Ensure you replace INITIALS in the policy statements to reflect the name of the S3 buckets you created] 16 | 17 | #### Creating a Lambda function 18 | - AWS Management Console - Lambda Functions: https://console.aws.amazon.com/lambda/home#/functions 19 | - `CSVtoParquetLambda` function code: [CSVtoParquetLambda.py](CSVtoParquetLambda.py) 20 | - **Note 1:** Make sure that on Line 26 you replace INITIALS with the unique identifier you used when creating your clean-zone bucket 21 | - **Note 2:** Make sure you don't miss the step about increasing the Lambda function timeout to 1 minute. If using a larger CSV file than the file provided here as a sample (test.csv) then consider also increasing the memory allocation. 22 | 23 | #### Configuring our Lambda function to be triggered by an S3 upload 24 | - Sample CSV file: [test.csv](test.csv) 25 | 26 | #### Command to list the newly created Parquet files in the clean-zone bucket: 27 | ###### Ensure you replace INITIALS below to reflect the name of the bucket you previously created 28 | 29 | ``` 30 | aws s3 ls s3://dataeng-clean-zone-INITIALS/cleanzonedb/csvtoparquet/ 31 | ``` 32 | 33 | -------------------------------------------------------------------------------- /Chapter10/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 10 - Orchestrating the Data Pipeline 2 | 3 | In this chapter, we looked at a critical part of a data engineers job: designing and 4 | orchestrating data pipelines. First, we examined some of the core concepts around data 5 | pipelines, such as the definition of a DAG (directed acyclic graph), 6 | and common pipeline triggers (scheduled and event-based pipelines). We also 7 | looked at how to handle failures and retries. 8 | 9 | We then looked at four different AWS services that can be used for creating and 10 | orchestrating data pipelines. This included Amazon Data Pipeline (now in maintenance mode), 11 | AWS Glue Workflows, Amazon Managed Workflows for Apache Airflow (MWAA), and AWS Step Functions. 12 | We discussed some of the use cases for each of these services, as well as the pros and cons 13 | of them. 14 | 15 | ## Hands-on Activity 16 | In the hands-on section of this chapter, we built an event-driven pipeline. We used 17 | two AWS Lambda functions for processing files, and an Amazon SNS topic for sending out 18 | notifications about failure. Then, we put these pieces of our data pipeline together into 19 | a state machine orchestrated by AWS Step Functions. 20 | 21 | #### Lambda function to determine the file extension 22 | 23 | - AWS Management Console - Lambda Functions: https://console.aws.amazon.com/lambda/home 24 | 25 | - Code for Lambda function to check file extension: [dataeng-check-file-ext.py](dataeng-check-file-ext.py) 26 | 27 | #### Lambda function to randomly generate failures 28 | 29 | - Code for Lambda function to generate random failures: [dataeng-random-failure-generator.py](dataeng-random-failure-generator.py) 30 | 31 | #### Creating an SNS topic and subscribing to an email address 32 | 33 | - AWS Management Console - SNS: https://us-east-2.console.aws.amazon.com/sns/v3/home 34 | 35 | #### Creating a new Step Functions state machine 36 | 37 | - AWS Management Console - Step Functions: https://console.aws.amazon.com/states/home 38 | 39 | - Example of Step Functions JSON for completed state machine: [ProcessFileStateMachine.json](ProcessFileStateMachine.json) 40 | [Note that the ARN references in this state machine are not valid, and would need to be updated to reflect your AWS account number] 41 | 42 | #### Configuring our S3 bucket to send events to EventBridge 43 | 44 | - AWS Management Console - Amazon S3: https://s3.console.aws.amazon.com/s3 45 | 46 | #### Create an EventBridge rule for triggering our Step Functions state machine 47 | 48 | - AWS Management Console - EventBridge: https://console.aws.amazon.com/events/home 49 | 50 | #### Testing out event-driven data orchestration pipeline 51 | 52 | - AWS Management Console - Amazon S3: https://s3.console.aws.amazon.com/s3 53 | 54 | -------------------------------------------------------------------------------- /Chapter15/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 15 - Implementing a Data Mesh Strategy 2 | 3 | In this chapter we explored the concept of a data mesh approach to organizing data responsibilities within an organization. We started off by 4 | examining the **four core principals** of a data mesh: 5 | 6 | - Domain-orientated, decentralized data ownership 7 | - Data as a product 8 | - Self-service data infrastructure as a platform 9 | - Federated computational governance 10 | 11 | We then looked at how a data mesh approach can solve a number of challenges that exist with traditional data lake approaches. This included 12 | understanding how a centralized data team can be a bottleneck, how traditionally product teams did not consider data analytics to be there 13 | "problem", and how there was a lack of organization wide visibility into which datasets were available. 14 | 15 | We then reviewed the organizational and technical challenges of building a data mesh. This included a discussion of the difficulties of 16 | changing the way that an organization traditionally appraoched data analytics, and how a data mesh approach changed the way that a 17 | centralized data and analytics team worked. We then looked at the changes needed for line of business teams, and finally at how there 18 | were a number of technical challenges to building a data mesh. 19 | 20 | After that we examined the AWS services that can help to build a data mesh, including a look at the **Amazon DataZone** service. 21 | We also reviewed a sample architecture for building a data mesh using both AWS services, and 3rd party services. 22 | 23 | ## Hands-On Activity 24 | In the hands-on section of this chapter we looked at how to setup Amazon DataZone, and how to publish data to the DataZone business catalog. 25 | 26 | #### Setting up AWS Identity Center 27 | 28 | - AWS Management Console - AWS IAM Identity Center: https://us-east-2.console.aws.amazon.com/singlesignon/home 29 | 30 | - Group to create: `DataZone Users` 31 | - User to create: `film-catalog-team-admin` 32 | - User to create: `marketing-team-admin` 33 | 34 | #### Enabling and configuring Amazon DataZone 35 | 36 | - AWS Management Console - Amazon DataZone Console: https://us-east-2.console.aws.amazon.com/datazone/home 37 | 38 | #### Adding business metadata 39 | 40 | - Business name for the film_category table: `Movie listings with category` 41 | - Description for the film_category table: `This table contains a complete listing of all films in our streaming movie catalog, including category / genre information for each film, as well as a list of special features.` 42 | - Description for the Category Name field: `This field contains the movie category / genre. Sample categories include Animation, Comedy, Sports, Children, Drama, Action, and more.` 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /Chapter10/ProcessFileStateMachine.json: -------------------------------------------------------------------------------- 1 | { 2 | "Comment": "A description of my state machine", 3 | "StartAt": "Check file extension", 4 | "States": { 5 | "Check file extension": { 6 | "Type": "Task", 7 | "Resource": "arn:aws:states:::lambda:invoke", 8 | "OutputPath": "$.Payload", 9 | "Parameters": { 10 | "Payload.$": "$", 11 | "FunctionName": "arn:aws:lambda:us-east-2:123456789:function:dataeng-check-file-ext:$LATEST" 12 | }, 13 | "Retry": [ 14 | { 15 | "ErrorEquals": [ 16 | "Lambda.ServiceException", 17 | "Lambda.AWSLambdaException", 18 | "Lambda.SdkClientException", 19 | "Lambda.TooManyRequestsException" 20 | ], 21 | "IntervalSeconds": 2, 22 | "MaxAttempts": 6, 23 | "BackoffRate": 2 24 | } 25 | ], 26 | "Next": "Choice" 27 | }, 28 | "Choice": { 29 | "Type": "Choice", 30 | "Choices": [ 31 | { 32 | "Variable": "$.file_extension ", 33 | "StringMatches": ".csv", 34 | "Next": "Process CSV" 35 | } 36 | ], 37 | "Default": "Pass - Invalid File Ext" 38 | }, 39 | "Process CSV": { 40 | "Type": "Task", 41 | "Resource": "arn:aws:states:::lambda:invoke", 42 | "OutputPath": "$.Payload", 43 | "Parameters": { 44 | "Payload.$": "$", 45 | "FunctionName": "arn:aws:lambda:us-east-2:123456789:function:dataeng-random-failure-generator:$LATEST" 46 | }, 47 | "Retry": [ 48 | { 49 | "ErrorEquals": [ 50 | "Lambda.ServiceException", 51 | "Lambda.AWSLambdaException", 52 | "Lambda.SdkClientException", 53 | "Lambda.TooManyRequestsException" 54 | ], 55 | "IntervalSeconds": 2, 56 | "MaxAttempts": 6, 57 | "BackoffRate": 2 58 | } 59 | ], 60 | "Catch": [ 61 | { 62 | "ErrorEquals": [ 63 | "States.ALL" 64 | ], 65 | "Next": "SNS Publish", 66 | "ResultPath": "$.Payload" 67 | } 68 | ], 69 | "Next": "Success" 70 | }, 71 | "Success": { 72 | "Type": "Succeed" 73 | }, 74 | "Pass - Invalid File Ext": { 75 | "Type": "Pass", 76 | "Result": { 77 | "Error": "InvalidFileFormat" 78 | }, 79 | "ResultPath": "$.Payload", 80 | "Next": "SNS Publish" 81 | }, 82 | "SNS Publish": { 83 | "Type": "Task", 84 | "Resource": "arn:aws:states:::sns:publish", 85 | "Parameters": { 86 | "Message.$": "$", 87 | "TopicArn": "arn:aws:sns:us-east-2:123456789:dataeng-failure-notification" 88 | }, 89 | "Next": "Fail" 90 | }, 91 | "Fail": { 92 | "Type": "Fail" 93 | } 94 | } 95 | } 96 | -------------------------------------------------------------------------------- /Chapter05/Data-Engineering-Whiteboard-Completed-Notes.drawio: -------------------------------------------------------------------------------- 1 | 7VrfU9s4EP5rMgMPyThxEuAxBGh7VwpHmGGmLzeKrcQaHMuV5ITw19/uSv6R2CmBQoe2x0NwZGm12v1291tNWv548fBBsTS6lCGPWz0vfGj5Z61er9c9HsI/HFnbka43OLIjcyVCN1YOTMQjzye60UyEXG9MNFLGRqSbg4FMEh6YjTGmlFxtTpvJeHPXlM15bWASsLg+eidCE+XaeV754iMX88ht3c9fLFg+2Q3oiIVyVRnyz1v+WElp7NPiYcxjtF5uF7vuYsfbQjHFE7PPgqHI/k1752fh11StJ/dJPP30sd2zUpYsztyBnbJmnVtAySwJOQrxWv7pKhKGT1IW4NsVOB3GIrOI4VsXHutKOT2XXBn+UBlySn7gcsGNWsMU97Y99I7tGoeZtj887gzs0Kr0QT+HSFQ1/9Ewn8qc5+fFDqVx4MHZ5xm28mu2OmOGwcitYomeSQUuFzKpGRDObTatpI2S93wsY6lgJJEJzDydiTjeGmKxmCfwNQBrchg/RSsKAOfIvViIMMRtGt2y6biZTIwLr17/lTw18AfbnurV/NQbNvgpD4pXd1K//w4R3T3eC9F+/ycjuj/cBemxTHS24Er/QWjuev4eaPYKd/wUPB957xDPJ7lWFTzX0TzoNaL5rQxVL2UOyxOZqYD/SUg+6df804DkJv+8GY5PThpwPIzJAXCa4dzQ2e3IVG2PwJ71aemGP4ffMqRSZNC2JouOYEJ3kD6UL7dF4OSXS2m30F5O53xwBN6OOLy541NEH1fgOXj4LOfweSFipLOjysGmFaXYAhGSTDX+u1ByAUugonmF0BUJ1SRUd5r2n6y14bjuapUgOEfbYvfZ+guETixwwxs2FYlciiBq3M3FWGWvJ2VfMnXPjUjmnV2aFUs/JXOuiVTtJ3osM+D3PS/TqPrfIuFaaDTfnJOfjcSPnK7lA39Nrr4g5pOQTEv/6EUp4EIoHknNO5VnbGsk5hUEtQgd+fMOMg1nQ4ezxTRE48yyJMCXh8UmhQo63+qaqW8ZN9SiIJHsUBG2p1kpSCK4G2wcFMcIYs5wv0fMSb2xTRpG4EYcV00xI4QMPw/W8IdhC0e+CMPDzncDzcZEbTj9NcIPATlleu8gG2faSMt00IbXSoZZYNy3G24ylbgvVyqkaR65efLP52p0g45Q2bxzLAipEoSO85C8sXeUPqnr1Yrj1pfBOGZL/lrxOGExofiWs0UtJK+lgdARLI7XRVyNFuyRbHB2OcnRqHgaQwUknNJAMWni42u2KqDqiaQZ8DvT1K+NyDvOTEQgIefs55TzBwBSwug2xbpUE43BFLJkImbTGJ+Xgl5lUx0oke5E2w9m6brA5ybmUUXnmS1ro7tJbhIIlYcgYiCzhRcnuHEaA2fqNGVRTH2U+GLJwtI8InkKdquIoqcwX+dZCHs18qihlGOB8M8G5bdbCVY6a/d2EXgJWWYW0/VRBNwSDvIdHgmZ6JV45KC/xSNPvDrP7w0beH7e8b46kex6TR3Rb8kkixBEOKc8gCwstNm3sN2xkvLYvD1FZsK1ttECthZAiyilHCyFzkD6I8t5Cl6y5quXgmMYZWnbyHZoczzwUG1ZCYMstYbuB8UesLAdgBYM2yAsCjOuFItdAQ1klko7HvJFaoN5TMhhZsHICZFc2QMzCPalMOucyQCA5njHLAIK/cBq2qEiPVNcR3lhZnhq4EYa/0eQNKlyASUQxJdHFXUNVDwSDWmxrHBUCzFBsMyqN81MkZescm6ho3+JpK2gJOPOArpCEIJ7XaWUcciI0tWA00/Ouvp5+ec9A9UlcTLt/gC94RrgoIWrChKtEyiAgrUxMAqpkIg5zixgKuQXTU5yQUEcDGiEEtbgOfkrFq2K2uvKROA49ZRbfCJyuC0jAgFQ+hvWISBWbkHCuRVY9l8oDtx9D5nf4iGmFm8mLJ+qSm8A5W/n/kkgkCruD4Av1qSQMyIZFO5kQWBzFIVNWNB5ck3pz9y/G+6oOADhcGtdGGQKq2CFwUJ1VfeXnylltFPFF6LYwTpfViI3oewXxDILbSLhYR7CBBNms8OaPu9RvE7tyTL0V8hjAdqtC50DGQP/kCpvGC3y4VhCkYggEsRXAC4qsUBPXVfSRJKbO+iXNXZ/IsnpHm1fzh/XL8t8v+lq/uStbsu6XvdNSM4LU8Fw71RwQ4T7K4FgrxzwdMYuOktbBGy3WfB6DPM7S+zzvCuoN7FBqsR8bpv2pj6CZKUAOnsZ5Db8hkyIiEcQccjuhWrG7oN9hy51Gbt7mK+2y2i4hNm8fXm1y5c3d+bGyfYknTvzsTOcrbV0kUX52Blx8zLLYzND6+sXcK2CorEfcurWRV4gk5mYZ4oXd3/5pVuTevv4+HtcpGYQ1xJvtKu5DXaiclxILvR5AnHvDV+ZcmH9soSRmxEqBV4hQfuCwtAsPFEisMzrYCVM1JRX0vJm69Cuyil6pU8CAFPDwaAckVev667nohBMl63NlAWbKir0yP5Ir52hcvh/zX7qhwcne9Ts4qc4GzXbf37Nhq/lz3ToXeXXTv75fw== -------------------------------------------------------------------------------- /Chapter09/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 9 - A deeper-dive into Data Marts and Amazon Redshift 2 | 3 | In this chapter, we learned how a cloud data warehouse can be used to 4 | optimize performance for hot data. We reviewed some common "anti-patterns" 5 | for data warehouse usage before diving deep into Redshift architecture to learn more 6 | about how Redshift optimizes data storage across nodes. 7 | 8 | We then reviewed some of the important design decisions that need to be made when 9 | creating a Redshift cluster in order to optimize performance while balancing costs, and also 10 | reviewed ssome of the advanced Redshift features. 11 | 12 | ## Hands-on Activity 13 | In the hands-on section of this chapter we created a new Redshift Serverless cluster, 14 | configured Redshift Spectrum to query data from Amazon S3, and then loaded a 15 | subset of data from S3 into Redshift. 16 | 17 | ### Uploading our sample data to Amazon S3 18 | In this exercise, we use some fake data that was created using a tool called [Mockaroo](https://www.mockaroo.com/). 19 | 20 | Download our fake user data with the following link: [user_details.csv](user_details.csv). When you open the link, click on the three dots, and then click on `Download`. 21 | 22 | Open the Amazon S3 console with the following link: https://s3.console.aws.amazon.com/s3 23 | 24 | #### IAM Roles for Redshift 25 | 26 | - AWS Management Console - IAM Roles: https://console.aws.amazon.com/iamv2/home?#/roles 27 | 28 | #### Creating a Redshift cluster 29 | 30 | - AWS Management Console - Redshift: https://console.aws.amazon.com/redshiftv2/ 31 | 32 | #### Using Redshift Spectrum to directly query data in the data lake 33 | 34 | - The following query can be run in the Redshift Query Editor to create an external schema. Make sure to specify the ARN for the new role you created in place of the ***iam_role*** listed below. 35 | 36 | ``` 37 | create external schema spectrum_schema 38 | from data catalog 39 | database 'users' 40 | iam_role 'arn:aws:iam::1234567890:role/AmazonRedshiftSpectrumRole' 41 | create external database if not exists; 42 | ``` 43 | 44 | - The following query can be run to create a new external table. Make sure to replace ***INITIALS*** in the query below with the correct identifier for your Landing Zone bucket. 45 | 46 | ``` 47 | CREATE EXTERNAL TABLE spectrum_schema.user_details( 48 | id INTEGER, 49 | first_name VARCHAR(40), 50 | last_name VARCHAR(40), 51 | email VARCHAR(60), 52 | gender VARCHAR(15), 53 | address_1 VARCHAR(80), 54 | address_2 VARCHAR(80), 55 | city VARCHAR(40), 56 | state VARCHAR(25), 57 | zip VARCHAR(5), 58 | phone VARCHAR(12) 59 | ) 60 | row format delimited 61 | fields terminated by ',' 62 | stored as textfile 63 | location 's3://dataeng-landing-zone-initials/users/' 64 | table properties ('skip.header.line.count'='1'); 65 | ``` 66 | 67 | **Query 1:** 68 | ``` 69 | select * from spectrum_schema.user_details limit 10; 70 | ``` 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /Chapter11/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 11 - Ad-hoc queries with Amazon Athena 2 | 3 | Amazon Athena is a serverless, fully managed service that lets you use SQL to directly query data in the data lake, as well as query various other databases. It requires no setup, and by default the cost is based on the amount of data that is scanned to complete the query, or based on the amount of provisioned capacity that you specify. 4 | 5 | In this chapter, we did a deep dive into Athena, examining how Athena can be used to query data directly in the data lake, looking at advanced Athena functionality (such as the ability to query data from other data sources with Query Federation), and how Athena provides workgroup functionality to help with governance and cost management. 6 | 7 | ## Links 8 | [Considerations and Limitations for CTAS Queries](https://docs.aws.amazon.com/athena/latest/ug/considerations-ctas.html) 9 | 10 | [Airbnb blog post about the problem of small files](https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908) 11 | 12 | [Partitioning and bucketing in Athena](https://docs.aws.amazon.com/athena/latest/ug/ctas-partitioning-and-bucketing.html) 13 | 14 | [Partition Projection with Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html) 15 | 16 | [Using EXPLAIN and EXPLAIN ANALYSE in Athena](https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html) 17 | 18 | [Functions in Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/functions.html) 19 | 20 | [Performance Tuning in Athena](https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html) 21 | 22 | [Using Amazon Athena Federated Query](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source.html) 23 | 24 | [Athena Query Federation SDK](https://docs.aws.amazon.com/athena/latest/ug/connect-data-source-federation-sdk.html) 25 | 26 | [Explore your data lake using Amazon Athena for Apache Spark](https://aws.amazon.com/blogs/big-data/explore-your-data-lake-using-amazon-athena-for-apache-spark/) 27 | 28 | [Using Athena ACID transactions](https://docs.aws.amazon.com/athena/latest/ug/acid-transactions.html) 29 | 30 | [Tag-Based IAM Access Control Policies](https://docs.aws.amazon.com/athena/latest/ug/tags-access-control.html) 31 | 32 | ## Hands-on Activity 33 | 34 | In the hands-on activity section of this chapter, we create and configure a new Athena Workgroup and learn more about how Workgroups can help separate groups of users. 35 | 36 | #### Creating an Amazon Athena workgroup and configuring Athena settings 37 | 38 | - AWS Management Console - Amazon Athena: https://console.aws.amazon.com/athena 39 | 40 | #### Switching Workgroups and running queries 41 | 42 | - AWS Documentation on IAM Policies for Accessing Workgroups: https://docs.aws.amazon.com/athena/latest/ug/workgroups-iam-policy.html 43 | 44 | - Query to determine most popular category of films (Step 3) 45 | ``` 46 | SELECT category_name, count(category_name) streams 47 | FROM streaming_films 48 | GROUP BY category_name 49 | ORDER BY streams DESC 50 | ``` 51 | 52 | - Query to determine which State streamed the most films (Step 7) 53 | ``` 54 | SELECT state, count(state) count 55 | FROM streaming_films 56 | GROUP BY state 57 | ORDER BY count desc 58 | ``` 59 | 60 | 61 | 62 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /Chapter04/AthenaAccessCleanZoneDB: -------------------------------------------------------------------------------- 1 | { 2 | "Version": "2012-10-17", 3 | "Statement": [ 4 | { 5 | "Effect": "Allow", 6 | "Action": [ 7 | "athena:*" 8 | ], 9 | "Resource": [ 10 | "*" 11 | ] 12 | }, 13 | { 14 | "Effect": "Allow", 15 | "Action": [ 16 | "glue:CreateDatabase", 17 | "glue:DeleteDatabase", 18 | "glue:GetDatabase", 19 | "glue:GetDatabases", 20 | "glue:UpdateDatabase", 21 | "glue:CreateTable", 22 | "glue:DeleteTable", 23 | "glue:BatchDeleteTable", 24 | "glue:UpdateTable", 25 | "glue:GetTable", 26 | "glue:GetTables", 27 | "glue:BatchCreatePartition", 28 | "glue:CreatePartition", 29 | "glue:DeletePartition", 30 | "glue:BatchDeletePartition", 31 | "glue:UpdatePartition", 32 | "glue:GetPartition", 33 | "glue:GetPartitions", 34 | "glue:BatchGetPartition" 35 | ], 36 | "Resource": [ 37 | "arn:aws:glue:*:*:catalog", 38 | "arn:aws:glue:*:*:database/cleanzonedb", 39 | "arn:aws:glue:*:*:database/cleanzonedb*", 40 | "arn:aws:glue:*:*:table/cleanzonedb/*" 41 | ] 42 | }, 43 | { 44 | "Effect": "Allow", 45 | "Action": [ 46 | "s3:GetBucketLocation", 47 | "s3:GetObject", 48 | "s3:ListBucket", 49 | "s3:ListBucketMultipartUploads", 50 | "s3:ListMultipartUploadParts", 51 | "s3:AbortMultipartUpload", 52 | "s3:PutObject" 53 | ], 54 | "Resource": [ 55 | "arn:aws:s3:::dataeng-clean-zone-gse23/*" 56 | ] 57 | }, 58 | { 59 | "Effect": "Allow", 60 | "Action": [ 61 | "s3:GetBucketLocation", 62 | "s3:GetObject", 63 | "s3:ListBucket", 64 | "s3:ListBucketMultipartUploads", 65 | "s3:ListMultipartUploadParts", 66 | "s3:AbortMultipartUpload", 67 | "s3:CreateBucket", 68 | "s3:PutObject", 69 | "s3:PutBucketPublicAccessBlock" 70 | ], 71 | "Resource": [ 72 | "arn:aws:s3:::aws-athena-query-results-*" 73 | ] 74 | }, 75 | { 76 | "Effect": "Allow", 77 | "Action": [ 78 | "s3:GetObject", 79 | "s3:ListBucket" 80 | ], 81 | "Resource": [ 82 | "arn:aws:s3:::athena-examples*" 83 | ] 84 | }, 85 | { 86 | "Effect": "Allow", 87 | "Action": [ 88 | "s3:ListBucket", 89 | "s3:GetBucketLocation", 90 | "s3:ListAllMyBuckets" 91 | ], 92 | "Resource": [ 93 | "*" 94 | ] 95 | }, 96 | { 97 | "Effect": "Allow", 98 | "Action": [ 99 | "sns:ListTopics", 100 | "sns:GetTopicAttributes" 101 | ], 102 | "Resource": [ 103 | "*" 104 | ] 105 | }, 106 | { 107 | "Effect": "Allow", 108 | "Action": [ 109 | "cloudwatch:PutMetricAlarm", 110 | "cloudwatch:DescribeAlarms", 111 | "cloudwatch:DeleteAlarms", 112 | "cloudwatch:GetMetricData" 113 | ], 114 | "Resource": [ 115 | "*" 116 | ] 117 | }, 118 | { 119 | "Effect": "Allow", 120 | "Action": [ 121 | "lakeformation:GetDataAccess" 122 | ], 123 | "Resource": [ 124 | "*" 125 | ] 126 | }, 127 | { 128 | "Effect": "Allow", 129 | "Action": [ 130 | "pricing:GetProducts" 131 | ], 132 | "Resource": [ 133 | "*" 134 | ] 135 | } 136 | ] 137 | } 138 | -------------------------------------------------------------------------------- /Chapter13/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 13 - Enabling Artificial Intelligence and Machine Learning 2 | 3 | In this chapter, you learned more about the broad range of AWS ML and AI services, 4 | and had the opportunity to get hands-on with Amazon Comprehend, an AI service for 5 | extracting insights from written text. 6 | 7 | We discussed how ML and AI services can apply to a broad range of use cases, both 8 | specialized (such as detecting cancer early) and general (business forecasting or 9 | personalization). 10 | 11 | We examined different AWS services related to ML and AI. We looked at how different 12 | Amazon SageMaker capabilities can be used to prepare data for ML, build models, train 13 | and fine-tune models, and deploy and manage models. SageMaker makes building custom 14 | ML models much more accessible to developers without existing expertise in ML. 15 | 16 | We then looked at a range of AWS AI services that provide prebuilt and trained models for 17 | common use cases. We looked at services for transcribing text from audio files (Amazon 18 | Transcribe), for extracting text from forms and handwritten documents (Amazon 19 | Textract), for recognizing images (Amazon Rekognition), and for extracting insights from 20 | text (Amazon Comprehend). We also briefly discussed other business-focused AI services, 21 | such as Amazon Forecast and Amazon Personalize. 22 | 23 | Finally, we did a quick overview of what Generative AI, Foundation Models, and Large Language 24 | Models are, and examined how Generative AI solutiosn can be built on AWS. 25 | 26 | ## Useful links 27 | The links below cover some of the services that were overviewed in this chapter. 28 | 29 | ### SageMaker Components 30 | - [Amazon SageMaker](https://aws.amazon.com/sagemaker/) 31 | - [Amazon SageMaker Ground Truth](https://aws.amazon.com/sagemaker/data-labeling/) 32 | - [Amazon SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler) 33 | - [Amazon SageMaker Clarify](https://aws.amazon.com/sagemaker/clarify) 34 | - [Amazon SageMaker Notebooks](https://aws.amazon.com/sagemaker/notebooks/) 35 | - [Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot) 36 | - [Amazon SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart) 37 | - [Amazon SageMaker Experiments](https://aws.amazon.com/sagemaker/experiments/) 38 | - [Amazon SageMaker Model Monitor](https://aws.amazon.com/sagemaker/model-monitor/) 39 | 40 | ### AWS Serivces for AI 41 | - [Amazon Transcribe](https://aws.amazon.com/transcribe/) 42 | - [Amazon Textract](https://aws.amazon.com/textract/) 43 | - [Amazon Comprehend](https://aws.amazon.com/comprehend/) 44 | - [Amazon Rekognition](https://aws.amazon.com/rekognition/) 45 | - [Amazon Forecast](https://aws.amazon.com/forecast/) 46 | - [Amazon Fraud Detector](https://aws.amazon.com/fraud-detector/) 47 | - [Amazon Personalize](https://aws.amazon.com/personalize/) 48 | 49 | ### AWS Services for Generative AI 50 | - [Blog post - Get started with generative AI on AWS using Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/get-started-with-generative-ai-on-aws-using-amazon-sagemaker-jumpstart/) 51 | - [Amazon Bedrock](https://aws.amazon.com/bedrock/) 52 | - [Amazon Titan](https://aws.amazon.com/bedrock/titan/) 53 | 54 | ## Hands-on Activity 55 | In the hands-on acitvity section of this chapter we looked at how we can use the 56 | [Amazon Comprehend](https://aws.amazon.com/comprehend/) service to gain insight into the sentiment 57 | of reviews posted to a website. We configured an SQS queue to receive the details 58 | of newly posted reviewes, and had a Lambda function configured to pass the review text to Amazon 59 | Comprehend to gain insight into the review sentiment (postivie, negative, or netural). 60 | 61 | #### Setting up a new Amazon SQS message queue 62 | 63 | - Amazon Management Console - SQS: https://console.aws.amazon.com/sqs/v2/. 64 | 65 | #### Creating a Lambda function for calling Amazon Comprehend 66 | 67 | - Amazon Management Console - Lambda: https://console.aws.amazon.com/lambda/ 68 | 69 | - Lambda function code for calling Amazon Comprehend for sentiment analysis: [website-reviews-analysis-role.py](website-reviews-analysis-role.py) 70 | 71 | **[ Make sure to set region on Line 4 to the correct region that you are using for the hands-on activities in this book ]** 72 | 73 | #### Testing the solution with Amazon Comprehend 74 | 75 | - Example of a postivie review 76 | 77 | ``` 78 | I recently stayed at the Kensington Hotel in downtown Cape Town and was very impressed. 79 | The hotel is beautiful, the service from the staff is amazing, and the sea views cannot be beaten. 80 | If you have the time, stop by Elizabeth's Kitchen, a coffee shop not far from the hotel, 81 | to get a coffee and try some of their delicious cakes and baked goods. 82 | ``` 83 | 84 | 85 | 86 | 87 | 88 | 89 | -------------------------------------------------------------------------------- /Chapter06/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 6 - Ingesting Batch and Streaming Data 2 | 3 | In this chapter, we discussed the 5 V's of data (volume, velocity, variety, validity, and value). 4 | We then reviewed a few different approaches for ignesting data from databases, and from streaming data sources. 5 | 6 | ## Hands-on Activity 7 | In the ***hands-on activity*** section of this chapter, we deployed an **Amazon CloudFormation** template that provisioned an **Amazon RDS MySQL** database instance, as well as an **Amazon EC2** instance, which was used to load a demo database into the MySQL database instance. We configured **Amazon Database Migration Service (DMS)** to ingest data from the MySQL database, and then we configured **Amazon Kinesis Data Firehose** to ingest streaming data that we generated using the **Amazon Kinesis Data Generator (KDG)**. 8 | 9 | ### Deploying MySQL and an EC2 data loader via AWS CloudFormation 10 | - Download the CloudFormation template from [here](./mysql-ec2loader.cfn) 11 | 12 | - AWS Management Console - CloudFormation: https://us-east-2.console.aws.amazon.com/cloudformation 13 | 14 | - CloudFormation Stack Name: `dataeng-aws-chapter6-mysql-ec2` 15 | 16 | #### Creating an IAM policy and role for DMS 17 | 18 | - AWS Management Console - IAM Policies: https://console.aws.amazon.com/iamv2/home?#/policies 19 | 20 | - AWS IAM policy for `DataEngDMSLandingS3BucketPolicy`. **Change INITIALS in the policy below to match the name of the landing zone bucket that you previously created**. 21 | ``` 22 | { 23 | "Version": "2012-10-17", 24 | "Statement": [ 25 | { 26 | "Effect": "Allow", 27 | "Action": [ 28 | "s3:*" 29 | ], 30 | "Resource": [ 31 | "arn:aws:s3:::dataeng-landing-zone-INITIALS", 32 | "arn:aws:s3:::dataeng-landing-zone-INITIALS/*" 33 | ] 34 | } 35 | ] 36 | } 37 | ``` 38 | - IAM Policy Name: `DataEngDMSLandingS3BucketPolicy` 39 | - IAM Role Name: `DataEngDMSLandingS3BucketRole` 40 | 41 | #### Configuring DMS settings and performing a full load from MySQL to S3 42 | - Replication instance name: `mysql-s3-replication` 43 | - Target endpoint identifier: `s3-landing-zone-sakilia-csv` 44 | - Migration task identifier: `dataeng-mysql-s3-sakila-task` 45 | - Schema Source Name: `%sakila%` 46 | 47 | #### Querying data with Amazon Athena 48 | - Athena S3 Query Bucket Name: `athena-query-results-`. **Change \ to the unique identifier you have been using for bucket names**. 49 | 50 | - Athena query to validate that data has been successfully ingested using DMS 51 | `select * from film limit 20;` 52 | 53 | - Further reading: [Using Amazon S3 as a target for AWS Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html) 54 | 55 | - Further reading: [Using a MySQL-compatible database as a source for AWS DMS](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html) 56 | 57 | ### Ingesting Streaming Data 58 | 59 | #### Configuring Kinesis Data Firehose for streaming delivery to Amazon S3 60 | - AWS Management Console - Kinesis Firehose: https://us-east-2.console.aws.amazon.com/firehose/home 61 | - Kinesis Firehose Delivery Stream Name: `dataeng-firehose-streaming-s3` 62 | - *S3 Bucket Prefix* for Kinesis Data Firehose: `streaming/!{timestamp:yyyy/MM/}` 63 | - *S3 Bucket Error Output Prefix* for Kinesis Data Firehose: `!{firehose:error-output-type}/!{timestamp:yyyy/MM/}` 64 | 65 | #### Configuring Amazon Kinesis Data Generator (KDG) 66 | - Kinesis Data Generator Help Page: https://awslabs.github.io/amazon-kinesis-data-generator/web/help.html 67 | - Documentation for mapping region names to region ID's: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html 68 | - Record template for Kinesis Data Generator 69 | ``` 70 | { 71 | "timestamp":"{{date.now}}", 72 | "eventType":"{{random.weightedArrayElement( 73 | { 74 | "weights": [0.3,0.1,0.6], 75 | "data": ["rent","buy","trailer"] 76 | } 77 | )}}", 78 | "film_id":{{random.number( 79 | { 80 | "min":1, 81 | "max":1000 82 | } 83 | )}}, 84 | "distributor":"{{random.arrayElement( 85 | ["amazon prime", "google play", "apple itunes","vudo", "fandango now", "microsoft", "youtube"] 86 | )}}", 87 | "platform":"{{random.arrayElement( 88 | ["ios", "android", "xbox", "playstation", "smarttv", "other"] 89 | )}}", 90 | "state":"{{address.state}}" 91 | } 92 | ``` 93 | 94 | #### Querying data with Amazon Athena 95 | - AWS Management Console - Glue: https://us-east-2.console.aws.amazon.com/glue/home 96 | - Crawler name: `dataeng-streaming-crawler` 97 | - Crawler S3 data path: `s3://dataeng-landing-zone-/streaming/` (**Change \ to the unique identifier used for your landing zone bucket**) 98 | - Athena query to validate that data has been successfully ingested using DMS: 99 | `select * from streaming limit 20;` 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /Chapter16/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 16 - Building a Modern Data Platform on AWS 2 | 3 | In this chapter we reviewed high-level concepts around building a modern data platform on AWS. 4 | We covered some of the puzzle pieces that need to be in place in order to build a data platform that is flexible and agile, scalable, 5 | well goverened, and secure, while also being easy to use and enable data producers and consumers to self-serve. 6 | 7 | We examined the pros and cons of building or buying a data platform, and then looked at how a DataOps approach to the development 8 | of a platform helps enable automation and observability. We examined how tools such as CloudFormation can be used to automate and manage 9 | the process of deploying data infrastructure, and how we can use Source Control Systems to manage our code for data transformation (such 10 | as PySpark jobs that we run in AWS Glue). 11 | 12 | We then examined a number of tools that enable a DataOps approach, including CloudFormation, CodeCommit, CodeBuild and CodePipeline. 13 | 14 | ## Hands-on Activity 15 | In the hands-on activity for this chapter, we set up a CodeCommit repository, and then added Glue ETL code and a CloudFormation template 16 | to the repository using Git. We then built two pipelines using CodePipeline - one to write out Glue ETL code to S3 whenever the file is updated in 17 | CodeCommit, and a second one to deploy our CloudFormation template, whenever the template is updated. 18 | 19 | #### Setting up a Cloud9 IDE environment 20 | 21 | - AWS Management Console - Cloud9: https://us-east-2.console.aws.amazon.com/cloud9control/home 22 | 23 | - Commands to configure a Git username and password. Change the example below to reflect your name and email address. 24 | ``` 25 | git config --global user.name "Gareth Eagar" 26 | git config --global user.email gareth.eagar@example.com 27 | ``` 28 | 29 | - Configuring the AWS CLI credential helper to automate provisioning of credentials needed to authenticate to CodeCommit 30 | ``` 31 | git config --global credential.helper '!aws codecommit credential-helper $@' 32 | git config --global credential.UseHttpPath true 33 | ``` 34 | 35 | - Creating a directory for your Git repository 36 | ``` 37 | cd /home/ec2-user/environment 38 | mkdir git 39 | cd git 40 | ``` 41 | 42 | #### Setting up our AWS CodeCommit repository 43 | 44 | - AWS Management Console - CodeCommit: https://us-east-2.console.aws.amazon.com/codesuite/codecommit/repositories 45 | 46 | - Use the HTTPS link under Clone URL to copy the link to clone your repository. Then run `git clone `. For example: 47 | ``` 48 | git clone https://git-codecommit.us-east-2.amazonaws.com/v1/repos/data-product-film 49 | ``` 50 | 51 | - Create new directories in the repository, using the Cloud9 terminal window 52 | ``` 53 | cd data-product-film 54 | mkdir glueETL_code 55 | mkdir cfn_templates 56 | ``` 57 | 58 | - Create a new Amazon S3 bucket to store the resources for our film data product. 59 | ``` 60 | aws s3 mb s3://data-product-film-initials 61 | ``` 62 | 63 | #### Adding a Glue ETL script and CloudFormation template into our repository 64 | 65 | - Access the [Glue-streaming_views_by_category.py](/Chapter16/Glue-streaming_views_by_category.py) file in this repository, and paste it into a new file in your Cloud9 IDE environment. 66 | 67 | **Make sure to change the S3 bucket path on Line 47 of the code to match the name of your curated zone bucket** 68 | 69 | - Access the [CFN-glue_job-streams_by_category.cfn](/Chapter16/CFN-glue_job-streams_by_category.cfn) file in this repository, and paste it into a new file in your Cloud9 IDE environment. 70 | 71 | **Make sure to change the S3 bucket path on Line 22 of the template to match the name of the bucket you created earlier in this exercise** 72 | 73 | - Commit the newly created files into your repository by running the following commands in the Cloud9 terminal 74 | ``` 75 | git add . 76 | git commit -m "Initial commit of CloudFormation template and Glue code for our streaming views by category data product" 77 | git push 78 | ``` 79 | 80 | - Access the [CodeCommit](https://us-east-1.console.aws.amazon.com/codesuite/codecommit/repositories) service in the AWS Management Console, and ensure that the two files have been written to your repository 81 | 82 | #### Automating deployment of our Glue code 83 | 84 | - AWS Management Console - CodePipeline: https://us-east-1.console.aws.amazon.com/codesuite/codepipeline/pipelines 85 | 86 | #### Automating deployment of our Glue job 87 | 88 | - AWS Management Console - IAM: https://us-east-1.console.aws.amazon.com/iamv2/home 89 | 90 | - Edit the `DataEngGlueCWS3CuratedZoneRole`, and replace the current **Trust relationship** JSON with the JSON in [DataEngGlueCWS3CuratedZoneRole-trust-policy.json](/Chapter16/DataEngGlueCWS3CuratedZoneRole-trust-policy.json) in this repository 91 | 92 | - In the permissions tab for the role, edit the DataEngGlueCWS3CuratedZoneWrite role to add in S3 permissions for the new bucket you created earlier. The new S3 permissions should look as follows (but reflect the name of your buckets): 93 | ``` 94 | { 95 | "Effect": "Allow", 96 | "Action": [ 97 | "s3:GetObject" 98 | ], 99 | "Resource": [ 100 | "arn:aws:s3:::dataeng-landing-zone-gse23/*", 101 | "arn:aws:s3:::dataeng-clean-zone-gse23/*", 102 | "arn:aws:s3:::data-product-film-gse23/*" 103 | ] 104 | } 105 | ``` 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /Chapter14/README.md: -------------------------------------------------------------------------------- 1 | # Chapter 14 - Building Transactional Data Lakes 2 | 3 | In this chapter we did a deeper-dive into three common transactional table formats (also sometimes refered to as open table formats), namely Delta 4 | Lake, Apache Hudi, and Apache Iceberg. 5 | 6 | We looked into the limitations of the traditional Hive format, and discussed how these new transactional formats provided a number of benefits for 7 | building modern data lakes, that are able to behave more like a traditional data warehouse. We discussed benefits such as being able to use 8 | ACID transactions, perform record level updates, run time travel queries, handle schema evolution, and more. 9 | 10 | We then 'looked under the covers' at how these open table formats work by tracking metadata at the table level, and at some of the different 11 | approaches for performing updates - Copy-on-Write (COW) which provides better read performance, and Merge-on-Read (MOR) that provides better 12 | write performance. 13 | 14 | Each of the three popular table formats we discussed in this section have their own pros and cons, and while there was not space to go super 15 | deep into how each of them work, we did do a bit of a deeper dive into Delta Lake, Apache Hudi and Apache Iceberg. Finally, before getting hands-on, 16 | we reviewed the support for each of the table formats across different AWS services (such as Glue, EMR, Redshift, and Athena). 17 | 18 | ## Hands-on Activity 19 | In the hands-on activity section of this chapter, we used Amazon Athena to create an Apache Iceberg formatted table, added data to the table, 20 | and then looked at how the underlying metadata was changed as we deleted data and performed table maintenance tasks. We also used the time travel 21 | feature of Iceberg to see how we could query data as it was at a previous point in time. 22 | 23 | #### Creating an Apache Iceberg table using Amazon Athena 24 | 25 | - Amazon Management Console - Athena: https://console.aws.amazon.com/athena/home 26 | 27 | - Use the following statement to create a new Athena database 28 | ``` 29 | create database curatedzonedb_iceberg; 30 | ``` 31 | 32 | - Use the following statement to create a new table in Apache Iceberg format. **Make sure to change the S3 bucket location below to 33 | match the location of your curated-zone bucket**. 34 | ``` 35 | CREATE TABLE curatedzonedb_iceberg.streaming_films_ib( 36 | timestamp string, 37 | eventtype string, 38 | film_id_streaming int, 39 | distributor string, 40 | platform string, 41 | state string, 42 | ingest_year string, 43 | ingest_month string, 44 | category_id bigint, 45 | category_name string, 46 | film_id bigint, 47 | title string, 48 | description string, 49 | release_year bigint, 50 | language_id bigint, 51 | original_language_id double, 52 | length bigint, 53 | rating string, 54 | special_features string 55 | ) 56 | PARTITIONED BY (category_name) 57 | LOCATION 's3://dataeng-curated-zone-gse23/iceberg/streaming_films/' 58 | TBLPROPERTIES ('table_type' = 'ICEBERG', 'format' = 'parquet') 59 | ``` 60 | 61 | - Access the Amazon S3 console to review metadata files: https://s3.console.aws.amazon.com/s3 62 | 63 | #### Adding data to our Iceberg table and running queries 64 | 65 | - Run the following statement to insert data from the previously created streaming_films table (created in Chapter 7), into our new 66 | Apache Iceberg formatted table created in the previous step. 67 | ``` 68 | insert into curatedzonedb_iceberg.streaming_films_ib 69 | select * 70 | from curatedzonedb.streaming_films 71 | ``` 72 | 73 | - Open the AWS Glue console to view table properties: https://console.aws.amazon.com/glue 74 | 75 | - Run the following statement to query the newly created table 76 | ``` 77 | select * from curatedzonedb_iceberg.streaming_films_ib limit 50; 78 | ``` 79 | 80 | - Run the following three statements (one at a time) to view table metadata. 81 | ``` 82 | select * from "curatedzonedb_iceberg"."streaming_films_ib$manifests" 83 | ``` 84 | 85 | ``` 86 | select * from "curatedzonedb_iceberg"."streaming_films_ib$files" 87 | ``` 88 | 89 | ``` 90 | select * from "curatedzonedb_iceberg"."streaming_films_ib$partitions" 91 | ``` 92 | 93 | #### Modifying data in our Iceberg table and running queries 94 | 95 | - Delete data from the table for all movies in the *Documentary* category, using the following statement 96 | ``` 97 | delete from curatedzonedb_iceberg.streaming_films_ib where category_name='Documentary' 98 | ``` 99 | 100 | - Use the following statement to list out the current partitions, and note how there are only 15 partitions now 101 | ``` 102 | select * from "curatedzonedb_iceberg"."streaming_films_ib$partitions" 103 | ``` 104 | 105 | - Use the following statement to list out details about the current manifest: 106 | ``` 107 | select * from "curatedzonedb_iceberg"."streaming_films_ib$manifests" 108 | ``` 109 | - Use the following statemet to query the snapshot history: 110 | 111 | ``` 112 | select * from "curatedzonedb_iceberg"."streaming_films_ib$history" 113 | ``` 114 | 115 | - Use the following statement to do a *time travel* query, where you query the table as it was at a point in the past. **Ensure you change the 116 | TIMESTAMP in the statement below to be a UTC time between when your first and second snapshots were created**. 117 | ``` 118 | SELECT * FROM "curatedzonedb_iceberg"."streaming_films_ib" FOR TIMESTAMP AS OF TIMESTAMP '2023-07-23 19:00:00 UTC' where category_name = 'Documentary' 119 | ``` 120 | 121 | #### Iceberg table maintenance tasks 122 | 123 | - Use the following statement to display the list of files that make up the current snapshot: 124 | ``` 125 | select * from "curatedzonedb_iceberg"."streaming_films_ib$files" 126 | ``` 127 | 128 | - Execute the OPTIMIZE command by running the following statemet: 129 | ``` 130 | OPTIMIZE curatedzonedb_iceberg.streaming_films_ib REWRITE DATA USING BIN_PACK 131 | ``` 132 | 133 | - Use the following statement to display the list of files that make up the current, optimized, snapshot: 134 | ``` 135 | select * from "curatedzonedb_iceberg"."streaming_films_ib$files" 136 | ``` 137 | 138 | - Use the following statement to list the tables snapshot history: 139 | ``` 140 | select * from "curatedzonedb_iceberg"."streaming_films_ib$history" 141 | ``` 142 | 143 | - Change the table property for `vacuum_max_snapshot_age_seconds` to be just 60 seconds 144 | 145 | ``` 146 | ALTER TABLE curatedzonedb_iceberg.streaming_films_ib SET TBLPROPERTIES ( 147 | 'vacuum_max_snapshot_age_seconds'='60' 148 | ) 149 | ``` 150 | 151 | - View the properties of our Iceberg table to ensure that the setting was applied: 152 | 153 | ``` 154 | SHOW TBLPROPERTIES curatedzonedb_iceberg.streaming_films_ib 155 | ``` 156 | 157 | - Run the following statement to VACUUM the Iceberg table 158 | 159 | ``` 160 | VACUUM curatedzonedb_iceberg.streaming_films_ib 161 | ``` 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | -------------------------------------------------------------------------------- /Chapter05/Data-Engineering-Completed-Whiteboard.drawio: -------------------------------------------------------------------------------- 1 |  -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Engineering with AWS, Second Edition 2 | This is the code repository for [Data Engineering with AWS, Second Edition](https://www.packtpub.com/product/data-engineering-with-aws-second-edition/9781804614426), published by Packt. 3 | **Acquire the skills to design and build AWS-based data transformation pipelines like a pro** 4 | 5 | The author of this book is -[Gareth Eagar](https://www.linkedin.com/in/garetheagar/) 6 | ## About the book 7 | 8 | This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. 9 | 10 | You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and by getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. 11 | 12 | By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro! 13 | 14 | ## Key Takeaways 15 | - Delve into robust AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines 16 | - Stay up to date with a comprehensive revised chapter on Data Governance 17 | - Build modern data platforms with a new section covering transactional data lakes and data mesh 18 | 19 | 20 | 21 | ## New Edition v/s Previous Edition: 22 | - New and refreshed content. 23 | - Fully revised with the most up-to-date information on AWS. 24 | 25 | 26 | 27 | 28 | ## What's New 29 | - Updates to every single chapter to cover all the newest AWS services and features. 30 | - Three brand-new chapters, including content on building transactional data lakes, implementing a data mesh approach on AWS, and using a DataOps approach to data platform building. 31 | - Modernized coverage of AI and ML functionality within AWS. 32 | - A new and refreshed look at data governance strategies. 33 | 34 | 35 | 36 | 37 | 38 | ## Outline and Chapter Summary 39 | This new edition builds on the work of the original to bring you a fully refreshed, definitive guide to building data pipelines in AWS. You’ll move from the basics of learning what data engineering is and what goes into designing strong architectures, to exploring the tools and services available to help you and how to get the most out of them, before covering recent developments in the data engineering space, like transactional data lakes, AI, and ML. 40 | Throughout the book you’ll cement your newfound knowledge with a series of easy to follow practical exercises. 41 | 42 | 1. [An Introduction to Data Engineering](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter01) 43 | 2. [Data Management Architectures for Analytics](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter02) 44 | 3. [The AWS Data Engineers Toolkit](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter03) 45 | 4. [Data Cataloging, Security, and Governance](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter04) 46 | 5. [Architecting Data Engineering Pipelines](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter05) 47 | 6. [Ingesting Batch and Streaming Data](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter06) 48 | 7. [Transforming Data to Optimize for Analytics](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter07) 49 | 8. [Identifying and Enabling Data Consumers](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter08) 50 | 9. [Loading Data into a Dart Mart](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter09) 51 | 10. [Orchestrating the Data Pipeline](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter10) 52 | 11. [Ad Hoc Queries with Amazon Athena](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter11) 53 | 12. [Visualizing Data with Amazon QuickSight](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter12) 54 | 13. [Enabling Artificial Intelligence and Machine Learning](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter13) 55 | 14. [Transactional Data Lakes](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter14) 56 | 15. [Implementing a Data Mesh strategy](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter15) 57 | 16. [Building a modern data platform on AWS](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter16) 58 | 17. [Wrapping Up the First Part of Your Learning Journey](https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter17) 59 | 60 | ### Chapter 01, An Introduction to Data Engineering 61 | This chapter goes into the burgeoning field of data engineering, shedding light on its rapid growth and the soaring demand for professionals in this domain. With data assuming an increasingly pivotal role in organizations of all scales, the chapter underscores the gratification derived by individuals who relish assembling intricate data pipelines that not only ingest raw data, but also undertake its transformation and optimization for diverse consumer needs. It stresses the significance of data as a corporate asset, highlighting its escalating value, while concurrently examining the hurdles faced by organizations coping with surges in data volume. The chapter elucidates how data engineers can leverage cloud-based services to surmount these challenges, setting the stage for the practical activities presented later in the book. It also provides a comprehensive guide for creating a new Amazon Web Services (AWS) account. 62 | 63 | 64 | **Key Insights**: 65 | - Data engineering is a rapidly growing career path, with high demand being driven by the increasing importance of data in organizations. 66 | - Data engineers play a critical role in building complex data pipelines that ingest, transform, and optimize raw data for various consumers. 67 | - The chapter emphasizes the value of data as a corporate asset and discusses the challenges organizations face with growing data volumes. 68 | - Cloud-based services are highlighted as a means for data engineers to address these challenges effectively. 69 | - The chapter also introduces the foundational step of creating an Amazon Web Services (AWS) account, setting the stage for hands-on activities in later chapters. 70 | 71 | 72 | 73 | ### Chapter 02, Data Management Architectures for Analytics 74 | This chapter underlines the diversity of options available in terms of cloud services, open-source frameworks, file formats, and architectural approaches for analytical projects, depending on specific business requirements. It serves as a foundational resource for understanding modern analytical architectures, regardless of whether one chooses to implement them on AWS or other platforms. The chapter provides essential introductory concepts that lay the groundwork for subsequent chapters, covering key topics such as the evolution of data management for analytics, in-depth exploration of data warehouse concepts and architecture, an overview of data lake architecture, and the synergy between data warehouses and data lakes. It also includes a practical component, guiding readers on using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets. 75 | 76 | Furthermore, the chapter emphasizes the importance of foundational architectural concepts in designing real-world analytics data management and processing solutions. It introduces three prevalent analytics data management architectures used in organizations today: data warehouses, data lakes, and data lakehouses. 77 | 78 | **Key Insights**: 79 | 80 | - Various cloud services, open-source frameworks, file formats, and architectural approaches can be employed in analytical projects, with choices dependent on specific business requirements. 81 | - Foundational concepts and technologies essential for modern analytical architectures are discussed, irrespective of the cloud platform used, laying the groundwork for subsequent chapters. 82 | - The chapter covers the evolution of data management for analytics, data warehouses, data lakes, and data lakehouses. 83 | - Practical guidance is provided on using the AWS CLI to create S3 buckets. 84 | 85 | 86 | ### Chapter 03, The AWS Data Engineers Toolkit 87 | This chapter highlights how cloud computing has transformed the traditional approach to big data processing. In the past, organizations grappled with the complexity of building and maintaining their own data processing systems, often leading to significant expenditures and delays. With AWS, many of these challenges have been eliminated, enabling the quick deployment of fully configured software solutions, automatic updates, and scalability without the need for large upfront capital investment. The chapter introduces a range of AWS-managed services used for building big data solutions, and emphasizes the importance of selecting the most suitable service for specific requirements. 88 | 89 | The chapter provides a comprehensive overview of AWS services for data ingestion, data transformation, orchestrating big data pipelines, and data consumption. It also includes a hands-on component, guiding readers on how to trigger an AWS Lambda function when a new file arrives in an S3 bucket, offering a practical understanding of these services. The text culminates with an introduction to data governance, emphasizing its crucial role in data engineering projects. 90 | 91 | **Key Insights**: 92 | - AWS, launched in 2006, has been instrumental in shaping the cloud computing industry by continuously innovating and expanding its service offerings. 93 | - Traditional data processing systems in organizations were complex, expensive, and required significant maintenance and scaling efforts, whereas AWS has simplified these challenges by providing on-demand and scalable solutions. 94 | - AWS offers over 200 services, including analytics services, which data engineers can use to build complex data analytic pipelines. 95 | - The chapter introduces key AWS-managed services for data ingestion, transformation, orchestration of big data pipelines, and data consumption, emphasizing the importance of selecting the right service for specific requirements. 96 | - A hands-on component guides readers in triggering an AWS Lambda function when a new file arrives in an S3 bucket. 97 | 98 | 99 | ### Chapter 04, Data Cataloging, Security, and Governance 100 | In this chapter, the text explores the best practices for responsible data handling, aiming to ensure that data's value is fully realized by organizations. It covers various aspects of data governance, including data security, access, and privacy, data quality, data profiling, and data lineage. It introduces the significance of data catalogs and AWS services that aid in data governance. A hands-on component guides readers in configuring Lake Formation permissions, offering practical insights into implementing data governance. The chapter underscores the necessity of a robust data governance program to ensure correct data usage and its optimal value to the organization. 101 | 102 | **Key Insights**: 103 | - Data governance and security are crucial components of data management, ensuring that data remains secure, compliant with local laws, and discoverable for organizational use. 104 | - Data breaches and mishandling can lead to reputational damage and government-imposed penalties, making it essential for organizations to prioritize responsible data handling. 105 | - Siloed data, poor data quality, and a lack of user trust can hinder organizations from maximizing the value of their data. 106 | - The hands-on component provides practical insights into configuring data governance in an AWS environment. 107 | 108 | 109 | ### Chapter 05, Architecting Data Engineering Pipelines 110 | The chapter provides a structured approach to designing data pipelines, beginning with the importance of understanding data consumers and their specific requirements. It also covers the identification of data sources, the processes of data ingestion, and the essential steps of data transformation and optimization. Loading data into data marts is another critical component discussed, along with a practical hands-on session that guides readers in architecting a sample data pipeline. This chapter offers a comprehensive guide for translating data engineering concepts into real-world applications, enabling readers to bridge the gap between theory and practice. 111 | 112 | **Key Insights**: 113 | - Data pipelines are pivotal in data engineering, serving as the means to ingest data from multiple sources, optimize and transform it, and make it available to data consumers. 114 | - The chapter focuses on the practical application of data engineering principles, enabling readers to translate theoretical knowledge into real-world data pipeline architecture. 115 | - It emphasizes the importance of understanding the needs and requirements of data consumers as a fundamental step in the data pipeline design process. 116 | - The chapter concludes with a hands-on exercise that allows readers to architect a sample data pipeline, providing practical experience in implementing the concepts covered. 117 | 118 | 119 | ### Chapter 06, Ingesting Batch and Streaming Data 120 | The chapter covers several fundamental topics, including understanding data sources, ingesting data from database and streaming sources. It underscores the importance of using the right ingestion tools to ensure efficient data movement. The hands-on portion of the chapter allows readers to practice data ingestion from both database and streaming sources using AWS DMS, Amazon Kinesis, and AWS Glue. 121 | 122 | **Key Insights**: 123 | - Emphasizes the critical role of data ingestion within the data pipeline architecture, providing a detailed exploration of this foundational process. 124 | - Data engineers often face the challenges of the five Vs of data: variety, volume, velocity, veracity, and value, and the chapter underscores the importance of addressing these challenges in data engineering. 125 | - The chapter covers various data sources and the multitude of tools available within AWS for effective data ingestion. 126 | - It provides guidance on making informed decisions when choosing the appropriate data ingestion tools for specific tasks. 127 | - Readers gain practical experience through hands-on activities, ingesting data from both database and streaming sources using AWS DMS, Amazon Kinesis, and AWS Glue, reinforcing the theoretical concepts with practical application. 128 | 129 | 130 | ### Chapter 07, Transforming Data to Optimize for Analytics 131 | This chapter looks into the vital task of data transformation, a key responsibility for data engineers in optimizing data for analytics and creating value for organizations. The chapter underscores the diversity of data transformations, distinguishing between common, generic transformations applicable to datasets, such as the conversion of raw files to Parquet format and partitioning, and those that incorporate business logic tailored to specific data content and business requirements. The chapter aims to equip data engineers with an understanding of the value of transformations, types of data transformation tools, common data preparation transformations, and business use case transformations. A hands-on section illustrates the practical application of these concepts, using AWS Glue Studio and Apache Spark for building transformations. 132 | 133 | **Key Insights**: 134 | - Data transformation is a key task for data engineers, involving various types of transformations, including both generic and business-specific ones. 135 | - The chapter highlights the value of data transformations in optimizing data for analytics and creating value for organizations. 136 | - It provides an overview of data transformation tools, common data preparation transformations, and business use case transformations. 137 | - Hands-on activities with AWS Glue Studio and Apache Spark offer practical experience in building data transformations. 138 | - The chapter serves as a bridge between data pipeline architecture and data ingestion, emphasizing the significance of data optimization for analytics in the data engineering process. 139 | 140 | 141 | ### Chapter 08, Identifying and Enabling Data Consumers 142 | This chapter examines the intricacies of data consumers. Data consumers span a wide spectrum, from operational staff needing real-time stock levels to the CEO requiring data for strategic decision-making. Additionally, data consumers can be systems seeking data from other systems. The central objective of data engineers is to make datasets useful and accessible to these consumers, ultimately empowering the business to gain valuable insights from its data. To achieve this, data engineers must deliver the right data through appropriate tools to the right people or applications at the right time, facilitating informed decision-making. The chapter underscores the pivotal role of understanding business objectives, recognizing data consumers, and comprehending their requirements when designing a data engineering pipeline. This approach enables data engineers to select the right data ingestion tools, frequency, and transformation processes to meet the specific needs of data consumers. The chapter also considers data democratization, emphasizing its impact and significance. A hands-on section asks readers to take on the role of a data analyst tasked with creating a mailing list for the marketing department. 143 | 144 | **Key Insights**: 145 | - Data consumers, including individuals, applications, and systems within an organization, require access to data to fulfill a range of needs, from operational tasks to strategic decision-making. 146 | - Data engineers play a crucial role in making datasets useful and accessible to data consumers, facilitating valuable insights for the business. 147 | - Understanding the business objectives, identifying data consumers, and comprehending their requirements are essential steps in designing a data engineering pipeline. 148 | - Data democratization has a significant impact on data access and usage, influencing various data consumer groups, including business users, data analysts, and data scientists. 149 | - The hands-on section of the chapter provides practical experience in transforming data using AWS Glue DataBrew. 150 | 151 | 152 | ### Chapter 09, Loading Data into a Dart Mart 153 | This chapter explores the significance of data warehouses and data marts in scenarios where data engineers need to move data from a data lake to an external data warehouse or data mart to serve specific data consumers. The chapter clarifies the distinctions between a data lake, which serves as a comprehensive source of truth across various business lines, and a data mart, which typically contains a subset of data tailored to specific user groups. 154 | It begins by examining common anti-patterns that should be avoided when using data warehouses. Subsequently, it goes into the intricate architecture of Redshift, elucidating how it optimizes data storage across nodes. The chapter emphasizes the importance of design decisions for creating a Redshift cluster optimized for performance, as well as the processes for data ingestion into and unloading from Redshift. Additionally, advanced Redshift features, including data sharing, Data Distribution Keys (DDM), and cluster resizing, are reviewed. The hands-on exercises provide practical experience by guiding readers in creating a new Redshift Serverless cluster, exploring sample data, and configuring Redshift Spectrum to query data from Amazon S3. 155 | 156 | **Key Insights**: 157 | - Data lakes provide a central source of data while data marts serve as subsets optimized for specific user groups and query types. 158 | - Data marts play a dual role, offering a subset of data for targeted queries and providing high-performance query engines for analytic use cases. 159 | - It highlights common anti-patterns to avoid in data warehouse usage and explores the architecture of Redshift, shedding light on its data storage optimization. 160 | - Readers gain insights into the design considerations for high-performance data warehouses, data movement between data lakes and Redshift, and advanced Redshift features. 161 | - The hands-on exercises involve creating a Redshift Serverless cluster and utilizing Redshift Spectrum for querying data from Amazon S3. 162 | 163 | ### Chapter 10, Orchestrating the Data Pipeline 164 | This chapter explores the pivotal role of data pipeline orchestration tools in automating data engineering tasks, emphasizing their significance in a real production environment. The chapter begins by highlighting the various services and techniques discussed in previous chapters, including data ingestion via Amazon Kinesis Data Firehose and AWS Database Migration Service, as well as data transformation through AWS Lambda and AWS Glue functions. It underlines the importance of updating a data catalog and loading subsets of data into data marts or data warehouses for specific use cases. However, for real-world applications, manual triggering of these tasks is not acceptable, necessitating automation. Data pipeline orchestration engines play a crucial role in orchestrating various data engineering tasks, managing task sequences and dependencies, handling task success or failure, and executing tasks in parallel or sequentially. The chapter covers core concepts of pipeline orchestration, reviews different options for orchestrating data pipelines within AWS, and provides a hands-on activity demonstrating orchestration using AWS Step Functions. 165 | 166 | Readers gain insights into concepts like scheduled and event-based pipelines, failure handling, and retries. The chapter introduces four AWS services for creating and orchestrating data pipelines, namely AWS Data Pipeline, AWS Glue workflows, Amazon MWAA, and AWS Step Functions. The hands-on section guides readers in building an event-driven pipeline using AWS Lambda functions and an Amazon SNS topic for notifications about failures, orchestrated by AWS Step Functions. 167 | 168 | **Key Insights**: 169 | - Data pipeline orchestration is crucial for automating data engineering tasks, enabling scheduled and event-driven pipelines, handling task failures, and orchestrating parallel and sequential tasks. 170 | - Core concepts of pipeline orchestration, including scheduling and dependency management, are fundamental to designing effective data pipelines. 171 | - AWS offers multiple orchestration options, including AWS Data Pipeline, AWS Glue workflows, Amazon MWAA, and AWS Step Functions, each with its unique strengths. 172 | - The hands-on activity in this chapter involves orchestrating a data pipeline using AWS Step Functions, providing practical experience in implementing pipeline orchestration. 173 | - This chapter serves as a bridge between architectural design and practical implementation of data pipelines, setting the stage for exploring data consumption services and deeper dives into data exploration and analysis in the subsequent chapters. 174 | 175 | 176 | ### Chapter 11, Ad Hoc Queries with Amazon Athena 177 | This chapter investigates Amazon Athena, a serverless and fully managed service that allows users to query data directly in a data lake using SQL and Spark. The chapter commences by introducing Athena's features, emphasizing its setup-free nature, flexible payment options based on data scanned or provisioned capacity, and its capability to query data not only in data lakes but also from various other database sources using Query Federation. The chapter explores Athena's governance and cost management functionality through workgroups and offers recommended best practices for optimizing SQL queries for both cost-efficiency and performance. 178 | 179 | The chapter also covers advanced functionality, illustrating how Athena can serve as a SQL query engine for data in Amazon S3 data lakes and external data sources like databases, data warehouses, and CloudWatch logs. It concludes with a hands-on exercise where users create an Athena workgroup, configure settings, and execute SQL queries within that workgroup. 180 | 181 | **Key Insights**: 182 | - Amazon Athena is a serverless, fully managed service that enables SQL and Spark-based querying of data in data lakes and various database sources. 183 | - The chapter provides insights into optimizing Athena SQL queries for cost-efficiency and performance, covering tips and best practices. 184 | - Athena's advanced functionality includes Query Federation, allowing users to query data from external sources like databases and data warehouses. 185 | - Athena workgroups offer governance and cost management capabilities, helping manage settings for different teams or projects and controlling data scan limits in queries. 186 | 187 | 188 | ### Chapter 12, Visualizing Data with Amazon QuickSight 189 | In this chapter, the focus is on Amazon QuickSight, a powerful BI tool that empowers users to create rich visualizations and reports while exploring data interactively. QuickSight offers features such as data filtering, drill-down capabilities, natural language querying, and the ability to summarize complex datasets visually. The chapter emphasizes the purpose of BI tools in helping users comprehend complex data through visual exploration. While it primarily covers Amazon QuickSight, the concepts discussed can apply to other popular BI applications like Tableau, Microsoft Power BI, and Qlik. 190 | 191 | The chapter also looks at advanced QuickSight features, such as ML Insights for data trend detection and outlier identification, QuickSight Q for natural language querying, embedded dashboards for integration with websites and applications, and paginated reports. It concludes with a hands-on section guiding users through the configuration of QuickSight in their AWS account and the creation of customized visualizations. 192 | 193 | **Key Insights**: 194 | - This chapter provides an in-depth understanding of Amazon QuickSight, highlighting its role as a serverless, fully managed BI tool for creating visualizations, analyzing data, and reporting. 195 | - The chapter emphasizes the core concepts of business intelligence, explaining the significance of visual data representation, data exploration, and the role of BI tools in enabling users to make data-driven decisions. 196 | - QuickSight's cost-effective subscription-based pricing model is discussed, making it a practical choice for organizations looking to avoid infrastructure and licensing costs. 197 | - The chapter offers hands-on experience, guiding users through the setup and customization of QuickSight in their AWS accounts, empowering them to create their own visualizations and reports. 198 | 199 | 200 | ### Chapter 13, Enabling Artificial Intelligence and Machine Learning 201 | This chapter delves into the significance of AI and ML for organizations, emphasizing how advancements in these fields have made it possible to automate tasks, personalize recommendations, and derive insights from data. While AI focuses on replicating human intelligence for tasks like problem-solving, decision-making, and language understanding, ML, more specifically, develops algorithms and models that recognize patterns in data to make predictions. The chapter highlights the growing impact of AI and ML in various domains, such as healthcare, self-driving vehicles, and the capabilities of Large Language Models (LLMs) like ChatGPT. It introduces AWS's array of AI and ML services and explores their utility with different data types. 202 | 203 | 204 | **Key Insights**: 205 | - The chapter underscores the transformative role of AI and ML in enabling organizations to automate tasks, personalize recommendations, and draw insights from data. 206 | - It clarifies the distinction between AI and ML. 207 | - The chapter introduces a range of AWS AI and ML services, including Amazon SageMaker for custom model development and services like Amazon Transcribe, Textract, Rekognition, and Comprehend for specific tasks. 208 | - It explores various real-world applications of AI and ML, from early detection of diseases to self-driving vehicles and the capabilities of LLMs. 209 | - The hands-on exercise in the chapter demonstrates the use of Amazon Comprehend for extracting insights from written text, providing practical exposure to AWS AI and ML services. 210 | 211 | 212 | ### Chapter 14, Transactional Data Lakes 213 | In this chapter, the focus is on the evolution of data lakes, which have advanced significantly in recent years, transforming into transactional data lakes. These new technologies not only retain the benefits of traditional data lakes, such as cost-effectiveness and serverless data processing capabilities but also offer improved data update mechanisms. 214 | 215 | The key topics covered in this chapter include defining the concept of a transactional data lake, an in-depth examination of table formats like Delta Lake, Apache Hudi, and Apache Iceberg, and how AWS integrates these table formats to create transactional data lakes. The hands-on section allows readers to work with Apache Iceberg tables in AWS, gaining practical experience in managing and querying tables in this modern data lake format. 216 | 217 | **Key Insights**: 218 | - Traditional data lakes built on Apache Hive had limitations in terms of handling data updates, querying consistency, and schema changes. 219 | - New table formats like Delta Lake, Apache Hudi, and Apache Iceberg address these limitations, offering transactional capabilities and advanced features. 220 | - AWS integrates these table formats to support the creation of transactional data lakes, making them more akin to traditional data warehouses. 221 | - The transformation of data lakes into transactional data lakes is a significant shift, driven by innovations in table formats and metadata management, shaping the future of data lake management. 222 | 223 | ### Chapter 15, Implementing a Data Mesh strategy 224 | This chapter introduces a shift from traditional data lake approaches to a newer concept known as a data mesh. Historically, organizations established central data engineering teams responsible for collecting, processing, and transforming raw data into a data lake. However, this centralized approach had its limitations. The data mesh approach presents an alternative by decentralizing data analytics and pushing the responsibility for creating analytic data products closer to the teams that generate or own the operational data. The chapter explores the various aspects of a data mesh, including organizational changes, architectural approaches, and technology implementation. It also delves into the core concepts and components of Amazon DataZone, a data governance tool offering a business data catalog. In a hands-on exercise, readers set up DataZone, import a dataset from the AWS Glue Data Catalog, add business metadata, and explore how to search the catalog and subscribe to data products. 225 | 226 | 227 | **Key Insights**: 228 | - The data mesh approach offers a decentralized model, pushing responsibility for data analytics closer to operational data owners. 229 | - Data mesh implementation involves organizational changes, architectural approaches, and technology utilization. 230 | - Amazon DataZone, a data governance tool with a business data catalog, is introduced as a key element in data mesh initiatives. 231 | - The chapter provides a hands-on exercise for setting up DataZone, importing datasets, adding metadata, and exploring the catalog, preparing readers for modern data platform building on AWS. 232 | 233 | 234 | ### Chapter 16, Building a modern data platform on AWS 235 | In this chapter, the focus is on the essential aspects of constructing a modern data platform on AWS. The chapter aims to provide a foundational understanding of building such platforms and emphasizes the significance of combining the various concepts presented throughout the book. It offers insights into goals for modern data platforms, deliberates the decision to build or buy such a platform, and introduces the concept of DataOps for automated development of data products. The hands-on section illustrates how AWS services like CloudFormation, CodeCommit, and CodePipeline can be employed to automate data platforms, streamlining infrastructure deployment and ETL code management. 236 | 237 | 238 | 239 | **Key Insights**: 240 | - Offers a high-level overview of constructing a modern data platform on AWS, serving as a foundational guide. 241 | - Explores key objectives for modern data platforms, including flexibility, scalability, governance, security, and self-service capabilities. 242 | - The chapter discusses the decision-making process of whether to build or purchase a data platform, weighing the pros and cons. 243 | - Introduces DataOps as an approach to automate the development of data products and manage infrastructure, making the platform more efficient and agile. 244 | - Demonstrates the practical use of AWS services like CloudFormation, CodeCommit, and CodePipeline to automate components and code deployment in a modern data platform. 245 | 246 | ### Chapter 17, Wrapping Up the First Part of Your Learning Journey 247 | In this concluding chapter, the reader is reminded of the wide spectrum of topics explored in the book, ranging from common architectural patterns to hands-on experience with AWS services essential for data engineering. The chapter reflects on the significance of data engineering in optimizing data value within organizations. It emphasizes the dynamic nature of AWS services, with continuous innovation to meet evolving customer needs, offering a fast-paced journey through the cloud for data engineering platforms. 248 | 249 | The chapter serves as an acknowledgment of the ever-expanding horizons of data engineering, indicating that the book is just the beginning of a comprehensive learning journey. It encourages the reader to delve deeper into the field. The conclusion paints an exciting picture of the journey ahead, implying that with the constant evolution of AWS services and the vast wealth of resources, the opportunities for growth and innovation are boundless for those venturing into data engineering on AWS. 250 | 251 | **Key Insights**: 252 | - The book provides a comprehensive exploration of data engineering, covering architectural patterns, hands-on AWS service usage, data security, governance, and data catalog importance. 253 | - It introduces the concept of a data lake house and discusses data marts, data warehouses, and data consumers, along with tools like Amazon Athena and QuickSight for data querying and visualization. 254 | - The book touches on ML and AI and highlights AWS services in these domains. 255 | - Emerging trends such as the data mesh approach and open table formats like Apache Iceberg are discussed, focusing on decentralizing data engineering responsibilities. 256 | - The chapter concludes by emphasizing the rapid evolution of AWS services and the constant learning opportunities in data engineering, encouraging readers to embark on a continuous learning journey in this field. 257 | 258 | > If you feel this book is for you, get your [copy](https://www.amazon.in/Data-Engineering-AWS-cloud-based-transformation-ebook/dp/B0C61KXWQ5) today! Coding 259 | 260 | 261 | 262 | ## Learn more on the Discord server Coding 263 | You can join in on the discord server for all the latest updates and discussions in the community at [Discord](https://discord.gg/9s5mHNyECd) 264 | 265 | ## Download a free PDF Coding 266 | 267 | _If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost. Simply click on the link to claim your free PDF._ 268 | [Free-Ebook](https://packt.link/free-ebook/9781804614426) Coding 269 | 270 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book at [GraphicBundle](https://packt.link/gbp/9781804614426) Coding 271 | 272 | 273 | ## Get to know the Author 274 | _Gareth Eagar_ has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth also frequently speaks on data-related topics. 275 | 276 | 277 | ## Other Related Books 278 | - [Data Engineering with dbt](https://www.packtpub.com/product/data-engineering-with-dbt/9781803246284) 279 | - [Data Engineering with Python](https://www.packtpub.com/product/data-engineering-with-python/9781839214189) 280 | --------------------------------------------------------------------------------