├── README.md ├── case_studies ├── AnyCompany_engineering.md ├── AnyCompany_racing_automotive.md ├── README.md └── petabytes_data_3_months.md ├── how-to ├── Athena-prevent-queries-from-queueing.md ├── Athena-understand-query-see-cost.md ├── Firehose-convert-input-data-format.md ├── Glue-DataBrew-convert-columns-into-key-value-pairs.md ├── Glue-resolve-data-reprocessing-with-job-bookmarks-enabled.md ├── Glue-speed-up-crawler.md ├── README.md ├── Redshift-cluster-balance-the-load.md ├── Redshift-ingest-data-from-Kinesis-Data-Streams.md ├── Redshift-monitor-events.md ├── S3-redact-PII-dynamically.md ├── add-aws-glue-crawler.md ├── calculate_data_transfer_time.md ├── detect-process-sensitive-data-glue.md ├── improve-Redshift-Spectrum-query-performance.md ├── streaming-ingestion-to-Redshift.md ├── turn-on-concurrency-scaling-Redshift.md └── upgrade-EBS-volumes-from-gp2-to-gp3.md └── images └── exam_score.png /README.md: -------------------------------------------------------------------------------- 1 | # AWS Data Engineer Associate certification 2 | 3 | 👋 Welcome to this short and simple repo! 4 | 5 | I scored 1000/1000 on the [AWS Data Engineer - Associate exam](https://aws.amazon.com/certification/certified-data-engineer-associate/). I hope my materials and cheat sheets help you! 6 | 7 | ![My exam score](/images/exam_score.png) 8 | 9 | ## What I did for the preparation: 10 | 11 | 1. Completed [FREE AWS Certified Data Engineer Associate Practice Exam – Sampler](https://portal.tutorialsdojo.com/courses/free-aws-certified-data-engineer-associate-practice-exam-sampler/) 12 | 2. Took [Planning Large Scale Data Migrations to AWS](https://explore.skillbuilder.aws/learn/course/15545/) (for personal interest) 13 | 3. Reviewed all questions from [Examtopics](https://www.examtopics.com/exams/amazon/aws-certified-data-engineer-associate-dea-c01/) 14 | 4. Checked some concepts hands-on in my [AWS Management console](https://aws.amazon.com/console/) 15 | 16 | **(Disclaimer)**: I already had experience with core Data Engineering services (Redshift, Athena, Glue, Kinesis, S3) + the Data Analytics Specialty certification ([retired](https://aws.amazon.com/blogs/training-and-certification/aws-certification-retirements-and-launches/)) + knowledge of [Architecting on AWS](https://aws.amazon.com/training/classroom/architecting-on-aws/) + [AWS Solutions Architect associate certification](https://aws.amazon.com/certification/certified-solutions-architect-associate/). 17 | 18 | 📌 I do recommend [labs](https://explore.skillbuilder.aws/learn/catalog?ctldoc-catalog-0=l-_en~field14-_4~se-%22data%20engineer%22), [workshops](https://workshops.aws/) and any relevant [hands-on practice](https://aws.amazon.com/getting-started/hands-on/?getting-started-all.sort-by=item.additionalFields.content-latest-publish-date&getting-started-all.sort-order=desc&awsf.getting-started-category=category%23analytics%7Ccategory%23storage%7Ccategory%23databases). 19 | 20 | 📌 I do recommend going through fundamental courses, practice exams and example questions: 21 | 22 | ## Courses: 23 | 24 | - [Udemy course by Stéphane Maarek and Frank Kane](https://www.udemy.com/course/aws-data-engineer/) 25 | 26 | ## Relevant AWS Skill Builder courses: 27 | 28 | - [Fundamentals of Analytics on AWS – Part 1](https://explore.skillbuilder.aws/learn/course/internal/view/elearning/18437/fundamentals-of-analytics-on-aws-part-1) 29 | - [Fundamentals of Analytics on AWS – Part 2](https://explore.skillbuilder.aws/learn/course/internal/view/elearning/18440/fundamentals-of-analytics-on-aws-part-2) 30 | - [Data Engineering on AWS - Foundations](https://explore.skillbuilder.aws/learn/course/internal/view/elearning/19747/data-engineering-on-aws-foundations) 31 | 32 | ## Practice exams: 33 | 34 | - [Udemy practice exam](https://www.udemy.com/course/practice-exams-aws-certified-data-engineer-associate-r/) 35 | - [Tutorial Dojo practice exam](https://portal.tutorialsdojo.com/courses/aws-certified-data-engineer-associate-practice-exam-dea-c01/) 36 | 37 | ## Questions: 38 | 39 | - [Examtopics](https://www.examtopics.com/exams/amazon/aws-certified-data-engineer-associate-dea-c01/) 40 | 41 | ## Common AWS Data Engineer Associate exam tips: 🏆 42 | 43 | 1. Focus on **what’s being asked**: e.g., LEAST operational overhead, MOST cost-effective, FASTEST queries, LEAST latency. 44 | 2. Pay attention to **question details**: data sources, on-premise or on AWS, data size, access patterns, etc. 45 | 3. Note **specific requirements**: fault tolerance, low latency, no public Internet access, etc. 46 | 4. **Eliminate answers** that don't make any sense. 47 | 5. For the options remained: choose the **most suitable option based on points 1-3**. 48 | 6. [EXTRA] Does the **question scenario** remind you anything mentioned in the [how-to directory](https://github.com/dashapetr/aws-data-engineer-certification/tree/main/how-to)? 49 | 7. Unsure? **Flag it for review** and move on. 50 | 8. **Stay calm and enjoy the exam—you've got this!** 51 | -------------------------------------------------------------------------------- /case_studies/AnyCompany_engineering.md: -------------------------------------------------------------------------------- 1 | # Meet AnyCompany Engineering 2 | 3 | AnyCompany Engineering (ACE) is a software company that specializes in developing 3D design, engineering, and entertainment software. ACE maintains a large, on-premises storage array that hosts Oracle and SQL server database backups. Due to compliance requirements, the retention policy for the backups is several years. ACE doesn't want to upgrade the aging storage infrastructure to handle its current 700 TB and growing data set. They prefer to move to a managed, scalable, and reliable solution, specifically Amazon S3. 4 | 5 | ## Migration 6 | 7 | ACE wants to migrate its 700 TB of database backups. ACE anticipates additional data transfers after the initial 700 TB migration, so prefers a robust solution to handle additional transfer tasks. ACE qualifies as a large-scale migration effort and requires a plan to move its data to Amazon S3. 8 | 9 | ## Business goal and objectives 10 | 11 | #### What business goal or objective is driving the data migration plan? 12 | ACE IT evaluated multiple options to ease the maintenance and management of the current storage system. One option was to upgrade the existing infrastructure to a newer version but that would still have an infrastructure to build and maintain. Instead, we prefer Amazon S3 because of cost, high durability, and availability. Also, we plan to use the lifecycle management capabilities for long-term archival storage. 13 | 14 | #### Are you prioritizing migration speed or cost? 15 | Cost. We need to meet compliance requirements with our database backups, which means long-term storage. We want to realize the financial benefit of the Amazon S3 Glacier storage classes. 16 | 17 | #### Is this a one-time migration or a recurring data transfer? 18 | This project calls for an initial migration of 700 TB, then subsequent migration of database backups. 19 | 20 | #### What is your cutover window to start and complete the migration? 21 | We haven't determined an exact time yet. The planning is still in an early stage. 22 | 23 | #### What events or milestones are driving your migration schedule? 24 | When we started exploring options, we had a storage capacity issue, as the data set then was 2.4 PB. However, as we conducted our analysis of the data, we tweaked our retention policy. The change in the policy allowed us to shrink the data set to 700 TB. Even still, we intend to move forward with this effort to migrate away from our current storage infrastructure. 25 | 26 | ## On-premise discovery 27 | 28 | #### What are the details of the data to migration? 29 | 30 | ACE has 700 TB and growing of Oracle RMAN and SQL Server database backups located in one data center. The data spans multiple racks and arrays. Data integrity is key, so we need to validate each file in this migration. 31 | 32 | #### How is the data accessed? 33 | Backups occur regularly as defined by the backup policy to our NAS systems. We use the NFS file sharing protocol. 34 | 35 | #### What are your site requirements for each service? 36 | Not applicable. 37 | 38 | #### Do you have a Virtual Machine infrastructure available to deploy a virtual appliance for your data migration? 39 | Yes, we do run a virtual infrastructure in our data center and have additional capacity. 40 | 41 | #### Have you identified the personnel who will be in charge of the network and storage components for the migration? 42 | Yes, we have both storage and IT staff available for this migration project. We also have staff in the data center that can help with the migration. The storage and IT staff have some familiarity with AWS migration and storage services. They plan to run a pilot, and manage the technical tasks of the migration plan. They are prepared to write and test scripts for this migration project. 43 | 44 | ## AWS destination 45 | 46 | #### What is the AWS destination for your data? 47 | Our destination is Amazon S3. 48 | 49 | #### Where is the AWS destination? (For example, what is the target Region?) 50 | Our data center is located in the western United States. We intend to use the us-west-2 (Oregon) Region, as that Region fits our purpose for this project. 51 | 52 | ## Additional details 53 | 54 | #### What type of internet connectivity and available bandwidth do you have? 55 | We have a 2 Gbps internet connection for the data center. We don't ever reach bandwidth saturation. However, this project is a priority for us, so we could allocate 70 percent of the connection's bandwidth. We do have production workloads that require internet connectivity, so we can't use the full 2 Gbps connection during business hours. 56 | 57 | #### When does the data need to be available after the data is transferred? 58 | There is no rush for when the data becomes available to use, because the data is primarily backup data. We are interested in available options and how much time it would take with each one. 59 | 60 | ## With these additional details, what AWS service is a good fit for the ACE migration and subsequent transfer of data? 61 | 62 | - [ ] AWS DataSync 63 | - [ ] Amazon S3 File Gateway 64 | - [ ] AWS Snowball Edge 65 | - [ ] AWS Transfer Family 66 | 67 | ## Solution 68 | 69 | - [x] AWS DataSync 70 | - [ ] Amazon S3 File Gateway 71 | - [ ] AWS Snowball Edge 72 | - [ ] AWS Transfer Family 73 | 74 | ACE migrated 700 TB of data to Amazon S3 using AWS DataSync. After several rounds of testing and configuring DataSync tasks to the project specifications, the total migration time was 2.5 months. The transfer task was fast, steady, and used 1.4 Gbps of bandwidth during the low peak usage hours at night. It yielded a data transfer rate of about 500 Gibibytes an hour. 75 | 76 | ACE could have used about 10 AWS Snowball Edge devices for this migration, but the Snowball based migration plan included many more steps than using AWS DataSync. For example, the Snowball based plan called for creating and managing scripts to transfer data to the Snowball Edge devices and handling each Snowball Edge device in the data center. The AWS DataSync plan was straightforward and required less time to complete than the AWS Snowball Edge option. 77 | 78 | Source: https://explore.skillbuilder.aws/learn/course/15545/play/76124/planning-large-scale-data-migrations-to-aws 79 | -------------------------------------------------------------------------------- /case_studies/AnyCompany_racing_automotive.md: -------------------------------------------------------------------------------- 1 | # Meet AnyCompany Racing Automotive 2 | 3 | AnyCompany Racing Automotive (ACRA) is a global racing organization that hosts races in multiple geographical locations. ACRA needs to replicate its 400 TB video archive to an offsite location. However, this video content must remain accessible locally and with low latency. ACRA doesn't want to invest in another data center for its offsite requirement. ACRA will need to replicate additional video content over time. 4 | 5 | ## Migration 6 | 7 | ACRA wants to replicate its 400 TB video archive to AWS. ACRA qualifies as a large-scale migration effort and requires a plan to move its data to Amazon S3. 8 | 9 | ## Business goal and objectives 10 | 11 | #### What business goal or objective is driving the data migration plan? 12 | Managing another data center for our offsite requirements is cost-prohibitive. Co-location options don't meet our goal for the same reason—cost. We have two goals at play. We need to replicate this data offsite for backup and recovery purposes. Additionally, our end users depend on low latency access to the data, so the content will need to be accessible in our data center. We have plans to create data lakes for our other data sets, but currently, our focus is on our video content. 13 | 14 | #### Are you prioritizing migration speed or cost? 15 | Cost. We need to only spend as much as we consume with our data. 16 | 17 | #### Is this a one-time migration or a recurring data transfer? 18 | This project calls for an initial replication of 400 TB, then subsequent replication as new content is created. 19 | 20 | #### What events or milestones are driving your migration schedule? 21 | Hardware contracts are ending soon. We'd like to ensure that we have enough time to thoroughly test our backup and disaster recovery plans before that time. 22 | 23 | ## On-premise discovery 24 | 25 | #### What are the details of the data to migrate? 26 | ACRA has 400 TBs and growing of video footage located in one data center. There are approximately 80,000 files. The amount of data grows by about 4 TB each day. Data integrity is key, so we need to validate each file in this migration. Although the data is raw video footage, our end users will need to be able to quickly retrieve it whenever needed. 27 | 28 | #### How is the data accessed? 29 | End users access the data on a network-attached storage (NAS) device. The NAS shares out the content using the NFS sharing protocol. 30 | 31 | #### What are your site requirements for each service? 32 | We need to maintain local access to the data. 33 | 34 | #### What type of internet connectivity and available bandwidth do you have? 35 | We have a 1 Gbps internet connection for the data center. We sometimes reach bandwidth saturation. However, this project is a priority for us, so the amount of bandwidth is flexible. Additionally, we have an AWS Direct Connect connection to the eu-west-2 Region. 36 | 37 | #### Do you have a Virtual Machine infrastructure available to deploy a virtual appliance for your data migration? 38 | Yes, we do run a virtual infrastructure in our data center and have additional capacity. 39 | 40 | #### When does the data need to be available after the data is transferred? 41 | As quickly as possible. 42 | 43 | #### Have you identified the personnel who will be in charge of the network and storage components for the migration? 44 | Yes, we have both storage and IT staff available for this migration project. We also have staff in the data center that can help with the migration. The storage and IT staff have some familiarity with AWS migration and storage services. They plan to run a pilot, and manage the technical tasks of the migration plan. 45 | 46 | ## AWS destination 47 | 48 | #### What is the AWS destination for your data? 49 | Our destination is Amazon S3. 50 | 51 | #### Where is the AWS destination? (For example, what is the target Region?) 52 | Our data centers are located in the United Kingdom. We intend to use the eu-west-2 (London) Region, as that Region fits our purpose for this project. 53 | 54 | ## Additional details 55 | 56 | #### What is your cutover window to start and complete the migration? 57 | We intend for the migration to happen over a 3-month period. After all the data is in AWS, we will retire our current backup infrastructure. We'd like to have the data migrated into AWS by the end of 3 months, then have another 3 months to run recoverability tests. 58 | 59 | ## With this additional detail, what AWS services are a good fit for the ACRA migration and local data access? (Choose TWO responses.) 60 | 61 | - [ ] AWS DataSync 62 | - [ ] Amazon S3 File Gateway 63 | - [ ] AWS Snowball Edge 64 | - [ ] AWS Transfer Family 65 | 66 | ## Solution 67 | 68 | - [x] AWS DataSync 69 | - [x] Amazon S3 File Gateway 70 | - [ ] AWS Snowball Edge 71 | - [ ] AWS Transfer Family 72 | 73 | ACRA replicates up to 4 TB for each day to Amazon S3 using AWS DataSync. The racing organization uses AWS Storage Gateway to provide standard file access to data in Amazon S3 with a local cache for low latency access. 74 | 75 | Source: https://explore.skillbuilder.aws/learn/course/15545/play/76124/planning-large-scale-data-migrations-to-aws 76 | -------------------------------------------------------------------------------- /case_studies/README.md: -------------------------------------------------------------------------------- 1 | # Large Scale Data Migration to AWS Case Studies 2 | 3 | This directory contains case studies on large-scale data migrations to AWS, providing insights, strategies, and best practices. 4 | 5 | ## Source 6 | 7 | The content is based on the AWS course: [Planning Large Scale Data Migrations to AWS](https://explore.skillbuilder.aws/learn/course/15545/play/76124/planning-large-scale-data-migrations-to-aws). 8 | 9 | -------------------------------------------------------------------------------- /case_studies/petabytes_data_3_months.md: -------------------------------------------------------------------------------- 1 | # Moving petabytes of data in 3 months 2 | 3 | AnyCompany Streaming Media (ASM) company needs to migrate over 3 PB of data to Amazon S3. ASM wants to decrease infrastructure costs for their large data set while providing a better streaming service to their customers. The project timetable is set for 3 months because existing data center and licensing contracts will expire by then. 4 | 5 | ## Business goal and objectives 6 | 7 | #### What business goal or objective is driving the data migration plan? 8 | Hosting petabytes of streaming content is expensive, and we at ASM want to reduce those costs. Reducing infrastructure and eliminating hardware and software licensing costs will help achieve this objective. Also, we want to provide a better quality streaming service to wherever our customers are at any time. Having our streaming catalogue in a data center limits our options. 9 | 10 | #### Are you prioritizing migration speed or cost? 11 | Speed. We need to move as quickly as we can. 12 | 13 | #### Is this a one-time migration or a recurring data transfer? 14 | This project calls for a one-time migration. 15 | 16 | #### What is your cutover window to start and complete the migration? 17 | As soon as the entire data set is available in AWS. 18 | 19 | #### What events or milestones are driving your migration schedule? 20 | Software licensing and data center contracts expire within four months. Extending current contracts is an expensive option we want to avoid. 21 | 22 | ## On-premises discovery 23 | 24 | #### What are the details of the data to migrate? 25 | 26 | ASM has over 3 PB of media archives in a data center. All of the media are video files that average around 21 GB for each file. The archives hold approximately 175,000 files. The files themselves in the archives don't change. We transcode the files and deliver the content to our customers. All the files are stored on volumes backed by storage area network (SAN) devices. The SAN devices serve out block volumes to our transcoding servers. We don't use any kind of local object storage. 27 | 28 | #### How is the data accessed? 29 | The data is accessed through block volumes that are presented by SAN devices. The transcoding software accesses the files as if they were local drives attached to the server. 30 | 31 | #### What are your site requirements for each service? 32 | Not applicable. 33 | 34 | #### What type of internet connectivity and available bandwidth do you have? 35 | We have 5 Gbps internet connection for the data center that we use to push our content to our partner's content delivery network. Our bandwidth usage averages about 60 percent each day. We initially thought about setting up AWS Direct Connect, but we aren't sure if that's an option. We worry that setting up AWS Direct Connect and performing the migration may go beyond our project timeline. 36 | 37 | #### Do you have a Virtual Machine infrastructure available to deploy a virtual appliance for your data migration? 38 | Yes, we do run a virtual infrastructure in our data center and have additional capacity. 39 | 40 | #### When does the data need to be available after the data is transferred? 41 | As quickly as possible. 42 | 43 | #### Have you identified the personnel who will be in charge of the network and storage components for the migration? 44 | Yes, we have both storage and transcoding staff available for this migration project. We also have staff in the data center that can help with the migration. 45 | 46 | ## AWS destination 47 | 48 | #### What is the AWS storage destination for your data? 49 | Our destination is Amazon S3. We would like to use the Intelligent-Tiering option, as we believe it will give us the flexibility of being able to manage our media. 50 | 51 | #### Where is the destination? (For example, what is the target Region) 52 | Our data center is located in southern Germany. We intend to use the eu-central-1 Region as that Region fits our purpose for this project. 53 | 54 | 55 | 56 | ## Which AWS service would you recommend AnyCompany Streaming Media use for their large-scale data migration? 57 | 58 | - [ ] AWS DataSync 59 | - [ ] Amazon S3 File Gateway 60 | - [ ] AWS Snowball Edge 61 | - [ ] Both AWS DataSync and AWS Snowball Edge 62 | 63 | 64 | Correct answer: 65 | 66 | - [ ] AWS DataSync 67 | - [ ] Amazon S3 File Gateway 68 | - [x] AWS Snowball Edge 69 | - [ ] Both AWS DataSync and AWS Snowball Edge 70 | 71 | ## Solution: 72 | 73 | Multiple Snowball Edge devices are the best choice for ASM. The devices don't need internet access, and ASM stated they have available staffing in the data center to connect the devices. 74 | 75 | 76 | Based on information from the framework, several Snowball Edge Storage Optimized devices is the best approach to meet the three-month timeline. AWS DataSync could have been another option if there was available bandwidth. However, ASM stated that there are constraints on bandwidth availability. With AWS recommendations in place, ASM set up an infrastructure where four Snowball devices read from on-premises storage simultaneously. One engineer was present at the data center to manage the setup of the devices while other team members worked remotely to manage the logistics and file transfers. 77 | 78 | At any given time, 12 AWS Snowball Edge devices were in use, whether copying data on-site at the data center, uploading content to Amazon S3, or in transit. 79 | 80 | 81 | Source: https://explore.skillbuilder.aws/learn/course/15545/play/76124/planning-large-scale-data-migrations-to-aws 82 | -------------------------------------------------------------------------------- /how-to/Athena-prevent-queries-from-queueing.md: -------------------------------------------------------------------------------- 1 | # Introducing Athena Provisioned Capacity 2 | 3 | A data engineer notices that Amazon Athena queries are held in a queue before the queries run. 4 | 5 | **How can the data engineer prevent the queries from queueing?** 6 | 7 | **Response:** 8 | 9 | Configure provisioned capacity for an existing workgroup 10 | 11 | ### AWS News Blog quote: 12 | ``` 13 | With provisioned capacity, you provision a dedicated set of compute resources to run your queries. 14 | This always-on capacity can serve your business-critical queries with near-zero latency and no queuing. 15 | ``` 16 | 17 | Source: https://aws.amazon.com/blogs/aws/introducing-athena-provisioned-capacity/ 18 | -------------------------------------------------------------------------------- /how-to/Athena-understand-query-see-cost.md: -------------------------------------------------------------------------------- 1 | # Using EXPLAIN and EXPLAIN ANALYZE in Athena 2 | 3 | A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table. 4 | 5 | The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost of each operation in a SQL query. 6 | 7 | **Which statement does the data engineer need to run to meet these requirements?** 8 | 9 | **Response:** 10 | 11 | EXPLAIN ANALYZE SELECT * FROM sales 12 | 13 | ### Documentation quote: 14 | ``` 15 | The EXPLAIN ANALYZE statement shows both the distributed execution plan of a 16 | specified SQL statement and the computational cost of each operation in a SQL query. 17 | 18 | You can output the results in text or JSON format. 19 | ``` 20 | 21 | Source: https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html 22 | -------------------------------------------------------------------------------- /how-to/Firehose-convert-input-data-format.md: -------------------------------------------------------------------------------- 1 | # Convert input data format in Amazon Data Firehose 2 | 3 | A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format. 4 | 5 | **Which solution will meet these requirements with the LEAST development effort?** 6 | 7 | **Response:** 8 | 9 | 1. Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON. 10 | 2. Use Kinesis Data Firehose to store the files in Parquet format. 11 | 12 | 13 | ### Documentation quote: 14 | ``` 15 | Amazon Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. 16 | Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. 17 | 18 | If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first. 19 | ``` 20 | 21 | Source: https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html 22 | -------------------------------------------------------------------------------- /how-to/Glue-DataBrew-convert-columns-into-key-value-pairs.md: -------------------------------------------------------------------------------- 1 | # Convert user-selected columns into key-value pairs 2 | 3 | 4 | A company receives .csv files that contain physical address data. The data is in columns that have the following names: Door_No, Street_Name, City, and Zip_Code. The company wants to create a single column to store these values in the following format: 5 | ``` 6 | { 7 | "Door_No": "24", 8 | "Street_Name": "AAA street", 9 | "City": "BBB", 10 | "Zip_Code": "111111" 11 | } 12 | ``` 13 | 14 | **Which solution will meet this requirement with the LEAST coding effort?** 15 | 16 | **Response:** 17 | 18 | 1. Use AWS Glue DataBrew to read the files. 19 | 2. Use the NEST_TO_MAP transformation to create the new column. 20 | 21 | Documentation quote: 22 | ``` 23 | NEST_TO_MAP 24 | 25 | Converts user-selected columns into key-value pairs, 26 | each with a key representing the column name and a value representing the row value. 27 | ``` 28 | 29 | Source: https://docs.aws.amazon.com/databrew/latest/dg/recipe-actions.NEST_TO_MAP.html 30 | -------------------------------------------------------------------------------- /how-to/Glue-resolve-data-reprocessing-with-job-bookmarks-enabled.md: -------------------------------------------------------------------------------- 1 | # Error: A job is reprocessing data when job bookmarks are enabled 2 | 3 | A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job. 4 | The data engineer has set the maximum concurrency for the AWS Glue job to 1. 5 | 6 | The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs. 7 | 8 | **What is the likely reason the AWS Glue job is reprocessing the files?** 9 | 10 | **Response:** 11 | 12 | The AWS Glue job does not have a required commit statement. 13 | 14 | ### Documentation quote: 15 | ``` 16 | Missing Job Object 17 | Ensure that your job run script ends with the following commit: 18 | 19 | job.commit() 20 | 21 | When you include this object, AWS Glue records the timestamp and path of the job run. 22 | If you run the job again with the same path, AWS Glue processes only the new files. 23 | 24 | If you don't include this object and job bookmarks are enabled, 25 | the job reprocesses the already processed files along with the new files 26 | and creates redundancy in the job's target data store. 27 | ``` 28 | 29 | Source: https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data 30 | -------------------------------------------------------------------------------- /how-to/Glue-speed-up-crawler.md: -------------------------------------------------------------------------------- 1 | # Why is the AWS Glue crawler running for a long time? 2 | 3 | **Steps to approach:** 4 | 5 | ## 1. Do you need a crawler? 6 | 7 | Unless you need to create a table in the AWS Glue Data Catalog and use the table in an extract, transform, and load (ETL) job or a downstream service, such as Amazon Athena, you don't need to run a crawler. 8 | For ETL jobs, you can use [from_options](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options) to read the data directly from the data store and use the transformations on the DynamicFrame. When you do this, you don't need a crawler in your ETL pipeline. 9 | 10 | ## 2. Review if you add a lot of files or folders to your data store between crawler runs 11 | 12 | During the first crawler run, the crawler reads the first megabyte of each file to infer the schema. During subsequent crawler runs, the crawler lists files in the target, including files that were crawled during the first run, and reads the first megabyte of new files. 13 | The crawler doesn't read files that were read in the previous crawler run. This means that subsequent crawler runs are often faster. This is due to the [incremental crawl feature](https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html), if activated. 14 | With this option, crawler only reads new data in subsequent crawl runs. However, when you add a lot of files or folders to your data store between crawler runs, the run time increases each time. 15 | 16 | ## 3. Are your files compressed? 17 | 18 | Compressed files take longer to crawl. That's because the crawler must download the file and decompress it before reading the first megabyte or listing the file. 19 | 20 | ## 4. Use an exclude pattern 21 | 22 | An exclude pattern tells the crawler to skip certain files or paths. 23 | Exclude patterns reduce the number of files that the crawler must list, making the crawler run faster. 24 | For example, use an exclude pattern to exclude meta files and files that have already been crawled. For more information, including examples of exclude patterns, see [Include and exclude patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude). 25 | 26 | ## 5. Use the sample size feature 27 | 28 | The AWS Glue crawler supports the [sample size feature](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html). With this feature, you can specify the number of files in each leaf folder to be crawled when crawling sample files in a dataset. 29 | When this feature is turned on, the crawler randomly selects some files in each leaf folder to crawl instead of crawling all the files in the dataset. If you have previous knowledge about your data formats and know that schemas in your folders do not change, then use the sampling crawler. Turning on this feature significantly reduces the crawler run time. 30 | 31 | ## 6. Run multiple crawlers 32 | 33 | Instead of running one crawler on the entire data store, consider running multiple crawlers. 34 | Running multiple crawlers for a short amount of time is better than running one crawler for a long time. For example, assume that you are partitioning your data by year, and that each partition contains a large amount of data. If you run a different crawler on each partition (each year), the crawlers complete faster. 35 | 36 | ## 7. Combine smaller files to create larger ones 37 | 38 | It takes more time to crawl a large number of small files than a small number of large files. That's because the crawler must list each file and must read the first megabyte of each new file. 39 | 40 | Source: https://repost.aws/knowledge-center/long-running-glue-crawler 41 | -------------------------------------------------------------------------------- /how-to/README.md: -------------------------------------------------------------------------------- 1 | # How To: Data Engineering on AWS 2 | 3 | This directory contains tips, tricks, and 'secrets' for performing various data engineering tasks on AWS. 4 | 5 | ## Content 6 | 7 | Explore practical guides and best practices on: 8 | - Optimizing AWS data workflows 9 | - Efficient use of AWS services like Redshift, Athena, Glue, Kinesis, S3, etc. 10 | - Troubleshooting common issues 11 | - Enhancing performance and reducing costs 12 | -------------------------------------------------------------------------------- /how-to/Redshift-cluster-balance-the-load.md: -------------------------------------------------------------------------------- 1 | # Choose data distribution styles 2 | 3 | A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution. 4 | A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations. 5 | The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes. 6 | 7 | **Which solution will meet these requirements?** 8 | 9 | **Response:** 10 | 11 | Change the distribution key to the table column that has the largest dimension. 12 | 13 | ### Documentation quote: 14 | ``` 15 | KEY distribution 16 | 17 | The rows are distributed according to the values in one column. 18 | The leader node places matching values on the same node slice. 19 | If you distribute a pair of tables on the joining keys, 20 | the leader node collocates the rows on the slices according to the values in the joining columns. 21 | This way, matching values from the common columns are physically stored together. 22 | ``` 23 | 24 | Source: https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html 25 | -------------------------------------------------------------------------------- /how-to/Redshift-ingest-data-from-Kinesis-Data-Streams.md: -------------------------------------------------------------------------------- 1 | # Read data from Kinesis Data Streams using Amazon Redshift 2 | 3 | A technology company currently uses Amazon Kinesis Data Streams to collect log data in real time. The company wants to use Amazon Redshift for downstream real-time queries and to enrich the log data. 4 | 5 | **Which solution will ingest data into Amazon Redshift with the LEAST operational overhead?** 6 | 7 | **Response:** 8 | 9 | Use Amazon Redshift streaming ingestion from Kinesis Data Streams and to present data as a materialized view 10 | 11 | ### Documentation quote: 12 | ``` 13 | Amazon Redshift supports streaming ingestion from Amazon Kinesis Data Streams. 14 | 15 | The Amazon Redshift streaming ingestion feature provides low-latency, high-speed ingestion of streaming data 16 | from Amazon Kinesis Data Streams into an Amazon Redshift materialized view. 17 | ``` 18 | 19 | Source: https://docs.aws.amazon.com/streams/latest/dev/using-other-services-redshift.html 20 | -------------------------------------------------------------------------------- /how-to/Redshift-monitor-events.md: -------------------------------------------------------------------------------- 1 | # Monitoring events for the Amazon Redshift Data API in Amazon EventBridge 2 | 3 | A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. 4 | A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. 5 | 6 | 7 | **How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?** 8 | 9 | **Response:** 10 | 11 | 1. Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. 12 | 2. Configure an EventBridge rule to invoke the Lambda function. 13 | 14 | 15 | Documentation quote: 16 | ``` 17 | You can monitor Data API events in EventBridge, 18 | which delivers a stream of real-time data from your own applications, 19 | software-as-a-service (SaaS) applications, and AWS services. 20 | 21 | EventBridge routes that data to targets such as AWS Lambda and Amazon SNS. 22 | ``` 23 | 24 | 25 | Source: https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-monitoring-events.html 26 | -------------------------------------------------------------------------------- /how-to/S3-redact-PII-dynamically.md: -------------------------------------------------------------------------------- 1 | # Using Amazon S3 object Lambda access points for personally identifiable information (PII) 2 | 3 | A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII. 4 | To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset. 5 | 6 | **Which solution will meet the requirements with the LEAST operational overhead?** 7 | 8 | **Response:** 9 | 10 | 1. Create an S3 Object Lambda endpoint. 11 | 2. Use the S3 Object Lambda endpoint to read data from the S3 bucket. 12 | 3. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data. 13 | 14 | ### Documentation quote: 15 | ``` 16 | Use Amazon S3 Object Lambda Access Points for personally identifiable information (PII) 17 | to configure how documents are retrieved from your Amazon S3 bucket. 18 | 19 | You can control access to documents that contain PII and redact PII from documents. 20 | ``` 21 | 22 | Source: https://docs.aws.amazon.com/comprehend/latest/dg/using-access-points.html 23 | -------------------------------------------------------------------------------- /how-to/add-aws-glue-crawler.md: -------------------------------------------------------------------------------- 1 | # Adding an AWS Glue crawler 2 | 3 | A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data. 4 | The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog. 5 | 6 | **Which solution will meet these requirements?** 7 | 8 | **Response:** 9 | 10 | 1. Create an IAM role that includes the AWSGlueServiceRole policy. 11 | 2. Associate the role with the crawler. 12 | 3. Specify the S3 bucket path of the source data as the crawler's data store. 13 | 4. Create a daily schedule to run the crawler. 14 | 5. Specify a database name for the output. 15 | 16 | 17 | Source: https://docs.aws.amazon.com/glue/latest/dg/tutorial-add-crawler.html 18 | -------------------------------------------------------------------------------- /how-to/calculate_data_transfer_time.md: -------------------------------------------------------------------------------- 1 | Suppose you want to take the least amount of time for your data migration with minimal disruption to your existing environment and workflow. AWS uses a data transfer formula to determine the length of time for an online transfer of any given amount of data. 2 | 3 | ``` 4 | (Terabytes * 8 bits per Byte)/(CIRCUIT gigabits per second * NETWORK_UTILIZATION 5 | percent * 3600 seconds per hour * AVAILABLE_HOURS) = Number of Days 6 | ``` 7 | 8 | If you have a 1 Gbps internet connection and 100 TB of data to migrate to AWS, the minimum time to complete an online transfer at about 80 percent network use is approximately 12 days. 9 | 10 | ``` 11 | (100,000,000,000,000 Bytes * 8 bits per byte) /(1,000,000,000 bps * 80 percent 12 | * 3600 seconds per hour * 24 hours per day) = 11.57 days 13 | ``` 14 | 15 | This calculation assumes a full 24 hours for each day for the data migration to AWS. The 80 percent of network usage is an approximation to determine a theoretical number of days. You can modify the formula to your exact parameters. With the results you calculate, you can compare the theoretical amount of days to your project's timeline to determine whether an online or offline data transfer is right for you. 16 | 17 | The following is a table that provides some guidance to determine if an online or offline data transfer will take the least amount of time. 18 | 19 | ### Large data migration – Time to transfer online 20 | 21 | | | 1 Gbps | 2 Gbps | 5 Gbps | 10 Gbps | 22 | |-------|----------|---------|----------|----------| 23 | |100 TB | 12 days | 6 days | 3 days | 30 hours | 24 | |500 TB | 58 days |29 days | 12 days | 6 days | 25 | |5 PB | 2 years | 1 year | 116 days | 58 days | 26 | |10 PB | 3 years |2 years | 232 days | 116 days | 27 | 28 | 29 | Comparatively, using a Snowball Edge device for an offline transfer takes about **25–30 days end-to-end for the majority of use cases**. For your large-scale data migration, you will want more than one device at a time at your data center. 30 | 31 | Source: https://explore.skillbuilder.aws/learn/course/15545/play/76124/planning-large-scale-data-migrations-to-aws 32 | -------------------------------------------------------------------------------- /how-to/detect-process-sensitive-data-glue.md: -------------------------------------------------------------------------------- 1 | # Detect and process sensitive data 2 | 3 | A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII. 4 | 5 | **Which solution will meet this requirement with the LEAST operational effort?** 6 | 7 | **Response:** 8 | 9 | 1. Use the Detect PII transform in AWS Glue Studio to identify the PII. 10 | 2. Obfuscate the PII. 11 | 3. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake. 12 | 13 | 14 | Source: https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html 15 | -------------------------------------------------------------------------------- /how-to/improve-Redshift-Spectrum-query-performance.md: -------------------------------------------------------------------------------- 1 | # Improving Amazon Redshift Spectrum query performance 2 | 3 | A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3. 4 | 5 | **Which actions will provide the FASTEST queries?** 6 | 7 | **Response:** 8 | 9 | 1. Use a columnar storage file format. 10 | 2. Partition the data based on the most common query predicates. 11 | 12 | ## Documentation quote: 13 | ``` 14 | Use Apache Parquet formatted data files. 15 | Parquet stores data in a columnar format, so Redshift Spectrum can eliminate unneeded columns from the scan. 16 | When data is in text-file format, Redshift Spectrum needs to scan the entire file. 17 | 18 | Use multiple files to optimize for parallel processing. 19 | Keep your file sizes larger than 64 MB. 20 | Avoid data size skew by keeping files about the same size. 21 | 22 | Use partitions to limit the data that is scanned. 23 | Partition your data based on your most common query predicates, then prune partitions by filtering on partition columns. 24 | ``` 25 | 26 | Source: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-performance.html 27 | -------------------------------------------------------------------------------- /how-to/streaming-ingestion-to-Redshift.md: -------------------------------------------------------------------------------- 1 | # Streaming ingestion to a materialized view 2 | 3 | A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records. 4 | A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data. 5 | 6 | **Which solution will meet these requirements with the LEAST operational overhead?** 7 | 8 | **Response:** 9 | 10 | Use the streaming ingestion feature of Amazon Redshift. 11 | 12 | ### Documentation quote: 13 | ``` 14 | Streaming ingestion provides low-latency, high-speed data ingestion from Amazon Kinesis Data Streams 15 | or Amazon Managed Streaming for Apache Kafka 16 | to an Amazon Redshift provisioned or Amazon Redshift Serverless database. 17 | 18 | The data lands in a Redshift materialized view that's configured for the purpose. 19 | This results in fast access to external data. 20 | 21 | Streaming ingestion lowers data-access time and reduces storage cost. 22 | 23 | You can configure it for your Amazon Redshift cluster or for your Amazon Redshift Serverless workgroup, 24 | using a small collection of SQL commands. 25 | After it's set up, each materialized-view refresh can ingest hundreds of megabytes of data per second. 26 | ``` 27 | 28 | Source: https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion.html 29 | -------------------------------------------------------------------------------- /how-to/turn-on-concurrency-scaling-Redshift.md: -------------------------------------------------------------------------------- 1 | # Configuring concurrency scaling queues 2 | 3 | A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling. 4 | 5 | **Which solution will meet this requirement?** 6 | 7 | -------------------------------- 8 | 9 | **Response:** 10 | 11 | Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster. 12 | 13 | -------------------------------- 14 | Documentation quote: 15 | ``` 16 | You route queries to concurrency scaling clusters 17 | by enabling concurrency scaling in a workload manager (WLM) queue. 18 | 19 | To turn on concurrency scaling for a queue, set the Concurrency Scaling mode value to auto. 20 | ``` 21 | ------------------------------- 22 | Source: https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling-queues.html 23 | -------------------------------------------------------------------------------- /how-to/upgrade-EBS-volumes-from-gp2-to-gp3.md: -------------------------------------------------------------------------------- 1 | # Migrate your Amazon EBS volumes from gp2 to gp3 2 | 3 | A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage. 4 | 5 | **Which solution will meet these requirements with the LEAST operational overhead?** 6 | 7 | **Response:** 8 | 9 | 1. Change the volume type of the existing gp2 volumes to gp3. 10 | 2. Enter new values for volume size, IOPS, and throughput. 11 | 12 | ### AWS Blog post quote: 13 | ``` 14 | 1. Open the Amazon EC2 console. 15 | 2. Choose Volumes, select the volume to modify, and then choose Actions, Modify Volume. 16 | 3. The Modify Volume window displays the volume ID and the volume’s current configuration, including type, size, IOPS, and throughput. 17 | Set new configuration values as follows: 18 | 19 | - To modify the type, choose gp3 for Volume Type. 20 | - To modify the size, enter a new value for Size. 21 | - To modify the IOPS, enter a new value for IOPS. 22 | - To modify the throughput, if the volume type is gp3, enter a new value for Throughput. 23 | - After you have finished changing the volume settings, choose Modify. When prompted for confirmation, choose Yes. 24 | ``` 25 | 26 | Source: https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/ 27 | -------------------------------------------------------------------------------- /images/exam_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dashapetr/aws-data-engineer-certification/0b8a16bb518d8e3f6d100c7a6cd850eb4e7ffdec/images/exam_score.png --------------------------------------------------------------------------------