├── 701-Using-Kinesis-Stream-for-Ingestion-of-Streaming-Data └── README.md ├── 702-Streaming-Data-to-S3-Using-Kinesis-Firehose ├── README.md └── images │ ├── solution.png │ ├── step1.png │ ├── step2.png │ ├── step3.png │ ├── step4.png │ ├── validation1.png │ └── validation2.png ├── 703-Automatically-Discover-Metadata-with-Clue-Crawlers └── README.md ├── 704-Query-Files-on-S3-Using-Amazon-Athena └── README.md ├── 705-Transforming-Data-with-Glue-DataBrew └── README.md ├── LICENSE └── README.md /701-Using-Kinesis-Stream-for-Ingestion-of-Streaming-Data/README.md: -------------------------------------------------------------------------------- 1 | # Using a Kinesis Stream for Ingestion of Streaming Data 2 | 3 | 4 | ## Clean up 5 | 6 | ### Delete the Kinesis Stream: 7 | `aws kinesis delete-stream --stream-name AWSCookbook701` 8 | 9 | ### Clean up the local variable you created: 10 | `unset SHARD_ITERATOR` 11 | -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/README.md: -------------------------------------------------------------------------------- 1 | # Streaming Data to Amazon S3 Using Amazon Kinesis Data Firehose 2 | 3 | ## Problem 4 | You need to deliver incoming streaming data to object storage. 5 | 6 | ## Solution 7 | Create an S3 bucket, create a Kinesis stream, and configure Kinesis Data Firehose to deliver the stream data to the S3 bucket. 8 | 9 | ![Solution](./images/solution.png) 10 | 11 | ### Prerequisites 12 | - Kinesis stream 13 | - S3 Bucket with a CSV file 14 | 15 | ### Preparation 16 | Preparation 17 | ### Set a unique suffix to use for the S3 bucket name: 18 | ``` 19 | RANDOM_STRING=$(aws secretsmanager get-random-password \ 20 | --exclude-punctuation --exclude-uppercase \ 21 | --password-length 6 --require-each-included-type \ 22 | --output text \ 23 | --query RandomPassword) 24 | ``` 25 | 26 | ### Create a S3 bucket for data: 27 | ``` 28 | aws s3api create-bucket --bucket awscookbook702-$RANDOM_STRING 29 | ``` 30 | 31 | ### Create a Kinesis stream: 32 | ``` 33 | aws kinesis create-stream --stream-name AWSCookbook702 --shard-count 1 34 | ``` 35 | 36 | ### Confirm that your stream is in ACTIVE state: 37 | ``` 38 | aws kinesis describe-stream-summary --stream-name AWSCookbook702 39 | ``` 40 | 41 | ## Steps 42 | 1. Open the Kinesis Data Firehose console and click the “Create delivery stream” button; choose Amazon Kinesis Data Streams for the source and Amazon S3 for the destination, as shown in Figure 2. 43 | 44 | ![Figure 2](./images/step1.png) 45 | 46 | 2. For Source settings, choose the Kinesis stream that you created in the preparation steps, as shown in Figure 3. 47 | 48 | ![Figure 3](./images/step2.png) 49 | 50 | 3. Keep the defaults (Disabled) for “Transform and convert records” options. For Destination settings, browse for and choose the S3 bucket that you created in the preparation steps as shown in Figure 4, and keep the defaults for the other options (disabled partitioning and no prefixes). 51 | 52 | ![Figure 4](./images/step3.png) 53 | 54 | 4. Under the Advanced settings section, confirm that the “Create or update IAM role” is selected. This will create an IAM role that Kinesis can use to access the stream and S3 bucket, as shown in Figure 5. 55 | 56 | ![Figure 5](./images/step4.png) 57 | 58 | 59 | ## Validation checks 60 | 61 | You can test delivery to the stream from within the Kinesis console. Click the Delivery streams link in the left navigation menu, choose the stream you created, expand the “Test with demo data” section, and click the “Start sending demo data” button. This will initiate sending sample data to your stream so you can verify that it is making it to your S3 bucket. 62 | 63 | ![Figure 6](./images/validation1.png) 64 | 65 | 66 | After a few minutes, you will see a folder structure and a file appear in your S3 bucket, similar to Figure 7. 67 | 68 | ![Figure 7](./images/validation1.png) 69 | 70 | 71 | If you download and inspect the file, you will see output similar to the following: 72 | 73 | ``` 74 | {"CHANGE":3.95,"PRICE":79.75,"TICKER_SYMBOL":"SLW","SECTOR":"ENERGY"} 75 | {"CHANGE":7.27,"PRICE":96.37,"TICKER_SYMBOL":"ALY","SECTOR":"ENERGY"} 76 | {"CHANGE":-5,"PRICE":81.74,"TICKER_SYMBOL":"QXZ","SECTOR":"HEALTHCARE"} 77 | {"CHANGE":-0.6,"PRICE":98.4,"TICKER_SYMBOL":"NFLX","SECTOR":"TECHNOLOGY"} 78 | {"CHANGE":-0.46,"PRICE":18.92,"TICKER_SYMBOL":"PLM","SECTOR":"FINANCIAL"} 79 | {"CHANGE":4.09,"PRICE":100.46,"TICKER_SYMBOL":"ALY","SECTOR":"ENERGY"} 80 | {"CHANGE":2.06,"PRICE":32.34,"TICKER_SYMBOL":"PPL","SECTOR":"HEALTHCARE"} 81 | {"CHANGE":-2.99,"PRICE":38.98,"TICKER_SYMBOL":"KFU","SECTOR":"ENERGY"} 82 | ``` 83 | 84 | ## Clean up 85 | 86 | You can delete the Kinesis Delivery Stream and Kinesis Stream from within the Kinesis console. Click the Delivery streams link in the left navigation menu, choose the stream you created, and click the Delete delivery stream button. Click the Data streams link and choose your data stream, then click the Delete stream button. 87 | 88 | Empty the S3 bucket that you created and delete the bucket using the S3 console. You can click the “Empty” button and then the Delete button to accomplish this. 89 | 90 | ### Delete the Kinesis Stream: 91 | `aws kinesis delete-stream --stream-name AWSCookbook702` 92 | 93 | ### Clean up the local variable you created: 94 | `unset RANDOM_STRING` 95 | 96 | ## Discussion 97 | 98 | As you begin to ingest data from various sources, your application may be consuming or reacting to streaming data in real time. In some cases, you may want to store the data from the stream to query or process it later. You can use Kinesis Data Firehose to deliver data to object storage (S3), Amazon Redshift, OpenSearch, and many third-party endpoints. You can also connect multiple delivery streams to a single producer stream to deliver data if you have to support multiple delivery locations from your streams. 99 | 100 | > Note: Kinesis Data Firehose scales automatically to handle the volume of data you need to deliver, meaning that you do not have to configure or provision additional resources if your data stream starts to receive large volumes of data. For more information on Kinesis Data Firehose features and capabilities, see the AWS documentation for [Kinesis Data Firehose](https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html). 101 | 102 | If you need to transform data before it ends up in the destination via Firehose, you can configure transformations. A transformation will automatically invoke a Lambda function as your streaming data queues up (see [this Firehose article](https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html) for buffer size information). This is useful when you have to adjust the schema of a record before delivery, sanitize data for long-term storage on the fly (e.g., remove personally identifiable information), or join the data with other sources before delivery. The transformation Lambda function you invoke must follow the convention specified by the Kinesis Data Firehose API. To see some examples of Lambda functions, go to the [AWS Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo/) and search for “firehose.” 103 | 104 | ## Challenge 105 | 106 | Configure a Firehose delivery with transformations to remove a field from the streaming data before delivery. 107 | 108 | -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/solution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/solution.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step1.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step2.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step3.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/step4.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/validation1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/validation1.png -------------------------------------------------------------------------------- /702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/validation2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AWSCookbook/BigData/cef7fcbc3e5cac7159fa5132563fb9fdaf25bdc8/702-Streaming-Data-to-S3-Using-Kinesis-Firehose/images/validation2.png -------------------------------------------------------------------------------- /703-Automatically-Discover-Metadata-with-Clue-Crawlers/README.md: -------------------------------------------------------------------------------- 1 | # Automatically Discover Metadata with AWS Glue Crawlers 2 | ## Preparation 3 | ### Set a unique suffix to use for the S3 bucket name: 4 | ``` 5 | RANDOM_STRING=$(aws secretsmanager get-random-password \ 6 | --exclude-punctuation --exclude-uppercase \ 7 | --password-length 6 --require-each-included-type \ 8 | --output text \ 9 | --query RandomPassword) 10 | ``` 11 | ### Create a S3 bucket for data: 12 | ``` 13 | aws s3api create-bucket --bucket awscookbook703-$RANDOM_STRING 14 | ``` 15 | 16 | ### Create S3 prefixes for results and data: 17 | ``` 18 | aws s3 mb s3://awscookbook703-$RANDOM_STRING/results/ 19 | aws s3 mb s3://awscookbook703-$RANDOM_STRING/data/ 20 | ``` 21 | 22 | ### Download a public dataset 23 | ``` 24 | curl -vk https://www.bl.uk/bibliographic/downloads/ComicsResearcherFormat_202104_csv.zip -O ComicsResearcherFormat_202104_csv.zip 25 | ``` 26 | 27 | ### Unzip the dataset 28 | ``` 29 | unzip ComicsResearcherFormat_202104_csv.zip 30 | ``` 31 | 32 | ### Copy the titles.csv file to your S3 bucket 33 | ``` 34 | aws s3 cp titles.csv s3://awscookbook703-$RANDOM_STRING/data/ 35 | ``` 36 | 37 | ## Clean up 38 | 39 | ### Delete the Crawler: 40 | `Select the crawler from the AWS Glue Console, select Actions, and choose Delete.` 41 | 42 | ### Empty the S3 bucket you created in the preparation steps and delete the bucket 43 | `Empty the S3 bucket you created in the preparation steps and delete the bucket` 44 | -------------------------------------------------------------------------------- /704-Query-Files-on-S3-Using-Amazon-Athena/README.md: -------------------------------------------------------------------------------- 1 | # Query Files on S3 Using Amazon Athena 2 | ## Preparation 3 | ### Set a unique suffix to use for the S3 bucket name: 4 | ``` 5 | RANDOM_STRING=$(aws secretsmanager get-random-password \ 6 | --exclude-punctuation --exclude-uppercase \ 7 | --password-length 6 --require-each-included-type \ 8 | --output text \ 9 | --query RandomPassword) 10 | ``` 11 | 12 | ### Create a S3 bucket for data: 13 | ``` 14 | aws s3api create-bucket --bucket awscookbook704-$RANDOM_STRING 15 | ``` 16 | 17 | ### Create S3 prefixes for results and data: 18 | ``` 19 | aws s3 mb s3://awscookbook704-$RANDOM_STRING/results/ 20 | aws s3 mb s3://awscookbook704-$RANDOM_STRING/data/ 21 | ``` 22 | 23 | ### Download a public dataset 24 | ``` 25 | curl -vk https://www.bl.uk/bibliographic/downloads/ComicsResearcherFormat_202104_csv.zip -O ComicsResearcherFormat_202104_csv.zip 26 | ``` 27 | 28 | ### Unzip the dataset 29 | ``` 30 | unzip ComicsResearcherFormat_202104_csv.zip 31 | ``` 32 | 33 | ### Copy the titles.csv file to your S3 bucket 34 | ``` 35 | aws s3 cp titles.csv s3://awscookbook704-$RANDOM_STRING/data/ 36 | ``` 37 | 38 | ## Clean up 39 | 40 | ``` 41 | Open the Query Editor and run the following two SQL statements: 42 | 43 | DROP TABLE `awscookbook704table`; 44 | 45 | DROP DATABASE `awscookbook704db`; 46 | 47 | Delete the contents of the S3 bucket you created and delete the bucket. The S3 Console provides an easy way to empty the bucket with the “Empty” button. 48 | 49 | ``` 50 | -------------------------------------------------------------------------------- /705-Transforming-Data-with-Glue-DataBrew/README.md: -------------------------------------------------------------------------------- 1 | # Transforming Data with AWS Glue DataBrew 2 | 3 | 4 | ## Clean up 5 | 6 | ``` 7 | From the Actions menu in the top right hand corner, select “Delete project”. 8 | 9 | Delete the dataset. See https://docs.aws.amazon.com/databrew/latest/dg/datasets.deleting.html 10 | ``` 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 John Culkin 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Chapter 7 - Big Data 2 | ## Set and export your default region: 3 | 4 | `export AWS_REGION=us-east-1` 5 | 6 | ## Set your AWS ACCOUNT ID: 7 | 8 | `AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)` 9 | 10 | ## Validate AWS Cli Setup and access: 11 | 12 | `aws ec2 describe-instances` 13 | --------------------------------------------------------------------------------