├── README.md └── images ├── AWS Glue table has been added.png ├── AWS-Glue-table-has-been-added.png ├── Athena query editor.png ├── Athena run query.png ├── Athena-query-editor.png ├── Athena-run-query.png ├── IAM role policies.png ├── IAM-role-policies.png ├── README ├── disable.png ├── job running.png ├── job-running.png ├── s3-glue-data-lake.gif ├── table information.png ├── table-for-transformed-parquet-data.png └── table-information.png /README.md: -------------------------------------------------------------------------------- 1 | Building a Data Lake with AWS Glue and Amazon S3 2 | =============================================================== 3 | 4 | 5 | ## Scenario 6 | The following procedures help you set up a data lake that could store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema. This tutorial walks you define a database, configure a [crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet, create a table for the Parquet data, and query the data with Amazon Athena. 7 | 8 | 9 | ## Architecture Diagram 10 | AWS Glue is an essential component of an Amazon S3 data lake, providing the data catalog and transformation services for modern data analytics. 11 | 12 | ![s3-glue-data-lake.gif](images/s3-glue-data-lake.gif) 13 | 14 | 15 | ## Prerequisites 16 | 17 | >Make sure you are in US East (N. Virginia), which short name is us-east-1. 18 | 19 | 20 | ## Lab tutorial 21 | ### Create IAM role 22 | 23 | 1.1. On the **service** menu, click **IAM**. 24 | 25 | 1.2. In the navigation pane, click **Roles**. 26 | 27 | 1.3. Click **Create role**. 28 | 29 | 1.4. For role type, choose **AWS Service**, find and choose **Glue**, and click **Next: Permissions**. 30 | 31 | 1.5. On the **Attach permissions policy** page, search and select **AmazonS3FullAccess**, **AWSGlueServiceRole**, and click **Next: Review** button. 32 | 33 | 1.6. On the **Review** page, enter the following detail: 34 | 35 | * **Role name: AWSGlueServiceRoleDefault** 36 | 37 | 1.7. Click **Create role**. 38 | 39 | 1.8. Click **Roles** to switch page, click the role **AWSGlueServiceRoleDefault** you just created. 40 | 41 | 1.9. On the **Permissions** tab, click **add inline policy** on right side to create an inline policy. 42 | 43 | 1.10. On the JSON tab, paste in the following policy: 44 | 45 | { 46 | "Version": "2012-10-17", 47 | "Statement": [ 48 | { 49 | "Effect": "Allow", 50 | "Action": [ 51 | "logs:CreateLogGroup", 52 | "logs:CreateLogStream", 53 | "logs:PutLogEvents", 54 | "logs:DescribeLogStreams" 55 | ], 56 | "Resource": [ 57 | "arn:aws:logs:*:*:*" 58 | ] 59 | } 60 | ] 61 | } 62 | 63 | 1.11. Click **Review policy**. 64 | 65 | 1.12. On the Review policy, enter policy name: **AWSCloudWatchLogs**. 66 | 67 | 1.13. Click **Create policy**. 68 | 69 | 1.14. Now confirm you have policies as below figure. 70 | 71 | ![IAM role policies.png](images/IAM-role-policies.png) 72 | 73 | 74 | ### Add Crawler 75 | 76 | 2.1. On the **Services** menu, click **AWS Glue**. 77 | 78 | 2.2. In the console, click **Add database**. In **Database name**, type **nycitytaxi**, and click **Create**. 79 | 80 | 2.3. Click **Crawlers** in the navigation pane, click **Add crawler**. Type **nytaxicrawler** for Crawler name and click **Next**. 81 | 82 | 2.4. On **Add a data store** page, select **S3** as the data store. 83 | 84 | 2.5. Select **Specified path in my account**. 85 | 86 | 2.6. Enter data source path **s3://aws-bigdata-blog/artifacts/glue-data-lake/data/** and click **Next**. 87 | 88 | 2.7. On **Add another data store** page, select **No**, and click **Next**. 89 | 90 | 2.8. Select **Choose an existing IAM role**, and choose the role **AWSGlueServiceRoleDefault** you just created in the drop-down list, and choose **Next**. 91 | 92 | 2.9. For **Frequency**, choose **Run on demand**, and click **Next**. 93 | 94 | 2.10. For **Database**, choose **nycitytaxi**, and click **Next**. 95 | 96 | 2.11. Review the steps, and click **Finish**. 97 | 98 | 2.12. The crawler is ready to run. Click **Run it now**. 99 | 100 | 2.13. When the crawler has finished, one table has been added. Click **Tables** in the left navigation pane, and then click **data** to confirmed. 101 | 102 | ![.png](images/AWS-Glue-table-has-been-added.png) 103 | 104 | ![.png](images/table-information.png) 105 | 106 | ### Create S3 Buckets for Storage 107 | 108 | 3.1. On the **service** menu, click **S3**. 109 | 110 | 3.2. Click **Create bucket** button. 111 | 112 | 3.3. In **Bucket name**, type **nytaxi-original-data-xxxx** (where xxxx is you name). For example, if your name is Annie, the bucket name will be set as **nytaxi-original-data-annie**. If the bucket name has been used, please add a number to the end of bucket name. 113 | 114 | 3.4. In **Region**, select **US East (N. Virginia)**. 115 | 116 | 3.5. Click **Next** button. 117 | 118 | 3.6. Click **Next** button. 119 | 120 | 3.7. Disable the following options. 121 | 122 | ![disable.png](images/disable.png) 123 | 124 | 3.8. Click **Next** button. 125 | 126 | 3.9. Click **Create bucket** button. 127 | 128 | 129 | ### Transform the Data from CSV to Parquet Format 130 | 131 | 4.1. In the left navigation pane, under **ETL**, click **Jobs**, and then click **Add job**. 132 | 133 | 4.2. On the Job properties, enter the following details: 134 | 135 | * **Name: nytaxi-csv-parquet** 136 | 137 | * **IAM role**: choose **AWSGlueServiceRoleDefault** 138 | 139 | 4.3. For **This job runs**, select **A proposed script generated by AWS Glue**. 140 | 141 | 4.4. Click **Next**. 142 | 143 | 4.5. Choose **data** as the data source, and click **Next**. 144 | 145 | 4.6. Choose **Create tables in your data target**. 146 | 147 | 4.7. For Datastore, select Amazon S3, and select **Parquet** as the format. 148 | 149 | 4.8. For **Target path**, type **s3://aws-glue-result-xxxx** to store data where **`xxxx`** is your name. 150 | 151 | 4.9. Verify the schema mapping, and click **Finish**. 152 | 153 | 4.10. View the job. This screen provides a complete view of the job and allows you to edit, click **Save**, and click **Run job**. This steps may be waiting around 10 minutes. 154 | 155 | ![job running.png](images/job-running.png) 156 | 157 | 158 | ### Add the Parquet Table and Crawler 159 | 160 | 5.1. When the job has finished, back to **AwS Glue console**, click **Tables** on left navigation pane. 161 | 162 | 5.2. Click **Add tables -> Add tables using a crawler**, type Crawler name **nytaxiparquet** and click **Next**. 163 | 164 | 5.3. Choose S3 as the **Data store**. 165 | 166 | 5.4. Include path type **s3://aws-glue-result-xxxx** (where **`xxxx`** is your name) to store data. 167 | 168 | 5.5. Click **Next**. 169 | 170 | 5.6. On **Add another data store** page, select **No**, and click **Next**. 171 | 172 | 5.7. Select **Choose an existing IAM role**, and select the role **AWSGlueServiceRoleDefault** you just created in the drop-down list, and click **Next**. 173 | 174 | 5.8. For **Frequency**, choose **Run on demand**, and click **Next**. 175 | 176 | 5.9. For **Database**, choose **nycitytaxi**, and click **Next**. 177 | 178 | 5.10. Review the steps, and click **Finish**. 179 | 180 | 5.11. The crawler is ready to run. Click **Run it now**. 181 | 182 | 5.12. After the crawler has finished, there are two tables in the **nycitytaxi** database: a table for the raw CSV data and a table for the transformed Parquet data. 183 | 184 | ![table for transformed parquet data.png](images/table-for-transformed-parquet-data.png) 185 | 186 | 187 | ### Query the Data with Amazon Athena 188 | 189 | 6.1. On the **Services** menu, click **Athena**. 190 | 191 | 6.2. On the **Query Editor** tab, choose the database **nycitytaxi**. 192 | 193 | ![Athena query editor.png](images/Athena-query-editor.png) 194 | 195 | 6.3. Choose the **aws_glue_result_xxxx** table. 196 | 197 | 6.4. Type below standard SQL to query the data: 198 | 199 | Select * From "nycitytaxi"."data" limit 10; 200 | 201 | 6.5. Click **Run Query**. 202 | 203 | ![Athena run query.png](images/Athena-run-query.png) 204 | 205 | 206 | ### Appendix - Analyze the data with Amazon Athena 207 | 208 | Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is capable of querying CSV data. However, the Parquet file format significantly reduces the time and cost of querying the data. 209 | 210 | To use AWS Glue with Amazon Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. For more information about upgrading your Athena data catalog, see this [step-by-step guide](https://docs.aws.amazon.com/athena/latest/ug/glue-upgrade.html). 211 | 212 | 213 | ## Conclusion 214 | 215 | Congratulations! You now have learned how to: 216 | 217 | * Build data lake using AWS Glue and Amazon S3. 218 | * Crawler your data to Amazon S3 by AWS Glue. 219 | * Analysis through Amazon Athena service. 220 | -------------------------------------------------------------------------------- /images/AWS Glue table has been added.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/AWS Glue table has been added.png -------------------------------------------------------------------------------- /images/AWS-Glue-table-has-been-added.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/AWS-Glue-table-has-been-added.png -------------------------------------------------------------------------------- /images/Athena query editor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/Athena query editor.png -------------------------------------------------------------------------------- /images/Athena run query.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/Athena run query.png -------------------------------------------------------------------------------- /images/Athena-query-editor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/Athena-query-editor.png -------------------------------------------------------------------------------- /images/Athena-run-query.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/Athena-run-query.png -------------------------------------------------------------------------------- /images/IAM role policies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/IAM role policies.png -------------------------------------------------------------------------------- /images/IAM-role-policies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/IAM-role-policies.png -------------------------------------------------------------------------------- /images/README: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /images/disable.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/disable.png -------------------------------------------------------------------------------- /images/job running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/job running.png -------------------------------------------------------------------------------- /images/job-running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/job-running.png -------------------------------------------------------------------------------- /images/s3-glue-data-lake.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/s3-glue-data-lake.gif -------------------------------------------------------------------------------- /images/table information.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/table information.png -------------------------------------------------------------------------------- /images/table-for-transformed-parquet-data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/table-for-transformed-parquet-data.png -------------------------------------------------------------------------------- /images/table-information.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ecloudvalley/Building-a-Data-Lake-with-AWS-Glue-and-Amazon-S3/cce5542e57c77d275bdd081bfaddbbb71632740b/images/table-information.png --------------------------------------------------------------------------------