59 | * Serverless Analytics with Amazon Athena [[Packt]](https://www.packtpub.com/product/serverless-analytics-with-amazon-athena/9781800562349) [[Amazon]](https://www.amazon.in/Serverless-Analytics-Amazon-Athena-semi-structured/dp/1800562349/ref=sr_1_1?keywords=Serverless+Analytics+with+Amazon+Athena&qid=1638757768&sr=8-1)
60 |
61 | * Scalable Data Streaming with Amazon Kinesis [[Packt]](https://www.packtpub.com/product/scalable-data-streaming-with-amazon-kinesis/9781800565401) [[Amazon]](https://www.amazon.in/Scalable-Data-Streaming-Amazon-Kinesis/dp/1800565402/ref=sr_1_1?keywords=Scalable+Data+Streaming+with+Amazon+Kinesis&qid=1638757818&sr=8-1)
62 |
63 | ## Get to Know the Author
64 | **Gareth Eagar** has worked in the IT industry for over 25 years, starting in South Africa, then working in the United Kingdom, and now based in the United States. In 2017, he started working at Amazon Web Services (AWS) as a solution architect, working with enterprise customers in the NYC metro area. Gareth has become a recognized subject matter expert for building data lakes on AWS, and in 2019 he launched the Data Lake Day educational event at the AWS Lofts in NYC and San Francisco. He has also delivered a number of public talks and webinars on topics relating to big data, and in 2020 Gareth transitioned to the AWS Professional Services organization as a senior data architect, helping customers architect and build complex data pipelines.
65 |
66 | **Note from the author:**
67 |
68 | You can use the resources provided in this GitHub repo as you work through the hands-on activities includes in each chapter of the book. This repo is laid out with resources matched to each chapter of the book - such as the JSON used to define IAM policies, sample files, relevant links, etc.
69 | ### Download a free PDF
70 |
71 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
72 | https://packt.link/free-ebook/9781800560413
--------------------------------------------------------------------------------
/Chapter09/README.md:
--------------------------------------------------------------------------------
1 | # Chapter 9 - Loading Data into a Data Mart
2 |
3 | In this chapter, we learned how a cloud data warehouse can be used to store hot data to
4 | optimize performance and manage costs. We reviewed some common "anti-patterns"
5 | for data warehouse usage before diving deep into the Redshift architecture to learn more
6 | about how Redshift optimizes data storage across nodes.
7 | We then reviewed some of the important design decisions that need to be made when
8 | creating an optimized schema in Redshift, before reviewing ingested unloaded from
9 | Redshift.
10 |
11 | ## Hands-on Activity
12 | In the hands-on section of this chapter we created a new Redshift cluster,
13 | configured Redshift Spectrum to query data from Amazon S3, and then loaded a
14 | subset of data from S3 into Redshift. We then ran some complex queries to calculate the
15 | distance between two points before creating a materialized view with the results of our
16 | complex query.
17 |
18 | #### Uploading our sample data to Amazon S3
19 | In this exercise, we use open-source data from an organization called **Inside Airbnb** that provides data that quantifies the impact of short-term rentals on housing and residential communities. To learn more about the organization, see http://insideairbnb.com/index.html.
20 |
21 | The following links are to download data from [*Inside Airbnb*](http://insideairbnb.com/index.html), who have licensed this data under [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
22 |
23 | Download the ***listings.csv*** file for Jersey City, New Jersey, and for New York City, New York, using the following links. *Make sure to download the CSV version of the file, and not the csv.gz version*. Rename each file when downloading so you can identify which city it is for (such as jc-listings.csv and ny-listings.csv).
24 |
25 | - Jersey City Listings: [Access here](http://insideairbnb.com/get-the-data.html#:~:text=Jersey%20City%2C%20New%20Jersey%2C%20United%20States)
26 | - New York Listings: [Access here](http://insideairbnb.com/get-the-data.html#:~:text=New%20York%20City%2C%20New%20York%2C%20United%20States)
27 |
28 | **Commands to upload files to Amazon S3**
29 |
30 | - Use the following commands to upload the data to your S3 Landing Zone bucket
31 | ```
32 | aws s3 cp jc-listings.csv s3://dataeng-landing-zone-INITIALS/listings/city=jersey_city/jc-listings.csv
33 | ```
34 |
35 | ```
36 | aws s3 cp ny-listings.csv s3://dataeng-landing-zone-INITIALS/listings/city=new_york_city/ny-listings.csv
37 | ```
38 |
39 | #### IAM Roles for Redshift
40 |
41 | - AWS Management Console - IAM Roles: https://console.aws.amazon.com/iamv2/home?#/roles
42 |
43 | #### Creating a Redshift cluster
44 |
45 | - AWS Management Console - Redshift: https://console.aws.amazon.com/redshiftv2/
46 |
47 | #### Creating eternal tables for querying data in S3
48 |
49 | - The following query can be run in the Redshift Query Editor to create an external schema. Make sure to specify the ARN for the new role you created in place of the ***iam_role*** listed below.
50 |
51 | ```
52 | create external schema spectrum_schema
53 | from data catalog
54 | database 'accommodation'
55 | iam_role 'arn:aws:iam::1234567890:role/AmazonRedshiftSpectrumRole'
56 | create external database if not exists;
57 | ```
58 |
59 | - The following query can be run to create a new external table. Make sure to replace ***INITIALS*** in the query below with the correct identifier for your Landing Zone bucket.
60 |
61 | ```
62 | CREATE EXTERNAL TABLE spectrum_schema.listings(
63 | listing_id INTEGER,
64 | name VARCHAR(100),
65 | host_id INT,
66 | host_name VARCHAR(100),
67 | neighbourhood_group VARCHAR(100),
68 | neighbourhood VARCHAR(100),
69 | latitude Decimal(8,6),
70 | longitudes Decimal(9,6),
71 | room_type VARCHAR(100),
72 | price SMALLINT,
73 | minimum_nights SMALLINT,
74 | number_of_reviews SMALLINT,
75 | last_review DATE,
76 | reviews_per_month NUMERIC(8,2),
77 | calculated_host_listings_count SMALLINT,
78 | availability_365 SMALLINT)
79 | partitioned by(city varchar(100))
80 | row format delimited
81 | fields terminated by ','
82 | stored as textfile
83 | location 's3://dataeng-landing-zone-INITIALS/listings/';
84 | ```
85 |
86 | - The following two queries create partitions for our Jersey City and New York City data. *Make sure to run each of these queries separately, and to replace **INITIALS** with the correct identifier for your Landing Zone bucket.*
87 |
88 | **Query 1:**
89 | ```
90 | alter table spectrum_schema.listings add
91 | partition(city='jersey_city')
92 | location 's3://dataeng-landing-zone-INITIALS/listings/city=jersey_city/'
93 | ```
94 | **Query 2:**
95 | ```
96 | alter table spectrum_schema.listings add
97 | partition(city='new_york_city')
98 | location 's3://dataeng-landing-zone-INITIALS/listings/city=new_york_city/'
99 | ```
100 |
101 | - Validate that the data has been loaded and defined correctly by running queries using both Redshift and Amazon Athena.
102 |
103 | **Redshift Query:**
104 | ```
105 | select * from spectrum_schema.listings limit 100;
106 | ```
107 |
108 | **Amazon Athena Query:**
109 | ```
110 | select * from accommodation.listings limit 100;
111 | ```
112 |
113 | #### Creating a schema for a local Redshift table
114 |
115 | - Create new local (not external) Redshift schema
116 |
117 | ```
118 | create schema if not exists accommodation_local;
119 | ```
120 |
121 | - Create new local listings table
122 |
123 | ```
124 | CREATE TABLE dev.accommodation_local.listings(
125 | listing_id INTEGER,
126 | name VARCHAR(100),
127 | neighbourhood_group VARCHAR(100),
128 | neighbourhood VARCHAR(100),
129 | latitude Decimal(8,6),
130 | longitudes Decimal(9,6),
131 | room_type VARCHAR(100),
132 | price SMALLINT,
133 | minimum_nights SMALLINT,
134 | city VARCHAR(40))
135 | distkey(listing_id)
136 | sortkey(price);
137 | ```
138 |
139 | - Load data from our Redshift Spectrum (external) table into the new local table
140 |
141 | ```
142 | INSERT into accommodation_local.listings
143 | (SELECT listing_id,
144 | name,
145 | neighbourhood_group,
146 | neighbourhood,
147 | latitude,
148 | longitudes,
149 | room_type,
150 | price,
151 | minimum_nights
152 | FROM spectrum_schema.listings);
153 | ```
154 |
155 | #### Running complex SQL queries against our data
156 | In this section we create a complex query, but do so in steps in order to better understand how the query works.
157 |
158 | **Query 1**
159 | ```
160 | WITH touristspots_raw(name,lon,lat) AS (
161 | (SELECT 'Freedom Tower', -74.013382,40.712742) UNION
162 | (SELECT 'Empire State Building', -73.985428, 40.748817)),
163 | touristspots (name,location) AS (SELECT name,
164 | ST_Point(lon, lat) FROM touristspots_raw)
165 | select name, location from touristspots
166 | ```
167 |
168 | **Query 2**
169 | ```
170 | WITH accommodation(listing_id, name, room_type, location) AS (SELECT listing_id, name, room_type, ST_Point(longitudes, latitude) from accommodation_local.listings)
171 | select listing_id, name, room_type, location from accommodation
172 | ```
173 |
174 | **Query 3**
175 | ```
176 | WITH touristspots_raw(name,lon,lat) AS (
177 | (SELECT 'Freedom Tower', -74.013382,40.712742) UNION
178 | (SELECT 'Empire State Building', -73.985428, 40.748817)
179 | ),
180 | touristspots(name,location) AS (
181 | SELECT name, ST_Point(lon, lat)
182 | FROM touristspots_raw),
183 | accommodation(listing_id, name, room_type, price,
184 | location) AS
185 | (
186 | SELECT listing_id, name, room_type, price,
187 | ST_Point(longitudes, latitude)
188 | FROM accommodation_local.listings)
189 | SELECT
190 | touristspots.name as tourist_spot,
191 | accommodation.listing_id as listing_id,
192 | accommodation.name as location_name,
193 | (ST_DistanceSphere(touristspots.location,
194 | accommodation.location) / 1000)::decimal(10,2) AS
195 | distance_in_km,
196 | accommodation.price AS price,
197 | accommodation.room_type as room_type
198 | FROM touristspots, accommodation
199 | WHERE tourist_spot like 'Empire%'
200 | ORDER BY distance_in_km
201 | LIMIT 100;
202 | ```
203 |
204 | **Query 4**
205 | ```
206 | CREATE MATERIALIZED VIEW listings_touristspot_distance_view AS
207 | WITH touristspots_raw(name, lon, lat) AS (
208 | (SELECT 'Freedom Tower', -74.013382,40.712742) UNION
209 | (SELECT 'Empire State Building', -73.985428, 40.748817)
210 | ),
211 | touristspots(name,location) AS (
212 | SELECT name, ST_Point(lon, lat)
213 | FROM touristspots_raw),
214 | accommodation(listing_id, name, room_type, price,location) AS
215 | (
216 | SELECT listing_id, name, room_type, price, ST_Point(longitudes, latitude)
217 | FROM accommodation_local.listings)
218 | SELECT
219 | touristspots.name as tourist_spot,
220 | accommodation.listing_id as listing_id,
221 | accommodation.name as location_name,
222 | (ST_DistanceSphere(touristspots.location,accommodation.location) / 1000)::decimal(10,2) AS distance_in_km,
223 | accommodation.price AS price,
224 | accommodation.room_type as room_type
225 | FROM touristspots, accommodation
226 | ```
227 |
228 | **Query 5**
229 | ```
230 | select * from listings_touristspot_distance_view
231 | where tourist_spot like 'Empire%'
232 | order by distance_in_km
233 | limit 100
234 | ```
235 |
236 |
237 |
238 |
--------------------------------------------------------------------------------
/Chapter05/Data-Engineering-Completed-Whiteboard.drawio:
--------------------------------------------------------------------------------
1 | 
--------------------------------------------------------------------------------