├── .github └── PULL_REQUEST_TEMPLATE.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md └── images ├── analyze_compression.png ├── jan_file_head.png ├── query_editor.png ├── redshift_arch.png ├── redshift_launch.png ├── s3_ls.png └── table_pop_strat.png /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Issue #, if available:* 2 | 3 | *Description of changes:* 4 | 5 | 6 | By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. 7 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-redshift-tiered-storage/issues), or [recently closed](https://github.com/aws-samples/amazon-redshift-tiered-storage/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *master* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-redshift-tiered-storage/labels/help%20wanted) issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](https://github.com/aws-samples/amazon-redshift-tiered-storage/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | 61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes. 62 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this 4 | software and associated documentation files (the "Software"), to deal in the Software 5 | without restriction, including without limitation the rights to use, copy, modify, 6 | merge, publish, distribute, sublicense, and/or sell copies of the Software, and to 7 | permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 10 | INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 11 | PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 12 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 13 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 14 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ANT321 - Tiered Data Sets in Amazon Redshift 2 | 3 | Amazon Redshift offers a common query interface against data stored in fast, local storage as well as data from high-capacity, inexpensive storage (S3). This workshop will cover the basics of this tiered storage model and outline the design patterns you can leverage to get the most from large volumes of data. You will build out your own Redshift cluster with multiple data sets to illustrate the trade-offs between the storage systems. By the time you leave, you’ll know how to distribute your data and design your DDL to deliver the best data warehouse for your business. 4 | 5 | 6 | ``` 7 | Matt Scaer 8 | Principal DW Specialist SA 9 | AWS 10 | 11 | Karthik Odapally 12 | Senior Solution Architect 13 | AWS 14 | ``` 15 | 16 | 17 | ## License Summary 18 | 19 | This sample code is made available under a modified MIT license. See the LICENSE file. 20 | 21 | ## Agenda 22 | * Introductions 23 | * Account Logins and Cluster Spin-up (Via Cloudformation) 24 | * Refresher on Amazon Redshift 25 | * Workshop time 26 | * Learning Objectives 27 | * True-up against the website 28 | 29 | ## Why this session 30 | * Data typically grows at 10x every 5 years. 31 | * Average lifetime for an Analytics Platform is 15yrs. 32 | * Not just price and performance but also complexity. 33 | 34 | ## Why you really need this session 35 | 36 | * To learn how simple it is: 37 | * To query 2.87 billion rows (200Gb’s of data) in <5 seconds. 38 | * Query historical data residing on S3. 39 | * Plan for the future. 40 | 41 | ## Account Login and Redshift Cluster Spin-u 42 | * Get a temporary account (slip of paper). 43 | * Log into AWS using the provided placeholder credentials, then switch to **us-west-2** region. 44 | * We are here to help, please don’t hesitate to ask for assistance! 45 | * Create an IAM role to query S3 data, giving the role read-only access to all Amazon S3 buckets. Make sure to choose **AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess** 46 | 47 | ```python 48 | 49 | https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-role.html 50 | 51 | ```` 52 | 53 | * Use Redshift’s ‘Quick Create’ functionality (or “Classic”) to create a cluster and associate the IAM role with it. 54 | * Please use **2 compute nodes of DC2.Large**, using a cluster identifier, and master user of your choice. 55 | * **Do not pick an AZ** 56 | * Update the Security Group to allow Redshift. Please do not allow access from 0.0.0.0, your cluster will be auto-deactivated. You can choose allow current IP/only this IP. 57 | * Double-check you are in the **us-west-2 ** region 58 | 59 |
How-to Screenshot 60 |

61 | 62 | ![GitHub Logo](/images/redshift_launch.png) 63 | 64 |

65 |
66 | 67 | ## Refresher on Amazon Redshift 68 | 69 | ![GitHub Logo](/images/redshift_arch.png) 70 | 71 | * Massively parallel, shared nothing columnar architecture 72 | * Leader Node: 73 | * SQL Endpoint 74 | * Stores Metadata 75 | * Coordinates parallel SQL Processing 76 | * Compute Node: 77 | * Local, columnar storage 78 | * Executes queries in parallel 79 | * Load, unload, backup, restore 80 | * Amazon Redshift Spectrum nodes 81 | * Execute queries directly against 82 | * Amazon Simple Storage Service (Amazon S3) 83 | 84 | ### Two Complimentary Usage Patterns 85 | 86 | Amazon Redshift combines two usage patterns under a single, seamless service: 87 | 88 | * Redshift (using direct-attached storage objects): 89 | * Billed hourly for the number and type of compute nodes in the cluster. 90 | * An “all you can eat” model. 91 | * Redshift Spectrum (table data resides on S3): 92 | * Billed at $5 per TB of data scanned. 93 | * Both performance and cost-savings incent reducing the amount of data scanned through: 94 | * Compressing the data on S3. 95 | * Storing the data on S3 in a columnar format (eg. Parquet or ORC). 96 | * Partitioning the data on S3 97 | 98 | **Amazon Redshift can be leveraged using the patterns either solely, or in combination.** 99 | 100 | ### Connecting to the Cluster 101 | 102 | * Existing tools like SQL Workbench can be used. 103 | * Amazon Redshift has a built-in Query Editor via the AWS Management console. 104 | 105 | ![GitHub Logo](/images/query_editor.png) 106 | 107 | and you can type in the following query to get the ball-rolling 108 | 109 | ```python 110 | 111 | SELECT 'Hello Redshift Tiered-Storage workshop attendee!' AS welcome; 112 | 113 | ``` 114 | 115 | * PgWeb -> Pgweb is a web-based database browser for PostgreSQL, written in Go and works on OSX, Linux and Windows machines. Feel free to download this runtime-client from here: 116 | 117 | 118 | ```python 119 | 120 | https://github.com/sosedoff/pgweb/releases 121 | 122 | ``` 123 | 124 | * PgWeb defaults to port 5432, whereas Redshift defaults to 5439. If you want to change the port in PGWeb to 5439, invoke with: 125 | 126 | ```python 127 | 128 | pgweb --url postgres://{username}:{password}@{cluster_endpoint}:5439/{database_name}?sslmode=require 129 | 130 | ``` 131 | 132 | * (optional) Command Line Interface (CLI) for Amazon Redshift. 133 | 134 | ```python 135 | 136 | https://docs.aws.amazon.com/redshift/latest/mgmt/setting-up-rs-cli.html 137 | 138 | ``` 139 | 140 | ## Workshop - Scenario #1: What happened in 2016? 141 | 142 | * Assemble your toolset: 143 | * Choosing a SQL editor (SQL Workbench, PGWeb, psql, query from Console, etc.) 144 | 145 | * Load the Green company data for January 2016 into Redshift direct-attached storage (DAS) with COPY. 146 | * Collect supporting/refuting evidence for the impact of the January, 2016 blizzard on taxi usage. 147 | * The CSV data is by month on Amazon S3. Here's a quick screenshot via the CLI: 148 | 149 | ```python 150 | 151 | $ aws s3 ls s3://us-west-2.serverless-analytics/NYC-Pub/green/ | grep 2016 152 | 153 | ``` 154 | 155 | ![GitHub Logo](/images/s3_ls.png) 156 | 157 | * Here's Sample data from the January File: 158 | 159 | ```python 160 | 161 | head -20 green_tripdata_2016-01.csv 162 | 163 | VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type 164 | 165 | ``` 166 | 167 | ![GitHub Logo](/images/jan_file_head.png) 168 | 169 | 170 | ### Build you DDL 171 | - Create a schema `workshop_das` and table `workshop_das.green_201601_csv` for tables that will reside on the Redshift compute nodes, AKA the Redshift direct-attached storage (DAS) tables. 172 | 173 |
Hint 174 |

175 | 176 | ```python 177 | CREATE SCHEMA workshop_das; 178 | 179 | CREATE TABLE workshop_das.green_201601_csv 180 | ( 181 | vendorid VARCHAR(4), 182 | pickup_datetime TIMESTAMP, 183 | dropoff_datetime TIMESTAMP, 184 | store_and_fwd_flag VARCHAR(1), 185 | ratecode INT, 186 | pickup_longitude FLOAT4, 187 | pickup_latitude FLOAT4, 188 | dropoff_longitude FLOAT4, 189 | dropoff_latitude FLOAT4, 190 | passenger_count INT, 191 | trip_distance FLOAT4, 192 | fare_amount FLOAT4, 193 | extra FLOAT4, 194 | mta_tax FLOAT4, 195 | tip_amount FLOAT4, 196 | tolls_amount FLOAT4, 197 | ehail_fee FLOAT4, 198 | improvement_surcharge FLOAT4, 199 | total_amount FLOAT4, 200 | payment_type VARCHAR(4), 201 | trip_type VARCHAR(4) 202 | ) 203 | DISTSTYLE EVEN SORTKEY (passenger_count,pickup_datetime); 204 | 205 | ``` 206 |

207 |
208 | 209 | ### Build your Copy Command 210 | 211 | Build your copy command to copy the data from Amazon S3. This dataset has the number of taxi rides in the month of January 2016. 212 | 213 | ```python 214 | 215 | s3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv 216 | 217 | ``` 218 | 219 |
Hint 220 |

221 | 222 | 223 | ```python 224 | COPY workshop_das.green_201601_csv 225 | FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv' 226 | CREDENTIALS 'aws_iam_role=arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' 227 | DATEFORMAT 'auto' 228 | IGNOREHEADER 1 229 | DELIMITER ',' 230 | IGNOREBLANKLINES 231 | ; 232 | ``` 233 | **HINT HINT: The `XXXXXXXXXXXX` in the above command should be your AWS account number and Role information.** 234 | 235 |

236 |
237 | 238 | ### Pin-point the Blizzard 239 | 240 | In this month, there is a date which had the lowest number of taxi rides due to a blizzard. Can you find that date? 241 | 242 |
Hint 243 |

244 | 245 | ```python 246 | SELECT TO_CHAR(pickup_datetime, 'YYYY-MM-DD'), 247 | COUNT(*) 248 | FROM workshop_das.green_201601_csv 249 | GROUP BY 1 250 | ORDER BY 1; 251 | ``` 252 |

253 |
254 | 255 | ## Workshop - Scenario #2: Go Back in Time 256 | 257 | * Query historical data residing on S3: 258 | * Create external DB for Redshift Spectrum. 259 | * Create the external table on Spectrum. 260 | * Write a script or SQL statement to add partitions. 261 | * Create and populate a small number of dimension tables on Redshift DAS. 262 | * Introspect the historical data, perhaps rolling-up the data in novel ways to see trends over time, or other dimensions. 263 | * Enforce reasonable use of the cluster with Redshift Spectrum-specific Query Monitoring Rules (QMR). 264 | * Test the QMR setup by writing an excessive-use query. 265 | * For the dimension table(s), feel free to leverage multi-row insert in Redshift: 266 | `https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-multi-row-inserts.html` 267 | 268 | **Note the partitioning scheme is Year, Month, Type (where Type is a taxi company). Here's a quick Screenshot:** 269 | 270 | ```python 271 | $ aws s3 ls s3://us-west-2.serverless-analytics/canonical/NY-Pub/ 272 |                            PRE year=2009/ 273 |                            PRE year=2010/ 274 |                            PRE year=2011/ 275 |                            PRE year=2012/ 276 |                            PRE year=2013/ 277 |                            PRE year=2014/ 278 |                            PRE year=2015/ 279 |                            PRE year=2016/ 280 | $ aws s3 ls s3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=1/ 281 |                            PRE type=fhv/ 282 |                            PRE type=green/ 283 |                            PRE type=yellow/ 284 | $ aws s3 ls s3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=1/type=green/ 285 | 2017-05-18 19:43:22   18910771 part-r-00000-4c01b1ef-3419-40ba-908e-5b36b3556fa7.gz.parquet 286 | ``` 287 | 288 | ### Create external schema (and DB) for Redshift Spectrum 289 | 290 | * Create an external schema **ant321** from your database **spectrumdb** 291 | 292 |
Hint 293 |

294 | 295 | ```python 296 | CREATE external SCHEMA ant321 297 | FROM data catalog DATABASE 'spectrumdb' 298 | IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' 299 | CREATE external DATABASE if not exists; 300 | ``` 301 | 302 |

303 |
304 | 305 | ### Create your Spectrum table DDL (or use this) 306 | 307 | * Create your external table **ant321.NYTaxiRides** for 308 | `vendorid, pickup_datetime, dropoff_datetime, ratecode, passenger_count, trip_distance, fare_amount, total_amount, payment_type` 309 | stored in parquet format under location `s3://us-west-2.serverless-analytics/canonical/NY-Pub/` 310 |
Hint 311 |

312 | 313 | 314 | ```python 315 | CREATE EXTERNAL TABLE ant321.NYTaxiRides ( 316 | vendorid VARCHAR(6), 317 | pickup_datetime TIMESTAMP, 318 | dropoff_datetime TIMESTAMP, 319 | ratecode INT, 320 | passenger_count INT, 321 | trip_distance FLOAT8, 322 | fare_amount FLOAT8, 323 | total_amount FLOAT8, 324 | payment_type INT 325 | ) 326 | PARTITIONED BY (YEAR INT, MONTH INT, "TYPE" CHAR(6)) 327 | STORED AS PARQUET 328 | LOCATION 's3://us-west-2.serverless-analytics/canonical/NY-Pub/' 329 | ; 330 | 331 | ``` 332 |

333 |
334 | 335 | ### Add the Partitions 336 | 337 | ```python 338 | WITH generate_smallint_series AS (select row_number() over () as n from workshop_das.green_201601_csv limit 65536) 339 | , part_years AS (select n AS year_num from generate_smallint_series where n between 2009 and 2016) 340 | , part_months AS (select n AS month_num from generate_smallint_series where n between 1 and 12) 341 | , taxi_companies AS (SELECT 'fhv' taxi_vendor UNION ALL SELECT 'green' UNION ALL SELECT 'yellow’) 342 | 343 | SELECT 'ALTER TABLE ant321.NYTaxiRides ADD PARTITION(year=' || year_num || ', month=' || month_num || ', type=\'' || taxi_vendor || '\') ' || 344 | 'LOCATION \'s3://us-west-2.serverless-analytics/canonical/NY-Pub/year=' || year_num || '/month=' || month_num || '/type=' || taxi_vendor || '/\';' 345 | FROM part_years, part_months, taxi_companies order by 1; 346 | 347 | ``` 348 | 349 | ### Update the number of rows table property 350 | 351 | 1. Determine the number of rows in the table. 352 | 2. Save a copy of the explain plan for #1 above. 353 | 3. Set the (approximate or specific) number of rows using the TABLE PROPERTIES under ALTER EXTERNAL TABLE. 354 | 4. Rerun the explain plan for #1, noting the difference. When might this be impactful? 355 | 356 | ### Add a Redshift Spectrum Query Monitoring Rule to ensure reasonable use 357 | 358 | In Amazon Redshift workload management (WLM), query monitoring rules define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries. Setup a Query Monitoring Rule to ensure reasonable use. 359 | 360 | ```python 361 | 362 | https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html 363 | 364 | ``` 365 | 366 | Take a look at SVL_QUERY_METRICS_SUMMARY view shows the maximum values of metrics for completed queries. This view is derived from the STL_QUERY_METRICS system table. Use the values in this view as an aid to determine threshold values for defining query monitoring rules. 367 | 368 | ```python 369 | 370 | https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_QUERY_METRICS_SUMMARY.html 371 | 372 | ``` 373 | 374 | Quick Note on QLM: The WLM configuration properties are either dynamic or static. Dynamic properties can be applied to the database without a cluster reboot, but static properties require a cluster reboot for changes to take effect. Additional info here: 375 | 376 | ```python 377 | 378 | https://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html 379 | 380 | ``` 381 | 382 | ### Multi-row insert in Redshift 383 | 384 | * Create a federal holidays table using the data in the attached spreadsheet. Populate your table using Redshift **multi-row insert** statement 385 | 386 | ```python 387 | https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-multi-row-inserts.html 388 | ``` 389 | 390 | * Consider the distribution for this **dimension table**. 391 | * What does the output of ANALYZE COMPRESSION look like for this table? Why is that? 392 | 393 |
Hint 394 |

395 | 396 | ```python 397 | CREATE TABLE federal_holidays ( holiday_date DATE, holiday_name VARCHAR(64) ) DISTSTYLE ALL; 398 | 399 | INSERT INTO federal_holidays VALUES ('01/01/2009', 'New Years Day'), ('01/19/2009', 'Martin Luther King Jr. Day'), ('02/16/2009', 'Presidents\' Day'), ('05/25/2009', 'Memorial Day'), ('07/03/2009', 'Independence Day'), ('09/07/2009', 'Labor Day'), ('10/12/2009', 'Columbus Day'), ('11/11/2009', 'Veterans Day'), ('11/26/2009', 'Thanksgiving'), ('11/27/2009', 'Day after Thanksgiving'), ('12/25/2009', 'Christmas Day'), ('01/01/2010', 'New Years Day'), ('01/18/2010', 'Martin Luther King Jr. Day'), ('02/15/2010', 'Presidents\' Day'), ('05/09/2010', 'Mother\'s Day'), ('05/31/2010', 'Memorial Day'), ('07/05/2010', 'Independence Day'), ('09/06/2010', 'Labor Day'), ('10/11/2010', 'Columbus Day'), ('11/11/2010', 'Veterans Day'), ('11/25/2010', 'Thanksgiving'), ('11/26/2010', 'Day after Thanksgiving'), ('12/24/2010', 'Christmas Day observed'), ('12/31/2010', 'New Years Day observed'), ('01/17/2011', 'Martin Luther King Jr. Day'), ('02/21/2011', 'Presidents\' Day'), ('04/15/2011', 'Emancipation Day'), ('05/08/2011', 'Mother\'s Day'), ('05/30/2011', 'Memorial Day'), ('06/19/2011', 'Father\'s Day'), ('07/04/2011', 'Independence Day'), ('09/05/2011', 'Labor Day'), ('10/10/2011', 'Columbus Day'), ('11/11/2011', 'Veterans Day'), ('11/24/2011', 'Thanksgiving'), ('11/25/2011', 'Day after Thanksgiving'), ('12/26/2011', 'Christmas Holiday'), ('01/02/2012', 'New Years Day observed'), ('01/16/2012', 'Martin Luther King Jr. Day'), ('02/20/2012', 'Presidents\' Day'), ('04/16/2012', 'Emancipation Day'), ('05/13/2012', 'Mother\'s Day'), ('05/28/2012', 'Memorial Day'), ('06/17/2012', 'Father\'s Day'), ('07/04/2012', 'Independence Day'), ('09/03/2012', 'Labor Day'), ('10/08/2012', 'Columbus Day'), ('11/12/2012', 'Veterans Day'), ('11/22/2012', 'Thanksgiving'), ('11/23/2012', 'Day after Thanksgiving'), ('01/01/2013', 'New Years Day'), ('01/21/2013', 'Martin Luther King Jr. Day'), ('02/18/2013', 'Presidents\' Day'), ('04/16/2013', 'Emancipation Day'), ('05/12/2013', 'Mother\'s Day'), ('05/27/2013', 'Memorial Day'), ('06/16/2013', 'Father\'s Day'), ('07/04/2013', 'Independence Day'), ('09/02/2013', 'Labor Day'), ('10/14/2013', 'Columbus Day'), ('11/11/2013', 'Veterans Day'), ('11/28/2013', 'Thanksgiving'), ('11/29/2013', 'Day after Thanksgiving'), ('12/25/2013', 'Christmas Day'), ('01/01/2014', 'New Years Day'), ('01/20/2014', 'Martin Luther King Jr. Day'), ('02/17/2014', 'Presidents\' Day'), ('04/16/2014', 'Emancipation Day'), ('05/11/2014', 'Mother\'s Day'), ('05/26/2014', 'Memorial Day'), ('06/15/2014', 'Father\'s Day'), ('07/04/2014', 'Independence Day'), ('09/01/2014', 'Labor Day'), ('10/13/2014', 'Columbus Day'), ('11/11/2014', 'Veterans Day'), ('11/27/2014', 'Thanksgiving'), ('11/28/2014', 'Day after Thanksgiving'), ('12/25/2014', 'Christmas Day'), ('12/26/2014', 'Day after Christmas'), ('01/01/2015', 'New Years Day'), ('01/19/2015', 'Martin Luther King Jr. Day'), ('02/16/2015', 'Presidents\' Day'), ('04/16/2015', 'Emancipation Day'), ('05/10/2015', 'Mother\'s Day'), ('05/25/2015', 'Memorial Day'), ('06/21/2015', 'Father\'s Day'), ('07/03/2015', 'Independence Day (observed)'), ('09/07/2015', 'Labor Day'), ('10/12/2015', 'Columbus Day'), ('11/11/2015', 'Veterans Day'), ('11/26/2015', 'Thanksgiving'), ('11/27/2015', 'Day after Thanksgiving'), ('12/25/2015', 'Christmas Day'), ('01/01/2016', 'New Years Day'), ('01/18/2016', 'Martin Luther King Jr. Day'), ('02/15/2016', 'Presidents\' Day'), ('04/15/2016', 'Emancipation Day'), ('05/08/2016', 'Mother\'s Day'), ('05/30/2016', 'Memorial Day'), ('06/19/2016', 'Father\'s Day'), ('07/04/2016', 'Independence Day'), ('09/05/2016', 'Labor Day'), ('10/10/2016', 'Columbus Day'), ('11/11/2016', 'Veterans Day'), ('11/24/2016', 'Thanksgiving'), ('11/25/2016', 'Day after Thanksgiving'), ('12/26/2016', 'Christmas Day observed'); 400 | ``` 401 | 402 |

403 |
404 | 405 | * Write a query to report the Holiday, number of passengers for the holidays in 2011, number of passengers for the holidays in 2016, and the percentage change over the 5 years for Yellow taxi. 406 | 407 |
Hint 408 |

409 | 410 | 411 | ```python 412 | WITH helper AS (SELECT fed.holiday_date, fed.holiday_name, type, COUNT(*) AS num_fares, SUM(passenger_count) AS num_passengersFROM ant321_view_NYTaxiRides nyc, federal_holidays fedWHERE year IN (2011,2016) AND month = 11 AND fed.holiday_date = TO_CHAR(pickup_datetime,'MM/DD/YYYY')::DATE AND type in ('yellow') GROUP BY 1,2,3)SELECT a.holiday_name, a.type, a.num_passengers AS p2011, b.num_passengers AS p2016, (-1 * (1 - (b.num_passengers::FLOAT / a.num_passengers::FLOAT))) AS perc_diffFROM helper a, helper b where a.holiday_name = b.holiday_name AND a.type = b.type AND a.holiday_date < '01/01/2015'::DATE AND b.holiday_date > '01/01/2015'::DATE; 413 | 414 | ``` 415 | 416 |

417 |
418 | 419 | 420 | ### [Advanced Topic] Debug a Parquet/Redshift Spectrum datatype mismatch 421 | 422 | 1. Create a new Redshift Spectrum table, changing the datatype of column ‘trip_distance’ from FLOAT8 to FLOAT4. 423 | * Add a single partition for testing. 424 | 2. Counts still work, but what about other operations (SELECT MIN(trip_distance) FROM, SELECT * FROM, CTAS)? 425 | 3. Instead of considering Apache Drill or other tool to help resolve the issue, consider Redshift system view SVL_S3LOG 426 | 427 |
Hint 428 |

429 | 430 | 431 | ```python 432 | 433 | https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-troubleshooting.html#spectrum-troubleshooting-incompatible-data-format 434 | 435 | ``` 436 | 437 |

438 |
439 | 440 | 441 | ## Workshop - Scenario #3: Create a Single Version of Truth 442 | 443 | ### Create a view 444 | 445 | Create a view that covers both the January, 2016 Green company DAS table with the historical data residing on S3 to make a single table exclusively for the Green data scientists. Use CTAS to create a table with data from January, 2016 for the Green company. Compare the runtime to populate this with the COPY runtime earlier. 446 | 447 | 448 |
Hint 449 |

450 | 451 | ```python 452 | 453 | CREATE TABLE workshop_das.taxi_201601 AS SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month = 1 AND type = 'green'; 454 | 455 | ``` 456 | 457 |

458 |
459 | 460 | Note: What about column compression/encoding? Remember that on a CTAS, Amazon Redshift automatically assigns compression encoding as follows: 461 | 462 | * Columns that are defined as sort keys are assigned RAW compression. 463 | * Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression. 464 | * All other columns are assigned LZO compression. 465 | 466 | ```python 467 | 468 | https://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_usage_notes.html 469 | 470 | ``` 471 | Here's the ANALYZE COMPRESSION output in case you want to use it: 472 | 473 | ![GitHub Logo](/images/analyze_compression.png) 474 | 475 | ### Complete populating the table 476 | 477 | Add to the January, 2016 table with an INSERT/SELECT statement for the other taxi companies. 478 | 479 |
Hint 480 |

481 | 482 | ```python 483 | 484 | INSERT INTO workshop_das.taxi_201601 (SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month = 1 AND type != 'green'); 485 | 486 | ``` 487 | 488 |

489 |
490 | 491 | ### Create a new Spectrum table 492 | 493 | Create a new Spectrum table **ant321.NYTaxiRides** (or simply drop the January, 2016 partitions). 494 | 495 |
Hint 496 |

497 | 498 | ```python 499 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='fhv'); 500 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='green'); 501 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='yellow'); 502 | 503 | ``` 504 |

505 |
506 | 507 | 508 | ### Create a view with no Schema Binding 509 | Create a view **ant321_view_NYTaxiRides** from **workshop_das.taxi_201601** that allows seamless querying of the DAS and Spectrum data. 510 | 511 |
Hint 512 |

513 | 514 | ```python 515 | CREATE VIEW ant321_view_NYTaxiRides AS 516 | SELECT * FROM workshop_das.taxi_201601 517 | UNION ALL 518 | SELECT * FROM ant321.NYTaxiRides 519 | WITH NO SCHEMA BINDING 520 | ; 521 | 522 | ``` 523 | 524 |

525 |
526 | 527 | ### Is it Surprising this is valid SQL? 528 | 529 | - Note the use of the partition columns in the SELECT and WHERE clauses. Where were those columns in your Spectrum table definition? 530 | - Note the filters being applied either at the partition or file levels in the Spectrum portion of the query (versus the Redshift DAS section). 531 | - If you actually run the query (and not just generate the explain plan), does the runtime surprise you? Why or why not? 532 | 533 | ```python 534 | 535 | EXPLAIN SELECT year, month, type, COUNT(*) FROM ant321_view_NYTaxiRides WHERE year = 2016 AND month IN (1) AND passenger_count = 4 GROUP BY 1,2,3 ORDER BY 1,2,3; 536 | 537 | ``` 538 | 539 |

540 | QUERY PLAN
541 | XN Merge  (cost=1000090025653.20..1000090025653.21 rows=2 width=48)
542 |   Merge Key: derived_col1, derived_col2, derived_col3
543 |   ->  XN Network  (cost=1000090025653.20..1000090025653.21 rows=2 width=48)
544 |         Send to leader
545 |         ->  XN Sort  (cost=1000090025653.20..1000090025653.21 rows=2 width=48)
546 |               Sort Key: derived_col1, derived_col2, derived_col3
547 |               ->  XN HashAggregate  (cost=90025653.19..90025653.19 rows=2 width=48)
548 |                     ->  XN Subquery Scan ant321_view_nytaxirides  (cost=25608.12..90025653.17 rows=2 width=48)
549 |                           ->  XN Append  (cost=25608.12..90025653.15 rows=2 width=38)
550 |                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=25608.12..25608.13 rows=1 width=18)
551 |                                       ->  XN HashAggregate  (cost=25608.12..25608.12 rows=1 width=18)
552 |                                             ->  XN Seq Scan on t201601_pqt  (cost=0.00..25292.49 rows=31563 width=18)
553 |                                                   Filter: ((passenger_count = 4) AND ("month" = 1) AND ("year" = 2016))
554 |                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=90000045.00..90000045.02 rows=1 width=38)
555 |                                       ->  XN HashAggregate  (cost=90000045.00..90000045.01 rows=1 width=38)
556 |                                             ->  XN Partition Loop  (cost=90000000.00..90000035.00 rows=1000 width=38)
557 |                                                   ->  XN Seq Scan PartitionInfo of ant321.nytaxirides  (cost=0.00..15.00 rows=1 width=30)
558 |                                                         Filter: (("month" = 1) AND ("year" = 2016))
559 |                                                   ->  XN S3 Query Scan nytaxirides  (cost=45000000.00..45000010.00 rows=1000 width=8)
560 |                                                         ->  S3 Aggregate  (cost=45000000.00..45000000.00 rows=1000 width=0)
561 |                                                               ->  S3 Seq Scan ant321.nytaxirides location:"s3://us-west-2.serverless-analytics/canonical/NY-Pub" format:PARQUET  (cost=0.00..37500000.00 rows=3000000000 width=0)
562 |                                                                    Filter: (passenger_count = 4)
563 | 
564 | 565 | - Now include Spectrum data by adding a month whose data is in Spectrum 566 | 567 | ``` 568 | EXPLAIN SELECT year, month, type, COUNT(*) FROM ant321_view_NYTaxiRides WHERE year = 2016 AND month IN (1,2) AND passenger_count = 4 GROUP BY 1,2,3 ORDER BY 1,2,3; 569 | 570 | ``` 571 | 572 |

573 | QUERY PLAN
574 | XN Merge  (cost=1000090029268.92..1000090029268.92 rows=2 width=48)
575 |   Merge Key: derived_col1, derived_col2, derived_col3
576 |   ->  XN Network  (cost=1000090029268.92..1000090029268.92 rows=2 width=48)
577 |         Send to leader
578 |         ->  XN Sort  (cost=1000090029268.92..1000090029268.92 rows=2 width=48)
579 |               Sort Key: derived_col1, derived_col2, derived_col3
580 |               ->  XN HashAggregate  (cost=90029268.90..90029268.90 rows=2 width=48)
581 |                     ->  XN Subquery Scan ant321_view_nytaxirides  (cost=29221.33..90029268.88 rows=2 width=48)
582 |                           ->  XN Append  (cost=29221.33..90029268.86 rows=2 width=38)
583 |                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=29221.33..29221.34 rows=1 width=18)
584 |                                       ->  XN HashAggregate  (cost=29221.33..29221.33 rows=1 width=18)
585 |                                             ->  XN Seq Scan on t201601_pqt  (cost=0.00..28905.70 rows=31563 width=18)
586 |                                                   Filter: ((passenger_count = 4) AND ("year" = 2016) AND (("month" = 1) OR ("month" = 2))) 
587 |                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=90000047.50..90000047.52 rows=1 width=38)
588 |                                       ->  XN HashAggregate  (cost=90000047.50..90000047.51 rows=1 width=38)
589 |                                             ->  XN Partition Loop  (cost=90000000.00..90000037.50 rows=1000 width=38)
590 |                                                   ->  XN Seq Scan PartitionInfo of ant321.nytaxirides  (cost=0.00..17.50 rows=1 width=30)
591 |                                                         Filter: (("year" = 2016) AND (("month" = 1) OR ("month" = 2)))
592 |                                                   ->  XN S3 Query Scan nytaxirides  (cost=45000000.00..45000010.00 rows=1000 width=8)
593 |                                                         ->  S3 Aggregate  (cost=45000000.00..45000000.00 rows=1000 width=0)
594 |                                                               ->  S3 Seq Scan ant321.nytaxirides location:"s3://us-west-2.serverless-analytics/canonical/NY-Pub" format:PARQUET  (cost=0.00..37500000.00 rows=3000000000 width=0)
595 |                                                                    Filter: (passenger_count = 4)
596 | 
597 | 598 |

599 | EXPLAIN SELECT passenger_count, COUNT(*) FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (1,2) GROUP BY 1 ORDER BY 1;
600 | 
601 | 602 |

603 | QUERY PLAN
604 | XN Merge  (cost=1000090005026.64..1000090005027.14 rows=200 width=12)
605 |   Merge Key: nytaxirides.derived_col1
606 |   ->  XN Network  (cost=1000090005026.64..1000090005027.14 rows=200 width=12)
607 |         Send to leader
608 |         ->  XN Sort  (cost=1000090005026.64..1000090005027.14 rows=200 width=12)
609 |               Sort Key: nytaxirides.derived_col1
610 |               ->  XN HashAggregate  (cost=90005018.50..90005019.00 rows=200 width=12)
611 |                     ->  XN Partition Loop  (cost=90000000.00..90004018.50 rows=200000 width=12)
612 |                           ->  XN Seq Scan PartitionInfo of ant321.nytaxirides  (cost=0.00..17.50 rows=1 width=0)
613 |                                Filter: (("year" = 2016) AND (("month" = 1) OR ("month" = 2)))
614 |                           ->  XN S3 Query Scan nytaxirides  (cost=45000000.00..45002000.50 rows=200000 width=12)
615 |                                  ->  S3 HashAggregate  (cost=45000000.00..45000000.50 rows=200000 width=4)
616 |                                       ->  S3 Seq Scan ant321.nytaxirides location:"s3://us-west-2.serverless-analytics/canonical/NY-Pub" format:PARQUET  (cost=0.00..30000000.00 rows=3000000000 width=4)
617 | 
618 | 619 |

620 | EXPLAIN SELECT type, COUNT(*) FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (1,2) GROUP BY 1 ORDER BY 1 ;
621 | 
622 | 623 |

624 | QUERY PLAN
625 | XN Merge  (cost=1000075000042.52..1000075000042.52 rows=1 width=30)
626 |   Merge Key: nytaxirides."type"
627 |   ->  XN Network  (cost=1000075000042.52..1000075000042.52 rows=1 width=30)
628 |         Send to leader
629 |         ->  XN Sort  (cost=1000075000042.52..1000075000042.52 rows=1 width=30)
630 |               Sort Key: nytaxirides."type"
631 |               ->  XN HashAggregate  (cost=75000042.50..75000042.51 rows=1 width=30)
632 |                     ->  XN Partition Loop  (cost=75000000.00..75000037.50 rows=1000 width=30)
633 |                           ->  XN Seq Scan PartitionInfo of ant321.nytaxirides  (cost=0.00..17.50 rows=1 width=22)
634 |                                Filter: (("year" = 2016) AND (("month" = 1) OR ("month" = 2)))
635 |                           ->  XN S3 Query Scan nytaxirides  (cost=37500000.00..37500010.00 rows=1000 width=8)
636 |                                 ->  S3 Aggregate  (cost=37500000.00..37500000.00 rows=1000 width=0)
637 |                                       ->  S3 Seq Scan ant321.nytaxirides location:"s3://us-west-2.serverless-analytics/canonical/NY-Pub" format:PARQUET  (cost=0.00..30000000.00 rows=3000000000 width=0)
638 | 
639 | 640 | ## Workshop - Scenario #4: Plan for the Future 641 | 642 | * Allow for trailing 5 quarters reporting by adding the Q4 2015 data to Redshift DAS: 643 | * Anticipating the we’ll want to ”age-off” the oldest quarter on a 3 month basis, architect your DAS table to make this easy to maintain and query. 644 | * Adjust your Redshift Spectrum table to exclude the Q4 2015 data. 645 | * Develop and execute a plan to move the Q4 2015 data to S3. 646 | * What are the discrete steps to be performed? 647 | * What extra-Redshift functionality must be leverage as of Monday, November 27, 2018? 648 | * Simulating the extra-Redshift steps with the existing Parquet data, age-off the Q4 2015 data from Redshift DAS and perform any needed steps to maintain a single version of the truth. 649 | 650 | * Several options to accomplish the goal 651 | 652 | ![GitHub Logo](/images/table_pop_strat.png) 653 | 654 | 655 | * Anticipating that we’ll want to ”age-off” the oldest quarter on a 3 month basis, architect your DAS table to make this easy to maintain and query. 656 | 657 | * How about something like this: 658 | 659 |

660 | CREATE OR REPLACE VIEW ant321_view_NYTaxiRides AS
661 | SELECT * FROM workshop_das.taxi_201504 Note how these are business quarters
662 | UNION ALL
663 | SELECT * FROM workshop_das.taxi_201601
664 | UNION ALL
665 | SELECT * FROM workshop_das.taxi_201602
666 | UNION ALL
667 | SELECT * FROM workshop_das.taxi_201603
668 | UNION ALL
669 | SELECT * FROM workshop_das.taxi_201604
670 | UNION ALL
671 | SELECT * FROM ant321.NYTaxiRides
672 | WITH NO SCHEMA BINDING;
673 | 
674 | 675 | * Or something like this? Bulk DELETE-s in Redshift are actually quite fast (with one-time single-digit minute time to VACUUM), so this is also a valid configuration as well: 676 | 677 |

678 | CREATE OR REPLACE VIEW ant321_view_NYTaxiRides AS
679 | SELECT * FROM workshop_das.taxi_current
680 | UNION ALL
681 | SELECT * FROM ant321.NYTaxiRides
682 | WITH NO SCHEMA BINDING;
683 | 
684 | 685 | * Don’t forget a quick ANALYZE and VACUUM after completing either version. 686 | 687 | * If needed, the Redshift DAS tables can also be populated from the Parquet data with COPY. Note: This will highlight a data design when we created the Parquet data 688 | 689 | **COPY with Parquet doesn’t currently include a way to specify the partition columns as sources to populate the target Redshift DAS table. The current expectation is that since there’s no overhead (performance-wise) and little cost in also storing the partition data as actual columns on S3, customers will store the partition column data as well.** 690 | 691 | * We’re going to show how to work with the scenario where this pattern wasn’t followed. Use the single table option for this example 692 | 693 | ```python 694 | 695 | CREATE TABLE workshop_das.taxi_current DISTSTYLE EVEN SORTKEY(year, month, type) AS SELECT * FROM ant321.NYTaxiRides WHERE 1 = 0; 696 | 697 | ``` 698 | 699 | * And, create a helper table that doesn't include the partition columns from the Redshift Spectrum table. 700 | 701 | ```python 702 | 703 | CREATE TABLE workshop_das.taxi_loader AS SELECT vendorid, pickup_datetime, dropoff_datetime, ratecode, passenger_count, trip_distance, fare_amount, total_amount, payment_type FROM workshop_das.taxi_current WHERE 1 = 0; 704 | 705 | ``` 706 | 707 | ### Parquet copy continued 708 | 709 | * The population could be scripted easily; there are also a few different patterns that could be followed, (this isn't the only one): 710 | - Start Green loop. 711 | - Q4 2015. 712 | 713 |
	
714 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2015/month=10/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
715 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2015/month=11/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
716 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2015/month=12/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
717 | -- All 2016:
718 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=1/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
719 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=2/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
720 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=3/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
721 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=4/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
722 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=5/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
723 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=6/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
724 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=7/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
725 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=8/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
726 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=9/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
727 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=10/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
728 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=11/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
729 | COPY workshop_das.taxi_loader FROM 's3://us-west-2.serverless-analytics/canonical/NY-Pub/year=2016/month=12/type=green' IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/mySpectrumRole' FORMAT AS PARQUET;
730 | 
731 | 732 |

733 | INSERT INTO workshop_das.taxi_current SELECT *, DATE_PART(year,pickup_datetime), DATE_PART(month,pickup_datetime), 'green' FROM workshop_das.taxi_loader;
734 | 
735 | TRUNCATE workshop_das.taxi_loader;
736 | 
737 | 738 | - Similarly, start Yellow loop. 739 | 740 | ### Redshift Spectrum can, of course, also be used to populate the table(s). 741 | 742 |

743 | INSERT INTO  workshop_das.taxi_201601    SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (2,3); /* Need to complete the first quarter of 2016.*/
744 | CREATE TABLE workshop_das.taxi_201602 AS SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (4,5,6);
745 | CREATE TABLE workshop_das.taxi_201603 AS SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (7,8,9);
746 | CREATE TABLE workshop_das.taxi_201604 AS SELECT * FROM ant321.NYTaxiRides WHERE year = 2016 AND month IN (10,11,12);
747 | 
748 | 749 | ### Adjust your Redshift Spectrum table to exclude the Q4 2015 data 750 | 751 |

752 | WITH generate_smallint_series AS (select row_number() over () as n from workshop_das.green_201601_csv limit 65536)
753 | , part_years AS (select n AS year_num from generate_smallint_series where n between 2015 and 2016)
754 | , part_months AS (select n AS month_num from generate_smallint_series where n between 1 and 12)
755 | , taxi_companies AS (SELECT 'fhv' taxi_vendor UNION ALL SELECT 'green' UNION ALL SELECT 'yellow')
756 | 
757 | SELECT 'ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=' || year_num || ', month=' || month_num || ', type=\'' || taxi_vendor || '\');'
758 | FROM part_years, part_months, taxi_companies WHERE year_num = 2016 or (year_num = 2015 and month_num IN (10,11,12)) ORDER BY year_num, month_num;
759 | 
760 | Or
761 | 
762 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=10, type='fhv');
763 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=10, type='yellow');
764 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=10, type='green');
765 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=11, type='yellow');
766 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=11, type='fhv');
767 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=11, type='green');
768 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=12, type='yellow');
769 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=12, type='fhv');
770 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2015, month=12, type='green');
771 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='yellow');
772 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='fhv');
773 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=1, type='green');
774 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=2, type='yellow');
775 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=2, type='fhv');
776 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=2, type='green');
777 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=3, type='yellow');
778 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=3, type='fhv');
779 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=3, type='green');
780 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=4, type='yellow');
781 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=4, type='fhv');
782 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=4, type='green');
783 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=5, type='yellow');
784 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=5, type='fhv');
785 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=5, type='green');
786 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=6, type='yellow');
787 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=6, type='fhv');
788 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=6, type='green');
789 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=7, type='yellow');
790 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=7, type='fhv');
791 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=7, type='green');
792 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=8, type='yellow');
793 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=8, type='fhv');
794 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=8, type='green');
795 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=9, type='yellow');
796 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=9, type='fhv');
797 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=9, type='green');
798 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=10, type='yellow');
799 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=10, type='fhv');
800 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=10, type='green');
801 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=11, type='yellow');
802 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=11, type='fhv');
803 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=11, type='green');
804 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=12, type='yellow');
805 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=12, type='fhv');
806 | ALTER TABLE ant321.NYTaxiRides DROP PARTITION(year=2016, month=12, type='green');
807 | 
808 | 809 | * Now, regardless of method, there’s a view covering the trailing 5 quarters in Redshift DAS, and all of time on Redshift Spectrum, completely transparent to users of the view. What would be the steps to “age-off” the Q4 2015 data? 810 | 811 | * Put a copy of the data from Redshift DAS table to S3. Listen closely this week for a possible announcement around this step! What would be the command(s)? 812 | * UNLOAD to Parquet. 813 | * Extend the Redshift Spectrum table to cover the Q4 2015 data with Redshift Spectrum. 814 | * ADD Partition. 815 | * Remove the data from the Redshift DAS table: 816 | * Either DELETE or DROP TABLE (depending on the implementation). 817 | 818 | **You have already done all of the steps in previous scenarios for this workshop. You have the toolset in your mind to do this! 819 | ** 820 | 821 | ## Important Reminder: Please go ahead and delete your Redshift Cluster at the end of this workshop. For a quick how-to on that: 822 | 823 | ```python 824 | https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#delete-cluster 825 | ``` 826 | -------------------------------------------------------------------------------- /images/analyze_compression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/analyze_compression.png -------------------------------------------------------------------------------- /images/jan_file_head.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/jan_file_head.png -------------------------------------------------------------------------------- /images/query_editor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/query_editor.png -------------------------------------------------------------------------------- /images/redshift_arch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/redshift_arch.png -------------------------------------------------------------------------------- /images/redshift_launch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/redshift_launch.png -------------------------------------------------------------------------------- /images/s3_ls.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/s3_ls.png -------------------------------------------------------------------------------- /images/table_pop_strat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-redshift-tiered-storage/d1105d714776e544e52f6972827c6dc6890ba73a/images/table_pop_strat.png --------------------------------------------------------------------------------