├── 2. connecttoredshift.txt ├── 1. launchredshift.txt ├── README.md ├── 5. jointables.txt ├── 4 .runQueries.txt ├── redshiftprimer.txt ├── 3. loaddata.txt └── 6. Analyze Performance.txt /2. connecttoredshift.txt: -------------------------------------------------------------------------------- 1 | In this task, you will use a web-based PostgreSQL client ("pgweb") to connect to Redshift. 2 | 3 | Copy the pgweb IP address shown to the left of these instructions. 4 | This is the IP address of a web server that is running the pgweb software. 5 | 6 | Open a new tab in your web browser, paste the IP address and hit Enter. 7 | You will be presented with the pgweb login screen. 8 | 9 | Configure the following settings: 10 | Username: 11 | master 12 | Password: 13 | Redshift123 14 | Database: 15 | lab 16 | Port: 17 | 5439 18 | (which is different to the default value) 19 | 20 | 21 | 22 | To the right of the screen, Copy the Endpoint to your clipboard. 23 | The endpoint will look similar to: lab.czvdbh5dsk9y.us-west-2.redshift.amazonaws.com:5439/lab 24 | 25 | Remove the :5439/lab ending so that the Host value ends with: .com -------------------------------------------------------------------------------- /1. launchredshift.txt: -------------------------------------------------------------------------------- 1 | create cluster with below config- 2 | 3 | Cluster identifier: lab 4 | Node type: dc2.large 5 | Nodes: 2 6 | 7 | This lab uses the dc2.large node size, which has 160GB of storage per node. You will be using a single node for this lab, but the type and number of nodes in a Redshift cluster can be changed at any time to provide extra storage and faster data processing. 8 | 9 | Scroll down to the Database configurations section, then configure: 10 | Master user name: 11 | master 12 | Master user password: 13 | Redshift123 14 | 15 | Expand Cluster permissions, then configure: 16 | Available IAM roles: Redshift-Role 17 | Click Associate IAM role 18 | Next to Additional configurations, click the slider to not use the defaults. 19 | 20 | Expand Network and security, configure: 21 | 22 | Virtual private cloud (VPC) Lab VPC 23 | VPC security groups: 24 | Select Redshift Security Group 25 | Remove Default 26 | Expand Database configurations, configure: 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Bigdata-on-AWS 2 | 3 | worked with an Amazon Redshift cluster to analyze USA Domestic flight data. 4 | 5 | Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions 6 | 7 | Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes. We’ve made Amazon Redshift easy to use by automating most of the common administrative tasks associated with provisioning, configuring, monitoring, backing up, and securing a data warehouse. 8 | 9 | 10 | project curriculum- 11 | Launch an Amazon Redshift cluster 12 | Connect to Amazon Redshift by using SQL client software 13 | Load data from Amazon S3 into Amazon Redshift 14 | Query data from Amazon Redshift 15 | Monitor Amazon Redshift performance 16 | -------------------------------------------------------------------------------- /5. jointables.txt: -------------------------------------------------------------------------------- 1 | n this task, you will load more data and then run queries that join information between tables. 2 | 3 | Run this query to create a new table for aircraft information: 4 | CREATE TABLE aircraft ( 5 | aircraft_code CHAR(3) SORTKEY, 6 | aircraft VARCHAR(100) 7 | ); 8 | 9 | A new aircraft table will appear in the left-side table list. 10 | 11 | Paste the following text into pgweb but do not run it yet: 12 | COPY aircraft 13 | FROM 's3://us-west-2-aws-training/awsu-spl/spl-17/4.2.9.prod/data/lookup_aircraft.csv' 14 | IAM_ROLE 'INSERT-YOUR-REDSHIFT-ROLE' 15 | IGNOREHEADER 1 16 | DELIMITER ',' 17 | REMOVEQUOTES 18 | TRUNCATECOLUMNS 19 | REGION 'us-west-2'; 20 | 21 | 22 | Run the query. 23 | 24 | This will load 383 different types of aircraft flown by the carriers. 25 | 26 | Run this query to view 10 random rows of aircraft data: 27 | SELECT * 28 | FROM aircraft 29 | ORDER BY random() 30 | LIMIT 10; 31 | 32 | content_copy 33 | The table contains an aircraft code and an aircraft description. The two tables can be joined together to provide useful information: 34 | 35 | Run this query to view the most-flown types of aircraft: 36 | SELECT 37 | aircraft, 38 | SUM(departures) AS trips 39 | FROM flights 40 | JOIN aircraft using (aircraft_code) 41 | GROUP BY aircraft 42 | ORDER BY trips DESC 43 | LIMIT 10; 44 | 45 | content_copy 46 | The results show the friendly name of aircraft that are flown on most trips. The JOIN command links the flight table with the aircraft table. 47 | 48 | -------------------------------------------------------------------------------- /4 .runQueries.txt: -------------------------------------------------------------------------------- 1 | SELECT COUNT(*) FROM flights; 2 | 3 | 4 | SELECT * 5 | FROM flights 6 | ORDER BY random() 7 | LIMIT 10; 8 | 9 | This query actually assigns a random number to all 96 million rows, sorts them by the random number and then returns the first 10 results. 10 | 11 | 12 | 13 | Now that you have the data loaded, the next step is to perform queries to find underlying patterns in the data and to help drive business decisions. 14 | 15 | 16 | SELECT 17 | carrier, 18 | SUM (departures) 19 | FROM flights 20 | GROUP BY carrier 21 | ORDER BY 2 DESC 22 | LIMIT 10; 23 | 24 | 25 | Run the above query. 26 | Question: Who are the top 3 carriers by number of departures? 27 | 28 | Change departures to 29 | passengers 30 | and run it again to view top carriers by passengers carried. 31 | Question: Who are the top 3 carriers by passengers carried? 32 | 33 | Change departures to 34 | miles 35 | and run it again to view top carriers by miles flown. 36 | Question: Who are the top 3 carriers by miles flown? 37 | 38 | Change departures to 39 | passengers * miles 40 | and run it again to view top carriers by passenger-miles. 41 | Question: Who are the top 3 carriers by passenger-miles? 42 | 43 | Change departures to 44 | freight_pounds 45 | and run it again to view top carriers by freight transported. 46 | Question: Who are the top 3 carriers of freight? (You should be able to guess this one!) 47 | 48 | Each of these queries is performing calculations against almost 100 million rows of data, but they each take only a few seconds to run. Adding additional compute nodes will make the queries run even faster. -------------------------------------------------------------------------------- /redshiftprimer.txt: -------------------------------------------------------------------------------- 1 | Nodes & Clusters 2 | An Amazon Redshift data warehouse is a collection of computing resources called nodes. This collection of nodes is called a cluster. When you provision a cluster, you specify the type and the number of nodes that will make up the cluster. The node type determines the storage size, memory, CPU, and price of each node in the cluster: 3 | 4 | Scalability 5 | If your storage and performance needs change after you initially provision your cluster, you can always scale the cluster in or out by adding or removing nodes, scale the cluster up or down by specifying a different node type, or you can do both. Resizing the cluster in either way involves minimal downtime. Resizing replaces the old cluster at the end of the resize operation. When you submit a resize request, the source cluster remains in read-only mode until the resize operation is complete. 6 | 7 | Parallel Processing 8 | Amazon Redshift distributes workload to each node in a cluster and processes work in parallel, allowing processing speed to scale in addition to storage. 9 | 10 | Columnar Storage 11 | Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk. 12 | 13 | Rather than storing data values together for a whole row, Amazon Redshift stores data by column. This means that operations on a column require less disk I/O. 14 | 15 | 16 | Compression 17 | Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance. 18 | 19 | 20 | Snapshots as Backups 21 | Snapshots are point-in-time backups of a cluster. You can create snapshots automatically or manually. Amazon Redshift stores these snapshots internally in Amazon S3 using an encrypted Secure Sockets Layer (SSL) connection. If you need to restore a cluster, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify. 22 | 23 | Integrates With Existing Business Intelligence Tools 24 | Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. Your existing Business Intelligence tools can easily integrate with Amazon Redshift. 25 | 26 | 27 | The typical process for loading data into Amazon Redshift is: 28 | 29 | Data is exported from a source system (for example, a company database). 30 | The data is placed into an Amazon S3 bucket, preferably in a compressed format to save storage space. 31 | The data is copied into Amazon Redshift tables via the COPY command. 32 | The SQL client is used to query Amazon Redshift. 33 | The results of the query will be returned to the SQL client. 34 | 35 | 36 | In this lab, you will be loading USA domestic airline data for analysis. The data has been obtained from the United States Department of Transportation’s Bureau of Transportation Statistics. 37 | 38 | The transport data has already been placed into an Amazon S3 bucket in a compressed format. This lab will lead you through the steps of loading the data into an Amazon Redshift cluster and then running queries to analyze the data. 39 | 40 | -------------------------------------------------------------------------------- /3. loaddata.txt: -------------------------------------------------------------------------------- 1 | In this task, you will create a Table in Amazon Redshift. Tables are used to store a particular set of information. 2 | 3 | Copy and paste the following text into pgweb (above the Run Query button): 4 | 5 | CREATE TABLE flights ( 6 | year smallint, 7 | month smallint, 8 | day smallint, 9 | carrier varchar(80) DISTKEY, 10 | origin char(3), 11 | dest char(3), 12 | aircraft_code char(3), 13 | miles int, 14 | departures int, 15 | minutes int, 16 | seats int, 17 | passengers int, 18 | freight_pounds int 19 | ); 20 | 21 | 22 | A new flights table will appear on the left of the screen, under the Tables heading. 23 | 24 | You can now load data into the table. The data has already been placed into an Amazon S3 bucket and can be loaded into Amazon Redshift by using the COPY command. 25 | 26 | 27 | COPY flights 28 | FROM 's3://us-west-2-aws-training/awsu-spl/spl-17/4.2.9.prod/data/flights-usa' 29 | IAM_ROLE 'INSERT-YOUR-REDSHIFT-ROLE' #Replace INSERT-YOUR-REDSHIFT-ROLE in the third line with the RedshiftRole value shown to the left of these instructions 30 | GZIP 31 | DELIMITER ',' 32 | REMOVEQUOTES 33 | REGION 'us-west-2'; 34 | 35 | 36 | The COPY command is used to load data into Amazon Redshift: 37 | 38 | FROM: Indicates where the data is located 39 | IAM_ROLE: Provides the permissions to access the data being loaded 40 | GZIP: Indicates that the data has been compressed (zipped) – Amazon Redshift will automatically decompress the data when it is loaded 41 | DELIMITER: Indicates that data items are separated by a comma 42 | REMOVEQUOTES: Tells Amazon Redshift to remove quotation marks that are included in the data 43 | REGION: Indicates which AWS region contains the S3 bucket 44 | The data being loaded consists of: 45 | 46 | 23 data files in CSV format (one for each year from 1990 - 2012) 47 | Comprising 6 GB of data 48 | Compressed with GZIP down to only 700 MB of storage 49 | The data files are being loaded in parallel from Amazon S3. This is the most efficient way to load data into Amazon Redshift since the load process is distributed across multiple slices across all available nodes. 50 | 51 | Each slice of a compute node is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. 52 | 53 | When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and execute queries efficiently. 54 | 55 | The CREATE TABLE command you ran earlier designated the carrier (airline) field as the Distribution Key (DISTKEY). This means the data will be split between the all available slices and nodes, but all data related to a particular carrier will always reside on the same slice. This improves processing speed when performing operations on the carrier field, such as GROUP BY and JOIN operations. 56 | 57 | -------------------------------------------------------------------------------- /6. Analyze Performance.txt: -------------------------------------------------------------------------------- 1 | You can use the EXPLAIN command to view how Amazon Redshift processes queries. 2 | 3 | Run this query: 4 | Use the Run Query button (not the Explain Query button). 5 | 6 | 7 | SET enable_result_cache_for_session TO OFF; 8 | 9 | EXPLAIN 10 | SELECT 11 | aircraft, 12 | SUM(departures) AS trips 13 | FROM flights 14 | JOIN aircraft using (aircraft_code) 15 | GROUP BY aircraft 16 | ORDER BY trips DESC 17 | LIMIT 10; 18 | 19 | 20 | It is the same the previous query, but is prefixed by the EXPLAIN command. 21 | 22 | This command will return an Explain Plan similar to this: 23 | 24 | XN Limit (cost=1000156830987.88..1000156830987.90 rows=10 width=29) 25 | -> XN Merge (cost=1000156830987.88..1000156830988.84 rows=383 width=29) 26 | Merge Key: sum(flights.departures) 27 | -> XN Network (cost=1000156830987.88..1000156830988.84 rows=383 width=29) 28 | Send to leader 29 | -> XN Sort (cost=1000156830987.88..1000156830988.84 rows=383 width=29) 30 | Sort Key: sum(flights.departures) 31 | -> XN HashAggregate (cost=156830970.49..156830971.44 rows=383 width=29) 32 | -> XN Hash Join DS_BCAST_INNER (cost=4.79..156346841.73 rows=96825752 width=29) 33 | Hash Cond: ("outer".aircraft_code = "inner".aircraft_code) 34 | -> XN Seq Scan on flights (cost=0.00..968257.52 rows=96825752 width=11) 35 | -> XN Hash (cost=3.83..3.83 rows=383 width=32) 36 | -> XN Seq Scan on aircraft (cost=0.00..3.83 rows=383 width=32) 37 | 38 | 39 | The plan shows the logical steps that Amazon Redshift will perform when running the query. Reading the Explain Plan from the bottom up, it displays a breakdown of logical operations needed to perform the query as well as an indication of their relative processing cost and the amount of data that needs to be processed. By analyzing the plan, you can often identify opportunities to improve query performance. 40 | 41 | In traditional databases, a sequential scan (Seq Scan) across many rows of data can be very inefficient and is normally improved by adding an index. However, Amazon Redshift does not use indexes, yet is able to perform extremely fast queries across huge quantities of data – in this case, scanning over 96 million rows in a few seconds. 42 | 43 | 44 | 45 | Data Compression & Column-based Storage 46 | Data in Amazon Redshift is stored as columns. This is faster than storing data as rows, since most queries only require a few columns of data. It also allows Amazon Redshift to compress data within each column. 47 | 48 | When data was loaded with the COPY command earlier in this lab, Amazon Redshift performed a compression analysis to identify the optimal way to store each column. You can view the results of the analysis by using the ANALYZE COMPRESSION command. 49 | 50 | Run this command to analyze the data stored in the flights table: 51 | 52 | ANALYZE COMPRESSION flights; 53 | 54 | Amazon Redshift will display recommended compression settings for the data. 55 | 56 | Compression is a column-level operation that reduces the size of data when it is stored. Possible compression methods are: 57 | 58 | Byte dictionary: A method of reference up to 256 possible values in a single byte. Ideal for fields with few, but frequently repeated, values such as Country names. 59 | Delta: Compresses data by recording the difference between values that follow each other in the column. 60 | LZO: Provides a very high compression ratio with good performance. Works well for columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings. 61 | Mostly: Compresses the majority of the values in the column to a smaller standard storage size. 62 | Run-length: Replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run). Best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values. 63 | Text: Compresses VARCHAR columns in which the same words recur often. 64 | Zstandard: Provides a high compression ratio with very good performance across diverse data sets. Works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings. 65 | Raw: Uncompressed 66 | When data is compressed, information can be retrieved from disk faster. Compression conserves storage space, reduces the amount of disk I/O and therefore improves query performance. 67 | 68 | 69 | Creating tables from other tables 70 | It is often necessarily to manipulate data to make the information more meaningful. Amazon Redshift has the ability to create new tables based upon data from existing tables. 71 | 72 | For example, if you want to closely analyze data on passengers who fly to Las Vegas, you can create a table with only those flights that flew to Las Vegas. 73 | 74 | You will now load a table that converts 3-digit airport codes (eg ‘LAS’) into easily-readable city names (‘Las Vegas’). 75 | 76 | Run this query create a new table for airport information: 77 | 78 | 79 | CREATE TABLE airports ( 80 | airport_code CHAR(3) SORTKEY, 81 | airport varchar(100) 82 | ); 83 | 84 | A new airports table will appear in the left-side list of tables. 85 | 86 | Run this command, replacing INSERT-YOUR-REDSHIFT-ROLE with the RedshiftRole value shown to the left of these instructions: 87 | 88 | COPY airports 89 | FROM 's3://us-west-2-aws-training/awsu-spl/spl-17/4.2.9.prod/data/lookup_airports.csv' 90 | IAM_ROLE 'INSERT-YOUR-REDSHIFT-ROLE' 91 | IGNOREHEADER 1 92 | DELIMITER ',' 93 | REMOVEQUOTES 94 | TRUNCATECOLUMNS 95 | REGION 'us-west-2'; 96 | 97 | 98 | This loads a list of 6,265 airports. 99 | 100 | Next, combine the flights and airports information into a new table that only including flights that went to Las Vegas. 101 | 102 | Run this query create a new table about Las Vegas flights: 103 | CREATE TABLE vegas_flights 104 | DISTKEY (origin) 105 | SORTKEY (origin) 106 | AS 107 | SELECT 108 | flights.*, 109 | airport 110 | FROM flights 111 | JOIN airports ON origin = airport_code 112 | WHERE dest = 'LAS'; 113 | 114 | content_copy 115 | A new vegas_flights table will appear in the left-side list of tables. 116 | 117 | Queries can now be run against this new vegas_flights table. 118 | 119 | Run this query to discover from where the most popular flights to Las Vegas originate: 120 | SELECT 121 | airport, 122 | to_char(SUM(passengers), '999,999,999') as passengers 123 | FROM vegas_flights 124 | GROUP BY airport 125 | ORDER BY SUM(passengers) desc 126 | LIMIT 10; 127 | 128 | content_copy 129 | Question: Which is the airport that sends the most passengers to Las Vegas? 130 | 131 | This query also demonstrates use of the PostgreSQL 132 | to_char 133 | function that formats output text in a human-friendly format. 134 | 135 | Creating new tables in this manner can improve performance since queries only need to scan a subset of data. 136 | 137 | Examining Disk Space and Data Distribution 138 | Data in Amazon Redshift is distributed across multiple nodes and hard disks. 139 | 140 | Run this query to see much disk capacity has been used: 141 | SELECT 142 | owner AS node, 143 | diskno, 144 | used, 145 | capacity, 146 | used/capacity::numeric * 100 as percent_used 147 | FROM stv_partitions 148 | WHERE host = node 149 | ORDER BY 1, 2; 150 | 151 | content_copy 152 | The output shows: 153 | 154 | Node: The node within the cluster. 155 | Diskno: The disk number. Nodes can have multiple drives, allowing data to be accessed in parallel. 156 | Used: Megabytes of disk space used. 157 | Capacity: Available disk space. There is 160GB per node, but extra is provided for database replication. 158 | Percent_used: Percent of disk space used. The flight data is occupying less than 0.5% of available disk space. 159 | Disk usage is also available on a per-table basis. 160 | 161 | Run this query to see how much space is taken by each of the data tables: 162 | SELECT 163 | name, 164 | count(*) 165 | FROM stv_blocklist 166 | JOIN (SELECT DISTINCT name, id as tbl from stv_tbl_perm) USING (tbl) 167 | GROUP BY name; 168 | 169 | content_copy 170 | The amounts shown are in MB. The flights table consumes 1548 MB (1.5 GB). Each Amazon Redshift dc2.large node can hold 160GB, which is one hundred times the amount of data currently in use. 171 | 172 | Task 7: Explore the Amazon Redshift Console 173 | All your interactions with Amazon Redshift so far have been via SQL. 174 | 175 | Amazon Redshift also has a management console that provides insight into operation of the system. 176 | 177 | Return to your web browser tab showing the Amazon Redshift console. 178 | 179 | In the left navigation pane, click CLUSTERS. 180 | 181 | Click lab. 182 | 183 | Click the Query monitoring tab. 184 | 185 | Amazon Redshift maintains information about every data load and query performed. 186 | 187 | Click the refresh button next to Query monitoring. 188 | 189 | In the Queries and loads window, click on one of the queries. 190 | 191 | The information above will be updated to include information such as: 192 | 193 | The duration of the query. 194 | The user that ran the query. 195 | The SQL used for the query. 196 | Scroll down to the Query details section. 197 | 198 | Click one of the query job numbers for COPY flights FROM… in the SQL column. 199 | 200 | Information will be displayed showing: 201 | 202 | The SQL used to run the query. 203 | The Total runtime. 204 | The Rows returned. 205 | The Total data scanned. 206 | Click Query plan tab. 207 | This shows a chart that displays system performance during the load. This information can be used to diagnose problems and to monitor system performance during regular load processes. 208 | 209 | In the left navigation pane, click CLUSTERS. 210 | 211 | Click lab. 212 | 213 | Click Cluster performance to view information about the cluster. 214 | 215 | Click on Alarms, here you will see a section for CloudWatch Alarms, which can send a notification based upon cluster metrics, such as available disk space or the health of the cluster. 216 | 217 | Snapshots 218 | Snapshots are point-in-time backups of a cluster. You can create snapshots automatically or manually. Amazon Redshift stores these snapshots in Amazon S3. If you need to restore a cluster, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify. 219 | 220 | Amazon Redshift periodically takes automated snapshots and deletes the automated snapshot at the end of a retention period that you specify. 221 | 222 | You can also take a manual snapshot whenever you wish. Manual snapshots are retained even after you delete your cluster. Manual snapshots accrue storage charges, so it is important that you manually delete them if you no longer need them. 223 | 224 | To reduce backup times and Amazon S3 storage requirements, Amazon Redshift uses incremental backups. When a snapshot is taken, the backup records the cluster changes since the last snapshot. 225 | 226 | Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster until you delete the cluster. You can use this free storage for automated or manual snapshots. After the free backup storage limit is reached, you are charged for any additional storage at the normal rate. 227 | 228 | Click the Maintenance tab. 229 | You should see that a an automatic snapshot was created. 230 | 231 | Exporting Data 232 | Data can also be exported to Amazon S3 with the UNLOAD command. The data can then be used in other systems, such as Amazon DynamoDB, your own applications or loaded into another Amazon Redshift cluster. The command accepts an SQL query and can export the data into fixed-width or delimited text files and can also be compressed with GZIP and encrypted. 233 | 234 | Delete the Cluster 235 | This lab is now complete. The final step is to shut down the cluster. 236 | 237 | In the Actions menu, click Delete. 238 | 239 | De-select Create final snapshot. 240 | 241 | Click Delete cluster 242 | 243 | Your cluster will now be deleted. 244 | 245 | --------------------------------------------------------------------------------