├── LICENSE ├── README.md └── hivetable.sql /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Adit Modi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS-EMR 2 | 3 | Analyzing Big Data with Amazon EMR 4 | 5 | 6 | ![image](https://user-images.githubusercontent.com/48589838/77052931-91a99180-69f3-11ea-83dc-e9c47f0215f7.png) 7 | 8 | ![image](https://user-images.githubusercontent.com/48589838/77053019-b0a82380-69f3-11ea-89f0-f7b8a6d2a33a.png) 9 | 10 | ![image](https://user-images.githubusercontent.com/48589838/77053034-b6056e00-69f3-11ea-82ea-5fa1bb5d14e5.png) 11 | 12 | 13 | 14 | ## Steps 15 | 16 | ### 1.Create an Amazon S3 Bucket 17 | For more information about creating a bucket, see Create a Bucket in the Amazon Simple Storage Service Getting Started Guide. After you create the bucket, choose it from the list and then choose Create folder, replace New folder with a name that meets the requirements, and then choose Save. 18 | 19 | ### 2.Create an Amazon EC2 Key Pair 20 | You must have an Amazon Elastic Compute Cloud (Amazon EC2) key pair to connect to the nodes in your cluster over a secure channel using the Secure Shell (SSH) protocol. If you already have a key pair that you want to use, you can skip this step. If you don't have a key pair, follow one of the following procedures depending on your operating system. 21 | 22 | Creating Your Key Pair Using Amazon EC2 in the Amazon EC2 User Guide for Windows Instances[https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair] 23 | 24 | Creating Your Key Pair Using Amazon EC2 in the Amazon EC2 User Guide for Linux Instances. Use this procedure for Mac OS as well.[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair] 25 | 26 | ### 3. Launch Your Sample Amazon EMR Cluster 27 | Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/. 28 | 29 | Choose Create cluster. 30 | 31 | On the Create Cluster - Quick Options page, accept the default values except for the following fields: 32 | 33 | Enter a Cluster name that helps you identify the cluster, for example, My First EMR Cluster. 34 | 35 | Under Security and access, choose the EC2 key pair that you created in Create an Amazon EC2 Key Pair. 36 | 37 | Choose Create cluster. 38 | 39 | ### 4.Allow SSH Connections to the Cluster From Your Client 40 | To remove the inbound rule that allows public access using SSH for the ElasticMapReduce-master security group 41 | 42 | The following procedure assumes that the ElasticMapReduce-master security group has not been edited previously. In addition, to edit security groups, you must be logged in to AWS as a root user or as an IAM principal that is allowed to manage security groups for the VPC that the cluster is in. For more information, see Changing Permissions for an IAM User and the Example Policy that allows managing EC2 security groups in the IAM User Guide. 43 | 44 | Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/. 45 | 46 | Choose Clusters. 47 | 48 | Choose the Name of the cluster. 49 | 50 | Under Security and access choose the Security groups for Master link. 51 | 52 | Edit security groups from EMR cluster status.Choose ElasticMapReduce-master from the list. 53 | 54 | Choose Inbound, Edit. 55 | 56 | Find the rule with the following settings and choose the x icon to delete it: 57 | 58 | Type SSH Port22Source Custom 0.0.0.0/0 Scroll to the bottom of the list of rules and choose Add Rule. 59 | 60 | For Type, select SSH. 61 | 62 | This automatically enters TCP for Protocol and 22 for Port Range. 63 | 64 | For source, select My IP. 65 | 66 | This automatically adds the IP address of your client computer as the source address. Alternatively, you can add a range of Custom trusted client IP addresses and choose Add rule to create additional rules for other clients. In many network environments, IP addresses are allocated dynamically, so you may need to periodically edit security group rules to update the IP address of trusted clients. 67 | 68 | Choose Save. 69 | 70 | Optionally, choose ElasticMapReduce-slave from the list and repeat the steps above to allow SSH client access to core and task nodes from trusted clients. 71 | 72 | ### 5.Process Data By Running The Hive Script as a Step 73 | 74 | ### 6.Terminate the Cluster and Delete the Bucket 75 | 76 | 77 | https://aws.amazon.com/getting-started/projects/analyze-big-data/ 78 | -------------------------------------------------------------------------------- /hivetable.sql: -------------------------------------------------------------------------------- 1 | -- Summary: This sample shows you how to analyze CloudFront logs stored in S3 using Hive 2 | 3 | 4 | -- Create table using sample data in S3. Note: you can replace this S3 path with your own. 5 | CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( 6 | DateObject Date, 7 | Time STRING, 8 | Location STRING, 9 | Bytes INT, 10 | RequestIP STRING, 11 | Method STRING, 12 | Host STRING, 13 | Uri STRING, 14 | Status INT, 15 | Referrer STRING, 16 | OS String, 17 | Browser String, 18 | BrowserVersion String 19 | ) 20 | ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 21 | WITH SERDEPROPERTIES ( 22 | "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" 23 | ) LOCATION '${INPUT}/cloudfront/data'; 24 | 25 | -- Total requests per operating system for a given time frame 26 | INSERT OVERWRITE DIRECTORY '${OUTPUT}/os_requests/' SELECT os, COUNT(*) count FROM cloudfront_logs WHERE dateobject BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY os; 27 | --------------------------------------------------------------------------------