├── LICENSE ├── README.md └── sparkjob.java /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | 4 | Copyright (c) 2020 Adit Modi 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Run-a-Spark-job-within-Amazon-EMR 2 | 3 | ![image](https://user-images.githubusercontent.com/48589838/77054629-0251ad80-69f6-11ea-8d55-29b0d8b98842.png) ![image](https://user-images.githubusercontent.com/48589838/77054833-2c0ad480-69f6-11ea-9855-0bdeec8535b7.png) 4 | 5 | 6 | AWS and Amazon EMR 7 | AWS is one of the most used cloud services platform, a lot of services are available, it is very well documented and easy to use. 8 | 9 | A cloud services platform allow users to access on-demand resources (compute power, memory, storage) and services (databases, monitoring, workflow, etc.) via the internet with pay-as-you-go pricing. 10 | 11 | Among all the cool services offered by AWS, we will only use two of them : 12 | 13 | Simple Storage Service (S3), a massively scalable object storage service 14 | 15 | Elastic MapReduce (EMR), a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. 16 | 17 | It can be view like Hadoop-as-a-Service, you start a cluster with the number of nodes you want, run any job you want and only pay for the time the cluster is actually up. 18 | 19 | ## The input and output files will be store using S3 storage. 20 | 21 | ## Create an Amazon S3 Bucket 22 | 23 | Open the Amazon S3 console 24 | 25 | Choose Create bucket 26 | 27 | Type a name for your bucket (ex : my-first-emr-bucket) and choose its AWS Region then click next. 28 | 29 | On the Set properties page, you can configure some properties for the bucket. In this tutorial we don’t need any specific properties, click next. 30 | 31 | On the Set permissions page, you manage permissions that are set on the bucket that you are creating. We will use the default permissions, click next. 32 | 33 | On the Review page, verify the settings and choose Create bucket. 34 | 35 | 36 | ## Upload files on Amazon S3 37 | 38 | On the Amazon S3 console click on the bucket you just created 39 | 40 | Choose Create folder, enter the name of the folder (ex : tutorialEMR), the encryption setting (ex : None) and save 41 | 42 | Click on the folder you just created, then choose upload to upload the Spark application jar and the input text file on which you want to apply the wordcount 43 | 44 | ## Create an Amazon EMR cluster & Submit the Spark Job 45 | 46 | Open the Amazon EMR console 47 | On the right left corner, change the region on which you want to deploy the cluster 48 | Choose Create cluster 49 | 50 | On the General Configuration section, enter the cluster name, choose the S3 bucket you created (the logs will be stored in this bucket) and check Step execution. 51 | 52 | On the Add steps section, select Spark application, click Configure and fill the popup box. 53 | 54 | On the Software Configuration section, use the default release (the last one) 55 | 56 | On the Hardware configuration section, choose the instance type and the number of instances 57 | 58 | On the Security and access section, use the Default values. 59 | Click on Create cluster 60 | Click on the refresh icon to see the status passing from Starting to Running to Terminating — All steps completed 61 | 62 | Now go back to the S3 console and you will see the output directory in which the result has been stored, you can click on it and download its contents. 63 | 64 | special mention 65 | 66 | Robin JEAN 67 | 68 | https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 69 | -------------------------------------------------------------------------------- /sparkjob.java: -------------------------------------------------------------------------------- 1 | package com.jeanr84.sparkjob; 2 | import org.apache.spark.api.java.JavaPairRDD; 3 | import org.apache.spark.api.java.JavaRDD; 4 | import org.apache.spark.sql.SparkSession; 5 | import scala.Tuple2; 6 | 7 | import java.util.Arrays; 8 | import java.util.regex.Pattern; 9 | 10 | public class SparkJob { 11 | 12 | private static final Pattern SPACE = Pattern.compile(" "); 13 | 14 | public static void main(String[] args) { 15 | if (args.length < 2) { 16 | System.err.println("Usage: JavaWordCount "); 17 | System.exit(1); 18 | } 19 | 20 | SparkSession spark = SparkSession 21 | .builder() 22 | .appName("SparkJob") 23 | .getOrCreate(); 24 | 25 | JavaRDD textFile = spark.read().textFile(args[0]).toJavaRDD(); 26 | 27 | JavaPairRDD counts = textFile 28 | .flatMap(s -> Arrays.asList(SPACE.split(s)).iterator()) 29 | .mapToPair(s -> new Tuple2<>(s, 1)) 30 | .reduceByKey((a, b) -> a + b); 31 | 32 | counts.saveAsTextFile(args[1]); 33 | } 34 | } 35 | --------------------------------------------------------------------------------