├── .gitignore ├── README.md ├── build.sbt ├── data ├── reduced-tweets.json └── wordcount.txt ├── img ├── dataframe.png └── streaming.png ├── sbt ├── sbt └── sbt-launch.jar └── src ├── main └── scala │ └── com │ └── duchessfr │ └── spark │ ├── core │ ├── Ex0Wordcount.scala │ ├── Ex1UserMining.scala │ ├── Ex2TweetMining.scala │ ├── Ex3HashTagMining.scala │ └── Ex4InvertedIndex.scala │ ├── dataframe │ └── DataFrameOnTweets.scala │ ├── streaming │ └── StreamingOnTweets.scala │ └── utils │ └── TweetUtils.scala └── test └── scala └── com └── duchessfr └── spark ├── core ├── Ex0WordcountSpec.scala ├── Ex1UserMiningSpec.scala ├── Ex2TweetMiningSpec.scala ├── Ex3HashTagMiningSpec.scala └── Ex4InvertedIndexSpec.scala ├── dataframe └── DataFrameOnTweetsSpec.scala └── streaming └── StreamingOnTweetsSpec.scala /.gitignore: -------------------------------------------------------------------------------- 1 | /target 2 | .idea 3 | *.iml 4 | /project 5 | .DS_Store 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Workshop spark-in-practice 2 | 3 | In this workshop the exercises are focused on using the [Spark core](https://spark.apache.org/docs/1.4.0/programming-guide.html) and [Spark Streaming](https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html) APIs, and also the [dataFrame](https://spark.apache.org/docs/1.4.0/sql-programming-guide.html) on data processing. 4 | Exercises are available both in [Java](https://github.com/nivdul/spark-in-practice) and Scala on my github account (here in scala). You just have to clone the project and go! 5 | If you need help, take a look at the solution branch. 6 | 7 | The original blog-post is right [here](https://nivdul.wordpress.com/2015/08/09/getting-started-with-spark-in-practice/). 8 | 9 | To help you to implement each class, unit tests are in. 10 | 11 | Frameworks used: 12 | 13 | * Spark 1.4.0 14 | * scala 2.10 15 | * sbt 16 | * scalatest 17 | 18 | All exercises runs in local mode as a standalone program. 19 | 20 | To work on the hands-on, retrieve the code via the following command line: 21 |
$ git clone https://github.com/nivdul/spark-in-practice-scala.git
22 | 23 | Then you can import the project in IntelliJ or Eclipse (add the SBT and Scala plugins for Scala), or use sublime text for example. 24 | 25 | If you want to use the interactive spark-shell (only scala/python), you need to download a [binary Spark distribution](https://spark.apache.org/downloads.html). 26 | 27 |
Go to the Spark directory
28 | $ cd /spark-1.4.0
29 | 
30 | First build the project
31 | $ build/mvn -DskipTests clean package
32 | 
33 | Launch the spark-shell
34 | $ ./bin/spark-shell
35 | scala>
36 | 
37 | 38 | ## Part 1: Spark core API 39 | To be more familiar with the Spark API, you will start by implementing the wordcount example (Ex0). 40 | After that we use reduced tweets as the data along a json format for data mining (Ex1-Ex3). 41 | 42 | In these exercises you will have to: 43 | 44 | * Find all the tweets by user 45 | * Find how many tweets each user has 46 | * Find all the persons mentioned on tweets 47 | * Count how many times each person is mentioned 48 | * Find the 10 most mentioned persons 49 | * Find all the hashtags mentioned on a tweet 50 | * Count how many times each hashtag is mentioned 51 | * Find the 10 most popular Hashtags 52 | 53 | The last exercise (Ex4) is a way more complicated: the goal is to build an inverted index knowing that an inverted is the data structure used to build search engines. 54 | Assuming #spark is a hashtag that appears in tweet1, tweet3, tweet39, the inverted index will be a Map that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)). 55 | 56 | ## Part 2: streaming analytics with Spark Streaming 57 | Spark Streaming is a component of Spark to process live data streams in a scalable, high-throughput and fault-tolerant way. 58 | 59 | ![Spark Streaming](img/streaming.png) 60 | 61 | In fact Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. 62 | The abstraction, which represents a continuous stream of data is the DStream (discretized stream). 63 | 64 | In the workshop, Spark Streaming is used to process a live stream of Tweets using twitter4j, a library for the Twitter API. 65 | To be able to read the firehose, you will need to create a Twitter application at http://apps.twitter.com, get your credentials, and add it in the StreamUtils class. 66 | 67 | In this exercise you will have to: 68 | 69 | * Print the status of each tweet 70 | * Find the 10 most popular Hashtag 71 | 72 | ## Part 3: structured data with the DataFrame 73 | A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 74 | DataFrames can be constructed from different sources such as: structured data files, tables in Hive, external databases, or existing RDDs. 75 | 76 | ![DataFrame](img/dataframe.png) 77 | 78 | In the exercise you will have to: 79 | 80 | * Print the dataframe 81 | * Print the schema of the dataframe 82 | * Find people who are located in Paris 83 | * Find the user who tweets the more 84 | 85 | ## Conclusion 86 | If you find better way/implementation, do not hesitate to send a pull request or open an issue. 87 | 88 | Here are some useful links around Spark and its ecosystem: 89 | 90 | * [Apache Spark website](https://spark.apache.org/docs/1.4.0/programming-guide.html) 91 | * [Spark Streaming documentation](https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html) 92 | * [Spark SQL and DataFrame documentation](https://spark.apache.org/docs/1.4.0/sql-programming-guide.html) 93 | * [Databricks blog](https://databricks.com/blog ) 94 | * [Analyze data from an accelerometer using Spark, Cassandra and MLlib](http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/) 95 | 96 | -------------------------------------------------------------------------------- /build.sbt: -------------------------------------------------------------------------------- 1 | name := "Spark-HandsOn" 2 | 3 | version := "1.0" 4 | 5 | scalaVersion := "2.10.4" 6 | 7 | libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" 8 | 9 | libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.5.2" 10 | 11 | libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.5.2" 12 | 13 | libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.5.2" 14 | 15 | libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4" 16 | 17 | libraryDependencies += "com.google.code.gson" % "gson" % "2.3.1" 18 | 19 | libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "1.5.2" 20 | 21 | libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3" 22 | 23 | libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.2.4" % "test" 24 | 25 | resolvers += "Akka Repository" at "http://repo.akka.io/releases/" 26 | -------------------------------------------------------------------------------- /data/wordcount.txt: -------------------------------------------------------------------------------- 1 | word count from Wikipedia the free encyclopedia 2 | the word count is the number of words in a document or passage of text Word counting may be needed when a text 3 | is required to stay within certain numbers of words This may particularly be the case in academia legal 4 | proceedings journalism and advertising Word count is commonly used by translators to determine the price for 5 | the translation job Word counts may also be used to calculate measures of readability and to measure typing 6 | and reading speeds usually in words per minute When converting character counts to words a measure of five or 7 | six characters to a word is generally used Contents Details and variations of definition Software In fiction 8 | In non fiction See also References Sources External links Details and variations of definition 9 | This section does not cite any references or sources Please help improve this section by adding citations to 10 | reliable sources Unsourced material may be challenged and removed 11 | Variations in the operational definitions of how to count the words can occur namely what counts as a word and 12 | which words don't count toward the total However especially since the advent of widespread word processing there 13 | is a broad consensus on these operational definitions and hence the bottom line integer result 14 | The consensus is to accept the text segmentation rules generally found in most word processing software including how 15 | word boundaries are determined which depends on how word dividers are defined The first trait of that definition is that a space any of various whitespace 16 | characters such as a regular word space an em space or a tab character is a word divider Usually a hyphen or a slash is too 17 | Different word counting programs may give varying results depending on the text segmentation rule 18 | details and on whether words outside the main text such as footnotes endnotes or hidden text) are counted But the behavior 19 | of most major word processing applications is broadly similar However during the era when school assignments were done in 20 | handwriting or with typewriters the rules for these definitions often differed from todays consensus 21 | Most importantly many students were drilled on the rule that certain words don't count usually articles namely a an the but 22 | sometimes also others such as conjunctions for example and or but and some prepositions usually to of Hyphenated permanent 23 | compounds such as follow up noun or long term adjective were counted as one word To save the time and effort of counting 24 | word by word often a rule of thumb for the average number of words per line was used such as 10 words per line These rules 25 | have fallen by the wayside in the word processing era the word count feature of such software which follows the text 26 | segmentation rules mentioned earlier is now the standard arbiter because it is largely consistent across documents and 27 | applications and because it is fast effortless and costless already included with the application As for which sections of 28 | a document count toward the total such as footnotes endnotes abstracts reference lists and bibliographies tables figure 29 | captions hidden text the person in charge teacher client can define their choice and users students workers can simply 30 | select or exclude the elements accordingly and watch the word count automatically update Software Modern web browsers 31 | support word counting via extensions via a JavaScript bookmarklet or a script that is hosted in a website Most word 32 | processors can also count words Unix like systems include a program wc specifically for word counting 33 | As explained earlier different word counting programs may give varying results depending on the text segmentation rule 34 | details The exact number of words often is not a strict requirement thus the variation is acceptable 35 | In fiction Novelist Jane Smiley suggests that length is an important quality of the novel However novels can vary 36 | tremendously in length Smiley lists novels as typically being between and words while National Novel Writing Month 37 | requires its novels to be at least words There are no firm rules for example the boundary between a novella and a novel 38 | is arbitrary and a literary work may be difficult to categorise But while the length of a novel is to a large extent up 39 | to its writer lengths may also vary by subgenre many chapter books for children start at a length of about words and a 40 | typical mystery novel might be in the to word range while a thriller could be over words 41 | The Science Fiction and Fantasy Writers of America specifies word lengths for each category of its Nebula award categories 42 | Classification Word count Novel over words Novella to words Novelette to words Short story under words 43 | In non fiction The acceptable length of an academic dissertation varies greatly dependent predominantly on the subject 44 | Numerous American universities limit Ph.D. dissertations to at most words barring special permission for exceeding this limit -------------------------------------------------------------------------------- /img/dataframe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/img/dataframe.png -------------------------------------------------------------------------------- /img/streaming.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/img/streaming.png -------------------------------------------------------------------------------- /sbt/sbt: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so 3 | export SPARK_JAVA_OPTS="-XX:+UseConcMarkSweepGC" 4 | java -Xmx1024m -jar $(dirname $0)/sbt-launch.jar "$@" 5 | -------------------------------------------------------------------------------- /sbt/sbt-launch.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/sbt/sbt-launch.jar -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/core/Ex0Wordcount.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.apache.spark.SparkContext 4 | import org.apache.spark.SparkConf 5 | import org.apache.spark.rdd.RDD 6 | 7 | /** 8 | * The scala Spark API documentation: http://spark.apache.org/docs/latest/api/scala/index.html 9 | * 10 | * Here the goal is to count how much each word appears in a file and make some operation on the result. 11 | * We use the mapreduce pattern to do this: 12 | * 13 | * step 1, the mapper: 14 | * - we attribute 1 to each word. And we obtain then couples (word, 1), where word is the key. 15 | * 16 | * step 2, the reducer: 17 | * - for each key (=word), the values are added and we will obtain the total amount. 18 | * 19 | * Use the Ex0WordcountSpec to implement the code. 20 | */ 21 | object Ex0Wordcount { 22 | 23 | val pathToFile = "data/wordcount.txt" 24 | 25 | /** 26 | * Load the data from the text file and return an RDD of words 27 | */ 28 | def loadData(): RDD[String] = { 29 | // create spark configuration and spark context: the Spark context is the entry point in Spark. 30 | // It represents the connexion to Spark and it is the place where you can configure the common properties 31 | // like the app name, the master url, memories allocation... 32 | val conf = new SparkConf() 33 | .setAppName("Wordcount") 34 | .setMaster("local[*]") // here local mode. And * means you will use as much as you have cores. 35 | 36 | val sc = new SparkContext(conf) 37 | 38 | // load data and create an RDD where each element will be a word 39 | // Here the flatMap method is used to separate the word in each line using the space separator 40 | // In this way it returns an RDD where each "element" is a word 41 | sc.textFile(pathToFile) 42 | .flatMap(_.split(" ")) 43 | } 44 | 45 | /** 46 | * Now count how much each word appears! 47 | */ 48 | def wordcount(): RDD[(String, Int)] = { 49 | val tweets = loadData 50 | 51 | // Step 1: the mapper step 52 | // The philosophy: we want to attribute the number 1 to each word: so we create couples (word, 1). 53 | // Hint: look at the mapToPair method 54 | // TODO write code here 55 | 56 | // Step 2: reducer step 57 | // The philosophy: now you have a couple (key, value) where the key is a word, you want to aggregate the value for each word. 58 | // So you will use a reducer function. 59 | // Hint: the Spark API provides some reduce methods 60 | // TODO write code here 61 | null 62 | 63 | } 64 | 65 | /** 66 | * Now keep the word which appear strictly more than 4 times! 67 | */ 68 | def filterOnWordcount(): RDD[(String, Int)] = { 69 | val tweets = wordcount 70 | 71 | // Hint: the Spark API provides a filter method 72 | // TODO write code here 73 | null 74 | } 75 | 76 | } 77 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/core/Ex1UserMining.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | 4 | import org.apache.spark.{SparkContext, SparkConf} 5 | 6 | import org.apache.spark.rdd._ 7 | import com.duchessfr.spark.utils.TweetUtils 8 | import com.duchessfr.spark.utils.TweetUtils._ 9 | 10 | /** 11 | * The scala API documentation: http://spark.apache.org/docs/latest/api/scala/index.html 12 | * 13 | * We still use the dataset with the 8198 reduced tweets. The data are reduced tweets as the example below: 14 | * 15 | * {"id":"572692378957430785", 16 | * "user":"Srkian_nishu :)", 17 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking", 18 | * "place":"Orissa", 19 | * "country":"India"} 20 | * 21 | * We want to make some computations on the users: 22 | * - find all the tweets by user 23 | * - find how many tweets each user has 24 | * 25 | * Use the Ex1UserMiningSpec to implement the code. 26 | */ 27 | object Ex1UserMining { 28 | 29 | val pathToFile = "data/reduced-tweets.json" 30 | 31 | /** 32 | * Load the data from the json file and return an RDD of Tweet 33 | */ 34 | def loadData(): RDD[Tweet] = { 35 | // Create the spark configuration and spark context 36 | val conf = new SparkConf() 37 | .setAppName("User mining") 38 | .setMaster("local[*]") 39 | 40 | val sc = new SparkContext(conf) 41 | 42 | // Load the data and parse it into a Tweet. 43 | // Look at the Tweet Object in the TweetUtils class. 44 | sc.textFile(pathToFile).mapPartitions(TweetUtils.parseFromJson(_)) 45 | } 46 | 47 | /** 48 | * For each user return all his tweets 49 | */ 50 | def tweetsByUser(): RDD[(String, Iterable[Tweet])] = { 51 | val tweets = loadData 52 | // TODO write code here 53 | // Hint: the Spark API provides a groupBy method 54 | null 55 | } 56 | 57 | /** 58 | * Compute the number of tweets by user 59 | */ 60 | def tweetByUserNumber(): RDD[(String, Int)] = { 61 | val tweets = loadData 62 | 63 | // TODO write code here 64 | // Hint: think about what you did in the wordcount example 65 | null 66 | } 67 | 68 | 69 | /** 70 | * Top 10 twitterers 71 | */ 72 | def topTenTwitterers(): Array[(String, Int)] = { 73 | 74 | // Return the top 10 of persons which used to twitt the more 75 | // TODO write code here 76 | // Hint: the Spark API provides a sortBy method 77 | null 78 | } 79 | 80 | } 81 | 82 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/core/Ex2TweetMining.scala: -------------------------------------------------------------------------------- 1 | 2 | package com.duchessfr.spark.core 3 | 4 | import org.apache.spark.{SparkConf, SparkContext} 5 | import org.apache.spark.rdd._ 6 | import com.duchessfr.spark.utils._ 7 | import com.duchessfr.spark.utils.TweetUtils.Tweet 8 | 9 | /** 10 | * The scala Spark API documentation: http://spark.apache.org/docs/latest/api/scala/index.html 11 | * 12 | * We still use the dataset with the 8198 reduced tweets. Here an example of a tweet: 13 | * 14 | * {"id":"572692378957430785", 15 | * "user":"Srkian_nishu :)", 16 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking", 17 | * "place":"Orissa", 18 | * "country":"India"} 19 | * 20 | * We want to make some computations on the tweets: 21 | * - Find all the persons mentioned on tweets 22 | * - Count how many times each person is mentioned 23 | * - Find the 10 most mentioned persons by descending order 24 | * 25 | * Use the Ex2TweetMiningSpec to implement the code. 26 | */ 27 | object Ex2TweetMining { 28 | 29 | val pathToFile = "data/reduced-tweets.json" 30 | 31 | /** 32 | * Load the data from the json file and return an RDD of Tweet 33 | */ 34 | def loadData(): RDD[Tweet] = { 35 | // create spark configuration and spark context 36 | val conf = new SparkConf() 37 | .setAppName("Tweet mining") 38 | .setMaster("local[*]") 39 | 40 | val sc = new SparkContext(conf) 41 | 42 | // Load the data and parse it into a Tweet. 43 | // Look at the Tweet Object in the TweetUtils class. 44 | sc.textFile(pathToFile) 45 | .mapPartitions(TweetUtils.parseFromJson(_)) 46 | 47 | } 48 | 49 | /** 50 | * Find all the persons mentioned on tweets (case sensitive) 51 | */ 52 | def mentionOnTweet(): RDD[String] = { 53 | val tweets = loadData 54 | 55 | // Hint: think about separating the word in the text field and then find the mentions 56 | // TODO write code here 57 | null 58 | } 59 | 60 | /** 61 | * Count how many times each person is mentioned 62 | */ 63 | def countMentions(): RDD[(String, Int)] = { 64 | val mentions = mentionOnTweet 65 | 66 | // Hint: think about what you did in the wordcount example 67 | // TODO write code here 68 | null 69 | } 70 | 71 | /** 72 | * Find the 10 most mentioned persons by descending order 73 | */ 74 | def top10mentions(): Array[(String, Int)] = { 75 | 76 | // Hint: take a look at the sorting and take methods 77 | // TODO write code here 78 | null 79 | } 80 | 81 | } 82 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/core/Ex3HashTagMining.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.apache.spark.{SparkContext, SparkConf} 4 | import org.apache.spark.rdd._ 5 | import com.duchessfr.spark.utils.TweetUtils 6 | import com.duchessfr.spark.utils.TweetUtils._ 7 | 8 | /** 9 | * The Java Spark API documentation: http://spark.apache.org/docs/latest/api/java/index.html 10 | * 11 | * We still use the dataset with the 8198 reduced tweets. Here an example of a tweet: 12 | * 13 | * {"id":"572692378957430785", 14 | * "user":"Srkian_nishu :)", 15 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking", 16 | * "place":"Orissa", 17 | * "country":"India"} 18 | * 19 | * We want to make some computations on the hashtags. It is very similar to the exercise 2 20 | * - Find all the hashtags mentioned on a tweet 21 | * - Count how many times each hashtag is mentioned 22 | * - Find the 10 most popular Hashtag by descending order 23 | * 24 | * Use the Ex3HashTagMiningSpec to implement the code. 25 | */ 26 | object Ex3HashTagMining { 27 | 28 | val pathToFile = "data/reduced-tweets.json" 29 | 30 | /** 31 | * Load the data from the json file and return an RDD of Tweet 32 | */ 33 | def loadData(): RDD[Tweet] = { 34 | // create spark configuration and spark context 35 | val conf = new SparkConf() 36 | .setAppName("Hashtag mining") 37 | .setMaster("local[*]") 38 | 39 | val sc = new SparkContext(conf) 40 | 41 | // Load the data and parse it into a Tweet. 42 | // Look at the Tweet Object in the TweetUtils class. 43 | sc.textFile(pathToFile) 44 | .mapPartitions(TweetUtils.parseFromJson(_)) 45 | } 46 | 47 | /** 48 | * Find all the hashtags mentioned on tweets 49 | */ 50 | def hashtagMentionedOnTweet(): RDD[String] = { 51 | val tweets = loadData 52 | // You want to return an RDD with the mentions 53 | // Hint: think about separating the word in the text field and then find the mentions 54 | // TODO write code here 55 | null 56 | } 57 | 58 | 59 | /** 60 | * Count how many times each hashtag is mentioned 61 | */ 62 | def countMentions(): RDD[(String, Int)] = { 63 | val tags= hashtagMentionedOnTweet 64 | // Hint: think about what you did in the wordcount example 65 | // TODO write code here 66 | null 67 | } 68 | 69 | /** 70 | * Find the 10 most popular Hashtags by descending order 71 | */ 72 | def top10HashTags(): Array[(String, Int)] = { 73 | val countTags= countMentions 74 | // Hint: take a look at the sorting and take methods 75 | // TODO write code here 76 | null 77 | } 78 | 79 | } 80 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/core/Ex4InvertedIndex.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import com.duchessfr.spark.utils.TweetUtils.Tweet 4 | import org.apache.spark.{SparkContext, SparkConf} 5 | 6 | import com.duchessfr.spark.utils.TweetUtils 7 | 8 | import scala.collection.Map 9 | 10 | object Ex4InvertedIndex { 11 | 12 | /** 13 | * 14 | * Buildind a hashtag search engine 15 | * 16 | * The goal is to build an inverted index. An inverted is the data structure used to build search engines. 17 | * 18 | * How does it work? 19 | * 20 | * Assuming #spark is an hashtag that appears in tweet1, tweet3, tweet39. 21 | * The inverted index that you must return should be a Map (or HashMap) that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)). 22 | * 23 | * Use the Ex4InvertedIndexSpec to implement the code. 24 | */ 25 | def invertedIndex(): Map[String, Iterable[Tweet]] = { 26 | // create spark configuration and spark context 27 | val conf = new SparkConf () 28 | .setAppName ("Inverted index") 29 | .setMaster ("local[*]") 30 | 31 | val sc = new SparkContext (conf) 32 | 33 | val tweets = sc.textFile ("data/reduced-tweets.json") 34 | .mapPartitions (TweetUtils.parseFromJson (_) ) 35 | 36 | // Let's try it out! 37 | // Hint: 38 | // For each tweet, extract all the hashtag and then create couples (hashtag,tweet) 39 | // Then group the tweets by hashtag 40 | // Finally return the inverted index as a map structure 41 | // TODO write code here 42 | null 43 | } 44 | 45 | } 46 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/dataframe/DataFrameOnTweets.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.dataframe 2 | 3 | import org.apache.spark._ 4 | import org.apache.spark.sql._ 5 | 6 | /** 7 | * The Spark SQL and DataFrame documentation is available on: 8 | * https://spark.apache.org/docs/1.4.0/sql-programming-guide.html 9 | * 10 | * A DataFrame is a distributed collection of data organized into named columns. 11 | * The entry point before to use the DataFrame is the SQLContext class (from Spark SQL). 12 | * With a SQLContext, you can create DataFrames from: 13 | * - an existing RDD 14 | * - a Hive table 15 | * - data sources... 16 | * 17 | * In the exercise we will create a dataframe with the content of a JSON file. 18 | * 19 | * We want to: 20 | * - print the dataframe 21 | * - print the schema of the dataframe 22 | * - find people who are located in Paris 23 | * - find the user who tweets the more 24 | * 25 | * And just to recap we use a dataset with 8198 tweets,where a tweet looks like that: 26 | * 27 | * {"id":"572692378957430785", 28 | * "user":"Srkian_nishu :)", 29 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking", 30 | * "place":"Orissa", 31 | * "country":"India"} 32 | * 33 | * Use the DataFrameOnTweetsSpec to implement the code. 34 | */ 35 | object DataFrameOnTweets { 36 | 37 | 38 | val pathToFile = "data/reduced-tweets.json" 39 | 40 | /** 41 | * Here the method to create the contexts (Spark and SQL) and 42 | * then create the dataframe. 43 | * 44 | * Run the test to see how looks the dataframe! 45 | */ 46 | def loadData(): DataFrame = { 47 | // create spark configuration and spark context 48 | val conf = new SparkConf() 49 | .setAppName("Dataframe") 50 | .setMaster("local[*]") 51 | 52 | val sc = new SparkContext(conf) 53 | 54 | // Create a sql context: the SQLContext wraps the SparkContext, and is specific to Spark SQL. 55 | // It is the entry point in Spark SQL. 56 | // TODO write code here 57 | val sqlcontext = null 58 | 59 | // Load the data regarding the file is a json file 60 | // Hint: use the sqlContext and apply the read method before loading the json file 61 | // TODO write code here 62 | null 63 | } 64 | 65 | 66 | /** 67 | * See how looks the dataframe 68 | */ 69 | def showDataFrame() = { 70 | val dataframe = loadData() 71 | 72 | // Displays the content of the DataFrame to stdout 73 | // TODO write code here 74 | } 75 | 76 | /** 77 | * Print the schema 78 | */ 79 | def printSchema() = { 80 | val dataframe = loadData() 81 | 82 | // Print the schema 83 | // TODO write code here 84 | } 85 | 86 | /** 87 | * Find people who are located in Paris 88 | */ 89 | def filterByLocation(): DataFrame = { 90 | val dataframe = loadData() 91 | 92 | // Select all the persons which are located in Paris 93 | // TODO write code here 94 | null 95 | } 96 | 97 | 98 | /** 99 | * Find the user who tweets the more 100 | */ 101 | def mostPopularTwitterer(): (Long, String) = { 102 | val dataframe = loadData() 103 | 104 | // First group the tweets by user 105 | // Then sort by descending order and take the first one 106 | // TODO write code here 107 | null 108 | } 109 | 110 | } 111 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/streaming/StreamingOnTweets.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.streaming 2 | 3 | import org.apache.spark.streaming.{Seconds, StreamingContext} 4 | import org.apache.spark.streaming.twitter._ 5 | import org.apache.spark.SparkConf 6 | import org.apache.spark._ 7 | 8 | /** 9 | * First authenticate with the Twitter streaming API. 10 | * 11 | * Go to https://apps.twitter.com/ 12 | * Create your application and then get your own credentials (keys and access tokens tab) 13 | * 14 | * See https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html 15 | * for help. 16 | * 17 | * If you have the following error "error 401 Unauthorized": 18 | * - it might be because of wrong credentials 19 | * OR 20 | * - a time zone issue (so be certain that the time zone on your computer is the good one) 21 | * 22 | * The Spark Streaming documentation is available on: 23 | * http://spark.apache.org/docs/latest/streaming-programming-guide.html 24 | * 25 | * Spark Streaming is an extension of the core Spark API that enables scalable, 26 | * high-throughput, fault-tolerant stream processing of live data streams. 27 | * Spark Streaming receives live input data streams and divides the data into batches, 28 | * which are then processed by the Spark engine to generate the final stream of results in batches. 29 | * Spark Streaming provides a high-level abstraction called discretized stream or DStream, 30 | * which represents a continuous stream of data. 31 | * 32 | * In this exercise we will: 33 | * - Print the status text of the some of the tweets 34 | * - Find the 10 most popular Hashtag in the last minute 35 | * 36 | * You can see informations about the streaming in the Spark UI console: http://localhost:4040/streaming/ 37 | */ 38 | object StreamingOnTweets extends App { 39 | 40 | def top10Hashtag() = { 41 | // TODO fill the keys and tokens 42 | val CONSUMER_KEY = "TODO" 43 | val CONSUMER_SECRET = "TODO" 44 | val ACCESS_TOKEN = "TODO" 45 | val ACCESS_TOKEN_SECRET = "TODO" 46 | 47 | System.setProperty("twitter4j.oauth.consumerKey", CONSUMER_KEY) 48 | System.setProperty("twitter4j.oauth.consumerSecret", CONSUMER_SECRET) 49 | System.setProperty("twitter4j.oauth.accessToken", ACCESS_TOKEN) 50 | System.setProperty("twitter4j.oauth.accessTokenSecret", ACCESS_TOKEN_SECRET) 51 | 52 | // Load the data using TwitterUtils: we obtain a DStream of tweets 53 | // 54 | // More about TwitterUtils: 55 | // https://spark.apache.org/docs/1.4.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html 56 | 57 | // create spark configuration and spark context 58 | val conf = new SparkConf() 59 | .setAppName("streaming") 60 | .setMaster("local[*]") 61 | 62 | val sc = new SparkContext(conf) 63 | // create a StreamingContext by providing a Spark context and a window (2 seconds batch) 64 | val ssc = new StreamingContext(sc, Seconds(2)) 65 | 66 | println("Initializing Twitter stream...") 67 | 68 | // Here we start a stream of tweets 69 | // The object tweetsStream is a DStream of tweet statuses: 70 | // - the Status class contains all information of a tweet 71 | // See http://twitter4j.org/javadoc/twitter4j/Status.html 72 | // and fill the keys and tokens 73 | val tweetsStream = TwitterUtils.createStream(ssc, None, Array[String]()) 74 | 75 | //Your turn ... 76 | 77 | // Print the status text of the some of the tweets 78 | // You must see tweets appear in the console 79 | val status = tweetsStream.map(_.getText) 80 | // Here print the status's text: see the Status class 81 | // Hint: use the print method 82 | // TODO write code here 83 | 84 | 85 | // Find the 10 most popular Hashtag in the last minute 86 | 87 | // For each tweet in the stream filter out all the hashtags 88 | // stream is like a sequence of RDD so you can do all the operation you did in the first part of the hands-on 89 | // Hint: think about what you did in the Hashtagmining part 90 | // TODO write code here 91 | val hashTags = null 92 | 93 | // Now here, find the 10 most popular hashtags in a 60 seconds window 94 | // Hint: look at the reduceByKeyAndWindow function in the spark doc. 95 | // Reduce last 60 seconds of data 96 | // Hint: look at the transform function to operate on the DStream 97 | // TODO write code here 98 | val top10 = null 99 | 100 | // and return the 10 most populars 101 | // Hint: loop on the RDD and take the 10 most popular 102 | // TODO write code here 103 | 104 | // we need to tell the context to start running the computation we have setup 105 | // it won't work if you don't add this! 106 | ssc.start 107 | ssc.awaitTermination 108 | } 109 | } 110 | -------------------------------------------------------------------------------- /src/main/scala/com/duchessfr/spark/utils/TweetUtils.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.utils 2 | 3 | import com.google.gson._ 4 | 5 | object TweetUtils { 6 | case class Tweet ( 7 | id : String, 8 | user : String, 9 | userName : String, 10 | text : String, 11 | place : String, 12 | country : String, 13 | lang : String 14 | ) 15 | 16 | 17 | def parseFromJson(lines:Iterator[String]):Iterator[Tweet] = { 18 | val gson = new Gson 19 | lines.map(line => gson.fromJson(line, classOf[Tweet])) 20 | } 21 | } 22 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/core/Ex0WordcountSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.scalatest._ 4 | 5 | /** 6 | * Here are the tests to help you to implement the Ex0Wordcount 7 | */ 8 | class Ex0WordcountSpec extends FunSuite with Matchers { 9 | 10 | // this test is already green but see how we download the data in the loadData method 11 | test("number of data loaded") { 12 | val data = Ex0Wordcount.loadData 13 | data.count should be (809) 14 | } 15 | 16 | test("countWord should count the occurrences of each word"){ 17 | val wordCounts = Ex0Wordcount.wordcount 18 | wordCounts.count should be (381) 19 | wordCounts.collect should contain ("the", 38) 20 | wordCounts.collect should contain ("generally", 2) 21 | } 22 | 23 | test("filterOnWordcount should keep the words which appear more than 4 times"){ 24 | val wordCounts = Ex0Wordcount.filterOnWordcount 25 | wordCounts.count should be (26) 26 | wordCounts.collect should contain ("the", 38) 27 | wordCounts.collect shouldNot contain ("generally") 28 | } 29 | 30 | } 31 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/core/Ex1UserMiningSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the Ex1UserMining 7 | */ 8 | class Ex1UserMiningSpec extends FunSuite with Matchers { 9 | 10 | test("should count the number of couple (user, tweets)") { 11 | val tweets = Ex1UserMining.tweetsByUser 12 | tweets.count should be (5967) 13 | } 14 | 15 | test("tweetByUserNumber should count the number of tweets by user"){ 16 | val tweetsByUser = Ex1UserMining.tweetByUserNumber 17 | tweetsByUser.count should be (5967) 18 | tweetsByUser.collect should contain ("Dell Feddi", 29) 19 | } 20 | 21 | test("should return the top ten twitterers"){ 22 | val top10 = Ex1UserMining.topTenTwitterers 23 | top10.size should be (10) 24 | top10 should contain ("williampriceking", 46) 25 | top10 should contain ("Phillthy McNasty",43) 26 | } 27 | 28 | } 29 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/core/Ex2TweetMiningSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the Ex2TweetMining 7 | */ 8 | class Ex2TweetMiningSpec extends FunSuite with Matchers { 9 | 10 | test("should count the persons mentioned on tweets") { 11 | val userMentions = Ex2TweetMining.mentionOnTweet 12 | userMentions.count should be (4462) 13 | } 14 | 15 | test("should count the number for each user mention"){ 16 | val mentionsCount = Ex2TweetMining.countMentions 17 | mentionsCount.count should be (3283) 18 | mentionsCount.collect should contain ("@JordinSparks", 2) 19 | } 20 | 21 | test("should define the top10"){ 22 | val top10 = Ex2TweetMining.top10mentions 23 | top10.size should be (10) 24 | top10 should contain ("@HIITMANonDECK", 100) 25 | top10 should contain ("@ShawnMendes", 189) 26 | top10 should contain ("@officialdjjuice", 59) 27 | } 28 | 29 | } 30 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/core/Ex3HashTagMiningSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the Ex3HashTagMining class 7 | */ 8 | class Ex3HashTagMiningSpec extends FunSuite with Matchers { 9 | 10 | test("should count the hashtag mentioned on tweets") { 11 | val hashtagMentions = Ex3HashTagMining.hashtagMentionedOnTweet 12 | hashtagMentions.count should be (5262) 13 | } 14 | 15 | test("should count the number of mention by hashtag"){ 16 | val mentionsCount = Ex3HashTagMining.countMentions 17 | mentionsCount.count should be (2461) 18 | mentionsCount.collect should contain ("#youtube", 2) 19 | } 20 | 21 | test("should define the top10"){ 22 | val top10 = Ex3HashTagMining.top10HashTags 23 | top10.size should be (10) 24 | top10 should contain ("#DME", 253) 25 | } 26 | } 27 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/core/Ex4InvertedIndexSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.core 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the Ex4InvertedIndex class 7 | */ 8 | class Ex4InvertedIndexSpec extends FunSuite with Matchers { 9 | 10 | test("should return an inverted index") { 11 | val invertedIndex = Ex4InvertedIndex.invertedIndex 12 | invertedIndex.size should be (2461) 13 | //invertedIndex should contain ("Paris" -> 144) 14 | invertedIndex should contain key ("#EDM") 15 | invertedIndex should contain key ("#Paris") 16 | } 17 | 18 | } 19 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/dataframe/DataFrameOnTweetsSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.dataframe 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the DataFrameOnTweets class 7 | */ 8 | class DataFrameOnTweetsSpec extends FunSuite with Matchers { 9 | 10 | test("should load the data and init the context") { 11 | val data = DataFrameOnTweets.loadData 12 | data.count should be (8198) 13 | } 14 | 15 | test("should show the dataframe"){ 16 | DataFrameOnTweets.showDataFrame 17 | // you must see something like that in your console: 18 | //+--------------------+------------------+-----------------+--------------------+-------------------+ 19 | //| country| id| place| text| user| 20 | //+--------------------+------------------+-----------------+--------------------+-------------------+ 21 | //| India|572692378957430785| Orissa|@always_nidhi @Yo...| Srkian_nishu :)| 22 | //| United States|572575240615796737| Manhattan|@OnlyDancers Bell...| TagineDiningGlobal| 23 | //| United States|572575243883036672| Claremont|1/ "Without the a...| Daniel Beer| 24 | } 25 | 26 | test("should print the schema"){ 27 | DataFrameOnTweets.printSchema 28 | // you must see something like that in your console: 29 | // root 30 | // |-- country: string (nullable = true) 31 | // |-- id: string (nullable = true) 32 | // |-- place: string (nullable = true) 33 | // |-- text:string(nullable = true) 34 | // |-- user: string (nullable = true) 35 | } 36 | 37 | test("should group the tweets by location") { 38 | val data = DataFrameOnTweets.filterByLocation 39 | data.count should be (329) 40 | } 41 | 42 | test("should return the most popular twitterer") { 43 | val populars = DataFrameOnTweets.mostPopularTwitterer 44 | populars should be (258, "#QuissyUpSoon") 45 | } 46 | 47 | } 48 | -------------------------------------------------------------------------------- /src/test/scala/com/duchessfr/spark/streaming/StreamingOnTweetsSpec.scala: -------------------------------------------------------------------------------- 1 | package com.duchessfr.spark.streaming 2 | 3 | import org.scalatest.{Matchers, FunSuite} 4 | 5 | /** 6 | * Here are the tests to help you to implement the StreamingOnTweets class 7 | * It's not real unit tests because of the live stream context, but it can give 8 | * help you anyway and run the function 9 | */ 10 | class StreamingOnTweetsSpec extends FunSuite with Matchers { 11 | 12 | test("should return the 10 most popular hashtag") { 13 | StreamingOnTweets.top10Hashtag 14 | // You should see something like that: 15 | // Most popular hashtag : #tlot: 1, #followme: 1,... 16 | } 17 | } 18 | --------------------------------------------------------------------------------