├── .gitignore ├── README.md ├── build.sbt ├── data ├── reduced-tweets.json └── wordcount.txt ├── img ├── dataframe.png └── streaming.png ├── sbt ├── sbt └── sbt-launch.jar └── src ├── main └── scala │ └── com │ └── duchessfr │ └── spark │ ├── core │ ├── Ex0Wordcount.scala │ ├── Ex1UserMining.scala │ ├── Ex2TweetMining.scala │ ├── Ex3HashTagMining.scala │ └── Ex4InvertedIndex.scala │ ├── dataframe │ └── DataFrameOnTweets.scala │ ├── streaming │ └── StreamingOnTweets.scala │ └── utils │ └── TweetUtils.scala └── test └── scala └── com └── duchessfr └── spark ├── core ├── Ex0WordcountSpec.scala ├── Ex1UserMiningSpec.scala ├── Ex2TweetMiningSpec.scala ├── Ex3HashTagMiningSpec.scala └── Ex4InvertedIndexSpec.scala ├── dataframe └── DataFrameOnTweetsSpec.scala └── streaming └── StreamingOnTweetsSpec.scala /.gitignore: -------------------------------------------------------------------------------- 1 | /target 2 | .idea 3 | *.iml 4 | /project 5 | .DS_Store 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Workshop spark-in-practice 2 | 3 | In this workshop the exercises are focused on using the [Spark core](https://spark.apache.org/docs/1.4.0/programming-guide.html) and [Spark Streaming](https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html) APIs, and also the [dataFrame](https://spark.apache.org/docs/1.4.0/sql-programming-guide.html) on data processing. 4 | Exercises are available both in [Java](https://github.com/nivdul/spark-in-practice) and Scala on my github account (here in scala). You just have to clone the project and go! 5 | If you need help, take a look at the solution branch. 6 | 7 | The original blog-post is right [here](https://nivdul.wordpress.com/2015/08/09/getting-started-with-spark-in-practice/). 8 | 9 | To help you to implement each class, unit tests are in. 10 | 11 | Frameworks used: 12 | 13 | * Spark 1.4.0 14 | * scala 2.10 15 | * sbt 16 | * scalatest 17 | 18 | All exercises runs in local mode as a standalone program. 19 | 20 | To work on the hands-on, retrieve the code via the following command line: 21 |
$ git clone https://github.com/nivdul/spark-in-practice-scala.git
22 |
23 | Then you can import the project in IntelliJ or Eclipse (add the SBT and Scala plugins for Scala), or use sublime text for example.
24 |
25 | If you want to use the interactive spark-shell (only scala/python), you need to download a [binary Spark distribution](https://spark.apache.org/downloads.html).
26 |
27 | Go to the Spark directory
28 | $ cd /spark-1.4.0
29 |
30 | First build the project
31 | $ build/mvn -DskipTests clean package
32 |
33 | Launch the spark-shell
34 | $ ./bin/spark-shell
35 | scala>
36 |
37 |
38 | ## Part 1: Spark core API
39 | To be more familiar with the Spark API, you will start by implementing the wordcount example (Ex0).
40 | After that we use reduced tweets as the data along a json format for data mining (Ex1-Ex3).
41 |
42 | In these exercises you will have to:
43 |
44 | * Find all the tweets by user
45 | * Find how many tweets each user has
46 | * Find all the persons mentioned on tweets
47 | * Count how many times each person is mentioned
48 | * Find the 10 most mentioned persons
49 | * Find all the hashtags mentioned on a tweet
50 | * Count how many times each hashtag is mentioned
51 | * Find the 10 most popular Hashtags
52 |
53 | The last exercise (Ex4) is a way more complicated: the goal is to build an inverted index knowing that an inverted is the data structure used to build search engines.
54 | Assuming #spark is a hashtag that appears in tweet1, tweet3, tweet39, the inverted index will be a Map that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)).
55 |
56 | ## Part 2: streaming analytics with Spark Streaming
57 | Spark Streaming is a component of Spark to process live data streams in a scalable, high-throughput and fault-tolerant way.
58 |
59 | 
60 |
61 | In fact Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
62 | The abstraction, which represents a continuous stream of data is the DStream (discretized stream).
63 |
64 | In the workshop, Spark Streaming is used to process a live stream of Tweets using twitter4j, a library for the Twitter API.
65 | To be able to read the firehose, you will need to create a Twitter application at http://apps.twitter.com, get your credentials, and add it in the StreamUtils class.
66 |
67 | In this exercise you will have to:
68 |
69 | * Print the status of each tweet
70 | * Find the 10 most popular Hashtag
71 |
72 | ## Part 3: structured data with the DataFrame
73 | A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
74 | DataFrames can be constructed from different sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
75 |
76 | 
77 |
78 | In the exercise you will have to:
79 |
80 | * Print the dataframe
81 | * Print the schema of the dataframe
82 | * Find people who are located in Paris
83 | * Find the user who tweets the more
84 |
85 | ## Conclusion
86 | If you find better way/implementation, do not hesitate to send a pull request or open an issue.
87 |
88 | Here are some useful links around Spark and its ecosystem:
89 |
90 | * [Apache Spark website](https://spark.apache.org/docs/1.4.0/programming-guide.html)
91 | * [Spark Streaming documentation](https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html)
92 | * [Spark SQL and DataFrame documentation](https://spark.apache.org/docs/1.4.0/sql-programming-guide.html)
93 | * [Databricks blog](https://databricks.com/blog )
94 | * [Analyze data from an accelerometer using Spark, Cassandra and MLlib](http://www.duchess-france.org/analyze-accelerometer-data-with-apache-spark-and-mllib/)
95 |
96 |
--------------------------------------------------------------------------------
/build.sbt:
--------------------------------------------------------------------------------
1 | name := "Spark-HandsOn"
2 |
3 | version := "1.0"
4 |
5 | scalaVersion := "2.10.4"
6 |
7 | libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
8 |
9 | libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.5.2"
10 |
11 | libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
12 |
13 | libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.5.2"
14 |
15 | libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
16 |
17 | libraryDependencies += "com.google.code.gson" % "gson" % "2.3.1"
18 |
19 | libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "1.5.2"
20 |
21 | libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3"
22 |
23 | libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.2.4" % "test"
24 |
25 | resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
26 |
--------------------------------------------------------------------------------
/data/wordcount.txt:
--------------------------------------------------------------------------------
1 | word count from Wikipedia the free encyclopedia
2 | the word count is the number of words in a document or passage of text Word counting may be needed when a text
3 | is required to stay within certain numbers of words This may particularly be the case in academia legal
4 | proceedings journalism and advertising Word count is commonly used by translators to determine the price for
5 | the translation job Word counts may also be used to calculate measures of readability and to measure typing
6 | and reading speeds usually in words per minute When converting character counts to words a measure of five or
7 | six characters to a word is generally used Contents Details and variations of definition Software In fiction
8 | In non fiction See also References Sources External links Details and variations of definition
9 | This section does not cite any references or sources Please help improve this section by adding citations to
10 | reliable sources Unsourced material may be challenged and removed
11 | Variations in the operational definitions of how to count the words can occur namely what counts as a word and
12 | which words don't count toward the total However especially since the advent of widespread word processing there
13 | is a broad consensus on these operational definitions and hence the bottom line integer result
14 | The consensus is to accept the text segmentation rules generally found in most word processing software including how
15 | word boundaries are determined which depends on how word dividers are defined The first trait of that definition is that a space any of various whitespace
16 | characters such as a regular word space an em space or a tab character is a word divider Usually a hyphen or a slash is too
17 | Different word counting programs may give varying results depending on the text segmentation rule
18 | details and on whether words outside the main text such as footnotes endnotes or hidden text) are counted But the behavior
19 | of most major word processing applications is broadly similar However during the era when school assignments were done in
20 | handwriting or with typewriters the rules for these definitions often differed from todays consensus
21 | Most importantly many students were drilled on the rule that certain words don't count usually articles namely a an the but
22 | sometimes also others such as conjunctions for example and or but and some prepositions usually to of Hyphenated permanent
23 | compounds such as follow up noun or long term adjective were counted as one word To save the time and effort of counting
24 | word by word often a rule of thumb for the average number of words per line was used such as 10 words per line These rules
25 | have fallen by the wayside in the word processing era the word count feature of such software which follows the text
26 | segmentation rules mentioned earlier is now the standard arbiter because it is largely consistent across documents and
27 | applications and because it is fast effortless and costless already included with the application As for which sections of
28 | a document count toward the total such as footnotes endnotes abstracts reference lists and bibliographies tables figure
29 | captions hidden text the person in charge teacher client can define their choice and users students workers can simply
30 | select or exclude the elements accordingly and watch the word count automatically update Software Modern web browsers
31 | support word counting via extensions via a JavaScript bookmarklet or a script that is hosted in a website Most word
32 | processors can also count words Unix like systems include a program wc specifically for word counting
33 | As explained earlier different word counting programs may give varying results depending on the text segmentation rule
34 | details The exact number of words often is not a strict requirement thus the variation is acceptable
35 | In fiction Novelist Jane Smiley suggests that length is an important quality of the novel However novels can vary
36 | tremendously in length Smiley lists novels as typically being between and words while National Novel Writing Month
37 | requires its novels to be at least words There are no firm rules for example the boundary between a novella and a novel
38 | is arbitrary and a literary work may be difficult to categorise But while the length of a novel is to a large extent up
39 | to its writer lengths may also vary by subgenre many chapter books for children start at a length of about words and a
40 | typical mystery novel might be in the to word range while a thriller could be over words
41 | The Science Fiction and Fantasy Writers of America specifies word lengths for each category of its Nebula award categories
42 | Classification Word count Novel over words Novella to words Novelette to words Short story under words
43 | In non fiction The acceptable length of an academic dissertation varies greatly dependent predominantly on the subject
44 | Numerous American universities limit Ph.D. dissertations to at most words barring special permission for exceeding this limit
--------------------------------------------------------------------------------
/img/dataframe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/img/dataframe.png
--------------------------------------------------------------------------------
/img/streaming.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/img/streaming.png
--------------------------------------------------------------------------------
/sbt/sbt:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
3 | export SPARK_JAVA_OPTS="-XX:+UseConcMarkSweepGC"
4 | java -Xmx1024m -jar $(dirname $0)/sbt-launch.jar "$@"
5 |
--------------------------------------------------------------------------------
/sbt/sbt-launch.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nivdul/spark-in-practice-scala/ebbfdab13f24654f7dd4742ef882dafd5c82be54/sbt/sbt-launch.jar
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/core/Ex0Wordcount.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.apache.spark.SparkContext
4 | import org.apache.spark.SparkConf
5 | import org.apache.spark.rdd.RDD
6 |
7 | /**
8 | * The scala Spark API documentation: http://spark.apache.org/docs/latest/api/scala/index.html
9 | *
10 | * Here the goal is to count how much each word appears in a file and make some operation on the result.
11 | * We use the mapreduce pattern to do this:
12 | *
13 | * step 1, the mapper:
14 | * - we attribute 1 to each word. And we obtain then couples (word, 1), where word is the key.
15 | *
16 | * step 2, the reducer:
17 | * - for each key (=word), the values are added and we will obtain the total amount.
18 | *
19 | * Use the Ex0WordcountSpec to implement the code.
20 | */
21 | object Ex0Wordcount {
22 |
23 | val pathToFile = "data/wordcount.txt"
24 |
25 | /**
26 | * Load the data from the text file and return an RDD of words
27 | */
28 | def loadData(): RDD[String] = {
29 | // create spark configuration and spark context: the Spark context is the entry point in Spark.
30 | // It represents the connexion to Spark and it is the place where you can configure the common properties
31 | // like the app name, the master url, memories allocation...
32 | val conf = new SparkConf()
33 | .setAppName("Wordcount")
34 | .setMaster("local[*]") // here local mode. And * means you will use as much as you have cores.
35 |
36 | val sc = new SparkContext(conf)
37 |
38 | // load data and create an RDD where each element will be a word
39 | // Here the flatMap method is used to separate the word in each line using the space separator
40 | // In this way it returns an RDD where each "element" is a word
41 | sc.textFile(pathToFile)
42 | .flatMap(_.split(" "))
43 | }
44 |
45 | /**
46 | * Now count how much each word appears!
47 | */
48 | def wordcount(): RDD[(String, Int)] = {
49 | val tweets = loadData
50 |
51 | // Step 1: the mapper step
52 | // The philosophy: we want to attribute the number 1 to each word: so we create couples (word, 1).
53 | // Hint: look at the mapToPair method
54 | // TODO write code here
55 |
56 | // Step 2: reducer step
57 | // The philosophy: now you have a couple (key, value) where the key is a word, you want to aggregate the value for each word.
58 | // So you will use a reducer function.
59 | // Hint: the Spark API provides some reduce methods
60 | // TODO write code here
61 | null
62 |
63 | }
64 |
65 | /**
66 | * Now keep the word which appear strictly more than 4 times!
67 | */
68 | def filterOnWordcount(): RDD[(String, Int)] = {
69 | val tweets = wordcount
70 |
71 | // Hint: the Spark API provides a filter method
72 | // TODO write code here
73 | null
74 | }
75 |
76 | }
77 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/core/Ex1UserMining.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 |
4 | import org.apache.spark.{SparkContext, SparkConf}
5 |
6 | import org.apache.spark.rdd._
7 | import com.duchessfr.spark.utils.TweetUtils
8 | import com.duchessfr.spark.utils.TweetUtils._
9 |
10 | /**
11 | * The scala API documentation: http://spark.apache.org/docs/latest/api/scala/index.html
12 | *
13 | * We still use the dataset with the 8198 reduced tweets. The data are reduced tweets as the example below:
14 | *
15 | * {"id":"572692378957430785",
16 | * "user":"Srkian_nishu :)",
17 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking",
18 | * "place":"Orissa",
19 | * "country":"India"}
20 | *
21 | * We want to make some computations on the users:
22 | * - find all the tweets by user
23 | * - find how many tweets each user has
24 | *
25 | * Use the Ex1UserMiningSpec to implement the code.
26 | */
27 | object Ex1UserMining {
28 |
29 | val pathToFile = "data/reduced-tweets.json"
30 |
31 | /**
32 | * Load the data from the json file and return an RDD of Tweet
33 | */
34 | def loadData(): RDD[Tweet] = {
35 | // Create the spark configuration and spark context
36 | val conf = new SparkConf()
37 | .setAppName("User mining")
38 | .setMaster("local[*]")
39 |
40 | val sc = new SparkContext(conf)
41 |
42 | // Load the data and parse it into a Tweet.
43 | // Look at the Tweet Object in the TweetUtils class.
44 | sc.textFile(pathToFile).mapPartitions(TweetUtils.parseFromJson(_))
45 | }
46 |
47 | /**
48 | * For each user return all his tweets
49 | */
50 | def tweetsByUser(): RDD[(String, Iterable[Tweet])] = {
51 | val tweets = loadData
52 | // TODO write code here
53 | // Hint: the Spark API provides a groupBy method
54 | null
55 | }
56 |
57 | /**
58 | * Compute the number of tweets by user
59 | */
60 | def tweetByUserNumber(): RDD[(String, Int)] = {
61 | val tweets = loadData
62 |
63 | // TODO write code here
64 | // Hint: think about what you did in the wordcount example
65 | null
66 | }
67 |
68 |
69 | /**
70 | * Top 10 twitterers
71 | */
72 | def topTenTwitterers(): Array[(String, Int)] = {
73 |
74 | // Return the top 10 of persons which used to twitt the more
75 | // TODO write code here
76 | // Hint: the Spark API provides a sortBy method
77 | null
78 | }
79 |
80 | }
81 |
82 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/core/Ex2TweetMining.scala:
--------------------------------------------------------------------------------
1 |
2 | package com.duchessfr.spark.core
3 |
4 | import org.apache.spark.{SparkConf, SparkContext}
5 | import org.apache.spark.rdd._
6 | import com.duchessfr.spark.utils._
7 | import com.duchessfr.spark.utils.TweetUtils.Tweet
8 |
9 | /**
10 | * The scala Spark API documentation: http://spark.apache.org/docs/latest/api/scala/index.html
11 | *
12 | * We still use the dataset with the 8198 reduced tweets. Here an example of a tweet:
13 | *
14 | * {"id":"572692378957430785",
15 | * "user":"Srkian_nishu :)",
16 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking",
17 | * "place":"Orissa",
18 | * "country":"India"}
19 | *
20 | * We want to make some computations on the tweets:
21 | * - Find all the persons mentioned on tweets
22 | * - Count how many times each person is mentioned
23 | * - Find the 10 most mentioned persons by descending order
24 | *
25 | * Use the Ex2TweetMiningSpec to implement the code.
26 | */
27 | object Ex2TweetMining {
28 |
29 | val pathToFile = "data/reduced-tweets.json"
30 |
31 | /**
32 | * Load the data from the json file and return an RDD of Tweet
33 | */
34 | def loadData(): RDD[Tweet] = {
35 | // create spark configuration and spark context
36 | val conf = new SparkConf()
37 | .setAppName("Tweet mining")
38 | .setMaster("local[*]")
39 |
40 | val sc = new SparkContext(conf)
41 |
42 | // Load the data and parse it into a Tweet.
43 | // Look at the Tweet Object in the TweetUtils class.
44 | sc.textFile(pathToFile)
45 | .mapPartitions(TweetUtils.parseFromJson(_))
46 |
47 | }
48 |
49 | /**
50 | * Find all the persons mentioned on tweets (case sensitive)
51 | */
52 | def mentionOnTweet(): RDD[String] = {
53 | val tweets = loadData
54 |
55 | // Hint: think about separating the word in the text field and then find the mentions
56 | // TODO write code here
57 | null
58 | }
59 |
60 | /**
61 | * Count how many times each person is mentioned
62 | */
63 | def countMentions(): RDD[(String, Int)] = {
64 | val mentions = mentionOnTweet
65 |
66 | // Hint: think about what you did in the wordcount example
67 | // TODO write code here
68 | null
69 | }
70 |
71 | /**
72 | * Find the 10 most mentioned persons by descending order
73 | */
74 | def top10mentions(): Array[(String, Int)] = {
75 |
76 | // Hint: take a look at the sorting and take methods
77 | // TODO write code here
78 | null
79 | }
80 |
81 | }
82 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/core/Ex3HashTagMining.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.apache.spark.{SparkContext, SparkConf}
4 | import org.apache.spark.rdd._
5 | import com.duchessfr.spark.utils.TweetUtils
6 | import com.duchessfr.spark.utils.TweetUtils._
7 |
8 | /**
9 | * The Java Spark API documentation: http://spark.apache.org/docs/latest/api/java/index.html
10 | *
11 | * We still use the dataset with the 8198 reduced tweets. Here an example of a tweet:
12 | *
13 | * {"id":"572692378957430785",
14 | * "user":"Srkian_nishu :)",
15 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking",
16 | * "place":"Orissa",
17 | * "country":"India"}
18 | *
19 | * We want to make some computations on the hashtags. It is very similar to the exercise 2
20 | * - Find all the hashtags mentioned on a tweet
21 | * - Count how many times each hashtag is mentioned
22 | * - Find the 10 most popular Hashtag by descending order
23 | *
24 | * Use the Ex3HashTagMiningSpec to implement the code.
25 | */
26 | object Ex3HashTagMining {
27 |
28 | val pathToFile = "data/reduced-tweets.json"
29 |
30 | /**
31 | * Load the data from the json file and return an RDD of Tweet
32 | */
33 | def loadData(): RDD[Tweet] = {
34 | // create spark configuration and spark context
35 | val conf = new SparkConf()
36 | .setAppName("Hashtag mining")
37 | .setMaster("local[*]")
38 |
39 | val sc = new SparkContext(conf)
40 |
41 | // Load the data and parse it into a Tweet.
42 | // Look at the Tweet Object in the TweetUtils class.
43 | sc.textFile(pathToFile)
44 | .mapPartitions(TweetUtils.parseFromJson(_))
45 | }
46 |
47 | /**
48 | * Find all the hashtags mentioned on tweets
49 | */
50 | def hashtagMentionedOnTweet(): RDD[String] = {
51 | val tweets = loadData
52 | // You want to return an RDD with the mentions
53 | // Hint: think about separating the word in the text field and then find the mentions
54 | // TODO write code here
55 | null
56 | }
57 |
58 |
59 | /**
60 | * Count how many times each hashtag is mentioned
61 | */
62 | def countMentions(): RDD[(String, Int)] = {
63 | val tags= hashtagMentionedOnTweet
64 | // Hint: think about what you did in the wordcount example
65 | // TODO write code here
66 | null
67 | }
68 |
69 | /**
70 | * Find the 10 most popular Hashtags by descending order
71 | */
72 | def top10HashTags(): Array[(String, Int)] = {
73 | val countTags= countMentions
74 | // Hint: take a look at the sorting and take methods
75 | // TODO write code here
76 | null
77 | }
78 |
79 | }
80 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/core/Ex4InvertedIndex.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import com.duchessfr.spark.utils.TweetUtils.Tweet
4 | import org.apache.spark.{SparkContext, SparkConf}
5 |
6 | import com.duchessfr.spark.utils.TweetUtils
7 |
8 | import scala.collection.Map
9 |
10 | object Ex4InvertedIndex {
11 |
12 | /**
13 | *
14 | * Buildind a hashtag search engine
15 | *
16 | * The goal is to build an inverted index. An inverted is the data structure used to build search engines.
17 | *
18 | * How does it work?
19 | *
20 | * Assuming #spark is an hashtag that appears in tweet1, tweet3, tweet39.
21 | * The inverted index that you must return should be a Map (or HashMap) that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)).
22 | *
23 | * Use the Ex4InvertedIndexSpec to implement the code.
24 | */
25 | def invertedIndex(): Map[String, Iterable[Tweet]] = {
26 | // create spark configuration and spark context
27 | val conf = new SparkConf ()
28 | .setAppName ("Inverted index")
29 | .setMaster ("local[*]")
30 |
31 | val sc = new SparkContext (conf)
32 |
33 | val tweets = sc.textFile ("data/reduced-tweets.json")
34 | .mapPartitions (TweetUtils.parseFromJson (_) )
35 |
36 | // Let's try it out!
37 | // Hint:
38 | // For each tweet, extract all the hashtag and then create couples (hashtag,tweet)
39 | // Then group the tweets by hashtag
40 | // Finally return the inverted index as a map structure
41 | // TODO write code here
42 | null
43 | }
44 |
45 | }
46 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/dataframe/DataFrameOnTweets.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.dataframe
2 |
3 | import org.apache.spark._
4 | import org.apache.spark.sql._
5 |
6 | /**
7 | * The Spark SQL and DataFrame documentation is available on:
8 | * https://spark.apache.org/docs/1.4.0/sql-programming-guide.html
9 | *
10 | * A DataFrame is a distributed collection of data organized into named columns.
11 | * The entry point before to use the DataFrame is the SQLContext class (from Spark SQL).
12 | * With a SQLContext, you can create DataFrames from:
13 | * - an existing RDD
14 | * - a Hive table
15 | * - data sources...
16 | *
17 | * In the exercise we will create a dataframe with the content of a JSON file.
18 | *
19 | * We want to:
20 | * - print the dataframe
21 | * - print the schema of the dataframe
22 | * - find people who are located in Paris
23 | * - find the user who tweets the more
24 | *
25 | * And just to recap we use a dataset with 8198 tweets,where a tweet looks like that:
26 | *
27 | * {"id":"572692378957430785",
28 | * "user":"Srkian_nishu :)",
29 | * "text":"@always_nidhi @YouTube no i dnt understand bt i loved of this mve is rocking",
30 | * "place":"Orissa",
31 | * "country":"India"}
32 | *
33 | * Use the DataFrameOnTweetsSpec to implement the code.
34 | */
35 | object DataFrameOnTweets {
36 |
37 |
38 | val pathToFile = "data/reduced-tweets.json"
39 |
40 | /**
41 | * Here the method to create the contexts (Spark and SQL) and
42 | * then create the dataframe.
43 | *
44 | * Run the test to see how looks the dataframe!
45 | */
46 | def loadData(): DataFrame = {
47 | // create spark configuration and spark context
48 | val conf = new SparkConf()
49 | .setAppName("Dataframe")
50 | .setMaster("local[*]")
51 |
52 | val sc = new SparkContext(conf)
53 |
54 | // Create a sql context: the SQLContext wraps the SparkContext, and is specific to Spark SQL.
55 | // It is the entry point in Spark SQL.
56 | // TODO write code here
57 | val sqlcontext = null
58 |
59 | // Load the data regarding the file is a json file
60 | // Hint: use the sqlContext and apply the read method before loading the json file
61 | // TODO write code here
62 | null
63 | }
64 |
65 |
66 | /**
67 | * See how looks the dataframe
68 | */
69 | def showDataFrame() = {
70 | val dataframe = loadData()
71 |
72 | // Displays the content of the DataFrame to stdout
73 | // TODO write code here
74 | }
75 |
76 | /**
77 | * Print the schema
78 | */
79 | def printSchema() = {
80 | val dataframe = loadData()
81 |
82 | // Print the schema
83 | // TODO write code here
84 | }
85 |
86 | /**
87 | * Find people who are located in Paris
88 | */
89 | def filterByLocation(): DataFrame = {
90 | val dataframe = loadData()
91 |
92 | // Select all the persons which are located in Paris
93 | // TODO write code here
94 | null
95 | }
96 |
97 |
98 | /**
99 | * Find the user who tweets the more
100 | */
101 | def mostPopularTwitterer(): (Long, String) = {
102 | val dataframe = loadData()
103 |
104 | // First group the tweets by user
105 | // Then sort by descending order and take the first one
106 | // TODO write code here
107 | null
108 | }
109 |
110 | }
111 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/streaming/StreamingOnTweets.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.streaming
2 |
3 | import org.apache.spark.streaming.{Seconds, StreamingContext}
4 | import org.apache.spark.streaming.twitter._
5 | import org.apache.spark.SparkConf
6 | import org.apache.spark._
7 |
8 | /**
9 | * First authenticate with the Twitter streaming API.
10 | *
11 | * Go to https://apps.twitter.com/
12 | * Create your application and then get your own credentials (keys and access tokens tab)
13 | *
14 | * See https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html
15 | * for help.
16 | *
17 | * If you have the following error "error 401 Unauthorized":
18 | * - it might be because of wrong credentials
19 | * OR
20 | * - a time zone issue (so be certain that the time zone on your computer is the good one)
21 | *
22 | * The Spark Streaming documentation is available on:
23 | * http://spark.apache.org/docs/latest/streaming-programming-guide.html
24 | *
25 | * Spark Streaming is an extension of the core Spark API that enables scalable,
26 | * high-throughput, fault-tolerant stream processing of live data streams.
27 | * Spark Streaming receives live input data streams and divides the data into batches,
28 | * which are then processed by the Spark engine to generate the final stream of results in batches.
29 | * Spark Streaming provides a high-level abstraction called discretized stream or DStream,
30 | * which represents a continuous stream of data.
31 | *
32 | * In this exercise we will:
33 | * - Print the status text of the some of the tweets
34 | * - Find the 10 most popular Hashtag in the last minute
35 | *
36 | * You can see informations about the streaming in the Spark UI console: http://localhost:4040/streaming/
37 | */
38 | object StreamingOnTweets extends App {
39 |
40 | def top10Hashtag() = {
41 | // TODO fill the keys and tokens
42 | val CONSUMER_KEY = "TODO"
43 | val CONSUMER_SECRET = "TODO"
44 | val ACCESS_TOKEN = "TODO"
45 | val ACCESS_TOKEN_SECRET = "TODO"
46 |
47 | System.setProperty("twitter4j.oauth.consumerKey", CONSUMER_KEY)
48 | System.setProperty("twitter4j.oauth.consumerSecret", CONSUMER_SECRET)
49 | System.setProperty("twitter4j.oauth.accessToken", ACCESS_TOKEN)
50 | System.setProperty("twitter4j.oauth.accessTokenSecret", ACCESS_TOKEN_SECRET)
51 |
52 | // Load the data using TwitterUtils: we obtain a DStream of tweets
53 | //
54 | // More about TwitterUtils:
55 | // https://spark.apache.org/docs/1.4.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html
56 |
57 | // create spark configuration and spark context
58 | val conf = new SparkConf()
59 | .setAppName("streaming")
60 | .setMaster("local[*]")
61 |
62 | val sc = new SparkContext(conf)
63 | // create a StreamingContext by providing a Spark context and a window (2 seconds batch)
64 | val ssc = new StreamingContext(sc, Seconds(2))
65 |
66 | println("Initializing Twitter stream...")
67 |
68 | // Here we start a stream of tweets
69 | // The object tweetsStream is a DStream of tweet statuses:
70 | // - the Status class contains all information of a tweet
71 | // See http://twitter4j.org/javadoc/twitter4j/Status.html
72 | // and fill the keys and tokens
73 | val tweetsStream = TwitterUtils.createStream(ssc, None, Array[String]())
74 |
75 | //Your turn ...
76 |
77 | // Print the status text of the some of the tweets
78 | // You must see tweets appear in the console
79 | val status = tweetsStream.map(_.getText)
80 | // Here print the status's text: see the Status class
81 | // Hint: use the print method
82 | // TODO write code here
83 |
84 |
85 | // Find the 10 most popular Hashtag in the last minute
86 |
87 | // For each tweet in the stream filter out all the hashtags
88 | // stream is like a sequence of RDD so you can do all the operation you did in the first part of the hands-on
89 | // Hint: think about what you did in the Hashtagmining part
90 | // TODO write code here
91 | val hashTags = null
92 |
93 | // Now here, find the 10 most popular hashtags in a 60 seconds window
94 | // Hint: look at the reduceByKeyAndWindow function in the spark doc.
95 | // Reduce last 60 seconds of data
96 | // Hint: look at the transform function to operate on the DStream
97 | // TODO write code here
98 | val top10 = null
99 |
100 | // and return the 10 most populars
101 | // Hint: loop on the RDD and take the 10 most popular
102 | // TODO write code here
103 |
104 | // we need to tell the context to start running the computation we have setup
105 | // it won't work if you don't add this!
106 | ssc.start
107 | ssc.awaitTermination
108 | }
109 | }
110 |
--------------------------------------------------------------------------------
/src/main/scala/com/duchessfr/spark/utils/TweetUtils.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.utils
2 |
3 | import com.google.gson._
4 |
5 | object TweetUtils {
6 | case class Tweet (
7 | id : String,
8 | user : String,
9 | userName : String,
10 | text : String,
11 | place : String,
12 | country : String,
13 | lang : String
14 | )
15 |
16 |
17 | def parseFromJson(lines:Iterator[String]):Iterator[Tweet] = {
18 | val gson = new Gson
19 | lines.map(line => gson.fromJson(line, classOf[Tweet]))
20 | }
21 | }
22 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/core/Ex0WordcountSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.scalatest._
4 |
5 | /**
6 | * Here are the tests to help you to implement the Ex0Wordcount
7 | */
8 | class Ex0WordcountSpec extends FunSuite with Matchers {
9 |
10 | // this test is already green but see how we download the data in the loadData method
11 | test("number of data loaded") {
12 | val data = Ex0Wordcount.loadData
13 | data.count should be (809)
14 | }
15 |
16 | test("countWord should count the occurrences of each word"){
17 | val wordCounts = Ex0Wordcount.wordcount
18 | wordCounts.count should be (381)
19 | wordCounts.collect should contain ("the", 38)
20 | wordCounts.collect should contain ("generally", 2)
21 | }
22 |
23 | test("filterOnWordcount should keep the words which appear more than 4 times"){
24 | val wordCounts = Ex0Wordcount.filterOnWordcount
25 | wordCounts.count should be (26)
26 | wordCounts.collect should contain ("the", 38)
27 | wordCounts.collect shouldNot contain ("generally")
28 | }
29 |
30 | }
31 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/core/Ex1UserMiningSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the Ex1UserMining
7 | */
8 | class Ex1UserMiningSpec extends FunSuite with Matchers {
9 |
10 | test("should count the number of couple (user, tweets)") {
11 | val tweets = Ex1UserMining.tweetsByUser
12 | tweets.count should be (5967)
13 | }
14 |
15 | test("tweetByUserNumber should count the number of tweets by user"){
16 | val tweetsByUser = Ex1UserMining.tweetByUserNumber
17 | tweetsByUser.count should be (5967)
18 | tweetsByUser.collect should contain ("Dell Feddi", 29)
19 | }
20 |
21 | test("should return the top ten twitterers"){
22 | val top10 = Ex1UserMining.topTenTwitterers
23 | top10.size should be (10)
24 | top10 should contain ("williampriceking", 46)
25 | top10 should contain ("Phillthy McNasty",43)
26 | }
27 |
28 | }
29 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/core/Ex2TweetMiningSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the Ex2TweetMining
7 | */
8 | class Ex2TweetMiningSpec extends FunSuite with Matchers {
9 |
10 | test("should count the persons mentioned on tweets") {
11 | val userMentions = Ex2TweetMining.mentionOnTweet
12 | userMentions.count should be (4462)
13 | }
14 |
15 | test("should count the number for each user mention"){
16 | val mentionsCount = Ex2TweetMining.countMentions
17 | mentionsCount.count should be (3283)
18 | mentionsCount.collect should contain ("@JordinSparks", 2)
19 | }
20 |
21 | test("should define the top10"){
22 | val top10 = Ex2TweetMining.top10mentions
23 | top10.size should be (10)
24 | top10 should contain ("@HIITMANonDECK", 100)
25 | top10 should contain ("@ShawnMendes", 189)
26 | top10 should contain ("@officialdjjuice", 59)
27 | }
28 |
29 | }
30 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/core/Ex3HashTagMiningSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the Ex3HashTagMining class
7 | */
8 | class Ex3HashTagMiningSpec extends FunSuite with Matchers {
9 |
10 | test("should count the hashtag mentioned on tweets") {
11 | val hashtagMentions = Ex3HashTagMining.hashtagMentionedOnTweet
12 | hashtagMentions.count should be (5262)
13 | }
14 |
15 | test("should count the number of mention by hashtag"){
16 | val mentionsCount = Ex3HashTagMining.countMentions
17 | mentionsCount.count should be (2461)
18 | mentionsCount.collect should contain ("#youtube", 2)
19 | }
20 |
21 | test("should define the top10"){
22 | val top10 = Ex3HashTagMining.top10HashTags
23 | top10.size should be (10)
24 | top10 should contain ("#DME", 253)
25 | }
26 | }
27 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/core/Ex4InvertedIndexSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.core
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the Ex4InvertedIndex class
7 | */
8 | class Ex4InvertedIndexSpec extends FunSuite with Matchers {
9 |
10 | test("should return an inverted index") {
11 | val invertedIndex = Ex4InvertedIndex.invertedIndex
12 | invertedIndex.size should be (2461)
13 | //invertedIndex should contain ("Paris" -> 144)
14 | invertedIndex should contain key ("#EDM")
15 | invertedIndex should contain key ("#Paris")
16 | }
17 |
18 | }
19 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/dataframe/DataFrameOnTweetsSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.dataframe
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the DataFrameOnTweets class
7 | */
8 | class DataFrameOnTweetsSpec extends FunSuite with Matchers {
9 |
10 | test("should load the data and init the context") {
11 | val data = DataFrameOnTweets.loadData
12 | data.count should be (8198)
13 | }
14 |
15 | test("should show the dataframe"){
16 | DataFrameOnTweets.showDataFrame
17 | // you must see something like that in your console:
18 | //+--------------------+------------------+-----------------+--------------------+-------------------+
19 | //| country| id| place| text| user|
20 | //+--------------------+------------------+-----------------+--------------------+-------------------+
21 | //| India|572692378957430785| Orissa|@always_nidhi @Yo...| Srkian_nishu :)|
22 | //| United States|572575240615796737| Manhattan|@OnlyDancers Bell...| TagineDiningGlobal|
23 | //| United States|572575243883036672| Claremont|1/ "Without the a...| Daniel Beer|
24 | }
25 |
26 | test("should print the schema"){
27 | DataFrameOnTweets.printSchema
28 | // you must see something like that in your console:
29 | // root
30 | // |-- country: string (nullable = true)
31 | // |-- id: string (nullable = true)
32 | // |-- place: string (nullable = true)
33 | // |-- text:string(nullable = true)
34 | // |-- user: string (nullable = true)
35 | }
36 |
37 | test("should group the tweets by location") {
38 | val data = DataFrameOnTweets.filterByLocation
39 | data.count should be (329)
40 | }
41 |
42 | test("should return the most popular twitterer") {
43 | val populars = DataFrameOnTweets.mostPopularTwitterer
44 | populars should be (258, "#QuissyUpSoon")
45 | }
46 |
47 | }
48 |
--------------------------------------------------------------------------------
/src/test/scala/com/duchessfr/spark/streaming/StreamingOnTweetsSpec.scala:
--------------------------------------------------------------------------------
1 | package com.duchessfr.spark.streaming
2 |
3 | import org.scalatest.{Matchers, FunSuite}
4 |
5 | /**
6 | * Here are the tests to help you to implement the StreamingOnTweets class
7 | * It's not real unit tests because of the live stream context, but it can give
8 | * help you anyway and run the function
9 | */
10 | class StreamingOnTweetsSpec extends FunSuite with Matchers {
11 |
12 | test("should return the 10 most popular hashtag") {
13 | StreamingOnTweets.top10Hashtag
14 | // You should see something like that:
15 | // Most popular hashtag : #tlot: 1, #followme: 1,...
16 | }
17 | }
18 |
--------------------------------------------------------------------------------