├── .gitignore
├── README-2.md
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.class
 2 | *.log
 3 | 
 4 | # sbt specific
 5 | .cache
 6 | .history
 7 | .lib/
 8 | dist/*
 9 | target/
10 | lib_managed/
11 | src_managed/
12 | project/boot/
13 | project/plugins/project/
14 | 
15 | # Scala-IDE specific
16 | .scala_dependencies
17 | .worksheet
18 | 


--------------------------------------------------------------------------------
/README-2.md:
--------------------------------------------------------------------------------
1 | # Stress Testing your Spark Code with Terabytes of Streaming Music 
2 | 
3 | Developing an intuition for how your code scales is one of the trickiest challenges of learning distributed technologies like Spark, Hadoop, or NoSQL databases. Code that works perfectly fine with a modestly sized data set may utterly fail as you reach hundreds of GBs or TBs. This is the cruelty of building a data product: you may not realize that your platform won't scale until you've convinced tens of thousands of users to try it out.  Ironically, this is the worst time to stress test your system, and an excellent recipe for late nights and firefighting.   
4 | 
5 | To avoid this pitfall, it's critical to test any code that will run on distributed systems with large data sets.  More so, this data set should roughly mimic your use case and distribution of users.  Your testing framework should account for your edge cases and data skew - everything from a surge of Uber requests around 2AM, to the tweet from Justin Bieber that gets 80 million retweets.  
6 | 
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Generating Terabytes of Streaming Music Data with Eventsim 
  2 | 
  3 | One of the difficult parts of learning distributed systems like Spark, Hadoop, and NoSQL databases is finding a data set to work with that's both large and intriguing.  It should be big enough to test the limits of your system (e.g. it shouldn't all fit in your laptop's main memory), yet rich enough for challenging queries that go beyond cleaning, ETL, and word counts. There are a [number of data sets available](https://aws.amazon.com/public-data-sets/), but finding one that matches the type of application you want to build can be frustrating. 
  4 | 
  5 | If you want to test out a low-latency application, there are even fewer options for streaming data, and you don't want to spend time worrying about API rate limits or web scrapers. One solution is to use a script to simulate streaming from a static source of sample data, but then you'll have to spend some time figuring out how to replace the timestamps with relevant times, and also building in logic to account for cycles if you run out of static data.
  6 | 
  7 | If you're already going to write a streaming data generator, it may be easier to generate the data from scratch in the first place.  However, you would need to write some logic to give you realistic data based off your use case, which could be non-trivial for even a simple scenario.  For example, generating data for something like a streaming music site (e.g. Spotify or Pandora) would require tracking the state of users, pages, login sessions, artists, songs, and ads, not to mention the relationships between artists and songs.
  8 | 
  9 | Fortunately, [Interana](https://www.interana.com/introducing-eventism-the-demo-event-data-generator/) released [Eventsim](https://github.com/Interana/eventsim), a nifty tool for simulating exactly these types of user events. To get started with Eventsim, I'll walk through how you can quickly and easily simulate millions of users for a streaming music site like Spotify.
 10 | 
 11 | ## Quick Cluster Set Up
 12 | 
 13 | To work through a quick example of Eventsim, I'm using our cloud deployment tool, [Pegasus](https://github.com/InsightDataScience/pegasus), to spin up a 4-node Spark/Hadoop cluster on AWS in less than 5 minutes.  Pegasus simply uses the AWS CLI and Bash scripts, and you can easily [set it up with a few steps](http://insightdataengineering.com/blog/pegasus/).  Specifically, I used the following instance configuration for my master
 14 | 
 15 | **examples/eventsim/master.yml**
 16 | 
 17 | 	purchase_type: on_demand
 18 | 	subnet_id: subnet-d43cb8b0
 19 | 	num_instances: 1
 20 | 	key_name: david-drummond
 21 | 	security_group_ids: sg-faf51d9c
 22 | 	instance_type: m4.large
 23 | 	tag_name: davids-eventsim-cluster
 24 | 	vol_size: 50
 25 | 	role: master
 26 | 	
 27 | which spins up an on-demand m4.large with 50 GBs of standard EBS storage, and similarly used
 28 | 
 29 | **examples/eventsim/workers.yml**
 30 | 
 31 | 	purchase_type: spot
 32 | 	subnet_id: subnet-d43cb8b0
 33 | 	price: 0.13
 34 | 	num_instances: 3
 35 | 	key_name: david-drummond
 36 | 	security_group_ids: sg-faf51d9c
 37 | 	instance_type: m4.large
 38 | 	tag_name: davids-eventsim-cluster
 39 | 	vol_size: 2000
 40 | 	vol_type: gp2
 41 | 	role: worker
 42 | 
 43 | to spin up 3 spot instances with 2 TBs of general purpose storage (gp2) for each worker.  Installing and starting Hadoop and Spark is as simple as running the following script from the top-level pegasus directory:
 44 | 
 45 | **examples/eventsim/spark_hadoop.sh**
 46 | 
 47 | 	PEG_ROOT=$(dirname ${BASH_SOURCE})/../..
 48 | 	
 49 | 	CLUSTER_NAME=davids-eventsim-cluster
 50 | 	
 51 | 	peg up ${PEG_ROOT}/examples/eventsim/master.yml &
 52 | 	peg up ${PEG_ROOT}/examples/eventsim/workers.yml &
 53 | 	
 54 | 	wait
 55 | 	
 56 | 	peg fetch ${CLUSTER_NAME}
 57 | 	
 58 | 	peg install ${CLUSTER_NAME} ssh
 59 | 	peg install ${CLUSTER_NAME} aws
 60 | 	peg install ${CLUSTER_NAME} hadoop
 61 | 	peg install ${CLUSTER_NAME} spark
 62 | 	
 63 | 	wait
 64 | 	
 65 | 	peg service ${CLUSTER_NAME} hadoop start
 66 | 	peg service ${CLUSTER_NAME} spark start
 67 | 
 68 | ## Installing Eventsim
 69 | 
 70 | Eventsim is incredibly easy to install, either locally or remotely.  To avoid any dependency issues, I ssh'ed into one of my datanodes (i.e. workers) with
 71 | 
 72 | 	peg ssh davids-eventsim-cluster 2
 73 | 	
 74 | then cloned the eventsim repo, and ran SBT from within the repo
 75 | 
 76 | 	git clone https://github.com/Interana/eventsim.git
 77 | 	cd eventsim
 78 | 	sbt assembly
 79 | 	
 80 | Note that Eventsim requires Java 8 and Scala, but Pegasus uses an AMI with these dependendencies pre-packaged by default.  Once SBT has done its job, give the eventsim binary execution permission
 81 | 
 82 | 	chmod +x bin/eventsim
 83 | 
 84 | and you're ready to go!
 85 | 
 86 | ## Getting a Sense of the Data
 87 | 
 88 | Let's start by generating a small sample of data for a single user, starting from 7 days ago:
 89 | 
 90 | 	bin/eventsim --nusers 1 --from 7 -c configs/Accordion-config.json  data/one-user.json
 91 | 	
 92 | Here's a sample of the output in `data/one_user.json`:
 93 | 
 94 | 	{"ts":1466705356324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":5,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Todd Barry","song":"Sugar Ray (LP Version)","length":126.82404}
 95 | 	{"ts":1466705454324,"userId":"2","sessionId":5,"page":"Roll Advert","auth":"Logged In","method":"GET","status":200,"level":"free","itemInSession":6,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F"}
 96 | 	{"ts":1466705482324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":7,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Justin Bieber","song":"Somebody To Love","length":220.89098}
 97 | 	{"ts":1466705702324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":8,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Sheena Easton","song":"Strut (1993 Digital Remaster)","length":239.62077}
 98 | 	
 99 | This is the activity for a female user from Riverside, CA named Fanny Keith, and we can follow the data as she clicks from songs by Justin Bieber to Sheena Easton. There are sessions that accurately correspond to login and logout events, and it also reflects that she is using the free version with more advertising, but may eventually upgrade to the paid version. Eventsim also includes information on the machine, browser, and the HTTP requests.
100 | 
101 | The timestamps are measured in milliseconds since the Unix Epoch, and accurately reflect the time at which the Eventsim was ran. The duration between events is pseudo-random, except for the duration for `NextSong` events, which is separated by the length of the previous song. 
102 | 
103 | All of the song and artist information comes from the [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/), while the names and locations are sampled from U.S. Census and Social Security data sets.
104 | 
105 | ### Custom Configurations
106 | 
107 | The transitions from one state to the next are controlled by the configuration file, which contains entries for each pair of source and destination page, and corresponding probabilities.  For example `Accordion-config.json` contains the following entries for the pages that a logged-in, paid user could visit after seeing an ad:
108 | 
109 | 	{"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Downgrade","method":"GET","status":200,"auth":"Logged In","level":"paid"},"p":0.05},
110 |     {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"p":0.02},
111 |     {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"NextSong","method":"PUT","status":200,"auth":"Logged In","level":"paid"},"p":0.8},
112 |     {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Cancel","method":"PUT","status":307,"auth":"Logged In","level":"paid"},"p":0.005},
113 |     {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Logout","method":"PUT","status":307,"auth":"Logged In","level":"paid"},"p":0.1},
114 |     {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Error","method":"GET","status":404,"auth":"Logged In","level":"paid"},"p":0.001},
115 |     
116 | meaning that these users have a
117 | 
118 | * 80% chance of playing a song
119 | * 10% chance of logging out
120 | * 5% chance of downgrading to a free membership
121 | * 2% chance of seeing yet another ad
122 | * 0.5% chance of completely cancelling their membership
123 | * 0.1% chance of having an error on the site
124 | * 2.4% chance that the session ends
125 | 
126 | It also has daily and weekly damping effects that simulate reduced activity on nights and weekends with the following options:
127 | 
128 | 	"damping" : 0.09375,
129 | 	"weekend-damping-offset" : 180,
130 | 	"weekend-damping-scale" : 360,
131 | 	"weekend-damping" : 0.50
132 | 
133 | which means that user activtiy dips sinusoidally, with the minimum 9.375% lower than normal, 180 minutes after midnight,  and the dip lasts roughly  
134 | 
135 | These configurations enable you to customize the business logic of your expected user activity to a granular level. Eventsim also comes with several convenient pre-built configurations like:
136 | 
137 | * `Cello-config.json`: Increased weekend-dampening, meaning
138 | that users play somewhat less music than normal on the weekends.
139 | * `Nagara-config.json`: Much higher
140 | advertising rates leading to more downgrades and cancellations.
141 | * `Whistle-config.json`: Happy users with a higher probability of
142 | ThumbsUp events, leading to a higher probability of Upgrade events.
143 | 
144 | You can learn more about how Eventsim runs the simulations by looking at the main [README](https://github.com/Interana/eventsim/blob/master/README.md), and you can view every pre-built configuration in the [config/README](https://github.com/Interana/eventsim/blob/master/configs/README.md).
145 | 
146 | ### More Users
147 | 
148 | ## Ramping up the Data
149 | hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile
150 | 
151 | ### Configuring the JVM Settings
152 | 
153 | discuss -XX options, GC, and heap size


--------------------------------------------------------------------------------