├── .gitignore ├── README-2.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.class 2 | *.log 3 | 4 | # sbt specific 5 | .cache 6 | .history 7 | .lib/ 8 | dist/* 9 | target/ 10 | lib_managed/ 11 | src_managed/ 12 | project/boot/ 13 | project/plugins/project/ 14 | 15 | # Scala-IDE specific 16 | .scala_dependencies 17 | .worksheet 18 | -------------------------------------------------------------------------------- /README-2.md: -------------------------------------------------------------------------------- 1 | # Stress Testing your Spark Code with Terabytes of Streaming Music 2 | 3 | Developing an intuition for how your code scales is one of the trickiest challenges of learning distributed technologies like Spark, Hadoop, or NoSQL databases. Code that works perfectly fine with a modestly sized data set may utterly fail as you reach hundreds of GBs or TBs. This is the cruelty of building a data product: you may not realize that your platform won't scale until you've convinced tens of thousands of users to try it out. Ironically, this is the worst time to stress test your system, and an excellent recipe for late nights and firefighting. 4 | 5 | To avoid this pitfall, it's critical to test any code that will run on distributed systems with large data sets. More so, this data set should roughly mimic your use case and distribution of users. Your testing framework should account for your edge cases and data skew - everything from a surge of Uber requests around 2AM, to the tweet from Justin Bieber that gets 80 million retweets. 6 | 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Generating Terabytes of Streaming Music Data with Eventsim 2 | 3 | One of the difficult parts of learning distributed systems like Spark, Hadoop, and NoSQL databases is finding a data set to work with that's both large and intriguing. It should be big enough to test the limits of your system (e.g. it shouldn't all fit in your laptop's main memory), yet rich enough for challenging queries that go beyond cleaning, ETL, and word counts. There are a [number of data sets available](https://aws.amazon.com/public-data-sets/), but finding one that matches the type of application you want to build can be frustrating. 4 | 5 | If you want to test out a low-latency application, there are even fewer options for streaming data, and you don't want to spend time worrying about API rate limits or web scrapers. One solution is to use a script to simulate streaming from a static source of sample data, but then you'll have to spend some time figuring out how to replace the timestamps with relevant times, and also building in logic to account for cycles if you run out of static data. 6 | 7 | If you're already going to write a streaming data generator, it may be easier to generate the data from scratch in the first place. However, you would need to write some logic to give you realistic data based off your use case, which could be non-trivial for even a simple scenario. For example, generating data for something like a streaming music site (e.g. Spotify or Pandora) would require tracking the state of users, pages, login sessions, artists, songs, and ads, not to mention the relationships between artists and songs. 8 | 9 | Fortunately, [Interana](https://www.interana.com/introducing-eventism-the-demo-event-data-generator/) released [Eventsim](https://github.com/Interana/eventsim), a nifty tool for simulating exactly these types of user events. To get started with Eventsim, I'll walk through how you can quickly and easily simulate millions of users for a streaming music site like Spotify. 10 | 11 | ## Quick Cluster Set Up 12 | 13 | To work through a quick example of Eventsim, I'm using our cloud deployment tool, [Pegasus](https://github.com/InsightDataScience/pegasus), to spin up a 4-node Spark/Hadoop cluster on AWS in less than 5 minutes. Pegasus simply uses the AWS CLI and Bash scripts, and you can easily [set it up with a few steps](http://insightdataengineering.com/blog/pegasus/). Specifically, I used the following instance configuration for my master 14 | 15 | **examples/eventsim/master.yml** 16 | 17 | purchase_type: on_demand 18 | subnet_id: subnet-d43cb8b0 19 | num_instances: 1 20 | key_name: david-drummond 21 | security_group_ids: sg-faf51d9c 22 | instance_type: m4.large 23 | tag_name: davids-eventsim-cluster 24 | vol_size: 50 25 | role: master 26 | 27 | which spins up an on-demand m4.large with 50 GBs of standard EBS storage, and similarly used 28 | 29 | **examples/eventsim/workers.yml** 30 | 31 | purchase_type: spot 32 | subnet_id: subnet-d43cb8b0 33 | price: 0.13 34 | num_instances: 3 35 | key_name: david-drummond 36 | security_group_ids: sg-faf51d9c 37 | instance_type: m4.large 38 | tag_name: davids-eventsim-cluster 39 | vol_size: 2000 40 | vol_type: gp2 41 | role: worker 42 | 43 | to spin up 3 spot instances with 2 TBs of general purpose storage (gp2) for each worker. Installing and starting Hadoop and Spark is as simple as running the following script from the top-level pegasus directory: 44 | 45 | **examples/eventsim/spark_hadoop.sh** 46 | 47 | PEG_ROOT=$(dirname ${BASH_SOURCE})/../.. 48 | 49 | CLUSTER_NAME=davids-eventsim-cluster 50 | 51 | peg up ${PEG_ROOT}/examples/eventsim/master.yml & 52 | peg up ${PEG_ROOT}/examples/eventsim/workers.yml & 53 | 54 | wait 55 | 56 | peg fetch ${CLUSTER_NAME} 57 | 58 | peg install ${CLUSTER_NAME} ssh 59 | peg install ${CLUSTER_NAME} aws 60 | peg install ${CLUSTER_NAME} hadoop 61 | peg install ${CLUSTER_NAME} spark 62 | 63 | wait 64 | 65 | peg service ${CLUSTER_NAME} hadoop start 66 | peg service ${CLUSTER_NAME} spark start 67 | 68 | ## Installing Eventsim 69 | 70 | Eventsim is incredibly easy to install, either locally or remotely. To avoid any dependency issues, I ssh'ed into one of my datanodes (i.e. workers) with 71 | 72 | peg ssh davids-eventsim-cluster 2 73 | 74 | then cloned the eventsim repo, and ran SBT from within the repo 75 | 76 | git clone https://github.com/Interana/eventsim.git 77 | cd eventsim 78 | sbt assembly 79 | 80 | Note that Eventsim requires Java 8 and Scala, but Pegasus uses an AMI with these dependendencies pre-packaged by default. Once SBT has done its job, give the eventsim binary execution permission 81 | 82 | chmod +x bin/eventsim 83 | 84 | and you're ready to go! 85 | 86 | ## Getting a Sense of the Data 87 | 88 | Let's start by generating a small sample of data for a single user, starting from 7 days ago: 89 | 90 | bin/eventsim --nusers 1 --from 7 -c configs/Accordion-config.json data/one-user.json 91 | 92 | Here's a sample of the output in `data/one_user.json`: 93 | 94 | {"ts":1466705356324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":5,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Todd Barry","song":"Sugar Ray (LP Version)","length":126.82404} 95 | {"ts":1466705454324,"userId":"2","sessionId":5,"page":"Roll Advert","auth":"Logged In","method":"GET","status":200,"level":"free","itemInSession":6,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F"} 96 | {"ts":1466705482324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":7,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Justin Bieber","song":"Somebody To Love","length":220.89098} 97 | {"ts":1466705702324,"userId":"2","sessionId":5,"page":"NextSong","auth":"Logged In","method":"PUT","status":200,"level":"free","itemInSession":8,"location":"Riverside-San Bernardino-Ontario, CA","userAgent":"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36\"","lastName":"Keith","firstName":"Fanny","registration":1466495311324,"gender":"F","artist":"Sheena Easton","song":"Strut (1993 Digital Remaster)","length":239.62077} 98 | 99 | This is the activity for a female user from Riverside, CA named Fanny Keith, and we can follow the data as she clicks from songs by Justin Bieber to Sheena Easton. There are sessions that accurately correspond to login and logout events, and it also reflects that she is using the free version with more advertising, but may eventually upgrade to the paid version. Eventsim also includes information on the machine, browser, and the HTTP requests. 100 | 101 | The timestamps are measured in milliseconds since the Unix Epoch, and accurately reflect the time at which the Eventsim was ran. The duration between events is pseudo-random, except for the duration for `NextSong` events, which is separated by the length of the previous song. 102 | 103 | All of the song and artist information comes from the [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/), while the names and locations are sampled from U.S. Census and Social Security data sets. 104 | 105 | ### Custom Configurations 106 | 107 | The transitions from one state to the next are controlled by the configuration file, which contains entries for each pair of source and destination page, and corresponding probabilities. For example `Accordion-config.json` contains the following entries for the pages that a logged-in, paid user could visit after seeing an ad: 108 | 109 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Downgrade","method":"GET","status":200,"auth":"Logged In","level":"paid"},"p":0.05}, 110 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"p":0.02}, 111 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"NextSong","method":"PUT","status":200,"auth":"Logged In","level":"paid"},"p":0.8}, 112 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Cancel","method":"PUT","status":307,"auth":"Logged In","level":"paid"},"p":0.005}, 113 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Logout","method":"PUT","status":307,"auth":"Logged In","level":"paid"},"p":0.1}, 114 | {"source":{"page":"Roll Advert","method":"GET","status":200,"auth":"Logged In","level":"paid"},"dest":{"page":"Error","method":"GET","status":404,"auth":"Logged In","level":"paid"},"p":0.001}, 115 | 116 | meaning that these users have a 117 | 118 | * 80% chance of playing a song 119 | * 10% chance of logging out 120 | * 5% chance of downgrading to a free membership 121 | * 2% chance of seeing yet another ad 122 | * 0.5% chance of completely cancelling their membership 123 | * 0.1% chance of having an error on the site 124 | * 2.4% chance that the session ends 125 | 126 | It also has daily and weekly damping effects that simulate reduced activity on nights and weekends with the following options: 127 | 128 | "damping" : 0.09375, 129 | "weekend-damping-offset" : 180, 130 | "weekend-damping-scale" : 360, 131 | "weekend-damping" : 0.50 132 | 133 | which means that user activtiy dips sinusoidally, with the minimum 9.375% lower than normal, 180 minutes after midnight, and the dip lasts roughly 134 | 135 | These configurations enable you to customize the business logic of your expected user activity to a granular level. Eventsim also comes with several convenient pre-built configurations like: 136 | 137 | * `Cello-config.json`: Increased weekend-dampening, meaning 138 | that users play somewhat less music than normal on the weekends. 139 | * `Nagara-config.json`: Much higher 140 | advertising rates leading to more downgrades and cancellations. 141 | * `Whistle-config.json`: Happy users with a higher probability of 142 | ThumbsUp events, leading to a higher probability of Upgrade events. 143 | 144 | You can learn more about how Eventsim runs the simulations by looking at the main [README](https://github.com/Interana/eventsim/blob/master/README.md), and you can view every pre-built configuration in the [config/README](https://github.com/Interana/eventsim/blob/master/configs/README.md). 145 | 146 | ### More Users 147 | 148 | ## Ramping up the Data 149 | hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile 150 | 151 | ### Configuring the JVM Settings 152 | 153 | discuss -XX options, GC, and heap size --------------------------------------------------------------------------------