├── 4-bigdata-riddles.md
├── LICENSE
├── ReadMe.md
├── __init__.py
├── analytics
    ├── __init__.py
    ├── concurrency_profile.py
    ├── data
    │   ├── ProfileFatso
    │   │   ├── CpuAndMemoryFatso.json.gz
    │   │   ├── JobFatso.log.gz
    │   │   └── StacktraceFatso.json.gz
    │   ├── ProfileStraggler
    │   │   ├── CpuAndMemory.json.gz
    │   │   ├── JobStraggler.log.gz
    │   │   └── Stacktrace.json.gz
    │   ├── __init__.py
    │   ├── profile_fatso
    │   │   ├── CombinedProfile.json.gz
    │   │   ├── CpuAndMemory.json.gz
    │   │   ├── Stacktrace.json.gz
    │   │   ├── s_8_7510_cpumem.json.gz
    │   │   ├── s_8_7511_cpumem.json.gz
    │   │   ├── s_8_7512_cpumem.json.gz
    │   │   └── s_8_stack.json.gz
    │   ├── profile_slacker
    │   │   ├── CombinedCpuAndMemory.json.gz
    │   │   ├── CombinedStack.folded.gz
    │   │   ├── CombinedStack.json.gz
    │   │   ├── CombinedStack.svg.gz
    │   │   ├── CpuAndMemory.json.gz
    │   │   ├── JobSlacker.log.gz
    │   │   ├── Stacktrace.json.gz
    │   │   ├── s_8_931_cpumem.json.gz
    │   │   ├── s_8_932_cpumem.json.gz
    │   │   ├── s_8_933_cpumem.json.gz
    │   │   └── s_8_stack.json.gz
    │   └── profile_straggler
    │   │   ├── CpuAndMemory.json.gz
    │   │   ├── ProcessInfo.json.gz
    │   │   ├── Stacktrace.json.gz
    │   │   ├── Straggler_PySpark.log.gz
    │   │   ├── s_1_1401_cpumem.json.gz
    │   │   ├── s_1_1402_cpumem.json.gz
    │   │   ├── s_1_1403_cpumem.json.gz
    │   │   └── s_1_stack.json.gz
    ├── extract_everything.py
    ├── extract_heckler.py
    ├── plot_application.py
    ├── plot_fatso.py
    ├── plot_slacker.py
    └── plot_straggler.py
├── fold_stacks.py
├── helper.py
├── parsers.py
├── pyspark_profilers.py
└── spark_jobs
    ├── __init__.py
    ├── job_fatso.py
    ├── job_heckler.py
    ├── job_slacker.py
    └── job_straggler.py


/4-bigdata-riddles.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | layout: default
  3 | ---
  4 | 
  5 | Made in London 
  6 | 
  7 | # 4 BigData Riddles: The Straggler, The Slacker, The Fatso & The Heckler
  8 | <p></p>
  9 | 
 10 | This article discusses four bottlenecks for BigData applications and introduces a number of tools, some new, for identifying and solving them. The 
 11 | 
 12 | 
 13 | The riddles occur particular framework Apache Spark Scala and Python 
 14 | Spark applications.
 15 | <p></p>
 16 | These applications have something in common: They require around 10 minutes _wall clock time_ when a "local version" of them is run on a commodity notebook. The easiest  -- using more processors/machines or more powerful machines -- would not significantly reduce the execution time. 
 17 | 
 18 | <p></p>
 19 | But there are also differences between the applications: Each one has a different kind of bottleneck that is responsible for slowness and each of these bottlenecks can be identified with a different tool. Some of these tools -- among them flamegraphs, resource profiles and log files -- and methods for combining them are not widely known or do not exist and will be presented in the second half of this article along with complete source code and additional explanations for all  
 20 | 
 21 | 
 22 | I will talk about the riddles as a kind of black box first, the code that will be shown in the second part of this article.
 23 | Means of iddentifying the bottleneck is different, custom scripts or programs were usedd
 24 | I will describe the riddles and their "solutions" first, then I will introduce the methods for solbing them in more detail.  
 25 | 
 26 | 
 27 | 
 28 | means of identification for these botlenecks is different and don't exist in this form
 29 | 
 30 | Using more machines or more powerful machines for the riddles will not help
 31 | 
 32 | 
 33 | <br>
 34 | <br>
 35 | 
 36 | ## The Fatso
 37 | 
 38 | The _Fatso_ in a might be inherent to BigData. 
 39 | 
 40 | 
 41 | ### Spark JVM Profile
 42 | 
 43 | The fatso is almost soften occurs in the BigData world. A symptom is noise -- 
 44 | 
 45 | The job qualifies as a fatso 
 46 | 
 47 | 
 48 | Using my project introduced below in this way using just a few lines of code:
 49 | 
 50 | ```python
 51 | profile_file = './data/ProfileFatso/CpuAndMemoryFatso.json.gz'  # Output from JVM profiler
 52 | profile_parser = ProfileParser(profile_file, normalize=True)
 53 | data_points = profile_parser.make_graph()
 54 | 
 55 | 
 56 | logfile = './data/ProfileFatso/Fatso.log'
 57 | log_parser = SparkLogParser(logfile)
 58 | stage_markers = log_parser.extract_stage_markers()
 59 | data_points.append(stage_markers)
 60 | 
 61 | max_y = get_max_y(data_points)
 62 | layout = log_parser.extract_job_markers(max_y)
 63 | fig = Figure(data=data_points, layout=layout)
 64 | plot(fig, filename='fatso.html')
 65 | ```
 66 | 
 67 | This code (full version hereXXX) combines two different information sources (the output of a profiler and normal Spark logs) into a single visualization. This interactive graph can be analyzed in its full glory [hereXXX](http://127.0.0.1:4000/fatso.html). A smaller snapshot is displayed below:
 68 | 
 69 | <p></p>
 70 | ![Fatso profile](images/fatso.png)
 71 | <p></p>
 72 | Spark's execution model consists of different units of different "granularity levels" and some of these are displayed above: Boundaries of Spark jobs are represented as vertical dashed lines, start and end points of Spark stages are displayed as transparent blue dots on the x-axis which also show the full stage IDs. This scheduling information does not add a lot of insight here since _Fatso_ consists of only one Spark job which in turn consists of just a single Spark stage (comprised of three tasks) but, as shown below, knowing such time points can be very helpful when analyzing more complex applications.
 73 | 
 74 | For all graphs in this article, the x-axis shows the application run time as UNIX Epoch time (milliseconds passed since 1 January 1970). The y-axis represents different normalized units for different metrics: For lines representing memory metrics such as _total heap memory used_ ("heapMemoryTotalUsed", turquoise line above), it represents gigabytes; for time measurements like _MarkSweep GC collection time_ ("MarkSweepCollTime", orange line above), data points on the y-axis represent milliseconds. More details can be found in thisXXXX data struture which can be changed or augmented with new metrics from  different profilers.
 75 | 
 76 | <br>
 77 | One available metric, _ScavengeCollCount_, is absent from the snapshot above but present in the original [graph](http://127.0.0.1:4000/fatso.html). It signifies a minor garbage collection event and almost linearly increases up to 20000 during _Fatso_'s execution. In other words, the application ran for almost 10 minutes (from XX to YY ) and more than 20000 minor Garbage Collection events and almost 70 major GC events ("MarkSweepCollCount", green line) occurred. 
 78 | <br>
 79 | When the application was launched, no configuration parameters were manually set so the default Spark settings applied. This means that the maximum memory available to the program was 1GB. Having a closer look at the two heap memory metrics _heapMemoryCommitted_ and _heapMemoryTotalUsed_ reveals that both lines approach this 1GB ceiling near the end of the application.
 80 | <br>
 81 | The intermediate conclusion that can be drawn from the discussion so far is that the application is very memory hungry and a lot of GC activity is going on, but the exact reason for this is still unclear. A second tool can help now:
 82 | 
 83 | 
 84 | <br>
 85 | 
 86 | ### Spark JVM FlameGraph
 87 | 
 88 | ```terminal
 89 | Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/ProfileFatso/StacktraceFatso.json.gz  > Fatso.folded
 90 | Phils-MacBook-Pro:analytics a$ perl flamegraph.pl Fatso.folded > FatsoFlame.svg
 91 | ```
 92 | Opening _FatsoFlame.svg_ in a browser:
 93 | 
 94 | {% include FatsoFlame.svg %}
 95 | A rule of thumb for the interpretation of flame graphs is: The more spiky the shape, the better. We see many plateaus above with native Spark/Java functions like _sun.misc.unsafe.park_ sitting on top (first plateau) or low-level functions from packages like _io.netty_ occurring near the top, this is a 3rd party library that Spark depends on for network communication / IO. The only functions in the picture that are defined by me are located in the center plateau, searching for the package name _profile.sparkjob_. On top of them are Array and String functions. Indeed, they create many Strings objects in an efficient way we have identified the two _Fatso_ method that need to be optimized.
 96 | 
 97 | 
 98 | <br>
 99 | 
100 | ### Python/PySpark Profiles
101 | What about Spark applications written in Python? I created [several profilers](https://github.com/g1thubhub/phil_stopwatch/blob/master/pyspark_profilers.py) that try to provide some of the functionality of Uber's JVM profiler. Because of the architecture of PySpark, it might be beneficial to generate both Python and JVM profiles in order to get a good grasp of the overall resource usage. This can be accomplished for the Python edition of _Fatso_ by using the following launch command (abbreviated, full version hereXXX)
102 | ```terminal
103 | ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit  \
104 | --conf spark.python.profile=true  \
105 | --conf spark.driver.extraJavaOptions=-javaagent:/.../=sampleInterval=1000,metricInterval=100,reporter=...outputDir=... \
106 |  ./spark_jobs/job_fatso.py  cpumemstack  /Users/phil/phil_stopwatch/analytics/data/profile_fatso  >  Fatso_PySpark.dat 
107 | ```
108 | As above, the *--conf* parameter in the third line is responsible for attaching the JVM profiler. The --conf parameter in the second line as well as the two script arguments in the last line are new and required for PySpark profiling: The _cpumemstack_ argument will choose a PySpark profiler that captures both CPU/memory usage as well as stack traces. By providing a second script argument in the form of a directory path, it is ensured that the profile records are written into separate output files instead of just printing all of them to the standard output.
109 | 
110 | <br>
111 | Similar to its Scala cousin, the PySpark edition of _Fatso_  completes in around 10 minutes on my MacBook and and creates several JSON files in the specified output directory. The JVM profile could be visualized idenpendently of the Python profile but it might be more insightful to create a single combined graph from them. This can be accomplished easily and is shown in the second half of thisXXX script. The full combined graph is located [here](http://127.0.0.1:4000/fatso-pyspark.html)
112 | 
113 | ![Fatso PySpark profile](images/fatso-pyspark.png)
114 | <p></p>
115 | The clever reader will already have a hunch about the high memory consumption and who is responsible for it: The garbage collection activity of the JVM that is again represented by _MarkSweepCollCount_ and _ScavengeCollCount_ is much lower here compared to the "pure" Spark run described in the previous paragraphs (20000 events above versus less than 20 GC events now). The two inefficient _fatso_ functions are now implemented in Python and therefore not managed by the JVM leading to far fewer JVM memory usage and GC events. A PySpark flamegraph should confirm our hunch:
116 | <p></p>
117 | 
118 | ```terminal
119 | Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/profile_fatso/s_8_stack.json  > FatsoPyspark.folded
120 | Phils-MacBook-Pro:analytics a$ perl flamegraph.pl  FatsoPyspark.folded  > FatsoPySparkFlame.svg
121 | ```
122 | Opening _FatsoPySparkFlame.svg_ in a browser displays ...
123 | {% include FatsoPySparkFlame.svg %}
124 | <p></p>
125 | And indeed, two _fatso_ methods sit ontop the stack for almost 90% of all measurements burning most CPU cycles.
126 | It would be easy to create a combined JVM/Python flamegraph by concatenating the respective stacktrace files. This would be of limited use here though since the JVM flamegraph will likely consist entirely of native Java/Spark functions over which a Python coder has no control. One scenario I can think of where this merging of JVM with PySpark stacktraces might be especilly useful is when Java code or libraries are registered and called from PySpark/Python code which is getting easier and easier in newer versions of Spark. In the discussion of _Slacker_ later on, I will present a combined stack trace of Python and Scala code.   
127 | 
128 | <br>
129 | <br>
130 | ## The Straggler
131 | The _Straggler_ is deceiving: It appears as if all computational resources are fully utilized most the time and only closer analysis can reveal that this might be the case for only a small subset of the system or for a limited period of time. The following graph created from [this](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_straggler.py) script combines two CPU metrics with information about task and stage boundaries extracted from the standard logging output of a typical straggler run; the full size graph can be found [here](http://127.0.0.1:4000/straggler.html)
132 | <p></p>
133 | ![Straggler profile](images/straggler.png)
134 | 
135 | The associated application consisted of one Spark job which is represented as vertical dashed lines at the left and right. This single job was comprised of a single stage, shown as transparent blue dots on the x axis that coincide with the job start and end points. But there were three tasks within that stage so we can see three horizontal task lines. The naming schema of this execution hierarchy is not arbitrary:
136 | - The stage name in the graph is **0.0@0** because a stage with the id **0.0** which belonged to a job with id **0** is referred to. The first part of stage or task names is a floating point number, this reflects the apparent naming convention in Spark logs that new attempts of failed task or stages are baptized with an incremented fraction part.
137 | - The task names are **0.0@0.0@0**, **1.0@0.0@0**, and **2.0@0.0@0** because three tasks were launched that were all members of stage **0.0@0** that in turn belonged to job **0**
138 | 
139 | The three tasks have the same start time which almost coincides with the application's invocation but very different end times: Tasks *1.0@0.0@0* and *2.0@0.0@0* finish within the first fifth of the application's lifetime whereas task *0.0@0.0@0* stays alive for almost the entire application since its start and end points are located at the left and right borders of this graph.
140 | The orange and light blue lines visualize two CPU metrics (_system cpu load_ and _process cpu load_) whose fluctuations correspond with the task activity: We can observe that the CPU load drops right after tasks *1.0@0.0@0* and *2.0@0.0@0* end. It stays at around 20% for 4/5 of the time, when only straggler task *0.0@0.0@0* is running.  BLIPXXXX
141 | 
142 | ### Concurrency Profiles
143 | When an application consists of more than just one stage with three tasks like _Straggler_, it might be more illuminating to calculate and represent the total number of tasks that were running at any point during the application's lifetime. The "concurrency profile" of a BigData workload might look more like 
144 | {% include conc-profile.html %}
145 | 
146 | The source code that is the basis for the graph can be found in [this](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_application.py) script. The big difference to the _Straggler_ toy example before is that in real-life applications, many different log files are compiled (one for each container/Spark executor) and there is only one "master" log file which contains necessary information about scheduling and task boundaries which are needed to make concurrency profiles. The script uses an `AppParser` [class](https://github.com/g1thubhub/phil_stopwatch/blob/28c174f25a1cfab5003998656e4c0ae96fb95384/parsers.py#L687) that does this automatically by creating a list of `LogParser` objects (one for each container) and then parsing them to determine the master log.
147 | <br>
148 | We can attempt a back-of-the-envelope calculation to increase the efficiency of the application from just looking at this concurrency profile: In case around 80 physical CPU cores were used (given multiple peaks of ~80 active tasks), we can hypothesize that the application was "overallocated" by at least 20 CPU cores or 4 to 7 Spark executors or one to three nodes as Spark executors are often configured to use 3 to 5 physical CPU cores. Reducing the machines reserved for this application should not increase its execution time but it will give more resources to other users in a shared cluster setting or save some $$ in a cloud environment.
149 | 
150 | <br>
151 | ### A Fratso
152 | 
153 | What about the specs of the actual compute nodes used? The memory profile for a similar app created via [this](https://github.com/g1thubhub/phil_stopwatch/blob/28c174f25a1cfab5003998656e4c0ae96fb95384/analytics/plot_application.py#L27) code segment is chaotic yet illuminating since more than 50 Spark executors/containers were launched by application and each one left its mark in the graph in the form of a memory metric line (original located [here](http://127.0.0.1:4000/bigjob-memory.html))
154 | ![BigJob Mem](images/bigjob-memory.png)
155 | 
156 | The peak heap memory used is a little more than 10GB, one executor crosses this 10k line twice (top right) while most other executors use at most 8-9 GB or less. Removing the memory usage from the picture and displaying scheduling information like task boundaries instead results in the following graph
157 | 
158 | ![BigJob Tasks](images/bigjob-tasks.png) 
159 | 
160 | The application launches several small Spark jobs initially as indicated by the occurrence of multiple dashed lines near the left border. However, more than 90% of the total execution time is consumed by a single big job which has the ID _8_. A closer look at the blue dots on the x-axis that represent boundaries of Spark stages reveals that there are two longer stages within job _8_. During the first stage, there are four task waves without stragglers -- concurrent tasks that together look like solid blue rectangles when visualized this way. The second stage of job _8_ does have a straggler task as there is one horizontal blue task line that is much longer active than its "neighbour" tasks. Looking back at the memory graph of this application, it is likely that this straggler task is also responsible for the heap memory peak of >10GB that we discovered. We might have identified a "fratso" here (a straggling fatso) and this task/stage should definitely be analyzed in more detail when improving the associated application.
161 | <br>
162 | <br>
163 | The script that generated all three previous plots can be found [here](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_application.py).
164 | 
165 | <br>
166 | <br>
167 | ## The Heckler: CoreNLP & spaCy
168 | 
169 | Applying NLP or machine learning methods often involves the use of third party libraries which in turn create quite memory-intensive objects. There are several different ways of constructing such heavy classifiers in Spark so that each task can access them, the first version of the _Heckler_ code that is the topic of this section will do that in the worst possible way. I am not aware of a metric currently exposed by Spark that could directly show such inefficiencies, something similar to a measure of network transfer from master to executors would be required for one case below. The identification of this bottleneck must therefore happen indirectly by applying some more sophisticated string matching and collapsing logic to Spark's standard logs:
170 | 
171 | ```python
172 | log_file = './data/ProfileHeckler1/JobHeckler1.log.gz'
173 | log_parser = SparkLogParser(log_file)
174 | collapsed_ranked_log: List[Tuple[int, List[str]]] = log_parser.get_top_log_chunks()
175 | for line in collapsed_ranked_log[:5]:  # print 5 most frequently occurring log chunks
176 |     print(line)
177 | ```
178 | Executing the [script](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/extract_heckler.py) containing this code segment produces the following output:
179 | ```terminal
180 | Phils-MacBook-Pro:analytics a$ python3 extract_heckler.py
181 |  
182 | ^^ Identified time format for log file: %Y-%m-%d %H:%M:%S
183 |  
184 | (329, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
185 | 'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
186 | 'StanfordCoreNLP:88 - Adding annotator parse'])
187 | (257, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
188 | 'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
189 | 'StanfordCoreNLP:88 - Adding annotator parse'])
190 | (223, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
191 | 'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
192 | 'StanfordCoreNLP:88 - Adding annotator parse'])
193 | (221, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
194 | 'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
195 | 'StanfordCoreNLP:88 - Adding annotator parse'])
196 | (197, ['StanfordCoreNLP:88 - Adding annotator tokenize', 'StanfordCoreNLP:88 - Adding annotator ssplit', 
197 | 'StanfordCoreNLP:88 - Adding annotator pos', 'StanfordCoreNLP:88 - Adding annotator lemma', 
198 | 'StanfordCoreNLP:88 - Adding annotator parse'])
199 | ```
200 | These are the 5 most frequent log chunks found in the logging file, each one is a pair `[int, List[str]]`. The left integer signifies the total number of times the right list of log segments occurred in the file; each individual member in the list occurred in a separate log line in the file. Hence the return value of the method `get_top_log_chunks` that created the output above has the type annotation `List[Tuple[int, List[str]]]`, it extracts a ranked list of contiguous log segments. 
201 | 
202 | The top record can be interpreted the following way: The four strings
203 | ```terminal
204 | StanfordCoreNLP:88 - Adding annotator tokenize
205 | StanfordCoreNLP:88 - Adding annotator ssplit 
206 | StanfordCoreNLP:88 - Adding annotator pos
207 | StanfordCoreNLP:88 - Adding annotator lemma
208 | StanfordCoreNLP:88 - Adding annotator parse
209 | ```
210 | occurred as infixes in this order 329 times in total in the log file. They were likely part of longer log lines as normalization and collapsing logic was applied by the extraction algorithm, an example occurrence of the first part of the chunk (`StanfordCoreNLP:88 - Adding annotator tokenize`) would be 
211 | ```terminal
212 | 2019-02-16 08:44:30 INFO StanfordCoreNLP:88 - Adding annotator tokenize
213 | ```
214 | 
215 | What does this tell us? The associated Spark app seems to have performed some NLP tagging since log4j messages from the Stanford [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) project can be found as part of the Spark logs. Initializing a `StanfordCoreNLP` object ...
216 | ```scala
217 |   val props = new Properties()
218 |   props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")
219 | 
220 |   val pipeline = new StanfordCoreNLP(props)
221 |   val annotation = new Annotation("This is an example sentence")
222 | 
223 |   pipeline.annotate(annotation)
224 |   val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
225 |   println(parseTree.toString) // prints (ROOT (NP (NN Example) (NN sentence)))
226 | ```
227 | ... yields the following log4j output ...
228 | ```terminal
229 | 0 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator tokenize
230 | 9 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator ssplit
231 | 13 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator pos
232 | 847 [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger  - Loading POS tagger from [...] done [0.8 sec].
233 | 848 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator lemma
234 | 849 [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  - Adding annotator parse
235 | 1257 [main] INFO edu.stanford.nlp.parser.common.ParserGrammar  - Loading parser from serialized [...] ... done [0.4 sec].
236 | ``` 
237 | ... which tells us that five annotators (*tokenize, ssplit, pos, lemma, parse*) are created and wrapped inside a single `StanfordCoreNLP` object. Concerning the use of CoreNLP with Spark,
238 | the number of cores/tasks used in _Heckler_ is three (as it is in all other riddles) which means that we should find at most three occurrences of these annotator messages in the corresponding Spark log file. But we already saw more than 1000 occurrences when only the top 5 log chunks were investigated above. Having a closer look at the _Heckler_ source [code](https://github.com/g1thubhub/philstopwatch/blob/164e6ab0ac55ccab356c286ba3912c334bea7b27/src/main/scala/profile/sparkjobs/JobHeckler.scala#L47) resolves this contradiction, the implementation is bad since one classifier object is recreated for every input sentence that will be syntactially annotated -- there are 60000 input sentences in total so an `StanfordCoreNLP` object will be constructed a staggering 60000 times. Due to the distributed/concurrent nature of _Heckler_, we don't always see the annotator messages in the order `tokenize - ssplit - pos - lemma - parse` because log messages of task (1) might interweave with log messages of task (2) and (3) in the actual log file which is also the reason for the slightly reordered log chunks in the top 5 list.
239 | 
240 | <br>
241 | Improving this inefficient implementation is not too difficult: Creating the classifier inside a `mapPartitions` instead of a `map` function as done [here](https://github.com/g1thubhub/philstopwatch/blob/164e6ab0ac55ccab356c286ba3912c334bea7b27/src/main/scala/profile/sparkjobs/JobHeckler.scala#L59) will only create three *StanfordCoreNLP* objects overall. However, this is not the minimum, I will now set the record for creating the smallest number of tagger objects with the minimum amount of network transfer: Since `StanfordCoreNLP` is not serializable per se, it needs to be wrapped inside a class that is in order to prevent a _java.io.NotSerializableException_ when broadcasting it later:
242 | ```scala
243 | class DistribbutedStanfordCoreNLP extends Serializable {
244 |   val props = new Properties()
245 |   props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse")
246 |   lazy val pipeline = new StanfordCoreNLP(props)
247 | }
248 | [...]
249 | val pipelineWrapper = new DistribbutedStanfordCoreNLP()
250 | val pipelineBroadcast: Broadcast[DistribbutedStanfordCoreNLP] = session.sparkContext.broadcast(pipelineWrapper)
251 | [...]
252 | val parsedStrings3 = stringsDS.map(string => {
253 |    val annotation = new Annotation(string)
254 |    pipelineBroadcast.value.pipeline.annotate(annotation)
255 |    val parseTree = annotation.get(classOf[SentencesAnnotation]).get(0).get(classOf[TreeAnnotation])
256 |    parseTree.toString
257 | })
258 | ```
259 | The proof lies in the logs:
260 | ```terminal
261 | 19/02/23 18:48:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
262 | 19/02/23 18:48:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
263 | 19/02/23 18:48:45 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
264 | 19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator tokenize
265 | 19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator ssplit
266 | 19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator pos
267 | 19/02/23 18:48:46 INFO MaxentTagger: Loading POS tagger from [...] ... done [0.6 sec].
268 | 19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator lemma
269 | 19/02/23 18:48:46 INFO StanfordCoreNLP: Adding annotator parse
270 | 19/02/23 18:48:47 INFO ParserGrammar: Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
271 | 19/02/23 18:59:07 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1590 bytes result sent to driver
272 | 19/02/23 18:59:07 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1590 bytes result sent to driver
273 | 19/02/23 18:59:07 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1590 bytes result sent to driver
274 | ```
275 | I'm not sure about the multi-threading capabilities of `StanfordCoreNLP` so it might turn out that the second "per partition" solution is superior performance-wise to the [third](https://github.com/g1thubhub/philstopwatch/blob/164e6ab0ac55ccab356c286ba3912c334bea7b27/src/main/scala/profile/sparkjobs/JobHeckler.scala#L73). In any case, we reduced the number of tagging objects created from 60000 to three or one, not bad.
276 | 
277 | ### spaCy on PySpark
278 | 
279 | The PySpark version of _Heckler_ will use [spaCy](https://spacy.io/) (written in Cython/Python) as NLP library instead of *CoreNLP*. From the perspective of a JVM aficionado, packaging in Python itself is odd and spaCy doesn't seem to be very chatty. Therefore I created an initialization [function](https://github.com/g1thubhub/phil_stopwatch/blob/a1a088facf08eafae806e3958d26cf948d1538f1/spark_jobs/job_heckler.py#L15) that should print more log messages and address potential issues when running spaCy in a distributed enviroment as model files need to be present on every Spark executor.
280 | 
281 | As expected, the "bad" [implementation](https://github.com/g1thubhub/phil_stopwatch/blob/a1a088facf08eafae806e3958d26cf948d1538f1/spark_jobs/job_heckler.py#L48) of _Heckler_ recreates one spaCy NLP model per input sentence as proven by this logging excerpt:
282 | ```terminal
283 | [Stage 0:>                                                          3 / 3]
284 | ^^ Using spaCy 2.0.18
285 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
286 | ^^ Using spaCy 2.0.18
287 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
288 | ^^ Using spaCy 2.0.18
289 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
290 | ^^ Created model en_core_web_sm
291 | ^^ Created model en_core_web_sm
292 | ^^ Created model en_core_web_sm
293 | ^^ Using spaCy 2.0.18
294 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
295 | ^^ Using spaCy 2.0.18
296 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
297 | ^^ Using spaCy 2.0.18
298 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
299 | ^^ Created model en_core_web_sm
300 | ^^ Created model en_core_web_sm
301 | ^^ Created model en_core_web_sm
302 | [...]
303 | ```
304 | 
305 | Inspired by the Scala edition of _Heckler_, the "per partition" PySpark [solution](https://github.com/g1thubhub/phil_stopwatch/blob/a1a088facf08eafae806e3958d26cf948d1538f1/spark_jobs/job_heckler.py#L36) only initialize three spacy NLP objects during the application's lifetime, the complete log file of that run is short:
306 | ```terminal
307 | [Stage 0:>                                                          (0 + 3) / 3]
308 | ^^ Using spaCy 2.0.18
309 | ^^ Using spaCy 2.0.18
310 | ^^ Using spaCy 2.0.18
311 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
312 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
313 | ^^ Model found at /usr/local/lib/python3.6/site-packages/spacy/data/en_core_web_sm
314 | ^^ Created model en_core_web_sm
315 | ^^ Created model en_core_web_sm
316 | ^^ Created model en_core_web_sm
317 | 1500
318 | ```
319 | 
320 | <br>
321 | ### Finding failure messages
322 | The functionality introduced in the previous paragraphs can be modified a bit to make the investigation of failed applications much easier: The reason for a crash is often not immediately apparent and requires sifting through log files. Resource-intensive applications will create many log files (one per container/Spark executor) so search functionality along with deduplication and pattern matching logic should come in handy here: The function `extract_errors` from the [AppParser](https://github.com/g1thubhub/phil_stopwatch/blob/c55ec3e5e821eed3e47ec86a8b2ecf03c8090c59/parsers.py#L783) class tries to deduplicate potential exceptions and error messages and print them out in reverse chronological order. An exception or error message might occur several times during a run with slight variations (e.g., different timestamps or code line numbers) but the last occurrence is the most important one for debugging purposes since it might be the direct cause for the failure.
323 | 
324 | ```python
325 | app_path = './data/application_1549675138635_0005'
326 | app_parser = AppParser(app_path)
327 | app_errors: Deque[Tuple[str, List[str]]] = app_parser.extract_errors()
328 | 
329 | for error in app_errors:
330 |     print(error)
331 | ```
332 | 
333 | 
334 | ```terminal
335 | ^^ Identified app path with log files
336 | ^^ Identified time format for log file: %y/%m/%d %H:%M:%S
337 | ^^ Warning: Not all tasks completed successfully: {(16.0, 9.0, 8), (16.1, 9.0, 8), (164.0, 9.0, 8), ...}
338 | ^^ Extracting task intervals
339 | ^^ Extracting stage intervals
340 | ^^ Extracting job intervals
341 | 
342 | Error messages found, most recent ones first:
343 | ```
344 | 
345 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:35 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted.', 'org.apache.spark.SparkException: Job aborted.', 'at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)', 'at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)', 'at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)', 'at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)', 'at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)', 'at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)', 'at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)', 'at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)', 'at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)', 'at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)', 'at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)', 'at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)', 'at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)', 'at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)', 'at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)', 'at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)', 'at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)', 'at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)', 'at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)', 'at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)', 'at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)', 'at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)', 'at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)', 'at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)', 'at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)', 'at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)', 'at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)', 
346 | [...]
347 | '... 48 more'])
348 | 
349 | <br>
350 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000124/stderr.gz', ['18/02/01 21:49:34 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM'])
351 | 
352 | <br>
353 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 WARN YarnAllocator: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node'])
354 | 
355 | <br>
356 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node'])
357 | 
358 | <br>
359 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 ERROR TaskSetManager: Task 30 in stage 9.0 failed 4 times; aborting job'])
360 | 
361 | <br>
362 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 ERROR YarnClusterScheduler: Lost executor 62 on ip-172-18-39-28.ec2.internal: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node'])
363 | 
364 | <br>
365 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 WARN TaskSetManager: Lost task 4.3 in stage 9.0 (TID 610, ip-172-18-39-28.ec2.internal, executor 62): ExecutorLostFailure (executor 62 exited caused by one of the running tasks) Reason: Container marked as failed: container_1549731000_0001_02_000124 on host: ip-172-18-39-28.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node'])
366 | 
367 | <br>
368 | ('/Users/phil/data/application_1549_0005/container_1549_0001_02_000002/stderr.gz', ['18/02/01 21:49:34 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0'])
369 | 
370 | [...]
371 | <br>
372 | <br>
373 | 
374 | Each error record printed out in this fashion consists of two elements: The first one is a path to the source log to the source log in which the second element, the actual error chunk, was found.
375 | 
376 | 
377 | 
378 | 
379 | stack trance -- a multiline error message that is normalized and copied into a single listhere
380 | 
381 | reverse chronological order, pattern matching
382 | 
383 | 
384 | Some are benign
385 | 
386 | 
387 | concurrency, triple
388 | concurrent nature, they might interweave 
389 | 
390 | pattern matching, contiguous subseqences
391 | handy
392 | 
393 | Flooding, down from 316 error lines down to just 16, sort from 
394 | internally collapse (repeated segment) 
395 | 
396 | possibility of finding errors
397 | reverse chronologcial order
398 | 
399 | <br>
400 | <br>
401 | ## The Slacker
402 | 
403 | 
404 | ```python
405 | combined_file = './data/profile_slacker/CombinedCpuAndMemory.json.gz'  # Output from JVM & PySpark profilers
406 | 
407 | jvm_parser = ProfileParser(combined_file)
408 | jvm_parser.manually_set_profiler('JVMProfiler')
409 | 
410 | pyspark_parser = ProfileParser(combined_file)
411 | pyspark_parser.manually_set_profiler('pyspark')
412 | 
413 | jvm_maxima: Dict[str, float] = jvm_parser.get_maxima()
414 | pyspark_maxima: Dict[str, float] = pyspark_parser.get_maxima()
415 | 
416 | print('JVM max values:')
417 | print(jvm_maxima)
418 | print('\nPySpark max values:')
419 | print(pyspark_maxima)
420 | ```
421 | 
422 | The output is ...
423 | ```terminal
424 | JVM max values:
425 | {'ScavengeCollTime': 0.0013, 'MarkSweepCollTime': 0.00255, 'MarkSweepCollCount': 3.0, 'ScavengeCollCount': 10.0,
426 | 'systemCpuLoad': 0.6419753086419753, 'processCpuLoad': 0.6189945167759597, 'nonHeapMemoryTotalUsed': 89.07969665527344,
427 | 'nonHeapMemoryCommitted': 90.3125, 'heapMemoryTotalUsed': 336.95963287353516, 'heapMemoryCommitted': 452.0}
428 | 
429 | PySpark max values:
430 | {'pmem_rss': 78.50390625, 'pmem_vms': 4448.35546875, 'cpu_percent': 0.4}
431 | ```
432 | These are low values given a baseline overhead of running Spark and comparing the profiles for _Fatso_ above -- for example, only 13 GC events happened and the peak CPU load for the entire run was less than 65%. Visualizing all CPU data points shows that these maxima occured at the beginning / end of the application when there is always lots of initialization and cleanup work regardless of the logic being executed (bigger version [hereXXX](http://127.0.0.1:4000/slacker-cpu.html)): 
433 | ![Slacker PySpark profile](images/cpu-usage.png)
434 | <br>
435 | So the system is almost idle for the majority of the time. The slacker in this pure form is a rare sight; when processing real-life workloads, slacking most likely occurs in certain stages that interact with an external system like querying a database for records that should be joined with Datasets/RDDs later on or that materialize output records to a storage layer like HDFS and use not enough write partitions. A combined flame graph of JVM and Python stack traces will reveal the slacking part, original [hereXXX](http://127.0.0.1:4000/CombinedStack.svg)
436 | {% include CombinedStack.svg %}
437 | 
438 | In the first plateau which is also the longest, two custom Python functions sit at the top. After inspecting their implementation [here](https://github.com/g1thubhub/phil_stopwatch/blob/edef3b88425717ede93c33683bd0c59f85ba40c6/spark_jobs/job_slacker.py#L15) and [there](https://github.com/g1thubhub/phil_stopwatch/blob/edef3b88425717ede93c33683bd0c59f85ba40c6/helper.py#L211), the low system utilization should not be surprising anymore: The second function from the top,`job_slacker.py:slacking`, is basically a simple loop that calls a function `helper.py:secondsSleep` from an external _helper_ package many times. This function has a sample presence of almost 20% (seen in the [originalXXX](http://127.0.0.1:4000/CombinedStack.svg)) and, since it sits atop the plateau, is executed by the CPU most of the time. As its function name suggests, it causes the program to sleep for one second. So _Slacker_ is esentially a 10 minute long system sleep.
439 | In real-world BigData application that have slacking phases, we can expect the top of some plateaus to be occupied by "write" functions like `FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala` or by functions related to DB queries.
440 | <br>
441 | <br>
442 | 
443 | ## Code & Goodies
444 | 
445 | ### Riddle source code
446 | The 
447 | 
448 | tried to make things fairly compositional / composable -- for example, the output of a profiler can occur in one or across several files, either alone or in combination with log messages (of different formats!) s or the output of another profiler, the code will take care of that
449 | Unit conversion in order to display various metrics conveniently in the same graph. You can check the map XXX and also add your own metrics.
450 | 
451 | 
452 | ### Riddle source code
453 | 
454 | The Scala source code for the local riddles are here
455 | 
456 | The Python versions are here
457 | 
458 | The Python / PysSpark editions are here
459 | 
460 | ### Uber's JVM profiler
461 | 
462 | Existing JVM applications can be profiled without modifying any source code. I built it in the following way:
463 | 
464 | 
465 | ```terminal
466 | git clone https://github.com/uber-common/jvm-profiler.git
467 | cd jvm-profiler/
468 | mvn clean package
469 | [...]
470 | Replacing /Users/a/jvm-profiler/target/jvm-profiler-1.0.0.jar with /Users/a/jvm-profiler/target/jvm-profiler-1.0.0-shaded.jar
471 | ```
472 | The jvm-profiler-1.0.0.jar is a self-contained "fat jar" and can be used for profiling.
473 | 
474 | S3.
475 | 
476 | In a cloud environment it is likely to not use the FileOutputReporter since there is no code that can interact with for HDFS or S3. Printing to standard out, the parser will take care of this as shown above. 
477 | 
478 | 
479 | ### PySpark
480 | ```shell
481 | export PYSPARK_PYTHON=python3
482 | pip3 install -e .
483 | 
484 | ```
485 | ### JVM & PySpark Profilers
486 | 
487 | 
488 | java.lang.management package  (Java 1.5)
489 | 
490 | 
491 | javax.management.MX
492 | 
493 | 
494 | Three different PySpark profilers in this [file](https://github.com/g1thubhub/phil_stopwatch/blob/master/pyspark_profilers.py): A CPU/Memory profiler, a Stacktrace profiler and a combination of these two.
495 | 
496 | If the JVM profiler is used with the options shown above, three different kinds of records are generated which, in case the _FileOutputReporter_ flag is used, are written to three JSON files, _ProcessInfo.json_, _CpuAndMemory.json_ and _Stacktrace.json_. Similarly, my PySpark profiler will create two different output records that, with the appropriate flag set, are written to at least two JSON files with the pattern _s_*_stack.json_ or  _s_*_cpumem.json_.
497 | In case of Java/Scala, three In case 
498 | 
499 | 
500 | 
501 | 
502 | 
503 | At least two additional JSON files created by my PySpark profilers, 
504 | At least five JSON files, 3 are created by the JVM profiler and the rest are created by 
505 | CpuAndMemory.json
506 | 
507 | 
508 | cloud/hadoop environment part of "standard out" since XXX does not implement an HDFS FileWriter, I might start working on tjat soon
509 | 
510 | 
511 | 
512 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 |                             Preamble
  9 | 
 10 |   The GNU General Public License is a free, copyleft license for
 11 | software and other kinds of works.
 12 | 
 13 |   The licenses for most software and other practical works are designed
 14 | to take away your freedom to share and change the works.  By contrast,
 15 | the GNU General Public License is intended to guarantee your freedom to
 16 | share and change all versions of a program--to make sure it remains free
 17 | software for all its users.  We, the Free Software Foundation, use the
 18 | GNU General Public License for most of our software; it applies also to
 19 | any other work released this way by its authors.  You can apply it to
 20 | your programs, too.
 21 | 
 22 |   When we speak of free software, we are referring to freedom, not
 23 | price.  Our General Public Licenses are designed to make sure that you
 24 | have the freedom to distribute copies of free software (and charge for
 25 | them if you wish), that you receive source code or can get it if you
 26 | want it, that you can change the software or use pieces of it in new
 27 | free programs, and that you know you can do these things.
 28 | 
 29 |   To protect your rights, we need to prevent others from denying you
 30 | these rights or asking you to surrender the rights.  Therefore, you have
 31 | certain responsibilities if you distribute copies of the software, or if
 32 | you modify it: responsibilities to respect the freedom of others.
 33 | 
 34 |   For example, if you distribute copies of such a program, whether
 35 | gratis or for a fee, you must pass on to the recipients the same
 36 | freedoms that you received.  You must make sure that they, too, receive
 37 | or can get the source code.  And you must show them these terms so they
 38 | know their rights.
 39 | 
 40 |   Developers that use the GNU GPL protect your rights with two steps:
 41 | (1) assert copyright on the software, and (2) offer you this License
 42 | giving you legal permission to copy, distribute and/or modify it.
 43 | 
 44 |   For the developers' and authors' protection, the GPL clearly explains
 45 | that there is no warranty for this free software.  For both users' and
 46 | authors' sake, the GPL requires that modified versions be marked as
 47 | changed, so that their problems will not be attributed erroneously to
 48 | authors of previous versions.
 49 | 
 50 |   Some devices are designed to deny users access to install or run
 51 | modified versions of the software inside them, although the manufacturer
 52 | can do so.  This is fundamentally incompatible with the aim of
 53 | protecting users' freedom to change the software.  The systematic
 54 | pattern of such abuse occurs in the area of products for individuals to
 55 | use, which is precisely where it is most unacceptable.  Therefore, we
 56 | have designed this version of the GPL to prohibit the practice for those
 57 | products.  If such problems arise substantially in other domains, we
 58 | stand ready to extend this provision to those domains in future versions
 59 | of the GPL, as needed to protect the freedom of users.
 60 | 
 61 |   Finally, every program is threatened constantly by software patents.
 62 | States should not allow patents to restrict development and use of
 63 | software on general-purpose computers, but in those that do, we wish to
 64 | avoid the special danger that patents applied to a free program could
 65 | make it effectively proprietary.  To prevent this, the GPL assures that
 66 | patents cannot be used to render the program non-free.
 67 | 
 68 |   The precise terms and conditions for copying, distribution and
 69 | modification follow.
 70 | 
 71 |                        TERMS AND CONDITIONS
 72 | 
 73 |   0. Definitions.
 74 | 
 75 |   "This License" refers to version 3 of the GNU General Public License.
 76 | 
 77 |   "Copyright" also means copyright-like laws that apply to other kinds of
 78 | works, such as semiconductor masks.
 79 | 
 80 |   "The Program" refers to any copyrightable work licensed under this
 81 | License.  Each licensee is addressed as "you".  "Licensees" and
 82 | "recipients" may be individuals or organizations.
 83 | 
 84 |   To "modify" a work means to copy from or adapt all or part of the work
 85 | in a fashion requiring copyright permission, other than the making of an
 86 | exact copy.  The resulting work is called a "modified version" of the
 87 | earlier work or a work "based on" the earlier work.
 88 | 
 89 |   A "covered work" means either the unmodified Program or a work based
 90 | on the Program.
 91 | 
 92 |   To "propagate" a work means to do anything with it that, without
 93 | permission, would make you directly or secondarily liable for
 94 | infringement under applicable copyright law, except executing it on a
 95 | computer or modifying a private copy.  Propagation includes copying,
 96 | distribution (with or without modification), making available to the
 97 | public, and in some countries other activities as well.
 98 | 
 99 |   To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies.  Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 | 
103 |   An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License.  If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 | 
112 |   1. Source Code.
113 | 
114 |   The "source code" for a work means the preferred form of the work
115 | for making modifications to it.  "Object code" means any non-source
116 | form of a work.
117 | 
118 |   A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 | 
123 |   The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form.  A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 | 
134 |   The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities.  However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work.  For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 | 
147 |   The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 | 
151 |   The Corresponding Source for a work in source code form is that
152 | same work.
153 | 
154 |   2. Basic Permissions.
155 | 
156 |   All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met.  This License explicitly affirms your unlimited
159 | permission to run the unmodified Program.  The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work.  This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 | 
164 |   You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force.  You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright.  Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 | 
175 |   Conveying under any other circumstances is permitted solely under
176 | the conditions stated below.  Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 | 
179 |   3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 | 
181 |   No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 | 
187 |   When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 | 
195 |   4. Conveying Verbatim Copies.
196 | 
197 |   You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 | 
205 |   You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 | 
208 |   5. Conveying Modified Source Versions.
209 | 
210 |   You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 | 
214 |     a) The work must carry prominent notices stating that you modified
215 |     it, and giving a relevant date.
216 | 
217 |     b) The work must carry prominent notices stating that it is
218 |     released under this License and any conditions added under section
219 |     7.  This requirement modifies the requirement in section 4 to
220 |     "keep intact all notices".
221 | 
222 |     c) You must license the entire work, as a whole, under this
223 |     License to anyone who comes into possession of a copy.  This
224 |     License will therefore apply, along with any applicable section 7
225 |     additional terms, to the whole of the work, and all its parts,
226 |     regardless of how they are packaged.  This License gives no
227 |     permission to license the work in any other way, but it does not
228 |     invalidate such permission if you have separately received it.
229 | 
230 |     d) If the work has interactive user interfaces, each must display
231 |     Appropriate Legal Notices; however, if the Program has interactive
232 |     interfaces that do not display Appropriate Legal Notices, your
233 |     work need not make them do so.
234 | 
235 |   A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit.  Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 | 
245 |   6. Conveying Non-Source Forms.
246 | 
247 |   You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 | 
252 |     a) Convey the object code in, or embodied in, a physical product
253 |     (including a physical distribution medium), accompanied by the
254 |     Corresponding Source fixed on a durable physical medium
255 |     customarily used for software interchange.
256 | 
257 |     b) Convey the object code in, or embodied in, a physical product
258 |     (including a physical distribution medium), accompanied by a
259 |     written offer, valid for at least three years and valid for as
260 |     long as you offer spare parts or customer support for that product
261 |     model, to give anyone who possesses the object code either (1) a
262 |     copy of the Corresponding Source for all the software in the
263 |     product that is covered by this License, on a durable physical
264 |     medium customarily used for software interchange, for a price no
265 |     more than your reasonable cost of physically performing this
266 |     conveying of source, or (2) access to copy the
267 |     Corresponding Source from a network server at no charge.
268 | 
269 |     c) Convey individual copies of the object code with a copy of the
270 |     written offer to provide the Corresponding Source.  This
271 |     alternative is allowed only occasionally and noncommercially, and
272 |     only if you received the object code with such an offer, in accord
273 |     with subsection 6b.
274 | 
275 |     d) Convey the object code by offering access from a designated
276 |     place (gratis or for a charge), and offer equivalent access to the
277 |     Corresponding Source in the same way through the same place at no
278 |     further charge.  You need not require recipients to copy the
279 |     Corresponding Source along with the object code.  If the place to
280 |     copy the object code is a network server, the Corresponding Source
281 |     may be on a different server (operated by you or a third party)
282 |     that supports equivalent copying facilities, provided you maintain
283 |     clear directions next to the object code saying where to find the
284 |     Corresponding Source.  Regardless of what server hosts the
285 |     Corresponding Source, you remain obligated to ensure that it is
286 |     available for as long as needed to satisfy these requirements.
287 | 
288 |     e) Convey the object code using peer-to-peer transmission, provided
289 |     you inform other peers where the object code and Corresponding
290 |     Source of the work are being offered to the general public at no
291 |     charge under subsection 6d.
292 | 
293 |   A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 | 
297 |   A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling.  In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage.  For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product.  A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 | 
310 |   "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source.  The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 | 
318 |   If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information.  But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 | 
329 |   The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed.  Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 | 
337 |   Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 | 
343 |   7. Additional Terms.
344 | 
345 |   "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law.  If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 | 
354 |   When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it.  (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.)  You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 | 
361 |   Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 | 
365 |     a) Disclaiming warranty or limiting liability differently from the
366 |     terms of sections 15 and 16 of this License; or
367 | 
368 |     b) Requiring preservation of specified reasonable legal notices or
369 |     author attributions in that material or in the Appropriate Legal
370 |     Notices displayed by works containing it; or
371 | 
372 |     c) Prohibiting misrepresentation of the origin of that material, or
373 |     requiring that modified versions of such material be marked in
374 |     reasonable ways as different from the original version; or
375 | 
376 |     d) Limiting the use for publicity purposes of names of licensors or
377 |     authors of the material; or
378 | 
379 |     e) Declining to grant rights under trademark law for use of some
380 |     trade names, trademarks, or service marks; or
381 | 
382 |     f) Requiring indemnification of licensors and authors of that
383 |     material by anyone who conveys the material (or modified versions of
384 |     it) with contractual assumptions of liability to the recipient, for
385 |     any liability that these contractual assumptions directly impose on
386 |     those licensors and authors.
387 | 
388 |   All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10.  If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term.  If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 | 
398 |   If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 | 
403 |   Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 | 
407 |   8. Termination.
408 | 
409 |   You may not propagate or modify a covered work except as expressly
410 | provided under this License.  Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 | 
415 |   However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 | 
422 |   Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 | 
429 |   Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License.  If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 | 
435 |   9. Acceptance Not Required for Having Copies.
436 | 
437 |   You are not required to accept this License in order to receive or
438 | run a copy of the Program.  Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance.  However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work.  These actions infringe copyright if you do
443 | not accept this License.  Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 | 
446 |   10. Automatic Licensing of Downstream Recipients.
447 | 
448 |   Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License.  You are not responsible
451 | for enforcing compliance by third parties with this License.
452 | 
453 |   An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations.  If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 | 
463 |   You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License.  For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 | 
471 |   11. Patents.
472 | 
473 |   A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based.  The
475 | work thus licensed is called the contributor's "contributor version".
476 | 
477 |   A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version.  For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 | 
487 |   Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 | 
492 |   In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement).  To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 | 
499 |   If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients.  "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 | 
513 |   If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 | 
521 |   A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License.  You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 | 
536 |   Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 | 
540 |   12. No Surrender of Others' Freedom.
541 | 
542 |   If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License.  If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all.  For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 | 
552 |   13. Use with the GNU Affero General Public License.
553 | 
554 |   Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work.  The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 | 
563 |   14. Revised Versions of this License.
564 | 
565 |   The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time.  Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 | 
570 |   Each version is given a distinguishing version number.  If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation.  If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 | 
579 |   If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 | 
584 |   Later license versions may give you additional or different
585 | permissions.  However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 | 
589 |   15. Disclaimer of Warranty.
590 | 
591 |   THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 | 
600 |   16. Limitation of Liability.
601 | 
602 |   IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 | 
612 |   17. Interpretation of Sections 15 and 16.
613 | 
614 |   If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 | 
621 |                      END OF TERMS AND CONDITIONS
622 | 
623 |             How to Apply These Terms to Your New Programs
624 | 
625 |   If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 | 
629 |   To do so, attach the following notices to the program.  It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 | 
634 |     <one line to give the program's name and a brief idea of what it does.>
635 |     Copyright (C) <year>  <name of author>
636 | 
637 |     This program is free software: you can redistribute it and/or modify
638 |     it under the terms of the GNU General Public License as published by
639 |     the Free Software Foundation, either version 3 of the License, or
640 |     (at your option) any later version.
641 | 
642 |     This program is distributed in the hope that it will be useful,
643 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
644 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
645 |     GNU General Public License for more details.
646 | 
647 |     You should have received a copy of the GNU General Public License
648 |     along with this program.  If not, see <https://www.gnu.org/licenses/>.
649 | 
650 | Also add information on how to contact you by electronic and paper mail.
651 | 
652 |   If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 | 
655 |     <program>  Copyright (C) <year>  <name of author>
656 |     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 |     This is free software, and you are welcome to redistribute it
658 |     under certain conditions; type `show c' for details.
659 | 
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License.  Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 | 
664 |   You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | <https://www.gnu.org/licenses/>.
668 | 
669 |   The GNU General Public License does not permit incorporating your program
670 | into proprietary programs.  If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library.  If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License.  But first, please read
674 | <https://www.gnu.org/licenses/why-not-lgpl.html>.
675 | 


--------------------------------------------------------------------------------
/ReadMe.md:
--------------------------------------------------------------------------------
  1 | # Phil's Stopwatch for profiling Spark
  2 | 
  3 | A tech blog showing some of the features of this library can be found [here](https://g1thubhub.github.io/4-bigdata-riddles)
  4 | 
  5 | The Scala sources for the riddles in can be found in this [companion project](https://github.com/g1thubhub/philstopwatch)
  6 | 
  7 | The project can be locally installed as a module with the following command executed in its base directory:
  8 | ```shell
  9 | export PYSPARK_PYTHON=python3
 10 | pip3 install -e .
 11 | ```
 12 | The input to all scripts in the [analytics](https://github.com/g1thubhub/phil_stopwatch/tree/master/analytics) folder are output records from a profiler or Spark logs. Two parser classes are defined in [parsers.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/parsers.py) that get initialized with the path to an individual profile file ([ProfileParser](https://github.com/g1thubhub/phil_stopwatch/blob/7dc3431572874d99d18451ec7f93e16ad15ebd23/parsers.py#L59)) or to an individual Spark log file ([SparkLogParser](https://github.com/g1thubhub/phil_stopwatch/blob/7dc3431572874d99d18451ec7f93e16ad15ebd23/parsers.py#L239)). When creating an object of the [AppParser](https://github.com/g1thubhub/phil_stopwatch/blob/7dc3431572874d99d18451ec7f93e16ad15ebd23/parsers.py#L729) class, the constructor expects a path to an application directory under which many log files are located. Concrete examples are shown below. 
 13 | 
 14 | The design is compositional: Since records from different profilers could have been written to a single file along with logging messages, a _SparkLogParser_ object contains a _ProfileParser_ object in a member field that might never be needed or get initialized. 
 15 | 
 16 | Since an application folder typically contains many log files and only one of those, the master log file created by the driver (see below), contains all information about task/stage/job boundaries, an _AppParser_ object contains a list of _ProfileParser_ objects in a member field and most of its methods delegate to them. 
 17 | 
 18 | The script [plot_application](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_application.py) gives a good overview, it extracts and plots metrics as well as scheduling info from all logs belonging to a Spark application. Further explanations and examples are included below.
 19 | 
 20 | When visualizing metrics, normalization logic is applied by default so different metric types can be conveniently displayed in the same plot. This can be prevented by using a `normalize=False` parameter when constructing a _SparkLogParser_ or _ProfileParser_ object like so: 
 21 | ```python
 22 | log_parser = SparkLogParser('./data/ProfileFatso/JobFatso.log.gz', normalize=False)
 23 | ```
 24 | The normalization logic is configured in [this dictionary](https://github.com/g1thubhub/phil_stopwatch/blob/7dc3431572874d99d18451ec7f93e16ad15ebd23/helper.py#L5) (second element in the lists) which also defines all known metrics and can be extended for new profilers.
 25 | 
 26 | ## Uber's JVM profiler
 27 | Uber's [JVM profiler](https://github.com/uber-common/jvm-profiler) has several advantages so this project assumes that it will be used as the JVM profiler (the [ProfileParser](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/parsers.py#L59) code could be easily modified to use outputs from other profilers though). The profiler JAR can be built with the following commands:
 28 | ```terminal
 29 | $ git clone https://github.com/uber-common/jvm-profiler.git
 30 | $ cd jvm-profiler/
 31 | $ mvn clean package
 32 | [...]
 33 | Replacing /users/phil/jvm-profiler/target/jvm-profiler-1.0.0.jar with /users/phil/jvm-profiler/target/jvm-profiler-1.0.0-shaded.jar
 34 | 
 35 | $ ls
 36 | -rw-r--r--   1 a  staff  7097056  9 Feb 10:07 jvm-profiler-1.0.0.jar
 37 | drwxr-xr-x   3 a  staff       96  9 Feb 10:07 maven-archiver
 38 | drwxr-xr-x   3 a  staff       96  9 Feb 10:07 maven-status
 39 | -rw-r--r--   1 a  staff    92420  9 Feb 10:07 original-jvm-profiler-1.0.0.jar
 40 | jvm-profiler-1.0.0.jar
 41 | 
 42 | ```
 43 | ... or you can use the JAR that I built from [here](https://github.com/g1thubhub/philstopwatch/blob/master/src/main/resources/jvm-profiler-1.0.0.jar)
 44 | 
 45 | 
 46 | ### Profiling a single JVM / Spark in local mode
 47 | The following command was used to generate the output uploaded [here](https://github.com/g1thubhub/phil_stopwatch/tree/master/analytics/data/ProfileStraggler) from the [JobStraggler](https://github.com/g1thubhub/philstopwatch/blob/master/src/main/scala/profile/sparkjobs/JobStraggler.scala) class:
 48 | ```terminal
 49 | ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
 50 | --conf spark.driver.extraJavaOptions=-javaagent:/users/phil/jvm-profiler/target/jvm-profiler-1.0.0.jar=sampleInterval=1000,metricInterval=100,reporter=com.uber.profiling.reporters.FileOutputReporter,outputDir=./ProfileStraggler \
 51 | --class profile.sparkjobs.JobStraggler \
 52 | target/scala-2.11/philstopwatch-assembly-0.1.jar > JobStraggler.log 
 53 | ```
 54 | If the JVM profiler is used in this fashion, three different kinds of records are generated by its _FileOutputReporter_ which are written to three separate JSON files: _ProcessInfo.json_, _CpuAndMemory.json_, and _Stacktrace.json_. 
 55 | 
 56 | ### Profiling executor JVMs
 57 | When Spark is launched in distributed mode, most of the actual work is done by Spark executors that run on remote cluster nodes. In order to profile their VMs, the following `conf` setting would need to be added to the launch command above:
 58 | ```terminal
 59 | --conf spark.executor.extraJavaOptions=[...]
 60 | ```
 61 | This is redundant in "local" mode though since the driver and executor run in the same JVM -- the profile records are already created by including *--conf spark.driver.extraJavaOptions*. An actual example of the command used for profiling a legit distributed Spark job is included further below.
 62 | 
 63 | ## Profiling PySpark
 64 | The output of executing the PySpark edition of [Straggler](https://github.com/g1thubhub/phil_stopwatch/blob/master/spark_jobs/job_straggler.py) is included in [this](https://github.com/g1thubhub/phil_stopwatch/tree/master/analytics/data/profile_straggler) repo folder, here is the command that was used:
 65 | ```terminal
 66 | ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
 67 | --conf spark.driver.extraJavaOptions=-javaagent:/users/phil/jvm-profiler/target/jvm-profiler-1.0.0.jar=sampleInterval=1000,metricInterval=100,reporter=com.uber.profiling.reporters.FileOutputReporter,outputDir=./profile_straggler \
 68 | --conf spark.python.profile=true \
 69 | ./spark_jobs/job_straggler.py  cpumemstack /users/phil/phil_stopwatch/analytics/data/profile_straggler > Straggler_PySpark.log
 70 | ```
 71 | As in the run of the "Scala" edition above, the second line activates the JVM profiling, only the output directory name has changed (*profile_straggler* instead of *ProfileStraggler*). This line is optional here since a PySpark app is launched but it might make sense to include this JVM profiling it as a majority of the work is performed outside of Python. The third line and the two input arguments to the script in the last line (`cpumemstack` and `/users/phil/phil_stopwatch/analytics/data/profile_straggler`) are required for the actual PySpark profiler: The config parameter (*--conf spark.python.profile=true*) tells Spark that a custom profiler will be used, the first script argument *cpumemstack* specifies the profiler class (a profiler that tracks CPU, memory and the stack) and the second argument specifies the directory to where the profiler records will be saved. In that case, my PySpark profilers will create one or two different types of output records that are stored in at least two JSON files with the pattern _s_*_stack.json_ or  _s_*_cpumem.json_.
 72 | 
 73 | Three PySpark profiler classes are included in [pyspark_profilers.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/pyspark_profilers.py): [StackProfiler](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/pyspark_profilers.py#L119) can be used to catch stack traces in order to create flame graphs, [CpuMemProfiler](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/pyspark_profilers.py#L27) captures CPU and memory usage, and [CpuMemStackProfiler](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/pyspark_profilers.py#L206) is a combination of these two. In order to use them, the `profiler_cls` field needs to be set with the name of the profiler as in this [example](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/spark_jobs/job_straggler.py#L29) when constructing a *SparkContext* so there are three possible settings:
 74 | * profiler_cls=StackProfiler
 75 | * profiler_cls=CpuMemProfiler
 76 | * profiler_cls=CpuMemStackProfiler
 77 | 
 78 | I added a dictionary `profiler_map` in [helper.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/helper.py) that links these class names with abbreviations that can be used as the actual script arguments when launching a PySpark app:
 79 | * `spark-submit [...] your_script.py stack ` sets `profiler_cls=StackProfiler`
 80 | * `spark-submit [...] your_script.py cpumem ` sets `profiler_cls=CpuMemProfiler`
 81 | * `spark-submit [...] your_script.py cpumemstack ` sets `profiler_cls=CpuMemStackProfiler`
 82 | 
 83 | The PySpark profiler code in [pyspark_profilers.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/pyspark_profilers.py) needs a few auxiliary methods defined in helper.py. In case of a distributed application, Spark executors running on cluster nodes 
 84 | also need to access these two files, the easiest (but probably not most elegant) way of doing this is via SparkContext's `adFile` method, this solution is used [here](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/spark_jobs/job_straggler.py#L31).
 85 | 
 86 | If such a distributed application is launched, it might not be possible to create output files in this fashion on storage systems like HDFS or S3 so profiler records might need to be written to the standard output instead. This can easily be accomplished by using the appropriate function in the appliction source code: The line ... 
 87 | ```python
 88 | session.sparkContext.dump_profiles(dump_path)
 89 | ``` 
 90 | ... would change into ...
 91 | ```python
 92 | session.sparkContext.show_profiles()
 93 | ``` 
 94 | An example occurrence is [here](https://github.com/g1thubhub/phil_stopwatch/blob/e83645f44e7fecf43331e0b1dcf9920a6deb027c/spark_jobs/job_straggler.py#L46) 
 95 | 
 96 | The script [plot_slacker.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_slacker.py) demonstrates how to create a combined JVM/PySpark metrics plot and flame graph.
 97 | 
 98 | 
 99 | ## Distributed Profiling
100 | The last sentences already described some of the changes required for distributed PySpark profiling. For distributed JVM profiling, the worker nodes need to access the `jvm-profiler-1.0.0.jar` file so this JAR should be uploaded to the storage layer, in the case of S3 the copy command would be
101 | ```terminal
102 | aws s3 cp ./jvm-profiler-1.0.0.jar s3://your/bucket/jvm-profiler-1.0.0.jar
103 | ```
104 | 
105 | In a cloud environment, it is likely that the `FileOutputReporter` used above (set via *reporter=com.uber.profiling.reporters.FileOutputReporter*) will not work since its source code does not seem to include functionality for interacting with storage layers like HDFS or S3. In these cases, the profiler output records can be written to standard out along with other messages. This happens by default when no explicit `reporter=` flag is set as in the following command:
106 | ```terminal
107 | spark-submit --deploy-mode cluster \ 
108 | --class your.Class \
109 | --conf spark.jars=s3://your/bucket/jvm-profiler-1.0.0.jar \
110 | --conf spark.driver.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar=sampleInterval=2000,metricInterval=1000 \
111 | --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar=sampleInterval=2000,metricInterval=1000 \
112 | s3://path/to/your/project.jar 
113 | ```
114 | The actual source code of the Spark application is located in the class `Class` inside package `your` and all source code is packaged inside the JAR file `project.jar`. The third line specifies the location of the profiler JAR that all Spark executors and the driver need to be able to access in case they are profiled. They are indeed, the fourth and fifth line activate *CpuAndMemory* profiling and *Stacktrace* sampling.
115 | 
116 | 
117 | ## Analyzing and Visualizing
118 | 
119 | As already mentioned, the code in this repo operates on two types of input, on the output records of a profiler or on Spark log files. Since the design is compositional, the records can be mixed in one or split across several files.
120 | 
121 | To see which metrics were extracted, the method `.get_available_metrics()` returns a list of metric strings and is available on a _ProfileParser_ or _SparkLogParser_ object; the _SparkLogParser_ object would simply call the `get_available_metrics()` function of its enclosed _ProfileParser_. The same logic applies to `make_graph()` which constructs a graph from all metric values and to the more selective `get_metrics(['metric_name'])` which builds a graph only for the metrics included in its the list argument.
122 | 
123 | To handle mixed JVM and PySpark profiles in the same file or to selectively build graphs for records from a specific profile subset, check [this script](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_fatso.py).
124 | 
125 | Calling `get_executor_logparsers` on an *AppParser* object returns a list of all encapsulated *ProfileParser*s objects. Most class methods of [AppParser](https://github.com/g1thubhub/phil_stopwatch/blob/7dc3431572874d99d18451ec7f93e16ad15ebd23/parsers.py#L729) delegate to or accumulate values of its *ProfileParser*s objects, a good demonstration is [plot_application.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_application.py) which contains many examples.
126 | 
127 | ### Analyzing and visualizing a local job / 4 riddles
128 | The source code of the four riddles inside the [spark_jobs](https://github.com/g1thubhub/phil_stopwatch/tree/master/spark_jobs) folder was executed in "local mode". Several scripts that produce visualizations and reports of the output of these riddles are included in the [analytics](https://github.com/g1thubhub/phil_stopwatch/tree/master/analytics) folder. 
129 | 
130 | An example of a script that extracts "everything" -- metric graphs, Spark tasks/stage/job boundaries -- and visualizes that is pasted below:
131 | 
132 | ```python
133 | from plotly.offline import plot
134 | from plotly.graph_objs import Figure, Scatter
135 | from typing import List
136 | from parsers import ProfileParser, SparkLogParser
137 | from helper import get_max_y
138 | 
139 | 
140 | # Create a ProfileParser object to extract metrics graph:
141 | profile_file = './data/ProfileStraggler/CpuAndMemory.json.gz'  # Output from JVM profiler
142 | profile_parser = ProfileParser(profile_file, normalize=True)  # normalize various metrics
143 | data_points: List[Scatter] = profile_parser.make_graph()  # create graph lines of various metrics
144 | 
145 | # Create a SparkLogParser object to extract task/stage/job boundaries:
146 | log_file = './data/ProfileStraggler/JobStraggler.log.gz'  # standard Spark log
147 | log_parser = SparkLogParser(log_file)
148 | 
149 | max: int = get_max_y(data_points)  # maximum y-value used to scale task lines extracted below:s
150 | task_data: List[Scatter] = log_parser.graph_tasks(max)  # create graph lines of all Spark tasks
151 | data_points.extend(task_data)
152 | 
153 | stage_interval_markers: Scatter = log_parser.extract_stage_markers()  # extract stage boundaries and will show on x-axis
154 | data_points.append(stage_interval_markers)
155 | layout = log_parser.extract_job_markers(max)  # extracts job boundaries and will show as vertical dotted lines
156 | 
157 | # Plot the actual gaph and save it in 'everything.html'
158 | fig = Figure(data=data_points, layout=layout)
159 | plot(fig, filename='everything.html')
160 | 
161 | ```
162 | 
163 | ### Analyzing and visualizing a distributed application
164 | When launching a distributed application, Spark executors run on multiple nodes in a cluster and produce several log files, one per executor/container. In a cloud environment like AWS, these log files will be organized in the following structure:
165 | ```terminal
166 | s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_001/
167 |                                                                                          stderr.gz
168 |                                                                                          stdout.gz
169 | s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_002/
170 |                                                                                          stderr.gz
171 |                                                                                          stdout.gz
172 | [...]
173 | s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/container_1_N/
174 |                                                                                          stderr.gz
175 |                                                                                          stdout.gz
176 | 
177 | [...]
178 | 
179 | s3://aws-logs/elasticmapreduce/clusterid-M/containers/application_K_0001/container_K_L/
180 |                                                                                          stderr.gz
181 |                                                                                          stdout.gz
182 | ```
183 | An EMR cluster like `clusterid-1` might run several Spark applications consecutively, each one as its own step. Each application launched a number of containers, `application_1_0001` for example launched executors `container_1_001`, `container_1_002`, ..., `container_1_N`. Each of these container created a standard error and a standard out file on S3. In order to analyze a particular application like `application_1_0001` above, all of its associated log files like *.../application_1_0001/container_1_001/stderr.gz* and *.../application_1_0001/container_1_001/stdout.gz* are needed. The easiest way is to collect all files under the _application_ folder using a command like ...
184 | ```terminal
185 | aws s3 cp --recursive s3://aws-logs/elasticmapreduce/clusterid-1/containers/application_1_0001/ ./application_1_0001/
186 | ```
187 | ... and then to create an [AppParser](https://github.com/g1thubhub/phil_stopwatch/blob/d2e1697c380e7e5a3f16d064131f66da2f0d98ac/parsers.py#L729) object object like 
188 | ```terminal
189 | from parsers import AppParser
190 | app_path = './application_1_0001/'  # path to the application directory downloaded from s3 above
191 | app_parser = AppParser(app_path)
192 | ```
193 | This object creates a number of [SparkLogParser](https://github.com/g1thubhub/phil_stopwatch/blob/d2e1697c380e7e5a3f16d064131f66da2f0d98ac/parsers.py#L239) objects internally (one for each container) and automatically identifies the "master" log file created by the Spark driver (likely located under `application_1_0001/container_1_001/`). Several useful functions can now be called on `app_parser`, example scripts are located in the [analytics](https://github.com/g1thubhub/phil_stopwatch/tree/master/analytics) folder and more detailled explanations can be found in the [readme](https://github.com/g1thubhub/phil_stopwatch) file.
194 | 
195 | ### Finding Error messages and top log chunks
196 | The script (extract_heckler)[https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/extract_heckler.py] shows how to extract top log chunks and the most recent error messages from an individual log file or from a collection of log files that form an application:
197 | 
198 | In the case of "top log chunks", the function `SparkLogParser.get_top_log_chunks` applies a pattern matching and collapsing algorithm across multiple consecutive log lines and creates a ranked list of these top log chunks as output. 
199 | 
200 | The function `AppParser.extract_errors()` tries to deduplicate potential exceptions and error messages and will print them out in reverse chronological order. An exception or error message might occur several times during a run with slight variations (e.g., different timestamps or code line numbers) but the last occurrence is the most important one for debugging purposes since it might be the direct cause for the failure.
201 | 
202 | ### Creating Flame Graphs
203 | The profilers described above might produce stacktraces -- *Stacktrace.json* files in the case of the JVM profiler and *s_N_stack.json* files in the case of the PySpark profilers. These outputs can be folded and transformed into flame graphs with the help of my [fold_stacks.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/fold_stacks.py) script and this external script: [flamegraph.pl](https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl)
204 | 
205 | For JVM stack traces like *./analytics/data/ProfileFatso/StacktraceFatso.json.gz*, use
206 | ```terminal
207 | Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/ProfileFatso/StacktraceFatso.json.gz  > Fatso.folded
208 | Phils-MacBook-Pro:analytics a$ perl flamegraph.pl Fatso.folded > FatsoFlame.svg
209 | ```
210 | The final output file *FatsoFlame.svg* can be opened in a browser. The procedure is identical for PySpark stacktraces like *./analytics/data/profile_fatso/s_8_stack.json*:
211 | ```terminal
212 | Phils-MacBook-Pro:analytics a$ python3 fold_stacks.py ./analytics/data/profile_fatso/s_8_stack.json  > FatsoPyspark.folded
213 | Phils-MacBook-Pro:analytics a$ perl flamegraph.pl  FatsoPyspark.folded  > FatsoPySparkFlame.svg
214 | ```
215 | The script [plot_slacker.py](https://github.com/g1thubhub/phil_stopwatch/blob/master/analytics/plot_slacker.py) mentions the steps needed to create a combined JVM/PySpark flame graph.
216 | 
217 | Made at https://github.com/g1thubhub/phil_stopwatch by writingphil@gmail.com
218 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/__init__.py


--------------------------------------------------------------------------------
/analytics/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/__init__.py


--------------------------------------------------------------------------------
/analytics/concurrency_profile.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from parsers import AppParser, SparkLogParser
 3 | import plotly.graph_objs as go
 4 | from plotly.graph_objs import Figure
 5 | from helper import get_max_y
 6 | 
 7 | #################################################################################################
 8 | 
 9 | # Concurrency profile for Straggler:
10 | logfile = './data/ProfileStraggler/JobStraggler.log.gz'
11 | log_parser = SparkLogParser(logfile)
12 | log_parser.extract_stage_markers()
13 | data = []
14 | interval_markers = log_parser.extract_stage_markers()
15 | data.append(interval_markers)
16 | 
17 | active_tasks = log_parser.get_active_tasks_plot()
18 | data.append(active_tasks)
19 | job_intervals = log_parser.job_intervals
20 | max_y = get_max_y(data)
21 | layout = log_parser.extract_job_markers(max_y)
22 | trace0 = go.Scatter()
23 | fig = Figure(data=data, layout=layout)
24 | plot(fig, filename='conc-straggler.html')
25 | 
26 | 
27 | #################################################################################################
28 | 
29 | # Concurrency profile for whole application:
30 | log_path = '/data/application_1597675138635_0007/'
31 | app_parser = AppParser(log_path)
32 | data = []
33 | active_tasks = app_parser.get_active_tasks_plot()
34 | data.append(active_tasks)
35 | job_intervals = app_parser.get_job_intervals
36 | max_y = get_max_y(data)
37 | layout = app_parser.extract_job_markers(max_y)
38 | trace0 = go.Scatter()
39 | fig = Figure(data=data, layout=layout)
40 | plot(fig, filename='conc-app.html')
41 | 


--------------------------------------------------------------------------------
/analytics/data/ProfileFatso/CpuAndMemoryFatso.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileFatso/CpuAndMemoryFatso.json.gz


--------------------------------------------------------------------------------
/analytics/data/ProfileFatso/JobFatso.log.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileFatso/JobFatso.log.gz


--------------------------------------------------------------------------------
/analytics/data/ProfileFatso/StacktraceFatso.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileFatso/StacktraceFatso.json.gz


--------------------------------------------------------------------------------
/analytics/data/ProfileStraggler/CpuAndMemory.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileStraggler/CpuAndMemory.json.gz


--------------------------------------------------------------------------------
/analytics/data/ProfileStraggler/JobStraggler.log.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileStraggler/JobStraggler.log.gz


--------------------------------------------------------------------------------
/analytics/data/ProfileStraggler/Stacktrace.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/ProfileStraggler/Stacktrace.json.gz


--------------------------------------------------------------------------------
/analytics/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/__init__.py


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/CombinedProfile.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/CombinedProfile.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/CpuAndMemory.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/CpuAndMemory.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/Stacktrace.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/Stacktrace.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/s_8_7510_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/s_8_7510_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/s_8_7511_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/s_8_7511_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/s_8_7512_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/s_8_7512_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_fatso/s_8_stack.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_fatso/s_8_stack.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/CombinedCpuAndMemory.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/CombinedCpuAndMemory.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/CombinedStack.folded.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/CombinedStack.folded.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/CombinedStack.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/CombinedStack.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/CombinedStack.svg.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/CombinedStack.svg.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/CpuAndMemory.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/CpuAndMemory.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/JobSlacker.log.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/JobSlacker.log.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/Stacktrace.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/Stacktrace.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/s_8_931_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/s_8_931_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/s_8_932_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/s_8_932_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/s_8_933_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/s_8_933_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_slacker/s_8_stack.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_slacker/s_8_stack.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/CpuAndMemory.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/CpuAndMemory.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/ProcessInfo.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/ProcessInfo.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/Stacktrace.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/Stacktrace.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/Straggler_PySpark.log.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/Straggler_PySpark.log.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/s_1_1401_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/s_1_1401_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/s_1_1402_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/s_1_1402_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/s_1_1403_cpumem.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/s_1_1403_cpumem.json.gz


--------------------------------------------------------------------------------
/analytics/data/profile_straggler/s_1_stack.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/analytics/data/profile_straggler/s_1_stack.json.gz


--------------------------------------------------------------------------------
/analytics/extract_everything.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from plotly.graph_objs import Figure, Scatter
 3 | from typing import List
 4 | from parsers import ProfileParser, SparkLogParser
 5 | from helper import get_max_y
 6 | 
 7 | 
 8 | # Create a ProfileParser object to extract metrics graph:
 9 | profile_file = './data/ProfileStraggler/CpuAndMemory.json.gz'  # Output from JVM profiler
10 | profile_parser = ProfileParser(profile_file, normalize=True)  # normalize various metrics
11 | data_points: List[Scatter] = profile_parser.make_graph()  # create graph lines of various metrics
12 | 
13 | # Create a SparkLogParser object to extract task/stage/job boundaries:
14 | log_file = './data/ProfileStraggler/JobStraggler.log.gz'  # standard Spark log
15 | log_parser = SparkLogParser(log_file)
16 | 
17 | max: int = get_max_y(data_points)  # maximum y-value used to scale task lines extracted below:s
18 | task_data: List[Scatter] = log_parser.graph_tasks(max)  # create graph lines of all Spark tasks
19 | data_points.extend(task_data)
20 | 
21 | stage_interval_markers: Scatter = log_parser.extract_stage_markers()  # extract stage boundaries and will show on x-axis
22 | data_points.append(stage_interval_markers)
23 | layout = log_parser.extract_job_markers(max)  # extracts job boundaries and will show as vertical dotted lines
24 | 
25 | # Plot the actual gaph and save it in 'everything.html'
26 | fig = Figure(data=data_points, layout=layout)
27 | plot(fig, filename='everything.html')
28 | 


--------------------------------------------------------------------------------
/analytics/extract_heckler.py:
--------------------------------------------------------------------------------
 1 | from parsers import SparkLogParser, AppParser
 2 | from typing import Tuple, List, Deque
 3 | 
 4 | #################################################################################################
 5 | 
 6 | # Collapsed top Logs:
 7 | log_file = './data/ProfileHeckler1/JobHeckler1.log.gz'
 8 | log_parser = SparkLogParser(log_file)
 9 | collapsed_ranked_log: List[Tuple[int, List[str]]] = log_parser.get_top_log_chunks()
10 | for line in collapsed_ranked_log[:5]:  # print 5 most frequently occurring log chunks
11 |     print(line)
12 | 
13 | #################################################################################################
14 | 
15 | # Extracting errors from an application:
16 | app_path = './data/application_15496751386_0005/'
17 | app_parser = AppParser(app_path)
18 | app_errors: Deque[Tuple[str, List[str]]] = app_parser.extract_errors()
19 | 
20 | for error in app_errors:
21 |     print(error)
22 | 


--------------------------------------------------------------------------------
/analytics/plot_application.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from typing import List
 3 | from parsers import AppParser, SparkLogParser
 4 | from helper import get_max_y
 5 | from plotly.graph_objs import Figure, Scatter
 6 | 
 7 | # Path to the log files of an application, structure:
 8 | application_path = './data/application_1551152464841_0001'
 9 | 
10 | #################################################################################################
11 | 
12 | # Active tasks for the whole application, parses master log file internally
13 | app_parser = AppParser(application_path)
14 | data: List[Scatter] = list()
15 | 
16 | active_tasks = app_parser.get_active_tasks_plot()
17 | job_intervals = app_parser.get_job_intervals
18 | max_y = get_max_y(data)
19 | 
20 | data.append(active_tasks)
21 | layout = app_parser.extract_job_markers(max_y)
22 | fig = Figure(data=data, layout=layout)
23 | plot(fig, filename='bigjob-concurrency.html')
24 | 
25 | 
26 | #################################################################################################
27 | 
28 | # memory profile for all executors, parses all log files except for the "master log file" internally
29 | app_parser = AppParser(application_path)
30 | data_points: List[Scatter] = list()
31 | 
32 | executor_logs: List[SparkLogParser] = app_parser.get_executor_logparsers()
33 | for parser in executor_logs:
34 |     # print(parser.get_available_metrics())  # ['epochMillis', 'ScavengeCollTime', 'MarkSweepCollTime', 'MarkSweepCollCount', 'ScavengeCollCount', 'systemCpuLoad', 'processCpuLoad', 'nonHeapMemoryTotalUsed', 'nonHeapMemoryCommitted', 'heapMemoryTotalUsed', 'heapMemoryCommitted']
35 |     relevant_metric: List[Scatter] = parser.get_metrics(['heapMemoryTotalUsed'])
36 |     data_points.extend(relevant_metric)
37 | 
38 | max_y = get_max_y(data_points)  # static method, maximum y value needed for cosmetic reasons, scaling tasks
39 | stage_interval_markers = app_parser.extract_stage_markers()
40 | 
41 | data_points.append(stage_interval_markers)
42 | layout = app_parser.extract_job_markers(max_y)
43 | fig = Figure(data=data_points, layout=layout)
44 | plot(fig, filename='bigjob-memory.html')
45 | 
46 | 
47 | #################################################################################################
48 | 
49 | # tasks, stages, job annotations for the whole application, parses "master log file" internally
50 | app_parser = AppParser(application_path)
51 | data_points: List[Scatter] = list()
52 | 
53 | max_y = 1000  # heuristic since we're not plotting metrics here, for cosmetic purposes
54 | 
55 | task_markers = app_parser.graph_tasks(max_y)
56 | stage_interval_markers = app_parser.extract_stage_markers()
57 | layout = app_parser.extract_job_markers(max_y)
58 | 
59 | data_points.append(stage_interval_markers)
60 | data_points.extend(task_markers)
61 | 
62 | fig = Figure(data=data_points, layout=layout)
63 | plot(fig, filename='bigjob-tasks.html')


--------------------------------------------------------------------------------
/analytics/plot_fatso.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from parsers import ProfileParser, SparkLogParser
 3 | from plotly.graph_objs import Figure, Scatter
 4 | from typing import List
 5 | 
 6 | # The two files used below are created by running
 7 | #  ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --class profile.sparkjobs.JobFatso --conf spark.driver.extraJavaOptions=-javaagent:/Users/phil/jvm-profiler/target/jvm-profiler-1.0.0.jar=sampleInterval=100,metricInterval=100,reporter=com.uber.profiling.reporters.FileOutputReporter,outputDir=./ProfileFatso   target/scala-2.11/philstopwatch-assembly-0.1.jar > JobFatso.log
 8 | 
 9 | profile_file = './data/ProfileFatso/CpuAndMemoryFatso.json.gz'  # Output from JVM profiler
10 | profile_parser = ProfileParser(profile_file, normalize=True)
11 | data_points: List[Scatter] = profile_parser.make_graph()
12 | 
13 | 
14 | logfile = './data/ProfileFatso/JobFatso.log.gz'  # standard Spark logs
15 | log_parser = SparkLogParser(logfile)
16 | stage_interval_markers: Scatter = log_parser.extract_stage_markers()
17 | data_points.append(stage_interval_markers)
18 | 
19 | layout = log_parser.extract_job_markers(700)
20 | fig = Figure(data=data_points, layout=layout)
21 | plot(fig, filename='fatso.html')
22 | 
23 | 
24 | #################################################################################################
25 | 
26 | # Profiling PySpark & JVM:
27 | # Run with
28 | # ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --conf spark.python.profile=true --conf spark.driver.extraJavaOptions=-javaagent:/Users/phil/jvm-profiler/target/jvm-profiler-1.0.0.jar=sampleInterval=100,metricInterval=100,reporter=com.uber.profiling.reporters.FileOutputReporter,outputDir=/Users/phil/IdeaProjects/phil_stopwatch/analytics/data/profile_fatso  ./spark_jobs/job_fatso.py cpumemstack /Users/phil/IdeaProjects/phil_stopwatch/analytics/data/profile_fatso > Fatso_PySpark.log
29 | #
30 | # Easiest way is to concatenate all JVM & PySpark profiles into a single file first via
31 | # cat data/profile_fatso/s_8_7510_cpumem.json <(echo) data/profile_fatso/s_8_7511_cpumem.json <(echo) data/profile_fatso/s_8_7512_cpumem.json <(echo) data/profile_fatso/CpuAndMemory.json > data/profile_fatso/CombinedProfile.json
32 | # ...and then applying some manual settings:
33 | 
34 | data_points = list()
35 | combined_file = './data/profile_fatso/CombinedProfile.json'
36 | 
37 | jvm_parser = ProfileParser(combined_file)
38 | jvm_parser.manually_set_profiler('JVMProfiler')
39 | data_points.extend(jvm_parser.make_graph())
40 | 
41 | pyspark_parser = ProfileParser(combined_file)
42 | pyspark_parser.manually_set_profiler('pyspark')
43 |                                                           # Records from different profilers are in a single file so these IDs
44 | data_points.extend(pyspark_parser.make_graph(id='7510'))  # are used to collect to collect all PySpark records. They were the
45 | data_points.extend(pyspark_parser.make_graph(id='7511'))  # process IDs when the code was profiled, present in every JSON records
46 | data_points.extend(pyspark_parser.make_graph(id='7512'))  # outputted by PySpark profiler as value in 'pid' fields
47 | 
48 | fig = Figure(data=data_points)
49 | plot(fig, filename='fatso-pyspark.html')


--------------------------------------------------------------------------------
/analytics/plot_slacker.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from plotly.graph_objs import Figure
 3 | from typing import Dict
 4 | from parsers import ProfileParser
 5 | 
 6 | # job_slacker.py executed with the following command:
 7 | # ~/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --conf spark.python.profile=true --conf spark.driver.extraJavaOptions=-javaagent:/Users/a/jvm-profiler/target/jvm-profiler-1.0.0.jar=sampleInterval=1000,metricInterval=100,reporter=com.uber.profiling.reporters.FileOutputReporter,outputDir=/Users/a/IdeaProjects/phil_stopwatch/analytics/data/profile_slacker  ./spark_jobs/job_slacker.py  cpumemstack /Users/a/IdeaProjects/phil_stopwatch/analytics/data/profile_slacker > JobSlacker.log
 8 | 
 9 | # cat data/profile_slacker/s_8_931_cpumem.json <(echo) data/profile_slacker/s_8_932_cpumem.json <(echo) data/profile_slacker/s_8_933_cpumem.json <(echo)  data/profile_slacker/CpuAndMemory.json > CombinedCpuAndMemory.json
10 | combined_file = './data/profile_slacker/CombinedCpuAndMemory.json.gz'  # Output from JVM & PySpark profilers create by the command above
11 | 
12 | jvm_parser = ProfileParser(combined_file)
13 | jvm_parser.manually_set_profiler('JVMProfiler')
14 | 
15 | pyspark_parser = ProfileParser(combined_file)
16 | pyspark_parser.manually_set_profiler('pyspark')
17 | 
18 | jvm_maxima: Dict[str, float] = jvm_parser.get_maxima()
19 | pyspark_maxima: Dict[str, float] = pyspark_parser.get_maxima()
20 | 
21 | print('JVM max values:')
22 | print(jvm_maxima)
23 | print('\nPySpark max values:')
24 | print(pyspark_maxima)
25 | 
26 | # Plotting the graphs:
27 | # Seeing all available metrices:
28 | # print(jvm_parser.get_available_metrics())
29 | # ['epochMillis', 'ScavengeCollTime', 'MarkSweepCollTime', 'MarkSweepCollCount', 'ScavengeCollCount', 'systemCpuLoad', 'processCpuLoad', 'nonHeapMemoryTotalUsed', 'nonHeapMemoryCommitted', 'heapMemoryTotalUsed', 'heapMemoryCommitted']
30 | # print(pyspark_parser.get_available_metrics())
31 | # ['epochMillis', 'pmem_rss', 'pmem_vms', 'cpu_percent']
32 | 
33 | data_points = list()
34 | data_points.extend(jvm_parser.get_metrics(['systemCpuLoad', 'processCpuLoad']))
35 | # Records from different profilers are in a single file so these IDs
36 | data_points.extend(pyspark_parser.get_metrics(['cpu_percent_931'], id='931'))  # are used to collect to collect all PySpark records. They were the
37 |                                                                                # process IDs when the code was profiled, present in every JSON records
38 |                                                                                # outputted by PySpark profiler as value in 'pid' fields
39 | fig = Figure(data=data_points)
40 | plot(fig, filename='slacker-cpu.html')
41 | 


--------------------------------------------------------------------------------
/analytics/plot_straggler.py:
--------------------------------------------------------------------------------
 1 | from plotly.offline import plot
 2 | from plotly.graph_objs import Figure, Scatter
 3 | from parsers import ProfileParser, SparkLogParser
 4 | from helper import get_max_y
 5 | from typing import List
 6 | 
 7 | profile_file = './data/ProfileStraggler/CpuAndMemory.json.gz'  # Output from JVM profiler
 8 | profile_parser = ProfileParser(profile_file, normalize=True)
 9 | # data_points = profile_parser.ignore_metrics(['ScavengeCollCount'])
10 | data_points: List[Scatter] = profile_parser.get_metrics(['systemCpuLoad', 'processCpuLoad'])
11 | 
12 | log_file = './data/ProfileStraggler/JobStraggler.log.gz'  # standard Spark log
13 | log_parser = SparkLogParser(log_file)
14 | 
15 | max = get_max_y(data_points)
16 | task_data = log_parser.graph_tasks(max)
17 | 
18 | data_points.extend(task_data)
19 | stage_interval_markers = log_parser.extract_stage_markers()
20 | data_points.append(stage_interval_markers)
21 | 
22 | layout = log_parser.extract_job_markers(max)
23 | fig = Figure(data=data_points, layout=layout)
24 | plot(fig, filename='straggler.html')
25 | 


--------------------------------------------------------------------------------
/fold_stacks.py:
--------------------------------------------------------------------------------
 1 | from parsers import StackParser
 2 | from sys import argv
 3 | 
 4 | if len(argv) < 2:
 5 |     raise RuntimeError('a valid stacktrace has to be provided as script argument')
 6 | elif len(argv) == 1:
 7 |     StackParser.convert_file(argv[1])
 8 | else:
 9 |     StackParser.convert_files(argv[1:])
10 | 
11 | # Made at https://github.com/g1thubhub/phil_stopwatch by writingphil@gmail.com


--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
  1 | import plotly.graph_objs as go
  2 | import time
  3 | from numbers import Number
  4 | 
  5 | metric_definitions = """[
  6 |     ["JVMProfiler", {
  7 |         "epochMillis": ["ms", "", true],
  8 |         "ScavengeCollTime": ["ms", "ms_to_minutes", false],
  9 |         "MarkSweepCollTime": ["ms", "ms_to_minutes", false],
 10 |         "MarkSweepCollCount": ["int", "", false],
 11 |         "ScavengeCollCount": ["int", "", false],
 12 |         "systemCpuLoad": ["float", "", true],
 13 |         "processCpuLoad": ["float", "", true],
 14 |         "nonHeapMemoryTotalUsed": ["byte", "byte_to_mb", true],
 15 |         "nonHeapMemoryCommitted": ["byte", "byte_to_mb", true],
 16 |         "heapMemoryTotalUsed": ["byte", "byte_to_mb", true],
 17 |         "heapMemoryCommitted": ["byte", "byte_to_mb", true]
 18 |     }
 19 |      ],
 20 |     ["PySparkPhilProfiler", {
 21 |         "epochMillis": ["ms", "", true],
 22 |         "pmem_rss": ["byte", "byte_to_mb", true],
 23 |         "pmem_vms": ["byte", "byte_to_mb", true],
 24 |         "cpu_percent": ["float", "", true]
 25 |     }
 26 |      ]
 27 | ]"""
 28 | 
 29 | # For PySpark jobs:
 30 | 
 31 | class Object:
 32 |     def __init__(self, string):
 33 |         self.string = string
 34 | 
 35 | # For log & profile parsing
 36 | 
 37 | time_patterns = {
 38 |     r'^(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)': '%Y-%m-%d %H:%M:%S',  # 2019-01-05 19:38:41
 39 |     r'^(\d{4}/\d\d/\d\d \d\d:\d\d:\d\d)': '%Y/%m/%d %H:%M:%S',  # 2019/01/05 19:38:41
 40 |     r'^(\d\d/\d\d/\d\d \d\d:\d\d:\d\d)': '%y/%m/%d %H:%M:%S',   # 19/01/05 19:38:41
 41 |     r'^(\d\d-\d\d-\d\d \d\d:\d\d:\d\d)': '%y-%m-%d %H:%M:%S',   # 19-01-05 19:38:41
 42 | }
 43 | 
 44 | 
 45 | 
 46 | # 2019-01-05 10:21:57 INFO  SparkContext:54 - Submitted application: Profile (1)
 47 | r_app_start = r'.* submitted application: (.*)'
 48 | 
 49 | # 2019-01-05 10:30:21 INFO  TaskSetManager:54 - Starting task 0.0 in stage 1.0 (TID 3, localhost, executor driver, partition 0, PROCESS_LOCAL, 8249 bytes)
 50 | # Ri
 51 | r_task_start = r'.* (?:starting|running) task (\d+\.\d+) in stage (\d+\.\d+).*'
 52 | 
 53 | # 2019-01-05 10:30:21 INFO  Executor:54 - Finished task 2.0 in stage 1.0 (TID 5). 1547 bytes result sent to driver
 54 | r_task_end = r'.* finished task (\d+\.\d+) in stage (\d+\.\d+).*'
 55 | 
 56 | # 2019-01-05 10:22:02 INFO  DAGScheduler:54 - Got job 0 (foreach at ProfileStragglerSM.scala:37) with 3 output partitions
 57 | r_job_start = r'.* got job (\d+) '
 58 | # 2019-01-05 10:30:19 INFO  DAGScheduler:54 - Job 0 finished: foreach at ProfileStragglerSM.scala:37, took 497.646962 s
 59 | r_job_end = r'.* job (\d+) finished'
 60 | 
 61 | 
 62 | r_spark_log = r'.* (error|info|warn)(.*)'
 63 | 
 64 | r_container_id = r'.*container_\d+_\d+_\d+_0*(\d+).*'
 65 | 
 66 | 
 67 | 
 68 | 
 69 | # Unit conversion methods
 70 | def identity(num):
 71 |     return num
 72 | 
 73 | def ms_to_seconds(ms):
 74 |     return ms / 1000
 75 | 
 76 | def ms_to_minutes(ms):
 77 |     return ms / 60000
 78 | 
 79 | def ns_to_minute(ns):
 80 |     return ns / 60000000000
 81 | 
 82 | def byte_to_mb(bytes):
 83 |     return bytes / (1024*1024)
 84 | 
 85 | def to_epochms(spark_timestamp):
 86 |     return int(time.mktime(spark_timestamp.timetuple()) * 1000 + spark_timestamp.microsecond/1000)
 87 | 
 88 | 
 89 | conversion_map = {
 90 |     "": identity,
 91 |     "ns_to_minute": ns_to_minute,
 92 |     "ms_to_minutes": ms_to_minutes,
 93 |     "byte_to_mb": byte_to_mb
 94 | }
 95 | 
 96 | 
 97 | def extract_nested_keys(structure, key_acc):
 98 |     if isinstance(structure, dict):
 99 |         for k, v in structure.items():
100 |             if isinstance(v, list):
101 |                 extract_nested_keys(v, key_acc)
102 |             elif isinstance(v, dict):
103 |                 extract_nested_keys(v, key_acc)
104 |             key_acc.add(k)
105 |     elif isinstance(structure, list):
106 |         for i in structure:
107 |             if isinstance(i, list):
108 |                 extract_nested_keys(i, key_acc)
109 |             elif isinstance(i, dict):
110 |                 extract_nested_keys(i, key_acc)
111 |     return key_acc
112 | 
113 | 
114 | def graph_tasks(starttimes, endtimes, maximum):
115 |     data = []
116 |     tasks_x = []
117 |     tasks_y = []
118 |     texts = []
119 |     multiplier = 1
120 |     distance = maximum / len(starttimes)   # y-distance between stage lines
121 |     for task_stage_id in starttimes:
122 |         text = 'Task ' + '@'.join((str(task_stage_id[0]), str(task_stage_id[1])))
123 |         starttime = starttimes[task_stage_id]
124 |         tasks_x.append(starttime)
125 |         tasks_y.append(task_stage_id[0] + task_stage_id[1] + distance*multiplier)
126 |         texts.append(text + ' Start')
127 |         scatter = go.Scatter(
128 |             name=text,
129 |             x=[starttime, endtimes[task_stage_id]],
130 |             y=[task_stage_id[0] + task_stage_id[1] + distance*multiplier, task_stage_id[0] + task_stage_id[1] + distance*multiplier],
131 |             mode='lines+markers',
132 |             hoverinfo='none',
133 |             line=dict(color='darkblue', width=5),
134 |             opacity=0.5
135 |         )
136 |         data.append(scatter)
137 | 
138 |         endtime = endtimes[task_stage_id]
139 |         tasks_x.append(endtime)
140 |         tasks_y.append(task_stage_id[0] + task_stage_id[1] + distance*multiplier)
141 |         texts.append(text + ' End')
142 | 
143 |         multiplier += 1
144 | 
145 |     return data, tasks_x, tasks_y, texts
146 | 
147 | def cover(prefix, arr):
148 |     arr1 = arr.copy()
149 |     n = 0
150 |     while arr1[:len(prefix)] == prefix:
151 |         arr1 = arr1[len(prefix):]
152 |         n += 1
153 |     return n, arr1
154 | 
155 | def cost(t):
156 |     s, r, n = t
157 |     c = len(s) + len(r) * n
158 |     return c, -r, -s
159 | 
160 | def findcovers(stack, maxfraglen):
161 |     covers = []
162 |     for fraglen in range(1, maxfraglen + 1):
163 |         fragment = stack[:fraglen]
164 |         n, _ = cover(fragment, stack)
165 |         if n >= 2:
166 |             covers.append((fraglen, n))
167 |     return covers
168 | 
169 | 
170 | def collapse(stack):
171 |     results = []
172 |     lastsingle = False
173 | 
174 |     while len(stack) > 0:
175 |         covers = findcovers(stack, maxfraglen=10)
176 | 
177 |         if len(covers) == 0:
178 |             element = stack[:1]
179 |             if lastsingle:
180 |                 lastfrag, _ = results[-1]
181 |                 results[-1] = (lastfrag + element, 1)
182 |             else:
183 |                 results.append((element, 1))
184 |                 lastsingle = True
185 |             stack = stack[1:]
186 |         else:
187 |             R, N = max(covers, key=lambda t: (t[0] * t[1], -t[0]))
188 |             results.append((stack[:R], N))
189 |             stack = stack[R*N:]
190 |             lastsingle = False
191 | 
192 |     return results
193 | 
194 | 
195 | def get_max_y(data):
196 |     max_y = -1
197 |     for datapoint in data:
198 |         numbers = filter(lambda x: isinstance(x, Number), datapoint.y)
199 |         current_max_y = max(numbers)
200 |         if current_max_y > max_y:
201 |             max_y = current_max_y
202 |     return max_y
203 | 
204 | 
205 | def fat_function_inner(i):
206 |     new_list = list()
207 |     for j in range(0, i):
208 |         new_list.append(j)
209 |     return new_list
210 | 
211 | def secondsSleep(i):
212 |     time.sleep(1)
213 |     return i
214 | 
215 | # Made at https://github.com/g1thubhub/phil_stopwatch by writingphil@gmail.com


--------------------------------------------------------------------------------
/parsers.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import re
  3 | from datetime import datetime
  4 | import os
  5 | from functools import reduce
  6 | import operator
  7 | import gzip
  8 | import hashlib
  9 | import glob
 10 | from typing import Tuple, List, Deque, Dict
 11 | from collections import deque, defaultdict
 12 | import plotly.graph_objs as go
 13 | from plotly.graph_objs import Scatter
 14 | from helper import r_container_id, metric_definitions, time_patterns, r_spark_log, r_app_start, r_task_start, r_task_end, \
 15 |     r_job_start, r_job_end, extract_nested_keys, conversion_map, to_epochms, ms_to_seconds, get_max_y
 16 | 
 17 | 
 18 | class StackParser:
 19 |     @staticmethod
 20 |     def open_stream(file):
 21 |         if file.endswith('gz'):
 22 |             return gzip.open(file, 'rt')
 23 |         else:
 24 |             return open(file, 'r')
 25 | 
 26 |     @staticmethod
 27 |     def convert_file(file, id='', merge=True):
 28 |         StackParser.convert_files([file], id, merge)
 29 | 
 30 |     @staticmethod
 31 |     def convert_files(file_list, id='', merge=True):
 32 |         stacks = defaultdict(int)
 33 | 
 34 |         for file in file_list:
 35 |             stream = StackParser.open_stream(file)
 36 |             for line in stream:
 37 |                 line = line.strip()
 38 |                 if not (line.startswith('{') and line.endswith('}') and '"stacktrace":' in line):
 39 |                     continue
 40 |                 stack_record = json.loads(line)
 41 |                 if 'count' not in stack_record:
 42 |                     continue
 43 |                 count = stack_record['count']
 44 |                 stacktrace = ';'.join(list(reversed(stack_record['stacktrace'])))
 45 | 
 46 |                 if 'i_d' in stack_record and id != '' and stack_record['i_d'] != id:  # skip for PySpark stack traces
 47 |                     continue
 48 | 
 49 |                 if merge:
 50 |                     stacks[stacktrace] += count
 51 |                 else:
 52 |                     print(' '.join((stacktrace, str(count))))
 53 | 
 54 |             stream.close()
 55 |         for (stack, count) in stacks.items():
 56 |             print(' '.join((stack, str(count))))
 57 | 
 58 | 
 59 | class ProfileParser:
 60 |     def __init__(self, filename, normalize=True):
 61 |         self.filename = filename
 62 |         self.normalize = normalize  # normalize units
 63 |         self.resource_format = ''  # currently JVMProfiler or PySparkPhilProfiler
 64 |         self.data_points = list()  # list of dictionaries
 65 |         self.metric_conversions = list()  # json file
 66 |         self.relevant_metrics = dict()  # {'epochMillis': (<function identity at 0x119437730>, True),
 67 |         self.profile_match_keys = set()
 68 | 
 69 |         metric_map = json.loads(metric_definitions)
 70 |         for profiler in metric_map:
 71 |             profiler_name = profiler[0]
 72 |             metric_map = profiler[1]
 73 |             self.metric_conversions.append((profiler_name, dict(map(lambda kv: (kv[0], (conversion_map[kv[1][1]], kv[1][2])), metric_map.items()))))
 74 | 
 75 |     def parse_profiles(self, id=''):
 76 |         if self.resource_format == '':  # First valid JSON profile record in file sets for whole file
 77 |             self.deduce_profiler()
 78 |         self.data_points.clear()
 79 | 
 80 |         stream = self.open_stream()
 81 |         for line in stream:
 82 |             currentline = line.strip().replace('ConsoleOutputReporter - CpuAndMemory: ', '')  # In case JVM Profiler wrote to STDOUT
 83 |             # Profile record can be part of log file so checks here:
 84 |             if not (currentline.startswith('{') and currentline.endswith('}')):
 85 |                 continue
 86 |             resource_usage = json.loads(currentline)
 87 |             record_keys = extract_nested_keys(resource_usage, set())
 88 |             if len(self.profile_match_keys - record_keys) != 0:  # in case a file contains records from two different profilers
 89 |                 continue
 90 | 
 91 |             if self.resource_format == 'PySparkPhilProfiler' and id != '':
 92 |                 if str(resource_usage['pid']) != id:
 93 |                     continue
 94 |             metrics = dict()
 95 | 
 96 |             for relevant_metric in self.relevant_metrics.items():
 97 |                 if relevant_metric[1][1] is True:
 98 |                     metric_name = relevant_metric[0]
 99 |                     value = resource_usage[metric_name]
100 |                     if self.normalize:  # Convert metrics so they can be visualized conveniently together
101 |                         value = self.relevant_metrics[metric_name][0](value)
102 |                     metrics[metric_name] = value
103 | 
104 |             # Custom parsing logic of 4 GC metrics for JVM profiler
105 |             if self.resource_format == 'JVMProfiler':
106 |                 memory_pools = resource_usage['memoryPools']
107 |                 codecache = memory_pools[0]
108 |                 assert codecache['name'] == 'Code Cache'
109 |                 metaspace = memory_pools[1]
110 |                 assert metaspace['name'] == 'Metaspace'
111 |                 metaspace = memory_pools[2]
112 |                 assert metaspace['name'] == 'Compressed Class Space'
113 |                 metaspace = memory_pools[3]
114 |                 assert metaspace['name'] == 'PS Eden Space'
115 |                 # # ,"gc":[{"collectionTime":97,"name":"PS Scavenge","collectionCount":13},{"collectionTime":166,"name":"PS MarkSweep","collectionCount":3}]}
116 |                 gc = resource_usage['gc']
117 |                 scavenge = gc[0]
118 |                 marksweep = gc[1]
119 |                 assert scavenge['name'] == "PS Scavenge"
120 |                 assert marksweep['name'] == "PS MarkSweep"
121 | 
122 |                 scavenge_count = scavenge['collectionCount']
123 |                 scavenge_time = scavenge['collectionTime']
124 |                 marksweep_count = marksweep['collectionCount']
125 |                 marksweep_time = marksweep['collectionTime']
126 | 
127 |                 if self.normalize:
128 |                     scavenge_count = self.relevant_metrics['ScavengeCollCount'][0](scavenge_count)
129 |                     scavenge_time = self.relevant_metrics['ScavengeCollTime'][0](scavenge_time)
130 |                     marksweep_count = self.relevant_metrics['MarkSweepCollCount'][0](marksweep_count)
131 |                     marksweep_time = self.relevant_metrics['MarkSweepCollTime'][0](marksweep_time)
132 | 
133 |                 metrics['ScavengeCollCount'] = scavenge_count
134 |                 metrics['ScavengeCollTime'] = scavenge_time
135 |                 metrics['MarkSweepCollCount'] = marksweep_count
136 |                 metrics['MarkSweepCollTime'] = marksweep_time
137 | 
138 |             self.data_points.append(metrics)
139 |         self.data_points.sort(key=lambda entry: entry['epochMillis'])  # sorting based on timestamp
140 |         stream.close()
141 |         print('## Parsed file, number of data points: ' + str(len(self.data_points)))
142 | 
143 |     def open_stream(self):
144 |         if self.filename.endswith('gz'):
145 |             return gzip.open(self.filename, 'rt')
146 |         else:
147 |             return open(self.filename, 'r')
148 | 
149 |     def get_available_metrics(self, id='') -> List[str]:
150 |         if len(self.data_points) == 0:
151 |             self.parse_profiles(id)
152 |         return list(self.relevant_metrics.keys())
153 | 
154 |     def get_maxima(self) -> Dict[str, float]:
155 |         maxima = {}
156 |         all_metrics: List[Scatter] = self.ignore_metrics(list())
157 |         for metric in all_metrics:
158 |             max_value = get_max_y([metric])
159 |             maxima[metric.name] = float(max_value)
160 |         return maxima
161 | 
162 |     def deduce_profiler(self):
163 |         # Determining format of profiler used
164 |         stream = self.open_stream()
165 |         for line in stream:
166 |             currentline = line.strip()  #
167 |             currentline = currentline.replace('ConsoleOutputReporter - CpuAndMemory: ', '')  # In case JVM Profiler wrote to STDOUT
168 |             if currentline.startswith('{') and currentline.endswith('}'):  # could be part of a log file
169 |                 profile_record = json.loads(currentline)
170 |                 record_keys = extract_nested_keys(profile_record, set())
171 |                 for profile in self.metric_conversions:
172 |                     profile_match_keys = set([item[0] for item in profile[1].items() if item[1][1] is True])
173 |                     delta = profile_match_keys - record_keys
174 |                     if len(delta) == 0:
175 |                         self.resource_format = profile[0]
176 |                         self.relevant_metrics = profile[1]
177 |                         self.profile_match_keys = profile_match_keys
178 |             if self.resource_format != '':
179 |                 break
180 |         if self.resource_format != '':
181 |             print('## Identified Profile for ' + self.filename + ' as ' + self.resource_format)
182 |         else:
183 |             raise ValueError('Unknown profile format for file ' + self.filename)
184 |         stream.close()
185 | 
186 |     def manually_set_profiler(self, profile):
187 |         # Setting format of profiler used
188 |         normalized_profile = profile.lower()
189 |         for profile in self.metric_conversions:
190 |             if normalized_profile in profile[0].lower():
191 |                 self.resource_format = profile[0]
192 |                 self.relevant_metrics = profile[1]
193 |                 profile_match_keys = set([item[0] for item in profile[1].items() if item[1][1] is True])
194 |                 self.profile_match_keys = profile_match_keys
195 |                 print('## Set Profile for ' + self.filename + ' to ' + self.resource_format)
196 | 
197 |     def make_graph(self, id='') -> List[Scatter]:
198 |         if len(self.data_points) == 0:
199 |             self.parse_profiles(id)
200 |         if len(self.data_points) == 0:
201 |             print('## No data points')
202 |             return None
203 | 
204 |         display_keys = list(self.relevant_metrics.keys())
205 |         display_keys.remove('epochMillis')
206 |         data_points = list()
207 |         for display_key in display_keys:
208 |             display_key_name = display_key
209 |             if id != '':
210 |                 display_key_name += '_' + id
211 |             data_points.append(go.Scatter(x=list(map(lambda x: x['epochMillis'], self.data_points)), y=list(map(lambda x: x[display_key], self.data_points)),
212 |                                               mode='lines+markers', name=display_key_name))
213 |         return data_points
214 | 
215 |     def get_metrics(self, names=list(), id='') -> List[Scatter]:
216 |         if len(names) == 0:
217 |             return self.make_graph()
218 |         else:
219 |             all_metrics = self.make_graph(id)
220 |             relevant_metrics = list(filter(lambda x: any([ele in x['name'] for ele in names]), all_metrics))
221 |             return relevant_metrics
222 | 
223 |     def ignore_metrics(self, names=list()) -> List[Scatter]:
224 |         if len(names) == 0:
225 |             return self.make_graph()
226 |         else:
227 |             all_metrics = self.make_graph()
228 |             relevant_metrics = list(filter(lambda x: any([ele not in x['name'] for ele in names]), all_metrics))
229 |             return relevant_metrics
230 | 
231 | 
232 | 
233 | 
234 |     @staticmethod
235 |     def get_max_y(data_points):
236 |         return get_max_y(data_points)
237 | 
238 | 
239 | class SparkLogParser:
240 |     prefix_max_len = 21
241 |     r_spark_log = r'.* (?:error|info|warning)(.*)'
242 |     log_types = ['error', 'info', 'warn']
243 | 
244 |     def __init__(self, log_file, profile_file='', id='', normalize=True):
245 |         if log_file != '':
246 |             self.logfile = log_file
247 |             self.time_pattern, self.re_time_pattern, self.re_app_start, self.re_job_start, self.re_job_end = None, None, None, None, None
248 |             self.re_task_start, self.re_task_end, self.re_spark_log, self.re_problempattern = None, None, None, None
249 |             self.application_name = ''  # application name is always set even if not provided by user
250 |             self.jobs = list()  # [(job_id, job_start, job_end), ...   [(0, 1546683722000, 1546684219000),...
251 |             self.task_intervals = dict()  # dict_items([((0, 0, 0), (1546683722000, 1546684219000)),
252 |             self.stage_intervals = dict()
253 |             self.job_intervals = dict()
254 |             self.identify_timeformat()
255 |         if profile_file is '':
256 |             self.profile_parser = ProfileParser(self.logfile, normalize)
257 |         else:
258 |             self.profile_parser = ProfileParser(profile_file, normalize)
259 |         self.id = id
260 | 
261 |     def open_stream(self):
262 |         if self.logfile.endswith('gz'):
263 |             return gzip.open(self.logfile, 'rt')
264 |         else:
265 |             return open(self.logfile, 'r')
266 | 
267 |     def get_available_metrics(self) -> List[str]:
268 |         return self.profile_parser.get_available_metrics()
269 | 
270 |     def identify_timeformat(self):
271 |         stream = self.open_stream()
272 |         for logline in stream:
273 |             for time_pattern in time_patterns:
274 |                 match_attempt = re.match(time_pattern, logline)
275 |                 if match_attempt is not None:
276 |                     print('^^ Identified time format for log file: ' + time_patterns[time_pattern])
277 |                     self.time_pattern = time_pattern
278 |                     break
279 |             if self.time_pattern is not None:
280 |                 break
281 |         stream.close()
282 | 
283 |         if self.time_pattern is None:
284 |             print('^^ Warning: log file is empty or has an unknown time format, edit in XXX')
285 |         else:
286 |             self.re_time_pattern = re.compile(self.time_pattern, re.IGNORECASE)
287 |             self.re_app_start = re.compile(self.time_pattern + r_app_start, re.IGNORECASE)
288 |             self.re_job_start = re.compile(self.time_pattern + r_job_start, re.IGNORECASE)
289 |             self.re_job_end = re.compile(self.time_pattern + r_job_end, re.IGNORECASE)
290 |             self.re_task_start = re.compile(self.time_pattern + r_task_start, re.IGNORECASE)
291 |             self.re_task_end = re.compile(self.time_pattern + r_task_end, re.IGNORECASE)
292 |             self.re_spark_log = re.compile(self.time_pattern + r_spark_log, re.IGNORECASE)
293 |             self.re_problempattern = re.compile(self.time_pattern + ' (error|warn)', re.IGNORECASE)
294 | 
295 |     def extract_time(self, line):
296 |         match_obj = self.re_time_pattern.match(line)
297 |         if match_obj:
298 |             datetime_obj = datetime.strptime(match_obj.group(1), time_patterns[self.time_pattern])
299 |             ms = to_epochms(datetime_obj)
300 |             return ms
301 |         else:
302 |             return None
303 | 
304 |     def parse_profile(self):  # delegates to embedded ProfileParser
305 |         self.profile_parser.deduce_profiler()
306 | 
307 |     def get_available_metrics(self):
308 |         return self.profile_parser.get_available_metrics(self.id)
309 | 
310 |     def __make_graph(self) -> List[Scatter]: # delegates to embedded ProfileParser
311 |         return self.profile_parser.make_graph(self.id)
312 | 
313 |     def get_metrics(self, names=list()) -> List[Scatter]:
314 |         return self.profile_parser.get_metrics(names)
315 | 
316 |     def ignore_metrics(self, names=list()) -> List[Scatter]:
317 |         return self.profile_parser.ignore_metrics(names)
318 | 
319 |     def get_max_y(self, data_points):
320 |         return self.profile_parser.get_max_y(data_points)
321 | 
322 |     @staticmethod
323 |     def pick_longest_frequent(length_frequ):
324 |         return length_frequ[0] * length_frequ[1], ((length_frequ[0] - length_frequ[1]) * (length_frequ[1] - length_frequ[0]))
325 | 
326 |     @staticmethod
327 |     def find_reps(elements):
328 |         reps = []
329 |         for prefix_len in range(1, SparkLogParser.prefix_max_len):
330 |             suffix = elements.copy()
331 |             prefix = suffix[:prefix_len]
332 |             repeat = 0  # prefix repeat
333 |             while prefix == suffix[:prefix_len]:
334 |                 suffix = suffix[prefix_len:]
335 |                 repeat += 1
336 | 
337 |             if repeat >= 2:
338 |                 reps.append((prefix_len, repeat))
339 |         return reps
340 | 
341 |     @staticmethod
342 |     def collapse(log, rank=False) -> List[Tuple[int, List[str]]]:
343 |         collapsed_log = []
344 |         is_last_pref = False
345 | 
346 |         while len(log) > 0:
347 |             candidates = SparkLogParser.find_reps(log)
348 |             if len(candidates) >= 1:
349 |                 prefix_len, repeats = max(candidates, key=lambda ele: SparkLogParser.pick_longest_frequent(ele))
350 |                 collapsed_log.append((repeats, log[:prefix_len]))
351 |                 log = log[prefix_len * repeats:]
352 |                 is_last_pref = False
353 |             else:
354 |                 curr_prefix = log[:1]
355 |                 log = log[1:]
356 |                 if is_last_pref:
357 |                     collapsed_log[-1] = (1, collapsed_log[-1][1] + curr_prefix)  # end of list => append to previous prefix
358 |                 else:
359 |                     collapsed_log.append((1, curr_prefix))  # penultimate
360 |                     is_last_pref = True
361 |         if rank:
362 |             collapsed_log.sort(key=lambda segment: -segment[0])
363 |             return collapsed_log
364 |         else:   # only take strings and flatten list of lists
365 |             return reduce(operator.concat, (map(lambda entry: entry[1], collapsed_log)))
366 | 
367 |     @staticmethod
368 |     def digest_string(string) -> str:
369 |         shrinked = re.sub(r'[^a-z]', '', string.lower())
370 |         # hashed = hashlib.md5()
371 |         hashed = hashlib.sha1()
372 |         hashed.update(shrinked.encode())
373 |         return hashed.hexdigest()
374 | 
375 |     @staticmethod
376 |     def digest_strings(string_list) -> str:
377 |         return SparkLogParser.digest_string(''.join(string_list))
378 | 
379 |     @staticmethod
380 |     def dedupe_errors(stack) -> Deque[Tuple[int, List[str]]]:
381 |         digests = set()
382 |         collapsed_errors = deque()
383 |         while len(stack) > 0:
384 |             last_lines = stack.pop()
385 |             digested_line = SparkLogParser.digest_strings(last_lines)
386 |             if digested_line not in digests:
387 |                 collapsed_errors.append(last_lines)
388 |                 digests.add(digested_line)
389 |         return collapsed_errors
390 | 
391 |     @staticmethod
392 |     def dedupe_source_errors(stack) -> Deque[Tuple[str, List[str]]]:
393 |         digests = set()
394 |         collapsed_errors = deque()
395 |         while len(stack) > 0:
396 |             lastele = stack.pop()
397 |             last_file = lastele[0]
398 |             last_lines = lastele[1]
399 |             digested_line = SparkLogParser.digest_strings(last_lines)
400 |             if digested_line not in digests:
401 |                 collapsed_errors.append((last_file, last_lines))
402 |                 digests.add(digested_line)
403 |         return collapsed_errors
404 | 
405 |     def get_top_log_chunks(self, log_level='') -> List[Tuple[int, List[str]]]:
406 |         log_types_to_process = self.log_types
407 |         if log_level != '':
408 |             log_types_to_process = [log_level.lower()]
409 | 
410 |         log_contents = list()
411 |         stream = self.open_stream()
412 |         for line in stream:
413 |             match_obj = self.re_spark_log.match(line.strip())
414 |             if match_obj:
415 |                 line_type = match_obj.group(2)
416 |                 if line_type.lower() in log_types_to_process:
417 |                     log_contents.append(match_obj.group(3).strip())
418 |         stream.close()
419 |         collapsed_ranked_log = SparkLogParser.collapse(log_contents, rank=True)
420 |         return collapsed_ranked_log
421 | 
422 |     def extract_errors(self, deduplicate=True) -> Deque[Tuple[int, List[str]]]:
423 |         stream = self.open_stream()
424 |         in_multiline = False
425 |         errors = list()
426 |         multiline_message = list()
427 | 
428 |         for logline in stream:
429 |             logline = logline.strip()
430 |             match_obj = self.re_problempattern.match(logline)
431 |             if in_multiline:
432 |                 normal_match_obj = self.re_spark_log.match(logline)
433 |                 if normal_match_obj:  # new logline so close the previous multiline error one
434 |                     errors.append(multiline_message.copy())
435 |                     multiline_message.clear()
436 |                     in_multiline = False
437 |                 else:  # continued multiline error
438 |                     multiline_message.append(logline)
439 |             if match_obj:
440 |                 in_multiline = True
441 |                 multiline_message.append(logline)
442 |         if in_multiline:  # Error at end of file
443 |             errors.append(multiline_message.copy())
444 |             multiline_message.clear()
445 |         stream.close()
446 |         #  Collapse log messages internally for repeated segments
447 |         collapsed_errors = deque(map(lambda entry: SparkLogParser.collapse(entry), errors))
448 |         if deduplicate:
449 |             collapsed_errors = SparkLogParser.dedupe_errors(collapsed_errors)
450 |         return collapsed_errors
451 | 
452 |     def extract_entity_id(self, match_obj, entity):
453 |         timestamp = match_obj.group(1)  # 2018-12-26 12:05
454 |         datetime_obj = datetime.strptime(timestamp, time_patterns[self.time_pattern])
455 |         ms = to_epochms(datetime_obj)
456 | 
457 |         if entity == 'task':
458 |             task_id = match_obj.group(2)
459 |             stage_id = match_obj.group(3)
460 |             task_stage_id = (float(task_id), float(stage_id))
461 |             return task_stage_id, ms
462 |         elif entity == 'job':
463 |             job_id = match_obj.group(2)
464 |             return int(job_id), ms
465 |         else:
466 |             raise Exception('Unknown entity type: ' + entity)
467 | 
468 |     # extracts task start & endpoints
469 |     def extract_task_intervals(self):
470 |         stream = self.open_stream()
471 |         start_times = dict()
472 |         end_times = dict()
473 | 
474 |         for line in stream:
475 |             line = line.strip()
476 | 
477 |             match_obj = self.re_app_start.match(line)
478 |             if match_obj:
479 |                 name = match_obj.group(2)
480 |                 if self.application_name != '':
481 |                     raise Exception('Several Spark applications wrote to the same file: ' + self.logfile)
482 |                 else:
483 |                     self.application_name = name
484 |                     continue
485 | 
486 |             # Extracting jobs
487 |             match_obj = self.re_job_start.match(line)
488 |             if match_obj:
489 |                 job_id, ms = self.extract_entity_id(match_obj, 'job')
490 |                 if len(self.jobs) > 0:
491 |                     (previous_job_id, _, _) = self.jobs[-1]
492 |                     if previous_job_id == job_id:
493 |                         raise Exception('Conflicting info for start/end of job ' + job_id)
494 |                 self.jobs.append((job_id, ms, -1))  # set job end time below
495 |                 continue
496 |             match_obj = self.re_job_end.match(line)
497 |             if match_obj:
498 |                 job_id, ms = self.extract_entity_id(match_obj, 'job')
499 |                 (previous_job_id, job_start, dummy_end) = self.jobs.pop()
500 |                 if job_id != previous_job_id or dummy_end != -1:
501 |                     raise Exception('Conflicting info for start/end of job ' + job_id + 'and ' + previous_job_id)
502 |                 self.jobs.append((job_id, job_start, ms))
503 |                 continue
504 | 
505 |             # Extracting task/stage/job ids with start/endtimes
506 |             match_obj = self.re_task_start.match(line)
507 |             if match_obj:
508 |                 task_stage_id, ms = self.extract_entity_id(match_obj, 'task')
509 |                 active_job_id = '' #  Executor logs don't have a Job ID Spark 2.4
510 |                 if len(self.jobs) > 0:
511 |                     (active_job_id, _, dummy) = self.jobs[-1]
512 |                     if dummy != -1:
513 |                         raise Exception('Conflicting info for start/end of job ' + active_job_id + 'and task' + task_stage_id)
514 |                 task_stage_job_id = (task_stage_id[0], task_stage_id[1], active_job_id)
515 |                 start_times[task_stage_job_id] = ms
516 |                 continue
517 | 
518 |             match_obj = self.re_task_end.match(line)
519 |             if match_obj:
520 |                 task_stage_id, ms = self.extract_entity_id(match_obj, 'task')
521 |                 active_job_id = ''  # Executor logs don't have a Job ID Spark 2.4
522 |                 if len(self.jobs) > 0:
523 |                     (active_job_id, _, dummy) = self.jobs[-1]
524 |                     if dummy != -1:
525 |                         raise Exception('Conflicting info for start/end of job ' + active_job_id + 'and task' + task_stage_id)
526 |                 task_stage_job_id = (task_stage_id[0], task_stage_id[1], active_job_id)
527 |                 end_times[task_stage_job_id] = ms
528 |                 continue
529 | 
530 |         if start_times.keys() != end_times.keys():
531 |             print("^^ Warning: Not all tasks completed successfully: " + str(start_times.keys() - end_times.keys()))
532 |         print('^^ Extracting task intervals')
533 |         for task in start_times.keys():
534 |             if task in end_times:
535 |                 self.task_intervals[task] = (start_times[task], end_times[task])
536 | 
537 |         self.extract_stage_intervals()
538 |         self.extract_job_intervals()
539 |         stream.close()
540 |         return self.task_intervals
541 | 
542 |     def extract_stage_intervals(self):
543 |         print('^^ Extracting stage intervals')
544 |         stage_intervals = dict()
545 |         if len(self.task_intervals) == 0:
546 |             self.extract_task_intervals()
547 |         for ((_, s_id, job_id), (start, end)) in self.task_intervals.items():
548 |             stage_id = (s_id, job_id)
549 |             if stage_id in stage_intervals:
550 |                 (previous_start, previous_end) = stage_intervals[stage_id]
551 |                 if start < previous_start:
552 |                     previous_start = start
553 |                 if end > previous_end:
554 |                     previous_end = end
555 |                 stage_intervals[stage_id] = (previous_start, previous_end)
556 |             else:
557 |                 stage_intervals[stage_id] = (start, end)
558 |         self.stage_intervals = stage_intervals
559 |         return self.stage_intervals
560 | 
561 |     def extract_job_intervals(self):
562 |         print('^^ Extracting job intervals')
563 |         job_intervals = dict()
564 |         if len(self.stage_intervals) == 0:
565 |             self.extract_stage_intervals()
566 |         for ((_, job_id), (start, end)) in self.stage_intervals.items():
567 |             if job_id in job_intervals:
568 |                 (previous_start, previous_end) = job_intervals[job_id]
569 |                 if start < previous_start:
570 |                     previous_start = start
571 |                 if end > previous_end:
572 |                     previous_end = end
573 |                 job_intervals[job_id] = (previous_start, previous_end)
574 |             else:
575 |                 job_intervals[job_id] = (start, end)
576 |         self.job_intervals = job_intervals
577 |         return self.job_intervals
578 | 
579 |     def get_job_intervals(self):
580 |         if len(self.job_intervals) == 0:
581 |             self.extract_job_intervals()
582 |         return self.job_intervals
583 | 
584 |     def extract_active_tasks(self) -> Tuple[List[int], List[int]]:
585 |         if len(self.task_intervals) == 0:
586 |             self.extract_task_intervals()
587 |         # dict_items([((0, 0, 0), (1546683722000, 1546684219000)),
588 |         application_start = min(list(map(lambda x: x[1][0], self.job_intervals.items())))
589 |         application_end = max(list(map(lambda x: x[1][1], self.job_intervals.items())))
590 |         application_duration = int(ms_to_seconds(application_end - application_start))
591 |         print('## Application started at {}, ended at {} and took {}'.format(application_start, application_end, application_duration))
592 | 
593 |         job_time = []
594 |         active_tasks = []
595 | 
596 |         for step in range(0, application_duration+1):
597 |             step_time = 1000*step + application_start  # to ms
598 |             active_task = 0
599 |             for ((_, _, _), (task_start, task_end)) in self.task_intervals.items():
600 |                 if task_start <= step_time <= task_end:
601 |                     active_task += 1
602 |             job_time.append(step_time)
603 |             active_tasks.append(active_task)
604 |         return job_time, active_tasks
605 | 
606 |     def get_active_tasks_plot(self):
607 |         job_time, active_tasks = self.extract_active_tasks()
608 |         scatter = go.Scatter(
609 |             name='Active Tasks',
610 |             x=job_time,
611 |             y=active_tasks,
612 |             mode='lines+markers',
613 |             hoverinfo='none',
614 |             line=dict(color='darkblue', width=5)
615 |         )
616 |         return scatter
617 | 
618 |     def extract_stage_markers(self):
619 |         stage_x = list()
620 |         stage_y = list()
621 |         texts = list()
622 |         if len(self.stage_intervals) == 0:
623 |             self.extract_stage_intervals()
624 | 
625 |         for ((stage_id, job_id), (start, end)) in self.stage_intervals.items():
626 |             stage_name = '@'.join((str(stage_id), str(job_id)))
627 |             stage_x.append(start)
628 |             texts.append('Stage ' + stage_name + ' start')
629 |             stage_x.append(end)
630 |             texts.append('Stage ' + stage_name + ' end')
631 |             stage_y.append(0)
632 |             stage_y.append(0)
633 | 
634 |         markers = go.Scatter(
635 |             name="Stage Labels",
636 |             x=stage_x,
637 |             y=stage_y,
638 |             mode='markers+text',
639 |             text=texts,
640 |             textposition='bottom center',
641 |             marker=dict(color='darkblue', size=18),
642 |             opacity=.5
643 |         )
644 | 
645 |         return markers
646 | 
647 |     def graph_tasks(self, maximum) -> List[Scatter]:
648 |         data = []
649 |         tasks_x = []
650 |         tasks_y = []
651 |         texts = []
652 |         multiplier = 1
653 |         distance = 0.0
654 |         if len(self.task_intervals) == 0:
655 |             self.extract_task_intervals()
656 |         if maximum < 1.0:
657 |             distance = 1.0 / len(self.task_intervals)
658 | 
659 |         else:
660 |             distance = maximum / len(self.task_intervals)   # y-distance between stage lines
661 | 
662 |         for ((task_id, stage_id, job_id), (task_start, task_end)) in self.task_intervals.items():
663 |             task_name = '@'.join((str(task_id), str(stage_id), str(job_id)))
664 |             text = 'Task ' + task_name
665 |             tasks_x.append(task_start)
666 |             tasks_y.append(task_id + stage_id + distance*multiplier)
667 |             # tasks_y.append(distance*multiplier)
668 |             texts.append(text + ' Start')
669 |             #  Create horizontal task lines
670 |             scatter = go.Scatter(
671 |                 name=text,
672 |                 x=[task_start, task_end],
673 |                 y=[task_id + stage_id + distance*multiplier, task_id + stage_id + distance*multiplier],
674 |                 # y=[distance*multiplier, distance*multiplier],
675 |                 mode='lines+markers',
676 |                 hoverinfo='none',
677 |                 line=dict(color='darkblue', width=5),
678 |                 opacity=0.5
679 |             )
680 |             data.append(scatter)
681 | 
682 |             tasks_x.append(task_end)
683 |             tasks_y.append(task_id + stage_id + distance*multiplier)
684 |             # tasks_y.append(distance*multiplier)
685 |             texts.append(text + ' End')
686 | 
687 |             multiplier += 1
688 | 
689 |         #  Create markers for task start/end points
690 |         trace_tasks = go.Scatter(
691 |             name="Task labels",
692 |             x=tasks_x,
693 |             y=tasks_y,
694 |             mode='markers+text',
695 |             text=texts,
696 |             hoverinfo='none',
697 |             textposition='bottom center',
698 |             marker=dict(color='darkblue', size=14),
699 |             opacity=.5
700 |         )
701 |         data.append(trace_tasks)
702 |         return data
703 | 
704 |     def extract_job_markers(self, max=10):
705 |         vertical_lines = list()
706 |         if len(self.job_intervals) == 0:
707 |             self.extract_job_intervals()
708 |         for (_, (start, end)) in self.job_intervals.items():
709 |             vertical_lines.append({         # Line Vertical
710 |                 'type': 'line',
711 |                 'x0': start,
712 |                 'y0': 0,
713 |                 'x1': start,
714 |                 'y1': max,
715 |                 'line': { 'color': 'rgb(128, 0, 128)', 'width': 4, 'dash': 'dot', },
716 |             })
717 |             vertical_lines.append({         # Line Vertical
718 |                 'type': 'line',
719 |                 'x0': end,
720 |                 'y0': 0,
721 |                 'x1': end,
722 |                 'y1': max,
723 |                 'line': {'color': 'rgb(128, 0, 128)', 'width': 4, 'dash': 'dot', },
724 |             })
725 | 
726 |         return {'shapes': vertical_lines}
727 | 
728 | 
729 | class AppParser:
730 |     def __init__(self, logs_path, suffix='stderr'):
731 |         self.master_logparser = None
732 |         if logs_path.endswith(os.sep):
733 |             logs_path = logs_path[:-1]
734 |         self.logs_path = logs_path
735 |         logfiles = glob.glob(logs_path + '*/**/' + suffix + '*', recursive=False)
736 |         logfiles.sort(key=lambda path: path)
737 | 
738 |         # [['container_1547584802630_0001_01_000001', 'stderr.gz']
739 |         suffixes = map(lambda logfile: logfile[len(logs_path) + 1:].split(os.sep), logfiles)
740 |         app_dirs = list(filter(lambda suffix: len(suffix) == 2, suffixes))
741 |         cluster_dirs = list(filter(lambda suffix: len(suffix) == 3, suffixes))
742 | 
743 |         dummy_id = 1  # artificial ID if path is weird
744 | 
745 |         if len(app_dirs) > 0 and len(app_dirs) > len(cluster_dirs):
746 |             print('^^ Identified app path with log files')
747 |             self.parsers = list()
748 |             for app_dir in app_dirs:
749 |                 loc = os.sep.join(app_dir)
750 |                 loc = os.sep.join((logs_path, loc))
751 | 
752 |                 parent = loc[:loc.rindex(os.sep)]
753 |                 stdout_glob = glob.glob(parent + '*/' + 'stdout' + '*', recursive=False)  # ToDo: Better logic
754 |                 stdout_path = ''
755 |                 if len(stdout_glob) > 0:
756 |                     stdout_path = stdout_glob[0]
757 | 
758 |                 re_container_id = re.compile(r_container_id, re.IGNORECASE)
759 |                 match_obj = re_container_id.match(loc)
760 |                 if match_obj:
761 |                     container_id = match_obj.group(1)
762 |                     self.parsers.append(SparkLogParser(loc, profile_file=stdout_path, id=container_id))
763 |                 else:
764 |                     self.parsers.append(SparkLogParser(loc, profile_file=stdout_path, id=dummy_id))
765 |                     dummy_id += 1
766 |         elif len(cluster_dirs) > 0:
767 |             print('^^ Identified cluster job path several apps')
768 |         else:
769 |             raise Exception('Path does not contain log files in known format')
770 |         self.identify_master_log()
771 | 
772 |     def get_maxima(self) -> Dict[str, float]:
773 |         maxima = {}
774 |         for parser in self.parsers:
775 |             all_metrics: List[Scatter] = parser.profile_parser.get_maxima()
776 |             for metric in all_metrics:
777 |                 metric_name = metric.name
778 |                 max_value = float(get_max_y([metric]))
779 |                 if metric_name in maxima and max_value > maxima[metric_name]:
780 |                     maxima[metric_name] = max_value
781 |         return maxima
782 | 
783 |     def extract_errors(self) -> Deque[Tuple[str, List[str]]]:
784 |         app_errors = deque()
785 |         log_sources = deque()
786 |         for parser in self.parsers:
787 |             container_error = parser.extract_errors(True)
788 |             app_errors.extend(container_error)
789 |             for _ in range(0, len(container_error)):
790 |                 log_sources.append(parser.logfile)
791 | 
792 |         # sort based on timestamp
793 |         timed_errors = list()
794 |         for app_error in app_errors:
795 |             head = app_error[0]
796 |             ms = self.parsers[0].extract_time(head)
797 |             timed_errors.append((ms, log_sources.popleft(), app_error))
798 | 
799 |         timed_errors.sort(key=lambda pair: pair[0])
800 |         app_errors = SparkLogParser.dedupe_source_errors(deque(map(lambda triple: (triple[1], triple[2]), timed_errors)))
801 |         return app_errors
802 | 
803 |     def identify_master_log(self):
804 |         for parser in self.parsers:
805 |             parser.extract_task_intervals()
806 | 
807 |             if parser.application_name != '':
808 |                 self.master_logparser = parser
809 |                 break
810 | 
811 |     def get_master_logfile(self):
812 |         return self.master_logparser.logfile
813 | 
814 |     def get_master_logparser(self):
815 |         return self.master_logparser
816 | 
817 |     def get_executor_logparsers(self):
818 |         executors = list()
819 |         for parser in self.parsers:
820 |             if parser != self.master_logparser:
821 |                 executors.append(parser)
822 |         return executors
823 | 
824 |     def graph_tasks(self, maximum):
825 |         if self.master_logparser is None:
826 |             raise Exception('No master log file found for ' + self.logs_path)
827 |         return self.master_logparser.graph_tasks(maximum)
828 | 
829 |     def extract_stage_markers(self):
830 |         if self.master_logparser is None:
831 |             raise Exception('No master log file found for ' + self.logs_path)
832 |         return self.master_logparser.extract_stage_markers()
833 | 
834 |     def extract_job_markers(self, max_y=100):
835 |         if self.master_logparser is None:
836 |             raise Exception('No master log file found for ' + self.logs_path)
837 |         return self.master_logparser.extract_job_markers(max_y)
838 | 
839 | 
840 |     def get_job_intervals(self):
841 |         if self.master_logparser is None:
842 |             raise Exception('No master log file found for ' + self.logs_path)
843 |         if len(self.job_intervals) == 0:
844 |                 self.master_logparser.extract_job_intervals()
845 |         return self.master_logparser.get_job_intervals()
846 | 
847 |     def get_active_tasks_plot(self):  # Find the master log file and call its function
848 |         if self.master_logparser is None:
849 |             raise Exception('No master log file found for ' + self.logs_path)
850 |         return self.master_logparser.get_active_tasks_plot()
851 | 
852 |     def extract_job_markers(self, max=10):
853 |         if self.master_logparser is None:
854 |             raise Exception('No master log file found for ' + self.logs_path)
855 |         return self.master_logparser.extract_job_markers(max)
856 | 
857 | 
858 | if __name__ == '__main__':
859 |     log_path = '/Users/a/logs/application_1550152404841_0001'
860 |     app_parser = AppParser(log_path)
861 | 
862 | # Made at https://github.com/g1thubhub/phil_stopwatch by writingphil@gmail.com


--------------------------------------------------------------------------------
/pyspark_profilers.py:
--------------------------------------------------------------------------------
  1 | import os.path
  2 | from os.path import join
  3 | from sys import _current_frames
  4 | from threading import Thread, Event
  5 | from pyspark.profiler import BasicProfiler
  6 | from pyspark import AccumulatorParam
  7 | from collections import deque, defaultdict
  8 | import psutil
  9 | import time
 10 | import json
 11 | from typing import DefaultDict, Tuple, List, Set
 12 | 
 13 | class CustomProfiler(BasicProfiler):
 14 |     def show(self, id):
 15 |         print("My custom profiles for RDD:%s" % id)
 16 | 
 17 | # A custom profiler has to define or inherit the following methods:
 18 | # profile - will produce a system profile of some sort.
 19 | # stats - return the collected stats.
 20 | # dump - dumps the profiles to a path
 21 | # add - adds a profile to the existing accumulated profile
 22 | 
 23 | #######################################################################################################################
 24 | 
 25 | # Phil PySpark Profiler for CPU & Memory
 26 | 
 27 | class CpuMemProfiler(BasicProfiler):
 28 |     profile_interval = 0.1  # same default value as in Uber's profiler
 29 | 
 30 |     def __init__(self, ctx):
 31 |         """ Creates a new accumulator for combining the profiles of different partitions of a stage """
 32 |         self._accumulator = ctx.accumulator(list(), CpuMemParam())
 33 |         self.profile_interval = float(ctx.environment.get('profile_interval', self.profile_interval))
 34 |         self.pids = set()
 35 | 
 36 |     def profile(self, spark_action):
 37 |         """ Runs and profiles the method to_profile passed in. A profile object is returned. """
 38 |         parser = CpuMemParser(self.profile_interval)
 39 |         parser.start()
 40 |         spark_action()  # trigger the Spark job
 41 |         parser.stop()
 42 |         self._accumulator.add(parser.profiles)
 43 | 
 44 |     def show(self, id):
 45 |         """ Print the profile stats to stdout, id is the RDD id """
 46 |         print(self.collapse())
 47 | 
 48 |     def dump(self, id, path):
 49 |         """ Dump the profile into path, id is the RDD id; See Profiler.dump() """
 50 |         if not os.path.exists(path):
 51 |             print('^^ Path ' + path + ' does not exist, trying to create it' )
 52 |             os.makedirs(path)
 53 |         self.get_pids()
 54 |         for pid in self.pids:
 55 |             with open(join(path, 's_{}_{}_cpumem.json'.format(id, pid)), 'w') as file:
 56 |                 file.write(self.collapse_pid(pid))
 57 | 
 58 |     def get_pids(self) -> Set[int]:
 59 |         for profile_dict in self._accumulator.value:
 60 |             self.pids.add(profile_dict['pid'])
 61 |         return self.pids
 62 | 
 63 |     def collapse(self) -> str:
 64 |         """ Rearrange the result for further processing """
 65 |         return '\n'.join([json.dumps(profile_dict) for profile_dict in self._accumulator.value])
 66 | 
 67 |     def collapse_pid(self, pid) -> str:
 68 |         """ Rearrange the result for further processing split up according to pid"""
 69 |         pid_result = []
 70 |         for profile_dict in self._accumulator.value:
 71 |             if profile_dict['pid'] == pid:
 72 |                 pid_result.append(json.dumps(profile_dict))
 73 |         return '\n'.join(pid_result)
 74 | 
 75 | 
 76 | class CpuMemParser(object):
 77 |     def __init__(self, profile_interval):
 78 |         self.profile_interval = profile_interval
 79 |         self.thread = Thread(target=self.catch_mem_cpu)
 80 |         self.event = Event()
 81 |         self.profiles = []
 82 | 
 83 |     def catch_mem_cpu(self):
 84 |         while not self.event.is_set():
 85 |             self.event.wait(self.profile_interval)
 86 |             pid = os.getpid()
 87 |             current_process = psutil.Process(pid)
 88 |             current_time = int(round(time.time() * 1000))
 89 |             # mem_usage =current_process.memory_full_info()  Only +RSS on MacOS
 90 |             mem_usage = current_process.memory_info()
 91 |             cpu_percent = current_process.cpu_percent(interval=self.profile_interval)
 92 |             profile = {'pid': pid, 'epochMillis': current_time, 'pmem_rss': mem_usage.rss, 'pmem_vms': mem_usage.vms, 'pmem_pfaults': mem_usage.pfaults, 'cpu_percent': cpu_percent}
 93 |             self.profiles.append(profile)
 94 | 
 95 |     def start(self):
 96 |         self.thread.start()
 97 | 
 98 |     def stop(self):
 99 |         self.event.set()
100 |         self.thread.join()
101 | 
102 | 
103 | class CpuMemParam(AccumulatorParam):
104 | 
105 |     def zero(self, value) -> List:
106 |         """ Provide a 'zero value' for the type, compatible in dimensions """
107 |         return list()
108 | 
109 |     def addInPlace(self, profiles, new_profiles) -> List:
110 |         """ Add 2 values of the accumulator's data type, returning a new value; for efficiency, can also update C{value1} in place and return it. """
111 |         profiles.extend(new_profiles)
112 |         return profiles
113 | 
114 | #######################################################################################################################
115 | 
116 | # Phil PySpark Profiler for catching stack traces
117 | 
118 | 
119 | class StackProfiler(BasicProfiler):
120 |     profile_interval = 0.1  # same default value as in Uber's profiler
121 | 
122 |     def __init__(self, ctx):
123 |         """ Creates a new accumulator for combining the profiles of different partitions of a stage """
124 |         self._accumulator = ctx.accumulator(defaultdict(int), StackParam())
125 |         self.profile_interval = float(ctx.environment.get('profile_interval', self.profile_interval))
126 | 
127 |     def profile(self, spark_action):
128 |         """ Runs and profiles the method to_profile passed in. A profile object is returned. """
129 |         parser = StackParser(self.profile_interval)
130 |         parser.start()
131 |         spark_action()  # trigger the Spark job
132 |         parser.stop()
133 |         self._accumulator.add(parser.stack_frequ)
134 | 
135 |     def show(self, id):
136 |         """ Print the profile stats to stdout, id is the RDD id """
137 |         print(self.collapse(str(id) + '_id'))
138 | 
139 |     def dump(self, id, path):
140 |         """ Dump the profile into path, id is the RDD id; See Profiler.dump() """
141 |         if not os.path.exists(path):
142 |             print('^^ Path ' + path + ' does not exist, trying to create it' )
143 |             os.makedirs(path)
144 |         with open(join(path, 's_{}_stack.json'.format(id)), 'w') as file:
145 |             file.write(self.collapse(str(id) + '_id'))
146 | 
147 |     def collapse(self, id) -> str:
148 |         """ Rearrange the result for further processing """
149 |         stacks = []
150 |         for stack, count in self._accumulator.value.items():
151 |             stacks.append('{"stacktrace":[' + stack + '],"count":' + str(count) + ',"i_d":"' + str(id) + '"}\n')
152 |         return ''.join(stacks)
153 | 
154 | 
155 | class StackParser(object):
156 |     def __init__(self, profile_interval):
157 |         self.profile_interval = profile_interval
158 |         self.stack_frequ = defaultdict(int)  # incrementing count values for stackframes in a Map
159 |         self.thread = Thread(target=self.capture_stack)
160 |         self.event = Event()
161 | 
162 |     @staticmethod
163 |     def parse_stackframe(frame) -> str:
164 |         current_stack = deque()
165 |         while frame is not None:
166 |             linenum = str(frame.f_lineno)  # FrameType
167 |             co_filename = frame.f_code.co_filename  # CodeType
168 |             co_name = frame.f_code.co_name  # CodeType
169 |             current_stack.append('\"' + ":".join((co_filename, co_name, linenum)) + '\"')  # similar output to JVM profiler
170 |             frame = frame.f_back
171 |         return ','.join(current_stack)
172 | 
173 |     def capture_stack(self):
174 |         while not self.event.is_set():
175 |             self.event.wait(self.profile_interval)
176 |             for thread_id, stackframe in _current_frames().items():  # thread id to T's current stack frame
177 |                 if thread_id != self.thread.ident:  # Thread identifier
178 |                     stack = self.parse_stackframe(stackframe)
179 |                     self.stack_frequ[stack] += 1
180 | 
181 |     def start(self):
182 |         self.thread.start()
183 | 
184 |     def stop(self):
185 |         self.event.set()
186 |         self.thread.join()
187 | 
188 | 
189 | class StackParam(AccumulatorParam):
190 | 
191 |     def zero(self, value) -> DefaultDict[str, int]:
192 |         """ Provide a 'zero value' for the type, compatible in dimensions """
193 |         return defaultdict(int)
194 | 
195 |     def addInPlace(self, dict1, dict2) -> DefaultDict[str, int]:
196 |         """ Add 2 values of the accumulator's data type, returning a new value; for efficiency, can also update C{value1} in place and return it. """
197 |         for frame, frequ in dict2.items():
198 |             dict1[frame] += frequ
199 |         return dict1
200 | 
201 | #######################################################################################################################
202 | 
203 | # Phil PySpark Profiler, combination of CpuMemProfiler & StackProfiler
204 | 
205 | 
206 | class CpuMemStackProfiler(BasicProfiler):
207 |     profile_interval = 0.1  # same default value as in Uber's profiler
208 | 
209 |     def __init__(self, ctx):
210 |         """ Creates a new accumulator for combining the profiles of different partitions of a stage """
211 |         self._accumulator = ctx.accumulator(tuple((list(), defaultdict(int))), CpuMemStackParam())
212 |         self.profile_interval = float(ctx.environment.get('profile_interval', self.profile_interval))
213 |         self.pids = set()
214 | 
215 |     def profile(self, spark_action):
216 |         """ Runs and profiles the method to_profile passed in. A profile object is returned. """
217 |         parser = CpuMemStackParser(self.profile_interval)
218 |         parser.start()
219 |         spark_action()  # trigger the Spark job
220 |         parser.stop()
221 |         self._accumulator.add(parser.profiles)
222 | 
223 |     def show(self, id):
224 |         """ Print the profile stats to stdout, id is the RDD id """
225 |         print(self.collapse(id))
226 | 
227 | 
228 |     def collapse(self, id) -> str:
229 |         """ Rearrange the result for further processing """
230 |         results = []
231 |         for profiledict in self._accumulator.value[0]:
232 |             results.append(json.dumps(profiledict) + '\n')
233 |         for stack, count in self._accumulator.value[1].items():
234 |             results.append(''.join((''.join(stack), '\t', str(count), '\t', str(id) + '_id', '\n')))
235 |         return ''.join(results)
236 | 
237 | 
238 |     def dump(self, id, path):
239 |         """ Dump the profile into path, id is the RDD id; See Profiler.dump() """
240 |         if not os.path.exists(path):
241 |             print('^^ Path ' + path + ' does not exist, trying to create it' )
242 |             os.makedirs(path)
243 |         self.get_pids()
244 |         for pid in self.pids:
245 |             with open(join(path, 's_{}_{}_cpumem.json'.format(id, pid)), 'w') as file:
246 |                 file.write(self.collapse_pid(pid))
247 |         with open(join(path, 's_{}_stack.json'.format(id)), 'w') as file:
248 |             file.write(self.collapse_stacks(str(id) + '_id'))
249 | 
250 |     def get_pids(self) -> Set[int]:
251 |         for profile_dict in self._accumulator.value[0]:
252 |             self.pids.add(profile_dict['pid'])
253 |         return self.pids
254 | 
255 |     def collapse_pid(self, pid) -> str:
256 |         pid_result = list()
257 |         for profile_dict in self._accumulator.value[0]:
258 |             if profile_dict['pid'] == pid:
259 |                 pid_result.append(json.dumps(profile_dict))
260 |         return '\n'.join(pid_result)
261 | 
262 |     def collapse_stacks(self, id) -> str:
263 |         results = list()
264 |         for stack, count in self._accumulator.value[1].items():
265 |             results.append('{"stacktrace":[' + stack + '],"count":' + str(count) + ',"i_d":"' + str(id) + '"}\n')
266 |         return ''.join(results)
267 | 
268 | 
269 | class CpuMemStackParser(object):
270 |     def __init__(self, profile_interval):
271 |         self.profile_interal = profile_interval
272 |         self.thread = Thread(target=self.catch_mem_cpu_stack)
273 | 
274 |         self.event = Event()
275 |         self.profiles = tuple((list(), defaultdict(int)))
276 | 
277 |     def catch_mem_cpu_stack(self):
278 |         while not self.event.is_set():
279 |             self.event.wait(self.profile_interal)
280 |             pid = os.getpid()
281 |             current_process = psutil.Process(pid)
282 |             current_time = int(round(time.time() * 1000))
283 |             # mem_usage =current_process.memory_full_info()
284 |             mem_usage = current_process.memory_info()
285 |             cpu_percent = current_process.cpu_percent(interval=self.profile_interal)
286 |             profile = {'pid': pid, 'epochMillis': current_time, 'pmem_rss': mem_usage.rss, 'pmem_vms': mem_usage.vms, 'pmem_pfaults': mem_usage.pfaults, 'cpu_percent': cpu_percent}
287 |             self.profiles[0].append(profile)
288 |             for thread_id, stackframe in _current_frames().items():  # thread id to T's current stack frame
289 |                 if thread_id != self.thread.ident: # Thread identifier
290 |                     stack = self.parse_stackframe(stackframe)
291 |                     self.profiles[1][stack] += 1
292 | 
293 |     @staticmethod
294 |     def parse_stackframe(frame) -> str:
295 |         current_stack = deque()
296 |         while frame is not None:
297 |             linenum = str(frame.f_lineno)  # FrameType
298 |             co_filename = frame.f_code.co_filename  # CodeType
299 |             co_name = frame.f_code.co_name  # CodeType
300 |             current_stack.append('\"' + ":".join((co_filename, co_name, linenum)) + '\"')  # similar output to JVM profiler
301 |             frame = frame.f_back
302 |         return ','.join(current_stack)
303 | 
304 |     def start(self):
305 |         self.thread.start()
306 | 
307 |     def stop(self):
308 |         self.event.set()
309 |         self.thread.join()
310 | 
311 | 
312 | class CpuMemStackParam(AccumulatorParam):
313 |     def zero(self, value) -> Tuple[List, DefaultDict[str, int]]:
314 |         """ Provide a 'zero value' for the type, compatible in dimensions """
315 |         return tuple((list(), defaultdict(int)))
316 | 
317 |     def addInPlace(self, pair1, pair2) -> Tuple[List, DefaultDict[str, int]]:
318 |         """ Add 2 values of the accumulator's data type, returning a new value; for efficiency, can also update C{value1} in place and return it. """
319 |         if len(pair2[0]) > 0:
320 |             pair1[0].extend(pair2[0])
321 |         for frame, frequ in pair2[1].items():
322 |             pair1[1][frame] += frequ
323 |         return pair1
324 | 
325 | 
326 | #######################################################################################################################
327 | 
328 | # Profiler map for command line args, default is CpuMemProfiler
329 | profiler_map = {'customprofiler': CustomProfiler, 'cpumemprofiler': CpuMemProfiler, 'cpumem': CpuMemProfiler, 'stackprofiler': StackProfiler,
330 |                 'stack': StackProfiler, 'cpumemstackprofiler': CpuMemStackProfiler, 'both': CpuMemStackProfiler,'cpumemstack': CpuMemStackProfiler,
331 |                 'stackcpumem': CpuMemStackProfiler, 'cpumemprofiler': CpuMemProfiler, 'stackprofiler': StackProfiler, 'cpumemstackprofiler': CpuMemStackProfiler
332 |                 }
333 | 
334 | # Made at https://github.com/g1thubhub/phil_stopwatch by writingphil@gmail.com


--------------------------------------------------------------------------------
/spark_jobs/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/g1thubhub/phil_stopwatch/98816c8252a7587b197893b54fa484cd803ff89b/spark_jobs/__init__.py


--------------------------------------------------------------------------------
/spark_jobs/job_fatso.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import datetime
 3 | from sys import argv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark import SparkContext
 6 | from helper import fat_function_inner
 7 | from pyspark_profilers import profiler_map
 8 | 
 9 | # Avoids this problem: 'Exception: Python in worker has different version 2.7 than that in driver 3.6',
10 | os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3.6'  # ToDo: Modify this
11 | os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3.6'  # ToDo: Modify this
12 | 
13 | 
14 | def fat_function_outer(string):
15 |     result = ''
16 |     for i in range(0, 100000):
17 |         if i % 10000 == 0:
18 |             i = 0
19 |         result = fat_function_inner(i)
20 |     return string + '@@' + str(result[-1])
21 | 
22 | 
23 | if __name__ == '__main__':
24 |     profiler = argv[1].lower()  # cpumem
25 |     dump_path = argv[2]  # ./ProfilePythonBusy
26 |     print("^^ Using " + profiler + ' and writing to ' + dump_path)
27 | 
28 |     start = str(datetime.datetime.now())
29 |     # Initialization:
30 |     threads = 3  # program simulates a single executor with 3 cores (one local JVM with 3 threads)
31 |     # conf = (SparkConf().set('spark.python.profile', 'true'))
32 |     sparkContext = SparkContext('local[{}]'.format(threads), 'Profiling Busy', profiler_cls=profiler_map[profiler])
33 |     session = SparkSession(sparkContext)
34 |     session.sparkContext.addPyFile('./helper.py')  # ToDo: Modify this
35 |     session.sparkContext.addPyFile('./pyspark_profilers.py')  # ToDo: Modify this
36 | 
37 |     records = session.createDataFrame([('a',), ('b',), ('c',)])
38 |     result = records.rdd.map(lambda x: fat_function_outer(x[0]))
39 |     print("@@@ " + str(result.collect()))
40 |     end = str(datetime.datetime.now())
41 | 
42 |     session.sparkContext.dump_profiles(dump_path)
43 |     # session.sparkContext.show_profiles()  # Uncomment for printing profile records to standard out
44 | 
45 |     print("******************\n" + start + "\n******************")
46 |     print("******************\n" + end + "\n******************")
47 | 


--------------------------------------------------------------------------------
/spark_jobs/job_heckler.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark import SparkContext
 3 | import datetime
 4 | import spacy
 5 | import pkgutil
 6 | import socket
 7 | import os
 8 | 
 9 | # Avoids this problem: 'Exception: Python in worker has different version 2.7 than that in driver 3.6',
10 | os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3.6'
11 | os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3.6'
12 | os.environ["OBJC_DISABLE_INITIALIZE_FORK_SAFETY"] = "YES"  # might be needed for spaCy on MacOS
13 | 
14 | 
15 | def check_spacy_setup(model_name):  # Should simulate CoreNLP's logging & packaging
16 |     # Check for spacy itself:
17 |     host_name = socket.gethostbyaddr(socket.gethostname())[0]
18 |     installed_modules = set()
19 |     for module_info in pkgutil.iter_modules():
20 |         installed_modules.add(module_info.name)  # ModuleInfo(module_finder=FileFinder('/usr/local/lib/python3.6/site-packages'), name='spacy', ispkg=True)
21 |     if 'spacy' not in installed_modules:
22 |         print('^^ Warning: spacy might not have been installed on this host, ' + host_name)
23 |     else:
24 |         import spacy
25 |         spacy_version = spacy.__version__
26 |         print('^^ Using spaCy ' + spacy_version)
27 |         data_path = spacy.util.get_data_path()  # spaCy data directory, e.g. /usr/local/lib/python3.6/site-packages/spacy/data
28 |         full_model_path = os.path.join(data_path.as_posix(), model_name)
29 |         if os.path.exists(full_model_path):
30 |             print('^^ Model found at ' + full_model_path)
31 |         else:
32 |             print('^^ Model not found at ' + full_model_path + ', trying to download now')
33 |             spacy.cli.download(model_name)
34 | 
35 | 
36 | def fast_annotate_texts(iter, model_name):
37 |     check_spacy_setup(model_name)
38 |     nlp_model = spacy.load(model_name)
39 |     print('^^ Created model ' + model_name)
40 |     for element in iter:
41 |         annotations = list()
42 |         doc = nlp_model(element)
43 |         for word in doc:
44 |             annotations.append('//'.join((str(word), word.dep_)))
45 |         yield ' '.join(annotations)
46 | 
47 | 
48 | def slow_annotate_text(element, model_name):
49 |     check_spacy_setup(model_name)
50 |     nlp_model = spacy.load(model_name)
51 |     print('^^ Created model ' + model_name)
52 | 
53 |     doc = nlp_model(element)
54 |     annotations = list()
55 |     for word in doc:
56 |         annotations.append('//'.join((str(word), word.dep_)))
57 |     return ' '.join(annotations)
58 | 
59 | 
60 | if __name__ == "__main__":
61 |     standard_model = 'en_core_web_sm'
62 |     texts = list()
63 |     for _ in range(0, 500):
64 |         texts.append("The small red car turned very quickly around the corner.")
65 |         texts.append("The quick brown fox jumps over the lazy dog.")
66 |         texts.append("This is supposed to be a nonsensial sentence but in the context of this app it does make sense after all.")
67 | 
68 |     start = str(datetime.datetime.now())
69 |     # Initialization:
70 |     threads = 3  # program simulates a single executor with 3 cores (one local JVM with 3 threads)
71 |     sparkContext = SparkContext('local[{}]'.format(threads), 'Profiling Heckler')
72 |     session = SparkSession(sparkContext)
73 | 
74 |     parsed_strings1 = session.sparkContext.parallelize(texts) \
75 |         .map(lambda record: slow_annotate_text(record, model_name=standard_model))
76 | 
77 |     parsed_strings2 = session.sparkContext.parallelize(texts) \
78 |         .mapPartitions(lambda parition: fast_annotate_texts(parition, model_name=standard_model))
79 | 
80 |     # print(parsed_strings1.count())
81 |     print(parsed_strings2.count())
82 | 


--------------------------------------------------------------------------------
/spark_jobs/job_slacker.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import datetime
 3 | from sys import argv
 4 | from pyspark.sql import SparkSession
 5 | from pyspark import SparkContext
 6 | from helper import secondsSleep
 7 | from pyspark_profilers import profiler_map
 8 | 
 9 | 
10 | # Avoids this problem: 'Exception: Python in worker has different version 2.7 than that in driver 3.6',
11 | os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3.6'
12 | os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3.6'
13 | 
14 | 
15 | def slacking(string):
16 |     result = ''
17 |     for i in range(1, 600):
18 |         result = secondsSleep(i)
19 |     return string + '@@' + str(result)
20 | 
21 | 
22 | if __name__ == "__main__":
23 |     profiler = argv[1].lower()  # cpumem
24 |     dump_path = argv[2]  # ./ProfilePythonBusy
25 |     print("^^ Using " + profiler + ' and writing to ' + dump_path)
26 | 
27 |     start = str(datetime.datetime.now())
28 |     # Initialization:
29 |     threads = 3  # program simulates a single executor with 3 cores (one local JVM with 3 threads)
30 |     # conf = (SparkConf().set('spark.python.profile', 'true'))
31 |     sparkContext = SparkContext('local[{}]'.format(threads), 'Profiling Slacker', profiler_cls=profiler_map[profiler])
32 |     session = SparkSession(sparkContext)
33 |     session.sparkContext.addPyFile('./helper.py')  # ToDo: Modify this
34 |     session.sparkContext.addPyFile('./pyspark_profilers.py')  # ToDo: Modify this
35 | 
36 |     records = session.createDataFrame([('a',), ('b',), ('c',)])
37 |     result = records.rdd.map(lambda x: slacking(x[0]))
38 |     print("@@@ " + str(result.collect()))
39 |     end = str(datetime.datetime.now())
40 | 
41 |     session.sparkContext.dump_profiles(dump_path)
42 |     # session.sparkContext.show_profiles()  # Uncomment for printing profile records to standard out
43 | 
44 |     print("******************\n" + start + "\n******************")
45 |     print("******************\n" + end + "\n******************")
46 | 


--------------------------------------------------------------------------------
/spark_jobs/job_straggler.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import datetime
 3 | from sys import argv
 4 | from pyspark import SparkContext
 5 | from pyspark.sql import SparkSession
 6 | from pyspark_profilers import profiler_map
 7 | 
 8 | # Avoids this problem: 'Exception: Python in worker has different version 2.7 than that in driver 3.6',
 9 | os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3.6'
10 | os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3.6'
11 | 
12 | 
13 | def processs(pair):
14 |     currentlist = list()
15 |     for i in range(pair[1]):
16 |         currentlist.append(i)
17 | 
18 |     return currentlist
19 | 
20 | 
21 | if __name__ == "__main__":
22 |     profiler = argv[1].lower()  # cpumem
23 |     dump_path = argv[2]  # ./ProfilePythonBusy
24 |     print("^^ Using " + profiler + ' and writing to ' + dump_path)
25 | 
26 |     start = str(datetime.datetime.now())
27 |     # Initialization:
28 |     threads = 3  # program simulates a single executor with 3 cores (one local JVM with 3 threads)
29 |     sparkContext = SparkContext('local[{}]'.format(threads), 'Profiling Straggler', profiler_cls=profiler_map[profiler])
30 |     session = SparkSession(sparkContext)
31 |     session.sparkContext.addPyFile('./helper.py')  # ToDo: Modify this
32 |     session.sparkContext.addPyFile('./pyspark_profilers.py')  # ToDo: Modify this
33 | 
34 |     letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
35 |                'v', 'w', 'x', 'y']
36 | 
37 |     frequencies = session.sparkContext.parallelize(letters) \
38 |         .map(lambda x: (x, 1800000000) if x == "d" else (x, 30000000))
39 | 
40 |     summed = frequencies.map(processs) \
41 |         .map(sum)
42 | 
43 |     print(summed.count())
44 |     end = str(datetime.datetime.now())
45 | 
46 |     session.sparkContext.dump_profiles(dump_path)
47 |     # session.sparkContext.show_profiles()  # Uncomment for printing profile records to standard out
48 | 
49 |     print("******************\n" + start + "\n******************")
50 |     print("******************\n" + end + "\n******************")
51 | 


--------------------------------------------------------------------------------