├── .gitignore
├── README.md
├── _deprecated.py
├── ao3.py
├── fanworks
    └── .gitignore
├── requirements.txt
├── results
    └── .gitignore
├── scripts
    └── .gitignore
├── search.py
├── vis.py
└── workflow
    ├── format_helper.py
    ├── reformat.py
    ├── revis.py
    └── vis_helper.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | # Fandom-specific settings
92 | match*.csv
93 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | The Archive of Our Own script ao3.py can be used to scrape and analyze 
  2 | fanworks and prepare the results for visualization in JavaScript.
  3 | A markup version of the script of the orginal work is required for 
  4 | searching for n-gram matches in the fanworks.
  5 | 
  6 | The basic workflow is below. This assumes you have a `scripts` folder, 
  7 | a `fanworks` folder, and a `results` folder, with a particular structure
  8 | that can be inferred from the example commands below. (Sorry, very busy!)
  9 | Take `sw-all` to be a stand-in for a folder of fan works, `sw-new-hope.txt`
 10 | to be a stand-in for a correctly formatted script, and `sw-new-hope` (without
 11 | the `.txt`) to be a stand-in for the results folder for the given movie.
 12 | 
 13 | A todo for this repo is to create options for where to save error and 
 14 | log files, and search results.
 15 | 
 16 | Another todo for this repo is to create more thorough documentation, 
 17 | especially of the script format, which is idiosyncratic but effective.
 18 | 
 19 | * Scrape AO3 (Ooops! Currently broken!)
 20 | 
 21 |       python ao3.py scrape \
 22 |           -t "Star Wars - All Media Types" \
 23 |           -o fanworks/sw-all/html
 24 | 
 25 | The scrape command will save log and error files; check to see that the
 26 | scrape went OK, and then move the (generically named) error file to
 27 | `fanworks/sw-all/sw-all-errors.txt`.
 28 | 
 29 | * Clean the HTML
 30 | 
 31 |       python ao3.py clean \
 32 |           fanworks/sw-all/html/ \
 33 |           -o fanworks/sw-all/plaintext/
 34 | 
 35 | The clean command will save an error file; check to see that the cleaning
 36 | process went OK, and then move the error file (this time in the root dir)
 37 | from `clean-html-errors.txt` to `sw-all-clean-errors.txt`
 38 | 
 39 | * Perform the reuse search
 40 | 
 41 |       python ao3.py search \
 42 |           fanworks/sw-all/ \
 43 |           scripts/sw-new-hope.txt
 44 | 
 45 | The search command will create sevaral (and in some case, many, even hundreds)
 46 | of separate CSV files. Each one contains the results for 500 fan works. They
 47 | will automatically be aggregated by the script at the end of the process, but
 48 | they are also saved here to ensure that if the search is interrupted, the 
 49 | results are still usable.
 50 | 
 51 | If the search completes without any errors, the final aggregated data will
 52 | be in a file with a date timestamp in YYYYMMDD format. It will be something 
 53 | like `match-6gram-20190604.` Create a new folder `results/sw-all/20190604/`, 
 54 | and move all the CSV files into that folder.
 55 | 
 56 | * Aggregate the results over the script (i.e. "format" the results)
 57 | 
 58 |       python ao3.py format \
 59 |           results/sw-new-hope/20190604/match-6gram-20190604.csv \
 60 |           scripts/sw-new-hope.txt \
 61 |           -o results/sw-new-hope/fandom-data-new-hope.csv
 62 | 
 63 | * Create a Bokeh visualization of the aggregated results
 64 | 
 65 |       python ao3.py vis \
 66 |           results/sw-new-hope/fandom-data-new-hope.csv \
 67 |           -o results/sw-new-hope/new_hope_reuse.html
 68 | 
 69 | 
 70 | This is not a perfect workflow and needs to be tidied up in several ways. I 
 71 | will get around to that someday.
 72 | 
 73 | ```
 74 | usage: ao3.py [-h] {scrape,clean,getmeta,search,matrix,format} ...
 75 | 
 76 | process fanworks scraped from Archive of Our Own.
 77 | 
 78 | positional arguments:
 79 |   {scrape,clean,getmeta,search,matrix,format}
 80 |                         scrape, clean, getmeta, search, matrix, or format
 81 |     scrape              find and scrape fanfiction works from Archive of Our
 82 |                         Own
 83 |     clean               takes a directory of html files and yields a new
 84 |                         directory of text files
 85 |     getmeta             takes a directory of html files and yields a csv file
 86 |                         containing metadata
 87 |     search              compare fanworks with the original script
 88 |     matrix              deduplicates and builds matrix for best n-gram matches
 89 |     format              takes a script and outputs a csv with senitment
 90 |                         information for each word formatted for javascript
 91 |                         visualization
 92 | 
 93 | optional arguments:
 94 |   -h, --help            show this help message and exit
 95 | ```
 96 | There are three scraping options for Archive of Our Own:
 97 | (1) Use the '-s' option to provide a search term and see a list of possible tags.
 98 | (2) Use the '-t' option to scrape fanworks from a tag.
 99 | (3) Use the '-u' option to scrape fanworks from a URL. The URL should be to the /works page,
100 | 	e.g. https://archiveofourown.org/tags/Rogue%20One:%20A%20Star%20Wars%20Story%20(2016)/works
101 | ```
102 | usage: ao3.py scrape [-h] [-s SEARCH | -t TAG | -u URL] [-o OUT]
103 |                      [-p STARTPAGE]
104 | 
105 | optional arguments:
106 |   -h, --help            show this help message and exit
107 |   -s SEARCH, --search SEARCH
108 |                         search term to search for a tag to scrape
109 |   -t TAG, --tag TAG     the tag to be scraped
110 |   -u URL, --url URL     the full URL of first page to be scraped
111 |   -o OUT, --out OUT     target directory for scraped html files
112 |   -p STARTPAGE, --startpage STARTPAGE
113 |                         page on which to begin downloading (to resume a
114 |                         previous job)
115 | ``` 
116 | Clean and convert the scraped html files into plain text files.
117 | ```
118 | usage: ao3.py clean [-h] [-o O] i
119 | 
120 | positional arguments:
121 |   i           directory of input html files to clean
122 | 
123 | optional arguments:
124 |   -h, --help  show this help message and exit
125 |   -o O        target directory for output txt files
126 | ```
127 | Extract Archive of Our Own metadata from the scraped html files.
128 | ```
129 | usage: ao3.py getmeta [-h] [-o O] i
130 | 
131 | positional arguments:
132 |   i           directory of input html files to process
133 | 
134 | optional arguments:
135 |   -h, --help  show this help message and exit
136 |   -o O        filename for metadata csv file
137 | ```
138 | The search process compares fanworks with the original work script and is based on 6-gram matches.
139 | ```
140 | usage: ao3.py search [-h] d s
141 | 
142 | positional arguments:
143 |   d           directory of fanwork text files
144 |   s           filename for markup version of script
145 | 
146 | optional arguments:
147 |   -h, --help  show this help message and exits
148 | ```
149 | The n-gram search results can be used to create a matrix.
150 | ```
151 | usage: ao3.py matrix [-h] [-n N] i m
152 | 
153 | positional arguments:
154 |   i           input csv file
155 |   m           fandom/movie name for output file prefix
156 | 
157 | optional arguments:
158 |   -h, --help  show this help message and exit
159 |   -n N        n-gram size, default is 6-grams
160 | ```
161 | The n-gram search results can be prepared for JavaScript visualization.
162 | ```
163 | usage: ao3.py format [-h] [-o O] s
164 | 
165 | positional arguments:
166 |   s           filename for markup version of script
167 | 
168 | optional arguments:
169 |   -h, --help  show this help message and exit
170 |   -o O        filename for csv output file of data formatted for visualization
171 | s```
172 | 
173 | 


--------------------------------------------------------------------------------
/_deprecated.py:
--------------------------------------------------------------------------------
  1 | def cosine_distance(row_values, col_values):
  2 |     """Calculate the cosine distance between two vectors. Also
  3 |     accepts matrices and 2-d arrays, and calculates the
  4 |     distances over the cross product of rows and columns.
  5 |     """
  6 |     verr_msg = '`cosine_distance` is not defined for {}-dimensional arrays.'
  7 |     if len(row_values.shape) == 1:
  8 |         row_values = row_values[None,:]
  9 |     elif len(row_values.shape) != 2:
 10 |         raise ValueError(verr_msg.format(len(row_values.shape)))
 11 | 
 12 |     if len(col_values.shape) == 1:
 13 |         col_values = col_values[:,None]
 14 |     elif len(col_values.shape) != 2:
 15 |         raise ValueError(verr_msg.format(len(col_values.shape)))
 16 | 
 17 |     row_norm = (row_values * row_values).sum(axis=1) ** 0.5
 18 |     row_norm = row_norm[:,None]
 19 | 
 20 |     col_norm = (col_values * col_values).sum(axis=0) ** 0.5
 21 |     col_norm = col_norm[None,:]
 22 | 
 23 |     result = row_values @ col_values
 24 |     result /= row_norm
 25 |     result /= col_norm
 26 |     return 1 - result
 27 | 
 28 | def make_match_strata(records, record_structure, num_strata, max_threshold):
 29 |     combined_ix = record_structure['fields'].index('BEST_COMBINED_DISTANCE')
 30 |     low = [i / num_strata * max_threshold
 31 |            for i in range(0, num_strata)]
 32 |     high = [i / num_strata * max_threshold
 33 |             for i in range(1, num_strata + 1)]
 34 |     ranges = zip(low, high)
 35 | 
 36 |     return [[r for r in records[1:]
 37 |              if r[combined_ix] >= low and r[combined_ix] < high]
 38 |             for low, high in ranges]
 39 | 
 40 | def label_match_strata(num_strata, max_threshold):
 41 |     high = [i / num_strata * max_threshold
 42 |             for i in range(1, num_strata + 1)]
 43 |     return ['Number of matches below threshold {:.2}'.format(h)
 44 |             for h in high]
 45 | 
 46 | def chart_match_strata(records,
 47 |                        num_strata=5, max_threshold=1,
 48 |                        start=1, end=None,
 49 |                        figsize=(15, 10),
 50 |                        colormap='plasma',
 51 |                        legend=True):
 52 |     match_strata = make_match_strata(records, new_record_structure, num_strata, max_threshold)
 53 | 
 54 |     cumulative_strata = [match_strata[0:i] for i in
 55 |                          range(len(match_strata), 0, -1)]
 56 |     match_counters = [Counter(row[4] for matches in strata for row in matches)
 57 |                       for strata in cumulative_strata]
 58 |     maxn = max(max(mc) for mc in match_counters if mc)
 59 |     match_cols = [[mc[n] for mc in match_counters]
 60 |                   for n in range(maxn + 1)]
 61 | 
 62 |     col_names = label_match_strata(num_strata, max_threshold)
 63 |     col_names.reverse()
 64 |     df = pd.DataFrame(match_cols,
 65 |                           index = range(maxn + 1),
 66 |                           columns=col_names)
 67 |     df.index.name = 'Word index in original script'
 68 |     df = df.loc[start:end]
 69 |     df.plot(figsize=figsize, colormap=colormap, legend=legend)
 70 | 
 71 | def most_frequent_matches(records, n_matches, threshold):
 72 |     ct = Counter(r[3] for r in records if r[-1] < threshold)
 73 |     ix_to_context = {r[3]: r[4] for r in records}
 74 |     matches = ct.most_common(n_matches)
 75 |     return [(i, c, ix_to_context[i])
 76 |             for i, c in matches]
 77 |     return matches
 78 | 
 79 | # ----------------
 80 | # matrix functions
 81 | # ----------------
 82 | 
 83 | def add_matrix_subparser(subparsers):
 84 |     # Create n-gram matrices (deprecated)
 85 |     matrix_parser = subparsers.add_parser('matrix', help='deduplicates and builds matrix for best n-gram matches')
 86 |     matrix_parser.add_argument('i', action='store', help='input csv file')
 87 |     matrix_parser.add_argument('m', action = 'store', help='fandom/movie name for output file prefix')
 88 |     matrix_parser.add_argument('-n', action='store', default=6, help='n-gram size, default is 6-grams')
 89 |     matrix_parser.set_defaults(func=process)
 90 | 
 91 | class StrictNgramDedupe(object):
 92 |     def __init__(self, data_path, ngram_size):
 93 |         self.ngram_size = ngram_size
 94 | 
 95 |         with open(data_path, encoding='UTF8') as ip:
 96 |             rows = list(csv.DictReader(ip))
 97 |         self.data = rows
 98 |         self.work_matches = collections.defaultdict(list)
 99 | 
100 |         for r in rows:
101 |             self.work_matches[r['FAN_WORK_FILENAME']].append(r)
102 | 
103 |         # Use n-gram starting index as a unique identifier.
104 |         self.starts_counter = collections.Counter(
105 |             start
106 |             for matches in self.work_matches.values()
107 |             for start in self.to_ngram_starts(self.segment_full(matches))
108 |         )
109 | 
110 |         filtered_matches = [self.top_ngram(span)
111 |                             for matches in self.work_matches.values()
112 |                             for span in self.segment_full(matches)]
113 | 
114 |         self.filtered_matches = [ng for ng in filtered_matches
115 |                                  if self.no_better_match(ng)]
116 | 
117 |     def num_ngrams(self):
118 |         return len(set(int(ng[0]['ORIGINAL_SCRIPT_WORD_INDEX'])
119 |                        for ng in self.filtered_matches))
120 | 
121 |     def match_to_phrase(self, match):
122 |         return ' '.join(m['ORIGINAL_SCRIPT_WORD'].lower() for m in match)
123 | 
124 |     def write_match_work_count_matrix(self, out_filename):
125 |         ngrams = {}
126 |         works = set()
127 |         cells = collections.defaultdict(int)
128 |         for m in self.filtered_matches:
129 |             phrase = self.match_to_phrase(m)
130 |             ix = int(m[0]['ORIGINAL_SCRIPT_WORD_INDEX'])
131 |             filename = m[0]['FAN_WORK_FILENAME']
132 | 
133 |             ngrams[phrase] = ix
134 |             works.add(filename)
135 |             cells[(filename, phrase)] += 1
136 | 
137 |         ngrams = sorted(ngrams, key=ngrams.get)
138 |         works = sorted(works)
139 |         rows = [[cells[(fn, ng)] for ng in ngrams]
140 |                 for fn in works]
141 |         totals = [sum(r[col] for r in rows) for col in range(len(rows[0]))]
142 | 
143 |         header = ['FILENAME'] + ngrams
144 |         totals = ['(total)'] + totals
145 |         rows = [[fn] + r for fn, r in zip(works, rows)]
146 |         rows = [header, totals] + rows
147 | 
148 |         with open(out_filename, 'w', encoding='utf-8') as op:
149 |             csv.writer(op).writerows(rows)
150 | 
151 |     def write_match_sentiment(self, out_filename):
152 |         phrases = {}
153 |         for m in self.filtered_matches:
154 |             phrase = self.match_to_phrase(m)
155 |             ix = int(m[0]['ORIGINAL_SCRIPT_WORD_INDEX'])
156 |             phrases[phrase] = ix
157 |         sorted_phrases = sorted(phrases, key=phrases.get)
158 | 
159 |         phrase_indices = [phrases[p] for p in sorted_phrases]
160 |         phrases = sorted_phrases
161 | 
162 |         if emolex:
163 |             emo_count = [emolex.lex_count(p) for p in phrases]
164 |             emo_sent_count = self.project_sentiment_keys(emo_count,
165 |                                                          ['NEGATIVE', 'POSITIVE'])
166 |             emo_emo_count = self.project_sentiment_keys(emo_count,
167 |                                                         ['ANTICIPATION',
168 |                                                          'ANGER',
169 |                                                          'TRUST',
170 |                                                          'SADNESS',
171 |                                                          'DISGUST',
172 |                                                          'SURPRISE',
173 |                                                          'FEAR',
174 |                                                          'JOY'])
175 |         if bing:
176 |             bing_count = [bing.lex_count(p) for p in phrases]
177 |             bing_count = self.project_sentiment_keys(bing_count,
178 |                                                      ['NEGATIVE', 'POSITIVE'])
179 | 
180 |         if liwc:
181 |             liwc_count = [liwc.lex_count(p) for p in phrases]
182 |             liwc_sent_count = self.project_sentiment_keys(liwc_count,
183 |                                                           ['POSEMO', 'NEGEMO'])
184 |             liwc_other_keys = set(k for ct in liwc_count for k in ct.keys())
185 |             liwc_other_keys -= set(['POSEMO', 'NEGEMO'])
186 |             liwc_other_count = self.project_sentiment_keys(liwc_count,
187 |                                                            liwc_other_keys)
188 | 
189 |         counts = []
190 |         count_labels = []
191 | 
192 |         if emolex:
193 |             counts.append(emo_emo_count)
194 |             counts.append(emo_sent_count)
195 |             count_labels.append('NRC_EMOTION_')
196 |             count_labels.append('NRC_SENTIMENT_')
197 | 
198 |         counts.append(bing_count)
199 |         count_labels.append('BING_SENTIMENT_')
200 | 
201 |         if liwc:
202 |             counts.append(liwc_sent_count)
203 |             counts.append(liwc_other_count)
204 |             count_labels.append('LIWC_SENTIMENT_')
205 |             count_labels.append('LIWC_ALL_OTHER_')
206 | 
207 |         rows = self.compile_sentiment_groups(counts, count_labels)
208 | 
209 |         for r, p, i in zip(rows, phrases, phrase_indices):
210 |             r['{}-GRAM'.format(self.ngram_size)] = p
211 |             r['{}-GRAM_START_INDEX'.format(self.ngram_size)] = i
212 | 
213 |         fieldnames = sorted(set(k for r in rows for k in r.keys()))
214 |         totals = collections.defaultdict(int)
215 |         skipkeys = ['{}-GRAM_START_INDEX'.format(self.ngram_size),
216 |                     '{}-GRAM'.format(self.ngram_size)]
217 |         totals[skipkeys[0]] = 0
218 |         totals[skipkeys[1]] = '(total)'
219 |         for r in rows:
220 |             for k in r:
221 |                 if k not in skipkeys:
222 |                     totals[k] += r[k]
223 |         rows = [totals] + rows
224 | 
225 |         with open(out_filename, 'w', encoding='utf-8') as op:
226 |             wr = csv.DictWriter(op, fieldnames=fieldnames)
227 |             wr.writeheader()
228 |             wr.writerows(rows)
229 | 
230 |     def project_sentiment_keys(self, counts, keys):
231 |         counts = [{k: ct.get(k, 0) for k in keys}
232 |                   for ct in counts]
233 |         for ct in counts:
234 |             if sum(ct.values()) == 0:
235 |                 ct['UNDETERMINED'] = 1
236 |             else:
237 |                 ct['UNDETERMINED'] = 0
238 | 
239 |         return counts
240 | 
241 |     def compile_sentiment_groups(self, groups, prefixes):
242 |         new_rows = []
243 |         for group_row in zip(*groups):
244 |             new_row = {}
245 |             for gr, pf in zip(group_row, prefixes):
246 |                 for k, v in gr.items():
247 |                     new_row[pf + k] = v
248 |             new_rows.append(new_row)
249 |         return new_rows
250 | 
251 |     def get_spans(self, indices):
252 |         starts = [0]
253 |         ends = []
254 |         for i in range(1, len(indices)):
255 |             if indices[i] != indices[i - 1] + 1:
256 |                 starts.append(i)
257 |                 ends.append(i)
258 |         ends.append(len(indices))
259 |         return list(zip(starts, ends))
260 | 
261 |     def segment_matches(self, matches, key):
262 |         matches = sorted(matches, key=lambda m: int(m[key]))
263 |         indices = [int(m[key]) for m in matches]
264 |         return [[matches[i] for i in range(start, end)]
265 |                 for start, end in self.get_spans(indices)]
266 | 
267 |     def segment_fan_matches(self, matches):
268 |         return self.segment_matches(matches, 'FAN_WORK_WORD_INDEX')
269 | 
270 |     def segment_orig_matches(self, matches):
271 |         return self.segment_matches(matches, 'ORIGINAL_SCRIPT_WORD_INDEX')
272 | 
273 |     def segment_full(self, matches):
274 |         return [orig_m
275 |                 for fan_m in self.segment_fan_matches(matches)
276 |                 for orig_m in self.segment_orig_matches(fan_m)
277 |                 if len(orig_m) >= self.ngram_size]
278 | 
279 |     def to_ngram_starts(self, match_spans):
280 |         return [int(ms[i]['ORIGINAL_SCRIPT_WORD_INDEX'])
281 |                 for ms in match_spans
282 |                 for i in range(len(ms) - self.ngram_size + 1)]
283 | 
284 |     def start_count_key(self, span):
285 |         def key(i):
286 |             script_ix = int(span[i]['ORIGINAL_SCRIPT_WORD_INDEX'])
287 |             return self.starts_counter.get(script_ix, 0)
288 |         return key
289 | 
290 |     def no_better_match(self, ng):
291 |         start = int(ng[0]['ORIGINAL_SCRIPT_WORD_INDEX'])
292 |         best_start = max(range(start - self.ngram_size + 1,
293 |                                start + self.ngram_size),
294 |                          key=self.starts_counter.__getitem__)
295 |         return start == best_start
296 | 
297 |     def top_ngram(self, span):
298 |         start = max(
299 |             range(len(span) - self.ngram_size + 1),
300 |             key=self.start_count_key(span)
301 |         )
302 |         return span[start: start + self.ngram_size]
303 | 
304 | 
305 | def process(inputs):
306 |     ngram_size = inputs['n']
307 |     in_file = inputs['i']
308 |     out_prefix = inputs['m']
309 | 
310 |     matrix_out = '{}-most-common-perfect-matches-no-overlap-{}-gram-match-matrix.csv'.format(out_prefix, ngram_size)
311 |     sentiment_out = '{}-most-common-perfect-matches-no-overlap-{}-gram-sentiment.csv'.format(out_prefix, ngram_size)
312 | 
313 |     dd = StrictNgramDedupe(in_file, ngram_size=ngram_size)
314 |     #print(dd.num_ngrams())
315 | 
316 |     dd.write_match_work_count_matrix(matrix_out)
317 | 


--------------------------------------------------------------------------------
/ao3.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | import re
  4 | import os
  5 | import sys
  6 | import json
  7 | import csv
  8 | import random
  9 | import argparse
 10 | import requests
 11 | import collections
 12 | from time import sleep
 13 | 
 14 | import numpy
 15 | import pandas as pd
 16 | from bs4 import BeautifulSoup
 17 | 
 18 | import search
 19 | import vis
 20 | 
 21 | try:
 22 |     import lextrie
 23 |     bing = lextrie.LexTrie.from_plugin('bing')
 24 | 
 25 |     try:
 26 |         emolex = lextrie.LexTrie.from_plugin('emolex_en')
 27 |     except Exception:
 28 |         emolex = None
 29 | 
 30 |     try:
 31 |         liwc = lextrie.LexTrie.from_plugin('liwc')
 32 |     except Exception:
 33 |         liwc = None
 34 | except ImportError:
 35 |     bing = None
 36 |     emolex = None
 37 |     liwc = None
 38 | 
 39 | 
 40 | # -----------------------------------------------------------------------------
 41 | # HTML TO TXT FUNCTIONS
 42 | # ---------------------
 43 | 
 44 | def get_fan_work(fan_html_name):
 45 |     with open(fan_html_name, encoding='utf8') as fan_in:
 46 |         fan_html = BeautifulSoup(fan_in.read(), "lxml")
 47 |         fan_txt = fan_html.find(id='workskin')
 48 |         if fan_txt is None:
 49 |             return ''
 50 | 
 51 |     fan_txt = ' '.join(fan_txt.strings)
 52 |     fan_txt = re.split(r'Work Text\b([\s:]*)', fan_txt, maxsplit=1)[-1]
 53 |     fan_txt = re.split(r'Chapter 1\b([\s:]*)', fan_txt, maxsplit=1)[-1]
 54 |     fan_txt = fan_txt.replace('Chapter Text', ' ')
 55 |     fan_txt = re.sub(r'\s+', ' ', fan_txt).strip()
 56 |     return fan_txt
 57 | 
 58 | def convert_dir(args):
 59 |     html_dir = args.input
 60 |     out_dir = args.output
 61 | 
 62 |     try:
 63 |         os.makedirs(out_dir)
 64 |     except Exception:
 65 |         pass
 66 | 
 67 |     errors = []
 68 |     for infile in os.listdir(html_dir):
 69 |         base, ext = os.path.splitext(infile)
 70 |         outfile = os.path.join(out_dir, base + '.txt')
 71 |         infile = os.path.join(html_dir, infile)
 72 | 
 73 |         if not os.path.exists(outfile):
 74 |             text = get_fan_work(infile)
 75 |             if text:
 76 |                 with open(outfile, 'w', encoding='utf-8') as out:
 77 |                     out.write(text)
 78 |             else:
 79 |                 errors.append(infile)
 80 | 
 81 |     error_outfile = 'clean-html-errors.txt'
 82 |     with open(error_outfile, 'w', encoding='utf-8') as out:
 83 |         out.write('The following files were not converted:\n\n')
 84 |         for e in errors:
 85 |             out.write(e)
 86 |             out.write('\n')
 87 | 
 88 | # ------------------
 89 | # METADATA FUNCTIONS
 90 | # ------------------
 91 | 
 92 | def select_text(soup_node, selector):
 93 |     sel = soup_node.select(selector)
 94 |     return sel[0].get_text().strip() if sel else 'AOOO_UNSPECIFIED'
 95 |     # "AOOO_UNSPECIFIED" means value not in An Archive of Our Own metadata field
 96 | 
 97 | meta_headers = ['FILENAME', 'TITLE', 'AUTHOR', 'SUMMARY', 'NOTES',
 98 |                 'PUBLICATION_DATE', 'LANGUAGE', 'TAGS']
 99 | def get_fan_meta(fan_html_name):
100 |     with open(fan_html_name, encoding='utf8') as fan_in:
101 |         fan_html = BeautifulSoup(fan_in.read(), 'lxml')
102 | 
103 |     title = select_text(fan_html, '.title.heading')
104 |     author = select_text(fan_html, '.byline.heading')
105 |     summary = select_text(fan_html, '.summary.module')
106 |     notes = select_text(fan_html, '.notes.module')
107 |     date = select_text(fan_html, 'dd.published')
108 |     language = select_text(fan_html, 'dd.language')
109 |     tags = {k.get_text().strip().strip(':'):
110 |             v.get_text(separator='; ').strip().strip('\n; ')
111 |             for k, v in
112 |             zip(fan_html.select('dt.tags'), fan_html.select('dd.tags'))}
113 |     tags = json.dumps(tags)
114 | 
115 |     path, filename = os.path.split(fan_html_name)
116 | 
117 |     vals = [filename, title, author, summary, notes,
118 |             date, language, tags]
119 |     return dict(zip(meta_headers, vals))
120 | 
121 | def collect_meta(args):
122 |     in_dir = args.input
123 |     out_file = args.output
124 | 
125 |     errors = []
126 |     rows = []
127 |     for infile in os.listdir(in_dir):
128 |         infile = os.path.join(in_dir, infile)
129 |         rows.append(get_fan_meta(infile))
130 | 
131 |     error_outfile = out_file + '-errors.txt'
132 |     with open(error_outfile, 'w', encoding='utf-8') as out:
133 |         out.write('Metadata could not be collected from the following files:\n\n')
134 |         for e in errors:
135 |             out.write(e)
136 |             out.write('\n')
137 | 
138 |     csv_outfile = out_file + '.csv'
139 |     with open(csv_outfile, 'w', encoding='utf-8') as out:
140 |         wr = csv.DictWriter(out, fieldnames=meta_headers)
141 |         wr.writeheader()
142 |         for row in rows:
143 |             wr.writerow(row)
144 | 
145 | #----------------
146 | #SCRAPE FUNCTIONS
147 | #----------------
148 | class Logger:
149 |     def __init__(self, logfile='log.txt'):
150 |         self.logfile = logfile
151 | 
152 |     def log(self, msg, newline=True):
153 |         with open(self.logfile, 'a') as f:
154 |             f.write(msg)
155 |             if newline:
156 |                 f.write('\n')
157 | 
158 | _logger = Logger()
159 | log = _logger.log
160 | 
161 | _error_id_log = Logger(logfile='error-ids.txt')
162 | log_error_id = _error_id_log.log
163 | 
164 | def load_error_ids():
165 |     with open(_error_id_log.logfile, 'w+') as ip:
166 |         ids = set(l.strip() for l in ip.readlines())
167 |         return ids
168 | 
169 | class InlineDisplay:
170 |     def __init__(self):
171 |         self.currlen = 0
172 | 
173 |     def display(self, s):
174 |         print(s, end=' ')
175 |         sys.stdout.flush()
176 |         self.currlen += len(s) + 1
177 | 
178 |     def reset(self):
179 |         print('', end='\r')
180 |         print(' ' * self.currlen, end='\r')
181 |         sys.stdout.flush()
182 |         self.currlen = 0
183 | 
184 | _id = InlineDisplay()
185 | display = _id.display
186 | reset_display = _id.reset
187 | 
188 | def request_loop(url, timeout=4.0, sleep_base=1.0):
189 |     # We try 20 times. But we double the delay each time,
190 |     # so that we don't get really annoying. Eventually the
191 |     # delay will be more than an hour long, at which point
192 |     # we'll try a few more times, and then give up.
193 | 
194 |     orig_url = url
195 |     for i in range(20):
196 |         if sleep_base > 7200:  # Only delay up to an hour.
197 |             sleep_base /= 2
198 |             url = '{}#{}'.format(orig_url, random.randrange(1000))
199 |         display('Sleeping for {} seconds;'.format(sleep_base))
200 |         sleep(sleep_base)
201 |         try:
202 |             response = requests.get(url, timeout=timeout)
203 |             response.raise_for_status()
204 |             return response.text
205 |         except requests.exceptions.HTTPError:
206 |             code = response.status_code
207 |             if code >= 400 and code < 500:
208 |                 display('Unrecoverable error ({})'.format(code))
209 |                 return ''
210 |             else:
211 |                 sleep_base *= 2
212 |                 display('Recoverable error ({});'.format(code))
213 |         except requests.exceptions.ReadTimeout as exc:
214 |             sleep_base *= 2
215 |             display('Read timed out -- trying again;')
216 |         except requests.exceptions.RequestException as exc:
217 |             sleep_base *= 2
218 |             display('Unexpected error ({}), trying again;\n'.format(exc))
219 |     else:
220 |         return None
221 | 
222 | def scrape(args):
223 |     search_term = args.search
224 |     tag = args.tag
225 |     header = args.url
226 |     out_dir = args.out
227 |     end = args.startpage
228 | 
229 |     # tag scraping option
230 |     if search_term:
231 |         pp = 1
232 |         safe_search = search_term.replace(' ', '+')
233 |         # an alternative here is to scrape this page and use regex to filter the results:
234 |         # http://archiveofourown.org/media/Movies/fandoms?
235 |         # the canonical filter is used here because the "fandom" filter on the
236 |         # beta tag search is broken as of November 2017
237 |         search_ref = "http://archiveofourown.org/tags/search?utf8=%E2%9C%93&query%5Bname%5D=" + safe_search + "&query%5Btype%5D=&query%5Bcanonical%5D=true&page="
238 |         print('\nTags:')
239 | 
240 |         tags = ["initialize"]
241 |         while (len(tags)) != 0:
242 |             results_page = requests.get(search_ref + str(pp))
243 |             results_soup = BeautifulSoup(results_page.text, "lxml")
244 |             tags = results_soup(attrs={'href': re.compile('^/tags/[^s]....[^?].*')})
245 | 
246 |             for x in tags:
247 |                 print(x.string)
248 | 
249 |             pp += 1
250 | 
251 |     # fan work scraping options
252 |     if header or tag:
253 |         try:
254 |             os.makedirs(out_dir)
255 |         except Exception:
256 |             pass
257 | 
258 |         os.chdir(out_dir)
259 |         error_works = load_error_ids()
260 | 
261 |         results = ["initialize"]
262 |         while (len(results)) != 0:
263 |             log('\n\nPAGE ' + str(end))
264 |             print('Page {} '.format(end))
265 | 
266 |             display('Loading table of contents;')
267 | 
268 |             if tag:
269 |                 mod_header = tag.replace(' ', '%20')
270 |                 header = "http://archiveofourown.org/tags/" + mod_header + "/works"
271 | 
272 |             request_url = header + "?page=" + str(end)
273 |             toc_page = request_loop(request_url)
274 |             if not toc_page:
275 |                 err_msg = 'Error loading TOC; aborting.'
276 |                 log(err_msg)
277 |                 display(err_msg)
278 |                 reset_display()
279 |                 continue
280 | 
281 |             toc_page_soup = BeautifulSoup(toc_page, "lxml")
282 |             results = toc_page_soup(attrs={'href': re.compile('^/works/[0-9]+[0-9]$')})
283 | 
284 |             log('Number of Works on Page {}: {}'.format(end, len(results)))
285 |             log('Page URL: {}'.format(request_url))
286 |             log('Progress: ')
287 | 
288 |             reset_display()
289 | 
290 |             for x in results:
291 |                 body = str(x).split('"')
292 |                 docID = str(body[1]).split('/')[2]
293 |                 filename = str(docID) + '.html'
294 | 
295 |                 if os.path.exists(filename):
296 |                     display('Work {} already exists -- skpping;'.format(docID))
297 |                     reset_display()
298 |                     msg = ('skipped existing document {} on '
299 |                            'page {} ({} bytes)')
300 |                     log(msg.format(docID, str(end),
301 |                                    os.path.getsize(filename)))
302 |                 elif docID in error_works:
303 |                     display('Work {} is known to cause errors '
304 |                             '-- skipping;'.format(docID))
305 |                     reset_display()
306 |                     msg = ('skipped document {} on page {} '
307 |                            'known to cause errors')
308 |                     log(msg.format(docID, str(end)))
309 | 
310 |                 else:
311 |                     display('Loading work {};'.format(docID))
312 |                     work_request_url = "https://archiveofourown.org/" + body[1] + "?view_adult=true&view_full_work=true"
313 |                     work_page = request_loop(work_request_url)
314 | 
315 |                     if work_page is None:
316 |                         error_works.add(docID)
317 |                         log_error_id(docID)
318 |                         continue
319 | 
320 |                     with open(filename, 'w', encoding='utf-8') as html_out:
321 |                         bytes_written = html_out.write(str(work_page))
322 | 
323 |                     msg = 'reached document {} on page {}, saved {} bytes'
324 |                     log(msg.format(docID, str(end), bytes_written))
325 |                     reset_display()
326 | 
327 |             reset_display()
328 |             end = end + 1
329 | 
330 | # -----------------------------------
331 | # data visualization format functions
332 | # -----------------------------------
333 | def project_sentiment_keys_shortform(counts, keys):
334 |         counts = [{k: ct.get(k, 0) for k in keys}
335 |                   for ct in counts]
336 |         for ct in counts:
337 |             if sum(ct.values()) == 0:
338 |                 ct['UNDETERMINED'] = 1
339 |             else:
340 |                 ct['UNDETERMINED'] = 0
341 |         return counts
342 | 
343 | def regex(name):
344 |     return (re.sub('(?P<name> (\w))*(\\(.*\\))', '\g<name>', name)).strip()
345 | 
346 | def format_data(args):
347 |     original_script_markup = args.script
348 |     match_table = args.matches
349 |     output = args.output
350 | 
351 |     matches = pd.read_csv(match_table)
352 | 
353 |     name = 'Frequency of Reuse (Exact Matches)'
354 |     positive_match = matches.BEST_COMBINED_DISTANCE <= 0
355 |     matches_thresh = matches.assign(**{name: positive_match})
356 | 
357 |     thresholds = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
358 |     threshname = ['Frequency of Reuse (0-{})'.format(str(t)) for t in thresholds]
359 |     for thresh, name in zip(thresholds, threshname):
360 |         positive_match = matches.BEST_COMBINED_DISTANCE <= thresh
361 |         matches_thresh = matches_thresh.assign(**{name: positive_match})
362 |     thresholds = [0] + thresholds
363 |     threshname = ['Frequency of Reuse (Exact Matches)'] + threshname
364 | 
365 |     os_markup_raw = search.load_markup_script(original_script_markup)
366 |     os_markup_header = os_markup_raw[0]
367 |     os_markup_raw = os_markup_raw[1:]
368 | 
369 |     lt = emolex # LexTrie.from_plugin('emolex_en')
370 |     emo_terms = ['ANGER',
371 |                  'ANTICIPATION',
372 |                  'DISGUST',
373 |                  'FEAR',
374 |                  'JOY',
375 |                  'SADNESS',
376 |                  'SURPRISE',
377 |                  'TRUST',
378 |                  'NEGATIVE',
379 |                  'POSITIVE']
380 | 
381 |     os_markup_header.extend(emo_terms)
382 |     for r in os_markup_raw:
383 |         emos = lt.get_lex_tags(r[0])
384 |         r.extend(int(t in emos) for t in emo_terms)
385 | 
386 |     os_markup = pd.DataFrame(os_markup_raw, columns=os_markup_header)
387 |     os_markup.index.name = 'ORIGINAL_SCRIPT_WORD_INDEX'
388 | 
389 |     #os_markup.CHARACTER = os_markup.CHARACTER.apply(regex)
390 |     top_eight = collections.Counter(os_markup.CHARACTER).most_common(8)
391 |     top_eight_list = []
392 |     for (name_top,val) in top_eight:
393 |         top_eight_list = top_eight_list + [name_top]
394 |     top_eight_char = ["CHARACTER_" + name for name in top_eight_list]
395 |     used_names = []
396 |     name_char = top_eight[0][0]
397 |     positive_match = 1 * (os_markup.CHARACTER == name_char)
398 |     matches_name = os_markup.assign(**{"CHARACTER_" + name_char.upper(): positive_match})
399 |     used_names = ["CHARACTER_" + name_char] + used_names
400 | 
401 |     for top_name in top_eight_list:
402 |         if "CHARACTER_" + top_name not in used_names:
403 |             positive_matches = 1 * (os_markup.CHARACTER == top_name)
404 |             matches_name = matches_name.assign(**{"CHARACTER_" + top_name.upper(): positive_matches})
405 |             used_names = used_names + [top_name]
406 | 
407 |     match_word_counts = matches_thresh.groupby(
408 |         'ORIGINAL_SCRIPT_WORD_INDEX'
409 |     ).aggregate({
410 |         name: numpy.sum for name in threshname
411 |     })
412 | 
413 |     match_word_counts = match_word_counts.reindex(
414 |         matches_name.index,
415 |         fill_value=0
416 |     )
417 | 
418 |     match_word_words = matches_thresh.groupby(
419 |         'ORIGINAL_SCRIPT_WORD_INDEX'
420 |     ).aggregate({
421 |         'ORIGINAL_SCRIPT_WORD': numpy.max,
422 |     })
423 | 
424 |     match_word_counts = match_word_counts.join(match_word_words)
425 | 
426 |     match_count = (match_word_counts.join(matches_name))
427 | 
428 |     match_count.to_csv(output)
429 | 
430 | def _format_data_sentiment_only(args):
431 |     fin = args.s
432 |     fout = args.o
433 | 
434 |     markup_script = search.load_markup_script(fin)
435 |     markup_script = markup_script[1:]
436 |     list_script = [[i] + r for i, r in enumerate(markup_script)]
437 | 
438 |     csv_script = pd.DataFrame(list_script)
439 |     csv_script.columns = ['ORIGINAL_SCRIPT_INDEX',
440 |        'LOWERCASE',
441 |        'SPACY_ORTH_ID',
442 |        'SCENE',
443 |        'CHARACTER']
444 | 
445 |     bing_count = [bing.lex_count(j[1]) for j in list_script]
446 |     bing_sentiment_keys = ['NEGATIVE', 'POSITIVE']
447 |     bing_count = project_sentiment_keys_shortform(bing_count, bing_sentiment_keys)
448 |     bing_DF = pd.DataFrame(bing_count)
449 | 
450 |     bing_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX']
451 |     out = pd.merge(csv_script, bing_DF, on='ORIGINAL_SCRIPT_INDEX')
452 | 
453 |     if emolex:
454 |         emo_count = [emolex.lex_count(j[1]) for j in list_script]
455 |         emo_sentiment_keys = ['ANTICIPATION', 'ANGER', 'TRUST', 'SADNESS','DISGUST',
456 |                           'SURPRISE', 'FEAR', 'JOY', 'NEGATIVE', 'POSITIVE']
457 |         emo_count = project_sentiment_keys_shortform(emo_count, emo_sentiment_keys)
458 |         emo_DF = pd.DataFrame(emo_count)
459 |         emo_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX']
460 |         out = pd.merge(out, emo_DF, on='ORIGINAL_SCRIPT_INDEX')
461 | 
462 |     if liwc:
463 |         liwc_count = [liwc.lex_count(j[1]) for j in list_script]
464 | 
465 |         liwc_sentiment_keys = ['POSEMO', 'NEGEMO']
466 |         liwc_sent_count = project_sentiment_keys_shortform(liwc_count, liwc_sentiment_keys)
467 |         liwc_sent_DF = pd.DataFrame(liwc_sent_count)
468 |         liwc_sent_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX']
469 |         out = pd.merge(out, liwc_sent_DF, on='ORIGINAL_SCRIPT_INDEX')
470 | 
471 |         liwc_other_keys = set(k for ct in liwc_count for k in ct.keys())
472 |         liwc_other_keys -= set(['POSEMO', 'NEGEMO']) #already used these
473 |         liwc_other_count = project_sentiment_keys_shortform(liwc_count, liwc_other_keys)
474 |         liwc_other_DF = pd.DataFrame(liwc_other_count)
475 |         liwc_other_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX']
476 |         out = pd.merge(out, liwc_other_DF, on='ORIGINAL_SCRIPT_INDEX')
477 | 
478 |     out.to_csv(fout + '.csv', index=False)
479 | 
480 | # -----------------------------------------------------------------------------
481 | # SCRIPT
482 | # ------
483 | 
484 | if __name__ == '__main__':
485 | 
486 |     parser = argparse.ArgumentParser(description='process fanworks scraped from Archive of Our Own.')
487 |     subparsers = parser.add_subparsers(help='scrape, clean, getmeta, search, format, or vis')
488 | 
489 |     #sub-parsers
490 |     scrape_parser = subparsers.add_parser('scrape', help='find and scrape fanfiction works from Archive of Our Own')
491 |     group = scrape_parser.add_mutually_exclusive_group()
492 |     group.add_argument('-s', '--search', action='store', help="search term to search for a tag to scrape")
493 |     group.add_argument('-t', '--tag', action='store', help="the tag to be scraped")
494 |     group.add_argument('-u', '--url', action='store', help="the full URL of first page to be scraped")
495 |     scrape_parser.add_argument('-o', '--out', action='store', default=os.path.join('.','scraped-html'), help="target directory for scraped html files")
496 |     scrape_parser.add_argument('-p', '--startpage', action='store', default=1, type=int, help="page on which to begin downloading (to resume a previous job)")
497 |     scrape_parser.set_defaults(func=scrape)
498 | 
499 |     clean_parser = subparsers.add_parser('clean', help='takes a directory of html files and yields a new directory of text files')
500 |     clean_parser.add_argument('input', action='store', help='directory of input html files to clean')
501 |     clean_parser.add_argument('-o', '--output', action='store', default='plain-text', help='target directory for output txt files')
502 |     clean_parser.set_defaults(func=convert_dir)
503 | 
504 |     meta_parser = subparsers.add_parser('getmeta', help='takes a directory of html files and yields a csv file containing metadata')
505 |     meta_parser.add_argument('input', action='store', help='directory of input html files to process')
506 |     meta_parser.add_argument('-o', '--output', action='store', default='fan-meta', help='filename for metadata csv file')
507 |     meta_parser.set_defaults(func=collect_meta)
508 | 
509 |     validate_parser = subparsers.add_parser('validate', help='validate script markup')
510 |     validate_parser.add_argument('script', action='store', help='filename for markup version of script')
511 |     validate_parser.set_defaults(func=search.validate_cmd)
512 | 
513 |     # Search for reuse
514 |     search_parser = subparsers.add_parser('search', help='compare fanworks with the original script')
515 |     search_parser.add_argument('fan_works', action='store', help='directory of fanwork text files')
516 |     search_parser.add_argument('script', action='store', help='filename for markup version of script')
517 |     search_parser.add_argument('-n', '--num-works', default=-1, type=int, help="number of works to search (for subsampling)")
518 |     search_parser.add_argument('-s', '--skip-works', default=0, type=int, help="number of works to skip (for subsampling)")
519 |     search_parser.set_defaults(func=search.analyze)
520 | 
521 |     # Aggregate word-level counts
522 |     data_parser = subparsers.add_parser('format', help='takes a script and outputs a csv with senitment information for each word formatted for javascript visualization')
523 |     data_parser.add_argument('matches', action='store', help='filename for search output')
524 |     data_parser.add_argument('script', action='store', help='filename for markup version of script')
525 |     data_parser.add_argument('-o', '--output', action='store', default='js-data.csv', help='filename for csv output file of data formatted for visualization')
526 |     data_parser.set_defaults(func=format_data)
527 | 
528 |     # Generate visualizaiton
529 |     vis_parser = subparsers.add_parser('vis',
530 |                                        help='takes a formatted data csv '
531 |                                             'and generates an embeddable '
532 |                                             'visualization')
533 |     vis_parser.add_argument('input', action='store',
534 |                             help='the data csv to use for generating the '
535 |                                  'visualization')
536 |     vis_parser.add_argument('-s', '--static', action='store_true',
537 |                             default=False,
538 |                             help="save a full html file")
539 |     vis_parser.add_argument('-o', '--output', action='store',
540 |                             default='reuse.html',
541 |                             help="output filename")
542 |     vis_parser.add_argument('-w', '--words-per-chunk', type=int, default=140,
543 |                             help='number of words per script segment')
544 |     vis_parser.set_defaults(func=vis.save_plot)
545 | 
546 |     # handle args
547 |     args = parser.parse_args()
548 | 
549 |     # call function
550 |     if hasattr(args, 'func'):
551 |         args.func(args)
552 |     else:
553 |         parser.print_help()
554 | 


--------------------------------------------------------------------------------
/fanworks/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | pandas
 2 | nearpy
 3 | spacy
 4 | requests
 5 | bs4
 6 | lxml
 7 | python-Levenshtein
 8 | bokeh
 9 | https://github.com/senderle/lextrie/archive/master.zip
10 | 


--------------------------------------------------------------------------------
/results/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/scripts/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/search.py:
--------------------------------------------------------------------------------
  1 | import multiprocessing
  2 | import datetime
  3 | import csv
  4 | import os
  5 | import re
  6 | import sys
  7 | import random
  8 | from operator import itemgetter
  9 | from collections import defaultdict
 10 | 
 11 | import numpy
 12 | import nearpy
 13 | import spacy
 14 | from Levenshtein import distance as lev_distance
 15 | 
 16 | _SPACY_MODEL = None
 17 | 
 18 | # Approximate nearest neighbors search settings:
 19 | 
 20 | new_record_structure = {
 21 |     'fields': ['FAN_WORK_FILENAME',
 22 |                'FAN_WORK_WORD_INDEX',
 23 |                'FAN_WORK_WORD',
 24 |                'FAN_WORK_ORTH_ID',
 25 |                'ORIGINAL_SCRIPT_WORD_INDEX',
 26 |                'ORIGINAL_SCRIPT_WORD',
 27 |                'ORIGINAL_SCRIPT_ORTH_ID',
 28 |                'ORIGINAL_SCRIPT_CHARACTER',
 29 |                'ORIGINAL_SCRIPT_SCENE',
 30 |                'BEST_MATCH_DISTANCE',
 31 |                'BEST_LEVENSHTEIN_DISTANCE',
 32 |                'BEST_COMBINED_DISTANCE',
 33 |                ],
 34 |     'types': [str, int, str, int, int, str,
 35 |               int, str, int, float, int, float
 36 |               ]
 37 | }
 38 | 
 39 | 
 40 | def get_spacy_model():
 41 |     global _SPACY_MODEL
 42 |     if _SPACY_MODEL is None:
 43 |         _SPACY_MODEL = spacy.load('en_core_web_md',
 44 |                                   disable=['parser', 'tagger', 'ner'])
 45 |     return _SPACY_MODEL
 46 | 
 47 | def sp_parse_chunks(txt, size=100000):
 48 |     spacy_model = get_spacy_model()
 49 | 
 50 |     start = 0
 51 |     if len(txt) < 100000:
 52 |         yield spacy_model(txt)
 53 |         return
 54 | 
 55 |     while start < len(txt):
 56 |         end = start + 100000
 57 |         if end > len(txt):
 58 |             end = len(txt)
 59 |         else:
 60 |             while txt[end] != ' ':
 61 |                 end -= 1
 62 |         yield spacy_model(txt[start: end])
 63 |         start = end + 1
 64 | 
 65 | def mk_vectors(sp_txt):
 66 |     # Given a text, parse it into `spacy`'s native format,
 67 |     # and produce a sequence of vectors, one per token.
 68 | 
 69 |     rows = len(sp_txt)
 70 |     cols = len(sp_txt[0].vector) if rows else 0
 71 | 
 72 |     vectors = numpy.empty((rows, cols), dtype=float)
 73 |     for i, word in enumerate(sp_txt):
 74 |         if word.has_vector:
 75 |             vectors[i] = word.vector
 76 |         else:
 77 |             # `spacy` doesn't have a pre-trained vector for this word,
 78 |             # so give it a unique random vector.
 79 |             w_str = str(word)
 80 |             vectors[i] = 0
 81 |             vectors[i][hash(w_str) % cols] = 1.0
 82 |             vectors[i][hash(w_str * 2) % cols] = 1.0
 83 |             vectors[i][hash(w_str * 3) % cols] = 1.0
 84 |     return vectors
 85 | 
 86 | def build_lsh_engine(orig, window_size, number_of_hashes, hash_dimensions):
 87 |     # Build the ngram vectors using rolling windows.
 88 |     # Variables named `*_win_vectors` contain vectors for
 89 |     # the given input, such that each row is the vector
 90 |     # for a single window. Successive windows overlap
 91 |     # at all words except for the first and last.
 92 | 
 93 |     orig_vectors = mk_vectors(orig)
 94 |     orig_win_vectors = numpy.array([orig_vectors[i:i + window_size, :].ravel()
 95 |                                    for i in range(orig_vectors.shape[0] - window_size + 1)])
 96 | 
 97 |     # Initialize the approximate nearest neighbor search algorithm.
 98 |     # This creates the search "engine" and populates its index with
 99 |     # the window-vectors from the original script. We can then pass
100 |     # over the window-vectors from a fan work, taking each vector
101 |     # and searching for good matches in the engine's index of script
102 |     # text.
103 | 
104 |     # We could do the search in the opposite direction, storing
105 |     # fan text in the engine's index, and passing over window-
106 |     # vectors from the original script, searching for matches in
107 |     # the index of fan text. Unfortuantely, the quality of the
108 |     # matches found goes down when you add too many values to the
109 |     # engine's index.
110 |     vector_dim = orig_win_vectors.shape[1]
111 | 
112 |     hashes = []
113 |     for i in range(number_of_hashes):
114 |         h = nearpy.hashes.RandomBinaryProjections('rbp{}'.format(i),
115 |                                                   hash_dimensions)
116 |         hashes.append(h)
117 | 
118 |     engine = nearpy.Engine(vector_dim,
119 |                            lshashes=hashes,
120 |                            distance=nearpy.distances.CosineDistance())
121 | 
122 |     for ix, row in enumerate(orig_win_vectors):
123 |         engine.store_vector(row, (ix, str(orig[ix: ix + window_size])))
124 |     return engine
125 | 
126 | def multi_search_wrapper(work):
127 |     result = _ANN_INDEX.search(work)
128 |     return result
129 | 
130 | class AnnIndexSearch(object):
131 |     def __init__(self, original_script_filename, window_size,
132 |                  number_of_hashes, hash_dimensions, distance_threshold):
133 |         orig_csv = load_markup_script(original_script_filename)
134 |         orig_csv = orig_csv[1:]  # drop header
135 |         orig_csv = [[i] + r for i, r in enumerate(orig_csv)]
136 |         # [['ORIGINAL_SCRIPT_INDEX',
137 |         #   'LOWERCASE',
138 |         #   'SPACY_ORTH_ID',
139 |         #   'SCENE',
140 |         #   'CHARACTER']]
141 | 
142 |         (self.word_index,
143 |          self.word_lowercase,
144 |          self.orth_id,
145 |          self.scene,
146 |          self.character) = zip(*orig_csv)
147 | 
148 |         self.window_size = window_size
149 |         self.distance_threshold = distance_threshold
150 |         self.spacy_model = get_spacy_model()
151 |         orig_doc = spacy.tokens.Doc(self.spacy_model.vocab, self.word_lowercase)
152 |         self.engine = build_lsh_engine(orig_doc, window_size,
153 |                                        number_of_hashes, hash_dimensions)
154 |         self.reset_stats()
155 | 
156 |     def reset_stats(self):
157 |         self._windows_processed = 0
158 | 
159 |     @property
160 |     def windows_processed(self):
161 |         return self._windows_processed
162 | 
163 |     def search(self, filename):
164 |         with open(filename, encoding='utf8') as fan_file:
165 |             fan = fan_file.read()
166 |             fan = [t for ch in sp_parse_chunks(fan) for t in ch if not t.is_space]
167 | 
168 |         # Create the fan windows:
169 |         fan_vectors = mk_vectors(fan)
170 |         fan_win_vectors = numpy.array(
171 |             [fan_vectors[i:i + self.window_size, :].ravel()
172 |              for i in range(fan_vectors.shape[0] - self.window_size + 1)]
173 |         )
174 | 
175 |         duplicate_records = defaultdict(list)
176 |         for fan_ix, row in enumerate(fan_win_vectors):
177 |             self._windows_processed += 1
178 |             results = self.engine.neighbours(row)
179 | 
180 |             # Extract data about the original script
181 |             # embedded in the engine's results.
182 |             results = [(match_ix, match_str, distance)
183 |                        for vec, (match_ix, match_str), distance in results
184 |                        if distance < self.distance_threshold]
185 | 
186 |             # Create a new record with original script
187 |             # information and fan work information.
188 |             for match_ix, match_str, distance in results:
189 |                 fan_context = str(fan[fan_ix: fan_ix + self.window_size])
190 |                 lev_d = lev_distance(match_str, fan_context)
191 | 
192 |                 for window_ix in range(self.window_size):
193 |                     fan_word_ix = fan_ix + window_ix
194 |                     fan_word = fan[fan_word_ix].orth_
195 |                     fan_orth_id = fan[fan_word_ix].orth
196 | 
197 |                     orig_word_ix = match_ix + window_ix
198 |                     orig_word = self.word_lowercase[orig_word_ix]
199 |                     orig_orth_id = self.orth_id[orig_word_ix]
200 |                     char = self.character[orig_word_ix]
201 |                     scene = self.scene[orig_word_ix]
202 | 
203 |                     duplicate_records[(filename, fan_word_ix)].append(
204 |                         # NOTE: This **must** match the definition
205 |                         #       of `record_structure` above
206 |                         [filename,
207 |                          fan_word_ix,
208 |                          fan_word,
209 |                          fan_orth_id,
210 |                          orig_word_ix,
211 |                          orig_word,
212 |                          orig_orth_id,
213 |                          char,
214 |                          scene,
215 |                          distance,
216 |                          lev_d,
217 |                          distance * lev_d]
218 |                     )
219 | 
220 |         # To deduplicate duplicate_records, we
221 |         # pick the single best match, as measured by
222 |         # the combined distance for the given n-gram
223 |         # match that first identified the word.
224 |         for k, dset in duplicate_records.items():
225 |             duplicate_records[k] = min(dset, key=itemgetter(11))
226 |         return sorted(duplicate_records.values())
227 | 
228 | def validate_markup_script(filename,
229 |                            interactive=False,
230 |                            _unbalanced_l=re.compile('<<[^>]*<<'),
231 |                            _unbalanced_r=re.compile('>>[^<]*>>'),
232 |                            _tags=re.compile('>>\s*([^<]*)\s*<<')):
233 |     with open(filename, encoding='utf-8') as ip:
234 |         script = ip.read()
235 | 
236 |     print('Checking script for markup errors.')
237 |     print()
238 | 
239 |     errs = False
240 |     unbal_l = _unbalanced_l.findall(script)
241 |     if unbal_l:
242 |         print('Unbalanced left tag delimiters:')
243 |         for m in _unbalanced_l.finditer(script):
244 |             line = script[:m.start() + 1].count('\n') + 1
245 |             print('  On line {}'.format(line))
246 |             print('    {}'.format(m.group().strip()))
247 |         errs = True
248 |         print()
249 | 
250 |     unbal_r = _unbalanced_r.findall(script)
251 |     if unbal_r:
252 |         print('Unbalanced right tag delimiters:')
253 |         for m in _unbalanced_r.finditer(script):
254 |             line = script[:m.start() + 1].count('\n') + 1
255 |             print('  On line {}'.format(line))
256 |             print('    {}'.format(m.group().strip()))
257 |         errs = True
258 |         print()
259 | 
260 |     tag_set = set(t.strip() for t in _tags.findall(script))
261 |     expected_tags = set(('LINE', 'DIRECTION', 'SCENE_NUMBER', 'SCENE_DESCRIPTION', 'CHARACTER_NAME'))
262 |     if tag_set - expected_tags:
263 |         print('Unexpected tag labels:')
264 |         for m in _tags.finditer(script):
265 |             if m.group(1).strip() not in expected_tags:
266 |                 line = script[:m.start(1) + 1].count('\n') + 1
267 |                 print('  On line {}'.format(line))
268 |                 print('    {}'.format(m.group(1).strip()))
269 |         errs = True
270 |         print()
271 | 
272 |     if not errs:
273 |         print('No markup errors found.')
274 |         return True
275 |     elif interactive and errs:
276 |         print('Errors were found in the script markup. Do you want to continue? (Default is no.)')
277 |         print()
278 |         r = ''
279 |         while r.lower() not in ('y', 'yes', 'n', 'no'):
280 |             r = input('Enter y for yes or n for no: ')
281 |             if not r.strip():
282 |                 r = 'n'
283 |         return r.lower() in ('y', 'yes')
284 |     else:
285 |         return False
286 | 
287 | def validate_cmd(args):
288 |     return validate_markup_script(args.script)
289 | 
290 | def load_markup_script(filename,
291 |                         _line_rex=re.compile('LINE<<(?P<line>[^>]*)>>'),
292 |                         _scene_rex=re.compile('SCENE_NUMBER<<(?P<scene>[^>]*)>>'),
293 |                         _char_rex=re.compile('CHARACTER_NAME<<(?P<character>[^>]*)>>')):
294 | 
295 |     with open(filename, encoding='utf-8') as ip:
296 |         spacy_model = get_spacy_model()
297 | 
298 |         current_scene = None
299 |         current_scene_count = 0
300 |         current_scene_error_fix = False
301 |         current_char = None
302 |         rows = [['LOWERCASE', 'SPACY_ORTH_ID', 'SCENE', 'CHARACTER']]
303 |         for i, line in enumerate(ip):
304 |             if _scene_rex.search(line):
305 |                 current_scene_count += 1
306 |                 scene_string = _scene_rex.search(line).group('scene')
307 |                 scene_string = ''.join(c for c in scene_string
308 |                                        if c.isdigit())
309 |                 try:
310 |                     scene_int = int(scene_string)
311 |                     current_scene = scene_int
312 |                 except ValueError:
313 |                     current_scene_error_fix = True
314 |                     print("Error in Scene markup: {}".format(line))
315 | 
316 |                 if current_scene_error_fix:
317 |                     current_scene = current_scene_count
318 | 
319 |             elif _char_rex.search(line):
320 |                 current_char = _char_rex.search(line).group('character')
321 |             elif _line_rex.search(line):
322 |                 tokens = spacy_model(_line_rex.search(line).group('line'))
323 |                 tokens = [t for t in tokens if not t.is_space]
324 |                 for t in tokens:
325 |                     # original Spacy lexeme object can be recreated using
326 |                     #     spacy.lexeme.Lexeme(get_spacy_model().vocab, t.orth)
327 |                     row = [t.lower_, t.lower, current_scene, current_char]
328 |                     rows.append(row)
329 |     return rows
330 | 
331 | def write_records(records, filename):
332 |     with open(filename, 'w', encoding='utf-8') as out:
333 |         wr = csv.writer(out)
334 |         wr.writerows(records)
335 | 
336 | def analyze(args,
337 |             window_size=6,
338 |             number_of_hashes=15,  # Bigger -> slower (linear), more matches
339 |             hash_dimensions=14,   # Bigger -> faster (???), fewer matches
340 |             distance_threshold=0.1,
341 |             chunk_size=500
342 |             ):
343 |     fan_work_directory = args.fan_works
344 |     original_script_markup = args.script
345 |     subsample_start = 0 if args.skip_works < 0 else args.skip_works
346 |     subsample_end = (None if args.num_works < 0 else 
347 |                      args.num_works + subsample_start)
348 | 
349 |     fan_works = os.listdir(fan_work_directory)
350 |     fan_works = [os.path.join(fan_work_directory, f)
351 |                  for f in fan_works]
352 | 
353 |     # This will always generate the same "random" sample.
354 |     random.seed(4815162342)
355 |     random.shuffle(fan_works)
356 | 
357 |     # Optionally skip ahead in the list or stop early.
358 |     fan_works = fan_works[subsample_start:subsample_end]
359 | 
360 |     start = 0
361 |     fan_clusters = [fan_works[i:i + chunk_size]
362 |                     for i in range(start, len(fan_works), chunk_size)]
363 | 
364 |     filename_base = 'match-{}gram{{}}'.format(window_size)
365 |     batch_filename = filename_base.format('-batch-{}.csv')
366 | 
367 |     accumulated_records = [new_record_structure['fields']]
368 |     ann_index = AnnIndexSearch(original_script_markup,
369 |                                window_size,
370 |                                number_of_hashes,
371 |                                hash_dimensions,
372 |                                distance_threshold)
373 | 
374 |     for i, fan_cluster in enumerate(fan_clusters, start=start):
375 |         print('Processing cluster {} ({}-{})'.format(i,
376 |                                                      chunk_size * i,
377 |                                                      chunk_size * (i + 1)))
378 | 
379 |         global _ANN_INDEX
380 |         _ANN_INDEX = ann_index
381 |         with multiprocessing.Pool(processes=4, maxtasksperchild=10) as pool:
382 |             record_sets = pool.map(
383 |                 multi_search_wrapper,
384 |                 fan_cluster,
385 |                 chunksize=chunk_size // (4 * pool._processes))
386 |             records = [r for r_set in record_sets for r in r_set]
387 |             write_records(records, batch_filename.format(i))
388 |             accumulated_records.extend(records)
389 | 
390 |     i = 0
391 |     today_str = '-{:%Y%m%d}.csv'.format(datetime.date.today())
392 |     name_check = filename_base.format(today_str)
393 |     while os.path.exists(name_check):
394 |         i += 1
395 |         today_str = '-{:%Y%m%d}-{}.csv'.format(datetime.date.today(), i)
396 |         name_check = filename_base.format(today_str)
397 | 
398 |     write_records(accumulated_records,
399 |                   name_check)
400 | 


--------------------------------------------------------------------------------
/vis.py:
--------------------------------------------------------------------------------
   1 | import argparse
   2 | import math
   3 | import pandas as pd
   4 | import numpy
   5 | from scipy.stats import gmean
   6 | from numpy import mean
   7 | 
   8 | from bokeh.plotting import figure
   9 | from bokeh.io import curdoc, output_file, save
  10 | from bokeh.resources import CDN
  11 | from bokeh.embed import file_html, components
  12 | from bokeh.layouts import row, column
  13 | from bokeh.models import HoverTool, CustomJS, ColumnDataSource, FactorRange, Panel, Tabs
  14 | from bokeh.models.widgets import RadioButtonGroup, CheckboxButtonGroup, Select
  15 | from bokeh.transform import factor_cmap
  16 | from bokeh.palettes import Spectral6
  17 | from bokeh.events import ButtonClick
  18 | 
  19 | _FIELDS = ['Frequency of Reuse (Exact Matches)',
  20 |            'Frequency of Reuse (0-0.1)',
  21 |            'Frequency of Reuse (0-0.25)',
  22 |            'None',
  23 |            'ANGER',
  24 |            'ANTICIPATION',
  25 |            'DISGUST',
  26 |            'FEAR',
  27 |            'JOY',
  28 |            'SADNESS',
  29 |            'SURPRISE',
  30 |            'TRUST',
  31 |            'NEGATIVE',
  32 |            'POSITIVE',]
  33 |            # 'None']
  34 | 
  35 | _AGG_FUNCS = [lambda x: gmean(x + 1) - 1] * 3
  36 | _AGG_FUNCS += [mean] * 11
  37 | 
  38 | # Possibly dead code now. TODO: Check and if so, remove.
  39 | def parse_args():
  40 |     parser = argparse.ArgumentParser()
  41 |     parser.add_argument('-s', '--static', action='store_true',
  42 |                         default=False,
  43 |                         help="save a full html file")
  44 |     # parser.add_argument('-o', '--output', action='store',
  45 |     #                     default='reuse.html',
  46 |     #                     help="output filename")
  47 | 
  48 |     args = parser.parse_args()
  49 |     args.words_per_chunk = 140
  50 |     args.data_path = 'fandom-data.csv'
  51 |     title = 'Average Quantity of Text Reuse by {}-word Section'
  52 |     title = title.format(args.words_per_chunk)
  53 |     args.title = title
  54 |     args.out_filename = 'star-wars-reuse.html'
  55 |     return args
  56 | 
  57 | def unnan(val):
  58 |     # Pandas annoyingly converts the string 'nan' into a floating
  59 |     # point nan value, even in an all-string column.
  60 |     if isinstance(val, float) and math.isnan(val):
  61 |         return 'nan'
  62 |     else:
  63 |         return val
  64 | 
  65 | def word_formatter(names=None):
  66 |     if names is None:
  67 |         names = []
  68 | 
  69 |     punctuation = [',', '.', '!', '?', '\'', '"', ':', '-', '--']
  70 |     endpunctuation = ['.', '!', '?', '"', '...', '....', '--']
  71 |     contractions = ['\'ve', '\'m', '\'ll', '\'re', '\'s', '\'t', 'n\'t', 'na']
  72 |     capitals = ['i']
  73 | 
  74 |     def span(content, highlight=None):
  75 |         if highlight is None:
  76 |             return '<span>{}</span>'.format(content)
  77 |         else:
  78 |             style = 'background-color: rgba(16, 96, 255, {:04.3f})'.format(highlight)
  79 |             return '<span style="{}">{}</span>'.format(style, content)
  80 | 
  81 |     def format_word(word, prev_word, character, new_char, new_scene, highlight=None):
  82 |         character = unnan(character).upper()
  83 |         word = unnan(word)
  84 | 
  85 |         parts = []
  86 |         if new_scene:
  87 |             parts.append(span('-- next scene--<br \>'))
  88 | 
  89 |         if new_char:
  90 |             parts.append('\n')
  91 |             parts.append(span(' ' + character.upper() + ': '))
  92 | 
  93 |         if word in punctuation or word in contractions:
  94 |             # no space before punctuation
  95 |             parts.append(span(word, highlight))
  96 |         elif not prev_word or prev_word in endpunctuation:
  97 |             # capitalize first word of sentence
  98 |             parts.append(span(' ' + word.capitalize(), highlight))
  99 |         elif word in capitals:
 100 |             # format things like 'i'
 101 |             parts.append(span(' ' + word.upper(), highlight))
 102 |         elif word.capitalize() in names:
 103 |             # format names
 104 |             parts.append(span(' ' + word.capitalize(), highlight))
 105 |         else:
 106 |             # all other words
 107 |             parts.append(span(' ' + word, highlight))
 108 |         return ''.join(parts)
 109 |     return format_word
 110 | 
 111 | def chart_cols(fandom_data, words_per_chunk):
 112 |     words = fandom_data['LOWERCASE'].tolist()
 113 |     prevwords = [None] + words[:-1]
 114 |     chars = fandom_data['CHARACTER'].tolist()
 115 |     newchar = fandom_data['CHARACTER'][:-1].values != fandom_data['CHARACTER'][1:].values
 116 |     newchar = [True] + list(newchar)
 117 |     newscene = fandom_data['SCENE'].values
 118 |     newscene[numpy.isnan(newscene)] = 0
 119 |     newscene = fandom_data['SCENE'][:-1].values != fandom_data['SCENE'][1:].values
 120 |     newscene = [False] + list(newscene)
 121 | 
 122 | 
 123 |     highlights = fandom_data['Frequency of Reuse (Exact Matches)'].tolist()
 124 |     chunks = (fandom_data.index // words_per_chunk).tolist()
 125 |     chunkmax = {}
 126 |     global_max = max(highlights)
 127 |     for h, c in zip(highlights, chunks):
 128 |         if c not in chunkmax or chunkmax[c] < h:
 129 |             #chunkmax[c] = h
 130 |             chunkmax[c] = global_max
 131 |     # highlights = [math.log(1 + h, 1.25) / (1.6 * math.log(1 + chunkmax[c], 1.25)) if chunkmax[c] > 0 else 0
 132 |     #               for h, c in zip(highlights, chunks)]
 133 |     highlights = [(1 + h) ** 0.33 / (1.6 * (1 + chunkmax[c]) ** 0.33) if chunkmax[c] > 0 else 0
 134 |                   for h, c in zip(highlights, chunks)]
 135 | 
 136 |     wform = word_formatter()
 137 |     spans = list(map(wform, words, prevwords, chars, newchar, newscene, highlights))
 138 | 
 139 |     fandom_data = fandom_data.assign(
 140 |         **{'None': fandom_data[_FIELDS[0]].values * 0}
 141 |     )
 142 | 
 143 |     # fandom_data = fandom_data.assign(
 144 |     #     **{'None': fandom_data[_FIELDS[0]].values * 0}
 145 |     # )
 146 | 
 147 |     character_cols = [x for x in fandom_data.columns if x.startswith("CHARACTER_")]
 148 |     chart_cols = fandom_data[_FIELDS + character_cols]
 149 |     chart_cols = chart_cols.assign(chunk=chunks)
 150 |     chart_cols = chart_cols.assign(span=spans)
 151 | 
 152 |     return chart_cols
 153 | 
 154 | def join_wrap(seq):
 155 |     lines = []
 156 |     line = []
 157 |     last_br = 0
 158 |     for span in seq:
 159 |         if '\n' in span or last_br > 7 and '> ' in span:
 160 |             # Convert newlines to div breaks. Also insert breaks
 161 |             # whenever we've seen 7 words and there is some
 162 |             # leading whitespace in the current span.
 163 |             lines.append(''.join(line))
 164 |             line = []
 165 |             last_br = 0
 166 |         else:
 167 |             last_br += 1
 168 | 
 169 |         line.append(span)
 170 | 
 171 |     tail = ''.join(line)
 172 |     if tail.strip():
 173 |         lines.append(tail)
 174 | 
 175 |     return '\n'.join('<div>{}</div>'.format(l) for l in lines)
 176 | 
 177 | def chart_pivot(chart_cols):
 178 |     character_cols = [x for x in chart_cols.columns if x.startswith("CHARACTER_")]
 179 |     fields = _FIELDS + character_cols + ['span']
 180 |     aggfuncs = _AGG_FUNCS + [mean] * len(character_cols) + [join_wrap]
 181 |     table =  pd.pivot_table(
 182 |         chart_cols,
 183 |         values=fields,
 184 |         index=chart_cols.chunk,
 185 |         aggfunc=dict(zip(fields, aggfuncs))
 186 |     )
 187 |     # apparently when you create a pandas pivot table, it will automatically
 188 |     # sort your columns alphabetically (which is dumb). This is their work
 189 |     # around, where you literally give the table the fields you already gave
 190 |     # them, so that they "reindex" it.
 191 |     return table.reindex(fields, axis=1)
 192 | 
 193 | 
 194 | def build_bar_plot(data_path, words_per_chunk, title='Reuse'):
 195 |     #Read in from csv
 196 |     flat_data = pd.read_csv(data_path)
 197 |     flat_data = chart_cols(flat_data, words_per_chunk)
 198 |     flat_data = chart_pivot(flat_data)
 199 | 
 200 |     # Scale so that both maxima have the same height
 201 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
 202 |     emo_y = flat_data['None']
 203 |     reuse_max = reuse_y.values.max()
 204 |     emo_max = emo_y.values.max()
 205 | 
 206 |     #Make ratio work
 207 |     ratio_denom = min(reuse_max, emo_max)
 208 |     ratio_num = max(reuse_max, emo_max)
 209 |     ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
 210 | 
 211 |     to_scale = reuse_y if reuse_max < emo_max else emo_y
 212 |     to_scale *= ratio
 213 | 
 214 |     # Create data columns
 215 |     grouped_x = [(str(x), key)
 216 |                  for x in flat_data.index
 217 |                  for key in ('Reuse', 'Emotion')]
 218 |     y = [re for re_pair in zip(reuse_y, emo_y) for re in re_pair]
 219 |     span = zip(flat_data.span, flat_data.span)
 220 |     span = [s for s_pair in span for s in s_pair]
 221 | 
 222 |     flat_data_source = ColumnDataSource(flat_data)
 223 |     source = ColumnDataSource(dict(x=grouped_x,
 224 |                                    y=y,
 225 |                                    span=span))
 226 | 
 227 |     plot = figure(x_range=FactorRange(*grouped_x),
 228 |                   plot_width=800, plot_height=600,
 229 |                   title=title, tools="hover")
 230 | 
 231 |     # Turn off ticks, major labels, and x grid lines, etc.
 232 |     # Axis settings:
 233 |     plot.xaxis.major_label_text_font_size = '0pt'
 234 |     plot.xaxis.major_tick_line_color = None
 235 |     plot.xaxis.minor_tick_line_color = None
 236 | 
 237 |     # CategoricalAxis settings:
 238 |     plot.xaxis.group_text_font_size = '0pt'
 239 |     plot.xaxis.separator_line_color = None
 240 | 
 241 |     # Grid settings:
 242 |     plot.xgrid.grid_line_color = None
 243 |     plot.ygrid.minor_grid_line_color = 'black'
 244 |     plot.ygrid.minor_grid_line_alpha = 0.03
 245 | 
 246 |     hover = plot.select(dict(type=HoverTool))
 247 |     hover.tooltips = "<div>@span{safe}</div>"
 248 |     plot.vbar(x='x',
 249 |               width=1.0,
 250 |               bottom=0,
 251 |               source=source,
 252 |               top='y',
 253 |               line_color='white',
 254 |               fill_color=factor_cmap('x', palette=Spectral6,
 255 |                                      factors=['Reuse', 'Emotion'],
 256 |                                      start=1, end=2))
 257 | 
 258 | 
 259 |     reuse_button_group = RadioButtonGroup(
 260 |         labels= [_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], button_type='primary',
 261 |         active=0
 262 |     )
 263 | 
 264 |     emotion_button_group = RadioButtonGroup(
 265 |         labels=_FIELDS[3:], button_type = "success",
 266 |         active=0
 267 |     )
 268 | 
 269 |     callback = CustomJS(
 270 |         args=dict(
 271 |             source=source,
 272 |             flat_data_source=flat_data_source,
 273 |             reuse_button_group=reuse_button_group,
 274 |             emotion_button_group=emotion_button_group
 275 |         ),
 276 |         code="""
 277 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
 278 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
 279 |             reuse = "Frequency of Reuse (0-0.25)";
 280 |         }
 281 |         var emo = emotion_button_group.labels[emotion_button_group.active];
 282 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
 283 |         var emo_data = flat_data_source.data[emo].slice();      // Copy
 284 |         var reuse_max = Math.max.apply(Math, reuse_data);
 285 |         var emo_max = Math.max.apply(Math, emo_data);
 286 | 
 287 |         var ratio = 0;
 288 |         var to_scale = null;
 289 |         if (emo_max > reuse_max) {
 290 |             to_scale = reuse_data;
 291 |             ratio = emo_max / reuse_max;
 292 |         } else {
 293 |             to_scale = emo_data;
 294 |             if (emo_max > 0) {
 295 |                 ratio = reuse_max / emo_max;
 296 |             } else {
 297 |                 ratio = 1;
 298 |             }
 299 |         }
 300 |         for (var i = 0; i < to_scale.length; i++) {
 301 |             to_scale[i] *= ratio;
 302 |         }
 303 | 
 304 |         var x = source.data['x'];
 305 |         var y = source.data['y'];
 306 |         for (var i = 0; i < x.length; i++) {
 307 |             if (i % 2 === 0) {
 308 |                 // This is a reuse bar
 309 |                 y[i] = reuse_data[i / 2];
 310 |             } else {
 311 |                 // This is an emotion bar
 312 |                 y[i] = emo_data[(i - 1) / 2];
 313 |             }
 314 |         }
 315 |         source.change.emit();
 316 |         """
 317 |     )
 318 |     reuse_button_group.js_on_change('active', callback)
 319 |     emotion_button_group.js_on_change('active', callback)
 320 | 
 321 |     layout = column(reuse_button_group, emotion_button_group, plot)
 322 |     tab1 = Panel(child=layout, title='Bar')
 323 |     return tab1
 324 | 
 325 | def build_line_plot(data_path, words_per_chunk, title='Reuse'):
 326 |     #Read in from csv
 327 |     flat_data = pd.read_csv(data_path)
 328 |     flat_data = chart_cols(flat_data, words_per_chunk)
 329 |     flat_data = chart_pivot(flat_data)
 330 | 
 331 |     # Scale so that both maxima have the same height
 332 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
 333 |     emo_y = flat_data['None']
 334 |     char_y = flat_data['None']
 335 |     reuse_max = reuse_y.values.max()
 336 |     emo_max = emo_y.values.max()
 337 |     char_max = char_y.values.max()
 338 | 
 339 |     #Make ratio work
 340 |     ratio_denom = min(char_max, min(reuse_max, emo_max))
 341 |     ratio_num = max(char_max, max(reuse_max, emo_max))
 342 |     ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
 343 |     if reuse_max < emo_max and reuse_max < char_max:
 344 |         to_scale = reuse_y
 345 |     elif emo_max < char_max and emo_max < reuse_max:
 346 |         to_scale = emo_y
 347 |     else:
 348 |         to_scale = char_y
 349 |     to_scale *= ratio
 350 | 
 351 |     # Create data columns
 352 |     x = [str(i) for i in flat_data.index]
 353 |     reuse_y=reuse_y
 354 |     reuse_zero = len(reuse_y) * [0]
 355 |     span = flat_data.span
 356 |     flat_data_source = ColumnDataSource(flat_data)
 357 |     source = ColumnDataSource(dict(x=x,
 358 |                                    reuse_y=reuse_y,
 359 |                                    emo_y=emo_y,
 360 |                                    char_y=char_y,
 361 |                                    reuse_zero=reuse_zero,
 362 |                                    span=span))
 363 | 
 364 |     plot = figure(x_range=FactorRange(*x),
 365 |                   plot_width=800, plot_height=600,
 366 |                   title=title, tools="hover")
 367 | 
 368 |     # Turn off ticks, major labels, and x grid lines, etc.
 369 |     # Axis settings:
 370 |     plot.xaxis.major_label_text_font_size = '0pt'
 371 |     plot.xaxis.major_tick_line_color = None
 372 |     plot.xaxis.minor_tick_line_color = None
 373 | 
 374 |     # CategoricalAxis settings:
 375 |     plot.xaxis.group_text_font_size = '0pt'
 376 |     plot.xaxis.separator_line_color = None
 377 | 
 378 |     # Grid settings:
 379 |     plot.xgrid.grid_line_color = None
 380 |     plot.ygrid.minor_grid_line_color = 'black'
 381 |     plot.ygrid.minor_grid_line_alpha = 0.03
 382 | 
 383 |     hover = plot.select(dict(type=HoverTool))
 384 |     hover.tooltips = "<div>@span{safe}</div>"
 385 | 
 386 |     plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6)
 387 |     plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0)
 388 |     plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1])
 389 |     plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red')
 390 | 
 391 | 
 392 |     reuse_button_group = RadioButtonGroup(
 393 |         labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"],
 394 |         button_type='primary',
 395 |         active=0
 396 |     )
 397 | 
 398 |     emotion_button_group = RadioButtonGroup(
 399 |         labels=_FIELDS[3:],
 400 |         button_type='success',
 401 |         active=0
 402 |     )
 403 | 
 404 |     char_button_group = RadioButtonGroup(
 405 |         labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
 406 |         button_type='danger',
 407 |         active=0
 408 |     )
 409 | 
 410 |     callback_code="""
 411 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
 412 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
 413 |             reuse = "Frequency of Reuse (0-0.25)";
 414 |         }
 415 |         var emo = emotion_button_group.labels[emotion_button_group.active];
 416 |         var char = char_button_group.labels[char_button_group.active];
 417 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
 418 |         var emo_data = flat_data_source.data[emo].slice();      // Copy
 419 |         if (char == "None") {
 420 |             var char_data = flat_data_source.data["None"].slice();
 421 |         } else {
 422 |             var char_data = flat_data_source.data["CHARACTER_" + char].slice();
 423 |             }  // Copy
 424 |         var reuse_max = Math.max.apply(Math, reuse_data);
 425 |         var emo_max = Math.max.apply(Math, emo_data);
 426 |         var char_max = Math.max.apply(Math, char_data);
 427 | 
 428 |         var ratio = 0;
 429 |         var to_scale = null;
 430 |         var to_scale_other = null;
 431 | 
 432 |         if (emo_max > reuse_max && emo_max > char_max) {
 433 |             to_scale = reuse_data;
 434 |             to_scale_also = char_data;
 435 |             ratio_one = emo_max / reuse_max;
 436 |             ratio_two = emo_max / char_max;
 437 |         } else if (char_max > emo_max && char_max > reuse_max) {
 438 |             to_scale = reuse_data;
 439 |             to_scale_also = emo_data;
 440 |             ratio_one = char_max / reuse_max;
 441 |             ratio_two = char_max / emo_max;
 442 |         } else {
 443 |             to_scale = emo_data;
 444 |             to_scale_also = char_data;
 445 |             ratio_one = reuse_max / emo_max;
 446 |             ratio_two = reuse_max / char_max;
 447 |         }
 448 | 
 449 |         for (var i = 0; i < to_scale.length; i++) {
 450 |             to_scale[i] *= ratio_one;
 451 |             to_scale_also[i] *= ratio_two;
 452 |         }
 453 | 
 454 |         var x = source.data['x'];
 455 |         var reuse_y = source.data['reuse_y'];
 456 |         var emo_y = source.data['emo_y'];
 457 |         var char_y = source.data['char_y']
 458 |         for (var i = 0; i < x.length; i++) {
 459 |             reuse_y[i] = reuse_data[i];
 460 |             emo_y[i] = emo_data[i];
 461 |             char_y[i] = char_data[i];
 462 |         }
 463 | 
 464 |         source.change.emit();
 465 |         if (char_button_group.active == 0 || emotion_button_group.active == 0) {
 466 |             return;
 467 |         }
 468 | 
 469 |         if (other_button_group) {
 470 |             other_button_group.active = 0;
 471 |         }
 472 | 
 473 |         """
 474 | 
 475 |     reuse_callback = CustomJS(
 476 |         args=dict(
 477 |             source=source,
 478 |             flat_data_source=flat_data_source,
 479 |             reuse_button_group=reuse_button_group,
 480 |             emotion_button_group=emotion_button_group,
 481 |             char_button_group=char_button_group,
 482 |             other_button_group=None
 483 |         ), code = callback_code)
 484 | 
 485 | 
 486 |     emo_callback = CustomJS(
 487 |         args=dict(
 488 |             source=source,
 489 |             flat_data_source=flat_data_source,
 490 |             reuse_button_group=reuse_button_group,
 491 |             emotion_button_group=emotion_button_group,
 492 |             char_button_group=char_button_group,
 493 |             other_button_group=char_button_group
 494 |         ), code = callback_code)
 495 | 
 496 |     char_callback = CustomJS(
 497 |         args=dict(
 498 |             source=source,
 499 |             flat_data_source=flat_data_source,
 500 |             reuse_button_group=reuse_button_group,
 501 |             emotion_button_group=emotion_button_group,
 502 |             char_button_group=char_button_group,
 503 |             other_button_group=emotion_button_group
 504 |         ), code = callback_code)
 505 | 
 506 | 
 507 |     reuse_button_group.js_on_change('active', reuse_callback)
 508 |     emotion_button_group.js_on_change('active', emo_callback)
 509 |     char_button_group.js_on_change('active', char_callback)
 510 | 
 511 | 
 512 |     layout = column(reuse_button_group, emotion_button_group, char_button_group, plot)
 513 |     tab1 = Panel(child=layout, title='Line')
 514 |     return tab1
 515 | 
 516 | 
 517 | def build_line_plot_compare(data_path, words_per_chunk, title='Degree of Reuse'):
 518 |     #Read in from csv
 519 |     flat_data = pd.read_csv(data_path)
 520 | 
 521 |     flat_data = chart_cols(flat_data, words_per_chunk)
 522 | 
 523 |     flat_data = chart_pivot(flat_data)
 524 | 
 525 | 
 526 |     # Scale so that both maxima have the same height
 527 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
 528 |     emo_y = flat_data['None']
 529 |     first_char_y = flat_data['None']
 530 |     second_char_y = flat_data['None']
 531 |     # reuse_max = reuse_y.values.max()
 532 |     # emo_max = emo_y.values.max()
 533 |     # char_max = char_y.values.max()
 534 |     # mult_char_max = mult_char_y.values.max()
 535 |     #
 536 |     # #Make ratio work
 537 |     # ratio_denom = min(mult_char_max, min(reuse_max, emo_max))
 538 |     # ratio_num = max(mult_char_max, max(reuse_max, emo_max))
 539 |     # ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
 540 |     # if reuse_max < emo_max and reuse_max < mult_char_max:
 541 |     #     to_scale = reuse_y
 542 |     # elif emo_max < mult_char_max and emo_max < reuse_max:
 543 |     #     to_scale = emo_y
 544 |     # else:
 545 |     #     to_scale = mult_char_y
 546 |     # to_scale *= ratio
 547 | 
 548 |     # Create data columns
 549 |     x = [str(i) for i in flat_data.index]
 550 |     reuse_y=reuse_y
 551 |     reuse_zero = len(reuse_y) * [0]
 552 |     span = flat_data.span
 553 |     flat_data_source = ColumnDataSource(flat_data)
 554 |     source = ColumnDataSource(dict(x=x,
 555 |                                    reuse_y=reuse_y,
 556 |                                    emo_y=emo_y,
 557 |                                    first_char_y=first_char_y,
 558 |                                    second_char_y=second_char_y,
 559 |                                    reuse_zero=reuse_zero,
 560 |                                    span=span))
 561 | 
 562 |     plot = figure(x_range=FactorRange(*x),
 563 |                   plot_width=800, plot_height=600,
 564 |                   title=title, tools="hover")
 565 | 
 566 |     # Turn off ticks, major labels, and x grid lines, etc.
 567 |     # Axis settings:
 568 |     plot.xaxis.major_label_text_font_size = '0pt'
 569 |     plot.xaxis.major_tick_line_color = None
 570 |     plot.xaxis.minor_tick_line_color = None
 571 | 
 572 |     plot.yaxis.major_label_text_font_size = '0pt'
 573 |     plot.yaxis.major_tick_line_color = None
 574 |     plot.yaxis.minor_tick_line_color = None
 575 | 
 576 |     # CategoricalAxis settings:
 577 |     plot.xaxis.group_text_font_size = '0pt'
 578 |     plot.xaxis.separator_line_color = None
 579 | 
 580 |     # Grid settings:
 581 |     plot.xgrid.grid_line_color = None
 582 |     # plot.ygrid.minor_grid_line_color = 'black'
 583 |     # plot.ygrid.minor_grid_line_alpha = 0.03
 584 |     plot.xaxis.axis_label = 'Beginning of Script    ←                                                                                                                                      →   End of Script'
 585 |     plot.yaxis.axis_label = 'Low Reuse                           Medium Reuse                                  High Reuse'
 586 | 
 587 |     hover = plot.select(dict(type=HoverTool))
 588 |     hover.tooltips = "<div>@span{safe}</div>"
 589 | 
 590 |     plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6)
 591 |     plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0)
 592 |     plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color =  'red')
 593 |     plot.line(x='x', line_width=2.0, source=source, y='first_char_y', line_color = Spectral6[1])
 594 |     plot.line(x='x', line_width=2.0, source=source, y='second_char_y', line_color = '#F0AD4E')
 595 | 
 596 | 
 597 |     reuse_button_group = RadioButtonGroup(
 598 |         labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"],
 599 |         button_type='primary',
 600 |         active=0
 601 |     )
 602 | 
 603 |     emotion_button_group = CheckboxButtonGroup(
 604 |         labels= _FIELDS[4:],
 605 |         button_type='danger',
 606 |         active=[],
 607 |     )
 608 | 
 609 |     first_char_button_group = CheckboxButtonGroup(
 610 |         labels= [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
 611 |         button_type='success',
 612 |         active=[]
 613 |     )
 614 | 
 615 |     second_char_button_group = CheckboxButtonGroup(
 616 |         labels= [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
 617 |         button_type='warning',
 618 |         active=[]
 619 |     )
 620 | 
 621 | 
 622 |     callback_code="""
 623 | 
 624 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
 625 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
 626 |             reuse = "Frequency of Reuse (0-0.25)";
 627 |         }
 628 |         var second_char = [];
 629 |         for (i = 0; i < second_char_button_group.active.length; i++) {
 630 |             second_char.push(second_char_button_group.labels[second_char_button_group.active[i]]);
 631 |         }
 632 | 
 633 |         var first_char = [];
 634 |         for (i = 0; i < first_char_button_group.active.length; i++) {
 635 |             first_char.push(first_char_button_group.labels[first_char_button_group.active[i]]);
 636 |         }
 637 | 
 638 |         var emo_arr = [];
 639 |         for (i = 0; i < emotion_button_group.active.length; i++) {
 640 |             emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]);
 641 |         }
 642 | 
 643 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
 644 | 
 645 | 
 646 | 
 647 |         if (second_char_button_group.active.length > 1) {
 648 |             if (second_char_button_group.active[0] == window.secondprev) {
 649 |                 second_char_button_group.active.splice(0,1);
 650 |                 second_char.splice(0,1);
 651 |                 window.secondprev = second_char_button_group.active[0];
 652 |             } else {
 653 |                 second_char_button_group.active.splice(1,1);
 654 |                 second_char.splice(1,1);
 655 |                 window.secondprev = second_char_button_group.active[0];
 656 |             }
 657 |         } else {
 658 |             window.secondprev = second_char_button_group.active[0];
 659 | 
 660 |         }
 661 | 
 662 |         if (first_char_button_group.active.length > 1) {
 663 |             if (first_char_button_group.active[0] == window.firstprev) {
 664 |                 first_char_button_group.active.splice(0,1);
 665 |                 first_char.splice(0,1);
 666 |                 window.firstprev = first_char_button_group.active[0];
 667 |             } else {
 668 |                 first_char_button_group.active.splice(1,1);
 669 |                 first_char.splice(1,1);
 670 |                 window.firstprev = first_char_button_group.active[0];
 671 |             }
 672 |         } else {
 673 |             window.firstprev = first_char_button_group.active[0];
 674 | 
 675 |         }
 676 | 
 677 |         if (emotion_button_group.active.length > 1) {
 678 |             if (emotion_button_group.active[0] == window.thirdprev) {
 679 |                 emotion_button_group.active.splice(0,1);
 680 |                 emo_arr.splice(0,1);
 681 |                 window.thirdprev = emotion_button_group.active[0];
 682 |             } else {
 683 |                 emotion_button_group.active.splice(1,1);
 684 |                 emo_arr.splice(1,1);
 685 |                 window.thirdprev = emotion_button_group.active[0];
 686 |             }
 687 |         } else {
 688 |             window.thirdprev = emotion_button_group.active[0];
 689 | 
 690 |         }
 691 | 
 692 | 
 693 |         //for (i = 0; i < mult_char.length; i++) {
 694 |         //    if (mult_char[i] == "Clear") {
 695 |         //        listOfLists.push(flat_data_source.data["None"].slice());
 696 |         //    } else {
 697 |         //        listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice());
 698 |         //        }
 699 |         //}
 700 |         //
 701 |         //for (i = 0; i < char.length; i++) {
 702 |         //    if (char[i] == "Clear") {
 703 |         //        charListOfLists.push(flat_data_source.data["None"].slice());
 704 |         //    } else {
 705 |         //        charListOfLists.push(flat_data_source.data["CHARACTER_" + char[i]].slice());
 706 |         //        }
 707 |         //}
 708 |         //
 709 |         //var emoListOfLists = [];
 710 |         //for (i = 0; i < emo_arr.length; i++) {
 711 |         //    if (emo_arr[i] == "Clear") {
 712 |         //        emoListOfLists.push(flat_data_source.data["None"].slice());
 713 |         //    } else {
 714 |         //        emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice());
 715 |         //        }
 716 |         //}
 717 | 
 718 |         function fill(a, b) {
 719 |             for (i = 0; i < b.length; i++) {
 720 |                 a.push(flat_data_source.data[b[i]].slice());
 721 |             }
 722 |             return a;
 723 |         }
 724 | 
 725 |         function fillChar(a, b) {
 726 |             for (i = 0; i < b.length; i++) {
 727 |                 a.push(flat_data_source.data["CHARACTER_" + b[i]].slice());
 728 |             }
 729 |             return a;
 730 |         }
 731 | 
 732 | 
 733 | 
 734 | 
 735 |         function zip(a) {
 736 |             if (a.length == 0) {
 737 |                 return [];
 738 |             }
 739 |             var output = [];
 740 |             var length = a[0].length;
 741 |             for (i = 0; i < length; i++) {
 742 |                 var newRow = [];
 743 |                 for (j = 0; j < a.length; j++) {
 744 |                     newRow.push(a[j][i]);
 745 |                 }
 746 |                 output.push(newRow);
 747 |             }
 748 |             return output;
 749 |         }
 750 | 
 751 |         function gMean(a) {
 752 |             var starter = 1;
 753 |             for (i = 0; i < a.length; i++) {
 754 |                 starter = starter * a[i];
 755 |             }
 756 |             if (starter == 0) {
 757 |                 return 0;
 758 |             } else {
 759 |                 return Math.pow(starter, 1/a.length);
 760 |                 }
 761 |         }
 762 | 
 763 |         var firstCharListOfLists = [];
 764 |         var secondCharListOfLists = [];
 765 |         var emoListOfLists = [];
 766 | 
 767 |         var first_char_data = zip(fillChar(firstCharListOfLists, first_char)).map(gMean);
 768 |         var emo_data = zip(fill(emoListOfLists, emo_arr)).map(gMean);
 769 |         var second_char_data = zip(fillChar(secondCharListOfLists, second_char)).map(gMean);
 770 | 
 771 |         var reuse_max = Math.max.apply(Math, reuse_data);
 772 |         var emo_max = Math.max.apply(Math, emo_data);
 773 |         var first_char_max = Math.max.apply(Math, first_char_data);
 774 |         var second_char_max = Math.max.apply(Math, second_char_data);
 775 | 
 776 |         for (var i = 0; i < second_char_data.length; i++) {
 777 |             second_char_data[i] /= second_char_max;
 778 |         }
 779 | 
 780 |         for (var i = 0; i < first_char_data.length; i++) {
 781 |             first_char_data[i] /= first_char_max;
 782 |         }
 783 | 
 784 |         for (var i = 0; i < emo_data.length; i++) {
 785 |             emo_data[i] /= emo_max;
 786 |         }
 787 | 
 788 |         for (var i = 0; i < reuse_data.length; i++) {
 789 |             reuse_data[i] /= reuse_max;
 790 |         }
 791 | 
 792 |         var x = source.data['x'];
 793 |         var reuse_y = source.data['reuse_y'];
 794 |         var emo_y = source.data['emo_y'];
 795 |         var first_char_y = source.data['first_char_y'];
 796 |         var second_char_y = source.data['second_char_y']
 797 |         for (var i = 0; i < x.length; i++) {
 798 |             reuse_y[i] = reuse_data[i];
 799 |             emo_y[i] = emo_data[i];
 800 |             first_char_y[i] = first_char_data[i];
 801 |             second_char_y[i] = second_char_data[i];
 802 |         }
 803 | 
 804 |         source.change.emit();
 805 | 
 806 |         """
 807 | 
 808 |     reuse_callback = CustomJS(
 809 |         args=dict(
 810 |             source=source,
 811 |             flat_data_source=flat_data_source,
 812 |             reuse_button_group=reuse_button_group,
 813 |             emotion_button_group=emotion_button_group,
 814 |             first_char_button_group=first_char_button_group,
 815 |             second_char_button_group=second_char_button_group,
 816 |             other_button_group=None
 817 |         ), code = callback_code)
 818 | 
 819 | 
 820 |     emo_callback = CustomJS(
 821 |         args=dict(
 822 |             source=source,
 823 |             flat_data_source=flat_data_source,
 824 |             reuse_button_group=reuse_button_group,
 825 |             emotion_button_group=emotion_button_group,
 826 |             first_char_button_group=first_char_button_group,
 827 |             second_char_button_group=second_char_button_group,
 828 |             other_button_group=first_char_button_group
 829 |         ), code = callback_code)
 830 | 
 831 |     char_callback = CustomJS(
 832 |         args=dict(
 833 |             source=source,
 834 |             flat_data_source=flat_data_source,
 835 |             reuse_button_group=reuse_button_group,
 836 |             emotion_button_group=emotion_button_group,
 837 |             first_char_button_group=first_char_button_group,
 838 |             second_char_button_group=second_char_button_group,
 839 |             other_button_group=emotion_button_group
 840 |         ), code = callback_code)
 841 | 
 842 |     mult_char_callback = CustomJS(
 843 |         args=dict(
 844 |             source=source,
 845 |             flat_data_source=flat_data_source,
 846 |             reuse_button_group=reuse_button_group,
 847 |             emotion_button_group=emotion_button_group,
 848 |             first_char_button_group=first_char_button_group,
 849 |             second_char_button_group=second_char_button_group,
 850 |             other_button_group=emotion_button_group
 851 |         ), code = callback_code)
 852 | 
 853 | 
 854 |     reuse_button_group.js_on_change('active', reuse_callback)
 855 |     emotion_button_group.js_on_change('active', emo_callback)
 856 |     first_char_button_group.js_on_change('active', char_callback)
 857 |     second_char_button_group.js_on_change('active', mult_char_callback)
 858 | 
 859 | 
 860 |     layout = column(reuse_button_group, first_char_button_group, second_char_button_group, emotion_button_group, plot)
 861 |     tab1 = Panel(child=layout, title='Both')
 862 |     #return tab1
 863 |     return layout
 864 | 
 865 | 
 866 | def build_line_plot_affect(data_path, words_per_chunk, title='Degree of Reuse'):
 867 |     #Read in from csv
 868 |     flat_data = pd.read_csv(data_path)
 869 | 
 870 |     flat_data = chart_cols(flat_data, words_per_chunk)
 871 | 
 872 |     flat_data = chart_pivot(flat_data)
 873 | 
 874 | 
 875 |     # Scale so that both maxima have the same height
 876 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
 877 |     emo_y = flat_data['None']
 878 |     char_y = flat_data['None']
 879 |     mult_char_y = flat_data['None']
 880 |     reuse_max = reuse_y.values.max()
 881 |     emo_max = emo_y.values.max()
 882 |     char_max = char_y.values.max()
 883 |     mult_char_max = mult_char_y.values.max()
 884 | 
 885 |     #Make ratio work
 886 |     ratio_denom = min(mult_char_max, min(reuse_max, emo_max))
 887 |     ratio_num = max(mult_char_max, max(reuse_max, emo_max))
 888 |     ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
 889 |     if reuse_max < emo_max and reuse_max < mult_char_max:
 890 |         to_scale = reuse_y
 891 |     elif emo_max < mult_char_max and emo_max < reuse_max:
 892 |         to_scale = emo_y
 893 |     else:
 894 |         to_scale = mult_char_y
 895 |     to_scale *= ratio
 896 | 
 897 |     # Create data columns
 898 |     x = [str(i) for i in flat_data.index]
 899 |     reuse_y=reuse_y
 900 |     reuse_zero = len(reuse_y) * [0]
 901 |     span = flat_data.span
 902 |     flat_data_source = ColumnDataSource(flat_data)
 903 |     source = ColumnDataSource(dict(x=x,
 904 |                                    reuse_y=reuse_y,
 905 |                                    emo_y=emo_y,
 906 |                                    char_y=char_y,
 907 |                                    mult_char_y=mult_char_y,
 908 |                                    reuse_zero=reuse_zero,
 909 |                                    span=span,))
 910 | 
 911 |     plot = figure(x_range=FactorRange(*x),
 912 |                   plot_width=800, plot_height=600,
 913 |                   title=title, tools="hover")
 914 | 
 915 |     # Turn off ticks, major labels, and x grid lines, etc.
 916 |     # Axis settings:
 917 |     plot.xaxis.major_label_text_font_size = '0pt'
 918 |     plot.xaxis.major_tick_line_color = None
 919 |     plot.xaxis.minor_tick_line_color = None
 920 | 
 921 |     plot.yaxis.major_label_text_font_size = '0pt'
 922 |     plot.yaxis.major_tick_line_color = None
 923 |     plot.yaxis.minor_tick_line_color = None
 924 | 
 925 |     # CategoricalAxis settings:
 926 |     plot.xaxis.group_text_font_size = '0pt'
 927 |     plot.xaxis.separator_line_color = None
 928 | 
 929 |     # Grid settings:
 930 |     plot.xgrid.grid_line_color = None
 931 |     # plot.ygrid.minor_grid_line_color = 'black'
 932 |     # plot.ygrid.minor_grid_line_alpha = 0.03
 933 |     plot.xaxis.axis_label = 'Beginning of Script    ←                                                                                                                                      →   End of Script'
 934 |     plot.yaxis.axis_label = 'Low Reuse                           Medium Reuse                                  High Reuse'
 935 | 
 936 |     hover = plot.select(dict(type=HoverTool))
 937 |     hover.tooltips = "<div>@span{safe}</div>"
 938 | 
 939 |     plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6)
 940 |     plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0)
 941 |     plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1])
 942 |     plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red')
 943 |     plot.line(x='x', line_width=2.0, source=source, y='mult_char_y', line_color = 'red')
 944 | 
 945 | 
 946 |     reuse_button_group = RadioButtonGroup(
 947 |         labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"],
 948 |         button_type='primary',
 949 |         active=0
 950 |     )
 951 | 
 952 |     emotion_button_group = CheckboxButtonGroup(
 953 |         labels= ['Clear'] + _FIELDS[4:],
 954 |         button_type='success',
 955 |         active=[],
 956 |     )
 957 | 
 958 |     char_button_group = RadioButtonGroup(
 959 |         labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
 960 |         button_type='danger',
 961 |         active=0
 962 |     )
 963 | 
 964 |     mult_char_button_group = CheckboxButtonGroup(
 965 |         labels= ['Clear'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
 966 |         button_type='danger',
 967 |         active=[]
 968 |     )
 969 | 
 970 | 
 971 |     callback_code="""
 972 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
 973 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
 974 |             reuse = "Frequency of Reuse (0-0.25)";
 975 |         }
 976 |         //var emo = emotion_button_group.labels[emotion_button_group.active];
 977 |         var char = char_button_group.labels[char_button_group.active];
 978 |         var mult_char = [];
 979 |         for (i = 0; i < mult_char_button_group.active.length; i++) {
 980 |             mult_char.push(mult_char_button_group.labels[mult_char_button_group.active[i]]);
 981 |         }
 982 | 
 983 |         if (mult_char.includes("Clear")) {
 984 |             mult_char = [];
 985 |             mult_char_button_group.active = []
 986 |         }
 987 | 
 988 |         var emo_arr = [];
 989 |         for (i = 0; i < emotion_button_group.active.length; i++) {
 990 |             emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]);
 991 |         }
 992 | 
 993 |         console.log(emo_arr);
 994 |         if (emo_arr.includes("Clear")) {
 995 |             emo_arr = [];
 996 |             emotion_button_group.active = []
 997 |         }
 998 | 
 999 | 
1000 | 
1001 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
1002 |         //var emo_data = flat_data_source.data[emo].slice();      // Copy
1003 |         if (char == "None") {
1004 |             var char_data = flat_data_source.data["None"].slice();
1005 |         } else {
1006 |             var char_data = flat_data_source.data["CHARACTER_" + char].slice();
1007 |             }  // Copy
1008 | 
1009 |         var multiplied = [];
1010 |         var listOfLists = [];
1011 |         var newList = [[]];
1012 |         for (i = 0; i < mult_char.length; i++) {
1013 |             if (mult_char[i] == "Clear") {
1014 |                 listOfLists.push(flat_data_source.data["None"].slice());
1015 |             } else {
1016 |                 listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice());
1017 |                 }
1018 |         }
1019 | 
1020 |         var emoListOfLists = [];
1021 |         for (i = 0; i < emo_arr.length; i++) {
1022 |             if (emo_arr[i] == "Clear") {
1023 |                 emoListOfLists.push(flat_data_source.data["None"].slice());
1024 |             } else {
1025 |                 emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice());
1026 |                 }
1027 |         }
1028 | 
1029 | 
1030 | 
1031 |         function zip(a) {
1032 |             if (a.length == 0) {
1033 |                 return [];
1034 |             }
1035 |             var output = [];
1036 |             var length = a[0].length;
1037 |             for (i = 0; i < length; i++) {
1038 |                 var newRow = [];
1039 |                 for (j = 0; j < a.length; j++) {
1040 |                     newRow.push(a[j][i]);
1041 |                 }
1042 |                 output.push(newRow);
1043 |             }
1044 |             return output;
1045 |         }
1046 | 
1047 |         function gMean(a) {
1048 |             var starter = 1;
1049 |             for (i = 0; i < a.length; i++) {
1050 |                 starter = starter * a[i];
1051 |             }
1052 |             if (starter == 0) {
1053 |                 return 0;
1054 |             } else {
1055 |                 return Math.pow(starter, 1/a.length);
1056 |                 }
1057 |         }
1058 | 
1059 |         var mult_char_data = zip(listOfLists).map(gMean);
1060 |         var emo_data = zip(emoListOfLists).map(gMean)
1061 | 
1062 |         var reuse_max = Math.max.apply(Math, reuse_data);
1063 |         var emo_max = Math.max.apply(Math, emo_data);
1064 |         var char_max = Math.max.apply(Math, char_data);
1065 |         var mult_char_max = Math.max.apply(Math, mult_char_data);
1066 | 
1067 |         var ratio = 0;
1068 |         var to_scale = null;
1069 |         var to_scale_other = null;
1070 | 
1071 |         //if (emo_max > reuse_max && emo_max > mult_char_max) {
1072 |         //    to_scale = reuse_data;
1073 |         //    to_scale_also = mult_char_data;
1074 |         //    ratio_one = emo_max / reuse_max;
1075 |         //    ratio_two = emo_max / mult_char_max;
1076 |         //
1077 |         //} else if (mult_char_max > emo_max && mult_char_max > reuse_max) {
1078 |         //    to_scale = reuse_data;
1079 |         //    to_scale_also = emo_data;
1080 |         //    ratio_one = mult_char_max / reuse_max;
1081 |         //    ratio_two = mult_char_max / emo_max;
1082 |         //} else if (reuse_max > emo_max && reuse_max > mult_char_max){
1083 |         //    to_scale = emo_data;
1084 |         //    to_scale_also = mult_char_data;
1085 |         //    ratio_one = reuse_max / emo_max;
1086 |         //    ratio_two = reuse_max / mult_char_max;
1087 |         //}
1088 |         //
1089 |         for (var i = 0; i < mult_char_data.length; i++) {
1090 |             mult_char_data[i] /= mult_char_max;
1091 | 
1092 |         }
1093 | 
1094 |         for (var i = 0; i < emo_data.length; i++) {
1095 |             emo_data[i] /= emo_max;
1096 |         }
1097 | 
1098 |         for (var i = 0; i < reuse_data.length; i++) {
1099 |             reuse_data[i] /= reuse_max;
1100 |         }
1101 | 
1102 |         var x = source.data['x'];
1103 |         var reuse_y = source.data['reuse_y'];
1104 |         var emo_y = source.data['emo_y'];
1105 |         var char_y = source.data['char_y'];
1106 |         var mult_char_y = source.data['mult_char_y']
1107 |         for (var i = 0; i < x.length; i++) {
1108 |             reuse_y[i] = reuse_data[i];
1109 |             emo_y[i] = emo_data[i];
1110 |             char_y[i] = char_data[i];
1111 |             mult_char_y[i] = mult_char_data[i];
1112 |         }
1113 | 
1114 |         console.log(source.data['prev']);
1115 | 
1116 |         source.change.emit();
1117 | 
1118 |         """
1119 | 
1120 |     reuse_callback = CustomJS(
1121 |         args=dict(
1122 |             source=source,
1123 |             flat_data_source=flat_data_source,
1124 |             reuse_button_group=reuse_button_group,
1125 |             emotion_button_group=emotion_button_group,
1126 |             char_button_group=char_button_group,
1127 |             mult_char_button_group=mult_char_button_group,
1128 |             other_button_group=None
1129 |         ), code = callback_code)
1130 | 
1131 | 
1132 |     emo_callback = CustomJS(
1133 |         args=dict(
1134 |             source=source,
1135 |             flat_data_source=flat_data_source,
1136 |             reuse_button_group=reuse_button_group,
1137 |             emotion_button_group=emotion_button_group,
1138 |             char_button_group=char_button_group,
1139 |             mult_char_button_group=mult_char_button_group,
1140 |             other_button_group=char_button_group
1141 |         ), code = callback_code)
1142 | 
1143 |     char_callback = CustomJS(
1144 |         args=dict(
1145 |             source=source,
1146 |             flat_data_source=flat_data_source,
1147 |             reuse_button_group=reuse_button_group,
1148 |             emotion_button_group=emotion_button_group,
1149 |             char_button_group=char_button_group,
1150 |             mult_char_button_group=mult_char_button_group,
1151 |             other_button_group=emotion_button_group
1152 |         ), code = callback_code)
1153 | 
1154 |     mult_char_callback = CustomJS(
1155 |         args=dict(
1156 |             source=source,
1157 |             flat_data_source=flat_data_source,
1158 |             reuse_button_group=reuse_button_group,
1159 |             emotion_button_group=emotion_button_group,
1160 |             char_button_group=char_button_group,
1161 |             mult_char_button_group=mult_char_button_group,
1162 |             other_button_group=emotion_button_group
1163 |         ), code = callback_code)
1164 | 
1165 | 
1166 |     reuse_button_group.js_on_change('active', reuse_callback)
1167 |     emotion_button_group.js_on_change('active', emo_callback)
1168 |     char_button_group.js_on_change('active', char_callback)
1169 |     mult_char_button_group.js_on_change('active', mult_char_callback)
1170 | 
1171 | 
1172 |     layout = column(reuse_button_group, emotion_button_group, plot)
1173 |     tab1 = Panel(child=layout, title='Affect')
1174 |     return tab1
1175 |     # return layout
1176 | 
1177 | def build_line_plot_char(data_path, words_per_chunk, title='Degree of Reuse'):
1178 |     #Read in from csv
1179 |     flat_data = pd.read_csv(data_path)
1180 | 
1181 |     flat_data = chart_cols(flat_data, words_per_chunk)
1182 | 
1183 |     flat_data = chart_pivot(flat_data)
1184 | 
1185 | 
1186 |     # Scale so that both maxima have the same height
1187 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
1188 |     emo_y = flat_data['None']
1189 |     char_y = flat_data['None']
1190 |     mult_char_y = flat_data['None']
1191 |     reuse_max = reuse_y.values.max()
1192 |     emo_max = emo_y.values.max()
1193 |     char_max = char_y.values.max()
1194 |     mult_char_max = mult_char_y.values.max()
1195 | 
1196 |     #Make ratio work
1197 |     ratio_denom = min(mult_char_max, min(reuse_max, emo_max))
1198 |     ratio_num = max(mult_char_max, max(reuse_max, emo_max))
1199 |     ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
1200 |     if reuse_max < emo_max and reuse_max < mult_char_max:
1201 |         to_scale = reuse_y
1202 |     elif emo_max < mult_char_max and emo_max < reuse_max:
1203 |         to_scale = emo_y
1204 |     else:
1205 |         to_scale = mult_char_y
1206 |     to_scale *= ratio
1207 | 
1208 |     # Create data columns
1209 |     x = [str(i) for i in flat_data.index]
1210 |     reuse_y=reuse_y
1211 |     reuse_zero = len(reuse_y) * [0]
1212 |     span = flat_data.span
1213 |     flat_data_source = ColumnDataSource(flat_data)
1214 |     source = ColumnDataSource(dict(x=x,
1215 |                                    reuse_y=reuse_y,
1216 |                                    emo_y=emo_y,
1217 |                                    char_y=char_y,
1218 |                                    mult_char_y=mult_char_y,
1219 |                                    reuse_zero=reuse_zero,
1220 |                                    span=span))
1221 | 
1222 |     plot = figure(x_range=FactorRange(*x),
1223 |                   plot_width=800, plot_height=600,
1224 |                   title=title, tools="hover")
1225 | 
1226 |     # Turn off ticks, major labels, and x grid lines, etc.
1227 |     # Axis settings:
1228 |     plot.xaxis.major_label_text_font_size = '0pt'
1229 |     plot.xaxis.major_tick_line_color = None
1230 |     plot.xaxis.minor_tick_line_color = None
1231 | 
1232 |     plot.yaxis.major_label_text_font_size = '0pt'
1233 |     plot.yaxis.major_tick_line_color = None
1234 |     plot.yaxis.minor_tick_line_color = None
1235 | 
1236 |     # CategoricalAxis settings:
1237 |     plot.xaxis.group_text_font_size = '0pt'
1238 |     plot.xaxis.separator_line_color = None
1239 | 
1240 |     # Grid settings:
1241 |     plot.xgrid.grid_line_color = None
1242 |     # plot.ygrid.minor_grid_line_color = 'black'
1243 |     # plot.ygrid.minor_grid_line_alpha = 0.03
1244 |     plot.xaxis.axis_label = 'Beginning of Script    ←                                                                                                                                      →   End of Script'
1245 |     plot.yaxis.axis_label = 'Low Reuse                           Medium Reuse                                  High Reuse'
1246 | 
1247 |     hover = plot.select(dict(type=HoverTool))
1248 |     hover.tooltips = "<div>@span{safe}</div>"
1249 | 
1250 |     plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6)
1251 |     plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0)
1252 |     plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1])
1253 |     plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red')
1254 |     plot.line(x='x', line_width=2.0, source=source, y='mult_char_y', line_color = 'red')
1255 | 
1256 | 
1257 |     reuse_button_group = RadioButtonGroup(
1258 |         labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"],
1259 |         button_type='primary',
1260 |         active=0
1261 |     )
1262 | 
1263 |     emotion_button_group = CheckboxButtonGroup(
1264 |         labels= ['Clear'] + _FIELDS[4:],
1265 |         button_type='success',
1266 |         active=[],
1267 |     )
1268 | 
1269 |     char_button_group = RadioButtonGroup(
1270 |         labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
1271 |         button_type='danger',
1272 |         active=0
1273 |     )
1274 | 
1275 |     mult_char_button_group = CheckboxButtonGroup(
1276 |         labels= ['Clear'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")],
1277 |         button_type='danger',
1278 |         active=[]
1279 |     )
1280 | 
1281 | 
1282 |     callback_code="""
1283 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
1284 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
1285 |             reuse = "Frequency of Reuse (0-0.25)";
1286 |         }
1287 |         //var emo = emotion_button_group.labels[emotion_button_group.active];
1288 |         var char = char_button_group.labels[char_button_group.active];
1289 |         var mult_char = [];
1290 |         for (i = 0; i < mult_char_button_group.active.length; i++) {
1291 |             mult_char.push(mult_char_button_group.labels[mult_char_button_group.active[i]]);
1292 |         }
1293 |         if (mult_char.includes("Clear")) {
1294 |             mult_char = [];
1295 |             mult_char_button_group.active = []
1296 |         }
1297 | 
1298 |         var emo_arr = [];
1299 |         for (i = 0; i < emotion_button_group.active.length; i++) {
1300 |             emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]);
1301 |         }
1302 |         if (emo_arr.includes("Clear")) {
1303 |             emo_arr = [];
1304 |             emotion_button_group.active = []
1305 |         }
1306 | 
1307 | 
1308 | 
1309 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
1310 |         //var emo_data = flat_data_source.data[emo].slice();      // Copy
1311 |         if (char == "None") {
1312 |             var char_data = flat_data_source.data["None"].slice();
1313 |         } else {
1314 |             var char_data = flat_data_source.data["CHARACTER_" + char].slice();
1315 |             }  // Copy
1316 | 
1317 |         var multiplied = [];
1318 |         var listOfLists = [];
1319 |         var newList = [[]];
1320 |         for (i = 0; i < mult_char.length; i++) {
1321 |             if (mult_char[i] == "Clear") {
1322 |                 listOfLists.push(flat_data_source.data["None"].slice());
1323 |             } else {
1324 |                 listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice());
1325 |                 }
1326 |         }
1327 | 
1328 |         var emoListOfLists = [];
1329 |         for (i = 0; i < emo_arr.length; i++) {
1330 |             if (emo_arr[i] == "Clear") {
1331 |                 emoListOfLists.push(flat_data_source.data["None"].slice());
1332 |             } else {
1333 |                 emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice());
1334 |                 }
1335 |         }
1336 | 
1337 | 
1338 | 
1339 |         function zip(a) {
1340 |             if (a.length == 0) {
1341 |                 return [];
1342 |             }
1343 |             var output = [];
1344 |             var length = a[0].length;
1345 |             for (i = 0; i < length; i++) {
1346 |                 var newRow = [];
1347 |                 for (j = 0; j < a.length; j++) {
1348 |                     newRow.push(a[j][i]);
1349 |                 }
1350 |                 output.push(newRow);
1351 |             }
1352 |             return output;
1353 |         }
1354 | 
1355 |         function gMean(a) {
1356 |             var starter = 1;
1357 |             for (i = 0; i < a.length; i++) {
1358 |                 starter = starter * a[i];
1359 |             }
1360 |             if (starter == 0) {
1361 |                 return 0;
1362 |             } else {
1363 |                 return Math.pow(starter, 1/a.length);
1364 |                 }
1365 |         }
1366 | 
1367 |         var mult_char_data = zip(listOfLists).map(gMean);
1368 |         var emo_data = zip(emoListOfLists).map(gMean)
1369 | 
1370 |         var reuse_max = Math.max.apply(Math, reuse_data);
1371 |         var emo_max = Math.max.apply(Math, emo_data);
1372 |         var char_max = Math.max.apply(Math, char_data);
1373 |         var mult_char_max = Math.max.apply(Math, mult_char_data);
1374 | 
1375 |         var ratio = 0;
1376 |         var to_scale = null;
1377 |         var to_scale_other = null;
1378 | 
1379 |         //if (emo_max > reuse_max && emo_max > mult_char_max) {
1380 |         //    to_scale = reuse_data;
1381 |         //    to_scale_also = mult_char_data;
1382 |         //    ratio_one = emo_max / reuse_max;
1383 |         //    ratio_two = emo_max / mult_char_max;
1384 |         //
1385 |         //} else if (mult_char_max > emo_max && mult_char_max > reuse_max) {
1386 |         //    to_scale = reuse_data;
1387 |         //    to_scale_also = emo_data;
1388 |         //    ratio_one = mult_char_max / reuse_max;
1389 |         //    ratio_two = mult_char_max / emo_max;
1390 |         //} else if (reuse_max > emo_max && reuse_max > mult_char_max){
1391 |         //    to_scale = emo_data;
1392 |         //    to_scale_also = mult_char_data;
1393 |         //    ratio_one = reuse_max / emo_max;
1394 |         //    ratio_two = reuse_max / mult_char_max;
1395 |         //}
1396 |         //
1397 |         for (var i = 0; i < mult_char_data.length; i++) {
1398 |             mult_char_data[i] /= mult_char_max;
1399 | 
1400 |         }
1401 | 
1402 |         for (var i = 0; i < emo_data.length; i++) {
1403 |             emo_data[i] /= emo_max;
1404 |         }
1405 | 
1406 |         for (var i = 0; i < reuse_data.length; i++) {
1407 |             reuse_data[i] /= reuse_max;
1408 |         }
1409 | 
1410 |         var x = source.data['x'];
1411 |         var reuse_y = source.data['reuse_y'];
1412 |         var emo_y = source.data['emo_y'];
1413 |         var char_y = source.data['char_y'];
1414 |         var mult_char_y = source.data['mult_char_y']
1415 |         for (var i = 0; i < x.length; i++) {
1416 |             reuse_y[i] = reuse_data[i];
1417 |             emo_y[i] = emo_data[i];
1418 |             char_y[i] = char_data[i];
1419 |             mult_char_y[i] = mult_char_data[i];
1420 |         }
1421 | 
1422 |         source.change.emit();
1423 | 
1424 |         """
1425 | 
1426 |     reuse_callback = CustomJS(
1427 |         args=dict(
1428 |             source=source,
1429 |             flat_data_source=flat_data_source,
1430 |             reuse_button_group=reuse_button_group,
1431 |             emotion_button_group=emotion_button_group,
1432 |             char_button_group=char_button_group,
1433 |             mult_char_button_group=mult_char_button_group,
1434 |             other_button_group=None
1435 |         ), code = callback_code)
1436 | 
1437 | 
1438 |     emo_callback = CustomJS(
1439 |         args=dict(
1440 |             source=source,
1441 |             flat_data_source=flat_data_source,
1442 |             reuse_button_group=reuse_button_group,
1443 |             emotion_button_group=emotion_button_group,
1444 |             char_button_group=char_button_group,
1445 |             mult_char_button_group=mult_char_button_group,
1446 |             other_button_group=char_button_group
1447 |         ), code = callback_code)
1448 | 
1449 |     char_callback = CustomJS(
1450 |         args=dict(
1451 |             source=source,
1452 |             flat_data_source=flat_data_source,
1453 |             reuse_button_group=reuse_button_group,
1454 |             emotion_button_group=emotion_button_group,
1455 |             char_button_group=char_button_group,
1456 |             mult_char_button_group=mult_char_button_group,
1457 |             other_button_group=emotion_button_group
1458 |         ), code = callback_code)
1459 | 
1460 |     mult_char_callback = CustomJS(
1461 |         args=dict(
1462 |             source=source,
1463 |             flat_data_source=flat_data_source,
1464 |             reuse_button_group=reuse_button_group,
1465 |             emotion_button_group=emotion_button_group,
1466 |             char_button_group=char_button_group,
1467 |             mult_char_button_group=mult_char_button_group,
1468 |             other_button_group=emotion_button_group
1469 |         ), code = callback_code)
1470 | 
1471 | 
1472 |     reuse_button_group.js_on_change('active', reuse_callback)
1473 |     emotion_button_group.js_on_change('active', emo_callback)
1474 |     char_button_group.js_on_change('active', char_callback)
1475 |     mult_char_button_group.js_on_change('active', mult_char_callback)
1476 | 
1477 | 
1478 |     layout = column(reuse_button_group, mult_char_button_group, plot)
1479 |     tab1 = Panel(child=layout, title='Character')
1480 |     return tab1
1481 |     #return layout
1482 | 
1483 | def build_line_plot_dropdown(data_path, words_per_chunk, title='Reuse'):
1484 |     #Read in from csv
1485 |     flat_data = pd.read_csv(data_path)
1486 |     flat_data = chart_cols(flat_data, words_per_chunk)
1487 |     flat_data = chart_pivot(flat_data)
1488 | 
1489 |     # Scale so that both maxima have the same height
1490 |     reuse_y = flat_data['Frequency of Reuse (Exact Matches)']
1491 |     emo_y = flat_data['None']
1492 |     char_y = flat_data['None']
1493 |     reuse_max = reuse_y.values.max()
1494 |     emo_max = emo_y.values.max()
1495 |     char_max = char_y.values.max()
1496 | 
1497 |     #Make ratio work
1498 |     ratio_denom = min(char_max, min(reuse_max, emo_max))
1499 |     ratio_num = max(char_max, max(reuse_max, emo_max))
1500 |     ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1
1501 |     if reuse_max < emo_max and reuse_max < char_max:
1502 |         to_scale = reuse_y
1503 |     elif emo_max < char_max and emo_max < reuse_max:
1504 |         to_scale = emo_y
1505 |     else:
1506 |         to_scale = char_y
1507 |     to_scale *= ratio
1508 | 
1509 |     # Create data columns
1510 |     x = [str(i) for i in flat_data.index]
1511 |     reuse_y=reuse_y
1512 |     reuse_zero = len(reuse_y) * [0]
1513 |     span = flat_data.span
1514 |     flat_data_source = ColumnDataSource(flat_data)
1515 |     source = ColumnDataSource(dict(x=x,
1516 |                                    reuse_y=reuse_y,
1517 |                                    emo_y=emo_y,
1518 |                                    char_y=char_y,
1519 |                                    reuse_zero=reuse_zero,
1520 |                                    span=span))
1521 | 
1522 |     plot = figure(x_range=FactorRange(*x),
1523 |                   plot_width=800, plot_height=600,
1524 |                   title=title, tools="hover")
1525 | 
1526 |     # Turn off ticks, major labels, and x grid lines, etc.
1527 |     # Axis settings:
1528 |     plot.xaxis.major_label_text_font_size = '0pt'
1529 |     plot.xaxis.major_tick_line_color = None
1530 |     plot.xaxis.minor_tick_line_color = None
1531 | 
1532 |     # CategoricalAxis settings:
1533 |     plot.xaxis.group_text_font_size = '0pt'
1534 |     plot.xaxis.separator_line_color = None
1535 | 
1536 |     # Grid settings:
1537 |     plot.xgrid.grid_line_color = None
1538 |     plot.ygrid.minor_grid_line_color = 'black'
1539 |     plot.ygrid.minor_grid_line_alpha = 0.03
1540 | 
1541 |     hover = plot.select(dict(type=HoverTool))
1542 |     hover.tooltips = "<div>@span{safe}</div>"
1543 | 
1544 |     plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6)
1545 |     plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0)
1546 |     plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1])
1547 |     plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red')
1548 | 
1549 | 
1550 |     reuse_button_group = RadioButtonGroup(
1551 |         labels= [_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], button_type='primary',
1552 |         active=0
1553 |     )
1554 | 
1555 |     emotion_dropdown_button_group = Select(
1556 |            title="Emotion", value="None", options=_FIELDS[3:])
1557 | 
1558 |     char_dropdown_button_group = Select(
1559 |            title="Emotion2", value="None", options=_FIELDS[3:])
1560 | 
1561 |     callback_code="""
1562 |         var reuse = reuse_button_group.labels[reuse_button_group.active];
1563 |         if (reuse == "Frequency of Reuse (Fuzzy Matches)") {
1564 |             reuse = "Frequency of Reuse (0-0.25)";
1565 |         }
1566 |         var emo = emotion_dropdown_button_group.value;
1567 |         var char = char_dropdown_button_group.value;
1568 |         var reuse_data = flat_data_source.data[reuse].slice();  // Copy
1569 |         var emo_data = flat_data_source.data[emo].slice();      // Copy
1570 |         var char_data = flat_data_source.data[char].slice();      // Copy
1571 |         var reuse_max = Math.max.apply(Math, reuse_data);
1572 |         var emo_max = Math.max.apply(Math, emo_data);
1573 |         var char_max = Math.max.apply(Math, char_data);
1574 | 
1575 |         var ratio = 0;
1576 |         var to_scale = null;
1577 |         var to_scale_other = null;
1578 | 
1579 |         if (emo_max > reuse_max && emo_max > char_max) {
1580 |             to_scale = reuse_data;
1581 |             to_scale_also = char_data;
1582 |             ratio_one = emo_max / reuse_max;
1583 |             ratio_two = emo_max / char_max;
1584 |         } else if (char_max > emo_max && char_max > reuse_max) {
1585 |             to_scale = reuse_data;
1586 |             to_scale_also = emo_data;
1587 |             ratio_one = char_max / reuse_max;
1588 |             ratio_two = char_max / emo_max;
1589 |         } else {
1590 |             to_scale = emo_data;
1591 |             to_scale_also = char_data;
1592 |             ratio_one = reuse_max / emo_max;
1593 |             ratio_two = reuse_max / char_max;
1594 |         }
1595 | 
1596 |         for (var i = 0; i < to_scale.length; i++) {
1597 |             to_scale[i] *= ratio_one;
1598 |             to_scale_also[i] *= ratio_two;
1599 |         }
1600 | 
1601 |         var x = source.data['x'];
1602 |         var reuse_y = source.data['reuse_y'];
1603 |         var emo_y = source.data['emo_y'];
1604 |         var char_y = source.data['char_y']
1605 |         for (var i = 0; i < x.length; i++) {
1606 |             reuse_y[i] = reuse_data[i];
1607 |             emo_y[i] = emo_data[i];
1608 |             char_y[i] = char_data[i];
1609 |         }
1610 | 
1611 |         source.change.emit();
1612 |         if (char_dropdown_button_group.value == "None" || emotion_dropdown_button_group.value == "None") {
1613 |             return;
1614 |         }
1615 | 
1616 |         if (other_button_group) {
1617 |             other_button_group.value = "None";
1618 |         }
1619 | 
1620 |         """
1621 | 
1622 |     reuse_callback = CustomJS(
1623 |         args=dict(
1624 |             source=source,
1625 |             flat_data_source=flat_data_source,
1626 |             reuse_button_group=reuse_button_group,
1627 |             emotion_dropdown_button_group=emotion_dropdown_button_group,
1628 |             char_dropdown_button_group=char_dropdown_button_group,
1629 |             other_button_group=None
1630 |         ), code = callback_code)
1631 | 
1632 | 
1633 |     emo_callback = CustomJS(
1634 |         args=dict(
1635 |             source=source,
1636 |             flat_data_source=flat_data_source,
1637 |             reuse_button_group=reuse_button_group,
1638 |             emotion_dropdown_button_group=emotion_dropdown_button_group,
1639 |             char_dropdown_button_group=char_dropdown_button_group,
1640 |             other_button_group=char_dropdown_button_group
1641 |         ), code = callback_code)
1642 | 
1643 |     char_callback = CustomJS(
1644 |         args=dict(
1645 |             source=source,
1646 |             flat_data_source=flat_data_source,
1647 |             reuse_button_group=reuse_button_group,
1648 |             emotion_dropdown_button_group=emotion_dropdown_button_group,
1649 |             char_dropdown_button_group=char_dropdown_button_group,
1650 |             other_button_group=emotion_dropdown_button_group
1651 |         ), code = callback_code)
1652 | 
1653 | 
1654 |     reuse_button_group.js_on_change('active', reuse_callback)
1655 |     emotion_dropdown_button_group.js_on_change('value', emo_callback)
1656 |     char_dropdown_button_group.js_on_change('value', char_callback)
1657 | 
1658 | 
1659 |     layout = column(reuse_button_group, emotion_dropdown_button_group, char_dropdown_button_group, plot)
1660 |     tab1 = Panel(child=layout, title='Line Dropdown')
1661 |     return tab1
1662 | 
1663 | def build_plot(args):
1664 |     # return Tabs(tabs=[build_line_plot_char(args.input, args.words_per_chunk),
1665 |     #                   build_line_plot_affect(args.input, args.words_per_chunk),
1666 |     #                   build_line_plot_compare(args.input, args.words_per_chunk)])
1667 |     return build_line_plot_compare(args.input, args.words_per_chunk)
1668 | 
1669 | 
1670 | 
1671 | def save_static(args):
1672 |     plot = build_plot(args)
1673 |     file_html(plot, CDN, args.title)
1674 |     output_file(args.output,
1675 |                 title=args.title, mode="cdn")
1676 |     save(plot)
1677 | 
1678 | def save_embed(args):
1679 |     plot = build_plot(args)
1680 |     with open(args.output, 'w', encoding='utf-8') as op:
1681 |         for c in components(plot):
1682 |             op.write(c)
1683 |             op.write('\n')
1684 | 
1685 | def save_plot(args):
1686 |     title = 'Average Quantity of Text Reuse by {}-word Section'
1687 |     title = title.format(args.words_per_chunk)
1688 |     args.title = title
1689 | 
1690 |     if args.static:
1691 |         save_static(args)
1692 |     else:
1693 |         save_embed(args)
1694 | 


--------------------------------------------------------------------------------
/workflow/format_helper.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | import re
 4 | 
 5 | franchise = sys.argv[1]
 6 | movie = sys.argv[2]
 7 | 
 8 | if len(sys.argv) <= 3:
 9 |     movie_folder = 'results/{0:}-{1:}'.format(franchise, movie)
10 |     data_folders = os.listdir(movie_folder)
11 |     dates = sorted([f for f in data_folders if re.search(r'[0-9]{8}', f)])
12 |     date = dates[-1]
13 | else:
14 |     date = sys.argv[3]
15 | 
16 | cmd = ('python ao3.py format '
17 |        '-o results/{0:}-{1:}/fandom-data-{1:}.csv '
18 |        'results/{0:}-{1:}/{2:}/match-6gram-{2:}.csv '
19 |        'scripts/{0:}-{1:}.txt')
20 | cmd = cmd.format(franchise,
21 |                  movie,
22 |                  date)
23 | print(cmd)
24 | 
25 | 


--------------------------------------------------------------------------------
/workflow/reformat.py:
--------------------------------------------------------------------------------
1 | import os
2 | 
3 | for x in os.listdir('results/') :
4 |     if not x.startswith('.'):
5 |         os.system('`python format_helper.py' + ' ' + x.replace('-', ' ', 1) + '`')
6 | 


--------------------------------------------------------------------------------
/workflow/revis.py:
--------------------------------------------------------------------------------
1 | import os
2 | 
3 | for x in os.listdir('results/') :
4 |     if not x.startswith('.'):
5 |         os.system('`python vis_helper.py' + ' ' + x.replace('-', ' ', 1) + '`')
6 | 


--------------------------------------------------------------------------------
/workflow/vis_helper.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | cmd = ('python ao3.py vis '
 3 |        '-o results/{0:}-{1:}/{2:}_reuse.html '
 4 |        'results/{0:}-{1:}/fandom-data-{1:}.csv')
 5 | cmd = cmd.format(sys.argv[1],
 6 |                  sys.argv[2],
 7 |                  sys.argv[2].replace('-', '_'))
 8 | print(cmd)
 9 | 
10 | 


--------------------------------------------------------------------------------