├── .gitignore ├── README.md ├── _deprecated.py ├── ao3.py ├── fanworks └── .gitignore ├── requirements.txt ├── results └── .gitignore ├── scripts └── .gitignore ├── search.py ├── vis.py └── workflow ├── format_helper.py ├── reformat.py ├── revis.py └── vis_helper.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | 91 | # Fandom-specific settings 92 | match*.csv 93 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The Archive of Our Own script ao3.py can be used to scrape and analyze 2 | fanworks and prepare the results for visualization in JavaScript. 3 | A markup version of the script of the orginal work is required for 4 | searching for n-gram matches in the fanworks. 5 | 6 | The basic workflow is below. This assumes you have a `scripts` folder, 7 | a `fanworks` folder, and a `results` folder, with a particular structure 8 | that can be inferred from the example commands below. (Sorry, very busy!) 9 | Take `sw-all` to be a stand-in for a folder of fan works, `sw-new-hope.txt` 10 | to be a stand-in for a correctly formatted script, and `sw-new-hope` (without 11 | the `.txt`) to be a stand-in for the results folder for the given movie. 12 | 13 | A todo for this repo is to create options for where to save error and 14 | log files, and search results. 15 | 16 | Another todo for this repo is to create more thorough documentation, 17 | especially of the script format, which is idiosyncratic but effective. 18 | 19 | * Scrape AO3 (Ooops! Currently broken!) 20 | 21 | python ao3.py scrape \ 22 | -t "Star Wars - All Media Types" \ 23 | -o fanworks/sw-all/html 24 | 25 | The scrape command will save log and error files; check to see that the 26 | scrape went OK, and then move the (generically named) error file to 27 | `fanworks/sw-all/sw-all-errors.txt`. 28 | 29 | * Clean the HTML 30 | 31 | python ao3.py clean \ 32 | fanworks/sw-all/html/ \ 33 | -o fanworks/sw-all/plaintext/ 34 | 35 | The clean command will save an error file; check to see that the cleaning 36 | process went OK, and then move the error file (this time in the root dir) 37 | from `clean-html-errors.txt` to `sw-all-clean-errors.txt` 38 | 39 | * Perform the reuse search 40 | 41 | python ao3.py search \ 42 | fanworks/sw-all/ \ 43 | scripts/sw-new-hope.txt 44 | 45 | The search command will create sevaral (and in some case, many, even hundreds) 46 | of separate CSV files. Each one contains the results for 500 fan works. They 47 | will automatically be aggregated by the script at the end of the process, but 48 | they are also saved here to ensure that if the search is interrupted, the 49 | results are still usable. 50 | 51 | If the search completes without any errors, the final aggregated data will 52 | be in a file with a date timestamp in YYYYMMDD format. It will be something 53 | like `match-6gram-20190604.` Create a new folder `results/sw-all/20190604/`, 54 | and move all the CSV files into that folder. 55 | 56 | * Aggregate the results over the script (i.e. "format" the results) 57 | 58 | python ao3.py format \ 59 | results/sw-new-hope/20190604/match-6gram-20190604.csv \ 60 | scripts/sw-new-hope.txt \ 61 | -o results/sw-new-hope/fandom-data-new-hope.csv 62 | 63 | * Create a Bokeh visualization of the aggregated results 64 | 65 | python ao3.py vis \ 66 | results/sw-new-hope/fandom-data-new-hope.csv \ 67 | -o results/sw-new-hope/new_hope_reuse.html 68 | 69 | 70 | This is not a perfect workflow and needs to be tidied up in several ways. I 71 | will get around to that someday. 72 | 73 | ``` 74 | usage: ao3.py [-h] {scrape,clean,getmeta,search,matrix,format} ... 75 | 76 | process fanworks scraped from Archive of Our Own. 77 | 78 | positional arguments: 79 | {scrape,clean,getmeta,search,matrix,format} 80 | scrape, clean, getmeta, search, matrix, or format 81 | scrape find and scrape fanfiction works from Archive of Our 82 | Own 83 | clean takes a directory of html files and yields a new 84 | directory of text files 85 | getmeta takes a directory of html files and yields a csv file 86 | containing metadata 87 | search compare fanworks with the original script 88 | matrix deduplicates and builds matrix for best n-gram matches 89 | format takes a script and outputs a csv with senitment 90 | information for each word formatted for javascript 91 | visualization 92 | 93 | optional arguments: 94 | -h, --help show this help message and exit 95 | ``` 96 | There are three scraping options for Archive of Our Own: 97 | (1) Use the '-s' option to provide a search term and see a list of possible tags. 98 | (2) Use the '-t' option to scrape fanworks from a tag. 99 | (3) Use the '-u' option to scrape fanworks from a URL. The URL should be to the /works page, 100 | e.g. https://archiveofourown.org/tags/Rogue%20One:%20A%20Star%20Wars%20Story%20(2016)/works 101 | ``` 102 | usage: ao3.py scrape [-h] [-s SEARCH | -t TAG | -u URL] [-o OUT] 103 | [-p STARTPAGE] 104 | 105 | optional arguments: 106 | -h, --help show this help message and exit 107 | -s SEARCH, --search SEARCH 108 | search term to search for a tag to scrape 109 | -t TAG, --tag TAG the tag to be scraped 110 | -u URL, --url URL the full URL of first page to be scraped 111 | -o OUT, --out OUT target directory for scraped html files 112 | -p STARTPAGE, --startpage STARTPAGE 113 | page on which to begin downloading (to resume a 114 | previous job) 115 | ``` 116 | Clean and convert the scraped html files into plain text files. 117 | ``` 118 | usage: ao3.py clean [-h] [-o O] i 119 | 120 | positional arguments: 121 | i directory of input html files to clean 122 | 123 | optional arguments: 124 | -h, --help show this help message and exit 125 | -o O target directory for output txt files 126 | ``` 127 | Extract Archive of Our Own metadata from the scraped html files. 128 | ``` 129 | usage: ao3.py getmeta [-h] [-o O] i 130 | 131 | positional arguments: 132 | i directory of input html files to process 133 | 134 | optional arguments: 135 | -h, --help show this help message and exit 136 | -o O filename for metadata csv file 137 | ``` 138 | The search process compares fanworks with the original work script and is based on 6-gram matches. 139 | ``` 140 | usage: ao3.py search [-h] d s 141 | 142 | positional arguments: 143 | d directory of fanwork text files 144 | s filename for markup version of script 145 | 146 | optional arguments: 147 | -h, --help show this help message and exits 148 | ``` 149 | The n-gram search results can be used to create a matrix. 150 | ``` 151 | usage: ao3.py matrix [-h] [-n N] i m 152 | 153 | positional arguments: 154 | i input csv file 155 | m fandom/movie name for output file prefix 156 | 157 | optional arguments: 158 | -h, --help show this help message and exit 159 | -n N n-gram size, default is 6-grams 160 | ``` 161 | The n-gram search results can be prepared for JavaScript visualization. 162 | ``` 163 | usage: ao3.py format [-h] [-o O] s 164 | 165 | positional arguments: 166 | s filename for markup version of script 167 | 168 | optional arguments: 169 | -h, --help show this help message and exit 170 | -o O filename for csv output file of data formatted for visualization 171 | s``` 172 | 173 | -------------------------------------------------------------------------------- /_deprecated.py: -------------------------------------------------------------------------------- 1 | def cosine_distance(row_values, col_values): 2 | """Calculate the cosine distance between two vectors. Also 3 | accepts matrices and 2-d arrays, and calculates the 4 | distances over the cross product of rows and columns. 5 | """ 6 | verr_msg = '`cosine_distance` is not defined for {}-dimensional arrays.' 7 | if len(row_values.shape) == 1: 8 | row_values = row_values[None,:] 9 | elif len(row_values.shape) != 2: 10 | raise ValueError(verr_msg.format(len(row_values.shape))) 11 | 12 | if len(col_values.shape) == 1: 13 | col_values = col_values[:,None] 14 | elif len(col_values.shape) != 2: 15 | raise ValueError(verr_msg.format(len(col_values.shape))) 16 | 17 | row_norm = (row_values * row_values).sum(axis=1) ** 0.5 18 | row_norm = row_norm[:,None] 19 | 20 | col_norm = (col_values * col_values).sum(axis=0) ** 0.5 21 | col_norm = col_norm[None,:] 22 | 23 | result = row_values @ col_values 24 | result /= row_norm 25 | result /= col_norm 26 | return 1 - result 27 | 28 | def make_match_strata(records, record_structure, num_strata, max_threshold): 29 | combined_ix = record_structure['fields'].index('BEST_COMBINED_DISTANCE') 30 | low = [i / num_strata * max_threshold 31 | for i in range(0, num_strata)] 32 | high = [i / num_strata * max_threshold 33 | for i in range(1, num_strata + 1)] 34 | ranges = zip(low, high) 35 | 36 | return [[r for r in records[1:] 37 | if r[combined_ix] >= low and r[combined_ix] < high] 38 | for low, high in ranges] 39 | 40 | def label_match_strata(num_strata, max_threshold): 41 | high = [i / num_strata * max_threshold 42 | for i in range(1, num_strata + 1)] 43 | return ['Number of matches below threshold {:.2}'.format(h) 44 | for h in high] 45 | 46 | def chart_match_strata(records, 47 | num_strata=5, max_threshold=1, 48 | start=1, end=None, 49 | figsize=(15, 10), 50 | colormap='plasma', 51 | legend=True): 52 | match_strata = make_match_strata(records, new_record_structure, num_strata, max_threshold) 53 | 54 | cumulative_strata = [match_strata[0:i] for i in 55 | range(len(match_strata), 0, -1)] 56 | match_counters = [Counter(row[4] for matches in strata for row in matches) 57 | for strata in cumulative_strata] 58 | maxn = max(max(mc) for mc in match_counters if mc) 59 | match_cols = [[mc[n] for mc in match_counters] 60 | for n in range(maxn + 1)] 61 | 62 | col_names = label_match_strata(num_strata, max_threshold) 63 | col_names.reverse() 64 | df = pd.DataFrame(match_cols, 65 | index = range(maxn + 1), 66 | columns=col_names) 67 | df.index.name = 'Word index in original script' 68 | df = df.loc[start:end] 69 | df.plot(figsize=figsize, colormap=colormap, legend=legend) 70 | 71 | def most_frequent_matches(records, n_matches, threshold): 72 | ct = Counter(r[3] for r in records if r[-1] < threshold) 73 | ix_to_context = {r[3]: r[4] for r in records} 74 | matches = ct.most_common(n_matches) 75 | return [(i, c, ix_to_context[i]) 76 | for i, c in matches] 77 | return matches 78 | 79 | # ---------------- 80 | # matrix functions 81 | # ---------------- 82 | 83 | def add_matrix_subparser(subparsers): 84 | # Create n-gram matrices (deprecated) 85 | matrix_parser = subparsers.add_parser('matrix', help='deduplicates and builds matrix for best n-gram matches') 86 | matrix_parser.add_argument('i', action='store', help='input csv file') 87 | matrix_parser.add_argument('m', action = 'store', help='fandom/movie name for output file prefix') 88 | matrix_parser.add_argument('-n', action='store', default=6, help='n-gram size, default is 6-grams') 89 | matrix_parser.set_defaults(func=process) 90 | 91 | class StrictNgramDedupe(object): 92 | def __init__(self, data_path, ngram_size): 93 | self.ngram_size = ngram_size 94 | 95 | with open(data_path, encoding='UTF8') as ip: 96 | rows = list(csv.DictReader(ip)) 97 | self.data = rows 98 | self.work_matches = collections.defaultdict(list) 99 | 100 | for r in rows: 101 | self.work_matches[r['FAN_WORK_FILENAME']].append(r) 102 | 103 | # Use n-gram starting index as a unique identifier. 104 | self.starts_counter = collections.Counter( 105 | start 106 | for matches in self.work_matches.values() 107 | for start in self.to_ngram_starts(self.segment_full(matches)) 108 | ) 109 | 110 | filtered_matches = [self.top_ngram(span) 111 | for matches in self.work_matches.values() 112 | for span in self.segment_full(matches)] 113 | 114 | self.filtered_matches = [ng for ng in filtered_matches 115 | if self.no_better_match(ng)] 116 | 117 | def num_ngrams(self): 118 | return len(set(int(ng[0]['ORIGINAL_SCRIPT_WORD_INDEX']) 119 | for ng in self.filtered_matches)) 120 | 121 | def match_to_phrase(self, match): 122 | return ' '.join(m['ORIGINAL_SCRIPT_WORD'].lower() for m in match) 123 | 124 | def write_match_work_count_matrix(self, out_filename): 125 | ngrams = {} 126 | works = set() 127 | cells = collections.defaultdict(int) 128 | for m in self.filtered_matches: 129 | phrase = self.match_to_phrase(m) 130 | ix = int(m[0]['ORIGINAL_SCRIPT_WORD_INDEX']) 131 | filename = m[0]['FAN_WORK_FILENAME'] 132 | 133 | ngrams[phrase] = ix 134 | works.add(filename) 135 | cells[(filename, phrase)] += 1 136 | 137 | ngrams = sorted(ngrams, key=ngrams.get) 138 | works = sorted(works) 139 | rows = [[cells[(fn, ng)] for ng in ngrams] 140 | for fn in works] 141 | totals = [sum(r[col] for r in rows) for col in range(len(rows[0]))] 142 | 143 | header = ['FILENAME'] + ngrams 144 | totals = ['(total)'] + totals 145 | rows = [[fn] + r for fn, r in zip(works, rows)] 146 | rows = [header, totals] + rows 147 | 148 | with open(out_filename, 'w', encoding='utf-8') as op: 149 | csv.writer(op).writerows(rows) 150 | 151 | def write_match_sentiment(self, out_filename): 152 | phrases = {} 153 | for m in self.filtered_matches: 154 | phrase = self.match_to_phrase(m) 155 | ix = int(m[0]['ORIGINAL_SCRIPT_WORD_INDEX']) 156 | phrases[phrase] = ix 157 | sorted_phrases = sorted(phrases, key=phrases.get) 158 | 159 | phrase_indices = [phrases[p] for p in sorted_phrases] 160 | phrases = sorted_phrases 161 | 162 | if emolex: 163 | emo_count = [emolex.lex_count(p) for p in phrases] 164 | emo_sent_count = self.project_sentiment_keys(emo_count, 165 | ['NEGATIVE', 'POSITIVE']) 166 | emo_emo_count = self.project_sentiment_keys(emo_count, 167 | ['ANTICIPATION', 168 | 'ANGER', 169 | 'TRUST', 170 | 'SADNESS', 171 | 'DISGUST', 172 | 'SURPRISE', 173 | 'FEAR', 174 | 'JOY']) 175 | if bing: 176 | bing_count = [bing.lex_count(p) for p in phrases] 177 | bing_count = self.project_sentiment_keys(bing_count, 178 | ['NEGATIVE', 'POSITIVE']) 179 | 180 | if liwc: 181 | liwc_count = [liwc.lex_count(p) for p in phrases] 182 | liwc_sent_count = self.project_sentiment_keys(liwc_count, 183 | ['POSEMO', 'NEGEMO']) 184 | liwc_other_keys = set(k for ct in liwc_count for k in ct.keys()) 185 | liwc_other_keys -= set(['POSEMO', 'NEGEMO']) 186 | liwc_other_count = self.project_sentiment_keys(liwc_count, 187 | liwc_other_keys) 188 | 189 | counts = [] 190 | count_labels = [] 191 | 192 | if emolex: 193 | counts.append(emo_emo_count) 194 | counts.append(emo_sent_count) 195 | count_labels.append('NRC_EMOTION_') 196 | count_labels.append('NRC_SENTIMENT_') 197 | 198 | counts.append(bing_count) 199 | count_labels.append('BING_SENTIMENT_') 200 | 201 | if liwc: 202 | counts.append(liwc_sent_count) 203 | counts.append(liwc_other_count) 204 | count_labels.append('LIWC_SENTIMENT_') 205 | count_labels.append('LIWC_ALL_OTHER_') 206 | 207 | rows = self.compile_sentiment_groups(counts, count_labels) 208 | 209 | for r, p, i in zip(rows, phrases, phrase_indices): 210 | r['{}-GRAM'.format(self.ngram_size)] = p 211 | r['{}-GRAM_START_INDEX'.format(self.ngram_size)] = i 212 | 213 | fieldnames = sorted(set(k for r in rows for k in r.keys())) 214 | totals = collections.defaultdict(int) 215 | skipkeys = ['{}-GRAM_START_INDEX'.format(self.ngram_size), 216 | '{}-GRAM'.format(self.ngram_size)] 217 | totals[skipkeys[0]] = 0 218 | totals[skipkeys[1]] = '(total)' 219 | for r in rows: 220 | for k in r: 221 | if k not in skipkeys: 222 | totals[k] += r[k] 223 | rows = [totals] + rows 224 | 225 | with open(out_filename, 'w', encoding='utf-8') as op: 226 | wr = csv.DictWriter(op, fieldnames=fieldnames) 227 | wr.writeheader() 228 | wr.writerows(rows) 229 | 230 | def project_sentiment_keys(self, counts, keys): 231 | counts = [{k: ct.get(k, 0) for k in keys} 232 | for ct in counts] 233 | for ct in counts: 234 | if sum(ct.values()) == 0: 235 | ct['UNDETERMINED'] = 1 236 | else: 237 | ct['UNDETERMINED'] = 0 238 | 239 | return counts 240 | 241 | def compile_sentiment_groups(self, groups, prefixes): 242 | new_rows = [] 243 | for group_row in zip(*groups): 244 | new_row = {} 245 | for gr, pf in zip(group_row, prefixes): 246 | for k, v in gr.items(): 247 | new_row[pf + k] = v 248 | new_rows.append(new_row) 249 | return new_rows 250 | 251 | def get_spans(self, indices): 252 | starts = [0] 253 | ends = [] 254 | for i in range(1, len(indices)): 255 | if indices[i] != indices[i - 1] + 1: 256 | starts.append(i) 257 | ends.append(i) 258 | ends.append(len(indices)) 259 | return list(zip(starts, ends)) 260 | 261 | def segment_matches(self, matches, key): 262 | matches = sorted(matches, key=lambda m: int(m[key])) 263 | indices = [int(m[key]) for m in matches] 264 | return [[matches[i] for i in range(start, end)] 265 | for start, end in self.get_spans(indices)] 266 | 267 | def segment_fan_matches(self, matches): 268 | return self.segment_matches(matches, 'FAN_WORK_WORD_INDEX') 269 | 270 | def segment_orig_matches(self, matches): 271 | return self.segment_matches(matches, 'ORIGINAL_SCRIPT_WORD_INDEX') 272 | 273 | def segment_full(self, matches): 274 | return [orig_m 275 | for fan_m in self.segment_fan_matches(matches) 276 | for orig_m in self.segment_orig_matches(fan_m) 277 | if len(orig_m) >= self.ngram_size] 278 | 279 | def to_ngram_starts(self, match_spans): 280 | return [int(ms[i]['ORIGINAL_SCRIPT_WORD_INDEX']) 281 | for ms in match_spans 282 | for i in range(len(ms) - self.ngram_size + 1)] 283 | 284 | def start_count_key(self, span): 285 | def key(i): 286 | script_ix = int(span[i]['ORIGINAL_SCRIPT_WORD_INDEX']) 287 | return self.starts_counter.get(script_ix, 0) 288 | return key 289 | 290 | def no_better_match(self, ng): 291 | start = int(ng[0]['ORIGINAL_SCRIPT_WORD_INDEX']) 292 | best_start = max(range(start - self.ngram_size + 1, 293 | start + self.ngram_size), 294 | key=self.starts_counter.__getitem__) 295 | return start == best_start 296 | 297 | def top_ngram(self, span): 298 | start = max( 299 | range(len(span) - self.ngram_size + 1), 300 | key=self.start_count_key(span) 301 | ) 302 | return span[start: start + self.ngram_size] 303 | 304 | 305 | def process(inputs): 306 | ngram_size = inputs['n'] 307 | in_file = inputs['i'] 308 | out_prefix = inputs['m'] 309 | 310 | matrix_out = '{}-most-common-perfect-matches-no-overlap-{}-gram-match-matrix.csv'.format(out_prefix, ngram_size) 311 | sentiment_out = '{}-most-common-perfect-matches-no-overlap-{}-gram-sentiment.csv'.format(out_prefix, ngram_size) 312 | 313 | dd = StrictNgramDedupe(in_file, ngram_size=ngram_size) 314 | #print(dd.num_ngrams()) 315 | 316 | dd.write_match_work_count_matrix(matrix_out) 317 | -------------------------------------------------------------------------------- /ao3.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | import re 4 | import os 5 | import sys 6 | import json 7 | import csv 8 | import random 9 | import argparse 10 | import requests 11 | import collections 12 | from time import sleep 13 | 14 | import numpy 15 | import pandas as pd 16 | from bs4 import BeautifulSoup 17 | 18 | import search 19 | import vis 20 | 21 | try: 22 | import lextrie 23 | bing = lextrie.LexTrie.from_plugin('bing') 24 | 25 | try: 26 | emolex = lextrie.LexTrie.from_plugin('emolex_en') 27 | except Exception: 28 | emolex = None 29 | 30 | try: 31 | liwc = lextrie.LexTrie.from_plugin('liwc') 32 | except Exception: 33 | liwc = None 34 | except ImportError: 35 | bing = None 36 | emolex = None 37 | liwc = None 38 | 39 | 40 | # ----------------------------------------------------------------------------- 41 | # HTML TO TXT FUNCTIONS 42 | # --------------------- 43 | 44 | def get_fan_work(fan_html_name): 45 | with open(fan_html_name, encoding='utf8') as fan_in: 46 | fan_html = BeautifulSoup(fan_in.read(), "lxml") 47 | fan_txt = fan_html.find(id='workskin') 48 | if fan_txt is None: 49 | return '' 50 | 51 | fan_txt = ' '.join(fan_txt.strings) 52 | fan_txt = re.split(r'Work Text\b([\s:]*)', fan_txt, maxsplit=1)[-1] 53 | fan_txt = re.split(r'Chapter 1\b([\s:]*)', fan_txt, maxsplit=1)[-1] 54 | fan_txt = fan_txt.replace('Chapter Text', ' ') 55 | fan_txt = re.sub(r'\s+', ' ', fan_txt).strip() 56 | return fan_txt 57 | 58 | def convert_dir(args): 59 | html_dir = args.input 60 | out_dir = args.output 61 | 62 | try: 63 | os.makedirs(out_dir) 64 | except Exception: 65 | pass 66 | 67 | errors = [] 68 | for infile in os.listdir(html_dir): 69 | base, ext = os.path.splitext(infile) 70 | outfile = os.path.join(out_dir, base + '.txt') 71 | infile = os.path.join(html_dir, infile) 72 | 73 | if not os.path.exists(outfile): 74 | text = get_fan_work(infile) 75 | if text: 76 | with open(outfile, 'w', encoding='utf-8') as out: 77 | out.write(text) 78 | else: 79 | errors.append(infile) 80 | 81 | error_outfile = 'clean-html-errors.txt' 82 | with open(error_outfile, 'w', encoding='utf-8') as out: 83 | out.write('The following files were not converted:\n\n') 84 | for e in errors: 85 | out.write(e) 86 | out.write('\n') 87 | 88 | # ------------------ 89 | # METADATA FUNCTIONS 90 | # ------------------ 91 | 92 | def select_text(soup_node, selector): 93 | sel = soup_node.select(selector) 94 | return sel[0].get_text().strip() if sel else 'AOOO_UNSPECIFIED' 95 | # "AOOO_UNSPECIFIED" means value not in An Archive of Our Own metadata field 96 | 97 | meta_headers = ['FILENAME', 'TITLE', 'AUTHOR', 'SUMMARY', 'NOTES', 98 | 'PUBLICATION_DATE', 'LANGUAGE', 'TAGS'] 99 | def get_fan_meta(fan_html_name): 100 | with open(fan_html_name, encoding='utf8') as fan_in: 101 | fan_html = BeautifulSoup(fan_in.read(), 'lxml') 102 | 103 | title = select_text(fan_html, '.title.heading') 104 | author = select_text(fan_html, '.byline.heading') 105 | summary = select_text(fan_html, '.summary.module') 106 | notes = select_text(fan_html, '.notes.module') 107 | date = select_text(fan_html, 'dd.published') 108 | language = select_text(fan_html, 'dd.language') 109 | tags = {k.get_text().strip().strip(':'): 110 | v.get_text(separator='; ').strip().strip('\n; ') 111 | for k, v in 112 | zip(fan_html.select('dt.tags'), fan_html.select('dd.tags'))} 113 | tags = json.dumps(tags) 114 | 115 | path, filename = os.path.split(fan_html_name) 116 | 117 | vals = [filename, title, author, summary, notes, 118 | date, language, tags] 119 | return dict(zip(meta_headers, vals)) 120 | 121 | def collect_meta(args): 122 | in_dir = args.input 123 | out_file = args.output 124 | 125 | errors = [] 126 | rows = [] 127 | for infile in os.listdir(in_dir): 128 | infile = os.path.join(in_dir, infile) 129 | rows.append(get_fan_meta(infile)) 130 | 131 | error_outfile = out_file + '-errors.txt' 132 | with open(error_outfile, 'w', encoding='utf-8') as out: 133 | out.write('Metadata could not be collected from the following files:\n\n') 134 | for e in errors: 135 | out.write(e) 136 | out.write('\n') 137 | 138 | csv_outfile = out_file + '.csv' 139 | with open(csv_outfile, 'w', encoding='utf-8') as out: 140 | wr = csv.DictWriter(out, fieldnames=meta_headers) 141 | wr.writeheader() 142 | for row in rows: 143 | wr.writerow(row) 144 | 145 | #---------------- 146 | #SCRAPE FUNCTIONS 147 | #---------------- 148 | class Logger: 149 | def __init__(self, logfile='log.txt'): 150 | self.logfile = logfile 151 | 152 | def log(self, msg, newline=True): 153 | with open(self.logfile, 'a') as f: 154 | f.write(msg) 155 | if newline: 156 | f.write('\n') 157 | 158 | _logger = Logger() 159 | log = _logger.log 160 | 161 | _error_id_log = Logger(logfile='error-ids.txt') 162 | log_error_id = _error_id_log.log 163 | 164 | def load_error_ids(): 165 | with open(_error_id_log.logfile, 'w+') as ip: 166 | ids = set(l.strip() for l in ip.readlines()) 167 | return ids 168 | 169 | class InlineDisplay: 170 | def __init__(self): 171 | self.currlen = 0 172 | 173 | def display(self, s): 174 | print(s, end=' ') 175 | sys.stdout.flush() 176 | self.currlen += len(s) + 1 177 | 178 | def reset(self): 179 | print('', end='\r') 180 | print(' ' * self.currlen, end='\r') 181 | sys.stdout.flush() 182 | self.currlen = 0 183 | 184 | _id = InlineDisplay() 185 | display = _id.display 186 | reset_display = _id.reset 187 | 188 | def request_loop(url, timeout=4.0, sleep_base=1.0): 189 | # We try 20 times. But we double the delay each time, 190 | # so that we don't get really annoying. Eventually the 191 | # delay will be more than an hour long, at which point 192 | # we'll try a few more times, and then give up. 193 | 194 | orig_url = url 195 | for i in range(20): 196 | if sleep_base > 7200: # Only delay up to an hour. 197 | sleep_base /= 2 198 | url = '{}#{}'.format(orig_url, random.randrange(1000)) 199 | display('Sleeping for {} seconds;'.format(sleep_base)) 200 | sleep(sleep_base) 201 | try: 202 | response = requests.get(url, timeout=timeout) 203 | response.raise_for_status() 204 | return response.text 205 | except requests.exceptions.HTTPError: 206 | code = response.status_code 207 | if code >= 400 and code < 500: 208 | display('Unrecoverable error ({})'.format(code)) 209 | return '' 210 | else: 211 | sleep_base *= 2 212 | display('Recoverable error ({});'.format(code)) 213 | except requests.exceptions.ReadTimeout as exc: 214 | sleep_base *= 2 215 | display('Read timed out -- trying again;') 216 | except requests.exceptions.RequestException as exc: 217 | sleep_base *= 2 218 | display('Unexpected error ({}), trying again;\n'.format(exc)) 219 | else: 220 | return None 221 | 222 | def scrape(args): 223 | search_term = args.search 224 | tag = args.tag 225 | header = args.url 226 | out_dir = args.out 227 | end = args.startpage 228 | 229 | # tag scraping option 230 | if search_term: 231 | pp = 1 232 | safe_search = search_term.replace(' ', '+') 233 | # an alternative here is to scrape this page and use regex to filter the results: 234 | # http://archiveofourown.org/media/Movies/fandoms? 235 | # the canonical filter is used here because the "fandom" filter on the 236 | # beta tag search is broken as of November 2017 237 | search_ref = "http://archiveofourown.org/tags/search?utf8=%E2%9C%93&query%5Bname%5D=" + safe_search + "&query%5Btype%5D=&query%5Bcanonical%5D=true&page=" 238 | print('\nTags:') 239 | 240 | tags = ["initialize"] 241 | while (len(tags)) != 0: 242 | results_page = requests.get(search_ref + str(pp)) 243 | results_soup = BeautifulSoup(results_page.text, "lxml") 244 | tags = results_soup(attrs={'href': re.compile('^/tags/[^s]....[^?].*')}) 245 | 246 | for x in tags: 247 | print(x.string) 248 | 249 | pp += 1 250 | 251 | # fan work scraping options 252 | if header or tag: 253 | try: 254 | os.makedirs(out_dir) 255 | except Exception: 256 | pass 257 | 258 | os.chdir(out_dir) 259 | error_works = load_error_ids() 260 | 261 | results = ["initialize"] 262 | while (len(results)) != 0: 263 | log('\n\nPAGE ' + str(end)) 264 | print('Page {} '.format(end)) 265 | 266 | display('Loading table of contents;') 267 | 268 | if tag: 269 | mod_header = tag.replace(' ', '%20') 270 | header = "http://archiveofourown.org/tags/" + mod_header + "/works" 271 | 272 | request_url = header + "?page=" + str(end) 273 | toc_page = request_loop(request_url) 274 | if not toc_page: 275 | err_msg = 'Error loading TOC; aborting.' 276 | log(err_msg) 277 | display(err_msg) 278 | reset_display() 279 | continue 280 | 281 | toc_page_soup = BeautifulSoup(toc_page, "lxml") 282 | results = toc_page_soup(attrs={'href': re.compile('^/works/[0-9]+[0-9]$')}) 283 | 284 | log('Number of Works on Page {}: {}'.format(end, len(results))) 285 | log('Page URL: {}'.format(request_url)) 286 | log('Progress: ') 287 | 288 | reset_display() 289 | 290 | for x in results: 291 | body = str(x).split('"') 292 | docID = str(body[1]).split('/')[2] 293 | filename = str(docID) + '.html' 294 | 295 | if os.path.exists(filename): 296 | display('Work {} already exists -- skpping;'.format(docID)) 297 | reset_display() 298 | msg = ('skipped existing document {} on ' 299 | 'page {} ({} bytes)') 300 | log(msg.format(docID, str(end), 301 | os.path.getsize(filename))) 302 | elif docID in error_works: 303 | display('Work {} is known to cause errors ' 304 | '-- skipping;'.format(docID)) 305 | reset_display() 306 | msg = ('skipped document {} on page {} ' 307 | 'known to cause errors') 308 | log(msg.format(docID, str(end))) 309 | 310 | else: 311 | display('Loading work {};'.format(docID)) 312 | work_request_url = "https://archiveofourown.org/" + body[1] + "?view_adult=true&view_full_work=true" 313 | work_page = request_loop(work_request_url) 314 | 315 | if work_page is None: 316 | error_works.add(docID) 317 | log_error_id(docID) 318 | continue 319 | 320 | with open(filename, 'w', encoding='utf-8') as html_out: 321 | bytes_written = html_out.write(str(work_page)) 322 | 323 | msg = 'reached document {} on page {}, saved {} bytes' 324 | log(msg.format(docID, str(end), bytes_written)) 325 | reset_display() 326 | 327 | reset_display() 328 | end = end + 1 329 | 330 | # ----------------------------------- 331 | # data visualization format functions 332 | # ----------------------------------- 333 | def project_sentiment_keys_shortform(counts, keys): 334 | counts = [{k: ct.get(k, 0) for k in keys} 335 | for ct in counts] 336 | for ct in counts: 337 | if sum(ct.values()) == 0: 338 | ct['UNDETERMINED'] = 1 339 | else: 340 | ct['UNDETERMINED'] = 0 341 | return counts 342 | 343 | def regex(name): 344 | return (re.sub('(?P (\w))*(\\(.*\\))', '\g', name)).strip() 345 | 346 | def format_data(args): 347 | original_script_markup = args.script 348 | match_table = args.matches 349 | output = args.output 350 | 351 | matches = pd.read_csv(match_table) 352 | 353 | name = 'Frequency of Reuse (Exact Matches)' 354 | positive_match = matches.BEST_COMBINED_DISTANCE <= 0 355 | matches_thresh = matches.assign(**{name: positive_match}) 356 | 357 | thresholds = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5] 358 | threshname = ['Frequency of Reuse (0-{})'.format(str(t)) for t in thresholds] 359 | for thresh, name in zip(thresholds, threshname): 360 | positive_match = matches.BEST_COMBINED_DISTANCE <= thresh 361 | matches_thresh = matches_thresh.assign(**{name: positive_match}) 362 | thresholds = [0] + thresholds 363 | threshname = ['Frequency of Reuse (Exact Matches)'] + threshname 364 | 365 | os_markup_raw = search.load_markup_script(original_script_markup) 366 | os_markup_header = os_markup_raw[0] 367 | os_markup_raw = os_markup_raw[1:] 368 | 369 | lt = emolex # LexTrie.from_plugin('emolex_en') 370 | emo_terms = ['ANGER', 371 | 'ANTICIPATION', 372 | 'DISGUST', 373 | 'FEAR', 374 | 'JOY', 375 | 'SADNESS', 376 | 'SURPRISE', 377 | 'TRUST', 378 | 'NEGATIVE', 379 | 'POSITIVE'] 380 | 381 | os_markup_header.extend(emo_terms) 382 | for r in os_markup_raw: 383 | emos = lt.get_lex_tags(r[0]) 384 | r.extend(int(t in emos) for t in emo_terms) 385 | 386 | os_markup = pd.DataFrame(os_markup_raw, columns=os_markup_header) 387 | os_markup.index.name = 'ORIGINAL_SCRIPT_WORD_INDEX' 388 | 389 | #os_markup.CHARACTER = os_markup.CHARACTER.apply(regex) 390 | top_eight = collections.Counter(os_markup.CHARACTER).most_common(8) 391 | top_eight_list = [] 392 | for (name_top,val) in top_eight: 393 | top_eight_list = top_eight_list + [name_top] 394 | top_eight_char = ["CHARACTER_" + name for name in top_eight_list] 395 | used_names = [] 396 | name_char = top_eight[0][0] 397 | positive_match = 1 * (os_markup.CHARACTER == name_char) 398 | matches_name = os_markup.assign(**{"CHARACTER_" + name_char.upper(): positive_match}) 399 | used_names = ["CHARACTER_" + name_char] + used_names 400 | 401 | for top_name in top_eight_list: 402 | if "CHARACTER_" + top_name not in used_names: 403 | positive_matches = 1 * (os_markup.CHARACTER == top_name) 404 | matches_name = matches_name.assign(**{"CHARACTER_" + top_name.upper(): positive_matches}) 405 | used_names = used_names + [top_name] 406 | 407 | match_word_counts = matches_thresh.groupby( 408 | 'ORIGINAL_SCRIPT_WORD_INDEX' 409 | ).aggregate({ 410 | name: numpy.sum for name in threshname 411 | }) 412 | 413 | match_word_counts = match_word_counts.reindex( 414 | matches_name.index, 415 | fill_value=0 416 | ) 417 | 418 | match_word_words = matches_thresh.groupby( 419 | 'ORIGINAL_SCRIPT_WORD_INDEX' 420 | ).aggregate({ 421 | 'ORIGINAL_SCRIPT_WORD': numpy.max, 422 | }) 423 | 424 | match_word_counts = match_word_counts.join(match_word_words) 425 | 426 | match_count = (match_word_counts.join(matches_name)) 427 | 428 | match_count.to_csv(output) 429 | 430 | def _format_data_sentiment_only(args): 431 | fin = args.s 432 | fout = args.o 433 | 434 | markup_script = search.load_markup_script(fin) 435 | markup_script = markup_script[1:] 436 | list_script = [[i] + r for i, r in enumerate(markup_script)] 437 | 438 | csv_script = pd.DataFrame(list_script) 439 | csv_script.columns = ['ORIGINAL_SCRIPT_INDEX', 440 | 'LOWERCASE', 441 | 'SPACY_ORTH_ID', 442 | 'SCENE', 443 | 'CHARACTER'] 444 | 445 | bing_count = [bing.lex_count(j[1]) for j in list_script] 446 | bing_sentiment_keys = ['NEGATIVE', 'POSITIVE'] 447 | bing_count = project_sentiment_keys_shortform(bing_count, bing_sentiment_keys) 448 | bing_DF = pd.DataFrame(bing_count) 449 | 450 | bing_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX'] 451 | out = pd.merge(csv_script, bing_DF, on='ORIGINAL_SCRIPT_INDEX') 452 | 453 | if emolex: 454 | emo_count = [emolex.lex_count(j[1]) for j in list_script] 455 | emo_sentiment_keys = ['ANTICIPATION', 'ANGER', 'TRUST', 'SADNESS','DISGUST', 456 | 'SURPRISE', 'FEAR', 'JOY', 'NEGATIVE', 'POSITIVE'] 457 | emo_count = project_sentiment_keys_shortform(emo_count, emo_sentiment_keys) 458 | emo_DF = pd.DataFrame(emo_count) 459 | emo_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX'] 460 | out = pd.merge(out, emo_DF, on='ORIGINAL_SCRIPT_INDEX') 461 | 462 | if liwc: 463 | liwc_count = [liwc.lex_count(j[1]) for j in list_script] 464 | 465 | liwc_sentiment_keys = ['POSEMO', 'NEGEMO'] 466 | liwc_sent_count = project_sentiment_keys_shortform(liwc_count, liwc_sentiment_keys) 467 | liwc_sent_DF = pd.DataFrame(liwc_sent_count) 468 | liwc_sent_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX'] 469 | out = pd.merge(out, liwc_sent_DF, on='ORIGINAL_SCRIPT_INDEX') 470 | 471 | liwc_other_keys = set(k for ct in liwc_count for k in ct.keys()) 472 | liwc_other_keys -= set(['POSEMO', 'NEGEMO']) #already used these 473 | liwc_other_count = project_sentiment_keys_shortform(liwc_count, liwc_other_keys) 474 | liwc_other_DF = pd.DataFrame(liwc_other_count) 475 | liwc_other_DF['ORIGINAL_SCRIPT_INDEX'] = csv_script['ORIGINAL_SCRIPT_INDEX'] 476 | out = pd.merge(out, liwc_other_DF, on='ORIGINAL_SCRIPT_INDEX') 477 | 478 | out.to_csv(fout + '.csv', index=False) 479 | 480 | # ----------------------------------------------------------------------------- 481 | # SCRIPT 482 | # ------ 483 | 484 | if __name__ == '__main__': 485 | 486 | parser = argparse.ArgumentParser(description='process fanworks scraped from Archive of Our Own.') 487 | subparsers = parser.add_subparsers(help='scrape, clean, getmeta, search, format, or vis') 488 | 489 | #sub-parsers 490 | scrape_parser = subparsers.add_parser('scrape', help='find and scrape fanfiction works from Archive of Our Own') 491 | group = scrape_parser.add_mutually_exclusive_group() 492 | group.add_argument('-s', '--search', action='store', help="search term to search for a tag to scrape") 493 | group.add_argument('-t', '--tag', action='store', help="the tag to be scraped") 494 | group.add_argument('-u', '--url', action='store', help="the full URL of first page to be scraped") 495 | scrape_parser.add_argument('-o', '--out', action='store', default=os.path.join('.','scraped-html'), help="target directory for scraped html files") 496 | scrape_parser.add_argument('-p', '--startpage', action='store', default=1, type=int, help="page on which to begin downloading (to resume a previous job)") 497 | scrape_parser.set_defaults(func=scrape) 498 | 499 | clean_parser = subparsers.add_parser('clean', help='takes a directory of html files and yields a new directory of text files') 500 | clean_parser.add_argument('input', action='store', help='directory of input html files to clean') 501 | clean_parser.add_argument('-o', '--output', action='store', default='plain-text', help='target directory for output txt files') 502 | clean_parser.set_defaults(func=convert_dir) 503 | 504 | meta_parser = subparsers.add_parser('getmeta', help='takes a directory of html files and yields a csv file containing metadata') 505 | meta_parser.add_argument('input', action='store', help='directory of input html files to process') 506 | meta_parser.add_argument('-o', '--output', action='store', default='fan-meta', help='filename for metadata csv file') 507 | meta_parser.set_defaults(func=collect_meta) 508 | 509 | validate_parser = subparsers.add_parser('validate', help='validate script markup') 510 | validate_parser.add_argument('script', action='store', help='filename for markup version of script') 511 | validate_parser.set_defaults(func=search.validate_cmd) 512 | 513 | # Search for reuse 514 | search_parser = subparsers.add_parser('search', help='compare fanworks with the original script') 515 | search_parser.add_argument('fan_works', action='store', help='directory of fanwork text files') 516 | search_parser.add_argument('script', action='store', help='filename for markup version of script') 517 | search_parser.add_argument('-n', '--num-works', default=-1, type=int, help="number of works to search (for subsampling)") 518 | search_parser.add_argument('-s', '--skip-works', default=0, type=int, help="number of works to skip (for subsampling)") 519 | search_parser.set_defaults(func=search.analyze) 520 | 521 | # Aggregate word-level counts 522 | data_parser = subparsers.add_parser('format', help='takes a script and outputs a csv with senitment information for each word formatted for javascript visualization') 523 | data_parser.add_argument('matches', action='store', help='filename for search output') 524 | data_parser.add_argument('script', action='store', help='filename for markup version of script') 525 | data_parser.add_argument('-o', '--output', action='store', default='js-data.csv', help='filename for csv output file of data formatted for visualization') 526 | data_parser.set_defaults(func=format_data) 527 | 528 | # Generate visualizaiton 529 | vis_parser = subparsers.add_parser('vis', 530 | help='takes a formatted data csv ' 531 | 'and generates an embeddable ' 532 | 'visualization') 533 | vis_parser.add_argument('input', action='store', 534 | help='the data csv to use for generating the ' 535 | 'visualization') 536 | vis_parser.add_argument('-s', '--static', action='store_true', 537 | default=False, 538 | help="save a full html file") 539 | vis_parser.add_argument('-o', '--output', action='store', 540 | default='reuse.html', 541 | help="output filename") 542 | vis_parser.add_argument('-w', '--words-per-chunk', type=int, default=140, 543 | help='number of words per script segment') 544 | vis_parser.set_defaults(func=vis.save_plot) 545 | 546 | # handle args 547 | args = parser.parse_args() 548 | 549 | # call function 550 | if hasattr(args, 'func'): 551 | args.func(args) 552 | else: 553 | parser.print_help() 554 | -------------------------------------------------------------------------------- /fanworks/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | nearpy 3 | spacy 4 | requests 5 | bs4 6 | lxml 7 | python-Levenshtein 8 | bokeh 9 | https://github.com/senderle/lextrie/archive/master.zip 10 | -------------------------------------------------------------------------------- /results/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /scripts/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /search.py: -------------------------------------------------------------------------------- 1 | import multiprocessing 2 | import datetime 3 | import csv 4 | import os 5 | import re 6 | import sys 7 | import random 8 | from operator import itemgetter 9 | from collections import defaultdict 10 | 11 | import numpy 12 | import nearpy 13 | import spacy 14 | from Levenshtein import distance as lev_distance 15 | 16 | _SPACY_MODEL = None 17 | 18 | # Approximate nearest neighbors search settings: 19 | 20 | new_record_structure = { 21 | 'fields': ['FAN_WORK_FILENAME', 22 | 'FAN_WORK_WORD_INDEX', 23 | 'FAN_WORK_WORD', 24 | 'FAN_WORK_ORTH_ID', 25 | 'ORIGINAL_SCRIPT_WORD_INDEX', 26 | 'ORIGINAL_SCRIPT_WORD', 27 | 'ORIGINAL_SCRIPT_ORTH_ID', 28 | 'ORIGINAL_SCRIPT_CHARACTER', 29 | 'ORIGINAL_SCRIPT_SCENE', 30 | 'BEST_MATCH_DISTANCE', 31 | 'BEST_LEVENSHTEIN_DISTANCE', 32 | 'BEST_COMBINED_DISTANCE', 33 | ], 34 | 'types': [str, int, str, int, int, str, 35 | int, str, int, float, int, float 36 | ] 37 | } 38 | 39 | 40 | def get_spacy_model(): 41 | global _SPACY_MODEL 42 | if _SPACY_MODEL is None: 43 | _SPACY_MODEL = spacy.load('en_core_web_md', 44 | disable=['parser', 'tagger', 'ner']) 45 | return _SPACY_MODEL 46 | 47 | def sp_parse_chunks(txt, size=100000): 48 | spacy_model = get_spacy_model() 49 | 50 | start = 0 51 | if len(txt) < 100000: 52 | yield spacy_model(txt) 53 | return 54 | 55 | while start < len(txt): 56 | end = start + 100000 57 | if end > len(txt): 58 | end = len(txt) 59 | else: 60 | while txt[end] != ' ': 61 | end -= 1 62 | yield spacy_model(txt[start: end]) 63 | start = end + 1 64 | 65 | def mk_vectors(sp_txt): 66 | # Given a text, parse it into `spacy`'s native format, 67 | # and produce a sequence of vectors, one per token. 68 | 69 | rows = len(sp_txt) 70 | cols = len(sp_txt[0].vector) if rows else 0 71 | 72 | vectors = numpy.empty((rows, cols), dtype=float) 73 | for i, word in enumerate(sp_txt): 74 | if word.has_vector: 75 | vectors[i] = word.vector 76 | else: 77 | # `spacy` doesn't have a pre-trained vector for this word, 78 | # so give it a unique random vector. 79 | w_str = str(word) 80 | vectors[i] = 0 81 | vectors[i][hash(w_str) % cols] = 1.0 82 | vectors[i][hash(w_str * 2) % cols] = 1.0 83 | vectors[i][hash(w_str * 3) % cols] = 1.0 84 | return vectors 85 | 86 | def build_lsh_engine(orig, window_size, number_of_hashes, hash_dimensions): 87 | # Build the ngram vectors using rolling windows. 88 | # Variables named `*_win_vectors` contain vectors for 89 | # the given input, such that each row is the vector 90 | # for a single window. Successive windows overlap 91 | # at all words except for the first and last. 92 | 93 | orig_vectors = mk_vectors(orig) 94 | orig_win_vectors = numpy.array([orig_vectors[i:i + window_size, :].ravel() 95 | for i in range(orig_vectors.shape[0] - window_size + 1)]) 96 | 97 | # Initialize the approximate nearest neighbor search algorithm. 98 | # This creates the search "engine" and populates its index with 99 | # the window-vectors from the original script. We can then pass 100 | # over the window-vectors from a fan work, taking each vector 101 | # and searching for good matches in the engine's index of script 102 | # text. 103 | 104 | # We could do the search in the opposite direction, storing 105 | # fan text in the engine's index, and passing over window- 106 | # vectors from the original script, searching for matches in 107 | # the index of fan text. Unfortuantely, the quality of the 108 | # matches found goes down when you add too many values to the 109 | # engine's index. 110 | vector_dim = orig_win_vectors.shape[1] 111 | 112 | hashes = [] 113 | for i in range(number_of_hashes): 114 | h = nearpy.hashes.RandomBinaryProjections('rbp{}'.format(i), 115 | hash_dimensions) 116 | hashes.append(h) 117 | 118 | engine = nearpy.Engine(vector_dim, 119 | lshashes=hashes, 120 | distance=nearpy.distances.CosineDistance()) 121 | 122 | for ix, row in enumerate(orig_win_vectors): 123 | engine.store_vector(row, (ix, str(orig[ix: ix + window_size]))) 124 | return engine 125 | 126 | def multi_search_wrapper(work): 127 | result = _ANN_INDEX.search(work) 128 | return result 129 | 130 | class AnnIndexSearch(object): 131 | def __init__(self, original_script_filename, window_size, 132 | number_of_hashes, hash_dimensions, distance_threshold): 133 | orig_csv = load_markup_script(original_script_filename) 134 | orig_csv = orig_csv[1:] # drop header 135 | orig_csv = [[i] + r for i, r in enumerate(orig_csv)] 136 | # [['ORIGINAL_SCRIPT_INDEX', 137 | # 'LOWERCASE', 138 | # 'SPACY_ORTH_ID', 139 | # 'SCENE', 140 | # 'CHARACTER']] 141 | 142 | (self.word_index, 143 | self.word_lowercase, 144 | self.orth_id, 145 | self.scene, 146 | self.character) = zip(*orig_csv) 147 | 148 | self.window_size = window_size 149 | self.distance_threshold = distance_threshold 150 | self.spacy_model = get_spacy_model() 151 | orig_doc = spacy.tokens.Doc(self.spacy_model.vocab, self.word_lowercase) 152 | self.engine = build_lsh_engine(orig_doc, window_size, 153 | number_of_hashes, hash_dimensions) 154 | self.reset_stats() 155 | 156 | def reset_stats(self): 157 | self._windows_processed = 0 158 | 159 | @property 160 | def windows_processed(self): 161 | return self._windows_processed 162 | 163 | def search(self, filename): 164 | with open(filename, encoding='utf8') as fan_file: 165 | fan = fan_file.read() 166 | fan = [t for ch in sp_parse_chunks(fan) for t in ch if not t.is_space] 167 | 168 | # Create the fan windows: 169 | fan_vectors = mk_vectors(fan) 170 | fan_win_vectors = numpy.array( 171 | [fan_vectors[i:i + self.window_size, :].ravel() 172 | for i in range(fan_vectors.shape[0] - self.window_size + 1)] 173 | ) 174 | 175 | duplicate_records = defaultdict(list) 176 | for fan_ix, row in enumerate(fan_win_vectors): 177 | self._windows_processed += 1 178 | results = self.engine.neighbours(row) 179 | 180 | # Extract data about the original script 181 | # embedded in the engine's results. 182 | results = [(match_ix, match_str, distance) 183 | for vec, (match_ix, match_str), distance in results 184 | if distance < self.distance_threshold] 185 | 186 | # Create a new record with original script 187 | # information and fan work information. 188 | for match_ix, match_str, distance in results: 189 | fan_context = str(fan[fan_ix: fan_ix + self.window_size]) 190 | lev_d = lev_distance(match_str, fan_context) 191 | 192 | for window_ix in range(self.window_size): 193 | fan_word_ix = fan_ix + window_ix 194 | fan_word = fan[fan_word_ix].orth_ 195 | fan_orth_id = fan[fan_word_ix].orth 196 | 197 | orig_word_ix = match_ix + window_ix 198 | orig_word = self.word_lowercase[orig_word_ix] 199 | orig_orth_id = self.orth_id[orig_word_ix] 200 | char = self.character[orig_word_ix] 201 | scene = self.scene[orig_word_ix] 202 | 203 | duplicate_records[(filename, fan_word_ix)].append( 204 | # NOTE: This **must** match the definition 205 | # of `record_structure` above 206 | [filename, 207 | fan_word_ix, 208 | fan_word, 209 | fan_orth_id, 210 | orig_word_ix, 211 | orig_word, 212 | orig_orth_id, 213 | char, 214 | scene, 215 | distance, 216 | lev_d, 217 | distance * lev_d] 218 | ) 219 | 220 | # To deduplicate duplicate_records, we 221 | # pick the single best match, as measured by 222 | # the combined distance for the given n-gram 223 | # match that first identified the word. 224 | for k, dset in duplicate_records.items(): 225 | duplicate_records[k] = min(dset, key=itemgetter(11)) 226 | return sorted(duplicate_records.values()) 227 | 228 | def validate_markup_script(filename, 229 | interactive=False, 230 | _unbalanced_l=re.compile('<<[^>]*<<'), 231 | _unbalanced_r=re.compile('>>[^<]*>>'), 232 | _tags=re.compile('>>\s*([^<]*)\s*<<')): 233 | with open(filename, encoding='utf-8') as ip: 234 | script = ip.read() 235 | 236 | print('Checking script for markup errors.') 237 | print() 238 | 239 | errs = False 240 | unbal_l = _unbalanced_l.findall(script) 241 | if unbal_l: 242 | print('Unbalanced left tag delimiters:') 243 | for m in _unbalanced_l.finditer(script): 244 | line = script[:m.start() + 1].count('\n') + 1 245 | print(' On line {}'.format(line)) 246 | print(' {}'.format(m.group().strip())) 247 | errs = True 248 | print() 249 | 250 | unbal_r = _unbalanced_r.findall(script) 251 | if unbal_r: 252 | print('Unbalanced right tag delimiters:') 253 | for m in _unbalanced_r.finditer(script): 254 | line = script[:m.start() + 1].count('\n') + 1 255 | print(' On line {}'.format(line)) 256 | print(' {}'.format(m.group().strip())) 257 | errs = True 258 | print() 259 | 260 | tag_set = set(t.strip() for t in _tags.findall(script)) 261 | expected_tags = set(('LINE', 'DIRECTION', 'SCENE_NUMBER', 'SCENE_DESCRIPTION', 'CHARACTER_NAME')) 262 | if tag_set - expected_tags: 263 | print('Unexpected tag labels:') 264 | for m in _tags.finditer(script): 265 | if m.group(1).strip() not in expected_tags: 266 | line = script[:m.start(1) + 1].count('\n') + 1 267 | print(' On line {}'.format(line)) 268 | print(' {}'.format(m.group(1).strip())) 269 | errs = True 270 | print() 271 | 272 | if not errs: 273 | print('No markup errors found.') 274 | return True 275 | elif interactive and errs: 276 | print('Errors were found in the script markup. Do you want to continue? (Default is no.)') 277 | print() 278 | r = '' 279 | while r.lower() not in ('y', 'yes', 'n', 'no'): 280 | r = input('Enter y for yes or n for no: ') 281 | if not r.strip(): 282 | r = 'n' 283 | return r.lower() in ('y', 'yes') 284 | else: 285 | return False 286 | 287 | def validate_cmd(args): 288 | return validate_markup_script(args.script) 289 | 290 | def load_markup_script(filename, 291 | _line_rex=re.compile('LINE<<(?P[^>]*)>>'), 292 | _scene_rex=re.compile('SCENE_NUMBER<<(?P[^>]*)>>'), 293 | _char_rex=re.compile('CHARACTER_NAME<<(?P[^>]*)>>')): 294 | 295 | with open(filename, encoding='utf-8') as ip: 296 | spacy_model = get_spacy_model() 297 | 298 | current_scene = None 299 | current_scene_count = 0 300 | current_scene_error_fix = False 301 | current_char = None 302 | rows = [['LOWERCASE', 'SPACY_ORTH_ID', 'SCENE', 'CHARACTER']] 303 | for i, line in enumerate(ip): 304 | if _scene_rex.search(line): 305 | current_scene_count += 1 306 | scene_string = _scene_rex.search(line).group('scene') 307 | scene_string = ''.join(c for c in scene_string 308 | if c.isdigit()) 309 | try: 310 | scene_int = int(scene_string) 311 | current_scene = scene_int 312 | except ValueError: 313 | current_scene_error_fix = True 314 | print("Error in Scene markup: {}".format(line)) 315 | 316 | if current_scene_error_fix: 317 | current_scene = current_scene_count 318 | 319 | elif _char_rex.search(line): 320 | current_char = _char_rex.search(line).group('character') 321 | elif _line_rex.search(line): 322 | tokens = spacy_model(_line_rex.search(line).group('line')) 323 | tokens = [t for t in tokens if not t.is_space] 324 | for t in tokens: 325 | # original Spacy lexeme object can be recreated using 326 | # spacy.lexeme.Lexeme(get_spacy_model().vocab, t.orth) 327 | row = [t.lower_, t.lower, current_scene, current_char] 328 | rows.append(row) 329 | return rows 330 | 331 | def write_records(records, filename): 332 | with open(filename, 'w', encoding='utf-8') as out: 333 | wr = csv.writer(out) 334 | wr.writerows(records) 335 | 336 | def analyze(args, 337 | window_size=6, 338 | number_of_hashes=15, # Bigger -> slower (linear), more matches 339 | hash_dimensions=14, # Bigger -> faster (???), fewer matches 340 | distance_threshold=0.1, 341 | chunk_size=500 342 | ): 343 | fan_work_directory = args.fan_works 344 | original_script_markup = args.script 345 | subsample_start = 0 if args.skip_works < 0 else args.skip_works 346 | subsample_end = (None if args.num_works < 0 else 347 | args.num_works + subsample_start) 348 | 349 | fan_works = os.listdir(fan_work_directory) 350 | fan_works = [os.path.join(fan_work_directory, f) 351 | for f in fan_works] 352 | 353 | # This will always generate the same "random" sample. 354 | random.seed(4815162342) 355 | random.shuffle(fan_works) 356 | 357 | # Optionally skip ahead in the list or stop early. 358 | fan_works = fan_works[subsample_start:subsample_end] 359 | 360 | start = 0 361 | fan_clusters = [fan_works[i:i + chunk_size] 362 | for i in range(start, len(fan_works), chunk_size)] 363 | 364 | filename_base = 'match-{}gram{{}}'.format(window_size) 365 | batch_filename = filename_base.format('-batch-{}.csv') 366 | 367 | accumulated_records = [new_record_structure['fields']] 368 | ann_index = AnnIndexSearch(original_script_markup, 369 | window_size, 370 | number_of_hashes, 371 | hash_dimensions, 372 | distance_threshold) 373 | 374 | for i, fan_cluster in enumerate(fan_clusters, start=start): 375 | print('Processing cluster {} ({}-{})'.format(i, 376 | chunk_size * i, 377 | chunk_size * (i + 1))) 378 | 379 | global _ANN_INDEX 380 | _ANN_INDEX = ann_index 381 | with multiprocessing.Pool(processes=4, maxtasksperchild=10) as pool: 382 | record_sets = pool.map( 383 | multi_search_wrapper, 384 | fan_cluster, 385 | chunksize=chunk_size // (4 * pool._processes)) 386 | records = [r for r_set in record_sets for r in r_set] 387 | write_records(records, batch_filename.format(i)) 388 | accumulated_records.extend(records) 389 | 390 | i = 0 391 | today_str = '-{:%Y%m%d}.csv'.format(datetime.date.today()) 392 | name_check = filename_base.format(today_str) 393 | while os.path.exists(name_check): 394 | i += 1 395 | today_str = '-{:%Y%m%d}-{}.csv'.format(datetime.date.today(), i) 396 | name_check = filename_base.format(today_str) 397 | 398 | write_records(accumulated_records, 399 | name_check) 400 | -------------------------------------------------------------------------------- /vis.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import math 3 | import pandas as pd 4 | import numpy 5 | from scipy.stats import gmean 6 | from numpy import mean 7 | 8 | from bokeh.plotting import figure 9 | from bokeh.io import curdoc, output_file, save 10 | from bokeh.resources import CDN 11 | from bokeh.embed import file_html, components 12 | from bokeh.layouts import row, column 13 | from bokeh.models import HoverTool, CustomJS, ColumnDataSource, FactorRange, Panel, Tabs 14 | from bokeh.models.widgets import RadioButtonGroup, CheckboxButtonGroup, Select 15 | from bokeh.transform import factor_cmap 16 | from bokeh.palettes import Spectral6 17 | from bokeh.events import ButtonClick 18 | 19 | _FIELDS = ['Frequency of Reuse (Exact Matches)', 20 | 'Frequency of Reuse (0-0.1)', 21 | 'Frequency of Reuse (0-0.25)', 22 | 'None', 23 | 'ANGER', 24 | 'ANTICIPATION', 25 | 'DISGUST', 26 | 'FEAR', 27 | 'JOY', 28 | 'SADNESS', 29 | 'SURPRISE', 30 | 'TRUST', 31 | 'NEGATIVE', 32 | 'POSITIVE',] 33 | # 'None'] 34 | 35 | _AGG_FUNCS = [lambda x: gmean(x + 1) - 1] * 3 36 | _AGG_FUNCS += [mean] * 11 37 | 38 | # Possibly dead code now. TODO: Check and if so, remove. 39 | def parse_args(): 40 | parser = argparse.ArgumentParser() 41 | parser.add_argument('-s', '--static', action='store_true', 42 | default=False, 43 | help="save a full html file") 44 | # parser.add_argument('-o', '--output', action='store', 45 | # default='reuse.html', 46 | # help="output filename") 47 | 48 | args = parser.parse_args() 49 | args.words_per_chunk = 140 50 | args.data_path = 'fandom-data.csv' 51 | title = 'Average Quantity of Text Reuse by {}-word Section' 52 | title = title.format(args.words_per_chunk) 53 | args.title = title 54 | args.out_filename = 'star-wars-reuse.html' 55 | return args 56 | 57 | def unnan(val): 58 | # Pandas annoyingly converts the string 'nan' into a floating 59 | # point nan value, even in an all-string column. 60 | if isinstance(val, float) and math.isnan(val): 61 | return 'nan' 62 | else: 63 | return val 64 | 65 | def word_formatter(names=None): 66 | if names is None: 67 | names = [] 68 | 69 | punctuation = [',', '.', '!', '?', '\'', '"', ':', '-', '--'] 70 | endpunctuation = ['.', '!', '?', '"', '...', '....', '--'] 71 | contractions = ['\'ve', '\'m', '\'ll', '\'re', '\'s', '\'t', 'n\'t', 'na'] 72 | capitals = ['i'] 73 | 74 | def span(content, highlight=None): 75 | if highlight is None: 76 | return '{}'.format(content) 77 | else: 78 | style = 'background-color: rgba(16, 96, 255, {:04.3f})'.format(highlight) 79 | return '{}'.format(style, content) 80 | 81 | def format_word(word, prev_word, character, new_char, new_scene, highlight=None): 82 | character = unnan(character).upper() 83 | word = unnan(word) 84 | 85 | parts = [] 86 | if new_scene: 87 | parts.append(span('-- next scene--
')) 88 | 89 | if new_char: 90 | parts.append('\n') 91 | parts.append(span(' ' + character.upper() + ': ')) 92 | 93 | if word in punctuation or word in contractions: 94 | # no space before punctuation 95 | parts.append(span(word, highlight)) 96 | elif not prev_word or prev_word in endpunctuation: 97 | # capitalize first word of sentence 98 | parts.append(span(' ' + word.capitalize(), highlight)) 99 | elif word in capitals: 100 | # format things like 'i' 101 | parts.append(span(' ' + word.upper(), highlight)) 102 | elif word.capitalize() in names: 103 | # format names 104 | parts.append(span(' ' + word.capitalize(), highlight)) 105 | else: 106 | # all other words 107 | parts.append(span(' ' + word, highlight)) 108 | return ''.join(parts) 109 | return format_word 110 | 111 | def chart_cols(fandom_data, words_per_chunk): 112 | words = fandom_data['LOWERCASE'].tolist() 113 | prevwords = [None] + words[:-1] 114 | chars = fandom_data['CHARACTER'].tolist() 115 | newchar = fandom_data['CHARACTER'][:-1].values != fandom_data['CHARACTER'][1:].values 116 | newchar = [True] + list(newchar) 117 | newscene = fandom_data['SCENE'].values 118 | newscene[numpy.isnan(newscene)] = 0 119 | newscene = fandom_data['SCENE'][:-1].values != fandom_data['SCENE'][1:].values 120 | newscene = [False] + list(newscene) 121 | 122 | 123 | highlights = fandom_data['Frequency of Reuse (Exact Matches)'].tolist() 124 | chunks = (fandom_data.index // words_per_chunk).tolist() 125 | chunkmax = {} 126 | global_max = max(highlights) 127 | for h, c in zip(highlights, chunks): 128 | if c not in chunkmax or chunkmax[c] < h: 129 | #chunkmax[c] = h 130 | chunkmax[c] = global_max 131 | # highlights = [math.log(1 + h, 1.25) / (1.6 * math.log(1 + chunkmax[c], 1.25)) if chunkmax[c] > 0 else 0 132 | # for h, c in zip(highlights, chunks)] 133 | highlights = [(1 + h) ** 0.33 / (1.6 * (1 + chunkmax[c]) ** 0.33) if chunkmax[c] > 0 else 0 134 | for h, c in zip(highlights, chunks)] 135 | 136 | wform = word_formatter() 137 | spans = list(map(wform, words, prevwords, chars, newchar, newscene, highlights)) 138 | 139 | fandom_data = fandom_data.assign( 140 | **{'None': fandom_data[_FIELDS[0]].values * 0} 141 | ) 142 | 143 | # fandom_data = fandom_data.assign( 144 | # **{'None': fandom_data[_FIELDS[0]].values * 0} 145 | # ) 146 | 147 | character_cols = [x for x in fandom_data.columns if x.startswith("CHARACTER_")] 148 | chart_cols = fandom_data[_FIELDS + character_cols] 149 | chart_cols = chart_cols.assign(chunk=chunks) 150 | chart_cols = chart_cols.assign(span=spans) 151 | 152 | return chart_cols 153 | 154 | def join_wrap(seq): 155 | lines = [] 156 | line = [] 157 | last_br = 0 158 | for span in seq: 159 | if '\n' in span or last_br > 7 and '> ' in span: 160 | # Convert newlines to div breaks. Also insert breaks 161 | # whenever we've seen 7 words and there is some 162 | # leading whitespace in the current span. 163 | lines.append(''.join(line)) 164 | line = [] 165 | last_br = 0 166 | else: 167 | last_br += 1 168 | 169 | line.append(span) 170 | 171 | tail = ''.join(line) 172 | if tail.strip(): 173 | lines.append(tail) 174 | 175 | return '\n'.join('
{}
'.format(l) for l in lines) 176 | 177 | def chart_pivot(chart_cols): 178 | character_cols = [x for x in chart_cols.columns if x.startswith("CHARACTER_")] 179 | fields = _FIELDS + character_cols + ['span'] 180 | aggfuncs = _AGG_FUNCS + [mean] * len(character_cols) + [join_wrap] 181 | table = pd.pivot_table( 182 | chart_cols, 183 | values=fields, 184 | index=chart_cols.chunk, 185 | aggfunc=dict(zip(fields, aggfuncs)) 186 | ) 187 | # apparently when you create a pandas pivot table, it will automatically 188 | # sort your columns alphabetically (which is dumb). This is their work 189 | # around, where you literally give the table the fields you already gave 190 | # them, so that they "reindex" it. 191 | return table.reindex(fields, axis=1) 192 | 193 | 194 | def build_bar_plot(data_path, words_per_chunk, title='Reuse'): 195 | #Read in from csv 196 | flat_data = pd.read_csv(data_path) 197 | flat_data = chart_cols(flat_data, words_per_chunk) 198 | flat_data = chart_pivot(flat_data) 199 | 200 | # Scale so that both maxima have the same height 201 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 202 | emo_y = flat_data['None'] 203 | reuse_max = reuse_y.values.max() 204 | emo_max = emo_y.values.max() 205 | 206 | #Make ratio work 207 | ratio_denom = min(reuse_max, emo_max) 208 | ratio_num = max(reuse_max, emo_max) 209 | ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 210 | 211 | to_scale = reuse_y if reuse_max < emo_max else emo_y 212 | to_scale *= ratio 213 | 214 | # Create data columns 215 | grouped_x = [(str(x), key) 216 | for x in flat_data.index 217 | for key in ('Reuse', 'Emotion')] 218 | y = [re for re_pair in zip(reuse_y, emo_y) for re in re_pair] 219 | span = zip(flat_data.span, flat_data.span) 220 | span = [s for s_pair in span for s in s_pair] 221 | 222 | flat_data_source = ColumnDataSource(flat_data) 223 | source = ColumnDataSource(dict(x=grouped_x, 224 | y=y, 225 | span=span)) 226 | 227 | plot = figure(x_range=FactorRange(*grouped_x), 228 | plot_width=800, plot_height=600, 229 | title=title, tools="hover") 230 | 231 | # Turn off ticks, major labels, and x grid lines, etc. 232 | # Axis settings: 233 | plot.xaxis.major_label_text_font_size = '0pt' 234 | plot.xaxis.major_tick_line_color = None 235 | plot.xaxis.minor_tick_line_color = None 236 | 237 | # CategoricalAxis settings: 238 | plot.xaxis.group_text_font_size = '0pt' 239 | plot.xaxis.separator_line_color = None 240 | 241 | # Grid settings: 242 | plot.xgrid.grid_line_color = None 243 | plot.ygrid.minor_grid_line_color = 'black' 244 | plot.ygrid.minor_grid_line_alpha = 0.03 245 | 246 | hover = plot.select(dict(type=HoverTool)) 247 | hover.tooltips = "
@span{safe}
" 248 | plot.vbar(x='x', 249 | width=1.0, 250 | bottom=0, 251 | source=source, 252 | top='y', 253 | line_color='white', 254 | fill_color=factor_cmap('x', palette=Spectral6, 255 | factors=['Reuse', 'Emotion'], 256 | start=1, end=2)) 257 | 258 | 259 | reuse_button_group = RadioButtonGroup( 260 | labels= [_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], button_type='primary', 261 | active=0 262 | ) 263 | 264 | emotion_button_group = RadioButtonGroup( 265 | labels=_FIELDS[3:], button_type = "success", 266 | active=0 267 | ) 268 | 269 | callback = CustomJS( 270 | args=dict( 271 | source=source, 272 | flat_data_source=flat_data_source, 273 | reuse_button_group=reuse_button_group, 274 | emotion_button_group=emotion_button_group 275 | ), 276 | code=""" 277 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 278 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 279 | reuse = "Frequency of Reuse (0-0.25)"; 280 | } 281 | var emo = emotion_button_group.labels[emotion_button_group.active]; 282 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 283 | var emo_data = flat_data_source.data[emo].slice(); // Copy 284 | var reuse_max = Math.max.apply(Math, reuse_data); 285 | var emo_max = Math.max.apply(Math, emo_data); 286 | 287 | var ratio = 0; 288 | var to_scale = null; 289 | if (emo_max > reuse_max) { 290 | to_scale = reuse_data; 291 | ratio = emo_max / reuse_max; 292 | } else { 293 | to_scale = emo_data; 294 | if (emo_max > 0) { 295 | ratio = reuse_max / emo_max; 296 | } else { 297 | ratio = 1; 298 | } 299 | } 300 | for (var i = 0; i < to_scale.length; i++) { 301 | to_scale[i] *= ratio; 302 | } 303 | 304 | var x = source.data['x']; 305 | var y = source.data['y']; 306 | for (var i = 0; i < x.length; i++) { 307 | if (i % 2 === 0) { 308 | // This is a reuse bar 309 | y[i] = reuse_data[i / 2]; 310 | } else { 311 | // This is an emotion bar 312 | y[i] = emo_data[(i - 1) / 2]; 313 | } 314 | } 315 | source.change.emit(); 316 | """ 317 | ) 318 | reuse_button_group.js_on_change('active', callback) 319 | emotion_button_group.js_on_change('active', callback) 320 | 321 | layout = column(reuse_button_group, emotion_button_group, plot) 322 | tab1 = Panel(child=layout, title='Bar') 323 | return tab1 324 | 325 | def build_line_plot(data_path, words_per_chunk, title='Reuse'): 326 | #Read in from csv 327 | flat_data = pd.read_csv(data_path) 328 | flat_data = chart_cols(flat_data, words_per_chunk) 329 | flat_data = chart_pivot(flat_data) 330 | 331 | # Scale so that both maxima have the same height 332 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 333 | emo_y = flat_data['None'] 334 | char_y = flat_data['None'] 335 | reuse_max = reuse_y.values.max() 336 | emo_max = emo_y.values.max() 337 | char_max = char_y.values.max() 338 | 339 | #Make ratio work 340 | ratio_denom = min(char_max, min(reuse_max, emo_max)) 341 | ratio_num = max(char_max, max(reuse_max, emo_max)) 342 | ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 343 | if reuse_max < emo_max and reuse_max < char_max: 344 | to_scale = reuse_y 345 | elif emo_max < char_max and emo_max < reuse_max: 346 | to_scale = emo_y 347 | else: 348 | to_scale = char_y 349 | to_scale *= ratio 350 | 351 | # Create data columns 352 | x = [str(i) for i in flat_data.index] 353 | reuse_y=reuse_y 354 | reuse_zero = len(reuse_y) * [0] 355 | span = flat_data.span 356 | flat_data_source = ColumnDataSource(flat_data) 357 | source = ColumnDataSource(dict(x=x, 358 | reuse_y=reuse_y, 359 | emo_y=emo_y, 360 | char_y=char_y, 361 | reuse_zero=reuse_zero, 362 | span=span)) 363 | 364 | plot = figure(x_range=FactorRange(*x), 365 | plot_width=800, plot_height=600, 366 | title=title, tools="hover") 367 | 368 | # Turn off ticks, major labels, and x grid lines, etc. 369 | # Axis settings: 370 | plot.xaxis.major_label_text_font_size = '0pt' 371 | plot.xaxis.major_tick_line_color = None 372 | plot.xaxis.minor_tick_line_color = None 373 | 374 | # CategoricalAxis settings: 375 | plot.xaxis.group_text_font_size = '0pt' 376 | plot.xaxis.separator_line_color = None 377 | 378 | # Grid settings: 379 | plot.xgrid.grid_line_color = None 380 | plot.ygrid.minor_grid_line_color = 'black' 381 | plot.ygrid.minor_grid_line_alpha = 0.03 382 | 383 | hover = plot.select(dict(type=HoverTool)) 384 | hover.tooltips = "
@span{safe}
" 385 | 386 | plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6) 387 | plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0) 388 | plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1]) 389 | plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red') 390 | 391 | 392 | reuse_button_group = RadioButtonGroup( 393 | labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], 394 | button_type='primary', 395 | active=0 396 | ) 397 | 398 | emotion_button_group = RadioButtonGroup( 399 | labels=_FIELDS[3:], 400 | button_type='success', 401 | active=0 402 | ) 403 | 404 | char_button_group = RadioButtonGroup( 405 | labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 406 | button_type='danger', 407 | active=0 408 | ) 409 | 410 | callback_code=""" 411 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 412 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 413 | reuse = "Frequency of Reuse (0-0.25)"; 414 | } 415 | var emo = emotion_button_group.labels[emotion_button_group.active]; 416 | var char = char_button_group.labels[char_button_group.active]; 417 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 418 | var emo_data = flat_data_source.data[emo].slice(); // Copy 419 | if (char == "None") { 420 | var char_data = flat_data_source.data["None"].slice(); 421 | } else { 422 | var char_data = flat_data_source.data["CHARACTER_" + char].slice(); 423 | } // Copy 424 | var reuse_max = Math.max.apply(Math, reuse_data); 425 | var emo_max = Math.max.apply(Math, emo_data); 426 | var char_max = Math.max.apply(Math, char_data); 427 | 428 | var ratio = 0; 429 | var to_scale = null; 430 | var to_scale_other = null; 431 | 432 | if (emo_max > reuse_max && emo_max > char_max) { 433 | to_scale = reuse_data; 434 | to_scale_also = char_data; 435 | ratio_one = emo_max / reuse_max; 436 | ratio_two = emo_max / char_max; 437 | } else if (char_max > emo_max && char_max > reuse_max) { 438 | to_scale = reuse_data; 439 | to_scale_also = emo_data; 440 | ratio_one = char_max / reuse_max; 441 | ratio_two = char_max / emo_max; 442 | } else { 443 | to_scale = emo_data; 444 | to_scale_also = char_data; 445 | ratio_one = reuse_max / emo_max; 446 | ratio_two = reuse_max / char_max; 447 | } 448 | 449 | for (var i = 0; i < to_scale.length; i++) { 450 | to_scale[i] *= ratio_one; 451 | to_scale_also[i] *= ratio_two; 452 | } 453 | 454 | var x = source.data['x']; 455 | var reuse_y = source.data['reuse_y']; 456 | var emo_y = source.data['emo_y']; 457 | var char_y = source.data['char_y'] 458 | for (var i = 0; i < x.length; i++) { 459 | reuse_y[i] = reuse_data[i]; 460 | emo_y[i] = emo_data[i]; 461 | char_y[i] = char_data[i]; 462 | } 463 | 464 | source.change.emit(); 465 | if (char_button_group.active == 0 || emotion_button_group.active == 0) { 466 | return; 467 | } 468 | 469 | if (other_button_group) { 470 | other_button_group.active = 0; 471 | } 472 | 473 | """ 474 | 475 | reuse_callback = CustomJS( 476 | args=dict( 477 | source=source, 478 | flat_data_source=flat_data_source, 479 | reuse_button_group=reuse_button_group, 480 | emotion_button_group=emotion_button_group, 481 | char_button_group=char_button_group, 482 | other_button_group=None 483 | ), code = callback_code) 484 | 485 | 486 | emo_callback = CustomJS( 487 | args=dict( 488 | source=source, 489 | flat_data_source=flat_data_source, 490 | reuse_button_group=reuse_button_group, 491 | emotion_button_group=emotion_button_group, 492 | char_button_group=char_button_group, 493 | other_button_group=char_button_group 494 | ), code = callback_code) 495 | 496 | char_callback = CustomJS( 497 | args=dict( 498 | source=source, 499 | flat_data_source=flat_data_source, 500 | reuse_button_group=reuse_button_group, 501 | emotion_button_group=emotion_button_group, 502 | char_button_group=char_button_group, 503 | other_button_group=emotion_button_group 504 | ), code = callback_code) 505 | 506 | 507 | reuse_button_group.js_on_change('active', reuse_callback) 508 | emotion_button_group.js_on_change('active', emo_callback) 509 | char_button_group.js_on_change('active', char_callback) 510 | 511 | 512 | layout = column(reuse_button_group, emotion_button_group, char_button_group, plot) 513 | tab1 = Panel(child=layout, title='Line') 514 | return tab1 515 | 516 | 517 | def build_line_plot_compare(data_path, words_per_chunk, title='Degree of Reuse'): 518 | #Read in from csv 519 | flat_data = pd.read_csv(data_path) 520 | 521 | flat_data = chart_cols(flat_data, words_per_chunk) 522 | 523 | flat_data = chart_pivot(flat_data) 524 | 525 | 526 | # Scale so that both maxima have the same height 527 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 528 | emo_y = flat_data['None'] 529 | first_char_y = flat_data['None'] 530 | second_char_y = flat_data['None'] 531 | # reuse_max = reuse_y.values.max() 532 | # emo_max = emo_y.values.max() 533 | # char_max = char_y.values.max() 534 | # mult_char_max = mult_char_y.values.max() 535 | # 536 | # #Make ratio work 537 | # ratio_denom = min(mult_char_max, min(reuse_max, emo_max)) 538 | # ratio_num = max(mult_char_max, max(reuse_max, emo_max)) 539 | # ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 540 | # if reuse_max < emo_max and reuse_max < mult_char_max: 541 | # to_scale = reuse_y 542 | # elif emo_max < mult_char_max and emo_max < reuse_max: 543 | # to_scale = emo_y 544 | # else: 545 | # to_scale = mult_char_y 546 | # to_scale *= ratio 547 | 548 | # Create data columns 549 | x = [str(i) for i in flat_data.index] 550 | reuse_y=reuse_y 551 | reuse_zero = len(reuse_y) * [0] 552 | span = flat_data.span 553 | flat_data_source = ColumnDataSource(flat_data) 554 | source = ColumnDataSource(dict(x=x, 555 | reuse_y=reuse_y, 556 | emo_y=emo_y, 557 | first_char_y=first_char_y, 558 | second_char_y=second_char_y, 559 | reuse_zero=reuse_zero, 560 | span=span)) 561 | 562 | plot = figure(x_range=FactorRange(*x), 563 | plot_width=800, plot_height=600, 564 | title=title, tools="hover") 565 | 566 | # Turn off ticks, major labels, and x grid lines, etc. 567 | # Axis settings: 568 | plot.xaxis.major_label_text_font_size = '0pt' 569 | plot.xaxis.major_tick_line_color = None 570 | plot.xaxis.minor_tick_line_color = None 571 | 572 | plot.yaxis.major_label_text_font_size = '0pt' 573 | plot.yaxis.major_tick_line_color = None 574 | plot.yaxis.minor_tick_line_color = None 575 | 576 | # CategoricalAxis settings: 577 | plot.xaxis.group_text_font_size = '0pt' 578 | plot.xaxis.separator_line_color = None 579 | 580 | # Grid settings: 581 | plot.xgrid.grid_line_color = None 582 | # plot.ygrid.minor_grid_line_color = 'black' 583 | # plot.ygrid.minor_grid_line_alpha = 0.03 584 | plot.xaxis.axis_label = 'Beginning of Script    ←            →   End of Script' 585 | plot.yaxis.axis_label = 'Low Reuse             Medium Reuse             High Reuse' 586 | 587 | hover = plot.select(dict(type=HoverTool)) 588 | hover.tooltips = "
@span{safe}
" 589 | 590 | plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6) 591 | plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0) 592 | plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = 'red') 593 | plot.line(x='x', line_width=2.0, source=source, y='first_char_y', line_color = Spectral6[1]) 594 | plot.line(x='x', line_width=2.0, source=source, y='second_char_y', line_color = '#F0AD4E') 595 | 596 | 597 | reuse_button_group = RadioButtonGroup( 598 | labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], 599 | button_type='primary', 600 | active=0 601 | ) 602 | 603 | emotion_button_group = CheckboxButtonGroup( 604 | labels= _FIELDS[4:], 605 | button_type='danger', 606 | active=[], 607 | ) 608 | 609 | first_char_button_group = CheckboxButtonGroup( 610 | labels= [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 611 | button_type='success', 612 | active=[] 613 | ) 614 | 615 | second_char_button_group = CheckboxButtonGroup( 616 | labels= [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 617 | button_type='warning', 618 | active=[] 619 | ) 620 | 621 | 622 | callback_code=""" 623 | 624 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 625 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 626 | reuse = "Frequency of Reuse (0-0.25)"; 627 | } 628 | var second_char = []; 629 | for (i = 0; i < second_char_button_group.active.length; i++) { 630 | second_char.push(second_char_button_group.labels[second_char_button_group.active[i]]); 631 | } 632 | 633 | var first_char = []; 634 | for (i = 0; i < first_char_button_group.active.length; i++) { 635 | first_char.push(first_char_button_group.labels[first_char_button_group.active[i]]); 636 | } 637 | 638 | var emo_arr = []; 639 | for (i = 0; i < emotion_button_group.active.length; i++) { 640 | emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]); 641 | } 642 | 643 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 644 | 645 | 646 | 647 | if (second_char_button_group.active.length > 1) { 648 | if (second_char_button_group.active[0] == window.secondprev) { 649 | second_char_button_group.active.splice(0,1); 650 | second_char.splice(0,1); 651 | window.secondprev = second_char_button_group.active[0]; 652 | } else { 653 | second_char_button_group.active.splice(1,1); 654 | second_char.splice(1,1); 655 | window.secondprev = second_char_button_group.active[0]; 656 | } 657 | } else { 658 | window.secondprev = second_char_button_group.active[0]; 659 | 660 | } 661 | 662 | if (first_char_button_group.active.length > 1) { 663 | if (first_char_button_group.active[0] == window.firstprev) { 664 | first_char_button_group.active.splice(0,1); 665 | first_char.splice(0,1); 666 | window.firstprev = first_char_button_group.active[0]; 667 | } else { 668 | first_char_button_group.active.splice(1,1); 669 | first_char.splice(1,1); 670 | window.firstprev = first_char_button_group.active[0]; 671 | } 672 | } else { 673 | window.firstprev = first_char_button_group.active[0]; 674 | 675 | } 676 | 677 | if (emotion_button_group.active.length > 1) { 678 | if (emotion_button_group.active[0] == window.thirdprev) { 679 | emotion_button_group.active.splice(0,1); 680 | emo_arr.splice(0,1); 681 | window.thirdprev = emotion_button_group.active[0]; 682 | } else { 683 | emotion_button_group.active.splice(1,1); 684 | emo_arr.splice(1,1); 685 | window.thirdprev = emotion_button_group.active[0]; 686 | } 687 | } else { 688 | window.thirdprev = emotion_button_group.active[0]; 689 | 690 | } 691 | 692 | 693 | //for (i = 0; i < mult_char.length; i++) { 694 | // if (mult_char[i] == "Clear") { 695 | // listOfLists.push(flat_data_source.data["None"].slice()); 696 | // } else { 697 | // listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice()); 698 | // } 699 | //} 700 | // 701 | //for (i = 0; i < char.length; i++) { 702 | // if (char[i] == "Clear") { 703 | // charListOfLists.push(flat_data_source.data["None"].slice()); 704 | // } else { 705 | // charListOfLists.push(flat_data_source.data["CHARACTER_" + char[i]].slice()); 706 | // } 707 | //} 708 | // 709 | //var emoListOfLists = []; 710 | //for (i = 0; i < emo_arr.length; i++) { 711 | // if (emo_arr[i] == "Clear") { 712 | // emoListOfLists.push(flat_data_source.data["None"].slice()); 713 | // } else { 714 | // emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice()); 715 | // } 716 | //} 717 | 718 | function fill(a, b) { 719 | for (i = 0; i < b.length; i++) { 720 | a.push(flat_data_source.data[b[i]].slice()); 721 | } 722 | return a; 723 | } 724 | 725 | function fillChar(a, b) { 726 | for (i = 0; i < b.length; i++) { 727 | a.push(flat_data_source.data["CHARACTER_" + b[i]].slice()); 728 | } 729 | return a; 730 | } 731 | 732 | 733 | 734 | 735 | function zip(a) { 736 | if (a.length == 0) { 737 | return []; 738 | } 739 | var output = []; 740 | var length = a[0].length; 741 | for (i = 0; i < length; i++) { 742 | var newRow = []; 743 | for (j = 0; j < a.length; j++) { 744 | newRow.push(a[j][i]); 745 | } 746 | output.push(newRow); 747 | } 748 | return output; 749 | } 750 | 751 | function gMean(a) { 752 | var starter = 1; 753 | for (i = 0; i < a.length; i++) { 754 | starter = starter * a[i]; 755 | } 756 | if (starter == 0) { 757 | return 0; 758 | } else { 759 | return Math.pow(starter, 1/a.length); 760 | } 761 | } 762 | 763 | var firstCharListOfLists = []; 764 | var secondCharListOfLists = []; 765 | var emoListOfLists = []; 766 | 767 | var first_char_data = zip(fillChar(firstCharListOfLists, first_char)).map(gMean); 768 | var emo_data = zip(fill(emoListOfLists, emo_arr)).map(gMean); 769 | var second_char_data = zip(fillChar(secondCharListOfLists, second_char)).map(gMean); 770 | 771 | var reuse_max = Math.max.apply(Math, reuse_data); 772 | var emo_max = Math.max.apply(Math, emo_data); 773 | var first_char_max = Math.max.apply(Math, first_char_data); 774 | var second_char_max = Math.max.apply(Math, second_char_data); 775 | 776 | for (var i = 0; i < second_char_data.length; i++) { 777 | second_char_data[i] /= second_char_max; 778 | } 779 | 780 | for (var i = 0; i < first_char_data.length; i++) { 781 | first_char_data[i] /= first_char_max; 782 | } 783 | 784 | for (var i = 0; i < emo_data.length; i++) { 785 | emo_data[i] /= emo_max; 786 | } 787 | 788 | for (var i = 0; i < reuse_data.length; i++) { 789 | reuse_data[i] /= reuse_max; 790 | } 791 | 792 | var x = source.data['x']; 793 | var reuse_y = source.data['reuse_y']; 794 | var emo_y = source.data['emo_y']; 795 | var first_char_y = source.data['first_char_y']; 796 | var second_char_y = source.data['second_char_y'] 797 | for (var i = 0; i < x.length; i++) { 798 | reuse_y[i] = reuse_data[i]; 799 | emo_y[i] = emo_data[i]; 800 | first_char_y[i] = first_char_data[i]; 801 | second_char_y[i] = second_char_data[i]; 802 | } 803 | 804 | source.change.emit(); 805 | 806 | """ 807 | 808 | reuse_callback = CustomJS( 809 | args=dict( 810 | source=source, 811 | flat_data_source=flat_data_source, 812 | reuse_button_group=reuse_button_group, 813 | emotion_button_group=emotion_button_group, 814 | first_char_button_group=first_char_button_group, 815 | second_char_button_group=second_char_button_group, 816 | other_button_group=None 817 | ), code = callback_code) 818 | 819 | 820 | emo_callback = CustomJS( 821 | args=dict( 822 | source=source, 823 | flat_data_source=flat_data_source, 824 | reuse_button_group=reuse_button_group, 825 | emotion_button_group=emotion_button_group, 826 | first_char_button_group=first_char_button_group, 827 | second_char_button_group=second_char_button_group, 828 | other_button_group=first_char_button_group 829 | ), code = callback_code) 830 | 831 | char_callback = CustomJS( 832 | args=dict( 833 | source=source, 834 | flat_data_source=flat_data_source, 835 | reuse_button_group=reuse_button_group, 836 | emotion_button_group=emotion_button_group, 837 | first_char_button_group=first_char_button_group, 838 | second_char_button_group=second_char_button_group, 839 | other_button_group=emotion_button_group 840 | ), code = callback_code) 841 | 842 | mult_char_callback = CustomJS( 843 | args=dict( 844 | source=source, 845 | flat_data_source=flat_data_source, 846 | reuse_button_group=reuse_button_group, 847 | emotion_button_group=emotion_button_group, 848 | first_char_button_group=first_char_button_group, 849 | second_char_button_group=second_char_button_group, 850 | other_button_group=emotion_button_group 851 | ), code = callback_code) 852 | 853 | 854 | reuse_button_group.js_on_change('active', reuse_callback) 855 | emotion_button_group.js_on_change('active', emo_callback) 856 | first_char_button_group.js_on_change('active', char_callback) 857 | second_char_button_group.js_on_change('active', mult_char_callback) 858 | 859 | 860 | layout = column(reuse_button_group, first_char_button_group, second_char_button_group, emotion_button_group, plot) 861 | tab1 = Panel(child=layout, title='Both') 862 | #return tab1 863 | return layout 864 | 865 | 866 | def build_line_plot_affect(data_path, words_per_chunk, title='Degree of Reuse'): 867 | #Read in from csv 868 | flat_data = pd.read_csv(data_path) 869 | 870 | flat_data = chart_cols(flat_data, words_per_chunk) 871 | 872 | flat_data = chart_pivot(flat_data) 873 | 874 | 875 | # Scale so that both maxima have the same height 876 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 877 | emo_y = flat_data['None'] 878 | char_y = flat_data['None'] 879 | mult_char_y = flat_data['None'] 880 | reuse_max = reuse_y.values.max() 881 | emo_max = emo_y.values.max() 882 | char_max = char_y.values.max() 883 | mult_char_max = mult_char_y.values.max() 884 | 885 | #Make ratio work 886 | ratio_denom = min(mult_char_max, min(reuse_max, emo_max)) 887 | ratio_num = max(mult_char_max, max(reuse_max, emo_max)) 888 | ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 889 | if reuse_max < emo_max and reuse_max < mult_char_max: 890 | to_scale = reuse_y 891 | elif emo_max < mult_char_max and emo_max < reuse_max: 892 | to_scale = emo_y 893 | else: 894 | to_scale = mult_char_y 895 | to_scale *= ratio 896 | 897 | # Create data columns 898 | x = [str(i) for i in flat_data.index] 899 | reuse_y=reuse_y 900 | reuse_zero = len(reuse_y) * [0] 901 | span = flat_data.span 902 | flat_data_source = ColumnDataSource(flat_data) 903 | source = ColumnDataSource(dict(x=x, 904 | reuse_y=reuse_y, 905 | emo_y=emo_y, 906 | char_y=char_y, 907 | mult_char_y=mult_char_y, 908 | reuse_zero=reuse_zero, 909 | span=span,)) 910 | 911 | plot = figure(x_range=FactorRange(*x), 912 | plot_width=800, plot_height=600, 913 | title=title, tools="hover") 914 | 915 | # Turn off ticks, major labels, and x grid lines, etc. 916 | # Axis settings: 917 | plot.xaxis.major_label_text_font_size = '0pt' 918 | plot.xaxis.major_tick_line_color = None 919 | plot.xaxis.minor_tick_line_color = None 920 | 921 | plot.yaxis.major_label_text_font_size = '0pt' 922 | plot.yaxis.major_tick_line_color = None 923 | plot.yaxis.minor_tick_line_color = None 924 | 925 | # CategoricalAxis settings: 926 | plot.xaxis.group_text_font_size = '0pt' 927 | plot.xaxis.separator_line_color = None 928 | 929 | # Grid settings: 930 | plot.xgrid.grid_line_color = None 931 | # plot.ygrid.minor_grid_line_color = 'black' 932 | # plot.ygrid.minor_grid_line_alpha = 0.03 933 | plot.xaxis.axis_label = 'Beginning of Script    ←            →   End of Script' 934 | plot.yaxis.axis_label = 'Low Reuse             Medium Reuse             High Reuse' 935 | 936 | hover = plot.select(dict(type=HoverTool)) 937 | hover.tooltips = "
@span{safe}
" 938 | 939 | plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6) 940 | plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0) 941 | plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1]) 942 | plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red') 943 | plot.line(x='x', line_width=2.0, source=source, y='mult_char_y', line_color = 'red') 944 | 945 | 946 | reuse_button_group = RadioButtonGroup( 947 | labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], 948 | button_type='primary', 949 | active=0 950 | ) 951 | 952 | emotion_button_group = CheckboxButtonGroup( 953 | labels= ['Clear'] + _FIELDS[4:], 954 | button_type='success', 955 | active=[], 956 | ) 957 | 958 | char_button_group = RadioButtonGroup( 959 | labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 960 | button_type='danger', 961 | active=0 962 | ) 963 | 964 | mult_char_button_group = CheckboxButtonGroup( 965 | labels= ['Clear'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 966 | button_type='danger', 967 | active=[] 968 | ) 969 | 970 | 971 | callback_code=""" 972 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 973 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 974 | reuse = "Frequency of Reuse (0-0.25)"; 975 | } 976 | //var emo = emotion_button_group.labels[emotion_button_group.active]; 977 | var char = char_button_group.labels[char_button_group.active]; 978 | var mult_char = []; 979 | for (i = 0; i < mult_char_button_group.active.length; i++) { 980 | mult_char.push(mult_char_button_group.labels[mult_char_button_group.active[i]]); 981 | } 982 | 983 | if (mult_char.includes("Clear")) { 984 | mult_char = []; 985 | mult_char_button_group.active = [] 986 | } 987 | 988 | var emo_arr = []; 989 | for (i = 0; i < emotion_button_group.active.length; i++) { 990 | emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]); 991 | } 992 | 993 | console.log(emo_arr); 994 | if (emo_arr.includes("Clear")) { 995 | emo_arr = []; 996 | emotion_button_group.active = [] 997 | } 998 | 999 | 1000 | 1001 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 1002 | //var emo_data = flat_data_source.data[emo].slice(); // Copy 1003 | if (char == "None") { 1004 | var char_data = flat_data_source.data["None"].slice(); 1005 | } else { 1006 | var char_data = flat_data_source.data["CHARACTER_" + char].slice(); 1007 | } // Copy 1008 | 1009 | var multiplied = []; 1010 | var listOfLists = []; 1011 | var newList = [[]]; 1012 | for (i = 0; i < mult_char.length; i++) { 1013 | if (mult_char[i] == "Clear") { 1014 | listOfLists.push(flat_data_source.data["None"].slice()); 1015 | } else { 1016 | listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice()); 1017 | } 1018 | } 1019 | 1020 | var emoListOfLists = []; 1021 | for (i = 0; i < emo_arr.length; i++) { 1022 | if (emo_arr[i] == "Clear") { 1023 | emoListOfLists.push(flat_data_source.data["None"].slice()); 1024 | } else { 1025 | emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice()); 1026 | } 1027 | } 1028 | 1029 | 1030 | 1031 | function zip(a) { 1032 | if (a.length == 0) { 1033 | return []; 1034 | } 1035 | var output = []; 1036 | var length = a[0].length; 1037 | for (i = 0; i < length; i++) { 1038 | var newRow = []; 1039 | for (j = 0; j < a.length; j++) { 1040 | newRow.push(a[j][i]); 1041 | } 1042 | output.push(newRow); 1043 | } 1044 | return output; 1045 | } 1046 | 1047 | function gMean(a) { 1048 | var starter = 1; 1049 | for (i = 0; i < a.length; i++) { 1050 | starter = starter * a[i]; 1051 | } 1052 | if (starter == 0) { 1053 | return 0; 1054 | } else { 1055 | return Math.pow(starter, 1/a.length); 1056 | } 1057 | } 1058 | 1059 | var mult_char_data = zip(listOfLists).map(gMean); 1060 | var emo_data = zip(emoListOfLists).map(gMean) 1061 | 1062 | var reuse_max = Math.max.apply(Math, reuse_data); 1063 | var emo_max = Math.max.apply(Math, emo_data); 1064 | var char_max = Math.max.apply(Math, char_data); 1065 | var mult_char_max = Math.max.apply(Math, mult_char_data); 1066 | 1067 | var ratio = 0; 1068 | var to_scale = null; 1069 | var to_scale_other = null; 1070 | 1071 | //if (emo_max > reuse_max && emo_max > mult_char_max) { 1072 | // to_scale = reuse_data; 1073 | // to_scale_also = mult_char_data; 1074 | // ratio_one = emo_max / reuse_max; 1075 | // ratio_two = emo_max / mult_char_max; 1076 | // 1077 | //} else if (mult_char_max > emo_max && mult_char_max > reuse_max) { 1078 | // to_scale = reuse_data; 1079 | // to_scale_also = emo_data; 1080 | // ratio_one = mult_char_max / reuse_max; 1081 | // ratio_two = mult_char_max / emo_max; 1082 | //} else if (reuse_max > emo_max && reuse_max > mult_char_max){ 1083 | // to_scale = emo_data; 1084 | // to_scale_also = mult_char_data; 1085 | // ratio_one = reuse_max / emo_max; 1086 | // ratio_two = reuse_max / mult_char_max; 1087 | //} 1088 | // 1089 | for (var i = 0; i < mult_char_data.length; i++) { 1090 | mult_char_data[i] /= mult_char_max; 1091 | 1092 | } 1093 | 1094 | for (var i = 0; i < emo_data.length; i++) { 1095 | emo_data[i] /= emo_max; 1096 | } 1097 | 1098 | for (var i = 0; i < reuse_data.length; i++) { 1099 | reuse_data[i] /= reuse_max; 1100 | } 1101 | 1102 | var x = source.data['x']; 1103 | var reuse_y = source.data['reuse_y']; 1104 | var emo_y = source.data['emo_y']; 1105 | var char_y = source.data['char_y']; 1106 | var mult_char_y = source.data['mult_char_y'] 1107 | for (var i = 0; i < x.length; i++) { 1108 | reuse_y[i] = reuse_data[i]; 1109 | emo_y[i] = emo_data[i]; 1110 | char_y[i] = char_data[i]; 1111 | mult_char_y[i] = mult_char_data[i]; 1112 | } 1113 | 1114 | console.log(source.data['prev']); 1115 | 1116 | source.change.emit(); 1117 | 1118 | """ 1119 | 1120 | reuse_callback = CustomJS( 1121 | args=dict( 1122 | source=source, 1123 | flat_data_source=flat_data_source, 1124 | reuse_button_group=reuse_button_group, 1125 | emotion_button_group=emotion_button_group, 1126 | char_button_group=char_button_group, 1127 | mult_char_button_group=mult_char_button_group, 1128 | other_button_group=None 1129 | ), code = callback_code) 1130 | 1131 | 1132 | emo_callback = CustomJS( 1133 | args=dict( 1134 | source=source, 1135 | flat_data_source=flat_data_source, 1136 | reuse_button_group=reuse_button_group, 1137 | emotion_button_group=emotion_button_group, 1138 | char_button_group=char_button_group, 1139 | mult_char_button_group=mult_char_button_group, 1140 | other_button_group=char_button_group 1141 | ), code = callback_code) 1142 | 1143 | char_callback = CustomJS( 1144 | args=dict( 1145 | source=source, 1146 | flat_data_source=flat_data_source, 1147 | reuse_button_group=reuse_button_group, 1148 | emotion_button_group=emotion_button_group, 1149 | char_button_group=char_button_group, 1150 | mult_char_button_group=mult_char_button_group, 1151 | other_button_group=emotion_button_group 1152 | ), code = callback_code) 1153 | 1154 | mult_char_callback = CustomJS( 1155 | args=dict( 1156 | source=source, 1157 | flat_data_source=flat_data_source, 1158 | reuse_button_group=reuse_button_group, 1159 | emotion_button_group=emotion_button_group, 1160 | char_button_group=char_button_group, 1161 | mult_char_button_group=mult_char_button_group, 1162 | other_button_group=emotion_button_group 1163 | ), code = callback_code) 1164 | 1165 | 1166 | reuse_button_group.js_on_change('active', reuse_callback) 1167 | emotion_button_group.js_on_change('active', emo_callback) 1168 | char_button_group.js_on_change('active', char_callback) 1169 | mult_char_button_group.js_on_change('active', mult_char_callback) 1170 | 1171 | 1172 | layout = column(reuse_button_group, emotion_button_group, plot) 1173 | tab1 = Panel(child=layout, title='Affect') 1174 | return tab1 1175 | # return layout 1176 | 1177 | def build_line_plot_char(data_path, words_per_chunk, title='Degree of Reuse'): 1178 | #Read in from csv 1179 | flat_data = pd.read_csv(data_path) 1180 | 1181 | flat_data = chart_cols(flat_data, words_per_chunk) 1182 | 1183 | flat_data = chart_pivot(flat_data) 1184 | 1185 | 1186 | # Scale so that both maxima have the same height 1187 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 1188 | emo_y = flat_data['None'] 1189 | char_y = flat_data['None'] 1190 | mult_char_y = flat_data['None'] 1191 | reuse_max = reuse_y.values.max() 1192 | emo_max = emo_y.values.max() 1193 | char_max = char_y.values.max() 1194 | mult_char_max = mult_char_y.values.max() 1195 | 1196 | #Make ratio work 1197 | ratio_denom = min(mult_char_max, min(reuse_max, emo_max)) 1198 | ratio_num = max(mult_char_max, max(reuse_max, emo_max)) 1199 | ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 1200 | if reuse_max < emo_max and reuse_max < mult_char_max: 1201 | to_scale = reuse_y 1202 | elif emo_max < mult_char_max and emo_max < reuse_max: 1203 | to_scale = emo_y 1204 | else: 1205 | to_scale = mult_char_y 1206 | to_scale *= ratio 1207 | 1208 | # Create data columns 1209 | x = [str(i) for i in flat_data.index] 1210 | reuse_y=reuse_y 1211 | reuse_zero = len(reuse_y) * [0] 1212 | span = flat_data.span 1213 | flat_data_source = ColumnDataSource(flat_data) 1214 | source = ColumnDataSource(dict(x=x, 1215 | reuse_y=reuse_y, 1216 | emo_y=emo_y, 1217 | char_y=char_y, 1218 | mult_char_y=mult_char_y, 1219 | reuse_zero=reuse_zero, 1220 | span=span)) 1221 | 1222 | plot = figure(x_range=FactorRange(*x), 1223 | plot_width=800, plot_height=600, 1224 | title=title, tools="hover") 1225 | 1226 | # Turn off ticks, major labels, and x grid lines, etc. 1227 | # Axis settings: 1228 | plot.xaxis.major_label_text_font_size = '0pt' 1229 | plot.xaxis.major_tick_line_color = None 1230 | plot.xaxis.minor_tick_line_color = None 1231 | 1232 | plot.yaxis.major_label_text_font_size = '0pt' 1233 | plot.yaxis.major_tick_line_color = None 1234 | plot.yaxis.minor_tick_line_color = None 1235 | 1236 | # CategoricalAxis settings: 1237 | plot.xaxis.group_text_font_size = '0pt' 1238 | plot.xaxis.separator_line_color = None 1239 | 1240 | # Grid settings: 1241 | plot.xgrid.grid_line_color = None 1242 | # plot.ygrid.minor_grid_line_color = 'black' 1243 | # plot.ygrid.minor_grid_line_alpha = 0.03 1244 | plot.xaxis.axis_label = 'Beginning of Script    ←            →   End of Script' 1245 | plot.yaxis.axis_label = 'Low Reuse             Medium Reuse             High Reuse' 1246 | 1247 | hover = plot.select(dict(type=HoverTool)) 1248 | hover.tooltips = "
@span{safe}
" 1249 | 1250 | plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6) 1251 | plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0) 1252 | plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1]) 1253 | plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red') 1254 | plot.line(x='x', line_width=2.0, source=source, y='mult_char_y', line_color = 'red') 1255 | 1256 | 1257 | reuse_button_group = RadioButtonGroup( 1258 | labels=[_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], 1259 | button_type='primary', 1260 | active=0 1261 | ) 1262 | 1263 | emotion_button_group = CheckboxButtonGroup( 1264 | labels= ['Clear'] + _FIELDS[4:], 1265 | button_type='success', 1266 | active=[], 1267 | ) 1268 | 1269 | char_button_group = RadioButtonGroup( 1270 | labels= ['None'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 1271 | button_type='danger', 1272 | active=0 1273 | ) 1274 | 1275 | mult_char_button_group = CheckboxButtonGroup( 1276 | labels= ['Clear'] + [x.replace("CHARACTER_", "") for x in flat_data.columns if x.startswith("CHARACTER_")], 1277 | button_type='danger', 1278 | active=[] 1279 | ) 1280 | 1281 | 1282 | callback_code=""" 1283 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 1284 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 1285 | reuse = "Frequency of Reuse (0-0.25)"; 1286 | } 1287 | //var emo = emotion_button_group.labels[emotion_button_group.active]; 1288 | var char = char_button_group.labels[char_button_group.active]; 1289 | var mult_char = []; 1290 | for (i = 0; i < mult_char_button_group.active.length; i++) { 1291 | mult_char.push(mult_char_button_group.labels[mult_char_button_group.active[i]]); 1292 | } 1293 | if (mult_char.includes("Clear")) { 1294 | mult_char = []; 1295 | mult_char_button_group.active = [] 1296 | } 1297 | 1298 | var emo_arr = []; 1299 | for (i = 0; i < emotion_button_group.active.length; i++) { 1300 | emo_arr.push(emotion_button_group.labels[emotion_button_group.active[i]]); 1301 | } 1302 | if (emo_arr.includes("Clear")) { 1303 | emo_arr = []; 1304 | emotion_button_group.active = [] 1305 | } 1306 | 1307 | 1308 | 1309 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 1310 | //var emo_data = flat_data_source.data[emo].slice(); // Copy 1311 | if (char == "None") { 1312 | var char_data = flat_data_source.data["None"].slice(); 1313 | } else { 1314 | var char_data = flat_data_source.data["CHARACTER_" + char].slice(); 1315 | } // Copy 1316 | 1317 | var multiplied = []; 1318 | var listOfLists = []; 1319 | var newList = [[]]; 1320 | for (i = 0; i < mult_char.length; i++) { 1321 | if (mult_char[i] == "Clear") { 1322 | listOfLists.push(flat_data_source.data["None"].slice()); 1323 | } else { 1324 | listOfLists.push(flat_data_source.data["CHARACTER_" + mult_char[i]].slice()); 1325 | } 1326 | } 1327 | 1328 | var emoListOfLists = []; 1329 | for (i = 0; i < emo_arr.length; i++) { 1330 | if (emo_arr[i] == "Clear") { 1331 | emoListOfLists.push(flat_data_source.data["None"].slice()); 1332 | } else { 1333 | emoListOfLists.push(flat_data_source.data[emo_arr[i]].slice()); 1334 | } 1335 | } 1336 | 1337 | 1338 | 1339 | function zip(a) { 1340 | if (a.length == 0) { 1341 | return []; 1342 | } 1343 | var output = []; 1344 | var length = a[0].length; 1345 | for (i = 0; i < length; i++) { 1346 | var newRow = []; 1347 | for (j = 0; j < a.length; j++) { 1348 | newRow.push(a[j][i]); 1349 | } 1350 | output.push(newRow); 1351 | } 1352 | return output; 1353 | } 1354 | 1355 | function gMean(a) { 1356 | var starter = 1; 1357 | for (i = 0; i < a.length; i++) { 1358 | starter = starter * a[i]; 1359 | } 1360 | if (starter == 0) { 1361 | return 0; 1362 | } else { 1363 | return Math.pow(starter, 1/a.length); 1364 | } 1365 | } 1366 | 1367 | var mult_char_data = zip(listOfLists).map(gMean); 1368 | var emo_data = zip(emoListOfLists).map(gMean) 1369 | 1370 | var reuse_max = Math.max.apply(Math, reuse_data); 1371 | var emo_max = Math.max.apply(Math, emo_data); 1372 | var char_max = Math.max.apply(Math, char_data); 1373 | var mult_char_max = Math.max.apply(Math, mult_char_data); 1374 | 1375 | var ratio = 0; 1376 | var to_scale = null; 1377 | var to_scale_other = null; 1378 | 1379 | //if (emo_max > reuse_max && emo_max > mult_char_max) { 1380 | // to_scale = reuse_data; 1381 | // to_scale_also = mult_char_data; 1382 | // ratio_one = emo_max / reuse_max; 1383 | // ratio_two = emo_max / mult_char_max; 1384 | // 1385 | //} else if (mult_char_max > emo_max && mult_char_max > reuse_max) { 1386 | // to_scale = reuse_data; 1387 | // to_scale_also = emo_data; 1388 | // ratio_one = mult_char_max / reuse_max; 1389 | // ratio_two = mult_char_max / emo_max; 1390 | //} else if (reuse_max > emo_max && reuse_max > mult_char_max){ 1391 | // to_scale = emo_data; 1392 | // to_scale_also = mult_char_data; 1393 | // ratio_one = reuse_max / emo_max; 1394 | // ratio_two = reuse_max / mult_char_max; 1395 | //} 1396 | // 1397 | for (var i = 0; i < mult_char_data.length; i++) { 1398 | mult_char_data[i] /= mult_char_max; 1399 | 1400 | } 1401 | 1402 | for (var i = 0; i < emo_data.length; i++) { 1403 | emo_data[i] /= emo_max; 1404 | } 1405 | 1406 | for (var i = 0; i < reuse_data.length; i++) { 1407 | reuse_data[i] /= reuse_max; 1408 | } 1409 | 1410 | var x = source.data['x']; 1411 | var reuse_y = source.data['reuse_y']; 1412 | var emo_y = source.data['emo_y']; 1413 | var char_y = source.data['char_y']; 1414 | var mult_char_y = source.data['mult_char_y'] 1415 | for (var i = 0; i < x.length; i++) { 1416 | reuse_y[i] = reuse_data[i]; 1417 | emo_y[i] = emo_data[i]; 1418 | char_y[i] = char_data[i]; 1419 | mult_char_y[i] = mult_char_data[i]; 1420 | } 1421 | 1422 | source.change.emit(); 1423 | 1424 | """ 1425 | 1426 | reuse_callback = CustomJS( 1427 | args=dict( 1428 | source=source, 1429 | flat_data_source=flat_data_source, 1430 | reuse_button_group=reuse_button_group, 1431 | emotion_button_group=emotion_button_group, 1432 | char_button_group=char_button_group, 1433 | mult_char_button_group=mult_char_button_group, 1434 | other_button_group=None 1435 | ), code = callback_code) 1436 | 1437 | 1438 | emo_callback = CustomJS( 1439 | args=dict( 1440 | source=source, 1441 | flat_data_source=flat_data_source, 1442 | reuse_button_group=reuse_button_group, 1443 | emotion_button_group=emotion_button_group, 1444 | char_button_group=char_button_group, 1445 | mult_char_button_group=mult_char_button_group, 1446 | other_button_group=char_button_group 1447 | ), code = callback_code) 1448 | 1449 | char_callback = CustomJS( 1450 | args=dict( 1451 | source=source, 1452 | flat_data_source=flat_data_source, 1453 | reuse_button_group=reuse_button_group, 1454 | emotion_button_group=emotion_button_group, 1455 | char_button_group=char_button_group, 1456 | mult_char_button_group=mult_char_button_group, 1457 | other_button_group=emotion_button_group 1458 | ), code = callback_code) 1459 | 1460 | mult_char_callback = CustomJS( 1461 | args=dict( 1462 | source=source, 1463 | flat_data_source=flat_data_source, 1464 | reuse_button_group=reuse_button_group, 1465 | emotion_button_group=emotion_button_group, 1466 | char_button_group=char_button_group, 1467 | mult_char_button_group=mult_char_button_group, 1468 | other_button_group=emotion_button_group 1469 | ), code = callback_code) 1470 | 1471 | 1472 | reuse_button_group.js_on_change('active', reuse_callback) 1473 | emotion_button_group.js_on_change('active', emo_callback) 1474 | char_button_group.js_on_change('active', char_callback) 1475 | mult_char_button_group.js_on_change('active', mult_char_callback) 1476 | 1477 | 1478 | layout = column(reuse_button_group, mult_char_button_group, plot) 1479 | tab1 = Panel(child=layout, title='Character') 1480 | return tab1 1481 | #return layout 1482 | 1483 | def build_line_plot_dropdown(data_path, words_per_chunk, title='Reuse'): 1484 | #Read in from csv 1485 | flat_data = pd.read_csv(data_path) 1486 | flat_data = chart_cols(flat_data, words_per_chunk) 1487 | flat_data = chart_pivot(flat_data) 1488 | 1489 | # Scale so that both maxima have the same height 1490 | reuse_y = flat_data['Frequency of Reuse (Exact Matches)'] 1491 | emo_y = flat_data['None'] 1492 | char_y = flat_data['None'] 1493 | reuse_max = reuse_y.values.max() 1494 | emo_max = emo_y.values.max() 1495 | char_max = char_y.values.max() 1496 | 1497 | #Make ratio work 1498 | ratio_denom = min(char_max, min(reuse_max, emo_max)) 1499 | ratio_num = max(char_max, max(reuse_max, emo_max)) 1500 | ratio = ratio_num / ratio_denom if ratio_denom > 0 else 1 1501 | if reuse_max < emo_max and reuse_max < char_max: 1502 | to_scale = reuse_y 1503 | elif emo_max < char_max and emo_max < reuse_max: 1504 | to_scale = emo_y 1505 | else: 1506 | to_scale = char_y 1507 | to_scale *= ratio 1508 | 1509 | # Create data columns 1510 | x = [str(i) for i in flat_data.index] 1511 | reuse_y=reuse_y 1512 | reuse_zero = len(reuse_y) * [0] 1513 | span = flat_data.span 1514 | flat_data_source = ColumnDataSource(flat_data) 1515 | source = ColumnDataSource(dict(x=x, 1516 | reuse_y=reuse_y, 1517 | emo_y=emo_y, 1518 | char_y=char_y, 1519 | reuse_zero=reuse_zero, 1520 | span=span)) 1521 | 1522 | plot = figure(x_range=FactorRange(*x), 1523 | plot_width=800, plot_height=600, 1524 | title=title, tools="hover") 1525 | 1526 | # Turn off ticks, major labels, and x grid lines, etc. 1527 | # Axis settings: 1528 | plot.xaxis.major_label_text_font_size = '0pt' 1529 | plot.xaxis.major_tick_line_color = None 1530 | plot.xaxis.minor_tick_line_color = None 1531 | 1532 | # CategoricalAxis settings: 1533 | plot.xaxis.group_text_font_size = '0pt' 1534 | plot.xaxis.separator_line_color = None 1535 | 1536 | # Grid settings: 1537 | plot.xgrid.grid_line_color = None 1538 | plot.ygrid.minor_grid_line_color = 'black' 1539 | plot.ygrid.minor_grid_line_alpha = 0.03 1540 | 1541 | hover = plot.select(dict(type=HoverTool)) 1542 | hover.tooltips = "
@span{safe}
" 1543 | 1544 | plot.varea(x='x', source = source, y1 = 'reuse_y', y2 = 'reuse_zero', fill_color = Spectral6[0], fill_alpha = 0.6) 1545 | plot.line(x='x', source = source, y = 'reuse_y', line_color = Spectral6[0], line_alpha = 0.0) 1546 | plot.line(x='x', line_width=2.0, source=source, y='emo_y', line_color = Spectral6[1]) 1547 | plot.line(x='x', line_width=2.0, source=source, y='char_y', line_color = 'red') 1548 | 1549 | 1550 | reuse_button_group = RadioButtonGroup( 1551 | labels= [_FIELDS[0]] + ["Frequency of Reuse (Fuzzy Matches)"], button_type='primary', 1552 | active=0 1553 | ) 1554 | 1555 | emotion_dropdown_button_group = Select( 1556 | title="Emotion", value="None", options=_FIELDS[3:]) 1557 | 1558 | char_dropdown_button_group = Select( 1559 | title="Emotion2", value="None", options=_FIELDS[3:]) 1560 | 1561 | callback_code=""" 1562 | var reuse = reuse_button_group.labels[reuse_button_group.active]; 1563 | if (reuse == "Frequency of Reuse (Fuzzy Matches)") { 1564 | reuse = "Frequency of Reuse (0-0.25)"; 1565 | } 1566 | var emo = emotion_dropdown_button_group.value; 1567 | var char = char_dropdown_button_group.value; 1568 | var reuse_data = flat_data_source.data[reuse].slice(); // Copy 1569 | var emo_data = flat_data_source.data[emo].slice(); // Copy 1570 | var char_data = flat_data_source.data[char].slice(); // Copy 1571 | var reuse_max = Math.max.apply(Math, reuse_data); 1572 | var emo_max = Math.max.apply(Math, emo_data); 1573 | var char_max = Math.max.apply(Math, char_data); 1574 | 1575 | var ratio = 0; 1576 | var to_scale = null; 1577 | var to_scale_other = null; 1578 | 1579 | if (emo_max > reuse_max && emo_max > char_max) { 1580 | to_scale = reuse_data; 1581 | to_scale_also = char_data; 1582 | ratio_one = emo_max / reuse_max; 1583 | ratio_two = emo_max / char_max; 1584 | } else if (char_max > emo_max && char_max > reuse_max) { 1585 | to_scale = reuse_data; 1586 | to_scale_also = emo_data; 1587 | ratio_one = char_max / reuse_max; 1588 | ratio_two = char_max / emo_max; 1589 | } else { 1590 | to_scale = emo_data; 1591 | to_scale_also = char_data; 1592 | ratio_one = reuse_max / emo_max; 1593 | ratio_two = reuse_max / char_max; 1594 | } 1595 | 1596 | for (var i = 0; i < to_scale.length; i++) { 1597 | to_scale[i] *= ratio_one; 1598 | to_scale_also[i] *= ratio_two; 1599 | } 1600 | 1601 | var x = source.data['x']; 1602 | var reuse_y = source.data['reuse_y']; 1603 | var emo_y = source.data['emo_y']; 1604 | var char_y = source.data['char_y'] 1605 | for (var i = 0; i < x.length; i++) { 1606 | reuse_y[i] = reuse_data[i]; 1607 | emo_y[i] = emo_data[i]; 1608 | char_y[i] = char_data[i]; 1609 | } 1610 | 1611 | source.change.emit(); 1612 | if (char_dropdown_button_group.value == "None" || emotion_dropdown_button_group.value == "None") { 1613 | return; 1614 | } 1615 | 1616 | if (other_button_group) { 1617 | other_button_group.value = "None"; 1618 | } 1619 | 1620 | """ 1621 | 1622 | reuse_callback = CustomJS( 1623 | args=dict( 1624 | source=source, 1625 | flat_data_source=flat_data_source, 1626 | reuse_button_group=reuse_button_group, 1627 | emotion_dropdown_button_group=emotion_dropdown_button_group, 1628 | char_dropdown_button_group=char_dropdown_button_group, 1629 | other_button_group=None 1630 | ), code = callback_code) 1631 | 1632 | 1633 | emo_callback = CustomJS( 1634 | args=dict( 1635 | source=source, 1636 | flat_data_source=flat_data_source, 1637 | reuse_button_group=reuse_button_group, 1638 | emotion_dropdown_button_group=emotion_dropdown_button_group, 1639 | char_dropdown_button_group=char_dropdown_button_group, 1640 | other_button_group=char_dropdown_button_group 1641 | ), code = callback_code) 1642 | 1643 | char_callback = CustomJS( 1644 | args=dict( 1645 | source=source, 1646 | flat_data_source=flat_data_source, 1647 | reuse_button_group=reuse_button_group, 1648 | emotion_dropdown_button_group=emotion_dropdown_button_group, 1649 | char_dropdown_button_group=char_dropdown_button_group, 1650 | other_button_group=emotion_dropdown_button_group 1651 | ), code = callback_code) 1652 | 1653 | 1654 | reuse_button_group.js_on_change('active', reuse_callback) 1655 | emotion_dropdown_button_group.js_on_change('value', emo_callback) 1656 | char_dropdown_button_group.js_on_change('value', char_callback) 1657 | 1658 | 1659 | layout = column(reuse_button_group, emotion_dropdown_button_group, char_dropdown_button_group, plot) 1660 | tab1 = Panel(child=layout, title='Line Dropdown') 1661 | return tab1 1662 | 1663 | def build_plot(args): 1664 | # return Tabs(tabs=[build_line_plot_char(args.input, args.words_per_chunk), 1665 | # build_line_plot_affect(args.input, args.words_per_chunk), 1666 | # build_line_plot_compare(args.input, args.words_per_chunk)]) 1667 | return build_line_plot_compare(args.input, args.words_per_chunk) 1668 | 1669 | 1670 | 1671 | def save_static(args): 1672 | plot = build_plot(args) 1673 | file_html(plot, CDN, args.title) 1674 | output_file(args.output, 1675 | title=args.title, mode="cdn") 1676 | save(plot) 1677 | 1678 | def save_embed(args): 1679 | plot = build_plot(args) 1680 | with open(args.output, 'w', encoding='utf-8') as op: 1681 | for c in components(plot): 1682 | op.write(c) 1683 | op.write('\n') 1684 | 1685 | def save_plot(args): 1686 | title = 'Average Quantity of Text Reuse by {}-word Section' 1687 | title = title.format(args.words_per_chunk) 1688 | args.title = title 1689 | 1690 | if args.static: 1691 | save_static(args) 1692 | else: 1693 | save_embed(args) 1694 | -------------------------------------------------------------------------------- /workflow/format_helper.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import re 4 | 5 | franchise = sys.argv[1] 6 | movie = sys.argv[2] 7 | 8 | if len(sys.argv) <= 3: 9 | movie_folder = 'results/{0:}-{1:}'.format(franchise, movie) 10 | data_folders = os.listdir(movie_folder) 11 | dates = sorted([f for f in data_folders if re.search(r'[0-9]{8}', f)]) 12 | date = dates[-1] 13 | else: 14 | date = sys.argv[3] 15 | 16 | cmd = ('python ao3.py format ' 17 | '-o results/{0:}-{1:}/fandom-data-{1:}.csv ' 18 | 'results/{0:}-{1:}/{2:}/match-6gram-{2:}.csv ' 19 | 'scripts/{0:}-{1:}.txt') 20 | cmd = cmd.format(franchise, 21 | movie, 22 | date) 23 | print(cmd) 24 | 25 | -------------------------------------------------------------------------------- /workflow/reformat.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | for x in os.listdir('results/') : 4 | if not x.startswith('.'): 5 | os.system('`python format_helper.py' + ' ' + x.replace('-', ' ', 1) + '`') 6 | -------------------------------------------------------------------------------- /workflow/revis.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | for x in os.listdir('results/') : 4 | if not x.startswith('.'): 5 | os.system('`python vis_helper.py' + ' ' + x.replace('-', ' ', 1) + '`') 6 | -------------------------------------------------------------------------------- /workflow/vis_helper.py: -------------------------------------------------------------------------------- 1 | import sys 2 | cmd = ('python ao3.py vis ' 3 | '-o results/{0:}-{1:}/{2:}_reuse.html ' 4 | 'results/{0:}-{1:}/fandom-data-{1:}.csv') 5 | cmd = cmd.format(sys.argv[1], 6 | sys.argv[2], 7 | sys.argv[2].replace('-', '_')) 8 | print(cmd) 9 | 10 | --------------------------------------------------------------------------------