├── .gitignore ├── README.md ├── src-human-correlations ├── chatGPT_evaluate_intruders.py ├── chatGPT_evaluate_topic_ratings.py ├── generate_intruder_words_dataset.py └── human_correlations_bootstrap.py ├── src-misc ├── 01-get_data.sh ├── 02-process_data.ipynb ├── 02-process_data.py ├── 03-figures_nmpi_llm.py ├── 04-launch_figures_nmpi_llm.sh ├── 05-find_rating_errors.ipynb ├── 06-figures_ari_llm.py ├── 07-pairwise_scores.ipynb └── fig_utils.py ├── src-number-of-topics ├── LLM_scores_and_ARI.py ├── chatGPT_document_label_assignment.py └── chatGPT_ratings_assignment.py └── topic-modeling-output ├── dvae-topics-best-c_npmi_10_full.json ├── etm-topics-best-c_npmi_10_full.json └── mallet-topics-best-c_npmi_10_full.json /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | repo-old 3 | *.pyc 4 | computed/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Re-visiting Automated Topic Model Evaluation with Large Language Models 2 | 3 | This repo contains code and data for our [EMNLP 2023 paper](https://aclanthology.org/2023.emnlp-main.581/) about assessing topic model output with Large Language Models. 4 | ``` 5 | @inproceedings{stammbach-etal-2023-revisiting, 6 | title = "Revisiting Automated Topic Model Evaluation with Large Language Models", 7 | author = "Stammbach, Dominik and 8 | Zouhar, Vil{\'e}m and 9 | Hoyle, Alexander and 10 | Sachan, Mrinmaya and 11 | Ash, Elliott", 12 | booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", 13 | month = dec, 14 | year = "2023", 15 | address = "Singapore", 16 | publisher = "Association for Computational Linguistics", 17 | url = "https://aclanthology.org/2023.emnlp-main.581", 18 | doi = "10.18653/v1/2023.emnlp-main.581", 19 | pages = "9348--9357" 20 | } 21 | ``` 22 | 23 | ## Prerequisites 24 | 25 | ```shell 26 | pip install --upgrade openai pandas 27 | ``` 28 | 29 | ## Large Language Models and Topics with Human Annotations 30 | 31 | Download topic words and human annotations from the paper [Is Automated Topic Model Evaluation Broken?](https://arxiv.org/abs/2107.02173) from their [github repository](https://github.com/ahoho/topics/blob/dev/data/human/all_data/all_data.json). 32 | 33 | 34 | ### Intruder Detection Test 35 | 36 | Following (Hoyle et al., 2021), we randomly sample intruder words which are not in the top 50 topic words for each topic. 37 | 38 | ```shell 39 | python src-human-correlations/generate_intruder_words_dataset.py 40 | ``` 41 | 42 | We can then call an LLM to automatically annotate the intruder words for each topic. 43 | 44 | ```shell 45 | python src-human-correlations/chatGPT_evaluate_intruders.py --API_KEY a_valid_openAI_api_key 46 | ``` 47 | 48 | For the ratings task, simply call the file which rates topic word sets (no need to generate a dataset first) 49 | 50 | ```shell 51 | python src-human-correlations/chatGPT_evaluate_topic_ratings.py --API_KEY a_valid_openAI_api_key 52 | ``` 53 | (In case the openAI API breaks, we simply save all output in a json file, and would restart the script while skipping all already annotated datapoints.) 54 | 55 | 56 | ### Evaluation LLMs and Human Correlations 57 | 58 | We evaluate using a bootstrapp appraoch where we sample human annotations and LLM annotations for each datapoint. We then average these sampled annotation, and compute a spearman's rho for each bootstrapped sample. We report the mean spearman's rho over all 1000 bootstrapped samples. 59 | 60 | ```shell 61 | python src-human-correlations/human_correlations_bootstrap.py --filename coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl --task ratings 62 | ``` 63 | 64 | 65 | ## Evaluating Topic Models with Different Numbers of Topics 66 | 67 | Download fitted topic models and metadata for two datasets (bills and wikitext) [here](https://www.dropbox.com/s/huxdloe5l6w2tu5/topic_model_k_selection.zip?dl=0) and unzip 68 | 69 | ### Rating Topic Word Sets 70 | 71 | To run LLM ratings of topic word sets on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run: 72 | 73 | ```shell 74 | python src-number-of-topics/chatGPT_ratings_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad 75 | ``` 76 | 77 | ### Purity of Document Collections 78 | 79 | We also assign a document label to the top documents belonging to a topic, following [Doogan and Buntine, 2021](https://aclanthology.org/2021.naacl-main.300/). We then average purity per document collection, and the number of topics with on averag highest purities is the preferred cluster of this procedure. 80 | 81 | To run LLM label assignments on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run: 82 | 83 | ```shell 84 | python src-number-of-topics/chatGPT_document_label_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad 85 | ``` 86 | 87 | ### Plot resulting scores 88 | 89 | ```shell 90 | python src-number-of-topics/LLM_scores_and_ARI.py --label_categories broad --method label_assignment --dataset bills --label_categories broad --filename number-of-topics-section-4/document_label_assignment_wikitext_broad.jsonl 91 | ``` 92 | 93 | ## Questions 94 | 95 | Please contact [Dominik Stammbach](mailto:dominik.stammbach@gess.ethz.ch) regarding any questions. 96 | 97 | 98 | [![Paper video presentation](https://img.youtube.com/vi/qIDjtgWTgjs/0.jpg)](https://www.youtube.com/watch?v=qIDjtgWTgjs) 99 | -------------------------------------------------------------------------------- /src-human-correlations/chatGPT_evaluate_intruders.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import json 3 | import random 4 | import openai 5 | from tqdm import tqdm 6 | from ast import literal_eval 7 | from collections import defaultdict 8 | import time 9 | import argparse 10 | import os 11 | 12 | def get_prompts(include_dataset_description="include"): 13 | if include_dataset_description == "include": 14 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 15 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces. 16 | Reply with a single word.""" 17 | 18 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 19 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others. 20 | Reply with a single word.""" 21 | outfile_name = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl" 22 | else: 23 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 24 | Reply with a single word.""" 25 | 26 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 27 | Reply with a single word.""" 28 | outfile_name = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.jsonl" 29 | return system_prompt_NYT, system_prompt_wikitext, outfile_name 30 | 31 | if __name__ == "__main__": 32 | parser = argparse.ArgumentParser() 33 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key") 34 | parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.") 35 | args = parser.parse_args() 36 | 37 | 38 | openai.api_key = args.API_KEY 39 | random.seed(42) 40 | df = pd.read_json("intruder_outfile.jsonl", lines=True) 41 | 42 | system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description) 43 | os.makedirs("coherence-outputs-section-2", exist_ok=True) 44 | 45 | 46 | columns = df.columns.tolist() 47 | 48 | with open(outfile_name, "w") as outfile: 49 | for i, row in tqdm(df.iterrows(), total=len(df)): 50 | if row.dataset_name == "wikitext": 51 | system_prompt = system_prompt_wikitext 52 | else: 53 | system_prompt = system_prompt_NYT 54 | 55 | words = row.topic_terms 56 | # shuffle words 57 | random.shuffle(words) 58 | # we only prompt 5 words 59 | words = words[:5] 60 | 61 | # we add intruder term 62 | intruder_term = row.intruder_term 63 | 64 | # we shuffle again 65 | words.append(intruder_term) 66 | random.shuffle(words) 67 | 68 | # we have a user prompt 69 | user_prompt = ", ".join(['"' + w + '"' for w in words]) 70 | 71 | out = {col: row[col] for col in columns} 72 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.7, max_tokens=15)["choices"][0]["message"]["content"].strip() 73 | out["response"] = response 74 | out["user_promt"] = user_prompt 75 | json.dump(out, outfile) 76 | outfile.write("\n") 77 | time.sleep(0.5) 78 | 79 | 80 | -------------------------------------------------------------------------------- /src-human-correlations/chatGPT_evaluate_topic_ratings.py: -------------------------------------------------------------------------------- 1 | import json 2 | import random 3 | import openai 4 | from tqdm import tqdm 5 | import pandas as pd 6 | import time 7 | import argparse 8 | import os 9 | 10 | 11 | def get_prompts(include_dataset_description=True): 12 | if include_dataset_description: 13 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 14 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces. 15 | Reply with a single number, indicating the overall appropriateness of the topic.""" 16 | 17 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 18 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others. 19 | Reply with a single number, indicating the overall appropriateness of the topic.""" 20 | outfile_name = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl" 21 | else: 22 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 23 | Reply with a single number, indicating the overall appropriateness of the topic.""" 24 | 25 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 26 | Reply with a single number, indicating the overall appropriateness of the topic.""" 27 | outfile_name = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl" 28 | return system_prompt_NYT, system_prompt_wikitext, outfile_name 29 | 30 | 31 | if __name__ == "__main__": 32 | 33 | parser = argparse.ArgumentParser() 34 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key") 35 | parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.") 36 | args = parser.parse_args() 37 | 38 | openai.api_key = args.API_KEY 39 | random.seed(42) 40 | 41 | with open("all_data.json") as f: 42 | data = json.load(f) 43 | random.seed(42) 44 | 45 | system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description) 46 | os.makedirs("coherence-outputs-section-2", exist_ok=True) 47 | 48 | with open(outfile_name, "w") as outfile: 49 | for dataset_name, dataset in data.items(): 50 | for model_name, dataset_model in dataset.items(): 51 | print (dataset_name, model_name) 52 | topics = dataset_model["topics"] 53 | human_evaluations = dataset_model["metrics"]["ratings_scores_avg"] 54 | i = 0 55 | for topic, human_eval in tqdm(zip(topics, human_evaluations), total=50): 56 | topic = topic[:10] 57 | for run in range(3): 58 | random.shuffle(topic) 59 | user_prompt = ", ".join(topic) 60 | if dataset_name == "wikitext": 61 | system_prompt = system_prompt_wikitext 62 | else: 63 | system_prompt = system_prompt_NYT 64 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=1.0, logit_bias={16:100, 17:100, 18:100}, max_tokens=1)["choices"][0]["message"]["content"].strip() 65 | out = {"dataset_name": dataset_name, "model_name": model_name, "topic_id": i, "user_prompt": user_prompt, "response": response, "human_eval":human_eval, "run": run} 66 | json.dump(out, outfile) 67 | outfile.write("\n") 68 | time.sleep(0.5) 69 | i += 1 70 | 71 | -------------------------------------------------------------------------------- /src-human-correlations/generate_intruder_words_dataset.py: -------------------------------------------------------------------------------- 1 | import json 2 | import random 3 | 4 | 5 | if __name__ == "__main__": 6 | with open("all_data.json") as f: 7 | data = json.load(f) 8 | 9 | random.seed(42) 10 | with open("intruder_outfile.jsonl", "w") as outfile: 11 | for dataset_name, dataset in data.items(): 12 | for model_name, dataset_model in dataset.items(): 13 | print (dataset_name, model_name) 14 | if model_name == "mallet": 15 | fn = "topic-modeling-output/mallet-topics-best-c_npmi_10_full.json" 16 | elif model_name == "dvae": 17 | fn = "topic-modeling-output/dvae-topics-best-c_npmi_10_full.json" 18 | else: 19 | fn = "topic-modeling-output/etm-topics-best-c_npmi_10_full.json" 20 | with open(fn) as f: 21 | topics_data = json.load(f) 22 | 23 | raw_topics = topics_data[dataset_name]["topics"] 24 | 25 | words = set() 26 | for topic in raw_topics: 27 | words.update(topic) 28 | words = list(set(words)) 29 | for i, (topic, metric, double_check) in enumerate(zip(raw_topics, dataset_model["metrics"]["intrusion_scores_avg"], dataset_model["topics"])): 30 | topic_set = set(topic) 31 | candidate_words = [i for i in words if i not in topic_set] 32 | random.shuffle(candidate_words) 33 | sampled_intruders = candidate_words[:10] 34 | for intruder in sampled_intruders: 35 | out = {} 36 | out["topic_id"] = i 37 | out["intruder_term"] = intruder 38 | out["topic_terms"] = topic[:10] 39 | out["intrusion_scores_avg"] = metric 40 | out["dataset_name"] = dataset_name 41 | out["model_name"] = model_name 42 | json.dump(out, outfile) 43 | outfile.write("\n") 44 | 45 | -------------------------------------------------------------------------------- /src-human-correlations/human_correlations_bootstrap.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import re 3 | from scipy.stats import spearmanr 4 | import numpy as np 5 | from sklearn.metrics import accuracy_score 6 | import random, json 7 | from ast import literal_eval 8 | from tqdm import tqdm 9 | import os 10 | import argparse 11 | 12 | def load_dataframe(fn, task=""): 13 | if fn.endswith(".csv"): 14 | df = pd.read_csv(fn) 15 | elif fn.endswith("jsonl"): 16 | df = pd.read_json(fn, lines=True) 17 | print (df) 18 | print (df.iloc[0]) 19 | if task == "intrusion": 20 | if "response_correct" not in df.columns: 21 | df["response_correct"] = postprocess_chatGPT_intrusion_test(df) 22 | else: 23 | 24 | if "gpt_ratings" not in df.columns: 25 | df["gpt_ratings"] = df.response.astype(int) 26 | return df 27 | 28 | def postprocess_chatGPT_intrusion_test(df): 29 | response = df.response.tolist() 30 | response = [i.lower() for i in response] 31 | response = [re.findall(r"\b[a-z_\d]+\b", i) for i in response] 32 | response_correct = [] 33 | #df.topic_terms = df.topic_terms.apply(lambda x: literal_eval(x)) 34 | #for i,j,k in zip(response, df.intruder_term, df.topic_terms): 35 | for i,j in zip(response, df.intruder_term): 36 | if not i: 37 | response_correct.append(0) 38 | elif i[0] == j: 39 | response_correct.append(1) 40 | else: 41 | response_correct.append(0) 42 | return response_correct 43 | 44 | def postprocess_chatGPT_intrusion_new(df): 45 | for response, intruder_term, topic_words in zip(response, df.intruder_term, df.topic_terms): 46 | if not response: 47 | response_correct.append(0) 48 | elif response.lower() == intruder_term.lower(): 49 | response_correct.append(1) 50 | else: 51 | response_correct.append(0) 52 | return response_correct 53 | 54 | def get_confidence_intervals(statistic): 55 | statistic = sorted(statistic) 56 | lower = statistic[4] # it's the fourth lowest value; if we had 100 samples, it would be the 2.5nd lowest value, this * 1.5 gets us 3.75 57 | upper = statistic[-4] # it's the fourth highest value 58 | print ("lower", lower, "upper", upper) 59 | 60 | def get_filenames(with_dataset_description = True): 61 | if with_dataset_description: 62 | intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl" 63 | rating_fn = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.csv" 64 | else: 65 | intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.csv" 66 | rating_fn = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl" 67 | return intrusion_fn, rating_fn 68 | 69 | 70 | def compute_human_ceiling_intrusion(data, only_confident = False): 71 | ratings_human, ratings_chatGPT = [], [] 72 | spearman_wiki = [] 73 | spearman_NYT = [] 74 | spearman_concat = [] 75 | for _ in tqdm(range(1000), total=1000): 76 | ratings_human, ratings_chatGPT = [], [] 77 | for dataset in ["wikitext", "nytimes"]: 78 | for model in ["mallet", "dvae", "etm"]: 79 | for topic_id in range(50): 80 | intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id] 81 | if only_confident: 82 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1] 83 | if not intrusion_scores_raw: 84 | intrusion_scores_raw = [0] 85 | 86 | if len(intrusion_scores_raw) == 1: 87 | intrusion_scores_1 = intrusion_scores_raw 88 | intrusion_scores_2 = intrusion_scores_raw 89 | else: 90 | length = len(intrusion_scores_raw) // 2 91 | intrusion_scores_1 = intrusion_scores_raw[:length] 92 | intrusion_scores_2 = intrusion_scores_raw[length:] 93 | intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1)) 94 | intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2)) 95 | ratings_human.append(np.mean(intrusion_scores_1)) 96 | ratings_chatGPT.append(np.mean(intrusion_scores_2)) 97 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic) 98 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic) 99 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic) 100 | print ("wiki", np.mean(spearman_wiki)) 101 | #get_confidence_intervals(spearman_wiki) 102 | print ("NYT", np.mean(spearman_NYT)) 103 | #get_confidence_intervals(spearman_NYT) 104 | print ("concat", np.mean(spearman_concat)) 105 | #get_confidence_intervals(spearman_concat) 106 | 107 | def compute_human_ceiling_rating(data, only_confident = False): 108 | ratings_human, ratings_chatGPT = [], [] 109 | spearman_wiki = [] 110 | spearman_NYT = [] 111 | spearman_concat = [] 112 | for _ in tqdm(range(1000), total=1000): 113 | ratings_human, ratings_chatGPT = [], [] 114 | for dataset in ["wikitext", "nytimes"]: 115 | for model in ["mallet", "dvae", "etm"]: 116 | for topic_id in range(50): 117 | intrusion_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id] 118 | if only_confident: 119 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1] 120 | 121 | if not intrusion_scores_raw: 122 | intrusion_scores_raw = [1] 123 | if len(intrusion_scores_raw) == 1: 124 | intrusion_scores_1 = intrusion_scores_raw 125 | intrusion_scores_2 = intrusion_scores_raw 126 | else: 127 | length = len(intrusion_scores_raw) // 2 128 | intrusion_scores_1 = intrusion_scores_raw[:length] 129 | intrusion_scores_2 = intrusion_scores_raw[length:] 130 | intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1)) 131 | intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2)) 132 | ratings_human.append(np.mean(intrusion_scores_1)) 133 | ratings_chatGPT.append(np.mean(intrusion_scores_2)) 134 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic) 135 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic) 136 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic) 137 | print ("wiki", np.mean(spearman_wiki)) 138 | #get_confidence_intervals(spearman_wiki) 139 | print ("NYT", np.mean(spearman_NYT)) 140 | #get_confidence_intervals(spearman_NYT) 141 | print ("concat", np.mean(spearman_concat)) 142 | #get_confidence_intervals(spearman_concat) 143 | 144 | 145 | def compute_spearmanr_bootstrap_intrusion(df_intruder_scores, data, only_confident=False): 146 | ratings_human, ratings_chatGPT = [], [] 147 | spearman_wiki = [] 148 | spearman_NYT = [] 149 | spearman_concat = [] 150 | 151 | for _ in tqdm(range(1000), total=1000): 152 | ratings_human, ratings_chatGPT = [], [] 153 | for dataset in ["wikitext", "nytimes"]: 154 | for model in ["mallet", "dvae", "etm"]: 155 | for topic_id in range(50): 156 | intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id] 157 | if only_confident: 158 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1] 159 | # sample bootstrap fold 160 | 161 | intrusion_scores_raw = random.choices(intrusion_scores_raw, k=len(intrusion_scores_raw)) 162 | df_topic = df_intruder_scores[(df_intruder_scores.dataset_name == dataset) & (df_intruder_scores.model_name == model) & (df_intruder_scores.topic_id == topic_id)] 163 | gpt_ratings = random.choices(df_topic.response_correct.tolist(), k= len(df_topic.response_correct)) 164 | 165 | # save results 166 | ratings_human.append(np.mean(intrusion_scores_raw)) 167 | ratings_chatGPT.append(np.mean(gpt_ratings)) 168 | # compute spearman_R and save results 169 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic) 170 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic) 171 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic) 172 | print ("wiki", np.mean(spearman_wiki)) 173 | get_confidence_intervals(spearman_wiki) 174 | print ("NYT", np.mean(spearman_NYT)) 175 | get_confidence_intervals(spearman_NYT) 176 | print ("concat", np.mean(spearman_concat)) 177 | get_confidence_intervals(spearman_concat) 178 | 179 | def compute_spearmanr_bootstrap_rating(df_rating_scores, data, only_confident=False): 180 | ratings_human, ratings_chatGPT = [], [] 181 | spearman_wiki = [] 182 | spearman_NYT = [] 183 | spearman_concat = [] 184 | 185 | for _ in tqdm(range(150), total=150): 186 | ratings_human, ratings_chatGPT = [], [] 187 | for dataset in ["wikitext", "nytimes"]: 188 | for model in ["mallet", "dvae", "etm"]: 189 | for topic_id in range(50): 190 | rating_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id] 191 | if only_confident: 192 | rating_scores_raw = [i for i,j in zip(rating_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1] 193 | # sample bootstrap fold 194 | rating_scores_raw = random.choices(rating_scores_raw, k=len(rating_scores_raw)) 195 | df_topic = df_rating_scores[(df_rating_scores.dataset_name == dataset) & (df_rating_scores.model_name == model) & (df_rating_scores.topic_id == topic_id)] 196 | gpt_ratings = random.choices(df_topic.gpt_ratings.tolist(), k= len(df_topic.gpt_ratings)) 197 | 198 | # save results 199 | ratings_human.append(np.mean(rating_scores_raw)) 200 | ratings_chatGPT.append(np.mean(gpt_ratings)) 201 | # compute spearman_R and save results 202 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic) 203 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic) 204 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic) 205 | print ("wiki", np.mean(spearman_wiki)) 206 | get_confidence_intervals(spearman_wiki) 207 | print ("NYT", np.mean(spearman_NYT)) 208 | get_confidence_intervals(spearman_NYT) 209 | print ("concat", np.mean(spearman_concat)) 210 | get_confidence_intervals(spearman_concat) 211 | 212 | 213 | 214 | 215 | if __name__ == "__main__": 216 | parser = argparse.ArgumentParser() 217 | parser.add_argument("--task", default="ratings", type=str) 218 | parser.add_argument("--filename", default="coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl", type=str, help="whether to include a dataset description or not, default = include.") 219 | parser.add_argument("--only_confident", default="true", type=str) 220 | args = parser.parse_args() 221 | 222 | if args.only_confident == "true": 223 | only_confident = True 224 | else: 225 | only_confident = False 226 | 227 | random.seed(42) 228 | path = "coherence-outputs-section-2" 229 | 230 | experiments = ["human_ceiling", "dataset_description", "dataset_description_only_confident", "no_dataset_description"] 231 | 232 | with open("all_data.json") as f: 233 | data = json.load(f) 234 | 235 | if args.task == "human_ceiling": 236 | compute_human_ceiling_intrusion(data, only_confident=only_confident) 237 | compute_human_ceiling_rating(data, only_confident=only_confident) 238 | elif args.task == "ratings": 239 | df_rating = load_dataframe(args.filename) 240 | compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident) 241 | 242 | elif args.task == "intrusion": 243 | df_rating = load_dataframe(args.filename) 244 | compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident) 245 | -------------------------------------------------------------------------------- /src-misc/01-get_data.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | 3 | mkdir -p data 4 | 5 | wget https://raw.githubusercontent.com/ahoho/topics/dev/data/human/all_data/all_data.json -O data/intrusion.json -------------------------------------------------------------------------------- /src-misc/02-process_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import json\n", 10 | "data = json.load(open(\"../data/intrusion.json\", \"r\"))" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "data[\"wikitext\"][\"mallet\"][\"metrics\"].keys()\n", 20 | "\n", 21 | "[(k, len(v), None if type(v[0]) != list else len(v[0])) for k, v in data[\"nytimes\"][\"mallet\"][\"metrics\"].items()]" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "['species',\n", 33 | " 'birds',\n", 34 | " 'males',\n", 35 | " 'females',\n", 36 | " 'bird',\n", 37 | " 'white',\n", 38 | " 'found',\n", 39 | " 'female',\n", 40 | " 'male',\n", 41 | " 'range',\n", 42 | " 'large',\n", 43 | " 'breeding',\n", 44 | " 'long',\n", 45 | " 'black',\n", 46 | " 'small',\n", 47 | " 'shark',\n", 48 | " 'population',\n", 49 | " 'common',\n", 50 | " 'prey',\n", 51 | " 'eggs']" 52 | ] 53 | }, 54 | "execution_count": 2, 55 | "metadata": {}, 56 | "output_type": "execute_result" 57 | } 58 | ], 59 | "source": [ 60 | "data[\"wikitext\"][\"mallet\"][\"topics\"][1]" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "kernelspec": { 66 | "display_name": "Python 3.10.7 64-bit", 67 | "language": "python", 68 | "name": "python3" 69 | }, 70 | "language_info": { 71 | "codemirror_mode": { 72 | "name": "ipython", 73 | "version": 3 74 | }, 75 | "file_extension": ".py", 76 | "mimetype": "text/x-python", 77 | "name": "python", 78 | "nbconvert_exporter": "python", 79 | "pygments_lexer": "ipython3", 80 | "version": "3.10.7" 81 | }, 82 | "orig_nbformat": 4, 83 | "vscode": { 84 | "interpreter": { 85 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" 86 | } 87 | } 88 | }, 89 | "nbformat": 4, 90 | "nbformat_minor": 2 91 | } 92 | -------------------------------------------------------------------------------- /src-misc/02-process_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import json 4 | 5 | data = json.load(open("data/intrusion.json", "r")) 6 | 7 | for topic_scores, topic_words in zip(data["wikitext"]["etm"]["metrics"]["intrusion_scores_raw"], data["wikitext"]["etm"]["topics"]): 8 | print(len(topic_scores), len(topic_words)) 9 | group_a = [w for s, w in zip(topic_scores, topic_words) if s == 0] 10 | group_b = [w for s, w in zip(topic_scores, topic_words) if s == 1] 11 | print(group_a, group_b) -------------------------------------------------------------------------------- /src-misc/03-figures_nmpi_llm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import matplotlib.pyplot as plt 4 | import pandas as pd 5 | import os 6 | from tqdm import tqdm 7 | import numpy as np 8 | from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score 9 | from scipy.stats import spearmanr 10 | from collections import defaultdict 11 | import fig_utils 12 | from matplotlib.ticker import FormatStrFormatter 13 | import argparse 14 | 15 | args = argparse.ArgumentParser() 16 | args.add_argument("--experiment", default="wikitext_specific") 17 | args = args.parse_args() 18 | 19 | os.makedirs("computed/figures", exist_ok=True) 20 | 21 | N_CLUSTER = range(20, 420, 20) 22 | 23 | 24 | def moving_average(a, window_size=3): 25 | if window_size == 0: 26 | return a 27 | out = [] 28 | for i in range(len(a)): 29 | start = max(0, i - window_size) 30 | out.append(np.mean(a[start:i + 1])) 31 | return out 32 | 33 | 34 | def compute_adjusted_NMI_bills(): 35 | df_metadata = pd.read_json( 36 | "data/bills/processed/labeled/vocab_15k/train.metadata.jsonl", 37 | lines=True 38 | ) 39 | print(df_metadata) 40 | paths = [ 41 | "runs/outputs/k_selection/" + "bills" + 42 | "-labeled/vocab_15k/k-" + str(i) 43 | for i in N_CLUSTER 44 | ] 45 | topic = df_metadata.topic.tolist() 46 | cluster_metrics = defaultdict(list) 47 | 48 | for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20): 49 | path = os.path.join(path, "2972") 50 | beta = np.load(os.path.join(path, "beta.npy")) 51 | theta = np.load(os.path.join(path, "train.theta.npy")) 52 | argmax_theta = theta.argmax(axis=-1) 53 | cluster_metrics["ami"].append( 54 | adjusted_mutual_info_score(topic, argmax_theta) 55 | ) 56 | cluster_metrics["ari"].append( 57 | adjusted_rand_score(topic, argmax_theta) 58 | ) 59 | cluster_metrics["completeness"].append( 60 | completeness_score(topic, argmax_theta) 61 | ) 62 | cluster_metrics["homogeneity"].append( 63 | homogeneity_score(topic, argmax_theta) 64 | ) 65 | return cluster_metrics 66 | 67 | 68 | def compute_adjusted_NMI_wikitext_old(broad_categories=True): 69 | df_metadata = pd.read_json( 70 | "data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl", 71 | lines=True 72 | ) 73 | print(df_metadata) 74 | paths = [ 75 | "runs/outputs/k_selection/" + "wikitext" + 76 | "-labeled/vocab_15k/k-" + str(i) 77 | for i in N_CLUSTER 78 | ] 79 | 80 | if broad_categories: 81 | topic = df_metadata.category.tolist() 82 | else: 83 | topic = df_metadata.subcategory.tolist() 84 | 85 | cluster_metrics = defaultdict(list) 86 | 87 | for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20): 88 | path = os.path.join(path, "2972") 89 | beta = np.load(os.path.join(path, "beta.npy")) 90 | theta = np.load(os.path.join(path, "train.theta.npy")) 91 | argmax_theta = theta.argmax(axis=-1) 92 | cluster_metrics["ami"].append( 93 | adjusted_mutual_info_score(topic, argmax_theta) 94 | ) 95 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta)) 96 | cluster_metrics["completeness"].append( 97 | completeness_score(topic, argmax_theta) 98 | ) 99 | cluster_metrics["homogeneity"].append( 100 | homogeneity_score(topic, argmax_theta) 101 | ) 102 | return cluster_metrics 103 | 104 | def compute_adjusted_NMI_wikitext(broad_categories=True): 105 | df_metadata = pd.read_json("data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl", lines=True) 106 | print (df_metadata) 107 | paths = ["runs/outputs/k_selection/" + "wikitext" + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)] 108 | 109 | # re-do columns: 110 | value_counts = df_metadata.category.value_counts().rename_axis("topic").reset_index(name="counts") 111 | value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25} 112 | df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.category] 113 | 114 | value_counts = df_metadata.subcategory.value_counts().rename_axis("topic").reset_index(name="counts") 115 | value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25} 116 | df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.subcategory] 117 | 118 | if broad_categories: 119 | topic = df_metadata.category.tolist() 120 | else: 121 | topic = df_metadata.subcategory.tolist() 122 | 123 | cluster_metrics = defaultdict(list) 124 | 125 | for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20): 126 | path = os.path.join(path, "2972") 127 | beta = np.load(os.path.join(path, "beta.npy")) 128 | theta = np.load(os.path.join(path, "train.theta.npy")) 129 | argmax_theta = theta.argmax(axis=-1) 130 | cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta)) 131 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta)) 132 | cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta)) 133 | cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta)) 134 | return cluster_metrics 135 | 136 | 137 | 138 | 139 | # experiment = "bills_broad" 140 | # experiment = "wikitext_broad" 141 | # experiment = "wikitext_specific" 142 | 143 | if args.experiment == "bills_broad": 144 | dataset = "bills" 145 | cluster_metrics = compute_adjusted_NMI_bills() 146 | paths = [ 147 | "runs/outputs/k_selection/bills-labeled/vocab_15k/k-" + str(i) 148 | for i in N_CLUSTER 149 | ] 150 | df = pd.read_csv("LLM-scores/LLM_outputs_bills_broad.csv") 151 | plt_label = "Adjusted MI" 152 | outfile_name = "n_clusters_bills_dataset.pdf" 153 | plt_title = "BillSum, broad, $\\rho = RHO_CORR$" 154 | left_ylab = True 155 | right_ylab = False 156 | degree = 1 157 | elif args.experiment == "wikitext_broad": 158 | dataset = "wikitext" 159 | cluster_metrics = compute_adjusted_NMI_wikitext() 160 | paths = [ 161 | "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i) 162 | for i in N_CLUSTER 163 | ] 164 | df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_broad.csv") 165 | plt_label = "Adjusted MI" 166 | outfile_name = "n_clusters_wikitext_broad.pdf" 167 | plt_title = "Wikitext, broad, $\\rho = RHO_CORR$" 168 | left_ylab = False 169 | right_ylab = False 170 | degree = 1 171 | elif args.experiment == "wikitext_specific": 172 | dataset = "wikitext" 173 | cluster_metrics = compute_adjusted_NMI_wikitext(broad_categories=False) 174 | paths = [ 175 | "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i) 176 | for i in N_CLUSTER 177 | ] 178 | df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_specific.csv") 179 | plt_label = "Adjusted MI" 180 | outfile_name = "n_clusters_wikitext_specific.pdf" 181 | plt_title = "Wikitext, specific, $\\rho = RHO_CORR$" 182 | left_ylab = False 183 | right_ylab = True 184 | degree = 3 185 | 186 | df["gpt_ratings"] = df.chatGPT_eval.astype(int) 187 | average_goodness = [] 188 | # get average gpt_ratings for each k 189 | for path in paths: 190 | path = os.path.join(path, "2972") 191 | df_at_k = df[df.path == path] 192 | average_goodness.append(df_at_k.gpt_ratings.mean()) 193 | 194 | # smooth via moving_average to remove weird outliers 195 | average_goodness = moving_average(average_goodness) 196 | 197 | # if we want to plot spearmanR for different clustering metrics 198 | compute_spearmanR_different_cluster_metrics = False 199 | if compute_spearmanR_different_cluster_metrics: 200 | for key in ["ami", "ari", "completeness", "homogeneity"]: 201 | value = cluster_metrics[key] 202 | statistic = spearmanr(average_goodness, value) 203 | print( 204 | "topics " + key, statistic.statistic.round(3), 205 | statistic.pvalue.round(3) 206 | ) 207 | 208 | # re-shape to compute z-scores (otherwise, the scales are off because LLM scores and clustering metrics are on different scales and the graphs do not look too great 209 | average_goodness = np.array(average_goodness).reshape(-1, 1) 210 | AMI = np.array(cluster_metrics["ami"]).reshape(-1, 1) 211 | 212 | # compute z-scores 213 | # average_goodness = StandardScaler().fit_transform(average_goodness).squeeze() 214 | # AMI = StandardScaler().fit_transform(AMI).squeeze() 215 | average_goodness = np.array(average_goodness).squeeze() 216 | AMI = np.array(AMI).squeeze() 217 | 218 | # plot figures 219 | fig = plt.figure(figsize=(3.5, 2)) 220 | ax1 = plt.gca() 221 | ax2 = ax1.twinx() 222 | SCATTER_STYLE = {"edgecolor": "black", "s": 30} 223 | l1 = ax1.scatter( 224 | N_CLUSTER, 225 | average_goodness, 226 | label="LLM score", 227 | color=fig_utils.COLORS[0], 228 | **SCATTER_STYLE 229 | ) 230 | l2 = ax2.scatter( 231 | N_CLUSTER, 232 | AMI, 233 | label=plt_label, 234 | color=fig_utils.COLORS[1], 235 | **SCATTER_STYLE 236 | ) 237 | 238 | # print to one digit 239 | ax2.yaxis.set_major_formatter(FormatStrFormatter('%.1f')) 240 | 241 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500) 242 | 243 | poly_ag = np.poly1d(np.polyfit(N_CLUSTER, average_goodness, degree)) 244 | ax1.plot(xticks_fine, poly_ag(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100) 245 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, AMI, degree)) 246 | ax2.plot(xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100) 247 | 248 | plt.legend( 249 | [l1, l2], [p_.get_label() for p_ in [l1, l2]], 250 | loc="upper right", 251 | handletextpad=-0.2, 252 | labelspacing=0.15, 253 | borderpad=0.15, 254 | borderaxespad=0.1, 255 | ) 256 | if left_ylab: 257 | ax1.set_ylabel("Adjusted MI") 258 | if right_ylab: 259 | ax2.set_ylabel("Averaged LLM score") 260 | ax1.set_xlabel("Number of topics") 261 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3]) 262 | 263 | 264 | statistic = spearmanr(average_goodness, cluster_metrics["ami"]) 265 | plt.title(plt_title.replace("RHO_CORR", f"{statistic[0]:.2f}")) 266 | 267 | plt.tight_layout(pad=0.2) 268 | plt.savefig("computed/figures/" + outfile_name) 269 | plt.show() 270 | -------------------------------------------------------------------------------- /src-misc/04-launch_figures_nmpi_llm.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | 3 | for EXPERIMENT in "bills_broad" "wikitext_broad" "wikitext_specific"; do 4 | # prevent from showing 5 | DISPLAY="" python3 src_vilem/03-figures_nmpi_llm.py --experiment ${EXPERIMENT} 6 | done -------------------------------------------------------------------------------- /src-misc/05-find_rating_errors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 34, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/tmp/ipykernel_32934/435855610.py:6: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.\n", 13 | " df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "#!/usr/bin/env python3\n", 19 | "\n", 20 | "import pandas as pd\n", 21 | "\n", 22 | "df_rating_scores = pd.read_json(\"../computed/ratings_outfile_with_logit_bias_old_prompt_5_runs.jsonl\", lines=True)\n", 23 | "df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n", 24 | "df_grouped[\"diff_abs\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"]).abs()\n", 25 | "df_grouped[\"diff\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"])" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 36, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/html": [ 36 | "
\n", 37 | "\n", 50 | "\n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | "
dataset_namemodel_nametopic_idchatGPT_evalhuman_evaldiff_absdiff
299wikitextmallet492.62.60.00.0
157wikitextdvae72.82.80.00.0
16nytimesdvae162.02.00.00.0
110nytimesmallet102.02.00.00.0
101nytimesmallet13.03.00.00.0
\n", 116 | "
" 117 | ], 118 | "text/plain": [ 119 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n", 120 | "299 wikitext mallet 49 2.6 2.6 0.0 \n", 121 | "157 wikitext dvae 7 2.8 2.8 0.0 \n", 122 | "16 nytimes dvae 16 2.0 2.0 0.0 \n", 123 | "110 nytimes mallet 10 2.0 2.0 0.0 \n", 124 | "101 nytimes mallet 1 3.0 3.0 0.0 \n", 125 | "\n", 126 | " diff \n", 127 | "299 0.0 \n", 128 | "157 0.0 \n", 129 | "16 0.0 \n", 130 | "110 0.0 \n", 131 | "101 0.0 " 132 | ] 133 | }, 134 | "execution_count": 36, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "df_grouped.sort_values(by=\"diff_abs\")[:5]\n" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 39, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/html": [ 151 | "
\n", 152 | "\n", 165 | "\n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | "
dataset_namemodel_nametopic_idchatGPT_evalhuman_evaldiff_absdiff
58nytimesetm81.42.8666671.466667-1.466667
41nytimesdvae411.02.5333331.533333-1.533333
28nytimesdvae281.22.7333331.533333-1.533333
34nytimesdvae341.02.6666671.666667-1.666667
216wikitextetm161.02.9333331.933333-1.933333
\n", 231 | "
" 232 | ], 233 | "text/plain": [ 234 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n", 235 | "58 nytimes etm 8 1.4 2.866667 1.466667 \n", 236 | "41 nytimes dvae 41 1.0 2.533333 1.533333 \n", 237 | "28 nytimes dvae 28 1.2 2.733333 1.533333 \n", 238 | "34 nytimes dvae 34 1.0 2.666667 1.666667 \n", 239 | "216 wikitext etm 16 1.0 2.933333 1.933333 \n", 240 | "\n", 241 | " diff \n", 242 | "58 -1.466667 \n", 243 | "41 -1.533333 \n", 244 | "28 -1.533333 \n", 245 | "34 -1.666667 \n", 246 | "216 -1.933333 " 247 | ] 248 | }, 249 | "execution_count": 39, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "df_grouped.sort_values(by=\"diff_abs\")[-5:]" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 38, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "data": { 265 | "text/html": [ 266 | "
\n", 267 | "\n", 280 | "\n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | "
dataset_namemodel_nametopic_idchatGPT_evalhuman_evaldiff_absdiff
200wikitextetm02.01.6000000.4000000.400000
210wikitextetm101.81.3333330.4666670.466667
248wikitextetm483.02.5333330.4666670.466667
207wikitextetm71.81.2666670.5333330.533333
242wikitextetm421.81.1333330.6666670.666667
\n", 346 | "
" 347 | ], 348 | "text/plain": [ 349 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n", 350 | "200 wikitext etm 0 2.0 1.600000 0.400000 \n", 351 | "210 wikitext etm 10 1.8 1.333333 0.466667 \n", 352 | "248 wikitext etm 48 3.0 2.533333 0.466667 \n", 353 | "207 wikitext etm 7 1.8 1.266667 0.533333 \n", 354 | "242 wikitext etm 42 1.8 1.133333 0.666667 \n", 355 | "\n", 356 | " diff \n", 357 | "200 0.400000 \n", 358 | "210 0.466667 \n", 359 | "248 0.466667 \n", 360 | "207 0.533333 \n", 361 | "242 0.666667 " 362 | ] 363 | }, 364 | "execution_count": 38, 365 | "metadata": {}, 366 | "output_type": "execute_result" 367 | } 368 | ], 369 | "source": [ 370 | "df_grouped.sort_values(by=\"diff\")[-5:]" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 40, 376 | "metadata": {}, 377 | "outputs": [ 378 | { 379 | "data": { 380 | "text/plain": [ 381 | "-0.6393333333333333" 382 | ] 383 | }, 384 | "execution_count": 40, 385 | "metadata": {}, 386 | "output_type": "execute_result" 387 | } 388 | ], 389 | "source": [ 390 | "df_grouped[\"diff\"].mean()" 391 | ] 392 | } 393 | ], 394 | "metadata": { 395 | "kernelspec": { 396 | "display_name": "Python 3.10.7 64-bit", 397 | "language": "python", 398 | "name": "python3" 399 | }, 400 | "language_info": { 401 | "codemirror_mode": { 402 | "name": "ipython", 403 | "version": 3 404 | }, 405 | "file_extension": ".py", 406 | "mimetype": "text/x-python", 407 | "name": "python", 408 | "nbconvert_exporter": "python", 409 | "pygments_lexer": "ipython3", 410 | "version": "3.10.7" 411 | }, 412 | "orig_nbformat": 4, 413 | "vscode": { 414 | "interpreter": { 415 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" 416 | } 417 | } 418 | }, 419 | "nbformat": 4, 420 | "nbformat_minor": 2 421 | } 422 | -------------------------------------------------------------------------------- /src-misc/06-figures_ari_llm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import matplotlib.pyplot as plt 4 | import os 5 | from tqdm import tqdm 6 | import numpy as np 7 | from collections import defaultdict 8 | from scipy.stats import spearmanr 9 | import fig_utils 10 | from matplotlib.ticker import FormatStrFormatter 11 | import csv 12 | import argparse 13 | 14 | args = argparse.ArgumentParser() 15 | args.add_argument("--experiment", default="wikitext_specific") 16 | args = args.parse_args() 17 | 18 | os.makedirs("computed/figures", exist_ok=True) 19 | 20 | N_CLUSTER = range(20, 420, 20) 21 | 22 | data = list(csv.DictReader( 23 | open(f"repo-old/LLM-scores-3/{args.experiment}_dataframe_all_results.csv", "r"))) 24 | 25 | # plot figures 26 | fig = plt.figure(figsize=(3.5, 2)) 27 | ax1 = plt.gca() 28 | ax2 = ax1.twinx() 29 | ax3 = ax1.twinx() 30 | if args.experiment == "bills_broad": 31 | dataset = "bills" 32 | outfile_name = "n_clusters_bills_broad.pdf" 33 | plt_title = "BillSum, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$ " 34 | left_ylab = True 35 | show_legend = True 36 | degree = 4 37 | 38 | ax3.tick_params(axis='y', colors="#3d518c") 39 | ax3.yaxis.set_label_position('left') 40 | ax3.yaxis.set_ticks_position('left') 41 | ax3.set_ylabel("Document LLM", color="#3d518c") 42 | ax1.set_yticks([]) 43 | ax2.set_axis_off() 44 | elif args.experiment == "wikitext_broad": 45 | dataset = "wikitext" 46 | outfile_name = "n_clusters_wikitext_broad.pdf" 47 | plt_title = "Wikitext, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$" 48 | left_ylab = False 49 | show_legend = False 50 | degree = 4 51 | ax1.set_yticks([]) 52 | ax2.set_axis_off() 53 | ax3.set_yticks([]) 54 | ax1.set_ylabel("|", color="white") 55 | ax3.set_ylabel("|", color="white") 56 | elif args.experiment == "wikitext_specific": 57 | dataset = "wikitext" 58 | outfile_name = "n_clusters_wikitext_specific.pdf" 59 | plt_title = " Wikitext, specific, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$" 60 | left_ylab = False 61 | show_legend = False 62 | degree = 4 63 | # ax1.set_yticks([]) 64 | # ax3.tick_params(axis='y', colors="#3d518c") 65 | # ax2.set_axis_off() 66 | # ax3.set_axis_off() 67 | 68 | ax1.tick_params(axis='y', colors="#933d35") 69 | ax1.yaxis.set_label_position('right') 70 | ax1.yaxis.set_ticks_position('right') 71 | ax1.set_ylabel("ARI", color="#933d35") 72 | ax3.set_axis_off() 73 | ax2.set_axis_off() 74 | 75 | data_llm_word = [float(x["LLM Scores Wordset"]) for x in data] 76 | data_llm_doc = [float(x["LLM Scores Documentset"]) for x in data] 77 | data_ari = [float(x["ARI"]) for x in data] 78 | 79 | SCATTER_STYLE = {"edgecolor": "black", "s": 30} 80 | l1 = ax1.scatter( 81 | N_CLUSTER, 82 | data_ari, 83 | label="ARI", 84 | color=fig_utils.COLORS[1], 85 | **SCATTER_STYLE 86 | ) 87 | l2 = ax2.scatter( 88 | N_CLUSTER, 89 | data_llm_word, 90 | label="Word LLM", 91 | color=fig_utils.COLORS[0], 92 | **SCATTER_STYLE 93 | ) 94 | l3 = ax3.scatter( 95 | N_CLUSTER, 96 | data_llm_doc, 97 | label="Doc LLM", 98 | color=fig_utils.COLORS[4], 99 | **SCATTER_STYLE 100 | ) 101 | 102 | # ax1.axes.get_yaxis().set_visible(False) 103 | # print to one digit 104 | # ax1.yaxis.set_major_formatter(FormatStrFormatter('%.0f')) 105 | 106 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500) 107 | 108 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, data_ari, degree)) 109 | ax1.plot( 110 | xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100 111 | ) 112 | poly_llm_word = np.poly1d(np.polyfit(N_CLUSTER, data_llm_word, degree)) 113 | ax2.plot( 114 | xticks_fine, poly_llm_word(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100 115 | ) 116 | poly_llm_doc = np.poly1d(np.polyfit(N_CLUSTER, data_llm_doc, degree)) 117 | ax3.plot( 118 | xticks_fine, poly_llm_doc(xticks_fine), '-', color=fig_utils.COLORS[4], zorder=-100 119 | ) 120 | 121 | if show_legend: 122 | lhandles = [l2, l3, l1] 123 | plt.legend( 124 | lhandles, [p_.get_label() for p_ in lhandles], 125 | loc="upper right", 126 | handletextpad=-0.2, 127 | labelspacing=0.1, 128 | borderpad=0.2, 129 | borderaxespad=0.2, 130 | handlelength=1.5, 131 | columnspacing=0.8, 132 | ncols=2, 133 | edgecolor="black", 134 | facecolor="#dddddd" 135 | ) 136 | # if left_ylab: 137 | # ax1.set_ylabel("Metric Scores") 138 | # else: 139 | # ax1.set_ylabel(" ") 140 | 141 | ax1.set_xlabel("Number of topics") 142 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3]) 143 | 144 | 145 | statistic_doc = spearmanr(data_llm_doc, data_ari) 146 | statistic_word = spearmanr(data_llm_word, data_ari) 147 | # statistic = spearmanr(data_llm_doc, data_llm_word) 148 | plt.title( 149 | plt_title 150 | .replace("RHO_CORR1", f"{statistic_doc[0]:.2f}") 151 | .replace("RHO_CORR2", f"{statistic_word[0]:.2f}") 152 | ) 153 | 154 | plt.tight_layout(pad=0.1) 155 | plt.savefig("computed/figures/" + outfile_name) 156 | plt.show() 157 | -------------------------------------------------------------------------------- /src-misc/07-pairwise_scores.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Intruder is black (5.20)\n", 13 | "Coherence is 6.27\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import random\n", 19 | "import numpy as np\n", 20 | "random.seed(0)\n", 21 | "\n", 22 | "dataset = [\n", 23 | " [\"red\", \"blue\", \"flag\", \"black\", \"sky\", \"sun\"]\n", 24 | "]\n", 25 | "\n", 26 | "\n", 27 | "def get_pairwise_similarity(word1, word2):\n", 28 | " prompt = f'On a scale from 1 to 10, how similar are \"{word1}\" and \"{word2}\"? Your answer should be only the number and nothing else.'\n", 29 | " # TODO: GPT prompt\n", 30 | " return random.randint(1, 10)\n", 31 | "\n", 32 | "\n", 33 | "for words_line in dataset:\n", 34 | " # create |W|*|W| list\n", 35 | " results_line = [[None] * len(words_line) for _ in words_line]\n", 36 | " for word1_i, word1 in enumerate(words_line):\n", 37 | " for word2_i, word2 in enumerate(words_line):\n", 38 | " # save up half the prompts\n", 39 | " if word2_i <= word1_i:\n", 40 | " continue\n", 41 | " similarity = get_pairwise_similarity(word1, word2)\n", 42 | " results_line[word1_i][word2_i] = similarity\n", 43 | " results_line[word2_i][word1_i] = similarity\n", 44 | "\n", 45 | " # remove None (on diagonal)\n", 46 | " results_line = [\n", 47 | " [x for x in similarities if x]\n", 48 | " for similarities in results_line\n", 49 | " ]\n", 50 | "\n", 51 | " per_word_avg = [\n", 52 | " np.average(similarities)\n", 53 | " for similarities in results_line\n", 54 | " ]\n", 55 | " word_intruder_i = np.argmin(per_word_avg)\n", 56 | " coherence = np.average(per_word_avg)\n", 57 | "\n", 58 | " print(f\"Intruder is {words_line[word_intruder_i]} ({min(per_word_avg):.2f})\")\n", 59 | " print(f\"Coherence is {coherence:.2f}\")\n" 60 | ] 61 | } 62 | ], 63 | "metadata": { 64 | "kernelspec": { 65 | "display_name": "Python 3.10.7 64-bit", 66 | "language": "python", 67 | "name": "python3" 68 | }, 69 | "language_info": { 70 | "codemirror_mode": { 71 | "name": "ipython", 72 | "version": 3 73 | }, 74 | "file_extension": ".py", 75 | "mimetype": "text/x-python", 76 | "name": "python", 77 | "nbconvert_exporter": "python", 78 | "pygments_lexer": "ipython3", 79 | "version": "3.10.7" 80 | }, 81 | "orig_nbformat": 4, 82 | "vscode": { 83 | "interpreter": { 84 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" 85 | } 86 | } 87 | }, 88 | "nbformat": 4, 89 | "nbformat_minor": 2 90 | } 91 | -------------------------------------------------------------------------------- /src-misc/fig_utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib.style 2 | import matplotlib as mpl 3 | from cycler import cycler 4 | 5 | FONT_MONOSPACE = {'fontname':'monospace'} 6 | MARKERS = "o^s*DP1" 7 | COLORS = [ 8 | "#b7423c", 9 | "#71a548", 10 | "salmon", 11 | "darkseagreen", 12 | "cornflowerblue", 13 | "orange", 14 | "seagreen", 15 | "dimgray", 16 | "purple", 17 | ] 18 | 19 | mpl.rcParams['axes.prop_cycle'] = cycler(color=COLORS) 20 | mpl.rcParams['lines.linewidth'] = 2 21 | mpl.rcParams['lines.markersize'] = 7 22 | mpl.rcParams['axes.linewidth'] = 1.5 23 | mpl.rcParams['font.family'] = "serif" 24 | 25 | METRIC_PRETTY_NAME = { 26 | "bleu": "BLEU", 27 | "ter": "TER", 28 | "meteor": "METEOR", 29 | "chrf": "ChrF", 30 | "comet": "COMET", 31 | "bleurt": "BLEURT" 32 | } 33 | 34 | COLORS_EXTRA = ["#9c2963", "#fb9e07"] -------------------------------------------------------------------------------- /src-number-of-topics/LLM_scores_and_ARI.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import pandas as pd 3 | import re, os, json, random 4 | from tqdm import tqdm 5 | import numpy as np 6 | from sklearn.metrics import normalized_mutual_info_score, adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score 7 | from sklearn.preprocessing import StandardScaler 8 | from scipy.stats import spearmanr 9 | from collections import defaultdict 10 | import argparse 11 | from collections import Counter 12 | 13 | def moving_average(a, window_size=3): 14 | if window_size == 0: 15 | return a 16 | out = [] 17 | for i in range(len(a)): 18 | start = max(0, i - window_size) 19 | out.append(np.mean(a[start:i + 1])) 20 | return out 21 | 22 | def postprocess_labels(df): 23 | import spacy 24 | nlp = spacy.load("en_core_web_sm") 25 | if "response" not in df.columns: 26 | df["response"] = df.chatGPT_eval.tolist() 27 | out = [] 28 | 29 | for i, text in enumerate(df.response): 30 | text = text.lower() 31 | text = re.findall("[a-z ]+", text)[0] # oh, what would this do? 32 | text = " ".join([w.lemma_ for w in nlp(text)]) 33 | out.append(text) 34 | df["response"] = out 35 | return df 36 | 37 | def compute_ARI(args): 38 | df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True) 39 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)] 40 | 41 | if args.dataset == "bills": 42 | topic = df_metadata.topic.tolist() 43 | else: 44 | if args.label_categories == "broad": 45 | topic = df_metadata.category.tolist() 46 | else: 47 | topic = df_metadata.subcategory.tolist() 48 | 49 | cluster_metrics = defaultdict(list) 50 | for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20): 51 | path = os.path.join(path, "2972") 52 | beta = np.load(os.path.join(path, "beta.npy")) 53 | theta = np.load(os.path.join(path, "train.theta.npy")) 54 | argmax_theta = theta.argmax(axis=-1) 55 | cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta)) 56 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta)) 57 | cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta)) 58 | cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta)) 59 | return cluster_metrics 60 | 61 | if __name__ == "__main__": 62 | parser = argparse.ArgumentParser() 63 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)") 64 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.") 65 | parser.add_argument("--filename", default="number-of-topics-section-4/document_label_assignment_wikitext_broad.sonl", type=str, help="filename with LLM responses") 66 | parser.add_argument("--method", default="label_assignment", type=str, help="if we use topic word set ratings or document label assignment (label_assignment | topic_ratings") 67 | 68 | args = parser.parse_args() 69 | 70 | cluster_metrics = compute_ARI(args) 71 | paths = ["../runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)] 72 | df = pd.read_json(args.filename, lines=True) 73 | outfile_name = "n_clusters_" + args.dataset + "_" + args.label_categories + ".png" 74 | plt_label = "LLM Scores and ARI" 75 | 76 | 77 | average_goodness = [] 78 | if args.method == "topic_ratings": 79 | df["gpt_ratings"] = df.response.astype(int) 80 | # get average gpt_ratings for each k 81 | for path in paths: 82 | path = os.path.join(path, "2972") 83 | df_at_k = df[df.path == path] 84 | average_goodness.append(df_at_k.gpt_ratings.mean()) 85 | elif args.method == "label_assignment": 86 | df = postprocess_labels(df) 87 | # get average purity for each k 88 | for path in paths: 89 | path = os.path.join(path, "2972") 90 | df_at_k = df[df.path == path] 91 | purities = [] 92 | for topic in df_at_k.topic.unique(): 93 | df_topic = df_at_k[df_at_k.topic == topic] 94 | labels = df_topic.response.tolist() 95 | most_common,num_most_common = Counter(labels).most_common(1)[0] 96 | purity = num_most_common / len(labels) 97 | purities.append(purity) 98 | average_goodness.append(np.mean(purities)) 99 | 100 | average_goodness = moving_average(average_goodness) # smooth via moving_average to remove weird outliers 101 | ARI = cluster_metrics["ari"] 102 | 103 | fig = plt.figure() 104 | ax = fig.add_subplot(111) 105 | 106 | ax1 = plt.subplot() 107 | l1, = ax1.plot(average_goodness, color='tab:red') 108 | ax2 = ax1.twinx() 109 | l2, = ax2.plot(ARI, color='tab:blue') 110 | 111 | spearman_rho = spearmanr(average_goodness, ARI).statistic 112 | print (spearman_rho) 113 | 114 | plt.legend([l1, l2], ["LLM Score", "ARI"]) 115 | ax.set_xlabel("Number of Topics") 116 | 117 | n_clusters = list(range(20, 420, 20)) 118 | plt.xticks(range(len(n_clusters)), n_clusters, rotation=45) 119 | 120 | if len(n_clusters) > 12: 121 | every_nth = len(n_clusters) // 8 122 | for n, label in enumerate(ax.xaxis.get_ticklabels()): 123 | if n % every_nth != 0: 124 | label.set_visible(False) 125 | fig_title = "LLM Scores and ARI, " + args.dataset + ", " + args.label_categories + "$, \\rho = " + str(spearman_rho) + "$" 126 | plt.title(fig_title) 127 | plt.savefig(outfile_name) 128 | -------------------------------------------------------------------------------- /src-number-of-topics/chatGPT_document_label_assignment.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import numpy as np 4 | import sys 5 | import random 6 | import openai 7 | from tqdm import tqdm 8 | import pandas as pd 9 | import time 10 | import re 11 | import argparse 12 | 13 | def get_system_prompt(args): 14 | if args.dataset == "bills" and args.label_categories == "broad": 15 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example "health", "public lands", "domestic commerce", "government operations" and "defense". 16 | 17 | Reply with a single word or phrase, indicating the label of the document.""" 18 | 19 | elif args.dataset == "wikitext" and args.label_categories == "broad": 20 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example csc"television", "songs", "transport", "warships and naval units", and "biology and medicine". 21 | 22 | Reply with a single word or phrase, indicating the label of the document.""" 23 | 24 | elif args.dataset == "wikitext" and args.label_categories == "specific": 25 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a specific label, for example "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany". 26 | 27 | Reply with a single word or phrase, indicating the label of the document.""" 28 | else: 29 | print ("experiment not implemented") 30 | sys.exit(0) 31 | return system_prompt 32 | 33 | 34 | 35 | # add logit bias 36 | if __name__ == "__main__": 37 | 38 | parser = argparse.ArgumentParser() 39 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key") 40 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)") 41 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.") 42 | args = parser.parse_args() 43 | 44 | random.seed(42) 45 | system_prompt = get_system_prompt(args) 46 | openai.api_key = args.API_KEY 47 | 48 | dataset = "bills" 49 | if args.dataset == "bills": 50 | column = "summary" 51 | else: 52 | column ="text" 53 | 54 | with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f: 55 | vocab = json.load(f) 56 | 57 | vocab = {j:i for i,j in vocab.items()} 58 | 59 | # load metadata 60 | df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True) 61 | 62 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)] 63 | 64 | output_path = "number-of-topics-section-4" 65 | os.makedirs(output_path, exist_ok=True) 66 | 67 | with open(os.path.join(output_path, "document_label_assignment_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile: 68 | for path in tqdm(paths): 69 | path = os.path.join(path, "2972") 70 | beta = np.load(os.path.join(path, "beta.npy")) 71 | theta = np.load(os.path.join(path, "train.theta.npy")).T # transpose 72 | 73 | print (beta.shape) # (20, 15'000), each row is probability distribution over vocab 74 | print (theta.shape) # (20, 32'661), each row is a probability distribution over documents? 75 | num_topics = beta.shape[0] 76 | 77 | # sample some topics 78 | sampled_topics = random.sample(list(range(num_topics)), 5) 79 | 80 | # for each topic 81 | for topic in sampled_topics: 82 | # sample top documents 83 | num_topics = 0 84 | user_prompt = "" 85 | arg_indices = np.argsort(theta[topic])[::-1][:10] 86 | 87 | for k, index in enumerate(arg_indices): 88 | # get text of this document 89 | text = df_metadata[column].iloc[index] 90 | text = " ".join(text.split()[:50]) # only take first 50 words 91 | user_prompt = text 92 | print (system_prompt) 93 | print (user_prompt) 94 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.0, max_tokens=20)["choices"][0]["message"]["content"].strip() 95 | print ("topic", topic, "response --", response) 96 | out = {"path": path, "user_prompt": user_prompt, "response": response, "topic": topic, "k":k} 97 | json.dump(out, outfile) 98 | outfile.write("\n") 99 | time.sleep(0.1) 100 | -------------------------------------------------------------------------------- /src-number-of-topics/chatGPT_ratings_assignment.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import numpy as np 4 | import sys 5 | import random 6 | import openai 7 | from tqdm import tqdm 8 | import pandas as pd 9 | import argparse 10 | import time 11 | 12 | 13 | def get_system_prompt(args): 14 | if args.dataset == "bills" and args.label_categories == "broad": 15 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 16 | The topic modeling is based on a legislative Bill summary dataset. We are interested in coherent broad topics. Typical topics in the dataset include "Health", "Public Lands", "Domestic Commerce", "Government Operations", or "Defense". 17 | Reply with a single number, indicating the overall appropriateness of the topic.""" 18 | 19 | elif args.dataset == "wikitext" and args.label_categories == "broad": 20 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 21 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "television", "songs", "transport", "warships and naval units", and "biology and medicine". 22 | Reply with a single number, indicating the overall appropriateness of the topic.""" 23 | 24 | elif args.dataset == "wikitext" and args.label_categories == "specific": 25 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 26 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany". 27 | Reply with a single number, indicating the overall appropriateness of the topic.""" 28 | else: 29 | print ("experiment not implemented") 30 | sys.exit(0) 31 | return system_prompt 32 | 33 | # add logit bias 34 | if __name__ == "__main__": 35 | parser = argparse.ArgumentParser() 36 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key") 37 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)") 38 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.") 39 | args = parser.parse_args() 40 | 41 | random.seed(42) 42 | system_prompt = get_system_prompt(args) 43 | openai.api_key = args.API_KEY 44 | 45 | with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f: 46 | vocab = json.load(f) 47 | vocab = {j:i for i,j in vocab.items()} 48 | 49 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)] 50 | 51 | 52 | output_path = "number-of-topics-section-4" 53 | os.makedirs(output_path, exist_ok=True) 54 | 55 | with open(os.path.join(output_path, "coherence_ratings_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile: 56 | for path in tqdm(paths): 57 | path = os.path.join(path, "2972") 58 | beta = np.load(os.path.join(path, "beta.npy")) 59 | theta = np.load(os.path.join(path, "train.theta.npy")) 60 | 61 | print (beta.shape) # 20, 15'000, each row is probability distribution over vocab 62 | print (theta.shape) 63 | num_topics = beta.shape[0] 64 | top_words = [] 65 | for row in beta: 66 | indices = row.argsort()[::-1][:10] 67 | top_topic_words = [vocab[i] for i in indices] 68 | top_words.append(top_topic_words) 69 | 70 | # sample 10 topics 71 | 72 | sampled_topics = random.sample(list(range(num_topics)), k=10) 73 | for i in sampled_topics: 74 | topic = top_words[i] 75 | random.shuffle(topic) 76 | user_prompt = ", ".join(topic) 77 | 78 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0, max_tokens=1, logit_bias={16:100, 17:100, 18:100})["choices"][0]["message"]["content"].strip() 79 | out = {"path": path, "topic": i, "user_prompt": user_prompt, "response": response} 80 | json.dump(out, outfile) 81 | outfile.write("\n") 82 | print (response) 83 | time.sleep(0.1) 84 | -------------------------------------------------------------------------------- /topic-modeling-output/etm-topics-best-c_npmi_10_full.json: -------------------------------------------------------------------------------- 1 | { 2 | "nytimes": { 3 | "c_npmi_10_full": 0.11378887397558148, 4 | "c_npmi_10_full_sd": 0.09088256843754142, 5 | "tu": 0.904, 6 | "to": 0.0770408163265306, 7 | "overlaps": 0, 8 | "anneal_lr": 0, 9 | "data_path": "/workspace/topic-preprocessing/data/nytimes/processed/full-mindf_power_law-maxdf_0.9/etm", 10 | "epochs": 1000, 11 | "lr": 0.02, 12 | "seed": 11235, 13 | "wdecay": 1.2e-06, 14 | "input_dir": "nytimes", 15 | "topics": [ 16 | [ 17 | "campaign", 18 | "bush", 19 | "clinton", 20 | "vote", 21 | "state", 22 | "congress", 23 | "republican", 24 | "administration", 25 | "election", 26 | "governor", 27 | "political", 28 | "bill", 29 | "democrats", 30 | "senator", 31 | "senate", 32 | "democratic", 33 | "party", 34 | "legislation", 35 | "republicans", 36 | "white_house", 37 | "candidate", 38 | "presidential", 39 | "lawmakers", 40 | "voters", 41 | "congressional", 42 | "debate", 43 | "representative", 44 | "washington", 45 | "candidates", 46 | "issue", 47 | "abortion", 48 | "democrat", 49 | "votes", 50 | "government", 51 | "parties", 52 | "legislature", 53 | "elections", 54 | "term", 55 | "leader", 56 | "opponents", 57 | "speaker", 58 | "primary", 59 | "elected", 60 | "opposition", 61 | "leaders", 62 | "support", 63 | "policy", 64 | "conservative", 65 | "reagan", 66 | "speech" 67 | ], 68 | [ 69 | "percent", 70 | "market", 71 | "rate", 72 | "prices", 73 | "rates", 74 | "price", 75 | "economy", 76 | "growth", 77 | "dollar", 78 | "lower", 79 | "average", 80 | "interest", 81 | "month", 82 | "rose", 83 | "stocks", 84 | "bonds", 85 | "higher", 86 | "markets", 87 | "inflation", 88 | "stock", 89 | "fell", 90 | "yesterday", 91 | "sales", 92 | "quarter", 93 | "rise", 94 | "bond", 95 | "economic", 96 | "billion", 97 | "decline", 98 | "value", 99 | "trading", 100 | "consumer", 101 | "points", 102 | "demand", 103 | "investors", 104 | "treasury", 105 | "index", 106 | "expected", 107 | "traders", 108 | "drop", 109 | "profits", 110 | "issues", 111 | "percentage", 112 | "discount", 113 | "unemployment", 114 | "week", 115 | "economists", 116 | "currency", 117 | "analysts", 118 | "fed" 119 | ], 120 | [ 121 | "years", 122 | "life", 123 | "family", 124 | "home", 125 | "old", 126 | "died", 127 | "death", 128 | "wife", 129 | "lives", 130 | "man", 131 | "born", 132 | "son", 133 | "lived", 134 | "living", 135 | "marriage", 136 | "couple", 137 | "church", 138 | "brother", 139 | "heart", 140 | "brothers", 141 | "age", 142 | "gay", 143 | "population", 144 | "career", 145 | "sons", 146 | "grew", 147 | "daughter", 148 | "moved", 149 | "serving", 150 | "native", 151 | "house", 152 | "resident", 153 | "men", 154 | "decades", 155 | "divorce", 156 | "battle", 157 | "married", 158 | "section", 159 | "retirement", 160 | "relatives", 161 | "catholic", 162 | "birthday", 163 | "chapter", 164 | "doctor", 165 | "daughters", 166 | "couples", 167 | "visit", 168 | "apartment", 169 | "year", 170 | "roman" 171 | ], 172 | [ 173 | "said", 174 | "added", 175 | "asked", 176 | "year", 177 | "saying", 178 | "months", 179 | "spokesman", 180 | "told", 181 | "week", 182 | "month", 183 | "expected", 184 | "interview", 185 | "going", 186 | "years", 187 | "according", 188 | "called", 189 | "decision", 190 | "statement", 191 | "days", 192 | "continued", 193 | "plans", 194 | "discuss", 195 | "taking", 196 | "trying", 197 | "involved", 198 | "decided", 199 | "weeks", 200 | "earlier", 201 | "believed", 202 | "suggested", 203 | "adding", 204 | "noted", 205 | "comment", 206 | "appeared", 207 | "declined", 208 | "referring", 209 | "similar", 210 | "described", 211 | "wanted", 212 | "sent", 213 | "acknowledged", 214 | "took", 215 | "spoke", 216 | "began", 217 | "taken", 218 | "concerned", 219 | "refused", 220 | "reported", 221 | "discussed", 222 | "calls" 223 | ], 224 | [ 225 | "glass", 226 | "red", 227 | "blue", 228 | "wear", 229 | "light", 230 | "wearing", 231 | "hair", 232 | "white", 233 | "fashion", 234 | "clothes", 235 | "yellow", 236 | "wood", 237 | "plastic", 238 | "gray", 239 | "shoes", 240 | "color", 241 | "colors", 242 | "pink", 243 | "metal", 244 | "dress", 245 | "wore", 246 | "paint", 247 | "leather", 248 | "colored", 249 | "coat", 250 | "hanging", 251 | "length", 252 | "green", 253 | "tall", 254 | "look", 255 | "looking", 256 | "glasses", 257 | "sunglasses", 258 | "skin", 259 | "shoe", 260 | "classy", 261 | "buttons", 262 | "clothing", 263 | "shirt", 264 | "hang", 265 | "wooden", 266 | "stick", 267 | "liked", 268 | "toy", 269 | "window", 270 | "inside", 271 | "feet", 272 | "looks", 273 | "dressed", 274 | "underneath" 275 | ], 276 | [ 277 | "music", 278 | "band", 279 | "songs", 280 | "concert", 281 | "rock", 282 | "dance", 283 | "jazz", 284 | "album", 285 | "musical", 286 | "musicians", 287 | "piano", 288 | "performance", 289 | "singer", 290 | "song", 291 | "theater", 292 | "pop", 293 | "performances", 294 | "stage", 295 | "performed", 296 | "night", 297 | "opera", 298 | "orchestra", 299 | "composer", 300 | "piece", 301 | "play", 302 | "played", 303 | "singers", 304 | "production", 305 | "evening", 306 | "singing", 307 | "ballet", 308 | "tonight", 309 | "solo", 310 | "plays", 311 | "debut", 312 | "classical", 313 | "presented", 314 | "repertory", 315 | "program", 316 | "chamber", 317 | "performing", 318 | "quartet", 319 | "recordings", 320 | "pianist", 321 | "duo", 322 | "tunes", 323 | "works", 324 | "playing", 325 | "blues", 326 | "festival" 327 | ], 328 | [ 329 | "year", 330 | "years", 331 | "ago", 332 | "later", 333 | "old", 334 | "late", 335 | "major", 336 | "record", 337 | "lost", 338 | "leading", 339 | "early", 340 | "won", 341 | "helped", 342 | "worked", 343 | "known", 344 | "annual", 345 | "national", 346 | "held", 347 | "figure", 348 | "director", 349 | "club", 350 | "highest", 351 | "cause", 352 | "host", 353 | "earlier", 354 | "event", 355 | "ended", 356 | "million", 357 | "month", 358 | "award", 359 | "earned", 360 | "joined", 361 | "rare", 362 | "degree", 363 | "expert", 364 | "raised", 365 | "paid", 366 | "attended", 367 | "minor", 368 | "previous", 369 | "chosen", 370 | "bridge", 371 | "founded", 372 | "position", 373 | "produced", 374 | "spent", 375 | "moved", 376 | "principal", 377 | "received", 378 | "active" 379 | ], 380 | [ 381 | "street", 382 | "avenue", 383 | "west", 384 | "information", 385 | "sunday", 386 | "tickets", 387 | "tomorrow", 388 | "east", 389 | "free", 390 | "park", 391 | "manhattan", 392 | "saturday", 393 | "hours", 394 | "admission", 395 | "broadway", 396 | "center", 397 | "include", 398 | "new", 399 | "fifth", 400 | "open", 401 | "friday", 402 | "madison", 403 | "theater", 404 | "noon", 405 | "garden", 406 | "square", 407 | "tour", 408 | "includes", 409 | "sundays", 410 | "children", 411 | "new_york", 412 | "place", 413 | "saturdays", 414 | "library", 415 | "april", 416 | "road", 417 | "subway", 418 | "shows", 419 | "festival", 420 | "opens", 421 | "available", 422 | "benefit", 423 | "times", 424 | "tuesday", 425 | "fridays", 426 | "reservations", 427 | "events", 428 | "section", 429 | "near", 430 | "cards" 431 | ], 432 | [ 433 | "wine", 434 | "food", 435 | "fresh", 436 | "taste", 437 | "add", 438 | "cooking", 439 | "sugar", 440 | "heat", 441 | "minutes", 442 | "chicken", 443 | "cook", 444 | "red", 445 | "juice", 446 | "menu", 447 | "cooked", 448 | "olive", 449 | "cheese", 450 | "salt", 451 | "meat", 452 | "fat", 453 | "cup", 454 | "kosher", 455 | "meal", 456 | "dishes", 457 | "vegetables", 458 | "garlic", 459 | "dish", 460 | "oil", 461 | "rice", 462 | "seafood", 463 | "restaurant", 464 | "soup", 465 | "sauce", 466 | "oven", 467 | "milk", 468 | "ground", 469 | "cups", 470 | "warm", 471 | "beef", 472 | "bread", 473 | "eat", 474 | "desserts", 475 | "steak", 476 | "pepper", 477 | "medium", 478 | "small", 479 | "butter", 480 | "potato", 481 | "coffee", 482 | "white" 483 | ], 484 | [ 485 | "water", 486 | "oil", 487 | "plants", 488 | "plant", 489 | "energy", 490 | "gas", 491 | "natural", 492 | "trees", 493 | "space", 494 | "clean", 495 | "earth", 496 | "waste", 497 | "fish", 498 | "animals", 499 | "environmental", 500 | "animal", 501 | "soil", 502 | "fuel", 503 | "scientists", 504 | "birds", 505 | "snow", 506 | "rain", 507 | "wildlife", 508 | "storm", 509 | "ice", 510 | "debris", 511 | "electricity", 512 | "steel", 513 | "bird", 514 | "farm", 515 | "electric", 516 | "forest", 517 | "tree", 518 | "pollution", 519 | "surface", 520 | "grass", 521 | "ground", 522 | "farmers", 523 | "gasoline", 524 | "weather", 525 | "electrical", 526 | "heavy", 527 | "cubic", 528 | "species", 529 | "organic", 530 | "arctic", 531 | "power", 532 | "solar", 533 | "habitats", 534 | "light" 535 | ], 536 | [ 537 | "police", 538 | "said", 539 | "officers", 540 | "killed", 541 | "crime", 542 | "people", 543 | "prison", 544 | "fire", 545 | "arrested", 546 | "officer", 547 | "men", 548 | "man", 549 | "killing", 550 | "shot", 551 | "death", 552 | "victims", 553 | "gun", 554 | "murder", 555 | "wounded", 556 | "accused", 557 | "authorities", 558 | "attack", 559 | "found", 560 | "drug", 561 | "charged", 562 | "shooting", 563 | "dead", 564 | "crimes", 565 | "assault", 566 | "violent", 567 | "investigators", 568 | "reported", 569 | "security", 570 | "arrest", 571 | "suspect", 572 | "taken", 573 | "spokesman", 574 | "woman", 575 | "violence", 576 | "guns", 577 | "car", 578 | "year", 579 | "narcotics", 580 | "told", 581 | "nyt", 582 | "incident", 583 | "jail", 584 | "driver", 585 | "rape", 586 | "identified" 587 | ], 588 | [ 589 | "fact", 590 | "time", 591 | "times", 592 | "later", 593 | "believe", 594 | "having", 595 | "probably", 596 | "real", 597 | "true", 598 | "words", 599 | "word", 600 | "face", 601 | "bit", 602 | "seen", 603 | "mean", 604 | "answer", 605 | "simply", 606 | "far", 607 | "clearly", 608 | "hope", 609 | "questions", 610 | "hands", 611 | "possible", 612 | "past", 613 | "instead", 614 | "certainly", 615 | "easily", 616 | "subject", 617 | "days", 618 | "attempt", 619 | "person", 620 | "means", 621 | "surprise", 622 | "tell", 623 | "yes", 624 | "present", 625 | "doubt", 626 | "confidence", 627 | "apparently", 628 | "suggests", 629 | "usual", 630 | "speak", 631 | "appear", 632 | "eye", 633 | "prove", 634 | "place", 635 | "short", 636 | "reality", 637 | "actually", 638 | "appears" 639 | ], 640 | [ 641 | "new_york", 642 | "yesterday", 643 | "director", 644 | "manhattan", 645 | "brooklyn", 646 | "received", 647 | "new_jersey", 648 | "named", 649 | "queens", 650 | "owner", 651 | "assistant", 652 | "new_york_city", 653 | "connecticut", 654 | "department", 655 | "boston", 656 | "retired", 657 | "announced", 658 | "bronx", 659 | "los_angeles", 660 | "washington", 661 | "master", 662 | "executive", 663 | "professor", 664 | "founder", 665 | "manager", 666 | "firm", 667 | "associate", 668 | "partner", 669 | "vice", 670 | "consultant", 671 | "greenwich", 672 | "princeton", 673 | "san_francisco", 674 | "research", 675 | "hartford", 676 | "newark", 677 | "ohio", 678 | "managing", 679 | "harvard", 680 | "known", 681 | "senior", 682 | "division", 683 | "joined", 684 | "captain", 685 | "philadelphia", 686 | "coordinator", 687 | "yale", 688 | "amp", 689 | "columbia", 690 | "staten_island" 691 | ], 692 | [ 693 | "computer", 694 | "internet", 695 | "technology", 696 | "web", 697 | "system", 698 | "software", 699 | "computers", 700 | "systems", 701 | "video", 702 | "data", 703 | "car", 704 | "equipment", 705 | "users", 706 | "digital", 707 | "cars", 708 | "sites", 709 | "phone", 710 | "customers", 711 | "online", 712 | "network", 713 | "electronic", 714 | "personal", 715 | "vehicle", 716 | "site", 717 | "models", 718 | "mail", 719 | "information", 720 | "auto", 721 | "chip", 722 | "speed", 723 | "customer", 724 | "products", 725 | "device", 726 | "use", 727 | "available", 728 | "devices", 729 | "apple", 730 | "price", 731 | "mobile", 732 | "engine", 733 | "vehicles", 734 | "manufacturing", 735 | "machines", 736 | "consumers", 737 | "user", 738 | "allow", 739 | "microsoft", 740 | "type", 741 | "networks", 742 | "ford" 743 | ], 744 | [ 745 | "game", 746 | "team", 747 | "season", 748 | "games", 749 | "play", 750 | "players", 751 | "coach", 752 | "player", 753 | "teams", 754 | "league", 755 | "ball", 756 | "football", 757 | "played", 758 | "baseball", 759 | "basketball", 760 | "mets", 761 | "giants", 762 | "yards", 763 | "yankees", 764 | "jets", 765 | "playing", 766 | "seasons", 767 | "nets", 768 | "pitch", 769 | "stadium", 770 | "hockey", 771 | "fans", 772 | "rangers", 773 | "knicks", 774 | "coaching", 775 | "yard", 776 | "field", 777 | "yankee", 778 | "pitcher", 779 | "bowl", 780 | "quarterback", 781 | "playoffs", 782 | "rookie", 783 | "redskins", 784 | "inning", 785 | "teammates", 786 | "offense", 787 | "preseason", 788 | "national_football_league", 789 | "defensive", 790 | "leagues", 791 | "innings", 792 | "pitching", 793 | "touchdown", 794 | "minutes" 795 | ], 796 | [ 797 | "week", 798 | "article", 799 | "page", 800 | "march", 801 | "tuesday", 802 | "june", 803 | "july", 804 | "friday", 805 | "thursday", 806 | "wednesday", 807 | "day", 808 | "april", 809 | "monday", 810 | "telephone", 811 | "production", 812 | "reported", 813 | "misstated", 814 | "correction", 815 | "weeks", 816 | "new", 817 | "scheduled", 818 | "sunday", 819 | "date", 820 | "expected", 821 | "month", 822 | "numbers", 823 | "referred", 824 | "november", 825 | "saturday", 826 | "daily", 827 | "fall", 828 | "months", 829 | "report", 830 | "number", 831 | "news", 832 | "appeared", 833 | "error", 834 | "following", 835 | "closed", 836 | "announced", 837 | "october", 838 | "september", 839 | "incorrectly", 840 | "picture", 841 | "year", 842 | "column", 843 | "copies", 844 | "brief", 845 | "december", 846 | "august" 847 | ], 848 | [ 849 | "board", 850 | "members", 851 | "mayor", 852 | "groups", 853 | "agency", 854 | "meeting", 855 | "officials", 856 | "plan", 857 | "public", 858 | "city", 859 | "union", 860 | "committee", 861 | "agreement", 862 | "agreed", 863 | "organization", 864 | "rules", 865 | "labor", 866 | "official", 867 | "commission", 868 | "approved", 869 | "member", 870 | "leaders", 871 | "issues", 872 | "group", 873 | "proposed", 874 | "proposal", 875 | "owners", 876 | "association", 877 | "negotiations", 878 | "strike", 879 | "council", 880 | "deal", 881 | "announced", 882 | "rejected", 883 | "decision", 884 | "approval", 885 | "director", 886 | "major", 887 | "talks", 888 | "chairman", 889 | "giuliani", 890 | "authority", 891 | "workers", 892 | "dispute", 893 | "process", 894 | "plans", 895 | "commissioner", 896 | "organizations", 897 | "joint", 898 | "effort" 899 | ], 900 | [ 901 | "today", 902 | "group", 903 | "including", 904 | "called", 905 | "led", 906 | "known", 907 | "began", 908 | "built", 909 | "early", 910 | "planned", 911 | "held", 912 | "created", 913 | "included", 914 | "considered", 915 | "brought", 916 | "completed", 917 | "based", 918 | "offered", 919 | "taken", 920 | "given", 921 | "followed", 922 | "intended", 923 | "sent", 924 | "organized", 925 | "recently", 926 | "replaced", 927 | "established", 928 | "working", 929 | "designed", 930 | "include", 931 | "join", 932 | "holding", 933 | "produced", 934 | "abandoned", 935 | "giving", 936 | "developed", 937 | "formed", 938 | "calling", 939 | "focused", 940 | "joined", 941 | "met", 942 | "introduced", 943 | "turned", 944 | "opened", 945 | "gave", 946 | "provided", 947 | "studied", 948 | "destroyed", 949 | "presented", 950 | "remained" 951 | ], 952 | [ 953 | "year", 954 | "contract", 955 | "left", 956 | "signed", 957 | "manager", 958 | "second", 959 | "chicago", 960 | "free", 961 | "right", 962 | "lost", 963 | "pass", 964 | "agent", 965 | "forward", 966 | "practice", 967 | "atlanta", 968 | "season", 969 | "home", 970 | "day", 971 | "draft", 972 | "brown", 973 | "defense", 974 | "johnson", 975 | "houston", 976 | "smith", 977 | "florida", 978 | "seattle", 979 | "camp", 980 | "miami", 981 | "list", 982 | "led", 983 | "seven", 984 | "sunday", 985 | "dallas", 986 | "today", 987 | "yesterday", 988 | "jackson", 989 | "super", 990 | "terms", 991 | "texas", 992 | "guard", 993 | "detroit", 994 | "tonight", 995 | "running", 996 | "record", 997 | "training", 998 | "cleveland", 999 | "starting", 1000 | "agreed", 1001 | "washington", 1002 | "toronto" 1003 | ], 1004 | [ 1005 | "like", 1006 | "little", 1007 | "good", 1008 | "best", 1009 | "line", 1010 | "hard", 1011 | "makes", 1012 | "small", 1013 | "find", 1014 | "long", 1015 | "look", 1016 | "especially", 1017 | "comes", 1018 | "come", 1019 | "called", 1020 | "use", 1021 | "sign", 1022 | "easy", 1023 | "better", 1024 | "usually", 1025 | "clear", 1026 | "form", 1027 | "offers", 1028 | "turn", 1029 | "recently", 1030 | "range", 1031 | "particular", 1032 | "looks", 1033 | "real", 1034 | "gives", 1035 | "instead", 1036 | "want", 1037 | "gets", 1038 | "turns", 1039 | "particularly", 1040 | "goes", 1041 | "turning", 1042 | "takes", 1043 | "need", 1044 | "available", 1045 | "putting", 1046 | "individual", 1047 | "means", 1048 | "taking", 1049 | "large", 1050 | "unlike", 1051 | "rarely", 1052 | "quality", 1053 | "spread", 1054 | "hand" 1055 | ], 1056 | [ 1057 | "city", 1058 | "government", 1059 | "people", 1060 | "local", 1061 | "state", 1062 | "program", 1063 | "work", 1064 | "public", 1065 | "service", 1066 | "workers", 1067 | "site", 1068 | "development", 1069 | "private", 1070 | "new", 1071 | "jobs", 1072 | "nation", 1073 | "care", 1074 | "services", 1075 | "help", 1076 | "county", 1077 | "project", 1078 | "system", 1079 | "programs", 1080 | "officials", 1081 | "residents", 1082 | "areas", 1083 | "cities", 1084 | "health", 1085 | "projects", 1086 | "provide", 1087 | "land", 1088 | "plans", 1089 | "community", 1090 | "poor", 1091 | "housing", 1092 | "new_york_city", 1093 | "plan", 1094 | "build", 1095 | "country", 1096 | "welfare", 1097 | "working", 1098 | "need", 1099 | "employees", 1100 | "thousands", 1101 | "homes", 1102 | "region", 1103 | "emergency", 1104 | "families", 1105 | "area", 1106 | "agencies" 1107 | ], 1108 | [ 1109 | "political", 1110 | "rights", 1111 | "religious", 1112 | "human", 1113 | "anti", 1114 | "democracy", 1115 | "freedom", 1116 | "civil", 1117 | "struggle", 1118 | "protest", 1119 | "war", 1120 | "politics", 1121 | "power", 1122 | "racial", 1123 | "social", 1124 | "apartheid", 1125 | "communist", 1126 | "views", 1127 | "moral", 1128 | "influence", 1129 | "revolution", 1130 | "religion", 1131 | "liberal", 1132 | "vietnam", 1133 | "independence", 1134 | "conflicts", 1135 | "faith", 1136 | "ethnic", 1137 | "fear", 1138 | "movement", 1139 | "nazi", 1140 | "matters", 1141 | "speech", 1142 | "prominent", 1143 | "outrage", 1144 | "protests", 1145 | "radical", 1146 | "solidarity", 1147 | "conscience", 1148 | "spiritual", 1149 | "discussion", 1150 | "liberties", 1151 | "philosophy", 1152 | "protesters", 1153 | "opposition", 1154 | "liberation", 1155 | "demonstrations", 1156 | "ideological", 1157 | "community", 1158 | "outraged" 1159 | ], 1160 | [ 1161 | "health", 1162 | "medical", 1163 | "drug", 1164 | "research", 1165 | "doctors", 1166 | "patients", 1167 | "drugs", 1168 | "study", 1169 | "aids", 1170 | "blood", 1171 | "patient", 1172 | "researchers", 1173 | "tests", 1174 | "cancer", 1175 | "treatment", 1176 | "hospital", 1177 | "studies", 1178 | "medicine", 1179 | "disease", 1180 | "human", 1181 | "care", 1182 | "brain", 1183 | "risk", 1184 | "testing", 1185 | "tested", 1186 | "cell", 1187 | "smoking", 1188 | "physicians", 1189 | "scientists", 1190 | "nutrition", 1191 | "treatments", 1192 | "heart", 1193 | "treat", 1194 | "therapy", 1195 | "effective", 1196 | "clinic", 1197 | "cause", 1198 | "body", 1199 | "genetic", 1200 | "use", 1201 | "virus", 1202 | "condition", 1203 | "substance", 1204 | "test", 1205 | "gene", 1206 | "animal", 1207 | "effects", 1208 | "samples", 1209 | "medication", 1210 | "laboratory" 1211 | ], 1212 | [ 1213 | "time", 1214 | "half", 1215 | "center", 1216 | "open", 1217 | "away", 1218 | "place", 1219 | "high", 1220 | "day", 1221 | "run", 1222 | "head", 1223 | "days", 1224 | "hours", 1225 | "end", 1226 | "right", 1227 | "home", 1228 | "field", 1229 | "left", 1230 | "minutes", 1231 | "feet", 1232 | "summer", 1233 | "hour", 1234 | "close", 1235 | "eyes", 1236 | "single", 1237 | "hand", 1238 | "inch", 1239 | "green", 1240 | "set", 1241 | "drive", 1242 | "let", 1243 | "walk", 1244 | "cut", 1245 | "spring", 1246 | "wall", 1247 | "wide", 1248 | "foot", 1249 | "wait", 1250 | "nearly", 1251 | "watch", 1252 | "leaves", 1253 | "ground", 1254 | "double", 1255 | "stop", 1256 | "box", 1257 | "opening", 1258 | "seat", 1259 | "base", 1260 | "seven", 1261 | "caught", 1262 | "opened" 1263 | ], 1264 | [ 1265 | "military", 1266 | "war", 1267 | "iraq", 1268 | "israel", 1269 | "peace", 1270 | "government", 1271 | "israeli", 1272 | "forces", 1273 | "officials", 1274 | "united_nations", 1275 | "american", 1276 | "security", 1277 | "soviet", 1278 | "iran", 1279 | "official", 1280 | "iraqi", 1281 | "arab", 1282 | "russia", 1283 | "weapons", 1284 | "united_states", 1285 | "nuclear", 1286 | "troops", 1287 | "intelligence", 1288 | "attacks", 1289 | "lebanon", 1290 | "attack", 1291 | "palestinian", 1292 | "army", 1293 | "afghanistan", 1294 | "pakistan", 1295 | "administration", 1296 | "minister", 1297 | "arms", 1298 | "fighting", 1299 | "leaders", 1300 | "prime", 1301 | "islamic", 1302 | "russian", 1303 | "talks", 1304 | "middle_east", 1305 | "militants", 1306 | "international", 1307 | "muslim", 1308 | "israelis", 1309 | "armed", 1310 | "soldiers", 1311 | "country", 1312 | "treaty", 1313 | "iranian", 1314 | "soviet_union" 1315 | ], 1316 | [ 1317 | "great", 1318 | "member", 1319 | "members", 1320 | "served", 1321 | "friends", 1322 | "friend", 1323 | "entire", 1324 | "jewish", 1325 | "king", 1326 | "service", 1327 | "jews", 1328 | "extend", 1329 | "special", 1330 | "deeply", 1331 | "leader", 1332 | "deep", 1333 | "mass", 1334 | "community", 1335 | "thomas", 1336 | "longtime", 1337 | "spirit", 1338 | "george", 1339 | "hill", 1340 | "david", 1341 | "wish", 1342 | "honor", 1343 | "chairman", 1344 | "kennedy", 1345 | "gift", 1346 | "choice", 1347 | "jordan", 1348 | "divided", 1349 | "loved", 1350 | "john", 1351 | "commitment", 1352 | "missed", 1353 | "christian", 1354 | "express", 1355 | "foundation", 1356 | "james", 1357 | "temple", 1358 | "shared", 1359 | "wonderful", 1360 | "dedicated", 1361 | "values", 1362 | "jerusalem", 1363 | "prince", 1364 | "mark", 1365 | "martin", 1366 | "south" 1367 | ], 1368 | [ 1369 | "won", 1370 | "victory", 1371 | "win", 1372 | "lead", 1373 | "run", 1374 | "second", 1375 | "points", 1376 | "night", 1377 | "fourth", 1378 | "round", 1379 | "champion", 1380 | "title", 1381 | "winning", 1382 | "shot", 1383 | "tournament", 1384 | "final", 1385 | "beat", 1386 | "hit", 1387 | "time", 1388 | "championship", 1389 | "winner", 1390 | "cup", 1391 | "horse", 1392 | "game", 1393 | "finished", 1394 | "series", 1395 | "record", 1396 | "race", 1397 | "goal", 1398 | "seconds", 1399 | "match", 1400 | "point", 1401 | "minutes", 1402 | "saturday", 1403 | "period", 1404 | "tonight", 1405 | "boxing", 1406 | "times", 1407 | "races", 1408 | "track", 1409 | "gave", 1410 | "minute", 1411 | "fifth", 1412 | "ninth", 1413 | "career", 1414 | "line", 1415 | "sixth", 1416 | "start", 1417 | "shots", 1418 | "best" 1419 | ], 1420 | [ 1421 | "night", 1422 | "people", 1423 | "day", 1424 | "house", 1425 | "town", 1426 | "white", 1427 | "store", 1428 | "room", 1429 | "morning", 1430 | "hotel", 1431 | "stores", 1432 | "message", 1433 | "party", 1434 | "names", 1435 | "christmas", 1436 | "days", 1437 | "outside", 1438 | "streets", 1439 | "door", 1440 | "near", 1441 | "bar", 1442 | "time", 1443 | "way", 1444 | "evening", 1445 | "afternoon", 1446 | "live", 1447 | "main", 1448 | "country", 1449 | "home", 1450 | "shop", 1451 | "restaurant", 1452 | "rooms", 1453 | "dinner", 1454 | "visit", 1455 | "table", 1456 | "standing", 1457 | "crowd", 1458 | "business", 1459 | "shopping", 1460 | "summer", 1461 | "restaurants", 1462 | "club", 1463 | "block", 1464 | "dozen", 1465 | "second", 1466 | "holiday", 1467 | "hour", 1468 | "hot", 1469 | "dead", 1470 | "weekend" 1471 | ], 1472 | [ 1473 | "art", 1474 | "works", 1475 | "century", 1476 | "museum", 1477 | "artist", 1478 | "exhibition", 1479 | "artists", 1480 | "painting", 1481 | "gallery", 1482 | "paintings", 1483 | "photographs", 1484 | "collection", 1485 | "hall", 1486 | "contemporary", 1487 | "design", 1488 | "images", 1489 | "pictures", 1490 | "view", 1491 | "sculpture", 1492 | "modern", 1493 | "architecture", 1494 | "designed", 1495 | "style", 1496 | "19th", 1497 | "objects", 1498 | "exhibit", 1499 | "drawings", 1500 | "museums", 1501 | "arts", 1502 | "work", 1503 | "museum_of_modern_art", 1504 | "artworks", 1505 | "stone", 1506 | "painter", 1507 | "designers", 1508 | "landscape", 1509 | "visitors", 1510 | "designs", 1511 | "beautiful", 1512 | "18th", 1513 | "pieces", 1514 | "sculptor", 1515 | "architectural", 1516 | "collections", 1517 | "galleries", 1518 | "portrait", 1519 | "artifacts", 1520 | "20th", 1521 | "historic", 1522 | "prints" 1523 | ], 1524 | [ 1525 | "new", 1526 | "long", 1527 | "change", 1528 | "way", 1529 | "work", 1530 | "big", 1531 | "small", 1532 | "lines", 1533 | "different", 1534 | "high", 1535 | "time", 1536 | "approach", 1537 | "view", 1538 | "far", 1539 | "decade", 1540 | "past", 1541 | "large", 1542 | "current", 1543 | "larger", 1544 | "term", 1545 | "longer", 1546 | "decades", 1547 | "moving", 1548 | "style", 1549 | "changing", 1550 | "rest", 1551 | "major", 1552 | "rich", 1553 | "early", 1554 | "point", 1555 | "end", 1556 | "generation", 1557 | "little", 1558 | "future", 1559 | "months", 1560 | "lack", 1561 | "better", 1562 | "great", 1563 | "slow", 1564 | "period", 1565 | "middle", 1566 | "era", 1567 | "rise", 1568 | "ways", 1569 | "changes", 1570 | "vast", 1571 | "modern", 1572 | "huge", 1573 | "worst", 1574 | "soon" 1575 | ], 1576 | [ 1577 | "end", 1578 | "set", 1579 | "time", 1580 | "step", 1581 | "return", 1582 | "let", 1583 | "beginning", 1584 | "forced", 1585 | "future", 1586 | "come", 1587 | "second", 1588 | "bring", 1589 | "difficult", 1590 | "allow", 1591 | "competition", 1592 | "continue", 1593 | "begin", 1594 | "coming", 1595 | "protect", 1596 | "trying", 1597 | "reach", 1598 | "major", 1599 | "start", 1600 | "sides", 1601 | "parts", 1602 | "half", 1603 | "taking", 1604 | "setting", 1605 | "break", 1606 | "point", 1607 | "failed", 1608 | "able", 1609 | "necessary", 1610 | "came", 1611 | "remain", 1612 | "meet", 1613 | "stop", 1614 | "progress", 1615 | "complete", 1616 | "send", 1617 | "final", 1618 | "tough", 1619 | "balance", 1620 | "aside", 1621 | "try", 1622 | "begins", 1623 | "gap", 1624 | "took", 1625 | "need", 1626 | "meant" 1627 | ], 1628 | [ 1629 | "black", 1630 | "school", 1631 | "students", 1632 | "television", 1633 | "schools", 1634 | "college", 1635 | "public", 1636 | "education", 1637 | "high", 1638 | "program", 1639 | "white", 1640 | "class", 1641 | "student", 1642 | "programs", 1643 | "cable", 1644 | "teachers", 1645 | "radio", 1646 | "university", 1647 | "network", 1648 | "district", 1649 | "teacher", 1650 | "sports", 1651 | "districts", 1652 | "hispanic", 1653 | "broadcast", 1654 | "entertainment", 1655 | "shows", 1656 | "stations", 1657 | "blacks", 1658 | "classes", 1659 | "colleges", 1660 | "campus", 1661 | "viewers", 1662 | "nbc", 1663 | "educational", 1664 | "faculty", 1665 | "teaching", 1666 | "fox", 1667 | "middle", 1668 | "private", 1669 | "cbs", 1670 | "special", 1671 | "grade", 1672 | "minority", 1673 | "principal", 1674 | "academic", 1675 | "disney", 1676 | "learning", 1677 | "graduate", 1678 | "studio" 1679 | ], 1680 | [ 1681 | "court", 1682 | "judge", 1683 | "law", 1684 | "case", 1685 | "federal", 1686 | "lawyer", 1687 | "trial", 1688 | "lawyers", 1689 | "charges", 1690 | "justice", 1691 | "legal", 1692 | "attorney", 1693 | "jury", 1694 | "investigation", 1695 | "supreme_court", 1696 | "ruling", 1697 | "prosecutors", 1698 | "criminal", 1699 | "decision", 1700 | "hearing", 1701 | "filed", 1702 | "denied", 1703 | "laws", 1704 | "ruled", 1705 | "justice_department", 1706 | "courts", 1707 | "ordered", 1708 | "judges", 1709 | "lawsuit", 1710 | "charged", 1711 | "suit", 1712 | "prosecutor", 1713 | "inquiry", 1714 | "convicted", 1715 | "defense", 1716 | "testimony", 1717 | "office", 1718 | "amendment", 1719 | "prosecution", 1720 | "district", 1721 | "proceedings", 1722 | "prison", 1723 | "charge", 1724 | "conviction", 1725 | "guilty", 1726 | "defendants", 1727 | "argued", 1728 | "hearings", 1729 | "civil", 1730 | "rights" 1731 | ], 1732 | [ 1733 | "million", 1734 | "money", 1735 | "budget", 1736 | "tax", 1737 | "year", 1738 | "billion", 1739 | "pay", 1740 | "spending", 1741 | "cost", 1742 | "costs", 1743 | "cut", 1744 | "bank", 1745 | "cuts", 1746 | "income", 1747 | "plan", 1748 | "fund", 1749 | "raise", 1750 | "financial", 1751 | "taxes", 1752 | "insurance", 1753 | "banks", 1754 | "funds", 1755 | "paid", 1756 | "deficit", 1757 | "dollars", 1758 | "aid", 1759 | "federal", 1760 | "financing", 1761 | "cash", 1762 | "increases", 1763 | "finance", 1764 | "raising", 1765 | "total", 1766 | "credit", 1767 | "new", 1768 | "reduce", 1769 | "proposed", 1770 | "savings", 1771 | "medicare", 1772 | "package", 1773 | "loans", 1774 | "paying", 1775 | "fiscal", 1776 | "capital", 1777 | "limits", 1778 | "house", 1779 | "payments", 1780 | "month", 1781 | "week", 1782 | "fees" 1783 | ], 1784 | [ 1785 | "net", 1786 | "share", 1787 | "inc", 1788 | "earns", 1789 | "company", 1790 | "reports", 1791 | "loss", 1792 | "lead", 1793 | "sales", 1794 | "qtr", 1795 | "quarter", 1796 | "shares", 1797 | "revenue", 1798 | "year", 1799 | "outst", 1800 | "million", 1801 | "included", 1802 | "rev", 1803 | "march", 1804 | "june", 1805 | "nyse", 1806 | "cents", 1807 | "otc", 1808 | "6mo", 1809 | "income", 1810 | "9mo", 1811 | "gain", 1812 | "results", 1813 | "dec", 1814 | "months", 1815 | "operations", 1816 | "extraordinary", 1817 | "charge", 1818 | "latest", 1819 | "sept", 1820 | "earnings", 1821 | "tax", 1822 | "ago", 1823 | "corp", 1824 | "discontinued", 1825 | "amex", 1826 | "accounting", 1827 | "credit", 1828 | "respectively", 1829 | "sale", 1830 | "amp", 1831 | "period", 1832 | "losses", 1833 | "restructuring", 1834 | "april" 1835 | ], 1836 | [ 1837 | "editor", 1838 | "book", 1839 | "wrote", 1840 | "published", 1841 | "magazine", 1842 | "books", 1843 | "author", 1844 | "written", 1845 | "paper", 1846 | "professor", 1847 | "read", 1848 | "writing", 1849 | "original", 1850 | "english", 1851 | "writer", 1852 | "pages", 1853 | "amp", 1854 | "language", 1855 | "version", 1856 | "life", 1857 | "letters", 1858 | "reading", 1859 | "review", 1860 | "ideas", 1861 | "found", 1862 | "science", 1863 | "write", 1864 | "letter", 1865 | "illustrated", 1866 | "known", 1867 | "called", 1868 | "novel", 1869 | "model", 1870 | "notes", 1871 | "writes", 1872 | "note", 1873 | "new_york", 1874 | "editorial", 1875 | "interest", 1876 | "account", 1877 | "publisher", 1878 | "readers", 1879 | "new_york_times", 1880 | "cast", 1881 | "director", 1882 | "sir", 1883 | "idea", 1884 | "writers", 1885 | "prize", 1886 | "critic" 1887 | ], 1888 | [ 1889 | "children", 1890 | "women", 1891 | "young", 1892 | "woman", 1893 | "child", 1894 | "mother", 1895 | "old", 1896 | "love", 1897 | "parents", 1898 | "age", 1899 | "father", 1900 | "miss", 1901 | "boy", 1902 | "family", 1903 | "girl", 1904 | "baby", 1905 | "sex", 1906 | "husband", 1907 | "older", 1908 | "professional", 1909 | "ages", 1910 | "friends", 1911 | "kids", 1912 | "boys", 1913 | "younger", 1914 | "dog", 1915 | "fellow", 1916 | "sexual", 1917 | "girls", 1918 | "families", 1919 | "fun", 1920 | "live", 1921 | "social", 1922 | "adults", 1923 | "proud", 1924 | "female", 1925 | "teen", 1926 | "later", 1927 | "know", 1928 | "dogs", 1929 | "away", 1930 | "emotional", 1931 | "adult", 1932 | "home", 1933 | "parent", 1934 | "birth", 1935 | "childhood", 1936 | "mom", 1937 | "abuse", 1938 | "dad" 1939 | ], 1940 | [ 1941 | "american", 1942 | "united_states", 1943 | "world", 1944 | "country", 1945 | "international", 1946 | "foreign", 1947 | "americans", 1948 | "countries", 1949 | "china", 1950 | "french", 1951 | "japan", 1952 | "europe", 1953 | "trade", 1954 | "economic", 1955 | "london", 1956 | "british", 1957 | "european", 1958 | "war", 1959 | "japanese", 1960 | "washington", 1961 | "france", 1962 | "german", 1963 | "germany", 1964 | "nation", 1965 | "national", 1966 | "britain", 1967 | "chinese", 1968 | "paris", 1969 | "america", 1970 | "italy", 1971 | "mexico", 1972 | "canada", 1973 | "global", 1974 | "england", 1975 | "italian", 1976 | "western", 1977 | "immigrants", 1978 | "domestic", 1979 | "south", 1980 | "african", 1981 | "nations", 1982 | "brazil", 1983 | "spain", 1984 | "region", 1985 | "central", 1986 | "asian", 1987 | "india", 1988 | "spanish", 1989 | "cultural", 1990 | "hong_kong" 1991 | ], 1992 | [ 1993 | "president", 1994 | "general", 1995 | "chief", 1996 | "secretary", 1997 | "office", 1998 | "news", 1999 | "support", 2000 | "vice", 2001 | "chairman", 2002 | "staff", 2003 | "executive", 2004 | "senior", 2005 | "minister", 2006 | "prime", 2007 | "press", 2008 | "conference", 2009 | "deputy", 2010 | "post", 2011 | "reporters", 2012 | "newspaper", 2013 | "appointed", 2014 | "media", 2015 | "independent", 2016 | "colleagues", 2017 | "reporter", 2018 | "relations", 2019 | "resigned", 2020 | "adviser", 2021 | "criticism", 2022 | "interview", 2023 | "newspapers", 2024 | "influence", 2025 | "leadership", 2026 | "political", 2027 | "appointment", 2028 | "resignation", 2029 | "affairs", 2030 | "ambassador", 2031 | "dean", 2032 | "force", 2033 | "articles", 2034 | "counsel", 2035 | "publicly", 2036 | "cabinet", 2037 | "corruption", 2038 | "boss", 2039 | "weeks", 2040 | "successor", 2041 | "dismissed", 2042 | "succeed" 2043 | ], 2044 | [ 2045 | "building", 2046 | "room", 2047 | "old", 2048 | "year", 2049 | "market", 2050 | "weeks", 2051 | "taxes", 2052 | "bedroom", 2053 | "listed", 2054 | "house", 2055 | "space", 2056 | "square", 2057 | "broker", 2058 | "area", 2059 | "floors", 2060 | "estate", 2061 | "kitchen", 2062 | "lot", 2063 | "buildings", 2064 | "million", 2065 | "floor", 2066 | "bath", 2067 | "foot", 2068 | "apartment", 2069 | "houses", 2070 | "property", 2071 | "street", 2072 | "office", 2073 | "number", 2074 | "real", 2075 | "feet", 2076 | "basement", 2077 | "acre", 2078 | "car", 2079 | "rent", 2080 | "neighborhood", 2081 | "units", 2082 | "garage", 2083 | "city", 2084 | "maintenance", 2085 | "dining", 2086 | "apartments", 2087 | "fireplace", 2088 | "walls", 2089 | "project", 2090 | "properties", 2091 | "roof", 2092 | "windows", 2093 | "brick", 2094 | "construction" 2095 | ], 2096 | [ 2097 | "world", 2098 | "work", 2099 | "man", 2100 | "way", 2101 | "history", 2102 | "sense", 2103 | "men", 2104 | "life", 2105 | "time", 2106 | "play", 2107 | "people", 2108 | "series", 2109 | "best", 2110 | "mind", 2111 | "america", 2112 | "different", 2113 | "self", 2114 | "moment", 2115 | "kind", 2116 | "voice", 2117 | "role", 2118 | "course", 2119 | "audience", 2120 | "things", 2121 | "feeling", 2122 | "society", 2123 | "stars", 2124 | "good", 2125 | "light", 2126 | "experience", 2127 | "sound", 2128 | "young", 2129 | "culture", 2130 | "love", 2131 | "feel", 2132 | "god", 2133 | "rest", 2134 | "shows", 2135 | "live", 2136 | "male", 2137 | "body", 2138 | "finally", 2139 | "middle", 2140 | "nature", 2141 | "image", 2142 | "tradition", 2143 | "times", 2144 | "behavior", 2145 | "respect", 2146 | "act" 2147 | ], 2148 | [ 2149 | "film", 2150 | "movie", 2151 | "story", 2152 | "films", 2153 | "directed", 2154 | "movies", 2155 | "star", 2156 | "character", 2157 | "characters", 2158 | "drama", 2159 | "comedy", 2160 | "actor", 2161 | "tale", 2162 | "starring", 2163 | "cinema", 2164 | "actors", 2165 | "plays", 2166 | "hollywood", 2167 | "documentary", 2168 | "actress", 2169 | "romance", 2170 | "stories", 2171 | "scenes", 2172 | "novel", 2173 | "audiences", 2174 | "plot", 2175 | "adaptation", 2176 | "love", 2177 | "screen", 2178 | "novels", 2179 | "comic", 2180 | "feature", 2181 | "playwright", 2182 | "loves", 2183 | "stars", 2184 | "fantasy", 2185 | "funniest", 2186 | "world_", 2187 | "famous", 2188 | "memoir", 2189 | "onscreen", 2190 | "filmmaker", 2191 | "book", 2192 | "beautiful", 2193 | "screenplay", 2194 | "thriller", 2195 | "tales", 2196 | "miramax", 2197 | "writer", 2198 | "favorite" 2199 | ], 2200 | [ 2201 | "company", 2202 | "business", 2203 | "companies", 2204 | "million", 2205 | "amp", 2206 | "industry", 2207 | "executive", 2208 | "chief", 2209 | "executives", 2210 | "sell", 2211 | "largest", 2212 | "stock", 2213 | "investment", 2214 | "based", 2215 | "chairman", 2216 | "deal", 2217 | "advertising", 2218 | "sales", 2219 | "billion", 2220 | "firm", 2221 | "offer", 2222 | "unit", 2223 | "marketing", 2224 | "management", 2225 | "shares", 2226 | "corporate", 2227 | "operating", 2228 | "market", 2229 | "products", 2230 | "financial", 2231 | "sold", 2232 | "analyst", 2233 | "operations", 2234 | "buy", 2235 | "yesterday", 2236 | "analysts", 2237 | "businesses", 2238 | "investors", 2239 | "buying", 2240 | "selling", 2241 | "bid", 2242 | "venture", 2243 | "services", 2244 | "owned", 2245 | "subsidiary", 2246 | "division", 2247 | "merger", 2248 | "maker", 2249 | "customers", 2250 | "product" 2251 | ], 2252 | [ 2253 | "like", 2254 | "making", 2255 | "important", 2256 | "based", 2257 | "strong", 2258 | "including", 2259 | "recent", 2260 | "high", 2261 | "remains", 2262 | "short", 2263 | "popular", 2264 | "example", 2265 | "largely", 2266 | "includes", 2267 | "certain", 2268 | "success", 2269 | "events", 2270 | "order", 2271 | "consider", 2272 | "create", 2273 | "creating", 2274 | "unusual", 2275 | "standards", 2276 | "figures", 2277 | "latest", 2278 | "role", 2279 | "standard", 2280 | "include", 2281 | "addition", 2282 | "similar", 2283 | "key", 2284 | "expensive", 2285 | "present", 2286 | "successful", 2287 | "increasingly", 2288 | "fact", 2289 | "hopes", 2290 | "produce", 2291 | "american", 2292 | "somewhat", 2293 | "makes", 2294 | "interesting", 2295 | "basic", 2296 | "holds", 2297 | "indian", 2298 | "current", 2299 | "expect", 2300 | "traditional", 2301 | "despite", 2302 | "continues" 2303 | ], 2304 | [ 2305 | "think", 2306 | "people", 2307 | "want", 2308 | "going", 2309 | "says", 2310 | "know", 2311 | "good", 2312 | "way", 2313 | "right", 2314 | "time", 2315 | "job", 2316 | "lot", 2317 | "things", 2318 | "better", 2319 | "thing", 2320 | "getting", 2321 | "got", 2322 | "big", 2323 | "talk", 2324 | "help", 2325 | "bad", 2326 | "trying", 2327 | "come", 2328 | "need", 2329 | "problem", 2330 | "feel", 2331 | "having", 2332 | "try", 2333 | "wants", 2334 | "ask", 2335 | "work", 2336 | "idea", 2337 | "chance", 2338 | "kind", 2339 | "happen", 2340 | "question", 2341 | "look", 2342 | "reason", 2343 | "opportunity", 2344 | "sure", 2345 | "guy", 2346 | "point", 2347 | "wrong", 2348 | "deal", 2349 | "leave", 2350 | "guys", 2351 | "thinking", 2352 | "pick", 2353 | "money", 2354 | "tell" 2355 | ], 2356 | [ 2357 | "power", 2358 | "number", 2359 | "control", 2360 | "according", 2361 | "increase", 2362 | "large", 2363 | "likely", 2364 | "result", 2365 | "system", 2366 | "use", 2367 | "pressure", 2368 | "growing", 2369 | "recent", 2370 | "experts", 2371 | "safety", 2372 | "far", 2373 | "nearly", 2374 | "highly", 2375 | "force", 2376 | "hundreds", 2377 | "ability", 2378 | "people", 2379 | "particularly", 2380 | "change", 2381 | "possible", 2382 | "problems", 2383 | "greater", 2384 | "challenge", 2385 | "despite", 2386 | "given", 2387 | "effort", 2388 | "limited", 2389 | "improve", 2390 | "higher", 2391 | "significant", 2392 | "increased", 2393 | "largest", 2394 | "increasing", 2395 | "status", 2396 | "important", 2397 | "relatively", 2398 | "critical", 2399 | "risk", 2400 | "thousands", 2401 | "material", 2402 | "level", 2403 | "found", 2404 | "changes", 2405 | "efforts", 2406 | "begun" 2407 | ], 2408 | [ 2409 | "came", 2410 | "took", 2411 | "went", 2412 | "told", 2413 | "asked", 2414 | "got", 2415 | "started", 2416 | "thought", 2417 | "felt", 2418 | "wanted", 2419 | "left", 2420 | "little", 2421 | "began", 2422 | "knew", 2423 | "long", 2424 | "turned", 2425 | "tried", 2426 | "found", 2427 | "know", 2428 | "saw", 2429 | "called", 2430 | "come", 2431 | "worked", 2432 | "heard", 2433 | "kept", 2434 | "gave", 2435 | "met", 2436 | "going", 2437 | "learned", 2438 | "looked", 2439 | "hard", 2440 | "spent", 2441 | "happened", 2442 | "decided", 2443 | "hit", 2444 | "seen", 2445 | "brought", 2446 | "moved", 2447 | "soon", 2448 | "working", 2449 | "personal", 2450 | "right", 2451 | "stopped", 2452 | "sat", 2453 | "old", 2454 | "ago", 2455 | "returned", 2456 | "sitting", 2457 | "gone", 2458 | "recalled" 2459 | ], 2460 | [ 2461 | "miles", 2462 | "air", 2463 | "airport", 2464 | "north", 2465 | "traffic", 2466 | "travel", 2467 | "road", 2468 | "trip", 2469 | "flight", 2470 | "mile", 2471 | "plane", 2472 | "bus", 2473 | "boat", 2474 | "train", 2475 | "village", 2476 | "fly", 2477 | "nearby", 2478 | "passengers", 2479 | "island", 2480 | "near", 2481 | "beach", 2482 | "passenger", 2483 | "fare", 2484 | "south", 2485 | "station", 2486 | "travelers", 2487 | "land", 2488 | "highway", 2489 | "buses", 2490 | "flights", 2491 | "parking", 2492 | "airline", 2493 | "coast", 2494 | "roads", 2495 | "trips", 2496 | "crew", 2497 | "eastern", 2498 | "ski", 2499 | "jet", 2500 | "routes", 2501 | "crash", 2502 | "northern", 2503 | "ticket", 2504 | "lake", 2505 | "beaches", 2506 | "trains", 2507 | "sea", 2508 | "navy", 2509 | "resort", 2510 | "shore" 2511 | ], 2512 | [ 2513 | "state", 2514 | "officials", 2515 | "report", 2516 | "california", 2517 | "evidence", 2518 | "policy", 2519 | "states", 2520 | "use", 2521 | "cases", 2522 | "found", 2523 | "problem", 2524 | "action", 2525 | "issue", 2526 | "failed", 2527 | "illegal", 2528 | "required", 2529 | "require", 2530 | "involved", 2531 | "review", 2532 | "effort", 2533 | "question", 2534 | "records", 2535 | "agents", 2536 | "case", 2537 | "problems", 2538 | "matter", 2539 | "prevent", 2540 | "study", 2541 | "enforcement", 2542 | "seeking", 2543 | "process", 2544 | "calls", 2545 | "penalty", 2546 | "appeal", 2547 | "interest", 2548 | "ban", 2549 | "complaints", 2550 | "survey", 2551 | "questions", 2552 | "opposed", 2553 | "documents", 2554 | "federal", 2555 | "search", 2556 | "test", 2557 | "critics", 2558 | "determine", 2559 | "response", 2560 | "avoid", 2561 | "delay", 2562 | "measure" 2563 | ], 2564 | [ 2565 | "father", 2566 | "late", 2567 | "wife", 2568 | "mother", 2569 | "daughter", 2570 | "husband", 2571 | "beloved", 2572 | "son", 2573 | "loving", 2574 | "amp", 2575 | "devoted", 2576 | "graduated", 2577 | "sister", 2578 | "family", 2579 | "married", 2580 | "survived", 2581 | "memorial", 2582 | "funeral", 2583 | "brother", 2584 | "law", 2585 | "passing", 2586 | "services", 2587 | "grandchildren", 2588 | "grandmother", 2589 | "service", 2590 | "grandfather", 2591 | "memory", 2592 | "ceremony", 2593 | "contributions", 2594 | "january", 2595 | "september", 2596 | "august", 2597 | "cherished", 2598 | "dear", 2599 | "february", 2600 | "degree", 2601 | "flowers", 2602 | "lieu", 2603 | "condolences", 2604 | "bridegroom", 2605 | "bride", 2606 | "michael", 2607 | "new_york", 2608 | "sympathy", 2609 | "died", 2610 | "wednesday", 2611 | "donations", 2612 | "monday", 2613 | "friday", 2614 | "performed" 2615 | ] 2616 | ], 2617 | "c_npmi_10_full_all": [ 2618 | 0.14624615320916104, 2619 | 0.1842349463761919, 2620 | 0.09397097956875038, 2621 | 0.02224374883885288, 2622 | 0.14196491443014664, 2623 | 0.2370141384022756, 2624 | 0.03293073853312596, 2625 | 0.1962455412333827, 2626 | 0.1915341149418706, 2627 | 0.11818618548977458, 2628 | 0.118088046901936, 2629 | 0.014622927252097677, 2630 | 0.060262066918601497, 2631 | 0.19586991688606797, 2632 | 0.21550434513063996, 2633 | 0.1304279440162075, 2634 | 0.067141489159369, 2635 | 0.005676347127949025, 2636 | 0.0399443341151241, 2637 | 0.027394469703840317, 2638 | 0.035940728231458, 2639 | 0.13207218718391886, 2640 | 0.1771557992145982, 2641 | 0.0028646926979995825, 2642 | 0.1583017349711412, 2643 | 0.035186989711695024, 2644 | 0.126649067889828, 2645 | 0.03554250799327234, 2646 | 0.24728769165095757, 2647 | 0.007093446130125744, 2648 | 0.015344450189039873, 2649 | 0.10515918827441718, 2650 | 0.1882183122717962, 2651 | 0.14759851831626747, 2652 | 0.5031460615129256, 2653 | 0.1563020964426604, 2654 | 0.1088203777913382, 2655 | 0.08715242376128987, 2656 | 0.10371770006655577, 2657 | 0.1560688706902293, 2658 | 0.01920638899674475, 2659 | 0.15620943194110215, 2660 | 0.10602635865005294, 2661 | -0.009920778876769423, 2662 | 0.08536125795429042, 2663 | 0.032156055229606545, 2664 | 0.07240444289452569, 2665 | 0.1347891185211831, 2666 | 0.04646465360805376, 2667 | 0.2776205766334038 2668 | ], 2669 | "path": "outputs/full-mindf_power_law-maxdf_0.9/nytimes/k-50/etm/lr_0.02-reg_1.2e-06-epochs_1000-anneal_lr_0/11235" 2670 | }, 2671 | "wikitext": { 2672 | "c_npmi_10_full": 0.11328744378992832, 2673 | "c_npmi_10_full_sd": 0.06794966942260233, 2674 | "tu": 0.94, 2675 | "to": 0.030612244897959183, 2676 | "overlaps": 0, 2677 | "anneal_lr": 0, 2678 | "data_path": "/workspace/topic-preprocessing/data/wikitext/processed/full-mindf_power_law-maxdf_0.9/etm", 2679 | "epochs": 1000, 2680 | "lr": 0.001, 2681 | "seed": 42, 2682 | "wdecay": 1.2e-05, 2683 | "input_dir": "wikitext", 2684 | "topics": [ 2685 | [ 2686 | "new", 2687 | "use", 2688 | "development", 2689 | "world", 2690 | "design", 2691 | "created", 2692 | "system", 2693 | "developed", 2694 | "power", 2695 | "based", 2696 | "designed", 2697 | "produced", 2698 | "original", 2699 | "production", 2700 | "available", 2701 | "different", 2702 | "create", 2703 | "version", 2704 | "additional", 2705 | "single", 2706 | "effects", 2707 | "model", 2708 | "main", 2709 | "introduced", 2710 | "including", 2711 | "included", 2712 | "quality", 2713 | "special", 2714 | "test", 2715 | "changes", 2716 | "space", 2717 | "include", 2718 | "standard", 2719 | "project", 2720 | "added", 2721 | "energy", 2722 | "elements", 2723 | "concept", 2724 | "intended", 2725 | "value", 2726 | "uses", 2727 | "instead", 2728 | "technology", 2729 | "provided", 2730 | "produce", 2731 | "type", 2732 | "creating", 2733 | "introduction", 2734 | "work", 2735 | "complete" 2736 | ], 2737 | [ 2738 | "american", 2739 | "united_states", 2740 | "new_york", 2741 | "washington", 2742 | "california", 2743 | "texas", 2744 | "americans", 2745 | "virginia", 2746 | "chicago", 2747 | "boston", 2748 | "canadian", 2749 | "smith", 2750 | "canada", 2751 | "florida", 2752 | "new_york_city", 2753 | "michigan", 2754 | "north_carolina", 2755 | "ohio", 2756 | "johnson", 2757 | "los_angeles", 2758 | "kentucky", 2759 | "grant", 2760 | "philadelphia", 2761 | "america", 2762 | "davis", 2763 | "maryland", 2764 | "massachusetts", 2765 | "illinois", 2766 | "indiana", 2767 | "pennsylvania", 2768 | "new_jersey", 2769 | "new_orleans", 2770 | "south_carolina", 2771 | "mexican", 2772 | "toronto", 2773 | "african", 2774 | "houston", 2775 | "tennessee", 2776 | "minnesota", 2777 | "san_francisco", 2778 | "taylor", 2779 | "lee", 2780 | "cleveland", 2781 | "wisconsin", 2782 | "georgia", 2783 | "atlanta", 2784 | "missouri", 2785 | "adams", 2786 | "colorado", 2787 | "wilson" 2788 | ], 2789 | [ 2790 | "work", 2791 | "published", 2792 | "book", 2793 | "years", 2794 | "described", 2795 | "time", 2796 | "group", 2797 | "style", 2798 | "early", 2799 | "people", 2800 | "new", 2801 | "works", 2802 | "included", 2803 | "year", 2804 | "english", 2805 | "including", 2806 | "noted", 2807 | "history", 2808 | "period", 2809 | "popular", 2810 | "wrote", 2811 | "late", 2812 | "american", 2813 | "written", 2814 | "old", 2815 | "movement", 2816 | "notes", 2817 | "different", 2818 | "edition", 2819 | "list", 2820 | "social", 2821 | "original", 2822 | "culture", 2823 | "traditional", 2824 | "based", 2825 | "books", 2826 | "influenced", 2827 | "collection", 2828 | "material", 2829 | "1970s", 2830 | "recent", 2831 | "composed", 2832 | "influence", 2833 | "1980s", 2834 | "issue", 2835 | "young", 2836 | "1960s", 2837 | "middle", 2838 | "parts", 2839 | "language" 2840 | ], 2841 | [ 2842 | "australia", 2843 | "world", 2844 | "country", 2845 | "international", 2846 | "united_states", 2847 | "australian", 2848 | "india", 2849 | "countries", 2850 | "united_kingdom", 2851 | "japan", 2852 | "indian", 2853 | "europe", 2854 | "european", 2855 | "china", 2856 | "new_zealand", 2857 | "canada", 2858 | "spain", 2859 | "foreign", 2860 | "average", 2861 | "worldwide", 2862 | "highest", 2863 | "total", 2864 | "africa", 2865 | "singapore", 2866 | "american", 2867 | "sweden", 2868 | "domestic", 2869 | "sydney", 2870 | "france", 2871 | "south_africa", 2872 | "brazil", 2873 | "dutch", 2874 | "argentina", 2875 | "vietnam", 2876 | "norway", 2877 | "britain", 2878 | "peak", 2879 | "asian", 2880 | "philippines", 2881 | "melbourne", 2882 | "malaysia", 2883 | "russia", 2884 | "belgium", 2885 | "netherlands", 2886 | "nations", 2887 | "germany", 2888 | "colonial", 2889 | "indonesian", 2890 | "ireland", 2891 | "pakistan" 2892 | ], 2893 | [ 2894 | "war", 2895 | "april", 2896 | "march", 2897 | "september", 2898 | "june", 2899 | "december", 2900 | "august", 2901 | "october", 2902 | "january", 2903 | "july", 2904 | "began", 2905 | "november", 2906 | "later", 2907 | "following", 2908 | "french", 2909 | "british", 2910 | "february", 2911 | "year", 2912 | "new", 2913 | "including", 2914 | "early", 2915 | "german", 2916 | "members", 2917 | "world", 2918 | "years", 2919 | "training", 2920 | "number", 2921 | "late", 2922 | "joined", 2923 | "days", 2924 | "group", 2925 | "included", 2926 | "months", 2927 | "weeks", 2928 | "continued", 2929 | "replaced", 2930 | "served", 2931 | "france", 2932 | "support", 2933 | "returned", 2934 | "spanish", 2935 | "major", 2936 | "received", 2937 | "end", 2938 | "led", 2939 | "anti", 2940 | "month", 2941 | "service", 2942 | "remained", 2943 | "staff" 2944 | ], 2945 | [ 2946 | "race", 2947 | "horses", 2948 | "horse", 2949 | "oxford", 2950 | "cambridge", 2951 | "canal", 2952 | "dog", 2953 | "estate", 2954 | "boat", 2955 | "colony", 2956 | "trade", 2957 | "manchester", 2958 | "breed", 2959 | "races", 2960 | "racing", 2961 | "parish", 2962 | "riders", 2963 | "slaves", 2964 | "slave", 2965 | "bristol", 2966 | "goods", 2967 | "crew", 2968 | "cotton", 2969 | "dogs", 2970 | "trading", 2971 | "riding", 2972 | "pounds", 2973 | "mill", 2974 | "breeding", 2975 | "rowing", 2976 | "blues", 2977 | "navigation", 2978 | "lancashire", 2979 | "breeds", 2980 | "cattle", 2981 | "lengths", 2982 | "farm", 2983 | "liverpool", 2984 | "hunting", 2985 | "rider", 2986 | "boats", 2987 | "bath", 2988 | "draft", 2989 | "mills", 2990 | "oldham", 2991 | "mare", 2992 | "labour", 2993 | "merchant", 2994 | "bridgwater", 2995 | "wool" 2996 | ], 2997 | [ 2998 | "formula", 2999 | "chemical", 3000 | "nuclear", 3001 | "applications", 3002 | "element", 3003 | "hydrogen", 3004 | "atomic", 3005 | "uranium", 3006 | "oxygen", 3007 | "gas", 3008 | "acid", 3009 | "carbon", 3010 | "chemistry", 3011 | "stable", 3012 | "electrical", 3013 | "radioactive", 3014 | "atom", 3015 | "properties", 3016 | "plutonium", 3017 | "method", 3018 | "forms", 3019 | "solutions", 3020 | "calibration", 3021 | "energy", 3022 | "particles", 3023 | "thermal", 3024 | "metal", 3025 | "nitrogen", 3026 | "liquid", 3027 | "vapor", 3028 | "chain", 3029 | "physics", 3030 | "compounds", 3031 | "type", 3032 | "atoms", 3033 | "solution", 3034 | "electron", 3035 | "linear", 3036 | "ions", 3037 | "helium", 3038 | "equation", 3039 | "beta", 3040 | "particle", 3041 | "organic", 3042 | "components", 3043 | "lithium", 3044 | "synthesis", 3045 | "structure", 3046 | "metals", 3047 | "protons" 3048 | ], 3049 | [ 3050 | "stated", 3051 | "considered", 3052 | "went", 3053 | "chief", 3054 | "said", 3055 | "took", 3056 | "worked", 3057 | "called", 3058 | "decided", 3059 | "continued", 3060 | "met", 3061 | "moved", 3062 | "claimed", 3063 | "held", 3064 | "asked", 3065 | "gave", 3066 | "agreed", 3067 | "according", 3068 | "head", 3069 | "believed", 3070 | "received", 3071 | "told", 3072 | "returned", 3073 | "appointed", 3074 | "passed", 3075 | "refused", 3076 | "came", 3077 | "leader", 3078 | "director", 3079 | "ran", 3080 | "accepted", 3081 | "placed", 3082 | "turned", 3083 | "known", 3084 | "brought", 3085 | "issued", 3086 | "meeting", 3087 | "presented", 3088 | "remained", 3089 | "working", 3090 | "assistant", 3091 | "offered", 3092 | "reported", 3093 | "concluded", 3094 | "declared", 3095 | "criticized", 3096 | "announced", 3097 | "referred", 3098 | "spent", 3099 | "noted" 3100 | ], 3101 | [ 3102 | "known", 3103 | "large", 3104 | "century", 3105 | "found", 3106 | "small", 3107 | "number", 3108 | "called", 3109 | "high", 3110 | "including", 3111 | "long", 3112 | "include", 3113 | "form", 3114 | "common", 3115 | "similar", 3116 | "like", 3117 | "largest", 3118 | "lower", 3119 | "considered", 3120 | "low", 3121 | "usually", 3122 | "larger", 3123 | "size", 3124 | "modern", 3125 | "range", 3126 | "based", 3127 | "associated", 3128 | "estimated", 3129 | "higher", 3130 | "important", 3131 | "significant", 3132 | "according", 3133 | "smaller", 3134 | "increased", 3135 | "central", 3136 | "generally", 3137 | "likely", 3138 | "upper", 3139 | "level", 3140 | "rate", 3141 | "named", 3142 | "study", 3143 | "relatively", 3144 | "19th", 3145 | "greater", 3146 | "history", 3147 | "growth", 3148 | "reported", 3149 | "names", 3150 | "suggested", 3151 | "today" 3152 | ], 3153 | [ 3154 | "great", 3155 | "early", 3156 | "major", 3157 | "success", 3158 | "successful", 3159 | "hand", 3160 | "best", 3161 | "general", 3162 | "strong", 3163 | "earlier", 3164 | "particularly", 3165 | "key", 3166 | "important", 3167 | "difficult", 3168 | "highly", 3169 | "head", 3170 | "despite", 3171 | "largely", 3172 | "better", 3173 | "similar", 3174 | "minor", 3175 | "numerous", 3176 | "entire", 3177 | "poor", 3178 | "popular", 3179 | "especially", 3180 | "initial", 3181 | "powerful", 3182 | "complete", 3183 | "generally", 3184 | "influence", 3185 | "famous", 3186 | "greatest", 3187 | "simply", 3188 | "heavily", 3189 | "significant", 3190 | "popularity", 3191 | "notable", 3192 | "increasingly", 3193 | "immediately", 3194 | "finally", 3195 | "extremely", 3196 | "hard", 3197 | "attention", 3198 | "entirely", 3199 | "completely", 3200 | "double", 3201 | "considerable", 3202 | "easily", 3203 | "prominent" 3204 | ], 3205 | [ 3206 | "relationship", 3207 | "tells", 3208 | "begins", 3209 | "leaves", 3210 | "goes", 3211 | "tries", 3212 | "finds", 3213 | "jack", 3214 | "takes", 3215 | "meets", 3216 | "gets", 3217 | "wants", 3218 | "charlie", 3219 | "tell", 3220 | "storyline", 3221 | "young", 3222 | "restaurant", 3223 | "learns", 3224 | "asks", 3225 | "tony", 3226 | "paul", 3227 | "makes", 3228 | "breaks", 3229 | "kills", 3230 | "comes", 3231 | "arrives", 3232 | "chris", 3233 | "blood", 3234 | "dies", 3235 | "meat", 3236 | "starts", 3237 | "falls", 3238 | "believes", 3239 | "confesses", 3240 | "soap", 3241 | "killer", 3242 | "continues", 3243 | "donna", 3244 | "girlfriend", 3245 | "eve", 3246 | "returns", 3247 | "friendship", 3248 | "brings", 3249 | "adam", 3250 | "ben", 3251 | "feels", 3252 | "gives", 3253 | "helps", 3254 | "ends", 3255 | "discovers" 3256 | ], 3257 | [ 3258 | "music", 3259 | "musical", 3260 | "stage", 3261 | "piano", 3262 | "performed", 3263 | "played", 3264 | "playing", 3265 | "play", 3266 | "performance", 3267 | "opera", 3268 | "harrison", 3269 | "concert", 3270 | "theatre", 3271 | "composer", 3272 | "recordings", 3273 | "piece", 3274 | "concerts", 3275 | "performances", 3276 | "solo", 3277 | "orchestra", 3278 | "performing", 3279 | "instruments", 3280 | "session", 3281 | "sang", 3282 | "musicians", 3283 | "instrumental", 3284 | "sessions", 3285 | "version", 3286 | "singing", 3287 | "recording", 3288 | "blues", 3289 | "bach", 3290 | "mccartney", 3291 | "choir", 3292 | "beatles", 3293 | "melody", 3294 | "folk", 3295 | "organ", 3296 | "backing", 3297 | "versions", 3298 | "compositions", 3299 | "touring", 3300 | "sung", 3301 | "jazz", 3302 | "symphony", 3303 | "dylan", 3304 | "audience", 3305 | "sullivan", 3306 | "orchestral", 3307 | "guitar" 3308 | ], 3309 | [ 3310 | "song", 3311 | "album", 3312 | "band", 3313 | "music", 3314 | "songs", 3315 | "number", 3316 | "single", 3317 | "released", 3318 | "chart", 3319 | "video", 3320 | "track", 3321 | "recording", 3322 | "rock", 3323 | "release", 3324 | "studio", 3325 | "tour", 3326 | "performed", 3327 | "albums", 3328 | "vocals", 3329 | "recorded", 3330 | "sound", 3331 | "lyrics", 3332 | "singles", 3333 | "singer", 3334 | "tracks", 3335 | "dance", 3336 | "record", 3337 | "copies", 3338 | "billboard", 3339 | "madonna", 3340 | "charts", 3341 | "live", 3342 | "pop", 3343 | "hot", 3344 | "debuted", 3345 | "performance", 3346 | "produced", 3347 | "vocal", 3348 | "week", 3349 | "guitar", 3350 | "best", 3351 | "background", 3352 | "peaked", 3353 | "artist", 3354 | "carey", 3355 | "digital", 3356 | "featured", 3357 | "platinum", 3358 | "reached", 3359 | "ballad" 3360 | ], 3361 | [ 3362 | "film", 3363 | "series", 3364 | "character", 3365 | "season", 3366 | "production", 3367 | "role", 3368 | "characters", 3369 | "best", 3370 | "scene", 3371 | "released", 3372 | "cast", 3373 | "films", 3374 | "director", 3375 | "release", 3376 | "directed", 3377 | "scenes", 3378 | "star", 3379 | "played", 3380 | "script", 3381 | "version", 3382 | "original", 3383 | "filming", 3384 | "story", 3385 | "received", 3386 | "actor", 3387 | "producer", 3388 | "office", 3389 | "man", 3390 | "box", 3391 | "plot", 3392 | "featured", 3393 | "set", 3394 | "performance", 3395 | "actors", 3396 | "stars", 3397 | "movie", 3398 | "appeared", 3399 | "fans", 3400 | "originally", 3401 | "filmed", 3402 | "appearance", 3403 | "later", 3404 | "crew", 3405 | "title", 3406 | "week", 3407 | "voice", 3408 | "final", 3409 | "roles", 3410 | "reviews", 3411 | "based" 3412 | ], 3413 | [ 3414 | "episode", 3415 | "episodes", 3416 | "television", 3417 | "viewers", 3418 | "aired", 3419 | "broadcast", 3420 | "fox", 3421 | "guest", 3422 | "watched", 3423 | "simpsons", 3424 | "ratings", 3425 | "rating", 3426 | "homer", 3427 | "michael", 3428 | "mulder", 3429 | "marge", 3430 | "nbc", 3431 | "scully", 3432 | "network", 3433 | "bart", 3434 | "writers", 3435 | "nielsen", 3436 | "lisa", 3437 | "creator", 3438 | "finale", 3439 | "peter", 3440 | "files", 3441 | "writer", 3442 | "glee", 3443 | "airing", 3444 | "plot", 3445 | "shows", 3446 | "watching", 3447 | "jim", 3448 | "rated", 3449 | "andy", 3450 | "x-files", 3451 | "viewing", 3452 | "dwight", 3453 | "demographic", 3454 | "households", 3455 | "comedy", 3456 | "tells", 3457 | "carter", 3458 | "brian", 3459 | "references", 3460 | "reality", 3461 | "viewed", 3462 | "gets", 3463 | "recurring" 3464 | ], 3465 | [ 3466 | "white", 3467 | "black", 3468 | "red", 3469 | "brown", 3470 | "blue", 3471 | "dark", 3472 | "long", 3473 | "metal", 3474 | "yellow", 3475 | "green", 3476 | "small", 3477 | "short", 3478 | "like", 3479 | "color", 3480 | "length", 3481 | "slightly", 3482 | "covered", 3483 | "shaped", 3484 | "deep", 3485 | "features", 3486 | "cap", 3487 | "light", 3488 | "wood", 3489 | "shape", 3490 | "appear", 3491 | "near", 3492 | "consists", 3493 | "bands", 3494 | "north_america", 3495 | "grey", 3496 | "thick", 3497 | "europe", 3498 | "surface", 3499 | "leaves", 3500 | "wide", 3501 | "single", 3502 | "contain", 3503 | "contains", 3504 | "fruit", 3505 | "edge", 3506 | "dry", 3507 | "orange", 3508 | "smooth", 3509 | "gray", 3510 | "thin", 3511 | "outer", 3512 | "colour", 3513 | "inner", 3514 | "narrow", 3515 | "sides" 3516 | ], 3517 | [ 3518 | "aircraft", 3519 | "air", 3520 | "flight", 3521 | "car", 3522 | "service", 3523 | "engine", 3524 | "flying", 3525 | "wing", 3526 | "train", 3527 | "cars", 3528 | "passenger", 3529 | "fighter", 3530 | "fuel", 3531 | "vehicles", 3532 | "bomb", 3533 | "operations", 3534 | "bomber", 3535 | "track", 3536 | "pilots", 3537 | "radar", 3538 | "trains", 3539 | "flights", 3540 | "operation", 3541 | "passengers", 3542 | "speed", 3543 | "units", 3544 | "carrier", 3545 | "operate", 3546 | "station", 3547 | "ride", 3548 | "carried", 3549 | "operating", 3550 | "unit", 3551 | "operational", 3552 | "capacity", 3553 | "launch", 3554 | "pilot", 3555 | "engines", 3556 | "services", 3557 | "range", 3558 | "aviation", 3559 | "weight", 3560 | "lift", 3561 | "planes", 3562 | "airport", 3563 | "capability", 3564 | "locomotives", 3565 | "dive", 3566 | "flown", 3567 | "operated" 3568 | ], 3569 | [ 3570 | "won", 3571 | "team", 3572 | "year", 3573 | "record", 3574 | "second", 3575 | "finished", 3576 | "final", 3577 | "points", 3578 | "round", 3579 | "world", 3580 | "stage", 3581 | "winning", 3582 | "lead", 3583 | "ranked", 3584 | "coach", 3585 | "best", 3586 | "win", 3587 | "national", 3588 | "teams", 3589 | "named", 3590 | "led", 3591 | "event", 3592 | "place", 3593 | "career", 3594 | "seconds", 3595 | "history", 3596 | "tournament", 3597 | "competition", 3598 | "captain", 3599 | "injury", 3600 | "fourth", 3601 | "medal", 3602 | "overall", 3603 | "tour", 3604 | "summer", 3605 | "gold", 3606 | "olympics", 3607 | "sixth", 3608 | "earned", 3609 | "finish", 3610 | "sports", 3611 | "selected", 3612 | "winner", 3613 | "losing", 3614 | "championships", 3615 | "consecutive", 3616 | "debut", 3617 | "award", 3618 | "professional", 3619 | "events" 3620 | ], 3621 | [ 3622 | "match", 3623 | "title", 3624 | "defeated", 3625 | "championship", 3626 | "event", 3627 | "ring", 3628 | "team", 3629 | "won", 3630 | "win", 3631 | "raw", 3632 | "time", 3633 | "champion", 3634 | "wrestling", 3635 | "triple", 3636 | "defeating", 3637 | "tag", 3638 | "wwe", 3639 | "face", 3640 | "angle", 3641 | "titles", 3642 | "matches", 3643 | "table", 3644 | "division", 3645 | "edge", 3646 | "tna", 3647 | "lost", 3648 | "rivalry", 3649 | "impact", 3650 | "defeat", 3651 | "following", 3652 | "retain", 3653 | "faced", 3654 | "kane", 3655 | "feud", 3656 | "brand", 3657 | "bout", 3658 | "eliminated", 3659 | "promotion", 3660 | "defended", 3661 | "opponent", 3662 | "wrestler", 3663 | "wwf", 3664 | "hardy", 3665 | "world", 3666 | "went", 3667 | "month", 3668 | "smackdown", 3669 | "wrestlers", 3670 | "attacked", 3671 | "heavyweight" 3672 | ], 3673 | [ 3674 | "family", 3675 | "death", 3676 | "life", 3677 | "years", 3678 | "father", 3679 | "later", 3680 | "son", 3681 | "man", 3682 | "children", 3683 | "house", 3684 | "women", 3685 | "died", 3686 | "mother", 3687 | "year", 3688 | "wife", 3689 | "born", 3690 | "old", 3691 | "brother", 3692 | "married", 3693 | "daughter", 3694 | "child", 3695 | "woman", 3696 | "age", 3697 | "men", 3698 | "young", 3699 | "marriage", 3700 | "home", 3701 | "people", 3702 | "months", 3703 | "parents", 3704 | "friend", 3705 | "great", 3706 | "friends", 3707 | "husband", 3708 | "lived", 3709 | "sister", 3710 | "named", 3711 | "younger", 3712 | "birth", 3713 | "couple", 3714 | "hospital", 3715 | "relationship", 3716 | "brothers", 3717 | "care", 3718 | "member", 3719 | "sex", 3720 | "grand", 3721 | "living", 3722 | "leaving", 3723 | "help" 3724 | ], 3725 | [ 3726 | "million", 3727 | "company", 3728 | "sold", 3729 | "business", 3730 | "percent", 3731 | "commercial", 3732 | "announced", 3733 | "market", 3734 | "industry", 3735 | "cost", 3736 | "project", 3737 | "management", 3738 | "billion", 3739 | "sales", 3740 | "companies", 3741 | "selling", 3742 | "rights", 3743 | "price", 3744 | "costs", 3745 | "board", 3746 | "sell", 3747 | "employees", 3748 | "program", 3749 | "working", 3750 | "channel", 3751 | "plans", 3752 | "firm", 3753 | "purchase", 3754 | "purchased", 3755 | "store", 3756 | "engineer", 3757 | "sale", 3758 | "workers", 3759 | "equipment", 3760 | "financial", 3761 | "computer", 3762 | "owned", 3763 | "markets", 3764 | "advertising", 3765 | "radio", 3766 | "launched", 3767 | "potential", 3768 | "products", 3769 | "owners", 3770 | "code", 3771 | "demand", 3772 | "machine", 3773 | "income", 3774 | "mobile", 3775 | "filed" 3776 | ], 3777 | [ 3778 | "city", 3779 | "area", 3780 | "town", 3781 | "west", 3782 | "south", 3783 | "north", 3784 | "local", 3785 | "east", 3786 | "station", 3787 | "line", 3788 | "park", 3789 | "bridge", 3790 | "miles", 3791 | "located", 3792 | "construction", 3793 | "new", 3794 | "opened", 3795 | "northern", 3796 | "street", 3797 | "built", 3798 | "areas", 3799 | "southern", 3800 | "major", 3801 | "road", 3802 | "centre", 3803 | "river", 3804 | "railway", 3805 | "main", 3806 | "western", 3807 | "state", 3808 | "center", 3809 | "village", 3810 | "near", 3811 | "land", 3812 | "hill", 3813 | "services", 3814 | "years", 3815 | "border", 3816 | "eastern", 3817 | "district", 3818 | "county", 3819 | "year", 3820 | "day", 3821 | "ground", 3822 | "site", 3823 | "nearby", 3824 | "system", 3825 | "old", 3826 | "airport", 3827 | "stations" 3828 | ], 3829 | [ 3830 | "killed", 3831 | "sent", 3832 | "attack", 3833 | "left", 3834 | "near", 3835 | "support", 3836 | "arrived", 3837 | "led", 3838 | "forced", 3839 | "soon", 3840 | "return", 3841 | "control", 3842 | "reached", 3843 | "able", 3844 | "caused", 3845 | "destroyed", 3846 | "plan", 3847 | "died", 3848 | "main", 3849 | "ground", 3850 | "established", 3851 | "remained", 3852 | "attacks", 3853 | "damage", 3854 | "attacked", 3855 | "supported", 3856 | "victory", 3857 | "attempted", 3858 | "suffered", 3859 | "returned", 3860 | "lost", 3861 | "attempt", 3862 | "landing", 3863 | "operation", 3864 | "brought", 3865 | "expedition", 3866 | "defeat", 3867 | "failed", 3868 | "immediately", 3869 | "captured", 3870 | "engaged", 3871 | "damaged", 3872 | "prepared", 3873 | "invasion", 3874 | "defeated", 3875 | "unable", 3876 | "prevent", 3877 | "arrival", 3878 | "shortly", 3879 | "capture" 3880 | ], 3881 | [ 3882 | "king", 3883 | "england", 3884 | "london", 3885 | "english", 3886 | "lord", 3887 | "royal", 3888 | "scotland", 3889 | "queen", 3890 | "sir", 3891 | "edward", 3892 | "henry", 3893 | "duke", 3894 | "john", 3895 | "wales", 3896 | "prince", 3897 | "scottish", 3898 | "fort", 3899 | "irish", 3900 | "reign", 3901 | "george", 3902 | "historian", 3903 | "earl", 3904 | "william", 3905 | "crown", 3906 | "james", 3907 | "kent", 3908 | "norman", 3909 | "richard", 3910 | "catholic", 3911 | "britain", 3912 | "york", 3913 | "stephen", 3914 | "abbey", 3915 | "charles", 3916 | "pope", 3917 | "robert", 3918 | "mary", 3919 | "france", 3920 | "bishop", 3921 | "throne", 3922 | "thomas", 3923 | "granted", 3924 | "appointed", 3925 | "ireland", 3926 | "catherine", 3927 | "canterbury", 3928 | "monarch", 3929 | "anne", 3930 | "protestant", 3931 | "castle" 3932 | ], 3933 | [ 3934 | "treatment", 3935 | "blood", 3936 | "brain", 3937 | "cell", 3938 | "disease", 3939 | "risk", 3940 | "patients", 3941 | "cells", 3942 | "symptoms", 3943 | "dna", 3944 | "protein", 3945 | "cancer", 3946 | "genes", 3947 | "virus", 3948 | "cases", 3949 | "causes", 3950 | "function", 3951 | "diseases", 3952 | "cause", 3953 | "body", 3954 | "related", 3955 | "gene", 3956 | "infection", 3957 | "drugs", 3958 | "activity", 3959 | "levels", 3960 | "people", 3961 | "effects", 3962 | "hiv", 3963 | "acute", 3964 | "diagnosis", 3965 | "disorders", 3966 | "genetic", 3967 | "tissue", 3968 | "liver", 3969 | "proteins", 3970 | "muscle", 3971 | "normal", 3972 | "specific", 3973 | "occurs", 3974 | "skin", 3975 | "form", 3976 | "occur", 3977 | "treatments", 3978 | "hemoglobin", 3979 | "induced", 3980 | "clinical", 3981 | "individuals", 3982 | "viruses", 3983 | "mechanisms" 3984 | ], 3985 | [ 3986 | "route", 3987 | "road", 3988 | "highway", 3989 | "state", 3990 | "north", 3991 | "traffic", 3992 | "east", 3993 | "street", 3994 | "intersection", 3995 | "continues", 3996 | "south", 3997 | "interchange", 3998 | "avenue", 3999 | "section", 4000 | "mile", 4001 | "passes", 4002 | "freeway", 4003 | "lane", 4004 | "portion", 4005 | "runs", 4006 | "crosses", 4007 | "completed", 4008 | "heads", 4009 | "turns", 4010 | "downtown", 4011 | "designated", 4012 | "terminus", 4013 | "crossing", 4014 | "northeast", 4015 | "extended", 4016 | "interstate", 4017 | "west", 4018 | "access", 4019 | "highways", 4020 | "roadway", 4021 | "passing", 4022 | "exit", 4023 | "junction", 4024 | "alignment", 4025 | "northwest", 4026 | "begins", 4027 | "corridor", 4028 | "enters", 4029 | "lanes", 4030 | "past", 4031 | "parallel", 4032 | "end", 4033 | "designation", 4034 | "bypass", 4035 | "description" 4036 | ], 4037 | [ 4038 | "ship", 4039 | "ships", 4040 | "guns", 4041 | "fleet", 4042 | "class", 4043 | "inch", 4044 | "tons", 4045 | "torpedo", 4046 | "naval", 4047 | "gun", 4048 | "long", 4049 | "knots", 4050 | "crew", 4051 | "turrets", 4052 | "fired", 4053 | "maximum", 4054 | "admiral", 4055 | "carried", 4056 | "armor", 4057 | "steam", 4058 | "vessels", 4059 | "cruisers", 4060 | "vessel", 4061 | "port", 4062 | "inches", 4063 | "battery", 4064 | "convoy", 4065 | "battleships", 4066 | "navy", 4067 | "service", 4068 | "cruiser", 4069 | "meters", 4070 | "armored", 4071 | "deck", 4072 | "aboard", 4073 | "built", 4074 | "battleship", 4075 | "hull", 4076 | "boats", 4077 | "armour", 4078 | "submarine", 4079 | "feet", 4080 | "flagship", 4081 | "laid", 4082 | "ton", 4083 | "turret", 4084 | "cruise", 4085 | "speed", 4086 | "ordered", 4087 | "boilers" 4088 | ], 4089 | [ 4090 | "art", 4091 | "gold", 4092 | "act", 4093 | "silver", 4094 | "image", 4095 | "flag", 4096 | "painting", 4097 | "pieces", 4098 | "artist", 4099 | "artists", 4100 | "work", 4101 | "coins", 4102 | "figures", 4103 | "paintings", 4104 | "images", 4105 | "painted", 4106 | "arms", 4107 | "exhibition", 4108 | "piece", 4109 | "figure", 4110 | "eye", 4111 | "dress", 4112 | "depicted", 4113 | "fashion", 4114 | "ceremony", 4115 | "portrait", 4116 | "worn", 4117 | "design", 4118 | "presented", 4119 | "body", 4120 | "wearing", 4121 | "sculpture", 4122 | "wear", 4123 | "display", 4124 | "coin", 4125 | "hair", 4126 | "coat", 4127 | "eagle", 4128 | "mint", 4129 | "statue", 4130 | "showing", 4131 | "bronze", 4132 | "depicts", 4133 | "wore", 4134 | "colours", 4135 | "displayed", 4136 | "dollar", 4137 | "print", 4138 | "photographs", 4139 | "designs" 4140 | ], 4141 | [ 4142 | "building", 4143 | "built", 4144 | "century", 4145 | "buildings", 4146 | "site", 4147 | "stone", 4148 | "hall", 4149 | "castle", 4150 | "tower", 4151 | "wall", 4152 | "walls", 4153 | "square", 4154 | "church", 4155 | "completed", 4156 | "house", 4157 | "floor", 4158 | "constructed", 4159 | "architecture", 4160 | "houses", 4161 | "listed", 4162 | "chapel", 4163 | "monument", 4164 | "construction", 4165 | "entrance", 4166 | "rooms", 4167 | "glass", 4168 | "museum", 4169 | "structure", 4170 | "iron", 4171 | "17th", 4172 | "window", 4173 | "remains", 4174 | "medieval", 4175 | "wooden", 4176 | "late", 4177 | "brick", 4178 | "restoration", 4179 | "architectural", 4180 | "19th", 4181 | "memorial", 4182 | "roof", 4183 | "stones", 4184 | "hotel", 4185 | "15th", 4186 | "14th", 4187 | "garden", 4188 | "palace", 4189 | "demolished", 4190 | "dates", 4191 | "steel" 4192 | ], 4193 | [ 4194 | "public", 4195 | "money", 4196 | "signed", 4197 | "manager", 4198 | "future", 4199 | "self", 4200 | "local", 4201 | "contract", 4202 | "deal", 4203 | "letter", 4204 | "pay", 4205 | "paid", 4206 | "career", 4207 | "given", 4208 | "interest", 4209 | "health", 4210 | "help", 4211 | "press", 4212 | "decision", 4213 | "executive", 4214 | "private", 4215 | "media", 4216 | "letters", 4217 | "trade", 4218 | "real", 4219 | "attention", 4220 | "personal", 4221 | "forward", 4222 | "president", 4223 | "intended", 4224 | "owner", 4225 | "sign", 4226 | "financial", 4227 | "expressed", 4228 | "exchange", 4229 | "represented", 4230 | "chairman", 4231 | "independent", 4232 | "raise", 4233 | "acquired", 4234 | "agent", 4235 | "fellow", 4236 | "prime", 4237 | "agreement", 4238 | "fund", 4239 | "funds", 4240 | "appeal", 4241 | "founded", 4242 | "promotion", 4243 | "experience" 4244 | ], 4245 | [ 4246 | "day", 4247 | "found", 4248 | "night", 4249 | "live", 4250 | "seen", 4251 | "set", 4252 | "return", 4253 | "fire", 4254 | "having", 4255 | "way", 4256 | "come", 4257 | "find", 4258 | "days", 4259 | "dead", 4260 | "party", 4261 | "leave", 4262 | "body", 4263 | "cover", 4264 | "hours", 4265 | "fact", 4266 | "taken", 4267 | "hit", 4268 | "fight", 4269 | "shot", 4270 | "escape", 4271 | "special", 4272 | "search", 4273 | "secret", 4274 | "away", 4275 | "true", 4276 | "stay", 4277 | "believe", 4278 | "follow", 4279 | "christmas", 4280 | "previous", 4281 | "claims", 4282 | "present", 4283 | "kill", 4284 | "heard", 4285 | "idea", 4286 | "reason", 4287 | "journey", 4288 | "past", 4289 | "mark", 4290 | "face", 4291 | "news", 4292 | "thought", 4293 | "morning", 4294 | "die", 4295 | "question" 4296 | ], 4297 | [ 4298 | "military", 4299 | "soviet", 4300 | "army", 4301 | "emperor", 4302 | "hungarian", 4303 | "roman", 4304 | "muslim", 4305 | "empire", 4306 | "war", 4307 | "forces", 4308 | "polish", 4309 | "allies", 4310 | "greece", 4311 | "chinese", 4312 | "arab", 4313 | "armies", 4314 | "rule", 4315 | "treaty", 4316 | "wars", 4317 | "imperial", 4318 | "albania", 4319 | "croatia", 4320 | "croatian", 4321 | "poland", 4322 | "territory", 4323 | "siege", 4324 | "diplomatic", 4325 | "ottoman", 4326 | "conflict", 4327 | "coup", 4328 | "serbian", 4329 | "vietnamese", 4330 | "egyptian", 4331 | "byzantine", 4332 | "serbia", 4333 | "capital", 4334 | "justinian", 4335 | "uprising", 4336 | "communist", 4337 | "jews", 4338 | "turkey", 4339 | "occupation", 4340 | "damascus", 4341 | "turkish", 4342 | "constantinople", 4343 | "serbs", 4344 | "independence", 4345 | "peace", 4346 | "revolt", 4347 | "hostilities" 4348 | ], 4349 | [ 4350 | "time", 4351 | "second", 4352 | "later", 4353 | "end", 4354 | "following", 4355 | "place", 4356 | "half", 4357 | "day", 4358 | "took", 4359 | "right", 4360 | "line", 4361 | "left", 4362 | "having", 4363 | "despite", 4364 | "final", 4365 | "seven", 4366 | "point", 4367 | "away", 4368 | "high", 4369 | "times", 4370 | "set", 4371 | "came", 4372 | "eventually", 4373 | "little", 4374 | "lost", 4375 | "far", 4376 | "short", 4377 | "followed", 4378 | "previous", 4379 | "instead", 4380 | "position", 4381 | "ended", 4382 | "began", 4383 | "long", 4384 | "fourth", 4385 | "result", 4386 | "held", 4387 | "lines", 4388 | "taking", 4389 | "total", 4390 | "started", 4391 | "order", 4392 | "minutes", 4393 | "allowed", 4394 | "good", 4395 | "previously", 4396 | "saw", 4397 | "beginning", 4398 | "prior", 4399 | "heavy" 4400 | ], 4401 | [ 4402 | "home", 4403 | "new", 4404 | "making", 4405 | "open", 4406 | "play", 4407 | "run", 4408 | "room", 4409 | "opening", 4410 | "free", 4411 | "inside", 4412 | "stand", 4413 | "opened", 4414 | "cold", 4415 | "allow", 4416 | "standing", 4417 | "filled", 4418 | "door", 4419 | "shot", 4420 | "super", 4421 | "going", 4422 | "originally", 4423 | "turned", 4424 | "featured", 4425 | "turn", 4426 | "drawn", 4427 | "supporting", 4428 | "wild", 4429 | "card", 4430 | "der", 4431 | "turning", 4432 | "walk", 4433 | "playing", 4434 | "mixed", 4435 | "kept", 4436 | "moving", 4437 | "sets", 4438 | "crowd", 4439 | "close", 4440 | "closed", 4441 | "space", 4442 | "closing", 4443 | "enter", 4444 | "swedish", 4445 | "keeping", 4446 | "featuring", 4447 | "friendly", 4448 | "walking", 4449 | "bring", 4450 | "allowing", 4451 | "opposite" 4452 | ], 4453 | [ 4454 | "said", 4455 | "like", 4456 | "wrote", 4457 | "love", 4458 | "writing", 4459 | "felt", 4460 | "called", 4461 | "good", 4462 | "written", 4463 | "received", 4464 | "saying", 4465 | "critics", 4466 | "review", 4467 | "described", 4468 | "praised", 4469 | "don", 4470 | "wanted", 4471 | "according", 4472 | "way", 4473 | "know", 4474 | "critical", 4475 | "positive", 4476 | "want", 4477 | "feel", 4478 | "think", 4479 | "things", 4480 | "real", 4481 | "girl", 4482 | "life", 4483 | "thought", 4484 | "interview", 4485 | "reception", 4486 | "reviews", 4487 | "little", 4488 | "going", 4489 | "let", 4490 | "kind", 4491 | "inspired", 4492 | "heart", 4493 | "gave", 4494 | "thing", 4495 | "got", 4496 | "explained", 4497 | "mind", 4498 | "look", 4499 | "titled", 4500 | "writer", 4501 | "commented", 4502 | "lot", 4503 | "compared" 4504 | ], 4505 | [ 4506 | "battle", 4507 | "force", 4508 | "forces", 4509 | "japanese", 4510 | "command", 4511 | "troops", 4512 | "army", 4513 | "men", 4514 | "commander", 4515 | "general", 4516 | "attack", 4517 | "british", 4518 | "division", 4519 | "soldiers", 4520 | "units", 4521 | "lieutenant", 4522 | "fighting", 4523 | "north", 4524 | "positions", 4525 | "south", 4526 | "regiment", 4527 | "military", 4528 | "squadron", 4529 | "battalion", 4530 | "casualties", 4531 | "ordered", 4532 | "enemy", 4533 | "artillery", 4534 | "infantry", 4535 | "combat", 4536 | "unit", 4537 | "captain", 4538 | "german", 4539 | "brigade", 4540 | "officers", 4541 | "wounded", 4542 | "officer", 4543 | "1st", 4544 | "fire", 4545 | "captured", 4546 | "colonel", 4547 | "commanders", 4548 | "commanded", 4549 | "allied", 4550 | "2nd", 4551 | "position", 4552 | "garrison", 4553 | "advance", 4554 | "divisions", 4555 | "battalions" 4556 | ], 4557 | [ 4558 | "water", 4559 | "population", 4560 | "feet", 4561 | "land", 4562 | "island", 4563 | "food", 4564 | "region", 4565 | "sea", 4566 | "river", 4567 | "area", 4568 | "lake", 4569 | "metres", 4570 | "oil", 4571 | "mountain", 4572 | "climate", 4573 | "average", 4574 | "areas", 4575 | "tree", 4576 | "ice", 4577 | "forest", 4578 | "trees", 4579 | "acres", 4580 | "waters", 4581 | "near", 4582 | "people", 4583 | "islands", 4584 | "native", 4585 | "miles", 4586 | "rivers", 4587 | "plant", 4588 | "km2", 4589 | "air", 4590 | "kilometres", 4591 | "fishing", 4592 | "creek", 4593 | "plants", 4594 | "residents", 4595 | "coal", 4596 | "temperatures", 4597 | "agriculture", 4598 | "mountains", 4599 | "fish", 4600 | "snow", 4601 | "mouth", 4602 | "valley", 4603 | "approximately", 4604 | "cities", 4605 | "agricultural", 4606 | "salt", 4607 | "flow" 4608 | ], 4609 | [ 4610 | "story", 4611 | "novel", 4612 | "stories", 4613 | "fiction", 4614 | "doctor", 4615 | "book", 4616 | "magazine", 4617 | "bond", 4618 | "alien", 4619 | "comic", 4620 | "science", 4621 | "novels", 4622 | "universe", 4623 | "fantasy", 4624 | "batman", 4625 | "comics", 4626 | "villain", 4627 | "fictional", 4628 | "evil", 4629 | "magic", 4630 | "halo", 4631 | "trek", 4632 | "adventures", 4633 | "tales", 4634 | "vampire", 4635 | "narrative", 4636 | "magical", 4637 | "dragon", 4638 | "trilogy", 4639 | "adventure", 4640 | "tale", 4641 | "reader", 4642 | "horror", 4643 | "marvel", 4644 | "sam", 4645 | "enterprise", 4646 | "harry", 4647 | "tintin", 4648 | "harry_potter", 4649 | "genre", 4650 | "ghost", 4651 | "readers", 4652 | "fleming", 4653 | "publisher", 4654 | "fantastic", 4655 | "potter", 4656 | "editor", 4657 | "aliens", 4658 | "spider", 4659 | "ian_fleming" 4660 | ], 4661 | [ 4662 | "school", 4663 | "students", 4664 | "member", 4665 | "college", 4666 | "education", 4667 | "schools", 4668 | "university", 4669 | "members", 4670 | "class", 4671 | "student", 4672 | "community", 4673 | "founded", 4674 | "president", 4675 | "program", 4676 | "research", 4677 | "established", 4678 | "organization", 4679 | "campus", 4680 | "programs", 4681 | "served", 4682 | "academic", 4683 | "national", 4684 | "medical", 4685 | "science", 4686 | "high", 4687 | "study", 4688 | "organizations", 4689 | "prize", 4690 | "degree", 4691 | "professor", 4692 | "attended", 4693 | "honor", 4694 | "arts", 4695 | "activities", 4696 | "primary", 4697 | "board", 4698 | "department", 4699 | "engineering", 4700 | "library", 4701 | "faculty", 4702 | "graduate", 4703 | "festival", 4704 | "awarded", 4705 | "institution", 4706 | "courses", 4707 | "mayor", 4708 | "grade", 4709 | "graduated", 4710 | "teaching", 4711 | "enrolled" 4712 | ], 4713 | [ 4714 | "society", 4715 | "read", 4716 | "literature", 4717 | "author", 4718 | "literary", 4719 | "argued", 4720 | "ideas", 4721 | "poem", 4722 | "poetry", 4723 | "moral", 4724 | "argues", 4725 | "sexual", 4726 | "view", 4727 | "poet", 4728 | "writings", 4729 | "social", 4730 | "religion", 4731 | "isbn", 4732 | "historical", 4733 | "writers", 4734 | "views", 4735 | "political", 4736 | "philosophy", 4737 | "writer", 4738 | "matter", 4739 | "freedom", 4740 | "writes", 4741 | "biography", 4742 | "scholar", 4743 | "theory", 4744 | "essay", 4745 | "believed", 4746 | "speech", 4747 | "argument", 4748 | "context", 4749 | "considers", 4750 | "understanding", 4751 | "publication", 4752 | "published", 4753 | "essays", 4754 | "arguments", 4755 | "understand", 4756 | "idea", 4757 | "principles", 4758 | "historians", 4759 | "opinions", 4760 | "article", 4761 | "according", 4762 | "philosophical", 4763 | "myth" 4764 | ], 4765 | [ 4766 | "government", 4767 | "state", 4768 | "president", 4769 | "law", 4770 | "political", 4771 | "governor", 4772 | "election", 4773 | "minister", 4774 | "national", 4775 | "act", 4776 | "congress", 4777 | "party", 4778 | "federal", 4779 | "campaign", 4780 | "general", 4781 | "civil", 4782 | "constitution", 4783 | "rights", 4784 | "secretary", 4785 | "states", 4786 | "policy", 4787 | "office", 4788 | "elected", 4789 | "vote", 4790 | "bill", 4791 | "parliament", 4792 | "democratic", 4793 | "support", 4794 | "politics", 4795 | "term", 4796 | "majority", 4797 | "candidate", 4798 | "representative", 4799 | "administration", 4800 | "labor", 4801 | "justice", 4802 | "citizens", 4803 | "legislation", 4804 | "opposition", 4805 | "authority", 4806 | "convention", 4807 | "senator", 4808 | "economic", 4809 | "court", 4810 | "tax", 4811 | "parties", 4812 | "held", 4813 | "elections", 4814 | "constitutional", 4815 | "cabinet" 4816 | ], 4817 | [ 4818 | "season", 4819 | "game", 4820 | "team", 4821 | "league", 4822 | "club", 4823 | "games", 4824 | "played", 4825 | "scored", 4826 | "goal", 4827 | "play", 4828 | "runs", 4829 | "players", 4830 | "cup", 4831 | "goals", 4832 | "ball", 4833 | "innings", 4834 | "football", 4835 | "player", 4836 | "yards", 4837 | "test", 4838 | "scoring", 4839 | "win", 4840 | "victory", 4841 | "match", 4842 | "bowl", 4843 | "playing", 4844 | "score", 4845 | "career", 4846 | "championship", 4847 | "matches", 4848 | "arsenal", 4849 | "run", 4850 | "yard", 4851 | "lap", 4852 | "points", 4853 | "lead", 4854 | "seasons", 4855 | "stadium", 4856 | "baseball", 4857 | "field", 4858 | "miller", 4859 | "appearances", 4860 | "series", 4861 | "quarterback", 4862 | "pitch", 4863 | "draw", 4864 | "bowling", 4865 | "assists", 4866 | "batting", 4867 | "wicket" 4868 | ], 4869 | [ 4870 | "use", 4871 | "order", 4872 | "human", 4873 | "possible", 4874 | "required", 4875 | "evidence", 4876 | "case", 4877 | "role", 4878 | "process", 4879 | "effect", 4880 | "change", 4881 | "nature", 4882 | "result", 4883 | "involved", 4884 | "non", 4885 | "given", 4886 | "events", 4887 | "groups", 4888 | "lack", 4889 | "particular", 4890 | "information", 4891 | "provide", 4892 | "life", 4893 | "certain", 4894 | "example", 4895 | "status", 4896 | "force", 4897 | "research", 4898 | "need", 4899 | "natural", 4900 | "period", 4901 | "view", 4902 | "increase", 4903 | "problems", 4904 | "necessary", 4905 | "action", 4906 | "individual", 4907 | "structure", 4908 | "responsible", 4909 | "cause", 4910 | "data", 4911 | "form", 4912 | "control", 4913 | "able", 4914 | "terms", 4915 | "protection", 4916 | "related", 4917 | "complex", 4918 | "response", 4919 | "basis" 4920 | ], 4921 | [ 4922 | "earth", 4923 | "sun", 4924 | "surface", 4925 | "star", 4926 | "planet", 4927 | "mass", 4928 | "light", 4929 | "moon", 4930 | "years", 4931 | "stars", 4932 | "solar", 4933 | "atmosphere", 4934 | "giant", 4935 | "system", 4936 | "orbit", 4937 | "space", 4938 | "magnitude", 4939 | "observed", 4940 | "distance", 4941 | "mars", 4942 | "discovery", 4943 | "cloud", 4944 | "observations", 4945 | "jupiter", 4946 | "dwarf", 4947 | "objects", 4948 | "object", 4949 | "rotation", 4950 | "motion", 4951 | "planets", 4952 | "dust", 4953 | "saturn", 4954 | "gravity", 4955 | "masses", 4956 | "massive", 4957 | "ago", 4958 | "approximately", 4959 | "telescope", 4960 | "activity", 4961 | "disk", 4962 | "roughly", 4963 | "spacecraft", 4964 | "mercury", 4965 | "sky", 4966 | "discovered", 4967 | "measurements", 4968 | "times", 4969 | "deep", 4970 | "zone", 4971 | "type" 4972 | ], 4973 | [ 4974 | "game", 4975 | "player", 4976 | "games", 4977 | "released", 4978 | "players", 4979 | "gameplay", 4980 | "release", 4981 | "video", 4982 | "character", 4983 | "series", 4984 | "version", 4985 | "japan", 4986 | "versions", 4987 | "playstation", 4988 | "features", 4989 | "titles", 4990 | "nintendo", 4991 | "received", 4992 | "console", 4993 | "mode", 4994 | "enemies", 4995 | "development", 4996 | "ign", 4997 | "reviewers", 4998 | "screen", 4999 | "soundtrack", 5000 | "action", 5001 | "wii", 5002 | "multiplayer", 5003 | "xbox", 5004 | "gaming", 5005 | "releases", 5006 | "playable", 5007 | "original", 5008 | "content", 5009 | "reception", 5010 | "japanese", 5011 | "level", 5012 | "person", 5013 | "north_america", 5014 | "protagonist", 5015 | "characters", 5016 | "units", 5017 | "graphics", 5018 | "team", 5019 | "levels", 5020 | "copies", 5021 | "feature", 5022 | "battles", 5023 | "developed" 5024 | ], 5025 | [ 5026 | "species", 5027 | "birds", 5028 | "genus", 5029 | "specimens", 5030 | "animal", 5031 | "bird", 5032 | "tail", 5033 | "habitat", 5034 | "taxonomy", 5035 | "females", 5036 | "males", 5037 | "diet", 5038 | "found", 5039 | "body", 5040 | "teeth", 5041 | "populations", 5042 | "fish", 5043 | "male", 5044 | "skull", 5045 | "nests", 5046 | "breeding", 5047 | "female", 5048 | "prey", 5049 | "range", 5050 | "common", 5051 | "shark", 5052 | "subspecies", 5053 | "predators", 5054 | "specimen", 5055 | "eggs", 5056 | "animals", 5057 | "ecology", 5058 | "bodies", 5059 | "mature", 5060 | "insects", 5061 | "sexes", 5062 | "fossils", 5063 | "feeding", 5064 | "sharks", 5065 | "extinction", 5066 | "nest", 5067 | "jaws", 5068 | "interspecific", 5069 | "mating", 5070 | "occurs", 5071 | "whale", 5072 | "humans", 5073 | "juveniles", 5074 | "abundant", 5075 | "fruit" 5076 | ], 5077 | [ 5078 | "television", 5079 | "award", 5080 | "awards", 5081 | "dvd", 5082 | "nominated", 5083 | "critics", 5084 | "video", 5085 | "animated", 5086 | "actress", 5087 | "festival", 5088 | "magazine", 5089 | "drama", 5090 | "bbc", 5091 | "media", 5092 | "hollywood", 5093 | "new_york_times", 5094 | "movie", 5095 | "entertainment", 5096 | "writer", 5097 | "comedy", 5098 | "nomination", 5099 | "nominations", 5100 | "disney", 5101 | "female", 5102 | "documentary", 5103 | "ray", 5104 | "theatrical", 5105 | "critic", 5106 | "outstanding", 5107 | "animation", 5108 | "male", 5109 | "reviews", 5110 | "digital", 5111 | "anime", 5112 | "audio", 5113 | "acclaim", 5114 | "actor", 5115 | "audience", 5116 | "youtube", 5117 | "credits", 5118 | "film", 5119 | "poll", 5120 | "voice", 5121 | "creative", 5122 | "costumes", 5123 | "weekly", 5124 | "emmy", 5125 | "thriller", 5126 | "washington_post", 5127 | "reporter" 5128 | ], 5129 | [ 5130 | "god", 5131 | "church", 5132 | "christian", 5133 | "religious", 5134 | "temple", 5135 | "worship", 5136 | "language", 5137 | "languages", 5138 | "traditions", 5139 | "tradition", 5140 | "religion", 5141 | "jesus", 5142 | "christianity", 5143 | "kingdom", 5144 | "hindu", 5145 | "spoken", 5146 | "legend", 5147 | "christ", 5148 | "holy", 5149 | "text", 5150 | "words", 5151 | "word", 5152 | "kings", 5153 | "inscriptions", 5154 | "spiritual", 5155 | "christians", 5156 | "texts", 5157 | "maya", 5158 | "faith", 5159 | "congregation", 5160 | "gods", 5161 | "divine", 5162 | "goddess", 5163 | "bible", 5164 | "shiva", 5165 | "tomb", 5166 | "buddhist", 5167 | "churches", 5168 | "deity", 5169 | "sacred", 5170 | "king", 5171 | "shrines", 5172 | "practices", 5173 | "shrine", 5174 | "monastery", 5175 | "dialect", 5176 | "ritual", 5177 | "baptism", 5178 | "hebrew", 5179 | "secular" 5180 | ], 5181 | [ 5182 | "storm", 5183 | "tropical", 5184 | "hurricane", 5185 | "winds", 5186 | "depression", 5187 | "mph", 5188 | "wind", 5189 | "cyclone", 5190 | "september", 5191 | "august", 5192 | "damage", 5193 | "rainfall", 5194 | "day", 5195 | "reported", 5196 | "intensity", 5197 | "utc", 5198 | "pressure", 5199 | "near", 5200 | "system", 5201 | "occurred", 5202 | "hours", 5203 | "caused", 5204 | "damaged", 5205 | "category", 5206 | "flooding", 5207 | "convection", 5208 | "sustained", 5209 | "october", 5210 | "northeast", 5211 | "attained", 5212 | "landfall", 5213 | "heavy", 5214 | "moving", 5215 | "estimated", 5216 | "homes", 5217 | "southeast", 5218 | "south", 5219 | "reached", 5220 | "remnants", 5221 | "low", 5222 | "moved", 5223 | "disturbance", 5224 | "north", 5225 | "circulation", 5226 | "severe", 5227 | "developed", 5228 | "upgraded", 5229 | "weakened", 5230 | "northwest", 5231 | "early" 5232 | ], 5233 | [ 5234 | "police", 5235 | "court", 5236 | "people", 5237 | "trial", 5238 | "report", 5239 | "prison", 5240 | "incident", 5241 | "arrested", 5242 | "violence", 5243 | "murder", 5244 | "judge", 5245 | "criminal", 5246 | "crime", 5247 | "investigation", 5248 | "jewish", 5249 | "men", 5250 | "charges", 5251 | "committed", 5252 | "person", 5253 | "news", 5254 | "accused", 5255 | "gay", 5256 | "security", 5257 | "victims", 5258 | "jury", 5259 | "crimes", 5260 | "arrest", 5261 | "defense", 5262 | "reports", 5263 | "media", 5264 | "newspaper", 5265 | "statement", 5266 | "officers", 5267 | "alleged", 5268 | "convicted", 5269 | "newspapers", 5270 | "violent", 5271 | "guilty", 5272 | "actions", 5273 | "drug", 5274 | "suicide", 5275 | "illegal", 5276 | "conspiracy", 5277 | "witnesses", 5278 | "protest", 5279 | "prisoners", 5280 | "lawyers", 5281 | "authorities", 5282 | "charge", 5283 | "case" 5284 | ] 5285 | ], 5286 | "c_npmi_10_full_all": [ 5287 | 0.03828346256626948, 5288 | 0.1038583234675748, 5289 | 0.015366944419668899, 5290 | 0.0900476418994117, 5291 | 0.1020442098673313, 5292 | 0.03190760464777211, 5293 | 0.16677015392348543, 5294 | -0.008954957865953978, 5295 | 0.012054122204282256, 5296 | -0.004889701871966398, 5297 | 0.15132187821689855, 5298 | 0.13286033788304982, 5299 | 0.21361278838454933, 5300 | 0.06768430328713142, 5301 | 0.25295664311845517, 5302 | 0.19214197022094048, 5303 | 0.1246862697135218, 5304 | 0.13727695971556078, 5305 | 0.15977332312391387, 5306 | 0.07607815548878252, 5307 | 0.08921236964042889, 5308 | 0.14721562942824956, 5309 | 0.04320981988759863, 5310 | 0.1348618452599969, 5311 | 0.19853561056687896, 5312 | 0.21675332548983514, 5313 | 0.20602537817919672, 5314 | 0.05120785577156596, 5315 | 0.15718236651705925, 5316 | 0.03070653982222502, 5317 | 0.021220918893205067, 5318 | 0.11234754553902984, 5319 | 0.037713983698919415, 5320 | 0.020353224603349568, 5321 | 0.04964237205892103, 5322 | 0.16596275544055908, 5323 | 0.0976401097020574, 5324 | 0.13018839146257705, 5325 | 0.134779218822221, 5326 | 0.1173630971965991, 5327 | 0.09886537661229274, 5328 | 0.22095588022862706, 5329 | 0.03753952947452625, 5330 | 0.12707757576728707, 5331 | 0.12641353705174047, 5332 | 0.18913925649606136, 5333 | 0.1493844703305789, 5334 | 0.12044245581211913, 5335 | 0.2556792335636147, 5336 | 0.12187208376841602 5337 | ], 5338 | "path": "outputs/full-mindf_power_law-maxdf_0.9/wikitext/k-50/etm/lr_0.001-reg_1.2e-05-epochs_1000-anneal_lr_0/42" 5339 | } 5340 | } --------------------------------------------------------------------------------