├── .gitignore
├── README.md
├── src-human-correlations
├── chatGPT_evaluate_intruders.py
├── chatGPT_evaluate_topic_ratings.py
├── generate_intruder_words_dataset.py
└── human_correlations_bootstrap.py
├── src-misc
├── 01-get_data.sh
├── 02-process_data.ipynb
├── 02-process_data.py
├── 03-figures_nmpi_llm.py
├── 04-launch_figures_nmpi_llm.sh
├── 05-find_rating_errors.ipynb
├── 06-figures_ari_llm.py
├── 07-pairwise_scores.ipynb
└── fig_utils.py
├── src-number-of-topics
├── LLM_scores_and_ARI.py
├── chatGPT_document_label_assignment.py
└── chatGPT_ratings_assignment.py
└── topic-modeling-output
├── dvae-topics-best-c_npmi_10_full.json
├── etm-topics-best-c_npmi_10_full.json
└── mallet-topics-best-c_npmi_10_full.json
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | repo-old
3 | *.pyc
4 | computed/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Re-visiting Automated Topic Model Evaluation with Large Language Models
2 |
3 | This repo contains code and data for our [EMNLP 2023 paper](https://aclanthology.org/2023.emnlp-main.581/) about assessing topic model output with Large Language Models.
4 | ```
5 | @inproceedings{stammbach-etal-2023-revisiting,
6 | title = "Revisiting Automated Topic Model Evaluation with Large Language Models",
7 | author = "Stammbach, Dominik and
8 | Zouhar, Vil{\'e}m and
9 | Hoyle, Alexander and
10 | Sachan, Mrinmaya and
11 | Ash, Elliott",
12 | booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
13 | month = dec,
14 | year = "2023",
15 | address = "Singapore",
16 | publisher = "Association for Computational Linguistics",
17 | url = "https://aclanthology.org/2023.emnlp-main.581",
18 | doi = "10.18653/v1/2023.emnlp-main.581",
19 | pages = "9348--9357"
20 | }
21 | ```
22 |
23 | ## Prerequisites
24 |
25 | ```shell
26 | pip install --upgrade openai pandas
27 | ```
28 |
29 | ## Large Language Models and Topics with Human Annotations
30 |
31 | Download topic words and human annotations from the paper [Is Automated Topic Model Evaluation Broken?](https://arxiv.org/abs/2107.02173) from their [github repository](https://github.com/ahoho/topics/blob/dev/data/human/all_data/all_data.json).
32 |
33 |
34 | ### Intruder Detection Test
35 |
36 | Following (Hoyle et al., 2021), we randomly sample intruder words which are not in the top 50 topic words for each topic.
37 |
38 | ```shell
39 | python src-human-correlations/generate_intruder_words_dataset.py
40 | ```
41 |
42 | We can then call an LLM to automatically annotate the intruder words for each topic.
43 |
44 | ```shell
45 | python src-human-correlations/chatGPT_evaluate_intruders.py --API_KEY a_valid_openAI_api_key
46 | ```
47 |
48 | For the ratings task, simply call the file which rates topic word sets (no need to generate a dataset first)
49 |
50 | ```shell
51 | python src-human-correlations/chatGPT_evaluate_topic_ratings.py --API_KEY a_valid_openAI_api_key
52 | ```
53 | (In case the openAI API breaks, we simply save all output in a json file, and would restart the script while skipping all already annotated datapoints.)
54 |
55 |
56 | ### Evaluation LLMs and Human Correlations
57 |
58 | We evaluate using a bootstrapp appraoch where we sample human annotations and LLM annotations for each datapoint. We then average these sampled annotation, and compute a spearman's rho for each bootstrapped sample. We report the mean spearman's rho over all 1000 bootstrapped samples.
59 |
60 | ```shell
61 | python src-human-correlations/human_correlations_bootstrap.py --filename coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl --task ratings
62 | ```
63 |
64 |
65 | ## Evaluating Topic Models with Different Numbers of Topics
66 |
67 | Download fitted topic models and metadata for two datasets (bills and wikitext) [here](https://www.dropbox.com/s/huxdloe5l6w2tu5/topic_model_k_selection.zip?dl=0) and unzip
68 |
69 | ### Rating Topic Word Sets
70 |
71 | To run LLM ratings of topic word sets on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:
72 |
73 | ```shell
74 | python src-number-of-topics/chatGPT_ratings_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad
75 | ```
76 |
77 | ### Purity of Document Collections
78 |
79 | We also assign a document label to the top documents belonging to a topic, following [Doogan and Buntine, 2021](https://aclanthology.org/2021.naacl-main.300/). We then average purity per document collection, and the number of topics with on averag highest purities is the preferred cluster of this procedure.
80 |
81 | To run LLM label assignments on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:
82 |
83 | ```shell
84 | python src-number-of-topics/chatGPT_document_label_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad
85 | ```
86 |
87 | ### Plot resulting scores
88 |
89 | ```shell
90 | python src-number-of-topics/LLM_scores_and_ARI.py --label_categories broad --method label_assignment --dataset bills --label_categories broad --filename number-of-topics-section-4/document_label_assignment_wikitext_broad.jsonl
91 | ```
92 |
93 | ## Questions
94 |
95 | Please contact [Dominik Stammbach](mailto:dominik.stammbach@gess.ethz.ch) regarding any questions.
96 |
97 |
98 | [](https://www.youtube.com/watch?v=qIDjtgWTgjs)
99 |
--------------------------------------------------------------------------------
/src-human-correlations/chatGPT_evaluate_intruders.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import json
3 | import random
4 | import openai
5 | from tqdm import tqdm
6 | from ast import literal_eval
7 | from collections import defaultdict
8 | import time
9 | import argparse
10 | import os
11 |
12 | def get_prompts(include_dataset_description="include"):
13 | if include_dataset_description == "include":
14 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
15 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces.
16 | Reply with a single word."""
17 |
18 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
19 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others.
20 | Reply with a single word."""
21 | outfile_name = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl"
22 | else:
23 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
24 | Reply with a single word."""
25 |
26 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
27 | Reply with a single word."""
28 | outfile_name = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.jsonl"
29 | return system_prompt_NYT, system_prompt_wikitext, outfile_name
30 |
31 | if __name__ == "__main__":
32 | parser = argparse.ArgumentParser()
33 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
34 | parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.")
35 | args = parser.parse_args()
36 |
37 |
38 | openai.api_key = args.API_KEY
39 | random.seed(42)
40 | df = pd.read_json("intruder_outfile.jsonl", lines=True)
41 |
42 | system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description)
43 | os.makedirs("coherence-outputs-section-2", exist_ok=True)
44 |
45 |
46 | columns = df.columns.tolist()
47 |
48 | with open(outfile_name, "w") as outfile:
49 | for i, row in tqdm(df.iterrows(), total=len(df)):
50 | if row.dataset_name == "wikitext":
51 | system_prompt = system_prompt_wikitext
52 | else:
53 | system_prompt = system_prompt_NYT
54 |
55 | words = row.topic_terms
56 | # shuffle words
57 | random.shuffle(words)
58 | # we only prompt 5 words
59 | words = words[:5]
60 |
61 | # we add intruder term
62 | intruder_term = row.intruder_term
63 |
64 | # we shuffle again
65 | words.append(intruder_term)
66 | random.shuffle(words)
67 |
68 | # we have a user prompt
69 | user_prompt = ", ".join(['"' + w + '"' for w in words])
70 |
71 | out = {col: row[col] for col in columns}
72 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.7, max_tokens=15)["choices"][0]["message"]["content"].strip()
73 | out["response"] = response
74 | out["user_promt"] = user_prompt
75 | json.dump(out, outfile)
76 | outfile.write("\n")
77 | time.sleep(0.5)
78 |
79 |
80 |
--------------------------------------------------------------------------------
/src-human-correlations/chatGPT_evaluate_topic_ratings.py:
--------------------------------------------------------------------------------
1 | import json
2 | import random
3 | import openai
4 | from tqdm import tqdm
5 | import pandas as pd
6 | import time
7 | import argparse
8 | import os
9 |
10 |
11 | def get_prompts(include_dataset_description=True):
12 | if include_dataset_description:
13 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
14 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces.
15 | Reply with a single number, indicating the overall appropriateness of the topic."""
16 |
17 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
18 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others.
19 | Reply with a single number, indicating the overall appropriateness of the topic."""
20 | outfile_name = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl"
21 | else:
22 | system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
23 | Reply with a single number, indicating the overall appropriateness of the topic."""
24 |
25 | system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
26 | Reply with a single number, indicating the overall appropriateness of the topic."""
27 | outfile_name = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl"
28 | return system_prompt_NYT, system_prompt_wikitext, outfile_name
29 |
30 |
31 | if __name__ == "__main__":
32 |
33 | parser = argparse.ArgumentParser()
34 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
35 | parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.")
36 | args = parser.parse_args()
37 |
38 | openai.api_key = args.API_KEY
39 | random.seed(42)
40 |
41 | with open("all_data.json") as f:
42 | data = json.load(f)
43 | random.seed(42)
44 |
45 | system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description)
46 | os.makedirs("coherence-outputs-section-2", exist_ok=True)
47 |
48 | with open(outfile_name, "w") as outfile:
49 | for dataset_name, dataset in data.items():
50 | for model_name, dataset_model in dataset.items():
51 | print (dataset_name, model_name)
52 | topics = dataset_model["topics"]
53 | human_evaluations = dataset_model["metrics"]["ratings_scores_avg"]
54 | i = 0
55 | for topic, human_eval in tqdm(zip(topics, human_evaluations), total=50):
56 | topic = topic[:10]
57 | for run in range(3):
58 | random.shuffle(topic)
59 | user_prompt = ", ".join(topic)
60 | if dataset_name == "wikitext":
61 | system_prompt = system_prompt_wikitext
62 | else:
63 | system_prompt = system_prompt_NYT
64 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=1.0, logit_bias={16:100, 17:100, 18:100}, max_tokens=1)["choices"][0]["message"]["content"].strip()
65 | out = {"dataset_name": dataset_name, "model_name": model_name, "topic_id": i, "user_prompt": user_prompt, "response": response, "human_eval":human_eval, "run": run}
66 | json.dump(out, outfile)
67 | outfile.write("\n")
68 | time.sleep(0.5)
69 | i += 1
70 |
71 |
--------------------------------------------------------------------------------
/src-human-correlations/generate_intruder_words_dataset.py:
--------------------------------------------------------------------------------
1 | import json
2 | import random
3 |
4 |
5 | if __name__ == "__main__":
6 | with open("all_data.json") as f:
7 | data = json.load(f)
8 |
9 | random.seed(42)
10 | with open("intruder_outfile.jsonl", "w") as outfile:
11 | for dataset_name, dataset in data.items():
12 | for model_name, dataset_model in dataset.items():
13 | print (dataset_name, model_name)
14 | if model_name == "mallet":
15 | fn = "topic-modeling-output/mallet-topics-best-c_npmi_10_full.json"
16 | elif model_name == "dvae":
17 | fn = "topic-modeling-output/dvae-topics-best-c_npmi_10_full.json"
18 | else:
19 | fn = "topic-modeling-output/etm-topics-best-c_npmi_10_full.json"
20 | with open(fn) as f:
21 | topics_data = json.load(f)
22 |
23 | raw_topics = topics_data[dataset_name]["topics"]
24 |
25 | words = set()
26 | for topic in raw_topics:
27 | words.update(topic)
28 | words = list(set(words))
29 | for i, (topic, metric, double_check) in enumerate(zip(raw_topics, dataset_model["metrics"]["intrusion_scores_avg"], dataset_model["topics"])):
30 | topic_set = set(topic)
31 | candidate_words = [i for i in words if i not in topic_set]
32 | random.shuffle(candidate_words)
33 | sampled_intruders = candidate_words[:10]
34 | for intruder in sampled_intruders:
35 | out = {}
36 | out["topic_id"] = i
37 | out["intruder_term"] = intruder
38 | out["topic_terms"] = topic[:10]
39 | out["intrusion_scores_avg"] = metric
40 | out["dataset_name"] = dataset_name
41 | out["model_name"] = model_name
42 | json.dump(out, outfile)
43 | outfile.write("\n")
44 |
45 |
--------------------------------------------------------------------------------
/src-human-correlations/human_correlations_bootstrap.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import re
3 | from scipy.stats import spearmanr
4 | import numpy as np
5 | from sklearn.metrics import accuracy_score
6 | import random, json
7 | from ast import literal_eval
8 | from tqdm import tqdm
9 | import os
10 | import argparse
11 |
12 | def load_dataframe(fn, task=""):
13 | if fn.endswith(".csv"):
14 | df = pd.read_csv(fn)
15 | elif fn.endswith("jsonl"):
16 | df = pd.read_json(fn, lines=True)
17 | print (df)
18 | print (df.iloc[0])
19 | if task == "intrusion":
20 | if "response_correct" not in df.columns:
21 | df["response_correct"] = postprocess_chatGPT_intrusion_test(df)
22 | else:
23 |
24 | if "gpt_ratings" not in df.columns:
25 | df["gpt_ratings"] = df.response.astype(int)
26 | return df
27 |
28 | def postprocess_chatGPT_intrusion_test(df):
29 | response = df.response.tolist()
30 | response = [i.lower() for i in response]
31 | response = [re.findall(r"\b[a-z_\d]+\b", i) for i in response]
32 | response_correct = []
33 | #df.topic_terms = df.topic_terms.apply(lambda x: literal_eval(x))
34 | #for i,j,k in zip(response, df.intruder_term, df.topic_terms):
35 | for i,j in zip(response, df.intruder_term):
36 | if not i:
37 | response_correct.append(0)
38 | elif i[0] == j:
39 | response_correct.append(1)
40 | else:
41 | response_correct.append(0)
42 | return response_correct
43 |
44 | def postprocess_chatGPT_intrusion_new(df):
45 | for response, intruder_term, topic_words in zip(response, df.intruder_term, df.topic_terms):
46 | if not response:
47 | response_correct.append(0)
48 | elif response.lower() == intruder_term.lower():
49 | response_correct.append(1)
50 | else:
51 | response_correct.append(0)
52 | return response_correct
53 |
54 | def get_confidence_intervals(statistic):
55 | statistic = sorted(statistic)
56 | lower = statistic[4] # it's the fourth lowest value; if we had 100 samples, it would be the 2.5nd lowest value, this * 1.5 gets us 3.75
57 | upper = statistic[-4] # it's the fourth highest value
58 | print ("lower", lower, "upper", upper)
59 |
60 | def get_filenames(with_dataset_description = True):
61 | if with_dataset_description:
62 | intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl"
63 | rating_fn = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.csv"
64 | else:
65 | intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.csv"
66 | rating_fn = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl"
67 | return intrusion_fn, rating_fn
68 |
69 |
70 | def compute_human_ceiling_intrusion(data, only_confident = False):
71 | ratings_human, ratings_chatGPT = [], []
72 | spearman_wiki = []
73 | spearman_NYT = []
74 | spearman_concat = []
75 | for _ in tqdm(range(1000), total=1000):
76 | ratings_human, ratings_chatGPT = [], []
77 | for dataset in ["wikitext", "nytimes"]:
78 | for model in ["mallet", "dvae", "etm"]:
79 | for topic_id in range(50):
80 | intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id]
81 | if only_confident:
82 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1]
83 | if not intrusion_scores_raw:
84 | intrusion_scores_raw = [0]
85 |
86 | if len(intrusion_scores_raw) == 1:
87 | intrusion_scores_1 = intrusion_scores_raw
88 | intrusion_scores_2 = intrusion_scores_raw
89 | else:
90 | length = len(intrusion_scores_raw) // 2
91 | intrusion_scores_1 = intrusion_scores_raw[:length]
92 | intrusion_scores_2 = intrusion_scores_raw[length:]
93 | intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1))
94 | intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2))
95 | ratings_human.append(np.mean(intrusion_scores_1))
96 | ratings_chatGPT.append(np.mean(intrusion_scores_2))
97 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
98 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
99 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
100 | print ("wiki", np.mean(spearman_wiki))
101 | #get_confidence_intervals(spearman_wiki)
102 | print ("NYT", np.mean(spearman_NYT))
103 | #get_confidence_intervals(spearman_NYT)
104 | print ("concat", np.mean(spearman_concat))
105 | #get_confidence_intervals(spearman_concat)
106 |
107 | def compute_human_ceiling_rating(data, only_confident = False):
108 | ratings_human, ratings_chatGPT = [], []
109 | spearman_wiki = []
110 | spearman_NYT = []
111 | spearman_concat = []
112 | for _ in tqdm(range(1000), total=1000):
113 | ratings_human, ratings_chatGPT = [], []
114 | for dataset in ["wikitext", "nytimes"]:
115 | for model in ["mallet", "dvae", "etm"]:
116 | for topic_id in range(50):
117 | intrusion_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id]
118 | if only_confident:
119 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1]
120 |
121 | if not intrusion_scores_raw:
122 | intrusion_scores_raw = [1]
123 | if len(intrusion_scores_raw) == 1:
124 | intrusion_scores_1 = intrusion_scores_raw
125 | intrusion_scores_2 = intrusion_scores_raw
126 | else:
127 | length = len(intrusion_scores_raw) // 2
128 | intrusion_scores_1 = intrusion_scores_raw[:length]
129 | intrusion_scores_2 = intrusion_scores_raw[length:]
130 | intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1))
131 | intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2))
132 | ratings_human.append(np.mean(intrusion_scores_1))
133 | ratings_chatGPT.append(np.mean(intrusion_scores_2))
134 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
135 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
136 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
137 | print ("wiki", np.mean(spearman_wiki))
138 | #get_confidence_intervals(spearman_wiki)
139 | print ("NYT", np.mean(spearman_NYT))
140 | #get_confidence_intervals(spearman_NYT)
141 | print ("concat", np.mean(spearman_concat))
142 | #get_confidence_intervals(spearman_concat)
143 |
144 |
145 | def compute_spearmanr_bootstrap_intrusion(df_intruder_scores, data, only_confident=False):
146 | ratings_human, ratings_chatGPT = [], []
147 | spearman_wiki = []
148 | spearman_NYT = []
149 | spearman_concat = []
150 |
151 | for _ in tqdm(range(1000), total=1000):
152 | ratings_human, ratings_chatGPT = [], []
153 | for dataset in ["wikitext", "nytimes"]:
154 | for model in ["mallet", "dvae", "etm"]:
155 | for topic_id in range(50):
156 | intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id]
157 | if only_confident:
158 | intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1]
159 | # sample bootstrap fold
160 |
161 | intrusion_scores_raw = random.choices(intrusion_scores_raw, k=len(intrusion_scores_raw))
162 | df_topic = df_intruder_scores[(df_intruder_scores.dataset_name == dataset) & (df_intruder_scores.model_name == model) & (df_intruder_scores.topic_id == topic_id)]
163 | gpt_ratings = random.choices(df_topic.response_correct.tolist(), k= len(df_topic.response_correct))
164 |
165 | # save results
166 | ratings_human.append(np.mean(intrusion_scores_raw))
167 | ratings_chatGPT.append(np.mean(gpt_ratings))
168 | # compute spearman_R and save results
169 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
170 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
171 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
172 | print ("wiki", np.mean(spearman_wiki))
173 | get_confidence_intervals(spearman_wiki)
174 | print ("NYT", np.mean(spearman_NYT))
175 | get_confidence_intervals(spearman_NYT)
176 | print ("concat", np.mean(spearman_concat))
177 | get_confidence_intervals(spearman_concat)
178 |
179 | def compute_spearmanr_bootstrap_rating(df_rating_scores, data, only_confident=False):
180 | ratings_human, ratings_chatGPT = [], []
181 | spearman_wiki = []
182 | spearman_NYT = []
183 | spearman_concat = []
184 |
185 | for _ in tqdm(range(150), total=150):
186 | ratings_human, ratings_chatGPT = [], []
187 | for dataset in ["wikitext", "nytimes"]:
188 | for model in ["mallet", "dvae", "etm"]:
189 | for topic_id in range(50):
190 | rating_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id]
191 | if only_confident:
192 | rating_scores_raw = [i for i,j in zip(rating_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1]
193 | # sample bootstrap fold
194 | rating_scores_raw = random.choices(rating_scores_raw, k=len(rating_scores_raw))
195 | df_topic = df_rating_scores[(df_rating_scores.dataset_name == dataset) & (df_rating_scores.model_name == model) & (df_rating_scores.topic_id == topic_id)]
196 | gpt_ratings = random.choices(df_topic.gpt_ratings.tolist(), k= len(df_topic.gpt_ratings))
197 |
198 | # save results
199 | ratings_human.append(np.mean(rating_scores_raw))
200 | ratings_chatGPT.append(np.mean(gpt_ratings))
201 | # compute spearman_R and save results
202 | spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
203 | spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
204 | spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
205 | print ("wiki", np.mean(spearman_wiki))
206 | get_confidence_intervals(spearman_wiki)
207 | print ("NYT", np.mean(spearman_NYT))
208 | get_confidence_intervals(spearman_NYT)
209 | print ("concat", np.mean(spearman_concat))
210 | get_confidence_intervals(spearman_concat)
211 |
212 |
213 |
214 |
215 | if __name__ == "__main__":
216 | parser = argparse.ArgumentParser()
217 | parser.add_argument("--task", default="ratings", type=str)
218 | parser.add_argument("--filename", default="coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl", type=str, help="whether to include a dataset description or not, default = include.")
219 | parser.add_argument("--only_confident", default="true", type=str)
220 | args = parser.parse_args()
221 |
222 | if args.only_confident == "true":
223 | only_confident = True
224 | else:
225 | only_confident = False
226 |
227 | random.seed(42)
228 | path = "coherence-outputs-section-2"
229 |
230 | experiments = ["human_ceiling", "dataset_description", "dataset_description_only_confident", "no_dataset_description"]
231 |
232 | with open("all_data.json") as f:
233 | data = json.load(f)
234 |
235 | if args.task == "human_ceiling":
236 | compute_human_ceiling_intrusion(data, only_confident=only_confident)
237 | compute_human_ceiling_rating(data, only_confident=only_confident)
238 | elif args.task == "ratings":
239 | df_rating = load_dataframe(args.filename)
240 | compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident)
241 |
242 | elif args.task == "intrusion":
243 | df_rating = load_dataframe(args.filename)
244 | compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident)
245 |
--------------------------------------------------------------------------------
/src-misc/01-get_data.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 |
3 | mkdir -p data
4 |
5 | wget https://raw.githubusercontent.com/ahoho/topics/dev/data/human/all_data/all_data.json -O data/intrusion.json
--------------------------------------------------------------------------------
/src-misc/02-process_data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import json\n",
10 | "data = json.load(open(\"../data/intrusion.json\", \"r\"))"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "data[\"wikitext\"][\"mallet\"][\"metrics\"].keys()\n",
20 | "\n",
21 | "[(k, len(v), None if type(v[0]) != list else len(v[0])) for k, v in data[\"nytimes\"][\"mallet\"][\"metrics\"].items()]"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 2,
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/plain": [
32 | "['species',\n",
33 | " 'birds',\n",
34 | " 'males',\n",
35 | " 'females',\n",
36 | " 'bird',\n",
37 | " 'white',\n",
38 | " 'found',\n",
39 | " 'female',\n",
40 | " 'male',\n",
41 | " 'range',\n",
42 | " 'large',\n",
43 | " 'breeding',\n",
44 | " 'long',\n",
45 | " 'black',\n",
46 | " 'small',\n",
47 | " 'shark',\n",
48 | " 'population',\n",
49 | " 'common',\n",
50 | " 'prey',\n",
51 | " 'eggs']"
52 | ]
53 | },
54 | "execution_count": 2,
55 | "metadata": {},
56 | "output_type": "execute_result"
57 | }
58 | ],
59 | "source": [
60 | "data[\"wikitext\"][\"mallet\"][\"topics\"][1]"
61 | ]
62 | }
63 | ],
64 | "metadata": {
65 | "kernelspec": {
66 | "display_name": "Python 3.10.7 64-bit",
67 | "language": "python",
68 | "name": "python3"
69 | },
70 | "language_info": {
71 | "codemirror_mode": {
72 | "name": "ipython",
73 | "version": 3
74 | },
75 | "file_extension": ".py",
76 | "mimetype": "text/x-python",
77 | "name": "python",
78 | "nbconvert_exporter": "python",
79 | "pygments_lexer": "ipython3",
80 | "version": "3.10.7"
81 | },
82 | "orig_nbformat": 4,
83 | "vscode": {
84 | "interpreter": {
85 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
86 | }
87 | }
88 | },
89 | "nbformat": 4,
90 | "nbformat_minor": 2
91 | }
92 |
--------------------------------------------------------------------------------
/src-misc/02-process_data.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import json
4 |
5 | data = json.load(open("data/intrusion.json", "r"))
6 |
7 | for topic_scores, topic_words in zip(data["wikitext"]["etm"]["metrics"]["intrusion_scores_raw"], data["wikitext"]["etm"]["topics"]):
8 | print(len(topic_scores), len(topic_words))
9 | group_a = [w for s, w in zip(topic_scores, topic_words) if s == 0]
10 | group_b = [w for s, w in zip(topic_scores, topic_words) if s == 1]
11 | print(group_a, group_b)
--------------------------------------------------------------------------------
/src-misc/03-figures_nmpi_llm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import matplotlib.pyplot as plt
4 | import pandas as pd
5 | import os
6 | from tqdm import tqdm
7 | import numpy as np
8 | from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score
9 | from scipy.stats import spearmanr
10 | from collections import defaultdict
11 | import fig_utils
12 | from matplotlib.ticker import FormatStrFormatter
13 | import argparse
14 |
15 | args = argparse.ArgumentParser()
16 | args.add_argument("--experiment", default="wikitext_specific")
17 | args = args.parse_args()
18 |
19 | os.makedirs("computed/figures", exist_ok=True)
20 |
21 | N_CLUSTER = range(20, 420, 20)
22 |
23 |
24 | def moving_average(a, window_size=3):
25 | if window_size == 0:
26 | return a
27 | out = []
28 | for i in range(len(a)):
29 | start = max(0, i - window_size)
30 | out.append(np.mean(a[start:i + 1]))
31 | return out
32 |
33 |
34 | def compute_adjusted_NMI_bills():
35 | df_metadata = pd.read_json(
36 | "data/bills/processed/labeled/vocab_15k/train.metadata.jsonl",
37 | lines=True
38 | )
39 | print(df_metadata)
40 | paths = [
41 | "runs/outputs/k_selection/" + "bills" +
42 | "-labeled/vocab_15k/k-" + str(i)
43 | for i in N_CLUSTER
44 | ]
45 | topic = df_metadata.topic.tolist()
46 | cluster_metrics = defaultdict(list)
47 |
48 | for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20):
49 | path = os.path.join(path, "2972")
50 | beta = np.load(os.path.join(path, "beta.npy"))
51 | theta = np.load(os.path.join(path, "train.theta.npy"))
52 | argmax_theta = theta.argmax(axis=-1)
53 | cluster_metrics["ami"].append(
54 | adjusted_mutual_info_score(topic, argmax_theta)
55 | )
56 | cluster_metrics["ari"].append(
57 | adjusted_rand_score(topic, argmax_theta)
58 | )
59 | cluster_metrics["completeness"].append(
60 | completeness_score(topic, argmax_theta)
61 | )
62 | cluster_metrics["homogeneity"].append(
63 | homogeneity_score(topic, argmax_theta)
64 | )
65 | return cluster_metrics
66 |
67 |
68 | def compute_adjusted_NMI_wikitext_old(broad_categories=True):
69 | df_metadata = pd.read_json(
70 | "data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl",
71 | lines=True
72 | )
73 | print(df_metadata)
74 | paths = [
75 | "runs/outputs/k_selection/" + "wikitext" +
76 | "-labeled/vocab_15k/k-" + str(i)
77 | for i in N_CLUSTER
78 | ]
79 |
80 | if broad_categories:
81 | topic = df_metadata.category.tolist()
82 | else:
83 | topic = df_metadata.subcategory.tolist()
84 |
85 | cluster_metrics = defaultdict(list)
86 |
87 | for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20):
88 | path = os.path.join(path, "2972")
89 | beta = np.load(os.path.join(path, "beta.npy"))
90 | theta = np.load(os.path.join(path, "train.theta.npy"))
91 | argmax_theta = theta.argmax(axis=-1)
92 | cluster_metrics["ami"].append(
93 | adjusted_mutual_info_score(topic, argmax_theta)
94 | )
95 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
96 | cluster_metrics["completeness"].append(
97 | completeness_score(topic, argmax_theta)
98 | )
99 | cluster_metrics["homogeneity"].append(
100 | homogeneity_score(topic, argmax_theta)
101 | )
102 | return cluster_metrics
103 |
104 | def compute_adjusted_NMI_wikitext(broad_categories=True):
105 | df_metadata = pd.read_json("data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl", lines=True)
106 | print (df_metadata)
107 | paths = ["runs/outputs/k_selection/" + "wikitext" + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
108 |
109 | # re-do columns:
110 | value_counts = df_metadata.category.value_counts().rename_axis("topic").reset_index(name="counts")
111 | value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25}
112 | df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.category]
113 |
114 | value_counts = df_metadata.subcategory.value_counts().rename_axis("topic").reset_index(name="counts")
115 | value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25}
116 | df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.subcategory]
117 |
118 | if broad_categories:
119 | topic = df_metadata.category.tolist()
120 | else:
121 | topic = df_metadata.subcategory.tolist()
122 |
123 | cluster_metrics = defaultdict(list)
124 |
125 | for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20):
126 | path = os.path.join(path, "2972")
127 | beta = np.load(os.path.join(path, "beta.npy"))
128 | theta = np.load(os.path.join(path, "train.theta.npy"))
129 | argmax_theta = theta.argmax(axis=-1)
130 | cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta))
131 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
132 | cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta))
133 | cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta))
134 | return cluster_metrics
135 |
136 |
137 |
138 |
139 | # experiment = "bills_broad"
140 | # experiment = "wikitext_broad"
141 | # experiment = "wikitext_specific"
142 |
143 | if args.experiment == "bills_broad":
144 | dataset = "bills"
145 | cluster_metrics = compute_adjusted_NMI_bills()
146 | paths = [
147 | "runs/outputs/k_selection/bills-labeled/vocab_15k/k-" + str(i)
148 | for i in N_CLUSTER
149 | ]
150 | df = pd.read_csv("LLM-scores/LLM_outputs_bills_broad.csv")
151 | plt_label = "Adjusted MI"
152 | outfile_name = "n_clusters_bills_dataset.pdf"
153 | plt_title = "BillSum, broad, $\\rho = RHO_CORR$"
154 | left_ylab = True
155 | right_ylab = False
156 | degree = 1
157 | elif args.experiment == "wikitext_broad":
158 | dataset = "wikitext"
159 | cluster_metrics = compute_adjusted_NMI_wikitext()
160 | paths = [
161 | "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i)
162 | for i in N_CLUSTER
163 | ]
164 | df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_broad.csv")
165 | plt_label = "Adjusted MI"
166 | outfile_name = "n_clusters_wikitext_broad.pdf"
167 | plt_title = "Wikitext, broad, $\\rho = RHO_CORR$"
168 | left_ylab = False
169 | right_ylab = False
170 | degree = 1
171 | elif args.experiment == "wikitext_specific":
172 | dataset = "wikitext"
173 | cluster_metrics = compute_adjusted_NMI_wikitext(broad_categories=False)
174 | paths = [
175 | "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i)
176 | for i in N_CLUSTER
177 | ]
178 | df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_specific.csv")
179 | plt_label = "Adjusted MI"
180 | outfile_name = "n_clusters_wikitext_specific.pdf"
181 | plt_title = "Wikitext, specific, $\\rho = RHO_CORR$"
182 | left_ylab = False
183 | right_ylab = True
184 | degree = 3
185 |
186 | df["gpt_ratings"] = df.chatGPT_eval.astype(int)
187 | average_goodness = []
188 | # get average gpt_ratings for each k
189 | for path in paths:
190 | path = os.path.join(path, "2972")
191 | df_at_k = df[df.path == path]
192 | average_goodness.append(df_at_k.gpt_ratings.mean())
193 |
194 | # smooth via moving_average to remove weird outliers
195 | average_goodness = moving_average(average_goodness)
196 |
197 | # if we want to plot spearmanR for different clustering metrics
198 | compute_spearmanR_different_cluster_metrics = False
199 | if compute_spearmanR_different_cluster_metrics:
200 | for key in ["ami", "ari", "completeness", "homogeneity"]:
201 | value = cluster_metrics[key]
202 | statistic = spearmanr(average_goodness, value)
203 | print(
204 | "topics " + key, statistic.statistic.round(3),
205 | statistic.pvalue.round(3)
206 | )
207 |
208 | # re-shape to compute z-scores (otherwise, the scales are off because LLM scores and clustering metrics are on different scales and the graphs do not look too great
209 | average_goodness = np.array(average_goodness).reshape(-1, 1)
210 | AMI = np.array(cluster_metrics["ami"]).reshape(-1, 1)
211 |
212 | # compute z-scores
213 | # average_goodness = StandardScaler().fit_transform(average_goodness).squeeze()
214 | # AMI = StandardScaler().fit_transform(AMI).squeeze()
215 | average_goodness = np.array(average_goodness).squeeze()
216 | AMI = np.array(AMI).squeeze()
217 |
218 | # plot figures
219 | fig = plt.figure(figsize=(3.5, 2))
220 | ax1 = plt.gca()
221 | ax2 = ax1.twinx()
222 | SCATTER_STYLE = {"edgecolor": "black", "s": 30}
223 | l1 = ax1.scatter(
224 | N_CLUSTER,
225 | average_goodness,
226 | label="LLM score",
227 | color=fig_utils.COLORS[0],
228 | **SCATTER_STYLE
229 | )
230 | l2 = ax2.scatter(
231 | N_CLUSTER,
232 | AMI,
233 | label=plt_label,
234 | color=fig_utils.COLORS[1],
235 | **SCATTER_STYLE
236 | )
237 |
238 | # print to one digit
239 | ax2.yaxis.set_major_formatter(FormatStrFormatter('%.1f'))
240 |
241 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500)
242 |
243 | poly_ag = np.poly1d(np.polyfit(N_CLUSTER, average_goodness, degree))
244 | ax1.plot(xticks_fine, poly_ag(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100)
245 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, AMI, degree))
246 | ax2.plot(xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100)
247 |
248 | plt.legend(
249 | [l1, l2], [p_.get_label() for p_ in [l1, l2]],
250 | loc="upper right",
251 | handletextpad=-0.2,
252 | labelspacing=0.15,
253 | borderpad=0.15,
254 | borderaxespad=0.1,
255 | )
256 | if left_ylab:
257 | ax1.set_ylabel("Adjusted MI")
258 | if right_ylab:
259 | ax2.set_ylabel("Averaged LLM score")
260 | ax1.set_xlabel("Number of topics")
261 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3])
262 |
263 |
264 | statistic = spearmanr(average_goodness, cluster_metrics["ami"])
265 | plt.title(plt_title.replace("RHO_CORR", f"{statistic[0]:.2f}"))
266 |
267 | plt.tight_layout(pad=0.2)
268 | plt.savefig("computed/figures/" + outfile_name)
269 | plt.show()
270 |
--------------------------------------------------------------------------------
/src-misc/04-launch_figures_nmpi_llm.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 |
3 | for EXPERIMENT in "bills_broad" "wikitext_broad" "wikitext_specific"; do
4 | # prevent from showing
5 | DISPLAY="" python3 src_vilem/03-figures_nmpi_llm.py --experiment ${EXPERIMENT}
6 | done
--------------------------------------------------------------------------------
/src-misc/05-find_rating_errors.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 34,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "/tmp/ipykernel_32934/435855610.py:6: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.\n",
13 | " df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n"
14 | ]
15 | }
16 | ],
17 | "source": [
18 | "#!/usr/bin/env python3\n",
19 | "\n",
20 | "import pandas as pd\n",
21 | "\n",
22 | "df_rating_scores = pd.read_json(\"../computed/ratings_outfile_with_logit_bias_old_prompt_5_runs.jsonl\", lines=True)\n",
23 | "df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n",
24 | "df_grouped[\"diff_abs\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"]).abs()\n",
25 | "df_grouped[\"diff\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"])"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 36,
31 | "metadata": {},
32 | "outputs": [
33 | {
34 | "data": {
35 | "text/html": [
36 | "
\n",
37 | "\n",
50 | "
\n",
51 | " \n",
52 | " \n",
53 | " | \n",
54 | " dataset_name | \n",
55 | " model_name | \n",
56 | " topic_id | \n",
57 | " chatGPT_eval | \n",
58 | " human_eval | \n",
59 | " diff_abs | \n",
60 | " diff | \n",
61 | "
\n",
62 | " \n",
63 | " \n",
64 | " \n",
65 | " 299 | \n",
66 | " wikitext | \n",
67 | " mallet | \n",
68 | " 49 | \n",
69 | " 2.6 | \n",
70 | " 2.6 | \n",
71 | " 0.0 | \n",
72 | " 0.0 | \n",
73 | "
\n",
74 | " \n",
75 | " 157 | \n",
76 | " wikitext | \n",
77 | " dvae | \n",
78 | " 7 | \n",
79 | " 2.8 | \n",
80 | " 2.8 | \n",
81 | " 0.0 | \n",
82 | " 0.0 | \n",
83 | "
\n",
84 | " \n",
85 | " 16 | \n",
86 | " nytimes | \n",
87 | " dvae | \n",
88 | " 16 | \n",
89 | " 2.0 | \n",
90 | " 2.0 | \n",
91 | " 0.0 | \n",
92 | " 0.0 | \n",
93 | "
\n",
94 | " \n",
95 | " 110 | \n",
96 | " nytimes | \n",
97 | " mallet | \n",
98 | " 10 | \n",
99 | " 2.0 | \n",
100 | " 2.0 | \n",
101 | " 0.0 | \n",
102 | " 0.0 | \n",
103 | "
\n",
104 | " \n",
105 | " 101 | \n",
106 | " nytimes | \n",
107 | " mallet | \n",
108 | " 1 | \n",
109 | " 3.0 | \n",
110 | " 3.0 | \n",
111 | " 0.0 | \n",
112 | " 0.0 | \n",
113 | "
\n",
114 | " \n",
115 | "
\n",
116 | "
"
117 | ],
118 | "text/plain": [
119 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n",
120 | "299 wikitext mallet 49 2.6 2.6 0.0 \n",
121 | "157 wikitext dvae 7 2.8 2.8 0.0 \n",
122 | "16 nytimes dvae 16 2.0 2.0 0.0 \n",
123 | "110 nytimes mallet 10 2.0 2.0 0.0 \n",
124 | "101 nytimes mallet 1 3.0 3.0 0.0 \n",
125 | "\n",
126 | " diff \n",
127 | "299 0.0 \n",
128 | "157 0.0 \n",
129 | "16 0.0 \n",
130 | "110 0.0 \n",
131 | "101 0.0 "
132 | ]
133 | },
134 | "execution_count": 36,
135 | "metadata": {},
136 | "output_type": "execute_result"
137 | }
138 | ],
139 | "source": [
140 | "df_grouped.sort_values(by=\"diff_abs\")[:5]\n"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 39,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "data": {
150 | "text/html": [
151 | "\n",
152 | "\n",
165 | "
\n",
166 | " \n",
167 | " \n",
168 | " | \n",
169 | " dataset_name | \n",
170 | " model_name | \n",
171 | " topic_id | \n",
172 | " chatGPT_eval | \n",
173 | " human_eval | \n",
174 | " diff_abs | \n",
175 | " diff | \n",
176 | "
\n",
177 | " \n",
178 | " \n",
179 | " \n",
180 | " 58 | \n",
181 | " nytimes | \n",
182 | " etm | \n",
183 | " 8 | \n",
184 | " 1.4 | \n",
185 | " 2.866667 | \n",
186 | " 1.466667 | \n",
187 | " -1.466667 | \n",
188 | "
\n",
189 | " \n",
190 | " 41 | \n",
191 | " nytimes | \n",
192 | " dvae | \n",
193 | " 41 | \n",
194 | " 1.0 | \n",
195 | " 2.533333 | \n",
196 | " 1.533333 | \n",
197 | " -1.533333 | \n",
198 | "
\n",
199 | " \n",
200 | " 28 | \n",
201 | " nytimes | \n",
202 | " dvae | \n",
203 | " 28 | \n",
204 | " 1.2 | \n",
205 | " 2.733333 | \n",
206 | " 1.533333 | \n",
207 | " -1.533333 | \n",
208 | "
\n",
209 | " \n",
210 | " 34 | \n",
211 | " nytimes | \n",
212 | " dvae | \n",
213 | " 34 | \n",
214 | " 1.0 | \n",
215 | " 2.666667 | \n",
216 | " 1.666667 | \n",
217 | " -1.666667 | \n",
218 | "
\n",
219 | " \n",
220 | " 216 | \n",
221 | " wikitext | \n",
222 | " etm | \n",
223 | " 16 | \n",
224 | " 1.0 | \n",
225 | " 2.933333 | \n",
226 | " 1.933333 | \n",
227 | " -1.933333 | \n",
228 | "
\n",
229 | " \n",
230 | "
\n",
231 | "
"
232 | ],
233 | "text/plain": [
234 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n",
235 | "58 nytimes etm 8 1.4 2.866667 1.466667 \n",
236 | "41 nytimes dvae 41 1.0 2.533333 1.533333 \n",
237 | "28 nytimes dvae 28 1.2 2.733333 1.533333 \n",
238 | "34 nytimes dvae 34 1.0 2.666667 1.666667 \n",
239 | "216 wikitext etm 16 1.0 2.933333 1.933333 \n",
240 | "\n",
241 | " diff \n",
242 | "58 -1.466667 \n",
243 | "41 -1.533333 \n",
244 | "28 -1.533333 \n",
245 | "34 -1.666667 \n",
246 | "216 -1.933333 "
247 | ]
248 | },
249 | "execution_count": 39,
250 | "metadata": {},
251 | "output_type": "execute_result"
252 | }
253 | ],
254 | "source": [
255 | "df_grouped.sort_values(by=\"diff_abs\")[-5:]"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 38,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "data": {
265 | "text/html": [
266 | "\n",
267 | "\n",
280 | "
\n",
281 | " \n",
282 | " \n",
283 | " | \n",
284 | " dataset_name | \n",
285 | " model_name | \n",
286 | " topic_id | \n",
287 | " chatGPT_eval | \n",
288 | " human_eval | \n",
289 | " diff_abs | \n",
290 | " diff | \n",
291 | "
\n",
292 | " \n",
293 | " \n",
294 | " \n",
295 | " 200 | \n",
296 | " wikitext | \n",
297 | " etm | \n",
298 | " 0 | \n",
299 | " 2.0 | \n",
300 | " 1.600000 | \n",
301 | " 0.400000 | \n",
302 | " 0.400000 | \n",
303 | "
\n",
304 | " \n",
305 | " 210 | \n",
306 | " wikitext | \n",
307 | " etm | \n",
308 | " 10 | \n",
309 | " 1.8 | \n",
310 | " 1.333333 | \n",
311 | " 0.466667 | \n",
312 | " 0.466667 | \n",
313 | "
\n",
314 | " \n",
315 | " 248 | \n",
316 | " wikitext | \n",
317 | " etm | \n",
318 | " 48 | \n",
319 | " 3.0 | \n",
320 | " 2.533333 | \n",
321 | " 0.466667 | \n",
322 | " 0.466667 | \n",
323 | "
\n",
324 | " \n",
325 | " 207 | \n",
326 | " wikitext | \n",
327 | " etm | \n",
328 | " 7 | \n",
329 | " 1.8 | \n",
330 | " 1.266667 | \n",
331 | " 0.533333 | \n",
332 | " 0.533333 | \n",
333 | "
\n",
334 | " \n",
335 | " 242 | \n",
336 | " wikitext | \n",
337 | " etm | \n",
338 | " 42 | \n",
339 | " 1.8 | \n",
340 | " 1.133333 | \n",
341 | " 0.666667 | \n",
342 | " 0.666667 | \n",
343 | "
\n",
344 | " \n",
345 | "
\n",
346 | "
"
347 | ],
348 | "text/plain": [
349 | " dataset_name model_name topic_id chatGPT_eval human_eval diff_abs \\\n",
350 | "200 wikitext etm 0 2.0 1.600000 0.400000 \n",
351 | "210 wikitext etm 10 1.8 1.333333 0.466667 \n",
352 | "248 wikitext etm 48 3.0 2.533333 0.466667 \n",
353 | "207 wikitext etm 7 1.8 1.266667 0.533333 \n",
354 | "242 wikitext etm 42 1.8 1.133333 0.666667 \n",
355 | "\n",
356 | " diff \n",
357 | "200 0.400000 \n",
358 | "210 0.466667 \n",
359 | "248 0.466667 \n",
360 | "207 0.533333 \n",
361 | "242 0.666667 "
362 | ]
363 | },
364 | "execution_count": 38,
365 | "metadata": {},
366 | "output_type": "execute_result"
367 | }
368 | ],
369 | "source": [
370 | "df_grouped.sort_values(by=\"diff\")[-5:]"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": 40,
376 | "metadata": {},
377 | "outputs": [
378 | {
379 | "data": {
380 | "text/plain": [
381 | "-0.6393333333333333"
382 | ]
383 | },
384 | "execution_count": 40,
385 | "metadata": {},
386 | "output_type": "execute_result"
387 | }
388 | ],
389 | "source": [
390 | "df_grouped[\"diff\"].mean()"
391 | ]
392 | }
393 | ],
394 | "metadata": {
395 | "kernelspec": {
396 | "display_name": "Python 3.10.7 64-bit",
397 | "language": "python",
398 | "name": "python3"
399 | },
400 | "language_info": {
401 | "codemirror_mode": {
402 | "name": "ipython",
403 | "version": 3
404 | },
405 | "file_extension": ".py",
406 | "mimetype": "text/x-python",
407 | "name": "python",
408 | "nbconvert_exporter": "python",
409 | "pygments_lexer": "ipython3",
410 | "version": "3.10.7"
411 | },
412 | "orig_nbformat": 4,
413 | "vscode": {
414 | "interpreter": {
415 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
416 | }
417 | }
418 | },
419 | "nbformat": 4,
420 | "nbformat_minor": 2
421 | }
422 |
--------------------------------------------------------------------------------
/src-misc/06-figures_ari_llm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import matplotlib.pyplot as plt
4 | import os
5 | from tqdm import tqdm
6 | import numpy as np
7 | from collections import defaultdict
8 | from scipy.stats import spearmanr
9 | import fig_utils
10 | from matplotlib.ticker import FormatStrFormatter
11 | import csv
12 | import argparse
13 |
14 | args = argparse.ArgumentParser()
15 | args.add_argument("--experiment", default="wikitext_specific")
16 | args = args.parse_args()
17 |
18 | os.makedirs("computed/figures", exist_ok=True)
19 |
20 | N_CLUSTER = range(20, 420, 20)
21 |
22 | data = list(csv.DictReader(
23 | open(f"repo-old/LLM-scores-3/{args.experiment}_dataframe_all_results.csv", "r")))
24 |
25 | # plot figures
26 | fig = plt.figure(figsize=(3.5, 2))
27 | ax1 = plt.gca()
28 | ax2 = ax1.twinx()
29 | ax3 = ax1.twinx()
30 | if args.experiment == "bills_broad":
31 | dataset = "bills"
32 | outfile_name = "n_clusters_bills_broad.pdf"
33 | plt_title = "BillSum, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$ "
34 | left_ylab = True
35 | show_legend = True
36 | degree = 4
37 |
38 | ax3.tick_params(axis='y', colors="#3d518c")
39 | ax3.yaxis.set_label_position('left')
40 | ax3.yaxis.set_ticks_position('left')
41 | ax3.set_ylabel("Document LLM", color="#3d518c")
42 | ax1.set_yticks([])
43 | ax2.set_axis_off()
44 | elif args.experiment == "wikitext_broad":
45 | dataset = "wikitext"
46 | outfile_name = "n_clusters_wikitext_broad.pdf"
47 | plt_title = "Wikitext, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$"
48 | left_ylab = False
49 | show_legend = False
50 | degree = 4
51 | ax1.set_yticks([])
52 | ax2.set_axis_off()
53 | ax3.set_yticks([])
54 | ax1.set_ylabel("|", color="white")
55 | ax3.set_ylabel("|", color="white")
56 | elif args.experiment == "wikitext_specific":
57 | dataset = "wikitext"
58 | outfile_name = "n_clusters_wikitext_specific.pdf"
59 | plt_title = " Wikitext, specific, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$"
60 | left_ylab = False
61 | show_legend = False
62 | degree = 4
63 | # ax1.set_yticks([])
64 | # ax3.tick_params(axis='y', colors="#3d518c")
65 | # ax2.set_axis_off()
66 | # ax3.set_axis_off()
67 |
68 | ax1.tick_params(axis='y', colors="#933d35")
69 | ax1.yaxis.set_label_position('right')
70 | ax1.yaxis.set_ticks_position('right')
71 | ax1.set_ylabel("ARI", color="#933d35")
72 | ax3.set_axis_off()
73 | ax2.set_axis_off()
74 |
75 | data_llm_word = [float(x["LLM Scores Wordset"]) for x in data]
76 | data_llm_doc = [float(x["LLM Scores Documentset"]) for x in data]
77 | data_ari = [float(x["ARI"]) for x in data]
78 |
79 | SCATTER_STYLE = {"edgecolor": "black", "s": 30}
80 | l1 = ax1.scatter(
81 | N_CLUSTER,
82 | data_ari,
83 | label="ARI",
84 | color=fig_utils.COLORS[1],
85 | **SCATTER_STYLE
86 | )
87 | l2 = ax2.scatter(
88 | N_CLUSTER,
89 | data_llm_word,
90 | label="Word LLM",
91 | color=fig_utils.COLORS[0],
92 | **SCATTER_STYLE
93 | )
94 | l3 = ax3.scatter(
95 | N_CLUSTER,
96 | data_llm_doc,
97 | label="Doc LLM",
98 | color=fig_utils.COLORS[4],
99 | **SCATTER_STYLE
100 | )
101 |
102 | # ax1.axes.get_yaxis().set_visible(False)
103 | # print to one digit
104 | # ax1.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
105 |
106 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500)
107 |
108 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, data_ari, degree))
109 | ax1.plot(
110 | xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100
111 | )
112 | poly_llm_word = np.poly1d(np.polyfit(N_CLUSTER, data_llm_word, degree))
113 | ax2.plot(
114 | xticks_fine, poly_llm_word(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100
115 | )
116 | poly_llm_doc = np.poly1d(np.polyfit(N_CLUSTER, data_llm_doc, degree))
117 | ax3.plot(
118 | xticks_fine, poly_llm_doc(xticks_fine), '-', color=fig_utils.COLORS[4], zorder=-100
119 | )
120 |
121 | if show_legend:
122 | lhandles = [l2, l3, l1]
123 | plt.legend(
124 | lhandles, [p_.get_label() for p_ in lhandles],
125 | loc="upper right",
126 | handletextpad=-0.2,
127 | labelspacing=0.1,
128 | borderpad=0.2,
129 | borderaxespad=0.2,
130 | handlelength=1.5,
131 | columnspacing=0.8,
132 | ncols=2,
133 | edgecolor="black",
134 | facecolor="#dddddd"
135 | )
136 | # if left_ylab:
137 | # ax1.set_ylabel("Metric Scores")
138 | # else:
139 | # ax1.set_ylabel(" ")
140 |
141 | ax1.set_xlabel("Number of topics")
142 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3])
143 |
144 |
145 | statistic_doc = spearmanr(data_llm_doc, data_ari)
146 | statistic_word = spearmanr(data_llm_word, data_ari)
147 | # statistic = spearmanr(data_llm_doc, data_llm_word)
148 | plt.title(
149 | plt_title
150 | .replace("RHO_CORR1", f"{statistic_doc[0]:.2f}")
151 | .replace("RHO_CORR2", f"{statistic_word[0]:.2f}")
152 | )
153 |
154 | plt.tight_layout(pad=0.1)
155 | plt.savefig("computed/figures/" + outfile_name)
156 | plt.show()
157 |
--------------------------------------------------------------------------------
/src-misc/07-pairwise_scores.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 5,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stdout",
10 | "output_type": "stream",
11 | "text": [
12 | "Intruder is black (5.20)\n",
13 | "Coherence is 6.27\n"
14 | ]
15 | }
16 | ],
17 | "source": [
18 | "import random\n",
19 | "import numpy as np\n",
20 | "random.seed(0)\n",
21 | "\n",
22 | "dataset = [\n",
23 | " [\"red\", \"blue\", \"flag\", \"black\", \"sky\", \"sun\"]\n",
24 | "]\n",
25 | "\n",
26 | "\n",
27 | "def get_pairwise_similarity(word1, word2):\n",
28 | " prompt = f'On a scale from 1 to 10, how similar are \"{word1}\" and \"{word2}\"? Your answer should be only the number and nothing else.'\n",
29 | " # TODO: GPT prompt\n",
30 | " return random.randint(1, 10)\n",
31 | "\n",
32 | "\n",
33 | "for words_line in dataset:\n",
34 | " # create |W|*|W| list\n",
35 | " results_line = [[None] * len(words_line) for _ in words_line]\n",
36 | " for word1_i, word1 in enumerate(words_line):\n",
37 | " for word2_i, word2 in enumerate(words_line):\n",
38 | " # save up half the prompts\n",
39 | " if word2_i <= word1_i:\n",
40 | " continue\n",
41 | " similarity = get_pairwise_similarity(word1, word2)\n",
42 | " results_line[word1_i][word2_i] = similarity\n",
43 | " results_line[word2_i][word1_i] = similarity\n",
44 | "\n",
45 | " # remove None (on diagonal)\n",
46 | " results_line = [\n",
47 | " [x for x in similarities if x]\n",
48 | " for similarities in results_line\n",
49 | " ]\n",
50 | "\n",
51 | " per_word_avg = [\n",
52 | " np.average(similarities)\n",
53 | " for similarities in results_line\n",
54 | " ]\n",
55 | " word_intruder_i = np.argmin(per_word_avg)\n",
56 | " coherence = np.average(per_word_avg)\n",
57 | "\n",
58 | " print(f\"Intruder is {words_line[word_intruder_i]} ({min(per_word_avg):.2f})\")\n",
59 | " print(f\"Coherence is {coherence:.2f}\")\n"
60 | ]
61 | }
62 | ],
63 | "metadata": {
64 | "kernelspec": {
65 | "display_name": "Python 3.10.7 64-bit",
66 | "language": "python",
67 | "name": "python3"
68 | },
69 | "language_info": {
70 | "codemirror_mode": {
71 | "name": "ipython",
72 | "version": 3
73 | },
74 | "file_extension": ".py",
75 | "mimetype": "text/x-python",
76 | "name": "python",
77 | "nbconvert_exporter": "python",
78 | "pygments_lexer": "ipython3",
79 | "version": "3.10.7"
80 | },
81 | "orig_nbformat": 4,
82 | "vscode": {
83 | "interpreter": {
84 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
85 | }
86 | }
87 | },
88 | "nbformat": 4,
89 | "nbformat_minor": 2
90 | }
91 |
--------------------------------------------------------------------------------
/src-misc/fig_utils.py:
--------------------------------------------------------------------------------
1 | import matplotlib.style
2 | import matplotlib as mpl
3 | from cycler import cycler
4 |
5 | FONT_MONOSPACE = {'fontname':'monospace'}
6 | MARKERS = "o^s*DP1"
7 | COLORS = [
8 | "#b7423c",
9 | "#71a548",
10 | "salmon",
11 | "darkseagreen",
12 | "cornflowerblue",
13 | "orange",
14 | "seagreen",
15 | "dimgray",
16 | "purple",
17 | ]
18 |
19 | mpl.rcParams['axes.prop_cycle'] = cycler(color=COLORS)
20 | mpl.rcParams['lines.linewidth'] = 2
21 | mpl.rcParams['lines.markersize'] = 7
22 | mpl.rcParams['axes.linewidth'] = 1.5
23 | mpl.rcParams['font.family'] = "serif"
24 |
25 | METRIC_PRETTY_NAME = {
26 | "bleu": "BLEU",
27 | "ter": "TER",
28 | "meteor": "METEOR",
29 | "chrf": "ChrF",
30 | "comet": "COMET",
31 | "bleurt": "BLEURT"
32 | }
33 |
34 | COLORS_EXTRA = ["#9c2963", "#fb9e07"]
--------------------------------------------------------------------------------
/src-number-of-topics/LLM_scores_and_ARI.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import pandas as pd
3 | import re, os, json, random
4 | from tqdm import tqdm
5 | import numpy as np
6 | from sklearn.metrics import normalized_mutual_info_score, adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score
7 | from sklearn.preprocessing import StandardScaler
8 | from scipy.stats import spearmanr
9 | from collections import defaultdict
10 | import argparse
11 | from collections import Counter
12 |
13 | def moving_average(a, window_size=3):
14 | if window_size == 0:
15 | return a
16 | out = []
17 | for i in range(len(a)):
18 | start = max(0, i - window_size)
19 | out.append(np.mean(a[start:i + 1]))
20 | return out
21 |
22 | def postprocess_labels(df):
23 | import spacy
24 | nlp = spacy.load("en_core_web_sm")
25 | if "response" not in df.columns:
26 | df["response"] = df.chatGPT_eval.tolist()
27 | out = []
28 |
29 | for i, text in enumerate(df.response):
30 | text = text.lower()
31 | text = re.findall("[a-z ]+", text)[0] # oh, what would this do?
32 | text = " ".join([w.lemma_ for w in nlp(text)])
33 | out.append(text)
34 | df["response"] = out
35 | return df
36 |
37 | def compute_ARI(args):
38 | df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True)
39 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
40 |
41 | if args.dataset == "bills":
42 | topic = df_metadata.topic.tolist()
43 | else:
44 | if args.label_categories == "broad":
45 | topic = df_metadata.category.tolist()
46 | else:
47 | topic = df_metadata.subcategory.tolist()
48 |
49 | cluster_metrics = defaultdict(list)
50 | for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20):
51 | path = os.path.join(path, "2972")
52 | beta = np.load(os.path.join(path, "beta.npy"))
53 | theta = np.load(os.path.join(path, "train.theta.npy"))
54 | argmax_theta = theta.argmax(axis=-1)
55 | cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta))
56 | cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
57 | cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta))
58 | cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta))
59 | return cluster_metrics
60 |
61 | if __name__ == "__main__":
62 | parser = argparse.ArgumentParser()
63 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
64 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
65 | parser.add_argument("--filename", default="number-of-topics-section-4/document_label_assignment_wikitext_broad.sonl", type=str, help="filename with LLM responses")
66 | parser.add_argument("--method", default="label_assignment", type=str, help="if we use topic word set ratings or document label assignment (label_assignment | topic_ratings")
67 |
68 | args = parser.parse_args()
69 |
70 | cluster_metrics = compute_ARI(args)
71 | paths = ["../runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
72 | df = pd.read_json(args.filename, lines=True)
73 | outfile_name = "n_clusters_" + args.dataset + "_" + args.label_categories + ".png"
74 | plt_label = "LLM Scores and ARI"
75 |
76 |
77 | average_goodness = []
78 | if args.method == "topic_ratings":
79 | df["gpt_ratings"] = df.response.astype(int)
80 | # get average gpt_ratings for each k
81 | for path in paths:
82 | path = os.path.join(path, "2972")
83 | df_at_k = df[df.path == path]
84 | average_goodness.append(df_at_k.gpt_ratings.mean())
85 | elif args.method == "label_assignment":
86 | df = postprocess_labels(df)
87 | # get average purity for each k
88 | for path in paths:
89 | path = os.path.join(path, "2972")
90 | df_at_k = df[df.path == path]
91 | purities = []
92 | for topic in df_at_k.topic.unique():
93 | df_topic = df_at_k[df_at_k.topic == topic]
94 | labels = df_topic.response.tolist()
95 | most_common,num_most_common = Counter(labels).most_common(1)[0]
96 | purity = num_most_common / len(labels)
97 | purities.append(purity)
98 | average_goodness.append(np.mean(purities))
99 |
100 | average_goodness = moving_average(average_goodness) # smooth via moving_average to remove weird outliers
101 | ARI = cluster_metrics["ari"]
102 |
103 | fig = plt.figure()
104 | ax = fig.add_subplot(111)
105 |
106 | ax1 = plt.subplot()
107 | l1, = ax1.plot(average_goodness, color='tab:red')
108 | ax2 = ax1.twinx()
109 | l2, = ax2.plot(ARI, color='tab:blue')
110 |
111 | spearman_rho = spearmanr(average_goodness, ARI).statistic
112 | print (spearman_rho)
113 |
114 | plt.legend([l1, l2], ["LLM Score", "ARI"])
115 | ax.set_xlabel("Number of Topics")
116 |
117 | n_clusters = list(range(20, 420, 20))
118 | plt.xticks(range(len(n_clusters)), n_clusters, rotation=45)
119 |
120 | if len(n_clusters) > 12:
121 | every_nth = len(n_clusters) // 8
122 | for n, label in enumerate(ax.xaxis.get_ticklabels()):
123 | if n % every_nth != 0:
124 | label.set_visible(False)
125 | fig_title = "LLM Scores and ARI, " + args.dataset + ", " + args.label_categories + "$, \\rho = " + str(spearman_rho) + "$"
126 | plt.title(fig_title)
127 | plt.savefig(outfile_name)
128 |
--------------------------------------------------------------------------------
/src-number-of-topics/chatGPT_document_label_assignment.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import numpy as np
4 | import sys
5 | import random
6 | import openai
7 | from tqdm import tqdm
8 | import pandas as pd
9 | import time
10 | import re
11 | import argparse
12 |
13 | def get_system_prompt(args):
14 | if args.dataset == "bills" and args.label_categories == "broad":
15 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example "health", "public lands", "domestic commerce", "government operations" and "defense".
16 |
17 | Reply with a single word or phrase, indicating the label of the document."""
18 |
19 | elif args.dataset == "wikitext" and args.label_categories == "broad":
20 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example csc"television", "songs", "transport", "warships and naval units", and "biology and medicine".
21 |
22 | Reply with a single word or phrase, indicating the label of the document."""
23 |
24 | elif args.dataset == "wikitext" and args.label_categories == "specific":
25 | system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a specific label, for example "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany".
26 |
27 | Reply with a single word or phrase, indicating the label of the document."""
28 | else:
29 | print ("experiment not implemented")
30 | sys.exit(0)
31 | return system_prompt
32 |
33 |
34 |
35 | # add logit bias
36 | if __name__ == "__main__":
37 |
38 | parser = argparse.ArgumentParser()
39 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
40 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
41 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
42 | args = parser.parse_args()
43 |
44 | random.seed(42)
45 | system_prompt = get_system_prompt(args)
46 | openai.api_key = args.API_KEY
47 |
48 | dataset = "bills"
49 | if args.dataset == "bills":
50 | column = "summary"
51 | else:
52 | column ="text"
53 |
54 | with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f:
55 | vocab = json.load(f)
56 |
57 | vocab = {j:i for i,j in vocab.items()}
58 |
59 | # load metadata
60 | df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True)
61 |
62 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
63 |
64 | output_path = "number-of-topics-section-4"
65 | os.makedirs(output_path, exist_ok=True)
66 |
67 | with open(os.path.join(output_path, "document_label_assignment_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile:
68 | for path in tqdm(paths):
69 | path = os.path.join(path, "2972")
70 | beta = np.load(os.path.join(path, "beta.npy"))
71 | theta = np.load(os.path.join(path, "train.theta.npy")).T # transpose
72 |
73 | print (beta.shape) # (20, 15'000), each row is probability distribution over vocab
74 | print (theta.shape) # (20, 32'661), each row is a probability distribution over documents?
75 | num_topics = beta.shape[0]
76 |
77 | # sample some topics
78 | sampled_topics = random.sample(list(range(num_topics)), 5)
79 |
80 | # for each topic
81 | for topic in sampled_topics:
82 | # sample top documents
83 | num_topics = 0
84 | user_prompt = ""
85 | arg_indices = np.argsort(theta[topic])[::-1][:10]
86 |
87 | for k, index in enumerate(arg_indices):
88 | # get text of this document
89 | text = df_metadata[column].iloc[index]
90 | text = " ".join(text.split()[:50]) # only take first 50 words
91 | user_prompt = text
92 | print (system_prompt)
93 | print (user_prompt)
94 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.0, max_tokens=20)["choices"][0]["message"]["content"].strip()
95 | print ("topic", topic, "response --", response)
96 | out = {"path": path, "user_prompt": user_prompt, "response": response, "topic": topic, "k":k}
97 | json.dump(out, outfile)
98 | outfile.write("\n")
99 | time.sleep(0.1)
100 |
--------------------------------------------------------------------------------
/src-number-of-topics/chatGPT_ratings_assignment.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import numpy as np
4 | import sys
5 | import random
6 | import openai
7 | from tqdm import tqdm
8 | import pandas as pd
9 | import argparse
10 | import time
11 |
12 |
13 | def get_system_prompt(args):
14 | if args.dataset == "bills" and args.label_categories == "broad":
15 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
16 | The topic modeling is based on a legislative Bill summary dataset. We are interested in coherent broad topics. Typical topics in the dataset include "Health", "Public Lands", "Domestic Commerce", "Government Operations", or "Defense".
17 | Reply with a single number, indicating the overall appropriateness of the topic."""
18 |
19 | elif args.dataset == "wikitext" and args.label_categories == "broad":
20 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
21 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "television", "songs", "transport", "warships and naval units", and "biology and medicine".
22 | Reply with a single number, indicating the overall appropriateness of the topic."""
23 |
24 | elif args.dataset == "wikitext" and args.label_categories == "specific":
25 | system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related).
26 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany".
27 | Reply with a single number, indicating the overall appropriateness of the topic."""
28 | else:
29 | print ("experiment not implemented")
30 | sys.exit(0)
31 | return system_prompt
32 |
33 | # add logit bias
34 | if __name__ == "__main__":
35 | parser = argparse.ArgumentParser()
36 | parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
37 | parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
38 | parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
39 | args = parser.parse_args()
40 |
41 | random.seed(42)
42 | system_prompt = get_system_prompt(args)
43 | openai.api_key = args.API_KEY
44 |
45 | with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f:
46 | vocab = json.load(f)
47 | vocab = {j:i for i,j in vocab.items()}
48 |
49 | paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
50 |
51 |
52 | output_path = "number-of-topics-section-4"
53 | os.makedirs(output_path, exist_ok=True)
54 |
55 | with open(os.path.join(output_path, "coherence_ratings_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile:
56 | for path in tqdm(paths):
57 | path = os.path.join(path, "2972")
58 | beta = np.load(os.path.join(path, "beta.npy"))
59 | theta = np.load(os.path.join(path, "train.theta.npy"))
60 |
61 | print (beta.shape) # 20, 15'000, each row is probability distribution over vocab
62 | print (theta.shape)
63 | num_topics = beta.shape[0]
64 | top_words = []
65 | for row in beta:
66 | indices = row.argsort()[::-1][:10]
67 | top_topic_words = [vocab[i] for i in indices]
68 | top_words.append(top_topic_words)
69 |
70 | # sample 10 topics
71 |
72 | sampled_topics = random.sample(list(range(num_topics)), k=10)
73 | for i in sampled_topics:
74 | topic = top_words[i]
75 | random.shuffle(topic)
76 | user_prompt = ", ".join(topic)
77 |
78 | response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0, max_tokens=1, logit_bias={16:100, 17:100, 18:100})["choices"][0]["message"]["content"].strip()
79 | out = {"path": path, "topic": i, "user_prompt": user_prompt, "response": response}
80 | json.dump(out, outfile)
81 | outfile.write("\n")
82 | print (response)
83 | time.sleep(0.1)
84 |
--------------------------------------------------------------------------------
/topic-modeling-output/etm-topics-best-c_npmi_10_full.json:
--------------------------------------------------------------------------------
1 | {
2 | "nytimes": {
3 | "c_npmi_10_full": 0.11378887397558148,
4 | "c_npmi_10_full_sd": 0.09088256843754142,
5 | "tu": 0.904,
6 | "to": 0.0770408163265306,
7 | "overlaps": 0,
8 | "anneal_lr": 0,
9 | "data_path": "/workspace/topic-preprocessing/data/nytimes/processed/full-mindf_power_law-maxdf_0.9/etm",
10 | "epochs": 1000,
11 | "lr": 0.02,
12 | "seed": 11235,
13 | "wdecay": 1.2e-06,
14 | "input_dir": "nytimes",
15 | "topics": [
16 | [
17 | "campaign",
18 | "bush",
19 | "clinton",
20 | "vote",
21 | "state",
22 | "congress",
23 | "republican",
24 | "administration",
25 | "election",
26 | "governor",
27 | "political",
28 | "bill",
29 | "democrats",
30 | "senator",
31 | "senate",
32 | "democratic",
33 | "party",
34 | "legislation",
35 | "republicans",
36 | "white_house",
37 | "candidate",
38 | "presidential",
39 | "lawmakers",
40 | "voters",
41 | "congressional",
42 | "debate",
43 | "representative",
44 | "washington",
45 | "candidates",
46 | "issue",
47 | "abortion",
48 | "democrat",
49 | "votes",
50 | "government",
51 | "parties",
52 | "legislature",
53 | "elections",
54 | "term",
55 | "leader",
56 | "opponents",
57 | "speaker",
58 | "primary",
59 | "elected",
60 | "opposition",
61 | "leaders",
62 | "support",
63 | "policy",
64 | "conservative",
65 | "reagan",
66 | "speech"
67 | ],
68 | [
69 | "percent",
70 | "market",
71 | "rate",
72 | "prices",
73 | "rates",
74 | "price",
75 | "economy",
76 | "growth",
77 | "dollar",
78 | "lower",
79 | "average",
80 | "interest",
81 | "month",
82 | "rose",
83 | "stocks",
84 | "bonds",
85 | "higher",
86 | "markets",
87 | "inflation",
88 | "stock",
89 | "fell",
90 | "yesterday",
91 | "sales",
92 | "quarter",
93 | "rise",
94 | "bond",
95 | "economic",
96 | "billion",
97 | "decline",
98 | "value",
99 | "trading",
100 | "consumer",
101 | "points",
102 | "demand",
103 | "investors",
104 | "treasury",
105 | "index",
106 | "expected",
107 | "traders",
108 | "drop",
109 | "profits",
110 | "issues",
111 | "percentage",
112 | "discount",
113 | "unemployment",
114 | "week",
115 | "economists",
116 | "currency",
117 | "analysts",
118 | "fed"
119 | ],
120 | [
121 | "years",
122 | "life",
123 | "family",
124 | "home",
125 | "old",
126 | "died",
127 | "death",
128 | "wife",
129 | "lives",
130 | "man",
131 | "born",
132 | "son",
133 | "lived",
134 | "living",
135 | "marriage",
136 | "couple",
137 | "church",
138 | "brother",
139 | "heart",
140 | "brothers",
141 | "age",
142 | "gay",
143 | "population",
144 | "career",
145 | "sons",
146 | "grew",
147 | "daughter",
148 | "moved",
149 | "serving",
150 | "native",
151 | "house",
152 | "resident",
153 | "men",
154 | "decades",
155 | "divorce",
156 | "battle",
157 | "married",
158 | "section",
159 | "retirement",
160 | "relatives",
161 | "catholic",
162 | "birthday",
163 | "chapter",
164 | "doctor",
165 | "daughters",
166 | "couples",
167 | "visit",
168 | "apartment",
169 | "year",
170 | "roman"
171 | ],
172 | [
173 | "said",
174 | "added",
175 | "asked",
176 | "year",
177 | "saying",
178 | "months",
179 | "spokesman",
180 | "told",
181 | "week",
182 | "month",
183 | "expected",
184 | "interview",
185 | "going",
186 | "years",
187 | "according",
188 | "called",
189 | "decision",
190 | "statement",
191 | "days",
192 | "continued",
193 | "plans",
194 | "discuss",
195 | "taking",
196 | "trying",
197 | "involved",
198 | "decided",
199 | "weeks",
200 | "earlier",
201 | "believed",
202 | "suggested",
203 | "adding",
204 | "noted",
205 | "comment",
206 | "appeared",
207 | "declined",
208 | "referring",
209 | "similar",
210 | "described",
211 | "wanted",
212 | "sent",
213 | "acknowledged",
214 | "took",
215 | "spoke",
216 | "began",
217 | "taken",
218 | "concerned",
219 | "refused",
220 | "reported",
221 | "discussed",
222 | "calls"
223 | ],
224 | [
225 | "glass",
226 | "red",
227 | "blue",
228 | "wear",
229 | "light",
230 | "wearing",
231 | "hair",
232 | "white",
233 | "fashion",
234 | "clothes",
235 | "yellow",
236 | "wood",
237 | "plastic",
238 | "gray",
239 | "shoes",
240 | "color",
241 | "colors",
242 | "pink",
243 | "metal",
244 | "dress",
245 | "wore",
246 | "paint",
247 | "leather",
248 | "colored",
249 | "coat",
250 | "hanging",
251 | "length",
252 | "green",
253 | "tall",
254 | "look",
255 | "looking",
256 | "glasses",
257 | "sunglasses",
258 | "skin",
259 | "shoe",
260 | "classy",
261 | "buttons",
262 | "clothing",
263 | "shirt",
264 | "hang",
265 | "wooden",
266 | "stick",
267 | "liked",
268 | "toy",
269 | "window",
270 | "inside",
271 | "feet",
272 | "looks",
273 | "dressed",
274 | "underneath"
275 | ],
276 | [
277 | "music",
278 | "band",
279 | "songs",
280 | "concert",
281 | "rock",
282 | "dance",
283 | "jazz",
284 | "album",
285 | "musical",
286 | "musicians",
287 | "piano",
288 | "performance",
289 | "singer",
290 | "song",
291 | "theater",
292 | "pop",
293 | "performances",
294 | "stage",
295 | "performed",
296 | "night",
297 | "opera",
298 | "orchestra",
299 | "composer",
300 | "piece",
301 | "play",
302 | "played",
303 | "singers",
304 | "production",
305 | "evening",
306 | "singing",
307 | "ballet",
308 | "tonight",
309 | "solo",
310 | "plays",
311 | "debut",
312 | "classical",
313 | "presented",
314 | "repertory",
315 | "program",
316 | "chamber",
317 | "performing",
318 | "quartet",
319 | "recordings",
320 | "pianist",
321 | "duo",
322 | "tunes",
323 | "works",
324 | "playing",
325 | "blues",
326 | "festival"
327 | ],
328 | [
329 | "year",
330 | "years",
331 | "ago",
332 | "later",
333 | "old",
334 | "late",
335 | "major",
336 | "record",
337 | "lost",
338 | "leading",
339 | "early",
340 | "won",
341 | "helped",
342 | "worked",
343 | "known",
344 | "annual",
345 | "national",
346 | "held",
347 | "figure",
348 | "director",
349 | "club",
350 | "highest",
351 | "cause",
352 | "host",
353 | "earlier",
354 | "event",
355 | "ended",
356 | "million",
357 | "month",
358 | "award",
359 | "earned",
360 | "joined",
361 | "rare",
362 | "degree",
363 | "expert",
364 | "raised",
365 | "paid",
366 | "attended",
367 | "minor",
368 | "previous",
369 | "chosen",
370 | "bridge",
371 | "founded",
372 | "position",
373 | "produced",
374 | "spent",
375 | "moved",
376 | "principal",
377 | "received",
378 | "active"
379 | ],
380 | [
381 | "street",
382 | "avenue",
383 | "west",
384 | "information",
385 | "sunday",
386 | "tickets",
387 | "tomorrow",
388 | "east",
389 | "free",
390 | "park",
391 | "manhattan",
392 | "saturday",
393 | "hours",
394 | "admission",
395 | "broadway",
396 | "center",
397 | "include",
398 | "new",
399 | "fifth",
400 | "open",
401 | "friday",
402 | "madison",
403 | "theater",
404 | "noon",
405 | "garden",
406 | "square",
407 | "tour",
408 | "includes",
409 | "sundays",
410 | "children",
411 | "new_york",
412 | "place",
413 | "saturdays",
414 | "library",
415 | "april",
416 | "road",
417 | "subway",
418 | "shows",
419 | "festival",
420 | "opens",
421 | "available",
422 | "benefit",
423 | "times",
424 | "tuesday",
425 | "fridays",
426 | "reservations",
427 | "events",
428 | "section",
429 | "near",
430 | "cards"
431 | ],
432 | [
433 | "wine",
434 | "food",
435 | "fresh",
436 | "taste",
437 | "add",
438 | "cooking",
439 | "sugar",
440 | "heat",
441 | "minutes",
442 | "chicken",
443 | "cook",
444 | "red",
445 | "juice",
446 | "menu",
447 | "cooked",
448 | "olive",
449 | "cheese",
450 | "salt",
451 | "meat",
452 | "fat",
453 | "cup",
454 | "kosher",
455 | "meal",
456 | "dishes",
457 | "vegetables",
458 | "garlic",
459 | "dish",
460 | "oil",
461 | "rice",
462 | "seafood",
463 | "restaurant",
464 | "soup",
465 | "sauce",
466 | "oven",
467 | "milk",
468 | "ground",
469 | "cups",
470 | "warm",
471 | "beef",
472 | "bread",
473 | "eat",
474 | "desserts",
475 | "steak",
476 | "pepper",
477 | "medium",
478 | "small",
479 | "butter",
480 | "potato",
481 | "coffee",
482 | "white"
483 | ],
484 | [
485 | "water",
486 | "oil",
487 | "plants",
488 | "plant",
489 | "energy",
490 | "gas",
491 | "natural",
492 | "trees",
493 | "space",
494 | "clean",
495 | "earth",
496 | "waste",
497 | "fish",
498 | "animals",
499 | "environmental",
500 | "animal",
501 | "soil",
502 | "fuel",
503 | "scientists",
504 | "birds",
505 | "snow",
506 | "rain",
507 | "wildlife",
508 | "storm",
509 | "ice",
510 | "debris",
511 | "electricity",
512 | "steel",
513 | "bird",
514 | "farm",
515 | "electric",
516 | "forest",
517 | "tree",
518 | "pollution",
519 | "surface",
520 | "grass",
521 | "ground",
522 | "farmers",
523 | "gasoline",
524 | "weather",
525 | "electrical",
526 | "heavy",
527 | "cubic",
528 | "species",
529 | "organic",
530 | "arctic",
531 | "power",
532 | "solar",
533 | "habitats",
534 | "light"
535 | ],
536 | [
537 | "police",
538 | "said",
539 | "officers",
540 | "killed",
541 | "crime",
542 | "people",
543 | "prison",
544 | "fire",
545 | "arrested",
546 | "officer",
547 | "men",
548 | "man",
549 | "killing",
550 | "shot",
551 | "death",
552 | "victims",
553 | "gun",
554 | "murder",
555 | "wounded",
556 | "accused",
557 | "authorities",
558 | "attack",
559 | "found",
560 | "drug",
561 | "charged",
562 | "shooting",
563 | "dead",
564 | "crimes",
565 | "assault",
566 | "violent",
567 | "investigators",
568 | "reported",
569 | "security",
570 | "arrest",
571 | "suspect",
572 | "taken",
573 | "spokesman",
574 | "woman",
575 | "violence",
576 | "guns",
577 | "car",
578 | "year",
579 | "narcotics",
580 | "told",
581 | "nyt",
582 | "incident",
583 | "jail",
584 | "driver",
585 | "rape",
586 | "identified"
587 | ],
588 | [
589 | "fact",
590 | "time",
591 | "times",
592 | "later",
593 | "believe",
594 | "having",
595 | "probably",
596 | "real",
597 | "true",
598 | "words",
599 | "word",
600 | "face",
601 | "bit",
602 | "seen",
603 | "mean",
604 | "answer",
605 | "simply",
606 | "far",
607 | "clearly",
608 | "hope",
609 | "questions",
610 | "hands",
611 | "possible",
612 | "past",
613 | "instead",
614 | "certainly",
615 | "easily",
616 | "subject",
617 | "days",
618 | "attempt",
619 | "person",
620 | "means",
621 | "surprise",
622 | "tell",
623 | "yes",
624 | "present",
625 | "doubt",
626 | "confidence",
627 | "apparently",
628 | "suggests",
629 | "usual",
630 | "speak",
631 | "appear",
632 | "eye",
633 | "prove",
634 | "place",
635 | "short",
636 | "reality",
637 | "actually",
638 | "appears"
639 | ],
640 | [
641 | "new_york",
642 | "yesterday",
643 | "director",
644 | "manhattan",
645 | "brooklyn",
646 | "received",
647 | "new_jersey",
648 | "named",
649 | "queens",
650 | "owner",
651 | "assistant",
652 | "new_york_city",
653 | "connecticut",
654 | "department",
655 | "boston",
656 | "retired",
657 | "announced",
658 | "bronx",
659 | "los_angeles",
660 | "washington",
661 | "master",
662 | "executive",
663 | "professor",
664 | "founder",
665 | "manager",
666 | "firm",
667 | "associate",
668 | "partner",
669 | "vice",
670 | "consultant",
671 | "greenwich",
672 | "princeton",
673 | "san_francisco",
674 | "research",
675 | "hartford",
676 | "newark",
677 | "ohio",
678 | "managing",
679 | "harvard",
680 | "known",
681 | "senior",
682 | "division",
683 | "joined",
684 | "captain",
685 | "philadelphia",
686 | "coordinator",
687 | "yale",
688 | "amp",
689 | "columbia",
690 | "staten_island"
691 | ],
692 | [
693 | "computer",
694 | "internet",
695 | "technology",
696 | "web",
697 | "system",
698 | "software",
699 | "computers",
700 | "systems",
701 | "video",
702 | "data",
703 | "car",
704 | "equipment",
705 | "users",
706 | "digital",
707 | "cars",
708 | "sites",
709 | "phone",
710 | "customers",
711 | "online",
712 | "network",
713 | "electronic",
714 | "personal",
715 | "vehicle",
716 | "site",
717 | "models",
718 | "mail",
719 | "information",
720 | "auto",
721 | "chip",
722 | "speed",
723 | "customer",
724 | "products",
725 | "device",
726 | "use",
727 | "available",
728 | "devices",
729 | "apple",
730 | "price",
731 | "mobile",
732 | "engine",
733 | "vehicles",
734 | "manufacturing",
735 | "machines",
736 | "consumers",
737 | "user",
738 | "allow",
739 | "microsoft",
740 | "type",
741 | "networks",
742 | "ford"
743 | ],
744 | [
745 | "game",
746 | "team",
747 | "season",
748 | "games",
749 | "play",
750 | "players",
751 | "coach",
752 | "player",
753 | "teams",
754 | "league",
755 | "ball",
756 | "football",
757 | "played",
758 | "baseball",
759 | "basketball",
760 | "mets",
761 | "giants",
762 | "yards",
763 | "yankees",
764 | "jets",
765 | "playing",
766 | "seasons",
767 | "nets",
768 | "pitch",
769 | "stadium",
770 | "hockey",
771 | "fans",
772 | "rangers",
773 | "knicks",
774 | "coaching",
775 | "yard",
776 | "field",
777 | "yankee",
778 | "pitcher",
779 | "bowl",
780 | "quarterback",
781 | "playoffs",
782 | "rookie",
783 | "redskins",
784 | "inning",
785 | "teammates",
786 | "offense",
787 | "preseason",
788 | "national_football_league",
789 | "defensive",
790 | "leagues",
791 | "innings",
792 | "pitching",
793 | "touchdown",
794 | "minutes"
795 | ],
796 | [
797 | "week",
798 | "article",
799 | "page",
800 | "march",
801 | "tuesday",
802 | "june",
803 | "july",
804 | "friday",
805 | "thursday",
806 | "wednesday",
807 | "day",
808 | "april",
809 | "monday",
810 | "telephone",
811 | "production",
812 | "reported",
813 | "misstated",
814 | "correction",
815 | "weeks",
816 | "new",
817 | "scheduled",
818 | "sunday",
819 | "date",
820 | "expected",
821 | "month",
822 | "numbers",
823 | "referred",
824 | "november",
825 | "saturday",
826 | "daily",
827 | "fall",
828 | "months",
829 | "report",
830 | "number",
831 | "news",
832 | "appeared",
833 | "error",
834 | "following",
835 | "closed",
836 | "announced",
837 | "october",
838 | "september",
839 | "incorrectly",
840 | "picture",
841 | "year",
842 | "column",
843 | "copies",
844 | "brief",
845 | "december",
846 | "august"
847 | ],
848 | [
849 | "board",
850 | "members",
851 | "mayor",
852 | "groups",
853 | "agency",
854 | "meeting",
855 | "officials",
856 | "plan",
857 | "public",
858 | "city",
859 | "union",
860 | "committee",
861 | "agreement",
862 | "agreed",
863 | "organization",
864 | "rules",
865 | "labor",
866 | "official",
867 | "commission",
868 | "approved",
869 | "member",
870 | "leaders",
871 | "issues",
872 | "group",
873 | "proposed",
874 | "proposal",
875 | "owners",
876 | "association",
877 | "negotiations",
878 | "strike",
879 | "council",
880 | "deal",
881 | "announced",
882 | "rejected",
883 | "decision",
884 | "approval",
885 | "director",
886 | "major",
887 | "talks",
888 | "chairman",
889 | "giuliani",
890 | "authority",
891 | "workers",
892 | "dispute",
893 | "process",
894 | "plans",
895 | "commissioner",
896 | "organizations",
897 | "joint",
898 | "effort"
899 | ],
900 | [
901 | "today",
902 | "group",
903 | "including",
904 | "called",
905 | "led",
906 | "known",
907 | "began",
908 | "built",
909 | "early",
910 | "planned",
911 | "held",
912 | "created",
913 | "included",
914 | "considered",
915 | "brought",
916 | "completed",
917 | "based",
918 | "offered",
919 | "taken",
920 | "given",
921 | "followed",
922 | "intended",
923 | "sent",
924 | "organized",
925 | "recently",
926 | "replaced",
927 | "established",
928 | "working",
929 | "designed",
930 | "include",
931 | "join",
932 | "holding",
933 | "produced",
934 | "abandoned",
935 | "giving",
936 | "developed",
937 | "formed",
938 | "calling",
939 | "focused",
940 | "joined",
941 | "met",
942 | "introduced",
943 | "turned",
944 | "opened",
945 | "gave",
946 | "provided",
947 | "studied",
948 | "destroyed",
949 | "presented",
950 | "remained"
951 | ],
952 | [
953 | "year",
954 | "contract",
955 | "left",
956 | "signed",
957 | "manager",
958 | "second",
959 | "chicago",
960 | "free",
961 | "right",
962 | "lost",
963 | "pass",
964 | "agent",
965 | "forward",
966 | "practice",
967 | "atlanta",
968 | "season",
969 | "home",
970 | "day",
971 | "draft",
972 | "brown",
973 | "defense",
974 | "johnson",
975 | "houston",
976 | "smith",
977 | "florida",
978 | "seattle",
979 | "camp",
980 | "miami",
981 | "list",
982 | "led",
983 | "seven",
984 | "sunday",
985 | "dallas",
986 | "today",
987 | "yesterday",
988 | "jackson",
989 | "super",
990 | "terms",
991 | "texas",
992 | "guard",
993 | "detroit",
994 | "tonight",
995 | "running",
996 | "record",
997 | "training",
998 | "cleveland",
999 | "starting",
1000 | "agreed",
1001 | "washington",
1002 | "toronto"
1003 | ],
1004 | [
1005 | "like",
1006 | "little",
1007 | "good",
1008 | "best",
1009 | "line",
1010 | "hard",
1011 | "makes",
1012 | "small",
1013 | "find",
1014 | "long",
1015 | "look",
1016 | "especially",
1017 | "comes",
1018 | "come",
1019 | "called",
1020 | "use",
1021 | "sign",
1022 | "easy",
1023 | "better",
1024 | "usually",
1025 | "clear",
1026 | "form",
1027 | "offers",
1028 | "turn",
1029 | "recently",
1030 | "range",
1031 | "particular",
1032 | "looks",
1033 | "real",
1034 | "gives",
1035 | "instead",
1036 | "want",
1037 | "gets",
1038 | "turns",
1039 | "particularly",
1040 | "goes",
1041 | "turning",
1042 | "takes",
1043 | "need",
1044 | "available",
1045 | "putting",
1046 | "individual",
1047 | "means",
1048 | "taking",
1049 | "large",
1050 | "unlike",
1051 | "rarely",
1052 | "quality",
1053 | "spread",
1054 | "hand"
1055 | ],
1056 | [
1057 | "city",
1058 | "government",
1059 | "people",
1060 | "local",
1061 | "state",
1062 | "program",
1063 | "work",
1064 | "public",
1065 | "service",
1066 | "workers",
1067 | "site",
1068 | "development",
1069 | "private",
1070 | "new",
1071 | "jobs",
1072 | "nation",
1073 | "care",
1074 | "services",
1075 | "help",
1076 | "county",
1077 | "project",
1078 | "system",
1079 | "programs",
1080 | "officials",
1081 | "residents",
1082 | "areas",
1083 | "cities",
1084 | "health",
1085 | "projects",
1086 | "provide",
1087 | "land",
1088 | "plans",
1089 | "community",
1090 | "poor",
1091 | "housing",
1092 | "new_york_city",
1093 | "plan",
1094 | "build",
1095 | "country",
1096 | "welfare",
1097 | "working",
1098 | "need",
1099 | "employees",
1100 | "thousands",
1101 | "homes",
1102 | "region",
1103 | "emergency",
1104 | "families",
1105 | "area",
1106 | "agencies"
1107 | ],
1108 | [
1109 | "political",
1110 | "rights",
1111 | "religious",
1112 | "human",
1113 | "anti",
1114 | "democracy",
1115 | "freedom",
1116 | "civil",
1117 | "struggle",
1118 | "protest",
1119 | "war",
1120 | "politics",
1121 | "power",
1122 | "racial",
1123 | "social",
1124 | "apartheid",
1125 | "communist",
1126 | "views",
1127 | "moral",
1128 | "influence",
1129 | "revolution",
1130 | "religion",
1131 | "liberal",
1132 | "vietnam",
1133 | "independence",
1134 | "conflicts",
1135 | "faith",
1136 | "ethnic",
1137 | "fear",
1138 | "movement",
1139 | "nazi",
1140 | "matters",
1141 | "speech",
1142 | "prominent",
1143 | "outrage",
1144 | "protests",
1145 | "radical",
1146 | "solidarity",
1147 | "conscience",
1148 | "spiritual",
1149 | "discussion",
1150 | "liberties",
1151 | "philosophy",
1152 | "protesters",
1153 | "opposition",
1154 | "liberation",
1155 | "demonstrations",
1156 | "ideological",
1157 | "community",
1158 | "outraged"
1159 | ],
1160 | [
1161 | "health",
1162 | "medical",
1163 | "drug",
1164 | "research",
1165 | "doctors",
1166 | "patients",
1167 | "drugs",
1168 | "study",
1169 | "aids",
1170 | "blood",
1171 | "patient",
1172 | "researchers",
1173 | "tests",
1174 | "cancer",
1175 | "treatment",
1176 | "hospital",
1177 | "studies",
1178 | "medicine",
1179 | "disease",
1180 | "human",
1181 | "care",
1182 | "brain",
1183 | "risk",
1184 | "testing",
1185 | "tested",
1186 | "cell",
1187 | "smoking",
1188 | "physicians",
1189 | "scientists",
1190 | "nutrition",
1191 | "treatments",
1192 | "heart",
1193 | "treat",
1194 | "therapy",
1195 | "effective",
1196 | "clinic",
1197 | "cause",
1198 | "body",
1199 | "genetic",
1200 | "use",
1201 | "virus",
1202 | "condition",
1203 | "substance",
1204 | "test",
1205 | "gene",
1206 | "animal",
1207 | "effects",
1208 | "samples",
1209 | "medication",
1210 | "laboratory"
1211 | ],
1212 | [
1213 | "time",
1214 | "half",
1215 | "center",
1216 | "open",
1217 | "away",
1218 | "place",
1219 | "high",
1220 | "day",
1221 | "run",
1222 | "head",
1223 | "days",
1224 | "hours",
1225 | "end",
1226 | "right",
1227 | "home",
1228 | "field",
1229 | "left",
1230 | "minutes",
1231 | "feet",
1232 | "summer",
1233 | "hour",
1234 | "close",
1235 | "eyes",
1236 | "single",
1237 | "hand",
1238 | "inch",
1239 | "green",
1240 | "set",
1241 | "drive",
1242 | "let",
1243 | "walk",
1244 | "cut",
1245 | "spring",
1246 | "wall",
1247 | "wide",
1248 | "foot",
1249 | "wait",
1250 | "nearly",
1251 | "watch",
1252 | "leaves",
1253 | "ground",
1254 | "double",
1255 | "stop",
1256 | "box",
1257 | "opening",
1258 | "seat",
1259 | "base",
1260 | "seven",
1261 | "caught",
1262 | "opened"
1263 | ],
1264 | [
1265 | "military",
1266 | "war",
1267 | "iraq",
1268 | "israel",
1269 | "peace",
1270 | "government",
1271 | "israeli",
1272 | "forces",
1273 | "officials",
1274 | "united_nations",
1275 | "american",
1276 | "security",
1277 | "soviet",
1278 | "iran",
1279 | "official",
1280 | "iraqi",
1281 | "arab",
1282 | "russia",
1283 | "weapons",
1284 | "united_states",
1285 | "nuclear",
1286 | "troops",
1287 | "intelligence",
1288 | "attacks",
1289 | "lebanon",
1290 | "attack",
1291 | "palestinian",
1292 | "army",
1293 | "afghanistan",
1294 | "pakistan",
1295 | "administration",
1296 | "minister",
1297 | "arms",
1298 | "fighting",
1299 | "leaders",
1300 | "prime",
1301 | "islamic",
1302 | "russian",
1303 | "talks",
1304 | "middle_east",
1305 | "militants",
1306 | "international",
1307 | "muslim",
1308 | "israelis",
1309 | "armed",
1310 | "soldiers",
1311 | "country",
1312 | "treaty",
1313 | "iranian",
1314 | "soviet_union"
1315 | ],
1316 | [
1317 | "great",
1318 | "member",
1319 | "members",
1320 | "served",
1321 | "friends",
1322 | "friend",
1323 | "entire",
1324 | "jewish",
1325 | "king",
1326 | "service",
1327 | "jews",
1328 | "extend",
1329 | "special",
1330 | "deeply",
1331 | "leader",
1332 | "deep",
1333 | "mass",
1334 | "community",
1335 | "thomas",
1336 | "longtime",
1337 | "spirit",
1338 | "george",
1339 | "hill",
1340 | "david",
1341 | "wish",
1342 | "honor",
1343 | "chairman",
1344 | "kennedy",
1345 | "gift",
1346 | "choice",
1347 | "jordan",
1348 | "divided",
1349 | "loved",
1350 | "john",
1351 | "commitment",
1352 | "missed",
1353 | "christian",
1354 | "express",
1355 | "foundation",
1356 | "james",
1357 | "temple",
1358 | "shared",
1359 | "wonderful",
1360 | "dedicated",
1361 | "values",
1362 | "jerusalem",
1363 | "prince",
1364 | "mark",
1365 | "martin",
1366 | "south"
1367 | ],
1368 | [
1369 | "won",
1370 | "victory",
1371 | "win",
1372 | "lead",
1373 | "run",
1374 | "second",
1375 | "points",
1376 | "night",
1377 | "fourth",
1378 | "round",
1379 | "champion",
1380 | "title",
1381 | "winning",
1382 | "shot",
1383 | "tournament",
1384 | "final",
1385 | "beat",
1386 | "hit",
1387 | "time",
1388 | "championship",
1389 | "winner",
1390 | "cup",
1391 | "horse",
1392 | "game",
1393 | "finished",
1394 | "series",
1395 | "record",
1396 | "race",
1397 | "goal",
1398 | "seconds",
1399 | "match",
1400 | "point",
1401 | "minutes",
1402 | "saturday",
1403 | "period",
1404 | "tonight",
1405 | "boxing",
1406 | "times",
1407 | "races",
1408 | "track",
1409 | "gave",
1410 | "minute",
1411 | "fifth",
1412 | "ninth",
1413 | "career",
1414 | "line",
1415 | "sixth",
1416 | "start",
1417 | "shots",
1418 | "best"
1419 | ],
1420 | [
1421 | "night",
1422 | "people",
1423 | "day",
1424 | "house",
1425 | "town",
1426 | "white",
1427 | "store",
1428 | "room",
1429 | "morning",
1430 | "hotel",
1431 | "stores",
1432 | "message",
1433 | "party",
1434 | "names",
1435 | "christmas",
1436 | "days",
1437 | "outside",
1438 | "streets",
1439 | "door",
1440 | "near",
1441 | "bar",
1442 | "time",
1443 | "way",
1444 | "evening",
1445 | "afternoon",
1446 | "live",
1447 | "main",
1448 | "country",
1449 | "home",
1450 | "shop",
1451 | "restaurant",
1452 | "rooms",
1453 | "dinner",
1454 | "visit",
1455 | "table",
1456 | "standing",
1457 | "crowd",
1458 | "business",
1459 | "shopping",
1460 | "summer",
1461 | "restaurants",
1462 | "club",
1463 | "block",
1464 | "dozen",
1465 | "second",
1466 | "holiday",
1467 | "hour",
1468 | "hot",
1469 | "dead",
1470 | "weekend"
1471 | ],
1472 | [
1473 | "art",
1474 | "works",
1475 | "century",
1476 | "museum",
1477 | "artist",
1478 | "exhibition",
1479 | "artists",
1480 | "painting",
1481 | "gallery",
1482 | "paintings",
1483 | "photographs",
1484 | "collection",
1485 | "hall",
1486 | "contemporary",
1487 | "design",
1488 | "images",
1489 | "pictures",
1490 | "view",
1491 | "sculpture",
1492 | "modern",
1493 | "architecture",
1494 | "designed",
1495 | "style",
1496 | "19th",
1497 | "objects",
1498 | "exhibit",
1499 | "drawings",
1500 | "museums",
1501 | "arts",
1502 | "work",
1503 | "museum_of_modern_art",
1504 | "artworks",
1505 | "stone",
1506 | "painter",
1507 | "designers",
1508 | "landscape",
1509 | "visitors",
1510 | "designs",
1511 | "beautiful",
1512 | "18th",
1513 | "pieces",
1514 | "sculptor",
1515 | "architectural",
1516 | "collections",
1517 | "galleries",
1518 | "portrait",
1519 | "artifacts",
1520 | "20th",
1521 | "historic",
1522 | "prints"
1523 | ],
1524 | [
1525 | "new",
1526 | "long",
1527 | "change",
1528 | "way",
1529 | "work",
1530 | "big",
1531 | "small",
1532 | "lines",
1533 | "different",
1534 | "high",
1535 | "time",
1536 | "approach",
1537 | "view",
1538 | "far",
1539 | "decade",
1540 | "past",
1541 | "large",
1542 | "current",
1543 | "larger",
1544 | "term",
1545 | "longer",
1546 | "decades",
1547 | "moving",
1548 | "style",
1549 | "changing",
1550 | "rest",
1551 | "major",
1552 | "rich",
1553 | "early",
1554 | "point",
1555 | "end",
1556 | "generation",
1557 | "little",
1558 | "future",
1559 | "months",
1560 | "lack",
1561 | "better",
1562 | "great",
1563 | "slow",
1564 | "period",
1565 | "middle",
1566 | "era",
1567 | "rise",
1568 | "ways",
1569 | "changes",
1570 | "vast",
1571 | "modern",
1572 | "huge",
1573 | "worst",
1574 | "soon"
1575 | ],
1576 | [
1577 | "end",
1578 | "set",
1579 | "time",
1580 | "step",
1581 | "return",
1582 | "let",
1583 | "beginning",
1584 | "forced",
1585 | "future",
1586 | "come",
1587 | "second",
1588 | "bring",
1589 | "difficult",
1590 | "allow",
1591 | "competition",
1592 | "continue",
1593 | "begin",
1594 | "coming",
1595 | "protect",
1596 | "trying",
1597 | "reach",
1598 | "major",
1599 | "start",
1600 | "sides",
1601 | "parts",
1602 | "half",
1603 | "taking",
1604 | "setting",
1605 | "break",
1606 | "point",
1607 | "failed",
1608 | "able",
1609 | "necessary",
1610 | "came",
1611 | "remain",
1612 | "meet",
1613 | "stop",
1614 | "progress",
1615 | "complete",
1616 | "send",
1617 | "final",
1618 | "tough",
1619 | "balance",
1620 | "aside",
1621 | "try",
1622 | "begins",
1623 | "gap",
1624 | "took",
1625 | "need",
1626 | "meant"
1627 | ],
1628 | [
1629 | "black",
1630 | "school",
1631 | "students",
1632 | "television",
1633 | "schools",
1634 | "college",
1635 | "public",
1636 | "education",
1637 | "high",
1638 | "program",
1639 | "white",
1640 | "class",
1641 | "student",
1642 | "programs",
1643 | "cable",
1644 | "teachers",
1645 | "radio",
1646 | "university",
1647 | "network",
1648 | "district",
1649 | "teacher",
1650 | "sports",
1651 | "districts",
1652 | "hispanic",
1653 | "broadcast",
1654 | "entertainment",
1655 | "shows",
1656 | "stations",
1657 | "blacks",
1658 | "classes",
1659 | "colleges",
1660 | "campus",
1661 | "viewers",
1662 | "nbc",
1663 | "educational",
1664 | "faculty",
1665 | "teaching",
1666 | "fox",
1667 | "middle",
1668 | "private",
1669 | "cbs",
1670 | "special",
1671 | "grade",
1672 | "minority",
1673 | "principal",
1674 | "academic",
1675 | "disney",
1676 | "learning",
1677 | "graduate",
1678 | "studio"
1679 | ],
1680 | [
1681 | "court",
1682 | "judge",
1683 | "law",
1684 | "case",
1685 | "federal",
1686 | "lawyer",
1687 | "trial",
1688 | "lawyers",
1689 | "charges",
1690 | "justice",
1691 | "legal",
1692 | "attorney",
1693 | "jury",
1694 | "investigation",
1695 | "supreme_court",
1696 | "ruling",
1697 | "prosecutors",
1698 | "criminal",
1699 | "decision",
1700 | "hearing",
1701 | "filed",
1702 | "denied",
1703 | "laws",
1704 | "ruled",
1705 | "justice_department",
1706 | "courts",
1707 | "ordered",
1708 | "judges",
1709 | "lawsuit",
1710 | "charged",
1711 | "suit",
1712 | "prosecutor",
1713 | "inquiry",
1714 | "convicted",
1715 | "defense",
1716 | "testimony",
1717 | "office",
1718 | "amendment",
1719 | "prosecution",
1720 | "district",
1721 | "proceedings",
1722 | "prison",
1723 | "charge",
1724 | "conviction",
1725 | "guilty",
1726 | "defendants",
1727 | "argued",
1728 | "hearings",
1729 | "civil",
1730 | "rights"
1731 | ],
1732 | [
1733 | "million",
1734 | "money",
1735 | "budget",
1736 | "tax",
1737 | "year",
1738 | "billion",
1739 | "pay",
1740 | "spending",
1741 | "cost",
1742 | "costs",
1743 | "cut",
1744 | "bank",
1745 | "cuts",
1746 | "income",
1747 | "plan",
1748 | "fund",
1749 | "raise",
1750 | "financial",
1751 | "taxes",
1752 | "insurance",
1753 | "banks",
1754 | "funds",
1755 | "paid",
1756 | "deficit",
1757 | "dollars",
1758 | "aid",
1759 | "federal",
1760 | "financing",
1761 | "cash",
1762 | "increases",
1763 | "finance",
1764 | "raising",
1765 | "total",
1766 | "credit",
1767 | "new",
1768 | "reduce",
1769 | "proposed",
1770 | "savings",
1771 | "medicare",
1772 | "package",
1773 | "loans",
1774 | "paying",
1775 | "fiscal",
1776 | "capital",
1777 | "limits",
1778 | "house",
1779 | "payments",
1780 | "month",
1781 | "week",
1782 | "fees"
1783 | ],
1784 | [
1785 | "net",
1786 | "share",
1787 | "inc",
1788 | "earns",
1789 | "company",
1790 | "reports",
1791 | "loss",
1792 | "lead",
1793 | "sales",
1794 | "qtr",
1795 | "quarter",
1796 | "shares",
1797 | "revenue",
1798 | "year",
1799 | "outst",
1800 | "million",
1801 | "included",
1802 | "rev",
1803 | "march",
1804 | "june",
1805 | "nyse",
1806 | "cents",
1807 | "otc",
1808 | "6mo",
1809 | "income",
1810 | "9mo",
1811 | "gain",
1812 | "results",
1813 | "dec",
1814 | "months",
1815 | "operations",
1816 | "extraordinary",
1817 | "charge",
1818 | "latest",
1819 | "sept",
1820 | "earnings",
1821 | "tax",
1822 | "ago",
1823 | "corp",
1824 | "discontinued",
1825 | "amex",
1826 | "accounting",
1827 | "credit",
1828 | "respectively",
1829 | "sale",
1830 | "amp",
1831 | "period",
1832 | "losses",
1833 | "restructuring",
1834 | "april"
1835 | ],
1836 | [
1837 | "editor",
1838 | "book",
1839 | "wrote",
1840 | "published",
1841 | "magazine",
1842 | "books",
1843 | "author",
1844 | "written",
1845 | "paper",
1846 | "professor",
1847 | "read",
1848 | "writing",
1849 | "original",
1850 | "english",
1851 | "writer",
1852 | "pages",
1853 | "amp",
1854 | "language",
1855 | "version",
1856 | "life",
1857 | "letters",
1858 | "reading",
1859 | "review",
1860 | "ideas",
1861 | "found",
1862 | "science",
1863 | "write",
1864 | "letter",
1865 | "illustrated",
1866 | "known",
1867 | "called",
1868 | "novel",
1869 | "model",
1870 | "notes",
1871 | "writes",
1872 | "note",
1873 | "new_york",
1874 | "editorial",
1875 | "interest",
1876 | "account",
1877 | "publisher",
1878 | "readers",
1879 | "new_york_times",
1880 | "cast",
1881 | "director",
1882 | "sir",
1883 | "idea",
1884 | "writers",
1885 | "prize",
1886 | "critic"
1887 | ],
1888 | [
1889 | "children",
1890 | "women",
1891 | "young",
1892 | "woman",
1893 | "child",
1894 | "mother",
1895 | "old",
1896 | "love",
1897 | "parents",
1898 | "age",
1899 | "father",
1900 | "miss",
1901 | "boy",
1902 | "family",
1903 | "girl",
1904 | "baby",
1905 | "sex",
1906 | "husband",
1907 | "older",
1908 | "professional",
1909 | "ages",
1910 | "friends",
1911 | "kids",
1912 | "boys",
1913 | "younger",
1914 | "dog",
1915 | "fellow",
1916 | "sexual",
1917 | "girls",
1918 | "families",
1919 | "fun",
1920 | "live",
1921 | "social",
1922 | "adults",
1923 | "proud",
1924 | "female",
1925 | "teen",
1926 | "later",
1927 | "know",
1928 | "dogs",
1929 | "away",
1930 | "emotional",
1931 | "adult",
1932 | "home",
1933 | "parent",
1934 | "birth",
1935 | "childhood",
1936 | "mom",
1937 | "abuse",
1938 | "dad"
1939 | ],
1940 | [
1941 | "american",
1942 | "united_states",
1943 | "world",
1944 | "country",
1945 | "international",
1946 | "foreign",
1947 | "americans",
1948 | "countries",
1949 | "china",
1950 | "french",
1951 | "japan",
1952 | "europe",
1953 | "trade",
1954 | "economic",
1955 | "london",
1956 | "british",
1957 | "european",
1958 | "war",
1959 | "japanese",
1960 | "washington",
1961 | "france",
1962 | "german",
1963 | "germany",
1964 | "nation",
1965 | "national",
1966 | "britain",
1967 | "chinese",
1968 | "paris",
1969 | "america",
1970 | "italy",
1971 | "mexico",
1972 | "canada",
1973 | "global",
1974 | "england",
1975 | "italian",
1976 | "western",
1977 | "immigrants",
1978 | "domestic",
1979 | "south",
1980 | "african",
1981 | "nations",
1982 | "brazil",
1983 | "spain",
1984 | "region",
1985 | "central",
1986 | "asian",
1987 | "india",
1988 | "spanish",
1989 | "cultural",
1990 | "hong_kong"
1991 | ],
1992 | [
1993 | "president",
1994 | "general",
1995 | "chief",
1996 | "secretary",
1997 | "office",
1998 | "news",
1999 | "support",
2000 | "vice",
2001 | "chairman",
2002 | "staff",
2003 | "executive",
2004 | "senior",
2005 | "minister",
2006 | "prime",
2007 | "press",
2008 | "conference",
2009 | "deputy",
2010 | "post",
2011 | "reporters",
2012 | "newspaper",
2013 | "appointed",
2014 | "media",
2015 | "independent",
2016 | "colleagues",
2017 | "reporter",
2018 | "relations",
2019 | "resigned",
2020 | "adviser",
2021 | "criticism",
2022 | "interview",
2023 | "newspapers",
2024 | "influence",
2025 | "leadership",
2026 | "political",
2027 | "appointment",
2028 | "resignation",
2029 | "affairs",
2030 | "ambassador",
2031 | "dean",
2032 | "force",
2033 | "articles",
2034 | "counsel",
2035 | "publicly",
2036 | "cabinet",
2037 | "corruption",
2038 | "boss",
2039 | "weeks",
2040 | "successor",
2041 | "dismissed",
2042 | "succeed"
2043 | ],
2044 | [
2045 | "building",
2046 | "room",
2047 | "old",
2048 | "year",
2049 | "market",
2050 | "weeks",
2051 | "taxes",
2052 | "bedroom",
2053 | "listed",
2054 | "house",
2055 | "space",
2056 | "square",
2057 | "broker",
2058 | "area",
2059 | "floors",
2060 | "estate",
2061 | "kitchen",
2062 | "lot",
2063 | "buildings",
2064 | "million",
2065 | "floor",
2066 | "bath",
2067 | "foot",
2068 | "apartment",
2069 | "houses",
2070 | "property",
2071 | "street",
2072 | "office",
2073 | "number",
2074 | "real",
2075 | "feet",
2076 | "basement",
2077 | "acre",
2078 | "car",
2079 | "rent",
2080 | "neighborhood",
2081 | "units",
2082 | "garage",
2083 | "city",
2084 | "maintenance",
2085 | "dining",
2086 | "apartments",
2087 | "fireplace",
2088 | "walls",
2089 | "project",
2090 | "properties",
2091 | "roof",
2092 | "windows",
2093 | "brick",
2094 | "construction"
2095 | ],
2096 | [
2097 | "world",
2098 | "work",
2099 | "man",
2100 | "way",
2101 | "history",
2102 | "sense",
2103 | "men",
2104 | "life",
2105 | "time",
2106 | "play",
2107 | "people",
2108 | "series",
2109 | "best",
2110 | "mind",
2111 | "america",
2112 | "different",
2113 | "self",
2114 | "moment",
2115 | "kind",
2116 | "voice",
2117 | "role",
2118 | "course",
2119 | "audience",
2120 | "things",
2121 | "feeling",
2122 | "society",
2123 | "stars",
2124 | "good",
2125 | "light",
2126 | "experience",
2127 | "sound",
2128 | "young",
2129 | "culture",
2130 | "love",
2131 | "feel",
2132 | "god",
2133 | "rest",
2134 | "shows",
2135 | "live",
2136 | "male",
2137 | "body",
2138 | "finally",
2139 | "middle",
2140 | "nature",
2141 | "image",
2142 | "tradition",
2143 | "times",
2144 | "behavior",
2145 | "respect",
2146 | "act"
2147 | ],
2148 | [
2149 | "film",
2150 | "movie",
2151 | "story",
2152 | "films",
2153 | "directed",
2154 | "movies",
2155 | "star",
2156 | "character",
2157 | "characters",
2158 | "drama",
2159 | "comedy",
2160 | "actor",
2161 | "tale",
2162 | "starring",
2163 | "cinema",
2164 | "actors",
2165 | "plays",
2166 | "hollywood",
2167 | "documentary",
2168 | "actress",
2169 | "romance",
2170 | "stories",
2171 | "scenes",
2172 | "novel",
2173 | "audiences",
2174 | "plot",
2175 | "adaptation",
2176 | "love",
2177 | "screen",
2178 | "novels",
2179 | "comic",
2180 | "feature",
2181 | "playwright",
2182 | "loves",
2183 | "stars",
2184 | "fantasy",
2185 | "funniest",
2186 | "world_",
2187 | "famous",
2188 | "memoir",
2189 | "onscreen",
2190 | "filmmaker",
2191 | "book",
2192 | "beautiful",
2193 | "screenplay",
2194 | "thriller",
2195 | "tales",
2196 | "miramax",
2197 | "writer",
2198 | "favorite"
2199 | ],
2200 | [
2201 | "company",
2202 | "business",
2203 | "companies",
2204 | "million",
2205 | "amp",
2206 | "industry",
2207 | "executive",
2208 | "chief",
2209 | "executives",
2210 | "sell",
2211 | "largest",
2212 | "stock",
2213 | "investment",
2214 | "based",
2215 | "chairman",
2216 | "deal",
2217 | "advertising",
2218 | "sales",
2219 | "billion",
2220 | "firm",
2221 | "offer",
2222 | "unit",
2223 | "marketing",
2224 | "management",
2225 | "shares",
2226 | "corporate",
2227 | "operating",
2228 | "market",
2229 | "products",
2230 | "financial",
2231 | "sold",
2232 | "analyst",
2233 | "operations",
2234 | "buy",
2235 | "yesterday",
2236 | "analysts",
2237 | "businesses",
2238 | "investors",
2239 | "buying",
2240 | "selling",
2241 | "bid",
2242 | "venture",
2243 | "services",
2244 | "owned",
2245 | "subsidiary",
2246 | "division",
2247 | "merger",
2248 | "maker",
2249 | "customers",
2250 | "product"
2251 | ],
2252 | [
2253 | "like",
2254 | "making",
2255 | "important",
2256 | "based",
2257 | "strong",
2258 | "including",
2259 | "recent",
2260 | "high",
2261 | "remains",
2262 | "short",
2263 | "popular",
2264 | "example",
2265 | "largely",
2266 | "includes",
2267 | "certain",
2268 | "success",
2269 | "events",
2270 | "order",
2271 | "consider",
2272 | "create",
2273 | "creating",
2274 | "unusual",
2275 | "standards",
2276 | "figures",
2277 | "latest",
2278 | "role",
2279 | "standard",
2280 | "include",
2281 | "addition",
2282 | "similar",
2283 | "key",
2284 | "expensive",
2285 | "present",
2286 | "successful",
2287 | "increasingly",
2288 | "fact",
2289 | "hopes",
2290 | "produce",
2291 | "american",
2292 | "somewhat",
2293 | "makes",
2294 | "interesting",
2295 | "basic",
2296 | "holds",
2297 | "indian",
2298 | "current",
2299 | "expect",
2300 | "traditional",
2301 | "despite",
2302 | "continues"
2303 | ],
2304 | [
2305 | "think",
2306 | "people",
2307 | "want",
2308 | "going",
2309 | "says",
2310 | "know",
2311 | "good",
2312 | "way",
2313 | "right",
2314 | "time",
2315 | "job",
2316 | "lot",
2317 | "things",
2318 | "better",
2319 | "thing",
2320 | "getting",
2321 | "got",
2322 | "big",
2323 | "talk",
2324 | "help",
2325 | "bad",
2326 | "trying",
2327 | "come",
2328 | "need",
2329 | "problem",
2330 | "feel",
2331 | "having",
2332 | "try",
2333 | "wants",
2334 | "ask",
2335 | "work",
2336 | "idea",
2337 | "chance",
2338 | "kind",
2339 | "happen",
2340 | "question",
2341 | "look",
2342 | "reason",
2343 | "opportunity",
2344 | "sure",
2345 | "guy",
2346 | "point",
2347 | "wrong",
2348 | "deal",
2349 | "leave",
2350 | "guys",
2351 | "thinking",
2352 | "pick",
2353 | "money",
2354 | "tell"
2355 | ],
2356 | [
2357 | "power",
2358 | "number",
2359 | "control",
2360 | "according",
2361 | "increase",
2362 | "large",
2363 | "likely",
2364 | "result",
2365 | "system",
2366 | "use",
2367 | "pressure",
2368 | "growing",
2369 | "recent",
2370 | "experts",
2371 | "safety",
2372 | "far",
2373 | "nearly",
2374 | "highly",
2375 | "force",
2376 | "hundreds",
2377 | "ability",
2378 | "people",
2379 | "particularly",
2380 | "change",
2381 | "possible",
2382 | "problems",
2383 | "greater",
2384 | "challenge",
2385 | "despite",
2386 | "given",
2387 | "effort",
2388 | "limited",
2389 | "improve",
2390 | "higher",
2391 | "significant",
2392 | "increased",
2393 | "largest",
2394 | "increasing",
2395 | "status",
2396 | "important",
2397 | "relatively",
2398 | "critical",
2399 | "risk",
2400 | "thousands",
2401 | "material",
2402 | "level",
2403 | "found",
2404 | "changes",
2405 | "efforts",
2406 | "begun"
2407 | ],
2408 | [
2409 | "came",
2410 | "took",
2411 | "went",
2412 | "told",
2413 | "asked",
2414 | "got",
2415 | "started",
2416 | "thought",
2417 | "felt",
2418 | "wanted",
2419 | "left",
2420 | "little",
2421 | "began",
2422 | "knew",
2423 | "long",
2424 | "turned",
2425 | "tried",
2426 | "found",
2427 | "know",
2428 | "saw",
2429 | "called",
2430 | "come",
2431 | "worked",
2432 | "heard",
2433 | "kept",
2434 | "gave",
2435 | "met",
2436 | "going",
2437 | "learned",
2438 | "looked",
2439 | "hard",
2440 | "spent",
2441 | "happened",
2442 | "decided",
2443 | "hit",
2444 | "seen",
2445 | "brought",
2446 | "moved",
2447 | "soon",
2448 | "working",
2449 | "personal",
2450 | "right",
2451 | "stopped",
2452 | "sat",
2453 | "old",
2454 | "ago",
2455 | "returned",
2456 | "sitting",
2457 | "gone",
2458 | "recalled"
2459 | ],
2460 | [
2461 | "miles",
2462 | "air",
2463 | "airport",
2464 | "north",
2465 | "traffic",
2466 | "travel",
2467 | "road",
2468 | "trip",
2469 | "flight",
2470 | "mile",
2471 | "plane",
2472 | "bus",
2473 | "boat",
2474 | "train",
2475 | "village",
2476 | "fly",
2477 | "nearby",
2478 | "passengers",
2479 | "island",
2480 | "near",
2481 | "beach",
2482 | "passenger",
2483 | "fare",
2484 | "south",
2485 | "station",
2486 | "travelers",
2487 | "land",
2488 | "highway",
2489 | "buses",
2490 | "flights",
2491 | "parking",
2492 | "airline",
2493 | "coast",
2494 | "roads",
2495 | "trips",
2496 | "crew",
2497 | "eastern",
2498 | "ski",
2499 | "jet",
2500 | "routes",
2501 | "crash",
2502 | "northern",
2503 | "ticket",
2504 | "lake",
2505 | "beaches",
2506 | "trains",
2507 | "sea",
2508 | "navy",
2509 | "resort",
2510 | "shore"
2511 | ],
2512 | [
2513 | "state",
2514 | "officials",
2515 | "report",
2516 | "california",
2517 | "evidence",
2518 | "policy",
2519 | "states",
2520 | "use",
2521 | "cases",
2522 | "found",
2523 | "problem",
2524 | "action",
2525 | "issue",
2526 | "failed",
2527 | "illegal",
2528 | "required",
2529 | "require",
2530 | "involved",
2531 | "review",
2532 | "effort",
2533 | "question",
2534 | "records",
2535 | "agents",
2536 | "case",
2537 | "problems",
2538 | "matter",
2539 | "prevent",
2540 | "study",
2541 | "enforcement",
2542 | "seeking",
2543 | "process",
2544 | "calls",
2545 | "penalty",
2546 | "appeal",
2547 | "interest",
2548 | "ban",
2549 | "complaints",
2550 | "survey",
2551 | "questions",
2552 | "opposed",
2553 | "documents",
2554 | "federal",
2555 | "search",
2556 | "test",
2557 | "critics",
2558 | "determine",
2559 | "response",
2560 | "avoid",
2561 | "delay",
2562 | "measure"
2563 | ],
2564 | [
2565 | "father",
2566 | "late",
2567 | "wife",
2568 | "mother",
2569 | "daughter",
2570 | "husband",
2571 | "beloved",
2572 | "son",
2573 | "loving",
2574 | "amp",
2575 | "devoted",
2576 | "graduated",
2577 | "sister",
2578 | "family",
2579 | "married",
2580 | "survived",
2581 | "memorial",
2582 | "funeral",
2583 | "brother",
2584 | "law",
2585 | "passing",
2586 | "services",
2587 | "grandchildren",
2588 | "grandmother",
2589 | "service",
2590 | "grandfather",
2591 | "memory",
2592 | "ceremony",
2593 | "contributions",
2594 | "january",
2595 | "september",
2596 | "august",
2597 | "cherished",
2598 | "dear",
2599 | "february",
2600 | "degree",
2601 | "flowers",
2602 | "lieu",
2603 | "condolences",
2604 | "bridegroom",
2605 | "bride",
2606 | "michael",
2607 | "new_york",
2608 | "sympathy",
2609 | "died",
2610 | "wednesday",
2611 | "donations",
2612 | "monday",
2613 | "friday",
2614 | "performed"
2615 | ]
2616 | ],
2617 | "c_npmi_10_full_all": [
2618 | 0.14624615320916104,
2619 | 0.1842349463761919,
2620 | 0.09397097956875038,
2621 | 0.02224374883885288,
2622 | 0.14196491443014664,
2623 | 0.2370141384022756,
2624 | 0.03293073853312596,
2625 | 0.1962455412333827,
2626 | 0.1915341149418706,
2627 | 0.11818618548977458,
2628 | 0.118088046901936,
2629 | 0.014622927252097677,
2630 | 0.060262066918601497,
2631 | 0.19586991688606797,
2632 | 0.21550434513063996,
2633 | 0.1304279440162075,
2634 | 0.067141489159369,
2635 | 0.005676347127949025,
2636 | 0.0399443341151241,
2637 | 0.027394469703840317,
2638 | 0.035940728231458,
2639 | 0.13207218718391886,
2640 | 0.1771557992145982,
2641 | 0.0028646926979995825,
2642 | 0.1583017349711412,
2643 | 0.035186989711695024,
2644 | 0.126649067889828,
2645 | 0.03554250799327234,
2646 | 0.24728769165095757,
2647 | 0.007093446130125744,
2648 | 0.015344450189039873,
2649 | 0.10515918827441718,
2650 | 0.1882183122717962,
2651 | 0.14759851831626747,
2652 | 0.5031460615129256,
2653 | 0.1563020964426604,
2654 | 0.1088203777913382,
2655 | 0.08715242376128987,
2656 | 0.10371770006655577,
2657 | 0.1560688706902293,
2658 | 0.01920638899674475,
2659 | 0.15620943194110215,
2660 | 0.10602635865005294,
2661 | -0.009920778876769423,
2662 | 0.08536125795429042,
2663 | 0.032156055229606545,
2664 | 0.07240444289452569,
2665 | 0.1347891185211831,
2666 | 0.04646465360805376,
2667 | 0.2776205766334038
2668 | ],
2669 | "path": "outputs/full-mindf_power_law-maxdf_0.9/nytimes/k-50/etm/lr_0.02-reg_1.2e-06-epochs_1000-anneal_lr_0/11235"
2670 | },
2671 | "wikitext": {
2672 | "c_npmi_10_full": 0.11328744378992832,
2673 | "c_npmi_10_full_sd": 0.06794966942260233,
2674 | "tu": 0.94,
2675 | "to": 0.030612244897959183,
2676 | "overlaps": 0,
2677 | "anneal_lr": 0,
2678 | "data_path": "/workspace/topic-preprocessing/data/wikitext/processed/full-mindf_power_law-maxdf_0.9/etm",
2679 | "epochs": 1000,
2680 | "lr": 0.001,
2681 | "seed": 42,
2682 | "wdecay": 1.2e-05,
2683 | "input_dir": "wikitext",
2684 | "topics": [
2685 | [
2686 | "new",
2687 | "use",
2688 | "development",
2689 | "world",
2690 | "design",
2691 | "created",
2692 | "system",
2693 | "developed",
2694 | "power",
2695 | "based",
2696 | "designed",
2697 | "produced",
2698 | "original",
2699 | "production",
2700 | "available",
2701 | "different",
2702 | "create",
2703 | "version",
2704 | "additional",
2705 | "single",
2706 | "effects",
2707 | "model",
2708 | "main",
2709 | "introduced",
2710 | "including",
2711 | "included",
2712 | "quality",
2713 | "special",
2714 | "test",
2715 | "changes",
2716 | "space",
2717 | "include",
2718 | "standard",
2719 | "project",
2720 | "added",
2721 | "energy",
2722 | "elements",
2723 | "concept",
2724 | "intended",
2725 | "value",
2726 | "uses",
2727 | "instead",
2728 | "technology",
2729 | "provided",
2730 | "produce",
2731 | "type",
2732 | "creating",
2733 | "introduction",
2734 | "work",
2735 | "complete"
2736 | ],
2737 | [
2738 | "american",
2739 | "united_states",
2740 | "new_york",
2741 | "washington",
2742 | "california",
2743 | "texas",
2744 | "americans",
2745 | "virginia",
2746 | "chicago",
2747 | "boston",
2748 | "canadian",
2749 | "smith",
2750 | "canada",
2751 | "florida",
2752 | "new_york_city",
2753 | "michigan",
2754 | "north_carolina",
2755 | "ohio",
2756 | "johnson",
2757 | "los_angeles",
2758 | "kentucky",
2759 | "grant",
2760 | "philadelphia",
2761 | "america",
2762 | "davis",
2763 | "maryland",
2764 | "massachusetts",
2765 | "illinois",
2766 | "indiana",
2767 | "pennsylvania",
2768 | "new_jersey",
2769 | "new_orleans",
2770 | "south_carolina",
2771 | "mexican",
2772 | "toronto",
2773 | "african",
2774 | "houston",
2775 | "tennessee",
2776 | "minnesota",
2777 | "san_francisco",
2778 | "taylor",
2779 | "lee",
2780 | "cleveland",
2781 | "wisconsin",
2782 | "georgia",
2783 | "atlanta",
2784 | "missouri",
2785 | "adams",
2786 | "colorado",
2787 | "wilson"
2788 | ],
2789 | [
2790 | "work",
2791 | "published",
2792 | "book",
2793 | "years",
2794 | "described",
2795 | "time",
2796 | "group",
2797 | "style",
2798 | "early",
2799 | "people",
2800 | "new",
2801 | "works",
2802 | "included",
2803 | "year",
2804 | "english",
2805 | "including",
2806 | "noted",
2807 | "history",
2808 | "period",
2809 | "popular",
2810 | "wrote",
2811 | "late",
2812 | "american",
2813 | "written",
2814 | "old",
2815 | "movement",
2816 | "notes",
2817 | "different",
2818 | "edition",
2819 | "list",
2820 | "social",
2821 | "original",
2822 | "culture",
2823 | "traditional",
2824 | "based",
2825 | "books",
2826 | "influenced",
2827 | "collection",
2828 | "material",
2829 | "1970s",
2830 | "recent",
2831 | "composed",
2832 | "influence",
2833 | "1980s",
2834 | "issue",
2835 | "young",
2836 | "1960s",
2837 | "middle",
2838 | "parts",
2839 | "language"
2840 | ],
2841 | [
2842 | "australia",
2843 | "world",
2844 | "country",
2845 | "international",
2846 | "united_states",
2847 | "australian",
2848 | "india",
2849 | "countries",
2850 | "united_kingdom",
2851 | "japan",
2852 | "indian",
2853 | "europe",
2854 | "european",
2855 | "china",
2856 | "new_zealand",
2857 | "canada",
2858 | "spain",
2859 | "foreign",
2860 | "average",
2861 | "worldwide",
2862 | "highest",
2863 | "total",
2864 | "africa",
2865 | "singapore",
2866 | "american",
2867 | "sweden",
2868 | "domestic",
2869 | "sydney",
2870 | "france",
2871 | "south_africa",
2872 | "brazil",
2873 | "dutch",
2874 | "argentina",
2875 | "vietnam",
2876 | "norway",
2877 | "britain",
2878 | "peak",
2879 | "asian",
2880 | "philippines",
2881 | "melbourne",
2882 | "malaysia",
2883 | "russia",
2884 | "belgium",
2885 | "netherlands",
2886 | "nations",
2887 | "germany",
2888 | "colonial",
2889 | "indonesian",
2890 | "ireland",
2891 | "pakistan"
2892 | ],
2893 | [
2894 | "war",
2895 | "april",
2896 | "march",
2897 | "september",
2898 | "june",
2899 | "december",
2900 | "august",
2901 | "october",
2902 | "january",
2903 | "july",
2904 | "began",
2905 | "november",
2906 | "later",
2907 | "following",
2908 | "french",
2909 | "british",
2910 | "february",
2911 | "year",
2912 | "new",
2913 | "including",
2914 | "early",
2915 | "german",
2916 | "members",
2917 | "world",
2918 | "years",
2919 | "training",
2920 | "number",
2921 | "late",
2922 | "joined",
2923 | "days",
2924 | "group",
2925 | "included",
2926 | "months",
2927 | "weeks",
2928 | "continued",
2929 | "replaced",
2930 | "served",
2931 | "france",
2932 | "support",
2933 | "returned",
2934 | "spanish",
2935 | "major",
2936 | "received",
2937 | "end",
2938 | "led",
2939 | "anti",
2940 | "month",
2941 | "service",
2942 | "remained",
2943 | "staff"
2944 | ],
2945 | [
2946 | "race",
2947 | "horses",
2948 | "horse",
2949 | "oxford",
2950 | "cambridge",
2951 | "canal",
2952 | "dog",
2953 | "estate",
2954 | "boat",
2955 | "colony",
2956 | "trade",
2957 | "manchester",
2958 | "breed",
2959 | "races",
2960 | "racing",
2961 | "parish",
2962 | "riders",
2963 | "slaves",
2964 | "slave",
2965 | "bristol",
2966 | "goods",
2967 | "crew",
2968 | "cotton",
2969 | "dogs",
2970 | "trading",
2971 | "riding",
2972 | "pounds",
2973 | "mill",
2974 | "breeding",
2975 | "rowing",
2976 | "blues",
2977 | "navigation",
2978 | "lancashire",
2979 | "breeds",
2980 | "cattle",
2981 | "lengths",
2982 | "farm",
2983 | "liverpool",
2984 | "hunting",
2985 | "rider",
2986 | "boats",
2987 | "bath",
2988 | "draft",
2989 | "mills",
2990 | "oldham",
2991 | "mare",
2992 | "labour",
2993 | "merchant",
2994 | "bridgwater",
2995 | "wool"
2996 | ],
2997 | [
2998 | "formula",
2999 | "chemical",
3000 | "nuclear",
3001 | "applications",
3002 | "element",
3003 | "hydrogen",
3004 | "atomic",
3005 | "uranium",
3006 | "oxygen",
3007 | "gas",
3008 | "acid",
3009 | "carbon",
3010 | "chemistry",
3011 | "stable",
3012 | "electrical",
3013 | "radioactive",
3014 | "atom",
3015 | "properties",
3016 | "plutonium",
3017 | "method",
3018 | "forms",
3019 | "solutions",
3020 | "calibration",
3021 | "energy",
3022 | "particles",
3023 | "thermal",
3024 | "metal",
3025 | "nitrogen",
3026 | "liquid",
3027 | "vapor",
3028 | "chain",
3029 | "physics",
3030 | "compounds",
3031 | "type",
3032 | "atoms",
3033 | "solution",
3034 | "electron",
3035 | "linear",
3036 | "ions",
3037 | "helium",
3038 | "equation",
3039 | "beta",
3040 | "particle",
3041 | "organic",
3042 | "components",
3043 | "lithium",
3044 | "synthesis",
3045 | "structure",
3046 | "metals",
3047 | "protons"
3048 | ],
3049 | [
3050 | "stated",
3051 | "considered",
3052 | "went",
3053 | "chief",
3054 | "said",
3055 | "took",
3056 | "worked",
3057 | "called",
3058 | "decided",
3059 | "continued",
3060 | "met",
3061 | "moved",
3062 | "claimed",
3063 | "held",
3064 | "asked",
3065 | "gave",
3066 | "agreed",
3067 | "according",
3068 | "head",
3069 | "believed",
3070 | "received",
3071 | "told",
3072 | "returned",
3073 | "appointed",
3074 | "passed",
3075 | "refused",
3076 | "came",
3077 | "leader",
3078 | "director",
3079 | "ran",
3080 | "accepted",
3081 | "placed",
3082 | "turned",
3083 | "known",
3084 | "brought",
3085 | "issued",
3086 | "meeting",
3087 | "presented",
3088 | "remained",
3089 | "working",
3090 | "assistant",
3091 | "offered",
3092 | "reported",
3093 | "concluded",
3094 | "declared",
3095 | "criticized",
3096 | "announced",
3097 | "referred",
3098 | "spent",
3099 | "noted"
3100 | ],
3101 | [
3102 | "known",
3103 | "large",
3104 | "century",
3105 | "found",
3106 | "small",
3107 | "number",
3108 | "called",
3109 | "high",
3110 | "including",
3111 | "long",
3112 | "include",
3113 | "form",
3114 | "common",
3115 | "similar",
3116 | "like",
3117 | "largest",
3118 | "lower",
3119 | "considered",
3120 | "low",
3121 | "usually",
3122 | "larger",
3123 | "size",
3124 | "modern",
3125 | "range",
3126 | "based",
3127 | "associated",
3128 | "estimated",
3129 | "higher",
3130 | "important",
3131 | "significant",
3132 | "according",
3133 | "smaller",
3134 | "increased",
3135 | "central",
3136 | "generally",
3137 | "likely",
3138 | "upper",
3139 | "level",
3140 | "rate",
3141 | "named",
3142 | "study",
3143 | "relatively",
3144 | "19th",
3145 | "greater",
3146 | "history",
3147 | "growth",
3148 | "reported",
3149 | "names",
3150 | "suggested",
3151 | "today"
3152 | ],
3153 | [
3154 | "great",
3155 | "early",
3156 | "major",
3157 | "success",
3158 | "successful",
3159 | "hand",
3160 | "best",
3161 | "general",
3162 | "strong",
3163 | "earlier",
3164 | "particularly",
3165 | "key",
3166 | "important",
3167 | "difficult",
3168 | "highly",
3169 | "head",
3170 | "despite",
3171 | "largely",
3172 | "better",
3173 | "similar",
3174 | "minor",
3175 | "numerous",
3176 | "entire",
3177 | "poor",
3178 | "popular",
3179 | "especially",
3180 | "initial",
3181 | "powerful",
3182 | "complete",
3183 | "generally",
3184 | "influence",
3185 | "famous",
3186 | "greatest",
3187 | "simply",
3188 | "heavily",
3189 | "significant",
3190 | "popularity",
3191 | "notable",
3192 | "increasingly",
3193 | "immediately",
3194 | "finally",
3195 | "extremely",
3196 | "hard",
3197 | "attention",
3198 | "entirely",
3199 | "completely",
3200 | "double",
3201 | "considerable",
3202 | "easily",
3203 | "prominent"
3204 | ],
3205 | [
3206 | "relationship",
3207 | "tells",
3208 | "begins",
3209 | "leaves",
3210 | "goes",
3211 | "tries",
3212 | "finds",
3213 | "jack",
3214 | "takes",
3215 | "meets",
3216 | "gets",
3217 | "wants",
3218 | "charlie",
3219 | "tell",
3220 | "storyline",
3221 | "young",
3222 | "restaurant",
3223 | "learns",
3224 | "asks",
3225 | "tony",
3226 | "paul",
3227 | "makes",
3228 | "breaks",
3229 | "kills",
3230 | "comes",
3231 | "arrives",
3232 | "chris",
3233 | "blood",
3234 | "dies",
3235 | "meat",
3236 | "starts",
3237 | "falls",
3238 | "believes",
3239 | "confesses",
3240 | "soap",
3241 | "killer",
3242 | "continues",
3243 | "donna",
3244 | "girlfriend",
3245 | "eve",
3246 | "returns",
3247 | "friendship",
3248 | "brings",
3249 | "adam",
3250 | "ben",
3251 | "feels",
3252 | "gives",
3253 | "helps",
3254 | "ends",
3255 | "discovers"
3256 | ],
3257 | [
3258 | "music",
3259 | "musical",
3260 | "stage",
3261 | "piano",
3262 | "performed",
3263 | "played",
3264 | "playing",
3265 | "play",
3266 | "performance",
3267 | "opera",
3268 | "harrison",
3269 | "concert",
3270 | "theatre",
3271 | "composer",
3272 | "recordings",
3273 | "piece",
3274 | "concerts",
3275 | "performances",
3276 | "solo",
3277 | "orchestra",
3278 | "performing",
3279 | "instruments",
3280 | "session",
3281 | "sang",
3282 | "musicians",
3283 | "instrumental",
3284 | "sessions",
3285 | "version",
3286 | "singing",
3287 | "recording",
3288 | "blues",
3289 | "bach",
3290 | "mccartney",
3291 | "choir",
3292 | "beatles",
3293 | "melody",
3294 | "folk",
3295 | "organ",
3296 | "backing",
3297 | "versions",
3298 | "compositions",
3299 | "touring",
3300 | "sung",
3301 | "jazz",
3302 | "symphony",
3303 | "dylan",
3304 | "audience",
3305 | "sullivan",
3306 | "orchestral",
3307 | "guitar"
3308 | ],
3309 | [
3310 | "song",
3311 | "album",
3312 | "band",
3313 | "music",
3314 | "songs",
3315 | "number",
3316 | "single",
3317 | "released",
3318 | "chart",
3319 | "video",
3320 | "track",
3321 | "recording",
3322 | "rock",
3323 | "release",
3324 | "studio",
3325 | "tour",
3326 | "performed",
3327 | "albums",
3328 | "vocals",
3329 | "recorded",
3330 | "sound",
3331 | "lyrics",
3332 | "singles",
3333 | "singer",
3334 | "tracks",
3335 | "dance",
3336 | "record",
3337 | "copies",
3338 | "billboard",
3339 | "madonna",
3340 | "charts",
3341 | "live",
3342 | "pop",
3343 | "hot",
3344 | "debuted",
3345 | "performance",
3346 | "produced",
3347 | "vocal",
3348 | "week",
3349 | "guitar",
3350 | "best",
3351 | "background",
3352 | "peaked",
3353 | "artist",
3354 | "carey",
3355 | "digital",
3356 | "featured",
3357 | "platinum",
3358 | "reached",
3359 | "ballad"
3360 | ],
3361 | [
3362 | "film",
3363 | "series",
3364 | "character",
3365 | "season",
3366 | "production",
3367 | "role",
3368 | "characters",
3369 | "best",
3370 | "scene",
3371 | "released",
3372 | "cast",
3373 | "films",
3374 | "director",
3375 | "release",
3376 | "directed",
3377 | "scenes",
3378 | "star",
3379 | "played",
3380 | "script",
3381 | "version",
3382 | "original",
3383 | "filming",
3384 | "story",
3385 | "received",
3386 | "actor",
3387 | "producer",
3388 | "office",
3389 | "man",
3390 | "box",
3391 | "plot",
3392 | "featured",
3393 | "set",
3394 | "performance",
3395 | "actors",
3396 | "stars",
3397 | "movie",
3398 | "appeared",
3399 | "fans",
3400 | "originally",
3401 | "filmed",
3402 | "appearance",
3403 | "later",
3404 | "crew",
3405 | "title",
3406 | "week",
3407 | "voice",
3408 | "final",
3409 | "roles",
3410 | "reviews",
3411 | "based"
3412 | ],
3413 | [
3414 | "episode",
3415 | "episodes",
3416 | "television",
3417 | "viewers",
3418 | "aired",
3419 | "broadcast",
3420 | "fox",
3421 | "guest",
3422 | "watched",
3423 | "simpsons",
3424 | "ratings",
3425 | "rating",
3426 | "homer",
3427 | "michael",
3428 | "mulder",
3429 | "marge",
3430 | "nbc",
3431 | "scully",
3432 | "network",
3433 | "bart",
3434 | "writers",
3435 | "nielsen",
3436 | "lisa",
3437 | "creator",
3438 | "finale",
3439 | "peter",
3440 | "files",
3441 | "writer",
3442 | "glee",
3443 | "airing",
3444 | "plot",
3445 | "shows",
3446 | "watching",
3447 | "jim",
3448 | "rated",
3449 | "andy",
3450 | "x-files",
3451 | "viewing",
3452 | "dwight",
3453 | "demographic",
3454 | "households",
3455 | "comedy",
3456 | "tells",
3457 | "carter",
3458 | "brian",
3459 | "references",
3460 | "reality",
3461 | "viewed",
3462 | "gets",
3463 | "recurring"
3464 | ],
3465 | [
3466 | "white",
3467 | "black",
3468 | "red",
3469 | "brown",
3470 | "blue",
3471 | "dark",
3472 | "long",
3473 | "metal",
3474 | "yellow",
3475 | "green",
3476 | "small",
3477 | "short",
3478 | "like",
3479 | "color",
3480 | "length",
3481 | "slightly",
3482 | "covered",
3483 | "shaped",
3484 | "deep",
3485 | "features",
3486 | "cap",
3487 | "light",
3488 | "wood",
3489 | "shape",
3490 | "appear",
3491 | "near",
3492 | "consists",
3493 | "bands",
3494 | "north_america",
3495 | "grey",
3496 | "thick",
3497 | "europe",
3498 | "surface",
3499 | "leaves",
3500 | "wide",
3501 | "single",
3502 | "contain",
3503 | "contains",
3504 | "fruit",
3505 | "edge",
3506 | "dry",
3507 | "orange",
3508 | "smooth",
3509 | "gray",
3510 | "thin",
3511 | "outer",
3512 | "colour",
3513 | "inner",
3514 | "narrow",
3515 | "sides"
3516 | ],
3517 | [
3518 | "aircraft",
3519 | "air",
3520 | "flight",
3521 | "car",
3522 | "service",
3523 | "engine",
3524 | "flying",
3525 | "wing",
3526 | "train",
3527 | "cars",
3528 | "passenger",
3529 | "fighter",
3530 | "fuel",
3531 | "vehicles",
3532 | "bomb",
3533 | "operations",
3534 | "bomber",
3535 | "track",
3536 | "pilots",
3537 | "radar",
3538 | "trains",
3539 | "flights",
3540 | "operation",
3541 | "passengers",
3542 | "speed",
3543 | "units",
3544 | "carrier",
3545 | "operate",
3546 | "station",
3547 | "ride",
3548 | "carried",
3549 | "operating",
3550 | "unit",
3551 | "operational",
3552 | "capacity",
3553 | "launch",
3554 | "pilot",
3555 | "engines",
3556 | "services",
3557 | "range",
3558 | "aviation",
3559 | "weight",
3560 | "lift",
3561 | "planes",
3562 | "airport",
3563 | "capability",
3564 | "locomotives",
3565 | "dive",
3566 | "flown",
3567 | "operated"
3568 | ],
3569 | [
3570 | "won",
3571 | "team",
3572 | "year",
3573 | "record",
3574 | "second",
3575 | "finished",
3576 | "final",
3577 | "points",
3578 | "round",
3579 | "world",
3580 | "stage",
3581 | "winning",
3582 | "lead",
3583 | "ranked",
3584 | "coach",
3585 | "best",
3586 | "win",
3587 | "national",
3588 | "teams",
3589 | "named",
3590 | "led",
3591 | "event",
3592 | "place",
3593 | "career",
3594 | "seconds",
3595 | "history",
3596 | "tournament",
3597 | "competition",
3598 | "captain",
3599 | "injury",
3600 | "fourth",
3601 | "medal",
3602 | "overall",
3603 | "tour",
3604 | "summer",
3605 | "gold",
3606 | "olympics",
3607 | "sixth",
3608 | "earned",
3609 | "finish",
3610 | "sports",
3611 | "selected",
3612 | "winner",
3613 | "losing",
3614 | "championships",
3615 | "consecutive",
3616 | "debut",
3617 | "award",
3618 | "professional",
3619 | "events"
3620 | ],
3621 | [
3622 | "match",
3623 | "title",
3624 | "defeated",
3625 | "championship",
3626 | "event",
3627 | "ring",
3628 | "team",
3629 | "won",
3630 | "win",
3631 | "raw",
3632 | "time",
3633 | "champion",
3634 | "wrestling",
3635 | "triple",
3636 | "defeating",
3637 | "tag",
3638 | "wwe",
3639 | "face",
3640 | "angle",
3641 | "titles",
3642 | "matches",
3643 | "table",
3644 | "division",
3645 | "edge",
3646 | "tna",
3647 | "lost",
3648 | "rivalry",
3649 | "impact",
3650 | "defeat",
3651 | "following",
3652 | "retain",
3653 | "faced",
3654 | "kane",
3655 | "feud",
3656 | "brand",
3657 | "bout",
3658 | "eliminated",
3659 | "promotion",
3660 | "defended",
3661 | "opponent",
3662 | "wrestler",
3663 | "wwf",
3664 | "hardy",
3665 | "world",
3666 | "went",
3667 | "month",
3668 | "smackdown",
3669 | "wrestlers",
3670 | "attacked",
3671 | "heavyweight"
3672 | ],
3673 | [
3674 | "family",
3675 | "death",
3676 | "life",
3677 | "years",
3678 | "father",
3679 | "later",
3680 | "son",
3681 | "man",
3682 | "children",
3683 | "house",
3684 | "women",
3685 | "died",
3686 | "mother",
3687 | "year",
3688 | "wife",
3689 | "born",
3690 | "old",
3691 | "brother",
3692 | "married",
3693 | "daughter",
3694 | "child",
3695 | "woman",
3696 | "age",
3697 | "men",
3698 | "young",
3699 | "marriage",
3700 | "home",
3701 | "people",
3702 | "months",
3703 | "parents",
3704 | "friend",
3705 | "great",
3706 | "friends",
3707 | "husband",
3708 | "lived",
3709 | "sister",
3710 | "named",
3711 | "younger",
3712 | "birth",
3713 | "couple",
3714 | "hospital",
3715 | "relationship",
3716 | "brothers",
3717 | "care",
3718 | "member",
3719 | "sex",
3720 | "grand",
3721 | "living",
3722 | "leaving",
3723 | "help"
3724 | ],
3725 | [
3726 | "million",
3727 | "company",
3728 | "sold",
3729 | "business",
3730 | "percent",
3731 | "commercial",
3732 | "announced",
3733 | "market",
3734 | "industry",
3735 | "cost",
3736 | "project",
3737 | "management",
3738 | "billion",
3739 | "sales",
3740 | "companies",
3741 | "selling",
3742 | "rights",
3743 | "price",
3744 | "costs",
3745 | "board",
3746 | "sell",
3747 | "employees",
3748 | "program",
3749 | "working",
3750 | "channel",
3751 | "plans",
3752 | "firm",
3753 | "purchase",
3754 | "purchased",
3755 | "store",
3756 | "engineer",
3757 | "sale",
3758 | "workers",
3759 | "equipment",
3760 | "financial",
3761 | "computer",
3762 | "owned",
3763 | "markets",
3764 | "advertising",
3765 | "radio",
3766 | "launched",
3767 | "potential",
3768 | "products",
3769 | "owners",
3770 | "code",
3771 | "demand",
3772 | "machine",
3773 | "income",
3774 | "mobile",
3775 | "filed"
3776 | ],
3777 | [
3778 | "city",
3779 | "area",
3780 | "town",
3781 | "west",
3782 | "south",
3783 | "north",
3784 | "local",
3785 | "east",
3786 | "station",
3787 | "line",
3788 | "park",
3789 | "bridge",
3790 | "miles",
3791 | "located",
3792 | "construction",
3793 | "new",
3794 | "opened",
3795 | "northern",
3796 | "street",
3797 | "built",
3798 | "areas",
3799 | "southern",
3800 | "major",
3801 | "road",
3802 | "centre",
3803 | "river",
3804 | "railway",
3805 | "main",
3806 | "western",
3807 | "state",
3808 | "center",
3809 | "village",
3810 | "near",
3811 | "land",
3812 | "hill",
3813 | "services",
3814 | "years",
3815 | "border",
3816 | "eastern",
3817 | "district",
3818 | "county",
3819 | "year",
3820 | "day",
3821 | "ground",
3822 | "site",
3823 | "nearby",
3824 | "system",
3825 | "old",
3826 | "airport",
3827 | "stations"
3828 | ],
3829 | [
3830 | "killed",
3831 | "sent",
3832 | "attack",
3833 | "left",
3834 | "near",
3835 | "support",
3836 | "arrived",
3837 | "led",
3838 | "forced",
3839 | "soon",
3840 | "return",
3841 | "control",
3842 | "reached",
3843 | "able",
3844 | "caused",
3845 | "destroyed",
3846 | "plan",
3847 | "died",
3848 | "main",
3849 | "ground",
3850 | "established",
3851 | "remained",
3852 | "attacks",
3853 | "damage",
3854 | "attacked",
3855 | "supported",
3856 | "victory",
3857 | "attempted",
3858 | "suffered",
3859 | "returned",
3860 | "lost",
3861 | "attempt",
3862 | "landing",
3863 | "operation",
3864 | "brought",
3865 | "expedition",
3866 | "defeat",
3867 | "failed",
3868 | "immediately",
3869 | "captured",
3870 | "engaged",
3871 | "damaged",
3872 | "prepared",
3873 | "invasion",
3874 | "defeated",
3875 | "unable",
3876 | "prevent",
3877 | "arrival",
3878 | "shortly",
3879 | "capture"
3880 | ],
3881 | [
3882 | "king",
3883 | "england",
3884 | "london",
3885 | "english",
3886 | "lord",
3887 | "royal",
3888 | "scotland",
3889 | "queen",
3890 | "sir",
3891 | "edward",
3892 | "henry",
3893 | "duke",
3894 | "john",
3895 | "wales",
3896 | "prince",
3897 | "scottish",
3898 | "fort",
3899 | "irish",
3900 | "reign",
3901 | "george",
3902 | "historian",
3903 | "earl",
3904 | "william",
3905 | "crown",
3906 | "james",
3907 | "kent",
3908 | "norman",
3909 | "richard",
3910 | "catholic",
3911 | "britain",
3912 | "york",
3913 | "stephen",
3914 | "abbey",
3915 | "charles",
3916 | "pope",
3917 | "robert",
3918 | "mary",
3919 | "france",
3920 | "bishop",
3921 | "throne",
3922 | "thomas",
3923 | "granted",
3924 | "appointed",
3925 | "ireland",
3926 | "catherine",
3927 | "canterbury",
3928 | "monarch",
3929 | "anne",
3930 | "protestant",
3931 | "castle"
3932 | ],
3933 | [
3934 | "treatment",
3935 | "blood",
3936 | "brain",
3937 | "cell",
3938 | "disease",
3939 | "risk",
3940 | "patients",
3941 | "cells",
3942 | "symptoms",
3943 | "dna",
3944 | "protein",
3945 | "cancer",
3946 | "genes",
3947 | "virus",
3948 | "cases",
3949 | "causes",
3950 | "function",
3951 | "diseases",
3952 | "cause",
3953 | "body",
3954 | "related",
3955 | "gene",
3956 | "infection",
3957 | "drugs",
3958 | "activity",
3959 | "levels",
3960 | "people",
3961 | "effects",
3962 | "hiv",
3963 | "acute",
3964 | "diagnosis",
3965 | "disorders",
3966 | "genetic",
3967 | "tissue",
3968 | "liver",
3969 | "proteins",
3970 | "muscle",
3971 | "normal",
3972 | "specific",
3973 | "occurs",
3974 | "skin",
3975 | "form",
3976 | "occur",
3977 | "treatments",
3978 | "hemoglobin",
3979 | "induced",
3980 | "clinical",
3981 | "individuals",
3982 | "viruses",
3983 | "mechanisms"
3984 | ],
3985 | [
3986 | "route",
3987 | "road",
3988 | "highway",
3989 | "state",
3990 | "north",
3991 | "traffic",
3992 | "east",
3993 | "street",
3994 | "intersection",
3995 | "continues",
3996 | "south",
3997 | "interchange",
3998 | "avenue",
3999 | "section",
4000 | "mile",
4001 | "passes",
4002 | "freeway",
4003 | "lane",
4004 | "portion",
4005 | "runs",
4006 | "crosses",
4007 | "completed",
4008 | "heads",
4009 | "turns",
4010 | "downtown",
4011 | "designated",
4012 | "terminus",
4013 | "crossing",
4014 | "northeast",
4015 | "extended",
4016 | "interstate",
4017 | "west",
4018 | "access",
4019 | "highways",
4020 | "roadway",
4021 | "passing",
4022 | "exit",
4023 | "junction",
4024 | "alignment",
4025 | "northwest",
4026 | "begins",
4027 | "corridor",
4028 | "enters",
4029 | "lanes",
4030 | "past",
4031 | "parallel",
4032 | "end",
4033 | "designation",
4034 | "bypass",
4035 | "description"
4036 | ],
4037 | [
4038 | "ship",
4039 | "ships",
4040 | "guns",
4041 | "fleet",
4042 | "class",
4043 | "inch",
4044 | "tons",
4045 | "torpedo",
4046 | "naval",
4047 | "gun",
4048 | "long",
4049 | "knots",
4050 | "crew",
4051 | "turrets",
4052 | "fired",
4053 | "maximum",
4054 | "admiral",
4055 | "carried",
4056 | "armor",
4057 | "steam",
4058 | "vessels",
4059 | "cruisers",
4060 | "vessel",
4061 | "port",
4062 | "inches",
4063 | "battery",
4064 | "convoy",
4065 | "battleships",
4066 | "navy",
4067 | "service",
4068 | "cruiser",
4069 | "meters",
4070 | "armored",
4071 | "deck",
4072 | "aboard",
4073 | "built",
4074 | "battleship",
4075 | "hull",
4076 | "boats",
4077 | "armour",
4078 | "submarine",
4079 | "feet",
4080 | "flagship",
4081 | "laid",
4082 | "ton",
4083 | "turret",
4084 | "cruise",
4085 | "speed",
4086 | "ordered",
4087 | "boilers"
4088 | ],
4089 | [
4090 | "art",
4091 | "gold",
4092 | "act",
4093 | "silver",
4094 | "image",
4095 | "flag",
4096 | "painting",
4097 | "pieces",
4098 | "artist",
4099 | "artists",
4100 | "work",
4101 | "coins",
4102 | "figures",
4103 | "paintings",
4104 | "images",
4105 | "painted",
4106 | "arms",
4107 | "exhibition",
4108 | "piece",
4109 | "figure",
4110 | "eye",
4111 | "dress",
4112 | "depicted",
4113 | "fashion",
4114 | "ceremony",
4115 | "portrait",
4116 | "worn",
4117 | "design",
4118 | "presented",
4119 | "body",
4120 | "wearing",
4121 | "sculpture",
4122 | "wear",
4123 | "display",
4124 | "coin",
4125 | "hair",
4126 | "coat",
4127 | "eagle",
4128 | "mint",
4129 | "statue",
4130 | "showing",
4131 | "bronze",
4132 | "depicts",
4133 | "wore",
4134 | "colours",
4135 | "displayed",
4136 | "dollar",
4137 | "print",
4138 | "photographs",
4139 | "designs"
4140 | ],
4141 | [
4142 | "building",
4143 | "built",
4144 | "century",
4145 | "buildings",
4146 | "site",
4147 | "stone",
4148 | "hall",
4149 | "castle",
4150 | "tower",
4151 | "wall",
4152 | "walls",
4153 | "square",
4154 | "church",
4155 | "completed",
4156 | "house",
4157 | "floor",
4158 | "constructed",
4159 | "architecture",
4160 | "houses",
4161 | "listed",
4162 | "chapel",
4163 | "monument",
4164 | "construction",
4165 | "entrance",
4166 | "rooms",
4167 | "glass",
4168 | "museum",
4169 | "structure",
4170 | "iron",
4171 | "17th",
4172 | "window",
4173 | "remains",
4174 | "medieval",
4175 | "wooden",
4176 | "late",
4177 | "brick",
4178 | "restoration",
4179 | "architectural",
4180 | "19th",
4181 | "memorial",
4182 | "roof",
4183 | "stones",
4184 | "hotel",
4185 | "15th",
4186 | "14th",
4187 | "garden",
4188 | "palace",
4189 | "demolished",
4190 | "dates",
4191 | "steel"
4192 | ],
4193 | [
4194 | "public",
4195 | "money",
4196 | "signed",
4197 | "manager",
4198 | "future",
4199 | "self",
4200 | "local",
4201 | "contract",
4202 | "deal",
4203 | "letter",
4204 | "pay",
4205 | "paid",
4206 | "career",
4207 | "given",
4208 | "interest",
4209 | "health",
4210 | "help",
4211 | "press",
4212 | "decision",
4213 | "executive",
4214 | "private",
4215 | "media",
4216 | "letters",
4217 | "trade",
4218 | "real",
4219 | "attention",
4220 | "personal",
4221 | "forward",
4222 | "president",
4223 | "intended",
4224 | "owner",
4225 | "sign",
4226 | "financial",
4227 | "expressed",
4228 | "exchange",
4229 | "represented",
4230 | "chairman",
4231 | "independent",
4232 | "raise",
4233 | "acquired",
4234 | "agent",
4235 | "fellow",
4236 | "prime",
4237 | "agreement",
4238 | "fund",
4239 | "funds",
4240 | "appeal",
4241 | "founded",
4242 | "promotion",
4243 | "experience"
4244 | ],
4245 | [
4246 | "day",
4247 | "found",
4248 | "night",
4249 | "live",
4250 | "seen",
4251 | "set",
4252 | "return",
4253 | "fire",
4254 | "having",
4255 | "way",
4256 | "come",
4257 | "find",
4258 | "days",
4259 | "dead",
4260 | "party",
4261 | "leave",
4262 | "body",
4263 | "cover",
4264 | "hours",
4265 | "fact",
4266 | "taken",
4267 | "hit",
4268 | "fight",
4269 | "shot",
4270 | "escape",
4271 | "special",
4272 | "search",
4273 | "secret",
4274 | "away",
4275 | "true",
4276 | "stay",
4277 | "believe",
4278 | "follow",
4279 | "christmas",
4280 | "previous",
4281 | "claims",
4282 | "present",
4283 | "kill",
4284 | "heard",
4285 | "idea",
4286 | "reason",
4287 | "journey",
4288 | "past",
4289 | "mark",
4290 | "face",
4291 | "news",
4292 | "thought",
4293 | "morning",
4294 | "die",
4295 | "question"
4296 | ],
4297 | [
4298 | "military",
4299 | "soviet",
4300 | "army",
4301 | "emperor",
4302 | "hungarian",
4303 | "roman",
4304 | "muslim",
4305 | "empire",
4306 | "war",
4307 | "forces",
4308 | "polish",
4309 | "allies",
4310 | "greece",
4311 | "chinese",
4312 | "arab",
4313 | "armies",
4314 | "rule",
4315 | "treaty",
4316 | "wars",
4317 | "imperial",
4318 | "albania",
4319 | "croatia",
4320 | "croatian",
4321 | "poland",
4322 | "territory",
4323 | "siege",
4324 | "diplomatic",
4325 | "ottoman",
4326 | "conflict",
4327 | "coup",
4328 | "serbian",
4329 | "vietnamese",
4330 | "egyptian",
4331 | "byzantine",
4332 | "serbia",
4333 | "capital",
4334 | "justinian",
4335 | "uprising",
4336 | "communist",
4337 | "jews",
4338 | "turkey",
4339 | "occupation",
4340 | "damascus",
4341 | "turkish",
4342 | "constantinople",
4343 | "serbs",
4344 | "independence",
4345 | "peace",
4346 | "revolt",
4347 | "hostilities"
4348 | ],
4349 | [
4350 | "time",
4351 | "second",
4352 | "later",
4353 | "end",
4354 | "following",
4355 | "place",
4356 | "half",
4357 | "day",
4358 | "took",
4359 | "right",
4360 | "line",
4361 | "left",
4362 | "having",
4363 | "despite",
4364 | "final",
4365 | "seven",
4366 | "point",
4367 | "away",
4368 | "high",
4369 | "times",
4370 | "set",
4371 | "came",
4372 | "eventually",
4373 | "little",
4374 | "lost",
4375 | "far",
4376 | "short",
4377 | "followed",
4378 | "previous",
4379 | "instead",
4380 | "position",
4381 | "ended",
4382 | "began",
4383 | "long",
4384 | "fourth",
4385 | "result",
4386 | "held",
4387 | "lines",
4388 | "taking",
4389 | "total",
4390 | "started",
4391 | "order",
4392 | "minutes",
4393 | "allowed",
4394 | "good",
4395 | "previously",
4396 | "saw",
4397 | "beginning",
4398 | "prior",
4399 | "heavy"
4400 | ],
4401 | [
4402 | "home",
4403 | "new",
4404 | "making",
4405 | "open",
4406 | "play",
4407 | "run",
4408 | "room",
4409 | "opening",
4410 | "free",
4411 | "inside",
4412 | "stand",
4413 | "opened",
4414 | "cold",
4415 | "allow",
4416 | "standing",
4417 | "filled",
4418 | "door",
4419 | "shot",
4420 | "super",
4421 | "going",
4422 | "originally",
4423 | "turned",
4424 | "featured",
4425 | "turn",
4426 | "drawn",
4427 | "supporting",
4428 | "wild",
4429 | "card",
4430 | "der",
4431 | "turning",
4432 | "walk",
4433 | "playing",
4434 | "mixed",
4435 | "kept",
4436 | "moving",
4437 | "sets",
4438 | "crowd",
4439 | "close",
4440 | "closed",
4441 | "space",
4442 | "closing",
4443 | "enter",
4444 | "swedish",
4445 | "keeping",
4446 | "featuring",
4447 | "friendly",
4448 | "walking",
4449 | "bring",
4450 | "allowing",
4451 | "opposite"
4452 | ],
4453 | [
4454 | "said",
4455 | "like",
4456 | "wrote",
4457 | "love",
4458 | "writing",
4459 | "felt",
4460 | "called",
4461 | "good",
4462 | "written",
4463 | "received",
4464 | "saying",
4465 | "critics",
4466 | "review",
4467 | "described",
4468 | "praised",
4469 | "don",
4470 | "wanted",
4471 | "according",
4472 | "way",
4473 | "know",
4474 | "critical",
4475 | "positive",
4476 | "want",
4477 | "feel",
4478 | "think",
4479 | "things",
4480 | "real",
4481 | "girl",
4482 | "life",
4483 | "thought",
4484 | "interview",
4485 | "reception",
4486 | "reviews",
4487 | "little",
4488 | "going",
4489 | "let",
4490 | "kind",
4491 | "inspired",
4492 | "heart",
4493 | "gave",
4494 | "thing",
4495 | "got",
4496 | "explained",
4497 | "mind",
4498 | "look",
4499 | "titled",
4500 | "writer",
4501 | "commented",
4502 | "lot",
4503 | "compared"
4504 | ],
4505 | [
4506 | "battle",
4507 | "force",
4508 | "forces",
4509 | "japanese",
4510 | "command",
4511 | "troops",
4512 | "army",
4513 | "men",
4514 | "commander",
4515 | "general",
4516 | "attack",
4517 | "british",
4518 | "division",
4519 | "soldiers",
4520 | "units",
4521 | "lieutenant",
4522 | "fighting",
4523 | "north",
4524 | "positions",
4525 | "south",
4526 | "regiment",
4527 | "military",
4528 | "squadron",
4529 | "battalion",
4530 | "casualties",
4531 | "ordered",
4532 | "enemy",
4533 | "artillery",
4534 | "infantry",
4535 | "combat",
4536 | "unit",
4537 | "captain",
4538 | "german",
4539 | "brigade",
4540 | "officers",
4541 | "wounded",
4542 | "officer",
4543 | "1st",
4544 | "fire",
4545 | "captured",
4546 | "colonel",
4547 | "commanders",
4548 | "commanded",
4549 | "allied",
4550 | "2nd",
4551 | "position",
4552 | "garrison",
4553 | "advance",
4554 | "divisions",
4555 | "battalions"
4556 | ],
4557 | [
4558 | "water",
4559 | "population",
4560 | "feet",
4561 | "land",
4562 | "island",
4563 | "food",
4564 | "region",
4565 | "sea",
4566 | "river",
4567 | "area",
4568 | "lake",
4569 | "metres",
4570 | "oil",
4571 | "mountain",
4572 | "climate",
4573 | "average",
4574 | "areas",
4575 | "tree",
4576 | "ice",
4577 | "forest",
4578 | "trees",
4579 | "acres",
4580 | "waters",
4581 | "near",
4582 | "people",
4583 | "islands",
4584 | "native",
4585 | "miles",
4586 | "rivers",
4587 | "plant",
4588 | "km2",
4589 | "air",
4590 | "kilometres",
4591 | "fishing",
4592 | "creek",
4593 | "plants",
4594 | "residents",
4595 | "coal",
4596 | "temperatures",
4597 | "agriculture",
4598 | "mountains",
4599 | "fish",
4600 | "snow",
4601 | "mouth",
4602 | "valley",
4603 | "approximately",
4604 | "cities",
4605 | "agricultural",
4606 | "salt",
4607 | "flow"
4608 | ],
4609 | [
4610 | "story",
4611 | "novel",
4612 | "stories",
4613 | "fiction",
4614 | "doctor",
4615 | "book",
4616 | "magazine",
4617 | "bond",
4618 | "alien",
4619 | "comic",
4620 | "science",
4621 | "novels",
4622 | "universe",
4623 | "fantasy",
4624 | "batman",
4625 | "comics",
4626 | "villain",
4627 | "fictional",
4628 | "evil",
4629 | "magic",
4630 | "halo",
4631 | "trek",
4632 | "adventures",
4633 | "tales",
4634 | "vampire",
4635 | "narrative",
4636 | "magical",
4637 | "dragon",
4638 | "trilogy",
4639 | "adventure",
4640 | "tale",
4641 | "reader",
4642 | "horror",
4643 | "marvel",
4644 | "sam",
4645 | "enterprise",
4646 | "harry",
4647 | "tintin",
4648 | "harry_potter",
4649 | "genre",
4650 | "ghost",
4651 | "readers",
4652 | "fleming",
4653 | "publisher",
4654 | "fantastic",
4655 | "potter",
4656 | "editor",
4657 | "aliens",
4658 | "spider",
4659 | "ian_fleming"
4660 | ],
4661 | [
4662 | "school",
4663 | "students",
4664 | "member",
4665 | "college",
4666 | "education",
4667 | "schools",
4668 | "university",
4669 | "members",
4670 | "class",
4671 | "student",
4672 | "community",
4673 | "founded",
4674 | "president",
4675 | "program",
4676 | "research",
4677 | "established",
4678 | "organization",
4679 | "campus",
4680 | "programs",
4681 | "served",
4682 | "academic",
4683 | "national",
4684 | "medical",
4685 | "science",
4686 | "high",
4687 | "study",
4688 | "organizations",
4689 | "prize",
4690 | "degree",
4691 | "professor",
4692 | "attended",
4693 | "honor",
4694 | "arts",
4695 | "activities",
4696 | "primary",
4697 | "board",
4698 | "department",
4699 | "engineering",
4700 | "library",
4701 | "faculty",
4702 | "graduate",
4703 | "festival",
4704 | "awarded",
4705 | "institution",
4706 | "courses",
4707 | "mayor",
4708 | "grade",
4709 | "graduated",
4710 | "teaching",
4711 | "enrolled"
4712 | ],
4713 | [
4714 | "society",
4715 | "read",
4716 | "literature",
4717 | "author",
4718 | "literary",
4719 | "argued",
4720 | "ideas",
4721 | "poem",
4722 | "poetry",
4723 | "moral",
4724 | "argues",
4725 | "sexual",
4726 | "view",
4727 | "poet",
4728 | "writings",
4729 | "social",
4730 | "religion",
4731 | "isbn",
4732 | "historical",
4733 | "writers",
4734 | "views",
4735 | "political",
4736 | "philosophy",
4737 | "writer",
4738 | "matter",
4739 | "freedom",
4740 | "writes",
4741 | "biography",
4742 | "scholar",
4743 | "theory",
4744 | "essay",
4745 | "believed",
4746 | "speech",
4747 | "argument",
4748 | "context",
4749 | "considers",
4750 | "understanding",
4751 | "publication",
4752 | "published",
4753 | "essays",
4754 | "arguments",
4755 | "understand",
4756 | "idea",
4757 | "principles",
4758 | "historians",
4759 | "opinions",
4760 | "article",
4761 | "according",
4762 | "philosophical",
4763 | "myth"
4764 | ],
4765 | [
4766 | "government",
4767 | "state",
4768 | "president",
4769 | "law",
4770 | "political",
4771 | "governor",
4772 | "election",
4773 | "minister",
4774 | "national",
4775 | "act",
4776 | "congress",
4777 | "party",
4778 | "federal",
4779 | "campaign",
4780 | "general",
4781 | "civil",
4782 | "constitution",
4783 | "rights",
4784 | "secretary",
4785 | "states",
4786 | "policy",
4787 | "office",
4788 | "elected",
4789 | "vote",
4790 | "bill",
4791 | "parliament",
4792 | "democratic",
4793 | "support",
4794 | "politics",
4795 | "term",
4796 | "majority",
4797 | "candidate",
4798 | "representative",
4799 | "administration",
4800 | "labor",
4801 | "justice",
4802 | "citizens",
4803 | "legislation",
4804 | "opposition",
4805 | "authority",
4806 | "convention",
4807 | "senator",
4808 | "economic",
4809 | "court",
4810 | "tax",
4811 | "parties",
4812 | "held",
4813 | "elections",
4814 | "constitutional",
4815 | "cabinet"
4816 | ],
4817 | [
4818 | "season",
4819 | "game",
4820 | "team",
4821 | "league",
4822 | "club",
4823 | "games",
4824 | "played",
4825 | "scored",
4826 | "goal",
4827 | "play",
4828 | "runs",
4829 | "players",
4830 | "cup",
4831 | "goals",
4832 | "ball",
4833 | "innings",
4834 | "football",
4835 | "player",
4836 | "yards",
4837 | "test",
4838 | "scoring",
4839 | "win",
4840 | "victory",
4841 | "match",
4842 | "bowl",
4843 | "playing",
4844 | "score",
4845 | "career",
4846 | "championship",
4847 | "matches",
4848 | "arsenal",
4849 | "run",
4850 | "yard",
4851 | "lap",
4852 | "points",
4853 | "lead",
4854 | "seasons",
4855 | "stadium",
4856 | "baseball",
4857 | "field",
4858 | "miller",
4859 | "appearances",
4860 | "series",
4861 | "quarterback",
4862 | "pitch",
4863 | "draw",
4864 | "bowling",
4865 | "assists",
4866 | "batting",
4867 | "wicket"
4868 | ],
4869 | [
4870 | "use",
4871 | "order",
4872 | "human",
4873 | "possible",
4874 | "required",
4875 | "evidence",
4876 | "case",
4877 | "role",
4878 | "process",
4879 | "effect",
4880 | "change",
4881 | "nature",
4882 | "result",
4883 | "involved",
4884 | "non",
4885 | "given",
4886 | "events",
4887 | "groups",
4888 | "lack",
4889 | "particular",
4890 | "information",
4891 | "provide",
4892 | "life",
4893 | "certain",
4894 | "example",
4895 | "status",
4896 | "force",
4897 | "research",
4898 | "need",
4899 | "natural",
4900 | "period",
4901 | "view",
4902 | "increase",
4903 | "problems",
4904 | "necessary",
4905 | "action",
4906 | "individual",
4907 | "structure",
4908 | "responsible",
4909 | "cause",
4910 | "data",
4911 | "form",
4912 | "control",
4913 | "able",
4914 | "terms",
4915 | "protection",
4916 | "related",
4917 | "complex",
4918 | "response",
4919 | "basis"
4920 | ],
4921 | [
4922 | "earth",
4923 | "sun",
4924 | "surface",
4925 | "star",
4926 | "planet",
4927 | "mass",
4928 | "light",
4929 | "moon",
4930 | "years",
4931 | "stars",
4932 | "solar",
4933 | "atmosphere",
4934 | "giant",
4935 | "system",
4936 | "orbit",
4937 | "space",
4938 | "magnitude",
4939 | "observed",
4940 | "distance",
4941 | "mars",
4942 | "discovery",
4943 | "cloud",
4944 | "observations",
4945 | "jupiter",
4946 | "dwarf",
4947 | "objects",
4948 | "object",
4949 | "rotation",
4950 | "motion",
4951 | "planets",
4952 | "dust",
4953 | "saturn",
4954 | "gravity",
4955 | "masses",
4956 | "massive",
4957 | "ago",
4958 | "approximately",
4959 | "telescope",
4960 | "activity",
4961 | "disk",
4962 | "roughly",
4963 | "spacecraft",
4964 | "mercury",
4965 | "sky",
4966 | "discovered",
4967 | "measurements",
4968 | "times",
4969 | "deep",
4970 | "zone",
4971 | "type"
4972 | ],
4973 | [
4974 | "game",
4975 | "player",
4976 | "games",
4977 | "released",
4978 | "players",
4979 | "gameplay",
4980 | "release",
4981 | "video",
4982 | "character",
4983 | "series",
4984 | "version",
4985 | "japan",
4986 | "versions",
4987 | "playstation",
4988 | "features",
4989 | "titles",
4990 | "nintendo",
4991 | "received",
4992 | "console",
4993 | "mode",
4994 | "enemies",
4995 | "development",
4996 | "ign",
4997 | "reviewers",
4998 | "screen",
4999 | "soundtrack",
5000 | "action",
5001 | "wii",
5002 | "multiplayer",
5003 | "xbox",
5004 | "gaming",
5005 | "releases",
5006 | "playable",
5007 | "original",
5008 | "content",
5009 | "reception",
5010 | "japanese",
5011 | "level",
5012 | "person",
5013 | "north_america",
5014 | "protagonist",
5015 | "characters",
5016 | "units",
5017 | "graphics",
5018 | "team",
5019 | "levels",
5020 | "copies",
5021 | "feature",
5022 | "battles",
5023 | "developed"
5024 | ],
5025 | [
5026 | "species",
5027 | "birds",
5028 | "genus",
5029 | "specimens",
5030 | "animal",
5031 | "bird",
5032 | "tail",
5033 | "habitat",
5034 | "taxonomy",
5035 | "females",
5036 | "males",
5037 | "diet",
5038 | "found",
5039 | "body",
5040 | "teeth",
5041 | "populations",
5042 | "fish",
5043 | "male",
5044 | "skull",
5045 | "nests",
5046 | "breeding",
5047 | "female",
5048 | "prey",
5049 | "range",
5050 | "common",
5051 | "shark",
5052 | "subspecies",
5053 | "predators",
5054 | "specimen",
5055 | "eggs",
5056 | "animals",
5057 | "ecology",
5058 | "bodies",
5059 | "mature",
5060 | "insects",
5061 | "sexes",
5062 | "fossils",
5063 | "feeding",
5064 | "sharks",
5065 | "extinction",
5066 | "nest",
5067 | "jaws",
5068 | "interspecific",
5069 | "mating",
5070 | "occurs",
5071 | "whale",
5072 | "humans",
5073 | "juveniles",
5074 | "abundant",
5075 | "fruit"
5076 | ],
5077 | [
5078 | "television",
5079 | "award",
5080 | "awards",
5081 | "dvd",
5082 | "nominated",
5083 | "critics",
5084 | "video",
5085 | "animated",
5086 | "actress",
5087 | "festival",
5088 | "magazine",
5089 | "drama",
5090 | "bbc",
5091 | "media",
5092 | "hollywood",
5093 | "new_york_times",
5094 | "movie",
5095 | "entertainment",
5096 | "writer",
5097 | "comedy",
5098 | "nomination",
5099 | "nominations",
5100 | "disney",
5101 | "female",
5102 | "documentary",
5103 | "ray",
5104 | "theatrical",
5105 | "critic",
5106 | "outstanding",
5107 | "animation",
5108 | "male",
5109 | "reviews",
5110 | "digital",
5111 | "anime",
5112 | "audio",
5113 | "acclaim",
5114 | "actor",
5115 | "audience",
5116 | "youtube",
5117 | "credits",
5118 | "film",
5119 | "poll",
5120 | "voice",
5121 | "creative",
5122 | "costumes",
5123 | "weekly",
5124 | "emmy",
5125 | "thriller",
5126 | "washington_post",
5127 | "reporter"
5128 | ],
5129 | [
5130 | "god",
5131 | "church",
5132 | "christian",
5133 | "religious",
5134 | "temple",
5135 | "worship",
5136 | "language",
5137 | "languages",
5138 | "traditions",
5139 | "tradition",
5140 | "religion",
5141 | "jesus",
5142 | "christianity",
5143 | "kingdom",
5144 | "hindu",
5145 | "spoken",
5146 | "legend",
5147 | "christ",
5148 | "holy",
5149 | "text",
5150 | "words",
5151 | "word",
5152 | "kings",
5153 | "inscriptions",
5154 | "spiritual",
5155 | "christians",
5156 | "texts",
5157 | "maya",
5158 | "faith",
5159 | "congregation",
5160 | "gods",
5161 | "divine",
5162 | "goddess",
5163 | "bible",
5164 | "shiva",
5165 | "tomb",
5166 | "buddhist",
5167 | "churches",
5168 | "deity",
5169 | "sacred",
5170 | "king",
5171 | "shrines",
5172 | "practices",
5173 | "shrine",
5174 | "monastery",
5175 | "dialect",
5176 | "ritual",
5177 | "baptism",
5178 | "hebrew",
5179 | "secular"
5180 | ],
5181 | [
5182 | "storm",
5183 | "tropical",
5184 | "hurricane",
5185 | "winds",
5186 | "depression",
5187 | "mph",
5188 | "wind",
5189 | "cyclone",
5190 | "september",
5191 | "august",
5192 | "damage",
5193 | "rainfall",
5194 | "day",
5195 | "reported",
5196 | "intensity",
5197 | "utc",
5198 | "pressure",
5199 | "near",
5200 | "system",
5201 | "occurred",
5202 | "hours",
5203 | "caused",
5204 | "damaged",
5205 | "category",
5206 | "flooding",
5207 | "convection",
5208 | "sustained",
5209 | "october",
5210 | "northeast",
5211 | "attained",
5212 | "landfall",
5213 | "heavy",
5214 | "moving",
5215 | "estimated",
5216 | "homes",
5217 | "southeast",
5218 | "south",
5219 | "reached",
5220 | "remnants",
5221 | "low",
5222 | "moved",
5223 | "disturbance",
5224 | "north",
5225 | "circulation",
5226 | "severe",
5227 | "developed",
5228 | "upgraded",
5229 | "weakened",
5230 | "northwest",
5231 | "early"
5232 | ],
5233 | [
5234 | "police",
5235 | "court",
5236 | "people",
5237 | "trial",
5238 | "report",
5239 | "prison",
5240 | "incident",
5241 | "arrested",
5242 | "violence",
5243 | "murder",
5244 | "judge",
5245 | "criminal",
5246 | "crime",
5247 | "investigation",
5248 | "jewish",
5249 | "men",
5250 | "charges",
5251 | "committed",
5252 | "person",
5253 | "news",
5254 | "accused",
5255 | "gay",
5256 | "security",
5257 | "victims",
5258 | "jury",
5259 | "crimes",
5260 | "arrest",
5261 | "defense",
5262 | "reports",
5263 | "media",
5264 | "newspaper",
5265 | "statement",
5266 | "officers",
5267 | "alleged",
5268 | "convicted",
5269 | "newspapers",
5270 | "violent",
5271 | "guilty",
5272 | "actions",
5273 | "drug",
5274 | "suicide",
5275 | "illegal",
5276 | "conspiracy",
5277 | "witnesses",
5278 | "protest",
5279 | "prisoners",
5280 | "lawyers",
5281 | "authorities",
5282 | "charge",
5283 | "case"
5284 | ]
5285 | ],
5286 | "c_npmi_10_full_all": [
5287 | 0.03828346256626948,
5288 | 0.1038583234675748,
5289 | 0.015366944419668899,
5290 | 0.0900476418994117,
5291 | 0.1020442098673313,
5292 | 0.03190760464777211,
5293 | 0.16677015392348543,
5294 | -0.008954957865953978,
5295 | 0.012054122204282256,
5296 | -0.004889701871966398,
5297 | 0.15132187821689855,
5298 | 0.13286033788304982,
5299 | 0.21361278838454933,
5300 | 0.06768430328713142,
5301 | 0.25295664311845517,
5302 | 0.19214197022094048,
5303 | 0.1246862697135218,
5304 | 0.13727695971556078,
5305 | 0.15977332312391387,
5306 | 0.07607815548878252,
5307 | 0.08921236964042889,
5308 | 0.14721562942824956,
5309 | 0.04320981988759863,
5310 | 0.1348618452599969,
5311 | 0.19853561056687896,
5312 | 0.21675332548983514,
5313 | 0.20602537817919672,
5314 | 0.05120785577156596,
5315 | 0.15718236651705925,
5316 | 0.03070653982222502,
5317 | 0.021220918893205067,
5318 | 0.11234754553902984,
5319 | 0.037713983698919415,
5320 | 0.020353224603349568,
5321 | 0.04964237205892103,
5322 | 0.16596275544055908,
5323 | 0.0976401097020574,
5324 | 0.13018839146257705,
5325 | 0.134779218822221,
5326 | 0.1173630971965991,
5327 | 0.09886537661229274,
5328 | 0.22095588022862706,
5329 | 0.03753952947452625,
5330 | 0.12707757576728707,
5331 | 0.12641353705174047,
5332 | 0.18913925649606136,
5333 | 0.1493844703305789,
5334 | 0.12044245581211913,
5335 | 0.2556792335636147,
5336 | 0.12187208376841602
5337 | ],
5338 | "path": "outputs/full-mindf_power_law-maxdf_0.9/wikitext/k-50/etm/lr_0.001-reg_1.2e-05-epochs_1000-anneal_lr_0/42"
5339 | }
5340 | }
--------------------------------------------------------------------------------