├── .gitignore
├── README.md
├── src-human-correlations
    ├── chatGPT_evaluate_intruders.py
    ├── chatGPT_evaluate_topic_ratings.py
    ├── generate_intruder_words_dataset.py
    └── human_correlations_bootstrap.py
├── src-misc
    ├── 01-get_data.sh
    ├── 02-process_data.ipynb
    ├── 02-process_data.py
    ├── 03-figures_nmpi_llm.py
    ├── 04-launch_figures_nmpi_llm.sh
    ├── 05-find_rating_errors.ipynb
    ├── 06-figures_ari_llm.py
    ├── 07-pairwise_scores.ipynb
    └── fig_utils.py
├── src-number-of-topics
    ├── LLM_scores_and_ARI.py
    ├── chatGPT_document_label_assignment.py
    └── chatGPT_ratings_assignment.py
└── topic-modeling-output
    ├── dvae-topics-best-c_npmi_10_full.json
    ├── etm-topics-best-c_npmi_10_full.json
    └── mallet-topics-best-c_npmi_10_full.json


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | repo-old
3 | *.pyc
4 | computed/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Re-visiting Automated Topic Model Evaluation with Large Language Models
 2 | 
 3 | This repo contains code and data for our [EMNLP 2023 paper](https://aclanthology.org/2023.emnlp-main.581/) about assessing topic model output with Large Language Models.
 4 | ```
 5 | @inproceedings{stammbach-etal-2023-revisiting,
 6 |     title = "Revisiting Automated Topic Model Evaluation with Large Language Models",
 7 |     author = "Stammbach, Dominik  and
 8 |       Zouhar, Vil{\'e}m  and
 9 |       Hoyle, Alexander  and
10 |       Sachan, Mrinmaya  and
11 |       Ash, Elliott",
12 |     booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
13 |     month = dec,
14 |     year = "2023",
15 |     address = "Singapore",
16 |     publisher = "Association for Computational Linguistics",
17 |     url = "https://aclanthology.org/2023.emnlp-main.581",
18 |     doi = "10.18653/v1/2023.emnlp-main.581",
19 |     pages = "9348--9357"
20 | }
21 | ```
22 | 
23 | ## Prerequisites
24 | 
25 | ```shell
26 | pip install --upgrade openai pandas
27 | ```
28 | 
29 | ## Large Language Models and Topics with Human Annotations
30 | 
31 | Download topic words and human annotations from the paper [Is Automated Topic Model Evaluation Broken?](https://arxiv.org/abs/2107.02173) from their [github repository](https://github.com/ahoho/topics/blob/dev/data/human/all_data/all_data.json).
32 | 
33 | 
34 | ### Intruder Detection Test
35 | 
36 | Following (Hoyle et al., 2021), we randomly sample intruder words which are not in the top 50 topic words for each topic.
37 | 
38 | ```shell
39 | python src-human-correlations/generate_intruder_words_dataset.py 
40 | ```
41 | 
42 | We can then call an LLM to automatically annotate the intruder words for each topic. 
43 | 
44 | ```shell
45 | python src-human-correlations/chatGPT_evaluate_intruders.py --API_KEY a_valid_openAI_api_key
46 | ```
47 | 
48 | For the ratings task, simply call the file which rates topic word sets (no need to generate a dataset first)
49 | 
50 | ```shell
51 | python src-human-correlations/chatGPT_evaluate_topic_ratings.py --API_KEY a_valid_openAI_api_key
52 | ```
53 | (In case the openAI API breaks, we simply save all output in a json file, and would restart the script while skipping all already annotated datapoints.)
54 | 
55 | 
56 | ### Evaluation LLMs and Human Correlations
57 | 
58 | We evaluate using a bootstrapp appraoch where we sample human annotations and LLM annotations for each datapoint. We then average these sampled annotation, and compute a spearman's rho for each bootstrapped sample. We report the mean spearman's rho over all 1000 bootstrapped samples.
59 | 
60 | ```shell
61 | python src-human-correlations/human_correlations_bootstrap.py --filename coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl --task ratings
62 | ```
63 | 
64 | 
65 | ## Evaluating Topic Models with Different Numbers of Topics
66 | 
67 | Download fitted topic models and metadata for two datasets (bills and wikitext) [here](https://www.dropbox.com/s/huxdloe5l6w2tu5/topic_model_k_selection.zip?dl=0) and unzip
68 | 
69 | ### Rating Topic Word Sets
70 | 
71 | To run LLM ratings of topic word sets on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:
72 | 
73 | ```shell
74 | python src-number-of-topics/chatGPT_ratings_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad
75 | ```
76 | 
77 | ### Purity of Document Collections
78 | 
79 | We also assign a document label to the top documents belonging to a topic, following [Doogan and Buntine, 2021](https://aclanthology.org/2021.naacl-main.300/). We then average purity per document collection, and the number of topics with on averag highest purities is the preferred cluster of this procedure.
80 | 
81 | To run LLM label assignments on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:
82 | 
83 | ```shell
84 | python src-number-of-topics/chatGPT_document_label_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad
85 | ```
86 | 
87 | ### Plot resulting scores
88 | 
89 | ```shell
90 | python src-number-of-topics/LLM_scores_and_ARI.py --label_categories broad --method label_assignment --dataset bills --label_categories broad --filename number-of-topics-section-4/document_label_assignment_wikitext_broad.jsonl
91 | ```
92 | 
93 | ## Questions
94 | 
95 | Please contact [Dominik Stammbach](mailto:dominik.stammbach@gess.ethz.ch) regarding any questions.
96 | 
97 | 
98 | [![Paper video presentation](https://img.youtube.com/vi/qIDjtgWTgjs/0.jpg)](https://www.youtube.com/watch?v=qIDjtgWTgjs)
99 | 


--------------------------------------------------------------------------------
/src-human-correlations/chatGPT_evaluate_intruders.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import json
 3 | import random
 4 | import openai
 5 | from tqdm import tqdm
 6 | from ast import literal_eval
 7 | from collections import defaultdict
 8 | import time
 9 | import argparse
10 | import os
11 | 
12 | def get_prompts(include_dataset_description="include"):
13 | 	if include_dataset_description == "include":
14 | 		system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 
15 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces.
16 | Reply with a single word."""
17 | 
18 | 		system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
19 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others.
20 | Reply with a single word."""
21 | 		outfile_name = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl"
22 | 	else:
23 | 		system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place. 
24 | Reply with a single word."""
25 | 
26 | 		system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Select which word is the least related to all other words. If multiple words do not fit, choose the word that is most out of place.
27 | Reply with a single word."""
28 | 		outfile_name = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.jsonl"
29 | 	return 	system_prompt_NYT, system_prompt_wikitext, outfile_name
30 | 
31 | if __name__ == "__main__":
32 | 	parser = argparse.ArgumentParser()
33 | 	parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
34 | 	parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.")
35 | 	args = parser.parse_args()
36 | 
37 | 
38 | 	openai.api_key = args.API_KEY
39 | 	random.seed(42)
40 | 	df = pd.read_json("intruder_outfile.jsonl", lines=True)
41 | 
42 | 	system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description)
43 | 	os.makedirs("coherence-outputs-section-2", exist_ok=True)
44 | 
45 | 
46 | 	columns = df.columns.tolist()
47 | 
48 | 	with open(outfile_name, "w") as outfile:
49 | 		for i, row in tqdm(df.iterrows(), total=len(df)):
50 | 			if row.dataset_name == "wikitext":
51 | 				system_prompt = system_prompt_wikitext
52 | 			else:
53 | 				system_prompt = system_prompt_NYT
54 | 
55 | 			words = row.topic_terms
56 | 			# shuffle words
57 | 			random.shuffle(words)
58 | 			# we only prompt 5 words
59 | 			words = words[:5]
60 | 			
61 | 			# we add intruder term
62 | 			intruder_term = row.intruder_term
63 | 			
64 | 			# we shuffle again
65 | 			words.append(intruder_term)
66 | 			random.shuffle(words)	
67 | 			
68 | 			# we have a user prompt			
69 | 			user_prompt = ", ".join(['"' + w + '"' for w in words])
70 | 
71 | 			out = {col: row[col] for col in columns}
72 | 			response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.7, max_tokens=15)["choices"][0]["message"]["content"].strip()
73 | 			out["response"] = response
74 | 			out["user_promt"] = user_prompt
75 | 			json.dump(out, outfile)
76 | 			outfile.write("\n")
77 | 			time.sleep(0.5)
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/src-human-correlations/chatGPT_evaluate_topic_ratings.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import random
 3 | import openai
 4 | from tqdm import tqdm
 5 | import pandas as pd
 6 | import time
 7 | import argparse
 8 | import os
 9 | 
10 | 
11 | def get_prompts(include_dataset_description=True):
12 | 	if include_dataset_description:
13 | 		system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
14 | The topic modeling is based on The New York Times corpus. The corpus consists of articles from 1987 to 2007. Sections from a typical paper include International, National, New York Regional, Business, Technology, and Sports news; features on topics such as Dining, Movies, Travel, and Fashion; there are also obituaries and opinion pieces.
15 | Reply with a single number, indicating the overall appropriateness of the topic."""
16 | 
17 | 		system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
18 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Articles can include biographies ("George Washington"), scientific phenomena ("Solar Eclipse"), art pieces ("La Danse"), music ("Amazing Grace"), transportation ("U.S. Route 131"), sports ("1952 winter olympics"), historical events or periods ("Tang Dynasty"), media and pop culture ("The Simpsons Movie"), places ("Yosemite National Park"), plants and animals ("koala"), and warfare ("USS Nevada (BB-36)"), among others.
19 | Reply with a single number, indicating the overall appropriateness of the topic."""
20 | 		outfile_name = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl"
21 | 	else:
22 | 		system_prompt_NYT = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
23 | Reply with a single number, indicating the overall appropriateness of the topic."""
24 | 
25 | 		system_prompt_wikitext = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
26 | Reply with a single number, indicating the overall appropriateness of the topic."""
27 | 		outfile_name = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl"
28 | 	return 	system_prompt_NYT, system_prompt_wikitext, outfile_name
29 | 
30 | 
31 | if __name__ == "__main__":
32 | 
33 | 	parser = argparse.ArgumentParser()
34 | 	parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
35 | 	parser.add_argument("--dataset_description", default="include", type=str, help="whether to include a dataset description or not, default = include.")
36 | 	args = parser.parse_args()
37 | 
38 | 	openai.api_key = args.API_KEY
39 | 	random.seed(42)
40 | 
41 | 	with open("all_data.json") as f:
42 | 		data = json.load(f)
43 | 	random.seed(42)
44 | 
45 | 	system_prompt_NYT, system_prompt_wikitext, outfile_name = get_prompts(include_dataset_description=args.dataset_description)	
46 | 	os.makedirs("coherence-outputs-section-2", exist_ok=True)
47 | 
48 | 	with open(outfile_name, "w") as outfile:
49 | 		for dataset_name, dataset in data.items():
50 | 			for model_name, dataset_model in dataset.items():
51 | 				print (dataset_name, model_name)
52 | 				topics = dataset_model["topics"]
53 | 				human_evaluations = dataset_model["metrics"]["ratings_scores_avg"]
54 | 				i = 0
55 | 				for topic, human_eval in tqdm(zip(topics, human_evaluations), total=50):
56 | 					topic = topic[:10]
57 | 					for run in range(3):
58 | 						random.shuffle(topic)
59 | 						user_prompt = ", ".join(topic)
60 | 						if dataset_name == "wikitext":
61 | 							system_prompt = system_prompt_wikitext
62 | 						else:
63 | 							system_prompt = system_prompt_NYT
64 | 						response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=1.0, logit_bias={16:100, 17:100, 18:100}, max_tokens=1)["choices"][0]["message"]["content"].strip()
65 | 						out = {"dataset_name": dataset_name, "model_name": model_name, "topic_id": i, "user_prompt": user_prompt, "response": response, "human_eval":human_eval, "run": run}
66 | 						json.dump(out, outfile)
67 | 						outfile.write("\n")
68 | 						time.sleep(0.5)
69 | 					i += 1
70 | 
71 | 


--------------------------------------------------------------------------------
/src-human-correlations/generate_intruder_words_dataset.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import random
 3 | 
 4 | 
 5 | if __name__ == "__main__":
 6 | 	with open("all_data.json") as f:
 7 | 		data = json.load(f)
 8 | 
 9 | 	random.seed(42)
10 | 	with open("intruder_outfile.jsonl", "w") as outfile:
11 | 		for dataset_name, dataset in data.items():
12 | 			for model_name, dataset_model in dataset.items():
13 | 				print (dataset_name, model_name)
14 | 				if model_name == "mallet":
15 | 					fn = "topic-modeling-output/mallet-topics-best-c_npmi_10_full.json"
16 | 				elif model_name == "dvae":
17 | 					fn = "topic-modeling-output/dvae-topics-best-c_npmi_10_full.json"
18 | 				else:
19 | 					fn = "topic-modeling-output/etm-topics-best-c_npmi_10_full.json"
20 | 				with open(fn) as f:
21 | 					topics_data = json.load(f)
22 | 
23 | 				raw_topics = topics_data[dataset_name]["topics"]				
24 | 
25 | 				words = set()
26 | 				for topic in raw_topics:
27 | 					words.update(topic)
28 | 				words = list(set(words))
29 | 				for i, (topic, metric, double_check) in enumerate(zip(raw_topics, dataset_model["metrics"]["intrusion_scores_avg"], dataset_model["topics"])):
30 | 					topic_set = set(topic)
31 | 					candidate_words = [i for i in words if i not in topic_set]
32 | 					random.shuffle(candidate_words)
33 | 					sampled_intruders = candidate_words[:10]
34 | 					for intruder in sampled_intruders:
35 | 						out = {}
36 | 						out["topic_id"] = i
37 | 						out["intruder_term"] = intruder
38 | 						out["topic_terms"] = topic[:10]
39 | 						out["intrusion_scores_avg"] = metric
40 | 						out["dataset_name"] = dataset_name
41 | 						out["model_name"] = model_name
42 | 						json.dump(out, outfile)
43 | 						outfile.write("\n")
44 | 
45 | 


--------------------------------------------------------------------------------
/src-human-correlations/human_correlations_bootstrap.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import re
  3 | from scipy.stats import spearmanr
  4 | import numpy as np
  5 | from sklearn.metrics import accuracy_score
  6 | import random, json
  7 | from ast import literal_eval
  8 | from tqdm import tqdm
  9 | import os
 10 | import argparse
 11 | 
 12 | def load_dataframe(fn, task=""):
 13 | 	if fn.endswith(".csv"):
 14 | 		df = pd.read_csv(fn)
 15 | 	elif fn.endswith("jsonl"):
 16 | 		df = pd.read_json(fn, lines=True)
 17 | 	print (df)
 18 | 	print (df.iloc[0])
 19 | 	if task == "intrusion":
 20 | 		if "response_correct" not in df.columns:
 21 | 			df["response_correct"] = postprocess_chatGPT_intrusion_test(df)
 22 | 	else:
 23 | 
 24 | 		if "gpt_ratings" not in df.columns:
 25 | 			df["gpt_ratings"] = df.response.astype(int)		
 26 | 	return df
 27 | 	
 28 | def postprocess_chatGPT_intrusion_test(df):
 29 | 	response = df.response.tolist()
 30 | 	response = [i.lower() for i in response]
 31 | 	response = [re.findall(r"\b[a-z_\d]+\b", i) for i in response]
 32 | 	response_correct = []	
 33 | 	#df.topic_terms = df.topic_terms.apply(lambda x: literal_eval(x))
 34 | 	#for i,j,k in zip(response, df.intruder_term, df.topic_terms):
 35 | 	for i,j in zip(response, df.intruder_term):
 36 | 		if not i:
 37 | 			response_correct.append(0)
 38 | 		elif i[0] == j:
 39 | 			response_correct.append(1)
 40 | 		else:
 41 | 			response_correct.append(0)
 42 | 	return response_correct		
 43 | 
 44 | def postprocess_chatGPT_intrusion_new(df):
 45 | 	for response, intruder_term, topic_words in zip(response, df.intruder_term, df.topic_terms):
 46 | 		if not response:
 47 | 			response_correct.append(0)
 48 | 		elif response.lower() == intruder_term.lower():
 49 | 			response_correct.append(1)
 50 | 		else:
 51 | 			response_correct.append(0)
 52 | 	return response_correct		
 53 | 
 54 | def get_confidence_intervals(statistic):
 55 | 	statistic = sorted(statistic)
 56 | 	lower = statistic[4] # it's the fourth lowest value; if we had 100 samples, it would be the 2.5nd lowest value, this * 1.5 gets us 3.75
 57 | 	upper = statistic[-4] # it's the fourth highest value
 58 | 	print ("lower", lower, "upper", upper)
 59 | 	
 60 | def get_filenames(with_dataset_description = True):
 61 | 	if with_dataset_description:
 62 | 		intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_with_dataset_description.jsonl"
 63 | 		rating_fn = "coherence-outputs-section-2/ratings_outfile_with_dataset_description.csv"
 64 | 	else:
 65 | 		intrusion_fn = "coherence-outputs-section-2/intrusion_outfile_without_dataset_description.csv"
 66 | 		rating_fn = "coherence-outputs-section-2/ratings_outfile_without_dataset_description.jsonl"
 67 | 	return intrusion_fn, rating_fn
 68 | 
 69 | 
 70 | def compute_human_ceiling_intrusion(data, only_confident = False):
 71 | 	ratings_human, ratings_chatGPT = [], []
 72 | 	spearman_wiki = []
 73 | 	spearman_NYT = []
 74 | 	spearman_concat = []
 75 | 	for _ in tqdm(range(1000), total=1000):
 76 | 		ratings_human, ratings_chatGPT = [], []
 77 | 		for dataset in ["wikitext", "nytimes"]:
 78 | 			for model in ["mallet", "dvae", "etm"]:
 79 | 				for topic_id in range(50):
 80 | 					intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id]
 81 | 					if only_confident:
 82 | 						intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1]
 83 | 					if not intrusion_scores_raw:
 84 | 						intrusion_scores_raw = [0]
 85 | 
 86 | 					if len(intrusion_scores_raw) == 1:
 87 | 						intrusion_scores_1 = intrusion_scores_raw
 88 | 						intrusion_scores_2 = intrusion_scores_raw
 89 | 					else:
 90 | 						length = len(intrusion_scores_raw) // 2
 91 | 						intrusion_scores_1 = intrusion_scores_raw[:length]
 92 | 						intrusion_scores_2 = intrusion_scores_raw[length:]
 93 | 					intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1))
 94 | 					intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2))
 95 | 					ratings_human.append(np.mean(intrusion_scores_1))
 96 | 					ratings_chatGPT.append(np.mean(intrusion_scores_2))
 97 | 		spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
 98 | 		spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
 99 | 		spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
100 | 	print ("wiki", np.mean(spearman_wiki))
101 | 	#get_confidence_intervals(spearman_wiki)
102 | 	print ("NYT",  np.mean(spearman_NYT))
103 | 	#get_confidence_intervals(spearman_NYT)
104 | 	print ("concat", np.mean(spearman_concat))
105 | 	#get_confidence_intervals(spearman_concat)
106 | 
107 | def compute_human_ceiling_rating(data, only_confident = False):
108 | 	ratings_human, ratings_chatGPT = [], []
109 | 	spearman_wiki = []
110 | 	spearman_NYT = []
111 | 	spearman_concat = []
112 | 	for _ in tqdm(range(1000), total=1000):
113 | 		ratings_human, ratings_chatGPT = [], []
114 | 		for dataset in ["wikitext", "nytimes"]:
115 | 			for model in ["mallet", "dvae", "etm"]:
116 | 				for topic_id in range(50):
117 | 					intrusion_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id]
118 | 					if only_confident:
119 | 						intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1]
120 | 						
121 | 					if not intrusion_scores_raw:
122 | 						intrusion_scores_raw = [1]
123 | 					if len(intrusion_scores_raw) == 1:
124 | 						intrusion_scores_1 = intrusion_scores_raw
125 | 						intrusion_scores_2 = intrusion_scores_raw
126 | 					else:
127 | 						length = len(intrusion_scores_raw) // 2
128 | 						intrusion_scores_1 = intrusion_scores_raw[:length]
129 | 						intrusion_scores_2 = intrusion_scores_raw[length:]
130 | 					intrusion_scores_1 = random.choices(intrusion_scores_1, k=len(intrusion_scores_1))
131 | 					intrusion_scores_2 = random.choices(intrusion_scores_2, k=len(intrusion_scores_2))
132 | 					ratings_human.append(np.mean(intrusion_scores_1))
133 | 					ratings_chatGPT.append(np.mean(intrusion_scores_2))
134 | 		spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
135 | 		spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
136 | 		spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
137 | 	print ("wiki", np.mean(spearman_wiki))
138 | 	#get_confidence_intervals(spearman_wiki)
139 | 	print ("NYT",  np.mean(spearman_NYT))
140 | 	#get_confidence_intervals(spearman_NYT)
141 | 	print ("concat", np.mean(spearman_concat))
142 | 	#get_confidence_intervals(spearman_concat)
143 | 	
144 | 
145 | def compute_spearmanr_bootstrap_intrusion(df_intruder_scores, data, only_confident=False):
146 | 	ratings_human, ratings_chatGPT = [], []
147 | 	spearman_wiki = []
148 | 	spearman_NYT = []
149 | 	spearman_concat = []
150 | 
151 | 	for _ in tqdm(range(1000), total=1000):
152 | 		ratings_human, ratings_chatGPT = [], []
153 | 		for dataset in ["wikitext", "nytimes"]:
154 | 			for model in ["mallet", "dvae", "etm"]:
155 | 				for topic_id in range(50):
156 | 					intrusion_scores_raw = data[dataset][model]["metrics"]["intrusion_scores_raw"][topic_id]
157 | 					if only_confident:
158 | 						intrusion_scores_raw = [i for i,j in zip(intrusion_scores_raw, data[dataset][model]["metrics"]["intrusion_confidences_raw"][topic_id]) if j == 1]
159 | 					# sample bootstrap fold
160 | 					
161 | 					intrusion_scores_raw = random.choices(intrusion_scores_raw, k=len(intrusion_scores_raw))
162 | 					df_topic = df_intruder_scores[(df_intruder_scores.dataset_name == dataset) & (df_intruder_scores.model_name == model) & (df_intruder_scores.topic_id == topic_id)]
163 | 					gpt_ratings = random.choices(df_topic.response_correct.tolist(), k= len(df_topic.response_correct))
164 | 					
165 | 					# save results
166 | 					ratings_human.append(np.mean(intrusion_scores_raw))
167 | 					ratings_chatGPT.append(np.mean(gpt_ratings))
168 | 			# compute spearman_R and save results
169 | 		spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
170 | 		spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
171 | 		spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
172 | 	print ("wiki", np.mean(spearman_wiki))
173 | 	get_confidence_intervals(spearman_wiki)
174 | 	print ("NYT",  np.mean(spearman_NYT))
175 | 	get_confidence_intervals(spearman_NYT)
176 | 	print ("concat", np.mean(spearman_concat))
177 | 	get_confidence_intervals(spearman_concat)
178 | 
179 | def compute_spearmanr_bootstrap_rating(df_rating_scores, data, only_confident=False):
180 | 	ratings_human, ratings_chatGPT = [], []
181 | 	spearman_wiki = []
182 | 	spearman_NYT = []
183 | 	spearman_concat = []
184 | 
185 | 	for _ in tqdm(range(150), total=150):
186 | 		ratings_human, ratings_chatGPT = [], []
187 | 		for dataset in ["wikitext", "nytimes"]:
188 | 			for model in ["mallet", "dvae", "etm"]:
189 | 				for topic_id in range(50):
190 | 					rating_scores_raw = data[dataset][model]["metrics"]["ratings_scores_raw"][topic_id]
191 | 					if only_confident:
192 | 						rating_scores_raw = [i for i,j in zip(rating_scores_raw, data[dataset][model]["metrics"]["ratings_confidences_raw"][topic_id]) if j == 1]
193 | 					# sample bootstrap fold
194 | 					rating_scores_raw = random.choices(rating_scores_raw, k=len(rating_scores_raw))
195 | 					df_topic = df_rating_scores[(df_rating_scores.dataset_name == dataset) & (df_rating_scores.model_name == model) & (df_rating_scores.topic_id == topic_id)]
196 | 					gpt_ratings = random.choices(df_topic.gpt_ratings.tolist(), k= len(df_topic.gpt_ratings))
197 | 					
198 | 					# save results
199 | 					ratings_human.append(np.mean(rating_scores_raw))
200 | 					ratings_chatGPT.append(np.mean(gpt_ratings))
201 | 		# compute spearman_R and save results
202 | 		spearman_wiki.append(spearmanr(ratings_chatGPT[:150], ratings_human[:150]).statistic)
203 | 		spearman_NYT.append(spearmanr(ratings_chatGPT[150:], ratings_human[150:]).statistic)
204 | 		spearman_concat.append(spearmanr(ratings_chatGPT, ratings_human).statistic)
205 | 	print ("wiki", np.mean(spearman_wiki))
206 | 	get_confidence_intervals(spearman_wiki)
207 | 	print ("NYT",  np.mean(spearman_NYT))
208 | 	get_confidence_intervals(spearman_NYT)
209 | 	print ("concat", np.mean(spearman_concat))
210 | 	get_confidence_intervals(spearman_concat)
211 | 
212 | 
213 | 
214 | 
215 | if __name__ == "__main__":
216 | 	parser = argparse.ArgumentParser()
217 | 	parser.add_argument("--task", default="ratings", type=str)
218 | 	parser.add_argument("--filename", default="coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl", type=str, help="whether to include a dataset description or not, default = include.")
219 | 	parser.add_argument("--only_confident", default="true", type=str)
220 | 	args = parser.parse_args()
221 | 	
222 | 	if args.only_confident == "true":
223 | 		only_confident = True
224 | 	else:
225 | 		only_confident = False
226 | 
227 | 	random.seed(42)
228 | 	path = "coherence-outputs-section-2"
229 | 	
230 | 	experiments = ["human_ceiling", "dataset_description", "dataset_description_only_confident", "no_dataset_description"]
231 | 	
232 | 	with open("all_data.json") as f:
233 | 		data = json.load(f)
234 | 
235 | 	if args.task == "human_ceiling":
236 | 		compute_human_ceiling_intrusion(data, only_confident=only_confident)
237 | 		compute_human_ceiling_rating(data, only_confident=only_confident)
238 | 	elif args.task == "ratings":
239 | 		df_rating = load_dataframe(args.filename)
240 | 		compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident)			
241 | 		
242 | 	elif args.task == "intrusion":
243 | 		df_rating = load_dataframe(args.filename)
244 | 		compute_spearmanr_bootstrap_rating(df_rating, data, only_confident=only_confident)			
245 | 


--------------------------------------------------------------------------------
/src-misc/01-get_data.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 | 
3 | mkdir -p data
4 | 
5 | wget https://raw.githubusercontent.com/ahoho/topics/dev/data/human/all_data/all_data.json -O data/intrusion.json


--------------------------------------------------------------------------------
/src-misc/02-process_data.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 1,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "import json\n",
10 |     "data = json.load(open(\"../data/intrusion.json\", \"r\"))"
11 |    ]
12 |   },
13 |   {
14 |    "cell_type": "code",
15 |    "execution_count": null,
16 |    "metadata": {},
17 |    "outputs": [],
18 |    "source": [
19 |     "data[\"wikitext\"][\"mallet\"][\"metrics\"].keys()\n",
20 |     "\n",
21 |     "[(k, len(v), None if type(v[0]) != list else len(v[0])) for k, v in data[\"nytimes\"][\"mallet\"][\"metrics\"].items()]"
22 |    ]
23 |   },
24 |   {
25 |    "cell_type": "code",
26 |    "execution_count": 2,
27 |    "metadata": {},
28 |    "outputs": [
29 |     {
30 |      "data": {
31 |       "text/plain": [
32 |        "['species',\n",
33 |        " 'birds',\n",
34 |        " 'males',\n",
35 |        " 'females',\n",
36 |        " 'bird',\n",
37 |        " 'white',\n",
38 |        " 'found',\n",
39 |        " 'female',\n",
40 |        " 'male',\n",
41 |        " 'range',\n",
42 |        " 'large',\n",
43 |        " 'breeding',\n",
44 |        " 'long',\n",
45 |        " 'black',\n",
46 |        " 'small',\n",
47 |        " 'shark',\n",
48 |        " 'population',\n",
49 |        " 'common',\n",
50 |        " 'prey',\n",
51 |        " 'eggs']"
52 |       ]
53 |      },
54 |      "execution_count": 2,
55 |      "metadata": {},
56 |      "output_type": "execute_result"
57 |     }
58 |    ],
59 |    "source": [
60 |     "data[\"wikitext\"][\"mallet\"][\"topics\"][1]"
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "kernelspec": {
66 |    "display_name": "Python 3.10.7 64-bit",
67 |    "language": "python",
68 |    "name": "python3"
69 |   },
70 |   "language_info": {
71 |    "codemirror_mode": {
72 |     "name": "ipython",
73 |     "version": 3
74 |    },
75 |    "file_extension": ".py",
76 |    "mimetype": "text/x-python",
77 |    "name": "python",
78 |    "nbconvert_exporter": "python",
79 |    "pygments_lexer": "ipython3",
80 |    "version": "3.10.7"
81 |   },
82 |   "orig_nbformat": 4,
83 |   "vscode": {
84 |    "interpreter": {
85 |     "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
86 |    }
87 |   }
88 |  },
89 |  "nbformat": 4,
90 |  "nbformat_minor": 2
91 | }
92 | 


--------------------------------------------------------------------------------
/src-misc/02-process_data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import json
 4 | 
 5 | data = json.load(open("data/intrusion.json", "r"))
 6 | 
 7 | for topic_scores, topic_words in zip(data["wikitext"]["etm"]["metrics"]["intrusion_scores_raw"], data["wikitext"]["etm"]["topics"]):
 8 |     print(len(topic_scores), len(topic_words))
 9 |     group_a = [w for s, w in zip(topic_scores, topic_words) if s == 0]
10 |     group_b = [w for s, w in zip(topic_scores, topic_words) if s == 1]
11 |     print(group_a, group_b)


--------------------------------------------------------------------------------
/src-misc/03-figures_nmpi_llm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import matplotlib.pyplot as plt
  4 | import pandas as pd
  5 | import os
  6 | from tqdm import tqdm
  7 | import numpy as np
  8 | from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score
  9 | from scipy.stats import spearmanr
 10 | from collections import defaultdict
 11 | import fig_utils
 12 | from matplotlib.ticker import FormatStrFormatter
 13 | import argparse
 14 | 
 15 | args = argparse.ArgumentParser()
 16 | args.add_argument("--experiment", default="wikitext_specific")
 17 | args = args.parse_args()
 18 | 
 19 | os.makedirs("computed/figures", exist_ok=True)
 20 | 
 21 | N_CLUSTER = range(20, 420, 20)
 22 | 
 23 | 
 24 | def moving_average(a, window_size=3):
 25 |     if window_size == 0:
 26 |         return a
 27 |     out = []
 28 |     for i in range(len(a)):
 29 |         start = max(0, i - window_size)
 30 |         out.append(np.mean(a[start:i + 1]))
 31 |     return out
 32 | 
 33 | 
 34 | def compute_adjusted_NMI_bills():
 35 |     df_metadata = pd.read_json(
 36 |         "data/bills/processed/labeled/vocab_15k/train.metadata.jsonl",
 37 |         lines=True
 38 |     )
 39 |     print(df_metadata)
 40 |     paths = [
 41 |         "runs/outputs/k_selection/" + "bills" +
 42 |         "-labeled/vocab_15k/k-" + str(i)
 43 |         for i in N_CLUSTER
 44 |     ]
 45 |     topic = df_metadata.topic.tolist()
 46 |     cluster_metrics = defaultdict(list)
 47 | 
 48 |     for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20):
 49 |         path = os.path.join(path, "2972")
 50 |         beta = np.load(os.path.join(path, "beta.npy"))
 51 |         theta = np.load(os.path.join(path, "train.theta.npy"))
 52 |         argmax_theta = theta.argmax(axis=-1)
 53 |         cluster_metrics["ami"].append(
 54 |             adjusted_mutual_info_score(topic, argmax_theta)
 55 |         )
 56 |         cluster_metrics["ari"].append(
 57 |             adjusted_rand_score(topic, argmax_theta)
 58 |         )
 59 |         cluster_metrics["completeness"].append(
 60 |             completeness_score(topic, argmax_theta)
 61 |         )
 62 |         cluster_metrics["homogeneity"].append(
 63 |             homogeneity_score(topic, argmax_theta)
 64 |         )
 65 |     return cluster_metrics
 66 | 
 67 | 
 68 | def compute_adjusted_NMI_wikitext_old(broad_categories=True):
 69 |     df_metadata = pd.read_json(
 70 |         "data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl",
 71 |         lines=True
 72 |     )
 73 |     print(df_metadata)
 74 |     paths = [
 75 |         "runs/outputs/k_selection/" + "wikitext" +
 76 |         "-labeled/vocab_15k/k-" + str(i)
 77 |         for i in N_CLUSTER
 78 |     ]
 79 | 
 80 |     if broad_categories:
 81 |         topic = df_metadata.category.tolist()
 82 |     else:
 83 |         topic = df_metadata.subcategory.tolist()
 84 | 
 85 |     cluster_metrics = defaultdict(list)
 86 | 
 87 |     for path, num_topics in tqdm(zip(paths, N_CLUSTER), total=20):
 88 |         path = os.path.join(path, "2972")
 89 |         beta = np.load(os.path.join(path, "beta.npy"))
 90 |         theta = np.load(os.path.join(path, "train.theta.npy"))
 91 |         argmax_theta = theta.argmax(axis=-1)
 92 |         cluster_metrics["ami"].append(
 93 |             adjusted_mutual_info_score(topic, argmax_theta)
 94 |         )
 95 |         cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
 96 |         cluster_metrics["completeness"].append(
 97 |             completeness_score(topic, argmax_theta)
 98 |         )
 99 |         cluster_metrics["homogeneity"].append(
100 |             homogeneity_score(topic, argmax_theta)
101 |         )
102 |     return cluster_metrics
103 | 
104 | def compute_adjusted_NMI_wikitext(broad_categories=True):
105 | 	df_metadata = pd.read_json("data/wikitext/processed/labeled/vocab_15k/train.metadata.jsonl", lines=True)
106 | 	print (df_metadata)
107 | 	paths = ["runs/outputs/k_selection/" + "wikitext" + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
108 | 
109 | 	# re-do columns:
110 | 	value_counts = df_metadata.category.value_counts().rename_axis("topic").reset_index(name="counts")
111 | 	value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25}
112 | 	df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.category]
113 | 
114 | 	value_counts = df_metadata.subcategory.value_counts().rename_axis("topic").reset_index(name="counts")
115 | 	value_counts = {i:j for i,j in zip(value_counts.topic, value_counts.counts) if j > 25}
116 | 	df_metadata["subtopic"] = ["other" if i in value_counts else i for i in df_metadata.subcategory]
117 | 
118 | 	if broad_categories:
119 | 		topic = df_metadata.category.tolist()
120 | 	else:
121 | 		topic = df_metadata.subcategory.tolist()
122 | 
123 | 	cluster_metrics = defaultdict(list)
124 | 	
125 | 	for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20):
126 | 		path = os.path.join(path, "2972")
127 | 		beta = np.load(os.path.join(path, "beta.npy"))
128 | 		theta = np.load(os.path.join(path, "train.theta.npy"))
129 | 		argmax_theta = theta.argmax(axis=-1)
130 | 		cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta))
131 | 		cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
132 | 		cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta))
133 | 		cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta))
134 | 	return cluster_metrics
135 | 
136 | 
137 | 
138 | 
139 | # experiment = "bills_broad"
140 | # experiment = "wikitext_broad"
141 | # experiment = "wikitext_specific"
142 | 
143 | if args.experiment == "bills_broad":
144 |     dataset = "bills"
145 |     cluster_metrics = compute_adjusted_NMI_bills()
146 |     paths = [
147 |         "runs/outputs/k_selection/bills-labeled/vocab_15k/k-" + str(i)
148 |         for i in N_CLUSTER
149 |     ]
150 |     df = pd.read_csv("LLM-scores/LLM_outputs_bills_broad.csv")
151 |     plt_label = "Adjusted MI"
152 |     outfile_name = "n_clusters_bills_dataset.pdf"
153 |     plt_title = "BillSum, broad, $\\rho = RHO_CORR$"
154 |     left_ylab = True
155 |     right_ylab = False
156 |     degree = 1
157 | elif args.experiment == "wikitext_broad":
158 |     dataset = "wikitext"
159 |     cluster_metrics = compute_adjusted_NMI_wikitext()
160 |     paths = [
161 |         "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i)
162 |         for i in N_CLUSTER
163 |     ]
164 |     df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_broad.csv")
165 |     plt_label = "Adjusted MI"
166 |     outfile_name = "n_clusters_wikitext_broad.pdf"
167 |     plt_title = "Wikitext, broad, $\\rho = RHO_CORR$"
168 |     left_ylab = False
169 |     right_ylab = False
170 |     degree = 1
171 | elif args.experiment == "wikitext_specific":
172 |     dataset = "wikitext"
173 |     cluster_metrics = compute_adjusted_NMI_wikitext(broad_categories=False)
174 |     paths = [
175 |         "runs/outputs/k_selection/wikitext-labeled/vocab_15k/k-" + str(i)
176 |         for i in N_CLUSTER
177 |     ]
178 |     df = pd.read_csv("LLM-scores/LLM_outputs_wikitext_specific.csv")
179 |     plt_label = "Adjusted MI"
180 |     outfile_name = "n_clusters_wikitext_specific.pdf"
181 |     plt_title = "Wikitext, specific, $\\rho = RHO_CORR$"
182 |     left_ylab = False
183 |     right_ylab = True
184 |     degree = 3
185 | 
186 | df["gpt_ratings"] = df.chatGPT_eval.astype(int)
187 | average_goodness = []
188 | # get average gpt_ratings for each k
189 | for path in paths:
190 |     path = os.path.join(path, "2972")
191 |     df_at_k = df[df.path == path]
192 |     average_goodness.append(df_at_k.gpt_ratings.mean())
193 | 
194 | # smooth via moving_average to remove weird outliers
195 | average_goodness = moving_average(average_goodness)
196 | 
197 | # if we want to plot spearmanR for different clustering metrics
198 | compute_spearmanR_different_cluster_metrics = False
199 | if compute_spearmanR_different_cluster_metrics:
200 |     for key in ["ami", "ari", "completeness", "homogeneity"]:
201 |         value = cluster_metrics[key]
202 |         statistic = spearmanr(average_goodness, value)
203 |         print(
204 |             "topics " + key, statistic.statistic.round(3),
205 |             statistic.pvalue.round(3)
206 |         )
207 | 
208 | # re-shape to compute z-scores (otherwise, the scales are off because LLM scores and clustering metrics are on different scales and the graphs do not look too great
209 | average_goodness = np.array(average_goodness).reshape(-1, 1)
210 | AMI = np.array(cluster_metrics["ami"]).reshape(-1, 1)
211 | 
212 | # compute z-scores
213 | # average_goodness = StandardScaler().fit_transform(average_goodness).squeeze()
214 | # AMI = StandardScaler().fit_transform(AMI).squeeze()
215 | average_goodness = np.array(average_goodness).squeeze()
216 | AMI = np.array(AMI).squeeze()
217 | 
218 | # plot figures
219 | fig = plt.figure(figsize=(3.5, 2))
220 | ax1 = plt.gca()
221 | ax2 = ax1.twinx()
222 | SCATTER_STYLE = {"edgecolor": "black", "s": 30}
223 | l1 = ax1.scatter(
224 |     N_CLUSTER,
225 |     average_goodness,
226 |     label="LLM score",
227 |     color=fig_utils.COLORS[0],
228 |     **SCATTER_STYLE
229 | )
230 | l2 = ax2.scatter(
231 |     N_CLUSTER,
232 |     AMI,
233 |     label=plt_label,
234 |     color=fig_utils.COLORS[1],
235 |     **SCATTER_STYLE
236 | )
237 | 
238 | # print to one digit
239 | ax2.yaxis.set_major_formatter(FormatStrFormatter('%.1f'))
240 | 
241 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500)
242 | 
243 | poly_ag = np.poly1d(np.polyfit(N_CLUSTER, average_goodness, degree))
244 | ax1.plot(xticks_fine, poly_ag(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100)
245 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, AMI, degree))
246 | ax2.plot(xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100)
247 | 
248 | plt.legend(
249 |     [l1, l2], [p_.get_label() for p_ in [l1, l2]],
250 |     loc="upper right",
251 |     handletextpad=-0.2,
252 |     labelspacing=0.15,
253 |     borderpad=0.15,
254 |     borderaxespad=0.1,
255 | )
256 | if left_ylab:
257 |     ax1.set_ylabel("Adjusted MI")
258 | if right_ylab:
259 |     ax2.set_ylabel("Averaged LLM score")
260 | ax1.set_xlabel("Number of topics")
261 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3])
262 | 
263 | 
264 | statistic = spearmanr(average_goodness, cluster_metrics["ami"])
265 | plt.title(plt_title.replace("RHO_CORR", f"{statistic[0]:.2f}"))
266 | 
267 | plt.tight_layout(pad=0.2)
268 | plt.savefig("computed/figures/" + outfile_name)
269 | plt.show()
270 | 


--------------------------------------------------------------------------------
/src-misc/04-launch_figures_nmpi_llm.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 | 
3 | for EXPERIMENT in "bills_broad" "wikitext_broad" "wikitext_specific"; do
4 |     # prevent from showing
5 |     DISPLAY="" python3 src_vilem/03-figures_nmpi_llm.py --experiment ${EXPERIMENT}
6 | done


--------------------------------------------------------------------------------
/src-misc/05-find_rating_errors.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 34,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "/tmp/ipykernel_32934/435855610.py:6: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.\n",
 13 |       "  df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n"
 14 |      ]
 15 |     }
 16 |    ],
 17 |    "source": [
 18 |     "#!/usr/bin/env python3\n",
 19 |     "\n",
 20 |     "import pandas as pd\n",
 21 |     "\n",
 22 |     "df_rating_scores = pd.read_json(\"../computed/ratings_outfile_with_logit_bias_old_prompt_5_runs.jsonl\", lines=True)\n",
 23 |     "df_grouped = df_rating_scores.groupby([\"dataset_name\", \"model_name\", \"topic_id\"])[\"chatGPT_eval\", \"human_eval\"].mean().reset_index()\n",
 24 |     "df_grouped[\"diff_abs\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"]).abs()\n",
 25 |     "df_grouped[\"diff\"] = (df_grouped[\"chatGPT_eval\"]-df_grouped[\"human_eval\"])"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 36,
 31 |    "metadata": {},
 32 |    "outputs": [
 33 |     {
 34 |      "data": {
 35 |       "text/html": [
 36 |        "<div>\n",
 37 |        "<style scoped>\n",
 38 |        "    .dataframe tbody tr th:only-of-type {\n",
 39 |        "        vertical-align: middle;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe tbody tr th {\n",
 43 |        "        vertical-align: top;\n",
 44 |        "    }\n",
 45 |        "\n",
 46 |        "    .dataframe thead th {\n",
 47 |        "        text-align: right;\n",
 48 |        "    }\n",
 49 |        "</style>\n",
 50 |        "<table border=\"1\" class=\"dataframe\">\n",
 51 |        "  <thead>\n",
 52 |        "    <tr style=\"text-align: right;\">\n",
 53 |        "      <th></th>\n",
 54 |        "      <th>dataset_name</th>\n",
 55 |        "      <th>model_name</th>\n",
 56 |        "      <th>topic_id</th>\n",
 57 |        "      <th>chatGPT_eval</th>\n",
 58 |        "      <th>human_eval</th>\n",
 59 |        "      <th>diff_abs</th>\n",
 60 |        "      <th>diff</th>\n",
 61 |        "    </tr>\n",
 62 |        "  </thead>\n",
 63 |        "  <tbody>\n",
 64 |        "    <tr>\n",
 65 |        "      <th>299</th>\n",
 66 |        "      <td>wikitext</td>\n",
 67 |        "      <td>mallet</td>\n",
 68 |        "      <td>49</td>\n",
 69 |        "      <td>2.6</td>\n",
 70 |        "      <td>2.6</td>\n",
 71 |        "      <td>0.0</td>\n",
 72 |        "      <td>0.0</td>\n",
 73 |        "    </tr>\n",
 74 |        "    <tr>\n",
 75 |        "      <th>157</th>\n",
 76 |        "      <td>wikitext</td>\n",
 77 |        "      <td>dvae</td>\n",
 78 |        "      <td>7</td>\n",
 79 |        "      <td>2.8</td>\n",
 80 |        "      <td>2.8</td>\n",
 81 |        "      <td>0.0</td>\n",
 82 |        "      <td>0.0</td>\n",
 83 |        "    </tr>\n",
 84 |        "    <tr>\n",
 85 |        "      <th>16</th>\n",
 86 |        "      <td>nytimes</td>\n",
 87 |        "      <td>dvae</td>\n",
 88 |        "      <td>16</td>\n",
 89 |        "      <td>2.0</td>\n",
 90 |        "      <td>2.0</td>\n",
 91 |        "      <td>0.0</td>\n",
 92 |        "      <td>0.0</td>\n",
 93 |        "    </tr>\n",
 94 |        "    <tr>\n",
 95 |        "      <th>110</th>\n",
 96 |        "      <td>nytimes</td>\n",
 97 |        "      <td>mallet</td>\n",
 98 |        "      <td>10</td>\n",
 99 |        "      <td>2.0</td>\n",
100 |        "      <td>2.0</td>\n",
101 |        "      <td>0.0</td>\n",
102 |        "      <td>0.0</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "      <th>101</th>\n",
106 |        "      <td>nytimes</td>\n",
107 |        "      <td>mallet</td>\n",
108 |        "      <td>1</td>\n",
109 |        "      <td>3.0</td>\n",
110 |        "      <td>3.0</td>\n",
111 |        "      <td>0.0</td>\n",
112 |        "      <td>0.0</td>\n",
113 |        "    </tr>\n",
114 |        "  </tbody>\n",
115 |        "</table>\n",
116 |        "</div>"
117 |       ],
118 |       "text/plain": [
119 |        "    dataset_name model_name  topic_id  chatGPT_eval  human_eval  diff_abs  \\\n",
120 |        "299     wikitext     mallet        49           2.6         2.6       0.0   \n",
121 |        "157     wikitext       dvae         7           2.8         2.8       0.0   \n",
122 |        "16       nytimes       dvae        16           2.0         2.0       0.0   \n",
123 |        "110      nytimes     mallet        10           2.0         2.0       0.0   \n",
124 |        "101      nytimes     mallet         1           3.0         3.0       0.0   \n",
125 |        "\n",
126 |        "     diff  \n",
127 |        "299   0.0  \n",
128 |        "157   0.0  \n",
129 |        "16    0.0  \n",
130 |        "110   0.0  \n",
131 |        "101   0.0  "
132 |       ]
133 |      },
134 |      "execution_count": 36,
135 |      "metadata": {},
136 |      "output_type": "execute_result"
137 |     }
138 |    ],
139 |    "source": [
140 |     "df_grouped.sort_values(by=\"diff_abs\")[:5]\n"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 39,
146 |    "metadata": {},
147 |    "outputs": [
148 |     {
149 |      "data": {
150 |       "text/html": [
151 |        "<div>\n",
152 |        "<style scoped>\n",
153 |        "    .dataframe tbody tr th:only-of-type {\n",
154 |        "        vertical-align: middle;\n",
155 |        "    }\n",
156 |        "\n",
157 |        "    .dataframe tbody tr th {\n",
158 |        "        vertical-align: top;\n",
159 |        "    }\n",
160 |        "\n",
161 |        "    .dataframe thead th {\n",
162 |        "        text-align: right;\n",
163 |        "    }\n",
164 |        "</style>\n",
165 |        "<table border=\"1\" class=\"dataframe\">\n",
166 |        "  <thead>\n",
167 |        "    <tr style=\"text-align: right;\">\n",
168 |        "      <th></th>\n",
169 |        "      <th>dataset_name</th>\n",
170 |        "      <th>model_name</th>\n",
171 |        "      <th>topic_id</th>\n",
172 |        "      <th>chatGPT_eval</th>\n",
173 |        "      <th>human_eval</th>\n",
174 |        "      <th>diff_abs</th>\n",
175 |        "      <th>diff</th>\n",
176 |        "    </tr>\n",
177 |        "  </thead>\n",
178 |        "  <tbody>\n",
179 |        "    <tr>\n",
180 |        "      <th>58</th>\n",
181 |        "      <td>nytimes</td>\n",
182 |        "      <td>etm</td>\n",
183 |        "      <td>8</td>\n",
184 |        "      <td>1.4</td>\n",
185 |        "      <td>2.866667</td>\n",
186 |        "      <td>1.466667</td>\n",
187 |        "      <td>-1.466667</td>\n",
188 |        "    </tr>\n",
189 |        "    <tr>\n",
190 |        "      <th>41</th>\n",
191 |        "      <td>nytimes</td>\n",
192 |        "      <td>dvae</td>\n",
193 |        "      <td>41</td>\n",
194 |        "      <td>1.0</td>\n",
195 |        "      <td>2.533333</td>\n",
196 |        "      <td>1.533333</td>\n",
197 |        "      <td>-1.533333</td>\n",
198 |        "    </tr>\n",
199 |        "    <tr>\n",
200 |        "      <th>28</th>\n",
201 |        "      <td>nytimes</td>\n",
202 |        "      <td>dvae</td>\n",
203 |        "      <td>28</td>\n",
204 |        "      <td>1.2</td>\n",
205 |        "      <td>2.733333</td>\n",
206 |        "      <td>1.533333</td>\n",
207 |        "      <td>-1.533333</td>\n",
208 |        "    </tr>\n",
209 |        "    <tr>\n",
210 |        "      <th>34</th>\n",
211 |        "      <td>nytimes</td>\n",
212 |        "      <td>dvae</td>\n",
213 |        "      <td>34</td>\n",
214 |        "      <td>1.0</td>\n",
215 |        "      <td>2.666667</td>\n",
216 |        "      <td>1.666667</td>\n",
217 |        "      <td>-1.666667</td>\n",
218 |        "    </tr>\n",
219 |        "    <tr>\n",
220 |        "      <th>216</th>\n",
221 |        "      <td>wikitext</td>\n",
222 |        "      <td>etm</td>\n",
223 |        "      <td>16</td>\n",
224 |        "      <td>1.0</td>\n",
225 |        "      <td>2.933333</td>\n",
226 |        "      <td>1.933333</td>\n",
227 |        "      <td>-1.933333</td>\n",
228 |        "    </tr>\n",
229 |        "  </tbody>\n",
230 |        "</table>\n",
231 |        "</div>"
232 |       ],
233 |       "text/plain": [
234 |        "    dataset_name model_name  topic_id  chatGPT_eval  human_eval  diff_abs  \\\n",
235 |        "58       nytimes        etm         8           1.4    2.866667  1.466667   \n",
236 |        "41       nytimes       dvae        41           1.0    2.533333  1.533333   \n",
237 |        "28       nytimes       dvae        28           1.2    2.733333  1.533333   \n",
238 |        "34       nytimes       dvae        34           1.0    2.666667  1.666667   \n",
239 |        "216     wikitext        etm        16           1.0    2.933333  1.933333   \n",
240 |        "\n",
241 |        "         diff  \n",
242 |        "58  -1.466667  \n",
243 |        "41  -1.533333  \n",
244 |        "28  -1.533333  \n",
245 |        "34  -1.666667  \n",
246 |        "216 -1.933333  "
247 |       ]
248 |      },
249 |      "execution_count": 39,
250 |      "metadata": {},
251 |      "output_type": "execute_result"
252 |     }
253 |    ],
254 |    "source": [
255 |     "df_grouped.sort_values(by=\"diff_abs\")[-5:]"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": 38,
261 |    "metadata": {},
262 |    "outputs": [
263 |     {
264 |      "data": {
265 |       "text/html": [
266 |        "<div>\n",
267 |        "<style scoped>\n",
268 |        "    .dataframe tbody tr th:only-of-type {\n",
269 |        "        vertical-align: middle;\n",
270 |        "    }\n",
271 |        "\n",
272 |        "    .dataframe tbody tr th {\n",
273 |        "        vertical-align: top;\n",
274 |        "    }\n",
275 |        "\n",
276 |        "    .dataframe thead th {\n",
277 |        "        text-align: right;\n",
278 |        "    }\n",
279 |        "</style>\n",
280 |        "<table border=\"1\" class=\"dataframe\">\n",
281 |        "  <thead>\n",
282 |        "    <tr style=\"text-align: right;\">\n",
283 |        "      <th></th>\n",
284 |        "      <th>dataset_name</th>\n",
285 |        "      <th>model_name</th>\n",
286 |        "      <th>topic_id</th>\n",
287 |        "      <th>chatGPT_eval</th>\n",
288 |        "      <th>human_eval</th>\n",
289 |        "      <th>diff_abs</th>\n",
290 |        "      <th>diff</th>\n",
291 |        "    </tr>\n",
292 |        "  </thead>\n",
293 |        "  <tbody>\n",
294 |        "    <tr>\n",
295 |        "      <th>200</th>\n",
296 |        "      <td>wikitext</td>\n",
297 |        "      <td>etm</td>\n",
298 |        "      <td>0</td>\n",
299 |        "      <td>2.0</td>\n",
300 |        "      <td>1.600000</td>\n",
301 |        "      <td>0.400000</td>\n",
302 |        "      <td>0.400000</td>\n",
303 |        "    </tr>\n",
304 |        "    <tr>\n",
305 |        "      <th>210</th>\n",
306 |        "      <td>wikitext</td>\n",
307 |        "      <td>etm</td>\n",
308 |        "      <td>10</td>\n",
309 |        "      <td>1.8</td>\n",
310 |        "      <td>1.333333</td>\n",
311 |        "      <td>0.466667</td>\n",
312 |        "      <td>0.466667</td>\n",
313 |        "    </tr>\n",
314 |        "    <tr>\n",
315 |        "      <th>248</th>\n",
316 |        "      <td>wikitext</td>\n",
317 |        "      <td>etm</td>\n",
318 |        "      <td>48</td>\n",
319 |        "      <td>3.0</td>\n",
320 |        "      <td>2.533333</td>\n",
321 |        "      <td>0.466667</td>\n",
322 |        "      <td>0.466667</td>\n",
323 |        "    </tr>\n",
324 |        "    <tr>\n",
325 |        "      <th>207</th>\n",
326 |        "      <td>wikitext</td>\n",
327 |        "      <td>etm</td>\n",
328 |        "      <td>7</td>\n",
329 |        "      <td>1.8</td>\n",
330 |        "      <td>1.266667</td>\n",
331 |        "      <td>0.533333</td>\n",
332 |        "      <td>0.533333</td>\n",
333 |        "    </tr>\n",
334 |        "    <tr>\n",
335 |        "      <th>242</th>\n",
336 |        "      <td>wikitext</td>\n",
337 |        "      <td>etm</td>\n",
338 |        "      <td>42</td>\n",
339 |        "      <td>1.8</td>\n",
340 |        "      <td>1.133333</td>\n",
341 |        "      <td>0.666667</td>\n",
342 |        "      <td>0.666667</td>\n",
343 |        "    </tr>\n",
344 |        "  </tbody>\n",
345 |        "</table>\n",
346 |        "</div>"
347 |       ],
348 |       "text/plain": [
349 |        "    dataset_name model_name  topic_id  chatGPT_eval  human_eval  diff_abs  \\\n",
350 |        "200     wikitext        etm         0           2.0    1.600000  0.400000   \n",
351 |        "210     wikitext        etm        10           1.8    1.333333  0.466667   \n",
352 |        "248     wikitext        etm        48           3.0    2.533333  0.466667   \n",
353 |        "207     wikitext        etm         7           1.8    1.266667  0.533333   \n",
354 |        "242     wikitext        etm        42           1.8    1.133333  0.666667   \n",
355 |        "\n",
356 |        "         diff  \n",
357 |        "200  0.400000  \n",
358 |        "210  0.466667  \n",
359 |        "248  0.466667  \n",
360 |        "207  0.533333  \n",
361 |        "242  0.666667  "
362 |       ]
363 |      },
364 |      "execution_count": 38,
365 |      "metadata": {},
366 |      "output_type": "execute_result"
367 |     }
368 |    ],
369 |    "source": [
370 |     "df_grouped.sort_values(by=\"diff\")[-5:]"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 40,
376 |    "metadata": {},
377 |    "outputs": [
378 |     {
379 |      "data": {
380 |       "text/plain": [
381 |        "-0.6393333333333333"
382 |       ]
383 |      },
384 |      "execution_count": 40,
385 |      "metadata": {},
386 |      "output_type": "execute_result"
387 |     }
388 |    ],
389 |    "source": [
390 |     "df_grouped[\"diff\"].mean()"
391 |    ]
392 |   }
393 |  ],
394 |  "metadata": {
395 |   "kernelspec": {
396 |    "display_name": "Python 3.10.7 64-bit",
397 |    "language": "python",
398 |    "name": "python3"
399 |   },
400 |   "language_info": {
401 |    "codemirror_mode": {
402 |     "name": "ipython",
403 |     "version": 3
404 |    },
405 |    "file_extension": ".py",
406 |    "mimetype": "text/x-python",
407 |    "name": "python",
408 |    "nbconvert_exporter": "python",
409 |    "pygments_lexer": "ipython3",
410 |    "version": "3.10.7"
411 |   },
412 |   "orig_nbformat": 4,
413 |   "vscode": {
414 |    "interpreter": {
415 |     "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
416 |    }
417 |   }
418 |  },
419 |  "nbformat": 4,
420 |  "nbformat_minor": 2
421 | }
422 | 


--------------------------------------------------------------------------------
/src-misc/06-figures_ari_llm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import matplotlib.pyplot as plt
  4 | import os
  5 | from tqdm import tqdm
  6 | import numpy as np
  7 | from collections import defaultdict
  8 | from scipy.stats import spearmanr
  9 | import fig_utils
 10 | from matplotlib.ticker import FormatStrFormatter
 11 | import csv
 12 | import argparse
 13 | 
 14 | args = argparse.ArgumentParser()
 15 | args.add_argument("--experiment", default="wikitext_specific")
 16 | args = args.parse_args()
 17 | 
 18 | os.makedirs("computed/figures", exist_ok=True)
 19 | 
 20 | N_CLUSTER = range(20, 420, 20)
 21 | 
 22 | data = list(csv.DictReader(
 23 |     open(f"repo-old/LLM-scores-3/{args.experiment}_dataframe_all_results.csv", "r")))
 24 | 
 25 | # plot figures
 26 | fig = plt.figure(figsize=(3.5, 2))
 27 | ax1 = plt.gca()
 28 | ax2 = ax1.twinx()
 29 | ax3 = ax1.twinx()
 30 | if args.experiment == "bills_broad":
 31 |     dataset = "bills"
 32 |     outfile_name = "n_clusters_bills_broad.pdf"
 33 |     plt_title = "BillSum, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$  "
 34 |     left_ylab = True
 35 |     show_legend = True
 36 |     degree = 4
 37 | 
 38 |     ax3.tick_params(axis='y', colors="#3d518c")
 39 |     ax3.yaxis.set_label_position('left')
 40 |     ax3.yaxis.set_ticks_position('left')
 41 |     ax3.set_ylabel("Document LLM", color="#3d518c")
 42 |     ax1.set_yticks([])
 43 |     ax2.set_axis_off()
 44 | elif args.experiment == "wikitext_broad":
 45 |     dataset = "wikitext"
 46 |     outfile_name = "n_clusters_wikitext_broad.pdf"
 47 |     plt_title = "Wikitext, broad, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$"
 48 |     left_ylab = False
 49 |     show_legend = False
 50 |     degree = 4
 51 |     ax1.set_yticks([])
 52 |     ax2.set_axis_off()
 53 |     ax3.set_yticks([])
 54 |     ax1.set_ylabel("|", color="white")
 55 |     ax3.set_ylabel("|", color="white")
 56 | elif args.experiment == "wikitext_specific":
 57 |     dataset = "wikitext"
 58 |     outfile_name = "n_clusters_wikitext_specific.pdf"
 59 |     plt_title = "    Wikitext, specific, $\\rho_D = RHO_CORR1$ $\\rho_W = RHO_CORR2$"
 60 |     left_ylab = False
 61 |     show_legend = False
 62 |     degree = 4
 63 |     # ax1.set_yticks([])
 64 |     # ax3.tick_params(axis='y', colors="#3d518c")
 65 |     # ax2.set_axis_off()
 66 |     # ax3.set_axis_off()
 67 | 
 68 |     ax1.tick_params(axis='y', colors="#933d35")
 69 |     ax1.yaxis.set_label_position('right')
 70 |     ax1.yaxis.set_ticks_position('right')
 71 |     ax1.set_ylabel("ARI", color="#933d35")
 72 |     ax3.set_axis_off()
 73 |     ax2.set_axis_off()
 74 | 
 75 | data_llm_word = [float(x["LLM Scores Wordset"]) for x in data]
 76 | data_llm_doc = [float(x["LLM Scores Documentset"]) for x in data]
 77 | data_ari = [float(x["ARI"]) for x in data]
 78 | 
 79 | SCATTER_STYLE = {"edgecolor": "black", "s": 30}
 80 | l1 = ax1.scatter(
 81 |     N_CLUSTER,
 82 |     data_ari,
 83 |     label="ARI",
 84 |     color=fig_utils.COLORS[1],
 85 |     **SCATTER_STYLE
 86 | )
 87 | l2 = ax2.scatter(
 88 |     N_CLUSTER,
 89 |     data_llm_word,
 90 |     label="Word LLM",
 91 |     color=fig_utils.COLORS[0],
 92 |     **SCATTER_STYLE
 93 | )
 94 | l3 = ax3.scatter(
 95 |     N_CLUSTER,
 96 |     data_llm_doc,
 97 |     label="Doc LLM",
 98 |     color=fig_utils.COLORS[4],
 99 |     **SCATTER_STYLE
100 | )
101 | 
102 | # ax1.axes.get_yaxis().set_visible(False)
103 | # print to one digit
104 | # ax1.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
105 | 
106 | xticks_fine = np.linspace(min(N_CLUSTER), max(N_CLUSTER), 500)
107 | 
108 | poly_ami = np.poly1d(np.polyfit(N_CLUSTER, data_ari, degree))
109 | ax1.plot(
110 |     xticks_fine, poly_ami(xticks_fine), '-', color=fig_utils.COLORS[1], zorder=-100
111 | )
112 | poly_llm_word = np.poly1d(np.polyfit(N_CLUSTER, data_llm_word, degree))
113 | ax2.plot(
114 |     xticks_fine, poly_llm_word(xticks_fine), '-', color=fig_utils.COLORS[0], zorder=-100
115 | )
116 | poly_llm_doc = np.poly1d(np.polyfit(N_CLUSTER, data_llm_doc, degree))
117 | ax3.plot(
118 |     xticks_fine, poly_llm_doc(xticks_fine), '-', color=fig_utils.COLORS[4], zorder=-100
119 | )
120 | 
121 | if show_legend:
122 |     lhandles = [l2, l3, l1]
123 |     plt.legend(
124 |         lhandles, [p_.get_label() for p_ in lhandles],
125 |         loc="upper right",
126 |         handletextpad=-0.2,
127 |         labelspacing=0.1,
128 |         borderpad=0.2,
129 |         borderaxespad=0.2,
130 |         handlelength=1.5,
131 |         columnspacing=0.8,
132 |         ncols=2,
133 |         edgecolor="black",
134 |         facecolor="#dddddd"
135 |     )
136 | # if left_ylab:
137 | #     ax1.set_ylabel("Metric Scores")
138 | # else:
139 | #     ax1.set_ylabel(" ")
140 | 
141 | ax1.set_xlabel("Number of topics")
142 | plt.xticks(N_CLUSTER[::3], N_CLUSTER[::3])
143 | 
144 | 
145 | statistic_doc = spearmanr(data_llm_doc, data_ari)
146 | statistic_word = spearmanr(data_llm_word, data_ari)
147 | # statistic = spearmanr(data_llm_doc, data_llm_word)
148 | plt.title(
149 |     plt_title
150 |     .replace("RHO_CORR1", f"{statistic_doc[0]:.2f}")
151 |     .replace("RHO_CORR2", f"{statistic_word[0]:.2f}")
152 | )
153 | 
154 | plt.tight_layout(pad=0.1)
155 | plt.savefig("computed/figures/" + outfile_name)
156 | plt.show()
157 | 


--------------------------------------------------------------------------------
/src-misc/07-pairwise_scores.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 5,
 6 |    "metadata": {},
 7 |    "outputs": [
 8 |     {
 9 |      "name": "stdout",
10 |      "output_type": "stream",
11 |      "text": [
12 |       "Intruder is black (5.20)\n",
13 |       "Coherence is 6.27\n"
14 |      ]
15 |     }
16 |    ],
17 |    "source": [
18 |     "import random\n",
19 |     "import numpy as np\n",
20 |     "random.seed(0)\n",
21 |     "\n",
22 |     "dataset = [\n",
23 |     "    [\"red\", \"blue\", \"flag\", \"black\", \"sky\", \"sun\"]\n",
24 |     "]\n",
25 |     "\n",
26 |     "\n",
27 |     "def get_pairwise_similarity(word1, word2):\n",
28 |     "    prompt = f'On a scale from 1 to 10, how similar are \"{word1}\" and \"{word2}\"? Your answer should be only the number and nothing else.'\n",
29 |     "    # TODO: GPT prompt\n",
30 |     "    return random.randint(1, 10)\n",
31 |     "\n",
32 |     "\n",
33 |     "for words_line in dataset:\n",
34 |     "    # create |W|*|W| list\n",
35 |     "    results_line = [[None] * len(words_line) for _ in words_line]\n",
36 |     "    for word1_i, word1 in enumerate(words_line):\n",
37 |     "        for word2_i, word2 in enumerate(words_line):\n",
38 |     "            # save up half the prompts\n",
39 |     "            if word2_i <= word1_i:\n",
40 |     "                continue\n",
41 |     "            similarity = get_pairwise_similarity(word1, word2)\n",
42 |     "            results_line[word1_i][word2_i] = similarity\n",
43 |     "            results_line[word2_i][word1_i] = similarity\n",
44 |     "\n",
45 |     "    # remove None (on diagonal)\n",
46 |     "    results_line = [\n",
47 |     "        [x for x in similarities if x]\n",
48 |     "        for similarities in results_line\n",
49 |     "    ]\n",
50 |     "\n",
51 |     "    per_word_avg = [\n",
52 |     "        np.average(similarities)\n",
53 |     "        for similarities in results_line\n",
54 |     "    ]\n",
55 |     "    word_intruder_i = np.argmin(per_word_avg)\n",
56 |     "    coherence = np.average(per_word_avg)\n",
57 |     "\n",
58 |     "    print(f\"Intruder is {words_line[word_intruder_i]} ({min(per_word_avg):.2f})\")\n",
59 |     "    print(f\"Coherence is {coherence:.2f}\")\n"
60 |    ]
61 |   }
62 |  ],
63 |  "metadata": {
64 |   "kernelspec": {
65 |    "display_name": "Python 3.10.7 64-bit",
66 |    "language": "python",
67 |    "name": "python3"
68 |   },
69 |   "language_info": {
70 |    "codemirror_mode": {
71 |     "name": "ipython",
72 |     "version": 3
73 |    },
74 |    "file_extension": ".py",
75 |    "mimetype": "text/x-python",
76 |    "name": "python",
77 |    "nbconvert_exporter": "python",
78 |    "pygments_lexer": "ipython3",
79 |    "version": "3.10.7"
80 |   },
81 |   "orig_nbformat": 4,
82 |   "vscode": {
83 |    "interpreter": {
84 |     "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
85 |    }
86 |   }
87 |  },
88 |  "nbformat": 4,
89 |  "nbformat_minor": 2
90 | }
91 | 


--------------------------------------------------------------------------------
/src-misc/fig_utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.style
 2 | import matplotlib as mpl
 3 | from cycler import cycler
 4 | 
 5 | FONT_MONOSPACE = {'fontname':'monospace'}
 6 | MARKERS = "o^s*DP1"
 7 | COLORS = [
 8 |     "#b7423c",
 9 |     "#71a548",
10 |     "salmon",
11 |     "darkseagreen",
12 |     "cornflowerblue",
13 |     "orange",
14 |     "seagreen",
15 |     "dimgray",
16 |     "purple",
17 | ]
18 | 
19 | mpl.rcParams['axes.prop_cycle'] = cycler(color=COLORS)
20 | mpl.rcParams['lines.linewidth'] = 2
21 | mpl.rcParams['lines.markersize'] = 7
22 | mpl.rcParams['axes.linewidth'] = 1.5
23 | mpl.rcParams['font.family'] = "serif"
24 | 
25 | METRIC_PRETTY_NAME = {
26 |     "bleu": "BLEU",
27 |     "ter": "TER",
28 |     "meteor": "METEOR",
29 |     "chrf": "ChrF",
30 |     "comet": "COMET",
31 |     "bleurt": "BLEURT"
32 | }
33 | 
34 | COLORS_EXTRA = ["#9c2963", "#fb9e07"]


--------------------------------------------------------------------------------
/src-number-of-topics/LLM_scores_and_ARI.py:
--------------------------------------------------------------------------------
  1 | import matplotlib.pyplot as plt
  2 | import pandas as pd
  3 | import re, os, json, random
  4 | from tqdm import tqdm
  5 | import numpy as np
  6 | from sklearn.metrics import normalized_mutual_info_score, adjusted_mutual_info_score, adjusted_rand_score, completeness_score, homogeneity_score
  7 | from sklearn.preprocessing import StandardScaler
  8 | from scipy.stats import spearmanr
  9 | from collections import defaultdict
 10 | import argparse
 11 | from collections import Counter
 12 | 
 13 | def moving_average(a, window_size=3):
 14 | 	if window_size == 0:
 15 | 		return a
 16 | 	out = []
 17 | 	for i in range(len(a)):
 18 | 		start = max(0, i - window_size)
 19 | 		out.append(np.mean(a[start:i + 1]))
 20 | 	return out
 21 | 
 22 | def postprocess_labels(df):
 23 | 	import spacy
 24 | 	nlp = spacy.load("en_core_web_sm")
 25 | 	if "response" not in df.columns:
 26 | 		df["response"] = df.chatGPT_eval.tolist()
 27 | 	out = []
 28 | 
 29 | 	for i, text in enumerate(df.response):
 30 | 		text = text.lower()
 31 | 		text = re.findall("[a-z ]+", text)[0] # oh, what would this do?
 32 | 		text = " ".join([w.lemma_ for w in nlp(text)])
 33 | 		out.append(text)
 34 | 	df["response"] = out
 35 | 	return df
 36 | 
 37 | def compute_ARI(args):
 38 | 	df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True)
 39 | 	paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
 40 | 	
 41 | 	if args.dataset == "bills":
 42 | 		topic = df_metadata.topic.tolist()
 43 | 	else:
 44 | 		if args.label_categories == "broad":
 45 | 			topic = df_metadata.category.tolist()
 46 | 		else:
 47 | 			topic = df_metadata.subcategory.tolist()
 48 | 
 49 | 	cluster_metrics = defaultdict(list)	
 50 | 	for path,num_topics in tqdm(zip(paths, range(20,420,20)), total=20):
 51 | 		path = os.path.join(path, "2972")
 52 | 		beta = np.load(os.path.join(path, "beta.npy"))
 53 | 		theta = np.load(os.path.join(path, "train.theta.npy"))
 54 | 		argmax_theta = theta.argmax(axis=-1)
 55 | 		cluster_metrics["ami"].append(adjusted_mutual_info_score(topic, argmax_theta))
 56 | 		cluster_metrics["ari"].append(adjusted_rand_score(topic, argmax_theta))
 57 | 		cluster_metrics["completeness"].append(completeness_score(topic, argmax_theta))
 58 | 		cluster_metrics["homogeneity"].append(homogeneity_score(topic, argmax_theta))
 59 | 	return cluster_metrics
 60 | 
 61 | if __name__ == "__main__":
 62 | 	parser = argparse.ArgumentParser()
 63 | 	parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
 64 | 	parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
 65 | 	parser.add_argument("--filename", default="number-of-topics-section-4/document_label_assignment_wikitext_broad.sonl", type=str, help="filename with LLM responses")
 66 | 	parser.add_argument("--method", default="label_assignment", type=str, help="if we use topic word set ratings or document label assignment (label_assignment | topic_ratings")
 67 | 	
 68 | 	args = parser.parse_args()
 69 | 
 70 | 	cluster_metrics = compute_ARI(args)
 71 | 	paths = ["../runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
 72 | 	df = pd.read_json(args.filename, lines=True)
 73 | 	outfile_name = "n_clusters_" + args.dataset + "_" + args.label_categories + ".png"
 74 | 	plt_label = "LLM Scores and ARI"
 75 | 
 76 | 
 77 | 	average_goodness = []
 78 | 	if args.method == "topic_ratings":
 79 | 		df["gpt_ratings"] = df.response.astype(int)
 80 | 		# get average gpt_ratings for each k
 81 | 		for path in paths:
 82 | 			path = os.path.join(path, "2972")
 83 | 			df_at_k = df[df.path == path]
 84 | 			average_goodness.append(df_at_k.gpt_ratings.mean())	
 85 | 	elif args.method == "label_assignment":
 86 | 		df = postprocess_labels(df)
 87 | 		# get average purity for each k
 88 | 		for path in paths:
 89 | 			path = os.path.join(path, "2972")
 90 | 			df_at_k = df[df.path == path]
 91 | 			purities = []
 92 | 			for topic in df_at_k.topic.unique():
 93 | 				df_topic = df_at_k[df_at_k.topic == topic]
 94 | 				labels = df_topic.response.tolist()
 95 | 				most_common,num_most_common = Counter(labels).most_common(1)[0]
 96 | 				purity = num_most_common / len(labels)
 97 | 				purities.append(purity)
 98 | 			average_goodness.append(np.mean(purities))
 99 | 
100 | 	average_goodness = moving_average(average_goodness) # smooth via moving_average to remove weird outliers
101 | 	ARI = cluster_metrics["ari"]
102 | 
103 | 	fig = plt.figure()
104 | 	ax = fig.add_subplot(111)
105 | 
106 | 	ax1 = plt.subplot()
107 | 	l1, = ax1.plot(average_goodness, color='tab:red')
108 | 	ax2 = ax1.twinx()
109 | 	l2, = ax2.plot(ARI, color='tab:blue')
110 | 
111 | 	spearman_rho = spearmanr(average_goodness, ARI).statistic
112 | 	print (spearman_rho)
113 | 
114 | 	plt.legend([l1, l2], ["LLM Score", "ARI"])
115 | 	ax.set_xlabel("Number of Topics")
116 | 
117 | 	n_clusters = list(range(20, 420, 20)) 
118 | 	plt.xticks(range(len(n_clusters)), n_clusters, rotation=45)
119 | 
120 | 	if len(n_clusters) > 12:
121 | 		every_nth = len(n_clusters) // 8
122 | 		for n, label in enumerate(ax.xaxis.get_ticklabels()):
123 | 			if n % every_nth != 0:
124 | 				label.set_visible(False)
125 | 	fig_title = "LLM Scores and ARI, " + args.dataset + ", " + args.label_categories + "$, \\rho = " + str(spearman_rho) + "$"
126 | 	plt.title(fig_title)
127 | 	plt.savefig(outfile_name)
128 | 


--------------------------------------------------------------------------------
/src-number-of-topics/chatGPT_document_label_assignment.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import numpy as np
  4 | import sys
  5 | import random
  6 | import openai
  7 | from tqdm import tqdm
  8 | import pandas as pd
  9 | import time
 10 | import re
 11 | import argparse
 12 | 
 13 | def get_system_prompt(args):
 14 | 	if args.dataset == "bills" and args.label_categories == "broad":
 15 | 		system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example "health", "public lands", "domestic commerce", "government operations" and "defense".
 16 | 
 17 | Reply with a single word or phrase, indicating the label of the document."""
 18 | 
 19 | 	elif args.dataset == "wikitext" and args.label_categories == "broad":
 20 | 		system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a broad label, for example csc"television", "songs", "transport", "warships and naval units", and "biology and medicine".
 21 | 
 22 | Reply with a single word or phrase, indicating the label of the document."""
 23 | 
 24 | 	elif args.dataset == "wikitext" and args.label_categories == "specific":
 25 | 		system_prompt = """You are a helpful research assistant with lots of knowledge about topic models. You are given a document assigned to a topic by a topic model. Annotate the document with a specific label, for example "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany".
 26 | 
 27 | Reply with a single word or phrase, indicating the label of the document."""
 28 | 	else:
 29 | 		print ("experiment not implemented")
 30 | 		sys.exit(0)
 31 | 	return system_prompt
 32 | 
 33 | 
 34 | 
 35 | # add logit bias
 36 | if __name__ == "__main__":
 37 | 
 38 | 	parser = argparse.ArgumentParser()
 39 | 	parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
 40 | 	parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
 41 | 	parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
 42 | 	args = parser.parse_args()
 43 | 
 44 | 	random.seed(42)
 45 | 	system_prompt = get_system_prompt(args)
 46 | 	openai.api_key = args.API_KEY
 47 | 
 48 | 	dataset = "bills"
 49 | 	if args.dataset == "bills":
 50 | 		column = "summary"
 51 | 	else:
 52 | 		column ="text"
 53 | 
 54 | 	with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f:
 55 | 		vocab = json.load(f)
 56 | 
 57 | 	vocab = {j:i for i,j in vocab.items()}
 58 | 
 59 | 	# load metadata
 60 | 	df_metadata = pd.read_json(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/train.metadata.jsonl"), lines=True)
 61 | 
 62 | 	paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
 63 | 
 64 | 	output_path = "number-of-topics-section-4"
 65 | 	os.makedirs(output_path, exist_ok=True)
 66 | 
 67 | 	with open(os.path.join(output_path, "document_label_assignment_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile:
 68 | 		for path in tqdm(paths):
 69 | 			path = os.path.join(path, "2972")
 70 | 			beta = np.load(os.path.join(path, "beta.npy"))
 71 | 			theta = np.load(os.path.join(path, "train.theta.npy")).T # transpose
 72 | 			
 73 | 			print (beta.shape) # (20, 15'000), each row is probability distribution over vocab
 74 | 			print (theta.shape) # (20, 32'661), each row is a probability distribution over documents?
 75 | 			num_topics = beta.shape[0]
 76 | 
 77 | 			# sample some topics
 78 | 			sampled_topics = random.sample(list(range(num_topics)), 5)
 79 | 
 80 | 			# for each topic
 81 | 			for topic in sampled_topics:
 82 | 				# sample top documents
 83 | 				num_topics = 0
 84 | 				user_prompt = ""
 85 | 				arg_indices = np.argsort(theta[topic])[::-1][:10]
 86 | 
 87 | 				for k, index in enumerate(arg_indices):
 88 | 					# get text of this document
 89 | 					text = df_metadata[column].iloc[index]
 90 | 					text = " ".join(text.split()[:50]) # only take first 50 words
 91 | 					user_prompt = text
 92 | 					print (system_prompt)
 93 | 					print (user_prompt)
 94 | 					response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0.0, max_tokens=20)["choices"][0]["message"]["content"].strip()
 95 | 					print ("topic", topic, "response --", response)
 96 | 					out = {"path": path, "user_prompt": user_prompt, "response": response, "topic": topic, "k":k}
 97 | 					json.dump(out, outfile)
 98 | 					outfile.write("\n")
 99 | 					time.sleep(0.1)
100 | 


--------------------------------------------------------------------------------
/src-number-of-topics/chatGPT_ratings_assignment.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import json
 3 | import numpy as np
 4 | import sys
 5 | import random
 6 | import openai
 7 | from tqdm import tqdm
 8 | import pandas as pd
 9 | import argparse
10 | import time
11 | 
12 | 
13 | def get_system_prompt(args):
14 | 	if args.dataset == "bills" and args.label_categories == "broad":
15 | 		system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
16 | The topic modeling is based on a legislative Bill summary dataset. We are interested in coherent broad topics. Typical topics in the dataset include "Health", "Public Lands", "Domestic Commerce", "Government Operations", or "Defense".
17 | Reply with a single number, indicating the overall appropriateness of the topic."""
18 | 
19 | 	elif args.dataset == "wikitext" and args.label_categories == "broad":
20 | 		system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
21 | 		The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "television", "songs", "transport", "warships and naval units", and "biology and medicine".
22 | 		Reply with a single number, indicating the overall appropriateness of the topic."""
23 | 
24 | 	elif args.dataset == "wikitext" and args.label_categories == "specific":
25 | 		system_prompt = """You are a helpful assistant evaluating the top words of a topic model output for a given topic. Please rate how related the following words are to each other on a scale from 1 to 3 ("1" = not very related, "2" = moderately related, "3" = very related). 
26 | The topic modeling is based on the Wikipedia corpus. Wikipedia is an online encyclopedia covering a huge range of topics. Typical topics in the dataset include "tropical cyclones: atlantic", "actors, directors, models, performers, and celebrities", "road infrastructure: midwestern united states", "armies and military units", and "warships of germany".
27 | Reply with a single number, indicating the overall appropriateness of the topic."""
28 | 	else:
29 | 		print ("experiment not implemented")
30 | 		sys.exit(0)
31 | 	return system_prompt
32 | 
33 | # add logit bias
34 | if __name__ == "__main__":
35 | 	parser = argparse.ArgumentParser()
36 | 	parser.add_argument("--API_KEY", default="openai API key", type=str, required=True, help="valid openAI API key")
37 | 	parser.add_argument("--dataset", default="wikitext", type=str, help="dataset (wikitext or bills)")
38 | 	parser.add_argument("--label_categories", default="broad", type=str, help="granularity of ground-truth labels (part of the prompt): broad or specific.")
39 | 	args = parser.parse_args()
40 | 
41 | 	random.seed(42)
42 | 	system_prompt = get_system_prompt(args)
43 | 	openai.api_key = args.API_KEY
44 | 
45 | 	with open(os.path.join("data", args.dataset, "processed/labeled/vocab_15k/vocab.json")) as f:
46 | 		vocab = json.load(f)
47 | 	vocab = {j:i for i,j in vocab.items()}
48 | 
49 | 	paths = ["runs/outputs/k_selection/" + args.dataset + "-labeled/vocab_15k/k-" + str(i) for i in range(20, 420, 20)]
50 | 
51 | 
52 | 	output_path = "number-of-topics-section-4"
53 | 	os.makedirs(output_path, exist_ok=True)
54 | 
55 | 	with open(os.path.join(output_path, "coherence_ratings_" + args.dataset + "_" + args.label_categories + ".jsonl"), "w") as outfile:
56 | 		for path in tqdm(paths):
57 | 			path = os.path.join(path, "2972")
58 | 			beta = np.load(os.path.join(path, "beta.npy"))
59 | 			theta = np.load(os.path.join(path, "train.theta.npy"))
60 | 			
61 | 			print (beta.shape) # 20, 15'000, each row is probability distribution over vocab
62 | 			print (theta.shape)
63 | 			num_topics = beta.shape[0]
64 | 			top_words = []
65 | 			for row in beta: 
66 | 				indices = row.argsort()[::-1][:10]
67 | 				top_topic_words = [vocab[i] for i in indices]
68 | 				top_words.append(top_topic_words)
69 | 
70 | 			# sample 10 topics
71 | 
72 | 			sampled_topics = random.sample(list(range(num_topics)), k=10)
73 | 			for i in sampled_topics:
74 | 				topic = top_words[i]
75 | 				random.shuffle(topic)
76 | 				user_prompt = ", ".join(topic)
77 | 
78 | 				response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}], temperature=0, max_tokens=1, logit_bias={16:100, 17:100, 18:100})["choices"][0]["message"]["content"].strip()
79 | 				out = {"path": path, "topic": i, "user_prompt": user_prompt, "response": response}
80 | 				json.dump(out, outfile)
81 | 				outfile.write("\n")
82 | 				print (response)
83 | 				time.sleep(0.1)
84 | 


--------------------------------------------------------------------------------
/topic-modeling-output/etm-topics-best-c_npmi_10_full.json:
--------------------------------------------------------------------------------
   1 | {
   2 |   "nytimes": {
   3 |     "c_npmi_10_full": 0.11378887397558148,
   4 |     "c_npmi_10_full_sd": 0.09088256843754142,
   5 |     "tu": 0.904,
   6 |     "to": 0.0770408163265306,
   7 |     "overlaps": 0,
   8 |     "anneal_lr": 0,
   9 |     "data_path": "/workspace/topic-preprocessing/data/nytimes/processed/full-mindf_power_law-maxdf_0.9/etm",
  10 |     "epochs": 1000,
  11 |     "lr": 0.02,
  12 |     "seed": 11235,
  13 |     "wdecay": 1.2e-06,
  14 |     "input_dir": "nytimes",
  15 |     "topics": [
  16 |       [
  17 |         "campaign",
  18 |         "bush",
  19 |         "clinton",
  20 |         "vote",
  21 |         "state",
  22 |         "congress",
  23 |         "republican",
  24 |         "administration",
  25 |         "election",
  26 |         "governor",
  27 |         "political",
  28 |         "bill",
  29 |         "democrats",
  30 |         "senator",
  31 |         "senate",
  32 |         "democratic",
  33 |         "party",
  34 |         "legislation",
  35 |         "republicans",
  36 |         "white_house",
  37 |         "candidate",
  38 |         "presidential",
  39 |         "lawmakers",
  40 |         "voters",
  41 |         "congressional",
  42 |         "debate",
  43 |         "representative",
  44 |         "washington",
  45 |         "candidates",
  46 |         "issue",
  47 |         "abortion",
  48 |         "democrat",
  49 |         "votes",
  50 |         "government",
  51 |         "parties",
  52 |         "legislature",
  53 |         "elections",
  54 |         "term",
  55 |         "leader",
  56 |         "opponents",
  57 |         "speaker",
  58 |         "primary",
  59 |         "elected",
  60 |         "opposition",
  61 |         "leaders",
  62 |         "support",
  63 |         "policy",
  64 |         "conservative",
  65 |         "reagan",
  66 |         "speech"
  67 |       ],
  68 |       [
  69 |         "percent",
  70 |         "market",
  71 |         "rate",
  72 |         "prices",
  73 |         "rates",
  74 |         "price",
  75 |         "economy",
  76 |         "growth",
  77 |         "dollar",
  78 |         "lower",
  79 |         "average",
  80 |         "interest",
  81 |         "month",
  82 |         "rose",
  83 |         "stocks",
  84 |         "bonds",
  85 |         "higher",
  86 |         "markets",
  87 |         "inflation",
  88 |         "stock",
  89 |         "fell",
  90 |         "yesterday",
  91 |         "sales",
  92 |         "quarter",
  93 |         "rise",
  94 |         "bond",
  95 |         "economic",
  96 |         "billion",
  97 |         "decline",
  98 |         "value",
  99 |         "trading",
 100 |         "consumer",
 101 |         "points",
 102 |         "demand",
 103 |         "investors",
 104 |         "treasury",
 105 |         "index",
 106 |         "expected",
 107 |         "traders",
 108 |         "drop",
 109 |         "profits",
 110 |         "issues",
 111 |         "percentage",
 112 |         "discount",
 113 |         "unemployment",
 114 |         "week",
 115 |         "economists",
 116 |         "currency",
 117 |         "analysts",
 118 |         "fed"
 119 |       ],
 120 |       [
 121 |         "years",
 122 |         "life",
 123 |         "family",
 124 |         "home",
 125 |         "old",
 126 |         "died",
 127 |         "death",
 128 |         "wife",
 129 |         "lives",
 130 |         "man",
 131 |         "born",
 132 |         "son",
 133 |         "lived",
 134 |         "living",
 135 |         "marriage",
 136 |         "couple",
 137 |         "church",
 138 |         "brother",
 139 |         "heart",
 140 |         "brothers",
 141 |         "age",
 142 |         "gay",
 143 |         "population",
 144 |         "career",
 145 |         "sons",
 146 |         "grew",
 147 |         "daughter",
 148 |         "moved",
 149 |         "serving",
 150 |         "native",
 151 |         "house",
 152 |         "resident",
 153 |         "men",
 154 |         "decades",
 155 |         "divorce",
 156 |         "battle",
 157 |         "married",
 158 |         "section",
 159 |         "retirement",
 160 |         "relatives",
 161 |         "catholic",
 162 |         "birthday",
 163 |         "chapter",
 164 |         "doctor",
 165 |         "daughters",
 166 |         "couples",
 167 |         "visit",
 168 |         "apartment",
 169 |         "year",
 170 |         "roman"
 171 |       ],
 172 |       [
 173 |         "said",
 174 |         "added",
 175 |         "asked",
 176 |         "year",
 177 |         "saying",
 178 |         "months",
 179 |         "spokesman",
 180 |         "told",
 181 |         "week",
 182 |         "month",
 183 |         "expected",
 184 |         "interview",
 185 |         "going",
 186 |         "years",
 187 |         "according",
 188 |         "called",
 189 |         "decision",
 190 |         "statement",
 191 |         "days",
 192 |         "continued",
 193 |         "plans",
 194 |         "discuss",
 195 |         "taking",
 196 |         "trying",
 197 |         "involved",
 198 |         "decided",
 199 |         "weeks",
 200 |         "earlier",
 201 |         "believed",
 202 |         "suggested",
 203 |         "adding",
 204 |         "noted",
 205 |         "comment",
 206 |         "appeared",
 207 |         "declined",
 208 |         "referring",
 209 |         "similar",
 210 |         "described",
 211 |         "wanted",
 212 |         "sent",
 213 |         "acknowledged",
 214 |         "took",
 215 |         "spoke",
 216 |         "began",
 217 |         "taken",
 218 |         "concerned",
 219 |         "refused",
 220 |         "reported",
 221 |         "discussed",
 222 |         "calls"
 223 |       ],
 224 |       [
 225 |         "glass",
 226 |         "red",
 227 |         "blue",
 228 |         "wear",
 229 |         "light",
 230 |         "wearing",
 231 |         "hair",
 232 |         "white",
 233 |         "fashion",
 234 |         "clothes",
 235 |         "yellow",
 236 |         "wood",
 237 |         "plastic",
 238 |         "gray",
 239 |         "shoes",
 240 |         "color",
 241 |         "colors",
 242 |         "pink",
 243 |         "metal",
 244 |         "dress",
 245 |         "wore",
 246 |         "paint",
 247 |         "leather",
 248 |         "colored",
 249 |         "coat",
 250 |         "hanging",
 251 |         "length",
 252 |         "green",
 253 |         "tall",
 254 |         "look",
 255 |         "looking",
 256 |         "glasses",
 257 |         "sunglasses",
 258 |         "skin",
 259 |         "shoe",
 260 |         "classy",
 261 |         "buttons",
 262 |         "clothing",
 263 |         "shirt",
 264 |         "hang",
 265 |         "wooden",
 266 |         "stick",
 267 |         "liked",
 268 |         "toy",
 269 |         "window",
 270 |         "inside",
 271 |         "feet",
 272 |         "looks",
 273 |         "dressed",
 274 |         "underneath"
 275 |       ],
 276 |       [
 277 |         "music",
 278 |         "band",
 279 |         "songs",
 280 |         "concert",
 281 |         "rock",
 282 |         "dance",
 283 |         "jazz",
 284 |         "album",
 285 |         "musical",
 286 |         "musicians",
 287 |         "piano",
 288 |         "performance",
 289 |         "singer",
 290 |         "song",
 291 |         "theater",
 292 |         "pop",
 293 |         "performances",
 294 |         "stage",
 295 |         "performed",
 296 |         "night",
 297 |         "opera",
 298 |         "orchestra",
 299 |         "composer",
 300 |         "piece",
 301 |         "play",
 302 |         "played",
 303 |         "singers",
 304 |         "production",
 305 |         "evening",
 306 |         "singing",
 307 |         "ballet",
 308 |         "tonight",
 309 |         "solo",
 310 |         "plays",
 311 |         "debut",
 312 |         "classical",
 313 |         "presented",
 314 |         "repertory",
 315 |         "program",
 316 |         "chamber",
 317 |         "performing",
 318 |         "quartet",
 319 |         "recordings",
 320 |         "pianist",
 321 |         "duo",
 322 |         "tunes",
 323 |         "works",
 324 |         "playing",
 325 |         "blues",
 326 |         "festival"
 327 |       ],
 328 |       [
 329 |         "year",
 330 |         "years",
 331 |         "ago",
 332 |         "later",
 333 |         "old",
 334 |         "late",
 335 |         "major",
 336 |         "record",
 337 |         "lost",
 338 |         "leading",
 339 |         "early",
 340 |         "won",
 341 |         "helped",
 342 |         "worked",
 343 |         "known",
 344 |         "annual",
 345 |         "national",
 346 |         "held",
 347 |         "figure",
 348 |         "director",
 349 |         "club",
 350 |         "highest",
 351 |         "cause",
 352 |         "host",
 353 |         "earlier",
 354 |         "event",
 355 |         "ended",
 356 |         "million",
 357 |         "month",
 358 |         "award",
 359 |         "earned",
 360 |         "joined",
 361 |         "rare",
 362 |         "degree",
 363 |         "expert",
 364 |         "raised",
 365 |         "paid",
 366 |         "attended",
 367 |         "minor",
 368 |         "previous",
 369 |         "chosen",
 370 |         "bridge",
 371 |         "founded",
 372 |         "position",
 373 |         "produced",
 374 |         "spent",
 375 |         "moved",
 376 |         "principal",
 377 |         "received",
 378 |         "active"
 379 |       ],
 380 |       [
 381 |         "street",
 382 |         "avenue",
 383 |         "west",
 384 |         "information",
 385 |         "sunday",
 386 |         "tickets",
 387 |         "tomorrow",
 388 |         "east",
 389 |         "free",
 390 |         "park",
 391 |         "manhattan",
 392 |         "saturday",
 393 |         "hours",
 394 |         "admission",
 395 |         "broadway",
 396 |         "center",
 397 |         "include",
 398 |         "new",
 399 |         "fifth",
 400 |         "open",
 401 |         "friday",
 402 |         "madison",
 403 |         "theater",
 404 |         "noon",
 405 |         "garden",
 406 |         "square",
 407 |         "tour",
 408 |         "includes",
 409 |         "sundays",
 410 |         "children",
 411 |         "new_york",
 412 |         "place",
 413 |         "saturdays",
 414 |         "library",
 415 |         "april",
 416 |         "road",
 417 |         "subway",
 418 |         "shows",
 419 |         "festival",
 420 |         "opens",
 421 |         "available",
 422 |         "benefit",
 423 |         "times",
 424 |         "tuesday",
 425 |         "fridays",
 426 |         "reservations",
 427 |         "events",
 428 |         "section",
 429 |         "near",
 430 |         "cards"
 431 |       ],
 432 |       [
 433 |         "wine",
 434 |         "food",
 435 |         "fresh",
 436 |         "taste",
 437 |         "add",
 438 |         "cooking",
 439 |         "sugar",
 440 |         "heat",
 441 |         "minutes",
 442 |         "chicken",
 443 |         "cook",
 444 |         "red",
 445 |         "juice",
 446 |         "menu",
 447 |         "cooked",
 448 |         "olive",
 449 |         "cheese",
 450 |         "salt",
 451 |         "meat",
 452 |         "fat",
 453 |         "cup",
 454 |         "kosher",
 455 |         "meal",
 456 |         "dishes",
 457 |         "vegetables",
 458 |         "garlic",
 459 |         "dish",
 460 |         "oil",
 461 |         "rice",
 462 |         "seafood",
 463 |         "restaurant",
 464 |         "soup",
 465 |         "sauce",
 466 |         "oven",
 467 |         "milk",
 468 |         "ground",
 469 |         "cups",
 470 |         "warm",
 471 |         "beef",
 472 |         "bread",
 473 |         "eat",
 474 |         "desserts",
 475 |         "steak",
 476 |         "pepper",
 477 |         "medium",
 478 |         "small",
 479 |         "butter",
 480 |         "potato",
 481 |         "coffee",
 482 |         "white"
 483 |       ],
 484 |       [
 485 |         "water",
 486 |         "oil",
 487 |         "plants",
 488 |         "plant",
 489 |         "energy",
 490 |         "gas",
 491 |         "natural",
 492 |         "trees",
 493 |         "space",
 494 |         "clean",
 495 |         "earth",
 496 |         "waste",
 497 |         "fish",
 498 |         "animals",
 499 |         "environmental",
 500 |         "animal",
 501 |         "soil",
 502 |         "fuel",
 503 |         "scientists",
 504 |         "birds",
 505 |         "snow",
 506 |         "rain",
 507 |         "wildlife",
 508 |         "storm",
 509 |         "ice",
 510 |         "debris",
 511 |         "electricity",
 512 |         "steel",
 513 |         "bird",
 514 |         "farm",
 515 |         "electric",
 516 |         "forest",
 517 |         "tree",
 518 |         "pollution",
 519 |         "surface",
 520 |         "grass",
 521 |         "ground",
 522 |         "farmers",
 523 |         "gasoline",
 524 |         "weather",
 525 |         "electrical",
 526 |         "heavy",
 527 |         "cubic",
 528 |         "species",
 529 |         "organic",
 530 |         "arctic",
 531 |         "power",
 532 |         "solar",
 533 |         "habitats",
 534 |         "light"
 535 |       ],
 536 |       [
 537 |         "police",
 538 |         "said",
 539 |         "officers",
 540 |         "killed",
 541 |         "crime",
 542 |         "people",
 543 |         "prison",
 544 |         "fire",
 545 |         "arrested",
 546 |         "officer",
 547 |         "men",
 548 |         "man",
 549 |         "killing",
 550 |         "shot",
 551 |         "death",
 552 |         "victims",
 553 |         "gun",
 554 |         "murder",
 555 |         "wounded",
 556 |         "accused",
 557 |         "authorities",
 558 |         "attack",
 559 |         "found",
 560 |         "drug",
 561 |         "charged",
 562 |         "shooting",
 563 |         "dead",
 564 |         "crimes",
 565 |         "assault",
 566 |         "violent",
 567 |         "investigators",
 568 |         "reported",
 569 |         "security",
 570 |         "arrest",
 571 |         "suspect",
 572 |         "taken",
 573 |         "spokesman",
 574 |         "woman",
 575 |         "violence",
 576 |         "guns",
 577 |         "car",
 578 |         "year",
 579 |         "narcotics",
 580 |         "told",
 581 |         "nyt",
 582 |         "incident",
 583 |         "jail",
 584 |         "driver",
 585 |         "rape",
 586 |         "identified"
 587 |       ],
 588 |       [
 589 |         "fact",
 590 |         "time",
 591 |         "times",
 592 |         "later",
 593 |         "believe",
 594 |         "having",
 595 |         "probably",
 596 |         "real",
 597 |         "true",
 598 |         "words",
 599 |         "word",
 600 |         "face",
 601 |         "bit",
 602 |         "seen",
 603 |         "mean",
 604 |         "answer",
 605 |         "simply",
 606 |         "far",
 607 |         "clearly",
 608 |         "hope",
 609 |         "questions",
 610 |         "hands",
 611 |         "possible",
 612 |         "past",
 613 |         "instead",
 614 |         "certainly",
 615 |         "easily",
 616 |         "subject",
 617 |         "days",
 618 |         "attempt",
 619 |         "person",
 620 |         "means",
 621 |         "surprise",
 622 |         "tell",
 623 |         "yes",
 624 |         "present",
 625 |         "doubt",
 626 |         "confidence",
 627 |         "apparently",
 628 |         "suggests",
 629 |         "usual",
 630 |         "speak",
 631 |         "appear",
 632 |         "eye",
 633 |         "prove",
 634 |         "place",
 635 |         "short",
 636 |         "reality",
 637 |         "actually",
 638 |         "appears"
 639 |       ],
 640 |       [
 641 |         "new_york",
 642 |         "yesterday",
 643 |         "director",
 644 |         "manhattan",
 645 |         "brooklyn",
 646 |         "received",
 647 |         "new_jersey",
 648 |         "named",
 649 |         "queens",
 650 |         "owner",
 651 |         "assistant",
 652 |         "new_york_city",
 653 |         "connecticut",
 654 |         "department",
 655 |         "boston",
 656 |         "retired",
 657 |         "announced",
 658 |         "bronx",
 659 |         "los_angeles",
 660 |         "washington",
 661 |         "master",
 662 |         "executive",
 663 |         "professor",
 664 |         "founder",
 665 |         "manager",
 666 |         "firm",
 667 |         "associate",
 668 |         "partner",
 669 |         "vice",
 670 |         "consultant",
 671 |         "greenwich",
 672 |         "princeton",
 673 |         "san_francisco",
 674 |         "research",
 675 |         "hartford",
 676 |         "newark",
 677 |         "ohio",
 678 |         "managing",
 679 |         "harvard",
 680 |         "known",
 681 |         "senior",
 682 |         "division",
 683 |         "joined",
 684 |         "captain",
 685 |         "philadelphia",
 686 |         "coordinator",
 687 |         "yale",
 688 |         "amp",
 689 |         "columbia",
 690 |         "staten_island"
 691 |       ],
 692 |       [
 693 |         "computer",
 694 |         "internet",
 695 |         "technology",
 696 |         "web",
 697 |         "system",
 698 |         "software",
 699 |         "computers",
 700 |         "systems",
 701 |         "video",
 702 |         "data",
 703 |         "car",
 704 |         "equipment",
 705 |         "users",
 706 |         "digital",
 707 |         "cars",
 708 |         "sites",
 709 |         "phone",
 710 |         "customers",
 711 |         "online",
 712 |         "network",
 713 |         "electronic",
 714 |         "personal",
 715 |         "vehicle",
 716 |         "site",
 717 |         "models",
 718 |         "mail",
 719 |         "information",
 720 |         "auto",
 721 |         "chip",
 722 |         "speed",
 723 |         "customer",
 724 |         "products",
 725 |         "device",
 726 |         "use",
 727 |         "available",
 728 |         "devices",
 729 |         "apple",
 730 |         "price",
 731 |         "mobile",
 732 |         "engine",
 733 |         "vehicles",
 734 |         "manufacturing",
 735 |         "machines",
 736 |         "consumers",
 737 |         "user",
 738 |         "allow",
 739 |         "microsoft",
 740 |         "type",
 741 |         "networks",
 742 |         "ford"
 743 |       ],
 744 |       [
 745 |         "game",
 746 |         "team",
 747 |         "season",
 748 |         "games",
 749 |         "play",
 750 |         "players",
 751 |         "coach",
 752 |         "player",
 753 |         "teams",
 754 |         "league",
 755 |         "ball",
 756 |         "football",
 757 |         "played",
 758 |         "baseball",
 759 |         "basketball",
 760 |         "mets",
 761 |         "giants",
 762 |         "yards",
 763 |         "yankees",
 764 |         "jets",
 765 |         "playing",
 766 |         "seasons",
 767 |         "nets",
 768 |         "pitch",
 769 |         "stadium",
 770 |         "hockey",
 771 |         "fans",
 772 |         "rangers",
 773 |         "knicks",
 774 |         "coaching",
 775 |         "yard",
 776 |         "field",
 777 |         "yankee",
 778 |         "pitcher",
 779 |         "bowl",
 780 |         "quarterback",
 781 |         "playoffs",
 782 |         "rookie",
 783 |         "redskins",
 784 |         "inning",
 785 |         "teammates",
 786 |         "offense",
 787 |         "preseason",
 788 |         "national_football_league",
 789 |         "defensive",
 790 |         "leagues",
 791 |         "innings",
 792 |         "pitching",
 793 |         "touchdown",
 794 |         "minutes"
 795 |       ],
 796 |       [
 797 |         "week",
 798 |         "article",
 799 |         "page",
 800 |         "march",
 801 |         "tuesday",
 802 |         "june",
 803 |         "july",
 804 |         "friday",
 805 |         "thursday",
 806 |         "wednesday",
 807 |         "day",
 808 |         "april",
 809 |         "monday",
 810 |         "telephone",
 811 |         "production",
 812 |         "reported",
 813 |         "misstated",
 814 |         "correction",
 815 |         "weeks",
 816 |         "new",
 817 |         "scheduled",
 818 |         "sunday",
 819 |         "date",
 820 |         "expected",
 821 |         "month",
 822 |         "numbers",
 823 |         "referred",
 824 |         "november",
 825 |         "saturday",
 826 |         "daily",
 827 |         "fall",
 828 |         "months",
 829 |         "report",
 830 |         "number",
 831 |         "news",
 832 |         "appeared",
 833 |         "error",
 834 |         "following",
 835 |         "closed",
 836 |         "announced",
 837 |         "october",
 838 |         "september",
 839 |         "incorrectly",
 840 |         "picture",
 841 |         "year",
 842 |         "column",
 843 |         "copies",
 844 |         "brief",
 845 |         "december",
 846 |         "august"
 847 |       ],
 848 |       [
 849 |         "board",
 850 |         "members",
 851 |         "mayor",
 852 |         "groups",
 853 |         "agency",
 854 |         "meeting",
 855 |         "officials",
 856 |         "plan",
 857 |         "public",
 858 |         "city",
 859 |         "union",
 860 |         "committee",
 861 |         "agreement",
 862 |         "agreed",
 863 |         "organization",
 864 |         "rules",
 865 |         "labor",
 866 |         "official",
 867 |         "commission",
 868 |         "approved",
 869 |         "member",
 870 |         "leaders",
 871 |         "issues",
 872 |         "group",
 873 |         "proposed",
 874 |         "proposal",
 875 |         "owners",
 876 |         "association",
 877 |         "negotiations",
 878 |         "strike",
 879 |         "council",
 880 |         "deal",
 881 |         "announced",
 882 |         "rejected",
 883 |         "decision",
 884 |         "approval",
 885 |         "director",
 886 |         "major",
 887 |         "talks",
 888 |         "chairman",
 889 |         "giuliani",
 890 |         "authority",
 891 |         "workers",
 892 |         "dispute",
 893 |         "process",
 894 |         "plans",
 895 |         "commissioner",
 896 |         "organizations",
 897 |         "joint",
 898 |         "effort"
 899 |       ],
 900 |       [
 901 |         "today",
 902 |         "group",
 903 |         "including",
 904 |         "called",
 905 |         "led",
 906 |         "known",
 907 |         "began",
 908 |         "built",
 909 |         "early",
 910 |         "planned",
 911 |         "held",
 912 |         "created",
 913 |         "included",
 914 |         "considered",
 915 |         "brought",
 916 |         "completed",
 917 |         "based",
 918 |         "offered",
 919 |         "taken",
 920 |         "given",
 921 |         "followed",
 922 |         "intended",
 923 |         "sent",
 924 |         "organized",
 925 |         "recently",
 926 |         "replaced",
 927 |         "established",
 928 |         "working",
 929 |         "designed",
 930 |         "include",
 931 |         "join",
 932 |         "holding",
 933 |         "produced",
 934 |         "abandoned",
 935 |         "giving",
 936 |         "developed",
 937 |         "formed",
 938 |         "calling",
 939 |         "focused",
 940 |         "joined",
 941 |         "met",
 942 |         "introduced",
 943 |         "turned",
 944 |         "opened",
 945 |         "gave",
 946 |         "provided",
 947 |         "studied",
 948 |         "destroyed",
 949 |         "presented",
 950 |         "remained"
 951 |       ],
 952 |       [
 953 |         "year",
 954 |         "contract",
 955 |         "left",
 956 |         "signed",
 957 |         "manager",
 958 |         "second",
 959 |         "chicago",
 960 |         "free",
 961 |         "right",
 962 |         "lost",
 963 |         "pass",
 964 |         "agent",
 965 |         "forward",
 966 |         "practice",
 967 |         "atlanta",
 968 |         "season",
 969 |         "home",
 970 |         "day",
 971 |         "draft",
 972 |         "brown",
 973 |         "defense",
 974 |         "johnson",
 975 |         "houston",
 976 |         "smith",
 977 |         "florida",
 978 |         "seattle",
 979 |         "camp",
 980 |         "miami",
 981 |         "list",
 982 |         "led",
 983 |         "seven",
 984 |         "sunday",
 985 |         "dallas",
 986 |         "today",
 987 |         "yesterday",
 988 |         "jackson",
 989 |         "super",
 990 |         "terms",
 991 |         "texas",
 992 |         "guard",
 993 |         "detroit",
 994 |         "tonight",
 995 |         "running",
 996 |         "record",
 997 |         "training",
 998 |         "cleveland",
 999 |         "starting",
1000 |         "agreed",
1001 |         "washington",
1002 |         "toronto"
1003 |       ],
1004 |       [
1005 |         "like",
1006 |         "little",
1007 |         "good",
1008 |         "best",
1009 |         "line",
1010 |         "hard",
1011 |         "makes",
1012 |         "small",
1013 |         "find",
1014 |         "long",
1015 |         "look",
1016 |         "especially",
1017 |         "comes",
1018 |         "come",
1019 |         "called",
1020 |         "use",
1021 |         "sign",
1022 |         "easy",
1023 |         "better",
1024 |         "usually",
1025 |         "clear",
1026 |         "form",
1027 |         "offers",
1028 |         "turn",
1029 |         "recently",
1030 |         "range",
1031 |         "particular",
1032 |         "looks",
1033 |         "real",
1034 |         "gives",
1035 |         "instead",
1036 |         "want",
1037 |         "gets",
1038 |         "turns",
1039 |         "particularly",
1040 |         "goes",
1041 |         "turning",
1042 |         "takes",
1043 |         "need",
1044 |         "available",
1045 |         "putting",
1046 |         "individual",
1047 |         "means",
1048 |         "taking",
1049 |         "large",
1050 |         "unlike",
1051 |         "rarely",
1052 |         "quality",
1053 |         "spread",
1054 |         "hand"
1055 |       ],
1056 |       [
1057 |         "city",
1058 |         "government",
1059 |         "people",
1060 |         "local",
1061 |         "state",
1062 |         "program",
1063 |         "work",
1064 |         "public",
1065 |         "service",
1066 |         "workers",
1067 |         "site",
1068 |         "development",
1069 |         "private",
1070 |         "new",
1071 |         "jobs",
1072 |         "nation",
1073 |         "care",
1074 |         "services",
1075 |         "help",
1076 |         "county",
1077 |         "project",
1078 |         "system",
1079 |         "programs",
1080 |         "officials",
1081 |         "residents",
1082 |         "areas",
1083 |         "cities",
1084 |         "health",
1085 |         "projects",
1086 |         "provide",
1087 |         "land",
1088 |         "plans",
1089 |         "community",
1090 |         "poor",
1091 |         "housing",
1092 |         "new_york_city",
1093 |         "plan",
1094 |         "build",
1095 |         "country",
1096 |         "welfare",
1097 |         "working",
1098 |         "need",
1099 |         "employees",
1100 |         "thousands",
1101 |         "homes",
1102 |         "region",
1103 |         "emergency",
1104 |         "families",
1105 |         "area",
1106 |         "agencies"
1107 |       ],
1108 |       [
1109 |         "political",
1110 |         "rights",
1111 |         "religious",
1112 |         "human",
1113 |         "anti",
1114 |         "democracy",
1115 |         "freedom",
1116 |         "civil",
1117 |         "struggle",
1118 |         "protest",
1119 |         "war",
1120 |         "politics",
1121 |         "power",
1122 |         "racial",
1123 |         "social",
1124 |         "apartheid",
1125 |         "communist",
1126 |         "views",
1127 |         "moral",
1128 |         "influence",
1129 |         "revolution",
1130 |         "religion",
1131 |         "liberal",
1132 |         "vietnam",
1133 |         "independence",
1134 |         "conflicts",
1135 |         "faith",
1136 |         "ethnic",
1137 |         "fear",
1138 |         "movement",
1139 |         "nazi",
1140 |         "matters",
1141 |         "speech",
1142 |         "prominent",
1143 |         "outrage",
1144 |         "protests",
1145 |         "radical",
1146 |         "solidarity",
1147 |         "conscience",
1148 |         "spiritual",
1149 |         "discussion",
1150 |         "liberties",
1151 |         "philosophy",
1152 |         "protesters",
1153 |         "opposition",
1154 |         "liberation",
1155 |         "demonstrations",
1156 |         "ideological",
1157 |         "community",
1158 |         "outraged"
1159 |       ],
1160 |       [
1161 |         "health",
1162 |         "medical",
1163 |         "drug",
1164 |         "research",
1165 |         "doctors",
1166 |         "patients",
1167 |         "drugs",
1168 |         "study",
1169 |         "aids",
1170 |         "blood",
1171 |         "patient",
1172 |         "researchers",
1173 |         "tests",
1174 |         "cancer",
1175 |         "treatment",
1176 |         "hospital",
1177 |         "studies",
1178 |         "medicine",
1179 |         "disease",
1180 |         "human",
1181 |         "care",
1182 |         "brain",
1183 |         "risk",
1184 |         "testing",
1185 |         "tested",
1186 |         "cell",
1187 |         "smoking",
1188 |         "physicians",
1189 |         "scientists",
1190 |         "nutrition",
1191 |         "treatments",
1192 |         "heart",
1193 |         "treat",
1194 |         "therapy",
1195 |         "effective",
1196 |         "clinic",
1197 |         "cause",
1198 |         "body",
1199 |         "genetic",
1200 |         "use",
1201 |         "virus",
1202 |         "condition",
1203 |         "substance",
1204 |         "test",
1205 |         "gene",
1206 |         "animal",
1207 |         "effects",
1208 |         "samples",
1209 |         "medication",
1210 |         "laboratory"
1211 |       ],
1212 |       [
1213 |         "time",
1214 |         "half",
1215 |         "center",
1216 |         "open",
1217 |         "away",
1218 |         "place",
1219 |         "high",
1220 |         "day",
1221 |         "run",
1222 |         "head",
1223 |         "days",
1224 |         "hours",
1225 |         "end",
1226 |         "right",
1227 |         "home",
1228 |         "field",
1229 |         "left",
1230 |         "minutes",
1231 |         "feet",
1232 |         "summer",
1233 |         "hour",
1234 |         "close",
1235 |         "eyes",
1236 |         "single",
1237 |         "hand",
1238 |         "inch",
1239 |         "green",
1240 |         "set",
1241 |         "drive",
1242 |         "let",
1243 |         "walk",
1244 |         "cut",
1245 |         "spring",
1246 |         "wall",
1247 |         "wide",
1248 |         "foot",
1249 |         "wait",
1250 |         "nearly",
1251 |         "watch",
1252 |         "leaves",
1253 |         "ground",
1254 |         "double",
1255 |         "stop",
1256 |         "box",
1257 |         "opening",
1258 |         "seat",
1259 |         "base",
1260 |         "seven",
1261 |         "caught",
1262 |         "opened"
1263 |       ],
1264 |       [
1265 |         "military",
1266 |         "war",
1267 |         "iraq",
1268 |         "israel",
1269 |         "peace",
1270 |         "government",
1271 |         "israeli",
1272 |         "forces",
1273 |         "officials",
1274 |         "united_nations",
1275 |         "american",
1276 |         "security",
1277 |         "soviet",
1278 |         "iran",
1279 |         "official",
1280 |         "iraqi",
1281 |         "arab",
1282 |         "russia",
1283 |         "weapons",
1284 |         "united_states",
1285 |         "nuclear",
1286 |         "troops",
1287 |         "intelligence",
1288 |         "attacks",
1289 |         "lebanon",
1290 |         "attack",
1291 |         "palestinian",
1292 |         "army",
1293 |         "afghanistan",
1294 |         "pakistan",
1295 |         "administration",
1296 |         "minister",
1297 |         "arms",
1298 |         "fighting",
1299 |         "leaders",
1300 |         "prime",
1301 |         "islamic",
1302 |         "russian",
1303 |         "talks",
1304 |         "middle_east",
1305 |         "militants",
1306 |         "international",
1307 |         "muslim",
1308 |         "israelis",
1309 |         "armed",
1310 |         "soldiers",
1311 |         "country",
1312 |         "treaty",
1313 |         "iranian",
1314 |         "soviet_union"
1315 |       ],
1316 |       [
1317 |         "great",
1318 |         "member",
1319 |         "members",
1320 |         "served",
1321 |         "friends",
1322 |         "friend",
1323 |         "entire",
1324 |         "jewish",
1325 |         "king",
1326 |         "service",
1327 |         "jews",
1328 |         "extend",
1329 |         "special",
1330 |         "deeply",
1331 |         "leader",
1332 |         "deep",
1333 |         "mass",
1334 |         "community",
1335 |         "thomas",
1336 |         "longtime",
1337 |         "spirit",
1338 |         "george",
1339 |         "hill",
1340 |         "david",
1341 |         "wish",
1342 |         "honor",
1343 |         "chairman",
1344 |         "kennedy",
1345 |         "gift",
1346 |         "choice",
1347 |         "jordan",
1348 |         "divided",
1349 |         "loved",
1350 |         "john",
1351 |         "commitment",
1352 |         "missed",
1353 |         "christian",
1354 |         "express",
1355 |         "foundation",
1356 |         "james",
1357 |         "temple",
1358 |         "shared",
1359 |         "wonderful",
1360 |         "dedicated",
1361 |         "values",
1362 |         "jerusalem",
1363 |         "prince",
1364 |         "mark",
1365 |         "martin",
1366 |         "south"
1367 |       ],
1368 |       [
1369 |         "won",
1370 |         "victory",
1371 |         "win",
1372 |         "lead",
1373 |         "run",
1374 |         "second",
1375 |         "points",
1376 |         "night",
1377 |         "fourth",
1378 |         "round",
1379 |         "champion",
1380 |         "title",
1381 |         "winning",
1382 |         "shot",
1383 |         "tournament",
1384 |         "final",
1385 |         "beat",
1386 |         "hit",
1387 |         "time",
1388 |         "championship",
1389 |         "winner",
1390 |         "cup",
1391 |         "horse",
1392 |         "game",
1393 |         "finished",
1394 |         "series",
1395 |         "record",
1396 |         "race",
1397 |         "goal",
1398 |         "seconds",
1399 |         "match",
1400 |         "point",
1401 |         "minutes",
1402 |         "saturday",
1403 |         "period",
1404 |         "tonight",
1405 |         "boxing",
1406 |         "times",
1407 |         "races",
1408 |         "track",
1409 |         "gave",
1410 |         "minute",
1411 |         "fifth",
1412 |         "ninth",
1413 |         "career",
1414 |         "line",
1415 |         "sixth",
1416 |         "start",
1417 |         "shots",
1418 |         "best"
1419 |       ],
1420 |       [
1421 |         "night",
1422 |         "people",
1423 |         "day",
1424 |         "house",
1425 |         "town",
1426 |         "white",
1427 |         "store",
1428 |         "room",
1429 |         "morning",
1430 |         "hotel",
1431 |         "stores",
1432 |         "message",
1433 |         "party",
1434 |         "names",
1435 |         "christmas",
1436 |         "days",
1437 |         "outside",
1438 |         "streets",
1439 |         "door",
1440 |         "near",
1441 |         "bar",
1442 |         "time",
1443 |         "way",
1444 |         "evening",
1445 |         "afternoon",
1446 |         "live",
1447 |         "main",
1448 |         "country",
1449 |         "home",
1450 |         "shop",
1451 |         "restaurant",
1452 |         "rooms",
1453 |         "dinner",
1454 |         "visit",
1455 |         "table",
1456 |         "standing",
1457 |         "crowd",
1458 |         "business",
1459 |         "shopping",
1460 |         "summer",
1461 |         "restaurants",
1462 |         "club",
1463 |         "block",
1464 |         "dozen",
1465 |         "second",
1466 |         "holiday",
1467 |         "hour",
1468 |         "hot",
1469 |         "dead",
1470 |         "weekend"
1471 |       ],
1472 |       [
1473 |         "art",
1474 |         "works",
1475 |         "century",
1476 |         "museum",
1477 |         "artist",
1478 |         "exhibition",
1479 |         "artists",
1480 |         "painting",
1481 |         "gallery",
1482 |         "paintings",
1483 |         "photographs",
1484 |         "collection",
1485 |         "hall",
1486 |         "contemporary",
1487 |         "design",
1488 |         "images",
1489 |         "pictures",
1490 |         "view",
1491 |         "sculpture",
1492 |         "modern",
1493 |         "architecture",
1494 |         "designed",
1495 |         "style",
1496 |         "19th",
1497 |         "objects",
1498 |         "exhibit",
1499 |         "drawings",
1500 |         "museums",
1501 |         "arts",
1502 |         "work",
1503 |         "museum_of_modern_art",
1504 |         "artworks",
1505 |         "stone",
1506 |         "painter",
1507 |         "designers",
1508 |         "landscape",
1509 |         "visitors",
1510 |         "designs",
1511 |         "beautiful",
1512 |         "18th",
1513 |         "pieces",
1514 |         "sculptor",
1515 |         "architectural",
1516 |         "collections",
1517 |         "galleries",
1518 |         "portrait",
1519 |         "artifacts",
1520 |         "20th",
1521 |         "historic",
1522 |         "prints"
1523 |       ],
1524 |       [
1525 |         "new",
1526 |         "long",
1527 |         "change",
1528 |         "way",
1529 |         "work",
1530 |         "big",
1531 |         "small",
1532 |         "lines",
1533 |         "different",
1534 |         "high",
1535 |         "time",
1536 |         "approach",
1537 |         "view",
1538 |         "far",
1539 |         "decade",
1540 |         "past",
1541 |         "large",
1542 |         "current",
1543 |         "larger",
1544 |         "term",
1545 |         "longer",
1546 |         "decades",
1547 |         "moving",
1548 |         "style",
1549 |         "changing",
1550 |         "rest",
1551 |         "major",
1552 |         "rich",
1553 |         "early",
1554 |         "point",
1555 |         "end",
1556 |         "generation",
1557 |         "little",
1558 |         "future",
1559 |         "months",
1560 |         "lack",
1561 |         "better",
1562 |         "great",
1563 |         "slow",
1564 |         "period",
1565 |         "middle",
1566 |         "era",
1567 |         "rise",
1568 |         "ways",
1569 |         "changes",
1570 |         "vast",
1571 |         "modern",
1572 |         "huge",
1573 |         "worst",
1574 |         "soon"
1575 |       ],
1576 |       [
1577 |         "end",
1578 |         "set",
1579 |         "time",
1580 |         "step",
1581 |         "return",
1582 |         "let",
1583 |         "beginning",
1584 |         "forced",
1585 |         "future",
1586 |         "come",
1587 |         "second",
1588 |         "bring",
1589 |         "difficult",
1590 |         "allow",
1591 |         "competition",
1592 |         "continue",
1593 |         "begin",
1594 |         "coming",
1595 |         "protect",
1596 |         "trying",
1597 |         "reach",
1598 |         "major",
1599 |         "start",
1600 |         "sides",
1601 |         "parts",
1602 |         "half",
1603 |         "taking",
1604 |         "setting",
1605 |         "break",
1606 |         "point",
1607 |         "failed",
1608 |         "able",
1609 |         "necessary",
1610 |         "came",
1611 |         "remain",
1612 |         "meet",
1613 |         "stop",
1614 |         "progress",
1615 |         "complete",
1616 |         "send",
1617 |         "final",
1618 |         "tough",
1619 |         "balance",
1620 |         "aside",
1621 |         "try",
1622 |         "begins",
1623 |         "gap",
1624 |         "took",
1625 |         "need",
1626 |         "meant"
1627 |       ],
1628 |       [
1629 |         "black",
1630 |         "school",
1631 |         "students",
1632 |         "television",
1633 |         "schools",
1634 |         "college",
1635 |         "public",
1636 |         "education",
1637 |         "high",
1638 |         "program",
1639 |         "white",
1640 |         "class",
1641 |         "student",
1642 |         "programs",
1643 |         "cable",
1644 |         "teachers",
1645 |         "radio",
1646 |         "university",
1647 |         "network",
1648 |         "district",
1649 |         "teacher",
1650 |         "sports",
1651 |         "districts",
1652 |         "hispanic",
1653 |         "broadcast",
1654 |         "entertainment",
1655 |         "shows",
1656 |         "stations",
1657 |         "blacks",
1658 |         "classes",
1659 |         "colleges",
1660 |         "campus",
1661 |         "viewers",
1662 |         "nbc",
1663 |         "educational",
1664 |         "faculty",
1665 |         "teaching",
1666 |         "fox",
1667 |         "middle",
1668 |         "private",
1669 |         "cbs",
1670 |         "special",
1671 |         "grade",
1672 |         "minority",
1673 |         "principal",
1674 |         "academic",
1675 |         "disney",
1676 |         "learning",
1677 |         "graduate",
1678 |         "studio"
1679 |       ],
1680 |       [
1681 |         "court",
1682 |         "judge",
1683 |         "law",
1684 |         "case",
1685 |         "federal",
1686 |         "lawyer",
1687 |         "trial",
1688 |         "lawyers",
1689 |         "charges",
1690 |         "justice",
1691 |         "legal",
1692 |         "attorney",
1693 |         "jury",
1694 |         "investigation",
1695 |         "supreme_court",
1696 |         "ruling",
1697 |         "prosecutors",
1698 |         "criminal",
1699 |         "decision",
1700 |         "hearing",
1701 |         "filed",
1702 |         "denied",
1703 |         "laws",
1704 |         "ruled",
1705 |         "justice_department",
1706 |         "courts",
1707 |         "ordered",
1708 |         "judges",
1709 |         "lawsuit",
1710 |         "charged",
1711 |         "suit",
1712 |         "prosecutor",
1713 |         "inquiry",
1714 |         "convicted",
1715 |         "defense",
1716 |         "testimony",
1717 |         "office",
1718 |         "amendment",
1719 |         "prosecution",
1720 |         "district",
1721 |         "proceedings",
1722 |         "prison",
1723 |         "charge",
1724 |         "conviction",
1725 |         "guilty",
1726 |         "defendants",
1727 |         "argued",
1728 |         "hearings",
1729 |         "civil",
1730 |         "rights"
1731 |       ],
1732 |       [
1733 |         "million",
1734 |         "money",
1735 |         "budget",
1736 |         "tax",
1737 |         "year",
1738 |         "billion",
1739 |         "pay",
1740 |         "spending",
1741 |         "cost",
1742 |         "costs",
1743 |         "cut",
1744 |         "bank",
1745 |         "cuts",
1746 |         "income",
1747 |         "plan",
1748 |         "fund",
1749 |         "raise",
1750 |         "financial",
1751 |         "taxes",
1752 |         "insurance",
1753 |         "banks",
1754 |         "funds",
1755 |         "paid",
1756 |         "deficit",
1757 |         "dollars",
1758 |         "aid",
1759 |         "federal",
1760 |         "financing",
1761 |         "cash",
1762 |         "increases",
1763 |         "finance",
1764 |         "raising",
1765 |         "total",
1766 |         "credit",
1767 |         "new",
1768 |         "reduce",
1769 |         "proposed",
1770 |         "savings",
1771 |         "medicare",
1772 |         "package",
1773 |         "loans",
1774 |         "paying",
1775 |         "fiscal",
1776 |         "capital",
1777 |         "limits",
1778 |         "house",
1779 |         "payments",
1780 |         "month",
1781 |         "week",
1782 |         "fees"
1783 |       ],
1784 |       [
1785 |         "net",
1786 |         "share",
1787 |         "inc",
1788 |         "earns",
1789 |         "company",
1790 |         "reports",
1791 |         "loss",
1792 |         "lead",
1793 |         "sales",
1794 |         "qtr",
1795 |         "quarter",
1796 |         "shares",
1797 |         "revenue",
1798 |         "year",
1799 |         "outst",
1800 |         "million",
1801 |         "included",
1802 |         "rev",
1803 |         "march",
1804 |         "june",
1805 |         "nyse",
1806 |         "cents",
1807 |         "otc",
1808 |         "6mo",
1809 |         "income",
1810 |         "9mo",
1811 |         "gain",
1812 |         "results",
1813 |         "dec",
1814 |         "months",
1815 |         "operations",
1816 |         "extraordinary",
1817 |         "charge",
1818 |         "latest",
1819 |         "sept",
1820 |         "earnings",
1821 |         "tax",
1822 |         "ago",
1823 |         "corp",
1824 |         "discontinued",
1825 |         "amex",
1826 |         "accounting",
1827 |         "credit",
1828 |         "respectively",
1829 |         "sale",
1830 |         "amp",
1831 |         "period",
1832 |         "losses",
1833 |         "restructuring",
1834 |         "april"
1835 |       ],
1836 |       [
1837 |         "editor",
1838 |         "book",
1839 |         "wrote",
1840 |         "published",
1841 |         "magazine",
1842 |         "books",
1843 |         "author",
1844 |         "written",
1845 |         "paper",
1846 |         "professor",
1847 |         "read",
1848 |         "writing",
1849 |         "original",
1850 |         "english",
1851 |         "writer",
1852 |         "pages",
1853 |         "amp",
1854 |         "language",
1855 |         "version",
1856 |         "life",
1857 |         "letters",
1858 |         "reading",
1859 |         "review",
1860 |         "ideas",
1861 |         "found",
1862 |         "science",
1863 |         "write",
1864 |         "letter",
1865 |         "illustrated",
1866 |         "known",
1867 |         "called",
1868 |         "novel",
1869 |         "model",
1870 |         "notes",
1871 |         "writes",
1872 |         "note",
1873 |         "new_york",
1874 |         "editorial",
1875 |         "interest",
1876 |         "account",
1877 |         "publisher",
1878 |         "readers",
1879 |         "new_york_times",
1880 |         "cast",
1881 |         "director",
1882 |         "sir",
1883 |         "idea",
1884 |         "writers",
1885 |         "prize",
1886 |         "critic"
1887 |       ],
1888 |       [
1889 |         "children",
1890 |         "women",
1891 |         "young",
1892 |         "woman",
1893 |         "child",
1894 |         "mother",
1895 |         "old",
1896 |         "love",
1897 |         "parents",
1898 |         "age",
1899 |         "father",
1900 |         "miss",
1901 |         "boy",
1902 |         "family",
1903 |         "girl",
1904 |         "baby",
1905 |         "sex",
1906 |         "husband",
1907 |         "older",
1908 |         "professional",
1909 |         "ages",
1910 |         "friends",
1911 |         "kids",
1912 |         "boys",
1913 |         "younger",
1914 |         "dog",
1915 |         "fellow",
1916 |         "sexual",
1917 |         "girls",
1918 |         "families",
1919 |         "fun",
1920 |         "live",
1921 |         "social",
1922 |         "adults",
1923 |         "proud",
1924 |         "female",
1925 |         "teen",
1926 |         "later",
1927 |         "know",
1928 |         "dogs",
1929 |         "away",
1930 |         "emotional",
1931 |         "adult",
1932 |         "home",
1933 |         "parent",
1934 |         "birth",
1935 |         "childhood",
1936 |         "mom",
1937 |         "abuse",
1938 |         "dad"
1939 |       ],
1940 |       [
1941 |         "american",
1942 |         "united_states",
1943 |         "world",
1944 |         "country",
1945 |         "international",
1946 |         "foreign",
1947 |         "americans",
1948 |         "countries",
1949 |         "china",
1950 |         "french",
1951 |         "japan",
1952 |         "europe",
1953 |         "trade",
1954 |         "economic",
1955 |         "london",
1956 |         "british",
1957 |         "european",
1958 |         "war",
1959 |         "japanese",
1960 |         "washington",
1961 |         "france",
1962 |         "german",
1963 |         "germany",
1964 |         "nation",
1965 |         "national",
1966 |         "britain",
1967 |         "chinese",
1968 |         "paris",
1969 |         "america",
1970 |         "italy",
1971 |         "mexico",
1972 |         "canada",
1973 |         "global",
1974 |         "england",
1975 |         "italian",
1976 |         "western",
1977 |         "immigrants",
1978 |         "domestic",
1979 |         "south",
1980 |         "african",
1981 |         "nations",
1982 |         "brazil",
1983 |         "spain",
1984 |         "region",
1985 |         "central",
1986 |         "asian",
1987 |         "india",
1988 |         "spanish",
1989 |         "cultural",
1990 |         "hong_kong"
1991 |       ],
1992 |       [
1993 |         "president",
1994 |         "general",
1995 |         "chief",
1996 |         "secretary",
1997 |         "office",
1998 |         "news",
1999 |         "support",
2000 |         "vice",
2001 |         "chairman",
2002 |         "staff",
2003 |         "executive",
2004 |         "senior",
2005 |         "minister",
2006 |         "prime",
2007 |         "press",
2008 |         "conference",
2009 |         "deputy",
2010 |         "post",
2011 |         "reporters",
2012 |         "newspaper",
2013 |         "appointed",
2014 |         "media",
2015 |         "independent",
2016 |         "colleagues",
2017 |         "reporter",
2018 |         "relations",
2019 |         "resigned",
2020 |         "adviser",
2021 |         "criticism",
2022 |         "interview",
2023 |         "newspapers",
2024 |         "influence",
2025 |         "leadership",
2026 |         "political",
2027 |         "appointment",
2028 |         "resignation",
2029 |         "affairs",
2030 |         "ambassador",
2031 |         "dean",
2032 |         "force",
2033 |         "articles",
2034 |         "counsel",
2035 |         "publicly",
2036 |         "cabinet",
2037 |         "corruption",
2038 |         "boss",
2039 |         "weeks",
2040 |         "successor",
2041 |         "dismissed",
2042 |         "succeed"
2043 |       ],
2044 |       [
2045 |         "building",
2046 |         "room",
2047 |         "old",
2048 |         "year",
2049 |         "market",
2050 |         "weeks",
2051 |         "taxes",
2052 |         "bedroom",
2053 |         "listed",
2054 |         "house",
2055 |         "space",
2056 |         "square",
2057 |         "broker",
2058 |         "area",
2059 |         "floors",
2060 |         "estate",
2061 |         "kitchen",
2062 |         "lot",
2063 |         "buildings",
2064 |         "million",
2065 |         "floor",
2066 |         "bath",
2067 |         "foot",
2068 |         "apartment",
2069 |         "houses",
2070 |         "property",
2071 |         "street",
2072 |         "office",
2073 |         "number",
2074 |         "real",
2075 |         "feet",
2076 |         "basement",
2077 |         "acre",
2078 |         "car",
2079 |         "rent",
2080 |         "neighborhood",
2081 |         "units",
2082 |         "garage",
2083 |         "city",
2084 |         "maintenance",
2085 |         "dining",
2086 |         "apartments",
2087 |         "fireplace",
2088 |         "walls",
2089 |         "project",
2090 |         "properties",
2091 |         "roof",
2092 |         "windows",
2093 |         "brick",
2094 |         "construction"
2095 |       ],
2096 |       [
2097 |         "world",
2098 |         "work",
2099 |         "man",
2100 |         "way",
2101 |         "history",
2102 |         "sense",
2103 |         "men",
2104 |         "life",
2105 |         "time",
2106 |         "play",
2107 |         "people",
2108 |         "series",
2109 |         "best",
2110 |         "mind",
2111 |         "america",
2112 |         "different",
2113 |         "self",
2114 |         "moment",
2115 |         "kind",
2116 |         "voice",
2117 |         "role",
2118 |         "course",
2119 |         "audience",
2120 |         "things",
2121 |         "feeling",
2122 |         "society",
2123 |         "stars",
2124 |         "good",
2125 |         "light",
2126 |         "experience",
2127 |         "sound",
2128 |         "young",
2129 |         "culture",
2130 |         "love",
2131 |         "feel",
2132 |         "god",
2133 |         "rest",
2134 |         "shows",
2135 |         "live",
2136 |         "male",
2137 |         "body",
2138 |         "finally",
2139 |         "middle",
2140 |         "nature",
2141 |         "image",
2142 |         "tradition",
2143 |         "times",
2144 |         "behavior",
2145 |         "respect",
2146 |         "act"
2147 |       ],
2148 |       [
2149 |         "film",
2150 |         "movie",
2151 |         "story",
2152 |         "films",
2153 |         "directed",
2154 |         "movies",
2155 |         "star",
2156 |         "character",
2157 |         "characters",
2158 |         "drama",
2159 |         "comedy",
2160 |         "actor",
2161 |         "tale",
2162 |         "starring",
2163 |         "cinema",
2164 |         "actors",
2165 |         "plays",
2166 |         "hollywood",
2167 |         "documentary",
2168 |         "actress",
2169 |         "romance",
2170 |         "stories",
2171 |         "scenes",
2172 |         "novel",
2173 |         "audiences",
2174 |         "plot",
2175 |         "adaptation",
2176 |         "love",
2177 |         "screen",
2178 |         "novels",
2179 |         "comic",
2180 |         "feature",
2181 |         "playwright",
2182 |         "loves",
2183 |         "stars",
2184 |         "fantasy",
2185 |         "funniest",
2186 |         "world_",
2187 |         "famous",
2188 |         "memoir",
2189 |         "onscreen",
2190 |         "filmmaker",
2191 |         "book",
2192 |         "beautiful",
2193 |         "screenplay",
2194 |         "thriller",
2195 |         "tales",
2196 |         "miramax",
2197 |         "writer",
2198 |         "favorite"
2199 |       ],
2200 |       [
2201 |         "company",
2202 |         "business",
2203 |         "companies",
2204 |         "million",
2205 |         "amp",
2206 |         "industry",
2207 |         "executive",
2208 |         "chief",
2209 |         "executives",
2210 |         "sell",
2211 |         "largest",
2212 |         "stock",
2213 |         "investment",
2214 |         "based",
2215 |         "chairman",
2216 |         "deal",
2217 |         "advertising",
2218 |         "sales",
2219 |         "billion",
2220 |         "firm",
2221 |         "offer",
2222 |         "unit",
2223 |         "marketing",
2224 |         "management",
2225 |         "shares",
2226 |         "corporate",
2227 |         "operating",
2228 |         "market",
2229 |         "products",
2230 |         "financial",
2231 |         "sold",
2232 |         "analyst",
2233 |         "operations",
2234 |         "buy",
2235 |         "yesterday",
2236 |         "analysts",
2237 |         "businesses",
2238 |         "investors",
2239 |         "buying",
2240 |         "selling",
2241 |         "bid",
2242 |         "venture",
2243 |         "services",
2244 |         "owned",
2245 |         "subsidiary",
2246 |         "division",
2247 |         "merger",
2248 |         "maker",
2249 |         "customers",
2250 |         "product"
2251 |       ],
2252 |       [
2253 |         "like",
2254 |         "making",
2255 |         "important",
2256 |         "based",
2257 |         "strong",
2258 |         "including",
2259 |         "recent",
2260 |         "high",
2261 |         "remains",
2262 |         "short",
2263 |         "popular",
2264 |         "example",
2265 |         "largely",
2266 |         "includes",
2267 |         "certain",
2268 |         "success",
2269 |         "events",
2270 |         "order",
2271 |         "consider",
2272 |         "create",
2273 |         "creating",
2274 |         "unusual",
2275 |         "standards",
2276 |         "figures",
2277 |         "latest",
2278 |         "role",
2279 |         "standard",
2280 |         "include",
2281 |         "addition",
2282 |         "similar",
2283 |         "key",
2284 |         "expensive",
2285 |         "present",
2286 |         "successful",
2287 |         "increasingly",
2288 |         "fact",
2289 |         "hopes",
2290 |         "produce",
2291 |         "american",
2292 |         "somewhat",
2293 |         "makes",
2294 |         "interesting",
2295 |         "basic",
2296 |         "holds",
2297 |         "indian",
2298 |         "current",
2299 |         "expect",
2300 |         "traditional",
2301 |         "despite",
2302 |         "continues"
2303 |       ],
2304 |       [
2305 |         "think",
2306 |         "people",
2307 |         "want",
2308 |         "going",
2309 |         "says",
2310 |         "know",
2311 |         "good",
2312 |         "way",
2313 |         "right",
2314 |         "time",
2315 |         "job",
2316 |         "lot",
2317 |         "things",
2318 |         "better",
2319 |         "thing",
2320 |         "getting",
2321 |         "got",
2322 |         "big",
2323 |         "talk",
2324 |         "help",
2325 |         "bad",
2326 |         "trying",
2327 |         "come",
2328 |         "need",
2329 |         "problem",
2330 |         "feel",
2331 |         "having",
2332 |         "try",
2333 |         "wants",
2334 |         "ask",
2335 |         "work",
2336 |         "idea",
2337 |         "chance",
2338 |         "kind",
2339 |         "happen",
2340 |         "question",
2341 |         "look",
2342 |         "reason",
2343 |         "opportunity",
2344 |         "sure",
2345 |         "guy",
2346 |         "point",
2347 |         "wrong",
2348 |         "deal",
2349 |         "leave",
2350 |         "guys",
2351 |         "thinking",
2352 |         "pick",
2353 |         "money",
2354 |         "tell"
2355 |       ],
2356 |       [
2357 |         "power",
2358 |         "number",
2359 |         "control",
2360 |         "according",
2361 |         "increase",
2362 |         "large",
2363 |         "likely",
2364 |         "result",
2365 |         "system",
2366 |         "use",
2367 |         "pressure",
2368 |         "growing",
2369 |         "recent",
2370 |         "experts",
2371 |         "safety",
2372 |         "far",
2373 |         "nearly",
2374 |         "highly",
2375 |         "force",
2376 |         "hundreds",
2377 |         "ability",
2378 |         "people",
2379 |         "particularly",
2380 |         "change",
2381 |         "possible",
2382 |         "problems",
2383 |         "greater",
2384 |         "challenge",
2385 |         "despite",
2386 |         "given",
2387 |         "effort",
2388 |         "limited",
2389 |         "improve",
2390 |         "higher",
2391 |         "significant",
2392 |         "increased",
2393 |         "largest",
2394 |         "increasing",
2395 |         "status",
2396 |         "important",
2397 |         "relatively",
2398 |         "critical",
2399 |         "risk",
2400 |         "thousands",
2401 |         "material",
2402 |         "level",
2403 |         "found",
2404 |         "changes",
2405 |         "efforts",
2406 |         "begun"
2407 |       ],
2408 |       [
2409 |         "came",
2410 |         "took",
2411 |         "went",
2412 |         "told",
2413 |         "asked",
2414 |         "got",
2415 |         "started",
2416 |         "thought",
2417 |         "felt",
2418 |         "wanted",
2419 |         "left",
2420 |         "little",
2421 |         "began",
2422 |         "knew",
2423 |         "long",
2424 |         "turned",
2425 |         "tried",
2426 |         "found",
2427 |         "know",
2428 |         "saw",
2429 |         "called",
2430 |         "come",
2431 |         "worked",
2432 |         "heard",
2433 |         "kept",
2434 |         "gave",
2435 |         "met",
2436 |         "going",
2437 |         "learned",
2438 |         "looked",
2439 |         "hard",
2440 |         "spent",
2441 |         "happened",
2442 |         "decided",
2443 |         "hit",
2444 |         "seen",
2445 |         "brought",
2446 |         "moved",
2447 |         "soon",
2448 |         "working",
2449 |         "personal",
2450 |         "right",
2451 |         "stopped",
2452 |         "sat",
2453 |         "old",
2454 |         "ago",
2455 |         "returned",
2456 |         "sitting",
2457 |         "gone",
2458 |         "recalled"
2459 |       ],
2460 |       [
2461 |         "miles",
2462 |         "air",
2463 |         "airport",
2464 |         "north",
2465 |         "traffic",
2466 |         "travel",
2467 |         "road",
2468 |         "trip",
2469 |         "flight",
2470 |         "mile",
2471 |         "plane",
2472 |         "bus",
2473 |         "boat",
2474 |         "train",
2475 |         "village",
2476 |         "fly",
2477 |         "nearby",
2478 |         "passengers",
2479 |         "island",
2480 |         "near",
2481 |         "beach",
2482 |         "passenger",
2483 |         "fare",
2484 |         "south",
2485 |         "station",
2486 |         "travelers",
2487 |         "land",
2488 |         "highway",
2489 |         "buses",
2490 |         "flights",
2491 |         "parking",
2492 |         "airline",
2493 |         "coast",
2494 |         "roads",
2495 |         "trips",
2496 |         "crew",
2497 |         "eastern",
2498 |         "ski",
2499 |         "jet",
2500 |         "routes",
2501 |         "crash",
2502 |         "northern",
2503 |         "ticket",
2504 |         "lake",
2505 |         "beaches",
2506 |         "trains",
2507 |         "sea",
2508 |         "navy",
2509 |         "resort",
2510 |         "shore"
2511 |       ],
2512 |       [
2513 |         "state",
2514 |         "officials",
2515 |         "report",
2516 |         "california",
2517 |         "evidence",
2518 |         "policy",
2519 |         "states",
2520 |         "use",
2521 |         "cases",
2522 |         "found",
2523 |         "problem",
2524 |         "action",
2525 |         "issue",
2526 |         "failed",
2527 |         "illegal",
2528 |         "required",
2529 |         "require",
2530 |         "involved",
2531 |         "review",
2532 |         "effort",
2533 |         "question",
2534 |         "records",
2535 |         "agents",
2536 |         "case",
2537 |         "problems",
2538 |         "matter",
2539 |         "prevent",
2540 |         "study",
2541 |         "enforcement",
2542 |         "seeking",
2543 |         "process",
2544 |         "calls",
2545 |         "penalty",
2546 |         "appeal",
2547 |         "interest",
2548 |         "ban",
2549 |         "complaints",
2550 |         "survey",
2551 |         "questions",
2552 |         "opposed",
2553 |         "documents",
2554 |         "federal",
2555 |         "search",
2556 |         "test",
2557 |         "critics",
2558 |         "determine",
2559 |         "response",
2560 |         "avoid",
2561 |         "delay",
2562 |         "measure"
2563 |       ],
2564 |       [
2565 |         "father",
2566 |         "late",
2567 |         "wife",
2568 |         "mother",
2569 |         "daughter",
2570 |         "husband",
2571 |         "beloved",
2572 |         "son",
2573 |         "loving",
2574 |         "amp",
2575 |         "devoted",
2576 |         "graduated",
2577 |         "sister",
2578 |         "family",
2579 |         "married",
2580 |         "survived",
2581 |         "memorial",
2582 |         "funeral",
2583 |         "brother",
2584 |         "law",
2585 |         "passing",
2586 |         "services",
2587 |         "grandchildren",
2588 |         "grandmother",
2589 |         "service",
2590 |         "grandfather",
2591 |         "memory",
2592 |         "ceremony",
2593 |         "contributions",
2594 |         "january",
2595 |         "september",
2596 |         "august",
2597 |         "cherished",
2598 |         "dear",
2599 |         "february",
2600 |         "degree",
2601 |         "flowers",
2602 |         "lieu",
2603 |         "condolences",
2604 |         "bridegroom",
2605 |         "bride",
2606 |         "michael",
2607 |         "new_york",
2608 |         "sympathy",
2609 |         "died",
2610 |         "wednesday",
2611 |         "donations",
2612 |         "monday",
2613 |         "friday",
2614 |         "performed"
2615 |       ]
2616 |     ],
2617 |     "c_npmi_10_full_all": [
2618 |       0.14624615320916104,
2619 |       0.1842349463761919,
2620 |       0.09397097956875038,
2621 |       0.02224374883885288,
2622 |       0.14196491443014664,
2623 |       0.2370141384022756,
2624 |       0.03293073853312596,
2625 |       0.1962455412333827,
2626 |       0.1915341149418706,
2627 |       0.11818618548977458,
2628 |       0.118088046901936,
2629 |       0.014622927252097677,
2630 |       0.060262066918601497,
2631 |       0.19586991688606797,
2632 |       0.21550434513063996,
2633 |       0.1304279440162075,
2634 |       0.067141489159369,
2635 |       0.005676347127949025,
2636 |       0.0399443341151241,
2637 |       0.027394469703840317,
2638 |       0.035940728231458,
2639 |       0.13207218718391886,
2640 |       0.1771557992145982,
2641 |       0.0028646926979995825,
2642 |       0.1583017349711412,
2643 |       0.035186989711695024,
2644 |       0.126649067889828,
2645 |       0.03554250799327234,
2646 |       0.24728769165095757,
2647 |       0.007093446130125744,
2648 |       0.015344450189039873,
2649 |       0.10515918827441718,
2650 |       0.1882183122717962,
2651 |       0.14759851831626747,
2652 |       0.5031460615129256,
2653 |       0.1563020964426604,
2654 |       0.1088203777913382,
2655 |       0.08715242376128987,
2656 |       0.10371770006655577,
2657 |       0.1560688706902293,
2658 |       0.01920638899674475,
2659 |       0.15620943194110215,
2660 |       0.10602635865005294,
2661 |       -0.009920778876769423,
2662 |       0.08536125795429042,
2663 |       0.032156055229606545,
2664 |       0.07240444289452569,
2665 |       0.1347891185211831,
2666 |       0.04646465360805376,
2667 |       0.2776205766334038
2668 |     ],
2669 |     "path": "outputs/full-mindf_power_law-maxdf_0.9/nytimes/k-50/etm/lr_0.02-reg_1.2e-06-epochs_1000-anneal_lr_0/11235"
2670 |   },
2671 |   "wikitext": {
2672 |     "c_npmi_10_full": 0.11328744378992832,
2673 |     "c_npmi_10_full_sd": 0.06794966942260233,
2674 |     "tu": 0.94,
2675 |     "to": 0.030612244897959183,
2676 |     "overlaps": 0,
2677 |     "anneal_lr": 0,
2678 |     "data_path": "/workspace/topic-preprocessing/data/wikitext/processed/full-mindf_power_law-maxdf_0.9/etm",
2679 |     "epochs": 1000,
2680 |     "lr": 0.001,
2681 |     "seed": 42,
2682 |     "wdecay": 1.2e-05,
2683 |     "input_dir": "wikitext",
2684 |     "topics": [
2685 |       [
2686 |         "new",
2687 |         "use",
2688 |         "development",
2689 |         "world",
2690 |         "design",
2691 |         "created",
2692 |         "system",
2693 |         "developed",
2694 |         "power",
2695 |         "based",
2696 |         "designed",
2697 |         "produced",
2698 |         "original",
2699 |         "production",
2700 |         "available",
2701 |         "different",
2702 |         "create",
2703 |         "version",
2704 |         "additional",
2705 |         "single",
2706 |         "effects",
2707 |         "model",
2708 |         "main",
2709 |         "introduced",
2710 |         "including",
2711 |         "included",
2712 |         "quality",
2713 |         "special",
2714 |         "test",
2715 |         "changes",
2716 |         "space",
2717 |         "include",
2718 |         "standard",
2719 |         "project",
2720 |         "added",
2721 |         "energy",
2722 |         "elements",
2723 |         "concept",
2724 |         "intended",
2725 |         "value",
2726 |         "uses",
2727 |         "instead",
2728 |         "technology",
2729 |         "provided",
2730 |         "produce",
2731 |         "type",
2732 |         "creating",
2733 |         "introduction",
2734 |         "work",
2735 |         "complete"
2736 |       ],
2737 |       [
2738 |         "american",
2739 |         "united_states",
2740 |         "new_york",
2741 |         "washington",
2742 |         "california",
2743 |         "texas",
2744 |         "americans",
2745 |         "virginia",
2746 |         "chicago",
2747 |         "boston",
2748 |         "canadian",
2749 |         "smith",
2750 |         "canada",
2751 |         "florida",
2752 |         "new_york_city",
2753 |         "michigan",
2754 |         "north_carolina",
2755 |         "ohio",
2756 |         "johnson",
2757 |         "los_angeles",
2758 |         "kentucky",
2759 |         "grant",
2760 |         "philadelphia",
2761 |         "america",
2762 |         "davis",
2763 |         "maryland",
2764 |         "massachusetts",
2765 |         "illinois",
2766 |         "indiana",
2767 |         "pennsylvania",
2768 |         "new_jersey",
2769 |         "new_orleans",
2770 |         "south_carolina",
2771 |         "mexican",
2772 |         "toronto",
2773 |         "african",
2774 |         "houston",
2775 |         "tennessee",
2776 |         "minnesota",
2777 |         "san_francisco",
2778 |         "taylor",
2779 |         "lee",
2780 |         "cleveland",
2781 |         "wisconsin",
2782 |         "georgia",
2783 |         "atlanta",
2784 |         "missouri",
2785 |         "adams",
2786 |         "colorado",
2787 |         "wilson"
2788 |       ],
2789 |       [
2790 |         "work",
2791 |         "published",
2792 |         "book",
2793 |         "years",
2794 |         "described",
2795 |         "time",
2796 |         "group",
2797 |         "style",
2798 |         "early",
2799 |         "people",
2800 |         "new",
2801 |         "works",
2802 |         "included",
2803 |         "year",
2804 |         "english",
2805 |         "including",
2806 |         "noted",
2807 |         "history",
2808 |         "period",
2809 |         "popular",
2810 |         "wrote",
2811 |         "late",
2812 |         "american",
2813 |         "written",
2814 |         "old",
2815 |         "movement",
2816 |         "notes",
2817 |         "different",
2818 |         "edition",
2819 |         "list",
2820 |         "social",
2821 |         "original",
2822 |         "culture",
2823 |         "traditional",
2824 |         "based",
2825 |         "books",
2826 |         "influenced",
2827 |         "collection",
2828 |         "material",
2829 |         "1970s",
2830 |         "recent",
2831 |         "composed",
2832 |         "influence",
2833 |         "1980s",
2834 |         "issue",
2835 |         "young",
2836 |         "1960s",
2837 |         "middle",
2838 |         "parts",
2839 |         "language"
2840 |       ],
2841 |       [
2842 |         "australia",
2843 |         "world",
2844 |         "country",
2845 |         "international",
2846 |         "united_states",
2847 |         "australian",
2848 |         "india",
2849 |         "countries",
2850 |         "united_kingdom",
2851 |         "japan",
2852 |         "indian",
2853 |         "europe",
2854 |         "european",
2855 |         "china",
2856 |         "new_zealand",
2857 |         "canada",
2858 |         "spain",
2859 |         "foreign",
2860 |         "average",
2861 |         "worldwide",
2862 |         "highest",
2863 |         "total",
2864 |         "africa",
2865 |         "singapore",
2866 |         "american",
2867 |         "sweden",
2868 |         "domestic",
2869 |         "sydney",
2870 |         "france",
2871 |         "south_africa",
2872 |         "brazil",
2873 |         "dutch",
2874 |         "argentina",
2875 |         "vietnam",
2876 |         "norway",
2877 |         "britain",
2878 |         "peak",
2879 |         "asian",
2880 |         "philippines",
2881 |         "melbourne",
2882 |         "malaysia",
2883 |         "russia",
2884 |         "belgium",
2885 |         "netherlands",
2886 |         "nations",
2887 |         "germany",
2888 |         "colonial",
2889 |         "indonesian",
2890 |         "ireland",
2891 |         "pakistan"
2892 |       ],
2893 |       [
2894 |         "war",
2895 |         "april",
2896 |         "march",
2897 |         "september",
2898 |         "june",
2899 |         "december",
2900 |         "august",
2901 |         "october",
2902 |         "january",
2903 |         "july",
2904 |         "began",
2905 |         "november",
2906 |         "later",
2907 |         "following",
2908 |         "french",
2909 |         "british",
2910 |         "february",
2911 |         "year",
2912 |         "new",
2913 |         "including",
2914 |         "early",
2915 |         "german",
2916 |         "members",
2917 |         "world",
2918 |         "years",
2919 |         "training",
2920 |         "number",
2921 |         "late",
2922 |         "joined",
2923 |         "days",
2924 |         "group",
2925 |         "included",
2926 |         "months",
2927 |         "weeks",
2928 |         "continued",
2929 |         "replaced",
2930 |         "served",
2931 |         "france",
2932 |         "support",
2933 |         "returned",
2934 |         "spanish",
2935 |         "major",
2936 |         "received",
2937 |         "end",
2938 |         "led",
2939 |         "anti",
2940 |         "month",
2941 |         "service",
2942 |         "remained",
2943 |         "staff"
2944 |       ],
2945 |       [
2946 |         "race",
2947 |         "horses",
2948 |         "horse",
2949 |         "oxford",
2950 |         "cambridge",
2951 |         "canal",
2952 |         "dog",
2953 |         "estate",
2954 |         "boat",
2955 |         "colony",
2956 |         "trade",
2957 |         "manchester",
2958 |         "breed",
2959 |         "races",
2960 |         "racing",
2961 |         "parish",
2962 |         "riders",
2963 |         "slaves",
2964 |         "slave",
2965 |         "bristol",
2966 |         "goods",
2967 |         "crew",
2968 |         "cotton",
2969 |         "dogs",
2970 |         "trading",
2971 |         "riding",
2972 |         "pounds",
2973 |         "mill",
2974 |         "breeding",
2975 |         "rowing",
2976 |         "blues",
2977 |         "navigation",
2978 |         "lancashire",
2979 |         "breeds",
2980 |         "cattle",
2981 |         "lengths",
2982 |         "farm",
2983 |         "liverpool",
2984 |         "hunting",
2985 |         "rider",
2986 |         "boats",
2987 |         "bath",
2988 |         "draft",
2989 |         "mills",
2990 |         "oldham",
2991 |         "mare",
2992 |         "labour",
2993 |         "merchant",
2994 |         "bridgwater",
2995 |         "wool"
2996 |       ],
2997 |       [
2998 |         "formula",
2999 |         "chemical",
3000 |         "nuclear",
3001 |         "applications",
3002 |         "element",
3003 |         "hydrogen",
3004 |         "atomic",
3005 |         "uranium",
3006 |         "oxygen",
3007 |         "gas",
3008 |         "acid",
3009 |         "carbon",
3010 |         "chemistry",
3011 |         "stable",
3012 |         "electrical",
3013 |         "radioactive",
3014 |         "atom",
3015 |         "properties",
3016 |         "plutonium",
3017 |         "method",
3018 |         "forms",
3019 |         "solutions",
3020 |         "calibration",
3021 |         "energy",
3022 |         "particles",
3023 |         "thermal",
3024 |         "metal",
3025 |         "nitrogen",
3026 |         "liquid",
3027 |         "vapor",
3028 |         "chain",
3029 |         "physics",
3030 |         "compounds",
3031 |         "type",
3032 |         "atoms",
3033 |         "solution",
3034 |         "electron",
3035 |         "linear",
3036 |         "ions",
3037 |         "helium",
3038 |         "equation",
3039 |         "beta",
3040 |         "particle",
3041 |         "organic",
3042 |         "components",
3043 |         "lithium",
3044 |         "synthesis",
3045 |         "structure",
3046 |         "metals",
3047 |         "protons"
3048 |       ],
3049 |       [
3050 |         "stated",
3051 |         "considered",
3052 |         "went",
3053 |         "chief",
3054 |         "said",
3055 |         "took",
3056 |         "worked",
3057 |         "called",
3058 |         "decided",
3059 |         "continued",
3060 |         "met",
3061 |         "moved",
3062 |         "claimed",
3063 |         "held",
3064 |         "asked",
3065 |         "gave",
3066 |         "agreed",
3067 |         "according",
3068 |         "head",
3069 |         "believed",
3070 |         "received",
3071 |         "told",
3072 |         "returned",
3073 |         "appointed",
3074 |         "passed",
3075 |         "refused",
3076 |         "came",
3077 |         "leader",
3078 |         "director",
3079 |         "ran",
3080 |         "accepted",
3081 |         "placed",
3082 |         "turned",
3083 |         "known",
3084 |         "brought",
3085 |         "issued",
3086 |         "meeting",
3087 |         "presented",
3088 |         "remained",
3089 |         "working",
3090 |         "assistant",
3091 |         "offered",
3092 |         "reported",
3093 |         "concluded",
3094 |         "declared",
3095 |         "criticized",
3096 |         "announced",
3097 |         "referred",
3098 |         "spent",
3099 |         "noted"
3100 |       ],
3101 |       [
3102 |         "known",
3103 |         "large",
3104 |         "century",
3105 |         "found",
3106 |         "small",
3107 |         "number",
3108 |         "called",
3109 |         "high",
3110 |         "including",
3111 |         "long",
3112 |         "include",
3113 |         "form",
3114 |         "common",
3115 |         "similar",
3116 |         "like",
3117 |         "largest",
3118 |         "lower",
3119 |         "considered",
3120 |         "low",
3121 |         "usually",
3122 |         "larger",
3123 |         "size",
3124 |         "modern",
3125 |         "range",
3126 |         "based",
3127 |         "associated",
3128 |         "estimated",
3129 |         "higher",
3130 |         "important",
3131 |         "significant",
3132 |         "according",
3133 |         "smaller",
3134 |         "increased",
3135 |         "central",
3136 |         "generally",
3137 |         "likely",
3138 |         "upper",
3139 |         "level",
3140 |         "rate",
3141 |         "named",
3142 |         "study",
3143 |         "relatively",
3144 |         "19th",
3145 |         "greater",
3146 |         "history",
3147 |         "growth",
3148 |         "reported",
3149 |         "names",
3150 |         "suggested",
3151 |         "today"
3152 |       ],
3153 |       [
3154 |         "great",
3155 |         "early",
3156 |         "major",
3157 |         "success",
3158 |         "successful",
3159 |         "hand",
3160 |         "best",
3161 |         "general",
3162 |         "strong",
3163 |         "earlier",
3164 |         "particularly",
3165 |         "key",
3166 |         "important",
3167 |         "difficult",
3168 |         "highly",
3169 |         "head",
3170 |         "despite",
3171 |         "largely",
3172 |         "better",
3173 |         "similar",
3174 |         "minor",
3175 |         "numerous",
3176 |         "entire",
3177 |         "poor",
3178 |         "popular",
3179 |         "especially",
3180 |         "initial",
3181 |         "powerful",
3182 |         "complete",
3183 |         "generally",
3184 |         "influence",
3185 |         "famous",
3186 |         "greatest",
3187 |         "simply",
3188 |         "heavily",
3189 |         "significant",
3190 |         "popularity",
3191 |         "notable",
3192 |         "increasingly",
3193 |         "immediately",
3194 |         "finally",
3195 |         "extremely",
3196 |         "hard",
3197 |         "attention",
3198 |         "entirely",
3199 |         "completely",
3200 |         "double",
3201 |         "considerable",
3202 |         "easily",
3203 |         "prominent"
3204 |       ],
3205 |       [
3206 |         "relationship",
3207 |         "tells",
3208 |         "begins",
3209 |         "leaves",
3210 |         "goes",
3211 |         "tries",
3212 |         "finds",
3213 |         "jack",
3214 |         "takes",
3215 |         "meets",
3216 |         "gets",
3217 |         "wants",
3218 |         "charlie",
3219 |         "tell",
3220 |         "storyline",
3221 |         "young",
3222 |         "restaurant",
3223 |         "learns",
3224 |         "asks",
3225 |         "tony",
3226 |         "paul",
3227 |         "makes",
3228 |         "breaks",
3229 |         "kills",
3230 |         "comes",
3231 |         "arrives",
3232 |         "chris",
3233 |         "blood",
3234 |         "dies",
3235 |         "meat",
3236 |         "starts",
3237 |         "falls",
3238 |         "believes",
3239 |         "confesses",
3240 |         "soap",
3241 |         "killer",
3242 |         "continues",
3243 |         "donna",
3244 |         "girlfriend",
3245 |         "eve",
3246 |         "returns",
3247 |         "friendship",
3248 |         "brings",
3249 |         "adam",
3250 |         "ben",
3251 |         "feels",
3252 |         "gives",
3253 |         "helps",
3254 |         "ends",
3255 |         "discovers"
3256 |       ],
3257 |       [
3258 |         "music",
3259 |         "musical",
3260 |         "stage",
3261 |         "piano",
3262 |         "performed",
3263 |         "played",
3264 |         "playing",
3265 |         "play",
3266 |         "performance",
3267 |         "opera",
3268 |         "harrison",
3269 |         "concert",
3270 |         "theatre",
3271 |         "composer",
3272 |         "recordings",
3273 |         "piece",
3274 |         "concerts",
3275 |         "performances",
3276 |         "solo",
3277 |         "orchestra",
3278 |         "performing",
3279 |         "instruments",
3280 |         "session",
3281 |         "sang",
3282 |         "musicians",
3283 |         "instrumental",
3284 |         "sessions",
3285 |         "version",
3286 |         "singing",
3287 |         "recording",
3288 |         "blues",
3289 |         "bach",
3290 |         "mccartney",
3291 |         "choir",
3292 |         "beatles",
3293 |         "melody",
3294 |         "folk",
3295 |         "organ",
3296 |         "backing",
3297 |         "versions",
3298 |         "compositions",
3299 |         "touring",
3300 |         "sung",
3301 |         "jazz",
3302 |         "symphony",
3303 |         "dylan",
3304 |         "audience",
3305 |         "sullivan",
3306 |         "orchestral",
3307 |         "guitar"
3308 |       ],
3309 |       [
3310 |         "song",
3311 |         "album",
3312 |         "band",
3313 |         "music",
3314 |         "songs",
3315 |         "number",
3316 |         "single",
3317 |         "released",
3318 |         "chart",
3319 |         "video",
3320 |         "track",
3321 |         "recording",
3322 |         "rock",
3323 |         "release",
3324 |         "studio",
3325 |         "tour",
3326 |         "performed",
3327 |         "albums",
3328 |         "vocals",
3329 |         "recorded",
3330 |         "sound",
3331 |         "lyrics",
3332 |         "singles",
3333 |         "singer",
3334 |         "tracks",
3335 |         "dance",
3336 |         "record",
3337 |         "copies",
3338 |         "billboard",
3339 |         "madonna",
3340 |         "charts",
3341 |         "live",
3342 |         "pop",
3343 |         "hot",
3344 |         "debuted",
3345 |         "performance",
3346 |         "produced",
3347 |         "vocal",
3348 |         "week",
3349 |         "guitar",
3350 |         "best",
3351 |         "background",
3352 |         "peaked",
3353 |         "artist",
3354 |         "carey",
3355 |         "digital",
3356 |         "featured",
3357 |         "platinum",
3358 |         "reached",
3359 |         "ballad"
3360 |       ],
3361 |       [
3362 |         "film",
3363 |         "series",
3364 |         "character",
3365 |         "season",
3366 |         "production",
3367 |         "role",
3368 |         "characters",
3369 |         "best",
3370 |         "scene",
3371 |         "released",
3372 |         "cast",
3373 |         "films",
3374 |         "director",
3375 |         "release",
3376 |         "directed",
3377 |         "scenes",
3378 |         "star",
3379 |         "played",
3380 |         "script",
3381 |         "version",
3382 |         "original",
3383 |         "filming",
3384 |         "story",
3385 |         "received",
3386 |         "actor",
3387 |         "producer",
3388 |         "office",
3389 |         "man",
3390 |         "box",
3391 |         "plot",
3392 |         "featured",
3393 |         "set",
3394 |         "performance",
3395 |         "actors",
3396 |         "stars",
3397 |         "movie",
3398 |         "appeared",
3399 |         "fans",
3400 |         "originally",
3401 |         "filmed",
3402 |         "appearance",
3403 |         "later",
3404 |         "crew",
3405 |         "title",
3406 |         "week",
3407 |         "voice",
3408 |         "final",
3409 |         "roles",
3410 |         "reviews",
3411 |         "based"
3412 |       ],
3413 |       [
3414 |         "episode",
3415 |         "episodes",
3416 |         "television",
3417 |         "viewers",
3418 |         "aired",
3419 |         "broadcast",
3420 |         "fox",
3421 |         "guest",
3422 |         "watched",
3423 |         "simpsons",
3424 |         "ratings",
3425 |         "rating",
3426 |         "homer",
3427 |         "michael",
3428 |         "mulder",
3429 |         "marge",
3430 |         "nbc",
3431 |         "scully",
3432 |         "network",
3433 |         "bart",
3434 |         "writers",
3435 |         "nielsen",
3436 |         "lisa",
3437 |         "creator",
3438 |         "finale",
3439 |         "peter",
3440 |         "files",
3441 |         "writer",
3442 |         "glee",
3443 |         "airing",
3444 |         "plot",
3445 |         "shows",
3446 |         "watching",
3447 |         "jim",
3448 |         "rated",
3449 |         "andy",
3450 |         "x-files",
3451 |         "viewing",
3452 |         "dwight",
3453 |         "demographic",
3454 |         "households",
3455 |         "comedy",
3456 |         "tells",
3457 |         "carter",
3458 |         "brian",
3459 |         "references",
3460 |         "reality",
3461 |         "viewed",
3462 |         "gets",
3463 |         "recurring"
3464 |       ],
3465 |       [
3466 |         "white",
3467 |         "black",
3468 |         "red",
3469 |         "brown",
3470 |         "blue",
3471 |         "dark",
3472 |         "long",
3473 |         "metal",
3474 |         "yellow",
3475 |         "green",
3476 |         "small",
3477 |         "short",
3478 |         "like",
3479 |         "color",
3480 |         "length",
3481 |         "slightly",
3482 |         "covered",
3483 |         "shaped",
3484 |         "deep",
3485 |         "features",
3486 |         "cap",
3487 |         "light",
3488 |         "wood",
3489 |         "shape",
3490 |         "appear",
3491 |         "near",
3492 |         "consists",
3493 |         "bands",
3494 |         "north_america",
3495 |         "grey",
3496 |         "thick",
3497 |         "europe",
3498 |         "surface",
3499 |         "leaves",
3500 |         "wide",
3501 |         "single",
3502 |         "contain",
3503 |         "contains",
3504 |         "fruit",
3505 |         "edge",
3506 |         "dry",
3507 |         "orange",
3508 |         "smooth",
3509 |         "gray",
3510 |         "thin",
3511 |         "outer",
3512 |         "colour",
3513 |         "inner",
3514 |         "narrow",
3515 |         "sides"
3516 |       ],
3517 |       [
3518 |         "aircraft",
3519 |         "air",
3520 |         "flight",
3521 |         "car",
3522 |         "service",
3523 |         "engine",
3524 |         "flying",
3525 |         "wing",
3526 |         "train",
3527 |         "cars",
3528 |         "passenger",
3529 |         "fighter",
3530 |         "fuel",
3531 |         "vehicles",
3532 |         "bomb",
3533 |         "operations",
3534 |         "bomber",
3535 |         "track",
3536 |         "pilots",
3537 |         "radar",
3538 |         "trains",
3539 |         "flights",
3540 |         "operation",
3541 |         "passengers",
3542 |         "speed",
3543 |         "units",
3544 |         "carrier",
3545 |         "operate",
3546 |         "station",
3547 |         "ride",
3548 |         "carried",
3549 |         "operating",
3550 |         "unit",
3551 |         "operational",
3552 |         "capacity",
3553 |         "launch",
3554 |         "pilot",
3555 |         "engines",
3556 |         "services",
3557 |         "range",
3558 |         "aviation",
3559 |         "weight",
3560 |         "lift",
3561 |         "planes",
3562 |         "airport",
3563 |         "capability",
3564 |         "locomotives",
3565 |         "dive",
3566 |         "flown",
3567 |         "operated"
3568 |       ],
3569 |       [
3570 |         "won",
3571 |         "team",
3572 |         "year",
3573 |         "record",
3574 |         "second",
3575 |         "finished",
3576 |         "final",
3577 |         "points",
3578 |         "round",
3579 |         "world",
3580 |         "stage",
3581 |         "winning",
3582 |         "lead",
3583 |         "ranked",
3584 |         "coach",
3585 |         "best",
3586 |         "win",
3587 |         "national",
3588 |         "teams",
3589 |         "named",
3590 |         "led",
3591 |         "event",
3592 |         "place",
3593 |         "career",
3594 |         "seconds",
3595 |         "history",
3596 |         "tournament",
3597 |         "competition",
3598 |         "captain",
3599 |         "injury",
3600 |         "fourth",
3601 |         "medal",
3602 |         "overall",
3603 |         "tour",
3604 |         "summer",
3605 |         "gold",
3606 |         "olympics",
3607 |         "sixth",
3608 |         "earned",
3609 |         "finish",
3610 |         "sports",
3611 |         "selected",
3612 |         "winner",
3613 |         "losing",
3614 |         "championships",
3615 |         "consecutive",
3616 |         "debut",
3617 |         "award",
3618 |         "professional",
3619 |         "events"
3620 |       ],
3621 |       [
3622 |         "match",
3623 |         "title",
3624 |         "defeated",
3625 |         "championship",
3626 |         "event",
3627 |         "ring",
3628 |         "team",
3629 |         "won",
3630 |         "win",
3631 |         "raw",
3632 |         "time",
3633 |         "champion",
3634 |         "wrestling",
3635 |         "triple",
3636 |         "defeating",
3637 |         "tag",
3638 |         "wwe",
3639 |         "face",
3640 |         "angle",
3641 |         "titles",
3642 |         "matches",
3643 |         "table",
3644 |         "division",
3645 |         "edge",
3646 |         "tna",
3647 |         "lost",
3648 |         "rivalry",
3649 |         "impact",
3650 |         "defeat",
3651 |         "following",
3652 |         "retain",
3653 |         "faced",
3654 |         "kane",
3655 |         "feud",
3656 |         "brand",
3657 |         "bout",
3658 |         "eliminated",
3659 |         "promotion",
3660 |         "defended",
3661 |         "opponent",
3662 |         "wrestler",
3663 |         "wwf",
3664 |         "hardy",
3665 |         "world",
3666 |         "went",
3667 |         "month",
3668 |         "smackdown",
3669 |         "wrestlers",
3670 |         "attacked",
3671 |         "heavyweight"
3672 |       ],
3673 |       [
3674 |         "family",
3675 |         "death",
3676 |         "life",
3677 |         "years",
3678 |         "father",
3679 |         "later",
3680 |         "son",
3681 |         "man",
3682 |         "children",
3683 |         "house",
3684 |         "women",
3685 |         "died",
3686 |         "mother",
3687 |         "year",
3688 |         "wife",
3689 |         "born",
3690 |         "old",
3691 |         "brother",
3692 |         "married",
3693 |         "daughter",
3694 |         "child",
3695 |         "woman",
3696 |         "age",
3697 |         "men",
3698 |         "young",
3699 |         "marriage",
3700 |         "home",
3701 |         "people",
3702 |         "months",
3703 |         "parents",
3704 |         "friend",
3705 |         "great",
3706 |         "friends",
3707 |         "husband",
3708 |         "lived",
3709 |         "sister",
3710 |         "named",
3711 |         "younger",
3712 |         "birth",
3713 |         "couple",
3714 |         "hospital",
3715 |         "relationship",
3716 |         "brothers",
3717 |         "care",
3718 |         "member",
3719 |         "sex",
3720 |         "grand",
3721 |         "living",
3722 |         "leaving",
3723 |         "help"
3724 |       ],
3725 |       [
3726 |         "million",
3727 |         "company",
3728 |         "sold",
3729 |         "business",
3730 |         "percent",
3731 |         "commercial",
3732 |         "announced",
3733 |         "market",
3734 |         "industry",
3735 |         "cost",
3736 |         "project",
3737 |         "management",
3738 |         "billion",
3739 |         "sales",
3740 |         "companies",
3741 |         "selling",
3742 |         "rights",
3743 |         "price",
3744 |         "costs",
3745 |         "board",
3746 |         "sell",
3747 |         "employees",
3748 |         "program",
3749 |         "working",
3750 |         "channel",
3751 |         "plans",
3752 |         "firm",
3753 |         "purchase",
3754 |         "purchased",
3755 |         "store",
3756 |         "engineer",
3757 |         "sale",
3758 |         "workers",
3759 |         "equipment",
3760 |         "financial",
3761 |         "computer",
3762 |         "owned",
3763 |         "markets",
3764 |         "advertising",
3765 |         "radio",
3766 |         "launched",
3767 |         "potential",
3768 |         "products",
3769 |         "owners",
3770 |         "code",
3771 |         "demand",
3772 |         "machine",
3773 |         "income",
3774 |         "mobile",
3775 |         "filed"
3776 |       ],
3777 |       [
3778 |         "city",
3779 |         "area",
3780 |         "town",
3781 |         "west",
3782 |         "south",
3783 |         "north",
3784 |         "local",
3785 |         "east",
3786 |         "station",
3787 |         "line",
3788 |         "park",
3789 |         "bridge",
3790 |         "miles",
3791 |         "located",
3792 |         "construction",
3793 |         "new",
3794 |         "opened",
3795 |         "northern",
3796 |         "street",
3797 |         "built",
3798 |         "areas",
3799 |         "southern",
3800 |         "major",
3801 |         "road",
3802 |         "centre",
3803 |         "river",
3804 |         "railway",
3805 |         "main",
3806 |         "western",
3807 |         "state",
3808 |         "center",
3809 |         "village",
3810 |         "near",
3811 |         "land",
3812 |         "hill",
3813 |         "services",
3814 |         "years",
3815 |         "border",
3816 |         "eastern",
3817 |         "district",
3818 |         "county",
3819 |         "year",
3820 |         "day",
3821 |         "ground",
3822 |         "site",
3823 |         "nearby",
3824 |         "system",
3825 |         "old",
3826 |         "airport",
3827 |         "stations"
3828 |       ],
3829 |       [
3830 |         "killed",
3831 |         "sent",
3832 |         "attack",
3833 |         "left",
3834 |         "near",
3835 |         "support",
3836 |         "arrived",
3837 |         "led",
3838 |         "forced",
3839 |         "soon",
3840 |         "return",
3841 |         "control",
3842 |         "reached",
3843 |         "able",
3844 |         "caused",
3845 |         "destroyed",
3846 |         "plan",
3847 |         "died",
3848 |         "main",
3849 |         "ground",
3850 |         "established",
3851 |         "remained",
3852 |         "attacks",
3853 |         "damage",
3854 |         "attacked",
3855 |         "supported",
3856 |         "victory",
3857 |         "attempted",
3858 |         "suffered",
3859 |         "returned",
3860 |         "lost",
3861 |         "attempt",
3862 |         "landing",
3863 |         "operation",
3864 |         "brought",
3865 |         "expedition",
3866 |         "defeat",
3867 |         "failed",
3868 |         "immediately",
3869 |         "captured",
3870 |         "engaged",
3871 |         "damaged",
3872 |         "prepared",
3873 |         "invasion",
3874 |         "defeated",
3875 |         "unable",
3876 |         "prevent",
3877 |         "arrival",
3878 |         "shortly",
3879 |         "capture"
3880 |       ],
3881 |       [
3882 |         "king",
3883 |         "england",
3884 |         "london",
3885 |         "english",
3886 |         "lord",
3887 |         "royal",
3888 |         "scotland",
3889 |         "queen",
3890 |         "sir",
3891 |         "edward",
3892 |         "henry",
3893 |         "duke",
3894 |         "john",
3895 |         "wales",
3896 |         "prince",
3897 |         "scottish",
3898 |         "fort",
3899 |         "irish",
3900 |         "reign",
3901 |         "george",
3902 |         "historian",
3903 |         "earl",
3904 |         "william",
3905 |         "crown",
3906 |         "james",
3907 |         "kent",
3908 |         "norman",
3909 |         "richard",
3910 |         "catholic",
3911 |         "britain",
3912 |         "york",
3913 |         "stephen",
3914 |         "abbey",
3915 |         "charles",
3916 |         "pope",
3917 |         "robert",
3918 |         "mary",
3919 |         "france",
3920 |         "bishop",
3921 |         "throne",
3922 |         "thomas",
3923 |         "granted",
3924 |         "appointed",
3925 |         "ireland",
3926 |         "catherine",
3927 |         "canterbury",
3928 |         "monarch",
3929 |         "anne",
3930 |         "protestant",
3931 |         "castle"
3932 |       ],
3933 |       [
3934 |         "treatment",
3935 |         "blood",
3936 |         "brain",
3937 |         "cell",
3938 |         "disease",
3939 |         "risk",
3940 |         "patients",
3941 |         "cells",
3942 |         "symptoms",
3943 |         "dna",
3944 |         "protein",
3945 |         "cancer",
3946 |         "genes",
3947 |         "virus",
3948 |         "cases",
3949 |         "causes",
3950 |         "function",
3951 |         "diseases",
3952 |         "cause",
3953 |         "body",
3954 |         "related",
3955 |         "gene",
3956 |         "infection",
3957 |         "drugs",
3958 |         "activity",
3959 |         "levels",
3960 |         "people",
3961 |         "effects",
3962 |         "hiv",
3963 |         "acute",
3964 |         "diagnosis",
3965 |         "disorders",
3966 |         "genetic",
3967 |         "tissue",
3968 |         "liver",
3969 |         "proteins",
3970 |         "muscle",
3971 |         "normal",
3972 |         "specific",
3973 |         "occurs",
3974 |         "skin",
3975 |         "form",
3976 |         "occur",
3977 |         "treatments",
3978 |         "hemoglobin",
3979 |         "induced",
3980 |         "clinical",
3981 |         "individuals",
3982 |         "viruses",
3983 |         "mechanisms"
3984 |       ],
3985 |       [
3986 |         "route",
3987 |         "road",
3988 |         "highway",
3989 |         "state",
3990 |         "north",
3991 |         "traffic",
3992 |         "east",
3993 |         "street",
3994 |         "intersection",
3995 |         "continues",
3996 |         "south",
3997 |         "interchange",
3998 |         "avenue",
3999 |         "section",
4000 |         "mile",
4001 |         "passes",
4002 |         "freeway",
4003 |         "lane",
4004 |         "portion",
4005 |         "runs",
4006 |         "crosses",
4007 |         "completed",
4008 |         "heads",
4009 |         "turns",
4010 |         "downtown",
4011 |         "designated",
4012 |         "terminus",
4013 |         "crossing",
4014 |         "northeast",
4015 |         "extended",
4016 |         "interstate",
4017 |         "west",
4018 |         "access",
4019 |         "highways",
4020 |         "roadway",
4021 |         "passing",
4022 |         "exit",
4023 |         "junction",
4024 |         "alignment",
4025 |         "northwest",
4026 |         "begins",
4027 |         "corridor",
4028 |         "enters",
4029 |         "lanes",
4030 |         "past",
4031 |         "parallel",
4032 |         "end",
4033 |         "designation",
4034 |         "bypass",
4035 |         "description"
4036 |       ],
4037 |       [
4038 |         "ship",
4039 |         "ships",
4040 |         "guns",
4041 |         "fleet",
4042 |         "class",
4043 |         "inch",
4044 |         "tons",
4045 |         "torpedo",
4046 |         "naval",
4047 |         "gun",
4048 |         "long",
4049 |         "knots",
4050 |         "crew",
4051 |         "turrets",
4052 |         "fired",
4053 |         "maximum",
4054 |         "admiral",
4055 |         "carried",
4056 |         "armor",
4057 |         "steam",
4058 |         "vessels",
4059 |         "cruisers",
4060 |         "vessel",
4061 |         "port",
4062 |         "inches",
4063 |         "battery",
4064 |         "convoy",
4065 |         "battleships",
4066 |         "navy",
4067 |         "service",
4068 |         "cruiser",
4069 |         "meters",
4070 |         "armored",
4071 |         "deck",
4072 |         "aboard",
4073 |         "built",
4074 |         "battleship",
4075 |         "hull",
4076 |         "boats",
4077 |         "armour",
4078 |         "submarine",
4079 |         "feet",
4080 |         "flagship",
4081 |         "laid",
4082 |         "ton",
4083 |         "turret",
4084 |         "cruise",
4085 |         "speed",
4086 |         "ordered",
4087 |         "boilers"
4088 |       ],
4089 |       [
4090 |         "art",
4091 |         "gold",
4092 |         "act",
4093 |         "silver",
4094 |         "image",
4095 |         "flag",
4096 |         "painting",
4097 |         "pieces",
4098 |         "artist",
4099 |         "artists",
4100 |         "work",
4101 |         "coins",
4102 |         "figures",
4103 |         "paintings",
4104 |         "images",
4105 |         "painted",
4106 |         "arms",
4107 |         "exhibition",
4108 |         "piece",
4109 |         "figure",
4110 |         "eye",
4111 |         "dress",
4112 |         "depicted",
4113 |         "fashion",
4114 |         "ceremony",
4115 |         "portrait",
4116 |         "worn",
4117 |         "design",
4118 |         "presented",
4119 |         "body",
4120 |         "wearing",
4121 |         "sculpture",
4122 |         "wear",
4123 |         "display",
4124 |         "coin",
4125 |         "hair",
4126 |         "coat",
4127 |         "eagle",
4128 |         "mint",
4129 |         "statue",
4130 |         "showing",
4131 |         "bronze",
4132 |         "depicts",
4133 |         "wore",
4134 |         "colours",
4135 |         "displayed",
4136 |         "dollar",
4137 |         "print",
4138 |         "photographs",
4139 |         "designs"
4140 |       ],
4141 |       [
4142 |         "building",
4143 |         "built",
4144 |         "century",
4145 |         "buildings",
4146 |         "site",
4147 |         "stone",
4148 |         "hall",
4149 |         "castle",
4150 |         "tower",
4151 |         "wall",
4152 |         "walls",
4153 |         "square",
4154 |         "church",
4155 |         "completed",
4156 |         "house",
4157 |         "floor",
4158 |         "constructed",
4159 |         "architecture",
4160 |         "houses",
4161 |         "listed",
4162 |         "chapel",
4163 |         "monument",
4164 |         "construction",
4165 |         "entrance",
4166 |         "rooms",
4167 |         "glass",
4168 |         "museum",
4169 |         "structure",
4170 |         "iron",
4171 |         "17th",
4172 |         "window",
4173 |         "remains",
4174 |         "medieval",
4175 |         "wooden",
4176 |         "late",
4177 |         "brick",
4178 |         "restoration",
4179 |         "architectural",
4180 |         "19th",
4181 |         "memorial",
4182 |         "roof",
4183 |         "stones",
4184 |         "hotel",
4185 |         "15th",
4186 |         "14th",
4187 |         "garden",
4188 |         "palace",
4189 |         "demolished",
4190 |         "dates",
4191 |         "steel"
4192 |       ],
4193 |       [
4194 |         "public",
4195 |         "money",
4196 |         "signed",
4197 |         "manager",
4198 |         "future",
4199 |         "self",
4200 |         "local",
4201 |         "contract",
4202 |         "deal",
4203 |         "letter",
4204 |         "pay",
4205 |         "paid",
4206 |         "career",
4207 |         "given",
4208 |         "interest",
4209 |         "health",
4210 |         "help",
4211 |         "press",
4212 |         "decision",
4213 |         "executive",
4214 |         "private",
4215 |         "media",
4216 |         "letters",
4217 |         "trade",
4218 |         "real",
4219 |         "attention",
4220 |         "personal",
4221 |         "forward",
4222 |         "president",
4223 |         "intended",
4224 |         "owner",
4225 |         "sign",
4226 |         "financial",
4227 |         "expressed",
4228 |         "exchange",
4229 |         "represented",
4230 |         "chairman",
4231 |         "independent",
4232 |         "raise",
4233 |         "acquired",
4234 |         "agent",
4235 |         "fellow",
4236 |         "prime",
4237 |         "agreement",
4238 |         "fund",
4239 |         "funds",
4240 |         "appeal",
4241 |         "founded",
4242 |         "promotion",
4243 |         "experience"
4244 |       ],
4245 |       [
4246 |         "day",
4247 |         "found",
4248 |         "night",
4249 |         "live",
4250 |         "seen",
4251 |         "set",
4252 |         "return",
4253 |         "fire",
4254 |         "having",
4255 |         "way",
4256 |         "come",
4257 |         "find",
4258 |         "days",
4259 |         "dead",
4260 |         "party",
4261 |         "leave",
4262 |         "body",
4263 |         "cover",
4264 |         "hours",
4265 |         "fact",
4266 |         "taken",
4267 |         "hit",
4268 |         "fight",
4269 |         "shot",
4270 |         "escape",
4271 |         "special",
4272 |         "search",
4273 |         "secret",
4274 |         "away",
4275 |         "true",
4276 |         "stay",
4277 |         "believe",
4278 |         "follow",
4279 |         "christmas",
4280 |         "previous",
4281 |         "claims",
4282 |         "present",
4283 |         "kill",
4284 |         "heard",
4285 |         "idea",
4286 |         "reason",
4287 |         "journey",
4288 |         "past",
4289 |         "mark",
4290 |         "face",
4291 |         "news",
4292 |         "thought",
4293 |         "morning",
4294 |         "die",
4295 |         "question"
4296 |       ],
4297 |       [
4298 |         "military",
4299 |         "soviet",
4300 |         "army",
4301 |         "emperor",
4302 |         "hungarian",
4303 |         "roman",
4304 |         "muslim",
4305 |         "empire",
4306 |         "war",
4307 |         "forces",
4308 |         "polish",
4309 |         "allies",
4310 |         "greece",
4311 |         "chinese",
4312 |         "arab",
4313 |         "armies",
4314 |         "rule",
4315 |         "treaty",
4316 |         "wars",
4317 |         "imperial",
4318 |         "albania",
4319 |         "croatia",
4320 |         "croatian",
4321 |         "poland",
4322 |         "territory",
4323 |         "siege",
4324 |         "diplomatic",
4325 |         "ottoman",
4326 |         "conflict",
4327 |         "coup",
4328 |         "serbian",
4329 |         "vietnamese",
4330 |         "egyptian",
4331 |         "byzantine",
4332 |         "serbia",
4333 |         "capital",
4334 |         "justinian",
4335 |         "uprising",
4336 |         "communist",
4337 |         "jews",
4338 |         "turkey",
4339 |         "occupation",
4340 |         "damascus",
4341 |         "turkish",
4342 |         "constantinople",
4343 |         "serbs",
4344 |         "independence",
4345 |         "peace",
4346 |         "revolt",
4347 |         "hostilities"
4348 |       ],
4349 |       [
4350 |         "time",
4351 |         "second",
4352 |         "later",
4353 |         "end",
4354 |         "following",
4355 |         "place",
4356 |         "half",
4357 |         "day",
4358 |         "took",
4359 |         "right",
4360 |         "line",
4361 |         "left",
4362 |         "having",
4363 |         "despite",
4364 |         "final",
4365 |         "seven",
4366 |         "point",
4367 |         "away",
4368 |         "high",
4369 |         "times",
4370 |         "set",
4371 |         "came",
4372 |         "eventually",
4373 |         "little",
4374 |         "lost",
4375 |         "far",
4376 |         "short",
4377 |         "followed",
4378 |         "previous",
4379 |         "instead",
4380 |         "position",
4381 |         "ended",
4382 |         "began",
4383 |         "long",
4384 |         "fourth",
4385 |         "result",
4386 |         "held",
4387 |         "lines",
4388 |         "taking",
4389 |         "total",
4390 |         "started",
4391 |         "order",
4392 |         "minutes",
4393 |         "allowed",
4394 |         "good",
4395 |         "previously",
4396 |         "saw",
4397 |         "beginning",
4398 |         "prior",
4399 |         "heavy"
4400 |       ],
4401 |       [
4402 |         "home",
4403 |         "new",
4404 |         "making",
4405 |         "open",
4406 |         "play",
4407 |         "run",
4408 |         "room",
4409 |         "opening",
4410 |         "free",
4411 |         "inside",
4412 |         "stand",
4413 |         "opened",
4414 |         "cold",
4415 |         "allow",
4416 |         "standing",
4417 |         "filled",
4418 |         "door",
4419 |         "shot",
4420 |         "super",
4421 |         "going",
4422 |         "originally",
4423 |         "turned",
4424 |         "featured",
4425 |         "turn",
4426 |         "drawn",
4427 |         "supporting",
4428 |         "wild",
4429 |         "card",
4430 |         "der",
4431 |         "turning",
4432 |         "walk",
4433 |         "playing",
4434 |         "mixed",
4435 |         "kept",
4436 |         "moving",
4437 |         "sets",
4438 |         "crowd",
4439 |         "close",
4440 |         "closed",
4441 |         "space",
4442 |         "closing",
4443 |         "enter",
4444 |         "swedish",
4445 |         "keeping",
4446 |         "featuring",
4447 |         "friendly",
4448 |         "walking",
4449 |         "bring",
4450 |         "allowing",
4451 |         "opposite"
4452 |       ],
4453 |       [
4454 |         "said",
4455 |         "like",
4456 |         "wrote",
4457 |         "love",
4458 |         "writing",
4459 |         "felt",
4460 |         "called",
4461 |         "good",
4462 |         "written",
4463 |         "received",
4464 |         "saying",
4465 |         "critics",
4466 |         "review",
4467 |         "described",
4468 |         "praised",
4469 |         "don",
4470 |         "wanted",
4471 |         "according",
4472 |         "way",
4473 |         "know",
4474 |         "critical",
4475 |         "positive",
4476 |         "want",
4477 |         "feel",
4478 |         "think",
4479 |         "things",
4480 |         "real",
4481 |         "girl",
4482 |         "life",
4483 |         "thought",
4484 |         "interview",
4485 |         "reception",
4486 |         "reviews",
4487 |         "little",
4488 |         "going",
4489 |         "let",
4490 |         "kind",
4491 |         "inspired",
4492 |         "heart",
4493 |         "gave",
4494 |         "thing",
4495 |         "got",
4496 |         "explained",
4497 |         "mind",
4498 |         "look",
4499 |         "titled",
4500 |         "writer",
4501 |         "commented",
4502 |         "lot",
4503 |         "compared"
4504 |       ],
4505 |       [
4506 |         "battle",
4507 |         "force",
4508 |         "forces",
4509 |         "japanese",
4510 |         "command",
4511 |         "troops",
4512 |         "army",
4513 |         "men",
4514 |         "commander",
4515 |         "general",
4516 |         "attack",
4517 |         "british",
4518 |         "division",
4519 |         "soldiers",
4520 |         "units",
4521 |         "lieutenant",
4522 |         "fighting",
4523 |         "north",
4524 |         "positions",
4525 |         "south",
4526 |         "regiment",
4527 |         "military",
4528 |         "squadron",
4529 |         "battalion",
4530 |         "casualties",
4531 |         "ordered",
4532 |         "enemy",
4533 |         "artillery",
4534 |         "infantry",
4535 |         "combat",
4536 |         "unit",
4537 |         "captain",
4538 |         "german",
4539 |         "brigade",
4540 |         "officers",
4541 |         "wounded",
4542 |         "officer",
4543 |         "1st",
4544 |         "fire",
4545 |         "captured",
4546 |         "colonel",
4547 |         "commanders",
4548 |         "commanded",
4549 |         "allied",
4550 |         "2nd",
4551 |         "position",
4552 |         "garrison",
4553 |         "advance",
4554 |         "divisions",
4555 |         "battalions"
4556 |       ],
4557 |       [
4558 |         "water",
4559 |         "population",
4560 |         "feet",
4561 |         "land",
4562 |         "island",
4563 |         "food",
4564 |         "region",
4565 |         "sea",
4566 |         "river",
4567 |         "area",
4568 |         "lake",
4569 |         "metres",
4570 |         "oil",
4571 |         "mountain",
4572 |         "climate",
4573 |         "average",
4574 |         "areas",
4575 |         "tree",
4576 |         "ice",
4577 |         "forest",
4578 |         "trees",
4579 |         "acres",
4580 |         "waters",
4581 |         "near",
4582 |         "people",
4583 |         "islands",
4584 |         "native",
4585 |         "miles",
4586 |         "rivers",
4587 |         "plant",
4588 |         "km2",
4589 |         "air",
4590 |         "kilometres",
4591 |         "fishing",
4592 |         "creek",
4593 |         "plants",
4594 |         "residents",
4595 |         "coal",
4596 |         "temperatures",
4597 |         "agriculture",
4598 |         "mountains",
4599 |         "fish",
4600 |         "snow",
4601 |         "mouth",
4602 |         "valley",
4603 |         "approximately",
4604 |         "cities",
4605 |         "agricultural",
4606 |         "salt",
4607 |         "flow"
4608 |       ],
4609 |       [
4610 |         "story",
4611 |         "novel",
4612 |         "stories",
4613 |         "fiction",
4614 |         "doctor",
4615 |         "book",
4616 |         "magazine",
4617 |         "bond",
4618 |         "alien",
4619 |         "comic",
4620 |         "science",
4621 |         "novels",
4622 |         "universe",
4623 |         "fantasy",
4624 |         "batman",
4625 |         "comics",
4626 |         "villain",
4627 |         "fictional",
4628 |         "evil",
4629 |         "magic",
4630 |         "halo",
4631 |         "trek",
4632 |         "adventures",
4633 |         "tales",
4634 |         "vampire",
4635 |         "narrative",
4636 |         "magical",
4637 |         "dragon",
4638 |         "trilogy",
4639 |         "adventure",
4640 |         "tale",
4641 |         "reader",
4642 |         "horror",
4643 |         "marvel",
4644 |         "sam",
4645 |         "enterprise",
4646 |         "harry",
4647 |         "tintin",
4648 |         "harry_potter",
4649 |         "genre",
4650 |         "ghost",
4651 |         "readers",
4652 |         "fleming",
4653 |         "publisher",
4654 |         "fantastic",
4655 |         "potter",
4656 |         "editor",
4657 |         "aliens",
4658 |         "spider",
4659 |         "ian_fleming"
4660 |       ],
4661 |       [
4662 |         "school",
4663 |         "students",
4664 |         "member",
4665 |         "college",
4666 |         "education",
4667 |         "schools",
4668 |         "university",
4669 |         "members",
4670 |         "class",
4671 |         "student",
4672 |         "community",
4673 |         "founded",
4674 |         "president",
4675 |         "program",
4676 |         "research",
4677 |         "established",
4678 |         "organization",
4679 |         "campus",
4680 |         "programs",
4681 |         "served",
4682 |         "academic",
4683 |         "national",
4684 |         "medical",
4685 |         "science",
4686 |         "high",
4687 |         "study",
4688 |         "organizations",
4689 |         "prize",
4690 |         "degree",
4691 |         "professor",
4692 |         "attended",
4693 |         "honor",
4694 |         "arts",
4695 |         "activities",
4696 |         "primary",
4697 |         "board",
4698 |         "department",
4699 |         "engineering",
4700 |         "library",
4701 |         "faculty",
4702 |         "graduate",
4703 |         "festival",
4704 |         "awarded",
4705 |         "institution",
4706 |         "courses",
4707 |         "mayor",
4708 |         "grade",
4709 |         "graduated",
4710 |         "teaching",
4711 |         "enrolled"
4712 |       ],
4713 |       [
4714 |         "society",
4715 |         "read",
4716 |         "literature",
4717 |         "author",
4718 |         "literary",
4719 |         "argued",
4720 |         "ideas",
4721 |         "poem",
4722 |         "poetry",
4723 |         "moral",
4724 |         "argues",
4725 |         "sexual",
4726 |         "view",
4727 |         "poet",
4728 |         "writings",
4729 |         "social",
4730 |         "religion",
4731 |         "isbn",
4732 |         "historical",
4733 |         "writers",
4734 |         "views",
4735 |         "political",
4736 |         "philosophy",
4737 |         "writer",
4738 |         "matter",
4739 |         "freedom",
4740 |         "writes",
4741 |         "biography",
4742 |         "scholar",
4743 |         "theory",
4744 |         "essay",
4745 |         "believed",
4746 |         "speech",
4747 |         "argument",
4748 |         "context",
4749 |         "considers",
4750 |         "understanding",
4751 |         "publication",
4752 |         "published",
4753 |         "essays",
4754 |         "arguments",
4755 |         "understand",
4756 |         "idea",
4757 |         "principles",
4758 |         "historians",
4759 |         "opinions",
4760 |         "article",
4761 |         "according",
4762 |         "philosophical",
4763 |         "myth"
4764 |       ],
4765 |       [
4766 |         "government",
4767 |         "state",
4768 |         "president",
4769 |         "law",
4770 |         "political",
4771 |         "governor",
4772 |         "election",
4773 |         "minister",
4774 |         "national",
4775 |         "act",
4776 |         "congress",
4777 |         "party",
4778 |         "federal",
4779 |         "campaign",
4780 |         "general",
4781 |         "civil",
4782 |         "constitution",
4783 |         "rights",
4784 |         "secretary",
4785 |         "states",
4786 |         "policy",
4787 |         "office",
4788 |         "elected",
4789 |         "vote",
4790 |         "bill",
4791 |         "parliament",
4792 |         "democratic",
4793 |         "support",
4794 |         "politics",
4795 |         "term",
4796 |         "majority",
4797 |         "candidate",
4798 |         "representative",
4799 |         "administration",
4800 |         "labor",
4801 |         "justice",
4802 |         "citizens",
4803 |         "legislation",
4804 |         "opposition",
4805 |         "authority",
4806 |         "convention",
4807 |         "senator",
4808 |         "economic",
4809 |         "court",
4810 |         "tax",
4811 |         "parties",
4812 |         "held",
4813 |         "elections",
4814 |         "constitutional",
4815 |         "cabinet"
4816 |       ],
4817 |       [
4818 |         "season",
4819 |         "game",
4820 |         "team",
4821 |         "league",
4822 |         "club",
4823 |         "games",
4824 |         "played",
4825 |         "scored",
4826 |         "goal",
4827 |         "play",
4828 |         "runs",
4829 |         "players",
4830 |         "cup",
4831 |         "goals",
4832 |         "ball",
4833 |         "innings",
4834 |         "football",
4835 |         "player",
4836 |         "yards",
4837 |         "test",
4838 |         "scoring",
4839 |         "win",
4840 |         "victory",
4841 |         "match",
4842 |         "bowl",
4843 |         "playing",
4844 |         "score",
4845 |         "career",
4846 |         "championship",
4847 |         "matches",
4848 |         "arsenal",
4849 |         "run",
4850 |         "yard",
4851 |         "lap",
4852 |         "points",
4853 |         "lead",
4854 |         "seasons",
4855 |         "stadium",
4856 |         "baseball",
4857 |         "field",
4858 |         "miller",
4859 |         "appearances",
4860 |         "series",
4861 |         "quarterback",
4862 |         "pitch",
4863 |         "draw",
4864 |         "bowling",
4865 |         "assists",
4866 |         "batting",
4867 |         "wicket"
4868 |       ],
4869 |       [
4870 |         "use",
4871 |         "order",
4872 |         "human",
4873 |         "possible",
4874 |         "required",
4875 |         "evidence",
4876 |         "case",
4877 |         "role",
4878 |         "process",
4879 |         "effect",
4880 |         "change",
4881 |         "nature",
4882 |         "result",
4883 |         "involved",
4884 |         "non",
4885 |         "given",
4886 |         "events",
4887 |         "groups",
4888 |         "lack",
4889 |         "particular",
4890 |         "information",
4891 |         "provide",
4892 |         "life",
4893 |         "certain",
4894 |         "example",
4895 |         "status",
4896 |         "force",
4897 |         "research",
4898 |         "need",
4899 |         "natural",
4900 |         "period",
4901 |         "view",
4902 |         "increase",
4903 |         "problems",
4904 |         "necessary",
4905 |         "action",
4906 |         "individual",
4907 |         "structure",
4908 |         "responsible",
4909 |         "cause",
4910 |         "data",
4911 |         "form",
4912 |         "control",
4913 |         "able",
4914 |         "terms",
4915 |         "protection",
4916 |         "related",
4917 |         "complex",
4918 |         "response",
4919 |         "basis"
4920 |       ],
4921 |       [
4922 |         "earth",
4923 |         "sun",
4924 |         "surface",
4925 |         "star",
4926 |         "planet",
4927 |         "mass",
4928 |         "light",
4929 |         "moon",
4930 |         "years",
4931 |         "stars",
4932 |         "solar",
4933 |         "atmosphere",
4934 |         "giant",
4935 |         "system",
4936 |         "orbit",
4937 |         "space",
4938 |         "magnitude",
4939 |         "observed",
4940 |         "distance",
4941 |         "mars",
4942 |         "discovery",
4943 |         "cloud",
4944 |         "observations",
4945 |         "jupiter",
4946 |         "dwarf",
4947 |         "objects",
4948 |         "object",
4949 |         "rotation",
4950 |         "motion",
4951 |         "planets",
4952 |         "dust",
4953 |         "saturn",
4954 |         "gravity",
4955 |         "masses",
4956 |         "massive",
4957 |         "ago",
4958 |         "approximately",
4959 |         "telescope",
4960 |         "activity",
4961 |         "disk",
4962 |         "roughly",
4963 |         "spacecraft",
4964 |         "mercury",
4965 |         "sky",
4966 |         "discovered",
4967 |         "measurements",
4968 |         "times",
4969 |         "deep",
4970 |         "zone",
4971 |         "type"
4972 |       ],
4973 |       [
4974 |         "game",
4975 |         "player",
4976 |         "games",
4977 |         "released",
4978 |         "players",
4979 |         "gameplay",
4980 |         "release",
4981 |         "video",
4982 |         "character",
4983 |         "series",
4984 |         "version",
4985 |         "japan",
4986 |         "versions",
4987 |         "playstation",
4988 |         "features",
4989 |         "titles",
4990 |         "nintendo",
4991 |         "received",
4992 |         "console",
4993 |         "mode",
4994 |         "enemies",
4995 |         "development",
4996 |         "ign",
4997 |         "reviewers",
4998 |         "screen",
4999 |         "soundtrack",
5000 |         "action",
5001 |         "wii",
5002 |         "multiplayer",
5003 |         "xbox",
5004 |         "gaming",
5005 |         "releases",
5006 |         "playable",
5007 |         "original",
5008 |         "content",
5009 |         "reception",
5010 |         "japanese",
5011 |         "level",
5012 |         "person",
5013 |         "north_america",
5014 |         "protagonist",
5015 |         "characters",
5016 |         "units",
5017 |         "graphics",
5018 |         "team",
5019 |         "levels",
5020 |         "copies",
5021 |         "feature",
5022 |         "battles",
5023 |         "developed"
5024 |       ],
5025 |       [
5026 |         "species",
5027 |         "birds",
5028 |         "genus",
5029 |         "specimens",
5030 |         "animal",
5031 |         "bird",
5032 |         "tail",
5033 |         "habitat",
5034 |         "taxonomy",
5035 |         "females",
5036 |         "males",
5037 |         "diet",
5038 |         "found",
5039 |         "body",
5040 |         "teeth",
5041 |         "populations",
5042 |         "fish",
5043 |         "male",
5044 |         "skull",
5045 |         "nests",
5046 |         "breeding",
5047 |         "female",
5048 |         "prey",
5049 |         "range",
5050 |         "common",
5051 |         "shark",
5052 |         "subspecies",
5053 |         "predators",
5054 |         "specimen",
5055 |         "eggs",
5056 |         "animals",
5057 |         "ecology",
5058 |         "bodies",
5059 |         "mature",
5060 |         "insects",
5061 |         "sexes",
5062 |         "fossils",
5063 |         "feeding",
5064 |         "sharks",
5065 |         "extinction",
5066 |         "nest",
5067 |         "jaws",
5068 |         "interspecific",
5069 |         "mating",
5070 |         "occurs",
5071 |         "whale",
5072 |         "humans",
5073 |         "juveniles",
5074 |         "abundant",
5075 |         "fruit"
5076 |       ],
5077 |       [
5078 |         "television",
5079 |         "award",
5080 |         "awards",
5081 |         "dvd",
5082 |         "nominated",
5083 |         "critics",
5084 |         "video",
5085 |         "animated",
5086 |         "actress",
5087 |         "festival",
5088 |         "magazine",
5089 |         "drama",
5090 |         "bbc",
5091 |         "media",
5092 |         "hollywood",
5093 |         "new_york_times",
5094 |         "movie",
5095 |         "entertainment",
5096 |         "writer",
5097 |         "comedy",
5098 |         "nomination",
5099 |         "nominations",
5100 |         "disney",
5101 |         "female",
5102 |         "documentary",
5103 |         "ray",
5104 |         "theatrical",
5105 |         "critic",
5106 |         "outstanding",
5107 |         "animation",
5108 |         "male",
5109 |         "reviews",
5110 |         "digital",
5111 |         "anime",
5112 |         "audio",
5113 |         "acclaim",
5114 |         "actor",
5115 |         "audience",
5116 |         "youtube",
5117 |         "credits",
5118 |         "film",
5119 |         "poll",
5120 |         "voice",
5121 |         "creative",
5122 |         "costumes",
5123 |         "weekly",
5124 |         "emmy",
5125 |         "thriller",
5126 |         "washington_post",
5127 |         "reporter"
5128 |       ],
5129 |       [
5130 |         "god",
5131 |         "church",
5132 |         "christian",
5133 |         "religious",
5134 |         "temple",
5135 |         "worship",
5136 |         "language",
5137 |         "languages",
5138 |         "traditions",
5139 |         "tradition",
5140 |         "religion",
5141 |         "jesus",
5142 |         "christianity",
5143 |         "kingdom",
5144 |         "hindu",
5145 |         "spoken",
5146 |         "legend",
5147 |         "christ",
5148 |         "holy",
5149 |         "text",
5150 |         "words",
5151 |         "word",
5152 |         "kings",
5153 |         "inscriptions",
5154 |         "spiritual",
5155 |         "christians",
5156 |         "texts",
5157 |         "maya",
5158 |         "faith",
5159 |         "congregation",
5160 |         "gods",
5161 |         "divine",
5162 |         "goddess",
5163 |         "bible",
5164 |         "shiva",
5165 |         "tomb",
5166 |         "buddhist",
5167 |         "churches",
5168 |         "deity",
5169 |         "sacred",
5170 |         "king",
5171 |         "shrines",
5172 |         "practices",
5173 |         "shrine",
5174 |         "monastery",
5175 |         "dialect",
5176 |         "ritual",
5177 |         "baptism",
5178 |         "hebrew",
5179 |         "secular"
5180 |       ],
5181 |       [
5182 |         "storm",
5183 |         "tropical",
5184 |         "hurricane",
5185 |         "winds",
5186 |         "depression",
5187 |         "mph",
5188 |         "wind",
5189 |         "cyclone",
5190 |         "september",
5191 |         "august",
5192 |         "damage",
5193 |         "rainfall",
5194 |         "day",
5195 |         "reported",
5196 |         "intensity",
5197 |         "utc",
5198 |         "pressure",
5199 |         "near",
5200 |         "system",
5201 |         "occurred",
5202 |         "hours",
5203 |         "caused",
5204 |         "damaged",
5205 |         "category",
5206 |         "flooding",
5207 |         "convection",
5208 |         "sustained",
5209 |         "october",
5210 |         "northeast",
5211 |         "attained",
5212 |         "landfall",
5213 |         "heavy",
5214 |         "moving",
5215 |         "estimated",
5216 |         "homes",
5217 |         "southeast",
5218 |         "south",
5219 |         "reached",
5220 |         "remnants",
5221 |         "low",
5222 |         "moved",
5223 |         "disturbance",
5224 |         "north",
5225 |         "circulation",
5226 |         "severe",
5227 |         "developed",
5228 |         "upgraded",
5229 |         "weakened",
5230 |         "northwest",
5231 |         "early"
5232 |       ],
5233 |       [
5234 |         "police",
5235 |         "court",
5236 |         "people",
5237 |         "trial",
5238 |         "report",
5239 |         "prison",
5240 |         "incident",
5241 |         "arrested",
5242 |         "violence",
5243 |         "murder",
5244 |         "judge",
5245 |         "criminal",
5246 |         "crime",
5247 |         "investigation",
5248 |         "jewish",
5249 |         "men",
5250 |         "charges",
5251 |         "committed",
5252 |         "person",
5253 |         "news",
5254 |         "accused",
5255 |         "gay",
5256 |         "security",
5257 |         "victims",
5258 |         "jury",
5259 |         "crimes",
5260 |         "arrest",
5261 |         "defense",
5262 |         "reports",
5263 |         "media",
5264 |         "newspaper",
5265 |         "statement",
5266 |         "officers",
5267 |         "alleged",
5268 |         "convicted",
5269 |         "newspapers",
5270 |         "violent",
5271 |         "guilty",
5272 |         "actions",
5273 |         "drug",
5274 |         "suicide",
5275 |         "illegal",
5276 |         "conspiracy",
5277 |         "witnesses",
5278 |         "protest",
5279 |         "prisoners",
5280 |         "lawyers",
5281 |         "authorities",
5282 |         "charge",
5283 |         "case"
5284 |       ]
5285 |     ],
5286 |     "c_npmi_10_full_all": [
5287 |       0.03828346256626948,
5288 |       0.1038583234675748,
5289 |       0.015366944419668899,
5290 |       0.0900476418994117,
5291 |       0.1020442098673313,
5292 |       0.03190760464777211,
5293 |       0.16677015392348543,
5294 |       -0.008954957865953978,
5295 |       0.012054122204282256,
5296 |       -0.004889701871966398,
5297 |       0.15132187821689855,
5298 |       0.13286033788304982,
5299 |       0.21361278838454933,
5300 |       0.06768430328713142,
5301 |       0.25295664311845517,
5302 |       0.19214197022094048,
5303 |       0.1246862697135218,
5304 |       0.13727695971556078,
5305 |       0.15977332312391387,
5306 |       0.07607815548878252,
5307 |       0.08921236964042889,
5308 |       0.14721562942824956,
5309 |       0.04320981988759863,
5310 |       0.1348618452599969,
5311 |       0.19853561056687896,
5312 |       0.21675332548983514,
5313 |       0.20602537817919672,
5314 |       0.05120785577156596,
5315 |       0.15718236651705925,
5316 |       0.03070653982222502,
5317 |       0.021220918893205067,
5318 |       0.11234754553902984,
5319 |       0.037713983698919415,
5320 |       0.020353224603349568,
5321 |       0.04964237205892103,
5322 |       0.16596275544055908,
5323 |       0.0976401097020574,
5324 |       0.13018839146257705,
5325 |       0.134779218822221,
5326 |       0.1173630971965991,
5327 |       0.09886537661229274,
5328 |       0.22095588022862706,
5329 |       0.03753952947452625,
5330 |       0.12707757576728707,
5331 |       0.12641353705174047,
5332 |       0.18913925649606136,
5333 |       0.1493844703305789,
5334 |       0.12044245581211913,
5335 |       0.2556792335636147,
5336 |       0.12187208376841602
5337 |     ],
5338 |     "path": "outputs/full-mindf_power_law-maxdf_0.9/wikitext/k-50/etm/lr_0.001-reg_1.2e-05-epochs_1000-anneal_lr_0/42"
5339 |   }
5340 | }


--------------------------------------------------------------------------------