├── .gitignore
├── README.md
├── code
    ├── __pycache__
    │   └── eda.cpython-36.pyc
    ├── augment.py
    └── eda.py
├── data
    ├── lol.txt
    └── sst2_train_500.txt
├── eda_figure.png
├── experiments
    ├── __pycache__
    │   ├── a_config.cpython-36.pyc
    │   ├── a_config.cpython-37.pyc
    │   ├── b_config.cpython-36.pyc
    │   ├── c_config.cpython-36.pyc
    │   ├── config.cpython-36.pyc
    │   ├── e_config.cpython-36.pyc
    │   ├── methods.cpython-36.pyc
    │   ├── methods.cpython-37.pyc
    │   └── nlp_aug.cpython-36.pyc
    ├── a_1_data_process.py
    ├── a_2_train_eval.py
    ├── a_config.py
    ├── b_1_data_process.py
    ├── b_2_train_eval.py
    ├── b_config.py
    ├── c_1_data_process.py
    ├── c_2_train_eval.py
    ├── c_config.py
    ├── d_0_preprocess.py
    ├── d_1_train_models.py
    ├── d_2_tsne.py
    ├── d_neg_1_balance_trec.py
    ├── e_1_data_process.py
    ├── e_2_cnn_aug.py
    ├── e_2_cnn_baselines.py
    ├── e_2_rnn_aug.py
    ├── e_2_rnn_baselines.py
    ├── e_config.py
    ├── methods.py
    └── nlp_aug.py
└── preprocess
    ├── __pycache__
        └── utils.cpython-36.pyc
    ├── bg_clean.py
    ├── copy_sized_datasets.py
    ├── cr_clean.py
    ├── create_dataset_increments.py
    ├── get_stats.py
    ├── procon_clean.py
    ├── shuffle_lines.py
    ├── sst1_clean.py
    ├── subj_clean.py
    ├── trej_clean.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | word2vec*
 3 | size_data*
 4 | size_data_f1*
 5 | size_data_f3*
 6 | size_data_t1*
 7 | increment_datasets_f2*
 8 | z_archives*
 9 | special_f4*
10 | outputs_f1*
11 | outputs_f2*
12 | outputs_f3*
13 | outputs_f4*
14 | baseline_cnn*
15 | baseline_rnn*
16 | 
17 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
 2 | [![Conference](http://img.shields.io/badge/EMNLP-2019-4b44ce.svg)](https://arxiv.org/abs/1901.11196)
 3 | 
 4 | For a survey of data augmentation in NLP, see this [repository](https://github.com/styfeng/DataAug4NLP/blob/main/README.md)/this [paper](http://arxiv.org/abs/2105.03075).
 5 | 
 6 | This is the code for the EMNLP-IJCNLP paper [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196) 
 7 | 
 8 | A blog post that explains EDA is [[here]](https://medium.com/@jason.20/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610). 
 9 | 
10 | Update: find an external implementation of EDA in Chinese [[here]](https://github.com/zhanlaoban/EDA_NLP_for_Chinese).
11 | 
12 | By [Jason Wei](https://jasonwei20.github.io/research/) and Kai Zou.
13 | 
14 | Note: **Do not** email me with questions, as I will not reply. Instead, open an issue.
15 | 
16 | We present **EDA**: **e**asy **d**ata **a**ugmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements on five NLP classification tasks, with substantial improvements on datasets of size `N < 500`. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in good performance gains. Given a sentence in the training set, we perform the following operations:
17 | 
18 | - **Synonym Replacement (SR):** Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
19 | - **Random Insertion (RI):** Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times.
20 | - **Random Swap (RS):** Randomly choose two words in the sentence and swap their positions. Do this *n* times.
21 | - **Random Deletion (RD):** For each word in the sentence, randomly remove it with probability *p*.
22 | 
23 | <p align="center"> <img src="eda_figure.png" alt="drawing" width="400" class="center"> </p>
24 | Average performance on 5 datasets with and without EDA, with respect to percent of training data used.
25 | 
26 | # Usage
27 | 
28 | You can run EDA any text classification dataset in less than 5 minutes. Just two steps:
29 | 
30 | ### Install NLTK (if you don't have it already):
31 | 
32 | Pip install it.
33 | 
34 | ```bash
35 | pip install -U nltk
36 | ```
37 | 
38 | Download WordNet.
39 | ```bash
40 | python
41 | >>> import nltk; nltk.download('wordnet')
42 | ```
43 | 
44 | ### Run EDA
45 | 
46 | You can easily write your own implementation, but this one takes input files in the format `label\tsentence` (note the `\t`). So for instance, your input file should look like this (example from stanford sentiment treebank):
47 | 
48 | ```
49 | 1   neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present 
50 | 0   it is a visual rorschach test and i must have failed 
51 | 0   the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers
52 | ...
53 | ```
54 | 
55 | Now place this input file into the `data` folder. Run 
56 | 
57 | ```bash
58 | python code/augment.py --input=<insert input filename>
59 | ```
60 | 
61 | The default output filename will append `eda_` to the front of the input filename, but you can specify your own with `--output`. You can also specify the number of generated augmented sentences per original sentence using `--num_aug` (default is 9). Furthermore, you can specify different alpha parameters, which approximately means the percent of words in the sentence that will be changed according to that rule (default is `0.1` or `10%`). So for example, if your input file is `sst2_train.txt` and you want to output to `sst2_augmented.txt` with `16` augmented sentences per original sentence and replace 5% of words by synonyms (`alpha_sr=0.05`), delete 10% of words (`alpha_rd=0.1`, or leave as the default) and do not apply random insertion (`alpha_ri=0.0`) and random swap (`alpha_rs=0.0`), you would do:
62 | 
63 | ```bash
64 | python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16 --alpha_sr=0.05 --alpha_rd=0.1 --alpha_ri=0.0 --alpha_rs=0.0
65 | ```
66 | 
67 | Note that at least one augmentation operation is applied per augmented sentence regardless of alpha (if greater than zero). So if you do `alpha_sr=0.001` and your sentence only has four words, one augmentation operation will still be performed. Of course, if one particular alpha is zero, nothing will be done. Best of luck!
68 | 
69 | # Citation
70 | If you use EDA in your paper, please cite us:
71 | ```
72 | @inproceedings{wei-zou-2019-eda,
73 |     title = "{EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks",
74 |     author = "Wei, Jason  and
75 |       Zou, Kai",
76 |     booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
77 |     month = nov,
78 |     year = "2019",
79 |     address = "Hong Kong, China",
80 |     publisher = "Association for Computational Linguistics",
81 |     url = "https://www.aclweb.org/anthology/D19-1670",
82 |     pages = "6383--6389",
83 | }
84 | ```
85 | 
86 | # Experiments
87 | 
88 | The code is not documented, but is [here](https://github.com/jasonwei20/eda_nlp/tree/master/experiments) for all experiments used in the paper. See [this issue](https://github.com/jasonwei20/eda_nlp/issues/10) for limited guidance.
89 | 
90 | 
91 | 
92 | 


--------------------------------------------------------------------------------
/code/__pycache__/eda.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/code/__pycache__/eda.cpython-36.pyc


--------------------------------------------------------------------------------
/code/augment.py:
--------------------------------------------------------------------------------
 1 | # Easy data augmentation techniques for text classification
 2 | # Jason Wei and Kai Zou
 3 | 
 4 | from eda import *
 5 | 
 6 | #arguments to be parsed from command line
 7 | import argparse
 8 | ap = argparse.ArgumentParser()
 9 | ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data")
10 | ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data")
11 | ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence")
12 | ap.add_argument("--alpha_sr", required=False, type=float, help="percent of words in each sentence to be replaced by synonyms")
13 | ap.add_argument("--alpha_ri", required=False, type=float, help="percent of words in each sentence to be inserted")
14 | ap.add_argument("--alpha_rs", required=False, type=float, help="percent of words in each sentence to be swapped")
15 | ap.add_argument("--alpha_rd", required=False, type=float, help="percent of words in each sentence to be deleted")
16 | args = ap.parse_args()
17 | 
18 | #the output file
19 | output = None
20 | if args.output:
21 |     output = args.output
22 | else:
23 |     from os.path import dirname, basename, join
24 |     output = join(dirname(args.input), 'eda_' + basename(args.input))
25 | 
26 | #number of augmented sentences to generate per original sentence
27 | num_aug = 9 #default
28 | if args.num_aug:
29 |     num_aug = args.num_aug
30 | 
31 | #how much to replace each word by synonyms
32 | alpha_sr = 0.1#default
33 | if args.alpha_sr is not None:
34 |     alpha_sr = args.alpha_sr
35 | 
36 | #how much to insert new words that are synonyms
37 | alpha_ri = 0.1#default
38 | if args.alpha_ri is not None:
39 |     alpha_ri = args.alpha_ri
40 | 
41 | #how much to swap words
42 | alpha_rs = 0.1#default
43 | if args.alpha_rs is not None:
44 |     alpha_rs = args.alpha_rs
45 | 
46 | #how much to delete words
47 | alpha_rd = 0.1#default
48 | if args.alpha_rd is not None:
49 |     alpha_rd = args.alpha_rd
50 | 
51 | if alpha_sr == alpha_ri == alpha_rs == alpha_rd == 0:
52 |      ap.error('At least one alpha should be greater than zero')
53 | 
54 | #generate more data with standard augmentation
55 | def gen_eda(train_orig, output_file, alpha_sr, alpha_ri, alpha_rs, alpha_rd, num_aug=9):
56 | 
57 |     writer = open(output_file, 'w')
58 |     lines = open(train_orig, 'r').readlines()
59 | 
60 |     for i, line in enumerate(lines):
61 |         parts = line[:-1].split('\t')
62 |         label = parts[0]
63 |         sentence = parts[1]
64 |         aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug)
65 |         for aug_sentence in aug_sentences:
66 |             writer.write(label + "\t" + aug_sentence + '\n')
67 | 
68 |     writer.close()
69 |     print("generated augmented sentences with eda for " + train_orig + " to " + output_file + " with num_aug=" + str(num_aug))
70 | 
71 | #main function
72 | if __name__ == "__main__":
73 | 
74 |     #generate augmented sentences and output into a new file
75 |     gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug)


--------------------------------------------------------------------------------
/code/eda.py:
--------------------------------------------------------------------------------
  1 | # Easy data augmentation techniques for text classification
  2 | # Jason Wei and Kai Zou
  3 | 
  4 | import random
  5 | from random import shuffle
  6 | random.seed(1)
  7 | 
  8 | #stop words list
  9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 
 10 | 			'ours', 'ourselves', 'you', 'your', 'yours', 
 11 | 			'yourself', 'yourselves', 'he', 'him', 'his', 
 12 | 			'himself', 'she', 'her', 'hers', 'herself', 
 13 | 			'it', 'its', 'itself', 'they', 'them', 'their', 
 14 | 			'theirs', 'themselves', 'what', 'which', 'who', 
 15 | 			'whom', 'this', 'that', 'these', 'those', 'am', 
 16 | 			'is', 'are', 'was', 'were', 'be', 'been', 'being', 
 17 | 			'have', 'has', 'had', 'having', 'do', 'does', 'did',
 18 | 			'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
 19 | 			'because', 'as', 'until', 'while', 'of', 'at', 
 20 | 			'by', 'for', 'with', 'about', 'against', 'between',
 21 | 			'into', 'through', 'during', 'before', 'after', 
 22 | 			'above', 'below', 'to', 'from', 'up', 'down', 'in',
 23 | 			'out', 'on', 'off', 'over', 'under', 'again', 
 24 | 			'further', 'then', 'once', 'here', 'there', 'when', 
 25 | 			'where', 'why', 'how', 'all', 'any', 'both', 'each', 
 26 | 			'few', 'more', 'most', 'other', 'some', 'such', 'no', 
 27 | 			'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 
 28 | 			'very', 's', 't', 'can', 'will', 'just', 'don', 
 29 | 			'should', 'now', '']
 30 | 
 31 | #cleaning up text
 32 | import re
 33 | def get_only_chars(line):
 34 | 
 35 |     clean_line = ""
 36 | 
 37 |     line = line.replace("’", "")
 38 |     line = line.replace("'", "")
 39 |     line = line.replace("-", " ") #replace hyphens with spaces
 40 |     line = line.replace("\t", " ")
 41 |     line = line.replace("\n", " ")
 42 |     line = line.lower()
 43 | 
 44 |     for char in line:
 45 |         if char in 'qwertyuiopasdfghjklzxcvbnm ':
 46 |             clean_line += char
 47 |         else:
 48 |             clean_line += ' '
 49 | 
 50 |     clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
 51 |     if clean_line[0] == ' ':
 52 |         clean_line = clean_line[1:]
 53 |     return clean_line
 54 | 
 55 | ########################################################################
 56 | # Synonym replacement
 57 | # Replace n words in the sentence with synonyms from wordnet
 58 | ########################################################################
 59 | 
 60 | #for the first time you use wordnet
 61 | #import nltk
 62 | #nltk.download('wordnet')
 63 | from nltk.corpus import wordnet 
 64 | 
 65 | def synonym_replacement(words, n):
 66 | 	new_words = words.copy()
 67 | 	random_word_list = list(set([word for word in words if word not in stop_words]))
 68 | 	random.shuffle(random_word_list)
 69 | 	num_replaced = 0
 70 | 	for random_word in random_word_list:
 71 | 		synonyms = get_synonyms(random_word)
 72 | 		if len(synonyms) >= 1:
 73 | 			synonym = random.choice(list(synonyms))
 74 | 			new_words = [synonym if word == random_word else word for word in new_words]
 75 | 			#print("replaced", random_word, "with", synonym)
 76 | 			num_replaced += 1
 77 | 		if num_replaced >= n: #only replace up to n words
 78 | 			break
 79 | 
 80 | 	#this is stupid but we need it, trust me
 81 | 	sentence = ' '.join(new_words)
 82 | 	new_words = sentence.split(' ')
 83 | 
 84 | 	return new_words
 85 | 
 86 | def get_synonyms(word):
 87 | 	synonyms = set()
 88 | 	for syn in wordnet.synsets(word): 
 89 | 		for l in syn.lemmas(): 
 90 | 			synonym = l.name().replace("_", " ").replace("-", " ").lower()
 91 | 			synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
 92 | 			synonyms.add(synonym) 
 93 | 	if word in synonyms:
 94 | 		synonyms.remove(word)
 95 | 	return list(synonyms)
 96 | 
 97 | ########################################################################
 98 | # Random deletion
 99 | # Randomly delete words from the sentence with probability p
100 | ########################################################################
101 | 
102 | def random_deletion(words, p):
103 | 
104 | 	#obviously, if there's only one word, don't delete it
105 | 	if len(words) == 1:
106 | 		return words
107 | 
108 | 	#randomly delete words with probability p
109 | 	new_words = []
110 | 	for word in words:
111 | 		r = random.uniform(0, 1)
112 | 		if r > p:
113 | 			new_words.append(word)
114 | 
115 | 	#if you end up deleting all words, just return a random word
116 | 	if len(new_words) == 0:
117 | 		rand_int = random.randint(0, len(words)-1)
118 | 		return [words[rand_int]]
119 | 
120 | 	return new_words
121 | 
122 | ########################################################################
123 | # Random swap
124 | # Randomly swap two words in the sentence n times
125 | ########################################################################
126 | 
127 | def random_swap(words, n):
128 | 	new_words = words.copy()
129 | 	for _ in range(n):
130 | 		new_words = swap_word(new_words)
131 | 	return new_words
132 | 
133 | def swap_word(new_words):
134 | 	random_idx_1 = random.randint(0, len(new_words)-1)
135 | 	random_idx_2 = random_idx_1
136 | 	counter = 0
137 | 	while random_idx_2 == random_idx_1:
138 | 		random_idx_2 = random.randint(0, len(new_words)-1)
139 | 		counter += 1
140 | 		if counter > 3:
141 | 			return new_words
142 | 	new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
143 | 	return new_words
144 | 
145 | ########################################################################
146 | # Random insertion
147 | # Randomly insert n words into the sentence
148 | ########################################################################
149 | 
150 | def random_insertion(words, n):
151 | 	new_words = words.copy()
152 | 	for _ in range(n):
153 | 		add_word(new_words)
154 | 	return new_words
155 | 
156 | def add_word(new_words):
157 | 	synonyms = []
158 | 	counter = 0
159 | 	while len(synonyms) < 1:
160 | 		random_word = new_words[random.randint(0, len(new_words)-1)]
161 | 		synonyms = get_synonyms(random_word)
162 | 		counter += 1
163 | 		if counter >= 10:
164 | 			return
165 | 	random_synonym = synonyms[0]
166 | 	random_idx = random.randint(0, len(new_words)-1)
167 | 	new_words.insert(random_idx, random_synonym)
168 | 
169 | ########################################################################
170 | # main data augmentation function
171 | ########################################################################
172 | 
173 | def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
174 | 	
175 | 	sentence = get_only_chars(sentence)
176 | 	words = sentence.split(' ')
177 | 	words = [word for word in words if word is not '']
178 | 	num_words = len(words)
179 | 	
180 | 	augmented_sentences = []
181 | 	num_new_per_technique = int(num_aug/4)+1
182 | 
183 | 	#sr
184 | 	if (alpha_sr > 0):
185 | 		n_sr = max(1, int(alpha_sr*num_words))
186 | 		for _ in range(num_new_per_technique):
187 | 			a_words = synonym_replacement(words, n_sr)
188 | 			augmented_sentences.append(' '.join(a_words))
189 | 
190 | 	#ri
191 | 	if (alpha_ri > 0):
192 | 		n_ri = max(1, int(alpha_ri*num_words))
193 | 		for _ in range(num_new_per_technique):
194 | 			a_words = random_insertion(words, n_ri)
195 | 			augmented_sentences.append(' '.join(a_words))
196 | 
197 | 	#rs
198 | 	if (alpha_rs > 0):
199 | 		n_rs = max(1, int(alpha_rs*num_words))
200 | 		for _ in range(num_new_per_technique):
201 | 			a_words = random_swap(words, n_rs)
202 | 			augmented_sentences.append(' '.join(a_words))
203 | 
204 | 	#rd
205 | 	if (p_rd > 0):
206 | 		for _ in range(num_new_per_technique):
207 | 			a_words = random_deletion(words, p_rd)
208 | 			augmented_sentences.append(' '.join(a_words))
209 | 
210 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
211 | 	shuffle(augmented_sentences)
212 | 
213 | 	#trim so that we have the desired number of augmented sentences
214 | 	if num_aug >= 1:
215 | 		augmented_sentences = augmented_sentences[:num_aug]
216 | 	else:
217 | 		keep_prob = num_aug / len(augmented_sentences)
218 | 		augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
219 | 
220 | 	#append the original sentence
221 | 	augmented_sentences.append(sentence)
222 | 
223 | 	return augmented_sentences


--------------------------------------------------------------------------------
/data/sst2_train_500.txt:
--------------------------------------------------------------------------------
  1 | 1	neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present 
  2 | 0	it is a visual rorschach test and i must have failed 
  3 | 0	the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers 
  4 | 0	scores no points for originality wit or intelligence 
  5 | 0	it would take a complete moron to foul up a screen adaptation of oscar wilde is classic satire 
  6 | 1	pure cinematic intoxication a wildly inventive mixture of comedy and melodrama tastelessness and swooning elegance 
  7 | 0	it is not the first time that director sara sugarman stoops to having characters drop their pants for laughs and not the last time she fails to provoke them 
  8 | 1	just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing 
  9 | 1	matthew lillard is born to play shaggy 
 10 | 0	it is drab 
 11 | 1	the film has several strong performances 
 12 | 1	munch is screenplay is tenderly observant of his characters 
 13 | 1	isabelle huppert excels as the enigmatic mika and anna mouglalis is a stunning new young talent in one of chabrol is most intense psychological mysteries 
 14 | 1	a cruelly funny twist on teen comedy packed with inventive cinematic tricks and an ironically killer soundtrack
 15 | 0	predictably soulless techno tripe 
 16 | 1	tsai has a well deserved reputation as one of the cinema world is great visual stylists and in this film every shot enhances the excellent performances 
 17 | 0	some like it hot on the hardwood proves once again that a man in drag is not in and of himself funny 
 18 | 0	for a movie about the power of poetry and passion there is precious little of either 
 19 | 1	diggs and lathan are among the chief reasons brown sugar is such a sweet and sexy film 
 20 | 1	uneven but a lot of fun 
 21 | 0	when the film ended i felt tired and drained and wanted to lie on my own deathbed for a while 
 22 | 0	contains the humor characterization poignancy and intelligence of a bad sitcom 
 23 | 0	a pretentious and ultimately empty examination of a sick and evil woman 
 24 | 1	i can easily imagine benigni is pinocchio becoming a christmas perennial 
 25 | 1	generates an enormous feeling of empathy for its characters 
 26 | 1	reggio is continual visual barrage is absorbing as well as thought provoking 
 27 | 0	spreads itself too thin leaving these actors as well as the members of the commune short of profound characterizations
 28 | 0	once again director chris columbus takes a hat in hand approach to rowling that stifles creativity and allows the film to drag on for nearly three hours 
 29 | 0	has all the hallmarks of a movie designed strictly for children is home video a market so insatiable it absorbs all manner of lame entertainment as long as year olds find it diverting 
 30 | 1	a whale of a good time for both children and parents seeking christian themed fun 
 31 | 0	there are plot holes big enough for shamu the killer whale to swim through 
 32 | 1	director andrew niccol demonstrates a wry understanding of the quirks of fame 
 33 | 1	a must for fans of british cinema if only because so many titans of the industry are along for the ride 
 34 | 0	though clearly well intentioned this cross cultural soap opera is painfully formulaic and stilted 
 35 | 0	that is because relatively nothing happens 
 36 | 0	novak contemplates a heartland so overwhelmed by its lack of purpose that it seeks excitement in manufactured high drama 
 37 | 0	utter mush conceited pap 
 38 | 0	cassavetes thinks he is making dog day afternoon with a cause but all he is done is to reduce everything he touches to a shrill didactic cartoon 
 39 | 0	disjointed parody 
 40 | 1	in fact it is quite fun in places 
 41 | 0	it can not be enjoyed even on the level that one enjoys a bad slasher flick primarily because it is dull 
 42 | 0	it is always disappointing when a documentary fails to live up to or offer any new insight into its chosen topic 
 43 | 0	in the era of the sopranos it feels painfully redundant and inauthentic 
 44 | 1	a hypnotic cyber hymn and a cruel story of youth culture 
 45 | 0	much of the cast is stiff or just plain bad 
 46 | 1	i ve yet to find an actual vietnam war combat movie actually produced by either the north or south vietnamese but at least now we ve got something pretty damn close 
 47 | 0	a bizarre piece of work with premise and dialogue at the level of kids television and plot threads as morose as teen pregnancy rape and suspected murder
 48 | 0	a documentary to make the stones weep as shameful as it is scary 
 49 | 0	there is got to be a more graceful way of portraying the devastation of this disease 
 50 | 1	and your reward will be a thoughtful emotional movie experience 
 51 | 1	brash intelligent and erotically perplexing haneke is portrait of an upper class austrian society and the suppression of its tucked away demons is uniquely felt with a sardonic jolt 
 52 | 0	it forces you to watch people doing unpleasant things to each other and themselves and it maintains a cool distance from its material that is deliberately unsettling 
 53 | 0	it is the element of condescension as the filmmakers look down on their working class subjects from their lofty perch that finally makes sex with strangers which opens today in the new york metropolitan area so distasteful 
 54 | 1	nothing short of a masterpiece and a challenging one 
 55 | 0	i just saw this movie well it is probably not accurate to call it a movie 
 56 | 0	a muted freak out
 57 | 0	there is an underlying old world sexism to monday morning that undercuts its charm 
 58 | 0	taken as a whole the tuxedo does nt add up to a whole lot 
 59 | 1	though the film is static its writer director is heart is in the right place his plea for democracy and civic action laudable 
 60 | 0	with mcconaughey in an entirely irony free zone and bale reduced mainly to batting his sensitive eyelids there is not enough intelligence wit or innovation on the screen to attract and sustain an older crowd 
 61 | 1	skin of man gets a few cheap shocks from its kids in peril theatrics but it also taps into the primal fears of young people trying to cope with the mysterious and brutal nature of adults 
 62 | 1	what it lacks in substance it makes up for in heart 
 63 | 1	like all great films about a life you never knew existed it offers much to absorb and even more to think about after the final frame 
 64 | 0	an average b movie with no aspirations to be anything more 
 65 | 1	standard guns versus martial arts cliche with little new added 
 66 | 1	a beautifully tooled action thriller about love and terrorism in korea 
 67 | 0	starts out with tremendous promise introducing an intriguing and alluring premise only to fall prey to a boatload of screenwriting cliches that sink it faster than a leaky freighter 
 68 | 0	hard core slasher aficionados will find things to like but overall the halloween series has lost its edge 
 69 | 1	a college story that works even without vulgarity sex scenes and cussing 
 70 | 0	insufferably naive 
 71 | 1	it is a strength of a documentary to disregard available bias especially as temptingly easy as it would have been with this premise 
 72 | 1	those with a modicum of patience will find in these characters foibles a timeless and unique perspective 
 73 | 0	its and pieces of the hot chick are so hilarious and schneider is performance is so fine it is a real shame that so much of the movie again as in the animal is a slapdash mess 
 74 | 1	there are some movies that hit you from the first scene and you know it is going to be a trip 
 75 | 1	it proves quite compelling as an intense brooding character study 
 76 | 0	a loud ugly irritating movie without any of its satirical salvos hitting a discernible target 
 77 | 0	you can taste it but there is no fizz 
 78 | 1	the quality of the art combined with the humor and intelligence of the script allow the filmmakers to present the biblical message of forgiveness without it ever becoming preachy or syrupy 
 79 | 0	the x potion gives the quickly named blossom bubbles and buttercup supernatural powers that include extraordinary strength and laser beam eyes which unfortunately do nt enable them to discern flimsy screenplays 
 80 | 1	rifkin is references are impeccable throughout 
 81 | 0	too bad none of it is funny 
 82 | 1	a gleefully grungy hilariously wicked black comedy 
 83 | 0	you leave the same way you came a few tasty morsels under your belt but no new friends 
 84 | 1	though it runs minutes safe conduct is anything but languorous 
 85 | 1	it is both degrading and strangely liberating to see people working so hard at leading lives of sexy intrigue only to be revealed by the dispassionate gantz brothers as ordinary pasty lumpen 
 86 | 0	they crush each other under cars throw each other out windows electrocute and dismember their victims in full consciousness 
 87 | 0	in any case i would recommend big bad love only to winger fans who have missed her since is forget paris 
 88 | 1	as surreal as a dream and as detailed as a photograph as visually dexterous as it is at times imaginatively overwhelming 
 89 | 1	even if you do nt know the band or the album is songs by heart you will enjoy seeing how both evolve and you will also learn a good deal about the state of the music business in the st century 
 90 | 1	with an unflappable air of decadent urbanity everett remains a perfect wildean actor and a relaxed firth displays impeccable comic skill 
 91 | 1	unpretentious charming quirky original
 92 | 0	a processed comedy chop suey 
 93 | 0	a sequel that is much too big for its britches 
 94 | 0	a complete waste of time 
 95 | 0	a well intentioned effort that is still too burdened by the actor is offbeat sensibilities for the earnest emotional core to emerge with any degree of accessibility 
 96 | 1	assayas ambitious sometimes beautiful adaptation of jacques chardonne is novel 
 97 | 0	despite the fact that this film was nt as bad as i thought it was going to be it is still not a good movie
 98 | 0	guys say mean things and shoot a lot of bullets 
 99 | 0	a manipulative feminist empowerment tale thinly posing as a serious drama about spousal abuse 
100 | 0	this movie is so bad that it is almost worth seeing because it is so bad 
101 | 0	with a romantic comedy plotline straight from the ages this cinderella story does nt have a single surprise up its sleeve 
102 | 1	and more than that it is an observant unfussily poetic meditation about identity and alienation 
103 | 0	but it could have been worse 
104 | 0	most viewers will wish there had been more of the queen and less of the damned 
105 | 0	a science fiction pastiche so lacking in originality that if you stripped away its inspirations there would be precious little left 
106 | 1	the pleasures that it does afford may be enough to keep many moviegoers occupied amidst some of the more serious minded concerns of other year end movies 
107 | 0	as plain and pedestrian as catsup 
108 | 0	every conceivable mistake a director could make in filming opera has been perpetrated here 
109 | 1	more concerned with overall feelings broader ideas and open ended questions than concrete story and definitive answers soderbergh is solaris is a gorgeous and deceptively minimalist cinematic tone poem 
110 | 0	no cute factor here not that i mind ugly the problem is he has no character loveable or otherwise 
111 | 1	filmmaker stacy peralta has a flashy editing style that does nt always jell with sean penn is monotone narration but he respects the material without sentimentalizing it 
112 | 1	you do nt need to be a hip hop fan to appreciate scratch and that is the mark of a documentary that works 
113 | 0	i was trying to decide what annoyed me most about god is great i m not and then i realized that i just did nt care 
114 | 1	metaphors abound but it is easy to take this film at face value and enjoy its slightly humorous and tender story 
115 | 1	a comedy that swings and jostles to the rhythms of life 
116 | 0	if you re looking to rekindle the magic of the first film you ll need a stronger stomach than us 
117 | 1	a modestly made but profoundly moving documentary 
118 | 0	pc stability notwithstanding the film suffers from a simplistic narrative and a pat fairy tale conclusion 
119 | 0	has about th the fun of its spry predecessor but it is a rushed slapdash sequel for the sake of a sequel with less than half the plot and ingenuity 
120 | 1	remarkable for its intelligence and intensity 
121 | 1	i do nt know if frailty will turn bill paxton into an a list director but he can rest contentedly with the knowledge that he is made at least one damn fine horror movie 
122 | 1	one of the year is most weirdly engaging and unpredictable character pieces 
123 | 0	shot like a postcard and overacted with all the boozy self indulgence that brings out the worst in otherwise talented actors 
124 | 0	barney is ideas about creation and identity do nt really seem all that profound at least by way of what can be gleaned from this three hour endurance test built around an hour is worth of actual material 
125 | 1	delia greta and paula rank as three of the most multilayered and sympathetic female characters of the year 
126 | 0	enough is not a bad movie just mediocre 
127 | 0	but the cinematography is cloudy the picture making becalmed 
128 | 0	a terrible adaptation of a play that only ever walked the delicate tightrope between farcical and loathsome 
129 | 0	slap me i saw this movie 
130 | 1	for those of an indulgent slightly sunbaked and summery mind sex and lucia may well prove diverting enough 
131 | 0	reign of fire never comes close to recovering from its demented premise but it does sustain an enjoyable level of ridiculousness 
132 | 0	the movie ends with outtakes in which most of the characters forget their lines and just utter uhhh which is better than most of the writing in the movie 
133 | 1	try as you might to resist if you ve got a place in your heart for smokey robinson this movie will worm its way there 
134 | 0	low rent from frame one 
135 | 1	we know the plot is a little crazy but it held my interest from start to finish 
136 | 0	dull a road trip movie that is surprisingly short of both adventure and song 
137 | 0	the colorful masseur wastes its time on mood rather than riding with the inherent absurdity of ganesh is rise up the social ladder 
138 | 1	it sends you away a believer again and quite cheered at just that 
139 | 1	we ve seen it all before in one form or another but director hoffman with great help from kevin kline makes us care about this latest reincarnation of the world is greatest teacher 
140 | 1	watching this gentle mesmerizing portrait of a man coming to terms with time you barely realize your mind is being blown 
141 | 1	at its most basic this cartoon adventure is that wind in the hair exhilarating 
142 | 1	the charms of the lead performances allow us to forget most of the film is problems 
143 | 0	hampered no paralyzed by a self indulgent script that aims for poetry and ends up sounding like satire 
144 | 1	a sloppy amusing comedy that proceeds from a stunningly unoriginal premise 
145 | 1	the film aims to be funny uplifting and moving sometimes all at once 
146 | 1	the story is inspiring ironic and revelatory of just how ridiculous and money oriented the record industry really is 
147 | 0	like a fish that is lived too long austin powers in goldmember has some unnecessary parts and is kinda wrong in places 
148 | 1	what distinguishes time of favor from countless other thrillers is its underlying concern with the consequences of words and with the complicated emotions fueling terrorist acts 
149 | 1	an entertaining if ultimately minor thriller 
150 | 1	just when you think that every possible angle has been exhausted by documentarians another new film emerges with yet another remarkable yet shockingly little known perspective 
151 | 0	it could have been something special but two things drag it down to mediocrity director clare peploe is misunderstanding of marivaux is rhythms and mira sorvino is limitations as a classical actress 
152 | 1	a summer entertainment adults can see without feeling embarrassed but it could have been more 
153 | 0	fails in making this character understandable in getting under her skin in exploring motivation well before the end the film grows as dull as its characters about whose fate it is hard to care 
154 | 0	turns a potentially interesting idea into an excruciating film school experience that plays better only for the film is publicists or for people who take as many drugs as the film is characters
155 | 1	it is a wonderful sobering heart felt drama 
156 | 0	the movie does nt think much of its characters its protagonist or of us 
157 | 0	it is too self important and plodding to be funny and too clipped and abbreviated to be an epic 
158 | 1	this surreal gilliam esque film is also a troubling interpretation of ecclesiastes 
159 | 0	the most offensive thing about the movie is that hollywood expects people to pay to see it 
160 | 1	in the end the film is less the cheap thriller you d expect than it is a fairly revealing study of its two main characters damaged goods people whose orbits will inevitably and dangerously collide 
161 | 0	the entire movie is about a boring sad man being boring and sad 
162 | 0	just not campy enough
163 | 0	despite an impressive roster of stars and direction from kathryn bigelow the weight of water is oppressively heavy 
164 | 0	when your subject is illusion versus reality should nt the reality seem at least passably real 
165 | 1	khouri manages with terrific flair to keep the extremes of screwball farce and blood curdling family intensity on one continuum 
166 | 1	it is a masterpiece 
167 | 1	romantic comedy and dogme filmmaking may seem odd bedfellows but they turn out to be delightfully compatible here 
168 | 1	about schmidt belongs to nicholson 
169 | 1	macdowell gives give a solid anguished performance that eclipses nearly everything else she is ever done 
170 | 0	wewannour money back actually 
171 | 1	it is a clear eyed portrait of an intensely lived time filled with nervous energy moral ambiguity and great uncertainties 
172 | 0	not exactly the bees knees
173 | 1	michael gerbosi is script is economically packed with telling scenes 
174 | 0	director douglas mcgrath takes on nickleby with all the halfhearted zeal of an th grade boy delving into required reading 
175 | 0	that the true story by which all the queen is men is allegedly inspired was a lot funnier and more deftly enacted than what is been cobbled together onscreen 
176 | 0	the end result is like cold porridge with only the odd enjoyably chewy lump 
177 | 1	it haunts you you ca nt forget it you admire its conception and are able to resolve some of the confusions you had while watching it 
178 | 0	forget the psychology study of romantic obsession and just watch the procession of costumes in castles and this wo nt seem like such a bore 
179 | 0	the kind of film that leaves you scratching your head in amazement over the fact that so many talented people could participate in such an ill advised and poorly executed idea 
180 | 0	off the hook is overlong and not well acted but credit writer producer director adam watstein with finishing it at all 
181 | 0	at the very least if you do nt know anything about derrida when you walk into the theater you wo nt know much more when you leave 
182 | 1	sly sophisticated and surprising 
183 | 1	a new film from bill plympton the animation master is always welcome 
184 | 0	it does nt flinch from its unsettling prognosis namely that the legacy of war is a kind of perpetual pain 
185 | 0	more tiring than anything 
186 | 1	his work with actors is particularly impressive 
187 | 0	sunk by way too much indulgence of scene chewing teeth gnashing actorliness 
188 | 1	it is an unstinting look at a collaboration between damaged people that may or may not qual
189 | 0	collapses under its own meager weight 
190 | 0	quitting however manages just to be depressing as the lead actor phones in his autobiographical performance 
191 | 0	the drama was so uninspiring that even a story immersed in love lust and sin could nt keep my attention 
192 | 0	due to some script weaknesses and the casting of the director is brother the film trails off into inconsequentiality 
193 | 0	suspend your disbelief here and now or you ll be shaking your head all the way to the credits 
194 | 0	i did nt smile 
195 | 1	for all its problems the lady and the duke surprisingly manages never to grow boring which proves that rohmer still has a sense of his audience 
196 | 0	it is a drag how nettelbeck sees working women or at least this working woman for whom she shows little understanding 
197 | 0	the script is a disaster with cloying messages and irksome characters 
198 | 0	in its best moments resembles a bad high school production of grease without benefit of song 
199 | 1	it is a fine focused piece of work that reopens an interesting controversy and never succumbs to sensationalism 
200 | 1	a triumph of pure craft and passionate heart 
201 | 1	not everything in this ambitious comic escapade works but coppola along with his sister sofia is a real filmmaker 
202 | 1	the emotions are raw and will strike a nerve with anyone who is ever had family trauma 
203 | 1	it deserves to be seen by anyone with even a passing interest in the events shaping the world beyond their own horizons 
204 | 0	two bit potboiler 
205 | 0	the movie directed by mick jackson leaves no cliche unturned from the predictable plot to the characters straight out of central casting 
206 | 1	fun and nimble 
207 | 0	big mistake 
208 | 1	the film boasts dry humor and jarring shocks plus moments of breathtaking mystery 
209 | 1	you may feel compelled to watch the film twice or pick up a book on the subject 
210 | 1	west coast rap wars this modern mob music drama never fails to fascinate 
211 | 1	children christian or otherwise deserve to hear the full story of jonah is despair in all its agonizing catch glory even if they spend years trying to comprehend it 
212 | 0	if they broke out into elaborate choreography singing and finger snapping it might have held my attention but as it stands i kept looking for the last exit from brooklyn 
213 | 1	i could nt recommend this film more 
214 | 1	translating complex characters from novels to the big screen is an impossible task but they are true to the essence of what it is to be ya ya 
215 | 0	their parents would do well to cram earplugs in their ears and put pillowcases over their heads for minutes 
216 | 1	rewarding 
217 | 1	upsetting and thought provoking the film has an odd purity that does nt bring you into the characters so much as it has you study them 
218 | 0	starts as a tart little lemon drop of a movie and ends up as a bitter pill 
219 | 0	a little less extreme than in the past with longer exposition sequences between them and with fewer gags to break the tedium 
220 | 1	a funny triumphant and moving documentary 
221 | 1	an entertaining mix of period drama and flat out farce that should please history fans 
222 | 0	during the tuxedo is minutes of screen time there is nt one true chan moment 
223 | 1	there is just something about watching a squad of psychopathic underdogs whale the tar out of unsuspecting lawmen that reaches across time and distance 
224 | 1	a series of tales told with the intricate preciseness of the best short story writing 
225 | 1	a bright inventive thoroughly winning flight of revisionist fancy 
226 | 0	almost peerlessly unsettling 
227 | 1	a dashing and absorbing outing with one of france is most inventive directors 
228 | 1	a true delight 
229 | 0	complete lack of originality cleverness or even visible effort
230 | 1	a few nonbelievers may rethink their attitudes when they see the joy the characters take in this creed but skeptics are nt likely to enter the theater 
231 | 1	like the rugrats movies the wild thornberrys movie does nt offer much more than the series but its emphasis on caring for animals and respecting other cultures is particularly welcome 
232 | 0	borstal boy represents the worst kind of filmmaking the kind that pretends to be passionate and truthful but is really frustratingly timid and soggy 
233 | 1	you feel good you feel sad you feel pissed off but in the end you feel alive which is what they did 
234 | 0	director tom shadyac and star kevin costner glumly mishandle the story is promising premise of a physician who needs to heal himself 
235 | 1	as relationships shift director robert j siegel allows the characters to inhabit their world without cleaving to a narrative arc 
236 | 0	deadeningly dull mired in convoluted melodrama nonsensical jargon and stiff upper lip laboriousness 
237 | 1	jacquot has filmed the opera exactly as the libretto directs ideally capturing the opera is drama and lyricism 
238 | 1	it can be safely recommended as a video dvd babysitter 
239 | 0	it is played in the most straight faced fashion with little humor to lighten things up 
240 | 1	though it goes further than both anyone who has seen the hunger or cat people will find little new here but a tasty performance from vincent gallo lifts this tale of cannibal lust above the ordinary 
241 | 1	the rich performances by friel and especially williams an american actress who becomes fully english round out the square edges 
242 | 0	amazingly lame 
243 | 1	more good than great but freeman and judd make it work 
244 | 0	a battle between bug eye theatre and dead eye matinee 
245 | 0	i m sorry to say that this should seal the deal arnold is not nor will he be back 
246 | 1	though jackson does nt always succeed in integrating the characters in the foreground into the extraordinarily rich landscape it must be said that he is an imaginative filmmaker who can see the forest for the trees 
247 | 0	van wilder has a built in audience but only among those who are drying out from spring break and are still unconcerned about what they ingest 
248 | 1	what sets ms birot is film apart from others in the genre is a greater attention to the parents and particularly the fateful fathers in the emotional evolution of the two bewitched adolescents 
249 | 0	a sentimental mess that never rings true 
250 | 1	but the talented cast alone will keep you watching as will the fight scenes 
251 | 1	allen is underestimated charm delivers more goodies than lumps of coal 
252 | 1	an elegant work food of love is as consistently engaging as it is revealing 
253 | 1	zoom 
254 | 1	a huge box office hit in korea shiri is a must for genre fans 
255 | 1	it is a technically superb film shining with all the usual spielberg flair expertly utilizing the talents of his top notch creative team 
256 | 1	what begins as a conventional thriller evolves into a gorgeously atmospheric meditation on life changing chance encounters 
257 | 1	a film with a great premise but only a great premise 
258 | 1	on that score the film certainly does nt disappoint 
259 | 1	the acting costumes music cinematography and sound are all astounding given the production is austere locales 
260 | 1	vincent gallo is right at home in this french shocker playing his usual bad boy weirdo role 
261 | 1	very well written and directed with brutal honesty and respect for its audience 
262 | 0	one senses in world traveler and in his earlier film that freundlich bears a grievous but obscure complaint against fathers and circles it obsessively without making contact 
263 | 1	neither parker nor donovan is a typical romantic lead but they bring a fresh quirky charm to the formula 
264 | 1	a giggle a minute 
265 | 0	in the end the film feels homogenized and a bit contrived as if we re looking back at a tattered and ugly past with rose tinted glasses 
266 | 1	an unusually dry eyed even analytical approach to material that is generally played for maximum moisture 
267 | 1	it made me want to get made up and go see this movie with my sisters 
268 | 0	neither revelatory nor truly edgy merely crassly flamboyant and comedically labored 
269 | 1	boasts a handful of virtuosic set pieces and offers a fair amount of trashy kinky fun 
270 | 0	i do nt mind having my heartstrings pulled but do nt treat me like a fool 
271 | 1	this is a sincerely crafted picture that deserves to emerge from the traffic jam of holiday movies 
272 | 0	an unintentionally surreal kid is picture in which actors in bad bear suits enact a sort of inter species parody of a vh behind the music episode 
273 | 1	gay or straight kissing jessica stein is one of the greatest date movies in years 
274 | 0	it looks good but it is essentially empty 
275 | 1	and there is no way you wo nt be talking about the film once you exit the theater 
276 | 0	much like robin williams death to smoochy has already reached its expiration date 
277 | 1	if you love the music and i do its hard to imagine having more fun watching a documentary 
278 | 0	a collage of clich s and a dim echo of allusions to other films 
279 | 1	norton is magnetic as graham 
280 | 1	k the widowmaker is a great yarn 
281 | 1	it might be easier to watch on video at home but that should nt stop die hard french film connoisseurs from going out and enjoying the big screen experience 
282 | 0	manages to be both repulsively sadistic and mundane 
283 | 0	an obvious copy of one of the best films ever made how could it not be 
284 | 1	surprisingly the film is a hilarious adventure and i shamelessly enjoyed it 
285 | 1	the cat is meow marks a return to form for director peter bogdanovich 
286 | 0	it is an odd show pregnant with moods stillborn except as a harsh conceptual exercise 
287 | 0	but if the essence of magic is its make believe promise of life that soars above the material realm this is the opposite of a truly magical movie 
288 | 0	the film is all over the place really 
289 | 0	without any redeeming value whatsoever 
290 | 1	it is a familiar story but one that is presented with great sympathy and intelligence 
291 | 0	manages to show life in all of its banality when the intention is quite the opposite 
292 | 1	read my lips is to be viewed and treasured for its extraordinary intelligence and originality as well as its lyrical variations on the game of love 
293 | 0	this director is cut which adds minutes takes a great film and turns it into a mundane soap opera 
294 | 1	the ensemble cast turns in a collectively stellar performance and the writing is tight and truthful full of funny situations and honest observations 
295 | 1	what saves this deeply affecting film from being merely a collection of wrenching cases is corcuera is attention to detail 
296 | 1	take nothing seriously and enjoy the ride 
297 | 1	from the opening strains of the average white band is pick up the pieces you can feel the love 
298 | 0	while the ensemble player who gained notice in guy ritchie is lock stock and two smoking barrels and snatch has the bod he is unlikely to become a household name on the basis of his first starring vehicle 
299 | 0	i could nt help but feel the wasted potential of this slapstick comedy 
300 | 1	it is an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part 
301 | 0	nothing too deep or substantial 
302 | 0	this picture is mostly a lump of run of the mill profanity sprinkled with a few remarks so geared toward engendering audience sympathy that you might think he was running for office or trying to win over a probation officer 
303 | 0	a boring parade of talking heads and technical gibberish that will do little to advance the linux cause 
304 | 0	the problem with the mayhem in formula is not that it is offensive but that it is boring 
305 | 0	as pedestrian as they come 
306 | 0	parents beware this is downright movie penance 
307 | 0	really does feel like a short stretched out to feature length 
308 | 0	no one but a convict guilty of some truly heinous crime should have to sit through the master of disguise 
309 | 1	may take its sweet time to get wherever it is going but if you have the patience for it you wo nt feel like it is wasted yours 
310 | 0	would ve been nice if the screenwriters had trusted audiences to understand a complex story and left off the film is predictable denouement 
311 | 1	i am not generally a huge fan of cartoons derived from tv shows but hey arnold 
312 | 1	brings to a spectacular completion one of the most complex generous and subversive artworks of the last decade 
313 | 1	reveals how important our special talents can be when put in service of of others 
314 | 1	the gags that fly at such a furiously funny pace that the only rip off that we were aware of was the one we felt when the movie ended so damned soon 
315 | 1	more mature than fatal attraction more complete than indecent proposal and more relevant than weeks unfaithful is at once intimate and universal cinema 
316 | 1	it is fairly solid not to mention well edited so that it certainly does nt feel like a film that strays past the two and a half mark 
317 | 1	while somewhat less than it might have been the film is a good one and you ve got to hand it to director george clooney for biting off such a big job the first time out 
318 | 1	routine harmless diversion and little else 
319 | 1	cremaster is at once a tough pill to swallow and a minor miracle of self expression 
320 | 1	this is human comedy at its most amusing interesting and confirming 
321 | 1	a story we have nt seen on the big screen before and it is a story that we as americans and human beings should know 
322 | 0	just about everyone involved here seems to be coasting 
323 | 1	a tour de force of modern cinema 
324 | 1	uplifting funny and wise 
325 | 0	it is just merely very bad 
326 | 1	it will guarantee to have you leaving the theater with a smile on your face 
327 | 0	simplistic silly and tedious 
328 | 1	passions obsessions and loneliest dark spots are pushed to their most virtuous limits lending the narrative an unusually surreal tone 
329 | 1	thanks to confident filmmaking and a pair of fascinating performances the way to that destination is a really special walk in the woods 
330 | 0	it is provocative stuff but the speculative effort is hampered by taylor is cartoonish performance and the film is ill considered notion that hitler is destiny was shaped by the most random of chances 
331 | 0	the animation and game phenomenon that peaked about three years ago is actually dying a slow death if the poor quality of pokemon ever is any indication 
332 | 1	the script is smart not cloying 
333 | 0	a muddy psychological thriller rife with miscalculations 
334 | 1	the wild thornberrys movie is a jolly surprise 
335 | 1	land people and narrative flow together in a stark portrait of motherhood deferred and desire explored 
336 | 0	unless there are zoning ordinances to protect your community from the dullest science fiction impostor is opening today at a theater near you 
337 | 1	world traveler might not go anywhere new or arrive anyplace special but it is certainly an honest attempt to get at something 
338 | 1	at once subtle and visceral the film never succumbs to the trap of the maudlin or tearful offering instead with its unflinching gaze a measure of faith in the future 
339 | 1	years of russian history and culture compressed into an evanescent seamless and sumptuous stream of consciousness 
340 | 0	the film is maudlin focus on the young woman is infirmity and her naive dreams play like the worst kind of hollywood heart string plucking 
341 | 0	director uwe boll and the actors provide scant reason to care in this crude s throwback 
342 | 1	intensely romantic thought provoking and even an engaging mystery 
343 | 0	the characters are paper thin and the plot is so cliched and contrived that it makes your least favorite james bond movie seem as cleverly plotted as the usual suspects 
344 | 1	de niro is a veritable source of sincere passion that this hollywood contrivance orbits around 
345 | 1	jonathan parker is bartleby should have been the be all end all of the modern office anomie films 
346 | 1	it is a piece of handiwork that shows its indie tatters and self conscious seams in places but has some quietly moving moments and an intelligent subtlety 
347 | 0	it is a barely tolerable slog over well trod ground 
348 | 0	it is tough to tell which is in more abundant supply in this woefully hackneyed movie directed by scott kalvert about street gangs and turf wars in brooklyn stale cliches gratuitous violence or empty machismo 
349 | 0	the script by vincent r nebrida tries to cram too many ingredients into one small pot 
350 | 1	strangely comes off as a kingdom more mild than wild 
351 | 0	thoroughly awful 
352 | 1	a moving story of determination and the human spirit 
353 | 1	a naturally funny film home movie makes you crave chris smith is next movie 
354 | 0	the only question is to determine how well the schmaltz is manufactured to assess the quality of the manipulative engineering 
355 | 0	the premise of abandon holds promise but its delivery is a complete mess 
356 | 0	plays like one of those conversations that comic book guy on the simpsons has 
357 | 0	in the book on tape market the film of the kid stays in the picture would be an abridged edition
358 | 1	and educational 
359 | 1	blisteringly rude scarily funny sorrowfully sympathetic to the damage it surveys the film has in kieran culkin a pitch perfect holden 
360 | 0	to get at the root psychology of this film would require many sessions on the couch of dr freud 
361 | 0	the young stars are too cute the story and ensuing complications are too manipulative the message is too blatant the resolutions are too convenient 
362 | 1	davis candid archly funny and deeply authentic take on intimate relationships comes to fruition in her sophomore effort 
363 | 1	not as good as the full monty but a really strong second effort 
364 | 0	even bigger and more ambitious than the first installment spy kids looks as if it were made by a highly gifted year old instead of a grown man 
365 | 0	includes too much obvious padding 
366 | 0	the story alone could force you to scratch a hole in your head 
367 | 0	we never truly come to care about the main characters and whether or not they ll wind up together and michele is spiritual quest is neither amusing nor dramatic enough to sustain interest 
368 | 1	this is nt exactly profound cinema but it is good natured and sometimes quite funny 
369 | 0	impostor ca nt think of a thing to do with these characters except have them run through dark tunnels fight off various anonymous attackers and evade elaborate surveillance technologies 
370 | 0	and that leaves a hole in the center of the salton sea 
371 | 1	chamber of secrets will find millions of eager fans 
372 | 0	seagal ran out of movies years ago and this is just the proof 
373 | 0	a movie like the guys is why film criticism can be considered work 
374 | 1	as it turns out you can go home again 
375 | 1	her performance moves between heartbreak and rebellion as she continually tries to accommodate to fit in and gain the unconditional love she seeks 
376 | 0	a low rate annie featuring some kid who ca nt act only echoes of jordan and weirdo actor crispin glover screwing things up old school 
377 | 1	rock solid family fun out of the gates extremely imaginative through out but wanes in the middle
378 | 0	if you go pack your knitting needles 
379 | 0	a technical triumph and an extraordinary bore 
380 | 0	if you re not fans of the adventues of steve and terri you should avoid this like the dreaded king brown snake 
381 | 0	the comedy death to smoochy is a rancorous curiosity a movie without an apparent audience 
382 | 1	the entire cast is extraordinarily good 
383 | 1	hugh grant who has a good line in charm has never been more charming than in about a boy 
384 | 1	delivers the sexy razzle dazzle that everyone especially movie musical fans has been hoping for 
385 | 1	a gripping movie played with performances that are all understated and touching 
386 | 0	hoffman waits too long to turn his movie in an unexpected direction and even then his tone retains a genteel prep school quality that feels dusty and leatherbound 
387 | 1	an ambitious what if 
388 | 1	uses high comedy to evoke surprising poignance 
389 | 0	contains a few big laughs but many more that graze the funny bone or miss it altogether in part because the consciously dumbed down approach wears thin 
390 | 1	the journey to the secret is eventual discovery is a separate adventure and thrill enough 
391 | 1	it is one heck of a character study not of hearst or davies but of the unique relationship between them 
392 | 1	a live wire film that never loses its ability to shock and amaze 
393 | 1	a real audience pleaser that will strike a chord with anyone who is ever waited in a doctor is office emergency room hospital bed or insurance company office 
394 | 0	there is no good answer to that one 
395 | 0	the film contains no good jokes no good scenes barely a moment when carvey is saturday night live honed mimicry rises above the level of embarrassment 
396 | 0	as inept as big screen remakes of the avengers and the wild wild west 
397 | 0	it is difficult to feel anything much while watching this movie beyond mild disturbance or detached pleasure at the acting 
398 | 0	almost as offensive as freddy got fingered 
399 | 1	this is a shrewd and effective film from a director who understands how to create and sustain a mood 
400 | 1	the bai brothers have taken an small slice of history and opened it up for all of us to understand and they ve told a nice little story in the process 
401 | 1	knows how to make our imagination wonder 
402 | 1	fear permeates the whole of stortelling todd solondz oftentimes funny yet ultimately cowardly autocritique 
403 | 1	a cutesy romantic tale with a twist 
404 | 0	violent vulgar and forgettably entertaining 
405 | 1	though its story is only surface deep the visuals and enveloping sounds of blue crush make this surprisingly decent flick worth a summertime look see 
406 | 1	sad to say it accurately reflects the rage and alienation that fuels the self destructiveness of many young people 
407 | 0	an allegory concerning the chronically mixed signals african american professionals get about overachieving could be intriguing but the supernatural trappings only obscure the message 
408 | 1	the film has the high buffed gloss and high octane jolts you expect of de palma but what makes it transporting is that it is also one of the smartest most pleasurable expressions of pure movie love to come from an american director in years 
409 | 1	wonderful fencing scenes and an exciting plot make this an eminently engrossing film 
410 | 1	if mostly martha is mostly unsurprising it is still a sweet even delectable diversion 
411 | 1	one of the most slyly exquisite anti adult movies ever made 
412 | 1	even when there are lulls the emotions seem authentic and the picture is so lovely toward the end you almost do nt notice the minute running time 
413 | 0	comes across as a fairly weak retooling 
414 | 0	time out is as serious as a pink slip 
415 | 0	a depressingly retrograde post feminist romantic comedy that takes an astonishingly condescending attitude toward women 
416 | 0	you might want to take a reality check before you pay the full ticket price to see simone and consider a dvd rental instead 
417 | 1	young hanks and fisk who vaguely resemble their celebrity parents bring fresh good looks and an ease in front of the camera to the work 
418 | 0	if you re looking for a story do nt bother 
419 | 1	the film is hard to dismiss moody thoughtful and lit by flashes of mordant humor 
420 | 1	a deeply felt and vividly detailed story about newcomers in a strange new world 
421 | 0	an ugly pointless stupid movie 
422 | 0	to honestly address the flaws inherent in how medical aid is made available to american workers a more balanced or fair portrayal of both sides will be needed 
423 | 1	the very definition of the small movie but it is a good stepping stone for director sprecher 
424 | 1	a solid cast assured direction and complete lack of modern day irony 
425 | 0	burns never really harnesses to full effect the energetic cast 
426 | 1	the difference between cho and most comics is that her confidence in her material is merited 
427 | 1	like its bizarre heroine it irrigates our souls 
428 | 1	in between the icy stunts the actors spout hilarious dialogue about following your dream and just letting the mountain tell you what to do 
429 | 0	in an effort i suspect not to offend by appearing either too serious or too lighthearted it offends by just being wishy washy 
430 | 0	it all comes down to whether you can tolerate leon barlow 
431 | 0	starts promisingly but disintegrates into a dreary humorless soap opera 
432 | 1	there is enough cool fun here to warm the hearts of animation enthusiasts of all ages 
433 | 1	the vitality of the actors keeps the intensity of the film high even as the strafings blend together 
434 | 1	a true blue delight 
435 | 0	despite auteuil is performance it is a rather listless amble down the middle of the road where the thematic ironies are too obvious and the sexual politics too smug 
436 | 1	well acted well directed and for all its moodiness not too pretentious 
437 | 0	adrift bentley and hudson stare and sniffle respectively as ledger attempts in vain to prove that movie star intensity can overcome bad hair design 
438 | 0	it is so downbeat and nearly humorless that it becomes a chore to sit through despite some first rate performances by its lead 
439 | 0	you leave feeling like you ve endured a long workout without your pulse ever racing 
440 | 1	a poignant artfully crafted meditation on mortality 
441 | 0	there is a scientific law to be discerned here that producers would be well to heed mediocre movies start to drag as soon as the action speeds up when the explosions start they fall to pieces 
442 | 1	a dreadful day in irish history is given passionate if somewhat flawed treatment 
443 | 1	the pleasure of read my lips is like seeing a series of perfect black pearls clicking together to form a string 
444 | 1	overcomes its visual hideousness with a sharp script and strong performances 
445 | 0	just too silly and sophomoric to ensnare its target audience 
446 | 1	if it is not entirely memorable the movie is certainly easy to watch 
447 | 1	if you can push on through the slow spots you ll be rewarded with some fine acting 
448 | 0	it is too bad that this likable movie is nt more accomplished 
449 | 1	tadpole may be one of the most appealing movies ever made about an otherwise appalling and downright creepy subject a teenage boy in love with his stepmother 
450 | 0	it just goes to show an intelligent person is nt necessarily an admirable storyteller 
451 | 1	if ayurveda can help us return to a sane regimen of eating sleeping and stress reducing contemplation it is clearly a good thing 
452 | 1	the story is a rather simplistic one grief drives her love drives him and a second chance to find love in the most unlikely place it struck a chord in me 
453 | 0	plays like a glossy melodrama that occasionally verges on camp 
454 | 0	as aimless as an old pickup skidding completely out of control on a long patch of black ice the movie makes two hours feel like four 
455 | 1	a searing epic treatment of a nationwide blight that seems to be horrifyingly ever on the rise 
456 | 0	what soured me on the santa clause was that santa bumps up against st century reality so hard it is icky 
457 | 1	as quiet patient and tenacious as mr lopez himself who approaches his difficult endless work with remarkable serenity and discipline 
458 | 0	shallow noisy and pretentious 
459 | 1	a light yet engrossing piece 
460 | 0	my only wish is that celebi could take me back to a time before i saw this movie and i could just skip it 
461 | 0	it is one pussy ass world when even killer thrillers revolve around group therapy sessions 
462 | 1	infidelity drama is nicely shot well edited and features a standout performance by diane lane 
463 | 0	rarely has sex on screen been so aggressively anti erotic 
464 | 1	there is enough originality in life to distance it from the pack of paint by number romantic comedies that so often end up on cinema screens 
465 | 0	a movie that quite simply should nt have been made 
466 | 1	further proof that the epicenter of cool beautiful thought provoking foreign cinema is smack dab in the middle of dubya is axis of evil 
467 | 1	writer director david jacobson and his star jeremy renner have made a remarkable film that explores the monster is psychology not in order to excuse him but rather to demonstrate that his pathology evolved from human impulses that grew hideously twisted 
468 | 1	will amuse and provoke adventurous adults in specialty venues 
469 | 0	just a kiss is a just a waste 
470 | 1	a muddle splashed with bloody beauty as vivid as any scorsese has ever given us 
471 | 0	a zippy minutes of mediocre special effects hoary dialogue fluxing accents and worst of all silly looking morlocks 
472 | 1	as a girl meets girl romantic comedy kissing jessica steinis quirky charming and often hilarious 
473 | 1	the overall fabric is hypnotic and mr mattei fosters moments of spontaneous intimacy 
474 | 0	men in black ii achieves ultimate insignificance it is the sci fi comedy spectacle as whiffle ball epic 
475 | 0	at best cletis tout might inspire a trip to the video store in search of a better movie experience 
476 | 0	nothing but an episode of smackdown 
477 | 1	the stunt work is top notch the dialogue and drama often food spittingly funny 
478 | 1	family fare 
479 | 1	using a stock plot about a boy injects just enough freshness into the proceedings to provide an enjoyable minutes in a movie theater 
480 | 0	in other words it is badder than bad 
481 | 0	the movie is almost completely lacking in suspense surprise and consistent emotional conviction 
482 | 1	another love story in is remarkable procession of sweeping pictures that have reinvigorated the romance genre 
483 | 0	there is only one way to kill michael myers for good stop buying tickets to these movies 
484 | 1	washington overcomes the script is flaws and envelops the audience in his character is anguish anger and frustration 
485 | 0	so we got ten little indians meets friday the th by way of clean and sober filmed on the set of carpenter is the thing and loaded with actors you re most likely to find on the next inevitable incarnation of the love boat 
486 | 0	confirms the nagging suspicion that ethan hawke would be even worse behind the camera than he is in front of it 
487 | 0	one of the more glaring signs of this movie is servitude to its superstar is the way it skirts around any scenes that might have required genuine acting from ms spears 
488 | 0	for all its shoot outs fistfights and car chases this movie is a phlegmatic bore so tedious it makes the silly spy vs spy film the sum of all fears starring ben affleck seem downright hitchcockian 
489 | 0	the only fun part of the movie is playing the obvious game 
490 | 0	plays like the old disease of the week small screen melodramas 
491 | 0	the cumulative effect of the movie is repulsive and depressing 
492 | 1	while we no longer possess the lack of attention span that we did at seventeen we had no trouble sitting for blade ii 
493 | 1	a surprisingly sweet and gentle comedy 
494 | 1	an elegant film with often surprising twists and an intermingling of naivet and sophistication 
495 | 0	for each chuckle there are at least complete misses many coming from the amazingly lifelike tara reid whose acting skills are comparable to a cardboard cutout 
496 | 1	polished well structured film 
497 | 1	a movie that will surely be profane politically charged music to the ears of cho is fans 
498 | 1	most consumers of lo mein and general tso is chicken barely give a thought to the folks who prepare and deliver it so hopefully this film will attach a human face to all those little steaming cartons 
499 | 0	movies like high crimes flog the dead horse of surprise as if it were an obligation 
500 | 0	a timid soggy near miss 
501 | 


--------------------------------------------------------------------------------
/eda_figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/eda_figure.png


--------------------------------------------------------------------------------
/experiments/__pycache__/a_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/a_config.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-37.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/b_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/b_config.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/c_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/c_config.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/config.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/e_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/e_config.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/methods.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/methods.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-37.pyc


--------------------------------------------------------------------------------
/experiments/__pycache__/nlp_aug.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/nlp_aug.cpython-36.pyc


--------------------------------------------------------------------------------
/experiments/a_1_data_process.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from a_config import *
 3 | 
 4 | if __name__ == "__main__":
 5 | 
 6 | 	#for each method
 7 | 	for a_method in a_methods:
 8 | 
 9 | 		#for each data size
10 | 		for size_folder in size_folders:
11 | 
12 | 			n_aug_list = n_aug_list_dict[size_folder]
13 | 			dataset_folders = [size_folder + '/' + s for s in datasets]
14 | 
15 | 			#for each dataset
16 | 			for i, dataset_folder in enumerate(dataset_folders):
17 | 
18 | 				train_orig = dataset_folder + '/train_orig.txt'
19 | 				n_aug = n_aug_list[i]
20 | 
21 | 				#for each alpha value
22 | 				for alpha in alphas:
23 | 
24 | 					output_file = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt'
25 | 
26 | 					#generate the augmented data
27 | 					if a_method == 'sr':
28 | 						gen_sr_aug(train_orig, output_file, alpha, n_aug)
29 | 					if a_method == 'ri':
30 | 						gen_ri_aug(train_orig, output_file, alpha, n_aug)
31 | 					if a_method == 'rd':
32 | 						gen_rd_aug(train_orig, output_file, alpha, n_aug)
33 | 					if a_method == 'rs':
34 | 						gen_rs_aug(train_orig, output_file, alpha, n_aug)
35 | 
36 | 				#generate the vocab dictionary
37 | 				word2vec_pickle = dataset_folder + '/word2vec.p'
38 | 				gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
39 | 			
40 | 


--------------------------------------------------------------------------------
/experiments/a_2_train_eval.py:
--------------------------------------------------------------------------------
 1 | from a_config import * 
 2 | from methods import *
 3 | from numpy.random import seed
 4 | seed(5)
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_cnn(train_file, test_file, num_classes, percent_dataset):
11 | 
12 | 	#initialize model
13 | 	model = build_cnn(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y, test_x, test_y, model = None, None, None, None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | ###############################
49 | ############ main #############
50 | ###############################
51 | 
52 | if __name__ == "__main__":
53 | 
54 | 	#for each method
55 | 	for a_method in a_methods:
56 | 
57 | 		writer = open('outputs_f1/' + a_method + '_' + get_now_str() + '.txt', 'w')
58 | 
59 | 		#for each size dataset
60 | 		for size_folder in size_folders:
61 | 
62 | 			writer.write(size_folder + '\n')
63 | 
64 | 			#get all six datasets
65 | 			dataset_folders = [size_folder + '/' + s for s in datasets]
66 | 
67 | 			#for storing the performances
68 | 			performances = {alpha:[] for alpha in alphas}
69 | 
70 | 			#for each dataset
71 | 			for i in range(len(dataset_folders)):
72 | 
73 | 				#initialize all the variables
74 | 				dataset_folder = dataset_folders[i]
75 | 				dataset = datasets[i]
76 | 				num_classes = num_classes_list[i]
77 | 				input_size = input_size_list[i]
78 | 				word2vec_pickle = dataset_folder + '/word2vec.p'
79 | 				word2vec = load_pickle(word2vec_pickle)
80 | 
81 | 				#test each alpha value
82 | 				for alpha in alphas:
83 | 
84 | 					train_path = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt'
85 | 					test_path = 'size_data_f1/test/' + dataset + '/test.txt'
86 | 					acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1)
87 | 					performances[alpha].append(acc)
88 | 
89 | 			writer.write(str(performances) + '\n')
90 | 			for alpha in performances:
91 | 				line = str(alpha) + ' : ' + str(sum(performances[alpha])/len(performances[alpha]))
92 | 				writer.write(line + '\n')
93 | 				print(line)
94 | 			print(performances)
95 | 
96 | 		writer.close()
97 | 


--------------------------------------------------------------------------------
/experiments/a_config.py:
--------------------------------------------------------------------------------
 1 | #user inputs
 2 | 
 3 | #size folders
 4 | sizes = ['1_tiny', '2_small', '3_standard', '4_full']
 5 | size_folders = ['size_data_f1/' + size for size in sizes]
 6 | 
 7 | #augmentation methods
 8 | a_methods = ['sr', 'ri', 'rd', 'rs']
 9 | 
10 | #dataset folder
11 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
12 | 
13 | #number of output classes
14 | num_classes_list = [2, 2, 2, 6, 2]
15 | 
16 | #number of augmentations
17 | n_aug_list_dict = {'size_data_f1/1_tiny': [16, 16, 16, 16, 16], 
18 | 					'size_data_f1/2_small': [16, 16, 16, 16, 16],
19 | 					'size_data_f1/3_standard': [8, 8, 8, 8, 4],
20 | 					'size_data_f1/4_full': [8, 8, 8, 8, 4]}
21 | 
22 | #alpha values we care about
23 | alphas = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
24 | 
25 | #number of words for input
26 | input_size_list = [50, 50, 40, 25, 25] 
27 | 
28 | #word2vec dictionary
29 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
30 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
31 | 


--------------------------------------------------------------------------------
/experiments/b_1_data_process.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from b_config import *
 3 | 
 4 | if __name__ == "__main__":
 5 | 
 6 | 	#generate the augmented data sets
 7 | 	for dataset_folder in dataset_folders:
 8 | 
 9 | 		#pre-existing file locations
10 | 		train_orig = dataset_folder + '/train_orig.txt'
11 | 
12 | 		#file to be created
13 | 		train_aug_st = dataset_folder + '/train_aug_st.txt'
14 | 
15 | 		#standard augmentation
16 | 		gen_standard_aug(train_orig, train_aug_st)
17 | 
18 | 		#generate the vocab dictionary
19 | 		word2vec_pickle = dataset_folder + '/word2vec.p' # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
20 | 		gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
21 | 


--------------------------------------------------------------------------------
/experiments/b_2_train_eval.py:
--------------------------------------------------------------------------------
 1 | from b_config import * 
 2 | from methods import *
 3 | from numpy.random import seed
 4 | seed(0)
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_model(train_file, test_file, num_classes, percent_dataset):
11 | 
12 | 	#initialize model
13 | 	model = build_model(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y = None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | if __name__ == "__main__":
49 | 
50 | 	#get the accuracy at each increment
51 | 	orig_accs = {dataset:{} for dataset in datasets}
52 | 	aug_accs = {dataset:{} for dataset in datasets}
53 | 
54 | 	writer = open('outputs_f2/' + get_now_str() + '.csv', 'w')
55 | 
56 | 	#for each dataset
57 | 	for i, dataset_folder in enumerate(dataset_folders):
58 | 
59 | 		dataset = datasets[i]
60 | 		num_classes = num_classes_list[i]
61 | 		input_size = input_size_list[i]
62 | 		train_orig = dataset_folder + '/train_orig.txt'
63 | 		train_aug_st = dataset_folder + '/train_aug_st.txt'
64 | 		test_path = dataset_folder + '/test.txt'
65 | 		word2vec_pickle = dataset_folder + '/word2vec.p'
66 | 		word2vec = load_pickle(word2vec_pickle)
67 | 
68 | 		for increment in increments:
69 | 			
70 | 			#calculate augmented accuracy
71 | 			aug_acc = run_model(train_aug_st, test_path, num_classes, increment)
72 | 			aug_accs[dataset][increment] = aug_acc
73 | 
74 | 			#calculate original accuracy
75 | 			orig_acc = run_model(train_orig, test_path, num_classes, increment)
76 | 			orig_accs[dataset][increment] = orig_acc
77 | 
78 | 			print(dataset, increment, orig_acc, aug_acc)
79 | 			writer.write(dataset + ',' + str(increment) + ',' + str(orig_acc) + ',' + str(aug_acc) + '\n')
80 | 
81 | 			gc.collect()
82 | 
83 | 	print(orig_accs, aug_accs)
84 | 


--------------------------------------------------------------------------------
/experiments/b_config.py:
--------------------------------------------------------------------------------
 1 | #user inputs
 2 | 
 3 | #dataset folder
 4 | datasets = ['pc']#['cr', 'sst2', 'subj', 'trec', 'pc']
 5 | dataset_folders = ['increment_datasets_f2/' + dataset for dataset in datasets] 
 6 | 
 7 | #number of output classes
 8 | num_classes_list = [2]#[2, 2, 2, 6, 2]
 9 | 
10 | #dataset increments
11 | increments = [0.7, 0.8, 0.9, 1]#[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
12 | 
13 | #number of words for input
14 | input_size_list = [25]#[50, 50, 40, 25, 25]
15 | 
16 | #word2vec dictionary
17 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
18 | word2vec_len = 300


--------------------------------------------------------------------------------
/experiments/c_1_data_process.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from c_config import *
 3 | 
 4 | if __name__ == "__main__":
 5 | 
 6 | 	#generate the augmented data sets
 7 | 
 8 | 	for size_folder in size_folders:
 9 | 
10 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
11 | 
12 | 		#for each dataset
13 | 		for dataset_folder in dataset_folders:
14 | 			train_orig = dataset_folder + '/train_orig.txt'
15 | 
16 | 			#for each n_aug value
17 | 			for num_aug in num_aug_list:
18 | 
19 | 				output_file = dataset_folder + '/train_' + str(num_aug) + '.txt'
20 | 
21 | 				#generate the augmented data
22 | 				if num_aug > 4 and '4_full/pc' in train_orig:
23 | 					gen_standard_aug(train_orig, output_file, num_aug=4)
24 | 				else:
25 | 					gen_standard_aug(train_orig, output_file, num_aug=num_aug)
26 | 
27 | 			#generate the vocab dictionary
28 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
29 | 			gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
30 | 			
31 | 


--------------------------------------------------------------------------------
/experiments/c_2_train_eval.py:
--------------------------------------------------------------------------------
  1 | from c_config import * 
  2 | from methods import *
  3 | from numpy.random import seed
  4 | seed(5)
  5 | 
  6 | ###############################
  7 | #### run model and get acc ####
  8 | ###############################
  9 | 
 10 | def run_cnn(train_file, test_file, num_classes, percent_dataset):
 11 | 
 12 | 	#initialize model
 13 | 	model = build_cnn(input_size, word2vec_len, num_classes)
 14 | 
 15 | 	#load data
 16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
 17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
 18 | 
 19 | 	#implement early stopping
 20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
 21 | 
 22 | 	#train model
 23 | 	model.fit(	train_x, 
 24 | 				train_y, 
 25 | 				epochs=100000, 
 26 | 				callbacks=callbacks,
 27 | 				validation_split=0.1, 
 28 | 				batch_size=1024, 
 29 | 				shuffle=True, 
 30 | 				verbose=0)
 31 | 	#model.save('checkpoints/lol')
 32 | 	#model = load_model('checkpoints/lol')
 33 | 
 34 | 	#evaluate model
 35 | 	y_pred = model.predict(test_x)
 36 | 	test_y_cat = one_hot_to_categorical(test_y)
 37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
 38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
 39 | 
 40 | 	#clean memory???
 41 | 	train_x, train_y = None, None
 42 | 	gc.collect()
 43 | 
 44 | 	#return the accuracy
 45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
 46 | 	return acc
 47 | 
 48 | ###############################
 49 | ############ main #############
 50 | ###############################
 51 | 
 52 | if __name__ == "__main__":
 53 | 
 54 | 	for see in range(5):
 55 | 
 56 | 		seed(see)
 57 | 		print('seed:', see)
 58 | 
 59 | 		writer = open('outputs_f3/' + get_now_str() + '.txt', 'w')
 60 | 
 61 | 		#for each size dataset
 62 | 		for size_folder in size_folders:
 63 | 
 64 | 			writer.write(size_folder + '\n')
 65 | 
 66 | 			#get all six datasets
 67 | 			dataset_folders = [size_folder + '/' + s for s in datasets]
 68 | 
 69 | 			#for storing the performances
 70 | 			performances = {num_aug:[] for num_aug in num_aug_list}
 71 | 
 72 | 			#for each dataset
 73 | 			for i in range(len(dataset_folders)):
 74 | 
 75 | 				#initialize all the variables
 76 | 				dataset_folder = dataset_folders[i]
 77 | 				dataset = datasets[i]
 78 | 				num_classes = num_classes_list[i]
 79 | 				input_size = input_size_list[i]
 80 | 				word2vec_pickle = dataset_folder + '/word2vec.p'
 81 | 				word2vec = load_pickle(word2vec_pickle)
 82 | 
 83 | 				#test each num_aug value
 84 | 				for num_aug in num_aug_list:
 85 | 
 86 | 					train_path = dataset_folder + '/train_' + str(num_aug) + '.txt'
 87 | 					test_path = 'size_data_f3/test/' + dataset + '/test.txt'
 88 | 					acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1)
 89 | 					performances[num_aug].append(acc)
 90 | 					writer.write(train_path + ',' + str(acc))
 91 | 
 92 | 			writer.write(str(performances) + '\n')
 93 | 			print()
 94 | 			for num_aug in performances:
 95 | 				line = str(num_aug) + ' : ' + str(sum(performances[num_aug])/len(performances[num_aug]))
 96 | 				writer.write(line + '\n')
 97 | 				print(line)
 98 | 			print(performances)
 99 | 
100 | 		writer.close()
101 | 


--------------------------------------------------------------------------------
/experiments/c_config.py:
--------------------------------------------------------------------------------
 1 | #user inputs
 2 | 
 3 | #size folders
 4 | sizes = ['3_standard']#, '4_full']#['1_tiny', '2_small', '3_standard', '4_full']
 5 | size_folders = ['size_data_f3/' + size for size in sizes]
 6 | 
 7 | #dataset folder
 8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
 9 | 
10 | #number of output classes
11 | num_classes_list = [2, 2, 2, 6, 2]
12 | 
13 | #alpha values we care about
14 | num_aug_list = [0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32]
15 | 
16 | #number of words for input
17 | input_size_list = [50, 50, 50, 25, 25] 
18 | 
19 | #word2vec dictionary
20 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
21 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
22 | 


--------------------------------------------------------------------------------
/experiments/d_0_preprocess.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | 
 3 | def generate_short(input_file, output_file, alpha):
 4 | 	lines = open(input_file, 'r').readlines()
 5 | 	increment = int(len(lines)/alpha)
 6 | 	lines = lines[::increment]
 7 | 	writer = open(output_file, 'w')
 8 | 	for line in lines:
 9 | 		writer.write(line)
10 | 
11 | if __name__ == "__main__":
12 | 
13 | 	#global params
14 | 	huge_word2vec = 'word2vec/glove.840B.300d.txt'
15 | 	datasets = ['pc']#, 'trec']
16 | 
17 | 	for dataset in datasets:
18 | 
19 | 		dataset_folder = 'special_f4/' + dataset
20 | 		test_short = 'special_f4/' + dataset + '/test_short.txt'
21 | 		test_aug_short = dataset_folder + '/test_short_aug.txt'
22 | 		word2vec_pickle = dataset_folder + '/word2vec.p' 
23 | 
24 | 		#augment the data
25 | 		gen_tsne_aug(test_short, test_aug_short)
26 | 
27 | 		#generate the vocab dictionaries
28 | 		gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
29 | 
30 | 
31 | 
32 | 
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/experiments/d_1_train_models.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from numpy.random import seed
 3 | seed(0)
 4 | 
 5 | ###############################
 6 | #### run model and get acc ####
 7 | ###############################
 8 | 
 9 | def run_model(train_file, test_file, num_classes, model_output_path):
10 | 
11 | 	#initialize model
12 | 	model = build_model(input_size, word2vec_len, num_classes)
13 | 
14 | 	#load data
15 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, 1)
16 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
17 | 
18 | 	#implement early stopping
19 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
20 | 
21 | 	#train model
22 | 	model.fit(	train_x, 
23 | 				train_y, 
24 | 				epochs=100000, 
25 | 				callbacks=callbacks,
26 | 				validation_split=0.1, 
27 | 				batch_size=1024, 
28 | 				shuffle=True, 
29 | 				verbose=0)
30 | 
31 | 	#save the model
32 | 	model.save(model_output_path)
33 | 	#model = load_model('checkpoints/lol')
34 | 
35 | 	#evaluate model
36 | 	y_pred = model.predict(test_x)
37 | 	test_y_cat = one_hot_to_categorical(test_y)
38 | 	y_pred_cat = one_hot_to_categorical(y_pred)
39 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
40 | 
41 | 	#clean memory???
42 | 	train_x, train_y = None, None
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | if __name__ == "__main__":
49 | 
50 | 	#parameters
51 | 	dataset_folders = ['increment_datasets_f2/trec', 'increment_datasets_f2/pc']
52 | 	output_paths = ['outputs_f4/trec_aug.h5', 'outputs_f4/pc_aug.h5']
53 | 	num_classes_list = [6, 2]
54 | 	input_size_list = [25, 25]
55 | 
56 | 	#word2vec dictionary
57 | 	word2vec_len = 300
58 | 
59 | 	for i, dataset_folder in enumerate(dataset_folders):
60 | 
61 | 		num_classes = num_classes_list[i]
62 | 		input_size = input_size_list[i]
63 | 		output_path = output_paths[i]
64 | 		train_orig = dataset_folder + '/train_aug_st.txt'
65 | 		test_path = dataset_folder + '/test.txt'
66 | 		word2vec_pickle = dataset_folder + '/word2vec.p'
67 | 		word2vec = load_pickle(word2vec_pickle)
68 | 
69 | 		#train model and save
70 | 		acc = run_model(train_orig, test_path, num_classes, output_path)
71 | 		print(dataset_folder, acc)


--------------------------------------------------------------------------------
/experiments/d_2_tsne.py:
--------------------------------------------------------------------------------
  1 | from methods import *
  2 | from numpy.random import seed
  3 | from keras import backend as K
  4 | from sklearn.manifold import TSNE
  5 | import matplotlib.pyplot as plt
  6 | seed(0)
  7 | 
  8 | ################################
  9 | #### get dense layer output ####
 10 | ################################
 11 | 
 12 | #getting the x and y inputs in numpy array form from the text file
 13 | def train_x(train_txt, word2vec_len, input_size, word2vec):
 14 | 
 15 | 	#read in lines
 16 | 	train_lines = open(train_txt, 'r').readlines()
 17 | 	num_lines = len(train_lines)
 18 | 
 19 | 	x_matrix = np.zeros((num_lines, input_size, word2vec_len))
 20 | 
 21 | 	#insert values
 22 | 	for i, line in enumerate(train_lines):
 23 | 
 24 | 		parts = line[:-1].split('\t')
 25 | 		label = int(parts[0])
 26 | 		sentence = parts[1]	
 27 | 
 28 | 		#insert x
 29 | 		words = sentence.split(' ')
 30 | 		words = words[:x_matrix.shape[1]] #cut off if too long
 31 | 		for j, word in enumerate(words):
 32 | 			if word in word2vec:
 33 | 				x_matrix[i, j, :] = word2vec[word]
 34 | 
 35 | 	return x_matrix
 36 | 
 37 | def get_dense_output(model_checkpoint, file, num_classes):
 38 | 
 39 | 	x = train_x(file, word2vec_len, input_size, word2vec)
 40 | 
 41 | 	model = load_model(model_checkpoint)
 42 | 
 43 | 	get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[4].output])
 44 | 	layer_output = get_3rd_layer_output([x])[0]
 45 | 
 46 | 	return layer_output
 47 | 
 48 | def get_tsne_labels(file):
 49 | 	labels = []
 50 | 	alphas = []
 51 | 	lines = open(file, 'r').readlines()
 52 | 	for i, line in enumerate(lines):
 53 | 		parts = line[:-1].split('\t')
 54 | 		_class = int(parts[0])
 55 | 		alpha = i % 10
 56 | 		labels.append(_class)
 57 | 		alphas.append(alpha)
 58 | 	return labels, alphas
 59 | 
 60 | def get_plot_vectors(layer_output):
 61 | 
 62 | 	tsne = TSNE(n_components=2).fit_transform(layer_output)
 63 | 	return tsne
 64 | 
 65 | def plot_tsne(tsne, labels, output_path):
 66 | 
 67 | 	label_to_legend_label = {	'outputs_f4/pc_tsne.png':{	0:'Con (augmented)', 
 68 | 															100:'Con (original)', 
 69 | 															1: 'Pro (augmented)', 
 70 | 															101:'Pro (original)'},
 71 | 								'outputs_f4/trec_tsne.png':{0:'Description (augmented)',
 72 | 															100:'Description (original)',
 73 | 															1:'Entity (augmented)',
 74 | 															101:'Entity (original)',
 75 | 															2:'Abbreviation (augmented)',
 76 | 															102:'Abbreviation (original)',
 77 | 															3:'Human (augmented)',
 78 | 															103:'Human (original)',
 79 | 															4:'Location (augmented)',
 80 | 															104:'Location (original)',
 81 | 															5:'Number (augmented)',
 82 | 															105:'Number (original)'}}
 83 | 
 84 | 	plot_to_legend_size = {'outputs_f4/pc_tsne.png':11, 'outputs_f4/trec_tsne.png':6}
 85 | 
 86 | 	labels = labels.tolist()
 87 | 	big_groups = [label for label in labels if label < 100]
 88 | 	big_groups = list(sorted(set(big_groups)))
 89 | 
 90 | 	colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', '#ff1493', '#FF4500']
 91 | 	fig, ax = plt.subplots()
 92 | 
 93 | 	for big_group in big_groups:
 94 | 
 95 | 		for group in [big_group, big_group+100]:
 96 | 
 97 | 			x, y = [], []
 98 | 
 99 | 			for j, label in enumerate(labels):
100 | 				if label == group:
101 | 					x.append(tsne[j][0])
102 | 					y.append(tsne[j][1])
103 | 
104 | 			#params
105 | 			color = colors[int(group % 100)]
106 | 			marker = 'x' if group < 100 else 'o'
107 | 			size = 1 if group < 100 else 27
108 | 			legend_label = label_to_legend_label[output_path][group]
109 | 
110 | 			ax.scatter(x, y, color=color, marker=marker, s=size, label=legend_label)
111 | 			plt.axis('off')
112 | 
113 | 	legend_size = plot_to_legend_size[output_path]
114 | 	plt.legend(prop={'size': legend_size})
115 | 	plt.savefig(output_path, dpi=1000)
116 | 	plt.clf()	
117 | 
118 | if __name__ == "__main__":
119 | 
120 | 	#global variables
121 | 	word2vec_len = 300
122 | 	input_size = 25
123 | 
124 | 	datasets = ['pc'] #['pc', 'trec']
125 | 	num_classes_list =[2] #[2, 6]
126 | 
127 | 	for i, dataset in enumerate(datasets):
128 | 
129 | 		#load parameters
130 | 		model_checkpoint = 'outputs_f4/' + dataset + '.h5'
131 | 		file = 'special_f4/' + dataset + '/test_short_aug.txt'
132 | 		num_classes = num_classes_list[i]
133 | 		word2vec_pickle = 'special_f4/' + dataset + '/word2vec.p'
134 | 		word2vec = load_pickle(word2vec_pickle)
135 | 
136 | 		#do tsne
137 | 		layer_output = get_dense_output(model_checkpoint, file, num_classes)
138 | 		print(layer_output.shape)
139 | 		t = get_plot_vectors(layer_output)
140 | 
141 | 		labels, alphas = get_tsne_labels(file)
142 | 
143 | 		print(labels, alphas)
144 | 
145 | 		writer = open("outputs_f4/new_tsne.txt", 'w')
146 | 
147 | 		label_to_mark = {0:'x', 1:'o'}
148 | 
149 | 		for i, label in enumerate(labels):
150 | 			alpha = alphas[i]
151 | 			line = str(t[i, 0]) + ' ' + str(t[i, 1]) + ' ' + str(label_to_mark[label]) + ' ' + str(alpha/10)
152 | 			writer.write(line + '\n')
153 | 
154 | 
155 | 


--------------------------------------------------------------------------------
/experiments/d_neg_1_balance_trec.py:
--------------------------------------------------------------------------------
 1 | lines = open('special_f4/trec/test_orig.txt', 'r').readlines()
 2 | 
 3 | label_to_lines = {x:[] for x in range(0, 6)}
 4 | 
 5 | for line in lines:
 6 | 	label = int(line[0])
 7 | 	label_to_lines[label].append(line)
 8 | 
 9 | for label in label_to_lines:
10 | 	print(label, len(label_to_lines[label]))


--------------------------------------------------------------------------------
/experiments/e_1_data_process.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from e_config import *
 3 | 
 4 | if __name__ == "__main__":
 5 | 
 6 | 	for size_folder in size_folders:
 7 | 
 8 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
 9 | 		n_aug_list = n_aug_list_dict[size_folder]
10 | 
11 | 		#for each dataset
12 | 		for i, dataset_folder in enumerate(dataset_folders):
13 | 
14 | 			n_aug = n_aug_list[i]
15 | 
16 | 			#pre-existing file locations
17 | 			train_orig = dataset_folder + '/train_orig.txt'
18 | 
19 | 			#file to be created
20 | 			train_aug_st = dataset_folder + '/train_aug_st.txt'
21 | 
22 | 			#standard augmentation
23 | 			gen_standard_aug(train_orig, train_aug_st, n_aug)
24 | 
25 | 			#generate the vocab dictionary
26 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
27 | 			gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
28 | 		
29 | 


--------------------------------------------------------------------------------
/experiments/e_2_cnn_aug.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from numpy.random import seed
 3 | seed(0)
 4 | from e_config import *
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_cnn(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 | 
12 | 	#initialize model
13 | 	model = build_cnn(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y, model = None, None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 | 
52 | def compute_baselines(writer):
53 | 
54 | 	#baseline computation
55 | 	for size_folder in size_folders:
56 | 
57 | 		#get all six datasets
58 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
59 | 		performances = []
60 | 		
61 | 		#for each dataset
62 | 		for i in range(len(dataset_folders)):
63 | 
64 | 			#initialize all the variables
65 | 			dataset_folder = dataset_folders[i]
66 | 			dataset = datasets[i]
67 | 			num_classes = num_classes_list[i]
68 | 			input_size = input_size_list[i]
69 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
70 | 			word2vec = load_pickle(word2vec_pickle)
71 | 
72 | 			train_path = dataset_folder + '/train_aug_st.txt'
73 | 			test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | 			acc = run_cnn(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | 			performances.append(str(acc))
76 | 
77 | 		line = ','.join(performances)
78 | 		print(line)
79 | 		writer.write(line+'\n')
80 | 
81 | ###############################
82 | ############ main #############
83 | ###############################
84 | 
85 | if __name__ == "__main__":
86 | 
87 | 	writer = open('baseline_cnn/' + get_now_str() + '.csv', 'w')
88 | 
89 | 	for i in range(0, 10):
90 | 
91 | 		seed(i)
92 | 		print(i)
93 | 		compute_baselines(writer)
94 | 


--------------------------------------------------------------------------------
/experiments/e_2_cnn_baselines.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from numpy.random import seed
 3 | seed(0)
 4 | from e_config import *
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 | 
12 | 	#initialize model
13 | 	model = build_model(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y = None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 | 
52 | def compute_baselines(writer):
53 | 
54 | 	#baseline computation
55 | 	for size_folder in size_folders:
56 | 
57 | 		#get all six datasets
58 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
59 | 		performances = []
60 | 		
61 | 		#for each dataset
62 | 		for i in range(len(dataset_folders)):
63 | 
64 | 			#initialize all the variables
65 | 			dataset_folder = dataset_folders[i]
66 | 			dataset = datasets[i]
67 | 			num_classes = num_classes_list[i]
68 | 			input_size = input_size_list[i]
69 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
70 | 			word2vec = load_pickle(word2vec_pickle)
71 | 
72 | 			train_path = dataset_folder + '/train_orig.txt'
73 | 			test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | 			acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | 			performances.append(str(acc))
76 | 
77 | 		line = ','.join(performances)
78 | 		print(line)
79 | 		writer.write(line+'\n')
80 | 
81 | ###############################
82 | ############ main #############
83 | ###############################
84 | 
85 | if __name__ == "__main__":
86 | 
87 | 	writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 | 
89 | 	for i in range(10, 24):
90 | 
91 | 		seed(i)
92 | 		print(i)
93 | 		compute_baselines(writer)
94 | 


--------------------------------------------------------------------------------
/experiments/e_2_rnn_aug.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from numpy.random import seed
 3 | seed(0)
 4 | from e_config import *
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 | 
12 | 	#initialize model
13 | 	model = build_model(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y, model = None, None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 | 
52 | def compute_baselines(writer):
53 | 
54 | 	#baseline computation
55 | 	for size_folder in size_folders:
56 | 
57 | 		#get all six datasets
58 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
59 | 		performances = []
60 | 		
61 | 		#for each dataset
62 | 		for i in range(len(dataset_folders)):
63 | 
64 | 			#initialize all the variables
65 | 			dataset_folder = dataset_folders[i]
66 | 			dataset = datasets[i]
67 | 			num_classes = num_classes_list[i]
68 | 			input_size = input_size_list[i]
69 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
70 | 			word2vec = load_pickle(word2vec_pickle)
71 | 
72 | 			train_path = dataset_folder + '/train_aug_st.txt'
73 | 			test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | 			acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | 			performances.append(str(acc))
76 | 
77 | 		line = ','.join(performances)
78 | 		print(line)
79 | 		writer.write(line+'\n')
80 | 
81 | ###############################
82 | ############ main #############
83 | ###############################
84 | 
85 | if __name__ == "__main__":
86 | 
87 | 	writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 | 
89 | 	for i in range(0, 10):
90 | 
91 | 		seed(i)
92 | 		print(i)
93 | 		compute_baselines(writer)
94 | 


--------------------------------------------------------------------------------
/experiments/e_2_rnn_baselines.py:
--------------------------------------------------------------------------------
 1 | from methods import *
 2 | from numpy.random import seed
 3 | seed(0)
 4 | from e_config import *
 5 | 
 6 | ###############################
 7 | #### run model and get acc ####
 8 | ###############################
 9 | 
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 | 
12 | 	#initialize model
13 | 	model = build_model(input_size, word2vec_len, num_classes)
14 | 
15 | 	#load data
16 | 	train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | 	test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 | 
19 | 	#implement early stopping
20 | 	callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 | 
22 | 	#train model
23 | 	model.fit(	train_x, 
24 | 				train_y, 
25 | 				epochs=100000, 
26 | 				callbacks=callbacks,
27 | 				validation_split=0.1, 
28 | 				batch_size=1024, 
29 | 				shuffle=True, 
30 | 				verbose=0)
31 | 	#model.save('checkpoints/lol')
32 | 	#model = load_model('checkpoints/lol')
33 | 
34 | 	#evaluate model
35 | 	y_pred = model.predict(test_x)
36 | 	test_y_cat = one_hot_to_categorical(test_y)
37 | 	y_pred_cat = one_hot_to_categorical(y_pred)
38 | 	acc = accuracy_score(test_y_cat, y_pred_cat)
39 | 
40 | 	#clean memory???
41 | 	train_x, train_y = None, None
42 | 	gc.collect()
43 | 
44 | 	#return the accuracy
45 | 	#print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | 	return acc
47 | 
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 | 
52 | def compute_baselines(writer):
53 | 
54 | 	#baseline computation
55 | 	for size_folder in size_folders:
56 | 
57 | 		#get all six datasets
58 | 		dataset_folders = [size_folder + '/' + s for s in datasets]
59 | 		performances = []
60 | 		
61 | 		#for each dataset
62 | 		for i in range(len(dataset_folders)):
63 | 
64 | 			#initialize all the variables
65 | 			dataset_folder = dataset_folders[i]
66 | 			dataset = datasets[i]
67 | 			num_classes = num_classes_list[i]
68 | 			input_size = input_size_list[i]
69 | 			word2vec_pickle = dataset_folder + '/word2vec.p'
70 | 			word2vec = load_pickle(word2vec_pickle)
71 | 
72 | 			train_path = dataset_folder + '/train_orig.txt'
73 | 			test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | 			acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | 			performances.append(str(acc))
76 | 
77 | 		line = ','.join(performances)
78 | 		print(line)
79 | 		writer.write(line+'\n')
80 | 
81 | ###############################
82 | ############ main #############
83 | ###############################
84 | 
85 | if __name__ == "__main__":
86 | 
87 | 	writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 | 
89 | 	for i in range(10, 24):
90 | 
91 | 		seed(i)
92 | 		print(i)
93 | 		compute_baselines(writer)
94 | 


--------------------------------------------------------------------------------
/experiments/e_config.py:
--------------------------------------------------------------------------------
 1 | #user inputs
 2 | 
 3 | #load hyperparameters
 4 | sizes = ['4_full']#['1_tiny', '2_small', '3_standard', '4_full']
 5 | size_folders = ['size_data_t1/' + size for size in sizes]
 6 | 
 7 | #datasets
 8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
 9 | 
10 | #number of output classes
11 | num_classes_list = [2, 2, 2, 6, 2]
12 | 
13 | #number of augmentations per original sentence
14 | n_aug_list_dict = {'size_data_t1/1_tiny': [32, 32, 32, 32, 32], 
15 | 					'size_data_t1/2_small': [32, 32, 32, 32, 32],
16 | 					'size_data_t1/3_standard': [16, 16, 16, 16, 4],
17 | 					'size_data_t1/4_full': [16, 16, 16, 16, 4]}
18 | 
19 | #number of words for input
20 | input_size_list = [50, 50, 40, 25, 25]
21 | 
22 | #word2vec dictionary
23 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
24 | word2vec_len = 300


--------------------------------------------------------------------------------
/experiments/methods.py:
--------------------------------------------------------------------------------
  1 | from keras.layers.core import Dense, Activation, Dropout
  2 | from keras.layers.recurrent import LSTM
  3 | from keras.layers import Bidirectional
  4 | import keras.layers as layers
  5 | from keras.models import Sequential
  6 | from keras.models import load_model
  7 | from keras.callbacks import EarlyStopping
  8 | 
  9 | from sklearn.utils import shuffle
 10 | from sklearn.metrics import accuracy_score
 11 | 
 12 | import math
 13 | import time
 14 | import numpy as np
 15 | import random
 16 | from random import randint
 17 | random.seed(3)
 18 | import datetime, re, operator
 19 | from random import shuffle
 20 | from time import gmtime, strftime
 21 | import gc
 22 | 
 23 | import os
 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #get rid of warnings
 25 | from os import listdir
 26 | from os.path import isfile, join, isdir
 27 | import pickle
 28 | 
 29 | #import data augmentation methods
 30 | from nlp_aug import *
 31 | 
 32 | ###################################################
 33 | ######### loading folders and txt files ###########
 34 | ###################################################
 35 | 
 36 | #loading a pickle file
 37 | def load_pickle(file):
 38 | 	return pickle.load(open(file, 'rb'))
 39 | 
 40 | #create an output folder if it does not already exist
 41 | def confirm_output_folder(output_folder):
 42 | 	if not os.path.exists(output_folder):
 43 | 	    os.makedirs(output_folder)
 44 | 
 45 | #get full image paths
 46 | def get_txt_paths(folder):
 47 |     txt_paths = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and '.txt' in f]
 48 |     if join(folder, '.DS_Store') in txt_paths:
 49 |         txt_paths.remove(join(folder, '.DS_Store'))
 50 |     txt_paths = sorted(txt_paths)
 51 |     return txt_paths
 52 | 
 53 | #get subfolders
 54 | def get_subfolder_paths(folder):
 55 |     subfolder_paths = [join(folder, f) for f in listdir(folder) if (isdir(join(folder, f)) and '.DS_Store' not in f)]
 56 |     if join(folder, '.DS_Store') in subfolder_paths:
 57 |         subfolder_paths.remove(join(folder, '.DS_Store'))
 58 |     subfolder_paths = sorted(subfolder_paths)
 59 |     return subfolder_paths
 60 | 
 61 | #get all image paths
 62 | def get_all_txt_paths(master_folder):
 63 | 
 64 |     all_paths = []
 65 |     subfolders = get_subfolder_paths(master_folder)
 66 |     if len(subfolders) > 1:
 67 |         for subfolder in subfolders:
 68 |             all_paths += get_txt_paths(subfolder)
 69 |     else:
 70 |         all_paths = get_txt_paths(master_folder)
 71 |     return all_paths
 72 | 
 73 | ###################################################
 74 | ################ data processing ##################
 75 | ###################################################
 76 | 
 77 | #get the pickle file for the word2vec so you don't have to load the entire huge file each time
 78 | def gen_vocab_dicts(folder, output_pickle_path, huge_word2vec):
 79 | 
 80 |     vocab = set()
 81 |     text_embeddings = open(huge_word2vec, 'r').readlines()
 82 |     word2vec = {}
 83 | 
 84 |     #get all the vocab
 85 |     all_txt_paths = get_all_txt_paths(folder)
 86 |     print(all_txt_paths)
 87 | 
 88 |     #loop through each text file
 89 |     for txt_path in all_txt_paths:
 90 | 
 91 |     	# get all the words
 92 |     	try:
 93 |     		all_lines = open(txt_path, "r").readlines()
 94 |     		for line in all_lines:
 95 |     			words = line[:-1].split(' ')
 96 |     			for word in words:
 97 |     			    vocab.add(word)
 98 |     	except:
 99 |     		print(txt_path, "has an error")
100 |     
101 |     print(len(vocab), "unique words found")
102 | 
103 |     # load the word embeddings, and only add the word to the dictionary if we need it
104 |     for line in text_embeddings:
105 |         items = line.split(' ')
106 |         word = items[0]
107 |         if word in vocab:
108 |             vec = items[1:]
109 |             word2vec[word] = np.asarray(vec, dtype = 'float32')
110 |     print(len(word2vec), "matches between unique words and word2vec dictionary")
111 |         
112 |     pickle.dump(word2vec, open(output_pickle_path, 'wb'))
113 |     print("dictionaries outputted to", output_pickle_path)
114 | 
115 | #getting the x and y inputs in numpy array form from the text file
116 | def get_x_y(train_txt, num_classes, word2vec_len, input_size, word2vec, percent_dataset):
117 | 
118 | 	#read in lines
119 | 	train_lines = open(train_txt, 'r').readlines()
120 | 	shuffle(train_lines)
121 | 	train_lines = train_lines[:int(percent_dataset*len(train_lines))]
122 | 	num_lines = len(train_lines)
123 | 
124 | 	#initialize x and y matrix
125 | 	x_matrix = None
126 | 	y_matrix = None
127 | 
128 | 	try:
129 | 		x_matrix = np.zeros((num_lines, input_size, word2vec_len))
130 | 	except:
131 | 		print("Error!", num_lines, input_size, word2vec_len)
132 | 	y_matrix = np.zeros((num_lines, num_classes))
133 | 
134 | 	#insert values
135 | 	for i, line in enumerate(train_lines):
136 | 
137 | 		parts = line[:-1].split('\t')
138 | 		label = int(parts[0])
139 | 		sentence = parts[1]	
140 | 
141 | 		#insert x
142 | 		words = sentence.split(' ')
143 | 		words = words[:x_matrix.shape[1]] #cut off if too long
144 | 		for j, word in enumerate(words):
145 | 			if word in word2vec:
146 | 				x_matrix[i, j, :] = word2vec[word]
147 | 
148 | 		#insert y
149 | 		y_matrix[i][label] = 1.0
150 | 
151 | 	return x_matrix, y_matrix
152 | 
153 | ###################################################
154 | ############### data augmentation #################
155 | ###################################################
156 | 
157 | def gen_tsne_aug(train_orig, output_file):
158 | 
159 |     writer = open(output_file, 'w')
160 |     lines = open(train_orig, 'r').readlines()
161 |     for i, line in enumerate(lines):
162 |     	parts = line[:-1].split('\t')
163 |     	label = parts[0]
164 |     	sentence = parts[1]
165 |     	writer.write(line)
166 |     	for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
167 |     		aug_sentence = eda_4(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=2)[0]
168 |     		writer.write(label + "\t" + aug_sentence + '\n')
169 |     writer.close()
170 |     print("finished eda for tsne for", train_orig, "to", output_file)
171 | 
172 | 
173 | 
174 | 
175 | #generate more data with standard augmentation
176 | def gen_standard_aug(train_orig, output_file, num_aug=9):
177 |     writer = open(output_file, 'w')
178 |     lines = open(train_orig, 'r').readlines()
179 |     for i, line in enumerate(lines):
180 |         parts = line[:-1].split('\t')
181 |         label = parts[0]
182 |         sentence = parts[1]
183 |         aug_sentences = eda_4(sentence, num_aug=num_aug)
184 |         for aug_sentence in aug_sentences:
185 |             writer.write(label + "\t" + aug_sentence + '\n')
186 |     writer.close()
187 |     print("finished eda for", train_orig, "to", output_file)
188 | 
189 | #generate more data with only synonym replacement (SR)
190 | def gen_sr_aug(train_orig, output_file, alpha_sr, n_aug):
191 |     writer = open(output_file, 'w')
192 |     lines = open(train_orig, 'r').readlines()
193 |     for i, line in enumerate(lines):
194 |         parts = line[:-1].split('\t')
195 |         label = parts[0]
196 |         sentence = parts[1]
197 |         aug_sentences = SR(sentence, alpha_sr=alpha_sr, n_aug=n_aug)
198 |         for aug_sentence in aug_sentences:
199 |             writer.write(label + "\t" + aug_sentence + '\n')
200 |     writer.close()
201 |     print("finished SR for", train_orig, "to", output_file, "with alpha", alpha_sr)
202 | 
203 | #generate more data with only random insertion (RI)
204 | def gen_ri_aug(train_orig, output_file, alpha_ri, n_aug):
205 |     writer = open(output_file, 'w')
206 |     lines = open(train_orig, 'r').readlines()
207 |     for i, line in enumerate(lines):
208 |         parts = line[:-1].split('\t')
209 |         label = parts[0]
210 |         sentence = parts[1]
211 |         aug_sentences = RI(sentence, alpha_ri=alpha_ri, n_aug=n_aug)
212 |         for aug_sentence in aug_sentences:
213 |             writer.write(label + "\t" + aug_sentence + '\n')
214 |     writer.close()
215 |     print("finished RI for", train_orig, "to", output_file, "with alpha", alpha_ri)
216 | 
217 | #generate more data with only random swap (RS)
218 | def gen_rs_aug(train_orig, output_file, alpha_rs, n_aug):
219 |     writer = open(output_file, 'w')
220 |     lines = open(train_orig, 'r').readlines()
221 |     for i, line in enumerate(lines):
222 |         parts = line[:-1].split('\t')
223 |         label = parts[0]
224 |         sentence = parts[1]
225 |         aug_sentences = RS(sentence, alpha_rs=alpha_rs, n_aug=n_aug)
226 |         for aug_sentence in aug_sentences:
227 |             writer.write(label + "\t" + aug_sentence + '\n')
228 |     writer.close()
229 |     print("finished RS for", train_orig, "to", output_file, "with alpha", alpha_rs)
230 | 
231 | #generate more data with only random deletion (RD)
232 | def gen_rd_aug(train_orig, output_file, alpha_rd, n_aug):
233 |     writer = open(output_file, 'w')
234 |     lines = open(train_orig, 'r').readlines()
235 |     for i, line in enumerate(lines):
236 |         parts = line[:-1].split('\t')
237 |         label = parts[0]
238 |         sentence = parts[1]
239 |         aug_sentences = RD(sentence, alpha_rd=alpha_rd, n_aug=n_aug)
240 |         for aug_sentence in aug_sentences:
241 |             writer.write(label + "\t" + aug_sentence + '\n')
242 |     writer.close()
243 |     print("finished RD for", train_orig, "to", output_file, "with alpha", alpha_rd)
244 | 
245 | ###################################################
246 | ##################### model #######################
247 | ###################################################
248 | 
249 | #building the model in keras
250 | def build_model(sentence_length, word2vec_len, num_classes):
251 | 	model = None
252 | 	model = Sequential()
253 | 	model.add(Bidirectional(LSTM(64, return_sequences=True), input_shape=(sentence_length, word2vec_len)))
254 | 	model.add(Dropout(0.5))
255 | 	model.add(Bidirectional(LSTM(32, return_sequences=False)))
256 | 	model.add(Dropout(0.5))
257 | 	model.add(Dense(20, activation='relu'))
258 | 	model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
259 | 	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
260 | 	#print(model.summary())
261 | 	return model
262 | 
263 | #building the cnn in keras
264 | def build_cnn(sentence_length, word2vec_len, num_classes):
265 | 	model = None
266 | 	model = Sequential()
267 | 	model.add(layers.Conv1D(128, 5, activation='relu', input_shape=(sentence_length, word2vec_len)))
268 | 	model.add(layers.GlobalMaxPooling1D())
269 | 	model.add(Dense(20, activation='relu'))
270 | 	model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
271 | 	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
272 | 	return model
273 | 
274 | #one hot to categorical
275 | def one_hot_to_categorical(y):
276 |     assert len(y.shape) == 2
277 |     return np.argmax(y, axis=1)
278 | 
279 | def get_now_str():
280 |     return str(strftime("%Y-%m-%d_%H:%M:%S", gmtime()))
281 | 
282 | 


--------------------------------------------------------------------------------
/experiments/nlp_aug.py:
--------------------------------------------------------------------------------
  1 | # Easy data augmentation techniques for text classification
  2 | # Jason Wei, Chengyu Huang, Yifang Wei, Fei Xing, Kai Zou
  3 | 
  4 | import random
  5 | from random import shuffle
  6 | random.seed(1)
  7 | 
  8 | #stop words list
  9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 
 10 | 			'ours', 'ourselves', 'you', 'your', 'yours', 
 11 | 			'yourself', 'yourselves', 'he', 'him', 'his', 
 12 | 			'himself', 'she', 'her', 'hers', 'herself', 
 13 | 			'it', 'its', 'itself', 'they', 'them', 'their', 
 14 | 			'theirs', 'themselves', 'what', 'which', 'who', 
 15 | 			'whom', 'this', 'that', 'these', 'those', 'am', 
 16 | 			'is', 'are', 'was', 'were', 'be', 'been', 'being', 
 17 | 			'have', 'has', 'had', 'having', 'do', 'does', 'did',
 18 | 			'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
 19 | 			'because', 'as', 'until', 'while', 'of', 'at', 
 20 | 			'by', 'for', 'with', 'about', 'against', 'between',
 21 | 			'into', 'through', 'during', 'before', 'after', 
 22 | 			'above', 'below', 'to', 'from', 'up', 'down', 'in',
 23 | 			'out', 'on', 'off', 'over', 'under', 'again', 
 24 | 			'further', 'then', 'once', 'here', 'there', 'when', 
 25 | 			'where', 'why', 'how', 'all', 'any', 'both', 'each', 
 26 | 			'few', 'more', 'most', 'other', 'some', 'such', 'no', 
 27 | 			'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 
 28 | 			'very', 's', 't', 'can', 'will', 'just', 'don', 
 29 | 			'should', 'now', '']
 30 | 
 31 | #cleaning up text
 32 | import re
 33 | def get_only_chars(line):
 34 | 
 35 |     clean_line = ""
 36 | 
 37 |     line = line.replace("’", "")
 38 |     line = line.replace("'", "")
 39 |     line = line.replace("-", " ") #replace hyphens with spaces
 40 |     line = line.replace("\t", " ")
 41 |     line = line.replace("\n", " ")
 42 |     line = line.lower()
 43 | 
 44 |     for char in line:
 45 |         if char in 'qwertyuiopasdfghjklzxcvbnm ':
 46 |             clean_line += char
 47 |         else:
 48 |             clean_line += ' '
 49 | 
 50 |     clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
 51 |     if clean_line[0] == ' ':
 52 |         clean_line = clean_line[1:]
 53 |     return clean_line
 54 | 
 55 | ########################################################################
 56 | # Synonym replacement
 57 | # Replace n words in the sentence with synonyms from wordnet
 58 | ########################################################################
 59 | 
 60 | #for the first time you use wordnet
 61 | #import nltk
 62 | #nltk.download('wordnet')
 63 | from nltk.corpus import wordnet 
 64 | 
 65 | def synonym_replacement(words, n):
 66 | 	new_words = words.copy()
 67 | 	random_word_list = list(set([word for word in words if word not in stop_words]))
 68 | 	random.shuffle(random_word_list)
 69 | 	num_replaced = 0
 70 | 	for random_word in random_word_list:
 71 | 		synonyms = get_synonyms(random_word)
 72 | 		if len(synonyms) >= 1:
 73 | 			synonym = random.choice(list(synonyms))
 74 | 			new_words = [synonym if word == random_word else word for word in new_words]
 75 | 			#print("replaced", random_word, "with", synonym)
 76 | 			num_replaced += 1
 77 | 		if num_replaced >= n: #only replace up to n words
 78 | 			break
 79 | 
 80 | 	#this is stupid but we need it, trust me
 81 | 	sentence = ' '.join(new_words)
 82 | 	new_words = sentence.split(' ')
 83 | 
 84 | 	return new_words
 85 | 
 86 | def get_synonyms(word):
 87 | 	synonyms = set()
 88 | 	for syn in wordnet.synsets(word): 
 89 | 		for l in syn.lemmas(): 
 90 | 			synonym = l.name().replace("_", " ").replace("-", " ").lower()
 91 | 			synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
 92 | 			synonyms.add(synonym) 
 93 | 	if word in synonyms:
 94 | 		synonyms.remove(word)
 95 | 	return list(synonyms)
 96 | 
 97 | ########################################################################
 98 | # Random deletion
 99 | # Randomly delete words from the sentence with probability p
100 | ########################################################################
101 | 
102 | def random_deletion(words, p):
103 | 
104 | 	#obviously, if there's only one word, don't delete it
105 | 	if len(words) == 1:
106 | 		return words
107 | 
108 | 	#randomly delete words with probability p
109 | 	new_words = []
110 | 	for word in words:
111 | 		r = random.uniform(0, 1)
112 | 		if r > p:
113 | 			new_words.append(word)
114 | 
115 | 	#if you end up deleting all words, just return a random word
116 | 	if len(new_words) == 0:
117 | 		rand_int = random.randint(0, len(words)-1)
118 | 		return [words[rand_int]]
119 | 
120 | 	return new_words
121 | 
122 | ########################################################################
123 | # Random swap
124 | # Randomly swap two words in the sentence n times
125 | ########################################################################
126 | 
127 | def random_swap(words, n):
128 | 	new_words = words.copy()
129 | 	for _ in range(n):
130 | 		new_words = swap_word(new_words)
131 | 	return new_words
132 | 
133 | def swap_word(new_words):
134 | 	random_idx_1 = random.randint(0, len(new_words)-1)
135 | 	random_idx_2 = random_idx_1
136 | 	counter = 0
137 | 	while random_idx_2 == random_idx_1:
138 | 		random_idx_2 = random.randint(0, len(new_words)-1)
139 | 		counter += 1
140 | 		if counter > 3:
141 | 			return new_words
142 | 	new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
143 | 	return new_words
144 | 
145 | ########################################################################
146 | # Random addition
147 | # Randomly add n words into the sentence
148 | ########################################################################
149 | 
150 | def random_addition(words, n):
151 | 	new_words = words.copy()
152 | 	for _ in range(n):
153 | 		add_word(new_words)
154 | 	return new_words
155 | 
156 | def add_word(new_words):
157 | 	synonyms = []
158 | 	counter = 0
159 | 	while len(synonyms) < 1:
160 | 		random_word = new_words[random.randint(0, len(new_words)-1)]
161 | 		synonyms = get_synonyms(random_word)
162 | 		counter += 1
163 | 		if counter >= 10:
164 | 			return
165 | 	random_synonym = synonyms[0]
166 | 	random_idx = random.randint(0, len(new_words)-1)
167 | 	new_words.insert(random_idx, random_synonym)
168 | 
169 | ########################################################################
170 | # main data augmentation function
171 | ########################################################################
172 | 
173 | def eda_4(sentence, alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15, num_aug=9):
174 | 	
175 | 	sentence = get_only_chars(sentence)
176 | 	words = sentence.split(' ')
177 | 	words = [word for word in words if word is not '']
178 | 	num_words = len(words)
179 | 	
180 | 	augmented_sentences = []
181 | 	num_new_per_technique = int(num_aug/4)+1
182 | 	n_sr = max(1, int(alpha_sr*num_words))
183 | 	n_ri = max(1, int(alpha_ri*num_words))
184 | 	n_rs = max(1, int(alpha_rs*num_words))
185 | 
186 | 	#sr
187 | 	for _ in range(num_new_per_technique):
188 | 		a_words = synonym_replacement(words, n_sr)
189 | 		augmented_sentences.append(' '.join(a_words))
190 | 
191 | 	#ri
192 | 	for _ in range(num_new_per_technique):
193 | 		a_words = random_addition(words, n_ri)
194 | 		augmented_sentences.append(' '.join(a_words))
195 | 
196 | 	#rs
197 | 	for _ in range(num_new_per_technique):
198 | 		a_words = random_swap(words, n_rs)
199 | 		augmented_sentences.append(' '.join(a_words))
200 | 
201 | 	#rd
202 | 	for _ in range(num_new_per_technique):
203 | 		a_words = random_deletion(words, p_rd)
204 | 		augmented_sentences.append(' '.join(a_words))
205 | 
206 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
207 | 	shuffle(augmented_sentences)
208 | 
209 | 	#trim so that we have the desired number of augmented sentences
210 | 	if num_aug >= 1:
211 | 		augmented_sentences = augmented_sentences[:num_aug]
212 | 	else:
213 | 		keep_prob = num_aug / len(augmented_sentences)
214 | 		augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
215 | 
216 | 	#append the original sentence
217 | 	augmented_sentences.append(sentence)
218 | 
219 | 	return augmented_sentences
220 | 
221 | def SR(sentence, alpha_sr, n_aug=9):
222 | 
223 | 	sentence = get_only_chars(sentence)
224 | 	words = sentence.split(' ')
225 | 	num_words = len(words)
226 | 
227 | 	augmented_sentences = []
228 | 	n_sr = max(1, int(alpha_sr*num_words))
229 | 
230 | 	for _ in range(n_aug):
231 | 		a_words = synonym_replacement(words, n_sr)
232 | 		augmented_sentences.append(' '.join(a_words))
233 | 
234 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
235 | 	shuffle(augmented_sentences)
236 | 
237 | 	augmented_sentences.append(sentence)
238 | 
239 | 	return augmented_sentences
240 | 
241 | def RI(sentence, alpha_ri, n_aug=9):
242 | 
243 | 	sentence = get_only_chars(sentence)
244 | 	words = sentence.split(' ')
245 | 	num_words = len(words)
246 | 
247 | 	augmented_sentences = []
248 | 	n_ri = max(1, int(alpha_ri*num_words))
249 | 
250 | 	for _ in range(n_aug):
251 | 		a_words = random_addition(words, n_ri)
252 | 		augmented_sentences.append(' '.join(a_words))
253 | 
254 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
255 | 	shuffle(augmented_sentences)
256 | 
257 | 	augmented_sentences.append(sentence)
258 | 
259 | 	return augmented_sentences
260 | 
261 | def RS(sentence, alpha_rs, n_aug=9):
262 | 
263 | 	sentence = get_only_chars(sentence)
264 | 	words = sentence.split(' ')
265 | 	num_words = len(words)
266 | 
267 | 	augmented_sentences = []
268 | 	n_rs = max(1, int(alpha_rs*num_words))
269 | 
270 | 	for _ in range(n_aug):
271 | 		a_words = random_swap(words, n_rs)
272 | 		augmented_sentences.append(' '.join(a_words))
273 | 
274 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
275 | 	shuffle(augmented_sentences)
276 | 
277 | 	augmented_sentences.append(sentence)
278 | 
279 | 	return augmented_sentences
280 | 
281 | def RD(sentence, alpha_rd, n_aug=9):
282 | 
283 | 	sentence = get_only_chars(sentence)
284 | 	words = sentence.split(' ')
285 | 	words = [word for word in words if word is not '']
286 | 	num_words = len(words)
287 | 
288 | 	augmented_sentences = []
289 | 
290 | 	for _ in range(n_aug):
291 | 		a_words = random_deletion(words, alpha_rd)
292 | 		augmented_sentences.append(' '.join(a_words))
293 | 
294 | 	augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
295 | 	shuffle(augmented_sentences)
296 | 
297 | 	augmented_sentences.append(sentence)
298 | 
299 | 	return augmented_sentences
300 | 
301 | 
302 | 
303 | 
304 | 
305 | 
306 | 
307 | 
308 | 
309 | 
310 | 
311 | 
312 | 
313 | 
314 | 
315 | 
316 | 
317 | 
318 | 
319 | 
320 | ########################################################################
321 | # Testing
322 | ########################################################################
323 | 
324 | if __name__ == '__main__':
325 | 
326 | 	line = 'Hi. My name is Jason. I’m a third-year computer science major at Dartmouth College, interested in deep learning and computer vision. My advisor is Saeed Hassanpour. I’m currently working on deep learning for lung cancer classification.'
327 | 
328 | 
329 | 
330 | ########################################################################
331 | # Sliding window
332 | # Slide a window of size w over the sentence with stride s
333 | # Returns a list of lists of words
334 | ########################################################################
335 | 
336 | # def sliding_window_sentences(words, w, s):
337 | # 	windows = []
338 | # 	for i in range(0, len(words)-w+1, s):
339 | # 		window = words[i:i+w]
340 | # 		windows.append(window)
341 | # 	return windows
342 | 
343 | 
344 | 
345 | 
346 | 


--------------------------------------------------------------------------------
/preprocess/__pycache__/utils.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/preprocess/__pycache__/utils.cpython-36.pyc


--------------------------------------------------------------------------------
/preprocess/bg_clean.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from utils import *
 3 | 
 4 | def clean_csv(input_file, output_file):
 5 | 
 6 | 	input_r = open(input_file, 'r').read()
 7 | 
 8 | 	lines = input_r.split(',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,')
 9 | 	print(len(lines))
10 | 	for line in lines[:10]:
11 | 		print(line[-3:])
12 | 
13 | if __name__ == "__main__":
14 | 
15 | 	input_file = 'raw/blog-gender-dataset.csv'
16 | 	output_file = 'datasets/bg/train.csv'
17 | 
18 | 	clean_csv(input_file, output_file)
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/preprocess/copy_sized_datasets.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | sizes = ['1_tiny', '2_small', '3_standard', '4_full']
 4 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc']
 5 | 
 6 | for size in sizes:
 7 | 	for dataset in datasets:
 8 | 		folder = 'size_data_t1/' + size + '/' + dataset
 9 | 		if not os.path.exists(folder):
10 | 			os.makedirs(folder)
11 | 
12 | 		origin = 'sized_datasets_f1/' + size + '/' + dataset + '/train_orig.txt'
13 | 		destination = 'size_data_t1/' + size + '/' + dataset + '/train_orig.txt'
14 | 		os.system('cp ' + origin + ' ' + destination)


--------------------------------------------------------------------------------
/preprocess/cr_clean.py:
--------------------------------------------------------------------------------
 1 | #0 = neg, 1 = pos
 2 | from utils import *
 3 | 
 4 | def retrieve_reviews(line):
 5 | 
 6 | 	reviews = set()
 7 | 	chars = list(line)
 8 | 	for i, char in enumerate(chars):
 9 | 		if char == '[':
10 | 			if chars[i+1] == '-':
11 | 				reviews.add(0)
12 | 			elif chars[i+1] == '+':
13 | 				reviews.add(1)
14 | 	
15 | 	reviews = list(reviews)
16 | 	if len(reviews) == 2:
17 | 		return -2
18 | 	elif len(reviews) == 1:
19 | 		return reviews[0]
20 | 	else:
21 | 		return -1
22 | 
23 | def clean_files(input_files, output_file):
24 | 
25 | 	writer = open(output_file, 'w')
26 | 
27 | 	for input_file in input_files:
28 | 		print(input_file)
29 | 		input_lines = open(input_file, 'r').readlines()
30 | 		counter = 0
31 | 		bad_counter = 0
32 | 		for line in input_lines:
33 | 			review = retrieve_reviews(line)
34 | 			if review in {0, 1}:
35 | 				good_line = get_only_chars(re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", line))
36 | 				output_line = str(review) + '\t' + good_line
37 | 				writer.write(output_line + '\n')
38 | 				counter += 1
39 | 			elif review == -2:
40 | 				bad_counter +=1 
41 | 		print(input_file, counter, bad_counter)
42 | 
43 | 	writer.close()
44 | 
45 | if __name__ == '__main__':
46 | 
47 | 	input_files = ['all.txt']#['canon_power.txt', 'canon_s1.txt', 'diaper.txt', 'hitachi.txt', 'ipod.txt', 'micromp3.txt', 'nokia6600.txt', 'norton.txt', 'router.txt']
48 | 	input_files = ['raw/cr/data_new/' + f for f in input_files]
49 | 	output_file = 'datasets/cr/apex_clean.txt'
50 | 
51 | 	clean_files(input_files, output_file)
52 | 


--------------------------------------------------------------------------------
/preprocess/create_dataset_increments.py:
--------------------------------------------------------------------------------
1 | import os
2 | 
3 | datasets = ['cr', 'pc', 'sst1', 'sst2', 'subj', 'trec']
4 | 
5 | for dataset in datasets:
6 | 	line = 'cat increment_datasets_f2/' + dataset + '/test.txt > sized_datasets_f1/test/' + dataset + '/test.txt'
7 | 	os.system(line)


--------------------------------------------------------------------------------
/preprocess/get_stats.py:
--------------------------------------------------------------------------------
 1 | import statistics
 2 | 
 3 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc']
 4 | 
 5 | filenames = ['increment_datasets_f2/' + x + '/train_orig.txt' for x in datasets]
 6 | 
 7 | def get_vocab_size(filename):
 8 | 	lines = open(filename, 'r').readlines()
 9 | 
10 | 	vocab = set()
11 | 	for line in lines:
12 | 		words = line[:-1].split(' ')
13 | 		for word in words:
14 | 			if word not in vocab:
15 | 				vocab.add(word)
16 | 
17 | 	return len(vocab)
18 | 
19 | def get_mean_and_std(filename):
20 | 	lines = open(filename, 'r').readlines()
21 | 
22 | 	line_lengths = []
23 | 	for line in lines:
24 | 		length = len(line[:-1].split(' ')) - 1
25 | 		line_lengths.append(length)
26 | 
27 | 	print(filename, statistics.mean(line_lengths), statistics.stdev(line_lengths), max(line_lengths))
28 | 
29 | 
30 | for filename in filenames:
31 | 	#print(get_vocab_size(filename))
32 | 	get_mean_and_std(filename)
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/preprocess/procon_clean.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from utils import *
 3 | 
 4 | def get_good_stuff(line):
 5 | 	idx = line.find('s>')
 6 | 	good = line[idx+2:-8]
 7 | 
 8 | 	return get_only_chars(good)
 9 | 
10 | def clean_file(con_file, pro_file, output_train, output_test):
11 | 
12 | 	train_writer = open(output_train, 'w')
13 | 	test_writer = open(output_test, 'w')
14 | 	con_lines = open(con_file, 'r').readlines()
15 | 	for line in con_lines[:int(len(con_lines)*0.9)]:
16 | 		content = get_good_stuff(line)
17 | 		if len(content) >= 8:
18 | 			train_writer.write('0\t' + content + '\n')
19 | 	for line in con_lines[int(len(con_lines)*0.9):]:
20 | 		content = get_good_stuff(line)
21 | 		if len(content) >= 8:
22 | 			test_writer.write('0\t' + content + '\n')
23 | 
24 | 	pro_lines = open(pro_file, 'r').readlines()
25 | 	for line in pro_lines[:int(len(con_lines)*0.9)]:
26 | 		content = get_good_stuff(line)
27 | 		if len(content) >= 8:
28 | 			train_writer.write('1\t' + content + '\n')
29 | 	for line in pro_lines[int(len(con_lines)*0.9):]:
30 | 		content = get_good_stuff(line)
31 | 		if len(content) >= 8:
32 | 			test_writer.write('1\t' + content + '\n')
33 | 
34 | 
35 | if __name__ == '__main__':
36 | 	
37 | 	con_file = 'raw/pros-cons/integratedCons.txt'
38 | 	pro_file = 'raw/pros-cons/integratedPros.txt'
39 | 	output_train = 'datasets/procon/train.txt'
40 | 	output_test = 'datasets/procon/test.txt'
41 | 	clean_file(con_file, pro_file, output_train, output_test)


--------------------------------------------------------------------------------
/preprocess/shuffle_lines.py:
--------------------------------------------------------------------------------
1 | import random
2 | 
3 | def shuffle_lines(text_file):
4 | 	lines = open(text_file).readlines()
5 | 	random.shuffle(lines)
6 | 	open(text_file, 'w').writelines(lines)
7 | 
8 | shuffle_lines('special_f4/pc/test_short_aug_shuffle.txt')


--------------------------------------------------------------------------------
/preprocess/sst1_clean.py:
--------------------------------------------------------------------------------
  1 | from utils import *
  2 | 
  3 | def get_label(decimal):
  4 | 	if decimal >= 0 and decimal <= 0.2:
  5 | 		return 0
  6 | 	elif decimal > 0.2 and decimal <= 0.4:
  7 | 		return 1
  8 | 	elif decimal > 0.4 and decimal <= 0.6:
  9 | 		return 2
 10 | 	elif decimal > 0.6 and decimal <= 0.8:
 11 | 		return 3
 12 | 	elif decimal > 0.8 and decimal <= 1:
 13 | 		return 4
 14 | 	else:
 15 | 		return -1
 16 | 
 17 | def get_label_binary(decimal):
 18 | 	if decimal >= 0 and decimal <= 0.4:
 19 | 		return 0
 20 | 	elif decimal > 0.6 and decimal <= 1:
 21 | 		return 1
 22 | 	else:
 23 | 		return -1
 24 | 
 25 | def get_split(split_num):
 26 | 	if split_num == 1 or split_num == 3:
 27 | 		return 'train'
 28 | 	elif split_num == 2:
 29 | 		return 'test'
 30 | 
 31 | if __name__ == "__main__":
 32 | 
 33 | 	data_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSentences.txt'
 34 | 	labels_path = 'raw/sst_1/stanfordSentimentTreebank/sentiment_labels.txt'
 35 | 	split_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSplit.txt'
 36 | 	dictionary_path = 'raw/sst_1/stanfordSentimentTreebank/dictionary.txt'
 37 | 
 38 | 	sentence_lines = open(data_path, 'r').readlines()
 39 | 	labels_lines = open(labels_path, 'r').readlines()
 40 | 	split_lines = open(split_path, 'r').readlines()
 41 | 	dictionary_lines = open(dictionary_path, 'r').readlines()
 42 | 
 43 | 	print(len(sentence_lines))
 44 | 	print(len(split_lines))
 45 | 	print(len(labels_lines))
 46 | 	print(len(dictionary_lines))
 47 | 
 48 | 	#create dictionary for id to label
 49 | 	id_to_label = {}
 50 | 	for line in labels_lines[1:]:
 51 | 		parts = line[:-1].split("|")
 52 | 		_id = parts[0]
 53 | 		score = float(parts[1])
 54 | 		label = get_label_binary(score)
 55 | 
 56 | 		id_to_label[_id] = label
 57 | 
 58 | 	print(len(id_to_label), "id to labels read in")
 59 | 
 60 | 	#create dictionary for phrase to label
 61 | 	phrase_to_label = {}
 62 | 	for line in dictionary_lines:
 63 | 		parts = line[:-1].split("|")
 64 | 		phrase = parts[0]
 65 | 		_id = parts[1]
 66 | 		label = id_to_label[_id]
 67 | 
 68 | 		phrase_to_label[phrase] = label
 69 | 
 70 | 	print(len(phrase_to_label), "phrase to id read in")
 71 | 
 72 | 	#create id to split 
 73 | 	id_to_split = {}
 74 | 	for line in split_lines[1:]:
 75 | 		parts = line[:-1].split(",")
 76 | 		_id = parts[0]
 77 | 		split_num = float(parts[1])
 78 | 		split = get_split(split_num)
 79 | 		id_to_split[_id] = split
 80 | 
 81 | 	print(len(id_to_split), "id to split read in")
 82 | 
 83 | 	train_writer = open('datasets/sst2/train_orig.txt', 'w')
 84 | 	test_writer = open('datasets/sst2/test.txt', 'w')
 85 | 
 86 | 	#create sentence to split and label
 87 | 	for sentence_line in sentence_lines[1:]:
 88 | 		parts = sentence_line[:-1].split('\t')
 89 | 		_id = parts[0]
 90 | 		sentence = get_only_chars(parts[1])
 91 | 		split = id_to_split[_id]
 92 | 
 93 | 		if parts[1] in phrase_to_label:
 94 | 			label = phrase_to_label[parts[1]]
 95 | 			if label in {0, 1}:
 96 | 				#print(label, sentence, split)
 97 | 				if split == 'train':
 98 | 					train_writer.write(str(label) + '\t' + sentence + '\n')
 99 | 				elif split == 'test':
100 | 					test_writer.write(str(label) + '\t' + sentence + '\n')
101 | 
102 | 		#print(parts, split)
103 | 
104 | 		#label = []
105 | 
106 | 
107 | 
108 | 
109 | 


--------------------------------------------------------------------------------
/preprocess/subj_clean.py:
--------------------------------------------------------------------------------
 1 | from utils import *
 2 | 
 3 | if __name__ == "__main__":
 4 | 	subj_path = "subj/rotten_imdb/subj.txt"
 5 | 	obj_path = "subj/rotten_imdb/plot.tok.gt9.5000"
 6 | 
 7 | 	subj_lines = open(subj_path, 'r').readlines()
 8 | 	obj_lines = open(obj_path, 'r').readlines()
 9 | 	print(len(subj_lines), len(obj_lines))
10 | 
11 | 	test_split = int(0.9*len(subj_lines))
12 | 	
13 | 	train_lines = []
14 | 	test_lines = []
15 | 
16 | 	#training set
17 | 	for s_line in subj_lines[:test_split]:
18 | 		clean_line = '1\t' + get_only_chars(s_line[:-1])
19 | 		train_lines.append(clean_line)
20 | 
21 | 	for o_line in obj_lines[:test_split]:
22 | 		clean_line = '0\t' + get_only_chars(o_line[:-1])
23 | 		train_lines.append(clean_line)
24 | 
25 | 	#testing set
26 | 	for s_line in subj_lines[test_split:]:
27 | 		clean_line = '1\t' + get_only_chars(s_line[:-1])
28 | 		test_lines.append(clean_line)
29 | 
30 | 	for o_line in obj_lines[test_split:]:
31 | 		clean_line = '0\t' + get_only_chars(o_line[:-1])
32 | 		test_lines.append(clean_line)
33 | 
34 | 	print(len(test_lines), len(train_lines))
35 | 
36 | 	#print training set
37 | 	writer = open('datasets/subj/train_orig.txt', 'w')
38 | 	for line in train_lines:
39 | 		writer.write(line + '\n')
40 | 	writer.close()
41 | 
42 | 	#print testing set
43 | 	writer = open('datasets/subj/test.txt', 'w')
44 | 	for line in test_lines:
45 | 		writer.write(line + '\n')
46 | 	writer.close()


--------------------------------------------------------------------------------
/preprocess/trej_clean.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from utils import *
 3 | 
 4 | class_name_to_num = {'DESC': 0, 'ENTY':1, 'ABBR':2, 'HUM': 3, 'LOC': 4, 'NUM': 5}
 5 | 
 6 | def clean(input_file, output_file):
 7 | 	lines = open(input_file, 'r').readlines()
 8 | 	writer = open(output_file, 'w')
 9 | 	for line in lines:
10 | 		parts = line[:-1].split(' ')
11 | 		tag = parts[0].split(':')[0]
12 | 		class_num = class_name_to_num[tag]
13 | 		sentence = get_only_chars(' '.join(parts[1:]))
14 | 		print(tag, class_num, sentence)
15 | 		output_line = str(class_num) + '\t' + sentence
16 | 		writer.write(output_line + '\n')
17 | 	writer.close()
18 | 
19 | 
20 | if __name__ == "__main__":
21 | 
22 | 	clean('raw/trec/train_copy.txt', 'datasets/trec/train_orig.txt')
23 | 	clean('raw/trec/test_copy.txt', 'datasets/trec/test.txt')
24 | 	


--------------------------------------------------------------------------------
/preprocess/utils.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | 
 5 | 
 6 | #cleaning up text
 7 | def get_only_chars(line):
 8 | 
 9 |     clean_line = ""
10 | 
11 |     line = line.lower()
12 |     line = line.replace(" 's", " is") 
13 |     line = line.replace("-", " ") #replace hyphens with spaces
14 |     line = line.replace("\t", " ")
15 |     line = line.replace("\n", " ")
16 |     line = line.replace("'", "")
17 | 
18 |     for char in line:
19 |         if char in 'qwertyuiopasdfghjklzxcvbnm ':
20 |             clean_line += char
21 |         else:
22 |             clean_line += ' '
23 | 
24 |     clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
25 |     print(clean_line)
26 |     if clean_line[0] == ' ':
27 |         clean_line = clean_line[1:]
28 |     return clean_line


--------------------------------------------------------------------------------