├── .gitignore ├── README.md ├── code ├── __pycache__ │ └── eda.cpython-36.pyc ├── augment.py └── eda.py ├── data ├── lol.txt └── sst2_train_500.txt ├── eda_figure.png ├── experiments ├── __pycache__ │ ├── a_config.cpython-36.pyc │ ├── a_config.cpython-37.pyc │ ├── b_config.cpython-36.pyc │ ├── c_config.cpython-36.pyc │ ├── config.cpython-36.pyc │ ├── e_config.cpython-36.pyc │ ├── methods.cpython-36.pyc │ ├── methods.cpython-37.pyc │ └── nlp_aug.cpython-36.pyc ├── a_1_data_process.py ├── a_2_train_eval.py ├── a_config.py ├── b_1_data_process.py ├── b_2_train_eval.py ├── b_config.py ├── c_1_data_process.py ├── c_2_train_eval.py ├── c_config.py ├── d_0_preprocess.py ├── d_1_train_models.py ├── d_2_tsne.py ├── d_neg_1_balance_trec.py ├── e_1_data_process.py ├── e_2_cnn_aug.py ├── e_2_cnn_baselines.py ├── e_2_rnn_aug.py ├── e_2_rnn_baselines.py ├── e_config.py ├── methods.py └── nlp_aug.py └── preprocess ├── __pycache__ └── utils.cpython-36.pyc ├── bg_clean.py ├── copy_sized_datasets.py ├── cr_clean.py ├── create_dataset_increments.py ├── get_stats.py ├── procon_clean.py ├── shuffle_lines.py ├── sst1_clean.py ├── subj_clean.py ├── trej_clean.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | word2vec* 3 | size_data* 4 | size_data_f1* 5 | size_data_f3* 6 | size_data_t1* 7 | increment_datasets_f2* 8 | z_archives* 9 | special_f4* 10 | outputs_f1* 11 | outputs_f2* 12 | outputs_f3* 13 | outputs_f4* 14 | baseline_cnn* 15 | baseline_rnn* 16 | 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks 2 | [![Conference](http://img.shields.io/badge/EMNLP-2019-4b44ce.svg)](https://arxiv.org/abs/1901.11196) 3 | 4 | For a survey of data augmentation in NLP, see this [repository](https://github.com/styfeng/DataAug4NLP/blob/main/README.md)/this [paper](http://arxiv.org/abs/2105.03075). 5 | 6 | This is the code for the EMNLP-IJCNLP paper [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196) 7 | 8 | A blog post that explains EDA is [[here]](https://medium.com/@jason.20/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610). 9 | 10 | Update: find an external implementation of EDA in Chinese [[here]](https://github.com/zhanlaoban/EDA_NLP_for_Chinese). 11 | 12 | By [Jason Wei](https://jasonwei20.github.io/research/) and Kai Zou. 13 | 14 | Note: **Do not** email me with questions, as I will not reply. Instead, open an issue. 15 | 16 | We present **EDA**: **e**asy **d**ata **a**ugmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements on five NLP classification tasks, with substantial improvements on datasets of size `N < 500`. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in good performance gains. Given a sentence in the training set, we perform the following operations: 17 | 18 | - **Synonym Replacement (SR):** Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random. 19 | - **Random Insertion (RI):** Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times. 20 | - **Random Swap (RS):** Randomly choose two words in the sentence and swap their positions. Do this *n* times. 21 | - **Random Deletion (RD):** For each word in the sentence, randomly remove it with probability *p*. 22 | 23 |

drawing

24 | Average performance on 5 datasets with and without EDA, with respect to percent of training data used. 25 | 26 | # Usage 27 | 28 | You can run EDA any text classification dataset in less than 5 minutes. Just two steps: 29 | 30 | ### Install NLTK (if you don't have it already): 31 | 32 | Pip install it. 33 | 34 | ```bash 35 | pip install -U nltk 36 | ``` 37 | 38 | Download WordNet. 39 | ```bash 40 | python 41 | >>> import nltk; nltk.download('wordnet') 42 | ``` 43 | 44 | ### Run EDA 45 | 46 | You can easily write your own implementation, but this one takes input files in the format `label\tsentence` (note the `\t`). So for instance, your input file should look like this (example from stanford sentiment treebank): 47 | 48 | ``` 49 | 1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present 50 | 0 it is a visual rorschach test and i must have failed 51 | 0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers 52 | ... 53 | ``` 54 | 55 | Now place this input file into the `data` folder. Run 56 | 57 | ```bash 58 | python code/augment.py --input= 59 | ``` 60 | 61 | The default output filename will append `eda_` to the front of the input filename, but you can specify your own with `--output`. You can also specify the number of generated augmented sentences per original sentence using `--num_aug` (default is 9). Furthermore, you can specify different alpha parameters, which approximately means the percent of words in the sentence that will be changed according to that rule (default is `0.1` or `10%`). So for example, if your input file is `sst2_train.txt` and you want to output to `sst2_augmented.txt` with `16` augmented sentences per original sentence and replace 5% of words by synonyms (`alpha_sr=0.05`), delete 10% of words (`alpha_rd=0.1`, or leave as the default) and do not apply random insertion (`alpha_ri=0.0`) and random swap (`alpha_rs=0.0`), you would do: 62 | 63 | ```bash 64 | python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16 --alpha_sr=0.05 --alpha_rd=0.1 --alpha_ri=0.0 --alpha_rs=0.0 65 | ``` 66 | 67 | Note that at least one augmentation operation is applied per augmented sentence regardless of alpha (if greater than zero). So if you do `alpha_sr=0.001` and your sentence only has four words, one augmentation operation will still be performed. Of course, if one particular alpha is zero, nothing will be done. Best of luck! 68 | 69 | # Citation 70 | If you use EDA in your paper, please cite us: 71 | ``` 72 | @inproceedings{wei-zou-2019-eda, 73 | title = "{EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks", 74 | author = "Wei, Jason and 75 | Zou, Kai", 76 | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", 77 | month = nov, 78 | year = "2019", 79 | address = "Hong Kong, China", 80 | publisher = "Association for Computational Linguistics", 81 | url = "https://www.aclweb.org/anthology/D19-1670", 82 | pages = "6383--6389", 83 | } 84 | ``` 85 | 86 | # Experiments 87 | 88 | The code is not documented, but is [here](https://github.com/jasonwei20/eda_nlp/tree/master/experiments) for all experiments used in the paper. See [this issue](https://github.com/jasonwei20/eda_nlp/issues/10) for limited guidance. 89 | 90 | 91 | 92 | -------------------------------------------------------------------------------- /code/__pycache__/eda.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/code/__pycache__/eda.cpython-36.pyc -------------------------------------------------------------------------------- /code/augment.py: -------------------------------------------------------------------------------- 1 | # Easy data augmentation techniques for text classification 2 | # Jason Wei and Kai Zou 3 | 4 | from eda import * 5 | 6 | #arguments to be parsed from command line 7 | import argparse 8 | ap = argparse.ArgumentParser() 9 | ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data") 10 | ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data") 11 | ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence") 12 | ap.add_argument("--alpha_sr", required=False, type=float, help="percent of words in each sentence to be replaced by synonyms") 13 | ap.add_argument("--alpha_ri", required=False, type=float, help="percent of words in each sentence to be inserted") 14 | ap.add_argument("--alpha_rs", required=False, type=float, help="percent of words in each sentence to be swapped") 15 | ap.add_argument("--alpha_rd", required=False, type=float, help="percent of words in each sentence to be deleted") 16 | args = ap.parse_args() 17 | 18 | #the output file 19 | output = None 20 | if args.output: 21 | output = args.output 22 | else: 23 | from os.path import dirname, basename, join 24 | output = join(dirname(args.input), 'eda_' + basename(args.input)) 25 | 26 | #number of augmented sentences to generate per original sentence 27 | num_aug = 9 #default 28 | if args.num_aug: 29 | num_aug = args.num_aug 30 | 31 | #how much to replace each word by synonyms 32 | alpha_sr = 0.1#default 33 | if args.alpha_sr is not None: 34 | alpha_sr = args.alpha_sr 35 | 36 | #how much to insert new words that are synonyms 37 | alpha_ri = 0.1#default 38 | if args.alpha_ri is not None: 39 | alpha_ri = args.alpha_ri 40 | 41 | #how much to swap words 42 | alpha_rs = 0.1#default 43 | if args.alpha_rs is not None: 44 | alpha_rs = args.alpha_rs 45 | 46 | #how much to delete words 47 | alpha_rd = 0.1#default 48 | if args.alpha_rd is not None: 49 | alpha_rd = args.alpha_rd 50 | 51 | if alpha_sr == alpha_ri == alpha_rs == alpha_rd == 0: 52 | ap.error('At least one alpha should be greater than zero') 53 | 54 | #generate more data with standard augmentation 55 | def gen_eda(train_orig, output_file, alpha_sr, alpha_ri, alpha_rs, alpha_rd, num_aug=9): 56 | 57 | writer = open(output_file, 'w') 58 | lines = open(train_orig, 'r').readlines() 59 | 60 | for i, line in enumerate(lines): 61 | parts = line[:-1].split('\t') 62 | label = parts[0] 63 | sentence = parts[1] 64 | aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug) 65 | for aug_sentence in aug_sentences: 66 | writer.write(label + "\t" + aug_sentence + '\n') 67 | 68 | writer.close() 69 | print("generated augmented sentences with eda for " + train_orig + " to " + output_file + " with num_aug=" + str(num_aug)) 70 | 71 | #main function 72 | if __name__ == "__main__": 73 | 74 | #generate augmented sentences and output into a new file 75 | gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug) -------------------------------------------------------------------------------- /code/eda.py: -------------------------------------------------------------------------------- 1 | # Easy data augmentation techniques for text classification 2 | # Jason Wei and Kai Zou 3 | 4 | import random 5 | from random import shuffle 6 | random.seed(1) 7 | 8 | #stop words list 9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 10 | 'ours', 'ourselves', 'you', 'your', 'yours', 11 | 'yourself', 'yourselves', 'he', 'him', 'his', 12 | 'himself', 'she', 'her', 'hers', 'herself', 13 | 'it', 'its', 'itself', 'they', 'them', 'their', 14 | 'theirs', 'themselves', 'what', 'which', 'who', 15 | 'whom', 'this', 'that', 'these', 'those', 'am', 16 | 'is', 'are', 'was', 'were', 'be', 'been', 'being', 17 | 'have', 'has', 'had', 'having', 'do', 'does', 'did', 18 | 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 19 | 'because', 'as', 'until', 'while', 'of', 'at', 20 | 'by', 'for', 'with', 'about', 'against', 'between', 21 | 'into', 'through', 'during', 'before', 'after', 22 | 'above', 'below', 'to', 'from', 'up', 'down', 'in', 23 | 'out', 'on', 'off', 'over', 'under', 'again', 24 | 'further', 'then', 'once', 'here', 'there', 'when', 25 | 'where', 'why', 'how', 'all', 'any', 'both', 'each', 26 | 'few', 'more', 'most', 'other', 'some', 'such', 'no', 27 | 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 28 | 'very', 's', 't', 'can', 'will', 'just', 'don', 29 | 'should', 'now', ''] 30 | 31 | #cleaning up text 32 | import re 33 | def get_only_chars(line): 34 | 35 | clean_line = "" 36 | 37 | line = line.replace("’", "") 38 | line = line.replace("'", "") 39 | line = line.replace("-", " ") #replace hyphens with spaces 40 | line = line.replace("\t", " ") 41 | line = line.replace("\n", " ") 42 | line = line.lower() 43 | 44 | for char in line: 45 | if char in 'qwertyuiopasdfghjklzxcvbnm ': 46 | clean_line += char 47 | else: 48 | clean_line += ' ' 49 | 50 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces 51 | if clean_line[0] == ' ': 52 | clean_line = clean_line[1:] 53 | return clean_line 54 | 55 | ######################################################################## 56 | # Synonym replacement 57 | # Replace n words in the sentence with synonyms from wordnet 58 | ######################################################################## 59 | 60 | #for the first time you use wordnet 61 | #import nltk 62 | #nltk.download('wordnet') 63 | from nltk.corpus import wordnet 64 | 65 | def synonym_replacement(words, n): 66 | new_words = words.copy() 67 | random_word_list = list(set([word for word in words if word not in stop_words])) 68 | random.shuffle(random_word_list) 69 | num_replaced = 0 70 | for random_word in random_word_list: 71 | synonyms = get_synonyms(random_word) 72 | if len(synonyms) >= 1: 73 | synonym = random.choice(list(synonyms)) 74 | new_words = [synonym if word == random_word else word for word in new_words] 75 | #print("replaced", random_word, "with", synonym) 76 | num_replaced += 1 77 | if num_replaced >= n: #only replace up to n words 78 | break 79 | 80 | #this is stupid but we need it, trust me 81 | sentence = ' '.join(new_words) 82 | new_words = sentence.split(' ') 83 | 84 | return new_words 85 | 86 | def get_synonyms(word): 87 | synonyms = set() 88 | for syn in wordnet.synsets(word): 89 | for l in syn.lemmas(): 90 | synonym = l.name().replace("_", " ").replace("-", " ").lower() 91 | synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm']) 92 | synonyms.add(synonym) 93 | if word in synonyms: 94 | synonyms.remove(word) 95 | return list(synonyms) 96 | 97 | ######################################################################## 98 | # Random deletion 99 | # Randomly delete words from the sentence with probability p 100 | ######################################################################## 101 | 102 | def random_deletion(words, p): 103 | 104 | #obviously, if there's only one word, don't delete it 105 | if len(words) == 1: 106 | return words 107 | 108 | #randomly delete words with probability p 109 | new_words = [] 110 | for word in words: 111 | r = random.uniform(0, 1) 112 | if r > p: 113 | new_words.append(word) 114 | 115 | #if you end up deleting all words, just return a random word 116 | if len(new_words) == 0: 117 | rand_int = random.randint(0, len(words)-1) 118 | return [words[rand_int]] 119 | 120 | return new_words 121 | 122 | ######################################################################## 123 | # Random swap 124 | # Randomly swap two words in the sentence n times 125 | ######################################################################## 126 | 127 | def random_swap(words, n): 128 | new_words = words.copy() 129 | for _ in range(n): 130 | new_words = swap_word(new_words) 131 | return new_words 132 | 133 | def swap_word(new_words): 134 | random_idx_1 = random.randint(0, len(new_words)-1) 135 | random_idx_2 = random_idx_1 136 | counter = 0 137 | while random_idx_2 == random_idx_1: 138 | random_idx_2 = random.randint(0, len(new_words)-1) 139 | counter += 1 140 | if counter > 3: 141 | return new_words 142 | new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 143 | return new_words 144 | 145 | ######################################################################## 146 | # Random insertion 147 | # Randomly insert n words into the sentence 148 | ######################################################################## 149 | 150 | def random_insertion(words, n): 151 | new_words = words.copy() 152 | for _ in range(n): 153 | add_word(new_words) 154 | return new_words 155 | 156 | def add_word(new_words): 157 | synonyms = [] 158 | counter = 0 159 | while len(synonyms) < 1: 160 | random_word = new_words[random.randint(0, len(new_words)-1)] 161 | synonyms = get_synonyms(random_word) 162 | counter += 1 163 | if counter >= 10: 164 | return 165 | random_synonym = synonyms[0] 166 | random_idx = random.randint(0, len(new_words)-1) 167 | new_words.insert(random_idx, random_synonym) 168 | 169 | ######################################################################## 170 | # main data augmentation function 171 | ######################################################################## 172 | 173 | def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9): 174 | 175 | sentence = get_only_chars(sentence) 176 | words = sentence.split(' ') 177 | words = [word for word in words if word is not ''] 178 | num_words = len(words) 179 | 180 | augmented_sentences = [] 181 | num_new_per_technique = int(num_aug/4)+1 182 | 183 | #sr 184 | if (alpha_sr > 0): 185 | n_sr = max(1, int(alpha_sr*num_words)) 186 | for _ in range(num_new_per_technique): 187 | a_words = synonym_replacement(words, n_sr) 188 | augmented_sentences.append(' '.join(a_words)) 189 | 190 | #ri 191 | if (alpha_ri > 0): 192 | n_ri = max(1, int(alpha_ri*num_words)) 193 | for _ in range(num_new_per_technique): 194 | a_words = random_insertion(words, n_ri) 195 | augmented_sentences.append(' '.join(a_words)) 196 | 197 | #rs 198 | if (alpha_rs > 0): 199 | n_rs = max(1, int(alpha_rs*num_words)) 200 | for _ in range(num_new_per_technique): 201 | a_words = random_swap(words, n_rs) 202 | augmented_sentences.append(' '.join(a_words)) 203 | 204 | #rd 205 | if (p_rd > 0): 206 | for _ in range(num_new_per_technique): 207 | a_words = random_deletion(words, p_rd) 208 | augmented_sentences.append(' '.join(a_words)) 209 | 210 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 211 | shuffle(augmented_sentences) 212 | 213 | #trim so that we have the desired number of augmented sentences 214 | if num_aug >= 1: 215 | augmented_sentences = augmented_sentences[:num_aug] 216 | else: 217 | keep_prob = num_aug / len(augmented_sentences) 218 | augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob] 219 | 220 | #append the original sentence 221 | augmented_sentences.append(sentence) 222 | 223 | return augmented_sentences -------------------------------------------------------------------------------- /data/sst2_train_500.txt: -------------------------------------------------------------------------------- 1 | 1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present 2 | 0 it is a visual rorschach test and i must have failed 3 | 0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers 4 | 0 scores no points for originality wit or intelligence 5 | 0 it would take a complete moron to foul up a screen adaptation of oscar wilde is classic satire 6 | 1 pure cinematic intoxication a wildly inventive mixture of comedy and melodrama tastelessness and swooning elegance 7 | 0 it is not the first time that director sara sugarman stoops to having characters drop their pants for laughs and not the last time she fails to provoke them 8 | 1 just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing 9 | 1 matthew lillard is born to play shaggy 10 | 0 it is drab 11 | 1 the film has several strong performances 12 | 1 munch is screenplay is tenderly observant of his characters 13 | 1 isabelle huppert excels as the enigmatic mika and anna mouglalis is a stunning new young talent in one of chabrol is most intense psychological mysteries 14 | 1 a cruelly funny twist on teen comedy packed with inventive cinematic tricks and an ironically killer soundtrack 15 | 0 predictably soulless techno tripe 16 | 1 tsai has a well deserved reputation as one of the cinema world is great visual stylists and in this film every shot enhances the excellent performances 17 | 0 some like it hot on the hardwood proves once again that a man in drag is not in and of himself funny 18 | 0 for a movie about the power of poetry and passion there is precious little of either 19 | 1 diggs and lathan are among the chief reasons brown sugar is such a sweet and sexy film 20 | 1 uneven but a lot of fun 21 | 0 when the film ended i felt tired and drained and wanted to lie on my own deathbed for a while 22 | 0 contains the humor characterization poignancy and intelligence of a bad sitcom 23 | 0 a pretentious and ultimately empty examination of a sick and evil woman 24 | 1 i can easily imagine benigni is pinocchio becoming a christmas perennial 25 | 1 generates an enormous feeling of empathy for its characters 26 | 1 reggio is continual visual barrage is absorbing as well as thought provoking 27 | 0 spreads itself too thin leaving these actors as well as the members of the commune short of profound characterizations 28 | 0 once again director chris columbus takes a hat in hand approach to rowling that stifles creativity and allows the film to drag on for nearly three hours 29 | 0 has all the hallmarks of a movie designed strictly for children is home video a market so insatiable it absorbs all manner of lame entertainment as long as year olds find it diverting 30 | 1 a whale of a good time for both children and parents seeking christian themed fun 31 | 0 there are plot holes big enough for shamu the killer whale to swim through 32 | 1 director andrew niccol demonstrates a wry understanding of the quirks of fame 33 | 1 a must for fans of british cinema if only because so many titans of the industry are along for the ride 34 | 0 though clearly well intentioned this cross cultural soap opera is painfully formulaic and stilted 35 | 0 that is because relatively nothing happens 36 | 0 novak contemplates a heartland so overwhelmed by its lack of purpose that it seeks excitement in manufactured high drama 37 | 0 utter mush conceited pap 38 | 0 cassavetes thinks he is making dog day afternoon with a cause but all he is done is to reduce everything he touches to a shrill didactic cartoon 39 | 0 disjointed parody 40 | 1 in fact it is quite fun in places 41 | 0 it can not be enjoyed even on the level that one enjoys a bad slasher flick primarily because it is dull 42 | 0 it is always disappointing when a documentary fails to live up to or offer any new insight into its chosen topic 43 | 0 in the era of the sopranos it feels painfully redundant and inauthentic 44 | 1 a hypnotic cyber hymn and a cruel story of youth culture 45 | 0 much of the cast is stiff or just plain bad 46 | 1 i ve yet to find an actual vietnam war combat movie actually produced by either the north or south vietnamese but at least now we ve got something pretty damn close 47 | 0 a bizarre piece of work with premise and dialogue at the level of kids television and plot threads as morose as teen pregnancy rape and suspected murder 48 | 0 a documentary to make the stones weep as shameful as it is scary 49 | 0 there is got to be a more graceful way of portraying the devastation of this disease 50 | 1 and your reward will be a thoughtful emotional movie experience 51 | 1 brash intelligent and erotically perplexing haneke is portrait of an upper class austrian society and the suppression of its tucked away demons is uniquely felt with a sardonic jolt 52 | 0 it forces you to watch people doing unpleasant things to each other and themselves and it maintains a cool distance from its material that is deliberately unsettling 53 | 0 it is the element of condescension as the filmmakers look down on their working class subjects from their lofty perch that finally makes sex with strangers which opens today in the new york metropolitan area so distasteful 54 | 1 nothing short of a masterpiece and a challenging one 55 | 0 i just saw this movie well it is probably not accurate to call it a movie 56 | 0 a muted freak out 57 | 0 there is an underlying old world sexism to monday morning that undercuts its charm 58 | 0 taken as a whole the tuxedo does nt add up to a whole lot 59 | 1 though the film is static its writer director is heart is in the right place his plea for democracy and civic action laudable 60 | 0 with mcconaughey in an entirely irony free zone and bale reduced mainly to batting his sensitive eyelids there is not enough intelligence wit or innovation on the screen to attract and sustain an older crowd 61 | 1 skin of man gets a few cheap shocks from its kids in peril theatrics but it also taps into the primal fears of young people trying to cope with the mysterious and brutal nature of adults 62 | 1 what it lacks in substance it makes up for in heart 63 | 1 like all great films about a life you never knew existed it offers much to absorb and even more to think about after the final frame 64 | 0 an average b movie with no aspirations to be anything more 65 | 1 standard guns versus martial arts cliche with little new added 66 | 1 a beautifully tooled action thriller about love and terrorism in korea 67 | 0 starts out with tremendous promise introducing an intriguing and alluring premise only to fall prey to a boatload of screenwriting cliches that sink it faster than a leaky freighter 68 | 0 hard core slasher aficionados will find things to like but overall the halloween series has lost its edge 69 | 1 a college story that works even without vulgarity sex scenes and cussing 70 | 0 insufferably naive 71 | 1 it is a strength of a documentary to disregard available bias especially as temptingly easy as it would have been with this premise 72 | 1 those with a modicum of patience will find in these characters foibles a timeless and unique perspective 73 | 0 its and pieces of the hot chick are so hilarious and schneider is performance is so fine it is a real shame that so much of the movie again as in the animal is a slapdash mess 74 | 1 there are some movies that hit you from the first scene and you know it is going to be a trip 75 | 1 it proves quite compelling as an intense brooding character study 76 | 0 a loud ugly irritating movie without any of its satirical salvos hitting a discernible target 77 | 0 you can taste it but there is no fizz 78 | 1 the quality of the art combined with the humor and intelligence of the script allow the filmmakers to present the biblical message of forgiveness without it ever becoming preachy or syrupy 79 | 0 the x potion gives the quickly named blossom bubbles and buttercup supernatural powers that include extraordinary strength and laser beam eyes which unfortunately do nt enable them to discern flimsy screenplays 80 | 1 rifkin is references are impeccable throughout 81 | 0 too bad none of it is funny 82 | 1 a gleefully grungy hilariously wicked black comedy 83 | 0 you leave the same way you came a few tasty morsels under your belt but no new friends 84 | 1 though it runs minutes safe conduct is anything but languorous 85 | 1 it is both degrading and strangely liberating to see people working so hard at leading lives of sexy intrigue only to be revealed by the dispassionate gantz brothers as ordinary pasty lumpen 86 | 0 they crush each other under cars throw each other out windows electrocute and dismember their victims in full consciousness 87 | 0 in any case i would recommend big bad love only to winger fans who have missed her since is forget paris 88 | 1 as surreal as a dream and as detailed as a photograph as visually dexterous as it is at times imaginatively overwhelming 89 | 1 even if you do nt know the band or the album is songs by heart you will enjoy seeing how both evolve and you will also learn a good deal about the state of the music business in the st century 90 | 1 with an unflappable air of decadent urbanity everett remains a perfect wildean actor and a relaxed firth displays impeccable comic skill 91 | 1 unpretentious charming quirky original 92 | 0 a processed comedy chop suey 93 | 0 a sequel that is much too big for its britches 94 | 0 a complete waste of time 95 | 0 a well intentioned effort that is still too burdened by the actor is offbeat sensibilities for the earnest emotional core to emerge with any degree of accessibility 96 | 1 assayas ambitious sometimes beautiful adaptation of jacques chardonne is novel 97 | 0 despite the fact that this film was nt as bad as i thought it was going to be it is still not a good movie 98 | 0 guys say mean things and shoot a lot of bullets 99 | 0 a manipulative feminist empowerment tale thinly posing as a serious drama about spousal abuse 100 | 0 this movie is so bad that it is almost worth seeing because it is so bad 101 | 0 with a romantic comedy plotline straight from the ages this cinderella story does nt have a single surprise up its sleeve 102 | 1 and more than that it is an observant unfussily poetic meditation about identity and alienation 103 | 0 but it could have been worse 104 | 0 most viewers will wish there had been more of the queen and less of the damned 105 | 0 a science fiction pastiche so lacking in originality that if you stripped away its inspirations there would be precious little left 106 | 1 the pleasures that it does afford may be enough to keep many moviegoers occupied amidst some of the more serious minded concerns of other year end movies 107 | 0 as plain and pedestrian as catsup 108 | 0 every conceivable mistake a director could make in filming opera has been perpetrated here 109 | 1 more concerned with overall feelings broader ideas and open ended questions than concrete story and definitive answers soderbergh is solaris is a gorgeous and deceptively minimalist cinematic tone poem 110 | 0 no cute factor here not that i mind ugly the problem is he has no character loveable or otherwise 111 | 1 filmmaker stacy peralta has a flashy editing style that does nt always jell with sean penn is monotone narration but he respects the material without sentimentalizing it 112 | 1 you do nt need to be a hip hop fan to appreciate scratch and that is the mark of a documentary that works 113 | 0 i was trying to decide what annoyed me most about god is great i m not and then i realized that i just did nt care 114 | 1 metaphors abound but it is easy to take this film at face value and enjoy its slightly humorous and tender story 115 | 1 a comedy that swings and jostles to the rhythms of life 116 | 0 if you re looking to rekindle the magic of the first film you ll need a stronger stomach than us 117 | 1 a modestly made but profoundly moving documentary 118 | 0 pc stability notwithstanding the film suffers from a simplistic narrative and a pat fairy tale conclusion 119 | 0 has about th the fun of its spry predecessor but it is a rushed slapdash sequel for the sake of a sequel with less than half the plot and ingenuity 120 | 1 remarkable for its intelligence and intensity 121 | 1 i do nt know if frailty will turn bill paxton into an a list director but he can rest contentedly with the knowledge that he is made at least one damn fine horror movie 122 | 1 one of the year is most weirdly engaging and unpredictable character pieces 123 | 0 shot like a postcard and overacted with all the boozy self indulgence that brings out the worst in otherwise talented actors 124 | 0 barney is ideas about creation and identity do nt really seem all that profound at least by way of what can be gleaned from this three hour endurance test built around an hour is worth of actual material 125 | 1 delia greta and paula rank as three of the most multilayered and sympathetic female characters of the year 126 | 0 enough is not a bad movie just mediocre 127 | 0 but the cinematography is cloudy the picture making becalmed 128 | 0 a terrible adaptation of a play that only ever walked the delicate tightrope between farcical and loathsome 129 | 0 slap me i saw this movie 130 | 1 for those of an indulgent slightly sunbaked and summery mind sex and lucia may well prove diverting enough 131 | 0 reign of fire never comes close to recovering from its demented premise but it does sustain an enjoyable level of ridiculousness 132 | 0 the movie ends with outtakes in which most of the characters forget their lines and just utter uhhh which is better than most of the writing in the movie 133 | 1 try as you might to resist if you ve got a place in your heart for smokey robinson this movie will worm its way there 134 | 0 low rent from frame one 135 | 1 we know the plot is a little crazy but it held my interest from start to finish 136 | 0 dull a road trip movie that is surprisingly short of both adventure and song 137 | 0 the colorful masseur wastes its time on mood rather than riding with the inherent absurdity of ganesh is rise up the social ladder 138 | 1 it sends you away a believer again and quite cheered at just that 139 | 1 we ve seen it all before in one form or another but director hoffman with great help from kevin kline makes us care about this latest reincarnation of the world is greatest teacher 140 | 1 watching this gentle mesmerizing portrait of a man coming to terms with time you barely realize your mind is being blown 141 | 1 at its most basic this cartoon adventure is that wind in the hair exhilarating 142 | 1 the charms of the lead performances allow us to forget most of the film is problems 143 | 0 hampered no paralyzed by a self indulgent script that aims for poetry and ends up sounding like satire 144 | 1 a sloppy amusing comedy that proceeds from a stunningly unoriginal premise 145 | 1 the film aims to be funny uplifting and moving sometimes all at once 146 | 1 the story is inspiring ironic and revelatory of just how ridiculous and money oriented the record industry really is 147 | 0 like a fish that is lived too long austin powers in goldmember has some unnecessary parts and is kinda wrong in places 148 | 1 what distinguishes time of favor from countless other thrillers is its underlying concern with the consequences of words and with the complicated emotions fueling terrorist acts 149 | 1 an entertaining if ultimately minor thriller 150 | 1 just when you think that every possible angle has been exhausted by documentarians another new film emerges with yet another remarkable yet shockingly little known perspective 151 | 0 it could have been something special but two things drag it down to mediocrity director clare peploe is misunderstanding of marivaux is rhythms and mira sorvino is limitations as a classical actress 152 | 1 a summer entertainment adults can see without feeling embarrassed but it could have been more 153 | 0 fails in making this character understandable in getting under her skin in exploring motivation well before the end the film grows as dull as its characters about whose fate it is hard to care 154 | 0 turns a potentially interesting idea into an excruciating film school experience that plays better only for the film is publicists or for people who take as many drugs as the film is characters 155 | 1 it is a wonderful sobering heart felt drama 156 | 0 the movie does nt think much of its characters its protagonist or of us 157 | 0 it is too self important and plodding to be funny and too clipped and abbreviated to be an epic 158 | 1 this surreal gilliam esque film is also a troubling interpretation of ecclesiastes 159 | 0 the most offensive thing about the movie is that hollywood expects people to pay to see it 160 | 1 in the end the film is less the cheap thriller you d expect than it is a fairly revealing study of its two main characters damaged goods people whose orbits will inevitably and dangerously collide 161 | 0 the entire movie is about a boring sad man being boring and sad 162 | 0 just not campy enough 163 | 0 despite an impressive roster of stars and direction from kathryn bigelow the weight of water is oppressively heavy 164 | 0 when your subject is illusion versus reality should nt the reality seem at least passably real 165 | 1 khouri manages with terrific flair to keep the extremes of screwball farce and blood curdling family intensity on one continuum 166 | 1 it is a masterpiece 167 | 1 romantic comedy and dogme filmmaking may seem odd bedfellows but they turn out to be delightfully compatible here 168 | 1 about schmidt belongs to nicholson 169 | 1 macdowell gives give a solid anguished performance that eclipses nearly everything else she is ever done 170 | 0 wewannour money back actually 171 | 1 it is a clear eyed portrait of an intensely lived time filled with nervous energy moral ambiguity and great uncertainties 172 | 0 not exactly the bees knees 173 | 1 michael gerbosi is script is economically packed with telling scenes 174 | 0 director douglas mcgrath takes on nickleby with all the halfhearted zeal of an th grade boy delving into required reading 175 | 0 that the true story by which all the queen is men is allegedly inspired was a lot funnier and more deftly enacted than what is been cobbled together onscreen 176 | 0 the end result is like cold porridge with only the odd enjoyably chewy lump 177 | 1 it haunts you you ca nt forget it you admire its conception and are able to resolve some of the confusions you had while watching it 178 | 0 forget the psychology study of romantic obsession and just watch the procession of costumes in castles and this wo nt seem like such a bore 179 | 0 the kind of film that leaves you scratching your head in amazement over the fact that so many talented people could participate in such an ill advised and poorly executed idea 180 | 0 off the hook is overlong and not well acted but credit writer producer director adam watstein with finishing it at all 181 | 0 at the very least if you do nt know anything about derrida when you walk into the theater you wo nt know much more when you leave 182 | 1 sly sophisticated and surprising 183 | 1 a new film from bill plympton the animation master is always welcome 184 | 0 it does nt flinch from its unsettling prognosis namely that the legacy of war is a kind of perpetual pain 185 | 0 more tiring than anything 186 | 1 his work with actors is particularly impressive 187 | 0 sunk by way too much indulgence of scene chewing teeth gnashing actorliness 188 | 1 it is an unstinting look at a collaboration between damaged people that may or may not qual 189 | 0 collapses under its own meager weight 190 | 0 quitting however manages just to be depressing as the lead actor phones in his autobiographical performance 191 | 0 the drama was so uninspiring that even a story immersed in love lust and sin could nt keep my attention 192 | 0 due to some script weaknesses and the casting of the director is brother the film trails off into inconsequentiality 193 | 0 suspend your disbelief here and now or you ll be shaking your head all the way to the credits 194 | 0 i did nt smile 195 | 1 for all its problems the lady and the duke surprisingly manages never to grow boring which proves that rohmer still has a sense of his audience 196 | 0 it is a drag how nettelbeck sees working women or at least this working woman for whom she shows little understanding 197 | 0 the script is a disaster with cloying messages and irksome characters 198 | 0 in its best moments resembles a bad high school production of grease without benefit of song 199 | 1 it is a fine focused piece of work that reopens an interesting controversy and never succumbs to sensationalism 200 | 1 a triumph of pure craft and passionate heart 201 | 1 not everything in this ambitious comic escapade works but coppola along with his sister sofia is a real filmmaker 202 | 1 the emotions are raw and will strike a nerve with anyone who is ever had family trauma 203 | 1 it deserves to be seen by anyone with even a passing interest in the events shaping the world beyond their own horizons 204 | 0 two bit potboiler 205 | 0 the movie directed by mick jackson leaves no cliche unturned from the predictable plot to the characters straight out of central casting 206 | 1 fun and nimble 207 | 0 big mistake 208 | 1 the film boasts dry humor and jarring shocks plus moments of breathtaking mystery 209 | 1 you may feel compelled to watch the film twice or pick up a book on the subject 210 | 1 west coast rap wars this modern mob music drama never fails to fascinate 211 | 1 children christian or otherwise deserve to hear the full story of jonah is despair in all its agonizing catch glory even if they spend years trying to comprehend it 212 | 0 if they broke out into elaborate choreography singing and finger snapping it might have held my attention but as it stands i kept looking for the last exit from brooklyn 213 | 1 i could nt recommend this film more 214 | 1 translating complex characters from novels to the big screen is an impossible task but they are true to the essence of what it is to be ya ya 215 | 0 their parents would do well to cram earplugs in their ears and put pillowcases over their heads for minutes 216 | 1 rewarding 217 | 1 upsetting and thought provoking the film has an odd purity that does nt bring you into the characters so much as it has you study them 218 | 0 starts as a tart little lemon drop of a movie and ends up as a bitter pill 219 | 0 a little less extreme than in the past with longer exposition sequences between them and with fewer gags to break the tedium 220 | 1 a funny triumphant and moving documentary 221 | 1 an entertaining mix of period drama and flat out farce that should please history fans 222 | 0 during the tuxedo is minutes of screen time there is nt one true chan moment 223 | 1 there is just something about watching a squad of psychopathic underdogs whale the tar out of unsuspecting lawmen that reaches across time and distance 224 | 1 a series of tales told with the intricate preciseness of the best short story writing 225 | 1 a bright inventive thoroughly winning flight of revisionist fancy 226 | 0 almost peerlessly unsettling 227 | 1 a dashing and absorbing outing with one of france is most inventive directors 228 | 1 a true delight 229 | 0 complete lack of originality cleverness or even visible effort 230 | 1 a few nonbelievers may rethink their attitudes when they see the joy the characters take in this creed but skeptics are nt likely to enter the theater 231 | 1 like the rugrats movies the wild thornberrys movie does nt offer much more than the series but its emphasis on caring for animals and respecting other cultures is particularly welcome 232 | 0 borstal boy represents the worst kind of filmmaking the kind that pretends to be passionate and truthful but is really frustratingly timid and soggy 233 | 1 you feel good you feel sad you feel pissed off but in the end you feel alive which is what they did 234 | 0 director tom shadyac and star kevin costner glumly mishandle the story is promising premise of a physician who needs to heal himself 235 | 1 as relationships shift director robert j siegel allows the characters to inhabit their world without cleaving to a narrative arc 236 | 0 deadeningly dull mired in convoluted melodrama nonsensical jargon and stiff upper lip laboriousness 237 | 1 jacquot has filmed the opera exactly as the libretto directs ideally capturing the opera is drama and lyricism 238 | 1 it can be safely recommended as a video dvd babysitter 239 | 0 it is played in the most straight faced fashion with little humor to lighten things up 240 | 1 though it goes further than both anyone who has seen the hunger or cat people will find little new here but a tasty performance from vincent gallo lifts this tale of cannibal lust above the ordinary 241 | 1 the rich performances by friel and especially williams an american actress who becomes fully english round out the square edges 242 | 0 amazingly lame 243 | 1 more good than great but freeman and judd make it work 244 | 0 a battle between bug eye theatre and dead eye matinee 245 | 0 i m sorry to say that this should seal the deal arnold is not nor will he be back 246 | 1 though jackson does nt always succeed in integrating the characters in the foreground into the extraordinarily rich landscape it must be said that he is an imaginative filmmaker who can see the forest for the trees 247 | 0 van wilder has a built in audience but only among those who are drying out from spring break and are still unconcerned about what they ingest 248 | 1 what sets ms birot is film apart from others in the genre is a greater attention to the parents and particularly the fateful fathers in the emotional evolution of the two bewitched adolescents 249 | 0 a sentimental mess that never rings true 250 | 1 but the talented cast alone will keep you watching as will the fight scenes 251 | 1 allen is underestimated charm delivers more goodies than lumps of coal 252 | 1 an elegant work food of love is as consistently engaging as it is revealing 253 | 1 zoom 254 | 1 a huge box office hit in korea shiri is a must for genre fans 255 | 1 it is a technically superb film shining with all the usual spielberg flair expertly utilizing the talents of his top notch creative team 256 | 1 what begins as a conventional thriller evolves into a gorgeously atmospheric meditation on life changing chance encounters 257 | 1 a film with a great premise but only a great premise 258 | 1 on that score the film certainly does nt disappoint 259 | 1 the acting costumes music cinematography and sound are all astounding given the production is austere locales 260 | 1 vincent gallo is right at home in this french shocker playing his usual bad boy weirdo role 261 | 1 very well written and directed with brutal honesty and respect for its audience 262 | 0 one senses in world traveler and in his earlier film that freundlich bears a grievous but obscure complaint against fathers and circles it obsessively without making contact 263 | 1 neither parker nor donovan is a typical romantic lead but they bring a fresh quirky charm to the formula 264 | 1 a giggle a minute 265 | 0 in the end the film feels homogenized and a bit contrived as if we re looking back at a tattered and ugly past with rose tinted glasses 266 | 1 an unusually dry eyed even analytical approach to material that is generally played for maximum moisture 267 | 1 it made me want to get made up and go see this movie with my sisters 268 | 0 neither revelatory nor truly edgy merely crassly flamboyant and comedically labored 269 | 1 boasts a handful of virtuosic set pieces and offers a fair amount of trashy kinky fun 270 | 0 i do nt mind having my heartstrings pulled but do nt treat me like a fool 271 | 1 this is a sincerely crafted picture that deserves to emerge from the traffic jam of holiday movies 272 | 0 an unintentionally surreal kid is picture in which actors in bad bear suits enact a sort of inter species parody of a vh behind the music episode 273 | 1 gay or straight kissing jessica stein is one of the greatest date movies in years 274 | 0 it looks good but it is essentially empty 275 | 1 and there is no way you wo nt be talking about the film once you exit the theater 276 | 0 much like robin williams death to smoochy has already reached its expiration date 277 | 1 if you love the music and i do its hard to imagine having more fun watching a documentary 278 | 0 a collage of clich s and a dim echo of allusions to other films 279 | 1 norton is magnetic as graham 280 | 1 k the widowmaker is a great yarn 281 | 1 it might be easier to watch on video at home but that should nt stop die hard french film connoisseurs from going out and enjoying the big screen experience 282 | 0 manages to be both repulsively sadistic and mundane 283 | 0 an obvious copy of one of the best films ever made how could it not be 284 | 1 surprisingly the film is a hilarious adventure and i shamelessly enjoyed it 285 | 1 the cat is meow marks a return to form for director peter bogdanovich 286 | 0 it is an odd show pregnant with moods stillborn except as a harsh conceptual exercise 287 | 0 but if the essence of magic is its make believe promise of life that soars above the material realm this is the opposite of a truly magical movie 288 | 0 the film is all over the place really 289 | 0 without any redeeming value whatsoever 290 | 1 it is a familiar story but one that is presented with great sympathy and intelligence 291 | 0 manages to show life in all of its banality when the intention is quite the opposite 292 | 1 read my lips is to be viewed and treasured for its extraordinary intelligence and originality as well as its lyrical variations on the game of love 293 | 0 this director is cut which adds minutes takes a great film and turns it into a mundane soap opera 294 | 1 the ensemble cast turns in a collectively stellar performance and the writing is tight and truthful full of funny situations and honest observations 295 | 1 what saves this deeply affecting film from being merely a collection of wrenching cases is corcuera is attention to detail 296 | 1 take nothing seriously and enjoy the ride 297 | 1 from the opening strains of the average white band is pick up the pieces you can feel the love 298 | 0 while the ensemble player who gained notice in guy ritchie is lock stock and two smoking barrels and snatch has the bod he is unlikely to become a household name on the basis of his first starring vehicle 299 | 0 i could nt help but feel the wasted potential of this slapstick comedy 300 | 1 it is an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part 301 | 0 nothing too deep or substantial 302 | 0 this picture is mostly a lump of run of the mill profanity sprinkled with a few remarks so geared toward engendering audience sympathy that you might think he was running for office or trying to win over a probation officer 303 | 0 a boring parade of talking heads and technical gibberish that will do little to advance the linux cause 304 | 0 the problem with the mayhem in formula is not that it is offensive but that it is boring 305 | 0 as pedestrian as they come 306 | 0 parents beware this is downright movie penance 307 | 0 really does feel like a short stretched out to feature length 308 | 0 no one but a convict guilty of some truly heinous crime should have to sit through the master of disguise 309 | 1 may take its sweet time to get wherever it is going but if you have the patience for it you wo nt feel like it is wasted yours 310 | 0 would ve been nice if the screenwriters had trusted audiences to understand a complex story and left off the film is predictable denouement 311 | 1 i am not generally a huge fan of cartoons derived from tv shows but hey arnold 312 | 1 brings to a spectacular completion one of the most complex generous and subversive artworks of the last decade 313 | 1 reveals how important our special talents can be when put in service of of others 314 | 1 the gags that fly at such a furiously funny pace that the only rip off that we were aware of was the one we felt when the movie ended so damned soon 315 | 1 more mature than fatal attraction more complete than indecent proposal and more relevant than weeks unfaithful is at once intimate and universal cinema 316 | 1 it is fairly solid not to mention well edited so that it certainly does nt feel like a film that strays past the two and a half mark 317 | 1 while somewhat less than it might have been the film is a good one and you ve got to hand it to director george clooney for biting off such a big job the first time out 318 | 1 routine harmless diversion and little else 319 | 1 cremaster is at once a tough pill to swallow and a minor miracle of self expression 320 | 1 this is human comedy at its most amusing interesting and confirming 321 | 1 a story we have nt seen on the big screen before and it is a story that we as americans and human beings should know 322 | 0 just about everyone involved here seems to be coasting 323 | 1 a tour de force of modern cinema 324 | 1 uplifting funny and wise 325 | 0 it is just merely very bad 326 | 1 it will guarantee to have you leaving the theater with a smile on your face 327 | 0 simplistic silly and tedious 328 | 1 passions obsessions and loneliest dark spots are pushed to their most virtuous limits lending the narrative an unusually surreal tone 329 | 1 thanks to confident filmmaking and a pair of fascinating performances the way to that destination is a really special walk in the woods 330 | 0 it is provocative stuff but the speculative effort is hampered by taylor is cartoonish performance and the film is ill considered notion that hitler is destiny was shaped by the most random of chances 331 | 0 the animation and game phenomenon that peaked about three years ago is actually dying a slow death if the poor quality of pokemon ever is any indication 332 | 1 the script is smart not cloying 333 | 0 a muddy psychological thriller rife with miscalculations 334 | 1 the wild thornberrys movie is a jolly surprise 335 | 1 land people and narrative flow together in a stark portrait of motherhood deferred and desire explored 336 | 0 unless there are zoning ordinances to protect your community from the dullest science fiction impostor is opening today at a theater near you 337 | 1 world traveler might not go anywhere new or arrive anyplace special but it is certainly an honest attempt to get at something 338 | 1 at once subtle and visceral the film never succumbs to the trap of the maudlin or tearful offering instead with its unflinching gaze a measure of faith in the future 339 | 1 years of russian history and culture compressed into an evanescent seamless and sumptuous stream of consciousness 340 | 0 the film is maudlin focus on the young woman is infirmity and her naive dreams play like the worst kind of hollywood heart string plucking 341 | 0 director uwe boll and the actors provide scant reason to care in this crude s throwback 342 | 1 intensely romantic thought provoking and even an engaging mystery 343 | 0 the characters are paper thin and the plot is so cliched and contrived that it makes your least favorite james bond movie seem as cleverly plotted as the usual suspects 344 | 1 de niro is a veritable source of sincere passion that this hollywood contrivance orbits around 345 | 1 jonathan parker is bartleby should have been the be all end all of the modern office anomie films 346 | 1 it is a piece of handiwork that shows its indie tatters and self conscious seams in places but has some quietly moving moments and an intelligent subtlety 347 | 0 it is a barely tolerable slog over well trod ground 348 | 0 it is tough to tell which is in more abundant supply in this woefully hackneyed movie directed by scott kalvert about street gangs and turf wars in brooklyn stale cliches gratuitous violence or empty machismo 349 | 0 the script by vincent r nebrida tries to cram too many ingredients into one small pot 350 | 1 strangely comes off as a kingdom more mild than wild 351 | 0 thoroughly awful 352 | 1 a moving story of determination and the human spirit 353 | 1 a naturally funny film home movie makes you crave chris smith is next movie 354 | 0 the only question is to determine how well the schmaltz is manufactured to assess the quality of the manipulative engineering 355 | 0 the premise of abandon holds promise but its delivery is a complete mess 356 | 0 plays like one of those conversations that comic book guy on the simpsons has 357 | 0 in the book on tape market the film of the kid stays in the picture would be an abridged edition 358 | 1 and educational 359 | 1 blisteringly rude scarily funny sorrowfully sympathetic to the damage it surveys the film has in kieran culkin a pitch perfect holden 360 | 0 to get at the root psychology of this film would require many sessions on the couch of dr freud 361 | 0 the young stars are too cute the story and ensuing complications are too manipulative the message is too blatant the resolutions are too convenient 362 | 1 davis candid archly funny and deeply authentic take on intimate relationships comes to fruition in her sophomore effort 363 | 1 not as good as the full monty but a really strong second effort 364 | 0 even bigger and more ambitious than the first installment spy kids looks as if it were made by a highly gifted year old instead of a grown man 365 | 0 includes too much obvious padding 366 | 0 the story alone could force you to scratch a hole in your head 367 | 0 we never truly come to care about the main characters and whether or not they ll wind up together and michele is spiritual quest is neither amusing nor dramatic enough to sustain interest 368 | 1 this is nt exactly profound cinema but it is good natured and sometimes quite funny 369 | 0 impostor ca nt think of a thing to do with these characters except have them run through dark tunnels fight off various anonymous attackers and evade elaborate surveillance technologies 370 | 0 and that leaves a hole in the center of the salton sea 371 | 1 chamber of secrets will find millions of eager fans 372 | 0 seagal ran out of movies years ago and this is just the proof 373 | 0 a movie like the guys is why film criticism can be considered work 374 | 1 as it turns out you can go home again 375 | 1 her performance moves between heartbreak and rebellion as she continually tries to accommodate to fit in and gain the unconditional love she seeks 376 | 0 a low rate annie featuring some kid who ca nt act only echoes of jordan and weirdo actor crispin glover screwing things up old school 377 | 1 rock solid family fun out of the gates extremely imaginative through out but wanes in the middle 378 | 0 if you go pack your knitting needles 379 | 0 a technical triumph and an extraordinary bore 380 | 0 if you re not fans of the adventues of steve and terri you should avoid this like the dreaded king brown snake 381 | 0 the comedy death to smoochy is a rancorous curiosity a movie without an apparent audience 382 | 1 the entire cast is extraordinarily good 383 | 1 hugh grant who has a good line in charm has never been more charming than in about a boy 384 | 1 delivers the sexy razzle dazzle that everyone especially movie musical fans has been hoping for 385 | 1 a gripping movie played with performances that are all understated and touching 386 | 0 hoffman waits too long to turn his movie in an unexpected direction and even then his tone retains a genteel prep school quality that feels dusty and leatherbound 387 | 1 an ambitious what if 388 | 1 uses high comedy to evoke surprising poignance 389 | 0 contains a few big laughs but many more that graze the funny bone or miss it altogether in part because the consciously dumbed down approach wears thin 390 | 1 the journey to the secret is eventual discovery is a separate adventure and thrill enough 391 | 1 it is one heck of a character study not of hearst or davies but of the unique relationship between them 392 | 1 a live wire film that never loses its ability to shock and amaze 393 | 1 a real audience pleaser that will strike a chord with anyone who is ever waited in a doctor is office emergency room hospital bed or insurance company office 394 | 0 there is no good answer to that one 395 | 0 the film contains no good jokes no good scenes barely a moment when carvey is saturday night live honed mimicry rises above the level of embarrassment 396 | 0 as inept as big screen remakes of the avengers and the wild wild west 397 | 0 it is difficult to feel anything much while watching this movie beyond mild disturbance or detached pleasure at the acting 398 | 0 almost as offensive as freddy got fingered 399 | 1 this is a shrewd and effective film from a director who understands how to create and sustain a mood 400 | 1 the bai brothers have taken an small slice of history and opened it up for all of us to understand and they ve told a nice little story in the process 401 | 1 knows how to make our imagination wonder 402 | 1 fear permeates the whole of stortelling todd solondz oftentimes funny yet ultimately cowardly autocritique 403 | 1 a cutesy romantic tale with a twist 404 | 0 violent vulgar and forgettably entertaining 405 | 1 though its story is only surface deep the visuals and enveloping sounds of blue crush make this surprisingly decent flick worth a summertime look see 406 | 1 sad to say it accurately reflects the rage and alienation that fuels the self destructiveness of many young people 407 | 0 an allegory concerning the chronically mixed signals african american professionals get about overachieving could be intriguing but the supernatural trappings only obscure the message 408 | 1 the film has the high buffed gloss and high octane jolts you expect of de palma but what makes it transporting is that it is also one of the smartest most pleasurable expressions of pure movie love to come from an american director in years 409 | 1 wonderful fencing scenes and an exciting plot make this an eminently engrossing film 410 | 1 if mostly martha is mostly unsurprising it is still a sweet even delectable diversion 411 | 1 one of the most slyly exquisite anti adult movies ever made 412 | 1 even when there are lulls the emotions seem authentic and the picture is so lovely toward the end you almost do nt notice the minute running time 413 | 0 comes across as a fairly weak retooling 414 | 0 time out is as serious as a pink slip 415 | 0 a depressingly retrograde post feminist romantic comedy that takes an astonishingly condescending attitude toward women 416 | 0 you might want to take a reality check before you pay the full ticket price to see simone and consider a dvd rental instead 417 | 1 young hanks and fisk who vaguely resemble their celebrity parents bring fresh good looks and an ease in front of the camera to the work 418 | 0 if you re looking for a story do nt bother 419 | 1 the film is hard to dismiss moody thoughtful and lit by flashes of mordant humor 420 | 1 a deeply felt and vividly detailed story about newcomers in a strange new world 421 | 0 an ugly pointless stupid movie 422 | 0 to honestly address the flaws inherent in how medical aid is made available to american workers a more balanced or fair portrayal of both sides will be needed 423 | 1 the very definition of the small movie but it is a good stepping stone for director sprecher 424 | 1 a solid cast assured direction and complete lack of modern day irony 425 | 0 burns never really harnesses to full effect the energetic cast 426 | 1 the difference between cho and most comics is that her confidence in her material is merited 427 | 1 like its bizarre heroine it irrigates our souls 428 | 1 in between the icy stunts the actors spout hilarious dialogue about following your dream and just letting the mountain tell you what to do 429 | 0 in an effort i suspect not to offend by appearing either too serious or too lighthearted it offends by just being wishy washy 430 | 0 it all comes down to whether you can tolerate leon barlow 431 | 0 starts promisingly but disintegrates into a dreary humorless soap opera 432 | 1 there is enough cool fun here to warm the hearts of animation enthusiasts of all ages 433 | 1 the vitality of the actors keeps the intensity of the film high even as the strafings blend together 434 | 1 a true blue delight 435 | 0 despite auteuil is performance it is a rather listless amble down the middle of the road where the thematic ironies are too obvious and the sexual politics too smug 436 | 1 well acted well directed and for all its moodiness not too pretentious 437 | 0 adrift bentley and hudson stare and sniffle respectively as ledger attempts in vain to prove that movie star intensity can overcome bad hair design 438 | 0 it is so downbeat and nearly humorless that it becomes a chore to sit through despite some first rate performances by its lead 439 | 0 you leave feeling like you ve endured a long workout without your pulse ever racing 440 | 1 a poignant artfully crafted meditation on mortality 441 | 0 there is a scientific law to be discerned here that producers would be well to heed mediocre movies start to drag as soon as the action speeds up when the explosions start they fall to pieces 442 | 1 a dreadful day in irish history is given passionate if somewhat flawed treatment 443 | 1 the pleasure of read my lips is like seeing a series of perfect black pearls clicking together to form a string 444 | 1 overcomes its visual hideousness with a sharp script and strong performances 445 | 0 just too silly and sophomoric to ensnare its target audience 446 | 1 if it is not entirely memorable the movie is certainly easy to watch 447 | 1 if you can push on through the slow spots you ll be rewarded with some fine acting 448 | 0 it is too bad that this likable movie is nt more accomplished 449 | 1 tadpole may be one of the most appealing movies ever made about an otherwise appalling and downright creepy subject a teenage boy in love with his stepmother 450 | 0 it just goes to show an intelligent person is nt necessarily an admirable storyteller 451 | 1 if ayurveda can help us return to a sane regimen of eating sleeping and stress reducing contemplation it is clearly a good thing 452 | 1 the story is a rather simplistic one grief drives her love drives him and a second chance to find love in the most unlikely place it struck a chord in me 453 | 0 plays like a glossy melodrama that occasionally verges on camp 454 | 0 as aimless as an old pickup skidding completely out of control on a long patch of black ice the movie makes two hours feel like four 455 | 1 a searing epic treatment of a nationwide blight that seems to be horrifyingly ever on the rise 456 | 0 what soured me on the santa clause was that santa bumps up against st century reality so hard it is icky 457 | 1 as quiet patient and tenacious as mr lopez himself who approaches his difficult endless work with remarkable serenity and discipline 458 | 0 shallow noisy and pretentious 459 | 1 a light yet engrossing piece 460 | 0 my only wish is that celebi could take me back to a time before i saw this movie and i could just skip it 461 | 0 it is one pussy ass world when even killer thrillers revolve around group therapy sessions 462 | 1 infidelity drama is nicely shot well edited and features a standout performance by diane lane 463 | 0 rarely has sex on screen been so aggressively anti erotic 464 | 1 there is enough originality in life to distance it from the pack of paint by number romantic comedies that so often end up on cinema screens 465 | 0 a movie that quite simply should nt have been made 466 | 1 further proof that the epicenter of cool beautiful thought provoking foreign cinema is smack dab in the middle of dubya is axis of evil 467 | 1 writer director david jacobson and his star jeremy renner have made a remarkable film that explores the monster is psychology not in order to excuse him but rather to demonstrate that his pathology evolved from human impulses that grew hideously twisted 468 | 1 will amuse and provoke adventurous adults in specialty venues 469 | 0 just a kiss is a just a waste 470 | 1 a muddle splashed with bloody beauty as vivid as any scorsese has ever given us 471 | 0 a zippy minutes of mediocre special effects hoary dialogue fluxing accents and worst of all silly looking morlocks 472 | 1 as a girl meets girl romantic comedy kissing jessica steinis quirky charming and often hilarious 473 | 1 the overall fabric is hypnotic and mr mattei fosters moments of spontaneous intimacy 474 | 0 men in black ii achieves ultimate insignificance it is the sci fi comedy spectacle as whiffle ball epic 475 | 0 at best cletis tout might inspire a trip to the video store in search of a better movie experience 476 | 0 nothing but an episode of smackdown 477 | 1 the stunt work is top notch the dialogue and drama often food spittingly funny 478 | 1 family fare 479 | 1 using a stock plot about a boy injects just enough freshness into the proceedings to provide an enjoyable minutes in a movie theater 480 | 0 in other words it is badder than bad 481 | 0 the movie is almost completely lacking in suspense surprise and consistent emotional conviction 482 | 1 another love story in is remarkable procession of sweeping pictures that have reinvigorated the romance genre 483 | 0 there is only one way to kill michael myers for good stop buying tickets to these movies 484 | 1 washington overcomes the script is flaws and envelops the audience in his character is anguish anger and frustration 485 | 0 so we got ten little indians meets friday the th by way of clean and sober filmed on the set of carpenter is the thing and loaded with actors you re most likely to find on the next inevitable incarnation of the love boat 486 | 0 confirms the nagging suspicion that ethan hawke would be even worse behind the camera than he is in front of it 487 | 0 one of the more glaring signs of this movie is servitude to its superstar is the way it skirts around any scenes that might have required genuine acting from ms spears 488 | 0 for all its shoot outs fistfights and car chases this movie is a phlegmatic bore so tedious it makes the silly spy vs spy film the sum of all fears starring ben affleck seem downright hitchcockian 489 | 0 the only fun part of the movie is playing the obvious game 490 | 0 plays like the old disease of the week small screen melodramas 491 | 0 the cumulative effect of the movie is repulsive and depressing 492 | 1 while we no longer possess the lack of attention span that we did at seventeen we had no trouble sitting for blade ii 493 | 1 a surprisingly sweet and gentle comedy 494 | 1 an elegant film with often surprising twists and an intermingling of naivet and sophistication 495 | 0 for each chuckle there are at least complete misses many coming from the amazingly lifelike tara reid whose acting skills are comparable to a cardboard cutout 496 | 1 polished well structured film 497 | 1 a movie that will surely be profane politically charged music to the ears of cho is fans 498 | 1 most consumers of lo mein and general tso is chicken barely give a thought to the folks who prepare and deliver it so hopefully this film will attach a human face to all those little steaming cartons 499 | 0 movies like high crimes flog the dead horse of surprise as if it were an obligation 500 | 0 a timid soggy near miss 501 | -------------------------------------------------------------------------------- /eda_figure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/eda_figure.png -------------------------------------------------------------------------------- /experiments/__pycache__/a_config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/a_config.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-37.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/b_config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/b_config.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/c_config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/c_config.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/config.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/e_config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/e_config.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/methods.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/methods.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-37.pyc -------------------------------------------------------------------------------- /experiments/__pycache__/nlp_aug.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/nlp_aug.cpython-36.pyc -------------------------------------------------------------------------------- /experiments/a_1_data_process.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from a_config import * 3 | 4 | if __name__ == "__main__": 5 | 6 | #for each method 7 | for a_method in a_methods: 8 | 9 | #for each data size 10 | for size_folder in size_folders: 11 | 12 | n_aug_list = n_aug_list_dict[size_folder] 13 | dataset_folders = [size_folder + '/' + s for s in datasets] 14 | 15 | #for each dataset 16 | for i, dataset_folder in enumerate(dataset_folders): 17 | 18 | train_orig = dataset_folder + '/train_orig.txt' 19 | n_aug = n_aug_list[i] 20 | 21 | #for each alpha value 22 | for alpha in alphas: 23 | 24 | output_file = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt' 25 | 26 | #generate the augmented data 27 | if a_method == 'sr': 28 | gen_sr_aug(train_orig, output_file, alpha, n_aug) 29 | if a_method == 'ri': 30 | gen_ri_aug(train_orig, output_file, alpha, n_aug) 31 | if a_method == 'rd': 32 | gen_rd_aug(train_orig, output_file, alpha, n_aug) 33 | if a_method == 'rs': 34 | gen_rs_aug(train_orig, output_file, alpha, n_aug) 35 | 36 | #generate the vocab dictionary 37 | word2vec_pickle = dataset_folder + '/word2vec.p' 38 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec) 39 | 40 | -------------------------------------------------------------------------------- /experiments/a_2_train_eval.py: -------------------------------------------------------------------------------- 1 | from a_config import * 2 | from methods import * 3 | from numpy.random import seed 4 | seed(5) 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_cnn(train_file, test_file, num_classes, percent_dataset): 11 | 12 | #initialize model 13 | model = build_cnn(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y, test_x, test_y, model = None, None, None, None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ############ main ############# 50 | ############################### 51 | 52 | if __name__ == "__main__": 53 | 54 | #for each method 55 | for a_method in a_methods: 56 | 57 | writer = open('outputs_f1/' + a_method + '_' + get_now_str() + '.txt', 'w') 58 | 59 | #for each size dataset 60 | for size_folder in size_folders: 61 | 62 | writer.write(size_folder + '\n') 63 | 64 | #get all six datasets 65 | dataset_folders = [size_folder + '/' + s for s in datasets] 66 | 67 | #for storing the performances 68 | performances = {alpha:[] for alpha in alphas} 69 | 70 | #for each dataset 71 | for i in range(len(dataset_folders)): 72 | 73 | #initialize all the variables 74 | dataset_folder = dataset_folders[i] 75 | dataset = datasets[i] 76 | num_classes = num_classes_list[i] 77 | input_size = input_size_list[i] 78 | word2vec_pickle = dataset_folder + '/word2vec.p' 79 | word2vec = load_pickle(word2vec_pickle) 80 | 81 | #test each alpha value 82 | for alpha in alphas: 83 | 84 | train_path = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt' 85 | test_path = 'size_data_f1/test/' + dataset + '/test.txt' 86 | acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1) 87 | performances[alpha].append(acc) 88 | 89 | writer.write(str(performances) + '\n') 90 | for alpha in performances: 91 | line = str(alpha) + ' : ' + str(sum(performances[alpha])/len(performances[alpha])) 92 | writer.write(line + '\n') 93 | print(line) 94 | print(performances) 95 | 96 | writer.close() 97 | -------------------------------------------------------------------------------- /experiments/a_config.py: -------------------------------------------------------------------------------- 1 | #user inputs 2 | 3 | #size folders 4 | sizes = ['1_tiny', '2_small', '3_standard', '4_full'] 5 | size_folders = ['size_data_f1/' + size for size in sizes] 6 | 7 | #augmentation methods 8 | a_methods = ['sr', 'ri', 'rd', 'rs'] 9 | 10 | #dataset folder 11 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc'] 12 | 13 | #number of output classes 14 | num_classes_list = [2, 2, 2, 6, 2] 15 | 16 | #number of augmentations 17 | n_aug_list_dict = {'size_data_f1/1_tiny': [16, 16, 16, 16, 16], 18 | 'size_data_f1/2_small': [16, 16, 16, 16, 16], 19 | 'size_data_f1/3_standard': [8, 8, 8, 8, 4], 20 | 'size_data_f1/4_full': [8, 8, 8, 8, 4]} 21 | 22 | #alpha values we care about 23 | alphas = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5] 24 | 25 | #number of words for input 26 | input_size_list = [50, 50, 40, 25, 25] 27 | 28 | #word2vec dictionary 29 | huge_word2vec = 'word2vec/glove.840B.300d.txt' 30 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary 31 | -------------------------------------------------------------------------------- /experiments/b_1_data_process.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from b_config import * 3 | 4 | if __name__ == "__main__": 5 | 6 | #generate the augmented data sets 7 | for dataset_folder in dataset_folders: 8 | 9 | #pre-existing file locations 10 | train_orig = dataset_folder + '/train_orig.txt' 11 | 12 | #file to be created 13 | train_aug_st = dataset_folder + '/train_aug_st.txt' 14 | 15 | #standard augmentation 16 | gen_standard_aug(train_orig, train_aug_st) 17 | 18 | #generate the vocab dictionary 19 | word2vec_pickle = dataset_folder + '/word2vec.p' # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary 20 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec) 21 | -------------------------------------------------------------------------------- /experiments/b_2_train_eval.py: -------------------------------------------------------------------------------- 1 | from b_config import * 2 | from methods import * 3 | from numpy.random import seed 4 | seed(0) 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_model(train_file, test_file, num_classes, percent_dataset): 11 | 12 | #initialize model 13 | model = build_model(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y = None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | if __name__ == "__main__": 49 | 50 | #get the accuracy at each increment 51 | orig_accs = {dataset:{} for dataset in datasets} 52 | aug_accs = {dataset:{} for dataset in datasets} 53 | 54 | writer = open('outputs_f2/' + get_now_str() + '.csv', 'w') 55 | 56 | #for each dataset 57 | for i, dataset_folder in enumerate(dataset_folders): 58 | 59 | dataset = datasets[i] 60 | num_classes = num_classes_list[i] 61 | input_size = input_size_list[i] 62 | train_orig = dataset_folder + '/train_orig.txt' 63 | train_aug_st = dataset_folder + '/train_aug_st.txt' 64 | test_path = dataset_folder + '/test.txt' 65 | word2vec_pickle = dataset_folder + '/word2vec.p' 66 | word2vec = load_pickle(word2vec_pickle) 67 | 68 | for increment in increments: 69 | 70 | #calculate augmented accuracy 71 | aug_acc = run_model(train_aug_st, test_path, num_classes, increment) 72 | aug_accs[dataset][increment] = aug_acc 73 | 74 | #calculate original accuracy 75 | orig_acc = run_model(train_orig, test_path, num_classes, increment) 76 | orig_accs[dataset][increment] = orig_acc 77 | 78 | print(dataset, increment, orig_acc, aug_acc) 79 | writer.write(dataset + ',' + str(increment) + ',' + str(orig_acc) + ',' + str(aug_acc) + '\n') 80 | 81 | gc.collect() 82 | 83 | print(orig_accs, aug_accs) 84 | -------------------------------------------------------------------------------- /experiments/b_config.py: -------------------------------------------------------------------------------- 1 | #user inputs 2 | 3 | #dataset folder 4 | datasets = ['pc']#['cr', 'sst2', 'subj', 'trec', 'pc'] 5 | dataset_folders = ['increment_datasets_f2/' + dataset for dataset in datasets] 6 | 7 | #number of output classes 8 | num_classes_list = [2]#[2, 2, 2, 6, 2] 9 | 10 | #dataset increments 11 | increments = [0.7, 0.8, 0.9, 1]#[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1] 12 | 13 | #number of words for input 14 | input_size_list = [25]#[50, 50, 40, 25, 25] 15 | 16 | #word2vec dictionary 17 | huge_word2vec = 'word2vec/glove.840B.300d.txt' 18 | word2vec_len = 300 -------------------------------------------------------------------------------- /experiments/c_1_data_process.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from c_config import * 3 | 4 | if __name__ == "__main__": 5 | 6 | #generate the augmented data sets 7 | 8 | for size_folder in size_folders: 9 | 10 | dataset_folders = [size_folder + '/' + s for s in datasets] 11 | 12 | #for each dataset 13 | for dataset_folder in dataset_folders: 14 | train_orig = dataset_folder + '/train_orig.txt' 15 | 16 | #for each n_aug value 17 | for num_aug in num_aug_list: 18 | 19 | output_file = dataset_folder + '/train_' + str(num_aug) + '.txt' 20 | 21 | #generate the augmented data 22 | if num_aug > 4 and '4_full/pc' in train_orig: 23 | gen_standard_aug(train_orig, output_file, num_aug=4) 24 | else: 25 | gen_standard_aug(train_orig, output_file, num_aug=num_aug) 26 | 27 | #generate the vocab dictionary 28 | word2vec_pickle = dataset_folder + '/word2vec.p' 29 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec) 30 | 31 | -------------------------------------------------------------------------------- /experiments/c_2_train_eval.py: -------------------------------------------------------------------------------- 1 | from c_config import * 2 | from methods import * 3 | from numpy.random import seed 4 | seed(5) 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_cnn(train_file, test_file, num_classes, percent_dataset): 11 | 12 | #initialize model 13 | model = build_cnn(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y = None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ############ main ############# 50 | ############################### 51 | 52 | if __name__ == "__main__": 53 | 54 | for see in range(5): 55 | 56 | seed(see) 57 | print('seed:', see) 58 | 59 | writer = open('outputs_f3/' + get_now_str() + '.txt', 'w') 60 | 61 | #for each size dataset 62 | for size_folder in size_folders: 63 | 64 | writer.write(size_folder + '\n') 65 | 66 | #get all six datasets 67 | dataset_folders = [size_folder + '/' + s for s in datasets] 68 | 69 | #for storing the performances 70 | performances = {num_aug:[] for num_aug in num_aug_list} 71 | 72 | #for each dataset 73 | for i in range(len(dataset_folders)): 74 | 75 | #initialize all the variables 76 | dataset_folder = dataset_folders[i] 77 | dataset = datasets[i] 78 | num_classes = num_classes_list[i] 79 | input_size = input_size_list[i] 80 | word2vec_pickle = dataset_folder + '/word2vec.p' 81 | word2vec = load_pickle(word2vec_pickle) 82 | 83 | #test each num_aug value 84 | for num_aug in num_aug_list: 85 | 86 | train_path = dataset_folder + '/train_' + str(num_aug) + '.txt' 87 | test_path = 'size_data_f3/test/' + dataset + '/test.txt' 88 | acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1) 89 | performances[num_aug].append(acc) 90 | writer.write(train_path + ',' + str(acc)) 91 | 92 | writer.write(str(performances) + '\n') 93 | print() 94 | for num_aug in performances: 95 | line = str(num_aug) + ' : ' + str(sum(performances[num_aug])/len(performances[num_aug])) 96 | writer.write(line + '\n') 97 | print(line) 98 | print(performances) 99 | 100 | writer.close() 101 | -------------------------------------------------------------------------------- /experiments/c_config.py: -------------------------------------------------------------------------------- 1 | #user inputs 2 | 3 | #size folders 4 | sizes = ['3_standard']#, '4_full']#['1_tiny', '2_small', '3_standard', '4_full'] 5 | size_folders = ['size_data_f3/' + size for size in sizes] 6 | 7 | #dataset folder 8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc'] 9 | 10 | #number of output classes 11 | num_classes_list = [2, 2, 2, 6, 2] 12 | 13 | #alpha values we care about 14 | num_aug_list = [0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32] 15 | 16 | #number of words for input 17 | input_size_list = [50, 50, 50, 25, 25] 18 | 19 | #word2vec dictionary 20 | huge_word2vec = 'word2vec/glove.840B.300d.txt' 21 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary 22 | -------------------------------------------------------------------------------- /experiments/d_0_preprocess.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | 3 | def generate_short(input_file, output_file, alpha): 4 | lines = open(input_file, 'r').readlines() 5 | increment = int(len(lines)/alpha) 6 | lines = lines[::increment] 7 | writer = open(output_file, 'w') 8 | for line in lines: 9 | writer.write(line) 10 | 11 | if __name__ == "__main__": 12 | 13 | #global params 14 | huge_word2vec = 'word2vec/glove.840B.300d.txt' 15 | datasets = ['pc']#, 'trec'] 16 | 17 | for dataset in datasets: 18 | 19 | dataset_folder = 'special_f4/' + dataset 20 | test_short = 'special_f4/' + dataset + '/test_short.txt' 21 | test_aug_short = dataset_folder + '/test_short_aug.txt' 22 | word2vec_pickle = dataset_folder + '/word2vec.p' 23 | 24 | #augment the data 25 | gen_tsne_aug(test_short, test_aug_short) 26 | 27 | #generate the vocab dictionaries 28 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec) 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /experiments/d_1_train_models.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | seed(0) 4 | 5 | ############################### 6 | #### run model and get acc #### 7 | ############################### 8 | 9 | def run_model(train_file, test_file, num_classes, model_output_path): 10 | 11 | #initialize model 12 | model = build_model(input_size, word2vec_len, num_classes) 13 | 14 | #load data 15 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, 1) 16 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 17 | 18 | #implement early stopping 19 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 20 | 21 | #train model 22 | model.fit( train_x, 23 | train_y, 24 | epochs=100000, 25 | callbacks=callbacks, 26 | validation_split=0.1, 27 | batch_size=1024, 28 | shuffle=True, 29 | verbose=0) 30 | 31 | #save the model 32 | model.save(model_output_path) 33 | #model = load_model('checkpoints/lol') 34 | 35 | #evaluate model 36 | y_pred = model.predict(test_x) 37 | test_y_cat = one_hot_to_categorical(test_y) 38 | y_pred_cat = one_hot_to_categorical(y_pred) 39 | acc = accuracy_score(test_y_cat, y_pred_cat) 40 | 41 | #clean memory??? 42 | train_x, train_y = None, None 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | if __name__ == "__main__": 49 | 50 | #parameters 51 | dataset_folders = ['increment_datasets_f2/trec', 'increment_datasets_f2/pc'] 52 | output_paths = ['outputs_f4/trec_aug.h5', 'outputs_f4/pc_aug.h5'] 53 | num_classes_list = [6, 2] 54 | input_size_list = [25, 25] 55 | 56 | #word2vec dictionary 57 | word2vec_len = 300 58 | 59 | for i, dataset_folder in enumerate(dataset_folders): 60 | 61 | num_classes = num_classes_list[i] 62 | input_size = input_size_list[i] 63 | output_path = output_paths[i] 64 | train_orig = dataset_folder + '/train_aug_st.txt' 65 | test_path = dataset_folder + '/test.txt' 66 | word2vec_pickle = dataset_folder + '/word2vec.p' 67 | word2vec = load_pickle(word2vec_pickle) 68 | 69 | #train model and save 70 | acc = run_model(train_orig, test_path, num_classes, output_path) 71 | print(dataset_folder, acc) -------------------------------------------------------------------------------- /experiments/d_2_tsne.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | from keras import backend as K 4 | from sklearn.manifold import TSNE 5 | import matplotlib.pyplot as plt 6 | seed(0) 7 | 8 | ################################ 9 | #### get dense layer output #### 10 | ################################ 11 | 12 | #getting the x and y inputs in numpy array form from the text file 13 | def train_x(train_txt, word2vec_len, input_size, word2vec): 14 | 15 | #read in lines 16 | train_lines = open(train_txt, 'r').readlines() 17 | num_lines = len(train_lines) 18 | 19 | x_matrix = np.zeros((num_lines, input_size, word2vec_len)) 20 | 21 | #insert values 22 | for i, line in enumerate(train_lines): 23 | 24 | parts = line[:-1].split('\t') 25 | label = int(parts[0]) 26 | sentence = parts[1] 27 | 28 | #insert x 29 | words = sentence.split(' ') 30 | words = words[:x_matrix.shape[1]] #cut off if too long 31 | for j, word in enumerate(words): 32 | if word in word2vec: 33 | x_matrix[i, j, :] = word2vec[word] 34 | 35 | return x_matrix 36 | 37 | def get_dense_output(model_checkpoint, file, num_classes): 38 | 39 | x = train_x(file, word2vec_len, input_size, word2vec) 40 | 41 | model = load_model(model_checkpoint) 42 | 43 | get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[4].output]) 44 | layer_output = get_3rd_layer_output([x])[0] 45 | 46 | return layer_output 47 | 48 | def get_tsne_labels(file): 49 | labels = [] 50 | alphas = [] 51 | lines = open(file, 'r').readlines() 52 | for i, line in enumerate(lines): 53 | parts = line[:-1].split('\t') 54 | _class = int(parts[0]) 55 | alpha = i % 10 56 | labels.append(_class) 57 | alphas.append(alpha) 58 | return labels, alphas 59 | 60 | def get_plot_vectors(layer_output): 61 | 62 | tsne = TSNE(n_components=2).fit_transform(layer_output) 63 | return tsne 64 | 65 | def plot_tsne(tsne, labels, output_path): 66 | 67 | label_to_legend_label = { 'outputs_f4/pc_tsne.png':{ 0:'Con (augmented)', 68 | 100:'Con (original)', 69 | 1: 'Pro (augmented)', 70 | 101:'Pro (original)'}, 71 | 'outputs_f4/trec_tsne.png':{0:'Description (augmented)', 72 | 100:'Description (original)', 73 | 1:'Entity (augmented)', 74 | 101:'Entity (original)', 75 | 2:'Abbreviation (augmented)', 76 | 102:'Abbreviation (original)', 77 | 3:'Human (augmented)', 78 | 103:'Human (original)', 79 | 4:'Location (augmented)', 80 | 104:'Location (original)', 81 | 5:'Number (augmented)', 82 | 105:'Number (original)'}} 83 | 84 | plot_to_legend_size = {'outputs_f4/pc_tsne.png':11, 'outputs_f4/trec_tsne.png':6} 85 | 86 | labels = labels.tolist() 87 | big_groups = [label for label in labels if label < 100] 88 | big_groups = list(sorted(set(big_groups))) 89 | 90 | colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', '#ff1493', '#FF4500'] 91 | fig, ax = plt.subplots() 92 | 93 | for big_group in big_groups: 94 | 95 | for group in [big_group, big_group+100]: 96 | 97 | x, y = [], [] 98 | 99 | for j, label in enumerate(labels): 100 | if label == group: 101 | x.append(tsne[j][0]) 102 | y.append(tsne[j][1]) 103 | 104 | #params 105 | color = colors[int(group % 100)] 106 | marker = 'x' if group < 100 else 'o' 107 | size = 1 if group < 100 else 27 108 | legend_label = label_to_legend_label[output_path][group] 109 | 110 | ax.scatter(x, y, color=color, marker=marker, s=size, label=legend_label) 111 | plt.axis('off') 112 | 113 | legend_size = plot_to_legend_size[output_path] 114 | plt.legend(prop={'size': legend_size}) 115 | plt.savefig(output_path, dpi=1000) 116 | plt.clf() 117 | 118 | if __name__ == "__main__": 119 | 120 | #global variables 121 | word2vec_len = 300 122 | input_size = 25 123 | 124 | datasets = ['pc'] #['pc', 'trec'] 125 | num_classes_list =[2] #[2, 6] 126 | 127 | for i, dataset in enumerate(datasets): 128 | 129 | #load parameters 130 | model_checkpoint = 'outputs_f4/' + dataset + '.h5' 131 | file = 'special_f4/' + dataset + '/test_short_aug.txt' 132 | num_classes = num_classes_list[i] 133 | word2vec_pickle = 'special_f4/' + dataset + '/word2vec.p' 134 | word2vec = load_pickle(word2vec_pickle) 135 | 136 | #do tsne 137 | layer_output = get_dense_output(model_checkpoint, file, num_classes) 138 | print(layer_output.shape) 139 | t = get_plot_vectors(layer_output) 140 | 141 | labels, alphas = get_tsne_labels(file) 142 | 143 | print(labels, alphas) 144 | 145 | writer = open("outputs_f4/new_tsne.txt", 'w') 146 | 147 | label_to_mark = {0:'x', 1:'o'} 148 | 149 | for i, label in enumerate(labels): 150 | alpha = alphas[i] 151 | line = str(t[i, 0]) + ' ' + str(t[i, 1]) + ' ' + str(label_to_mark[label]) + ' ' + str(alpha/10) 152 | writer.write(line + '\n') 153 | 154 | 155 | -------------------------------------------------------------------------------- /experiments/d_neg_1_balance_trec.py: -------------------------------------------------------------------------------- 1 | lines = open('special_f4/trec/test_orig.txt', 'r').readlines() 2 | 3 | label_to_lines = {x:[] for x in range(0, 6)} 4 | 5 | for line in lines: 6 | label = int(line[0]) 7 | label_to_lines[label].append(line) 8 | 9 | for label in label_to_lines: 10 | print(label, len(label_to_lines[label])) -------------------------------------------------------------------------------- /experiments/e_1_data_process.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from e_config import * 3 | 4 | if __name__ == "__main__": 5 | 6 | for size_folder in size_folders: 7 | 8 | dataset_folders = [size_folder + '/' + s for s in datasets] 9 | n_aug_list = n_aug_list_dict[size_folder] 10 | 11 | #for each dataset 12 | for i, dataset_folder in enumerate(dataset_folders): 13 | 14 | n_aug = n_aug_list[i] 15 | 16 | #pre-existing file locations 17 | train_orig = dataset_folder + '/train_orig.txt' 18 | 19 | #file to be created 20 | train_aug_st = dataset_folder + '/train_aug_st.txt' 21 | 22 | #standard augmentation 23 | gen_standard_aug(train_orig, train_aug_st, n_aug) 24 | 25 | #generate the vocab dictionary 26 | word2vec_pickle = dataset_folder + '/word2vec.p' 27 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec) 28 | 29 | -------------------------------------------------------------------------------- /experiments/e_2_cnn_aug.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | seed(0) 4 | from e_config import * 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_cnn(train_file, test_file, num_classes, input_size, percent_dataset, word2vec): 11 | 12 | #initialize model 13 | model = build_cnn(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y, model = None, None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ### get baseline accuracies ### 50 | ############################### 51 | 52 | def compute_baselines(writer): 53 | 54 | #baseline computation 55 | for size_folder in size_folders: 56 | 57 | #get all six datasets 58 | dataset_folders = [size_folder + '/' + s for s in datasets] 59 | performances = [] 60 | 61 | #for each dataset 62 | for i in range(len(dataset_folders)): 63 | 64 | #initialize all the variables 65 | dataset_folder = dataset_folders[i] 66 | dataset = datasets[i] 67 | num_classes = num_classes_list[i] 68 | input_size = input_size_list[i] 69 | word2vec_pickle = dataset_folder + '/word2vec.p' 70 | word2vec = load_pickle(word2vec_pickle) 71 | 72 | train_path = dataset_folder + '/train_aug_st.txt' 73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt' 74 | acc = run_cnn(train_path, test_path, num_classes, input_size, 1, word2vec) 75 | performances.append(str(acc)) 76 | 77 | line = ','.join(performances) 78 | print(line) 79 | writer.write(line+'\n') 80 | 81 | ############################### 82 | ############ main ############# 83 | ############################### 84 | 85 | if __name__ == "__main__": 86 | 87 | writer = open('baseline_cnn/' + get_now_str() + '.csv', 'w') 88 | 89 | for i in range(0, 10): 90 | 91 | seed(i) 92 | print(i) 93 | compute_baselines(writer) 94 | -------------------------------------------------------------------------------- /experiments/e_2_cnn_baselines.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | seed(0) 4 | from e_config import * 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec): 11 | 12 | #initialize model 13 | model = build_model(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y = None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ### get baseline accuracies ### 50 | ############################### 51 | 52 | def compute_baselines(writer): 53 | 54 | #baseline computation 55 | for size_folder in size_folders: 56 | 57 | #get all six datasets 58 | dataset_folders = [size_folder + '/' + s for s in datasets] 59 | performances = [] 60 | 61 | #for each dataset 62 | for i in range(len(dataset_folders)): 63 | 64 | #initialize all the variables 65 | dataset_folder = dataset_folders[i] 66 | dataset = datasets[i] 67 | num_classes = num_classes_list[i] 68 | input_size = input_size_list[i] 69 | word2vec_pickle = dataset_folder + '/word2vec.p' 70 | word2vec = load_pickle(word2vec_pickle) 71 | 72 | train_path = dataset_folder + '/train_orig.txt' 73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt' 74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec) 75 | performances.append(str(acc)) 76 | 77 | line = ','.join(performances) 78 | print(line) 79 | writer.write(line+'\n') 80 | 81 | ############################### 82 | ############ main ############# 83 | ############################### 84 | 85 | if __name__ == "__main__": 86 | 87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w') 88 | 89 | for i in range(10, 24): 90 | 91 | seed(i) 92 | print(i) 93 | compute_baselines(writer) 94 | -------------------------------------------------------------------------------- /experiments/e_2_rnn_aug.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | seed(0) 4 | from e_config import * 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec): 11 | 12 | #initialize model 13 | model = build_model(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y, model = None, None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ### get baseline accuracies ### 50 | ############################### 51 | 52 | def compute_baselines(writer): 53 | 54 | #baseline computation 55 | for size_folder in size_folders: 56 | 57 | #get all six datasets 58 | dataset_folders = [size_folder + '/' + s for s in datasets] 59 | performances = [] 60 | 61 | #for each dataset 62 | for i in range(len(dataset_folders)): 63 | 64 | #initialize all the variables 65 | dataset_folder = dataset_folders[i] 66 | dataset = datasets[i] 67 | num_classes = num_classes_list[i] 68 | input_size = input_size_list[i] 69 | word2vec_pickle = dataset_folder + '/word2vec.p' 70 | word2vec = load_pickle(word2vec_pickle) 71 | 72 | train_path = dataset_folder + '/train_aug_st.txt' 73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt' 74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec) 75 | performances.append(str(acc)) 76 | 77 | line = ','.join(performances) 78 | print(line) 79 | writer.write(line+'\n') 80 | 81 | ############################### 82 | ############ main ############# 83 | ############################### 84 | 85 | if __name__ == "__main__": 86 | 87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w') 88 | 89 | for i in range(0, 10): 90 | 91 | seed(i) 92 | print(i) 93 | compute_baselines(writer) 94 | -------------------------------------------------------------------------------- /experiments/e_2_rnn_baselines.py: -------------------------------------------------------------------------------- 1 | from methods import * 2 | from numpy.random import seed 3 | seed(0) 4 | from e_config import * 5 | 6 | ############################### 7 | #### run model and get acc #### 8 | ############################### 9 | 10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec): 11 | 12 | #initialize model 13 | model = build_model(input_size, word2vec_len, num_classes) 14 | 15 | #load data 16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset) 17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1) 18 | 19 | #implement early stopping 20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)] 21 | 22 | #train model 23 | model.fit( train_x, 24 | train_y, 25 | epochs=100000, 26 | callbacks=callbacks, 27 | validation_split=0.1, 28 | batch_size=1024, 29 | shuffle=True, 30 | verbose=0) 31 | #model.save('checkpoints/lol') 32 | #model = load_model('checkpoints/lol') 33 | 34 | #evaluate model 35 | y_pred = model.predict(test_x) 36 | test_y_cat = one_hot_to_categorical(test_y) 37 | y_pred_cat = one_hot_to_categorical(y_pred) 38 | acc = accuracy_score(test_y_cat, y_pred_cat) 39 | 40 | #clean memory??? 41 | train_x, train_y = None, None 42 | gc.collect() 43 | 44 | #return the accuracy 45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc) 46 | return acc 47 | 48 | ############################### 49 | ### get baseline accuracies ### 50 | ############################### 51 | 52 | def compute_baselines(writer): 53 | 54 | #baseline computation 55 | for size_folder in size_folders: 56 | 57 | #get all six datasets 58 | dataset_folders = [size_folder + '/' + s for s in datasets] 59 | performances = [] 60 | 61 | #for each dataset 62 | for i in range(len(dataset_folders)): 63 | 64 | #initialize all the variables 65 | dataset_folder = dataset_folders[i] 66 | dataset = datasets[i] 67 | num_classes = num_classes_list[i] 68 | input_size = input_size_list[i] 69 | word2vec_pickle = dataset_folder + '/word2vec.p' 70 | word2vec = load_pickle(word2vec_pickle) 71 | 72 | train_path = dataset_folder + '/train_orig.txt' 73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt' 74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec) 75 | performances.append(str(acc)) 76 | 77 | line = ','.join(performances) 78 | print(line) 79 | writer.write(line+'\n') 80 | 81 | ############################### 82 | ############ main ############# 83 | ############################### 84 | 85 | if __name__ == "__main__": 86 | 87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w') 88 | 89 | for i in range(10, 24): 90 | 91 | seed(i) 92 | print(i) 93 | compute_baselines(writer) 94 | -------------------------------------------------------------------------------- /experiments/e_config.py: -------------------------------------------------------------------------------- 1 | #user inputs 2 | 3 | #load hyperparameters 4 | sizes = ['4_full']#['1_tiny', '2_small', '3_standard', '4_full'] 5 | size_folders = ['size_data_t1/' + size for size in sizes] 6 | 7 | #datasets 8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc'] 9 | 10 | #number of output classes 11 | num_classes_list = [2, 2, 2, 6, 2] 12 | 13 | #number of augmentations per original sentence 14 | n_aug_list_dict = {'size_data_t1/1_tiny': [32, 32, 32, 32, 32], 15 | 'size_data_t1/2_small': [32, 32, 32, 32, 32], 16 | 'size_data_t1/3_standard': [16, 16, 16, 16, 4], 17 | 'size_data_t1/4_full': [16, 16, 16, 16, 4]} 18 | 19 | #number of words for input 20 | input_size_list = [50, 50, 40, 25, 25] 21 | 22 | #word2vec dictionary 23 | huge_word2vec = 'word2vec/glove.840B.300d.txt' 24 | word2vec_len = 300 -------------------------------------------------------------------------------- /experiments/methods.py: -------------------------------------------------------------------------------- 1 | from keras.layers.core import Dense, Activation, Dropout 2 | from keras.layers.recurrent import LSTM 3 | from keras.layers import Bidirectional 4 | import keras.layers as layers 5 | from keras.models import Sequential 6 | from keras.models import load_model 7 | from keras.callbacks import EarlyStopping 8 | 9 | from sklearn.utils import shuffle 10 | from sklearn.metrics import accuracy_score 11 | 12 | import math 13 | import time 14 | import numpy as np 15 | import random 16 | from random import randint 17 | random.seed(3) 18 | import datetime, re, operator 19 | from random import shuffle 20 | from time import gmtime, strftime 21 | import gc 22 | 23 | import os 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #get rid of warnings 25 | from os import listdir 26 | from os.path import isfile, join, isdir 27 | import pickle 28 | 29 | #import data augmentation methods 30 | from nlp_aug import * 31 | 32 | ################################################### 33 | ######### loading folders and txt files ########### 34 | ################################################### 35 | 36 | #loading a pickle file 37 | def load_pickle(file): 38 | return pickle.load(open(file, 'rb')) 39 | 40 | #create an output folder if it does not already exist 41 | def confirm_output_folder(output_folder): 42 | if not os.path.exists(output_folder): 43 | os.makedirs(output_folder) 44 | 45 | #get full image paths 46 | def get_txt_paths(folder): 47 | txt_paths = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and '.txt' in f] 48 | if join(folder, '.DS_Store') in txt_paths: 49 | txt_paths.remove(join(folder, '.DS_Store')) 50 | txt_paths = sorted(txt_paths) 51 | return txt_paths 52 | 53 | #get subfolders 54 | def get_subfolder_paths(folder): 55 | subfolder_paths = [join(folder, f) for f in listdir(folder) if (isdir(join(folder, f)) and '.DS_Store' not in f)] 56 | if join(folder, '.DS_Store') in subfolder_paths: 57 | subfolder_paths.remove(join(folder, '.DS_Store')) 58 | subfolder_paths = sorted(subfolder_paths) 59 | return subfolder_paths 60 | 61 | #get all image paths 62 | def get_all_txt_paths(master_folder): 63 | 64 | all_paths = [] 65 | subfolders = get_subfolder_paths(master_folder) 66 | if len(subfolders) > 1: 67 | for subfolder in subfolders: 68 | all_paths += get_txt_paths(subfolder) 69 | else: 70 | all_paths = get_txt_paths(master_folder) 71 | return all_paths 72 | 73 | ################################################### 74 | ################ data processing ################## 75 | ################################################### 76 | 77 | #get the pickle file for the word2vec so you don't have to load the entire huge file each time 78 | def gen_vocab_dicts(folder, output_pickle_path, huge_word2vec): 79 | 80 | vocab = set() 81 | text_embeddings = open(huge_word2vec, 'r').readlines() 82 | word2vec = {} 83 | 84 | #get all the vocab 85 | all_txt_paths = get_all_txt_paths(folder) 86 | print(all_txt_paths) 87 | 88 | #loop through each text file 89 | for txt_path in all_txt_paths: 90 | 91 | # get all the words 92 | try: 93 | all_lines = open(txt_path, "r").readlines() 94 | for line in all_lines: 95 | words = line[:-1].split(' ') 96 | for word in words: 97 | vocab.add(word) 98 | except: 99 | print(txt_path, "has an error") 100 | 101 | print(len(vocab), "unique words found") 102 | 103 | # load the word embeddings, and only add the word to the dictionary if we need it 104 | for line in text_embeddings: 105 | items = line.split(' ') 106 | word = items[0] 107 | if word in vocab: 108 | vec = items[1:] 109 | word2vec[word] = np.asarray(vec, dtype = 'float32') 110 | print(len(word2vec), "matches between unique words and word2vec dictionary") 111 | 112 | pickle.dump(word2vec, open(output_pickle_path, 'wb')) 113 | print("dictionaries outputted to", output_pickle_path) 114 | 115 | #getting the x and y inputs in numpy array form from the text file 116 | def get_x_y(train_txt, num_classes, word2vec_len, input_size, word2vec, percent_dataset): 117 | 118 | #read in lines 119 | train_lines = open(train_txt, 'r').readlines() 120 | shuffle(train_lines) 121 | train_lines = train_lines[:int(percent_dataset*len(train_lines))] 122 | num_lines = len(train_lines) 123 | 124 | #initialize x and y matrix 125 | x_matrix = None 126 | y_matrix = None 127 | 128 | try: 129 | x_matrix = np.zeros((num_lines, input_size, word2vec_len)) 130 | except: 131 | print("Error!", num_lines, input_size, word2vec_len) 132 | y_matrix = np.zeros((num_lines, num_classes)) 133 | 134 | #insert values 135 | for i, line in enumerate(train_lines): 136 | 137 | parts = line[:-1].split('\t') 138 | label = int(parts[0]) 139 | sentence = parts[1] 140 | 141 | #insert x 142 | words = sentence.split(' ') 143 | words = words[:x_matrix.shape[1]] #cut off if too long 144 | for j, word in enumerate(words): 145 | if word in word2vec: 146 | x_matrix[i, j, :] = word2vec[word] 147 | 148 | #insert y 149 | y_matrix[i][label] = 1.0 150 | 151 | return x_matrix, y_matrix 152 | 153 | ################################################### 154 | ############### data augmentation ################# 155 | ################################################### 156 | 157 | def gen_tsne_aug(train_orig, output_file): 158 | 159 | writer = open(output_file, 'w') 160 | lines = open(train_orig, 'r').readlines() 161 | for i, line in enumerate(lines): 162 | parts = line[:-1].split('\t') 163 | label = parts[0] 164 | sentence = parts[1] 165 | writer.write(line) 166 | for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]: 167 | aug_sentence = eda_4(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=2)[0] 168 | writer.write(label + "\t" + aug_sentence + '\n') 169 | writer.close() 170 | print("finished eda for tsne for", train_orig, "to", output_file) 171 | 172 | 173 | 174 | 175 | #generate more data with standard augmentation 176 | def gen_standard_aug(train_orig, output_file, num_aug=9): 177 | writer = open(output_file, 'w') 178 | lines = open(train_orig, 'r').readlines() 179 | for i, line in enumerate(lines): 180 | parts = line[:-1].split('\t') 181 | label = parts[0] 182 | sentence = parts[1] 183 | aug_sentences = eda_4(sentence, num_aug=num_aug) 184 | for aug_sentence in aug_sentences: 185 | writer.write(label + "\t" + aug_sentence + '\n') 186 | writer.close() 187 | print("finished eda for", train_orig, "to", output_file) 188 | 189 | #generate more data with only synonym replacement (SR) 190 | def gen_sr_aug(train_orig, output_file, alpha_sr, n_aug): 191 | writer = open(output_file, 'w') 192 | lines = open(train_orig, 'r').readlines() 193 | for i, line in enumerate(lines): 194 | parts = line[:-1].split('\t') 195 | label = parts[0] 196 | sentence = parts[1] 197 | aug_sentences = SR(sentence, alpha_sr=alpha_sr, n_aug=n_aug) 198 | for aug_sentence in aug_sentences: 199 | writer.write(label + "\t" + aug_sentence + '\n') 200 | writer.close() 201 | print("finished SR for", train_orig, "to", output_file, "with alpha", alpha_sr) 202 | 203 | #generate more data with only random insertion (RI) 204 | def gen_ri_aug(train_orig, output_file, alpha_ri, n_aug): 205 | writer = open(output_file, 'w') 206 | lines = open(train_orig, 'r').readlines() 207 | for i, line in enumerate(lines): 208 | parts = line[:-1].split('\t') 209 | label = parts[0] 210 | sentence = parts[1] 211 | aug_sentences = RI(sentence, alpha_ri=alpha_ri, n_aug=n_aug) 212 | for aug_sentence in aug_sentences: 213 | writer.write(label + "\t" + aug_sentence + '\n') 214 | writer.close() 215 | print("finished RI for", train_orig, "to", output_file, "with alpha", alpha_ri) 216 | 217 | #generate more data with only random swap (RS) 218 | def gen_rs_aug(train_orig, output_file, alpha_rs, n_aug): 219 | writer = open(output_file, 'w') 220 | lines = open(train_orig, 'r').readlines() 221 | for i, line in enumerate(lines): 222 | parts = line[:-1].split('\t') 223 | label = parts[0] 224 | sentence = parts[1] 225 | aug_sentences = RS(sentence, alpha_rs=alpha_rs, n_aug=n_aug) 226 | for aug_sentence in aug_sentences: 227 | writer.write(label + "\t" + aug_sentence + '\n') 228 | writer.close() 229 | print("finished RS for", train_orig, "to", output_file, "with alpha", alpha_rs) 230 | 231 | #generate more data with only random deletion (RD) 232 | def gen_rd_aug(train_orig, output_file, alpha_rd, n_aug): 233 | writer = open(output_file, 'w') 234 | lines = open(train_orig, 'r').readlines() 235 | for i, line in enumerate(lines): 236 | parts = line[:-1].split('\t') 237 | label = parts[0] 238 | sentence = parts[1] 239 | aug_sentences = RD(sentence, alpha_rd=alpha_rd, n_aug=n_aug) 240 | for aug_sentence in aug_sentences: 241 | writer.write(label + "\t" + aug_sentence + '\n') 242 | writer.close() 243 | print("finished RD for", train_orig, "to", output_file, "with alpha", alpha_rd) 244 | 245 | ################################################### 246 | ##################### model ####################### 247 | ################################################### 248 | 249 | #building the model in keras 250 | def build_model(sentence_length, word2vec_len, num_classes): 251 | model = None 252 | model = Sequential() 253 | model.add(Bidirectional(LSTM(64, return_sequences=True), input_shape=(sentence_length, word2vec_len))) 254 | model.add(Dropout(0.5)) 255 | model.add(Bidirectional(LSTM(32, return_sequences=False))) 256 | model.add(Dropout(0.5)) 257 | model.add(Dense(20, activation='relu')) 258 | model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax')) 259 | model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 260 | #print(model.summary()) 261 | return model 262 | 263 | #building the cnn in keras 264 | def build_cnn(sentence_length, word2vec_len, num_classes): 265 | model = None 266 | model = Sequential() 267 | model.add(layers.Conv1D(128, 5, activation='relu', input_shape=(sentence_length, word2vec_len))) 268 | model.add(layers.GlobalMaxPooling1D()) 269 | model.add(Dense(20, activation='relu')) 270 | model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax')) 271 | model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 272 | return model 273 | 274 | #one hot to categorical 275 | def one_hot_to_categorical(y): 276 | assert len(y.shape) == 2 277 | return np.argmax(y, axis=1) 278 | 279 | def get_now_str(): 280 | return str(strftime("%Y-%m-%d_%H:%M:%S", gmtime())) 281 | 282 | -------------------------------------------------------------------------------- /experiments/nlp_aug.py: -------------------------------------------------------------------------------- 1 | # Easy data augmentation techniques for text classification 2 | # Jason Wei, Chengyu Huang, Yifang Wei, Fei Xing, Kai Zou 3 | 4 | import random 5 | from random import shuffle 6 | random.seed(1) 7 | 8 | #stop words list 9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 10 | 'ours', 'ourselves', 'you', 'your', 'yours', 11 | 'yourself', 'yourselves', 'he', 'him', 'his', 12 | 'himself', 'she', 'her', 'hers', 'herself', 13 | 'it', 'its', 'itself', 'they', 'them', 'their', 14 | 'theirs', 'themselves', 'what', 'which', 'who', 15 | 'whom', 'this', 'that', 'these', 'those', 'am', 16 | 'is', 'are', 'was', 'were', 'be', 'been', 'being', 17 | 'have', 'has', 'had', 'having', 'do', 'does', 'did', 18 | 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 19 | 'because', 'as', 'until', 'while', 'of', 'at', 20 | 'by', 'for', 'with', 'about', 'against', 'between', 21 | 'into', 'through', 'during', 'before', 'after', 22 | 'above', 'below', 'to', 'from', 'up', 'down', 'in', 23 | 'out', 'on', 'off', 'over', 'under', 'again', 24 | 'further', 'then', 'once', 'here', 'there', 'when', 25 | 'where', 'why', 'how', 'all', 'any', 'both', 'each', 26 | 'few', 'more', 'most', 'other', 'some', 'such', 'no', 27 | 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 28 | 'very', 's', 't', 'can', 'will', 'just', 'don', 29 | 'should', 'now', ''] 30 | 31 | #cleaning up text 32 | import re 33 | def get_only_chars(line): 34 | 35 | clean_line = "" 36 | 37 | line = line.replace("’", "") 38 | line = line.replace("'", "") 39 | line = line.replace("-", " ") #replace hyphens with spaces 40 | line = line.replace("\t", " ") 41 | line = line.replace("\n", " ") 42 | line = line.lower() 43 | 44 | for char in line: 45 | if char in 'qwertyuiopasdfghjklzxcvbnm ': 46 | clean_line += char 47 | else: 48 | clean_line += ' ' 49 | 50 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces 51 | if clean_line[0] == ' ': 52 | clean_line = clean_line[1:] 53 | return clean_line 54 | 55 | ######################################################################## 56 | # Synonym replacement 57 | # Replace n words in the sentence with synonyms from wordnet 58 | ######################################################################## 59 | 60 | #for the first time you use wordnet 61 | #import nltk 62 | #nltk.download('wordnet') 63 | from nltk.corpus import wordnet 64 | 65 | def synonym_replacement(words, n): 66 | new_words = words.copy() 67 | random_word_list = list(set([word for word in words if word not in stop_words])) 68 | random.shuffle(random_word_list) 69 | num_replaced = 0 70 | for random_word in random_word_list: 71 | synonyms = get_synonyms(random_word) 72 | if len(synonyms) >= 1: 73 | synonym = random.choice(list(synonyms)) 74 | new_words = [synonym if word == random_word else word for word in new_words] 75 | #print("replaced", random_word, "with", synonym) 76 | num_replaced += 1 77 | if num_replaced >= n: #only replace up to n words 78 | break 79 | 80 | #this is stupid but we need it, trust me 81 | sentence = ' '.join(new_words) 82 | new_words = sentence.split(' ') 83 | 84 | return new_words 85 | 86 | def get_synonyms(word): 87 | synonyms = set() 88 | for syn in wordnet.synsets(word): 89 | for l in syn.lemmas(): 90 | synonym = l.name().replace("_", " ").replace("-", " ").lower() 91 | synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm']) 92 | synonyms.add(synonym) 93 | if word in synonyms: 94 | synonyms.remove(word) 95 | return list(synonyms) 96 | 97 | ######################################################################## 98 | # Random deletion 99 | # Randomly delete words from the sentence with probability p 100 | ######################################################################## 101 | 102 | def random_deletion(words, p): 103 | 104 | #obviously, if there's only one word, don't delete it 105 | if len(words) == 1: 106 | return words 107 | 108 | #randomly delete words with probability p 109 | new_words = [] 110 | for word in words: 111 | r = random.uniform(0, 1) 112 | if r > p: 113 | new_words.append(word) 114 | 115 | #if you end up deleting all words, just return a random word 116 | if len(new_words) == 0: 117 | rand_int = random.randint(0, len(words)-1) 118 | return [words[rand_int]] 119 | 120 | return new_words 121 | 122 | ######################################################################## 123 | # Random swap 124 | # Randomly swap two words in the sentence n times 125 | ######################################################################## 126 | 127 | def random_swap(words, n): 128 | new_words = words.copy() 129 | for _ in range(n): 130 | new_words = swap_word(new_words) 131 | return new_words 132 | 133 | def swap_word(new_words): 134 | random_idx_1 = random.randint(0, len(new_words)-1) 135 | random_idx_2 = random_idx_1 136 | counter = 0 137 | while random_idx_2 == random_idx_1: 138 | random_idx_2 = random.randint(0, len(new_words)-1) 139 | counter += 1 140 | if counter > 3: 141 | return new_words 142 | new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 143 | return new_words 144 | 145 | ######################################################################## 146 | # Random addition 147 | # Randomly add n words into the sentence 148 | ######################################################################## 149 | 150 | def random_addition(words, n): 151 | new_words = words.copy() 152 | for _ in range(n): 153 | add_word(new_words) 154 | return new_words 155 | 156 | def add_word(new_words): 157 | synonyms = [] 158 | counter = 0 159 | while len(synonyms) < 1: 160 | random_word = new_words[random.randint(0, len(new_words)-1)] 161 | synonyms = get_synonyms(random_word) 162 | counter += 1 163 | if counter >= 10: 164 | return 165 | random_synonym = synonyms[0] 166 | random_idx = random.randint(0, len(new_words)-1) 167 | new_words.insert(random_idx, random_synonym) 168 | 169 | ######################################################################## 170 | # main data augmentation function 171 | ######################################################################## 172 | 173 | def eda_4(sentence, alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15, num_aug=9): 174 | 175 | sentence = get_only_chars(sentence) 176 | words = sentence.split(' ') 177 | words = [word for word in words if word is not ''] 178 | num_words = len(words) 179 | 180 | augmented_sentences = [] 181 | num_new_per_technique = int(num_aug/4)+1 182 | n_sr = max(1, int(alpha_sr*num_words)) 183 | n_ri = max(1, int(alpha_ri*num_words)) 184 | n_rs = max(1, int(alpha_rs*num_words)) 185 | 186 | #sr 187 | for _ in range(num_new_per_technique): 188 | a_words = synonym_replacement(words, n_sr) 189 | augmented_sentences.append(' '.join(a_words)) 190 | 191 | #ri 192 | for _ in range(num_new_per_technique): 193 | a_words = random_addition(words, n_ri) 194 | augmented_sentences.append(' '.join(a_words)) 195 | 196 | #rs 197 | for _ in range(num_new_per_technique): 198 | a_words = random_swap(words, n_rs) 199 | augmented_sentences.append(' '.join(a_words)) 200 | 201 | #rd 202 | for _ in range(num_new_per_technique): 203 | a_words = random_deletion(words, p_rd) 204 | augmented_sentences.append(' '.join(a_words)) 205 | 206 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 207 | shuffle(augmented_sentences) 208 | 209 | #trim so that we have the desired number of augmented sentences 210 | if num_aug >= 1: 211 | augmented_sentences = augmented_sentences[:num_aug] 212 | else: 213 | keep_prob = num_aug / len(augmented_sentences) 214 | augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob] 215 | 216 | #append the original sentence 217 | augmented_sentences.append(sentence) 218 | 219 | return augmented_sentences 220 | 221 | def SR(sentence, alpha_sr, n_aug=9): 222 | 223 | sentence = get_only_chars(sentence) 224 | words = sentence.split(' ') 225 | num_words = len(words) 226 | 227 | augmented_sentences = [] 228 | n_sr = max(1, int(alpha_sr*num_words)) 229 | 230 | for _ in range(n_aug): 231 | a_words = synonym_replacement(words, n_sr) 232 | augmented_sentences.append(' '.join(a_words)) 233 | 234 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 235 | shuffle(augmented_sentences) 236 | 237 | augmented_sentences.append(sentence) 238 | 239 | return augmented_sentences 240 | 241 | def RI(sentence, alpha_ri, n_aug=9): 242 | 243 | sentence = get_only_chars(sentence) 244 | words = sentence.split(' ') 245 | num_words = len(words) 246 | 247 | augmented_sentences = [] 248 | n_ri = max(1, int(alpha_ri*num_words)) 249 | 250 | for _ in range(n_aug): 251 | a_words = random_addition(words, n_ri) 252 | augmented_sentences.append(' '.join(a_words)) 253 | 254 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 255 | shuffle(augmented_sentences) 256 | 257 | augmented_sentences.append(sentence) 258 | 259 | return augmented_sentences 260 | 261 | def RS(sentence, alpha_rs, n_aug=9): 262 | 263 | sentence = get_only_chars(sentence) 264 | words = sentence.split(' ') 265 | num_words = len(words) 266 | 267 | augmented_sentences = [] 268 | n_rs = max(1, int(alpha_rs*num_words)) 269 | 270 | for _ in range(n_aug): 271 | a_words = random_swap(words, n_rs) 272 | augmented_sentences.append(' '.join(a_words)) 273 | 274 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 275 | shuffle(augmented_sentences) 276 | 277 | augmented_sentences.append(sentence) 278 | 279 | return augmented_sentences 280 | 281 | def RD(sentence, alpha_rd, n_aug=9): 282 | 283 | sentence = get_only_chars(sentence) 284 | words = sentence.split(' ') 285 | words = [word for word in words if word is not ''] 286 | num_words = len(words) 287 | 288 | augmented_sentences = [] 289 | 290 | for _ in range(n_aug): 291 | a_words = random_deletion(words, alpha_rd) 292 | augmented_sentences.append(' '.join(a_words)) 293 | 294 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences] 295 | shuffle(augmented_sentences) 296 | 297 | augmented_sentences.append(sentence) 298 | 299 | return augmented_sentences 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | ######################################################################## 321 | # Testing 322 | ######################################################################## 323 | 324 | if __name__ == '__main__': 325 | 326 | line = 'Hi. My name is Jason. I’m a third-year computer science major at Dartmouth College, interested in deep learning and computer vision. My advisor is Saeed Hassanpour. I’m currently working on deep learning for lung cancer classification.' 327 | 328 | 329 | 330 | ######################################################################## 331 | # Sliding window 332 | # Slide a window of size w over the sentence with stride s 333 | # Returns a list of lists of words 334 | ######################################################################## 335 | 336 | # def sliding_window_sentences(words, w, s): 337 | # windows = [] 338 | # for i in range(0, len(words)-w+1, s): 339 | # window = words[i:i+w] 340 | # windows.append(window) 341 | # return windows 342 | 343 | 344 | 345 | 346 | -------------------------------------------------------------------------------- /preprocess/__pycache__/utils.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/preprocess/__pycache__/utils.cpython-36.pyc -------------------------------------------------------------------------------- /preprocess/bg_clean.py: -------------------------------------------------------------------------------- 1 | 2 | from utils import * 3 | 4 | def clean_csv(input_file, output_file): 5 | 6 | input_r = open(input_file, 'r').read() 7 | 8 | lines = input_r.split(',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,') 9 | print(len(lines)) 10 | for line in lines[:10]: 11 | print(line[-3:]) 12 | 13 | if __name__ == "__main__": 14 | 15 | input_file = 'raw/blog-gender-dataset.csv' 16 | output_file = 'datasets/bg/train.csv' 17 | 18 | clean_csv(input_file, output_file) 19 | 20 | 21 | -------------------------------------------------------------------------------- /preprocess/copy_sized_datasets.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | sizes = ['1_tiny', '2_small', '3_standard', '4_full'] 4 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc'] 5 | 6 | for size in sizes: 7 | for dataset in datasets: 8 | folder = 'size_data_t1/' + size + '/' + dataset 9 | if not os.path.exists(folder): 10 | os.makedirs(folder) 11 | 12 | origin = 'sized_datasets_f1/' + size + '/' + dataset + '/train_orig.txt' 13 | destination = 'size_data_t1/' + size + '/' + dataset + '/train_orig.txt' 14 | os.system('cp ' + origin + ' ' + destination) -------------------------------------------------------------------------------- /preprocess/cr_clean.py: -------------------------------------------------------------------------------- 1 | #0 = neg, 1 = pos 2 | from utils import * 3 | 4 | def retrieve_reviews(line): 5 | 6 | reviews = set() 7 | chars = list(line) 8 | for i, char in enumerate(chars): 9 | if char == '[': 10 | if chars[i+1] == '-': 11 | reviews.add(0) 12 | elif chars[i+1] == '+': 13 | reviews.add(1) 14 | 15 | reviews = list(reviews) 16 | if len(reviews) == 2: 17 | return -2 18 | elif len(reviews) == 1: 19 | return reviews[0] 20 | else: 21 | return -1 22 | 23 | def clean_files(input_files, output_file): 24 | 25 | writer = open(output_file, 'w') 26 | 27 | for input_file in input_files: 28 | print(input_file) 29 | input_lines = open(input_file, 'r').readlines() 30 | counter = 0 31 | bad_counter = 0 32 | for line in input_lines: 33 | review = retrieve_reviews(line) 34 | if review in {0, 1}: 35 | good_line = get_only_chars(re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", line)) 36 | output_line = str(review) + '\t' + good_line 37 | writer.write(output_line + '\n') 38 | counter += 1 39 | elif review == -2: 40 | bad_counter +=1 41 | print(input_file, counter, bad_counter) 42 | 43 | writer.close() 44 | 45 | if __name__ == '__main__': 46 | 47 | input_files = ['all.txt']#['canon_power.txt', 'canon_s1.txt', 'diaper.txt', 'hitachi.txt', 'ipod.txt', 'micromp3.txt', 'nokia6600.txt', 'norton.txt', 'router.txt'] 48 | input_files = ['raw/cr/data_new/' + f for f in input_files] 49 | output_file = 'datasets/cr/apex_clean.txt' 50 | 51 | clean_files(input_files, output_file) 52 | -------------------------------------------------------------------------------- /preprocess/create_dataset_increments.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | datasets = ['cr', 'pc', 'sst1', 'sst2', 'subj', 'trec'] 4 | 5 | for dataset in datasets: 6 | line = 'cat increment_datasets_f2/' + dataset + '/test.txt > sized_datasets_f1/test/' + dataset + '/test.txt' 7 | os.system(line) -------------------------------------------------------------------------------- /preprocess/get_stats.py: -------------------------------------------------------------------------------- 1 | import statistics 2 | 3 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc'] 4 | 5 | filenames = ['increment_datasets_f2/' + x + '/train_orig.txt' for x in datasets] 6 | 7 | def get_vocab_size(filename): 8 | lines = open(filename, 'r').readlines() 9 | 10 | vocab = set() 11 | for line in lines: 12 | words = line[:-1].split(' ') 13 | for word in words: 14 | if word not in vocab: 15 | vocab.add(word) 16 | 17 | return len(vocab) 18 | 19 | def get_mean_and_std(filename): 20 | lines = open(filename, 'r').readlines() 21 | 22 | line_lengths = [] 23 | for line in lines: 24 | length = len(line[:-1].split(' ')) - 1 25 | line_lengths.append(length) 26 | 27 | print(filename, statistics.mean(line_lengths), statistics.stdev(line_lengths), max(line_lengths)) 28 | 29 | 30 | for filename in filenames: 31 | #print(get_vocab_size(filename)) 32 | get_mean_and_std(filename) 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /preprocess/procon_clean.py: -------------------------------------------------------------------------------- 1 | 2 | from utils import * 3 | 4 | def get_good_stuff(line): 5 | idx = line.find('s>') 6 | good = line[idx+2:-8] 7 | 8 | return get_only_chars(good) 9 | 10 | def clean_file(con_file, pro_file, output_train, output_test): 11 | 12 | train_writer = open(output_train, 'w') 13 | test_writer = open(output_test, 'w') 14 | con_lines = open(con_file, 'r').readlines() 15 | for line in con_lines[:int(len(con_lines)*0.9)]: 16 | content = get_good_stuff(line) 17 | if len(content) >= 8: 18 | train_writer.write('0\t' + content + '\n') 19 | for line in con_lines[int(len(con_lines)*0.9):]: 20 | content = get_good_stuff(line) 21 | if len(content) >= 8: 22 | test_writer.write('0\t' + content + '\n') 23 | 24 | pro_lines = open(pro_file, 'r').readlines() 25 | for line in pro_lines[:int(len(con_lines)*0.9)]: 26 | content = get_good_stuff(line) 27 | if len(content) >= 8: 28 | train_writer.write('1\t' + content + '\n') 29 | for line in pro_lines[int(len(con_lines)*0.9):]: 30 | content = get_good_stuff(line) 31 | if len(content) >= 8: 32 | test_writer.write('1\t' + content + '\n') 33 | 34 | 35 | if __name__ == '__main__': 36 | 37 | con_file = 'raw/pros-cons/integratedCons.txt' 38 | pro_file = 'raw/pros-cons/integratedPros.txt' 39 | output_train = 'datasets/procon/train.txt' 40 | output_test = 'datasets/procon/test.txt' 41 | clean_file(con_file, pro_file, output_train, output_test) -------------------------------------------------------------------------------- /preprocess/shuffle_lines.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | def shuffle_lines(text_file): 4 | lines = open(text_file).readlines() 5 | random.shuffle(lines) 6 | open(text_file, 'w').writelines(lines) 7 | 8 | shuffle_lines('special_f4/pc/test_short_aug_shuffle.txt') -------------------------------------------------------------------------------- /preprocess/sst1_clean.py: -------------------------------------------------------------------------------- 1 | from utils import * 2 | 3 | def get_label(decimal): 4 | if decimal >= 0 and decimal <= 0.2: 5 | return 0 6 | elif decimal > 0.2 and decimal <= 0.4: 7 | return 1 8 | elif decimal > 0.4 and decimal <= 0.6: 9 | return 2 10 | elif decimal > 0.6 and decimal <= 0.8: 11 | return 3 12 | elif decimal > 0.8 and decimal <= 1: 13 | return 4 14 | else: 15 | return -1 16 | 17 | def get_label_binary(decimal): 18 | if decimal >= 0 and decimal <= 0.4: 19 | return 0 20 | elif decimal > 0.6 and decimal <= 1: 21 | return 1 22 | else: 23 | return -1 24 | 25 | def get_split(split_num): 26 | if split_num == 1 or split_num == 3: 27 | return 'train' 28 | elif split_num == 2: 29 | return 'test' 30 | 31 | if __name__ == "__main__": 32 | 33 | data_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSentences.txt' 34 | labels_path = 'raw/sst_1/stanfordSentimentTreebank/sentiment_labels.txt' 35 | split_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSplit.txt' 36 | dictionary_path = 'raw/sst_1/stanfordSentimentTreebank/dictionary.txt' 37 | 38 | sentence_lines = open(data_path, 'r').readlines() 39 | labels_lines = open(labels_path, 'r').readlines() 40 | split_lines = open(split_path, 'r').readlines() 41 | dictionary_lines = open(dictionary_path, 'r').readlines() 42 | 43 | print(len(sentence_lines)) 44 | print(len(split_lines)) 45 | print(len(labels_lines)) 46 | print(len(dictionary_lines)) 47 | 48 | #create dictionary for id to label 49 | id_to_label = {} 50 | for line in labels_lines[1:]: 51 | parts = line[:-1].split("|") 52 | _id = parts[0] 53 | score = float(parts[1]) 54 | label = get_label_binary(score) 55 | 56 | id_to_label[_id] = label 57 | 58 | print(len(id_to_label), "id to labels read in") 59 | 60 | #create dictionary for phrase to label 61 | phrase_to_label = {} 62 | for line in dictionary_lines: 63 | parts = line[:-1].split("|") 64 | phrase = parts[0] 65 | _id = parts[1] 66 | label = id_to_label[_id] 67 | 68 | phrase_to_label[phrase] = label 69 | 70 | print(len(phrase_to_label), "phrase to id read in") 71 | 72 | #create id to split 73 | id_to_split = {} 74 | for line in split_lines[1:]: 75 | parts = line[:-1].split(",") 76 | _id = parts[0] 77 | split_num = float(parts[1]) 78 | split = get_split(split_num) 79 | id_to_split[_id] = split 80 | 81 | print(len(id_to_split), "id to split read in") 82 | 83 | train_writer = open('datasets/sst2/train_orig.txt', 'w') 84 | test_writer = open('datasets/sst2/test.txt', 'w') 85 | 86 | #create sentence to split and label 87 | for sentence_line in sentence_lines[1:]: 88 | parts = sentence_line[:-1].split('\t') 89 | _id = parts[0] 90 | sentence = get_only_chars(parts[1]) 91 | split = id_to_split[_id] 92 | 93 | if parts[1] in phrase_to_label: 94 | label = phrase_to_label[parts[1]] 95 | if label in {0, 1}: 96 | #print(label, sentence, split) 97 | if split == 'train': 98 | train_writer.write(str(label) + '\t' + sentence + '\n') 99 | elif split == 'test': 100 | test_writer.write(str(label) + '\t' + sentence + '\n') 101 | 102 | #print(parts, split) 103 | 104 | #label = [] 105 | 106 | 107 | 108 | 109 | -------------------------------------------------------------------------------- /preprocess/subj_clean.py: -------------------------------------------------------------------------------- 1 | from utils import * 2 | 3 | if __name__ == "__main__": 4 | subj_path = "subj/rotten_imdb/subj.txt" 5 | obj_path = "subj/rotten_imdb/plot.tok.gt9.5000" 6 | 7 | subj_lines = open(subj_path, 'r').readlines() 8 | obj_lines = open(obj_path, 'r').readlines() 9 | print(len(subj_lines), len(obj_lines)) 10 | 11 | test_split = int(0.9*len(subj_lines)) 12 | 13 | train_lines = [] 14 | test_lines = [] 15 | 16 | #training set 17 | for s_line in subj_lines[:test_split]: 18 | clean_line = '1\t' + get_only_chars(s_line[:-1]) 19 | train_lines.append(clean_line) 20 | 21 | for o_line in obj_lines[:test_split]: 22 | clean_line = '0\t' + get_only_chars(o_line[:-1]) 23 | train_lines.append(clean_line) 24 | 25 | #testing set 26 | for s_line in subj_lines[test_split:]: 27 | clean_line = '1\t' + get_only_chars(s_line[:-1]) 28 | test_lines.append(clean_line) 29 | 30 | for o_line in obj_lines[test_split:]: 31 | clean_line = '0\t' + get_only_chars(o_line[:-1]) 32 | test_lines.append(clean_line) 33 | 34 | print(len(test_lines), len(train_lines)) 35 | 36 | #print training set 37 | writer = open('datasets/subj/train_orig.txt', 'w') 38 | for line in train_lines: 39 | writer.write(line + '\n') 40 | writer.close() 41 | 42 | #print testing set 43 | writer = open('datasets/subj/test.txt', 'w') 44 | for line in test_lines: 45 | writer.write(line + '\n') 46 | writer.close() -------------------------------------------------------------------------------- /preprocess/trej_clean.py: -------------------------------------------------------------------------------- 1 | 2 | from utils import * 3 | 4 | class_name_to_num = {'DESC': 0, 'ENTY':1, 'ABBR':2, 'HUM': 3, 'LOC': 4, 'NUM': 5} 5 | 6 | def clean(input_file, output_file): 7 | lines = open(input_file, 'r').readlines() 8 | writer = open(output_file, 'w') 9 | for line in lines: 10 | parts = line[:-1].split(' ') 11 | tag = parts[0].split(':')[0] 12 | class_num = class_name_to_num[tag] 13 | sentence = get_only_chars(' '.join(parts[1:])) 14 | print(tag, class_num, sentence) 15 | output_line = str(class_num) + '\t' + sentence 16 | writer.write(output_line + '\n') 17 | writer.close() 18 | 19 | 20 | if __name__ == "__main__": 21 | 22 | clean('raw/trec/train_copy.txt', 'datasets/trec/train_orig.txt') 23 | clean('raw/trec/test_copy.txt', 'datasets/trec/test.txt') 24 | -------------------------------------------------------------------------------- /preprocess/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | 5 | 6 | #cleaning up text 7 | def get_only_chars(line): 8 | 9 | clean_line = "" 10 | 11 | line = line.lower() 12 | line = line.replace(" 's", " is") 13 | line = line.replace("-", " ") #replace hyphens with spaces 14 | line = line.replace("\t", " ") 15 | line = line.replace("\n", " ") 16 | line = line.replace("'", "") 17 | 18 | for char in line: 19 | if char in 'qwertyuiopasdfghjklzxcvbnm ': 20 | clean_line += char 21 | else: 22 | clean_line += ' ' 23 | 24 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces 25 | print(clean_line) 26 | if clean_line[0] == ' ': 27 | clean_line = clean_line[1:] 28 | return clean_line --------------------------------------------------------------------------------