├── .gitignore
├── README.md
├── code
├── __pycache__
│ └── eda.cpython-36.pyc
├── augment.py
└── eda.py
├── data
├── lol.txt
└── sst2_train_500.txt
├── eda_figure.png
├── experiments
├── __pycache__
│ ├── a_config.cpython-36.pyc
│ ├── a_config.cpython-37.pyc
│ ├── b_config.cpython-36.pyc
│ ├── c_config.cpython-36.pyc
│ ├── config.cpython-36.pyc
│ ├── e_config.cpython-36.pyc
│ ├── methods.cpython-36.pyc
│ ├── methods.cpython-37.pyc
│ └── nlp_aug.cpython-36.pyc
├── a_1_data_process.py
├── a_2_train_eval.py
├── a_config.py
├── b_1_data_process.py
├── b_2_train_eval.py
├── b_config.py
├── c_1_data_process.py
├── c_2_train_eval.py
├── c_config.py
├── d_0_preprocess.py
├── d_1_train_models.py
├── d_2_tsne.py
├── d_neg_1_balance_trec.py
├── e_1_data_process.py
├── e_2_cnn_aug.py
├── e_2_cnn_baselines.py
├── e_2_rnn_aug.py
├── e_2_rnn_baselines.py
├── e_config.py
├── methods.py
└── nlp_aug.py
└── preprocess
├── __pycache__
└── utils.cpython-36.pyc
├── bg_clean.py
├── copy_sized_datasets.py
├── cr_clean.py
├── create_dataset_increments.py
├── get_stats.py
├── procon_clean.py
├── shuffle_lines.py
├── sst1_clean.py
├── subj_clean.py
├── trej_clean.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | word2vec*
3 | size_data*
4 | size_data_f1*
5 | size_data_f3*
6 | size_data_t1*
7 | increment_datasets_f2*
8 | z_archives*
9 | special_f4*
10 | outputs_f1*
11 | outputs_f2*
12 | outputs_f3*
13 | outputs_f4*
14 | baseline_cnn*
15 | baseline_rnn*
16 |
17 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
2 | [](https://arxiv.org/abs/1901.11196)
3 |
4 | For a survey of data augmentation in NLP, see this [repository](https://github.com/styfeng/DataAug4NLP/blob/main/README.md)/this [paper](http://arxiv.org/abs/2105.03075).
5 |
6 | This is the code for the EMNLP-IJCNLP paper [EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks.](https://arxiv.org/abs/1901.11196)
7 |
8 | A blog post that explains EDA is [[here]](https://medium.com/@jason.20/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610).
9 |
10 | Update: find an external implementation of EDA in Chinese [[here]](https://github.com/zhanlaoban/EDA_NLP_for_Chinese).
11 |
12 | By [Jason Wei](https://jasonwei20.github.io/research/) and Kai Zou.
13 |
14 | Note: **Do not** email me with questions, as I will not reply. Instead, open an issue.
15 |
16 | We present **EDA**: **e**asy **d**ata **a**ugmentation techniques for boosting performance on text classification tasks. These are a generalized set of data augmentation techniques that are easy to implement and have shown improvements on five NLP classification tasks, with substantial improvements on datasets of size `N < 500`. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in good performance gains. Given a sentence in the training set, we perform the following operations:
17 |
18 | - **Synonym Replacement (SR):** Randomly choose *n* words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
19 | - **Random Insertion (RI):** Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this *n* times.
20 | - **Random Swap (RS):** Randomly choose two words in the sentence and swap their positions. Do this *n* times.
21 | - **Random Deletion (RD):** For each word in the sentence, randomly remove it with probability *p*.
22 |
23 |
24 | Average performance on 5 datasets with and without EDA, with respect to percent of training data used.
25 |
26 | # Usage
27 |
28 | You can run EDA any text classification dataset in less than 5 minutes. Just two steps:
29 |
30 | ### Install NLTK (if you don't have it already):
31 |
32 | Pip install it.
33 |
34 | ```bash
35 | pip install -U nltk
36 | ```
37 |
38 | Download WordNet.
39 | ```bash
40 | python
41 | >>> import nltk; nltk.download('wordnet')
42 | ```
43 |
44 | ### Run EDA
45 |
46 | You can easily write your own implementation, but this one takes input files in the format `label\tsentence` (note the `\t`). So for instance, your input file should look like this (example from stanford sentiment treebank):
47 |
48 | ```
49 | 1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present
50 | 0 it is a visual rorschach test and i must have failed
51 | 0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers
52 | ...
53 | ```
54 |
55 | Now place this input file into the `data` folder. Run
56 |
57 | ```bash
58 | python code/augment.py --input=
59 | ```
60 |
61 | The default output filename will append `eda_` to the front of the input filename, but you can specify your own with `--output`. You can also specify the number of generated augmented sentences per original sentence using `--num_aug` (default is 9). Furthermore, you can specify different alpha parameters, which approximately means the percent of words in the sentence that will be changed according to that rule (default is `0.1` or `10%`). So for example, if your input file is `sst2_train.txt` and you want to output to `sst2_augmented.txt` with `16` augmented sentences per original sentence and replace 5% of words by synonyms (`alpha_sr=0.05`), delete 10% of words (`alpha_rd=0.1`, or leave as the default) and do not apply random insertion (`alpha_ri=0.0`) and random swap (`alpha_rs=0.0`), you would do:
62 |
63 | ```bash
64 | python code/augment.py --input=sst2_train.txt --output=sst2_augmented.txt --num_aug=16 --alpha_sr=0.05 --alpha_rd=0.1 --alpha_ri=0.0 --alpha_rs=0.0
65 | ```
66 |
67 | Note that at least one augmentation operation is applied per augmented sentence regardless of alpha (if greater than zero). So if you do `alpha_sr=0.001` and your sentence only has four words, one augmentation operation will still be performed. Of course, if one particular alpha is zero, nothing will be done. Best of luck!
68 |
69 | # Citation
70 | If you use EDA in your paper, please cite us:
71 | ```
72 | @inproceedings{wei-zou-2019-eda,
73 | title = "{EDA}: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks",
74 | author = "Wei, Jason and
75 | Zou, Kai",
76 | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
77 | month = nov,
78 | year = "2019",
79 | address = "Hong Kong, China",
80 | publisher = "Association for Computational Linguistics",
81 | url = "https://www.aclweb.org/anthology/D19-1670",
82 | pages = "6383--6389",
83 | }
84 | ```
85 |
86 | # Experiments
87 |
88 | The code is not documented, but is [here](https://github.com/jasonwei20/eda_nlp/tree/master/experiments) for all experiments used in the paper. See [this issue](https://github.com/jasonwei20/eda_nlp/issues/10) for limited guidance.
89 |
90 |
91 |
92 |
--------------------------------------------------------------------------------
/code/__pycache__/eda.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/code/__pycache__/eda.cpython-36.pyc
--------------------------------------------------------------------------------
/code/augment.py:
--------------------------------------------------------------------------------
1 | # Easy data augmentation techniques for text classification
2 | # Jason Wei and Kai Zou
3 |
4 | from eda import *
5 |
6 | #arguments to be parsed from command line
7 | import argparse
8 | ap = argparse.ArgumentParser()
9 | ap.add_argument("--input", required=True, type=str, help="input file of unaugmented data")
10 | ap.add_argument("--output", required=False, type=str, help="output file of unaugmented data")
11 | ap.add_argument("--num_aug", required=False, type=int, help="number of augmented sentences per original sentence")
12 | ap.add_argument("--alpha_sr", required=False, type=float, help="percent of words in each sentence to be replaced by synonyms")
13 | ap.add_argument("--alpha_ri", required=False, type=float, help="percent of words in each sentence to be inserted")
14 | ap.add_argument("--alpha_rs", required=False, type=float, help="percent of words in each sentence to be swapped")
15 | ap.add_argument("--alpha_rd", required=False, type=float, help="percent of words in each sentence to be deleted")
16 | args = ap.parse_args()
17 |
18 | #the output file
19 | output = None
20 | if args.output:
21 | output = args.output
22 | else:
23 | from os.path import dirname, basename, join
24 | output = join(dirname(args.input), 'eda_' + basename(args.input))
25 |
26 | #number of augmented sentences to generate per original sentence
27 | num_aug = 9 #default
28 | if args.num_aug:
29 | num_aug = args.num_aug
30 |
31 | #how much to replace each word by synonyms
32 | alpha_sr = 0.1#default
33 | if args.alpha_sr is not None:
34 | alpha_sr = args.alpha_sr
35 |
36 | #how much to insert new words that are synonyms
37 | alpha_ri = 0.1#default
38 | if args.alpha_ri is not None:
39 | alpha_ri = args.alpha_ri
40 |
41 | #how much to swap words
42 | alpha_rs = 0.1#default
43 | if args.alpha_rs is not None:
44 | alpha_rs = args.alpha_rs
45 |
46 | #how much to delete words
47 | alpha_rd = 0.1#default
48 | if args.alpha_rd is not None:
49 | alpha_rd = args.alpha_rd
50 |
51 | if alpha_sr == alpha_ri == alpha_rs == alpha_rd == 0:
52 | ap.error('At least one alpha should be greater than zero')
53 |
54 | #generate more data with standard augmentation
55 | def gen_eda(train_orig, output_file, alpha_sr, alpha_ri, alpha_rs, alpha_rd, num_aug=9):
56 |
57 | writer = open(output_file, 'w')
58 | lines = open(train_orig, 'r').readlines()
59 |
60 | for i, line in enumerate(lines):
61 | parts = line[:-1].split('\t')
62 | label = parts[0]
63 | sentence = parts[1]
64 | aug_sentences = eda(sentence, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, p_rd=alpha_rd, num_aug=num_aug)
65 | for aug_sentence in aug_sentences:
66 | writer.write(label + "\t" + aug_sentence + '\n')
67 |
68 | writer.close()
69 | print("generated augmented sentences with eda for " + train_orig + " to " + output_file + " with num_aug=" + str(num_aug))
70 |
71 | #main function
72 | if __name__ == "__main__":
73 |
74 | #generate augmented sentences and output into a new file
75 | gen_eda(args.input, output, alpha_sr=alpha_sr, alpha_ri=alpha_ri, alpha_rs=alpha_rs, alpha_rd=alpha_rd, num_aug=num_aug)
--------------------------------------------------------------------------------
/code/eda.py:
--------------------------------------------------------------------------------
1 | # Easy data augmentation techniques for text classification
2 | # Jason Wei and Kai Zou
3 |
4 | import random
5 | from random import shuffle
6 | random.seed(1)
7 |
8 | #stop words list
9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our',
10 | 'ours', 'ourselves', 'you', 'your', 'yours',
11 | 'yourself', 'yourselves', 'he', 'him', 'his',
12 | 'himself', 'she', 'her', 'hers', 'herself',
13 | 'it', 'its', 'itself', 'they', 'them', 'their',
14 | 'theirs', 'themselves', 'what', 'which', 'who',
15 | 'whom', 'this', 'that', 'these', 'those', 'am',
16 | 'is', 'are', 'was', 'were', 'be', 'been', 'being',
17 | 'have', 'has', 'had', 'having', 'do', 'does', 'did',
18 | 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
19 | 'because', 'as', 'until', 'while', 'of', 'at',
20 | 'by', 'for', 'with', 'about', 'against', 'between',
21 | 'into', 'through', 'during', 'before', 'after',
22 | 'above', 'below', 'to', 'from', 'up', 'down', 'in',
23 | 'out', 'on', 'off', 'over', 'under', 'again',
24 | 'further', 'then', 'once', 'here', 'there', 'when',
25 | 'where', 'why', 'how', 'all', 'any', 'both', 'each',
26 | 'few', 'more', 'most', 'other', 'some', 'such', 'no',
27 | 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
28 | 'very', 's', 't', 'can', 'will', 'just', 'don',
29 | 'should', 'now', '']
30 |
31 | #cleaning up text
32 | import re
33 | def get_only_chars(line):
34 |
35 | clean_line = ""
36 |
37 | line = line.replace("’", "")
38 | line = line.replace("'", "")
39 | line = line.replace("-", " ") #replace hyphens with spaces
40 | line = line.replace("\t", " ")
41 | line = line.replace("\n", " ")
42 | line = line.lower()
43 |
44 | for char in line:
45 | if char in 'qwertyuiopasdfghjklzxcvbnm ':
46 | clean_line += char
47 | else:
48 | clean_line += ' '
49 |
50 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
51 | if clean_line[0] == ' ':
52 | clean_line = clean_line[1:]
53 | return clean_line
54 |
55 | ########################################################################
56 | # Synonym replacement
57 | # Replace n words in the sentence with synonyms from wordnet
58 | ########################################################################
59 |
60 | #for the first time you use wordnet
61 | #import nltk
62 | #nltk.download('wordnet')
63 | from nltk.corpus import wordnet
64 |
65 | def synonym_replacement(words, n):
66 | new_words = words.copy()
67 | random_word_list = list(set([word for word in words if word not in stop_words]))
68 | random.shuffle(random_word_list)
69 | num_replaced = 0
70 | for random_word in random_word_list:
71 | synonyms = get_synonyms(random_word)
72 | if len(synonyms) >= 1:
73 | synonym = random.choice(list(synonyms))
74 | new_words = [synonym if word == random_word else word for word in new_words]
75 | #print("replaced", random_word, "with", synonym)
76 | num_replaced += 1
77 | if num_replaced >= n: #only replace up to n words
78 | break
79 |
80 | #this is stupid but we need it, trust me
81 | sentence = ' '.join(new_words)
82 | new_words = sentence.split(' ')
83 |
84 | return new_words
85 |
86 | def get_synonyms(word):
87 | synonyms = set()
88 | for syn in wordnet.synsets(word):
89 | for l in syn.lemmas():
90 | synonym = l.name().replace("_", " ").replace("-", " ").lower()
91 | synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
92 | synonyms.add(synonym)
93 | if word in synonyms:
94 | synonyms.remove(word)
95 | return list(synonyms)
96 |
97 | ########################################################################
98 | # Random deletion
99 | # Randomly delete words from the sentence with probability p
100 | ########################################################################
101 |
102 | def random_deletion(words, p):
103 |
104 | #obviously, if there's only one word, don't delete it
105 | if len(words) == 1:
106 | return words
107 |
108 | #randomly delete words with probability p
109 | new_words = []
110 | for word in words:
111 | r = random.uniform(0, 1)
112 | if r > p:
113 | new_words.append(word)
114 |
115 | #if you end up deleting all words, just return a random word
116 | if len(new_words) == 0:
117 | rand_int = random.randint(0, len(words)-1)
118 | return [words[rand_int]]
119 |
120 | return new_words
121 |
122 | ########################################################################
123 | # Random swap
124 | # Randomly swap two words in the sentence n times
125 | ########################################################################
126 |
127 | def random_swap(words, n):
128 | new_words = words.copy()
129 | for _ in range(n):
130 | new_words = swap_word(new_words)
131 | return new_words
132 |
133 | def swap_word(new_words):
134 | random_idx_1 = random.randint(0, len(new_words)-1)
135 | random_idx_2 = random_idx_1
136 | counter = 0
137 | while random_idx_2 == random_idx_1:
138 | random_idx_2 = random.randint(0, len(new_words)-1)
139 | counter += 1
140 | if counter > 3:
141 | return new_words
142 | new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
143 | return new_words
144 |
145 | ########################################################################
146 | # Random insertion
147 | # Randomly insert n words into the sentence
148 | ########################################################################
149 |
150 | def random_insertion(words, n):
151 | new_words = words.copy()
152 | for _ in range(n):
153 | add_word(new_words)
154 | return new_words
155 |
156 | def add_word(new_words):
157 | synonyms = []
158 | counter = 0
159 | while len(synonyms) < 1:
160 | random_word = new_words[random.randint(0, len(new_words)-1)]
161 | synonyms = get_synonyms(random_word)
162 | counter += 1
163 | if counter >= 10:
164 | return
165 | random_synonym = synonyms[0]
166 | random_idx = random.randint(0, len(new_words)-1)
167 | new_words.insert(random_idx, random_synonym)
168 |
169 | ########################################################################
170 | # main data augmentation function
171 | ########################################################################
172 |
173 | def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):
174 |
175 | sentence = get_only_chars(sentence)
176 | words = sentence.split(' ')
177 | words = [word for word in words if word is not '']
178 | num_words = len(words)
179 |
180 | augmented_sentences = []
181 | num_new_per_technique = int(num_aug/4)+1
182 |
183 | #sr
184 | if (alpha_sr > 0):
185 | n_sr = max(1, int(alpha_sr*num_words))
186 | for _ in range(num_new_per_technique):
187 | a_words = synonym_replacement(words, n_sr)
188 | augmented_sentences.append(' '.join(a_words))
189 |
190 | #ri
191 | if (alpha_ri > 0):
192 | n_ri = max(1, int(alpha_ri*num_words))
193 | for _ in range(num_new_per_technique):
194 | a_words = random_insertion(words, n_ri)
195 | augmented_sentences.append(' '.join(a_words))
196 |
197 | #rs
198 | if (alpha_rs > 0):
199 | n_rs = max(1, int(alpha_rs*num_words))
200 | for _ in range(num_new_per_technique):
201 | a_words = random_swap(words, n_rs)
202 | augmented_sentences.append(' '.join(a_words))
203 |
204 | #rd
205 | if (p_rd > 0):
206 | for _ in range(num_new_per_technique):
207 | a_words = random_deletion(words, p_rd)
208 | augmented_sentences.append(' '.join(a_words))
209 |
210 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
211 | shuffle(augmented_sentences)
212 |
213 | #trim so that we have the desired number of augmented sentences
214 | if num_aug >= 1:
215 | augmented_sentences = augmented_sentences[:num_aug]
216 | else:
217 | keep_prob = num_aug / len(augmented_sentences)
218 | augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
219 |
220 | #append the original sentence
221 | augmented_sentences.append(sentence)
222 |
223 | return augmented_sentences
--------------------------------------------------------------------------------
/data/sst2_train_500.txt:
--------------------------------------------------------------------------------
1 | 1 neil burger here succeeded in making the mystery of four decades back the springboard for a more immediate mystery in the present
2 | 0 it is a visual rorschach test and i must have failed
3 | 0 the only way to tolerate this insipid brutally clueless film might be with a large dose of painkillers
4 | 0 scores no points for originality wit or intelligence
5 | 0 it would take a complete moron to foul up a screen adaptation of oscar wilde is classic satire
6 | 1 pure cinematic intoxication a wildly inventive mixture of comedy and melodrama tastelessness and swooning elegance
7 | 0 it is not the first time that director sara sugarman stoops to having characters drop their pants for laughs and not the last time she fails to provoke them
8 | 1 just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing
9 | 1 matthew lillard is born to play shaggy
10 | 0 it is drab
11 | 1 the film has several strong performances
12 | 1 munch is screenplay is tenderly observant of his characters
13 | 1 isabelle huppert excels as the enigmatic mika and anna mouglalis is a stunning new young talent in one of chabrol is most intense psychological mysteries
14 | 1 a cruelly funny twist on teen comedy packed with inventive cinematic tricks and an ironically killer soundtrack
15 | 0 predictably soulless techno tripe
16 | 1 tsai has a well deserved reputation as one of the cinema world is great visual stylists and in this film every shot enhances the excellent performances
17 | 0 some like it hot on the hardwood proves once again that a man in drag is not in and of himself funny
18 | 0 for a movie about the power of poetry and passion there is precious little of either
19 | 1 diggs and lathan are among the chief reasons brown sugar is such a sweet and sexy film
20 | 1 uneven but a lot of fun
21 | 0 when the film ended i felt tired and drained and wanted to lie on my own deathbed for a while
22 | 0 contains the humor characterization poignancy and intelligence of a bad sitcom
23 | 0 a pretentious and ultimately empty examination of a sick and evil woman
24 | 1 i can easily imagine benigni is pinocchio becoming a christmas perennial
25 | 1 generates an enormous feeling of empathy for its characters
26 | 1 reggio is continual visual barrage is absorbing as well as thought provoking
27 | 0 spreads itself too thin leaving these actors as well as the members of the commune short of profound characterizations
28 | 0 once again director chris columbus takes a hat in hand approach to rowling that stifles creativity and allows the film to drag on for nearly three hours
29 | 0 has all the hallmarks of a movie designed strictly for children is home video a market so insatiable it absorbs all manner of lame entertainment as long as year olds find it diverting
30 | 1 a whale of a good time for both children and parents seeking christian themed fun
31 | 0 there are plot holes big enough for shamu the killer whale to swim through
32 | 1 director andrew niccol demonstrates a wry understanding of the quirks of fame
33 | 1 a must for fans of british cinema if only because so many titans of the industry are along for the ride
34 | 0 though clearly well intentioned this cross cultural soap opera is painfully formulaic and stilted
35 | 0 that is because relatively nothing happens
36 | 0 novak contemplates a heartland so overwhelmed by its lack of purpose that it seeks excitement in manufactured high drama
37 | 0 utter mush conceited pap
38 | 0 cassavetes thinks he is making dog day afternoon with a cause but all he is done is to reduce everything he touches to a shrill didactic cartoon
39 | 0 disjointed parody
40 | 1 in fact it is quite fun in places
41 | 0 it can not be enjoyed even on the level that one enjoys a bad slasher flick primarily because it is dull
42 | 0 it is always disappointing when a documentary fails to live up to or offer any new insight into its chosen topic
43 | 0 in the era of the sopranos it feels painfully redundant and inauthentic
44 | 1 a hypnotic cyber hymn and a cruel story of youth culture
45 | 0 much of the cast is stiff or just plain bad
46 | 1 i ve yet to find an actual vietnam war combat movie actually produced by either the north or south vietnamese but at least now we ve got something pretty damn close
47 | 0 a bizarre piece of work with premise and dialogue at the level of kids television and plot threads as morose as teen pregnancy rape and suspected murder
48 | 0 a documentary to make the stones weep as shameful as it is scary
49 | 0 there is got to be a more graceful way of portraying the devastation of this disease
50 | 1 and your reward will be a thoughtful emotional movie experience
51 | 1 brash intelligent and erotically perplexing haneke is portrait of an upper class austrian society and the suppression of its tucked away demons is uniquely felt with a sardonic jolt
52 | 0 it forces you to watch people doing unpleasant things to each other and themselves and it maintains a cool distance from its material that is deliberately unsettling
53 | 0 it is the element of condescension as the filmmakers look down on their working class subjects from their lofty perch that finally makes sex with strangers which opens today in the new york metropolitan area so distasteful
54 | 1 nothing short of a masterpiece and a challenging one
55 | 0 i just saw this movie well it is probably not accurate to call it a movie
56 | 0 a muted freak out
57 | 0 there is an underlying old world sexism to monday morning that undercuts its charm
58 | 0 taken as a whole the tuxedo does nt add up to a whole lot
59 | 1 though the film is static its writer director is heart is in the right place his plea for democracy and civic action laudable
60 | 0 with mcconaughey in an entirely irony free zone and bale reduced mainly to batting his sensitive eyelids there is not enough intelligence wit or innovation on the screen to attract and sustain an older crowd
61 | 1 skin of man gets a few cheap shocks from its kids in peril theatrics but it also taps into the primal fears of young people trying to cope with the mysterious and brutal nature of adults
62 | 1 what it lacks in substance it makes up for in heart
63 | 1 like all great films about a life you never knew existed it offers much to absorb and even more to think about after the final frame
64 | 0 an average b movie with no aspirations to be anything more
65 | 1 standard guns versus martial arts cliche with little new added
66 | 1 a beautifully tooled action thriller about love and terrorism in korea
67 | 0 starts out with tremendous promise introducing an intriguing and alluring premise only to fall prey to a boatload of screenwriting cliches that sink it faster than a leaky freighter
68 | 0 hard core slasher aficionados will find things to like but overall the halloween series has lost its edge
69 | 1 a college story that works even without vulgarity sex scenes and cussing
70 | 0 insufferably naive
71 | 1 it is a strength of a documentary to disregard available bias especially as temptingly easy as it would have been with this premise
72 | 1 those with a modicum of patience will find in these characters foibles a timeless and unique perspective
73 | 0 its and pieces of the hot chick are so hilarious and schneider is performance is so fine it is a real shame that so much of the movie again as in the animal is a slapdash mess
74 | 1 there are some movies that hit you from the first scene and you know it is going to be a trip
75 | 1 it proves quite compelling as an intense brooding character study
76 | 0 a loud ugly irritating movie without any of its satirical salvos hitting a discernible target
77 | 0 you can taste it but there is no fizz
78 | 1 the quality of the art combined with the humor and intelligence of the script allow the filmmakers to present the biblical message of forgiveness without it ever becoming preachy or syrupy
79 | 0 the x potion gives the quickly named blossom bubbles and buttercup supernatural powers that include extraordinary strength and laser beam eyes which unfortunately do nt enable them to discern flimsy screenplays
80 | 1 rifkin is references are impeccable throughout
81 | 0 too bad none of it is funny
82 | 1 a gleefully grungy hilariously wicked black comedy
83 | 0 you leave the same way you came a few tasty morsels under your belt but no new friends
84 | 1 though it runs minutes safe conduct is anything but languorous
85 | 1 it is both degrading and strangely liberating to see people working so hard at leading lives of sexy intrigue only to be revealed by the dispassionate gantz brothers as ordinary pasty lumpen
86 | 0 they crush each other under cars throw each other out windows electrocute and dismember their victims in full consciousness
87 | 0 in any case i would recommend big bad love only to winger fans who have missed her since is forget paris
88 | 1 as surreal as a dream and as detailed as a photograph as visually dexterous as it is at times imaginatively overwhelming
89 | 1 even if you do nt know the band or the album is songs by heart you will enjoy seeing how both evolve and you will also learn a good deal about the state of the music business in the st century
90 | 1 with an unflappable air of decadent urbanity everett remains a perfect wildean actor and a relaxed firth displays impeccable comic skill
91 | 1 unpretentious charming quirky original
92 | 0 a processed comedy chop suey
93 | 0 a sequel that is much too big for its britches
94 | 0 a complete waste of time
95 | 0 a well intentioned effort that is still too burdened by the actor is offbeat sensibilities for the earnest emotional core to emerge with any degree of accessibility
96 | 1 assayas ambitious sometimes beautiful adaptation of jacques chardonne is novel
97 | 0 despite the fact that this film was nt as bad as i thought it was going to be it is still not a good movie
98 | 0 guys say mean things and shoot a lot of bullets
99 | 0 a manipulative feminist empowerment tale thinly posing as a serious drama about spousal abuse
100 | 0 this movie is so bad that it is almost worth seeing because it is so bad
101 | 0 with a romantic comedy plotline straight from the ages this cinderella story does nt have a single surprise up its sleeve
102 | 1 and more than that it is an observant unfussily poetic meditation about identity and alienation
103 | 0 but it could have been worse
104 | 0 most viewers will wish there had been more of the queen and less of the damned
105 | 0 a science fiction pastiche so lacking in originality that if you stripped away its inspirations there would be precious little left
106 | 1 the pleasures that it does afford may be enough to keep many moviegoers occupied amidst some of the more serious minded concerns of other year end movies
107 | 0 as plain and pedestrian as catsup
108 | 0 every conceivable mistake a director could make in filming opera has been perpetrated here
109 | 1 more concerned with overall feelings broader ideas and open ended questions than concrete story and definitive answers soderbergh is solaris is a gorgeous and deceptively minimalist cinematic tone poem
110 | 0 no cute factor here not that i mind ugly the problem is he has no character loveable or otherwise
111 | 1 filmmaker stacy peralta has a flashy editing style that does nt always jell with sean penn is monotone narration but he respects the material without sentimentalizing it
112 | 1 you do nt need to be a hip hop fan to appreciate scratch and that is the mark of a documentary that works
113 | 0 i was trying to decide what annoyed me most about god is great i m not and then i realized that i just did nt care
114 | 1 metaphors abound but it is easy to take this film at face value and enjoy its slightly humorous and tender story
115 | 1 a comedy that swings and jostles to the rhythms of life
116 | 0 if you re looking to rekindle the magic of the first film you ll need a stronger stomach than us
117 | 1 a modestly made but profoundly moving documentary
118 | 0 pc stability notwithstanding the film suffers from a simplistic narrative and a pat fairy tale conclusion
119 | 0 has about th the fun of its spry predecessor but it is a rushed slapdash sequel for the sake of a sequel with less than half the plot and ingenuity
120 | 1 remarkable for its intelligence and intensity
121 | 1 i do nt know if frailty will turn bill paxton into an a list director but he can rest contentedly with the knowledge that he is made at least one damn fine horror movie
122 | 1 one of the year is most weirdly engaging and unpredictable character pieces
123 | 0 shot like a postcard and overacted with all the boozy self indulgence that brings out the worst in otherwise talented actors
124 | 0 barney is ideas about creation and identity do nt really seem all that profound at least by way of what can be gleaned from this three hour endurance test built around an hour is worth of actual material
125 | 1 delia greta and paula rank as three of the most multilayered and sympathetic female characters of the year
126 | 0 enough is not a bad movie just mediocre
127 | 0 but the cinematography is cloudy the picture making becalmed
128 | 0 a terrible adaptation of a play that only ever walked the delicate tightrope between farcical and loathsome
129 | 0 slap me i saw this movie
130 | 1 for those of an indulgent slightly sunbaked and summery mind sex and lucia may well prove diverting enough
131 | 0 reign of fire never comes close to recovering from its demented premise but it does sustain an enjoyable level of ridiculousness
132 | 0 the movie ends with outtakes in which most of the characters forget their lines and just utter uhhh which is better than most of the writing in the movie
133 | 1 try as you might to resist if you ve got a place in your heart for smokey robinson this movie will worm its way there
134 | 0 low rent from frame one
135 | 1 we know the plot is a little crazy but it held my interest from start to finish
136 | 0 dull a road trip movie that is surprisingly short of both adventure and song
137 | 0 the colorful masseur wastes its time on mood rather than riding with the inherent absurdity of ganesh is rise up the social ladder
138 | 1 it sends you away a believer again and quite cheered at just that
139 | 1 we ve seen it all before in one form or another but director hoffman with great help from kevin kline makes us care about this latest reincarnation of the world is greatest teacher
140 | 1 watching this gentle mesmerizing portrait of a man coming to terms with time you barely realize your mind is being blown
141 | 1 at its most basic this cartoon adventure is that wind in the hair exhilarating
142 | 1 the charms of the lead performances allow us to forget most of the film is problems
143 | 0 hampered no paralyzed by a self indulgent script that aims for poetry and ends up sounding like satire
144 | 1 a sloppy amusing comedy that proceeds from a stunningly unoriginal premise
145 | 1 the film aims to be funny uplifting and moving sometimes all at once
146 | 1 the story is inspiring ironic and revelatory of just how ridiculous and money oriented the record industry really is
147 | 0 like a fish that is lived too long austin powers in goldmember has some unnecessary parts and is kinda wrong in places
148 | 1 what distinguishes time of favor from countless other thrillers is its underlying concern with the consequences of words and with the complicated emotions fueling terrorist acts
149 | 1 an entertaining if ultimately minor thriller
150 | 1 just when you think that every possible angle has been exhausted by documentarians another new film emerges with yet another remarkable yet shockingly little known perspective
151 | 0 it could have been something special but two things drag it down to mediocrity director clare peploe is misunderstanding of marivaux is rhythms and mira sorvino is limitations as a classical actress
152 | 1 a summer entertainment adults can see without feeling embarrassed but it could have been more
153 | 0 fails in making this character understandable in getting under her skin in exploring motivation well before the end the film grows as dull as its characters about whose fate it is hard to care
154 | 0 turns a potentially interesting idea into an excruciating film school experience that plays better only for the film is publicists or for people who take as many drugs as the film is characters
155 | 1 it is a wonderful sobering heart felt drama
156 | 0 the movie does nt think much of its characters its protagonist or of us
157 | 0 it is too self important and plodding to be funny and too clipped and abbreviated to be an epic
158 | 1 this surreal gilliam esque film is also a troubling interpretation of ecclesiastes
159 | 0 the most offensive thing about the movie is that hollywood expects people to pay to see it
160 | 1 in the end the film is less the cheap thriller you d expect than it is a fairly revealing study of its two main characters damaged goods people whose orbits will inevitably and dangerously collide
161 | 0 the entire movie is about a boring sad man being boring and sad
162 | 0 just not campy enough
163 | 0 despite an impressive roster of stars and direction from kathryn bigelow the weight of water is oppressively heavy
164 | 0 when your subject is illusion versus reality should nt the reality seem at least passably real
165 | 1 khouri manages with terrific flair to keep the extremes of screwball farce and blood curdling family intensity on one continuum
166 | 1 it is a masterpiece
167 | 1 romantic comedy and dogme filmmaking may seem odd bedfellows but they turn out to be delightfully compatible here
168 | 1 about schmidt belongs to nicholson
169 | 1 macdowell gives give a solid anguished performance that eclipses nearly everything else she is ever done
170 | 0 wewannour money back actually
171 | 1 it is a clear eyed portrait of an intensely lived time filled with nervous energy moral ambiguity and great uncertainties
172 | 0 not exactly the bees knees
173 | 1 michael gerbosi is script is economically packed with telling scenes
174 | 0 director douglas mcgrath takes on nickleby with all the halfhearted zeal of an th grade boy delving into required reading
175 | 0 that the true story by which all the queen is men is allegedly inspired was a lot funnier and more deftly enacted than what is been cobbled together onscreen
176 | 0 the end result is like cold porridge with only the odd enjoyably chewy lump
177 | 1 it haunts you you ca nt forget it you admire its conception and are able to resolve some of the confusions you had while watching it
178 | 0 forget the psychology study of romantic obsession and just watch the procession of costumes in castles and this wo nt seem like such a bore
179 | 0 the kind of film that leaves you scratching your head in amazement over the fact that so many talented people could participate in such an ill advised and poorly executed idea
180 | 0 off the hook is overlong and not well acted but credit writer producer director adam watstein with finishing it at all
181 | 0 at the very least if you do nt know anything about derrida when you walk into the theater you wo nt know much more when you leave
182 | 1 sly sophisticated and surprising
183 | 1 a new film from bill plympton the animation master is always welcome
184 | 0 it does nt flinch from its unsettling prognosis namely that the legacy of war is a kind of perpetual pain
185 | 0 more tiring than anything
186 | 1 his work with actors is particularly impressive
187 | 0 sunk by way too much indulgence of scene chewing teeth gnashing actorliness
188 | 1 it is an unstinting look at a collaboration between damaged people that may or may not qual
189 | 0 collapses under its own meager weight
190 | 0 quitting however manages just to be depressing as the lead actor phones in his autobiographical performance
191 | 0 the drama was so uninspiring that even a story immersed in love lust and sin could nt keep my attention
192 | 0 due to some script weaknesses and the casting of the director is brother the film trails off into inconsequentiality
193 | 0 suspend your disbelief here and now or you ll be shaking your head all the way to the credits
194 | 0 i did nt smile
195 | 1 for all its problems the lady and the duke surprisingly manages never to grow boring which proves that rohmer still has a sense of his audience
196 | 0 it is a drag how nettelbeck sees working women or at least this working woman for whom she shows little understanding
197 | 0 the script is a disaster with cloying messages and irksome characters
198 | 0 in its best moments resembles a bad high school production of grease without benefit of song
199 | 1 it is a fine focused piece of work that reopens an interesting controversy and never succumbs to sensationalism
200 | 1 a triumph of pure craft and passionate heart
201 | 1 not everything in this ambitious comic escapade works but coppola along with his sister sofia is a real filmmaker
202 | 1 the emotions are raw and will strike a nerve with anyone who is ever had family trauma
203 | 1 it deserves to be seen by anyone with even a passing interest in the events shaping the world beyond their own horizons
204 | 0 two bit potboiler
205 | 0 the movie directed by mick jackson leaves no cliche unturned from the predictable plot to the characters straight out of central casting
206 | 1 fun and nimble
207 | 0 big mistake
208 | 1 the film boasts dry humor and jarring shocks plus moments of breathtaking mystery
209 | 1 you may feel compelled to watch the film twice or pick up a book on the subject
210 | 1 west coast rap wars this modern mob music drama never fails to fascinate
211 | 1 children christian or otherwise deserve to hear the full story of jonah is despair in all its agonizing catch glory even if they spend years trying to comprehend it
212 | 0 if they broke out into elaborate choreography singing and finger snapping it might have held my attention but as it stands i kept looking for the last exit from brooklyn
213 | 1 i could nt recommend this film more
214 | 1 translating complex characters from novels to the big screen is an impossible task but they are true to the essence of what it is to be ya ya
215 | 0 their parents would do well to cram earplugs in their ears and put pillowcases over their heads for minutes
216 | 1 rewarding
217 | 1 upsetting and thought provoking the film has an odd purity that does nt bring you into the characters so much as it has you study them
218 | 0 starts as a tart little lemon drop of a movie and ends up as a bitter pill
219 | 0 a little less extreme than in the past with longer exposition sequences between them and with fewer gags to break the tedium
220 | 1 a funny triumphant and moving documentary
221 | 1 an entertaining mix of period drama and flat out farce that should please history fans
222 | 0 during the tuxedo is minutes of screen time there is nt one true chan moment
223 | 1 there is just something about watching a squad of psychopathic underdogs whale the tar out of unsuspecting lawmen that reaches across time and distance
224 | 1 a series of tales told with the intricate preciseness of the best short story writing
225 | 1 a bright inventive thoroughly winning flight of revisionist fancy
226 | 0 almost peerlessly unsettling
227 | 1 a dashing and absorbing outing with one of france is most inventive directors
228 | 1 a true delight
229 | 0 complete lack of originality cleverness or even visible effort
230 | 1 a few nonbelievers may rethink their attitudes when they see the joy the characters take in this creed but skeptics are nt likely to enter the theater
231 | 1 like the rugrats movies the wild thornberrys movie does nt offer much more than the series but its emphasis on caring for animals and respecting other cultures is particularly welcome
232 | 0 borstal boy represents the worst kind of filmmaking the kind that pretends to be passionate and truthful but is really frustratingly timid and soggy
233 | 1 you feel good you feel sad you feel pissed off but in the end you feel alive which is what they did
234 | 0 director tom shadyac and star kevin costner glumly mishandle the story is promising premise of a physician who needs to heal himself
235 | 1 as relationships shift director robert j siegel allows the characters to inhabit their world without cleaving to a narrative arc
236 | 0 deadeningly dull mired in convoluted melodrama nonsensical jargon and stiff upper lip laboriousness
237 | 1 jacquot has filmed the opera exactly as the libretto directs ideally capturing the opera is drama and lyricism
238 | 1 it can be safely recommended as a video dvd babysitter
239 | 0 it is played in the most straight faced fashion with little humor to lighten things up
240 | 1 though it goes further than both anyone who has seen the hunger or cat people will find little new here but a tasty performance from vincent gallo lifts this tale of cannibal lust above the ordinary
241 | 1 the rich performances by friel and especially williams an american actress who becomes fully english round out the square edges
242 | 0 amazingly lame
243 | 1 more good than great but freeman and judd make it work
244 | 0 a battle between bug eye theatre and dead eye matinee
245 | 0 i m sorry to say that this should seal the deal arnold is not nor will he be back
246 | 1 though jackson does nt always succeed in integrating the characters in the foreground into the extraordinarily rich landscape it must be said that he is an imaginative filmmaker who can see the forest for the trees
247 | 0 van wilder has a built in audience but only among those who are drying out from spring break and are still unconcerned about what they ingest
248 | 1 what sets ms birot is film apart from others in the genre is a greater attention to the parents and particularly the fateful fathers in the emotional evolution of the two bewitched adolescents
249 | 0 a sentimental mess that never rings true
250 | 1 but the talented cast alone will keep you watching as will the fight scenes
251 | 1 allen is underestimated charm delivers more goodies than lumps of coal
252 | 1 an elegant work food of love is as consistently engaging as it is revealing
253 | 1 zoom
254 | 1 a huge box office hit in korea shiri is a must for genre fans
255 | 1 it is a technically superb film shining with all the usual spielberg flair expertly utilizing the talents of his top notch creative team
256 | 1 what begins as a conventional thriller evolves into a gorgeously atmospheric meditation on life changing chance encounters
257 | 1 a film with a great premise but only a great premise
258 | 1 on that score the film certainly does nt disappoint
259 | 1 the acting costumes music cinematography and sound are all astounding given the production is austere locales
260 | 1 vincent gallo is right at home in this french shocker playing his usual bad boy weirdo role
261 | 1 very well written and directed with brutal honesty and respect for its audience
262 | 0 one senses in world traveler and in his earlier film that freundlich bears a grievous but obscure complaint against fathers and circles it obsessively without making contact
263 | 1 neither parker nor donovan is a typical romantic lead but they bring a fresh quirky charm to the formula
264 | 1 a giggle a minute
265 | 0 in the end the film feels homogenized and a bit contrived as if we re looking back at a tattered and ugly past with rose tinted glasses
266 | 1 an unusually dry eyed even analytical approach to material that is generally played for maximum moisture
267 | 1 it made me want to get made up and go see this movie with my sisters
268 | 0 neither revelatory nor truly edgy merely crassly flamboyant and comedically labored
269 | 1 boasts a handful of virtuosic set pieces and offers a fair amount of trashy kinky fun
270 | 0 i do nt mind having my heartstrings pulled but do nt treat me like a fool
271 | 1 this is a sincerely crafted picture that deserves to emerge from the traffic jam of holiday movies
272 | 0 an unintentionally surreal kid is picture in which actors in bad bear suits enact a sort of inter species parody of a vh behind the music episode
273 | 1 gay or straight kissing jessica stein is one of the greatest date movies in years
274 | 0 it looks good but it is essentially empty
275 | 1 and there is no way you wo nt be talking about the film once you exit the theater
276 | 0 much like robin williams death to smoochy has already reached its expiration date
277 | 1 if you love the music and i do its hard to imagine having more fun watching a documentary
278 | 0 a collage of clich s and a dim echo of allusions to other films
279 | 1 norton is magnetic as graham
280 | 1 k the widowmaker is a great yarn
281 | 1 it might be easier to watch on video at home but that should nt stop die hard french film connoisseurs from going out and enjoying the big screen experience
282 | 0 manages to be both repulsively sadistic and mundane
283 | 0 an obvious copy of one of the best films ever made how could it not be
284 | 1 surprisingly the film is a hilarious adventure and i shamelessly enjoyed it
285 | 1 the cat is meow marks a return to form for director peter bogdanovich
286 | 0 it is an odd show pregnant with moods stillborn except as a harsh conceptual exercise
287 | 0 but if the essence of magic is its make believe promise of life that soars above the material realm this is the opposite of a truly magical movie
288 | 0 the film is all over the place really
289 | 0 without any redeeming value whatsoever
290 | 1 it is a familiar story but one that is presented with great sympathy and intelligence
291 | 0 manages to show life in all of its banality when the intention is quite the opposite
292 | 1 read my lips is to be viewed and treasured for its extraordinary intelligence and originality as well as its lyrical variations on the game of love
293 | 0 this director is cut which adds minutes takes a great film and turns it into a mundane soap opera
294 | 1 the ensemble cast turns in a collectively stellar performance and the writing is tight and truthful full of funny situations and honest observations
295 | 1 what saves this deeply affecting film from being merely a collection of wrenching cases is corcuera is attention to detail
296 | 1 take nothing seriously and enjoy the ride
297 | 1 from the opening strains of the average white band is pick up the pieces you can feel the love
298 | 0 while the ensemble player who gained notice in guy ritchie is lock stock and two smoking barrels and snatch has the bod he is unlikely to become a household name on the basis of his first starring vehicle
299 | 0 i could nt help but feel the wasted potential of this slapstick comedy
300 | 1 it is an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part
301 | 0 nothing too deep or substantial
302 | 0 this picture is mostly a lump of run of the mill profanity sprinkled with a few remarks so geared toward engendering audience sympathy that you might think he was running for office or trying to win over a probation officer
303 | 0 a boring parade of talking heads and technical gibberish that will do little to advance the linux cause
304 | 0 the problem with the mayhem in formula is not that it is offensive but that it is boring
305 | 0 as pedestrian as they come
306 | 0 parents beware this is downright movie penance
307 | 0 really does feel like a short stretched out to feature length
308 | 0 no one but a convict guilty of some truly heinous crime should have to sit through the master of disguise
309 | 1 may take its sweet time to get wherever it is going but if you have the patience for it you wo nt feel like it is wasted yours
310 | 0 would ve been nice if the screenwriters had trusted audiences to understand a complex story and left off the film is predictable denouement
311 | 1 i am not generally a huge fan of cartoons derived from tv shows but hey arnold
312 | 1 brings to a spectacular completion one of the most complex generous and subversive artworks of the last decade
313 | 1 reveals how important our special talents can be when put in service of of others
314 | 1 the gags that fly at such a furiously funny pace that the only rip off that we were aware of was the one we felt when the movie ended so damned soon
315 | 1 more mature than fatal attraction more complete than indecent proposal and more relevant than weeks unfaithful is at once intimate and universal cinema
316 | 1 it is fairly solid not to mention well edited so that it certainly does nt feel like a film that strays past the two and a half mark
317 | 1 while somewhat less than it might have been the film is a good one and you ve got to hand it to director george clooney for biting off such a big job the first time out
318 | 1 routine harmless diversion and little else
319 | 1 cremaster is at once a tough pill to swallow and a minor miracle of self expression
320 | 1 this is human comedy at its most amusing interesting and confirming
321 | 1 a story we have nt seen on the big screen before and it is a story that we as americans and human beings should know
322 | 0 just about everyone involved here seems to be coasting
323 | 1 a tour de force of modern cinema
324 | 1 uplifting funny and wise
325 | 0 it is just merely very bad
326 | 1 it will guarantee to have you leaving the theater with a smile on your face
327 | 0 simplistic silly and tedious
328 | 1 passions obsessions and loneliest dark spots are pushed to their most virtuous limits lending the narrative an unusually surreal tone
329 | 1 thanks to confident filmmaking and a pair of fascinating performances the way to that destination is a really special walk in the woods
330 | 0 it is provocative stuff but the speculative effort is hampered by taylor is cartoonish performance and the film is ill considered notion that hitler is destiny was shaped by the most random of chances
331 | 0 the animation and game phenomenon that peaked about three years ago is actually dying a slow death if the poor quality of pokemon ever is any indication
332 | 1 the script is smart not cloying
333 | 0 a muddy psychological thriller rife with miscalculations
334 | 1 the wild thornberrys movie is a jolly surprise
335 | 1 land people and narrative flow together in a stark portrait of motherhood deferred and desire explored
336 | 0 unless there are zoning ordinances to protect your community from the dullest science fiction impostor is opening today at a theater near you
337 | 1 world traveler might not go anywhere new or arrive anyplace special but it is certainly an honest attempt to get at something
338 | 1 at once subtle and visceral the film never succumbs to the trap of the maudlin or tearful offering instead with its unflinching gaze a measure of faith in the future
339 | 1 years of russian history and culture compressed into an evanescent seamless and sumptuous stream of consciousness
340 | 0 the film is maudlin focus on the young woman is infirmity and her naive dreams play like the worst kind of hollywood heart string plucking
341 | 0 director uwe boll and the actors provide scant reason to care in this crude s throwback
342 | 1 intensely romantic thought provoking and even an engaging mystery
343 | 0 the characters are paper thin and the plot is so cliched and contrived that it makes your least favorite james bond movie seem as cleverly plotted as the usual suspects
344 | 1 de niro is a veritable source of sincere passion that this hollywood contrivance orbits around
345 | 1 jonathan parker is bartleby should have been the be all end all of the modern office anomie films
346 | 1 it is a piece of handiwork that shows its indie tatters and self conscious seams in places but has some quietly moving moments and an intelligent subtlety
347 | 0 it is a barely tolerable slog over well trod ground
348 | 0 it is tough to tell which is in more abundant supply in this woefully hackneyed movie directed by scott kalvert about street gangs and turf wars in brooklyn stale cliches gratuitous violence or empty machismo
349 | 0 the script by vincent r nebrida tries to cram too many ingredients into one small pot
350 | 1 strangely comes off as a kingdom more mild than wild
351 | 0 thoroughly awful
352 | 1 a moving story of determination and the human spirit
353 | 1 a naturally funny film home movie makes you crave chris smith is next movie
354 | 0 the only question is to determine how well the schmaltz is manufactured to assess the quality of the manipulative engineering
355 | 0 the premise of abandon holds promise but its delivery is a complete mess
356 | 0 plays like one of those conversations that comic book guy on the simpsons has
357 | 0 in the book on tape market the film of the kid stays in the picture would be an abridged edition
358 | 1 and educational
359 | 1 blisteringly rude scarily funny sorrowfully sympathetic to the damage it surveys the film has in kieran culkin a pitch perfect holden
360 | 0 to get at the root psychology of this film would require many sessions on the couch of dr freud
361 | 0 the young stars are too cute the story and ensuing complications are too manipulative the message is too blatant the resolutions are too convenient
362 | 1 davis candid archly funny and deeply authentic take on intimate relationships comes to fruition in her sophomore effort
363 | 1 not as good as the full monty but a really strong second effort
364 | 0 even bigger and more ambitious than the first installment spy kids looks as if it were made by a highly gifted year old instead of a grown man
365 | 0 includes too much obvious padding
366 | 0 the story alone could force you to scratch a hole in your head
367 | 0 we never truly come to care about the main characters and whether or not they ll wind up together and michele is spiritual quest is neither amusing nor dramatic enough to sustain interest
368 | 1 this is nt exactly profound cinema but it is good natured and sometimes quite funny
369 | 0 impostor ca nt think of a thing to do with these characters except have them run through dark tunnels fight off various anonymous attackers and evade elaborate surveillance technologies
370 | 0 and that leaves a hole in the center of the salton sea
371 | 1 chamber of secrets will find millions of eager fans
372 | 0 seagal ran out of movies years ago and this is just the proof
373 | 0 a movie like the guys is why film criticism can be considered work
374 | 1 as it turns out you can go home again
375 | 1 her performance moves between heartbreak and rebellion as she continually tries to accommodate to fit in and gain the unconditional love she seeks
376 | 0 a low rate annie featuring some kid who ca nt act only echoes of jordan and weirdo actor crispin glover screwing things up old school
377 | 1 rock solid family fun out of the gates extremely imaginative through out but wanes in the middle
378 | 0 if you go pack your knitting needles
379 | 0 a technical triumph and an extraordinary bore
380 | 0 if you re not fans of the adventues of steve and terri you should avoid this like the dreaded king brown snake
381 | 0 the comedy death to smoochy is a rancorous curiosity a movie without an apparent audience
382 | 1 the entire cast is extraordinarily good
383 | 1 hugh grant who has a good line in charm has never been more charming than in about a boy
384 | 1 delivers the sexy razzle dazzle that everyone especially movie musical fans has been hoping for
385 | 1 a gripping movie played with performances that are all understated and touching
386 | 0 hoffman waits too long to turn his movie in an unexpected direction and even then his tone retains a genteel prep school quality that feels dusty and leatherbound
387 | 1 an ambitious what if
388 | 1 uses high comedy to evoke surprising poignance
389 | 0 contains a few big laughs but many more that graze the funny bone or miss it altogether in part because the consciously dumbed down approach wears thin
390 | 1 the journey to the secret is eventual discovery is a separate adventure and thrill enough
391 | 1 it is one heck of a character study not of hearst or davies but of the unique relationship between them
392 | 1 a live wire film that never loses its ability to shock and amaze
393 | 1 a real audience pleaser that will strike a chord with anyone who is ever waited in a doctor is office emergency room hospital bed or insurance company office
394 | 0 there is no good answer to that one
395 | 0 the film contains no good jokes no good scenes barely a moment when carvey is saturday night live honed mimicry rises above the level of embarrassment
396 | 0 as inept as big screen remakes of the avengers and the wild wild west
397 | 0 it is difficult to feel anything much while watching this movie beyond mild disturbance or detached pleasure at the acting
398 | 0 almost as offensive as freddy got fingered
399 | 1 this is a shrewd and effective film from a director who understands how to create and sustain a mood
400 | 1 the bai brothers have taken an small slice of history and opened it up for all of us to understand and they ve told a nice little story in the process
401 | 1 knows how to make our imagination wonder
402 | 1 fear permeates the whole of stortelling todd solondz oftentimes funny yet ultimately cowardly autocritique
403 | 1 a cutesy romantic tale with a twist
404 | 0 violent vulgar and forgettably entertaining
405 | 1 though its story is only surface deep the visuals and enveloping sounds of blue crush make this surprisingly decent flick worth a summertime look see
406 | 1 sad to say it accurately reflects the rage and alienation that fuels the self destructiveness of many young people
407 | 0 an allegory concerning the chronically mixed signals african american professionals get about overachieving could be intriguing but the supernatural trappings only obscure the message
408 | 1 the film has the high buffed gloss and high octane jolts you expect of de palma but what makes it transporting is that it is also one of the smartest most pleasurable expressions of pure movie love to come from an american director in years
409 | 1 wonderful fencing scenes and an exciting plot make this an eminently engrossing film
410 | 1 if mostly martha is mostly unsurprising it is still a sweet even delectable diversion
411 | 1 one of the most slyly exquisite anti adult movies ever made
412 | 1 even when there are lulls the emotions seem authentic and the picture is so lovely toward the end you almost do nt notice the minute running time
413 | 0 comes across as a fairly weak retooling
414 | 0 time out is as serious as a pink slip
415 | 0 a depressingly retrograde post feminist romantic comedy that takes an astonishingly condescending attitude toward women
416 | 0 you might want to take a reality check before you pay the full ticket price to see simone and consider a dvd rental instead
417 | 1 young hanks and fisk who vaguely resemble their celebrity parents bring fresh good looks and an ease in front of the camera to the work
418 | 0 if you re looking for a story do nt bother
419 | 1 the film is hard to dismiss moody thoughtful and lit by flashes of mordant humor
420 | 1 a deeply felt and vividly detailed story about newcomers in a strange new world
421 | 0 an ugly pointless stupid movie
422 | 0 to honestly address the flaws inherent in how medical aid is made available to american workers a more balanced or fair portrayal of both sides will be needed
423 | 1 the very definition of the small movie but it is a good stepping stone for director sprecher
424 | 1 a solid cast assured direction and complete lack of modern day irony
425 | 0 burns never really harnesses to full effect the energetic cast
426 | 1 the difference between cho and most comics is that her confidence in her material is merited
427 | 1 like its bizarre heroine it irrigates our souls
428 | 1 in between the icy stunts the actors spout hilarious dialogue about following your dream and just letting the mountain tell you what to do
429 | 0 in an effort i suspect not to offend by appearing either too serious or too lighthearted it offends by just being wishy washy
430 | 0 it all comes down to whether you can tolerate leon barlow
431 | 0 starts promisingly but disintegrates into a dreary humorless soap opera
432 | 1 there is enough cool fun here to warm the hearts of animation enthusiasts of all ages
433 | 1 the vitality of the actors keeps the intensity of the film high even as the strafings blend together
434 | 1 a true blue delight
435 | 0 despite auteuil is performance it is a rather listless amble down the middle of the road where the thematic ironies are too obvious and the sexual politics too smug
436 | 1 well acted well directed and for all its moodiness not too pretentious
437 | 0 adrift bentley and hudson stare and sniffle respectively as ledger attempts in vain to prove that movie star intensity can overcome bad hair design
438 | 0 it is so downbeat and nearly humorless that it becomes a chore to sit through despite some first rate performances by its lead
439 | 0 you leave feeling like you ve endured a long workout without your pulse ever racing
440 | 1 a poignant artfully crafted meditation on mortality
441 | 0 there is a scientific law to be discerned here that producers would be well to heed mediocre movies start to drag as soon as the action speeds up when the explosions start they fall to pieces
442 | 1 a dreadful day in irish history is given passionate if somewhat flawed treatment
443 | 1 the pleasure of read my lips is like seeing a series of perfect black pearls clicking together to form a string
444 | 1 overcomes its visual hideousness with a sharp script and strong performances
445 | 0 just too silly and sophomoric to ensnare its target audience
446 | 1 if it is not entirely memorable the movie is certainly easy to watch
447 | 1 if you can push on through the slow spots you ll be rewarded with some fine acting
448 | 0 it is too bad that this likable movie is nt more accomplished
449 | 1 tadpole may be one of the most appealing movies ever made about an otherwise appalling and downright creepy subject a teenage boy in love with his stepmother
450 | 0 it just goes to show an intelligent person is nt necessarily an admirable storyteller
451 | 1 if ayurveda can help us return to a sane regimen of eating sleeping and stress reducing contemplation it is clearly a good thing
452 | 1 the story is a rather simplistic one grief drives her love drives him and a second chance to find love in the most unlikely place it struck a chord in me
453 | 0 plays like a glossy melodrama that occasionally verges on camp
454 | 0 as aimless as an old pickup skidding completely out of control on a long patch of black ice the movie makes two hours feel like four
455 | 1 a searing epic treatment of a nationwide blight that seems to be horrifyingly ever on the rise
456 | 0 what soured me on the santa clause was that santa bumps up against st century reality so hard it is icky
457 | 1 as quiet patient and tenacious as mr lopez himself who approaches his difficult endless work with remarkable serenity and discipline
458 | 0 shallow noisy and pretentious
459 | 1 a light yet engrossing piece
460 | 0 my only wish is that celebi could take me back to a time before i saw this movie and i could just skip it
461 | 0 it is one pussy ass world when even killer thrillers revolve around group therapy sessions
462 | 1 infidelity drama is nicely shot well edited and features a standout performance by diane lane
463 | 0 rarely has sex on screen been so aggressively anti erotic
464 | 1 there is enough originality in life to distance it from the pack of paint by number romantic comedies that so often end up on cinema screens
465 | 0 a movie that quite simply should nt have been made
466 | 1 further proof that the epicenter of cool beautiful thought provoking foreign cinema is smack dab in the middle of dubya is axis of evil
467 | 1 writer director david jacobson and his star jeremy renner have made a remarkable film that explores the monster is psychology not in order to excuse him but rather to demonstrate that his pathology evolved from human impulses that grew hideously twisted
468 | 1 will amuse and provoke adventurous adults in specialty venues
469 | 0 just a kiss is a just a waste
470 | 1 a muddle splashed with bloody beauty as vivid as any scorsese has ever given us
471 | 0 a zippy minutes of mediocre special effects hoary dialogue fluxing accents and worst of all silly looking morlocks
472 | 1 as a girl meets girl romantic comedy kissing jessica steinis quirky charming and often hilarious
473 | 1 the overall fabric is hypnotic and mr mattei fosters moments of spontaneous intimacy
474 | 0 men in black ii achieves ultimate insignificance it is the sci fi comedy spectacle as whiffle ball epic
475 | 0 at best cletis tout might inspire a trip to the video store in search of a better movie experience
476 | 0 nothing but an episode of smackdown
477 | 1 the stunt work is top notch the dialogue and drama often food spittingly funny
478 | 1 family fare
479 | 1 using a stock plot about a boy injects just enough freshness into the proceedings to provide an enjoyable minutes in a movie theater
480 | 0 in other words it is badder than bad
481 | 0 the movie is almost completely lacking in suspense surprise and consistent emotional conviction
482 | 1 another love story in is remarkable procession of sweeping pictures that have reinvigorated the romance genre
483 | 0 there is only one way to kill michael myers for good stop buying tickets to these movies
484 | 1 washington overcomes the script is flaws and envelops the audience in his character is anguish anger and frustration
485 | 0 so we got ten little indians meets friday the th by way of clean and sober filmed on the set of carpenter is the thing and loaded with actors you re most likely to find on the next inevitable incarnation of the love boat
486 | 0 confirms the nagging suspicion that ethan hawke would be even worse behind the camera than he is in front of it
487 | 0 one of the more glaring signs of this movie is servitude to its superstar is the way it skirts around any scenes that might have required genuine acting from ms spears
488 | 0 for all its shoot outs fistfights and car chases this movie is a phlegmatic bore so tedious it makes the silly spy vs spy film the sum of all fears starring ben affleck seem downright hitchcockian
489 | 0 the only fun part of the movie is playing the obvious game
490 | 0 plays like the old disease of the week small screen melodramas
491 | 0 the cumulative effect of the movie is repulsive and depressing
492 | 1 while we no longer possess the lack of attention span that we did at seventeen we had no trouble sitting for blade ii
493 | 1 a surprisingly sweet and gentle comedy
494 | 1 an elegant film with often surprising twists and an intermingling of naivet and sophistication
495 | 0 for each chuckle there are at least complete misses many coming from the amazingly lifelike tara reid whose acting skills are comparable to a cardboard cutout
496 | 1 polished well structured film
497 | 1 a movie that will surely be profane politically charged music to the ears of cho is fans
498 | 1 most consumers of lo mein and general tso is chicken barely give a thought to the folks who prepare and deliver it so hopefully this film will attach a human face to all those little steaming cartons
499 | 0 movies like high crimes flog the dead horse of surprise as if it were an obligation
500 | 0 a timid soggy near miss
501 |
--------------------------------------------------------------------------------
/eda_figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/eda_figure.png
--------------------------------------------------------------------------------
/experiments/__pycache__/a_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/a_config.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/a_config.cpython-37.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/b_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/b_config.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/c_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/c_config.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/config.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/e_config.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/e_config.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/methods.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/methods.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/methods.cpython-37.pyc
--------------------------------------------------------------------------------
/experiments/__pycache__/nlp_aug.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/experiments/__pycache__/nlp_aug.cpython-36.pyc
--------------------------------------------------------------------------------
/experiments/a_1_data_process.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from a_config import *
3 |
4 | if __name__ == "__main__":
5 |
6 | #for each method
7 | for a_method in a_methods:
8 |
9 | #for each data size
10 | for size_folder in size_folders:
11 |
12 | n_aug_list = n_aug_list_dict[size_folder]
13 | dataset_folders = [size_folder + '/' + s for s in datasets]
14 |
15 | #for each dataset
16 | for i, dataset_folder in enumerate(dataset_folders):
17 |
18 | train_orig = dataset_folder + '/train_orig.txt'
19 | n_aug = n_aug_list[i]
20 |
21 | #for each alpha value
22 | for alpha in alphas:
23 |
24 | output_file = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt'
25 |
26 | #generate the augmented data
27 | if a_method == 'sr':
28 | gen_sr_aug(train_orig, output_file, alpha, n_aug)
29 | if a_method == 'ri':
30 | gen_ri_aug(train_orig, output_file, alpha, n_aug)
31 | if a_method == 'rd':
32 | gen_rd_aug(train_orig, output_file, alpha, n_aug)
33 | if a_method == 'rs':
34 | gen_rs_aug(train_orig, output_file, alpha, n_aug)
35 |
36 | #generate the vocab dictionary
37 | word2vec_pickle = dataset_folder + '/word2vec.p'
38 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
39 |
40 |
--------------------------------------------------------------------------------
/experiments/a_2_train_eval.py:
--------------------------------------------------------------------------------
1 | from a_config import *
2 | from methods import *
3 | from numpy.random import seed
4 | seed(5)
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_cnn(train_file, test_file, num_classes, percent_dataset):
11 |
12 | #initialize model
13 | model = build_cnn(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y, test_x, test_y, model = None, None, None, None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ############ main #############
50 | ###############################
51 |
52 | if __name__ == "__main__":
53 |
54 | #for each method
55 | for a_method in a_methods:
56 |
57 | writer = open('outputs_f1/' + a_method + '_' + get_now_str() + '.txt', 'w')
58 |
59 | #for each size dataset
60 | for size_folder in size_folders:
61 |
62 | writer.write(size_folder + '\n')
63 |
64 | #get all six datasets
65 | dataset_folders = [size_folder + '/' + s for s in datasets]
66 |
67 | #for storing the performances
68 | performances = {alpha:[] for alpha in alphas}
69 |
70 | #for each dataset
71 | for i in range(len(dataset_folders)):
72 |
73 | #initialize all the variables
74 | dataset_folder = dataset_folders[i]
75 | dataset = datasets[i]
76 | num_classes = num_classes_list[i]
77 | input_size = input_size_list[i]
78 | word2vec_pickle = dataset_folder + '/word2vec.p'
79 | word2vec = load_pickle(word2vec_pickle)
80 |
81 | #test each alpha value
82 | for alpha in alphas:
83 |
84 | train_path = dataset_folder + '/train_' + a_method + '_' + str(alpha) + '.txt'
85 | test_path = 'size_data_f1/test/' + dataset + '/test.txt'
86 | acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1)
87 | performances[alpha].append(acc)
88 |
89 | writer.write(str(performances) + '\n')
90 | for alpha in performances:
91 | line = str(alpha) + ' : ' + str(sum(performances[alpha])/len(performances[alpha]))
92 | writer.write(line + '\n')
93 | print(line)
94 | print(performances)
95 |
96 | writer.close()
97 |
--------------------------------------------------------------------------------
/experiments/a_config.py:
--------------------------------------------------------------------------------
1 | #user inputs
2 |
3 | #size folders
4 | sizes = ['1_tiny', '2_small', '3_standard', '4_full']
5 | size_folders = ['size_data_f1/' + size for size in sizes]
6 |
7 | #augmentation methods
8 | a_methods = ['sr', 'ri', 'rd', 'rs']
9 |
10 | #dataset folder
11 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
12 |
13 | #number of output classes
14 | num_classes_list = [2, 2, 2, 6, 2]
15 |
16 | #number of augmentations
17 | n_aug_list_dict = {'size_data_f1/1_tiny': [16, 16, 16, 16, 16],
18 | 'size_data_f1/2_small': [16, 16, 16, 16, 16],
19 | 'size_data_f1/3_standard': [8, 8, 8, 8, 4],
20 | 'size_data_f1/4_full': [8, 8, 8, 8, 4]}
21 |
22 | #alpha values we care about
23 | alphas = [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
24 |
25 | #number of words for input
26 | input_size_list = [50, 50, 40, 25, 25]
27 |
28 | #word2vec dictionary
29 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
30 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
31 |
--------------------------------------------------------------------------------
/experiments/b_1_data_process.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from b_config import *
3 |
4 | if __name__ == "__main__":
5 |
6 | #generate the augmented data sets
7 | for dataset_folder in dataset_folders:
8 |
9 | #pre-existing file locations
10 | train_orig = dataset_folder + '/train_orig.txt'
11 |
12 | #file to be created
13 | train_aug_st = dataset_folder + '/train_aug_st.txt'
14 |
15 | #standard augmentation
16 | gen_standard_aug(train_orig, train_aug_st)
17 |
18 | #generate the vocab dictionary
19 | word2vec_pickle = dataset_folder + '/word2vec.p' # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
20 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
21 |
--------------------------------------------------------------------------------
/experiments/b_2_train_eval.py:
--------------------------------------------------------------------------------
1 | from b_config import *
2 | from methods import *
3 | from numpy.random import seed
4 | seed(0)
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_model(train_file, test_file, num_classes, percent_dataset):
11 |
12 | #initialize model
13 | model = build_model(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y = None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | if __name__ == "__main__":
49 |
50 | #get the accuracy at each increment
51 | orig_accs = {dataset:{} for dataset in datasets}
52 | aug_accs = {dataset:{} for dataset in datasets}
53 |
54 | writer = open('outputs_f2/' + get_now_str() + '.csv', 'w')
55 |
56 | #for each dataset
57 | for i, dataset_folder in enumerate(dataset_folders):
58 |
59 | dataset = datasets[i]
60 | num_classes = num_classes_list[i]
61 | input_size = input_size_list[i]
62 | train_orig = dataset_folder + '/train_orig.txt'
63 | train_aug_st = dataset_folder + '/train_aug_st.txt'
64 | test_path = dataset_folder + '/test.txt'
65 | word2vec_pickle = dataset_folder + '/word2vec.p'
66 | word2vec = load_pickle(word2vec_pickle)
67 |
68 | for increment in increments:
69 |
70 | #calculate augmented accuracy
71 | aug_acc = run_model(train_aug_st, test_path, num_classes, increment)
72 | aug_accs[dataset][increment] = aug_acc
73 |
74 | #calculate original accuracy
75 | orig_acc = run_model(train_orig, test_path, num_classes, increment)
76 | orig_accs[dataset][increment] = orig_acc
77 |
78 | print(dataset, increment, orig_acc, aug_acc)
79 | writer.write(dataset + ',' + str(increment) + ',' + str(orig_acc) + ',' + str(aug_acc) + '\n')
80 |
81 | gc.collect()
82 |
83 | print(orig_accs, aug_accs)
84 |
--------------------------------------------------------------------------------
/experiments/b_config.py:
--------------------------------------------------------------------------------
1 | #user inputs
2 |
3 | #dataset folder
4 | datasets = ['pc']#['cr', 'sst2', 'subj', 'trec', 'pc']
5 | dataset_folders = ['increment_datasets_f2/' + dataset for dataset in datasets]
6 |
7 | #number of output classes
8 | num_classes_list = [2]#[2, 2, 2, 6, 2]
9 |
10 | #dataset increments
11 | increments = [0.7, 0.8, 0.9, 1]#[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
12 |
13 | #number of words for input
14 | input_size_list = [25]#[50, 50, 40, 25, 25]
15 |
16 | #word2vec dictionary
17 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
18 | word2vec_len = 300
--------------------------------------------------------------------------------
/experiments/c_1_data_process.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from c_config import *
3 |
4 | if __name__ == "__main__":
5 |
6 | #generate the augmented data sets
7 |
8 | for size_folder in size_folders:
9 |
10 | dataset_folders = [size_folder + '/' + s for s in datasets]
11 |
12 | #for each dataset
13 | for dataset_folder in dataset_folders:
14 | train_orig = dataset_folder + '/train_orig.txt'
15 |
16 | #for each n_aug value
17 | for num_aug in num_aug_list:
18 |
19 | output_file = dataset_folder + '/train_' + str(num_aug) + '.txt'
20 |
21 | #generate the augmented data
22 | if num_aug > 4 and '4_full/pc' in train_orig:
23 | gen_standard_aug(train_orig, output_file, num_aug=4)
24 | else:
25 | gen_standard_aug(train_orig, output_file, num_aug=num_aug)
26 |
27 | #generate the vocab dictionary
28 | word2vec_pickle = dataset_folder + '/word2vec.p'
29 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
30 |
31 |
--------------------------------------------------------------------------------
/experiments/c_2_train_eval.py:
--------------------------------------------------------------------------------
1 | from c_config import *
2 | from methods import *
3 | from numpy.random import seed
4 | seed(5)
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_cnn(train_file, test_file, num_classes, percent_dataset):
11 |
12 | #initialize model
13 | model = build_cnn(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y = None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ############ main #############
50 | ###############################
51 |
52 | if __name__ == "__main__":
53 |
54 | for see in range(5):
55 |
56 | seed(see)
57 | print('seed:', see)
58 |
59 | writer = open('outputs_f3/' + get_now_str() + '.txt', 'w')
60 |
61 | #for each size dataset
62 | for size_folder in size_folders:
63 |
64 | writer.write(size_folder + '\n')
65 |
66 | #get all six datasets
67 | dataset_folders = [size_folder + '/' + s for s in datasets]
68 |
69 | #for storing the performances
70 | performances = {num_aug:[] for num_aug in num_aug_list}
71 |
72 | #for each dataset
73 | for i in range(len(dataset_folders)):
74 |
75 | #initialize all the variables
76 | dataset_folder = dataset_folders[i]
77 | dataset = datasets[i]
78 | num_classes = num_classes_list[i]
79 | input_size = input_size_list[i]
80 | word2vec_pickle = dataset_folder + '/word2vec.p'
81 | word2vec = load_pickle(word2vec_pickle)
82 |
83 | #test each num_aug value
84 | for num_aug in num_aug_list:
85 |
86 | train_path = dataset_folder + '/train_' + str(num_aug) + '.txt'
87 | test_path = 'size_data_f3/test/' + dataset + '/test.txt'
88 | acc = run_cnn(train_path, test_path, num_classes, percent_dataset=1)
89 | performances[num_aug].append(acc)
90 | writer.write(train_path + ',' + str(acc))
91 |
92 | writer.write(str(performances) + '\n')
93 | print()
94 | for num_aug in performances:
95 | line = str(num_aug) + ' : ' + str(sum(performances[num_aug])/len(performances[num_aug]))
96 | writer.write(line + '\n')
97 | print(line)
98 | print(performances)
99 |
100 | writer.close()
101 |
--------------------------------------------------------------------------------
/experiments/c_config.py:
--------------------------------------------------------------------------------
1 | #user inputs
2 |
3 | #size folders
4 | sizes = ['3_standard']#, '4_full']#['1_tiny', '2_small', '3_standard', '4_full']
5 | size_folders = ['size_data_f3/' + size for size in sizes]
6 |
7 | #dataset folder
8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
9 |
10 | #number of output classes
11 | num_classes_list = [2, 2, 2, 6, 2]
12 |
13 | #alpha values we care about
14 | num_aug_list = [0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32]
15 |
16 | #number of words for input
17 | input_size_list = [50, 50, 50, 25, 25]
18 |
19 | #word2vec dictionary
20 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
21 | word2vec_len = 300 # don't want to load the huge pickle every time, so just save the words that are actually used into a smaller dictionary
22 |
--------------------------------------------------------------------------------
/experiments/d_0_preprocess.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 |
3 | def generate_short(input_file, output_file, alpha):
4 | lines = open(input_file, 'r').readlines()
5 | increment = int(len(lines)/alpha)
6 | lines = lines[::increment]
7 | writer = open(output_file, 'w')
8 | for line in lines:
9 | writer.write(line)
10 |
11 | if __name__ == "__main__":
12 |
13 | #global params
14 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
15 | datasets = ['pc']#, 'trec']
16 |
17 | for dataset in datasets:
18 |
19 | dataset_folder = 'special_f4/' + dataset
20 | test_short = 'special_f4/' + dataset + '/test_short.txt'
21 | test_aug_short = dataset_folder + '/test_short_aug.txt'
22 | word2vec_pickle = dataset_folder + '/word2vec.p'
23 |
24 | #augment the data
25 | gen_tsne_aug(test_short, test_aug_short)
26 |
27 | #generate the vocab dictionaries
28 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
--------------------------------------------------------------------------------
/experiments/d_1_train_models.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | seed(0)
4 |
5 | ###############################
6 | #### run model and get acc ####
7 | ###############################
8 |
9 | def run_model(train_file, test_file, num_classes, model_output_path):
10 |
11 | #initialize model
12 | model = build_model(input_size, word2vec_len, num_classes)
13 |
14 | #load data
15 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, 1)
16 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
17 |
18 | #implement early stopping
19 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
20 |
21 | #train model
22 | model.fit( train_x,
23 | train_y,
24 | epochs=100000,
25 | callbacks=callbacks,
26 | validation_split=0.1,
27 | batch_size=1024,
28 | shuffle=True,
29 | verbose=0)
30 |
31 | #save the model
32 | model.save(model_output_path)
33 | #model = load_model('checkpoints/lol')
34 |
35 | #evaluate model
36 | y_pred = model.predict(test_x)
37 | test_y_cat = one_hot_to_categorical(test_y)
38 | y_pred_cat = one_hot_to_categorical(y_pred)
39 | acc = accuracy_score(test_y_cat, y_pred_cat)
40 |
41 | #clean memory???
42 | train_x, train_y = None, None
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | if __name__ == "__main__":
49 |
50 | #parameters
51 | dataset_folders = ['increment_datasets_f2/trec', 'increment_datasets_f2/pc']
52 | output_paths = ['outputs_f4/trec_aug.h5', 'outputs_f4/pc_aug.h5']
53 | num_classes_list = [6, 2]
54 | input_size_list = [25, 25]
55 |
56 | #word2vec dictionary
57 | word2vec_len = 300
58 |
59 | for i, dataset_folder in enumerate(dataset_folders):
60 |
61 | num_classes = num_classes_list[i]
62 | input_size = input_size_list[i]
63 | output_path = output_paths[i]
64 | train_orig = dataset_folder + '/train_aug_st.txt'
65 | test_path = dataset_folder + '/test.txt'
66 | word2vec_pickle = dataset_folder + '/word2vec.p'
67 | word2vec = load_pickle(word2vec_pickle)
68 |
69 | #train model and save
70 | acc = run_model(train_orig, test_path, num_classes, output_path)
71 | print(dataset_folder, acc)
--------------------------------------------------------------------------------
/experiments/d_2_tsne.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | from keras import backend as K
4 | from sklearn.manifold import TSNE
5 | import matplotlib.pyplot as plt
6 | seed(0)
7 |
8 | ################################
9 | #### get dense layer output ####
10 | ################################
11 |
12 | #getting the x and y inputs in numpy array form from the text file
13 | def train_x(train_txt, word2vec_len, input_size, word2vec):
14 |
15 | #read in lines
16 | train_lines = open(train_txt, 'r').readlines()
17 | num_lines = len(train_lines)
18 |
19 | x_matrix = np.zeros((num_lines, input_size, word2vec_len))
20 |
21 | #insert values
22 | for i, line in enumerate(train_lines):
23 |
24 | parts = line[:-1].split('\t')
25 | label = int(parts[0])
26 | sentence = parts[1]
27 |
28 | #insert x
29 | words = sentence.split(' ')
30 | words = words[:x_matrix.shape[1]] #cut off if too long
31 | for j, word in enumerate(words):
32 | if word in word2vec:
33 | x_matrix[i, j, :] = word2vec[word]
34 |
35 | return x_matrix
36 |
37 | def get_dense_output(model_checkpoint, file, num_classes):
38 |
39 | x = train_x(file, word2vec_len, input_size, word2vec)
40 |
41 | model = load_model(model_checkpoint)
42 |
43 | get_3rd_layer_output = K.function([model.layers[0].input], [model.layers[4].output])
44 | layer_output = get_3rd_layer_output([x])[0]
45 |
46 | return layer_output
47 |
48 | def get_tsne_labels(file):
49 | labels = []
50 | alphas = []
51 | lines = open(file, 'r').readlines()
52 | for i, line in enumerate(lines):
53 | parts = line[:-1].split('\t')
54 | _class = int(parts[0])
55 | alpha = i % 10
56 | labels.append(_class)
57 | alphas.append(alpha)
58 | return labels, alphas
59 |
60 | def get_plot_vectors(layer_output):
61 |
62 | tsne = TSNE(n_components=2).fit_transform(layer_output)
63 | return tsne
64 |
65 | def plot_tsne(tsne, labels, output_path):
66 |
67 | label_to_legend_label = { 'outputs_f4/pc_tsne.png':{ 0:'Con (augmented)',
68 | 100:'Con (original)',
69 | 1: 'Pro (augmented)',
70 | 101:'Pro (original)'},
71 | 'outputs_f4/trec_tsne.png':{0:'Description (augmented)',
72 | 100:'Description (original)',
73 | 1:'Entity (augmented)',
74 | 101:'Entity (original)',
75 | 2:'Abbreviation (augmented)',
76 | 102:'Abbreviation (original)',
77 | 3:'Human (augmented)',
78 | 103:'Human (original)',
79 | 4:'Location (augmented)',
80 | 104:'Location (original)',
81 | 5:'Number (augmented)',
82 | 105:'Number (original)'}}
83 |
84 | plot_to_legend_size = {'outputs_f4/pc_tsne.png':11, 'outputs_f4/trec_tsne.png':6}
85 |
86 | labels = labels.tolist()
87 | big_groups = [label for label in labels if label < 100]
88 | big_groups = list(sorted(set(big_groups)))
89 |
90 | colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', '#ff1493', '#FF4500']
91 | fig, ax = plt.subplots()
92 |
93 | for big_group in big_groups:
94 |
95 | for group in [big_group, big_group+100]:
96 |
97 | x, y = [], []
98 |
99 | for j, label in enumerate(labels):
100 | if label == group:
101 | x.append(tsne[j][0])
102 | y.append(tsne[j][1])
103 |
104 | #params
105 | color = colors[int(group % 100)]
106 | marker = 'x' if group < 100 else 'o'
107 | size = 1 if group < 100 else 27
108 | legend_label = label_to_legend_label[output_path][group]
109 |
110 | ax.scatter(x, y, color=color, marker=marker, s=size, label=legend_label)
111 | plt.axis('off')
112 |
113 | legend_size = plot_to_legend_size[output_path]
114 | plt.legend(prop={'size': legend_size})
115 | plt.savefig(output_path, dpi=1000)
116 | plt.clf()
117 |
118 | if __name__ == "__main__":
119 |
120 | #global variables
121 | word2vec_len = 300
122 | input_size = 25
123 |
124 | datasets = ['pc'] #['pc', 'trec']
125 | num_classes_list =[2] #[2, 6]
126 |
127 | for i, dataset in enumerate(datasets):
128 |
129 | #load parameters
130 | model_checkpoint = 'outputs_f4/' + dataset + '.h5'
131 | file = 'special_f4/' + dataset + '/test_short_aug.txt'
132 | num_classes = num_classes_list[i]
133 | word2vec_pickle = 'special_f4/' + dataset + '/word2vec.p'
134 | word2vec = load_pickle(word2vec_pickle)
135 |
136 | #do tsne
137 | layer_output = get_dense_output(model_checkpoint, file, num_classes)
138 | print(layer_output.shape)
139 | t = get_plot_vectors(layer_output)
140 |
141 | labels, alphas = get_tsne_labels(file)
142 |
143 | print(labels, alphas)
144 |
145 | writer = open("outputs_f4/new_tsne.txt", 'w')
146 |
147 | label_to_mark = {0:'x', 1:'o'}
148 |
149 | for i, label in enumerate(labels):
150 | alpha = alphas[i]
151 | line = str(t[i, 0]) + ' ' + str(t[i, 1]) + ' ' + str(label_to_mark[label]) + ' ' + str(alpha/10)
152 | writer.write(line + '\n')
153 |
154 |
155 |
--------------------------------------------------------------------------------
/experiments/d_neg_1_balance_trec.py:
--------------------------------------------------------------------------------
1 | lines = open('special_f4/trec/test_orig.txt', 'r').readlines()
2 |
3 | label_to_lines = {x:[] for x in range(0, 6)}
4 |
5 | for line in lines:
6 | label = int(line[0])
7 | label_to_lines[label].append(line)
8 |
9 | for label in label_to_lines:
10 | print(label, len(label_to_lines[label]))
--------------------------------------------------------------------------------
/experiments/e_1_data_process.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from e_config import *
3 |
4 | if __name__ == "__main__":
5 |
6 | for size_folder in size_folders:
7 |
8 | dataset_folders = [size_folder + '/' + s for s in datasets]
9 | n_aug_list = n_aug_list_dict[size_folder]
10 |
11 | #for each dataset
12 | for i, dataset_folder in enumerate(dataset_folders):
13 |
14 | n_aug = n_aug_list[i]
15 |
16 | #pre-existing file locations
17 | train_orig = dataset_folder + '/train_orig.txt'
18 |
19 | #file to be created
20 | train_aug_st = dataset_folder + '/train_aug_st.txt'
21 |
22 | #standard augmentation
23 | gen_standard_aug(train_orig, train_aug_st, n_aug)
24 |
25 | #generate the vocab dictionary
26 | word2vec_pickle = dataset_folder + '/word2vec.p'
27 | gen_vocab_dicts(dataset_folder, word2vec_pickle, huge_word2vec)
28 |
29 |
--------------------------------------------------------------------------------
/experiments/e_2_cnn_aug.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | seed(0)
4 | from e_config import *
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_cnn(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 |
12 | #initialize model
13 | model = build_cnn(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y, model = None, None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 |
52 | def compute_baselines(writer):
53 |
54 | #baseline computation
55 | for size_folder in size_folders:
56 |
57 | #get all six datasets
58 | dataset_folders = [size_folder + '/' + s for s in datasets]
59 | performances = []
60 |
61 | #for each dataset
62 | for i in range(len(dataset_folders)):
63 |
64 | #initialize all the variables
65 | dataset_folder = dataset_folders[i]
66 | dataset = datasets[i]
67 | num_classes = num_classes_list[i]
68 | input_size = input_size_list[i]
69 | word2vec_pickle = dataset_folder + '/word2vec.p'
70 | word2vec = load_pickle(word2vec_pickle)
71 |
72 | train_path = dataset_folder + '/train_aug_st.txt'
73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | acc = run_cnn(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | performances.append(str(acc))
76 |
77 | line = ','.join(performances)
78 | print(line)
79 | writer.write(line+'\n')
80 |
81 | ###############################
82 | ############ main #############
83 | ###############################
84 |
85 | if __name__ == "__main__":
86 |
87 | writer = open('baseline_cnn/' + get_now_str() + '.csv', 'w')
88 |
89 | for i in range(0, 10):
90 |
91 | seed(i)
92 | print(i)
93 | compute_baselines(writer)
94 |
--------------------------------------------------------------------------------
/experiments/e_2_cnn_baselines.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | seed(0)
4 | from e_config import *
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 |
12 | #initialize model
13 | model = build_model(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y = None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 |
52 | def compute_baselines(writer):
53 |
54 | #baseline computation
55 | for size_folder in size_folders:
56 |
57 | #get all six datasets
58 | dataset_folders = [size_folder + '/' + s for s in datasets]
59 | performances = []
60 |
61 | #for each dataset
62 | for i in range(len(dataset_folders)):
63 |
64 | #initialize all the variables
65 | dataset_folder = dataset_folders[i]
66 | dataset = datasets[i]
67 | num_classes = num_classes_list[i]
68 | input_size = input_size_list[i]
69 | word2vec_pickle = dataset_folder + '/word2vec.p'
70 | word2vec = load_pickle(word2vec_pickle)
71 |
72 | train_path = dataset_folder + '/train_orig.txt'
73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | performances.append(str(acc))
76 |
77 | line = ','.join(performances)
78 | print(line)
79 | writer.write(line+'\n')
80 |
81 | ###############################
82 | ############ main #############
83 | ###############################
84 |
85 | if __name__ == "__main__":
86 |
87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 |
89 | for i in range(10, 24):
90 |
91 | seed(i)
92 | print(i)
93 | compute_baselines(writer)
94 |
--------------------------------------------------------------------------------
/experiments/e_2_rnn_aug.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | seed(0)
4 | from e_config import *
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 |
12 | #initialize model
13 | model = build_model(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y, model = None, None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 |
52 | def compute_baselines(writer):
53 |
54 | #baseline computation
55 | for size_folder in size_folders:
56 |
57 | #get all six datasets
58 | dataset_folders = [size_folder + '/' + s for s in datasets]
59 | performances = []
60 |
61 | #for each dataset
62 | for i in range(len(dataset_folders)):
63 |
64 | #initialize all the variables
65 | dataset_folder = dataset_folders[i]
66 | dataset = datasets[i]
67 | num_classes = num_classes_list[i]
68 | input_size = input_size_list[i]
69 | word2vec_pickle = dataset_folder + '/word2vec.p'
70 | word2vec = load_pickle(word2vec_pickle)
71 |
72 | train_path = dataset_folder + '/train_aug_st.txt'
73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | performances.append(str(acc))
76 |
77 | line = ','.join(performances)
78 | print(line)
79 | writer.write(line+'\n')
80 |
81 | ###############################
82 | ############ main #############
83 | ###############################
84 |
85 | if __name__ == "__main__":
86 |
87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 |
89 | for i in range(0, 10):
90 |
91 | seed(i)
92 | print(i)
93 | compute_baselines(writer)
94 |
--------------------------------------------------------------------------------
/experiments/e_2_rnn_baselines.py:
--------------------------------------------------------------------------------
1 | from methods import *
2 | from numpy.random import seed
3 | seed(0)
4 | from e_config import *
5 |
6 | ###############################
7 | #### run model and get acc ####
8 | ###############################
9 |
10 | def run_model(train_file, test_file, num_classes, input_size, percent_dataset, word2vec):
11 |
12 | #initialize model
13 | model = build_model(input_size, word2vec_len, num_classes)
14 |
15 | #load data
16 | train_x, train_y = get_x_y(train_file, num_classes, word2vec_len, input_size, word2vec, percent_dataset)
17 | test_x, test_y = get_x_y(test_file, num_classes, word2vec_len, input_size, word2vec, 1)
18 |
19 | #implement early stopping
20 | callbacks = [EarlyStopping(monitor='val_loss', patience=3)]
21 |
22 | #train model
23 | model.fit( train_x,
24 | train_y,
25 | epochs=100000,
26 | callbacks=callbacks,
27 | validation_split=0.1,
28 | batch_size=1024,
29 | shuffle=True,
30 | verbose=0)
31 | #model.save('checkpoints/lol')
32 | #model = load_model('checkpoints/lol')
33 |
34 | #evaluate model
35 | y_pred = model.predict(test_x)
36 | test_y_cat = one_hot_to_categorical(test_y)
37 | y_pred_cat = one_hot_to_categorical(y_pred)
38 | acc = accuracy_score(test_y_cat, y_pred_cat)
39 |
40 | #clean memory???
41 | train_x, train_y = None, None
42 | gc.collect()
43 |
44 | #return the accuracy
45 | #print("data with shape:", train_x.shape, train_y.shape, 'train=', train_file, 'test=', test_file, 'with fraction', percent_dataset, 'had acc', acc)
46 | return acc
47 |
48 | ###############################
49 | ### get baseline accuracies ###
50 | ###############################
51 |
52 | def compute_baselines(writer):
53 |
54 | #baseline computation
55 | for size_folder in size_folders:
56 |
57 | #get all six datasets
58 | dataset_folders = [size_folder + '/' + s for s in datasets]
59 | performances = []
60 |
61 | #for each dataset
62 | for i in range(len(dataset_folders)):
63 |
64 | #initialize all the variables
65 | dataset_folder = dataset_folders[i]
66 | dataset = datasets[i]
67 | num_classes = num_classes_list[i]
68 | input_size = input_size_list[i]
69 | word2vec_pickle = dataset_folder + '/word2vec.p'
70 | word2vec = load_pickle(word2vec_pickle)
71 |
72 | train_path = dataset_folder + '/train_orig.txt'
73 | test_path = 'size_data_t1/test/' + dataset + '/test.txt'
74 | acc = run_model(train_path, test_path, num_classes, input_size, 1, word2vec)
75 | performances.append(str(acc))
76 |
77 | line = ','.join(performances)
78 | print(line)
79 | writer.write(line+'\n')
80 |
81 | ###############################
82 | ############ main #############
83 | ###############################
84 |
85 | if __name__ == "__main__":
86 |
87 | writer = open('baseline_rnn/' + get_now_str() + '.csv', 'w')
88 |
89 | for i in range(10, 24):
90 |
91 | seed(i)
92 | print(i)
93 | compute_baselines(writer)
94 |
--------------------------------------------------------------------------------
/experiments/e_config.py:
--------------------------------------------------------------------------------
1 | #user inputs
2 |
3 | #load hyperparameters
4 | sizes = ['4_full']#['1_tiny', '2_small', '3_standard', '4_full']
5 | size_folders = ['size_data_t1/' + size for size in sizes]
6 |
7 | #datasets
8 | datasets = ['cr', 'sst2', 'subj', 'trec', 'pc']
9 |
10 | #number of output classes
11 | num_classes_list = [2, 2, 2, 6, 2]
12 |
13 | #number of augmentations per original sentence
14 | n_aug_list_dict = {'size_data_t1/1_tiny': [32, 32, 32, 32, 32],
15 | 'size_data_t1/2_small': [32, 32, 32, 32, 32],
16 | 'size_data_t1/3_standard': [16, 16, 16, 16, 4],
17 | 'size_data_t1/4_full': [16, 16, 16, 16, 4]}
18 |
19 | #number of words for input
20 | input_size_list = [50, 50, 40, 25, 25]
21 |
22 | #word2vec dictionary
23 | huge_word2vec = 'word2vec/glove.840B.300d.txt'
24 | word2vec_len = 300
--------------------------------------------------------------------------------
/experiments/methods.py:
--------------------------------------------------------------------------------
1 | from keras.layers.core import Dense, Activation, Dropout
2 | from keras.layers.recurrent import LSTM
3 | from keras.layers import Bidirectional
4 | import keras.layers as layers
5 | from keras.models import Sequential
6 | from keras.models import load_model
7 | from keras.callbacks import EarlyStopping
8 |
9 | from sklearn.utils import shuffle
10 | from sklearn.metrics import accuracy_score
11 |
12 | import math
13 | import time
14 | import numpy as np
15 | import random
16 | from random import randint
17 | random.seed(3)
18 | import datetime, re, operator
19 | from random import shuffle
20 | from time import gmtime, strftime
21 | import gc
22 |
23 | import os
24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #get rid of warnings
25 | from os import listdir
26 | from os.path import isfile, join, isdir
27 | import pickle
28 |
29 | #import data augmentation methods
30 | from nlp_aug import *
31 |
32 | ###################################################
33 | ######### loading folders and txt files ###########
34 | ###################################################
35 |
36 | #loading a pickle file
37 | def load_pickle(file):
38 | return pickle.load(open(file, 'rb'))
39 |
40 | #create an output folder if it does not already exist
41 | def confirm_output_folder(output_folder):
42 | if not os.path.exists(output_folder):
43 | os.makedirs(output_folder)
44 |
45 | #get full image paths
46 | def get_txt_paths(folder):
47 | txt_paths = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and '.txt' in f]
48 | if join(folder, '.DS_Store') in txt_paths:
49 | txt_paths.remove(join(folder, '.DS_Store'))
50 | txt_paths = sorted(txt_paths)
51 | return txt_paths
52 |
53 | #get subfolders
54 | def get_subfolder_paths(folder):
55 | subfolder_paths = [join(folder, f) for f in listdir(folder) if (isdir(join(folder, f)) and '.DS_Store' not in f)]
56 | if join(folder, '.DS_Store') in subfolder_paths:
57 | subfolder_paths.remove(join(folder, '.DS_Store'))
58 | subfolder_paths = sorted(subfolder_paths)
59 | return subfolder_paths
60 |
61 | #get all image paths
62 | def get_all_txt_paths(master_folder):
63 |
64 | all_paths = []
65 | subfolders = get_subfolder_paths(master_folder)
66 | if len(subfolders) > 1:
67 | for subfolder in subfolders:
68 | all_paths += get_txt_paths(subfolder)
69 | else:
70 | all_paths = get_txt_paths(master_folder)
71 | return all_paths
72 |
73 | ###################################################
74 | ################ data processing ##################
75 | ###################################################
76 |
77 | #get the pickle file for the word2vec so you don't have to load the entire huge file each time
78 | def gen_vocab_dicts(folder, output_pickle_path, huge_word2vec):
79 |
80 | vocab = set()
81 | text_embeddings = open(huge_word2vec, 'r').readlines()
82 | word2vec = {}
83 |
84 | #get all the vocab
85 | all_txt_paths = get_all_txt_paths(folder)
86 | print(all_txt_paths)
87 |
88 | #loop through each text file
89 | for txt_path in all_txt_paths:
90 |
91 | # get all the words
92 | try:
93 | all_lines = open(txt_path, "r").readlines()
94 | for line in all_lines:
95 | words = line[:-1].split(' ')
96 | for word in words:
97 | vocab.add(word)
98 | except:
99 | print(txt_path, "has an error")
100 |
101 | print(len(vocab), "unique words found")
102 |
103 | # load the word embeddings, and only add the word to the dictionary if we need it
104 | for line in text_embeddings:
105 | items = line.split(' ')
106 | word = items[0]
107 | if word in vocab:
108 | vec = items[1:]
109 | word2vec[word] = np.asarray(vec, dtype = 'float32')
110 | print(len(word2vec), "matches between unique words and word2vec dictionary")
111 |
112 | pickle.dump(word2vec, open(output_pickle_path, 'wb'))
113 | print("dictionaries outputted to", output_pickle_path)
114 |
115 | #getting the x and y inputs in numpy array form from the text file
116 | def get_x_y(train_txt, num_classes, word2vec_len, input_size, word2vec, percent_dataset):
117 |
118 | #read in lines
119 | train_lines = open(train_txt, 'r').readlines()
120 | shuffle(train_lines)
121 | train_lines = train_lines[:int(percent_dataset*len(train_lines))]
122 | num_lines = len(train_lines)
123 |
124 | #initialize x and y matrix
125 | x_matrix = None
126 | y_matrix = None
127 |
128 | try:
129 | x_matrix = np.zeros((num_lines, input_size, word2vec_len))
130 | except:
131 | print("Error!", num_lines, input_size, word2vec_len)
132 | y_matrix = np.zeros((num_lines, num_classes))
133 |
134 | #insert values
135 | for i, line in enumerate(train_lines):
136 |
137 | parts = line[:-1].split('\t')
138 | label = int(parts[0])
139 | sentence = parts[1]
140 |
141 | #insert x
142 | words = sentence.split(' ')
143 | words = words[:x_matrix.shape[1]] #cut off if too long
144 | for j, word in enumerate(words):
145 | if word in word2vec:
146 | x_matrix[i, j, :] = word2vec[word]
147 |
148 | #insert y
149 | y_matrix[i][label] = 1.0
150 |
151 | return x_matrix, y_matrix
152 |
153 | ###################################################
154 | ############### data augmentation #################
155 | ###################################################
156 |
157 | def gen_tsne_aug(train_orig, output_file):
158 |
159 | writer = open(output_file, 'w')
160 | lines = open(train_orig, 'r').readlines()
161 | for i, line in enumerate(lines):
162 | parts = line[:-1].split('\t')
163 | label = parts[0]
164 | sentence = parts[1]
165 | writer.write(line)
166 | for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
167 | aug_sentence = eda_4(sentence, alpha_sr=alpha, alpha_ri=alpha, alpha_rs=alpha, p_rd=alpha, num_aug=2)[0]
168 | writer.write(label + "\t" + aug_sentence + '\n')
169 | writer.close()
170 | print("finished eda for tsne for", train_orig, "to", output_file)
171 |
172 |
173 |
174 |
175 | #generate more data with standard augmentation
176 | def gen_standard_aug(train_orig, output_file, num_aug=9):
177 | writer = open(output_file, 'w')
178 | lines = open(train_orig, 'r').readlines()
179 | for i, line in enumerate(lines):
180 | parts = line[:-1].split('\t')
181 | label = parts[0]
182 | sentence = parts[1]
183 | aug_sentences = eda_4(sentence, num_aug=num_aug)
184 | for aug_sentence in aug_sentences:
185 | writer.write(label + "\t" + aug_sentence + '\n')
186 | writer.close()
187 | print("finished eda for", train_orig, "to", output_file)
188 |
189 | #generate more data with only synonym replacement (SR)
190 | def gen_sr_aug(train_orig, output_file, alpha_sr, n_aug):
191 | writer = open(output_file, 'w')
192 | lines = open(train_orig, 'r').readlines()
193 | for i, line in enumerate(lines):
194 | parts = line[:-1].split('\t')
195 | label = parts[0]
196 | sentence = parts[1]
197 | aug_sentences = SR(sentence, alpha_sr=alpha_sr, n_aug=n_aug)
198 | for aug_sentence in aug_sentences:
199 | writer.write(label + "\t" + aug_sentence + '\n')
200 | writer.close()
201 | print("finished SR for", train_orig, "to", output_file, "with alpha", alpha_sr)
202 |
203 | #generate more data with only random insertion (RI)
204 | def gen_ri_aug(train_orig, output_file, alpha_ri, n_aug):
205 | writer = open(output_file, 'w')
206 | lines = open(train_orig, 'r').readlines()
207 | for i, line in enumerate(lines):
208 | parts = line[:-1].split('\t')
209 | label = parts[0]
210 | sentence = parts[1]
211 | aug_sentences = RI(sentence, alpha_ri=alpha_ri, n_aug=n_aug)
212 | for aug_sentence in aug_sentences:
213 | writer.write(label + "\t" + aug_sentence + '\n')
214 | writer.close()
215 | print("finished RI for", train_orig, "to", output_file, "with alpha", alpha_ri)
216 |
217 | #generate more data with only random swap (RS)
218 | def gen_rs_aug(train_orig, output_file, alpha_rs, n_aug):
219 | writer = open(output_file, 'w')
220 | lines = open(train_orig, 'r').readlines()
221 | for i, line in enumerate(lines):
222 | parts = line[:-1].split('\t')
223 | label = parts[0]
224 | sentence = parts[1]
225 | aug_sentences = RS(sentence, alpha_rs=alpha_rs, n_aug=n_aug)
226 | for aug_sentence in aug_sentences:
227 | writer.write(label + "\t" + aug_sentence + '\n')
228 | writer.close()
229 | print("finished RS for", train_orig, "to", output_file, "with alpha", alpha_rs)
230 |
231 | #generate more data with only random deletion (RD)
232 | def gen_rd_aug(train_orig, output_file, alpha_rd, n_aug):
233 | writer = open(output_file, 'w')
234 | lines = open(train_orig, 'r').readlines()
235 | for i, line in enumerate(lines):
236 | parts = line[:-1].split('\t')
237 | label = parts[0]
238 | sentence = parts[1]
239 | aug_sentences = RD(sentence, alpha_rd=alpha_rd, n_aug=n_aug)
240 | for aug_sentence in aug_sentences:
241 | writer.write(label + "\t" + aug_sentence + '\n')
242 | writer.close()
243 | print("finished RD for", train_orig, "to", output_file, "with alpha", alpha_rd)
244 |
245 | ###################################################
246 | ##################### model #######################
247 | ###################################################
248 |
249 | #building the model in keras
250 | def build_model(sentence_length, word2vec_len, num_classes):
251 | model = None
252 | model = Sequential()
253 | model.add(Bidirectional(LSTM(64, return_sequences=True), input_shape=(sentence_length, word2vec_len)))
254 | model.add(Dropout(0.5))
255 | model.add(Bidirectional(LSTM(32, return_sequences=False)))
256 | model.add(Dropout(0.5))
257 | model.add(Dense(20, activation='relu'))
258 | model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
259 | model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
260 | #print(model.summary())
261 | return model
262 |
263 | #building the cnn in keras
264 | def build_cnn(sentence_length, word2vec_len, num_classes):
265 | model = None
266 | model = Sequential()
267 | model.add(layers.Conv1D(128, 5, activation='relu', input_shape=(sentence_length, word2vec_len)))
268 | model.add(layers.GlobalMaxPooling1D())
269 | model.add(Dense(20, activation='relu'))
270 | model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
271 | model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
272 | return model
273 |
274 | #one hot to categorical
275 | def one_hot_to_categorical(y):
276 | assert len(y.shape) == 2
277 | return np.argmax(y, axis=1)
278 |
279 | def get_now_str():
280 | return str(strftime("%Y-%m-%d_%H:%M:%S", gmtime()))
281 |
282 |
--------------------------------------------------------------------------------
/experiments/nlp_aug.py:
--------------------------------------------------------------------------------
1 | # Easy data augmentation techniques for text classification
2 | # Jason Wei, Chengyu Huang, Yifang Wei, Fei Xing, Kai Zou
3 |
4 | import random
5 | from random import shuffle
6 | random.seed(1)
7 |
8 | #stop words list
9 | stop_words = ['i', 'me', 'my', 'myself', 'we', 'our',
10 | 'ours', 'ourselves', 'you', 'your', 'yours',
11 | 'yourself', 'yourselves', 'he', 'him', 'his',
12 | 'himself', 'she', 'her', 'hers', 'herself',
13 | 'it', 'its', 'itself', 'they', 'them', 'their',
14 | 'theirs', 'themselves', 'what', 'which', 'who',
15 | 'whom', 'this', 'that', 'these', 'those', 'am',
16 | 'is', 'are', 'was', 'were', 'be', 'been', 'being',
17 | 'have', 'has', 'had', 'having', 'do', 'does', 'did',
18 | 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
19 | 'because', 'as', 'until', 'while', 'of', 'at',
20 | 'by', 'for', 'with', 'about', 'against', 'between',
21 | 'into', 'through', 'during', 'before', 'after',
22 | 'above', 'below', 'to', 'from', 'up', 'down', 'in',
23 | 'out', 'on', 'off', 'over', 'under', 'again',
24 | 'further', 'then', 'once', 'here', 'there', 'when',
25 | 'where', 'why', 'how', 'all', 'any', 'both', 'each',
26 | 'few', 'more', 'most', 'other', 'some', 'such', 'no',
27 | 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
28 | 'very', 's', 't', 'can', 'will', 'just', 'don',
29 | 'should', 'now', '']
30 |
31 | #cleaning up text
32 | import re
33 | def get_only_chars(line):
34 |
35 | clean_line = ""
36 |
37 | line = line.replace("’", "")
38 | line = line.replace("'", "")
39 | line = line.replace("-", " ") #replace hyphens with spaces
40 | line = line.replace("\t", " ")
41 | line = line.replace("\n", " ")
42 | line = line.lower()
43 |
44 | for char in line:
45 | if char in 'qwertyuiopasdfghjklzxcvbnm ':
46 | clean_line += char
47 | else:
48 | clean_line += ' '
49 |
50 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
51 | if clean_line[0] == ' ':
52 | clean_line = clean_line[1:]
53 | return clean_line
54 |
55 | ########################################################################
56 | # Synonym replacement
57 | # Replace n words in the sentence with synonyms from wordnet
58 | ########################################################################
59 |
60 | #for the first time you use wordnet
61 | #import nltk
62 | #nltk.download('wordnet')
63 | from nltk.corpus import wordnet
64 |
65 | def synonym_replacement(words, n):
66 | new_words = words.copy()
67 | random_word_list = list(set([word for word in words if word not in stop_words]))
68 | random.shuffle(random_word_list)
69 | num_replaced = 0
70 | for random_word in random_word_list:
71 | synonyms = get_synonyms(random_word)
72 | if len(synonyms) >= 1:
73 | synonym = random.choice(list(synonyms))
74 | new_words = [synonym if word == random_word else word for word in new_words]
75 | #print("replaced", random_word, "with", synonym)
76 | num_replaced += 1
77 | if num_replaced >= n: #only replace up to n words
78 | break
79 |
80 | #this is stupid but we need it, trust me
81 | sentence = ' '.join(new_words)
82 | new_words = sentence.split(' ')
83 |
84 | return new_words
85 |
86 | def get_synonyms(word):
87 | synonyms = set()
88 | for syn in wordnet.synsets(word):
89 | for l in syn.lemmas():
90 | synonym = l.name().replace("_", " ").replace("-", " ").lower()
91 | synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
92 | synonyms.add(synonym)
93 | if word in synonyms:
94 | synonyms.remove(word)
95 | return list(synonyms)
96 |
97 | ########################################################################
98 | # Random deletion
99 | # Randomly delete words from the sentence with probability p
100 | ########################################################################
101 |
102 | def random_deletion(words, p):
103 |
104 | #obviously, if there's only one word, don't delete it
105 | if len(words) == 1:
106 | return words
107 |
108 | #randomly delete words with probability p
109 | new_words = []
110 | for word in words:
111 | r = random.uniform(0, 1)
112 | if r > p:
113 | new_words.append(word)
114 |
115 | #if you end up deleting all words, just return a random word
116 | if len(new_words) == 0:
117 | rand_int = random.randint(0, len(words)-1)
118 | return [words[rand_int]]
119 |
120 | return new_words
121 |
122 | ########################################################################
123 | # Random swap
124 | # Randomly swap two words in the sentence n times
125 | ########################################################################
126 |
127 | def random_swap(words, n):
128 | new_words = words.copy()
129 | for _ in range(n):
130 | new_words = swap_word(new_words)
131 | return new_words
132 |
133 | def swap_word(new_words):
134 | random_idx_1 = random.randint(0, len(new_words)-1)
135 | random_idx_2 = random_idx_1
136 | counter = 0
137 | while random_idx_2 == random_idx_1:
138 | random_idx_2 = random.randint(0, len(new_words)-1)
139 | counter += 1
140 | if counter > 3:
141 | return new_words
142 | new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
143 | return new_words
144 |
145 | ########################################################################
146 | # Random addition
147 | # Randomly add n words into the sentence
148 | ########################################################################
149 |
150 | def random_addition(words, n):
151 | new_words = words.copy()
152 | for _ in range(n):
153 | add_word(new_words)
154 | return new_words
155 |
156 | def add_word(new_words):
157 | synonyms = []
158 | counter = 0
159 | while len(synonyms) < 1:
160 | random_word = new_words[random.randint(0, len(new_words)-1)]
161 | synonyms = get_synonyms(random_word)
162 | counter += 1
163 | if counter >= 10:
164 | return
165 | random_synonym = synonyms[0]
166 | random_idx = random.randint(0, len(new_words)-1)
167 | new_words.insert(random_idx, random_synonym)
168 |
169 | ########################################################################
170 | # main data augmentation function
171 | ########################################################################
172 |
173 | def eda_4(sentence, alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.1, p_rd=0.15, num_aug=9):
174 |
175 | sentence = get_only_chars(sentence)
176 | words = sentence.split(' ')
177 | words = [word for word in words if word is not '']
178 | num_words = len(words)
179 |
180 | augmented_sentences = []
181 | num_new_per_technique = int(num_aug/4)+1
182 | n_sr = max(1, int(alpha_sr*num_words))
183 | n_ri = max(1, int(alpha_ri*num_words))
184 | n_rs = max(1, int(alpha_rs*num_words))
185 |
186 | #sr
187 | for _ in range(num_new_per_technique):
188 | a_words = synonym_replacement(words, n_sr)
189 | augmented_sentences.append(' '.join(a_words))
190 |
191 | #ri
192 | for _ in range(num_new_per_technique):
193 | a_words = random_addition(words, n_ri)
194 | augmented_sentences.append(' '.join(a_words))
195 |
196 | #rs
197 | for _ in range(num_new_per_technique):
198 | a_words = random_swap(words, n_rs)
199 | augmented_sentences.append(' '.join(a_words))
200 |
201 | #rd
202 | for _ in range(num_new_per_technique):
203 | a_words = random_deletion(words, p_rd)
204 | augmented_sentences.append(' '.join(a_words))
205 |
206 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
207 | shuffle(augmented_sentences)
208 |
209 | #trim so that we have the desired number of augmented sentences
210 | if num_aug >= 1:
211 | augmented_sentences = augmented_sentences[:num_aug]
212 | else:
213 | keep_prob = num_aug / len(augmented_sentences)
214 | augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
215 |
216 | #append the original sentence
217 | augmented_sentences.append(sentence)
218 |
219 | return augmented_sentences
220 |
221 | def SR(sentence, alpha_sr, n_aug=9):
222 |
223 | sentence = get_only_chars(sentence)
224 | words = sentence.split(' ')
225 | num_words = len(words)
226 |
227 | augmented_sentences = []
228 | n_sr = max(1, int(alpha_sr*num_words))
229 |
230 | for _ in range(n_aug):
231 | a_words = synonym_replacement(words, n_sr)
232 | augmented_sentences.append(' '.join(a_words))
233 |
234 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
235 | shuffle(augmented_sentences)
236 |
237 | augmented_sentences.append(sentence)
238 |
239 | return augmented_sentences
240 |
241 | def RI(sentence, alpha_ri, n_aug=9):
242 |
243 | sentence = get_only_chars(sentence)
244 | words = sentence.split(' ')
245 | num_words = len(words)
246 |
247 | augmented_sentences = []
248 | n_ri = max(1, int(alpha_ri*num_words))
249 |
250 | for _ in range(n_aug):
251 | a_words = random_addition(words, n_ri)
252 | augmented_sentences.append(' '.join(a_words))
253 |
254 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
255 | shuffle(augmented_sentences)
256 |
257 | augmented_sentences.append(sentence)
258 |
259 | return augmented_sentences
260 |
261 | def RS(sentence, alpha_rs, n_aug=9):
262 |
263 | sentence = get_only_chars(sentence)
264 | words = sentence.split(' ')
265 | num_words = len(words)
266 |
267 | augmented_sentences = []
268 | n_rs = max(1, int(alpha_rs*num_words))
269 |
270 | for _ in range(n_aug):
271 | a_words = random_swap(words, n_rs)
272 | augmented_sentences.append(' '.join(a_words))
273 |
274 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
275 | shuffle(augmented_sentences)
276 |
277 | augmented_sentences.append(sentence)
278 |
279 | return augmented_sentences
280 |
281 | def RD(sentence, alpha_rd, n_aug=9):
282 |
283 | sentence = get_only_chars(sentence)
284 | words = sentence.split(' ')
285 | words = [word for word in words if word is not '']
286 | num_words = len(words)
287 |
288 | augmented_sentences = []
289 |
290 | for _ in range(n_aug):
291 | a_words = random_deletion(words, alpha_rd)
292 | augmented_sentences.append(' '.join(a_words))
293 |
294 | augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
295 | shuffle(augmented_sentences)
296 |
297 | augmented_sentences.append(sentence)
298 |
299 | return augmented_sentences
300 |
301 |
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 |
318 |
319 |
320 | ########################################################################
321 | # Testing
322 | ########################################################################
323 |
324 | if __name__ == '__main__':
325 |
326 | line = 'Hi. My name is Jason. I’m a third-year computer science major at Dartmouth College, interested in deep learning and computer vision. My advisor is Saeed Hassanpour. I’m currently working on deep learning for lung cancer classification.'
327 |
328 |
329 |
330 | ########################################################################
331 | # Sliding window
332 | # Slide a window of size w over the sentence with stride s
333 | # Returns a list of lists of words
334 | ########################################################################
335 |
336 | # def sliding_window_sentences(words, w, s):
337 | # windows = []
338 | # for i in range(0, len(words)-w+1, s):
339 | # window = words[i:i+w]
340 | # windows.append(window)
341 | # return windows
342 |
343 |
344 |
345 |
346 |
--------------------------------------------------------------------------------
/preprocess/__pycache__/utils.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jasonwei20/eda_nlp/04ab29c5b18d2d72f9fa5b304322aaf4793acea0/preprocess/__pycache__/utils.cpython-36.pyc
--------------------------------------------------------------------------------
/preprocess/bg_clean.py:
--------------------------------------------------------------------------------
1 |
2 | from utils import *
3 |
4 | def clean_csv(input_file, output_file):
5 |
6 | input_r = open(input_file, 'r').read()
7 |
8 | lines = input_r.split(',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,')
9 | print(len(lines))
10 | for line in lines[:10]:
11 | print(line[-3:])
12 |
13 | if __name__ == "__main__":
14 |
15 | input_file = 'raw/blog-gender-dataset.csv'
16 | output_file = 'datasets/bg/train.csv'
17 |
18 | clean_csv(input_file, output_file)
19 |
20 |
21 |
--------------------------------------------------------------------------------
/preprocess/copy_sized_datasets.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | sizes = ['1_tiny', '2_small', '3_standard', '4_full']
4 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc']
5 |
6 | for size in sizes:
7 | for dataset in datasets:
8 | folder = 'size_data_t1/' + size + '/' + dataset
9 | if not os.path.exists(folder):
10 | os.makedirs(folder)
11 |
12 | origin = 'sized_datasets_f1/' + size + '/' + dataset + '/train_orig.txt'
13 | destination = 'size_data_t1/' + size + '/' + dataset + '/train_orig.txt'
14 | os.system('cp ' + origin + ' ' + destination)
--------------------------------------------------------------------------------
/preprocess/cr_clean.py:
--------------------------------------------------------------------------------
1 | #0 = neg, 1 = pos
2 | from utils import *
3 |
4 | def retrieve_reviews(line):
5 |
6 | reviews = set()
7 | chars = list(line)
8 | for i, char in enumerate(chars):
9 | if char == '[':
10 | if chars[i+1] == '-':
11 | reviews.add(0)
12 | elif chars[i+1] == '+':
13 | reviews.add(1)
14 |
15 | reviews = list(reviews)
16 | if len(reviews) == 2:
17 | return -2
18 | elif len(reviews) == 1:
19 | return reviews[0]
20 | else:
21 | return -1
22 |
23 | def clean_files(input_files, output_file):
24 |
25 | writer = open(output_file, 'w')
26 |
27 | for input_file in input_files:
28 | print(input_file)
29 | input_lines = open(input_file, 'r').readlines()
30 | counter = 0
31 | bad_counter = 0
32 | for line in input_lines:
33 | review = retrieve_reviews(line)
34 | if review in {0, 1}:
35 | good_line = get_only_chars(re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", line))
36 | output_line = str(review) + '\t' + good_line
37 | writer.write(output_line + '\n')
38 | counter += 1
39 | elif review == -2:
40 | bad_counter +=1
41 | print(input_file, counter, bad_counter)
42 |
43 | writer.close()
44 |
45 | if __name__ == '__main__':
46 |
47 | input_files = ['all.txt']#['canon_power.txt', 'canon_s1.txt', 'diaper.txt', 'hitachi.txt', 'ipod.txt', 'micromp3.txt', 'nokia6600.txt', 'norton.txt', 'router.txt']
48 | input_files = ['raw/cr/data_new/' + f for f in input_files]
49 | output_file = 'datasets/cr/apex_clean.txt'
50 |
51 | clean_files(input_files, output_file)
52 |
--------------------------------------------------------------------------------
/preprocess/create_dataset_increments.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | datasets = ['cr', 'pc', 'sst1', 'sst2', 'subj', 'trec']
4 |
5 | for dataset in datasets:
6 | line = 'cat increment_datasets_f2/' + dataset + '/test.txt > sized_datasets_f1/test/' + dataset + '/test.txt'
7 | os.system(line)
--------------------------------------------------------------------------------
/preprocess/get_stats.py:
--------------------------------------------------------------------------------
1 | import statistics
2 |
3 | datasets = ['sst2', 'cr', 'subj', 'trec', 'pc']
4 |
5 | filenames = ['increment_datasets_f2/' + x + '/train_orig.txt' for x in datasets]
6 |
7 | def get_vocab_size(filename):
8 | lines = open(filename, 'r').readlines()
9 |
10 | vocab = set()
11 | for line in lines:
12 | words = line[:-1].split(' ')
13 | for word in words:
14 | if word not in vocab:
15 | vocab.add(word)
16 |
17 | return len(vocab)
18 |
19 | def get_mean_and_std(filename):
20 | lines = open(filename, 'r').readlines()
21 |
22 | line_lengths = []
23 | for line in lines:
24 | length = len(line[:-1].split(' ')) - 1
25 | line_lengths.append(length)
26 |
27 | print(filename, statistics.mean(line_lengths), statistics.stdev(line_lengths), max(line_lengths))
28 |
29 |
30 | for filename in filenames:
31 | #print(get_vocab_size(filename))
32 | get_mean_and_std(filename)
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
--------------------------------------------------------------------------------
/preprocess/procon_clean.py:
--------------------------------------------------------------------------------
1 |
2 | from utils import *
3 |
4 | def get_good_stuff(line):
5 | idx = line.find('s>')
6 | good = line[idx+2:-8]
7 |
8 | return get_only_chars(good)
9 |
10 | def clean_file(con_file, pro_file, output_train, output_test):
11 |
12 | train_writer = open(output_train, 'w')
13 | test_writer = open(output_test, 'w')
14 | con_lines = open(con_file, 'r').readlines()
15 | for line in con_lines[:int(len(con_lines)*0.9)]:
16 | content = get_good_stuff(line)
17 | if len(content) >= 8:
18 | train_writer.write('0\t' + content + '\n')
19 | for line in con_lines[int(len(con_lines)*0.9):]:
20 | content = get_good_stuff(line)
21 | if len(content) >= 8:
22 | test_writer.write('0\t' + content + '\n')
23 |
24 | pro_lines = open(pro_file, 'r').readlines()
25 | for line in pro_lines[:int(len(con_lines)*0.9)]:
26 | content = get_good_stuff(line)
27 | if len(content) >= 8:
28 | train_writer.write('1\t' + content + '\n')
29 | for line in pro_lines[int(len(con_lines)*0.9):]:
30 | content = get_good_stuff(line)
31 | if len(content) >= 8:
32 | test_writer.write('1\t' + content + '\n')
33 |
34 |
35 | if __name__ == '__main__':
36 |
37 | con_file = 'raw/pros-cons/integratedCons.txt'
38 | pro_file = 'raw/pros-cons/integratedPros.txt'
39 | output_train = 'datasets/procon/train.txt'
40 | output_test = 'datasets/procon/test.txt'
41 | clean_file(con_file, pro_file, output_train, output_test)
--------------------------------------------------------------------------------
/preprocess/shuffle_lines.py:
--------------------------------------------------------------------------------
1 | import random
2 |
3 | def shuffle_lines(text_file):
4 | lines = open(text_file).readlines()
5 | random.shuffle(lines)
6 | open(text_file, 'w').writelines(lines)
7 |
8 | shuffle_lines('special_f4/pc/test_short_aug_shuffle.txt')
--------------------------------------------------------------------------------
/preprocess/sst1_clean.py:
--------------------------------------------------------------------------------
1 | from utils import *
2 |
3 | def get_label(decimal):
4 | if decimal >= 0 and decimal <= 0.2:
5 | return 0
6 | elif decimal > 0.2 and decimal <= 0.4:
7 | return 1
8 | elif decimal > 0.4 and decimal <= 0.6:
9 | return 2
10 | elif decimal > 0.6 and decimal <= 0.8:
11 | return 3
12 | elif decimal > 0.8 and decimal <= 1:
13 | return 4
14 | else:
15 | return -1
16 |
17 | def get_label_binary(decimal):
18 | if decimal >= 0 and decimal <= 0.4:
19 | return 0
20 | elif decimal > 0.6 and decimal <= 1:
21 | return 1
22 | else:
23 | return -1
24 |
25 | def get_split(split_num):
26 | if split_num == 1 or split_num == 3:
27 | return 'train'
28 | elif split_num == 2:
29 | return 'test'
30 |
31 | if __name__ == "__main__":
32 |
33 | data_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSentences.txt'
34 | labels_path = 'raw/sst_1/stanfordSentimentTreebank/sentiment_labels.txt'
35 | split_path = 'raw/sst_1/stanfordSentimentTreebank/datasetSplit.txt'
36 | dictionary_path = 'raw/sst_1/stanfordSentimentTreebank/dictionary.txt'
37 |
38 | sentence_lines = open(data_path, 'r').readlines()
39 | labels_lines = open(labels_path, 'r').readlines()
40 | split_lines = open(split_path, 'r').readlines()
41 | dictionary_lines = open(dictionary_path, 'r').readlines()
42 |
43 | print(len(sentence_lines))
44 | print(len(split_lines))
45 | print(len(labels_lines))
46 | print(len(dictionary_lines))
47 |
48 | #create dictionary for id to label
49 | id_to_label = {}
50 | for line in labels_lines[1:]:
51 | parts = line[:-1].split("|")
52 | _id = parts[0]
53 | score = float(parts[1])
54 | label = get_label_binary(score)
55 |
56 | id_to_label[_id] = label
57 |
58 | print(len(id_to_label), "id to labels read in")
59 |
60 | #create dictionary for phrase to label
61 | phrase_to_label = {}
62 | for line in dictionary_lines:
63 | parts = line[:-1].split("|")
64 | phrase = parts[0]
65 | _id = parts[1]
66 | label = id_to_label[_id]
67 |
68 | phrase_to_label[phrase] = label
69 |
70 | print(len(phrase_to_label), "phrase to id read in")
71 |
72 | #create id to split
73 | id_to_split = {}
74 | for line in split_lines[1:]:
75 | parts = line[:-1].split(",")
76 | _id = parts[0]
77 | split_num = float(parts[1])
78 | split = get_split(split_num)
79 | id_to_split[_id] = split
80 |
81 | print(len(id_to_split), "id to split read in")
82 |
83 | train_writer = open('datasets/sst2/train_orig.txt', 'w')
84 | test_writer = open('datasets/sst2/test.txt', 'w')
85 |
86 | #create sentence to split and label
87 | for sentence_line in sentence_lines[1:]:
88 | parts = sentence_line[:-1].split('\t')
89 | _id = parts[0]
90 | sentence = get_only_chars(parts[1])
91 | split = id_to_split[_id]
92 |
93 | if parts[1] in phrase_to_label:
94 | label = phrase_to_label[parts[1]]
95 | if label in {0, 1}:
96 | #print(label, sentence, split)
97 | if split == 'train':
98 | train_writer.write(str(label) + '\t' + sentence + '\n')
99 | elif split == 'test':
100 | test_writer.write(str(label) + '\t' + sentence + '\n')
101 |
102 | #print(parts, split)
103 |
104 | #label = []
105 |
106 |
107 |
108 |
109 |
--------------------------------------------------------------------------------
/preprocess/subj_clean.py:
--------------------------------------------------------------------------------
1 | from utils import *
2 |
3 | if __name__ == "__main__":
4 | subj_path = "subj/rotten_imdb/subj.txt"
5 | obj_path = "subj/rotten_imdb/plot.tok.gt9.5000"
6 |
7 | subj_lines = open(subj_path, 'r').readlines()
8 | obj_lines = open(obj_path, 'r').readlines()
9 | print(len(subj_lines), len(obj_lines))
10 |
11 | test_split = int(0.9*len(subj_lines))
12 |
13 | train_lines = []
14 | test_lines = []
15 |
16 | #training set
17 | for s_line in subj_lines[:test_split]:
18 | clean_line = '1\t' + get_only_chars(s_line[:-1])
19 | train_lines.append(clean_line)
20 |
21 | for o_line in obj_lines[:test_split]:
22 | clean_line = '0\t' + get_only_chars(o_line[:-1])
23 | train_lines.append(clean_line)
24 |
25 | #testing set
26 | for s_line in subj_lines[test_split:]:
27 | clean_line = '1\t' + get_only_chars(s_line[:-1])
28 | test_lines.append(clean_line)
29 |
30 | for o_line in obj_lines[test_split:]:
31 | clean_line = '0\t' + get_only_chars(o_line[:-1])
32 | test_lines.append(clean_line)
33 |
34 | print(len(test_lines), len(train_lines))
35 |
36 | #print training set
37 | writer = open('datasets/subj/train_orig.txt', 'w')
38 | for line in train_lines:
39 | writer.write(line + '\n')
40 | writer.close()
41 |
42 | #print testing set
43 | writer = open('datasets/subj/test.txt', 'w')
44 | for line in test_lines:
45 | writer.write(line + '\n')
46 | writer.close()
--------------------------------------------------------------------------------
/preprocess/trej_clean.py:
--------------------------------------------------------------------------------
1 |
2 | from utils import *
3 |
4 | class_name_to_num = {'DESC': 0, 'ENTY':1, 'ABBR':2, 'HUM': 3, 'LOC': 4, 'NUM': 5}
5 |
6 | def clean(input_file, output_file):
7 | lines = open(input_file, 'r').readlines()
8 | writer = open(output_file, 'w')
9 | for line in lines:
10 | parts = line[:-1].split(' ')
11 | tag = parts[0].split(':')[0]
12 | class_num = class_name_to_num[tag]
13 | sentence = get_only_chars(' '.join(parts[1:]))
14 | print(tag, class_num, sentence)
15 | output_line = str(class_num) + '\t' + sentence
16 | writer.write(output_line + '\n')
17 | writer.close()
18 |
19 |
20 | if __name__ == "__main__":
21 |
22 | clean('raw/trec/train_copy.txt', 'datasets/trec/train_orig.txt')
23 | clean('raw/trec/test_copy.txt', 'datasets/trec/test.txt')
24 |
--------------------------------------------------------------------------------
/preprocess/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 |
4 |
5 |
6 | #cleaning up text
7 | def get_only_chars(line):
8 |
9 | clean_line = ""
10 |
11 | line = line.lower()
12 | line = line.replace(" 's", " is")
13 | line = line.replace("-", " ") #replace hyphens with spaces
14 | line = line.replace("\t", " ")
15 | line = line.replace("\n", " ")
16 | line = line.replace("'", "")
17 |
18 | for char in line:
19 | if char in 'qwertyuiopasdfghjklzxcvbnm ':
20 | clean_line += char
21 | else:
22 | clean_line += ' '
23 |
24 | clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
25 | print(clean_line)
26 | if clean_line[0] == ' ':
27 | clean_line = clean_line[1:]
28 | return clean_line
--------------------------------------------------------------------------------