├── CNN_Text_Classification.ipynb
├── LICENSE
├── README.md
├── data
├── labels.txt
└── reviews.txt
├── notebook_ims
├── complete_embedding_CNN.png
├── embedding_lookup_table.png
├── reviews_ex.png
└── two_vectors.png
├── requirements.txt
└── word2vec_model
└── readme_download_word2vecmodel.txt
/CNN_Text_Classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Convolutional Neural Networks\n",
8 | "---\n",
9 | "In this notebook, I'll train a **CNN** to classify the sentiment of movie reviews in a corpus of text. The approach will be as follows:\n",
10 | "* Pre-process movie reviews and their corresponding sentiment labels (positive = 1, negative = 0).\n",
11 | "* Load in a **pre-trained** Word2Vec model, and use it to tokenize the reviews.\n",
12 | "* Create training/validation/test sets of data.\n",
13 | "* Define a `SentimentCNN` model that has a pre-trained embedding layer, convolutional layers, and a final, fully-connected, classification layer.\n",
14 | "* Train and evaluate the model.\n",
15 | "\n",
16 | "An example of a positive and negative review are shown below.\n",
17 | "\n",
18 | "
\n",
19 | "\n",
20 | "The task of text classification has typically been done with an RNN, which accepts a sequence of words as input and has a hidden state that is dependent on that sequence and acts as a kind of memory. You can see an example that classifies this same review dataset using an RNN in [this Github repository](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/sentiment-rnn/Sentiment_RNN_Solution.ipynb). \n",
21 | "\n",
22 | "\n",
23 | "## Resources\n",
24 | "\n",
25 | "This example shows how you can utilize convolutional layers to find patterns in sequences of word embeddings and create an effective text classifier using a CNN-based approach.\n",
26 | "\n",
27 | "**1. Original paper**\n",
28 | "* The code follows the structure outlined in the paper, [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) by Yoon Kim (2014). \n",
29 | "\n",
30 | "**2. Pre-trained Word2Vec model**\n",
31 | "\n",
32 | "* The key to this approach is convolving over word embeddings, for which I will use a pre-trained [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) model. \n",
33 | "* I am specifically using a \"slim\"-version of a model that was trained on part of a Google News dataset (about 100 billion words). The [original model](https://code.google.com/archive/p/word2vec/) contains 300-dimensional vectors for 3 million words and phrases.\n",
34 | "* The \"slim\" model is cut to 300k English words, as described in [this Github repository](https://github.com/eyaler/word2vec-slim).\n",
35 | "\n",
36 | "You should be able to modify this code slightly to make it compatible with a Word2Vec model of your choosing.\n",
37 | "\n",
38 | "**3. Movie reviews data **\n",
39 | "\n",
40 | "The dataset holds 25000 movie reviews, which were obtained from the movie review site, IMDb.\n"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "---\n",
48 | "## Load in and Visualize the Data"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 1,
54 | "metadata": {},
55 | "outputs": [],
56 | "source": [
57 | "import numpy as np\n",
58 | "\n",
59 | "# read data from text files\n",
60 | "with open('data/reviews.txt', 'r') as f:\n",
61 | " reviews = f.read()\n",
62 | "with open('data/labels.txt', 'r') as f:\n",
63 | " labels = f.read()"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 2,
69 | "metadata": {},
70 | "outputs": [
71 | {
72 | "name": "stdout",
73 | "output_type": "stream",
74 | "text": [
75 | "bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life such as teachers . my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers . the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn t \n",
76 | "story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn\n",
77 | "\n",
78 | "positive\n",
79 | "negative\n",
80 | "po\n"
81 | ]
82 | }
83 | ],
84 | "source": [
85 | "# print some example review/sentiment text\n",
86 | "print(reviews[:1000])\n",
87 | "print()\n",
88 | "print(labels[:20])"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "---\n",
96 | "## Data Pre-processing\n",
97 | "\n",
98 | "The first step, when building a neural network, is getting the data into the proper form to feed into the network. Since I'm planning to use a word-embedding layer, I know that I'll need to encode each word in a reviews as an integer, and encode each sentiment label as 1 (positive) or 0 (negative). \n",
99 | "\n",
100 | "I'll first want to clean up the reviews by removing punctuation and converting them to lowercase. You can see an example of the reviews data, above. Here are the processing steps, I'll want to take:\n",
101 | ">* Get rid of any extraneous punctuation.\n",
102 | "* You might notice that the reviews are delimited with newline characters `\\n`. To deal with those, I'm going to split the text into each review using `\\n` as the delimiter. \n",
103 | "* Then I can combined all the reviews back together into one big string to get all of my text data.\n",
104 | "\n",
105 | "First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 3,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "from string import punctuation\n",
115 | "\n",
116 | "# get rid of punctuation\n",
117 | "reviews = reviews.lower() # lowercase, standardize\n",
118 | "all_text = ''.join([c for c in reviews if c not in punctuation])\n",
119 | "\n",
120 | "# split by new lines and spaces\n",
121 | "reviews_split = all_text.split('\\n')\n",
122 | "\n",
123 | "all_text = ' '.join(reviews_split)\n",
124 | "\n",
125 | "# create a list of all words\n",
126 | "all_words = all_text.split()\n"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "### Encoding the Labels\n",
134 | "\n",
135 | "The review labels are \"positive\" or \"negative\". To use these labels in a neural network, I need to convert them to numerical values, 1 (positive) and 0 (negative)."
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 4,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "# 1=positive, 0=negative label conversion\n",
145 | "labels_split = labels.split('\\n')\n",
146 | "encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "### Removing Outliers\n",
154 | "\n",
155 | "As an additional pre-processing step, I want to make sure that the reviews are in good shape for standard processing. That is, I'll want to shape the reviews into a specific, consistent length for ease of processing and comparison. I'll approach this task in two main steps:\n",
156 | "\n",
157 | "1. Getting rid of extremely long or short reviews; the outliers\n",
158 | "2. Padding/truncating the remaining data so that we have reviews of the same length.\n",
159 | "\n",
160 | "Before I pad the review text, below, I am checking for reviews of extremely short or long lengths; outliers that may mess with training."
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 5,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "name": "stdout",
170 | "output_type": "stream",
171 | "text": [
172 | "Zero-length reviews: 1\n",
173 | "Maximum review length: 2514\n"
174 | ]
175 | }
176 | ],
177 | "source": [
178 | "from collections import Counter\n",
179 | "\n",
180 | "# Build a dictionary that maps indices to review lengths\n",
181 | "counts = Counter(all_words)\n",
182 | "\n",
183 | "# outlier review stats\n",
184 | "# counting words in each review\n",
185 | "review_lens = Counter([len(x.split()) for x in reviews_split])\n",
186 | "print(\"Zero-length reviews: {}\".format(review_lens[0]))\n",
187 | "print(\"Maximum review length: {}\".format(max(review_lens)))"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "Okay, a couple issues here. I seem to have one review with zero length. And, the maximum review length is really long. I'm going to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently."
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 6,
200 | "metadata": {},
201 | "outputs": [
202 | {
203 | "name": "stdout",
204 | "output_type": "stream",
205 | "text": [
206 | "Number of reviews before removing outliers: 25001\n",
207 | "Number of reviews after removing outliers: 25000\n"
208 | ]
209 | }
210 | ],
211 | "source": [
212 | "print('Number of reviews before removing outliers: ', len(reviews_split))\n",
213 | "\n",
214 | "## remove any reviews/labels with zero length from the reviews_ints list.\n",
215 | "\n",
216 | "# get indices of any reviews with length 0\n",
217 | "non_zero_idx = [ii for ii, review in enumerate(reviews_split) if len(review.split()) != 0]\n",
218 | "\n",
219 | "# remove 0-length reviews and their labels\n",
220 | "reviews_split = [reviews_split[ii] for ii in non_zero_idx]\n",
221 | "encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])\n",
222 | "\n",
223 | "print('Number of reviews after removing outliers: ', len(reviews_split))"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "---\n",
231 | "## Using a Pre-Trained Embedding Layer\n",
232 | "\n",
233 | "Next, I'll want to tokenize my reviews; turning the list of words that make up a given review into a list of tokenized integers that represent those words. Typically, this is done by creating a dictionary that maps each unique word in a vocabulary to a specific integer value.\n",
234 | "\n",
235 | "In this example, I'll actually want to use a mapping that already exists, in a pre-trained embedding layer. Below, I am loading in a pre-trained embedding model, and I'll explore its traits.\n",
236 | "\n",
237 | "> This code assumes I have a downloaded model `GoogleNews-vectors-negative300-SLIM.bin.gz` in the same directory as this notebook, in a folder, `word2vec_model`."
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 7,
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "# # Load a pretrained word2vec model (only need to run code, once)\n",
247 | "# ! gzip -d word2vec_model/GoogleNews-vectors-negative300-SLIM.bin.gz"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 8,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "# import Word2Vec loading capabilities\n",
257 | "from gensim.models import KeyedVectors\n",
258 | "\n",
259 | "# Creating the model\n",
260 | "embed_lookup = KeyedVectors.load_word2vec_format('word2vec_model/GoogleNews-vectors-negative300-SLIM.bin', \n",
261 | " binary=True)\n"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "### Embedding Layer\n",
269 | "\n",
270 | "You can think of an embedding layer as a lookup table, where the rows are indexed by word token and the columns hold the embedding values. For example, row 958 is the embedding vector for the word that maps to the integer value 958.\n",
271 | "\n",
272 | "
\n",
273 | "\n",
274 | "In the below cells, I am storing the words in the pre-trained vocabulary, and printing out the size of the vocabulary and word embeddings. \n",
275 | "> The embedding dimension from the pret-rained model is 300."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 9,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "# store pretrained vocab\n",
285 | "pretrained_words = []\n",
286 | "for word in embed_lookup.vocab:\n",
287 | " pretrained_words.append(word)"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 10,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "Size of Vocab: 299567\n",
300 | "\n",
301 | "Word in vocab: for\n",
302 | "\n",
303 | "Length of embedding: 300\n",
304 | "\n"
305 | ]
306 | }
307 | ],
308 | "source": [
309 | "row_idx = 1\n",
310 | "\n",
311 | "# get word/embedding in that row\n",
312 | "word = pretrained_words[row_idx] # get words by index\n",
313 | "embedding = embed_lookup[word] # embeddings by word\n",
314 | "\n",
315 | "# vocab and embedding info\n",
316 | "print(\"Size of Vocab: {}\\n\".format(len(pretrained_words)))\n",
317 | "print('Word in vocab: {}\\n'.format(word))\n",
318 | "print('Length of embedding: {}\\n'.format(len(embedding)))\n",
319 | "#print('Associated embedding: \\n', embedding)"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 11,
325 | "metadata": {},
326 | "outputs": [
327 | {
328 | "name": "stdout",
329 | "output_type": "stream",
330 | "text": [
331 | "in\n",
332 | "for\n",
333 | "that\n",
334 | "is\n",
335 | "on\n"
336 | ]
337 | }
338 | ],
339 | "source": [
340 | "# print a few common words\n",
341 | "for i in range(5):\n",
342 | " print(pretrained_words[i])"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "### Cosine Similarity\n",
350 | "\n",
351 | "The pre-trained embedding model has learned to represent semantic relationships between words in vector space. Specifically, words that appear in similar contexts should point in roughly the same direction. To measure whether two vectors are colinear, we can use [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity), which computes the dot product of two vectors. This dot product is largest when the angle between two vectors is 0 (cos(0) = 1) and cosine is at a maximum, so cosine similarity is larger for aligned vectors.\n",
352 | "\n",
353 | "
\n",
354 | "\n",
355 | "### Embedded Bias\n",
356 | "\n",
357 | "Word2Vec, in addition to learning useful similarities and semantic relationships between words, also learns to represent problematic relationships between words. For example, a paper on [Debiasing Word Embeddings](https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf) by Bolukbasi et al. (2016), found that the vector-relationship between \"man\" and \"woman\" was similar to the relationship between \"physician\" and \"registered nurse\" or \"shopkeeper\" and \"housewife\" in the trained, Google News Word2Vec model, **which I am using in this notebook**.\n",
358 | "\n",
359 | ">*\"In this paper, we quantitatively demonstrate that word-embeddings contain biases in their geometry that reflect gender stereotypes present in broader society. Due to their wide-spread usage as basic\n",
360 | "features, word embeddings not only reflect such stereotypes but can also amplify them. This poses a\n",
361 | "significant risk and challenge for machine learning and its applications.\"*\n",
362 | "\n",
363 | "As such, it is important to note that this example is using a Word2Vec model that has been shown to encapsulate gender stereotypes.\n",
364 | "\n",
365 | "You can explore similarities and relationships between word embeddings using code. The code below finds words with the highest cosine similarity when compared to the word `find_similar_to`. "
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": 12,
371 | "metadata": {},
372 | "outputs": [
373 | {
374 | "name": "stdout",
375 | "output_type": "stream",
376 | "text": [
377 | "Similar words to fabulous: \n",
378 | "\n",
379 | "Word: wonderful, Similarity: 0.761\n",
380 | "Word: fantastic, Similarity: 0.761\n",
381 | "Word: marvelous, Similarity: 0.730\n",
382 | "Word: gorgeous, Similarity: 0.714\n",
383 | "Word: lovely, Similarity: 0.713\n",
384 | "Word: terrific, Similarity: 0.694\n",
385 | "Word: amazing, Similarity: 0.693\n",
386 | "Word: beautiful, Similarity: 0.670\n",
387 | "Word: magnificent, Similarity: 0.667\n",
388 | "Word: splendid, Similarity: 0.645\n"
389 | ]
390 | }
391 | ],
392 | "source": [
393 | "# Pick a word \n",
394 | "find_similar_to = 'fabulous'\n",
395 | "\n",
396 | "print('Similar words to '+find_similar_to+': \\n')\n",
397 | "\n",
398 | "# Find similar words, using cosine similarity\n",
399 | "# by default shows top 10 similar words\n",
400 | "for similar_word in embed_lookup.similar_by_word(find_similar_to):\n",
401 | " print(\"Word: {0}, Similarity: {1:.3f}\".format(\n",
402 | " similar_word[0], similar_word[1]\n",
403 | " ))\n"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "metadata": {},
409 | "source": [
410 | "## Tokenize reviews\n",
411 | "\n",
412 | "The pre-trained embedding layer already has tokens associated with each word in the dictionary. I want to use that same mapping to tokenize all the reviews in the movie review corpus. I will encode any unknown words (words that appear in the reviews but not in the pre-trained vocabulary) as the whitespace token, 0; this should be fine for the purpose of sentiment classification."
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": 13,
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "# convert reviews to tokens\n",
422 | "def tokenize_all_reviews(embed_lookup, reviews_split):\n",
423 | " # split each review into a list of words\n",
424 | " reviews_words = [review.split() for review in reviews_split]\n",
425 | "\n",
426 | " tokenized_reviews = []\n",
427 | " for review in reviews_words:\n",
428 | " ints = []\n",
429 | " for word in review:\n",
430 | " try:\n",
431 | " idx = embed_lookup.vocab[word].index\n",
432 | " except: \n",
433 | " idx = 0\n",
434 | " ints.append(idx)\n",
435 | " tokenized_reviews.append(ints)\n",
436 | " \n",
437 | " return tokenized_reviews\n"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 14,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": [
446 | "tokenized_reviews = tokenize_all_reviews(embed_lookup, reviews_split)"
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 15,
452 | "metadata": {},
453 | "outputs": [
454 | {
455 | "name": "stdout",
456 | "output_type": "stream",
457 | "text": [
458 | "[0, 137, 3, 0, 11620, 3799, 13, 1215, 10, 9, 194, 54, 12, 73, 61, 685, 41, 183, 243, 129, 12, 1663, 119, 72, 0, 9, 2989, 7334, 242, 159, 0, 453, 2, 0, 137, 1239, 19951, 3, 141, 1980, 0, 1898, 55, 3, 1663, 9, 11124, 0, 3857, 6663, 9, 20401, 295, 28, 45, 148, 157, 102, 27, 15452, 1663, 30714, 9, 65172, 0, 9, 844, 737, 47, 6585, 159, 0, 9, 668, 4365, 1003, 0, 27, 295, 56, 4365, 622, 9, 3832, 0, 43, 0, 897, 3187, 907, 0, 5396, 113, 9, 183, 4365, 1009, 3165, 10, 137, 0, 3288, 296, 10314, 4365, 6638, 213, 0, 8810, 40, 0, 116, 1663, 897, 2059, 0, 0, 137, 4365, 830, 2, 124, 2216, 0, 119, 782, 144, 2, 0, 137, 3, 330, 23046, 78, 0, 16915, 2, 13, 85275, 7451]\n"
459 | ]
460 | }
461 | ],
462 | "source": [
463 | "# testing code and printing a tokenized review\n",
464 | "print(tokenized_reviews[0])"
465 | ]
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {},
470 | "source": [
471 | "---\n",
472 | "## Padding sequences\n",
473 | "\n",
474 | "To deal with both short and very long reviews, I'll pad or truncate all the reviews to a specific length. For reviews shorter than some `seq_length`, I'll left-pad with 0s. For reviews longer than `seq_length`, I'll truncate them to the first `seq_length` words. A good `seq_length`, in this case, is about 200.\n",
475 | "\n",
476 | "> The function `pad_features` returns an array that contains padded, tokenized reviews, of a standard size, that we'll pass to the network. \n",
477 | "\n",
478 | "\n",
479 | "As a small example, if the `seq_length=10` and an input, tokenized review is: \n",
480 | "```\n",
481 | "[117, 18, 128]\n",
482 | "```\n",
483 | "The resultant, padded sequence should be: \n",
484 | "\n",
485 | "```\n",
486 | "[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]\n",
487 | "```\n",
488 | "\n",
489 | "**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**\n",
490 | "\n",
491 | "This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data."
492 | ]
493 | },
494 | {
495 | "cell_type": "code",
496 | "execution_count": 16,
497 | "metadata": {},
498 | "outputs": [],
499 | "source": [
500 | "def pad_features(tokenized_reviews, seq_length):\n",
501 | " ''' Return features of tokenized_reviews, where each review is padded with 0's \n",
502 | " or truncated to the input seq_length.\n",
503 | " '''\n",
504 | " \n",
505 | " # getting the correct rows x cols shape\n",
506 | " features = np.zeros((len(tokenized_reviews), seq_length), dtype=int)\n",
507 | "\n",
508 | " # for each review, I grab that review and \n",
509 | " for i, row in enumerate(tokenized_reviews):\n",
510 | " features[i, -len(row):] = np.array(row)[:seq_length]\n",
511 | " \n",
512 | " return features"
513 | ]
514 | },
515 | {
516 | "cell_type": "code",
517 | "execution_count": 17,
518 | "metadata": {},
519 | "outputs": [
520 | {
521 | "name": "stdout",
522 | "output_type": "stream",
523 | "text": [
524 | "[[ 0 0 0 0 0 0 0 0]\n",
525 | " [ 0 0 0 0 0 0 0 0]\n",
526 | " [ 16483 26 0 12 106210 0 1698 22]\n",
527 | " [ 1935 1326 12 0 1403 60 3921 2019]\n",
528 | " [ 0 0 0 0 0 0 0 0]\n",
529 | " [ 0 0 0 0 0 0 0 0]\n",
530 | " [ 0 0 0 0 0 0 0 0]\n",
531 | " [ 0 0 0 0 0 0 0 0]\n",
532 | " [ 0 0 0 0 0 0 0 0]\n",
533 | " [ 56 4365 8 270 119 756 247 159]\n",
534 | " [ 0 0 0 0 0 0 0 0]\n",
535 | " [ 0 0 0 0 0 0 0 0]\n",
536 | " [ 0 0 0 0 0 0 0 0]\n",
537 | " [ 9 104 1428 16 0 60 65033 9622]\n",
538 | " [ 0 25 13619 11902 7445 10397 179 4]\n",
539 | " [ 0 0 0 0 0 0 0 0]\n",
540 | " [ 9 208 18994 66850 121241 212263 0 87397]\n",
541 | " [ 0 0 0 0 0 0 0 0]\n",
542 | " [ 38 165 66850 121241 13241 25231 88 3]\n",
543 | " [ 9 661 3 675 67 3 81 61]]\n"
544 | ]
545 | }
546 | ],
547 | "source": [
548 | "# Test your implementation!\n",
549 | "\n",
550 | "seq_length = 200\n",
551 | "\n",
552 | "features = pad_features(tokenized_reviews, seq_length=seq_length)\n",
553 | "\n",
554 | "## test statements - do not change - ##\n",
555 | "assert len(features)==len(tokenized_reviews), \"Features should have as many rows as reviews.\"\n",
556 | "assert len(features[0])==seq_length, \"Each feature row should contain seq_length values.\"\n",
557 | "\n",
558 | "# print first 8 values of the first 20 batches \n",
559 | "print(features[:20,:8])"
560 | ]
561 | },
562 | {
563 | "cell_type": "markdown",
564 | "metadata": {},
565 | "source": [
566 | "---\n",
567 | "## Training, Validation, and Test Data\n",
568 | "\n",
569 | "With the data in nice shape, I'll split it into training, validation, and test sets.\n",
570 | "\n",
571 | "In the below code, I am creating features (x) and labels (y). \n",
572 | "* The split fraction, `split_frac` defines the fraction of data to **keep** in the training set. Usually this is set to 0.8 or 0.9. \n",
573 | "* Whatever data is left is split in half to create the validation and test data."
574 | ]
575 | },
576 | {
577 | "cell_type": "code",
578 | "execution_count": 18,
579 | "metadata": {},
580 | "outputs": [
581 | {
582 | "name": "stdout",
583 | "output_type": "stream",
584 | "text": [
585 | "\t\t\tFeature Shapes:\n",
586 | "Train set: \t\t(20000, 200) \n",
587 | "Validation set: \t(2500, 200) \n",
588 | "Test set: \t\t(2500, 200)\n"
589 | ]
590 | }
591 | ],
592 | "source": [
593 | "split_frac = 0.8\n",
594 | "\n",
595 | "## split data into training, validation, and test data (features and labels, x and y)\n",
596 | "\n",
597 | "split_idx = int(len(features)*split_frac)\n",
598 | "train_x, remaining_x = features[:split_idx], features[split_idx:]\n",
599 | "train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]\n",
600 | "\n",
601 | "test_idx = int(len(remaining_x)*0.5)\n",
602 | "val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]\n",
603 | "val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]\n",
604 | "\n",
605 | "## print out the shapes of your resultant feature data\n",
606 | "print(\"\\t\\t\\tFeature Shapes:\")\n",
607 | "print(\"Train set: \\t\\t{}\".format(train_x.shape), \n",
608 | " \"\\nValidation set: \\t{}\".format(val_x.shape),\n",
609 | " \"\\nTest set: \\t\\t{}\".format(test_x.shape))"
610 | ]
611 | },
612 | {
613 | "cell_type": "markdown",
614 | "metadata": {},
615 | "source": [
616 | "**Check your work**\n",
617 | "\n",
618 | "With train, validation, and test fractions equal to 0.8, 0.1, 0.1, respectively, the final, feature data shapes should look like:\n",
619 | "```\n",
620 | " Feature Shapes:\n",
621 | "Train set: \t\t (20000, 200) \n",
622 | "Validation set: \t(2500, 200) \n",
623 | "Test set: \t\t (2500, 200)\n",
624 | "```"
625 | ]
626 | },
627 | {
628 | "cell_type": "markdown",
629 | "metadata": {},
630 | "source": [
631 | "## DataLoaders and Batching\n",
632 | "\n",
633 | "After creating training, test, and validation data, I can create DataLoaders for this data by following two steps:\n",
634 | "1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.\n",
635 | "2. Create DataLoaders and batch our training, validation, and test Tensor datasets.\n",
636 | "\n",
637 | "```\n",
638 | "train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))\n",
639 | "train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)\n",
640 | "```\n",
641 | "\n",
642 | "This is an alternative to creating a generator function for batching our data into full batches."
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": 19,
648 | "metadata": {},
649 | "outputs": [],
650 | "source": [
651 | "import torch\n",
652 | "from torch.utils.data import TensorDataset, DataLoader\n",
653 | "\n",
654 | "# create Tensor datasets\n",
655 | "train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))\n",
656 | "valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))\n",
657 | "test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))\n",
658 | "\n",
659 | "# dataloaders\n",
660 | "batch_size = 50\n",
661 | "\n",
662 | "# shuffling and batching data\n",
663 | "train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)\n",
664 | "valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)\n",
665 | "test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)"
666 | ]
667 | },
668 | {
669 | "cell_type": "markdown",
670 | "metadata": {},
671 | "source": [
672 | "---\n",
673 | "# Sentiment Network with PyTorch\n",
674 | "\n",
675 | "The complete model is made of a few layers:\n",
676 | "\n",
677 | "**1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding)**\n",
678 | "* This converts our word tokens (integers) into embedded vectors of a specific size.\n",
679 | "* In this case, the vectors/weights of this layer will come from a **pretrained** lookup table. \n",
680 | "\n",
681 | "**2. A few [convolutional layers](https://pytorch.org/docs/stable/nn.html#conv1d)**\n",
682 | "* These are defined by an input size, number of filters/feature maps to output, and a kernel size.\n",
683 | "* The output of these layers will go through a ReLu activation function and pooling layer in the `forward` function.\n",
684 | "\n",
685 | "**3. A fully-connected, output layer**\n",
686 | "* This maps the convolutional layer outputs to a desired output_size (1 sentiment class).\n",
687 | "\n",
688 | "**4. A sigmoid activation layer**\n",
689 | "* This turns the output logit into a value 0-1; a class score.\n",
690 | "\n",
691 | "There is also a dropout layer, which will prevent overfitting, placed between the convolutional outputs and the final, fully-connected layer.\n",
692 | "\n",
693 | "
\n",
694 | "\n",
695 | "*Image from the original paper, [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf).*\n",
696 | "\n",
697 | "### The Embedding Layer\n",
698 | "\n",
699 | "The embedding layer comes from our pre-trained `embed_lookup` model. By default, the weights of this layer are set to the vectors from the pre-trained model and frozen, so it will just be used as a lookup table. You could train your own embedding layer here, but it will speed up the training process to use a pre-trained model.\n",
700 | "\n",
701 | "### The Convolutional Layer(s)\n",
702 | "\n",
703 | "I am creating three convolutional layers, which will have kernel_sizes of (3, 300), (4, 300), and (5, 300); to look at 3-, 4-, and 5- sequences of word embeddings at a time. Each of these three layers will produce 100 filtered outputs. This is following the layer conventions in the paper, [CNNs for Sentence Classification](https://arxiv.org/abs/1408.5882).\n",
704 | "\n",
705 | "> The kernels only move in one dimension: down a sequence of word embeddings. In other words, these kernels move along a sequence of words, in time!\n",
706 | "\n",
707 | "### Maxpooling Layers\n",
708 | "\n",
709 | "In the `forward` function, I am applying a ReLu activation to the outputs of all convolutional layers and a maxpooling layer over the input sequence dimension. The maxpooling layer will get us an indication of whether some high-level text feature was found. \n",
710 | "\n",
711 | "> After moving through 3 convolutional layers with 100 filtered outputs each, these layers will output 300 values that can be sent to a final, fully-connected, classification layer."
712 | ]
713 | },
714 | {
715 | "cell_type": "code",
716 | "execution_count": 20,
717 | "metadata": {},
718 | "outputs": [
719 | {
720 | "name": "stdout",
721 | "output_type": "stream",
722 | "text": [
723 | "No GPU available, training on CPU.\n"
724 | ]
725 | }
726 | ],
727 | "source": [
728 | "# First checking if GPU is available\n",
729 | "train_on_gpu=torch.cuda.is_available()\n",
730 | "\n",
731 | "if(train_on_gpu):\n",
732 | " print('Training on GPU.')\n",
733 | "else:\n",
734 | " print('No GPU available, training on CPU.')"
735 | ]
736 | },
737 | {
738 | "cell_type": "code",
739 | "execution_count": 21,
740 | "metadata": {},
741 | "outputs": [],
742 | "source": [
743 | "import torch.nn as nn\n",
744 | "import torch.nn.functional as F\n",
745 | "\n",
746 | "class SentimentCNN(nn.Module):\n",
747 | " \"\"\"\n",
748 | " The embedding layer + CNN model that will be used to perform sentiment analysis.\n",
749 | " \"\"\"\n",
750 | "\n",
751 | " def __init__(self, embed_model, vocab_size, output_size, embedding_dim,\n",
752 | " num_filters=100, kernel_sizes=[3, 4, 5], freeze_embeddings=True, drop_prob=0.5):\n",
753 | " \"\"\"\n",
754 | " Initialize the model by setting up the layers.\n",
755 | " \"\"\"\n",
756 | " super(SentimentCNN, self).__init__()\n",
757 | "\n",
758 | " # set class vars\n",
759 | " self.num_filters = num_filters\n",
760 | " self.embedding_dim = embedding_dim\n",
761 | " \n",
762 | " # 1. embedding layer\n",
763 | " self.embedding = nn.Embedding(vocab_size, embedding_dim)\n",
764 | " # set weights to pre-trained\n",
765 | " self.embedding.weight = nn.Parameter(torch.from_numpy(embed_model.vectors)) # all vectors\n",
766 | " # (optional) freeze embedding weights\n",
767 | " if freeze_embeddings:\n",
768 | " self.embedding.requires_grad = False\n",
769 | " \n",
770 | " # 2. convolutional layers\n",
771 | " self.convs_1d = nn.ModuleList([\n",
772 | " nn.Conv2d(1, num_filters, (k, embedding_dim), padding=(k-2,0)) \n",
773 | " for k in kernel_sizes])\n",
774 | " \n",
775 | " # 3. final, fully-connected layer for classification\n",
776 | " self.fc = nn.Linear(len(kernel_sizes) * num_filters, output_size) \n",
777 | " \n",
778 | " # 4. dropout and sigmoid layers\n",
779 | " self.dropout = nn.Dropout(drop_prob)\n",
780 | " self.sig = nn.Sigmoid()\n",
781 | " \n",
782 | " \n",
783 | " def conv_and_pool(self, x, conv):\n",
784 | " \"\"\"\n",
785 | " Convolutional + max pooling layer\n",
786 | " \"\"\"\n",
787 | " # squeeze last dim to get size: (batch_size, num_filters, conv_seq_length)\n",
788 | " # conv_seq_length will be ~ 200\n",
789 | " x = F.relu(conv(x)).squeeze(3)\n",
790 | " \n",
791 | " # 1D pool over conv_seq_length\n",
792 | " # squeeze to get size: (batch_size, num_filters)\n",
793 | " x_max = F.max_pool1d(x, x.size(2)).squeeze(2)\n",
794 | " return x_max\n",
795 | "\n",
796 | " def forward(self, x):\n",
797 | " \"\"\"\n",
798 | " Defines how a batch of inputs, x, passes through the model layers.\n",
799 | " Returns a single, sigmoid-activated class score as output.\n",
800 | " \"\"\"\n",
801 | " # embedded vectors\n",
802 | " embeds = self.embedding(x) # (batch_size, seq_length, embedding_dim)\n",
803 | " # embeds.unsqueeze(1) creates a channel dimension that conv layers expect\n",
804 | " embeds = embeds.unsqueeze(1)\n",
805 | " \n",
806 | " # get output of each conv-pool layer\n",
807 | " conv_results = [self.conv_and_pool(embeds, conv) for conv in self.convs_1d]\n",
808 | " \n",
809 | " # concatenate results and add dropout\n",
810 | " x = torch.cat(conv_results, 1)\n",
811 | " x = self.dropout(x)\n",
812 | " \n",
813 | " # final logit\n",
814 | " logit = self.fc(x) \n",
815 | " \n",
816 | " # sigmoid-activated --> a class score\n",
817 | " return self.sig(logit)\n",
818 | " "
819 | ]
820 | },
821 | {
822 | "cell_type": "markdown",
823 | "metadata": {},
824 | "source": [
825 | "## Instantiate the network\n",
826 | "\n",
827 | "Here, I'll instantiate the network. First up, defining the hyperparameters.\n",
828 | "\n",
829 | "* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.\n",
830 | "* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).\n",
831 | "* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.\n",
832 | "* `num_filters`: Number of filters that each convolutional layer produces as output.\n",
833 | "* `filter_sizes`: A list of kernel sizes; one convolutional layer will be created for each kernel size.\n",
834 | "\n",
835 | "Any parameters I did not list, are left as the default value."
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 22,
841 | "metadata": {},
842 | "outputs": [
843 | {
844 | "name": "stdout",
845 | "output_type": "stream",
846 | "text": [
847 | "SentimentCNN(\n",
848 | " (embedding): Embedding(299567, 300)\n",
849 | " (convs_1d): ModuleList(\n",
850 | " (0): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))\n",
851 | " (1): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1), padding=(2, 0))\n",
852 | " (2): Conv2d(1, 100, kernel_size=(5, 300), stride=(1, 1), padding=(3, 0))\n",
853 | " )\n",
854 | " (fc): Linear(in_features=300, out_features=1, bias=True)\n",
855 | " (dropout): Dropout(p=0.5)\n",
856 | " (sig): Sigmoid()\n",
857 | ")\n"
858 | ]
859 | }
860 | ],
861 | "source": [
862 | "# Instantiate the model w/ hyperparams\n",
863 | "\n",
864 | "vocab_size = len(pretrained_words)\n",
865 | "output_size = 1 # binary class (1 or 0)\n",
866 | "embedding_dim = len(embed_lookup[pretrained_words[0]]) # 300-dim vectors\n",
867 | "num_filters = 100\n",
868 | "kernel_sizes = [3, 4, 5]\n",
869 | "\n",
870 | "net = SentimentCNN(embed_lookup, vocab_size, output_size, embedding_dim,\n",
871 | " num_filters, kernel_sizes)\n",
872 | "\n",
873 | "print(net)"
874 | ]
875 | },
876 | {
877 | "cell_type": "markdown",
878 | "metadata": {},
879 | "source": [
880 | "---\n",
881 | "## Training\n",
882 | "\n",
883 | "Below is some training code, which iterates over all of the training data, records some loss statistics and performs backpropagation + optimization steps to update the weights of this network.\n",
884 | "\n",
885 | ">I'll also be using a binary cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1.\n",
886 | "\n",
887 | "I also have some training hyperparameters:\n",
888 | "\n",
889 | "* `lr`: Learning rate for the optimizer.\n",
890 | "* `epochs`: Number of times to iterate through the training dataset."
891 | ]
892 | },
893 | {
894 | "cell_type": "code",
895 | "execution_count": 23,
896 | "metadata": {},
897 | "outputs": [],
898 | "source": [
899 | "# loss and optimization functions\n",
900 | "lr=0.001\n",
901 | "\n",
902 | "criterion = nn.BCELoss()\n",
903 | "optimizer = torch.optim.Adam(net.parameters(), lr=lr)\n"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": 24,
909 | "metadata": {},
910 | "outputs": [],
911 | "source": [
912 | "# training loop\n",
913 | "def train(net, train_loader, epochs, print_every=100):\n",
914 | "\n",
915 | " # move model to GPU, if available\n",
916 | " if(train_on_gpu):\n",
917 | " net.cuda()\n",
918 | "\n",
919 | " counter = 0 # for printing\n",
920 | " \n",
921 | " # train for some number of epochs\n",
922 | " net.train()\n",
923 | " for e in range(epochs):\n",
924 | "\n",
925 | " # batch loop\n",
926 | " for inputs, labels in train_loader:\n",
927 | " counter += 1\n",
928 | "\n",
929 | " if(train_on_gpu):\n",
930 | " inputs, labels = inputs.cuda(), labels.cuda()\n",
931 | "\n",
932 | " # zero accumulated gradients\n",
933 | " net.zero_grad()\n",
934 | "\n",
935 | " # get the output from the model\n",
936 | " output = net(inputs)\n",
937 | "\n",
938 | " # calculate the loss and perform backprop\n",
939 | " loss = criterion(output.squeeze(), labels.float())\n",
940 | " loss.backward()\n",
941 | " optimizer.step()\n",
942 | "\n",
943 | " # loss stats\n",
944 | " if counter % print_every == 0:\n",
945 | " # Get validation loss\n",
946 | " val_losses = []\n",
947 | " net.eval()\n",
948 | " for inputs, labels in valid_loader:\n",
949 | "\n",
950 | " if(train_on_gpu):\n",
951 | " inputs, labels = inputs.cuda(), labels.cuda()\n",
952 | "\n",
953 | " output = net(inputs)\n",
954 | " val_loss = criterion(output.squeeze(), labels.float())\n",
955 | "\n",
956 | " val_losses.append(val_loss.item())\n",
957 | "\n",
958 | " net.train()\n",
959 | " print(\"Epoch: {}/{}...\".format(e+1, epochs),\n",
960 | " \"Step: {}...\".format(counter),\n",
961 | " \"Loss: {:.6f}...\".format(loss.item()),\n",
962 | " \"Val Loss: {:.6f}\".format(np.mean(val_losses)))"
963 | ]
964 | },
965 | {
966 | "cell_type": "code",
967 | "execution_count": 25,
968 | "metadata": {},
969 | "outputs": [
970 | {
971 | "name": "stdout",
972 | "output_type": "stream",
973 | "text": [
974 | "Epoch: 1/2... Step: 100... Loss: 0.451722... Val Loss: 0.446736\n",
975 | "Epoch: 1/2... Step: 200... Loss: 0.435447... Val Loss: 0.365078\n",
976 | "Epoch: 1/2... Step: 300... Loss: 0.333672... Val Loss: 0.344555\n",
977 | "Epoch: 1/2... Step: 400... Loss: 0.319042... Val Loss: 0.328191\n",
978 | "Epoch: 2/2... Step: 500... Loss: 0.287158... Val Loss: 0.343141\n",
979 | "Epoch: 2/2... Step: 600... Loss: 0.300172... Val Loss: 0.364031\n",
980 | "Epoch: 2/2... Step: 700... Loss: 0.183973... Val Loss: 0.353891\n",
981 | "Epoch: 2/2... Step: 800... Loss: 0.162030... Val Loss: 0.354852\n"
982 | ]
983 | }
984 | ],
985 | "source": [
986 | "# training params\n",
987 | "\n",
988 | "epochs = 2 # this is approx where I noticed the validation loss stop decreasing\n",
989 | "print_every = 100\n",
990 | "\n",
991 | "train(net, train_loader, epochs, print_every=print_every)"
992 | ]
993 | },
994 | {
995 | "cell_type": "markdown",
996 | "metadata": {},
997 | "source": [
998 | "---\n",
999 | "## Testing\n",
1000 | "\n",
1001 | "There are a few ways to test this network.\n",
1002 | "\n",
1003 | "* **Test data performance:** First, I'll see how our trained model performs on all of the defined test_data, above; I'll calculate the average loss and accuracy over the test data.\n",
1004 | "\n",
1005 | "* **Inference on user-generated data:** Second, I'll see if I can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**."
1006 | ]
1007 | },
1008 | {
1009 | "cell_type": "code",
1010 | "execution_count": 26,
1011 | "metadata": {},
1012 | "outputs": [
1013 | {
1014 | "name": "stdout",
1015 | "output_type": "stream",
1016 | "text": [
1017 | "Test loss: 0.376\n",
1018 | "Test accuracy: 0.840\n"
1019 | ]
1020 | }
1021 | ],
1022 | "source": [
1023 | "# Get test data loss and accuracy\n",
1024 | "\n",
1025 | "test_losses = [] # track loss\n",
1026 | "num_correct = 0\n",
1027 | "\n",
1028 | "\n",
1029 | "net.eval()\n",
1030 | "# iterate over test data\n",
1031 | "for inputs, labels in test_loader:\n",
1032 | "\n",
1033 | " if(train_on_gpu):\n",
1034 | " inputs, labels = inputs.cuda(), labels.cuda()\n",
1035 | " \n",
1036 | " # get predicted outputs\n",
1037 | " output = net(inputs)\n",
1038 | " \n",
1039 | " # calculate loss\n",
1040 | " test_loss = criterion(output.squeeze(), labels.float())\n",
1041 | " test_losses.append(test_loss.item())\n",
1042 | " \n",
1043 | " # convert output probabilities to predicted class (0 or 1)\n",
1044 | " pred = torch.round(output.squeeze()) # rounds to the nearest integer\n",
1045 | " \n",
1046 | " # compare predictions to true label\n",
1047 | " correct_tensor = pred.eq(labels.float().view_as(pred))\n",
1048 | " correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())\n",
1049 | " num_correct += np.sum(correct)\n",
1050 | "\n",
1051 | "\n",
1052 | "# -- stats! -- ##\n",
1053 | "# avg test loss\n",
1054 | "print(\"Test loss: {:.3f}\".format(np.mean(test_losses)))\n",
1055 | "\n",
1056 | "# accuracy over all test data\n",
1057 | "test_acc = num_correct/len(test_loader.dataset)\n",
1058 | "print(\"Test accuracy: {:.3f}\".format(test_acc))"
1059 | ]
1060 | },
1061 | {
1062 | "cell_type": "markdown",
1063 | "metadata": {},
1064 | "source": [
1065 | "### Inference on a test review\n",
1066 | "\n",
1067 | "You can change this test_review to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!\n",
1068 | "\n",
1069 | "> The below `predict` code, takes in a trained `embed_lookup` table, a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!\n"
1070 | ]
1071 | },
1072 | {
1073 | "cell_type": "code",
1074 | "execution_count": 27,
1075 | "metadata": {},
1076 | "outputs": [],
1077 | "source": [
1078 | "from string import punctuation\n",
1079 | "\n",
1080 | "# helper function to process and tokenize a single review\n",
1081 | "def tokenize_review(embed_lookup, test_review):\n",
1082 | " test_review = test_review.lower() # lowercase\n",
1083 | " # get rid of punctuation\n",
1084 | " test_text = ''.join([c for c in test_review if c not in punctuation])\n",
1085 | "\n",
1086 | " # splitting by spaces\n",
1087 | " test_words = test_text.split()\n",
1088 | "\n",
1089 | " # tokens\n",
1090 | " tokenized_review = []\n",
1091 | " for word in test_words:\n",
1092 | " try:\n",
1093 | " idx = embed_lookup.vocab[word].index\n",
1094 | " except: \n",
1095 | " idx = 0\n",
1096 | " tokenized_review.append(idx)\n",
1097 | "\n",
1098 | " return tokenized_review\n"
1099 | ]
1100 | },
1101 | {
1102 | "cell_type": "code",
1103 | "execution_count": 28,
1104 | "metadata": {},
1105 | "outputs": [],
1106 | "source": [
1107 | "def predict(embed_lookup, net, test_review, sequence_length=200):\n",
1108 | " \"\"\"\n",
1109 | " Predict whether a given test_review has negative or positive sentiment.\n",
1110 | " \"\"\"\n",
1111 | " \n",
1112 | " net.eval()\n",
1113 | " \n",
1114 | " # tokenize review\n",
1115 | " test_ints = tokenize_review(embed_lookup, test_review)\n",
1116 | " \n",
1117 | " # pad tokenized sequence\n",
1118 | " seq_length=sequence_length\n",
1119 | " features = pad_features([test_ints], seq_length)\n",
1120 | " \n",
1121 | " # convert to tensor to pass into your model\n",
1122 | " feature_tensor = torch.from_numpy(features)\n",
1123 | " \n",
1124 | " batch_size = feature_tensor.size(0)\n",
1125 | " \n",
1126 | " if(train_on_gpu):\n",
1127 | " feature_tensor = feature_tensor.cuda()\n",
1128 | " \n",
1129 | " # get the output from the model\n",
1130 | " output = net(feature_tensor)\n",
1131 | " \n",
1132 | " # convert output probabilities to predicted class (0 or 1)\n",
1133 | " pred = torch.round(output.squeeze()) \n",
1134 | " # printing output value, before rounding\n",
1135 | " print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))\n",
1136 | " \n",
1137 | " # print custom response\n",
1138 | " if(pred.item()==1):\n",
1139 | " print(\"Positive review detected!\")\n",
1140 | " else:\n",
1141 | " print(\"Negative review detected.\")\n",
1142 | " "
1143 | ]
1144 | },
1145 | {
1146 | "cell_type": "markdown",
1147 | "metadata": {},
1148 | "source": [
1149 | "### Test on pos/neg reviews\n",
1150 | "\n",
1151 | "Below, I test my code on both positive and negative reviews."
1152 | ]
1153 | },
1154 | {
1155 | "cell_type": "code",
1156 | "execution_count": 29,
1157 | "metadata": {},
1158 | "outputs": [],
1159 | "source": [
1160 | "# set hyperparams\n",
1161 | "seq_length=200 # good to use the length that was trained on\n"
1162 | ]
1163 | },
1164 | {
1165 | "cell_type": "code",
1166 | "execution_count": 30,
1167 | "metadata": {},
1168 | "outputs": [
1169 | {
1170 | "name": "stdout",
1171 | "output_type": "stream",
1172 | "text": [
1173 | "Prediction value, pre-rounding: 0.000775\n",
1174 | "Negative review detected.\n"
1175 | ]
1176 | }
1177 | ],
1178 | "source": [
1179 | "# negative test review\n",
1180 | "test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'\n",
1181 | "\n",
1182 | "# test negative review\n",
1183 | "predict(embed_lookup, net, test_review_neg, seq_length)"
1184 | ]
1185 | },
1186 | {
1187 | "cell_type": "code",
1188 | "execution_count": 31,
1189 | "metadata": {},
1190 | "outputs": [
1191 | {
1192 | "name": "stdout",
1193 | "output_type": "stream",
1194 | "text": [
1195 | "Prediction value, pre-rounding: 0.992333\n",
1196 | "Positive review detected!\n"
1197 | ]
1198 | }
1199 | ],
1200 | "source": [
1201 | "# positive test review\n",
1202 | "test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'\n",
1203 | "\n",
1204 | "predict(embed_lookup, net, test_review_pos, seq_length)"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "markdown",
1209 | "metadata": {},
1210 | "source": [
1211 | "## Try out test reviews of your own!\n",
1212 | "\n",
1213 | "Now that you have a trained model and a predict function, you can pass in _any_ kind of text and this model will predict whether the text has a positive or negative sentiment.\n",
1214 | "\n",
1215 | "---\n",
1216 | "## Further reading\n",
1217 | "\n",
1218 | "More than text classification, CNNs are used to analyze sequential data in a number of ways! Here are a couple of papers and applications that I find really interesting:\n",
1219 | "* CNN for semantic representations and **search query retrieval**, [paper (Microsoft)](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/www2014_cdssm_p07.pdf).\n",
1220 | "* CNN for **genetic mutation detection**, [paper (Nature)](https://www.nature.com/articles/s41467-019-09027-x).\n",
1221 | "* CNN for classifying [whale sounds](https://ai.googleblog.com/2018/10/acoustic-detection-of-humpback-whales.html) via spectogram and for [**audio classification**, generally (Google AI)](https://ai.google/research/pubs/pub45611)."
1222 | ]
1223 | }
1224 | ],
1225 | "metadata": {
1226 | "kernelspec": {
1227 | "display_name": "Python 3",
1228 | "language": "python",
1229 | "name": "python3"
1230 | },
1231 | "language_info": {
1232 | "codemirror_mode": {
1233 | "name": "ipython",
1234 | "version": 3
1235 | },
1236 | "file_extension": ".py",
1237 | "mimetype": "text/x-python",
1238 | "name": "python",
1239 | "nbconvert_exporter": "python",
1240 | "pygments_lexer": "ipython3",
1241 | "version": "3.6.8"
1242 | }
1243 | },
1244 | "nbformat": 4,
1245 | "nbformat_minor": 2
1246 | }
1247 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Cezanne Camacho
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CNN for Text Classification
2 |
3 | A PyTorch CNN for classifying the sentiment of movie reviews, based on the paper [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) by Yoon Kim (2014).
4 |
5 | The task of text classification has typically been done with an RNN, which accepts a sequence of words as input and has a hidden state that is dependent on that sequence and acts as a kind of memory. This example shows how you can utilize convolutional layers to find patterns in sequences of word embeddings and create an effective text classifier using a CNN-based approach!
6 |
7 | 
8 |
9 | *Image from the original paper, [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882).*
10 |
11 | If you'd like to work with this code locally, you may follow the instructions (as needed) below! These installation instructions assume you have installed miniconda, but if you have not, you can download the latest version [here](https://conda.io/en/latest/miniconda.html).
12 |
13 | ---
14 |
15 | ## Create and Activate the Environment
16 |
17 | For Windows users, these following commands need to be executed from the **Anaconda prompt** as opposed to a Windows terminal window. For Mac, a normal terminal window will work.
18 |
19 | #### Git and version control
20 | These instructions also assume you have `git` installed for working with Github from a terminal window, but if you do not, you can download that first with the command:
21 | ```
22 | conda install git
23 | ```
24 |
25 | **Now, we're ready to create a local environment!**
26 |
27 | 1. Clone the repository, and navigate to the downloaded folder. This may take a minute or two to clone due to the included image data.
28 | ```
29 | git clone https://github.com/cezannec/CNN_Text_Classification.git
30 | cd CNN_Text_Classification
31 | ```
32 |
33 | 2. Create (and activate) a new environment, named `classification-env` with Python 3. If prompted to proceed with the install `(Proceed [y]/n)` type y.
34 |
35 | - __Linux__ or __Mac__:
36 | ```
37 | conda create -n classification-env python=3
38 | source activate classification-env
39 | ```
40 | - __Windows__:
41 | ```
42 | conda create --name classification-env python=3
43 | activate classification-env
44 | ```
45 |
46 | At this point your command line should look something like: `(classification-env) :CNN_Text_Classification $`. The `(classification-env)` indicates that your environment has been activated, and you can proceed with further package installations.
47 |
48 | 3. Install PyTorch and torchvision; this should install the latest version of PyTorch.
49 |
50 | - __Linux__ or __Mac__:
51 | ```
52 | conda install pytorch torchvision -c pytorch
53 | ```
54 | - __Windows__:
55 | ```
56 | conda install pytorch -c pytorch
57 | pip install torchvision
58 | ```
59 |
60 | 6. Install a few required pip packages, which are specified in the requirements text file (including gensim).
61 | ```
62 | pip install -r requirements.txt
63 | ```
64 |
65 | 7. That's it!
66 |
67 | Now all of the `classification-env` libraries are available to you. Assuming your `classification-env` environment is still activated, you can navigate to the main repo and start looking at the notebooks:
68 |
69 | ```
70 | cd
71 | cd CNN_Text_Classification
72 | jupyter notebook
73 | ```
74 |
75 | To exit the environment when you have completed your work session, simply close the terminal window.
76 |
--------------------------------------------------------------------------------
/notebook_ims/complete_embedding_CNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cezannec/CNN_Text_Classification/20570fb02eeef2d9aa1725c3af4ec55525067992/notebook_ims/complete_embedding_CNN.png
--------------------------------------------------------------------------------
/notebook_ims/embedding_lookup_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cezannec/CNN_Text_Classification/20570fb02eeef2d9aa1725c3af4ec55525067992/notebook_ims/embedding_lookup_table.png
--------------------------------------------------------------------------------
/notebook_ims/reviews_ex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cezannec/CNN_Text_Classification/20570fb02eeef2d9aa1725c3af4ec55525067992/notebook_ims/reviews_ex.png
--------------------------------------------------------------------------------
/notebook_ims/two_vectors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cezannec/CNN_Text_Classification/20570fb02eeef2d9aa1725c3af4ec55525067992/notebook_ims/two_vectors.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | jupyter
2 | matplotlib
3 | pandas
4 | numpy
5 | gensim
6 | pillow
7 | tqdm
8 | h5py
9 | ipykernel
10 | bokeh
11 | pickleshare
12 |
--------------------------------------------------------------------------------
/word2vec_model/readme_download_word2vecmodel.txt:
--------------------------------------------------------------------------------
1 | You can download the "slim" Google News embeddings from this Github repository
2 | (using `git lfs` for downloading large files) from:
3 | https://github.com/eyaler/word2vec-slim
4 |
5 | You can download the complete embeddings directly from Google's archive, here:
6 | https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
7 |
8 | You are also free to use a different pre-trained model
9 | or train an embedding layer from scratch!
10 |
--------------------------------------------------------------------------------