├── Evaluation.ipynb
├── README.md
├── cnn_model.py
├── images
├── yelp_comparison.png
└── yelp_confusion.png
└── utils.py
/README.md:
--------------------------------------------------------------------------------
1 | # Text classification with Convolution Neural Networks (CNN)
2 | This project demonstrates how to classify text documents / sentences with CNNs. You can find a great introduction in a similar approach on a blog entry of [Denny Britz](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/) and [Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html). My approach is quit similar to the one of Denny and the original paper of Yoon Kim [1]. You can find the implementation of Yoon Kim on [GitHub](https://github.com/yoonkim/CNN_sentence) as well.
3 |
4 | ## Changes
5 |
6 | ### *** UPDATE *** - September 10th, 2021
7 | In this update I fixed some typos as well as improved the jupyter notebook. You can execute the notebook without any requirements. Required data will be downloaded automatically.
8 |
9 | - Add Yelp Polarity Dataset (Tensorflow-Dataset)
10 | - Add utils.py for moving code out of notebook
11 | - Add blank char in `ALPHABET` variable
12 |
13 | ### *** UPDATE *** - December 15th, 2019
14 | I’ve updated the code to TensorFlow 2.
15 | Besides I made some changes in the jupyter notebook:
16 | - Remove Yelp dataset
17 | - Add TensorFlow Dataset for IMDB
18 |
19 | ### *** UPDATE *** - May 17th, 2019
20 |
21 | Model:
22 | - Combine word-level with character-based input. The char input ist optional and can be used for further reasearch.
23 | - Change padding of conv-layer from same to valid.
24 | - Add average pooling after conv-layer and combine it with existing max pooling.
25 |
26 | Notebook:
27 | - Add CHAR support
28 | - Commented out preprocessing
29 | - Add scikit-learn example at the end for comparison between deep learning and machine learning.
30 |
31 | Using characters in addition to words ends up with no improvement but can be a good starting point for further research.
32 | I keep the model as simple as possible and reuse the existing methods for character input. As written in the paper of Yann LeCun [3] using several conv-layers on each over could improve performance.
33 |
34 | ### *** UPDATE *** - December 3rd, 2018
35 | - Implemented the model as a class (cnn_model.CNN)
36 | - Replaced max pooling by global max pooling
37 | - Replaced conv1d by separable convolutions
38 | - Added dense + dropout after each global max pooling
39 | - Removed flatten + dropout after concatenation
40 | - Removed L2 regularizer on convolution layers
41 | - Support multiclass classification
42 |
43 | Besides i made some changes in [evaluation notebook](https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb). It seems that cleaning the text by removing stopwords, nummerical values and punctuation remove important features too. Therefore I dont use this preprocessing steps anymore. As optimizer I switched from Adadelta to Adam because it converge to an optimum even faster.
44 |
45 | This are just small changes but with a significant improvement as you can see below.
46 |
47 | #### Comparing old vs new
48 | For the Yelp dataset I increased the training samples from 200000 to 600000 and test samples to 200000 instead of 50000.
49 |
50 | | Dataset | Old
(loss / acc) | New
(loss / acc) |
51 | | :---: | :---: | :---: |
52 | | **Polarity** | 0.4688 / 0.7974 | 0.4058 / 0.8135 |
53 | | **IMDB** | 0.2994 / 0.8896 | 0.2509 / 0.9007 |
54 | | **Yelp** | 0.1793 / 0.9393 | 0.0997 / 0.9631 |
55 | | **Yelp - Multi** | 0.9356 / 0.6051 | 0.8076 / 0.6487 |
56 |
57 | #### Next steps:
58 | - Combine word-level model with a character-based input. Working on characters has the advantage that misspellings and emoticons may be naturally learnt.
59 | - Adding attention layer on recurrent / convolution layer. I allready tested it but with no improvements but still working on this.
60 |
61 | ## Evaluation
62 | For evaluation I used different datasets that are freely available. They differ in their size of amount and the content length. What all have in common is that they have two classes to predict (positive / negative). I would like to show how CNN performs on ~10000 up to ~800000 documents with modify only a few paramters.
63 |
64 | I used the following sets for evaluation:
65 | - [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
66 | The polarity dataset v1.0 has 10662 sentences. It's quit similiar to traditional sentiment analysis of tweets because of the content length. I just splitted the data in train / validation (90% / 10%).
67 | - [IMDB moview review](http://ai.stanford.edu/~amaas/data/sentiment/)
68 | IMDB moview review has 25000 train and 25000 test documents. I splitted the trainset into train / validation (80% / 20%) and used the testset for a final test.
69 | - [Yelp dataset 2017](https://www.yelp.com/dataset)
70 | This dataset contains a JSON of nearly 5 million entries. I splitted this JSON for performance reason to randomly 600000 train and 200000 test documents. I selected ratings with 1-2 stars as negative and 4-5 as positive. Ratings with 3 stars are not considered because of their neutrality. In addition comes that this selected subset contains only texts with more than 5 words. The language of the texts include english, german, spanish and a lot more. During the training I used 80% / 20% (train / validation). If you are interested you can also check a small demo of the [embeddings](https://github.com/cmasch/word-embeddings-from-scratch) created from the training data.
71 |
72 | ## Model
73 | The implemented [model](https://github.com/cmasch/cnn-text-classification/blob/master/cnn_model.py) has multiple convolutional layers in parallel to obtain several features of one text. Through different kernel sizes of each convolution layer the window size varies and the text will be read with a n-gram approach. The default values are 3 convolution layers with kernel size of 3, 4 and 5.
74 |
75 | I also used pre-trained embedding [GloVe](https://nlp.stanford.edu/projects/glove/) with 300 dimensional vectors and 6B tokens to show that unsupervised learning of words can have a positive effect on neural nets.
76 |
77 | ## Results
78 | For all runs I used filter sizes of [3,4,5], Adam as optimizer, batch size of 100 and 10 epochs. As already described I used 5 runs with random state to get a final mean of loss / accuracy.
79 |
80 | ### Sentence polarity dataset v1.0
81 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) |
82 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
83 | | [100,100,100] | GloVe 300 | 15000 / 35 | 64 | 0.4 | 0.3134 / 0.8642 | 0.4058 / 0.8135 |
84 | | [100,100,100] | 300 | 15000 / 35 | 64 | 0.4 | 0.4741 / 0.7753 | 0.4563 / 0.7807 |
85 |
86 | ### IMDB
87 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) |
88 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
89 | | [200,200,200] | GloVe 300 | 15000 / 500 | 200 | 0.4 | 0.1735 / 0.9332 | 0.2417 / 0.9064 | 0.2509 / 0.9007 |
90 | | [200,200,200] | 300 | 15000 / 500 | 200 | 0.4 | 0.2425 / 0.9037 | 0.2554 / 0.8964 | 0.2632 / 0.8920 |
91 |
92 | ### Yelp Polarity Dataset (2015)
93 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) |
94 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
95 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.1066 / 0.9602 | 0.1146 / 0.9567 | 0.1130 / 0.9574 |
96 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.1029 / 0.9617 | 0.1243 / 0.9533 | 0.1219 / 0.9547 |
97 | | ML-Model | - | - | - | - | - | - / 0.9398 | - / 0.9398 |
98 |
99 | ### Yelp 2017
100 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) |
101 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
102 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.0793 / 0.9707 | 0.0958 / 0.9644 | 0.0997 / 0.9631 |
103 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.0820 / 0.9701 | 0.1012 / 0.9623 | 0.1045 / 0.9615 |
104 |
105 | ### Yelp 2017 - Multiclass classification
106 | All previous evaluations are typical binary classification tasks. The Yelp dataset comes with reviews which can be classified into five classes (one to five stars). For the evaluations above I merged one and two star reviews together to the negative class. Reviews with four and five stars are labeled as positive reviews. Neutral reviews with three stars are not considered. In this evaluation I trained the model on all five classes.
107 | The baseline we have to reach is 20% accuracy because all classes are balanced to the same amount of samples. In a first evaluation I reached 64% accuracy. This sounds a little bit low but you have to keep in mind that in the binary classification we have a baseline of 50% accuracy. That is more than twice as much! Furthermore there is a lot subjectivity in the reviews. Take a look on the confusion matrix:
108 |
109 |
110 |
111 | If you look carefully you can see that it’s hard to distinguish in one class that has surrounding classes side by side. If you wrote a negative review, when does this have just two stars and not one or three?! Sometimes it’s clear for sure but sometimes not!
112 |
113 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) |
114 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
115 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.7676 / 0.6658 | 0.7983 / 0.6531 | 0.8076 / 0.6487 |
116 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.7932 / 0.6556 | 0.8103 / 0.6470 | 0.8169 / 0.6443 |
117 |
118 | ## Conclusion and improvements
119 | Finally CNNs are a great approach for text classification. However a lot of data is needed for training a good model. It would be interesting to compare this results with a typical machine learning approach. I expect that using ML for all datasets except Yelp getting similar results. If you evaluate your own architecture (neural network), I recommend using IMDB or Yelp because of their amount of data.
120 |
121 | Using pre-trained embeddings like GloVe improved accuracy by about 1-2%. In addition comes that pre-trained embeddings have a regularization effect on training. That make sense because GloVe is trained on data which is some different to Yelp and the other datasets. This means that during training the weights of the pre-trained embedding will be updated. You can see the regularization effect in the following image:
122 |
123 |
124 |
125 | If you are interested in CNN and text classification try out the dataset from Yelp! Not only because of the best result in accuracy, it has a lot metadata. Maybe I will use this dataset to get insights for my next travel :)
126 |
127 | I'm sure that you can get better results by tuning some parameters:
128 | - Increase / decrease feature maps
129 | - Add / remove filter sizes
130 | - Use another embeddings (e.g. Google word2vec)
131 | - Increase / decrease maximum words in vocabulary and sequence
132 | - Modify the method `clean_text`
133 |
134 | If you have any questions or hints for improvement contact me through an issue. Thanks!
135 |
136 | ## Requirements
137 | * Python 3.x
138 | * TensorFlow 2.x
139 | * TensorFlow-Datasets
140 | * Scikit
141 |
142 | ## Usage
143 | Feel free to use the [model](https://github.com/cmasch/cnn-text-classification/blob/master/cnn_model.py) and your own dataset. As an example you can use this [evaluation notebook](https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb).
144 |
145 | ## References
146 | [1] [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
147 | [2] [Neural Document Embeddings for Intensive Care Patient Mortality Prediction](https://arxiv.org/abs/1612.00467)
148 | [3] [Character-level Convolutional Networks for Text Classification](https://arxiv.org/abs/1509.01626)
149 |
150 | ## Author
151 | Christopher Masch
152 |
--------------------------------------------------------------------------------
/cnn_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | CNN model for text classification implemented in TensorFlow 2.
4 | This implementation is based on the original paper of Yoon Kim [1] for classification using words.
5 | Besides I add charachter level input [2].
6 |
7 | # References
8 | - [1] [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
9 | - [2] [Character-level Convolutional Networks for Text Classification](https://arxiv.org/abs/1509.01626)
10 |
11 | @author: Christopher Masch
12 | """
13 |
14 | import tensorflow as tf
15 | from tensorflow.keras import layers
16 |
17 |
18 | class CNN:
19 | __version__ = '0.2.0'
20 |
21 | def __init__(self, embedding_layer=None, num_words=None, embedding_dim=None,
22 | max_seq_length=100, kernel_sizes=[3, 4, 5], feature_maps=[100, 100, 100],
23 | use_char=False, char_embedding_dim=50, char_max_length=200, alphabet_size=None, char_kernel_sizes=[3, 10, 20],
24 | char_feature_maps=[100, 100, 100], hidden_units=100, dropout_rate=None, nb_classes=None):
25 | """
26 | Arguments:
27 | embedding_layer : If not defined with pre-trained embeddings it will be created from scratch (default: None)
28 | num_words : Maximal amount of words in the vocabulary (default: None)
29 | embedding_dim : Dimension of word representation (default: None)
30 | max_seq_length : Max length of word sequence (default: 100)
31 | filter_sizes : An array of filter sizes per channel (default: [3,4,5])
32 | feature_maps : Defines the feature maps per channel (default: [100,100,100])
33 | use_char : If True, char-based model will be added to word-based model
34 | char_embedding_dim : Dimension of char representation (default: 50)
35 | char_max_length : Max length of char sequence (default: 200)
36 | alphabet_size : Amount of differnent chars used for creating embeddings (default: None)
37 | hidden_units : Hidden units per convolution channel (default: 100)
38 | dropout_rate : If defined, dropout will be added after embedding layer & concatenation (default: None)
39 | nb_classes : Number of classes which can be predicted
40 | """
41 |
42 | # WORD-level
43 | self.embedding_layer = embedding_layer
44 | self.num_words = num_words
45 | self.max_seq_length = max_seq_length
46 | self.embedding_dim = embedding_dim
47 | self.kernel_sizes = kernel_sizes
48 | self.feature_maps = feature_maps
49 |
50 | # CHAR-level
51 | self.use_char = use_char
52 | self.char_embedding_dim = char_embedding_dim
53 | self.char_max_length = char_max_length
54 | self.alphabet_size = alphabet_size
55 | self.char_kernel_sizes = char_kernel_sizes
56 | self.char_feature_maps = char_feature_maps
57 |
58 | # General
59 | self.hidden_units = hidden_units
60 | self.dropout_rate = dropout_rate
61 | self.nb_classes = nb_classes
62 |
63 | def build_model(self):
64 | """
65 | Build the model
66 |
67 | Returns:
68 | Model : Keras model instance
69 | """
70 |
71 | # Checks
72 | if len(self.kernel_sizes) != len(self.feature_maps):
73 | raise Exception('Please define `kernel_sizes` and `feature_maps` with the same amount.')
74 | if not self.embedding_layer and (not self.num_words or not self.embedding_dim):
75 | raise Exception('Please define `num_words` and `embedding_dim` if you not using a pre-trained embedding.')
76 | if self.use_char and (not self.char_max_length or not self.alphabet_size):
77 | raise Exception('Please define `char_max_length` and `alphabet_size` if you are using char.')
78 |
79 | # Building word-embeddings from scratch
80 | if self.embedding_layer is None:
81 | self.embedding_layer = layers.Embedding(
82 | input_dim = self.num_words,
83 | output_dim = self.embedding_dim,
84 | input_length = self.max_seq_length,
85 | weights = None,
86 | trainable = True,
87 | name = "word_embedding"
88 | )
89 |
90 | # WORD-level
91 | word_input = layers.Input(shape=(self.max_seq_length,), dtype='int32', name='word_input')
92 | x = self.embedding_layer(word_input)
93 |
94 | if self.dropout_rate:
95 | x = layers.Dropout(self.dropout_rate)(x)
96 |
97 | x = self.building_block(x, self.kernel_sizes, self.feature_maps)
98 | x = layers.Activation('relu')(x)
99 | prediction = layers.Dense(self.nb_classes, activation='softmax')(x)
100 |
101 |
102 | # CHAR-level
103 | if self.use_char:
104 | char_input = layers.Input(shape=(self.char_max_length,), dtype='int32', name='char_input')
105 | x_char = layers.Embedding(
106 | input_dim = self.alphabet_size + 1,
107 | output_dim = self.char_embedding_dim,
108 | input_length = self.char_max_length,
109 | name = 'char_embedding'
110 | )(char_input)
111 |
112 | x_char = self.building_block(x_char, self.char_kernel_sizes, self.char_feature_maps)
113 | x_char = layers.Activation('relu')(x_char)
114 | x_char = layers.Dense(self.nb_classes, activation='softmax')(x_char)
115 |
116 | prediction = layers.Average()([prediction, x_char])
117 | return tf.keras.Model(inputs=[word_input, char_input], outputs=prediction, name='CNN_Word_Char')
118 |
119 | return tf.keras.Model(inputs=word_input, outputs=prediction, name='CNN_Word')
120 |
121 | def building_block(self, input_layer, kernel_sizes, feature_maps):
122 | """
123 | Creates several CNN channels in parallel and concatenate them
124 |
125 | Arguments:
126 | input_layer : Layer which will be the input for all convolutional blocks
127 | kernel_sizes: Array of kernel sizes (working as n-gram filter)
128 | feature_maps: Array of feature maps
129 |
130 | Returns:
131 | x : Building block with one or several channels
132 | """
133 | channels = []
134 | for ix in range(len(kernel_sizes)):
135 | x = self.create_channel(input_layer, kernel_sizes[ix], feature_maps[ix])
136 | channels.append(x)
137 |
138 | # Check how many channels, one channel doesn't need a concatenation
139 | if (len(channels) > 1):
140 | x = layers.concatenate(channels)
141 |
142 | return x
143 |
144 | def create_channel(self, x, kernel_size, feature_map):
145 | """
146 | Creates a layer, working channel wise
147 |
148 | Arguments:
149 | x : Input for convolutional channel
150 | kernel_size : Kernel size for creating Conv1D
151 | feature_map : Feature map
152 |
153 | Returns:
154 | x : Channel including (Conv1D + {GlobalMaxPooling & GlobalAveragePooling} + Dense [+ Dropout])
155 | """
156 | x = layers.SeparableConv1D(
157 | feature_map,
158 | kernel_size = kernel_size,
159 | activation = 'relu',
160 | strides = 1,
161 | padding = 'valid',
162 | depth_multiplier = 4
163 | )(x)
164 |
165 | x1 = layers.GlobalMaxPooling1D()(x)
166 | x2 = layers.GlobalAveragePooling1D()(x)
167 | x = layers.concatenate([x1, x2])
168 |
169 | x = layers.Dense(self.hidden_units)(x)
170 | if self.dropout_rate:
171 | x = layers.Dropout(self.dropout_rate)(x)
172 | return x
173 |
--------------------------------------------------------------------------------
/images/yelp_comparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cmasch/cnn-text-classification/b543f6961e81bf861553432d36c4424a6d0af904/images/yelp_comparison.png
--------------------------------------------------------------------------------
/images/yelp_confusion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cmasch/cnn-text-classification/b543f6961e81bf861553432d36c4424a6d0af904/images/yelp_confusion.png
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Utils
4 |
5 | @author: Christopher Masch
6 | """
7 |
8 | import urllib3
9 | import os
10 | import re
11 | import string
12 | import numpy as np
13 | import tensorflow as tf
14 | import matplotlib.pyplot as plt
15 | from io import BytesIO
16 | from zipfile import ZipFile
17 | from nltk.corpus import stopwords
18 |
19 |
20 | def clean_doc(doc):
21 | """
22 | Cleaning a document by several methods:
23 | - Lowercase
24 | - Removing whitespaces
25 | - Removing numbers
26 | - Removing stopwords
27 | - Removing punctuations
28 | - Removing short words
29 |
30 | Arguments:
31 | doc : Text
32 |
33 | Returns:
34 | str : Cleaned text
35 | """
36 |
37 | #stop_words = set(stopwords.words('english'))
38 |
39 | # Lowercase
40 | doc = doc.lower()
41 | # Remove numbers
42 | #doc = re.sub(r"[0-9]+", "", doc)
43 | # Split in tokens
44 | tokens = doc.split()
45 | # Remove Stopwords
46 | #tokens = [w for w in tokens if not w in stop_words]
47 | # Remove punctuation
48 | #tokens = [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens]
49 | # Tokens with less then two characters will be ignored
50 | #tokens = [word for word in tokens if len(word) > 1]
51 | return ' '.join(tokens)
52 |
53 |
54 | def read_files(path):
55 | """
56 | Read in files of a given path.
57 | This can be a directory including many files or just one file.
58 |
59 | Arguments:
60 | path : Filepath to file(s)
61 |
62 | Returns:
63 | documents : Return a list of cleaned documents
64 | """
65 |
66 | documents = list()
67 | # Read in all files in directory
68 | if os.path.isdir(path):
69 | for filename in os.listdir(path):
70 | with open(f"{path}/{filename}") as f:
71 | doc = f.read()
72 | doc = clean_doc(doc)
73 | documents.append(doc)
74 |
75 | # Read in all lines in one file
76 | if os.path.isfile(path):
77 | with open(path, encoding='iso-8859-1') as f:
78 | doc = f.readlines()
79 | for line in doc:
80 | documents.append(clean_doc(line))
81 |
82 | return documents
83 |
84 |
85 | def char_vectorizer(X, char_max_length, char2idx_dict):
86 | """
87 | Vectorize an array of word sequences to char vector.
88 | Example (length 15): [test entry] --> [[1,2,3,1,4,2,5,1,6,7,0,0,0,0,0]]
89 |
90 | Arguments:
91 | X : Array of word sequences
92 | char_max_length : Maximum length of vector
93 | char2idx_dict : Dictionary of indices for converting char to integer
94 |
95 | Returns:
96 | str2idx : Array of vectorized char sequences
97 | """
98 |
99 | str2idx = np.zeros((len(X), char_max_length), dtype='int64')
100 | for idx, doc in enumerate(X):
101 | max_length = min(len(doc), char_max_length)
102 | for i in range(0, max_length):
103 | c = doc[i]
104 | if c in char2idx_dict:
105 | str2idx[idx, i] = char2idx_dict[c]
106 | return str2idx
107 |
108 |
109 | def create_glove_embeddings(embedding_dim, max_num_words, max_seq_length, tokenizer):
110 | """
111 | Load and create GloVe embeddings.
112 |
113 | Arguments:
114 | embedding_dim : Dimension of embeddings (e.g. 50,100,200,300)
115 | max_num_words : Maximum count of words in vocabulary
116 | max_seq_length: Maximum length of vector
117 | tokenizer : Tokenizer for converting words to integer
118 |
119 | Returns:
120 | tf.keras.layers.Embedding : Glove embeddings initialized in Keras Embedding-Layer
121 | """
122 |
123 | print("Pretrained GloVe embedding is loading...")
124 |
125 | if not os.path.exists("data"):
126 | os.makedirs("data")
127 |
128 | if not os.path.exists("data/glove"):
129 | print("No previous embeddings found. Will be download required files...")
130 | os.makedirs("data/glove")
131 | http = urllib3.PoolManager()
132 | response = http.request(
133 | url = "http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip",
134 | method = "GET",
135 | retries = False
136 | )
137 |
138 | with ZipFile(BytesIO(response.data)) as myzip:
139 | for f in myzip.infolist():
140 | with open(f"data/glove/{f.filename}", "wb") as outfile:
141 | outfile.write(myzip.open(f.filename).read())
142 |
143 | print("Download of GloVe embeddings finished.")
144 |
145 | embeddings_index = {}
146 | with open(f"data/glove/glove.6B.{embedding_dim}d.txt") as glove_embedding:
147 | for line in glove_embedding.readlines():
148 | values = line.split()
149 | word = values[0]
150 | coefs = np.asarray(values[1:], dtype="float32")
151 | embeddings_index[word] = coefs
152 |
153 | print(f"Found {len(embeddings_index)} word vectors in GloVe embedding\n")
154 |
155 | embedding_matrix = np.zeros((max_num_words, embedding_dim))
156 |
157 | for word, i in tokenizer.word_index.items():
158 | if i >= max_num_words:
159 | continue
160 |
161 | embedding_vector = embeddings_index.get(word)
162 | if embedding_vector is not None:
163 | embedding_matrix[i] = embedding_vector
164 |
165 | return tf.keras.layers.Embedding(
166 | input_dim = max_num_words,
167 | output_dim = embedding_dim,
168 | input_length = max_seq_length,
169 | weights = [embedding_matrix],
170 | trainable = True,
171 | name = "word_embedding"
172 | )
173 |
174 |
175 | def plot_acc_loss(title, histories, key_acc, key_loss):
176 | """
177 | Generate a plot for visualizing accuracy and loss
178 |
179 | Arguments:
180 | title : Title of visualization
181 | histories : Array of Keras metrics per run and epoch
182 | key_acc : Key of accuracy (accuracy, val_accuracy)
183 | key_loss : Key of loss (loss, val_loss)
184 | """
185 |
186 | fig, (ax1, ax2) = plt.subplots(1, 2)
187 |
188 | # Accuracy
189 | ax1.set_title(f"Model accuracy ({title})")
190 | names = []
191 | for i, model in enumerate(histories):
192 | ax1.plot(model[key_acc])
193 | ax1.set_xlabel("epoch")
194 | names.append(f"Model {i+1}")
195 | ax1.set_ylabel("accuracy")
196 | ax1.legend(names, loc="lower right")
197 |
198 | # Loss
199 | ax2.set_title(f"Model loss ({title})")
200 | for model in histories:
201 | ax2.plot(model[key_loss])
202 | ax2.set_xlabel('epoch')
203 | ax2.set_ylabel('loss')
204 | ax2.legend(names, loc='upper right')
205 | fig.set_size_inches(20, 5)
206 | plt.show()
207 |
208 |
209 | def visualize_features(ml_classifier, nb_neg_features=15, nb_pos_features=15):
210 | """
211 | Visualize trained coefficient of log regression in respect to vectorizer.
212 |
213 | Arguments:
214 | ml_classifier : ML-Pipeline including vectorizer as well as trained model
215 | nb_neg_features : Number of features to visualize
216 | nb_pos_features : Number of features to visualize
217 | """
218 |
219 | feature_names = ml_classifier.get_params()['vectorizer'].get_feature_names()
220 | coef = ml_classifier.get_params()['classifier'].coef_.ravel()
221 |
222 | print('Extracted features: {}'.format(len(feature_names)))
223 |
224 | pos_coef = np.argsort(coef)[-nb_pos_features:]
225 | neg_coef = np.argsort(coef)[:nb_neg_features]
226 | interesting_coefs = np.hstack([neg_coef, pos_coef])
227 |
228 | # Plot
229 | plt.figure(figsize=(20, 5))
230 | colors = ['red' if c < 0 else 'green' for c in coef[interesting_coefs]]
231 | plt.bar(np.arange(nb_neg_features + nb_pos_features), coef[interesting_coefs], color=colors)
232 | feature_names = np.array(feature_names)
233 | plt.xticks(
234 | np.arange(nb_neg_features+nb_pos_features),
235 | feature_names[interesting_coefs],
236 | size = 15,
237 | rotation = 75,
238 | ha = 'center'
239 | );
240 | plt.show()
241 |
242 |
--------------------------------------------------------------------------------