├── Evaluation.ipynb ├── README.md ├── cnn_model.py ├── images ├── yelp_comparison.png └── yelp_confusion.png └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Text classification with Convolution Neural Networks (CNN) 2 | This project demonstrates how to classify text documents / sentences with CNNs. You can find a great introduction in a similar approach on a blog entry of [Denny Britz](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/) and [Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html). My approach is quit similar to the one of Denny and the original paper of Yoon Kim [1]. You can find the implementation of Yoon Kim on [GitHub](https://github.com/yoonkim/CNN_sentence) as well. 3 | 4 | ## Changes 5 | 6 | ### *** UPDATE *** - September 10th, 2021 7 | In this update I fixed some typos as well as improved the jupyter notebook. You can execute the notebook without any requirements. Required data will be downloaded automatically. 8 | 9 | - Add Yelp Polarity Dataset (Tensorflow-Dataset) 10 | - Add utils.py for moving code out of notebook 11 | - Add blank char in `ALPHABET` variable 12 | 13 | ### *** UPDATE *** - December 15th, 2019 14 | I’ve updated the code to TensorFlow 2. 15 | Besides I made some changes in the jupyter notebook: 16 | - Remove Yelp dataset 17 | - Add TensorFlow Dataset for IMDB 18 | 19 | ### *** UPDATE *** - May 17th, 2019 20 | 21 | Model: 22 | - Combine word-level with character-based input. The char input ist optional and can be used for further reasearch. 23 | - Change padding of conv-layer from same to valid. 24 | - Add average pooling after conv-layer and combine it with existing max pooling. 25 | 26 | Notebook: 27 | - Add CHAR support 28 | - Commented out preprocessing 29 | - Add scikit-learn example at the end for comparison between deep learning and machine learning. 30 | 31 | Using characters in addition to words ends up with no improvement but can be a good starting point for further research. 32 | I keep the model as simple as possible and reuse the existing methods for character input. As written in the paper of Yann LeCun [3] using several conv-layers on each over could improve performance. 33 | 34 | ### *** UPDATE *** - December 3rd, 2018 35 | - Implemented the model as a class (cnn_model.CNN) 36 | - Replaced max pooling by global max pooling 37 | - Replaced conv1d by separable convolutions 38 | - Added dense + dropout after each global max pooling 39 | - Removed flatten + dropout after concatenation 40 | - Removed L2 regularizer on convolution layers 41 | - Support multiclass classification 42 | 43 | Besides i made some changes in [evaluation notebook](https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb). It seems that cleaning the text by removing stopwords, nummerical values and punctuation remove important features too. Therefore I dont use this preprocessing steps anymore. As optimizer I switched from Adadelta to Adam because it converge to an optimum even faster. 44 | 45 | This are just small changes but with a significant improvement as you can see below. 46 | 47 | #### Comparing old vs new 48 | For the Yelp dataset I increased the training samples from 200000 to 600000 and test samples to 200000 instead of 50000. 49 | 50 | | Dataset | Old
(loss / acc) | New
(loss / acc) | 51 | | :---: | :---: | :---: | 52 | | **Polarity** | 0.4688 / 0.7974 | 0.4058 / 0.8135 | 53 | | **IMDB** | 0.2994 / 0.8896 | 0.2509 / 0.9007 | 54 | | **Yelp** | 0.1793 / 0.9393 | 0.0997 / 0.9631 | 55 | | **Yelp - Multi** | 0.9356 / 0.6051 | 0.8076 / 0.6487 | 56 | 57 | #### Next steps: 58 | - Combine word-level model with a character-based input. Working on characters has the advantage that misspellings and emoticons may be naturally learnt. 59 | - Adding attention layer on recurrent / convolution layer. I allready tested it but with no improvements but still working on this. 60 | 61 | ## Evaluation 62 | For evaluation I used different datasets that are freely available. They differ in their size of amount and the content length. What all have in common is that they have two classes to predict (positive / negative). I would like to show how CNN performs on ~10000 up to ~800000 documents with modify only a few paramters. 63 | 64 | I used the following sets for evaluation: 65 | - [sentence polarity dataset v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
66 | The polarity dataset v1.0 has 10662 sentences. It's quit similiar to traditional sentiment analysis of tweets because of the content length. I just splitted the data in train / validation (90% / 10%). 67 | - [IMDB moview review](http://ai.stanford.edu/~amaas/data/sentiment/)
68 | IMDB moview review has 25000 train and 25000 test documents. I splitted the trainset into train / validation (80% / 20%) and used the testset for a final test. 69 | - [Yelp dataset 2017](https://www.yelp.com/dataset)
70 | This dataset contains a JSON of nearly 5 million entries. I splitted this JSON for performance reason to randomly 600000 train and 200000 test documents. I selected ratings with 1-2 stars as negative and 4-5 as positive. Ratings with 3 stars are not considered because of their neutrality. In addition comes that this selected subset contains only texts with more than 5 words. The language of the texts include english, german, spanish and a lot more. During the training I used 80% / 20% (train / validation). If you are interested you can also check a small demo of the [embeddings](https://github.com/cmasch/word-embeddings-from-scratch) created from the training data. 71 | 72 | ## Model 73 | The implemented [model](https://github.com/cmasch/cnn-text-classification/blob/master/cnn_model.py) has multiple convolutional layers in parallel to obtain several features of one text. Through different kernel sizes of each convolution layer the window size varies and the text will be read with a n-gram approach. The default values are 3 convolution layers with kernel size of 3, 4 and 5.
74 | 75 | I also used pre-trained embedding [GloVe](https://nlp.stanford.edu/projects/glove/) with 300 dimensional vectors and 6B tokens to show that unsupervised learning of words can have a positive effect on neural nets. 76 | 77 | ## Results 78 | For all runs I used filter sizes of [3,4,5], Adam as optimizer, batch size of 100 and 10 epochs. As already described I used 5 runs with random state to get a final mean of loss / accuracy. 79 | 80 | ### Sentence polarity dataset v1.0 81 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | 82 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 83 | | [100,100,100] | GloVe 300 | 15000 / 35 | 64 | 0.4 | 0.3134 / 0.8642 | 0.4058 / 0.8135 | 84 | | [100,100,100] | 300 | 15000 / 35 | 64 | 0.4 | 0.4741 / 0.7753 | 0.4563 / 0.7807 | 85 | 86 | ### IMDB 87 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) | 88 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 89 | | [200,200,200] | GloVe 300 | 15000 / 500 | 200 | 0.4 | 0.1735 / 0.9332 | 0.2417 / 0.9064 | 0.2509 / 0.9007 | 90 | | [200,200,200] | 300 | 15000 / 500 | 200 | 0.4 | 0.2425 / 0.9037 | 0.2554 / 0.8964 | 0.2632 / 0.8920 | 91 | 92 | ### Yelp Polarity Dataset (2015) 93 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) | 94 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 95 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.1066 / 0.9602 | 0.1146 / 0.9567 | 0.1130 / 0.9574 | 96 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.1029 / 0.9617 | 0.1243 / 0.9533 | 0.1219 / 0.9547 | 97 | | ML-Model | - | - | - | - | - | - / 0.9398 | - / 0.9398 | 98 | 99 | ### Yelp 2017 100 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) | 101 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 102 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.0793 / 0.9707 | 0.0958 / 0.9644 | 0.0997 / 0.9631 | 103 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.0820 / 0.9701 | 0.1012 / 0.9623 | 0.1045 / 0.9615 | 104 | 105 | ### Yelp 2017 - Multiclass classification 106 | All previous evaluations are typical binary classification tasks. The Yelp dataset comes with reviews which can be classified into five classes (one to five stars). For the evaluations above I merged one and two star reviews together to the negative class. Reviews with four and five stars are labeled as positive reviews. Neutral reviews with three stars are not considered. In this evaluation I trained the model on all five classes. 107 | The baseline we have to reach is 20% accuracy because all classes are balanced to the same amount of samples. In a first evaluation I reached 64% accuracy. This sounds a little bit low but you have to keep in mind that in the binary classification we have a baseline of 50% accuracy. That is more than twice as much! Furthermore there is a lot subjectivity in the reviews. Take a look on the confusion matrix: 108 | 109 | 110 | 111 | If you look carefully you can see that it’s hard to distinguish in one class that has surrounding classes side by side. If you wrote a negative review, when does this have just two stars and not one or three?! Sometimes it’s clear for sure but sometimes not! 112 | 113 | | Feature Maps | Embedding | Max Words / Sequence | Hidden Units | Dropout | Training
(loss / acc) | Validation
(loss / acc) | Test
(loss / acc) | 114 | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 115 | | [200,200,200] | GloVe 300 | 15000 / 200 | 250 | 0.5 | 0.7676 / 0.6658 | 0.7983 / 0.6531 | 0.8076 / 0.6487 | 116 | | [200,200,200] | 300 | 15000 / 200 | 250 | 0.5 | 0.7932 / 0.6556 | 0.8103 / 0.6470 | 0.8169 / 0.6443 | 117 | 118 | ## Conclusion and improvements 119 | Finally CNNs are a great approach for text classification. However a lot of data is needed for training a good model. It would be interesting to compare this results with a typical machine learning approach. I expect that using ML for all datasets except Yelp getting similar results. If you evaluate your own architecture (neural network), I recommend using IMDB or Yelp because of their amount of data.
120 | 121 | Using pre-trained embeddings like GloVe improved accuracy by about 1-2%. In addition comes that pre-trained embeddings have a regularization effect on training. That make sense because GloVe is trained on data which is some different to Yelp and the other datasets. This means that during training the weights of the pre-trained embedding will be updated. You can see the regularization effect in the following image: 122 | 123 | 124 | 125 | If you are interested in CNN and text classification try out the dataset from Yelp! Not only because of the best result in accuracy, it has a lot metadata. Maybe I will use this dataset to get insights for my next travel :) 126 | 127 | I'm sure that you can get better results by tuning some parameters: 128 | - Increase / decrease feature maps 129 | - Add / remove filter sizes 130 | - Use another embeddings (e.g. Google word2vec) 131 | - Increase / decrease maximum words in vocabulary and sequence 132 | - Modify the method `clean_text` 133 | 134 | If you have any questions or hints for improvement contact me through an issue. Thanks! 135 | 136 | ## Requirements 137 | * Python 3.x 138 | * TensorFlow 2.x 139 | * TensorFlow-Datasets 140 | * Scikit 141 | 142 | ## Usage 143 | Feel free to use the [model](https://github.com/cmasch/cnn-text-classification/blob/master/cnn_model.py) and your own dataset. As an example you can use this [evaluation notebook](https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb). 144 | 145 | ## References 146 | [1] [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
147 | [2] [Neural Document Embeddings for Intensive Care Patient Mortality Prediction](https://arxiv.org/abs/1612.00467)
148 | [3] [Character-level Convolutional Networks for Text Classification](https://arxiv.org/abs/1509.01626) 149 | 150 | ## Author 151 | Christopher Masch 152 | -------------------------------------------------------------------------------- /cnn_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | CNN model for text classification implemented in TensorFlow 2. 4 | This implementation is based on the original paper of Yoon Kim [1] for classification using words. 5 | Besides I add charachter level input [2]. 6 | 7 | # References 8 | - [1] [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) 9 | - [2] [Character-level Convolutional Networks for Text Classification](https://arxiv.org/abs/1509.01626) 10 | 11 | @author: Christopher Masch 12 | """ 13 | 14 | import tensorflow as tf 15 | from tensorflow.keras import layers 16 | 17 | 18 | class CNN: 19 | __version__ = '0.2.0' 20 | 21 | def __init__(self, embedding_layer=None, num_words=None, embedding_dim=None, 22 | max_seq_length=100, kernel_sizes=[3, 4, 5], feature_maps=[100, 100, 100], 23 | use_char=False, char_embedding_dim=50, char_max_length=200, alphabet_size=None, char_kernel_sizes=[3, 10, 20], 24 | char_feature_maps=[100, 100, 100], hidden_units=100, dropout_rate=None, nb_classes=None): 25 | """ 26 | Arguments: 27 | embedding_layer : If not defined with pre-trained embeddings it will be created from scratch (default: None) 28 | num_words : Maximal amount of words in the vocabulary (default: None) 29 | embedding_dim : Dimension of word representation (default: None) 30 | max_seq_length : Max length of word sequence (default: 100) 31 | filter_sizes : An array of filter sizes per channel (default: [3,4,5]) 32 | feature_maps : Defines the feature maps per channel (default: [100,100,100]) 33 | use_char : If True, char-based model will be added to word-based model 34 | char_embedding_dim : Dimension of char representation (default: 50) 35 | char_max_length : Max length of char sequence (default: 200) 36 | alphabet_size : Amount of differnent chars used for creating embeddings (default: None) 37 | hidden_units : Hidden units per convolution channel (default: 100) 38 | dropout_rate : If defined, dropout will be added after embedding layer & concatenation (default: None) 39 | nb_classes : Number of classes which can be predicted 40 | """ 41 | 42 | # WORD-level 43 | self.embedding_layer = embedding_layer 44 | self.num_words = num_words 45 | self.max_seq_length = max_seq_length 46 | self.embedding_dim = embedding_dim 47 | self.kernel_sizes = kernel_sizes 48 | self.feature_maps = feature_maps 49 | 50 | # CHAR-level 51 | self.use_char = use_char 52 | self.char_embedding_dim = char_embedding_dim 53 | self.char_max_length = char_max_length 54 | self.alphabet_size = alphabet_size 55 | self.char_kernel_sizes = char_kernel_sizes 56 | self.char_feature_maps = char_feature_maps 57 | 58 | # General 59 | self.hidden_units = hidden_units 60 | self.dropout_rate = dropout_rate 61 | self.nb_classes = nb_classes 62 | 63 | def build_model(self): 64 | """ 65 | Build the model 66 | 67 | Returns: 68 | Model : Keras model instance 69 | """ 70 | 71 | # Checks 72 | if len(self.kernel_sizes) != len(self.feature_maps): 73 | raise Exception('Please define `kernel_sizes` and `feature_maps` with the same amount.') 74 | if not self.embedding_layer and (not self.num_words or not self.embedding_dim): 75 | raise Exception('Please define `num_words` and `embedding_dim` if you not using a pre-trained embedding.') 76 | if self.use_char and (not self.char_max_length or not self.alphabet_size): 77 | raise Exception('Please define `char_max_length` and `alphabet_size` if you are using char.') 78 | 79 | # Building word-embeddings from scratch 80 | if self.embedding_layer is None: 81 | self.embedding_layer = layers.Embedding( 82 | input_dim = self.num_words, 83 | output_dim = self.embedding_dim, 84 | input_length = self.max_seq_length, 85 | weights = None, 86 | trainable = True, 87 | name = "word_embedding" 88 | ) 89 | 90 | # WORD-level 91 | word_input = layers.Input(shape=(self.max_seq_length,), dtype='int32', name='word_input') 92 | x = self.embedding_layer(word_input) 93 | 94 | if self.dropout_rate: 95 | x = layers.Dropout(self.dropout_rate)(x) 96 | 97 | x = self.building_block(x, self.kernel_sizes, self.feature_maps) 98 | x = layers.Activation('relu')(x) 99 | prediction = layers.Dense(self.nb_classes, activation='softmax')(x) 100 | 101 | 102 | # CHAR-level 103 | if self.use_char: 104 | char_input = layers.Input(shape=(self.char_max_length,), dtype='int32', name='char_input') 105 | x_char = layers.Embedding( 106 | input_dim = self.alphabet_size + 1, 107 | output_dim = self.char_embedding_dim, 108 | input_length = self.char_max_length, 109 | name = 'char_embedding' 110 | )(char_input) 111 | 112 | x_char = self.building_block(x_char, self.char_kernel_sizes, self.char_feature_maps) 113 | x_char = layers.Activation('relu')(x_char) 114 | x_char = layers.Dense(self.nb_classes, activation='softmax')(x_char) 115 | 116 | prediction = layers.Average()([prediction, x_char]) 117 | return tf.keras.Model(inputs=[word_input, char_input], outputs=prediction, name='CNN_Word_Char') 118 | 119 | return tf.keras.Model(inputs=word_input, outputs=prediction, name='CNN_Word') 120 | 121 | def building_block(self, input_layer, kernel_sizes, feature_maps): 122 | """ 123 | Creates several CNN channels in parallel and concatenate them 124 | 125 | Arguments: 126 | input_layer : Layer which will be the input for all convolutional blocks 127 | kernel_sizes: Array of kernel sizes (working as n-gram filter) 128 | feature_maps: Array of feature maps 129 | 130 | Returns: 131 | x : Building block with one or several channels 132 | """ 133 | channels = [] 134 | for ix in range(len(kernel_sizes)): 135 | x = self.create_channel(input_layer, kernel_sizes[ix], feature_maps[ix]) 136 | channels.append(x) 137 | 138 | # Check how many channels, one channel doesn't need a concatenation 139 | if (len(channels) > 1): 140 | x = layers.concatenate(channels) 141 | 142 | return x 143 | 144 | def create_channel(self, x, kernel_size, feature_map): 145 | """ 146 | Creates a layer, working channel wise 147 | 148 | Arguments: 149 | x : Input for convolutional channel 150 | kernel_size : Kernel size for creating Conv1D 151 | feature_map : Feature map 152 | 153 | Returns: 154 | x : Channel including (Conv1D + {GlobalMaxPooling & GlobalAveragePooling} + Dense [+ Dropout]) 155 | """ 156 | x = layers.SeparableConv1D( 157 | feature_map, 158 | kernel_size = kernel_size, 159 | activation = 'relu', 160 | strides = 1, 161 | padding = 'valid', 162 | depth_multiplier = 4 163 | )(x) 164 | 165 | x1 = layers.GlobalMaxPooling1D()(x) 166 | x2 = layers.GlobalAveragePooling1D()(x) 167 | x = layers.concatenate([x1, x2]) 168 | 169 | x = layers.Dense(self.hidden_units)(x) 170 | if self.dropout_rate: 171 | x = layers.Dropout(self.dropout_rate)(x) 172 | return x 173 | -------------------------------------------------------------------------------- /images/yelp_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmasch/cnn-text-classification/b543f6961e81bf861553432d36c4424a6d0af904/images/yelp_comparison.png -------------------------------------------------------------------------------- /images/yelp_confusion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmasch/cnn-text-classification/b543f6961e81bf861553432d36c4424a6d0af904/images/yelp_confusion.png -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Utils 4 | 5 | @author: Christopher Masch 6 | """ 7 | 8 | import urllib3 9 | import os 10 | import re 11 | import string 12 | import numpy as np 13 | import tensorflow as tf 14 | import matplotlib.pyplot as plt 15 | from io import BytesIO 16 | from zipfile import ZipFile 17 | from nltk.corpus import stopwords 18 | 19 | 20 | def clean_doc(doc): 21 | """ 22 | Cleaning a document by several methods: 23 | - Lowercase 24 | - Removing whitespaces 25 | - Removing numbers 26 | - Removing stopwords 27 | - Removing punctuations 28 | - Removing short words 29 | 30 | Arguments: 31 | doc : Text 32 | 33 | Returns: 34 | str : Cleaned text 35 | """ 36 | 37 | #stop_words = set(stopwords.words('english')) 38 | 39 | # Lowercase 40 | doc = doc.lower() 41 | # Remove numbers 42 | #doc = re.sub(r"[0-9]+", "", doc) 43 | # Split in tokens 44 | tokens = doc.split() 45 | # Remove Stopwords 46 | #tokens = [w for w in tokens if not w in stop_words] 47 | # Remove punctuation 48 | #tokens = [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens] 49 | # Tokens with less then two characters will be ignored 50 | #tokens = [word for word in tokens if len(word) > 1] 51 | return ' '.join(tokens) 52 | 53 | 54 | def read_files(path): 55 | """ 56 | Read in files of a given path. 57 | This can be a directory including many files or just one file. 58 | 59 | Arguments: 60 | path : Filepath to file(s) 61 | 62 | Returns: 63 | documents : Return a list of cleaned documents 64 | """ 65 | 66 | documents = list() 67 | # Read in all files in directory 68 | if os.path.isdir(path): 69 | for filename in os.listdir(path): 70 | with open(f"{path}/{filename}") as f: 71 | doc = f.read() 72 | doc = clean_doc(doc) 73 | documents.append(doc) 74 | 75 | # Read in all lines in one file 76 | if os.path.isfile(path): 77 | with open(path, encoding='iso-8859-1') as f: 78 | doc = f.readlines() 79 | for line in doc: 80 | documents.append(clean_doc(line)) 81 | 82 | return documents 83 | 84 | 85 | def char_vectorizer(X, char_max_length, char2idx_dict): 86 | """ 87 | Vectorize an array of word sequences to char vector. 88 | Example (length 15): [test entry] --> [[1,2,3,1,4,2,5,1,6,7,0,0,0,0,0]] 89 | 90 | Arguments: 91 | X : Array of word sequences 92 | char_max_length : Maximum length of vector 93 | char2idx_dict : Dictionary of indices for converting char to integer 94 | 95 | Returns: 96 | str2idx : Array of vectorized char sequences 97 | """ 98 | 99 | str2idx = np.zeros((len(X), char_max_length), dtype='int64') 100 | for idx, doc in enumerate(X): 101 | max_length = min(len(doc), char_max_length) 102 | for i in range(0, max_length): 103 | c = doc[i] 104 | if c in char2idx_dict: 105 | str2idx[idx, i] = char2idx_dict[c] 106 | return str2idx 107 | 108 | 109 | def create_glove_embeddings(embedding_dim, max_num_words, max_seq_length, tokenizer): 110 | """ 111 | Load and create GloVe embeddings. 112 | 113 | Arguments: 114 | embedding_dim : Dimension of embeddings (e.g. 50,100,200,300) 115 | max_num_words : Maximum count of words in vocabulary 116 | max_seq_length: Maximum length of vector 117 | tokenizer : Tokenizer for converting words to integer 118 | 119 | Returns: 120 | tf.keras.layers.Embedding : Glove embeddings initialized in Keras Embedding-Layer 121 | """ 122 | 123 | print("Pretrained GloVe embedding is loading...") 124 | 125 | if not os.path.exists("data"): 126 | os.makedirs("data") 127 | 128 | if not os.path.exists("data/glove"): 129 | print("No previous embeddings found. Will be download required files...") 130 | os.makedirs("data/glove") 131 | http = urllib3.PoolManager() 132 | response = http.request( 133 | url = "http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip", 134 | method = "GET", 135 | retries = False 136 | ) 137 | 138 | with ZipFile(BytesIO(response.data)) as myzip: 139 | for f in myzip.infolist(): 140 | with open(f"data/glove/{f.filename}", "wb") as outfile: 141 | outfile.write(myzip.open(f.filename).read()) 142 | 143 | print("Download of GloVe embeddings finished.") 144 | 145 | embeddings_index = {} 146 | with open(f"data/glove/glove.6B.{embedding_dim}d.txt") as glove_embedding: 147 | for line in glove_embedding.readlines(): 148 | values = line.split() 149 | word = values[0] 150 | coefs = np.asarray(values[1:], dtype="float32") 151 | embeddings_index[word] = coefs 152 | 153 | print(f"Found {len(embeddings_index)} word vectors in GloVe embedding\n") 154 | 155 | embedding_matrix = np.zeros((max_num_words, embedding_dim)) 156 | 157 | for word, i in tokenizer.word_index.items(): 158 | if i >= max_num_words: 159 | continue 160 | 161 | embedding_vector = embeddings_index.get(word) 162 | if embedding_vector is not None: 163 | embedding_matrix[i] = embedding_vector 164 | 165 | return tf.keras.layers.Embedding( 166 | input_dim = max_num_words, 167 | output_dim = embedding_dim, 168 | input_length = max_seq_length, 169 | weights = [embedding_matrix], 170 | trainable = True, 171 | name = "word_embedding" 172 | ) 173 | 174 | 175 | def plot_acc_loss(title, histories, key_acc, key_loss): 176 | """ 177 | Generate a plot for visualizing accuracy and loss 178 | 179 | Arguments: 180 | title : Title of visualization 181 | histories : Array of Keras metrics per run and epoch 182 | key_acc : Key of accuracy (accuracy, val_accuracy) 183 | key_loss : Key of loss (loss, val_loss) 184 | """ 185 | 186 | fig, (ax1, ax2) = plt.subplots(1, 2) 187 | 188 | # Accuracy 189 | ax1.set_title(f"Model accuracy ({title})") 190 | names = [] 191 | for i, model in enumerate(histories): 192 | ax1.plot(model[key_acc]) 193 | ax1.set_xlabel("epoch") 194 | names.append(f"Model {i+1}") 195 | ax1.set_ylabel("accuracy") 196 | ax1.legend(names, loc="lower right") 197 | 198 | # Loss 199 | ax2.set_title(f"Model loss ({title})") 200 | for model in histories: 201 | ax2.plot(model[key_loss]) 202 | ax2.set_xlabel('epoch') 203 | ax2.set_ylabel('loss') 204 | ax2.legend(names, loc='upper right') 205 | fig.set_size_inches(20, 5) 206 | plt.show() 207 | 208 | 209 | def visualize_features(ml_classifier, nb_neg_features=15, nb_pos_features=15): 210 | """ 211 | Visualize trained coefficient of log regression in respect to vectorizer. 212 | 213 | Arguments: 214 | ml_classifier : ML-Pipeline including vectorizer as well as trained model 215 | nb_neg_features : Number of features to visualize 216 | nb_pos_features : Number of features to visualize 217 | """ 218 | 219 | feature_names = ml_classifier.get_params()['vectorizer'].get_feature_names() 220 | coef = ml_classifier.get_params()['classifier'].coef_.ravel() 221 | 222 | print('Extracted features: {}'.format(len(feature_names))) 223 | 224 | pos_coef = np.argsort(coef)[-nb_pos_features:] 225 | neg_coef = np.argsort(coef)[:nb_neg_features] 226 | interesting_coefs = np.hstack([neg_coef, pos_coef]) 227 | 228 | # Plot 229 | plt.figure(figsize=(20, 5)) 230 | colors = ['red' if c < 0 else 'green' for c in coef[interesting_coefs]] 231 | plt.bar(np.arange(nb_neg_features + nb_pos_features), coef[interesting_coefs], color=colors) 232 | feature_names = np.array(feature_names) 233 | plt.xticks( 234 | np.arange(nb_neg_features+nb_pos_features), 235 | feature_names[interesting_coefs], 236 | size = 15, 237 | rotation = 75, 238 | ha = 'center' 239 | ); 240 | plt.show() 241 | 242 | --------------------------------------------------------------------------------