├── Readme.md ├── [Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb ├── [Introduction] - Big tutorial - Text Classification.ipynb ├── [Supervised] [DL method] GRU_HAN.ipynb ├── [Unsupervised] LDA.ipynb └── pictures ├── LDA2VEC.png ├── characters_attention.gif ├── explainability.gif ├── generative_LDA.gif ├── pyldavis.png ├── tsne_lda.png ├── word_correlations.png └── word_frequency.png /Readme.md: -------------------------------------------------------------------------------- 1 | Multi-classes task classification and LDA-based topic Recommender System 2 | ======================================================================== 3 | 4 | Here is **my winning strategy** to carry multi-text classification task 5 | out. 6 | 7 | **Data Source** : 8 | https://catalog.data.gov/dataset/consumer-complaint-database 9 | 10 | 1 - Text Mining 11 | =============== 12 | 13 | - **Word Frequency Plot**: Compare frequencies across different texts 14 | and quantify how similar and different these sets of word 15 | frequencies are using a correlation test. How correlated are the 16 | word frequencies between text1 and text2, and between text1 and 17 | text3? 18 | 19 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/word_frequency.png) 20 | 21 | - **Most discriminant and important word per categories** 22 | 23 | - **Relationships between words & Pairwise correlations**: examining 24 | which words tend to follow others immediately, or that tend to 25 | co-occur within the same documents. 26 | 27 | Which word is associated with another word? Note that this is a 28 | visualization of a Markov chain, a common model in text processing. In a 29 | Markov chain, each choice of word depends only on the previous word. In 30 | this case, a random generator following this model might spit out 31 | “collect”, then “agency”, then “report/credit/score”, by following each 32 | word to the most common words that follow it. To make the visualization 33 | interpretable, we chose to show only the most common word to word 34 | connections, but one could imagine an enormous graph representing all 35 | connections that occur in the text. 36 | 37 | - **Distribution of words**: Want to show that there are similar 38 | distributions for all texts, with many words that occur rarely and 39 | fewer words that occur frequently. Here is the goal of Zip Law 40 | (extended with Harmonic mean) - Zipf’s Law is a statistical 41 | distribution in certain data sets, such as words in a linguistic 42 | corpus, in which the frequencies of certain words are inversely 43 | proportional to their ranks. 44 | 45 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/word_correlations.png) 46 | 47 | - **How to spell variants of a given word** 48 | 49 | - **Chi-Square to see which words are associated to each category**: 50 | find the terms that are the most correlated with each of the 51 | categories 52 | 53 | - **Part of Speech Tags** and **Frequency distribution of POST**: Noun 54 | Count, Verb Count, Adjective Count, Adverb Count and Pronoun Count 55 | 56 | - **Metrics of words**: *Word Count of the documents* – ie. total 57 | number of words in the documents, *Character Count of the documents* 58 | – total number of characters in the documents, *Average Word Density 59 | of the documents* – average length of the words used in the 60 | documents, *Puncutation Count in the Complete Essay* – total number 61 | of punctuation marks in the documents, *Upper Case Count in the 62 | Complete Essay* – total number of upper count words in the 63 | documents, *Title Word Count in the Complete Essay* – total number 64 | of proper case (title) words in the documents 65 | 66 | 2 - Word Embedding 67 | ================== 68 | 69 | ### A - Frequency Based Embedding 70 | 71 | - Count Vector 72 | - TF IDF 73 | - Co-Occurrence Matrix with a fixed context window (SVD) 74 | - TF-ICF 75 | - Function Aware Components 76 | 77 | ### B - Prediction Based Embedding 78 | 79 | - CBOW (word2vec) 80 | - Skip-Grams (word2vec) 81 | - Glove 82 | - At character level -> FastText 83 | - Topic Model as features // LDA features 84 | 85 | #### LDA 86 | 87 | Visualization provides a global view of the topics (and how they differ 88 | from each other), while at the same time allowing for a deep inspection 89 | of the terms most highly associated with each individual topic. A novel 90 | method for choosing which terms to present to a user to aid in the task 91 | of topic interpretation, in which we define the relevance of a term to a 92 | topic. 93 | 94 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/generative_LDA.gif) 95 | 96 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/pyldavis.png) 97 | 98 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/tsne_lda.png) 99 | 100 | ### C - Poincaré Embedding \[Embeddings and Hyperbolic Geometry\] 101 | 102 | The main innovation here is that these embeddings are learnt in 103 | **hyperbolic space**, as opposed to the commonly used **Euclidean 104 | space**. The reason behind this is that hyperbolic space is more 105 | suitable for capturing any hierarchical information inherently present 106 | in the graph. Embedding nodes into a Euclidean space while preserving 107 | the distance between the nodes usually requires a very high number of 108 | dimensions. 109 | 110 | https://arxiv.org/pdf/1705.08039.pdf 111 | https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb 112 | 113 | **Learning representations** of symbolic data such as text, graphs and 114 | multi-relational data has become a central paradigm in machine learning 115 | and artificial intelligence. For instance, word embeddings such as 116 | **WORD2VEC**, **GLOVE** and **FASTTEXT** are widely used for tasks 117 | ranging from machine translation to sentiment analysis. 118 | 119 | Typically, the **objective of embedding methods** is to organize 120 | symbolic objects (e.g., words, entities, concepts) in a way such that 121 | **their similarity in the embedding space reflects their semantic or 122 | functional similarity**. For this purpose, the similarity of objects is 123 | usually measured either by their **distance** or by their **inner 124 | product** in the embedding space. For instance, Mikolov embed words in 125 | *R**d* such that their **inner product** is maximized when 126 | words co-occur within similar contexts in text corpora. This is 127 | motivated by the **distributional hypothesis**, i.e., that the meaning 128 | of words can be derived from the contexts in which they appear. 129 | 130 | 3 - Algorithms 131 | ============== 132 | 133 | ### A - Traditional Methods 134 | 135 | - CountVectorizer + Logistic 136 | - CountVectorizer + NB 137 | - CountVectorizer + LightGBM 138 | - HasingTF + IDF + Logistic Regression 139 | - TFIDF + NB 140 | - TFIDF + LightGBM 141 | - TF-IDF + SVM 142 | - Hashing Vectorizer + Logistic 143 | - Hashing Vectorizer + NB 144 | - Hashing Vectorizer + LightGBM 145 | - Bagging / Boosting 146 | - Word2Vec + Logistic 147 | - Word2Vec + LightGNM 148 | - Word2Vec + XGBoost 149 | - LSA + SVM 150 | 151 | ### B - Deep Learning Methods 152 | 153 | - GRU + Attention Mechanism 154 | - CNN + RNN + Attention Mechanism 155 | - CNN + LSTM/GRU + Attention Mechanism 156 | 157 | 4 - Explainability 158 | ================== 159 | 160 | **Goal**: explain predictions of arbitrary classifiers, including text 161 | classifiers (when it is hard to get exact mapping between model 162 | coefficients and text features, e.g. if there is dimension reduction 163 | involved) 164 | 165 | - Lime 166 | - Skate 167 | - Shap 168 | 169 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/explainability.gif) 170 | 171 | 5 - MyApp of multi-classes text classification with Attention mechanism 172 | ======================================================================= 173 | 174 | ![](https://github.com/adsieg/Multi_Text_Classification/blob/master/pictures/characters_attention.gif) 175 | 176 | 6 - Ressources / Bibliography 177 | ============================= 178 | 179 | - **All models** : 180 | https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/ 181 | 182 | - **CNN Text Classification**: 183 | https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb 184 | 185 | - **CNN Multichannel Text Classification + Hierarchical attention + 186 | …**: 187 | https://github.com/gaurav104/TextClassification/blob/master/CNN%20Multichannel%20Text%20Classification.ipynb 188 | 189 | - **Notes for Deep Learning** 190 | https://arxiv.org/pdf/1808.09772.pdf 191 | 192 | - **Doc classification with NLP** 193 | https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb 194 | 195 | - **Paragraph Topic Classification** 196 | http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf 197 | 198 | - **1D convolutional neural networks for NLP** 199 | https://github.com/Tixierae/deep_learning_NLP/blob/master/cnn_imdb.ipynb 200 | 201 | - **Hierarchical Attention for text classification** 202 | https://github.com/Tixierae/deep_learning_NLP/blob/master/HAN/HAN_final.ipynb 203 | 204 | - **Multi-class classification scikit learn** (Random forest, SVM, 205 | logistic regression) 206 | https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f 207 | https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb 208 | 209 | - **Text feature extraction TFIDF mathematics** 210 | https://dzone.com/articles/machine-learning-text-feature-0 211 | 212 | - **Classification Yelp Reviews (AWS)** 213 | http://www.developintelligence.com/blog/2017/06/practical-neural-networks-keras-classifying-yelp-reviews/ 214 | 215 | - **Convolutional Neural Networks for Text Classification (waouuuuu)** 216 | http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ 217 | https://github.com/davidsbatista/ConvNets-for-sentence-classification 218 | 219 | - **3 ways to interpretate your NLP model \[Lime, ELI5, Skater\]** 220 | https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb 221 | https://towardsdatascience.com/3-ways-to-interpretate-your-nlp-model-to-management-and-customer-5428bc07ce15 222 | https://medium.freecodecamp.org/how-to-improve-your-machine-learning-models-by-explaining-predictions-with-lime-7493e1d78375 223 | 224 | - **Deep Learning for text made easy with AllenNLP** 225 | https://medium.com/swlh/deep-learning-for-text-made-easy-with-allennlp-62bc79d41f31 226 | 227 | - **Ensemble Classifiers** 228 | https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/ 229 | 230 | - **Classification Algorithms ** \[tfidf, count features, logistic 231 | regression, naive bayes, svm, xgboost, grid search, word vectors, 232 | LSTM, GRU, Ensembling\] : 233 | https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook 234 | 235 | - **Deep learning architecture** \[TextCNN, BiDirectional 236 | RNN(LSTM/GRU), Attention Models\] : 237 | https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/ 238 | and 239 | https://www.kaggle.com/mlwhiz/attention-pytorch-and-keras 240 | 241 | - **CNN + Word2vec and LSTM + Word2Vec** : 242 | https://www.kaggle.com/kakiac/deep-learning-4-text-classification-cnn-bi-lstm 243 | 244 | - **Comparison of models** \[Bag of Words - Countvectorizer Features, 245 | TFIDF Features, Hashing Features, Word2vec Features\] : 246 | https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/ 247 | 248 | - **Embed, encode, attend, predict** : 249 | https://explosion.ai/blog/deep-learning-formula-nlp 250 | 251 | - Visualisation sympa pour comprendre CNN : 252 | http://www.thushv.com/natural_language_processing/make-cnns-for-nlp-great-again-classifying-sentences-with-cnns-in-tensorflow/ 253 | 254 | - **Yelp comments classification \[ LSTM, LSTM + CNN\]** : 255 | https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb 256 | 257 | - **RNN text classification** : 258 | https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 259 | 260 | - **CNN for Sentence Classification** & **DCNN for Modelling 261 | Sentences** & **VDNN for Text Classification** & **Multi Channel 262 | Variable size CNN** & **Multi Group Norm Constraint CNN** & **RACNN 263 | Neural Networks for Text Classification**: 264 | https://bicepjai.github.io/machine-learning/2017/11/10/text-class-part1.html 265 | 266 | - **Transformers** : 267 | https://towardsdatascience.com/transformers-141e32e69591 268 | 269 | - **Seq2Seq** : 270 | https://guillaumegenthial.github.io 271 | /sequence-to-sequence.html 272 | 273 | - **The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer 274 | Learning)** : 275 | https://jalammar.github.io/ 276 | 277 | - **LSTM & GRU explanation** : 278 | https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 279 | 280 | - **Text classification using attention mechanism in Keras** : 281 | http://androidkt.com/text-classification-using-attention-mechanism-in-keras/ 282 | 283 | - **Bernoulli Naive Bayes & Multinomial Naive Bayes & Random Forests & 284 | Linear SVM & SVM with non-linear kernel** 285 | https://github.com/irfanelahi-ds/document-classification-python/blob/master/document_classification_python_sklearn_nltk.ipynb 286 | and 287 | https://richliao.github.io/ 288 | 289 | - **DL text classification** : 290 | https://gitlab.com/the_insighters/data-university/nuggets/document-classification-with-deep-learning 291 | 292 | - **1-D Convolutions over text** : 293 | http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ 294 | and 295 | https://github.com/davidsbatista/ConvNets-for-sentence-classification/blob/master/Convolutional-Neural-Networks-for-Sentence-Classification.ipynb 296 | 297 | - **\[Bonus\] Sentiment Analysis in PySpark** : 298 | https://github.com/tthustla/setiment_analysis_pyspark/blob/master/Sentiment%20Analysis%20with%20PySpark.ipynb 299 | 300 | - **RNN Text Generation** : 301 | https://github.com/priya-dwivedi/Deep-Learning/blob/master/RNN_text_generation/RNN_project.ipynb 302 | 303 | - **Finding similar documents with Word2Vec and Soft Cosine Measure**: 304 | https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb 305 | 306 | - **\[!! ESSENTIAL !!\] Text Classification with Hierarchical 307 | Attention Networks**: 308 | https://humboldt-wi.github.io/blog/research/information_systems_1819/group5_han/ 309 | 310 | - **\[ESSENTIAL for any NLP Project\]**: 311 | https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks 312 | 313 | - **Doc2Vec + Logistic Regression** : 314 | https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint_3.ipynb 315 | 316 | - **Doc2Vec -> just embedding**: 317 | https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb 318 | 319 | - **New way of embedding -> Poincaré Embeddings**: 320 | https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb 321 | 322 | - **Doc2Vec + Text similarity**: 323 | https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb 324 | 325 | - **Graph Link predictions + Part-of-Speech tagging tutorial with the 326 | Keras**: 327 | https://github.com/Cdiscount/IT-Blog/tree/master/scripts/link-prediction 328 | & 329 | https://techblog.cdiscount.com/link-prediction-in-large-scale-networks/ 330 | 331 | 7. Other Topics - Text Similarity \[Word Mover Distance\] 332 | ========================================================= 333 | 334 | - **Finding similar documents with Word2Vec and WMD** : 335 | https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html 336 | 337 | - **Introduction to Wasserstein metric (earth mover’s distance)**: 338 | https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/ 339 | 340 | - **Earthmover Distance**: 341 | https://jeremykun.com/2018/03/05/earthmover-distance/ 342 | Problem: Compute distance between points with uncertain locations 343 | (given by samples, or differing observations, or clusters). For 344 | example, if I have the following three “points” in the plane, as 345 | indicated by their colors, which is closer, blue to green, or blue 346 | to red? 347 | 348 | - **Word Mover’s distance calculation between word pairs of two 349 | documents**: 350 | https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents 351 | 352 | - **Word Mover’s Distance (WMD) for Python**: 353 | https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb 354 | 355 | - \[LECTURES\] : **Computational Optimal Transport** : 356 | https://optimaltransport.github.io/pdf/ComputationalOT.pdf 357 | 358 | - **Computing the Earth Mover’s Distance under Transformations** : 359 | http://robotics.stanford.edu/~scohen/research/emdg/emdg.html 360 | 361 | - **\[LECTURES\] Slides WMD**: 362 | http://robotics.stanford.edu/~rubner/slides/sld014.htm 363 | 364 | Others \[Quora Datset\] : 365 | ------------------------- 366 | 367 | - **BOW + Xgboost Model** + **Word level TF-IDF + XgBoost** + **N-gram 368 | Level TF-IDF + Xgboost** + **Character Level TF-IDF + XGboost**: 369 | https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Xgboost_bow_tfidf.ipynb 370 | 371 | 8 - Other Topics - Topic Modeling [LDA](#lda) 372 | ============================================= 373 | 374 | https://github.com/FelixChop/MediumArticles/blob/master/LDA-BBC.ipynb 375 | 376 | https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb 377 | 378 | - **TF-IDF + K-means & Latent Dirichlet Allocation (with Bokeh)**: 379 | https://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html 380 | 381 | - **\[!! ESSENTIAL !!\] Building a LDA-based Book Recommender 382 | System**: 383 | https://humboldt-wi.github.io/blog/research/information_systems_1819/is_lda_final/ 384 | 385 | 9 - Variational Autoencoder 386 | =========================== 387 | 388 | - **Text generation with a Variational Autoencoder** : 389 | https://github.com/NicGian/text_VAE 390 | 391 | - **Variational\_text\_inference** : 392 | https://github.com/s4sarath/Deep-Learning-Projects/tree/master/variational_text_inference 393 | and 394 | https://s4sarath.github.io/2016/11/23/variational_autoenocder_for_Natural_Language_Processing 395 | -------------------------------------------------------------------------------- /[Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Import necessary dependencies and settings" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import numpy as np\n", 18 | "import re\n", 19 | "import nltk\n", 20 | "import matplotlib.pyplot as plt\n", 21 | "\n", 22 | "pd.options.display.max_colwidth = 200\n", 23 | "%matplotlib inline" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# Sample corpus of text documents" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "text/html": [ 41 | "
\n", 42 | "\n", 55 | "\n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | "
DocumentCategory
0The sky is blue and beautiful.weather
1Love this blue and beautiful sky!weather
2The quick brown fox jumps over the lazy dog.animals
3A king's breakfast has sausages, ham, bacon, eggs, toast and beansfood
4I love green eggs, ham, sausages and bacon!food
5The brown fox is quick and the blue dog is lazy!animals
6The sky is very blue and the sky is very beautiful todayweather
7The dog is lazy but the brown fox is quick!animals
\n", 106 | "
" 107 | ], 108 | "text/plain": [ 109 | " Document Category\n", 110 | "0 The sky is blue and beautiful. weather\n", 111 | "1 Love this blue and beautiful sky! weather\n", 112 | "2 The quick brown fox jumps over the lazy dog. animals\n", 113 | "3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans food\n", 114 | "4 I love green eggs, ham, sausages and bacon! food\n", 115 | "5 The brown fox is quick and the blue dog is lazy! animals\n", 116 | "6 The sky is very blue and the sky is very beautiful today weather\n", 117 | "7 The dog is lazy but the brown fox is quick! animals" 118 | ] 119 | }, 120 | "execution_count": 3, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "corpus = ['The sky is blue and beautiful.',\n", 127 | " 'Love this blue and beautiful sky!',\n", 128 | " 'The quick brown fox jumps over the lazy dog.',\n", 129 | " \"A king's breakfast has sausages, ham, bacon, eggs, toast and beans\",\n", 130 | " 'I love green eggs, ham, sausages and bacon!',\n", 131 | " 'The brown fox is quick and the blue dog is lazy!',\n", 132 | " 'The sky is very blue and the sky is very beautiful today',\n", 133 | " 'The dog is lazy but the brown fox is quick!' \n", 134 | "]\n", 135 | "labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']\n", 136 | "\n", 137 | "corpus = np.array(corpus)\n", 138 | "corpus_df = pd.DataFrame({'Document': corpus, \n", 139 | " 'Category': labels})\n", 140 | "corpus_df = corpus_df[['Document', 'Category']]\n", 141 | "corpus_df" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# Simple text pre-processing" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 4, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "wpt = nltk.WordPunctTokenizer()\n", 158 | "stop_words = nltk.corpus.stopwords.words('english')\n", 159 | "\n", 160 | "def normalize_document(doc):\n", 161 | " # lower case and remove special characters\\whitespaces\n", 162 | " doc = re.sub(r'[^a-zA-Z\\s]', '', doc, re.I|re.A)\n", 163 | " doc = doc.lower()\n", 164 | " doc = doc.strip()\n", 165 | " # tokenize document\n", 166 | " tokens = wpt.tokenize(doc)\n", 167 | " # filter stopwords out of document\n", 168 | " filtered_tokens = [token for token in tokens if token not in stop_words]\n", 169 | " # re-create document from filtered tokens\n", 170 | " doc = ' '.join(filtered_tokens)\n", 171 | " return doc\n", 172 | "\n", 173 | "normalize_corpus = np.vectorize(normalize_document)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "array(['sky blue beautiful', 'love blue beautiful sky',\n", 185 | " 'quick brown fox jumps lazy dog',\n", 186 | " 'kings breakfast sausages ham bacon eggs toast beans',\n", 187 | " 'love green eggs ham sausages bacon',\n", 188 | " 'brown fox quick blue dog lazy', 'sky blue sky beautiful today',\n", 189 | " 'dog lazy brown fox quick'], dtype='\n", 251 | "\n", 264 | "\n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | "
baconbeansbeautifulbluebreakfastbrowndogeggsfoxgreenhamjumpskingslazylovequicksausagesskytoasttoday
000110000000000000100
100110000000000100100
200000110100101010000
311001001001010001010
410000001011000101000
500010110100001010000
600110000000000000201
700000110100001010000
\n", 477 | "" 478 | ], 479 | "text/plain": [ 480 | " bacon beans beautiful blue breakfast brown dog eggs fox green \\\n", 481 | "0 0 0 1 1 0 0 0 0 0 0 \n", 482 | "1 0 0 1 1 0 0 0 0 0 0 \n", 483 | "2 0 0 0 0 0 1 1 0 1 0 \n", 484 | "3 1 1 0 0 1 0 0 1 0 0 \n", 485 | "4 1 0 0 0 0 0 0 1 0 1 \n", 486 | "5 0 0 0 1 0 1 1 0 1 0 \n", 487 | "6 0 0 1 1 0 0 0 0 0 0 \n", 488 | "7 0 0 0 0 0 1 1 0 1 0 \n", 489 | "\n", 490 | " ham jumps kings lazy love quick sausages sky toast today \n", 491 | "0 0 0 0 0 0 0 0 1 0 0 \n", 492 | "1 0 0 0 0 1 0 0 1 0 0 \n", 493 | "2 0 1 0 1 0 1 0 0 0 0 \n", 494 | "3 1 0 1 0 0 0 1 0 1 0 \n", 495 | "4 1 0 0 0 1 0 1 0 0 0 \n", 496 | "5 0 0 0 1 0 1 0 0 0 0 \n", 497 | "6 0 0 0 0 0 0 0 2 0 1 \n", 498 | "7 0 0 0 1 0 1 0 0 0 0 " 499 | ] 500 | }, 501 | "execution_count": 7, 502 | "metadata": {}, 503 | "output_type": "execute_result" 504 | } 505 | ], 506 | "source": [ 507 | "# get all unique words in the corpus\n", 508 | "vocab = cv.get_feature_names()\n", 509 | "# show document feature vectors\n", 510 | "pd.DataFrame(cv_matrix, columns=vocab)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "# Bag of N-Grams Model" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 8, 523 | "metadata": {}, 524 | "outputs": [ 525 | { 526 | "data": { 527 | "text/html": [ 528 | "
\n", 529 | "\n", 542 | "\n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | "
bacon eggsbeautiful skybeautiful todayblue beautifulblue dogblue skybreakfast sausagesbrown foxdog lazyeggs ham...lazy doglove bluelove greenquick bluequick brownsausages baconsausages hamsky beautifulsky bluetoast beans
00001000000...0000000010
10101000000...0100000000
20000000100...1000100000
31000001000...0000001001
40000000001...0010010000
50000100110...0001000000
60010010000...0000000110
70000000110...0000000000
\n", 764 | "

8 rows × 29 columns

\n", 765 | "
" 766 | ], 767 | "text/plain": [ 768 | " bacon eggs beautiful sky beautiful today blue beautiful blue dog \\\n", 769 | "0 0 0 0 1 0 \n", 770 | "1 0 1 0 1 0 \n", 771 | "2 0 0 0 0 0 \n", 772 | "3 1 0 0 0 0 \n", 773 | "4 0 0 0 0 0 \n", 774 | "5 0 0 0 0 1 \n", 775 | "6 0 0 1 0 0 \n", 776 | "7 0 0 0 0 0 \n", 777 | "\n", 778 | " blue sky breakfast sausages brown fox dog lazy eggs ham ... \\\n", 779 | "0 0 0 0 0 0 ... \n", 780 | "1 0 0 0 0 0 ... \n", 781 | "2 0 0 1 0 0 ... \n", 782 | "3 0 1 0 0 0 ... \n", 783 | "4 0 0 0 0 1 ... \n", 784 | "5 0 0 1 1 0 ... \n", 785 | "6 1 0 0 0 0 ... \n", 786 | "7 0 0 1 1 0 ... \n", 787 | "\n", 788 | " lazy dog love blue love green quick blue quick brown sausages bacon \\\n", 789 | "0 0 0 0 0 0 0 \n", 790 | "1 0 1 0 0 0 0 \n", 791 | "2 1 0 0 0 1 0 \n", 792 | "3 0 0 0 0 0 0 \n", 793 | "4 0 0 1 0 0 1 \n", 794 | "5 0 0 0 1 0 0 \n", 795 | "6 0 0 0 0 0 0 \n", 796 | "7 0 0 0 0 0 0 \n", 797 | "\n", 798 | " sausages ham sky beautiful sky blue toast beans \n", 799 | "0 0 0 1 0 \n", 800 | "1 0 0 0 0 \n", 801 | "2 0 0 0 0 \n", 802 | "3 1 0 0 1 \n", 803 | "4 0 0 0 0 \n", 804 | "5 0 0 0 0 \n", 805 | "6 0 1 1 0 \n", 806 | "7 0 0 0 0 \n", 807 | "\n", 808 | "[8 rows x 29 columns]" 809 | ] 810 | }, 811 | "execution_count": 8, 812 | "metadata": {}, 813 | "output_type": "execute_result" 814 | } 815 | ], 816 | "source": [ 817 | "# you can set the n-gram range to 1,2 to get unigrams as well as bigrams\n", 818 | "bv = CountVectorizer(ngram_range=(2,2))\n", 819 | "bv_matrix = bv.fit_transform(norm_corpus)\n", 820 | "\n", 821 | "bv_matrix = bv_matrix.toarray()\n", 822 | "vocab = bv.get_feature_names()\n", 823 | "pd.DataFrame(bv_matrix, columns=vocab)" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "# TF-IDF Model" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 12, 836 | "metadata": {}, 837 | "outputs": [ 838 | { 839 | "name": "stderr", 840 | "output_type": "stream", 841 | "text": [ 842 | "C:\\Program Files\\Anaconda3\\lib\\site-packages\\sklearn\\feature_extraction\\text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", 843 | " if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):\n" 844 | ] 845 | }, 846 | { 847 | "data": { 848 | "text/html": [ 849 | "
\n", 850 | "\n", 863 | "\n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | "
baconbeansbeautifulbluebreakfastbrowndogeggsfoxgreenhamjumpskingslazylovequicksausagesskytoasttoday
00.000.000.600.530.000.000.000.000.000.000.000.000.000.000.000.000.000.600.000.0
10.000.000.490.430.000.000.000.000.000.000.000.000.000.000.570.000.000.490.000.0
20.000.000.000.000.000.380.380.000.380.000.000.530.000.380.000.380.000.000.000.0
30.320.380.000.000.380.000.000.320.000.000.320.000.380.000.000.000.320.000.380.0
40.390.000.000.000.000.000.000.390.000.470.390.000.000.000.390.000.390.000.000.0
50.000.000.000.370.000.420.420.000.420.000.000.000.000.420.000.420.000.000.000.0
60.000.000.360.320.000.000.000.000.000.000.000.000.000.000.000.000.000.720.000.5
70.000.000.000.000.000.450.450.000.450.000.000.000.000.450.000.450.000.000.000.0
\n", 1076 | "
" 1077 | ], 1078 | "text/plain": [ 1079 | " bacon beans beautiful blue breakfast brown dog eggs fox green \\\n", 1080 | "0 0.00 0.00 0.60 0.53 0.00 0.00 0.00 0.00 0.00 0.00 \n", 1081 | "1 0.00 0.00 0.49 0.43 0.00 0.00 0.00 0.00 0.00 0.00 \n", 1082 | "2 0.00 0.00 0.00 0.00 0.00 0.38 0.38 0.00 0.38 0.00 \n", 1083 | "3 0.32 0.38 0.00 0.00 0.38 0.00 0.00 0.32 0.00 0.00 \n", 1084 | "4 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.39 0.00 0.47 \n", 1085 | "5 0.00 0.00 0.00 0.37 0.00 0.42 0.42 0.00 0.42 0.00 \n", 1086 | "6 0.00 0.00 0.36 0.32 0.00 0.00 0.00 0.00 0.00 0.00 \n", 1087 | "7 0.00 0.00 0.00 0.00 0.00 0.45 0.45 0.00 0.45 0.00 \n", 1088 | "\n", 1089 | " ham jumps kings lazy love quick sausages sky toast today \n", 1090 | "0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.60 0.00 0.0 \n", 1091 | "1 0.00 0.00 0.00 0.00 0.57 0.00 0.00 0.49 0.00 0.0 \n", 1092 | "2 0.00 0.53 0.00 0.38 0.00 0.38 0.00 0.00 0.00 0.0 \n", 1093 | "3 0.32 0.00 0.38 0.00 0.00 0.00 0.32 0.00 0.38 0.0 \n", 1094 | "4 0.39 0.00 0.00 0.00 0.39 0.00 0.39 0.00 0.00 0.0 \n", 1095 | "5 0.00 0.00 0.00 0.42 0.00 0.42 0.00 0.00 0.00 0.0 \n", 1096 | "6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.00 0.5 \n", 1097 | "7 0.00 0.00 0.00 0.45 0.00 0.45 0.00 0.00 0.00 0.0 " 1098 | ] 1099 | }, 1100 | "execution_count": 12, 1101 | "metadata": {}, 1102 | "output_type": "execute_result" 1103 | } 1104 | ], 1105 | "source": [ 1106 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1107 | "\n", 1108 | "tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)\n", 1109 | "tv_matrix = tv.fit_transform(norm_corpus)\n", 1110 | "tv_matrix = tv_matrix.toarray()\n", 1111 | "\n", 1112 | "vocab = tv.get_feature_names()\n", 1113 | "pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "# Document Similarity" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 13, 1126 | "metadata": {}, 1127 | "outputs": [ 1128 | { 1129 | "data": { 1130 | "text/html": [ 1131 | "
\n", 1132 | "\n", 1145 | "\n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | "
01234567
01.0000000.8205990.0000000.0000000.0000000.1923530.8172460.000000
10.8205991.0000000.0000000.0000000.2254890.1578450.6706310.000000
20.0000000.0000001.0000000.0000000.0000000.7918210.0000000.850516
30.0000000.0000000.0000001.0000000.5068660.0000000.0000000.000000
40.0000000.2254890.0000000.5068661.0000000.0000000.0000000.000000
50.1923530.1578450.7918210.0000000.0000001.0000000.1154880.930989
60.8172460.6706310.0000000.0000000.0000000.1154881.0000000.000000
70.0000000.0000000.8505160.0000000.0000000.9309890.0000001.000000
\n", 1250 | "
" 1251 | ], 1252 | "text/plain": [ 1253 | " 0 1 2 3 4 5 6 \\\n", 1254 | "0 1.000000 0.820599 0.000000 0.000000 0.000000 0.192353 0.817246 \n", 1255 | "1 0.820599 1.000000 0.000000 0.000000 0.225489 0.157845 0.670631 \n", 1256 | "2 0.000000 0.000000 1.000000 0.000000 0.000000 0.791821 0.000000 \n", 1257 | "3 0.000000 0.000000 0.000000 1.000000 0.506866 0.000000 0.000000 \n", 1258 | "4 0.000000 0.225489 0.000000 0.506866 1.000000 0.000000 0.000000 \n", 1259 | "5 0.192353 0.157845 0.791821 0.000000 0.000000 1.000000 0.115488 \n", 1260 | "6 0.817246 0.670631 0.000000 0.000000 0.000000 0.115488 1.000000 \n", 1261 | "7 0.000000 0.000000 0.850516 0.000000 0.000000 0.930989 0.000000 \n", 1262 | "\n", 1263 | " 7 \n", 1264 | "0 0.000000 \n", 1265 | "1 0.000000 \n", 1266 | "2 0.850516 \n", 1267 | "3 0.000000 \n", 1268 | "4 0.000000 \n", 1269 | "5 0.930989 \n", 1270 | "6 0.000000 \n", 1271 | "7 1.000000 " 1272 | ] 1273 | }, 1274 | "execution_count": 13, 1275 | "metadata": {}, 1276 | "output_type": "execute_result" 1277 | } 1278 | ], 1279 | "source": [ 1280 | "from sklearn.metrics.pairwise import cosine_similarity\n", 1281 | "\n", 1282 | "similarity_matrix = cosine_similarity(tv_matrix)\n", 1283 | "similarity_df = pd.DataFrame(similarity_matrix)\n", 1284 | "similarity_df" 1285 | ] 1286 | }, 1287 | { 1288 | "cell_type": "markdown", 1289 | "metadata": {}, 1290 | "source": [ 1291 | "## Clustering documents using similarity features" 1292 | ] 1293 | }, 1294 | { 1295 | "cell_type": "code", 1296 | "execution_count": 14, 1297 | "metadata": {}, 1298 | "outputs": [ 1299 | { 1300 | "data": { 1301 | "text/html": [ 1302 | "
\n", 1303 | "\n", 1316 | "\n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | "
Document\\Cluster 1Document\\Cluster 2DistanceCluster Size
0270.2530982
1060.3085392
2580.3869523
3190.4898453
4340.7329452
511122.695655
610133.451088
\n", 1378 | "
" 1379 | ], 1380 | "text/plain": [ 1381 | " Document\\Cluster 1 Document\\Cluster 2 Distance Cluster Size\n", 1382 | "0 2 7 0.253098 2\n", 1383 | "1 0 6 0.308539 2\n", 1384 | "2 5 8 0.386952 3\n", 1385 | "3 1 9 0.489845 3\n", 1386 | "4 3 4 0.732945 2\n", 1387 | "5 11 12 2.69565 5\n", 1388 | "6 10 13 3.45108 8" 1389 | ] 1390 | }, 1391 | "execution_count": 14, 1392 | "metadata": {}, 1393 | "output_type": "execute_result" 1394 | } 1395 | ], 1396 | "source": [ 1397 | "from scipy.cluster.hierarchy import dendrogram, linkage\n", 1398 | "\n", 1399 | "Z = linkage(similarity_matrix, 'ward')\n", 1400 | "pd.DataFrame(Z, columns=['Document\\Cluster 1', 'Document\\Cluster 2', \n", 1401 | " 'Distance', 'Cluster Size'], dtype='object')" 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "code", 1406 | "execution_count": 15, 1407 | "metadata": {}, 1408 | "outputs": [ 1409 | { 1410 | "data": { 1411 | "text/plain": [ 1412 | "" 1413 | ] 1414 | }, 1415 | "execution_count": 15, 1416 | "metadata": {}, 1417 | "output_type": "execute_result" 1418 | }, 1419 | { 1420 | "data": { 1421 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfUAAADjCAYAAACcsI0jAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHzJJREFUeJzt3X+cVnWd9/HXWyDBMKkgQBTJwkqzRp3VtNzm3rSEteTe\nagVK061GTdssqzVv18xuu6t7+ykqTeWi5dhaaouGd7abo1Jqgo4gKkimgaACJTCAKPi5/zjf0YvL\na2augevMNXPm/Xw8rsdc55zvOedznTkzn+t7vt9zvooIzMzMbODbrd4BmJmZWW04qZuZmRWEk7qZ\nmVlBOKmbmZkVhJO6mZlZQTipm5mZFYSTuvV7kpZIauoHcUySFJKGdrH8PEk/ynMfVax/oaSf7koM\ntSKpQ9L+9Y6jFtLv5I31jsOsJ07qVleSHpN0TNm8UyTN75yOiIMioq3Pg+uliPhaRHwi7/1Imilp\nQUqaqyXdLOldNdz+Ln2x6BQRIyPi0VrF1Sl9cXle0sb0WiZplqTxtd6X2UDjpG6FtTNJSdKQPGKp\nFUmfA74LfA0YC0wELgU+UM+4Su3ql4Eq/UdE7Am8BvifwDhgYT0Sey3PGWX8f9l2mk8e6/dKa/OS\ndpN0rqQ/Slon6VpJr0nLOmuYH5f0Z+C3af7PJT0pab2k2yUdVLLtOZIulzRP0ibgf0gaIelbkh5P\n68yXNKIkpI9I+rOktZL+V8m2drj0Leldkn4v6RlJKySdkub/vaT7JG1I8y+s8jjsBVwEnBkR10fE\npoh4PiJuiogvVijfJGllN8fy8FTj3yDpKUnfTsVuTz+fSVcDjkzl/0nSQ5L+KunXkvYr2W5IOlPS\nI8AjJfPeWHKcL5X0q1S7vlvSG0rWf6+kpel4XybpNkk9XvVIn38JcCKwBjinZJvHS2pPx//3kt5W\ndhw+L2lR2ud/SBpesvwL6SrIKkn/VHYMK50ze0m6StKadN6c35mcJQ1J59NaSX+SdJZKroRIapN0\nsaTfAZuB/SWdmo71RkmPSjqt/Pcq6YuSnk5xTpM0VdlVi79IOq+nY2fF5KRuA82ngWnAu4G9gb+S\n1VRLvRt4C/C+NH0zMBl4HXAvcHVZ+ZnAxcCewHzg34DDgKPIaoJfBF4oKf8u4E3Ae4ALJL2lPMiU\n8G4GLgHGAA1Ae1q8CTgZGAX8PXCGpGlVfPYjgeHADVWUrcb3gO9FxKuANwDXpvl/m36OSpfQ75R0\nAnAe8A9kn+cO4Jqy7U0DjgAO7GJ/04GvAK8GlpMdcySNBn4BfAl4LbCU7NhXLSK2A/8JHJ22eQhw\nBXBa2uYPgLmSdi9Z7R+B44DXA28DTknrHgd8HjiW7LzZoXkoKT9nLgH2AvYnO/9OBk5NZT8JTCE7\nBw4lO07lTgKa0/YeB54GjgdelbbzHUmHlpQfR3YuTAAuAH4IfJTsvD0a+FdJr+/qeFmBRYRfftXt\nBTwGdADPlLw2A/PLyhyT3j8EvKdk2XjgeWAoMAkIYP9u9jcqldkrTc8BripZvhuwBXh7hXU7t79P\nybw/ANPT+wuBn6b3XwJuqPIYfBf4Ttk+hlYo9xHgyR62VRpDE7CywvHuPJa3kyXZ0V18zqEl824G\nPl52nDYD+6XpAP6ubDsBvLHkOP+oZNlU4OH0/mTgzpJlAlYAn+jpM5bNPx14JL2/HPhq2fKlwLtL\njsNHS5Z9E5id3l8BfL1k2QEVPkvpOTMEeA44sGTeaUBbev9b4LSSZceUHl+gDbioh9/rL4HPlPxe\ntwBD0vSeaXtHlJRfCEzL6+/Wr/77ck3d+oNpETGq8wV8qpuy+wE3pEuqz5Al+e1k7cudVnS+SZc+\nv67scv0Gsn/mAKMrlU/zhwN/7CaGJ0vebwZGViizb1fbkHSEpFvTpdr1ZMlodKWyZdYBo1W7NuuP\nkyWshyXdI+n4bsruB3yv5Lj/hSz5Tigps6Limi/p6rjtXbpuRASwQ7NBlSakuDrjPacz3hTzvmlf\nvYqHrOZcrvycGVZW7nFeOjbl26t0nHaYJ2mKpLvSpfRnyL4ElZ4j6yK7OgFZggd4qmT5Fiqfl1Zw\nTuo20KwAppR+CYiI4RHxREmZ0qEHZwInkNWO9iKrhUKWkCqVXws8S3Y5elfj7GobrcBcYN+I2AuY\nXRZPV+4EtlL58m0lm4A9OieUdega0zkdEY9ExAyyZolvAL+Q9Ep2PB6dVpDVNkuP+4iI+H1JmZ0d\n8nE1sE9JnCqdrkZqv34/WbNAZ7wXl8W7R0SUNxl0Fc++JdMTK5QpP2eeJ/siUbpO5zm5w+cr2/bL\ntpeaCK4jawYam77ozqO6c8QGOSd1G2hmAxd3dtKSNCa193ZlT7JEuI4swX2tu41HxAtkl1+/LWnv\nVNM/sqwtthpXA8dI+kdJQyW9VlJDSUx/iYhnJR1O9sWjRxGxnqz99NLUMWoPScNSre6bFVZZBgxX\n1jFvGHA+8OLnkPRRSWPSZ34mzX6BrMPZC2Ttw51mA19S6mSYOoZ9uMpj0ZNfAQenzzQUOJOszbhH\n6di+hax9fxzQ2dnvh8Dp6aqIJL0yHYc9q9jstcApkg6UtAfw5e4KpxrztWTn5Z7p3Pwc0Nlp8lrg\nM5ImSBoF/EsP+38F2e9pDbBN0hTgvVXEbeakbgPO98hqubdI2gjcRdY5qytXkV0KfQJ4MJXvyeeB\nxcA9ZJdzv0Ev/1Yi4s9kl0zPSdtoB96eFn8KuCjFfwEvdVCrZrvfIksY55P9018BnEXW5lpedn3a\n14/IPv8mdrysfRywRFIH2XGdHhFbImIzWSew36VL1++IiBvIjsPPUjPGA2Sdv3ZZRKwFPkzWrr2O\nrKPdArIvY105McW9nux8WAccFhGr0jYXkHVQm0XWmXI5qSNcFfHcTNbP4bdpvd9WsdqnyY7vo2Qd\n51rJvhxC9gXjFmARcB9ZrXsbWbNRpf1vBP6Z7Lz4K9mXvrnVxG6mrPnKzKx/SJfSVwIfiYhb6x1P\nraWa9+yI2K/Hwma95Jq6mdWdpPdJGpWaOc4jaz+u5qpKv6fsuQdTU1PBBLLL+bW6LdFsB07qZtYf\nHEl2t8Basg5v0yJiS/erDBgiu3Xwr2SX3x8ia3YxqzlffjczMysI19TNzMwKwkndzMysIPpiNKWa\nGj16dEyaNKneYZiZmfWZhQsXro2IMT2VG3BJfdKkSSxYsKDeYZiZmfUZSZUeV/wyuV1+lzRc0h8k\n3S9piaSvVCjTpGzYw/b0co9QMzOznZRnTX0r2ahNHekRlfMl3RwR5fee3hER3Q0kYWZmZlXILamn\nkZY60uSw9PL9c2ZmZjnJtfd7GgyjHXga+E1E3F2h2FGSFkm6uXOwiArbaZa0QNKCNWvW5BmymZnZ\ngJVrR7k0elFDGpnoBklvjYgHSorcC0xMl+inkg1KMbnCdlqAFoDGxkbX9su0tEBra72jMLNamTkT\nmpvrHYUNRH1yn3pEPAPcSjYqVOn8DRHRkd7PA4ZJGt0XMRVJayu0t9c7CjOrhfZ2f0m3nZdbTV3S\nGOD5iHhG0gjgWLKhG0vLjAOeiohI40rvRjaEovVSQwO0tdU7CjPbVU1N9Y7ABrI8L7+PB66UNIQs\nWV8bETdJOh0gImYDHwLOkLQN2EI2nrMvr5uZme2EPHu/LwIOqTB/dsn7WcCsvGIwMzMbTPzsdzMz\ns4JwUjczMysIJ3UzM7OCcFI3MzMrCCd1MzOzgnBSNzMzKwgndTMzs4JwUjczMysIJ3UzM7OCcFI3\nMzMrCCd1MzOzgnBSNzMzKwgndTMzs4LILalLGi7pD5Lul7RE0lcqlJGk70taLmmRpEPzisfMzKzo\n8hxPfSvwdxHRIWkYMF/SzRFxV0mZKcDk9DoCuDz9NDMzs17KraYemY40OSy9oqzYCcBVqexdwChJ\n4/OKyczMrMhybVOXNERSO/A08JuIuLusyARgRcn0yjTPzMzMeinXpB4R2yOiAdgHOFzSW3dmO5Ka\nJS2QtGDNmjW1DdLMzKwg+qT3e0Q8A9wKHFe26Alg35LpfdK88vVbIqIxIhrHjBmTX6BmZmYDWJ69\n38dIGpXejwCOBR4uKzYXODn1gn8HsD4iVucVk5mZWZHl2ft9PHClpCFkXx6ujYibJJ0OEBGzgXnA\nVGA5sBk4Ncd4zMzMCi23pB4Ri4BDKsyfXfI+gDPzisHMzGww8RPlzMzMCsJJ3czMrCCc1M3MzArC\nSd3MzKwgnNTNzMwKwkndzMysIJzUzczMCiLPh8+Y2SDX0gKtrfWOYmBpb89+NjXVNYwBZ+ZMaG6u\ndxT155q6meWmtfWlJGXVaWjIXla99nZ/eezkmrqZ5aqhAdra6h2FFZmvarzENXUzM7OCcFI3MzMr\nCCd1MzOzgnBSNzMzK4jckrqkfSXdKulBSUskfaZCmSZJ6yW1p9cFecVjZmZWdHn2ft8GnBMR90ra\nE1go6TcR8WBZuTsi4vgc4zAzMxsUcqupR8TqiLg3vd8IPARMyGt/ZmZmg12ftKlLmgQcAtxdYfFR\nkhZJulnSQV2s3yxpgaQFa9asyTFSMzOzgSv3pC5pJHAdcHZEbChbfC8wMSLeBlwC/LLSNiKiJSIa\nI6JxzJgx+QZsZmY2QOWa1CUNI0voV0fE9eXLI2JDRHSk9/OAYZJG5xmTmZlZUeXZ+13Aj4GHIuLb\nXZQZl8oh6fAUz7q8YjIzMyuyPHu/vxM4CVgsqXNIh/OAiQARMRv4EHCGpG3AFmB6RESOMZmZmRVW\nbkk9IuYD6qHMLGBWXjGYmZkNJn6inJmZWUE4qZuZmRWEk7qZmVlBOKmbmZkVhJO6mZlZQTipm5mZ\nFUTVSV3SfpKOSe9HpJHXzMzMrJ+oKqlL+iTwC+AHadY+dPGcdjMzM6uPamvqZ5I9IW4DQEQ8Arwu\nr6DMzMys96pN6lsj4rnOCUlDAT/O1czMrB+pNqnfJuk8YISkY4GfAzfmF5aZmZn1VrVJ/VxgDbAY\nOA2YB5yfV1BmZmbWe9UO6DICuCIifgggaUiatzmvwMzMzKx3qq2p/zdZEu80Aviv7laQtK+kWyU9\nKGmJpM9UKCNJ35e0XNIiSYdWH7qZmZmVqramPjwiOjonIqJD0h49rLMNOCci7k33tC+U9JuIeLCk\nzBRgcnodAVyefpqZmVkvVVtT31Rai5Z0GLCluxUiYnVE3JvebwQeAiaUFTsBuCoydwGjJI2vOnoz\nMzN7UbU19bOBn0taBQgYB5xY7U4kTQIOAe4uWzQBWFEyvTLNW13tts3MzCxTVVKPiHskvRl4U5q1\nNCKer2ZdSSOB64CzI2LDzgQpqRloBpg4ceLObMLMzKzwqq2pA/wNMCmtc6gkIuKq7laQNIwsoV8d\nEddXKPIEsG/J9D5p3g4iogVoAWhsbPRDb8zMzCqoKqlL+gnwBqAd2J5mB9BlUpck4MfAQxHx7S6K\nzQXOkvQzsg5y6yPCl97NzMx2QrU19UbgwIjoTS35ncBJwGJJ7WneecBEgIiYTfYQm6nAcrJ73k/t\nxfbNzMysRLVJ/QGyznFV16IjYj5Zp7ruygTZYDFmZma2i6pN6qOBByX9AdjaOTMiPpBLVGZmZtZr\n1Sb1C/MMwszMzHZdtbe03ZZ3IGZmZrZrqnqinKR3SLpHUoek5yRtl7RT95ybmZlZPqp9TOwsYAbw\nCNlgLp8ALs0rKDMzM+u9apM6EbEcGBIR2yPi34Hj8gvLzMzMeqvajnKbJb0CaJf0TbJb26r+QmBm\nZmb5qzYxn5TKngVsInu06z/kFZSZmZn1XrVJfVpEPBsRGyLiKxHxOeD4PAMzMzOz3qk2qX+swrxT\nahiHmZmZ7aJu29QlzQBmAq+XNLdk0auAv+QZmJmZmfVOTx3lfk/WKW408K2S+RuBRXkFZWZmZr3X\nbVKPiMeBxyUdA2yJiBckHQC8GVjcFwGamZlZdaptU78dGC5pAnALWW/4OXkFZWZmZr1XbVJXRGwm\nu43tsoj4MHBQtytIV0h6WtIDXSxvkrReUnt6XdC70M3MzKxU1Uld0pHAR4BfpXlDelhnDj0/de6O\niGhIr4uqjMXMzMwqqDapnw18CbghIpZI2h+4tbsVIuJ23EPezMysz/Rm6NXbSqYfBf65Bvs/StIi\n4Ang8xGxpFIhSc1AM8DEiRNrsFszM7Pi6ek+9e9GxNmSbgSifHlEfGAX9n0vMDEiOiRNBX4JTK5U\nMCJagBaAxsbGl8VhZmZmPdfUf5J+/lutdxwRG0rez5N0maTREbG21vsyMzMbDHq6T31h+nmbpDHp\n/Zpa7FjSOOCpiAhJh5O176+rxbbNzMwGox7b1CVdSDY6227ZpLYBl/TUW13SNUATMFrSSuDLwDCA\niJgNfAg4I21vCzA9Inxp3czMbCf11Kb+OeCdwN9ExJ/SvP2ByyV9NiK+09W6ETGju21HxCxgVu9D\nNjMzs0p6uqXtJGBGZ0KHF3u+fxQ4Oc/AzMzMrHd6SurDKnVcS+3qw/IJyczMzHZGT0n9uZ1cZmZm\nZn2sp45yb5e0ocJ8AcNziMfMzMx2Uk+3tPX0fHczMzPrJ6p99ruZmZn1c07qZmZmBeGkbmZmVhBO\n6mZmZgXhpG5mZlYQTupmZmYF4aRuZmZWEE7qZmZmBZFbUpd0haSnJT3QxXJJ+r6k5ZIWSTo0r1jM\nzMwGgzxr6nOA47pZPgWYnF7NwOU5xmJmZlZ4uSX1iLgd+Es3RU4ArorMXcAoSePzisfMzKzo6tmm\nPgFYUTK9Ms0zMzOzndDTKG39gqRmskv0jB07lgsvvJAPfvCDtLW1sW7dOpqbm2lpaeHggw9m5MiR\n3HnnncyYMYObbrqJrVu3MnPmTObMmcNhhx0GwMKFCznllFNobW1l99135/jjj+eaa67hyCOPpKOj\ng8WLF7+4zde+9rU0NTVx3XXX0dTUxKpVq1i2bNmLy8ePH09jYyM33ngj733ve1m2bBmPPfbYi8sn\nTZrEAQccwC233ML73/9+FixYwOrVq19cfsABB7D33nvT1ta2059p7Vro6FjIY48V5zMV8fc0GD/T\nmjUdbNq0mFWrivOZivh7Guif6fHHW9ltt91ZurQ4n6n891R1voyIXc25XW9cmgTcFBFvrbDsB0Bb\nRFyTppcCTRGxurttNjY2xoIFC3KIduBqasp+trXVMwqzl/O5aX1hMJxnkhZGRGNP5ep5+X0ucHLq\nBf8OYH1PCd3MzMy6ltvld0nXAE3AaEkrgS8DwwAiYjYwD5gKLAc2A6fmFYuZmdlgkFtSj4gZPSwP\n4My89m9mZjbY+IlyZmZmBeGkbmZmVhBO6mZmZgXhpG5mZlYQTupmZmYF4aRuZmZWEAPiMbFmZjZw\ntKxaRetTT/XZ/to73ghA033L+2yfM8eOpXnvvftsf9VyUjczs5pqfeop2js6aBg5sk/21/DDvkvm\nAO0dHQBO6mZmNjg0jBxJ2yGH1DuMXDTdd1+9Q+iS29TNzMwKwkndzMysIJzUzczMCsJJ3czMrCCc\n1M3MzAoi16Qu6ThJSyUtl3RuheVNktZLak+vC/KMx8zMrMhyu6VN0hDgUuBYYCVwj6S5EfFgWdE7\nIuL4vOIwMzMbLPKsqR8OLI+IRyPiOeBnwAk57s/MzGxQy/PhMxOAFSXTK4EjKpQ7StIi4Ang8xGx\npLyApGagGWDixIk5hGo2ALW0QGtrvaPoXvt3s59NZ9c3jmrMnAnNzfWOwmyX1PuJcvcCEyOiQ9JU\n4JfA5PJCEdECtAA0NjZG34Zo1k+1tkJ7OzQ01DuSLrU1DIBkDtlxBCd1G/DyTOpPAPuWTO+T5r0o\nIjaUvJ8n6TJJoyNibY5xmRVHQwO0tdU7ioGvqaneEZjVRJ5t6vcAkyW9XtIrgOnA3NICksZJUnp/\neIpnXY4xmZmZFVZuNfWI2CbpLODXwBDgiohYIun0tHw28CHgDEnbgC3A9IjoN5fXWxa20Lq4n7dZ\nAu1PZu2WTXP6/6XOmQfPpPkwX+I0M8tDrm3qETEPmFc2b3bJ+1nArDxj2BWti1tpf7KdhnH9t80S\noOHc/p/MAdqfzNotndTNzPJR745y/V7DuAbaTmmrdxiF0DSnqd4hmJkVmh8Ta2ZmVhCuqZtZ/9PX\n9+B33tLWl73gfV+85cA1dTPrfzrvwe8rDQ19e79/e3v/f3CQDUiuqQ9ifd27v7OjXF+2rbu3/QBW\n5HvwfV+85cQ19UGss3d/X2kY19CndxK0P9k+IG5JNDOrFdfUB7ki9+53b3szG2xcUzczMysIJ3Uz\nM7OCcFI3MzMrCCd1MzOzgnBSNzMzKwgndTMzs4LINalLOk7SUknLJZ1bYbkkfT8tXyTp0DzjMTMz\nK7LckrqkIcClwBTgQGCGpAPLik0BJqdXM3B5XvGYmZkVXZ419cOB5RHxaEQ8B/wMOKGszAnAVZG5\nCxglaXyOMZmZmRVWnkl9ArCiZHplmtfbMmZmZlaFAfGYWEnNZJfnATokLe3T/Z+qvtxdn/PnG+BU\n4M9X5M8Ghf98xf50ff759qumUJ5J/Qlg35LpfdK83pYhIlqAlloHaGZmViR5Xn6/B5gs6fWSXgFM\nB+aWlZkLnJx6wb8DWB8Rq3OMyczMrLByq6lHxDZJZwG/BoYAV0TEEkmnp+WzgXnAVGA5sBk4Na94\nzMzMik4RUe8YzMzMrAb8RDkzM7OCcFI3MzMrCCd1MzOzgnBS74KkNknPSupIrz69Nz5PknaX9GNJ\nj0vaKKld0pR6x1UrJb+zztd2SZfUO65akXSWpAWStkqaU+94ak3SayTdIGlTOkdn1jumWpM0XdJD\n6TP+UdLR9Y6pViT9VNKTkjZIWibpE/WOqdYkTU754af1jqXcgHj4TB2dFRE/qncQORhK9iS/dwN/\nJrsD4VpJB0fEY/UMrBYiYmTne0kjgSeBn9cvoppbBfxv4H3AiDrHkodLgeeAsUAD8CtJ90fEkvqG\nVRuSjgW+AZwI/AEo2qOxvw40R8RmSW8G2iTdFxEL6x1YDV1Kdtt2v+Oa+iAUEZsi4sKIeCwiXoiI\nm4A/AYfVO7YcfBB4Grij3oHUSkRcHxG/BNbVO5Zak/RKst/Zv0ZER0TMB/4TOKm+kdXUV4CLIuKu\n9Pf3RES87KFbA1VEPBARmzsn0+sNdQyppiRNB54B/rvesVTipN69/yNpraTfSWqqdzB5kTQWOAAo\nRE2ozMdIgwbVOxCrygHAtohYVjLvfuCgOsVTU2n0ykZgTBpyeqWkWZIKdcVF0mWSNgMPA6vJnkky\n4El6FXAR8Ll6x9IVJ/Wu/QuwP9kAMy3AjZIK822zk6RhwNXAlRHxcL3jqSVJ+5E1MVxZ71isaiOB\nDWXzNgB71iGWPIwFhgEfAo4ma144BDi/nkHVWkR8iux3djRwPbC1vhHVzFeBH0fEynoH0hUn9S5E\nxN0RsTEitkbElcDvyNqeC0PSbsBPyNovz6pzOHk4CZgfEX+qdyBWtQ7gVWXz9gI21iGWPGxJPy+J\niNURsRb4NgX73wIQEdtT88k+wBn1jmdXSWoAjgG+U+9YuuOOctULCjTokCQBPyarOUyNiOfrHFIe\nTibrtGMDxzJgqKTJEfFImvd2CtI0FBF/lbSS7P/Ji7PrFU8fGUox2tSbgEnAn7N/n4wEhkg6MCIO\nrWNcO3BNvQJJoyS9T9JwSUMlfQT4W+D/1Tu2GroceAvw/ojY0lPhgUbSUWRNJ0Xq9Q5AOieHk42p\nMKTzPK13XLUQEZvILtdeJOmVkt4FfIDsilJR/DvwaUmvk/Rq4LPATXWOqSbSZ5ouaaSkIZLeB8yg\nn3Yq66UWsi8nDek1G/gV2V0o/UYh/hHkYBjZLUNvBraTdfaYVtZ5Z8BKbc2nkbVzPamXxnQ+LSKu\nrltgtfUx4PqIKMpl21LnA18umf4oWY/qC+sSTe19CriC7K6FdcAZRbmdLfkqMJrsqsSzwLXAxXWN\nqHaC7FL7bLJK4+PA2RFRPkLngJN69Hf26kdSB/BsRKypX1Qv5wFdzMzMCsKX383MzArCSd3MzKwg\nnNTNzMwKwkndzMysIJzUzczMCsJJ3czMrCCc1M0KII0Z3y5piaT7JZ2THgPc3TqT+mKsckk/knRg\nD2Wm9VTGzHrmpG5WDFsioiEiDgKOBaaw4wNqKpkE5J7UI+ITEfFgD8WmAU7qZrvISd2sYCLiaaAZ\nOEuZSZLukHRveh2Vin4dODrV8D/bTbkXpTIPS7pa0kOSfiFpj7TsPZLuk7RY0hWSdk/z2yQ1pvcd\nki5OVxPukjQ27ecDwP9NsRThOeFmdeGkblZAEfEo2bPhX0f2uNVj06ATJwLfT8XOBe5INfzvdFOu\n3JuAyyLiLWTDon4qPYt+DnBiRBxM9gjqSiNzvRK4KyLeDtwOfDIifg/MBb6QYvnjLn58s0HLSd2s\n+IYBP5S0mGyAm64uc1dbbkVE/C69/ynwLrJE/6eS8RGuJBsEqdxzvDR4yUKyJgAzqxEP6GJWQJL2\nJxuM6GmytvWnyIYw3Y1sEJFKPltlufIBI3ozgMTz8dKAE9vx/yCzmnJN3axgJI0hGyVrVkqgewGr\nI+IF4CSyy/IAG4E9S1btqly5iZKOTO9nAvOBpcAkSW9M808CbutF2OWxmNlOcFI3K4YRnbe0Af8F\n3EI2HCvAZcDHJN1PNpzwpjR/EbA9dVr7bDflyi0FzpT0EPBq4PKIeBY4Ffh5unz/AtkXi2r9DPhC\n6mjnjnJmO8lDr5pZ1SRNAm6KiLfWORQzq8A1dTMzs4JwTd3MzKwgXFM3MzMrCCd1MzOzgnBSNzMz\nKwgndTMzs4JwUjczMysIJ3UzM7OC+P8o4OY3karuJgAAAABJRU5ErkJggg==\n", 1422 | "text/plain": [ 1423 | "" 1424 | ] 1425 | }, 1426 | "metadata": {}, 1427 | "output_type": "display_data" 1428 | } 1429 | ], 1430 | "source": [ 1431 | "plt.figure(figsize=(8, 3))\n", 1432 | "plt.title('Hierarchical Clustering Dendrogram')\n", 1433 | "plt.xlabel('Data point')\n", 1434 | "plt.ylabel('Distance')\n", 1435 | "dendrogram(Z)\n", 1436 | "plt.axhline(y=1.0, c='k', ls='--', lw=0.5)" 1437 | ] 1438 | }, 1439 | { 1440 | "cell_type": "code", 1441 | "execution_count": 16, 1442 | "metadata": {}, 1443 | "outputs": [ 1444 | { 1445 | "data": { 1446 | "text/html": [ 1447 | "
\n", 1448 | "\n", 1461 | "\n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | "
DocumentCategoryClusterLabel
0The sky is blue and beautiful.weather2
1Love this blue and beautiful sky!weather2
2The quick brown fox jumps over the lazy dog.animals1
3A king's breakfast has sausages, ham, bacon, eggs, toast and beansfood3
4I love green eggs, ham, sausages and bacon!food3
5The brown fox is quick and the blue dog is lazy!animals1
6The sky is very blue and the sky is very beautiful todayweather2
7The dog is lazy but the brown fox is quick!animals1
\n", 1521 | "
" 1522 | ], 1523 | "text/plain": [ 1524 | " Document \\\n", 1525 | "0 The sky is blue and beautiful. \n", 1526 | "1 Love this blue and beautiful sky! \n", 1527 | "2 The quick brown fox jumps over the lazy dog. \n", 1528 | "3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans \n", 1529 | "4 I love green eggs, ham, sausages and bacon! \n", 1530 | "5 The brown fox is quick and the blue dog is lazy! \n", 1531 | "6 The sky is very blue and the sky is very beautiful today \n", 1532 | "7 The dog is lazy but the brown fox is quick! \n", 1533 | "\n", 1534 | " Category ClusterLabel \n", 1535 | "0 weather 2 \n", 1536 | "1 weather 2 \n", 1537 | "2 animals 1 \n", 1538 | "3 food 3 \n", 1539 | "4 food 3 \n", 1540 | "5 animals 1 \n", 1541 | "6 weather 2 \n", 1542 | "7 animals 1 " 1543 | ] 1544 | }, 1545 | "execution_count": 16, 1546 | "metadata": {}, 1547 | "output_type": "execute_result" 1548 | } 1549 | ], 1550 | "source": [ 1551 | "from scipy.cluster.hierarchy import fcluster\n", 1552 | "max_dist = 1.0\n", 1553 | "\n", 1554 | "cluster_labels = fcluster(Z, max_dist, criterion='distance')\n", 1555 | "cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])\n", 1556 | "pd.concat([corpus_df, cluster_labels], axis=1)" 1557 | ] 1558 | }, 1559 | { 1560 | "cell_type": "markdown", 1561 | "metadata": {}, 1562 | "source": [ 1563 | "# Topic Models" 1564 | ] 1565 | }, 1566 | { 1567 | "cell_type": "code", 1568 | "execution_count": 17, 1569 | "metadata": {}, 1570 | "outputs": [ 1571 | { 1572 | "name": "stderr", 1573 | "output_type": "stream", 1574 | "text": [ 1575 | "C:\\Program Files\\Anaconda3\\lib\\site-packages\\sklearn\\decomposition\\online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.\n", 1576 | " DeprecationWarning)\n" 1577 | ] 1578 | }, 1579 | { 1580 | "data": { 1581 | "text/html": [ 1582 | "
\n", 1583 | "\n", 1596 | "\n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | "
T1T2T3
00.8321910.0834800.084329
10.8635540.0691000.067346
20.0477940.0477760.904430
30.0372430.9255590.037198
40.0491210.9030760.047802
50.0549010.0477780.897321
60.8882870.0556970.056016
70.0557040.0556890.888607
\n", 1656 | "
" 1657 | ], 1658 | "text/plain": [ 1659 | " T1 T2 T3\n", 1660 | "0 0.832191 0.083480 0.084329\n", 1661 | "1 0.863554 0.069100 0.067346\n", 1662 | "2 0.047794 0.047776 0.904430\n", 1663 | "3 0.037243 0.925559 0.037198\n", 1664 | "4 0.049121 0.903076 0.047802\n", 1665 | "5 0.054901 0.047778 0.897321\n", 1666 | "6 0.888287 0.055697 0.056016\n", 1667 | "7 0.055704 0.055689 0.888607" 1668 | ] 1669 | }, 1670 | "execution_count": 17, 1671 | "metadata": {}, 1672 | "output_type": "execute_result" 1673 | } 1674 | ], 1675 | "source": [ 1676 | "from sklearn.decomposition import LatentDirichletAllocation\n", 1677 | "\n", 1678 | "lda = LatentDirichletAllocation(n_topics=3, max_iter=10000, random_state=0)\n", 1679 | "dt_matrix = lda.fit_transform(cv_matrix)\n", 1680 | "features = pd.DataFrame(dt_matrix, columns=['T1', 'T2', 'T3'])\n", 1681 | "features" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "markdown", 1686 | "metadata": {}, 1687 | "source": [ 1688 | "## Show topics and their weights" 1689 | ] 1690 | }, 1691 | { 1692 | "cell_type": "code", 1693 | "execution_count": 18, 1694 | "metadata": {}, 1695 | "outputs": [ 1696 | { 1697 | "name": "stdout", 1698 | "output_type": "stream", 1699 | "text": [ 1700 | "[('sky', 4.3324395825632624), ('blue', 3.373753174831771), ('beautiful', 3.3323652405224857), ('today', 1.3325579841038182), ('love', 1.3304224288080069)]\n", 1701 | "\n", 1702 | "[('bacon', 2.332695948479998), ('eggs', 2.332695948479998), ('ham', 2.332695948479998), ('sausages', 2.332695948479998), ('love', 1.335454457601996), ('beans', 1.332773525378464), ('breakfast', 1.332773525378464), ('kings', 1.332773525378464), ('toast', 1.332773525378464), ('green', 1.3325433207547732)]\n", 1703 | "\n", 1704 | "[('brown', 3.3323474595768783), ('dog', 3.3323474595768783), ('fox', 3.3323474595768783), ('lazy', 3.3323474595768783), ('quick', 3.3323474595768783), ('jumps', 1.3324193736202712), ('blue', 1.2919635624485213)]\n", 1705 | "\n" 1706 | ] 1707 | } 1708 | ], 1709 | "source": [ 1710 | "tt_matrix = lda.components_\n", 1711 | "for topic_weights in tt_matrix:\n", 1712 | " topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]\n", 1713 | " topic = sorted(topic, key=lambda x: -x[1])\n", 1714 | " topic = [item for item in topic if item[1] > 0.6]\n", 1715 | " print(topic)\n", 1716 | " print()\n" 1717 | ] 1718 | }, 1719 | { 1720 | "cell_type": "markdown", 1721 | "metadata": {}, 1722 | "source": [ 1723 | "## Clustering documents using topic model features" 1724 | ] 1725 | }, 1726 | { 1727 | "cell_type": "code", 1728 | "execution_count": 19, 1729 | "metadata": {}, 1730 | "outputs": [ 1731 | { 1732 | "data": { 1733 | "text/html": [ 1734 | "
\n", 1735 | "\n", 1748 | "\n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | "
DocumentCategoryClusterLabel
0The sky is blue and beautiful.weather2
1Love this blue and beautiful sky!weather2
2The quick brown fox jumps over the lazy dog.animals1
3A king's breakfast has sausages, ham, bacon, eggs, toast and beansfood0
4I love green eggs, ham, sausages and bacon!food0
5The brown fox is quick and the blue dog is lazy!animals1
6The sky is very blue and the sky is very beautiful todayweather2
7The dog is lazy but the brown fox is quick!animals1
\n", 1808 | "
" 1809 | ], 1810 | "text/plain": [ 1811 | " Document \\\n", 1812 | "0 The sky is blue and beautiful. \n", 1813 | "1 Love this blue and beautiful sky! \n", 1814 | "2 The quick brown fox jumps over the lazy dog. \n", 1815 | "3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans \n", 1816 | "4 I love green eggs, ham, sausages and bacon! \n", 1817 | "5 The brown fox is quick and the blue dog is lazy! \n", 1818 | "6 The sky is very blue and the sky is very beautiful today \n", 1819 | "7 The dog is lazy but the brown fox is quick! \n", 1820 | "\n", 1821 | " Category ClusterLabel \n", 1822 | "0 weather 2 \n", 1823 | "1 weather 2 \n", 1824 | "2 animals 1 \n", 1825 | "3 food 0 \n", 1826 | "4 food 0 \n", 1827 | "5 animals 1 \n", 1828 | "6 weather 2 \n", 1829 | "7 animals 1 " 1830 | ] 1831 | }, 1832 | "execution_count": 19, 1833 | "metadata": {}, 1834 | "output_type": "execute_result" 1835 | } 1836 | ], 1837 | "source": [ 1838 | "from sklearn.cluster import KMeans\n", 1839 | "\n", 1840 | "km = KMeans(n_clusters=3, random_state=0)\n", 1841 | "km.fit_transform(features)\n", 1842 | "cluster_labels = km.labels_\n", 1843 | "cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])\n", 1844 | "pd.concat([corpus_df, cluster_labels], axis=1)" 1845 | ] 1846 | } 1847 | ], 1848 | "metadata": { 1849 | "kernelspec": { 1850 | "display_name": "Python 3", 1851 | "language": "python", 1852 | "name": "python3" 1853 | }, 1854 | "language_info": { 1855 | "codemirror_mode": { 1856 | "name": "ipython", 1857 | "version": 3 1858 | }, 1859 | "file_extension": ".py", 1860 | "mimetype": "text/x-python", 1861 | "name": "python", 1862 | "nbconvert_exporter": "python", 1863 | "pygments_lexer": "ipython3", 1864 | "version": "3.7.0" 1865 | } 1866 | }, 1867 | "nbformat": 4, 1868 | "nbformat_minor": 2 1869 | } 1870 | -------------------------------------------------------------------------------- /[Introduction] - Big tutorial - Text Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Text Classification" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "All models : https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/\n", 15 | "\n", 16 | "CNN Text Classification\n", 17 | "https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb\n", 18 | "\n", 19 | "CNN Multichannel Text Classification + Hierarchical attention + ...\n", 20 | "https://github.com/gaurav104/TextClassification/blob/master/CNN%20Multichannel%20Text%20Classification.ipynb\n", 21 | "\n", 22 | "Notes for Deep Learning\n", 23 | "https://arxiv.org/pdf/1808.09772.pdf\n", 24 | "\n", 25 | "Doc classification with NLP\n", 26 | "https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb\n", 27 | "\n", 28 | "Paragraph Topic Classification\n", 29 | "http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf\n", 30 | "\n", 31 | "1D convolutional neural networks for NLP\n", 32 | "https://github.com/Tixierae/deep_learning_NLP/blob/master/cnn_imdb.ipynb\n", 33 | "\n", 34 | "Hierarchical Attention for text classification\n", 35 | "https://github.com/Tixierae/deep_learning_NLP/blob/master/HAN/HAN_final.ipynb\n", 36 | "\n", 37 | "Multi-class classification scikit learn (Random forest, SVM, logistic regression)\n", 38 | "https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f\n", 39 | "https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb\n", 40 | "\n", 41 | "Text feature extraction TFIDF mathematics\n", 42 | "https://dzone.com/articles/machine-learning-text-feature-0\n", 43 | "\n", 44 | "Classification Yelp Reviews (AWS)\n", 45 | "http://www.developintelligence.com/blog/2017/06/practical-neural-networks-keras-classifying-yelp-reviews/\n", 46 | "\n", 47 | "Convolutional Neural Networks for Text Classification (waouuuuu)\n", 48 | "http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/\n", 49 | "https://github.com/davidsbatista/ConvNets-for-sentence-classification\n", 50 | "\n", 51 | "\n", 52 | "**3 ways to interpretate your NLP model** [Lime, ELI5, Skater]\n", 53 | "https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb\n", 54 | "https://towardsdatascience.com/3-ways-to-interpretate-your-nlp-model-to-management-and-customer-5428bc07ce15\n", 55 | "https://medium.freecodecamp.org/how-to-improve-your-machine-learning-models-by-explaining-predictions-with-lime-7493e1d78375\n", 56 | "\n", 57 | "Deep Learning for text made easy with AllenNLP\n", 58 | "https://medium.com/swlh/deep-learning-for-text-made-easy-with-allennlp-62bc79d41f31\n", 59 | "\n", 60 | "Ensemble Classifiers\n", 61 | "https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stderr", 71 | "output_type": "stream", 72 | "text": [ 73 | "C:\\Users\\adsieg\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\sklearn\\ensemble\\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", 74 | " from numpy.core.umath_tests import inner1d\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm\n", 80 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", 81 | "from sklearn import decomposition, ensemble\n", 82 | "\n", 83 | "import pandas, xgboost, numpy, textblob, string\n", 84 | "#from keras.preprocessing import text, sequence\n", 85 | "#from keras import layers, models, optimizers" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "# DATA LOADING " 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### A. FIRST dataset: Consumer Reviews [Amazon] [2 labels / binary classification]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "# load the dataset\n", 109 | "data = open('corpus.txt', encoding=\"utf8\").read()\n", 110 | "labels, texts = [], []\n", 111 | "for i, line in enumerate(data.split(\"\\n\")):\n", 112 | " content = line.split()\n", 113 | " labels.append(content[0])\n", 114 | " texts.append(\" \".join(content[1:]))\n", 115 | "\n", 116 | "# create a dataframe using texts and lables\n", 117 | "trainDF = pandas.DataFrame()\n", 118 | "trainDF['text'] = texts\n", 119 | "trainDF['label'] = labels" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "trainDF.head()" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "trainDF.shape" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# split the dataset into training and validation datasets \n", 147 | "train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])\n", 148 | "\n", 149 | "# label encode the target variable \n", 150 | "encoder = preprocessing.LabelEncoder()\n", 151 | "train_y = encoder.fit_transform(train_y)\n", 152 | "valid_y = encoder.fit_transform(valid_y)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "train_x.shape" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### B. Second dataset: Consumer Complaints [Banking industry] [multi-classification]" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 1, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "import pandas as pd\n", 178 | "df = pd.read_csv('C:/Users/adsieg/Desktop/link_news/ML/Consumer_Complaints.csv')\n", 179 | "df.head()\n", 180 | "\n", 181 | "df = df[pd.notnull(df['Consumer complaint narrative'])]\n", 182 | "\n", 183 | "col = ['Product', 'Consumer complaint narrative']\n", 184 | "df = df[col]\n", 185 | "df.columns = ['Product', 'Consumer_complaint_narrative']\n", 186 | "\n", 187 | "df['category_id'] = df['Product'].factorize()[0]\n", 188 | "from io import StringIO\n", 189 | "category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')\n", 190 | "category_to_id = dict(category_id_df.values)\n", 191 | "id_to_category = dict(category_id_df[['category_id', 'Product']].values)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "id_to_category" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 2, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "data": { 210 | "text/html": [ 211 | "
\n", 212 | "\n", 225 | "\n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | "
ProductConsumer_complaint_narrativecategory_id
1Credit reportingI have outdated information on my credit repor...0
2Consumer LoanI purchased a new car on XXXX XXXX. The car de...1
7Credit reportingAn account on my credit report has a mistaken ...0
12Debt collectionThis company refuses to provide me verificatio...2
16Debt collectionThis complaint is in regards to Square Two Fin...2
\n", 267 | "
" 268 | ], 269 | "text/plain": [ 270 | " Product Consumer_complaint_narrative \\\n", 271 | "1 Credit reporting I have outdated information on my credit repor... \n", 272 | "2 Consumer Loan I purchased a new car on XXXX XXXX. The car de... \n", 273 | "7 Credit reporting An account on my credit report has a mistaken ... \n", 274 | "12 Debt collection This company refuses to provide me verificatio... \n", 275 | "16 Debt collection This complaint is in regards to Square Two Fin... \n", 276 | "\n", 277 | " category_id \n", 278 | "1 0 \n", 279 | "2 1 \n", 280 | "7 0 \n", 281 | "12 2 \n", 282 | "16 2 " 283 | ] 284 | }, 285 | "execution_count": 2, 286 | "metadata": {}, 287 | "output_type": "execute_result" 288 | } 289 | ], 290 | "source": [ 291 | "# We take only 10,000 customer complaints to speed up algorithms\n", 292 | "df = df[:10000]\n", 293 | "\n", 294 | "df.head(5)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 7, 300 | "metadata": {}, 301 | "outputs": [ 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "'An account on my credit report has a mistaken date. I mailed in a debt validation letter to allow XXXX to correct the information. I received a letter in the mail, stating that Experian received my correspondence and found it to be \" suspicious \\'\\' and that \" I did n\\'t write it \\'\\'. Experian \\'s letter is worded to imply that I am incapable of writing my own letter. I was deeply offended by this implication. \\nI called Experian to figure out why my letter was so suspicious. I spoke to a representative who was incredibly unhelpful, She did not effectively answer any questions I asked of her, and she kept ignoring what I was saying regarding the offensive letter and my dispute process. I feel the representative did what she wanted to do, and I am not satisfied. It is STILL not clear to me why I received this letter. I typed this letter, I signed this letter, and I paid to mail this letter, yet Experian willfully disregarded my lawful request. \\nI am disgusted with this entire situation, and I would like for my dispute to be handled appropriately, and I would like for an Experian representative to contact me and give me a real explanation for this letter.'" 306 | ] 307 | }, 308 | "execution_count": 7, 309 | "metadata": {}, 310 | "output_type": "execute_result" 311 | } 312 | ], 313 | "source": [ 314 | "df['Consumer_complaint_narrative'].iloc[2]" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "#### Imbalanced dataset" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "Conventional algorithms are often **biased towards the majority class**, not taking the data distribution into consideration. In the worst case, **minority classes are treated as outliers and ignored**. \n", 329 | "\n", 330 | "For some cases, such as fraud detection or cancer prediction, we would need to carefully configure our model or artificially balance the dataset, for example by **undersampling or oversampling each class.**" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "import matplotlib.pyplot as plt\n", 340 | "%matplotlib inline\n", 341 | "fig = plt.figure(figsize=(8,6))\n", 342 | "df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)\n", 343 | "plt.show()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "# DATA CLEANING" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "### A. --- A quick and easy function to clean my text" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "import re\n", 367 | "from nltk.corpus import stopwords\n", 368 | "import pandas as pd\n", 369 | "from nltk.stem import PorterStemmer\n", 370 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 371 | "\n", 372 | "def preprocess(raw_text):\n", 373 | "\n", 374 | " # keep only words\n", 375 | " letters_only_text = re.sub(\"[^a-zA-Z]\", \" \", raw_text)\n", 376 | "\n", 377 | " # convert to lower case and split \n", 378 | " words = letters_only_text.lower().split()\n", 379 | "\n", 380 | " # remove stopwords\n", 381 | " stopword_set = set(stopwords.words(\"english\"))\n", 382 | " meaningful_words = [w for w in words if w not in stopword_set]\n", 383 | " \n", 384 | " #stemmed words\n", 385 | " ps = PorterStemmer()\n", 386 | " stemmed_words = [ps.stem(word) for word in meaningful_words]\n", 387 | " \n", 388 | " #join the cleaned words in a list\n", 389 | " cleaned_word_list = \" \".join(stemmed_words)\n", 390 | "\n", 391 | " return cleaned_word_list" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "df['Consumer_complaint_narrative'] = df['Consumer_complaint_narrative'].apply(lambda line : preprocess(line))" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "### B. How to decline all ways of a given " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "from nltk.corpus import stopwords\n", 417 | "from nltk.stem import PorterStemmer\n", 418 | "stemmer = PorterStemmer()\n", 419 | "\n", 420 | "def split_dataset_into_words(dataset):\n", 421 | " datawords = dataset.apply(lambda x: x.split())\n", 422 | " return list(datawords)\n", 423 | "\n", 424 | "# my_list = all_incidents \n", 425 | "# dictionnary\n", 426 | "def buffer_stemmisation_keywords(my_list):\n", 427 | " my_list = [item for sublist in my_list for item in sublist]\n", 428 | " aux = pd.DataFrame(my_list, columns =['word'] )\n", 429 | " aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))\n", 430 | " aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))\n", 431 | " aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))\n", 432 | " aux.index = aux['word_stemmed']\n", 433 | " del aux['word_stemmed']\n", 434 | " my_dict = aux.to_dict('dict')['word']\n", 435 | " return my_dict" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "dictionnary_all_words_unstemmed = buffer_stemmisation_keywords(split_dataset_into_words(df['Consumer_complaint_narrative']))\n", 445 | "\n", 446 | "# Dictionnary de-duplicated\n", 447 | "for key, value in dictionnary_all_words_unstemmed.items():\n", 448 | " new_value = value.replace(\",\", \"\")\n", 449 | " new_value = list(set(value.split()))\n", 450 | " new_value = list(set(map(lambda each:each.strip(\",\"), new_value)))\n", 451 | " dictionnary_all_words_unstemmed[key]=new_value" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "dictionnary_all_words_unstemmed" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "# FEATURE ENGINEERING" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "2.1 Count Vectors as features\n", 475 | "\n", 476 | "2.2 TF-IDF Vectors as features\n", 477 | "- --- Word level\n", 478 | "- --- N-Gram level\n", 479 | "- --- Character level\n", 480 | "\n", 481 | "2.3 Word Embeddings as features\n", 482 | "\n", 483 | "2.4 Text / NLP based features\n", 484 | "\n", 485 | "2.5 Topic Models as features" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "The different types of **word embeddings** can be broadly classified into two categories-\n", 493 | "\n", 494 | "- **Frequency based Embedding**\n", 495 | " - Count Vector\n", 496 | " - TF-IDF Vector\n", 497 | " - Co-Occurrence Matrix with a fixed context window (with SVD)\n", 498 | "- **Prediction based Embedding**\n", 499 | " - CBOW (Continuous Bag of words)\n", 500 | " - Skip – Gram model" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "### 2.1 Count Vectors as features" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document." 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": null, 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [ 523 | "# split the dataset into training and validation datasets \n", 524 | "train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['Consumer_complaint_narrative'], df['Product'])\n", 525 | "\n", 526 | "# label encode the target variable \n", 527 | "encoder = preprocessing.LabelEncoder()\n", 528 | "train_y = encoder.fit_transform(train_y)\n", 529 | "valid_y = encoder.fit_transform(valid_y)" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "# create a count vectorizer object \n", 539 | "count_vect = CountVectorizer(analyzer='word', token_pattern=r'\\w{1,}')\n", 540 | "count_vect.fit(df['Consumer_complaint_narrative'])\n", 541 | "\n", 542 | "# transform the training and validation data using count vectorizer object\n", 543 | "xtrain_count = count_vect.transform(train_x)\n", 544 | "xvalid_count = count_vect.transform(valid_x)" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "### 2.2 TF-IDF Vectors as features" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "a. **Word Level TF-IDF**: Matrix representing tf-idf scores of every term in different documents\n", 559 | "\n", 560 | "b. **N-gram Level TF-IDF**: N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams\n", 561 | "\n", 562 | "c. **Character Level TF-IDF**: Matrix representing tf-idf scores of character level n-grams in the corpus" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "Most often **term-frequency** alone is **not** a good measure of the **importance of a word/term to a document's topic**. Very common words like \"the\", \"a\", \"to\" are almost always the terms with the **highest frequency in the text**. Thus, having a high raw count of the number of times a term appears in a document does not necessarily mean that the corresponding word is more important. Furtermore, longer documents could have high frequency of terms that do not correlate with the document topic, but instead occur with high numbers solely due to the length of the document.\n", 570 | "\n", 571 | "To circumvent the limination of term-frequency, we often normalize it by the **inverse document frequency (idf)**. This results in the **term frequency-inverse document frequency (tf-idf)** matrix. The **inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents in the corpus**. We can give a formal defintion of the inverse-document-frequency by letting $\\mathcal{D}$ be the corpus or the set of all documents and $N$ is the number of documents in the corpus and $N_{t,D}$ be the number of documents that contain the term $t$ then, \n", 572 | "\n", 573 | "$$idf(t,\\mathcal{D}) \\, = \\, \\log\\left(\\frac{N_{\\mathcal{D}}}{1 + N_{t,\\mathcal{D}}}\\right) \\, = \\, - \\log\\left(\\frac{1 + N_{t,\\mathcal{D}}}{N_{\\mathcal{D}}}\\right) $$\n", 574 | "\n", 575 | "The reason for the presence of the $1$ is for smoothing. Without it, if the term/word did not appear in any training documents, then its inverse-document-frequency would be $idf(t,\\mathcal{D}) = \\infty$. However, with the presense of the $1$ it will now have $idf(t,\\mathcal{D}) = 0$.\n", 576 | "\n", 577 | "\n", 578 | "Now we can formally defined the term frequnecy-inverse document frequency as a normalized version of term-frequency,\n", 579 | "\n", 580 | "\n", 581 | "$$\\text{tf-idf}(t,d) \\, = \\, tf(t,d) \\cdot idf(t,\\mathcal{D}) $$\n", 582 | "\n", 583 | "Like the term-frequency, the term frequency-inverse document frequency is a sparse matrix, where again, each row is a document in our training corpus ($\\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list.\n", 584 | "_________________________________________\n", 585 | "**EXAMPLE:**\n", 586 | "\n", 587 | "**from** sklearn.feature_extraction.text **import** TfidfVectorizer\n", 588 | "\n", 589 | "tfidf = TfidfVectorizer(**sublinear_tf**=True, **min_df**=5, **norm**='l2', **encoding**='latin-1', **ngram_range**=(1, 2), **stop_words**='english')\n", 590 | "\n", 591 | "features = tfidf.fit_transform(df['text']).toarray()\n", 592 | "labels = df.category_id\n", 593 | "_________________________________________\n", 594 | "\n", 595 | "- **sublinear_df** is set to True to use a logarithmic form for frequency.\n", 596 | "- **min_df** is the minimum numbers of documents a word must be present in to be kept.\n", 597 | "- **norm** is set to l2, to ensure all our feature vectors have a euclidian norm of 1.\n", 598 | "- **ngram_range** is set to (1, 2) to indicate that we want to consider both unigrams and bigrams.\n", 599 | "- **stop_words** is set to \"english\" to remove all common pronouns (\"a\", \"the\", ...) to reduce the number of noisy features." 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": null, 605 | "metadata": {}, 606 | "outputs": [], 607 | "source": [ 608 | "# word level tf-idf\n", 609 | "tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\\w{1,}', max_features=5000)\n", 610 | "tfidf_vect.fit(trainDF['text'])\n", 611 | "xtrain_tfidf = tfidf_vect.transform(train_x)\n", 612 | "xvalid_tfidf = tfidf_vect.transform(valid_x)\n", 613 | "\n", 614 | "# ngram level tf-idf \n", 615 | "tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\\w{1,}', ngram_range=(2,3), max_features=5000)\n", 616 | "tfidf_vect_ngram.fit(trainDF['text'])\n", 617 | "xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)\n", 618 | "xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)\n", 619 | "\n", 620 | "# characters level tf-idf\n", 621 | "tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\\w{1,}', ngram_range=(2,3), max_features=5000)\n", 622 | "tfidf_vect_ngram_chars.fit(trainDF['text'])\n", 623 | "xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x) \n", 624 | "xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x) " 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "The term-frequency is a **sparse matrix** where **each row is a document in our training corpus** ($\\mathcal{D}$) and each **column corresponds to a term/word in the bag-of-words list**" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [ 640 | "print('- Size of the matrix is', xvalid_tfidf.shape, 'as we passed 5,000 words and we have 1725 customer comments')" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "print('Here is my bag of words:', tfidf_vect.get_feature_names())" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": null, 655 | "metadata": {}, 656 | "outputs": [], 657 | "source": [ 658 | "print('Size of my bag of words:', len(tfidf_vect.get_feature_names()))" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "#### Other implementation" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": null, 671 | "metadata": {}, 672 | "outputs": [], 673 | "source": [ 674 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 675 | "\n", 676 | "tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')\n", 677 | "\n", 678 | "features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()\n", 679 | "labels = df.category_id\n", 680 | "\n", 681 | "print('- Size of the matrix is', features.shape)\n", 682 | "print('- Each of', features.shape[0], 'consumer complaint narratives is represented by', features.shape[1], 'features, representing the tf-idf score for different unigrams and bigrams.')\n", 683 | "print('-', features.shape[0], 'is the # of document / complaint and', features.shape[1], 'is my bag of words containing unigram and bigram')" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "The term-frequency is a **sparse matrix** where **each row is a document in our training corpus** ($\\mathcal{D}$) and each **column corresponds to a term/word in the bag-of-words list**\n", 691 | "\n", 692 | "- **sklearn.feature_selection.chi2** to find the terms that are the most correlated with each of the products" 693 | ] 694 | }, 695 | { 696 | "cell_type": "code", 697 | "execution_count": null, 698 | "metadata": {}, 699 | "outputs": [], 700 | "source": [ 701 | "from sklearn.feature_selection import chi2\n", 702 | "import numpy as np\n", 703 | "\n", 704 | "N = 2\n", 705 | "for Product, category_id in sorted(category_to_id.items()):\n", 706 | " features_chi2 = chi2(features, labels == category_id)\n", 707 | " indices = np.argsort(features_chi2[0])\n", 708 | " feature_names = np.array(tfidf.get_feature_names())[indices]\n", 709 | " unigrams = [v for v in feature_names if len(v.split(' ')) == 1]\n", 710 | " bigrams = [v for v in feature_names if len(v.split(' ')) == 2]\n", 711 | " print(\"# '{}':\".format(Product))\n", 712 | " print(\" . Most correlated unigrams:\\n . {}\".format('\\n . '.join(unigrams[-N:])))\n", 713 | " print(\" . Most correlated bigrams:\\n . {}\".format('\\n . '.join(bigrams[-N:])))" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "### 2.3 Word Embeddings" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "A word embedding is a **form of representing words and documents** using a **dense vector representation**. The position of a word within the vector space is learned from text and is based on the **words that surround the word** when it is used. Word embeddings can be trained using the input corpus **itself** or can **be generated using pre-trained word embeddings** such as **Glove**, **FastText**, and **Word2Vec**. Any one of them can be downloaded and **used as transfer learning**. \n", 728 | "\n", 729 | "Four essential steps:\n", 730 | "- Loading the pretrained word embeddings\n", 731 | "- Creating a tokenizer object\n", 732 | "- Transforming text documents to sequence of tokens and pad them\n", 733 | "- Create a mapping of token and their respective embeddings" 734 | ] 735 | }, 736 | { 737 | "cell_type": "markdown", 738 | "metadata": {}, 739 | "source": [ 740 | "https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": null, 746 | "metadata": {}, 747 | "outputs": [], 748 | "source": [ 749 | "# load the pre-trained word-embedding vectors \n", 750 | "embeddings_index = {}\n", 751 | "for i, line in enumerate(open('data/wiki-news-300d-1M.vec')):\n", 752 | " values = line.split()\n", 753 | " embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')\n", 754 | "\n", 755 | "# create a tokenizer \n", 756 | "token = text.Tokenizer()\n", 757 | "token.fit_on_texts(trainDF['text'])\n", 758 | "word_index = token.word_index\n", 759 | "\n", 760 | "# convert text to sequence of tokens and pad them to ensure equal length vectors \n", 761 | "train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)\n", 762 | "valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)\n", 763 | "\n", 764 | "# create token-embedding mapping\n", 765 | "embedding_matrix = numpy.zeros((len(word_index) + 1, 300))\n", 766 | "for word, i in word_index.items():\n", 767 | " embedding_vector = embeddings_index.get(word)\n", 768 | " if embedding_vector is not None:\n", 769 | " embedding_matrix[i] = embedding_vector" 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "metadata": {}, 775 | "source": [ 776 | "### 2.4 Text / NLP based features" 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "- 1. **Word Count of the documents** – total number of words in the documents\n", 784 | "- 2. **Character Count of the documents** – total number of characters in the documents\n", 785 | "- 3. **Average Word Density of the documents** – average length of the words used in the documents\n", 786 | "- 4. **Puncutation Count in the Complete Essay** – total number of punctuation marks in the documents\n", 787 | "- 5. **Upper Case Count in the Complete Essay** – total number of upper count words in the documents\n", 788 | "- 6. **Title Word Count in the Complete Essay** – total number of proper case (title) words in the documents\n", 789 | "- 7. **Frequency distribution of Part of Speech Tags:**\n", 790 | " - Noun Count\n", 791 | " - Verb Count\n", 792 | " - Adjective Count\n", 793 | " - Adverb Count\n", 794 | " - Pronoun Count" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": null, 800 | "metadata": {}, 801 | "outputs": [], 802 | "source": [ 803 | "trainDF['char_count'] = trainDF['text'].apply(len)\n", 804 | "trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))\n", 805 | "trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)\n", 806 | "trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len(\"\".join(_ for _ in x if _ in string.punctuation))) \n", 807 | "trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))\n", 808 | "trainDF['upper_case_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "metadata": {}, 815 | "outputs": [], 816 | "source": [ 817 | "pos_family = {\n", 818 | " 'noun' : ['NN','NNS','NNP','NNPS'],\n", 819 | " 'pron' : ['PRP','PRP$','WP','WP$'],\n", 820 | " 'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],\n", 821 | " 'adj' : ['JJ','JJR','JJS'],\n", 822 | " 'adv' : ['RB','RBR','RBS','WRB']\n", 823 | "}\n", 824 | "\n", 825 | "# function to check and get the part of speech tag count of a words in a given sentence\n", 826 | "def check_pos_tag(x, flag):\n", 827 | " cnt = 0\n", 828 | " try:\n", 829 | " wiki = textblob.TextBlob(x)\n", 830 | " for tup in wiki.tags:\n", 831 | " ppo = list(tup)[1]\n", 832 | " if ppo in pos_family[flag]:\n", 833 | " cnt += 1\n", 834 | " except:\n", 835 | " pass\n", 836 | " return cnt\n", 837 | "\n", 838 | "trainDF['noun_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'noun'))\n", 839 | "trainDF['verb_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'verb'))\n", 840 | "trainDF['adj_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adj'))\n", 841 | "trainDF['adv_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adv'))\n", 842 | "trainDF['pron_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'pron'))" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "metadata": {}, 849 | "outputs": [], 850 | "source": [ 851 | "import nltk\n", 852 | "nltk.download('averaged_perceptron_tagger')" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "### 2.5 Topic Models as features (LDA)" 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": null, 865 | "metadata": {}, 866 | "outputs": [], 867 | "source": [ 868 | "# train a LDA Model\n", 869 | "lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)\n", 870 | "X_topics = lda_model.fit_transform(xtrain_count)\n", 871 | "topic_word = lda_model.components_ \n", 872 | "vocab = count_vect.get_feature_names()\n", 873 | "\n", 874 | "# view the topic models\n", 875 | "n_top_words = 10\n", 876 | "topic_summaries = []\n", 877 | "for i, topic_dist in enumerate(topic_word):\n", 878 | " topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]\n", 879 | " topic_summaries.append(' '.join(topic_words))" 880 | ] 881 | }, 882 | { 883 | "cell_type": "markdown", 884 | "metadata": {}, 885 | "source": [ 886 | "# MODEL BUILDING" 887 | ] 888 | }, 889 | { 890 | "cell_type": "markdown", 891 | "metadata": {}, 892 | "source": [ 893 | "- Naive Bayes Classifier\n", 894 | "- Linear Classifier\n", 895 | "- Support Vector Machine\n", 896 | "- Bagging Models\n", 897 | "- Boosting Models\n", 898 | "- Shallow Neural Networks\n", 899 | "- Deep Neural Networks\n", 900 | "- Convolutional Neural Network (CNN)\n", 901 | "- Long Short Term Modelr (LSTM)\n", 902 | "- Gated Recurrent Unit (GRU)\n", 903 | "- Bidirectional RNN\n", 904 | "- Recurrent Convolutional Neural Network (RCNN)\n", 905 | "- Other Variants of Deep Neural Networks" 906 | ] 907 | }, 908 | { 909 | "cell_type": "markdown", 910 | "metadata": {}, 911 | "source": [ 912 | "### 3.1 Naive Bayes" 913 | ] 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "metadata": {}, 918 | "source": [ 919 | "-------------------\n", 920 | "\n", 921 | "One of the most basic models for text classification is the Naive Bayes model. The Naive Bayes classification model predicts the document topic, $y = \\{C_{1},C_{2},\\ldots, C_{20}\\}$ where $C_{k}$ is the class or topic based on the document feactures $\\textbf{x} \\in \\mathbb{N}^{p}$, and $p$ is the number of terms in our bag-of-words list. The feature vector,\n", 922 | "\n", 923 | "$$\\textbf{x} \\, = \\, \\left[ x_{1}, x_{2}, \\ldots , x_{p} \\right] $$\n", 924 | "\n", 925 | "contains counts $x_{i}$ for the $\\text{tf-idf}$ value of the i-th term in our bag-of-words list. Using Bayes Theorem we can develop a model to predict the topic class ($C_{k}$) of a document from its feature vector $\\textbf{x}$,\n", 926 | "\n", 927 | "$$P\\left(C_{k} \\, \\vert \\, x_{1}, \\ldots , x_{p} \\right) \\; = \\; \\frac{P\\left(x_{1}, \\ldots, x_{p} \\, \\vert \\, C_{k} \\right)P(C_{k})}{P\\left(x_{1}, \\ldots, x_{p} \\right)}$$\n", 928 | "\n", 929 | "The Naive Bayes model makes the \"Naive\" assumption the probability of each term's $\\text{tf-idf}$ is **conditionally independent** of every other term. This reduces our **conditional probability function** to the product,\n", 930 | "\n", 931 | "$$ P\\left(x_{1}, \\ldots, x_{p} \\, \\vert \\, C_{k} \\right) \\; = \\; \\Pi_{i=1}^{p} P\\left(x_{i} \\, \\vert \\, C_{k} \\right)$$\n", 932 | "\n", 933 | "Subsequently Bayes' theorem for our classification problem becomes,\n", 934 | "\n", 935 | "$$P\\left(C_{k} \\, \\vert \\, x_{1}, \\ldots , x_{p} \\right) \\; = \\; \\frac{ P(C_{k}) \\, \\Pi_{i=1}^{p} P\\left(x_{i} \\, \\vert \\, C_{k} \\right)}{P\\left(x_{1}, \\ldots, x_{p} \\right)}$$\n", 936 | "\n", 937 | "\n", 938 | "Since the denominator is independent of the class ($C_{k}$) we can use a Maxmimum A Posteriori method to estimate the document topic , \n", 939 | "\n", 940 | "$$ \\hat{y} \\, = \\, \\text{arg max}_{k}\\; P(C_{k}) \\, \\Pi_{i=1}^{p} P\\left(x_{i} \\, \\vert \\, C_{k} \\right)$$ \n", 941 | "\n", 942 | "\n", 943 | "The **prior**, $P(C_{k}),$ is often taken to be the relative frequency of the class in the training corpus, while the form of the conditional distribution $P\\left(x_{i} \\, \\vert \\, C_{k} \\right)$ is a choice of the modeler and determines the type of Naive Bayes classifier. \n", 944 | "\n", 945 | "\n", 946 | "We will use a multinomial Naive Bayes model which works well when our features are discrete variables such as those in our $\\text{tf-idf}$ matrix. In the multinomial Naive Bayes model the conditional probability takes the form,\n", 947 | "\n", 948 | "\n", 949 | "$$ P\\left(x_{1}, \\ldots, x_{p} \\, \\vert \\, C_{k} \\right) \\, = \\, \\frac{\\left(\\sum_{i=1}^{p} x_{i}\\right)!}{\\Pi_{i=1}^{p} x_{i}!} \\Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$\n", 950 | "\n", 951 | "\n", 952 | "where $p_{k,i}$ is the probability that the $k$-th class will have the $i$-th bag-of-words term in its feature vector. This leads to our **posterior distribution** having the functional form,\n", 953 | "\n", 954 | "$$P\\left(C_{k} \\, \\vert \\, x_{1}, \\ldots , x_{p} \\right) \\; = \\; \\frac{ P(C_{k})}{P\\left(x_{1}, \\ldots, x_{p} \\right)} \\, \\frac{\\left(\\sum_{i=1}^{p} x_{i}\\right)!}{\\Pi_{i=1}^{p} x_{i}!} \\Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$\n", 955 | "\n", 956 | "\n", 957 | "\n", 958 | "We can instantiate a multinomial Naive Bayes classifier using the Scikit-learn library and fit it to our $\\text{tf-idf}$ matrix using the commands," 959 | ] 960 | }, 961 | { 962 | "cell_type": "code", 963 | "execution_count": null, 964 | "metadata": {}, 965 | "outputs": [], 966 | "source": [ 967 | "# Naive Bayes on Count Vectors\n", 968 | "accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)\n", 969 | "print(\"NB, Count Vectors: \", accuracy)\n", 970 | "\n", 971 | "# Naive Bayes on Word Level TF IDF Vectors\n", 972 | "\n", 973 | "accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)\n", 974 | "print(\"NB, WordLevel TF-IDF: \", accuracy)\n", 975 | "\n", 976 | "# Naive Bayes on Ngram Level TF IDF Vectors\n", 977 | "accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)\n", 978 | "print(\"NB, N-Gram Vectors: \", accuracy)\n", 979 | "\n", 980 | "# Naive Bayes on Character Level TF IDF Vectors\n", 981 | "accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)\n", 982 | "print(\"NB, CharLevel Vectors: \", accuracy)" 983 | ] 984 | }, 985 | { 986 | "cell_type": "markdown", 987 | "metadata": {}, 988 | "source": [ 989 | "#### Other implementation" 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "metadata": {}, 996 | "outputs": [], 997 | "source": [ 998 | "# Look at my dataframe\n", 999 | "trainDF.head()" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "code", 1004 | "execution_count": null, 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [ 1008 | "from sklearn.model_selection import train_test_split\n", 1009 | "from sklearn.feature_extraction.text import CountVectorizer\n", 1010 | "from sklearn.feature_extraction.text import TfidfTransformer\n", 1011 | "from sklearn.naive_bayes import MultinomialNB\n", 1012 | "\n", 1013 | "X_train, X_test, y_train, y_test = train_test_split(trainDF['text'], trainDF['label'], random_state = 0)\n", 1014 | "count_vect = CountVectorizer()\n", 1015 | "X_train_counts = count_vect.fit_transform(X_train)\n", 1016 | "tfidf_transformer = TfidfTransformer()\n", 1017 | "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n", 1018 | "\n", 1019 | "mod = MultinomialNB()\n", 1020 | "clf = mod.fit(X_train_tfidf, y_train)" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": null, 1026 | "metadata": {}, 1027 | "outputs": [], 1028 | "source": [ 1029 | "from sklearn.metrics import accuracy_score\n", 1030 | "X_test_tf = count_vect.transform(X_test)\n", 1031 | "X_test_tfidf = tfidf_transformer.transform(X_test_tf)\n", 1032 | "\n", 1033 | "predicted = mod.predict(X_test_tfidf)\n", 1034 | "print(\"Accuracy:\", accuracy_score(y_test, predicted))" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | "-- A -- **Accuracy** - Accuracy is the most intuitive performance measure and it is simply **a ratio of correctly predicted observation to the total observations.** One may think that, **if we have high accuracy then our model is best**. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. **For our model, we have got 0.803 which means our model is approx. 80% accurate.**\n", 1042 | "\n", 1043 | "**Accuracy = TP+TN/TP+FP+FN+TN**\n", 1044 | "\n", 1045 | "-- B -- **Precision** - Precision is the ratio of **correctly predicted positive observations to the total predicted positive observations.** The question that this metric answer is of **all passengers that labeled as survived, how many actually survived?** High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.\n", 1046 | "\n", 1047 | "**Precision = TP/TP+FP**\n", 1048 | "\n", 1049 | "-- C -- **Recall (Sensitivity)** - Recall is the **ratio of correctly predicted positive observations to the all observations in actual class - yes**. The question recall answers is: **Of all the passengers that truly survived, how many did we label?** We have got recall of 0.631 which is good for this model as it’s above 0.5.\n", 1050 | "\n", 1051 | "**Recall = TP/TP+FN**\n", 1052 | "\n", 1053 | "-- D -- **F1 score** - F1 Score is the **weighted average of Precision and Recall.** Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. In our case, F1 score is 0.701.\n", 1054 | "\n", 1055 | "**F1 Score = 2*(Recall * Precision) / (Recall + Precision)**" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "code", 1060 | "execution_count": null, 1061 | "metadata": {}, 1062 | "outputs": [], 1063 | "source": [ 1064 | "from sklearn.metrics import classification_report\n", 1065 | "print(classification_report(y_test, predicted))" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": null, 1071 | "metadata": {}, 1072 | "outputs": [], 1073 | "source": [ 1074 | "### Prediction\n", 1075 | "print(clf.predict(count_vect.transform([\"\"\" spent 3 days on the phone with countless \"agents\" - most of that time trying to get each one to understand what I wanted - simply wanted to change my email address in my account. Based on the terrible telephone service I assume they are all located in South America where it is known to be poor quality phone service. Spent 15 - 25 minutes just getting them to understand what the problem was. Ended up having to create a new account losing my entire order history. This has to be the worse phone customer service out there!\"\"\"])))" 1076 | ] 1077 | }, 1078 | { 1079 | "cell_type": "markdown", 1080 | "metadata": {}, 1081 | "source": [ 1082 | "### 3.2 Linear Classifier / Logistic Regression" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": {}, 1088 | "source": [ 1089 | "https://stlong0521.github.io/20160228%20-%20Logistic%20Regression.html" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "markdown", 1094 | "metadata": {}, 1095 | "source": [ 1096 | "

The generative classification model, such as Naive Bayes, tries to learn the probabilities and then predict by using Bayes rules to calculate the posterior, \\(p(y|\\textbf{x})\\). However, discrimitive classifiers model the posterior directly. As one of the most popular discrimitive classifiers, logistic regression directly models the linear decision boundary.

\n", 1097 | "

Binary Logistic Regression Classifier1

\n", 1098 | "

Let us start with the binary case. For an M-dimensional feature vector \\(\\textbf{x}=[x_1,x_2,...,x_M]^T\\), the posterior probability of class \\(y\\in\\{\\pm{1}\\}\\) given \\(\\textbf{x}\\) is assumed to satisfy\n", 1099 | "

\n", 1100 | "
\\begin{equation}\n", 1101 | "\\ln{\\frac{p(y=1|\\textbf{x})}{p(y=-1|\\textbf{x})}}=\\textbf{w}^T\\textbf{x},\n", 1102 | "\\end{equation}
\n", 1103 | "

\n", 1104 | "where \\(\\textbf{w}=[w_1,w_2,...,w_M]^T\\) is the weighting vector to be learned. Given the constraint that \\(p(y=1|\\textbf{x})+p(y=-1|\\textbf{x})=1\\), it follows that\n", 1105 | "

\n", 1106 | "
\\begin{equation} \\label{Eqn:Prob_Binary}\n", 1107 | "p(y|\\textbf{x})=\\frac{1}{1+\\exp(-y\\textbf{w}^T\\textbf{x})}=\\sigma(y\\textbf{w}^T\\textbf{x}),\n", 1108 | "\\end{equation}
\n", 1109 | "

\n", 1110 | "in which we can observe the logistic sigmoid function \\(\\sigma(a)=\\frac{1}{1+\\exp(-a)}\\).

\n", 1111 | "

Based on the assumptions above, the weighting vector, \\(\\textbf{w}\\), can be learned by maximum likelihood estimation (MLE). More specifically, given training data set \\(\\mathcal{D}=\\{(\\textbf{x}_1,y_1),(\\textbf{x}_2,y_2),...,(\\textbf{x}_N,y_N)\\}\\),\n", 1112 | "

\n", 1113 | "
\\begin{align}\n", 1114 | "\\begin{aligned}\n", 1115 | "\\textbf{w}^*&=\\max_{\\textbf{w}}{\\mathcal{L}(\\textbf{w})}\\\\\n", 1116 | "&=\\max_{\\textbf{w}}{\\sum_{i=1}^N\\ln{{p(y_i|\\textbf{x}_i)}}}\\\\\n", 1117 | "&=\\max_{\\textbf{w}}{\\sum_{i=1}^N{\\ln{\\frac{1}{1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}}}}\\\\\n", 1118 | "&=\\min_{\\textbf{w}}{\\sum_{i=1}^N{\\ln{(1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i))}}}.\n", 1119 | "\\end{aligned}\n", 1120 | "\\end{align}
\n", 1121 | "

\n", 1122 | "We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as\n", 1123 | "

\n", 1124 | "
\\begin{align}\n", 1125 | "\\begin{aligned}\n", 1126 | "\\nabla{\\mathcal{L}(\\textbf{w})}&=\\sum_{i=1}^N{\\frac{-y_i\\textbf{x}_i\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}{1+\\exp(-y_i\\textbf{w}^T\\textbf{x}_i)}}\\\\\n", 1127 | "&=-\\sum_{i=1}^N{y_i\\textbf{x}_i(1-p(y_i|\\textbf{x}_i))}.\n", 1128 | "\\end{aligned}\n", 1129 | "\\end{align}
\n", 1130 | "

\n", 1131 | "Then, we can learn the optimal \\(\\textbf{w}\\) by starting with an initial \\(\\textbf{w}_0\\) and iterating as follows:\n", 1132 | "

\n", 1133 | "
\\begin{equation} \\label{Eqn:Iteration_Binary}\n", 1134 | "\\textbf{w}_{t+1}=\\textbf{w}_{t}-\\eta_t\\nabla{\\mathcal{L}(\\textbf{w})},\n", 1135 | "\\end{equation}
\n", 1136 | "

\n", 1137 | "where \\(\\eta_t\\) is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting \\(\\eta_t\\propto{1/\\sqrt{t}}\\) such that the step size decreases with an increasing time \\(t\\).

\n", 1138 | "

Multiclass Logistic Regression Classifier1

\n", 1139 | "

When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have \\(K\\) possible classes, that is, \\(y\\in\\{1,2,..,K\\}\\). It is assumed that the posterior probability of class \\(y=k\\) given \\(\\textbf{x}\\) follows\n", 1140 | "

\n", 1141 | "
\\begin{equation}\n", 1142 | "\\ln{p(y=k|\\textbf{x})}\\propto\\textbf{w}_k^T\\textbf{x},\n", 1143 | "\\end{equation}
\n", 1144 | "

\n", 1145 | "where \\(\\textbf{w}_k\\) is a column weighting vector corresponding to class \\(k\\). Considering all classes \\(k=1,2,...,K\\), we would have a weighting matrix that includes all \\(K\\) weighting vectors. That is, \\(\\textbf{W}=[\\textbf{w}_1,\\textbf{w}_2,...,\\textbf{w}_K]\\).\n", 1146 | "Under the constraint\n", 1147 | "

\n", 1148 | "
\\begin{equation}\n", 1149 | "\\sum_{k=1}^K{p(y=k|\\textbf{x})}=1,\n", 1150 | "\\end{equation}
\n", 1151 | "

\n", 1152 | "it then follows that\n", 1153 | "

\n", 1154 | "
\\begin{equation} \\label{Eqn:Prob_Multiple}\n", 1155 | "p(y=k|\\textbf{x})=\\frac{\\exp(\\textbf{w}_k^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}.\n", 1156 | "\\end{equation}
\n", 1157 | "

The weighting matrix, \\(\\textbf{W}\\), can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set \\(\\mathcal{D}=\\{(\\textbf{x}_1,y_1),(\\textbf{x}_2,y_2),...(\\textbf{x}_N,y_N)\\}\\),\n", 1158 | "

\n", 1159 | "
\\begin{align}\n", 1160 | "\\begin{aligned}\n", 1161 | "\\textbf{W}^*&=\\max_{\\textbf{W}}{\\mathcal{L}(\\textbf{W})}\\\\\n", 1162 | "&=\\max_{\\textbf{W}}{\\sum_{i=1}^N\\ln{{p(y_i|\\textbf{x}_i)}}}\\\\\n", 1163 | "&=\\max_{\\textbf{W}}{\\sum_{i=1}^N{\\ln{\\frac{\\exp(\\textbf{w}_{y_i}^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}}}}.\n", 1164 | "\\end{aligned}\n", 1165 | "\\end{align}
\n", 1166 | "

\n", 1167 | "The gradient of the objective function with respect to each \\(\\textbf{w}_k\\) can be calculated as\n", 1168 | "

\n", 1169 | "
\\begin{align}\n", 1170 | "\\begin{aligned}\n", 1171 | "\\frac{\\partial{\\mathcal{L}(\\textbf{W})}}{\\partial{\\textbf{w}_k}}&=\\sum_{i=1}^N{\\textbf{x}_i\\left(I(y_i=k)-\\frac{\\exp(\\textbf{w}_k^T\\textbf{x})}{\\sum_{j=1}^K{\\exp(\\textbf{w}_j^T\\textbf{x})}}\\right)}\\\\\n", 1172 | "&=\\sum_{i=1}^N{\\textbf{x}_i(I(y_i=k)-p(y_i=k|\\textbf{x}_i))},\n", 1173 | "\\end{aligned}\n", 1174 | "\\end{align}
\n", 1175 | "

\n", 1176 | "where \\(I(\\cdot)\\) is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:\n", 1177 | "

\n", 1178 | "
\\begin{equation}\\label{Eqn:Iteration_Multiple}\n", 1179 | "\\textbf{w}_{k,t+1}=\\textbf{w}_{k,t}+\\eta_{t}\\frac{\\partial{\\mathcal{L}(\\textbf{W})}}{\\partial{\\textbf{w}_k}}.\n", 1180 | "\\end{equation}
\n", 1181 | "

\n", 1182 | "Note that we have \"\\(+\\)\" instead of \"\\(-\\)\", because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.

\n", 1183 | "

How to Perform Predictions?

\n", 1184 | "

Once the optimal weights are learned from the logistic regression model, for any new feature vector \\(\\textbf{x}\\), we can easily calculate the probability that it is associated to each class label \\(k\\) in the binary case in the multiclass case. With the probabilities for each class label available, we can then perform:

\n", 1185 | "
    \n", 1186 | "
  • a hard decision by identifying the class label with the highest probability, or
  • \n", 1187 | "
  • a soft decision by showing the top \\(k\\) most probable class labels with their corresponding probabilities.
  • \n", 1188 | "
" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "code", 1193 | "execution_count": null, 1194 | "metadata": {}, 1195 | "outputs": [], 1196 | "source": [ 1197 | "# Linear Classifier on Count Vectors\n", 1198 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)\n", 1199 | "print(\"LR, Count Vectors: \", accuracy)\n", 1200 | "\n", 1201 | "# Linear Classifier on Word Level TF IDF Vectors\n", 1202 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)\n", 1203 | "print(\"LR, WordLevel TF-IDF: \", accuracy)\n", 1204 | "\n", 1205 | "# Linear Classifier on Ngram Level TF IDF Vectors\n", 1206 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)\n", 1207 | "print(\"LR, N-Gram Vectors: \", accuracy)\n", 1208 | "\n", 1209 | "# Linear Classifier on Character Level TF IDF Vectors\n", 1210 | "accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)\n", 1211 | "print(\"LR, CharLevel Vectors: \", accuracy)" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "markdown", 1216 | "metadata": {}, 1217 | "source": [ 1218 | "### 3.3 Implementing a SVM Model" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": null, 1224 | "metadata": {}, 1225 | "outputs": [], 1226 | "source": [ 1227 | "# https://svivek.com/teaching/machine-learning/fall2018/slides/svm/svm-sgd.pdf\n", 1228 | "# https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-machine-with-math-47d6193c82be" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "code", 1233 | "execution_count": null, 1234 | "metadata": {}, 1235 | "outputs": [], 1236 | "source": [ 1237 | "\n", 1238 | "\n", 1239 | "# SVM on Ngram Level TF IDF Vectors\n", 1240 | "accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)\n", 1241 | "print(\"SVM, N-Gram Vectors: \", accuracy)" 1242 | ] 1243 | }, 1244 | { 1245 | "cell_type": "code", 1246 | "execution_count": null, 1247 | "metadata": {}, 1248 | "outputs": [], 1249 | "source": [ 1250 | "from sklearn.pipeline import Pipeline" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "code", 1255 | "execution_count": null, 1256 | "metadata": {}, 1257 | "outputs": [], 1258 | "source": [ 1259 | "from sklearn.linear_model import SGDClassifier\n", 1260 | "text_clf_svm = Pipeline([('vect', CountVectorizer()),\n", 1261 | " ('tfidf', TfidfTransformer()),\n", 1262 | " ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": null, 1268 | "metadata": {}, 1269 | "outputs": [], 1270 | "source": [ 1271 | ">>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)\n", 1272 | ">>> predicted_svm = text_clf_svm.predict(twenty_test.data)\n", 1273 | ">>> np.mean(predicted_svm == twenty_test.target)" 1274 | ] 1275 | }, 1276 | { 1277 | "cell_type": "code", 1278 | "execution_count": null, 1279 | "metadata": {}, 1280 | "outputs": [], 1281 | "source": [] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "### 3.4 Bagging Model" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": null, 1293 | "metadata": {}, 1294 | "outputs": [], 1295 | "source": [ 1296 | "# RF on Count Vectors\n", 1297 | "accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)\n", 1298 | "print \"RF, Count Vectors: \", accuracy\n", 1299 | "\n", 1300 | "# RF on Word Level TF IDF Vectors\n", 1301 | "accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)\n", 1302 | "print \"RF, WordLevel TF-IDF: \", accuracy" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "markdown", 1307 | "metadata": {}, 1308 | "source": [ 1309 | "### 3.5 Boosting Model" 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": {}, 1316 | "outputs": [], 1317 | "source": [ 1318 | "# Extereme Gradient Boosting on Count Vectors\n", 1319 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xvalid_count.tocsc())\n", 1320 | "print(\"Xgb, Count Vectors: \", accuracy)\n", 1321 | "\n", 1322 | "# Extereme Gradient Boosting on Word Level TF IDF Vectors\n", 1323 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc())\n", 1324 | "print(\"Xgb, WordLevel TF-IDF: \", accuracy)\n", 1325 | "\n", 1326 | "# Extereme Gradient Boosting on Character Level TF IDF Vectors\n", 1327 | "accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(), train_y, xvalid_tfidf_ngram_chars.tocsc())\n", 1328 | "print(\"Xgb, CharLevel Vectors: \", accuracy)" 1329 | ] 1330 | }, 1331 | { 1332 | "cell_type": "markdown", 1333 | "metadata": {}, 1334 | "source": [ 1335 | "### 3.6 Shallow Neural Networks" 1336 | ] 1337 | }, 1338 | { 1339 | "cell_type": "code", 1340 | "execution_count": null, 1341 | "metadata": {}, 1342 | "outputs": [], 1343 | "source": [ 1344 | "def create_model_architecture(input_size):\n", 1345 | " # create input layer \n", 1346 | " input_layer = layers.Input((input_size, ), sparse=True)\n", 1347 | " \n", 1348 | " # create hidden layer\n", 1349 | " hidden_layer = layers.Dense(100, activation=\"relu\")(input_layer)\n", 1350 | " \n", 1351 | " # create output layer\n", 1352 | " output_layer = layers.Dense(1, activation=\"sigmoid\")(hidden_layer)\n", 1353 | "\n", 1354 | " classifier = models.Model(inputs = input_layer, outputs = output_layer)\n", 1355 | " classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1356 | " return classifier \n", 1357 | "\n", 1358 | "classifier = create_model_architecture(xtrain_tfidf_ngram.shape[1])\n", 1359 | "accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram, is_neural_net=True)\n", 1360 | "print(\"NN, Ngram Level TF IDF Vectors\", accuracy)" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "metadata": {}, 1366 | "source": [ 1367 | "### 3.7.1 Convolutional Neural Network [Deep Neural Networks]" 1368 | ] 1369 | }, 1370 | { 1371 | "cell_type": "code", 1372 | "execution_count": null, 1373 | "metadata": {}, 1374 | "outputs": [], 1375 | "source": [ 1376 | "from IPython.display import Image\n", 1377 | "from IPython.core.display import HTML \n", 1378 | "Image(url= \"cnn.png\")" 1379 | ] 1380 | }, 1381 | { 1382 | "cell_type": "code", 1383 | "execution_count": null, 1384 | "metadata": {}, 1385 | "outputs": [], 1386 | "source": [ 1387 | "def create_cnn():\n", 1388 | " # Add an Input Layer\n", 1389 | " input_layer = layers.Input((70, ))\n", 1390 | "\n", 1391 | " # Add the word embedding Layer\n", 1392 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1393 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1394 | "\n", 1395 | " # Add the convolutional Layer\n", 1396 | " conv_layer = layers.Convolution1D(100, 3, activation=\"relu\")(embedding_layer)\n", 1397 | "\n", 1398 | " # Add the pooling Layer\n", 1399 | " pooling_layer = layers.GlobalMaxPool1D()(conv_layer)\n", 1400 | "\n", 1401 | " # Add the output Layers\n", 1402 | " output_layer1 = layers.Dense(50, activation=\"relu\")(pooling_layer)\n", 1403 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1404 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1405 | "\n", 1406 | " # Compile the model\n", 1407 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1408 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1409 | " \n", 1410 | " return model\n", 1411 | "\n", 1412 | "classifier = create_cnn()\n", 1413 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1414 | "print(\"CNN, Word Embeddings\", accuracy)" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "markdown", 1419 | "metadata": {}, 1420 | "source": [ 1421 | "### 3.7.2 Recurrent Neural Network – LSTM [Deep Neural Networks]" 1422 | ] 1423 | }, 1424 | { 1425 | "cell_type": "code", 1426 | "execution_count": null, 1427 | "metadata": {}, 1428 | "outputs": [], 1429 | "source": [ 1430 | "def create_rnn_lstm():\n", 1431 | " # Add an Input Layer\n", 1432 | " input_layer = layers.Input((70, ))\n", 1433 | "\n", 1434 | " # Add the word embedding Layer\n", 1435 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1436 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1437 | "\n", 1438 | " # Add the LSTM Layer\n", 1439 | " lstm_layer = layers.LSTM(100)(embedding_layer)\n", 1440 | "\n", 1441 | " # Add the output Layers\n", 1442 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1443 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1444 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1445 | "\n", 1446 | " # Compile the model\n", 1447 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1448 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1449 | " \n", 1450 | " return model\n", 1451 | "\n", 1452 | "classifier = create_rnn_lstm()\n", 1453 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1454 | "print(\"RNN-LSTM, Word Embeddings\", accuracy)" 1455 | ] 1456 | }, 1457 | { 1458 | "cell_type": "markdown", 1459 | "metadata": {}, 1460 | "source": [ 1461 | "### 3.7.3 Recurrent Neural Network – GRU [Deep Neural Networks]" 1462 | ] 1463 | }, 1464 | { 1465 | "cell_type": "code", 1466 | "execution_count": null, 1467 | "metadata": {}, 1468 | "outputs": [], 1469 | "source": [ 1470 | "def create_rnn_gru():\n", 1471 | " # Add an Input Layer\n", 1472 | " input_layer = layers.Input((70, ))\n", 1473 | "\n", 1474 | " # Add the word embedding Layer\n", 1475 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1476 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1477 | "\n", 1478 | " # Add the GRU Layer\n", 1479 | " lstm_layer = layers.GRU(100)(embedding_layer)\n", 1480 | "\n", 1481 | " # Add the output Layers\n", 1482 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1483 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1484 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1485 | "\n", 1486 | " # Compile the model\n", 1487 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1488 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1489 | " \n", 1490 | " return model\n", 1491 | "\n", 1492 | "classifier = create_rnn_gru()\n", 1493 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1494 | "print(\"RNN-GRU, Word Embeddings\", accuracy)" 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "markdown", 1499 | "metadata": {}, 1500 | "source": [ 1501 | "### 3.7.4 Bidirectional RNN [Deep Neural Networks]" 1502 | ] 1503 | }, 1504 | { 1505 | "cell_type": "code", 1506 | "execution_count": null, 1507 | "metadata": {}, 1508 | "outputs": [], 1509 | "source": [ 1510 | "def create_bidirectional_rnn():\n", 1511 | " # Add an Input Layer\n", 1512 | " input_layer = layers.Input((70, ))\n", 1513 | "\n", 1514 | " # Add the word embedding Layer\n", 1515 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1516 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1517 | "\n", 1518 | " # Add the LSTM Layer\n", 1519 | " lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)\n", 1520 | "\n", 1521 | " # Add the output Layers\n", 1522 | " output_layer1 = layers.Dense(50, activation=\"relu\")(lstm_layer)\n", 1523 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1524 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1525 | "\n", 1526 | " # Compile the model\n", 1527 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1528 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1529 | " \n", 1530 | " return model\n", 1531 | "\n", 1532 | "classifier = create_bidirectional_rnn()\n", 1533 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1534 | "print(\"RNN-Bidirectional, Word Embeddings\", accuracy)" 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "markdown", 1539 | "metadata": {}, 1540 | "source": [ 1541 | "### 3.7.5 Recurrent Convolutional Neural Network" 1542 | ] 1543 | }, 1544 | { 1545 | "cell_type": "markdown", 1546 | "metadata": {}, 1547 | "source": [ 1548 | "- Hierarichial Attention Networks\n", 1549 | "- Sequence to Sequence Models with Attention\n", 1550 | "- Bidirectional Recurrent Convolutional Neural Networks\n", 1551 | "- CNNs and RNNs with more number of layers" 1552 | ] 1553 | }, 1554 | { 1555 | "cell_type": "code", 1556 | "execution_count": null, 1557 | "metadata": {}, 1558 | "outputs": [], 1559 | "source": [ 1560 | "def create_rcnn():\n", 1561 | " # Add an Input Layer\n", 1562 | " input_layer = layers.Input((70, ))\n", 1563 | "\n", 1564 | " # Add the word embedding Layer\n", 1565 | " embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)\n", 1566 | " embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)\n", 1567 | " \n", 1568 | " # Add the recurrent layer\n", 1569 | " rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))(embedding_layer)\n", 1570 | " \n", 1571 | " # Add the convolutional Layer\n", 1572 | " conv_layer = layers.Convolution1D(100, 3, activation=\"relu\")(embedding_layer)\n", 1573 | "\n", 1574 | " # Add the pooling Layer\n", 1575 | " pooling_layer = layers.GlobalMaxPool1D()(conv_layer)\n", 1576 | "\n", 1577 | " # Add the output Layers\n", 1578 | " output_layer1 = layers.Dense(50, activation=\"relu\")(pooling_layer)\n", 1579 | " output_layer1 = layers.Dropout(0.25)(output_layer1)\n", 1580 | " output_layer2 = layers.Dense(1, activation=\"sigmoid\")(output_layer1)\n", 1581 | "\n", 1582 | " # Compile the model\n", 1583 | " model = models.Model(inputs=input_layer, outputs=output_layer2)\n", 1584 | " model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')\n", 1585 | " \n", 1586 | " return model\n", 1587 | "\n", 1588 | "classifier = create_rcnn()\n", 1589 | "accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)\n", 1590 | "print(\"CNN, Word Embeddings\", accuracy)" 1591 | ] 1592 | }, 1593 | { 1594 | "cell_type": "code", 1595 | "execution_count": null, 1596 | "metadata": {}, 1597 | "outputs": [], 1598 | "source": [] 1599 | }, 1600 | { 1601 | "cell_type": "code", 1602 | "execution_count": null, 1603 | "metadata": {}, 1604 | "outputs": [], 1605 | "source": [] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": {}, 1610 | "source": [ 1611 | "# EXPLAIN MODELS" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": {}, 1617 | "source": [ 1618 | "# TextExplainer: debugging black-box text classifiers" 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "markdown", 1623 | "metadata": {}, 1624 | "source": [ 1625 | "https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html\n", 1626 | "\n", 1627 | "https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html#example-problem-lsa-svm-for-20-newsgroups-dataset\n", 1628 | "\n", 1629 | "**Goal:** explain predictions of arbitrary classifiers, including text classifiers (when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved)" 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "markdown", 1634 | "metadata": {}, 1635 | "source": [ 1636 | "### Example problem: LSA+SVM for 20 Newsgroups dataset" 1637 | ] 1638 | }, 1639 | { 1640 | "cell_type": "code", 1641 | "execution_count": null, 1642 | "metadata": {}, 1643 | "outputs": [], 1644 | "source": [ 1645 | "from sklearn.datasets import fetch_20newsgroups\n", 1646 | "\n", 1647 | "categories = ['alt.atheism', 'soc.religion.christian',\n", 1648 | " 'comp.graphics', 'sci.med']\n", 1649 | "twenty_train = fetch_20newsgroups(\n", 1650 | " subset='train',\n", 1651 | " categories=categories,\n", 1652 | " shuffle=True,\n", 1653 | " random_state=42,\n", 1654 | " remove=('headers', 'footers'),\n", 1655 | ")\n", 1656 | "twenty_test = fetch_20newsgroups(\n", 1657 | " subset='test',\n", 1658 | " categories=categories,\n", 1659 | " shuffle=True,\n", 1660 | " random_state=42,\n", 1661 | " remove=('headers', 'footers'),\n", 1662 | ")" 1663 | ] 1664 | }, 1665 | { 1666 | "cell_type": "code", 1667 | "execution_count": null, 1668 | "metadata": {}, 1669 | "outputs": [], 1670 | "source": [ 1671 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1672 | "from sklearn.svm import SVC\n", 1673 | "from sklearn.decomposition import TruncatedSVD\n", 1674 | "from sklearn.pipeline import Pipeline, make_pipeline\n", 1675 | "\n", 1676 | "vec = TfidfVectorizer(min_df=3, stop_words='english',\n", 1677 | " ngram_range=(1, 2))\n", 1678 | "\n", 1679 | "# The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.\n", 1680 | "svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)\n", 1681 | "lsa = make_pipeline(vec, svd)\n", 1682 | "\n", 1683 | "clf = SVC(C=150, gamma=2e-2, probability=True)\n", 1684 | "pipe = make_pipeline(lsa, clf)\n", 1685 | "pipe.fit(twenty_train.data, twenty_train.target)\n", 1686 | "pipe.score(twenty_test.data, twenty_test.target)" 1687 | ] 1688 | }, 1689 | { 1690 | "cell_type": "code", 1691 | "execution_count": null, 1692 | "metadata": {}, 1693 | "outputs": [], 1694 | "source": [ 1695 | "def print_prediction(doc):\n", 1696 | " y_pred = pipe.predict_proba([doc])[0]\n", 1697 | " for target, prob in zip(twenty_train.target_names, y_pred):\n", 1698 | " print(\"{:.3f} {}\".format(prob, target))\n", 1699 | "\n", 1700 | "doc = twenty_test.data[0]\n", 1701 | "\n", 1702 | "print(twenty_test.data[0])\n", 1703 | "print('------------------------------------ What is the prediction?-------------------------------------------------------')\n", 1704 | "print_prediction(doc)" 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "markdown", 1709 | "metadata": {}, 1710 | "source": [ 1711 | "### TextExplainer" 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "markdown", 1716 | "metadata": {}, 1717 | "source": [ 1718 | "1. Create a TextExplainer instance, \n", 1719 | "2. ... then pass the document to explain and a black-box classifier (a function which returns probabilities) to the fit() method, \n", 1720 | "3. ... then check the explanation:" 1721 | ] 1722 | }, 1723 | { 1724 | "cell_type": "code", 1725 | "execution_count": null, 1726 | "metadata": {}, 1727 | "outputs": [], 1728 | "source": [ 1729 | "import eli5\n", 1730 | "from eli5.lime import TextExplainer\n", 1731 | "\n", 1732 | "doc = twenty_test.data[0]\n", 1733 | "\n", 1734 | "te = TextExplainer(random_state=42)\n", 1735 | "te.fit(doc, pipe.predict_proba)\n", 1736 | "te.show_prediction(target_names=twenty_train.target_names)" 1737 | ] 1738 | }, 1739 | { 1740 | "cell_type": "markdown", 1741 | "metadata": {}, 1742 | "source": [ 1743 | "### Why it works?" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "markdown", 1748 | "metadata": {}, 1749 | "source": [ 1750 | "Explanation makes sense - we expect reasonable classifier to **take highlighted words in account**. But how can we be sure this is **how the pipeline works**, not just a nice-looking lie? \n", 1751 | "\n", 1752 | "A simple **sanity check** is to **remove or change the highlighted words**, to confirm that **they change the outcome**" 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": null, 1758 | "metadata": {}, 1759 | "outputs": [], 1760 | "source": [ 1761 | "import re\n", 1762 | "doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)\n", 1763 | "print_prediction(doc2)" 1764 | ] 1765 | }, 1766 | { 1767 | "cell_type": "markdown", 1768 | "metadata": {}, 1769 | "source": [ 1770 | "**Predicted probabilities changed a lot indeed.**\n", 1771 | "\n", 1772 | "And in fact, TextExplainer did something similar to get the explanation. TextExplainer generated a lot of texts similar to the document (by removing some of the words), and then trained a white-box classifier which predicts the output of the black-box classifier (not the true labels!). The explanation we saw is for this white-box classifier.\n", 1773 | "\n", 1774 | "This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:\n", 1775 | "\n", 1776 | "- generate distorted versions of the text;\n", 1777 | "- predict probabilities for these distorted texts using the black-box classifier;\n", 1778 | "- train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.\n", 1779 | "\n", 1780 | "The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.\n", 1781 | "\n", 1782 | "Generated samples (distorted texts) are available in samples_ attribute:" 1783 | ] 1784 | }, 1785 | { 1786 | "cell_type": "code", 1787 | "execution_count": null, 1788 | "metadata": {}, 1789 | "outputs": [], 1790 | "source": [ 1791 | "print(te.samples_[0])" 1792 | ] 1793 | }, 1794 | { 1795 | "cell_type": "code", 1796 | "execution_count": null, 1797 | "metadata": {}, 1798 | "outputs": [], 1799 | "source": [ 1800 | "# By default TextExplainer generates 5000 distorted texts (use n_samples argument to change the amount):\n", 1801 | "len(te.samples_)" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "markdown", 1806 | "metadata": {}, 1807 | "source": [ 1808 | "### Customizing TextExplainer: classifier" 1809 | ] 1810 | }, 1811 | { 1812 | "cell_type": "code", 1813 | "execution_count": null, 1814 | "metadata": {}, 1815 | "outputs": [], 1816 | "source": [ 1817 | "from sklearn.tree import DecisionTreeClassifier\n", 1818 | "dtree=DecisionTreeClassifier()\n", 1819 | "dtree.fit(te5.show_weights())" 1820 | ] 1821 | }, 1822 | { 1823 | "cell_type": "code", 1824 | "execution_count": null, 1825 | "metadata": {}, 1826 | "outputs": [], 1827 | "source": [ 1828 | "explain_prediction_tree_classifier" 1829 | ] 1830 | }, 1831 | { 1832 | "cell_type": "code", 1833 | "execution_count": null, 1834 | "metadata": {}, 1835 | "outputs": [], 1836 | "source": [ 1837 | "from sklearn.tree import DecisionTreeClassifier\n", 1838 | "\n", 1839 | "te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)\n", 1840 | "te5.fit(doc, pipe.predict_proba)\n", 1841 | "print(te5.metrics_)\n", 1842 | "te5.show_weights()" 1843 | ] 1844 | }, 1845 | { 1846 | "cell_type": "markdown", 1847 | "metadata": {}, 1848 | "source": [ 1849 | "So according to this tree if **“kidney” is not in the document** and **“pain” is not in the document** then the **probability of a document** belonging to **sci.med** drops to **0.65**. If at least one of these words remain sci.med probability stays** 0.9+.**" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "markdown", 1854 | "metadata": {}, 1855 | "source": [ 1856 | "# 3 ways to interpretate NLP model" 1857 | ] 1858 | }, 1859 | { 1860 | "cell_type": "markdown", 1861 | "metadata": {}, 1862 | "source": [ 1863 | "https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb\n", 1864 | "\n", 1865 | "**Goal**: want to know why we predict it wrongly\n", 1866 | "\n", 1867 | "**1 . Interpretability**\n", 1868 | "- **Intrinsic**: We do not need to train another model to explain the target. For example, it is using decision tree or linear model\n", 1869 | "- **Post hoc**: The model belongs to black-box model which we need to use another model to interpret it. \n", 1870 | "\n", 1871 | "**2. Approach**\n", 1872 | "- **Model-specific**: Some tools are limited to specific model such as liner model and neural network model.\n", 1873 | "- **Model-agnostic**: On the other hand, some tools able to explain any model by building write-box model. \n", 1874 | "\n", 1875 | "** 3. Level**\n", 1876 | "- **Global**: Explain the overall model such as feature weight. This one give you a in general model behavior\n", 1877 | "- **Local**: Explain the specific prediction result." 1878 | ] 1879 | }, 1880 | { 1881 | "cell_type": "code", 1882 | "execution_count": null, 1883 | "metadata": {}, 1884 | "outputs": [], 1885 | "source": [ 1886 | "import random\n", 1887 | "import pandas as pd\n", 1888 | "import IPython\n", 1889 | "import xgboost\n", 1890 | "\n", 1891 | "import eli5\n", 1892 | "from eli5.lime import TextExplainer\n", 1893 | "from lime.lime_text import LimeTextExplainer\n", 1894 | "print('ELI5 Version:', eli5.__version__)\n", 1895 | "print('XGBoost Version:', xgboost.__version__)" 1896 | ] 1897 | }, 1898 | { 1899 | "cell_type": "code", 1900 | "execution_count": null, 1901 | "metadata": {}, 1902 | "outputs": [], 1903 | "source": [ 1904 | "from sklearn.datasets import fetch_20newsgroups\n", 1905 | "train_raw_df = fetch_20newsgroups(subset='train')\n", 1906 | "test_raw_df = fetch_20newsgroups(subset='test')" 1907 | ] 1908 | }, 1909 | { 1910 | "cell_type": "code", 1911 | "execution_count": null, 1912 | "metadata": {}, 1913 | "outputs": [], 1914 | "source": [ 1915 | "x_train = train_raw_df.data\n", 1916 | "y_train = train_raw_df.target\n", 1917 | "\n", 1918 | "x_test = test_raw_df.data\n", 1919 | "y_test = test_raw_df.target" 1920 | ] 1921 | }, 1922 | { 1923 | "cell_type": "code", 1924 | "execution_count": null, 1925 | "metadata": {}, 1926 | "outputs": [], 1927 | "source": [ 1928 | "x_train" 1929 | ] 1930 | }, 1931 | { 1932 | "cell_type": "code", 1933 | "execution_count": null, 1934 | "metadata": {}, 1935 | "outputs": [], 1936 | "source": [ 1937 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 1938 | "from sklearn.linear_model import LogisticRegression\n", 1939 | "from sklearn.ensemble import RandomForestClassifier\n", 1940 | "from sklearn.pipeline import make_pipeline\n", 1941 | "from xgboost import XGBClassifier" 1942 | ] 1943 | }, 1944 | { 1945 | "cell_type": "code", 1946 | "execution_count": null, 1947 | "metadata": {}, 1948 | "outputs": [], 1949 | "source": [ 1950 | "names = ['Logistic Regression', 'Random Forest', 'XGBoost Classifier']" 1951 | ] 1952 | }, 1953 | { 1954 | "cell_type": "code", 1955 | "execution_count": null, 1956 | "metadata": {}, 1957 | "outputs": [], 1958 | "source": [ 1959 | "def build_model(names, x, y):\n", 1960 | " pipelines = []\n", 1961 | " vec = TfidfVectorizer()\n", 1962 | " vec.fit(x)\n", 1963 | "\n", 1964 | " for name in names:\n", 1965 | " print('train %s' % name)\n", 1966 | " \n", 1967 | " if name == 'Logistic Regression':\n", 1968 | " estimator = LogisticRegression(solver='newton-cg', n_jobs=-1)\n", 1969 | " pipeline = make_pipeline(vec, estimator)\n", 1970 | " elif name == 'Random Forest':\n", 1971 | " estimator = RandomForestClassifier(n_jobs=-1)\n", 1972 | " pipeline = make_pipeline(vec, estimator)\n", 1973 | " elif name == 'XGBoost Classifier':\n", 1974 | " estimator = XGBClassifier()\n", 1975 | " pipeline = make_pipeline(vec, estimator)\n", 1976 | " \n", 1977 | " pipeline.fit(x, y)\n", 1978 | " pipelines.append({\n", 1979 | " 'name': name,\n", 1980 | " 'pipeline': pipeline\n", 1981 | " })\n", 1982 | " \n", 1983 | " return pipelines, vec" 1984 | ] 1985 | }, 1986 | { 1987 | "cell_type": "code", 1988 | "execution_count": null, 1989 | "metadata": {}, 1990 | "outputs": [], 1991 | "source": [ 1992 | "pipelines, vec = build_model(names, x_train, y_train)" 1993 | ] 1994 | }, 1995 | { 1996 | "cell_type": "markdown", 1997 | "metadata": {}, 1998 | "source": [ 1999 | "### 1. ELI5" 2000 | ] 2001 | }, 2002 | { 2003 | "cell_type": "markdown", 2004 | "metadata": {}, 2005 | "source": [ 2006 | "#### A. - ELI5 - Global Interpretation" 2007 | ] 2008 | }, 2009 | { 2010 | "cell_type": "code", 2011 | "execution_count": null, 2012 | "metadata": {}, 2013 | "outputs": [], 2014 | "source": [ 2015 | "for pipeline in pipelines:\n", 2016 | " print('Estimator: %s' % (pipeline['name']))\n", 2017 | " labels = pipeline['pipeline'].classes_.tolist()\n", 2018 | " \n", 2019 | " if pipeline['name'] in ['Logistic Regression', 'Random Forest']:\n", 2020 | " estimator = pipeline['pipeline']\n", 2021 | " elif pipeline['name'] == 'XGBoost Classifier':\n", 2022 | " estimator = pipeline['pipeline'].steps[1][1].get_booster()\n", 2023 | "# Not support Keras\n", 2024 | "# elif pipeline['name'] == 'keras':\n", 2025 | "# estimator = pipeline['pipeline']\n", 2026 | " else:\n", 2027 | " continue\n", 2028 | " \n", 2029 | " IPython.display.display(\n", 2030 | " eli5.show_weights(estimator=estimator, top=10, target_names=labels, vec=vec))" 2031 | ] 2032 | }, 2033 | { 2034 | "cell_type": "markdown", 2035 | "metadata": {}, 2036 | "source": [ 2037 | "#### B. - ELI5 - Local Interpretation" 2038 | ] 2039 | }, 2040 | { 2041 | "cell_type": "code", 2042 | "execution_count": null, 2043 | "metadata": {}, 2044 | "outputs": [], 2045 | "source": [ 2046 | "number_of_sample = 1\n", 2047 | "sample_ids = [random.randint(0, len(x_test) -1 ) for p in range(0, number_of_sample)]\n", 2048 | "\n", 2049 | "for idx in sample_ids:\n", 2050 | " print('Index: %d' % (idx))\n", 2051 | "# print('Index: %d, Feature: %s' % (idx, x_test[idx]))\n", 2052 | " for pipeline in pipelines:\n", 2053 | " print('-' * 50)\n", 2054 | " print('Estimator: %s' % (pipeline['name']))\n", 2055 | " \n", 2056 | " print('True Label: %s, Predicted Label: %s' % (y_test[idx], pipeline['pipeline'].predict([x_test[idx]])[0]))\n", 2057 | " labels = pipeline['pipeline'].classes_.tolist()\n", 2058 | " \n", 2059 | " if pipeline['name'] in ['Logistic Regression', 'Random Forest']:\n", 2060 | " estimator = pipeline['pipeline'].steps[1][1]\n", 2061 | " elif pipeline['name'] == 'XGBoost Classifier':\n", 2062 | " estimator = pipeline['pipeline'].steps[1][1].get_booster()\n", 2063 | " # Not support Keras\n", 2064 | "# elif pipeline['name'] == 'Keras':\n", 2065 | "# estimator = pipeline['pipeline'].model\n", 2066 | " else:\n", 2067 | " continue\n", 2068 | "\n", 2069 | " IPython.display.display(\n", 2070 | " eli5.show_prediction(estimator, x_test[idx], top=10, vec=vec, target_names=labels))" 2071 | ] 2072 | }, 2073 | { 2074 | "cell_type": "markdown", 2075 | "metadata": {}, 2076 | "source": [ 2077 | "### 2. LIME [2 independent examples]" 2078 | ] 2079 | }, 2080 | { 2081 | "cell_type": "markdown", 2082 | "metadata": {}, 2083 | "source": [ 2084 | "## 1st example" 2085 | ] 2086 | }, 2087 | { 2088 | "cell_type": "markdown", 2089 | "metadata": {}, 2090 | "source": [ 2091 | "https://www.kaggle.com/emanceau/interpreting-machine-learning-lime-explainer/notebook" 2092 | ] 2093 | }, 2094 | { 2095 | "cell_type": "markdown", 2096 | "metadata": {}, 2097 | "source": [ 2098 | "Dataset contains text from works of fiction written by spooky authors of the public domain:\n", 2099 | "- Edgar Allan Poe (EAP)\n", 2100 | "- HP Lovecraft (HPL)\n", 2101 | "- Mary Wollstonecraft Shelley (MWS)\n", 2102 | "\n", 2103 | "The objective is to **accurately identify the author of the sentences in the test set**\n", 2104 | "\n", 2105 | "**Lime explainer mission** is to help human to **understand decisions made by machine learning**. Basically, lime explainer create **a local linear model** around the prediction and try to **explain factor influence**." 2106 | ] 2107 | }, 2108 | { 2109 | "cell_type": "code", 2110 | "execution_count": 2, 2111 | "metadata": {}, 2112 | "outputs": [], 2113 | "source": [ 2114 | "import numpy as np\n", 2115 | "import pandas as pd\n", 2116 | "\n", 2117 | "import matplotlib.pyplot as plt\n", 2118 | "\n", 2119 | "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", 2120 | "from sklearn.model_selection import train_test_split\n", 2121 | "from sklearn.metrics import confusion_matrix\n", 2122 | "from sklearn import ensemble, metrics, model_selection, naive_bayes\n", 2123 | "from sklearn.pipeline import make_pipeline\n", 2124 | "\n", 2125 | "from lime import lime_text\n", 2126 | "from lime.lime_text import LimeTextExplainer\n", 2127 | "import itertools \n", 2128 | "%matplotlib inline\n", 2129 | "import warnings\n", 2130 | "warnings.simplefilter('ignore')" 2131 | ] 2132 | }, 2133 | { 2134 | "cell_type": "code", 2135 | "execution_count": 3, 2136 | "metadata": {}, 2137 | "outputs": [], 2138 | "source": [ 2139 | "train_df = pd.read_csv(\"train.csv\")\n", 2140 | "test_df = pd.read_csv(\"test.csv\")" 2141 | ] 2142 | }, 2143 | { 2144 | "cell_type": "code", 2145 | "execution_count": 4, 2146 | "metadata": {}, 2147 | "outputs": [ 2148 | { 2149 | "data": { 2150 | "text/html": [ 2151 | "
\n", 2152 | "\n", 2165 | "\n", 2166 | " \n", 2167 | " \n", 2168 | " \n", 2169 | " \n", 2170 | " \n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | "
idtextauthor
0id26305This process, however, afforded me no means of...EAP
1id17569It never once occurred to me that the fumbling...HPL
2id11008In his left hand was a gold snuff box, from wh...EAP
3id27763How lovely is spring As we looked from Windsor...MWS
4id12958Finding nothing else, not even gold, the Super...HPL
\n", 2207 | "
" 2208 | ], 2209 | "text/plain": [ 2210 | " id text author\n", 2211 | "0 id26305 This process, however, afforded me no means of... EAP\n", 2212 | "1 id17569 It never once occurred to me that the fumbling... HPL\n", 2213 | "2 id11008 In his left hand was a gold snuff box, from wh... EAP\n", 2214 | "3 id27763 How lovely is spring As we looked from Windsor... MWS\n", 2215 | "4 id12958 Finding nothing else, not even gold, the Super... HPL" 2216 | ] 2217 | }, 2218 | "execution_count": 4, 2219 | "metadata": {}, 2220 | "output_type": "execute_result" 2221 | } 2222 | ], 2223 | "source": [ 2224 | "train_df.head()" 2225 | ] 2226 | }, 2227 | { 2228 | "cell_type": "code", 2229 | "execution_count": 5, 2230 | "metadata": {}, 2231 | "outputs": [ 2232 | { 2233 | "data": { 2234 | "text/html": [ 2235 | "
\n", 2236 | "\n", 2249 | "\n", 2250 | " \n", 2251 | " \n", 2252 | " \n", 2253 | " \n", 2254 | " \n", 2255 | " \n", 2256 | " \n", 2257 | " \n", 2258 | " \n", 2259 | " \n", 2260 | " \n", 2261 | " \n", 2262 | " \n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | "
idtext
0id02310Still, as I urged our leaving Ireland with suc...
1id24541If a fire wanted fanning, it could readily be ...
2id00134And when they had broken down the frail door t...
3id27757While I was thinking how I should possibly man...
4id04081I am not sure to what limit his knowledge may ...
\n", 2285 | "
" 2286 | ], 2287 | "text/plain": [ 2288 | " id text\n", 2289 | "0 id02310 Still, as I urged our leaving Ireland with suc...\n", 2290 | "1 id24541 If a fire wanted fanning, it could readily be ...\n", 2291 | "2 id00134 And when they had broken down the frail door t...\n", 2292 | "3 id27757 While I was thinking how I should possibly man...\n", 2293 | "4 id04081 I am not sure to what limit his knowledge may ..." 2294 | ] 2295 | }, 2296 | "execution_count": 5, 2297 | "metadata": {}, 2298 | "output_type": "execute_result" 2299 | } 2300 | ], 2301 | "source": [ 2302 | "test_df.head()" 2303 | ] 2304 | }, 2305 | { 2306 | "cell_type": "markdown", 2307 | "metadata": {}, 2308 | "source": [ 2309 | "#### Explainer with basic model" 2310 | ] 2311 | }, 2312 | { 2313 | "cell_type": "code", 2314 | "execution_count": 6, 2315 | "metadata": {}, 2316 | "outputs": [ 2317 | { 2318 | "data": { 2319 | "text/plain": [ 2320 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 2321 | ] 2322 | }, 2323 | "execution_count": 6, 2324 | "metadata": {}, 2325 | "output_type": "execute_result" 2326 | } 2327 | ], 2328 | "source": [ 2329 | "class_names = ['EAP', 'HPL', 'MWS']\n", 2330 | "cols_to_drop = ['id', 'text']\n", 2331 | "train_X = train_df.drop(cols_to_drop+['author'], axis=1)\n", 2332 | "\n", 2333 | "## Prepare the data for modeling ###\n", 2334 | "author_mapping_dict = {'EAP':0, 'HPL':1, 'MWS':2}\n", 2335 | "train_y = train_df['author'].map(author_mapping_dict)\n", 2336 | "train_id = train_df['id'].values\n", 2337 | "\n", 2338 | "tfidf_vec = TfidfVectorizer(ngram_range=(1,5), analyzer='char')\n", 2339 | "full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())\n", 2340 | "train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())\n", 2341 | "\n", 2342 | "X_train, X_test, y_train, y_test = train_test_split(train_tfidf, train_y, test_size=0.33, random_state=14)\n", 2343 | "model_tf = naive_bayes.MultinomialNB()\n", 2344 | "model_tf.fit(X_train, y_train)" 2345 | ] 2346 | }, 2347 | { 2348 | "cell_type": "code", 2349 | "execution_count": null, 2350 | "metadata": {}, 2351 | "outputs": [], 2352 | "source": [ 2353 | "print(X_train)" 2354 | ] 2355 | }, 2356 | { 2357 | "cell_type": "code", 2358 | "execution_count": null, 2359 | "metadata": {}, 2360 | "outputs": [], 2361 | "source": [ 2362 | "def plot_confusion_matrix(cm, classes,\n", 2363 | " normalize=False,\n", 2364 | " title='Confusion matrix',\n", 2365 | " cmap=plt.cm.Blues):\n", 2366 | " \"\"\"\n", 2367 | " This function prints and plots the confusion matrix.\n", 2368 | " Normalization can be applied by setting `normalize=True`.\n", 2369 | " \"\"\"\n", 2370 | " if normalize:\n", 2371 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 2372 | " print(\"Normalized confusion matrix\")\n", 2373 | " else:\n", 2374 | " print('Confusion matrix, without normalization')\n", 2375 | "\n", 2376 | " print(cm)\n", 2377 | "\n", 2378 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 2379 | " plt.title(title)\n", 2380 | " plt.colorbar()\n", 2381 | " tick_marks = np.arange(len(classes))\n", 2382 | " plt.xticks(tick_marks, classes, rotation=45)\n", 2383 | " plt.yticks(tick_marks, classes)\n", 2384 | "\n", 2385 | " fmt = '.2f' if normalize else 'd'\n", 2386 | " thresh = cm.max() / 2.\n", 2387 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 2388 | " plt.text(j, i, format(cm[i, j], fmt),\n", 2389 | " horizontalalignment=\"center\",\n", 2390 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 2391 | "\n", 2392 | " plt.tight_layout()\n", 2393 | " plt.ylabel('True label')\n", 2394 | " plt.xlabel('Predicted label')" 2395 | ] 2396 | }, 2397 | { 2398 | "cell_type": "code", 2399 | "execution_count": null, 2400 | "metadata": {}, 2401 | "outputs": [], 2402 | "source": [ 2403 | "y_pred = model_tf.predict(X_test)\n", 2404 | "\n", 2405 | "# Compute confusion matrix\n", 2406 | "cnf_matrix = confusion_matrix(y_test, y_pred)\n", 2407 | "np.set_printoptions(precision=2)\n", 2408 | "\n", 2409 | "# Plot non-normalized confusion matrix\n", 2410 | "plt.figure()\n", 2411 | "plot_confusion_matrix(cnf_matrix, classes=class_names,\n", 2412 | " title='Confusion matrix, without normalization')\n", 2413 | "plt.show()" 2414 | ] 2415 | }, 2416 | { 2417 | "cell_type": "code", 2418 | "execution_count": null, 2419 | "metadata": {}, 2420 | "outputs": [], 2421 | "source": [ 2422 | "import re\n", 2423 | "c_tf = make_pipeline(tfidf_vec, model_tf)\n", 2424 | "\n", 2425 | "split_expression = lambda s: re.split(r'\\W+', s)\n", 2426 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)" 2427 | ] 2428 | }, 2429 | { 2430 | "cell_type": "code", 2431 | "execution_count": null, 2432 | "metadata": {}, 2433 | "outputs": [], 2434 | "source": [ 2435 | "comp = y_test.to_frame()\n", 2436 | "comp['idx'] = comp.index.values\n", 2437 | "comp['pred'] = y_pred\n", 2438 | "comp.rename(columns={'author': 'real'}, inplace=True)" 2439 | ] 2440 | }, 2441 | { 2442 | "cell_type": "markdown", 2443 | "metadata": {}, 2444 | "source": [ 2445 | "### Explaining errors" 2446 | ] 2447 | }, 2448 | { 2449 | "cell_type": "markdown", 2450 | "metadata": {}, 2451 | "source": [ 2452 | "#### A --- True POE but classified in HPL" 2453 | ] 2454 | }, 2455 | { 2456 | "cell_type": "code", 2457 | "execution_count": null, 2458 | "metadata": {}, 2459 | "outputs": [], 2460 | "source": [ 2461 | "wrong_poe_hpl = comp[(comp.real ==0) & (comp.pred ==1)]\n", 2462 | "wrong_poe_hpl.shape\n", 2463 | "print(wrong_poe_hpl.idx)\n", 2464 | "idx = wrong_poe_hpl.idx.iloc[1]\n", 2465 | "\n", 2466 | "print('We see that we got', len(wrong_poe_hpl.idx), 'as shown by the confusion matrix above')" 2467 | ] 2468 | }, 2469 | { 2470 | "cell_type": "code", 2471 | "execution_count": null, 2472 | "metadata": {}, 2473 | "outputs": [], 2474 | "source": [ 2475 | "c_tf.predict_proba" 2476 | ] 2477 | }, 2478 | { 2479 | "cell_type": "code", 2480 | "execution_count": null, 2481 | "metadata": {}, 2482 | "outputs": [], 2483 | "source": [ 2484 | "tokenizer = lambda doc: re.compile(r\"(?u)\\b\\w\\w+\\b\").findall(doc)\n", 2485 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=tokenizer)\n", 2486 | "exp = explainer.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=6)" 2487 | ] 2488 | }, 2489 | { 2490 | "cell_type": "markdown", 2491 | "metadata": {}, 2492 | "source": [ 2493 | "This error is created by the use of ancient greek words. Possible to improve the model ?" 2494 | ] 2495 | }, 2496 | { 2497 | "cell_type": "code", 2498 | "execution_count": null, 2499 | "metadata": {}, 2500 | "outputs": [], 2501 | "source": [ 2502 | "idx = wrong_poe_hpl.idx.iloc[3]\n", 2503 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=2)\n", 2504 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))" 2505 | ] 2506 | }, 2507 | { 2508 | "cell_type": "markdown", 2509 | "metadata": {}, 2510 | "source": [ 2511 | "OK, very difficult case. Only three words > Not enough to properly classify. No improvement possible." 2512 | ] 2513 | }, 2514 | { 2515 | "cell_type": "markdown", 2516 | "metadata": {}, 2517 | "source": [ 2518 | "#### B. --- True POE but classified in MWS" 2519 | ] 2520 | }, 2521 | { 2522 | "cell_type": "code", 2523 | "execution_count": null, 2524 | "metadata": {}, 2525 | "outputs": [], 2526 | "source": [ 2527 | "wrong_poe_mws = comp[(comp.real ==0) & (comp.pred ==2)]\n", 2528 | "print(wrong_poe_mws.shape)\n", 2529 | "idx = wrong_poe_mws.idx.iloc[12]" 2530 | ] 2531 | }, 2532 | { 2533 | "cell_type": "code", 2534 | "execution_count": null, 2535 | "metadata": {}, 2536 | "outputs": [], 2537 | "source": [ 2538 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2539 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))" 2540 | ] 2541 | }, 2542 | { 2543 | "cell_type": "markdown", 2544 | "metadata": {}, 2545 | "source": [ 2546 | "OK, this text contains anaphora, possible to improve the model with anaphora feature." 2547 | ] 2548 | }, 2549 | { 2550 | "cell_type": "code", 2551 | "execution_count": null, 2552 | "metadata": {}, 2553 | "outputs": [], 2554 | "source": [ 2555 | "idx = wrong_poe_mws.idx.iloc[18]\n", 2556 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2557 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2558 | ] 2559 | }, 2560 | { 2561 | "cell_type": "markdown", 2562 | "metadata": {}, 2563 | "source": [ 2564 | "OK, probabilities (EAP and MWS) are very close. Possible to improve the model." 2565 | ] 2566 | }, 2567 | { 2568 | "cell_type": "markdown", 2569 | "metadata": {}, 2570 | "source": [ 2571 | "#### C. --- True MWS but classified in HPL" 2572 | ] 2573 | }, 2574 | { 2575 | "cell_type": "code", 2576 | "execution_count": null, 2577 | "metadata": {}, 2578 | "outputs": [], 2579 | "source": [ 2580 | "wrong_mws_hpl = comp[(comp.real ==2) & (comp.pred ==1)]\n", 2581 | "print(wrong_mws_hpl.shape)\n", 2582 | "idx = wrong_mws_hpl.idx.iloc[8]" 2583 | ] 2584 | }, 2585 | { 2586 | "cell_type": "code", 2587 | "execution_count": null, 2588 | "metadata": {}, 2589 | "outputs": [], 2590 | "source": [ 2591 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2592 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2593 | ] 2594 | }, 2595 | { 2596 | "cell_type": "markdown", 2597 | "metadata": {}, 2598 | "source": [ 2599 | "OK, probabilities (HPL and MWS) are very close. Possible to improve the model." 2600 | ] 2601 | }, 2602 | { 2603 | "cell_type": "code", 2604 | "execution_count": null, 2605 | "metadata": {}, 2606 | "outputs": [], 2607 | "source": [ 2608 | "idx = wrong_mws_hpl.idx.iloc[5]\n", 2609 | "exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)\n", 2610 | "exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))" 2611 | ] 2612 | }, 2613 | { 2614 | "cell_type": "markdown", 2615 | "metadata": {}, 2616 | "source": [ 2617 | "OK, probabilities (EAP, HPL, MWS ) are all very close. Possible to improve the model (using repetition pattern ?)" 2618 | ] 2619 | }, 2620 | { 2621 | "cell_type": "markdown", 2622 | "metadata": {}, 2623 | "source": [ 2624 | "## 2nd example" 2625 | ] 2626 | }, 2627 | { 2628 | "cell_type": "markdown", 2629 | "metadata": {}, 2630 | "source": [ 2631 | "https://marcotcr.github.io/lime/tutorials/Lime%20-%20basic%20usage%2C%20two%20class%20case.html" 2632 | ] 2633 | }, 2634 | { 2635 | "cell_type": "markdown", 2636 | "metadata": {}, 2637 | "source": [ 2638 | "### 1st step : Fetching data, training a classifier" 2639 | ] 2640 | }, 2641 | { 2642 | "cell_type": "markdown", 2643 | "metadata": {}, 2644 | "source": [ 2645 | "For simplicity, we'll use a **2-class subset**: atheism and christianity" 2646 | ] 2647 | }, 2648 | { 2649 | "cell_type": "code", 2650 | "execution_count": null, 2651 | "metadata": {}, 2652 | "outputs": [], 2653 | "source": [ 2654 | "import lime\n", 2655 | "import sklearn\n", 2656 | "import numpy as np\n", 2657 | "import sklearn\n", 2658 | "import sklearn.ensemble\n", 2659 | "import sklearn.metrics\n", 2660 | "from __future__ import print_function" 2661 | ] 2662 | }, 2663 | { 2664 | "cell_type": "code", 2665 | "execution_count": null, 2666 | "metadata": {}, 2667 | "outputs": [], 2668 | "source": [ 2669 | "from sklearn.datasets import fetch_20newsgroups\n", 2670 | "categories = ['alt.atheism', 'soc.religion.christian']\n", 2671 | "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)\n", 2672 | "newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)\n", 2673 | "class_names = ['atheism', 'christian']" 2674 | ] 2675 | }, 2676 | { 2677 | "cell_type": "markdown", 2678 | "metadata": {}, 2679 | "source": [ 2680 | "Let's use the **tfidf vectorizer**, commonly used for text." 2681 | ] 2682 | }, 2683 | { 2684 | "cell_type": "code", 2685 | "execution_count": null, 2686 | "metadata": {}, 2687 | "outputs": [], 2688 | "source": [ 2689 | "vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)\n", 2690 | "train_vectors = vectorizer.fit_transform(newsgroups_train.data)\n", 2691 | "test_vectors = vectorizer.transform(newsgroups_test.data)" 2692 | ] 2693 | }, 2694 | { 2695 | "cell_type": "markdown", 2696 | "metadata": {}, 2697 | "source": [ 2698 | "Now, let's say we want to use **random forests for classification**. It's usually hard to understand what random forests are doing, especially with many trees." 2699 | ] 2700 | }, 2701 | { 2702 | "cell_type": "code", 2703 | "execution_count": null, 2704 | "metadata": {}, 2705 | "outputs": [], 2706 | "source": [ 2707 | "rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)\n", 2708 | "rf.fit(train_vectors, newsgroups_train.target)" 2709 | ] 2710 | }, 2711 | { 2712 | "cell_type": "code", 2713 | "execution_count": null, 2714 | "metadata": {}, 2715 | "outputs": [], 2716 | "source": [ 2717 | "pred = rf.predict(test_vectors)\n", 2718 | "sklearn.metrics.f1_score(newsgroups_test.target, pred, average='binary')" 2719 | ] 2720 | }, 2721 | { 2722 | "cell_type": "markdown", 2723 | "metadata": {}, 2724 | "source": [ 2725 | "We see that this classifier achieves a very high F score" 2726 | ] 2727 | }, 2728 | { 2729 | "cell_type": "markdown", 2730 | "metadata": {}, 2731 | "source": [ 2732 | "### 2nd step : Explaining predictions using lime" 2733 | ] 2734 | }, 2735 | { 2736 | "cell_type": "markdown", 2737 | "metadata": {}, 2738 | "source": [ 2739 | "Lime explainers assume that **classifiers act on raw text**, but **sklearn classifiers** act on **vectorized representation of texts**. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists." 2740 | ] 2741 | }, 2742 | { 2743 | "cell_type": "code", 2744 | "execution_count": null, 2745 | "metadata": {}, 2746 | "outputs": [], 2747 | "source": [ 2748 | "from lime import lime_text\n", 2749 | "from sklearn.pipeline import make_pipeline\n", 2750 | "c = make_pipeline(vectorizer, rf)" 2751 | ] 2752 | }, 2753 | { 2754 | "cell_type": "code", 2755 | "execution_count": null, 2756 | "metadata": {}, 2757 | "outputs": [], 2758 | "source": [ 2759 | "print(c.predict_proba([newsgroups_test.data[0]]))" 2760 | ] 2761 | }, 2762 | { 2763 | "cell_type": "markdown", 2764 | "metadata": {}, 2765 | "source": [ 2766 | "Now we create an explainer object. We pass the class_names a an argument for prettier display." 2767 | ] 2768 | }, 2769 | { 2770 | "cell_type": "code", 2771 | "execution_count": null, 2772 | "metadata": {}, 2773 | "outputs": [], 2774 | "source": [ 2775 | "from lime.lime_text import LimeTextExplainer\n", 2776 | "import re\n", 2777 | "split_expression = lambda s: re.split(r'\\W+', s)\n", 2778 | "explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)" 2779 | ] 2780 | }, 2781 | { 2782 | "cell_type": "markdown", 2783 | "metadata": {}, 2784 | "source": [ 2785 | "We then generate an explanation with at most 6 features for an arbitrary document in the test set." 2786 | ] 2787 | }, 2788 | { 2789 | "cell_type": "code", 2790 | "execution_count": null, 2791 | "metadata": {}, 2792 | "outputs": [], 2793 | "source": [ 2794 | "idx = 83\n", 2795 | "exp = explainer.explain_instance(newsgroups_test.data[idx], c.predict_proba, num_features=6)\n", 2796 | "print('Document id: %d' % idx)\n", 2797 | "print('Probability(christian) =', c.predict_proba([newsgroups_test.data[idx]])[0,1])\n", 2798 | "print('True class: %s' % class_names[newsgroups_test.target[idx]])" 2799 | ] 2800 | }, 2801 | { 2802 | "cell_type": "markdown", 2803 | "metadata": {}, 2804 | "source": [ 2805 | "The classifier got this example right (it predicted atheism)." 2806 | ] 2807 | }, 2808 | { 2809 | "cell_type": "code", 2810 | "execution_count": null, 2811 | "metadata": {}, 2812 | "outputs": [], 2813 | "source": [ 2814 | "# The explanation is presented below as a list of weighted features\n", 2815 | "\n", 2816 | "exp.as_list()" 2817 | ] 2818 | }, 2819 | { 2820 | "cell_type": "markdown", 2821 | "metadata": {}, 2822 | "source": [ 2823 | "These weighted features are a linear model, which approximates the **behaviour of the random forest classifier in the vicinity of the test example**. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case." 2824 | ] 2825 | }, 2826 | { 2827 | "cell_type": "code", 2828 | "execution_count": null, 2829 | "metadata": {}, 2830 | "outputs": [], 2831 | "source": [ 2832 | "print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])\n", 2833 | "tmp = test_vectors[idx].copy()\n", 2834 | "tmp[0,vectorizer.vocabulary_['Posting']] = 0\n", 2835 | "tmp[0,vectorizer.vocabulary_['Host']] = 0\n", 2836 | "print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])\n", 2837 | "print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])" 2838 | ] 2839 | }, 2840 | { 2841 | "cell_type": "markdown", 2842 | "metadata": {}, 2843 | "source": [ 2844 | "Pretty close!\n", 2845 | "**The words that explain the model around this document seem very arbitrary** - not much to do with either Christianity or Atheism.\n", 2846 | "In fact, these are words that appear in the email headers (you will see this clearly soon), which **make distinguishing between the classes much easier.**" 2847 | ] 2848 | }, 2849 | { 2850 | "cell_type": "markdown", 2851 | "metadata": {}, 2852 | "source": [ 2853 | "### 3rd Step: Visualizing explanations" 2854 | ] 2855 | }, 2856 | { 2857 | "cell_type": "code", 2858 | "execution_count": null, 2859 | "metadata": {}, 2860 | "outputs": [], 2861 | "source": [ 2862 | "%matplotlib inline\n", 2863 | "fig = exp.as_pyplot_figure()" 2864 | ] 2865 | }, 2866 | { 2867 | "cell_type": "code", 2868 | "execution_count": null, 2869 | "metadata": {}, 2870 | "outputs": [], 2871 | "source": [ 2872 | "exp.show_in_notebook(text=False)\n", 2873 | "# exp.save_to_file('/tmp/oi.html')" 2874 | ] 2875 | }, 2876 | { 2877 | "cell_type": "code", 2878 | "execution_count": null, 2879 | "metadata": {}, 2880 | "outputs": [], 2881 | "source": [ 2882 | "# how the words that affect the classifier the most are all in the email header.\n", 2883 | "exp.show_in_notebook(text=True)" 2884 | ] 2885 | }, 2886 | { 2887 | "cell_type": "markdown", 2888 | "metadata": {}, 2889 | "source": [ 2890 | "# Clustering documents using similarity features" 2891 | ] 2892 | }, 2893 | { 2894 | "cell_type": "markdown", 2895 | "metadata": {}, 2896 | "source": [ 2897 | "https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/feature%20engineering%20text%20data/Feature%20Engineering%20Text%20Data%20-%20Traditional%20Strategies.ipynb" 2898 | ] 2899 | }, 2900 | { 2901 | "cell_type": "code", 2902 | "execution_count": null, 2903 | "metadata": {}, 2904 | "outputs": [], 2905 | "source": [] 2906 | }, 2907 | { 2908 | "cell_type": "code", 2909 | "execution_count": null, 2910 | "metadata": {}, 2911 | "outputs": [], 2912 | "source": [] 2913 | }, 2914 | { 2915 | "cell_type": "code", 2916 | "execution_count": null, 2917 | "metadata": {}, 2918 | "outputs": [], 2919 | "source": [] 2920 | } 2921 | ], 2922 | "metadata": { 2923 | "kernelspec": { 2924 | "display_name": "Python 3", 2925 | "language": "python", 2926 | "name": "python3" 2927 | }, 2928 | "language_info": { 2929 | "codemirror_mode": { 2930 | "name": "ipython", 2931 | "version": 3 2932 | }, 2933 | "file_extension": ".py", 2934 | "mimetype": "text/x-python", 2935 | "name": "python", 2936 | "nbconvert_exporter": "python", 2937 | "pygments_lexer": "ipython3", 2938 | "version": "3.7.0" 2939 | } 2940 | }, 2941 | "nbformat": 4, 2942 | "nbformat_minor": 2 2943 | } 2944 | -------------------------------------------------------------------------------- /pictures/LDA2VEC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/LDA2VEC.png -------------------------------------------------------------------------------- /pictures/characters_attention.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/characters_attention.gif -------------------------------------------------------------------------------- /pictures/explainability.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/explainability.gif -------------------------------------------------------------------------------- /pictures/generative_LDA.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/generative_LDA.gif -------------------------------------------------------------------------------- /pictures/pyldavis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/pyldavis.png -------------------------------------------------------------------------------- /pictures/tsne_lda.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/tsne_lda.png -------------------------------------------------------------------------------- /pictures/word_correlations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/word_correlations.png -------------------------------------------------------------------------------- /pictures/word_frequency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/adsieg/Multi_Text_Classification/5ee4cd9b10bcfee651dd3deb6f7392012dbb1896/pictures/word_frequency.png --------------------------------------------------------------------------------