├── LICENSE ├── .gitignore ├── README.md ├── voting_process.ipynb ├── scikitlearnNB.ipynb └── nltkNB.ipynb /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright © 2016 Tasdik Rahman 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # ignoring the pickle files 2 | *.pickle 3 | 4 | # Project related files 5 | testing_file.py 6 | testing_file.ipynb 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | env/ 19 | build/ 20 | develop-eggs/ 21 | dist/ 22 | downloads/ 23 | eggs/ 24 | .eggs/ 25 | lib/ 26 | lib64/ 27 | parts/ 28 | sdist/ 29 | var/ 30 | *.egg-info/ 31 | .installed.cfg 32 | *.egg 33 | 34 | # PyInstaller 35 | # Usually these files are written by a python script from a template 36 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 37 | *.manifest 38 | *.spec 39 | 40 | # Installer logs 41 | pip-log.txt 42 | pip-delete-this-directory.txt 43 | 44 | # Unit test / coverage reports 45 | htmlcov/ 46 | .tox/ 47 | .coverage 48 | .coverage.* 49 | .cache 50 | nosetests.xml 51 | coverage.xml 52 | *,cover 53 | .hypothesis/ 54 | 55 | # Translations 56 | *.mo 57 | *.pot 58 | 59 | # Django stuff: 60 | *.log 61 | local_settings.py 62 | 63 | # Flask instance folder 64 | instance/ 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # IPython Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # dotenv 79 | .env 80 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Movie Review Analysis 2 | 3 | An analysis of the `movie_review` data set included in the `nltk` corpus. I would probably add some buzz words here later on. 4 | 5 | *** 6 | 7 | ## Index: 8 | 9 | - [What is in this repo](https://github.com/prodicus/movieReviewsAnalysis#what-is-in-this-repo) 10 | - [Accuracy achieved](https://github.com/prodicus/movieReviewsAnalysis#accuracy-achieved) 11 | - [Requirements](https://github.com/prodicus/movieReviewsAnalysis#requirements) 12 | - [Downloading the dataset](https://github.com/prodicus/movieReviewsAnalysis#downloading-the-dataset) 13 | - [Running it](https://github.com/prodicus/movieReviewsAnalysis#running-it) 14 | - [So](https://github.com/prodicus/movieReviewsAnalysis#so) 15 | - [Legal stuff](https://github.com/prodicus/movieReviewsAnalysis#legal-stuff) 16 | 17 | *** 18 | 19 | ## What is in this repo 20 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 21 | 22 | - [x] An implementation of `nltk.NaiveBayesClassifier` trained against **5000 movie reviews**. Implemented in `nltkNB.ipynb` 23 | - [x] Using `sklearn` 24 | - [x] **Naive Bayes**: 25 | - [x] `MultinomialNB`: 26 | - [x] `BernoulliNB`: 27 | - [x] **Linear Model** 28 | - [x] `LogisticRegression`: 29 | - [x] `SGDClassifier`: 30 | - [x] **SVM** 31 | - [x] `SVC`: 32 | - [x] `LinearSVC`: 33 | - [x] `NuSVC`: 34 | 35 | Implemented in `scikitlearnNB.ipynb` 36 | 37 | - [x] Implemented a voting system to choose the best out of all the learning methods. Implemented in `voting_process.ipynb` 38 | 39 | *** 40 | 41 | ### Accuracy achieved 42 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 43 | 44 | | **Classifiers** | **Accuracy achieved** | 45 | |---------------------------------|-----------------------| 46 | | `nltk.NaiveBayesClassifier` | _73.0%_ | 47 | | **ScikitLearn Implementations** | | 48 | | `BernoulliNB` | _72.0%_ | 49 | | `MultinomialNB` | _76.0%_ | 50 | | `LogisticRegression` | _74.0%_ | 51 | | `SGDClassifier` | _69.0%_ | 52 | | `SVC` | _48.0%_ | 53 | | `LinearSVC` | _74.0%_ | 54 | | `NuSVC` | _74.0%_ | 55 | 56 | *** 57 | 58 | ## Requirements 59 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 60 | 61 | The simplest way(and the suggested way) would be to install the the required packages and the dependencies by using either [anaconda](https://www.continuum.io/downloads) or [miniconda](http://conda.pydata.org/miniconda.html) 62 | 63 | After that you can do 64 | 65 | ```sh 66 | $ conda update conda 67 | $ conda install scikit-learn nltk 68 | ``` 69 | 70 | *** 71 | 72 | #### Downloading the dataset 73 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 74 | 75 | The dataset used in this package is bundled along with the `nltk` package. 76 | 77 | Run your python interpreter 78 | 79 | ```python 80 | >>> import nltk 81 | >>> nltk.download('stopwords') 82 | >>> nltk.download('movie_reviews') 83 | ``` 84 | 85 | **NOTE**: You can check system specific installation instructions from the official [`nltk` website](http://www.nltk.org/data.html) 86 | 87 | Check if everything is good till now by running your interpreter again and importing these 88 | 89 | ```python 90 | >>> import nltk 91 | >>> from nltk.corpus import stopwords, movie_reviews 92 | >>> import sklearn 93 | >>> 94 | ``` 95 | 96 | If these imports work for you. Then you are good to go! 97 | 98 | *** 99 | 100 | ## Running it 101 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 102 | 103 | 1. Clone the repo 104 | 105 | ```sh 106 | $ git clone https://github.com/prodicus/movieReviewsAnalysis 107 | $ cd movieReviewsAnalysis 108 | ## run the ipython server 109 | $ ipython notebook 110 | ``` 111 | 112 | 2. Order of running 113 | 1. `nltkNB.ipynb` 114 | 2. `scikitlearnNB.ipynb` 115 | 3. `voting_process.ipynb` 116 | 117 | 3. Hack away! 118 | 119 | *** 120 | 121 | ## So 122 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 123 | 124 | **"So what, Well this is pretty basic!"** 125 | 126 | Yes, it is but hey we all do start somewhere right? 127 | 128 | **Psst**. I am working on a spam filtering system. You know the one in which you paste an email and then it tells you whether 129 | it is a spam or not. 130 | 131 | You can follow me on twitter [@tasdikrahman](https://twitter.com/tasdikrahman) to keep tabs on it. 132 | 133 | *** 134 | 135 | ## Legal stuff 136 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis) 137 | 138 | Hacked together by [Tasdik Rahman](http://tasdikrahman.me) under the [MIT License](http://prodicus.mit-license.org) 139 | 140 | You can find a copy of the License at http://prodicus.mit-license.org/ 141 | 142 | -------------------------------------------------------------------------------- /voting_process.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### continuing where I left in the `sckitlearnNB.ipynb` notebook. \n", 8 | "\n", 9 | "Will be creating a voting system to increase our accuracy in predicting" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import nltk\n", 21 | "import random\n", 22 | "from nltk.corpus import movie_reviews\n", 23 | "import pickle\n", 24 | "\n", 25 | "from nltk.classify import ClassifierI\n", 26 | "from statistics import mode" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 7, 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "## defing the voteclassifier class\n", 38 | "class VoteClassifier(ClassifierI):\n", 39 | " def __init__(self, *classifiers):\n", 40 | " self._classifiers = classifiers\n", 41 | "\n", 42 | " def classify(self, features):\n", 43 | " votes = []\n", 44 | " for c in self._classifiers:\n", 45 | " v = c.classify(features)\n", 46 | " votes.append(v)\n", 47 | " return mode(votes)\n", 48 | "\n", 49 | " def confidence(self, features):\n", 50 | " votes = []\n", 51 | " for c in self._classifiers:\n", 52 | " v = c.classify(features)\n", 53 | " votes.append(v)\n", 54 | "\n", 55 | " choice_votes = votes.count(mode(votes))\n", 56 | " conf = choice_votes / len(votes)\n", 57 | " return conf" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": { 64 | "collapsed": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "# pickle_obj = open(\"documents.pickle\", \"wb\")\n", 69 | "documents = [(list(movie_reviews.words(fileid)), category)\n", 70 | " for category in movie_reviews.categories()\n", 71 | " for fileid in movie_reviews.fileids(category)]\n", 72 | "# pickle.dump(documents, pickle_obj)\n", 73 | "# pickle_obj.close()" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "# pickle_obj = open(\"documents.pickle\", \"rb\")\n", 85 | "# documents = pickle.load(pickle_obj)\n", 86 | "# pickle_obj.close()\n", 87 | "\n", 88 | "random.shuffle(documents)\n", 89 | "\n", 90 | "all_words = []\n", 91 | "\n", 92 | "for w in movie_reviews.words():\n", 93 | " all_words.append(w.lower())\n", 94 | "\n", 95 | "all_words = nltk.FreqDist(all_words)\n", 96 | "\n", 97 | "word_features = list(all_words.keys())[:3000]\n", 98 | "\n", 99 | "def find_features(document):\n", 100 | " words = set(document)\n", 101 | " features = {}\n", 102 | " for w in word_features:\n", 103 | " features[w] = (w in words)\n", 104 | "\n", 105 | " return features\n", 106 | "\n", 107 | "#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))\n", 108 | "\n", 109 | "featuresets = [(find_features(rev), category) for (rev, category) in documents]\n", 110 | " \n", 111 | "training_set = featuresets[:1900]\n", 112 | "testing_set = featuresets[1900:]" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### Loading all the classifiers from their respective pickle files" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 5, 125 | "metadata": { 126 | "collapsed": true 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "original_nb = open(\"naive_bayes.pickle\", \"rb\")\n", 131 | "naive_bayes_classifier = pickle.load(original_nb)\n", 132 | "original_nb.close()\n", 133 | "\n", 134 | "pickle_file = open(\"MNB_pickle.pickle\", \"rb\")\n", 135 | "MNB_classifier = pickle.load(pickle_file)\n", 136 | "pickle_file.close()\n", 137 | "\n", 138 | "pickle_file = open(\"BNB_pickle.pickle\", \"rb\")\n", 139 | "BernoulliNB_classifier = pickle.load(pickle_file)\n", 140 | "pickle_file.close()\n", 141 | "\n", 142 | "pickle_file = open(\"LogisticRegression.pickle\", \"rb\")\n", 143 | "LogisticRegression_classifier = pickle.load(pickle_file)\n", 144 | "pickle_file.close()\n", 145 | "\n", 146 | "pickle_file = open(\"SGDClassifier.pickle\", \"rb\")\n", 147 | "SGDClassifier_classifier = pickle.load(pickle_file)\n", 148 | "pickle_file.close()\n", 149 | "\n", 150 | "\n", 151 | "pickle_file = open(\"LinearSVC.pickle\", \"rb\")\n", 152 | "LinearSVC_classifier = pickle.load(pickle_file)\n", 153 | "pickle_file.close()\n", 154 | "\n", 155 | "pickle_file = open(\"NuSVC_classifier.pickle\", \"rb\")\n", 156 | "NuSVC_classifier = pickle.load(pickle_file)\n", 157 | "pickle_file.close()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 6, 163 | "metadata": { 164 | "collapsed": false 165 | }, 166 | "outputs": [ 167 | { 168 | "name": "stdout", 169 | "output_type": "stream", 170 | "text": [ 171 | "naive bayes: 69.0\n", 172 | "MNB_classifier: 61.0\n", 173 | "BernoulliNB_classifier: 48.0\n", 174 | "LogisticRegression_classifier: 56.00000000000001\n", 175 | "SGDClassifier_classifier: 55.00000000000001\n", 176 | "LinearSVC_classifier: 63.0\n", 177 | "NuSVC_classifier: 59.0\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "print(\"naive bayes: \", (nltk.classify.accuracy(naive_bayes_classifier, testing_set))*100)\n", 183 | "print(\"MNB_classifier: \", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)\n", 184 | "print(\"BernoulliNB_classifier: \", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)\n", 185 | "print(\"LogisticRegression_classifier: \", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)\n", 186 | "print(\"SGDClassifier_classifier: \", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)\n", 187 | "print(\"LinearSVC_classifier: \", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)\n", 188 | "print(\"NuSVC_classifier: \", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "## passing it voting class function" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 8, 201 | "metadata": { 202 | "collapsed": false 203 | }, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "Voted classifier accuracy : 56.99999999999999\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "voted_classifier = VoteClassifier(\n", 215 | " naive_bayes_classifier,\n", 216 | " MNB_classifier,\n", 217 | " BernoulliNB_classifier,\n", 218 | " LogisticRegression_classifier,\n", 219 | " SGDClassifier_classifier,\n", 220 | " LinearSVC_classifier,\n", 221 | " NuSVC_classifier\n", 222 | ")\n", 223 | "print(\"Voted classifier accuracy : \", (nltk.classify.accuracy(voted_classifier, testing_set))*100)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 9, 229 | "metadata": { 230 | "collapsed": false 231 | }, 232 | "outputs": [ 233 | { 234 | "name": "stdout", 235 | "output_type": "stream", 236 | "text": [ 237 | "Classification: neg Confidence %: 100.0\n", 238 | "Classification: neg Confidence %: 71.42857142857143\n", 239 | "Classification: neg Confidence %: 100.0\n", 240 | "Classification: neg Confidence %: 100.0\n", 241 | "Classification: pos Confidence %: 57.14285714285714\n", 242 | "Classification: neg Confidence %: 71.42857142857143\n" 243 | ] 244 | } 245 | ], 246 | "source": [ 247 | "print(\"Classification:\", voted_classifier.classify(testing_set[0][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[0][0])*100)\n", 248 | "print(\"Classification:\", voted_classifier.classify(testing_set[1][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[1][0])*100)\n", 249 | "print(\"Classification:\", voted_classifier.classify(testing_set[2][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[2][0])*100)\n", 250 | "print(\"Classification:\", voted_classifier.classify(testing_set[3][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[3][0])*100)\n", 251 | "print(\"Classification:\", voted_classifier.classify(testing_set[4][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[4][0])*100)\n", 252 | "print(\"Classification:\", voted_classifier.classify(testing_set[5][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[5][0])*100)" 253 | ] 254 | } 255 | ], 256 | "metadata": { 257 | "kernelspec": { 258 | "display_name": "Python 3", 259 | "language": "python", 260 | "name": "python3" 261 | }, 262 | "language_info": { 263 | "codemirror_mode": { 264 | "name": "ipython", 265 | "version": 3 266 | }, 267 | "file_extension": ".py", 268 | "mimetype": "text/x-python", 269 | "name": "python", 270 | "nbconvert_exporter": "python", 271 | "pygments_lexer": "ipython3", 272 | "version": "3.5.1" 273 | } 274 | }, 275 | "nbformat": 4, 276 | "nbformat_minor": 0 277 | } 278 | -------------------------------------------------------------------------------- /scikitlearnNB.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk\n", 12 | "import random\n", 13 | "from nltk.corpus import movie_reviews\n", 14 | "from nltk.corpus import stopwords\n", 15 | "import pickle\n", 16 | "\n", 17 | "from nltk.classify.scikitlearn import SklearnClassifier\n", 18 | "from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB\n", 19 | "\n", 20 | "from sklearn.linear_model import LogisticRegression, SGDClassifier\n", 21 | "from sklearn.svm import SVC, LinearSVC, NuSVC" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "stop_words = stopwords.words(\"english\")\n", 33 | "documents = [(list(movie_reviews.words(fileid)), category)\n", 34 | " for category in movie_reviews.categories()\n", 35 | " for fileid in movie_reviews.fileids(category)\n", 36 | " ]\n", 37 | "random.shuffle(documents)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "all_words = []\n", 49 | "for w in movie_reviews.words():\n", 50 | " all_words.append(w.lower())" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "## Making a frequency distribution of the words" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 4, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [ 67 | { 68 | "data": { 69 | "text/plain": [ 70 | "[(',', 77717),\n", 71 | " ('the', 76529),\n", 72 | " ('.', 65876),\n", 73 | " ('a', 38106),\n", 74 | " ('and', 35576),\n", 75 | " ('of', 34123),\n", 76 | " ('to', 31937),\n", 77 | " (\"'\", 30585),\n", 78 | " ('is', 25195),\n", 79 | " ('in', 21822),\n", 80 | " ('s', 18513),\n", 81 | " ('\"', 17612),\n", 82 | " ('it', 16107),\n", 83 | " ('that', 15924),\n", 84 | " ('-', 15595),\n", 85 | " (')', 11781),\n", 86 | " ('(', 11664),\n", 87 | " ('as', 11378),\n", 88 | " ('with', 10792),\n", 89 | " ('for', 9961)]" 90 | ] 91 | }, 92 | "execution_count": 4, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "all_words = nltk.FreqDist(all_words)\n", 99 | "all_words.most_common(20)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": { 106 | "collapsed": true 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "feature_words = list(all_words.keys())[:5000]\n", 111 | "def find_features(document):\n", 112 | " words = set(document)\n", 113 | " feature = {}\n", 114 | " for w in feature_words:\n", 115 | " feature[w] = (w in words)\n", 116 | " return feature" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 6, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "feature_sets = [(find_features(rev), category) for (rev, category) in documents]" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "### Training the classifier" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 7, 140 | "metadata": { 141 | "collapsed": false 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "training_set = feature_sets[:1900]\n", 146 | "testing_set = feature_sets[1900:]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 8, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [ 156 | { 157 | "name": "stdout", 158 | "output_type": "stream", 159 | "text": [ 160 | "Multinomial classifier accuracy : 72.0\n" 161 | ] 162 | } 163 | ], 164 | "source": [ 165 | "## TO-DO: To build own naive bayes algorithm\n", 166 | "# classifier = nltk.NaiveBayesClassifier.train(training_set)\n", 167 | "\n", 168 | "MNB_classifier = SklearnClassifier(MultinomialNB())\n", 169 | "MNB_classifier.train(training_set)\n", 170 | "\n", 171 | "## saving it in a pickle\n", 172 | "MNB_pickle = open(\"MNB_pickle.pickle\", \"wb\")\n", 173 | "pickle.dump(MNB_classifier, MNB_pickle)\n", 174 | "MNB_pickle.close()\n", 175 | "print(\"Multinomial classifier accuracy : \", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 9, 181 | "metadata": { 182 | "collapsed": false 183 | }, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "Bernoulli classifier accuracy : 72.0\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "## BernoulliNB \n", 195 | "\n", 196 | "BNB_classifier = SklearnClassifier(BernoulliNB())\n", 197 | "BNB_classifier.train(training_set)\n", 198 | "\n", 199 | "BNB_pickle = open(\"BNB_pickle.pickle\", \"wb\")\n", 200 | "pickle.dump(BNB_classifier, BNB_pickle)\n", 201 | "BNB_pickle.close()\n", 202 | "\n", 203 | "print(\"Bernoulli classifier accuracy : \", (nltk.classify.accuracy(BNB_classifier, testing_set))*100)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 10, 209 | "metadata": { 210 | "collapsed": false 211 | }, 212 | "outputs": [ 213 | { 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "LogisticRegression_classifier accuracy percent: 74.0\n", 218 | "SGDClassifier_classifier accuracy percent: 69.0\n", 219 | "SVC_classifier accuracy percent: 48.0\n", 220 | "LinearSVC_classifier accuracy percent: 74.0\n", 221 | "NuSVC_classifier accuracy percent: 74.0\n" 222 | ] 223 | } 224 | ], 225 | "source": [ 226 | "LogisticRegression_classifier = SklearnClassifier(LogisticRegression())\n", 227 | "LogisticRegression_classifier.train(training_set)\n", 228 | "\n", 229 | "LogisticRegression_pickle = open(\"LogisticRegression.pickle\", \"wb\")\n", 230 | "pickle.dump(LogisticRegression_classifier, LogisticRegression_pickle)\n", 231 | "LogisticRegression_pickle.close()\n", 232 | "\n", 233 | "print(\"LogisticRegression_classifier accuracy percent:\", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)\n", 234 | "\n", 235 | "SGDClassifier_classifier = SklearnClassifier(SGDClassifier())\n", 236 | "SGDClassifier_classifier.train(training_set)\n", 237 | "\n", 238 | "SGDClassifier_pickle = open(\"SGDClassifier.pickle\", \"wb\")\n", 239 | "pickle.dump(SGDClassifier_classifier, SGDClassifier_pickle)\n", 240 | "SGDClassifier_pickle.close()\n", 241 | "\n", 242 | "print(\"SGDClassifier_classifier accuracy percent:\", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)\n", 243 | "\n", 244 | "SVC_classifier = SklearnClassifier(SVC())\n", 245 | "SVC_classifier.train(training_set)\n", 246 | "\n", 247 | "SVC_classifier_pickle = open(\"SVC_classifier.pickle\", \"wb\")\n", 248 | "pickle.dump(SVC_classifier, SVC_classifier_pickle)\n", 249 | "SVC_classifier_pickle.close()\n", 250 | "\n", 251 | "print(\"SVC_classifier accuracy percent:\", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)\n", 252 | "\n", 253 | "LinearSVC_classifier = SklearnClassifier(LinearSVC())\n", 254 | "LinearSVC_classifier.train(training_set)\n", 255 | "\n", 256 | "LinearSVC_pickle = open(\"LinearSVC.pickle\", \"wb\")\n", 257 | "pickle.dump(LinearSVC_classifier, LinearSVC_pickle)\n", 258 | "LinearSVC_pickle.close()\n", 259 | "\n", 260 | "print(\"LinearSVC_classifier accuracy percent:\", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)\n", 261 | "\n", 262 | "NuSVC_classifier = SklearnClassifier(NuSVC())\n", 263 | "NuSVC_classifier.train(training_set)\n", 264 | "\n", 265 | "NuSVC_pickle = open(\"LinearSVC.pickle\", \"wb\")\n", 266 | "pickle.dump(NuSVC_classifier, NuSVC_pickle)\n", 267 | "NuSVC_pickle.close()\n", 268 | "\n", 269 | "print(\"NuSVC_classifier accuracy percent:\", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 12, 275 | "metadata": { 276 | "collapsed": false 277 | }, 278 | "outputs": [ 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "Naive bayes classifier accuracy percent: 76.0\n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "### using the old naive_bayes classifier\n", 289 | "naive_bayes_pickle = open(\"naivebayes.pickle\", \"rb\")\n", 290 | "naive_bayes_classifier = pickle.load(naive_bayes_pickle)\n", 291 | "naive_bayes_pickle.close()\n", 292 | "\n", 293 | "print(\"Naive bayes classifier accuracy percent:\", (nltk.classify.accuracy(naive_bayes_classifier, testing_set))*100)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "## Putting it all together to make a voting system for increasing accuracy\n", 301 | "\n", 302 | "### Check out voting_system.ipynb" 303 | ] 304 | } 305 | ], 306 | "metadata": { 307 | "kernelspec": { 308 | "display_name": "Python 3", 309 | "language": "python", 310 | "name": "python3" 311 | }, 312 | "language_info": { 313 | "codemirror_mode": { 314 | "name": "ipython", 315 | "version": 3 316 | }, 317 | "file_extension": ".py", 318 | "mimetype": "text/x-python", 319 | "name": "python", 320 | "nbconvert_exporter": "python", 321 | "pygments_lexer": "ipython3", 322 | "version": "3.5.1" 323 | } 324 | }, 325 | "nbformat": 4, 326 | "nbformat_minor": 0 327 | } 328 | -------------------------------------------------------------------------------- /nltkNB.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import nltk\n", 12 | "import random\n", 13 | "from nltk.corpus import movie_reviews\n", 14 | "import pprint\n", 15 | "from nltk.corpus import stopwords\n", 16 | "stop_words = stopwords.words(\"english\")\n", 17 | "import pickle" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "**There are a thousand movie reviews for both**\n", 25 | "\n", 26 | "- positive and\n", 27 | "- negetive\n", 28 | "\n", 29 | "**reviews**" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/plain": [ 42 | "['neg', 'pos']" 43 | ] 44 | }, 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "output_type": "execute_result" 48 | } 49 | ], 50 | "source": [ 51 | "movie_reviews.categories()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Now I need to store it as \n", 59 | "\n", 60 | "```python\n", 61 | "documents = [\n", 62 | " ('pos', ['good', 'awesome', ....]), \n", 63 | " ('neg', ['ridiculous', 'horrible', ...])\n", 64 | "]\n", 65 | "```" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "documents = [(list(movie_reviews.words(fileid)), category)\n", 77 | " for category in movie_reviews.categories()\n", 78 | " for fileid in movie_reviews.fileids(category)\n", 79 | " ]\n", 80 | "random.shuffle(documents)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## Other way to do it would be the normal way instead of this one liner" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "source": [ 96 | "```python\n", 97 | "document_dict = {\n", 98 | " 'pos': [],\n", 99 | " 'neg': []\n", 100 | "}\n", 101 | "for category in movie_reviews.categories():\n", 102 | " for fileid in movie_reviews.fileids(category):\n", 103 | " # this will store the list of words read from the particular file in fileid\n", 104 | " raw_list = movie_reviews.words(fileid)\n", 105 | " # cleaning the list using stopwords\n", 106 | " word_list = [word for word in raw_list if word not in stop_words]\n", 107 | " if category == 'pos':\n", 108 | " document_dict['pos'].extend(word_list)\n", 109 | " elif category == 'neg':\n", 110 | " document_dict['neg'].extend(word_list)\n", 111 | "```" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "**Getting the list of all words to store the most frequently occuring ones**" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": { 125 | "collapsed": true 126 | }, 127 | "outputs": [], 128 | "source": [ 129 | "all_words = []\n", 130 | "for w in movie_reviews.words():\n", 131 | " all_words.append(w.lower())" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "### Making a frequency distribution of the words" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 5, 144 | "metadata": { 145 | "collapsed": false 146 | }, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "[(',', 77717),\n", 152 | " ('the', 76529),\n", 153 | " ('.', 65876),\n", 154 | " ('a', 38106),\n", 155 | " ('and', 35576),\n", 156 | " ('of', 34123),\n", 157 | " ('to', 31937),\n", 158 | " (\"'\", 30585),\n", 159 | " ('is', 25195),\n", 160 | " ('in', 21822),\n", 161 | " ('s', 18513),\n", 162 | " ('\"', 17612),\n", 163 | " ('it', 16107),\n", 164 | " ('that', 15924),\n", 165 | " ('-', 15595),\n", 166 | " (')', 11781),\n", 167 | " ('(', 11664),\n", 168 | " ('as', 11378),\n", 169 | " ('with', 10792),\n", 170 | " ('for', 9961)]" 171 | ] 172 | }, 173 | "execution_count": 5, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "all_words = nltk.FreqDist(all_words)\n", 180 | "all_words.most_common(20)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 6, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "text/plain": [ 193 | "134" 194 | ] 195 | }, 196 | "execution_count": 6, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "all_words[\"hate\"] ## counting the occurences of a single word" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "### will train only for the first 5000 top words in the list" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 7, 215 | "metadata": { 216 | "collapsed": false 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "feature_words = list(all_words.keys())[:5000]" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Finding these feature words in documents, making our function would ease it out!" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 8, 233 | "metadata": { 234 | "collapsed": true 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "def find_features(document):\n", 239 | " words = set(document)\n", 240 | " feature = {}\n", 241 | " for w in feature_words:\n", 242 | " feature[w] = (w in words)\n", 243 | " return feature" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "**What the below one does is, before hand we had only words and its category. But not we have the feature set (along with a boolean value of whether it is one of the most frequently used words or not)of the same word and then the category.**" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 9, 256 | "metadata": { 257 | "collapsed": false 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "feature_sets = [(find_features(rev), category) for (rev, category) in documents]" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "### Training the classifier" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 11, 274 | "metadata": { 275 | "collapsed": false 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "training_set = feature_sets[:1900]\n", 280 | "testing_set = feature_sets[1900:]" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "We won't be telling the machine the category i.e. whether the document is a postive one or a negative one. We ask it to tell that to us. Then we compare it to the known category that we have and calculate how accurate it is.\n", 288 | "\n", 289 | "## Naive bayes algorithm\n", 290 | "\n", 291 | "It states that" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "\\begin{equation*}\n", 299 | "posterior = \\frac{PriorOccurences \\times likelihood}{CurrentEvidence}\n", 300 | "\\end{equation*}\n", 301 | "\n", 302 | "Here posterior is likelihood of occurence" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 12, 308 | "metadata": { 309 | "collapsed": true 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "## TO-DO: To build own naive bais algorithm\n", 314 | "classifier = nltk.NaiveBayesClassifier.train(training_set)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 13, 320 | "metadata": { 321 | "collapsed": false 322 | }, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "Naive bayes classifier accuracy percentage : 73.0\n" 329 | ] 330 | } 331 | ], 332 | "source": [ 333 | "## Testing it's accuracy\n", 334 | "print(\"Naive bayes classifier accuracy percentage : \", (nltk.classify.accuracy(classifier, testing_set))*100)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 14, 340 | "metadata": { 341 | "collapsed": false 342 | }, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "Most Informative Features\n", 349 | " avoids = True pos : neg = 11.7 : 1.0\n", 350 | " seamless = True pos : neg = 9.7 : 1.0\n", 351 | " conveys = True pos : neg = 9.0 : 1.0\n", 352 | " feeble = True neg : pos = 8.3 : 1.0\n", 353 | " incoherent = True neg : pos = 8.3 : 1.0\n", 354 | " supreme = True pos : neg = 7.0 : 1.0\n", 355 | " construction = True pos : neg = 7.0 : 1.0\n", 356 | " effortlessly = True pos : neg = 7.0 : 1.0\n", 357 | " observes = True pos : neg = 7.0 : 1.0\n", 358 | " compensate = True neg : pos = 7.0 : 1.0\n", 359 | " idiotic = True neg : pos = 7.0 : 1.0\n", 360 | " kudos = True pos : neg = 6.6 : 1.0\n", 361 | " moss = True pos : neg = 6.3 : 1.0\n", 362 | " obstacle = True pos : neg = 6.3 : 1.0\n", 363 | " dewey = True pos : neg = 6.3 : 1.0\n", 364 | " suvari = True neg : pos = 6.3 : 1.0\n", 365 | " regard = True pos : neg = 6.2 : 1.0\n", 366 | " embarrassment = True neg : pos = 6.2 : 1.0\n", 367 | " symbol = True pos : neg = 5.8 : 1.0\n", 368 | " strengths = True pos : neg = 5.8 : 1.0\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "classifier.show_most_informative_features(20)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "## Now now, don't get too carried away here.\n", 381 | "\n", 382 | "The results are not that accurate right now. We will be working on that later on\n", 383 | "\n", 384 | "What the above feature set means is lets take **abysmal**, \n", 385 | "\n", 386 | "> **neg : pos = 6.3 : 1.0**\n", 387 | "\n", 388 | "means that it appears **6.3** times more in **neg** reviews than in **pos** reviews\n", 389 | "\n", 390 | "## Saving the trained algorithm using **Pickle**\n", 391 | "\n", 392 | "We will be saving python objects so that we can quickly load them again.\n", 393 | "\n", 394 | "_Importing pickle at the top_" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 15, 400 | "metadata": { 401 | "collapsed": true 402 | }, 403 | "outputs": [], 404 | "source": [ 405 | "save_classifier = open(\"naivebayes.pickle\", \"wb\") ## 'wb' tells to write it using bytes\n", 406 | "pickle.dump(classifier, save_classifier)\n", 407 | "save_classifier.close()" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "**We will now use this classifier in the next file to classify documents**" 415 | ] 416 | } 417 | ], 418 | "metadata": { 419 | "kernelspec": { 420 | "display_name": "Python 3", 421 | "language": "python", 422 | "name": "python3" 423 | }, 424 | "language_info": { 425 | "codemirror_mode": { 426 | "name": "ipython", 427 | "version": 3 428 | }, 429 | "file_extension": ".py", 430 | "mimetype": "text/x-python", 431 | "name": "python", 432 | "nbconvert_exporter": "python", 433 | "pygments_lexer": "ipython3", 434 | "version": "3.5.1" 435 | } 436 | }, 437 | "nbformat": 4, 438 | "nbformat_minor": 0 439 | } 440 | --------------------------------------------------------------------------------