├── LICENSE
├── .gitignore
├── README.md
├── voting_process.ipynb
├── scikitlearnNB.ipynb
└── nltkNB.ipynb


/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright © 2016 Tasdik Rahman
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 
9 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # ignoring the pickle files
 2 | *.pickle
 3 | 
 4 | # Project related files
 5 | testing_file.py
 6 | testing_file.ipynb
 7 | 
 8 | # Byte-compiled / optimized / DLL files
 9 | __pycache__/
10 | *.py[cod]
11 | *$py.class
12 | 
13 | # C extensions
14 | *.so
15 | 
16 | # Distribution / packaging
17 | .Python
18 | env/
19 | build/
20 | develop-eggs/
21 | dist/
22 | downloads/
23 | eggs/
24 | .eggs/
25 | lib/
26 | lib64/
27 | parts/
28 | sdist/
29 | var/
30 | *.egg-info/
31 | .installed.cfg
32 | *.egg
33 | 
34 | # PyInstaller
35 | #  Usually these files are written by a python script from a template
36 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
37 | *.manifest
38 | *.spec
39 | 
40 | # Installer logs
41 | pip-log.txt
42 | pip-delete-this-directory.txt
43 | 
44 | # Unit test / coverage reports
45 | htmlcov/
46 | .tox/
47 | .coverage
48 | .coverage.*
49 | .cache
50 | nosetests.xml
51 | coverage.xml
52 | *,cover
53 | .hypothesis/
54 | 
55 | # Translations
56 | *.mo
57 | *.pot
58 | 
59 | # Django stuff:
60 | *.log
61 | local_settings.py
62 | 
63 | # Flask instance folder
64 | instance/
65 | 
66 | # Sphinx documentation
67 | docs/_build/
68 | 
69 | # PyBuilder
70 | target/
71 | 
72 | # IPython Notebook
73 | .ipynb_checkpoints
74 | 
75 | # pyenv
76 | .python-version
77 | 
78 | # dotenv
79 | .env
80 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Movie Review Analysis
  2 | 
  3 | An analysis of the `movie_review` data set included in the `nltk` corpus. I would probably add some buzz words here later on.
  4 | 
  5 | ***
  6 | 
  7 | ## Index:
  8 | 
  9 | - [What is in this repo](https://github.com/prodicus/movieReviewsAnalysis#what-is-in-this-repo)
 10 |     - [Accuracy achieved](https://github.com/prodicus/movieReviewsAnalysis#accuracy-achieved)
 11 | - [Requirements](https://github.com/prodicus/movieReviewsAnalysis#requirements)
 12 |     - [Downloading the dataset](https://github.com/prodicus/movieReviewsAnalysis#downloading-the-dataset)
 13 | - [Running it](https://github.com/prodicus/movieReviewsAnalysis#running-it)
 14 | - [So](https://github.com/prodicus/movieReviewsAnalysis#so)
 15 | - [Legal stuff](https://github.com/prodicus/movieReviewsAnalysis#legal-stuff)
 16 | 
 17 | ***
 18 | 
 19 | ## What is in this repo
 20 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
 21 | 
 22 | - [x] An implementation of `nltk.NaiveBayesClassifier` trained against **5000 movie reviews**. Implemented in `nltkNB.ipynb`
 23 | - [x] Using `sklearn`
 24 |   - [x] **Naive Bayes**: 
 25 |     - [x] `MultinomialNB`: 
 26 |     - [x] `BernoulliNB`:
 27 |   - [x] **Linear Model**
 28 |     - [x] `LogisticRegression`:
 29 |     - [x] `SGDClassifier`:
 30 |   - [x] **SVM**
 31 |     - [x] `SVC`: 
 32 |     - [x] `LinearSVC`:
 33 |     - [x] `NuSVC`:
 34 | 
 35 | Implemented in `scikitlearnNB.ipynb`
 36 | 
 37 | - [x] Implemented a voting system to choose the best out of all the learning methods. Implemented in `voting_process.ipynb`
 38 | 
 39 | ***
 40 | 
 41 | ### Accuracy achieved
 42 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
 43 | 
 44 | | **Classifiers**                 | **Accuracy achieved** |
 45 | |---------------------------------|-----------------------|
 46 | | `nltk.NaiveBayesClassifier`     | _73.0%_               |
 47 | | **ScikitLearn Implementations** |                       |
 48 | | `BernoulliNB`                   | _72.0%_               |
 49 | | `MultinomialNB`                 | _76.0%_               |
 50 | | `LogisticRegression`            | _74.0%_               |
 51 | | `SGDClassifier`                 | _69.0%_               |
 52 | | `SVC`                           | _48.0%_               |
 53 | | `LinearSVC`                     | _74.0%_               |
 54 | | `NuSVC`                         | _74.0%_               |
 55 | 
 56 | ***
 57 | 
 58 | ## Requirements
 59 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
 60 | 
 61 | The simplest way(and the suggested way) would be to install the the required packages and the dependencies by using either [anaconda](https://www.continuum.io/downloads) or [miniconda](http://conda.pydata.org/miniconda.html)
 62 | 
 63 | After that you can do
 64 | 
 65 | ```sh
 66 | $ conda update conda
 67 | $ conda install scikit-learn nltk
 68 | ```
 69 | 
 70 | ***
 71 | 
 72 | #### Downloading the dataset
 73 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
 74 | 
 75 | The dataset used in this package is bundled along with the `nltk` package.
 76 | 
 77 | Run your python interpreter
 78 | 
 79 | ```python
 80 | >>> import nltk
 81 | >>> nltk.download('stopwords')
 82 | >>> nltk.download('movie_reviews') 
 83 | ```
 84 | 
 85 | **NOTE**: You can check system specific installation instructions from the official [`nltk` website](http://www.nltk.org/data.html)
 86 | 
 87 | Check if everything is good till now by running your interpreter again and importing these
 88 | 
 89 | ```python
 90 | >>> import nltk
 91 | >>> from nltk.corpus import stopwords, movie_reviews
 92 | >>> import sklearn
 93 | >>> 
 94 | ```
 95 | 
 96 | If these imports work for you. Then you are good to go!
 97 | 
 98 | ***
 99 | 
100 | ## Running it
101 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
102 | 
103 | 1. Clone the repo 
104 | 
105 | ```sh
106 | $ git clone https://github.com/prodicus/movieReviewsAnalysis
107 | $ cd movieReviewsAnalysis
108 | ## run the ipython server
109 | $ ipython notebook
110 | ```
111 | 
112 | 2. Order of running
113 |   1. `nltkNB.ipynb`
114 |   2. `scikitlearnNB.ipynb`
115 |   3. `voting_process.ipynb`
116 | 
117 | 3. Hack away!
118 | 
119 | ***
120 | 
121 | ## So
122 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
123 | 
124 | **"So what, Well this is pretty basic!"**
125 | 
126 | Yes, it is but hey we all do start somewhere right?
127 | 
128 | **Psst**. I am working on a spam filtering system. You know the one in which you paste an email and then it tells you whether
129 | it is a spam or not.
130 | 
131 | You can follow me on twitter [@tasdikrahman](https://twitter.com/tasdikrahman) to keep tabs on it. 
132 | 
133 | ***
134 | 
135 | ## Legal stuff
136 | [[Back to top]](https://github.com/prodicus/movieReviewsAnalysis#movie-review-analysis)
137 | 
138 | Hacked together by [Tasdik Rahman](http://tasdikrahman.me) under the [MIT License](http://prodicus.mit-license.org)
139 | 
140 | You can find a copy of the License at http://prodicus.mit-license.org/
141 | 
142 | 


--------------------------------------------------------------------------------
/voting_process.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### continuing where I left in the `sckitlearnNB.ipynb` notebook. \n",
  8 |     "\n",
  9 |     "Will be creating a voting system to increase our accuracy in predicting"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import nltk\n",
 21 |     "import random\n",
 22 |     "from nltk.corpus import movie_reviews\n",
 23 |     "import pickle\n",
 24 |     "\n",
 25 |     "from nltk.classify import ClassifierI\n",
 26 |     "from statistics import mode"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 7,
 32 |    "metadata": {
 33 |     "collapsed": true
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "## defing the voteclassifier class\n",
 38 |     "class VoteClassifier(ClassifierI):\n",
 39 |     "    def __init__(self, *classifiers):\n",
 40 |     "        self._classifiers = classifiers\n",
 41 |     "\n",
 42 |     "    def classify(self, features):\n",
 43 |     "        votes = []\n",
 44 |     "        for c in self._classifiers:\n",
 45 |     "            v = c.classify(features)\n",
 46 |     "            votes.append(v)\n",
 47 |     "        return mode(votes)\n",
 48 |     "\n",
 49 |     "    def confidence(self, features):\n",
 50 |     "        votes = []\n",
 51 |     "        for c in self._classifiers:\n",
 52 |     "            v = c.classify(features)\n",
 53 |     "            votes.append(v)\n",
 54 |     "\n",
 55 |     "        choice_votes = votes.count(mode(votes))\n",
 56 |     "        conf = choice_votes / len(votes)\n",
 57 |     "        return conf"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 3,
 63 |    "metadata": {
 64 |     "collapsed": true
 65 |    },
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "# pickle_obj = open(\"documents.pickle\", \"wb\")\n",
 69 |     "documents = [(list(movie_reviews.words(fileid)), category)\n",
 70 |     "             for category in movie_reviews.categories()\n",
 71 |     "             for fileid in movie_reviews.fileids(category)]\n",
 72 |     "# pickle.dump(documents, pickle_obj)\n",
 73 |     "# pickle_obj.close()"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 4,
 79 |    "metadata": {
 80 |     "collapsed": true
 81 |    },
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "# pickle_obj = open(\"documents.pickle\", \"rb\")\n",
 85 |     "# documents = pickle.load(pickle_obj)\n",
 86 |     "# pickle_obj.close()\n",
 87 |     "\n",
 88 |     "random.shuffle(documents)\n",
 89 |     "\n",
 90 |     "all_words = []\n",
 91 |     "\n",
 92 |     "for w in movie_reviews.words():\n",
 93 |     "    all_words.append(w.lower())\n",
 94 |     "\n",
 95 |     "all_words = nltk.FreqDist(all_words)\n",
 96 |     "\n",
 97 |     "word_features = list(all_words.keys())[:3000]\n",
 98 |     "\n",
 99 |     "def find_features(document):\n",
100 |     "    words = set(document)\n",
101 |     "    features = {}\n",
102 |     "    for w in word_features:\n",
103 |     "        features[w] = (w in words)\n",
104 |     "\n",
105 |     "    return features\n",
106 |     "\n",
107 |     "#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))\n",
108 |     "\n",
109 |     "featuresets = [(find_features(rev), category) for (rev, category) in documents]\n",
110 |     "        \n",
111 |     "training_set = featuresets[:1900]\n",
112 |     "testing_set =  featuresets[1900:]"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "### Loading all the classifiers from their respective pickle files"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 5,
125 |    "metadata": {
126 |     "collapsed": true
127 |    },
128 |    "outputs": [],
129 |    "source": [
130 |     "original_nb = open(\"naive_bayes.pickle\", \"rb\")\n",
131 |     "naive_bayes_classifier = pickle.load(original_nb)\n",
132 |     "original_nb.close()\n",
133 |     "\n",
134 |     "pickle_file = open(\"MNB_pickle.pickle\", \"rb\")\n",
135 |     "MNB_classifier = pickle.load(pickle_file)\n",
136 |     "pickle_file.close()\n",
137 |     "\n",
138 |     "pickle_file = open(\"BNB_pickle.pickle\", \"rb\")\n",
139 |     "BernoulliNB_classifier = pickle.load(pickle_file)\n",
140 |     "pickle_file.close()\n",
141 |     "\n",
142 |     "pickle_file = open(\"LogisticRegression.pickle\", \"rb\")\n",
143 |     "LogisticRegression_classifier = pickle.load(pickle_file)\n",
144 |     "pickle_file.close()\n",
145 |     "\n",
146 |     "pickle_file = open(\"SGDClassifier.pickle\", \"rb\")\n",
147 |     "SGDClassifier_classifier = pickle.load(pickle_file)\n",
148 |     "pickle_file.close()\n",
149 |     "\n",
150 |     "\n",
151 |     "pickle_file = open(\"LinearSVC.pickle\", \"rb\")\n",
152 |     "LinearSVC_classifier = pickle.load(pickle_file)\n",
153 |     "pickle_file.close()\n",
154 |     "\n",
155 |     "pickle_file = open(\"NuSVC_classifier.pickle\", \"rb\")\n",
156 |     "NuSVC_classifier = pickle.load(pickle_file)\n",
157 |     "pickle_file.close()"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 6,
163 |    "metadata": {
164 |     "collapsed": false
165 |    },
166 |    "outputs": [
167 |     {
168 |      "name": "stdout",
169 |      "output_type": "stream",
170 |      "text": [
171 |       "naive bayes:  69.0\n",
172 |       "MNB_classifier:  61.0\n",
173 |       "BernoulliNB_classifier:  48.0\n",
174 |       "LogisticRegression_classifier:  56.00000000000001\n",
175 |       "SGDClassifier_classifier:  55.00000000000001\n",
176 |       "LinearSVC_classifier:  63.0\n",
177 |       "NuSVC_classifier:  59.0\n"
178 |      ]
179 |     }
180 |    ],
181 |    "source": [
182 |     "print(\"naive bayes: \", (nltk.classify.accuracy(naive_bayes_classifier, testing_set))*100)\n",
183 |     "print(\"MNB_classifier: \", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)\n",
184 |     "print(\"BernoulliNB_classifier: \", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)\n",
185 |     "print(\"LogisticRegression_classifier: \", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)\n",
186 |     "print(\"SGDClassifier_classifier: \", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)\n",
187 |     "print(\"LinearSVC_classifier: \", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)\n",
188 |     "print(\"NuSVC_classifier: \", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "## passing it voting class function"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 8,
201 |    "metadata": {
202 |     "collapsed": false
203 |    },
204 |    "outputs": [
205 |     {
206 |      "name": "stdout",
207 |      "output_type": "stream",
208 |      "text": [
209 |       "Voted classifier accuracy :  56.99999999999999\n"
210 |      ]
211 |     }
212 |    ],
213 |    "source": [
214 |     "voted_classifier = VoteClassifier(\n",
215 |     "    naive_bayes_classifier,\n",
216 |     "    MNB_classifier,\n",
217 |     "    BernoulliNB_classifier,\n",
218 |     "    LogisticRegression_classifier,\n",
219 |     "    SGDClassifier_classifier,\n",
220 |     "    LinearSVC_classifier,\n",
221 |     "    NuSVC_classifier\n",
222 |     ")\n",
223 |     "print(\"Voted classifier accuracy : \", (nltk.classify.accuracy(voted_classifier, testing_set))*100)"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 9,
229 |    "metadata": {
230 |     "collapsed": false
231 |    },
232 |    "outputs": [
233 |     {
234 |      "name": "stdout",
235 |      "output_type": "stream",
236 |      "text": [
237 |       "Classification: neg Confidence %: 100.0\n",
238 |       "Classification: neg Confidence %: 71.42857142857143\n",
239 |       "Classification: neg Confidence %: 100.0\n",
240 |       "Classification: neg Confidence %: 100.0\n",
241 |       "Classification: pos Confidence %: 57.14285714285714\n",
242 |       "Classification: neg Confidence %: 71.42857142857143\n"
243 |      ]
244 |     }
245 |    ],
246 |    "source": [
247 |     "print(\"Classification:\", voted_classifier.classify(testing_set[0][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[0][0])*100)\n",
248 |     "print(\"Classification:\", voted_classifier.classify(testing_set[1][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[1][0])*100)\n",
249 |     "print(\"Classification:\", voted_classifier.classify(testing_set[2][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[2][0])*100)\n",
250 |     "print(\"Classification:\", voted_classifier.classify(testing_set[3][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[3][0])*100)\n",
251 |     "print(\"Classification:\", voted_classifier.classify(testing_set[4][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[4][0])*100)\n",
252 |     "print(\"Classification:\", voted_classifier.classify(testing_set[5][0]), \"Confidence %:\",voted_classifier.confidence(testing_set[5][0])*100)"
253 |    ]
254 |   }
255 |  ],
256 |  "metadata": {
257 |   "kernelspec": {
258 |    "display_name": "Python 3",
259 |    "language": "python",
260 |    "name": "python3"
261 |   },
262 |   "language_info": {
263 |    "codemirror_mode": {
264 |     "name": "ipython",
265 |     "version": 3
266 |    },
267 |    "file_extension": ".py",
268 |    "mimetype": "text/x-python",
269 |    "name": "python",
270 |    "nbconvert_exporter": "python",
271 |    "pygments_lexer": "ipython3",
272 |    "version": "3.5.1"
273 |   }
274 |  },
275 |  "nbformat": 4,
276 |  "nbformat_minor": 0
277 | }
278 | 


--------------------------------------------------------------------------------
/scikitlearnNB.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import nltk\n",
 12 |     "import random\n",
 13 |     "from nltk.corpus import movie_reviews\n",
 14 |     "from nltk.corpus import stopwords\n",
 15 |     "import pickle\n",
 16 |     "\n",
 17 |     "from nltk.classify.scikitlearn import SklearnClassifier\n",
 18 |     "from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB\n",
 19 |     "\n",
 20 |     "from sklearn.linear_model import LogisticRegression, SGDClassifier\n",
 21 |     "from sklearn.svm import SVC, LinearSVC, NuSVC"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 2,
 27 |    "metadata": {
 28 |     "collapsed": false
 29 |    },
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "stop_words = stopwords.words(\"english\")\n",
 33 |     "documents = [(list(movie_reviews.words(fileid)), category)\n",
 34 |     "            for category in movie_reviews.categories()\n",
 35 |     "             for fileid in movie_reviews.fileids(category)\n",
 36 |     "            ]\n",
 37 |     "random.shuffle(documents)"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {
 44 |     "collapsed": true
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "all_words = []\n",
 49 |     "for w in movie_reviews.words():\n",
 50 |     "    all_words.append(w.lower())"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "## Making a frequency distribution of the words"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 4,
 63 |    "metadata": {
 64 |     "collapsed": false
 65 |    },
 66 |    "outputs": [
 67 |     {
 68 |      "data": {
 69 |       "text/plain": [
 70 |        "[(',', 77717),\n",
 71 |        " ('the', 76529),\n",
 72 |        " ('.', 65876),\n",
 73 |        " ('a', 38106),\n",
 74 |        " ('and', 35576),\n",
 75 |        " ('of', 34123),\n",
 76 |        " ('to', 31937),\n",
 77 |        " (\"'\", 30585),\n",
 78 |        " ('is', 25195),\n",
 79 |        " ('in', 21822),\n",
 80 |        " ('s', 18513),\n",
 81 |        " ('\"', 17612),\n",
 82 |        " ('it', 16107),\n",
 83 |        " ('that', 15924),\n",
 84 |        " ('-', 15595),\n",
 85 |        " (')', 11781),\n",
 86 |        " ('(', 11664),\n",
 87 |        " ('as', 11378),\n",
 88 |        " ('with', 10792),\n",
 89 |        " ('for', 9961)]"
 90 |       ]
 91 |      },
 92 |      "execution_count": 4,
 93 |      "metadata": {},
 94 |      "output_type": "execute_result"
 95 |     }
 96 |    ],
 97 |    "source": [
 98 |     "all_words = nltk.FreqDist(all_words)\n",
 99 |     "all_words.most_common(20)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 5,
105 |    "metadata": {
106 |     "collapsed": true
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "feature_words = list(all_words.keys())[:5000]\n",
111 |     "def find_features(document):\n",
112 |     "    words = set(document)\n",
113 |     "    feature = {}\n",
114 |     "    for w in feature_words:\n",
115 |     "        feature[w] = (w in words)\n",
116 |     "    return feature"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 6,
122 |    "metadata": {
123 |     "collapsed": false
124 |    },
125 |    "outputs": [],
126 |    "source": [
127 |     "feature_sets = [(find_features(rev), category) for (rev, category) in documents]"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "### Training the classifier"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 7,
140 |    "metadata": {
141 |     "collapsed": false
142 |    },
143 |    "outputs": [],
144 |    "source": [
145 |     "training_set = feature_sets[:1900]\n",
146 |     "testing_set = feature_sets[1900:]"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 8,
152 |    "metadata": {
153 |     "collapsed": false
154 |    },
155 |    "outputs": [
156 |     {
157 |      "name": "stdout",
158 |      "output_type": "stream",
159 |      "text": [
160 |       "Multinomial classifier accuracy :  72.0\n"
161 |      ]
162 |     }
163 |    ],
164 |    "source": [
165 |     "## TO-DO: To build own naive bayes algorithm\n",
166 |     "# classifier = nltk.NaiveBayesClassifier.train(training_set)\n",
167 |     "\n",
168 |     "MNB_classifier = SklearnClassifier(MultinomialNB())\n",
169 |     "MNB_classifier.train(training_set)\n",
170 |     "\n",
171 |     "## saving it in a pickle\n",
172 |     "MNB_pickle = open(\"MNB_pickle.pickle\", \"wb\")\n",
173 |     "pickle.dump(MNB_classifier, MNB_pickle)\n",
174 |     "MNB_pickle.close()\n",
175 |     "print(\"Multinomial classifier accuracy : \", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 9,
181 |    "metadata": {
182 |     "collapsed": false
183 |    },
184 |    "outputs": [
185 |     {
186 |      "name": "stdout",
187 |      "output_type": "stream",
188 |      "text": [
189 |       "Bernoulli classifier accuracy :  72.0\n"
190 |      ]
191 |     }
192 |    ],
193 |    "source": [
194 |     "## BernoulliNB \n",
195 |     "\n",
196 |     "BNB_classifier = SklearnClassifier(BernoulliNB())\n",
197 |     "BNB_classifier.train(training_set)\n",
198 |     "\n",
199 |     "BNB_pickle = open(\"BNB_pickle.pickle\", \"wb\")\n",
200 |     "pickle.dump(BNB_classifier, BNB_pickle)\n",
201 |     "BNB_pickle.close()\n",
202 |     "\n",
203 |     "print(\"Bernoulli classifier accuracy : \", (nltk.classify.accuracy(BNB_classifier, testing_set))*100)"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 10,
209 |    "metadata": {
210 |     "collapsed": false
211 |    },
212 |    "outputs": [
213 |     {
214 |      "name": "stdout",
215 |      "output_type": "stream",
216 |      "text": [
217 |       "LogisticRegression_classifier accuracy percent: 74.0\n",
218 |       "SGDClassifier_classifier accuracy percent: 69.0\n",
219 |       "SVC_classifier accuracy percent: 48.0\n",
220 |       "LinearSVC_classifier accuracy percent: 74.0\n",
221 |       "NuSVC_classifier accuracy percent: 74.0\n"
222 |      ]
223 |     }
224 |    ],
225 |    "source": [
226 |     "LogisticRegression_classifier = SklearnClassifier(LogisticRegression())\n",
227 |     "LogisticRegression_classifier.train(training_set)\n",
228 |     "\n",
229 |     "LogisticRegression_pickle = open(\"LogisticRegression.pickle\", \"wb\")\n",
230 |     "pickle.dump(LogisticRegression_classifier, LogisticRegression_pickle)\n",
231 |     "LogisticRegression_pickle.close()\n",
232 |     "\n",
233 |     "print(\"LogisticRegression_classifier accuracy percent:\", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)\n",
234 |     "\n",
235 |     "SGDClassifier_classifier = SklearnClassifier(SGDClassifier())\n",
236 |     "SGDClassifier_classifier.train(training_set)\n",
237 |     "\n",
238 |     "SGDClassifier_pickle = open(\"SGDClassifier.pickle\", \"wb\")\n",
239 |     "pickle.dump(SGDClassifier_classifier, SGDClassifier_pickle)\n",
240 |     "SGDClassifier_pickle.close()\n",
241 |     "\n",
242 |     "print(\"SGDClassifier_classifier accuracy percent:\", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)\n",
243 |     "\n",
244 |     "SVC_classifier = SklearnClassifier(SVC())\n",
245 |     "SVC_classifier.train(training_set)\n",
246 |     "\n",
247 |     "SVC_classifier_pickle = open(\"SVC_classifier.pickle\", \"wb\")\n",
248 |     "pickle.dump(SVC_classifier, SVC_classifier_pickle)\n",
249 |     "SVC_classifier_pickle.close()\n",
250 |     "\n",
251 |     "print(\"SVC_classifier accuracy percent:\", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)\n",
252 |     "\n",
253 |     "LinearSVC_classifier = SklearnClassifier(LinearSVC())\n",
254 |     "LinearSVC_classifier.train(training_set)\n",
255 |     "\n",
256 |     "LinearSVC_pickle = open(\"LinearSVC.pickle\", \"wb\")\n",
257 |     "pickle.dump(LinearSVC_classifier, LinearSVC_pickle)\n",
258 |     "LinearSVC_pickle.close()\n",
259 |     "\n",
260 |     "print(\"LinearSVC_classifier accuracy percent:\", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)\n",
261 |     "\n",
262 |     "NuSVC_classifier = SklearnClassifier(NuSVC())\n",
263 |     "NuSVC_classifier.train(training_set)\n",
264 |     "\n",
265 |     "NuSVC_pickle = open(\"LinearSVC.pickle\", \"wb\")\n",
266 |     "pickle.dump(NuSVC_classifier, NuSVC_pickle)\n",
267 |     "NuSVC_pickle.close()\n",
268 |     "\n",
269 |     "print(\"NuSVC_classifier accuracy percent:\", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": 12,
275 |    "metadata": {
276 |     "collapsed": false
277 |    },
278 |    "outputs": [
279 |     {
280 |      "name": "stdout",
281 |      "output_type": "stream",
282 |      "text": [
283 |       "Naive bayes classifier accuracy percent: 76.0\n"
284 |      ]
285 |     }
286 |    ],
287 |    "source": [
288 |     "### using the old naive_bayes classifier\n",
289 |     "naive_bayes_pickle = open(\"naivebayes.pickle\", \"rb\")\n",
290 |     "naive_bayes_classifier = pickle.load(naive_bayes_pickle)\n",
291 |     "naive_bayes_pickle.close()\n",
292 |     "\n",
293 |     "print(\"Naive bayes classifier accuracy percent:\", (nltk.classify.accuracy(naive_bayes_classifier, testing_set))*100)"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "## Putting it all together to make a voting system for increasing accuracy\n",
301 |     "\n",
302 |     "### Check out voting_system.ipynb"
303 |    ]
304 |   }
305 |  ],
306 |  "metadata": {
307 |   "kernelspec": {
308 |    "display_name": "Python 3",
309 |    "language": "python",
310 |    "name": "python3"
311 |   },
312 |   "language_info": {
313 |    "codemirror_mode": {
314 |     "name": "ipython",
315 |     "version": 3
316 |    },
317 |    "file_extension": ".py",
318 |    "mimetype": "text/x-python",
319 |    "name": "python",
320 |    "nbconvert_exporter": "python",
321 |    "pygments_lexer": "ipython3",
322 |    "version": "3.5.1"
323 |   }
324 |  },
325 |  "nbformat": 4,
326 |  "nbformat_minor": 0
327 | }
328 | 


--------------------------------------------------------------------------------
/nltkNB.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import nltk\n",
 12 |     "import random\n",
 13 |     "from nltk.corpus import movie_reviews\n",
 14 |     "import pprint\n",
 15 |     "from nltk.corpus import stopwords\n",
 16 |     "stop_words = stopwords.words(\"english\")\n",
 17 |     "import pickle"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "**There are a thousand movie reviews for both**\n",
 25 |     "\n",
 26 |     "- positive and\n",
 27 |     "- negetive\n",
 28 |     "\n",
 29 |     "**reviews**"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 2,
 35 |    "metadata": {
 36 |     "collapsed": false
 37 |    },
 38 |    "outputs": [
 39 |     {
 40 |      "data": {
 41 |       "text/plain": [
 42 |        "['neg', 'pos']"
 43 |       ]
 44 |      },
 45 |      "execution_count": 2,
 46 |      "metadata": {},
 47 |      "output_type": "execute_result"
 48 |     }
 49 |    ],
 50 |    "source": [
 51 |     "movie_reviews.categories()"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Now I need to store it as \n",
 59 |     "\n",
 60 |     "```python\n",
 61 |     "documents = [\n",
 62 |     "    ('pos', ['good', 'awesome', ....]), \n",
 63 |     "    ('neg', ['ridiculous', 'horrible', ...])\n",
 64 |     "]\n",
 65 |     "```"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 3,
 71 |    "metadata": {
 72 |     "collapsed": false
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "documents = [(list(movie_reviews.words(fileid)), category)\n",
 77 |     "            for category in movie_reviews.categories()\n",
 78 |     "             for fileid in movie_reviews.fileids(category)\n",
 79 |     "            ]\n",
 80 |     "random.shuffle(documents)"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "## Other way to do it would be the normal way instead of this one liner"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "source": [
 96 |     "```python\n",
 97 |     "document_dict = {\n",
 98 |     "    'pos': [],\n",
 99 |     "    'neg': []\n",
100 |     "}\n",
101 |     "for category in movie_reviews.categories():\n",
102 |     "    for fileid in movie_reviews.fileids(category):\n",
103 |     "        # this will store the list of words read from the particular file in fileid\n",
104 |     "        raw_list = movie_reviews.words(fileid)\n",
105 |     "        # cleaning the list using stopwords\n",
106 |     "        word_list = [word for word in raw_list if word not in stop_words]\n",
107 |     "        if category == 'pos':\n",
108 |     "            document_dict['pos'].extend(word_list)\n",
109 |     "        elif category == 'neg':\n",
110 |     "            document_dict['neg'].extend(word_list)\n",
111 |     "```"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "**Getting the list of all words to store the most frequently occuring ones**"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": 4,
124 |    "metadata": {
125 |     "collapsed": true
126 |    },
127 |    "outputs": [],
128 |    "source": [
129 |     "all_words = []\n",
130 |     "for w in movie_reviews.words():\n",
131 |     "    all_words.append(w.lower())"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "### Making a frequency distribution of the words"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 5,
144 |    "metadata": {
145 |     "collapsed": false
146 |    },
147 |    "outputs": [
148 |     {
149 |      "data": {
150 |       "text/plain": [
151 |        "[(',', 77717),\n",
152 |        " ('the', 76529),\n",
153 |        " ('.', 65876),\n",
154 |        " ('a', 38106),\n",
155 |        " ('and', 35576),\n",
156 |        " ('of', 34123),\n",
157 |        " ('to', 31937),\n",
158 |        " (\"'\", 30585),\n",
159 |        " ('is', 25195),\n",
160 |        " ('in', 21822),\n",
161 |        " ('s', 18513),\n",
162 |        " ('\"', 17612),\n",
163 |        " ('it', 16107),\n",
164 |        " ('that', 15924),\n",
165 |        " ('-', 15595),\n",
166 |        " (')', 11781),\n",
167 |        " ('(', 11664),\n",
168 |        " ('as', 11378),\n",
169 |        " ('with', 10792),\n",
170 |        " ('for', 9961)]"
171 |       ]
172 |      },
173 |      "execution_count": 5,
174 |      "metadata": {},
175 |      "output_type": "execute_result"
176 |     }
177 |    ],
178 |    "source": [
179 |     "all_words = nltk.FreqDist(all_words)\n",
180 |     "all_words.most_common(20)"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 6,
186 |    "metadata": {
187 |     "collapsed": false
188 |    },
189 |    "outputs": [
190 |     {
191 |      "data": {
192 |       "text/plain": [
193 |        "134"
194 |       ]
195 |      },
196 |      "execution_count": 6,
197 |      "metadata": {},
198 |      "output_type": "execute_result"
199 |     }
200 |    ],
201 |    "source": [
202 |     "all_words[\"hate\"]  ## counting the occurences of a single word"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "### will train only for the first 5000 top words in the list"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": 7,
215 |    "metadata": {
216 |     "collapsed": false
217 |    },
218 |    "outputs": [],
219 |    "source": [
220 |     "feature_words = list(all_words.keys())[:5000]"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {},
226 |    "source": [
227 |     "### Finding these feature words in documents, making our function would ease it out!"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": 8,
233 |    "metadata": {
234 |     "collapsed": true
235 |    },
236 |    "outputs": [],
237 |    "source": [
238 |     "def find_features(document):\n",
239 |     "    words = set(document)\n",
240 |     "    feature = {}\n",
241 |     "    for w in feature_words:\n",
242 |     "        feature[w] = (w in words)\n",
243 |     "    return feature"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "**What the below one does is, before hand we had only words and its category. But not we have the feature set (along with a boolean value of whether it is one of the most frequently used words or not)of the same word and then the category.**"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 9,
256 |    "metadata": {
257 |     "collapsed": false
258 |    },
259 |    "outputs": [],
260 |    "source": [
261 |     "feature_sets = [(find_features(rev), category) for (rev, category) in documents]"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "markdown",
266 |    "metadata": {},
267 |    "source": [
268 |     "### Training the classifier"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": 11,
274 |    "metadata": {
275 |     "collapsed": false
276 |    },
277 |    "outputs": [],
278 |    "source": [
279 |     "training_set = feature_sets[:1900]\n",
280 |     "testing_set = feature_sets[1900:]"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "We won't be telling the machine the category i.e. whether the document is a postive one or a negative one. We ask it to tell that to us. Then we compare it to the known category that we have and calculate how accurate it is.\n",
288 |     "\n",
289 |     "## Naive bayes algorithm\n",
290 |     "\n",
291 |     "It states that"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "\\begin{equation*}\n",
299 |     "posterior = \\frac{PriorOccurences \\times likelihood}{CurrentEvidence}\n",
300 |     "\\end{equation*}\n",
301 |     "\n",
302 |     "Here posterior is likelihood of occurence"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": 12,
308 |    "metadata": {
309 |     "collapsed": true
310 |    },
311 |    "outputs": [],
312 |    "source": [
313 |     "## TO-DO: To build own naive bais algorithm\n",
314 |     "classifier = nltk.NaiveBayesClassifier.train(training_set)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 13,
320 |    "metadata": {
321 |     "collapsed": false
322 |    },
323 |    "outputs": [
324 |     {
325 |      "name": "stdout",
326 |      "output_type": "stream",
327 |      "text": [
328 |       "Naive bayes classifier accuracy percentage :  73.0\n"
329 |      ]
330 |     }
331 |    ],
332 |    "source": [
333 |     "## Testing it's accuracy\n",
334 |     "print(\"Naive bayes classifier accuracy percentage : \", (nltk.classify.accuracy(classifier, testing_set))*100)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": 14,
340 |    "metadata": {
341 |     "collapsed": false
342 |    },
343 |    "outputs": [
344 |     {
345 |      "name": "stdout",
346 |      "output_type": "stream",
347 |      "text": [
348 |       "Most Informative Features\n",
349 |       "                  avoids = True              pos : neg    =     11.7 : 1.0\n",
350 |       "                seamless = True              pos : neg    =      9.7 : 1.0\n",
351 |       "                 conveys = True              pos : neg    =      9.0 : 1.0\n",
352 |       "                  feeble = True              neg : pos    =      8.3 : 1.0\n",
353 |       "              incoherent = True              neg : pos    =      8.3 : 1.0\n",
354 |       "                 supreme = True              pos : neg    =      7.0 : 1.0\n",
355 |       "            construction = True              pos : neg    =      7.0 : 1.0\n",
356 |       "            effortlessly = True              pos : neg    =      7.0 : 1.0\n",
357 |       "                observes = True              pos : neg    =      7.0 : 1.0\n",
358 |       "              compensate = True              neg : pos    =      7.0 : 1.0\n",
359 |       "                 idiotic = True              neg : pos    =      7.0 : 1.0\n",
360 |       "                   kudos = True              pos : neg    =      6.6 : 1.0\n",
361 |       "                    moss = True              pos : neg    =      6.3 : 1.0\n",
362 |       "                obstacle = True              pos : neg    =      6.3 : 1.0\n",
363 |       "                   dewey = True              pos : neg    =      6.3 : 1.0\n",
364 |       "                  suvari = True              neg : pos    =      6.3 : 1.0\n",
365 |       "                  regard = True              pos : neg    =      6.2 : 1.0\n",
366 |       "           embarrassment = True              neg : pos    =      6.2 : 1.0\n",
367 |       "                  symbol = True              pos : neg    =      5.8 : 1.0\n",
368 |       "               strengths = True              pos : neg    =      5.8 : 1.0\n"
369 |      ]
370 |     }
371 |    ],
372 |    "source": [
373 |     "classifier.show_most_informative_features(20)"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "markdown",
378 |    "metadata": {},
379 |    "source": [
380 |     "## Now now, don't get too carried away here.\n",
381 |     "\n",
382 |     "The results are not that accurate right now. We will be working on that later on\n",
383 |     "\n",
384 |     "What the above feature set means is lets take **abysmal**, \n",
385 |     "\n",
386 |     "> **neg : pos    =      6.3 : 1.0**\n",
387 |     "\n",
388 |     "means that it appears **6.3** times more in **neg** reviews than in **pos** reviews\n",
389 |     "\n",
390 |     "## Saving the trained algorithm using **Pickle**\n",
391 |     "\n",
392 |     "We will be saving python objects so that we can quickly load them again.\n",
393 |     "\n",
394 |     "_Importing pickle at the top_"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "code",
399 |    "execution_count": 15,
400 |    "metadata": {
401 |     "collapsed": true
402 |    },
403 |    "outputs": [],
404 |    "source": [
405 |     "save_classifier = open(\"naivebayes.pickle\", \"wb\") ## 'wb' tells to write it using bytes\n",
406 |     "pickle.dump(classifier, save_classifier)\n",
407 |     "save_classifier.close()"
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "markdown",
412 |    "metadata": {},
413 |    "source": [
414 |     "**We will now use this classifier in the next file to classify documents**"
415 |    ]
416 |   }
417 |  ],
418 |  "metadata": {
419 |   "kernelspec": {
420 |    "display_name": "Python 3",
421 |    "language": "python",
422 |    "name": "python3"
423 |   },
424 |   "language_info": {
425 |    "codemirror_mode": {
426 |     "name": "ipython",
427 |     "version": 3
428 |    },
429 |    "file_extension": ".py",
430 |    "mimetype": "text/x-python",
431 |    "name": "python",
432 |    "nbconvert_exporter": "python",
433 |    "pygments_lexer": "ipython3",
434 |    "version": "3.5.1"
435 |   }
436 |  },
437 |  "nbformat": 4,
438 |  "nbformat_minor": 0
439 | }
440 | 


--------------------------------------------------------------------------------