├── .gitignore ├── README.md ├── creating_dataset ├── .ipynb_checkpoints │ └── creating_dataset-checkpoint.ipynb └── creating_dataset.ipynb └── creating_sentiment_scoring_model ├── .ipynb_checkpoints ├── Sentiment_Scoring_BaselineModel-checkpoint.ipynb └── creating_sentiment_analyzer-checkpoint.ipynb ├── Sentiment_Scoring_BaselineModel.ipynb └── draft.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis and Natural Language Processing for Marketing 2 | 3 | I'm keeping my documents/source codes related to [this Manning live project](https://www.manning.com/liveproject/sentiment-analysis-and-natural-language-processing-for-marketing). 4 | 5 | #### From the project summary: 6 | 7 | In this liveProject, you can gain an overall impression of the job of a Natural Language Processing (NLP) Specialist working on the Growth Hacking Team of a freshly launched startup that is introducing a new video game to the market. One of the key targets of a growth hacking team is to enhance the massive growth of early startups in a short time. To do so, it introduces strategies with the help of which one can acquire as many customers as possible with the lowest cost as possible. As part of your team’s growth hacking strategy, your boss wants to map the field of the video game market. She aims to find out how customers evaluate your competitors’ products, namely what they like and dislike in a video game. Knowing what makes a video game attractive to a gamer helps the marketing team articulate the message of your product more effectively. 8 | 9 | To be able to find out what makes a video game worth buying according to gamers, you need to get deeper insights into the linguistic features of their utterances. As an NLP Specialist, your task will be to analyze customers’ reviews about video games. In order to carry out this task, you will employ different NLP methods. These methods will enable you to acquire a deeper understanding of customer feedback and opinion. 10 | 11 | Your task as an NLP Specialist on the Growth Hacking Team is the following: 12 | 13 | * Download the dataset of Amazon reviews. 14 | * Create your own dataset from the Amazon reviews. 15 | * Decide whether people like or dislike the video game they bought. Label each review with a sentiment score between -1 and 1. 16 | * Check the performance of your sentiment analyzer by comparing the sentiment scores with the review ratings. 17 | * Evaluate the performance of your sentiment analyzer and find out if you managed to correctly label the reviews. 18 | * Try out other methods of sentiment analysis. Explore how people evaluate the video game they purchased by classifying the reviews as positive, negative, and neutral. 19 | 20 | Summarize your results to the Head of the Growth Hacking Team. Based on your findings, list those things that are liked and those ones that are disliked about video games. 21 | 22 | ## Techniques Employed 23 | 24 | In order to get a deeper understanding of people’s opinion about video games, you will employ various NLP techniques. Here is a short list about what you will do and what techniques you will use. 25 | 26 | * Sampling from imbalanced datasets using the imbalanced-learn package 27 | * Enquiring about the sentiment value of the reviews with the dictionary-based sentiment analysis tools, which are part of NLTK, a natural language processing toolkit, used in Python. 28 | * Finding out if your algorithm did a good job. Data evaluation with scikit-learn in Python. 29 | * Analyzing the reviews with a state-of-the-art deep learning technique, namely with the DistilBERT model. To build this model, you will need to run Pytorch, transformers, and the simpletransformers packages. 30 | * Evaluating your model and creating descriptive statistics in Python with scikit-learn library before reporting your results to your boss. 31 | * Visualizing your findings about preferable and non-preferable words related to video games using Altair. 32 | 33 | ## Project Outline 34 | 35 | The project is made up of five steps, which are built on each other: 36 | 37 | * Creating your dataset. 38 | * Creating a dictionary-based sentiment analyzer. 39 | * Evaluating your dictionary-based sentiment analyzer. 40 | * Creating neural network-based sentiment analyzers. 41 | * Reporting your results. 42 | 43 | Both the steps and the techniques in this project model a real-life scenario. If you are employed as an NLP Specialist, you are likely to accomplish jobs like these ones. 44 | 45 | ## Dataset 46 | 47 | The Amazon review dataset can be downloaded from [here](https://nijianmo.github.io/amazon/index.html). Download the zipped json file of the category of video games 5 core, which can be found under the title Small subsets for experimentation. Once the download is complete, extract the file. 48 | 49 | -------------------------------------------------------------------------------- /creating_dataset/.ipynb_checkpoints/creating_dataset-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Objective\n", 8 | "* __Create your own dataset that contains a random sample of reviews__\n", 9 | "\n", 10 | "## Workflow\n", 11 | "\n", 12 | "1. Read the video game review data. Take a look at the text of the reviews and the ratings, which you will work with in this milestone.Note that your data is not pure JSON, but new line delimited JSON. To be able to read it, install and import ndjson.\n", 13 | "2. Create a plot of the ratings of the product. Study the distribution of the five categories.\n", 14 | "3. Take a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5. This way you get a smaller __balanced__ corpus, on which you will during Milestones 2-4. \n", 15 | "4. Take a random sample of the reviews by selecting 100,000 reviews. This way you get a bigger representative corpus, on which you will work in Milestones 4 and 5.\n", 16 | " * If you want to get identical results that are provided as a sample solution, use 42 as a random state.\n", 17 | "5. Export your corpora to two separate .csv files. Both of your tables should contain a column for the reviews and a column for the ratings. From now on we call the review text of the JSON key “reviews” and the overall key “ratings.” Name your corpora small_corpus and big_corpus.\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 24, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import ndjson\n", 27 | "import pandas as pd\n", 28 | "import numpy as np\n", 29 | "import seaborn as sns" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# reading reviews from json file\n", 39 | "with open('../data/Video_Games_5.json') as f:\n", 40 | " data = ndjson.load(f)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "reviews_df = pd.DataFrame(data)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 4, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/html": [ 60 | "
\n", 61 | "\n", 74 | "\n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | "
overallverifiedreviewTimereviewerIDasinreviewerNamereviewTextsummaryunixReviewTimevotestyleimage
05.0True10 17, 2015A1HP7NVNPFMA4N0700026657Ambrosia075This game is a bit hard to get the hang of, bu...but when you do it's great.1445040000NaNNaNNaN
14.0False07 27, 2015A1JGAP0185YJI60700026657travisI played it a while but it was alright. The st...But in spite of that it was fun, I liked it1437955200NaNNaNNaN
23.0True02 23, 2015A1YJWEXHQBWK2B0700026657Vincent G. Mezeraok game.Three Stars1424649600NaNNaNNaN
32.0True02 20, 2015A2204E1TH211HT0700026657Grandma KRfound the game a bit too complicated, not what...Two Stars1424390400NaNNaNNaN
45.0True12 25, 2014A2RF5B5H74JLPE0700026657jongreat game, I love it and have played it since...love this game1419465600NaNNaNNaN
\n", 170 | "
" 171 | ], 172 | "text/plain": [ 173 | " overall verified reviewTime reviewerID asin \\\n", 174 | "0 5.0 True 10 17, 2015 A1HP7NVNPFMA4N 0700026657 \n", 175 | "1 4.0 False 07 27, 2015 A1JGAP0185YJI6 0700026657 \n", 176 | "2 3.0 True 02 23, 2015 A1YJWEXHQBWK2B 0700026657 \n", 177 | "3 2.0 True 02 20, 2015 A2204E1TH211HT 0700026657 \n", 178 | "4 5.0 True 12 25, 2014 A2RF5B5H74JLPE 0700026657 \n", 179 | "\n", 180 | " reviewerName reviewText \\\n", 181 | "0 Ambrosia075 This game is a bit hard to get the hang of, bu... \n", 182 | "1 travis I played it a while but it was alright. The st... \n", 183 | "2 Vincent G. Mezera ok game. \n", 184 | "3 Grandma KR found the game a bit too complicated, not what... \n", 185 | "4 jon great game, I love it and have played it since... \n", 186 | "\n", 187 | " summary unixReviewTime vote style \\\n", 188 | "0 but when you do it's great. 1445040000 NaN NaN \n", 189 | "1 But in spite of that it was fun, I liked it 1437955200 NaN NaN \n", 190 | "2 Three Stars 1424649600 NaN NaN \n", 191 | "3 Two Stars 1424390400 NaN NaN \n", 192 | "4 love this game 1419465600 NaN NaN \n", 193 | "\n", 194 | " image \n", 195 | "0 NaN \n", 196 | "1 NaN \n", 197 | "2 NaN \n", 198 | "3 NaN \n", 199 | "4 NaN " 200 | ] 201 | }, 202 | "execution_count": 4, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "reviews_df.head()" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### Data Dictionry\n", 216 | " * __reviewerID__ - ID of the reviewer, e.g. A2SUAM1J3GNN3B\n", 217 | " * __asin__ - ID of the product, e.g. 0000013714\n", 218 | " * **reviewerName** - name of the reviewer\n", 219 | " * **vote** - helpful votes of the review\n", 220 | " * **style** - a disctionary of the product metadata, e.g., \"Format\" is \"Hardcover\"\n", 221 | " * **reviewText** - text of the review\n", 222 | " * **overall** - rating of the product\n", 223 | " * **summary** - summary of the review\n", 224 | " * **unixReviewTime** - time of the review (unix time)\n", 225 | " * **reviewTime** - time of the review (raw)\n", 226 | " * **image** - images that users post after they have received the product\n" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 5, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "(497577, 12)" 238 | ] 239 | }, 240 | "execution_count": 5, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "reviews_df.shape" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 6, 252 | "metadata": {}, 253 | "outputs": [ 254 | { 255 | "name": "stdout", 256 | "output_type": "stream", 257 | "text": [ 258 | "\n", 259 | "RangeIndex: 497577 entries, 0 to 497576\n", 260 | "Data columns (total 12 columns):\n", 261 | " # Column Non-Null Count Dtype \n", 262 | "--- ------ -------------- ----- \n", 263 | " 0 overall 497577 non-null float64\n", 264 | " 1 verified 497577 non-null bool \n", 265 | " 2 reviewTime 497577 non-null object \n", 266 | " 3 reviewerID 497577 non-null object \n", 267 | " 4 asin 497577 non-null object \n", 268 | " 5 reviewerName 497501 non-null object \n", 269 | " 6 reviewText 497419 non-null object \n", 270 | " 7 summary 497468 non-null object \n", 271 | " 8 unixReviewTime 497577 non-null int64 \n", 272 | " 9 vote 107793 non-null object \n", 273 | " 10 style 289237 non-null object \n", 274 | " 11 image 3634 non-null object \n", 275 | "dtypes: bool(1), float64(1), int64(1), object(9)\n", 276 | "memory usage: 42.2+ MB\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "reviews_df.info()" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 7, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/plain": [ 292 | "" 293 | ] 294 | }, 295 | "execution_count": 7, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | }, 299 | { 300 | "data": { 301 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAEGCAYAAACpXNjrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWxUlEQVR4nO3df/BddX3n8efLAGqrCEqkmIQNUzO7G90aNQvZxW0tTCGw2wY76MCMkrWs6Y6ho7O2K7ozi6LM6uwqLRbZpSUlca1AUZbUiY0ZZHXrlh9BEAiU4buIkgyaSCL4Y8QJvveP+0lz+XK/X77gufd+8/0+HzNn7jnv8znnfO4ZvnlxftxzUlVIktSlF4y7A5KkucdwkSR1znCRJHXOcJEkdc5wkSR17rBxd2C2OOaYY2rp0qXj7oYkHVLuuOOO71fVwsl1w6VZunQp27dvH3c3JOmQkuTbg+qeFpMkdc5wkSR1znCRJHXOcJEkdc5wkSR1bmjhkuRFSW5L8s0kO5J8uNVPSHJrkokk1yY5otVf2KYn2vylfev6QKs/kOT0vvrqVptIcmFffeA2JEmjMcwjlyeBU6rqdcAKYHWSVcDHgUur6tXAPuD81v58YF+rX9rakWQ5cA7wGmA18OkkC5IsAC4HzgCWA+e2tkyzDUnSCAwtXKrnR23y8DYUcApwfatvBM5q42vaNG3+qUnS6tdU1ZNV9S1gAjixDRNV9VBV/Qy4BljTlplqG5KkERjqNZd2hHEXsBvYBvw/4AdVtb812QksauOLgEcA2vzHgVf01yctM1X9FdNsQ5I0AkP9hX5VPQWsSHIUcAPwT4a5vecqyTpgHcDxxx8/5t5IOtT96fv+etxdGIoLPvHbz3mZkdwtVlU/AG4G/gVwVJIDobYY2NXGdwFLANr8lwGP9dcnLTNV/bFptjG5X1dW1cqqWrlw4TMejSNJep6GebfYwnbEQpIXA78F3E8vZM5uzdYCN7bxzW2aNv8r1XsH82bgnHY32QnAMuA24HZgWbsz7Ah6F/03t2Wm2oYkaQSGeVrsOGBju6vrBcB1VfXFJPcB1yT5KHAncFVrfxXwmSQTwF56YUFV7UhyHXAfsB9Y3063keQCYCuwANhQVTvaut4/xTYkSSMwtHCpqruB1w+oP0TvTq/J9Z8Cb51iXZcAlwyobwG2zHQbkqTR8Bf6kqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4NLVySLElyc5L7kuxI8p5W/1CSXUnuasOZfct8IMlEkgeSnN5XX91qE0ku7KufkOTWVr82yRGt/sI2PdHmLx3W95QkPdMwj1z2A++rquXAKmB9kuVt3qVVtaINWwDavHOA1wCrgU8nWZBkAXA5cAawHDi3bz0fb+t6NbAPOL/Vzwf2tfqlrZ0kaUSGFi5V9WhVfaON/xC4H1g0zSJrgGuq6smq+hYwAZzYhomqeqiqfgZcA6xJEuAU4Pq2/EbgrL51bWzj1wOntvaSpBEYyTWXdlrq9cCtrXRBkruTbEhydKstAh7pW2xnq01VfwXwg6raP6n+tHW1+Y+39pP7tS7J9iTb9+zZ84t9SUnSPxh6uCR5CfB54L1V9QRwBfCrwArgUeATw+7DVKrqyqpaWVUrFy5cOK5uSNKcM9RwSXI4vWD5bFV9AaCqvldVT1XVz4E/o3faC2AXsKRv8cWtNlX9MeCoJIdNqj9tXW3+y1p7SdIIDPNusQBXAfdX1Sf76sf1NXsLcG8b3wyc0+70OgFYBtwG3A4sa3eGHUHvov/mqirgZuDstvxa4Ma+da1t42cDX2ntJUkjcNizN3neTgbeAdyT5K5W+yC9u71WAAU8DPw+QFXtSHIdcB+9O83WV9VTAEkuALYCC4ANVbWjre/9wDVJPgrcSS/MaJ+fSTIB7KUXSJKkERlauFTV3wKD7tDaMs0ylwCXDKhvGbRcVT3EwdNq/fWfAm99Lv2VJHXHX+hLkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjo3tHBJsiTJzUnuS7IjyXta/eVJtiV5sH0e3epJclmSiSR3J3lD37rWtvYPJlnbV39jknvaMpclyXTbkCSNxjCPXPYD76uq5cAqYH2S5cCFwE1VtQy4qU0DnAEsa8M64AroBQVwEXAScCJwUV9YXAG8q2+51a0+1TYkSSMwtHCpqker6htt/IfA/cAiYA2wsTXbCJzVxtcAm6rnFuCoJMcBpwPbqmpvVe0DtgGr27wjq+qWqipg06R1DdqGJGkERnLNJclS4PXArcCxVfVom/Vd4Ng2vgh4pG+xna02XX3ngDrTbEOSNAJDD5ckLwE+D7y3qp7on9eOOGqY259uG0nWJdmeZPuePXuG2Q1JmleGGi5JDqcXLJ+tqi+08vfaKS3a5+5W3wUs6Vt8catNV188oD7dNp6mqq6sqpVVtXLhwoXP70tKkp5hmHeLBbgKuL+qPtk3azNw4I6vtcCNffXz2l1jq4DH26mtrcBpSY5uF/JPA7a2eU8kWdW2dd6kdQ3ahiRpBA4b4rpPBt4B3JPkrlb7IPAx4Lok5wPfBt7W5m0BzgQmgJ8A7wSoqr1JPgLc3tpdXFV72/i7gauBFwNfagPTbEOSNAJDC5eq+lsgU8w+dUD7AtZPsa4NwIYB9e3AawfUHxu0DUnSaPgLfUlS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS52YULklumklNkiR4lkfuJ3kR8EvAMe1FXQceoX8kB99XL0nS0zzb+1x+H3gv8CrgDg6GyxPAnw6vW5KkQ9m04VJVfwL8SZI/qKpPjahPkqRD3IzeRFlVn0ryL4Gl/ctU1aYh9UuSdAibUbgk+Qzwq8BdwFOtXIDhIkl6hhmFC7ASWN7ecy9J0rRm+juXe4FfGWZHJElzx0yPXI4B7ktyG/DkgWJV/c5QeiVJOqTNNFw+NMxOSJLmlpneLfbVYXdEkjR3zPRusR/SuzsM4AjgcODHVXXksDomSTp0zfTI5aUHxpMEWAOsGlanJEmHtuf8VOTq+V/A6dO1S7Ihye4k9/bVPpRkV5K72nBm37wPJJlI8kCS0/vqq1ttIsmFffUTktza6tcmOaLVX9imJ9r8pc/1O0qSfjEzfSry7/YNZyf5GPDTZ1nsamD1gPqlVbWiDVva+pcD5wCvact8OsmCJAuAy4EzgOXAua0twMfbul4N7APOb/XzgX2tfmlrJ0kaoZkeufx233A68EN6p8amVFVfA/bOcP1rgGuq6smq+hYwAZzYhomqeqiqfgZcA6xpp+ZOAa5vy28Ezupb18Y2fj1wamsvSRqRmV5zeWeH27wgyXnAduB9VbWP3uP7b+lrs5ODj/R/ZFL9JOAVwA+qav+A9osOLFNV+5M83tp/f3JHkqwD1gEcf/zxv/g3kyQBMz8ttjjJDe0ayu4kn0+y+Hls7wp6zyhbATwKfOJ5rKMzVXVlVa2sqpULFy4cZ1ckaU6Z6WmxvwA203uvy6uAv26156SqvldVT1XVz4E/o3faC2AXsKSv6eJWm6r+GHBUksMm1Z+2rjb/Za29JGlEZhouC6vqL6pqfxuuBp7z/+onOa5v8i30nlkGveA6p93pdQKwDLgNuB1Y1u4MO4LeRf/N7QGaNwNnt+XXAjf2rWttGz8b+IoP3JSk0Zrp418eS/J24HNt+lye5WggyeeAN9N7RfJO4CLgzUlW0PtB5sP03nRJVe1Ich1wH7AfWF9VT7X1XABsBRYAG6pqR9vE+4FrknwUuBO4qtWvAj6TZILeDQXnzPA7SpI6MtNw+T3gU/Ru7S3g/wL/droFqurcAeWrBtQOtL8EuGRAfQuwZUD9IQ6eVuuv/xR463R9kyQN10zD5WJgbbuziyQvB/4bvdCRJOlpZnrN5dcOBAtAVe0FXj+cLkmSDnUzDZcXJDn6wEQ7cpnpUY8kaZ6ZaUB8Avi7JH/Vpt/KgOsjkiTBzH+hvynJdnqPXAH43aq6b3jdkiQdymZ8aquFiYEiSXpWz/mR+5IkPRvDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUuaGFS5INSXYnubev9vIk25I82D6PbvUkuSzJRJK7k7yhb5m1rf2DSdb21d+Y5J62zGVJMt02JEmjM8wjl6uB1ZNqFwI3VdUy4KY2DXAGsKwN64AroBcUwEXAScCJwEV9YXEF8K6+5VY/yzYkSSMytHCpqq8BeyeV1wAb2/hG4Ky++qbquQU4KslxwOnAtqraW1X7gG3A6jbvyKq6paoK2DRpXYO2IUkakVFfczm2qh5t498Fjm3ji4BH+trtbLXp6jsH1KfbxjMkWZdke5Lte/bseR5fR5I0yNgu6LcjjhrnNqrqyqpaWVUrFy5cOMyuSNK8Mupw+V47pUX73N3qu4Alfe0Wt9p09cUD6tNtQ5I0IqMOl83AgTu+1gI39tXPa3eNrQIeb6e2tgKnJTm6Xcg/Ddja5j2RZFW7S+y8SesatA1J0ogcNqwVJ/kc8GbgmCQ76d319THguiTnA98G3taabwHOBCaAnwDvBKiqvUk+Atze2l1cVQduEng3vTvSXgx8qQ1Msw1JQ/LVX/+NcXehc7/xta+OuwuHtKGFS1WdO8WsUwe0LWD9FOvZAGwYUN8OvHZA/bFB25AkjY6/0Jckdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdW4s4ZLk4ST3JLkryfZWe3mSbUkebJ9Ht3qSXJZkIsndSd7Qt561rf2DSdb21d/Y1j/Rls3ov6UkzV/jPHL5zapaUVUr2/SFwE1VtQy4qU0DnAEsa8M64ArohRFwEXAScCJw0YFAam3e1bfc6uF/HUnSAbPptNgaYGMb3wic1VffVD23AEclOQ44HdhWVXurah+wDVjd5h1ZVbdUVQGb+tYlSRqBcYVLAV9OckeSda12bFU92sa/CxzbxhcBj/Qtu7PVpqvvHFB/hiTrkmxPsn3Pnj2/yPeRJPU5bEzbfVNV7UrySmBbkr/vn1lVlaSG3YmquhK4EmDlypVD357mlpM/dfK4u9C5r//B18fdBc0RYzlyqapd7XM3cAO9aybfa6e0aJ+7W/NdwJK+xRe32nT1xQPqkqQRGXm4JPnlJC89MA6cBtwLbAYO3PG1FrixjW8Gzmt3ja0CHm+nz7YCpyU5ul3IPw3Y2uY9kWRVu0vsvL51SZJGYBynxY4Fbmh3Bx8G/GVV/U2S24HrkpwPfBt4W2u/BTgTmAB+ArwToKr2JvkIcHtrd3FV7W3j7wauBl4MfKkNkqQRGXm4VNVDwOsG1B8DTh1QL2D9FOvaAGwYUN8OvPYX7qwk6XmZTbciS5LmCMNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUuXE9W+yQ8sY/2jTuLnTujv963ri7IGkO88hFktQ5w0WS1DnDRZLUOcNFktQ5L+jrOfnOxf9s3F3o3PH/+Z5xd0GaczxykSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1bs6GS5LVSR5IMpHkwnH3R5LmkzkZLkkWAJcDZwDLgXOTLB9vryRp/piT4QKcCExU1UNV9TPgGmDNmPskSfNGqmrcfehckrOB1VX179r0O4CTquqCSe3WAeva5D8GHhhpR5/pGOD7Y+7DbOG+OMh9cZD74qDZsi/+UVUtnFyc12+irKorgSvH3Y8DkmyvqpXj7sds4L44yH1xkPvioNm+L+bqabFdwJK+6cWtJkkagbkaLrcDy5KckOQI4Bxg85j7JEnzxpw8LVZV+5NcAGwFFgAbqmrHmLs1E7PmFN0s4L44yH1xkPvioFm9L+bkBX1J0njN1dNikqQxMlwkSZ0zXMYgyYYku5PcO8X8JLmsPbrm7iRvGHUfRyHJkiQ3J7kvyY4k7xnQZr7sixcluS3JN9u++PCANi9Mcm3bF7cmWTqGro5MkgVJ7kzyxQHz5s2+SPJwknuS3JVk+4D5s/JvxHAZj6uB1dPMPwNY1oZ1wBUj6NM47AfeV1XLgVXA+gGP6Zkv++JJ4JSqeh2wAlidZNWkNucD+6rq1cClwMdH28WRew9w/xTz5tu++M2qWjHF71pm5d+I4TIGVfU1YO80TdYAm6rnFuCoJMeNpnejU1WPVtU32vgP6f1DsmhSs/myL6qqftQmD2/D5Ltt1gAb2/j1wKlJMqIujlSSxcC/Bv58iibzZl/MwKz8GzFcZqdFwCN90zt55j+6c0o7rfF64NZJs+bNvminge4CdgPbqmrKfVFV+4HHgVeMtJOj88fAfwR+PsX8+bQvCvhykjvaI6smm5V/I4aLxi7JS4DPA++tqifG3Z9xqaqnqmoFvSdKnJjktWPu0lgk+TfA7qq6Y9x9mSXeVFVvoHf6a32SXx93h2bCcJmd5s3ja5IcTi9YPltVXxjQZN7siwOq6gfAzTzzutw/7IskhwEvAx4baedG42Tgd5I8TO+J5qck+Z+T2syXfUFV7Wqfu4Eb6D31vd+s/BsxXGanzcB57S6QVcDjVfXouDvVtXaO/Crg/qr65BTN5su+WJjkqDb+YuC3gL+f1GwzsLaNnw18pebgr6Cr6gNVtbiqltJ7dNNXqurtk5rNi32R5JeTvPTAOHAaMPku01n5NzInH/8y2yX5HPBm4JgkO4GL6F3Apar+O7AFOBOYAH4CvHM8PR26k4F3APe0aw0AHwSOh3m3L44DNrYX3b0AuK6qvpjkYmB7VW2mF8SfSTJB74aQc8bX3dGbp/viWOCGdq/CYcBfVtXfJPn3MLv/Rnz8iySpc54WkyR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJHmmCT/O8nKNv5wkmPG3SfNP4aLdIhpP5bzb1ezmv+BSiOQ5D8kubcN703ysSTr++Z/KMkftvE/SnJ7ezfHh1ttaZIHkmyi9wvtJUmuSLJ9qve/SOPkL/SlIUvyRnq/mj4JCL0nP7+d3pN/L2/N3gacnuQ0eu/lOLG13dweVPidVl/bHqtOkv9UVXvbr/pvSvJrVXX36L6ZNDXDRRq+NwE3VNWPAZJ8AfhXwCuTvApYSO/FV4+k9zbO04A727IvoRcq3wG+fSBYmre1R7AfRu/xMcsBw0WzguEijc9f0Xvo4q8A17ZagP9SVf+jv2F7382P+6ZPAP4Q+OdVtS/J1cCLRtBnaUa85iIN3/8BzkryS+3Jtm9ptWvpPXDxbHpBA7AV+L32jhuSLEryygHrPJJe2Dye5Fh67/qQZg2PXKQhq6pvtCOL21rpz6vqToD2OPVdBx6RXlVfTvJPgb9rT8L9Eb3rM09NWuc3k9xJ77H8jwBfH8V3kWbKpyJLkjrnaTFJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUuf+P6HKMtPCrn4hAAAAAElFTkSuQmCC\n", 302 | "text/plain": [ 303 | "
" 304 | ] 305 | }, 306 | "metadata": { 307 | "needs_background": "light" 308 | }, 309 | "output_type": "display_data" 310 | } 311 | ], 312 | "source": [ 313 | "sns.countplot(data = reviews_df, x='overall')" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 8, 319 | "metadata": {}, 320 | "outputs": [ 321 | { 322 | "data": { 323 | "text/plain": [ 324 | "17408" 325 | ] 326 | }, 327 | "execution_count": 8, 328 | "metadata": {}, 329 | "output_type": "execute_result" 330 | } 331 | ], 332 | "source": [ 333 | "len(reviews_df['asin'].value_counts(dropna=False))" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "### Undersampling of Reviews\n", 341 | "Taking a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5. This way you get a smaller balanced corpus, on which you will during Milestones 2-4.\n" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 12, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "one_1500 = reviews_df[reviews_df['overall']==1.0].sample(n=1500)\n", 351 | "two_500 = reviews_df[reviews_df['overall']==2.0].sample(n=500)\n", 352 | "three_500 = reviews_df[reviews_df['overall']==3.0].sample(n=500)\n", 353 | "four_500 = reviews_df[reviews_df['overall']==4.0].sample(n=500)\n", 354 | "five_1500 = reviews_df[reviews_df['overall']==5.0].sample(n=1500)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 18, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "undersampled_reviews = pd.concat([one_1500, two_500, three_500, four_500, five_1500], axis=0)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 19, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "5.0 1500\n", 375 | "1.0 1500\n", 376 | "4.0 500\n", 377 | "3.0 500\n", 378 | "2.0 500\n", 379 | "Name: overall, dtype: int64" 380 | ] 381 | }, 382 | "execution_count": 19, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "undersampled_reviews['overall'].value_counts(dropna=False)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 20, 394 | "metadata": {}, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/plain": [ 399 | "" 400 | ] 401 | }, 402 | "execution_count": 20, 403 | "metadata": {}, 404 | "output_type": "execute_result" 405 | }, 406 | { 407 | "data": { 408 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEGCAYAAACUzrmNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUaklEQVR4nO3df7DddX3n8ecLoqi1GiBXxCRsmJpxy7pW8S6yS9da2YVgrWFbpDCrpEgn21l0dbG1aGeWlo6zOv1BxVp2sxIJXVdElCV1aDEDKLuOIOGH/NTlDgWSDJho+GFlrRv63j/OJ+b0cm++N3DPOTc5z8fMmfv9fj6fc877foebF99fn2+qCkmS9uagURcgSVr4DAtJUifDQpLUybCQJHUyLCRJnRaNuoBBWLJkSa1YsWLUZUjSfuW22277XlVNzNR3QIbFihUr2Lx586jLkKT9SpKHZ+vzMJQkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSp0wF5B/fevPG3Lx91CQNx2x+eNeoSpAPOn33wL0ddwkC8949/eZ/f456FJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqdPAwiLJ+iTbk9wzQ98Hk1SSJW09SS5OMpXkriTH9o1dk+SB9lozqHolSbMb5J7FZcCq6Y1JlgMnAY/0NZ8CrGyvtcAlbexhwAXAm4DjgAuSHDrAmiVJMxhYWFTVTcDOGbouAj4EVF/bauDy6rkZWJzkSOBkYFNV7ayqx4FNzBBAkqTBGuo5iySrgW1V9a1pXUuBLX3rW1vbbO2SpCEa2qyzSV4CfITeIahBfP5aeoewOOqoowbxFZI0toa5Z/EzwNHAt5I8BCwDbk/ySmAbsLxv7LLWNlv7s1TVuqqarKrJiYmJAZQvSeNraGFRVXdX1SuqakVVraB3SOnYqnoM2Aic1a6KOh54sqoeBa4DTkpyaDuxfVJrkyQN0SAvnf0c8A3gNUm2JjlnL8OvBR4EpoD/Bvx7gKraCfwBcGt7XdjaJElDNLBzFlV1Zkf/ir7lAs6dZdx6YP28FidJ2ifewS1J6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROAwuLJOuTbE9yT1/bHyb5dpK7klydZHFf34eTTCX5TpKT+9pXtbapJOcPql5J0uwGuWdxGbBqWtsm4LVV9Trg/wAfBkhyDHAG8E/ae/48ycFJDgY+BZwCHAOc2cZKkoZoYGFRVTcBO6e1faWqdrXVm4FlbXk1cEVV/V1V/Q0wBRzXXlNV9WBV/Ri4oo2VJA3RKM9ZvAf4q7a8FNjS17e1tc3W/ixJ1ibZnGTzjh07BlCuJI2vkYRFkt8FdgGfna/PrKp1VTVZVZMTExPz9bGSJGDRsL8wya8DbwdOrKpqzduA5X3DlrU29tIuSRqSoe5ZJFkFfAh4R1U93de1ETgjySFJjgZWAt8EbgVWJjk6yQvpnQTfOMyaJUkD3LNI8jngLcCSJFuBC+hd/XQIsCkJwM1V9ZtVdW+SK4H76B2eOreqnmmf817gOuBgYH1V3TuomiVJMxtYWFTVmTM0X7qX8R8FPjpD+7XAtfNYmiRpH3kHtySpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoNLCySrE+yPck9fW2HJdmU5IH289DWniQXJ5lKcleSY/ves6aNfyDJmkHVK0ma3SD3LC4DVk1rOx+4vqpWAte3dYBTgJXttRa4BHrhAlwAvAk4Drhgd8BIkoZnYGFRVTcBO6c1rwY2tOUNwKl97ZdXz83A4iRHAicDm6pqZ1U9Dmzi2QEkSRqwYZ+zOKKqHm3LjwFHtOWlwJa+cVtb22ztz5JkbZLNSTbv2LFjfquWpDE3shPcVVVAzePnrauqyaqanJiYmK+PlSQx/LD4bju8RPu5vbVvA5b3jVvW2mZrlyQN0bDDYiOw+4qmNcA1fe1ntauijgeebIerrgNOSnJoO7F9UmuTJA3RokF9cJLPAW8BliTZSu+qpo8BVyY5B3gYOL0NvxZ4GzAFPA2cDVBVO5P8AXBrG3dhVU0/aS5JGrCBhUVVnTlL14kzjC3g3Fk+Zz2wfh5LkyTtI+/gliR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1GlOYZHk+rm0SZIOTHuddTbJi4CX0Jtm/FAgretlzPJ4U0nSgadrivJ/B3wAeBVwG3vC4ingzwZXliRpIdlrWFTVJ4BPJHlfVX1ySDVJkhaYOT38qKo+meRfACv631NVlw+oLknSAjKnsEjyF8DPAHcCz7TmAgwLSRoDc32s6iRwTHv86fOW5D8Cv0EvcO6m98ztI4ErgMPpnR95d1X9OMkh9ELpjcD3gV+rqofmow5J0tzM9T6Le4BXzscXJlkK/AdgsqpeCxwMnAF8HLioql4NPA6c095yDvB4a7+ojZMkDdFcw2IJcF+S65Js3P16Ht+7CHhxkkX0Ls19FHgrcFXr3wCc2pZXt3Va/4lJgiRpaOZ6GOr35usLq2pbkj8CHgH+L/AVeoednqiqXW3YVvbcx7EU2NLeuyvJk/QOVX2v/3OTrAXWAhx11FHzVa4kiblfDfW1+frCdnPfauBo4AngC8Cq5/u5VbUOWAcwOTk5L+dWJEk9c53u4wdJnmqvHyV5JslTz/E7/xXwN1W1o6r+H/Al4ARgcTssBbAM2NaWtwHLWx2LgJfTO9EtSRqSOYVFVf10Vb2sql4GvBj4VeDPn+N3PgIcn+Ql7dzDicB9wI3AaW3MGuCatryxrdP6b5ivq7IkSXOzz7POVs//BE5+Ll9YVbfQO1F9O73LZg+id/jod4DzkkzROydxaXvLpcDhrf084Pzn8r2SpOdurjfl/Urf6kH07rv40XP90qq6ALhgWvODwHEzjP0R8M7n+l2SpOdvrldD/XLf8i7gIXonqSVJY2CuV0OdPehCJEkL11yvhlqW5Ook29vri0mWDbo4SdLCMNcT3J+hd1XSq9rrL1ubJGkMzDUsJqrqM1W1q70uAyYGWJckaQGZa1h8P8m7khzcXu/CG+MkaWzMNSzeA5wOPEZv0r/TgF8fUE2SpAVmrpfOXgisqarHAZIcBvwRvRCRJB3g5rpn8brdQQFQVTuBNwymJEnSQjPXsDiozRYL/GTPYq57JZKk/dxc/8H/Y+AbSb7Q1t8JfHQwJUmSFpq53sF9eZLN9J5mB/ArVXXf4MqSJC0kcz6U1MLBgJCkMbTPU5RLksaPYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSeo0krBIsjjJVUm+neT+JP88yWFJNiV5oP08tI1NkouTTCW5K8mxo6hZksbZqPYsPgH8dVX9Y+DngPuB84Hrq2olcH1bBzgFWNlea4FLhl+uJI23oYdFkpcDbwYuBaiqH1fVE8BqYEMbtgE4tS2vBi6vnpuBxUmOHGrRkjTmRrFncTSwA/hMkjuSfDrJTwFHVNWjbcxjwBFteSmwpe/9W1vbP5BkbZLNSTbv2LFjgOVL0vgZRVgsAo4FLqmqNwA/ZM8hJwCqqoDalw+tqnVVNVlVkxMTPvFVkubTKMJiK7C1qm5p61fRC4/v7j681H5ub/3bgOV971/W2iRJQzL0sKiqx4AtSV7Tmk6kN0HhRmBNa1sDXNOWNwJntauijgee7DtcJUkaglE9wOh9wGeTvBB4EDibXnBdmeQc4GF6z/wGuBZ4GzAFPN3GSpKGaCRhUVV3ApMzdJ04w9gCzh10TZKk2XkHtySpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkTqO6g1sLwCMX/tNRlzAQR/2nu/f5PSd88oQBVDJ6X3/f1/f5PV978y8MoJLR+4WbvjbqEvZr7llIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOo0sLJIcnOSOJF9u60cnuSXJVJLPt+dzk+SQtj7V+leMqmZJGlej3LN4P3B/3/rHgYuq6tXA48A5rf0c4PHWflEbJ0kaopGERZJlwC8Bn27rAd4KXNWGbABObcur2zqt/8Q2XpI0JKPas/hT4EPA37f1w4EnqmpXW98KLG3LS4EtAK3/yTZekjQkQw+LJG8HtlfVbfP8uWuTbE6yeceOHfP50ZI09kaxZ3EC8I4kDwFX0Dv89AlgcZLdU6YvA7a15W3AcoDW/3Lg+9M/tKrWVdVkVU1OTEwM9jeQpDEz9LCoqg9X1bKqWgGcAdxQVf8WuBE4rQ1bA1zTlje2dVr/DVVVQyxZksbeQrrP4neA85JM0TsncWlrvxQ4vLWfB5w/ovokaWyN9El5VfVV4Ktt+UHguBnG/Ah451ALkyT9Awtpz0KStEAZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSp09DDIsnyJDcmuS/JvUne39oPS7IpyQPt56GtPUkuTjKV5K4kxw67Zkkad6PYs9gFfLCqjgGOB85NcgxwPnB9Va0Erm/rAKcAK9trLXDJ8EuWpPE29LCoqker6va2/APgfmApsBrY0IZtAE5ty6uBy6vnZmBxkiOHW7UkjbeRnrNIsgJ4A3ALcERVPdq6HgOOaMtLgS19b9va2qZ/1tokm5Ns3rFjx+CKlqQxNLKwSPJS4IvAB6rqqf6+qiqg9uXzqmpdVU1W1eTExMQ8VipJGklYJHkBvaD4bFV9qTV/d/fhpfZze2vfBizve/uy1iZJGpJRXA0V4FLg/qr6k76ujcCatrwGuKav/ax2VdTxwJN9h6skSUOwaATfeQLwbuDuJHe2to8AHwOuTHIO8DBweuu7FngbMAU8DZw91GolScMPi6r630Bm6T5xhvEFnDvQoiRJe+Ud3JKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSeq034RFklVJvpNkKsn5o65HksbJfhEWSQ4GPgWcAhwDnJnkmNFWJUnjY78IC+A4YKqqHqyqHwNXAKtHXJMkjY1U1ahr6JTkNGBVVf1GW3838Kaqem/fmLXA2rb6GuA7Qy/02ZYA3xt1EQuE22IPt8Uebos9FsK2+EdVNTFTx6JhVzIoVbUOWDfqOvol2VxVk6OuYyFwW+zhttjDbbHHQt8W+8thqG3A8r71Za1NkjQE+0tY3AqsTHJ0khcCZwAbR1yTJI2N/eIwVFXtSvJe4DrgYGB9Vd074rLmYkEdFhsxt8Uebos93BZ7LOhtsV+c4JYkjdb+chhKkjRChoUkqZNh8TwlWZ9ke5J7ZulPkovbNCV3JTl22DUOS5LlSW5Mcl+Se5O8f4YxY7E9krwoyTeTfKtti9+fYcwhST7ftsUtSVaMoNShSHJwkjuSfHmGvrHZDgBJHkpyd5I7k2yeoX9B/o0YFs/fZcCqvfSfAqxsr7XAJUOoaVR2AR+sqmOA44FzZ5iWZVy2x98Bb62qnwNeD6xKcvy0MecAj1fVq4GLgI8Pt8Shej9w/yx947QddvvFqnr9LPdVLMi/EcPieaqqm4CdexmyGri8em4GFic5cjjVDVdVPVpVt7flH9D7x2HptGFjsT3a7/e3bfUF7TX9apLVwIa2fBVwYpIMqcShSbIM+CXg07MMGYvtsA8W5N+IYTF4S4EtfetbefY/oAecdijhDcAt07rGZnu0Qy93AtuBTVU167aoql3Ak8DhQy1yOP4U+BDw97P0j8t22K2AryS5rU1TNN2C/BsxLDTvkrwU+CLwgap6atT1jEpVPVNVr6c348BxSV474pKGLsnbge1Vdduoa1lAfr6qjqV3uOncJG8edUFzYVgM3lhNVZLkBfSC4rNV9aUZhozV9gCoqieAG3n2ua2fbIski4CXA98fanGDdwLwjiQP0Zst+q1J/vu0MeOwHX6iqra1n9uBq+nNqt1vQf6NGBaDtxE4q13hcDzwZFU9OuqiBqEdZ74UuL+q/mSWYWOxPZJMJFncll8M/Gvg29OGbQTWtOXTgBvqALtLtqo+XFXLqmoFvWl6bqiqd00bdsBvh92S/FSSn969DJwETL+SckH+jewX030sZEk+B7wFWJJkK3ABvZOZVNV/Aa4F3gZMAU8DZ4+m0qE4AXg3cHc7Vg/wEeAoGLvtcSSwoT246yDgyqr6cpILgc1VtZFesP5Fkil6F0mcMbpyh2uMt8MRwNXt/P0i4H9U1V8n+U1Y2H8jTvchSerkYShJUifDQpLUybCQJHUyLCRJnQwLSVInw0JawJJ8NclkW34oyZJR16TxZFhII9RuvPLvUAue/5FK+yjJeUnuaa8PJPlYknP7+n8vyW+15d9Ocmt7LsHvt7YVSb6T5HJ6d+8uT3JJks2zPftCGjXv4Jb2QZI30ruj9k1A6M2q+y56M6t+qg07HTg5yUn0nklwXBu7sU0a90hrX9OmoCbJ71bVznbH9/VJXldVdw3vN5P2zrCQ9s3PA1dX1Q8BknwJ+JfAK5K8Cpig9yCfLek9KfAk4I723pfSC4lHgId3B0VzepuuehG9qUKOAQwLLRiGhTQ/vkBvErxXAp9vbQH+c1X91/6B7VkfP+xbPxr4LeCfVdXjSS4DXjSEmqU585yFtG/+F3Bqkpe0WUP/TWv7PL0J8E6jFxwA1wHvac/3IMnSJK+Y4TNfRi88nkxyBL3nHEgLinsW0j6oqtvb//l/szV9uqruAGhTT2/bPZ10VX0lyc8C32izjP4tvfMbz0z7zG8luYPeFOZbgK8P43eR9oWzzkqSOnkYSpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ3+Pz5ExirPkY9zAAAAAElFTkSuQmCC\n", 409 | "text/plain": [ 410 | "
" 411 | ] 412 | }, 413 | "metadata": { 414 | "needs_background": "light" 415 | }, 416 | "output_type": "display_data" 417 | } 418 | ], 419 | "source": [ 420 | "sns.countplot(data=undersampled_reviews, x='overall')" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "### Random Sampling of 100K Reviews" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 17, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "sample_100K_revs = reviews_df.sample(n=100000, random_state=42)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "### Writing Corpora" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 22, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "undersampled_reviews.to_csv(\"../data/small_corpus.csv\")" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 23, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "sample_100K_revs.to_csv(\"../data/big_corpus.csv\")" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [] 470 | } 471 | ], 472 | "metadata": { 473 | "kernelspec": { 474 | "display_name": "Python 3", 475 | "language": "python", 476 | "name": "python3" 477 | }, 478 | "language_info": { 479 | "codemirror_mode": { 480 | "name": "ipython", 481 | "version": 3 482 | }, 483 | "file_extension": ".py", 484 | "mimetype": "text/x-python", 485 | "name": "python", 486 | "nbconvert_exporter": "python", 487 | "pygments_lexer": "ipython3", 488 | "version": "3.6.12" 489 | } 490 | }, 491 | "nbformat": 4, 492 | "nbformat_minor": 4 493 | } 494 | -------------------------------------------------------------------------------- /creating_dataset/creating_dataset.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Objective\n", 8 | "* __Create your own dataset that contains a random sample of reviews__\n", 9 | "\n", 10 | "## Workflow\n", 11 | "\n", 12 | "1. Read the video game review data. Take a look at the text of the reviews and the ratings, which you will work with in this milestone.Note that your data is not pure JSON, but new line delimited JSON. To be able to read it, install and import ndjson.\n", 13 | "2. Create a plot of the ratings of the product. Study the distribution of the five categories.\n", 14 | "3. Take a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5. This way you get a smaller __balanced__ corpus, on which you will during Milestones 2-4. \n", 15 | "4. Take a random sample of the reviews by selecting 100,000 reviews. This way you get a bigger representative corpus, on which you will work in Milestones 4 and 5.\n", 16 | " * If you want to get identical results that are provided as a sample solution, use 42 as a random state.\n", 17 | "5. Export your corpora to two separate .csv files. Both of your tables should contain a column for the reviews and a column for the ratings. From now on we call the review text of the JSON key “reviews” and the overall key “ratings.” Name your corpora small_corpus and big_corpus.\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 1, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import ndjson\n", 27 | "import pandas as pd\n", 28 | "import numpy as np\n", 29 | "import seaborn as sns" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "# reading reviews from json file\n", 39 | "with open('../data/Video_Games_5.json') as f:\n", 40 | " data = ndjson.load(f)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "reviews_df = pd.DataFrame(data)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 4, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/html": [ 60 | "
\n", 61 | "\n", 74 | "\n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | "
overallverifiedreviewTimereviewerIDasinreviewerNamereviewTextsummaryunixReviewTimevotestyleimage
05.0True10 17, 2015A1HP7NVNPFMA4N0700026657Ambrosia075This game is a bit hard to get the hang of, bu...but when you do it's great.1445040000NaNNaNNaN
14.0False07 27, 2015A1JGAP0185YJI60700026657travisI played it a while but it was alright. The st...But in spite of that it was fun, I liked it1437955200NaNNaNNaN
23.0True02 23, 2015A1YJWEXHQBWK2B0700026657Vincent G. Mezeraok game.Three Stars1424649600NaNNaNNaN
32.0True02 20, 2015A2204E1TH211HT0700026657Grandma KRfound the game a bit too complicated, not what...Two Stars1424390400NaNNaNNaN
45.0True12 25, 2014A2RF5B5H74JLPE0700026657jongreat game, I love it and have played it since...love this game1419465600NaNNaNNaN
\n", 170 | "
" 171 | ], 172 | "text/plain": [ 173 | " overall verified reviewTime reviewerID asin \\\n", 174 | "0 5.0 True 10 17, 2015 A1HP7NVNPFMA4N 0700026657 \n", 175 | "1 4.0 False 07 27, 2015 A1JGAP0185YJI6 0700026657 \n", 176 | "2 3.0 True 02 23, 2015 A1YJWEXHQBWK2B 0700026657 \n", 177 | "3 2.0 True 02 20, 2015 A2204E1TH211HT 0700026657 \n", 178 | "4 5.0 True 12 25, 2014 A2RF5B5H74JLPE 0700026657 \n", 179 | "\n", 180 | " reviewerName reviewText \\\n", 181 | "0 Ambrosia075 This game is a bit hard to get the hang of, bu... \n", 182 | "1 travis I played it a while but it was alright. The st... \n", 183 | "2 Vincent G. Mezera ok game. \n", 184 | "3 Grandma KR found the game a bit too complicated, not what... \n", 185 | "4 jon great game, I love it and have played it since... \n", 186 | "\n", 187 | " summary unixReviewTime vote style \\\n", 188 | "0 but when you do it's great. 1445040000 NaN NaN \n", 189 | "1 But in spite of that it was fun, I liked it 1437955200 NaN NaN \n", 190 | "2 Three Stars 1424649600 NaN NaN \n", 191 | "3 Two Stars 1424390400 NaN NaN \n", 192 | "4 love this game 1419465600 NaN NaN \n", 193 | "\n", 194 | " image \n", 195 | "0 NaN \n", 196 | "1 NaN \n", 197 | "2 NaN \n", 198 | "3 NaN \n", 199 | "4 NaN " 200 | ] 201 | }, 202 | "execution_count": 4, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "reviews_df.head()" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "### Data Dictionry\n", 216 | " * __reviewerID__ - ID of the reviewer, e.g. A2SUAM1J3GNN3B\n", 217 | " * __asin__ - ID of the product, e.g. 0000013714\n", 218 | " * **reviewerName** - name of the reviewer\n", 219 | " * **vote** - helpful votes of the review\n", 220 | " * **style** - a disctionary of the product metadata, e.g., \"Format\" is \"Hardcover\"\n", 221 | " * **reviewText** - text of the review\n", 222 | " * **overall** - rating of the product\n", 223 | " * **summary** - summary of the review\n", 224 | " * **unixReviewTime** - time of the review (unix time)\n", 225 | " * **reviewTime** - time of the review (raw)\n", 226 | " * **image** - images that users post after they have received the product\n" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 5, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "data": { 236 | "text/plain": [ 237 | "(497577, 12)" 238 | ] 239 | }, 240 | "execution_count": 5, 241 | "metadata": {}, 242 | "output_type": "execute_result" 243 | } 244 | ], 245 | "source": [ 246 | "reviews_df.shape" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 6, 252 | "metadata": {}, 253 | "outputs": [ 254 | { 255 | "name": "stdout", 256 | "output_type": "stream", 257 | "text": [ 258 | "\n", 259 | "RangeIndex: 497577 entries, 0 to 497576\n", 260 | "Data columns (total 12 columns):\n", 261 | " # Column Non-Null Count Dtype \n", 262 | "--- ------ -------------- ----- \n", 263 | " 0 overall 497577 non-null float64\n", 264 | " 1 verified 497577 non-null bool \n", 265 | " 2 reviewTime 497577 non-null object \n", 266 | " 3 reviewerID 497577 non-null object \n", 267 | " 4 asin 497577 non-null object \n", 268 | " 5 reviewerName 497501 non-null object \n", 269 | " 6 reviewText 497419 non-null object \n", 270 | " 7 summary 497468 non-null object \n", 271 | " 8 unixReviewTime 497577 non-null int64 \n", 272 | " 9 vote 107793 non-null object \n", 273 | " 10 style 289237 non-null object \n", 274 | " 11 image 3634 non-null object \n", 275 | "dtypes: bool(1), float64(1), int64(1), object(9)\n", 276 | "memory usage: 42.2+ MB\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "reviews_df.info()" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 7, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/plain": [ 292 | "" 293 | ] 294 | }, 295 | "execution_count": 7, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | }, 299 | { 300 | "data": { 301 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAEGCAYAAACpXNjrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWxUlEQVR4nO3df/BddX3n8efLAGqrCEqkmIQNUzO7G90aNQvZxW0tTCGw2wY76MCMkrWs6Y6ho7O2K7ozi6LM6uwqLRbZpSUlca1AUZbUiY0ZZHXrlh9BEAiU4buIkgyaSCL4Y8QJvveP+0lz+XK/X77gufd+8/0+HzNn7jnv8znnfO4ZvnlxftxzUlVIktSlF4y7A5KkucdwkSR1znCRJHXOcJEkdc5wkSR17rBxd2C2OOaYY2rp0qXj7oYkHVLuuOOO71fVwsl1w6VZunQp27dvH3c3JOmQkuTbg+qeFpMkdc5wkSR1znCRJHXOcJEkdc5wkSR1bmjhkuRFSW5L8s0kO5J8uNVPSHJrkokk1yY5otVf2KYn2vylfev6QKs/kOT0vvrqVptIcmFffeA2JEmjMcwjlyeBU6rqdcAKYHWSVcDHgUur6tXAPuD81v58YF+rX9rakWQ5cA7wGmA18OkkC5IsAC4HzgCWA+e2tkyzDUnSCAwtXKrnR23y8DYUcApwfatvBM5q42vaNG3+qUnS6tdU1ZNV9S1gAjixDRNV9VBV/Qy4BljTlplqG5KkERjqNZd2hHEXsBvYBvw/4AdVtb812QksauOLgEcA2vzHgVf01yctM1X9FdNsQ5I0AkP9hX5VPQWsSHIUcAPwT4a5vecqyTpgHcDxxx8/5t5IOtT96fv+etxdGIoLPvHbz3mZkdwtVlU/AG4G/gVwVJIDobYY2NXGdwFLANr8lwGP9dcnLTNV/bFptjG5X1dW1cqqWrlw4TMejSNJep6GebfYwnbEQpIXA78F3E8vZM5uzdYCN7bxzW2aNv8r1XsH82bgnHY32QnAMuA24HZgWbsz7Ah6F/03t2Wm2oYkaQSGeVrsOGBju6vrBcB1VfXFJPcB1yT5KHAncFVrfxXwmSQTwF56YUFV7UhyHXAfsB9Y3063keQCYCuwANhQVTvaut4/xTYkSSMwtHCpqruB1w+oP0TvTq/J9Z8Cb51iXZcAlwyobwG2zHQbkqTR8Bf6kqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4ZLpKkzhkukqTOGS6SpM4NLVySLElyc5L7kuxI8p5W/1CSXUnuasOZfct8IMlEkgeSnN5XX91qE0ku7KufkOTWVr82yRGt/sI2PdHmLx3W95QkPdMwj1z2A++rquXAKmB9kuVt3qVVtaINWwDavHOA1wCrgU8nWZBkAXA5cAawHDi3bz0fb+t6NbAPOL/Vzwf2tfqlrZ0kaUSGFi5V9WhVfaON/xC4H1g0zSJrgGuq6smq+hYwAZzYhomqeqiqfgZcA6xJEuAU4Pq2/EbgrL51bWzj1wOntvaSpBEYyTWXdlrq9cCtrXRBkruTbEhydKstAh7pW2xnq01VfwXwg6raP6n+tHW1+Y+39pP7tS7J9iTb9+zZ84t9SUnSPxh6uCR5CfB54L1V9QRwBfCrwArgUeATw+7DVKrqyqpaWVUrFy5cOK5uSNKcM9RwSXI4vWD5bFV9AaCqvldVT1XVz4E/o3faC2AXsKRv8cWtNlX9MeCoJIdNqj9tXW3+y1p7SdIIDPNusQBXAfdX1Sf76sf1NXsLcG8b3wyc0+70OgFYBtwG3A4sa3eGHUHvov/mqirgZuDstvxa4Ma+da1t42cDX2ntJUkjcNizN3neTgbeAdyT5K5W+yC9u71WAAU8DPw+QFXtSHIdcB+9O83WV9VTAEkuALYCC4ANVbWjre/9wDVJPgrcSS/MaJ+fSTIB7KUXSJKkERlauFTV3wKD7tDaMs0ylwCXDKhvGbRcVT3EwdNq/fWfAm99Lv2VJHXHX+hLkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjpnuEiSOme4SJI6Z7hIkjo3tHBJsiTJzUnuS7IjyXta/eVJtiV5sH0e3epJclmSiSR3J3lD37rWtvYPJlnbV39jknvaMpclyXTbkCSNxjCPXPYD76uq5cAqYH2S5cCFwE1VtQy4qU0DnAEsa8M64AroBQVwEXAScCJwUV9YXAG8q2+51a0+1TYkSSMwtHCpqker6htt/IfA/cAiYA2wsTXbCJzVxtcAm6rnFuCoJMcBpwPbqmpvVe0DtgGr27wjq+qWqipg06R1DdqGJGkERnLNJclS4PXArcCxVfVom/Vd4Ng2vgh4pG+xna02XX3ngDrTbEOSNAJDD5ckLwE+D7y3qp7on9eOOGqY259uG0nWJdmeZPuePXuG2Q1JmleGGi5JDqcXLJ+tqi+08vfaKS3a5+5W3wUs6Vt8catNV188oD7dNp6mqq6sqpVVtXLhwoXP70tKkp5hmHeLBbgKuL+qPtk3azNw4I6vtcCNffXz2l1jq4DH26mtrcBpSY5uF/JPA7a2eU8kWdW2dd6kdQ3ahiRpBA4b4rpPBt4B3JPkrlb7IPAx4Lok5wPfBt7W5m0BzgQmgJ8A7wSoqr1JPgLc3tpdXFV72/i7gauBFwNfagPTbEOSNAJDC5eq+lsgU8w+dUD7AtZPsa4NwIYB9e3AawfUHxu0DUnSaPgLfUlS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUucMF0lS52YULklumklNkiR4lkfuJ3kR8EvAMe1FXQceoX8kB99XL0nS0zzb+1x+H3gv8CrgDg6GyxPAnw6vW5KkQ9m04VJVfwL8SZI/qKpPjahPkqRD3IzeRFlVn0ryL4Gl/ctU1aYh9UuSdAibUbgk+Qzwq8BdwFOtXIDhIkl6hhmFC7ASWN7ecy9J0rRm+juXe4FfGWZHJElzx0yPXI4B7ktyG/DkgWJV/c5QeiVJOqTNNFw+NMxOSJLmlpneLfbVYXdEkjR3zPRusR/SuzsM4AjgcODHVXXksDomSTp0zfTI5aUHxpMEWAOsGlanJEmHtuf8VOTq+V/A6dO1S7Ihye4k9/bVPpRkV5K72nBm37wPJJlI8kCS0/vqq1ttIsmFffUTktza6tcmOaLVX9imJ9r8pc/1O0qSfjEzfSry7/YNZyf5GPDTZ1nsamD1gPqlVbWiDVva+pcD5wCvact8OsmCJAuAy4EzgOXAua0twMfbul4N7APOb/XzgX2tfmlrJ0kaoZkeufx233A68EN6p8amVFVfA/bOcP1rgGuq6smq+hYwAZzYhomqeqiqfgZcA6xpp+ZOAa5vy28Ezupb18Y2fj1wamsvSRqRmV5zeWeH27wgyXnAduB9VbWP3uP7b+lrs5ODj/R/ZFL9JOAVwA+qav+A9osOLFNV+5M83tp/f3JHkqwD1gEcf/zxv/g3kyQBMz8ttjjJDe0ayu4kn0+y+Hls7wp6zyhbATwKfOJ5rKMzVXVlVa2sqpULFy4cZ1ckaU6Z6WmxvwA203uvy6uAv26156SqvldVT1XVz4E/o3faC2AXsKSv6eJWm6r+GHBUksMm1Z+2rjb/Za29JGlEZhouC6vqL6pqfxuuBp7z/+onOa5v8i30nlkGveA6p93pdQKwDLgNuB1Y1u4MO4LeRf/N7QGaNwNnt+XXAjf2rWttGz8b+IoP3JSk0Zrp418eS/J24HNt+lye5WggyeeAN9N7RfJO4CLgzUlW0PtB5sP03nRJVe1Ich1wH7AfWF9VT7X1XABsBRYAG6pqR9vE+4FrknwUuBO4qtWvAj6TZILeDQXnzPA7SpI6MtNw+T3gU/Ru7S3g/wL/droFqurcAeWrBtQOtL8EuGRAfQuwZUD9IQ6eVuuv/xR463R9kyQN10zD5WJgbbuziyQvB/4bvdCRJOlpZnrN5dcOBAtAVe0FXj+cLkmSDnUzDZcXJDn6wEQ7cpnpUY8kaZ6ZaUB8Avi7JH/Vpt/KgOsjkiTBzH+hvynJdnqPXAH43aq6b3jdkiQdymZ8aquFiYEiSXpWz/mR+5IkPRvDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUuaGFS5INSXYnubev9vIk25I82D6PbvUkuSzJRJK7k7yhb5m1rf2DSdb21d+Y5J62zGVJMt02JEmjM8wjl6uB1ZNqFwI3VdUy4KY2DXAGsKwN64AroBcUwEXAScCJwEV9YXEF8K6+5VY/yzYkSSMytHCpqq8BeyeV1wAb2/hG4Ky++qbquQU4KslxwOnAtqraW1X7gG3A6jbvyKq6paoK2DRpXYO2IUkakVFfczm2qh5t498Fjm3ji4BH+trtbLXp6jsH1KfbxjMkWZdke5Lte/bseR5fR5I0yNgu6LcjjhrnNqrqyqpaWVUrFy5cOMyuSNK8Mupw+V47pUX73N3qu4Alfe0Wt9p09cUD6tNtQ5I0IqMOl83AgTu+1gI39tXPa3eNrQIeb6e2tgKnJTm6Xcg/Ddja5j2RZFW7S+y8SesatA1J0ogcNqwVJ/kc8GbgmCQ76d319THguiTnA98G3taabwHOBCaAnwDvBKiqvUk+Atze2l1cVQduEng3vTvSXgx8qQ1Msw1JQ/LVX/+NcXehc7/xta+OuwuHtKGFS1WdO8WsUwe0LWD9FOvZAGwYUN8OvHZA/bFB25AkjY6/0Jckdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdW4s4ZLk4ST3JLkryfZWe3mSbUkebJ9Ht3qSXJZkIsndSd7Qt561rf2DSdb21d/Y1j/Rls3ov6UkzV/jPHL5zapaUVUr2/SFwE1VtQy4qU0DnAEsa8M64ArohRFwEXAScCJw0YFAam3e1bfc6uF/HUnSAbPptNgaYGMb3wic1VffVD23AEclOQ44HdhWVXurah+wDVjd5h1ZVbdUVQGb+tYlSRqBcYVLAV9OckeSda12bFU92sa/CxzbxhcBj/Qtu7PVpqvvHFB/hiTrkmxPsn3Pnj2/yPeRJPU5bEzbfVNV7UrySmBbkr/vn1lVlaSG3YmquhK4EmDlypVD357mlpM/dfK4u9C5r//B18fdBc0RYzlyqapd7XM3cAO9aybfa6e0aJ+7W/NdwJK+xRe32nT1xQPqkqQRGXm4JPnlJC89MA6cBtwLbAYO3PG1FrixjW8Gzmt3ja0CHm+nz7YCpyU5ul3IPw3Y2uY9kWRVu0vsvL51SZJGYBynxY4Fbmh3Bx8G/GVV/U2S24HrkpwPfBt4W2u/BTgTmAB+ArwToKr2JvkIcHtrd3FV7W3j7wauBl4MfKkNkqQRGXm4VNVDwOsG1B8DTh1QL2D9FOvaAGwYUN8OvPYX7qwk6XmZTbciS5LmCMNFktQ5w0WS1DnDRZLUOcNFktQ5w0WS1DnDRZLUuXE9W+yQ8sY/2jTuLnTujv963ri7IGkO88hFktQ5w0WS1DnDRZLUOcNFktQ5L+jrOfnOxf9s3F3o3PH/+Z5xd0GaczxykSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJEkdc5wkSR1bs6GS5LVSR5IMpHkwnH3R5LmkzkZLkkWAJcDZwDLgXOTLB9vryRp/piT4QKcCExU1UNV9TPgGmDNmPskSfNGqmrcfehckrOB1VX179r0O4CTquqCSe3WAeva5D8GHhhpR5/pGOD7Y+7DbOG+OMh9cZD74qDZsi/+UVUtnFyc12+irKorgSvH3Y8DkmyvqpXj7sds4L44yH1xkPvioNm+L+bqabFdwJK+6cWtJkkagbkaLrcDy5KckOQI4Bxg85j7JEnzxpw8LVZV+5NcAGwFFgAbqmrHmLs1E7PmFN0s4L44yH1xkPvioFm9L+bkBX1J0njN1dNikqQxMlwkSZ0zXMYgyYYku5PcO8X8JLmsPbrm7iRvGHUfRyHJkiQ3J7kvyY4k7xnQZr7sixcluS3JN9u++PCANi9Mcm3bF7cmWTqGro5MkgVJ7kzyxQHz5s2+SPJwknuS3JVk+4D5s/JvxHAZj6uB1dPMPwNY1oZ1wBUj6NM47AfeV1XLgVXA+gGP6Zkv++JJ4JSqeh2wAlidZNWkNucD+6rq1cClwMdH28WRew9w/xTz5tu++M2qWjHF71pm5d+I4TIGVfU1YO80TdYAm6rnFuCoJMeNpnejU1WPVtU32vgP6f1DsmhSs/myL6qqftQmD2/D5Ltt1gAb2/j1wKlJMqIujlSSxcC/Bv58iibzZl/MwKz8GzFcZqdFwCN90zt55j+6c0o7rfF64NZJs+bNvminge4CdgPbqmrKfVFV+4HHgVeMtJOj88fAfwR+PsX8+bQvCvhykjvaI6smm5V/I4aLxi7JS4DPA++tqifG3Z9xqaqnqmoFvSdKnJjktWPu0lgk+TfA7qq6Y9x9mSXeVFVvoHf6a32SXx93h2bCcJmd5s3ja5IcTi9YPltVXxjQZN7siwOq6gfAzTzzutw/7IskhwEvAx4baedG42Tgd5I8TO+J5qck+Z+T2syXfUFV7Wqfu4Eb6D31vd+s/BsxXGanzcB57S6QVcDjVfXouDvVtXaO/Crg/qr65BTN5su+WJjkqDb+YuC3gL+f1GwzsLaNnw18pebgr6Cr6gNVtbiqltJ7dNNXqurtk5rNi32R5JeTvPTAOHAaMPku01n5NzInH/8y2yX5HPBm4JgkO4GL6F3Apar+O7AFOBOYAH4CvHM8PR26k4F3APe0aw0AHwSOh3m3L44DNrYX3b0AuK6qvpjkYmB7VW2mF8SfSTJB74aQc8bX3dGbp/viWOCGdq/CYcBfVtXfJPn3MLv/Rnz8iySpc54WkyR1znCRJHXOcJEkdc5wkSR1znCRJHXOcJHmmCT/O8nKNv5wkmPG3SfNP4aLdIhpP5bzb1ezmv+BSiOQ5D8kubcN703ysSTr++Z/KMkftvE/SnJ7ezfHh1ttaZIHkmyi9wvtJUmuSLJ9qve/SOPkL/SlIUvyRnq/mj4JCL0nP7+d3pN/L2/N3gacnuQ0eu/lOLG13dweVPidVl/bHqtOkv9UVXvbr/pvSvJrVXX36L6ZNDXDRRq+NwE3VNWPAZJ8AfhXwCuTvApYSO/FV4+k9zbO04A727IvoRcq3wG+fSBYmre1R7AfRu/xMcsBw0WzguEijc9f0Xvo4q8A17ZagP9SVf+jv2F7382P+6ZPAP4Q+OdVtS/J1cCLRtBnaUa85iIN3/8BzkryS+3Jtm9ptWvpPXDxbHpBA7AV+L32jhuSLEryygHrPJJe2Dye5Fh67/qQZg2PXKQhq6pvtCOL21rpz6vqToD2OPVdBx6RXlVfTvJPgb9rT8L9Eb3rM09NWuc3k9xJ77H8jwBfH8V3kWbKpyJLkjrnaTFJUucMF0lS5wwXSVLnDBdJUucMF0lS5wwXSVLnDBdJUuf+P6HKMtPCrn4hAAAAAElFTkSuQmCC\n", 302 | "text/plain": [ 303 | "
" 304 | ] 305 | }, 306 | "metadata": { 307 | "needs_background": "light" 308 | }, 309 | "output_type": "display_data" 310 | } 311 | ], 312 | "source": [ 313 | "sns.countplot(data = reviews_df, x='overall')" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 8, 319 | "metadata": {}, 320 | "outputs": [ 321 | { 322 | "data": { 323 | "text/plain": [ 324 | "17408" 325 | ] 326 | }, 327 | "execution_count": 8, 328 | "metadata": {}, 329 | "output_type": "execute_result" 330 | } 331 | ], 332 | "source": [ 333 | "len(reviews_df['asin'].value_counts(dropna=False))" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "### Undersampling of Reviews\n", 341 | "Taking a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5. This way you get a smaller balanced corpus, on which you will during Milestones 2-4.\n" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 9, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "one_1500 = reviews_df[reviews_df['overall']==1.0].sample(n=1500)\n", 351 | "two_500 = reviews_df[reviews_df['overall']==2.0].sample(n=500)\n", 352 | "three_500 = reviews_df[reviews_df['overall']==3.0].sample(n=500)\n", 353 | "four_500 = reviews_df[reviews_df['overall']==4.0].sample(n=500)\n", 354 | "five_1500 = reviews_df[reviews_df['overall']==5.0].sample(n=1500)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 10, 360 | "metadata": {}, 361 | "outputs": [], 362 | "source": [ 363 | "undersampled_reviews = pd.concat([one_1500, two_500, three_500, four_500, five_1500], axis=0)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 11, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "5.0 1500\n", 375 | "1.0 1500\n", 376 | "4.0 500\n", 377 | "3.0 500\n", 378 | "2.0 500\n", 379 | "Name: overall, dtype: int64" 380 | ] 381 | }, 382 | "execution_count": 11, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "undersampled_reviews['overall'].value_counts(dropna=False)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 12, 394 | "metadata": {}, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/plain": [ 399 | "" 400 | ] 401 | }, 402 | "execution_count": 12, 403 | "metadata": {}, 404 | "output_type": "execute_result" 405 | }, 406 | { 407 | "data": { 408 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEGCAYAAACUzrmNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUaklEQVR4nO3df7DddX3n8ecLoqi1GiBXxCRsmJpxy7pW8S6yS9da2YVgrWFbpDCrpEgn21l0dbG1aGeWlo6zOv1BxVp2sxIJXVdElCV1aDEDKLuOIOGH/NTlDgWSDJho+GFlrRv63j/OJ+b0cm++N3DPOTc5z8fMmfv9fj6fc877foebF99fn2+qCkmS9uagURcgSVr4DAtJUifDQpLUybCQJHUyLCRJnRaNuoBBWLJkSa1YsWLUZUjSfuW22277XlVNzNR3QIbFihUr2Lx586jLkKT9SpKHZ+vzMJQkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSp0wF5B/fevPG3Lx91CQNx2x+eNeoSpAPOn33wL0ddwkC8949/eZ/f456FJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqdPAwiLJ+iTbk9wzQ98Hk1SSJW09SS5OMpXkriTH9o1dk+SB9lozqHolSbMb5J7FZcCq6Y1JlgMnAY/0NZ8CrGyvtcAlbexhwAXAm4DjgAuSHDrAmiVJMxhYWFTVTcDOGbouAj4EVF/bauDy6rkZWJzkSOBkYFNV7ayqx4FNzBBAkqTBGuo5iySrgW1V9a1pXUuBLX3rW1vbbO2SpCEa2qyzSV4CfITeIahBfP5aeoewOOqoowbxFZI0toa5Z/EzwNHAt5I8BCwDbk/ySmAbsLxv7LLWNlv7s1TVuqqarKrJiYmJAZQvSeNraGFRVXdX1SuqakVVraB3SOnYqnoM2Aic1a6KOh54sqoeBa4DTkpyaDuxfVJrkyQN0SAvnf0c8A3gNUm2JjlnL8OvBR4EpoD/Bvx7gKraCfwBcGt7XdjaJElDNLBzFlV1Zkf/ir7lAs6dZdx6YP28FidJ2ifewS1J6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROAwuLJOuTbE9yT1/bHyb5dpK7klydZHFf34eTTCX5TpKT+9pXtbapJOcPql5J0uwGuWdxGbBqWtsm4LVV9Trg/wAfBkhyDHAG8E/ae/48ycFJDgY+BZwCHAOc2cZKkoZoYGFRVTcBO6e1faWqdrXVm4FlbXk1cEVV/V1V/Q0wBRzXXlNV9WBV/Ri4oo2VJA3RKM9ZvAf4q7a8FNjS17e1tc3W/ixJ1ibZnGTzjh07BlCuJI2vkYRFkt8FdgGfna/PrKp1VTVZVZMTExPz9bGSJGDRsL8wya8DbwdOrKpqzduA5X3DlrU29tIuSRqSoe5ZJFkFfAh4R1U93de1ETgjySFJjgZWAt8EbgVWJjk6yQvpnQTfOMyaJUkD3LNI8jngLcCSJFuBC+hd/XQIsCkJwM1V9ZtVdW+SK4H76B2eOreqnmmf817gOuBgYH1V3TuomiVJMxtYWFTVmTM0X7qX8R8FPjpD+7XAtfNYmiRpH3kHtySpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoNLCySrE+yPck9fW2HJdmU5IH289DWniQXJ5lKcleSY/ves6aNfyDJmkHVK0ma3SD3LC4DVk1rOx+4vqpWAte3dYBTgJXttRa4BHrhAlwAvAk4Drhgd8BIkoZnYGFRVTcBO6c1rwY2tOUNwKl97ZdXz83A4iRHAicDm6pqZ1U9Dmzi2QEkSRqwYZ+zOKKqHm3LjwFHtOWlwJa+cVtb22ztz5JkbZLNSTbv2LFjfquWpDE3shPcVVVAzePnrauqyaqanJiYmK+PlSQx/LD4bju8RPu5vbVvA5b3jVvW2mZrlyQN0bDDYiOw+4qmNcA1fe1ntauijgeebIerrgNOSnJoO7F9UmuTJA3RokF9cJLPAW8BliTZSu+qpo8BVyY5B3gYOL0NvxZ4GzAFPA2cDVBVO5P8AXBrG3dhVU0/aS5JGrCBhUVVnTlL14kzjC3g3Fk+Zz2wfh5LkyTtI+/gliR1MiwkSZ0MC0lSJ8NCktTJsJAkdTIsJEmdDAtJUifDQpLUybCQJHUyLCRJnQwLSVInw0KS1GlOYZHk+rm0SZIOTHuddTbJi4CX0Jtm/FAgretlzPJ4U0nSgadrivJ/B3wAeBVwG3vC4ingzwZXliRpIdlrWFTVJ4BPJHlfVX1ySDVJkhaYOT38qKo+meRfACv631NVlw+oLknSAjKnsEjyF8DPAHcCz7TmAgwLSRoDc32s6iRwTHv86fOW5D8Cv0EvcO6m98ztI4ErgMPpnR95d1X9OMkh9ELpjcD3gV+rqofmow5J0tzM9T6Le4BXzscXJlkK/AdgsqpeCxwMnAF8HLioql4NPA6c095yDvB4a7+ojZMkDdFcw2IJcF+S65Js3P16Ht+7CHhxkkX0Ls19FHgrcFXr3wCc2pZXt3Va/4lJgiRpaOZ6GOr35usLq2pbkj8CHgH+L/AVeoednqiqXW3YVvbcx7EU2NLeuyvJk/QOVX2v/3OTrAXWAhx11FHzVa4kiblfDfW1+frCdnPfauBo4AngC8Cq5/u5VbUOWAcwOTk5L+dWJEk9c53u4wdJnmqvHyV5JslTz/E7/xXwN1W1o6r+H/Al4ARgcTssBbAM2NaWtwHLWx2LgJfTO9EtSRqSOYVFVf10Vb2sql4GvBj4VeDPn+N3PgIcn+Ql7dzDicB9wI3AaW3MGuCatryxrdP6b5ivq7IkSXOzz7POVs//BE5+Ll9YVbfQO1F9O73LZg+id/jod4DzkkzROydxaXvLpcDhrf084Pzn8r2SpOdurjfl/Urf6kH07rv40XP90qq6ALhgWvODwHEzjP0R8M7n+l2SpOdvrldD/XLf8i7gIXonqSVJY2CuV0OdPehCJEkL11yvhlqW5Ook29vri0mWDbo4SdLCMNcT3J+hd1XSq9rrL1ubJGkMzDUsJqrqM1W1q70uAyYGWJckaQGZa1h8P8m7khzcXu/CG+MkaWzMNSzeA5wOPEZv0r/TgF8fUE2SpAVmrpfOXgisqarHAZIcBvwRvRCRJB3g5rpn8brdQQFQVTuBNwymJEnSQjPXsDiozRYL/GTPYq57JZKk/dxc/8H/Y+AbSb7Q1t8JfHQwJUmSFpq53sF9eZLN9J5mB/ArVXXf4MqSJC0kcz6U1MLBgJCkMbTPU5RLksaPYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSeo0krBIsjjJVUm+neT+JP88yWFJNiV5oP08tI1NkouTTCW5K8mxo6hZksbZqPYsPgH8dVX9Y+DngPuB84Hrq2olcH1bBzgFWNlea4FLhl+uJI23oYdFkpcDbwYuBaiqH1fVE8BqYEMbtgE4tS2vBi6vnpuBxUmOHGrRkjTmRrFncTSwA/hMkjuSfDrJTwFHVNWjbcxjwBFteSmwpe/9W1vbP5BkbZLNSTbv2LFjgOVL0vgZRVgsAo4FLqmqNwA/ZM8hJwCqqoDalw+tqnVVNVlVkxMTPvFVkubTKMJiK7C1qm5p61fRC4/v7j681H5ub/3bgOV971/W2iRJQzL0sKiqx4AtSV7Tmk6kN0HhRmBNa1sDXNOWNwJntauijgee7DtcJUkaglE9wOh9wGeTvBB4EDibXnBdmeQc4GF6z/wGuBZ4GzAFPN3GSpKGaCRhUVV3ApMzdJ04w9gCzh10TZKk2XkHtySpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkTqO6g1sLwCMX/tNRlzAQR/2nu/f5PSd88oQBVDJ6X3/f1/f5PV978y8MoJLR+4WbvjbqEvZr7llIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOo0sLJIcnOSOJF9u60cnuSXJVJLPt+dzk+SQtj7V+leMqmZJGlej3LN4P3B/3/rHgYuq6tXA48A5rf0c4PHWflEbJ0kaopGERZJlwC8Bn27rAd4KXNWGbABObcur2zqt/8Q2XpI0JKPas/hT4EPA37f1w4EnqmpXW98KLG3LS4EtAK3/yTZekjQkQw+LJG8HtlfVbfP8uWuTbE6yeceOHfP50ZI09kaxZ3EC8I4kDwFX0Dv89AlgcZLdU6YvA7a15W3AcoDW/3Lg+9M/tKrWVdVkVU1OTEwM9jeQpDEz9LCoqg9X1bKqWgGcAdxQVf8WuBE4rQ1bA1zTlje2dVr/DVVVQyxZksbeQrrP4neA85JM0TsncWlrvxQ4vLWfB5w/ovokaWyN9El5VfVV4Ktt+UHguBnG/Ah451ALkyT9Awtpz0KStEAZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSp09DDIsnyJDcmuS/JvUne39oPS7IpyQPt56GtPUkuTjKV5K4kxw67Zkkad6PYs9gFfLCqjgGOB85NcgxwPnB9Va0Erm/rAKcAK9trLXDJ8EuWpPE29LCoqker6va2/APgfmApsBrY0IZtAE5ty6uBy6vnZmBxkiOHW7UkjbeRnrNIsgJ4A3ALcERVPdq6HgOOaMtLgS19b9va2qZ/1tokm5Ns3rFjx+CKlqQxNLKwSPJS4IvAB6rqqf6+qiqg9uXzqmpdVU1W1eTExMQ8VipJGklYJHkBvaD4bFV9qTV/d/fhpfZze2vfBizve/uy1iZJGpJRXA0V4FLg/qr6k76ujcCatrwGuKav/ax2VdTxwJN9h6skSUOwaATfeQLwbuDuJHe2to8AHwOuTHIO8DBweuu7FngbMAU8DZw91GolScMPi6r630Bm6T5xhvEFnDvQoiRJe+Ud3JKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSepkWEiSOhkWkqROhoUkqZNhIUnqZFhIkjoZFpKkToaFJKmTYSFJ6mRYSJI6GRaSpE6GhSSpk2EhSeq034RFklVJvpNkKsn5o65HksbJfhEWSQ4GPgWcAhwDnJnkmNFWJUnjY78IC+A4YKqqHqyqHwNXAKtHXJMkjY1U1ahr6JTkNGBVVf1GW3838Kaqem/fmLXA2rb6GuA7Qy/02ZYA3xt1EQuE22IPt8Uebos9FsK2+EdVNTFTx6JhVzIoVbUOWDfqOvol2VxVk6OuYyFwW+zhttjDbbHHQt8W+8thqG3A8r71Za1NkjQE+0tY3AqsTHJ0khcCZwAbR1yTJI2N/eIwVFXtSvJe4DrgYGB9Vd074rLmYkEdFhsxt8Uebos93BZ7LOhtsV+c4JYkjdb+chhKkjRChoUkqZNh8TwlWZ9ke5J7ZulPkovbNCV3JTl22DUOS5LlSW5Mcl+Se5O8f4YxY7E9krwoyTeTfKtti9+fYcwhST7ftsUtSVaMoNShSHJwkjuSfHmGvrHZDgBJHkpyd5I7k2yeoX9B/o0YFs/fZcCqvfSfAqxsr7XAJUOoaVR2AR+sqmOA44FzZ5iWZVy2x98Bb62qnwNeD6xKcvy0MecAj1fVq4GLgI8Pt8Shej9w/yx947QddvvFqnr9LPdVLMi/EcPieaqqm4CdexmyGri8em4GFic5cjjVDVdVPVpVt7flH9D7x2HptGFjsT3a7/e3bfUF7TX9apLVwIa2fBVwYpIMqcShSbIM+CXg07MMGYvtsA8W5N+IYTF4S4EtfetbefY/oAecdijhDcAt07rGZnu0Qy93AtuBTVU167aoql3Ak8DhQy1yOP4U+BDw97P0j8t22K2AryS5rU1TNN2C/BsxLDTvkrwU+CLwgap6atT1jEpVPVNVr6c348BxSV474pKGLsnbge1Vdduoa1lAfr6qjqV3uOncJG8edUFzYVgM3lhNVZLkBfSC4rNV9aUZhozV9gCoqieAG3n2ua2fbIski4CXA98fanGDdwLwjiQP0Zst+q1J/vu0MeOwHX6iqra1n9uBq+nNqt1vQf6NGBaDtxE4q13hcDzwZFU9OuqiBqEdZ74UuL+q/mSWYWOxPZJMJFncll8M/Gvg29OGbQTWtOXTgBvqALtLtqo+XFXLqmoFvWl6bqiqd00bdsBvh92S/FSSn969DJwETL+SckH+jewX030sZEk+B7wFWJJkK3ABvZOZVNV/Aa4F3gZMAU8DZ4+m0qE4AXg3cHc7Vg/wEeAoGLvtcSSwoT246yDgyqr6cpILgc1VtZFesP5Fkil6F0mcMbpyh2uMt8MRwNXt/P0i4H9U1V8n+U1Y2H8jTvchSerkYShJUifDQpLUybCQJHUyLCRJnQwLSVInw0JawJJ8NclkW34oyZJR16TxZFhII9RuvPLvUAue/5FK+yjJeUnuaa8PJPlYknP7+n8vyW+15d9Ocmt7LsHvt7YVSb6T5HJ6d+8uT3JJks2zPftCGjXv4Jb2QZI30ruj9k1A6M2q+y56M6t+qg07HTg5yUn0nklwXBu7sU0a90hrX9OmoCbJ71bVznbH9/VJXldVdw3vN5P2zrCQ9s3PA1dX1Q8BknwJ+JfAK5K8Cpig9yCfLek9KfAk4I723pfSC4lHgId3B0VzepuuehG9qUKOAQwLLRiGhTQ/vkBvErxXAp9vbQH+c1X91/6B7VkfP+xbPxr4LeCfVdXjSS4DXjSEmqU585yFtG/+F3Bqkpe0WUP/TWv7PL0J8E6jFxwA1wHvac/3IMnSJK+Y4TNfRi88nkxyBL3nHEgLinsW0j6oqtvb//l/szV9uqruAGhTT2/bPZ10VX0lyc8C32izjP4tvfMbz0z7zG8luYPeFOZbgK8P43eR9oWzzkqSOnkYSpLUybCQJHUyLCRJnQwLSVInw0KS1MmwkCR1MiwkSZ3+Pz5ExirPkY9zAAAAAElFTkSuQmCC\n", 409 | "text/plain": [ 410 | "
" 411 | ] 412 | }, 413 | "metadata": { 414 | "needs_background": "light" 415 | }, 416 | "output_type": "display_data" 417 | } 418 | ], 419 | "source": [ 420 | "sns.countplot(data=undersampled_reviews, x='overall')" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "### Random Sampling of 100K Reviews" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 13, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "sample_100K_revs = reviews_df.sample(n=100000, random_state=42)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "### Writing Corpora" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 16, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "undersampled_reviews.to_csv(\"../data/small_corpus.csv\", index=False)" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 17, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "sample_100K_revs.to_csv(\"../data/big_corpus.csv\", index=False)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [] 470 | } 471 | ], 472 | "metadata": { 473 | "kernelspec": { 474 | "display_name": "Python 3", 475 | "language": "python", 476 | "name": "python3" 477 | }, 478 | "language_info": { 479 | "codemirror_mode": { 480 | "name": "ipython", 481 | "version": 3 482 | }, 483 | "file_extension": ".py", 484 | "mimetype": "text/x-python", 485 | "name": "python", 486 | "nbconvert_exporter": "python", 487 | "pygments_lexer": "ipython3", 488 | "version": "3.6.12" 489 | } 490 | }, 491 | "nbformat": 4, 492 | "nbformat_minor": 4 493 | } 494 | -------------------------------------------------------------------------------- /creating_sentiment_scoring_model/draft.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Creating a Dictionary-based Sentiment Analyzer" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import nltk\n", 18 | "from IPython.display import display\n", 19 | "pd.set_option('display.max_columns', None)" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Step 1: Loading in the small_corpus .csv file created in the \"creating_dataset\" milestone." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "reviews = pd.read_csv(\"../data/small_corpus.csv\")" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 60 | "\n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | "
overallverifiedreviewTimereviewerIDasinreviewerNamereviewTextsummaryunixReviewTimevotestyleimage
01.0True11 30, 2015A3AC92K59QLYR8B00503E8S2benGame freezes over and over its unplayableit just doesn't work1448841600NaN{'Format:': ' Video Game'}NaN
11.0False05 19, 2012A334LHR8DWARY8B00178630AXenocideI have no problem with needing to be online to...The only real way to show Blizzard our feeling...133738560023{'Format:': ' Computer Game'}NaN
21.0True10 19, 2014A28982ODE7ZGVPB001AWIP7MEric FrykbergNOT GOODOne Star1413676800NaN{'Format:': ' Video Game'}NaN
31.0True09 6, 2015A19E85RLQCAMI1B00NASF4MSJoeReally not worth the money to buy this game on...Really not worth the money to buy this game on...14414976002{'Format:': ' Video Game'}NaN
41.0False05 28, 2008AEMQKS13WC4D2B00140P9BACraigThey need to eliminate the Securom. I purchase...Securom can ruin a great game121193280055{'Format:': ' DVD-ROM'}NaN
\n", 156 | "
" 157 | ], 158 | "text/plain": [ 159 | " overall verified reviewTime reviewerID asin reviewerName \\\n", 160 | "0 1.0 True 11 30, 2015 A3AC92K59QLYR8 B00503E8S2 ben \n", 161 | "1 1.0 False 05 19, 2012 A334LHR8DWARY8 B00178630A Xenocide \n", 162 | "2 1.0 True 10 19, 2014 A28982ODE7ZGVP B001AWIP7M Eric Frykberg \n", 163 | "3 1.0 True 09 6, 2015 A19E85RLQCAMI1 B00NASF4MS Joe \n", 164 | "4 1.0 False 05 28, 2008 AEMQKS13WC4D2 B00140P9BA Craig \n", 165 | "\n", 166 | " reviewText \\\n", 167 | "0 Game freezes over and over its unplayable \n", 168 | "1 I have no problem with needing to be online to... \n", 169 | "2 NOT GOOD \n", 170 | "3 Really not worth the money to buy this game on... \n", 171 | "4 They need to eliminate the Securom. I purchase... \n", 172 | "\n", 173 | " summary unixReviewTime vote \\\n", 174 | "0 it just doesn't work 1448841600 NaN \n", 175 | "1 The only real way to show Blizzard our feeling... 1337385600 23 \n", 176 | "2 One Star 1413676800 NaN \n", 177 | "3 Really not worth the money to buy this game on... 1441497600 2 \n", 178 | "4 Securom can ruin a great game 1211932800 55 \n", 179 | "\n", 180 | " style image \n", 181 | "0 {'Format:': ' Video Game'} NaN \n", 182 | "1 {'Format:': ' Computer Game'} NaN \n", 183 | "2 {'Format:': ' Video Game'} NaN \n", 184 | "3 {'Format:': ' Video Game'} NaN \n", 185 | "4 {'Format:': ' DVD-ROM'} NaN " 186 | ] 187 | }, 188 | "execution_count": 3, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "reviews.head()" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "### Step 2: Tokenizing the sentences and words of the reviews\n", 202 | "Here, We're going to test different versions of word tokenizer on reviews. We'll then decide which tokenizer might be better to use." 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "### Treebank Word Tokenizer" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 4, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "from nltk.tokenize import TreebankWordTokenizer\n", 219 | "from string import punctuation\n", 220 | "import string" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 5, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "tb_tokenizer = TreebankWordTokenizer()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 6, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "reviews[\"rev_text_lower\"] = reviews['reviewText'].apply(lambda rev: str(rev)\\\n", 239 | " .translate(str.maketrans('', '', punctuation))\\\n", 240 | " .replace(\"
\", \" \")\\\n", 241 | " .lower())" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 7, 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/html": [ 252 | "
\n", 253 | "\n", 266 | "\n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | "
reviewTextrev_text_lower
2451Front case of the game was kind of damaged.. B...front case of the game was kind of damaged but...
2980First off, this is a great gaming mouse for th...first off this is a great gaming mouse for the...
\n", 287 | "
" 288 | ], 289 | "text/plain": [ 290 | " reviewText \\\n", 291 | "2451 Front case of the game was kind of damaged.. B... \n", 292 | "2980 First off, this is a great gaming mouse for th... \n", 293 | "\n", 294 | " rev_text_lower \n", 295 | "2451 front case of the game was kind of damaged but... \n", 296 | "2980 first off this is a great gaming mouse for the... " 297 | ] 298 | }, 299 | "execution_count": 7, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "reviews[['reviewText','rev_text_lower']].sample(2)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 8, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "reviews[\"tb_tokens\"] = reviews['rev_text_lower'].apply(lambda rev: tb_tokenizer.tokenize(str(rev)))" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 9, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "pd.set_option('display.max_colwidth', None)" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 10, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/html": [ 334 | "
\n", 335 | "\n", 348 | "\n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | "
reviewTexttb_tokens
2355First of all I would like to say this game is not nearly as difficult as many people claim. It is very unforgiving of mistakes, but most of the \"trash\" mobs are harder than the bosses.. Many areas/fights you just will die unless you read a guide or a player left a hint beforehand, because there is no other way to know how you need to prepare. The fights are fun, and normally being hit more than once without healing up means you die, but really once you figure out (or look up) what you need to do for the fight it is extremely easy. When people talk about how brutal it is all they really mean is having to fight your way back to the souls you lost or losing them forever if you die again.. not challenging, just unforgiving.\\n\\nThe tendency system is extremely annoying. To gain pure white character tendency you need to kill black phantoms. To kill black phantoms (other than players) you need pure black world tendency in that area. For that you need to kill yourself several times in body form (and resurrecting with a stone to kill your body again.) The most frustrating part of all this is if you log in the game while online it resets your world tendency, and can reset it randomly while your playing online.. Pure black is a bit easier, just need to kill some friendly npcs. Its like watching paint dry.. why people love doing this is beyond me.\\n\\nThere is also an area where you literally trudge around in water picking up loot with very little fighting at like 50% reduced walk speed and chug poison dispels since the water randomly poisons you for several hours.. seriously? Granted you can skip the loot and get through it all in under an hour.. but why have an area like that at all?[first, of, all, i, would, like, to, say, this, game, is, not, nearly, as, difficult, as, many, people, claim, it, is, very, unforgiving, of, mistakes, but, most, of, the, trash, mobs, are, harder, than, the, bosses, many, areasfights, you, just, will, die, unless, you, read, a, guide, or, a, player, left, a, hint, beforehand, because, there, is, no, other, way, to, know, how, you, need, to, prepare, the, fights, are, fun, and, normally, being, hit, more, than, once, without, healing, up, means, you, die, but, really, once, you, figure, out, or, look, up, what, you, need, to, do, for, the, ...]
3269awesome![awesome]
4006I would recommend buying this It was just as advertised. Also it came on the time and day that I was told it would come[i, would, recommend, buying, this, it, was, just, as, advertised, also, it, came, on, the, time, and, day, that, i, was, told, it, would, come]
\n", 374 | "
" 375 | ], 376 | "text/plain": [ 377 | " reviewText \\\n", 378 | "2355 First of all I would like to say this game is not nearly as difficult as many people claim. It is very unforgiving of mistakes, but most of the \"trash\" mobs are harder than the bosses.. Many areas/fights you just will die unless you read a guide or a player left a hint beforehand, because there is no other way to know how you need to prepare. The fights are fun, and normally being hit more than once without healing up means you die, but really once you figure out (or look up) what you need to do for the fight it is extremely easy. When people talk about how brutal it is all they really mean is having to fight your way back to the souls you lost or losing them forever if you die again.. not challenging, just unforgiving.\\n\\nThe tendency system is extremely annoying. To gain pure white character tendency you need to kill black phantoms. To kill black phantoms (other than players) you need pure black world tendency in that area. For that you need to kill yourself several times in body form (and resurrecting with a stone to kill your body again.) The most frustrating part of all this is if you log in the game while online it resets your world tendency, and can reset it randomly while your playing online.. Pure black is a bit easier, just need to kill some friendly npcs. Its like watching paint dry.. why people love doing this is beyond me.\\n\\nThere is also an area where you literally trudge around in water picking up loot with very little fighting at like 50% reduced walk speed and chug poison dispels since the water randomly poisons you for several hours.. seriously? Granted you can skip the loot and get through it all in under an hour.. but why have an area like that at all? \n", 379 | "3269 awesome! \n", 380 | "4006 I would recommend buying this It was just as advertised. Also it came on the time and day that I was told it would come \n", 381 | "\n", 382 | " tb_tokens \n", 383 | "2355 [first, of, all, i, would, like, to, say, this, game, is, not, nearly, as, difficult, as, many, people, claim, it, is, very, unforgiving, of, mistakes, but, most, of, the, trash, mobs, are, harder, than, the, bosses, many, areasfights, you, just, will, die, unless, you, read, a, guide, or, a, player, left, a, hint, beforehand, because, there, is, no, other, way, to, know, how, you, need, to, prepare, the, fights, are, fun, and, normally, being, hit, more, than, once, without, healing, up, means, you, die, but, really, once, you, figure, out, or, look, up, what, you, need, to, do, for, the, ...] \n", 384 | "3269 [awesome] \n", 385 | "4006 [i, would, recommend, buying, this, it, was, just, as, advertised, also, it, came, on, the, time, and, day, that, i, was, told, it, would, come] " 386 | ] 387 | }, 388 | "execution_count": 10, 389 | "metadata": {}, 390 | "output_type": "execute_result" 391 | } 392 | ], 393 | "source": [ 394 | "reviews[['reviewText','tb_tokens']].sample(3)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "### Casual Tokenizer" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 11, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "from nltk.tokenize.casual import casual_tokenize" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 12, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "reviews['casual_tokens'] = reviews['rev_text_lower'].apply(lambda rev: casual_tokenize(str(rev)))" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 13, 425 | "metadata": {}, 426 | "outputs": [ 427 | { 428 | "data": { 429 | "text/html": [ 430 | "
\n", 431 | "\n", 444 | "\n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | "
reviewTextcasual_tokenstb_tokens
3430Beatiful! Create a new look. Excellent. Nice quality. Recommended.[beatiful, create, a, new, look, excellent, nice, quality, recommended][beatiful, create, a, new, look, excellent, nice, quality, recommended]
1813It was only three chapters in when the game froze during a cut scene and began a high pitched scream for about 30 seconds and then went on like normal. Every cut scene the game decided I didn't want to be holding the weapon I was holding and replaced it with a pistol. Why on earth would I drop a shotgun full of ammo for a pistol with half a clip? Who knows? This isn't the worst game I've ever played and the cut scenes can be enjoyable when they don't look like crap. I would only suggest this to people who enjoy tedious shoot outs, hard boiled detective stories, and don't mind a flawed game. If you can forgive all that's wrong with this game... it might be fun. I for one can't.[it, was, only, three, chapters, in, when, the, game, froze, during, a, cut, scene, and, began, a, high, pitched, scream, for, about, 30, seconds, and, then, went, on, like, normal, every, cut, scene, the, game, decided, i, didnt, want, to, be, holding, the, weapon, i, was, holding, and, replaced, it, with, a, pistol, why, on, earth, would, i, drop, a, shotgun, full, of, ammo, for, a, pistol, with, half, a, clip, who, knows, this, isnt, the, worst, game, ive, ever, played, and, the, cut, scenes, can, be, enjoyable, when, they, dont, look, like, crap, i, would, only, suggest, this, to, ...][it, was, only, three, chapters, in, when, the, game, froze, during, a, cut, scene, and, began, a, high, pitched, scream, for, about, 30, seconds, and, then, went, on, like, normal, every, cut, scene, the, game, decided, i, didnt, want, to, be, holding, the, weapon, i, was, holding, and, replaced, it, with, a, pistol, why, on, earth, would, i, drop, a, shotgun, full, of, ammo, for, a, pistol, with, half, a, clip, who, knows, this, isnt, the, worst, game, ive, ever, played, and, the, cut, scenes, can, be, enjoyable, when, they, dont, look, like, crap, i, would, only, suggest, this, to, ...]
1340I have loved every Blizzard game but this one looks to be their first BIG flop. I guess it had to happen sooner or later. They must have gotten to big and are getting dragged down by to many cooks and kitchen(Bad management that used to be great talent but now camp and only bog down the real creatives). It is obvious that dragging out release dates is no longer for a high quality products and is just to cobble together a mesh-mash of yesteryear's leftovers. Granted all they do is re-release the same three games :-P but so far they have been doing great at improving them and adding to the muiltiplayer experience and normally have a good story line to go with it. On a side note I can believe it takes them more than two years to try to make the second part/ expansion of StarCraft II. Diablo 3 is an EXACT copy of Diablo 2. This may sound like a good thing but computers have change a little in the last 10 years and it is just not up to snuff. The story line is also very weak. My advice if you want the feeling of playing Diablo again remember what made it great. It was the first of the MMORPG's it was not quite there but it got people thinking about what could be done and playing together online. This could have been a great blend bringing the classic dungeon game to somewhat massive multiplayer but with only one to a max of 3 other people in your game it is a compleate failure and a relic of the past nothing new to see here move along and find a real MMORPG or a rpg or strategy game that can support at least 8 to 12 players in the same game.[i, have, loved, every, blizzard, game, but, this, one, looks, to, be, their, first, big, flop, i, guess, it, had, to, happen, sooner, or, later, they, must, have, gotten, to, big, and, are, getting, dragged, down, by, to, many, cooks, and, kitchenbad, management, that, used, to, be, great, talent, but, now, camp, and, only, bog, down, the, real, creatives, it, is, obvious, that, dragging, out, release, dates, is, no, longer, for, a, high, quality, products, and, is, just, to, cobble, together, a, meshmash, of, yesteryears, leftovers, granted, all, they, do, is, rerelease, the, same, three, games, p, but, so, far, ...][i, have, loved, every, blizzard, game, but, this, one, looks, to, be, their, first, big, flop, i, guess, it, had, to, happen, sooner, or, later, they, must, have, gotten, to, big, and, are, getting, dragged, down, by, to, many, cooks, and, kitchenbad, management, that, used, to, be, great, talent, but, now, camp, and, only, bog, down, the, real, creatives, it, is, obvious, that, dragging, out, release, dates, is, no, longer, for, a, high, quality, products, and, is, just, to, cobble, together, a, meshmash, of, yesteryears, leftovers, granted, all, they, do, is, rerelease, the, same, three, games, p, but, so, far, ...]
\n", 474 | "
" 475 | ], 476 | "text/plain": [ 477 | " reviewText \\\n", 478 | "3430 Beatiful! Create a new look. Excellent. Nice quality. Recommended. \n", 479 | "1813 It was only three chapters in when the game froze during a cut scene and began a high pitched scream for about 30 seconds and then went on like normal. Every cut scene the game decided I didn't want to be holding the weapon I was holding and replaced it with a pistol. Why on earth would I drop a shotgun full of ammo for a pistol with half a clip? Who knows? This isn't the worst game I've ever played and the cut scenes can be enjoyable when they don't look like crap. I would only suggest this to people who enjoy tedious shoot outs, hard boiled detective stories, and don't mind a flawed game. If you can forgive all that's wrong with this game... it might be fun. I for one can't. \n", 480 | "1340 I have loved every Blizzard game but this one looks to be their first BIG flop. I guess it had to happen sooner or later. They must have gotten to big and are getting dragged down by to many cooks and kitchen(Bad management that used to be great talent but now camp and only bog down the real creatives). It is obvious that dragging out release dates is no longer for a high quality products and is just to cobble together a mesh-mash of yesteryear's leftovers. Granted all they do is re-release the same three games :-P but so far they have been doing great at improving them and adding to the muiltiplayer experience and normally have a good story line to go with it. On a side note I can believe it takes them more than two years to try to make the second part/ expansion of StarCraft II. Diablo 3 is an EXACT copy of Diablo 2. This may sound like a good thing but computers have change a little in the last 10 years and it is just not up to snuff. The story line is also very weak. My advice if you want the feeling of playing Diablo again remember what made it great. It was the first of the MMORPG's it was not quite there but it got people thinking about what could be done and playing together online. This could have been a great blend bringing the classic dungeon game to somewhat massive multiplayer but with only one to a max of 3 other people in your game it is a compleate failure and a relic of the past nothing new to see here move along and find a real MMORPG or a rpg or strategy game that can support at least 8 to 12 players in the same game. \n", 481 | "\n", 482 | " casual_tokens \\\n", 483 | "3430 [beatiful, create, a, new, look, excellent, nice, quality, recommended] \n", 484 | "1813 [it, was, only, three, chapters, in, when, the, game, froze, during, a, cut, scene, and, began, a, high, pitched, scream, for, about, 30, seconds, and, then, went, on, like, normal, every, cut, scene, the, game, decided, i, didnt, want, to, be, holding, the, weapon, i, was, holding, and, replaced, it, with, a, pistol, why, on, earth, would, i, drop, a, shotgun, full, of, ammo, for, a, pistol, with, half, a, clip, who, knows, this, isnt, the, worst, game, ive, ever, played, and, the, cut, scenes, can, be, enjoyable, when, they, dont, look, like, crap, i, would, only, suggest, this, to, ...] \n", 485 | "1340 [i, have, loved, every, blizzard, game, but, this, one, looks, to, be, their, first, big, flop, i, guess, it, had, to, happen, sooner, or, later, they, must, have, gotten, to, big, and, are, getting, dragged, down, by, to, many, cooks, and, kitchenbad, management, that, used, to, be, great, talent, but, now, camp, and, only, bog, down, the, real, creatives, it, is, obvious, that, dragging, out, release, dates, is, no, longer, for, a, high, quality, products, and, is, just, to, cobble, together, a, meshmash, of, yesteryears, leftovers, granted, all, they, do, is, rerelease, the, same, three, games, p, but, so, far, ...] \n", 486 | "\n", 487 | " tb_tokens \n", 488 | "3430 [beatiful, create, a, new, look, excellent, nice, quality, recommended] \n", 489 | "1813 [it, was, only, three, chapters, in, when, the, game, froze, during, a, cut, scene, and, began, a, high, pitched, scream, for, about, 30, seconds, and, then, went, on, like, normal, every, cut, scene, the, game, decided, i, didnt, want, to, be, holding, the, weapon, i, was, holding, and, replaced, it, with, a, pistol, why, on, earth, would, i, drop, a, shotgun, full, of, ammo, for, a, pistol, with, half, a, clip, who, knows, this, isnt, the, worst, game, ive, ever, played, and, the, cut, scenes, can, be, enjoyable, when, they, dont, look, like, crap, i, would, only, suggest, this, to, ...] \n", 490 | "1340 [i, have, loved, every, blizzard, game, but, this, one, looks, to, be, their, first, big, flop, i, guess, it, had, to, happen, sooner, or, later, they, must, have, gotten, to, big, and, are, getting, dragged, down, by, to, many, cooks, and, kitchenbad, management, that, used, to, be, great, talent, but, now, camp, and, only, bog, down, the, real, creatives, it, is, obvious, that, dragging, out, release, dates, is, no, longer, for, a, high, quality, products, and, is, just, to, cobble, together, a, meshmash, of, yesteryears, leftovers, granted, all, they, do, is, rerelease, the, same, three, games, p, but, so, far, ...] " 491 | ] 492 | }, 493 | "execution_count": 13, 494 | "metadata": {}, 495 | "output_type": "execute_result" 496 | } 497 | ], 498 | "source": [ 499 | "reviews[['reviewText','casual_tokens','tb_tokens']].sample(3)" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "### Removing StopWords\n", 507 | "This part has been remvoed as removing stop words is not good for sentiment analysis at all!!" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 14, 513 | "metadata": {}, 514 | "outputs": [], 515 | "source": [ 516 | "#nltk.download('stopwords')" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 15, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "#stop_words = nltk.corpus.stopwords.words('english')" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 16, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "#stop_words.remove(\"no\")" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 17, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "#stop_words.remove(\"not\")" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 18, 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "#print(stop_words)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 19, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "#\"not\" in stop_words" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 20, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "#len(stop_words)" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 21, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | "#from string import punctuation\n", 580 | "#print(punctuation)" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 22, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [ 589 | "#reviews['tokens_nosw'] = reviews['tb_tokens'].\\\n", 590 | "# apply(lambda words: [w for w in words if w not in stop_words and w not in punctuation and w != \"\"])" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 23, 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [ 599 | "#reviews[['tb_tokens','tokens_nosw']].sample(3)" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "### Stemming" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 24, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "from nltk.stem.porter import PorterStemmer" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 25, 621 | "metadata": {}, 622 | "outputs": [], 623 | "source": [ 624 | "stemmer = PorterStemmer()" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 26, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [ 633 | "reviews['tokens_stemmed'] = reviews['tb_tokens'].apply(lambda words: [stemmer.stem(w) for w in words])" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 27, 639 | "metadata": {}, 640 | "outputs": [ 641 | { 642 | "data": { 643 | "text/html": [ 644 | "
\n", 645 | "\n", 658 | "\n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | "
tb_tokenstokens_stemmed
1766[a, very, buggy, game, with, a, lack, luster, story, mediocre, at, best][a, veri, buggi, game, with, a, lack, luster, stori, mediocr, at, best]
1163[this, is, the, ps2, ps3, scenario, repeating, itself, all, over, again, the, same, way, to, the, ps3, will, never, measure, up, to, the, ps2, the, vita, is, milestones, away, from, its, predecessor, the, most, powerful, systems, have, never, been, on, top, through, out, gaming, history, history, repeats, itself, once, again, what, did, the, ps2, have, that, the, ps3, didnt, godlike, 3rd, party, support, but, not, just, that, it, had, the, most, important, aspect, of, all, balance, power, means, nothing, to, a, game, console, games, mean, everything, attracting, people, with, power, instead, of, gameplay, mechanics, and, games, is, the, easy, way, ...][thi, is, the, ps2, ps3, scenario, repeat, itself, all, over, again, the, same, way, to, the, ps3, will, never, measur, up, to, the, ps2, the, vita, is, mileston, away, from, it, predecessor, the, most, power, system, have, never, been, on, top, through, out, game, histori, histori, repeat, itself, onc, again, what, did, the, ps2, have, that, the, ps3, didnt, godlik, 3rd, parti, support, but, not, just, that, it, had, the, most, import, aspect, of, all, balanc, power, mean, noth, to, a, game, consol, game, mean, everyth, attract, peopl, with, power, instead, of, gameplay, mechan, and, game, is, the, easi, way, ...]
4189[good][good]
\n", 684 | "
" 685 | ], 686 | "text/plain": [ 687 | " tb_tokens \\\n", 688 | "1766 [a, very, buggy, game, with, a, lack, luster, story, mediocre, at, best] \n", 689 | "1163 [this, is, the, ps2, ps3, scenario, repeating, itself, all, over, again, the, same, way, to, the, ps3, will, never, measure, up, to, the, ps2, the, vita, is, milestones, away, from, its, predecessor, the, most, powerful, systems, have, never, been, on, top, through, out, gaming, history, history, repeats, itself, once, again, what, did, the, ps2, have, that, the, ps3, didnt, godlike, 3rd, party, support, but, not, just, that, it, had, the, most, important, aspect, of, all, balance, power, means, nothing, to, a, game, console, games, mean, everything, attracting, people, with, power, instead, of, gameplay, mechanics, and, games, is, the, easy, way, ...] \n", 690 | "4189 [good] \n", 691 | "\n", 692 | " tokens_stemmed \n", 693 | "1766 [a, veri, buggi, game, with, a, lack, luster, stori, mediocr, at, best] \n", 694 | "1163 [thi, is, the, ps2, ps3, scenario, repeat, itself, all, over, again, the, same, way, to, the, ps3, will, never, measur, up, to, the, ps2, the, vita, is, mileston, away, from, it, predecessor, the, most, power, system, have, never, been, on, top, through, out, game, histori, histori, repeat, itself, onc, again, what, did, the, ps2, have, that, the, ps3, didnt, godlik, 3rd, parti, support, but, not, just, that, it, had, the, most, import, aspect, of, all, balanc, power, mean, noth, to, a, game, consol, game, mean, everyth, attract, peopl, with, power, instead, of, gameplay, mechan, and, game, is, the, easi, way, ...] \n", 695 | "4189 [good] " 696 | ] 697 | }, 698 | "execution_count": 27, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "reviews[['tb_tokens','tokens_stemmed']].sample(3)" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "### Lemmatisation" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 28, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "from nltk.stem import WordNetLemmatizer\n", 721 | "from nltk.corpus import wordnet as wn\n", 722 | "from nltk.corpus import sentiwordnet as swn\n", 723 | "from nltk import sent_tokenize, word_tokenize, pos_tag" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": 29, 729 | "metadata": {}, 730 | "outputs": [], 731 | "source": [ 732 | "def penn_to_wn(tag):\n", 733 | " \"\"\"\n", 734 | " Convert between the PennTreebank tags to simple Wordnet tags\n", 735 | " \"\"\"\n", 736 | " if tag.startswith('J'):\n", 737 | " return wn.ADJ\n", 738 | " elif tag.startswith('N'):\n", 739 | " return wn.NOUN\n", 740 | " elif tag.startswith('R'):\n", 741 | " return wn.ADV\n", 742 | " elif tag.startswith('V'):\n", 743 | " return wn.VERB\n", 744 | " return None" 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 30, 750 | "metadata": {}, 751 | "outputs": [], 752 | "source": [ 753 | "lemmatizer = WordNetLemmatizer()\n", 754 | "def get_lemas(tokens):\n", 755 | " lemmas = []\n", 756 | " for token in tokens:\n", 757 | " pos = penn_to_wn(pos_tag([token])[0][1])\n", 758 | " if pos:\n", 759 | " lemma = lemmatizer.lemmatize(token, pos)\n", 760 | " if lemma:\n", 761 | " lemmas.append(lemma)\n", 762 | " return lemmas" 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": 31, 768 | "metadata": {}, 769 | "outputs": [], 770 | "source": [ 771 | "reviews['lemmas'] = reviews['tb_tokens'].apply(lambda tokens: get_lemas(tokens))" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 39, 777 | "metadata": {}, 778 | "outputs": [ 779 | { 780 | "data": { 781 | "text/html": [ 782 | "
\n", 783 | "\n", 796 | "\n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | "
reviewTexttokens_stemmedlemmas
4135Freaking awesome game!!! I did all kinds of side missions that normally I would skip.[freak, awesom, game, i, did, all, kind, of, side, mission, that, normal, i, would, skip][freak, awesome, game, i, do, kind, side, mission, normally, i, skip]
1878After my 60+ hours of game play, I get more and more frustrated with the game.\\n\\n+ side: nice graphic, and curve roads are nice.\\n\\n- side:\\nalways online sucks, yes they fixes the logon issue, so I can always logon without waiting. The problems I am having now is that I cannot load some of my cities at all. So many hours of work is wasted\\n\\nTourism is broken. All the casinos and attractions suddenly lost all the customers for no explainable reasons. I have airport, rail, and ferry running, but all of a sudden people just decide not to visit.\\n\\nInteractive between cities has lots of issues. Simcity try to be MMO, but they cannot do simple transactions right. Imagine that each city is a character in a MMO, and it takes 10+ minutes for each characters to exchange items/gold? Do you want to play MMO like that?\\n\\nPublic transports just randomly pick and drop sims, and path finding has lots of issues. Some of the stores say no shoppers and some sims ask where is shopping, even they are right across the street. A $20 game like Cities in Motion does a lot better than Simcity in terms of path finding and transport system.\\n\\nRendering issues. Some of the buildings keep switching from invisible to visible back and forth. Sometimes when you have a quest, you get zoom in, but what you get is the back of some tall buildings because they are getting int the way. May be they should have an options to change theses behaviors?\\n\\nThere are many fundamental design issues in this game, and I do not know if EA can or willing to actually fix them.[after, my, 60, hour, of, game, play, i, get, more, and, more, frustrat, with, the, game, side, nice, graphic, and, curv, road, are, nice, side, alway, onlin, suck, ye, they, fix, the, logon, issu, so, i, can, alway, logon, without, wait, the, problem, i, am, have, now, is, that, i, can, not, load, some, of, my, citi, at, all, so, mani, hour, of, work, is, wast, tourism, is, broken, all, the, casino, and, attract, suddenli, lost, all, the, custom, for, no, explain, reason, i, have, airport, rail, and, ferri, run, but, all, of, a, sudden, peopl, just, decid, not, to, ...][hour, game, play, i, get, more, more, frustrate, game, side, nice, graphic, curve, road, be, nice, side, always, online, suck, yes, fix, logon, issue, so, i, always, logon, wait, problem, i, be, have, now, be, i, not, load, city, so, many, hour, work, be, waste, tourism, be, broken, casino, attraction, suddenly, lose, customer, explainable, reason, i, have, airport, rail, ferry, run, sudden, people, just, decide, not, visit, interactive, city, have, lot, issue, simcity, try, be, mmo, not, do, simple, transaction, right, imagine, city, be, character, mmo, take, minute, character, exchange, itemsgold, do, want, play, mmo, public, transport, just, randomly, pick, ...]
\n", 820 | "
" 821 | ], 822 | "text/plain": [ 823 | " reviewText \\\n", 824 | "4135 Freaking awesome game!!! I did all kinds of side missions that normally I would skip. \n", 825 | "1878 After my 60+ hours of game play, I get more and more frustrated with the game.\\n\\n+ side: nice graphic, and curve roads are nice.\\n\\n- side:\\nalways online sucks, yes they fixes the logon issue, so I can always logon without waiting. The problems I am having now is that I cannot load some of my cities at all. So many hours of work is wasted\\n\\nTourism is broken. All the casinos and attractions suddenly lost all the customers for no explainable reasons. I have airport, rail, and ferry running, but all of a sudden people just decide not to visit.\\n\\nInteractive between cities has lots of issues. Simcity try to be MMO, but they cannot do simple transactions right. Imagine that each city is a character in a MMO, and it takes 10+ minutes for each characters to exchange items/gold? Do you want to play MMO like that?\\n\\nPublic transports just randomly pick and drop sims, and path finding has lots of issues. Some of the stores say no shoppers and some sims ask where is shopping, even they are right across the street. A $20 game like Cities in Motion does a lot better than Simcity in terms of path finding and transport system.\\n\\nRendering issues. Some of the buildings keep switching from invisible to visible back and forth. Sometimes when you have a quest, you get zoom in, but what you get is the back of some tall buildings because they are getting int the way. May be they should have an options to change theses behaviors?\\n\\nThere are many fundamental design issues in this game, and I do not know if EA can or willing to actually fix them. \n", 826 | "\n", 827 | " tokens_stemmed \\\n", 828 | "4135 [freak, awesom, game, i, did, all, kind, of, side, mission, that, normal, i, would, skip] \n", 829 | "1878 [after, my, 60, hour, of, game, play, i, get, more, and, more, frustrat, with, the, game, side, nice, graphic, and, curv, road, are, nice, side, alway, onlin, suck, ye, they, fix, the, logon, issu, so, i, can, alway, logon, without, wait, the, problem, i, am, have, now, is, that, i, can, not, load, some, of, my, citi, at, all, so, mani, hour, of, work, is, wast, tourism, is, broken, all, the, casino, and, attract, suddenli, lost, all, the, custom, for, no, explain, reason, i, have, airport, rail, and, ferri, run, but, all, of, a, sudden, peopl, just, decid, not, to, ...] \n", 830 | "\n", 831 | " lemmas \n", 832 | "4135 [freak, awesome, game, i, do, kind, side, mission, normally, i, skip] \n", 833 | "1878 [hour, game, play, i, get, more, more, frustrate, game, side, nice, graphic, curve, road, be, nice, side, always, online, suck, yes, fix, logon, issue, so, i, always, logon, wait, problem, i, be, have, now, be, i, not, load, city, so, many, hour, work, be, waste, tourism, be, broken, casino, attraction, suddenly, lose, customer, explainable, reason, i, have, airport, rail, ferry, run, sudden, people, just, decide, not, visit, interactive, city, have, lot, issue, simcity, try, be, mmo, not, do, simple, transaction, right, imagine, city, be, character, mmo, take, minute, character, exchange, itemsgold, do, want, play, mmo, public, transport, just, randomly, pick, ...] " 834 | ] 835 | }, 836 | "execution_count": 39, 837 | "metadata": {}, 838 | "output_type": "execute_result" 839 | } 840 | ], 841 | "source": [ 842 | "reviews[['reviewText','tokens_stemmed','lemmas']].sample(2)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "### Sentiment Predictor Baseline Model" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 54, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "def get_sentiment_score(tokens):\n", 859 | " score = 0\n", 860 | " tags = pos_tag(tokens)\n", 861 | " for word, tag in tags:\n", 862 | " wn_tag = penn_to_wn(tag)\n", 863 | " if not wn_tag:\n", 864 | " continue\n", 865 | " synsets = wn.synsets(word, pos=wn_tag)\n", 866 | " if not synsets:\n", 867 | " continue\n", 868 | " \n", 869 | " #most common set:\n", 870 | " synset = synsets[0]\n", 871 | " swn_synset = swn.senti_synset(synset.name())\n", 872 | " \n", 873 | " score += (swn_synset.pos_score() - swn_synset.neg_score())\n", 874 | " \n", 875 | " return score\n", 876 | " " 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 61, 882 | "metadata": {}, 883 | "outputs": [ 884 | { 885 | "data": { 886 | "text/plain": [ 887 | "0.625" 888 | ] 889 | }, 890 | "execution_count": 61, 891 | "metadata": {}, 892 | "output_type": "execute_result" 893 | } 894 | ], 895 | "source": [ 896 | "## test\n", 897 | "swn.senti_synset(wn.synsets(\"perfect\", wn.ADJ)[0].name()).pos_score()" 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 56, 903 | "metadata": {}, 904 | "outputs": [], 905 | "source": [ 906 | "reviews['sentiment_score'] = reviews['lemmas'].apply(lambda tokens: get_sentiment_score(tokens))" 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": 57, 912 | "metadata": {}, 913 | "outputs": [ 914 | { 915 | "data": { 916 | "text/html": [ 917 | "
\n", 918 | "\n", 931 | "\n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | "
reviewTextlemmassentiment_score
3875It works well just as he said with just a little dirt on the contacts. Check seller ratings to improve the possibility of your old school game working. Ask all your questions[work, well, just, say, just, little, dirt, contact, check, seller, rating, improve, possibility, old, school, game, work, ask, question]0.875
803I tried to play the game but it would freeze up so I didn't get to play.[i, try, play, game, freeze, up, so, i, didnt, get, play]-0.125
1419I cannot see how anyone would like this game. You must have the patience of a python waiting a month to catch prey. My wife bought this game with my PS3 as a gift. Two days later I took it to Gamestop...I was surprised to get 40 buck for it. Well, when it went on sale on Amazon I figured I did not give it a chance so I bought it for PC, well waste of money. It take about an hour before the game starts, and once it does still no action. I cannot even play it anymore.[i, not, see, anyone, game, have, patience, python, wait, month, catch, prey, wife, bought, game, ps3, gift, day, later, i, take, gamestopi, be, surprised, get, buck, well, go, sale, amazon, i, figure, i, do, not, give, chance, so, i, bought, pc, well, waste, money, take, hour, game, start, once, do, still, action, i, not, even, play, anymore]-1.125
2861True to the brand, lots of good details. Fun skills, upgrades and lots of great missions. I bought for my son but my husband took it over! I love the original songs and score in the game. It can be a little tricky to capture ghosts at first but once you get the hang of it you will find it a lot of fun. I feel the game is a little hard for younger kids so I would not reccomend to kids under 7. I ain't afraid of no ghosts![true, brand, lot, good, detail, fun, skill, upgrade, lot, great, mission, i, bought, son, husband, take, i, love, original, song, score, game, be, little, tricky, capture, ghost, first, once, get, hang, find, lot, fun, i, feel, game, be, little, hard, young, kid, so, i, not, reccomend, kid, i, aint, afraid, ghost]-0.125
3156Very well made game for being a Wii title. I play on my wii u[very, well, make, game, be, wii, title, i, play, wii, u]0.250
\n", 973 | "
" 974 | ], 975 | "text/plain": [ 976 | " reviewText \\\n", 977 | "3875 It works well just as he said with just a little dirt on the contacts. Check seller ratings to improve the possibility of your old school game working. Ask all your questions \n", 978 | "803 I tried to play the game but it would freeze up so I didn't get to play. \n", 979 | "1419 I cannot see how anyone would like this game. You must have the patience of a python waiting a month to catch prey. My wife bought this game with my PS3 as a gift. Two days later I took it to Gamestop...I was surprised to get 40 buck for it. Well, when it went on sale on Amazon I figured I did not give it a chance so I bought it for PC, well waste of money. It take about an hour before the game starts, and once it does still no action. I cannot even play it anymore. \n", 980 | "2861 True to the brand, lots of good details. Fun skills, upgrades and lots of great missions. I bought for my son but my husband took it over! I love the original songs and score in the game. It can be a little tricky to capture ghosts at first but once you get the hang of it you will find it a lot of fun. I feel the game is a little hard for younger kids so I would not reccomend to kids under 7. I ain't afraid of no ghosts! \n", 981 | "3156 Very well made game for being a Wii title. I play on my wii u \n", 982 | "\n", 983 | " lemmas \\\n", 984 | "3875 [work, well, just, say, just, little, dirt, contact, check, seller, rating, improve, possibility, old, school, game, work, ask, question] \n", 985 | "803 [i, try, play, game, freeze, up, so, i, didnt, get, play] \n", 986 | "1419 [i, not, see, anyone, game, have, patience, python, wait, month, catch, prey, wife, bought, game, ps3, gift, day, later, i, take, gamestopi, be, surprised, get, buck, well, go, sale, amazon, i, figure, i, do, not, give, chance, so, i, bought, pc, well, waste, money, take, hour, game, start, once, do, still, action, i, not, even, play, anymore] \n", 987 | "2861 [true, brand, lot, good, detail, fun, skill, upgrade, lot, great, mission, i, bought, son, husband, take, i, love, original, song, score, game, be, little, tricky, capture, ghost, first, once, get, hang, find, lot, fun, i, feel, game, be, little, hard, young, kid, so, i, not, reccomend, kid, i, aint, afraid, ghost] \n", 988 | "3156 [very, well, make, game, be, wii, title, i, play, wii, u] \n", 989 | "\n", 990 | " sentiment_score \n", 991 | "3875 0.875 \n", 992 | "803 -0.125 \n", 993 | "1419 -1.125 \n", 994 | "2861 -0.125 \n", 995 | "3156 0.250 " 996 | ] 997 | }, 998 | "execution_count": 57, 999 | "metadata": {}, 1000 | "output_type": "execute_result" 1001 | } 1002 | ], 1003 | "source": [ 1004 | "reviews[['reviewText','lemmas','sentiment_score']].sample(5)" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": 58, 1010 | "metadata": {}, 1011 | "outputs": [ 1012 | { 1013 | "data": { 1014 | "text/html": [ 1015 | "
\n", 1016 | "\n", 1029 | "\n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | "
reviewTextlemmassentiment_score
2434Super Ghouls 'N Ghosts is impossible. Nah it's not. It's possible. It's possible to beat this game. I know because I've watched people do it before. But your patience level needs to be at its absolute highest in order to get *anywhere* in this game. With a game so tough it's no wonder most people back in the day only played it a couple times, realized how hard it was and put it away for years until they eventually sold it. I know I would've done the same thing had I not had prior knowledge of this games incredible difficulty.\\n\\nNot getting through level 1 was a bad enough experience for me, but I already knew what to expect going into the game so I didn't mind the loss. I didn't take losing personally. No, not when you're playing a game with a reputation for being extremely hard. You learn to KNOW better! Some people just didn't know what they were getting into when they played this for the first time (and probably now with a whole new generation of gamers, you could pull a really dirty trick on them and say \"Hey! Check out THIS game! I know you'll love it!\" and see what their reaction is like, lol). Probably a face-ripping reaction... that being your face.\\n\\nWhat makes the game so tough is a combination of 3 things- 1 is the play control. Growing up with the NES generation I never minded awkward moving and jumping if the surrounding danger was kept to a minimum. Unfortunately the surrounding danger in this game is *constant* with many precious deadly surprises around every corner, and enemies are located in places *just* off-screen and appearing magically when you move over one notch. Just enough for you to take an unavoidable hit in the cheapest way ever. Throw in water stages and platform logs you have to ride across, and how about a level that moves up and down while riding a platform all the while avoiding surrounding dangerous enemies and their attacks, and that's the kind of nightmare-ish encounters you can expect in Super Ghouls 'N Ghosts. The game *wants* you, BEGS you, to take a walk into dangerous territory just so you can take a hit.\\n\\nThe 2nd reason for this games high challenge and for me clearly the biggest reason, is the fact that you can only get once or twice and then you die. If at least 5 different enemies and attack are occurring all around you giving you little time to react, the generous and actually *right* gameplay method is to allow the player to take at least more than 1 or 2 hits. Even Castlevania with its amazingly high challenge is courteous enough to allow the player to take at least 3-4 hits. Imagine if you couldn't even wear protective gear while roaming through the levels of Super Ghouls 'N Ghosts? Then it simply becomes \"1 hit you're dead\". At least Contra allows the player to reappear in the same spot after a death. No, in this game you either have to start back at the beginning of the stage or the nearest checkpoint. The gameplay is already sluggish enough- it's not worth the aggravation believe me.\\n\\nAnd the 3rd reason is one I already mentioned- enemies being cheap. Too many enemies on screen? Yes. Enemies with unpredictable movements with no clear pattern to retaliate? Oh yeah (those DARN bird creatures have no pattern and you have to deal with more than 1 on screen occasionally!) Oddly the boss fights are relatively easy compared to the regular enemies that often swarm the stages. Too much commotion taking place to properly focus on enemy elimination is an unforgiveable issue. I guess the good news is that at least the levels aren't boring.\\n\\nTo Super Ghouls 'N Ghosts credit a massively challenging game does sometimes have its moments of fun... when you're partying with friends and falling down drunk. Or when a cat is taking its claws and scratching your face to shreds. Or when you're so cold your feet feel like swollen icicles. Or when the Dallas Cowboys are winning. Under these conditions the game is probably fun, but not when you have a nice library of more sensible and enjoyable games sitting just a few feet from where you stand. Oh and... before I forget! If you're somehow able to make it to level 7, you HAVE to use the right weapon to eliminate the boss here otherwise the game will actually kick you back to level 1 preventing you from entering level 8. That's silly, annoying, ridiculous, pathetic and horrible all wrapped in a nice package for grandma.\\n\\nOtherwise this game has really memorable music, fantastic Super NES graphics/superbly detailed backgrounds, and only a little bit of replay value obviously due to the high challenge level but also the fact that the game is short at just over 40 minutes (if you're an amazing enough gamer to blast through these tough stages that is- for the rest of us Super Ghouls 'N Ghosts takes hours to finish). I recommend skipping this one. Contra III: The Alien Wars is the other ridiculously hard Super NES game but at least it's fairer.[super, ghoul, n, ghost, be, impossible, nah, not, possible, possible, beat, game, i, know, ive, watch, people, do, patience, level, need, be, absolute, high, order, get, anywhere, game, game, so, tough, wonder, most, people, back, day, only, played, couple, time, realize, hard, be, put, away, year, eventually, sell, i, know, i, wouldve, do, same, thing, have, i, not, have, prior, knowledge, game, incredible, difficulty, not, get, level, be, bad, enough, experience, i, already, knew, expect, go, game, so, i, didnt, mind, loss, i, didnt, take, lose, personally, not, youre, play, game, reputation, be, extremely, hard, learn, know, well, people, just, ...]1.250
1689I have been a long Playstation person. But now their network is all kinds of screwed up. They have screwed up the account and have to now send it to a \"specialist\" in which I have to wait at least 2 days before I can even use my system anymore since everything requires the playstation ID. So not happy. Gave 2 stars instead of 1 because the Customer Service guy was pretty nice, and he did his best, it is the Sony company that screwed this all up. So good job for the support guy, but boo for Sony.[i, have, be, long, playstation, person, now, network, be, kind, screw, up, have, screw, up, account, have, now, send, specialist, i, have, wait, least, day, i, even, use, system, anymore, everything, require, playstation, id, so, not, happy, give, star, instead, customer, service, guy, be, pretty, nice, do, best, be, sony, company, screw, up, so, good, job, support, guy, boo, sony]5.250
4317one of my favorites[favorite]0.250
1751I received the Ghostbusters movie instead. I was looking forward to playing this game[i, receive, ghostbusters, movie, instead, i, be, look, forward, play, game]0.125
4034So much fun, It improves the 1st one so much, love the settings menu, the blood and physics are awesome.\\nIf you get this game I would recommend the superpad 64 the one that looks like a modern controller. just because this game dose use the d-pad, crap get that controller anyway.\\nlove this game.\\nOne of the few games i know that you can shoot some ones neck and watch the blood squirt out, or blow limbs off....\\nlove it[so, much, fun, improves, so, much, love, setting, menu, blood, physic, be, awesome, get, game, i, recommend, superpad, look, modern, controller, just, game, dose, use, dpad, crap, get, controller, anyway, love, game, few, game, i, know, shoot, one, neck, watch, blood, squirt, blow, limb, love]1.375
\n", 1071 | "
" 1072 | ], 1073 | "text/plain": [ 1074 | " reviewText \\\n", 1075 | "2434 Super Ghouls 'N Ghosts is impossible. Nah it's not. It's possible. It's possible to beat this game. I know because I've watched people do it before. But your patience level needs to be at its absolute highest in order to get *anywhere* in this game. With a game so tough it's no wonder most people back in the day only played it a couple times, realized how hard it was and put it away for years until they eventually sold it. I know I would've done the same thing had I not had prior knowledge of this games incredible difficulty.\\n\\nNot getting through level 1 was a bad enough experience for me, but I already knew what to expect going into the game so I didn't mind the loss. I didn't take losing personally. No, not when you're playing a game with a reputation for being extremely hard. You learn to KNOW better! Some people just didn't know what they were getting into when they played this for the first time (and probably now with a whole new generation of gamers, you could pull a really dirty trick on them and say \"Hey! Check out THIS game! I know you'll love it!\" and see what their reaction is like, lol). Probably a face-ripping reaction... that being your face.\\n\\nWhat makes the game so tough is a combination of 3 things- 1 is the play control. Growing up with the NES generation I never minded awkward moving and jumping if the surrounding danger was kept to a minimum. Unfortunately the surrounding danger in this game is *constant* with many precious deadly surprises around every corner, and enemies are located in places *just* off-screen and appearing magically when you move over one notch. Just enough for you to take an unavoidable hit in the cheapest way ever. Throw in water stages and platform logs you have to ride across, and how about a level that moves up and down while riding a platform all the while avoiding surrounding dangerous enemies and their attacks, and that's the kind of nightmare-ish encounters you can expect in Super Ghouls 'N Ghosts. The game *wants* you, BEGS you, to take a walk into dangerous territory just so you can take a hit.\\n\\nThe 2nd reason for this games high challenge and for me clearly the biggest reason, is the fact that you can only get once or twice and then you die. If at least 5 different enemies and attack are occurring all around you giving you little time to react, the generous and actually *right* gameplay method is to allow the player to take at least more than 1 or 2 hits. Even Castlevania with its amazingly high challenge is courteous enough to allow the player to take at least 3-4 hits. Imagine if you couldn't even wear protective gear while roaming through the levels of Super Ghouls 'N Ghosts? Then it simply becomes \"1 hit you're dead\". At least Contra allows the player to reappear in the same spot after a death. No, in this game you either have to start back at the beginning of the stage or the nearest checkpoint. The gameplay is already sluggish enough- it's not worth the aggravation believe me.\\n\\nAnd the 3rd reason is one I already mentioned- enemies being cheap. Too many enemies on screen? Yes. Enemies with unpredictable movements with no clear pattern to retaliate? Oh yeah (those DARN bird creatures have no pattern and you have to deal with more than 1 on screen occasionally!) Oddly the boss fights are relatively easy compared to the regular enemies that often swarm the stages. Too much commotion taking place to properly focus on enemy elimination is an unforgiveable issue. I guess the good news is that at least the levels aren't boring.\\n\\nTo Super Ghouls 'N Ghosts credit a massively challenging game does sometimes have its moments of fun... when you're partying with friends and falling down drunk. Or when a cat is taking its claws and scratching your face to shreds. Or when you're so cold your feet feel like swollen icicles. Or when the Dallas Cowboys are winning. Under these conditions the game is probably fun, but not when you have a nice library of more sensible and enjoyable games sitting just a few feet from where you stand. Oh and... before I forget! If you're somehow able to make it to level 7, you HAVE to use the right weapon to eliminate the boss here otherwise the game will actually kick you back to level 1 preventing you from entering level 8. That's silly, annoying, ridiculous, pathetic and horrible all wrapped in a nice package for grandma.\\n\\nOtherwise this game has really memorable music, fantastic Super NES graphics/superbly detailed backgrounds, and only a little bit of replay value obviously due to the high challenge level but also the fact that the game is short at just over 40 minutes (if you're an amazing enough gamer to blast through these tough stages that is- for the rest of us Super Ghouls 'N Ghosts takes hours to finish). I recommend skipping this one. Contra III: The Alien Wars is the other ridiculously hard Super NES game but at least it's fairer. \n", 1076 | "1689 I have been a long Playstation person. But now their network is all kinds of screwed up. They have screwed up the account and have to now send it to a \"specialist\" in which I have to wait at least 2 days before I can even use my system anymore since everything requires the playstation ID. So not happy. Gave 2 stars instead of 1 because the Customer Service guy was pretty nice, and he did his best, it is the Sony company that screwed this all up. So good job for the support guy, but boo for Sony. \n", 1077 | "4317 one of my favorites \n", 1078 | "1751 I received the Ghostbusters movie instead. I was looking forward to playing this game \n", 1079 | "4034 So much fun, It improves the 1st one so much, love the settings menu, the blood and physics are awesome.\\nIf you get this game I would recommend the superpad 64 the one that looks like a modern controller. just because this game dose use the d-pad, crap get that controller anyway.\\nlove this game.\\nOne of the few games i know that you can shoot some ones neck and watch the blood squirt out, or blow limbs off....\\nlove it \n", 1080 | "\n", 1081 | " lemmas \\\n", 1082 | "2434 [super, ghoul, n, ghost, be, impossible, nah, not, possible, possible, beat, game, i, know, ive, watch, people, do, patience, level, need, be, absolute, high, order, get, anywhere, game, game, so, tough, wonder, most, people, back, day, only, played, couple, time, realize, hard, be, put, away, year, eventually, sell, i, know, i, wouldve, do, same, thing, have, i, not, have, prior, knowledge, game, incredible, difficulty, not, get, level, be, bad, enough, experience, i, already, knew, expect, go, game, so, i, didnt, mind, loss, i, didnt, take, lose, personally, not, youre, play, game, reputation, be, extremely, hard, learn, know, well, people, just, ...] \n", 1083 | "1689 [i, have, be, long, playstation, person, now, network, be, kind, screw, up, have, screw, up, account, have, now, send, specialist, i, have, wait, least, day, i, even, use, system, anymore, everything, require, playstation, id, so, not, happy, give, star, instead, customer, service, guy, be, pretty, nice, do, best, be, sony, company, screw, up, so, good, job, support, guy, boo, sony] \n", 1084 | "4317 [favorite] \n", 1085 | "1751 [i, receive, ghostbusters, movie, instead, i, be, look, forward, play, game] \n", 1086 | "4034 [so, much, fun, improves, so, much, love, setting, menu, blood, physic, be, awesome, get, game, i, recommend, superpad, look, modern, controller, just, game, dose, use, dpad, crap, get, controller, anyway, love, game, few, game, i, know, shoot, one, neck, watch, blood, squirt, blow, limb, love] \n", 1087 | "\n", 1088 | " sentiment_score \n", 1089 | "2434 1.250 \n", 1090 | "1689 5.250 \n", 1091 | "4317 0.250 \n", 1092 | "1751 0.125 \n", 1093 | "4034 1.375 " 1094 | ] 1095 | }, 1096 | "execution_count": 58, 1097 | "metadata": {}, 1098 | "output_type": "execute_result" 1099 | } 1100 | ], 1101 | "source": [ 1102 | "reviews[['reviewText','lemmas','sentiment_score']].sample(5)" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": null, 1108 | "metadata": {}, 1109 | "outputs": [], 1110 | "source": [] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": null, 1115 | "metadata": {}, 1116 | "outputs": [], 1117 | "source": [] 1118 | }, 1119 | { 1120 | "cell_type": "code", 1121 | "execution_count": null, 1122 | "metadata": {}, 1123 | "outputs": [], 1124 | "source": [] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": null, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [] 1132 | }, 1133 | { 1134 | "cell_type": "code", 1135 | "execution_count": null, 1136 | "metadata": {}, 1137 | "outputs": [], 1138 | "source": [] 1139 | } 1140 | ], 1141 | "metadata": { 1142 | "kernelspec": { 1143 | "display_name": "Python 3", 1144 | "language": "python", 1145 | "name": "python3" 1146 | }, 1147 | "language_info": { 1148 | "codemirror_mode": { 1149 | "name": "ipython", 1150 | "version": 3 1151 | }, 1152 | "file_extension": ".py", 1153 | "mimetype": "text/x-python", 1154 | "name": "python", 1155 | "nbconvert_exporter": "python", 1156 | "pygments_lexer": "ipython3", 1157 | "version": "3.6.12" 1158 | } 1159 | }, 1160 | "nbformat": 4, 1161 | "nbformat_minor": 4 1162 | } 1163 | --------------------------------------------------------------------------------