├── README.md ├── mp2-object-oriented.ipynb ├── mp1-program-flow.ipynb ├── mp3-pw-data-structures.ipynb ├── mp5-ml-machine-learning.ipynb └── mp6-nlp-natural-language-processing.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # WQU-data-science-challenges 2 | 3 | To complete the **2-unit**, 16-week Applied Data Science Module of **WorldQuant University**, students are required to succeed 6 mini-projects in total. I have successfully completed them and maintained a cumulative average score of 90% or above. The mini projects are as follows: 4 | 5 | ### Applied Data Science Unit I - Scientific Computing and Python 6 | 7 | **In mini project 1**, students use Python to compute Mersenne numbers, using the Lucas-Lehmer test to identify Mersenne numbers that are prime. They had to use Python data structures and core programming principles such as for loops to implement their solution. Further, they had to implement the Sieve of Eratosthenes as a faster solution for checking if numbers are prime, learning about the importance of algorithm time complexity. 8 | 9 | **In mini project 2**, students used Object Oriented Programming to create a class that represents a geometric point. They define methods that describe common operations with points such as adding two points together and finding the distance between two points. Finally, they wrote a K-means clustering algorithm that uses the previous defined point class. 10 | 11 | **In mini project 3**, students used basic Python data structures, functions, and control program flow to answer posed questions over medical data from the British NHS on prescription drugs. They had to use fundamental data wrangling techniques such as joining data sets together, splitting data into groups, and aggregating data into summary statistics. 12 | 13 | **In mini project 4**, students used the Python package pandas to perform data analysis on a prescription drug data set from the British NHS. They answered questions such as identifying what medical practices prescribe opioids at an usually high rate and what practices are prescribing substantially more rare drugs compared to the rest of the medical practices. They used statistical concepts like z-score to help identify the aforementioned practices. 14 | 15 | 16 | ### Applied Data Science Unit II - Machine Learning & Statistical Analysis 17 | 18 | **In mini project 5**, students worked with nursing home inspection data from the United States, predicting which providers may be fined and for how much. They used the scikit-learn Python package to construct progressively more complicated machine learning models. They had to impute missing values, apply feature engineering, and encode categorical data. 19 | 20 | **In mini project 6**, students used natural language processing to train various machine learning models to predict an Amazon review rating based on the text of the review. Further, they used one of the trained models to gain insight on the reviews, identifying words that are highly polar. With these highly polar words identified, one can understand what words highly influence the model’s prediction. 21 | -------------------------------------------------------------------------------- /mp2-object-oriented.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "init_cell": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%logstop\n", 12 | "%logstart -rtq ~/.logs/vc.py append\n", 13 | "%matplotlib inline\n", 14 | "import matplotlib\n", 15 | "import seaborn as sns\n", 16 | "sns.set()\n", 17 | "matplotlib.rcParams['figure.dpi'] = 144" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 3, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from static_grader import grader" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Object-oriented exercises\n", 34 | "## Introduction\n", 35 | "\n", 36 | "The objective of these exercises is to develop your familiarity with Python's `class` syntax and object-oriented programming. By deepening our understanding of Python objects, we will be better prepared to work with complex data structures and machine learning models. We will develop a `Point` class capable of handling some simple linear algebra operations in 2D." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Exercise 1: `point_repr`\n", 44 | "\n", 45 | "The first step in defining most classes is to define their `__init__` and `__repr__` methods so that we can construct and represent distinct objects of that class. Our `Point` class should accept two arguments, `x` and `y`, and be represented by a string `'Point(x, y)'` with appropriate values for `x` and `y`.\n", 46 | "\n", 47 | "When you've written a `Point` class capable of this, execute the cell with `grader.score` for this question (do not edit that cell; you only need to modify the `Point` class)." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "class Point(object):\n", 57 | "\n", 58 | " def __init__(self, x, y):\n", 59 | " self.x = x\n", 60 | " self.y = y\n", 61 | " \n", 62 | " def __repr__(self):\n", 63 | " return f\"Point({self.x}, {self.y})\"" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "==================\n", 76 | "Your score: 1.0\n", 77 | "==================\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "grader.score.vc__point_repr(lambda points: [str(Point(*point)) for point in points])" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## Exercise 2: add_subtract\n", 90 | "\n", 91 | "The most basic vector operations we want our `Point` object to handle are addition and subtraction. For two points $(x_1, y_1) + (x_2, y_2) = (x_1 + x_2, y_1 + y_2)$ and similarly for subtraction. Implement a method within `Point` that allows two `Point` objects to be added together using the `+` operator, and likewise for subtraction. Once this is done, execute the `grader.score` cell for this question (do not edit that cell; you only need to modify the `Point` class.)\n", 92 | "\n", 93 | "(Remember that `__add__` and `__sub__` methods will allow us to use the `+` and `-` operators.)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 6, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "class Point(object):\n", 103 | "\n", 104 | " def __init__(self, x, y):\n", 105 | " self.x = x\n", 106 | " self.y = y\n", 107 | " \n", 108 | " def __repr__(self):\n", 109 | " return f\"Point({self.x}, {self.y})\"\n", 110 | " \n", 111 | " def __add__(self, another):\n", 112 | " sum_x = self.x + another.x\n", 113 | " sum_y = self.y + another.y\n", 114 | " return Point(sum_x, sum_y)\n", 115 | " \n", 116 | " def __sub__(self, another):\n", 117 | " sub_x = self.x - another.x\n", 118 | " sub_y = self.y - another.y \n", 119 | " return Point(sub_x, sub_y)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 7, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "==================\n", 132 | "Your score: 1.0\n", 133 | "==================\n" 134 | ] 135 | } 136 | ], 137 | "source": [ 138 | "from functools import reduce\n", 139 | "def add_sub_results(points):\n", 140 | " points = [Point(*point) for point in points]\n", 141 | " return [str(reduce(lambda x, y: x + y, points)), \n", 142 | " str(reduce(lambda x, y: x - y, points))]\n", 143 | "\n", 144 | "grader.score.vc__add_subtract(add_sub_results)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "## Exercise 3: multiplication\n", 152 | "\n", 153 | "Within linear algebra there's many different kinds of multiplication: scalar multiplication, inner product, cross product, and matrix product. We're going to implement scalar multiplication and the inner product.\n", 154 | "\n", 155 | "We can define scalar multiplication given a point $P$ and a scalar $a$ as \n", 156 | "$$aP=a(x,y)=(ax,ay)$$\n", 157 | "\n", 158 | "and we can define the inner product for points $P,Q$ as\n", 159 | "$$P\\cdot Q=(x_1,y_1)\\cdot (x_2, y_2) = x_1x_2 + y_1y_2$$\n", 160 | "\n", 161 | "To test that you've implemented this correctly, compute $2(x, y) \\cdot (x, y)$ for a `Point` object. Once this is done, execute the `grader.score` cell for this question (do not edit that cell; you only need to modify the `Point` class.)\n", 162 | "\n", 163 | "(Remember that `__mul__` method will allow us to use the `*` operator. Also don't forget that the ordering of operands matters when implementing these operators.)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 8, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "class Point(object):\n", 173 | "\n", 174 | " def __init__(self, x, y):\n", 175 | " self.x = x\n", 176 | " self.y = y\n", 177 | " \n", 178 | " def __repr__(self):\n", 179 | " return f\"Point({self.x}, {self.y})\"\n", 180 | " \n", 181 | " def __add__(self, another):\n", 182 | " sum_x = self.x + another.x\n", 183 | " sum_y = self.y + another.y\n", 184 | " return Point(sum_x, sum_y)\n", 185 | " \n", 186 | " def __sub__(self, another):\n", 187 | " sub_x = self.x - another.x\n", 188 | " sub_y = self.y - another.y \n", 189 | " return Point(sub_x, sub_y)\n", 190 | " \n", 191 | " def __mul__(self, another):\n", 192 | " if isinstance(another, Point):\n", 193 | " return self.x*another.x + self.y*another.y\n", 194 | " if isinstance(another, (int, float, complex)):\n", 195 | " return Point(another*self.x, another*self.y)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 9, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "name": "stdout", 205 | "output_type": "stream", 206 | "text": [ 207 | "==================\n", 208 | "Your score: 1.0\n", 209 | "==================\n" 210 | ] 211 | } 212 | ], 213 | "source": [ 214 | "def mult_result(points):\n", 215 | " points = [Point(*point) for point in points]\n", 216 | " return [point*point*2 for point in points]\n", 217 | "\n", 218 | "grader.score.vc__multiplication(mult_result)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "## Exercise 4: Distance\n", 226 | "\n", 227 | "Another quantity we might want to compute is the distance between two points. This is generally given for points $P_1=(x_1,y_1)$ and $P_2=(x_2,y_2)$ as \n", 228 | "$$D = |P_2 - P_1| = \\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}.$$\n", 229 | "\n", 230 | "Implement a method called `distance` which finds the distance from a point to another point. \n", 231 | "\n", 232 | "Once this is done, execute the `grader.score` cell for this question (do not edit that cell; you only need to modify the `Point` class.)\n", 233 | "\n", 234 | "### Hint\n", 235 | "* *You can use the `sqrt` function from the math package*." 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 10, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "from math import sqrt\n", 245 | "\n", 246 | "class Point(object):\n", 247 | "\n", 248 | " def __init__(self, x, y):\n", 249 | " self.x = x\n", 250 | " self.y = y\n", 251 | " \n", 252 | " def __repr__(self):\n", 253 | " return f\"Point({self.x}, {self.y})\"\n", 254 | " \n", 255 | " def __add__(self, another):\n", 256 | " sum_x = self.x + another.x\n", 257 | " sum_y = self.y + another.y\n", 258 | " return Point(sum_x, sum_y)\n", 259 | " \n", 260 | " def __sub__(self, another):\n", 261 | " sub_x = self.x - another.x\n", 262 | " sub_y = self.y - another.y \n", 263 | " return Point(sub_x, sub_y)\n", 264 | " \n", 265 | " def __mul__(self, another):\n", 266 | " if isinstance(another, Point):\n", 267 | " return self.x*another.x + self.y*another.y\n", 268 | " if isinstance(another, (int, float, complex)):\n", 269 | " return Point(another*self.x, another*self.y)\n", 270 | " \n", 271 | " def distance(self, another):\n", 272 | " return sqrt((self.x-another.x)**2 + (self.y-another.y)**2)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 15, 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "name": "stdout", 282 | "output_type": "stream", 283 | "text": [ 284 | "==================\n", 285 | "Your score: 1.0\n", 286 | "==================\n" 287 | ] 288 | } 289 | ], 290 | "source": [ 291 | "def dist_result(points):\n", 292 | " points = [Point(*point) for point in points]\n", 293 | " return [points[0].distance(point) for point in points]\n", 294 | "\n", 295 | "grader.score.vc__distance(dist_result)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "## Exercise 5: Algorithm\n", 303 | "\n", 304 | "Now we will use these points to solve a real world problem! We can use our Point objects to represent measurements of two different quantities (e.g. a company's stock price and volume). One thing we might want to do with a data set is to separate the points into groups of similar points. Here we will implement an iterative algorithm to do this which will be a specific case of the very general $k$-means clustering algorithm. The algorithm will require us to keep track of two clusters, each of which have a list of points and a center (which is another point, not necessarily one of the points we are clustering). After making an initial guess at the center of the two clusters, $C_1$ and $C_2$, the steps proceed as follows\n", 305 | "\n", 306 | "1. Assign each point to $C_1$ or $C_2$ based on whether the point is closer to the center of $C_1$ or $C_2$.\n", 307 | "2. Recalculate the center of $C_1$ and $C_2$ based on the contained points. \n", 308 | "\n", 309 | "See [reference](https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) for more information.\n", 310 | "\n", 311 | "This algorithm will terminate in general when the assignments no longer change. For this question, we would like you to initialize one cluster at `(1, 0)` and the other at `(-1, 0)`. \n", 312 | "\n", 313 | "The returned values should be the two centers of the clusters ordered by greatest `x` value. Please return these as a list of numeric tuples $[(x_1, y_1), (x_2, y_2)]$\n", 314 | "\n", 315 | "In order to accomplish this we will create a class called cluster which has two methods besides `__init__` which you will need to write. The first method `update` will update the center of the Cluster given the points contained in the attribute `points`. Remember, you after updating the center of the cluster, you will want to reassign the points and thus remove previous assignments. The other method `add_point` will add a point to the `points` attribute.\n", 316 | "\n", 317 | "Once this is done, execute the `grader.score` cell for this question (do not edit that cell; you only need to modify the `Cluster` class and `compute_result` function.)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 35, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "class Cluster(object):\n", 327 | " def __init__(self, x, y):\n", 328 | " self.center = Point(x, y)\n", 329 | " self.points = []\n", 330 | " \n", 331 | " def update(self):\n", 332 | " if len(self.points) != 0: \n", 333 | " x_center = sum([point.x for point in self.points])\n", 334 | " y_center = sum([point.y for point in self.points])\n", 335 | " self.center = Point(x_center/len(self.points), y_center/len(self.points))\n", 336 | " \n", 337 | " def add_point(self, point):\n", 338 | " self.points.append(point)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 36, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "def compute_result(points):\n", 348 | " points = [Point(*point) for point in points]\n", 349 | " a = Cluster(1,0)\n", 350 | " b = Cluster(-1,0)\n", 351 | " a_old = []\n", 352 | " for _ in range(10000): # max iterations\n", 353 | " for point in points:\n", 354 | " if point.distance(a.center) < point.distance(b.center):\n", 355 | " a.add_point(point)\n", 356 | " else:\n", 357 | " b.add_point(point)\n", 358 | " if a_old == a.points:\n", 359 | " break\n", 360 | " a_old = a.points\n", 361 | " a.update()\n", 362 | " b.update()\n", 363 | " a.points = []\n", 364 | " b.points = []\n", 365 | " return sorted([(a.center.x, a.center.y), (b.center.x, b.center.y)], key= lambda x: x[0], reverse = True)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 37, 371 | "metadata": { 372 | "scrolled": true 373 | }, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "==================\n", 380 | "Your score: 1.0\n", 381 | "==================\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "grader.score.vc__k_means(compute_result)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "*Copyright © 2020 The Data Incubator. All rights reserved.*" 394 | ] 395 | } 396 | ], 397 | "metadata": { 398 | "kernelspec": { 399 | "display_name": "Python 3", 400 | "language": "python", 401 | "name": "python3" 402 | }, 403 | "language_info": { 404 | "codemirror_mode": { 405 | "name": "ipython", 406 | "version": 3 407 | }, 408 | "file_extension": ".py", 409 | "mimetype": "text/x-python", 410 | "name": "python", 411 | "nbconvert_exporter": "python", 412 | "pygments_lexer": "ipython3", 413 | "version": "3.7.3" 414 | }, 415 | "nbclean": true 416 | }, 417 | "nbformat": 4, 418 | "nbformat_minor": 1 419 | } 420 | -------------------------------------------------------------------------------- /mp1-program-flow.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 101, 6 | "metadata": { 7 | "init_cell": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%logstop\n", 12 | "%logstart -rtq ~/.logs/ip.py append\n", 13 | "%matplotlib inline\n", 14 | "import matplotlib\n", 15 | "import seaborn as sns\n", 16 | "sns.set()\n", 17 | "matplotlib.rcParams['figure.dpi'] = 144" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 4, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from static_grader import grader" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Program Flow exercises\n", 34 | "\n", 35 | "The objective of these exercises is to develop your ability to use iteration and conditional logic to build reusable functions. We will be extending our `get_primes` example from the [Program Flow notebook](../PY_ProgramFlow.ipynb) for testing whether much larger numbers are prime. Large primes are useful for encryption. It is too slow to test every possible factor of a large number to determine if it is prime, so we will take a different approach." 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## Exercise 1: `mersenne_numbers`\n", 43 | "\n", 44 | "A Mersenne number is any number that can be written as $2^p - 1$ for some $p$. For example, 3 is a Mersenne number ($2^2 - 1$) as is 31 ($2^5 - 1$). We will see later on that it is easy to test if Mersenne numbers are prime.\n", 45 | "\n", 46 | "Write a function that accepts an exponent $p$ and returns the corresponding Mersenne number." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 5, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "def mersenne_number(p):\n", 56 | " return 2**p - 1" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "Mersenne numbers can only be prime if their exponent, $p$, is prime. Make a list of the Mersenne numbers for all primes $p$ between 3 and 65 (there should be 17 of them).\n", 64 | "\n", 65 | "Hint: It may be useful to modify the `is_prime` and `get_primes` functions from [the Program Flow notebook](PY_ProgramFlow.ipynb) for use in this problem." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 8, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "def is_prime(number):\n", 75 | " if number <= 1:\n", 76 | " return False\n", 77 | " else:\n", 78 | " for factor in range(2,number):\n", 79 | " if number % factor == 0:\n", 80 | " return False\n", 81 | " return True\n", 82 | "\n", 83 | "def get_primes(n_start, n_end):\n", 84 | " return [number for number in range(n_start, n_end+1) if is_prime(number)]" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "The next cell shows a dummy solution, a list of 17 sevens. Alter the next cell to make use of the functions you've defined above to create the appropriate list of Mersenne numbers." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 9, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "mersennes = [mersenne_number(p) for p in get_primes(3,65)]" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 11, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stdout", 110 | "output_type": "stream", 111 | "text": [ 112 | "==================\n", 113 | "Your score: 1.0\n", 114 | "==================\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "grader.score.ip__mersenne_numbers(mersennes)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "## Exercise 2: `lucas_lehmer`\n", 127 | "\n", 128 | "We can test if a Mersenne number is prime using the [Lucas-Lehmer test](https://en.wikipedia.org/wiki/Lucas%E2%80%93Lehmer_primality_test). First let's write a function that generates the sequence used in the test. Given a Mersenne number with exponent $p$, the sequence can be defined as\n", 129 | "\n", 130 | "$$ n_0 = 4 $$\n", 131 | "$$ n_i = (n_{i-1}^2 - 2) mod (2^p - 1) $$\n", 132 | "\n", 133 | "Write a function that accepts the exponent $p$ of a Mersenne number and returns the Lucas-Lehmer sequence up to $i = p - 2$ (inclusive). Remember that the [modulo operation](https://en.wikipedia.org/wiki/Modulo_operation) is implemented in Python as `%`." 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 22, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "def lucas_lehmer(p):\n", 143 | " n = [4]\n", 144 | " if p > 2:\n", 145 | " for i in range(1, (p-2)+1):\n", 146 | " n.append(0)\n", 147 | " n[i] = (n[i-1]**2 - 2) % (2**p - 1)\n", 148 | " return n" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Use your function to calculate the Lucas-Lehmer series for $p = 17$ and pass the result to the grader." 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 23, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "==================\n", 168 | "Your score: 1.0\n", 169 | "==================\n" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "ll_result = lucas_lehmer(17)\n", 175 | "\n", 176 | "grader.score.ip__lucas_lehmer(ll_result)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Exercise 3: `mersenne_primes`\n", 184 | "\n", 185 | "For a given Mersenne number with exponent $p$, the number is prime if the Lucas-Lehmer series is 0 at position $p-2$. Write a function that tests if a Mersenne number with exponent $p$ is prime. Test if the Mersenne numbers with prime $p$ between 3 and 65 (i.e. 3, 5, 7, ..., 61) are prime. Your final answer should be a list of tuples consisting of `(Mersenne exponent, 0)` (or `1`) for each Mersenne number you test, where `0` and `1` are replacements for `False` and `True` respectively.\n", 186 | "\n", 187 | "_HINT: The `zip` function is useful for combining two lists into a list of tuples_" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 24, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "def ll_prime(p):\n", 197 | " return lucas_lehmer(p)[-1] == 0" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 31, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "==================\n", 210 | "Your score: 1.0\n", 211 | "==================\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "mersenne_primes = list(zip([p for p in get_primes(3,65)], [int(ll_prime(p)) for p in get_primes(3,65)]))\n", 217 | "\n", 218 | "grader.score.ip__mersenne_primes(mersenne_primes)" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "## Exercise 4: Optimize `is_prime`\n", 226 | "\n", 227 | "You might have noticed that the primality check `is_prime` we developed before is somewhat slow for large numbers. This is because we are doing a ton of extra work checking every possible factor of the tested number. We will use two optimizations to make a `is_prime_fast` function.\n", 228 | "\n", 229 | "The first optimization takes advantage of the fact that two is the only even prime. Thus we can check if a number is even and as long as its greater than 2, we know that it is not prime.\n", 230 | "\n", 231 | "Our second optimization takes advantage of the fact that when checking factors, we only need to check odd factors up to the square root of a number. Consider a number $n$ decomposed into factors $n=ab$. There are two cases, either $n$ is prime and without loss of generality, $a=n, b=1$ or $n$ is not prime and $a,b \\neq n,1$. In this case, if $a > \\sqrt{n}$, then $b<\\sqrt{n}$. So we only need to check all possible values of $b$ and we get the values of $a$ for free! This means that even the simple method of checking factors will increase in complexity as a square root compared to the size of the number instead of linearly.\n", 232 | "\n", 233 | "Lets write the function to do this and check the speed! `is_prime_fast` will take a number and return whether or not it is prime.\n", 234 | "\n", 235 | "You will see the functions followed by a cell with an `assert` statement. These cells should run and produce no output, if they produce an error, then your function needs to be modified. Do not modify the assert statements, they are exactly as they should be!" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 34, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "import math\n", 245 | "\n", 246 | "def is_prime_fast(number):\n", 247 | " if number <= 1 or ( number > 2 and number % 2 == 0 ):\n", 248 | " return False\n", 249 | " else:\n", 250 | " for factor in range(2, int(math.sqrt(number))+1):\n", 251 | " if number % factor == 0:\n", 252 | " return False\n", 253 | " return True" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "Run the following cell to make sure it finds the same primes as the original function." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 35, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "for n in range(10000):\n", 270 | " assert is_prime(n) == is_prime_fast(n)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "Now lets check the timing, here we will use the `%%timeit` magic which will time the execution of a particular cell." 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 36, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "4.91 s ± 32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 290 | ] 291 | } 292 | ], 293 | "source": [ 294 | "%%timeit\n", 295 | "is_prime(67867967)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 37, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "name": "stdout", 305 | "output_type": "stream", 306 | "text": [ 307 | "600 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "%%timeit\n", 313 | "is_prime_fast(67867967)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Now return a function which will find all prime numbers up to and including $n$. Submit this function to the grader." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 38, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "def get_primes_fast(n):\n", 330 | " return [number for number in range(2,n) if is_prime_fast(number)]" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 39, 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "==================\n", 343 | "Your score: 1.0\n", 344 | "==================\n" 345 | ] 346 | } 347 | ], 348 | "source": [ 349 | "grader.score.ip__is_prime_fast(get_primes_fast)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "## Exercise 5: sieve\n", 357 | "\n", 358 | "In this problem we will develop an even faster method which is known as the Sieve of Eratosthenes (although it will be more expensive in terms of memory). The Sieve of Eratosthenes is an example of dynamic programming, where the general idea is to not redo computations we have already done (read more about it [here](https://en.wikipedia.org/wiki/Dynamic_programming)). We will break this sieve down into several small functions. \n", 359 | "\n", 360 | "Our submission will be a list of all prime numbers less than 2000.\n", 361 | "\n", 362 | "The method works as follows (see [here](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) for more details)\n", 363 | "\n", 364 | "1. Generate a list of all numbers between 0 and N; mark the numbers 0 and 1 to be not prime\n", 365 | "2. Starting with $p=2$ (the first prime) mark all numbers of the form $np$ where $n>1$ and $np <= N$ to be not prime (they can't be prime since they are multiples of 2!)\n", 366 | "3. Find the smallest number greater than $p$ which is not marked and set that equal to $p$, then go back to step 2. Stop if there is no unmarked number greater than $p$ and less than $N+1$\n", 367 | "\n", 368 | "We will break this up into a few functions, our general strategy will be to use a Python `list` as our container although we could use other data structures. The index of this list will represent numbers.\n", 369 | "\n", 370 | "We have implemented a `sieve` function which will find all the prime numbers up to $n$. You will need to implement the functions which it calls. They are as follows\n", 371 | "\n", 372 | "* `list_true` Make a list of true values of length $n+1$ where the first two values are false (this corresponds with step 1 of the algorithm above)\n", 373 | "* `mark_false` takes a list of booleans and a number $p$. Mark all elements $2p,3p,...n$ false (this corresponds with step 2 of the algorithm above)\n", 374 | "* `find_next` Find the smallest `True` element in a list which is greater than some $p$ (has index greater than $p$ (this corresponds with step 3 of the algorithm above)\n", 375 | "* `prime_from_list` Return indices of True values\n", 376 | "\n", 377 | "Remember that python lists are zero indexed. We have provided assertions below to help you assess whether your functions are functioning properly." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 89, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "def list_true(n):\n", 387 | " return [False, False] + [True for i in range(2, n+1)]" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 47, 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "assert len(list_true(20)) == 21\n", 397 | "assert list_true(20)[0] is False\n", 398 | "assert list_true(20)[1] is False" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "Now we want to write a function which takes a list of elements and a number $p$ and marks elements false which are in the range $2p,3p ... N$." 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 50, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "def mark_false(bool_list, p):\n", 415 | " i = 2\n", 416 | " while i*p < len(bool_list) :\n", 417 | " bool_list[i*p] = False\n", 418 | " i+=1\n", 419 | " return bool_list" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 52, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "assert mark_false(list_true(6), 2) == [False, False, True, True, False, True, False]" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "Now lets write a `find_next` function which returns the smallest element in a list which is not false and is greater than $p$." 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 62, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "def find_next(bool_list, p):\n", 445 | " for i in range(0,len(bool_list)):\n", 446 | " if bool_list[i] == True and i>p :\n", 447 | " return i\n", 448 | " return None" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 63, 454 | "metadata": {}, 455 | "outputs": [], 456 | "source": [ 457 | "assert find_next([True, True, True, True], 2) == 3\n", 458 | "assert find_next([True, True, True, False], 2) is None" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "Now given a list of `True` and `False`, return the index of the true values." 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 90, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "def prime_from_list(bool_list):\n", 475 | " return [i for i, value in enumerate(bool_list) if value == True]" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 91, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "assert prime_from_list([False, False, True, True, False]) == [2, 3]" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 92, 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "def sieve(n):\n", 494 | " bool_list = list_true(n)\n", 495 | " p = 2\n", 496 | " while p is not None:\n", 497 | " bool_list = mark_false(bool_list, p)\n", 498 | " p = find_next(bool_list, p)\n", 499 | " return prime_from_list(bool_list)" 500 | ] 501 | }, 502 | { 503 | "cell_type": "code", 504 | "execution_count": 93, 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "assert sieve(1000) == get_primes(0, 1000)" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 99, 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "name": "stdout", 518 | "output_type": "stream", 519 | "text": [ 520 | "370 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 521 | ] 522 | } 523 | ], 524 | "source": [ 525 | "%%timeit \n", 526 | "sieve(10000)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 100, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "name": "stdout", 536 | "output_type": "stream", 537 | "text": [ 538 | "415 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 539 | ] 540 | } 541 | ], 542 | "source": [ 543 | "%%timeit \n", 544 | "get_primes(0, 10000)" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": 96, 550 | "metadata": {}, 551 | "outputs": [ 552 | { 553 | "name": "stdout", 554 | "output_type": "stream", 555 | "text": [ 556 | "==================\n", 557 | "Your score: 1.0\n", 558 | "==================\n" 559 | ] 560 | } 561 | ], 562 | "source": [ 563 | "grader.score.ip__eratosthenes(sieve)" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "*Copyright © 2020 The Data Incubator. All rights reserved.*" 571 | ] 572 | } 573 | ], 574 | "metadata": { 575 | "kernelspec": { 576 | "display_name": "Python 3", 577 | "language": "python", 578 | "name": "python3" 579 | }, 580 | "language_info": { 581 | "codemirror_mode": { 582 | "name": "ipython", 583 | "version": 3 584 | }, 585 | "file_extension": ".py", 586 | "mimetype": "text/x-python", 587 | "name": "python", 588 | "nbconvert_exporter": "python", 589 | "pygments_lexer": "ipython3", 590 | "version": "3.7.3" 591 | }, 592 | "nbclean": true 593 | }, 594 | "nbformat": 4, 595 | "nbformat_minor": 1 596 | } 597 | -------------------------------------------------------------------------------- /mp3-pw-data-structures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "init_cell": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%logstop\n", 12 | "%logstart -rtq ~/.logs/pw.py append\n", 13 | "%matplotlib inline\n", 14 | "import matplotlib\n", 15 | "import seaborn as sns\n", 16 | "sns.set()\n", 17 | "matplotlib.rcParams['figure.dpi'] = 144" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 3, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from static_grader import grader" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# PW Miniproject\n", 34 | "## Introduction\n", 35 | "\n", 36 | "The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.\n", 37 | "**Please do not use `pandas` or `numpy` to answer these questions.**\n", 38 | "\n", 39 | "We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject." 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## Downloading the data\n", 47 | "\n", 48 | "We first need to download the data we'll be using from Amazon S3:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 4, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stderr", 58 | "output_type": "stream", 59 | "text": [ 60 | "mkdir: cannot create directory ‘pw-data’: File exists\n", 61 | "File ‘./pw-data/201701scripts_sample.json.gz’ already there; not retrieving.\n", 62 | "\n", 63 | "File ‘./pw-data/practices.json.gz’ already there; not retrieving.\n", 64 | "\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "%%bash\n", 70 | "mkdir pw-data\n", 71 | "wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/201701scripts_sample.json.gz -nc -P ./pw-data\n", 72 | "wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/practices.json.gz -nc -P ./pw-data" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "## Loading the data\n", 80 | "\n", 81 | "The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "import gzip\n", 91 | "import simplejson as json" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 5, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:\n", 101 | " scripts = json.load(f)\n", 102 | "\n", 103 | "with gzip.open('./pw-data/practices.json.gz', 'rb') as f:\n", 104 | " practices = json.load(f)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. " 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 71, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "data": { 121 | "text/plain": [ 122 | "[{'bnf_code': '0101010G0AAABAB',\n", 123 | " 'items': 2,\n", 124 | " 'practice': 'N81013',\n", 125 | " 'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',\n", 126 | " 'nic': 5.98,\n", 127 | " 'act_cost': 5.56,\n", 128 | " 'quantity': 1000,\n", 129 | " 'post_code': 'SK11 6JL'},\n", 130 | " {'bnf_code': '0101021B0AAAHAH',\n", 131 | " 'items': 1,\n", 132 | " 'practice': 'N81013',\n", 133 | " 'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',\n", 134 | " 'nic': 1.95,\n", 135 | " 'act_cost': 1.82,\n", 136 | " 'quantity': 500,\n", 137 | " 'post_code': 'SK11 6JL'}]" 138 | ] 139 | }, 140 | "execution_count": 71, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "scripts[:2]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.\n", 154 | "\n", 155 | "| Data field |Description|\n", 156 | "|:----------:|-----------|\n", 157 | "|`'practice'`|Code designating the medical practice issuing the prescription|\n", 158 | "|`'bnf_code'`|British National Formulary drug code|\n", 159 | "|`'bnf_name'`|British National Formulary drug name|\n", 160 | "|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|\n", 161 | "| `'items'` |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|\n", 162 | "| `'nic'` |Net ingredient cost|\n", 163 | "|`'act_cost'`|Total cost including containers, fees, and discounts|" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 27, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "data": { 180 | "text/plain": [ 181 | "[{'code': 'A81001',\n", 182 | " 'name': 'THE DENSHAM SURGERY',\n", 183 | " 'addr_1': 'THE HEALTH CENTRE',\n", 184 | " 'addr_2': 'LAWSON STREET',\n", 185 | " 'borough': 'STOCKTON ON TEES',\n", 186 | " 'village': 'CLEVELAND',\n", 187 | " 'post_code': 'TS18 1HU'},\n", 188 | " {'code': 'A81002',\n", 189 | " 'name': 'QUEENS PARK MEDICAL CENTRE',\n", 190 | " 'addr_1': 'QUEENS PARK MEDICAL CTR',\n", 191 | " 'addr_2': 'FARRER STREET',\n", 192 | " 'borough': 'STOCKTON ON TEES',\n", 193 | " 'village': 'CLEVELAND',\n", 194 | " 'post_code': 'TS18 2AW'}]" 195 | ] 196 | }, 197 | "execution_count": 27, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "practices[:2]" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions." 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "## Question 1: summary_statistics\n", 218 | "\n", 219 | "Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed (`'quantity'`), the net cost of the ingredients (`'nic'`), and the actual cost to the patient (`'act_cost'`). Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.\n", 220 | "\n", 221 | "Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 6, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "import math\n", 231 | "\n", 232 | "def describe(key):\n", 233 | " \n", 234 | " n = len(scripts)\n", 235 | " \n", 236 | " #mean\n", 237 | " total = sum([script[key] for script in scripts])\n", 238 | " avg = total/n\n", 239 | " \n", 240 | " #standard deviation\n", 241 | " s = math.sqrt(sum([(script[key] - avg)**2 for script in scripts])/(n-1))\n", 242 | " \n", 243 | " #median\n", 244 | " sorted_set = sorted([script[key] for script in scripts])\n", 245 | " med = (sorted_set[n//2]+sorted_set[n//2 + 1])/2 if n%2 == 0 else sorted_set[math.ceil(n/2)]\n", 246 | " \n", 247 | " #first and third quartiles\n", 248 | " lower_half = sorted_set[: math.floor(n/2) + 1]\n", 249 | " upper_half = sorted_set[math.ceil(n/2) + 1 :]\n", 250 | " m = len(lower_half)\n", 251 | " q25 = (lower_half[m//2]+lower_half[m//2 + 1])/2\n", 252 | " q75 = (upper_half[m//2]+upper_half[m//2 + 1])/2\n", 253 | "\n", 254 | " return [total, avg, s, q25, med, q75]" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 7, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "summary = [('items', describe('items')),\n", 264 | " ('quantity', describe('quantity')),\n", 265 | " ('nic', describe('nic')),\n", 266 | " ('act_cost', describe('act_cost'))]" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 14, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "==================\n", 279 | "Your score: 1.0\n", 280 | "==================\n" 281 | ] 282 | } 283 | ], 284 | "source": [ 285 | "grader.score.pw__summary_statistics(summary)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "## Question 2: most_common_item\n", 293 | "\n", 294 | "Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?\n", 295 | "\n", 296 | "To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.\n", 297 | "\n", 298 | "We'll use `'bnf_name'` to construct our groups. You should have *5619* unique values for `'bnf_name'`." 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 37, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "bnf_names = list(set([script['bnf_name'] for script in scripts]))\n", 308 | "assert(len(bnf_names) == 5619)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "We want to construct \"groups\" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 38, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "groups = {name: [] for name in bnf_names}\n", 325 | "for script in scripts:\n", 326 | " groups[script['bnf_name']].append(script)" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 39, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "bnf_items = [(name, sum([group['items'] for group in groups[name]])) for name in groups.keys()]" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 40, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "max_item = [max(bnf_items, key= lambda x : x[1])]" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 41, 357 | "metadata": {}, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/plain": [ 362 | "[('Omeprazole_Cap E/C 20mg', 113826)]" 363 | ] 364 | }, 365 | "execution_count": 41, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "max_item" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "**TIP:** If you are getting an error from the grader below, please make sure your answer conforms to the correct format of `[(bnf_name, item total)]`." 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 25, 384 | "metadata": {}, 385 | "outputs": [ 386 | { 387 | "name": "stdout", 388 | "output_type": "stream", 389 | "text": [ 390 | "==================\n", 391 | "Your score: 1.0\n", 392 | "==================\n" 393 | ] 394 | } 395 | ], 396 | "source": [ 397 | "grader.score.pw__most_common_item(max_item)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name')` or `('bnf_name', 'post_code')`) and returns a dictionary of groups. The following questions will require you to aggregate data in groups, so this could be a useful function for the rest of the miniproject." 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 103, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "#def group_by_field(data, fields):\n", 414 | " \n", 415 | "# field_values = [ list(set([entry[field] for entry in data])) for field in fields ]\n", 416 | " \n", 417 | "# from itertools import product\n", 418 | "# field_combinations = list(product(field_values[0], field_values[1])) if len(field_values)==2 else field_values[0]\n", 419 | " \n", 420 | "# groups = {key: [] for key in field_combinations}\n", 421 | " \n", 422 | "# for dic in data:\n", 423 | "# for key in groups.keys():\n", 424 | "# if set(key).issubset(dic.values()):\n", 425 | "# groups[key].append(dic)\n", 426 | " \n", 427 | "# return groups" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 6, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "def group_by_field(data, fields):\n", 437 | " \n", 438 | " groups = {}\n", 439 | " \n", 440 | " for dic in data:\n", 441 | " key = dic[fields[0]] if len(fields) == 1 else (dic[fields[0]], dic[fields[1]])\n", 442 | " if key in groups:\n", 443 | " groups[key].append(dic)\n", 444 | " else:\n", 445 | " groups[key] = [dic]\n", 446 | " \n", 447 | " return groups" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 48, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "groups = group_by_field(scripts, ('bnf_name',))\n", 457 | "bnf_items = [(key, sum([group['items'] for group in groups[key]])) for key in groups.keys()]\n", 458 | "test_max_item = [max(bnf_items, key= lambda x : x[1])]\n", 459 | "\n", 460 | "assert test_max_item == max_item" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "## Question 3: postal_totals\n", 468 | "\n", 469 | "Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).\n", 470 | "\n", 471 | "Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes. Only include post codes if there is at least one prescription from a practice in that post code.\n", 472 | "\n", 473 | "**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code." 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes." 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 44, 486 | "metadata": {}, 487 | "outputs": [], 488 | "source": [ 489 | "practice_postal = {}\n", 490 | "for practice in practices:\n", 491 | " if practice['code'] in practice_postal :\n", 492 | " if practice['post_code'] < practice_postal[practice['code']]:\n", 493 | " practice_postal[practice['code']] = practice['post_code']\n", 494 | " else:\n", 495 | " practice_postal[practice['code']] = practice['post_code']" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "**Challenge:** This is an aggregation of the practice data grouped by practice codes. Write an alternative implementation of the above cell using the `group_by_field` function you defined previously." 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 8, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "groupsByCode = group_by_field(practices, ('code',))\n", 512 | "#post_codes = [sorted([practice['post_code'] for practice in groupsByCode[code]])[0] for code in groupsByCode.keys()]\n", 513 | "#practice_postal = {key:value for (key,value) in zip(groupsByCode.keys(), post_codes)}\n", 514 | "practice_postal = {key:value for key in groupsByCode.keys() \\\n", 515 | " for value in sorted([group['post_code'] for group in groupsByCode[key]])[:1]}" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 9, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "assert practice_postal['K82019'] == 'HP21 8TR'" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "Now we can join `practice_postal` to `scripts`." 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 10, 537 | "metadata": {}, 538 | "outputs": [], 539 | "source": [ 540 | "joined = scripts[:]\n", 541 | "for script in joined:\n", 542 | " script['post_code'] = practice_postal[script['practice']]" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 11, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "groups = group_by_field(joined, ('post_code',))\n", 559 | "items_by_post = [(name, sum([group['items'] for group in groups[name]])) for name in groups.keys()]" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 22, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/plain": [ 570 | "[('SK11 6JL', 110071),\n", 571 | " ('CW5 5NX', 38797),\n", 572 | " ('CW1 3AW', 64104),\n", 573 | " ('CW7 1AT', 43164),\n", 574 | " ('CH65 6TG', 25090)]" 575 | ] 576 | }, 577 | "execution_count": 22, 578 | "metadata": {}, 579 | "output_type": "execute_result" 580 | } 581 | ], 582 | "source": [ 583 | "items_by_post[:5]" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": 68, 589 | "metadata": {}, 590 | "outputs": [ 591 | { 592 | "name": "stdout", 593 | "output_type": "stream", 594 | "text": [ 595 | "==================\n", 596 | "Your score: 1.0\n", 597 | "==================\n" 598 | ] 599 | } 600 | ], 601 | "source": [ 602 | "postal_totals = sorted(items_by_post)[:100]\n", 603 | "\n", 604 | "grader.score.pw__postal_totals(postal_totals)" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "## Question 4: items_by_region\n", 612 | "\n", 613 | "Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (`post_code`, `bnf_name`, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.\n", 614 | "\n", 615 | "**NOTE:** We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`." 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "Now we need to calculate the total items of each `'bnf_name'` prescribed in each `'post_code'`. Use the techniques we developed in the previous questions to calculate these totals. You should have 141196 `('post_code', 'bnf_name')` groups." 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 12, 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "groups = group_by_field(joined, ('post_code', 'bnf_name'))\n", 632 | "total_items_by_bnf_post = [(name, sum([group['items'] for group in groups[name]])) for name in groups.keys()]\n", 633 | "assert len(total_items_by_bnf_post) == 141196" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 13, 639 | "metadata": {}, 640 | "outputs": [ 641 | { 642 | "data": { 643 | "text/plain": [ 644 | "[(('SK11 6JL', 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F'), 5),\n", 645 | " (('SK11 6JL', 'Alginate_Raft-Forming Oral Susp S/F'), 3),\n", 646 | " (('SK11 6JL', 'Sod Algin/Pot Bicarb_Susp S/F'), 94),\n", 647 | " (('SK11 6JL', 'Sod Alginate/Pot Bicarb_Tab Chble 500mg'), 9),\n", 648 | " (('SK11 6JL', 'Gaviscon Infant_Sach 2g (Dual Pack) S/F'), 41),\n", 649 | " (('SK11 6JL', 'Gaviscon Advance_Liq (Aniseed) (Reckitt)'), 98),\n", 650 | " (('SK11 6JL', 'Gaviscon Advance_Tab Chble Mint(Reckitt)'), 16),\n", 651 | " (('SK11 6JL', 'Gaviscon Advance_Liq (Peppermint) S/F'), 65),\n", 652 | " (('SK11 6JL', 'Peptac_Liq (Peppermint) S/F'), 14),\n", 653 | " (('SK11 6JL', 'Alverine Cit_Cap 60mg'), 10)]" 654 | ] 655 | }, 656 | "execution_count": 13, 657 | "metadata": {}, 658 | "output_type": "execute_result" 659 | } 660 | ], 661 | "source": [ 662 | "total_items_by_bnf_post[:10]" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "Let's use `total_items` to find the maximum item total for each postal code. To do this, we will want to regroup `total_items_by_bnf_post` by `'post_code'` only, not by `('post_code', 'bnf_name')`. First let's turn `total_items` into a list of dictionaries (similar to `scripts` or `practices`) and then group it by `'post_code'`. You should have 118 groups in the resulting `total_items_by_post` after grouping `total_items` by `'post_code'`." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": 14, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [ 678 | "total_items = [{'post_code':t[0][0], 'bnf_name':t[0][1], 'items': t[1]} for t in total_items_by_bnf_post ]\n", 679 | "total_items_by_post = group_by_field(total_items, ('post_code',))\n", 680 | "assert len(total_items_by_post) == 118" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "Now we will aggregate the groups in `total_items_by_post` to create `max_item_by_post`. Some `'bnf_name'` have the same item total within a given postal code. Therefore, if more than one `'bnf_name'` has the maximum item total in a given postal code, we'll take the alphabetically first `'bnf_name'`. We can do this by [sorting](https://docs.python.org/2.7/howto/sorting.html) each group according to the item total and `'bnf_name'`." 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 19, 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [ 696 | "max_item_by_post = {k:value for k in total_items_by_post.keys() \\\n", 697 | " for value in sorted(total_items_by_post[k], key=lambda x:(x['items'], x['bnf_name']))}" 698 | ] 699 | }, 700 | { 701 | "cell_type": "markdown", 702 | "metadata": {}, 703 | "source": [ 704 | "In order to express the item totals as a proportion of the total amount of items prescribed across all `'bnf_name'` in a postal code, we'll need to use the total items prescribed that we previously calculated as `items_by_post`. Calculate the proportions for the most common `'bnf_names'` for each postal code. Format your answer as a list of tuples: `[(post_code, bnf_name, total)]`" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 34, 710 | "metadata": {}, 711 | "outputs": [], 712 | "source": [ 713 | "#sorted_max_item_by_post = sorted(list(max_item_by_post.items()))\n", 714 | "\n", 715 | "items_by_region = sorted([(max_item['post_code'], max_item['bnf_name'], max_item['items']/total_items[1]) \\\n", 716 | " for max_item,total_items in zip(list(max_item_by_post.values()), items_by_post)])[:100]" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 35, 722 | "metadata": {}, 723 | "outputs": [ 724 | { 725 | "name": "stdout", 726 | "output_type": "stream", 727 | "text": [ 728 | "==================\n", 729 | "Your score: 1.0\n", 730 | "==================\n" 731 | ] 732 | } 733 | ], 734 | "source": [ 735 | "grader.score.pw__items_by_region(items_by_region)" 736 | ] 737 | }, 738 | { 739 | "cell_type": "markdown", 740 | "metadata": {}, 741 | "source": [ 742 | "*Copyright © 2020 The Data Incubator. All rights reserved.*" 743 | ] 744 | } 745 | ], 746 | "metadata": { 747 | "kernelspec": { 748 | "display_name": "Python 3", 749 | "language": "python", 750 | "name": "python3" 751 | }, 752 | "language_info": { 753 | "codemirror_mode": { 754 | "name": "ipython", 755 | "version": 3 756 | }, 757 | "file_extension": ".py", 758 | "mimetype": "text/x-python", 759 | "name": "python", 760 | "nbconvert_exporter": "python", 761 | "pygments_lexer": "ipython3", 762 | "version": "3.7.3" 763 | }, 764 | "nbclean": true 765 | }, 766 | "nbformat": 4, 767 | "nbformat_minor": 1 768 | } 769 | -------------------------------------------------------------------------------- /mp5-ml-machine-learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 120, 6 | "metadata": { 7 | "init_cell": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%logstop\n", 12 | "%logstart -rtq ~/.logs/ml.py append\n", 13 | "import seaborn as sns\n", 14 | "sns.set()" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 17, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from static_grader import grader" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# ML Miniproject\n", 31 | "## Introduction\n", 32 | "\n", 33 | "The objective of this miniproject is to exercise your ability to create effective machine learning models for making predictions. We will be working with nursing home inspection data from the United States, predicting which providers may be fined and for how much.\n", 34 | "\n", 35 | "## Scoring\n", 36 | "\n", 37 | "In this miniproject you will often submit your model's `predict` or `predict_proba` method to the grader. The grader will assess the performance of your model using a scoring metric, comparing it against the score of a reference model. We will use the [average precision score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html). If your model performs better than the reference solution, then you can score higher than 1.0.\n", 38 | "\n", 39 | "**Note:** If you use an estimator that relies on random draws (like a `RandomForestClassifier`) you should set the `random_state=` to an integer so that your results are reproducible. \n", 40 | "\n", 41 | "## Downloading the data\n", 42 | "\n", 43 | "We can download the data set from Amazon S3:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 4, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stderr", 53 | "output_type": "stream", 54 | "text": [ 55 | "mkdir: cannot create directory ‘data’: File exists\n", 56 | "--2021-03-20 21:06:08-- http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-train.csv\n", 57 | "Resolving dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)... 54.231.40.131\n", 58 | "Connecting to dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)|54.231.40.131|:80... connected.\n", 59 | "HTTP request sent, awaiting response... 200 OK\n", 60 | "Length: 3398120 (3.2M) [text/csv]\n", 61 | "Saving to: ‘./ml-data/providers-train.csv’\n", 62 | "\n", 63 | " 0K .......... .......... .......... .......... .......... 1% 3.48M 1s\n", 64 | " 50K .......... .......... .......... .......... .......... 3% 7.25M 1s\n", 65 | " 100K .......... .......... .......... .......... .......... 4% 1.09M 1s\n", 66 | " 150K .......... .......... .......... .......... .......... 6% 58.6M 1s\n", 67 | " 200K .......... .......... .......... .......... .......... 7% 8.95M 1s\n", 68 | " 250K .......... .......... .......... .......... .......... 9% 37.3M 1s\n", 69 | " 300K .......... .......... .......... .......... .......... 10% 71.8M 1s\n", 70 | " 350K .......... .......... .......... .......... .......... 12% 77.5M 1s\n", 71 | " 400K .......... .......... .......... .......... .......... 13% 58.9M 0s\n", 72 | " 450K .......... .......... .......... .......... .......... 15% 15.6M 0s\n", 73 | " 500K .......... .......... .......... .......... .......... 16% 150M 0s\n", 74 | " 550K .......... .......... .......... .......... .......... 18% 134M 0s\n", 75 | " 600K .......... .......... .......... .......... .......... 19% 160M 0s\n", 76 | " 650K .......... .......... .......... .......... .......... 21% 141M 0s\n", 77 | " 700K .......... .......... .......... .......... .......... 22% 164M 0s\n", 78 | " 750K .......... .......... .......... .......... .......... 24% 439M 0s\n", 79 | " 800K .......... .......... .......... .......... .......... 25% 462M 0s\n", 80 | " 850K .......... .......... .......... .......... .......... 27% 385M 0s\n", 81 | " 900K .......... .......... .......... .......... .......... 28% 304M 0s\n", 82 | " 950K .......... .......... .......... .......... .......... 30% 9.13M 0s\n", 83 | " 1000K .......... .......... .......... .......... .......... 31% 27.9M 0s\n", 84 | " 1050K .......... .......... .......... .......... .......... 33% 46.2M 0s\n", 85 | " 1100K .......... .......... .......... .......... .......... 34% 75.2M 0s\n", 86 | " 1150K .......... .......... .......... .......... .......... 36% 78.8M 0s\n", 87 | " 1200K .......... .......... .......... .......... .......... 37% 50.0M 0s\n", 88 | " 1250K .......... .......... .......... .......... .......... 39% 28.7M 0s\n", 89 | " 1300K .......... .......... .......... .......... .......... 40% 63.1M 0s\n", 90 | " 1350K .......... .......... .......... .......... .......... 42% 81.8M 0s\n", 91 | " 1400K .......... .......... .......... .......... .......... 43% 81.6M 0s\n", 92 | " 1450K .......... .......... .......... .......... .......... 45% 76.5M 0s\n", 93 | " 1500K .......... .......... .......... .......... .......... 46% 426M 0s\n", 94 | " 1550K .......... .......... .......... .......... .......... 48% 433M 0s\n", 95 | " 1600K .......... .......... .......... .......... .......... 49% 374M 0s\n", 96 | " 1650K .......... .......... .......... .......... .......... 51% 371M 0s\n", 97 | " 1700K .......... .......... .......... .......... .......... 52% 428M 0s\n", 98 | " 1750K .......... .......... .......... .......... .......... 54% 436M 0s\n", 99 | " 1800K .......... .......... .......... .......... .......... 55% 436M 0s\n", 100 | " 1850K .......... .......... .......... .......... .......... 57% 377M 0s\n", 101 | " 1900K .......... .......... .......... .......... .......... 58% 422M 0s\n", 102 | " 1950K .......... .......... .......... .......... .......... 60% 433M 0s\n", 103 | " 2000K .......... .......... .......... .......... .......... 61% 429M 0s\n", 104 | " 2050K .......... .......... .......... .......... .......... 63% 340M 0s\n", 105 | " 2100K .......... .......... .......... .......... .......... 64% 778K 0s\n", 106 | " 2150K .......... .......... .......... .......... .......... 66% 50.5M 0s\n", 107 | " 2200K .......... .......... .......... .......... .......... 67% 31.4M 0s\n", 108 | " 2250K .......... .......... .......... .......... .......... 69% 22.1M 0s\n", 109 | " 2300K .......... .......... .......... .......... .......... 70% 29.2M 0s\n", 110 | " 2350K .......... .......... .......... .......... .......... 72% 54.5M 0s\n", 111 | " 2400K .......... .......... .......... .......... .......... 73% 80.0M 0s\n", 112 | " 2450K .......... .......... .......... .......... .......... 75% 158M 0s\n", 113 | " 2500K .......... .......... .......... .......... .......... 76% 396M 0s\n", 114 | " 2550K .......... .......... .......... .......... .......... 78% 346M 0s\n", 115 | " 2600K .......... .......... .......... .......... .......... 79% 413M 0s\n", 116 | " 2650K .......... .......... .......... .......... .......... 81% 414M 0s\n", 117 | " 2700K .......... .......... .......... .......... .......... 82% 413M 0s\n", 118 | " 2750K .......... .......... .......... .......... .......... 84% 329M 0s\n", 119 | " 2800K .......... .......... .......... .......... .......... 85% 398M 0s\n", 120 | " 2850K .......... .......... .......... .......... .......... 87% 426M 0s\n", 121 | " 2900K .......... .......... .......... .......... .......... 88% 419M 0s\n", 122 | " 2950K .......... .......... .......... .......... .......... 90% 363M 0s\n", 123 | " 3000K .......... .......... .......... .......... .......... 91% 431M 0s\n", 124 | " 3050K .......... .......... .......... .......... .......... 93% 401M 0s\n", 125 | " 3100K .......... .......... .......... .......... .......... 94% 366M 0s\n", 126 | " 3150K .......... .......... .......... .......... .......... 96% 362M 0s\n", 127 | " 3200K .......... .......... .......... .......... .......... 97% 424M 0s\n", 128 | " 3250K .......... .......... .......... .......... .......... 99% 414M 0s\n", 129 | " 3300K .......... ........ 100% 431M=0.2s\n", 130 | "\n", 131 | "2021-03-20 21:06:08 (18.9 MB/s) - ‘./ml-data/providers-train.csv’ saved [3398120/3398120]\n", 132 | "\n", 133 | "--2021-03-20 21:06:08-- http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-metadata.csv\n", 134 | "Resolving dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)... 54.231.40.131\n", 135 | "Connecting to dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)|54.231.40.131|:80... connected.\n", 136 | "HTTP request sent, awaiting response... 200 OK\n", 137 | "Length: 9401 (9.2K) [text/csv]\n", 138 | "Saving to: ‘./ml-data/providers-metadata.csv’\n", 139 | "\n", 140 | " 0K ......... 100% 38.5M=0s\n", 141 | "\n", 142 | "2021-03-20 21:06:08 (38.5 MB/s) - ‘./ml-data/providers-metadata.csv’ saved [9401/9401]\n", 143 | "\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "%%bash\n", 149 | "mkdir data\n", 150 | "wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-train.csv -nc -P ./ml-data\n", 151 | "wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-metadata.csv -nc -P ./ml-data" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "We'll load the data into a Pandas DataFrame. Several columns will become target labels in future questions. Let's pop those columns out from the data, and drop related columns that are neither targets nor reasonable features (i.e. we don't wouldn't know how many times a facility denied payment before knowing whether it was fined).\n", 159 | "\n", 160 | "The data has many columns. We have also provided a data dictionary." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 18, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "import numpy as np\n", 170 | "import pandas as pd" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 19, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "data": { 180 | "text/html": [ 181 | "
\n", 182 | "\n", 195 | "\n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | "
VariableLabelDescriptionFormat
0PROVNUMFederal Provider NumberFederal Provider Number6 alphanumeric characters
1PROVNAMEProvider NameProvider Nametext
2ADDRESSProvider AddressProvider Addresstext
3CITYProvider CityProvider Citytext
4STATEProvider StateProvider State2-character postal abbreviation
\n", 243 | "
" 244 | ], 245 | "text/plain": [ 246 | " Variable Label Description \\\n", 247 | "0 PROVNUM Federal Provider Number Federal Provider Number \n", 248 | "1 PROVNAME Provider Name Provider Name \n", 249 | "2 ADDRESS Provider Address Provider Address \n", 250 | "3 CITY Provider City Provider City \n", 251 | "4 STATE Provider State Provider State \n", 252 | "\n", 253 | " Format \n", 254 | "0 6 alphanumeric characters \n", 255 | "1 text \n", 256 | "2 text \n", 257 | "3 text \n", 258 | "4 2-character postal abbreviation " 259 | ] 260 | }, 261 | "execution_count": 19, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "metadata = pd.read_csv('./ml-data/providers-metadata.csv')\n", 268 | "metadata.head()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 20, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "data = pd.read_csv('./ml-data/providers-train.csv', encoding='latin1')\n", 278 | "\n", 279 | "fine_counts = data.pop('FINE_CNT')\n", 280 | "fine_totals = data.pop('FINE_TOT')\n", 281 | "cycle_2_score = data.pop('CYCLE_2_TOTAL_SCORE')" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "## Question 1: state_model\n", 289 | "\n", 290 | "A federal agency, Centers for Medicare and Medicaid Services (CMS), imposes regulations on nursing homes. However, nursing homes are inspected by state agencies for compliance with regulations, and fines for violations can vary widely between states.\n", 291 | "\n", 292 | "Let's develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location. Fill in the class definition of the custom estimator, `StateMeanEstimator`, below.\n", 293 | "\n", 294 | "**Note:** When the grader checks your answer, it passes a list of dictionaries to the `predict` method of your estimator, not a DataFrame. This means that your model must work with both data types. You can handle this by adding a test (and optional conversion) in the `predict` method of your custom class, similar to the `ColumnSelectTransformer` given below in Question 2. " 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 56, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin\n", 304 | "\n", 305 | "class GroupMeanEstimator(BaseEstimator, RegressorMixin):\n", 306 | " def __init__(self, grouper):\n", 307 | " self.grouper = grouper\n", 308 | " self.group_averages = {}\n", 309 | "\n", 310 | " def fit(self, X, y):\n", 311 | " if not isinstance(X,pd.DataFrame):\n", 312 | " X = pd.DataFrame(X)\n", 313 | " # Use self.group_averages to store the average penalty by group\n", 314 | " self.group_averages = y.groupby(X[self.grouper]).mean().to_dict()\n", 315 | " self.group_averages['global_average'] = y.mean()\n", 316 | " return self\n", 317 | "\n", 318 | " def predict(self, X):\n", 319 | " if not isinstance(X,pd.DataFrame):\n", 320 | " X = pd.DataFrame(X)\n", 321 | " # Return a list of predicted penalties based on group of samples in X\n", 322 | " return [self.group_averages.get(group, self.group_averages['global_average']) for group in X[self.grouper]]" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "After filling in class definition, we can create an instance of the estimator and fit it to the data." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 57, 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "data": { 339 | "text/plain": [ 340 | "Pipeline(memory=None, steps=[('sme', GroupMeanEstimator(grouper='STATE'))],\n", 341 | " verbose=False)" 342 | ] 343 | }, 344 | "execution_count": 57, 345 | "metadata": {}, 346 | "output_type": "execute_result" 347 | } 348 | ], 349 | "source": [ 350 | "from sklearn.pipeline import Pipeline\n", 351 | "\n", 352 | "state_model = Pipeline([\n", 353 | " ('sme', GroupMeanEstimator(grouper='STATE'))\n", 354 | " ])\n", 355 | "state_model.fit(data, fine_totals)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "Next we should test that our predict method works." 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 58, 368 | "metadata": {}, 369 | "outputs": [ 370 | { 371 | "data": { 372 | "text/plain": [ 373 | "[8214.822977725675,\n", 374 | " 5611.819277108434,\n", 375 | " 11812.111111111111,\n", 376 | " 5626.954545454545,\n", 377 | " 8054.977611940299]" 378 | ] 379 | }, 380 | "execution_count": 58, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "state_model.predict(data.sample(5))" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "However, what if we have data from a nursing home in a state (or territory) of the US which is not in the training data?" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 59, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "data": { 403 | "text/plain": [ 404 | "[14969.857687877915]" 405 | ] 406 | }, 407 | "execution_count": 59, 408 | "metadata": {}, 409 | "output_type": "execute_result" 410 | } 411 | ], 412 | "source": [ 413 | "state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Make sure your model can handle this possibility before submitting your model's predict method to the grader." 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 60, 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "name": "stdout", 430 | "output_type": "stream", 431 | "text": [ 432 | "==================\n", 433 | "Your score: 0.9999999999999999\n", 434 | "==================\n" 435 | ] 436 | } 437 | ], 438 | "source": [ 439 | "grader.score.ml__state_model(state_model.predict)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "## Question 2: simple_features_model\n", 447 | "\n", 448 | "Nursing homes vary greatly in their business characteristics. Some are owned by the government or non-profits while others are run for profit. Some house a few dozen residents while others house hundreds. Some are located within hospitals and may work with more vulnerable populations. We will try to predict which facilities are fined based on their business characteristics.\n", 449 | "\n", 450 | "We'll begin with columns in our DataFrame containing numeric and boolean features. Some of the rows contain null values; estimators cannot handle null values so these must be imputed or dropped. We will create a `Pipeline` containing transformers that process these features, followed by an estimator.\n", 451 | "\n", 452 | "**Note:** As mentioned above in Question 1, when the grader checks your answer, it passes a list of dictionaries to the `predict` or `predict_proba` method of your estimator, not a DataFrame. This means that your model must work with both data types. For this reason, we've provided a custom `ColumnSelectTransformer` for you to use instead `scikit-learn`'s own `ColumnTransformer`." 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 64, 458 | "metadata": {}, 459 | "outputs": [], 460 | "source": [ 461 | "from sklearn.impute import SimpleImputer\n", 462 | "\n", 463 | "simple_cols = ['BEDCERT', 'RESTOT', 'INHOSP', 'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL', 'ADJ_TOTAL']\n", 464 | "\n", 465 | "class ColumnSelectTransformer(BaseEstimator, TransformerMixin):\n", 466 | " def __init__(self, columns):\n", 467 | " self.columns = columns\n", 468 | "\n", 469 | " def fit(self, X, y=None):\n", 470 | " return self\n", 471 | "\n", 472 | " def transform(self, X):\n", 473 | " if not isinstance(X, pd.DataFrame):\n", 474 | " X = pd.DataFrame(X)\n", 475 | " return X[self.columns]\n", 476 | " \n", 477 | "simple_features = Pipeline([\n", 478 | " ('cst', ColumnSelectTransformer(simple_cols)),\n", 479 | " ('imputer', SimpleImputer())\n", 480 | "])" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": {}, 486 | "source": [ 487 | "**Note:** The assertion below assumes the output of `noncategorical_features.fit_transform` is a `ndarray`, not a `DataFrame`.)" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": 65, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "assert data['RESTOT'].isnull().sum() > 0\n", 497 | "assert not np.isnan(simple_features.fit_transform(data)).any()" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "Now combine the `simple_features` pipeline with an estimator in a new pipeline. Fit `simple_features_model` to the data and submit `simple_features_model.predict_proba` to the grader. You may wish to use cross-validation to tune the hyperparameters of your model." 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 81, 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "name": "stdout", 514 | "output_type": "stream", 515 | "text": [ 516 | "Fitting 5 folds for each of 50 candidates, totalling 250 fits\n" 517 | ] 518 | }, 519 | { 520 | "name": "stderr", 521 | "output_type": "stream", 522 | "text": [ 523 | "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n", 524 | "[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 33.3s\n", 525 | "[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 2.1min\n", 526 | "[Parallel(n_jobs=4)]: Done 250 out of 250 | elapsed: 2.7min finished\n" 527 | ] 528 | }, 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "{'estimator__n_estimators': 70, 'estimator__max_depth': 4}" 533 | ] 534 | }, 535 | "execution_count": 81, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "from sklearn.ensemble import RandomForestClassifier\n", 542 | "from sklearn.model_selection import GridSearchCV, RandomizedSearchCV\n", 543 | "\n", 544 | "param_grid = {'estimator__n_estimators': range(20, 100, 10),\n", 545 | " 'estimator__max_depth': range(3, 15, 1)}\n", 546 | "\n", 547 | "simple_features_model = Pipeline([\n", 548 | " ('simple', simple_features),\n", 549 | " ('estimator', RandomForestClassifier(n_estimators=60, max_depth = 5))\n", 550 | "])\n", 551 | "\n", 552 | "gs = RandomizedSearchCV(simple_features_model, param_distributions = param_grid, cv = 5, n_iter = 50, n_jobs = 4, verbose = 1)\n", 553 | "\n", 554 | "gs.fit(data, fine_counts > 0)\n", 555 | "\n", 556 | "gs.best_params_" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": 131, 562 | "metadata": {}, 563 | "outputs": [ 564 | { 565 | "data": { 566 | "text/plain": [ 567 | "Pipeline(memory=None,\n", 568 | " steps=[('simple',\n", 569 | " Pipeline(memory=None,\n", 570 | " steps=[('cst',\n", 571 | " ColumnSelectTransformer(columns=['BEDCERT',\n", 572 | " 'RESTOT',\n", 573 | " 'INHOSP',\n", 574 | " 'CCRC_FACIL',\n", 575 | " 'SFF',\n", 576 | " 'CHOW_LAST_12MOS',\n", 577 | " 'SPRINKLER_STATUS',\n", 578 | " 'EXP_TOTAL',\n", 579 | " 'ADJ_TOTAL'])),\n", 580 | " ('imputer',\n", 581 | " SimpleImputer(add_indicator=False, copy=True,\n", 582 | " fill_value=None,\n", 583 | " missing_values=nan,\n", 584 | " strategy='mean', verbose=0))],\n", 585 | " verbose=False)...\n", 586 | " RandomForestClassifier(bootstrap=True, class_weight=None,\n", 587 | " criterion='gini', max_depth=None,\n", 588 | " max_features='auto',\n", 589 | " max_leaf_nodes=None,\n", 590 | " min_impurity_decrease=0.0,\n", 591 | " min_impurity_split=None,\n", 592 | " min_samples_leaf=100,\n", 593 | " min_samples_split=2,\n", 594 | " min_weight_fraction_leaf=0.0,\n", 595 | " n_estimators=200, n_jobs=None,\n", 596 | " oob_score=False, random_state=None,\n", 597 | " verbose=0, warm_start=False))],\n", 598 | " verbose=False)" 599 | ] 600 | }, 601 | "execution_count": 131, 602 | "metadata": {}, 603 | "output_type": "execute_result" 604 | } 605 | ], 606 | "source": [ 607 | "simple_features_model = Pipeline([\n", 608 | " ('simple', simple_features),\n", 609 | " ('estimator', RandomForestClassifier(n_estimators=200, min_samples_leaf = 100))\n", 610 | "])\n", 611 | "\n", 612 | "simple_features_model.fit(data, fine_counts > 0)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 132, 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "name": "stdout", 622 | "output_type": "stream", 623 | "text": [ 624 | "==================\n", 625 | "Your score: 1.008880739222664\n", 626 | "==================\n" 627 | ] 628 | } 629 | ], 630 | "source": [ 631 | "def positive_probability(model):\n", 632 | " def predict_proba(X):\n", 633 | " return model.predict_proba(X)[:, 1]\n", 634 | " return predict_proba\n", 635 | "\n", 636 | "grader.score.ml__simple_features(positive_probability(simple_features_model))" 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": {}, 642 | "source": [ 643 | "## Question 3: categorical_features" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "The `'OWNERSHIP'` and `'CERTIFICATION'` columns contain categorical data. We will have to encode the categorical data into numerical features before we pass them to an estimator. Construct one or more pipelines for this purpose. Transformers such as [LabelEncoder](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) and [OneHotEncoder](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) may be useful, but you may also want to define your own transformers.\n", 651 | "\n", 652 | "If you used more than one `Pipeline`, combine them with a `FeatureUnion`. As in Question 2, we will combine this with an estimator, fit it, and submit the `predict_proba` method to the grader." 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 92, 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [ 661 | "from sklearn.pipeline import FeatureUnion\n", 662 | "from sklearn.preprocessing import OneHotEncoder\n", 663 | "\n", 664 | "owner_onehot = Pipeline([\n", 665 | " ('cst', ColumnSelectTransformer(['OWNERSHIP'])),\n", 666 | " ('ohe', OneHotEncoder (categories = 'auto', sparse = False))\n", 667 | "])\n", 668 | "\n", 669 | "cert_onehot = Pipeline([\n", 670 | " ('cst', ColumnSelectTransformer(['CERTIFICATION'])),\n", 671 | " ('ohe', OneHotEncoder (categories = 'auto', sparse = False))\n", 672 | "])\n", 673 | "\n", 674 | "categorical_features = FeatureUnion([\n", 675 | " ('owner', owner_onehot),\n", 676 | " ('cert', cert_onehot)\n", 677 | "])" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 93, 683 | "metadata": {}, 684 | "outputs": [], 685 | "source": [ 686 | "assert categorical_features.fit_transform(data).shape[0] == data.shape[0]\n", 687 | "assert categorical_features.fit_transform(data).dtype == np.float64\n", 688 | "assert not np.isnan(categorical_features.fit_transform(data)).any()" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "As in the previous question, create a model using the `categorical_features`, fit it to the data, and submit its `predict_proba` method to the grader." 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": 128, 701 | "metadata": {}, 702 | "outputs": [], 703 | "source": [ 704 | "categorical_features_model = Pipeline([\n", 705 | " ('categorical', categorical_features),\n", 706 | " ('estimator', RandomForestClassifier(n_estimators=200, min_samples_leaf = 100))\n", 707 | "])" 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": 129, 713 | "metadata": {}, 714 | "outputs": [ 715 | { 716 | "data": { 717 | "text/plain": [ 718 | "Pipeline(memory=None,\n", 719 | " steps=[('categorical',\n", 720 | " FeatureUnion(n_jobs=None,\n", 721 | " transformer_list=[('owner',\n", 722 | " Pipeline(memory=None,\n", 723 | " steps=[('cst',\n", 724 | " ColumnSelectTransformer(columns=['OWNERSHIP'])),\n", 725 | " ('ohe',\n", 726 | " OneHotEncoder(categorical_features=None,\n", 727 | " categories='auto',\n", 728 | " drop=None,\n", 729 | " dtype=,\n", 730 | " handle_unknown='error',\n", 731 | " n_values=None,\n", 732 | " sparse=False))],\n", 733 | " verbose=False))...\n", 734 | " RandomForestClassifier(bootstrap=True, class_weight=None,\n", 735 | " criterion='gini', max_depth=None,\n", 736 | " max_features='auto',\n", 737 | " max_leaf_nodes=None,\n", 738 | " min_impurity_decrease=0.0,\n", 739 | " min_impurity_split=None,\n", 740 | " min_samples_leaf=100,\n", 741 | " min_samples_split=2,\n", 742 | " min_weight_fraction_leaf=0.0,\n", 743 | " n_estimators=200, n_jobs=None,\n", 744 | " oob_score=False, random_state=None,\n", 745 | " verbose=0, warm_start=False))],\n", 746 | " verbose=False)" 747 | ] 748 | }, 749 | "execution_count": 129, 750 | "metadata": {}, 751 | "output_type": "execute_result" 752 | } 753 | ], 754 | "source": [ 755 | "categorical_features_model.fit(data, fine_counts > 0)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": 130, 761 | "metadata": {}, 762 | "outputs": [ 763 | { 764 | "name": "stdout", 765 | "output_type": "stream", 766 | "text": [ 767 | "==================\n", 768 | "Your score: 1.0035144150583772\n", 769 | "==================\n" 770 | ] 771 | } 772 | ], 773 | "source": [ 774 | "grader.score.ml__categorical_features(positive_probability(categorical_features_model))" 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "metadata": {}, 780 | "source": [ 781 | "## Question 4: business_model" 782 | ] 783 | }, 784 | { 785 | "cell_type": "markdown", 786 | "metadata": {}, 787 | "source": [ 788 | "Finally, we'll combine `simple_features` and `categorical_features` in a `FeatureUnion`, followed by an estimator in a `Pipeline`. You may want to optimize the hyperparameters of your estimator using cross-validation or try engineering new features (e.g. see [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)). When you've assembled and trained your model, pass the `predict_proba` method to the grader." 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 104, 794 | "metadata": {}, 795 | "outputs": [], 796 | "source": [ 797 | "from sklearn.preprocessing import PolynomialFeatures\n", 798 | "from sklearn.linear_model import LogisticRegression" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": 105, 804 | "metadata": {}, 805 | "outputs": [], 806 | "source": [ 807 | "business_features = FeatureUnion([\n", 808 | " ('simple', simple_features),\n", 809 | " ('categorical', categorical_features)\n", 810 | "])" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 125, 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [ 819 | "business_model = Pipeline([\n", 820 | " ('features', business_features),\n", 821 | " ('ploy', PolynomialFeatures(3)),\n", 822 | " ('estimator', RandomForestClassifier(n_estimators=200, min_samples_leaf = 100))\n", 823 | " \n", 824 | "])" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": 126, 830 | "metadata": {}, 831 | "outputs": [ 832 | { 833 | "data": { 834 | "text/plain": [ 835 | "Pipeline(memory=None,\n", 836 | " steps=[('features',\n", 837 | " FeatureUnion(n_jobs=None,\n", 838 | " transformer_list=[('simple',\n", 839 | " Pipeline(memory=None,\n", 840 | " steps=[('cst',\n", 841 | " ColumnSelectTransformer(columns=['BEDCERT',\n", 842 | " 'RESTOT',\n", 843 | " 'INHOSP',\n", 844 | " 'CCRC_FACIL',\n", 845 | " 'SFF',\n", 846 | " 'CHOW_LAST_12MOS',\n", 847 | " 'SPRINKLER_STATUS',\n", 848 | " 'EXP_TOTAL',\n", 849 | " 'ADJ_TOTAL'])),\n", 850 | " ('imputer',\n", 851 | " SimpleImputer(add_indicator=False,\n", 852 | " copy=True,\n", 853 | " fill_value=None,\n", 854 | " missing...\n", 855 | " RandomForestClassifier(bootstrap=True, class_weight=None,\n", 856 | " criterion='gini', max_depth=None,\n", 857 | " max_features='auto',\n", 858 | " max_leaf_nodes=None,\n", 859 | " min_impurity_decrease=0.0,\n", 860 | " min_impurity_split=None,\n", 861 | " min_samples_leaf=100,\n", 862 | " min_samples_split=2,\n", 863 | " min_weight_fraction_leaf=0.0,\n", 864 | " n_estimators=200, n_jobs=None,\n", 865 | " oob_score=False, random_state=None,\n", 866 | " verbose=0, warm_start=False))],\n", 867 | " verbose=False)" 868 | ] 869 | }, 870 | "execution_count": 126, 871 | "metadata": {}, 872 | "output_type": "execute_result" 873 | } 874 | ], 875 | "source": [ 876 | "business_model.fit(data, fine_counts > 0)" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 127, 882 | "metadata": {}, 883 | "outputs": [ 884 | { 885 | "name": "stdout", 886 | "output_type": "stream", 887 | "text": [ 888 | "==================\n", 889 | "Your score: 0.9955474877555909\n", 890 | "==================\n" 891 | ] 892 | } 893 | ], 894 | "source": [ 895 | "grader.score.ml__business_model(positive_probability(business_model))" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "## Question 5: survey_results" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "Surveys reveal safety and health deficiencies at nursing homes that may indicate risk for incidents (and penalties). CMS routinely makes surveys of nursing homes. Build a model that combines the `business_features` of each facility with its cycle 1 survey results, as well as the time between the cycle 1 and cycle 2 survey to predict the cycle 2 total score.\n", 910 | "\n", 911 | "First, let's create a transformer to calculate the difference in time between the cycle 1 and cycle 2 surveys." 912 | ] 913 | }, 914 | { 915 | "cell_type": "code", 916 | "execution_count": 117, 917 | "metadata": {}, 918 | "outputs": [], 919 | "source": [ 920 | "class TimedeltaTransformer(BaseEstimator, TransformerMixin):\n", 921 | " def __init__(self, t1_col, t2_col):\n", 922 | " self.t1_col = t1_col\n", 923 | " self.t2_col = t2_col\n", 924 | "\n", 925 | " def fit(self, X, y=None):\n", 926 | " return self\n", 927 | "\n", 928 | " def transform(self, X):\n", 929 | " if not isinstance(X,pd.DataFrame):\n", 930 | " X = pd.DataFrame(X)\n", 931 | " return (pd.to_datetime( X[self.t2_col])\n", 932 | " -pd.to_datetime( X[self.t1_col])).values.reshape(-1,1).astype(float)" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 118, 938 | "metadata": {}, 939 | "outputs": [], 940 | "source": [ 941 | "cycle_1_date = 'CYCLE_1_SURVEY_DATE'\n", 942 | "cycle_2_date = 'CYCLE_2_SURVEY_DATE'\n", 943 | "time_feature = TimedeltaTransformer(cycle_1_date, cycle_2_date)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "markdown", 948 | "metadata": {}, 949 | "source": [ 950 | "In the cell below we'll collect the cycle 1 survey features." 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 119, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [ 959 | "cycle_1_cols = ['CYCLE_1_DEFS', 'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP',\n", 960 | " 'CYCLE_1_DEFS_SCORE', 'CYCLE_1_NUMREVIS',\n", 961 | " 'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE']\n", 962 | "cycle_1_features = ColumnSelectTransformer(cycle_1_cols)" 963 | ] 964 | }, 965 | { 966 | "cell_type": "code", 967 | "execution_count": 121, 968 | "metadata": {}, 969 | "outputs": [], 970 | "source": [ 971 | "from sklearn.ensemble import RandomForestRegressor\n", 972 | "from sklearn.decomposition import TruncatedSVD\n", 973 | "\n", 974 | "survey_model = Pipeline([\n", 975 | " ('features', FeatureUnion([\n", 976 | " ('business', business_features),\n", 977 | " ('survey', cycle_1_features),\n", 978 | " ('time', time_feature)\n", 979 | " ])),\n", 980 | " (\"poly\", PolynomialFeatures(2)),\n", 981 | " (\"svd\", TruncatedSVD(20) ),\n", 982 | " (\"classifier\", RandomForestRegressor(n_estimators=200, min_samples_leaf = 100))\n", 983 | "])" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 122, 989 | "metadata": {}, 990 | "outputs": [ 991 | { 992 | "data": { 993 | "text/plain": [ 994 | "Pipeline(memory=None,\n", 995 | " steps=[('features',\n", 996 | " FeatureUnion(n_jobs=None,\n", 997 | " transformer_list=[('business',\n", 998 | " FeatureUnion(n_jobs=None,\n", 999 | " transformer_list=[('simple',\n", 1000 | " Pipeline(memory=None,\n", 1001 | " steps=[('cst',\n", 1002 | " ColumnSelectTransformer(columns=['BEDCERT',\n", 1003 | " 'RESTOT',\n", 1004 | " 'INHOSP',\n", 1005 | " 'CCRC_FACIL',\n", 1006 | " 'SFF',\n", 1007 | " 'CHOW_LAST_12MOS',\n", 1008 | " 'SPRINKLER_STATUS',\n", 1009 | " 'EXP_TOTAL',\n", 1010 | " 'ADJ_TOTAL'])),\n", 1011 | " ('imputer',\n", 1012 | " SimpleImpute...\n", 1013 | " random_state=None, tol=0.0)),\n", 1014 | " ('classifier',\n", 1015 | " RandomForestRegressor(bootstrap=True, criterion='mse',\n", 1016 | " max_depth=None, max_features='auto',\n", 1017 | " max_leaf_nodes=None,\n", 1018 | " min_impurity_decrease=0.0,\n", 1019 | " min_impurity_split=None,\n", 1020 | " min_samples_leaf=100,\n", 1021 | " min_samples_split=2,\n", 1022 | " min_weight_fraction_leaf=0.0,\n", 1023 | " n_estimators=200, n_jobs=None,\n", 1024 | " oob_score=False, random_state=None,\n", 1025 | " verbose=0, warm_start=False))],\n", 1026 | " verbose=False)" 1027 | ] 1028 | }, 1029 | "execution_count": 122, 1030 | "metadata": {}, 1031 | "output_type": "execute_result" 1032 | } 1033 | ], 1034 | "source": [ 1035 | "survey_model.fit(data, cycle_2_score.astype(int))" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": 123, 1041 | "metadata": {}, 1042 | "outputs": [ 1043 | { 1044 | "name": "stdout", 1045 | "output_type": "stream", 1046 | "text": [ 1047 | "==================\n", 1048 | "Your score: 1.1764735621658964\n", 1049 | "==================\n" 1050 | ] 1051 | } 1052 | ], 1053 | "source": [ 1054 | "grader.score.ml__survey_model(survey_model.predict)" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "markdown", 1059 | "metadata": {}, 1060 | "source": [ 1061 | "*Copyright © 2021 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*" 1062 | ] 1063 | } 1064 | ], 1065 | "metadata": { 1066 | "kernelspec": { 1067 | "display_name": "Python 3", 1068 | "language": "python", 1069 | "name": "python3" 1070 | }, 1071 | "language_info": { 1072 | "codemirror_mode": { 1073 | "name": "ipython", 1074 | "version": 3 1075 | }, 1076 | "file_extension": ".py", 1077 | "mimetype": "text/x-python", 1078 | "name": "python", 1079 | "nbconvert_exporter": "python", 1080 | "pygments_lexer": "ipython3", 1081 | "version": "3.7.3" 1082 | }, 1083 | "nbclean": true 1084 | }, 1085 | "nbformat": 4, 1086 | "nbformat_minor": 1 1087 | } 1088 | -------------------------------------------------------------------------------- /mp6-nlp-natural-language-processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 34, 6 | "metadata": { 7 | "init_cell": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "%logstop\n", 12 | "%logstart -rtq ~/.logs/nlp.py append\n", 13 | "import seaborn as sns\n", 14 | "sns.set()" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 7, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from static_grader import grader" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 8, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "from IPython.display import clear_output\n", 33 | "\n", 34 | "import warnings\n", 35 | "warnings.filterwarnings('ignore')" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "# NLP Miniproject" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Introduction\n", 50 | "\n", 51 | "The objective of this miniproject is to gain experience with natural language processing and how to use text data to train a machine learning model to make predictions. For the miniproject, we will be working with product review text from Amazon. The reviews are for only products in the \"Electronics\" category. The objective is to train a model to predict the rating, ranging from 1 to 5 stars.\n", 52 | "\n", 53 | "## Scoring\n", 54 | "\n", 55 | "For most of the questions, you will be asked to submit the `predict` method of your trained model to the grader. The grader will use the passed `predict` method to evaluate how your model performs on a test set with respect to a reference model. The grader uses the [R2-score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score) for model evaluation. If your model performs better than the reference solution, then you can score higher than 1.0. For the last question, you will submit the results of an analysis and your passed answer will be compared directly to the reference solution.\n", 56 | "\n", 57 | "## Downloading and loading the data\n", 58 | "\n", 59 | "The data set is available on Amazon S3 and comes as a compressed file where each line is a JSON object. To load the data set, we will need to use the `gzip` library to open the file and decode each JSON into a Python dictionary. In the end, we have a list of dictionaries, where each dictionary represents an observation." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 10, 65 | "metadata": {}, 66 | "outputs": [ 67 | { 68 | "name": "stderr", 69 | "output_type": "stream", 70 | "text": [ 71 | "--2021-03-20 19:59:29-- http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_electronics_reviews_training.json.gz\n", 72 | "Resolving dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)... 52.216.227.24\n", 73 | "Connecting to dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)|52.216.227.24|:80... connected.\n", 74 | "HTTP request sent, awaiting response... 200 OK\n", 75 | "Length: 18357730 (18M) [application/x-gzip]\n", 76 | "Saving to: ‘./data/amazon_electronics_reviews_training.json.gz’\n", 77 | "\n", 78 | " 0K .......... .......... .......... .......... .......... 0% 3.41M 5s\n", 79 | " 50K .......... .......... .......... .......... .......... 0% 6.86M 4s\n", 80 | " 100K .......... .......... .......... .......... .......... 0% 6.71M 3s\n", 81 | " 150K .......... .......... .......... .......... .......... 1% 67.0M 3s\n", 82 | " 200K .......... .......... .......... .......... .......... 1% 7.63M 3s\n", 83 | " 250K .......... .......... .......... .......... .......... 1% 53.6M 2s\n", 84 | " 300K .......... .......... .......... .......... .......... 1% 70.7M 2s\n", 85 | " 350K .......... .......... .......... .......... .......... 2% 76.5M 2s\n", 86 | " 400K .......... .......... .......... .......... .......... 2% 82.7M 2s\n", 87 | " 450K .......... .......... .......... .......... .......... 2% 11.4M 1s\n", 88 | " 500K .......... .......... .......... .......... .......... 3% 86.9M 1s\n", 89 | " 550K .......... .......... .......... .......... .......... 3% 331M 1s\n", 90 | " 600K .......... .......... .......... .......... .......... 3% 333M 1s\n", 91 | " 650K .......... .......... .......... .......... .......... 3% 16.4M 1s\n", 92 | " 700K .......... .......... .......... .......... .......... 4% 55.5M 1s\n", 93 | " 750K .......... .......... .......... .......... .......... 4% 66.4M 1s\n", 94 | " 800K .......... .......... .......... .......... .......... 4% 130M 1s\n", 95 | " 850K .......... .......... .......... .......... .......... 5% 418M 1s\n", 96 | " 900K .......... .......... .......... .......... .......... 5% 48.3M 1s\n", 97 | " 950K .......... .......... .......... .......... .......... 5% 50.0M 1s\n", 98 | " 1000K .......... .......... .......... .......... .......... 5% 278M 1s\n", 99 | " 1050K .......... .......... .......... .......... .......... 6% 391M 1s\n", 100 | " 1100K .......... .......... .......... .......... .......... 6% 421M 1s\n", 101 | " 1150K .......... .......... .......... .......... .......... 6% 427M 1s\n", 102 | " 1200K .......... .......... .......... .......... .......... 6% 379M 1s\n", 103 | " 1250K .......... .......... .......... .......... .......... 7% 455M 1s\n", 104 | " 1300K .......... .......... .......... .......... .......... 7% 414M 1s\n", 105 | " 1350K .......... .......... .......... .......... .......... 7% 20.6M 1s\n", 106 | " 1400K .......... .......... .......... .......... .......... 8% 57.6M 1s\n", 107 | " 1450K .......... .......... .......... .......... .......... 8% 85.5M 1s\n", 108 | " 1500K .......... .......... .......... .......... .......... 8% 52.6M 1s\n", 109 | " 1550K .......... .......... .......... .......... .......... 8% 57.6M 1s\n", 110 | " 1600K .......... .......... .......... .......... .......... 9% 82.2M 1s\n", 111 | " 1650K .......... .......... .......... .......... .......... 9% 333M 1s\n", 112 | " 1700K .......... .......... .......... .......... .......... 9% 304M 1s\n", 113 | " 1750K .......... .......... .......... .......... .......... 10% 282M 1s\n", 114 | " 1800K .......... .......... .......... .......... .......... 10% 352M 1s\n", 115 | " 1850K .......... .......... .......... .......... .......... 10% 448M 0s\n", 116 | " 1900K .......... .......... .......... .......... .......... 10% 276M 0s\n", 117 | " 1950K .......... .......... .......... .......... .......... 11% 356M 0s\n", 118 | " 2000K .......... .......... .......... .......... .......... 11% 451M 0s\n", 119 | " 2050K .......... .......... .......... .......... .......... 11% 371M 0s\n", 120 | " 2100K .......... .......... .......... .......... .......... 11% 388M 0s\n", 121 | " 2150K .......... .......... .......... .......... .......... 12% 371M 0s\n", 122 | " 2200K .......... .......... .......... .......... .......... 12% 428M 0s\n", 123 | " 2250K .......... .......... .......... .......... .......... 12% 449M 0s\n", 124 | " 2300K .......... .......... .......... .......... .......... 13% 344M 0s\n", 125 | " 2350K .......... .......... .......... .......... .......... 13% 451M 0s\n", 126 | " 2400K .......... .......... .......... .......... .......... 13% 62.1M 0s\n", 127 | " 2450K .......... .......... .......... .......... .......... 13% 63.5M 0s\n", 128 | " 2500K .......... .......... .......... .......... .......... 14% 49.5M 0s\n", 129 | " 2550K .......... .......... .......... .......... .......... 14% 79.1M 0s\n", 130 | " 2600K .......... .......... .......... .......... .......... 14% 71.1M 0s\n", 131 | " 2650K .......... .......... .......... .......... .......... 15% 58.4M 0s\n", 132 | " 2700K .......... .......... .......... .......... .......... 15% 48.3M 0s\n", 133 | " 2750K .......... .......... .......... .......... .......... 15% 34.0M 0s\n", 134 | " 2800K .......... .......... .......... .......... .......... 15% 34.2M 0s\n", 135 | " 2850K .......... .......... .......... .......... .......... 16% 47.7M 0s\n", 136 | " 2900K .......... .......... .......... .......... .......... 16% 63.5M 0s\n", 137 | " 2950K .......... .......... .......... .......... .......... 16% 58.0M 0s\n", 138 | " 3000K .......... .......... .......... .......... .......... 17% 63.6M 0s\n", 139 | " 3050K .......... .......... .......... .......... .......... 17% 126M 0s\n", 140 | " 3100K .......... .......... .......... .......... .......... 17% 253M 0s\n", 141 | " 3150K .......... .......... .......... .......... .......... 17% 347M 0s\n", 142 | " 3200K .......... .......... .......... .......... .......... 18% 325M 0s\n", 143 | " 3250K .......... .......... .......... .......... .......... 18% 407M 0s\n", 144 | " 3300K .......... .......... .......... .......... .......... 18% 232M 0s\n", 145 | " 3350K .......... .......... .......... .......... .......... 18% 343M 0s\n", 146 | " 3400K .......... .......... .......... .......... .......... 19% 350M 0s\n", 147 | " 3450K .......... .......... .......... .......... .......... 19% 348M 0s\n", 148 | " 3500K .......... .......... .......... .......... .......... 19% 297M 0s\n", 149 | " 3550K .......... .......... .......... .......... .......... 20% 248M 0s\n", 150 | " 3600K .......... .......... .......... .......... .......... 20% 315M 0s\n", 151 | " 3650K .......... .......... .......... .......... .......... 20% 352M 0s\n", 152 | " 3700K .......... .......... .......... .......... .......... 20% 291M 0s\n", 153 | " 3750K .......... .......... .......... .......... .......... 21% 423M 0s\n", 154 | " 3800K .......... .......... .......... .......... .......... 21% 263M 0s\n", 155 | " 3850K .......... .......... .......... .......... .......... 21% 299M 0s\n", 156 | " 3900K .......... .......... .......... .......... .......... 22% 250M 0s\n", 157 | " 3950K .......... .......... .......... .......... .......... 22% 328M 0s\n", 158 | " 4000K .......... .......... .......... .......... .......... 22% 355M 0s\n", 159 | " 4050K .......... .......... .......... .......... .......... 22% 395M 0s\n", 160 | " 4100K .......... .......... .......... .......... .......... 23% 366M 0s\n", 161 | " 4150K .......... .......... .......... .......... .......... 23% 241M 0s\n", 162 | " 4200K .......... .......... .......... .......... .......... 23% 85.4M 0s\n", 163 | " 4250K .......... .......... .......... .......... .......... 23% 53.5M 0s\n", 164 | " 4300K .......... .......... .......... .......... .......... 24% 46.5M 0s\n", 165 | " 4350K .......... .......... .......... .......... .......... 24% 56.4M 0s\n", 166 | " 4400K .......... .......... .......... .......... .......... 24% 56.8M 0s\n", 167 | " 4450K .......... .......... .......... .......... .......... 25% 61.7M 0s\n", 168 | " 4500K .......... .......... .......... .......... .......... 25% 45.4M 0s\n", 169 | " 4550K .......... .......... .......... .......... .......... 25% 64.9M 0s\n", 170 | " 4600K .......... .......... .......... .......... .......... 25% 65.7M 0s\n", 171 | " 4650K .......... .......... .......... .......... .......... 26% 52.9M 0s\n", 172 | " 4700K .......... .......... .......... .......... .......... 26% 59.1M 0s\n", 173 | " 4750K .......... .......... .......... .......... .......... 26% 68.6M 0s\n", 174 | " 4800K .......... .......... .......... .......... .......... 27% 43.3M 0s\n", 175 | " 4850K .......... .......... .......... .......... .......... 27% 53.5M 0s\n", 176 | " 4900K .......... .......... .......... .......... .......... 27% 29.4M 0s\n", 177 | " 4950K .......... .......... .......... .......... .......... 27% 160M 0s\n", 178 | " 5000K .......... .......... .......... .......... .......... 28% 310M 0s\n", 179 | " 5050K .......... .......... .......... .......... .......... 28% 224M 0s\n", 180 | " 5100K .......... .......... .......... .......... .......... 28% 161M 0s\n", 181 | " 5150K .......... .......... .......... .......... .......... 29% 248M 0s\n", 182 | " 5200K .......... .......... .......... .......... .......... 29% 303M 0s\n", 183 | " 5250K .......... .......... .......... .......... .......... 29% 182M 0s\n", 184 | " 5300K .......... .......... .......... .......... .......... 29% 246M 0s\n", 185 | " 5350K .......... .......... .......... .......... .......... 30% 301M 0s\n", 186 | " 5400K .......... .......... .......... .......... .......... 30% 247M 0s\n", 187 | " 5450K .......... .......... .......... .......... .......... 30% 323M 0s\n", 188 | " 5500K .......... .......... .......... .......... .......... 30% 39.4M 0s\n", 189 | " 5550K .......... .......... .......... .......... .......... 31% 32.9M 0s\n", 190 | " 5600K .......... .......... .......... .......... .......... 31% 35.3M 0s\n", 191 | " 5650K .......... .......... .......... .......... .......... 31% 41.4M 0s\n", 192 | " 5700K .......... .......... .......... .......... .......... 32% 28.7M 0s\n", 193 | " 5750K .......... .......... .......... .......... .......... 32% 34.4M 0s\n", 194 | " 5800K .......... .......... .......... .......... .......... 32% 30.3M 0s\n", 195 | " 5850K .......... .......... .......... .......... .......... 32% 40.5M 0s\n", 196 | " 5900K .......... .......... .......... .......... .......... 33% 32.5M 0s\n", 197 | " 5950K .......... .......... .......... .......... .......... 33% 35.9M 0s\n", 198 | " 6000K .......... .......... .......... .......... .......... 33% 81.3M 0s\n", 199 | " 6050K .......... .......... .......... .......... .......... 34% 230M 0s\n", 200 | " 6100K .......... .......... .......... .......... .......... 34% 38.1M 0s\n", 201 | " 6150K .......... .......... .......... .......... .......... 34% 43.3M 0s\n", 202 | " 6200K .......... .......... .......... .......... .......... 34% 299M 0s\n", 203 | " 6250K .......... .......... .......... .......... .......... 35% 299M 0s\n", 204 | " 6300K .......... .......... .......... .......... .......... 35% 222M 0s\n", 205 | " 6350K .......... .......... .......... .......... .......... 35% 235M 0s\n", 206 | " 6400K .......... .......... .......... .......... .......... 35% 224M 0s\n", 207 | " 6450K .......... .......... .......... .......... .......... 36% 277M 0s\n", 208 | " 6500K .......... .......... .......... .......... .......... 36% 288M 0s\n", 209 | " 6550K .......... .......... .......... .......... .......... 36% 243M 0s\n", 210 | " 6600K .......... .......... .......... .......... .......... 37% 187M 0s\n", 211 | " 6650K .......... .......... .......... .......... .......... 37% 217M 0s\n", 212 | " 6700K .......... .......... .......... .......... .......... 37% 196M 0s\n", 213 | " 6750K .......... .......... .......... .......... .......... 37% 233M 0s\n", 214 | " 6800K .......... .......... .......... .......... .......... 38% 204M 0s\n", 215 | " 6850K .......... .......... .......... .......... .......... 38% 49.9M 0s\n", 216 | " 6900K .......... .......... .......... .......... .......... 38% 39.6M 0s\n", 217 | " 6950K .......... .......... .......... .......... .......... 39% 41.8M 0s\n", 218 | " 7000K .......... .......... .......... .......... .......... 39% 51.3M 0s\n", 219 | " 7050K .......... .......... .......... .......... .......... 39% 46.5M 0s\n", 220 | " 7100K .......... .......... .......... .......... .......... 39% 42.0M 0s\n", 221 | " 7150K .......... .......... .......... .......... .......... 40% 19.7M 0s\n", 222 | " 7200K .......... .......... .......... .......... .......... 40% 41.1M 0s\n", 223 | " 7250K .......... .......... .......... .......... .......... 40% 88.7M 0s\n", 224 | " 7300K .......... .......... .......... .......... .......... 40% 266M 0s\n", 225 | " 7350K .......... .......... .......... .......... .......... 41% 313M 0s\n", 226 | " 7400K .......... .......... .......... .......... .......... 41% 324M 0s\n", 227 | " 7450K .......... .......... .......... .......... .......... 41% 311M 0s\n", 228 | " 7500K .......... .......... .......... .......... .......... 42% 271M 0s\n", 229 | " 7550K .......... .......... .......... .......... .......... 42% 274M 0s\n", 230 | " 7600K .......... .......... .......... .......... .......... 42% 252M 0s\n", 231 | " 7650K .......... .......... .......... .......... .......... 42% 73.3M 0s\n", 232 | " 7700K .......... .......... .......... .......... .......... 43% 37.0M 0s\n", 233 | " 7750K .......... .......... .......... .......... .......... 43% 52.4M 0s\n", 234 | " 7800K .......... .......... .......... .......... .......... 43% 36.4M 0s\n", 235 | " 7850K .......... .......... .......... .......... .......... 44% 41.9M 0s\n", 236 | " 7900K .......... .......... .......... .......... .......... 44% 183M 0s\n", 237 | " 7950K .......... .......... .......... .......... .......... 44% 288M 0s\n", 238 | " 8000K .......... .......... .......... .......... .......... 44% 302M 0s\n", 239 | " 8050K .......... .......... .......... .......... .......... 45% 296M 0s\n", 240 | " 8100K .......... .......... .......... .......... .......... 45% 264M 0s\n", 241 | " 8150K .......... .......... .......... .......... .......... 45% 288M 0s\n", 242 | " 8200K .......... .......... .......... .......... .......... 46% 336M 0s\n", 243 | " 8250K .......... .......... .......... .......... .......... 46% 314M 0s\n", 244 | " 8300K .......... .......... .......... .......... .......... 46% 250M 0s\n", 245 | " 8350K .......... .......... .......... .......... .......... 46% 317M 0s\n", 246 | " 8400K .......... .......... .......... .......... .......... 47% 286M 0s\n", 247 | " 8450K .......... .......... .......... .......... .......... 47% 217M 0s\n", 248 | " 8500K .......... .......... .......... .......... .......... 47% 184M 0s\n", 249 | " 8550K .......... .......... .......... .......... .......... 47% 139M 0s\n", 250 | " 8600K .......... .......... .......... .......... .......... 48% 41.7M 0s\n", 251 | " 8650K .......... .......... .......... .......... .......... 48% 51.1M 0s\n", 252 | " 8700K .......... .......... .......... .......... .......... 48% 40.1M 0s\n", 253 | " 8750K .......... .......... .......... .......... .......... 49% 45.7M 0s\n", 254 | " 8800K .......... .......... .......... .......... .......... 49% 47.4M 0s\n", 255 | " 8850K .......... .......... .......... .......... .......... 49% 51.6M 0s\n", 256 | " 8900K .......... .......... .......... .......... .......... 49% 31.4M 0s\n", 257 | " 8950K .......... .......... .......... .......... .......... 50% 124M 0s\n", 258 | " 9000K .......... .......... .......... .......... .......... 50% 265M 0s\n", 259 | " 9050K .......... .......... .......... .......... .......... 50% 307M 0s\n", 260 | " 9100K .......... .......... .......... .......... .......... 51% 264M 0s\n", 261 | " 9150K .......... .......... .......... .......... .......... 51% 305M 0s\n", 262 | " 9200K .......... .......... .......... .......... .......... 51% 304M 0s\n", 263 | " 9250K .......... .......... .......... .......... .......... 51% 300M 0s\n", 264 | " 9300K .......... .......... .......... .......... .......... 52% 271M 0s\n", 265 | " 9350K .......... .......... .......... .......... .......... 52% 224M 0s\n", 266 | " 9400K .......... .......... .......... .......... .......... 52% 259M 0s\n", 267 | " 9450K .......... .......... .......... .......... .......... 52% 216M 0s\n", 268 | " 9500K .......... .......... .......... .......... .......... 53% 263M 0s\n", 269 | " 9550K .......... .......... .......... .......... .......... 53% 314M 0s\n", 270 | " 9600K .......... .......... .......... .......... .......... 53% 320M 0s\n", 271 | " 9650K .......... .......... .......... .......... .......... 54% 221M 0s\n", 272 | " 9700K .......... .......... .......... .......... .......... 54% 167M 0s\n", 273 | " 9750K .......... .......... .......... .......... .......... 54% 226M 0s\n", 274 | " 9800K .......... .......... .......... .......... .......... 54% 311M 0s\n", 275 | " 9850K .......... .......... .......... .......... .......... 55% 278M 0s\n", 276 | " 9900K .......... .......... .......... .......... .......... 55% 250M 0s\n", 277 | " 9950K .......... .......... .......... .......... .......... 55% 42.8M 0s\n", 278 | " 10000K .......... .......... .......... .......... .......... 56% 40.1M 0s\n", 279 | " 10050K .......... .......... .......... .......... .......... 56% 42.1M 0s\n", 280 | " 10100K .......... .......... .......... .......... .......... 56% 46.4M 0s\n", 281 | " 10150K .......... .......... .......... .......... .......... 56% 36.6M 0s\n", 282 | " 10200K .......... .......... .......... .......... .......... 57% 45.8M 0s\n", 283 | " 10250K .......... .......... .......... .......... .......... 57% 41.8M 0s\n", 284 | " 10300K .......... .......... .......... .......... .......... 57% 90.6M 0s\n", 285 | " 10350K .......... .......... .......... .......... .......... 58% 309M 0s\n", 286 | " 10400K .......... .......... .......... .......... .......... 58% 238M 0s\n", 287 | " 10450K .......... .......... .......... .......... .......... 58% 207M 0s\n", 288 | " 10500K .......... .......... .......... .......... .......... 58% 152M 0s\n", 289 | " 10550K .......... .......... .......... .......... .......... 59% 264M 0s\n", 290 | " 10600K .......... .......... .......... .......... .......... 59% 309M 0s\n", 291 | " 10650K .......... .......... .......... .......... .......... 59% 271M 0s\n", 292 | " 10700K .......... .......... .......... .......... .......... 59% 157M 0s\n", 293 | " 10750K .......... .......... .......... .......... .......... 60% 297M 0s\n", 294 | " 10800K .......... .......... .......... .......... .......... 60% 299M 0s\n", 295 | " 10850K .......... .......... .......... .......... .......... 60% 311M 0s\n", 296 | " 10900K .......... .......... .......... .......... .......... 61% 254M 0s\n", 297 | " 10950K .......... .......... .......... .......... .......... 61% 268M 0s\n", 298 | " 11000K .......... .......... .......... .......... .......... 61% 292M 0s\n", 299 | " 11050K .......... .......... .......... .......... .......... 61% 282M 0s\n", 300 | " 11100K .......... .......... .......... .......... .......... 62% 276M 0s\n", 301 | " 11150K .......... .......... .......... .......... .......... 62% 94.6M 0s\n", 302 | " 11200K .......... .......... .......... .......... .......... 62% 38.1M 0s\n", 303 | " 11250K .......... .......... .......... .......... .......... 63% 43.1M 0s\n", 304 | " 11300K .......... .......... .......... .......... .......... 63% 48.4M 0s\n", 305 | " 11350K .......... .......... .......... .......... .......... 63% 41.1M 0s\n", 306 | " 11400K .......... .......... .......... .......... .......... 63% 41.0M 0s\n", 307 | " 11450K .......... .......... .......... .......... .......... 64% 48.5M 0s\n", 308 | " 11500K .......... .......... .......... .......... .......... 64% 48.0M 0s\n", 309 | " 11550K .......... .......... .......... .......... .......... 64% 185M 0s\n", 310 | " 11600K .......... .......... .......... .......... .......... 64% 286M 0s\n", 311 | " 11650K .......... .......... .......... .......... .......... 65% 256M 0s\n", 312 | " 11700K .......... .......... .......... .......... .......... 65% 281M 0s\n", 313 | " 11750K .......... .......... .......... .......... .......... 65% 209M 0s\n", 314 | " 11800K .......... .......... .......... .......... .......... 66% 276M 0s\n", 315 | " 11850K .......... .......... .......... .......... .......... 66% 314M 0s\n", 316 | " 11900K .......... .......... .......... .......... .......... 66% 198M 0s\n", 317 | " 11950K .......... .......... .......... .......... .......... 66% 309M 0s\n", 318 | " 12000K .......... .......... .......... .......... .......... 67% 284M 0s\n", 319 | " 12050K .......... .......... .......... .......... .......... 67% 277M 0s\n", 320 | " 12100K .......... .......... .......... .......... .......... 67% 40.2M 0s\n", 321 | " 12150K .......... .......... .......... .......... .......... 68% 36.5M 0s\n", 322 | " 12200K .......... .......... .......... .......... .......... 68% 54.8M 0s\n", 323 | " 12250K .......... .......... .......... .......... .......... 68% 37.3M 0s\n", 324 | " 12300K .......... .......... .......... .......... .......... 68% 32.3M 0s\n", 325 | " 12350K .......... .......... .......... .......... .......... 69% 46.9M 0s\n", 326 | " 12400K .......... .......... .......... .......... .......... 69% 45.1M 0s\n", 327 | " 12450K .......... .......... .......... .......... .......... 69% 72.9M 0s\n", 328 | " 12500K .......... .......... .......... .......... .......... 70% 245M 0s\n", 329 | " 12550K .......... .......... .......... .......... .......... 70% 311M 0s\n", 330 | " 12600K .......... .......... .......... .......... .......... 70% 231M 0s\n", 331 | " 12650K .......... .......... .......... .......... .......... 70% 250M 0s\n", 332 | " 12700K .......... .......... .......... .......... .......... 71% 272M 0s\n", 333 | " 12750K .......... .......... .......... .......... .......... 71% 424M 0s\n", 334 | " 12800K .......... .......... .......... .......... .......... 71% 416M 0s\n", 335 | " 12850K .......... .......... .......... .......... .......... 71% 437M 0s\n", 336 | " 12900K .......... .......... .......... .......... .......... 72% 375M 0s\n", 337 | " 12950K .......... .......... .......... .......... .......... 72% 358M 0s\n", 338 | " 13000K .......... .......... .......... .......... .......... 72% 337M 0s\n", 339 | " 13050K .......... .......... .......... .......... .......... 73% 352M 0s\n", 340 | " 13100K .......... .......... .......... .......... .......... 73% 292M 0s\n", 341 | " 13150K .......... .......... .......... .......... .......... 73% 394M 0s\n", 342 | " 13200K .......... .......... .......... .......... .......... 73% 367M 0s\n", 343 | " 13250K .......... .......... .......... .......... .......... 74% 389M 0s\n", 344 | " 13300K .......... .......... .......... .......... .......... 74% 317M 0s\n", 345 | " 13350K .......... .......... .......... .......... .......... 74% 326M 0s\n", 346 | " 13400K .......... .......... .......... .......... .......... 75% 435M 0s\n", 347 | " 13450K .......... .......... .......... .......... .......... 75% 342M 0s\n", 348 | " 13500K .......... .......... .......... .......... .......... 75% 338M 0s\n", 349 | " 13550K .......... .......... .......... .......... .......... 75% 356M 0s\n", 350 | " 13600K .......... .......... .......... .......... .......... 76% 406M 0s\n", 351 | " 13650K .......... .......... .......... .......... .......... 76% 444M 0s\n", 352 | " 13700K .......... .......... .......... .......... .......... 76% 325M 0s\n", 353 | " 13750K .......... .......... .......... .......... .......... 76% 350M 0s\n", 354 | " 13800K .......... .......... .......... .......... .......... 77% 291M 0s\n", 355 | " 13850K .......... .......... .......... .......... .......... 77% 434M 0s\n", 356 | " 13900K .......... .......... .......... .......... .......... 77% 338M 0s\n", 357 | " 13950K .......... .......... .......... .......... .......... 78% 395M 0s\n", 358 | " 14000K .......... .......... .......... .......... .......... 78% 257M 0s\n", 359 | " 14050K .......... .......... .......... .......... .......... 78% 343M 0s\n", 360 | " 14100K .......... .......... .......... .......... .......... 78% 68.6M 0s\n", 361 | " 14150K .......... .......... .......... .......... .......... 79% 83.7M 0s\n", 362 | " 14200K .......... .......... .......... .......... .......... 79% 55.4M 0s\n", 363 | " 14250K .......... .......... .......... .......... .......... 79% 72.1M 0s\n", 364 | " 14300K .......... .......... .......... .......... .......... 80% 67.7M 0s\n", 365 | " 14350K .......... .......... .......... .......... .......... 80% 64.8M 0s\n", 366 | " 14400K .......... .......... .......... .......... .......... 80% 56.6M 0s\n", 367 | " 14450K .......... .......... .......... .......... .......... 80% 82.2M 0s\n", 368 | " 14500K .......... .......... .......... .......... .......... 81% 61.8M 0s\n", 369 | " 14550K .......... .......... .......... .......... .......... 81% 82.0M 0s\n", 370 | " 14600K .......... .......... .......... .......... .......... 81% 40.9M 0s\n", 371 | " 14650K .......... .......... .......... .......... .......... 81% 70.9M 0s\n", 372 | " 14700K .......... .......... .......... .......... .......... 82% 264M 0s\n", 373 | " 14750K .......... .......... .......... .......... .......... 82% 107M 0s\n", 374 | " 14800K .......... .......... .......... .......... .......... 82% 85.1M 0s\n", 375 | " 14850K .......... .......... .......... .......... .......... 83% 341M 0s\n", 376 | " 14900K .......... .......... .......... .......... .......... 83% 339M 0s\n", 377 | " 14950K .......... .......... .......... .......... .......... 83% 279M 0s\n", 378 | " 15000K .......... .......... .......... .......... .......... 83% 431M 0s\n", 379 | " 15050K .......... .......... .......... .......... .......... 84% 424M 0s\n", 380 | " 15100K .......... .......... .......... .......... .......... 84% 368M 0s\n", 381 | " 15150K .......... .......... .......... .......... .......... 84% 435M 0s\n", 382 | " 15200K .......... .......... .......... .......... .......... 85% 438M 0s\n", 383 | " 15250K .......... .......... .......... .......... .......... 85% 428M 0s\n", 384 | " 15300K .......... .......... .......... .......... .......... 85% 367M 0s\n", 385 | " 15350K .......... .......... .......... .......... .......... 85% 433M 0s\n", 386 | " 15400K .......... .......... .......... .......... .......... 86% 438M 0s\n", 387 | " 15450K .......... .......... .......... .......... .......... 86% 437M 0s\n", 388 | " 15500K .......... .......... .......... .......... .......... 86% 230M 0s\n", 389 | " 15550K .......... .......... .......... .......... .......... 87% 50.1M 0s\n", 390 | " 15600K .......... .......... .......... .......... .......... 87% 81.8M 0s\n", 391 | " 15650K .......... .......... .......... .......... .......... 87% 78.7M 0s\n", 392 | " 15700K .......... .......... .......... .......... .......... 87% 72.6M 0s\n", 393 | " 15750K .......... .......... .......... .......... .......... 88% 84.4M 0s\n", 394 | " 15800K .......... .......... .......... .......... .......... 88% 79.0M 0s\n", 395 | " 15850K .......... .......... .......... .......... .......... 88% 81.6M 0s\n", 396 | " 15900K .......... .......... .......... .......... .......... 88% 70.3M 0s\n", 397 | " 15950K .......... .......... .......... .......... .......... 89% 82.5M 0s\n", 398 | " 16000K .......... .......... .......... .......... .......... 89% 85.8M 0s\n", 399 | " 16050K .......... .......... .......... .......... .......... 89% 422M 0s\n", 400 | " 16100K .......... .......... .......... .......... .......... 90% 380M 0s\n", 401 | " 16150K .......... .......... .......... .......... .......... 90% 449M 0s\n", 402 | " 16200K .......... .......... .......... .......... .......... 90% 446M 0s\n", 403 | " 16250K .......... .......... .......... .......... .......... 90% 441M 0s\n", 404 | " 16300K .......... .......... .......... .......... .......... 91% 246M 0s\n", 405 | " 16350K .......... .......... .......... .......... .......... 91% 432M 0s\n", 406 | " 16400K .......... .......... .......... .......... .......... 91% 422M 0s\n", 407 | " 16450K .......... .......... .......... .......... .......... 92% 441M 0s\n", 408 | " 16500K .......... .......... .......... .......... .......... 92% 382M 0s\n", 409 | " 16550K .......... .......... .......... .......... .......... 92% 434M 0s\n", 410 | " 16600K .......... .......... .......... .......... .......... 92% 107M 0s\n", 411 | " 16650K .......... .......... .......... .......... .......... 93% 86.5M 0s\n", 412 | " 16700K .......... .......... .......... .......... .......... 93% 74.9M 0s\n", 413 | " 16750K .......... .......... .......... .......... .......... 93% 83.9M 0s\n", 414 | " 16800K .......... .......... .......... .......... .......... 93% 83.1M 0s\n", 415 | " 16850K .......... .......... .......... .......... .......... 94% 79.5M 0s\n", 416 | " 16900K .......... .......... .......... .......... .......... 94% 70.0M 0s\n", 417 | " 16950K .......... .......... .......... .......... .......... 94% 78.3M 0s\n", 418 | " 17000K .......... .......... .......... .......... .......... 95% 87.2M 0s\n", 419 | " 17050K .......... .......... .......... .......... .......... 95% 339M 0s\n", 420 | " 17100K .......... .......... .......... .......... .......... 95% 387M 0s\n", 421 | " 17150K .......... .......... .......... .......... .......... 95% 446M 0s\n", 422 | " 17200K .......... .......... .......... .......... .......... 96% 423M 0s\n", 423 | " 17250K .......... .......... .......... .......... .......... 96% 448M 0s\n", 424 | " 17300K .......... .......... .......... .......... .......... 96% 369M 0s\n", 425 | " 17350K .......... .......... .......... .......... .......... 97% 409M 0s\n", 426 | " 17400K .......... .......... .......... .......... .......... 97% 348M 0s\n", 427 | " 17450K .......... .......... .......... .......... .......... 97% 425M 0s\n", 428 | " 17500K .......... .......... .......... .......... .......... 97% 374M 0s\n", 429 | " 17550K .......... .......... .......... .......... .......... 98% 422M 0s\n", 430 | " 17600K .......... .......... .......... .......... .......... 98% 435M 0s\n", 431 | " 17650K .......... .......... .......... .......... .......... 98% 422M 0s\n", 432 | " 17700K .......... .......... .......... .......... .......... 99% 381M 0s\n", 433 | " 17750K .......... .......... .......... .......... .......... 99% 441M 0s\n", 434 | " 17800K .......... .......... .......... .......... .......... 99% 414M 0s\n", 435 | " 17850K .......... .......... .......... .......... .......... 99% 435M 0s\n", 436 | " 17900K .......... .......... ....... 100% 337M=0.2s\n", 437 | "\n", 438 | "2021-03-20 19:59:29 (84.4 MB/s) - ‘./data/amazon_electronics_reviews_training.json.gz’ saved [18357730/18357730]\n", 439 | "\n", 440 | "bash: line 5: syntax error: unexpected end of file\n" 441 | ] 442 | } 443 | ], 444 | "source": [ 445 | "%%bash\n", 446 | "mkdir data\n", 447 | "wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_electronics_reviews_training.json.gz -nc -P ./data" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 11, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "import gzip\n", 457 | "import ujson as json\n", 458 | "from pandas.io.json import json_normalize\n", 459 | "\n", 460 | "with gzip.open(\"data/amazon_electronics_reviews_training.json.gz\", \"r\") as f: \n", 461 | " data = [json.loads(line) for line in f]" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 12, 467 | "metadata": {}, 468 | "outputs": [ 469 | { 470 | "data": { 471 | "text/html": [ 472 | "
\n", 473 | "\n", 486 | "\n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | "
reviewerIDasinreviewerNamehelpfulreviewTextoverallsummaryunixReviewTimereviewTime
0A238V1XTSK9NFEB00004VX3TAndrew Lynn[2, 2]I bought this mouse to use with my laptop beca...5.0Excellent mouse for laptop users100794240012 10, 2001
\n", 516 | "
" 517 | ], 518 | "text/plain": [ 519 | " reviewerID asin reviewerName helpful \\\n", 520 | "0 A238V1XTSK9NFE B00004VX3T Andrew Lynn [2, 2] \n", 521 | "\n", 522 | " reviewText overall \\\n", 523 | "0 I bought this mouse to use with my laptop beca... 5.0 \n", 524 | "\n", 525 | " summary unixReviewTime reviewTime \n", 526 | "0 Excellent mouse for laptop users 1007942400 12 10, 2001 " 527 | ] 528 | }, 529 | "execution_count": 12, 530 | "metadata": {}, 531 | "output_type": "execute_result" 532 | } 533 | ], 534 | "source": [ 535 | "json_normalize(data[0])" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": {}, 541 | "source": [ 542 | "The ratings are stored in the keyword `\"overall\"`. You should create an array of the ratings for each review, preferably using list comprehensions." 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 13, 548 | "metadata": {}, 549 | "outputs": [], 550 | "source": [ 551 | "ratings = [review['overall'] for review in data]" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": 14, 557 | "metadata": {}, 558 | "outputs": [ 559 | { 560 | "data": { 561 | "text/plain": [ 562 | "[5.0, 1.0, 4.0, 5.0, 3.0]" 563 | ] 564 | }, 565 | "execution_count": 14, 566 | "metadata": {}, 567 | "output_type": "execute_result" 568 | } 569 | ], 570 | "source": [ 571 | "ratings[:5]" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "**Note**, the test set used by the grader is in the same format as that of `data`, a list of dictionaries. Your trained model needs to accept data in the same format. Thus, you should use `Pipeline` when constructing your model so that all necessary transformation needed are encapsulated into a single estimator object." 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "## Question 1: Bag of words model\n", 586 | "\n", 587 | "Construct a machine learning model trained on word counts using the bag of words algorithm. Remember, the bag of words is implemented with `CountVectorizer`. Some things you should consider:\n", 588 | "\n", 589 | "* The reference solution uses a linear model and you should as well; use either `Ridge` or `SGDRegressor`.\n", 590 | "* The text review is stored in the key `\"reviewText\"`. You will need to construct a custom transformer to extract out the value of this key. It will be the first step in your pipeline.\n", 591 | "* Consider what hyperparameters you will need to tune for your model.\n", 592 | "* Subsampling the training data will boost training times, which will be helpful when determining the best hyperparameters to use. Note, your final model will perform best if it is trained on the full data set.\n", 593 | "* Including stop words may help with performance." 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": 15, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "from sklearn.base import BaseEstimator,TransformerMixin\n", 603 | "\n", 604 | "class KeySelector (BaseEstimator,TransformerMixin):\n", 605 | " def __init__(self, key):\n", 606 | " self.key = key\n", 607 | " \n", 608 | " def fit(self, X, y = None):\n", 609 | " return self \n", 610 | " \n", 611 | " def transform(self, X):\n", 612 | " return [row[self.key] for row in X]" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 16, 618 | "metadata": {}, 619 | "outputs": [], 620 | "source": [ 621 | "from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer, TfidfTransformer\n", 622 | "from sklearn.pipeline import Pipeline\n", 623 | "from sklearn.linear_model import Ridge, SGDRegressor\n", 624 | "from spacy.lang.en import STOP_WORDS" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 73, 630 | "metadata": {}, 631 | "outputs": [ 632 | { 633 | "data": { 634 | "text/plain": [ 635 | "Pipeline(memory=None,\n", 636 | " steps=[('selector', KeySelector(key='reviewText')),\n", 637 | " ('vectorizer',\n", 638 | " CountVectorizer(analyzer='word', binary=False,\n", 639 | " decode_error='strict',\n", 640 | " dtype=, encoding='utf-8',\n", 641 | " input='content', lowercase=True, max_df=1.0,\n", 642 | " max_features=None, min_df=1,\n", 643 | " ngram_range=(1, 1), preprocessor=None,\n", 644 | " stop_words=None, strip_accents=None,\n", 645 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 646 | " tokenizer=None, vocabulary=None)),\n", 647 | " ('regressor',\n", 648 | " Ridge(alpha=1.0, copy_X=True, fit_intercept=True,\n", 649 | " max_iter=None, normalize=False, random_state=None,\n", 650 | " solver='auto', tol=0.001))],\n", 651 | " verbose=False)" 652 | ] 653 | }, 654 | "execution_count": 73, 655 | "metadata": {}, 656 | "output_type": "execute_result" 657 | } 658 | ], 659 | "source": [ 660 | "bag_of_words_model = Pipeline([\n", 661 | " ('selector', KeySelector('reviewText')),\n", 662 | " ('vectorizer', CountVectorizer()),\n", 663 | " ('regressor', Ridge())\n", 664 | "])\n", 665 | "\n", 666 | "bag_of_words_model.fit(data, ratings)" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": 74, 672 | "metadata": {}, 673 | "outputs": [ 674 | { 675 | "name": "stdout", 676 | "output_type": "stream", 677 | "text": [ 678 | "==================\n", 679 | "Your score: 1.0503909927619204\n", 680 | "==================\n" 681 | ] 682 | } 683 | ], 684 | "source": [ 685 | "grader.score.nlp__bag_of_words_model(bag_of_words_model.predict)" 686 | ] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "metadata": {}, 691 | "source": [ 692 | "## Question 2: Normalized model\n", 693 | "\n", 694 | "Using raw counts will not be as effective compared if we had normalized the counts. There are several ways to normalize raw counts; the `HashingVectorizer` class has the keyword `norm` and there is also the `TfidfTransformer` and `TfidfVectorizer` that perform tf-idf weighting on the counts. Apply normalization to your model to improve performance." 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": 17, 700 | "metadata": {}, 701 | "outputs": [ 702 | { 703 | "data": { 704 | "text/plain": [ 705 | "Pipeline(memory=None,\n", 706 | " steps=[('selector', KeySelector(key='reviewText')),\n", 707 | " ('vectorizer',\n", 708 | " HashingVectorizer(alternate_sign=True, analyzer='word',\n", 709 | " binary=False, decode_error='strict',\n", 710 | " dtype=,\n", 711 | " encoding='utf-8', input='content',\n", 712 | " lowercase=True, n_features=1048576,\n", 713 | " ngram_range=(1, 1), norm='l2',\n", 714 | " preprocessor=None, stop_words=None,\n", 715 | " strip_accents=None,\n", 716 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 717 | " tokenizer=None)),\n", 718 | " ('regressor',\n", 719 | " Ridge(alpha=1.0, copy_X=True, fit_intercept=True,\n", 720 | " max_iter=None, normalize=False, random_state=None,\n", 721 | " solver='auto', tol=0.001))],\n", 722 | " verbose=False)" 723 | ] 724 | }, 725 | "execution_count": 17, 726 | "metadata": {}, 727 | "output_type": "execute_result" 728 | } 729 | ], 730 | "source": [ 731 | "normalized_model = Pipeline([\n", 732 | " ('selector', KeySelector('reviewText')),\n", 733 | " ('vectorizer', HashingVectorizer()),\n", 734 | " ('regressor', Ridge())\n", 735 | "])\n", 736 | "\n", 737 | "normalized_model.fit(data, ratings)" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": 18, 743 | "metadata": {}, 744 | "outputs": [ 745 | { 746 | "name": "stdout", 747 | "output_type": "stream", 748 | "text": [ 749 | "==================\n", 750 | "Your score: 1.0336821875031081\n", 751 | "==================\n" 752 | ] 753 | } 754 | ], 755 | "source": [ 756 | "grader.score.nlp__normalized_model(normalized_model.predict)" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "## Question 3: Bigrams model\n", 764 | "\n", 765 | "The model performance may increase when including additional features generated by counting bigrams. Include bigrams to your model. When using more features, the risk of overfitting increases. Make sure you try to minimize overfitting as much as possible." 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": 19, 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "data": { 775 | "text/plain": [ 776 | "Pipeline(memory=None,\n", 777 | " steps=[('selector', KeySelector(key='reviewText')),\n", 778 | " ('vectorizer',\n", 779 | " HashingVectorizer(alternate_sign=True, analyzer='word',\n", 780 | " binary=False, decode_error='strict',\n", 781 | " dtype=,\n", 782 | " encoding='utf-8', input='content',\n", 783 | " lowercase=True, n_features=1048576,\n", 784 | " ngram_range=(1, 2), norm='l2',\n", 785 | " preprocessor=None, stop_words=None,\n", 786 | " strip_accents=None,\n", 787 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 788 | " tokenizer=None)),\n", 789 | " ('regressor',\n", 790 | " Ridge(alpha=1.0, copy_X=True, fit_intercept=True,\n", 791 | " max_iter=None, normalize=False, random_state=None,\n", 792 | " solver='auto', tol=0.001))],\n", 793 | " verbose=False)" 794 | ] 795 | }, 796 | "execution_count": 19, 797 | "metadata": {}, 798 | "output_type": "execute_result" 799 | } 800 | ], 801 | "source": [ 802 | "bigrams_model = Pipeline([\n", 803 | " ('selector', KeySelector('reviewText')),\n", 804 | " ('vectorizer', HashingVectorizer(lowercase=True, ngram_range=(1,2))),\n", 805 | " ('regressor', Ridge())\n", 806 | "])\n", 807 | "\n", 808 | "bigrams_model.fit(data, ratings)" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": 21, 814 | "metadata": {}, 815 | "outputs": [ 816 | { 817 | "name": "stdout", 818 | "output_type": "stream", 819 | "text": [ 820 | "==================\n", 821 | "Your score: 1.1522966762187192\n", 822 | "==================\n" 823 | ] 824 | } 825 | ], 826 | "source": [ 827 | "grader.score.nlp__bigrams_model(bigrams_model.predict)" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": {}, 833 | "source": [ 834 | "## Question 4: Polarity analysis\n", 835 | "\n", 836 | "Let's derive some insight from our analysis. We want to determine the most polarizing words in the corpus of reviews. In other words, we want identify words that strongly signal a review is either positive or negative. For example, we understand a word like \"terrible\" will mostly appear in negative rather than positive reviews. The naive Bayes model calculates probabilities such as $P(\\text{terrible } | \\text{ negative})$, the probability the word \"terrible\" appears in the text, given that the review is negative. Using these probabilities, we can derive a **polarity score** for each counted word,\n", 837 | "\n", 838 | "$$\n", 839 | "\\text{polarity} = \\log\\left(\\frac{P(\\text{word } | \\text{ positive})}{P(\\text{word } | \\text{ negative})}\\right).\n", 840 | "$$ \n", 841 | "\n", 842 | "The polarity analysis is an example where a simpler model offers more explicability than a more complicated model. For this question, you are asked to determine the top twenty-five words with the largest positive **and** largest negative polarity, for a total of fifty words. For this analysis, you should:\n", 843 | "\n", 844 | "1. Use the naive Bayes model, `MultinomialNB`.\n", 845 | "1. Use tf-idf weighting.\n", 846 | "1. Remove stop words.\n", 847 | "\n", 848 | "A trained naive Bayes model stores the log of the probabilities in the attribute `feature_log_prob_`. It is a NumPy array of shape (number of classes, the number of features). You will need the mapping between feature index to word. For this problem, you will use a different data set; it has been processed to only include reviews with one and five stars. You can download it below." 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": 22, 854 | "metadata": {}, 855 | "outputs": [ 856 | { 857 | "name": "stderr", 858 | "output_type": "stream", 859 | "text": [ 860 | "--2021-03-20 20:43:16-- http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_one_and_five_star_reviews.json.gz\n", 861 | "Resolving dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)... 52.216.17.224\n", 862 | "Connecting to dataincubator-wqu.s3.amazonaws.com (dataincubator-wqu.s3.amazonaws.com)|52.216.17.224|:80... connected.\n", 863 | "HTTP request sent, awaiting response... 200 OK\n", 864 | "Length: 2970853 (2.8M) [application/x-gzip]\n", 865 | "Saving to: ‘./data/amazon_one_and_five_star_reviews.json.gz’\n", 866 | "\n", 867 | " 0K .......... .......... .......... .......... .......... 1% 3.66M 1s\n", 868 | " 50K .......... .......... .......... .......... .......... 3% 7.29M 1s\n", 869 | " 100K .......... .......... .......... .......... .......... 5% 66.3M 0s\n", 870 | " 150K .......... .......... .......... .......... .......... 6% 7.91M 0s\n", 871 | " 200K .......... .......... .......... .......... .......... 8% 80.5M 0s\n", 872 | " 250K .......... .......... .......... .......... .......... 10% 55.8M 0s\n", 873 | " 300K .......... .......... .......... .......... .......... 12% 68.1M 0s\n", 874 | " 350K .......... .......... .......... .......... .......... 13% 11.9M 0s\n", 875 | " 400K .......... .......... .......... .......... .......... 15% 26.4M 0s\n", 876 | " 450K .......... .......... .......... .......... .......... 17% 82.4M 0s\n", 877 | " 500K .......... .......... .......... .......... .......... 18% 62.1M 0s\n", 878 | " 550K .......... .......... .......... .......... .......... 20% 83.0M 0s\n", 879 | " 600K .......... .......... .......... .......... .......... 22% 184M 0s\n", 880 | " 650K .......... .......... .......... .......... .......... 24% 435M 0s\n", 881 | " 700K .......... .......... .......... .......... .......... 25% 441M 0s\n", 882 | " 750K .......... .......... .......... .......... .......... 27% 13.9M 0s\n", 883 | " 800K .......... .......... .......... .......... .......... 29% 293M 0s\n", 884 | " 850K .......... .......... .......... .......... .......... 31% 158M 0s\n", 885 | " 900K .......... .......... .......... .......... .......... 32% 104M 0s\n", 886 | " 950K .......... .......... .......... .......... .......... 34% 180M 0s\n", 887 | " 1000K .......... .......... .......... .......... .......... 36% 146M 0s\n", 888 | " 1050K .......... .......... .......... .......... .......... 37% 305M 0s\n", 889 | " 1100K .......... .......... .......... .......... .......... 39% 157M 0s\n", 890 | " 1150K .......... .......... .......... .......... .......... 41% 153M 0s\n", 891 | " 1200K .......... .......... .......... .......... .......... 43% 411M 0s\n", 892 | " 1250K .......... .......... .......... .......... .......... 44% 416M 0s\n", 893 | " 1300K .......... .......... .......... .......... .......... 46% 375M 0s\n", 894 | " 1350K .......... .......... .......... .......... .......... 48% 430M 0s\n", 895 | " 1400K .......... .......... .......... .......... .......... 49% 444M 0s\n", 896 | " 1450K .......... .......... .......... .......... .......... 51% 336M 0s\n", 897 | " 1500K .......... .......... .......... .......... .......... 53% 20.7M 0s\n", 898 | " 1550K .......... .......... .......... .......... .......... 55% 32.5M 0s\n", 899 | " 1600K .......... .......... .......... .......... .......... 56% 24.7M 0s\n", 900 | " 1650K .......... .......... .......... .......... .......... 58% 55.2M 0s\n", 901 | " 1700K .......... .......... .......... .......... .......... 60% 83.4M 0s\n", 902 | " 1750K .......... .......... .......... .......... .......... 62% 84.8M 0s\n", 903 | " 1800K .......... .......... .......... .......... .......... 63% 62.9M 0s\n", 904 | " 1850K .......... .......... .......... .......... .......... 65% 76.4M 0s\n", 905 | " 1900K .......... .......... .......... .......... .......... 67% 84.9M 0s\n", 906 | " 1950K .......... .......... .......... .......... .......... 68% 82.9M 0s\n", 907 | " 2000K .......... .......... .......... .......... .......... 70% 94.1M 0s\n", 908 | " 2050K .......... .......... .......... .......... .......... 72% 406M 0s\n", 909 | " 2100K .......... .......... .......... .......... .......... 74% 438M 0s\n", 910 | " 2150K .......... .......... .......... .......... .......... 75% 434M 0s\n", 911 | " 2200K .......... .......... .......... .......... .......... 77% 381M 0s\n", 912 | " 2250K .......... .......... .......... .......... .......... 79% 426M 0s\n", 913 | " 2300K .......... .......... .......... .......... .......... 81% 432M 0s\n", 914 | " 2350K .......... .......... .......... .......... .......... 82% 441M 0s\n", 915 | " 2400K .......... .......... .......... .......... .......... 84% 306M 0s\n", 916 | " 2450K .......... .......... .......... .......... .......... 86% 294M 0s\n", 917 | " 2500K .......... .......... .......... .......... .......... 87% 313M 0s\n", 918 | " 2550K .......... .......... .......... .......... .......... 89% 413M 0s\n", 919 | " 2600K .......... .......... .......... .......... .......... 91% 366M 0s\n", 920 | " 2650K .......... .......... .......... .......... .......... 93% 420M 0s\n", 921 | " 2700K .......... .......... .......... .......... .......... 94% 437M 0s\n", 922 | " 2750K .......... .......... .......... .......... .......... 96% 425M 0s\n", 923 | " 2800K .......... .......... .......... .......... .......... 98% 368M 0s\n", 924 | " 2850K .......... .......... .......... .......... .......... 99% 422M 0s\n", 925 | " 2900K . 100% 2334G=0.06s\n", 926 | "\n", 927 | "2021-03-20 20:43:16 (49.5 MB/s) - ‘./data/amazon_one_and_five_star_reviews.json.gz’ saved [2970853/2970853]\n", 928 | "\n" 929 | ] 930 | } 931 | ], 932 | "source": [ 933 | "%%bash\n", 934 | "wget http://dataincubator-wqu.s3.amazonaws.com/mldata/amazon_one_and_five_star_reviews.json.gz -nc -P ./data" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "In order to avoid memory issues, let's delete the older data." 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": 23, 947 | "metadata": {}, 948 | "outputs": [], 949 | "source": [ 950 | "del data, ratings" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 24, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [ 959 | "import numpy as np\n", 960 | "from sklearn.naive_bayes import MultinomialNB\n", 961 | "\n", 962 | "with gzip.open(\"data/amazon_one_and_five_star_reviews.json.gz\", \"r\") as f:\n", 963 | " data_polarity = [json.loads(line) for line in f]\n", 964 | "\n", 965 | "ratings = [review['overall'] for review in data_polarity]" 966 | ] 967 | }, 968 | { 969 | "cell_type": "code", 970 | "execution_count": 25, 971 | "metadata": {}, 972 | "outputs": [ 973 | { 974 | "data": { 975 | "text/plain": [ 976 | "Pipeline(memory=None,\n", 977 | " steps=[('selector', KeySelector(key='reviewText')),\n", 978 | " ('vectorizer',\n", 979 | " TfidfVectorizer(analyzer='word', binary=False,\n", 980 | " decode_error='strict',\n", 981 | " dtype=,\n", 982 | " encoding='utf-8', input='content',\n", 983 | " lowercase=True, max_df=1.0, max_features=None,\n", 984 | " min_df=1, ngram_range=(1, 1), norm='l2',\n", 985 | " preprocessor=None, smooth_idf=True,\n", 986 | " stop_words={'a', 'abou...\n", 987 | " 'also', 'although', 'always', 'am',\n", 988 | " 'among', 'amongst', 'amount', 'an',\n", 989 | " 'and', 'another', 'any', 'anyhow',\n", 990 | " 'anyone', 'anything', 'anyway',\n", 991 | " 'anywhere', 'are', ...},\n", 992 | " strip_accents=None, sublinear_tf=False,\n", 993 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 994 | " tokenizer=None, use_idf=True,\n", 995 | " vocabulary=None)),\n", 996 | " ('classifier',\n", 997 | " MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],\n", 998 | " verbose=False)" 999 | ] 1000 | }, 1001 | "execution_count": 25, 1002 | "metadata": {}, 1003 | "output_type": "execute_result" 1004 | } 1005 | ], 1006 | "source": [ 1007 | "pipe = Pipeline([\n", 1008 | " (\"selector\", KeySelector(\"reviewText\")),\n", 1009 | " (\"vectorizer\", TfidfVectorizer(stop_words = STOP_WORDS)),\n", 1010 | " (\"classifier\", MultinomialNB())\n", 1011 | "])\n", 1012 | "\n", 1013 | "pipe.fit(data_polarity, ratings)" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "code", 1018 | "execution_count": 47, 1019 | "metadata": {}, 1020 | "outputs": [], 1021 | "source": [ 1022 | "#retrieve features' log probabilities\n", 1023 | "log_prob = pipe['classifier'].feature_log_prob_\n", 1024 | "\n", 1025 | "#compute polarity\n", 1026 | "polarity = log_prob[0,:] - log_prob[1,:]\n", 1027 | "\n", 1028 | "#get feature names\n", 1029 | "terms = pipe['vectorizer'].get_feature_names()\n", 1030 | "\n", 1031 | "#assign each feature to corresponding polarity\n", 1032 | "terms_polarity = list(zip(polarity, terms))\n", 1033 | "\n", 1034 | "#sort features_polarity elements with respect to polarity\n", 1035 | "sorted_terms_polarity = sorted(terms_polarity)\n", 1036 | "\n", 1037 | "#select N highest polirized terms\n", 1038 | "N = 50\n", 1039 | "highest_polarized_terms = sorted_terms_polarity[:N//2] + sorted_terms_polarity[-N//2:]\n", 1040 | "\n", 1041 | "\n", 1042 | "#extract the terms with highest polarization\n", 1043 | "top_50 = [term for polarity, term in highest_polarized_terms]" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "code", 1048 | "execution_count": 48, 1049 | "metadata": {}, 1050 | "outputs": [ 1051 | { 1052 | "name": "stdout", 1053 | "output_type": "stream", 1054 | "text": [ 1055 | "==================\n", 1056 | "Your score: 1.0\n", 1057 | "==================\n" 1058 | ] 1059 | } 1060 | ], 1061 | "source": [ 1062 | "grader.score.nlp__most_polar(top_50)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "markdown", 1067 | "metadata": {}, 1068 | "source": [ 1069 | "## Question 5: Topic modeling [optional]\n", 1070 | "\n", 1071 | "Topic modeling is the analysis of determining the key topics or themes in a corpus. With respect to machine learning, topic modeling is an unsupervised technique. One way to uncover the main topics in a corpus is to use [non-negative matrix factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html). For this question, use non-negative matrix factorization to determine the top ten words for the first twenty topics. You should submit your answer as a list of lists. What topics exist in the reviews?" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": null, 1077 | "metadata": {}, 1078 | "outputs": [], 1079 | "source": [ 1080 | "from sklearn.decomposition import NMF\n", 1081 | " " 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "*Copyright © 2021 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*" 1089 | ] 1090 | } 1091 | ], 1092 | "metadata": { 1093 | "kernelspec": { 1094 | "display_name": "Python 3", 1095 | "language": "python", 1096 | "name": "python3" 1097 | }, 1098 | "language_info": { 1099 | "codemirror_mode": { 1100 | "name": "ipython", 1101 | "version": 3 1102 | }, 1103 | "file_extension": ".py", 1104 | "mimetype": "text/x-python", 1105 | "name": "python", 1106 | "nbconvert_exporter": "python", 1107 | "pygments_lexer": "ipython3", 1108 | "version": "3.7.3" 1109 | }, 1110 | "nbclean": true 1111 | }, 1112 | "nbformat": 4, 1113 | "nbformat_minor": 1 1114 | } 1115 | --------------------------------------------------------------------------------