├── .gitattributes ├── 9781484241301.jpg ├── Contributing.md ├── LICENSE.txt ├── README.md ├── chapter_2_Data_Processing ├── Data_processing_using_PySpark.ipynb └── sample_data.csv ├── chapter_4_Linear_Regression ├── Linear_Regression.ipynb └── Linear_regression_dataset.csv ├── chapter_5_Logistic_Regression ├── Log_Reg_dataset.csv └── Logistic_Regression_Pyspark.ipynb ├── chapter_6_Random_Forests ├── Random_Forests.ipynb └── affairs.csv ├── chapter_7_Clustering ├── Clustering_PySpark.ipynb └── iris_dataset.csv ├── chapter_8_Recommender_System ├── Recommender_System_PySpark.ipynb └── movie_ratings_df.csv ├── chapter_9_NLP ├── Movie_reviews.csv ├── NLP_PySpark.ipynb └── Sequence_Embeddings_PySpark.ipynb └── errata.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /9781484241301.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Apress/machine-learning-with-pyspark/e35273491ed9bf25a6849d38033ee2a64c962227/9781484241301.jpg -------------------------------------------------------------------------------- /Contributing.md: -------------------------------------------------------------------------------- 1 | # Contributing to Apress Source Code 2 | 3 | Copyright for Apress source code belongs to the author(s). However, under fair use you are encouraged to fork and contribute minor corrections and updates for the benefit of the author(s) and other readers. 4 | 5 | ## How to Contribute 6 | 7 | 1. Make sure you have a GitHub account. 8 | 2. Fork the repository for the relevant book. 9 | 3. Create a new branch on which to make your change, e.g. 10 | `git checkout -b my_code_contribution` 11 | 4. Commit your change. Include a commit message describing the correction. Please note that if your commit message is not clear, the correction will not be accepted. 12 | 5. Submit a pull request. 13 | 14 | Thank you for your contribution! -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Freeware License, some rights reserved 2 | 3 | Copyright (c) 2019 Pramod Singh 4 | 5 | Permission is hereby granted, free of charge, to anyone obtaining a copy 6 | of this software and associated documentation files (the "Software"), 7 | to work with the Software within the limits of freeware distribution and fair use. 8 | This includes the rights to use, copy, and modify the Software for personal use. 9 | Users are also allowed and encouraged to submit corrections and modifications 10 | to the Software for the benefit of other users. 11 | 12 | It is not allowed to reuse, modify, or redistribute the Software for 13 | commercial use in any way, or for a user’s educational materials such as books 14 | or blog articles without prior permission from the copyright holder. 15 | 16 | The above copyright notice and this permission notice need to be included 17 | in all copies or substantial portions of the software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 22 | AUTHORS OR COPYRIGHT HOLDERS OR APRESS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 25 | SOFTWARE. 26 | 27 | 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Apress Source Code 2 | 3 | This repository accompanies [*Machine Learning with PySpark*](https://www.apress.com/9781484241301) by Pramod Singh (Apress, 2019). 4 | 5 | [comment]: #cover 6 | ![Cover image](9781484241301.jpg) 7 | 8 | Download the files as a zip using the green button, or clone the repository to your machine using Git. 9 | 10 | ## Releases 11 | 12 | Release v1.0 corresponds to the code in the published book, without corrections or updates. 13 | 14 | ## Contributions 15 | 16 | See the file Contributing.md for more information on how you can contribute to this repository. -------------------------------------------------------------------------------- /chapter_2_Data_Processing/Data_processing_using_PySpark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Processing using Pyspark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "#import SparkSession\n", 17 | "from pyspark.sql import SparkSession" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "#create spar session object\n", 27 | "spark=SparkSession.builder.appName('data_processing').getOrCreate()" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 3, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# Load csv Dataset \n", 37 | "df=spark.read.csv('sample_data.csv',inferSchema=True,header=True)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "['ratings', 'age', 'experience', 'family', 'mobile']" 49 | ] 50 | }, 51 | "execution_count": 4, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "#columns of dataframe\n", 58 | "df.columns" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "5" 70 | ] 71 | }, 72 | "execution_count": 5, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "#check number of columns\n", 79 | "len(df.columns)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 6, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "33" 91 | ] 92 | }, 93 | "execution_count": 6, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "#number of records in dataframe\n", 100 | "df.count()" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 7, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stdout", 110 | "output_type": "stream", 111 | "text": [ 112 | "(33, 5)\n" 113 | ] 114 | } 115 | ], 116 | "source": [ 117 | "#shape of dataset\n", 118 | "print((df.count(),len(df.columns)))" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 8, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "name": "stdout", 128 | "output_type": "stream", 129 | "text": [ 130 | "root\n", 131 | " |-- ratings: integer (nullable = true)\n", 132 | " |-- age: integer (nullable = true)\n", 133 | " |-- experience: double (nullable = true)\n", 134 | " |-- family: integer (nullable = true)\n", 135 | " |-- mobile: string (nullable = true)\n", 136 | "\n" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "#printSchema\n", 142 | "df.printSchema()" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 11, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "+-------+---+----------+------+-------+\n", 155 | "|ratings|age|experience|family| mobile|\n", 156 | "+-------+---+----------+------+-------+\n", 157 | "| 3| 32| 9.0| 3| Vivo|\n", 158 | "| 3| 27| 13.0| 3| Apple|\n", 159 | "| 4| 22| 2.5| 0|Samsung|\n", 160 | "| 4| 37| 16.5| 4| Apple|\n", 161 | "| 5| 27| 9.0| 1| MI|\n", 162 | "+-------+---+----------+------+-------+\n", 163 | "only showing top 5 rows\n", 164 | "\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "#fisrt few rows of dataframe\n", 170 | "df.show(5)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 12, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "+---+-------+\n", 183 | "|age| mobile|\n", 184 | "+---+-------+\n", 185 | "| 32| Vivo|\n", 186 | "| 27| Apple|\n", 187 | "| 22|Samsung|\n", 188 | "| 37| Apple|\n", 189 | "| 27| MI|\n", 190 | "+---+-------+\n", 191 | "only showing top 5 rows\n", 192 | "\n" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "#select only 2 columns\n", 198 | "df.select('age','mobile').show(5)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 13, 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "name": "stdout", 208 | "output_type": "stream", 209 | "text": [ 210 | "+-------+------------------+------------------+------------------+------------------+------+\n", 211 | "|summary| ratings| age| experience| family|mobile|\n", 212 | "+-------+------------------+------------------+------------------+------------------+------+\n", 213 | "| count| 33| 33| 33| 33| 33|\n", 214 | "| mean|3.5757575757575757|30.484848484848484|10.303030303030303|1.8181818181818181| null|\n", 215 | "| stddev|1.1188806636071336| 6.18527087180309| 6.770731351213326|1.8448330794164254| null|\n", 216 | "| min| 1| 22| 2.5| 0| Apple|\n", 217 | "| max| 5| 42| 23.0| 5| Vivo|\n", 218 | "+-------+------------------+------------------+------------------+------------------+------+\n", 219 | "\n" 220 | ] 221 | } 222 | ], 223 | "source": [ 224 | "#info about dataframe\n", 225 | "df.describe().show()" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 43, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "from pyspark.sql.types import StringType,DoubleType,IntegerType" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 16, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | "+-------+---+----------+------+-------+----------------+\n", 247 | "|ratings|age|experience|family|mobile |age_after_10_yrs|\n", 248 | "+-------+---+----------+------+-------+----------------+\n", 249 | "|3 |32 |9.0 |3 |Vivo |42 |\n", 250 | "|3 |27 |13.0 |3 |Apple |37 |\n", 251 | "|4 |22 |2.5 |0 |Samsung|32 |\n", 252 | "|4 |37 |16.5 |4 |Apple |47 |\n", 253 | "|5 |27 |9.0 |1 |MI |37 |\n", 254 | "|4 |27 |9.0 |0 |Oppo |37 |\n", 255 | "|5 |37 |23.0 |5 |Vivo |47 |\n", 256 | "|5 |37 |23.0 |5 |Samsung|47 |\n", 257 | "|3 |22 |2.5 |0 |Apple |32 |\n", 258 | "|3 |27 |6.0 |0 |MI |37 |\n", 259 | "+-------+---+----------+------+-------+----------------+\n", 260 | "only showing top 10 rows\n", 261 | "\n" 262 | ] 263 | } 264 | ], 265 | "source": [ 266 | "#with column\n", 267 | "df.withColumn(\"age_after_10_yrs\",(df[\"age\"]+10)).show(10,False)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 15, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "name": "stdout", 277 | "output_type": "stream", 278 | "text": [ 279 | "+-------+---+----------+------+-------+----------+\n", 280 | "|ratings|age|experience|family|mobile |age_double|\n", 281 | "+-------+---+----------+------+-------+----------+\n", 282 | "|3 |32 |9.0 |3 |Vivo |32.0 |\n", 283 | "|3 |27 |13.0 |3 |Apple |27.0 |\n", 284 | "|4 |22 |2.5 |0 |Samsung|22.0 |\n", 285 | "|4 |37 |16.5 |4 |Apple |37.0 |\n", 286 | "|5 |27 |9.0 |1 |MI |27.0 |\n", 287 | "|4 |27 |9.0 |0 |Oppo |27.0 |\n", 288 | "|5 |37 |23.0 |5 |Vivo |37.0 |\n", 289 | "|5 |37 |23.0 |5 |Samsung|37.0 |\n", 290 | "|3 |22 |2.5 |0 |Apple |22.0 |\n", 291 | "|3 |27 |6.0 |0 |MI |27.0 |\n", 292 | "+-------+---+----------+------+-------+----------+\n", 293 | "only showing top 10 rows\n", 294 | "\n" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "df.withColumn('age_double',df['age'].cast(DoubleType())).show(10,False)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 17, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "+-------+---+----------+------+-------+----------------+\n", 312 | "|ratings|age|experience|family|mobile |age_after_10_yrs|\n", 313 | "+-------+---+----------+------+-------+----------------+\n", 314 | "|3 |32 |9.0 |3 |Vivo |42 |\n", 315 | "|3 |27 |13.0 |3 |Apple |37 |\n", 316 | "|4 |22 |2.5 |0 |Samsung|32 |\n", 317 | "|4 |37 |16.5 |4 |Apple |47 |\n", 318 | "|5 |27 |9.0 |1 |MI |37 |\n", 319 | "|4 |27 |9.0 |0 |Oppo |37 |\n", 320 | "|5 |37 |23.0 |5 |Vivo |47 |\n", 321 | "|5 |37 |23.0 |5 |Samsung|47 |\n", 322 | "|3 |22 |2.5 |0 |Apple |32 |\n", 323 | "|3 |27 |6.0 |0 |MI |37 |\n", 324 | "+-------+---+----------+------+-------+----------------+\n", 325 | "only showing top 10 rows\n", 326 | "\n" 327 | ] 328 | } 329 | ], 330 | "source": [ 331 | "#with column\n", 332 | "df.withColumn(\"age_after_10_yrs\",(df[\"age\"]+10)).show(10,False)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 18, 338 | "metadata": {}, 339 | "outputs": [ 340 | { 341 | "name": "stdout", 342 | "output_type": "stream", 343 | "text": [ 344 | "+-------+---+----------+------+------+\n", 345 | "|ratings|age|experience|family|mobile|\n", 346 | "+-------+---+----------+------+------+\n", 347 | "| 3| 32| 9.0| 3| Vivo|\n", 348 | "| 5| 37| 23.0| 5| Vivo|\n", 349 | "| 4| 37| 6.0| 0| Vivo|\n", 350 | "| 5| 37| 13.0| 1| Vivo|\n", 351 | "| 4| 37| 6.0| 0| Vivo|\n", 352 | "+-------+---+----------+------+------+\n", 353 | "\n" 354 | ] 355 | } 356 | ], 357 | "source": [ 358 | "#filter the records \n", 359 | "df.filter(df['mobile']=='Vivo').show()" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 20, 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "name": "stdout", 369 | "output_type": "stream", 370 | "text": [ 371 | "+---+-------+------+\n", 372 | "|age|ratings|mobile|\n", 373 | "+---+-------+------+\n", 374 | "| 32| 3| Vivo|\n", 375 | "| 37| 5| Vivo|\n", 376 | "| 37| 4| Vivo|\n", 377 | "| 37| 5| Vivo|\n", 378 | "| 37| 4| Vivo|\n", 379 | "+---+-------+------+\n", 380 | "\n" 381 | ] 382 | } 383 | ], 384 | "source": [ 385 | "#filter the records \n", 386 | "df.filter(df['mobile']=='Vivo').select('age','ratings','mobile').show()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 21, 392 | "metadata": {}, 393 | "outputs": [ 394 | { 395 | "name": "stdout", 396 | "output_type": "stream", 397 | "text": [ 398 | "+-------+---+----------+------+------+\n", 399 | "|ratings|age|experience|family|mobile|\n", 400 | "+-------+---+----------+------+------+\n", 401 | "| 5| 37| 23.0| 5| Vivo|\n", 402 | "| 5| 37| 13.0| 1| Vivo|\n", 403 | "+-------+---+----------+------+------+\n", 404 | "\n" 405 | ] 406 | } 407 | ], 408 | "source": [ 409 | "#filter the multiple conditions\n", 410 | "df.filter(df['mobile']=='Vivo').filter(df['experience'] >10).show()" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 22, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "+-------+---+----------+------+------+\n", 423 | "|ratings|age|experience|family|mobile|\n", 424 | "+-------+---+----------+------+------+\n", 425 | "| 5| 37| 23.0| 5| Vivo|\n", 426 | "| 5| 37| 13.0| 1| Vivo|\n", 427 | "+-------+---+----------+------+------+\n", 428 | "\n" 429 | ] 430 | } 431 | ], 432 | "source": [ 433 | "#filter the multiple conditions\n", 434 | "df.filter((df['mobile']=='Vivo')&(df['experience'] >10)).show()" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 23, 440 | "metadata": {}, 441 | "outputs": [ 442 | { 443 | "name": "stdout", 444 | "output_type": "stream", 445 | "text": [ 446 | "+-------+\n", 447 | "| mobile|\n", 448 | "+-------+\n", 449 | "| MI|\n", 450 | "| Oppo|\n", 451 | "|Samsung|\n", 452 | "| Vivo|\n", 453 | "| Apple|\n", 454 | "+-------+\n", 455 | "\n" 456 | ] 457 | } 458 | ], 459 | "source": [ 460 | "#Distinct Values in a column\n", 461 | "df.select('mobile').distinct().show()" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 25, 467 | "metadata": {}, 468 | "outputs": [ 469 | { 470 | "data": { 471 | "text/plain": [ 472 | "5" 473 | ] 474 | }, 475 | "execution_count": 25, 476 | "metadata": {}, 477 | "output_type": "execute_result" 478 | } 479 | ], 480 | "source": [ 481 | "#distinct value count\n", 482 | "df.select('mobile').distinct().count()" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 26, 488 | "metadata": {}, 489 | "outputs": [ 490 | { 491 | "name": "stdout", 492 | "output_type": "stream", 493 | "text": [ 494 | "+-------+-----+\n", 495 | "|mobile |count|\n", 496 | "+-------+-----+\n", 497 | "|MI |8 |\n", 498 | "|Oppo |7 |\n", 499 | "|Samsung|6 |\n", 500 | "|Vivo |5 |\n", 501 | "|Apple |7 |\n", 502 | "+-------+-----+\n", 503 | "\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "df.groupBy('mobile').count().show(5,False)" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 27, 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "name": "stdout", 518 | "output_type": "stream", 519 | "text": [ 520 | "+-------+-----+\n", 521 | "|mobile |count|\n", 522 | "+-------+-----+\n", 523 | "|MI |8 |\n", 524 | "|Oppo |7 |\n", 525 | "|Apple |7 |\n", 526 | "|Samsung|6 |\n", 527 | "|Vivo |5 |\n", 528 | "+-------+-----+\n", 529 | "\n" 530 | ] 531 | } 532 | ], 533 | "source": [ 534 | "# Value counts\n", 535 | "df.groupBy('mobile').count().orderBy('count',ascending=False).show(5,False)" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": 28, 541 | "metadata": {}, 542 | "outputs": [ 543 | { 544 | "name": "stdout", 545 | "output_type": "stream", 546 | "text": [ 547 | "+-------+------------------+------------------+------------------+------------------+\n", 548 | "|mobile |avg(ratings) |avg(age) |avg(experience) |avg(family) |\n", 549 | "+-------+------------------+------------------+------------------+------------------+\n", 550 | "|MI |3.5 |30.125 |10.1875 |1.375 |\n", 551 | "|Oppo |2.857142857142857 |28.428571428571427|10.357142857142858|1.4285714285714286|\n", 552 | "|Samsung|4.166666666666667 |28.666666666666668|8.666666666666666 |1.8333333333333333|\n", 553 | "|Vivo |4.2 |36.0 |11.4 |1.8 |\n", 554 | "|Apple |3.4285714285714284|30.571428571428573|11.0 |2.7142857142857144|\n", 555 | "+-------+------------------+------------------+------------------+------------------+\n", 556 | "\n" 557 | ] 558 | } 559 | ], 560 | "source": [ 561 | "# Value counts\n", 562 | "df.groupBy('mobile').mean().show(5,False)" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": 29, 568 | "metadata": {}, 569 | "outputs": [ 570 | { 571 | "name": "stdout", 572 | "output_type": "stream", 573 | "text": [ 574 | "+-------+------------+--------+---------------+-----------+\n", 575 | "|mobile |sum(ratings)|sum(age)|sum(experience)|sum(family)|\n", 576 | "+-------+------------+--------+---------------+-----------+\n", 577 | "|MI |28 |241 |81.5 |11 |\n", 578 | "|Oppo |20 |199 |72.5 |10 |\n", 579 | "|Samsung|25 |172 |52.0 |11 |\n", 580 | "|Vivo |21 |180 |57.0 |9 |\n", 581 | "|Apple |24 |214 |77.0 |19 |\n", 582 | "+-------+------------+--------+---------------+-----------+\n", 583 | "\n" 584 | ] 585 | } 586 | ], 587 | "source": [ 588 | "df.groupBy('mobile').sum().show(5,False)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 30, 594 | "metadata": {}, 595 | "outputs": [ 596 | { 597 | "name": "stdout", 598 | "output_type": "stream", 599 | "text": [ 600 | "+-------+------------+--------+---------------+-----------+\n", 601 | "|mobile |max(ratings)|max(age)|max(experience)|max(family)|\n", 602 | "+-------+------------+--------+---------------+-----------+\n", 603 | "|MI |5 |42 |23.0 |5 |\n", 604 | "|Oppo |4 |42 |23.0 |2 |\n", 605 | "|Samsung|5 |37 |23.0 |5 |\n", 606 | "|Vivo |5 |37 |23.0 |5 |\n", 607 | "|Apple |4 |37 |16.5 |5 |\n", 608 | "+-------+------------+--------+---------------+-----------+\n", 609 | "\n" 610 | ] 611 | } 612 | ], 613 | "source": [ 614 | "# Value counts\n", 615 | "df.groupBy('mobile').max().show(5,False)" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 31, 621 | "metadata": {}, 622 | "outputs": [ 623 | { 624 | "name": "stdout", 625 | "output_type": "stream", 626 | "text": [ 627 | "+-------+------------+--------+---------------+-----------+\n", 628 | "|mobile |min(ratings)|min(age)|min(experience)|min(family)|\n", 629 | "+-------+------------+--------+---------------+-----------+\n", 630 | "|MI |1 |27 |2.5 |0 |\n", 631 | "|Oppo |2 |22 |6.0 |0 |\n", 632 | "|Samsung|2 |22 |2.5 |0 |\n", 633 | "|Vivo |3 |32 |6.0 |0 |\n", 634 | "|Apple |3 |22 |2.5 |0 |\n", 635 | "+-------+------------+--------+---------------+-----------+\n", 636 | "\n" 637 | ] 638 | } 639 | ], 640 | "source": [ 641 | "# Value counts\n", 642 | "df.groupBy('mobile').min().show(5,False)" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 32, 648 | "metadata": {}, 649 | "outputs": [ 650 | { 651 | "name": "stdout", 652 | "output_type": "stream", 653 | "text": [ 654 | "+-------+---------------+\n", 655 | "|mobile |sum(experience)|\n", 656 | "+-------+---------------+\n", 657 | "|MI |81.5 |\n", 658 | "|Oppo |72.5 |\n", 659 | "|Samsung|52.0 |\n", 660 | "|Vivo |57.0 |\n", 661 | "|Apple |77.0 |\n", 662 | "+-------+---------------+\n", 663 | "\n" 664 | ] 665 | } 666 | ], 667 | "source": [ 668 | "#Aggregation\n", 669 | "df.groupBy('mobile').agg({'experience':'sum'}).show(5,False)" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": 33, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [ 678 | "# UDF\n", 679 | "from pyspark.sql.functions import udf\n" 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": 34, 685 | "metadata": {}, 686 | "outputs": [], 687 | "source": [ 688 | "#normal function \n", 689 | "def price_range(brand):\n", 690 | " if brand in ['Samsung','Apple']:\n", 691 | " return 'High Price'\n", 692 | " elif brand =='MI':\n", 693 | " return 'Mid Price'\n", 694 | " else:\n", 695 | " return 'Low Price'" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": 35, 701 | "metadata": {}, 702 | "outputs": [ 703 | { 704 | "name": "stdout", 705 | "output_type": "stream", 706 | "text": [ 707 | "+-------+---+----------+------+-------+-----------+\n", 708 | "|ratings|age|experience|family|mobile |price_range|\n", 709 | "+-------+---+----------+------+-------+-----------+\n", 710 | "|3 |32 |9.0 |3 |Vivo |Low Price |\n", 711 | "|3 |27 |13.0 |3 |Apple |High Price |\n", 712 | "|4 |22 |2.5 |0 |Samsung|High Price |\n", 713 | "|4 |37 |16.5 |4 |Apple |High Price |\n", 714 | "|5 |27 |9.0 |1 |MI |Mid Price |\n", 715 | "|4 |27 |9.0 |0 |Oppo |Low Price |\n", 716 | "|5 |37 |23.0 |5 |Vivo |Low Price |\n", 717 | "|5 |37 |23.0 |5 |Samsung|High Price |\n", 718 | "|3 |22 |2.5 |0 |Apple |High Price |\n", 719 | "|3 |27 |6.0 |0 |MI |Mid Price |\n", 720 | "+-------+---+----------+------+-------+-----------+\n", 721 | "only showing top 10 rows\n", 722 | "\n" 723 | ] 724 | } 725 | ], 726 | "source": [ 727 | "#create udf using python function\n", 728 | "brand_udf=udf(price_range,StringType())\n", 729 | "#apply udf on dataframe\n", 730 | "df.withColumn('price_range',brand_udf(df['mobile'])).show(10,False)" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": 36, 736 | "metadata": {}, 737 | "outputs": [ 738 | { 739 | "name": "stdout", 740 | "output_type": "stream", 741 | "text": [ 742 | "+-------+---+----------+------+-------+---------+\n", 743 | "|ratings|age|experience|family|mobile |age_group|\n", 744 | "+-------+---+----------+------+-------+---------+\n", 745 | "|3 |32 |9.0 |3 |Vivo |senior |\n", 746 | "|3 |27 |13.0 |3 |Apple |young |\n", 747 | "|4 |22 |2.5 |0 |Samsung|young |\n", 748 | "|4 |37 |16.5 |4 |Apple |senior |\n", 749 | "|5 |27 |9.0 |1 |MI |young |\n", 750 | "|4 |27 |9.0 |0 |Oppo |young |\n", 751 | "|5 |37 |23.0 |5 |Vivo |senior |\n", 752 | "|5 |37 |23.0 |5 |Samsung|senior |\n", 753 | "|3 |22 |2.5 |0 |Apple |young |\n", 754 | "|3 |27 |6.0 |0 |MI |young |\n", 755 | "+-------+---+----------+------+-------+---------+\n", 756 | "only showing top 10 rows\n", 757 | "\n" 758 | ] 759 | } 760 | ], 761 | "source": [ 762 | "#using lambda function\n", 763 | "age_udf = udf(lambda age: \"young\" if age <= 30 else \"senior\", StringType())\n", 764 | "#apply udf on dataframe\n", 765 | "df.withColumn(\"age_group\", age_udf(df.age)).show(10,False)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": 52, 771 | "metadata": {}, 772 | "outputs": [], 773 | "source": [ 774 | "#pandas udf\n", 775 | "from pyspark.sql.functions import pandas_udf, PandasUDFType" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 53, 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "#create python function\n", 785 | "def remaining_yrs(age):\n", 786 | " yrs_left=100-age\n", 787 | "\n", 788 | " return yrs_left" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 54, 794 | "metadata": {}, 795 | "outputs": [ 796 | { 797 | "name": "stdout", 798 | "output_type": "stream", 799 | "text": [ 800 | "+-------+---+----------+------+-------+--------+\n", 801 | "|ratings|age|experience|family|mobile |yrs_left|\n", 802 | "+-------+---+----------+------+-------+--------+\n", 803 | "|3 |32 |9.0 |3 |Vivo |68 |\n", 804 | "|3 |27 |13.0 |3 |Apple |73 |\n", 805 | "|4 |22 |2.5 |0 |Samsung|78 |\n", 806 | "|4 |37 |16.5 |4 |Apple |63 |\n", 807 | "|5 |27 |9.0 |1 |MI |73 |\n", 808 | "|4 |27 |9.0 |0 |Oppo |73 |\n", 809 | "|5 |37 |23.0 |5 |Vivo |63 |\n", 810 | "|5 |37 |23.0 |5 |Samsung|63 |\n", 811 | "|3 |22 |2.5 |0 |Apple |78 |\n", 812 | "|3 |27 |6.0 |0 |MI |73 |\n", 813 | "+-------+---+----------+------+-------+--------+\n", 814 | "only showing top 10 rows\n", 815 | "\n" 816 | ] 817 | } 818 | ], 819 | "source": [ 820 | "#create udf using python function\n", 821 | "length_udf = pandas_udf(remaining_yrs, IntegerType())\n", 822 | "#apply pandas udf on dataframe\n", 823 | "df.withColumn(\"yrs_left\", length_udf(df['age'])).show(10,False)" 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": 57, 829 | "metadata": {}, 830 | "outputs": [], 831 | "source": [ 832 | "#udf using two columns \n", 833 | "def prod(rating,exp):\n", 834 | " x=rating*exp\n", 835 | " return x" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 58, 841 | "metadata": {}, 842 | "outputs": [ 843 | { 844 | "name": "stdout", 845 | "output_type": "stream", 846 | "text": [ 847 | "+-------+---+----------+------+-------+-------+\n", 848 | "|ratings|age|experience|family|mobile |product|\n", 849 | "+-------+---+----------+------+-------+-------+\n", 850 | "|3 |32 |9.0 |3 |Vivo |27.0 |\n", 851 | "|3 |27 |13.0 |3 |Apple |39.0 |\n", 852 | "|4 |22 |2.5 |0 |Samsung|10.0 |\n", 853 | "|4 |37 |16.5 |4 |Apple |66.0 |\n", 854 | "|5 |27 |9.0 |1 |MI |45.0 |\n", 855 | "|4 |27 |9.0 |0 |Oppo |36.0 |\n", 856 | "|5 |37 |23.0 |5 |Vivo |115.0 |\n", 857 | "|5 |37 |23.0 |5 |Samsung|115.0 |\n", 858 | "|3 |22 |2.5 |0 |Apple |7.5 |\n", 859 | "|3 |27 |6.0 |0 |MI |18.0 |\n", 860 | "+-------+---+----------+------+-------+-------+\n", 861 | "only showing top 10 rows\n", 862 | "\n" 863 | ] 864 | } 865 | ], 866 | "source": [ 867 | "#create udf using python function\n", 868 | "prod_udf = pandas_udf(prod, DoubleType())\n", 869 | "#apply pandas udf on multiple columns of dataframe\n", 870 | "df.withColumn(\"product\", prod_udf(df['ratings'],df['experience'])).show(10,False)" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": 45, 876 | "metadata": {}, 877 | "outputs": [ 878 | { 879 | "data": { 880 | "text/plain": [ 881 | "33" 882 | ] 883 | }, 884 | "execution_count": 45, 885 | "metadata": {}, 886 | "output_type": "execute_result" 887 | } 888 | ], 889 | "source": [ 890 | "#duplicate values\n", 891 | "df.count()" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": 46, 897 | "metadata": {}, 898 | "outputs": [], 899 | "source": [ 900 | "#drop duplicate values\n", 901 | "df=df.dropDuplicates()" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": 47, 907 | "metadata": {}, 908 | "outputs": [ 909 | { 910 | "data": { 911 | "text/plain": [ 912 | "26" 913 | ] 914 | }, 915 | "execution_count": 47, 916 | "metadata": {}, 917 | "output_type": "execute_result" 918 | } 919 | ], 920 | "source": [ 921 | "#validate new count\n", 922 | "df.count()" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": 59, 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [ 931 | "#drop column of dataframe\n", 932 | "df_new=df.drop('mobile')" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 60, 938 | "metadata": {}, 939 | "outputs": [ 940 | { 941 | "name": "stdout", 942 | "output_type": "stream", 943 | "text": [ 944 | "+-------+---+----------+------+\n", 945 | "|ratings|age|experience|family|\n", 946 | "+-------+---+----------+------+\n", 947 | "| 3| 32| 9.0| 3|\n", 948 | "| 3| 27| 13.0| 3|\n", 949 | "| 4| 22| 2.5| 0|\n", 950 | "| 4| 37| 16.5| 4|\n", 951 | "| 5| 27| 9.0| 1|\n", 952 | "| 4| 27| 9.0| 0|\n", 953 | "| 5| 37| 23.0| 5|\n", 954 | "| 5| 37| 23.0| 5|\n", 955 | "| 3| 22| 2.5| 0|\n", 956 | "| 3| 27| 6.0| 0|\n", 957 | "+-------+---+----------+------+\n", 958 | "only showing top 10 rows\n", 959 | "\n" 960 | ] 961 | } 962 | ], 963 | "source": [ 964 | "df_new.show(10)" 965 | ] 966 | }, 967 | { 968 | "cell_type": "code", 969 | "execution_count": null, 970 | "metadata": {}, 971 | "outputs": [], 972 | "source": [ 973 | "# saving file (csv)" 974 | ] 975 | }, 976 | { 977 | "cell_type": "code", 978 | "execution_count": null, 979 | "metadata": {}, 980 | "outputs": [], 981 | "source": [ 982 | "#current working directory\n", 983 | "pwd" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": null, 989 | "metadata": {}, 990 | "outputs": [], 991 | "source": [ 992 | "#target directory \n", 993 | "write_uri='/home/jovyan/work/df_csv'" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": null, 999 | "metadata": {}, 1000 | "outputs": [], 1001 | "source": [ 1002 | "#save the dataframe as single csv \n", 1003 | "df.coalesce(1).write.format(\"csv\").option(\"header\",\"true\").save(write_uri)" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": null, 1009 | "metadata": {}, 1010 | "outputs": [], 1011 | "source": [ 1012 | "# parquet" 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "code", 1017 | "execution_count": null, 1018 | "metadata": {}, 1019 | "outputs": [], 1020 | "source": [ 1021 | "#target location\n", 1022 | "parquet_uri='/home/jovyan/work/df_parquet'" 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": null, 1028 | "metadata": {}, 1029 | "outputs": [], 1030 | "source": [ 1031 | "#save the data into parquet format \n", 1032 | "df.write.format('parquet').save(parquet_uri)" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": null, 1038 | "metadata": {}, 1039 | "outputs": [], 1040 | "source": [] 1041 | } 1042 | ], 1043 | "metadata": { 1044 | "kernelspec": { 1045 | "display_name": "Python 3", 1046 | "language": "python", 1047 | "name": "python3" 1048 | }, 1049 | "language_info": { 1050 | "codemirror_mode": { 1051 | "name": "ipython", 1052 | "version": 3 1053 | }, 1054 | "file_extension": ".py", 1055 | "mimetype": "text/x-python", 1056 | "name": "python", 1057 | "nbconvert_exporter": "python", 1058 | "pygments_lexer": "ipython3", 1059 | "version": "3.6.3" 1060 | } 1061 | }, 1062 | "nbformat": 4, 1063 | "nbformat_minor": 2 1064 | } 1065 | -------------------------------------------------------------------------------- /chapter_2_Data_Processing/sample_data.csv: -------------------------------------------------------------------------------- 1 | ratings,age,experience,family,mobile 2 | 3,32,9,3,Vivo 3 | 3,27,13,3,Apple 4 | 4,22,2.5,0,Samsung 5 | 4,37,16.5,4,Apple 6 | 5,27,9,1,MI 7 | 4,27,9,0,Oppo 8 | 5,37,23,5,Vivo 9 | 5,37,23,5,Samsung 10 | 3,22,2.5,0,Apple 11 | 3,27,6,0,MI 12 | 2,27,6,2,Oppo 13 | 5,27,6,2,Samsung 14 | 3,37,16.5,5,Apple 15 | 5,27,6,0,MI 16 | 4,22,6,1,Oppo 17 | 4,37,9,2,Samsung 18 | 4,27,6,1,Apple 19 | 1,37,23,5,MI 20 | 2,42,23,2,Oppo 21 | 4,37,6,0,Vivo 22 | 5,22,2.5,0,Samsung 23 | 3,37,16.5,5,Apple 24 | 3,42,23,5,MI 25 | 2,27,9,2,Samsung 26 | 4,27,6,1,Apple 27 | 5,27,2.5,0,MI 28 | 2,27,6,2,Oppo 29 | 5,37,13,1,Vivo 30 | 2,32,16.5,2,Oppo 31 | 3,27,6,0,MI 32 | 3,27,6,0,MI 33 | 4,22,6,1,Oppo 34 | 4,37,6,0,Vivo -------------------------------------------------------------------------------- /chapter_4_Linear_Regression/Linear_Regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Linear Regression using Pyspark" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "#create sparksession object\n", 17 | "from pyspark.sql import SparkSession\n", 18 | "spark=SparkSession.builder.appName('lin_reg').getOrCreate()" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "#import Linear Regression from spark's MLlib\n", 28 | "from pyspark.ml.regression import LinearRegression" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "#Load the dataset\n", 38 | "df=spark.read.csv('Linear_regression_dataset.csv',inferSchema=True,header=True)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 4, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "(1232, 6)\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "#validate the size of data\n", 56 | "print((df.count(), len(df.columns)))" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 5, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "root\n", 69 | " |-- var_1: integer (nullable = true)\n", 70 | " |-- var_2: integer (nullable = true)\n", 71 | " |-- var_3: integer (nullable = true)\n", 72 | " |-- var_4: double (nullable = true)\n", 73 | " |-- var_5: double (nullable = true)\n", 74 | " |-- output: double (nullable = true)\n", 75 | "\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "#explore the data\n", 81 | "df.printSchema()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 7, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+\n", 94 | "|summary|var_1 |var_2 |var_3 |var_4 |var_5 |output |\n", 95 | "+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+\n", 96 | "|count |1232 |1232 |1232 |1232 |1232 |1232 |\n", 97 | "|mean |715.0819805194806|715.0819805194806|80.90422077922078 |0.3263311688311693 |0.25927272727272715 |0.39734172077922014|\n", 98 | "|stddev |91.5342940441652 |93.07993263118064|11.458139049993724|0.015012772334166148|0.012907228928000298|0.03326689862173776|\n", 99 | "|min |463 |472 |40 |0.277 |0.214 |0.301 |\n", 100 | "|max |1009 |1103 |116 |0.373 |0.294 |0.491 |\n", 101 | "+-------+-----------------+-----------------+------------------+--------------------+--------------------+-------------------+\n", 102 | "\n" 103 | ] 104 | } 105 | ], 106 | "source": [ 107 | "#view statistical measures of data \n", 108 | "df.describe().show(5,False)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 8, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/plain": [ 119 | "[Row(var_1=734, var_2=688, var_3=81, var_4=0.328, var_5=0.259, output=0.418),\n", 120 | " Row(var_1=700, var_2=600, var_3=94, var_4=0.32, var_5=0.247, output=0.389),\n", 121 | " Row(var_1=712, var_2=705, var_3=93, var_4=0.311, var_5=0.247, output=0.417)]" 122 | ] 123 | }, 124 | "execution_count": 8, 125 | "metadata": {}, 126 | "output_type": "execute_result" 127 | } 128 | ], 129 | "source": [ 130 | "#sneak into the dataset\n", 131 | "df.head(3)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 9, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "#import corr function from pyspark functions\n", 141 | "from pyspark.sql.functions import corr" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 10, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "+-------------------+\n", 154 | "|corr(var_1, output)|\n", 155 | "+-------------------+\n", 156 | "| 0.9187399607627283|\n", 157 | "+-------------------+\n", 158 | "\n" 159 | ] 160 | } 161 | ], 162 | "source": [ 163 | "# check for correlation\n", 164 | "df.select(corr('var_1','output')).show()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 11, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "#import vectorassembler to create dense vectors\n", 174 | "from pyspark.ml.linalg import Vector\n", 175 | "from pyspark.ml.feature import VectorAssembler" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 12, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "data": { 185 | "text/plain": [ 186 | "['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'output']" 187 | ] 188 | }, 189 | "execution_count": 12, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "#select the columns to create input vector\n", 196 | "df.columns" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 13, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "#create the vector assembler \n", 206 | "vec_assmebler=VectorAssembler(inputCols=['var_1', 'var_2', 'var_3', 'var_4', 'var_5'],outputCol='features')" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 14, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "#transform the values\n", 216 | "features_df=vec_assmebler.transform(df)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 15, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "root\n", 229 | " |-- var_1: integer (nullable = true)\n", 230 | " |-- var_2: integer (nullable = true)\n", 231 | " |-- var_3: integer (nullable = true)\n", 232 | " |-- var_4: double (nullable = true)\n", 233 | " |-- var_5: double (nullable = true)\n", 234 | " |-- output: double (nullable = true)\n", 235 | " |-- features: vector (nullable = true)\n", 236 | "\n" 237 | ] 238 | } 239 | ], 240 | "source": [ 241 | "#validate the presence of dense vectors \n", 242 | "features_df.printSchema()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 16, 248 | "metadata": { 249 | "scrolled": true 250 | }, 251 | "outputs": [ 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "+------------------------------+\n", 257 | "|features |\n", 258 | "+------------------------------+\n", 259 | "|[734.0,688.0,81.0,0.328,0.259]|\n", 260 | "|[700.0,600.0,94.0,0.32,0.247] |\n", 261 | "|[712.0,705.0,93.0,0.311,0.247]|\n", 262 | "|[734.0,806.0,69.0,0.315,0.26] |\n", 263 | "|[613.0,759.0,61.0,0.302,0.24] |\n", 264 | "+------------------------------+\n", 265 | "only showing top 5 rows\n", 266 | "\n" 267 | ] 268 | } 269 | ], 270 | "source": [ 271 | "#view the details of dense vector\n", 272 | "features_df.select('features').show(5,False)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 17, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "#create data containing input features and output column\n", 282 | "model_df=features_df.select('features','output')" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 18, 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "name": "stdout", 292 | "output_type": "stream", 293 | "text": [ 294 | "+------------------------------+------+\n", 295 | "|features |output|\n", 296 | "+------------------------------+------+\n", 297 | "|[734.0,688.0,81.0,0.328,0.259]|0.418 |\n", 298 | "|[700.0,600.0,94.0,0.32,0.247] |0.389 |\n", 299 | "|[712.0,705.0,93.0,0.311,0.247]|0.417 |\n", 300 | "|[734.0,806.0,69.0,0.315,0.26] |0.415 |\n", 301 | "|[613.0,759.0,61.0,0.302,0.24] |0.378 |\n", 302 | "+------------------------------+------+\n", 303 | "only showing top 5 rows\n", 304 | "\n" 305 | ] 306 | } 307 | ], 308 | "source": [ 309 | "model_df.show(5,False)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 18, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "(1232, 2)\n" 322 | ] 323 | } 324 | ], 325 | "source": [ 326 | "#size of model df\n", 327 | "print((model_df.count(), len(model_df.columns)))" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "### Split Data - Train & Test sets\n" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 19, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "#split the data into 70/30 ratio for train test purpose\n", 344 | "train_df,test_df=model_df.randomSplit([0.7,0.3])" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 20, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "name": "stdout", 354 | "output_type": "stream", 355 | "text": [ 356 | "(843, 2)\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "print((train_df.count(), len(train_df.columns)))" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 21, 367 | "metadata": {}, 368 | "outputs": [ 369 | { 370 | "name": "stdout", 371 | "output_type": "stream", 372 | "text": [ 373 | "(389, 2)\n" 374 | ] 375 | } 376 | ], 377 | "source": [ 378 | "print((test_df.count(), len(test_df.columns)))" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 22, 384 | "metadata": {}, 385 | "outputs": [ 386 | { 387 | "name": "stdout", 388 | "output_type": "stream", 389 | "text": [ 390 | "+-------+-------------------+\n", 391 | "|summary| output|\n", 392 | "+-------+-------------------+\n", 393 | "| count| 843|\n", 394 | "| mean| 0.3981637010676153|\n", 395 | "| stddev|0.03294758324054597|\n", 396 | "| min| 0.311|\n", 397 | "| max| 0.491|\n", 398 | "+-------+-------------------+\n", 399 | "\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "train_df.describe().show()" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "## Build Linear Regression Model " 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": 23, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "#Build Linear Regression model \n", 421 | "lin_Reg=LinearRegression(labelCol='output')" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 24, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "#fit the linear regression model on training data set \n", 431 | "lr_model=lin_Reg.fit(train_df)" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 25, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "data": { 441 | "text/plain": [ 442 | "0.19182762797333336" 443 | ] 444 | }, 445 | "execution_count": 25, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "lr_model.intercept" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": 26, 457 | "metadata": {}, 458 | "outputs": [ 459 | { 460 | "name": "stdout", 461 | "output_type": "stream", 462 | "text": [ 463 | "[0.000340839517716,4.79714814651e-05,0.000188236319406,-0.683084289192,0.520925094105]\n" 464 | ] 465 | } 466 | ], 467 | "source": [ 468 | "print(lr_model.coefficients)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 27, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "training_predictions=lr_model.evaluate(train_df)" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 28, 483 | "metadata": {}, 484 | "outputs": [ 485 | { 486 | "data": { 487 | "text/plain": [ 488 | "0.0001479324335891862" 489 | ] 490 | }, 491 | "execution_count": 28, 492 | "metadata": {}, 493 | "output_type": "execute_result" 494 | } 495 | ], 496 | "source": [ 497 | "training_predictions.meanSquaredError" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": 29, 503 | "metadata": {}, 504 | "outputs": [ 505 | { 506 | "data": { 507 | "text/plain": [ 508 | "0.863563127042712" 509 | ] 510 | }, 511 | "execution_count": 29, 512 | "metadata": {}, 513 | "output_type": "execute_result" 514 | } 515 | ], 516 | "source": [ 517 | "training_predictions.r2" 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 30, 523 | "metadata": {}, 524 | "outputs": [], 525 | "source": [ 526 | "#make predictions on test data \n", 527 | "test_results=lr_model.evaluate(test_df)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 31, 533 | "metadata": {}, 534 | "outputs": [ 535 | { 536 | "name": "stdout", 537 | "output_type": "stream", 538 | "text": [ 539 | "+--------------------+\n", 540 | "| residuals|\n", 541 | "+--------------------+\n", 542 | "|-0.01339317627364...|\n", 543 | "|0.005852942292043195|\n", 544 | "|-0.01270570081380...|\n", 545 | "|-0.00719060666608...|\n", 546 | "|-0.00117340799738...|\n", 547 | "|0.003662428381860...|\n", 548 | "|-0.01232056280684...|\n", 549 | "|-0.00201734580289...|\n", 550 | "|0.011105011921430763|\n", 551 | "|-0.00108724755114...|\n", 552 | "+--------------------+\n", 553 | "only showing top 10 rows\n", 554 | "\n" 555 | ] 556 | } 557 | ], 558 | "source": [ 559 | "#view the residual errors based on predictions \n", 560 | "test_results.residuals.show(10)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 32, 566 | "metadata": {}, 567 | "outputs": [ 568 | { 569 | "data": { 570 | "text/plain": [ 571 | "0.8796832901372513" 572 | ] 573 | }, 574 | "execution_count": 32, 575 | "metadata": {}, 576 | "output_type": "execute_result" 577 | } 578 | ], 579 | "source": [ 580 | "#coefficient of determination value for model\n", 581 | "test_results.r2" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 33, 587 | "metadata": {}, 588 | "outputs": [ 589 | { 590 | "data": { 591 | "text/plain": [ 592 | "0.011751649289716636" 593 | ] 594 | }, 595 | "execution_count": 33, 596 | "metadata": {}, 597 | "output_type": "execute_result" 598 | } 599 | ], 600 | "source": [ 601 | "test_results.rootMeanSquaredError" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": 34, 607 | "metadata": {}, 608 | "outputs": [ 609 | { 610 | "data": { 611 | "text/plain": [ 612 | "0.0001381012610284975" 613 | ] 614 | }, 615 | "execution_count": 34, 616 | "metadata": {}, 617 | "output_type": "execute_result" 618 | } 619 | ], 620 | "source": [ 621 | "test_results.meanSquaredError" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": null, 627 | "metadata": {}, 628 | "outputs": [], 629 | "source": [] 630 | } 631 | ], 632 | "metadata": { 633 | "kernelspec": { 634 | "display_name": "Python 3", 635 | "language": "python", 636 | "name": "python3" 637 | }, 638 | "language_info": { 639 | "codemirror_mode": { 640 | "name": "ipython", 641 | "version": 3 642 | }, 643 | "file_extension": ".py", 644 | "mimetype": "text/x-python", 645 | "name": "python", 646 | "nbconvert_exporter": "python", 647 | "pygments_lexer": "ipython3", 648 | "version": "3.6.3" 649 | } 650 | }, 651 | "nbformat": 4, 652 | "nbformat_minor": 2 653 | } 654 | -------------------------------------------------------------------------------- /chapter_4_Linear_Regression/Linear_regression_dataset.csv: -------------------------------------------------------------------------------- 1 | var_1,var_2,var_3,var_4,var_5,output 2 | 734,688,81,0.328,0.259,0.418 3 | 700,600,94,0.32,0.247,0.389 4 | 712,705,93,0.311,0.247,0.417 5 | 734,806,69,0.315,0.26,0.415 6 | 613,759,61,0.302,0.24,0.378 7 | 748,676,85,0.318,0.255,0.422 8 | 669,588,97,0.315,0.251,0.411 9 | 667,845,68,0.324,0.251,0.381 10 | 758,890,64,0.33,0.274,0.436 11 | 726,670,88,0.335,0.268,0.422 12 | 583,794,55,0.302,0.236,0.371 13 | 676,746,72,0.317,0.265,0.4 14 | 767,699,89,0.332,0.274,0.433 15 | 637,597,86,0.317,0.252,0.374 16 | 609,724,69,0.308,0.244,0.382 17 | 776,733,83,0.325,0.259,0.437 18 | 701,832,66,0.325,0.26,0.39 19 | 650,709,74,0.316,0.249,0.386 20 | 804,668,95,0.337,0.265,0.453 21 | 713,614,94,0.31,0.238,0.404 22 | 684,680,81,0.317,0.255,0.4 23 | 651,674,79,0.304,0.243,0.395 24 | 651,710,76,0.319,0.247,0.38 25 | 619,651,75,0.296,0.234,0.369 26 | 718,649,94,0.327,0.269,0.397 27 | 765,648,88,0.338,0.271,0.421 28 | 697,577,90,0.317,0.24,0.394 29 | 808,707,93,0.334,0.273,0.446 30 | 716,784,73,0.309,0.245,0.407 31 | 731,594,98,0.322,0.261,0.428 32 | 731,662,94,0.322,0.25,0.413 33 | 641,605,89,0.308,0.243,0.387 34 | 708,860,69,0.316,0.257,0.413 35 | 875,737,90,0.349,0.28,0.461 36 | 654,756,71,0.314,0.256,0.401 37 | 654,706,79,0.319,0.252,0.388 38 | 735,720,79,0.326,0.256,0.408 39 | 704,760,80,0.317,0.25,0.396 40 | 735,774,73,0.329,0.258,0.41 41 | 787,711,95,0.34,0.277,0.434 42 | 625,702,72,0.318,0.247,0.388 43 | 615,796,56,0.311,0.258,0.374 44 | 730,762,71,0.329,0.275,0.415 45 | 667,633,86,0.313,0.253,0.402 46 | 644,612,82,0.322,0.257,0.375 47 | 721,638,96,0.325,0.261,0.425 48 | 619,804,63,0.306,0.247,0.36 49 | 718,742,77,0.335,0.264,0.391 50 | 867,657,97,0.343,0.263,0.444 51 | 645,679,74,0.311,0.244,0.369 52 | 713,529,102,0.323,0.253,0.395 53 | 610,712,72,0.309,0.244,0.368 54 | 593,611,71,0.305,0.237,0.349 55 | 556,675,67,0.292,0.233,0.348 56 | 570,578,86,0.303,0.242,0.368 57 | 762,692,90,0.341,0.273,0.425 58 | 707,614,91,0.322,0.244,0.402 59 | 855,677,96,0.34,0.283,0.46 60 | 743,761,81,0.317,0.249,0.413 61 | 624,643,80,0.309,0.242,0.383 62 | 713,836,65,0.325,0.25,0.416 63 | 738,629,91,0.339,0.258,0.401 64 | 613,785,66,0.316,0.259,0.386 65 | 818,744,89,0.339,0.268,0.451 66 | 685,767,75,0.32,0.257,0.401 67 | 752,704,88,0.332,0.268,0.42 68 | 790,685,91,0.338,0.272,0.436 69 | 646,752,69,0.322,0.248,0.378 70 | 770,717,83,0.336,0.263,0.425 71 | 751,743,81,0.335,0.268,0.415 72 | 719,717,80,0.321,0.254,0.403 73 | 611,729,76,0.303,0.247,0.362 74 | 676,845,67,0.331,0.274,0.399 75 | 681,702,80,0.311,0.248,0.39 76 | 667,692,80,0.322,0.252,0.379 77 | 750,804,77,0.335,0.262,0.424 78 | 781,671,94,0.341,0.273,0.422 79 | 656,652,79,0.314,0.249,0.383 80 | 859,693,95,0.35,0.267,0.436 81 | 663,626,81,0.324,0.256,0.378 82 | 772,640,97,0.332,0.26,0.413 83 | 587,866,57,0.304,0.242,0.373 84 | 665,581,90,0.317,0.246,0.371 85 | 513,698,61,0.298,0.236,0.339 86 | 697,583,92,0.321,0.257,0.408 87 | 736,641,86,0.332,0.263,0.402 88 | 802,649,96,0.333,0.247,0.403 89 | 787,687,90,0.338,0.276,0.419 90 | 755,728,85,0.312,0.248,0.454 91 | 655,742,69,0.318,0.25,0.39 92 | 720,782,70,0.324,0.253,0.418 93 | 735,641,86,0.339,0.263,0.405 94 | 741,876,64,0.332,0.268,0.415 95 | 872,736,95,0.352,0.27,0.454 96 | 707,672,83,0.332,0.255,0.407 97 | 724,732,79,0.329,0.258,0.411 98 | 673,723,78,0.318,0.247,0.394 99 | 773,865,65,0.339,0.264,0.417 100 | 804,715,92,0.343,0.261,0.441 101 | 743,745,86,0.331,0.26,0.416 102 | 772,766,87,0.34,0.268,0.416 103 | 643,770,74,0.319,0.26,0.4 104 | 686,842,65,0.318,0.259,0.405 105 | 883,761,97,0.35,0.285,0.441 106 | 780,611,95,0.346,0.27,0.412 107 | 785,818,80,0.341,0.263,0.426 108 | 817,765,87,0.345,0.274,0.429 109 | 671,757,70,0.335,0.27,0.394 110 | 915,753,103,0.362,0.283,0.478 111 | 759,761,75,0.328,0.262,0.397 112 | 820,709,93,0.334,0.258,0.447 113 | 636,768,62,0.318,0.252,0.387 114 | 638,769,75,0.321,0.242,0.381 115 | 640,692,85,0.314,0.258,0.402 116 | 657,611,88,0.309,0.257,0.389 117 | 730,640,91,0.332,0.263,0.415 118 | 803,754,84,0.343,0.263,0.439 119 | 784,740,87,0.32,0.26,0.445 120 | 798,771,75,0.333,0.266,0.44 121 | 710,874,59,0.337,0.258,0.406 122 | 720,706,82,0.327,0.251,0.415 123 | 753,778,72,0.345,0.27,0.408 124 | 782,869,68,0.333,0.267,0.429 125 | 845,694,95,0.358,0.28,0.447 126 | 855,671,97,0.354,0.278,0.443 127 | 811,729,89,0.332,0.263,0.448 128 | 704,800,74,0.321,0.247,0.408 129 | 805,761,81,0.339,0.262,0.424 130 | 747,822,74,0.336,0.263,0.415 131 | 821,857,74,0.34,0.271,0.444 132 | 770,767,84,0.326,0.254,0.433 133 | 712,743,86,0.323,0.263,0.415 134 | 691,781,75,0.32,0.269,0.397 135 | 765,697,100,0.33,0.268,0.413 136 | 700,648,84,0.333,0.264,0.399 137 | 750,689,90,0.325,0.253,0.431 138 | 829,745,88,0.34,0.279,0.408 139 | 799,715,89,0.34,0.266,0.42 140 | 789,727,89,0.342,0.271,0.427 141 | 646,690,75,0.318,0.242,0.369 142 | 799,680,92,0.332,0.255,0.438 143 | 735,884,67,0.32,0.258,0.403 144 | 637,764,63,0.317,0.25,0.39 145 | 671,811,61,0.318,0.265,0.389 146 | 640,759,72,0.321,0.262,0.382 147 | 779,725,86,0.35,0.281,0.433 148 | 774,671,97,0.34,0.26,0.422 149 | 901,967,79,0.354,0.283,0.462 150 | 714,610,86,0.331,0.264,0.399 151 | 641,825,59,0.323,0.251,0.373 152 | 712,732,90,0.321,0.25,0.413 153 | 810,733,84,0.339,0.275,0.435 154 | 756,868,69,0.333,0.272,0.412 155 | 867,657,96,0.362,0.279,0.444 156 | 752,690,85,0.333,0.271,0.422 157 | 693,839,72,0.318,0.246,0.404 158 | 783,853,72,0.335,0.267,0.436 159 | 811,704,96,0.343,0.268,0.428 160 | 860,758,90,0.354,0.28,0.437 161 | 887,797,88,0.345,0.287,0.458 162 | 790,891,71,0.336,0.267,0.448 163 | 723,813,73,0.33,0.26,0.412 164 | 706,778,69,0.322,0.261,0.388 165 | 822,731,94,0.345,0.284,0.417 166 | 735,727,82,0.337,0.275,0.406 167 | 801,776,83,0.329,0.262,0.456 168 | 718,725,79,0.33,0.264,0.391 169 | 804,750,88,0.342,0.275,0.432 170 | 968,777,94,0.366,0.29,0.463 171 | 741,758,76,0.338,0.256,0.407 172 | 892,821,89,0.354,0.274,0.458 173 | 724,846,68,0.325,0.263,0.411 174 | 741,666,89,0.322,0.251,0.411 175 | 794,813,88,0.337,0.287,0.425 176 | 683,720,71,0.322,0.254,0.387 177 | 725,829,78,0.337,0.274,0.405 178 | 782,944,66,0.336,0.268,0.433 179 | 816,844,75,0.328,0.263,0.426 180 | 753,699,83,0.327,0.259,0.419 181 | 673,783,73,0.325,0.256,0.39 182 | 773,788,76,0.331,0.267,0.424 183 | 849,805,79,0.337,0.27,0.455 184 | 768,899,70,0.339,0.277,0.424 185 | 820,825,86,0.351,0.269,0.435 186 | 716,834,66,0.319,0.268,0.422 187 | 868,794,90,0.342,0.28,0.464 188 | 749,801,80,0.336,0.257,0.432 189 | 870,782,78,0.349,0.28,0.457 190 | 813,812,76,0.341,0.27,0.433 191 | 822,675,95,0.329,0.274,0.449 192 | 758,772,78,0.331,0.264,0.435 193 | 735,719,82,0.332,0.255,0.409 194 | 757,971,62,0.332,0.271,0.411 195 | 766,732,89,0.334,0.274,0.425 196 | 820,751,88,0.348,0.276,0.432 197 | 730,833,75,0.327,0.258,0.42 198 | 801,683,96,0.347,0.287,0.425 199 | 834,731,97,0.334,0.264,0.445 200 | 930,767,97,0.363,0.285,0.461 201 | 771,727,93,0.34,0.26,0.412 202 | 865,812,85,0.347,0.267,0.447 203 | 691,797,67,0.327,0.263,0.397 204 | 731,679,88,0.332,0.263,0.416 205 | 756,792,78,0.325,0.272,0.424 206 | 746,790,76,0.324,0.259,0.422 207 | 781,762,83,0.337,0.269,0.431 208 | 689,856,61,0.314,0.255,0.42 209 | 835,784,80,0.338,0.278,0.446 210 | 809,754,87,0.348,0.284,0.463 211 | 746,872,71,0.338,0.262,0.418 212 | 696,856,77,0.332,0.256,0.421 213 | 769,674,90,0.333,0.265,0.435 214 | 729,800,74,0.327,0.269,0.434 215 | 910,805,95,0.357,0.281,0.454 216 | 703,714,79,0.324,0.27,0.44 217 | 741,645,99,0.322,0.262,0.425 218 | 820,889,73,0.339,0.261,0.446 219 | 790,642,93,0.334,0.271,0.453 220 | 740,862,67,0.333,0.267,0.411 221 | 723,787,71,0.321,0.272,0.428 222 | 717,732,83,0.339,0.272,0.409 223 | 693,609,89,0.322,0.256,0.408 224 | 701,935,56,0.32,0.263,0.396 225 | 761,643,95,0.325,0.27,0.409 226 | 685,755,71,0.326,0.253,0.395 227 | 726,697,81,0.331,0.259,0.423 228 | 688,662,83,0.323,0.259,0.391 229 | 722,648,83,0.322,0.258,0.416 230 | 886,789,95,0.355,0.276,0.45 231 | 772,658,88,0.33,0.262,0.407 232 | 807,726,88,0.348,0.27,0.423 233 | 680,769,67,0.322,0.259,0.4 234 | 684,726,82,0.333,0.257,0.391 235 | 699,751,69,0.317,0.256,0.391 236 | 649,745,75,0.319,0.261,0.396 237 | 805,634,100,0.339,0.27,0.423 238 | 750,936,67,0.329,0.274,0.425 239 | 865,858,79,0.329,0.267,0.468 240 | 775,705,80,0.331,0.265,0.407 241 | 639,673,81,0.322,0.252,0.386 242 | 836,734,92,0.341,0.282,0.429 243 | 615,899,51,0.31,0.253,0.393 244 | 803,668,96,0.343,0.27,0.434 245 | 842,830,78,0.345,0.281,0.432 246 | 949,768,98,0.36,0.282,0.472 247 | 789,665,89,0.328,0.268,0.458 248 | 865,831,83,0.333,0.268,0.457 249 | 750,907,76,0.331,0.25,0.418 250 | 858,857,80,0.351,0.276,0.444 251 | 833,923,68,0.345,0.275,0.455 252 | 827,844,72,0.337,0.272,0.449 253 | 718,700,83,0.329,0.264,0.406 254 | 803,698,92,0.342,0.267,0.436 255 | 720,905,58,0.322,0.259,0.397 256 | 761,684,93,0.332,0.262,0.423 257 | 634,757,67,0.321,0.248,0.387 258 | 780,715,92,0.332,0.266,0.431 259 | 635,769,67,0.313,0.249,0.392 260 | 684,731,71,0.317,0.249,0.409 261 | 897,808,101,0.353,0.268,0.458 262 | 793,742,91,0.343,0.27,0.433 263 | 840,781,86,0.345,0.267,0.443 264 | 680,744,72,0.321,0.26,0.401 265 | 768,705,87,0.342,0.273,0.414 266 | 698,823,63,0.331,0.27,0.396 267 | 850,770,91,0.357,0.27,0.438 268 | 855,659,105,0.344,0.278,0.46 269 | 714,842,70,0.32,0.258,0.405 270 | 860,794,89,0.329,0.266,0.457 271 | 719,823,67,0.328,0.26,0.403 272 | 736,743,77,0.33,0.268,0.413 273 | 717,685,84,0.33,0.263,0.417 274 | 907,740,101,0.349,0.284,0.475 275 | 743,820,71,0.323,0.268,0.405 276 | 961,809,95,0.36,0.289,0.491 277 | 724,683,88,0.323,0.259,0.416 278 | 791,715,86,0.331,0.263,0.446 279 | 694,886,69,0.318,0.245,0.395 280 | 699,778,68,0.316,0.254,0.401 281 | 853,892,74,0.344,0.267,0.445 282 | 591,928,43,0.3,0.24,0.375 283 | 751,692,91,0.333,0.266,0.421 284 | 805,677,87,0.336,0.263,0.431 285 | 836,867,83,0.336,0.274,0.427 286 | 574,556,85,0.303,0.243,0.368 287 | 714,873,68,0.329,0.256,0.419 288 | 801,758,90,0.341,0.277,0.431 289 | 711,716,83,0.326,0.258,0.401 290 | 642,754,66,0.314,0.247,0.374 291 | 877,716,101,0.356,0.271,0.453 292 | 768,643,96,0.327,0.254,0.417 293 | 791,697,86,0.343,0.261,0.419 294 | 753,801,75,0.338,0.267,0.42 295 | 678,831,64,0.333,0.261,0.388 296 | 795,637,93,0.344,0.271,0.41 297 | 755,638,100,0.338,0.264,0.425 298 | 876,796,85,0.35,0.279,0.454 299 | 715,852,63,0.32,0.265,0.404 300 | 826,969,71,0.33,0.266,0.454 301 | 894,826,86,0.349,0.279,0.455 302 | 851,644,99,0.341,0.282,0.433 303 | 819,674,98,0.346,0.267,0.423 304 | 708,565,101,0.331,0.26,0.409 305 | 667,773,67,0.309,0.246,0.403 306 | 859,665,93,0.345,0.277,0.444 307 | 706,759,67,0.321,0.246,0.413 308 | 856,798,81,0.338,0.268,0.449 309 | 709,774,78,0.33,0.253,0.408 310 | 739,837,74,0.321,0.249,0.412 311 | 778,898,73,0.337,0.274,0.423 312 | 575,864,55,0.3,0.248,0.379 313 | 699,763,79,0.337,0.261,0.403 314 | 749,695,84,0.338,0.262,0.417 315 | 737,891,62,0.323,0.256,0.398 316 | 713,643,92,0.32,0.264,0.409 317 | 627,821,56,0.32,0.253,0.39 318 | 768,712,94,0.332,0.272,0.437 319 | 735,718,83,0.334,0.261,0.418 320 | 690,703,75,0.322,0.256,0.395 321 | 897,697,103,0.354,0.275,0.455 322 | 800,654,103,0.339,0.261,0.432 323 | 710,724,80,0.339,0.259,0.422 324 | 641,730,72,0.319,0.244,0.381 325 | 662,815,66,0.321,0.253,0.381 326 | 814,699,93,0.35,0.275,0.419 327 | 783,616,95,0.344,0.267,0.442 328 | 787,648,97,0.338,0.268,0.425 329 | 673,918,55,0.314,0.253,0.39 330 | 843,882,72,0.338,0.269,0.455 331 | 813,828,78,0.327,0.261,0.43 332 | 691,730,75,0.327,0.261,0.405 333 | 818,677,92,0.341,0.267,0.442 334 | 729,643,88,0.324,0.26,0.412 335 | 687,829,63,0.319,0.248,0.38 336 | 772,745,82,0.334,0.266,0.439 337 | 777,701,88,0.336,0.261,0.43 338 | 798,795,83,0.334,0.268,0.451 339 | 735,850,66,0.324,0.262,0.419 340 | 897,821,91,0.35,0.278,0.458 341 | 923,906,73,0.354,0.292,0.483 342 | 724,876,66,0.32,0.26,0.409 343 | 742,744,76,0.326,0.264,0.423 344 | 847,769,93,0.347,0.271,0.451 345 | 729,858,65,0.318,0.266,0.409 346 | 758,744,86,0.323,0.255,0.425 347 | 740,806,68,0.319,0.251,0.426 348 | 771,766,85,0.337,0.272,0.433 349 | 670,812,68,0.319,0.253,0.396 350 | 642,713,82,0.323,0.249,0.387 351 | 804,713,95,0.334,0.267,0.435 352 | 884,645,102,0.345,0.264,0.439 353 | 746,719,86,0.329,0.26,0.414 354 | 657,858,62,0.313,0.247,0.393 355 | 789,812,79,0.336,0.252,0.399 356 | 927,627,116,0.36,0.288,0.445 357 | 799,748,90,0.342,0.266,0.46 358 | 814,684,93,0.339,0.27,0.441 359 | 672,887,62,0.32,0.258,0.388 360 | 890,968,73,0.344,0.275,0.471 361 | 767,753,80,0.325,0.263,0.43 362 | 864,869,82,0.352,0.28,0.472 363 | 792,754,85,0.333,0.265,0.429 364 | 810,714,95,0.346,0.271,0.429 365 | 794,913,74,0.341,0.272,0.435 366 | 792,745,85,0.341,0.267,0.423 367 | 764,904,65,0.335,0.256,0.411 368 | 978,839,95,0.356,0.286,0.47 369 | 825,765,85,0.343,0.274,0.447 370 | 950,816,90,0.367,0.288,0.47 371 | 968,897,82,0.362,0.294,0.455 372 | 823,827,79,0.343,0.275,0.438 373 | 731,797,79,0.331,0.262,0.409 374 | 938,944,72,0.361,0.278,0.477 375 | 879,930,77,0.348,0.288,0.425 376 | 798,729,86,0.341,0.257,0.431 377 | 740,826,73,0.325,0.246,0.403 378 | 748,880,69,0.337,0.27,0.407 379 | 738,902,67,0.326,0.266,0.432 380 | 807,738,94,0.346,0.263,0.43 381 | 871,814,87,0.354,0.277,0.45 382 | 947,813,91,0.36,0.27,0.458 383 | 708,830,65,0.329,0.251,0.4 384 | 793,888,69,0.339,0.267,0.424 385 | 752,815,76,0.33,0.254,0.402 386 | 907,780,91,0.361,0.269,0.442 387 | 925,747,97,0.362,0.278,0.472 388 | 887,771,95,0.356,0.27,0.455 389 | 733,842,69,0.329,0.257,0.399 390 | 848,974,71,0.352,0.283,0.446 391 | 861,908,83,0.341,0.275,0.469 392 | 711,826,70,0.322,0.256,0.395 393 | 908,676,100,0.347,0.277,0.459 394 | 840,661,103,0.341,0.266,0.436 395 | 851,815,78,0.353,0.279,0.447 396 | 836,718,94,0.35,0.278,0.448 397 | 747,920,67,0.329,0.257,0.42 398 | 777,870,75,0.337,0.277,0.429 399 | 865,711,96,0.341,0.272,0.451 400 | 1009,860,97,0.373,0.289,0.467 401 | 906,1028,72,0.348,0.288,0.472 402 | 747,882,69,0.326,0.261,0.443 403 | 691,852,64,0.325,0.263,0.395 404 | 823,675,97,0.355,0.267,0.42 405 | 856,921,64,0.348,0.282,0.433 406 | 793,787,77,0.339,0.266,0.42 407 | 815,886,74,0.353,0.273,0.426 408 | 686,845,63,0.328,0.264,0.384 409 | 718,853,68,0.323,0.265,0.427 410 | 853,711,97,0.363,0.279,0.434 411 | 900,731,98,0.366,0.282,0.453 412 | 893,846,87,0.355,0.259,0.446 413 | 841,846,77,0.351,0.275,0.431 414 | 775,782,78,0.334,0.259,0.419 415 | 710,781,74,0.332,0.252,0.393 416 | 859,905,79,0.343,0.269,0.455 417 | 872,831,86,0.356,0.271,0.434 418 | 809,838,75,0.338,0.262,0.426 419 | 772,913,69,0.343,0.274,0.411 420 | 945,859,95,0.361,0.293,0.479 421 | 883,862,84,0.352,0.28,0.457 422 | 787,783,85,0.335,0.272,0.415 423 | 665,812,65,0.314,0.246,0.393 424 | 826,581,106,0.342,0.272,0.453 425 | 817,785,79,0.347,0.273,0.447 426 | 876,729,92,0.348,0.28,0.463 427 | 831,792,90,0.337,0.264,0.433 428 | 861,931,80,0.339,0.271,0.444 429 | 750,760,77,0.337,0.262,0.402 430 | 850,779,89,0.347,0.272,0.448 431 | 826,855,77,0.347,0.291,0.461 432 | 722,863,65,0.323,0.264,0.415 433 | 667,923,54,0.317,0.248,0.373 434 | 874,620,102,0.356,0.28,0.436 435 | 714,899,72,0.324,0.263,0.399 436 | 669,678,83,0.31,0.252,0.387 437 | 707,812,74,0.33,0.26,0.396 438 | 734,818,70,0.328,0.266,0.389 439 | 644,783,65,0.31,0.249,0.394 440 | 706,645,88,0.33,0.259,0.394 441 | 965,656,114,0.364,0.288,0.46 442 | 804,866,74,0.338,0.257,0.397 443 | 713,808,75,0.326,0.264,0.395 444 | 650,718,69,0.311,0.254,0.374 445 | 749,635,98,0.33,0.253,0.409 446 | 859,855,76,0.345,0.276,0.468 447 | 845,739,89,0.353,0.274,0.421 448 | 810,782,83,0.341,0.258,0.441 449 | 620,751,63,0.321,0.261,0.385 450 | 940,871,88,0.357,0.289,0.462 451 | 816,768,88,0.34,0.266,0.448 452 | 829,794,84,0.346,0.272,0.416 453 | 791,581,101,0.343,0.27,0.426 454 | 812,681,98,0.341,0.268,0.429 455 | 851,857,78,0.352,0.291,0.463 456 | 687,759,68,0.321,0.263,0.396 457 | 779,833,80,0.341,0.273,0.417 458 | 651,764,76,0.321,0.253,0.389 459 | 868,815,86,0.358,0.286,0.467 460 | 923,908,83,0.357,0.288,0.478 461 | 784,790,79,0.332,0.258,0.415 462 | 740,669,92,0.346,0.259,0.395 463 | 777,660,84,0.344,0.259,0.403 464 | 747,820,67,0.333,0.264,0.407 465 | 742,645,88,0.33,0.268,0.418 466 | 681,742,78,0.325,0.26,0.398 467 | 772,861,68,0.333,0.27,0.409 468 | 691,740,78,0.316,0.258,0.425 469 | 777,709,88,0.332,0.262,0.405 470 | 891,688,96,0.362,0.287,0.436 471 | 764,946,65,0.339,0.26,0.423 472 | 668,840,68,0.322,0.255,0.385 473 | 725,760,79,0.329,0.262,0.404 474 | 795,891,76,0.342,0.271,0.407 475 | 925,833,90,0.355,0.28,0.485 476 | 784,793,90,0.337,0.258,0.414 477 | 689,708,73,0.324,0.255,0.396 478 | 807,823,77,0.334,0.274,0.438 479 | 654,694,76,0.31,0.244,0.389 480 | 773,648,96,0.333,0.27,0.432 481 | 949,903,88,0.35,0.274,0.472 482 | 928,921,85,0.359,0.283,0.457 483 | 762,943,70,0.339,0.276,0.431 484 | 772,771,76,0.32,0.251,0.401 485 | 898,794,85,0.36,0.281,0.447 486 | 778,773,81,0.331,0.256,0.422 487 | 952,769,99,0.369,0.293,0.475 488 | 961,964,83,0.355,0.287,0.472 489 | 783,1103,53,0.323,0.256,0.42 490 | 688,703,80,0.329,0.257,0.393 491 | 753,792,82,0.336,0.262,0.397 492 | 746,786,75,0.332,0.267,0.398 493 | 703,652,90,0.316,0.252,0.384 494 | 894,899,80,0.353,0.279,0.441 495 | 877,900,78,0.357,0.288,0.425 496 | 741,668,88,0.327,0.262,0.406 497 | 746,779,71,0.324,0.27,0.412 498 | 871,787,92,0.36,0.288,0.436 499 | 861,900,78,0.344,0.265,0.452 500 | 650,790,67,0.325,0.256,0.387 501 | 776,833,73,0.329,0.266,0.407 502 | 771,682,91,0.338,0.265,0.402 503 | 993,895,85,0.366,0.287,0.484 504 | 752,862,68,0.331,0.253,0.388 505 | 759,706,88,0.33,0.267,0.407 506 | 928,799,90,0.358,0.284,0.469 507 | 766,809,74,0.331,0.259,0.42 508 | 767,559,104,0.331,0.262,0.408 509 | 786,745,85,0.346,0.267,0.413 510 | 686,698,80,0.33,0.264,0.395 511 | 684,770,71,0.331,0.26,0.38 512 | 738,739,84,0.325,0.27,0.414 513 | 776,664,94,0.338,0.265,0.411 514 | 722,785,73,0.324,0.264,0.396 515 | 790,813,76,0.335,0.275,0.409 516 | 758,967,67,0.323,0.273,0.422 517 | 899,837,85,0.362,0.275,0.434 518 | 581,724,64,0.314,0.248,0.346 519 | 716,630,85,0.33,0.267,0.409 520 | 675,694,84,0.32,0.263,0.397 521 | 675,662,81,0.321,0.261,0.383 522 | 733,792,69,0.328,0.258,0.378 523 | 693,830,71,0.327,0.264,0.385 524 | 732,682,94,0.326,0.257,0.386 525 | 672,744,59,0.305,0.248,0.39 526 | 821,761,88,0.353,0.279,0.435 527 | 715,846,68,0.33,0.254,0.394 528 | 877,740,97,0.351,0.274,0.426 529 | 707,806,75,0.335,0.267,0.393 530 | 679,772,61,0.312,0.252,0.389 531 | 734,731,82,0.339,0.26,0.406 532 | 808,636,103,0.34,0.276,0.427 533 | 758,744,87,0.341,0.272,0.395 534 | 835,751,86,0.329,0.267,0.431 535 | 847,742,95,0.35,0.279,0.436 536 | 682,569,98,0.316,0.254,0.388 537 | 705,656,89,0.34,0.259,0.398 538 | 599,669,73,0.321,0.246,0.347 539 | 579,671,72,0.301,0.243,0.338 540 | 593,624,78,0.307,0.254,0.364 541 | 738,690,86,0.336,0.261,0.383 542 | 660,609,90,0.328,0.26,0.382 543 | 674,746,76,0.323,0.266,0.383 544 | 791,794,75,0.337,0.256,0.407 545 | 608,668,81,0.313,0.246,0.359 546 | 610,667,72,0.315,0.256,0.364 547 | 548,636,63,0.313,0.248,0.339 548 | 740,604,92,0.33,0.268,0.375 549 | 747,653,90,0.341,0.277,0.391 550 | 648,581,87,0.313,0.252,0.37 551 | 599,653,72,0.31,0.235,0.342 552 | 733,746,76,0.328,0.261,0.406 553 | 745,672,96,0.346,0.258,0.386 554 | 686,717,70,0.32,0.253,0.377 555 | 693,595,96,0.324,0.255,0.381 556 | 617,636,82,0.313,0.255,0.386 557 | 679,799,64,0.323,0.263,0.402 558 | 574,647,72,0.302,0.244,0.355 559 | 631,604,83,0.323,0.262,0.375 560 | 682,753,77,0.321,0.25,0.393 561 | 780,682,96,0.333,0.263,0.414 562 | 749,644,94,0.328,0.258,0.393 563 | 686,796,67,0.319,0.254,0.401 564 | 731,712,84,0.34,0.269,0.401 565 | 653,649,81,0.314,0.255,0.374 566 | 695,734,77,0.309,0.253,0.39 567 | 758,681,87,0.336,0.262,0.391 568 | 689,691,74,0.32,0.258,0.403 569 | 576,759,57,0.313,0.254,0.35 570 | 817,794,84,0.333,0.247,0.416 571 | 605,717,65,0.309,0.244,0.347 572 | 727,722,82,0.328,0.264,0.394 573 | 665,565,93,0.326,0.253,0.359 574 | 799,744,83,0.336,0.271,0.396 575 | 776,652,95,0.344,0.28,0.42 576 | 579,655,71,0.308,0.246,0.357 577 | 640,646,77,0.317,0.244,0.365 578 | 674,777,71,0.316,0.256,0.387 579 | 760,776,84,0.331,0.248,0.389 580 | 629,680,78,0.303,0.241,0.358 581 | 768,632,98,0.338,0.263,0.398 582 | 636,646,84,0.31,0.244,0.362 583 | 702,674,83,0.328,0.255,0.383 584 | 649,697,75,0.309,0.246,0.381 585 | 651,648,84,0.322,0.255,0.357 586 | 829,814,85,0.341,0.27,0.424 587 | 684,622,91,0.322,0.257,0.4 588 | 682,821,65,0.311,0.25,0.396 589 | 669,698,76,0.33,0.245,0.37 590 | 699,664,88,0.344,0.272,0.395 591 | 690,706,80,0.329,0.26,0.391 592 | 690,774,77,0.314,0.263,0.392 593 | 682,633,94,0.32,0.258,0.379 594 | 693,597,91,0.325,0.265,0.399 595 | 732,737,77,0.324,0.267,0.391 596 | 750,754,79,0.337,0.259,0.409 597 | 573,656,75,0.313,0.242,0.345 598 | 707,709,75,0.328,0.267,0.395 599 | 728,685,86,0.328,0.262,0.382 600 | 732,760,74,0.32,0.256,0.384 601 | 666,729,74,0.324,0.265,0.385 602 | 662,598,85,0.322,0.25,0.37 603 | 775,613,91,0.323,0.256,0.408 604 | 603,749,67,0.3,0.241,0.366 605 | 733,570,103,0.336,0.254,0.391 606 | 646,729,77,0.327,0.255,0.363 607 | 733,619,95,0.33,0.259,0.405 608 | 673,673,75,0.32,0.257,0.38 609 | 640,680,77,0.333,0.259,0.373 610 | 719,710,85,0.323,0.262,0.396 611 | 599,698,70,0.32,0.256,0.358 612 | 676,696,83,0.331,0.259,0.376 613 | 767,661,86,0.328,0.265,0.419 614 | 584,680,63,0.298,0.234,0.35 615 | 708,686,87,0.326,0.252,0.379 616 | 774,735,83,0.351,0.277,0.403 617 | 669,578,91,0.311,0.256,0.386 618 | 702,623,93,0.319,0.261,0.387 619 | 693,750,69,0.328,0.271,0.383 620 | 632,691,75,0.309,0.247,0.37 621 | 604,654,73,0.31,0.245,0.365 622 | 617,816,59,0.318,0.242,0.351 623 | 647,669,86,0.306,0.239,0.345 624 | 690,635,92,0.329,0.261,0.373 625 | 554,536,77,0.306,0.24,0.339 626 | 707,679,81,0.318,0.259,0.382 627 | 740,738,80,0.334,0.276,0.402 628 | 632,630,81,0.319,0.247,0.361 629 | 683,595,87,0.311,0.246,0.385 630 | 698,792,74,0.331,0.269,0.391 631 | 712,576,99,0.331,0.261,0.381 632 | 629,735,67,0.314,0.243,0.364 633 | 637,680,74,0.311,0.241,0.359 634 | 642,626,89,0.319,0.251,0.369 635 | 694,728,73,0.32,0.257,0.384 636 | 699,600,92,0.316,0.25,0.39 637 | 632,608,86,0.321,0.258,0.363 638 | 695,714,83,0.326,0.263,0.394 639 | 731,651,89,0.323,0.26,0.398 640 | 555,741,54,0.298,0.242,0.348 641 | 550,789,54,0.305,0.238,0.359 642 | 813,689,89,0.357,0.283,0.42 643 | 714,771,75,0.321,0.261,0.385 644 | 660,694,77,0.31,0.261,0.383 645 | 631,757,71,0.303,0.244,0.37 646 | 641,596,87,0.309,0.246,0.368 647 | 666,731,78,0.314,0.261,0.387 648 | 703,658,88,0.324,0.25,0.378 649 | 617,631,82,0.306,0.244,0.351 650 | 704,648,84,0.321,0.259,0.391 651 | 628,544,94,0.305,0.248,0.352 652 | 682,616,87,0.314,0.257,0.375 653 | 759,672,91,0.34,0.274,0.421 654 | 628,592,81,0.309,0.251,0.373 655 | 703,532,100,0.325,0.256,0.396 656 | 772,748,85,0.333,0.263,0.395 657 | 800,620,104,0.336,0.263,0.399 658 | 597,734,65,0.306,0.239,0.355 659 | 651,616,85,0.317,0.247,0.369 660 | 594,583,83,0.31,0.247,0.351 661 | 664,744,68,0.317,0.257,0.398 662 | 670,626,83,0.318,0.248,0.368 663 | 578,633,76,0.309,0.249,0.337 664 | 637,735,70,0.32,0.252,0.368 665 | 763,680,87,0.332,0.268,0.419 666 | 747,829,69,0.339,0.258,0.403 667 | 729,880,67,0.322,0.258,0.418 668 | 842,825,78,0.352,0.278,0.43 669 | 770,803,75,0.326,0.252,0.401 670 | 720,801,76,0.326,0.264,0.432 671 | 748,746,77,0.319,0.258,0.415 672 | 783,752,84,0.33,0.266,0.427 673 | 742,957,61,0.324,0.263,0.422 674 | 896,735,98,0.349,0.272,0.451 675 | 648,678,76,0.318,0.253,0.373 676 | 715,691,83,0.328,0.262,0.412 677 | 635,675,73,0.309,0.252,0.371 678 | 862,817,91,0.346,0.276,0.428 679 | 786,806,85,0.328,0.261,0.43 680 | 741,720,91,0.328,0.265,0.401 681 | 823,698,92,0.339,0.268,0.434 682 | 788,758,89,0.336,0.262,0.418 683 | 806,789,81,0.333,0.26,0.428 684 | 702,749,80,0.327,0.254,0.41 685 | 723,744,80,0.33,0.264,0.403 686 | 668,763,65,0.332,0.26,0.378 687 | 760,801,78,0.335,0.272,0.428 688 | 783,669,90,0.324,0.26,0.43 689 | 798,693,95,0.34,0.263,0.378 690 | 823,849,75,0.333,0.266,0.43 691 | 845,655,96,0.336,0.269,0.446 692 | 615,719,72,0.319,0.25,0.381 693 | 708,760,73,0.327,0.258,0.395 694 | 794,696,95,0.346,0.271,0.415 695 | 786,684,92,0.338,0.255,0.404 696 | 680,781,70,0.318,0.256,0.398 697 | 644,699,72,0.31,0.247,0.363 698 | 732,717,86,0.325,0.254,0.387 699 | 831,841,84,0.337,0.284,0.43 700 | 798,714,87,0.338,0.263,0.424 701 | 654,569,96,0.322,0.255,0.381 702 | 654,673,76,0.313,0.252,0.39 703 | 638,679,73,0.313,0.251,0.37 704 | 667,734,77,0.321,0.255,0.385 705 | 741,839,71,0.325,0.261,0.428 706 | 637,688,78,0.322,0.254,0.379 707 | 783,578,108,0.339,0.263,0.401 708 | 797,738,90,0.347,0.271,0.43 709 | 731,760,76,0.322,0.252,0.39 710 | 739,713,86,0.327,0.253,0.4 711 | 663,700,64,0.321,0.25,0.374 712 | 656,723,74,0.321,0.261,0.388 713 | 718,835,67,0.326,0.253,0.399 714 | 698,618,83,0.322,0.253,0.375 715 | 601,611,79,0.309,0.236,0.327 716 | 771,743,87,0.331,0.267,0.428 717 | 809,733,86,0.329,0.269,0.427 718 | 632,781,66,0.315,0.246,0.363 719 | 818,764,83,0.336,0.263,0.43 720 | 800,720,81,0.347,0.282,0.429 721 | 732,703,90,0.333,0.251,0.386 722 | 686,729,77,0.324,0.254,0.39 723 | 736,720,85,0.315,0.253,0.392 724 | 677,666,89,0.327,0.255,0.376 725 | 729,861,60,0.324,0.265,0.385 726 | 729,688,84,0.318,0.253,0.424 727 | 706,691,83,0.319,0.261,0.388 728 | 687,639,91,0.313,0.252,0.401 729 | 682,579,95,0.328,0.261,0.382 730 | 690,802,71,0.319,0.263,0.379 731 | 705,782,77,0.326,0.264,0.407 732 | 633,636,84,0.31,0.247,0.375 733 | 695,568,98,0.323,0.257,0.385 734 | 839,660,97,0.344,0.267,0.425 735 | 757,787,77,0.325,0.264,0.401 736 | 667,673,75,0.312,0.245,0.383 737 | 568,708,57,0.311,0.247,0.347 738 | 650,622,83,0.32,0.255,0.368 739 | 719,818,74,0.326,0.255,0.412 740 | 556,674,62,0.299,0.233,0.348 741 | 747,572,101,0.335,0.264,0.379 742 | 617,785,62,0.322,0.253,0.381 743 | 759,588,99,0.331,0.269,0.425 744 | 632,655,80,0.317,0.247,0.361 745 | 681,667,85,0.328,0.252,0.391 746 | 810,764,86,0.341,0.283,0.441 747 | 696,697,81,0.319,0.249,0.381 748 | 762,658,96,0.331,0.26,0.397 749 | 679,736,74,0.314,0.247,0.395 750 | 627,747,70,0.313,0.244,0.356 751 | 761,766,75,0.335,0.265,0.384 752 | 829,643,104,0.342,0.271,0.432 753 | 693,630,80,0.323,0.264,0.371 754 | 673,686,84,0.317,0.268,0.399 755 | 580,600,79,0.306,0.244,0.348 756 | 641,734,67,0.317,0.262,0.37 757 | 673,675,81,0.318,0.265,0.385 758 | 593,585,78,0.312,0.251,0.362 759 | 652,676,90,0.32,0.257,0.369 760 | 758,679,87,0.339,0.276,0.404 761 | 738,796,77,0.327,0.259,0.404 762 | 720,690,81,0.333,0.266,0.407 763 | 615,567,75,0.31,0.255,0.363 764 | 686,634,92,0.317,0.259,0.371 765 | 682,774,74,0.324,0.258,0.384 766 | 682,807,66,0.328,0.265,0.375 767 | 652,645,84,0.317,0.252,0.351 768 | 656,714,69,0.313,0.261,0.377 769 | 750,696,89,0.331,0.273,0.421 770 | 746,640,88,0.341,0.272,0.4 771 | 799,652,98,0.34,0.269,0.421 772 | 724,775,78,0.335,0.27,0.409 773 | 722,779,70,0.322,0.26,0.393 774 | 701,719,71,0.319,0.261,0.401 775 | 800,650,99,0.329,0.262,0.413 776 | 623,710,74,0.314,0.239,0.356 777 | 704,785,70,0.338,0.265,0.369 778 | 789,679,92,0.335,0.274,0.427 779 | 643,646,85,0.32,0.257,0.375 780 | 696,767,79,0.32,0.271,0.397 781 | 654,609,91,0.318,0.25,0.379 782 | 764,708,87,0.333,0.277,0.418 783 | 709,822,70,0.319,0.261,0.401 784 | 677,646,82,0.326,0.264,0.386 785 | 575,680,68,0.3,0.241,0.344 786 | 770,703,91,0.337,0.273,0.416 787 | 708,782,74,0.326,0.262,0.381 788 | 696,635,90,0.329,0.249,0.373 789 | 659,648,84,0.325,0.264,0.383 790 | 653,653,81,0.311,0.25,0.351 791 | 558,740,60,0.301,0.24,0.36 792 | 687,697,79,0.325,0.247,0.375 793 | 679,710,79,0.335,0.27,0.384 794 | 639,609,77,0.31,0.255,0.366 795 | 795,726,89,0.338,0.277,0.436 796 | 739,702,89,0.325,0.256,0.383 797 | 774,687,94,0.341,0.266,0.419 798 | 753,713,89,0.34,0.274,0.407 799 | 814,670,93,0.347,0.274,0.433 800 | 676,709,73,0.317,0.26,0.375 801 | 786,710,87,0.337,0.273,0.413 802 | 545,661,61,0.31,0.251,0.35 803 | 683,748,78,0.341,0.262,0.373 804 | 729,685,83,0.324,0.266,0.418 805 | 569,620,77,0.302,0.247,0.349 806 | 784,717,90,0.337,0.285,0.428 807 | 691,612,88,0.327,0.264,0.388 808 | 891,717,95,0.335,0.279,0.455 809 | 657,819,60,0.316,0.257,0.396 810 | 697,616,86,0.325,0.262,0.396 811 | 609,723,65,0.305,0.247,0.35 812 | 709,716,79,0.328,0.256,0.398 813 | 691,819,68,0.309,0.236,0.367 814 | 664,654,89,0.323,0.26,0.376 815 | 724,696,84,0.327,0.273,0.408 816 | 675,658,81,0.311,0.257,0.359 817 | 651,712,76,0.311,0.254,0.381 818 | 673,687,87,0.327,0.253,0.376 819 | 685,609,92,0.334,0.264,0.364 820 | 590,749,64,0.308,0.249,0.359 821 | 651,701,78,0.314,0.262,0.383 822 | 630,660,81,0.307,0.25,0.38 823 | 805,640,100,0.342,0.273,0.413 824 | 757,767,83,0.34,0.283,0.436 825 | 698,797,65,0.332,0.265,0.378 826 | 614,728,64,0.309,0.251,0.365 827 | 587,722,70,0.311,0.259,0.37 828 | 707,670,89,0.327,0.262,0.386 829 | 738,807,79,0.35,0.277,0.381 830 | 830,757,84,0.348,0.273,0.409 831 | 637,589,93,0.326,0.261,0.367 832 | 809,694,97,0.345,0.286,0.413 833 | 663,591,92,0.323,0.263,0.388 834 | 811,682,86,0.329,0.275,0.448 835 | 670,724,77,0.319,0.265,0.381 836 | 694,629,90,0.324,0.257,0.388 837 | 611,702,67,0.319,0.257,0.345 838 | 820,662,103,0.343,0.267,0.425 839 | 686,642,83,0.322,0.259,0.385 840 | 728,639,91,0.327,0.27,0.4 841 | 666,646,83,0.322,0.266,0.388 842 | 591,654,73,0.324,0.255,0.342 843 | 610,793,59,0.308,0.248,0.356 844 | 573,634,75,0.308,0.244,0.342 845 | 738,710,74,0.328,0.275,0.4 846 | 756,752,76,0.339,0.284,0.405 847 | 624,762,67,0.309,0.251,0.383 848 | 669,763,66,0.318,0.256,0.377 849 | 757,582,102,0.336,0.261,0.419 850 | 841,711,91,0.344,0.283,0.456 851 | 866,768,88,0.351,0.282,0.429 852 | 706,707,80,0.329,0.269,0.403 853 | 730,748,73,0.333,0.275,0.41 854 | 731,644,90,0.338,0.264,0.396 855 | 760,805,81,0.34,0.258,0.384 856 | 770,738,84,0.339,0.269,0.415 857 | 583,582,89,0.315,0.256,0.344 858 | 851,816,85,0.343,0.282,0.422 859 | 739,717,79,0.331,0.263,0.412 860 | 807,722,95,0.345,0.28,0.448 861 | 764,725,82,0.341,0.278,0.402 862 | 701,581,95,0.319,0.264,0.408 863 | 593,706,63,0.313,0.25,0.35 864 | 734,672,89,0.328,0.266,0.406 865 | 573,860,54,0.302,0.239,0.346 866 | 683,718,84,0.34,0.266,0.396 867 | 775,643,98,0.33,0.272,0.416 868 | 603,681,68,0.311,0.242,0.348 869 | 711,820,67,0.331,0.269,0.404 870 | 672,751,71,0.319,0.246,0.365 871 | 731,693,86,0.331,0.278,0.401 872 | 750,698,83,0.334,0.278,0.409 873 | 613,862,53,0.311,0.251,0.363 874 | 600,750,69,0.315,0.244,0.363 875 | 659,633,90,0.326,0.258,0.396 876 | 796,657,99,0.336,0.267,0.424 877 | 691,666,87,0.33,0.259,0.37 878 | 664,724,79,0.331,0.264,0.361 879 | 634,731,71,0.317,0.264,0.379 880 | 710,688,92,0.334,0.256,0.393 881 | 639,694,69,0.323,0.261,0.379 882 | 714,653,86,0.339,0.271,0.392 883 | 605,634,74,0.313,0.258,0.355 884 | 743,634,92,0.329,0.268,0.399 885 | 727,573,95,0.338,0.264,0.402 886 | 804,650,93,0.339,0.276,0.432 887 | 666,678,73,0.339,0.267,0.375 888 | 633,611,76,0.306,0.254,0.379 889 | 607,690,66,0.314,0.245,0.352 890 | 735,582,100,0.329,0.267,0.388 891 | 532,690,69,0.303,0.245,0.351 892 | 708,586,90,0.328,0.258,0.388 893 | 684,637,88,0.32,0.257,0.385 894 | 591,598,84,0.321,0.252,0.348 895 | 614,834,56,0.314,0.248,0.359 896 | 613,594,89,0.318,0.248,0.374 897 | 600,657,69,0.303,0.249,0.358 898 | 692,632,87,0.332,0.253,0.381 899 | 590,775,59,0.308,0.25,0.359 900 | 678,895,61,0.32,0.254,0.376 901 | 719,653,97,0.329,0.261,0.393 902 | 859,712,97,0.345,0.281,0.465 903 | 675,695,74,0.324,0.255,0.386 904 | 692,739,81,0.33,0.266,0.387 905 | 844,771,90,0.344,0.278,0.444 906 | 802,725,88,0.345,0.274,0.436 907 | 676,739,71,0.334,0.269,0.38 908 | 714,751,74,0.318,0.264,0.41 909 | 680,650,81,0.32,0.254,0.385 910 | 822,651,102,0.34,0.277,0.436 911 | 769,582,98,0.336,0.266,0.418 912 | 639,765,67,0.314,0.258,0.389 913 | 867,776,84,0.348,0.282,0.417 914 | 665,736,75,0.318,0.26,0.402 915 | 587,663,64,0.313,0.244,0.346 916 | 831,651,100,0.344,0.281,0.444 917 | 605,749,63,0.308,0.24,0.352 918 | 847,668,101,0.346,0.279,0.448 919 | 734,665,96,0.331,0.274,0.413 920 | 692,834,69,0.324,0.249,0.375 921 | 624,855,64,0.312,0.256,0.381 922 | 673,711,75,0.323,0.253,0.383 923 | 737,688,83,0.33,0.27,0.388 924 | 767,657,94,0.342,0.27,0.405 925 | 605,822,54,0.316,0.252,0.365 926 | 620,700,70,0.32,0.245,0.334 927 | 619,598,88,0.31,0.243,0.358 928 | 716,660,83,0.324,0.263,0.402 929 | 550,631,76,0.306,0.235,0.318 930 | 611,728,75,0.313,0.251,0.356 931 | 586,745,64,0.314,0.255,0.349 932 | 857,633,102,0.357,0.28,0.424 933 | 615,615,81,0.321,0.263,0.359 934 | 609,709,74,0.315,0.257,0.365 935 | 625,657,80,0.322,0.256,0.347 936 | 713,611,90,0.327,0.269,0.371 937 | 608,543,92,0.313,0.251,0.349 938 | 570,655,66,0.311,0.246,0.34 939 | 743,704,85,0.341,0.274,0.375 940 | 531,734,55,0.291,0.235,0.34 941 | 615,538,86,0.319,0.246,0.352 942 | 730,575,97,0.328,0.269,0.389 943 | 686,598,87,0.323,0.246,0.361 944 | 770,557,101,0.338,0.272,0.395 945 | 708,630,92,0.321,0.267,0.391 946 | 570,662,73,0.31,0.247,0.337 947 | 595,686,74,0.312,0.246,0.345 948 | 629,671,72,0.323,0.26,0.359 949 | 616,652,76,0.321,0.25,0.341 950 | 583,739,67,0.313,0.244,0.346 951 | 682,553,90,0.326,0.252,0.373 952 | 796,709,95,0.344,0.275,0.417 953 | 628,723,72,0.322,0.246,0.328 954 | 712,827,75,0.338,0.259,0.368 955 | 655,703,75,0.331,0.255,0.358 956 | 840,586,108,0.353,0.271,0.401 957 | 688,703,79,0.327,0.261,0.392 958 | 570,786,57,0.301,0.249,0.366 959 | 664,711,64,0.32,0.254,0.359 960 | 710,649,91,0.333,0.261,0.394 961 | 648,534,88,0.325,0.248,0.365 962 | 675,792,68,0.32,0.25,0.389 963 | 724,736,76,0.341,0.271,0.386 964 | 601,690,75,0.317,0.244,0.348 965 | 646,625,82,0.319,0.256,0.361 966 | 681,588,83,0.325,0.264,0.382 967 | 758,606,98,0.333,0.254,0.391 968 | 735,694,86,0.342,0.269,0.402 969 | 712,565,92,0.323,0.263,0.402 970 | 552,683,71,0.31,0.244,0.335 971 | 659,671,80,0.333,0.259,0.365 972 | 662,689,82,0.327,0.273,0.375 973 | 714,733,79,0.33,0.256,0.371 974 | 661,563,88,0.319,0.249,0.363 975 | 659,612,91,0.322,0.256,0.37 976 | 696,661,84,0.333,0.264,0.377 977 | 618,657,68,0.321,0.254,0.356 978 | 669,826,66,0.327,0.251,0.365 979 | 684,721,80,0.33,0.268,0.389 980 | 776,631,98,0.343,0.26,0.394 981 | 662,694,77,0.311,0.255,0.37 982 | 620,768,72,0.303,0.247,0.366 983 | 653,632,81,0.322,0.263,0.378 984 | 667,662,77,0.327,0.259,0.364 985 | 798,561,102,0.342,0.272,0.401 986 | 647,660,76,0.309,0.244,0.369 987 | 673,669,82,0.336,0.272,0.378 988 | 662,657,79,0.335,0.254,0.35 989 | 572,646,71,0.311,0.235,0.329 990 | 671,623,89,0.324,0.263,0.368 991 | 689,551,90,0.321,0.247,0.373 992 | 676,701,80,0.32,0.261,0.373 993 | 751,657,88,0.335,0.274,0.391 994 | 541,830,60,0.302,0.229,0.33 995 | 634,723,72,0.322,0.252,0.358 996 | 677,643,86,0.331,0.265,0.365 997 | 690,698,83,0.336,0.272,0.377 998 | 799,774,76,0.339,0.266,0.427 999 | 754,561,97,0.345,0.266,0.389 1000 | 738,647,89,0.338,0.267,0.401 1001 | 629,657,79,0.318,0.253,0.348 1002 | 614,655,77,0.32,0.247,0.357 1003 | 652,705,77,0.324,0.256,0.372 1004 | 741,621,99,0.332,0.254,0.383 1005 | 680,826,71,0.315,0.256,0.387 1006 | 642,674,85,0.32,0.254,0.39 1007 | 681,672,82,0.312,0.251,0.376 1008 | 755,752,88,0.339,0.261,0.381 1009 | 675,565,95,0.323,0.263,0.371 1010 | 708,731,74,0.325,0.253,0.388 1011 | 738,692,81,0.342,0.27,0.393 1012 | 668,702,79,0.34,0.251,0.364 1013 | 608,588,82,0.315,0.246,0.338 1014 | 641,610,80,0.322,0.261,0.378 1015 | 758,615,94,0.333,0.26,0.389 1016 | 642,717,71,0.31,0.249,0.371 1017 | 704,693,80,0.315,0.261,0.405 1018 | 548,770,60,0.296,0.244,0.351 1019 | 739,702,88,0.335,0.262,0.407 1020 | 643,603,81,0.325,0.259,0.357 1021 | 619,844,57,0.318,0.255,0.361 1022 | 643,699,82,0.312,0.257,0.385 1023 | 742,530,101,0.347,0.261,0.398 1024 | 691,667,85,0.322,0.252,0.397 1025 | 511,576,76,0.29,0.231,0.329 1026 | 637,648,83,0.325,0.258,0.378 1027 | 617,597,79,0.325,0.25,0.373 1028 | 586,581,79,0.3,0.241,0.366 1029 | 543,747,60,0.3,0.238,0.342 1030 | 701,645,91,0.325,0.254,0.405 1031 | 585,567,79,0.302,0.24,0.34 1032 | 603,566,85,0.313,0.25,0.353 1033 | 663,587,89,0.325,0.266,0.37 1034 | 534,609,69,0.304,0.229,0.329 1035 | 654,670,74,0.323,0.26,0.372 1036 | 622,729,71,0.322,0.246,0.343 1037 | 588,550,83,0.319,0.249,0.351 1038 | 648,641,81,0.328,0.254,0.36 1039 | 691,564,101,0.321,0.252,0.384 1040 | 558,688,67,0.298,0.233,0.35 1041 | 788,599,97,0.33,0.274,0.416 1042 | 486,610,61,0.293,0.233,0.332 1043 | 706,644,90,0.329,0.247,0.378 1044 | 739,699,90,0.338,0.275,0.385 1045 | 537,660,63,0.307,0.23,0.326 1046 | 736,772,76,0.334,0.27,0.404 1047 | 792,574,108,0.344,0.257,0.401 1048 | 786,722,87,0.335,0.262,0.428 1049 | 631,630,86,0.309,0.251,0.363 1050 | 806,679,84,0.333,0.259,0.415 1051 | 633,822,56,0.315,0.253,0.362 1052 | 775,681,102,0.336,0.27,0.436 1053 | 649,675,76,0.314,0.249,0.394 1054 | 666,731,79,0.322,0.238,0.374 1055 | 744,763,79,0.332,0.259,0.391 1056 | 611,705,65,0.309,0.244,0.348 1057 | 749,684,87,0.334,0.27,0.382 1058 | 613,751,65,0.319,0.242,0.358 1059 | 744,605,98,0.327,0.262,0.403 1060 | 687,807,73,0.323,0.237,0.365 1061 | 695,630,83,0.333,0.249,0.37 1062 | 680,612,93,0.324,0.251,0.365 1063 | 678,593,89,0.325,0.249,0.392 1064 | 594,730,73,0.305,0.238,0.356 1065 | 729,664,89,0.325,0.27,0.406 1066 | 681,788,63,0.312,0.246,0.391 1067 | 831,826,86,0.351,0.262,0.409 1068 | 744,747,76,0.331,0.263,0.379 1069 | 626,689,70,0.321,0.238,0.358 1070 | 691,631,93,0.321,0.258,0.38 1071 | 779,517,109,0.343,0.265,0.414 1072 | 743,736,87,0.333,0.251,0.415 1073 | 528,652,71,0.3,0.23,0.319 1074 | 720,611,92,0.323,0.253,0.384 1075 | 625,723,68,0.32,0.247,0.357 1076 | 798,768,89,0.335,0.277,0.422 1077 | 573,717,62,0.307,0.237,0.345 1078 | 701,601,90,0.316,0.242,0.387 1079 | 676,668,81,0.33,0.24,0.352 1080 | 586,688,69,0.309,0.24,0.338 1081 | 645,561,85,0.315,0.254,0.359 1082 | 790,618,97,0.34,0.268,0.408 1083 | 582,791,52,0.31,0.24,0.359 1084 | 632,541,100,0.311,0.242,0.351 1085 | 562,587,80,0.308,0.235,0.344 1086 | 740,678,88,0.329,0.249,0.376 1087 | 645,745,63,0.312,0.241,0.372 1088 | 725,652,88,0.334,0.277,0.398 1089 | 468,746,52,0.285,0.225,0.329 1090 | 639,799,64,0.316,0.234,0.346 1091 | 713,636,90,0.334,0.242,0.361 1092 | 595,540,87,0.316,0.253,0.359 1093 | 694,644,86,0.33,0.251,0.378 1094 | 514,549,81,0.307,0.252,0.339 1095 | 579,497,91,0.304,0.225,0.352 1096 | 614,611,86,0.313,0.236,0.352 1097 | 498,615,67,0.291,0.227,0.318 1098 | 612,611,84,0.298,0.242,0.366 1099 | 463,527,67,0.284,0.228,0.311 1100 | 690,673,83,0.32,0.273,0.389 1101 | 516,504,86,0.293,0.234,0.327 1102 | 671,492,103,0.307,0.235,0.385 1103 | 510,588,72,0.298,0.231,0.317 1104 | 470,509,76,0.289,0.23,0.319 1105 | 562,546,79,0.299,0.237,0.35 1106 | 473,499,73,0.281,0.228,0.315 1107 | 536,531,83,0.292,0.214,0.318 1108 | 569,544,82,0.304,0.24,0.343 1109 | 543,615,76,0.294,0.233,0.333 1110 | 583,532,80,0.306,0.252,0.343 1111 | 599,529,88,0.307,0.239,0.341 1112 | 583,472,97,0.298,0.249,0.346 1113 | 524,665,65,0.287,0.224,0.336 1114 | 631,640,77,0.307,0.24,0.372 1115 | 654,592,76,0.31,0.24,0.372 1116 | 722,614,92,0.321,0.255,0.395 1117 | 567,587,84,0.301,0.238,0.349 1118 | 702,624,87,0.316,0.251,0.378 1119 | 531,491,89,0.291,0.225,0.32 1120 | 604,563,87,0.297,0.248,0.372 1121 | 559,613,75,0.293,0.235,0.359 1122 | 683,587,91,0.325,0.243,0.376 1123 | 626,742,69,0.317,0.249,0.364 1124 | 533,660,62,0.296,0.233,0.33 1125 | 519,595,73,0.301,0.236,0.332 1126 | 671,590,91,0.309,0.24,0.369 1127 | 498,672,61,0.288,0.238,0.325 1128 | 522,621,72,0.296,0.225,0.317 1129 | 612,581,82,0.313,0.242,0.357 1130 | 679,693,81,0.324,0.277,0.38 1131 | 652,551,91,0.313,0.245,0.372 1132 | 695,557,101,0.32,0.263,0.379 1133 | 550,637,76,0.288,0.223,0.326 1134 | 782,683,85,0.326,0.263,0.424 1135 | 755,601,97,0.324,0.258,0.409 1136 | 655,731,72,0.31,0.24,0.376 1137 | 604,643,80,0.303,0.232,0.354 1138 | 644,809,59,0.313,0.254,0.38 1139 | 574,517,83,0.297,0.231,0.331 1140 | 692,702,76,0.309,0.26,0.395 1141 | 574,586,81,0.297,0.237,0.36 1142 | 719,698,88,0.321,0.251,0.406 1143 | 612,695,72,0.318,0.255,0.365 1144 | 564,648,74,0.294,0.236,0.337 1145 | 606,490,95,0.314,0.256,0.362 1146 | 663,581,89,0.316,0.249,0.382 1147 | 587,761,66,0.301,0.239,0.342 1148 | 611,612,70,0.299,0.235,0.374 1149 | 696,640,87,0.322,0.258,0.378 1150 | 759,641,92,0.329,0.279,0.428 1151 | 675,626,93,0.303,0.248,0.392 1152 | 571,577,83,0.298,0.251,0.368 1153 | 557,659,71,0.295,0.234,0.355 1154 | 641,578,94,0.307,0.238,0.363 1155 | 669,791,62,0.327,0.251,0.4 1156 | 527,569,75,0.297,0.239,0.341 1157 | 635,723,72,0.307,0.238,0.358 1158 | 647,555,95,0.315,0.246,0.364 1159 | 825,704,89,0.339,0.273,0.439 1160 | 663,613,87,0.315,0.25,0.379 1161 | 680,602,89,0.312,0.238,0.374 1162 | 569,711,65,0.305,0.237,0.34 1163 | 585,755,59,0.309,0.24,0.358 1164 | 608,521,97,0.312,0.245,0.335 1165 | 774,600,102,0.324,0.254,0.399 1166 | 708,633,86,0.31,0.256,0.416 1167 | 495,752,50,0.277,0.221,0.327 1168 | 611,604,77,0.299,0.235,0.364 1169 | 654,667,85,0.313,0.25,0.384 1170 | 675,580,90,0.317,0.265,0.382 1171 | 682,593,95,0.313,0.252,0.385 1172 | 707,674,80,0.314,0.254,0.371 1173 | 591,721,70,0.304,0.228,0.35 1174 | 679,567,97,0.316,0.248,0.387 1175 | 688,793,72,0.322,0.258,0.416 1176 | 649,724,76,0.314,0.251,0.39 1177 | 642,501,98,0.32,0.247,0.353 1178 | 660,566,92,0.308,0.249,0.372 1179 | 689,693,79,0.312,0.247,0.38 1180 | 699,678,85,0.319,0.253,0.395 1181 | 495,628,66,0.285,0.229,0.315 1182 | 621,836,57,0.311,0.239,0.379 1183 | 544,551,82,0.304,0.242,0.344 1184 | 614,572,80,0.305,0.25,0.34 1185 | 737,678,79,0.322,0.252,0.427 1186 | 803,744,88,0.333,0.272,0.418 1187 | 569,776,53,0.296,0.246,0.348 1188 | 730,577,99,0.317,0.253,0.387 1189 | 693,632,92,0.315,0.258,0.391 1190 | 663,636,80,0.315,0.264,0.389 1191 | 656,587,90,0.31,0.246,0.382 1192 | 715,652,93,0.324,0.272,0.392 1193 | 578,733,62,0.299,0.231,0.348 1194 | 644,621,86,0.31,0.249,0.38 1195 | 666,704,76,0.312,0.252,0.4 1196 | 570,578,82,0.297,0.238,0.363 1197 | 683,544,94,0.323,0.25,0.365 1198 | 648,594,86,0.31,0.246,0.371 1199 | 635,702,79,0.301,0.239,0.381 1200 | 700,703,79,0.327,0.252,0.382 1201 | 464,640,66,0.283,0.22,0.301 1202 | 615,704,73,0.313,0.247,0.353 1203 | 597,660,70,0.309,0.25,0.354 1204 | 640,550,99,0.309,0.251,0.357 1205 | 767,602,91,0.325,0.255,0.43 1206 | 677,603,84,0.312,0.244,0.37 1207 | 501,774,51,0.285,0.219,0.315 1208 | 714,547,104,0.309,0.252,0.403 1209 | 642,578,87,0.306,0.252,0.381 1210 | 567,595,74,0.309,0.25,0.359 1211 | 725,641,88,0.316,0.258,0.414 1212 | 747,628,93,0.326,0.271,0.403 1213 | 578,812,56,0.293,0.227,0.351 1214 | 652,680,77,0.314,0.248,0.387 1215 | 707,756,76,0.324,0.258,0.403 1216 | 632,827,59,0.317,0.253,0.377 1217 | 707,658,85,0.334,0.257,0.372 1218 | 802,685,98,0.332,0.27,0.417 1219 | 682,745,80,0.312,0.245,0.388 1220 | 758,692,85,0.33,0.248,0.411 1221 | 592,717,64,0.31,0.246,0.351 1222 | 745,837,72,0.332,0.263,0.386 1223 | 718,706,86,0.325,0.25,0.38 1224 | 842,697,102,0.337,0.268,0.4 1225 | 798,713,91,0.338,0.26,0.412 1226 | 730,665,86,0.326,0.252,0.403 1227 | 617,948,40,0.318,0.24,0.361 1228 | 817,680,96,0.337,0.267,0.426 1229 | 705,759,81,0.33,0.26,0.39 1230 | 706,626,93,0.321,0.268,0.394 1231 | 878,690,103,0.341,0.278,0.441 1232 | 774,664,84,0.335,0.271,0.394 1233 | 599,716,60,0.308,0.25,0.373 -------------------------------------------------------------------------------- /chapter_5_Logistic_Regression/Logistic_Regression_Pyspark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 71, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#import SparkSession\n", 10 | "from pyspark.sql import SparkSession\n", 11 | "spark=SparkSession.builder.appName('log_reg').getOrCreate()" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 72, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "#read the dataset\n", 21 | "df=spark.read.csv('Log_Reg_dataset.csv',inferSchema=True,header=True)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 73, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "from pyspark.sql.functions import *\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 74, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "(20000, 6)\n" 43 | ] 44 | } 45 | ], 46 | "source": [ 47 | "#check the shape of the data \n", 48 | "print((df.count(),len(df.columns)))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 75, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stdout", 58 | "output_type": "stream", 59 | "text": [ 60 | "root\n", 61 | " |-- Country: string (nullable = true)\n", 62 | " |-- Age: integer (nullable = true)\n", 63 | " |-- Repeat_Visitor: integer (nullable = true)\n", 64 | " |-- Search_Engine: string (nullable = true)\n", 65 | " |-- Web_pages_viewed: integer (nullable = true)\n", 66 | " |-- Status: integer (nullable = true)\n", 67 | "\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "#printSchema\n", 73 | "df.printSchema()" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 76, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/plain": [ 84 | "['Country',\n", 85 | " 'Age',\n", 86 | " 'Repeat_Visitor',\n", 87 | " 'Search_Engine',\n", 88 | " 'Web_pages_viewed',\n", 89 | " 'Status']" 90 | ] 91 | }, 92 | "execution_count": 76, 93 | "metadata": {}, 94 | "output_type": "execute_result" 95 | } 96 | ], 97 | "source": [ 98 | "#number of columns in dataset\n", 99 | "df.columns" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 77, 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "+---------+---+--------------+-------------+----------------+------+\n", 112 | "| Country|Age|Repeat_Visitor|Search_Engine|Web_pages_viewed|Status|\n", 113 | "+---------+---+--------------+-------------+----------------+------+\n", 114 | "| India| 41| 1| Yahoo| 21| 1|\n", 115 | "| Brazil| 28| 1| Yahoo| 5| 0|\n", 116 | "| Brazil| 40| 0| Google| 3| 0|\n", 117 | "|Indonesia| 31| 1| Bing| 15| 1|\n", 118 | "| Malaysia| 32| 0| Google| 15| 1|\n", 119 | "+---------+---+--------------+-------------+----------------+------+\n", 120 | "only showing top 5 rows\n", 121 | "\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "#view the dataset\n", 127 | "df.show(5)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 78, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "+-------+--------+-----------------+-----------------+-------------+-----------------+------------------+\n", 140 | "|summary| Country| Age| Repeat_Visitor|Search_Engine| Web_pages_viewed| Status|\n", 141 | "+-------+--------+-----------------+-----------------+-------------+-----------------+------------------+\n", 142 | "| count| 20000| 20000| 20000| 20000| 20000| 20000|\n", 143 | "| mean| null| 28.53955| 0.5029| null| 9.5533| 0.5|\n", 144 | "| stddev| null|7.888912950773227|0.500004090187782| null|6.073903499824976|0.5000125004687693|\n", 145 | "| min| Brazil| 17| 0| Bing| 1| 0|\n", 146 | "| max|Malaysia| 111| 1| Yahoo| 29| 1|\n", 147 | "+-------+--------+-----------------+-----------------+-------------+-----------------+------------------+\n", 148 | "\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "#Exploratory Data Analysis\n", 154 | "df.describe().show()\n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 79, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "name": "stdout", 164 | "output_type": "stream", 165 | "text": [ 166 | "+---------+-----+\n", 167 | "| Country|count|\n", 168 | "+---------+-----+\n", 169 | "| Malaysia| 1218|\n", 170 | "| India| 4018|\n", 171 | "|Indonesia|12178|\n", 172 | "| Brazil| 2586|\n", 173 | "+---------+-----+\n", 174 | "\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "df.groupBy('Country').count().show()" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 80, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "+-------------+-----+\n", 192 | "|Search_Engine|count|\n", 193 | "+-------------+-----+\n", 194 | "| Yahoo| 9859|\n", 195 | "| Bing| 4360|\n", 196 | "| Google| 5781|\n", 197 | "+-------------+-----+\n", 198 | "\n" 199 | ] 200 | } 201 | ], 202 | "source": [ 203 | "df.groupBy('Search_Engine').count().show()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 81, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "+------+-----+\n", 216 | "|Status|count|\n", 217 | "+------+-----+\n", 218 | "| 1|10000|\n", 219 | "| 0|10000|\n", 220 | "+------+-----+\n", 221 | "\n" 222 | ] 223 | } 224 | ], 225 | "source": [ 226 | "df.groupBy('Status').count().show()" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 82, 232 | "metadata": { 233 | "scrolled": true 234 | }, 235 | "outputs": [ 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "+---------+------------------+-------------------+---------------------+--------------------+\n", 241 | "| Country| avg(Age)|avg(Repeat_Visitor)|avg(Web_pages_viewed)| avg(Status)|\n", 242 | "+---------+------------------+-------------------+---------------------+--------------------+\n", 243 | "| Malaysia|27.792282430213465| 0.5730706075533661| 11.192118226600986| 0.6568144499178982|\n", 244 | "| India|27.976854156296664| 0.5433051269288203| 10.727227476356397| 0.6212045793927327|\n", 245 | "|Indonesia| 28.43159796354081| 0.5207751683363442| 9.985711939563148| 0.5422893742814913|\n", 246 | "| Brazil|30.274168600154677| 0.322892498066512| 4.921113689095128|0.038669760247486466|\n", 247 | "+---------+------------------+-------------------+---------------------+--------------------+\n", 248 | "\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "df.groupBy('Country').mean().show()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 83, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "+-------------+------------------+-------------------+---------------------+------------------+\n", 266 | "|Search_Engine| avg(Age)|avg(Repeat_Visitor)|avg(Web_pages_viewed)| avg(Status)|\n", 267 | "+-------------+------------------+-------------------+---------------------+------------------+\n", 268 | "| Yahoo|28.569226087838523| 0.5094837204584644| 9.599655137437875|0.5071508266558474|\n", 269 | "| Bing| 28.68394495412844| 0.4720183486238532| 9.114908256880733|0.4559633027522936|\n", 270 | "| Google|28.380038055699707| 0.5149628092025601| 9.804878048780488|0.5210171250648676|\n", 271 | "+-------------+------------------+-------------------+---------------------+------------------+\n", 272 | "\n" 273 | ] 274 | } 275 | ], 276 | "source": [ 277 | "df.groupBy('Search_Engine').mean().show()" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 84, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | "+------+--------+-------------------+---------------------+-----------+\n", 290 | "|Status|avg(Age)|avg(Repeat_Visitor)|avg(Web_pages_viewed)|avg(Status)|\n", 291 | "+------+--------+-------------------+---------------------+-----------+\n", 292 | "| 1| 26.5435| 0.7019| 14.5617| 1.0|\n", 293 | "| 0| 30.5356| 0.3039| 4.5449| 0.0|\n", 294 | "+------+--------+-------------------+---------------------+-----------+\n", 295 | "\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "df.groupBy('Status').mean().show()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 85, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "#converting categorical data to numerical form" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 86, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "#import required libraries\n", 319 | "\n", 320 | "from pyspark.ml.feature import StringIndexer\n" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 87, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "#Indexing " 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 88, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "search_engine_indexer = StringIndexer(inputCol=\"Search_Engine\", outputCol=\"Search_Engine_Num\").fit(df)\n", 339 | "df = search_engine_indexer.transform(df)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 89, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "name": "stdout", 349 | "output_type": "stream", 350 | "text": [ 351 | "+-------+---+--------------+-------------+----------------+------+-----------------+\n", 352 | "|Country|Age|Repeat_Visitor|Search_Engine|Web_pages_viewed|Status|Search_Engine_Num|\n", 353 | "+-------+---+--------------+-------------+----------------+------+-----------------+\n", 354 | "|India |41 |1 |Yahoo |21 |1 |0.0 |\n", 355 | "|Brazil |28 |1 |Yahoo |5 |0 |0.0 |\n", 356 | "|Brazil |40 |0 |Google |3 |0 |1.0 |\n", 357 | "+-------+---+--------------+-------------+----------------+------+-----------------+\n", 358 | "only showing top 3 rows\n", 359 | "\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "df.show(3,False)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 90, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "from pyspark.ml.feature import OneHotEncoder" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 91, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "#one hot encoding\n", 383 | "search_engine_encoder = OneHotEncoder(inputCol=\"Search_Engine_Num\", outputCol=\"Search_Engine_Vector\")\n", 384 | "df = search_engine_encoder.transform(df)" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 92, 390 | "metadata": {}, 391 | "outputs": [ 392 | { 393 | "name": "stdout", 394 | "output_type": "stream", 395 | "text": [ 396 | "+-------+---+--------------+-------------+----------------+------+-----------------+--------------------+\n", 397 | "|Country|Age|Repeat_Visitor|Search_Engine|Web_pages_viewed|Status|Search_Engine_Num|Search_Engine_Vector|\n", 398 | "+-------+---+--------------+-------------+----------------+------+-----------------+--------------------+\n", 399 | "|India |41 |1 |Yahoo |21 |1 |0.0 |(2,[0],[1.0]) |\n", 400 | "|Brazil |28 |1 |Yahoo |5 |0 |0.0 |(2,[0],[1.0]) |\n", 401 | "|Brazil |40 |0 |Google |3 |0 |1.0 |(2,[1],[1.0]) |\n", 402 | "+-------+---+--------------+-------------+----------------+------+-----------------+--------------------+\n", 403 | "only showing top 3 rows\n", 404 | "\n" 405 | ] 406 | } 407 | ], 408 | "source": [ 409 | "df.show(3,False)" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 59, 415 | "metadata": {}, 416 | "outputs": [ 417 | { 418 | "name": "stdout", 419 | "output_type": "stream", 420 | "text": [ 421 | "+-------------+-----+\n", 422 | "|Search_Engine|count|\n", 423 | "+-------------+-----+\n", 424 | "|Yahoo |9859 |\n", 425 | "|Google |5781 |\n", 426 | "|Bing |4360 |\n", 427 | "+-------------+-----+\n", 428 | "\n" 429 | ] 430 | } 431 | ], 432 | "source": [ 433 | "df.groupBy('Search_Engine').count().orderBy('count',ascending=False).show(5,False)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 60, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | "+-----------------+-----+\n", 446 | "|Search_Engine_Num|count|\n", 447 | "+-----------------+-----+\n", 448 | "|0.0 |9859 |\n", 449 | "|1.0 |5781 |\n", 450 | "|2.0 |4360 |\n", 451 | "+-----------------+-----+\n", 452 | "\n" 453 | ] 454 | } 455 | ], 456 | "source": [ 457 | "df.groupBy('Search_Engine_Num').count().orderBy('count',ascending=False).show(5,False)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 93, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "+--------------------+-----+\n", 470 | "|Search_Engine_Vector|count|\n", 471 | "+--------------------+-----+\n", 472 | "|(2,[0],[1.0]) |9859 |\n", 473 | "|(2,[1],[1.0]) |5781 |\n", 474 | "|(2,[],[]) |4360 |\n", 475 | "+--------------------+-----+\n", 476 | "\n" 477 | ] 478 | } 479 | ], 480 | "source": [ 481 | "df.groupBy('Search_Engine_Vector').count().orderBy('count',ascending=False).show(5,False)" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": 63, 487 | "metadata": {}, 488 | "outputs": [], 489 | "source": [ 490 | "country_indexer = StringIndexer(inputCol=\"Country\", outputCol=\"Country_Num\").fit(df)\n", 491 | "df = country_indexer.transform(df)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 65, 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "name": "stdout", 501 | "output_type": "stream", 502 | "text": [ 503 | "+-------+-----------+\n", 504 | "|Country|Country_Num|\n", 505 | "+-------+-----------+\n", 506 | "|India |1.0 |\n", 507 | "|Brazil |2.0 |\n", 508 | "|Brazil |2.0 |\n", 509 | "+-------+-----------+\n", 510 | "only showing top 3 rows\n", 511 | "\n" 512 | ] 513 | } 514 | ], 515 | "source": [ 516 | "df.select(['Country','Country_Num']).show(3,False)" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 67, 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "#one hot encoding\n", 526 | "country_encoder = OneHotEncoder(inputCol=\"Country_Num\", outputCol=\"Country_Vector\")\n", 527 | "df = country_encoder.transform(df)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 69, 533 | "metadata": {}, 534 | "outputs": [ 535 | { 536 | "name": "stdout", 537 | "output_type": "stream", 538 | "text": [ 539 | "+-------+-----------+--------------+\n", 540 | "|Country|country_Num|Country_Vector|\n", 541 | "+-------+-----------+--------------+\n", 542 | "|India |1.0 |(3,[1],[1.0]) |\n", 543 | "|Brazil |2.0 |(3,[2],[1.0]) |\n", 544 | "|Brazil |2.0 |(3,[2],[1.0]) |\n", 545 | "+-------+-----------+--------------+\n", 546 | "only showing top 3 rows\n", 547 | "\n" 548 | ] 549 | } 550 | ], 551 | "source": [ 552 | "df.select(['Country','country_Num','Country_Vector']).show(3,False)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 34, 558 | "metadata": {}, 559 | "outputs": [ 560 | { 561 | "name": "stdout", 562 | "output_type": "stream", 563 | "text": [ 564 | "+---------+-----+\n", 565 | "|Country |count|\n", 566 | "+---------+-----+\n", 567 | "|Indonesia|12178|\n", 568 | "|India |4018 |\n", 569 | "|Brazil |2586 |\n", 570 | "|Malaysia |1218 |\n", 571 | "+---------+-----+\n", 572 | "\n" 573 | ] 574 | } 575 | ], 576 | "source": [ 577 | "df.groupBy('Country').count().orderBy('count',ascending=False).show(5,False)" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 66, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "name": "stdout", 587 | "output_type": "stream", 588 | "text": [ 589 | "+-----------+-----+\n", 590 | "|Country_Num|count|\n", 591 | "+-----------+-----+\n", 592 | "|0.0 |12178|\n", 593 | "|1.0 |4018 |\n", 594 | "|2.0 |2586 |\n", 595 | "|3.0 |1218 |\n", 596 | "+-----------+-----+\n", 597 | "\n" 598 | ] 599 | } 600 | ], 601 | "source": [ 602 | "df.groupBy('Country_Num').count().orderBy('count',ascending=False).show(5,False)" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 70, 608 | "metadata": {}, 609 | "outputs": [ 610 | { 611 | "name": "stdout", 612 | "output_type": "stream", 613 | "text": [ 614 | "+--------------+-----+\n", 615 | "|Country_Vector|count|\n", 616 | "+--------------+-----+\n", 617 | "|(3,[0],[1.0]) |12178|\n", 618 | "|(3,[1],[1.0]) |4018 |\n", 619 | "|(3,[2],[1.0]) |2586 |\n", 620 | "|(3,[],[]) |1218 |\n", 621 | "+--------------+-----+\n", 622 | "\n" 623 | ] 624 | } 625 | ], 626 | "source": [ 627 | "df.groupBy('Country_Vector').count().orderBy('count',ascending=False).show(5,False)" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": 37, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "from pyspark.ml.feature import VectorAssembler" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": 146, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "df_assembler = VectorAssembler(inputCols=['platform_vector','country_vector','Age', 'Repeat_Visitor','Web_pages_viewed'], outputCol=\"features\")\n", 646 | "df = df_assembler.transform(df)" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 147, 652 | "metadata": {}, 653 | "outputs": [ 654 | { 655 | "name": "stdout", 656 | "output_type": "stream", 657 | "text": [ 658 | "root\n", 659 | " |-- Country: string (nullable = true)\n", 660 | " |-- Age: integer (nullable = true)\n", 661 | " |-- Repeat_Visitor: integer (nullable = true)\n", 662 | " |-- Platform: string (nullable = true)\n", 663 | " |-- Web_pages_viewed: integer (nullable = true)\n", 664 | " |-- Status: integer (nullable = true)\n", 665 | " |-- platform_num: double (nullable = false)\n", 666 | " |-- platform_vector: vector (nullable = true)\n", 667 | " |-- country_num: double (nullable = false)\n", 668 | " |-- country_vector: vector (nullable = true)\n", 669 | " |-- features: vector (nullable = true)\n", 670 | "\n" 671 | ] 672 | } 673 | ], 674 | "source": [ 675 | "df.printSchema()" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 148, 681 | "metadata": {}, 682 | "outputs": [ 683 | { 684 | "name": "stdout", 685 | "output_type": "stream", 686 | "text": [ 687 | "+-----------------------------------+------+\n", 688 | "|features |Status|\n", 689 | "+-----------------------------------+------+\n", 690 | "|[1.0,0.0,0.0,1.0,0.0,41.0,1.0,21.0]|1 |\n", 691 | "|[1.0,0.0,0.0,0.0,1.0,28.0,1.0,5.0] |0 |\n", 692 | "|(8,[1,4,5,7],[1.0,1.0,40.0,3.0]) |0 |\n", 693 | "|(8,[2,5,6,7],[1.0,31.0,1.0,15.0]) |1 |\n", 694 | "|(8,[1,5,7],[1.0,32.0,15.0]) |1 |\n", 695 | "|(8,[1,4,5,7],[1.0,1.0,32.0,3.0]) |0 |\n", 696 | "|(8,[1,4,5,7],[1.0,1.0,32.0,6.0]) |0 |\n", 697 | "|(8,[1,2,5,7],[1.0,1.0,27.0,9.0]) |0 |\n", 698 | "|(8,[0,2,5,7],[1.0,1.0,32.0,2.0]) |0 |\n", 699 | "|(8,[2,5,6,7],[1.0,31.0,1.0,16.0]) |1 |\n", 700 | "+-----------------------------------+------+\n", 701 | "only showing top 10 rows\n", 702 | "\n" 703 | ] 704 | } 705 | ], 706 | "source": [ 707 | "df.select(['features','Status']).show(10,False)" 708 | ] 709 | }, 710 | { 711 | "cell_type": "code", 712 | "execution_count": 149, 713 | "metadata": {}, 714 | "outputs": [], 715 | "source": [ 716 | "#select data for building model\n", 717 | "model_df=df.select(['features','Status'])" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 150, 723 | "metadata": {}, 724 | "outputs": [], 725 | "source": [ 726 | "from pyspark.ml.classification import LogisticRegression" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": 151, 732 | "metadata": {}, 733 | "outputs": [], 734 | "source": [ 735 | "#split the data \n", 736 | "training_df,test_df=model_df.randomSplit([0.75,0.25])" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 152, 742 | "metadata": {}, 743 | "outputs": [ 744 | { 745 | "data": { 746 | "text/plain": [ 747 | "14907" 748 | ] 749 | }, 750 | "execution_count": 152, 751 | "metadata": {}, 752 | "output_type": "execute_result" 753 | } 754 | ], 755 | "source": [ 756 | "training_df.count()" 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": 160, 762 | "metadata": {}, 763 | "outputs": [ 764 | { 765 | "name": "stdout", 766 | "output_type": "stream", 767 | "text": [ 768 | "+------+-----+\n", 769 | "|Status|count|\n", 770 | "+------+-----+\n", 771 | "| 1| 7417|\n", 772 | "| 0| 7490|\n", 773 | "+------+-----+\n", 774 | "\n" 775 | ] 776 | } 777 | ], 778 | "source": [ 779 | "training_df.groupBy('Status').count().show()" 780 | ] 781 | }, 782 | { 783 | "cell_type": "code", 784 | "execution_count": 153, 785 | "metadata": {}, 786 | "outputs": [ 787 | { 788 | "data": { 789 | "text/plain": [ 790 | "5093" 791 | ] 792 | }, 793 | "execution_count": 153, 794 | "metadata": {}, 795 | "output_type": "execute_result" 796 | } 797 | ], 798 | "source": [ 799 | "test_df.count()" 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": 161, 805 | "metadata": {}, 806 | "outputs": [ 807 | { 808 | "name": "stdout", 809 | "output_type": "stream", 810 | "text": [ 811 | "+------+-----+\n", 812 | "|Status|count|\n", 813 | "+------+-----+\n", 814 | "| 1| 2583|\n", 815 | "| 0| 2510|\n", 816 | "+------+-----+\n", 817 | "\n" 818 | ] 819 | } 820 | ], 821 | "source": [ 822 | "test_df.groupBy('Status').count().show()" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": 154, 828 | "metadata": {}, 829 | "outputs": [], 830 | "source": [ 831 | "log_reg=LogisticRegression(labelCol='Status').fit(training_df)" 832 | ] 833 | }, 834 | { 835 | "cell_type": "code", 836 | "execution_count": null, 837 | "metadata": {}, 838 | "outputs": [], 839 | "source": [ 840 | "#Training Results" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": 155, 846 | "metadata": {}, 847 | "outputs": [], 848 | "source": [ 849 | "train_results=log_reg.evaluate(training_df).predictions" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 168, 855 | "metadata": {}, 856 | "outputs": [ 857 | { 858 | "name": "stdout", 859 | "output_type": "stream", 860 | "text": [ 861 | "+------+----------+----------------------------------------+\n", 862 | "|Status|prediction|probability |\n", 863 | "+------+----------+----------------------------------------+\n", 864 | "|1 |1.0 |[0.2978572628475072,0.7021427371524929] |\n", 865 | "|1 |1.0 |[0.2978572628475072,0.7021427371524929] |\n", 866 | "|1 |1.0 |[0.16704676975730415,0.8329532302426959]|\n", 867 | "|1 |1.0 |[0.16704676975730415,0.8329532302426959]|\n", 868 | "|1 |1.0 |[0.16704676975730415,0.8329532302426959]|\n", 869 | "|1 |1.0 |[0.08659913656062515,0.9134008634393749]|\n", 870 | "|1 |1.0 |[0.08659913656062515,0.9134008634393749]|\n", 871 | "|1 |1.0 |[0.08659913656062515,0.9134008634393749]|\n", 872 | "|1 |1.0 |[0.08659913656062515,0.9134008634393749]|\n", 873 | "|1 |1.0 |[0.08659913656062515,0.9134008634393749]|\n", 874 | "+------+----------+----------------------------------------+\n", 875 | "only showing top 10 rows\n", 876 | "\n" 877 | ] 878 | } 879 | ], 880 | "source": [ 881 | "train_results.filter(train_results['Status']==1).filter(train_results['prediction']==1).select(['Status','prediction','probability']).show(10,False)" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "Probability at 0 index is for 0 class and probabilty as 1 index is for 1 class" 889 | ] 890 | }, 891 | { 892 | "cell_type": "code", 893 | "execution_count": 177, 894 | "metadata": {}, 895 | "outputs": [], 896 | "source": [ 897 | "correct_preds=train_results.filter(train_results['Status']==1).filter(train_results['prediction']==1).count()\n" 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 174, 903 | "metadata": {}, 904 | "outputs": [ 905 | { 906 | "data": { 907 | "text/plain": [ 908 | "7417" 909 | ] 910 | }, 911 | "execution_count": 174, 912 | "metadata": {}, 913 | "output_type": "execute_result" 914 | } 915 | ], 916 | "source": [ 917 | "training_df.filter(training_df['Status']==1).count()" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 178, 923 | "metadata": {}, 924 | "outputs": [ 925 | { 926 | "data": { 927 | "text/plain": [ 928 | "0.9366320614803829" 929 | ] 930 | }, 931 | "execution_count": 178, 932 | "metadata": {}, 933 | "output_type": "execute_result" 934 | } 935 | ], 936 | "source": [ 937 | "#accuracy on training dataset \n", 938 | "float(correct_preds)/(training_df.filter(training_df['Status']==1).count())" 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": null, 944 | "metadata": {}, 945 | "outputs": [], 946 | "source": [ 947 | "#Test Set results" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": 170, 953 | "metadata": {}, 954 | "outputs": [], 955 | "source": [ 956 | "results=log_reg.evaluate(test_df).predictions" 957 | ] 958 | }, 959 | { 960 | "cell_type": "code", 961 | "execution_count": 93, 962 | "metadata": {}, 963 | "outputs": [ 964 | { 965 | "name": "stdout", 966 | "output_type": "stream", 967 | "text": [ 968 | "+------+----------+\n", 969 | "|Status|prediction|\n", 970 | "+------+----------+\n", 971 | "|0 |0.0 |\n", 972 | "|0 |0.0 |\n", 973 | "|0 |0.0 |\n", 974 | "|0 |0.0 |\n", 975 | "|1 |0.0 |\n", 976 | "|0 |0.0 |\n", 977 | "|1 |1.0 |\n", 978 | "|0 |1.0 |\n", 979 | "|1 |1.0 |\n", 980 | "|1 |1.0 |\n", 981 | "+------+----------+\n", 982 | "only showing top 10 rows\n", 983 | "\n" 984 | ] 985 | } 986 | ], 987 | "source": [ 988 | "results.select(['Status','prediction']).show(10,False)" 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": 91, 994 | "metadata": {}, 995 | "outputs": [ 996 | { 997 | "name": "stdout", 998 | "output_type": "stream", 999 | "text": [ 1000 | "root\n", 1001 | " |-- features: vector (nullable = true)\n", 1002 | " |-- Status: integer (nullable = true)\n", 1003 | " |-- rawPrediction: vector (nullable = true)\n", 1004 | " |-- probability: vector (nullable = true)\n", 1005 | " |-- prediction: double (nullable = false)\n", 1006 | "\n" 1007 | ] 1008 | } 1009 | ], 1010 | "source": [ 1011 | "results.printSchema()" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": 92, 1017 | "metadata": {}, 1018 | "outputs": [], 1019 | "source": [ 1020 | "from pyspark.ml.evaluation import BinaryClassificationEvaluator" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 94, 1026 | "metadata": {}, 1027 | "outputs": [], 1028 | "source": [ 1029 | "#confusion matrix\n", 1030 | "true_postives = results[(results.Status == 1) & (results.prediction == 1)].count()\n", 1031 | "true_negatives = results[(results.Status == 0) & (results.prediction == 0)].count()\n", 1032 | "false_positives = results[(results.Status == 0) & (results.prediction == 1)].count()\n", 1033 | "false_negatives = results[(results.Status == 1) & (results.prediction == 0)].count()" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "code", 1038 | "execution_count": 98, 1039 | "metadata": {}, 1040 | "outputs": [ 1041 | { 1042 | "name": "stdout", 1043 | "output_type": "stream", 1044 | "text": [ 1045 | "2356\n", 1046 | "2363\n", 1047 | "158\n", 1048 | "157\n", 1049 | "5034\n", 1050 | "5034\n" 1051 | ] 1052 | } 1053 | ], 1054 | "source": [ 1055 | "print (true_postives)\n", 1056 | "print (true_negatives)\n", 1057 | "print (false_positives)\n", 1058 | "print (false_negatives)\n", 1059 | "print(true_postives+true_negatives+false_positives+false_negatives)\n", 1060 | "print (results.count())" 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "code", 1065 | "execution_count": 99, 1066 | "metadata": {}, 1067 | "outputs": [ 1068 | { 1069 | "name": "stdout", 1070 | "output_type": "stream", 1071 | "text": [ 1072 | "0.937524870672503\n" 1073 | ] 1074 | } 1075 | ], 1076 | "source": [ 1077 | "recall = float(true_postives)/(true_postives + false_negatives)\n", 1078 | "print(recall)" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "execution_count": 100, 1084 | "metadata": {}, 1085 | "outputs": [ 1086 | { 1087 | "name": "stdout", 1088 | "output_type": "stream", 1089 | "text": [ 1090 | "0.9371519490851233\n" 1091 | ] 1092 | } 1093 | ], 1094 | "source": [ 1095 | "precision = float(true_postives) / (true_postives + false_positives)\n", 1096 | "print(precision)" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "code", 1101 | "execution_count": 103, 1102 | "metadata": {}, 1103 | "outputs": [ 1104 | { 1105 | "name": "stdout", 1106 | "output_type": "stream", 1107 | "text": [ 1108 | "0.9374255065554231\n" 1109 | ] 1110 | } 1111 | ], 1112 | "source": [ 1113 | "accuracy=float((true_postives+true_negatives) /(results.count()))\n", 1114 | "print(accuracy)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "code", 1119 | "execution_count": null, 1120 | "metadata": {}, 1121 | "outputs": [], 1122 | "source": [] 1123 | } 1124 | ], 1125 | "metadata": { 1126 | "kernelspec": { 1127 | "display_name": "Python 3", 1128 | "language": "python", 1129 | "name": "python3" 1130 | }, 1131 | "language_info": { 1132 | "codemirror_mode": { 1133 | "name": "ipython", 1134 | "version": 3 1135 | }, 1136 | "file_extension": ".py", 1137 | "mimetype": "text/x-python", 1138 | "name": "python", 1139 | "nbconvert_exporter": "python", 1140 | "pygments_lexer": "ipython3", 1141 | "version": "3.6.3" 1142 | } 1143 | }, 1144 | "nbformat": 4, 1145 | "nbformat_minor": 2 1146 | } 1147 | -------------------------------------------------------------------------------- /chapter_6_Random_Forests/Random_Forests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#import SparkSession\n", 10 | "from pyspark.sql import SparkSession\n", 11 | "spark=SparkSession.builder.appName('random_forest').getOrCreate()" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "#read the dataset\n", 21 | "df=spark.read.csv('affairs.csv',inferSchema=True,header=True)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "(6366, 6)\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "#check the shape of the data \n", 39 | "print((df.count(),len(df.columns)))" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 4, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "root\n", 52 | " |-- rate_marriage: integer (nullable = true)\n", 53 | " |-- age: double (nullable = true)\n", 54 | " |-- yrs_married: double (nullable = true)\n", 55 | " |-- children: double (nullable = true)\n", 56 | " |-- religious: integer (nullable = true)\n", 57 | " |-- affairs: integer (nullable = true)\n", 58 | "\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "#printSchema\n", 64 | "df.printSchema()" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 5, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "+-------------+----+-----------+--------+---------+-------+\n", 77 | "|rate_marriage| age|yrs_married|children|religious|affairs|\n", 78 | "+-------------+----+-----------+--------+---------+-------+\n", 79 | "| 5|32.0| 6.0| 1.0| 3| 0|\n", 80 | "| 4|22.0| 2.5| 0.0| 2| 0|\n", 81 | "| 3|32.0| 9.0| 3.0| 3| 1|\n", 82 | "| 3|27.0| 13.0| 3.0| 1| 1|\n", 83 | "| 4|22.0| 2.5| 0.0| 1| 1|\n", 84 | "+-------------+----+-----------+--------+---------+-------+\n", 85 | "only showing top 5 rows\n", 86 | "\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "#view the dataset\n", 92 | "df.show(5)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 7, 98 | "metadata": { 99 | "scrolled": true 100 | }, 101 | "outputs": [ 102 | { 103 | "name": "stdout", 104 | "output_type": "stream", 105 | "text": [ 106 | "+-------+------------------+------------------+-----------------+------------------+------------------+\n", 107 | "|summary| rate_marriage| age| yrs_married| children| religious|\n", 108 | "+-------+------------------+------------------+-----------------+------------------+------------------+\n", 109 | "| count| 6366| 6366| 6366| 6366| 6366|\n", 110 | "| mean| 4.109644989004084|29.082862079798932| 9.00942507068803|1.3968740182218033|2.4261702796104303|\n", 111 | "| stddev|0.9614295945655025| 6.847881883668817|7.280119972766412| 1.433470828560344|0.8783688402641785|\n", 112 | "| min| 1| 17.5| 0.5| 0.0| 1|\n", 113 | "| max| 5| 42.0| 23.0| 5.5| 4|\n", 114 | "+-------+------------------+------------------+-----------------+------------------+------------------+\n", 115 | "\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "#Exploratory Data Analysis\n", 121 | "df.describe().select('summary','rate_marriage','age','yrs_married','children','religious').show()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 8, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "+-------+-----+\n", 134 | "|affairs|count|\n", 135 | "+-------+-----+\n", 136 | "| 1| 2053|\n", 137 | "| 0| 4313|\n", 138 | "+-------+-----+\n", 139 | "\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "df.groupBy('affairs').count().show()" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 9, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "+-------------+-----+\n", 157 | "|rate_marriage|count|\n", 158 | "+-------------+-----+\n", 159 | "| 1| 99|\n", 160 | "| 3| 993|\n", 161 | "| 5| 2684|\n", 162 | "| 4| 2242|\n", 163 | "| 2| 348|\n", 164 | "+-------------+-----+\n", 165 | "\n" 166 | ] 167 | } 168 | ], 169 | "source": [ 170 | "df.groupBy('rate_marriage').count().show()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 10, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "+-------------+-------+-----+\n", 183 | "|rate_marriage|affairs|count|\n", 184 | "+-------------+-------+-----+\n", 185 | "| 1| 0| 25|\n", 186 | "| 1| 1| 74|\n", 187 | "| 2| 0| 127|\n", 188 | "| 2| 1| 221|\n", 189 | "| 3| 0| 446|\n", 190 | "| 3| 1| 547|\n", 191 | "| 4| 0| 1518|\n", 192 | "| 4| 1| 724|\n", 193 | "| 5| 0| 2197|\n", 194 | "| 5| 1| 487|\n", 195 | "+-------------+-------+-----+\n", 196 | "\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "df.groupBy('rate_marriage','affairs').count().orderBy('rate_marriage','affairs','count',ascending=True).show()" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 11, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "+---------+-------+-----+\n", 214 | "|religious|affairs|count|\n", 215 | "+---------+-------+-----+\n", 216 | "| 1| 0| 613|\n", 217 | "| 1| 1| 408|\n", 218 | "| 2| 0| 1448|\n", 219 | "| 2| 1| 819|\n", 220 | "| 3| 0| 1715|\n", 221 | "| 3| 1| 707|\n", 222 | "| 4| 0| 537|\n", 223 | "| 4| 1| 119|\n", 224 | "+---------+-------+-----+\n", 225 | "\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "df.groupBy('religious','affairs').count().orderBy('religious','affairs','count',ascending=True).show()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 12, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "+--------+-------+-----+\n", 243 | "|children|affairs|count|\n", 244 | "+--------+-------+-----+\n", 245 | "| 0.0| 0| 1912|\n", 246 | "| 0.0| 1| 502|\n", 247 | "| 1.0| 0| 747|\n", 248 | "| 1.0| 1| 412|\n", 249 | "| 2.0| 0| 873|\n", 250 | "| 2.0| 1| 608|\n", 251 | "| 3.0| 0| 460|\n", 252 | "| 3.0| 1| 321|\n", 253 | "| 4.0| 0| 197|\n", 254 | "| 4.0| 1| 131|\n", 255 | "| 5.5| 0| 124|\n", 256 | "| 5.5| 1| 79|\n", 257 | "+--------+-------+-----+\n", 258 | "\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "df.groupBy('children','affairs').count().orderBy('children','affairs','count',ascending=True).show()" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 11, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "+-------+------------------+------------------+------------------+------------------+------------------+------------+\n", 276 | "|affairs|avg(rate_marriage)| avg(age)| avg(yrs_married)| avg(children)| avg(religious)|avg(affairs)|\n", 277 | "+-------+------------------+------------------+------------------+------------------+------------------+------------+\n", 278 | "| 1|3.6473453482708234|30.537018996590355|11.152459814905017|1.7289332683877252| 2.261568436434486| 1.0|\n", 279 | "| 0| 4.329700904242986| 28.39067934152562| 7.989334569904939|1.2388128912589844|2.5045212149316023| 0.0|\n", 280 | "+-------+------------------+------------------+------------------+------------------+------------------+------------+\n", 281 | "\n" 282 | ] 283 | } 284 | ], 285 | "source": [ 286 | "df.groupBy('affairs').mean().show()" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 13, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "from pyspark.ml.feature import VectorAssembler" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 14, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "df_assembler = VectorAssembler(inputCols=['rate_marriage', 'age', 'yrs_married', 'children', 'religious'], outputCol=\"features\")\n", 305 | "df = df_assembler.transform(df)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 15, 311 | "metadata": {}, 312 | "outputs": [ 313 | { 314 | "name": "stdout", 315 | "output_type": "stream", 316 | "text": [ 317 | "root\n", 318 | " |-- rate_marriage: integer (nullable = true)\n", 319 | " |-- age: double (nullable = true)\n", 320 | " |-- yrs_married: double (nullable = true)\n", 321 | " |-- children: double (nullable = true)\n", 322 | " |-- religious: integer (nullable = true)\n", 323 | " |-- affairs: integer (nullable = true)\n", 324 | " |-- features: vector (nullable = true)\n", 325 | "\n" 326 | ] 327 | } 328 | ], 329 | "source": [ 330 | "df.printSchema()" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 16, 336 | "metadata": {}, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "+-----------------------+-------+\n", 343 | "|features |affairs|\n", 344 | "+-----------------------+-------+\n", 345 | "|[5.0,32.0,6.0,1.0,3.0] |0 |\n", 346 | "|[4.0,22.0,2.5,0.0,2.0] |0 |\n", 347 | "|[3.0,32.0,9.0,3.0,3.0] |1 |\n", 348 | "|[3.0,27.0,13.0,3.0,1.0]|1 |\n", 349 | "|[4.0,22.0,2.5,0.0,1.0] |1 |\n", 350 | "|[4.0,37.0,16.5,4.0,3.0]|1 |\n", 351 | "|[5.0,27.0,9.0,1.0,1.0] |1 |\n", 352 | "|[4.0,27.0,9.0,0.0,2.0] |1 |\n", 353 | "|[5.0,37.0,23.0,5.5,2.0]|1 |\n", 354 | "|[5.0,37.0,23.0,5.5,2.0]|1 |\n", 355 | "+-----------------------+-------+\n", 356 | "only showing top 10 rows\n", 357 | "\n" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "df.select(['features','affairs']).show(10,False)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 16, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "#select data for building model\n", 372 | "model_df=df.select(['features','affairs'])" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 17, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "train_df,test_df=model_df.randomSplit([0.75,0.25])" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 18, 387 | "metadata": {}, 388 | "outputs": [ 389 | { 390 | "data": { 391 | "text/plain": [ 392 | "4800" 393 | ] 394 | }, 395 | "execution_count": 18, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "train_df.count()" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 19, 407 | "metadata": {}, 408 | "outputs": [ 409 | { 410 | "name": "stdout", 411 | "output_type": "stream", 412 | "text": [ 413 | "+-------+-----+\n", 414 | "|affairs|count|\n", 415 | "+-------+-----+\n", 416 | "| 1| 1574|\n", 417 | "| 0| 3226|\n", 418 | "+-------+-----+\n", 419 | "\n" 420 | ] 421 | } 422 | ], 423 | "source": [ 424 | "train_df.groupBy('affairs').count().show()" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 20, 430 | "metadata": {}, 431 | "outputs": [ 432 | { 433 | "name": "stdout", 434 | "output_type": "stream", 435 | "text": [ 436 | "+-------+-----+\n", 437 | "|affairs|count|\n", 438 | "+-------+-----+\n", 439 | "| 1| 479|\n", 440 | "| 0| 1087|\n", 441 | "+-------+-----+\n", 442 | "\n" 443 | ] 444 | } 445 | ], 446 | "source": [ 447 | "test_df.groupBy('affairs').count().show()" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 21, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "from pyspark.ml.classification import RandomForestClassifier" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 22, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [ 465 | "rf_classifier=RandomForestClassifier(labelCol='affairs',numTrees=50).fit(train_df)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 23, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "rf_predictions=rf_classifier.transform(test_df)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 24, 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "name": "stdout", 484 | "output_type": "stream", 485 | "text": [ 486 | "+--------------------+-------+--------------------+--------------------+----------+\n", 487 | "| features|affairs| rawPrediction| probability|prediction|\n", 488 | "+--------------------+-------+--------------------+--------------------+----------+\n", 489 | "|[1.0,22.0,2.5,1.0...| 1|[18.7524598757082...|[0.37504919751416...| 1.0|\n", 490 | "|[1.0,27.0,2.5,0.0...| 1|[21.0021478623554...|[0.42004295724710...| 1.0|\n", 491 | "|[1.0,27.0,6.0,0.0...| 0|[20.1166611727778...|[0.40233322345555...| 1.0|\n", 492 | "|[1.0,27.0,6.0,2.0...| 1|[19.0905206218080...|[0.38181041243616...| 1.0|\n", 493 | "|[1.0,27.0,6.0,3.0...| 0|[16.3348579592130...|[0.32669715918426...| 1.0|\n", 494 | "|[1.0,27.0,9.0,4.0...| 0|[13.3128973003485...|[0.26625794600697...| 1.0|\n", 495 | "|[1.0,32.0,13.0,0....| 1|[16.7812008990910...|[0.33562401798182...| 1.0|\n", 496 | "|[1.0,32.0,13.0,2....| 1|[12.6966189294366...|[0.25393237858873...| 1.0|\n", 497 | "|[1.0,32.0,13.0,2....| 1|[12.6766379097881...|[0.25353275819576...| 1.0|\n", 498 | "|[1.0,32.0,16.5,2....| 1|[16.5966383870776...|[0.33193276774155...| 1.0|\n", 499 | "|[1.0,32.0,16.5,2....| 1|[16.5966383870776...|[0.33193276774155...| 1.0|\n", 500 | "|[1.0,32.0,16.5,3....| 1|[18.0962335004510...|[0.36192467000902...| 1.0|\n", 501 | "|[1.0,37.0,16.5,1....| 1|[17.1671412215484...|[0.34334282443096...| 1.0|\n", 502 | "|[1.0,37.0,16.5,2....| 1|[16.9033235593205...|[0.33806647118641...| 1.0|\n", 503 | "|[1.0,37.0,23.0,4....| 1|[12.9328949963349...|[0.25865789992669...| 1.0|\n", 504 | "|[1.0,37.0,23.0,5....| 1|[19.2882486885513...|[0.38576497377102...| 1.0|\n", 505 | "|[1.0,42.0,16.5,5....| 1|[14.7895576971207...|[0.29579115394241...| 1.0|\n", 506 | "|[1.0,42.0,23.0,2....| 1|[13.1159769754090...|[0.26231953950818...| 1.0|\n", 507 | "|[1.0,42.0,23.0,2....| 1|[16.8901830088537...|[0.33780366017707...| 1.0|\n", 508 | "|[1.0,42.0,23.0,2....| 1|[16.8901830088537...|[0.33780366017707...| 1.0|\n", 509 | "+--------------------+-------+--------------------+--------------------+----------+\n", 510 | "only showing top 20 rows\n", 511 | "\n" 512 | ] 513 | } 514 | ], 515 | "source": [ 516 | "rf_predictions.show()" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 25, 522 | "metadata": {}, 523 | "outputs": [ 524 | { 525 | "name": "stdout", 526 | "output_type": "stream", 527 | "text": [ 528 | "+----------+-----+\n", 529 | "|prediction|count|\n", 530 | "+----------+-----+\n", 531 | "| 0.0| 1261|\n", 532 | "| 1.0| 305|\n", 533 | "+----------+-----+\n", 534 | "\n" 535 | ] 536 | } 537 | ], 538 | "source": [ 539 | "rf_predictions.groupBy('prediction').count().show()" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 26, 545 | "metadata": {}, 546 | "outputs": [ 547 | { 548 | "name": "stdout", 549 | "output_type": "stream", 550 | "text": [ 551 | "+----------------------------------------+-------+----------+\n", 552 | "|probability |affairs|prediction|\n", 553 | "+----------------------------------------+-------+----------+\n", 554 | "|[0.37504919751416455,0.6249508024858356]|1 |1.0 |\n", 555 | "|[0.42004295724710805,0.579957042752892] |1 |1.0 |\n", 556 | "|[0.40233322345555694,0.597666776544443] |0 |1.0 |\n", 557 | "|[0.3818104124361619,0.6181895875638381] |1 |1.0 |\n", 558 | "|[0.32669715918426007,0.6733028408157399]|0 |1.0 |\n", 559 | "|[0.26625794600697006,0.7337420539930299]|0 |1.0 |\n", 560 | "|[0.3356240179818214,0.6643759820181787] |1 |1.0 |\n", 561 | "|[0.25393237858873335,0.7460676214112667]|1 |1.0 |\n", 562 | "|[0.2535327581957624,0.7464672418042376] |1 |1.0 |\n", 563 | "|[0.3319327677415531,0.6680672322584469] |1 |1.0 |\n", 564 | "+----------------------------------------+-------+----------+\n", 565 | "only showing top 10 rows\n", 566 | "\n" 567 | ] 568 | } 569 | ], 570 | "source": [ 571 | "rf_predictions.select(['probability','affairs','prediction']).show(10,False)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 28, 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [ 580 | "from pyspark.ml.evaluation import BinaryClassificationEvaluator" 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": 29, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [ 589 | "from pyspark.ml.evaluation import MulticlassClassificationEvaluator" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 30, 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "rf_accuracy=MulticlassClassificationEvaluator(labelCol='affairs',metricName='accuracy').evaluate(rf_predictions)" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 31, 604 | "metadata": {}, 605 | "outputs": [ 606 | { 607 | "name": "stdout", 608 | "output_type": "stream", 609 | "text": [ 610 | "The accuracy of RF on test data is 73%\n" 611 | ] 612 | } 613 | ], 614 | "source": [ 615 | "print('The accuracy of RF on test data is {0:.0%}'.format(rf_accuracy))" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 32, 621 | "metadata": {}, 622 | "outputs": [ 623 | { 624 | "name": "stdout", 625 | "output_type": "stream", 626 | "text": [ 627 | "0.7279693486590039\n" 628 | ] 629 | } 630 | ], 631 | "source": [ 632 | "print(rf_accuracy)" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 33, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "rf_precision=MulticlassClassificationEvaluator(labelCol='affairs',metricName='weightedPrecision').evaluate(rf_predictions)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 34, 647 | "metadata": {}, 648 | "outputs": [ 649 | { 650 | "name": "stdout", 651 | "output_type": "stream", 652 | "text": [ 653 | "The precision rate on test data is 71%\n" 654 | ] 655 | } 656 | ], 657 | "source": [ 658 | "print('The precision rate on test data is {0:.0%}'.format(rf_precision))" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": 35, 664 | "metadata": {}, 665 | "outputs": [ 666 | { 667 | "data": { 668 | "text/plain": [ 669 | "0.7085017563673452" 670 | ] 671 | }, 672 | "execution_count": 35, 673 | "metadata": {}, 674 | "output_type": "execute_result" 675 | } 676 | ], 677 | "source": [ 678 | "rf_precision" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 37, 684 | "metadata": {}, 685 | "outputs": [], 686 | "source": [ 687 | "rf_auc=BinaryClassificationEvaluator(labelCol='affairs').evaluate(rf_predictions)" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 38, 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "name": "stdout", 697 | "output_type": "stream", 698 | "text": [ 699 | "0.7421702296835062\n" 700 | ] 701 | } 702 | ], 703 | "source": [ 704 | "print(rf_auc)" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 39, 710 | "metadata": {}, 711 | "outputs": [], 712 | "source": [ 713 | "# Feature importance" 714 | ] 715 | }, 716 | { 717 | "cell_type": "code", 718 | "execution_count": 40, 719 | "metadata": {}, 720 | "outputs": [ 721 | { 722 | "data": { 723 | "text/plain": [ 724 | "SparseVector(5, {0: 0.5536, 1: 0.0364, 2: 0.2358, 3: 0.0803, 4: 0.0939})" 725 | ] 726 | }, 727 | "execution_count": 40, 728 | "metadata": {}, 729 | "output_type": "execute_result" 730 | } 731 | ], 732 | "source": [ 733 | "rf_classifier.featureImportances" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": 41, 739 | "metadata": {}, 740 | "outputs": [ 741 | { 742 | "data": { 743 | "text/plain": [ 744 | "{'numeric': [{'idx': 0, 'name': 'rate_marriage'},\n", 745 | " {'idx': 1, 'name': 'age'},\n", 746 | " {'idx': 2, 'name': 'yrs_married'},\n", 747 | " {'idx': 3, 'name': 'children'},\n", 748 | " {'idx': 4, 'name': 'religious'}]}" 749 | ] 750 | }, 751 | "execution_count": 41, 752 | "metadata": {}, 753 | "output_type": "execute_result" 754 | } 755 | ], 756 | "source": [ 757 | "df.schema[\"features\"].metadata[\"ml_attr\"][\"attrs\"]" 758 | ] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": 49, 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [ 766 | "# Save the model " 767 | ] 768 | }, 769 | { 770 | "cell_type": "code", 771 | "execution_count": 50, 772 | "metadata": {}, 773 | "outputs": [ 774 | { 775 | "data": { 776 | "text/plain": [ 777 | "'/home/jovyan/work'" 778 | ] 779 | }, 780 | "execution_count": 50, 781 | "metadata": {}, 782 | "output_type": "execute_result" 783 | } 784 | ], 785 | "source": [ 786 | "pwd" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": 52, 792 | "metadata": {}, 793 | "outputs": [], 794 | "source": [ 795 | "rf_classifier.save(\"/home/jovyan/work/RF_model\")" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 54, 801 | "metadata": {}, 802 | "outputs": [], 803 | "source": [ 804 | "from pyspark.ml.classification import RandomForestClassificationModel" 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 55, 810 | "metadata": {}, 811 | "outputs": [], 812 | "source": [ 813 | "rf=RandomForestClassificationModel.load(\"/home/jovyan/work/RF_model\")" 814 | ] 815 | }, 816 | { 817 | "cell_type": "code", 818 | "execution_count": 58, 819 | "metadata": {}, 820 | "outputs": [], 821 | "source": [ 822 | "model_preditions=rf.transform(test_df)" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": 60, 828 | "metadata": {}, 829 | "outputs": [ 830 | { 831 | "name": "stdout", 832 | "output_type": "stream", 833 | "text": [ 834 | "+--------------------+-------+--------------------+--------------------+----------+\n", 835 | "| features|affairs| rawPrediction| probability|prediction|\n", 836 | "+--------------------+-------+--------------------+--------------------+----------+\n", 837 | "|[1.0,22.0,2.5,1.0...| 1|[188.676639360932...|[0.37735327872186...| 1.0|\n", 838 | "|[1.0,27.0,2.5,0.0...| 1|[195.425833792250...|[0.39085166758450...| 1.0|\n", 839 | "|[1.0,27.0,6.0,0.0...| 0|[193.138478579040...|[0.38627695715808...| 1.0|\n", 840 | "|[1.0,27.0,6.0,2.0...| 1|[185.424877645536...|[0.37084975529107...| 1.0|\n", 841 | "|[1.0,27.0,6.0,3.0...| 0|[164.685852316351...|[0.32937170463270...| 1.0|\n", 842 | "|[1.0,27.0,9.0,4.0...| 0|[142.006095001922...|[0.28401219000384...| 1.0|\n", 843 | "|[1.0,32.0,13.0,0....| 1|[176.885399312490...|[0.35377079862498...| 1.0|\n", 844 | "|[1.0,32.0,13.0,2....| 1|[128.585405941664...|[0.25717081188332...| 1.0|\n", 845 | "|[1.0,32.0,13.0,2....| 1|[126.963464206019...|[0.25392692841203...| 1.0|\n", 846 | "|[1.0,32.0,16.5,2....| 1|[153.429839787005...|[0.30685967957401...| 1.0|\n", 847 | "|[1.0,32.0,16.5,2....| 1|[153.429839787005...|[0.30685967957401...| 1.0|\n", 848 | "|[1.0,32.0,16.5,3....| 1|[167.493344562280...|[0.33498668912456...| 1.0|\n", 849 | "|[1.0,37.0,16.5,1....| 1|[150.090425556233...|[0.30018085111246...| 1.0|\n", 850 | "|[1.0,37.0,16.5,2....| 1|[154.326506925977...|[0.30865301385195...| 1.0|\n", 851 | "|[1.0,37.0,23.0,4....| 1|[123.559481897677...|[0.24711896379535...| 1.0|\n", 852 | "|[1.0,37.0,23.0,5....| 1|[192.317134975430...|[0.38463426995086...| 1.0|\n", 853 | "|[1.0,42.0,16.5,5....| 1|[151.350101320899...|[0.30270020264179...| 1.0|\n", 854 | "|[1.0,42.0,23.0,2....| 1|[126.937590310819...|[0.25387518062163...| 1.0|\n", 855 | "|[1.0,42.0,23.0,2....| 1|[156.647804216314...|[0.31329560843262...| 1.0|\n", 856 | "|[1.0,42.0,23.0,2....| 1|[156.647804216314...|[0.31329560843262...| 1.0|\n", 857 | "+--------------------+-------+--------------------+--------------------+----------+\n", 858 | "only showing top 20 rows\n", 859 | "\n" 860 | ] 861 | } 862 | ], 863 | "source": [ 864 | "model_preditions.show()" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": null, 870 | "metadata": {}, 871 | "outputs": [], 872 | "source": [] 873 | } 874 | ], 875 | "metadata": { 876 | "kernelspec": { 877 | "display_name": "Python 3", 878 | "language": "python", 879 | "name": "python3" 880 | }, 881 | "language_info": { 882 | "codemirror_mode": { 883 | "name": "ipython", 884 | "version": 3 885 | }, 886 | "file_extension": ".py", 887 | "mimetype": "text/x-python", 888 | "name": "python", 889 | "nbconvert_exporter": "python", 890 | "pygments_lexer": "ipython3", 891 | "version": "3.6.3" 892 | } 893 | }, 894 | "nbformat": 4, 895 | "nbformat_minor": 2 896 | } 897 | -------------------------------------------------------------------------------- /chapter_7_Clustering/iris_dataset.csv: -------------------------------------------------------------------------------- 1 | sepal_length,sepal_width,petal_length,petal_width,species 2 | 5.1,3.5,1.4,0.2,setosa 3 | 4.9,3,1.4,0.2,setosa 4 | 4.7,3.2,1.3,0.2,setosa 5 | 4.6,3.1,1.5,0.2,setosa 6 | 5,3.6,1.4,0.2,setosa 7 | 5.4,3.9,1.7,0.4,setosa 8 | 4.6,3.4,1.4,0.3,setosa 9 | 5,3.4,1.5,0.2,setosa 10 | 4.4,2.9,1.4,0.2,setosa 11 | 4.9,3.1,1.5,0.1,setosa 12 | 5.4,3.7,1.5,0.2,setosa 13 | 4.8,3.4,1.6,0.2,setosa 14 | 4.8,3,1.4,0.1,setosa 15 | 4.3,3,1.1,0.1,setosa 16 | 5.8,4,1.2,0.2,setosa 17 | 5.7,4.4,1.5,0.4,setosa 18 | 5.4,3.9,1.3,0.4,setosa 19 | 5.1,3.5,1.4,0.3,setosa 20 | 5.7,3.8,1.7,0.3,setosa 21 | 5.1,3.8,1.5,0.3,setosa 22 | 5.4,3.4,1.7,0.2,setosa 23 | 5.1,3.7,1.5,0.4,setosa 24 | 4.6,3.6,1,0.2,setosa 25 | 5.1,3.3,1.7,0.5,setosa 26 | 4.8,3.4,1.9,0.2,setosa 27 | 5,3,1.6,0.2,setosa 28 | 5,3.4,1.6,0.4,setosa 29 | 5.2,3.5,1.5,0.2,setosa 30 | 5.2,3.4,1.4,0.2,setosa 31 | 4.7,3.2,1.6,0.2,setosa 32 | 4.8,3.1,1.6,0.2,setosa 33 | 5.4,3.4,1.5,0.4,setosa 34 | 5.2,4.1,1.5,0.1,setosa 35 | 5.5,4.2,1.4,0.2,setosa 36 | 4.9,3.1,1.5,0.1,setosa 37 | 5,3.2,1.2,0.2,setosa 38 | 5.5,3.5,1.3,0.2,setosa 39 | 4.9,3.1,1.5,0.1,setosa 40 | 4.4,3,1.3,0.2,setosa 41 | 5.1,3.4,1.5,0.2,setosa 42 | 5,3.5,1.3,0.3,setosa 43 | 4.5,2.3,1.3,0.3,setosa 44 | 4.4,3.2,1.3,0.2,setosa 45 | 5,3.5,1.6,0.6,setosa 46 | 5.1,3.8,1.9,0.4,setosa 47 | 4.8,3,1.4,0.3,setosa 48 | 5.1,3.8,1.6,0.2,setosa 49 | 4.6,3.2,1.4,0.2,setosa 50 | 5.3,3.7,1.5,0.2,setosa 51 | 5,3.3,1.4,0.2,setosa 52 | 7,3.2,4.7,1.4,versicolor 53 | 6.4,3.2,4.5,1.5,versicolor 54 | 6.9,3.1,4.9,1.5,versicolor 55 | 5.5,2.3,4,1.3,versicolor 56 | 6.5,2.8,4.6,1.5,versicolor 57 | 5.7,2.8,4.5,1.3,versicolor 58 | 6.3,3.3,4.7,1.6,versicolor 59 | 4.9,2.4,3.3,1,versicolor 60 | 6.6,2.9,4.6,1.3,versicolor 61 | 5.2,2.7,3.9,1.4,versicolor 62 | 5,2,3.5,1,versicolor 63 | 5.9,3,4.2,1.5,versicolor 64 | 6,2.2,4,1,versicolor 65 | 6.1,2.9,4.7,1.4,versicolor 66 | 5.6,2.9,3.6,1.3,versicolor 67 | 6.7,3.1,4.4,1.4,versicolor 68 | 5.6,3,4.5,1.5,versicolor 69 | 5.8,2.7,4.1,1,versicolor 70 | 6.2,2.2,4.5,1.5,versicolor 71 | 5.6,2.5,3.9,1.1,versicolor 72 | 5.9,3.2,4.8,1.8,versicolor 73 | 6.1,2.8,4,1.3,versicolor 74 | 6.3,2.5,4.9,1.5,versicolor 75 | 6.1,2.8,4.7,1.2,versicolor 76 | 6.4,2.9,4.3,1.3,versicolor 77 | 6.6,3,4.4,1.4,versicolor 78 | 6.8,2.8,4.8,1.4,versicolor 79 | 6.7,3,5,1.7,versicolor 80 | 6,2.9,4.5,1.5,versicolor 81 | 5.7,2.6,3.5,1,versicolor 82 | 5.5,2.4,3.8,1.1,versicolor 83 | 5.5,2.4,3.7,1,versicolor 84 | 5.8,2.7,3.9,1.2,versicolor 85 | 6,2.7,5.1,1.6,versicolor 86 | 5.4,3,4.5,1.5,versicolor 87 | 6,3.4,4.5,1.6,versicolor 88 | 6.7,3.1,4.7,1.5,versicolor 89 | 6.3,2.3,4.4,1.3,versicolor 90 | 5.6,3,4.1,1.3,versicolor 91 | 5.5,2.5,4,1.3,versicolor 92 | 5.5,2.6,4.4,1.2,versicolor 93 | 6.1,3,4.6,1.4,versicolor 94 | 5.8,2.6,4,1.2,versicolor 95 | 5,2.3,3.3,1,versicolor 96 | 5.6,2.7,4.2,1.3,versicolor 97 | 5.7,3,4.2,1.2,versicolor 98 | 5.7,2.9,4.2,1.3,versicolor 99 | 6.2,2.9,4.3,1.3,versicolor 100 | 5.1,2.5,3,1.1,versicolor 101 | 5.7,2.8,4.1,1.3,versicolor 102 | 6.3,3.3,6,2.5,virginica 103 | 5.8,2.7,5.1,1.9,virginica 104 | 7.1,3,5.9,2.1,virginica 105 | 6.3,2.9,5.6,1.8,virginica 106 | 6.5,3,5.8,2.2,virginica 107 | 7.6,3,6.6,2.1,virginica 108 | 4.9,2.5,4.5,1.7,virginica 109 | 7.3,2.9,6.3,1.8,virginica 110 | 6.7,2.5,5.8,1.8,virginica 111 | 7.2,3.6,6.1,2.5,virginica 112 | 6.5,3.2,5.1,2,virginica 113 | 6.4,2.7,5.3,1.9,virginica 114 | 6.8,3,5.5,2.1,virginica 115 | 5.7,2.5,5,2,virginica 116 | 5.8,2.8,5.1,2.4,virginica 117 | 6.4,3.2,5.3,2.3,virginica 118 | 6.5,3,5.5,1.8,virginica 119 | 7.7,3.8,6.7,2.2,virginica 120 | 7.7,2.6,6.9,2.3,virginica 121 | 6,2.2,5,1.5,virginica 122 | 6.9,3.2,5.7,2.3,virginica 123 | 5.6,2.8,4.9,2,virginica 124 | 7.7,2.8,6.7,2,virginica 125 | 6.3,2.7,4.9,1.8,virginica 126 | 6.7,3.3,5.7,2.1,virginica 127 | 7.2,3.2,6,1.8,virginica 128 | 6.2,2.8,4.8,1.8,virginica 129 | 6.1,3,4.9,1.8,virginica 130 | 6.4,2.8,5.6,2.1,virginica 131 | 7.2,3,5.8,1.6,virginica 132 | 7.4,2.8,6.1,1.9,virginica 133 | 7.9,3.8,6.4,2,virginica 134 | 6.4,2.8,5.6,2.2,virginica 135 | 6.3,2.8,5.1,1.5,virginica 136 | 6.1,2.6,5.6,1.4,virginica 137 | 7.7,3,6.1,2.3,virginica 138 | 6.3,3.4,5.6,2.4,virginica 139 | 6.4,3.1,5.5,1.8,virginica 140 | 6,3,4.8,1.8,virginica 141 | 6.9,3.1,5.4,2.1,virginica 142 | 6.7,3.1,5.6,2.4,virginica 143 | 6.9,3.1,5.1,2.3,virginica 144 | 5.8,2.7,5.1,1.9,virginica 145 | 6.8,3.2,5.9,2.3,virginica 146 | 6.7,3.3,5.7,2.5,virginica 147 | 6.7,3,5.2,2.3,virginica 148 | 6.3,2.5,5,1.9,virginica 149 | 6.5,3,5.2,2,virginica 150 | 6.2,3.4,5.4,2.3,virginica 151 | 5.9,3,5.1,1.8,virginica -------------------------------------------------------------------------------- /chapter_8_Recommender_System/Recommender_System_PySpark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 340, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#import and create sparksession object\n", 10 | "from pyspark.sql import SparkSession \n", 11 | "spark=SparkSession.builder.appName('rc').getOrCreate()" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 341, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "#import the required functions and libraries\n", 21 | "from pyspark.sql.functions import *" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 342, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "#load the dataset and create sprk dataframe\n", 31 | "df=spark.read.csv('movie_ratings_df.csv',inferSchema=True,header=True)" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 343, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "(100000, 3)\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "#validate the shape of the data \n", 49 | "print((df.count(),len(df.columns)))" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 344, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "root\n", 62 | " |-- userId: integer (nullable = true)\n", 63 | " |-- title: string (nullable = true)\n", 64 | " |-- rating: integer (nullable = true)\n", 65 | "\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "#check columns in dataframe\n", 71 | "df.printSchema()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 346, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "+------+-----------------------------+------+\n", 84 | "|userId|title |rating|\n", 85 | "+------+-----------------------------+------+\n", 86 | "|840 |Amistad (1997) |4 |\n", 87 | "|711 |Grand Day Out, A (1992) |5 |\n", 88 | "|311 |Casper (1995) |2 |\n", 89 | "|717 |Ulee's Gold (1997) |4 |\n", 90 | "|389 |Miracle on 34th Street (1994)|5 |\n", 91 | "|416 |Cool Runnings (1993) |3 |\n", 92 | "|449 |Withnail and I (1987) |5 |\n", 93 | "|796 |To Kill a Mockingbird (1962) |4 |\n", 94 | "|658 |Toy Story (1995) |4 |\n", 95 | "|345 |Shining, The (1980) |4 |\n", 96 | "+------+-----------------------------+------+\n", 97 | "only showing top 10 rows\n", 98 | "\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "#validate few rows of dataframe in random order\n", 104 | "df.orderBy(rand()).show(10,False)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 347, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "+------+-----+\n", 117 | "|userId|count|\n", 118 | "+------+-----+\n", 119 | "|405 |737 |\n", 120 | "|655 |685 |\n", 121 | "|13 |636 |\n", 122 | "|450 |540 |\n", 123 | "|276 |518 |\n", 124 | "|416 |493 |\n", 125 | "|537 |490 |\n", 126 | "|303 |484 |\n", 127 | "|234 |480 |\n", 128 | "|393 |448 |\n", 129 | "+------+-----+\n", 130 | "only showing top 10 rows\n", 131 | "\n" 132 | ] 133 | } 134 | ], 135 | "source": [ 136 | "#check number of ratings by each user\n", 137 | "df.groupBy('userId').count().orderBy('count',ascending=False).show(10,False)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 348, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "+------+-----+\n", 150 | "|userId|count|\n", 151 | "+------+-----+\n", 152 | "|732 |20 |\n", 153 | "|636 |20 |\n", 154 | "|631 |20 |\n", 155 | "|926 |20 |\n", 156 | "|93 |20 |\n", 157 | "|596 |20 |\n", 158 | "|572 |20 |\n", 159 | "|34 |20 |\n", 160 | "|685 |20 |\n", 161 | "|300 |20 |\n", 162 | "+------+-----+\n", 163 | "only showing top 10 rows\n", 164 | "\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "#check number of ratings by each user\n", 170 | "df.groupBy('userId').count().orderBy('count',ascending=True).show(10,False)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 349, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "+-----------------------------+-----+\n", 183 | "|title |count|\n", 184 | "+-----------------------------+-----+\n", 185 | "|Star Wars (1977) |583 |\n", 186 | "|Contact (1997) |509 |\n", 187 | "|Fargo (1996) |508 |\n", 188 | "|Return of the Jedi (1983) |507 |\n", 189 | "|Liar Liar (1997) |485 |\n", 190 | "|English Patient, The (1996) |481 |\n", 191 | "|Scream (1996) |478 |\n", 192 | "|Toy Story (1995) |452 |\n", 193 | "|Air Force One (1997) |431 |\n", 194 | "|Independence Day (ID4) (1996)|429 |\n", 195 | "+-----------------------------+-----+\n", 196 | "only showing top 10 rows\n", 197 | "\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "#number of times movie been rated \n", 203 | "df.groupBy('title').count().orderBy('count',ascending=False).show(10,False)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 350, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "+-----------------------------------------+-----+\n", 216 | "|title |count|\n", 217 | "+-----------------------------------------+-----+\n", 218 | "|Lashou shentan (1992) |1 |\n", 219 | "|Fear, The (1995) |1 |\n", 220 | "|Aiqing wansui (1994) |1 |\n", 221 | "|Mad Dog Time (1996) |1 |\n", 222 | "|Leopard Son, The (1996) |1 |\n", 223 | "|Next Step, The (1995) |1 |\n", 224 | "|Target (1995) |1 |\n", 225 | "|Vie est belle, La (Life is Rosey) (1987) |1 |\n", 226 | "|Modern Affair, A (1995) |1 |\n", 227 | "|JLG/JLG - autoportrait de d�cembre (1994)|1 |\n", 228 | "+-----------------------------------------+-----+\n", 229 | "only showing top 10 rows\n", 230 | "\n" 231 | ] 232 | } 233 | ], 234 | "source": [ 235 | "df.groupBy('title').count().orderBy('count',ascending=True).show(10,False)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 291, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "#import String indexer to convert string values to numeric values\n", 245 | "from pyspark.ml.feature import StringIndexer,IndexToString" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 292, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "#creating string indexer to convert the movie title column values into numerical values\n", 255 | "stringIndexer = StringIndexer(inputCol=\"title\", outputCol=\"title_new\")" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 293, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "#applying stringindexer object on dataframe movie title column\n", 265 | "model = stringIndexer.fit(df)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 294, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "#creating new dataframe with transformed values\n", 275 | "indexed = model.transform(df)" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 295, 281 | "metadata": {}, 282 | "outputs": [ 283 | { 284 | "name": "stdout", 285 | "output_type": "stream", 286 | "text": [ 287 | "+------+--------------------+------+---------+\n", 288 | "|userId| title|rating|title_new|\n", 289 | "+------+--------------------+------+---------+\n", 290 | "| 932| Cape Fear (1991)| 3| 161.0|\n", 291 | "| 721| Piano, The (1993)| 3| 173.0|\n", 292 | "| 642|Low Down Dirty Sh...| 2| 1115.0|\n", 293 | "| 798|That Darn Cat! (1...| 4| 686.0|\n", 294 | "| 535|African Queen, Th...| 4| 199.0|\n", 295 | "| 765|Stealing Beauty (...| 5| 521.0|\n", 296 | "| 927|Poison Ivy II (1995)| 3| 1041.0|\n", 297 | "| 544| G.I. Jane (1997)| 3| 152.0|\n", 298 | "| 788|Godfather: Part I...| 4| 108.0|\n", 299 | "| 706|Birdcage, The (1996)| 4| 43.0|\n", 300 | "+------+--------------------+------+---------+\n", 301 | "only showing top 10 rows\n", 302 | "\n" 303 | ] 304 | } 305 | ], 306 | "source": [ 307 | "#validate the numerical title values\n", 308 | "indexed.show(10)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 296, 314 | "metadata": {}, 315 | "outputs": [ 316 | { 317 | "name": "stdout", 318 | "output_type": "stream", 319 | "text": [ 320 | "+---------+-----+\n", 321 | "|title_new|count|\n", 322 | "+---------+-----+\n", 323 | "|0.0 |583 |\n", 324 | "|1.0 |509 |\n", 325 | "|2.0 |508 |\n", 326 | "|3.0 |507 |\n", 327 | "|4.0 |485 |\n", 328 | "|5.0 |481 |\n", 329 | "|6.0 |478 |\n", 330 | "|7.0 |452 |\n", 331 | "|8.0 |431 |\n", 332 | "|9.0 |429 |\n", 333 | "+---------+-----+\n", 334 | "only showing top 10 rows\n", 335 | "\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "#number of times each numerical movie title has been rated \n", 341 | "indexed.groupBy('title_new').count().orderBy('count',ascending=False).show(10,False)" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 297, 347 | "metadata": {}, 348 | "outputs": [], 349 | "source": [ 350 | "#split the data into training and test datatset\n", 351 | "train,test=indexed.randomSplit([0.75,0.25])" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 298, 357 | "metadata": {}, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/plain": [ 362 | "75104" 363 | ] 364 | }, 365 | "execution_count": 298, 366 | "metadata": {}, 367 | "output_type": "execute_result" 368 | } 369 | ], 370 | "source": [ 371 | "#count number of records in train set\n", 372 | "train.count()" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 299, 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/plain": [ 383 | "24876" 384 | ] 385 | }, 386 | "execution_count": 299, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "#count number of records in test set\n", 393 | "test.count()" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 300, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "#import ALS recommender function from pyspark ml library\n", 403 | "from pyspark.ml.recommendation import ALS" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 301, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "#Training the recommender model using train datatset\n", 413 | "rec=ALS(maxIter=10,regParam=0.01,userCol='userId',itemCol='title_new',ratingCol='rating',nonnegative=True,coldStartStrategy=\"drop\")" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 302, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "#fit the model on train set\n", 423 | "rec_model=rec.fit(train)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 303, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "#making predictions on test set \n", 433 | "predicted_ratings=rec_model.transform(test)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 337, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | "root\n", 446 | " |-- userId: integer (nullable = true)\n", 447 | " |-- title: string (nullable = true)\n", 448 | " |-- rating: integer (nullable = true)\n", 449 | " |-- title_new: double (nullable = false)\n", 450 | " |-- prediction: float (nullable = false)\n", 451 | "\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "#columns in predicted ratings dataframe\n", 457 | "predicted_ratings.printSchema()" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 304, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "+------+--------------------+------+---------+----------+\n", 470 | "|userId| title|rating|title_new|prediction|\n", 471 | "+------+--------------------+------+---------+----------+\n", 472 | "| 92|Tie Me Up! Tie Me...| 4| 766.0| 3.1512196|\n", 473 | "| 222| Batman (1989)| 3| 116.0| 3.503284|\n", 474 | "| 178|Beauty and the Be...| 4| 114.0| 4.1487904|\n", 475 | "| 303|Jerry Maguire (1996)| 5| 15.0| 4.348913|\n", 476 | "| 134| Flubber (1997)| 2| 579.0| 2.5635276|\n", 477 | "| 295| Henry V (1989)| 4| 268.0| 4.2598643|\n", 478 | "| 889|Adventures of Pri...| 2| 305.0| 2.9040515|\n", 479 | "| 374| Men in Black (1997)| 3| 31.0| 3.602631|\n", 480 | "| 559|Killing Fields, T...| 4| 276.0| 4.55797|\n", 481 | "| 290|Star Trek: The Mo...| 1| 286.0| 3.2992659|\n", 482 | "+------+--------------------+------+---------+----------+\n", 483 | "only showing top 10 rows\n", 484 | "\n" 485 | ] 486 | } 487 | ], 488 | "source": [ 489 | "#predicted vs actual ratings for test set \n", 490 | "predicted_ratings.orderBy(rand()).show(10)" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 305, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "#importing Regression Evaluator to measure RMSE\n", 500 | "from pyspark.ml.evaluation import RegressionEvaluator" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 306, 506 | "metadata": {}, 507 | "outputs": [], 508 | "source": [ 509 | "#create Regressor evaluator object for measuring accuracy\n", 510 | "evaluator=RegressionEvaluator(metricName='rmse',predictionCol='prediction',labelCol='rating')" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 307, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "#apply the RE on predictions dataframe to calculate RMSE\n", 520 | "rmse=evaluator.evaluate(predictions)" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 308, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "name": "stdout", 530 | "output_type": "stream", 531 | "text": [ 532 | "1.0293574739493354\n" 533 | ] 534 | } 535 | ], 536 | "source": [ 537 | "#print RMSE error\n", 538 | "print(rmse)" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 309, 544 | "metadata": {}, 545 | "outputs": [], 546 | "source": [ 547 | "#Recommend top movies which user might like " 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 310, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [ 556 | "#create dataset of all distinct movies \n", 557 | "unique_movies=indexed.select('title_new').distinct()" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 311, 563 | "metadata": {}, 564 | "outputs": [ 565 | { 566 | "data": { 567 | "text/plain": [ 568 | "1664" 569 | ] 570 | }, 571 | "execution_count": 311, 572 | "metadata": {}, 573 | "output_type": "execute_result" 574 | } 575 | ], 576 | "source": [ 577 | "#number of unique movies\n", 578 | "unique_movies.count()" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 312, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "#assigning alias name 'a' to unique movies df\n", 588 | "a = unique_movies.alias('a')" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 336, 594 | "metadata": {}, 595 | "outputs": [], 596 | "source": [ 597 | "user_id=85" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 321, 603 | "metadata": {}, 604 | "outputs": [], 605 | "source": [ 606 | "#creating another dataframe which contains already watched movie by active user \n", 607 | "watched_movies=indexed.filter(indexed['userId'] == user_id).select('title_new').distinct()" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 322, 613 | "metadata": {}, 614 | "outputs": [ 615 | { 616 | "data": { 617 | "text/plain": [ 618 | "287" 619 | ] 620 | }, 621 | "execution_count": 322, 622 | "metadata": {}, 623 | "output_type": "execute_result" 624 | } 625 | ], 626 | "source": [ 627 | "#number of movies already rated \n", 628 | "watched_movies.count()" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 323, 634 | "metadata": {}, 635 | "outputs": [], 636 | "source": [ 637 | "#assigning alias name 'b' to watched movies df\n", 638 | "b=watched_movies.alias('b')" 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": 324, 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [ 647 | "#joining both tables on left join \n", 648 | "total_movies = a.join(b, a.title_new == b.title_new,how='left')\n" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 325, 654 | "metadata": {}, 655 | "outputs": [ 656 | { 657 | "name": "stdout", 658 | "output_type": "stream", 659 | "text": [ 660 | "+---------+---------+\n", 661 | "|title_new|title_new|\n", 662 | "+---------+---------+\n", 663 | "|299.0 |null |\n", 664 | "|558.0 |null |\n", 665 | "|305.0 |305.0 |\n", 666 | "|596.0 |null |\n", 667 | "|1051.0 |null |\n", 668 | "|934.0 |null |\n", 669 | "|496.0 |496.0 |\n", 670 | "|769.0 |null |\n", 671 | "|692.0 |null |\n", 672 | "|720.0 |null |\n", 673 | "+---------+---------+\n", 674 | "only showing top 10 rows\n", 675 | "\n" 676 | ] 677 | } 678 | ], 679 | "source": [ 680 | "total_movies.show(10,False)" 681 | ] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": 326, 686 | "metadata": {}, 687 | "outputs": [], 688 | "source": [ 689 | "#selecting movies which active user is yet to rate or watch\n", 690 | "remaining_movies=total_movies.where(col(\"b.title_new\").isNull()).select(a.title_new).distinct()" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 327, 696 | "metadata": {}, 697 | "outputs": [ 698 | { 699 | "data": { 700 | "text/plain": [ 701 | "1377" 702 | ] 703 | }, 704 | "execution_count": 327, 705 | "metadata": {}, 706 | "output_type": "execute_result" 707 | } 708 | ], 709 | "source": [ 710 | "#number of movies user is yet to rate \n", 711 | "remaining_movies.count()" 712 | ] 713 | }, 714 | { 715 | "cell_type": "code", 716 | "execution_count": 328, 717 | "metadata": {}, 718 | "outputs": [], 719 | "source": [ 720 | "#adding new column of user_Id of active useer to remaining movies df \n", 721 | "remaining_movies=remaining_movies.withColumn(\"userId\",lit(int(user_id)))\n" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": 329, 727 | "metadata": {}, 728 | "outputs": [ 729 | { 730 | "name": "stdout", 731 | "output_type": "stream", 732 | "text": [ 733 | "+---------+------+\n", 734 | "|title_new|userId|\n", 735 | "+---------+------+\n", 736 | "|299.0 |85 |\n", 737 | "|558.0 |85 |\n", 738 | "|596.0 |85 |\n", 739 | "|1051.0 |85 |\n", 740 | "|934.0 |85 |\n", 741 | "|769.0 |85 |\n", 742 | "|692.0 |85 |\n", 743 | "|720.0 |85 |\n", 744 | "|576.0 |85 |\n", 745 | "|810.0 |85 |\n", 746 | "+---------+------+\n", 747 | "only showing top 10 rows\n", 748 | "\n" 749 | ] 750 | } 751 | ], 752 | "source": [ 753 | "remaining_movies.show(10,False)" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": 333, 759 | "metadata": {}, 760 | "outputs": [], 761 | "source": [ 762 | "#making recommendations using ALS recommender model and selecting only top 'n' movies\n", 763 | "recommendations=rec_model.transform(remaining_movies).orderBy('prediction',ascending=False)" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 332, 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "name": "stdout", 773 | "output_type": "stream", 774 | "text": [ 775 | "+---------+------+----------+\n", 776 | "|title_new|userId|prediction|\n", 777 | "+---------+------+----------+\n", 778 | "|1433.0 |85 |4.9689837 |\n", 779 | "|1322.0 |85 |4.6927013 |\n", 780 | "|1271.0 |85 |4.605163 |\n", 781 | "|1470.0 |85 |4.5409293 |\n", 782 | "|705.0 |85 |4.532007 |\n", 783 | "+---------+------+----------+\n", 784 | "\n" 785 | ] 786 | } 787 | ], 788 | "source": [ 789 | "recommendations.show(5,False)" 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "execution_count": 334, 795 | "metadata": {}, 796 | "outputs": [], 797 | "source": [ 798 | "#converting title_new values back to movie titles\n", 799 | "movie_title = IndexToString(inputCol=\"title_new\", outputCol=\"title\",labels=model.labels)\n", 800 | "\n", 801 | "final_recommendations=movie_title.transform(recommendations)\n" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": 335, 807 | "metadata": {}, 808 | "outputs": [ 809 | { 810 | "name": "stdout", 811 | "output_type": "stream", 812 | "text": [ 813 | "+---------+------+----------+----------------------------+\n", 814 | "|title_new|userId|prediction|title |\n", 815 | "+---------+------+----------+----------------------------+\n", 816 | "|1433.0 |85 |4.9689837 |Boys, Les (1997) |\n", 817 | "|1322.0 |85 |4.6927013 |Faust (1994) |\n", 818 | "|1271.0 |85 |4.605163 |Whole Wide World, The (1996)|\n", 819 | "|1470.0 |85 |4.5409293 |Some Mother's Son (1996) |\n", 820 | "|705.0 |85 |4.532007 |Laura (1944) |\n", 821 | "|303.0 |85 |4.5236835 |Close Shave, A (1995) |\n", 822 | "|1121.0 |85 |4.4936523 |Crooklyn (1994) |\n", 823 | "|1195.0 |85 |4.4636283 |Pather Panchali (1955) |\n", 824 | "|285.0 |85 |4.456875 |Wrong Trousers, The (1993) |\n", 825 | "|638.0 |85 |4.4495435 |Shall We Dance? (1996) |\n", 826 | "+---------+------+----------+----------------------------+\n", 827 | "only showing top 10 rows\n", 828 | "\n" 829 | ] 830 | } 831 | ], 832 | "source": [ 833 | "final_recommendations.show(10,False)" 834 | ] 835 | }, 836 | { 837 | "cell_type": "code", 838 | "execution_count": 338, 839 | "metadata": {}, 840 | "outputs": [], 841 | "source": [ 842 | "#create function to recommend top 'n' movies to any particular user\n", 843 | "def top_movies(user_id,n):\n", 844 | " \"\"\"\n", 845 | " This function returns the top 'n' movies that user has not seen yet but might like \n", 846 | " \n", 847 | " \"\"\"\n", 848 | " #assigning alias name 'a' to unique movies df\n", 849 | " a = unique_movies.alias('a')\n", 850 | " \n", 851 | " #creating another dataframe which contains already watched movie by active user \n", 852 | " watched_movies=indexed.filter(indexed['userId'] == user_id).select('title_new')\n", 853 | " \n", 854 | " #assigning alias name 'b' to watched movies df\n", 855 | " b=watched_movies.alias('b')\n", 856 | " \n", 857 | " #joining both tables on left join \n", 858 | " total_movies = a.join(b, a.title_new == b.title_new,how='left')\n", 859 | " \n", 860 | " #selecting movies which active user is yet to rate or watch\n", 861 | " remaining_movies=total_movies.where(col(\"b.title_new\").isNull()).select(a.title_new).distinct()\n", 862 | " \n", 863 | " \n", 864 | " #adding new column of user_Id of active useer to remaining movies df \n", 865 | " remaining_movies=remaining_movies.withColumn(\"userId\",lit(int(user_id)))\n", 866 | " \n", 867 | " \n", 868 | " #making recommendations using ALS recommender model and selecting only top 'n' movies\n", 869 | " recommendations=rec_model.transform(remaining_movies).orderBy('prediction',ascending=False).limit(n)\n", 870 | " \n", 871 | " \n", 872 | " #adding columns of movie titles in recommendations\n", 873 | " movie_title = IndexToString(inputCol=\"title_new\", outputCol=\"title\",labels=model.labels)\n", 874 | " final_recommendations=movie_title.transform(recommendations)\n", 875 | " \n", 876 | " #return the recommendations to active user\n", 877 | " return final_recommendations.show(n,False)" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": 339, 883 | "metadata": {}, 884 | "outputs": [ 885 | { 886 | "name": "stdout", 887 | "output_type": "stream", 888 | "text": [ 889 | "+---------+------+----------+----------------------------+\n", 890 | "|title_new|userId|prediction|title |\n", 891 | "+---------+------+----------+----------------------------+\n", 892 | "|1433.0 |85 |4.9689837 |Boys, Les (1997) |\n", 893 | "|1322.0 |85 |4.6927013 |Faust (1994) |\n", 894 | "|1271.0 |85 |4.605163 |Whole Wide World, The (1996)|\n", 895 | "|1470.0 |85 |4.5409293 |Some Mother's Son (1996) |\n", 896 | "|705.0 |85 |4.532007 |Laura (1944) |\n", 897 | "|303.0 |85 |4.5236835 |Close Shave, A (1995) |\n", 898 | "|1121.0 |85 |4.4936523 |Crooklyn (1994) |\n", 899 | "|1195.0 |85 |4.4636283 |Pather Panchali (1955) |\n", 900 | "|285.0 |85 |4.456875 |Wrong Trousers, The (1993) |\n", 901 | "|638.0 |85 |4.4495435 |Shall We Dance? (1996) |\n", 902 | "+---------+------+----------+----------------------------+\n", 903 | "\n" 904 | ] 905 | } 906 | ], 907 | "source": [ 908 | "top_movies(85,10)" 909 | ] 910 | }, 911 | { 912 | "cell_type": "code", 913 | "execution_count": null, 914 | "metadata": {}, 915 | "outputs": [], 916 | "source": [] 917 | } 918 | ], 919 | "metadata": { 920 | "kernelspec": { 921 | "display_name": "Python 3", 922 | "language": "python", 923 | "name": "python3" 924 | }, 925 | "language_info": { 926 | "codemirror_mode": { 927 | "name": "ipython", 928 | "version": 3 929 | }, 930 | "file_extension": ".py", 931 | "mimetype": "text/x-python", 932 | "name": "python", 933 | "nbconvert_exporter": "python", 934 | "pygments_lexer": "ipython3", 935 | "version": "3.6.3" 936 | } 937 | }, 938 | "nbformat": 4, 939 | "nbformat_minor": 2 940 | } 941 | -------------------------------------------------------------------------------- /chapter_8_Recommender_System/movie_ratings_df.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Apress/machine-learning-with-pyspark/e35273491ed9bf25a6849d38033ee2a64c962227/chapter_8_Recommender_System/movie_ratings_df.csv -------------------------------------------------------------------------------- /chapter_9_NLP/Movie_reviews.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Apress/machine-learning-with-pyspark/e35273491ed9bf25a6849d38033ee2a64c962227/chapter_9_NLP/Movie_reviews.csv -------------------------------------------------------------------------------- /errata.md: -------------------------------------------------------------------------------- 1 | # Errata for *Book Title* 2 | 3 | On **page xx** [Summary of error]: 4 | 5 | Details of error here. Highlight key pieces in **bold**. 6 | 7 | *** 8 | 9 | On **page xx** [Summary of error]: 10 | 11 | Details of error here. Highlight key pieces in **bold**. 12 | 13 | *** --------------------------------------------------------------------------------