├── README.md ├── data_raw ├── README.txt ├── artist_alias_small.txt ├── artist_data_small.txt └── user_artist_data_small.txt └── recommender_ALS_Spark_Python.ipynb /README.md: -------------------------------------------------------------------------------- 1 | For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix). 2 | 3 | To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions. 4 | 5 | Datasets: 6 | 7 | You will be using some publicly available song data from audioscrobbler, which can be found here. However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with _small.txt and contains only the information relevant to the top 50 most prolific users (highest artist play counts). 8 | 9 | The original data file user_artist_data.txt contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count. 10 | 11 | Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, "The Smiths", "Smiths, The", and "the smiths" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes artist_alias.txt, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist. 12 | 13 | The artist_data.txt file then provides a map from the canonical artist ID to the name of the artist. 14 | -------------------------------------------------------------------------------- /data_raw/README.txt: -------------------------------------------------------------------------------- 1 | Music Listening Dataset 2 | Audioscrobbler.com 3 | 6 May 2005 4 | -------------------------------- 5 | 6 | This data set contains profiles for around 150,000 real people 7 | The dataset lists the artists each person listens to, and a counter 8 | indicating how many times each user played each artist 9 | 10 | The dataset is continually growing; at the time of writing (6 May 2005) 11 | Audioscrobbler is receiving around 2 million song submissions per day 12 | 13 | We may produce additional/extended data dumps if anyone is interested 14 | in experimenting with the data. 15 | 16 | Please let us know if you do anything useful with this data, we're always 17 | up for new ways to visualize it or analyse/cluster it etc :) 18 | 19 | License 20 | ------- 21 | 22 | This data is made available under the following Creative Commons license: 23 | http://creativecommons.org/licenses/by-nc-sa/1.0/ 24 | 25 | 26 | Files 27 | ----- 28 | 29 | user_artist_data.txt 30 | 3 columns: userid artistid playcount 31 | 32 | artist_data.txt 33 | 2 columns: artistid artist_name 34 | 35 | artist_alias.txt 36 | 2 columns: badid, goodid 37 | known incorrectly spelt artists and the correct artist id. 38 | you can correct errors in user_artist_data as you read it in using this file 39 | (we're not yet finished merging this data) 40 | 41 | 42 | Execution 43 | ------------ 44 | Run the following line in the terminal to open the jupyter notebook with pyspark. Make sure to open the terminal and navigate into the project directory OR right click in the project directory in the Files application and click 'Open in Terminal'. 45 | 46 | export PYSPARK_DRIVER_PYTHON=ipython3 47 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 48 | $SPARK_HOME/bin/pyspark 49 | -------------------------------------------------------------------------------- /data_raw/artist_alias_small.txt: -------------------------------------------------------------------------------- 1 | 1027859 1252408 2 | 1017615 668 3 | 6745885 1268522 4 | 1018110 1018110 5 | 1014609 1014609 6 | 6713071 2976 7 | 1014175 1014175 8 | 1008798 1008798 9 | 1013851 1013851 10 | 6696814 1030672 11 | 1036747 1239516 12 | 1278781 1021980 13 | 2035175 1007565 14 | 1327067 1308328 15 | 2006482 1140837 16 | 1314530 1237371 17 | 1160800 1345290 18 | 1255401 1055061 19 | 1307351 1055061 20 | 1234249 1005225 21 | 6622310 1094137 22 | 1261919 6977528 23 | 2103190 1002909 24 | 9929875 1009048 25 | 2118737 1011363 26 | 9929864 1000699 27 | 6666813 1305683 28 | 1172822 1127113 29 | 2026635 1001597 30 | 6726078 1018408 31 | 1039896 1277013 32 | 1239168 1266817 33 | 6819291 1277876 34 | 2030690 2060894 35 | 6786886 166 36 | 1051692 1307569 37 | 1239193 1012079 38 | 1291581 78 39 | 6642817 1010969 40 | 1293171 1007614 41 | 1070350 1034635 42 | 6603691 1279932 43 | 1027851 1063053 44 | 2060513 2029258 45 | 1277348 668 46 | 1253023 1033862 47 | 1002892 1002451 48 | 2060435 1256876 49 | 6612396 1301739 50 | 1280154 1021970 51 | 6617155 1039381 52 | 1006102 1034635 53 | 6697417 2013670 54 | 1059007 2653 55 | 2101386 2013670 56 | 1098456 1254644 57 | 6633276 1013675 58 | 162 1332522 59 | 1246265 1010669 60 | 6708991 1009773 61 | 1000110 1034635 62 | 1002566 1034635 63 | 1001864 1001864 64 | 9929533 1000088 65 | 1289246 1023527 66 | 1261152 1007206 67 | 2113342 1134530 68 | 1016805 3195 69 | 1325227 1246524 70 | 1245064 1264 71 | 1015753 1261449 72 | 2164287 10076841 73 | 1044186 10076841 74 | 1006661 1172842 75 | 6639087 974 76 | 1028218 1349406 77 | 9928967 15 78 | 1269139 1003505 79 | 2150015 1018408 80 | 6611952 1269012 81 | 2134206 1062330 82 | 6893915 1017065 83 | 10345702 1017065 84 | 6880926 1017065 85 | 6873763 1259700 86 | 1231677 1294194 87 | 1333467 1156425 88 | 1169681 1651 89 | 1106289 2093800 90 | 6634844 1018408 91 | 2111668 2085 92 | 1038666 1295935 93 | 10112808 3437 94 | 9928973 1000113 95 | 10203303 6618355 96 | 1279723 1007263 97 | 1022552 1249851 98 | 2279441 6834637 99 | 1214254 1262045 100 | 1011272 1246839 101 | 10021668 1250233 102 | 6648707 1088328 103 | 1002139 1018807 104 | 1040536 2073100 105 | 1050544 1002332 106 | 6852428 1035970 107 | 1318457 1002152 108 | 1010410 1013654 109 | 1273591 1598 110 | 2144935 1066433 111 | 1000935 6747938 112 | 6603035 1314538 113 | 2073427 1006475 114 | 1305679 1034635 115 | 6723001 2039323 116 | 6612338 1257158 117 | 15 15 118 | 6843840 1326 119 | 1140506 1271441 120 | 1097968 1004831 121 | 9929045 153 122 | 1265244 3122 123 | 1010155 1252957 124 | 1246508 1013471 125 | 6666470 1349406 126 | 10328673 1956 127 | 3630 6951848 128 | 9919424 234 129 | 10013648 733 130 | 1185593 1028908 131 | 1030955 5452 132 | 1101433 1755 133 | 6979261 1008583 134 | 1199139 166 135 | 9929269 1238242 136 | 1323083 1029530 137 | 6652651 1009454 138 | 2684 1002480 139 | 1266264 1286358 140 | 1299041 1034635 141 | 10107676 118 142 | 6843503 1257158 143 | 10140618 28 144 | 1210088 1104179 145 | 6640761 1254011 146 | 1010284 1018807 147 | 1260442 2060894 148 | 1027472 2036 149 | 2085035 1000236 150 | 1156068 1001859 151 | 1211487 1014738 152 | 1123801 1034202 153 | 9929763 1003778 154 | 1327730 6705745 155 | 1016673 1298111 156 | 9910959 1034635 157 | 6644630 1065358 158 | 1146111 1000123 159 | 9929062 1010646 160 | 6666050 1294194 161 | 1104667 1012935 162 | 6747304 4538 163 | 1262727 2176737 164 | 9931068 1003361 165 | 1024502 1009571 166 | 6777696 1047693 167 | 1047140 1277286 168 | 1329111 1023485 169 | 1010145 1249657 170 | 1321574 71 171 | 1004857 1034635 172 | 2112240 1246983 173 | 1304801 1307569 174 | 10328618 2814 175 | 10227482 1000200 176 | 1341684 71 177 | 2036732 71 178 | 2034497 71 179 | 1338466 71 180 | 1351048 71 181 | 1339315 71 182 | 1009443 1020059 183 | 6927588 1107395 184 | 6755702 1014604 185 | 1037848 1007201 186 | 1321035 1007201 187 | 1051861 1056268 188 | 2066585 1178346 189 | 1003979 1247540 190 | 6606624 1034635 191 | 1210850 2101375 192 | 2154067 1279924 193 | 1292006 1279924 194 | 1100499 1003448 195 | 1159075 1002152 196 | 1016988 1009571 197 | 1300745 5841 198 | 6868142 6866886 199 | 1018155 1015852 200 | 6638483 1195889 201 | 1011730 1239504 202 | 1009499 6730533 203 | 1014145 1009646 204 | 1212985 1301739 205 | 9929600 1004129 206 | 1280087 1295531 207 | 6704224 9964755 208 | 1071257 1236897 209 | 1060739 1263049 210 | 6645431 1013510 211 | 1126370 2114258 212 | 10328567 1000689 213 | 9997128 4303 214 | 1214221 1021115 215 | 6752624 684 216 | 6843863 1326 217 | 10163001 4775 218 | 1244701 1249401 219 | 1330987 1056296 220 | 1038051 6684730 221 | 1007834 1237371 222 | 1293474 1006885 223 | 2099786 2048617 224 | 1302130 1291109 225 | 6738758 2106357 226 | 9929441 1307 227 | 1013011 1276641 228 | 6623536 9916985 229 | 6606825 1014175 230 | 2017616 1007864 231 | 1291230 1236346 232 | 1286507 1137423 233 | 6935408 4349 234 | 6689505 1001655 235 | 1023449 1310185 236 | 2009180 6751847 237 | 1109974 1007063 238 | 10079136 1002328 239 | 1099602 2966 240 | 1015298 1247152 241 | 9931148 1006896 242 | 6666533 1253307 243 | 6667192 1086117 244 | 1080914 1274829 245 | 1003801 1241757 246 | 1049704 1261464 247 | 10092575 1000028 248 | 1334929 1246709 249 | 1291110 1030060 250 | 1055562 1276662 251 | 1090594 1009633 252 | 1252764 1003014 253 | 2058402 1024619 254 | 1029677 9983203 255 | 6671271 1033631 256 | 1327919 9983203 257 | 6827946 9983203 258 | 1270553 1327696 259 | 1000945 1018807 260 | 6786145 300 261 | 6614668 7006467 262 | 10331634 1000048 263 | 9912102 1034635 264 | 1065198 2061677 265 | 1351750 1233610 266 | 1307528 2036704 267 | 1012315 1238836 268 | 1314904 6977528 269 | 1053693 1170206 270 | 1287055 1020615 271 | 7023179 5696 272 | 6963887 1013095 273 | 1252485 1010725 274 | 1079065 1236703 275 | 1027126 1255783 276 | 1274317 1234387 277 | 1012803 2161899 278 | 6666213 3554 279 | 1298276 6875510 280 | 1234344 1012125 281 | 10055114 2051723 282 | 10377598 1010055 283 | 1033104 1027610 284 | 2179213 1111915 285 | 6730134 1271216 286 | 1301746 1056258 287 | 1017322 1277866 288 | 1045804 1247516 289 | 1152469 1009402 290 | 2140188 10334513 291 | 1291960 1266817 292 | 2059804 1008487 293 | 6708740 1089337 294 | 1101793 1044253 295 | 1047491 1003342 296 | 1049384 1008336 297 | 1059884 1288727 298 | 6873850 1300642 299 | 2067429 1034635 300 | 2069589 1234503 301 | 10237528 1235384 302 | 1027009 2004228 303 | 6751850 2070071 304 | 6607841 1015122 305 | 6606625 1034635 306 | 1052722 2797 307 | 6688903 6706174 308 | 6892355 6785079 309 | 6618608 6785079 310 | 1019819 1034635 311 | 9929669 1004347 312 | 6606757 1003888 313 | 2140558 2114264 314 | 6730231 2161931 315 | 1075482 1264703 316 | 2064333 1076507 317 | 1022108 1035334 318 | 6759209 1241695 319 | 1008416 242 320 | 10263339 1008093 321 | 1276810 420 322 | 6622876 2161595 323 | 6670816 2051861 324 | 1254235 1254644 325 | 1305341 1010658 326 | 1039314 1203762 327 | 9919711 9956508 328 | 2082135 1259297 329 | 6635073 1259297 330 | 1300796 2036 331 | 6619918 2140107 332 | 1258892 1244746 333 | 6662497 1327588 334 | 6882695 1013167 335 | 1245000 1028445 336 | 5702 1066440 337 | 1007480 1012243 338 | 1244982 1028445 339 | 1122437 1254299 340 | 2075188 10150610 341 | 1073470 1327647 342 | 10270142 10096874 343 | 9954151 779 344 | 1275001 1028445 345 | 1244994 1028445 346 | 1122824 2148043 347 | 1084265 1065358 348 | 2140580 1002672 349 | 2070673 1169482 350 | 1010636 1239101 351 | 1031417 1265996 352 | 1033536 1008824 353 | 1006162 1048788 354 | 1179492 1246136 355 | 2113453 1003052 356 | 2052613 1088572 357 | 1279110 2035089 358 | 10314684 6716462 359 | 1042508 1008824 360 | 10176206 1014716 361 | 6657341 10361613 362 | 1283015 1001230 363 | 1289264 1013714 364 | 1033391 1239278 365 | 2082602 1043147 366 | 1052982 1011231 367 | 2036251 2043827 368 | 2020235 1246136 369 | 1271924 1018769 370 | 1039828 1018807 371 | 1039174 1075543 372 | 9918754 4497 373 | 1235697 1233982 374 | 1244970 1028445 375 | 6843877 1326 376 | 1115653 1249252 377 | 1127341 1252111 378 | 1012852 6696725 379 | 1130370 2235 380 | 1245134 1007263 381 | 1036143 1063053 382 | 6790420 2129177 383 | 1263808 2070227 384 | 6620802 4377 385 | 1181082 1009583 386 | 1203592 1156425 387 | 1024571 1029592 388 | 1029799 1156425 389 | 1016343 1014340 390 | 754 754 391 | 1016631 1006736 392 | 1138014 1034635 393 | 1002778 393 394 | 1253088 1238269 395 | 6933178 809 396 | 1011293 1489 397 | 6965760 1027610 398 | 1010760 1239516 399 | 1006322 1006322 400 | 1006347 1006347 401 | 1058622 1251812 402 | 6742353 2104058 403 | 6718488 1059264 404 | 1281865 1011316 405 | 1006140 1246817 406 | 1015584 1007658 407 | 9919044 2007 408 | 9937566 809 409 | 9937520 1147975 410 | 1275359 1287322 411 | 2061602 6748393 412 | 6642370 1049114 413 | 1010872 2439 414 | 2126687 1023928 415 | 9930422 1002619 416 | 1000129 5810 417 | 1092585 1002559 418 | 1303182 6837566 419 | 10586534 1008953 420 | 1017671 1015311 421 | 1025954 2040456 422 | 1271505 1003694 423 | 1322886 2105178 424 | 6895260 1308328 425 | 1278418 1255555 426 | 1340193 2797 427 | 2155515 1003105 428 | 1022439 4609 429 | 6632852 1001835 430 | 2100392 1012457 431 | 1108824 1007415 432 | 9929214 1233948 433 | 6703817 1006812 434 | 1007631 1235753 435 | 1039492 1269417 436 | 6926874 1014485 437 | 1002498 3066 438 | 1097119 1016561 439 | 1008455 1020 440 | 1286284 5630 441 | 1266371 1236703 442 | 6779861 1250104 443 | 2025676 1001141 444 | 1023217 2076786 445 | 1011160 1238478 446 | 1340980 1000602 447 | 1006268 1002392 448 | 1197558 1001943 449 | 1340959 1000602 450 | 6677859 670 451 | 2066701 1063426 452 | 1327070 1341919 453 | 10050528 831 454 | 1015209 1167955 455 | 1192222 1025647 456 | 1020004 1025647 457 | 6814996 1002912 458 | 1086572 1233677 459 | 1138325 2023771 460 | 1139033 1011219 461 | 1050349 1043653 462 | 4421 1252779 463 | 6703227 1210979 464 | 1070459 2064199 465 | 709 2003588 466 | 6843892 1012077 467 | 1012388 1254487 468 | 6625714 1060179 469 | 6901494 1006485 470 | 6806131 1002061 471 | 1284393 887 472 | 1023197 1269447 473 | 6604494 6834637 474 | 2131963 1020 475 | 10504380 1070177 476 | 2279509 1322366 477 | 1002670 1034635 478 | 1143149 1093353 479 | 6934479 1256115 480 | 9947015 1292207 481 | 2147002 10077841 482 | 1278109 1036654 483 | 9929574 1008086 484 | 1061384 1252912 485 | 6935936 3016 486 | 2159997 1003313 487 | 1044729 1046148 488 | 1067417 1003694 489 | 1270359 1019163 490 | 1102968 1233982 491 | 1030009 1791 492 | 6766073 1006837 493 | 1026989 1033119 494 | 1267504 1007885 495 | 1134651 659 496 | 1006146 1031646 497 | 1157380 1018266 498 | 1047433 1054273 499 | 6895364 2153903 500 | 6812143 1017092 501 | 1104448 1008824 502 | 1246003 4185 503 | 1255551 1266817 504 | 1326264 1235663 505 | 2164170 1247118 506 | 6923988 2140107 507 | 6606823 1267774 508 | 10198357 1025721 509 | 10014536 1009583 510 | 1038011 718 511 | 9952803 1156794 512 | 1248351 1007614 513 | 1293301 1001307 514 | 1033265 1136358 515 | 9967716 1015426 516 | 1106946 1006039 517 | 1233934 1234700 518 | 6704508 1006919 519 | 6804942 6676351 520 | 2025457 951 521 | 1112427 1080556 522 | 1089486 1090808 523 | 1139806 1039249 524 | 1006788 1003221 525 | 6677618 1028295 526 | 2166168 2025147 527 | 1059441 1039249 528 | 2164352 6922812 529 | 6607809 1034635 530 | 2008447 1034635 531 | 1028259 1237408 532 | 1257324 1240486 533 | 1129917 1039249 534 | 2162103 3195 535 | 6827283 3195 536 | 6835354 1039381 537 | 2031207 2031502 538 | 1022183 938 539 | 1067365 1003694 540 | 1344818 1331984 541 | 2008155 1007801 542 | 1011370 1246174 543 | 1089639 1003694 544 | 1023269 1253188 545 | 1033184 1006320 546 | 6801236 1013362 547 | 1130780 1319532 548 | 6793156 1319489 549 | 1240531 1078983 550 | 1005489 2003588 551 | 1116245 1003556 552 | 6621334 1012031 553 | 1059744 1254973 554 | 1079120 1000655 555 | 6972279 1000985 556 | 6985668 6992655 557 | 1078613 2017 558 | 1234308 59 559 | 9929753 1007347 560 | 1009440 1003272 561 | 10383606 1319489 562 | 10713370 1084235 563 | 10713436 6611448 564 | 1022947 6951566 565 | 1197153 1018406 566 | 1145243 2115937 567 | 1008020 1277876 568 | 1006577 1261516 569 | 1007949 1045811 570 | 1241896 1045811 571 | 6818610 1233623 572 | 6703268 1260159 573 | 6843530 1260159 574 | 9974891 1260159 575 | 1033242 1000655 576 | 2159316 1034635 577 | 1289361 1262825 578 | 6679381 1027349 579 | 6827288 1238056 580 | 1312285 1039017 581 | 1012932 1234727 582 | 1071225 1027595 583 | 1070722 930 584 | 1110471 1007075 585 | 10052696 1116214 586 | 1063100 1057539 587 | 1208053 1060179 588 | -------------------------------------------------------------------------------- /recommender_ALS_Spark_Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Music Recommender System using ALS Algorithm with Apache Spark and Python\n", 8 | "+ **Estimated Execution Time (whole script): 2 minutes**\n", 9 | "+ **Estimated Time (to complete the project): 8 hours**\n", 10 | "\n", 11 | "## Description\n", 12 | "\n", 13 | "For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix). \n", 14 | "\n", 15 | "To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.\n", 16 | "\n", 17 | "**Submission Instructions:** \n", 18 | "* Add all of your updates to this Jupyter Notebook file and do NOT clear any of the output you get from running your code.\n", 19 | "* Upload this file and the genererated HTML onto Moodle as a single zip folder called with your user name.\n", 20 | "\n", 21 | "## Datasets\n", 22 | "\n", 23 | "You will be using some publicly available song data from audioscrobbler, which can be found [here](http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html). However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with `_small.txt` and contains only the information relevant to the top 50 most prolific users (highest artist play counts).\n", 24 | "\n", 25 | "The original data file `user_artist_data.txt` contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.\n", 26 | "\n", 27 | "Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, \"The Smiths\", \"Smiths, The\", and \"the smiths\" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes `artist_alias.txt`, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.\n", 28 | "\n", 29 | "The `artist_data.txt` file then provides a map from the canonical artist ID to the name of the artist." 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 21, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "# Import libraries\n", 41 | "import findspark\n", 42 | "findspark.init()\n", 43 | "\n", 44 | "from pyspark.mllib.recommendation import *\n", 45 | "import random\n", 46 | "from operator import *\n", 47 | "from collections import defaultdict" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 22, 53 | "metadata": { 54 | "collapsed": true 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "# Initialize Spark Context\n", 59 | "# YOUR CODE GOES HERE\n", 60 | "from pyspark import SparkContext, SparkConf\n", 61 | "spark = SparkContext.getOrCreate()\n", 62 | "spark.stop()\n", 63 | "spark = SparkContext('local','Recommender')" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Loading data\n", 71 | "\n", 72 | "Load the three datasets into RDDs and name them `artistData`, `artistAlias`, and `userArtistData`. View the README, or the files themselves, to see how this data is formated. Some of the files have tab delimeters while some have space delimiters. Make sure that your `userArtistData` RDD contains only the canonical artist IDs." 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 23, 78 | "metadata": { 79 | "collapsed": true 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "# Import test files from location into RDD variables\n", 84 | "# YOUR CODE GOES HERE\n", 85 | "#import os\n", 86 | "#os.getcwd()\n", 87 | "artistData = spark.textFile('./data_raw/artist_data_small.txt').map(lambda s:(int(s.split(\"\\t\")[0]),s.split(\"\\t\")[1]))\n", 88 | "artistAlias = spark.textFile('./data_raw/artist_alias_small.txt')\n", 89 | "userArtistData = spark.textFile('./data_raw/user_artist_data_small.txt')" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 24, 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "outputs": [], 99 | "source": [ 100 | "#p = artistAlias.map(lambda s: len(s))\n", 101 | "#print(p)\n", 102 | "#print(\"hello\")" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Data Exploration\n", 110 | "\n", 111 | "In the blank below, write some code that with find the users' total play counts. Find the three users with the highest number of total play counts (sum of all counters) and print the user ID, the total play count, and the mean play count (average number of times a user played an artist). Your output should look as follows:\n", 112 | "```\n", 113 | "User 1059637 has a total play count of 674412 and a mean play count of 1878.\n", 114 | "User 2064012 has a total play count of 548427 and a mean play count of 9455.\n", 115 | "User 2069337 has a total play count of 393515 and a mean play count of 1519.\n", 116 | "```\n", 117 | "\n" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 25, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "name": "stdout", 127 | "output_type": "stream", 128 | "text": [ 129 | "User 1059637 has a total play count of 674412 and a mean play count of 1878.\n", 130 | "User 2064012 has a total play count of 548427 and a mean play count of 9455.\n", 131 | "User 2069337 has a total play count of 393515 and a mean play count of 1519.\n" 132 | ] 133 | } 134 | ], 135 | "source": [ 136 | "# Split a sequence into seperate entities and store as int\n", 137 | "# YOUR CODE GOES HERE\n", 138 | "\n", 139 | "userArtistData = userArtistData.map(lambda s:(int(s.split(\" \")[0]),int(s.split(\" \")[1]),int(s.split(\" \")[2])))\n", 140 | "\n", 141 | "# Create a dictionary of the 'artistAlias' dataset\n", 142 | "# YOUR CODE GOES HERE\n", 143 | "\n", 144 | "artistAliasDictionary = {}\n", 145 | "dataValue = artistAlias.map(lambda s:(int(s.split(\"\\t\")[0]),int(s.split(\"\\t\")[1])))\n", 146 | "for temp in dataValue.collect():\n", 147 | " artistAliasDictionary[temp[0]] = temp[1]\n", 148 | "\n", 149 | "# If artistid exists, replace with artistsid from artistAlias, else retain original\n", 150 | "# YOUR CODE GOES HERE\n", 151 | "\n", 152 | "userArtistData = userArtistData.map(lambda x: (x[0], artistAliasDictionary[x[1]] if x[1] in artistAliasDictionary else x[1], x[2]))\n", 153 | "\n", 154 | "# Create an RDD consisting of 'userid' and 'playcount' objects of original tuple\n", 155 | "# YOUR CODE GOES HERE\n", 156 | "\n", 157 | "#userArtistData.collect().foreach(println)\n", 158 | "\n", 159 | "userSum = userArtistData.map(lambda x:(x[0],x[2]))\n", 160 | "playCount1 = userSum.map(lambda x: (x[0],x[1])).reduceByKey(lambda a,b : a+b)\n", 161 | "playCount2 = userSum.map(lambda x: (x[0],1)).reduceByKey(lambda a,b:a+b)\n", 162 | "playSumAndCount = playCount1.leftOuterJoin(playCount2)\n", 163 | "\n", 164 | "\n", 165 | "# Count instances by key and store in broadcast variable\n", 166 | "# YOUR CODE GOES HERE\n", 167 | "\n", 168 | "playSumAndCount = playSumAndCount.map(lambda x: (x[0],x[1][0],int(x[1][0]/x[1][1])))\n", 169 | "\n", 170 | "# Compute and display users with the highest playcount along with their mean playcount across artists\n", 171 | "# YOUR CODE GOES HERE\n", 172 | "\n", 173 | "TopThree = playSumAndCount.top(3,key=lambda x: x[1])\n", 174 | "for i in TopThree:\n", 175 | " print('User '+str(i[0])+' has a total play count of '+str(i[1])+' and a mean play count of '+str(i[2])+'.')\n" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "collapsed": true 182 | }, 183 | "source": [ 184 | "#### Splitting Data for Testing\n", 185 | "\n", 186 | "Use the [randomSplit](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.randomSplit) function to divide the data (`userArtistData`) into:\n", 187 | "* A training set, `trainData`, that will be used to train the model. This set should constitute 40% of the data.\n", 188 | "* A validation set, `validationData`, used to perform parameter tuning. This set should constitute 40% of the data.\n", 189 | "* A test set, `testData`, used for a final evaluation of the model. This set should constitute 20% of the data.\n", 190 | "\n", 191 | "Use a random seed value of 13. Since these datasets will be repeatedly used you will probably want to persist them in memory using the [cache](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.cache) function.\n", 192 | "\n", 193 | "In addition, print out the first 3 elements of each set as well as their sizes; if you created these sets correctly, your output should look like the following:\n", 194 | "```\n", 195 | "[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]\n", 196 | "[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]\n", 197 | "[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]\n", 198 | "19761\n", 199 | "19862\n", 200 | "9858\n", 201 | "```" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 26, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]\n", 214 | "[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]\n", 215 | "[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]\n", 216 | "19761\n", 217 | "19862\n", 218 | "9858\n" 219 | ] 220 | } 221 | ], 222 | "source": [ 223 | "# Split the 'userArtistData' dataset into training, validation and test datasets. Store in cache for frequent access\n", 224 | "# YOUR CODE GOES HERE\n", 225 | "\n", 226 | "trainData, validationData, testData = userArtistData.randomSplit((0.4,0.4,0.2),seed=13)\n", 227 | "trainData.cache()\n", 228 | "validationData.cache()\n", 229 | "testData.cache()\n", 230 | "\n", 231 | "# Display the first 3 records of each dataset followed by the total count of records for each datasets\n", 232 | "# YOUR CODE GOES HERE\n", 233 | "\n", 234 | "\n", 235 | "print(trainData.take(3))\n", 236 | "print(validationData.take(3))\n", 237 | "print(testData.take(3))\n", 238 | "print(trainData.count())\n", 239 | "print(validationData.count())\n", 240 | "print(testData.count())" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "## The Recommender Model\n", 248 | "\n", 249 | "For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. Therefore, to get the best model, we will do a small parameter sweep and choose the model that performs the best on the validation set\n", 250 | "\n", 251 | "Therefore, we must first devise a way to evaluate models. Once we have a method for evaluation, we can run a parameter sweep, evaluate each combination of parameters on the validation data, and choose the optimal set of parameters. The parameters then can be used to make predictions on the test data.\n", 252 | "\n", 253 | "### Model Evaluation\n", 254 | "\n", 255 | "Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.\n", 256 | "\n", 257 | "For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.\n", 258 | "\n", 259 | "**NOTE: when using the model to predict the top-X artists for a user, do not include the artists listed with that user in the training data.**\n", 260 | "\n", 261 | "Name your function `modelEval` and have it take a model (the output of ALS.trainImplicit) and a dataset as input. For parameter tuning, the dataset parameter should be set to the validation data (`validationData`). After parameter tuning, the model can be evaluated on the test data (`testData`)." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 28, 267 | "metadata": { 268 | "collapsed": true 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "def modelEval(model, dataset):\n", 273 | " \n", 274 | " # All artists in the 'userArtistData' dataset\n", 275 | " # YOUR CODE GOES HERE\n", 276 | " AllArtists = spark.parallelize(set(userArtistData.map(lambda x:x[1]).collect()))\n", 277 | " \n", 278 | " \n", 279 | " # Set of all users in the current (Validation/Testing) dataset\n", 280 | " # YOUR CODE GOES HERE\n", 281 | " AllUsers = spark.parallelize(set(dataset.map(lambda x:x[0]).collect()))\n", 282 | " \n", 283 | " \n", 284 | " # Create a dictionary of (key, values) for current (Validation/Testing) dataset\n", 285 | " # YOUR CODE GOES HERE\n", 286 | " ValidationAndTestingDictionary ={}\n", 287 | " for temp in AllUsers.collect():\n", 288 | " tempFilter = dataset.filter(lambda x:x[0] == temp).collect()\n", 289 | " for item in tempFilter:\n", 290 | " if temp in ValidationAndTestingDictionary:\n", 291 | " ValidationAndTestingDictionary[temp].append(item[1])\n", 292 | " else:\n", 293 | " ValidationAndTestingDictionary[temp] = [item[1]]\n", 294 | " \n", 295 | " \n", 296 | " # Create a dictionary of (key, values) for training dataset\n", 297 | " # YOUR CODE GOES HERE\n", 298 | " TrainingDictionary = {}\n", 299 | " for temp in AllUsers.collect():\n", 300 | " tempFilter = trainData.filter(lambda x:x[0] == temp).collect()\n", 301 | " for item in tempFilter:\n", 302 | " if temp in TrainingDictionary:\n", 303 | " TrainingDictionary[temp].append(item[1])\n", 304 | " else:\n", 305 | " TrainingDictionary[temp] = [item[1]]\n", 306 | " \n", 307 | " \n", 308 | " # For each user, calculate the prediction score i.e. similarity between predicted and actual artists\n", 309 | " # YOUR CODE GOES HERE\n", 310 | " PredictionScore = 0.00\n", 311 | " for temp in AllUsers.collect():\n", 312 | " ArtistPrediction = AllArtists.map(lambda x:(temp,x))\n", 313 | " ModelPrediction = model.predictAll(ArtistPrediction)\n", 314 | " tempFilter = ModelPrediction.filter(lambda x :not x[1] in TrainingDictionary[x[0]])\n", 315 | " topPredictions = tempFilter.top(len(ValidationAndTestingDictionary[temp]),key=lambda x:x[2])\n", 316 | " l=[]\n", 317 | " for i in topPredictions:\n", 318 | " l.append(i[1])\n", 319 | " PredictionScore+=len(set(l).intersection(ValidationAndTestingDictionary[temp]))/len(ValidationAndTestingDictionary[temp]) \n", 320 | "\n", 321 | " \n", 322 | " # Print average score of the model for all users for the specified rank\n", 323 | " # YOUR CODE GOES HERE\n", 324 | " print(\"The model score for rank \"+str(model.rank)+\" is ~\"+str(PredictionScore/len(ValidationAndTestingDictionary)))" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "### Model Construction\n", 332 | "\n", 333 | "Now we can build the best model possibly using the validation set of data and the `modelEval` function. Although, there are a few parameters we could optimize, for the sake of time, we will just try a few different values for the [rank parameter](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering) (leave everything else at its default value, **except make `seed`=345**). Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.\n", 334 | "\n", 335 | "Note: this procedure may take several minutes to run.\n", 336 | "\n", 337 | "For each rank value, print out the output of the `modelEval` function for that model. Your output should look as follows:\n", 338 | "```\n", 339 | "The model score for rank 2 is ~0.090431\n", 340 | "The model score for rank 10 is ~0.095294\n", 341 | "The model score for rank 20 is ~0.090248\n", 342 | "```\n", 343 | "Step below takes 2 minutes to run. Uncomment to if you wish to run and calculate model score. " 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 29, 349 | "metadata": { 350 | "scrolled": false 351 | }, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "The model score for rank 2 is ~0.08082178719723072\n", 358 | "The model score for rank 10 is ~0.09052071953413846\n", 359 | "The model score for rank 20 is ~0.08225274139572855\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | " rankList = [2,10,20]\n", 365 | " for rank in rankList:\n", 366 | " model = ALS.trainImplicit(trainData, rank , seed=345)\n", 367 | " modelEval(model,validationData)" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "Now, using the bestModel, we will check the results over the test data. Your result should be ~`0.0507`. \n", 375 | "Step below takes 1 minute to run. Uncomment last line if you wish to run and calculate model score. " 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 30, 381 | "metadata": {}, 382 | "outputs": [ 383 | { 384 | "name": "stdout", 385 | "output_type": "stream", 386 | "text": [ 387 | "The model score for rank 10 is ~0.060728260020964896\n" 388 | ] 389 | } 390 | ], 391 | "source": [ 392 | "bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)\n", 393 | "modelEval(bestModel, testData)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## Trying Some Artist Recommendations\n", 401 | "Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:\n", 402 | "```\n", 403 | "Artist 0: My Chemical Romance\n", 404 | "Artist 1: Something Corporate\n", 405 | "Artist 2: Evanescence\n", 406 | "Artist 3: Alanis Morissette\n", 407 | "Artist 4: Counting Crows\n", 408 | "```" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 31, 414 | "metadata": {}, 415 | "outputs": [ 416 | { 417 | "name": "stdout", 418 | "output_type": "stream", 419 | "text": [ 420 | "Artist 0: My Chemical Romance\n", 421 | "Artist 1: Something Corporate\n", 422 | "Artist 2: Evanescence\n", 423 | "Artist 3: Alanis Morissette\n", 424 | "Artist 4: Counting Crows\n" 425 | ] 426 | } 427 | ], 428 | "source": [ 429 | "# Find the top 5 artists for a particular user and list their names\n", 430 | "# YOUR CODE GOES HERE\n", 431 | "\n", 432 | "TopFive = bestModel.recommendProducts(1059637,5)\n", 433 | "for item in range(0,5):\n", 434 | " print(\"Artist \"+str(item)+\": \"+artistData.filter(lambda x:x[0] == TopFive[item][1]).collect()[0][1])" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": { 441 | "collapsed": true 442 | }, 443 | "outputs": [], 444 | "source": [] 445 | } 446 | ], 447 | "metadata": { 448 | "kernelspec": { 449 | "display_name": "Python 3", 450 | "language": "python", 451 | "name": "python3" 452 | }, 453 | "language_info": { 454 | "codemirror_mode": { 455 | "name": "ipython", 456 | "version": 3 457 | }, 458 | "file_extension": ".py", 459 | "mimetype": "text/x-python", 460 | "name": "python", 461 | "nbconvert_exporter": "python", 462 | "pygments_lexer": "ipython3", 463 | "version": "3.5.2" 464 | } 465 | }, 466 | "nbformat": 4, 467 | "nbformat_minor": 2 468 | } 469 | --------------------------------------------------------------------------------