├── README.md
├── data_raw
    ├── README.txt
    ├── artist_alias_small.txt
    ├── artist_data_small.txt
    └── user_artist_data_small.txt
└── recommender_ALS_Spark_Python.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix).
 2 | 
 3 | To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.
 4 | 
 5 | Datasets:
 6 | 
 7 | You will be using some publicly available song data from audioscrobbler, which can be found here. However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with _small.txt and contains only the information relevant to the top 50 most prolific users (highest artist play counts).
 8 | 
 9 | The original data file user_artist_data.txt contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.
10 | 
11 | Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, "The Smiths", "Smiths, The", and "the smiths" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes artist_alias.txt, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.
12 | 
13 | The artist_data.txt file then provides a map from the canonical artist ID to the name of the artist.
14 | 


--------------------------------------------------------------------------------
/data_raw/README.txt:
--------------------------------------------------------------------------------
 1 | Music Listening Dataset
 2 | Audioscrobbler.com
 3 | 6 May 2005
 4 | --------------------------------
 5 | 
 6 | This data set contains profiles for around 150,000 real people
 7 | The dataset lists the artists each person listens to, and a counter
 8 | indicating how many times each user played each artist
 9 | 
10 | The dataset is continually growing; at the time of writing (6 May 2005) 
11 | Audioscrobbler is receiving around 2 million song submissions per day
12 | 
13 | We may produce additional/extended data dumps if anyone is interested 
14 | in experimenting with the data. 
15 | 
16 | Please let us know if you do anything useful with this data, we're always
17 | up for new ways to visualize it or analyse/cluster it etc :)
18 | 
19 | License
20 | -------
21 | 
22 | This data is made available under the following Creative Commons license:
23 | http://creativecommons.org/licenses/by-nc-sa/1.0/
24 | 
25 | 
26 | Files
27 | -----
28 | 
29 | user_artist_data.txt
30 |     3 columns: userid artistid playcount
31 | 
32 | artist_data.txt
33 |     2 columns: artistid artist_name
34 | 
35 | artist_alias.txt
36 |     2 columns: badid, goodid
37 |     known incorrectly spelt artists and the correct artist id. 
38 |     you can correct errors in user_artist_data as you read it in using this file
39 |     (we're not yet finished merging this data)
40 |     
41 |     
42 | Execution
43 | ------------
44 | Run the following line in the terminal to open the jupyter notebook with pyspark. Make sure to open the terminal and navigate into the project directory OR right click in the project directory in the Files application and click 'Open in Terminal'.
45 | 
46 | export PYSPARK_DRIVER_PYTHON=ipython3
47 | export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
48 | $SPARK_HOME/bin/pyspark
49 | 


--------------------------------------------------------------------------------
/data_raw/artist_alias_small.txt:
--------------------------------------------------------------------------------
  1 | 1027859	1252408
  2 | 1017615	668
  3 | 6745885	1268522
  4 | 1018110	1018110
  5 | 1014609	1014609
  6 | 6713071	2976
  7 | 1014175	1014175
  8 | 1008798	1008798
  9 | 1013851	1013851
 10 | 6696814	1030672
 11 | 1036747	1239516
 12 | 1278781	1021980
 13 | 2035175	1007565
 14 | 1327067	1308328
 15 | 2006482	1140837
 16 | 1314530	1237371
 17 | 1160800	1345290
 18 | 1255401	1055061
 19 | 1307351	1055061
 20 | 1234249	1005225
 21 | 6622310	1094137
 22 | 1261919	6977528
 23 | 2103190	1002909
 24 | 9929875	1009048
 25 | 2118737	1011363
 26 | 9929864	1000699
 27 | 6666813	1305683
 28 | 1172822	1127113
 29 | 2026635	1001597
 30 | 6726078	1018408
 31 | 1039896	1277013
 32 | 1239168	1266817
 33 | 6819291	1277876
 34 | 2030690	2060894
 35 | 6786886	166
 36 | 1051692	1307569
 37 | 1239193	1012079
 38 | 1291581	78
 39 | 6642817	1010969
 40 | 1293171	1007614
 41 | 1070350	1034635
 42 | 6603691	1279932
 43 | 1027851	1063053
 44 | 2060513	2029258
 45 | 1277348	668
 46 | 1253023	1033862
 47 | 1002892	1002451
 48 | 2060435	1256876
 49 | 6612396	1301739
 50 | 1280154	1021970
 51 | 6617155	1039381
 52 | 1006102	1034635
 53 | 6697417	2013670
 54 | 1059007	2653
 55 | 2101386	2013670
 56 | 1098456	1254644
 57 | 6633276	1013675
 58 | 162	1332522
 59 | 1246265	1010669
 60 | 6708991	1009773
 61 | 1000110	1034635
 62 | 1002566	1034635
 63 | 1001864	1001864
 64 | 9929533	1000088
 65 | 1289246	1023527
 66 | 1261152	1007206
 67 | 2113342	1134530
 68 | 1016805	3195
 69 | 1325227	1246524
 70 | 1245064	1264
 71 | 1015753	1261449
 72 | 2164287	10076841
 73 | 1044186	10076841
 74 | 1006661	1172842
 75 | 6639087	974
 76 | 1028218	1349406
 77 | 9928967	15
 78 | 1269139	1003505
 79 | 2150015	1018408
 80 | 6611952	1269012
 81 | 2134206	1062330
 82 | 6893915	1017065
 83 | 10345702	1017065
 84 | 6880926	1017065
 85 | 6873763	1259700
 86 | 1231677	1294194
 87 | 1333467	1156425
 88 | 1169681	1651
 89 | 1106289	2093800
 90 | 6634844	1018408
 91 | 2111668	2085
 92 | 1038666	1295935
 93 | 10112808	3437
 94 | 9928973	1000113
 95 | 10203303	6618355
 96 | 1279723	1007263
 97 | 1022552	1249851
 98 | 2279441	6834637
 99 | 1214254	1262045
100 | 1011272	1246839
101 | 10021668	1250233
102 | 6648707	1088328
103 | 1002139	1018807
104 | 1040536	2073100
105 | 1050544	1002332
106 | 6852428	1035970
107 | 1318457	1002152
108 | 1010410	1013654
109 | 1273591	1598
110 | 2144935	1066433
111 | 1000935	6747938
112 | 6603035	1314538
113 | 2073427	1006475
114 | 1305679	1034635
115 | 6723001	2039323
116 | 6612338	1257158
117 | 15	15
118 | 6843840	1326
119 | 1140506	1271441
120 | 1097968	1004831
121 | 9929045	153
122 | 1265244	3122
123 | 1010155	1252957
124 | 1246508	1013471
125 | 6666470	1349406
126 | 10328673	1956
127 | 3630	6951848
128 | 9919424	234
129 | 10013648	733
130 | 1185593	1028908
131 | 1030955	5452
132 | 1101433	1755
133 | 6979261	1008583
134 | 1199139	166
135 | 9929269	1238242
136 | 1323083	1029530
137 | 6652651	1009454
138 | 2684	1002480
139 | 1266264	1286358
140 | 1299041	1034635
141 | 10107676	118
142 | 6843503	1257158
143 | 10140618	28
144 | 1210088	1104179
145 | 6640761	1254011
146 | 1010284	1018807
147 | 1260442	2060894
148 | 1027472	2036
149 | 2085035	1000236
150 | 1156068	1001859
151 | 1211487	1014738
152 | 1123801	1034202
153 | 9929763	1003778
154 | 1327730	6705745
155 | 1016673	1298111
156 | 9910959	1034635
157 | 6644630	1065358
158 | 1146111	1000123
159 | 9929062	1010646
160 | 6666050	1294194
161 | 1104667	1012935
162 | 6747304	4538
163 | 1262727	2176737
164 | 9931068	1003361
165 | 1024502	1009571
166 | 6777696	1047693
167 | 1047140	1277286
168 | 1329111	1023485
169 | 1010145	1249657
170 | 1321574	71
171 | 1004857	1034635
172 | 2112240	1246983
173 | 1304801	1307569
174 | 10328618	2814
175 | 10227482	1000200
176 | 1341684	71
177 | 2036732	71
178 | 2034497	71
179 | 1338466	71
180 | 1351048	71
181 | 1339315	71
182 | 1009443	1020059
183 | 6927588	1107395
184 | 6755702	1014604
185 | 1037848	1007201
186 | 1321035	1007201
187 | 1051861	1056268
188 | 2066585	1178346
189 | 1003979	1247540
190 | 6606624	1034635
191 | 1210850	2101375
192 | 2154067	1279924
193 | 1292006	1279924
194 | 1100499	1003448
195 | 1159075	1002152
196 | 1016988	1009571
197 | 1300745	5841
198 | 6868142	6866886
199 | 1018155	1015852
200 | 6638483	1195889
201 | 1011730	1239504
202 | 1009499	6730533
203 | 1014145	1009646
204 | 1212985	1301739
205 | 9929600	1004129
206 | 1280087	1295531
207 | 6704224	9964755
208 | 1071257	1236897
209 | 1060739	1263049
210 | 6645431	1013510
211 | 1126370	2114258
212 | 10328567	1000689
213 | 9997128	4303
214 | 1214221	1021115
215 | 6752624	684
216 | 6843863	1326
217 | 10163001	4775
218 | 1244701	1249401
219 | 1330987	1056296
220 | 1038051	6684730
221 | 1007834	1237371
222 | 1293474	1006885
223 | 2099786	2048617
224 | 1302130	1291109
225 | 6738758	2106357
226 | 9929441	1307
227 | 1013011	1276641
228 | 6623536	9916985
229 | 6606825	1014175
230 | 2017616	1007864
231 | 1291230	1236346
232 | 1286507	1137423
233 | 6935408	4349
234 | 6689505	1001655
235 | 1023449	1310185
236 | 2009180	6751847
237 | 1109974	1007063
238 | 10079136	1002328
239 | 1099602	2966
240 | 1015298	1247152
241 | 9931148	1006896
242 | 6666533	1253307
243 | 6667192	1086117
244 | 1080914	1274829
245 | 1003801	1241757
246 | 1049704	1261464
247 | 10092575	1000028
248 | 1334929	1246709
249 | 1291110	1030060
250 | 1055562	1276662
251 | 1090594	1009633
252 | 1252764	1003014
253 | 2058402	1024619
254 | 1029677	9983203
255 | 6671271	1033631
256 | 1327919	9983203
257 | 6827946	9983203
258 | 1270553	1327696
259 | 1000945	1018807
260 | 6786145	300
261 | 6614668	7006467
262 | 10331634	1000048
263 | 9912102	1034635
264 | 1065198	2061677
265 | 1351750	1233610
266 | 1307528	2036704
267 | 1012315	1238836
268 | 1314904	6977528
269 | 1053693	1170206
270 | 1287055	1020615
271 | 7023179	5696
272 | 6963887	1013095
273 | 1252485	1010725
274 | 1079065	1236703
275 | 1027126	1255783
276 | 1274317	1234387
277 | 1012803	2161899
278 | 6666213	3554
279 | 1298276	6875510
280 | 1234344	1012125
281 | 10055114	2051723
282 | 10377598	1010055
283 | 1033104	1027610
284 | 2179213	1111915
285 | 6730134	1271216
286 | 1301746	1056258
287 | 1017322	1277866
288 | 1045804	1247516
289 | 1152469	1009402
290 | 2140188	10334513
291 | 1291960	1266817
292 | 2059804	1008487
293 | 6708740	1089337
294 | 1101793	1044253
295 | 1047491	1003342
296 | 1049384	1008336
297 | 1059884	1288727
298 | 6873850	1300642
299 | 2067429	1034635
300 | 2069589	1234503
301 | 10237528	1235384
302 | 1027009	2004228
303 | 6751850	2070071
304 | 6607841	1015122
305 | 6606625	1034635
306 | 1052722	2797
307 | 6688903	6706174
308 | 6892355	6785079
309 | 6618608	6785079
310 | 1019819	1034635
311 | 9929669	1004347
312 | 6606757	1003888
313 | 2140558	2114264
314 | 6730231	2161931
315 | 1075482	1264703
316 | 2064333	1076507
317 | 1022108	1035334
318 | 6759209	1241695
319 | 1008416	242
320 | 10263339	1008093
321 | 1276810	420
322 | 6622876	2161595
323 | 6670816	2051861
324 | 1254235	1254644
325 | 1305341	1010658
326 | 1039314	1203762
327 | 9919711	9956508
328 | 2082135	1259297
329 | 6635073	1259297
330 | 1300796	2036
331 | 6619918	2140107
332 | 1258892	1244746
333 | 6662497	1327588
334 | 6882695	1013167
335 | 1245000	1028445
336 | 5702	1066440
337 | 1007480	1012243
338 | 1244982	1028445
339 | 1122437	1254299
340 | 2075188	10150610
341 | 1073470	1327647
342 | 10270142	10096874
343 | 9954151	779
344 | 1275001	1028445
345 | 1244994	1028445
346 | 1122824	2148043
347 | 1084265	1065358
348 | 2140580	1002672
349 | 2070673	1169482
350 | 1010636	1239101
351 | 1031417	1265996
352 | 1033536	1008824
353 | 1006162	1048788
354 | 1179492	1246136
355 | 2113453	1003052
356 | 2052613	1088572
357 | 1279110	2035089
358 | 10314684	6716462
359 | 1042508	1008824
360 | 10176206	1014716
361 | 6657341	10361613
362 | 1283015	1001230
363 | 1289264	1013714
364 | 1033391	1239278
365 | 2082602	1043147
366 | 1052982	1011231
367 | 2036251	2043827
368 | 2020235	1246136
369 | 1271924	1018769
370 | 1039828	1018807
371 | 1039174	1075543
372 | 9918754	4497
373 | 1235697	1233982
374 | 1244970	1028445
375 | 6843877	1326
376 | 1115653	1249252
377 | 1127341	1252111
378 | 1012852	6696725
379 | 1130370	2235
380 | 1245134	1007263
381 | 1036143	1063053
382 | 6790420	2129177
383 | 1263808	2070227
384 | 6620802	4377
385 | 1181082	1009583
386 | 1203592	1156425
387 | 1024571	1029592
388 | 1029799	1156425
389 | 1016343	1014340
390 | 754	754
391 | 1016631	1006736
392 | 1138014	1034635
393 | 1002778	393
394 | 1253088	1238269
395 | 6933178	809
396 | 1011293	1489
397 | 6965760	1027610
398 | 1010760	1239516
399 | 1006322	1006322
400 | 1006347	1006347
401 | 1058622	1251812
402 | 6742353	2104058
403 | 6718488	1059264
404 | 1281865	1011316
405 | 1006140	1246817
406 | 1015584	1007658
407 | 9919044	2007
408 | 9937566	809
409 | 9937520	1147975
410 | 1275359	1287322
411 | 2061602	6748393
412 | 6642370	1049114
413 | 1010872	2439
414 | 2126687	1023928
415 | 9930422	1002619
416 | 1000129	5810
417 | 1092585	1002559
418 | 1303182	6837566
419 | 10586534	1008953
420 | 1017671	1015311
421 | 1025954	2040456
422 | 1271505	1003694
423 | 1322886	2105178
424 | 6895260	1308328
425 | 1278418	1255555
426 | 1340193	2797
427 | 2155515	1003105
428 | 1022439	4609
429 | 6632852	1001835
430 | 2100392	1012457
431 | 1108824	1007415
432 | 9929214	1233948
433 | 6703817	1006812
434 | 1007631	1235753
435 | 1039492	1269417
436 | 6926874	1014485
437 | 1002498	3066
438 | 1097119	1016561
439 | 1008455	1020
440 | 1286284	5630
441 | 1266371	1236703
442 | 6779861	1250104
443 | 2025676	1001141
444 | 1023217	2076786
445 | 1011160	1238478
446 | 1340980	1000602
447 | 1006268	1002392
448 | 1197558	1001943
449 | 1340959	1000602
450 | 6677859	670
451 | 2066701	1063426
452 | 1327070	1341919
453 | 10050528	831
454 | 1015209	1167955
455 | 1192222	1025647
456 | 1020004	1025647
457 | 6814996	1002912
458 | 1086572	1233677
459 | 1138325	2023771
460 | 1139033	1011219
461 | 1050349	1043653
462 | 4421	1252779
463 | 6703227	1210979
464 | 1070459	2064199
465 | 709	2003588
466 | 6843892	1012077
467 | 1012388	1254487
468 | 6625714	1060179
469 | 6901494	1006485
470 | 6806131	1002061
471 | 1284393	887
472 | 1023197	1269447
473 | 6604494	6834637
474 | 2131963	1020
475 | 10504380	1070177
476 | 2279509	1322366
477 | 1002670	1034635
478 | 1143149	1093353
479 | 6934479	1256115
480 | 9947015	1292207
481 | 2147002	10077841
482 | 1278109	1036654
483 | 9929574	1008086
484 | 1061384	1252912
485 | 6935936	3016
486 | 2159997	1003313
487 | 1044729	1046148
488 | 1067417	1003694
489 | 1270359	1019163
490 | 1102968	1233982
491 | 1030009	1791
492 | 6766073	1006837
493 | 1026989	1033119
494 | 1267504	1007885
495 | 1134651	659
496 | 1006146	1031646
497 | 1157380	1018266
498 | 1047433	1054273
499 | 6895364	2153903
500 | 6812143	1017092
501 | 1104448	1008824
502 | 1246003	4185
503 | 1255551	1266817
504 | 1326264	1235663
505 | 2164170	1247118
506 | 6923988	2140107
507 | 6606823	1267774
508 | 10198357	1025721
509 | 10014536	1009583
510 | 1038011	718
511 | 9952803	1156794
512 | 1248351	1007614
513 | 1293301	1001307
514 | 1033265	1136358
515 | 9967716	1015426
516 | 1106946	1006039
517 | 1233934	1234700
518 | 6704508	1006919
519 | 6804942	6676351
520 | 2025457	951
521 | 1112427	1080556
522 | 1089486	1090808
523 | 1139806	1039249
524 | 1006788	1003221
525 | 6677618	1028295
526 | 2166168	2025147
527 | 1059441	1039249
528 | 2164352	6922812
529 | 6607809	1034635
530 | 2008447	1034635
531 | 1028259	1237408
532 | 1257324	1240486
533 | 1129917	1039249
534 | 2162103	3195
535 | 6827283	3195
536 | 6835354	1039381
537 | 2031207	2031502
538 | 1022183	938
539 | 1067365	1003694
540 | 1344818	1331984
541 | 2008155	1007801
542 | 1011370	1246174
543 | 1089639	1003694
544 | 1023269	1253188
545 | 1033184	1006320
546 | 6801236	1013362
547 | 1130780	1319532
548 | 6793156	1319489
549 | 1240531	1078983
550 | 1005489	2003588
551 | 1116245	1003556
552 | 6621334	1012031
553 | 1059744	1254973
554 | 1079120	1000655
555 | 6972279	1000985
556 | 6985668	6992655
557 | 1078613	2017
558 | 1234308	59
559 | 9929753	1007347
560 | 1009440	1003272
561 | 10383606	1319489
562 | 10713370	1084235
563 | 10713436	6611448
564 | 1022947	6951566
565 | 1197153	1018406
566 | 1145243	2115937
567 | 1008020	1277876
568 | 1006577	1261516
569 | 1007949	1045811
570 | 1241896	1045811
571 | 6818610	1233623
572 | 6703268	1260159
573 | 6843530	1260159
574 | 9974891	1260159
575 | 1033242	1000655
576 | 2159316	1034635
577 | 1289361	1262825
578 | 6679381	1027349
579 | 6827288	1238056
580 | 1312285	1039017
581 | 1012932	1234727
582 | 1071225	1027595
583 | 1070722	930
584 | 1110471	1007075
585 | 10052696	1116214
586 | 1063100	1057539
587 | 1208053	1060179
588 | 


--------------------------------------------------------------------------------
/recommender_ALS_Spark_Python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Music Recommender System using ALS Algorithm with Apache Spark and Python\n",
  8 |     "+ **Estimated Execution Time (whole script): 2 minutes**\n",
  9 |     "+ **Estimated Time (to complete the project): 8 hours**\n",
 10 |     "\n",
 11 |     "## Description\n",
 12 |     "\n",
 13 |     "For this project, you are to create a recommender system that will recommend new musical artists to a user based on their listening history. Suggesting different songs or musical artists to a user is important to many music streaming services, such as Pandora and Spotify. In addition, this type of recommender system could also be used as a means of suggesting TV shows or movies to a user (e.g., Netflix). \n",
 14 |     "\n",
 15 |     "To create this system you will be using Spark and the collaborative filtering technique. The instructions for completing this project will be laid out entirely in this file. You will have to implement any missing code as well as answer any questions.\n",
 16 |     "\n",
 17 |     "**Submission Instructions:** \n",
 18 |     "* Add all of your updates to this Jupyter Notebook file and do NOT clear any of the output you get from running your code.\n",
 19 |     "* Upload this file and the genererated HTML onto Moodle as a single zip folder called with your user name.\n",
 20 |     "\n",
 21 |     "## Datasets\n",
 22 |     "\n",
 23 |     "You will be using some publicly available song data from audioscrobbler, which can be found [here](http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html). However, we modified the original data files so that the code will run in a reasonable time on a single machine. The reduced data files have been suffixed with `_small.txt` and contains only the information relevant to the top 50 most prolific users (highest artist play counts).\n",
 24 |     "\n",
 25 |     "The original data file `user_artist_data.txt` contained about 141,000 unique users, and 1.6 million unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.\n",
 26 |     "\n",
 27 |     "Note that when plays are scribbled, the client application submits the name of the artist being played. This name could be misspelled or nonstandard, and this may only be detected later. For example, \"The Smiths\", \"Smiths, The\", and \"the smiths\" may appear as distinct artist IDs in the data set, even though they clearly refer to the same artist. So, the data set includes `artist_alias.txt`, which maps artist IDs that are known misspellings or variants to the canonical ID of that artist.\n",
 28 |     "\n",
 29 |     "The `artist_data.txt` file then provides a map from the canonical artist ID to the name of the artist."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 21,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "# Import libraries\n",
 41 |     "import findspark\n",
 42 |     "findspark.init()\n",
 43 |     "\n",
 44 |     "from pyspark.mllib.recommendation import *\n",
 45 |     "import random\n",
 46 |     "from operator import *\n",
 47 |     "from collections import defaultdict"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 22,
 53 |    "metadata": {
 54 |     "collapsed": true
 55 |    },
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "# Initialize Spark Context\n",
 59 |     "# YOUR CODE GOES HERE\n",
 60 |     "from pyspark import SparkContext, SparkConf\n",
 61 |     "spark = SparkContext.getOrCreate()\n",
 62 |     "spark.stop()\n",
 63 |     "spark = SparkContext('local','Recommender')"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "## Loading data\n",
 71 |     "\n",
 72 |     "Load the three datasets into RDDs and name them `artistData`, `artistAlias`, and `userArtistData`. View the README, or the files themselves, to see how this data is formated. Some of the files have tab delimeters while some have space delimiters. Make sure that your `userArtistData` RDD contains only the canonical artist IDs."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 23,
 78 |    "metadata": {
 79 |     "collapsed": true
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "# Import test files from location into RDD variables\n",
 84 |     "# YOUR CODE GOES HERE\n",
 85 |     "#import os\n",
 86 |     "#os.getcwd()\n",
 87 |     "artistData = spark.textFile('./data_raw/artist_data_small.txt').map(lambda s:(int(s.split(\"\\t\")[0]),s.split(\"\\t\")[1]))\n",
 88 |     "artistAlias = spark.textFile('./data_raw/artist_alias_small.txt')\n",
 89 |     "userArtistData = spark.textFile('./data_raw/user_artist_data_small.txt')"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 24,
 95 |    "metadata": {
 96 |     "collapsed": true
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "#p = artistAlias.map(lambda s: len(s))\n",
101 |     "#print(p)\n",
102 |     "#print(\"hello\")"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "## Data Exploration\n",
110 |     "\n",
111 |     "In the blank below, write some code that with find the users' total play counts. Find the three users with the highest number of total play counts (sum of all counters) and print the user ID, the total play count, and the mean play count (average number of times a user played an artist). Your output should look as follows:\n",
112 |     "```\n",
113 |     "User 1059637 has a total play count of 674412 and a mean play count of 1878.\n",
114 |     "User 2064012 has a total play count of 548427 and a mean play count of 9455.\n",
115 |     "User 2069337 has a total play count of 393515 and a mean play count of 1519.\n",
116 |     "```\n",
117 |     "\n"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 25,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "User 1059637 has a total play count of 674412 and a mean play count of 1878.\n",
130 |       "User 2064012 has a total play count of 548427 and a mean play count of 9455.\n",
131 |       "User 2069337 has a total play count of 393515 and a mean play count of 1519.\n"
132 |      ]
133 |     }
134 |    ],
135 |    "source": [
136 |     "# Split a sequence into seperate entities and store as int\n",
137 |     "# YOUR CODE GOES HERE\n",
138 |     "\n",
139 |     "userArtistData = userArtistData.map(lambda s:(int(s.split(\" \")[0]),int(s.split(\" \")[1]),int(s.split(\" \")[2])))\n",
140 |     "\n",
141 |     "# Create a dictionary of the 'artistAlias' dataset\n",
142 |     "# YOUR CODE GOES HERE\n",
143 |     "\n",
144 |     "artistAliasDictionary = {}\n",
145 |     "dataValue = artistAlias.map(lambda s:(int(s.split(\"\\t\")[0]),int(s.split(\"\\t\")[1])))\n",
146 |     "for temp in dataValue.collect():\n",
147 |     "    artistAliasDictionary[temp[0]] = temp[1]\n",
148 |     "\n",
149 |     "# If artistid exists, replace with artistsid from artistAlias, else retain original\n",
150 |     "# YOUR CODE GOES HERE\n",
151 |     "\n",
152 |     "userArtistData = userArtistData.map(lambda x: (x[0], artistAliasDictionary[x[1]] if x[1] in artistAliasDictionary else x[1], x[2]))\n",
153 |     "\n",
154 |     "# Create an RDD consisting of 'userid' and 'playcount' objects of original tuple\n",
155 |     "# YOUR CODE GOES HERE\n",
156 |     "\n",
157 |     "#userArtistData.collect().foreach(println)\n",
158 |     "\n",
159 |     "userSum = userArtistData.map(lambda x:(x[0],x[2]))\n",
160 |     "playCount1 = userSum.map(lambda x: (x[0],x[1])).reduceByKey(lambda a,b : a+b)\n",
161 |     "playCount2 = userSum.map(lambda x: (x[0],1)).reduceByKey(lambda a,b:a+b)\n",
162 |     "playSumAndCount = playCount1.leftOuterJoin(playCount2)\n",
163 |     "\n",
164 |     "\n",
165 |     "# Count instances by key and store in broadcast variable\n",
166 |     "# YOUR CODE GOES HERE\n",
167 |     "\n",
168 |     "playSumAndCount = playSumAndCount.map(lambda x: (x[0],x[1][0],int(x[1][0]/x[1][1])))\n",
169 |     "\n",
170 |     "# Compute and display users with the highest playcount along with their mean playcount across artists\n",
171 |     "# YOUR CODE GOES HERE\n",
172 |     "\n",
173 |     "TopThree = playSumAndCount.top(3,key=lambda x: x[1])\n",
174 |     "for i in TopThree:\n",
175 |     "    print('User '+str(i[0])+' has a total play count of '+str(i[1])+' and a mean play count of '+str(i[2])+'.')\n"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {
181 |     "collapsed": true
182 |    },
183 |    "source": [
184 |     "####  Splitting Data for Testing\n",
185 |     "\n",
186 |     "Use the [randomSplit](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.randomSplit) function to divide the data (`userArtistData`) into:\n",
187 |     "* A training set, `trainData`, that will be used to train the model. This set should constitute 40% of the data.\n",
188 |     "* A validation set, `validationData`, used to perform parameter tuning. This set should constitute 40% of the data.\n",
189 |     "* A test set, `testData`, used for a final evaluation of the model. This set should constitute 20% of the data.\n",
190 |     "\n",
191 |     "Use a random seed value of 13. Since these datasets will be repeatedly used you will probably want to persist them in memory using the [cache](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.cache) function.\n",
192 |     "\n",
193 |     "In addition, print out the first 3 elements of each set as well as their sizes; if you created these sets correctly, your output should look like the following:\n",
194 |     "```\n",
195 |     "[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]\n",
196 |     "[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]\n",
197 |     "[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]\n",
198 |     "19761\n",
199 |     "19862\n",
200 |     "9858\n",
201 |     "```"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 26,
207 |    "metadata": {},
208 |    "outputs": [
209 |     {
210 |      "name": "stdout",
211 |      "output_type": "stream",
212 |      "text": [
213 |       "[(1059637, 1000049, 1), (1059637, 1000056, 1), (1059637, 1000114, 2)]\n",
214 |       "[(1059637, 1000010, 238), (1059637, 1000062, 11), (1059637, 1000123, 2)]\n",
215 |       "[(1059637, 1000094, 1), (1059637, 1000112, 423), (1059637, 1000113, 5)]\n",
216 |       "19761\n",
217 |       "19862\n",
218 |       "9858\n"
219 |      ]
220 |     }
221 |    ],
222 |    "source": [
223 |     "# Split the 'userArtistData' dataset into training, validation and test datasets. Store in cache for frequent access\n",
224 |     "# YOUR CODE GOES HERE\n",
225 |     "\n",
226 |     "trainData, validationData, testData = userArtistData.randomSplit((0.4,0.4,0.2),seed=13)\n",
227 |     "trainData.cache()\n",
228 |     "validationData.cache()\n",
229 |     "testData.cache()\n",
230 |     "\n",
231 |     "# Display the first 3 records of each dataset followed by the total count of records for each datasets\n",
232 |     "# YOUR CODE GOES HERE\n",
233 |     "\n",
234 |     "\n",
235 |     "print(trainData.take(3))\n",
236 |     "print(validationData.take(3))\n",
237 |     "print(testData.take(3))\n",
238 |     "print(trainData.count())\n",
239 |     "print(validationData.count())\n",
240 |     "print(testData.count())"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "## The Recommender Model\n",
248 |     "\n",
249 |     "For this project, we will train the model with implicit feedback. You can read more information about this from the collaborative filtering page: [http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html). The [function you will be using](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS.trainImplicit) has a few tunable parameters that will affect how the model is built. Therefore, to get the best model, we will do a small parameter sweep and choose the model that performs the best on the validation set\n",
250 |     "\n",
251 |     "Therefore, we must first devise a way to evaluate models. Once we have a method for evaluation, we can run a parameter sweep, evaluate each combination of parameters on the validation data, and choose the optimal set of parameters. The parameters then can be used to make predictions on the test data.\n",
252 |     "\n",
253 |     "### Model Evaluation\n",
254 |     "\n",
255 |     "Although there may be several ways to evaluate a model, we will use a simple method here. Suppose we have a model and some dataset of *true* artist plays for a set of users. This model can be used to predict the top X artist recommendations for a user and these recommendations can be compared the artists that the user actually listened to (here, X will be the number of artists in the dataset of *true* artist plays). Then, the fraction of overlap between the top X predictions of the model and the X artists that the user actually listened to can be calculated. This process can be repeated for all users and an average value returned.\n",
256 |     "\n",
257 |     "For example, suppose a model predicted [1,2,4,8] as the top X=4 artists for a user. Suppose, that user actually listened to the artists [1,3,7,8]. Then, for this user, the model would have a score of 2/4=0.5. To get the overall score, this would be performed for all users, with the average returned.\n",
258 |     "\n",
259 |     "**NOTE: when using the model to predict the top-X artists for a user, do not include the artists listed with that user in the training data.**\n",
260 |     "\n",
261 |     "Name your function `modelEval` and have it take a model (the output of ALS.trainImplicit) and a dataset as input. For parameter tuning, the dataset parameter should be set to the validation data (`validationData`). After parameter tuning, the model can be evaluated on the test data (`testData`)."
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 28,
267 |    "metadata": {
268 |     "collapsed": true
269 |    },
270 |    "outputs": [],
271 |    "source": [
272 |     "def modelEval(model, dataset):\n",
273 |     "    \n",
274 |     "    # All artists in the 'userArtistData' dataset\n",
275 |     "    # YOUR CODE GOES HERE\n",
276 |     "    AllArtists = spark.parallelize(set(userArtistData.map(lambda x:x[1]).collect()))\n",
277 |     "    \n",
278 |     "    \n",
279 |     "    # Set of all users in the current (Validation/Testing) dataset\n",
280 |     "    # YOUR CODE GOES HERE\n",
281 |     "    AllUsers = spark.parallelize(set(dataset.map(lambda x:x[0]).collect()))\n",
282 |     "    \n",
283 |     "    \n",
284 |     "    # Create a dictionary of (key, values) for current (Validation/Testing) dataset\n",
285 |     "    # YOUR CODE GOES HERE\n",
286 |     "    ValidationAndTestingDictionary ={}\n",
287 |     "    for temp in AllUsers.collect():\n",
288 |     "        tempFilter = dataset.filter(lambda x:x[0] == temp).collect()\n",
289 |     "        for item in tempFilter:\n",
290 |     "            if temp in ValidationAndTestingDictionary:\n",
291 |     "                ValidationAndTestingDictionary[temp].append(item[1])\n",
292 |     "            else:\n",
293 |     "                ValidationAndTestingDictionary[temp] = [item[1]]\n",
294 |     "                    \n",
295 |     "    \n",
296 |     "    # Create a dictionary of (key, values) for training dataset\n",
297 |     "    # YOUR CODE GOES HERE\n",
298 |     "    TrainingDictionary = {}\n",
299 |     "    for temp in AllUsers.collect():\n",
300 |     "        tempFilter = trainData.filter(lambda x:x[0] == temp).collect()\n",
301 |     "        for item in tempFilter:\n",
302 |     "            if temp in TrainingDictionary:\n",
303 |     "                TrainingDictionary[temp].append(item[1])\n",
304 |     "            else:\n",
305 |     "                TrainingDictionary[temp] = [item[1]]\n",
306 |     "        \n",
307 |     "    \n",
308 |     "    # For each user, calculate the prediction score i.e. similarity between predicted and actual artists\n",
309 |     "    # YOUR CODE GOES HERE\n",
310 |     "    PredictionScore = 0.00\n",
311 |     "    for temp in AllUsers.collect():\n",
312 |     "        ArtistPrediction =  AllArtists.map(lambda x:(temp,x))\n",
313 |     "        ModelPrediction = model.predictAll(ArtistPrediction)\n",
314 |     "        tempFilter = ModelPrediction.filter(lambda x :not x[1] in TrainingDictionary[x[0]])\n",
315 |     "        topPredictions = tempFilter.top(len(ValidationAndTestingDictionary[temp]),key=lambda x:x[2])\n",
316 |     "        l=[]\n",
317 |     "        for i in topPredictions:\n",
318 |     "            l.append(i[1])\n",
319 |     "        PredictionScore+=len(set(l).intersection(ValidationAndTestingDictionary[temp]))/len(ValidationAndTestingDictionary[temp])    \n",
320 |     "\n",
321 |     "        \n",
322 |     "    # Print average score of the model for all users for the specified rank\n",
323 |     "    # YOUR CODE GOES HERE\n",
324 |     "    print(\"The model score for rank \"+str(model.rank)+\" is ~\"+str(PredictionScore/len(ValidationAndTestingDictionary)))"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "markdown",
329 |    "metadata": {},
330 |    "source": [
331 |     "### Model Construction\n",
332 |     "\n",
333 |     "Now we can build the best model possibly using the validation set of data and the `modelEval` function. Although, there are a few parameters we could optimize, for the sake of time, we will just try a few different values for the [rank parameter](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering) (leave everything else at its default value, **except make `seed`=345**). Loop through the values [2, 10, 20] and figure out which one produces the highest scored based on your model evaluation function.\n",
334 |     "\n",
335 |     "Note: this procedure may take several minutes to run.\n",
336 |     "\n",
337 |     "For each rank value, print out the output of the `modelEval` function for that model. Your output should look as follows:\n",
338 |     "```\n",
339 |     "The model score for rank 2 is ~0.090431\n",
340 |     "The model score for rank 10 is ~0.095294\n",
341 |     "The model score for rank 20 is ~0.090248\n",
342 |     "```\n",
343 |     "Step below takes 2 minutes to run. Uncomment to if you wish to run and calculate model score. "
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 29,
349 |    "metadata": {
350 |     "scrolled": false
351 |    },
352 |    "outputs": [
353 |     {
354 |      "name": "stdout",
355 |      "output_type": "stream",
356 |      "text": [
357 |       "The model score for rank 2 is ~0.08082178719723072\n",
358 |       "The model score for rank 10 is ~0.09052071953413846\n",
359 |       "The model score for rank 20 is ~0.08225274139572855\n"
360 |      ]
361 |     }
362 |    ],
363 |    "source": [
364 |     " rankList = [2,10,20]\n",
365 |     " for rank in rankList:\n",
366 |     "     model = ALS.trainImplicit(trainData, rank , seed=345)\n",
367 |     "     modelEval(model,validationData)"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "Now, using the bestModel, we will check the results over the test data. Your result should be ~`0.0507`.  \n",
375 |     "Step below takes 1 minute to run. Uncomment last line if you wish to run and calculate model score. "
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": 30,
381 |    "metadata": {},
382 |    "outputs": [
383 |     {
384 |      "name": "stdout",
385 |      "output_type": "stream",
386 |      "text": [
387 |       "The model score for rank 10 is ~0.060728260020964896\n"
388 |      ]
389 |     }
390 |    ],
391 |    "source": [
392 |     "bestModel = ALS.trainImplicit(trainData, rank=10, seed=345)\n",
393 |     "modelEval(bestModel, testData)"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "markdown",
398 |    "metadata": {},
399 |    "source": [
400 |     "## Trying Some Artist Recommendations\n",
401 |     "Using the best model above, predict the top 5 artists for user `1059637` using the [recommendProducts](http://spark.apache.org/docs/1.5.2/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.MatrixFactorizationModel.recommendProducts) function. Map the results (integer IDs) into the real artist name using `artistAlias`. Print the results. The output should look as follows:\n",
402 |     "```\n",
403 |     "Artist 0: My Chemical Romance\n",
404 |     "Artist 1: Something Corporate\n",
405 |     "Artist 2: Evanescence\n",
406 |     "Artist 3: Alanis Morissette\n",
407 |     "Artist 4: Counting Crows\n",
408 |     "```"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "code",
413 |    "execution_count": 31,
414 |    "metadata": {},
415 |    "outputs": [
416 |     {
417 |      "name": "stdout",
418 |      "output_type": "stream",
419 |      "text": [
420 |       "Artist 0: My Chemical Romance\n",
421 |       "Artist 1: Something Corporate\n",
422 |       "Artist 2: Evanescence\n",
423 |       "Artist 3: Alanis Morissette\n",
424 |       "Artist 4: Counting Crows\n"
425 |      ]
426 |     }
427 |    ],
428 |    "source": [
429 |     "# Find the top 5 artists for a particular user and list their names\n",
430 |     "# YOUR CODE GOES HERE\n",
431 |     "\n",
432 |     "TopFive = bestModel.recommendProducts(1059637,5)\n",
433 |     "for item in range(0,5):\n",
434 |     "    print(\"Artist \"+str(item)+\": \"+artistData.filter(lambda x:x[0] == TopFive[item][1]).collect()[0][1])"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": null,
440 |    "metadata": {
441 |     "collapsed": true
442 |    },
443 |    "outputs": [],
444 |    "source": []
445 |   }
446 |  ],
447 |  "metadata": {
448 |   "kernelspec": {
449 |    "display_name": "Python 3",
450 |    "language": "python",
451 |    "name": "python3"
452 |   },
453 |   "language_info": {
454 |    "codemirror_mode": {
455 |     "name": "ipython",
456 |     "version": 3
457 |    },
458 |    "file_extension": ".py",
459 |    "mimetype": "text/x-python",
460 |    "name": "python",
461 |    "nbconvert_exporter": "python",
462 |    "pygments_lexer": "ipython3",
463 |    "version": "3.5.2"
464 |   }
465 |  },
466 |  "nbformat": 4,
467 |  "nbformat_minor": 2
468 | }
469 | 


--------------------------------------------------------------------------------