├── ConvertYelp.py ├── Eat, Rate, Love -- A Proposal for modifying Yelp's rating systems.pdf ├── Eat, Rate, Love -- Presentation Slides.pdf ├── README.md ├── SpringboardCapstone_YelpAnalysis.Rproj ├── indian_names.txt └── yelp.R /ConvertYelp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | import json 5 | import csv 6 | import io 7 | 8 | """ConvertYelp 9 | 10 | Written by: Robert Chen 11 | Date: 1/10/2016 12 | 13 | Processes one of 3 Yelp JSON files. It reads in the JSON data, extracts the fields 14 | needed from that file, and writes the result to a CSV file. 15 | 16 | The "json" and "csv" libraries are used for handling the input and output. The encode 17 | method is used to handle unicode characters that appear in "name" and "city" in the Yelp 18 | "business" data file and in "name" in the Yelp "user" data file. 19 | 20 | To use, type the following commands: 21 | 22 | python ConvertYelp.py yelp_academic_dataset_review.json 23 | python ConvertYelp.py yelp_academic_dataset_user.json 24 | python ConvertYelp.py yelp_academic_dataset_business.json 25 | 26 | A corresponding .csv file is generated for each run. 27 | 28 | """ 29 | 30 | 31 | def filter_and_convert_to_csv(filename): 32 | """ 33 | Open the specified file. The file is either the "review", "user", or "business" file. 34 | Depending on the type of file it is, read in certain fields and then save the result 35 | in CSV format. 36 | """ 37 | 38 | if "review" in filename: 39 | 40 | outputFile = open('yelp_academic_dataset_review.csv', 'w') 41 | fields= ['user_id', 'business_id', 'stars'] 42 | 43 | # The "lineterminator='\n' is needed to prevent an extra blank line between each line. 44 | outputWriter = csv.DictWriter(outputFile, fieldnames = fields, lineterminator='\n') 45 | print "Converting " + filename + " to " + "yelp_academic_dataset_review.csv ..." 46 | 47 | # Read line by line, write user_id, business_id, and stars to CSV file 48 | for line in open(filename, 'r'): 49 | r = json.loads(line) 50 | outputWriter.writerow({'user_id': r['user_id'], 'business_id': r['business_id'], 'stars': r['stars']}) 51 | 52 | elif "user" in filename: 53 | 54 | outputFile = open('yelp_academic_dataset_user.csv', 'w') 55 | fields= ['user_id', 'name'] 56 | 57 | # The "lineterminator='\n' is needed to prevent an extra blank line between each line. 58 | outputWriter = csv.DictWriter(outputFile, fieldnames = fields, lineterminator='\n') 59 | print "Converting " + filename + " to " + "yelp_academic_user_review.csv ..." 60 | 61 | # Read line by line, write user_id and name to CSV file 62 | for line in open(filename, 'r'): 63 | r = json.loads(line) 64 | 65 | # To handle name values with unicode, call "encode" to remove the unicode character. 66 | # This presents a problem from occurring in "writerow" (which cannot handle unicode well) 67 | n = r['name'] 68 | n1 = n.encode('ascii', 'ignore') 69 | outputWriter.writerow({'user_id': r['user_id'], 'name': n1}) 70 | 71 | elif "business" in filename: 72 | 73 | outputFile = open('yelp_academic_dataset_business.csv', 'w') 74 | fields= ['business_id', 'city', 'name', 'categories', 'review_count', 'stars'] 75 | outputWriter = csv.DictWriter(outputFile, fieldnames = fields, lineterminator='\n') 76 | print "Converting " + filename + " to " + "yelp_academic_dataset_review.csv ..." 77 | 78 | # Read line by line, write relevant fields if the business is a restaurant 79 | for line in open(filename, 'r'): 80 | r = json.loads(line) 81 | categories = str(r['categories']) 82 | if "Restaurants" in categories: 83 | # To handle name values with unicode, call "encode" to remove the unicode character. 84 | # This presents a problem from occurring in "writerow" (which cannot handle unicode well) 85 | n = r['name'] 86 | n1 = n.encode('ascii', 'ignore') 87 | c = r['city'] 88 | c1 = c.encode('ascii', 'ignore') 89 | 90 | # Now write the result to a CSV file 91 | outputWriter.writerow({'business_id': r['business_id'], 'city': c1, 'name': n1, 'categories': r['categories'], 92 | 'review_count': r['review_count'], 'stars': r['stars']}) 93 | 94 | else: 95 | 96 | print "Error! Unexpected filename used." 97 | exit() 98 | 99 | outputFile.close 100 | 101 | 102 | 103 | def main(): 104 | # This command-line parsing code is provided. 105 | # Make a list of command line arguments, omitting the [0] element 106 | # which is the script itself. 107 | args = sys.argv[1:] 108 | 109 | if not args: 110 | print 'usage: file' 111 | sys.exit(1) 112 | 113 | filter_and_convert_to_csv(sys.argv[1]) 114 | 115 | if __name__ == '__main__': 116 | main() 117 | -------------------------------------------------------------------------------- /Eat, Rate, Love -- A Proposal for modifying Yelp's rating systems.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rchen314/SpringboardCapstone_YelpAnalysis/053fe165772617e1b23862209d84f21f17b89d55/Eat, Rate, Love -- A Proposal for modifying Yelp's rating systems.pdf -------------------------------------------------------------------------------- /Eat, Rate, Love -- Presentation Slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rchen314/SpringboardCapstone_YelpAnalysis/053fe165772617e1b23862209d84f21f17b89d55/Eat, Rate, Love -- Presentation Slides.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SpringboardCapstone_YelpAnalysis 2 | Code for Springboard Capstone Project -- Analyzing 2 proposed tweaks to the Yelp rating system 3 | 4 | List of files in this directory: 5 | 6 | Eat, Rate, Love -- A Proposal for Modifying Yelp's Rating System.pdf Doc summarizing analysis and methodology used 7 | 8 | Eat, Rate, Love -- Presentation Slides.pdf Slide deck summarizing findings 9 | 10 | convertYelp.py Python script used for preprocessing Yelp files 11 | 12 | yelp.R R commands used for analysis 13 | 14 | README.md This file 15 | 16 | indian_names.txt Names used for modifying Yelp rating 17 | 18 | -------------------------------------------------------------------------------- /SpringboardCapstone_YelpAnalysis.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /indian_names.txt: -------------------------------------------------------------------------------- 1 | Aayush 2 | Abhi 3 | Abhijeet 4 | Abhijit 5 | Abhilash 6 | Abhinandan 7 | Abhinav 8 | Abhinay 9 | Abhisek 10 | Abhishek 11 | Abilash 12 | Achyuthan 13 | Aditi 14 | Aditya 15 | Adnan 16 | Ahana 17 | Aisha 18 | Aishaa 19 | Ajay 20 | Ajinkya 21 | Akash 22 | Akhil 23 | Akhilesh 24 | Akshada 25 | Akshay 26 | Akshaya 27 | Alina 28 | Alok 29 | Amal 30 | Amalee 31 | Amalia 32 | Aman 33 | Amandeep 34 | Amar 35 | Amaris 36 | Amarnath 37 | Ambika 38 | Ami 39 | Amil 40 | Amir 41 | Amirah 42 | Amisha 43 | Amit 44 | Amogha 45 | Amrita 46 | Anagha 47 | anand 48 | Anand Kumar 49 | Ananth 50 | Ananya 51 | Aneesh 52 | Ani 53 | Aniket 54 | Anil 55 | Anindya 56 | Anirudh 57 | Anish 58 | Anju 59 | Ankan 60 | Ankit 61 | Ankita 62 | Ankur 63 | Ankush 64 | Anoop 65 | Anshul 66 | Anu 67 | Anubhuti 68 | Anuj 69 | Anum 70 | Anup 71 | Anupama 72 | AnuPriya 73 | Anurag 74 | AnurAg 75 | Anush 76 | Anusha 77 | Apara 78 | Aparna 79 | Apeksha 80 | Aravind 81 | Aravind Baalaaji 82 | Archana 83 | Arin 84 | Arindam 85 | Arjun 86 | Arpita 87 | Arshad 88 | Arul 89 | Arun 90 | Arunkumar 91 | Arvind 92 | Asha 93 | Ashank 94 | Ashish 95 | Ashmita 96 | Ashok 97 | Ashwin 98 | Asish 99 | Asmita 100 | Asodha 101 | Atreyee 102 | Atul 103 | Atulya 104 | Avinash Kumar 105 | Avishek 106 | Avishekh 107 | Ayesha 108 | Baban 109 | Babjee 110 | Bala 111 | Balaji 112 | Bharadwaj 113 | Bharat 114 | Bharath 115 | Bhavana 116 | Bhavik 117 | Bhavin 118 | Bhavisha 119 | Bhavya 120 | Bhrata 121 | Bhujang 122 | Bhumi 123 | Bhumika 124 | Bibek 125 | Bijal 126 | Brij 127 | Brijen 128 | Brijesh 129 | Brinda 130 | Chaitanya 131 | Chaitra 132 | Chandan 133 | Chandini 134 | Chandra 135 | Chandrasekar 136 | Chaundra 137 | Chetan 138 | Chetu 139 | Chhayakanta 140 | Chiku 141 | Chintan 142 | Chinu 143 | Chirag 144 | Chitra 145 | Chitta 146 | Cintya 147 | Dakshina 148 | Debashri 149 | Deedar 150 | Deeksha 151 | Deepak 152 | Deepali 153 | Deepanjali 154 | Deepesh 155 | Deepika 156 | Deepthy 157 | Deepti 158 | Deiva 159 | Dev 160 | Devang 161 | Devesh 162 | Dhanasekar 163 | Dheeraj 164 | Dhinakaran 165 | Dhruv 166 | Digvijay 167 | Dinesh 168 | Diplekha 169 | Disha 170 | Divyansh 171 | Gaggandeep 172 | Ganesh 173 | Gaurang 174 | Gaurav 175 | Gautam 176 | Gayatri 177 | Geetha 178 | Ghena 179 | Girish 180 | Gita 181 | Gopi 182 | Gurpreet 183 | Gurpreet Singh 184 | Hardeep 185 | Haresh 186 | Hariharan 187 | Harinath 188 | Harini 189 | Harish 190 | Harjit 191 | Harpreet 192 | Harsh 193 | Harsha 194 | Harshit 195 | Himanshu 196 | Hitesh 197 | Humza 198 | Jagadeesh 199 | Jahan 200 | Jai Veer Singh 201 | Jaina 202 | Jaswant A. 203 | Jateen 204 | Jatin 205 | jaya 206 | jayashri 207 | jayatheerth 208 | jaydei 209 | JAYESH 210 | jayshree 211 | jeetendar 212 | jharna 213 | jimeet 214 | jinesh 215 | Kalpesh 216 | Kalyan 217 | Kamesh 218 | Kanishk 219 | Kannan 220 | Karan 221 | Karthik 222 | Karthiknathan 223 | Kartik 224 | Karuna 225 | Kashyap 226 | Kavita 227 | Kavitha 228 | Kedar 229 | Keerthi 230 | Keren 231 | Ketan 232 | Kewal 233 | Keya 234 | Khushbu 235 | Kira 236 | Kiran 237 | Kirra 238 | Kirti 239 | Kishore 240 | Krishan 241 | Krishna 242 | KrishNa 243 | Krishnasri 244 | Kriti 245 | Kritika 246 | Krupesh 247 | Kumar 248 | Kunal 249 | Kyara 250 | Laks 251 | Lakshmi 252 | Lanka Ksheera Sagar 253 | Laxmi 254 | Madhu 255 | Madhur 256 | Madhuri 257 | Mahadeva 258 | Mahanth 259 | Mahathi 260 | Mahawish 261 | Mahesh 262 | Maithreyi 263 | makku 264 | Mala 265 | Manav 266 | Mandi 267 | Mandisha 268 | Mangala 269 | Manikandan 270 | Manish 271 | Manjeera 272 | Manveer 273 | Mayur 274 | Meena 275 | Meera 276 | meha 277 | Mehta 278 | Mehul 279 | Mihir 280 | Milind 281 | Misbah 282 | Mohun 283 | Monal 284 | Monali 285 | Monish 286 | Mukesh 287 | Mukund 288 | Munir 289 | Nabeel 290 | Nachiket 291 | Nadeem 292 | Naga 293 | Nagdeep 294 | Nahja 295 | Najwan 296 | Nakul 297 | Namita 298 | Namrata 299 | Nanda 300 | Nandini 301 | Narayan 302 | Naren 303 | Naresh 304 | Naveen 305 | Navneet 306 | Navpreet 307 | Navya 308 | Nayan 309 | Neel 310 | Neelam 311 | Neelesh 312 | Neeraj 313 | Neha 314 | Nidhi 315 | Nikhil 316 | Nilanjan 317 | Nilesh 318 | Nimish 319 | Niraj 320 | Niranjana 321 | Nirmal 322 | Nisarg 323 | Nishant 324 | Nishitaa 325 | Nithesh 326 | Nitin 327 | Nitya 328 | Nivedhitha 329 | Padmaja 330 | Padmashree 331 | Palak 332 | Pancham 333 | Pankaj 334 | Parikshith 335 | Parth 336 | Parthiv 337 | Pavan 338 | Pavi 339 | Pavitha 340 | Pavithra 341 | Pavneet 342 | Peeyush 343 | Pinak 344 | Piyoosh 345 | Piyush 346 | Pooja 347 | Poorna 348 | Poornima 349 | Prbhakar 350 | Prabhjot 351 | Prabin 352 | Pracheta 353 | Prachi 354 | Pragna 355 | Prajod 356 | Pranab 357 | Pranathi 358 | Pranav 359 | Pranay 360 | Praneeth 361 | Praphul 362 | Prasad 363 | Prasanna 364 | Prasen 365 | Prashant 366 | Prashanth 367 | Prateek 368 | Pratiba 369 | Pratik 370 | Pratiti 371 | Praveen 372 | Prayag 373 | Preet 374 | Preeti 375 | Prem 376 | Premsankar 377 | Prerana 378 | Priya 379 | Priyam 380 | Priyank 381 | Prudhvi 382 | Puneet 383 | Punita 384 | Purnima 385 | Rachit 386 | Raghav 387 | Raghava 388 | Raghu 389 | Raghuram 390 | Rahul 391 | Rai 392 | Raj 393 | Raja 394 | Rajat 395 | Rajeev 396 | Rajendra Rjean K. C. 397 | Rajesh 398 | Rajiv 399 | Rajni 400 | Raju 401 | Rakesh 402 | Raki 403 | Ram 404 | Ramah 405 | Raman 406 | Ramandeep 407 | Ramesh 408 | Rasesh 409 | Rashmi 410 | Rav 411 | Raveendran 412 | Ravi 413 | Ravi Krishna 414 | Raviteja 415 | Risha 416 | Rishi 417 | Rishik 418 | Rishoo 419 | Ritesh 420 | Rohan 421 | Rohit 422 | Ronak 423 | Roshan 424 | Roshani 425 | Roshni 426 | Ruchi 427 | Rupa 428 | Rupu 429 | Rushabh 430 | Rushikesh 431 | Sachin 432 | Sadhu 433 | Sagar 434 | Sai 435 | Sairam 436 | Sakshi 437 | Samara 438 | Sambhav 439 | Sameer 440 | Samir 441 | Sampad 442 | Sandeep 443 | Sandhya 444 | Sangeetha 445 | Sanjay 446 | Sanjib 447 | Sanjith 448 | Sanjiv 449 | Santosh 450 | Sarang 451 | Sarnnya 452 | Saravana Prabhu 453 | Saravanan 454 | Saroj 455 | Sarun 456 | Sashidhar 457 | Satender 458 | Sathya 459 | Satya 460 | Satyajit 461 | Satyaswaroop 462 | Satyin 463 | Saumav 464 | Saumya 465 | Saurabh 466 | Savinay 467 | Sayed 468 | Sayjul 469 | Seema 470 | Seema and Amit 471 | Senthilkumar 472 | Shailesh 473 | Shaji 474 | Shakerul 475 | Shakil 476 | Shalini 477 | Shalu 478 | Shamanth 479 | Shanda 480 | Shankar 481 | Shantanu 482 | Sharath 483 | Shashank 484 | Sheema 485 | Sheetal 486 | Shikha 487 | Shilpa 488 | Shilpashree 489 | Shilpi 490 | Shiraz 491 | Shiv 492 | Shiva 493 | Shivani 494 | Shivanshu 495 | Shreejay 496 | Shrikant 497 | Shrimant 498 | Shriya 499 | Shruti 500 | Siddharth 501 | Sidhartha 502 | Sidhu 503 | Sirish 504 | Sneha 505 | Snigdha 506 | Soham 507 | Someshwar 508 | Somnath 509 | Sonal 510 | Sonel 511 | Soroush 512 | Soujanya 513 | Sourav 514 | Souvir 515 | Sowmiya 516 | Spandana 517 | Sree 518 | Sri 519 | Sri Balaji 520 | Sridhar 521 | Srikanth 522 | Srikumar 523 | Srini 524 | Srinivas 525 | Sriram 526 | Srishti 527 | Srivatsava 528 | Subha 529 | Subhash 530 | Sdatta 531 | Sudha 532 | Sudhakar 533 | Sudhanwa 534 | Sudheer 535 | Sudip 536 | Sujith 537 | Sukh 538 | Suma 539 | Sumaira 540 | Sumaithri 541 | Suman 542 | Sumit 543 | Sundar 544 | Sunil 545 | Supreeth 546 | Surajit 547 | Suraya 548 | Surranna 549 | Surya 550 | Sushant 551 | Suvir 552 | Swapnil 553 | Swati 554 | Swikrit 555 | Syed 556 | Tanweer 557 | Tapan 558 | Tarun 559 | Tripti 560 | Tuhin 561 | Tushaar 562 | Tushar 563 | Tushara 564 | Umesh 565 | Urvi 566 | Urvish 567 | Vaivelan 568 | Vaibhav 569 | Vaishali 570 | Vaishnav 571 | Vamsee 572 | Vamsi 573 | Varinder 574 | Varun 575 | Varuni 576 | Vasista 577 | Veena 578 | Vemana 579 | Venkat 580 | Venkata 581 | Venkesh 582 | Vibhor 583 | Vibhu 584 | Vignesh 585 | Vijay 586 | Vijay Kumar 587 | Vijaykumarty 588 | Vijith 589 | Vikas 590 | Vikash 591 | Vikram 592 | Vinay 593 | Vinayak 594 | Vineet 595 | Vineeth 596 | Vinit 597 | Vinod 598 | Vinoth 599 | Vinuth 600 | Vipin 601 | Vipin Das 602 | Vipul 603 | Vishal 604 | Vishnu 605 | Vishwa 606 | Vivek 607 | Yogesh 608 | Yuvaraj 609 | 610 | -------------------------------------------------------------------------------- /yelp.R: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # R commands to process the Yelp database # 3 | ########################################### 4 | 5 | ############################################# 6 | # Part 1: Setup and initial data wrangling # 7 | ############################################# 8 | 9 | # Load library 10 | library(dplyr) 11 | 12 | # Read in csv files 13 | reviews <- read.csv("yelp_academic_dataset_review.csv", header = FALSE) 14 | users <- read.csv("yelp_academic_dataset_user.csv", header = FALSE) 15 | businesses <- read.csv("yelp_academic_dataset_business.csv", header = FALSE) 16 | 17 | # Add names to the fields 18 | colnames(reviews)[1] = "user_id" 19 | colnames(reviews)[2] = "business_id" 20 | colnames(reviews)[3] = "stars" 21 | colnames(users)[1] = "user_id" 22 | colnames(users)[2] = "user_name" 23 | colnames(businesses)[1] = "business_id" 24 | colnames(businesses)[2] = "city" 25 | colnames(businesses)[3] = "business_name" 26 | colnames(businesses)[4] = "categories" 27 | colnames(businesses)[5] = "review_count" 28 | colnames(businesses)[6] = "avg_stars" 29 | 30 | # Join the files 31 | ru <- inner_join(reviews, users) 32 | rub <- inner_join(ru, businesses) 33 | 34 | ###################################################### 35 | # Part 2a: Analysis of Method 1 -- Initial Analysis # 36 | ###################################################### 37 | 38 | # Add "is_indian" field for any review that has "Indian" in "categories" 39 | rub$is_indian <- grepl("Indian", rub$categories) == TRUE 40 | 41 | # Make a dataframe of just reviews of Indian restaurants 42 | indian <- subset(rub, is_indian == TRUE) 43 | 44 | # Generate a summary of # of reviews of that cuisine done by each reviewer 45 | num_reviews_Indian <- indian %>% select(user_id, user_name, is_indian) %>% 46 | group_by(user_id) %>% 47 | summarise(tot_rev = sum(is_indian)) 48 | 49 | # Print the table, show the total # of entries, and find the avg # of reviews per user 50 | table(num_reviews_Indian$tot_rev) 51 | count(num_reviews_Indian) 52 | mean(num_reviews_Indian$tot_rev) 53 | 54 | ################################################################# 55 | # Part 2b: Analysis of Method 1 -- Extension to Other Cuisines # 56 | ################################################################# 57 | 58 | rub$is_chinese <- grepl("Chinese", rub$categories) == TRUE 59 | chinese <- subset(rub, is_chinese == TRUE) 60 | num_reviews_Chinese <- chinese %>% select(user_id, user_name, is_chinese) %>% 61 | group_by(user_id) %>% 62 | summarise(tot_rev = sum(is_chinese)) 63 | table(num_reviews_Chinese$tot_rev) 64 | count(num_reviews_Chinese) 65 | mean(num_reviews_Chinese$tot_rev) 66 | 67 | rub$is_mexican <- grepl("Mexican", rub$categories) == TRUE 68 | mexican <- subset(rub, is_mexican == TRUE) 69 | num_reviews_Mexican <- mexican %>% select(user_id, user_name, is_mexican) %>% 70 | group_by(user_id) %>% 71 | summarise(tot_rev = sum(is_mexican)) 72 | table(num_reviews_Mexican$tot_rev) 73 | count(num_reviews_Mexican) 74 | mean(num_reviews_Mexican$tot_rev) 75 | 76 | rub$is_italian <- grepl("Italian", rub$categories) == TRUE 77 | italian <- subset(rub, is_italian == TRUE) 78 | num_reviews_Italian <- italian %>% select(user_id, user_name, is_italian) %>% 79 | group_by(user_id) %>% 80 | summarise(tot_rev = sum(is_italian)) 81 | table(num_reviews_Italian$tot_rev) 82 | count(num_reviews_Italian) 83 | mean(num_reviews_Italian$tot_rev) 84 | 85 | # For Japanese, look for "Japanese" or "Sushi" 86 | rub$is_japanese <- (grepl("Japanese", rub$categories) == TRUE) | 87 | (grepl("Sushi", rub$categories) == TRUE) 88 | japanese <- subset(rub, is_japanese == TRUE) 89 | num_reviews_Japanese <- japanese %>% select(user_id, user_name, is_japanese) %>% 90 | group_by(user_id) %>% 91 | summarise(tot_rev = sum(is_japanese)) 92 | table(num_reviews_Japanese$tot_rev) 93 | count(num_reviews_Japanese) 94 | mean(num_reviews_Japanese$tot_rev) 95 | 96 | rub$is_greek <- grepl("Greek", rub$categories) == TRUE 97 | greek <- subset(rub, is_greek == TRUE) 98 | num_reviews_Greek <- greek %>% select(user_id, user_name, is_greek) %>% 99 | group_by(user_id) %>% 100 | summarise(tot_rev = sum(is_greek)) 101 | table(num_reviews_Greek$tot_rev) 102 | count(num_reviews_Greek) 103 | mean(num_reviews_Greek$tot_rev) 104 | 105 | rub$is_french <- grepl("French", rub$categories) == TRUE 106 | french <- subset(rub, is_french == TRUE) 107 | num_reviews_French <- french %>% select(user_id, user_name, is_french) %>% 108 | group_by(user_id) %>% 109 | summarise(tot_rev = sum(is_french)) 110 | table(num_reviews_French$tot_rev) 111 | count(num_reviews_French) 112 | mean(num_reviews_French$tot_rev) 113 | 114 | rub$is_thai <- grepl("Thai", rub$categories) == TRUE 115 | thai <- subset(rub, is_thai == TRUE) 116 | num_reviews_Thai <- thai %>% select(user_id, user_name, is_thai) %>% 117 | group_by(user_id) %>% 118 | summarise(tot_rev = sum(is_thai)) 119 | table(num_reviews_Thai$tot_rev) 120 | count(num_reviews_Thai) 121 | mean(num_reviews_Thai$tot_rev) 122 | 123 | rub$is_spanish <- (grepl("Spanish", rub$categories) == TRUE) | 124 | (grepl("Tapas", rub$categories) == TRUE) 125 | spanish <- subset(rub, is_spanish == TRUE) 126 | num_reviews_Spanish <- spanish %>% select(user_id, user_name, is_spanish) %>% 127 | group_by(user_id) %>% 128 | summarise(tot_rev = sum(is_spanish)) 129 | table(num_reviews_Spanish$tot_rev) 130 | count(num_reviews_Spanish) 131 | mean(num_reviews_Spanish$tot_rev) 132 | 133 | rub$is_mediterranean <- grepl("Mediterranean", rub$categories) == TRUE 134 | mediterranean <- subset(rub, is_mediterranean == TRUE) 135 | num_reviews_Mediterranean <- mediterranean %>% select(user_id, user_name, is_mediterranean) %>% 136 | group_by(user_id) %>% 137 | summarise(tot_rev = sum(is_mediterranean)) 138 | table(num_reviews_Mediterranean$tot_rev) 139 | count(num_reviews_Mediterranean) 140 | mean(num_reviews_Mediterranean$tot_rev) 141 | 142 | ##################################################################### 143 | # Part 2c: Analysis of Method 1 -- Apply new weight and see effect # 144 | ##################################################################### 145 | 146 | # Combine num_reviews information with original data frame of indian restaurant reviews 147 | cin <- inner_join(indian, num_reviews_Indian) 148 | 149 | # Generate "weighted_stars" for later calculation 150 | cin$weighted_stars <- cin$stars * cin$tot_rev 151 | 152 | # Use "summarise" to generate a new rating for each restaurant 153 | new_rating_Indian <- cin %>% select(city, business_name, avg_stars, stars, 154 | tot_rev, weighted_stars) %>% 155 | group_by(city, business_name, avg_stars) %>% 156 | summarise(cnt = n(), 157 | avg = sum(stars) / cnt, 158 | new = sum(weighted_stars) / sum(tot_rev), 159 | dif = new - avg) 160 | 161 | # Print summary data of the effect this new rating has 162 | summary(new_rating_Indian$dif) 163 | 164 | # Limit to those with at least 5 ratings and redo summary 165 | nri5 <- subset(new_rating_Indian, cnt > 5) 166 | summary(nri5$dif) 167 | 168 | 169 | ################################################################ 170 | # Part 3: Analysis of Method 2 -- Generate "immigrant" rating # 171 | ################################################################ 172 | 173 | # Read Indian names into a list 174 | inames <- scan("indian_names.txt", what = character()) 175 | 176 | # Add field "reviewer_indian_name" to indian reviews if user name is in the list 177 | indian$reviewer_indian_name <- indian$user_name %in% inames 178 | 179 | # Generate "istars" for internal calculation later 180 | indian$istars <- indian$stars * indian$reviewer_indian_name 181 | 182 | # Find out # of reviewers with a uniquely Indian name 183 | table(indian$reviewer_indian_name) 184 | 1274/(1274 + 11872) # .096 185 | 186 | # Generate new "immigrant" rating 187 | avg_rating_Indian <- indian %>% select(business_id, business_name, city, stars, 188 | avg_stars, reviewer_indian_name, 189 | is_indian, istars) %>% 190 | group_by(city, business_name, avg_stars) %>% 191 | summarise(count = n(), 192 | nin = sum(reviewer_indian_name), 193 | pin = sum(reviewer_indian_name) / n(), 194 | avg = sum(stars) / count, 195 | ias = sum(istars) / nin, 196 | dif = ias - avg) 197 | 198 | # Find out extent of effect of new rating 199 | summary(avg_rating_Indian$dif) 200 | 201 | # Limit to those restaurants with at least 5 "immigrant" reviews and look at effect again 202 | ari5 <- subset (avg_rating_Indian, nin > 5) 203 | summary(ari5$dif) 204 | 205 | 206 | 207 | 208 | 209 | --------------------------------------------------------------------------------