├── README.md ├── YelpDataCourseraPR1_HwanpyoKim.ipynb ├── YelpDataCourseraPR1_HwanpyoKim.txt └── YelpERDiagram.png /README.md: -------------------------------------------------------------------------------- 1 | # Coursera: SQL for Data Science - Peer-graded Assignment 2 | 3 | [Hwanpyo Kim's Answer](./YelpDataCourseraPR1_HwanpyoKim.ipynb) 4 | -------------------------------------------------------------------------------- /YelpDataCourseraPR1_HwanpyoKim.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# SQL for Data Science: Peer Review Assignment\n", 8 | "\n", 9 | "## Data Scientist Role Play: Profiling and Analyzing the Yelp Dataset Coursera Worksheet\n", 10 | "\n", 11 | "This notebook is based on my original [workseet](./YelpDataCourseraPR1_HwanpyoKim.txt) for peer review assignment with [Yelp Open Dataset](https://www.yelp.com/dataset) in Coursera class: [SQL for Data Science](https://www.coursera.org/learn/sql-for-data-science/home/welcome).\n", 12 | "\n", 13 | "**Yelp Dataset ER diagram**\n", 14 | "" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### Part 1: Yelp Dataset Profiling and Understanding" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "**1\\. Profile the data by finding the total number of records for each of the tables below:**\n", 29 | "\n", 30 | "\n", 31 | "* Attribute table = 10000\n", 32 | "```SQL\n", 33 | "SELECT count(*) FROM attribute\n", 34 | "```\n", 35 | "* Business table = 10000\n", 36 | "```SQL\n", 37 | "SELECT count(*) FROM business\n", 38 | "```\n", 39 | "* Category table = 10000\n", 40 | "```SQL\n", 41 | "SELECT count(*) FROM category\n", 42 | "```\n", 43 | "* Checkin table = 10000\n", 44 | "```SQL\n", 45 | "SELECT count(*) FROM checkin\n", 46 | "```\n", 47 | "* elite_years table = 10000\n", 48 | "```SQL\n", 49 | "SELECT count(*) FROM elite_years\n", 50 | "```\n", 51 | "* friend table = 10000\n", 52 | "```SQL\n", 53 | "SELECT count(*) FROM friend\n", 54 | "```\n", 55 | "* hours table = 10000\n", 56 | "```SQL\n", 57 | "SELECT count(*) FROM hours\n", 58 | "```\n", 59 | "* photo table = 10000\n", 60 | "```SQL\n", 61 | "SELECT count(*) FROM photo\n", 62 | "```\n", 63 | "* review table = 10000\n", 64 | "```SQL\n", 65 | "SELECT count(*) FROM review\n", 66 | "```\n", 67 | "* tip table = 10000\n", 68 | "```SQL\n", 69 | "SELECT count(*) FROM tip\n", 70 | "```\n", 71 | "* user table = 10000\n", 72 | "```SQL\n", 73 | "SELECT count(*) FROM user\n", 74 | "```\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "**2\\. Find the total distinct records by either the foreign key or primary key for each table. If two foreign keys are listed in the table, please specify which foreign key.**\n", 82 | "\n", 83 | "Note: Primary Keys are denoted in the ER-Diagram with a yellow key icon.\t\n", 84 | "\n", 85 | "* Business = 10000 (id)\n", 86 | "```SQL\n", 87 | "SELECT count(distinct id) FROM business\n", 88 | "```\n", 89 | "* Hours = 1562 (business_id)\n", 90 | "```SQL\n", 91 | "SELECT count(distinct business_id) FROM hours\n", 92 | "```\n", 93 | "* Category = 2643 (business_id)\n", 94 | "```SQL\n", 95 | "SELECT count(distinct business_id) FROM category\n", 96 | "```\n", 97 | "* Attribute = 1115 (business_id)\n", 98 | "```SQL\n", 99 | "SELECT count(distinct business_id) FROM attribute\n", 100 | "```\n", 101 | "* Review = 10000 (id), 8090 (business_id), 9581 (user_id)\n", 102 | "```SQL\n", 103 | "SELECT count(distinct id), count(distinct business_id), count(distinct user_id) FROM review\n", 104 | "```\n", 105 | "* Checkin = 493 (business_id)\n", 106 | "```SQL\n", 107 | "SELECT count(distinct business_id) FROM checkin\n", 108 | "```\n", 109 | "* Photo = 10000 (id), 6493 (business_id)\n", 110 | "```SQL\n", 111 | "SELECT count(distinct id), count(distinct business_id) FROM photo\n", 112 | "```\n", 113 | "* Tip = 3979 (business_id), 537 (user_id)\n", 114 | "```SQL\n", 115 | "SELECT count(distinct business_id), count(distinct user_id) FROM tip\n", 116 | "```\n", 117 | "* User = 10000 (id)\n", 118 | "```SQL\n", 119 | "SELECT count(distinct id) FROM user\n", 120 | "```\n", 121 | "* Friend = 11 (user_id)\n", 122 | "```SQL\n", 123 | "SELECT count(distinct user_id) FROM friend\n", 124 | "```\n", 125 | "* Elite_years = 2780 (user_id)\n", 126 | "```SQL\n", 127 | "SELECT count(distinct user_id) FROM elite_years\n", 128 | "```\n", 129 | "\n" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "**3\\. Are there any columns with null values in the Users table? Indicate \"yes,\" or \"no.\"**\n", 137 | "\n", 138 | "Answer: no\n", 139 | "\t\t\n", 140 | "SQL code used to arrive at answer:\n", 141 | " \n", 142 | "```SQL\n", 143 | "SELECT count(*) - count(id),\n", 144 | " count(*) - count(name),\n", 145 | " count(*) - count(review_count),\n", 146 | " count(*) - count(yelping_since),\n", 147 | " count(*) - count(useful),\n", 148 | " count(*) - count(cool),\n", 149 | " count(*) - count(fans),\n", 150 | " count(*) - count(average_stars),\n", 151 | " count(*) - count(compliment_hot),\n", 152 | " count(*) - count(compliment_more),\n", 153 | " count(*) - count(compliment_profile),\n", 154 | " count(*) - count(compliment_cute),\n", 155 | " count(*) - count(compliment_list),\n", 156 | " count(*) - count(compliment_note),\n", 157 | " count(*) - count(compliment_plain),\n", 158 | " count(*) - count(compliment_cool),\n", 159 | " count(*) - count(compliment_funny),\n", 160 | " count(*) - count(compliment_writer),\n", 161 | " count(*) - count(compliment_photos)\n", 162 | "FROM user\n", 163 | "```" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "**4\\. For each table and column listed below, display the smallest (minimum), largest (maximum), and average (mean) value for the following fields:**\n", 171 | "\n", 172 | "* Table: Review, Column: Stars\n", 173 | "\n", 174 | "min: 1\t\tmax: 5\t\tavg: 3.7082\n", 175 | "\n", 176 | "```SQL\n", 177 | "SELECT min(stars), max(stars), avg(stars) FROM review\n", 178 | "```\n", 179 | "\n", 180 | "\n", 181 | "* Table: Business, Column: Stars\n", 182 | "\n", 183 | "min: 1\t\tmax: 5\t\tavg: 3.6549\n", 184 | "\n", 185 | "```SQL\n", 186 | "SELECT min(stars), max(stars), avg(stars) FROM business\n", 187 | "```\n", 188 | "\n", 189 | "* Table: Tip, Column: Likes\n", 190 | "\n", 191 | "min: 0\t\tmax: 2\t\tavg: 0.0144\n", 192 | "\n", 193 | "```SQL\n", 194 | "SELECT min(likes), max(likes), avg(likes) FROM tip\n", 195 | "```\n", 196 | "\n", 197 | "* Table: Checkin, Column: Count\n", 198 | "\n", 199 | "min: 1\t\tmax: 53\t\tavg: 1.9414\n", 200 | "\n", 201 | "```SQL\n", 202 | "SELECT min(count), max(count), avg(count) FROM checkin\n", 203 | "```\n", 204 | "\n", 205 | "* Table: User, Column: Review_count\n", 206 | "\n", 207 | "min: 0\t\tmax: 2000\tavg: 24.2995\n", 208 | "\n", 209 | "```SQL\n", 210 | "SELECT min(review_count), max(review_count), avg(review_count) FROM user\n", 211 | "```" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "**5\\. List the cities with the most reviews in descending order:**\n", 219 | "\n", 220 | "SQL code used to arrive at answer:\n", 221 | "\n", 222 | "```SQL\n", 223 | "SELECT sum(review_count) AS total_reviews, city\n", 224 | "FROM business\n", 225 | "GROUP BY city\n", 226 | "ORDER BY total_reviews DESC\n", 227 | "```\n", 228 | "\n", 229 | "Copy and Paste the Result Below:\n", 230 | "\n", 231 | "```\n", 232 | "+---------------+-----------------+\n", 233 | "| total_reviews | city |\n", 234 | "+---------------+-----------------+\n", 235 | "| 82854 | Las Vegas |\n", 236 | "| 34503 | Phoenix |\n", 237 | "| 24113 | Toronto |\n", 238 | "| 20614 | Scottsdale |\n", 239 | "| 12523 | Charlotte |\n", 240 | "| 10871 | Henderson |\n", 241 | "| 10504 | Tempe |\n", 242 | "| 9798 | Pittsburgh |\n", 243 | "| 9448 | Montréal |\n", 244 | "| 8112 | Chandler |\n", 245 | "| 6875 | Mesa |\n", 246 | "| 6380 | Gilbert |\n", 247 | "| 5593 | Cleveland |\n", 248 | "| 5265 | Madison |\n", 249 | "| 4406 | Glendale |\n", 250 | "| 3814 | Mississauga |\n", 251 | "| 2792 | Edinburgh |\n", 252 | "| 2624 | Peoria |\n", 253 | "| 2438 | North Las Vegas |\n", 254 | "| 2352 | Markham |\n", 255 | "| 2029 | Champaign |\n", 256 | "| 1849 | Stuttgart |\n", 257 | "| 1520 | Surprise |\n", 258 | "| 1465 | Lakewood |\n", 259 | "| 1155 | Goodyear |\n", 260 | "+---------------+-----------------+\n", 261 | "(Output limit exceeded, 25 of 362 total rows shown)\n", 262 | "```" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "**6\\. Find the distribution of star ratings to the business in the following cities:**\n", 270 | "\n", 271 | "* Avon\n", 272 | "\n", 273 | "SQL code used to arrive at answer:\n", 274 | "\n", 275 | "```SQL\n", 276 | "SELECT stars AS star_rating, count(stars) AS count\n", 277 | "FROM business\n", 278 | "WHERE city='Avon'\n", 279 | "GROUP BY stars\n", 280 | "```\n", 281 | "\n", 282 | "Copy and Paste the Resulting Table Below (2 columns – star rating and count):\n", 283 | "\n", 284 | "```\n", 285 | "+-------------+-------+\n", 286 | "| star_rating | count |\n", 287 | "+-------------+-------+\n", 288 | "| 1.5 | 1 |\n", 289 | "| 2.5 | 2 |\n", 290 | "| 3.5 | 3 |\n", 291 | "| 4.0 | 2 |\n", 292 | "| 4.5 | 1 |\n", 293 | "| 5.0 | 1 |\n", 294 | "+-------------+-------+\n", 295 | "```\n", 296 | "\n", 297 | "* Beachwood\n", 298 | "\n", 299 | "\n", 300 | "SQL code used to arrive at answer:\n", 301 | "\n", 302 | "```SQL\n", 303 | "SELECT stars AS star_rating, count(stars) AS count\n", 304 | "FROM business\n", 305 | "WHERE city='Beachwood'\n", 306 | "GROUP BY stars\n", 307 | "```\n", 308 | "\n", 309 | "Copy and Paste the Resulting Table Below (2 columns – star rating and count):\n", 310 | "\n", 311 | "```\n", 312 | "+-------------+-------+\n", 313 | "| star_rating | count |\n", 314 | "+-------------+-------+\n", 315 | "| 2.0 | 1 |\n", 316 | "| 2.5 | 1 |\n", 317 | "| 3.0 | 2 |\n", 318 | "| 3.5 | 2 |\n", 319 | "| 4.0 | 1 |\n", 320 | "| 4.5 | 2 |\n", 321 | "| 5.0 | 5 |\n", 322 | "+-------------+-------+\n", 323 | "```\n", 324 | "\n" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "**7\\. Find the top 3 users based on their total number of reviews:**\n", 332 | "\n", 333 | "SQL code used to arrive at answer:\n", 334 | "\n", 335 | "```SQL\n", 336 | "SELECT name, sum(review_count) AS total_count\n", 337 | "FROM user\n", 338 | "GROUP BY id\n", 339 | "ORDER BY total_count DESC\n", 340 | "LIMIT 3\n", 341 | "```\n", 342 | "\t\t\n", 343 | "Copy and Paste the Result Below:\n", 344 | "```\n", 345 | "+--------+-------------+\n", 346 | "| name | total_count |\n", 347 | "+--------+-------------+\n", 348 | "| Gerald | 2000 |\n", 349 | "| Sara | 1629 |\n", 350 | "| Yuri | 1339 |\n", 351 | "+--------+-------------+\n", 352 | "```\n" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "**8\\. Does posing more reviews correlate with more fans?**\n", 360 | "\n", 361 | "Please explain your findings and interpretation of the results:\n", 362 | "\t\n", 363 | "Yes! it seems correlated. Users with more reviews tend to have more fans.\n", 364 | "\n", 365 | "SQL code: \n", 366 | "\n", 367 | "```SQL\n", 368 | "SELECT range AS fans_range, \n", 369 | " COUNT(*) AS num_user, \n", 370 | " AVG(review_count) AS avg_num_review, \n", 371 | " AVG(fans) AS avg_num_fans\n", 372 | "FROM (SELECT CASE WHEN fans BETWEEN 0 AND 9 THEN '0 - 9'\n", 373 | " WHEN fans BETWEEN 10 AND 99 THEN '10 - 99'\n", 374 | " ELSE '100-1000' END AS range,\n", 375 | " review_count, \n", 376 | " fans\n", 377 | " FROM user) AS subtable\n", 378 | "GROUP BY subtable.range\n", 379 | "```\n", 380 | "\n", 381 | "Result: \n", 382 | "```\n", 383 | "+------------+----------+----------------+----------------+\n", 384 | "| fans_range | num_user | avg_num_review | avg_num_fans |\n", 385 | "+------------+----------+----------------+----------------+\n", 386 | "| 0 - 9 | 9690 | 15.0085655315 | 0.447265221878 |\n", 387 | "| 10 - 99 | 294 | 283.326530612 | 25.5986394558 |\n", 388 | "| 100-1000 | 16 | 891.5 | 189.75 |\n", 389 | "+------------+----------+----------------+----------------+\n", 390 | "```" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "**9\\. Are there more reviews with the word \"love\" or with the word \"hate\" in them?**\n", 398 | "\n", 399 | "Answer: \"love\"\n", 400 | "\n", 401 | "SQL code used to arrive at answer:\n", 402 | "\n", 403 | "```SQL\n", 404 | "SELECT reaction, count(*) count\n", 405 | "FROM (\n", 406 | " SELECT CASE WHEN LOWER(text) LIKE '%love%' THEN 'love'\n", 407 | " WHEN LOWER(text) LIKE '%hate%' THEN 'hate' \n", 408 | " ELSE NULL END AS reaction\n", 409 | " FROM review)\n", 410 | "GROUP BY reaction\n", 411 | "ORDER BY count DESC\n", 412 | "```\n", 413 | "\n", 414 | "Result: \n", 415 | "```\n", 416 | "+----------+-------+\n", 417 | "| reaction | count |\n", 418 | "+----------+-------+\n", 419 | "| None | 8042 |\n", 420 | "| love | 1780 |\n", 421 | "| hate | 178 |\n", 422 | "+----------+-------+\n", 423 | "```" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "**10\\. Find the top 10 users with the most fans:**\n", 431 | "\n", 432 | "SQL code used to arrive at answer:\n", 433 | "```SQL\t\n", 434 | "SELECT name, fans\n", 435 | "FROM user\n", 436 | "ORDER BY fans DESC\n", 437 | "LIMIT 10\n", 438 | "```\n", 439 | "\t\n", 440 | "Copy and Paste the Result Below:\n", 441 | "```\n", 442 | "+-----------+------+\n", 443 | "| name | fans |\n", 444 | "+-----------+------+\n", 445 | "| Amy | 503 |\n", 446 | "| Mimi | 497 |\n", 447 | "| Harald | 311 |\n", 448 | "| Gerald | 253 |\n", 449 | "| Christine | 173 |\n", 450 | "| Lisa | 159 |\n", 451 | "| Cat | 133 |\n", 452 | "| William | 126 |\n", 453 | "| Fran | 124 |\n", 454 | "| Lissa | 120 |\n", 455 | "+-----------+------+\n", 456 | "```" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": {}, 462 | "source": [ 463 | "**11\\. Is there a strong relationship (or correlation) between having a high number of fans and being listed as \"useful\" or \"funny?\" Out of the top 10 users with the highest number of fans, what percent are also listed as “useful” or “funny”?**\n", 464 | "\n", 465 | "Key:
\n", 466 | "0% - 25% - Low relationship
\n", 467 | "26% - 75% - Medium relationship
\n", 468 | "76% - 100% - Strong relationship\n", 469 | "\t\n", 470 | "SQL code used to arrive at answer:\n", 471 | "\n", 472 | "```SQL\n", 473 | "SELECT name, fans, (useful + funny)*1.0/(useful + funny + cool) AS Per_useful_funny, \n", 474 | "CASE WHEN (useful + funny)*1.0/(useful + funny + cool) > 0.0 AND \n", 475 | " (useful + funny)*1.0/(useful + funny + cool) <=0.25 THEN 'Low'\n", 476 | " WHEN (useful + funny)*1.0/(useful + funny + cool) > 0.25 AND \n", 477 | " (useful + funny)*1.0/(useful + funny + cool) <=0.75 THEN 'Medium'\n", 478 | " ELSE 'Strong' END AS Key\n", 479 | "FROM user\n", 480 | "ORDER BY fans DESC\n", 481 | "LIMIT 10\t\n", 482 | "```\n", 483 | "\t\n", 484 | "Copy and Paste the Result Below:\n", 485 | "```\n", 486 | "+-----------+------+------------------+--------+\n", 487 | "| name | fans | Per_useful_funny | Key |\n", 488 | "+-----------+------+------------------+--------+\n", 489 | "| Amy | 503 | 0.677529011839 | Medium |\n", 490 | "| Mimi | 497 | 0.712996389892 | Medium |\n", 491 | "| Harald | 311 | 0.666268364881 | Medium |\n", 492 | "| Gerald | 253 | 0.569428505853 | Medium |\n", 493 | "| Christine | 173 | 0.726536295171 | Medium |\n", 494 | "| Lisa | 159 | 0.910447761194 | Strong |\n", 495 | "| Cat | 133 | 0.617081850534 | Medium |\n", 496 | "| William | 126 | 0.666476827792 | Medium |\n", 497 | "| Fran | 124 | 0.651356292676 | Medium |\n", 498 | "| Lissa | 120 | 0.638859556494 | Medium |\n", 499 | "+-----------+------+------------------+--------+\t\n", 500 | "```\t\n", 501 | "Please explain your findings and interpretation of the results:\n", 502 | "\t\n", 503 | "On average, there is medium relationship." 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "### Part 2: Inferences and Analysis" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "**1\\. Pick one city and category of your choice and group the businesses in that city or category by their overall star rating. Compare the businesses with 2-3 stars to the businesses with 4-5 stars and answer the following questions. Include your code.**\n", 518 | "\n", 519 | "I choose \"Las Vegas\" and \"Shopping\" category." 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "**i\\. Do the two groups you chose to analyze have a different distribution of hours?**\n", 527 | "\n", 528 | "SQL code:\n", 529 | "```SQL\n", 530 | "SELECT CASE WHEN stars >= 4.0 THEN '4-5 stars'\n", 531 | " WHEN stars >= 2.0 THEN '2-3 stars'\n", 532 | " ELSE 'below 2' END AS 'STARS', \n", 533 | " COUNT(DISTINCT business.id) AS id_count, \n", 534 | " COUNT(hours) AS open_days_total, -- number of openning days \n", 535 | " COUNT(hours)*1.0 / COUNT(DISTINCT business.id) AS open_days_avg\n", 536 | "FROM ((business INNER JOIN hours ON business.id = hours.business_id)\n", 537 | " INNER JOIN category ON business.id = category.business_id)\n", 538 | "WHERE city = 'Las Vegas' AND category.category ='Shopping'\n", 539 | "GROUP BY STARS\n", 540 | "```\n", 541 | "\n", 542 | "Result: \n", 543 | "```\n", 544 | "+-----------+----------+-----------------+---------------+\n", 545 | "| STARS | id_count | open_days_total | open_days_avg |\n", 546 | "+-----------+----------+-----------------+---------------+\n", 547 | "| 2-3 stars | 2 | 13 | 6.5 |\n", 548 | "| 4-5 stars | 2 | 12 | 6.0 |\n", 549 | "+-----------+----------+-----------------+---------------+\n", 550 | "```\n", 551 | "\n", 552 | "There is no huge difference in openning days between the two groups." 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "**ii\\. Do the two groups you chose to analyze have a different number of reviews?**\n", 560 | "\n", 561 | "SQL code:\n", 562 | "```SQL\n", 563 | "SELECT CASE WHEN stars >= 4.0 THEN '4-5 stars'\n", 564 | " WHEN stars >= 2.0 THEN '2-3 stars'\n", 565 | " ELSE 'below 2' END AS 'STARS', \n", 566 | " COUNT(DISTINCT business.id) AS id_count, \n", 567 | " SUM(review_count) AS review_count_total,\n", 568 | " SUM(review_count)*1.0/COUNT(DISTINCT business.id) AS review_count_avg\n", 569 | "FROM business INNER JOIN category ON business.id = category.business_id\n", 570 | "WHERE city = 'Las Vegas' AND category.category ='Shopping'\n", 571 | "GROUP BY STARS \n", 572 | "```\n", 573 | "\n", 574 | "Result:\n", 575 | "```\n", 576 | "+-----------+----------+--------------------+------------------+\n", 577 | "| STARS | id_count | review_count_total | review_count_avg |\n", 578 | "+-----------+----------+--------------------+------------------+\n", 579 | "| 2-3 stars | 2 | 17 | 8.5 |\n", 580 | "| 4-5 stars | 2 | 36 | 18.0 |\n", 581 | "+-----------+----------+--------------------+------------------+\n", 582 | "```\n", 583 | "\n", 584 | "There is different in the total number of review between the two groups, 17 & 36. The group with 4~5 stars have doubled number of reivew than the other group." 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": {}, 590 | "source": [ 591 | "**iii\\. Are you able to infer anything from the location data provided between these two groups? Explain.**\n", 592 | "\n", 593 | "SQL code used for analysis:\n", 594 | "\n", 595 | "```SQL\n", 596 | "SELECT CASE WHEN stars >= 4.0 THEN '4-5 stars'\n", 597 | " WHEN stars >= 2.0 THEN '2-3 stars'\n", 598 | " ELSE 'below 2' END AS 'STARS',\n", 599 | " business.neighborhood,\n", 600 | " business.address,\n", 601 | " business.postal_code\n", 602 | "FROM business INNER JOIN category ON business.id = category.business_id\n", 603 | "WHERE city = 'Las Vegas' AND category.category ='Shopping'\n", 604 | "ORDER BY STARS\n", 605 | "```\n", 606 | "\n", 607 | "Result:\n", 608 | "```\n", 609 | "+-----------+--------------+-----------------------------+-------------+\n", 610 | "| STARS | neighborhood | address | postal_code |\n", 611 | "+-----------+--------------+-----------------------------+-------------+\n", 612 | "| 2-3 stars | Southeast | 3421 E Tropicana Ave, Ste I | 89121 |\n", 613 | "| 2-3 stars | Eastside | 3808 E Tropicana Ave | 89121 |\n", 614 | "| 4-5 stars | | 1000 Scenic Loop Dr | 89161 |\n", 615 | "| 4-5 stars | | 3555 W Reno Ave, Ste F | 89118 |\n", 616 | "+-----------+--------------+-----------------------------+-------------+\n", 617 | "```\n", 618 | "The shops with 2-3 stars are placed at Tropicana Ave., but the shops with higher stars are away from each other." 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": {}, 624 | "source": [ 625 | "**2\\. Group business based on the ones that are open and the ones that are closed. What differences can you find between the ones that are still open and the ones that are closed? List at least two differences and the SQL code you used to arrive at your answer.**\n", 626 | "\n", 627 | "**i\\. Difference 1:** Number of business. The number of open-business is bigger than closed one.\n", 628 | " \n", 629 | " \n", 630 | "**ii\\. Difference 2:** Number of review & average of stars. Both of them in the open-business are bigger. \n", 631 | " \n", 632 | " \n", 633 | "SQL code used for analysis:\n", 634 | "\n", 635 | "```SQL\n", 636 | "SELECT is_open, \n", 637 | " count(distinct business.id) num_business, \n", 638 | " count(distinct review.id) num_review,\n", 639 | " avg(review.stars) avg_stars\n", 640 | "FROM business\n", 641 | "JOIN review ON business.id = review.business_id\n", 642 | "GROUP BY is_open\n", 643 | "```\n", 644 | "\n", 645 | "Result:\n", 646 | "```\n", 647 | "+---------+--------------+------------+---------------+\n", 648 | "| is_open | num_business | num_review | avg_stars |\n", 649 | "+---------+--------------+------------+---------------+\n", 650 | "| 0 | 61 | 71 | 3.64788732394 |\n", 651 | "| 1 | 446 | 565 | 3.7610619469 |\n", 652 | "+---------+--------------+------------+---------------+\n", 653 | "```" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "**3\\. For this last part of your analysis, you are going to choose the type of analysis you want to conduct on the Yelp dataset and are going to prepare the data for analysis.**\n", 661 | "\n", 662 | "**Ideas for analysis include: Parsing out keywords and business attributes for sentiment analysis, clustering businesses to find commonalities or anomalies between them, predicting the overall star rating for a business, predicting the number of fans a user will have, and so on. These are just a few examples to get you started, so feel free to be creative and come up with your own problem you want to solve. Provide answers, in-line, to all of the following:**\n", 663 | "\t\n", 664 | "**i. Indicate the type of analysis you chose to do:**\n", 665 | " \n", 666 | "What is the most successful category of business?\n", 667 | " \n", 668 | "**ii. Write 1-2 brief paragraphs on the type of data you will need for your analysis and why you chose that data:**\n", 669 | "\n", 670 | "Among the categories, I calcalate the average of stars and the proportion of opening on each category. To get statistical reasoning, I only consider the set of category with more than 10 of business. \n", 671 | "\n", 672 | "From the output, we can see that \"Local Service\", \"Health & Medica\", \"Home Services\", \"Shopping\", and \"Beauty & Spas\" are successful; they are getting better reviews and higher opening rate. However, \"Bars\", \"Nightlife\", and \"Restaurants\" have lower stars and close frequently.\n", 673 | " \n", 674 | " \n", 675 | "**iii. Output of your finished dataset:**\n", 676 | "```\n", 677 | "+------------------------+--------------+-----------+------------+\n", 678 | "| category | num_business | avg_stars | avg_isopen |\n", 679 | "+------------------------+--------------+-----------+------------+\n", 680 | "| Local Services | 12 | 4.21 | 0.83 |\n", 681 | "| Health & Medical | 17 | 4.09 | 0.94 |\n", 682 | "| Home Services | 16 | 4.0 | 0.94 |\n", 683 | "| Shopping | 30 | 3.98 | 0.83 |\n", 684 | "| Beauty & Spas | 13 | 3.88 | 0.92 |\n", 685 | "| American (Traditional) | 11 | 3.82 | 0.73 |\n", 686 | "| Food | 23 | 3.78 | 0.87 |\n", 687 | "| Bars | 17 | 3.5 | 0.65 |\n", 688 | "| Nightlife | 20 | 3.48 | 0.6 |\n", 689 | "| Restaurants | 71 | 3.46 | 0.75 |\n", 690 | "+------------------------+--------------+-----------+------------+ \n", 691 | "``` \n", 692 | "**iv. Provide the SQL code you used to create your final dataset:**\n", 693 | "```SQL\n", 694 | "SELECT category.category, \n", 695 | " count(business.id) num_business, \n", 696 | " round(avg(business.stars),2) avg_stars, \n", 697 | " round(avg(business.is_open),2) avg_isopen\n", 698 | "FROM (business INNER JOIN category ON business.id = category.business_id)\n", 699 | "GROUP BY category.category\n", 700 | "HAVING num_business > 10\n", 701 | "ORDER BY avg_stars DESC, avg_isopen DESC\n", 702 | "```" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": null, 708 | "metadata": { 709 | "collapsed": true 710 | }, 711 | "outputs": [], 712 | "source": [] 713 | } 714 | ], 715 | "metadata": { 716 | "kernelspec": { 717 | "display_name": "Python 3", 718 | "language": "python", 719 | "name": "python3" 720 | }, 721 | "language_info": { 722 | "codemirror_mode": { 723 | "name": "ipython", 724 | "version": 3 725 | }, 726 | "file_extension": ".py", 727 | "mimetype": "text/x-python", 728 | "name": "python", 729 | "nbconvert_exporter": "python", 730 | "pygments_lexer": "ipython3", 731 | "version": "3.6.2" 732 | } 733 | }, 734 | "nbformat": 4, 735 | "nbformat_minor": 2 736 | } 737 | -------------------------------------------------------------------------------- /YelpDataCourseraPR1_HwanpyoKim.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/legendarykim/Coursera_SQL_for_Data_Science/dc70c639bd324bee88c4c7d9527652977613f9dc/YelpDataCourseraPR1_HwanpyoKim.txt -------------------------------------------------------------------------------- /YelpERDiagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/legendarykim/Coursera_SQL_for_Data_Science/dc70c639bd324bee88c4c7d9527652977613f9dc/YelpERDiagram.png --------------------------------------------------------------------------------