└── FinalProject_chitra66.ipynb
/FinalProject_chitra66.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# COGS 108 - Final Project "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Overview"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "One important of interacting with people, especially in a society where interacting with people over food is important for personal and professional relationships, takes many forms, including eating at restaurants. When we go out with our friends, families, and colleagues it is crucial that we go to restaurants that are safe and healthy. One way to ensure that our restaurants are inspected are through random health inspections that are unique to each county; in these random health inspections, many things- such as temperature, hygiene, meat/food processing- are checked to make sure that they follow CDC and FDA guidelines. Customer ratings and health inspections are two of the ways that restaurants are kept accountable; this is why I wanted to know if publicizing the health inspection scores on a platform, like Yelp, would improve health in restaurants. After conducting an analysis on this topic, the data does not suggest that publicizing ratings makes restaurants healthier in the future. "
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# Name & GitHub\n",
29 | "\n",
30 | "- Name: Chitra Kulkarni\n",
31 | "- GitHub Username: chitra66"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "# Research Question"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "Does publicizing the health inspection score on platforms, like Yelp, improve the restaurant's overall health (as seen in future inspections) in popular cities such as Raleigh, North Carolina?"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Background and Prior Work"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "To begin this process, the first thing to research is to look at the restaurant health inspection processes in Raleigh, North Carolina, located in Wake County. In Wake County, restaurants are inspected one to four times a year, depending on what they serve and their food processing [1]. In NC, the grading of food establishments is based on a system of scoring, in which a score of 90 or higher will receive a grade of A, between 80-90 will receive a B, between 70-80 will receive a C; if an establishment receives a score of less than 70, their permit is immediately revoked [2]. In NC, there are 4 risk categories which determine the frequency of the inspections: risk category 1 is for establishments with no potentially hazardous foods, risk category 2 is for establishments with no more than 2 potentially hazardous foods, risk category 3 is for establishments with no more than 3 potentially hazardous foods, and risk category 4 is for establishments with unlimited number of potentially hazardous foods [3]. Risk factors, according to the definitions put forth by the CDC and FDA, are \"food preparation practices and employee behaviors most commonly reported to the CDC as contributing factors in foodborne illness outbreaks. [3]\"\n",
60 | "\n",
61 | "In an attempt to provide more information to help their customers, Yelp started adding health inspection scores onto their website and apps. This move was based on prior research done on restaurants in LA where restaurants would have to post grade cards in their window. This new rule resulted in a decrease in foodborne illnesses for over two years in the LA area [4]. Since Raleigh, NC is a busy city because of tourism, Yelp started adding restaurant health sanitation scores from March of 2015 [5].\n",
62 | "\n",
63 | "One example of someone who has tried to use scores to test a value of an item is a research project done on if there is a correlation between the price of wine and its corresponding WineEnthusiast score? This project relates to my research question because it is comparing if an item is better or worse based on its score; similarly, I am interested in seeing if the Yelp ratings and health inspection scores posted on Yelp make a restaurant more or less health-conscious [6].\n",
64 | "\n",
65 | "References (include links):\n",
66 | "\n",
67 | "1)http://www.wakegov.com/food/healthinspections/facilities/Pages/restaurants.aspx\n",
68 | "\n",
69 | "2)https://public.cdpehs.com/NCENVPBL/ESTABLISHMENT/ShowRestaurantRules.aspx\n",
70 | "\n",
71 | "3)https://ehs.ncpublichealth.com/faf/docs/foodprot/NC-MarkingInstructionsFinal-110419.pdf\n",
72 | "\n",
73 | "4)https://www.washingtonpost.com/news/voraciously/wp/2018/07/24/yelp-adds-health-inspection-scores-for-restaurants-and-restaurateurs-are-not-happy/\n",
74 | "\n",
75 | "5)https://gcn.com/articles/2015/03/02/yelp-city-restaurant-inspections.aspx\n",
76 | "\n",
77 | "6)FinalProject_group085.ipynb"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "# Hypothesis\n"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Since restaurants depend on customers' ratings and reviews for their businesses, I believe that publicizing health inspection scores on Yelp will improve overall restaurant health. Especially when customers write reviews regarding any problems that could reflect a restaurant’s health, other customers will be discouraged to go there; additionally, to improve their ratings, restaurants would be more willing to improve their health standards to pass the next inspection.\n",
92 | "\n"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "# Dataset(s)"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "I used three datasets for this analysis. The first dataset was inspections.csv, which captures all sanitation/health inspections from September 2012 to present. For my analysis, I specifically used the hsisid (state code identifying the restaurant), the facility type (to narrow my analysis to only restaurants), the city (to keep my analysis to just Raleigh, NC), and the zipcode for the restaurants. This dataset had 18466 rows x 36 columns. The second dataset I used was the violations.csv, which has 189802 rows x 18 columns, and describes all violations (and reports which code was violated) at active restaurants in Wake County, from September 2012 to present. In this dataset, I specifically used the hsisid (which is how I merged inspections.csv with violations.csv), violation type, and health inspection scores. There are three violation types described in this dataset: CDI, which is corrected during inspection; VR, which indicates that verification is required within 10 days; and lastly, R, which indicates a repeat offense [1]. The third dataset I used was the yelp.csv (which had 3688 rows x 31 columns), which had information about the restaurants's yelp ratings. From this database, I used the zipcode and ratings information. I used the zipcodes to merge this dataset with violations.csv. \n",
107 | "\n",
108 | "1) https://ehs.ncpublichealth.com/faf/docs/foodprot/NC-MarkingInstructionsFinal-110419.pdf"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "# Setup"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 10,
121 | "metadata": {},
122 | "outputs": [],
123 | "source": [
124 | "#imports\n",
125 | "import numpy as np\n",
126 | "import pandas as pd\n",
127 | "import matplotlib.pyplot as plt\n",
128 | " \n",
129 | "import seaborn as sns\n",
130 | "sns.set()\n",
131 | "sns.set_context('talk')"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 11,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "#read data\n",
141 | "df_insp = pd.read_csv(\"/home/cnkulkar/Project/Data/inspections.csv\")\n",
142 | "df_viol = pd.read_csv(\"/home/cnkulkar/Project/Data/violations.csv\")\n",
143 | "df_yelp = pd.read_csv(\"/home/cnkulkar/Project/Data/yelp.csv\")"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "# Data Cleaning"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 12,
156 | "metadata": {},
157 | "outputs": [],
158 | "source": [
159 | "#clean df_insp\n",
160 | "df_insp = df_insp.drop(columns = ['date', 'phonenumber', 'top_match', 'second_match', 'days_from_open_date', 'x', 'y', 'geocodestatus', 'type', 'inspectedby', 'inspector_id', 'previous_inspection_by_same_inspector'])\n",
161 | "\n",
162 | "#clean df_viol\n",
163 | "df_viol = df_viol.drop(columns = ['inspectedby'])\n",
164 | "\n",
165 | "#clean df_yelp\n",
166 | "df_yelp = df_yelp.drop(columns = ['latitude', 'longitude', 'phone', 'hotdogs', 'sandwiches', 'pizza', 'tradamerican', 'burgers', 'mexican', 'grocery', 'breakfast_brunch', 'coffee', 'chinese', 'italian', 'newamerican', 'chicken_wings', 'delis', 'bars', 'salad', 'seafood', 'bbq', 'bakeries', 'sushi'])\n",
167 | "df_yelp = df_yelp.rename(columns={'zip_code': 'zip'})"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 13,
173 | "metadata": {},
174 | "outputs": [
175 | {
176 | "data": {
177 | "text/html": [
178 | "
\n",
179 | "\n",
192 | "
\n",
193 | " \n",
194 | "
\n",
195 | "
\n",
196 | "
hsisid
\n",
197 | "
name
\n",
198 | "
address1
\n",
199 | "
address2
\n",
200 | "
city
\n",
201 | "
state
\n",
202 | "
postalcode
\n",
203 | "
restaurantopendate
\n",
204 | "
facilitytype
\n",
205 | "
zip
\n",
206 | "
...
\n",
207 | "
violationcode
\n",
208 | "
severity
\n",
209 | "
shortdesc
\n",
210 | "
comments
\n",
211 | "
pointvalue
\n",
212 | "
observationtype
\n",
213 | "
violationtype
\n",
214 | "
count
\n",
215 | "
cdcriskfactor
\n",
216 | "
cdcdataitem
\n",
217 | "
\n",
218 | " \n",
219 | " \n",
220 | "
\n",
221 | "
0
\n",
222 | "
4092013748
\n",
223 | "
Cafe 3000 At Wake Med
\n",
224 | "
3000 New Bern Ave
\n",
225 | "
NaN
\n",
226 | "
raleigh
\n",
227 | "
NC
\n",
228 | "
27610
\n",
229 | "
2002-12-21T00:00:00Z
\n",
230 | "
Restaurant
\n",
231 | "
27610
\n",
232 | "
...
\n",
233 | "
3-603.11
\n",
234 | "
Priority Foundation
\n",
235 | "
Consumer advisory provided for raw or undercoo...
\n",
236 | "
3-603.11; Priority Foundation; Establishment f...
\n",
237 | "
0
\n",
238 | "
Out
\n",
239 | "
CDI
\n",
240 | "
NaN
\n",
241 | "
NaN
\n",
242 | "
NaN
\n",
243 | "
\n",
244 | "
\n",
245 | "
1
\n",
246 | "
4092013748
\n",
247 | "
Cafe 3000 At Wake Med
\n",
248 | "
3000 New Bern Ave
\n",
249 | "
NaN
\n",
250 | "
raleigh
\n",
251 | "
NC
\n",
252 | "
27610
\n",
253 | "
2002-12-21T00:00:00Z
\n",
254 | "
Restaurant
\n",
255 | "
27610
\n",
256 | "
...
\n",
257 | "
3-502.11
\n",
258 | "
Priority Foundation
\n",
259 | "
Compliance with variance, specialized process,...
\n",
260 | "
Pf - 3-502.11 Variance Requirement - Establish...
\n",
261 | "
0
\n",
262 | "
Out
\n",
263 | "
CDI
\n",
264 | "
NaN
\n",
265 | "
NaN
\n",
266 | "
NaN
\n",
267 | "
\n",
268 | "
\n",
269 | "
2
\n",
270 | "
4092013748
\n",
271 | "
Cafe 3000 At Wake Med
\n",
272 | "
3000 New Bern Ave
\n",
273 | "
NaN
\n",
274 | "
raleigh
\n",
275 | "
NC
\n",
276 | "
27610
\n",
277 | "
2002-12-21T00:00:00Z
\n",
278 | "
Restaurant
\n",
279 | "
27610
\n",
280 | "
...
\n",
281 | "
3-603.11
\n",
282 | "
Priority Foundation
\n",
283 | "
Consumer advisory provided for raw or undercoo...
\n",
284 | "
Pf - 3-603.11 Consumption of Animal Foods that...
\n",
285 | "
0
\n",
286 | "
Out
\n",
287 | "
R
\n",
288 | "
NaN
\n",
289 | "
NaN
\n",
290 | "
NaN
\n",
291 | "
\n",
292 | "
\n",
293 | "
3
\n",
294 | "
4092013748
\n",
295 | "
Cafe 3000 At Wake Med
\n",
296 | "
3000 New Bern Ave
\n",
297 | "
NaN
\n",
298 | "
raleigh
\n",
299 | "
NC
\n",
300 | "
27610
\n",
301 | "
2002-12-21T00:00:00Z
\n",
302 | "
Restaurant
\n",
303 | "
27610
\n",
304 | "
...
\n",
305 | "
3-603.11
\n",
306 | "
Priority Foundation
\n",
307 | "
Consumer advisory provided for raw or undercoo...
\n",
308 | "
Pf - 3-603.11 Consumption of Animal Foods that...
\n",
309 | "
0
\n",
310 | "
Out
\n",
311 | "
VR
\n",
312 | "
NaN
\n",
313 | "
NaN
\n",
314 | "
NaN
\n",
315 | "
\n",
316 | "
\n",
317 | "
4
\n",
318 | "
4092013748
\n",
319 | "
Cafe 3000 At Wake Med
\n",
320 | "
3000 New Bern Ave
\n",
321 | "
NaN
\n",
322 | "
raleigh
\n",
323 | "
NC
\n",
324 | "
27610
\n",
325 | "
2002-12-21T00:00:00Z
\n",
326 | "
Restaurant
\n",
327 | "
27610
\n",
328 | "
...
\n",
329 | "
8-201.13
\n",
330 | "
Core
\n",
331 | "
Compliance with variance, specialized process,...
\n",
332 | "
Raw fish is used for sushi (sushi chef not pre...
\n",
333 | "
0
\n",
334 | "
Out
\n",
335 | "
NaN
\n",
336 | "
NaN
\n",
337 | "
NaN
\n",
338 | "
NaN
\n",
339 | "
\n",
340 | " \n",
341 | "
\n",
342 | "
5 rows × 40 columns
\n",
343 | "
"
344 | ],
345 | "text/plain": [
346 | " hsisid name address1 address2 city \\\n",
347 | "0 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
348 | "1 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
349 | "2 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
350 | "3 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
351 | "4 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
352 | "\n",
353 | " state postalcode restaurantopendate facilitytype zip ... \\\n",
354 | "0 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
355 | "1 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
356 | "2 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
357 | "3 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
358 | "4 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
359 | "\n",
360 | " violationcode severity \\\n",
361 | "0 3-603.11 Priority Foundation \n",
362 | "1 3-502.11 Priority Foundation \n",
363 | "2 3-603.11 Priority Foundation \n",
364 | "3 3-603.11 Priority Foundation \n",
365 | "4 8-201.13 Core \n",
366 | "\n",
367 | " shortdesc \\\n",
368 | "0 Consumer advisory provided for raw or undercoo... \n",
369 | "1 Compliance with variance, specialized process,... \n",
370 | "2 Consumer advisory provided for raw or undercoo... \n",
371 | "3 Consumer advisory provided for raw or undercoo... \n",
372 | "4 Compliance with variance, specialized process,... \n",
373 | "\n",
374 | " comments pointvalue \\\n",
375 | "0 3-603.11; Priority Foundation; Establishment f... 0 \n",
376 | "1 Pf - 3-502.11 Variance Requirement - Establish... 0 \n",
377 | "2 Pf - 3-603.11 Consumption of Animal Foods that... 0 \n",
378 | "3 Pf - 3-603.11 Consumption of Animal Foods that... 0 \n",
379 | "4 Raw fish is used for sushi (sushi chef not pre... 0 \n",
380 | "\n",
381 | " observationtype violationtype count cdcriskfactor cdcdataitem \n",
382 | "0 Out CDI NaN NaN NaN \n",
383 | "1 Out CDI NaN NaN NaN \n",
384 | "2 Out R NaN NaN NaN \n",
385 | "3 Out VR NaN NaN NaN \n",
386 | "4 Out NaN NaN NaN NaN \n",
387 | "\n",
388 | "[5 rows x 40 columns]"
389 | ]
390 | },
391 | "execution_count": 13,
392 | "metadata": {},
393 | "output_type": "execute_result"
394 | }
395 | ],
396 | "source": [
397 | "#merge the inspections.csv with violations.csv, using the hsisid\n",
398 | "df = pd.merge(df_insp, df_viol, on = 'hsisid')\n",
399 | "df.head()"
400 | ]
401 | },
402 | {
403 | "cell_type": "code",
404 | "execution_count": 15,
405 | "metadata": {},
406 | "outputs": [
407 | {
408 | "data": {
409 | "text/html": [
410 | "
\n",
411 | "\n",
424 | "
\n",
425 | " \n",
426 | "
\n",
427 | "
\n",
428 | "
hsisid
\n",
429 | "
name_x
\n",
430 | "
address1_x
\n",
431 | "
address2
\n",
432 | "
city
\n",
433 | "
state
\n",
434 | "
postalcode
\n",
435 | "
restaurantopendate
\n",
436 | "
facilitytype
\n",
437 | "
zip
\n",
438 | "
...
\n",
439 | "
avg_neighbor_num_critical
\n",
440 | "
avg_neighbor_num_non_critical
\n",
441 | "
critical
\n",
442 | "
id
\n",
443 | "
name_y
\n",
444 | "
is_closed
\n",
445 | "
rating
\n",
446 | "
review_count
\n",
447 | "
address1_y
\n",
448 | "
price
\n",
449 | "
\n",
450 | " \n",
451 | " \n",
452 | "
\n",
453 | "
0
\n",
454 | "
4092013748
\n",
455 | "
Cafe 3000 At Wake Med
\n",
456 | "
3000 New Bern Ave
\n",
457 | "
NaN
\n",
458 | "
raleigh
\n",
459 | "
NC
\n",
460 | "
27610
\n",
461 | "
2002-12-21T00:00:00Z
\n",
462 | "
Restaurant
\n",
463 | "
27610
\n",
464 | "
...
\n",
465 | "
NaN
\n",
466 | "
NaN
\n",
467 | "
1
\n",
468 | "
best-western-raleigh-inn-and-suites-raleigh
\n",
469 | "
holiday inn express & suites raleigh ne - medi...
\n",
470 | "
False
\n",
471 | "
3.0
\n",
472 | "
7
\n",
473 | "
3618 New Bern Ave
\n",
474 | "
$$
\n",
475 | "
\n",
476 | "
\n",
477 | "
1
\n",
478 | "
4092013748
\n",
479 | "
Cafe 3000 At Wake Med
\n",
480 | "
3000 New Bern Ave
\n",
481 | "
NaN
\n",
482 | "
raleigh
\n",
483 | "
NC
\n",
484 | "
27610
\n",
485 | "
2002-12-21T00:00:00Z
\n",
486 | "
Restaurant
\n",
487 | "
27610
\n",
488 | "
...
\n",
489 | "
NaN
\n",
490 | "
NaN
\n",
491 | "
1
\n",
492 | "
walnut-creek-ampitheatre-raleigh
\n",
493 | "
walnut creek ampitheatre
\n",
494 | "
False
\n",
495 | "
3.0
\n",
496 | "
44
\n",
497 | "
3801 Rock Quarry Rd
\n",
498 | "
$$
\n",
499 | "
\n",
500 | "
\n",
501 | "
2
\n",
502 | "
4092013748
\n",
503 | "
Cafe 3000 At Wake Med
\n",
504 | "
3000 New Bern Ave
\n",
505 | "
NaN
\n",
506 | "
raleigh
\n",
507 | "
NC
\n",
508 | "
27610
\n",
509 | "
2002-12-21T00:00:00Z
\n",
510 | "
Restaurant
\n",
511 | "
27610
\n",
512 | "
...
\n",
513 | "
NaN
\n",
514 | "
NaN
\n",
515 | "
1
\n",
516 | "
kfc-raleigh-7
\n",
517 | "
kfc
\n",
518 | "
False
\n",
519 | "
2.0
\n",
520 | "
5
\n",
521 | "
3408 Poole Road
\n",
522 | "
$
\n",
523 | "
\n",
524 | "
\n",
525 | "
3
\n",
526 | "
4092013748
\n",
527 | "
Cafe 3000 At Wake Med
\n",
528 | "
3000 New Bern Ave
\n",
529 | "
NaN
\n",
530 | "
raleigh
\n",
531 | "
NC
\n",
532 | "
27610
\n",
533 | "
2002-12-21T00:00:00Z
\n",
534 | "
Restaurant
\n",
535 | "
27610
\n",
536 | "
...
\n",
537 | "
NaN
\n",
538 | "
NaN
\n",
539 | "
1
\n",
540 | "
sheetz-raleigh
\n",
541 | "
sheetz
\n",
542 | "
False
\n",
543 | "
3.5
\n",
544 | "
6
\n",
545 | "
5200 New Bern Ave
\n",
546 | "
$
\n",
547 | "
\n",
548 | "
\n",
549 | "
4
\n",
550 | "
4092013748
\n",
551 | "
Cafe 3000 At Wake Med
\n",
552 | "
3000 New Bern Ave
\n",
553 | "
NaN
\n",
554 | "
raleigh
\n",
555 | "
NC
\n",
556 | "
27610
\n",
557 | "
2002-12-21T00:00:00Z
\n",
558 | "
Restaurant
\n",
559 | "
27610
\n",
560 | "
...
\n",
561 | "
NaN
\n",
562 | "
NaN
\n",
563 | "
1
\n",
564 | "
walmart-raleigh-3
\n",
565 | "
walmart
\n",
566 | "
False
\n",
567 | "
2.0
\n",
568 | "
12
\n",
569 | "
4431 New Bern Ave
\n",
570 | "
$
\n",
571 | "
\n",
572 | " \n",
573 | "
\n",
574 | "
5 rows × 31 columns
\n",
575 | "
"
576 | ],
577 | "text/plain": [
578 | " hsisid name_x address1_x address2 city \\\n",
579 | "0 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
580 | "1 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
581 | "2 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
582 | "3 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
583 | "4 4092013748 Cafe 3000 At Wake Med 3000 New Bern Ave NaN raleigh \n",
584 | "\n",
585 | " state postalcode restaurantopendate facilitytype zip ... \\\n",
586 | "0 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
587 | "1 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
588 | "2 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
589 | "3 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
590 | "4 NC 27610 2002-12-21T00:00:00Z Restaurant 27610 ... \n",
591 | "\n",
592 | " avg_neighbor_num_critical avg_neighbor_num_non_critical critical \\\n",
593 | "0 NaN NaN 1 \n",
594 | "1 NaN NaN 1 \n",
595 | "2 NaN NaN 1 \n",
596 | "3 NaN NaN 1 \n",
597 | "4 NaN NaN 1 \n",
598 | "\n",
599 | " id \\\n",
600 | "0 best-western-raleigh-inn-and-suites-raleigh \n",
601 | "1 walnut-creek-ampitheatre-raleigh \n",
602 | "2 kfc-raleigh-7 \n",
603 | "3 sheetz-raleigh \n",
604 | "4 walmart-raleigh-3 \n",
605 | "\n",
606 | " name_y is_closed rating \\\n",
607 | "0 holiday inn express & suites raleigh ne - medi... False 3.0 \n",
608 | "1 walnut creek ampitheatre False 3.0 \n",
609 | "2 kfc False 2.0 \n",
610 | "3 sheetz False 3.5 \n",
611 | "4 walmart False 2.0 \n",
612 | "\n",
613 | " review_count address1_y price \n",
614 | "0 7 3618 New Bern Ave $$ \n",
615 | "1 44 3801 Rock Quarry Rd $$ \n",
616 | "2 5 3408 Poole Road $ \n",
617 | "3 6 5200 New Bern Ave $ \n",
618 | "4 12 4431 New Bern Ave $ \n",
619 | "\n",
620 | "[5 rows x 31 columns]"
621 | ]
622 | },
623 | "execution_count": 15,
624 | "metadata": {},
625 | "output_type": "execute_result"
626 | }
627 | ],
628 | "source": [
629 | "#merge the inspections.csv and yelp.csv dataframes with the zipcodes\n",
630 | "df2 = pd.merge(df_insp, df_yelp, on = 'zip')\n",
631 | "\n",
632 | "#make sure that the data includes only restaurants that are in the Raleigh\n",
633 | "df2 = df2.drop(df2[df2['facilitytype'] != 'Restaurant'].index)\n",
634 | "df2 = df2.drop(df2[df2['city'] != 'raleigh'].index)\n",
635 | "df2.head()"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": 16,
641 | "metadata": {},
642 | "outputs": [
643 | {
644 | "data": {
645 | "text/plain": [
646 | ""
647 | ]
648 | },
649 | "execution_count": 16,
650 | "metadata": {},
651 | "output_type": "execute_result"
652 | },
653 | {
654 | "data": {
655 | "image/png": "\n",
656 | "text/plain": [
657 | ""
658 | ]
659 | },
660 | "metadata": {},
661 | "output_type": "display_data"
662 | }
663 | ],
664 | "source": [
665 | "#graph zip and score, grouping by the zip code \n",
666 | "z = df.groupby(df['zip'])\n",
667 | "sns.scatterplot(x = df['zip'], y = df2['score'])"
668 | ]
669 | },
670 | {
671 | "cell_type": "code",
672 | "execution_count": 17,
673 | "metadata": {},
674 | "outputs": [
675 | {
676 | "data": {
677 | "image/png": "\n",
678 | "text/plain": [
679 | ""
680 | ]
681 | },
682 | "metadata": {},
683 | "output_type": "display_data"
684 | }
685 | ],
686 | "source": [
687 | "#check the spread of zipcodes, grouping by the zip codes\n",
688 | "df.groupby(df['zip'])\n",
689 | "zip_plot = df['zip'].plot.hist(bins = 20)"
690 | ]
691 | },
692 | {
693 | "cell_type": "markdown",
694 | "metadata": {},
695 | "source": [
696 | "# Data Analysis & Results"
697 | ]
698 | },
699 | {
700 | "cell_type": "code",
701 | "execution_count": 19,
702 | "metadata": {},
703 | "outputs": [
704 | {
705 | "data": {
706 | "image/png": "\n",
707 | "text/plain": [
708 | ""
709 | ]
710 | },
711 | "metadata": {},
712 | "output_type": "display_data"
713 | }
714 | ],
715 | "source": [
716 | "#plot the frequency of the health inspection scores, using yelp and inspections data\n",
717 | "score_plot = df2['score'].plot.hist(bins = 20)"
718 | ]
719 | },
720 | {
721 | "cell_type": "code",
722 | "execution_count": 20,
723 | "metadata": {},
724 | "outputs": [
725 | {
726 | "data": {
727 | "image/png": "\n",
728 | "text/plain": [
729 | ""
730 | ]
731 | },
732 | "metadata": {},
733 | "output_type": "display_data"
734 | }
735 | ],
736 | "source": [
737 | "#plot the frequency of the yelp rating, using yelp and inspections data\n",
738 | "rating_plot = df2['rating'].plot.hist(bins = 20)"
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": 21,
744 | "metadata": {},
745 | "outputs": [],
746 | "source": [
747 | "#transform to be able to see the absolute change instead of the relative change\n",
748 | "df2['score'] = np.log10(df2['score'])\n",
749 | "df2['rating'] = np.log10(df2['rating'])"
750 | ]
751 | },
752 | {
753 | "cell_type": "code",
754 | "execution_count": 26,
755 | "metadata": {},
756 | "outputs": [
757 | {
758 | "data": {
759 | "text/plain": [
760 | ""
761 | ]
762 | },
763 | "execution_count": 26,
764 | "metadata": {},
765 | "output_type": "execute_result"
766 | },
767 | {
768 | "data": {
769 | "image/png": "\n",
770 | "text/plain": [
771 | ""
772 | ]
773 | },
774 | "metadata": {},
775 | "output_type": "display_data"
776 | }
777 | ],
778 | "source": [
779 | "#graph the log_transformation of health inspection scores\n",
780 | "df2['score'].plot(kind = 'hist', bins = 25)"
781 | ]
782 | },
783 | {
784 | "cell_type": "code",
785 | "execution_count": 25,
786 | "metadata": {},
787 | "outputs": [
788 | {
789 | "data": {
790 | "text/plain": [
791 | ""
792 | ]
793 | },
794 | "execution_count": 25,
795 | "metadata": {},
796 | "output_type": "execute_result"
797 | },
798 | {
799 | "data": {
800 | "image/png": "\n",
801 | "text/plain": [
802 | ""
803 | ]
804 | },
805 | "metadata": {},
806 | "output_type": "display_data"
807 | }
808 | ],
809 | "source": [
810 | "#graph the log_transformations of yelp rating\n",
811 | "df2['rating'].plot(kind = 'hist', bins = 25)"
812 | ]
813 | },
814 | {
815 | "cell_type": "code",
816 | "execution_count": 28,
817 | "metadata": {},
818 | "outputs": [
819 | {
820 | "data": {
821 | "text/plain": [
822 | ""
823 | ]
824 | },
825 | "execution_count": 28,
826 | "metadata": {},
827 | "output_type": "execute_result"
828 | },
829 | {
830 | "data": {
831 | "image/png": "\n",
832 | "text/plain": [
833 | ""
834 | ]
835 | },
836 | "metadata": {},
837 | "output_type": "display_data"
838 | }
839 | ],
840 | "source": [
841 | "#graph the rating as the x-axis and score as the y-axis\n",
842 | "sns.scatterplot(x = df2['rating'], y = df2['score'], data = df2)"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "# Ethics & Privacy"
850 | ]
851 | },
852 | {
853 | "cell_type": "markdown",
854 | "metadata": {},
855 | "source": [
856 | "Since health inspections and their reports are public record, it was crucial to remove the names of health inspectors, their ID's, and anything that would identify them for privacy reasons. Since privacy concerns are more related to individual people, rather than establishments, I did not remove any identifying information for the restaurants, including zipcodes, addresses, hsisid, and phone numbers. \n",
857 | "\n",
858 | "On the other hand, there are more considerations when it comes to ethics than privacy in this analysis. Firstly, I only conducted my analysis in one city in US; there is a danger that the results of this analysis could be used to describe restaurants all over the country, which would be wrong. One ethics concern from the start of this analysis, is the bias that comes from choosing only restaurants in only one area. One way to reduce this bias is using all restaurants in Raleigh, and not choosing certain restaurants, which is what I did. In addition, it is crucial in the discussion section to explain how the results are only applicable to Raleigh's restaurants and their connection to Yelp reviews. Lastly, the health inspection informtion is already public but connecting yelp reviews to those inspections could be used to argue against the importance of health inspections, which is not the goal of the analysis. "
859 | ]
860 | },
861 | {
862 | "cell_type": "markdown",
863 | "metadata": {},
864 | "source": [
865 | "# Conclusion & Discussion"
866 | ]
867 | },
868 | {
869 | "cell_type": "markdown",
870 | "metadata": {},
871 | "source": [
872 | "Based on this analysis, I can conclude that in Raleigh, NC, the ratings on yelp do not influence future health inspection scores; my hypothesis that the ratings would make restaurants healthier is wrong.\n",
873 | "\n",
874 | "One issue I encountered in my analysis was the fact that there were more retaurants from a specific zip code than other zip codes, which is illustrated in the first few graphs which show a skew when graphing the zip code with the scores. However, one I explored this data a bit more, I realized that this skew was not because of health inspection scores, but actually a result of the dataset having more restaurants in the 27600 zip. This is when I decided it would be better to answer the question I asked if I just directly compared the scores to the ratings.\n",
875 | "\n",
876 | "Another limitation of the data related to the health inspection process and reports. First, the reports contain mostly non-numerical items, which cannot be used in a numerical analysis. Second, even though this data went back to 2012, the fields that were considered in the analysis would not reflect changes in the process, if any."
877 | ]
878 | }
879 | ],
880 | "metadata": {
881 | "kernelspec": {
882 | "display_name": "Python 3",
883 | "language": "python",
884 | "name": "python3"
885 | },
886 | "language_info": {
887 | "codemirror_mode": {
888 | "name": "ipython",
889 | "version": 3
890 | },
891 | "file_extension": ".py",
892 | "mimetype": "text/x-python",
893 | "name": "python",
894 | "nbconvert_exporter": "python",
895 | "pygments_lexer": "ipython3",
896 | "version": "3.6.7"
897 | }
898 | },
899 | "nbformat": 4,
900 | "nbformat_minor": 2
901 | }
902 |
--------------------------------------------------------------------------------