├── LEVEL 1 TASK 1 Data Exploration and Preprocessing └── DATAEXPLORATION AND PREPROCESSING .ipynb ├── LEVEL 1 TASK 2 Descriptive Analysis └── DESCRIPTIVE ANALYSIS.ipynb ├── LEVEL 1 TASK 3 Geospatial Analysis └── GEOSPATIAL ANALYSIS.ipynb ├── LEVEL 2 TASK 2 Price Range Analysis └── PRICE RANGE ANALYSIS.ipynb ├── LEVEL 2 TASK 1 Table Booking and Online Delivery └── TABLE BOOKING AND ONLINE DELEIVERY.ipynb ├── LEVEL 2 TASK 3Feature Engineering └── FEATURE ENGINEERING.ipynb ├── LEVEL 3 TASK 1 Predictive Modeling └── PREDICTIVE MODELING.ipynb ├── LEVEL 3 TASK 2 Customer Preference Analysis └── CUSTOMER PREFERANCE ANALYSIS.ipynb ├── LICENSE └── README.md /LEVEL 1 TASK 1 Data Exploration and Preprocessing/DATAEXPLORATION AND PREPROCESSING .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Data Exploration and Preprocessing " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "here First, we need to load the data using Pandas" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 33, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import pandas as pd\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import warnings\n", 26 | "warnings.filterwarnings('ignore')\n", 27 | "import seaborn as sns\n", 28 | "file_path = r'C:\\Users\\Lenovo\\Documents\\GitHub\\TheUltimate pandas bootcamp\\Cognifyz-Data-Mastery-Program\\DATASETS\\Dataset .csv'\n", 29 | "DATASET = pd.read_csv(file_path)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "Display the first few rows to get a sense of the data" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 9, 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "name": "stdout", 46 | "output_type": "stream", 47 | "text": [ 48 | "Initial preview of the dataset:\n", 49 | " Restaurant ID Restaurant Name Country Code City \\\n", 50 | "0 6317637 Le Petit Souffle 162 Makati City \n", 51 | "1 6304287 Izakaya Kikufuji 162 Makati City \n", 52 | "2 6300002 Heat - Edsa Shangri-La 162 Mandaluyong City \n", 53 | "3 6318506 Ooma 162 Mandaluyong City \n", 54 | "4 6314302 Sambo Kojin 162 Mandaluyong City \n", 55 | "\n", 56 | " Address \\\n", 57 | "0 Third Floor, Century City Mall, Kalayaan Avenu... \n", 58 | "1 Little Tokyo, 2277 Chino Roces Avenue, Legaspi... \n", 59 | "2 Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal... \n", 60 | "3 Third Floor, Mega Fashion Hall, SM Megamall, O... \n", 61 | "4 Third Floor, Mega Atrium, SM Megamall, Ortigas... \n", 62 | "\n", 63 | " Locality \\\n", 64 | "0 Century City Mall, Poblacion, Makati City \n", 65 | "1 Little Tokyo, Legaspi Village, Makati City \n", 66 | "2 Edsa Shangri-La, Ortigas, Mandaluyong City \n", 67 | "3 SM Megamall, Ortigas, Mandaluyong City \n", 68 | "4 SM Megamall, Ortigas, Mandaluyong City \n", 69 | "\n", 70 | " Locality Verbose Longitude Latitude \\\n", 71 | "0 Century City Mall, Poblacion, Makati City, Mak... 121.027535 14.565443 \n", 72 | "1 Little Tokyo, Legaspi Village, Makati City, Ma... 121.014101 14.553708 \n", 73 | "2 Edsa Shangri-La, Ortigas, Mandaluyong City, Ma... 121.056831 14.581404 \n", 74 | "3 SM Megamall, Ortigas, Mandaluyong City, Mandal... 121.056475 14.585318 \n", 75 | "4 SM Megamall, Ortigas, Mandaluyong City, Mandal... 121.057508 14.584450 \n", 76 | "\n", 77 | " Cuisines ... Currency Has Table booking \\\n", 78 | "0 French, Japanese, Desserts ... Botswana Pula(P) Yes \n", 79 | "1 Japanese ... Botswana Pula(P) Yes \n", 80 | "2 Seafood, Asian, Filipino, Indian ... Botswana Pula(P) Yes \n", 81 | "3 Japanese, Sushi ... Botswana Pula(P) No \n", 82 | "4 Japanese, Korean ... Botswana Pula(P) Yes \n", 83 | "\n", 84 | " Has Online delivery Is delivering now Switch to order menu Price range \\\n", 85 | "0 No No No 3 \n", 86 | "1 No No No 3 \n", 87 | "2 No No No 4 \n", 88 | "3 No No No 4 \n", 89 | "4 No No No 4 \n", 90 | "\n", 91 | " Aggregate rating Rating color Rating text Votes \n", 92 | "0 4.8 Dark Green Excellent 314 \n", 93 | "1 4.5 Dark Green Excellent 591 \n", 94 | "2 4.4 Green Very Good 270 \n", 95 | "3 4.9 Dark Green Excellent 365 \n", 96 | "4 4.8 Dark Green Excellent 229 \n", 97 | "\n", 98 | "[5 rows x 21 columns]\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "print(\"Initial preview of the dataset:\")\n", 104 | "print(DATASET.head())" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "`We use pd.read_csv() to read the data and print(df.head()) to check the top 5 rows, ensuring we know what kind of data we're dealing with.`" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "### **Understanding the Structure of the Data**\n", 119 | "\n", 120 | "Now, let’s see how many rows and columns we have and the data types" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 13, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "Summary of the dataset:\n", 133 | "The dataset contains 9551 rows and 21 columns.\\n\n" 134 | ] 135 | } 136 | ], 137 | "source": [ 138 | "print(\"Summary of the dataset:\")\n", 139 | "print(f'The dataset contains {DATASET.shape[0]} rows and {DATASET.shape[1]} columns.\\\\n')" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 14, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "Information about the dataset:\n", 152 | "\n", 153 | "RangeIndex: 9551 entries, 0 to 9550\n", 154 | "Data columns (total 21 columns):\n", 155 | " # Column Non-Null Count Dtype \n", 156 | "--- ------ -------------- ----- \n", 157 | " 0 Restaurant ID 9551 non-null int64 \n", 158 | " 1 Restaurant Name 9551 non-null object \n", 159 | " 2 Country Code 9551 non-null int64 \n", 160 | " 3 City 9551 non-null object \n", 161 | " 4 Address 9551 non-null object \n", 162 | " 5 Locality 9551 non-null object \n", 163 | " 6 Locality Verbose 9551 non-null object \n", 164 | " 7 Longitude 9551 non-null float64\n", 165 | " 8 Latitude 9551 non-null float64\n", 166 | " 9 Cuisines 9542 non-null object \n", 167 | " 10 Average Cost for two 9551 non-null int64 \n", 168 | " 11 Currency 9551 non-null object \n", 169 | " 12 Has Table booking 9551 non-null object \n", 170 | " 13 Has Online delivery 9551 non-null object \n", 171 | " 14 Is delivering now 9551 non-null object \n", 172 | " 15 Switch to order menu 9551 non-null object \n", 173 | " 16 Price range 9551 non-null int64 \n", 174 | " 17 Aggregate rating 9551 non-null float64\n", 175 | " 18 Rating color 9551 non-null object \n", 176 | " 19 Rating text 9551 non-null object \n", 177 | " 20 Votes 9551 non-null int64 \n", 178 | "dtypes: float64(3), int64(5), object(13)\n", 179 | "memory usage: 1.5+ MB\n", 180 | "None\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "print(\"Information about the dataset:\")\n", 186 | "print(DATASET.info())" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | " `df.shape` tells us the size, and `df.info()` gives a summary, including data types and non-null counts" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "We have a solid overview of dataset, which contains 9551 rows and 21 columns. The columns include information about restaurants, such as `Restaurant ID, Restaurant Name, Country Code, City, Cuisines, Average Cost for two, and more.` Let’s continue with our `Level 1 Task 1: Data Exploration and Preprocessing` by addressing missing values and preparing the data." 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### **Checking for Missing Values**\n", 208 | "Missing data can affect analysis and model performance, so let’s identify and handle them" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 15, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "name": "stdout", 218 | "output_type": "stream", 219 | "text": [ 220 | "Count of missing values in each column\n", 221 | "Restaurant ID 0\n", 222 | "Restaurant Name 0\n", 223 | "Country Code 0\n", 224 | "City 0\n", 225 | "Address 0\n", 226 | "Locality 0\n", 227 | "Locality Verbose 0\n", 228 | "Longitude 0\n", 229 | "Latitude 0\n", 230 | "Cuisines 9\n", 231 | "Average Cost for two 0\n", 232 | "Currency 0\n", 233 | "Has Table booking 0\n", 234 | "Has Online delivery 0\n", 235 | "Is delivering now 0\n", 236 | "Switch to order menu 0\n", 237 | "Price range 0\n", 238 | "Aggregate rating 0\n", 239 | "Rating color 0\n", 240 | "Rating text 0\n", 241 | "Votes 0\n", 242 | "dtype: int64\n" 243 | ] 244 | } 245 | ], 246 | "source": [ 247 | "print(\"Count of missing values in each column\")\n", 248 | "print(DATASET.isnull().sum())" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "**Handling Missing Values**:\n", 256 | "\n", 257 | "- **Numerical Columns**: Fill with the mean or median.\n", 258 | "- **Categorical Columns**: Fill with the mode." 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "#### Checking for missing values " 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 18, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "Count of missing values in each column\n", 278 | "Cuisines 9\n", 279 | "dtype: int64\n" 280 | ] 281 | } 282 | ], 283 | "source": [ 284 | "missing_values = DATASET.isnull().sum()\n", 285 | "print(\"Count of missing values in each column\")\n", 286 | "print(missing_values[missing_values > 0])" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "This will show columns that have missing values and how many values are missing." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "**Cuisines**: Since this column has `9`missing values, we can fill them with a placeholder like \"Unknown\" or the mode (most common value):" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 21, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "DATASET['Cuisines'].fillna('Unknown', inplace=True)" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "### **Data Type Conversion**\n", 317 | "\n", 318 | "Its important to see that the data types for columns are appropriate. For instance, `Country Code` can be an integer, but `Has Table booking` may be better represented as a boolean." 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "#### Convert `Has Table booking` and `Has Online delivery` to boolean" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 26, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "DATASET['Has Table booking'] = DATASET['Has Table booking'].apply(lambda x: True if x == 'Yes' else False)\n", 335 | "DATASET['Has Online delivery'] = DATASET['Has Online delivery'].apply(lambda x: True if x == 'Yes' else False)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "#### Verify data types" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 27, 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "name": "stdout", 352 | "output_type": "stream", 353 | "text": [ 354 | "Updated data types:\n", 355 | "Restaurant ID int64\n", 356 | "Restaurant Name object\n", 357 | "Country Code int64\n", 358 | "City object\n", 359 | "Address object\n", 360 | "Locality object\n", 361 | "Locality Verbose object\n", 362 | "Longitude float64\n", 363 | "Latitude float64\n", 364 | "Cuisines object\n", 365 | "Average Cost for two int64\n", 366 | "Currency object\n", 367 | "Has Table booking bool\n", 368 | "Has Online delivery bool\n", 369 | "Is delivering now object\n", 370 | "Switch to order menu object\n", 371 | "Price range int64\n", 372 | "Aggregate rating float64\n", 373 | "Rating color object\n", 374 | "Rating text object\n", 375 | "Votes int64\n", 376 | "dtype: object\n" 377 | ] 378 | } 379 | ], 380 | "source": [ 381 | "print(\"Updated data types:\")\n", 382 | "print(DATASET.dtypes)" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "### **Distribution of the Target Variable**\n", 390 | "\n", 391 | "Now, let's check the distribution of the target variable (`Aggregate rating`):" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "Plot the distribution of the `Aggregate rating`" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 34, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "image/png": "", 409 | "text/plain": [ 410 | "
" 411 | ] 412 | }, 413 | "metadata": {}, 414 | "output_type": "display_data" 415 | } 416 | ], 417 | "source": [ 418 | "plt.figure(figsize=(10, 6))\n", 419 | "sns.histplot(DATASET['Aggregate rating'], bins=20, kde=True, color='green')\n", 420 | "plt.title('Distribution of Aggregate Rating')\n", 421 | "plt.xlabel('Aggregate Rating')\n", 422 | "plt.ylabel('Frequency')\n", 423 | "plt.grid(True)\n", 424 | "plt.show()" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "This histogram with a kernel density estimate (KDE) gives a smooth curve showing the distribution of ratings." 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "### **Class Imbalance Check**\n", 439 | "\n", 440 | "Finally, we are going to check there is a class imbalance in the `Aggregate rating`:" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 32, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "name": "stdout", 450 | "output_type": "stream", 451 | "text": [ 452 | "Class distribution for `Aggregate rating`:\n", 453 | "Aggregate rating\n", 454 | "0.0 2148\n", 455 | "3.2 522\n", 456 | "3.1 519\n", 457 | "3.4 498\n", 458 | "3.3 483\n", 459 | "3.5 480\n", 460 | "3.0 468\n", 461 | "3.6 458\n", 462 | "3.7 427\n", 463 | "3.8 400\n", 464 | "2.9 381\n", 465 | "3.9 335\n", 466 | "2.8 315\n", 467 | "4.1 274\n", 468 | "4.0 266\n", 469 | "2.7 250\n", 470 | "4.2 221\n", 471 | "2.6 191\n", 472 | "4.3 174\n", 473 | "4.4 144\n", 474 | "2.5 110\n", 475 | "4.5 95\n", 476 | "2.4 87\n", 477 | "4.6 78\n", 478 | "4.9 61\n", 479 | "2.3 47\n", 480 | "4.7 42\n", 481 | "2.2 27\n", 482 | "4.8 25\n", 483 | "2.1 15\n", 484 | "2.0 7\n", 485 | "1.9 2\n", 486 | "1.8 1\n", 487 | "Name: count, dtype: int64\n" 488 | ] 489 | } 490 | ], 491 | "source": [ 492 | "print(\"Class distribution for `Aggregate rating`:\")\n", 493 | "print(DATASET['Aggregate rating'].value_counts())" 494 | ] 495 | } 496 | ], 497 | "metadata": { 498 | "kernelspec": { 499 | "display_name": "Python 3", 500 | "language": "python", 501 | "name": "python3" 502 | }, 503 | "language_info": { 504 | "codemirror_mode": { 505 | "name": "ipython", 506 | "version": 3 507 | }, 508 | "file_extension": ".py", 509 | "mimetype": "text/x-python", 510 | "name": "python", 511 | "nbconvert_exporter": "python", 512 | "pygments_lexer": "ipython3", 513 | "version": "3.11.9" 514 | } 515 | }, 516 | "nbformat": 4, 517 | "nbformat_minor": 2 518 | } 519 | -------------------------------------------------------------------------------- /LEVEL 2 TASK 2 Price Range Analysis/PRICE RANGE ANALYSIS.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "` Price Range for the restaurants, focusing on the most common price range, average rating for each range, and identifying the color that represents the highest average rating.`" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### **Price Range Analysis**\n", 15 | "\n", 16 | "1. **Determine the Most Common Price Range**\n", 17 | "2. **Calculate the Average Rating for Each Price Range**\n", 18 | "3. **Identify the Color Representing the Highest Average Rating Among Different Price Ranges**" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import pandas as pd\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "import warnings\n", 30 | "warnings.filterwarnings('ignore')\n", 31 | "import seaborn as sns\n", 32 | "file_path = r'C:\\Users\\abhis\\Documents\\GitHub\\walmart sales forecasting\\Cognifyz-Data-Mastery-Program\\DATASETS\\Dataset .csv'\n", 33 | "DATASET = pd.read_csv(file_path)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "> Create price range categories based on 'Average Cost for two'" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "DATASET['Price Range Category'] = pd.cut(DATASET['Average Cost for two'], bins=[0, 500, 1000, 1500, 5000], labels=['Low', 'Medium', 'High', 'Very High'])" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "> Count the frequency of each price range" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "Most common price range:\n", 69 | "Price Range Category\n", 70 | "Low 6056\n", 71 | "Medium 2302\n", 72 | "High 591\n", 73 | "Very High 552\n", 74 | "Name: count, dtype: int64\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "most_common_price_range = DATASET['Price Range Category'].value_counts()\n", 80 | "print(\"Most common price range:\")\n", 81 | "print(most_common_price_range)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "image/png": "", 92 | "text/plain": [ 93 | "
" 94 | ] 95 | }, 96 | "metadata": {}, 97 | "output_type": "display_data" 98 | } 99 | ], 100 | "source": [ 101 | "plt.figure(figsize=(10, 5))\n", 102 | "sns.barplot(x=most_common_price_range.index, y=most_common_price_range.values, palette='Set2')\n", 103 | "plt.title('Most Common Price Range')\n", 104 | "plt.xlabel('Price Range Category')\n", 105 | "plt.ylabel('Number of Restaurants')\n", 106 | "plt.show()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "- The `pd.cut()` function is used to create price range categories.\n", 114 | "- `value_counts()` calculates the frequency of each price range, showing which one is most common." 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | " Calculate average rating for each price range category" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 6, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "Average rating for each price range:\n", 134 | "Price Range Category\n", 135 | "Low 2.32\n", 136 | "Medium 3.06\n", 137 | "High 3.64\n", 138 | "Very High 3.67\n", 139 | "Name: Aggregate rating, dtype: float64\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "avg_rating_per_price_range = DATASET.groupby('Price Range Category')['Aggregate rating'].mean().round(2)\n", 145 | "\n", 146 | "print(\"Average rating for each price range:\")\n", 147 | "print(avg_rating_per_price_range)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 7, 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "data": { 157 | "image/png": "", 158 | "text/plain": [ 159 | "
" 160 | ] 161 | }, 162 | "metadata": {}, 163 | "output_type": "display_data" 164 | } 165 | ], 166 | "source": [ 167 | "plt.figure(figsize=(10, 5))\n", 168 | "sns.barplot(x=avg_rating_per_price_range.index, y=avg_rating_per_price_range.values, palette='Blues')\n", 169 | "plt.title('Average Rating for Each Price Range')\n", 170 | "plt.xlabel('Price Range Category')\n", 171 | "plt.ylabel('Average Rating')\n", 172 | "plt.ylim(0, 5) # Setting limit for ratings (typically from 0 to 5)\n", 173 | "plt.show()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "- The `groupby()` method groups the data by `Price Range Category` and calculates the mean `Aggregate rating`.\n", 181 | "- The bar chart helps visualize the average ratings across price ranges." 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "> Find the color associated with the highest average rating in each price range" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 8, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "highest_avg_color_per_price_range = DATASET.groupby('Price Range Category')['Rating color'].agg(lambda x: x.value_counts().idxmax())" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 9, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "Color representing the highest average rating for each price range:\n", 210 | "Price Range Category\n", 211 | "Low Orange\n", 212 | "Medium Orange\n", 213 | "High Yellow\n", 214 | "Very High Yellow\n", 215 | "Name: Rating color, dtype: object\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "print(\"Color representing the highest average rating for each price range:\")\n", 221 | "print(highest_avg_color_per_price_range)" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 10, 227 | "metadata": {}, 228 | "outputs": [ 229 | { 230 | "data": { 231 | "image/png": "", 232 | "text/plain": [ 233 | "
" 234 | ] 235 | }, 236 | "metadata": {}, 237 | "output_type": "display_data" 238 | } 239 | ], 240 | "source": [ 241 | "plt.figure(figsize=(10, 5))\n", 242 | "sns.barplot(x=highest_avg_color_per_price_range.index, y=avg_rating_per_price_range.values, palette=highest_avg_color_per_price_range.values)\n", 243 | "plt.title('Color Representing Highest Average Rating by Price Range')\n", 244 | "plt.xlabel('Price Range Category')\n", 245 | "plt.ylabel('Average Rating')\n", 246 | "plt.show()" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "- `groupby()` is used with `agg()` to find the most common `Rating color` for each `Price Range Category`.\n", 254 | "- The bar chart uses colors representing the highest average rating for better visualization." 255 | ] 256 | } 257 | ], 258 | "metadata": { 259 | "kernelspec": { 260 | "display_name": "Python 3", 261 | "language": "python", 262 | "name": "python3" 263 | }, 264 | "language_info": { 265 | "codemirror_mode": { 266 | "name": "ipython", 267 | "version": 3 268 | }, 269 | "file_extension": ".py", 270 | "mimetype": "text/x-python", 271 | "name": "python", 272 | "nbconvert_exporter": "python", 273 | "pygments_lexer": "ipython3", 274 | "version": "3.12.5" 275 | } 276 | }, 277 | "nbformat": 4, 278 | "nbformat_minor": 2 279 | } 280 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 M. Dinesh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![image](https://github.com/user-attachments/assets/c2b2da51-f1d1-4d22-b3d9-c9d47b3110e7) 2 | 3 | 4 | ``` 5 | 📦 Data Science Internship Project 6 | │ 7 | ├── 📄 LICENSE 8 | ├── 📄 README.md 9 | │ 10 | ├── 📁 LEVEL 1 - Data Exploration and Preprocessing 11 | │ ├── 📄 DATAEXPLORATION AND PREPROCESSING.ipynb 12 | │ ├── 📄 DESCRIPTIVE ANALYSIS.ipynb 13 | │ ├── 📄 GEOSPATIAL ANALYSIS.ipynb 14 | │ └── 🌐 restaurant_map.html 15 | │ 16 | ├── 📁 LEVEL 2 - Advanced Analysis 17 | │ ├── 📄 TABLE BOOKING AND ONLINE DELEIVERY.ipynb 18 | │ ├── 📄 PRICE RANGE ANALYSIS.ipynb 19 | │ └── 📄 FEATURE ENGINEERING.ipynb 20 | │ 21 | ├── 📁 LEVEL 3 - Modeling and Visualization 22 | │ ├── 📄 PREDICTIVE MODELING.ipynb 23 | │ ├── 📄 CUSTOMER PREFERANCE ANALYSIS.ipynb 24 | │ ├── 📄 Data Visualization.ipynb 25 | │ ├── 📊 average_rating_by_cuisine.png 26 | │ ├── 📊 boxplot_ratings_by_cuisine.png 27 | │ ├── 📊 correlation_heatmap.png 28 | │ ├── 📊 jointplot_votes_rating.png 29 | │ ├── 📊 pair_plot.png 30 | │ ├── 📊 pairplot_votes_rating.png 31 | │ ├── 📊 rating_distribution_boxplot.png 32 | │ ├── 📊 rating_distribution_histogram.png 33 | │ ├── 📊 swarmplot_ratings_cuisines.png 34 | │ ├── 📊 top_cuisines_avg_rating.png 35 | │ ├── 📊 violinplot_votes_by_rating.png 36 | │ ├── 📊 votes_vs_aggregate_rating.png 37 | │ └── 🌐 bubble_chart_votes_rating.html 38 | │ 39 | └── 📁 DATASETS 40 | └── (Dataset files) 41 | 42 | ``` 43 | 44 | 45 | ## 📚 Libraries Used 46 | 47 | - **Folium** 🗺️: For creating interactive maps to visualize restaurant locations. 48 | - **Pandas** 🐼: For data manipulation and processing. 49 | - **Matplotlib** 📊: For plotting static graphs to visualize restaurant distributions. 50 | - **Seaborn** 🎨: For enhanced visualizations and creating scatter plots. 51 | - **Scikit-learn** 🤖: For applying machine learning algorithms like KMeans clustering to group restaurant locations. 52 | 53 | 🚀 **Workflow Overview :** 54 | 55 | ### **1. Data Loading and Preprocessing** 🧹 56 | 57 | We begin by loading the data and cleaning it. The dataset contains several cuisines with aggregate ratings, votes, and cuisine names. We clean the data, fill missing values, and prepare it for further analysis. 58 | 59 | ```python 60 | import pandas as pd 61 | import numpy as np 62 | 63 | # Load the dataset 64 | cuisine_data = pd.read_csv("Cuisine_Rating_Votes.csv") 65 | 66 | # Fill missing values 67 | cuisine_data.fillna(method='ffill', inplace=True) 68 | 69 | # Summary of the dataset 70 | cuisine_data.info() 71 | ``` 72 | 73 | - **Missing Values Handling**: The `fillna(method='ffill')` method is used to forward-fill any missing values. 74 | - **Dataset Overview**: We get a basic overview of the dataset with `info()` to understand its structure. 75 | 76 | --- 77 | 78 | ### **2. Exploratory Data Analysis (EDA)** 🔍 79 | 80 | #### **Cuisines with Consistent Ratings** 💯 81 | 82 | Next, we identify the cuisines that have consistent ratings by calculating the standard deviation of the aggregate ratings. 83 | 84 | ```python 85 | # Calculate standard deviation of ratings for each cuisine 86 | rating_std = cuisine_data.groupby('Cuisines')['Aggregate rating'].std() 87 | 88 | # Cuisines with lowest standard deviation (consistent ratings) 89 | consistent_cuisines = rating_std.sort_values().head(10) 90 | ``` 91 | 92 | - **Consistent Ratings**: Cuisines like `Italian`, `Hawaiian`, and `American` are identified as having the most consistent ratings, with low standard deviation. 93 | 94 | #### **Top Cuisines by Average Rating** 🌟 95 | 96 | We then calculate the average rating for each cuisine to find out which ones have the best average rating. 97 | 98 | ```python 99 | # Calculate the average rating by cuisine 100 | avg_rating_by_cuisine = cuisine_data.groupby('Cuisines')['Aggregate rating'].mean() 101 | 102 | # Top 10 cuisines with highest average ratings 103 | top_cuisines = avg_rating_by_cuisine.sort_values(ascending=False).head(10) 104 | ``` 105 | 106 | - **Top Cuisines**: This code highlights the cuisines with the highest average ratings, such as `Italian`, `Hawaiian`, and `American`. 107 | 108 | #### **Cuisines Rated by the Most People** 👥 109 | 110 | We now identify which cuisines have the most number of ratings, as more ratings usually indicate more popularity. 111 | 112 | ```python 113 | # Count the number of ratings for each cuisine 114 | ratings_count = cuisine_data.groupby('Cuisines')['Votes'].sum() 115 | 116 | # Top 10 cuisines rated by the most people 117 | top_cuisines_by_votes = ratings_count.sort_values(ascending=False).head(10) 118 | ``` 119 | 120 | - **Most Rated Cuisines**: The most rated cuisines are those that have the highest number of votes, such as `American` and `Italian`. 121 | 122 | --- 123 | 124 | ### **3. Data Visualization** 📊 125 | 126 | #### **Distribution of Aggregate Ratings** 📉 127 | 128 | We visualize the distribution of ratings using a histogram to see the overall spread. 129 | 130 | ```python 131 | import seaborn as sns 132 | import matplotlib.pyplot as plt 133 | 134 | # Histogram for Aggregate Ratings 135 | sns.histplot(cuisine_data['Aggregate rating'], kde=True) 136 | plt.title('Distribution of Aggregate Ratings') 137 | plt.xlabel('Rating') 138 | plt.ylabel('Frequency') 139 | plt.show() 140 | ``` 141 | 142 | - **Histogram**: The histogram shows the distribution of ratings across all cuisines, with a clear concentration of ratings between 4 and 5. 143 | 144 | #### **Votes vs. Aggregate Rating** 📈 145 | 146 | We use a scatter plot to visualize how the number of votes relates to the aggregate ratings. 147 | 148 | ```python 149 | sns.scatterplot(x=cuisine_data['Votes'], y=cuisine_data['Aggregate rating']) 150 | plt.title('Votes vs. Aggregate Rating') 151 | plt.xlabel('Number of Votes') 152 | plt.ylabel('Aggregate Rating') 153 | plt.show() 154 | ``` 155 | 156 | - **Scatter Plot**: The plot shows that as the number of votes increases, the aggregate rating generally increases, with some outliers. 157 | 158 | #### **Cuisines with the Most Consistent Ratings** 📏 159 | 160 | We create a bar plot to display the cuisines with the most consistent ratings. 161 | 162 | ```python 163 | sns.barplot(x=consistent_cuisines.index, y=consistent_cuisines.values) 164 | plt.title('Cuisines with Most Consistent Ratings') 165 | plt.xlabel('Cuisine') 166 | plt.ylabel('Standard Deviation of Ratings') 167 | plt.xticks(rotation=90) 168 | plt.show() 169 | ``` 170 | 171 | - **Bar Plot**: The plot highlights the top cuisines with the lowest standard deviations in their ratings, indicating consistency. 172 | 173 | --- 174 | 175 | ### **4. Clustering Cuisines** 🤖 176 | 177 | We apply KMeans clustering to group cuisines based on their `Votes` and `Aggregate rating` values. This allows us to find patterns in how cuisines are rated and voted upon. 178 | 179 | ```python 180 | from sklearn.preprocessing import StandardScaler 181 | from sklearn.cluster import KMeans 182 | 183 | # Select relevant features for clustering 184 | X = cuisine_data[['Votes', 'Aggregate rating']] 185 | 186 | # Normalize the data 187 | scaler = StandardScaler() 188 | X_scaled = scaler.fit_transform(X) 189 | 190 | # Apply KMeans clustering 191 | kmeans = KMeans(n_clusters=3, random_state=0) 192 | cuisine_data['Cluster'] = kmeans.fit_predict(X_scaled) 193 | 194 | # Visualizing the clusters 195 | sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=cuisine_data['Cluster'], palette='viridis') 196 | plt.title('Clustering of Cuisines Based on Votes and Ratings') 197 | plt.xlabel('Normalized Votes') 198 | plt.ylabel('Normalized Aggregate Rating') 199 | plt.show() 200 | ``` 201 | 202 | - **Clustering**: We use KMeans clustering to categorize cuisines into three groups based on their vote count and rating. 203 | - **Visualization**: The scatter plot visualizes how different cuisines are clustered based on these features. 204 | 205 | --- 206 | 207 | ### **5. Insights and Summary** 💡 208 | 209 | From the analysis, we gain the following insights: 210 | 211 | - **Top Rated Cuisines**: `Italian`, `American`, and `Hawaiian` are among the top-rated cuisines. 212 | - **Consistency**: Cuisines with low standard deviation in ratings, like `Italian`, `American`, and `Mexican`, are highly consistent in their ratings. 213 | - **Popularity**: Cuisines with the most votes are generally those that have more global recognition, such as `Italian` and `American`. 214 | - **Cluster Groupings**: Clustering based on `Votes` and `Aggregate rating` reveals that cuisines like `Italian` and `Mexican` form their own clusters based on higher ratings and votes. 215 | 216 | --- 217 | 218 | ## 🛠️ **Libraries Used** 219 | 220 | - **Pandas**: For data manipulation and analysis. 221 | - **Matplotlib & Seaborn**: For static and visualizations. 222 | - **Scikit-learn**: For clustering techniques. 223 | - **NumPy**: For numerical operations. 224 | - **Plotly**: For creating interactive visualizations. 225 | 226 | 227 | This provides an in-depth analysis of cuisine ratings, votes, and how they correlate with each other. By using clustering techniques, we uncover hidden patterns and gain insights into which cuisines are consistently rated highly and which ones are most popular based on votes. The combination of data cleaning, EDA, and clustering makes this analysis a comprehensive exploration of the cuisine ratings dataset. 228 | 229 | 230 | 231 | 232 | To include the image from your GitHub repository and create a small-size dashboard for **Vites vs Aggregate Rating** in your README, here's how you can modify it: 233 | 234 | --- 235 | 236 | ### 📊 **Vites vs Aggregate Rating Dashboard** 🚀 237 | 238 | The following visualization showcases the relationship between the **number of votes (Vites)** and **Aggregate Rating** for each cuisine. It helps us understand how higher ratings correlate with more votes, providing insights into the popularity and consistency of cuisines. 239 | 240 | #### **Visualization** 🖼️ 241 | 242 | You can view the plot below, which visualizes the correlation between `Votes` and `Aggregate Rating` for each cuisine: 243 | 244 | ![Vites vs Aggregate Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/Capture.PNG?raw=true) 245 | 246 | 247 | 248 | ## 📊 **Data Visualization Gallery** 🚀 249 | 250 | ### Here are various visualizations for better understanding of data: 251 | 252 | | ![Average Rating by Cuisine](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/average_rating_by_cuisine.png?raw=true) | ![Boxplot Ratings by Cuisine](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/boxplot_ratings_by_cuisine.png?raw=true) | 253 | | --- | --- | 254 | | **Average Rating by Cuisine** | **Boxplot Ratings by Cuisine** | 255 | 256 | | ![Correlation Heatmap](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/correlation_heatmap.png?raw=true) | ![Jointplot Votes vs Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/jointplot_votes_rating.png?raw=true) | 257 | | --- | --- | 258 | | **Correlation Heatmap** | **Jointplot Votes vs Rating** | 259 | 260 | 261 | | ![Pair Plot](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/pair_plot.png?raw=true) | ![Pairplot Votes vs Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/pairplot_votes_rating.png?raw=true) | 262 | | --- | --- | 263 | | **Pair Plot** | **Pairplot Votes vs Rating** | 264 | 265 | 266 | | ![Rating Distribution Boxplot](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/rating_distribution_boxplot.png?raw=true) | ![Rating Distribution Histogram](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/rating_distribution_histogram.png?raw=true) | 267 | | --- | --- | 268 | | **Rating Distribution Boxplot** | **Rating Distribution Histogram** | 269 | 270 | | ![Swarmplot Ratings by Cuisines](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/swarmplot_ratings_cuisines.png?raw=true) | ![Top Cuisines Average Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/top_cuisines_avg_rating.png?raw=true) | 271 | | --- | --- | 272 | | **Swarmplot Ratings by Cuisines** | **Top Cuisines Average Rating** | 273 | 274 | 275 | | ![Violinplot Votes by Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/violinplot_votes_by_rating.png?raw=true) | ![Votes vs Aggregate Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/votes_vs_aggregate_rating.png?raw=true) | 276 | | --- | --- | 277 | | **Violinplot Votes by Rating** | **Votes vs Aggregate Rating** | 278 | 279 | | ![Votes vs Rating Scatter](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/votes_vs_rating_scatter.png?raw=true) | | 280 | | --- | --- | 281 | | **Votes vs Rating Scatter** | | 282 | 283 | --------------------------------------------------------------------------------