├── .gitignore ├── readme.md ├── eda └── eda.ipynb ├── ML └── ML.ipynb └── data_wrangling └── data_wrangling.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # ignore all .DS_Store files 2 | .DS_Store 3 | # Ignore the datasets folder in your-project directory 4 | /Users/maximilianolopezsalgado/data_projects/your-project/datasets 5 | ======= 6 | # Ignore zip, db and csv files 7 | *.csv 8 | *.zip 9 | *.db 10 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Project Title 2 | 3 | ## Introduction 4 | 5 | In this project, we perform an end-to-end analysis of the "xxxxx" dataset. The project consists of three main phases: Exploratory Data Analysis (EDA), Data Wrangling, and Machine Learning. We aim to gain insights into the data, preprocess it for modeling, and train a machine learning model to make predictions. 6 | 7 | ## Phase 1: Exploratory Data Analysis (EDA) 8 | 9 | ### Importing Libraries 10 | 11 | - Import the necessary libraries: 12 | ```python 13 | - pandas 14 | - numpy 15 | - matplotlib.pyplot 16 | - seaborn 17 | ``` 18 | ### Loading Data 19 | 20 | - Read the "data.csv" file into a DataFrame named "data". 21 | - Display the head, information, and summary statistics of the "data" DataFrame. 22 | 23 | ### Data Exploration 24 | 25 | - Explore the data using descriptive statistics, data visualization, and correlation analysis. 26 | - Identify key features, patterns, and relationships between variables. 27 | - Generate visualizations, such as histograms, box plots, scatter plots, and heatmaps. 28 | 29 | ## Phase 2: Data Wrangling 30 | 31 | ### Data Cleaning 32 | 33 | - Handle missing values by either imputing or removing them. 34 | - Remove duplicates from the dataset, if any. 35 | 36 | ### Feature Engineering 37 | 38 | - Create new features based on existing ones to enhance predictive power. 39 | - Perform transformations, such as scaling or normalization, on numerical features. 40 | - Encode categorical variables using appropriate techniques, such as one-hot encoding or label encoding. 41 | 42 | ### Data Splitting 43 | 44 | - Split the data into training and testing sets using a suitable ratio (e.g., 80:20 or 70:30). 45 | 46 | ## Phase 3: Machine Learning 47 | 48 | ### Model Selection 49 | 50 | - Choose a machine learning model suitable for the problem at hand (e.g., regression, classification, or clustering). 51 | - Consider various models, such as linear regression, random forest, support vector machines, or neural networks. 52 | 53 | ### Model Training 54 | 55 | - Train the selected model using the training data. 56 | 57 | ### Model Evaluation 58 | 59 | - Evaluate the trained model using appropriate evaluation metrics, such as accuracy, precision, recall, F1 score, mean squared error (MSE), or root mean squared error (RMSE). 60 | - Adjust hyperparameters and compare multiple models if necessary. 61 | 62 | ### Model Prediction 63 | 64 | - Make predictions using the trained model on the testing data. 65 | - Analyze and interpret the predictions. 66 | 67 | ### Model Performance Analysis 68 | 69 | - Assess the performance of the model based on the evaluation metrics. 70 | - Analyze any limitations or shortcomings of the model. 71 | 72 | ### Conclusion 73 | 74 | - Summarize the findings and insights from the project. 75 | - Discuss the implications and potential applications of the results. 76 | - Reflect on the limitations of the analysis and suggest future improvements or areas of exploration. 77 | 78 | ## References 79 | 80 | - List any references or resources used during the project. 81 | 82 | ## Colaborators 83 | - Your Name 84 | 85 | --- 86 | 87 | Note: The above template serves as a general guide for conducting an end-to-end analysis project. The specific steps and techniques may vary depending on the dataset and problem domain. Feel free to adapt and customize the template as per your requirements. 88 | -------------------------------------------------------------------------------- /eda/eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "- Author: Your Name\n", 9 | "- First Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 10 | "- Last Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 11 | "- Description: This notebook is used to perform EDA on the \"xxxxx\" dataset" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# Import libraries\n", 21 | "import pandas as pd\n", 22 | "import numpy as np\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "import seaborn as sns\n", 25 | "from scipy import stats\n", 26 | "import folium\n", 27 | "from folium import plugins" 28 | ] 29 | }, 30 | { 31 | "attachments": {}, 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# Exploratory Data Analysis (EDA)" 36 | ] 37 | }, 38 | { 39 | "attachments": {}, 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## 1. Understanding the data" 44 | ] 45 | }, 46 | { 47 | "attachments": {}, 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "### 1.1 Gathering data" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# Import csv cleaned files \n", 61 | "df = pd.read_csv('../datasets/cleaned_df.csv')" 62 | ] 63 | }, 64 | { 65 | "attachments": {}, 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### 1.2 Assesing data" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# Take a look of the data´s shape\n", 79 | "display(df.shape)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# Take a look of the data´s info\n", 89 | "display(df.info())" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "# Use describe method to get descriptive statistics\n", 99 | "display(df.describe())" 100 | ] 101 | }, 102 | { 103 | "attachments": {}, 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "## 2. Extracting and Plotting the data" 108 | ] 109 | }, 110 | { 111 | "attachments": {}, 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "From the dataframes we have, here are some potential information we can extract:\n", 116 | "\n", 117 | "**DataFrame 1: df**\n", 118 | "- Count of unique values.\n", 119 | "- Distribution of values.\n", 120 | "- Count of values.\n", 121 | "- Descriptive statistics of column´s values.\n", 122 | "- Time-based analysis of column´s values.\n", 123 | "- Geographic distribution of values on a map\n", 124 | "- Visualization of regions with a heatmap.\n", 125 | "\n", 126 | "**DataFrame 2: df2**\n", 127 | "- Count of unique values.\n", 128 | "- Distribution of values.\n", 129 | "- Count of values.\n", 130 | "- Descriptive statistics of column´s values.\n", 131 | "- Time-based analysis of column´s values.\n", 132 | "- Geographic distribution of values on a map\n", 133 | "- Visualization of regions with a heatmap.\n", 134 | "\n", 135 | "**DataFrame 3: df3**\n", 136 | "- Count of unique values.\n", 137 | "- Distribution of values.\n", 138 | "- Count of values.\n", 139 | "- Descriptive statistics of column´s values.\n", 140 | "- Time-based analysis of column´s values.\n", 141 | "- Geographic distribution of values on a map\n", 142 | "- Visualization of regions with a heatmap.\n", 143 | "\n", 144 | "**DataFrame 4: df4**\n", 145 | "- Count of unique values.\n", 146 | "- Distribution of values.\n", 147 | "- Count of values.\n", 148 | "- Descriptive statistics of column´s values.\n", 149 | "- Time-based analysis of column´s values.\n", 150 | "- Geographic distribution of values on a map\n", 151 | "- Visualization of regions with a heatmap." 152 | ] 153 | }, 154 | { 155 | "attachments": {}, 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "## DataFrame 1: Customer Data" 160 | ] 161 | }, 162 | { 163 | "attachments": {}, 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### Number of unique values" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "unique_column1 = cleaned_df['column1'].nunique()\n", 177 | "print(\"Number of unique values:\", column1) #-----> Change \"values\" for the variable name you want to analize" 178 | ] 179 | }, 180 | { 181 | "attachments": {}, 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### Distribution of values" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "value_counts = cleaned_df['column1'].value_counts().reset_index()\n", 195 | "value_counts.columns = ['column1', 'Count']\n", 196 | "\n", 197 | "# Display the distribution of values\n", 198 | "print(\"\\nDistribution of values:\") #-----> Change \"values\" for the variable name you want to analize\n", 199 | "print(cleaned_df.head(10))" 200 | ] 201 | }, 202 | { 203 | "attachments": {}, 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "### Barplot of the distribution of values" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "plt.figure(figsize=(8, 8))\n", 217 | "plt.pie(cleaned_df['Count'][:10], labels=cleaned_df['column1'][:10], autopct='%1.1f%%')\n", 218 | "plt.title('Distribution of Values (Top 10)')\n", 219 | "plt.show()" 220 | ] 221 | }, 222 | { 223 | "attachments": {}, 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Descriptive statistics of Column values" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "cleaned_df_column1 = cleaned_df['column1'].describe()\n", 237 | "\n", 238 | "print(\"Descriptive statistics of price:\")\n", 239 | "print(cleaned_df_column1)" 240 | ] 241 | }, 242 | { 243 | "attachments": {}, 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### Distribution of values based on a column or several columns" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "cleaned_df[['column1', 'column2', 'column3', 'column4']].hist(figsize=(12, 6))\n", 257 | "plt.tight_layout()\n", 258 | "plt.show()" 259 | ] 260 | }, 261 | { 262 | "attachments": {}, 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "### Barplot of the distribution of values\n" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "cleaned_df_counts = df['column1'].value_counts()\n", 276 | "\n", 277 | "plt.figure(figsize=(12, 6))\n", 278 | "cleaned_df_counts.plot(kind='bar')\n", 279 | "plt.xlabel('State')\n", 280 | "plt.ylabel('Number of Cities')\n", 281 | "plt.title('Distribution of Cities across States')\n", 282 | "plt.xticks(rotation=45)\n", 283 | "plt.show()" 284 | ] 285 | }, 286 | { 287 | "attachments": {}, 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "## Analyses specific to geolocation data" 292 | ] 293 | }, 294 | { 295 | "attachments": {}, 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "### Geographic distribution of values (i.e. customers, start/end points) on a map" 300 | ] 301 | }, 302 | { 303 | "attachments": {}, 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "To address the long runtime and the large number of locations, consider using a random sample of the data. This will allow you to work with a smaller subset of the data, making it faster to merge and plot on a map" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "# Remember to import folium library\n", 317 | "# In case you haven`t do:\n", 318 | "# import folium\n", 319 | "\n", 320 | "# Randomly sample the geolocation DataFrame\n", 321 | "sampled_geolocation = geolocation.sample(n=250) \n", 322 | "\n", 323 | "# Randomly sample the cleaned_customers DataFrame\n", 324 | "sampled_cleaned_customers = cleaned_df.sample(n=250) \n", 325 | "\n", 326 | "# Merge the geolocation DataFrame with the cleaned_customers DataFrame using the common column 'customer_city'.\n", 327 | "merged_df = pd.merge(sampled_geolocation, sampled_cleaned_customers, left_on='geolocation_city', right_on='customer_city', how='inner')\n", 328 | "\n", 329 | "# Create a map object centered at a specific latitude and longitude you want, for example:\n", 330 | "map = folium.Map(location=[-12.257569734193066, -53.113064202406306], zoom_start=4)\n", 331 | "\n", 332 | "# Iterate over the aggregated data and add markers for each location\n", 333 | "for index, row in merged_df.iterrows():\n", 334 | " lat = row['geolocation_lat']\n", 335 | " lon = row['geolocation_lng']\n", 336 | " city = row['geolocation_city']\n", 337 | " marker = folium.Marker(location=[lat, lon], popup=city)\n", 338 | " marker.add_to(map)\n", 339 | "\n", 340 | "# save the map\n", 341 | "map.save('map.html')\n", 342 | "\n", 343 | "# Display the map\n", 344 | "map" 345 | ] 346 | }, 347 | { 348 | "attachments": {}, 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "### Heatmap - Regions with the highest concentration of values." 353 | ] 354 | }, 355 | { 356 | "attachments": {}, 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "Same as before, consider using a random sample of the data in case the df are too big, to avoid running errors." 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "# Group by 'customer_city' and count the occurrences\n", 370 | "heatmap_city_counts = sampled_cleaned_customers['customer_city'].value_counts().reset_index()\n", 371 | "heatmap_city_counts.columns = ['customer_city', 'count']\n", 372 | "\n", 373 | "# Convert 'customer_city' to lowercase for consistency\n", 374 | "heatmap_city_counts['customer_city'] = heatmap_city_counts['customer_city'].str.lower()\n", 375 | "\n", 376 | "# Merge with the 'sampled_geolocation' DataFrame\n", 377 | "heatmap_merged_df = pd.merge(sampled_geolocation, heatmap_city_counts, left_on='geolocation_city', right_on='customer_city', how='inner')\n", 378 | "\n", 379 | "# Create a map object centered at a specific latitude and longitude\n", 380 | "map_heatmap = folium.Map(location=[-12.257569734193066, -53.113064202406306], zoom_start=4)\n", 381 | "\n", 382 | "# Create a HeatMap layer using the aggregated latitude and longitude coordinates\n", 383 | "heatmap_data = heatmap_merged_df[['geolocation_lat', 'geolocation_lng', 'count']].values\n", 384 | "heatmap_layer = plugins.HeatMap(heatmap_data)\n", 385 | "\n", 386 | "# Add the HeatMap layer to the map\n", 387 | "heatmap_layer.add_to(map_heatmap)\n", 388 | "\n", 389 | "# Save the map as an HTML file\n", 390 | "map_heatmap.save('heatmap.html')\n", 391 | "\n", 392 | "# Display the map\n", 393 | "map_heatmap\n" 394 | ] 395 | } 396 | ], 397 | "metadata": { 398 | "kernelspec": { 399 | "display_name": "base", 400 | "language": "python", 401 | "name": "python3" 402 | }, 403 | "language_info": { 404 | "codemirror_mode": { 405 | "name": "ipython", 406 | "version": 3 407 | }, 408 | "file_extension": ".py", 409 | "mimetype": "text/x-python", 410 | "name": "python", 411 | "nbconvert_exporter": "python", 412 | "pygments_lexer": "ipython3", 413 | "version": "3.9.7" 414 | }, 415 | "orig_nbformat": 4 416 | }, 417 | "nbformat": 4, 418 | "nbformat_minor": 2 419 | } 420 | -------------------------------------------------------------------------------- /ML/ML.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "- Author: Your Name\n", 9 | "- First Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 10 | "- Last Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 11 | "- Description: This notebook is used to perform EDA on the \"xxxxx\" dataset" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# Import libraries\n", 21 | "# import libraries\n", 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "import seaborn as sns\n", 26 | "import os\n", 27 | "import warnings\n", 28 | "warnings.filterwarnings('ignore')\n", 29 | "from sklearn.model_selection import train_test_split\n", 30 | "from sklearn.linear_model import LinearRegression\n", 31 | "from sklearn.metrics import mean_squared_error\n", 32 | "from sklearn.preprocessing import MinMaxScaler\n", 33 | "from sklearn.preprocessing import StandardScaler\n", 34 | "from sklearn.preprocessing import LabelEncoder\n", 35 | "from sklearn.preprocessing import OneHotEncoder\n", 36 | "from sklearn.ensemble import RandomForestRegressor\n", 37 | "from sklearn.model_selection import GridSearchCV\n", 38 | "from sklearn.model_selection import RandomizedSearchCV\n", 39 | "from sklearn.model_selection import cross_val_score\n", 40 | "from sklearn.metrics import r2_score\n", 41 | "from sklearn.metrics import mean_squared_log_error\n", 42 | "from sklearn.metrics import mean_absolute_error\n", 43 | "from sklearn.metrics import mean_squared_error\n", 44 | "from sklearn.metrics import mean_squared_log_error\n", 45 | "from sklearn.metrics import median_absolute_error\n", 46 | "from sklearn.metrics import explained_variance_score\n", 47 | "from sklearn.metrics import max_error\n", 48 | "from sklearn.metrics import mean_poisson_deviance\n", 49 | "from sklearn.metrics import mean_gamma_deviance\n", 50 | "from sklearn.metrics import mean_tweedie_deviance\n", 51 | "from sklearn.metrics import mean_absolute_percentage_error" 52 | ] 53 | }, 54 | { 55 | "attachments": {}, 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "# Machine Learning" 60 | ] 61 | }, 62 | { 63 | "attachments": {}, 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "In this jupyter notebook, meant to perform a basic, but useful structure for a ML Project, we will be performing the next actions:\n", 68 | "\n", 69 | "**Importing Data**\n", 70 | "\n", 71 | "- Read the \"cleaned_df.csv\" file into a DataFrame named \"cleaned_df\".\n", 72 | "- Display the head, information, and summary statistics of the \"cleaned_df\" DataFrame.\n", 73 | "\n", 74 | "**Exploratory Data Analysis**\n", 75 | "\n", 76 | "- Visualize correlations between variables using a heatmap.\n", 77 | "\n", 78 | "**Data Preprocessing**\n", 79 | "\n", 80 | "- Drop a highly correlated column from the DataFrame.\n", 81 | "\n", 82 | "**Data Distribution**\n", 83 | "\n", 84 | "- Plot histograms to visualize the distribution of the data.\n", 85 | "\n", 86 | "**Linearity Analysis**\n", 87 | "\n", 88 | "- Create scatter plots to analyze linearity between variables.\n", 89 | "\n", 90 | "**Encoding Categorical Variables**\n", 91 | "\n", 92 | "- Encode categorical variables using one-hot encoding and create new columns for each category.\n", 93 | "\n", 94 | "**Splitting Data**\n", 95 | "\n", 96 | "- Split the data into train and test sets.\n", 97 | "\n", 98 | "**Scaling Data**\n", 99 | "\n", 100 | "- Scale the data using Min-Max scaling.\n", 101 | "\n", 102 | "**Model Training and Prediction**\n", 103 | "\n", 104 | "- Train a linear regression model on the scaled training data.\n", 105 | "- Make predictions on the scaled validation data.\n", 106 | "\n", 107 | "**Model Evaluation**\n", 108 | "\n", 109 | "- Calculate and display evaluation metrics:\n", 110 | " - R2 score\n", 111 | " - Mean Squared Error\n", 112 | " - Mean Absolute Error\n", 113 | " - Root Mean Squared Error\n", 114 | " - Explained Variance Score\n", 115 | "\n", 116 | "**Residual Analysis**\n", 117 | "\n", 118 | "- Plot the residuals between the true and predicted values." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "# Import data\n", 128 | "cleaned_df = pd.read_csv('../cleaned_df.csv')\n", 129 | "\n", 130 | "# Explore data\n", 131 | "display(cleaned_df.head())\n", 132 | "display(cleaned_df.info())\n", 133 | "display(cleaned_df.describe())" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Search for correlations between the variables in a heatmap\n", 143 | "plt.figure(figsize=(20, 10))\n", 144 | "sns.heatmap(cleaned_df.corr(), annot=True, cmap='RdYlGn')" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "# Drop the columns that are highly correlated with each other\n", 154 | "cleaned_df.drop('temp', axis=1, inplace=True)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "# Take a look of the distribution of the data\n", 164 | "cleaned_df.hist(figsize=(20, 10))" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "# Search for linearity between the variables\n", 174 | "sns.pairplot(cleaned_df, x_vars=['column1', 'column2', 'column3', 'column4'], y_vars=['column1', 'column2', 'column3', 'column4'], diag_kind='kde')" 175 | ] 176 | }, 177 | { 178 | "attachments": {}, 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "### Encoding cathegorical variables" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "cleaned_df.head()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "# Use get_dummies for encoding categorical variables with unique column names\n", 201 | "new_season = pd.get_dummies(cleaned_df['month'], prefix='month', drop_first=False)\n", 202 | "new_weather = pd.get_dummies(cleaned_df['weather'], prefix='weather', drop_first=False)\n", 203 | "new_year = pd.get_dummies(cleaned_df['year'], prefix='year', drop_first=False)\n", 204 | "\n", 205 | "# Drop the old columns\n", 206 | "cleaned_df.drop(['month', 'weather', 'year'], axis=1, inplace=True)\n", 207 | "\n", 208 | "# Concatenate the encoded columns with the original dataframe\n", 209 | "cleaned_df_encoded = pd.concat([cleaned_df, new_season, new_weather, new_year], axis=1)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "# weekday is also a categorical variable. \n", 219 | "# We will encode them with LabelEncoder and OneHotEncoder\n", 220 | "# LabelEncoder\n", 221 | "le = LabelEncoder()\n", 222 | "cleaned_df_encoded['weekday'] = le.fit_transform(cleaned_df_encoded['weekday'])\n", 223 | "\n", 224 | "ohe_weekday = OneHotEncoder()\n", 225 | "weekday_encoded = ohe_weekday.fit_transform(cleaned_df_encoded['weekday'].values.reshape(-1, 1)).toarray()\n", 226 | "weekday_df = pd.DataFrame(weekday_encoded, columns=[\"weekday_\" + str(int(i)) for i in range(weekday_encoded.shape[1])])\n", 227 | "\n", 228 | "cleaned_df_encoded = pd.concat([cleaned_df_encoded, weekday_df], axis=1)\n", 229 | "cleaned_df_encoded.drop(['weekday'], axis=1, inplace=True)\n", 230 | "\n", 231 | "cleaned_df_encoded.head()" 232 | ] 233 | }, 234 | { 235 | "attachments": {}, 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "### Split the data into train and test sets" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# Split the data into train, validation, and test sets\n", 249 | "train_df = cleaned_df_encoded[cleaned_df_encoded['year_0'] == 0]\n", 250 | "test_df = cleaned_df_encoded[cleaned_df_encoded['year_0'] == 1]\n", 251 | "\n", 252 | "# Drop the year columns\n", 253 | "train_df.drop(['year_0', 'year_1'], axis=1, inplace=True)\n", 254 | "test_df.drop(['year_0', 'year_1'], axis=1, inplace=True)\n", 255 | "\n", 256 | "# Split the train data into train and validation sets\n", 257 | "X_train = train_df.drop('count', axis=1)\n", 258 | "y_train = train_df['count']\n", 259 | "\n", 260 | "X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)\n", 261 | "\n", 262 | "# Split the test data into test and validation sets\n", 263 | "X_test = test_df.drop('count', axis=1)\n", 264 | "y_test = test_df['count']\n", 265 | "\n", 266 | "X_test, X_val_test, y_test, y_val_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "train_df.info()" 276 | ] 277 | }, 278 | { 279 | "attachments": {}, 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "### Scale the data" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "# Bring all the values to a uniform range\n", 293 | "# Remember that the scaling is applied because the Gradient Descent method that we use to minimize our underlying cost function, converges much faster with scaling than without it.\n", 294 | "\n", 295 | "scaler = MinMaxScaler()\n", 296 | "X_train_scaled = scaler.fit_transform(X_train)\n", 297 | "X_val_scaled = scaler.transform(X_val)\n", 298 | "X_test_scaled = scaler.transform(X_test)" 299 | ] 300 | }, 301 | { 302 | "attachments": {}, 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "### Train and fit the model and make predictions" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": null, 312 | "metadata": {}, 313 | "outputs": [], 314 | "source": [ 315 | "# Create the model\n", 316 | "model = LinearRegression()\n", 317 | "\n", 318 | "# Fit the model\n", 319 | "model.fit(X_train_scaled, y_train)\n", 320 | "\n", 321 | "# Make predictions\n", 322 | "y_pred = model.predict(X_val_scaled)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "# Create a function that will calculate the metrics\n", 332 | "def calculate_metrics(y_true, y_pred):\n", 333 | " print('R2 score: {}'.format(r2_score(y_true, y_pred)))\n", 334 | " print('Mean Squared Error: {}'.format(mean_squared_error(y_true, y_pred)))\n", 335 | " print('Mean Absolute Error: {}'.format(mean_absolute_error(y_true, y_pred)))\n", 336 | " print('Root Mean Squared Error: {}'.format(np.sqrt(mean_squared_error(y_true, y_pred))))\n", 337 | " print('Explained Variance Score: {}'.format(explained_variance_score(y_true, y_pred)))\n", 338 | "\n", 339 | "# calculate the metrics\n", 340 | "calculate_metrics(y_val, y_pred)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# Create a function that will plot the residuals\n", 350 | "def plot_residuals(y_true, y_pred):\n", 351 | " residuals = y_true - y_pred\n", 352 | " plt.figure(figsize=(20, 10))\n", 353 | " plt.scatter(y_true, residuals)\n", 354 | " plt.title('Residual plot')\n", 355 | " plt.xlabel('y_true')\n", 356 | " plt.ylabel('residuals')\n", 357 | " plt.show()\n", 358 | "\n", 359 | "# call the function to plot the residuals\n", 360 | "plot_residuals(y_val, y_pred)" 361 | ] 362 | }, 363 | { 364 | "attachments": {}, 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "Note: In addition to the metrics we have already calculated (R2 score, Mean Squared Error, Mean Absolute Error, Root Mean Squared Error, and Explained Variance Score), there are several other metrics that we can analyze to evaluate the performance of our machine learning model. Here are some commonly used metrics:\n", 369 | "\n", 370 | "- Accuracy: Accuracy measures the overall correctness of your model's predictions. It is the ratio of the number of correct predictions to the total number of predictions.\n", 371 | "\n", 372 | "- Precision: Precision is the proportion of true positive predictions out of all positive predictions. It measures the accuracy of positive predictions.\n", 373 | "\n", 374 | "- Recall: Recall is the proportion of true positive predictions out of all actual positive instances. It measures the ability of the model to correctly identify positive instances.\n", 375 | "\n", 376 | "- F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.\n", 377 | "\n", 378 | "- ROC AUC Score: ROC (Receiver Operating Characteristic) AUC (Area Under the Curve) score is a performance metric for binary classification models. It measures the model's ability to discriminate between positive and negative classes.\n", 379 | "\n", 380 | "- Confusion Matrix: A confusion matrix is a table that shows the counts of true positive, true negative, false positive, and false negative predictions. It provides a detailed breakdown of the model's performance across different classes.\n", 381 | "\n", 382 | "- Classification Report: A classification report provides a summary of precision, recall, F1 score, and support for each class in a multi-class classification problem.\n", 383 | "\n", 384 | "These metrics can provide additional insights into the performance of our machine learning model, especially in classification tasks. You can calculate these metrics using appropriate functions from libraries such as scikit-learn or other specialized evaluation libraries." 385 | ] 386 | }, 387 | { 388 | "attachments": {}, 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [] 392 | } 393 | ], 394 | "metadata": { 395 | "kernelspec": { 396 | "display_name": "base", 397 | "language": "python", 398 | "name": "python3" 399 | }, 400 | "language_info": { 401 | "codemirror_mode": { 402 | "name": "ipython", 403 | "version": 3 404 | }, 405 | "file_extension": ".py", 406 | "mimetype": "text/x-python", 407 | "name": "python", 408 | "nbconvert_exporter": "python", 409 | "pygments_lexer": "ipython3", 410 | "version": "3.9.7" 411 | }, 412 | "orig_nbformat": 4 413 | }, 414 | "nbformat": 4, 415 | "nbformat_minor": 2 416 | } 417 | -------------------------------------------------------------------------------- /data_wrangling/data_wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "- Author: Your Name\n", 9 | "- First Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 10 | "- Last Commit: yyyy-mm-dd #folowing ISO 8601 Format\n", 11 | "- Description: This notebook is used to perform EDA on the \"xxxxx\" dataset" 12 | ] 13 | }, 14 | { 15 | "attachments": {}, 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "## Key Steps in this Data Analysis:\n", 20 | "\n", 21 | "1. **Framing the Question:** \n", 22 | " - The first step towards any sort of data analysis is to ask the right question(s) from the given data. \n", 23 | " - Identifying the objective of the analysis makes it easier to decide on the type(s) of data needed to draw conclusions.\n", 24 | "\n", 25 | "2. **Data Wrangling:** \n", 26 | " - Data wrangling, sometimes referred to as data munging or data pre-processing, is the process of gathering, assessing, and cleaning \"raw\" data into a form suitable for analysis.\n", 27 | "\n", 28 | "3. **Exploratory Data Analysis (EDA):** \n", 29 | " - Once the data is collected, cleaned, and processed, it is ready for analysis. \n", 30 | " - During this phase, you can use data analysis tools and software to understand, interpret, and derive conclusions based on the requirements.\n", 31 | "\n", 32 | "4. **Drawing Conclusions:** \n", 33 | " - After completing the analysis phase, the next step is to interpret the analysis and draw conclusions. \n", 34 | " - Three key questions to ask at this stage:\n", 35 | " - Did the analysis answer my original question?\n", 36 | " - Were there any limitations in my analysis that could affect my conclusions?\n", 37 | " - Was the analysis sufficient to support decision-making?\n", 38 | "\n", 39 | "5. **Communicating Results:** \n", 40 | " - Once data has been explored and conclusions have been drawn, it's time to communicate the findings to the relevant audience. \n", 41 | " - Effective communication can be achieved through data storytelling, writing blogs, making presentations, or filing reports.\n", 42 | "\n", 43 | "**Note:** The five steps of data analysis are not always followed linearly. The process can be iterative, with steps revisited based on new insights or requirements that arise during the analysis.\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# Import libraries\n", 53 | "import pandas as pd\n", 54 | "import numpy as np\n", 55 | "import matplotlib.pyplot as plt\n", 56 | "import seaborn as sns\n", 57 | "from scipy import stats\n", 58 | "import os\n", 59 | "import warnings" 60 | ] 61 | }, 62 | { 63 | "attachments": {}, 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## 1. Data Wrangling" 68 | ] 69 | }, 70 | { 71 | "attachments": {}, 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "### 1.1 Gathering data" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "# import csv files \n", 85 | "df = pd.read_csv('../datasets/df.csv')" 86 | ] 87 | }, 88 | { 89 | "attachments": {}, 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "### 1.2 Assessing of Data" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "# Take a look of the data´s shape\n", 103 | "display(df.shape)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "# Take a look of the data´s info\n", 113 | "display(df.info())" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "# Take a look of the data´s head\n", 123 | "display(df.head())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "# Search for NULL values\n", 133 | "display(customers.isnull().sum())" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# Check the type of information that every column has\n", 143 | "display(df.dtypes)" 144 | ] 145 | }, 146 | { 147 | "attachments": {}, 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "### 1.3 Data Cleaning" 152 | ] 153 | }, 154 | { 155 | "attachments": {}, 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### 1.3.1 Remove irrelevant data" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# Remove irrelevant data in the 'x' column. (i.e: \"In this case, we will drop the colums that have more null values than valid values\")\n", 169 | "df = df.drop(columns=['column1', 'column2'])" 170 | ] 171 | }, 172 | { 173 | "attachments": {}, 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "### 1.3.2 Remove/replace null values\n", 178 | " " 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "# Calculate the mean of the columns with missing values \n", 188 | "# Check datatype of the columns\n", 189 | "display(df['column1'].dtype) \n", 190 | "\n", 191 | "# Transform the columns into datetime type\n", 192 | "df['column1'] = pd.to_datetime(df['column1'])" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "# Calculate the mean of this columns\n", 202 | "mean_df = df['column1'].mean()\n", 203 | "display(mean_df)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "# Imput the corresponding values to the null values in this columns with the mean (or the best method for each case)\n", 213 | "# check if null values exist in order_data dataset\n", 214 | "df.isnull().sum()['column1'].fillna(mean_df, inplace=True)\n" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "# Check if null values exist in order_data dataset\n", 224 | "df.isnull().sum()" 225 | ] 226 | }, 227 | { 228 | "attachments": {}, 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "#### 1.3.2.2 Dataset's columns types" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "# Calculate the mean of the columns with missing values \n", 242 | "# Check datatype of the columns (they need to be in this case datime type)\n", 243 | "display(df['column1'].dtype) " 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# Calculate the mean of this columns\n", 253 | "mean_column1 = df['column1'].mean()" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "# Imput the corresponding values to the null values in this columns with the mean\n", 263 | "df['column1'].fillna(mean_column1, inplace=True)" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "# Check again if null values exist in order_data dataset\n", 273 | "df.isnull().sum()" 274 | ] 275 | }, 276 | { 277 | "attachments": {}, 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "### 1.3.3 Drop the duplicates, if any" 282 | ] 283 | }, 284 | { 285 | "attachments": {}, 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "Use duplicate() function to find duplicated data in the datasets" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "# Find duplicates based on all columns\n", 299 | "display(df[df.duplicated()].sum())" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "# drop duplicates in df \n", 309 | "df = df.drop_duplicates()" 310 | ] 311 | }, 312 | { 313 | "attachments": {}, 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "### 1.3.4 Type conversion" 318 | ] 319 | }, 320 | { 321 | "attachments": {}, 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds, and so on).\n", 326 | "In this case we already did this." 327 | ] 328 | }, 329 | { 330 | "attachments": {}, 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "### 1.3.5 Syntax Errors" 335 | ] 336 | }, 337 | { 338 | "attachments": {}, 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "### 1.3.6 Outliers" 343 | ] 344 | }, 345 | { 346 | "attachments": {}, 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "They are values that are significantly different from all other observations. Any data value that lies more than (1.5 * IQR) away from the Q1 and Q3 quartiles is considered an outlier.\n", 351 | "\n", 352 | "In general, an e-commerce dataset obtained from a well-functioning system is less likely to have outliers compared to datasets that involve manual data entry or measurement errors. E-commerce datasets typically capture transactional information, such as customer details, product information, and order-related data, which are less prone to outliers.\n", 353 | "\n", 354 | "However, it's still possible to have outliers in certain scenarios, such as:\n", 355 | "\n", 356 | "Data entry errors: Although automated systems minimize data entry errors, there can still be instances where incorrect or extreme values are recorded.\n", 357 | "\n", 358 | "Measurement errors: If the dataset includes measurements or quantitative data collected manually, there may be measurement errors leading to outliers.\n", 359 | "\n", 360 | "System glitches or anomalies: While rare, system glitches or anomalies can occasionally result in outliers in the data.\n", 361 | "\n", 362 | "Fraudulent activities: In some cases, fraudulent transactions or activities may introduce outliers into the dataset.\n", 363 | "\n", 364 | "Therefore, while it's reasonable to assume that the occurrence of outliers in an e-commerce dataset is relatively low, it's still advisable to examine the data and apply outlier detection techniques to ensure data quality and integrity.\n", 365 | "\n", 366 | "Remember that outlier detection is an iterative process, and there is no one-size-fits-all approach. It requires a combination of domain knowledge, data understanding, and experimentation to determine the most suitable method and threshold for your specific dataset and analysis objectives." 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "# Create a function to find otliers\n", 376 | "dataframes = [df, df1, df2, df3,\n", 377 | " df4]\n", 378 | "\n", 379 | "for df in dataframes:\n", 380 | " # Identify numerical columns\n", 381 | " numerical_columns = df.select_dtypes(include=np.number).columns\n", 382 | "\n", 383 | " # Define percentiles for outlier detection (e.g., values outside [5th percentile, 95th percentile])\n", 384 | " lower_percentile = 5\n", 385 | " upper_percentile = 95\n", 386 | "\n", 387 | " for column in numerical_columns:\n", 388 | " # Calculate percentiles for the column\n", 389 | " lower_threshold = np.percentile(df[column], lower_percentile)\n", 390 | " upper_threshold = np.percentile(df[column], upper_percentile)\n", 391 | "\n", 392 | " # Find rows with outliers in the column\n", 393 | " outlier_rows = (df[column] < lower_threshold) | (df[column] > upper_threshold)\n", 394 | "\n", 395 | " # Print rows with outliers in the column\n", 396 | " print(df[outlier_rows])\n", 397 | " print('\\n')" 398 | ] 399 | }, 400 | { 401 | "attachments": {}, 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "After performing outlier analysis we indentified that because of the nature of the info, there are not relevant outliers, althought numerically there are some that exists. " 406 | ] 407 | }, 408 | { 409 | "attachments": {}, 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "### 1.3.7 In-record & cross-datasets errors" 414 | ] 415 | }, 416 | { 417 | "attachments": {}, 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "These errors result from having two or more values in the same row or across datasets that contradict with each other. For example, if we have a dataset about the cost of living in cities. The total column must be equivalent to the sum of rent, transport, and food." 422 | ] 423 | }, 424 | { 425 | "attachments": {}, 426 | "cell_type": "markdown", 427 | "metadata": {}, 428 | "source": [ 429 | "## Export the cleaned data" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "# Create a function to export the cleaned data to perform EDA in the future\n", 439 | "# Specify the path to the dataset folder\n", 440 | "folder_path = \"../datasets/\"\n", 441 | "\n", 442 | "dataframes = [df, df1, df2, df3,\n", 443 | " df4]\n", 444 | "\n", 445 | "file_names = ['df.csv', 'df1.csv', 'df3.csv',\n", 446 | " 'df4.csv']\n", 447 | "\n", 448 | "for df, file_name in zip(dataframes, file_names):\n", 449 | " # Add the prefix \"cleaned_\" to the file name\n", 450 | " cleaned_file_name = \"cleaned_\" + file_name\n", 451 | " # Get the full path of the output file\n", 452 | " output_file_path = os.path.join(folder_path, cleaned_file_name)\n", 453 | " # Save the cleaned DataFrame to CSV\n", 454 | " df.to_csv(output_file_path, index=False)" 455 | ] 456 | } 457 | ], 458 | "metadata": { 459 | "kernelspec": { 460 | "display_name": "base", 461 | "language": "python", 462 | "name": "python3" 463 | }, 464 | "language_info": { 465 | "name": "python", 466 | "version": "3.9.7" 467 | }, 468 | "orig_nbformat": 4 469 | }, 470 | "nbformat": 4, 471 | "nbformat_minor": 2 472 | } 473 | --------------------------------------------------------------------------------