├── requirements.txt ├── report python Project.pdf ├── Objectives of Python code.docx ├── README.md └── candySales.py /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | matplotlib 4 | seaborn 5 | -------------------------------------------------------------------------------- /report python Project.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Suhani810/Candy_Sales/HEAD/report python Project.pdf -------------------------------------------------------------------------------- /Objectives of Python code.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Suhani810/Candy_Sales/HEAD/Objectives of Python code.docx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🍬 Candy Sales Analysis Project 2 | 3 | This project explores and analyzes a dataset of candy sales using Python for data preprocessing, statistical analysis, and insightful visualizations. It is aimed at uncovering trends, patterns, and performance metrics across various dimensions like time, region, and product divisions. 4 | 5 | --- 6 | 7 | ## 📁 Dataset 8 | 9 | - **Source**: [Maven Analytics Data Playground](https://mavenanalytics.io/data-playground) 10 | - **File**: `Candy_Sales.csv` 11 | 12 | --- 13 | 14 | ## 🔧 Technologies Used 15 | 16 | - **Python** 17 | - **Data Visualization**: Matplotlib, Seaborn 18 | - **Numpy**: For Numerical Values 19 | - **Pandas**: For Data Handling 20 | 21 | --- 22 | 23 | ## ✅ Project Objectives 24 | 25 | ### 🧹 Objective 0: Data Preprocessing and Cleaning 26 | 27 | - Convert date columns (`Order Date`, `Ship Date`) to datetime objects. 28 | - Handle missing values: 29 | - Replace missing `Postal Code` with `"Unknown"`. 30 | - Replace missing `Sales` with the **mean** of the column. 31 | - Remove duplicates. 32 | - Print dataset info and preview the data. 33 | 34 | --- 35 | 36 | ### 📊 Objective 1: Revenue Statistics 37 | 38 | - **Total Revenue**: Sum of all sales. 39 | - **Average Revenue per Order**: Mean of the `Sales` column. 40 | - **Standard Deviation** of Sales. 41 | 42 | --- 43 | 44 | ### 📦 Objective 2: Orders and Performance by State 45 | 46 | - Total number of orders. 47 | - Total units sold. 48 | - State/Province with the highest total sales. 49 | 50 | --- 51 | 52 | ### 📈 Objective 3: Sales Performance & Profitability Visualizations 53 | 54 | - **Line Plot**: Sales trends over time. 55 | - **Grouped Bar Chart**: Sales and Gross Profit by Country/Region and Division. 56 | - **Histogram**: Distribution of Sales and Cost. 57 | - **Correlation Heatmap**: Sales, Cost, and Gross Profit relationships. 58 | 59 | --- 60 | 61 | ### 🚨 Objective 4: Outlier Detection 62 | 63 | - Detect outliers in: 64 | - `Sales` 65 | - `Gross Profit` 66 | - Use Interquartile Range (IQR) method. 67 | - Visualize with Box Plots. 68 | 69 | --- 70 | 71 | ### 📊 Objective 5: Graphical Analysis 72 | 73 | - **Pie Chart**: Sales distribution by region. 74 | - **Heatmap**: Correlation between numerical features. 75 | - **Horizontal Bar Chart**: Sales by region. 76 | - **Donut Chart**: Region-wise sales share. 77 | - **Scatter Plot**: Sales vs. Gross Profit. 78 | 79 | --- 80 | 81 | ### 📆 Objective 6: Seasonal Sales Trends 82 | 83 | - **Line Chart**: Monthly sales trend over time. 84 | - **Area Chart**: Seasonal pattern in monthly candy sales. 85 | 86 | --- 87 | 88 | ## 📌 Insights & Benefits 89 | 90 | This analysis helps to: 91 | - Understand which regions and divisions are most profitable. 92 | - Spot seasonal patterns for better planning. 93 | - Detect outliers which may indicate errors or exceptional cases. 94 | - Visualize relationships between key financial metrics. 95 | 96 | --- 97 | 98 | ## 📎 Notes 99 | 100 | Make sure the dataset `Candy_Sales.csv` is in your working directory, and required libraries are installed before running the script. 101 | 102 | --- 103 | 104 | ## 📬 Contact 105 | 106 | For questions or feedback, feel free to reach out! 107 | 108 | --- 109 | 110 | **Happy Analyzing! 🍭** 111 | -------------------------------------------------------------------------------- /candySales.py: -------------------------------------------------------------------------------- 1 | # ================================ Data Preprocessing & Cleaning =================================== 2 | 3 | import pandas as pd 4 | import matplotlib.pyplot as plt 5 | import seaborn as sns 6 | import numpy as np 7 | 8 | # Load dataset 9 | sales_data = pd.read_csv("C:/CA2 Python Project\\Candy_Sales.csv") 10 | 11 | # 1. Convert date columns to datetime 12 | sales_data["Order Date"] = pd.to_datetime(sales_data["Order Date"]) 13 | sales_data["Ship Date"] = pd.to_datetime(sales_data["Ship Date"]) 14 | 15 | # 2. Check for missing values 16 | print(sales_data.isnull().sum()) 17 | 18 | # 3. Handle Missing Values 19 | sales_data.fillna({"Postal Code": "Unknown", "Sales": sales_data["Sales"].mean()}, inplace=True) 20 | 21 | # 4. Remove Duplicates 22 | sales_data.drop_duplicates(inplace=True) 23 | 24 | # 5. Dataset Summary 25 | print(sales_data.info()) 26 | print(sales_data.head()) 27 | 28 | # ================================ Exploratory Data Analysis (EDA) ================================ 29 | 30 | # 1. Summary Statistics 31 | print("Summary stats-") 32 | print(sales_data.describe()) 33 | 34 | # 2. Correlation & Covariance 35 | print("Correlation Matrix") 36 | print(sales_data[["Sales", "Units", "Gross Profit", "Cost"]].corr()) 37 | 38 | print("Covariance Matrix") 39 | print(sales_data[["Sales", "Units", "Gross Profit", "Cost"]].cov()) 40 | 41 | 42 | # ================================ Objective 1: Revenue Stats ====================================== 43 | 44 | # Load dataset 45 | sales_data = pd.read_csv("Candy_Sales.csv") 46 | sales = np.array(sales_data["Sales"]) 47 | 48 | # Calculate total, average revenue and std deviation 49 | total_revenue = np.sum(sales) 50 | average_revenue = np.mean(sales) 51 | std_dev_sales = np.std(sales) 52 | 53 | print("Total Revenue:", total_revenue) 54 | print("Average Revenue per Order:", average_revenue) 55 | print("Standard Deviation of Sales:", std_dev_sales) 56 | 57 | # ================================ Objective 2: Orders & State Performance ========================== 58 | 59 | # Load dataset 60 | sales_data = pd.read_csv("Candy_Sales.csv") 61 | 62 | # Total orders, units, and top state 63 | total_orders = sales_data.shape[0] 64 | total_units = sales_data["Units"].sum() 65 | state_highest_sale = sales_data.groupby("State/Province")["Sales"].sum().idxmax() 66 | 67 | print("Total number of Orders:", total_orders) 68 | print("Total Units sold:", total_units) 69 | print("State with the highest Sales:", state_highest_sale) 70 | 71 | 72 | # ================================ Objective 3: Sales & Profit Visuals ============================== 73 | 74 | # 1. Sales Trends Over Time 75 | plt.figure(figsize=(10, 6)) 76 | sns.lineplot(data=sales_data, x="Order Date", y="Sales") 77 | plt.title("Sales Trends Over Time") 78 | plt.xlabel("Order Date") 79 | plt.ylabel("Total Sales") 80 | plt.show() 81 | 82 | # 2. Sales & Gross Profit Across Regions and Divisions 83 | plt.figure(figsize=(10, 6)) 84 | region = sales_data.groupby(["Country/Region", "Division"])[["Sales", "Gross Profit"]].sum().reset_index() 85 | melted_data = region.melt(id_vars=["Country/Region", "Division"], value_vars=["Sales", "Gross Profit"], var_name="Metric", value_name="Amount") 86 | sns.barplot(data=melted_data, x="Country/Region", y="Amount", hue="Division", errorbar=None) 87 | print(melted_data.head(10)) 88 | plt.title("Sales and Gross Profit by Regions and Divisions") 89 | plt.xlabel("Region") 90 | plt.ylabel("Total Amount") 91 | plt.legend(title="Metric") 92 | plt.grid(axis="y", linestyle="--") 93 | plt.show() 94 | 95 | # 3. Distribution of Sales and Costs 96 | plt.figure(figsize=(8, 5)) 97 | sns.histplot(sales_data["Sales"], bins=30, color="b", label="Sales") 98 | sns.histplot(sales_data["Cost"], bins=30, color="r", label="Cost") 99 | plt.title("Distribution of Sales and Costs") 100 | plt.xlabel("Metric") 101 | plt.ylabel("Amount") 102 | plt.legend() 103 | plt.show() 104 | 105 | # 4. Correlation Heatmap: Sales, Cost, Profit 106 | data = sales_data[["Sales", "Cost", "Gross Profit"]] 107 | plt.figure(figsize=(10, 6)) 108 | sns.heatmap(data.corr(), annot=True, fmt=".2f", linewidths=0.5, cmap="coolwarm") 109 | plt.title("Correlation Heatmap") 110 | plt.show() 111 | 112 | # ================================ Objective 4: Outlier Detection ================================== 113 | 114 | # Outlier Detection Function 115 | def detect_outliers(data, col): 116 | Q1 = data[col].quantile(0.25) 117 | Q3 = data[col].quantile(0.75) 118 | IQR = Q3 - Q1 119 | lower_bound = Q1 - 1.5 * IQR 120 | upper_bound = Q3 + 1.5 * IQR 121 | outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)] 122 | return outliers 123 | 124 | # Detect outliers 125 | sales_outlier = detect_outliers(sales_data, "Sales") 126 | gross_profit_outlier = detect_outliers(sales_data, "Gross Profit") 127 | 128 | print("Outliers in Sales:", sales_outlier) 129 | print("Outliers in Gross Profit:", gross_profit_outlier) 130 | 131 | # Box Plot of Sales and Gross Profit 132 | plt.figure(figsize=(10, 5)) 133 | plt.subplot(1, 2, 1) 134 | sns.boxplot(sales_data["Sales"], color="skyblue") 135 | plt.title("Box Plot of Sales") 136 | 137 | plt.subplot(1, 2, 2) 138 | sns.boxplot(sales_data["Gross Profit"], color="lightcoral") 139 | plt.title("Box Plot of Gross Profit") 140 | 141 | plt.tight_layout() 142 | plt.show() 143 | 144 | # ================================ Objective 5: Graphical Analysis ================================= 145 | 146 | # 1. Pie Chart: Sales Distribution by Region 147 | geo_col = 'Region' 148 | sales_col = 'Sales' 149 | region_sales = sales_data.groupby(geo_col)[sales_col].sum() 150 | 151 | plt.figure(figsize=(8, 8)) 152 | plt.pie(region_sales, labels=region_sales.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Set3.colors) 153 | plt.title('Pie Chart: Sales Distribution by Region') 154 | plt.axis('equal') 155 | plt.tight_layout() 156 | plt.show() 157 | 158 | # 2. Correlation Heatmap 159 | plt.figure(figsize=(8, 6)) 160 | correlation = sales_data.corr(numeric_only=True) 161 | sns.heatmap(correlation, annot=True, cmap="coolwarm", fmt=".2f") 162 | plt.title("Heatmap: Correlation Heatmap") 163 | plt.show() 164 | 165 | # 3. Horizontal Bar Plot: Sales by Region 166 | plt.figure(figsize=(8, 10)) 167 | region_sales.sort_values().plot(kind='barh', color='teal') 168 | plt.title('Horizontal Bar Plot: Sales by Region') 169 | plt.xlabel('Total Sales') 170 | plt.ylabel('Region') 171 | plt.tight_layout() 172 | plt.show() 173 | 174 | # 4. Donut Chart: Sales Distribution by Region 175 | geo_col = 'Region' 176 | sales_col = 'Sales' 177 | region_sales = sales_data.groupby(geo_col)[sales_col].sum() 178 | 179 | plt.figure(figsize=(8, 8)) 180 | colors = plt.cm.Pastel1.colors 181 | plt.pie(region_sales, labels=region_sales.index, autopct='%1.1f%%', startangle=140, colors=colors, wedgeprops=dict(width=0.4)) 182 | plt.title("Donut Chart: Sales Distribution by Region") 183 | plt.tight_layout() 184 | plt.show() 185 | 186 | # 5. Scatter Plot: Sales vs. Gross Profit 187 | plt.figure(figsize=(10, 6)) 188 | plt.scatter(sales_data['Sales'], sales_data['Gross Profit'], alpha=0.6, color='pink', edgecolors='k') 189 | plt.title('Scatter Chart: Sales vs. Gross Profit') 190 | plt.xlabel('Sales ($)') 191 | plt.ylabel('Gross Profit ($)') 192 | plt.grid(True) 193 | plt.tight_layout() 194 | plt.show() 195 | 196 | 197 | # ================================ Objective 6: Seasonal Sales Trends ============================== 198 | 199 | # 1. Line Plot: Monthly Candy Sales Trend 200 | sales_data['Order Date'] = pd.to_datetime(sales_data['Order Date']) 201 | sales_data['YearMonth'] = sales_data['Order Date'].dt.to_period('M') 202 | monthly_sales = sales_data.groupby('YearMonth')['Sales'].sum().reset_index() 203 | monthly_sales['YearMonth'] = monthly_sales['YearMonth'].astype(str) 204 | 205 | plt.figure(figsize=(12, 6)) 206 | plt.plot(monthly_sales['YearMonth'], monthly_sales['Sales'], marker='o', linestyle='-', color='green') 207 | plt.title('Monthly Candy Sales Trend') 208 | plt.xlabel('Month') 209 | plt.ylabel('Total Sales ($)') 210 | plt.xticks(rotation=45) 211 | plt.grid(True) 212 | plt.tight_layout() 213 | plt.show() 214 | 215 | # 2. Seasonal Area Chart 216 | sales_data['Order Date'] = pd.to_datetime(sales_data['Order Date']) 217 | sales_data['Month'] = sales_data['Order Date'].dt.to_period('M').astype(str) 218 | monthly_sales = sales_data.groupby('Month')['Sales'].sum().reset_index() 219 | 220 | plt.figure(figsize=(12, 6)) 221 | plt.fill_between(monthly_sales['Month'], monthly_sales['Sales'], color='yellow', alpha=0.6) 222 | plt.plot(monthly_sales['Month'], monthly_sales['Sales'], color='red', marker='o') 223 | plt.xticks(rotation=45) 224 | plt.title('Seasonal Candy Sales Trends') 225 | plt.xlabel('Month') 226 | plt.ylabel('Total Sales ($)') 227 | plt.grid(True, linestyle='--', alpha=0.5) 228 | plt.tight_layout() 229 | plt.show() 230 | 231 | 232 | --------------------------------------------------------------------------------