├── clean-data.xlsx ├── sample_-_superstore.xlsx ├── README.md └── sampleStore1.py /clean-data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/raykundan655/Retail-Analysis-/HEAD/clean-data.xlsx -------------------------------------------------------------------------------- /sample_-_superstore.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/raykundan655/Retail-Analysis-/HEAD/sample_-_superstore.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Retail-Analysis- 2 | This project provides business insight into a fictional superstore's operations through deep exploratory data analysis. Visual and statistical techniques were applied to discover patterns in regional performance, customer behavior, and operational logistics. 3 | 4 | 5 | #Superstore Retail Data Analysis 6 | 7 | A complete exploratory data analysis (EDA) project conducted on the Sample Superstore dataset, originally sourced from https://public.tableau.com/app/learn/sample-data. This analysis simulates a real-world retail business case study, with the goal of extracting actionable insights through structured data preparation, visual exploration, and statistical evaluation. 8 | 9 | 📌 Objectives 10 | 11 | This project is centered around five business-focused objectives: 12 | 13 | 1. Analyze monthly and yearly profit trends to understand seasonality. 14 | 2. Evaluate region-wise profitability to identify high- and low-performing markets. 15 | 3. Assess product return frequency by category to gauge product satisfaction. 16 | 4. Examine shipping duration patterns to evaluate delivery performance. 17 | 5. Analyze customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit. 18 | 19 | 20 | 21 | Key Insights 22 | 23 | - December consistently shows the highest profit across years, pointing to seasonal sales spikes. 24 | - The West region leads in profitability, while the **South region** underperforms comparatively. 25 | - Technology products have the highest return rate, suggesting a potential quality or customer expectation gap. 26 | - Shipping is mostly completed within **2–3 days**, indicating operational efficiency. 27 | - The Consumer segment drives volume, while Corporate customers yield higher profit margins. 28 | 29 | 30 | EDA Process 31 | 32 | - Data loaded from Excel (Orders & Returns sheets) 33 | - Cleaning and type conversions performed using `pandas` 34 | - Feature engineering: 35 | - `Returned_flag` (binary return indicator) 36 | - `Shipping Duration` (days between Order and Ship Date) 37 | - `Month` and `Year` (for temporal grouping) 38 | - Outlier detection using IQR method 39 | - Aggregations via `groupby()` for trends and comparisons 40 | 41 | 42 | 43 | # Visualizations 44 | 45 | Visuals created using `matplotlib` and `seaborn`: 46 | 47 | - **Line plots**: Profit trends over time 48 | - **Bar charts**: Region-wise and segment-wise performance 49 | - **Pie charts**: Return distribution by category 50 | - **Histograms**: Shipping durations 51 | - **Subplots**: Multi-metric comparisons (sales, profit, quantity) 52 | 53 | --- 54 | 55 | Tools & Skills Demonstrated 56 | 57 | - Data Cleaning & Transformation 58 | - Feature Engineering 59 | - Business-Oriented EDA 60 | - Data Visualization & Storytelling 61 | - Analytical Thinking 62 | 63 | --- 64 | 65 | ## References 66 | 67 | - Sample Dataset: [Tableau Public – Sample Data](https://public.tableau.com/app/learn/sample-data) 68 | - [Pandas Documentation](https://pandas.pydata.org/) 69 | - [Seaborn Documentation](https://seaborn.pydata.org/) 70 | - [Matplotlib Documentation](https://matplotlib.org/) 71 | - [Python Official Docs](https://docs.python.org/3/) 72 | 73 | 74 | 75 | > ✨ *This project was built as part of my learning journey in data analytics, combining business intuition with analytical depth.* 76 | -------------------------------------------------------------------------------- /sampleStore1.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | from matplotlib import style 5 | import seaborn as sns 6 | 7 | 8 | file_path="C:\\Users\\USER\\Downloads\\py\\sample_-_superstore.xlsx" 9 | df=pd.read_excel(file_path,sheet_name="Orders") 10 | return_df=pd.read_excel(file_path,sheet_name="Returns") 11 | # print(df) 12 | print(df.head()) 13 | print(df.info()) 14 | 15 | # cleaning 16 | print(df.isnull().sum()) 17 | print(df["Order ID"].duplicated().sum()) 18 | df.drop_duplicates(subset="Order ID",keep="first",inplace=True) 19 | print(df["Order ID"].duplicated().sum()) 20 | 21 | df["Order Date"]=pd.to_datetime(df["Order Date"]) 22 | df["Ship Date"]=pd.to_datetime(df["Ship Date"]) 23 | 24 | print(df.columns) 25 | numeric =["Quantity","Discount","Profit","Sales"] 26 | df[numeric]=df[numeric].apply(pd.to_numeric) 27 | 28 | df1=df[df["Profit"]>=0].copy() 29 | print(df1) 30 | 31 | # merging two sheet 32 | df1=pd.merge(df1,return_df,on="Order ID",how="left") 33 | print(df1) 34 | df1["Returned"].replace(np.nan,"No",inplace=True) 35 | print(df1) 36 | 37 | print(df1.columns) 38 | 39 | # outlier detection of profit and removable 40 | Q1=df1["Profit"].quantile(0.25) 41 | Q3=df1["Profit"].quantile(0.75) 42 | IQR=Q3-Q1 43 | 44 | lower_bound=Q1-1.5*IQR 45 | upper_bound=Q3+1.5*IQR 46 | 47 | df_no_outlier=df1[(df1["Profit"]>=lower_bound) & (df1["Profit"]<=upper_bound) ].copy() 48 | print(df_no_outlier) 49 | 50 | 51 | print(df_no_outlier.describe()) 52 | 53 | #Objectives 54 | 55 | #1 Examine monthly-yearly patterns in sales to identify peak periods and seasonal trends. 56 | 57 | 58 | df_no_outlier["Year"]=df_no_outlier["Order Date"].dt.year 59 | df_no_outlier["Month"]=df_no_outlier["Order Date"].dt.month_name() 60 | 61 | month_order = ['January', 'February', 'March', 'April', 'May', 'June', 62 | 'July', 'August', 'September', 'October', 'November', 'December'] 63 | 64 | 65 | print(df_no_outlier) 66 | gb_month=df_no_outlier.groupby("Month")["Profit"].sum().reindex(month_order).reset_index() 67 | gb_year=df_no_outlier.groupby("Year")["Profit"].sum().reset_index() 68 | 69 | plt.figure(figsize=(10,6)) 70 | sns.lineplot(x="Month",y="Profit",data=gb_month,color="m") 71 | plt.xticks(rotation=45) 72 | plt.title("Profit Trend Over Month") 73 | plt.grid(True) 74 | plt.show(block=False) 75 | plt.figure(figsize=(6,6)) 76 | sns.lineplot(x="Year",y="Profit",data=gb_year,color="red") 77 | plt.title("Profit Trend Over Year") 78 | plt.grid(True) 79 | 80 | plt.show() 81 | 82 | 83 | #2 average profit across different regions to identify high and low-performing areas. 84 | 85 | 86 | region=df_no_outlier.groupby("Region")["Profit"].mean() 87 | br=sns.barplot(x=region.index,y=region.values,palette="YlGnBu") 88 | 89 | plt.title("Average Profit Across region") 90 | br.bar_label(br.containers[0],fmt="%.2f") 91 | br.bar_label(br.containers[1],fmt="%.2f") 92 | br.bar_label(br.containers[2],fmt="%.2f") 93 | br.bar_label(br.containers[3],fmt="%.2f") 94 | plt.ylabel("Average Profit") 95 | plt.show() 96 | 97 | # 3 Evaluate the performance of different customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit. 98 | 99 | segment=df_no_outlier.groupby("Segment")[["Sales","Profit","Quantity"]].sum().reset_index() 100 | plt.figure(figsize=(15,7)) 101 | plt.subplot(1,3,1) 102 | sns.barplot(x="Segment",y="Sales",data=segment,palette='dark:orange') 103 | plt.grid(False) 104 | plt.xticks(rotation=45) 105 | plt.title("Segment by Sales") 106 | 107 | 108 | 109 | plt.subplot(1,3,2) 110 | sns.barplot(x="Segment",y="Quantity",data=segment,palette='dark:purple') 111 | plt.grid(False) 112 | plt.xticks(rotation=45) 113 | plt.title("Segment by Quantity") 114 | 115 | 116 | 117 | plt.subplot(1,3,3) 118 | sns.barplot(x="Segment",y="Profit",data=segment,palette='dark:blue') 119 | plt.grid(False) 120 | plt.xticks(rotation=45) 121 | plt.title("Segment by Profit") 122 | plt.suptitle("Segment-wise Customer Behavior") 123 | plt.show() 124 | 125 | 126 | 127 | 128 | 129 | # 4 identifying return frequency by category to assess business impact. 130 | 131 | def conv(val): 132 | if(val=="Yes"): 133 | return 1 134 | else: 135 | return 0 136 | 137 | df_no_outlier["returned_flag"]=df_no_outlier["Returned"].apply(conv) 138 | print(df_no_outlier) 139 | 140 | returnbase=df_no_outlier.groupby("Category")["returned_flag"].sum() 141 | plt.pie(returnbase.values,labels=returnbase.index,autopct="%.2f%%",colors = ["#A1C9F1", "#FFB5E8", "#B5EAD7"]) 142 | plt.legend() 143 | plt.title("Distribution of return order by Category") 144 | plt.show() 145 | 146 | # 5 Analyzing shipping time to understand delivery performance and customer experience. 147 | 148 | df_no_outlier["shipment_Duration"]=(df_no_outlier["Ship Date"] - df_no_outlier["Order Date"]).dt.days 149 | sns.histplot(x="shipment_Duration",data=df_no_outlier,bins=5,kde=True,color="#2ECC71") 150 | plt.grid(True) 151 | plt.title("Distribution of Shipping Duration") 152 | plt.show() 153 | 154 | 155 | 156 | corr=df_no_outlier.corr(numeric_only=True) 157 | sns.heatmap(corr,annot=True,cmap='coolwarm') 158 | plt.show() 159 | 160 | df_no_outlier.to_excel("clean-data.xlsx",index=False) 161 | 162 | 163 | --------------------------------------------------------------------------------