├── clean-data.xlsx
├── sample_-_superstore.xlsx
├── README.md
└── sampleStore1.py


/clean-data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/raykundan655/Retail-Analysis-/HEAD/clean-data.xlsx


--------------------------------------------------------------------------------
/sample_-_superstore.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/raykundan655/Retail-Analysis-/HEAD/sample_-_superstore.xlsx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Retail-Analysis-
 2 | This project provides business insight into a fictional superstore's operations through deep exploratory data analysis. Visual and statistical techniques were applied to discover patterns in regional performance, customer behavior, and operational logistics.
 3 | 
 4 | 
 5 | #Superstore Retail Data Analysis
 6 | 
 7 | A complete exploratory data analysis (EDA) project conducted on the Sample Superstore dataset, originally sourced from https://public.tableau.com/app/learn/sample-data. This analysis simulates a real-world retail business case study, with the goal of extracting actionable insights through structured data preparation, visual exploration, and statistical evaluation.
 8 | 
 9 | 📌 Objectives
10 | 
11 | This project is centered around five business-focused objectives:
12 | 
13 | 1. Analyze monthly and yearly profit trends to understand seasonality.
14 | 2. Evaluate region-wise profitability to identify high- and low-performing markets.
15 | 3. Assess product return frequency by category to gauge product satisfaction.
16 | 4. Examine shipping duration patterns to evaluate delivery performance.
17 | 5. Analyze customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit.
18 | 
19 | 
20 | 
21 |  Key Insights
22 | 
23 | - December consistently shows the highest profit across years, pointing to seasonal sales spikes.
24 | - The West region leads in profitability, while the **South region** underperforms comparatively.
25 | - Technology products have the highest return rate, suggesting a potential quality or customer expectation gap.
26 | - Shipping is mostly completed within **2–3 days**, indicating operational efficiency.
27 | - The Consumer segment drives volume, while Corporate customers yield higher profit margins.
28 | 
29 | 
30 |  EDA Process
31 | 
32 | - Data loaded from Excel (Orders & Returns sheets)
33 | - Cleaning and type conversions performed using `pandas`
34 | - Feature engineering:
35 |   - `Returned_flag` (binary return indicator)
36 |   - `Shipping Duration` (days between Order and Ship Date)
37 |   - `Month` and `Year` (for temporal grouping)
38 | - Outlier detection using IQR method
39 | - Aggregations via `groupby()` for trends and comparisons
40 | 
41 | 
42 | 
43 | # Visualizations
44 | 
45 | Visuals created using `matplotlib` and `seaborn`:
46 | 
47 | - **Line plots**: Profit trends over time
48 | - **Bar charts**: Region-wise and segment-wise performance
49 | - **Pie charts**: Return distribution by category
50 | - **Histograms**: Shipping durations
51 | - **Subplots**: Multi-metric comparisons (sales, profit, quantity)
52 | 
53 | ---
54 | 
55 |  Tools & Skills Demonstrated
56 | 
57 | - Data Cleaning & Transformation
58 | - Feature Engineering
59 | - Business-Oriented EDA
60 | - Data Visualization & Storytelling
61 | - Analytical Thinking
62 | 
63 | ---
64 | 
65 | ## References
66 | 
67 | - Sample Dataset: [Tableau Public – Sample Data](https://public.tableau.com/app/learn/sample-data)  
68 | - [Pandas Documentation](https://pandas.pydata.org/)  
69 | - [Seaborn Documentation](https://seaborn.pydata.org/)  
70 | - [Matplotlib Documentation](https://matplotlib.org/)  
71 | - [Python Official Docs](https://docs.python.org/3/)
72 | 
73 | 
74 | 
75 | > ✨ *This project was built as part of my learning journey in data analytics, combining business intuition with analytical depth.*
76 | 


--------------------------------------------------------------------------------
/sampleStore1.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | from matplotlib import style
  5 | import seaborn as sns  
  6 | 
  7 | 
  8 | file_path="C:\\Users\\USER\\Downloads\\py\\sample_-_superstore.xlsx"
  9 | df=pd.read_excel(file_path,sheet_name="Orders")
 10 | return_df=pd.read_excel(file_path,sheet_name="Returns")
 11 | # print(df)
 12 | print(df.head())
 13 | print(df.info())
 14 | 
 15 | # cleaning
 16 | print(df.isnull().sum())
 17 | print(df["Order ID"].duplicated().sum())
 18 | df.drop_duplicates(subset="Order ID",keep="first",inplace=True)
 19 | print(df["Order ID"].duplicated().sum())
 20 | 
 21 | df["Order Date"]=pd.to_datetime(df["Order Date"])
 22 | df["Ship Date"]=pd.to_datetime(df["Ship Date"])
 23 | 
 24 | print(df.columns)
 25 | numeric =["Quantity","Discount","Profit","Sales"]
 26 | df[numeric]=df[numeric].apply(pd.to_numeric)
 27 | 
 28 | df1=df[df["Profit"]>=0].copy()
 29 | print(df1)
 30 | 
 31 | # merging two sheet
 32 | df1=pd.merge(df1,return_df,on="Order ID",how="left")
 33 | print(df1)
 34 | df1["Returned"].replace(np.nan,"No",inplace=True)
 35 | print(df1)
 36 | 
 37 | print(df1.columns)
 38 | 
 39 | # outlier detection of profit and removable
 40 | Q1=df1["Profit"].quantile(0.25)
 41 | Q3=df1["Profit"].quantile(0.75)
 42 | IQR=Q3-Q1
 43 | 
 44 | lower_bound=Q1-1.5*IQR
 45 | upper_bound=Q3+1.5*IQR
 46 | 
 47 | df_no_outlier=df1[(df1["Profit"]>=lower_bound) & (df1["Profit"]<=upper_bound) ].copy()
 48 | print(df_no_outlier)
 49 | 
 50 | 
 51 | print(df_no_outlier.describe())
 52 | 
 53 | #Objectives
 54 | 
 55 | #1  Examine monthly-yearly  patterns in sales to identify peak periods and seasonal trends.
 56 | 
 57 | 
 58 | df_no_outlier["Year"]=df_no_outlier["Order Date"].dt.year
 59 | df_no_outlier["Month"]=df_no_outlier["Order Date"].dt.month_name()
 60 | 
 61 | month_order = ['January', 'February', 'March', 'April', 'May', 'June',
 62 |                'July', 'August', 'September', 'October', 'November', 'December']
 63 | 
 64 | 
 65 | print(df_no_outlier)
 66 | gb_month=df_no_outlier.groupby("Month")["Profit"].sum().reindex(month_order).reset_index()
 67 | gb_year=df_no_outlier.groupby("Year")["Profit"].sum().reset_index()
 68 | 
 69 | plt.figure(figsize=(10,6))
 70 | sns.lineplot(x="Month",y="Profit",data=gb_month,color="m")
 71 | plt.xticks(rotation=45)
 72 | plt.title("Profit Trend Over Month")
 73 | plt.grid(True)
 74 | plt.show(block=False)
 75 | plt.figure(figsize=(6,6))
 76 | sns.lineplot(x="Year",y="Profit",data=gb_year,color="red")
 77 | plt.title("Profit Trend Over Year")
 78 | plt.grid(True)
 79 | 
 80 | plt.show()
 81 | 
 82 | 
 83 | #2 average profit across different regions to identify high and low-performing areas.
 84 | 
 85 | 
 86 | region=df_no_outlier.groupby("Region")["Profit"].mean()
 87 | br=sns.barplot(x=region.index,y=region.values,palette="YlGnBu")
 88 | 
 89 | plt.title("Average Profit Across region")
 90 | br.bar_label(br.containers[0],fmt="%.2f")
 91 | br.bar_label(br.containers[1],fmt="%.2f")
 92 | br.bar_label(br.containers[2],fmt="%.2f")
 93 | br.bar_label(br.containers[3],fmt="%.2f")
 94 | plt.ylabel("Average Profit")
 95 | plt.show()
 96 | 
 97 | # 3 Evaluate the performance of different customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit.
 98 | 
 99 | segment=df_no_outlier.groupby("Segment")[["Sales","Profit","Quantity"]].sum().reset_index()
100 | plt.figure(figsize=(15,7))
101 | plt.subplot(1,3,1)
102 | sns.barplot(x="Segment",y="Sales",data=segment,palette='dark:orange')
103 | plt.grid(False)
104 | plt.xticks(rotation=45)
105 | plt.title("Segment by Sales")
106 | 
107 | 
108 | 
109 | plt.subplot(1,3,2)
110 | sns.barplot(x="Segment",y="Quantity",data=segment,palette='dark:purple')
111 | plt.grid(False)
112 | plt.xticks(rotation=45)
113 | plt.title("Segment by Quantity")
114 | 
115 | 
116 | 
117 | plt.subplot(1,3,3)
118 | sns.barplot(x="Segment",y="Profit",data=segment,palette='dark:blue')
119 | plt.grid(False)
120 | plt.xticks(rotation=45)
121 | plt.title("Segment by Profit")
122 | plt.suptitle("Segment-wise Customer Behavior")
123 | plt.show()
124 | 
125 | 
126 | 
127 | 
128 | 
129 | # 4 identifying return frequency by category to assess business impact.
130 | 
131 | def conv(val):
132 |     if(val=="Yes"):
133 |         return 1
134 |     else:
135 |         return 0
136 | 
137 | df_no_outlier["returned_flag"]=df_no_outlier["Returned"].apply(conv)
138 | print(df_no_outlier)
139 | 
140 | returnbase=df_no_outlier.groupby("Category")["returned_flag"].sum()
141 | plt.pie(returnbase.values,labels=returnbase.index,autopct="%.2f%%",colors = ["#A1C9F1", "#FFB5E8", "#B5EAD7"])
142 | plt.legend()
143 | plt.title("Distribution of return order by Category")
144 | plt.show()
145 | 
146 | # 5 Analyzing shipping time to understand delivery performance and customer experience.
147 | 
148 | df_no_outlier["shipment_Duration"]=(df_no_outlier["Ship Date"] - df_no_outlier["Order Date"]).dt.days
149 | sns.histplot(x="shipment_Duration",data=df_no_outlier,bins=5,kde=True,color="#2ECC71")
150 | plt.grid(True)
151 | plt.title("Distribution of Shipping Duration")
152 | plt.show()
153 | 
154 | 
155 | 
156 | corr=df_no_outlier.corr(numeric_only=True)
157 | sns.heatmap(corr,annot=True,cmap='coolwarm')
158 | plt.show()
159 | 
160 | df_no_outlier.to_excel("clean-data.xlsx",index=False)
161 | 
162 | 
163 | 


--------------------------------------------------------------------------------