├── output_data.xlsx
├── sample_-_superstore.xlsx
├── README.md
└── pyNew.py


/output_data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Priya-611/Retail-Analysis/HEAD/output_data.xlsx


--------------------------------------------------------------------------------
/sample_-_superstore.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Priya-611/Retail-Analysis/HEAD/sample_-_superstore.xlsx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Retail-Analysis
 2 | This repository presents an end-to-end data analysis project focused on simulating real-world business intelligence workflows. Using historical sales and returns data, the analysis identifies key performance drivers across regions, categories, and customer types, offering actionable insights for business decision-making.
 3 | 
 4 | # Data Cleaning and Preprocessing
 5 | Reading Data: Loaded Orders and Returns sheets using pandas.read_excel().
 6 | 
 7 | Initial Checks: Printed head(), info(), isnull().sum() and duplicate counts to assess data quality.
 8 | 
 9 | Duplicate Removal: Dropped duplicate Order IDs to ensure uniqueness of transactions.
10 | 
11 | Date Conversion: Converted Order Date and Ship Date columns to datetime format.
12 | 
13 | Numeric Conversion: Ensured columns like Quantity, Discount, Profit, and Sales are properly treated as numeric.
14 | 
15 | Profit Filter: Removed records with negative profits to focus on profitable sales data.
16 | 
17 | Merging Returns Data: Merged returns info with the main dataset using Order ID and filled missing values with "No".
18 | 
19 | Outlier Removal: Applied the IQR method to filter out outliers in the Profit column.
20 | 
21 | # Objectives
22 | # Objective 1: Trend Analysis (Monthly & Yearly Profit Trends)
23 | Purpose: Understand seasonal fluctuations and year-over-year performance.
24 | 
25 | Method: Extracted Year and Month from Order Date, grouped by them, and plotted using seaborn.lineplot.
26 | 
27 | Insights: Helps in identifying peak profit months and annual growth trends.
28 | 
29 | # Objective 2: Regional Performance (Average Profit per Order)
30 | Purpose: Identify high and low-performing regions.
31 | 
32 | Method: Grouped by Region and calculated mean Profit, visualized using a barplot with annotations.
33 | 
34 | Insights: Highlights which regions are underperforming or excelling, aiding strategic focus.
35 | 
36 | # Objective 3: Returns Analysis by Category
37 | Purpose: Assess which product categories face higher return rates.
38 | 
39 | Method: Created a Returned_flag using .apply(), grouped by Category, and plotted a pie chart.
40 | 
41 | Insights: Understanding return behavior helps in refining product quality and customer satisfaction strategies.
42 | 
43 | # Objective 4: Shipping Duration Analysis
44 | Purpose: Evaluate delivery performance.
45 | 
46 | Method: Calculated Shipping Duration in days and visualized using histplot with KDE.
47 | 
48 | Insights: Identifies delivery time trends and potential delays affecting customer experience.
49 | 
50 | # Objective 5: Customer Segment Evaluation
51 | Purpose: Analyze contribution of different customer types (Consumer, Corporate, Home Office) in terms of:
52 | 
53 | Sales
54 | 
55 | Quantity Ordered
56 | 
57 | Profit
58 | 
59 | Method: Grouped by Segment and plotted bar charts for each metric.
60 | 
61 | Insights: Useful for targeted marketing and understanding profitability per segment.
62 | 
63 | # Heatmap: Correlation Between Variables
64 | Purpose: Visualize relationships between numerical features (Sales, Profit, Discount, etc.).
65 | 
66 | Method: Used sns.heatmap() with annot=True for clear correlation values.
67 | 
68 | Insights: Helps identify potential multicollinearity or key variables influencing performance.
69 | 
70 | # Final Output
71 | The cleaned and processed dataset is exported to an Excel file (output_data.xlsx) for further use.
72 | 


--------------------------------------------------------------------------------
/pyNew.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns  
  5 | 
  6 | file_path="C:\\Users\\HP\\OneDrive\\Documents\\Python programming(sem4)\\python files\\sample_-_superstore.xlsx"
  7 | df=pd.read_excel(file_path,sheet_name="Orders")
  8 | return_df=pd.read_excel(file_path,sheet_name="Returns")
  9 | # print(df)
 10 | 
 11 | 
 12 | print(df.head())
 13 | print(df.info())
 14 | print(df.isnull().sum())
 15 | print(df["Order ID"].duplicated().sum())
 16 | 
 17 | 
 18 | #CLEANING AND VISUALISATION
 19 | 
 20 | df.drop_duplicates(subset="Order ID",keep="first",inplace=True)
 21 | print(df["Order ID"].duplicated().sum())
 22 | 
 23 | df["Order Date"]=pd.to_datetime(df["Order Date"])
 24 | df["Ship Date"]=pd.to_datetime(df["Ship Date"])
 25 | 
 26 | print(df.columns)
 27 | numeric =["Quantity","Discount","Profit","Sales"]
 28 | df[numeric]=df[numeric].apply(pd.to_numeric)
 29 | 
 30 | df1=df[df["Profit"]>=0].copy()
 31 | print(df1)
 32 | 
 33 | # merging two sheet
 34 | df1=pd.merge(df1,return_df,on="Order ID",how="left")
 35 | print(df1)
 36 | df1["Returned"].replace(np.nan,"No",inplace=True)
 37 | print(df1)
 38 | print(df1.columns)
 39 | 
 40 | #removing outliers 
 41 | 
 42 | Q1 = df1["Profit"].quantile(0.25)
 43 | Q3 = df1["Profit"].quantile(0.75)
 44 | IQR = Q3 - Q1
 45 | lower_bound = Q1 - 1.5 * IQR
 46 | upper_bound = Q3 + 1.5 * IQR
 47 | 
 48 | df_no_outliers = df1[(df1["Profit"] >= lower_bound) & (df1["Profit"] <= upper_bound)].copy()
 49 | print(df_no_outliers)
 50 | 
 51 | print(df_no_outliers.info())
 52 | print(df_no_outliers.describe())
 53 | 
 54 | #EDA
 55 | 
 56 | 
 57 | #OBJECTIVES:->
 58 | #1 Examine monthly or yearly patterns in sales to identify peak periods and seasonal trends.
 59 | 
 60 | df_no_outliers["Year"]=df_no_outliers["Order Date"].dt.year
 61 | df_no_outliers["Month"]=df_no_outliers["Order Date"].dt.month_name()
 62 | #print(df_no_outliers)
 63 | 
 64 | 
 65 | month_order = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
 66 | 
 67 | gb_m=df_no_outliers.groupby("Month")["Profit"].sum().reindex(month_order).reset_index()
 68 | plt.figure(figsize=(10,6))
 69 | sns.lineplot(x="Month",y="Profit",data=gb_m,color='m')
 70 | plt.xticks(rotation=45)
 71 | plt.grid(True)
 72 | plt.title('Analyze Profit Trends Over Month')
 73 | plt.show(block=False)
 74 | 
 75 | gb_y=df_no_outliers.groupby("Year")["Profit"].sum().reset_index()
 76 | plt.figure(figsize=(10,6))
 77 | sns.lineplot(x="Year",y="Profit",data=gb_y,color='r')
 78 | plt.grid(True)
 79 | plt.title('Analyze Profit Trends Over Year')
 80 | plt.show()
 81 | 
 82 | #2 average profit per order across different regions to identify high and low-performing areas.
 83 | 
 84 | 
 85 | 
 86 | region_p=df_no_outliers.groupby("Region")["Profit"].mean().reset_index()
 87 | br=sns.barplot(x="Region",y="Profit",data=region_p,hue="Region",palette='dark:pink',legend=False)
 88 | br.bar_label(br.containers[0],fmt='%.2f')
 89 | br.bar_label(br.containers[1],fmt='%.2f')
 90 | br.bar_label(br.containers[2],fmt='%.2f')
 91 | br.bar_label(br.containers[3],fmt='%.2f')
 92 | plt.title('Average Sales across Region')
 93 | plt.grid(False)
 94 | plt.show()
 95 | 
 96 | 
 97 | 
 98 | 
 99 | #3 Identifying return frequency by category to assess business impact.
100 | 
101 | def func(val):
102 |     if(val=="Yes"):
103 |         return 1
104 |     else:
105 |         return 0
106 | 
107 | df_no_outliers["Returned_flag"]=df_no_outliers["Returned"].apply(func)
108 | df_ret=df_no_outliers.groupby("Category")["Returned_flag"].sum()
109 | plt.pie(df_ret.values,labels=df_ret.index,autopct="%.2f%%",colors=["#66c2a5", "#fc8d62", "#8da0cb"],explode=[0,0,0.08],pctdistance=0.5)
110 | plt.title("Distribution of Returned Orders by Category")
111 | plt.legend(loc="upper right",bbox_to_anchor=(2,2))
112 | plt.show()
113 | 
114 | 
115 | 
116 | 
117 | #4 Analyzing shipping time to understand delivery performance and customer experience.
118 | 
119 | df_no_outliers["Shipping Duration"] = (df_no_outliers["Ship Date"] - df_no_outliers["Order Date"]).dt.days
120 | sns.histplot(x="Shipping Duration",data=df_no_outliers, bins=8, kde=True,color='#BA55D3')
121 | 
122 | plt.title("Distribution of Shipping Duration")
123 | plt.xlabel("Days")
124 | plt.ylabel("Shipping Duration")
125 | plt.show()
126 | 
127 | #5 Evaluating the performance of different customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit.
128 | 
129 | df_seg=df_no_outliers.groupby("Segment")[["Sales","Profit","Quantity"]].sum().reset_index()
130 | plt.figure(figsize=(16,6))
131 | 
132 | plt.subplot(1,3,1)
133 | sns.barplot(x="Segment",y="Sales",data=df_seg,hue="Segment",palette='dark:Yellow')
134 | plt.xticks(rotation=45)
135 | plt.title("Segment by Sales")
136 | 
137 | 
138 | plt.subplot(1,3,2)
139 | sns.barplot(x="Segment",y="Quantity",hue="Segment",data=df_seg,palette='dark:orange')
140 | plt.xticks(rotation=45)
141 | plt.title("Segment by Quantity")
142 | 
143 | plt.subplot(1,3,3)
144 | sns.barplot(x="Segment",y="Profit",hue="Segment",data=df_seg,palette='dark:purple')
145 | plt.xticks(rotation=45)
146 | plt.title("Segment by Profit")
147 | 
148 | plt.suptitle("Segment-Wise Customer Behaviour")
149 | plt.tight_layout()
150 | plt.show()
151 | 
152 | 
153 | #Heat Map
154 | 
155 | sns.heatmap(df_no_outliers.corr(numeric_only=True),annot=True)
156 | plt.title("Relationship Between Sales Variables")
157 | plt.show()
158 | 
159 | 


--------------------------------------------------------------------------------