├── output_data.xlsx ├── sample_-_superstore.xlsx ├── README.md └── pyNew.py /output_data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Priya-611/Retail-Analysis/HEAD/output_data.xlsx -------------------------------------------------------------------------------- /sample_-_superstore.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Priya-611/Retail-Analysis/HEAD/sample_-_superstore.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Retail-Analysis 2 | This repository presents an end-to-end data analysis project focused on simulating real-world business intelligence workflows. Using historical sales and returns data, the analysis identifies key performance drivers across regions, categories, and customer types, offering actionable insights for business decision-making. 3 | 4 | # Data Cleaning and Preprocessing 5 | Reading Data: Loaded Orders and Returns sheets using pandas.read_excel(). 6 | 7 | Initial Checks: Printed head(), info(), isnull().sum() and duplicate counts to assess data quality. 8 | 9 | Duplicate Removal: Dropped duplicate Order IDs to ensure uniqueness of transactions. 10 | 11 | Date Conversion: Converted Order Date and Ship Date columns to datetime format. 12 | 13 | Numeric Conversion: Ensured columns like Quantity, Discount, Profit, and Sales are properly treated as numeric. 14 | 15 | Profit Filter: Removed records with negative profits to focus on profitable sales data. 16 | 17 | Merging Returns Data: Merged returns info with the main dataset using Order ID and filled missing values with "No". 18 | 19 | Outlier Removal: Applied the IQR method to filter out outliers in the Profit column. 20 | 21 | # Objectives 22 | # Objective 1: Trend Analysis (Monthly & Yearly Profit Trends) 23 | Purpose: Understand seasonal fluctuations and year-over-year performance. 24 | 25 | Method: Extracted Year and Month from Order Date, grouped by them, and plotted using seaborn.lineplot. 26 | 27 | Insights: Helps in identifying peak profit months and annual growth trends. 28 | 29 | # Objective 2: Regional Performance (Average Profit per Order) 30 | Purpose: Identify high and low-performing regions. 31 | 32 | Method: Grouped by Region and calculated mean Profit, visualized using a barplot with annotations. 33 | 34 | Insights: Highlights which regions are underperforming or excelling, aiding strategic focus. 35 | 36 | # Objective 3: Returns Analysis by Category 37 | Purpose: Assess which product categories face higher return rates. 38 | 39 | Method: Created a Returned_flag using .apply(), grouped by Category, and plotted a pie chart. 40 | 41 | Insights: Understanding return behavior helps in refining product quality and customer satisfaction strategies. 42 | 43 | # Objective 4: Shipping Duration Analysis 44 | Purpose: Evaluate delivery performance. 45 | 46 | Method: Calculated Shipping Duration in days and visualized using histplot with KDE. 47 | 48 | Insights: Identifies delivery time trends and potential delays affecting customer experience. 49 | 50 | # Objective 5: Customer Segment Evaluation 51 | Purpose: Analyze contribution of different customer types (Consumer, Corporate, Home Office) in terms of: 52 | 53 | Sales 54 | 55 | Quantity Ordered 56 | 57 | Profit 58 | 59 | Method: Grouped by Segment and plotted bar charts for each metric. 60 | 61 | Insights: Useful for targeted marketing and understanding profitability per segment. 62 | 63 | # Heatmap: Correlation Between Variables 64 | Purpose: Visualize relationships between numerical features (Sales, Profit, Discount, etc.). 65 | 66 | Method: Used sns.heatmap() with annot=True for clear correlation values. 67 | 68 | Insights: Helps identify potential multicollinearity or key variables influencing performance. 69 | 70 | # Final Output 71 | The cleaned and processed dataset is exported to an Excel file (output_data.xlsx) for further use. 72 | -------------------------------------------------------------------------------- /pyNew.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | file_path="C:\\Users\\HP\\OneDrive\\Documents\\Python programming(sem4)\\python files\\sample_-_superstore.xlsx" 7 | df=pd.read_excel(file_path,sheet_name="Orders") 8 | return_df=pd.read_excel(file_path,sheet_name="Returns") 9 | # print(df) 10 | 11 | 12 | print(df.head()) 13 | print(df.info()) 14 | print(df.isnull().sum()) 15 | print(df["Order ID"].duplicated().sum()) 16 | 17 | 18 | #CLEANING AND VISUALISATION 19 | 20 | df.drop_duplicates(subset="Order ID",keep="first",inplace=True) 21 | print(df["Order ID"].duplicated().sum()) 22 | 23 | df["Order Date"]=pd.to_datetime(df["Order Date"]) 24 | df["Ship Date"]=pd.to_datetime(df["Ship Date"]) 25 | 26 | print(df.columns) 27 | numeric =["Quantity","Discount","Profit","Sales"] 28 | df[numeric]=df[numeric].apply(pd.to_numeric) 29 | 30 | df1=df[df["Profit"]>=0].copy() 31 | print(df1) 32 | 33 | # merging two sheet 34 | df1=pd.merge(df1,return_df,on="Order ID",how="left") 35 | print(df1) 36 | df1["Returned"].replace(np.nan,"No",inplace=True) 37 | print(df1) 38 | print(df1.columns) 39 | 40 | #removing outliers 41 | 42 | Q1 = df1["Profit"].quantile(0.25) 43 | Q3 = df1["Profit"].quantile(0.75) 44 | IQR = Q3 - Q1 45 | lower_bound = Q1 - 1.5 * IQR 46 | upper_bound = Q3 + 1.5 * IQR 47 | 48 | df_no_outliers = df1[(df1["Profit"] >= lower_bound) & (df1["Profit"] <= upper_bound)].copy() 49 | print(df_no_outliers) 50 | 51 | print(df_no_outliers.info()) 52 | print(df_no_outliers.describe()) 53 | 54 | #EDA 55 | 56 | 57 | #OBJECTIVES:-> 58 | #1 Examine monthly or yearly patterns in sales to identify peak periods and seasonal trends. 59 | 60 | df_no_outliers["Year"]=df_no_outliers["Order Date"].dt.year 61 | df_no_outliers["Month"]=df_no_outliers["Order Date"].dt.month_name() 62 | #print(df_no_outliers) 63 | 64 | 65 | month_order = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"] 66 | 67 | gb_m=df_no_outliers.groupby("Month")["Profit"].sum().reindex(month_order).reset_index() 68 | plt.figure(figsize=(10,6)) 69 | sns.lineplot(x="Month",y="Profit",data=gb_m,color='m') 70 | plt.xticks(rotation=45) 71 | plt.grid(True) 72 | plt.title('Analyze Profit Trends Over Month') 73 | plt.show(block=False) 74 | 75 | gb_y=df_no_outliers.groupby("Year")["Profit"].sum().reset_index() 76 | plt.figure(figsize=(10,6)) 77 | sns.lineplot(x="Year",y="Profit",data=gb_y,color='r') 78 | plt.grid(True) 79 | plt.title('Analyze Profit Trends Over Year') 80 | plt.show() 81 | 82 | #2 average profit per order across different regions to identify high and low-performing areas. 83 | 84 | 85 | 86 | region_p=df_no_outliers.groupby("Region")["Profit"].mean().reset_index() 87 | br=sns.barplot(x="Region",y="Profit",data=region_p,hue="Region",palette='dark:pink',legend=False) 88 | br.bar_label(br.containers[0],fmt='%.2f') 89 | br.bar_label(br.containers[1],fmt='%.2f') 90 | br.bar_label(br.containers[2],fmt='%.2f') 91 | br.bar_label(br.containers[3],fmt='%.2f') 92 | plt.title('Average Sales across Region') 93 | plt.grid(False) 94 | plt.show() 95 | 96 | 97 | 98 | 99 | #3 Identifying return frequency by category to assess business impact. 100 | 101 | def func(val): 102 | if(val=="Yes"): 103 | return 1 104 | else: 105 | return 0 106 | 107 | df_no_outliers["Returned_flag"]=df_no_outliers["Returned"].apply(func) 108 | df_ret=df_no_outliers.groupby("Category")["Returned_flag"].sum() 109 | plt.pie(df_ret.values,labels=df_ret.index,autopct="%.2f%%",colors=["#66c2a5", "#fc8d62", "#8da0cb"],explode=[0,0,0.08],pctdistance=0.5) 110 | plt.title("Distribution of Returned Orders by Category") 111 | plt.legend(loc="upper right",bbox_to_anchor=(2,2)) 112 | plt.show() 113 | 114 | 115 | 116 | 117 | #4 Analyzing shipping time to understand delivery performance and customer experience. 118 | 119 | df_no_outliers["Shipping Duration"] = (df_no_outliers["Ship Date"] - df_no_outliers["Order Date"]).dt.days 120 | sns.histplot(x="Shipping Duration",data=df_no_outliers, bins=8, kde=True,color='#BA55D3') 121 | 122 | plt.title("Distribution of Shipping Duration") 123 | plt.xlabel("Days") 124 | plt.ylabel("Shipping Duration") 125 | plt.show() 126 | 127 | #5 Evaluating the performance of different customer segments (Consumer, Corporate, Home Office) based on sales, quantity, and profit. 128 | 129 | df_seg=df_no_outliers.groupby("Segment")[["Sales","Profit","Quantity"]].sum().reset_index() 130 | plt.figure(figsize=(16,6)) 131 | 132 | plt.subplot(1,3,1) 133 | sns.barplot(x="Segment",y="Sales",data=df_seg,hue="Segment",palette='dark:Yellow') 134 | plt.xticks(rotation=45) 135 | plt.title("Segment by Sales") 136 | 137 | 138 | plt.subplot(1,3,2) 139 | sns.barplot(x="Segment",y="Quantity",hue="Segment",data=df_seg,palette='dark:orange') 140 | plt.xticks(rotation=45) 141 | plt.title("Segment by Quantity") 142 | 143 | plt.subplot(1,3,3) 144 | sns.barplot(x="Segment",y="Profit",hue="Segment",data=df_seg,palette='dark:purple') 145 | plt.xticks(rotation=45) 146 | plt.title("Segment by Profit") 147 | 148 | plt.suptitle("Segment-Wise Customer Behaviour") 149 | plt.tight_layout() 150 | plt.show() 151 | 152 | 153 | #Heat Map 154 | 155 | sns.heatmap(df_no_outliers.corr(numeric_only=True),annot=True) 156 | plt.title("Relationship Between Sales Variables") 157 | plt.show() 158 | 159 | --------------------------------------------------------------------------------