├── README.md ├── finaldata.xlsx └── finalproject.py /README.md: -------------------------------------------------------------------------------- 1 | # GST Data Analysis 2 | 3 | This project analyzes Goods and Services Tax (GST) data to understand registration patterns, state-wise trends, and payer behavior. The analysis uses Python with libraries like Pandas, Matplotlib, Seaborn, and Scipy. 4 | 5 | ## Project Structure 6 | 7 | * `finaldata.xlsx`: The dataset used for the analysis. 8 | * `gst_data_analysis.ipynb`: (Optional) Jupyter Notebook containing the analysis code. You provided a python script, but a notebook is common. 9 | * `README.md`: This file, providing an overview of the project. 10 | * `eligible_payers_distribution.png`: (Optional) If the code generates this. 11 | * `compliance_by_return_type.png`: (Optional) If the code generates this. 12 | * `top_states_by_registered.png`: (Optional) If the code generates this. 13 | * `state_compliance.png`: (Optional) If the code generates this. 14 | * `jk_trends.png`: (Optional) If the code generates this. 15 | * `time_trends.png`: (Optional) If the code generates this. 16 | * `compliance_trend.png`: (Optional) If the code generates this. 17 | * `high_value_states.png`: (Optional) If the code generates this. 18 | * `correlation_matrix.png`: (Optional) 19 | * `eligibility_vs_registrations.png`: (Optional) 20 | * `state_registration_comparison.png`: (Optional) 21 | * `payer_distribution_pie.png`: (Optional) 22 | * `compliance_by_return_boxplot.png`: (Optional) 23 | * `actual_vs_predicted.png`: (Optional) 24 | 25 | ## Data Description 26 | 27 | The dataset (`finaldata.xlsx`) contains GST-related information, including: 28 | 29 | * `srcStateName`: Name of the state. 30 | * `srcYear`: Year of the data. 31 | * `srcMonth`: Month of the data. 32 | * `GST (Goods and Service Tax) Return Type`: Type of GST return. 33 | * `Payer eligible for GST (Goods and Service Tax) registration`: Number of payers eligible for registration. 34 | * `GST (Goods and Service Tax) Payers registered before due date`: Number of payers registered before the due date. 35 | * `GST (Goods and Service Tax) Payers registered after due date`: Number of payers registered after the due date. 36 | * `YearCode`: Numerical code for the year. 37 | * `Year`: Year. 38 | * `MonthCode`: Numerical code for the month. 39 | * `Month`: Month name. 40 | 41 | ## Code Description 42 | 43 | The provided Python script performs the following analysis: 44 | 45 | 1. **Data Loading and Exploration:** 46 | * Loads the data from `finaldata.xlsx` using Pandas. 47 | * Displays the first few rows, dataset dimensions, column information, missing values, and a statistical summary. 48 | * Fills missing values with 0. 49 | * Calculates `Total_Registered` payers. 50 | 51 | 2. **State-wise Analysis:** 52 | * Visualizes the top 10 states/UTs by the number of eligible GST payers. 53 | * Provides a summary of regional variations in payer eligibility. 54 | * **Interactive Question:** Which state do you think has the highest number of GST payers, and why might that be the case? 55 | 56 | 3. **Time-based Trends:** 57 | * Visualizes GST payer eligibility and registrations over time. 58 | * Summarizes fluctuations and trends in payer activity. 59 | * **Interactive Question:** Can you identify any seasonal patterns or significant changes in GST activity over the observed period? What factors might explain these trends? 60 | 61 | 4. **Return Type Analysis:** 62 | * Visualizes the total number of registered GST payers by return type. 63 | * Provides a table comparing eligible and registered payers for each return type. 64 | * Summarizes the contribution of different return types to overall registrations. 65 | * **Interactive Question**: What does the difference in GSTR-1 and GSTR-3 tell us about the filers? 66 | 67 | 5. **Correlation Analysis:** 68 | * Calculates and visualizes the correlation between key GST metrics (eligible payers, registrations before/after due date, total registrations). 69 | * Summarizes the relationship between payer eligibility and registration numbers. 70 | * **Interactive Question**: How does the strong correlation between eligible payers and total registered payers influence decision-making? 71 | 72 | 6. **Registration Timing:** 73 | * Compares the number of GST payers registered before and after the due date. 74 | * Summarizes the timely compliance of GST payers. 75 | * **Interactive Question**: What are the possible reasons for the number of registrations after the due date? 76 | 77 | ## Key Findings 78 | 79 | The analysis reveals the following key insights: 80 | 81 | * **State-wise Variation:** States like Maharashtra and Uttar Pradesh have the highest number of eligible GST payers, indicating significant regional differences. 82 | * **Time-based Trends:** GST payer eligibility and registrations show fluctuations over time, possibly influenced by seasonal or policy changes. 83 | * **Return Type Contribution:** GSTR-3 filers contribute a larger share to total GST registrations compared to GSTR-1 filers. 84 | * **Correlation:** A strong positive correlation exists between the number of eligible payers and the total number of registered payers. 85 | * **Registration Timing:** Most GST payers register before the due date, indicating good compliance. 86 | * The data is right skewed. 87 | 88 | ## Visualizations 89 | 90 | The analysis includes the following visualizations: 91 | 92 | * Bar plots showing the top 10 states/UTs by eligible GST payers. 93 | * Line plots showing GST payer trends over time. 94 | * Bar plots showing total registered payers by return type. 95 | * Heatmaps visualizing the correlation between GST metrics. 96 | * Bar plots comparing before and after due date registrations. 97 | 98 | ## Dependencies 99 | 100 | The code requires the following Python libraries: 101 | 102 | * Pandas 103 | * Matplotlib 104 | * Seaborn 105 | * Scipy 106 | * Statsmodels 107 | 108 | ## Further Exploration 109 | 110 | * Explore the reasons behind the state-wise variations in GST payer registration. 111 | * Investigate the factors influencing the time-based trends in GST activity. 112 | * Analyze the characteristics of businesses that file GSTR-1 versus GSTR-3 returns. 113 | * Build a predictive model for registration compliance. 114 | * Examine the impact of policy changes on GST registration and compliance. 115 | -------------------------------------------------------------------------------- /finaldata.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/parika04/GST-Data-Analysis-using-Python/18bc1b5bea19a3965e7055b541da15ba256efdfd/finaldata.xlsx -------------------------------------------------------------------------------- /finalproject.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import pandas as pd 4 | import matplotlib.pyplot as plt 5 | import seaborn as sns 6 | from scipy import stats 7 | 8 | # Load the data 9 | data = pd.read_excel('finaldata.xlsx') 10 | 11 | print("\nEXPLORATORY DATA ANALYSIS-----------------------------------------------------") 12 | # Display the first few rows 13 | print("First 5 rows of the dataset:") 14 | print(data.head()) 15 | 16 | # Check the shape of the dataset 17 | print("\nDataset dimensions (rows, columns):", data.shape) 18 | 19 | # Check column names and data types 20 | print("\nColumn information:") 21 | print(data.info()) 22 | 23 | # Check for missing values 24 | print("\nMissing values per column:") 25 | print(data.isnull().sum()) 26 | 27 | # Basic statistical summary 28 | print("\nStatistical summary:") 29 | print(data.describe()) 30 | 31 | print("Column Names:\n", data.columns, "\n") 32 | data = data.fillna(0) 33 | 34 | data['Total_Registered'] = data['GST ( Goods and Service Tax ) Payers registered before due date'] + data['GST ( Goods and Service Tax ) Payers registered after due date'] 35 | 36 | 37 | #Objective 1----------------------------------------------------------- 38 | plt.figure(figsize=(10, 6)) 39 | state_eligible = data.groupby('srcStateName')['Payer eligible for GST ( Goods and Service Tax ) registration'].sum().sort_values(ascending=False).head(10) 40 | sns.barplot(x=state_eligible.values, y=state_eligible.index, hue=state_eligible.index, palette='pastel', legend=False) 41 | plt.title("Top 10 States/UTs by Eligible GST Payers") 42 | plt.xlabel("Number of Eligible Payers") 43 | plt.ylabel("State/UT") 44 | plt.tight_layout() 45 | plt.show() 46 | print("\nSummary: The top states/UTs, like Maharashtra and Uttar Pradesh, have the highest number of eligible GST payers, showing strong regional variation. Jammu and Kashmir's data indicates moderate eligibility compared to larger states.") 47 | 48 | 49 | #Objective 2------------------------------------------------------------ 50 | plt.figure(figsize=(10, 6)) 51 | time_trends = data.groupby('srcMonth').agg({ 52 | 'Payer eligible for GST ( Goods and Service Tax ) registration': 'sum', 53 | 'Total_Registered': 'sum' 54 | }) 55 | time_trends.plot(kind='line', marker='o') 56 | plt.title("GST Payers Over Time") 57 | plt.xlabel("Month") 58 | plt.ylabel("Number of Payers") 59 | plt.tight_layout() 60 | plt.show() 61 | print("\nSummary: GST payer eligibility and registrations fluctuate over time, with peaks in certain months, suggesting seasonal or policy-driven trends.") 62 | 63 | 64 | #Objective 3------------------------------------------------------------- 65 | plt.figure(figsize=(8, 6)) 66 | return_counts = data.groupby('GST ( Goods and Service Tax ) Return Type')['Total_Registered'].sum() 67 | sns.barplot(x=return_counts.index, y=return_counts.values, hue=return_counts.index, palette='coolwarm', legend=False) 68 | plt.title("Total Registered GST Payers by Return Type") 69 | plt.xlabel("Return Type") 70 | plt.ylabel("Total Registered Payers") 71 | plt.show() 72 | 73 | print("\nHigh-Value Payers by Return Type:") 74 | return_summary = data.groupby('GST ( Goods and Service Tax ) Return Type').agg({ 75 | 'Payer eligible for GST ( Goods and Service Tax ) registration': 'sum', 76 | 'Total_Registered': 'sum' 77 | }) 78 | print(return_summary) 79 | print("\nSummary: GSTR-3 return type has more registered payers (501M) than GSTR-1 (344M), indicating GSTR-3 filers contribute significantly to GST registrations.") 80 | 81 | #Objective 4------------------------------------------------------------- 82 | numeric_cols = ['Payer eligible for GST ( Goods and Service Tax ) registration', 83 | 'GST ( Goods and Service Tax ) Payers registered before due date', 84 | 'GST ( Goods and Service Tax ) Payers registered after due date', 85 | 'Total_Registered'] 86 | print("\nCorrelation Matrix:\n", data[numeric_cols].corr()) 87 | 88 | plt.figure(figsize=(8, 6)) 89 | sns.heatmap(data[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) 90 | plt.title("Correlation Heatmap of GST Metrics") 91 | plt.tight_layout() 92 | plt.show() 93 | print("\nSummary: Eligible payers and total registered payers are highly correlated (0.99), suggesting that higher eligibility strongly predicts more registrations.") 94 | 95 | #Objective 5------------------------------------------------------------- 96 | plt.figure(figsize=(8, 6)) 97 | reg_data = pd.melt(data, 98 | id_vars=['srcStateName'], 99 | value_vars=['GST ( Goods and Service Tax ) Payers registered before due date', 100 | 'GST ( Goods and Service Tax ) Payers registered after due date'], 101 | var_name='Registration_Type', 102 | value_name='Count') 103 | sns.barplot(x='Registration_Type', y='Count', hue='Registration_Type', data=reg_data, palette='muted') 104 | plt.title("Before vs After Due Date GST Registrations") 105 | plt.xlabel("Registration Type") 106 | plt.ylabel("Number of Payers") 107 | plt.xticks(rotation=45) 108 | plt.tight_layout() 109 | plt.show() 110 | print("\nSummary (Before vs After Due Date Registrations): Before-due-date registrations significantly outnumber after-due-date registrations, indicating timely compliance among most GST payers.") 111 | --------------------------------------------------------------------------------