├── README.md
└── project.py


/README.md:
--------------------------------------------------------------------------------
  1 | 🧪 Exploratory Data Analysis on Emergency Department Dataset
  2 | 📊 Project Overview
  3 | This repository presents a comprehensive Exploratory Data Analysis (EDA) pipeline conducted on an emergency department dataset. The analysis emphasizes data quality, distribution patterns, statistical relationships, and key operational metrics such as ED visits, diagnoses, and hospital attributes. Visualizations are used extensively to uncover trends and anomalies that can aid strategic decision-making in healthcare settings.
  4 | 
  5 | 📁 Project Structure
  6 | plaintext
  7 | Copy
  8 | Edit
  9 | .
 10 | ├── project.py                # Main script for performing EDA
 11 | ├── eda_summary.txt          # Auto-generated summary report of analysis
 12 | ├── eda_plots/               # Directory containing all generated plots
 13 | └── README.md                # Project documentation
 14 | ⚙️ Features
 15 | Robust missing value handling and imputation
 16 | 
 17 | Automated outlier detection and treatment using IQR method
 18 | 
 19 | Data type analysis and categorical exploration
 20 | 
 21 | Distribution visualization via histograms, boxplots, and heatmaps
 22 | 
 23 | Correlation analysis with Pearson matrix and significance highlights
 24 | 
 25 | Feature engineering using binning techniques
 26 | 
 27 | Advanced visual storytelling with:
 28 | 
 29 | Count plots
 30 | 
 31 | Stacked bar plots
 32 | 
 33 | Regression plots
 34 | 
 35 | Interactive Plotly scatter plots
 36 | 
 37 | Summary report (eda_summary.txt) automatically generated
 38 | 
 39 | 🏁 How to Run
 40 | 1. Prerequisites
 41 | Ensure the following libraries are installed:
 42 | 
 43 | bash
 44 | Copy
 45 | Edit
 46 | pip install pandas numpy matplotlib seaborn plotly scipy
 47 | 2. Clone the Repository
 48 | bash
 49 | Copy
 50 | Edit
 51 | git clone https://github.com/your-username/emergency-department-eda.git
 52 | cd emergency-department-eda
 53 | 3. Update File Path
 54 | Update the dataset path in project.py:
 55 | 
 56 | python
 57 | Copy
 58 | Edit
 59 | file_path = "C:/Users/hp/Desktop/python_dataset.csv"
 60 | Replace with your actual CSV file path or place the CSV in the project directory and update the relative path accordingly.
 61 | 
 62 | 4. Run the Script
 63 | bash
 64 | Copy
 65 | Edit
 66 | python project.py
 67 | All plots will be saved to the eda_plots/ directory and a summary will be logged in eda_summary.txt.
 68 | 
 69 | 📈 Sample Outputs
 70 | Heatmap of missing values
 71 | 
 72 | Boxplots post outlier treatment
 73 | 
 74 | Correlation matrix
 75 | 
 76 | Histograms of numerical features
 77 | 
 78 | Interactive scatter plot using Plotly
 79 | 
 80 | Regression analysis with slope, intercept, R² and p-value
 81 | 
 82 | 🧠 Use Cases
 83 | Healthcare operations planning
 84 | 
 85 | Data-driven resource allocation
 86 | 
 87 | Early anomaly detection in ED visit patterns
 88 | 
 89 | Foundational layer for machine learning modeling
 90 | 
 91 | 📌 Notes
 92 | Script includes automatic directory creation for plot storage.
 93 | 
 94 | Outliers are capped (not removed) to preserve dataset size.
 95 | 
 96 | Categorical variables are explored using count plots and cross-tabulations.
 97 | 
 98 | 🤝 Contributions
 99 | Contributions are welcome! Please fork the repository and submit a pull request with enhancements or bug fixes.
100 | 


--------------------------------------------------------------------------------
/project.py:
--------------------------------------------------------------------------------
  1 | # Import libraries
  2 | import pandas as pd
  3 | import numpy as np
  4 | import matplotlib.pyplot as plt
  5 | import seaborn as sns
  6 | import plotly.express as px
  7 | import os
  8 | from scipy.stats import skew
  9 | from scipy import stats  # Added for regression
 10 | import warnings
 11 | warnings.filterwarnings('ignore')
 12 | 
 13 | # Set Seaborn theme and custom palette
 14 | sns.set_theme(style="whitegrid")
 15 | custom_palette = sns.color_palette("Set2")  # Unified color scheme
 16 | sns.set_palette(custom_palette)
 17 | 
 18 | # Create a directory to save plots
 19 | if not os.path.exists("eda_plots"):
 20 |     os.makedirs("eda_plots")
 21 | 
 22 | # Function to detect and handle outliers using IQR
 23 | def handle_outliers(df, column, method='remove'):
 24 |     Q1 = df[column].quantile(0.25)
 25 |     Q3 = df[column].quantile(0.75)
 26 |     IQR = Q3 - Q1
 27 |     lower_bound = Q1 - 1.5 * IQR
 28 |     upper_bound = Q3 + 1.5 * IQR
 29 |     
 30 |     outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
 31 |     print(f"\n🔹 Outliers in {column}: {len(outliers)}")
 32 |     
 33 |     if method == 'remove':
 34 |         df_clean = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
 35 |     elif method == 'cap':
 36 |         df_clean = df.copy()
 37 |         df_clean[column] = df_clean[column].clip(lower=lower_bound, upper=upper_bound)
 38 |     else:
 39 |         df_clean = df
 40 |     
 41 |     return df_clean, outliers
 42 | 
 43 | # Load the dataset with error handling
 44 | file_path = "C:/Users/hp/Desktop/python_dataset.csv"
 45 | try:
 46 |     df = pd.read_csv(file_path, encoding='latin1')
 47 | except FileNotFoundError:
 48 |     print(f"Error: Dataset file '{file_path}' not found. Please check the file path.")
 49 |     raise SystemExit
 50 | except Exception as e:
 51 |     print(f"Error loading dataset: {e}")
 52 |     raise SystemExit
 53 | 
 54 | # Check if dataset is empty
 55 | if df.empty:
 56 |     print("Error: Dataset is empty.")
 57 |     raise SystemExit
 58 | 
 59 | # Basic info
 60 | print("🔹 Dataset Info:")
 61 | print(df.info())
 62 | 
 63 | # First few rows
 64 | print("\n🔹 First 5 Rows:")
 65 | print(df.head())
 66 | 
 67 | # Missing values analysis
 68 | print("\n🔹 Missing Values:")
 69 | missing_data = df.isnull().sum()
 70 | print(missing_data)
 71 | 
 72 | # Visualize missing values
 73 | plt.figure(figsize=(10, 6))
 74 | sns.heatmap(df.isnull(), cbar=False, cmap='Blues')  # Changed to Blues
 75 | plt.title("Missing Values Heatmap")
 76 | plt.savefig("eda_plots/missing_values_heatmap.png")
 77 | plt.show()
 78 | 
 79 | # Handle missing values (impute 'system' with mode, others as needed)
 80 | if df['system'].isnull().sum() > 0:
 81 |     df['system'].fillna(df['system'].mode()[0], inplace=True)
 82 | print("\n🔹 Missing Values After Imputation:")
 83 | print(df.isnull().sum())
 84 | 
 85 | # Summary statistics
 86 | print("\n🔹 Summary Statistics:")
 87 | print(df.describe(include='all'))
 88 | 
 89 | # Column data types
 90 | print("\n🔹 Column Types:")
 91 | print(df.dtypes)
 92 | 
 93 | # Unique values for each categorical column
 94 | print("\n🔹 Unique Values in Categorical Columns:")
 95 | categorical_cols = df.select_dtypes(include=['object']).columns
 96 | for col in categorical_cols:
 97 |     print(f"\n{col} ➤ Unique Values: {df[col].nunique()}")
 98 |     print(df[col].value_counts().head(5))
 99 | 
100 | # Check for duplicate rows
101 | print(f"\n🔹 Duplicate Rows: {df.duplicated().sum()}")
102 | df.drop_duplicates(inplace=True)
103 | print(f"🔹 Duplicate Rows After Removal: {df.duplicated().sum()}")
104 | 
105 | # Skewness analysis for numeric columns
106 | print("\n🔹 Skewness of Numeric Columns:")
107 | numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
108 | for col in numeric_cols:
109 |     skewness = skew(df[col].dropna())
110 |     print(f"{col}: {skewness:.2f}")
111 |     if abs(skewness) > 1:
112 |         print(f"  ➤ High skewness detected. Consider log transformation for {col}.")
113 |         df[f"{col}_log"] = np.log1p(df[col].clip(lower=0))
114 | 
115 | # Outlier detection and handling
116 | for col in numeric_cols:
117 |     df, outliers = handle_outliers(df, col, method='cap')  # Use 'cap' to avoid reducing dataset size
118 |     plt.figure(figsize=(8, 4))
119 |     sns.boxplot(x=df[col], color=custom_palette[1])  # Use Set2 color
120 |     plt.title(f"Boxplot of {col} After Outlier Handling")
121 |     plt.savefig(f"eda_plots/boxplot_{col}_after.png")
122 |     plt.show()
123 | 
124 | # Correlation matrix (Pearson)
125 | plt.figure(figsize=(12, 8))
126 | sns.heatmap(df.corr(numeric_only=True, method='pearson'), annot=True, cmap='Blues', fmt=".2f")  # Changed to Blues
127 | plt.title("Pearson Correlation Matrix")
128 | plt.savefig("eda_plots/pearson_correlation_matrix.png")
129 | plt.show()
130 | 
131 | # Highlight strong correlations
132 | corr_matrix = df.corr(numeric_only=True)
133 | strong_corrs = corr_matrix.where(np.triu(np.abs(corr_matrix) > 0.7, k=1)).stack()
134 | print("\n🔹 Strong Correlations (|corr| > 0.7):")
135 | print(strong_corrs)
136 | 
137 | # Histograms for numeric columns
138 | df.select_dtypes(include='number').hist(figsize=(15, 10), bins=30, color=custom_palette[2], edgecolor='black')  # Use Set2 color
139 | plt.suptitle("Distribution of Numeric Columns", fontsize=16)
140 | plt.savefig("eda_plots/numeric_histograms.png")
141 | plt.show()
142 | 
143 | 
144 | 
145 | # Feature engineering: Binning 'Tot_ED_NmbVsts' into categories
146 | df['ED_Visits_Category'] = pd.qcut(df['Tot_ED_NmbVsts'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
147 | print("\n🔹 New Feature 'ED_Visits_Category' Created:")
148 | print(df['ED_Visits_Category'].value_counts())
149 | 
150 | # Count plot for each categorical column
151 | for col in categorical_cols:
152 |     plt.figure(figsize=(10, 4))
153 |     sns.countplot(data=df, x=col, order=df[col].value_counts().index[:10], palette='Set2')  # Use Set2
154 |     plt.title(f"Top 10 Frequent Values in {col}")
155 |     plt.xticks(rotation=45)
156 |     plt.tight_layout()
157 |     plt.savefig(f"eda_plots/countplot_{col}.png")
158 |     plt.show()
159 | 
160 | # Stacked bar plot: HospitalOwnership vs UrbanRuralDesi
161 | plt.figure(figsize=(10, 6))
162 | pd.crosstab(df['HospitalOwnership'], df['UrbanRuralDesi']).plot(kind='bar', stacked=True, color=custom_palette)  # Use Set2
163 | plt.title("Stacked Bar Plot: HospitalOwnership vs UrbanRuralDesi")
164 | plt.xticks(rotation=45)
165 | plt.tight_layout()
166 | plt.savefig("eda_plots/stacked_bar_hospital_urban.png")
167 | plt.show()
168 | 
169 | # Interactive scatter plot: Tot_ED_NmbVsts vs EDStations
170 | fig = px.scatter(df, x='Tot_ED_NmbVsts', y='EDStations', color='UrbanRuralDesi',
171 |                  title="Interactive Scatter Plot: ED Visits vs ED Stations",
172 |                  color_discrete_sequence=px.colors.qualitative.Set2)  # Use Set2
173 | fig.write_html("eda_plots/interactive_scatter.html")  # Fixed method
174 | fig.show()
175 | 
176 | # Regression line chart: Tot_ED_NmbVsts vs EDDXCount
177 | print("\n🔹 Regression Analysis: Tot_ED_NmbVsts vs EDDXCount")
178 | plt.figure(figsize=(10, 6))
179 | sns.regplot(x='Tot_ED_NmbVsts', y='EDDXCount', data=df, scatter_kws={'alpha':0.5, 'color':custom_palette[0]},
180 |             line_kws={'color':custom_palette[4]})  # Use Set2 colors
181 | plt.title("Regression Line: Total ED Visits vs ED Diagnoses")
182 | plt.xlabel("Total ED Visits")
183 | plt.ylabel("ED Diagnoses")
184 | plt.tight_layout()
185 | plt.savefig("eda_plots/regression_ed_visits_vs_diagnoses.png")
186 | plt.show()
187 | 
188 | # Calculate and display regression slope and intercept
189 | slope, intercept, r_value, p_value, std_err = stats.linregress(df['Tot_ED_NmbVsts'], df['EDDXCount'])
190 | print(f"  ➤ Slope: {slope:.4f}, Intercept: {intercept:.4f}")
191 | print(f"  ➤ R-squared: {r_value**2:.4f}, P-value: {p_value:.4f}")
192 | 
193 | # Optional: Regression plots for other numeric pairs
194 | other_pairs = [('Tot_ED_NmbVsts', 'EDStations')]  # Swapped to include EDStations
195 | for x_col, y_col in other_pairs:
196 |     print(f"\n🔹 Regression Analysis: {x_col} vs {y_col}")
197 |     plt.figure(figsize=(10, 6))
198 |     sns.regplot(x=x_col, y=y_col, data=df, scatter_kws={'alpha':0.5, 'color':custom_palette[0]},
199 |                 line_kws={'color':custom_palette[4]})  # Use Set2 colors
200 |     plt.title(f"Regression Line: {x_col} vs {y_col}")
201 |     plt.xlabel(x_col)
202 |     plt.ylabel(y_col)
203 |     plt.tight_layout()
204 |     plt.savefig(f"eda_plots/regression_{x_col}_vs_{y_col}.png")
205 |     plt.show()
206 |     
207 |     slope, intercept, r_value, p_value, std_err = stats.linregress(df[x_col], df[y_col])
208 |     print(f"  ➤ Slope: {slope:.4f}, Intercept: {intercept:.4f}")
209 |     print(f"  ➤ R-squared: {r_value**2:.4f}, P-value: {p_value:.4f}")
210 | 
211 | # Generate a summary report
212 | with open("eda_summary.txt", "w") as f:
213 |     f.write("Exploratory Data Analysis Summary\n")
214 |     f.write("=" * 40 + "\n")
215 |     f.write(f"Dataset Shape: {df.shape}\n")
216 |     f.write(f"Numeric Columns: {list(numeric_cols)}\n")
217 |     f.write(f"Categorical Columns: {list(categorical_cols)}\n")
218 |     f.write("\nMissing Values After Imputation:\n")
219 |     f.write(str(df.isnull().sum()) + "\n")
220 |     f.write("\nStrong Correlations (|corr| > 0.7):\n")
221 |     f.write(str(strong_corrs) + "\n")
222 |     f.write("\nPlots Saved in: eda_plots/\n")
223 | print("\n🔹 EDA Summary saved to 'eda_summary.txt'")
224 | 
225 | print("\n🔹 EDA Completed! All plots saved in 'eda_plots/' directory.")


--------------------------------------------------------------------------------