├── README.md └── project.py /README.md: -------------------------------------------------------------------------------- 1 | 🧪 Exploratory Data Analysis on Emergency Department Dataset 2 | 📊 Project Overview 3 | This repository presents a comprehensive Exploratory Data Analysis (EDA) pipeline conducted on an emergency department dataset. The analysis emphasizes data quality, distribution patterns, statistical relationships, and key operational metrics such as ED visits, diagnoses, and hospital attributes. Visualizations are used extensively to uncover trends and anomalies that can aid strategic decision-making in healthcare settings. 4 | 5 | 📁 Project Structure 6 | plaintext 7 | Copy 8 | Edit 9 | . 10 | ├── project.py # Main script for performing EDA 11 | ├── eda_summary.txt # Auto-generated summary report of analysis 12 | ├── eda_plots/ # Directory containing all generated plots 13 | └── README.md # Project documentation 14 | ⚙️ Features 15 | Robust missing value handling and imputation 16 | 17 | Automated outlier detection and treatment using IQR method 18 | 19 | Data type analysis and categorical exploration 20 | 21 | Distribution visualization via histograms, boxplots, and heatmaps 22 | 23 | Correlation analysis with Pearson matrix and significance highlights 24 | 25 | Feature engineering using binning techniques 26 | 27 | Advanced visual storytelling with: 28 | 29 | Count plots 30 | 31 | Stacked bar plots 32 | 33 | Regression plots 34 | 35 | Interactive Plotly scatter plots 36 | 37 | Summary report (eda_summary.txt) automatically generated 38 | 39 | 🏁 How to Run 40 | 1. Prerequisites 41 | Ensure the following libraries are installed: 42 | 43 | bash 44 | Copy 45 | Edit 46 | pip install pandas numpy matplotlib seaborn plotly scipy 47 | 2. Clone the Repository 48 | bash 49 | Copy 50 | Edit 51 | git clone https://github.com/your-username/emergency-department-eda.git 52 | cd emergency-department-eda 53 | 3. Update File Path 54 | Update the dataset path in project.py: 55 | 56 | python 57 | Copy 58 | Edit 59 | file_path = "C:/Users/hp/Desktop/python_dataset.csv" 60 | Replace with your actual CSV file path or place the CSV in the project directory and update the relative path accordingly. 61 | 62 | 4. Run the Script 63 | bash 64 | Copy 65 | Edit 66 | python project.py 67 | All plots will be saved to the eda_plots/ directory and a summary will be logged in eda_summary.txt. 68 | 69 | 📈 Sample Outputs 70 | Heatmap of missing values 71 | 72 | Boxplots post outlier treatment 73 | 74 | Correlation matrix 75 | 76 | Histograms of numerical features 77 | 78 | Interactive scatter plot using Plotly 79 | 80 | Regression analysis with slope, intercept, R² and p-value 81 | 82 | 🧠 Use Cases 83 | Healthcare operations planning 84 | 85 | Data-driven resource allocation 86 | 87 | Early anomaly detection in ED visit patterns 88 | 89 | Foundational layer for machine learning modeling 90 | 91 | 📌 Notes 92 | Script includes automatic directory creation for plot storage. 93 | 94 | Outliers are capped (not removed) to preserve dataset size. 95 | 96 | Categorical variables are explored using count plots and cross-tabulations. 97 | 98 | 🤝 Contributions 99 | Contributions are welcome! Please fork the repository and submit a pull request with enhancements or bug fixes. 100 | -------------------------------------------------------------------------------- /project.py: -------------------------------------------------------------------------------- 1 | # Import libraries 2 | import pandas as pd 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | import seaborn as sns 6 | import plotly.express as px 7 | import os 8 | from scipy.stats import skew 9 | from scipy import stats # Added for regression 10 | import warnings 11 | warnings.filterwarnings('ignore') 12 | 13 | # Set Seaborn theme and custom palette 14 | sns.set_theme(style="whitegrid") 15 | custom_palette = sns.color_palette("Set2") # Unified color scheme 16 | sns.set_palette(custom_palette) 17 | 18 | # Create a directory to save plots 19 | if not os.path.exists("eda_plots"): 20 | os.makedirs("eda_plots") 21 | 22 | # Function to detect and handle outliers using IQR 23 | def handle_outliers(df, column, method='remove'): 24 | Q1 = df[column].quantile(0.25) 25 | Q3 = df[column].quantile(0.75) 26 | IQR = Q3 - Q1 27 | lower_bound = Q1 - 1.5 * IQR 28 | upper_bound = Q3 + 1.5 * IQR 29 | 30 | outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column] 31 | print(f"\n🔹 Outliers in {column}: {len(outliers)}") 32 | 33 | if method == 'remove': 34 | df_clean = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)] 35 | elif method == 'cap': 36 | df_clean = df.copy() 37 | df_clean[column] = df_clean[column].clip(lower=lower_bound, upper=upper_bound) 38 | else: 39 | df_clean = df 40 | 41 | return df_clean, outliers 42 | 43 | # Load the dataset with error handling 44 | file_path = "C:/Users/hp/Desktop/python_dataset.csv" 45 | try: 46 | df = pd.read_csv(file_path, encoding='latin1') 47 | except FileNotFoundError: 48 | print(f"Error: Dataset file '{file_path}' not found. Please check the file path.") 49 | raise SystemExit 50 | except Exception as e: 51 | print(f"Error loading dataset: {e}") 52 | raise SystemExit 53 | 54 | # Check if dataset is empty 55 | if df.empty: 56 | print("Error: Dataset is empty.") 57 | raise SystemExit 58 | 59 | # Basic info 60 | print("🔹 Dataset Info:") 61 | print(df.info()) 62 | 63 | # First few rows 64 | print("\n🔹 First 5 Rows:") 65 | print(df.head()) 66 | 67 | # Missing values analysis 68 | print("\n🔹 Missing Values:") 69 | missing_data = df.isnull().sum() 70 | print(missing_data) 71 | 72 | # Visualize missing values 73 | plt.figure(figsize=(10, 6)) 74 | sns.heatmap(df.isnull(), cbar=False, cmap='Blues') # Changed to Blues 75 | plt.title("Missing Values Heatmap") 76 | plt.savefig("eda_plots/missing_values_heatmap.png") 77 | plt.show() 78 | 79 | # Handle missing values (impute 'system' with mode, others as needed) 80 | if df['system'].isnull().sum() > 0: 81 | df['system'].fillna(df['system'].mode()[0], inplace=True) 82 | print("\n🔹 Missing Values After Imputation:") 83 | print(df.isnull().sum()) 84 | 85 | # Summary statistics 86 | print("\n🔹 Summary Statistics:") 87 | print(df.describe(include='all')) 88 | 89 | # Column data types 90 | print("\n🔹 Column Types:") 91 | print(df.dtypes) 92 | 93 | # Unique values for each categorical column 94 | print("\n🔹 Unique Values in Categorical Columns:") 95 | categorical_cols = df.select_dtypes(include=['object']).columns 96 | for col in categorical_cols: 97 | print(f"\n{col} ➤ Unique Values: {df[col].nunique()}") 98 | print(df[col].value_counts().head(5)) 99 | 100 | # Check for duplicate rows 101 | print(f"\n🔹 Duplicate Rows: {df.duplicated().sum()}") 102 | df.drop_duplicates(inplace=True) 103 | print(f"🔹 Duplicate Rows After Removal: {df.duplicated().sum()}") 104 | 105 | # Skewness analysis for numeric columns 106 | print("\n🔹 Skewness of Numeric Columns:") 107 | numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns 108 | for col in numeric_cols: 109 | skewness = skew(df[col].dropna()) 110 | print(f"{col}: {skewness:.2f}") 111 | if abs(skewness) > 1: 112 | print(f" ➤ High skewness detected. Consider log transformation for {col}.") 113 | df[f"{col}_log"] = np.log1p(df[col].clip(lower=0)) 114 | 115 | # Outlier detection and handling 116 | for col in numeric_cols: 117 | df, outliers = handle_outliers(df, col, method='cap') # Use 'cap' to avoid reducing dataset size 118 | plt.figure(figsize=(8, 4)) 119 | sns.boxplot(x=df[col], color=custom_palette[1]) # Use Set2 color 120 | plt.title(f"Boxplot of {col} After Outlier Handling") 121 | plt.savefig(f"eda_plots/boxplot_{col}_after.png") 122 | plt.show() 123 | 124 | # Correlation matrix (Pearson) 125 | plt.figure(figsize=(12, 8)) 126 | sns.heatmap(df.corr(numeric_only=True, method='pearson'), annot=True, cmap='Blues', fmt=".2f") # Changed to Blues 127 | plt.title("Pearson Correlation Matrix") 128 | plt.savefig("eda_plots/pearson_correlation_matrix.png") 129 | plt.show() 130 | 131 | # Highlight strong correlations 132 | corr_matrix = df.corr(numeric_only=True) 133 | strong_corrs = corr_matrix.where(np.triu(np.abs(corr_matrix) > 0.7, k=1)).stack() 134 | print("\n🔹 Strong Correlations (|corr| > 0.7):") 135 | print(strong_corrs) 136 | 137 | # Histograms for numeric columns 138 | df.select_dtypes(include='number').hist(figsize=(15, 10), bins=30, color=custom_palette[2], edgecolor='black') # Use Set2 color 139 | plt.suptitle("Distribution of Numeric Columns", fontsize=16) 140 | plt.savefig("eda_plots/numeric_histograms.png") 141 | plt.show() 142 | 143 | 144 | 145 | # Feature engineering: Binning 'Tot_ED_NmbVsts' into categories 146 | df['ED_Visits_Category'] = pd.qcut(df['Tot_ED_NmbVsts'], q=4, labels=['Low', 'Medium', 'High', 'Very High']) 147 | print("\n🔹 New Feature 'ED_Visits_Category' Created:") 148 | print(df['ED_Visits_Category'].value_counts()) 149 | 150 | # Count plot for each categorical column 151 | for col in categorical_cols: 152 | plt.figure(figsize=(10, 4)) 153 | sns.countplot(data=df, x=col, order=df[col].value_counts().index[:10], palette='Set2') # Use Set2 154 | plt.title(f"Top 10 Frequent Values in {col}") 155 | plt.xticks(rotation=45) 156 | plt.tight_layout() 157 | plt.savefig(f"eda_plots/countplot_{col}.png") 158 | plt.show() 159 | 160 | # Stacked bar plot: HospitalOwnership vs UrbanRuralDesi 161 | plt.figure(figsize=(10, 6)) 162 | pd.crosstab(df['HospitalOwnership'], df['UrbanRuralDesi']).plot(kind='bar', stacked=True, color=custom_palette) # Use Set2 163 | plt.title("Stacked Bar Plot: HospitalOwnership vs UrbanRuralDesi") 164 | plt.xticks(rotation=45) 165 | plt.tight_layout() 166 | plt.savefig("eda_plots/stacked_bar_hospital_urban.png") 167 | plt.show() 168 | 169 | # Interactive scatter plot: Tot_ED_NmbVsts vs EDStations 170 | fig = px.scatter(df, x='Tot_ED_NmbVsts', y='EDStations', color='UrbanRuralDesi', 171 | title="Interactive Scatter Plot: ED Visits vs ED Stations", 172 | color_discrete_sequence=px.colors.qualitative.Set2) # Use Set2 173 | fig.write_html("eda_plots/interactive_scatter.html") # Fixed method 174 | fig.show() 175 | 176 | # Regression line chart: Tot_ED_NmbVsts vs EDDXCount 177 | print("\n🔹 Regression Analysis: Tot_ED_NmbVsts vs EDDXCount") 178 | plt.figure(figsize=(10, 6)) 179 | sns.regplot(x='Tot_ED_NmbVsts', y='EDDXCount', data=df, scatter_kws={'alpha':0.5, 'color':custom_palette[0]}, 180 | line_kws={'color':custom_palette[4]}) # Use Set2 colors 181 | plt.title("Regression Line: Total ED Visits vs ED Diagnoses") 182 | plt.xlabel("Total ED Visits") 183 | plt.ylabel("ED Diagnoses") 184 | plt.tight_layout() 185 | plt.savefig("eda_plots/regression_ed_visits_vs_diagnoses.png") 186 | plt.show() 187 | 188 | # Calculate and display regression slope and intercept 189 | slope, intercept, r_value, p_value, std_err = stats.linregress(df['Tot_ED_NmbVsts'], df['EDDXCount']) 190 | print(f" ➤ Slope: {slope:.4f}, Intercept: {intercept:.4f}") 191 | print(f" ➤ R-squared: {r_value**2:.4f}, P-value: {p_value:.4f}") 192 | 193 | # Optional: Regression plots for other numeric pairs 194 | other_pairs = [('Tot_ED_NmbVsts', 'EDStations')] # Swapped to include EDStations 195 | for x_col, y_col in other_pairs: 196 | print(f"\n🔹 Regression Analysis: {x_col} vs {y_col}") 197 | plt.figure(figsize=(10, 6)) 198 | sns.regplot(x=x_col, y=y_col, data=df, scatter_kws={'alpha':0.5, 'color':custom_palette[0]}, 199 | line_kws={'color':custom_palette[4]}) # Use Set2 colors 200 | plt.title(f"Regression Line: {x_col} vs {y_col}") 201 | plt.xlabel(x_col) 202 | plt.ylabel(y_col) 203 | plt.tight_layout() 204 | plt.savefig(f"eda_plots/regression_{x_col}_vs_{y_col}.png") 205 | plt.show() 206 | 207 | slope, intercept, r_value, p_value, std_err = stats.linregress(df[x_col], df[y_col]) 208 | print(f" ➤ Slope: {slope:.4f}, Intercept: {intercept:.4f}") 209 | print(f" ➤ R-squared: {r_value**2:.4f}, P-value: {p_value:.4f}") 210 | 211 | # Generate a summary report 212 | with open("eda_summary.txt", "w") as f: 213 | f.write("Exploratory Data Analysis Summary\n") 214 | f.write("=" * 40 + "\n") 215 | f.write(f"Dataset Shape: {df.shape}\n") 216 | f.write(f"Numeric Columns: {list(numeric_cols)}\n") 217 | f.write(f"Categorical Columns: {list(categorical_cols)}\n") 218 | f.write("\nMissing Values After Imputation:\n") 219 | f.write(str(df.isnull().sum()) + "\n") 220 | f.write("\nStrong Correlations (|corr| > 0.7):\n") 221 | f.write(str(strong_corrs) + "\n") 222 | f.write("\nPlots Saved in: eda_plots/\n") 223 | print("\n🔹 EDA Summary saved to 'eda_summary.txt'") 224 | 225 | print("\n🔹 EDA Completed! All plots saved in 'eda_plots/' directory.") --------------------------------------------------------------------------------