├── README.md └── projectint375.py /README.md: -------------------------------------------------------------------------------- 1 | Mortality Trends Analysis 2 | Advanced Data Analysis | EDA | Time-Series Modeling | Forecasting 3 | 4 | This project analyzes weekly mortality data from the United States using advanced data-science techniques including exploratory data analysis (EDA), feature engineering, trend decomposition, and machine-learning-based forecasting. 5 | 6 | The goal is to identify major mortality patterns, understand cause-wise trends, and build predictive models to forecast all-cause mortality. 7 | 8 | Project Highlights 9 | 10 | Performed comprehensive EDA to understand mortality behavior 11 | 12 | Built feature-engineered time-series dataset 13 | 14 | Applied linear regression forecasting using lag & moving-average features 15 | 16 | Conducted trend + seasonality decomposition 17 | 18 | Created heatmaps, time-series plots, and stacked area charts 19 | 20 | Extracted meaningful insights such as peak mortality weeks and correlations 21 | 22 | Dataset 23 | 24 | Source: Weekly mortality records (Excel format) containing: 25 | 26 | All-Cause deaths 27 | 28 | COVID-19 deaths 29 | 30 | Major causes (heart disease, cancer, etc.) 31 | 32 | Date of week ending 33 | 34 | Jurisdiction (state names) 35 | 36 | Technologies Used 37 | 38 | Python 3 39 | 40 | Pandas, NumPy 41 | 42 | Matplotlib, Seaborn 43 | 44 | Statsmodels (seasonal decomposition) 45 | 46 | Scikit-learn (machine learning) 47 | 48 | Project Workflow 49 | 1. Import & Clean Data 50 | 51 | Loaded Excel dataset 52 | 53 | Converted date fields 54 | 55 | Filtered U.S-level aggregation 56 | 57 | Filled missing values 58 | 59 | Sorted chronologically 60 | 61 | 2. Feature Engineering 62 | 63 | Engineered several predictive & analytical features: 64 | 65 | Lag variables (1-week lag) 66 | 67 | Moving Averages (4-week average) 68 | 69 | Week number 70 | 71 | Year 72 | 73 | Rolling trend signals 74 | 75 | 3. Exploratory Data Analysis 76 | 77 | Includes: 78 | 79 | Time-series trend plots 80 | 81 | Correlation matrix 82 | 83 | Distribution of causes 84 | 85 | Stacked cause-wise area charts 86 | 87 | Jurisdiction heatmap 88 | 89 | 4. Time-Series Trend Decomposition 90 | 91 | Using seasonal_decompose(): 92 | 93 | Trend 94 | 95 | Seasonal patterns 96 | 97 | Residual noise 98 | 99 | 5. Machine Learning Forecast 100 | 101 | Built a Linear Regression model using: 102 | 103 | AllCause_Lag1 104 | 105 | COVID_Lag1 106 | 107 | AllCause_MA4 108 | 109 | COVID_MA4 110 | 111 | Evaluated using MAE & R², with a prediction visualization. 112 | 113 | 6. Results & Key Insights 114 | Peak Mortality 115 | 116 | Highest COVID-19 mortality week identified 117 | 118 | Highest All-Cause mortality week identified 119 | 120 | Correlation 121 | 122 | COVID-19 and All-Cause Mortality correlation example: 123 | 124 | Strong positive correlation → pandemic impact visible in overall mortality. 125 | 126 | Year-wise mortality totals 127 | 128 | Helps track increase or decrease over years. 129 | 130 | Visualizations Included 131 | 132 | All-Cause vs Natural Deaths Trend 133 | 134 | Correlation Heatmap 135 | 136 | Time-Series Decomposition 137 | 138 | Prediction vs Actual Line Plot 139 | 140 | Stacked Area Chart for top causes 141 | 142 | Jurisdiction Heatmap 143 | -------------------------------------------------------------------------------- /projectint375.py: -------------------------------------------------------------------------------- 1 | # ---------------------------- 2 | # Mortality Trends Analysis Project 3 | # Data Analysis | EDA | Time-Series ML | Forecasting 4 | # ---------------------------- 5 | 6 | import pandas as pd 7 | import numpy as np 8 | import matplotlib.pyplot as plt 9 | import seaborn as sns 10 | from sklearn.preprocessing import MinMaxScaler 11 | from sklearn.linear_model import LinearRegression 12 | from sklearn.metrics import mean_absolute_error, r2_score 13 | from statsmodels.tsa.seasonal import seasonal_decompose 14 | import warnings 15 | warnings.filterwarnings("ignore") 16 | 17 | # -------------------------------------------- 18 | # 1. Load Dataset 19 | # -------------------------------------------- 20 | df = pd.read_excel("python dataset.xlsx") 21 | 22 | # Standard column cleanup 23 | df.columns = df.columns.str.strip() 24 | 25 | print("Dataset Shape:", df.shape) 26 | print(df.head()) 27 | 28 | # -------------------------------------------- 29 | # 2. Data Cleaning 30 | # -------------------------------------------- 31 | 32 | # Convert date column 33 | df["Week Ending Date"] = pd.to_datetime(df["Week Ending Date"]) 34 | 35 | # Replace missing values 36 | df.fillna(0, inplace=True) 37 | 38 | # Keep only USA-level records (if needed) 39 | df = df[df["Jurisdiction of Occurrence"] == "United States"] 40 | 41 | # Sort by time 42 | df = df.sort_values("Week Ending Date") 43 | 44 | # -------------------------------------------- 45 | # 3. Feature Engineering 46 | # -------------------------------------------- 47 | 48 | df["year"] = df["Week Ending Date"].dt.year 49 | df["week"] = df["Week Ending Date"].dt.isocalendar().week 50 | 51 | # Lag features for ML models 52 | df["AllCause_Lag1"] = df["All Cause"].shift(1) 53 | df["COVID_Lag1"] = df["COVID-19"].shift(1) 54 | 55 | # 4-week Rolling Averages 56 | df["AllCause_MA4"] = df["All Cause"].rolling(4).mean() 57 | df["COVID_MA4"] = df["COVID-19"].rolling(4).mean() 58 | 59 | # Drop initial NaNs from rolling windows 60 | df = df.dropna() 61 | 62 | # -------------------------------------------- 63 | # 4. Exploratory Data Analysis 64 | # -------------------------------------------- 65 | plt.figure(figsize=(12, 6)) 66 | plt.plot(df["Week Ending Date"], df["All Cause"], label="All Cause Deaths") 67 | plt.plot(df["Week Ending Date"], df["Natural Cause"], label="Natural Deaths") 68 | plt.title("Mortality Trends Over Time") 69 | plt.xlabel("Date") 70 | plt.ylabel("Deaths") 71 | plt.legend() 72 | plt.grid(True) 73 | plt.show() 74 | 75 | # Correlation Matrix 76 | plt.figure(figsize=(14, 8)) 77 | sns.heatmap(df.corr(numeric_only=True), annot=False, cmap="coolwarm") 78 | plt.title("Correlation Matrix of Mortality Features") 79 | plt.show() 80 | 81 | # -------------------------------------------- 82 | # 5. Time-Series Decomposition (Trend + Seasonality) 83 | # -------------------------------------------- 84 | 85 | ts = df.set_index("Week Ending Date")["All Cause"] 86 | result = seasonal_decompose(ts, model="additive", period=52) 87 | 88 | result.plot() 89 | plt.suptitle("Time-Series Decomposition: All-Cause Mortality", y=1.02) 90 | plt.show() 91 | 92 | # -------------------------------------------- 93 | # 6. ML Model → Predicting Mortality (Simple Regression) 94 | # -------------------------------------------- 95 | 96 | features = ["AllCause_Lag1", "COVID_Lag1", "AllCause_MA4", "COVID_MA4"] 97 | X = df[features] 98 | y = df["All Cause"] 99 | 100 | # Train-test split 101 | split = int(len(df) * 0.8) 102 | X_train, X_test = X.iloc[:split], X.iloc[split:] 103 | y_train, y_test = y.iloc[:split], y.iloc[split:] 104 | 105 | model = LinearRegression() 106 | model.fit(X_train, y_train) 107 | 108 | # Predictions 109 | pred = model.predict(X_test) 110 | 111 | print("MAE:", mean_absolute_error(y_test, pred)) 112 | print("R2 Score:", r2_score(y_test, pred)) 113 | 114 | # Plot predictions 115 | plt.figure(figsize=(12, 6)) 116 | plt.plot(y_test.index, y_test, label="Actual") 117 | plt.plot(y_test.index, pred, label="Predicted") 118 | plt.title("All-Cause Mortality Prediction") 119 | plt.legend() 120 | plt.show() 121 | 122 | # -------------------------------------------- 123 | # 7. Stacked Area Plot of Major Causes 124 | # -------------------------------------------- 125 | major_causes = df.set_index("Week Ending Date")[[ 126 | "COVID-19", 127 | "Diseases of heart", 128 | "Malignant neoplasms", 129 | ]] 130 | 131 | major_causes.plot.area(figsize=(12, 6), colormap="Set2") 132 | plt.title("Major Causes of Death — Trend Comparison") 133 | plt.xlabel("Date") 134 | plt.ylabel("Deaths") 135 | plt.show() 136 | 137 | # -------------------------------------------- 138 | # 8. Heatmap by State (If multiple states exist) 139 | # -------------------------------------------- 140 | heatmap_df = df.pivot_table( 141 | index="Jurisdiction of Occurrence", 142 | columns="Week Ending Date", 143 | values="COVID-19", 144 | aggfunc="sum" 145 | ) 146 | 147 | plt.figure(figsize=(16, 8)) 148 | sns.heatmap(heatmap_df, cmap="YlOrRd") 149 | plt.title("COVID-19 Mortality Heatmap Across Jurisdictions") 150 | plt.show() 151 | 152 | # -------------------------------------------- 153 | # 9. Insight Extraction 154 | # -------------------------------------------- 155 | peak_covid_week = df.loc[df["COVID-19"].idxmax()]["Week Ending Date"] 156 | peak_allcause_week = df.loc[df["All Cause"].idxmax()]["Week Ending Date"] 157 | 158 | print("\n----- INSIGHTS -----") 159 | print(f"Highest COVID-19 deaths recorded during week: {peak_covid_week.date()}") 160 | print(f"Highest overall mortality recorded on: {peak_allcause_week.date()}") 161 | 162 | corr = df["COVID-19"].corr(df["All Cause"]) 163 | print(f"Correlation between COVID-19 & All-Cause Mortality: {corr:.2f}") 164 | 165 | yearly_summary = df.groupby("year")["All Cause"].sum() 166 | print("\nYearly Mortality Totals:") 167 | print(yearly_summary) 168 | --------------------------------------------------------------------------------