├── README.md
└── projectint375.py


/README.md:
--------------------------------------------------------------------------------
  1 | Mortality Trends Analysis
  2 | Advanced Data Analysis | EDA | Time-Series Modeling | Forecasting
  3 | 
  4 | This project analyzes weekly mortality data from the United States using advanced data-science techniques including exploratory data analysis (EDA), feature engineering, trend decomposition, and machine-learning-based forecasting.
  5 | 
  6 | The goal is to identify major mortality patterns, understand cause-wise trends, and build predictive models to forecast all-cause mortality.
  7 | 
  8 |  Project Highlights
  9 | 
 10 |  Performed comprehensive EDA to understand mortality behavior
 11 | 
 12 |  Built feature-engineered time-series dataset
 13 | 
 14 |  Applied linear regression forecasting using lag & moving-average features
 15 | 
 16 |  Conducted trend + seasonality decomposition
 17 | 
 18 |  Created heatmaps, time-series plots, and stacked area charts
 19 | 
 20 |  Extracted meaningful insights such as peak mortality weeks and correlations
 21 | 
 22 |  Dataset
 23 | 
 24 | Source: Weekly mortality records (Excel format) containing:
 25 | 
 26 | All-Cause deaths
 27 | 
 28 | COVID-19 deaths
 29 | 
 30 | Major causes (heart disease, cancer, etc.)
 31 | 
 32 | Date of week ending
 33 | 
 34 | Jurisdiction (state names)
 35 | 
 36 |  Technologies Used
 37 | 
 38 | Python 3
 39 | 
 40 | Pandas, NumPy
 41 | 
 42 | Matplotlib, Seaborn
 43 | 
 44 | Statsmodels (seasonal decomposition)
 45 | 
 46 | Scikit-learn (machine learning)
 47 | 
 48 |  Project Workflow
 49 | 1. Import & Clean Data
 50 | 
 51 | Loaded Excel dataset
 52 | 
 53 | Converted date fields
 54 | 
 55 | Filtered U.S-level aggregation
 56 | 
 57 | Filled missing values
 58 | 
 59 | Sorted chronologically
 60 | 
 61 | 2. Feature Engineering
 62 | 
 63 | Engineered several predictive & analytical features:
 64 | 
 65 | Lag variables (1-week lag)
 66 | 
 67 | Moving Averages (4-week average)
 68 | 
 69 | Week number
 70 | 
 71 | Year
 72 | 
 73 | Rolling trend signals
 74 | 
 75 | 3. Exploratory Data Analysis
 76 | 
 77 | Includes:
 78 | 
 79 | Time-series trend plots
 80 | 
 81 | Correlation matrix
 82 | 
 83 | Distribution of causes
 84 | 
 85 | Stacked cause-wise area charts
 86 | 
 87 | Jurisdiction heatmap
 88 | 
 89 | 4. Time-Series Trend Decomposition
 90 | 
 91 | Using seasonal_decompose():
 92 | 
 93 | Trend
 94 | 
 95 | Seasonal patterns
 96 | 
 97 | Residual noise
 98 | 
 99 | 5. Machine Learning Forecast
100 | 
101 | Built a Linear Regression model using:
102 | 
103 | AllCause_Lag1
104 | 
105 | COVID_Lag1
106 | 
107 | AllCause_MA4
108 | 
109 | COVID_MA4
110 | 
111 | Evaluated using MAE & R², with a prediction visualization.
112 | 
113 |  6. Results & Key Insights
114 |  Peak Mortality
115 | 
116 | Highest COVID-19 mortality week identified
117 | 
118 | Highest All-Cause mortality week identified
119 | 
120 |  Correlation
121 | 
122 | COVID-19 and All-Cause Mortality correlation example:
123 | 
124 | Strong positive correlation → pandemic impact visible in overall mortality.
125 | 
126 |  Year-wise mortality totals
127 | 
128 | Helps track increase or decrease over years.
129 | 
130 | Visualizations Included
131 | 
132 | All-Cause vs Natural Deaths Trend
133 | 
134 | Correlation Heatmap
135 | 
136 | Time-Series Decomposition
137 | 
138 | Prediction vs Actual Line Plot
139 | 
140 | Stacked Area Chart for top causes
141 | 
142 | Jurisdiction Heatmap 
143 | 


--------------------------------------------------------------------------------
/projectint375.py:
--------------------------------------------------------------------------------
  1 | # ----------------------------
  2 | # Mortality Trends Analysis Project
  3 | # Data Analysis | EDA | Time-Series ML | Forecasting
  4 | # ----------------------------
  5 | 
  6 | import pandas as pd
  7 | import numpy as np
  8 | import matplotlib.pyplot as plt
  9 | import seaborn as sns
 10 | from sklearn.preprocessing import MinMaxScaler
 11 | from sklearn.linear_model import LinearRegression
 12 | from sklearn.metrics import mean_absolute_error, r2_score
 13 | from statsmodels.tsa.seasonal import seasonal_decompose
 14 | import warnings
 15 | warnings.filterwarnings("ignore")
 16 | 
 17 | # --------------------------------------------
 18 | # 1. Load Dataset
 19 | # --------------------------------------------
 20 | df = pd.read_excel("python dataset.xlsx")
 21 | 
 22 | # Standard column cleanup
 23 | df.columns = df.columns.str.strip()
 24 | 
 25 | print("Dataset Shape:", df.shape)
 26 | print(df.head())
 27 | 
 28 | # --------------------------------------------
 29 | # 2. Data Cleaning
 30 | # --------------------------------------------
 31 | 
 32 | # Convert date column
 33 | df["Week Ending Date"] = pd.to_datetime(df["Week Ending Date"])
 34 | 
 35 | # Replace missing values
 36 | df.fillna(0, inplace=True)
 37 | 
 38 | # Keep only USA-level records (if needed)
 39 | df = df[df["Jurisdiction of Occurrence"] == "United States"]
 40 | 
 41 | # Sort by time
 42 | df = df.sort_values("Week Ending Date")
 43 | 
 44 | # --------------------------------------------
 45 | # 3. Feature Engineering
 46 | # --------------------------------------------
 47 | 
 48 | df["year"] = df["Week Ending Date"].dt.year
 49 | df["week"] = df["Week Ending Date"].dt.isocalendar().week
 50 | 
 51 | # Lag features for ML models
 52 | df["AllCause_Lag1"] = df["All Cause"].shift(1)
 53 | df["COVID_Lag1"] = df["COVID-19"].shift(1)
 54 | 
 55 | # 4-week Rolling Averages
 56 | df["AllCause_MA4"] = df["All Cause"].rolling(4).mean()
 57 | df["COVID_MA4"] = df["COVID-19"].rolling(4).mean()
 58 | 
 59 | # Drop initial NaNs from rolling windows
 60 | df = df.dropna()
 61 | 
 62 | # --------------------------------------------
 63 | # 4. Exploratory Data Analysis
 64 | # --------------------------------------------
 65 | plt.figure(figsize=(12, 6))
 66 | plt.plot(df["Week Ending Date"], df["All Cause"], label="All Cause Deaths")
 67 | plt.plot(df["Week Ending Date"], df["Natural Cause"], label="Natural Deaths")
 68 | plt.title("Mortality Trends Over Time")
 69 | plt.xlabel("Date")
 70 | plt.ylabel("Deaths")
 71 | plt.legend()
 72 | plt.grid(True)
 73 | plt.show()
 74 | 
 75 | # Correlation Matrix
 76 | plt.figure(figsize=(14, 8))
 77 | sns.heatmap(df.corr(numeric_only=True), annot=False, cmap="coolwarm")
 78 | plt.title("Correlation Matrix of Mortality Features")
 79 | plt.show()
 80 | 
 81 | # --------------------------------------------
 82 | # 5. Time-Series Decomposition (Trend + Seasonality)
 83 | # --------------------------------------------
 84 | 
 85 | ts = df.set_index("Week Ending Date")["All Cause"]
 86 | result = seasonal_decompose(ts, model="additive", period=52)
 87 | 
 88 | result.plot()
 89 | plt.suptitle("Time-Series Decomposition: All-Cause Mortality", y=1.02)
 90 | plt.show()
 91 | 
 92 | # --------------------------------------------
 93 | # 6. ML Model → Predicting Mortality (Simple Regression)
 94 | # --------------------------------------------
 95 | 
 96 | features = ["AllCause_Lag1", "COVID_Lag1", "AllCause_MA4", "COVID_MA4"]
 97 | X = df[features]
 98 | y = df["All Cause"]
 99 | 
100 | # Train-test split
101 | split = int(len(df) * 0.8)
102 | X_train, X_test = X.iloc[:split], X.iloc[split:]
103 | y_train, y_test = y.iloc[:split], y.iloc[split:]
104 | 
105 | model = LinearRegression()
106 | model.fit(X_train, y_train)
107 | 
108 | # Predictions
109 | pred = model.predict(X_test)
110 | 
111 | print("MAE:", mean_absolute_error(y_test, pred))
112 | print("R2 Score:", r2_score(y_test, pred))
113 | 
114 | # Plot predictions
115 | plt.figure(figsize=(12, 6))
116 | plt.plot(y_test.index, y_test, label="Actual")
117 | plt.plot(y_test.index, pred, label="Predicted")
118 | plt.title("All-Cause Mortality Prediction")
119 | plt.legend()
120 | plt.show()
121 | 
122 | # --------------------------------------------
123 | # 7. Stacked Area Plot of Major Causes
124 | # --------------------------------------------
125 | major_causes = df.set_index("Week Ending Date")[[
126 |     "COVID-19",
127 |     "Diseases of heart",
128 |     "Malignant neoplasms",
129 | ]]
130 | 
131 | major_causes.plot.area(figsize=(12, 6), colormap="Set2")
132 | plt.title("Major Causes of Death — Trend Comparison")
133 | plt.xlabel("Date")
134 | plt.ylabel("Deaths")
135 | plt.show()
136 | 
137 | # --------------------------------------------
138 | # 8. Heatmap by State (If multiple states exist)
139 | # --------------------------------------------
140 | heatmap_df = df.pivot_table(
141 |     index="Jurisdiction of Occurrence",
142 |     columns="Week Ending Date",
143 |     values="COVID-19",
144 |     aggfunc="sum"
145 | )
146 | 
147 | plt.figure(figsize=(16, 8))
148 | sns.heatmap(heatmap_df, cmap="YlOrRd")
149 | plt.title("COVID-19 Mortality Heatmap Across Jurisdictions")
150 | plt.show()
151 | 
152 | # --------------------------------------------
153 | # 9. Insight Extraction
154 | # --------------------------------------------
155 | peak_covid_week = df.loc[df["COVID-19"].idxmax()]["Week Ending Date"]
156 | peak_allcause_week = df.loc[df["All Cause"].idxmax()]["Week Ending Date"]
157 | 
158 | print("\n----- INSIGHTS -----")
159 | print(f"Highest COVID-19 deaths recorded during week: {peak_covid_week.date()}")
160 | print(f"Highest overall mortality recorded on: {peak_allcause_week.date()}")
161 | 
162 | corr = df["COVID-19"].corr(df["All Cause"])
163 | print(f"Correlation between COVID-19 & All-Cause Mortality: {corr:.2f}")
164 | 
165 | yearly_summary = df.groupby("year")["All Cause"].sum()
166 | print("\nYearly Mortality Totals:")
167 | print(yearly_summary)
168 | 


--------------------------------------------------------------------------------