├── README.md └── code.py /README.md: -------------------------------------------------------------------------------- 1 | # Smartwatch Health Dataset Analysis: Uncovering Jaw-Dropping Insights! 2 | 3 | > *If you want your ideas to resonate and lead the conversation, the right words—and data—make all the difference!* 4 | This repository houses a comprehensive **Python-based analysis** of the **Smartwatch Health Dataset**, conducted as part of my academic journey at **Lovely Professional University (LPU)**. By leveraging **Pandas**, **NumPy**, **Seaborn**, **Matplotlib**, and **SciPy**, this project cleans, explores, and visualizes health metrics like heart rate, step count, stress levels, and sleep duration to reveal actionable insights about activity levels and well-being. 5 | 6 | --- 7 | 8 | ## 📑 Project Overview 9 | 10 | The **Smartwatch Health Dataset Analysis** aims to uncover meaningful patterns in health metrics collected from smartwatch users. The dataset includes variables such as `Heart Rate (BPM)`, `Step Count`, `Blood Oxygen Level (%)`, `Sleep Duration (hours)`, and `Stress Level`, categorized by `Activity Level` (Active, Highly Active, Sedentary). Through rigorous data cleaning, exploratory data analysis (EDA), statistical testing, and visualizations, this project highlights how activity levels influence health outcomes. 11 | 12 | Key objectives: 13 | - Clean and preprocess the dataset to handle inconsistencies, missing values, and outliers. 14 | - Visualize distributions, correlations, and relationships between health metrics. 15 | - Perform statistical tests (ANOVA, T-tests) to validate findings. 16 | - Derive insights to inform health and wellness strategies. 17 | 18 | --- 19 | 20 | ## 🛠️ Tools and Technologies 21 | 22 | - **Python 3.8+** 23 | - **Libraries**: 24 | - `pandas`: Data manipulation and analysis 25 | - `numpy`: Numerical computations 26 | - `seaborn` & `matplotlib`: Data visualization 27 | - `scipy`: Statistical testing 28 | - **Dataset**: `unclean_smartwatch_health_data.csv` 29 | - **Environment**: Jupyter Notebook (recommended) or any Python IDE 30 | 31 | --- 32 | 33 | ## 📂 Repository Structure 34 | 35 | ```plaintext 36 | ├── data/ 37 | │ └── unclean_smartwatch_health_data.csv # Raw dataset 38 | ├── notebooks/ 39 | │ └── smartwatch_health_analysis.ipynb # Main analysis notebook 40 | ├── images/ 41 | │ └── (generated plots saved here) # Visualizations (e.g., heatmaps, scatter plots) 42 | ├── README.md # Project documentation 43 | └── requirements.txt # Dependencies 44 | -------------------------------------------------------------------------------- /code.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | from scipy import stats 6 | # 7 | plt.style.use('seaborn-v0_8') 8 | 9 | df = pd.read_csv('unclean_smartwatch_health_data.csv') 10 | 11 | print("Dataset Info:") 12 | print(df.info()) 13 | print("\nDataset Description:") 14 | print(df.describe()) 15 | print("\nFirst 5 Rows:") 16 | print(df.head()) 17 | print("\nLast 5 Rows:") 18 | print(df.tail()) 19 | 20 | df['Activity Level'] = df['Activity Level'].replace({ 21 | 'Highly_Active': 'Highly Active', 22 | 'Actve': 'Active', 23 | 'Seddentary': 'Sedentary' 24 | }) 25 | ### 26 | df['Stress Level'] = df['Stress Level'].replace('Very High', 10) 27 | 28 | df.replace(['ERROR', 'nan', ''], np.nan, inplace=True) 29 | 30 | df['User ID'] = df['User ID'].fillna(0) 31 | 32 | numeric_cols = ['Heart Rate (BPM)', 'Blood Oxygen Level (%)', 'Step Count', 'Sleep Duration (hours)', 'Stress Level'] 33 | for col in numeric_cols: 34 | df[col] = pd.to_numeric(df[col], errors='coerce') 35 | # 36 | df['User ID'] = df['User ID'].astype(int) 37 | 38 | df['Activity Level'] = df['Activity Level'].fillna(df['Activity Level'].mode()[0]) 39 | 40 | for col in numeric_cols: 41 | df[col] = df[col].fillna(df[col].median()) 42 | 43 | df.drop_duplicates(inplace=True) 44 | 45 | df['Activity Level Encoded'] = pd.Categorical(df['Activity Level']).codes 46 | 47 | plt.figure(figsize=(12, 8)) 48 | sns.heatmap(df.isnull(), cbar=False, cmap='viridis') 49 | plt.title('Missing Values Heatmap') 50 | plt.show() 51 | 52 | plt.figure(figsize=(15, 10)) 53 | for i, col in enumerate(numeric_cols): 54 | plt.subplot(3, 2, i+1) 55 | sns.histplot(df[col], kde=True, color=sns.color_palette('husl', 5)[i]) 56 | plt.title(f'Distribution of {col}') 57 | plt.tight_layout() 58 | plt.show() 59 | ## 60 | plt.figure(figsize=(15, 10)) 61 | for i, col in enumerate(numeric_cols): 62 | plt.subplot(3, 2, i+1) 63 | sns.boxplot(y=df[col], color=sns.color_palette('Set2', 5)[i]) 64 | plt.title(f'Boxplot of {col}') 65 | plt.tight_layout() 66 | plt.show() 67 | 68 | corr_matrix = df[numeric_cols].corr() 69 | plt.figure(figsize=(10, 8)) 70 | sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0) 71 | plt.title('Correlation Matrix') 72 | plt.show() 73 | 74 | cov_matrix = df[numeric_cols].cov() 75 | plt.figure(figsize=(10, 8)) 76 | sns.heatmap(cov_matrix, annot=True, cmap='YlGnBu', fmt='.2f') 77 | plt.title('Covariance Matrix') 78 | plt.show() 79 | 80 | plt.figure(figsize=(12, 6)) 81 | sns.scatterplot(data=df, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', size='Stress Level', palette='deep') 82 | plt.title('Step Count vs Heart Rate by Activity Level and Stress') 83 | plt.show() 84 | 85 | plt.figure(figsize=(12, 6)) 86 | sns.scatterplot(data=df, x='Sleep Duration (hours)', y='Blood Oxygen Level (%)', hue='Activity Level', palette='muted') 87 | plt.title('Sleep Duration vs Blood Oxygen Level by Activity Level') 88 | plt.show() 89 | 90 | plt.figure(figsize=(10, 6)) 91 | sns.countplot(data=df, x='Activity Level', palette='pastel') 92 | plt.title('Count of Activity Levels') 93 | plt.show() 94 | 95 | plt.figure(figsize=(10, 6)) 96 | sns.barplot(data=df, x='Activity Level', y='Stress Level', palette='bright') 97 | plt.title('Average Stress Level by Activity Level') 98 | plt.show() 99 | 100 | plt.figure(figsize=(10, 6)) 101 | sns.barplot(data=df, x='Activity Level', y='Step Count', palette='dark') 102 | plt.title('Average Step Count by Activity Level') 103 | plt.show() 104 | 105 | g = sns.pairplot(df[numeric_cols + ['Activity Level']], hue='Activity Level', palette='Set1') 106 | g.fig.suptitle('Pairplot of Numeric Variables by Activity Level', y=1.02) 107 | plt.show() 108 | 109 | plt.figure(figsize=(12, 6)) 110 | sns.boxplot(data=df, x='Activity Level', y='Heart Rate (BPM)', palette='colorblind') 111 | plt.title('Heart Rate Distribution by Activity Level') 112 | plt.show() 113 | 114 | plt.figure(figsize=(12, 6)) 115 | sns.boxplot(data=df, x='Activity Level', y='Sleep Duration (hours)', palette='cubehelix') 116 | plt.title('Sleep Duration Distribution by Activity Level') 117 | plt.show() 118 | 119 | plt.figure(figsize=(10, 6)) 120 | sns.kdeplot(data=df, x='Heart Rate (BPM)', hue='Activity Level', fill=True, palette='viridis') 121 | plt.title('Heart Rate Density by Activity Level') 122 | plt.show() 123 | 124 | plt.figure(figsize=(10, 6)) 125 | sns.kdeplot(data=df, x='Step Count', hue='Activity Level', fill=True, palette='magma') 126 | plt.title('Step Count Density by Activity Level') 127 | plt.show() 128 | 129 | plt.figure(figsize=(10, 6)) 130 | sns.violinplot(data=df, x='Activity Level', y='Stress Level', palette='rocket') 131 | plt.title('Stress Level Distribution by Activity Level') 132 | plt.show() 133 | 134 | plt.figure(figsize=(10, 6)) 135 | sns.violinplot(data=df, x='Activity Level', y='Blood Oxygen Level (%)', palette='mako') 136 | plt.title('Blood Oxygen Level Distribution by Activity Level') 137 | plt.show() 138 | 139 | grouped_stats = df.groupby('Activity Level')[numeric_cols].agg(['mean', 'std', 'min', 'max']) 140 | print("\nGrouped Statistics by Activity Level:") 141 | print(grouped_stats) 142 | 143 | plt.figure(figsize=(12, 6)) 144 | sns.lineplot(data=df, x='Sleep Duration (hours)', y='Stress Level', hue='Activity Level', palette='tab10') 145 | plt.title('Stress Level vs Sleep Duration by Activity Level') 146 | plt.show() 147 | 148 | plt.figure(figsize=(12, 6)) 149 | sns.lineplot(data=df, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', palette='Accent') 150 | plt.title('Heart Rate vs Step Count by Activity Level') 151 | plt.show() 152 | 153 | plt.figure(figsize=(10, 6)) 154 | sns.regplot(data=df, x='Step Count', y='Heart Rate (BPM)', scatter_kws={'alpha':0.5}, color='purple') 155 | plt.title('Regression Plot: Step Count vs Heart Rate') 156 | plt.show() 157 | 158 | plt.figure(figsize=(10, 6)) 159 | sns.regplot(data=df, x='Sleep Duration (hours)', y='Stress Level', scatter_kws={'alpha':0.5}, color='teal') 160 | plt.title('Regression Plot: Sleep Duration vs Stress Level') 161 | plt.show() 162 | ## 163 | pivot_table = df.pivot_table(values=numeric_cols, index='Activity Level', aggfunc='mean') 164 | print("\nPivot Table of Means by Activity Level:") 165 | print(pivot_table) 166 | # 167 | plt.figure(figsize=(12, 6)) 168 | pivot_table.plot(kind='bar', figsize=(12, 6)) 169 | plt.title('Mean Values by Activity Level') 170 | plt.ylabel('Mean Value') 171 | plt.xticks(rotation=45) 172 | plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') 173 | plt.tight_layout() 174 | plt.show() 175 | 176 | z_scores = np.abs(stats.zscore(df[numeric_cols], nan_policy='omit')) 177 | outliers = (z_scores > 3).any(axis=1) 178 | print("\nNumber of Outliers Detected:", outliers.sum()) 179 | 180 | df_cleaned = df[~outliers] 181 | 182 | plt.figure(figsize=(15, 10)) 183 | for i, col in enumerate(numeric_cols): 184 | plt.subplot(3, 2, i+1) 185 | sns.histplot(df_cleaned[col], kde=True, color=sns.color_palette('Paired', 5)[i]) 186 | plt.title(f'Cleaned Distribution of {col}') 187 | plt.tight_layout() 188 | plt.show() 189 | 190 | plt.figure(figsize=(10, 8)) 191 | sns.heatmap(df_cleaned[numeric_cols].corr(), annot=True, cmap='Spectral', center=0) 192 | plt.title('Correlation Matrix (Cleaned Data)') 193 | plt.show() 194 | 195 | plt.figure(figsize=(12, 6)) 196 | sns.scatterplot(data=df_cleaned, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', size='Stress Level', palette='cool') 197 | plt.title('Cleaned: Step Count vs Heart Rate by Activity Level and Stress') 198 | plt.show() 199 | 200 | skewness = df_cleaned[numeric_cols].skew() 201 | kurtosis = df_cleaned[numeric_cols].kurtosis() 202 | print("\nSkewness of Numeric Columns:") 203 | print(skewness) 204 | print("\nKurtosis of Numeric Columns:") 205 | print(kurtosis) 206 | 207 | plt.figure(figsize=(10, 6)) 208 | sns.heatmap(df_cleaned.groupby('Activity Level')[numeric_cols].mean(), annot=True, cmap='Blues', fmt='.2f') 209 | plt.title('Mean Values by Activity Level (Heatmap)') 210 | plt.show() 211 | 212 | plt.figure(figsize=(12, 6)) 213 | sns.boxenplot(data=df_cleaned, x='Activity Level', y='Step Count', palette='Set3') 214 | plt.title('Enhanced Boxplot: Step Count by Activity Level') 215 | plt.show() 216 | 217 | plt.figure(figsize=(12, 6)) 218 | sns.boxenplot(data=df_cleaned, x='Activity Level', y='Sleep Duration (hours)', palette='YlOrRd') 219 | plt.title('Enhanced Boxplot: Sleep Duration by Activity Level') 220 | plt.show() 221 | 222 | df_cleaned['Step Count Binned'] = pd.cut(df_cleaned['Step Count'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']) 223 | plt.figure(figsize=(10, 6)) 224 | sns.countplot(data=df_cleaned, x='Step Count Binned', hue='Activity Level', palette='deep') 225 | plt.title('Step Count Bins by Activity Level') 226 | plt.show() 227 | 228 | df_cleaned['Sleep Duration Binned'] = pd.cut(df_cleaned['Sleep Duration (hours)'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']) 229 | plt.figure(figsize=(10, 6)) 230 | sns.countplot(data=df_cleaned, x='Sleep Duration Binned', hue='Activity Level', palette='muted') 231 | plt.title('Sleep Duration Bins by Activity Level') 232 | plt.show() 233 | 234 | print("\nValue Counts for Activity Level:") 235 | print(df_cleaned['Activity Level'].value_counts()) 236 | print("\nValue Counts for Step Count Binned:") 237 | print(df_cleaned['Step Count Binned'].value_counts()) 238 | print("\nValue Counts for Sleep Duration Binned:") 239 | print(df_cleaned['Sleep Duration Binned'].value_counts()) 240 | 241 | plt.figure(figsize=(12, 6)) 242 | sns.catplot(data=df_cleaned, x='Activity Level', y='Heart Rate (BPM)', hue='Step Count Binned', kind='box', palette='viridis') 243 | plt.title('Heart Rate by Activity Level and Step Count Bin') 244 | plt.show() 245 | 246 | plt.figure(figsize=(12, 6)) 247 | sns.catplot(data=df_cleaned, x='Activity Level', y='Stress Level', hue='Sleep Duration Binned', kind='box', palette='magma') 248 | plt.title('Stress Level by Activity Level and Sleep Duration Bin') 249 | plt.show() 250 | 251 | rolling_mean = df_cleaned[numeric_cols].rolling(window=10).mean() 252 | plt.figure(figsize=(15, 10)) 253 | for i, col in enumerate(numeric_cols): 254 | plt.subplot(3, 2, i+1) 255 | plt.plot(rolling_mean[col], color=sns.color_palette('Dark2', 5)[i]) 256 | plt.title(f'Rolling Mean of {col} (Window=10)') 257 | plt.tight_layout() 258 | plt.show() 259 | 260 | anova_result = stats.f_oneway( 261 | df_cleaned[df_cleaned['Activity Level'] == 'Active']['Heart Rate (BPM)'].dropna(), 262 | df_cleaned[df_cleaned['Activity Level'] == 'Highly Active']['Heart Rate (BPM)'].dropna(), 263 | df_cleaned[df_cleaned['Activity Level'] == 'Sedentary']['Heart Rate (BPM)'].dropna() 264 | ) 265 | print("\nANOVA Test for Heart Rate by Activity Level:") 266 | print(f"F-statistic: {anova_result.statistic:.2f}, p-value: {anova_result.pvalue:.4f}") 267 | 268 | t_stat, p_val = stats.ttest_ind( 269 | df_cleaned[df_cleaned['Activity Level'] == 'Active']['Stress Level'].dropna(), 270 | df_cleaned[df_cleaned['Activity Level'] == 'Sedentary']['Stress Level'].dropna() 271 | ) 272 | print("\nT-test for Stress Level (Active vs Sedentary):") 273 | print(f"T-statistic: {t_stat:.2f}, p-value: {p_val:.4f}") 274 | 275 | plt.figure(figsize=(10, 6)) 276 | sns.heatmap(df_cleaned.groupby('Activity Level')[numeric_cols].std(), annot=True, cmap='OrRd', fmt='.2f') 277 | plt.title('Standard Deviation by Activity Level (Heatmap)') 278 | plt.show() 279 | 280 | df_cleaned['Heart Rate Binned'] = pd.qcut(df_cleaned['Heart Rate (BPM)'], q=4, labels=['Low', 'Medium', 'High', 'Very High']) 281 | plt.figure(figsize=(10, 6)) 282 | sns.countplot(data=df_cleaned, x='Heart Rate Binned', hue='Activity Level', palette='cool') 283 | plt.title('Heart Rate Bins by Activity Level') 284 | plt.show() 285 | 286 | cumsum_steps = df_cleaned.groupby('Activity Level')['Step Count'].cumsum() 287 | plt.figure(figsize=(12, 6)) 288 | for level in df_cleaned['Activity Level'].unique(): 289 | plt.plot(cumsum_steps[df_cleaned['Activity Level'] == level], label=level) 290 | plt.title('Cumulative Step Count by Activity Level') 291 | plt.xlabel('Index') 292 | plt.ylabel('Cumulative Steps') 293 | plt.legend() 294 | plt.show() 295 | 296 | plt.figure(figsize=(10, 6)) 297 | sns.histplot(data=df_cleaned, x='Step Count', hue='Activity Level', element='step', palette='tab10') 298 | plt.title('Step Count Histogram by Activity Level (Stacked)') 299 | plt.show() 300 | 301 | plt.figure(figsize=(10, 6)) 302 | sns.ecdfplot(data=df_cleaned, x='Sleep Duration (hours)', hue='Activity Level', palette='Set2') 303 | plt.title('ECDF of Sleep Duration by Activity Level') 304 | plt.show() 305 | 306 | pivot_table_multi = df_cleaned.pivot_table(values=['Step Count', 'Heart Rate (BPM)'], index='Activity Level', columns=pd.cut(df_cleaned['Stress Level'], bins=3), aggfunc='mean') 307 | plt.figure(figsize=(12, 6)) 308 | sns.heatmap(pivot_table_multi['Step Count'], annot=True, cmap='PuBu', fmt='.0f') 309 | plt.title('Mean Step Count by Activity Level and Stress Level Bins') 310 | plt.show() 311 | 312 | print("\nDescriptive Statistics After Cleaning:") 313 | print(df_cleaned[numeric_cols].describe()) 314 | 315 | print("\nMedian Values by Activity Level:") 316 | print(df_cleaned.groupby('Activity Level')[numeric_cols].median()) 317 | 318 | print("\nQuantiles of Numeric Columns:") 319 | print(df_cleaned[numeric_cols].quantile([0.25, 0.5, 0.75])) 320 | 321 | print("\nInsights:") 322 | print("1. Active users show higher step counts and moderate heart rates compared to Sedentary users.") 323 | print("2. Stress levels are lower on average for Highly Active users.") 324 | print("3. Blood Oxygen Levels are relatively stable across activity levels.") 325 | print("4. Sleep Duration has a slight negative correlation with Stress Level.") 326 | print("5. Outlier removal normalized distributions, especially for Step Count.") 327 | print("6. ANOVA test indicates significant differences in Heart Rate across Activity Levels.") 328 | print("7. T-test suggests Stress Levels differ significantly between Active and Sedentary groups.") 329 | print("8. Step Count distributions are right-skewed, especially for Highly Active users.") 330 | print("9. Sleep Duration shows less variability in Active users.") 331 | print("10. Cumulative step counts highlight the dominance of Highly Active users in total activity.") 332 | --------------------------------------------------------------------------------