├── README.md
└── code.py


/README.md:
--------------------------------------------------------------------------------
 1 | #  Smartwatch Health Dataset Analysis: Uncovering Jaw-Dropping Insights! 
 2 | 
 3 | > *If you want your ideas to resonate and lead the conversation, the right words—and data—make all the difference!*  
 4 | This repository houses a comprehensive **Python-based analysis** of the **Smartwatch Health Dataset**, conducted as part of my academic journey at **Lovely Professional University (LPU)**. By leveraging **Pandas**, **NumPy**, **Seaborn**, **Matplotlib**, and **SciPy**, this project cleans, explores, and visualizes health metrics like heart rate, step count, stress levels, and sleep duration to reveal actionable insights about activity levels and well-being.
 5 | 
 6 | ---
 7 | 
 8 | ## 📑 Project Overview
 9 | 
10 | The **Smartwatch Health Dataset Analysis** aims to uncover meaningful patterns in health metrics collected from smartwatch users. The dataset includes variables such as `Heart Rate (BPM)`, `Step Count`, `Blood Oxygen Level (%)`, `Sleep Duration (hours)`, and `Stress Level`, categorized by `Activity Level` (Active, Highly Active, Sedentary). Through rigorous data cleaning, exploratory data analysis (EDA), statistical testing, and visualizations, this project highlights how activity levels influence health outcomes.
11 | 
12 | Key objectives:
13 | - Clean and preprocess the dataset to handle inconsistencies, missing values, and outliers.
14 | - Visualize distributions, correlations, and relationships between health metrics.
15 | - Perform statistical tests (ANOVA, T-tests) to validate findings.
16 | - Derive insights to inform health and wellness strategies.
17 | 
18 | ---
19 | 
20 | ## 🛠️ Tools and Technologies
21 | 
22 | - **Python 3.8+**
23 | - **Libraries**:
24 |   - `pandas`: Data manipulation and analysis
25 |   - `numpy`: Numerical computations
26 |   - `seaborn` & `matplotlib`: Data visualization
27 |   - `scipy`: Statistical testing
28 | - **Dataset**: `unclean_smartwatch_health_data.csv`
29 | - **Environment**: Jupyter Notebook (recommended) or any Python IDE
30 | 
31 | ---
32 | 
33 | ## 📂 Repository Structure
34 | 
35 | ```plaintext
36 | ├── data/
37 | │   └── unclean_smartwatch_health_data.csv  # Raw dataset
38 | ├── notebooks/
39 | │   └── smartwatch_health_analysis.ipynb   # Main analysis notebook
40 | ├── images/
41 | │   └── (generated plots saved here)       # Visualizations (e.g., heatmaps, scatter plots)
42 | ├── README.md                             # Project documentation
43 | └── requirements.txt                      # Dependencies
44 | 


--------------------------------------------------------------------------------
/code.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import seaborn as sns
  4 | import matplotlib.pyplot as plt
  5 | from scipy import stats
  6 | #
  7 | plt.style.use('seaborn-v0_8')
  8 | 
  9 | df = pd.read_csv('unclean_smartwatch_health_data.csv')
 10 | 
 11 | print("Dataset Info:")
 12 | print(df.info())
 13 | print("\nDataset Description:")
 14 | print(df.describe())
 15 | print("\nFirst 5 Rows:")
 16 | print(df.head())
 17 | print("\nLast 5 Rows:")
 18 | print(df.tail())
 19 | 
 20 | df['Activity Level'] = df['Activity Level'].replace({
 21 |     'Highly_Active': 'Highly Active', 
 22 |     'Actve': 'Active', 
 23 |     'Seddentary': 'Sedentary'
 24 | })
 25 | ###
 26 | df['Stress Level'] = df['Stress Level'].replace('Very High', 10)
 27 | 
 28 | df.replace(['ERROR', 'nan', ''], np.nan, inplace=True)
 29 | 
 30 | df['User ID'] = df['User ID'].fillna(0)
 31 | 
 32 | numeric_cols = ['Heart Rate (BPM)', 'Blood Oxygen Level (%)', 'Step Count', 'Sleep Duration (hours)', 'Stress Level']
 33 | for col in numeric_cols:
 34 |     df[col] = pd.to_numeric(df[col], errors='coerce')
 35 | #
 36 | df['User ID'] = df['User ID'].astype(int)
 37 | 
 38 | df['Activity Level'] = df['Activity Level'].fillna(df['Activity Level'].mode()[0])
 39 | 
 40 | for col in numeric_cols:
 41 |     df[col] = df[col].fillna(df[col].median())
 42 | 
 43 | df.drop_duplicates(inplace=True)
 44 | 
 45 | df['Activity Level Encoded'] = pd.Categorical(df['Activity Level']).codes
 46 | 
 47 | plt.figure(figsize=(12, 8))
 48 | sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
 49 | plt.title('Missing Values Heatmap')
 50 | plt.show()
 51 | 
 52 | plt.figure(figsize=(15, 10))
 53 | for i, col in enumerate(numeric_cols):
 54 |     plt.subplot(3, 2, i+1)
 55 |     sns.histplot(df[col], kde=True, color=sns.color_palette('husl', 5)[i])
 56 |     plt.title(f'Distribution of {col}')
 57 | plt.tight_layout()
 58 | plt.show()
 59 | ##
 60 | plt.figure(figsize=(15, 10))
 61 | for i, col in enumerate(numeric_cols):
 62 |     plt.subplot(3, 2, i+1)
 63 |     sns.boxplot(y=df[col], color=sns.color_palette('Set2', 5)[i])
 64 |     plt.title(f'Boxplot of {col}')
 65 | plt.tight_layout()
 66 | plt.show()
 67 | 
 68 | corr_matrix = df[numeric_cols].corr()
 69 | plt.figure(figsize=(10, 8))
 70 | sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
 71 | plt.title('Correlation Matrix')
 72 | plt.show()
 73 | 
 74 | cov_matrix = df[numeric_cols].cov()
 75 | plt.figure(figsize=(10, 8))
 76 | sns.heatmap(cov_matrix, annot=True, cmap='YlGnBu', fmt='.2f')
 77 | plt.title('Covariance Matrix')
 78 | plt.show()
 79 | 
 80 | plt.figure(figsize=(12, 6))
 81 | sns.scatterplot(data=df, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', size='Stress Level', palette='deep')
 82 | plt.title('Step Count vs Heart Rate by Activity Level and Stress')
 83 | plt.show()
 84 | 
 85 | plt.figure(figsize=(12, 6))
 86 | sns.scatterplot(data=df, x='Sleep Duration (hours)', y='Blood Oxygen Level (%)', hue='Activity Level', palette='muted')
 87 | plt.title('Sleep Duration vs Blood Oxygen Level by Activity Level')
 88 | plt.show()
 89 | 
 90 | plt.figure(figsize=(10, 6))
 91 | sns.countplot(data=df, x='Activity Level', palette='pastel')
 92 | plt.title('Count of Activity Levels')
 93 | plt.show()
 94 | 
 95 | plt.figure(figsize=(10, 6))
 96 | sns.barplot(data=df, x='Activity Level', y='Stress Level', palette='bright')
 97 | plt.title('Average Stress Level by Activity Level')
 98 | plt.show()
 99 | 
100 | plt.figure(figsize=(10, 6))
101 | sns.barplot(data=df, x='Activity Level', y='Step Count', palette='dark')
102 | plt.title('Average Step Count by Activity Level')
103 | plt.show()
104 | 
105 | g = sns.pairplot(df[numeric_cols + ['Activity Level']], hue='Activity Level', palette='Set1')
106 | g.fig.suptitle('Pairplot of Numeric Variables by Activity Level', y=1.02)
107 | plt.show()
108 | 
109 | plt.figure(figsize=(12, 6))
110 | sns.boxplot(data=df, x='Activity Level', y='Heart Rate (BPM)', palette='colorblind')
111 | plt.title('Heart Rate Distribution by Activity Level')
112 | plt.show()
113 | 
114 | plt.figure(figsize=(12, 6))
115 | sns.boxplot(data=df, x='Activity Level', y='Sleep Duration (hours)', palette='cubehelix')
116 | plt.title('Sleep Duration Distribution by Activity Level')
117 | plt.show()
118 | 
119 | plt.figure(figsize=(10, 6))
120 | sns.kdeplot(data=df, x='Heart Rate (BPM)', hue='Activity Level', fill=True, palette='viridis')
121 | plt.title('Heart Rate Density by Activity Level')
122 | plt.show()
123 | 
124 | plt.figure(figsize=(10, 6))
125 | sns.kdeplot(data=df, x='Step Count', hue='Activity Level', fill=True, palette='magma')
126 | plt.title('Step Count Density by Activity Level')
127 | plt.show()
128 | 
129 | plt.figure(figsize=(10, 6))
130 | sns.violinplot(data=df, x='Activity Level', y='Stress Level', palette='rocket')
131 | plt.title('Stress Level Distribution by Activity Level')
132 | plt.show()
133 | 
134 | plt.figure(figsize=(10, 6))
135 | sns.violinplot(data=df, x='Activity Level', y='Blood Oxygen Level (%)', palette='mako')
136 | plt.title('Blood Oxygen Level Distribution by Activity Level')
137 | plt.show()
138 | 
139 | grouped_stats = df.groupby('Activity Level')[numeric_cols].agg(['mean', 'std', 'min', 'max'])
140 | print("\nGrouped Statistics by Activity Level:")
141 | print(grouped_stats)
142 | 
143 | plt.figure(figsize=(12, 6))
144 | sns.lineplot(data=df, x='Sleep Duration (hours)', y='Stress Level', hue='Activity Level', palette='tab10')
145 | plt.title('Stress Level vs Sleep Duration by Activity Level')
146 | plt.show()
147 | 
148 | plt.figure(figsize=(12, 6))
149 | sns.lineplot(data=df, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', palette='Accent')
150 | plt.title('Heart Rate vs Step Count by Activity Level')
151 | plt.show()
152 | 
153 | plt.figure(figsize=(10, 6))
154 | sns.regplot(data=df, x='Step Count', y='Heart Rate (BPM)', scatter_kws={'alpha':0.5}, color='purple')
155 | plt.title('Regression Plot: Step Count vs Heart Rate')
156 | plt.show()
157 | 
158 | plt.figure(figsize=(10, 6))
159 | sns.regplot(data=df, x='Sleep Duration (hours)', y='Stress Level', scatter_kws={'alpha':0.5}, color='teal')
160 | plt.title('Regression Plot: Sleep Duration vs Stress Level')
161 | plt.show()
162 | ##
163 | pivot_table = df.pivot_table(values=numeric_cols, index='Activity Level', aggfunc='mean')
164 | print("\nPivot Table of Means by Activity Level:")
165 | print(pivot_table)
166 | #
167 | plt.figure(figsize=(12, 6))
168 | pivot_table.plot(kind='bar', figsize=(12, 6))
169 | plt.title('Mean Values by Activity Level')
170 | plt.ylabel('Mean Value')
171 | plt.xticks(rotation=45)
172 | plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
173 | plt.tight_layout()
174 | plt.show()
175 | 
176 | z_scores = np.abs(stats.zscore(df[numeric_cols], nan_policy='omit'))
177 | outliers = (z_scores > 3).any(axis=1)
178 | print("\nNumber of Outliers Detected:", outliers.sum())
179 | 
180 | df_cleaned = df[~outliers]
181 | 
182 | plt.figure(figsize=(15, 10))
183 | for i, col in enumerate(numeric_cols):
184 |     plt.subplot(3, 2, i+1)
185 |     sns.histplot(df_cleaned[col], kde=True, color=sns.color_palette('Paired', 5)[i])
186 |     plt.title(f'Cleaned Distribution of {col}')
187 | plt.tight_layout()
188 | plt.show()
189 | 
190 | plt.figure(figsize=(10, 8))
191 | sns.heatmap(df_cleaned[numeric_cols].corr(), annot=True, cmap='Spectral', center=0)
192 | plt.title('Correlation Matrix (Cleaned Data)')
193 | plt.show()
194 | 
195 | plt.figure(figsize=(12, 6))
196 | sns.scatterplot(data=df_cleaned, x='Step Count', y='Heart Rate (BPM)', hue='Activity Level', size='Stress Level', palette='cool')
197 | plt.title('Cleaned: Step Count vs Heart Rate by Activity Level and Stress')
198 | plt.show()
199 | 
200 | skewness = df_cleaned[numeric_cols].skew()
201 | kurtosis = df_cleaned[numeric_cols].kurtosis()
202 | print("\nSkewness of Numeric Columns:")
203 | print(skewness)
204 | print("\nKurtosis of Numeric Columns:")
205 | print(kurtosis)
206 | 
207 | plt.figure(figsize=(10, 6))
208 | sns.heatmap(df_cleaned.groupby('Activity Level')[numeric_cols].mean(), annot=True, cmap='Blues', fmt='.2f')
209 | plt.title('Mean Values by Activity Level (Heatmap)')
210 | plt.show()
211 | 
212 | plt.figure(figsize=(12, 6))
213 | sns.boxenplot(data=df_cleaned, x='Activity Level', y='Step Count', palette='Set3')
214 | plt.title('Enhanced Boxplot: Step Count by Activity Level')
215 | plt.show()
216 | 
217 | plt.figure(figsize=(12, 6))
218 | sns.boxenplot(data=df_cleaned, x='Activity Level', y='Sleep Duration (hours)', palette='YlOrRd')
219 | plt.title('Enhanced Boxplot: Sleep Duration by Activity Level')
220 | plt.show()
221 | 
222 | df_cleaned['Step Count Binned'] = pd.cut(df_cleaned['Step Count'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
223 | plt.figure(figsize=(10, 6))
224 | sns.countplot(data=df_cleaned, x='Step Count Binned', hue='Activity Level', palette='deep')
225 | plt.title('Step Count Bins by Activity Level')
226 | plt.show()
227 | 
228 | df_cleaned['Sleep Duration Binned'] = pd.cut(df_cleaned['Sleep Duration (hours)'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
229 | plt.figure(figsize=(10, 6))
230 | sns.countplot(data=df_cleaned, x='Sleep Duration Binned', hue='Activity Level', palette='muted')
231 | plt.title('Sleep Duration Bins by Activity Level')
232 | plt.show()
233 | 
234 | print("\nValue Counts for Activity Level:")
235 | print(df_cleaned['Activity Level'].value_counts())
236 | print("\nValue Counts for Step Count Binned:")
237 | print(df_cleaned['Step Count Binned'].value_counts())
238 | print("\nValue Counts for Sleep Duration Binned:")
239 | print(df_cleaned['Sleep Duration Binned'].value_counts())
240 | 
241 | plt.figure(figsize=(12, 6))
242 | sns.catplot(data=df_cleaned, x='Activity Level', y='Heart Rate (BPM)', hue='Step Count Binned', kind='box', palette='viridis')
243 | plt.title('Heart Rate by Activity Level and Step Count Bin')
244 | plt.show()
245 | 
246 | plt.figure(figsize=(12, 6))
247 | sns.catplot(data=df_cleaned, x='Activity Level', y='Stress Level', hue='Sleep Duration Binned', kind='box', palette='magma')
248 | plt.title('Stress Level by Activity Level and Sleep Duration Bin')
249 | plt.show()
250 | 
251 | rolling_mean = df_cleaned[numeric_cols].rolling(window=10).mean()
252 | plt.figure(figsize=(15, 10))
253 | for i, col in enumerate(numeric_cols):
254 |     plt.subplot(3, 2, i+1)
255 |     plt.plot(rolling_mean[col], color=sns.color_palette('Dark2', 5)[i])
256 |     plt.title(f'Rolling Mean of {col} (Window=10)')
257 | plt.tight_layout()
258 | plt.show()
259 | 
260 | anova_result = stats.f_oneway(
261 |     df_cleaned[df_cleaned['Activity Level'] == 'Active']['Heart Rate (BPM)'].dropna(),
262 |     df_cleaned[df_cleaned['Activity Level'] == 'Highly Active']['Heart Rate (BPM)'].dropna(),
263 |     df_cleaned[df_cleaned['Activity Level'] == 'Sedentary']['Heart Rate (BPM)'].dropna()
264 | )
265 | print("\nANOVA Test for Heart Rate by Activity Level:")
266 | print(f"F-statistic: {anova_result.statistic:.2f}, p-value: {anova_result.pvalue:.4f}")
267 | 
268 | t_stat, p_val = stats.ttest_ind(
269 |     df_cleaned[df_cleaned['Activity Level'] == 'Active']['Stress Level'].dropna(),
270 |     df_cleaned[df_cleaned['Activity Level'] == 'Sedentary']['Stress Level'].dropna()
271 | )
272 | print("\nT-test for Stress Level (Active vs Sedentary):")
273 | print(f"T-statistic: {t_stat:.2f}, p-value: {p_val:.4f}")
274 | 
275 | plt.figure(figsize=(10, 6))
276 | sns.heatmap(df_cleaned.groupby('Activity Level')[numeric_cols].std(), annot=True, cmap='OrRd', fmt='.2f')
277 | plt.title('Standard Deviation by Activity Level (Heatmap)')
278 | plt.show()
279 | 
280 | df_cleaned['Heart Rate Binned'] = pd.qcut(df_cleaned['Heart Rate (BPM)'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
281 | plt.figure(figsize=(10, 6))
282 | sns.countplot(data=df_cleaned, x='Heart Rate Binned', hue='Activity Level', palette='cool')
283 | plt.title('Heart Rate Bins by Activity Level')
284 | plt.show()
285 | 
286 | cumsum_steps = df_cleaned.groupby('Activity Level')['Step Count'].cumsum()
287 | plt.figure(figsize=(12, 6))
288 | for level in df_cleaned['Activity Level'].unique():
289 |     plt.plot(cumsum_steps[df_cleaned['Activity Level'] == level], label=level)
290 | plt.title('Cumulative Step Count by Activity Level')
291 | plt.xlabel('Index')
292 | plt.ylabel('Cumulative Steps')
293 | plt.legend()
294 | plt.show()
295 | 
296 | plt.figure(figsize=(10, 6))
297 | sns.histplot(data=df_cleaned, x='Step Count', hue='Activity Level', element='step', palette='tab10')
298 | plt.title('Step Count Histogram by Activity Level (Stacked)')
299 | plt.show()
300 | 
301 | plt.figure(figsize=(10, 6))
302 | sns.ecdfplot(data=df_cleaned, x='Sleep Duration (hours)', hue='Activity Level', palette='Set2')
303 | plt.title('ECDF of Sleep Duration by Activity Level')
304 | plt.show()
305 | 
306 | pivot_table_multi = df_cleaned.pivot_table(values=['Step Count', 'Heart Rate (BPM)'], index='Activity Level', columns=pd.cut(df_cleaned['Stress Level'], bins=3), aggfunc='mean')
307 | plt.figure(figsize=(12, 6))
308 | sns.heatmap(pivot_table_multi['Step Count'], annot=True, cmap='PuBu', fmt='.0f')
309 | plt.title('Mean Step Count by Activity Level and Stress Level Bins')
310 | plt.show()
311 | 
312 | print("\nDescriptive Statistics After Cleaning:")
313 | print(df_cleaned[numeric_cols].describe())
314 | 
315 | print("\nMedian Values by Activity Level:")
316 | print(df_cleaned.groupby('Activity Level')[numeric_cols].median())
317 | 
318 | print("\nQuantiles of Numeric Columns:")
319 | print(df_cleaned[numeric_cols].quantile([0.25, 0.5, 0.75]))
320 | 
321 | print("\nInsights:")
322 | print("1. Active users show higher step counts and moderate heart rates compared to Sedentary users.")
323 | print("2. Stress levels are lower on average for Highly Active users.")
324 | print("3. Blood Oxygen Levels are relatively stable across activity levels.")
325 | print("4. Sleep Duration has a slight negative correlation with Stress Level.")
326 | print("5. Outlier removal normalized distributions, especially for Step Count.")
327 | print("6. ANOVA test indicates significant differences in Heart Rate across Activity Levels.")
328 | print("7. T-test suggests Stress Levels differ significantly between Active and Sedentary groups.")
329 | print("8. Step Count distributions are right-skewed, especially for Highly Active users.")
330 | print("9. Sleep Duration shows less variability in Active users.")
331 | print("10. Cumulative step counts highlight the dominance of Highly Active users in total activity.")
332 | 


--------------------------------------------------------------------------------