├── .gitignore ├── requirements.txt ├── README.md └── Obesity_Classification.py /.gitignore: -------------------------------------------------------------------------------- 1 | obesity 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.0 2 | argon2-cffi==20.1.0 3 | astroid==2.4.2 4 | attrs==20.1.0 5 | backcall==0.2.0 6 | bleach==3.1.5 7 | certifi==2020.6.20 8 | cffi==1.14.2 9 | chardet==3.0.4 10 | cycler==0.10.0 11 | decorator==4.4.2 12 | defusedxml==0.6.0 13 | entrypoints==0.3 14 | idna==2.10 15 | ipykernel==5.3.4 16 | ipython==7.17.0 17 | ipython-genutils==0.2.0 18 | isort==5.6.4 19 | jedi==0.17.2 20 | Jinja2==2.11.2 21 | joblib==0.16.0 22 | json5==0.9.5 23 | jsonschema==3.2.0 24 | jupyter-client==6.1.6 25 | jupyter-core==4.6.3 26 | jupyterlab==2.2.5 27 | jupyterlab-server==1.2.0 28 | kiwisolver==1.2.0 29 | lazy-object-proxy==1.4.3 30 | MarkupSafe==1.1.1 31 | matplotlib==3.3.1 32 | mccabe==0.6.1 33 | mistune==0.8.4 34 | nbconvert==5.6.1 35 | nbformat==5.0.7 36 | notebook==6.1.3 37 | numpy==1.19.1 38 | packaging==20.4 39 | pandas==1.1.1 40 | pandocfilters==1.4.2 41 | parso==0.7.1 42 | pexpect==4.8.0 43 | pickleshare==0.7.5 44 | Pillow==7.2.0 45 | prometheus-client==0.8.0 46 | prompt-toolkit==3.0.6 47 | ptyprocess==0.6.0 48 | pycparser==2.20 49 | Pygments==2.6.1 50 | pylint==2.6.0 51 | pyparsing==2.4.7 52 | pyrsistent==0.16.0 53 | python-dateutil==2.8.1 54 | pytz==2020.1 55 | pyzmq==19.0.2 56 | requests==2.24.0 57 | scikit-learn==0.23.2 58 | scipy==1.5.2 59 | seaborn==0.10.1 60 | Send2Trash==1.5.0 61 | six==1.15.0 62 | terminado==0.8.3 63 | testpath==0.4.4 64 | threadpoolctl==2.1.0 65 | toml==0.10.2 66 | tornado==6.0.4 67 | traitlets==4.3.3 68 | urllib3==1.25.10 69 | wcwidth==0.2.5 70 | webencodings==0.5.1 71 | wrapt==1.12.1 72 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Obesity-Classification 2 | 3 | [Link to Report](https://github.com/pymche/Machine-Learning-Obesity-Classification/blob/master/Obesity_Classification.ipynb) 4 | 5 | Prediction of levels of obesity by using machine learning classification models. 6 | 7 | Data collected from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+). Official publication of the research that provides data can be accessed from [here](https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub). 8 | 9 | ## About the Data 10 | 11 | Dietary, exercise and personal daily habits of individuals from Mexico, Peru and Columbia are recorded to build estimation of obesity levels. 12 | 13 | Obesity Level will be used as the target (y) variable, which consists of 7 classes - Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 14 | 15 | There are 17 attributes in total related to individual habits that are likely to determine obesity levels, such as number of main meals, time using technology devices, gender and transportation used. 16 | 17 | Details of the questions and possible answers collected for the data can be found in the link provided above. 18 | 19 | ## Preparing the data for exploratory analysis 20 | 21 | The values of some of the attributes from the original data set are numerical and does not describe the actual answers provided by individuals. In order to visualise the data, I have assigned the corresponding answer to each numerical value and temporary transformed the entire data set into a categorical data for plotting. 22 | 23 | ## Preprocessing the data 24 | 25 | In the original data set, there are numerical (both continuous and discrete) and categorical (including ordinal, non-ordinal and binary - i.e. yes & no) data. 26 | Imputation is used in both numerical and categorical data to fill in missing values. Feature scaling is employed for continuous numerical values, including age, weight and height; Ordinal and Label Encoding are used for non-ordinal categorical data, such as transportation used and obesity level, while One Hot Encoding is applied to data which is ordinal in nature (e.g. never, sometimes, always). 27 | 28 | All of the above preprocessing procedures are bundled into a pipeline, which also applies multiple models on the data set in search for the best model. 29 | 30 | ## Models used for classification 31 | 32 | * K Nearest Neighbour 33 | * Decision Tree 34 | * Random Forest 35 | * Adaboost 36 | * Stochastic Gradient Descent 37 | * Support Vector Machine 38 | 39 | A classification report of each model is also included. 40 | 41 | 42 | First commit: 1st September, 2020 43 | -------------------------------------------------------------------------------- /Obesity_Classification.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import pandas as pd 5 | import seaborn as sns 6 | from matplotlib import pyplot as plt 7 | import numpy as np 8 | import collections 9 | from collections import Counter 10 | 11 | import sklearn 12 | from sklearn.model_selection import train_test_split 13 | 14 | from sklearn.preprocessing import OrdinalEncoder 15 | from sklearn.preprocessing import OneHotEncoder 16 | from sklearn.preprocessing import StandardScaler 17 | from sklearn.impute import SimpleImputer 18 | from sklearn.compose import ColumnTransformer 19 | from sklearn.pipeline import Pipeline 20 | 21 | from sklearn.neighbors import KNeighborsClassifier 22 | from sklearn.svm import SVC 23 | from sklearn.tree import DecisionTreeClassifier 24 | from sklearn.ensemble import RandomForestClassifier 25 | from sklearn.ensemble import GradientBoostingClassifier 26 | from sklearn.ensemble import AdaBoostClassifier 27 | from sklearn.linear_model import SGDClassifier 28 | 29 | from sklearn.metrics import accuracy_score 30 | from sklearn.metrics import classification_report 31 | 32 | 33 | df = pd.read_csv('/Users/melaniecheung/Desktop/Python_Projects/Classification/ObesityDataSet_raw_and_data_sinthetic.csv') 34 | 35 | 36 | df 37 | 38 | 39 | df.shape 40 | 41 | 42 | df.info() 43 | 44 | 45 | df.describe() 46 | 47 | 48 | df.columns 49 | 50 | 51 | df.columns = ['Gender', 'Age', 'Height', 'Weight', 'Family History with Overweight', 52 | 'Frequent consumption of high caloric food', 'Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of food between meals', 'Smoke', 'Consumption of water daily', 'Calories consumption monitoring', 'Physical activity frequency', 'Time using technology devices', 53 | 'Consumption of alcohol', 'Transportation used', 'Obesity'] 54 | 55 | df 56 | 57 | 58 | df['Obesity'] = df['Obesity'].apply(lambda x: x.replace('_', ' ')) 59 | df['Transportation used'] = df['Transportation used'].apply(lambda x: x.replace('_', ' ')) 60 | df['Height'] = df['Height']*100 61 | df['Height'] = df['Height'].round(1) 62 | df['Weight'] = df['Weight'].round(1) 63 | df['Age'] = df['Age'].round(1) 64 | df 65 | 66 | 67 | for x in ['Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of water daily', 'Physical activity frequency', 'Time using technology devices']: 68 | value = np.array(df[x]) 69 | print(x,':', 'min:', np.min(value), 'max:', np.max(value)) 70 | 71 | 72 | # ## Exploratory Data Analysis 73 | 74 | for x in ['Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of water daily', 'Physical activity frequency', 'Time using technology devices']: 75 | df[x] = df[x].apply(round) 76 | value = np.array(df[x]) 77 | print(x,':', 'min:', np.min(value), 'max:', np.max(value), df[x].dtype) 78 | print(df[x].unique()) 79 | 80 | 81 | 82 | df1 = df.copy() 83 | 84 | 85 | mapping0 = {1:'Never', 2:'Sometimes', 3:'Always'} 86 | mapping1 = {1: '1', 2:'2' , 3: '3', 4: '3+'} 87 | mapping2 = {1: 'Less than a liter', 2:'Between 1 and 2 L', 3:'More than 2 L'} 88 | mapping3 = {0: 'I do not have', 1: '1 or 2 days', 2: '2 or 4 days', 3: '4 or 5 days'} 89 | mapping4 = {0: '0–2 hours', 1: '3–5 hours', 2: 'More than 5 hours'} 90 | 91 | 92 | df['Frequency of consumption of vegetables'] = df['Frequency of consumption of vegetables'].replace(mapping0) 93 | df['Number of main meals'] = df['Number of main meals'].replace(mapping1) 94 | df['Consumption of water daily'] = df['Consumption of water daily'].replace(mapping2) 95 | df['Physical activity frequency'] = df['Physical activity frequency'].replace(mapping3) 96 | df['Time using technology devices'] = df['Time using technology devices'].replace(mapping4) 97 | 98 | 99 | df 100 | 101 | 102 | # ### Age, Height and Weight 103 | 104 | # In terms of height, male and female are similarly distributed according to the box plot below. While male are generally taller than female, both male and female share a similar average in weight, with female having a much larger range of weight (as well as BMI) compared to male. This is further illustrated by the steeper line plot between weight and height of female than male. 105 | 106 | sns.set() 107 | fig = plt.figure(figsize=(20,10)) 108 | plt.subplot(1, 2, 1) 109 | sns.boxplot(x='Gender', y='Height', data=df) 110 | plt.subplot(1, 2, 2) 111 | sns.boxplot(x='Gender', y='Weight', data=df) 112 | 113 | 114 | sns.set() 115 | g = sns.jointplot("Height", "Weight", data=df, 116 | kind="reg", truncate=False, 117 | xlim=(125, 200), ylim=(35, 180), 118 | color="m", height=10) 119 | g.set_axis_labels("Height (cm)", "Weight (kg)") 120 | 121 | 122 | g = sns.lmplot(x="Height", y="Weight", hue="Gender", 123 | height=10, data=df) 124 | g.set_axis_labels("Height (cm)", "Weight (kg)") 125 | 126 | 127 | # ### Obesity 128 | 129 | c = Counter(df['Obesity']) 130 | print(c) 131 | 132 | 133 | fig = plt.figure(figsize=(8,8)) 134 | plt.pie([float(c[v]) for v in c], labels=[str(k) for k in c], autopct=None) 135 | plt.title('Weight Category') 136 | plt.tight_layout() 137 | 138 | 139 | filt = df['Gender'] == 'Male' 140 | c_m = Counter(df.loc[filt, 'Obesity']) 141 | print(c_m) 142 | c_f = Counter(df.loc[~filt, 'Obesity']) 143 | print(c_f) 144 | 145 | 146 | # A bigger proportion of female with a higher BMI is reflected by the large slice of Obesity Type III in the pie chart below, while Obesity Type II is the most prevalent type of obesity in make. Interestingly, there is also a higher proportion of Insufficient Weight in female compared to male, this could be explained by a heavier societal pressure on women to go on diets. 147 | 148 | fig = plt.figure(figsize=(20,8)) 149 | plt.subplot(1, 2, 1) 150 | plt.pie([float(c_m[v]) for v in c_m], labels=[str(k) for k in c_m], autopct=None) 151 | plt.title('Weight Category of Male') 152 | plt.tight_layout() 153 | 154 | plt.subplot(1, 2, 2) 155 | plt.pie([float(c_f[v]) for v in c_f], labels=[str(k) for k in c_f], autopct=None) 156 | plt.title('Weight Category of Female') 157 | plt.tight_layout() 158 | 159 | 160 | # ### Eating and Exercise Habits 161 | 162 | for a in df.columns[4:-1]: 163 | data = df[a].value_counts() 164 | values = df[a].value_counts().index.to_list() 165 | counts = df[a].value_counts().to_list() 166 | 167 | plt.figure(figsize=(12,5)) 168 | ax = sns.barplot(x = values, y = counts) 169 | 170 | plt.title(a) 171 | plt.xticks(rotation=45) 172 | print(a, values, counts) 173 | 174 | 175 | # ## Data Preprocessing 176 | 177 | df1.head() 178 | 179 | 180 | # #### Since classifier cannot operate with label data directly, One Hot Encoder and Label Encoding will be used to assign numeric values to each category 181 | 182 | # identity categorical variables (data type would be 'object') 183 | cat = df1.dtypes == object 184 | 185 | print(cat) 186 | 187 | # When dtype == object is 'true' 188 | print(cat[cat]) 189 | cat_labels = cat[cat].index 190 | print('Categorical variables:', cat_labels) 191 | 192 | # When dtype == object is 'false' 193 | false = cat[~cat] 194 | non_cat = false.index 195 | print('Non Categorical variables:', non_cat) 196 | 197 | 198 | # identify categorical variables with more than 2 values/answers 199 | col = [x for x in labels] 200 | multiple = [df1[x].unique() for x in labels] 201 | 202 | multi_col = {col: values for col, values in zip(col, multiple) if len(values)>2} 203 | print(multi_col) 204 | print('\n') 205 | print('Categorical variables with more than 2 values/answers:', multi_col.keys()) 206 | 207 | 208 | df1.head(3) 209 | 210 | 211 | df1.columns 212 | 213 | def col_no(x): 214 | d = {} 215 | d[df1.columns[x]] = x 216 | return(d) 217 | 218 | print([col_no(x) for x in range(0, len(df1.columns))]) 219 | 220 | 221 | x = df1[df1.columns[:-1]] 222 | y = df['Obesity'] 223 | 224 | x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1) 225 | 226 | 227 | # The target value, obesity level, will be transformed into digit label with LabelEncoder. 228 | # 229 | # StandardScaler is applied to attributes with values which ranges are not consistent with the rest, to avoid disproportionate weight assigned to these values. (i.e. Age, Height, Weight). 230 | # 231 | # Features that are ordinal in nature (i.e. answers including 'never', 'sometimes', 'always') will be preprocessed with OrdinalEncoder (exactly the same function is LabelEncoder, however this will take in multiple arguments as the latter is meant for the y-value only). 232 | # 233 | # Features that are non-ordinal in nature will be preprocessed with OneHotEncoder, so that the generated labels will not be interpreted in a way that suggests one answer is more important than the other (e.g. 3 is more important than 1). 234 | # 235 | # SimpleImputer is applied to all attributes to deal with missing values. 236 | # 237 | # All of these preprocessing techniques will be bundled into a pipeline, which will be deployed with classifiers later. 238 | 239 | le = LabelEncoder() 240 | y_train = le.fit_transform(y_train) 241 | y_train 242 | 243 | 244 | Scale_features = ['Age', 'Height', 'Weight'] 245 | Scale_transformer = Pipeline(steps=[ 246 | ('imputer', SimpleImputer(strategy='median')), 247 | ('Scaling', StandardScaler()) 248 | ]) 249 | 250 | Ordi_features = ['Consumption of food between meals', 'Consumption of alcohol'] 251 | Ordi_transformer = Pipeline(steps=[ 252 | ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 253 | ('Ordi', OrdinalEncoder()) 254 | ]) 255 | 256 | NonO_features = ['Gender', 'Family History with Overweight', 'Frequent consumption of high caloric food', 'Smoke', 'Calories consumption monitoring', 'Transportation used'] 257 | NonO_transformer = Pipeline(steps=[ 258 | ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 259 | ('Non-O', OneHotEncoder()) 260 | ]) 261 | 262 | Preprocessor = ColumnTransformer(transformers=[ 263 | ('Scale', Scale_transformer, Scale_features), 264 | ('Ordinal', Ordi_transformer, Ordi_features), 265 | ('Non-Ordinal', NonO_transformer, NonO_features) 266 | ], remainder = 'passthrough') 267 | 268 | clf = Pipeline(steps=[('preprocessor', Preprocessor)]) 269 | 270 | 271 | clf.fit(x_train, y_train) 272 | 273 | 274 | trans_df = clf.fit_transform(x_train) 275 | print(trans_df.shape) 276 | 277 | 278 | # Column name of first two steps in pipeline 279 | 280 | cols = [y for x in [Scale_features, Ordi_features] for y in x] 281 | cols 282 | 283 | 284 | # Column names of OneHotEncoder step in pipeline 285 | 286 | ohe_cols = clf.named_steps['preprocessor'].transformers_[2][1] .named_steps['Non-O'].get_feature_names(NonO_features) 287 | ohe_cols = [x for x in ohe_cols] 288 | ohe_cols 289 | 290 | 291 | # Column names of remainder='Passthrough' - remaining columns that didn't get processed 292 | non_cat 293 | 294 | 295 | transformed_x_train = pd.DataFrame(trans_df, columns= ['Age', 'Height', 296 | 'Weight', 297 | 'Consumption of food between meals', 298 | 'Consumption of alcohol','Gender_Female', 299 | 'Gender_Male', 300 | 'Family History with Overweight_no', 301 | 'Family History with Overweight_yes', 302 | 'Frequent consumption of high caloric food_no', 303 | 'Frequent consumption of high caloric food_yes', 304 | 'Smoke_no', 305 | 'Smoke_yes', 306 | 'Calories consumption monitoring_no', 307 | 'Calories consumption monitoring_yes', 308 | 'Transportation used_Automobile', 309 | 'Transportation used_Bike', 310 | 'Transportation used_Motorbike', 311 | 'Transportation used_Public Transportation', 312 | 'Transportation used_Walking', 'Frequency of consumption of vegetables', 313 | 'Number of main meals', 314 | 'Consumption of water daily', 315 | 'Physical activity frequency', 316 | 'Time using technology devices']) 317 | 318 | 319 | # transformed/processed features 320 | 321 | transformed_x_train 322 | 323 | 324 | le = LabelEncoder() 325 | y_test = le.fit_transform(y_test) 326 | le_name_mapping = dict(zip(le.transform(le.classes_), le.classes_)) 327 | print(le_name_mapping) 328 | 329 | 330 | # ## Model Selection 331 | 332 | # Classifiers are selected and stored in a list, each classifier will be looped through and the preprocessor will be applied each time. The accuracy score of every classifier will be printed out. 333 | 334 | classifiers = [ 335 | KNeighborsClassifier(n_neighbors = 5), 336 | SVC(kernel="rbf", C=0.025, probability=True), 337 | DecisionTreeClassifier(), 338 | RandomForestClassifier(), 339 | AdaBoostClassifier(), 340 | GradientBoostingClassifier(), 341 | SGDClassifier() 342 | ] 343 | 344 | top_class = [] 345 | 346 | for classifier in classifiers: 347 | pipe = Pipeline(steps=[('preprocessor', Preprocessor), 348 | ('classifier', classifier)]) 349 | 350 | # training model 351 | pipe.fit(x_train, y_train) 352 | print(classifier) 353 | 354 | acc_score = pipe.score(x_test, y_test) 355 | print("model score: %.3f" % acc_score) 356 | 357 | # using the model to predict 358 | y_pred = pipe.predict(x_test) 359 | 360 | target_names = [le_name_mapping[x] for x in le_name_mapping] 361 | print(classification_report(y_test, y_pred, target_names=target_names)) 362 | 363 | if acc_score > 0.8: 364 | top_class.append(classifier) 365 | 366 | 367 | # ### Classification Report 368 | # 369 | # Classification Report is used to investigate the performance of each classifier in classes (level of obesity). 370 | # 371 | # 'Precision' shows the percentage of the classfier that is able to correctly predict the class. (i.e. True Positive / (True Positive + False Positive) 372 | # 373 | # 'Recall' shows the percentage of the actual positive cases that the classifer is able to identify. (i.e. True Positive / (True Positive + False Negative) 374 | # 375 | # 'F1' is the harmonic mean between Precision and Recall. 376 | # 377 | # 'Support' is the number of occurence of occurence of the given class in dataset. More consistent the number of 'Support' of each class is, the more balanced the dataset. 378 | # 379 | # The following models score the highest in terms of accuracy. 380 | 381 | top_class 382 | 383 | --------------------------------------------------------------------------------