├── .gitignore
├── requirements.txt
├── README.md
└── Obesity_Classification.py


/.gitignore:
--------------------------------------------------------------------------------
1 | obesity
2 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | appnope==0.1.0
 2 | argon2-cffi==20.1.0
 3 | astroid==2.4.2
 4 | attrs==20.1.0
 5 | backcall==0.2.0
 6 | bleach==3.1.5
 7 | certifi==2020.6.20
 8 | cffi==1.14.2
 9 | chardet==3.0.4
10 | cycler==0.10.0
11 | decorator==4.4.2
12 | defusedxml==0.6.0
13 | entrypoints==0.3
14 | idna==2.10
15 | ipykernel==5.3.4
16 | ipython==7.17.0
17 | ipython-genutils==0.2.0
18 | isort==5.6.4
19 | jedi==0.17.2
20 | Jinja2==2.11.2
21 | joblib==0.16.0
22 | json5==0.9.5
23 | jsonschema==3.2.0
24 | jupyter-client==6.1.6
25 | jupyter-core==4.6.3
26 | jupyterlab==2.2.5
27 | jupyterlab-server==1.2.0
28 | kiwisolver==1.2.0
29 | lazy-object-proxy==1.4.3
30 | MarkupSafe==1.1.1
31 | matplotlib==3.3.1
32 | mccabe==0.6.1
33 | mistune==0.8.4
34 | nbconvert==5.6.1
35 | nbformat==5.0.7
36 | notebook==6.1.3
37 | numpy==1.19.1
38 | packaging==20.4
39 | pandas==1.1.1
40 | pandocfilters==1.4.2
41 | parso==0.7.1
42 | pexpect==4.8.0
43 | pickleshare==0.7.5
44 | Pillow==7.2.0
45 | prometheus-client==0.8.0
46 | prompt-toolkit==3.0.6
47 | ptyprocess==0.6.0
48 | pycparser==2.20
49 | Pygments==2.6.1
50 | pylint==2.6.0
51 | pyparsing==2.4.7
52 | pyrsistent==0.16.0
53 | python-dateutil==2.8.1
54 | pytz==2020.1
55 | pyzmq==19.0.2
56 | requests==2.24.0
57 | scikit-learn==0.23.2
58 | scipy==1.5.2
59 | seaborn==0.10.1
60 | Send2Trash==1.5.0
61 | six==1.15.0
62 | terminado==0.8.3
63 | testpath==0.4.4
64 | threadpoolctl==2.1.0
65 | toml==0.10.2
66 | tornado==6.0.4
67 | traitlets==4.3.3
68 | urllib3==1.25.10
69 | wcwidth==0.2.5
70 | webencodings==0.5.1
71 | wrapt==1.12.1
72 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Obesity-Classification
 2 | 
 3 | [Link to Report](https://github.com/pymche/Machine-Learning-Obesity-Classification/blob/master/Obesity_Classification.ipynb)
 4 | 
 5 | Prediction of levels of obesity by using machine learning classification models.
 6 | 
 7 | Data collected from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+). Official publication of the research that provides data can be accessed from [here](https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub).
 8 | 
 9 | ## About the Data
10 | 
11 | Dietary, exercise and personal daily habits of individuals from Mexico, Peru and Columbia are recorded to build estimation of obesity levels.
12 | 
13 | Obesity Level will be used as the target (y) variable, which consists of 7 classes - Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.
14 | 
15 | There are 17 attributes in total related to individual habits that are likely to determine obesity levels, such as number of main meals, time using technology devices, gender and transportation used. 
16 | 
17 | Details of the questions and possible answers collected for the data can be found in the link provided above.
18 | 
19 | ## Preparing the data for exploratory analysis
20 | 
21 | The values of some of the attributes from the original data set are numerical and does not describe the actual answers provided by individuals. In order to visualise the data, I have assigned the corresponding answer to each numerical value and temporary transformed the entire data set into a categorical data for plotting.
22 | 
23 | ## Preprocessing the data
24 | 
25 | In the original data set, there are numerical (both continuous and discrete) and categorical (including ordinal, non-ordinal and binary - i.e. yes & no) data.
26 | Imputation is used in both numerical and categorical data to fill in missing values. Feature scaling is employed for continuous numerical values, including age, weight and height; Ordinal and Label Encoding are used for non-ordinal categorical data, such as transportation used and obesity level, while One Hot Encoding is applied to data which is ordinal in nature (e.g. never, sometimes, always). 
27 | 
28 | All of the above preprocessing procedures are bundled into a pipeline, which also applies multiple models on the data set in search for the best model.
29 | 
30 | ## Models used for classification
31 | 
32 | * K Nearest Neighbour
33 | * Decision Tree
34 | * Random Forest
35 | * Adaboost
36 | * Stochastic Gradient Descent
37 | * Support Vector Machine
38 | 
39 | A classification report of each model is also included.
40 | 
41 | 
42 | First commit: 1st September, 2020
43 | 


--------------------------------------------------------------------------------
/Obesity_Classification.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | import pandas as pd
  5 | import seaborn as sns
  6 | from matplotlib import pyplot as plt
  7 | import numpy as np
  8 | import collections
  9 | from collections import Counter
 10 | 
 11 | import sklearn
 12 | from sklearn.model_selection import train_test_split
 13 | 
 14 | from sklearn.preprocessing import OrdinalEncoder
 15 | from sklearn.preprocessing import OneHotEncoder
 16 | from sklearn.preprocessing import StandardScaler
 17 | from sklearn.impute import SimpleImputer
 18 | from sklearn.compose import ColumnTransformer
 19 | from sklearn.pipeline import Pipeline
 20 | 
 21 | from sklearn.neighbors import KNeighborsClassifier
 22 | from sklearn.svm import SVC
 23 | from sklearn.tree import DecisionTreeClassifier
 24 | from sklearn.ensemble import RandomForestClassifier
 25 | from sklearn.ensemble import GradientBoostingClassifier
 26 | from sklearn.ensemble import AdaBoostClassifier
 27 | from sklearn.linear_model import SGDClassifier
 28 | 
 29 | from sklearn.metrics import accuracy_score
 30 | from sklearn.metrics import classification_report
 31 | 
 32 | 
 33 | df = pd.read_csv('/Users/melaniecheung/Desktop/Python_Projects/Classification/ObesityDataSet_raw_and_data_sinthetic.csv')
 34 | 
 35 | 
 36 | df
 37 | 
 38 | 
 39 | df.shape
 40 | 
 41 | 
 42 | df.info()
 43 | 
 44 | 
 45 | df.describe()
 46 | 
 47 | 
 48 | df.columns
 49 | 
 50 | 
 51 | df.columns = ['Gender', 'Age', 'Height', 'Weight', 'Family History with Overweight',
 52 |        'Frequent consumption of high caloric food', 'Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of food between meals', 'Smoke', 'Consumption of water daily', 'Calories consumption monitoring', 'Physical activity frequency', 'Time using technology devices',
 53 |        'Consumption of alcohol', 'Transportation used', 'Obesity']
 54 | 
 55 | df
 56 | 
 57 | 
 58 | df['Obesity'] = df['Obesity'].apply(lambda x: x.replace('_', ' '))
 59 | df['Transportation used'] = df['Transportation used'].apply(lambda x: x.replace('_', ' '))
 60 | df['Height'] = df['Height']*100
 61 | df['Height'] = df['Height'].round(1)
 62 | df['Weight'] = df['Weight'].round(1)
 63 | df['Age'] = df['Age'].round(1)
 64 | df
 65 | 
 66 | 
 67 | for x in ['Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of water daily', 'Physical activity frequency', 'Time using technology devices']:
 68 |     value = np.array(df[x])
 69 |     print(x,':', 'min:', np.min(value), 'max:', np.max(value))
 70 | 
 71 | 
 72 | # ## Exploratory Data Analysis
 73 | 
 74 | for x in ['Frequency of consumption of vegetables', 'Number of main meals', 'Consumption of water daily', 'Physical activity frequency', 'Time using technology devices']:
 75 |     df[x] = df[x].apply(round)
 76 |     value = np.array(df[x])
 77 |     print(x,':', 'min:', np.min(value), 'max:', np.max(value), df[x].dtype)
 78 |     print(df[x].unique())
 79 |     
 80 | 
 81 | 
 82 | df1 = df.copy()
 83 | 
 84 | 
 85 | mapping0 = {1:'Never', 2:'Sometimes', 3:'Always'}
 86 | mapping1 = {1: '1', 2:'2' , 3: '3', 4: '3+'}
 87 | mapping2 = {1: 'Less than a liter', 2:'Between 1 and 2 L', 3:'More than 2 L'}
 88 | mapping3 = {0: 'I do not have', 1: '1 or 2 days', 2: '2 or 4 days', 3: '4 or 5 days'}
 89 | mapping4 = {0: '0–2 hours', 1: '3–5 hours', 2: 'More than 5 hours'}
 90 | 
 91 | 
 92 | df['Frequency of consumption of vegetables'] = df['Frequency of consumption of vegetables'].replace(mapping0)
 93 | df['Number of main meals'] = df['Number of main meals'].replace(mapping1)
 94 | df['Consumption of water daily'] = df['Consumption of water daily'].replace(mapping2)
 95 | df['Physical activity frequency'] = df['Physical activity frequency'].replace(mapping3)
 96 | df['Time using technology devices'] = df['Time using technology devices'].replace(mapping4)
 97 | 
 98 | 
 99 | df
100 | 
101 | 
102 | # ### Age, Height and Weight
103 | 
104 | # In terms of height, male and female are similarly distributed according to the box plot below. While male are generally taller than female, both male and female share a similar average in weight, with female having a much larger range of weight (as well as BMI) compared to male. This is further illustrated by the steeper line plot between weight and height of female than male.
105 | 
106 | sns.set()
107 | fig = plt.figure(figsize=(20,10))
108 | plt.subplot(1, 2, 1)
109 | sns.boxplot(x='Gender', y='Height', data=df)
110 | plt.subplot(1, 2, 2)
111 | sns.boxplot(x='Gender', y='Weight', data=df)
112 | 
113 | 
114 | sns.set()
115 | g = sns.jointplot("Height", "Weight", data=df,
116 |                   kind="reg", truncate=False,
117 |                   xlim=(125, 200), ylim=(35, 180),
118 |                   color="m", height=10)
119 | g.set_axis_labels("Height (cm)", "Weight (kg)")
120 | 
121 | 
122 | g = sns.lmplot(x="Height", y="Weight", hue="Gender",
123 |                height=10, data=df)
124 | g.set_axis_labels("Height (cm)", "Weight (kg)")
125 | 
126 | 
127 | # ### Obesity
128 | 
129 | c = Counter(df['Obesity'])
130 | print(c)
131 | 
132 | 
133 | fig = plt.figure(figsize=(8,8))
134 | plt.pie([float(c[v]) for v in c], labels=[str(k) for k in c], autopct=None)
135 | plt.title('Weight Category') 
136 | plt.tight_layout()
137 | 
138 | 
139 | filt = df['Gender'] == 'Male'
140 | c_m = Counter(df.loc[filt, 'Obesity'])
141 | print(c_m)
142 | c_f = Counter(df.loc[~filt, 'Obesity'])
143 | print(c_f)
144 | 
145 | 
146 | # A bigger proportion of female with a higher BMI is reflected by the large slice of Obesity Type III in the pie chart below, while Obesity Type II is the most prevalent type of obesity in make. Interestingly, there is also a higher proportion of Insufficient Weight in female compared to male, this could be explained by a heavier societal pressure on women to go on diets.
147 | 
148 | fig = plt.figure(figsize=(20,8))
149 | plt.subplot(1, 2, 1)
150 | plt.pie([float(c_m[v]) for v in c_m], labels=[str(k) for k in c_m], autopct=None)
151 | plt.title('Weight Category of Male') 
152 | plt.tight_layout()
153 | 
154 | plt.subplot(1, 2, 2)
155 | plt.pie([float(c_f[v]) for v in c_f], labels=[str(k) for k in c_f], autopct=None)
156 | plt.title('Weight Category of Female') 
157 | plt.tight_layout()
158 | 
159 | 
160 | # ### Eating and Exercise Habits
161 | 
162 | for a in df.columns[4:-1]:
163 |     data = df[a].value_counts()
164 |     values = df[a].value_counts().index.to_list()
165 |     counts = df[a].value_counts().to_list()
166 |     
167 |     plt.figure(figsize=(12,5))
168 |     ax = sns.barplot(x = values, y = counts)
169 |     
170 |     plt.title(a)
171 |     plt.xticks(rotation=45)
172 |     print(a, values, counts)
173 | 
174 | 
175 | # ## Data Preprocessing
176 | 
177 | df1.head()
178 | 
179 | 
180 | # #### Since classifier cannot operate with label data directly, One Hot Encoder and Label Encoding will be used to assign numeric values to each category
181 | 
182 | # identity categorical variables (data type would be 'object')
183 | cat = df1.dtypes == object
184 | 
185 | print(cat)
186 | 
187 | # When dtype == object is 'true'
188 | print(cat[cat])
189 | cat_labels = cat[cat].index
190 | print('Categorical variables:', cat_labels)
191 | 
192 | # When dtype == object is 'false'
193 | false = cat[~cat]
194 | non_cat = false.index
195 | print('Non Categorical variables:', non_cat)
196 | 
197 | 
198 | # identify categorical variables with more than 2 values/answers
199 | col = [x for x in labels]
200 | multiple = [df1[x].unique() for x in labels]
201 | 
202 | multi_col = {col: values for col, values in zip(col, multiple) if len(values)>2}
203 | print(multi_col)
204 | print('\n')
205 | print('Categorical variables with more than 2 values/answers:', multi_col.keys())
206 | 
207 | 
208 | df1.head(3)
209 | 
210 | 
211 | df1.columns
212 | 
213 | def col_no(x):
214 |     d = {}
215 |     d[df1.columns[x]] = x
216 |     return(d)
217 | 
218 | print([col_no(x) for x in range(0, len(df1.columns))])
219 | 
220 | 
221 | x = df1[df1.columns[:-1]]
222 | y = df['Obesity']
223 | 
224 | x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
225 | 
226 | 
227 | # The target value, obesity level, will be transformed into digit label with LabelEncoder.
228 | # 
229 | # StandardScaler is applied to attributes with values which ranges are not consistent with the rest, to avoid disproportionate weight assigned to these values. (i.e. Age, Height, Weight).
230 | # 
231 | # Features that are ordinal in nature (i.e. answers including 'never', 'sometimes', 'always') will be preprocessed with OrdinalEncoder (exactly the same function is LabelEncoder, however this will take in multiple arguments as the latter is meant for the y-value only).
232 | # 
233 | # Features that are non-ordinal in nature will be preprocessed with OneHotEncoder, so that the generated labels will not be interpreted in a way that suggests one answer is more important than the other (e.g. 3 is more important than 1).
234 | # 
235 | # SimpleImputer is applied to all attributes to deal with missing values.
236 | # 
237 | # All of these preprocessing techniques will be bundled into a pipeline, which will be deployed with classifiers later.
238 | 
239 | le = LabelEncoder()
240 | y_train = le.fit_transform(y_train)
241 | y_train
242 | 
243 | 
244 | Scale_features = ['Age', 'Height', 'Weight']
245 | Scale_transformer = Pipeline(steps=[
246 |     ('imputer', SimpleImputer(strategy='median')),
247 |     ('Scaling', StandardScaler())
248 | ])
249 | 
250 | Ordi_features = ['Consumption of food between meals', 'Consumption of alcohol']
251 | Ordi_transformer = Pipeline(steps=[
252 |     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
253 |     ('Ordi', OrdinalEncoder())
254 | ])
255 | 
256 | NonO_features = ['Gender', 'Family History with Overweight', 'Frequent consumption of high caloric food', 'Smoke', 'Calories consumption monitoring', 'Transportation used']
257 | NonO_transformer = Pipeline(steps=[
258 |     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
259 |     ('Non-O', OneHotEncoder())
260 | ])
261 | 
262 | Preprocessor = ColumnTransformer(transformers=[
263 |     ('Scale', Scale_transformer, Scale_features),
264 |     ('Ordinal', Ordi_transformer, Ordi_features),
265 |     ('Non-Ordinal', NonO_transformer, NonO_features)
266 | ], remainder = 'passthrough')
267 |     
268 | clf = Pipeline(steps=[('preprocessor', Preprocessor)])
269 | 
270 | 
271 | clf.fit(x_train, y_train)
272 | 
273 | 
274 | trans_df = clf.fit_transform(x_train)
275 | print(trans_df.shape)
276 | 
277 | 
278 | # Column name of first two steps in pipeline
279 | 
280 | cols = [y for x in [Scale_features, Ordi_features] for y in x]
281 | cols
282 | 
283 | 
284 | # Column names of OneHotEncoder step in pipeline
285 | 
286 | ohe_cols = clf.named_steps['preprocessor'].transformers_[2][1]    .named_steps['Non-O'].get_feature_names(NonO_features)
287 | ohe_cols = [x for x in ohe_cols]
288 | ohe_cols
289 | 
290 | 
291 | # Column names of remainder='Passthrough' - remaining columns that didn't get processed
292 | non_cat
293 | 
294 | 
295 | transformed_x_train = pd.DataFrame(trans_df, columns= ['Age', 'Height',
296 |  'Weight',
297 |  'Consumption of food between meals',
298 |  'Consumption of alcohol','Gender_Female',
299 |  'Gender_Male',
300 |  'Family History with Overweight_no',
301 |  'Family History with Overweight_yes',
302 |  'Frequent consumption of high caloric food_no',
303 |  'Frequent consumption of high caloric food_yes',
304 |  'Smoke_no',
305 |  'Smoke_yes',
306 |  'Calories consumption monitoring_no',
307 |  'Calories consumption monitoring_yes',
308 |  'Transportation used_Automobile',
309 |  'Transportation used_Bike',
310 |  'Transportation used_Motorbike',
311 |  'Transportation used_Public Transportation',
312 |  'Transportation used_Walking', 'Frequency of consumption of vegetables',
313 |  'Number of main meals',
314 |  'Consumption of water daily',
315 |  'Physical activity frequency',
316 |  'Time using technology devices'])
317 | 
318 | 
319 | # transformed/processed features
320 | 
321 | transformed_x_train
322 | 
323 | 
324 | le = LabelEncoder()
325 | y_test = le.fit_transform(y_test)
326 | le_name_mapping = dict(zip(le.transform(le.classes_), le.classes_))
327 | print(le_name_mapping)
328 | 
329 | 
330 | # ## Model Selection
331 | 
332 | # Classifiers are selected and stored in a list, each classifier will be looped through and the preprocessor will be applied each time. The accuracy score of every classifier will be printed out.
333 | 
334 | classifiers = [
335 |     KNeighborsClassifier(n_neighbors = 5),
336 |     SVC(kernel="rbf", C=0.025, probability=True),
337 |     DecisionTreeClassifier(),
338 |     RandomForestClassifier(),
339 |     AdaBoostClassifier(),
340 |     GradientBoostingClassifier(),
341 |     SGDClassifier()
342 |     ]
343 | 
344 | top_class = []
345 | 
346 | for classifier in classifiers:
347 |     pipe = Pipeline(steps=[('preprocessor', Preprocessor),
348 |                       ('classifier', classifier)])
349 |     
350 |     # training model
351 |     pipe.fit(x_train, y_train)   
352 |     print(classifier)
353 |     
354 |     acc_score = pipe.score(x_test, y_test)
355 |     print("model score: %.3f" % acc_score)
356 |     
357 |     # using the model to predict
358 |     y_pred = pipe.predict(x_test)
359 |     
360 |     target_names = [le_name_mapping[x] for x in le_name_mapping]
361 |     print(classification_report(y_test, y_pred, target_names=target_names))
362 |     
363 |     if acc_score > 0.8:
364 |         top_class.append(classifier)
365 | 
366 | 
367 | # ### Classification Report
368 | # 
369 | # Classification Report is used to investigate the performance of each classifier in classes (level of obesity).
370 | # 
371 | # 'Precision' shows the percentage of the classfier that is able to correctly predict the class. (i.e. True Positive / (True Positive + False Positive)
372 | # 
373 | # 'Recall' shows the percentage of the actual positive cases that the classifer is able to identify. (i.e. True Positive / (True Positive + False Negative)
374 | # 
375 | # 'F1' is the harmonic mean between Precision and Recall.
376 | # 
377 | # 'Support' is the number of occurence of occurence of the given class in dataset. More consistent the number of 'Support' of each class is, the more balanced the dataset.
378 | # 
379 | # The following models score the highest in terms of accuracy.
380 | 
381 | top_class
382 | 
383 | 


--------------------------------------------------------------------------------