├── Chapter01
├── Breast Cancer Detection with SVM (Jupyter Notebook).html
└── Breast Cancer Detection with SVM (Jupyter Notebook).ipynb
├── Chapter02
├── Deep Learning Grid Search (Jupyter Notebook).html
└── Deep Learning Grid Search (Jupyter Notebook).ipynb
├── Chapter03
└── chapter3.ipynb
├── Chapter04
└── Heart Disease Prediction with Neural Networks.ipynb
├── Chapter05
└── Autism Screening with Machine Learning.ipynb
├── LICENSE
└── README.md
/Chapter03/chapter3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Classifying DNA Sequences\n",
8 | "### Presented by Eduonix\n",
9 | "\n",
10 | "During this tutorial, we will explore the world of bioinformatics by using Markov models, K-nearest neighbor (KNN) algorithms, support vector machines, and other common classifiers to classify short E. Coli DNA sequences. This project will use a dataset from the UCI Machine Learning Repository that has 106 DNA sequences, with 57 sequential nucleotides (“base-pairs”) each. \n",
11 | "\n",
12 | "You will learn how to:\n",
13 | "* Import data from the UCI repository\n",
14 | "* Convert text inputs to numerical data\n",
15 | "* Build and train classification algorithms\n",
16 | "* Compare and contrast classification algorithms\n",
17 | "\n",
18 | "## Step 1: Importing the Dataset\n",
19 | "\n",
20 | "The following code cells will import necessary libraries and import the dataset from the UCI repository as a Pandas DataFrame."
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "name": "stdout",
30 | "output_type": "stream",
31 | "text": [
32 | "Python: 2.7.13 |Continuum Analytics, Inc.| (default, May 11 2017, 13:17:26) [MSC v.1500 64 bit (AMD64)]\n",
33 | "Numpy: 1.14.0\n",
34 | "Sklearn: 0.19.1\n",
35 | "Pandas: 0.21.0\n"
36 | ]
37 | }
38 | ],
39 | "source": [
40 | "# To make sure all of the correct libraries are installed, import each module and print the version number\n",
41 | "\n",
42 | "import sys\n",
43 | "import numpy\n",
44 | "import sklearn\n",
45 | "import pandas\n",
46 | "\n",
47 | "print('Python: {}'.format(sys.version))\n",
48 | "print('Numpy: {}'.format(numpy.__version__))\n",
49 | "print('Sklearn: {}'.format(sklearn.__version__))\n",
50 | "print('Pandas: {}'.format(pandas.__version__))"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 2,
56 | "metadata": {
57 | "collapsed": true
58 | },
59 | "outputs": [],
60 | "source": [
61 | "# Import, change module names\n",
62 | "import numpy as np\n",
63 | "import pandas as pd\n",
64 | "\n",
65 | "# import the uci Molecular Biology (Promoter Gene Sequences) Data Set\n",
66 | "url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'\n",
67 | "names = ['Class', 'id', 'Sequence']\n",
68 | "data = pd.read_csv(url, names = names)"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 3,
74 | "metadata": {},
75 | "outputs": [
76 | {
77 | "name": "stdout",
78 | "output_type": "stream",
79 | "text": [
80 | "Class +\n",
81 | "id S10\n",
82 | "Sequence \\t\\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...\n",
83 | "Name: 0, dtype: object\n"
84 | ]
85 | }
86 | ],
87 | "source": [
88 | "print(data.iloc[0])"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "## Step 2: Preprocessing the Dataset\n",
96 | "\n",
97 | "The data is not in a usable form; as a result, we will need to process it before using it to train our algorithms."
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 4,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "name": "stdout",
107 | "output_type": "stream",
108 | "text": [
109 | "0 +\n",
110 | "1 +\n",
111 | "2 +\n",
112 | "3 +\n",
113 | "4 +\n",
114 | "Name: Class, dtype: object\n"
115 | ]
116 | }
117 | ],
118 | "source": [
119 | "# Building our Dataset by creating a custom Pandas DataFrame\n",
120 | "# Each column in a DataFrame is called a Series. Lets start by making a series for each column.\n",
121 | "\n",
122 | "classes = data.loc[:, 'Class']\n",
123 | "print(classes[:5])"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 5,
129 | "metadata": {},
130 | "outputs": [
131 | {
132 | "name": "stdout",
133 | "output_type": "stream",
134 | "text": [
135 | "['t', 'a', 'c', 't', 'a', 'g', 'c', 'a', 'a', 't', 'a', 'c', 'g', 'c', 't', 't', 'g', 'c', 'g', 't', 't', 'c', 'g', 'g', 't', 'g', 'g', 't', 't', 'a', 'a', 'g', 't', 'a', 't', 'g', 't', 'a', 't', 'a', 'a', 't', 'g', 'c', 'g', 'c', 'g', 'g', 'g', 'c', 't', 't', 'g', 't', 'c', 'g', 't', '+']\n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "# generate list of DNA sequences\n",
141 | "sequences = list(data.loc[:, 'Sequence'])\n",
142 | "dataset = {}\n",
143 | "\n",
144 | "# loop through sequences and split into individual nucleotides\n",
145 | "for i, seq in enumerate(sequences):\n",
146 | " \n",
147 | " # split into nucleotides, remove tab characters\n",
148 | " nucleotides = list(seq)\n",
149 | " nucleotides = [x for x in nucleotides if x != '\\t']\n",
150 | " \n",
151 | " # append class assignment\n",
152 | " nucleotides.append(classes[i])\n",
153 | " \n",
154 | " # add to dataset\n",
155 | " dataset[i] = nucleotides\n",
156 | " \n",
157 | "print(dataset[0])"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 6,
163 | "metadata": {},
164 | "outputs": [
165 | {
166 | "name": "stdout",
167 | "output_type": "stream",
168 | "text": [
169 | " 0 1 2 3 4 5 6 7 8 9 ... 96 97 98 99 100 101 102 \\\n",
170 | "0 t t g a t a c t c t ... c c t a g c g \n",
171 | "1 a g t a c g a t g t ... c g a g a c t \n",
172 | "2 c c a t g g g t a t ... g c t a g t a \n",
173 | "3 t t c t a g g c c t ... a t g g a c t \n",
174 | "4 a a t g t g g t t a ... g a a g g a t \n",
175 | "5 g t a t a c g a t a ... t g c g c a c \n",
176 | "6 c c g g a a g c a a ... a g c t a t t \n",
177 | "7 a c a a t a t a a t ... g a g g t g c \n",
178 | "8 a t g t t g g a t t ... a c a t g g a \n",
179 | "9 t g a g a g g a a t ... c t a a t c a \n",
180 | "10 a a a t a a a a t c ... c t c c c c c \n",
181 | "11 c c c g c g g c a c ... c t g t a t a \n",
182 | "12 g a t t t g g a c t ... t c a c g c a \n",
183 | "13 c g a a a a a c t c ... t t g c c t g \n",
184 | "14 t t g t t t t t g t ... a t t a c a a \n",
185 | "15 t t t c t g t t c t ... g g c a t a t \n",
186 | "16 g g g g g g t g g g ... a t a g c a t \n",
187 | "17 c t c a a a a a a t ... g t a a g c a \n",
188 | "18 g c a a c a a t c c ... a g t a a g a \n",
189 | "19 t a t g g a g a a a ... g a c g c g c \n",
190 | "20 t c t t a g c c g g ... c t a a a g c \n",
191 | "21 c g a g a a c t g g ... a t g g a t g \n",
192 | "22 g c g t a g a g a c ... t t a g c c a \n",
193 | "23 g t c g a g t t c c ... g t c a t t c \n",
194 | "24 t g t t g t c a g g ... t c c a t t a \n",
195 | "25 g a t t c t t t t g ... c c g g g g g \n",
196 | "26 g t a g t g c g c a ... a a c a c a a \n",
197 | "27 t t t c g c c a c a ... g t t t a g t \n",
198 | "28 t g t g a c t g g t ... c g t g t g t \n",
199 | "29 a g t g a g g c t a ... c c t a a g c \n",
200 | "30 a t t a a t a a t a ... t g g g a g a \n",
201 | "31 g g t g a a t t c c ... c g a g a t a \n",
202 | "32 t t t t c t g a t t ... g t c c t t t \n",
203 | "33 a c t a c a a c g c ... a g t t g t c \n",
204 | "34 t g g g a a c a t c ... c t c a c t t \n",
205 | "35 g t t a c a g g g c ... a t t g t t c \n",
206 | "36 t t t t t g c t t t ... a t g a t t g \n",
207 | "37 a a a g a a a a a a ... c t g c t g t \n",
208 | "38 t c t t g a t t a t ... t g t g c c g \n",
209 | "39 a a c t a a a a a a ... t c a t t t g \n",
210 | "40 a a a a a c g a t a ... g g t c t g a \n",
211 | "41 t t t g t t t t c t ... c c t t g a t \n",
212 | "42 g c g a g a c t g g ... a a a c t a g \n",
213 | "43 c t c a c g a g c c ... t a c t a a g \n",
214 | "44 g a t t g a g c a g ... a t t g g g a \n",
215 | "45 c a a a c g c t a c ... a g g c a g c \n",
216 | "46 g c a c c t c t t c ... a t t a c a g \n",
217 | "47 g g c t t c c c g a ... t t g t g g t \n",
218 | "48 g c c a c c a a a c ... g a a g t g t \n",
219 | "49 c a a a c g t a a c ... c a a g g a c \n",
220 | "50 t t c c g t c c a a ... t t c a c a a \n",
221 | "51 t c c a t t a a t c ... t c a g c c a \n",
222 | "52 g g c a g t t g g t ... t g t t c t c \n",
223 | "53 t c g a g a g a g g ... c c t a t a a \n",
224 | "54 c c g c t g a a t a ... t t a t a t t \n",
225 | "55 g a c t a g a c t c ... t t t g c a t \n",
226 | "56 t a g c g t t a t a ... g t t a g t g \n",
227 | "57 + + + + + + + + + + ... - - - - - - - \n",
228 | "\n",
229 | " 103 104 105 \n",
230 | "0 c c t \n",
231 | "1 g t a \n",
232 | "2 c c a \n",
233 | "3 g g c \n",
234 | "4 a t a \n",
235 | "5 c c t \n",
236 | "6 t c t \n",
237 | "7 a t a \n",
238 | "8 c c a \n",
239 | "9 g a t \n",
240 | "10 a a a \n",
241 | "11 t t a \n",
242 | "12 g g a \n",
243 | "13 a g t \n",
244 | "14 g c a \n",
245 | "15 a c a \n",
246 | "16 t t g \n",
247 | "17 g c g \n",
248 | "18 c t a \n",
249 | "19 c a g \n",
250 | "20 t a g \n",
251 | "21 g a c \n",
252 | "22 a c t \n",
253 | "23 g g c \n",
254 | "24 t g t \n",
255 | "25 g g a \n",
256 | "26 c t a \n",
257 | "27 t c t \n",
258 | "28 t t g \n",
259 | "29 c t g \n",
260 | "30 c g c \n",
261 | "31 g a a \n",
262 | "32 t g c \n",
263 | "33 t g t \n",
264 | "34 a g c \n",
265 | "35 c g a \n",
266 | "36 t t t \n",
267 | "37 g t t \n",
268 | "38 g t a \n",
269 | "39 a t g \n",
270 | "40 t t c \n",
271 | "41 t t c \n",
272 | "42 g g a \n",
273 | "43 t c a \n",
274 | "44 c t t \n",
275 | "45 a g c \n",
276 | "46 c a a \n",
277 | "47 c a a \n",
278 | "48 a a t \n",
279 | "49 a g c \n",
280 | "50 g g a \n",
281 | "51 g a a \n",
282 | "52 c g g \n",
283 | "53 t g a \n",
284 | "54 t a a \n",
285 | "55 c a c \n",
286 | "56 c c t \n",
287 | "57 - - - \n",
288 | "\n",
289 | "[58 rows x 106 columns]\n"
290 | ]
291 | }
292 | ],
293 | "source": [
294 | "# turn dataset into pandas DataFrame\n",
295 | "dframe = pd.DataFrame(dataset)\n",
296 | "print(dframe)"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 7,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "name": "stdout",
306 | "output_type": "stream",
307 | "text": [
308 | " 0 1 2 3 4 5 6 7 8 9 ... 48 49 50 51 52 53 54 55 56 57\n",
309 | "0 t a c t a g c a a t ... g c t t g t c g t +\n",
310 | "1 t g c t a t c c t g ... c a t c g c c a a +\n",
311 | "2 g t a c t a g a g a ... c a c c c g g c g +\n",
312 | "3 a a t t g t g a t g ... a a c a a a c t c +\n",
313 | "4 t c g a t a a t t a ... c c g t g g t a g +\n",
314 | "\n",
315 | "[5 rows x 58 columns]\n"
316 | ]
317 | }
318 | ],
319 | "source": [
320 | "# transpose the DataFrame\n",
321 | "df = dframe.transpose()\n",
322 | "print(df.iloc[:5])"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 8,
328 | "metadata": {},
329 | "outputs": [
330 | {
331 | "name": "stdout",
332 | "output_type": "stream",
333 | "text": [
334 | " 0 1 2 3 4 5 6 7 8 9 ... 48 49 50 51 52 53 54 55 56 Class\n",
335 | "0 t a c t a g c a a t ... g c t t g t c g t +\n",
336 | "1 t g c t a t c c t g ... c a t c g c c a a +\n",
337 | "2 g t a c t a g a g a ... c a c c c g g c g +\n",
338 | "3 a a t t g t g a t g ... a a c a a a c t c +\n",
339 | "4 t c g a t a a t t a ... c c g t g g t a g +\n",
340 | "\n",
341 | "[5 rows x 58 columns]\n"
342 | ]
343 | }
344 | ],
345 | "source": [
346 | "# for clarity, lets rename the last dataframe column to class\n",
347 | "df.rename(columns = {57: 'Class'}, inplace = True) \n",
348 | "print(df.iloc[:5])"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 9,
354 | "metadata": {},
355 | "outputs": [
356 | {
357 | "data": {
358 | "text/html": [
359 | "
\n",
360 | "\n",
373 | "
\n",
374 | " \n",
375 | " \n",
376 | " | \n",
377 | " 0 | \n",
378 | " 1 | \n",
379 | " 2 | \n",
380 | " 3 | \n",
381 | " 4 | \n",
382 | " 5 | \n",
383 | " 6 | \n",
384 | " 7 | \n",
385 | " 8 | \n",
386 | " 9 | \n",
387 | " ... | \n",
388 | " 48 | \n",
389 | " 49 | \n",
390 | " 50 | \n",
391 | " 51 | \n",
392 | " 52 | \n",
393 | " 53 | \n",
394 | " 54 | \n",
395 | " 55 | \n",
396 | " 56 | \n",
397 | " Class | \n",
398 | "
\n",
399 | " \n",
400 | " \n",
401 | " \n",
402 | " count | \n",
403 | " 106 | \n",
404 | " 106 | \n",
405 | " 106 | \n",
406 | " 106 | \n",
407 | " 106 | \n",
408 | " 106 | \n",
409 | " 106 | \n",
410 | " 106 | \n",
411 | " 106 | \n",
412 | " 106 | \n",
413 | " ... | \n",
414 | " 106 | \n",
415 | " 106 | \n",
416 | " 106 | \n",
417 | " 106 | \n",
418 | " 106 | \n",
419 | " 106 | \n",
420 | " 106 | \n",
421 | " 106 | \n",
422 | " 106 | \n",
423 | " 106 | \n",
424 | "
\n",
425 | " \n",
426 | " unique | \n",
427 | " 4 | \n",
428 | " 4 | \n",
429 | " 4 | \n",
430 | " 4 | \n",
431 | " 4 | \n",
432 | " 4 | \n",
433 | " 4 | \n",
434 | " 4 | \n",
435 | " 4 | \n",
436 | " 4 | \n",
437 | " ... | \n",
438 | " 4 | \n",
439 | " 4 | \n",
440 | " 4 | \n",
441 | " 4 | \n",
442 | " 4 | \n",
443 | " 4 | \n",
444 | " 4 | \n",
445 | " 4 | \n",
446 | " 4 | \n",
447 | " 2 | \n",
448 | "
\n",
449 | " \n",
450 | " top | \n",
451 | " t | \n",
452 | " a | \n",
453 | " a | \n",
454 | " c | \n",
455 | " a | \n",
456 | " a | \n",
457 | " a | \n",
458 | " a | \n",
459 | " a | \n",
460 | " a | \n",
461 | " ... | \n",
462 | " c | \n",
463 | " c | \n",
464 | " c | \n",
465 | " t | \n",
466 | " t | \n",
467 | " c | \n",
468 | " c | \n",
469 | " t | \n",
470 | " t | \n",
471 | " - | \n",
472 | "
\n",
473 | " \n",
474 | " freq | \n",
475 | " 38 | \n",
476 | " 34 | \n",
477 | " 30 | \n",
478 | " 30 | \n",
479 | " 36 | \n",
480 | " 42 | \n",
481 | " 38 | \n",
482 | " 34 | \n",
483 | " 33 | \n",
484 | " 36 | \n",
485 | " ... | \n",
486 | " 36 | \n",
487 | " 42 | \n",
488 | " 31 | \n",
489 | " 33 | \n",
490 | " 35 | \n",
491 | " 32 | \n",
492 | " 29 | \n",
493 | " 29 | \n",
494 | " 34 | \n",
495 | " 53 | \n",
496 | "
\n",
497 | " \n",
498 | "
\n",
499 | "
4 rows × 58 columns
\n",
500 | "
"
501 | ],
502 | "text/plain": [
503 | " 0 1 2 3 4 5 6 7 8 9 ... 48 49 50 \\\n",
504 | "count 106 106 106 106 106 106 106 106 106 106 ... 106 106 106 \n",
505 | "unique 4 4 4 4 4 4 4 4 4 4 ... 4 4 4 \n",
506 | "top t a a c a a a a a a ... c c c \n",
507 | "freq 38 34 30 30 36 42 38 34 33 36 ... 36 42 31 \n",
508 | "\n",
509 | " 51 52 53 54 55 56 Class \n",
510 | "count 106 106 106 106 106 106 106 \n",
511 | "unique 4 4 4 4 4 4 2 \n",
512 | "top t t c c t t - \n",
513 | "freq 33 35 32 29 29 34 53 \n",
514 | "\n",
515 | "[4 rows x 58 columns]"
516 | ]
517 | },
518 | "execution_count": 9,
519 | "metadata": {},
520 | "output_type": "execute_result"
521 | }
522 | ],
523 | "source": [
524 | "# looks good! Let's start to familiarize ourselves with the dataset so we can pick the most suitable \n",
525 | "# algorithms for this data\n",
526 | "\n",
527 | "df.describe()"
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "execution_count": 10,
533 | "metadata": {},
534 | "outputs": [
535 | {
536 | "name": "stdout",
537 | "output_type": "stream",
538 | "text": [
539 | " 0 1 2 3 4 5 6 7 8 9 ... 48 \\\n",
540 | "+ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
541 | "- NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN \n",
542 | "a 26.0 34.0 30.0 22.0 36.0 42.0 38.0 34.0 33.0 36.0 ... 23.0 \n",
543 | "c 27.0 22.0 21.0 30.0 19.0 18.0 21.0 20.0 22.0 22.0 ... 36.0 \n",
544 | "g 15.0 24.0 28.0 28.0 29.0 22.0 17.0 20.0 19.0 20.0 ... 26.0 \n",
545 | "t 38.0 26.0 27.0 26.0 22.0 24.0 30.0 32.0 32.0 28.0 ... 21.0 \n",
546 | "\n",
547 | " 49 50 51 52 53 54 55 56 Class \n",
548 | "+ NaN NaN NaN NaN NaN NaN NaN NaN 53.0 \n",
549 | "- NaN NaN NaN NaN NaN NaN NaN NaN 53.0 \n",
550 | "a 24.0 28.0 27.0 25.0 22.0 26.0 24.0 27.0 NaN \n",
551 | "c 42.0 31.0 32.0 21.0 32.0 29.0 29.0 17.0 NaN \n",
552 | "g 18.0 24.0 14.0 25.0 22.0 28.0 24.0 28.0 NaN \n",
553 | "t 22.0 23.0 33.0 35.0 30.0 23.0 29.0 34.0 NaN \n",
554 | "\n",
555 | "[6 rows x 58 columns]\n"
556 | ]
557 | }
558 | ],
559 | "source": [
560 | "# desribe does not tell us enough information since the attributes are text. Lets record value counts for each sequence\n",
561 | "series = []\n",
562 | "for name in df.columns:\n",
563 | " series.append(df[name].value_counts())\n",
564 | " \n",
565 | "info = pd.DataFrame(series)\n",
566 | "details = info.transpose()\n",
567 | "print(details)"
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": 11,
573 | "metadata": {},
574 | "outputs": [
575 | {
576 | "data": {
577 | "text/html": [
578 | "\n",
579 | "\n",
592 | "
\n",
593 | " \n",
594 | " \n",
595 | " | \n",
596 | " 0_a | \n",
597 | " 0_c | \n",
598 | " 0_g | \n",
599 | " 0_t | \n",
600 | " 1_a | \n",
601 | " 1_c | \n",
602 | " 1_g | \n",
603 | " 1_t | \n",
604 | " 2_a | \n",
605 | " 2_c | \n",
606 | " ... | \n",
607 | " 55_a | \n",
608 | " 55_c | \n",
609 | " 55_g | \n",
610 | " 55_t | \n",
611 | " 56_a | \n",
612 | " 56_c | \n",
613 | " 56_g | \n",
614 | " 56_t | \n",
615 | " Class_+ | \n",
616 | " Class_- | \n",
617 | "
\n",
618 | " \n",
619 | " \n",
620 | " \n",
621 | " 0 | \n",
622 | " 0 | \n",
623 | " 0 | \n",
624 | " 0 | \n",
625 | " 1 | \n",
626 | " 1 | \n",
627 | " 0 | \n",
628 | " 0 | \n",
629 | " 0 | \n",
630 | " 0 | \n",
631 | " 1 | \n",
632 | " ... | \n",
633 | " 0 | \n",
634 | " 0 | \n",
635 | " 1 | \n",
636 | " 0 | \n",
637 | " 0 | \n",
638 | " 0 | \n",
639 | " 0 | \n",
640 | " 1 | \n",
641 | " 1 | \n",
642 | " 0 | \n",
643 | "
\n",
644 | " \n",
645 | " 1 | \n",
646 | " 0 | \n",
647 | " 0 | \n",
648 | " 0 | \n",
649 | " 1 | \n",
650 | " 0 | \n",
651 | " 0 | \n",
652 | " 1 | \n",
653 | " 0 | \n",
654 | " 0 | \n",
655 | " 1 | \n",
656 | " ... | \n",
657 | " 1 | \n",
658 | " 0 | \n",
659 | " 0 | \n",
660 | " 0 | \n",
661 | " 1 | \n",
662 | " 0 | \n",
663 | " 0 | \n",
664 | " 0 | \n",
665 | " 1 | \n",
666 | " 0 | \n",
667 | "
\n",
668 | " \n",
669 | " 2 | \n",
670 | " 0 | \n",
671 | " 0 | \n",
672 | " 1 | \n",
673 | " 0 | \n",
674 | " 0 | \n",
675 | " 0 | \n",
676 | " 0 | \n",
677 | " 1 | \n",
678 | " 1 | \n",
679 | " 0 | \n",
680 | " ... | \n",
681 | " 0 | \n",
682 | " 1 | \n",
683 | " 0 | \n",
684 | " 0 | \n",
685 | " 0 | \n",
686 | " 0 | \n",
687 | " 1 | \n",
688 | " 0 | \n",
689 | " 1 | \n",
690 | " 0 | \n",
691 | "
\n",
692 | " \n",
693 | " 3 | \n",
694 | " 1 | \n",
695 | " 0 | \n",
696 | " 0 | \n",
697 | " 0 | \n",
698 | " 1 | \n",
699 | " 0 | \n",
700 | " 0 | \n",
701 | " 0 | \n",
702 | " 0 | \n",
703 | " 0 | \n",
704 | " ... | \n",
705 | " 0 | \n",
706 | " 0 | \n",
707 | " 0 | \n",
708 | " 1 | \n",
709 | " 0 | \n",
710 | " 1 | \n",
711 | " 0 | \n",
712 | " 0 | \n",
713 | " 1 | \n",
714 | " 0 | \n",
715 | "
\n",
716 | " \n",
717 | " 4 | \n",
718 | " 0 | \n",
719 | " 0 | \n",
720 | " 0 | \n",
721 | " 1 | \n",
722 | " 0 | \n",
723 | " 1 | \n",
724 | " 0 | \n",
725 | " 0 | \n",
726 | " 0 | \n",
727 | " 0 | \n",
728 | " ... | \n",
729 | " 1 | \n",
730 | " 0 | \n",
731 | " 0 | \n",
732 | " 0 | \n",
733 | " 0 | \n",
734 | " 0 | \n",
735 | " 1 | \n",
736 | " 0 | \n",
737 | " 1 | \n",
738 | " 0 | \n",
739 | "
\n",
740 | " \n",
741 | "
\n",
742 | "
5 rows × 230 columns
\n",
743 | "
"
744 | ],
745 | "text/plain": [
746 | " 0_a 0_c 0_g 0_t 1_a 1_c 1_g 1_t 2_a 2_c ... 55_a 55_c \\\n",
747 | "0 0 0 0 1 1 0 0 0 0 1 ... 0 0 \n",
748 | "1 0 0 0 1 0 0 1 0 0 1 ... 1 0 \n",
749 | "2 0 0 1 0 0 0 0 1 1 0 ... 0 1 \n",
750 | "3 1 0 0 0 1 0 0 0 0 0 ... 0 0 \n",
751 | "4 0 0 0 1 0 1 0 0 0 0 ... 1 0 \n",
752 | "\n",
753 | " 55_g 55_t 56_a 56_c 56_g 56_t Class_+ Class_- \n",
754 | "0 1 0 0 0 0 1 1 0 \n",
755 | "1 0 0 1 0 0 0 1 0 \n",
756 | "2 0 0 0 0 1 0 1 0 \n",
757 | "3 0 1 0 1 0 0 1 0 \n",
758 | "4 0 0 0 0 1 0 1 0 \n",
759 | "\n",
760 | "[5 rows x 230 columns]"
761 | ]
762 | },
763 | "execution_count": 11,
764 | "metadata": {},
765 | "output_type": "execute_result"
766 | }
767 | ],
768 | "source": [
769 | "# Unfortunately, we can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch\n",
770 | "# it to numerical data. This can easily be accomplished using the pd.get_dummies() function\n",
771 | "numerical_df = pd.get_dummies(df)\n",
772 | "numerical_df.iloc[:5]"
773 | ]
774 | },
775 | {
776 | "cell_type": "code",
777 | "execution_count": 12,
778 | "metadata": {},
779 | "outputs": [
780 | {
781 | "name": "stdout",
782 | "output_type": "stream",
783 | "text": [
784 | " 0_a 0_c 0_g 0_t 1_a 1_c 1_g 1_t 2_a 2_c ... 54_t 55_a 55_c \\\n",
785 | "0 0 0 0 1 1 0 0 0 0 1 ... 0 0 0 \n",
786 | "1 0 0 0 1 0 0 1 0 0 1 ... 0 1 0 \n",
787 | "2 0 0 1 0 0 0 0 1 1 0 ... 0 0 1 \n",
788 | "3 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 \n",
789 | "4 0 0 0 1 0 1 0 0 0 0 ... 1 1 0 \n",
790 | "\n",
791 | " 55_g 55_t 56_a 56_c 56_g 56_t Class \n",
792 | "0 1 0 0 0 0 1 1 \n",
793 | "1 0 0 1 0 0 0 1 \n",
794 | "2 0 0 0 0 1 0 1 \n",
795 | "3 0 1 0 1 0 0 1 \n",
796 | "4 0 0 0 0 1 0 1 \n",
797 | "\n",
798 | "[5 rows x 229 columns]\n"
799 | ]
800 | }
801 | ],
802 | "source": [
803 | "# We don't need both class columns. Lets drop one then rename the other to simply 'Class'.\n",
804 | "df = numerical_df.drop(columns=['Class_-'])\n",
805 | "\n",
806 | "df.rename(columns = {'Class_+': 'Class'}, inplace = True)\n",
807 | "print(df.iloc[:5])"
808 | ]
809 | },
810 | {
811 | "cell_type": "code",
812 | "execution_count": 13,
813 | "metadata": {},
814 | "outputs": [],
815 | "source": [
816 | "# Use the model_selection module to separate training and testing datasets\n",
817 | "from sklearn import model_selection\n",
818 | "\n",
819 | "# Create X and Y datasets for training\n",
820 | "X = np.array(df.drop(['Class'], 1))\n",
821 | "y = np.array(df['Class'])\n",
822 | "\n",
823 | "# define seed for reproducibility\n",
824 | "seed = 1\n",
825 | "\n",
826 | "# split data into training and testing datasets\n",
827 | "X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)\n"
828 | ]
829 | },
830 | {
831 | "cell_type": "markdown",
832 | "metadata": {},
833 | "source": [
834 | "## Step 3: Training and Testing the Classification Algorithms\n",
835 | "\n",
836 | "Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different classification algorithms. It's relatively easy to test multiple models; as a result, we will compare and contrast the performance of ten different algorithms."
837 | ]
838 | },
839 | {
840 | "cell_type": "code",
841 | "execution_count": 14,
842 | "metadata": {},
843 | "outputs": [
844 | {
845 | "name": "stdout",
846 | "output_type": "stream",
847 | "text": [
848 | "Nearest Neighbors: 0.823214 (0.113908)\n",
849 | "Gaussian Process: 0.873214 (0.056158)\n",
850 | "Decision Tree: 0.750000 (0.185405)\n",
851 | "Random Forest: 0.580357 (0.106021)\n"
852 | ]
853 | },
854 | {
855 | "name": "stderr",
856 | "output_type": "stream",
857 | "text": [
858 | "C:\\Programdata\\anaconda2\\lib\\site-packages\\sklearn\\neural_network\\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.\n",
859 | " % self.max_iter, ConvergenceWarning)\n"
860 | ]
861 | },
862 | {
863 | "name": "stdout",
864 | "output_type": "stream",
865 | "text": [
866 | "Neural Net: 0.887500 (0.087500)\n",
867 | "AdaBoost: 0.912500 (0.112500)\n",
868 | "Naive Bayes: 0.837500 (0.137500)\n",
869 | "SVM Linear: 0.850000 (0.108972)\n",
870 | "SVM RBF: 0.737500 (0.117925)\n",
871 | "SVM Sigmoid: 0.569643 (0.159209)\n"
872 | ]
873 | }
874 | ],
875 | "source": [
876 | "# Now that we have our dataset, we can start building algorithms! We'll need to import each algorithm we plan on using\n",
877 | "# from sklearn. We also need to import some performance metrics, such as accuracy_score and classification_report.\n",
878 | "\n",
879 | "from sklearn.neighbors import KNeighborsClassifier\n",
880 | "from sklearn.neural_network import MLPClassifier\n",
881 | "from sklearn.gaussian_process import GaussianProcessClassifier\n",
882 | "from sklearn.gaussian_process.kernels import RBF\n",
883 | "from sklearn.tree import DecisionTreeClassifier\n",
884 | "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n",
885 | "from sklearn.naive_bayes import GaussianNB\n",
886 | "from sklearn.svm import SVC\n",
887 | "from sklearn.metrics import classification_report, accuracy_score\n",
888 | "\n",
889 | "# define scoring method\n",
890 | "scoring = 'accuracy'\n",
891 | "\n",
892 | "# Define models to train\n",
893 | "names = [\"Nearest Neighbors\", \"Gaussian Process\",\n",
894 | " \"Decision Tree\", \"Random Forest\", \"Neural Net\", \"AdaBoost\",\n",
895 | " \"Naive Bayes\", \"SVM Linear\", \"SVM RBF\", \"SVM Sigmoid\"]\n",
896 | "\n",
897 | "classifiers = [\n",
898 | " KNeighborsClassifier(n_neighbors = 3),\n",
899 | " GaussianProcessClassifier(1.0 * RBF(1.0)),\n",
900 | " DecisionTreeClassifier(max_depth=5),\n",
901 | " RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),\n",
902 | " MLPClassifier(alpha=1),\n",
903 | " AdaBoostClassifier(),\n",
904 | " GaussianNB(),\n",
905 | " SVC(kernel = 'linear'), \n",
906 | " SVC(kernel = 'rbf'),\n",
907 | " SVC(kernel = 'sigmoid')\n",
908 | "]\n",
909 | "\n",
910 | "models = zip(names, classifiers)\n",
911 | "\n",
912 | "# evaluate each model in turn\n",
913 | "results = []\n",
914 | "names = []\n",
915 | "\n",
916 | "for name, model in models:\n",
917 | " kfold = model_selection.KFold(n_splits=10, random_state = seed)\n",
918 | " cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)\n",
919 | " results.append(cv_results)\n",
920 | " names.append(name)\n",
921 | " msg = \"%s: %f (%f)\" % (name, cv_results.mean(), cv_results.std())\n",
922 | " print(msg)"
923 | ]
924 | },
925 | {
926 | "cell_type": "code",
927 | "execution_count": 15,
928 | "metadata": {},
929 | "outputs": [
930 | {
931 | "name": "stdout",
932 | "output_type": "stream",
933 | "text": [
934 | "Nearest Neighbors\n",
935 | "0.7777777777777778\n",
936 | " precision recall f1-score support\n",
937 | "\n",
938 | " 0 1.00 0.65 0.79 17\n",
939 | " 1 0.62 1.00 0.77 10\n",
940 | "\n",
941 | "avg / total 0.86 0.78 0.78 27\n",
942 | "\n",
943 | "Gaussian Process\n",
944 | "0.8888888888888888\n",
945 | " precision recall f1-score support\n",
946 | "\n",
947 | " 0 1.00 0.82 0.90 17\n",
948 | " 1 0.77 1.00 0.87 10\n",
949 | "\n",
950 | "avg / total 0.91 0.89 0.89 27\n",
951 | "\n",
952 | "Decision Tree\n",
953 | "0.7777777777777778\n",
954 | " precision recall f1-score support\n",
955 | "\n",
956 | " 0 1.00 0.65 0.79 17\n",
957 | " 1 0.62 1.00 0.77 10\n",
958 | "\n",
959 | "avg / total 0.86 0.78 0.78 27\n",
960 | "\n",
961 | "Random Forest\n",
962 | "0.5925925925925926\n",
963 | " precision recall f1-score support\n",
964 | "\n",
965 | " 0 0.88 0.41 0.56 17\n",
966 | " 1 0.47 0.90 0.62 10\n",
967 | "\n",
968 | "avg / total 0.73 0.59 0.58 27\n",
969 | "\n",
970 | "Neural Net\n",
971 | "0.9259259259259259\n",
972 | " precision recall f1-score support\n",
973 | "\n",
974 | " 0 1.00 0.88 0.94 17\n",
975 | " 1 0.83 1.00 0.91 10\n",
976 | "\n",
977 | "avg / total 0.94 0.93 0.93 27\n",
978 | "\n",
979 | "AdaBoost\n",
980 | "0.8518518518518519\n",
981 | " precision recall f1-score support\n",
982 | "\n",
983 | " 0 1.00 0.76 0.87 17\n",
984 | " 1 0.71 1.00 0.83 10\n",
985 | "\n",
986 | "avg / total 0.89 0.85 0.85 27\n",
987 | "\n",
988 | "Naive Bayes\n",
989 | "0.9259259259259259\n",
990 | " precision recall f1-score support\n",
991 | "\n",
992 | " 0 1.00 0.88 0.94 17\n",
993 | " 1 0.83 1.00 0.91 10\n",
994 | "\n",
995 | "avg / total 0.94 0.93 0.93 27\n",
996 | "\n",
997 | "SVM Linear\n",
998 | "0.9629629629629629\n",
999 | " precision recall f1-score support\n",
1000 | "\n",
1001 | " 0 1.00 0.94 0.97 17\n",
1002 | " 1 0.91 1.00 0.95 10\n",
1003 | "\n",
1004 | "avg / total 0.97 0.96 0.96 27\n",
1005 | "\n",
1006 | "SVM RBF\n",
1007 | "0.7777777777777778\n",
1008 | " precision recall f1-score support\n",
1009 | "\n",
1010 | " 0 1.00 0.65 0.79 17\n",
1011 | " 1 0.62 1.00 0.77 10\n",
1012 | "\n",
1013 | "avg / total 0.86 0.78 0.78 27\n",
1014 | "\n",
1015 | "SVM Sigmoid\n",
1016 | "0.4444444444444444\n",
1017 | " precision recall f1-score support\n",
1018 | "\n",
1019 | " 0 1.00 0.12 0.21 17\n",
1020 | " 1 0.40 1.00 0.57 10\n",
1021 | "\n",
1022 | "avg / total 0.78 0.44 0.34 27\n",
1023 | "\n"
1024 | ]
1025 | }
1026 | ],
1027 | "source": [
1028 | "# Remember, performance on the training data is not that important. We want to know how well our algorithms\n",
1029 | "# can generalize to new data. To test this, let's make predictions on the validation dataset.\n",
1030 | "\n",
1031 | "for name, model in models:\n",
1032 | " model.fit(X_train, y_train)\n",
1033 | " predictions = model.predict(X_test)\n",
1034 | " print(name)\n",
1035 | " print(accuracy_score(y_test, predictions))\n",
1036 | " print(classification_report(y_test, predictions))\n",
1037 | " \n",
1038 | "# Accuracy - ratio of correctly predicted observation to the total observations. \n",
1039 | "# Precision - (false positives) ratio of correctly predicted positive observations to the total predicted positive observations\n",
1040 | "# Recall (Sensitivity) - (false negatives) ratio of correctly predicted positive observations to the all observations in actual class - yes.\n",
1041 | "# F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false "
1042 | ]
1043 | },
1044 | {
1045 | "cell_type": "code",
1046 | "execution_count": null,
1047 | "metadata": {
1048 | "collapsed": true
1049 | },
1050 | "outputs": [],
1051 | "source": []
1052 | }
1053 | ],
1054 | "metadata": {
1055 | "kernelspec": {
1056 | "display_name": "Python [default]",
1057 | "language": "python",
1058 | "name": "python2"
1059 | },
1060 | "language_info": {
1061 | "codemirror_mode": {
1062 | "name": "ipython",
1063 | "version": 2
1064 | },
1065 | "file_extension": ".py",
1066 | "mimetype": "text/x-python",
1067 | "name": "python",
1068 | "nbconvert_exporter": "python",
1069 | "pygments_lexer": "ipython2",
1070 | "version": "2.7.13"
1071 | }
1072 | },
1073 | "nbformat": 4,
1074 | "nbformat_minor": 2
1075 | }
1076 |
--------------------------------------------------------------------------------
/Chapter05/Autism Screening with Machine Learning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### Childhood Autistic Spectrum Disorder Screening using Machine Learning\n",
8 | "\n",
9 | "The early diagnosis of neurodevelopment disorders can improve treatment and significantly decrease the associated \n",
10 | "healthcare costs. In this project, we will use supervised learning to diagnose Autistic Spectrum Disorder \n",
11 | "(ASD) based on behavioural features and individual characteristics. More specifically, we will build and deploy a neural network using the Keras API. \n",
12 | "\n",
13 | "This project will use a dataset provided by the UCI Machine Learning Repository that contains screening data for 292 patients. The dataset can be found at the following URL: \n",
14 | "https://archive.ics.uci.edu/ml/datasets/Autistic+Spectrum+Disorder+Screening+Data+for+Children++\n",
15 | "\n",
16 | "Let's dive right in! First, we will import a few of libraries we will use in this project. "
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {},
23 | "outputs": [
24 | {
25 | "name": "stderr",
26 | "output_type": "stream",
27 | "text": [
28 | "Using Theano backend.\n",
29 | "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n"
30 | ]
31 | },
32 | {
33 | "name": "stdout",
34 | "output_type": "stream",
35 | "text": [
36 | "Python: 2.7.13 |Continuum Analytics, Inc.| (default, May 11 2017, 13:17:26) [MSC v.1500 64 bit (AMD64)]\n",
37 | "Pandas: 0.21.0\n",
38 | "Sklearn: 0.19.1\n",
39 | "Keras: 2.1.4\n"
40 | ]
41 | }
42 | ],
43 | "source": [
44 | "import sys\n",
45 | "import pandas as pd\n",
46 | "import sklearn\n",
47 | "import keras\n",
48 | "\n",
49 | "print 'Python: {}'.format(sys.version)\n",
50 | "print 'Pandas: {}'.format(pd.__version__)\n",
51 | "print 'Sklearn: {}'.format(sklearn.__version__)\n",
52 | "print 'Keras: {}'.format(keras.__version__)"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "### 1. Importing the Dataset\n",
60 | "\n",
61 | "We will obtain the data from the UCI Machine Learning Repository; however, since the data isn't contained in a csv or txt file, we will have to download the compressed zip file and then extract the data manually. Once that is accomplished, we will read the information in from a text file using Pandas. "
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 2,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "# import the dataset\n",
71 | "file = 'C:/users/brend/tutorial/autism-data.txt'\n",
72 | "\n",
73 | "# read the csv\n",
74 | "data = pd.read_table(file, sep = ',', index_col = None)"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 3,
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "name": "stdout",
84 | "output_type": "stream",
85 | "text": [
86 | "Shape of DataFrame: (292, 21)\n",
87 | "A1_Score 1\n",
88 | "A2_Score 1\n",
89 | "A3_Score 0\n",
90 | "A4_Score 0\n",
91 | "A5_Score 1\n",
92 | "A6_Score 1\n",
93 | "A7_Score 0\n",
94 | "A8_Score 1\n",
95 | "A9_Score 0\n",
96 | "A10_Score 0\n",
97 | "age 6\n",
98 | "gender m\n",
99 | "ethnicity Others\n",
100 | "jundice no\n",
101 | "family_history_of_PDD no\n",
102 | "contry_of_res Jordan\n",
103 | "used_app_before no\n",
104 | "result 5\n",
105 | "age_desc '4-11 years'\n",
106 | "relation Parent\n",
107 | "class NO\n",
108 | "Name: 0, dtype: object\n"
109 | ]
110 | }
111 | ],
112 | "source": [
113 | "# print the shape of the DataFrame, so we can see how many examples we have\n",
114 | "print 'Shape of DataFrame: {}'.format(data.shape)\n",
115 | "print data.loc[0]"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 4,
121 | "metadata": {},
122 | "outputs": [
123 | {
124 | "data": {
125 | "text/html": [
126 | "\n",
127 | "\n",
140 | "
\n",
141 | " \n",
142 | " \n",
143 | " | \n",
144 | " A1_Score | \n",
145 | " A2_Score | \n",
146 | " A3_Score | \n",
147 | " A4_Score | \n",
148 | " A5_Score | \n",
149 | " A6_Score | \n",
150 | " A7_Score | \n",
151 | " A8_Score | \n",
152 | " A9_Score | \n",
153 | " A10_Score | \n",
154 | " ... | \n",
155 | " gender | \n",
156 | " ethnicity | \n",
157 | " jundice | \n",
158 | " family_history_of_PDD | \n",
159 | " contry_of_res | \n",
160 | " used_app_before | \n",
161 | " result | \n",
162 | " age_desc | \n",
163 | " relation | \n",
164 | " class | \n",
165 | "
\n",
166 | " \n",
167 | " \n",
168 | " \n",
169 | " 0 | \n",
170 | " 1 | \n",
171 | " 1 | \n",
172 | " 0 | \n",
173 | " 0 | \n",
174 | " 1 | \n",
175 | " 1 | \n",
176 | " 0 | \n",
177 | " 1 | \n",
178 | " 0 | \n",
179 | " 0 | \n",
180 | " ... | \n",
181 | " m | \n",
182 | " Others | \n",
183 | " no | \n",
184 | " no | \n",
185 | " Jordan | \n",
186 | " no | \n",
187 | " 5 | \n",
188 | " '4-11 years' | \n",
189 | " Parent | \n",
190 | " NO | \n",
191 | "
\n",
192 | " \n",
193 | " 1 | \n",
194 | " 1 | \n",
195 | " 1 | \n",
196 | " 0 | \n",
197 | " 0 | \n",
198 | " 1 | \n",
199 | " 1 | \n",
200 | " 0 | \n",
201 | " 1 | \n",
202 | " 0 | \n",
203 | " 0 | \n",
204 | " ... | \n",
205 | " m | \n",
206 | " 'Middle Eastern ' | \n",
207 | " no | \n",
208 | " no | \n",
209 | " Jordan | \n",
210 | " no | \n",
211 | " 5 | \n",
212 | " '4-11 years' | \n",
213 | " Parent | \n",
214 | " NO | \n",
215 | "
\n",
216 | " \n",
217 | " 2 | \n",
218 | " 1 | \n",
219 | " 1 | \n",
220 | " 0 | \n",
221 | " 0 | \n",
222 | " 0 | \n",
223 | " 1 | \n",
224 | " 1 | \n",
225 | " 1 | \n",
226 | " 0 | \n",
227 | " 0 | \n",
228 | " ... | \n",
229 | " m | \n",
230 | " ? | \n",
231 | " no | \n",
232 | " no | \n",
233 | " Jordan | \n",
234 | " yes | \n",
235 | " 5 | \n",
236 | " '4-11 years' | \n",
237 | " ? | \n",
238 | " NO | \n",
239 | "
\n",
240 | " \n",
241 | " 3 | \n",
242 | " 0 | \n",
243 | " 1 | \n",
244 | " 0 | \n",
245 | " 0 | \n",
246 | " 1 | \n",
247 | " 1 | \n",
248 | " 0 | \n",
249 | " 0 | \n",
250 | " 0 | \n",
251 | " 1 | \n",
252 | " ... | \n",
253 | " f | \n",
254 | " ? | \n",
255 | " yes | \n",
256 | " no | \n",
257 | " Jordan | \n",
258 | " no | \n",
259 | " 4 | \n",
260 | " '4-11 years' | \n",
261 | " ? | \n",
262 | " NO | \n",
263 | "
\n",
264 | " \n",
265 | " 4 | \n",
266 | " 1 | \n",
267 | " 1 | \n",
268 | " 1 | \n",
269 | " 1 | \n",
270 | " 1 | \n",
271 | " 1 | \n",
272 | " 1 | \n",
273 | " 1 | \n",
274 | " 1 | \n",
275 | " 1 | \n",
276 | " ... | \n",
277 | " m | \n",
278 | " Others | \n",
279 | " yes | \n",
280 | " no | \n",
281 | " 'United States' | \n",
282 | " no | \n",
283 | " 10 | \n",
284 | " '4-11 years' | \n",
285 | " Parent | \n",
286 | " YES | \n",
287 | "
\n",
288 | " \n",
289 | " 5 | \n",
290 | " 0 | \n",
291 | " 0 | \n",
292 | " 1 | \n",
293 | " 0 | \n",
294 | " 1 | \n",
295 | " 1 | \n",
296 | " 0 | \n",
297 | " 1 | \n",
298 | " 0 | \n",
299 | " 1 | \n",
300 | " ... | \n",
301 | " m | \n",
302 | " ? | \n",
303 | " no | \n",
304 | " yes | \n",
305 | " Egypt | \n",
306 | " no | \n",
307 | " 5 | \n",
308 | " '4-11 years' | \n",
309 | " ? | \n",
310 | " NO | \n",
311 | "
\n",
312 | " \n",
313 | " 6 | \n",
314 | " 1 | \n",
315 | " 0 | \n",
316 | " 1 | \n",
317 | " 1 | \n",
318 | " 1 | \n",
319 | " 1 | \n",
320 | " 0 | \n",
321 | " 1 | \n",
322 | " 0 | \n",
323 | " 1 | \n",
324 | " ... | \n",
325 | " m | \n",
326 | " White-European | \n",
327 | " no | \n",
328 | " no | \n",
329 | " 'United Kingdom' | \n",
330 | " no | \n",
331 | " 7 | \n",
332 | " '4-11 years' | \n",
333 | " Parent | \n",
334 | " YES | \n",
335 | "
\n",
336 | " \n",
337 | " 7 | \n",
338 | " 1 | \n",
339 | " 1 | \n",
340 | " 1 | \n",
341 | " 1 | \n",
342 | " 1 | \n",
343 | " 1 | \n",
344 | " 1 | \n",
345 | " 1 | \n",
346 | " 0 | \n",
347 | " 0 | \n",
348 | " ... | \n",
349 | " f | \n",
350 | " 'Middle Eastern ' | \n",
351 | " no | \n",
352 | " no | \n",
353 | " Bahrain | \n",
354 | " no | \n",
355 | " 8 | \n",
356 | " '4-11 years' | \n",
357 | " Parent | \n",
358 | " YES | \n",
359 | "
\n",
360 | " \n",
361 | " 8 | \n",
362 | " 1 | \n",
363 | " 1 | \n",
364 | " 1 | \n",
365 | " 1 | \n",
366 | " 1 | \n",
367 | " 1 | \n",
368 | " 1 | \n",
369 | " 0 | \n",
370 | " 0 | \n",
371 | " 0 | \n",
372 | " ... | \n",
373 | " f | \n",
374 | " 'Middle Eastern ' | \n",
375 | " no | \n",
376 | " no | \n",
377 | " Bahrain | \n",
378 | " no | \n",
379 | " 7 | \n",
380 | " '4-11 years' | \n",
381 | " Parent | \n",
382 | " YES | \n",
383 | "
\n",
384 | " \n",
385 | " 9 | \n",
386 | " 0 | \n",
387 | " 0 | \n",
388 | " 1 | \n",
389 | " 1 | \n",
390 | " 1 | \n",
391 | " 0 | \n",
392 | " 1 | \n",
393 | " 1 | \n",
394 | " 0 | \n",
395 | " 0 | \n",
396 | " ... | \n",
397 | " f | \n",
398 | " ? | \n",
399 | " no | \n",
400 | " yes | \n",
401 | " Austria | \n",
402 | " no | \n",
403 | " 5 | \n",
404 | " '4-11 years' | \n",
405 | " ? | \n",
406 | " NO | \n",
407 | "
\n",
408 | " \n",
409 | " 10 | \n",
410 | " 1 | \n",
411 | " 0 | \n",
412 | " 0 | \n",
413 | " 0 | \n",
414 | " 1 | \n",
415 | " 1 | \n",
416 | " 1 | \n",
417 | " 1 | \n",
418 | " 1 | \n",
419 | " 1 | \n",
420 | " ... | \n",
421 | " m | \n",
422 | " White-European | \n",
423 | " yes | \n",
424 | " no | \n",
425 | " 'United Kingdom' | \n",
426 | " no | \n",
427 | " 7 | \n",
428 | " '4-11 years' | \n",
429 | " Self | \n",
430 | " YES | \n",
431 | "
\n",
432 | " \n",
433 | "
\n",
434 | "
11 rows × 21 columns
\n",
435 | "
"
436 | ],
437 | "text/plain": [
438 | " A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score \\\n",
439 | "0 1 1 0 0 1 1 0 \n",
440 | "1 1 1 0 0 1 1 0 \n",
441 | "2 1 1 0 0 0 1 1 \n",
442 | "3 0 1 0 0 1 1 0 \n",
443 | "4 1 1 1 1 1 1 1 \n",
444 | "5 0 0 1 0 1 1 0 \n",
445 | "6 1 0 1 1 1 1 0 \n",
446 | "7 1 1 1 1 1 1 1 \n",
447 | "8 1 1 1 1 1 1 1 \n",
448 | "9 0 0 1 1 1 0 1 \n",
449 | "10 1 0 0 0 1 1 1 \n",
450 | "\n",
451 | " A8_Score A9_Score A10_Score ... gender ethnicity jundice \\\n",
452 | "0 1 0 0 ... m Others no \n",
453 | "1 1 0 0 ... m 'Middle Eastern ' no \n",
454 | "2 1 0 0 ... m ? no \n",
455 | "3 0 0 1 ... f ? yes \n",
456 | "4 1 1 1 ... m Others yes \n",
457 | "5 1 0 1 ... m ? no \n",
458 | "6 1 0 1 ... m White-European no \n",
459 | "7 1 0 0 ... f 'Middle Eastern ' no \n",
460 | "8 0 0 0 ... f 'Middle Eastern ' no \n",
461 | "9 1 0 0 ... f ? no \n",
462 | "10 1 1 1 ... m White-European yes \n",
463 | "\n",
464 | " family_history_of_PDD contry_of_res used_app_before result \\\n",
465 | "0 no Jordan no 5 \n",
466 | "1 no Jordan no 5 \n",
467 | "2 no Jordan yes 5 \n",
468 | "3 no Jordan no 4 \n",
469 | "4 no 'United States' no 10 \n",
470 | "5 yes Egypt no 5 \n",
471 | "6 no 'United Kingdom' no 7 \n",
472 | "7 no Bahrain no 8 \n",
473 | "8 no Bahrain no 7 \n",
474 | "9 yes Austria no 5 \n",
475 | "10 no 'United Kingdom' no 7 \n",
476 | "\n",
477 | " age_desc relation class \n",
478 | "0 '4-11 years' Parent NO \n",
479 | "1 '4-11 years' Parent NO \n",
480 | "2 '4-11 years' ? NO \n",
481 | "3 '4-11 years' ? NO \n",
482 | "4 '4-11 years' Parent YES \n",
483 | "5 '4-11 years' ? NO \n",
484 | "6 '4-11 years' Parent YES \n",
485 | "7 '4-11 years' Parent YES \n",
486 | "8 '4-11 years' Parent YES \n",
487 | "9 '4-11 years' ? NO \n",
488 | "10 '4-11 years' Self YES \n",
489 | "\n",
490 | "[11 rows x 21 columns]"
491 | ]
492 | },
493 | "execution_count": 4,
494 | "metadata": {},
495 | "output_type": "execute_result"
496 | }
497 | ],
498 | "source": [
499 | "# print out multiple patients at the same time\n",
500 | "data.loc[:10]"
501 | ]
502 | },
503 | {
504 | "cell_type": "code",
505 | "execution_count": 5,
506 | "metadata": {},
507 | "outputs": [
508 | {
509 | "data": {
510 | "text/html": [
511 | "\n",
512 | "\n",
525 | "
\n",
526 | " \n",
527 | " \n",
528 | " | \n",
529 | " A1_Score | \n",
530 | " A2_Score | \n",
531 | " A3_Score | \n",
532 | " A4_Score | \n",
533 | " A5_Score | \n",
534 | " A6_Score | \n",
535 | " A7_Score | \n",
536 | " A8_Score | \n",
537 | " A9_Score | \n",
538 | " A10_Score | \n",
539 | " result | \n",
540 | "
\n",
541 | " \n",
542 | " \n",
543 | " \n",
544 | " count | \n",
545 | " 292.000000 | \n",
546 | " 292.000000 | \n",
547 | " 292.000000 | \n",
548 | " 292.000000 | \n",
549 | " 292.000000 | \n",
550 | " 292.000000 | \n",
551 | " 292.000000 | \n",
552 | " 292.000000 | \n",
553 | " 292.000000 | \n",
554 | " 292.000000 | \n",
555 | " 292.000000 | \n",
556 | "
\n",
557 | " \n",
558 | " mean | \n",
559 | " 0.633562 | \n",
560 | " 0.534247 | \n",
561 | " 0.743151 | \n",
562 | " 0.551370 | \n",
563 | " 0.743151 | \n",
564 | " 0.712329 | \n",
565 | " 0.606164 | \n",
566 | " 0.496575 | \n",
567 | " 0.493151 | \n",
568 | " 0.726027 | \n",
569 | " 6.239726 | \n",
570 | "
\n",
571 | " \n",
572 | " std | \n",
573 | " 0.482658 | \n",
574 | " 0.499682 | \n",
575 | " 0.437646 | \n",
576 | " 0.498208 | \n",
577 | " 0.437646 | \n",
578 | " 0.453454 | \n",
579 | " 0.489438 | \n",
580 | " 0.500847 | \n",
581 | " 0.500811 | \n",
582 | " 0.446761 | \n",
583 | " 2.284882 | \n",
584 | "
\n",
585 | " \n",
586 | " min | \n",
587 | " 0.000000 | \n",
588 | " 0.000000 | \n",
589 | " 0.000000 | \n",
590 | " 0.000000 | \n",
591 | " 0.000000 | \n",
592 | " 0.000000 | \n",
593 | " 0.000000 | \n",
594 | " 0.000000 | \n",
595 | " 0.000000 | \n",
596 | " 0.000000 | \n",
597 | " 0.000000 | \n",
598 | "
\n",
599 | " \n",
600 | " 25% | \n",
601 | " 0.000000 | \n",
602 | " 0.000000 | \n",
603 | " 0.000000 | \n",
604 | " 0.000000 | \n",
605 | " 0.000000 | \n",
606 | " 0.000000 | \n",
607 | " 0.000000 | \n",
608 | " 0.000000 | \n",
609 | " 0.000000 | \n",
610 | " 0.000000 | \n",
611 | " 5.000000 | \n",
612 | "
\n",
613 | " \n",
614 | " 50% | \n",
615 | " 1.000000 | \n",
616 | " 1.000000 | \n",
617 | " 1.000000 | \n",
618 | " 1.000000 | \n",
619 | " 1.000000 | \n",
620 | " 1.000000 | \n",
621 | " 1.000000 | \n",
622 | " 0.000000 | \n",
623 | " 0.000000 | \n",
624 | " 1.000000 | \n",
625 | " 6.000000 | \n",
626 | "
\n",
627 | " \n",
628 | " 75% | \n",
629 | " 1.000000 | \n",
630 | " 1.000000 | \n",
631 | " 1.000000 | \n",
632 | " 1.000000 | \n",
633 | " 1.000000 | \n",
634 | " 1.000000 | \n",
635 | " 1.000000 | \n",
636 | " 1.000000 | \n",
637 | " 1.000000 | \n",
638 | " 1.000000 | \n",
639 | " 8.000000 | \n",
640 | "
\n",
641 | " \n",
642 | " max | \n",
643 | " 1.000000 | \n",
644 | " 1.000000 | \n",
645 | " 1.000000 | \n",
646 | " 1.000000 | \n",
647 | " 1.000000 | \n",
648 | " 1.000000 | \n",
649 | " 1.000000 | \n",
650 | " 1.000000 | \n",
651 | " 1.000000 | \n",
652 | " 1.000000 | \n",
653 | " 10.000000 | \n",
654 | "
\n",
655 | " \n",
656 | "
\n",
657 | "
"
658 | ],
659 | "text/plain": [
660 | " A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score \\\n",
661 | "count 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 \n",
662 | "mean 0.633562 0.534247 0.743151 0.551370 0.743151 0.712329 \n",
663 | "std 0.482658 0.499682 0.437646 0.498208 0.437646 0.453454 \n",
664 | "min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
665 | "25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
666 | "50% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n",
667 | "75% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n",
668 | "max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n",
669 | "\n",
670 | " A7_Score A8_Score A9_Score A10_Score result \n",
671 | "count 292.000000 292.000000 292.000000 292.000000 292.000000 \n",
672 | "mean 0.606164 0.496575 0.493151 0.726027 6.239726 \n",
673 | "std 0.489438 0.500847 0.500811 0.446761 2.284882 \n",
674 | "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
675 | "25% 0.000000 0.000000 0.000000 0.000000 5.000000 \n",
676 | "50% 1.000000 0.000000 0.000000 1.000000 6.000000 \n",
677 | "75% 1.000000 1.000000 1.000000 1.000000 8.000000 \n",
678 | "max 1.000000 1.000000 1.000000 1.000000 10.000000 "
679 | ]
680 | },
681 | "execution_count": 5,
682 | "metadata": {},
683 | "output_type": "execute_result"
684 | }
685 | ],
686 | "source": [
687 | "# print out a description of the dataframe\n",
688 | "data.describe()"
689 | ]
690 | },
691 | {
692 | "cell_type": "markdown",
693 | "metadata": {},
694 | "source": [
695 | "### 2. Data Preprocessing\n",
696 | "\n",
697 | "This dataset is going to require multiple preprocessing steps. First, we have columns in our DataFrame (attributes) that we don't want to use when training our neural network. We will drop these columns first. Secondly, much of our data is reported using strings; as a result, we will convert our data to categorical labels. During our preprocessing, we will also split the dataset into X and Y datasets, where X has all of the attributes we want to use for prediction and Y has the class labels. "
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 6,
703 | "metadata": {},
704 | "outputs": [],
705 | "source": [
706 | "# drop unwanted columns\n",
707 | "data = data.drop(['result', 'age_desc'], axis=1)"
708 | ]
709 | },
710 | {
711 | "cell_type": "code",
712 | "execution_count": 7,
713 | "metadata": {},
714 | "outputs": [
715 | {
716 | "data": {
717 | "text/html": [
718 | "\n",
719 | "\n",
732 | "
\n",
733 | " \n",
734 | " \n",
735 | " | \n",
736 | " A1_Score | \n",
737 | " A2_Score | \n",
738 | " A3_Score | \n",
739 | " A4_Score | \n",
740 | " A5_Score | \n",
741 | " A6_Score | \n",
742 | " A7_Score | \n",
743 | " A8_Score | \n",
744 | " A9_Score | \n",
745 | " A10_Score | \n",
746 | " age | \n",
747 | " gender | \n",
748 | " ethnicity | \n",
749 | " jundice | \n",
750 | " family_history_of_PDD | \n",
751 | " contry_of_res | \n",
752 | " used_app_before | \n",
753 | " relation | \n",
754 | " class | \n",
755 | "
\n",
756 | " \n",
757 | " \n",
758 | " \n",
759 | " 0 | \n",
760 | " 1 | \n",
761 | " 1 | \n",
762 | " 0 | \n",
763 | " 0 | \n",
764 | " 1 | \n",
765 | " 1 | \n",
766 | " 0 | \n",
767 | " 1 | \n",
768 | " 0 | \n",
769 | " 0 | \n",
770 | " 6 | \n",
771 | " m | \n",
772 | " Others | \n",
773 | " no | \n",
774 | " no | \n",
775 | " Jordan | \n",
776 | " no | \n",
777 | " Parent | \n",
778 | " NO | \n",
779 | "
\n",
780 | " \n",
781 | " 1 | \n",
782 | " 1 | \n",
783 | " 1 | \n",
784 | " 0 | \n",
785 | " 0 | \n",
786 | " 1 | \n",
787 | " 1 | \n",
788 | " 0 | \n",
789 | " 1 | \n",
790 | " 0 | \n",
791 | " 0 | \n",
792 | " 6 | \n",
793 | " m | \n",
794 | " 'Middle Eastern ' | \n",
795 | " no | \n",
796 | " no | \n",
797 | " Jordan | \n",
798 | " no | \n",
799 | " Parent | \n",
800 | " NO | \n",
801 | "
\n",
802 | " \n",
803 | " 2 | \n",
804 | " 1 | \n",
805 | " 1 | \n",
806 | " 0 | \n",
807 | " 0 | \n",
808 | " 0 | \n",
809 | " 1 | \n",
810 | " 1 | \n",
811 | " 1 | \n",
812 | " 0 | \n",
813 | " 0 | \n",
814 | " 6 | \n",
815 | " m | \n",
816 | " ? | \n",
817 | " no | \n",
818 | " no | \n",
819 | " Jordan | \n",
820 | " yes | \n",
821 | " ? | \n",
822 | " NO | \n",
823 | "
\n",
824 | " \n",
825 | " 3 | \n",
826 | " 0 | \n",
827 | " 1 | \n",
828 | " 0 | \n",
829 | " 0 | \n",
830 | " 1 | \n",
831 | " 1 | \n",
832 | " 0 | \n",
833 | " 0 | \n",
834 | " 0 | \n",
835 | " 1 | \n",
836 | " 5 | \n",
837 | " f | \n",
838 | " ? | \n",
839 | " yes | \n",
840 | " no | \n",
841 | " Jordan | \n",
842 | " no | \n",
843 | " ? | \n",
844 | " NO | \n",
845 | "
\n",
846 | " \n",
847 | " 4 | \n",
848 | " 1 | \n",
849 | " 1 | \n",
850 | " 1 | \n",
851 | " 1 | \n",
852 | " 1 | \n",
853 | " 1 | \n",
854 | " 1 | \n",
855 | " 1 | \n",
856 | " 1 | \n",
857 | " 1 | \n",
858 | " 5 | \n",
859 | " m | \n",
860 | " Others | \n",
861 | " yes | \n",
862 | " no | \n",
863 | " 'United States' | \n",
864 | " no | \n",
865 | " Parent | \n",
866 | " YES | \n",
867 | "
\n",
868 | " \n",
869 | " 5 | \n",
870 | " 0 | \n",
871 | " 0 | \n",
872 | " 1 | \n",
873 | " 0 | \n",
874 | " 1 | \n",
875 | " 1 | \n",
876 | " 0 | \n",
877 | " 1 | \n",
878 | " 0 | \n",
879 | " 1 | \n",
880 | " 4 | \n",
881 | " m | \n",
882 | " ? | \n",
883 | " no | \n",
884 | " yes | \n",
885 | " Egypt | \n",
886 | " no | \n",
887 | " ? | \n",
888 | " NO | \n",
889 | "
\n",
890 | " \n",
891 | " 6 | \n",
892 | " 1 | \n",
893 | " 0 | \n",
894 | " 1 | \n",
895 | " 1 | \n",
896 | " 1 | \n",
897 | " 1 | \n",
898 | " 0 | \n",
899 | " 1 | \n",
900 | " 0 | \n",
901 | " 1 | \n",
902 | " 5 | \n",
903 | " m | \n",
904 | " White-European | \n",
905 | " no | \n",
906 | " no | \n",
907 | " 'United Kingdom' | \n",
908 | " no | \n",
909 | " Parent | \n",
910 | " YES | \n",
911 | "
\n",
912 | " \n",
913 | " 7 | \n",
914 | " 1 | \n",
915 | " 1 | \n",
916 | " 1 | \n",
917 | " 1 | \n",
918 | " 1 | \n",
919 | " 1 | \n",
920 | " 1 | \n",
921 | " 1 | \n",
922 | " 0 | \n",
923 | " 0 | \n",
924 | " 5 | \n",
925 | " f | \n",
926 | " 'Middle Eastern ' | \n",
927 | " no | \n",
928 | " no | \n",
929 | " Bahrain | \n",
930 | " no | \n",
931 | " Parent | \n",
932 | " YES | \n",
933 | "
\n",
934 | " \n",
935 | " 8 | \n",
936 | " 1 | \n",
937 | " 1 | \n",
938 | " 1 | \n",
939 | " 1 | \n",
940 | " 1 | \n",
941 | " 1 | \n",
942 | " 1 | \n",
943 | " 0 | \n",
944 | " 0 | \n",
945 | " 0 | \n",
946 | " 11 | \n",
947 | " f | \n",
948 | " 'Middle Eastern ' | \n",
949 | " no | \n",
950 | " no | \n",
951 | " Bahrain | \n",
952 | " no | \n",
953 | " Parent | \n",
954 | " YES | \n",
955 | "
\n",
956 | " \n",
957 | " 9 | \n",
958 | " 0 | \n",
959 | " 0 | \n",
960 | " 1 | \n",
961 | " 1 | \n",
962 | " 1 | \n",
963 | " 0 | \n",
964 | " 1 | \n",
965 | " 1 | \n",
966 | " 0 | \n",
967 | " 0 | \n",
968 | " 11 | \n",
969 | " f | \n",
970 | " ? | \n",
971 | " no | \n",
972 | " yes | \n",
973 | " Austria | \n",
974 | " no | \n",
975 | " ? | \n",
976 | " NO | \n",
977 | "
\n",
978 | " \n",
979 | " 10 | \n",
980 | " 1 | \n",
981 | " 0 | \n",
982 | " 0 | \n",
983 | " 0 | \n",
984 | " 1 | \n",
985 | " 1 | \n",
986 | " 1 | \n",
987 | " 1 | \n",
988 | " 1 | \n",
989 | " 1 | \n",
990 | " 10 | \n",
991 | " m | \n",
992 | " White-European | \n",
993 | " yes | \n",
994 | " no | \n",
995 | " 'United Kingdom' | \n",
996 | " no | \n",
997 | " Self | \n",
998 | " YES | \n",
999 | "
\n",
1000 | " \n",
1001 | "
\n",
1002 | "
"
1003 | ],
1004 | "text/plain": [
1005 | " A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score \\\n",
1006 | "0 1 1 0 0 1 1 0 \n",
1007 | "1 1 1 0 0 1 1 0 \n",
1008 | "2 1 1 0 0 0 1 1 \n",
1009 | "3 0 1 0 0 1 1 0 \n",
1010 | "4 1 1 1 1 1 1 1 \n",
1011 | "5 0 0 1 0 1 1 0 \n",
1012 | "6 1 0 1 1 1 1 0 \n",
1013 | "7 1 1 1 1 1 1 1 \n",
1014 | "8 1 1 1 1 1 1 1 \n",
1015 | "9 0 0 1 1 1 0 1 \n",
1016 | "10 1 0 0 0 1 1 1 \n",
1017 | "\n",
1018 | " A8_Score A9_Score A10_Score age gender ethnicity jundice \\\n",
1019 | "0 1 0 0 6 m Others no \n",
1020 | "1 1 0 0 6 m 'Middle Eastern ' no \n",
1021 | "2 1 0 0 6 m ? no \n",
1022 | "3 0 0 1 5 f ? yes \n",
1023 | "4 1 1 1 5 m Others yes \n",
1024 | "5 1 0 1 4 m ? no \n",
1025 | "6 1 0 1 5 m White-European no \n",
1026 | "7 1 0 0 5 f 'Middle Eastern ' no \n",
1027 | "8 0 0 0 11 f 'Middle Eastern ' no \n",
1028 | "9 1 0 0 11 f ? no \n",
1029 | "10 1 1 1 10 m White-European yes \n",
1030 | "\n",
1031 | " family_history_of_PDD contry_of_res used_app_before relation class \n",
1032 | "0 no Jordan no Parent NO \n",
1033 | "1 no Jordan no Parent NO \n",
1034 | "2 no Jordan yes ? NO \n",
1035 | "3 no Jordan no ? NO \n",
1036 | "4 no 'United States' no Parent YES \n",
1037 | "5 yes Egypt no ? NO \n",
1038 | "6 no 'United Kingdom' no Parent YES \n",
1039 | "7 no Bahrain no Parent YES \n",
1040 | "8 no Bahrain no Parent YES \n",
1041 | "9 yes Austria no ? NO \n",
1042 | "10 no 'United Kingdom' no Self YES "
1043 | ]
1044 | },
1045 | "execution_count": 7,
1046 | "metadata": {},
1047 | "output_type": "execute_result"
1048 | }
1049 | ],
1050 | "source": [
1051 | "data.loc[:10]"
1052 | ]
1053 | },
1054 | {
1055 | "cell_type": "code",
1056 | "execution_count": 8,
1057 | "metadata": {},
1058 | "outputs": [],
1059 | "source": [
1060 | "# create X and Y datasets for training\n",
1061 | "x = data.drop(['class'], 1)\n",
1062 | "y = data['class']"
1063 | ]
1064 | },
1065 | {
1066 | "cell_type": "code",
1067 | "execution_count": 9,
1068 | "metadata": {},
1069 | "outputs": [
1070 | {
1071 | "data": {
1072 | "text/html": [
1073 | "\n",
1074 | "\n",
1087 | "
\n",
1088 | " \n",
1089 | " \n",
1090 | " | \n",
1091 | " A1_Score | \n",
1092 | " A2_Score | \n",
1093 | " A3_Score | \n",
1094 | " A4_Score | \n",
1095 | " A5_Score | \n",
1096 | " A6_Score | \n",
1097 | " A7_Score | \n",
1098 | " A8_Score | \n",
1099 | " A9_Score | \n",
1100 | " A10_Score | \n",
1101 | " age | \n",
1102 | " gender | \n",
1103 | " ethnicity | \n",
1104 | " jundice | \n",
1105 | " family_history_of_PDD | \n",
1106 | " contry_of_res | \n",
1107 | " used_app_before | \n",
1108 | " relation | \n",
1109 | "
\n",
1110 | " \n",
1111 | " \n",
1112 | " \n",
1113 | " 0 | \n",
1114 | " 1 | \n",
1115 | " 1 | \n",
1116 | " 0 | \n",
1117 | " 0 | \n",
1118 | " 1 | \n",
1119 | " 1 | \n",
1120 | " 0 | \n",
1121 | " 1 | \n",
1122 | " 0 | \n",
1123 | " 0 | \n",
1124 | " 6 | \n",
1125 | " m | \n",
1126 | " Others | \n",
1127 | " no | \n",
1128 | " no | \n",
1129 | " Jordan | \n",
1130 | " no | \n",
1131 | " Parent | \n",
1132 | "
\n",
1133 | " \n",
1134 | " 1 | \n",
1135 | " 1 | \n",
1136 | " 1 | \n",
1137 | " 0 | \n",
1138 | " 0 | \n",
1139 | " 1 | \n",
1140 | " 1 | \n",
1141 | " 0 | \n",
1142 | " 1 | \n",
1143 | " 0 | \n",
1144 | " 0 | \n",
1145 | " 6 | \n",
1146 | " m | \n",
1147 | " 'Middle Eastern ' | \n",
1148 | " no | \n",
1149 | " no | \n",
1150 | " Jordan | \n",
1151 | " no | \n",
1152 | " Parent | \n",
1153 | "
\n",
1154 | " \n",
1155 | " 2 | \n",
1156 | " 1 | \n",
1157 | " 1 | \n",
1158 | " 0 | \n",
1159 | " 0 | \n",
1160 | " 0 | \n",
1161 | " 1 | \n",
1162 | " 1 | \n",
1163 | " 1 | \n",
1164 | " 0 | \n",
1165 | " 0 | \n",
1166 | " 6 | \n",
1167 | " m | \n",
1168 | " ? | \n",
1169 | " no | \n",
1170 | " no | \n",
1171 | " Jordan | \n",
1172 | " yes | \n",
1173 | " ? | \n",
1174 | "
\n",
1175 | " \n",
1176 | " 3 | \n",
1177 | " 0 | \n",
1178 | " 1 | \n",
1179 | " 0 | \n",
1180 | " 0 | \n",
1181 | " 1 | \n",
1182 | " 1 | \n",
1183 | " 0 | \n",
1184 | " 0 | \n",
1185 | " 0 | \n",
1186 | " 1 | \n",
1187 | " 5 | \n",
1188 | " f | \n",
1189 | " ? | \n",
1190 | " yes | \n",
1191 | " no | \n",
1192 | " Jordan | \n",
1193 | " no | \n",
1194 | " ? | \n",
1195 | "
\n",
1196 | " \n",
1197 | " 4 | \n",
1198 | " 1 | \n",
1199 | " 1 | \n",
1200 | " 1 | \n",
1201 | " 1 | \n",
1202 | " 1 | \n",
1203 | " 1 | \n",
1204 | " 1 | \n",
1205 | " 1 | \n",
1206 | " 1 | \n",
1207 | " 1 | \n",
1208 | " 5 | \n",
1209 | " m | \n",
1210 | " Others | \n",
1211 | " yes | \n",
1212 | " no | \n",
1213 | " 'United States' | \n",
1214 | " no | \n",
1215 | " Parent | \n",
1216 | "
\n",
1217 | " \n",
1218 | " 5 | \n",
1219 | " 0 | \n",
1220 | " 0 | \n",
1221 | " 1 | \n",
1222 | " 0 | \n",
1223 | " 1 | \n",
1224 | " 1 | \n",
1225 | " 0 | \n",
1226 | " 1 | \n",
1227 | " 0 | \n",
1228 | " 1 | \n",
1229 | " 4 | \n",
1230 | " m | \n",
1231 | " ? | \n",
1232 | " no | \n",
1233 | " yes | \n",
1234 | " Egypt | \n",
1235 | " no | \n",
1236 | " ? | \n",
1237 | "
\n",
1238 | " \n",
1239 | " 6 | \n",
1240 | " 1 | \n",
1241 | " 0 | \n",
1242 | " 1 | \n",
1243 | " 1 | \n",
1244 | " 1 | \n",
1245 | " 1 | \n",
1246 | " 0 | \n",
1247 | " 1 | \n",
1248 | " 0 | \n",
1249 | " 1 | \n",
1250 | " 5 | \n",
1251 | " m | \n",
1252 | " White-European | \n",
1253 | " no | \n",
1254 | " no | \n",
1255 | " 'United Kingdom' | \n",
1256 | " no | \n",
1257 | " Parent | \n",
1258 | "
\n",
1259 | " \n",
1260 | " 7 | \n",
1261 | " 1 | \n",
1262 | " 1 | \n",
1263 | " 1 | \n",
1264 | " 1 | \n",
1265 | " 1 | \n",
1266 | " 1 | \n",
1267 | " 1 | \n",
1268 | " 1 | \n",
1269 | " 0 | \n",
1270 | " 0 | \n",
1271 | " 5 | \n",
1272 | " f | \n",
1273 | " 'Middle Eastern ' | \n",
1274 | " no | \n",
1275 | " no | \n",
1276 | " Bahrain | \n",
1277 | " no | \n",
1278 | " Parent | \n",
1279 | "
\n",
1280 | " \n",
1281 | " 8 | \n",
1282 | " 1 | \n",
1283 | " 1 | \n",
1284 | " 1 | \n",
1285 | " 1 | \n",
1286 | " 1 | \n",
1287 | " 1 | \n",
1288 | " 1 | \n",
1289 | " 0 | \n",
1290 | " 0 | \n",
1291 | " 0 | \n",
1292 | " 11 | \n",
1293 | " f | \n",
1294 | " 'Middle Eastern ' | \n",
1295 | " no | \n",
1296 | " no | \n",
1297 | " Bahrain | \n",
1298 | " no | \n",
1299 | " Parent | \n",
1300 | "
\n",
1301 | " \n",
1302 | " 9 | \n",
1303 | " 0 | \n",
1304 | " 0 | \n",
1305 | " 1 | \n",
1306 | " 1 | \n",
1307 | " 1 | \n",
1308 | " 0 | \n",
1309 | " 1 | \n",
1310 | " 1 | \n",
1311 | " 0 | \n",
1312 | " 0 | \n",
1313 | " 11 | \n",
1314 | " f | \n",
1315 | " ? | \n",
1316 | " no | \n",
1317 | " yes | \n",
1318 | " Austria | \n",
1319 | " no | \n",
1320 | " ? | \n",
1321 | "
\n",
1322 | " \n",
1323 | " 10 | \n",
1324 | " 1 | \n",
1325 | " 0 | \n",
1326 | " 0 | \n",
1327 | " 0 | \n",
1328 | " 1 | \n",
1329 | " 1 | \n",
1330 | " 1 | \n",
1331 | " 1 | \n",
1332 | " 1 | \n",
1333 | " 1 | \n",
1334 | " 10 | \n",
1335 | " m | \n",
1336 | " White-European | \n",
1337 | " yes | \n",
1338 | " no | \n",
1339 | " 'United Kingdom' | \n",
1340 | " no | \n",
1341 | " Self | \n",
1342 | "
\n",
1343 | " \n",
1344 | "
\n",
1345 | "
"
1346 | ],
1347 | "text/plain": [
1348 | " A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score \\\n",
1349 | "0 1 1 0 0 1 1 0 \n",
1350 | "1 1 1 0 0 1 1 0 \n",
1351 | "2 1 1 0 0 0 1 1 \n",
1352 | "3 0 1 0 0 1 1 0 \n",
1353 | "4 1 1 1 1 1 1 1 \n",
1354 | "5 0 0 1 0 1 1 0 \n",
1355 | "6 1 0 1 1 1 1 0 \n",
1356 | "7 1 1 1 1 1 1 1 \n",
1357 | "8 1 1 1 1 1 1 1 \n",
1358 | "9 0 0 1 1 1 0 1 \n",
1359 | "10 1 0 0 0 1 1 1 \n",
1360 | "\n",
1361 | " A8_Score A9_Score A10_Score age gender ethnicity jundice \\\n",
1362 | "0 1 0 0 6 m Others no \n",
1363 | "1 1 0 0 6 m 'Middle Eastern ' no \n",
1364 | "2 1 0 0 6 m ? no \n",
1365 | "3 0 0 1 5 f ? yes \n",
1366 | "4 1 1 1 5 m Others yes \n",
1367 | "5 1 0 1 4 m ? no \n",
1368 | "6 1 0 1 5 m White-European no \n",
1369 | "7 1 0 0 5 f 'Middle Eastern ' no \n",
1370 | "8 0 0 0 11 f 'Middle Eastern ' no \n",
1371 | "9 1 0 0 11 f ? no \n",
1372 | "10 1 1 1 10 m White-European yes \n",
1373 | "\n",
1374 | " family_history_of_PDD contry_of_res used_app_before relation \n",
1375 | "0 no Jordan no Parent \n",
1376 | "1 no Jordan no Parent \n",
1377 | "2 no Jordan yes ? \n",
1378 | "3 no Jordan no ? \n",
1379 | "4 no 'United States' no Parent \n",
1380 | "5 yes Egypt no ? \n",
1381 | "6 no 'United Kingdom' no Parent \n",
1382 | "7 no Bahrain no Parent \n",
1383 | "8 no Bahrain no Parent \n",
1384 | "9 yes Austria no ? \n",
1385 | "10 no 'United Kingdom' no Self "
1386 | ]
1387 | },
1388 | "execution_count": 9,
1389 | "metadata": {},
1390 | "output_type": "execute_result"
1391 | }
1392 | ],
1393 | "source": [
1394 | "x.loc[:10]"
1395 | ]
1396 | },
1397 | {
1398 | "cell_type": "code",
1399 | "execution_count": 10,
1400 | "metadata": {},
1401 | "outputs": [],
1402 | "source": [
1403 | "# convert the data to categorical values - one-hot-encoded vectors\n",
1404 | "X = pd.get_dummies(x)"
1405 | ]
1406 | },
1407 | {
1408 | "cell_type": "code",
1409 | "execution_count": 11,
1410 | "metadata": {},
1411 | "outputs": [
1412 | {
1413 | "data": {
1414 | "text/plain": [
1415 | "array(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',\n",
1416 | " 'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',\n",
1417 | " 'age_10', 'age_11', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8',\n",
1418 | " 'age_9', 'age_?', 'gender_f', 'gender_m',\n",
1419 | " \"ethnicity_'Middle Eastern '\", \"ethnicity_'South Asian'\",\n",
1420 | " 'ethnicity_?', 'ethnicity_Asian', 'ethnicity_Black',\n",
1421 | " 'ethnicity_Hispanic', 'ethnicity_Latino', 'ethnicity_Others',\n",
1422 | " 'ethnicity_Pasifika', 'ethnicity_Turkish',\n",
1423 | " 'ethnicity_White-European', 'jundice_no', 'jundice_yes',\n",
1424 | " 'family_history_of_PDD_no', 'family_history_of_PDD_yes',\n",
1425 | " \"contry_of_res_'Costa Rica'\", \"contry_of_res_'Isle of Man'\",\n",
1426 | " \"contry_of_res_'New Zealand'\", \"contry_of_res_'Saudi Arabia'\",\n",
1427 | " \"contry_of_res_'South Africa'\", \"contry_of_res_'South Korea'\",\n",
1428 | " \"contry_of_res_'U.S. Outlying Islands'\",\n",
1429 | " \"contry_of_res_'United Arab Emirates'\",\n",
1430 | " \"contry_of_res_'United Kingdom'\", \"contry_of_res_'United States'\",\n",
1431 | " 'contry_of_res_Afghanistan', 'contry_of_res_Argentina',\n",
1432 | " 'contry_of_res_Armenia', 'contry_of_res_Australia',\n",
1433 | " 'contry_of_res_Austria', 'contry_of_res_Bahrain',\n",
1434 | " 'contry_of_res_Bangladesh', 'contry_of_res_Bhutan',\n",
1435 | " 'contry_of_res_Brazil', 'contry_of_res_Bulgaria',\n",
1436 | " 'contry_of_res_Canada', 'contry_of_res_China',\n",
1437 | " 'contry_of_res_Egypt', 'contry_of_res_Europe',\n",
1438 | " 'contry_of_res_Georgia', 'contry_of_res_Germany',\n",
1439 | " 'contry_of_res_Ghana', 'contry_of_res_India', 'contry_of_res_Iraq',\n",
1440 | " 'contry_of_res_Ireland', 'contry_of_res_Italy',\n",
1441 | " 'contry_of_res_Japan', 'contry_of_res_Jordan',\n",
1442 | " 'contry_of_res_Kuwait', 'contry_of_res_Latvia',\n",
1443 | " 'contry_of_res_Lebanon', 'contry_of_res_Libya',\n",
1444 | " 'contry_of_res_Malaysia', 'contry_of_res_Malta',\n",
1445 | " 'contry_of_res_Mexico', 'contry_of_res_Nepal',\n",
1446 | " 'contry_of_res_Netherlands', 'contry_of_res_Nigeria',\n",
1447 | " 'contry_of_res_Oman', 'contry_of_res_Pakistan',\n",
1448 | " 'contry_of_res_Philippines', 'contry_of_res_Qatar',\n",
1449 | " 'contry_of_res_Romania', 'contry_of_res_Russia',\n",
1450 | " 'contry_of_res_Sweden', 'contry_of_res_Syria',\n",
1451 | " 'contry_of_res_Turkey', 'used_app_before_no',\n",
1452 | " 'used_app_before_yes', \"relation_'Health care professional'\",\n",
1453 | " 'relation_?', 'relation_Parent', 'relation_Relative',\n",
1454 | " 'relation_Self', 'relation_self'], dtype=object)"
1455 | ]
1456 | },
1457 | "execution_count": 11,
1458 | "metadata": {},
1459 | "output_type": "execute_result"
1460 | }
1461 | ],
1462 | "source": [
1463 | "# print the new categorical column labels\n",
1464 | "X.columns.values"
1465 | ]
1466 | },
1467 | {
1468 | "cell_type": "code",
1469 | "execution_count": 12,
1470 | "metadata": {},
1471 | "outputs": [
1472 | {
1473 | "data": {
1474 | "text/plain": [
1475 | "A1_Score 1\n",
1476 | "A2_Score 1\n",
1477 | "A3_Score 0\n",
1478 | "A4_Score 0\n",
1479 | "A5_Score 1\n",
1480 | "A6_Score 1\n",
1481 | "A7_Score 0\n",
1482 | "A8_Score 1\n",
1483 | "A9_Score 0\n",
1484 | "A10_Score 0\n",
1485 | "age_10 0\n",
1486 | "age_11 0\n",
1487 | "age_4 0\n",
1488 | "age_5 0\n",
1489 | "age_6 1\n",
1490 | "age_7 0\n",
1491 | "age_8 0\n",
1492 | "age_9 0\n",
1493 | "age_? 0\n",
1494 | "gender_f 0\n",
1495 | "gender_m 1\n",
1496 | "ethnicity_'Middle Eastern ' 1\n",
1497 | "ethnicity_'South Asian' 0\n",
1498 | "ethnicity_? 0\n",
1499 | "ethnicity_Asian 0\n",
1500 | "ethnicity_Black 0\n",
1501 | "ethnicity_Hispanic 0\n",
1502 | "ethnicity_Latino 0\n",
1503 | "ethnicity_Others 0\n",
1504 | "ethnicity_Pasifika 0\n",
1505 | " ..\n",
1506 | "contry_of_res_Italy 0\n",
1507 | "contry_of_res_Japan 0\n",
1508 | "contry_of_res_Jordan 1\n",
1509 | "contry_of_res_Kuwait 0\n",
1510 | "contry_of_res_Latvia 0\n",
1511 | "contry_of_res_Lebanon 0\n",
1512 | "contry_of_res_Libya 0\n",
1513 | "contry_of_res_Malaysia 0\n",
1514 | "contry_of_res_Malta 0\n",
1515 | "contry_of_res_Mexico 0\n",
1516 | "contry_of_res_Nepal 0\n",
1517 | "contry_of_res_Netherlands 0\n",
1518 | "contry_of_res_Nigeria 0\n",
1519 | "contry_of_res_Oman 0\n",
1520 | "contry_of_res_Pakistan 0\n",
1521 | "contry_of_res_Philippines 0\n",
1522 | "contry_of_res_Qatar 0\n",
1523 | "contry_of_res_Romania 0\n",
1524 | "contry_of_res_Russia 0\n",
1525 | "contry_of_res_Sweden 0\n",
1526 | "contry_of_res_Syria 0\n",
1527 | "contry_of_res_Turkey 0\n",
1528 | "used_app_before_no 1\n",
1529 | "used_app_before_yes 0\n",
1530 | "relation_'Health care professional' 0\n",
1531 | "relation_? 0\n",
1532 | "relation_Parent 1\n",
1533 | "relation_Relative 0\n",
1534 | "relation_Self 0\n",
1535 | "relation_self 0\n",
1536 | "Name: 1, Length: 96, dtype: int64"
1537 | ]
1538 | },
1539 | "execution_count": 12,
1540 | "metadata": {},
1541 | "output_type": "execute_result"
1542 | }
1543 | ],
1544 | "source": [
1545 | "# print an example patient from the categorical data\n",
1546 | "X.loc[1]"
1547 | ]
1548 | },
1549 | {
1550 | "cell_type": "code",
1551 | "execution_count": 13,
1552 | "metadata": {},
1553 | "outputs": [],
1554 | "source": [
1555 | "# convert the class data to categorical values - one-hot-encoded vectors\n",
1556 | "Y = pd.get_dummies(y)"
1557 | ]
1558 | },
1559 | {
1560 | "cell_type": "code",
1561 | "execution_count": 14,
1562 | "metadata": {},
1563 | "outputs": [
1564 | {
1565 | "data": {
1566 | "text/html": [
1567 | "\n",
1568 | "\n",
1581 | "
\n",
1582 | " \n",
1583 | " \n",
1584 | " | \n",
1585 | " NO | \n",
1586 | " YES | \n",
1587 | "
\n",
1588 | " \n",
1589 | " \n",
1590 | " \n",
1591 | " 0 | \n",
1592 | " 1 | \n",
1593 | " 0 | \n",
1594 | "
\n",
1595 | " \n",
1596 | " 1 | \n",
1597 | " 1 | \n",
1598 | " 0 | \n",
1599 | "
\n",
1600 | " \n",
1601 | " 2 | \n",
1602 | " 1 | \n",
1603 | " 0 | \n",
1604 | "
\n",
1605 | " \n",
1606 | " 3 | \n",
1607 | " 1 | \n",
1608 | " 0 | \n",
1609 | "
\n",
1610 | " \n",
1611 | " 4 | \n",
1612 | " 0 | \n",
1613 | " 1 | \n",
1614 | "
\n",
1615 | " \n",
1616 | " 5 | \n",
1617 | " 1 | \n",
1618 | " 0 | \n",
1619 | "
\n",
1620 | " \n",
1621 | " 6 | \n",
1622 | " 0 | \n",
1623 | " 1 | \n",
1624 | "
\n",
1625 | " \n",
1626 | " 7 | \n",
1627 | " 0 | \n",
1628 | " 1 | \n",
1629 | "
\n",
1630 | " \n",
1631 | " 8 | \n",
1632 | " 0 | \n",
1633 | " 1 | \n",
1634 | "
\n",
1635 | " \n",
1636 | " 9 | \n",
1637 | " 1 | \n",
1638 | " 0 | \n",
1639 | "
\n",
1640 | " \n",
1641 | "
\n",
1642 | "
"
1643 | ],
1644 | "text/plain": [
1645 | " NO YES\n",
1646 | "0 1 0\n",
1647 | "1 1 0\n",
1648 | "2 1 0\n",
1649 | "3 1 0\n",
1650 | "4 0 1\n",
1651 | "5 1 0\n",
1652 | "6 0 1\n",
1653 | "7 0 1\n",
1654 | "8 0 1\n",
1655 | "9 1 0"
1656 | ]
1657 | },
1658 | "execution_count": 14,
1659 | "metadata": {},
1660 | "output_type": "execute_result"
1661 | }
1662 | ],
1663 | "source": [
1664 | "Y.iloc[:10]"
1665 | ]
1666 | },
1667 | {
1668 | "cell_type": "markdown",
1669 | "metadata": {},
1670 | "source": [
1671 | "### 3. Split the Dataset into Training and Testing Datasets\n",
1672 | "\n",
1673 | "Before we can begin training our neural network, we need to split the dataset into training and testing datasets. This will allow us to test our network after we are done training to determine how well it will generalize to new data. This step is incredibly easy when using the train_test_split() function provided by scikit-learn!"
1674 | ]
1675 | },
1676 | {
1677 | "cell_type": "code",
1678 | "execution_count": 15,
1679 | "metadata": {},
1680 | "outputs": [],
1681 | "source": [
1682 | "from sklearn import model_selection\n",
1683 | "# split the X and Y data into training and testing datasets\n",
1684 | "X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)"
1685 | ]
1686 | },
1687 | {
1688 | "cell_type": "code",
1689 | "execution_count": 16,
1690 | "metadata": {},
1691 | "outputs": [
1692 | {
1693 | "name": "stdout",
1694 | "output_type": "stream",
1695 | "text": [
1696 | "(233, 96)\n",
1697 | "(59, 96)\n",
1698 | "(233, 2)\n",
1699 | "(59, 2)\n"
1700 | ]
1701 | }
1702 | ],
1703 | "source": [
1704 | "print X_train.shape\n",
1705 | "print X_test.shape\n",
1706 | "print Y_train.shape\n",
1707 | "print Y_test.shape"
1708 | ]
1709 | },
1710 | {
1711 | "cell_type": "markdown",
1712 | "metadata": {},
1713 | "source": [
1714 | "### 4. Building the Network - Keras\n",
1715 | "\n",
1716 | "In this project, we are going to use Keras to build and train our network. This model will be relatively simple and will only use dense (also known as fully connected) layers. This is the most common neural network layer. The network will have one hidden layer, use an Adam optimizer, and a categorical crossentropy loss. We won't worry about optimizing parameters such as learning rate, number of neurons in each layer, or activation functions in this project; however, if you have the time, manually adjusting these parameters and observing the results is a great way to learn about their function!"
1717 | ]
1718 | },
1719 | {
1720 | "cell_type": "code",
1721 | "execution_count": 17,
1722 | "metadata": {},
1723 | "outputs": [
1724 | {
1725 | "name": "stdout",
1726 | "output_type": "stream",
1727 | "text": [
1728 | "_________________________________________________________________\n",
1729 | "Layer (type) Output Shape Param # \n",
1730 | "=================================================================\n",
1731 | "dense_1 (Dense) (None, 8) 776 \n",
1732 | "_________________________________________________________________\n",
1733 | "dense_2 (Dense) (None, 4) 36 \n",
1734 | "_________________________________________________________________\n",
1735 | "dense_3 (Dense) (None, 2) 10 \n",
1736 | "=================================================================\n",
1737 | "Total params: 822\n",
1738 | "Trainable params: 822\n",
1739 | "Non-trainable params: 0\n",
1740 | "_________________________________________________________________\n",
1741 | "None\n"
1742 | ]
1743 | }
1744 | ],
1745 | "source": [
1746 | "# build a neural network using Keras\n",
1747 | "from keras.models import Sequential\n",
1748 | "from keras.layers import Dense\n",
1749 | "from keras.optimizers import Adam\n",
1750 | "\n",
1751 | "# define a function to build the keras model\n",
1752 | "def create_model():\n",
1753 | " # create model\n",
1754 | " model = Sequential()\n",
1755 | " model.add(Dense(8, input_dim=96, kernel_initializer='normal', activation='relu'))\n",
1756 | " model.add(Dense(4, kernel_initializer='normal', activation='relu'))\n",
1757 | " model.add(Dense(2, activation='sigmoid'))\n",
1758 | " \n",
1759 | " # compile model\n",
1760 | " adam = Adam(lr=0.001)\n",
1761 | " model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])\n",
1762 | " return model\n",
1763 | "\n",
1764 | "model = create_model()\n",
1765 | "\n",
1766 | "print(model.summary())"
1767 | ]
1768 | },
1769 | {
1770 | "cell_type": "markdown",
1771 | "metadata": {},
1772 | "source": [
1773 | "### 5. Training the Network\n",
1774 | "\n",
1775 | "Now it's time for the fun! Training a Keras model is as simple as calling model.fit()."
1776 | ]
1777 | },
1778 | {
1779 | "cell_type": "code",
1780 | "execution_count": 18,
1781 | "metadata": {},
1782 | "outputs": [
1783 | {
1784 | "name": "stdout",
1785 | "output_type": "stream",
1786 | "text": [
1787 | "Epoch 1/50\n",
1788 | "233/233 [==============================] - 0s 288us/step - loss: 0.6927 - acc: 0.5794\n",
1789 | "Epoch 2/50\n",
1790 | "233/233 [==============================] - 0s 245us/step - loss: 0.6910 - acc: 0.7210\n",
1791 | "Epoch 3/50\n",
1792 | "233/233 [==============================] - 0s 258us/step - loss: 0.6868 - acc: 0.7639\n",
1793 | "Epoch 4/50\n",
1794 | "233/233 [==============================] - 0s 236us/step - loss: 0.6779 - acc: 0.7082\n",
1795 | "Epoch 5/50\n",
1796 | "233/233 [==============================] - 0s 236us/step - loss: 0.6619 - acc: 0.8541\n",
1797 | "Epoch 6/50\n",
1798 | "233/233 [==============================] - 0s 305us/step - loss: 0.6340 - acc: 0.8283\n",
1799 | "Epoch 7/50\n",
1800 | "233/233 [==============================] - 0s 227us/step - loss: 0.5963 - acc: 0.8541\n",
1801 | "Epoch 8/50\n",
1802 | "233/233 [==============================] - 0s 305us/step - loss: 0.5446 - acc: 0.9399\n",
1803 | "Epoch 9/50\n",
1804 | "233/233 [==============================] - 0s 240us/step - loss: 0.4884 - acc: 0.8884\n",
1805 | "Epoch 10/50\n",
1806 | "233/233 [==============================] - 0s 227us/step - loss: 0.4220 - acc: 0.9227\n",
1807 | "Epoch 11/50\n",
1808 | "233/233 [==============================] - 0s 322us/step - loss: 0.3603 - acc: 0.9313\n",
1809 | "Epoch 12/50\n",
1810 | "233/233 [==============================] - 0s 245us/step - loss: 0.2935 - acc: 0.9614\n",
1811 | "Epoch 13/50\n",
1812 | "233/233 [==============================] - 0s 296us/step - loss: 0.2528 - acc: 0.9657\n",
1813 | "Epoch 14/50\n",
1814 | "233/233 [==============================] - 0s 330us/step - loss: 0.2087 - acc: 0.9657\n",
1815 | "Epoch 15/50\n",
1816 | "233/233 [==============================] - 0s 305us/step - loss: 0.1788 - acc: 0.9871\n",
1817 | "Epoch 16/50\n",
1818 | "233/233 [==============================] - 0s 313us/step - loss: 0.1605 - acc: 0.9700\n",
1819 | "Epoch 17/50\n",
1820 | "233/233 [==============================] - 0s 309us/step - loss: 0.1389 - acc: 0.9828\n",
1821 | "Epoch 18/50\n",
1822 | "233/233 [==============================] - 0s 335us/step - loss: 0.1258 - acc: 0.9785\n",
1823 | "Epoch 19/50\n",
1824 | "233/233 [==============================] - 0s 343us/step - loss: 0.1108 - acc: 0.9871\n",
1825 | "Epoch 20/50\n",
1826 | "233/233 [==============================] - 0s 399us/step - loss: 0.1004 - acc: 0.9871\n",
1827 | "Epoch 21/50\n",
1828 | "233/233 [==============================] - 0s 416us/step - loss: 0.0910 - acc: 0.9871\n",
1829 | "Epoch 22/50\n",
1830 | "233/233 [==============================] - 0s 343us/step - loss: 0.0820 - acc: 0.9871\n",
1831 | "Epoch 23/50\n",
1832 | "233/233 [==============================] - 0s 361us/step - loss: 0.0752 - acc: 0.9914\n",
1833 | "Epoch 24/50\n",
1834 | "233/233 [==============================] - 0s 356us/step - loss: 0.0714 - acc: 0.9957\n",
1835 | "Epoch 25/50\n",
1836 | "233/233 [==============================] - 0s 309us/step - loss: 0.0634 - acc: 0.9957\n",
1837 | "Epoch 26/50\n",
1838 | "233/233 [==============================] - 0s 339us/step - loss: 0.0585 - acc: 0.9957\n",
1839 | "Epoch 27/50\n",
1840 | "233/233 [==============================] - 0s 335us/step - loss: 0.0571 - acc: 1.0000\n",
1841 | "Epoch 28/50\n",
1842 | "233/233 [==============================] - 0s 429us/step - loss: 0.0526 - acc: 0.9957\n",
1843 | "Epoch 29/50\n",
1844 | "233/233 [==============================] - 0s 335us/step - loss: 0.0474 - acc: 1.0000\n",
1845 | "Epoch 30/50\n",
1846 | "233/233 [==============================] - 0s 322us/step - loss: 0.0463 - acc: 0.9957\n",
1847 | "Epoch 31/50\n",
1848 | "233/233 [==============================] - 0s 296us/step - loss: 0.0431 - acc: 1.0000\n",
1849 | "Epoch 32/50\n",
1850 | "233/233 [==============================] - 0s 348us/step - loss: 0.0381 - acc: 1.0000\n",
1851 | "Epoch 33/50\n",
1852 | "233/233 [==============================] - 0s 322us/step - loss: 0.0357 - acc: 1.0000\n",
1853 | "Epoch 34/50\n",
1854 | "233/233 [==============================] - 0s 292us/step - loss: 0.0331 - acc: 1.0000\n",
1855 | "Epoch 35/50\n",
1856 | "233/233 [==============================] - 0s 305us/step - loss: 0.0316 - acc: 1.0000\n",
1857 | "Epoch 36/50\n",
1858 | "233/233 [==============================] - 0s 335us/step - loss: 0.0294 - acc: 1.0000\n",
1859 | "Epoch 37/50\n",
1860 | "233/233 [==============================] - 0s 322us/step - loss: 0.0282 - acc: 1.0000\n",
1861 | "Epoch 38/50\n",
1862 | "233/233 [==============================] - 0s 236us/step - loss: 0.0281 - acc: 1.0000\n",
1863 | "Epoch 39/50\n",
1864 | "233/233 [==============================] - 0s 339us/step - loss: 0.0253 - acc: 1.0000\n",
1865 | "Epoch 40/50\n",
1866 | "233/233 [==============================] - 0s 223us/step - loss: 0.0252 - acc: 1.0000\n",
1867 | "Epoch 41/50\n",
1868 | "233/233 [==============================] - 0s 326us/step - loss: 0.0226 - acc: 1.0000\n",
1869 | "Epoch 42/50\n",
1870 | "233/233 [==============================] - 0s 326us/step - loss: 0.0213 - acc: 1.0000\n",
1871 | "Epoch 43/50\n",
1872 | "233/233 [==============================] - 0s 219us/step - loss: 0.0203 - acc: 1.0000\n",
1873 | "Epoch 44/50\n",
1874 | "233/233 [==============================] - 0s 215us/step - loss: 0.0193 - acc: 1.0000\n",
1875 | "Epoch 45/50\n",
1876 | "233/233 [==============================] - 0s 318us/step - loss: 0.0190 - acc: 1.0000\n",
1877 | "Epoch 46/50\n",
1878 | "233/233 [==============================] - 0s 232us/step - loss: 0.0176 - acc: 1.0000\n",
1879 | "Epoch 47/50\n",
1880 | "233/233 [==============================] - 0s 215us/step - loss: 0.0163 - acc: 1.0000\n",
1881 | "Epoch 48/50\n",
1882 | "233/233 [==============================] - 0s 202us/step - loss: 0.0161 - acc: 1.0000\n",
1883 | "Epoch 49/50\n",
1884 | "233/233 [==============================] - 0s 240us/step - loss: 0.0154 - acc: 1.0000\n",
1885 | "Epoch 50/50\n",
1886 | "233/233 [==============================] - 0s 223us/step - loss: 0.0150 - acc: 1.0000\n"
1887 | ]
1888 | },
1889 | {
1890 | "data": {
1891 | "text/plain": [
1892 | ""
1893 | ]
1894 | },
1895 | "execution_count": 18,
1896 | "metadata": {},
1897 | "output_type": "execute_result"
1898 | }
1899 | ],
1900 | "source": [
1901 | "# fit the model to the training data\n",
1902 | "model.fit(X_train, Y_train, epochs=50, batch_size=10, verbose = 1)"
1903 | ]
1904 | },
1905 | {
1906 | "cell_type": "markdown",
1907 | "metadata": {},
1908 | "source": [
1909 | "### 6. Testing and Performance Metrics\n",
1910 | "\n",
1911 | "Now that our model has been trained, we need to test its performance on the testing dataset. The model has never seen this information before; as a result, the testing dataset allows us to determine whether or not the model will be able to generalize to information that wasn't used during its training phase. We will use some of the metrics provided by scikit-learn for this purpose! "
1912 | ]
1913 | },
1914 | {
1915 | "cell_type": "code",
1916 | "execution_count": 19,
1917 | "metadata": {},
1918 | "outputs": [
1919 | {
1920 | "data": {
1921 | "text/plain": [
1922 | "array([1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,\n",
1923 | " 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0,\n",
1924 | " 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0], dtype=int64)"
1925 | ]
1926 | },
1927 | "execution_count": 19,
1928 | "metadata": {},
1929 | "output_type": "execute_result"
1930 | }
1931 | ],
1932 | "source": [
1933 | "# generate classification report using predictions for categorical model\n",
1934 | "from sklearn.metrics import classification_report, accuracy_score\n",
1935 | "\n",
1936 | "predictions = model.predict_classes(X_test)\n",
1937 | "predictions"
1938 | ]
1939 | },
1940 | {
1941 | "cell_type": "code",
1942 | "execution_count": 20,
1943 | "metadata": {},
1944 | "outputs": [
1945 | {
1946 | "name": "stdout",
1947 | "output_type": "stream",
1948 | "text": [
1949 | "Results for Categorical Model\n",
1950 | "0.9661016949152542\n",
1951 | " precision recall f1-score support\n",
1952 | "\n",
1953 | " 0 0.97 0.97 0.97 36\n",
1954 | " 1 0.96 0.96 0.96 23\n",
1955 | "\n",
1956 | "avg / total 0.97 0.97 0.97 59\n",
1957 | "\n"
1958 | ]
1959 | }
1960 | ],
1961 | "source": [
1962 | "print('Results for Categorical Model')\n",
1963 | "print(accuracy_score(Y_test[['YES']], predictions))\n",
1964 | "print(classification_report(Y_test[['YES']], predictions))"
1965 | ]
1966 | },
1967 | {
1968 | "cell_type": "code",
1969 | "execution_count": null,
1970 | "metadata": {},
1971 | "outputs": [],
1972 | "source": []
1973 | }
1974 | ],
1975 | "metadata": {
1976 | "kernelspec": {
1977 | "display_name": "Python [default]",
1978 | "language": "python",
1979 | "name": "python2"
1980 | },
1981 | "language_info": {
1982 | "codemirror_mode": {
1983 | "name": "ipython",
1984 | "version": 2
1985 | },
1986 | "file_extension": ".py",
1987 | "mimetype": "text/x-python",
1988 | "name": "python",
1989 | "nbconvert_exporter": "python",
1990 | "pygments_lexer": "ipython2",
1991 | "version": "2.7.13"
1992 | }
1993 | },
1994 | "nbformat": 4,
1995 | "nbformat_minor": 2
1996 | }
1997 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Dinesh Jinjala
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine-Learning-for-Healthcare-Analytics
2 | This is the code repository for [Machine Learning for Healthcare Analytics Projects](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-healthcare-analytics-projects?utm_source=github&utm_medium=repository&utm_campaign=9781789536591), published by Packt.
3 |
4 | Machine Learning (ML) has changed the way organizations and individuals use data to improve the efficiency of a system. ML algorithms allow strategists to deal with a variety of structured, unstructured, and semi-structured data. Machine Learning for Healthcare Analytics Projects is packed with new approaches and methodologies for creating powerful solutions for healthcare analytics.
5 |
6 | This book will teach you how to implement key machine learning algorithms and walk you through their use cases by employing a range of libraries from the Python ecosystem. You will build five end-to-end projects to evaluate the efficiency of Artificial Intelligence (AI) applications for carrying out simple-to-complex healthcare analytics tasks. With each project, you will gain new insights, which will then help you handle healthcare data efficiently. As you make your way through the book, you will use ML to detect cancer in a set of patients using support vector machines (SVMs) and k-Nearest neighbors (KNN) models. In the final chapters, you will create a deep neural network in Keras to predict the onset of diabetes in a huge dataset of patients. You will also learn how to predict heart diseases using neural networks.
7 |
8 | By the end of this book, you will have learned how to address long-standing challenges, provide specialized solutions for how to deal with them, and carry out a range of cognitive tasks in the healthcare domain.
9 |
10 | # Table of Contents
11 | 1. Breast Cancer Detection
12 | 2. Diabetes Onset Detection
13 | 3. DNA classification
14 | 4. Diagnosing Coronary Artery Disease Using machine Learning
15 | 5. Screening Children for Autistic Spectrum Disorder using machine learning
16 |
17 | ## Instructions and Navigations
18 | All of the code is organized into folders. For example, Chapter02.
19 |
20 | The code will look like the following:
21 |
22 | import sys
23 |
24 | import pandas as pd
25 |
26 | import sklearn
27 |
28 | import keras
29 |
30 | print 'Python: {}'.format(sys.version)
31 |
32 | print 'Pandas: {}'.format(pd.__version__)
33 |
34 | print 'Sklearn: {}'.format(sklearn.__version__)
35 |
36 | print 'Keras: {}'.format(keras.__version__)
37 |
38 |
39 | *Following is what you need for this book:*
40 | Machine Learning for Healthcare Analytics Projects is for data scientists, machine learning engineers, and healthcare professionals who want to implement machine learning algorithms to build smart AI applications. Basic knowledge of Python or any programming language is expected to get the most from this book.
41 |
42 | With the following software and hardware list you can run all code files present in the book (Chapter 1-5).
43 | ### Software and Hardware List
44 | | Chapter | Software required | OS required |
45 | | -------- | ------------------------------------ | ----------------------------------- |
46 | | All | Python 3.6 or later | Windows, Mac OS X, and Linux (Any) |
47 | | | Anaconda 5.2 | Windows, Mac OS X, and Linux (Any) |
48 | | | Jupyter Notebook | Windows, Mac OS X, and Linux (Any) |
49 |
50 |
51 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://www.packtpub.com/sites/default/files/downloads/9781789536591_ColorImages.pdf).
52 |
--------------------------------------------------------------------------------