├── kaggle48.JPG
├── Day_016_HW.JPG
├── 計畫說明.txt
├── README.md
├── Day_054_HW.ipynb
├── Day_035_HW.ipynb
├── Day_039_HW.ipynb
├── Day_037_HW.ipynb
├── FeatureEngineering.md
├── Day_032_HW.ipynb
├── EDA02.txt
├── EDA01.txt
├── Day_040_HW.ipynb
├── Day_066_HW.ipynb
├── Day_63_HW.ipynb
├── Day_067_HW.ipynb
├── Day_68-Keras_Sequential_model_HW.ipynb
├── EDA.md
├── Day_69-keras_model_api_HW.ipynb
├── Day_043_HW.ipynb
├── Day_057_HW.ipynb
├── Day_034_HW.ipynb
├── EDA編寫區.txt
├── Day_065_HW.ipynb
├── Day_044_HW.ipynb
├── Day_055_HW.ipynb
├── Day_016_HW.ipynb
├── Day_72-Activation_function_HW.ipynb
├── Day_018_HW.ipynb
├── .ipynb_checkpoints
    ├── Day_018_HW-checkpoint.ipynb
    └── Day_017_HW-checkpoint.ipynb
├── Day_042_HW.ipynb
├── Day_022_HW.ipynb
└── Day_017_HW.ipynb


/kaggle48.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PatrickRuan/ML100-Days/HEAD/kaggle48.JPG


--------------------------------------------------------------------------------
/Day_016_HW.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PatrickRuan/ML100-Days/HEAD/Day_016_HW.JPG


--------------------------------------------------------------------------------
/計畫說明.txt:
--------------------------------------------------------------------------------
1 | 1.) local directory is at : d:\git\ML100-Days
2 | 2.) Documents are stored at D:\User\GitHub\MLCM100 (if fact, we saved these documents which were downloaded from Cupoy 100)
3 | 3.) Programs are stored at D:\MyPython\Cupoy100D, any program which is needed to commit, we will copy it to c:\


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ML100-Days
 2 | 機器學習百日馬拉松
 3 | 
 4 | Date: 12/15 (2018) ~ 4/15 (2019)
 5 | 
 6 | 1. 資料清理數據前處理 (16 days) 
 7 | 2. 資料科學特徵工程技術 (14 days)
 8 | 3. 機器學習基礎模型建立 (14 days)
 9 | 4. 機器學習調整參數 (1 day)
10 | 5. 非監督式機器學習 (9 days)
11 | 6. 深度學習理論與實作 (3 days)
12 | 7. 初探深度學習使用 Keras (27 days)
13 | 8. 深度學習應用卷積神經網路 (11 days)
14 | 


--------------------------------------------------------------------------------
/Day_054_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# 作業\n",
 8 |     "* 試著想想看, 非監督學習是否有可能使用評價函數 (Metric) 來鑑別好壞呢?  \n",
 9 |     "(Hint : 可以分為 \"有目標值\" 與 \"無目標值\" 兩個方向思考)"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "markdown",
14 |    "metadata": {
15 |     "collapsed": true
16 |    },
17 |    "source": [
18 |     "# 我的回答:\n",
19 |     "基本上非監督學習的輸出結果，通常是在協助我們整理資料與了解資料，\n",
20 |     "結果的優劣比較主觀，不容易用評價函數來鑑別。\n",
21 |     "\n",
22 |     "以分群來說我們如果有目標值的預期，似乎有一定的衡量標準，但是結果的呈現是否如我們的預期分群似乎是一個很大的問題，但是真的能給很多線索參考。"
23 |    ]
24 |   }
25 |  ],
26 |  "metadata": {
27 |   "kernelspec": {
28 |    "display_name": "Python 3",
29 |    "language": "python",
30 |    "name": "python3"
31 |   },
32 |   "language_info": {
33 |    "codemirror_mode": {
34 |     "name": "ipython",
35 |     "version": 3
36 |    },
37 |    "file_extension": ".py",
38 |    "mimetype": "text/x-python",
39 |    "name": "python",
40 |    "nbconvert_exporter": "python",
41 |    "pygments_lexer": "ipython3",
42 |    "version": "3.6.7"
43 |   }
44 |  },
45 |  "nbformat": 4,
46 |  "nbformat_minor": 1
47 | }
48 | 


--------------------------------------------------------------------------------
/Day_035_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## 練習時間\n",
 8 |     "相信大家對回歸問題及分類問題都有初步的了解。今天的作業希望大家搜尋有關 multi-label 的案例問題。下圖是電影 \"奇異博士\" 的分類。可以看到同時有 \"Action\", \"Adventure\", \"Fantasy\" 的類別存在，代表這是個多標籤的問題。 "
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "metadata": {},
14 |    "source": [
15 |     "![image](https://cdn-images-1.medium.com/max/1000/1*r0gYXMSQf5VhdMyl2bRDyg.png)"
16 |    ]
17 |   },
18 |   {
19 |    "cell_type": "markdown",
20 |    "metadata": {},
21 |    "source": [
22 |     "## 請搜尋目標為多標籤問題 (Multi-label) 的機器學習案例，了解其資料來源、目標以及評估指標為何\n",
23 |     "## Hint: 服飾"
24 |    ]
25 |   },
26 |   {
27 |    "cell_type": "markdown",
28 |    "metadata": {},
29 |    "source": [
30 |     "![image](https://work.alibaba-inc.com/aliwork_tfs/g01_alibaba-inc_com/tfscom/TB1QNKYXQZmBKNjSZPiXXXFNVXa.tfsprivate.png)"
31 |    ]
32 |   },
33 |   {
34 |    "cell_type": "code",
35 |    "execution_count": 1,
36 |    "metadata": {},
37 |    "outputs": [],
38 |    "source": [
39 |     "# https://tianchi.aliyun.com/competition/entrance/231671/information"
40 |    ]
41 |   }
42 |  ],
43 |  "metadata": {
44 |   "kernelspec": {
45 |    "display_name": "Python 3",
46 |    "language": "python",
47 |    "name": "python3"
48 |   },
49 |   "language_info": {
50 |    "codemirror_mode": {
51 |     "name": "ipython",
52 |     "version": 3
53 |    },
54 |    "file_extension": ".py",
55 |    "mimetype": "text/x-python",
56 |    "name": "python",
57 |    "nbconvert_exporter": "python",
58 |    "pygments_lexer": "ipython3",
59 |    "version": "3.6.7"
60 |   }
61 |  },
62 |  "nbformat": 4,
63 |  "nbformat_minor": 2
64 | }
65 | 


--------------------------------------------------------------------------------
/Day_039_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## 作業"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "請閱讀相關文獻，並回答下列問題\n",
15 |     "\n",
16 |     "[脊回歸 (Ridge Regression)](https://blog.csdn.net/daunxx/article/details/51578787)\n",
17 |     "[Linear, Ridge, Lasso Regression 本質區別](https://www.zhihu.com/question/38121173)\n",
18 |     "\n",
19 |     "1. LASSO 回歸可以被用來作為 Feature selection 的工具，請了解 LASSO 模型為什麼可用來作 Feature selection\n",
20 |     "2. 當自變數 (X) 存在高度共線性時，Ridge Regression 可以處理這樣的問題嗎?\n"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "metadata": {},
26 |    "source": [
27 |     "> LASSO is L1, \n",
28 |     "- $L1 = \\sum |weight| $\n",
29 |     "- $Loss = argmin(y-w*x)^2 + \\alpha*L1$\n",
30 |     "- When we are increasing alpha, then more numbers of coefficient of polynomial becoming zero, we can select non-zero coefficient items, and remove those zero items. \n",
31 |     "\n",
32 |     "> Ridge is L2,\n",
33 |     "- $L2 = \\sum (weight)^2 $\n",
34 |     "- $Loss = argmin(y-w*x)^2 + \\lambda*L2$\n",
35 |     "- in some domains, the number of independent variables is many, as well as we are not sure which of the independent variables influences dependent variable. In this kind of scenario, ridge regression plays a better role than linear regression."
36 |    ]
37 |   }
38 |  ],
39 |  "metadata": {
40 |   "kernelspec": {
41 |    "display_name": "Python 3",
42 |    "language": "python",
43 |    "name": "python3"
44 |   },
45 |   "language_info": {
46 |    "codemirror_mode": {
47 |     "name": "ipython",
48 |     "version": 3
49 |    },
50 |    "file_extension": ".py",
51 |    "mimetype": "text/x-python",
52 |    "name": "python",
53 |    "nbconvert_exporter": "python",
54 |    "pygments_lexer": "ipython3",
55 |    "version": "3.6.7"
56 |   }
57 |  },
58 |  "nbformat": 4,
59 |  "nbformat_minor": 2
60 | }
61 | 


--------------------------------------------------------------------------------
/Day_037_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## 作業"
 8 |    ]
 9 |   },
10 |   {
11 |    "cell_type": "markdown",
12 |    "metadata": {},
13 |    "source": [
14 |     "請閱讀以下相關文獻，並回答以下問題\n",
15 |     "\n",
16 |     "[Linear Regression 詳細介紹](https://medium.com/@yehjames/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-3%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E9%82%8F%E8%BC%AF%E6%96%AF%E5%9B%9E%E6%AD%B8-logistic-regression-%E4%BB%8B%E7%B4%B9-a1a5f47017e5)\n",
17 |     "\n",
18 |     "[Logistics Regression 詳細介紹](https://medium.com/@yehjames/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-3%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E9%82%8F%E8%BC%AF%E6%96%AF%E5%9B%9E%E6%AD%B8-logistic-regression-%E4%BB%8B%E7%B4%B9-a1a5f47017e5)\n",
19 |     "\n",
20 |     "[你可能不知道的 Logisitc Regression](https://taweihuang.hpd.io/2017/12/22/logreg101/)\n"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "markdown",
25 |    "metadata": {},
26 |    "source": [
27 |     "1. 線性回歸模型能夠準確預測非線性關係的資料集嗎?\n",
28 |     "2. 回歸模型是否對資料分布有基本假設?\n"
29 |    ]
30 |   },
31 |   {
32 |    "cell_type": "markdown",
33 |    "metadata": {},
34 |    "source": [
35 |     ">\n",
36 |     "    1.) 符合基本線性特徵，同時滿足特定的「加性」，「齊性」( https://zh.wikipedia.org/wiki/%E7%B7%9A%E6%80%A7%E9%97%9C%E4%BF%82 )，與資料特徵的線性獨立。\n",
37 |     "    2.) 在迴歸模型(2.1)中誤差項假定誤差項 εi 假定為獨立之常態，隨機變數平均數為 0，變異數為常數(如果樣本數夠大，則殘差 ei 間的不獨立性將相對地不重要而可以被忽略。)\n",
38 |     "    3.) 常數誤差變異數 \n",
39 |     "        * 2.) and 3.) refer to :http://web.ncyu.edu.tw/~lanjc/lesson/C7/class/ch03-AN.pdf\n",
40 |     "    \n",
41 |     "\n"
42 |    ]
43 |   }
44 |  ],
45 |  "metadata": {
46 |   "kernelspec": {
47 |    "display_name": "Python 3",
48 |    "language": "python",
49 |    "name": "python3"
50 |   },
51 |   "language_info": {
52 |    "codemirror_mode": {
53 |     "name": "ipython",
54 |     "version": 3
55 |    },
56 |    "file_extension": ".py",
57 |    "mimetype": "text/x-python",
58 |    "name": "python",
59 |    "nbconvert_exporter": "python",
60 |    "pygments_lexer": "ipython3",
61 |    "version": "3.6.7"
62 |   }
63 |  },
64 |  "nbformat": 4,
65 |  "nbformat_minor": 2
66 | }
67 | 


--------------------------------------------------------------------------------
/FeatureEngineering.md:
--------------------------------------------------------------------------------
 1 | # Part 1 Feature Engineering in ML100-Days
 2 | 機器學習百日馬拉松：
 3 | 第二部分 特徵工程 共 14天 (Day 17 ~ Day 30
 4 | 
 5 | Day 17: 什麼是特徵，什麼是特徵工程?
 6 | 	
 7 | 	我們藉由一些資料去推到目標的結果呈現，比如我們有房子的坪數、樓層、地點、區域等資料，這些資料類別可以影響我們關注的房價，那麼這些資料類別坪數、樓層、地點、區域等就是特徵，我們關注的房價就是目標。
 8 | 	我們試圖整理挑選轉換組合特徵來影響我們的預測精確度，這過程就是特徵工程。未來的 14天就是要做這些處理。
 9 | 	
10 | 	今天的作業練習有三個重點: 
11 | 	1.) 到https://www.kaggle.com/c/titanic 抓取房價預測的資料，當作未來14天的練習用。同時要留意資料要放在程式的上一層目錄下的 part02\ 下，如此一來，當我們用 pd.read_csv('..\part02\balabla.csv') 才能讀得到。
12 | 	2.) 判斷出哪一個欄位是目標，通常是看清楚題目來判斷，偷懶取巧的方式是，比對 training data 與 testing data 之間少了哪一個欄位作判斷。(why?)
13 | 	3.) 底下的 code 算是很典型作法，請思考，為什麼需要把 train data 與 test data 作 concat!
14 | 		train_Y = df_train['Survived']
15 | 		ids = df_test['PassengerId']
16 | 		df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
17 | 		df_test = df_test.drop(['PassengerId'] , axis=1)
18 | 		df = pd.concat([df_train,df_test])
19 | 		df.head()
20 | 	衍生練習: 
21 | 		1. to read house_train.csv.gz and house_test.csv.gz from ../part02/ (done)
22 | 		2. to concat train and test data, you know how to handle them before you concat these 2 datasets (done)
23 | 		3. to check features, to check the amount of features in dataset (done)
24 | 		4. to check data type first, to filter out object features, then to label encode them in a very simple way.
25 | 		5. to minmax scale all features.
26 | 		6. to submit to Kaggle
27 | 
28 | 
29 | Day 18: 特徵的類型
30 | 
31 | 	
32 | 	特徵基本兩類數值型與類別型，其實我們從 Day 1 ~ Day 16 學過了 EDA，基本上是不陌生這兩個類別。這裡還會請大家嚐試了解還有一些小類別，比如說二元型別，秩序型別與時間型別，我們慢慢地都會練習到，
33 | 	
34 | 	今天的作業是 
35 | 	1.) 重複 Day 17作業 1.) 的工作，但是今天是把 kaggle 的經典題目  titanic 搬到 ..\part02\ 底下
36 | 	2.) 如同 Day 17 作業 3.) 整理 train data, test data 甚至 label(Y), 3.) 利用 df.dtypes 觀察 類別與每一個類別的數量(我最愛用的是 df.dtypes.value_counts()  也請思考最後要加上 .reset_index() 的原因。  4.) 對數值型資料作基本運算，比如 mean()，max() 等。 
37 | 
38 | 
39 | Day 19: 數值型特徵 補缺失值與標準化:  
40 | 	在 EDA Day 6 時我們暴力的觀看所有欄位的 boxplot 可以看到一些 outlier，其實就是看最遠的那幾個圈是不是很遠，遠離正常的盒線，很孤獨的。如果是要找缺失值，也不難，只要用  isnull()測試就可以。 
41 | 	這裡我們業也很快的作同樣的運作。(參考 Cupoy ML100 day 19 資料)
42 | 		可以填補統計值
43 | 		• 填補平均值(Mean) : 數值型欄欄位，偏態不明顯 
44 | 		• 填補中位數(Median) : 數值型欄欄位，偏態很明顯 
45 | 		• 填補眾數(Mode) : 類別型欄欄位 填補指定值 - 
46 | 		如果對欄位領域知識已有了了解可能有其他的補法
47 | 		• 補 0 : 空缺原本就有 0 的含意，如一些房產的問題的房間數 
48 | 		• 補不可能出現的數值 : 類別型欄欄位，但不適合⽤用眾數時 
49 | 		填補預測值 - 速度較慢但精確，從其他資料欄欄位學得填補知識 
50 | 		• 若填補範圍廣，且是重要特徵欄欄位時可⽤用本⽅方式 
51 | 		• 本⽅方式須提防overﬁtting : 可能退化成為其他特徵的組合。
52 | 	然後我們要來練習 EDA Day 7 中介紹的 MinMaxScaler與 StandardScaler  
53 | 	
54 | 	作業 
55 | 	1.) 比較不同值的補值結果來作 LogisticRegression，藉由cross_val_score觀看那個結果比較好。
56 | 	2.) 比較不同標準化方法來作 LogisticRegression，藉由cross_val_score觀看那個結果比較好。
57 | 	
58 | 		
59 | 


--------------------------------------------------------------------------------
/Day_032_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## 練習時間\n",
 8 |     "相信大家目前都能夠初步了解機器學習專案的流程及步驟，今天的作業希望大家能夠看看全球機器學習巨頭們在做的機器學習專案。以 google 為例，下圖是 Google 內部專案使用機器學習的數量，隨著時間進展，現在早已超過 2000 個專案在使用 ML。"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "metadata": {},
14 |    "source": [
15 |     "![image](https://cdn-images-1.medium.com/max/800/1*U_L8qI8RmYS-MOBrYvXhSA.png)"
16 |    ]
17 |   },
18 |   {
19 |    "cell_type": "markdown",
20 |    "metadata": {},
21 |    "source": [
22 |     "底下幫同學整理幾間知名企業的 blog 或機器學習網站 (自行搜尋也可)，請挑選一篇文章閱讀並試著回答\n",
23 |     "1. 專案的目標？ (要解決什麼問題）\n",
24 |     "2. 使用的技術是？ (只需知道名稱即可，例如：使用 CNN 卷積神經網路做影像分類)\n",
25 |     "3. 資料來源？ "
26 |    ]
27 |   },
28 |   {
29 |    "cell_type": "markdown",
30 |    "metadata": {},
31 |    "source": [
32 |     "- [Google AI blog](https://ai.googleblog.com/)\n",
33 |     "- [Facebook Research blog](https://research.fb.com/blog/)\n",
34 |     "- [Apple machine learning journal](https://machinelearning.apple.com/)\n",
35 |     "- [機器之心](https://www.jiqizhixin.com/)\n",
36 |     "- [雷鋒網](http://www.leiphone.com/category/ai)"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "metadata": {},
42 |    "source": [
43 |     "Soft Actor-Critic: Deep Reinforcement Learning for Robotics\n",
44 |     "https://ai.googleblog.com/2019/01/soft-actor-critic-deep-reinforcement.html\n",
45 |     " \n",
46 |     " \n",
47 |     "Importantly, SAC is efficient enough to solve real-world robot tasks in only a handful of hours, and works on a variety of environments with a single set of hyperparameters. Below, we discuss some of the research behind SAC, and also describe some of our recent experiments.\n",
48 |     " \n",
49 |     " SAC, Soft Actor-Critic, one method of Deep Reinforcement Learning.\n",
50 |     " \n",
51 |     " 原本的 Deep RL 面臨花太多時間訓練與訓練好的機器人需要慢慢地嘗試但又不可避免在初期遭遇環境上的挑戰折損硬體，SAC 用極大化reward 與　極大化　熵。後者可以增加訓練時的對策隨機性，在對策隨機下尋求高 reward 相當於能夠承受更多的不確性。\n",
52 |     " \n",
53 |     " https://ai.googleblog.com/2019/01/soft-actor-critic-deep-reinforcement.html\n",
54 |     " "
55 |    ]
56 |   },
57 |   {
58 |    "cell_type": "code",
59 |    "execution_count": null,
60 |    "metadata": {},
61 |    "outputs": [],
62 |    "source": []
63 |   },
64 |   {
65 |    "cell_type": "code",
66 |    "execution_count": null,
67 |    "metadata": {},
68 |    "outputs": [],
69 |    "source": []
70 |   }
71 |  ],
72 |  "metadata": {
73 |   "kernelspec": {
74 |    "display_name": "Python 3",
75 |    "language": "python",
76 |    "name": "python3"
77 |   },
78 |   "language_info": {
79 |    "codemirror_mode": {
80 |     "name": "ipython",
81 |     "version": 3
82 |    },
83 |    "file_extension": ".py",
84 |    "mimetype": "text/x-python",
85 |    "name": "python",
86 |    "nbconvert_exporter": "python",
87 |    "pygments_lexer": "ipython3",
88 |    "version": "3.6.7"
89 |   }
90 |  },
91 |  "nbformat": 4,
92 |  "nbformat_minor": 2
93 | }
94 | 


--------------------------------------------------------------------------------
/EDA02.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | 【ML102未來高手: D4 ~ D5 】
 3 | 
 4 | .
 5 | 
 6 | 【前幾天在說什麼】
 7 | 
 8 | 某種程度，前三天也算是對環境的熟悉，
 9 | 第一天的課程，技術含量不大，但是非常重要，對題目的研究、了解題目會對資料科學研究方向產生正面影響。
10 | 第二天開始看看手中資料的筆數、含有多少特徵等。等於由外開始，有點像開箱文的開箱動作；從第四天起就要真的開始進入資料內部，有步驟的剝洋蔥。
11 | 
12 | 本文的內容會基本介紹 Cupoy Day 4 ~ Day 5 兩天的工作，我們設定的目標是 1/11 完成，有進度落後的同學請利用周末跟上，加油。
13 | 
14 | 作業的分享有，
15 | https://github.com/freyatzeng/ML100-Days ，與
16 | https://github.com/PatrickRuan/ML100-Days
17 | 
18 | 
19 | 個人的心得，Day 1 ~ Day 16 需要知道的統計機率就是統計機率可以來敘述手中的資料預估更大範疇的母體，從中看得的特性特徵就跟我們的機器學習有關。
20 | 比如，我們家的麵包店觀看久了就知道天氣涼，不下雨生意會很好，寒暑假週五週六生意差。
21 | 基本要的統計量就是平均值，中位數，標準差，眾數加上直方圖 
22 | 
23 | 
24 | 
25 | 【這兩天的進度】
26 | 
27 | Day 4: EDA 資料類型介紹
28 | 
29 | 	我們在第二天開始進行了 EDA 資料探勘的動作，當時是看整個資料的外型，有多少特徵有多少筆資料，今天起我們要開始一步一步的觀察特徵資料。
30 | 	
31 | 	跳進特徵研究的第一個小動作就是要知道，每一個特徵的格式，也就是資料的類別；基本的類別有 float64, int64, object。
32 | 	
33 | 	要留意 object是我們的機器無法直接處理的資料類型，所以今天是介紹為這類資料作編碼，編碼完後才會送進機器模型運算。
34 | 	
35 | 	HW: 請研讀參考資料 https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621
36 | 	，同時完成第四天的程式。
37 | 
38 | 	
39 | Day 5: EDA之資料分布
40 | 
41 | 	EDA 當然沒有固定的程序進行，但是對於新手而言，對於十個出頭的特徵的經典題目 Titanic 都已經是不知道如何下手了，更何況今天的題目有百餘個特徵。所以如果有一個按部就班的程序可以遵循，在新手時期是十分珍貴的。
42 | 	
43 | 	今天算是第三天的 EDA，Day 2時是第一次看資料框架(有多少資料，是寬的胖的高的瘦的)，Day 4 時是觀察這些特徵的資料類型，今天我們要來看看特徵資料的統計特性，
44 | 	計算集中趨勢
45 | 		•平均值 Mean
46 | 		•中位數 Median
47 | 		•眾數 Mode 
48 | 	計算資料分散程度
49 | 		•最⼩值 Min
50 | 		•最⼤值 Max
51 | 		•範圍 Range
52 | 		•四分位差 Quartiles
53 | 		•變異數 Variance
54 | 		•標準差 Standard deviation
55 | 		
56 | 	HW: 今天的作業是挑三個有興趣的特徵來看它們的統計特性，挑選的方式可以是在 Day 2時的 HomeCredit_columns_description.csv 研讀時對於特徵了解而挑選，如果挑很多個，最後也可以用 sns.heatmap(corr()matric) 的關係來挑選。
57 | 	今天要學會 DataFrame.describe() 來看基本統計資料。也要學會畫直方圖 (histogram) by plt.hist
58 | 
59 | 延伸參考資料有，前三類請各挑一篇觀看，
60 | 
61 | <Label Encoding 與 One Hot Encoding 三種文字可以選擇>
62 | https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621 (英文)
63 | https://tw.saowen.com/a/1b61dc9734ec055fd42ce2160acc79171cc5ff121248c3ceb44d65c4235feb57  (中文)
64 | https://www.smwenku.com/a/5baab6e32b7177781a0e6859/zh-cn/  (中文)
65 | 
66 | 分布的介紹(入門看 Normal dist. 二項式 Binomial 跟 Poisson 就好，練練英文不錯，或自己 Google) 
67 | https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/statistical-distributions(英文)
68 | https://chrisalbon.com/python/data_wrangling/pandas_dataframe_descriptive_stats/ (英文)
69 | https://amaozhao.gitbooks.io/pandas-notebook/content/pandas%E4%B8%AD%E7%9A%84%E7%BB%98%E5%9B%BE%E5%87%BD%E6%95%B0.html (中文)
70 | 
71 | 直方圖
72 | https://zh.wikipedia.org/zh-tw/%E7%9B%B4%E6%96%B9%E5%9B%BE
73 | http://blog.topspeedsnail.com/archives/814
74 | https://amaozhao.gitbooks.io/pandas-notebook/content/pandas%E4%B8%AD%E7%9A%84%E7%BB%98%E5%9B%BE%E5%87%BD%E6%95%B0.html
75 | 
76 | 進階: ROC (遲早會遇見的任務)
77 | https://en.wikipedia.org/wiki/Receiver_operating_characteristic
78 | 
79 | 其他
80 | https://ithelp.ithome.com.tw/articles/10185922 
81 | 


--------------------------------------------------------------------------------
/EDA01.txt:
--------------------------------------------------------------------------------
 1 | # Part 1 EDA in ML100-Days
 2 | 機器學習百日馬拉松：
 3 | 第一部分 EDA 共 16天
 4 | 
 5 | 
 6 | https://www.cupoy.com/home
 7 | 
 8 | 未來 100天的學習是這樣計畫的，
 9 | 從資料與觀察特徵開始，然後開始見模型或是建架 "機器學習實踐人工智慧" 的框架，花一天調教，四天到五天作一題 Kaggle 作驗收。
10 | 然後在前往深度學習之前，我們會經歷很實際的問題討論非監督式學習，最後三十多天會進入神玄的機器學習的課程。
11 | 
12 | 首先的16天，我們是資料清理數據前處理。
13 | 
14 | 在 Cupoy 的機器學習百日馬拉松的課程中，缺乏整理的介紹，直接進入了 100天的介紹，就算第一天有一些提醒與很棒的延伸教材，
15 | 但是對於一些沒有留意同學，可能有點見樹不見林，學了之後，可能未來有解題能力，但是拿到問題的霎那會不知道如何下手。
16 | 
17 | 所以在出發前，我先把我要的過程說明，資料清理沒有一個標準程序，但是初學的同學，可以把這 16天的學習當作一個規範出發，慢慢在靈活發展。
18 | 
19 | 我們收到了一個問題，這 16要好好地做好前處理，才會進行特徵挑選與整理，
20 | 	• Day 1: 各位小姐姐小哥哥，收到一個問題先不要急著跳進去，請好好的了解問題安排資源與規劃作法。
21 | 	• Day 2: 我們第一次進行 EDA，是從資料的筆數、欄位多寡開始認識我們手中有多少東西。當然不要忘了我們也閱讀的欄位的描述文件。
22 | 	• Day 4: 我們開始跳進去欄位去看所有欄位的資料格式，查看是浮點數、整數或物件資料。對於類別型資料我們馬上進行編碼工作。
23 | 	• Day 5: 我們學習繼續留在欄位中觀察他們的基本統計資料。
24 | 		我們可以這麼說，
25 | 		Day 4 用 app_train.dtypes.value_counts() 與 app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0) 來了解各種資料型態的個數與找出所有類別型資料與檢視類別型欄位各自類別的數量。
26 | 		Day 5 是用 .describe() 看欄位資料的統計特徵與用 plt.hist() 去劃出直方圖 histogram 讓我們認識欄位資料。
27 | 	• Day 6: 在經過 Day 5 的統計資訊觀察後，我們在 Day 6就要初步判斷是否有 outlier 異常資料，這次有用到 boxplot 圖。
28 | 	• Day 7: 跟 Day 6 是連續動作，對於一些 outlier 的處理，同時當個別欄位資料的 outlier 都處理的，我們也不希望整體的欄位中有某幾個因為單位不依樣因為數值特別大讓我們後續模型運算時有不平衡的影響，所以就一併作了標準化。
29 | 	• Day 9, 10: 當我們基本資料觀察與修正了之後，我們就展開了所有特徵值(欄位資料)對於 目標 的相關運算。
30 | 	• Day 11: 觀察單一特徵值的統計甚至進入該特徵值觀看是否有子區的個別特徵，我們會介紹 KDE 與 pdf。這些微細的觀察都是讓我們對於觀察對象及其特徵更清楚的認識與掌握。
31 | 	• Day 12，13，14: 把連續型變數離散化，有些特徵值如果作了分群，比如年紀，分成<25，介於 25 ~ 45，與 > 45 可能會有不錯的效果。
32 | 	• Day 16: 差不多所有的資料的初步處理已經完成，所以進行整理後的特徵值的相關性，準備嘗試建立第一個機器學習的程式 (Day16)。
33 | 
34 | 
35 | 每一天進度的程式作業在
36 | https://github.com/PatrickRuan/ML100-Days
37 | 可以幫忙按個 watch 或 follow ^^
38 | 
39 | 
40 | Day 1: 當我們遇到一個問題時，不見得要馬上跳進去玩耍；
41 | 	
42 | 	我們要先思考問題，知道目的，如果有資料，還可以概略看一下資料，盤算能夠作出甚麼有趣有用的結果。
43 | 	如果沒有資料，就問題的思考後，才去採集資料，相對可以節省很多時間。
44 | 	
45 | 	之後我們才會盤算一個策略去進行行動，比如快速完成一個方案，一個最小可行方案(Minimum Visiable Product 的概念)，後續進行優化改善。
46 | 	
47 | 	HW: 除了第一天的簡易練習 編寫 Y=wX+b，plot(X,Y) 定義 mse, mae 還有一個很重要的工作，
48 | 	作業2：如果你今天經營一個台灣大車隊，你要如何透過數據分析來提升業績? 思考一下，我們除了需要工程師也需要資料分析師與科學家。
49 | 	程式作業，
50 | 	1.) 建立一個 y = w * x + b， w =3, b =5，x 具有 amplitude 5 的 Guassian Noice
51 | 	2.) def mae, and mse 2 functions
52 | 	3.) 學習 文字區編寫 $ MSE = \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \hat{Y}_i)^2} $
53 | 
54 | 
55 | Day 2: 第一次資料探勘
56 | 
57 | 	好我們前十六天都將進行資料數據前處理的工作，練習的問題會是 Kaggle 上的題目 
58 | 	https://www.kaggle.com/c/home-credit-default-risk/data 
59 | 	
60 | 	會用到的應該是 application_train.csv 與 application_test.csv 
61 | 	最後要上傳時會用到 submission.csv 
62 | 	但是請不要忘了研讀 HomeCredit_columns_description.csv 
63 | 	
64 | 	HW:
65 | 	第二天我們開始讀取資料，對於這個被整理好的 Dataframe 進行最初步的探勘，比如說知道有多少筆資料，有多少 "欄" 的特徵，如何將特徵項目組成一個 list， 又或者我們該如何呈現截取部分資料等。總之，我們還沒有跳進去觀察個別特徵，個別資料前，我們所採取的動作在今天試著作一輪。
66 | 	第二天還有一個有趣的參考作業，讀一讀吳老師的資料探勘講義，其中為何要研究杜河的魚類又如何進行都是很有意義的練習，川普的故事也趣味十足。
67 | 	
68 | 	程式作業: 學會本程式的所有觀察，同時提出不懂的地方。
69 | 	
70 | Day 3: 關於 DataFrame
71 | 	
72 | 	我們在作資料清理，在處理機器學習的問題時，常常都要熟練 DataFrame 的操作，DataFrame 就像是 EXCEL 的 Spreadsheet 一樣，第一列存著的是所有欄位名稱，就是我們未來要處理的特徵。第二列起就是一筆一筆的資料。
73 | 	
74 | 	HW: 第三天的練習會比 pandas 的 DataFrame 多一點不同格式的東西，大家好好努力!
75 | 	程式作業: 學會本日程式(有兩個)的所有觀察，同時提出不懂的地方。
76 | 


--------------------------------------------------------------------------------
/Day_040_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 練習時間\n",
  8 |     "試著使用 sklearn datasets 的其他資料集 (boston, ...)，來訓練自己的線性迴歸模型，並加上適當的正則話來觀察訓練情形。"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [],
 16 |    "source": [
 17 |     "import numpy as np\n",
 18 |     "import pandas as pd\n",
 19 |     "import matplotlib.pyplot as plt\n",
 20 |     "from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso\n",
 21 |     "from sklearn.model_selection import train_test_split\n",
 22 |     "from sklearn.metrics import mean_squared_error, accuracy_score, r2_score\n",
 23 |     "from sklearn import datasets"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "Boston = datasets.load_boston()"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 3,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "x_train, x_test, y_train, y_test = train_test_split(Boston.data, Boston.target, test_size=0.1, random_state=4)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": []
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 4,
 54 |    "metadata": {},
 55 |    "outputs": [
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "mse= 17.04\n"
 61 |      ]
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "model=LinearRegression()\n",
 66 |     "model.fit(x_train, y_train)\n",
 67 |     "pred=model.predict(x_test)\n",
 68 |     "print('mse= %.2f'%mean_squared_error(y_test, pred))"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "# Ridge"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 5,
 81 |    "metadata": {},
 82 |    "outputs": [
 83 |     {
 84 |      "name": "stdout",
 85 |      "output_type": "stream",
 86 |      "text": [
 87 |       "mse= 17.35\n"
 88 |      ]
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "model = Ridge()\n",
 93 |     "model.fit(x_train, y_train)\n",
 94 |     "pred=model.predict(x_test)\n",
 95 |     "print('mse= %.2f'%mean_squared_error(y_test, pred))"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "# Lasso"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 6,
108 |    "metadata": {},
109 |    "outputs": [
110 |     {
111 |      "name": "stdout",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "mse= 23.24\n"
115 |      ]
116 |     }
117 |    ],
118 |    "source": [
119 |     "model = Lasso()\n",
120 |     "model.fit(x_train, y_train)\n",
121 |     "pred=model.predict(x_test)\n",
122 |     "print('mse= %.2f'%mean_squared_error(y_test, pred))"
123 |    ]
124 |   }
125 |  ],
126 |  "metadata": {
127 |   "kernelspec": {
128 |    "display_name": "Python 3",
129 |    "language": "python",
130 |    "name": "python3"
131 |   },
132 |   "language_info": {
133 |    "codemirror_mode": {
134 |     "name": "ipython",
135 |     "version": 3
136 |    },
137 |    "file_extension": ".py",
138 |    "mimetype": "text/x-python",
139 |    "name": "python",
140 |    "nbconvert_exporter": "python",
141 |    "pygments_lexer": "ipython3",
142 |    "version": "3.6.7"
143 |   }
144 |  },
145 |  "nbformat": 4,
146 |  "nbformat_minor": 2
147 | }
148 | 


--------------------------------------------------------------------------------
/Day_066_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "Using TensorFlow backend.\n"
 13 |      ]
 14 |     },
 15 |     {
 16 |      "name": "stdout",
 17 |      "output_type": "stream",
 18 |      "text": [
 19 |       "{'name': 'model_1', 'layers': [{'name': 'input_1', 'class_name': 'InputLayer', 'config': {'batch_input_shape': (None, 32), 'dtype': 'float32', 'sparse': False, 'name': 'input_1'}, 'inbound_nodes': []}, {'name': 'dense_1', 'class_name': 'Dense', 'config': {'name': 'dense_1', 'trainable': True, 'units': 32, 'activation': 'linear', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None}, 'inbound_nodes': [[['input_1', 0, 0, {}]]]}], 'input_layers': [['input_1', 0, 0]], 'output_layers': [['dense_1', 0, 0]]}\n"
 20 |      ]
 21 |     }
 22 |    ],
 23 |    "source": [
 24 |     "from keras.utils import multi_gpu_model\n",
 25 |     "from keras.models import Model\n",
 26 |     "from keras.layers import Input, Dense\n",
 27 |     "\n",
 28 |     "\n",
 29 |     "a = Input(shape=(32,))\n",
 30 |     "b = Dense(32)(a)\n",
 31 |     "model = Model(inputs=a, outputs=b)\n",
 32 |     "\n",
 33 |     "config = model.get_config()\n",
 34 |     "print(config)"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 2,
 40 |    "metadata": {},
 41 |    "outputs": [
 42 |     {
 43 |      "name": "stdout",
 44 |      "output_type": "stream",
 45 |      "text": [
 46 |       "_________________________________________________________________\n",
 47 |       "Layer (type)                 Output Shape              Param #   \n",
 48 |       "=================================================================\n",
 49 |       "input_1 (InputLayer)         (None, 32)                0         \n",
 50 |       "_________________________________________________________________\n",
 51 |       "dense_1 (Dense)              (None, 32)                1056      \n",
 52 |       "=================================================================\n",
 53 |       "Total params: 1,056\n",
 54 |       "Trainable params: 1,056\n",
 55 |       "Non-trainable params: 0\n",
 56 |       "_________________________________________________________________\n"
 57 |      ]
 58 |     }
 59 |    ],
 60 |    "source": [
 61 |     "model.summary()"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "# 作業:\n",
 69 |     "    檢查 backend\n",
 70 |     "    檢查 fuzz factor\n",
 71 |     "    設定 Keras 浮點運算為float16"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 3,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "data": {
 81 |       "text/plain": [
 82 |        "'tensorflow'"
 83 |       ]
 84 |      },
 85 |      "execution_count": 3,
 86 |      "metadata": {},
 87 |      "output_type": "execute_result"
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "import keras\n",
 92 |     "from keras import backend as K\n",
 93 |     "\n",
 94 |     "#檢查 backend\n",
 95 |     "K.backend()"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 4,
101 |    "metadata": {},
102 |    "outputs": [
103 |     {
104 |      "data": {
105 |       "text/plain": [
106 |        "1e-07"
107 |       ]
108 |      },
109 |      "execution_count": 4,
110 |      "metadata": {},
111 |      "output_type": "execute_result"
112 |     }
113 |    ],
114 |    "source": [
115 |     "#檢查 fuzz factor\n",
116 |     "# https://stats.stackexchange.com/questions/342892/what-is-the-meaning-of-fuzz-factor\n",
117 |     "K.epsilon()"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 7,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "data": {
127 |       "text/plain": [
128 |        "'float16'"
129 |       ]
130 |      },
131 |      "execution_count": 7,
132 |      "metadata": {},
133 |      "output_type": "execute_result"
134 |     }
135 |    ],
136 |    "source": [
137 |     "#設定 Keras 浮點運算為float16\n",
138 |     "K.set_floatx('float16')\n",
139 |     "K.floatx()"
140 |    ]
141 |   }
142 |  ],
143 |  "metadata": {
144 |   "kernelspec": {
145 |    "display_name": "Python 3",
146 |    "language": "python",
147 |    "name": "python3"
148 |   },
149 |   "language_info": {
150 |    "codemirror_mode": {
151 |     "name": "ipython",
152 |     "version": 3
153 |    },
154 |    "file_extension": ".py",
155 |    "mimetype": "text/x-python",
156 |    "name": "python",
157 |    "nbconvert_exporter": "python",
158 |    "pygments_lexer": "ipython3",
159 |    "version": "3.6.7"
160 |   }
161 |  },
162 |  "nbformat": 4,
163 |  "nbformat_minor": 2
164 | }
165 | 


--------------------------------------------------------------------------------
/Day_63_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import numpy as np\n",
 10 |     "\n",
 11 |     "#Input array\n",
 12 |     "X = np.array([[1, 0, 1, 0], [1 ,0 ,1 ,1 ],[ 0 , 1 , 0 , 1 ]])\n",
 13 |     "\n",
 14 |     "#Output\n",
 15 |     "y = np.array([[1], [1], [0]])"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {},
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "#Sigmoid Function\n",
 25 |     "def sigmoid(x): \n",
 26 |     "    return  1/(1+np.exp(-x))\n",
 27 |     "\n",
 28 |     "#Derivative of Sigmoid Function\n",
 29 |     "def derivatives_sigmoid(x): \n",
 30 |     "    return x*(1-x)"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 3,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "#Variable initialization\n",
 40 |     "epoch = 5000 #Setting training iterations\n",
 41 |     "lr = 0.1 #Setting learning rate\n",
 42 |     "inputlayer_neurons = X.shape[1] #number of features in data set \n",
 43 |     "hiddenlayer_neurons = 3 #number of hidden layers neurons\n",
 44 |     "output_neurons = 1 #number of neurons at output layer"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 4,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "#weight and bias initialization\n",
 54 |     "wh = np.random.uniform( size = ( inputlayer_neurons , hiddenlayer_neurons ) ) \n",
 55 |     "bh = np.random.uniform( size = ( 1 , hiddenlayer_neurons ) ) \n",
 56 |     "wout = np.random.uniform( size = ( hiddenlayer_neurons , output_neurons ) ) \n",
 57 |     "bout = np.random.uniform( size = ( 1 , output_neurons ) )"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {
 63 |     "collapsed": true
 64 |    },
 65 |    "source": [
 66 |     "# 作業 \n",
 67 |     "* 請參閱範例中的 hidden Layer 寫法, 完成 output Layer 的程式"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 5,
 73 |    "metadata": {},
 74 |    "outputs": [
 75 |     {
 76 |      "name": "stdout",
 77 |      "output_type": "stream",
 78 |      "text": [
 79 |       "output of Forward Propogation:\n",
 80 |       "[[0.77563347]\n",
 81 |       " [0.78326709]\n",
 82 |       " [0.77140564]]\n",
 83 |       "wout,bout of Backpropagation:\n",
 84 |       "[[0.27927802]\n",
 85 |       " [0.1851273 ]\n",
 86 |       " [0.41321647]],\n",
 87 |       "[[0.5395062]]\n"
 88 |      ]
 89 |     }
 90 |    ],
 91 |    "source": [
 92 |     "for i in  range ( epoch ) :\n",
 93 |     "    #Forward Propogation\n",
 94 |     "    hidden_layer_input1 = np.dot(X, wh) \n",
 95 |     "    hidden_layer_input = hidden_layer_input1 + bh\n",
 96 |     "    hiddenlayer_activations =  sigmoid( hidden_layer_input ) \n",
 97 |     "   \n",
 98 |     "    output_layer_input1 = np.dot(hiddenlayer_activations, wout) \n",
 99 |     "    output_layer_input = output_layer_input1 + bout\n",
100 |     "    output_layer_activations =  sigmoid( output_layer_input ) \n",
101 |     "    \n",
102 |     "print ( \"output of Forward Propogation:\\n{}\" .format(output_layer_activations)) \n",
103 |     "print ( \"wout,bout of Backpropagation:\\n{},\\n{}\" .format(wout, bout ))"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 6,
109 |    "metadata": {},
110 |    "outputs": [
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "output of Forward Propogation:\n",
116 |       "[[0.77563347]\n",
117 |       " [0.78326709]\n",
118 |       " [0.77140564]]\n",
119 |       "wout,bout of Backpropagation:\n",
120 |       "[[0.27927802]\n",
121 |       " [0.1851273 ]\n",
122 |       " [0.41321647]],\n",
123 |       "[[0.5395062]]\n"
124 |      ]
125 |     }
126 |    ],
127 |    "source": [
128 |     "for i in  range ( epoch ) :\n",
129 |     "    #Forward Propogation\n",
130 |     "    hidden_layer_input1 = np.dot(X, wh) \n",
131 |     "    hidden_layer_input = hidden_layer_input1 + bh\n",
132 |     "    hiddenlayer_activations =  sigmoid( hidden_layer_input ) \n",
133 |     "    output_layer_input1 = np.dot( hiddenlayer_activations, wout) \n",
134 |     "    output_layer_input = output_layer_input1 + bout\n",
135 |     "    output =  sigmoid(output_layer_input)\n",
136 |     "    \n",
137 |     "print ( \"output of Forward Propogation:\\n{}\" .format(output)) \n",
138 |     "print ( \"wout,bout of Backpropagation:\\n{},\\n{}\" .format(wout, bout ))"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": null,
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": []
147 |   }
148 |  ],
149 |  "metadata": {
150 |   "kernelspec": {
151 |    "display_name": "Python 3",
152 |    "language": "python",
153 |    "name": "python3"
154 |   },
155 |   "language_info": {
156 |    "codemirror_mode": {
157 |     "name": "ipython",
158 |     "version": 3
159 |    },
160 |    "file_extension": ".py",
161 |    "mimetype": "text/x-python",
162 |    "name": "python",
163 |    "nbconvert_exporter": "python",
164 |    "pygments_lexer": "ipython3",
165 |    "version": "3.6.7"
166 |   }
167 |  },
168 |  "nbformat": 4,
169 |  "nbformat_minor": 2
170 | }
171 | 


--------------------------------------------------------------------------------
/Day_067_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 1. Import Library"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "name": "stderr",
 17 |      "output_type": "stream",
 18 |      "text": [
 19 |       "Using TensorFlow backend.\n"
 20 |      ]
 21 |     }
 22 |    ],
 23 |    "source": [
 24 |     "import numpy\n",
 25 |     "from keras.datasets import cifar10\n",
 26 |     "import numpy as np\n",
 27 |     "np.random.seed(10)"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "# 資料準備"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 2,
 40 |    "metadata": {},
 41 |    "outputs": [
 42 |     {
 43 |      "name": "stdout",
 44 |      "output_type": "stream",
 45 |      "text": [
 46 |       "Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\n",
 47 |       "170500096/170498071 [==============================] - 813s 5us/step\n"
 48 |      ]
 49 |     }
 50 |    ],
 51 |    "source": [
 52 |     "(x_img_train,y_label_train), \\\n",
 53 |     "(x_img_test, y_label_test)=cifar10.load_data()"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 28,
 59 |    "metadata": {},
 60 |    "outputs": [
 61 |     {
 62 |      "name": "stdout",
 63 |      "output_type": "stream",
 64 |      "text": [
 65 |       "train data shape is, (50000, 32, 32, 3)\n",
 66 |       "label shape is, (50000, 1)\n",
 67 |       "test data shape is, (10000, 32, 32, 3)\n",
 68 |       "first pixel at first train data is, [59 62 63]\n"
 69 |      ]
 70 |     }
 71 |    ],
 72 |    "source": [
 73 |     "print(f'train data shape is, {x_img_train.shape}')\n",
 74 |     "print(f'label shape is, {y_label_train.shape}')\n",
 75 |     "print(f'test data shape is, {x_img_test.shape}')\n",
 76 |     "print(f'first pixel at first train data is, {x_img_train[0][0][0]}')"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "# Image normalize "
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 39,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "x_img_train_normalize = x_img_train / 255.0\n",
 93 |     "x_img_test_normalize = x_img_test / 255.0"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 41,
 99 |    "metadata": {},
100 |    "outputs": [
101 |     {
102 |      "data": {
103 |       "text/plain": [
104 |        "array([0.23137255, 0.24313725, 0.24705882])"
105 |       ]
106 |      },
107 |      "execution_count": 41,
108 |      "metadata": {},
109 |      "output_type": "execute_result"
110 |     }
111 |    ],
112 |    "source": [
113 |     "x_img_train_normalize[0][0][0]"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "# 轉換label 為OneHot Encoding"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 45,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "from keras.utils import np_utils\n",
130 |     "y_label_train_OneHot = np_utils.to_categorical(y_label_train)\n",
131 |     "y_label_test_OneHot = np_utils.to_categorical(y_label_test)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 48,
137 |    "metadata": {},
138 |    "outputs": [
139 |     {
140 |      "name": "stdout",
141 |      "output_type": "stream",
142 |      "text": [
143 |       "original label shape is, (50000, 1)\n",
144 |       "after ohe label shape is, (50000, 10)\n"
145 |      ]
146 |     }
147 |    ],
148 |    "source": [
149 |     "print(f'original label shape is, {y_label_train.shape}')\n",
150 |     "print(f'after ohe label shape is, {y_label_train_OneHot.shape}')\n"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 49,
156 |    "metadata": {},
157 |    "outputs": [
158 |     {
159 |      "name": "stdout",
160 |      "output_type": "stream",
161 |      "text": [
162 |       "original the 1st label @ training data is, [6]\n",
163 |       "after ohe label the 1st label @training data is, [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]\n"
164 |      ]
165 |     }
166 |    ],
167 |    "source": [
168 |     "print(f'original the 1st label @ training data is, {y_label_train[0]}')\n",
169 |     "print(f'after ohe label the 1st label @training data is, {y_label_train_OneHot[0]}')\n"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "# 作業:\n",
177 |     "    請嘗試改用CIFAR100"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "metadata": {},
184 |    "outputs": [],
185 |    "source": []
186 |   }
187 |  ],
188 |  "metadata": {
189 |   "anaconda-cloud": {},
190 |   "kernelspec": {
191 |    "display_name": "Python 3",
192 |    "language": "python",
193 |    "name": "python3"
194 |   },
195 |   "language_info": {
196 |    "codemirror_mode": {
197 |     "name": "ipython",
198 |     "version": 3
199 |    },
200 |    "file_extension": ".py",
201 |    "mimetype": "text/x-python",
202 |    "name": "python",
203 |    "nbconvert_exporter": "python",
204 |    "pygments_lexer": "ipython3",
205 |    "version": "3.6.7"
206 |   }
207 |  },
208 |  "nbformat": 4,
209 |  "nbformat_minor": 1
210 | }
211 | 


--------------------------------------------------------------------------------
/Day_68-Keras_Sequential_model_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "Using TensorFlow backend.\n"
 13 |      ]
 14 |     }
 15 |    ],
 16 |    "source": [
 17 |     "# 載入必須使用的 Library\n",
 18 |     "import keras\n",
 19 |     "from keras.datasets import cifar10\n",
 20 |     "from keras.models import Sequential, load_model\n",
 21 |     "from keras.layers import Dense, Dropout, Activation, Flatten\n",
 22 |     "from keras.layers import Conv2D, MaxPooling2D\n"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "# 作業:\n",
 30 |     "請修改input shape: (Conv2D(64, (3, 3))的設定, 新增一層 Dense 並觀看 model.summary 的輸出"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 4,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "(x_train,y_train),(x_test, y_test)= cifar10.load_data()\n",
 40 |     "\n",
 41 |     "x_train=x_train/255\n",
 42 |     "x_test = x_test/255\n",
 43 |     "\n",
 44 |     "y_train = keras.utils.to_categorical(y_train)\n",
 45 |     "y_test = keras.utils.to_categorical(y_test)"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 6,
 51 |    "metadata": {},
 52 |    "outputs": [
 53 |     {
 54 |      "data": {
 55 |       "text/plain": [
 56 |        "(32, 32, 3)"
 57 |       ]
 58 |      },
 59 |      "execution_count": 6,
 60 |      "metadata": {},
 61 |      "output_type": "execute_result"
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "x_train.shape[1:]"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 20,
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "name": "stdout",
 75 |      "output_type": "stream",
 76 |      "text": [
 77 |       "_________________________________________________________________\n",
 78 |       "Layer (type)                 Output Shape              Param #   \n",
 79 |       "=================================================================\n",
 80 |       "conv2d_14 (Conv2D)           (None, 16, 16, 64)        1792      \n",
 81 |       "_________________________________________________________________\n",
 82 |       "max_pooling2d_14 (MaxPooling (None, 8, 8, 64)          0         \n",
 83 |       "_________________________________________________________________\n",
 84 |       "flatten_14 (Flatten)         (None, 4096)              0         \n",
 85 |       "_________________________________________________________________\n",
 86 |       "dense_37 (Dense)             (None, 1024)              4195328   \n",
 87 |       "_________________________________________________________________\n",
 88 |       "activation_24 (Activation)   (None, 1024)              0         \n",
 89 |       "_________________________________________________________________\n",
 90 |       "dense_38 (Dense)             (None, 512)               524800    \n",
 91 |       "_________________________________________________________________\n",
 92 |       "dense_39 (Dense)             (None, 10)                5130      \n",
 93 |       "_________________________________________________________________\n",
 94 |       "activation_25 (Activation)   (None, 10)                0         \n",
 95 |       "=================================================================\n",
 96 |       "Total params: 4,727,050\n",
 97 |       "Trainable params: 4,727,050\n",
 98 |       "Non-trainable params: 0\n",
 99 |       "_________________________________________________________________\n",
100 |       "None\n"
101 |      ]
102 |     }
103 |    ],
104 |    "source": [
105 |     "model = Sequential()\n",
106 |     "model.add(Conv2D(filters=64,kernel_size=(3,3),\n",
107 |     "                 input_shape=(16, 16,3), \n",
108 |     "                 activation='relu', \n",
109 |     "                 padding='same'))\n",
110 |     "model.add(MaxPooling2D(pool_size=(2, 2)))\n",
111 |     "\n",
112 |     "model.add(Flatten())\n",
113 |     "model.add(Dense(1024))\n",
114 |     "model.add(Activation('relu'))\n",
115 |     "model.add(Dense(512, activation ='relu'))\n",
116 |     "model.add(Dense(10))\n",
117 |     "model.add(Activation('softmax'))\n",
118 |     "\n",
119 |     "print(model.summary())"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": []
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": []
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "metadata": {},
140 |    "outputs": [],
141 |    "source": []
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": []
149 |   }
150 |  ],
151 |  "metadata": {
152 |   "kernelspec": {
153 |    "display_name": "Python 3",
154 |    "language": "python",
155 |    "name": "python3"
156 |   },
157 |   "language_info": {
158 |    "codemirror_mode": {
159 |     "name": "ipython",
160 |     "version": 3
161 |    },
162 |    "file_extension": ".py",
163 |    "mimetype": "text/x-python",
164 |    "name": "python",
165 |    "nbconvert_exporter": "python",
166 |    "pygments_lexer": "ipython3",
167 |    "version": "3.6.7"
168 |   }
169 |  },
170 |  "nbformat": 4,
171 |  "nbformat_minor": 2
172 | }
173 | 


--------------------------------------------------------------------------------
/EDA.md:
--------------------------------------------------------------------------------
  1 | # Part 1 EDA in ML100-Days
  2 | 機器學習百日馬拉松：
  3 | 第一部分 EDA 共 16天
  4 | 
  5 | Day 1: 當我們遇到一個問題時，不見得要馬上跳進去玩耍；
  6 | 	
  7 | 	我們要先思考問題，知道目的，如果有資料，還可以概略看一下資料，盤算能夠作出甚麼有趣有用的結果。
  8 | 	如果沒有資料，就問題的思考後，才去採集資料，相對可以節省很多時間。
  9 | 	
 10 | 	之後我們才會盤算一個策略去進行行動，比如快速完成一個方案，一個最小可行方案(Minimum Visiable Product 的概念)，後續進行優化改善。
 11 | 	
 12 | 	HW: 除了第一天的簡易練習 編寫 Y=wX+b，plot(X,Y) 定義 mse, mae 還有一個很重要的工作，作業2：如果你今天經營一個台灣大車隊，你要如何透過數據分析來提升業績? 思考一下，我們除了需要工程師也需要資料分析師與科學家。
 13 | 
 14 | 
 15 | Day 2: 第一次資料探勘
 16 | 
 17 | 	好我們前十六天都將進行資料數據前處理的工作，練習的問題會是 Kaggle 上的題目 
 18 | 	https://www.kaggle.com/c/home-credit-default-risk/data 
 19 | 	
 20 | 	會用到的應該是 application_train.csv 與 application_test.csv 
 21 | 	最後要上傳時會用到 submission.csv 
 22 | 	但是請不要忘了研讀 HomeCredit_columns_description.csv 
 23 | 	
 24 | 	HW:
 25 | 	第二天我們開始讀取資料，對於這個被整理好的 Dataframe 進行最初步的探勘，比如說知道有多少筆資料，有多少 "欄" 的特徵，如何將特徵項目組成一個 list， 又或者我們該如何呈現截取部分資料等。總之，我們還沒有跳進去觀察個別特徵，個別資料前，我們所採取的動作在今天試著作一輪。
 26 | 	第二天還有一個有趣的參考作業，讀一讀吳老師的資料探勘講義，其中為何要研究杜河的魚類又如何進行都是很有意義的練習，川普的故事也趣味十足。
 27 | 	
 28 | Day 3: 關於 DataFrame
 29 | 	
 30 | 	我們在作資料清理，在處理機器學習的問題時，常常都要熟練 DataFrame 的操作，DataFrame 就像是 EXCEL 的 Spreadsheet 一樣，第一列存著的是所有欄位名稱，就是我們未來要處理的特徵。第二列起就是一筆一筆的資料。
 31 | 	
 32 | 	HW: 第三天的練習會比 pandas 的 DataFrame 多一點不同格式的東西，大家好好努力!
 33 | 	
 34 | Day 4: EDA 資料類型介紹
 35 | 
 36 | 	我們在第二天開始進行了 EDA 資料探勘的動作，當時是看整個資料的外型，有多少特徵有多少筆資料，今天起我們要開始一步一步的觀察特徵資料。
 37 | 	
 38 | 	跳進特徵研究的第一個小動作就是要知道，每一個特徵的格式，也就是資料的類別；基本的類別有 float64, int64, object。
 39 | 	
 40 | 	要留意 object是我們的機器無法直接處理的資料類型，所以今天是介紹為這類資料作編碼，編碼完後才會送進機器模型運算。
 41 | 	
 42 | 	HW: 請研讀參考資料 https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621
 43 | 	，同時完成第四天的程式。
 44 | 	
 45 | Day 5: EDA之資料分布
 46 | 
 47 | 	EDA 當然沒有固定的程序進行，但是對於新手而言，對於十個出頭的特徵的經典題目 Titanic 都已經是不知道如何下手了，更何況今天的題目有百餘個特徵。所以如果有一個按部就班的程序可以遵循，在新手時期是十分珍貴的。
 48 | 	
 49 | 	今天算是第三天的 EDA，Day 2時是第一次看資料框架(有多少資料，是寬的胖的高的瘦的)，Day 4 時是觀察這些特徵的資料類型，今天我們要來看看特徵資料的統計特性，
 50 | 	計算集中趨勢
 51 | 		•平均值 Mean
 52 | 		•中位數 Median
 53 | 		•眾數 Mode 
 54 | 	計算資料分散程度
 55 | 		•最⼩值 Min
 56 | 		•最⼤值 Max
 57 | 		•範圍 Range
 58 | 		•四分位差 Quartiles
 59 | 		•變異數 Variance
 60 | 		•標準差 Standard deviation
 61 | 		
 62 | 	HW: 今天的作業是挑三個有興趣的特徵來看它們的統計特性，挑選的方式可以是在 Day 2時的 HomeCredit_columns_description.csv 研讀時對於特徵了解而挑選，如果挑很多個，最後也可以用 sns.heatmap(corr()matric) 的關係來挑選。
 63 | 	今天要學會 DataFrame.describe() 來看基本統計資料。也要學會畫直方圖 (histogram) by plt.hist
 64 | 	
 65 | Day 6: EDA 與 outlier檢查  
 66 | 
 67 | 	面對異常值 outlier 我們要思考，為什麼會有 outlier?  該如何判斷是異常值 outlier? 還有要如何處理異常值 outlier? 
 68 | 	要解決一個問題，是不能閃躲與心存僥倖的，資料科學家與工程師花了無數時間處理的案子，如果我們只想要在短短的三個小時到四個小時作完，同時期待卓越成果，這樣的期待是有些夢幻的，不切實際的。所以 Day 5 我們挑三個特徵觀察，那真的只是前菜，開胃菜，熟悉溫度氣味而已。
 69 | 	對決是要通盤面對，所以今天要深入每一個數值資料特徵項目，畫出 boxplot 然後通通都去看，有沒有奇怪的離群索居的資料遠遠的孤傲站著? 
 70 | 	我們要找出他們思考他們的合理性否，判斷時候要作修正要如何修正。
 71 | 	HW: 完成今天的程式。
 72 | 	
 73 | Day 7: 常⽤數值取代：中位數與分位數 連續數值標準化 
 74 | 	
 75 | 	第七天要來看遇到缺失值或異常值要作的補值方式。請大家 google 中位數、分位數、眾數與平均值的定義與計算，同時今天的作業中有計算 Normalize function (to -1 ~ 1)，這樣的處理是要讓所有特徵立於平等地位，不會有些特徵單位大，有些特徵單位小，然後產生不對等的影響。
 76 | 	
 77 | 	HW；完成今天的程式，了解中位數、分位數、眾數與平均值與標準化。
 78 | 	
 79 | Day 8: 常⽤的 DataFrame 操作 
 80 | 
 81 | 	第八天不算是 EDA，是對 pandas的認識與 DataFrame 操作的練習日。就請大家作完作業，也上網找一些 DataFrame 的介紹。
 82 | 	除非偷懶只走深度學習，不然不熟悉 DataFrame是會讓人十分辛苦了。
 83 | 	
 84 | Day 9: 今天沒有作業，在介紹相關係數。
 85 | 
 86 | 	相關係數是⼀個介於 -1～1 之間的值，負值代表負相關，正值代表。
 87 | 	有請大家自己上網估勾一下。第十天會練習。
 88 | 	
 89 | Day 10: 相關係數實作
 90 | 
 91 | 	延續第九天的介紹，開始進行相關係數的應用。
 92 | 	這兩天似乎作不多，內容也輕鬆，所以我們剛好喘一口氣來回顧至今的 EDA，避開學習中的見樹不見林，
 93 | 	• Day 2: 我們第一天進行 EDA，是從資料的筆數、欄位多寡開始認識我們手中有多少東西。當然不要忘了我們也閱讀的欄位的描述文件。
 94 | 	• Day 4: 我們開始跳進去欄位去看所有欄位的資料格式，查看是浮點數、整數或物件資料。對於類別型資料我們馬上進行編碼工作。
 95 | 	• Day 5: 我們學習繼續留在欄位中觀察他們的基本統計資料。
 96 | 		我們可以這麼說，
 97 | 		Day 4 用 app_train.dtypes.value_counts() 與 app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0) 來了解各種資料型態的個數與找出所有類別型資料與檢視類別型欄位各自類別的數量。
 98 | 		Day 5 是用 .describe() 看欄位資料的統計特徵與用 plt.hist() 去劃出直方圖 histogram 讓我們認識欄位資料。
 99 | 	• Day 6: 在經過 Day 5 的統計資訊觀察後，我們在 Day 6就要初步判斷是否有 outlier，這次有用到 boxplot 圖。
100 | 	• Day 7: 跟 Day 6 是連續動作，對於一些 outlier 的處理，同時當個別欄位資料的 outlier 都處理的，我們也不希望整體的欄位中有某幾個因為單位不依樣因為數值特別大讓我們後續模型運算時有不平衡的影響，所以就一併作了標準化。
101 | 	• Day 9, 10: 當我們基本資料觀察與修正了之後，我們就展開了所有特徵值(欄位資料)對於 目標 的相關運算。
102 | 	
103 | 	HW: 既然是實作，第十天的主要工作就是好好地進行作業，主要是把資料型態為 Object 都進行了 one hot encoding 都成為了數字，然後與 Target 作 相關性計算，我們作了 corr()，不送進 sns.heatmap 因為實在太多了特徵值了，會畫很久喔! 然後挑出最大的幾個來觀察。
104 | 	今天的觀察有把同一個特徵在目標為 1 與目標為 0 去查看一下有沒有分布上的不同，然後去思考觀察一些現象。同時也建議同學在 https://www.kaggle.com/c/home-credit-default-risk/data 的最下方 application_test.csv (5.58 MB)的表格玩一玩，可以很方便地觀察到一些資訊!
105 | 	
106 | Day 11: 繪圖與樣式＆Kernel Density Estimation (KDE) 
107 | 
108 | 	在第十天的作業中，最後一個任務，我們不只觀看一個特徵的分布以及他與目標的相關性，我們也將該特徵分拆觀察，觀看不同目標結果的特徵子集合的特性。第十一天也繼續往分拆特徵進入特徵子集合的觀察前進。
109 | 	第十一天內容分成兩個部分介紹，
110 | 	• 介紹 KDE，我有在網路上查到這個 "kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way" 果然，KDE 就是對 pdf 估算的方法。如果我們用 sns.kdeplot() 去看圖，會得到一條連續數值的曲線，如果用 sns.distplot() 去畫圖，就會得到該條曲線與個別值的相對機率，後者(個別值的相對機率) 長得很像直方圖，請同學思考他跟直方圖的關係！
111 | 	• 第二個內容介紹就是前面提到的將特徵分群拆成好多個子集合子群分別觀察 KDE。
112 | 	
113 | 	這些微細的觀察都是讓我們對於觀察對象及其特徵更清楚的認識與掌握。
114 | 	
115 | 	HW: 上面的描述很多都是在解釋第十一天的作業，有些我也還在努力觀察中，加油。
116 | 	
117 | Day 12: 把連續型變數離散化 
118 | 
119 | 	第十二天沒有作業，內容也簡單，就是 ptt 肥宅愛用的詞用上了: "如題" 把連續型變數離散化。
120 | 	在第十一天的作業最後一部分，採用了 barplot，觀察出 "年紀愈輕" failed to repay 的可能性愈大，當然我們也可以直接用連續數值，但是把他離散化有助於把問題簡單化，想想我們有好多特徵值，似乎問題很複雜啊！同時~~ 也可以將 outlier 解決掉，將300歲歸為最大歲數那一類(這樣對嗎?) 
121 | 	通常，我們的分群有三個方法等寬，等頻，等類。
122 | 	我們第十三天的作業會用到等寬，等頻。 等類我還沒好好研究。
123 | 	這裡提示一些好應付 Day 13，
124 | 	pd.cut(Data_feature, bins = 整數值 n ) 這就是等寬，會從最小值到最大值等分成 n 個區間。
125 | 	pd.qcut(Data_feature, bins = 整數值 n) 這就是等頻，會從最小值到最大值分成 n 個區間，每個區間沒有意外就會都有一樣的樣品數量。
126 | 	大家可以思考一下，為什麼我不是說一定數量，而是加了一句沒有意外的話。
127 | 	
128 | 
129 | Day 13: 把連續型變數離散化  
130 | 	
131 | 	在第十三天後，EDA 的學習就快要近尾聲了(但還沒尾聲啦)。第十三天就是根據第十二天的內容好好地完成作業。
132 | 	
133 | Day 14: 把連續型變數離散化  
134 | 		
135 | 	第十四天的主要在學習另一個視覺化的技巧，讓好些個序列的圖表，相關聯的圖表有結構的排列比對，可以協助我們更好觀察。
136 | 	HW: 對於 python 不熟悉的同學，會吃力，但是也就是提供更多練習機會讓大家提升跟上同學。加油。
137 | 	
138 | Day 15:　Heatmap & Grid-plot   
139 | 
140 | 	在 Day 13 時我們說近尾聲是真的，但是第十五天的觀察真的是重要。 
141 | 	sns.heatmap(corr()) 先前就作過，但是仍然重要。
142 | 	sns.sns.PairGrid() 更是有趣的讓我們作出一個相關圖但是其相關觀察可以有我們決定。
143 | 
144 | Day 16: 模型初體驗 Logistic Regression 
145 | 
146 | 	好了，我們要上 Kaggle 了，第十六天請大家作幾個任務，
147 | 	◆ 自己上網查看 sklearn Imputer 在幹嘛。
148 | 	◆ 請大家上網查看李宏毅老師(hung yi lee) 的教學影片或吳恩達老師(Andrew Nq) 的教學影片學習甚麼是 Logistic Regression。
149 | 	◆ 請大家上網查看 sklearn LogisticRegress 怎麼用
150 | 	◆ 大家互相澄清一下 Logistic Regression 到底是迴歸法(Regression) 還是分類法(Classification)?
151 | 	◆ 請大家拿出 Day 2 的 submission.csv 準備，寫完作業後比對一下格式。 
152 | 	
153 | 	HW: 就是把作業寫完，準備上 Kaggle 吧！ 
154 | 	作業產生的格式與 submission.csv 一不一樣? 這樣上傳有沒有問題? 如果有問題，怎麼辦? 
155 | 	還有Kaggle 上傳的方式，請上網 Google！
156 | 	
157 | 	
158 | 	
159 | 	
160 | 	
161 | 	
162 | 


--------------------------------------------------------------------------------
/Day_69-keras_model_api_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業:\n",
  8 |     "請修改 Name 中, 自定義的 Layer 名稱\n",
  9 |     "增加一層全連階層\n",
 10 |     "宣告 MODEL API, 分別採用自行定義的 Input/Output Layer\n",
 11 |     "model.summary 查看 Layers stack"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 1,
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "name": "stderr",
 21 |      "output_type": "stream",
 22 |      "text": [
 23 |       "Using TensorFlow backend.\n"
 24 |      ]
 25 |     }
 26 |    ],
 27 |    "source": [
 28 |     "from keras.layers import Input, Embedding, LSTM, Dense\n",
 29 |     "from keras.models import Model\n",
 30 |     "\n",
 31 |     "#主要輸入接收新聞標題本身，即一個整數序列（每個整數編碼一個詞）。\n",
 32 |     "#這些整數在1 到10,000 之間（10,000 個詞的詞彙表），且序列長度為100 個詞\n",
 33 |     "#宣告一個 NAME 去定義Input\n",
 34 |     "main_input = Input(shape=(100,), dtype='int32', name='main_input')\n",
 35 |     "\n",
 36 |     "\n",
 37 |     "# Embedding 層將輸入序列編碼為一個稠密向量的序列，\n",
 38 |     "# 每個向量維度為 512。\n",
 39 |     "x = Embedding(output_dim=512, input_dim=10000, input_length=100)(main_input)\n",
 40 |     "\n",
 41 |     "# LSTM 層把向量序列轉換成單個向量，\n",
 42 |     "# 它包含整個序列的上下文信息\n",
 43 |     "lstm_out = LSTM(32)(x)"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "code",
 48 |    "execution_count": 2,
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "#插入輔助損失，使得即使在模型主損失很高的情況下，LSTM 層和Embedding 層都能被平穩地訓練\n",
 53 |     "news_output = Dense(1, activation='sigmoid', name='news_out')(lstm_out)"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 3,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "#輔助輸入數據與LSTM 層的輸出連接起來，輸入到模型\n",
 63 |     "import keras\n",
 64 |     "news_input = Input(shape=(5,), name='news_in')\n",
 65 |     "x = keras.layers.concatenate([lstm_out, news_input])\n",
 66 |     "\n",
 67 |     "\n",
 68 |     "# 堆疊多個全連接網路層\n",
 69 |     "x = Dense(64, activation='relu')(x)\n",
 70 |     "x = Dense(64, activation='relu')(x)\n",
 71 |     "#作業解答: 新增兩層\n",
 72 |     "x = Dense(64, activation='relu')(x)\n",
 73 |     "x = Dense(64, activation='relu')(x)\n",
 74 |     "x = Dense(64, activation='relu')(x)\n",
 75 |     "x = Dense(64, activation='relu')(x)\n",
 76 |     "\n",
 77 |     "\n",
 78 |     "# 最後添加主要的邏輯回歸層\n",
 79 |     "main_output = Dense(1, activation='sigmoid', name='main_output')(x)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 4,
 85 |    "metadata": {},
 86 |    "outputs": [
 87 |     {
 88 |      "name": "stdout",
 89 |      "output_type": "stream",
 90 |      "text": [
 91 |       "__________________________________________________________________________________________________\n",
 92 |       "Layer (type)                    Output Shape         Param #     Connected to                     \n",
 93 |       "==================================================================================================\n",
 94 |       "main_input (InputLayer)         (None, 100)          0                                            \n",
 95 |       "__________________________________________________________________________________________________\n",
 96 |       "embedding_1 (Embedding)         (None, 100, 512)     5120000     main_input[0][0]                 \n",
 97 |       "__________________________________________________________________________________________________\n",
 98 |       "lstm_1 (LSTM)                   (None, 32)           69760       embedding_1[0][0]                \n",
 99 |       "__________________________________________________________________________________________________\n",
100 |       "news_in (InputLayer)            (None, 5)            0                                            \n",
101 |       "__________________________________________________________________________________________________\n",
102 |       "concatenate_1 (Concatenate)     (None, 37)           0           lstm_1[0][0]                     \n",
103 |       "                                                                 news_in[0][0]                    \n",
104 |       "__________________________________________________________________________________________________\n",
105 |       "dense_1 (Dense)                 (None, 64)           2432        concatenate_1[0][0]              \n",
106 |       "__________________________________________________________________________________________________\n",
107 |       "dense_2 (Dense)                 (None, 64)           4160        dense_1[0][0]                    \n",
108 |       "__________________________________________________________________________________________________\n",
109 |       "dense_3 (Dense)                 (None, 64)           4160        dense_2[0][0]                    \n",
110 |       "__________________________________________________________________________________________________\n",
111 |       "dense_4 (Dense)                 (None, 64)           4160        dense_3[0][0]                    \n",
112 |       "__________________________________________________________________________________________________\n",
113 |       "dense_5 (Dense)                 (None, 64)           4160        dense_4[0][0]                    \n",
114 |       "__________________________________________________________________________________________________\n",
115 |       "dense_6 (Dense)                 (None, 64)           4160        dense_5[0][0]                    \n",
116 |       "__________________________________________________________________________________________________\n",
117 |       "main_output (Dense)             (None, 1)            65          dense_6[0][0]                    \n",
118 |       "__________________________________________________________________________________________________\n",
119 |       "news_out (Dense)                (None, 1)            33          lstm_1[0][0]                     \n",
120 |       "==================================================================================================\n",
121 |       "Total params: 5,213,090\n",
122 |       "Trainable params: 5,213,090\n",
123 |       "Non-trainable params: 0\n",
124 |       "__________________________________________________________________________________________________\n"
125 |      ]
126 |     }
127 |    ],
128 |    "source": [
129 |     "model = Model(inputs=[main_input, news_input], outputs= [main_output, news_output])\n",
130 |     "model.summary()"
131 |    ]
132 |   }
133 |  ],
134 |  "metadata": {
135 |   "kernelspec": {
136 |    "display_name": "Python 3",
137 |    "language": "python",
138 |    "name": "python3"
139 |   },
140 |   "language_info": {
141 |    "codemirror_mode": {
142 |     "name": "ipython",
143 |     "version": 3
144 |    },
145 |    "file_extension": ".py",
146 |    "mimetype": "text/x-python",
147 |    "name": "python",
148 |    "nbconvert_exporter": "python",
149 |    "pygments_lexer": "ipython3",
150 |    "version": "3.6.7"
151 |   }
152 |  },
153 |  "nbformat": 4,
154 |  "nbformat_minor": 2
155 | }
156 | 


--------------------------------------------------------------------------------
/Day_043_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 作業\n",
  8 |     "\n",
  9 |     "閱讀以下兩篇文獻，了解隨機森林原理，並試著回答後續的思考問題\n",
 10 |     "- [隨機森林 (random forest) - 中文](http://hhtucode.blogspot.tw/2013/06/ml-random-forest.html)\n",
 11 |     "- [how random forest works - 英文](https://medium.com/@Synced/how-random-forest-algorithm-works-in-machine-learning-3c0fe15b6674)"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "\n",
 19 |     "1. 隨機森林中的每一棵樹，是希望能夠\n",
 20 |     "\n",
 21 |     "    - 沒有任何限制，讓樹可以持續生長 (讓樹生成很深，讓模型變得複雜)\n",
 22 |     "    \n",
 23 |     "    - 不要過度生長，避免 Overfitting\n",
 24 |     "    \n",
 25 |     "    \n",
 26 |     "2. 假設總共有 N 筆資料，每棵樹用取後放回的方式抽了總共 N 筆資料生成，請問這棵樹大約使用了多少 % 不重複的原資料生成?\n",
 27 |     "hint: 0.632 bootstrap\n"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "有n个样本，我们有放回的随机从中抽取n次。\n",
 35 |     "\n",
 36 |     "在第一次抽取时，样本A被选中的概率是 $\\frac{1}{n}$\n",
 37 |     "\n",
 38 |     "，不被选中的概率自然就是$(1-\\frac{1}{n})$。每次抽取都是独立的，所以当抽完n次之后，A一次都没有被抽中的概率就是\n",
 39 |     "\n",
 40 |     "$$(1-\\frac{1}{n})$^n$.\n",
 41 |     "\n",
 42 |     "这个式子眼熟吗？这个就是高等数学中那个著名的极限\n",
 43 |     "\n",
 44 |     "$\\lim_{n→∞}(1-\\frac{1}{n})= \\frac{1}{e}$.\n",
 45 |     "\n",
 46 |     "所以当bootstrap样本总数很大的时候，任意一个样本被抽中的概率就是\n",
 47 |     "\n",
 48 |     "$1−\\frac{1}{e}≈1−\\frac{1}{2.71828}≈0.632$。"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "\n",
 56 |     "I will get to the 0.632 estimator, but it'll be a somewhat long development:\n",
 57 |     "\n",
 58 |     "Suppose we want to predict Y with X using the function f, where f may depend on some parameters that are estimated using the data (Y,X), e.g. f(X)=Xβ\n",
 59 |     "\n",
 60 |     "A naïve estimate of prediction error is\n",
 61 |     "\n",
 62 |     "\n",
 63 |     "$ err^n-=1N∑i=1NL(yi,f(xi))$\n",
 64 |     "$ MSE = \\frac{1}{n}\\sum_{i=1}^{n}{(Y_i - \\hat{Y}_i)^2} $\n",
 65 |     "\n",
 66 |     "rr¯¯¯¯¯¯¯=1N∑i=1NL(yi,f(xi)\n",
 67 |     "where L is some loss function, e.g. squared error loss. This is often called training error. Efron et al. calls it apparent error rate or resubstitution rate. It's not very good since we use our data (xi,yi) to fit f. This results in err¯¯¯¯¯¯¯ being downward biased. You want to know how well your model f does in predicting new values.\n",
 68 |     "\n",
 69 |     "Often we use cross-validation as a simple way to estimate the expected extra-sample prediction error (how well does our model do on data not in our training set?).\n",
 70 |     "Err=E[L(Y,f(X))]\n",
 71 |     "A popular way to do this is to do K-fold cross-validation. Split your data into K groups (e.g. 10). For each group k, fit your model on the remaining K−1 groups and test it on the kth group. Our cross-validated extra-sample prediction error is just the average\n",
 72 |     "ErrCV=1N∑i=1NL(yi,f−κ(i)(xi))\n",
 73 |     "where κ is some index function that indicates the partition to which observation i is allocated and f−κ(i)(xi) is the predicted value of xi using data not in the κ(i)th set.\n",
 74 |     "\n",
 75 |     "\n",
 76 |     "\n",
 77 |     "\n",
 78 |     "This estimator is approximately unbiased for the true prediction error when K=N and has larger variance and is more computationally expensive for larger K. So once again we see the bias–variance trade-off at play.\n",
 79 |     "\n",
 80 |     "Instead of cross-validation we could use the bootstrap to estimate the extra-sample prediction error. Bootstrap resampling can be used to estimate the sampling distribution of any statistic. If our training data is X=(x1,…,xN), then we can think of taking B bootstrap samples (with replacement) from this set Z1,…,ZB where each Zi is a set of N samples. Now we can use our bootstrap samples to estimate extra-sample prediction error:\n",
 81 |     "Errboot=1B∑b=1B1N∑i=1NL(yi,fb(xi))\n",
 82 |     "\n",
 83 |     "\n",
 84 |     "where fb(xi) is the predicted value at xi from the model fit to the bth bootstrap dataset. Unfortunately, this is not a particularly good estimator because bootstrap samples used to produce fb(xi) may have contained xi. The leave-one-out bootstrap estimator offers an improvement by mimicking cross-validation and is defined as:\n",
 85 |     "Errboot(1)=1N∑i=1N1|C−i|∑b∈C−iL(yi,fb(xi))\n",
 86 |     "where C−i is the set of indices for the bootstrap samples that do not contain observation i, and |C−i| is the number of such samples. Errboot(1) solves the overfitting problem, but is still biased (this one is upward biased). The bias is due to non-distinct observations in the bootstrap samples that result from sampling with replacement. The average number of distinct observations in each sample is about 0.632N (see this answer for an explanation of why Why on average does each bootstrap sample contain roughly two thirds of observations?). To solve the bias problem, Efron and Tibshirani proposed the 0.632 estimator:\n",
 87 |     "Err.632=0.368err¯¯¯¯¯¯¯+0.632Errboot(1)\n",
 88 |     "where\n",
 89 |     "err¯¯¯¯¯¯¯=1N∑i=1NL(yi,f(xi))\n",
 90 |     "is the naïve estimate of prediction error often called training error. The idea is to average a downward biased estimate and an upward biased estimate.\n",
 91 |     "\n",
 92 |     "\n",
 93 |     "However, if we have a highly overfit prediction function (i.e. err¯¯¯¯¯¯¯=0) then even the .632 estimator will be downward biased. The .632+ estimator is designed to be a less-biased compromise between err¯¯¯¯¯¯¯ and Errboot(1).\n",
 94 |     "Err.632+=(1−w)err¯¯¯¯¯¯¯+wErrboot(1)\n",
 95 |     "with\n",
 96 |     "w=0.6321−0.368RandR=Errboot(1)−err¯¯¯¯¯¯¯γ−err¯¯¯¯¯¯¯\n",
 97 |     "where γ is the no-information error rate, estimated by evaluating the prediction model on all possible combinations of targets yi and predictors xi.\n",
 98 |     "\n",
 99 |     "γ=1N2∑i=1N∑j=1NL(yi,f(xj))\n",
100 |     ".\n",
101 |     "\n",
102 |     "Here R measures the relative overfitting rate. If there is no overfitting (R=0, when the Errboot(1)=err¯¯¯¯¯¯¯) this is equal to the .632 estimator."
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "那我個人實驗目前比較採用的方式是:\n",
110 |     "\n",
111 |     "設定最少要 bagging 出 (k / 2) + 1 的 feature, 才比較有顯著結果, K 為原本的 feature 數量\n",
112 |     "\n",
113 |     "或者另外一個常見設定是 square(k)\n",
114 |     "\n",
115 |     "\n",
116 |     "因為重複採樣的關係, 平均來講, 每棵大約會有 1/3 training data 採樣不到\n",
117 |     "\n",
118 |     "所以收集這些 data, 最後等到 Forest 建立完成之後, 將這些 data 餵進去判斷, 最後得出錯誤率\n",
119 |     "\n",
120 |     "這方式稱為 Out-Of-Bag (OOB)\n",
121 |     "\n",
122 |     "其實 Random Forest 是一個很 heuristic 的演算法\n",
123 |     "\n",
124 |     "他還有很多需要被決定的參數, 像是, 我到底要用幾棵樹\n",
125 |     "\n",
126 |     "well, 笨一點的方式就是從 1棵, 2棵...到 n 棵, 棵一路建立上去\n",
127 |     "\n",
128 |     "然後計算他相對的 OOB, 要是發現 OOB 沒在下降, 那就差不多了"
129 |    ]
130 |   }
131 |  ],
132 |  "metadata": {
133 |   "kernelspec": {
134 |    "display_name": "Python 3",
135 |    "language": "python",
136 |    "name": "python3"
137 |   },
138 |   "language_info": {
139 |    "codemirror_mode": {
140 |     "name": "ipython",
141 |     "version": 3
142 |    },
143 |    "file_extension": ".py",
144 |    "mimetype": "text/x-python",
145 |    "name": "python",
146 |    "nbconvert_exporter": "python",
147 |    "pygments_lexer": "ipython3",
148 |    "version": "3.6.7"
149 |   }
150 |  },
151 |  "nbformat": 4,
152 |  "nbformat_minor": 2
153 | }
154 | 


--------------------------------------------------------------------------------
/Day_057_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業\n",
  8 |     "### 用 iris (dataset.load_iris()) 資料嘗試跑 hierarchical clustering"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [
 16 |     {
 17 |      "data": {
 18 |       "text/html": [
 19 |        "\n",
 20 |        "        <iframe\n",
 21 |        "            width=\"720\"\n",
 22 |        "            height=\"480\"\n",
 23 |        "            src=\"https://www.youtube.com/embed/https://www.youtube.com/watch?v=fX_guE7JNnY&index=21&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49\"\n",
 24 |        "            frameborder=\"0\"\n",
 25 |        "            allowfullscreen\n",
 26 |        "        ></iframe>\n",
 27 |        "        "
 28 |       ],
 29 |       "text/plain": [
 30 |        "<IPython.lib.display.YouTubeVideo at 0x19e9dbd2198>"
 31 |       ]
 32 |      },
 33 |      "execution_count": 1,
 34 |      "metadata": {},
 35 |      "output_type": "execute_result"
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "from IPython.display import YouTubeVideo\n",
 40 |     "# 6:10 處有 HAC\n",
 41 |     "YouTubeVideo(\"https://www.youtube.com/watch?v=fX_guE7JNnY&index=21&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49\", width=720, height=480)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 2,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "\n",
 51 |     "from sklearn import datasets\n",
 52 |     "from sklearn.cluster import AgglomerativeClustering\n",
 53 |     "\n",
 54 |     "iris = datasets.load_iris()\n",
 55 |     "X = iris.data\n",
 56 |     "y = iris.target\n",
 57 |     "\n",
 58 |     "\n"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "載入 相關套件 並 執行 hierarchical clustering 實驗 ..."
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 3,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "estimators = [('hc_iris_ward', AgglomerativeClustering(n_clusters=3, linkage=\"ward\")),\n",
 75 |     "              ('hc_iris_complete', AgglomerativeClustering(n_clusters=3, linkage=\"complete\")),\n",
 76 |     "              ('hc_iris_average', AgglomerativeClustering(n_clusters=3, linkage=\"average\"))]"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 4,
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "name": "stdout",
 86 |      "output_type": "stream",
 87 |      "text": [
 88 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
 89 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
 90 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
 91 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
 92 |       " 2 2]\n",
 93 |       "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
 94 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
 95 |       " 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2\n",
 96 |       " 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 2 0 2 2 2 0 2\n",
 97 |       " 2 0]\n",
 98 |       "134 150\n"
 99 |      ]
100 |     }
101 |    ],
102 |    "source": [
103 |     "hrc = AgglomerativeClustering(n_clusters=3, linkage='ward')\n",
104 |     "hrc_y=hrc.fit_predict(X)\n",
105 |     "print(y)\n",
106 |     "print(hrc_y)\n",
107 |     "r = 0\n",
108 |     "w = 0\n",
109 |     "for i in range(len(y)):\n",
110 |     "    if y[i]==0 and hrc_y[i]==1:\n",
111 |     "        r+=1\n",
112 |     "    if y[i]==1 and hrc_y[i]==0:\n",
113 |     "        r+=1\n",
114 |     "    if y[i]==2 and hrc_y[i]==2:\n",
115 |     "        r+=1\n",
116 |     "\n",
117 |     "print(r, len(y))"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 5,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
130 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
131 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
132 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
133 |       " 2 2]\n",
134 |       "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
135 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 2 0 2 0 2 0 2 2 2 2 0 2 0 2 2 0 2 0 2 0 0\n",
136 |       " 0 0 0 0 0 2 2 2 2 0 2 0 0 0 2 2 2 0 2 2 2 2 2 0 2 2 0 0 0 0 0 0 2 0 0 0 0\n",
137 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
138 |       " 0 0]\n",
139 |       "126 150\n"
140 |      ]
141 |     }
142 |    ],
143 |    "source": [
144 |     "from sklearn.cluster import AgglomerativeClustering\n",
145 |     "hrc = AgglomerativeClustering(n_clusters=3, linkage='complete')\n",
146 |     "hrc_y=hrc.fit_predict(X)\n",
147 |     "print(y)\n",
148 |     "print(hrc_y)\n",
149 |     "r = 0\n",
150 |     "w = 0\n",
151 |     "for i in range(len(y)):\n",
152 |     "    if y[i]==0 and hrc_y[i]==1:\n",
153 |     "        r+=1\n",
154 |     "    if y[i]==1 and hrc_y[i]==2:\n",
155 |     "        r+=1\n",
156 |     "    if y[i]==2 and hrc_y[i]==0:\n",
157 |     "        r+=1\n",
158 |     "\n",
159 |     "print(r, len(y))"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 6,
165 |    "metadata": {},
166 |    "outputs": [
167 |     {
168 |      "name": "stdout",
169 |      "output_type": "stream",
170 |      "text": [
171 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
172 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
173 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
174 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
175 |       " 2 2]\n",
176 |       "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
177 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
178 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2\n",
179 |       " 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2\n",
180 |       " 2 0]\n",
181 |       "136 150\n"
182 |      ]
183 |     }
184 |    ],
185 |    "source": [
186 |     "from sklearn.cluster import AgglomerativeClustering\n",
187 |     "hrc = AgglomerativeClustering(n_clusters=3, linkage='average')\n",
188 |     "hrc_y=hrc.fit_predict(X)\n",
189 |     "print(y)\n",
190 |     "print(hrc_y)\n",
191 |     "r = 0\n",
192 |     "w = 0\n",
193 |     "for i in range(len(y)):\n",
194 |     "    if y[i]==0 and hrc_y[i]==1:\n",
195 |     "        r+=1\n",
196 |     "    if y[i]==1 and hrc_y[i]==0:\n",
197 |     "        r+=1\n",
198 |     "    if y[i]==2 and hrc_y[i]==2:\n",
199 |     "        r+=1\n",
200 |     "\n",
201 |     "print(r, len(y))"
202 |    ]
203 |   }
204 |  ],
205 |  "metadata": {
206 |   "kernelspec": {
207 |    "display_name": "Python 3",
208 |    "language": "python",
209 |    "name": "python3"
210 |   },
211 |   "language_info": {
212 |    "codemirror_mode": {
213 |     "name": "ipython",
214 |     "version": 3
215 |    },
216 |    "file_extension": ".py",
217 |    "mimetype": "text/x-python",
218 |    "name": "python",
219 |    "nbconvert_exporter": "python",
220 |    "pygments_lexer": "ipython3",
221 |    "version": "3.6.7"
222 |   }
223 |  },
224 |  "nbformat": 4,
225 |  "nbformat_minor": 2
226 | }
227 | 


--------------------------------------------------------------------------------
/Day_034_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 練習時間\n",
  8 |     "假設我們資料中類別的數量並不均衡，在評估準確率時可能會有所偏頗，試著切分出 y_test 中，0 類別與 1 類別的數量是一樣的 (亦即 y_test 的類別是均衡的)"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [],
 16 |    "source": [
 17 |     "from sklearn.model_selection import train_test_split\n",
 18 |     "\n",
 19 |     "\n",
 20 |     "import numpy as np\n",
 21 |     "X = np.arange(1000).reshape(200, 5)\n",
 22 |     "y = np.zeros(200)\n",
 23 |     "y[:40] = 1"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {},
 30 |    "outputs": [
 31 |     {
 32 |      "data": {
 33 |       "text/plain": [
 34 |        "array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
 35 |        "       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
 36 |        "       1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 37 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 38 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 39 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 40 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 41 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 42 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 43 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 44 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
 45 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
 46 |       ]
 47 |      },
 48 |      "execution_count": 2,
 49 |      "metadata": {},
 50 |      "output_type": "execute_result"
 51 |     }
 52 |    ],
 53 |    "source": [
 54 |     "y"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "可以看見 y 類別中，有 160 個 類別 0，40 個 類別 1 ，請試著使用 train_test_split 函數，切分出 y_test 中能各有 10 筆類別 0 與 10 筆類別 1 。(HINT: 參考函數中的 test_size，可針對不同類別各自作切分後再合併)"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "?train_test_split"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 3,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "y1_index = range(40)\n",
 78 |     "y0_index = range(40,200)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": 4,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "X1_train, X1_test, y1_train, y1_test = train_test_split(X[y1_index],y[y1_index], random_state=16, test_size=10)\n",
 88 |     "X0_train, X0_test, y0_train, y0_test = train_test_split(X[y0_index],y[y0_index],random_state=16, test_size=10)"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 5,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/plain": [
 99 |        "array([[465, 466, 467, 468, 469],\n",
100 |        "       [235, 236, 237, 238, 239],\n",
101 |        "       [755, 756, 757, 758, 759],\n",
102 |        "       [885, 886, 887, 888, 889],\n",
103 |        "       [700, 701, 702, 703, 704],\n",
104 |        "       [860, 861, 862, 863, 864],\n",
105 |        "       [890, 891, 892, 893, 894],\n",
106 |        "       [500, 501, 502, 503, 504],\n",
107 |        "       [870, 871, 872, 873, 874],\n",
108 |        "       [880, 881, 882, 883, 884]])"
109 |       ]
110 |      },
111 |      "execution_count": 5,
112 |      "metadata": {},
113 |      "output_type": "execute_result"
114 |     }
115 |    ],
116 |    "source": [
117 |     "X1_test\n",
118 |     "X0_test"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "code",
123 |    "execution_count": 6,
124 |    "metadata": {},
125 |    "outputs": [],
126 |    "source": [
127 |     "X_test=np.concatenate([X1_test, X0_test])\n",
128 |     "X_train = np.concatenate([X1_train, X0_train])\n",
129 |     "y_train=np.concatenate([y1_train, y0_train])\n",
130 |     "y_test=np.concatenate([y1_test, y0_test])"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 7,
136 |    "metadata": {},
137 |    "outputs": [
138 |     {
139 |      "data": {
140 |       "text/plain": [
141 |        "array([[ 55,  56,  57,  58,  59],\n",
142 |        "       [ 75,  76,  77,  78,  79],\n",
143 |        "       [180, 181, 182, 183, 184],\n",
144 |        "       [110, 111, 112, 113, 114],\n",
145 |        "       [ 95,  96,  97,  98,  99],\n",
146 |        "       [ 15,  16,  17,  18,  19],\n",
147 |        "       [ 30,  31,  32,  33,  34],\n",
148 |        "       [ 35,  36,  37,  38,  39],\n",
149 |        "       [105, 106, 107, 108, 109],\n",
150 |        "       [ 85,  86,  87,  88,  89],\n",
151 |        "       [465, 466, 467, 468, 469],\n",
152 |        "       [235, 236, 237, 238, 239],\n",
153 |        "       [755, 756, 757, 758, 759],\n",
154 |        "       [885, 886, 887, 888, 889],\n",
155 |        "       [700, 701, 702, 703, 704],\n",
156 |        "       [860, 861, 862, 863, 864],\n",
157 |        "       [890, 891, 892, 893, 894],\n",
158 |        "       [500, 501, 502, 503, 504],\n",
159 |        "       [870, 871, 872, 873, 874],\n",
160 |        "       [880, 881, 882, 883, 884]])"
161 |       ]
162 |      },
163 |      "execution_count": 7,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "X_test"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 8,
175 |    "metadata": {},
176 |    "outputs": [
177 |     {
178 |      "data": {
179 |       "text/plain": [
180 |        "array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,\n",
181 |        "       0., 0., 0.])"
182 |       ]
183 |      },
184 |      "execution_count": 8,
185 |      "metadata": {},
186 |      "output_type": "execute_result"
187 |     }
188 |    ],
189 |    "source": [
190 |     "y_test"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": 9,
196 |    "metadata": {},
197 |    "outputs": [
198 |     {
199 |      "data": {
200 |       "text/plain": [
201 |        "array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,\n",
202 |        "       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.,\n",
203 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
204 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
205 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
206 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
207 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
208 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
209 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
210 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
211 |        "       0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])"
212 |       ]
213 |      },
214 |      "execution_count": 9,
215 |      "metadata": {},
216 |      "output_type": "execute_result"
217 |     }
218 |    ],
219 |    "source": [
220 |     "y_train"
221 |    ]
222 |   }
223 |  ],
224 |  "metadata": {
225 |   "kernelspec": {
226 |    "display_name": "Python 3",
227 |    "language": "python",
228 |    "name": "python3"
229 |   },
230 |   "language_info": {
231 |    "codemirror_mode": {
232 |     "name": "ipython",
233 |     "version": 3
234 |    },
235 |    "file_extension": ".py",
236 |    "mimetype": "text/x-python",
237 |    "name": "python",
238 |    "nbconvert_exporter": "python",
239 |    "pygments_lexer": "ipython3",
240 |    "version": "3.6.7"
241 |   }
242 |  },
243 |  "nbformat": 4,
244 |  "nbformat_minor": 2
245 | }
246 | 


--------------------------------------------------------------------------------
/EDA編寫區.txt:
--------------------------------------------------------------------------------
  1 | # Part 1 EDA in ML100-Days
  2 | 機器學習百日馬拉松：
  3 | 第一部分 EDA 共 16天
  4 | 
  5 | 
  6 | https://www.cupoy.com/home
  7 | 
  8 | 未來 100天的學習是這樣計畫的，
  9 | 從資料與觀察特徵開始，然後開始見模型或是建架 "機器學習實踐人工智慧" 的框架，花一天調教，四天到五天作一題 Kaggle 作驗收。
 10 | 然後在前往深度學習之前，我們會經歷很實際的問題討論非監督式學習，最後三十多天會進入神玄的機器學習的課程。
 11 | 
 12 | 首先的16天，我們是資料清理數據前處理。
 13 | 
 14 | 在 Cupoy 的機器學習百日馬拉松的課程中，缺乏整理的介紹，直接進入了 100天的介紹，就算第一天有一些提醒與很棒的延伸教材，
 15 | 但是對於一些沒有留意同學，可能有點見樹不見林，學了之後，可能未來有解題能力，但是拿到問題的霎那會不知道如何下手。
 16 | 
 17 | 所以在出發前，我先把我要的過程說明，資料清理沒有一個標準程序，但是初學的同學，可以把這 16天的學習當作一個規範出發，慢慢在靈活發展。
 18 | 
 19 | 我們收到了一個問題，這 16要好好地做好前處理，才會進行特徵挑選與整理，
 20 | 	• Day 1: 各位小姐姐小哥哥，收到一個問題先不要急著跳進去，請好好的了解問題安排資源與規劃作法。
 21 | 	• Day 2: 我們第一次進行 EDA，是從資料的筆數、欄位多寡開始認識我們手中有多少東西。當然不要忘了我們也閱讀的欄位的描述文件。
 22 | 	• Day 4: 我們開始跳進去欄位去看所有欄位的資料格式，查看是浮點數、整數或物件資料。對於類別型資料我們馬上進行編碼工作。
 23 | 	• Day 5: 我們學習繼續留在欄位中觀察他們的基本統計資料。
 24 | 		我們可以這麼說，
 25 | 		Day 4 用 app_train.dtypes.value_counts() 與 app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0) 來了解各種資料型態的個數與找出所有類別型資料與檢視類別型欄位各自類別的數量。
 26 | 		Day 5 是用 .describe() 看欄位資料的統計特徵與用 plt.hist() 去劃出直方圖 histogram 讓我們認識欄位資料。
 27 | 	• Day 6: 在經過 Day 5 的統計資訊觀察後，我們在 Day 6就要初步判斷是否有 outlier 異常資料，這次有用到 boxplot 圖。
 28 | 	• Day 7: 跟 Day 6 是連續動作，對於一些 outlier 的處理，同時當個別欄位資料的 outlier 都處理的，我們也不希望整體的欄位中有某幾個因為單位不依樣因為數值特別大讓我們後續模型運算時有不平衡的影響，所以就一併作了標準化。
 29 | 	• Day 9, 10: 當我們基本資料觀察與修正了之後，我們就展開了所有特徵值(欄位資料)對於 目標 的相關運算。
 30 | 	• Day 11: 觀察單一特徵值的統計甚至進入該特徵值觀看是否有子區的個別特徵，我們會介紹 KDE 與 pdf。這些微細的觀察都是讓我們對於觀察對象及其特徵更清楚的認識與掌握。
 31 | 	• Day 12，13，14: 把連續型變數離散化，有些特徵值如果作了分群，比如年紀，分成<25，介於 25 ~ 45，與 > 45 可能會有不錯的效果。
 32 | 	• Day 16: 差不多所有的資料的初步處理已經完成，所以進行整理後的特徵值的相關性，準備嘗試建立第一個機器學習的程式 (Day16)。
 33 | 
 34 | 
 35 | 每一天進度的程式作業在
 36 | https://github.com/PatrickRuan/ML100-Days
 37 | 可以幫忙按個 watch 或 follow ^^
 38 | 
 39 | 
 40 | Day 1: 當我們遇到一個問題時，不見得要馬上跳進去玩耍；
 41 | 	
 42 | 	我們要先思考問題，知道目的，如果有資料，還可以概略看一下資料，盤算能夠作出甚麼有趣有用的結果。
 43 | 	如果沒有資料，就問題的思考後，才去採集資料，相對可以節省很多時間。
 44 | 	
 45 | 	之後我們才會盤算一個策略去進行行動，比如快速完成一個方案，一個最小可行方案(Minimum Visiable Product 的概念)，後續進行優化改善。
 46 | 	
 47 | 	HW: 除了第一天的簡易練習 編寫 Y=wX+b，plot(X,Y) 定義 mse, mae 還有一個很重要的工作，
 48 | 	作業2：如果你今天經營一個台灣大車隊，你要如何透過數據分析來提升業績? 思考一下，我們除了需要工程師也需要資料分析師與科學家。
 49 | 	程式作業，
 50 | 	1.) 建立一個 y = w * x + b， w =3, b =5，x 具有 amplitude 5 的 Guassian Noice
 51 | 	2.) def mae, and mse 2 functions
 52 | 	3.) 學習 文字區編寫 $ MSE = \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \hat{Y}_i)^2} $
 53 | 
 54 | 
 55 | Day 2: 第一次資料探勘
 56 | 
 57 | 	好我們前十六天都將進行資料數據前處理的工作，練習的問題會是 Kaggle 上的題目 
 58 | 	https://www.kaggle.com/c/home-credit-default-risk/data 
 59 | 	
 60 | 	會用到的應該是 application_train.csv 與 application_test.csv 
 61 | 	最後要上傳時會用到 submission.csv 
 62 | 	但是請不要忘了研讀 HomeCredit_columns_description.csv 
 63 | 	
 64 | 	HW:
 65 | 	第二天我們開始讀取資料，對於這個被整理好的 Dataframe 進行最初步的探勘，比如說知道有多少筆資料，有多少 "欄" 的特徵，如何將特徵項目組成一個 list， 又或者我們該如何呈現截取部分資料等。總之，我們還沒有跳進去觀察個別特徵，個別資料前，我們所採取的動作在今天試著作一輪。
 66 | 	第二天還有一個有趣的參考作業，讀一讀吳老師的資料探勘講義，其中為何要研究杜河的魚類又如何進行都是很有意義的練習，川普的故事也趣味十足。
 67 | 	
 68 | 	程式作業: 學會本程式的所有觀察，同時提出不懂的地方。
 69 | 	
 70 | Day 3: 關於 DataFrame
 71 | 	
 72 | 	我們在作資料清理，在處理機器學習的問題時，常常都要熟練 DataFrame 的操作，DataFrame 就像是 EXCEL 的 Spreadsheet 一樣，第一列存著的是所有欄位名稱，就是我們未來要處理的特徵。第二列起就是一筆一筆的資料。
 73 | 	
 74 | 	HW: 第三天的練習會比 pandas 的 DataFrame 多一點不同格式的東西，大家好好努力!
 75 | 	程式作業: 學會本程式的所有觀察，同時提出不懂的地方。
 76 | 	
 77 | 	
 78 | Day 4: EDA 資料類型介紹
 79 | 
 80 | 	我們在第二天開始進行了 EDA 資料探勘的動作，當時是看整個資料的外型，有多少特徵有多少筆資料，今天起我們要開始一步一步的觀察特徵資料。
 81 | 	
 82 | 	跳進特徵研究的第一個小動作就是要知道，每一個特徵的格式，也就是資料的類別；基本的類別有 float64, int64, object。
 83 | 	
 84 | 	要留意 object是我們的機器無法直接處理的資料類型，所以今天是介紹為這類資料作編碼，編碼完後才會送進機器模型運算。
 85 | 	
 86 | 	HW: 請研讀參考資料 https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621
 87 | 	，同時完成第四天的程式。
 88 | 	
 89 | Day 5: EDA之資料分布
 90 | 
 91 | 	EDA 當然沒有固定的程序進行，但是對於新手而言，對於十個出頭的特徵的經典題目 Titanic 都已經是不知道如何下手了，更何況今天的題目有百餘個特徵。所以如果有一個按部就班的程序可以遵循，在新手時期是十分珍貴的。
 92 | 	
 93 | 	今天算是第三天的 EDA，Day 2時是第一次看資料框架(有多少資料，是寬的胖的高的瘦的)，Day 4 時是觀察這些特徵的資料類型，今天我們要來看看特徵資料的統計特性，
 94 | 	計算集中趨勢
 95 | 		•平均值 Mean
 96 | 		•中位數 Median
 97 | 		•眾數 Mode 
 98 | 	計算資料分散程度
 99 | 		•最⼩值 Min
100 | 		•最⼤值 Max
101 | 		•範圍 Range
102 | 		•四分位差 Quartiles
103 | 		•變異數 Variance
104 | 		•標準差 Standard deviation
105 | 		
106 | 	HW: 今天的作業是挑三個有興趣的特徵來看它們的統計特性，挑選的方式可以是在 Day 2時的 HomeCredit_columns_description.csv 研讀時對於特徵了解而挑選，如果挑很多個，最後也可以用 sns.heatmap(corr()matric) 的關係來挑選。
107 | 	今天要學會 DataFrame.describe() 來看基本統計資料。也要學會畫直方圖 (histogram) by plt.hist
108 | 	
109 | Day 6: EDA 與 outlier檢查  
110 | 
111 | 	面對異常值 outlier 我們要思考，為什麼會有 outlier?  該如何判斷是異常值 outlier? 還有要如何處理異常值 outlier? 
112 | 	要解決一個問題，是不能閃躲與心存僥倖的，資料科學家與工程師花了無數時間處理的案子，如果我們只想要在短短的三個小時到四個小時作完，同時期待卓越成果，這樣的期待是有些夢幻的，不切實際的。所以 Day 5 我們挑三個特徵觀察，那真的只是前菜，開胃菜，熟悉溫度氣味而已。
113 | 	對決是要通盤面對，所以今天要深入每一個數值資料特徵項目，畫出 boxplot 然後通通都去看，有沒有奇怪的離群索居的資料遠遠的孤傲站著? 
114 | 	我們要找出他們思考他們的合理性否，判斷時候要作修正要如何修正。
115 | 	HW: 完成今天的程式。
116 | 	
117 | Day 7: 常⽤數值取代：中位數與分位數 連續數值標準化 
118 | 	
119 | 	第七天要來看遇到缺失值或異常值要作的補值方式。請大家 google 中位數、分位數、眾數與平均值的定義與計算，同時今天的作業中有計算 Normalize function (to -1 ~ 1)，這樣的處理是要讓所有特徵立於平等地位，不會有些特徵單位大，有些特徵單位小，然後產生不對等的影響。
120 | 	
121 | 	HW；完成今天的程式，了解中位數、分位數、眾數與平均值與標準化。
122 | 	
123 | Day 8: 常⽤的 DataFrame 操作 
124 | 
125 | 	第八天不算是 EDA，是對 pandas的認識與 DataFrame 操作的練習日。就請大家作完作業，也上網找一些 DataFrame 的介紹。
126 | 	除非偷懶只走深度學習，不然不熟悉 DataFrame是會讓人十分辛苦了。
127 | 	
128 | Day 9: 今天沒有作業，在介紹相關係數。
129 | 
130 | 	相關係數是⼀個介於 -1～1 之間的值，負值代表負相關，正值代表。
131 | 	有請大家自己上網估勾一下。第十天會練習。
132 | 	
133 | Day 10: 相關係數實作
134 | 
135 | 	延續第九天的介紹，開始進行相關係數的應用。
136 | 	這兩天似乎作不多，內容也輕鬆，所以我們剛好喘一口氣來回顧至今的 EDA，避開學習中的見樹不見林，
137 | 	• Day 2: 我們第一天進行 EDA，是從資料的筆數、欄位多寡開始認識我們手中有多少東西。當然不要忘了我們也閱讀的欄位的描述文件。
138 | 	• Day 4: 我們開始跳進去欄位去看所有欄位的資料格式，查看是浮點數、整數或物件資料。對於類別型資料我們馬上進行編碼工作。
139 | 	• Day 5: 我們學習繼續留在欄位中觀察他們的基本統計資料。
140 | 		我們可以這麼說，
141 | 		Day 4 用 app_train.dtypes.value_counts() 與 app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0) 來了解各種資料型態的個數與找出所有類別型資料與檢視類別型欄位各自類別的數量。
142 | 		Day 5 是用 .describe() 看欄位資料的統計特徵與用 plt.hist() 去劃出直方圖 histogram 讓我們認識欄位資料。
143 | 	• Day 6: 在經過 Day 5 的統計資訊觀察後，我們在 Day 6就要初步判斷是否有 outlier，這次有用到 boxplot 圖。
144 | 	• Day 7: 跟 Day 6 是連續動作，對於一些 outlier 的處理，同時當個別欄位資料的 outlier 都處理的，我們也不希望整體的欄位中有某幾個因為單位不依樣因為數值特別大讓我們後續模型運算時有不平衡的影響，所以就一併作了標準化。
145 | 	• Day 9, 10: 當我們基本資料觀察與修正了之後，我們就展開了所有特徵值(欄位資料)對於 目標 的相關運算。
146 | 	
147 | 	HW: 既然是實作，第十天的主要工作就是好好地進行作業，主要是把資料型態為 Object 都進行了 one hot encoding 都成為了數字，然後與 Target 作 相關性計算，我們作了 corr()，不送進 sns.heatmap 因為實在太多了特徵值了，會畫很久喔! 然後挑出最大的幾個來觀察。
148 | 	今天的觀察有把同一個特徵在目標為 1 與目標為 0 去查看一下有沒有分布上的不同，然後去思考觀察一些現象。同時也建議同學在 https://www.kaggle.com/c/home-credit-default-risk/data 的最下方 application_test.csv (5.58 MB)的表格玩一玩，可以很方便地觀察到一些資訊!
149 | 	
150 | Day 11: 繪圖與樣式＆Kernel Density Estimation (KDE) 
151 | 
152 | 	在第十天的作業中，最後一個任務，我們不只觀看一個特徵的分布以及他與目標的相關性，我們也將該特徵分拆觀察，觀看不同目標結果的特徵子集合的特性。第十一天也繼續往分拆特徵進入特徵子集合的觀察前進。
153 | 	第十一天內容分成兩個部分介紹，
154 | 	• 介紹 KDE，我有在網路上查到這個 "kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way" 果然，KDE 就是對 pdf 估算的方法。如果我們用 sns.kdeplot() 去看圖，會得到一條連續數值的曲線，如果用 sns.distplot() 去畫圖，就會得到該條曲線與個別值的相對機率，後者(個別值的相對機率) 長得很像直方圖，請同學思考他跟直方圖的關係！
155 | 	• 第二個內容介紹就是前面提到的將特徵分群拆成好多個子集合子群分別觀察 KDE。
156 | 	
157 | 	這些微細的觀察都是讓我們對於觀察對象及其特徵更清楚的認識與掌握。
158 | 	
159 | 	HW: 上面的描述很多都是在解釋第十一天的作業，有些我也還在努力觀察中，加油。
160 | 	
161 | Day 12: 把連續型變數離散化 
162 | 
163 | 	第十二天沒有作業，內容也簡單，就是 ptt 肥宅愛用的詞用上了: "如題" 把連續型變數離散化。
164 | 	在第十一天的作業最後一部分，採用了 barplot，觀察出 "年紀愈輕" failed to repay 的可能性愈大，當然我們也可以直接用連續數值，但是把他離散化有助於把問題簡單化，想想我們有好多特徵值，似乎問題很複雜啊！同時~~ 也可以將 outlier 解決掉，將300歲歸為最大歲數那一類(這樣對嗎?) 
165 | 	通常，我們的分群有三個方法等寬，等頻，等類。
166 | 	我們第十三天的作業會用到等寬，等頻。 等類我還沒好好研究。
167 | 	這裡提示一些好應付 Day 13，
168 | 	pd.cut(Data_feature, bins = 整數值 n ) 這就是等寬，會從最小值到最大值等分成 n 個區間。
169 | 	pd.qcut(Data_feature, bins = 整數值 n) 這就是等頻，會從最小值到最大值分成 n 個區間，每個區間沒有意外就會都有一樣的樣品數量。
170 | 	大家可以思考一下，為什麼我不是說一定數量，而是加了一句沒有意外的話。
171 | 	
172 | 
173 | Day 13: 把連續型變數離散化  
174 | 	
175 | 	在第十三天後，EDA 的學習就快要近尾聲了(但還沒尾聲啦)。第十三天就是根據第十二天的內容好好地完成作業。
176 | 	
177 | Day 14: 把連續型變數離散化  
178 | 		
179 | 	第十四天的主要在學習另一個視覺化的技巧，讓好些個序列的圖表，相關聯的圖表有結構的排列比對，可以協助我們更好觀察。
180 | 	HW: 對於 python 不熟悉的同學，會吃力，但是也就是提供更多練習機會讓大家提升跟上同學。加油。
181 | 	
182 | Day 15:　Heatmap & Grid-plot   
183 | 
184 | 	在 Day 13 時我們說近尾聲是真的，但是第十五天的觀察真的是重要。 
185 | 	sns.heatmap(corr()) 先前就作過，但是仍然重要。
186 | 	sns.sns.PairGrid() 更是有趣的讓我們作出一個相關圖但是其相關觀察可以有我們決定。
187 | 
188 | Day 16: 模型初體驗 Logistic Regression 
189 | 
190 | 	好了，我們要上 Kaggle 了，第十六天請大家作幾個任務，
191 | 	◆ 自己上網查看 sklearn Imputer 在幹嘛。
192 | 	◆ 請大家上網查看李宏毅老師(hung yi lee) 的教學影片或吳恩達老師(Andrew Nq) 的教學影片學習甚麼是 Logistic Regression。
193 | 	◆ 請大家上網查看 sklearn LogisticRegress 怎麼用
194 | 	◆ 大家互相澄清一下 Logistic Regression 到底是迴歸法(Regression) 還是分類法(Classification)?
195 | 	◆ 請大家拿出 Day 2 的 submission.csv 準備，寫完作業後比對一下格式。 
196 | 	
197 | 	HW: 就是把作業寫完，準備上 Kaggle 吧！ 
198 | 	作業產生的格式與 submission.csv 一不一樣? 這樣上傳有沒有問題? 如果有問題，怎麼辦? 
199 | 	還有Kaggle 上傳的方式，請上網 Google！
200 | 	
201 | 	
202 | 	
203 | 	
204 | 	
205 | 	
206 | 


--------------------------------------------------------------------------------
/Day_065_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業:\n",
  8 |     "嘗試調整參數:  \n",
  9 |     "sg:sg=1表示採用skip-gram,sg=0 表示採用cbow  \n",
 10 |     "window:能往左往右看幾個字的意思 \n",
 11 |     "\n",
 12 |     "    參數說明\n",
 13 |     "    sentences：當然了了，這是要訓練的句句⼦子集，沒有他就不⽤用跑了了 \n",
 14 |     "    size：這表⽰示的是訓練出的詞向量量會有幾維 \n",
 15 |     "    alpha：機器學習中的學習率，這東⻄西會逐漸收斂到 min_alpha \n",
 16 |     "    sg：sg=1表⽰示採⽤用skip-gram，sg=0 表⽰示採⽤用cbow \n",
 17 |     "    window：能往左往右看幾個字的意思 \n",
 18 |     "    workers：執⾏行行緒數⽬目， \n",
 19 |     "    min_count：若若這個詞出現的次數⼩小於min_count，那他就不會被視為訓練對\n",
 20 |     "    象\n",
 21 |     "    Note: 若若是運⾏行行過程中碰到記憶體不⾜足的問題, 可以把worker 的值設置在 4 以下\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "name": "stderr",
 31 |      "output_type": "stream",
 32 |      "text": [
 33 |       "C:\\Users\\User\\Anaconda3\\lib\\site-packages\\gensim\\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial\n",
 34 |       "  warnings.warn(\"detected Windows; aliasing chunkize to chunkize_serial\")\n"
 35 |      ]
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "import gensim, logging\n",
 40 |     "from gensim.models import word2vec\n",
 41 |     "import warnings\n",
 42 |     "warnings.filterwarnings(\"ignore\")"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 2,
 48 |    "metadata": {},
 49 |    "outputs": [
 50 |     {
 51 |      "name": "stderr",
 52 |      "output_type": "stream",
 53 |      "text": [
 54 |       "2019-03-15 22:04:31,028 : INFO : collecting all words and their counts\n",
 55 |       "2019-03-15 22:04:31,032 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n",
 56 |       "2019-03-15 22:04:31,032 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences\n",
 57 |       "2019-03-15 22:04:31,032 : INFO : Loading a fresh vocabulary\n",
 58 |       "2019-03-15 22:04:31,036 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)\n",
 59 |       "2019-03-15 22:04:31,036 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)\n",
 60 |       "2019-03-15 22:04:31,036 : INFO : deleting the raw counts dictionary of 3 items\n",
 61 |       "2019-03-15 22:04:31,040 : INFO : sample=0.001 downsamples 3 most-common words\n",
 62 |       "2019-03-15 22:04:31,044 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)\n",
 63 |       "2019-03-15 22:04:31,044 : INFO : estimated required memory for 3 words and 256 dimensions: 7644 bytes\n",
 64 |       "2019-03-15 22:04:31,044 : INFO : resetting layer weights\n",
 65 |       "2019-03-15 22:04:31,044 : INFO : training model with 4 workers on 3 vocabulary and 256 features, using sg=1 hs=0 sample=0.001 negative=5 window=5\n",
 66 |       "2019-03-15 22:04:31,048 : INFO : worker thread finished; awaiting finish of 3 more threads\n",
 67 |       "2019-03-15 22:04:31,052 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
 68 |       "2019-03-15 22:04:31,052 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
 69 |       "2019-03-15 22:04:31,052 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
 70 |       "2019-03-15 22:04:31,052 : INFO : EPOCH - 1 : training on 4 raw words (0 effective words) took 0.0s, 0 effective words/s\n",
 71 |       "2019-03-15 22:04:31,072 : INFO : worker thread finished; awaiting finish of 3 more threads\n",
 72 |       "2019-03-15 22:04:31,072 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
 73 |       "2019-03-15 22:04:31,076 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
 74 |       "2019-03-15 22:04:31,076 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
 75 |       "2019-03-15 22:04:31,080 : INFO : EPOCH - 2 : training on 4 raw words (0 effective words) took 0.0s, 0 effective words/s\n",
 76 |       "2019-03-15 22:04:31,084 : INFO : worker thread finished; awaiting finish of 3 more threads\n",
 77 |       "2019-03-15 22:04:31,084 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
 78 |       "2019-03-15 22:04:31,092 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
 79 |       "2019-03-15 22:04:31,100 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
 80 |       "2019-03-15 22:04:31,104 : INFO : EPOCH - 3 : training on 4 raw words (1 effective words) took 0.0s, 52 effective words/s\n",
 81 |       "2019-03-15 22:04:31,112 : INFO : worker thread finished; awaiting finish of 3 more threads\n",
 82 |       "2019-03-15 22:04:31,116 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
 83 |       "2019-03-15 22:04:31,116 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
 84 |       "2019-03-15 22:04:31,120 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
 85 |       "2019-03-15 22:04:31,120 : INFO : EPOCH - 4 : training on 4 raw words (0 effective words) took 0.0s, 0 effective words/s\n",
 86 |       "2019-03-15 22:04:31,132 : INFO : worker thread finished; awaiting finish of 3 more threads\n",
 87 |       "2019-03-15 22:04:31,132 : INFO : worker thread finished; awaiting finish of 2 more threads\n",
 88 |       "2019-03-15 22:04:31,132 : INFO : worker thread finished; awaiting finish of 1 more threads\n",
 89 |       "2019-03-15 22:04:31,136 : INFO : worker thread finished; awaiting finish of 0 more threads\n",
 90 |       "2019-03-15 22:04:31,136 : INFO : EPOCH - 5 : training on 4 raw words (0 effective words) took 0.0s, 0 effective words/s\n",
 91 |       "2019-03-15 22:04:31,140 : INFO : training on a 20 raw words (1 effective words) took 0.1s, 11 effective words/s\n",
 92 |       "2019-03-15 22:04:31,140 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay\n"
 93 |      ]
 94 |     }
 95 |    ],
 96 |    "source": [
 97 |     "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) \n",
 98 |     "sentences = [['first', 'sentence'], ['second', 'sentence']]  \n",
 99 |     "model = word2vec.Word2Vec(sentences, size=256, min_count=1, window=5, workers=4, sg=1)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 3,
105 |    "metadata": {},
106 |    "outputs": [
107 |     {
108 |      "name": "stdout",
109 |      "output_type": "stream",
110 |      "text": [
111 |       "Word2Vec(vocab=3, size=256, alpha=0.025)\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "print(model)"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 4,
122 |    "metadata": {},
123 |    "outputs": [
124 |     {
125 |      "data": {
126 |       "text/plain": [
127 |        "0.09817103820984503"
128 |       ]
129 |      },
130 |      "execution_count": 4,
131 |      "metadata": {},
132 |      "output_type": "execute_result"
133 |     }
134 |    ],
135 |    "source": [
136 |     "model.similarity('first','second')"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 5,
142 |    "metadata": {},
143 |    "outputs": [
144 |     {
145 |      "name": "stderr",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "2019-03-15 22:04:31,224 : INFO : saving Word2Vec object under mymodel, separately None\n",
149 |       "2019-03-15 22:04:31,224 : INFO : not storing attribute vectors_norm\n",
150 |       "2019-03-15 22:04:31,232 : INFO : not storing attribute cum_table\n",
151 |       "2019-03-15 22:04:31,252 : INFO : saved mymodel\n",
152 |       "2019-03-15 22:04:31,256 : INFO : loading Word2Vec object from mymodel\n",
153 |       "2019-03-15 22:04:31,256 : INFO : loading wv recursively from mymodel.wv.* with mmap=None\n",
154 |       "2019-03-15 22:04:31,260 : INFO : setting ignored attribute vectors_norm to None\n",
155 |       "2019-03-15 22:04:31,268 : INFO : loading vocabulary recursively from mymodel.vocabulary.* with mmap=None\n",
156 |       "2019-03-15 22:04:31,272 : INFO : loading trainables recursively from mymodel.trainables.* with mmap=None\n",
157 |       "2019-03-15 22:04:31,272 : INFO : setting ignored attribute cum_table to None\n",
158 |       "2019-03-15 22:04:31,272 : INFO : loaded mymodel\n"
159 |      ]
160 |     }
161 |    ],
162 |    "source": [
163 |     "model.save('mymodel')  \n",
164 |     "new_model = gensim.models.Word2Vec.load('mymodel')  "
165 |    ]
166 |   }
167 |  ],
168 |  "metadata": {
169 |   "kernelspec": {
170 |    "display_name": "Python 3",
171 |    "language": "python",
172 |    "name": "python3"
173 |   },
174 |   "language_info": {
175 |    "codemirror_mode": {
176 |     "name": "ipython",
177 |     "version": 3
178 |    },
179 |    "file_extension": ".py",
180 |    "mimetype": "text/x-python",
181 |    "name": "python",
182 |    "nbconvert_exporter": "python",
183 |    "pygments_lexer": "ipython3",
184 |    "version": "3.6.7"
185 |   }
186 |  },
187 |  "nbformat": 4,
188 |  "nbformat_minor": 2
189 | }
190 | 


--------------------------------------------------------------------------------
/Day_044_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 作業\n",
  8 |     "\n",
  9 |     "1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？\n",
 10 |     "2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from sklearn import datasets, metrics\n",
 20 |     "from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor\n",
 21 |     "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n",
 22 |     "import pandas as pd\n",
 23 |     "from sklearn.model_selection import train_test_split\n",
 24 |     "from sklearn.metrics import mean_squared_error, r2_score, accuracy_score"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 2,
 30 |    "metadata": {},
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "boston = datasets.load_boston()"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 3,
 39 |    "metadata": {},
 40 |    "outputs": [],
 41 |    "source": [
 42 |     "#boston.target  # regression"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 4,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=0)"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 5,
 57 |    "metadata": {},
 58 |    "outputs": [
 59 |     {
 60 |      "name": "stdout",
 61 |      "output_type": "stream",
 62 |      "text": [
 63 |       "MSE = 17.50611023622048\n"
 64 |      ]
 65 |     },
 66 |     {
 67 |      "name": "stderr",
 68 |      "output_type": "stream",
 69 |      "text": [
 70 |       "C:\\Users\\User\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
 71 |       "  \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
 72 |      ]
 73 |     }
 74 |    ],
 75 |    "source": [
 76 |     "modelR=RandomForestRegressor()\n",
 77 |     "modelR.fit(x_train, y_train)\n",
 78 |     "pred=modelR.predict(x_test)\n",
 79 |     "print(f'MSE = {mean_squared_error(y_test, pred)}')"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 6,
 85 |    "metadata": {},
 86 |    "outputs": [
 87 |     {
 88 |      "data": {
 89 |       "text/html": [
 90 |        "<div>\n",
 91 |        "<style scoped>\n",
 92 |        "    .dataframe tbody tr th:only-of-type {\n",
 93 |        "        vertical-align: middle;\n",
 94 |        "    }\n",
 95 |        "\n",
 96 |        "    .dataframe tbody tr th {\n",
 97 |        "        vertical-align: top;\n",
 98 |        "    }\n",
 99 |        "\n",
100 |        "    .dataframe thead th {\n",
101 |        "        text-align: right;\n",
102 |        "    }\n",
103 |        "</style>\n",
104 |        "<table border=\"1\" class=\"dataframe\">\n",
105 |        "  <thead>\n",
106 |        "    <tr style=\"text-align: right;\">\n",
107 |        "      <th></th>\n",
108 |        "      <th>CRIM</th>\n",
109 |        "      <th>ZN</th>\n",
110 |        "      <th>INDUS</th>\n",
111 |        "      <th>CHAS</th>\n",
112 |        "      <th>NOX</th>\n",
113 |        "      <th>RM</th>\n",
114 |        "      <th>AGE</th>\n",
115 |        "      <th>DIS</th>\n",
116 |        "      <th>RAD</th>\n",
117 |        "      <th>TAX</th>\n",
118 |        "      <th>PTRATIO</th>\n",
119 |        "      <th>B</th>\n",
120 |        "      <th>LSTAT</th>\n",
121 |        "    </tr>\n",
122 |        "  </thead>\n",
123 |        "  <tbody>\n",
124 |        "    <tr>\n",
125 |        "      <th>0</th>\n",
126 |        "      <td>0.04191</td>\n",
127 |        "      <td>0.000441</td>\n",
128 |        "      <td>0.005722</td>\n",
129 |        "      <td>0.000459</td>\n",
130 |        "      <td>0.012075</td>\n",
131 |        "      <td>0.440251</td>\n",
132 |        "      <td>0.011111</td>\n",
133 |        "      <td>0.031247</td>\n",
134 |        "      <td>0.002155</td>\n",
135 |        "      <td>0.024146</td>\n",
136 |        "      <td>0.01939</td>\n",
137 |        "      <td>0.01144</td>\n",
138 |        "      <td>0.399652</td>\n",
139 |        "    </tr>\n",
140 |        "  </tbody>\n",
141 |        "</table>\n",
142 |        "</div>"
143 |       ],
144 |       "text/plain": [
145 |        "      CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \\\n",
146 |        "0  0.04191  0.000441  0.005722  0.000459  0.012075  0.440251  0.011111   \n",
147 |        "\n",
148 |        "        DIS       RAD       TAX  PTRATIO        B     LSTAT  \n",
149 |        "0  0.031247  0.002155  0.024146  0.01939  0.01144  0.399652  "
150 |       ]
151 |      },
152 |      "execution_count": 6,
153 |      "metadata": {},
154 |      "output_type": "execute_result"
155 |     }
156 |    ],
157 |    "source": [
158 |     "F1=modelR.feature_importances_\n",
159 |     "F2=boston.feature_names\n",
160 |     "f= pd.DataFrame([F1], columns=F2)\n",
161 |     "f"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 7,
167 |    "metadata": {},
168 |    "outputs": [
169 |     {
170 |      "name": "stdout",
171 |      "output_type": "stream",
172 |      "text": [
173 |       "MSE of RFR = 17.50611023622048\n",
174 |       "MSE of DTR = 26.58944881889764\n"
175 |      ]
176 |     }
177 |    ],
178 |    "source": [
179 |     "modelD=DecisionTreeRegressor()\n",
180 |     "modelD.fit(x_train, y_train)\n",
181 |     "predD=modelD.predict(x_test)\n",
182 |     "print(f'MSE of RFR = {mean_squared_error(y_test, pred)}')\n",
183 |     "print(f'MSE of DTR = {mean_squared_error(y_test, predD)}')"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 8,
189 |    "metadata": {},
190 |    "outputs": [
191 |     {
192 |      "name": "stdout",
193 |      "output_type": "stream",
194 |      "text": [
195 |       "MSE = 15.637565354330707\n"
196 |      ]
197 |     },
198 |     {
199 |      "name": "stderr",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "C:\\Users\\User\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
203 |       "  \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
204 |      ]
205 |     }
206 |    ],
207 |    "source": [
208 |     "modelR=RandomForestRegressor() #(max_features='sqrt') \n",
209 |     "# it seems default setting is not bad ~\n",
210 |     "modelR.fit(x_train, y_train)\n",
211 |     "pred=modelR.predict(x_test)\n",
212 |     "print(f'MSE = {mean_squared_error(y_test, pred)}')"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": []
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 9,
225 |    "metadata": {},
226 |    "outputs": [],
227 |    "source": [
228 |     "# Wine"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": 10,
234 |    "metadata": {},
235 |    "outputs": [],
236 |    "source": [
237 |     "wine = datasets.load_wine()\n",
238 |     "x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=0)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 11,
244 |    "metadata": {},
245 |    "outputs": [
246 |     {
247 |      "name": "stdout",
248 |      "output_type": "stream",
249 |      "text": [
250 |       " Random Forest accruacy: 0.9777777777777777\n",
251 |       " Decision Tree accruacy: 0.9555555555555556\n"
252 |      ]
253 |     },
254 |     {
255 |      "name": "stderr",
256 |      "output_type": "stream",
257 |      "text": [
258 |       "C:\\Users\\User\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
259 |       "  \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
260 |      ]
261 |     }
262 |    ],
263 |    "source": [
264 |     "ModelRF = RandomForestClassifier()\n",
265 |     "ModelRF.fit(x_train, y_train)\n",
266 |     "predR=ModelRF.predict(x_test)\n",
267 |     "print(f' Random Forest accruacy: {metrics.accuracy_score(predR, y_test)}')\n",
268 |     "\n",
269 |     "modelD = DecisionTreeClassifier()\n",
270 |     "modelD.fit(x_train, y_train)\n",
271 |     "predD=modelD.predict(x_test)\n",
272 |     "print(f' Decision Tree accruacy: {metrics.accuracy_score(predD, y_test)}')"
273 |    ]
274 |   }
275 |  ],
276 |  "metadata": {
277 |   "kernelspec": {
278 |    "display_name": "Python 3",
279 |    "language": "python",
280 |    "name": "python3"
281 |   },
282 |   "language_info": {
283 |    "codemirror_mode": {
284 |     "name": "ipython",
285 |     "version": 3
286 |    },
287 |    "file_extension": ".py",
288 |    "mimetype": "text/x-python",
289 |    "name": "python",
290 |    "nbconvert_exporter": "python",
291 |    "pygments_lexer": "ipython3",
292 |    "version": "3.6.7"
293 |   }
294 |  },
295 |  "nbformat": 4,
296 |  "nbformat_minor": 2
297 | }
298 | 


--------------------------------------------------------------------------------
/Day_055_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業\n",
  8 |     "### 用 iris (dataset.load_iris()) 資料嘗試跑 kmeans (可以測試不同的群數 , init 等)"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [],
 16 |    "source": [
 17 |     "from sklearn import datasets\n",
 18 |     "\n",
 19 |     "iris = datasets.load_iris()\n",
 20 |     "X = iris.data\n",
 21 |     "y = iris.target"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "### 載入 相關套件 並 執行 kmean 實驗 ...\n",
 29 |     "\n",
 30 |     "- 實驗測試不同的群數\n",
 31 |     "- 實驗測試不同的初始值\n",
 32 |     "- 呈現結果"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 2,
 38 |    "metadata": {},
 39 |    "outputs": [
 40 |     {
 41 |      "data": {
 42 |       "text/plain": [
 43 |        "array([[5.1, 3.5, 1.4, 0.2],\n",
 44 |        "       [4.9, 3. , 1.4, 0.2],\n",
 45 |        "       [4.7, 3.2, 1.3, 0.2],\n",
 46 |        "       [4.6, 3.1, 1.5, 0.2],\n",
 47 |        "       [5. , 3.6, 1.4, 0.2]])"
 48 |       ]
 49 |      },
 50 |      "execution_count": 2,
 51 |      "metadata": {},
 52 |      "output_type": "execute_result"
 53 |     }
 54 |    ],
 55 |    "source": [
 56 |     "X[:5]"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 3,
 62 |    "metadata": {},
 63 |    "outputs": [
 64 |     {
 65 |      "data": {
 66 |       "text/plain": [
 67 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
 68 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
 69 |        "       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 70 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
 71 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
 72 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
 73 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
 74 |       ]
 75 |      },
 76 |      "execution_count": 3,
 77 |      "metadata": {},
 78 |      "output_type": "execute_result"
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "y"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 4,
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "from sklearn.cluster import KMeans"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "### (n_clusters, n_init)\n",
 99 |     "# = (3,1)"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 5,
105 |    "metadata": {},
106 |    "outputs": [
107 |     {
108 |      "data": {
109 |       "text/plain": [
110 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
111 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
112 |        "       0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
113 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
114 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,\n",
115 |        "       1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1,\n",
116 |        "       1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2])"
117 |       ]
118 |      },
119 |      "execution_count": 5,
120 |      "metadata": {},
121 |      "output_type": "execute_result"
122 |     }
123 |    ],
124 |    "source": [
125 |     "model=KMeans(n_clusters=3, n_init=1, random_state=0)\n",
126 |     "model.fit_predict(X)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "### (n_clusters, n_init=2)\n",
134 |     "# = (3,2)"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "code",
139 |    "execution_count": 6,
140 |    "metadata": {},
141 |    "outputs": [
142 |     {
143 |      "data": {
144 |       "text/plain": [
145 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
146 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
147 |        "       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
148 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
149 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n",
150 |        "       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,\n",
151 |        "       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])"
152 |       ]
153 |      },
154 |      "execution_count": 6,
155 |      "metadata": {},
156 |      "output_type": "execute_result"
157 |     }
158 |    ],
159 |    "source": [
160 |     "model=KMeans(n_clusters=3, n_init=2, random_state=0)\n",
161 |     "model.fit_predict(X)"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "metadata": {},
167 |    "source": [
168 |     "### (n_clusters, n_init=3)\n",
169 |     "# = (3,3)"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 7,
175 |    "metadata": {},
176 |    "outputs": [
177 |     {
178 |      "data": {
179 |       "text/plain": [
180 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
181 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
182 |        "       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
183 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
184 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n",
185 |        "       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,\n",
186 |        "       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])"
187 |       ]
188 |      },
189 |      "execution_count": 7,
190 |      "metadata": {},
191 |      "output_type": "execute_result"
192 |     }
193 |    ],
194 |    "source": [
195 |     "model=KMeans(n_clusters=3, n_init=3, random_state=0)\n",
196 |     "model.fit_predict(X)"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "### (n_clusters, n_init=1)\n",
204 |     "# = (4,1)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 9,
210 |    "metadata": {},
211 |    "outputs": [
212 |     {
213 |      "data": {
214 |       "text/plain": [
215 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
216 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
217 |        "       0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 1,\n",
218 |        "       2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2,\n",
219 |        "       2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 1, 3, 3, 2, 3, 3, 3,\n",
220 |        "       1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 3, 3, 3,\n",
221 |        "       1, 1, 1, 3, 3, 1, 1, 3, 3, 1, 1, 3, 3, 1, 1, 1, 1, 1])"
222 |       ]
223 |      },
224 |      "execution_count": 9,
225 |      "metadata": {},
226 |      "output_type": "execute_result"
227 |     }
228 |    ],
229 |    "source": [
230 |     "model=KMeans(n_clusters=4, n_init=1, random_state=0)\n",
231 |     "model.fit_predict(X)"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "### (n_clusters, n_init=2)\n",
239 |     "# = (4,2)"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": 11,
245 |    "metadata": {},
246 |    "outputs": [
247 |     {
248 |      "data": {
249 |       "text/plain": [
250 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
251 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
252 |        "       0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 1,\n",
253 |        "       2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2,\n",
254 |        "       2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 1, 3, 3, 2, 3, 3, 3,\n",
255 |        "       1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 3, 3, 3,\n",
256 |        "       1, 1, 1, 3, 3, 1, 1, 3, 3, 1, 1, 3, 3, 1, 1, 1, 1, 1])"
257 |       ]
258 |      },
259 |      "execution_count": 11,
260 |      "metadata": {},
261 |      "output_type": "execute_result"
262 |     }
263 |    ],
264 |    "source": [
265 |     "model=KMeans(n_clusters=4, n_init=2, random_state=0)\n",
266 |     "model.fit_predict(X)"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "markdown",
271 |    "metadata": {},
272 |    "source": [
273 |     "### (n_clusters, n_init=3)\n",
274 |     "# = (4,3)"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 12,
280 |    "metadata": {},
281 |    "outputs": [
282 |     {
283 |      "data": {
284 |       "text/plain": [
285 |        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
286 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
287 |        "       1, 1, 1, 1, 1, 1, 3, 3, 3, 0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3,\n",
288 |        "       0, 0, 3, 0, 3, 0, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 3, 0, 3, 3, 3,\n",
289 |        "       0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 2, 3, 2, 2, 2, 2, 0, 2, 2, 2,\n",
290 |        "       3, 3, 2, 3, 3, 2, 2, 2, 2, 3, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2,\n",
291 |        "       2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2, 3])"
292 |       ]
293 |      },
294 |      "execution_count": 12,
295 |      "metadata": {},
296 |      "output_type": "execute_result"
297 |     }
298 |    ],
299 |    "source": [
300 |     "model=KMeans(n_clusters=4, n_init=3,random_state=0)\n",
301 |     "model.fit_predict(X)"
302 |    ]
303 |   }
304 |  ],
305 |  "metadata": {
306 |   "kernelspec": {
307 |    "display_name": "Python 3",
308 |    "language": "python",
309 |    "name": "python3"
310 |   },
311 |   "language_info": {
312 |    "codemirror_mode": {
313 |     "name": "ipython",
314 |     "version": 3
315 |    },
316 |    "file_extension": ".py",
317 |    "mimetype": "text/x-python",
318 |    "name": "python",
319 |    "nbconvert_exporter": "python",
320 |    "pygments_lexer": "ipython3",
321 |    "version": "3.6.7"
322 |   }
323 |  },
324 |  "nbformat": 4,
325 |  "nbformat_minor": 2
326 | }
327 | 


--------------------------------------------------------------------------------
/Day_016_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 練習時間\n",
  8 |     "將你的結果存成 csv, 上傳你的第一份 Kaggle 成績\n",
  9 |     "\n",
 10 |     "Hints: https://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "# Import 需要的套件\n",
 20 |     "import os\n",
 21 |     "import numpy as np \n",
 22 |     "import pandas as pd\n",
 23 |     "\n",
 24 |     "import matplotlib.pyplot as plt\n",
 25 |     "%matplotlib inline\n",
 26 |     "import warnings\n",
 27 |     "warnings.filterwarnings('ignore')"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "### 之前做過的處理"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 2,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "# 設定 data_path\n",
 44 |     "dir_data = '../part01/'\n",
 45 |     "f_app_train = os.path.join(dir_data, 'application_train.csv')\n",
 46 |     "f_app_test = os.path.join(dir_data, 'application_test.csv')\n",
 47 |     "\n",
 48 |     "app_train = pd.read_csv(f_app_train)\n",
 49 |     "app_test = pd.read_csv(f_app_test)\n",
 50 |     "\n",
 51 |     "from sklearn.preprocessing import LabelEncoder\n",
 52 |     "\n",
 53 |     "# Create a label encoder object\n",
 54 |     "le = LabelEncoder()\n",
 55 |     "le_count = 0\n",
 56 |     "\n",
 57 |     "# Iterate through the columns\n",
 58 |     "for col in app_train:\n",
 59 |     "    if app_train[col].dtype == 'object':\n",
 60 |     "        # If 2 or fewer unique categories\n",
 61 |     "        if len(list(app_train[col].unique())) <= 2:\n",
 62 |     "            # Train on the training data\n",
 63 |     "            le.fit(app_train[col])\n",
 64 |     "            # Transform both training and testing data\n",
 65 |     "            app_train[col] = le.transform(app_train[col])\n",
 66 |     "            app_test[col] = le.transform(app_test[col])\n",
 67 |     "            \n",
 68 |     "            # Keep track of how many columns were label encoded\n",
 69 |     "            le_count += 1\n",
 70 |     "            \n",
 71 |     "app_train = pd.get_dummies(app_train)\n",
 72 |     "app_test = pd.get_dummies(app_test)\n",
 73 |     "\n",
 74 |     "# Create an anomalous flag column\n",
 75 |     "app_train['DAYS_EMPLOYED_ANOM'] = app_train[\"DAYS_EMPLOYED\"] == 365243\n",
 76 |     "app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)\n",
 77 |     "# also apply to testing dataset\n",
 78 |     "app_test['DAYS_EMPLOYED_ANOM'] = app_test[\"DAYS_EMPLOYED\"] == 365243\n",
 79 |     "app_test[\"DAYS_EMPLOYED\"].replace({365243: np.nan}, inplace = True)\n",
 80 |     "\n",
 81 |     "# absolute the value of DAYS_BIRTH\n",
 82 |     "app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])\n",
 83 |     "app_test['DAYS_BIRTH'] = abs(app_test['DAYS_BIRTH'])\n"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "### 做好前處理\n",
 91 |     "開始擬合模型之前，我們要確保 training & testing data 的欄位數量一致，原因是因為 One hot encoding 會製造多的欄位，有些類別出現在 training data 而沒有出現 testing data 中，我們就要把這些多餘的欄位去除"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 3,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "train_labels = app_train['TARGET']\n",
101 |     "\n",
102 |     "# Align the training and testing data, keep only columns present in both dataframes\n",
103 |     "app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)\n",
104 |     "\n",
105 |     "# Add the target back in\n",
106 |     "app_train['TARGET'] = train_labels"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 4,
112 |    "metadata": {},
113 |    "outputs": [
114 |     {
115 |      "name": "stderr",
116 |      "output_type": "stream",
117 |      "text": [
118 |       "C:\\Users\\User\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.\n",
119 |       "  warnings.warn(msg, category=DeprecationWarning)\n"
120 |      ]
121 |     },
122 |     {
123 |      "name": "stdout",
124 |      "output_type": "stream",
125 |      "text": [
126 |       "Training data shape:  (307511, 240)\n",
127 |       "Testing data shape:  (48744, 240)\n"
128 |      ]
129 |     }
130 |    ],
131 |    "source": [
132 |     "from sklearn.preprocessing import MinMaxScaler, Imputer\n",
133 |     "\n",
134 |     "# Drop the target from the training data\n",
135 |     "if 'TARGET' in app_train:\n",
136 |     "    train = app_train.drop(labels = ['TARGET'], axis=1)\n",
137 |     "else:\n",
138 |     "    train = app_train.copy()\n",
139 |     "    \n",
140 |     "# Feature names\n",
141 |     "features = list(train.columns)\n",
142 |     "\n",
143 |     "# Copy of the testing data\n",
144 |     "test = app_test.copy()\n",
145 |     "\n",
146 |     "# Median imputation of missing values\n",
147 |     "imputer = Imputer(strategy = 'median')\n",
148 |     "\n",
149 |     "# Scale each feature to 0-1\n",
150 |     "scaler = MinMaxScaler(feature_range = (0, 1))\n",
151 |     "\n",
152 |     "# Fit on the training data\n",
153 |     "imputer.fit(train)\n",
154 |     "\n",
155 |     "# Transform both training and testing data\n",
156 |     "train = imputer.transform(train)\n",
157 |     "test = imputer.transform(app_test)\n",
158 |     "\n",
159 |     "# Repeat with the scaler\n",
160 |     "scaler.fit(train)\n",
161 |     "train = scaler.transform(train)\n",
162 |     "test = scaler.transform(test)\n",
163 |     "\n",
164 |     "print('Training data shape: ', train.shape)\n",
165 |     "print('Testing data shape: ', test.shape)"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### Fit the model"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 5,
178 |    "metadata": {},
179 |    "outputs": [
180 |     {
181 |      "data": {
182 |       "text/plain": [
183 |        "LogisticRegression(C=0.0001, class_weight=None, dual=False,\n",
184 |        "          fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
185 |        "          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,\n",
186 |        "          solver='warn', tol=0.0001, verbose=0, warm_start=False)"
187 |       ]
188 |      },
189 |      "execution_count": 5,
190 |      "metadata": {},
191 |      "output_type": "execute_result"
192 |     }
193 |    ],
194 |    "source": [
195 |     "from sklearn.linear_model import LogisticRegression\n",
196 |     "\n",
197 |     "# Make the model with the specified regularization parameter\n",
198 |     "log_reg = LogisticRegression(C = 0.0001)\n",
199 |     "\n",
200 |     "# Train on the training data\n",
201 |     "log_reg.fit(train, train_labels)"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "模型 fit 好以後，就可以用來預測 testing data 中的客戶違約遲繳貸款的機率咯! (記得要用 predict_proba 才會輸出機率)"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 6,
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "# Make predictions\n",
218 |     "# Make sure to select the second column only\n",
219 |     "log_reg_pred = log_reg.predict_proba(test)[:, 1]"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 7,
225 |    "metadata": {},
226 |    "outputs": [
227 |     {
228 |      "data": {
229 |       "text/plain": [
230 |        "0.9192711805431351"
231 |       ]
232 |      },
233 |      "execution_count": 7,
234 |      "metadata": {},
235 |      "output_type": "execute_result"
236 |     }
237 |    ],
238 |    "source": [
239 |     "log_reg.score(train, train_labels)   #0007, 002 same score... wow, 005, 01 worse, 001 better"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "### 儲存預測結果"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "code",
251 |    "execution_count": 8,
252 |    "metadata": {},
253 |    "outputs": [
254 |     {
255 |      "data": {
256 |       "text/html": [
257 |        "<div>\n",
258 |        "<style scoped>\n",
259 |        "    .dataframe tbody tr th:only-of-type {\n",
260 |        "        vertical-align: middle;\n",
261 |        "    }\n",
262 |        "\n",
263 |        "    .dataframe tbody tr th {\n",
264 |        "        vertical-align: top;\n",
265 |        "    }\n",
266 |        "\n",
267 |        "    .dataframe thead th {\n",
268 |        "        text-align: right;\n",
269 |        "    }\n",
270 |        "</style>\n",
271 |        "<table border=\"1\" class=\"dataframe\">\n",
272 |        "  <thead>\n",
273 |        "    <tr style=\"text-align: right;\">\n",
274 |        "      <th></th>\n",
275 |        "      <th>SK_ID_CURR</th>\n",
276 |        "      <th>TARGET</th>\n",
277 |        "    </tr>\n",
278 |        "  </thead>\n",
279 |        "  <tbody>\n",
280 |        "    <tr>\n",
281 |        "      <th>0</th>\n",
282 |        "      <td>100001</td>\n",
283 |        "      <td>0.065051</td>\n",
284 |        "    </tr>\n",
285 |        "    <tr>\n",
286 |        "      <th>1</th>\n",
287 |        "      <td>100005</td>\n",
288 |        "      <td>0.126401</td>\n",
289 |        "    </tr>\n",
290 |        "    <tr>\n",
291 |        "      <th>2</th>\n",
292 |        "      <td>100013</td>\n",
293 |        "      <td>0.081239</td>\n",
294 |        "    </tr>\n",
295 |        "    <tr>\n",
296 |        "      <th>3</th>\n",
297 |        "      <td>100028</td>\n",
298 |        "      <td>0.061509</td>\n",
299 |        "    </tr>\n",
300 |        "    <tr>\n",
301 |        "      <th>4</th>\n",
302 |        "      <td>100038</td>\n",
303 |        "      <td>0.128308</td>\n",
304 |        "    </tr>\n",
305 |        "  </tbody>\n",
306 |        "</table>\n",
307 |        "</div>"
308 |       ],
309 |       "text/plain": [
310 |        "   SK_ID_CURR    TARGET\n",
311 |        "0      100001  0.065051\n",
312 |        "1      100005  0.126401\n",
313 |        "2      100013  0.081239\n",
314 |        "3      100028  0.061509\n",
315 |        "4      100038  0.128308"
316 |       ]
317 |      },
318 |      "execution_count": 8,
319 |      "metadata": {},
320 |      "output_type": "execute_result"
321 |     }
322 |    ],
323 |    "source": [
324 |     "# Submission dataframe\n",
325 |     "submit = app_test[['SK_ID_CURR']]\n",
326 |     "submit['TARGET'] = log_reg_pred\n",
327 |     "\n",
328 |     "submit.head()"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "code",
333 |    "execution_count": 9,
334 |    "metadata": {},
335 |    "outputs": [],
336 |    "source": [
337 |     "submit.to_csv('sumbit.csv')"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 10,
343 |    "metadata": {},
344 |    "outputs": [
345 |     {
346 |      "name": "stdout",
347 |      "output_type": "stream",
348 |      "text": [
349 |       "0.9192744324593266\n"
350 |      ]
351 |     }
352 |    ],
353 |    "source": [
354 |     "# Submission dataframe\n",
355 |     "log_reg001 = LogisticRegression(C = 0.001)\n",
356 |     "\n",
357 |     "# Train on the training data\n",
358 |     "log_reg001.fit(train, train_labels)\n",
359 |     "print(log_reg001.score(train, train_labels))\n",
360 |     "log_reg_pred001 = log_reg001.predict_proba(test)[:, 1]\n",
361 |     "\n",
362 |     "submit = app_test[['SK_ID_CURR']]\n",
363 |     "submit['TARGET'] = log_reg_pred001\n",
364 |     "submit.to_csv('sumbit001.csv')"
365 |    ]
366 |   }
367 |  ],
368 |  "metadata": {
369 |   "kernelspec": {
370 |    "display_name": "Python 3",
371 |    "language": "python",
372 |    "name": "python3"
373 |   },
374 |   "language_info": {
375 |    "codemirror_mode": {
376 |     "name": "ipython",
377 |     "version": 3
378 |    },
379 |    "file_extension": ".py",
380 |    "mimetype": "text/x-python",
381 |    "name": "python",
382 |    "nbconvert_exporter": "python",
383 |    "pygments_lexer": "ipython3",
384 |    "version": "3.6.7"
385 |   }
386 |  },
387 |  "nbformat": 4,
388 |  "nbformat_minor": 2
389 | }
390 | 


--------------------------------------------------------------------------------
/Day_72-Activation_function_HW.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Rectified Linear Unit- Relu \n",
 8 |     "\n",
 9 |     "f(x)=max(0,x)\n"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "code",
14 |    "execution_count": 9,
15 |    "metadata": {},
16 |    "outputs": [
17 |     {
18 |      "data": {
19 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xl4VOXZx/HvTQj7vkV2VEBBZQtF0L4tWOuCC21dikKrrX2pKHWpte5a7WVba2ulilur1ZZVRC1VXCsp2taFhH2TVYggS4AshJDtfv/I0OaNMZlMZnJmJr/Pdc3FTObMnPshyW+ePHPmPubuiIhIcmkSdAEiIhJ9CncRkSSkcBcRSUIKdxGRJKRwFxFJQgp3EZEkpHAXEUlCCncRkSSkcBcRSUJNg9pxly5dvF+/fhE99tChQ7Ru3Tq6BQVEY4lPyTKWZBkHaCxHZWZm7nP3rrVtF1i49+vXj6VLl0b02IyMDMaOHRvdggKiscSnZBlLsowDNJajzOyTcLbTsoyISBJSuIuIJCGFu4hIElK4i4gkIYW7iEgSqjXczayFmX1oZivMbI2Z3VvNNs3NbJ6ZbTKzD8ysXyyKFRGR8IQzcz8CnOHuQ4FhwDlmNrrKNlcBB9y9P/A74IHolikiInVRa7h7hYLQzdTQpeq5+SYAz4WuvwB8zcwsalWKiCSJ6W9v5JO8spjvx8I5h6qZpQCZQH9ghrvfUuX+1cA57p4dur0ZONXd91XZbgowBSAtLS197ty5ERVdUFBAmzZtInpsvNFY4lOyjCVZxgHJMZZ/flrCH1YVc3Zv57KTIhvLuHHjMt19ZK0bunvYF6ADsBg4ucrX1wC9Kt3eDHSu6bnS09M9UosXL474sfFGY4lPyTKWZBmHe+KPZd2uXD/hzkU+8cl/+9t/fyfi5wGWehh5XaejZdz9IJABnFPlrmygN4CZNQXaA/vr8twiIskqr6iEqTOzaNcild9fNpyUJrFftQ7naJmuZtYhdL0lcCawvspmC4ErQtcvBt4JvcKIiDRq7s5P569k+/5CHr18BF3bNm+Q/YbTOKw78Fxo3b0J8Ly7v2Jm91Hx58FC4GngL2a2iYoZ+8SYVSwikkCefm8rr6/5jDvGD2LUsZ0abL+1hru7rwSGV/P1uytdLwIuiW5pIiKJ7aNt+/nla+s556Rj+MH/HNug+9YnVEVEYmBv/hGmzc6id8eW/PqSITT00eEKdxGRKCstK+e6Ocs4WFjC45PTadcitcFrCOxkHSIiyeqhtz7m31ty+M0lQxnUvV0gNWjmLiISRW+v3c1jGZu5bFRvLk7vFVgdCncRkSjZnlPIj59fzsk923HPBScFWovCXUQkCopKypg6KxOAxyel0yI1JdB6tOYuIhIFP1u4hjU783j6ipH07tQq6HI0cxcRqa/5S3cw96MdXDP2eL42KC3ocgCFu4hIvazdmcedL69mzHGd+fHXBwZdzn8o3EVEIpRXVMI1szJp37KiIVjTlPiJVK25i4hEwN35yfMryD5wmLlTRjdYQ7Bwxc/LjIhIAvnDu1t4c+1ubj33REb2a7iGYOFSuIuI1NEHW3J44PUNjD/lGK76csM2BAuXwl1EpA725Bcxbc4y+nZqxQMXNXxDsHBpzV1EJEylZeVMm72M/KIS/nLVKNoG0BAsXAp3EZEwPfjmBj7cup+HLh3KiccE0xAsXFqWEREJw5trPuPJf2zh8lP78K0RwTUEC5fCXUSkFp/kHOKm+Ss4pWd77j5/cNDlhEXhLiJSg6KSMqbOzKKJGY9NGhF4Q7Bwac1dRKQG9/x1DWt35fHMlfHRECxcmrmLiHyB55fuYN7SHUwb158zToyPhmDhUriLiFRjzc5c7np5Naf378yNcdQQLFwKdxGRKnIPlzB1ZhYdWzVj+sThpDSJzw8q1URr7iIilbg7P5m/gp0HDzPvh6Pp0ia+GoKFq9aZu5n1NrPFZrbOzNaY2fXVbDPWzHLNbHnocndsyhURia0nl2zhrbW7uX38INL7xl9DsHCFM3MvBW5y9ywzawtkmtlb7r62ynbvuvv50S9RRKRhvL8lhwff2MB5Q7rzvdP7BV1OvdQ6c3f3Xe6eFbqeD6wDesa6MBGRhrQnr4hps5fRt3N8NwQLl7l7+Bub9QOWACe7e16lr48FFgDZwE7gJ+6+pprHTwGmAKSlpaXPnTs3oqILCgpo06ZNRI+NNxpLfEqWsSTLOCC2Yykrd379URFb88q5Z3RLeraN7bEm9RnLuHHjMt19ZK0buntYF6ANkAl8q5r72gFtQtfHAxtre7709HSP1OLFiyN+bLzRWOJTsowlWcbhHtux/OLVtd73llf8pazsmO2jsvqMBVjqYWR2WC9PZpZKxcx8lru/WM0LRJ67F4SuLwJSzaxLOM8tIhKk11d/xpNLtjB5dB++MTx5VpzDOVrGgKeBde7+0Bdsc0xoO8xsVOh5c6JZqIhItG3dd4ib569gaK/23JUgDcHCFc7RMqcD3wFWmdny0NduB/oAuPsTwMXAVDMrBQ4DE0N/PoiIxKXDxWVMnZlJSooxY9IImjdNjIZg4ao13N39PaDGt43d/VHg0WgVJSISS+7OXX9dzYbd+Txz5Zfo1TFxGoKFS+0HRKTRmffRDl7IzOZH4/oz7oRuQZcTEwp3EWlUVn+ay90L1/Dl/l24/szEawgWLoW7iDQauYUlTJ2VSefWzZg+cVhCNgQLlxqHiUijUF7u3DR/ObsOFjHvh2PonKANwcKlmbuINApPLNnM2+v2cOd5g0jv2zHocmJO4S4iSe9fm/fxmzc2cMHQHlxxWr+gy2kQCncRSWq784q4bs4yju3Sml9965SEbwgWLq25i0jSKikrZ9rsLAqLy5jzv6Np3bzxRF7jGamINDq/fn09H207wPSJwxiQ1jbochqUlmVEJCm9tmoXf3h3K98d05cJw5KnIVi4FO4iknS27C3g5hdWMrR3B+44b1DQ5QRC4S4iSeVwcRnXzMoiNcV4LAkbgoVLa+4ikjTcnTteXsWG3fk8+71R9OzQMuiSAqOZu4gkjTkf7uDFrE+5/msD+OrArkGXEyiFu4gkhVXZufxs4Rq+MrAr150xIOhyAqdwF5GEd7CwmKtnZtKlTTMe/vYwmiRxQ7Bwac1dRBJaeblz47zl7MkvYv7Vp9GpdbOgS4oLmrmLSEJ7LGMTizfs5a7zBzOsd4egy4kbCncRSVj/3LSPh976mAuH9uA7o/sGXU5cUbiLSEL6LLeiIdhxXdvwy0bUECxcWnMXkYRTUlbOtbOzOFxSxrzJIxpVQ7Bw6X9ERBLOLxetJ/OTAzxy2XD6d2tcDcHCpWUZEUkor67cxTP/3MqVp/XjgqE9gi4nbtUa7mbW28wWm9k6M1tjZtdXs42Z2e/NbJOZrTSzEbEpV0Qas817C/jpCysY3qcDt49vnA3BwhXOskwpcJO7Z5lZWyDTzN5y97WVtjkXGBC6nAo8HvpXRCQqjpQ6U2dm0jw1hRmXj6BZUy081KTW/x133+XuWaHr+cA6oGpz5AnAn73C+0AHM+se9WpFpFFyd55de4SNewqYPnEYPRpxQ7Bw1emlz8z6AcOBD6rc1RPYUel2Np9/ARARicisD7bz751l3HjmQP5nQONuCBausI+WMbM2wALgBnfPq3p3NQ/xap5jCjAFIC0tjYyMjPArraSgoCDix8YbjSU+JctYkmEcW3PLuP/9IgZ3dE5ukk1GxqdBl1RvDfF9CSvczSyVimCf5e4vVrNJNtC70u1ewM6qG7n7U8BTACNHjvSxY8fWtV4AMjIyiPSx8UZjiU/JMpZEH8eBQ8Xc8ch7pLVvyTXDjTPGjQu6pKhoiO9LOEfLGPA0sM7dH/qCzRYC3w0dNTMayHX3XVGsU0QamfJy58bnl7M3/wiPTRpBm2b6BGpdhDNzPx34DrDKzJaHvnY70AfA3Z8AFgHjgU1AIfC96JcqIo3JjMWbyNiwl59/42SG9u5AxuagK0ostYa7u79H9Wvqlbdx4NpoFSUijdu7G/fy0Nsf841hPZh8ap+gy0lIOlBUROLKzoOHuX7ucgZ0a8Mv1BAsYgp3EYkbxaUVDcGKS8t5fHI6rZqp/VWk9D8nInHjF4vWsWz7QR6bNILju7YJupyEppm7iMSFv63YybP/2sb3Tz+W8afoA+71pXAXkcBt2lPArQtWkt63I7eNPzHocpKCwl1EAnXoSClTZ2bSItQQLDVFsRQNWnMXkcC4O7e9uIrNewv4y1Wnckz7FkGXlDT0EikigfnL+5+wcMVOfvz1gZzev0vQ5SQVhbuIBGLZ9gP8/JW1nHFiN64Z2z/ocpKOwl1EGtz+Q8VcOyuLtHYteOjSoTRpog8qRZvW3EWkQZWVOzfMW86+gmJemDqGDq2aBV1SUtLMXUQa1CPvbGTJx3u558LBDOnVIehykpbCXUQazD8+3sv0v2/kW8N7cvkoNQSLJYW7iDSITw8e5oa5yxjYrS33f1MNwWJN4S4iMVdcWs61s7IoKXMenzyCls1Sgi4p6ekNVRGJuftfXcvyHRUNwY5TQ7AGoZm7iMTUwhU7ee7fn3DVl9UQrCEp3EUkZjbuzufWBSsZ2bcjt56rhmANSeEuIjFRcKSUq2dm0qpZCjMmqSFYQ9Oau4hEnbtz64KVbN13iJk/OJW0dmoI1tD0UioiUffcv7bxyspd3HTWCZx2vBqCBUHhLiJRlbX9APcvWsfXTuzG1K8eH3Q5jZbCXUSi5mhDsGPat+ChS4epIViAtOYuIlFRVu5cP3cZOYeKeXHqabRvlRp0SY1arTN3M3vGzPaY2eovuH+smeWa2fLQ5e7olyki8W763zfy7sZ93HfhSZzcs33Q5TR64czcnwUeBf5cwzbvuvv5UalIRBJOxoY9PPLORi5O78W3v9Q76HKEMGbu7r4E2N8AtYhIAso+UMgN85ZzQlpbfj7hZDUEixPRekN1jJmtMLPXzOykKD2niMS5I6VlXDsri7Iy54nJ6WoIFkfM3WvfyKwf8Iq7n1zNfe2AcncvMLPxwHR3H/AFzzMFmAKQlpaWPnfu3IiKLigooE2b5Gg+pLHEp2QZS6zH8ee1R3hneyk/Gt6c9LTYHp+RLN8TqN9Yxo0bl+nuI2vd0N1rvQD9gNVhbrsN6FLbdunp6R6pxYsXR/zYeKOxxKdkGUssx/Hysmzve8srfv+ra2O2j8qS5XviXr+xAEs9jCyu97KMmR1joUU2MxtFxVJPTn2fV0Ti18e787l1wSpG9evEzWefEHQ5Uo1a/44ysznAWKCLmWUD9wCpAO7+BHAxMNXMSoHDwMTQq4uIJKGjDcFaN2/Ko5cPV0OwOFVruLv7ZbXc/ygVh0qKSJJzd25ZsJJt+w4x6wej6aaGYHFLL7kiErZn/7WNV1fu4uazT2TM8Z2DLkdqoHAXkbBkfnKA+19dx5mD0rj6q8cFXY7UQuEuIrXKKTjCtNlZ9OjQkt9eOlQfVEoAahwmIjWqaAi2/L8NwVqqIVgi0MxdRGr08Nsf896mffx8ghqCJRKFu4h8ocXr9/DIO5u4dGQvvv2lPkGXI3WgcBeRau3YX9EQbHD3dtw34XOdRyTOKdxF5HOOlJZx7ewsyt15fPIIWqSqIVii0RuqIvI59/1tLSuzc3nqO+n07dw66HIkApq5i8j/82JWNrM+2M4Pv3ocZ510TNDlSIQU7iLyH+s/y+P2l1Yx6thO3HyWGoIlMoW7iACQX1TC1JlZtG2RyqOXD6epGoIlNK25iwjuzk9fWMn2/YXM/sGpdGurhmCJTi/NIsLT723ltdWf8dOzT+DU49QQLBko3EUauaXb9vOr19Zz1uA0pnxFDcGShcJdpBHbV3CEa2dn0bNjSx68RA3BkonW3EUaqbJy57o5yzhYWMJL14xSQ7Ako3AXaaQeemsD/9qcw68vHsLgHu2CLkeiTMsyIo3Q39ftZsbizXx7ZG8uHdk76HIkBhTuIo3Mjv2F3DhvOSf1aMe9E04KuhyJEYW7SCNSVFLG1FmZADw+KV0NwZKY1txFGpF7/7aW1Z/m8cfvjqRP51ZBlyMxpJm7SCOxIDObOR9uZ+rY4zlzcFrQ5UiMKdxFGoF1u/K44+VVjDmuMzd9fWDQ5UgDqDXczewZM9tjZqu/4H4zs9+b2SYzW2lmI6JfpohEKq+ohKkzM2nXIpXfX6aGYI1FON/lZ4Fzarj/XGBA6DIFeLz+ZYlINLg7P52/kh0HDvPo5SPo2rZ50CVJA6k13N19CbC/hk0mAH/2Cu8DHcyse7QKFJHIvb6tlNfXfMZt557IqGM7BV2ONKBoHC3TE9hR6XZ26Gu7ovDcIsHZkgH//D3gQVcSkdzDJZyXfZArOjZn0LZ2sC3oiupnyP79sCM5XqCOaToYGBvTfUQj3KvrNFTtb4OZTaFi6Ya0tDQyMjIi2mFBQUHEj403Gkt8KigoYNebfyJt9z/Ib3t80OXUWVm5syPf6dDE6d60mLy9+UGXVG9WVkbu3sNBlxEVJR36xPx3JRrhng1U/vxyL2BndRu6+1PAUwAjR470sWPHRrTDjIwMIn1svNFY4lNGRgbd2zcHH0T7qe8FXU6dlJaVM/npD1iee5A7RjVnyAVnBF1SVCTTz1dOA4wlGm+bLwS+GzpqZjSQ6+5akpHEV5gDrRJvGeA3b37M+1v2c/83TqF3Wx0Z01iFcyjkHODfwAlmlm1mV5nZ1WZ2dWiTRcAWYBPwB+CamFUr0pAKc6BVYp2V6K21u3niH5u5bFQfLkrvFXQ5EqBal2Xc/bJa7nfg2qhVJBIvEizct+cU8uPnl3Nyz3bcc8HgoMuRgOlvNpHqeBkcPpAw4X60IVgTMzUEE0CNw0SqlVpyCPCECfefLVzDmp15PHPlSHp3UkMw0cxdpFqpJXkVVxLgDdX5S3cw96MdXDvueM44UQ3BpILCXaQaqSW5FVfifOa+dmced768mtOO78yPv35C0OVIHFG4i1TjvzP3+A333MMlTJ2VSYdWqUyfOJyUJtV9nlAaK625i1QjtST0ic44DXd35+b5K/j0wGHmThmthmDyOZq5i1Qj3tfcn1qyhTfX7ua28YMY2S8+a5RgKdxFqpFakgeprSG1ZdClfM4HW3L49RsbOO+U7nz/9H5BlyNxSuEuUo3Ukry4XJLZk1fEtDnL6NupFb+66BTMtM4u1dOau0g1KsI9vpY7SsvKmTZnGQVFpcy86lTatkgNuiSJY5q5i1QjHmfuD765gQ+37ucX3zqZE45pG3Q5EucU7iLVSC3Jj6twf2PNZzz5jy1MOrUP3xyuhmBSO4W7SDXiaea+bd8hfvL8Cob0as/daggmYVK4i1RVWkzTssK4CPeKhmBZNGlizLh8BM2bqiGYhEdvqIpUdTh0PvjWwYf7XS+vZt2uPP505ZfUEEzqRDN3kaoKcyr+DXjmPu+j7czPzOZHZ/Rn3IndAq1FEo/CXaSqOAj3NTtzueuva/hy/y7ccObAwOqQxKVwF6kq4HDPPVzC1JlZdGrVjOkTh6khmEREa+4iVQUY7uXlzk3Pr2DnwcPM++EYOrdRQzCJjGbuIlUVht5QbdmxwXf95JItvL1uN7ePH0R634bfvyQPhbtIVYU5lDRtDSkN+/H+f2/O4cE31nPekO58Tw3BpJ4U7iJVFeZQktquQXe5J6+IH81ZRr8urXngoiFqCCb1pjV3kaoaONxLysqZNnsZh46UMvt/T6VNc/1aSv3pp0ikqgYO9wff2MCH2/YzfeIwBqapIZhER1jLMmZ2jpltMLNNZnZrNfdfaWZ7zWx56PKD6Jcq0kAO5VCS2jAh+/rqz3hqyRa+M7ovE4b1bJB9SuNQ68zdzFKAGcDXgWzgIzNb6O5rq2w6z92nxaBGkYZVmENJ+9jP3LfuO8TN81cwtHcH7jx/UMz3J41LODP3UcAmd9/i7sXAXGBCbMsSCUhxIZQejvmyzOHiMqbOzCQlxZhx+XA1BJOoCyfcewI7Kt3ODn2tqovMbKWZvWBmvaNSnUhDC32AKZbh7u7c+fJqNuzO5+FvD6NXRzUEk+gL5w3V6o7J8iq3/wbMcfcjZnY18BxwxueeyGwKMAUgLS2NjIyMulUbUlBQEPFj443GEl/a5G9mJJBfmhqzsWTsKGHBmmImHJ8Ku9aSsavqCmf0JMP35CiNpY7cvcYLMAZ4o9Lt24Dbatg+Bcit7XnT09M9UosXL474sfFGY4kzG992v6edZ778WEyeflX2QR9wxyKf/Mf3vbSsPCb7qCwpvichGksFYKnXkq/uHtayzEfAADM71syaAROBhZU3MLPulW5eCKyr74uOSCBCrQdicbRMbmEJU2dl0rl1M6ZPHK6GYBJTtS7LuHupmU0D3qBiVv6Mu68xs/uoeAVZCFxnZhcCpcB+4MoY1iwSOzFacy8vd378/HI+yy1i3g/H0Kl1s6g+v0hVYX2Iyd0XAYuqfO3uStdvo2K5RiSxFeaANaG0aeuoPu3j/9jM39fv4d4LT2JEHzUEk9hTbxmRygpzKrpBWvQOTfzX5n389s0NXDC0B98d0zdqzytSE4W7SGWFOVHt4/5ZbhHXzVnGcV3b8KtvnaKGYNJg1FtGpLIohntFQ7AsCovLmDtlBK3VEEwakGbuIpUV7o9auD/w2nqWfnKAX100hP7d1BBMGpbCXaSywhxo1aneT/Paql388b2tXDGmLxcO7RGFwkTqRuEucpR7VJZltuwt4OYXVjKsdwfuOG9wlIoTqRuFu8hRR/KhvKRe4V5YXMrUmVmkphgzJo2gWVP9ikkw9JMnclToA0yRhru7c+dLq/l4Tz7TJw6nZ4eWUSxOpG4U7iJHhVoPRBrusz/czovLPuWGrw3kKwO7RrEwkbpTuIscVY+Z+8rsg9y7cC1fHdiVH53RP8qFidSdwl3kqP+Ee92OljlYWMzUmVl0bduch789jCZqCCZxQOEuclQEM/fycufGecvZk1/EjEkj6KiGYBInFO4iRxXmQJNUaB5+R8jHMjaxeMNe7j5/MMN6d4hhcSJ1o3AXOapwX8WsPcz+L+9t3Mdv3/qYCcN6MHm0GoJJfFG4ixxVh9YDu3IPc93cZfTv2oZfqiGYxCGFu8hRYbYeKC4t59pZWRwpKePxyem0aqaGYBJ/FO4iR4XZeuCXr60ja/tBHrh4CP27tWmAwkTqTuEuclQY4f7Kyp386Z/buPK0fpw/RA3BJH4p3EUAysvg8IEaw33TngJueWElw/t04PbxgxqwOJG6U7iLABTlgpd/YbgXFpdyzaxMmqem8JgagkkC0DtBIlDjB5jcndtfXMXGPQX8+fuj6N5eDcEk/mn6IQI1th6Y+cF2Xl6+kxvPHMj/DFBDMEkMCncR+MKZ+4odB/n539Yy9oSuTBunhmCSOBTuIlBtuB84VMw1s9QQTBJTWOFuZueY2QYz22Rmt1Zzf3Mzmxe6/wMz6xftQkViqkq4r99fxjcf+yd784/w+OQRdGilhmCSWGoNdzNLAWYA5wKDgcvMrOqJIa8CDrh7f+B3wAPRLlQkpgpzoGlL8stTueOlVfzqwyLKHZ77/iiG9FJDMEk84RwtMwrY5O5bAMxsLjABWFtpmwnAz0LXXwAeNTNzd49irSKxU7ifomYdOOt3S9idV8TZ/Zry8Pe/QstmKUFXJhKRcMK9J7Cj0u1s4NQv2sbdS80sF+gM7ItGkZWtzFhAn4y72LYkOd4u6FNerrHEga7l+9hW3o22HZvy2KTTyN2yQsEuCS2ccK/uXaSqM/JwtsHMpgBTANLS0sjIyAhj9/9fXvZuuqT0TJoufG6uscSBXU16sLX9KH56Ujm5W1ZQUFAQ0c9nvEmWcYDGUmfuXuMFGAO8Uen2bcBtVbZ5AxgTut6Uihm71fS86enpHqnFixdH/Nh4o7HEp2QZS7KMw11jOQpY6rXktruHdbTMR8AAMzvWzJoBE4GFVbZZCFwRun4x8E6oCBERCUCtyzJesYY+jYrZeQrwjLuvMbP7qHgFWQg8DfzFzDYB+6l4ARARkYCE1VvG3RcBi6p87e5K14uAS6JbmoiIRCoxD20QEZEaKdxFRJKQwl1EJAkp3EVEkpDCXUQkCVlQh6Ob2V7gkwgf3oUYtDYIiMYSn5JlLMkyDtBYjurr7rWeNSawcK8PM1vq7iODriMaNJb4lCxjSZZxgMZSV1qWERFJQgp3EZEklKjh/lTQBUSRxhKfkmUsyTIO0FjqJCHX3EVEpGaJOnMXEZEaJGy4m9nPzWylmS03szfNrEfQNUXKzB40s/Wh8bxkZgl70k4zu8TM1phZuZkl3JENtZ0MPlGY2TNmtsfMVgddS32ZWW8zW2xm60I/W9cHXVMkzKyFmX1oZitC47g3pvtL1GUZM2vn7nmh69cBg9396oDLioiZnUVFD/xSM3sAwN1vCbisiJjZIKAceBL4ibsvDbiksIVOBv8x8HUqTif5EXCZu6+t8YFxyMy+AhQAf3b3k4Oupz7MrDvQ3d2zzKwtkAl8I9G+L1ZxmrLW7l5gZqnAe8D17v5+LPaXsDP3o8Ee0ppqTuuXKNz9TXcvDd18H+gVZD314e7r3H1D0HVE6D8ng3f3YuDoyeATjrsvoeLcCgnP3Xe5e1boej6wjorzNieU0ImUCkI3U0OXmOVWwoY7gJndb2Y7gEnA3bVtnyC+D7wWdBGNVHUng0+4EElmZtYPGA58EGwlkTGzFDNbDuwB3nL3mI0jrsPdzN42s9XVXCYAuPsd7t4bmAVMC7bamtU2ltA2dwClVIwnboUzlgQV1oneJRhm1gZYANxQ5S/3hOHuZe4+jIq/zkeZWcyWzMI6E1NQ3P3MMDedDbwK3BPDcuqltrGY2RXA+cDX4v38s3X4viSabKB3pdu9gJ0B1SKVhNaoFwCz3P3FoOupL3c/aGYZwDlATN70juuZe03MbEClmxcC64Oqpb7M7BzgFuBCdy8Mup5GLJyTwUsDC70R+TSwzt0fCrqeSJlZ16NHwplZS+BMYphbiXy0zALgBCqOzPgEuNrdPw22qsiETizeHMgJfen9BD7y55vAI0BX4CCw3N3PDraq8JnZeOBh/nsy+PsDLikiZjYHGEut1SvCAAAAaUlEQVRF98HdwD3u/nSgRUXIzL4MvAusouL3HeD20LmdE4aZDQGeo+JnqwnwvLvfF7P9JWq4i4jIF0vYZRkREfliCncRkSSkcBcRSUIKdxGRJKRwFxFJQgp3EZEkpHAXEUlCCncRkST0f2hGQF4v9RX6AAAAAElFTkSuQmCC\n",
20 |       "text/plain": [
21 |        "<Figure size 432x288 with 1 Axes>"
22 |       ]
23 |      },
24 |      "metadata": {
25 |       "needs_background": "light"
26 |      },
27 |      "output_type": "display_data"
28 |     },
29 |     {
30 |      "data": {
31 |       "text/plain": [
32 |        "'\\n作業:\\n    寫出 ReLU & dReLU 一階導數\\n    並列印\\n'"
33 |       ]
34 |      },
35 |      "execution_count": 9,
36 |      "metadata": {},
37 |      "output_type": "execute_result"
38 |     }
39 |    ],
40 |    "source": [
41 |     "import numpy as np\n",
42 |     "from numpy import *\n",
43 |     "import matplotlib.pylab as plt\n",
44 |     "%matplotlib inline\n",
45 |     "\n",
46 |     "def ReLU(x):\n",
47 |     "    c1 = x > 0\n",
48 |     "    c2 = abs(x)\n",
49 |     "    return c1*c2\n",
50 |     "\n",
51 |     "def dReLU(x):\n",
52 |     "    return x>0\n",
53 |     "\n",
54 |     "x = plt.linspace(-3,3,60)\n",
55 |     "plt.plot(x,ReLU(x))\n",
56 |     "plt.plot(x,dReLU(x))\n",
57 |     "plt.grid()\n",
58 |     "plt.show()\n",
59 |     "\n",
60 |     "\n",
61 |     "'''\n",
62 |     "作業:\n",
63 |     "    寫出 ReLU & dReLU 一階導數\n",
64 |     "    並列印\n",
65 |     "'''"
66 |    ]
67 |   }
68 |  ],
69 |  "metadata": {
70 |   "kernelspec": {
71 |    "display_name": "Python 3",
72 |    "language": "python",
73 |    "name": "python3"
74 |   },
75 |   "language_info": {
76 |    "codemirror_mode": {
77 |     "name": "ipython",
78 |     "version": 3
79 |    },
80 |    "file_extension": ".py",
81 |    "mimetype": "text/x-python",
82 |    "name": "python",
83 |    "nbconvert_exporter": "python",
84 |    "pygments_lexer": "ipython3",
85 |    "version": "3.6.7"
86 |   }
87 |  },
88 |  "nbformat": 4,
89 |  "nbformat_minor": 2
90 | }
91 | 


--------------------------------------------------------------------------------
/Day_018_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "# 作業 : (Kaggle)鐵達尼生存預測"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {},
 16 |    "outputs": [
 17 |     {
 18 |      "data": {
 19 |       "text/plain": [
 20 |        "(891, 12)"
 21 |       ]
 22 |      },
 23 |      "execution_count": 1,
 24 |      "metadata": {},
 25 |      "output_type": "execute_result"
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "# 載入套件與資料\n",
 30 |     "import pandas as pd\n",
 31 |     "import numpy as np\n",
 32 |     "\n",
 33 |     "data_path = '../part02/'\n",
 34 |     "df_train = pd.read_csv(data_path + 'titanic_train.csv')\n",
 35 |     "df_test = pd.read_csv(data_path + 'titanic_test.csv')\n",
 36 |     "df_train.shape"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 2,
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "data": {
 46 |       "text/html": [
 47 |        "<div>\n",
 48 |        "<style scoped>\n",
 49 |        "    .dataframe tbody tr th:only-of-type {\n",
 50 |        "        vertical-align: middle;\n",
 51 |        "    }\n",
 52 |        "\n",
 53 |        "    .dataframe tbody tr th {\n",
 54 |        "        vertical-align: top;\n",
 55 |        "    }\n",
 56 |        "\n",
 57 |        "    .dataframe thead th {\n",
 58 |        "        text-align: right;\n",
 59 |        "    }\n",
 60 |        "</style>\n",
 61 |        "<table border=\"1\" class=\"dataframe\">\n",
 62 |        "  <thead>\n",
 63 |        "    <tr style=\"text-align: right;\">\n",
 64 |        "      <th></th>\n",
 65 |        "      <th>Pclass</th>\n",
 66 |        "      <th>Name</th>\n",
 67 |        "      <th>Sex</th>\n",
 68 |        "      <th>Age</th>\n",
 69 |        "      <th>SibSp</th>\n",
 70 |        "      <th>Parch</th>\n",
 71 |        "      <th>Ticket</th>\n",
 72 |        "      <th>Fare</th>\n",
 73 |        "      <th>Cabin</th>\n",
 74 |        "      <th>Embarked</th>\n",
 75 |        "    </tr>\n",
 76 |        "  </thead>\n",
 77 |        "  <tbody>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>0</th>\n",
 80 |        "      <td>3</td>\n",
 81 |        "      <td>Braund, Mr. Owen Harris</td>\n",
 82 |        "      <td>male</td>\n",
 83 |        "      <td>22.0</td>\n",
 84 |        "      <td>1</td>\n",
 85 |        "      <td>0</td>\n",
 86 |        "      <td>A/5 21171</td>\n",
 87 |        "      <td>7.2500</td>\n",
 88 |        "      <td>NaN</td>\n",
 89 |        "      <td>S</td>\n",
 90 |        "    </tr>\n",
 91 |        "    <tr>\n",
 92 |        "      <th>1</th>\n",
 93 |        "      <td>1</td>\n",
 94 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
 95 |        "      <td>female</td>\n",
 96 |        "      <td>38.0</td>\n",
 97 |        "      <td>1</td>\n",
 98 |        "      <td>0</td>\n",
 99 |        "      <td>PC 17599</td>\n",
100 |        "      <td>71.2833</td>\n",
101 |        "      <td>C85</td>\n",
102 |        "      <td>C</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "      <th>2</th>\n",
106 |        "      <td>3</td>\n",
107 |        "      <td>Heikkinen, Miss. Laina</td>\n",
108 |        "      <td>female</td>\n",
109 |        "      <td>26.0</td>\n",
110 |        "      <td>0</td>\n",
111 |        "      <td>0</td>\n",
112 |        "      <td>STON/O2. 3101282</td>\n",
113 |        "      <td>7.9250</td>\n",
114 |        "      <td>NaN</td>\n",
115 |        "      <td>S</td>\n",
116 |        "    </tr>\n",
117 |        "    <tr>\n",
118 |        "      <th>3</th>\n",
119 |        "      <td>1</td>\n",
120 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
121 |        "      <td>female</td>\n",
122 |        "      <td>35.0</td>\n",
123 |        "      <td>1</td>\n",
124 |        "      <td>0</td>\n",
125 |        "      <td>113803</td>\n",
126 |        "      <td>53.1000</td>\n",
127 |        "      <td>C123</td>\n",
128 |        "      <td>S</td>\n",
129 |        "    </tr>\n",
130 |        "    <tr>\n",
131 |        "      <th>4</th>\n",
132 |        "      <td>3</td>\n",
133 |        "      <td>Allen, Mr. William Henry</td>\n",
134 |        "      <td>male</td>\n",
135 |        "      <td>35.0</td>\n",
136 |        "      <td>0</td>\n",
137 |        "      <td>0</td>\n",
138 |        "      <td>373450</td>\n",
139 |        "      <td>8.0500</td>\n",
140 |        "      <td>NaN</td>\n",
141 |        "      <td>S</td>\n",
142 |        "    </tr>\n",
143 |        "  </tbody>\n",
144 |        "</table>\n",
145 |        "</div>"
146 |       ],
147 |       "text/plain": [
148 |        "   Pclass                                               Name     Sex   Age  \\\n",
149 |        "0       3                            Braund, Mr. Owen Harris    male  22.0   \n",
150 |        "1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
151 |        "2       3                             Heikkinen, Miss. Laina  female  26.0   \n",
152 |        "3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
153 |        "4       3                           Allen, Mr. William Henry    male  35.0   \n",
154 |        "\n",
155 |        "   SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
156 |        "0      1      0         A/5 21171   7.2500   NaN        S  \n",
157 |        "1      1      0          PC 17599  71.2833   C85        C  \n",
158 |        "2      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
159 |        "3      1      0            113803  53.1000  C123        S  \n",
160 |        "4      0      0            373450   8.0500   NaN        S  "
161 |       ]
162 |      },
163 |      "execution_count": 2,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "# 重組資料成為訓練 / 預測用格式\n",
170 |     "train_Y = df_train['Survived']\n",
171 |     "ids = df_test['PassengerId']\n",
172 |     "df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)\n",
173 |     "df_test = df_test.drop(['PassengerId'] , axis=1)\n",
174 |     "df = pd.concat([df_train,df_test])\n",
175 |     "df.head()"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {
182 |     "collapsed": true
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "# 秀出資料欄位的類型與數量\n",
187 |     "dtype_df = df.dtypes.reset_index()\n",
188 |     "dtype_df.columns = [\"Count\", \"Column Type\"]\n",
189 |     "dtype_df = dtype_df.groupby(\"Column Type\").aggregate('count').reset_index()\n",
190 |     "dtype_df"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": 16,
196 |    "metadata": {},
197 |    "outputs": [
198 |     {
199 |      "data": {
200 |       "text/plain": [
201 |        "0\n",
202 |        "int64      3\n",
203 |        "float64    2\n",
204 |        "object     5\n",
205 |        "dtype: int64"
206 |       ]
207 |      },
208 |      "execution_count": 16,
209 |      "metadata": {},
210 |      "output_type": "execute_result"
211 |     }
212 |    ],
213 |    "source": [
214 |     "dtype_df = df.dtypes.reset_index()\n",
215 |     "dtype_df\n",
216 |     "dtype_df.groupby(0).size()\n"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "## 由以上看到我們確實只有三類 float64, int64, 與 objects"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 17,
229 |    "metadata": {},
230 |    "outputs": [
231 |     {
232 |      "name": "stdout",
233 |      "output_type": "stream",
234 |      "text": [
235 |       "3 Integer Features : ['Pclass', 'SibSp', 'Parch']\n",
236 |       "\n",
237 |       "2 Float Features : ['Age', 'Fare']\n",
238 |       "\n",
239 |       "5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n"
240 |      ]
241 |     }
242 |    ],
243 |    "source": [
244 |     "#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中\n",
245 |     "int_features = []\n",
246 |     "float_features = []\n",
247 |     "object_features = []\n",
248 |     "for dtype, feature in zip(df.dtypes, df.columns):\n",
249 |     "    if dtype == 'float64':\n",
250 |     "        float_features.append(feature)\n",
251 |     "    elif dtype == 'int64':\n",
252 |     "        int_features.append(feature)\n",
253 |     "    else:\n",
254 |     "        object_features.append(feature)\n",
255 |     "print(f'{len(int_features)} Integer Features : {int_features}\\n')\n",
256 |     "print(f'{len(float_features)} Float Features : {float_features}\\n')\n",
257 |     "print(f'{len(object_features)} Object Features : {object_features}')"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {
263 |     "collapsed": true
264 |    },
265 |    "source": [
266 |     "# 作業1 \n",
267 |     "* 試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique )  \n",
268 |     "中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因?  \n",
269 |     "\n",
270 |     "# Answer:  \n",
271 |     "    object mean 無從計算，\n",
272 |     "    object mim, max 計算可能是從字母排序 a 最小\n",
273 |     "    \n",
274 |     "\n",
275 |     "\n",
276 |     "# 作業2\n",
277 |     "* 思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?  \n",
278 |     "所以三大類特徵中，哪一大類處理起來應該最複雜?\n",
279 |     "\n",
280 |     "# Answer: \n",
281 |     "    string @python  會歸類到 Object\n",
282 |     "    int32 @numpy 會歸類到 int64\n",
283 |     "    \n",
284 |     "    Object 要經過判斷與 Label Encoding or One Hot Encoding，相對複雜"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": 33,
290 |    "metadata": {},
291 |    "outputs": [
292 |     {
293 |      "name": "stdout",
294 |      "output_type": "stream",
295 |      "text": [
296 |       " operation 1 ~ 3 for integer valuable basic statistics\n",
297 |       "Pclass    2.294882\n",
298 |       "SibSp     0.498854\n",
299 |       "Parch     0.385027\n",
300 |       "dtype: float64\n",
301 |       "Pclass    3\n",
302 |       "SibSp     8\n",
303 |       "Parch     9\n",
304 |       "dtype: int64\n",
305 |       "Pclass    1\n",
306 |       "SibSp     0\n",
307 |       "Parch     0\n",
308 |       "dtype: int64\n",
309 |       "\n",
310 |       "\n",
311 |       " operation 4 ~ 6 for float valuable basic statistics\n",
312 |       "Age     29.881138\n",
313 |       "Fare    33.295479\n",
314 |       "dtype: float64\n",
315 |       "Age      80.0000\n",
316 |       "Fare    512.3292\n",
317 |       "dtype: float64\n",
318 |       "Age     0.17\n",
319 |       "Fare    0.00\n",
320 |       "dtype: float64\n",
321 |       "\n",
322 |       "\n",
323 |       " operation 7 ~ 9 for object valuable basic statistics\n",
324 |       "\n",
325 |       " Series([], dtype: float64)\n",
326 |       "\n",
327 |       " Name      van Melkebeke, Mr. Philemon\n",
328 |       "Sex                              male\n",
329 |       "Ticket                      WE/P 5735\n",
330 |       "dtype: object\n",
331 |       "\n",
332 |       " Name      Abbing, Mr. Anthony\n",
333 |       "Sex                    female\n",
334 |       "Ticket                 110152\n",
335 |       "dtype: object\n"
336 |      ]
337 |     },
338 |     {
339 |      "data": {
340 |       "text/plain": [
341 |        "Name        object\n",
342 |        "Sex         object\n",
343 |        "Ticket      object\n",
344 |        "Cabin       object\n",
345 |        "Embarked    object\n",
346 |        "dtype: object"
347 |       ]
348 |      },
349 |      "execution_count": 33,
350 |      "metadata": {},
351 |      "output_type": "execute_result"
352 |     }
353 |    ],
354 |    "source": [
355 |     "# op 1 - 3\n",
356 |     "# 例 : 整數 (int) 特徵取平均 (mean)\n",
357 |     "df[int_features].mean()\n",
358 |     "\"\"\"\n",
359 |     "Your Code Here\n",
360 |     "\"\"\"\n",
361 |     "print(' operation 1 ~ 3 for integer valuable basic statistics')\n",
362 |     "print(df[int_features].mean())\n",
363 |     "print(df[int_features].max())\n",
364 |     "print(df[int_features].min())\n",
365 |     "\n",
366 |     "print('\\n\\n operation 4 ~ 6 for float valuable basic statistics')\n",
367 |     "print(df[float_features].mean())\n",
368 |     "print(df[float_features].max())\n",
369 |     "print(df[float_features].min())\n",
370 |     "\n",
371 |     "print('\\n\\n operation 7 ~ 9 for object valuable basic statistics')\n",
372 |     "print('\\n',df[object_features].mean())\n",
373 |     "print('\\n',df[object_features].max())\n",
374 |     "print('\\n',df[object_features].min())\n",
375 |     "\n",
376 |     "df[object_features].dtypes"
377 |    ]
378 |   }
379 |  ],
380 |  "metadata": {
381 |   "kernelspec": {
382 |    "display_name": "Python 3",
383 |    "language": "python",
384 |    "name": "python3"
385 |   },
386 |   "language_info": {
387 |    "codemirror_mode": {
388 |     "name": "ipython",
389 |     "version": 3
390 |    },
391 |    "file_extension": ".py",
392 |    "mimetype": "text/x-python",
393 |    "name": "python",
394 |    "nbconvert_exporter": "python",
395 |    "pygments_lexer": "ipython3",
396 |    "version": "3.6.7"
397 |   }
398 |  },
399 |  "nbformat": 4,
400 |  "nbformat_minor": 2
401 | }
402 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Day_018_HW-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "# 作業 : (Kaggle)鐵達尼生存預測"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {},
 16 |    "outputs": [
 17 |     {
 18 |      "data": {
 19 |       "text/plain": [
 20 |        "(891, 12)"
 21 |       ]
 22 |      },
 23 |      "execution_count": 1,
 24 |      "metadata": {},
 25 |      "output_type": "execute_result"
 26 |     }
 27 |    ],
 28 |    "source": [
 29 |     "# 載入套件與資料\n",
 30 |     "import pandas as pd\n",
 31 |     "import numpy as np\n",
 32 |     "\n",
 33 |     "data_path = '../part02/'\n",
 34 |     "df_train = pd.read_csv(data_path + 'titanic_train.csv')\n",
 35 |     "df_test = pd.read_csv(data_path + 'titanic_test.csv')\n",
 36 |     "df_train.shape"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 2,
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "data": {
 46 |       "text/html": [
 47 |        "<div>\n",
 48 |        "<style scoped>\n",
 49 |        "    .dataframe tbody tr th:only-of-type {\n",
 50 |        "        vertical-align: middle;\n",
 51 |        "    }\n",
 52 |        "\n",
 53 |        "    .dataframe tbody tr th {\n",
 54 |        "        vertical-align: top;\n",
 55 |        "    }\n",
 56 |        "\n",
 57 |        "    .dataframe thead th {\n",
 58 |        "        text-align: right;\n",
 59 |        "    }\n",
 60 |        "</style>\n",
 61 |        "<table border=\"1\" class=\"dataframe\">\n",
 62 |        "  <thead>\n",
 63 |        "    <tr style=\"text-align: right;\">\n",
 64 |        "      <th></th>\n",
 65 |        "      <th>Pclass</th>\n",
 66 |        "      <th>Name</th>\n",
 67 |        "      <th>Sex</th>\n",
 68 |        "      <th>Age</th>\n",
 69 |        "      <th>SibSp</th>\n",
 70 |        "      <th>Parch</th>\n",
 71 |        "      <th>Ticket</th>\n",
 72 |        "      <th>Fare</th>\n",
 73 |        "      <th>Cabin</th>\n",
 74 |        "      <th>Embarked</th>\n",
 75 |        "    </tr>\n",
 76 |        "  </thead>\n",
 77 |        "  <tbody>\n",
 78 |        "    <tr>\n",
 79 |        "      <th>0</th>\n",
 80 |        "      <td>3</td>\n",
 81 |        "      <td>Braund, Mr. Owen Harris</td>\n",
 82 |        "      <td>male</td>\n",
 83 |        "      <td>22.0</td>\n",
 84 |        "      <td>1</td>\n",
 85 |        "      <td>0</td>\n",
 86 |        "      <td>A/5 21171</td>\n",
 87 |        "      <td>7.2500</td>\n",
 88 |        "      <td>NaN</td>\n",
 89 |        "      <td>S</td>\n",
 90 |        "    </tr>\n",
 91 |        "    <tr>\n",
 92 |        "      <th>1</th>\n",
 93 |        "      <td>1</td>\n",
 94 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
 95 |        "      <td>female</td>\n",
 96 |        "      <td>38.0</td>\n",
 97 |        "      <td>1</td>\n",
 98 |        "      <td>0</td>\n",
 99 |        "      <td>PC 17599</td>\n",
100 |        "      <td>71.2833</td>\n",
101 |        "      <td>C85</td>\n",
102 |        "      <td>C</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "      <th>2</th>\n",
106 |        "      <td>3</td>\n",
107 |        "      <td>Heikkinen, Miss. Laina</td>\n",
108 |        "      <td>female</td>\n",
109 |        "      <td>26.0</td>\n",
110 |        "      <td>0</td>\n",
111 |        "      <td>0</td>\n",
112 |        "      <td>STON/O2. 3101282</td>\n",
113 |        "      <td>7.9250</td>\n",
114 |        "      <td>NaN</td>\n",
115 |        "      <td>S</td>\n",
116 |        "    </tr>\n",
117 |        "    <tr>\n",
118 |        "      <th>3</th>\n",
119 |        "      <td>1</td>\n",
120 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
121 |        "      <td>female</td>\n",
122 |        "      <td>35.0</td>\n",
123 |        "      <td>1</td>\n",
124 |        "      <td>0</td>\n",
125 |        "      <td>113803</td>\n",
126 |        "      <td>53.1000</td>\n",
127 |        "      <td>C123</td>\n",
128 |        "      <td>S</td>\n",
129 |        "    </tr>\n",
130 |        "    <tr>\n",
131 |        "      <th>4</th>\n",
132 |        "      <td>3</td>\n",
133 |        "      <td>Allen, Mr. William Henry</td>\n",
134 |        "      <td>male</td>\n",
135 |        "      <td>35.0</td>\n",
136 |        "      <td>0</td>\n",
137 |        "      <td>0</td>\n",
138 |        "      <td>373450</td>\n",
139 |        "      <td>8.0500</td>\n",
140 |        "      <td>NaN</td>\n",
141 |        "      <td>S</td>\n",
142 |        "    </tr>\n",
143 |        "  </tbody>\n",
144 |        "</table>\n",
145 |        "</div>"
146 |       ],
147 |       "text/plain": [
148 |        "   Pclass                                               Name     Sex   Age  \\\n",
149 |        "0       3                            Braund, Mr. Owen Harris    male  22.0   \n",
150 |        "1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
151 |        "2       3                             Heikkinen, Miss. Laina  female  26.0   \n",
152 |        "3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
153 |        "4       3                           Allen, Mr. William Henry    male  35.0   \n",
154 |        "\n",
155 |        "   SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
156 |        "0      1      0         A/5 21171   7.2500   NaN        S  \n",
157 |        "1      1      0          PC 17599  71.2833   C85        C  \n",
158 |        "2      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
159 |        "3      1      0            113803  53.1000  C123        S  \n",
160 |        "4      0      0            373450   8.0500   NaN        S  "
161 |       ]
162 |      },
163 |      "execution_count": 2,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "# 重組資料成為訓練 / 預測用格式\n",
170 |     "train_Y = df_train['Survived']\n",
171 |     "ids = df_test['PassengerId']\n",
172 |     "df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)\n",
173 |     "df_test = df_test.drop(['PassengerId'] , axis=1)\n",
174 |     "df = pd.concat([df_train,df_test])\n",
175 |     "df.head()"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {
182 |     "collapsed": true
183 |    },
184 |    "outputs": [],
185 |    "source": [
186 |     "# 秀出資料欄位的類型與數量\n",
187 |     "dtype_df = df.dtypes.reset_index()\n",
188 |     "dtype_df.columns = [\"Count\", \"Column Type\"]\n",
189 |     "dtype_df = dtype_df.groupby(\"Column Type\").aggregate('count').reset_index()\n",
190 |     "dtype_df"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "code",
195 |    "execution_count": 16,
196 |    "metadata": {},
197 |    "outputs": [
198 |     {
199 |      "data": {
200 |       "text/plain": [
201 |        "0\n",
202 |        "int64      3\n",
203 |        "float64    2\n",
204 |        "object     5\n",
205 |        "dtype: int64"
206 |       ]
207 |      },
208 |      "execution_count": 16,
209 |      "metadata": {},
210 |      "output_type": "execute_result"
211 |     }
212 |    ],
213 |    "source": [
214 |     "dtype_df = df.dtypes.reset_index()\n",
215 |     "dtype_df\n",
216 |     "dtype_df.groupby(0).size()\n"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "## 由以上看到我們確實只有三類 float64, int64, 與 objects"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 17,
229 |    "metadata": {},
230 |    "outputs": [
231 |     {
232 |      "name": "stdout",
233 |      "output_type": "stream",
234 |      "text": [
235 |       "3 Integer Features : ['Pclass', 'SibSp', 'Parch']\n",
236 |       "\n",
237 |       "2 Float Features : ['Age', 'Fare']\n",
238 |       "\n",
239 |       "5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n"
240 |      ]
241 |     }
242 |    ],
243 |    "source": [
244 |     "#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中\n",
245 |     "int_features = []\n",
246 |     "float_features = []\n",
247 |     "object_features = []\n",
248 |     "for dtype, feature in zip(df.dtypes, df.columns):\n",
249 |     "    if dtype == 'float64':\n",
250 |     "        float_features.append(feature)\n",
251 |     "    elif dtype == 'int64':\n",
252 |     "        int_features.append(feature)\n",
253 |     "    else:\n",
254 |     "        object_features.append(feature)\n",
255 |     "print(f'{len(int_features)} Integer Features : {int_features}\\n')\n",
256 |     "print(f'{len(float_features)} Float Features : {float_features}\\n')\n",
257 |     "print(f'{len(object_features)} Object Features : {object_features}')"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {
263 |     "collapsed": true
264 |    },
265 |    "source": [
266 |     "# 作業1 \n",
267 |     "* 試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique )  \n",
268 |     "中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因?  \n",
269 |     "\n",
270 |     "# Answer:  \n",
271 |     "    object mean 無從計算，\n",
272 |     "    object mim, max 計算可能是從字母排序 a 最小\n",
273 |     "    \n",
274 |     "\n",
275 |     "\n",
276 |     "# 作業2\n",
277 |     "* 思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?  \n",
278 |     "所以三大類特徵中，哪一大類處理起來應該最複雜?\n",
279 |     "\n",
280 |     "# Answer: \n",
281 |     "    string @python  會歸類到 Object\n",
282 |     "    int32 @numpy 會歸類到 int64\n",
283 |     "    \n",
284 |     "    Object 要經過判斷與 Label Encoding or One Hot Encoding，相對複雜"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": 33,
290 |    "metadata": {},
291 |    "outputs": [
292 |     {
293 |      "name": "stdout",
294 |      "output_type": "stream",
295 |      "text": [
296 |       " operation 1 ~ 3 for integer valuable basic statistics\n",
297 |       "Pclass    2.294882\n",
298 |       "SibSp     0.498854\n",
299 |       "Parch     0.385027\n",
300 |       "dtype: float64\n",
301 |       "Pclass    3\n",
302 |       "SibSp     8\n",
303 |       "Parch     9\n",
304 |       "dtype: int64\n",
305 |       "Pclass    1\n",
306 |       "SibSp     0\n",
307 |       "Parch     0\n",
308 |       "dtype: int64\n",
309 |       "\n",
310 |       "\n",
311 |       " operation 4 ~ 6 for float valuable basic statistics\n",
312 |       "Age     29.881138\n",
313 |       "Fare    33.295479\n",
314 |       "dtype: float64\n",
315 |       "Age      80.0000\n",
316 |       "Fare    512.3292\n",
317 |       "dtype: float64\n",
318 |       "Age     0.17\n",
319 |       "Fare    0.00\n",
320 |       "dtype: float64\n",
321 |       "\n",
322 |       "\n",
323 |       " operation 7 ~ 9 for object valuable basic statistics\n",
324 |       "\n",
325 |       " Series([], dtype: float64)\n",
326 |       "\n",
327 |       " Name      van Melkebeke, Mr. Philemon\n",
328 |       "Sex                              male\n",
329 |       "Ticket                      WE/P 5735\n",
330 |       "dtype: object\n",
331 |       "\n",
332 |       " Name      Abbing, Mr. Anthony\n",
333 |       "Sex                    female\n",
334 |       "Ticket                 110152\n",
335 |       "dtype: object\n"
336 |      ]
337 |     },
338 |     {
339 |      "data": {
340 |       "text/plain": [
341 |        "Name        object\n",
342 |        "Sex         object\n",
343 |        "Ticket      object\n",
344 |        "Cabin       object\n",
345 |        "Embarked    object\n",
346 |        "dtype: object"
347 |       ]
348 |      },
349 |      "execution_count": 33,
350 |      "metadata": {},
351 |      "output_type": "execute_result"
352 |     }
353 |    ],
354 |    "source": [
355 |     "# op 1 - 3\n",
356 |     "# 例 : 整數 (int) 特徵取平均 (mean)\n",
357 |     "df[int_features].mean()\n",
358 |     "\"\"\"\n",
359 |     "Your Code Here\n",
360 |     "\"\"\"\n",
361 |     "print(' operation 1 ~ 3 for integer valuable basic statistics')\n",
362 |     "print(df[int_features].mean())\n",
363 |     "print(df[int_features].max())\n",
364 |     "print(df[int_features].min())\n",
365 |     "\n",
366 |     "print('\\n\\n operation 4 ~ 6 for float valuable basic statistics')\n",
367 |     "print(df[float_features].mean())\n",
368 |     "print(df[float_features].max())\n",
369 |     "print(df[float_features].min())\n",
370 |     "\n",
371 |     "print('\\n\\n operation 7 ~ 9 for object valuable basic statistics')\n",
372 |     "print('\\n',df[object_features].mean())\n",
373 |     "print('\\n',df[object_features].max())\n",
374 |     "print('\\n',df[object_features].min())\n",
375 |     "\n",
376 |     "df[object_features].dtypes"
377 |    ]
378 |   }
379 |  ],
380 |  "metadata": {
381 |   "kernelspec": {
382 |    "display_name": "Python 3",
383 |    "language": "python",
384 |    "name": "python3"
385 |   },
386 |   "language_info": {
387 |    "codemirror_mode": {
388 |     "name": "ipython",
389 |     "version": 3
390 |    },
391 |    "file_extension": ".py",
392 |    "mimetype": "text/x-python",
393 |    "name": "python",
394 |    "nbconvert_exporter": "python",
395 |    "pygments_lexer": "ipython3",
396 |    "version": "3.6.7"
397 |   }
398 |  },
399 |  "nbformat": 4,
400 |  "nbformat_minor": 2
401 | }
402 | 


--------------------------------------------------------------------------------
/Day_042_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 作業\n",
  8 |     "\n",
  9 |     "1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？\n",
 10 |     "2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "from sklearn import datasets, metrics\n",
 20 |     "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\n",
 21 |     "from sklearn.model_selection import train_test_split\n",
 22 |     "import pandas as pd\n",
 23 |     "from sklearn.metrics import mean_squared_error, r2_score, accuracy_score"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {},
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "# 讀取鳶尾花資料集\n",
 33 |     "iris = datasets.load_iris()\n",
 34 |     "# 切分訓練集/測試集\n",
 35 |     "x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)\n",
 36 |     "# 建立模型\n",
 37 |     "clf = DecisionTreeClassifier()\n",
 38 |     "# 訓練模型\n",
 39 |     "clf.fit(x_train, y_train)\n",
 40 |     "# 預測測試集\n",
 41 |     "y_pred = clf.predict(x_test)"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "## baseline: Acuuracy:  0.9736842105263158\n",
 49 |     "\n",
 50 |     "    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n",
 51 |     "\n",
 52 |     "    Feature importance:  \n",
 53 |     "    [0.         0.01796599 0.52229134 0.45974266]"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 3,
 59 |    "metadata": {},
 60 |    "outputs": [
 61 |     {
 62 |      "data": {
 63 |       "text/plain": [
 64 |        "\"\\nString form:\\nDecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\\n           max_featu <...>      min_weight_fraction_leaf=0.0, presort=False, random_state=None,\\n           splitter='best')\\n\""
 65 |       ]
 66 |      },
 67 |      "execution_count": 3,
 68 |      "metadata": {},
 69 |      "output_type": "execute_result"
 70 |     }
 71 |    ],
 72 |    "source": [
 73 |     "#?clf\n",
 74 |     "'''\n",
 75 |     "String form:\n",
 76 |     "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
 77 |     "           max_featu <...>      min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
 78 |     "           splitter='best')\n",
 79 |     "'''"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 4,
 85 |    "metadata": {},
 86 |    "outputs": [
 87 |     {
 88 |      "name": "stdout",
 89 |      "output_type": "stream",
 90 |      "text": [
 91 |       "Acuuracy, baseline+ entroy:  0.9736842105263158 same\n",
 92 |       "Acuuracy, baseline+ entroy:  0.9736842105263158 same\n"
 93 |      ]
 94 |     }
 95 |    ],
 96 |    "source": [
 97 |     "clf = DecisionTreeClassifier(criterion='entropy')\n",
 98 |     "clf.fit(x_train, y_train)\n",
 99 |     "acc = metrics.accuracy_score(y_test, y_pred)\n",
100 |     "print(\"Acuuracy, baseline+ entroy: \", acc, 'same') #same\n",
101 |     "\n",
102 |     "\n",
103 |     "clf = DecisionTreeClassifier(max_depth=3, )\n",
104 |     "clf.fit(x_train, y_train)\n",
105 |     "acc = metrics.accuracy_score(y_test, y_pred)\n",
106 |     "print(\"Acuuracy, baseline+ entroy: \", acc, 'same') #same"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 5,
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "# Wine"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 6,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "wine = datasets.load_wine()"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "code",
129 |    "execution_count": 7,
130 |    "metadata": {},
131 |    "outputs": [
132 |     {
133 |      "data": {
134 |       "text/plain": [
135 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
136 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
137 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,\n",
138 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
139 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
140 |        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,\n",
141 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
142 |        "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
143 |        "       2, 2])"
144 |       ]
145 |      },
146 |      "execution_count": 7,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "wine.target  #classification tasks"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 8,
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)\n",
162 |     "clf = DecisionTreeClassifier()\n",
163 |     "clf.fit(x_train, y_train)\n",
164 |     "y_pred = clf.predict(x_test)"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 9,
170 |    "metadata": {},
171 |    "outputs": [
172 |     {
173 |      "name": "stdout",
174 |      "output_type": "stream",
175 |      "text": [
176 |       "Acuuracy:  0.9111111111111111\n"
177 |      ]
178 |     }
179 |    ],
180 |    "source": [
181 |     "acc = metrics.accuracy_score(y_test, y_pred)\n",
182 |     "print(\"Acuuracy: \", acc)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 10,
188 |    "metadata": {},
189 |    "outputs": [
190 |     {
191 |      "data": {
192 |       "text/html": [
193 |        "<div>\n",
194 |        "<style scoped>\n",
195 |        "    .dataframe tbody tr th:only-of-type {\n",
196 |        "        vertical-align: middle;\n",
197 |        "    }\n",
198 |        "\n",
199 |        "    .dataframe tbody tr th {\n",
200 |        "        vertical-align: top;\n",
201 |        "    }\n",
202 |        "\n",
203 |        "    .dataframe thead th {\n",
204 |        "        text-align: right;\n",
205 |        "    }\n",
206 |        "</style>\n",
207 |        "<table border=\"1\" class=\"dataframe\">\n",
208 |        "  <thead>\n",
209 |        "    <tr style=\"text-align: right;\">\n",
210 |        "      <th></th>\n",
211 |        "      <th>alcohol</th>\n",
212 |        "      <th>malic_acid</th>\n",
213 |        "      <th>ash</th>\n",
214 |        "      <th>alcalinity_of_ash</th>\n",
215 |        "      <th>magnesium</th>\n",
216 |        "      <th>total_phenols</th>\n",
217 |        "      <th>flavanoids</th>\n",
218 |        "      <th>nonflavanoid_phenols</th>\n",
219 |        "      <th>proanthocyanins</th>\n",
220 |        "      <th>color_intensity</th>\n",
221 |        "      <th>hue</th>\n",
222 |        "      <th>od280/od315_of_diluted_wines</th>\n",
223 |        "      <th>proline</th>\n",
224 |        "    </tr>\n",
225 |        "  </thead>\n",
226 |        "  <tbody>\n",
227 |        "    <tr>\n",
228 |        "      <th>0</th>\n",
229 |        "      <td>0.044407</td>\n",
230 |        "      <td>0.0</td>\n",
231 |        "      <td>0.0</td>\n",
232 |        "      <td>0.0</td>\n",
233 |        "      <td>0.0</td>\n",
234 |        "      <td>0.081586</td>\n",
235 |        "      <td>0.0</td>\n",
236 |        "      <td>0.0</td>\n",
237 |        "      <td>0.0</td>\n",
238 |        "      <td>0.399535</td>\n",
239 |        "      <td>0.0</td>\n",
240 |        "      <td>0.085821</td>\n",
241 |        "      <td>0.38865</td>\n",
242 |        "    </tr>\n",
243 |        "  </tbody>\n",
244 |        "</table>\n",
245 |        "</div>"
246 |       ],
247 |       "text/plain": [
248 |        "    alcohol  malic_acid  ash  alcalinity_of_ash  magnesium  total_phenols  \\\n",
249 |        "0  0.044407         0.0  0.0                0.0        0.0       0.081586   \n",
250 |        "\n",
251 |        "   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity  hue  \\\n",
252 |        "0         0.0                   0.0              0.0         0.399535  0.0   \n",
253 |        "\n",
254 |        "   od280/od315_of_diluted_wines  proline  \n",
255 |        "0                      0.085821  0.38865  "
256 |       ]
257 |      },
258 |      "execution_count": 10,
259 |      "metadata": {},
260 |      "output_type": "execute_result"
261 |     }
262 |    ],
263 |    "source": [
264 |     "f1=wine.feature_names\n",
265 |     "f2=clf.feature_importances_\n",
266 |     "f=pd.DataFrame([f2], columns=f1)\n",
267 |     "f"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": null,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": []
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 11,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": [
283 |     "x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)\n",
284 |     "clf = DecisionTreeClassifier(criterion='entropy')\n",
285 |     "clf.fit(x_train, y_train)\n",
286 |     "y_pred = clf.predict(x_test)"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": 12,
292 |    "metadata": {},
293 |    "outputs": [
294 |     {
295 |      "name": "stdout",
296 |      "output_type": "stream",
297 |      "text": [
298 |       "Acuuracy:  0.9777777777777777\n"
299 |      ]
300 |     }
301 |    ],
302 |    "source": [
303 |     "acc = metrics.accuracy_score(y_test, y_pred)\n",
304 |     "print(\"Acuuracy: \", acc)"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 13,
310 |    "metadata": {},
311 |    "outputs": [],
312 |    "source": [
313 |     "x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)\n",
314 |     "clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)\n",
315 |     "clf.fit(x_train, y_train)\n",
316 |     "y_pred = clf.predict(x_test)"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": 14,
322 |    "metadata": {},
323 |    "outputs": [
324 |     {
325 |      "name": "stdout",
326 |      "output_type": "stream",
327 |      "text": [
328 |       "Acuuracy:  0.9555555555555556\n"
329 |      ]
330 |     }
331 |    ],
332 |    "source": [
333 |     "acc = metrics.accuracy_score(y_test, y_pred)\n",
334 |     "print(\"Acuuracy: \", acc)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "# Boston"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 15,
347 |    "metadata": {},
348 |    "outputs": [],
349 |    "source": [
350 |     "boston = datasets.load_boston() "
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": 16,
356 |    "metadata": {},
357 |    "outputs": [],
358 |    "source": [
359 |     "#boston.target #regression"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": 17,
365 |    "metadata": {},
366 |    "outputs": [],
367 |    "source": [
368 |     "x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=0)\n",
369 |     "model = DecisionTreeRegressor()\n",
370 |     "model.fit(x_train, y_train)\n",
371 |     "pred=model.predict(x_test)"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "code",
376 |    "execution_count": 18,
377 |    "metadata": {},
378 |    "outputs": [
379 |     {
380 |      "name": "stdout",
381 |      "output_type": "stream",
382 |      "text": [
383 |       "MSE: 27.464645669291336\n"
384 |      ]
385 |     }
386 |    ],
387 |    "source": [
388 |     "print(f'MSE: {mean_squared_error(y_test,pred)}')"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "code",
393 |    "execution_count": 19,
394 |    "metadata": {},
395 |    "outputs": [
396 |     {
397 |      "name": "stdout",
398 |      "output_type": "stream",
399 |      "text": [
400 |       "MSE: 31.190157480314962\n"
401 |      ]
402 |     }
403 |    ],
404 |    "source": [
405 |     "model = DecisionTreeRegressor(criterion='mae', max_depth=2, min_samples_split=5)\n",
406 |     "model.fit(x_train, y_train)\n",
407 |     "pred=model.predict(x_test)\n",
408 |     "print(f'MSE: {mean_squared_error(y_test,pred)}')"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "code",
413 |    "execution_count": 20,
414 |    "metadata": {},
415 |    "outputs": [
416 |     {
417 |      "name": "stdout",
418 |      "output_type": "stream",
419 |      "text": [
420 |       "MSE: 33.32551599016632\n"
421 |      ]
422 |     }
423 |    ],
424 |    "source": [
425 |     "model = DecisionTreeRegressor(criterion='mse', max_depth=2)\n",
426 |     "model.fit(x_train, y_train)\n",
427 |     "pred=model.predict(x_test)\n",
428 |     "print(f'MSE: {mean_squared_error(y_test,pred)}')"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": null,
434 |    "metadata": {},
435 |    "outputs": [],
436 |    "source": []
437 |   }
438 |  ],
439 |  "metadata": {
440 |   "kernelspec": {
441 |    "display_name": "Python 3",
442 |    "language": "python",
443 |    "name": "python3"
444 |   },
445 |   "language_info": {
446 |    "codemirror_mode": {
447 |     "name": "ipython",
448 |     "version": 3
449 |    },
450 |    "file_extension": ".py",
451 |    "mimetype": "text/x-python",
452 |    "name": "python",
453 |    "nbconvert_exporter": "python",
454 |    "pygments_lexer": "ipython3",
455 |    "version": "3.6.7"
456 |   }
457 |  },
458 |  "nbformat": 4,
459 |  "nbformat_minor": 2
460 | }
461 | 


--------------------------------------------------------------------------------
/Day_022_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "# 作業 : (Kaggle)鐵達尼生存預測\n",
 10 |     "https://www.kaggle.com/c/titanic"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "source": [
 19 |     "# 作業1\n",
 20 |     "* 觀察範例，在房價預測中調整標籤編碼(Label Encoder) / 獨熱編碼 (One Hot Encoder) 方式，  \n",
 21 |     "對於線性迴歸以及梯度提升樹兩種模型，何者影響比較大?"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/html": [
 32 |        "<div>\n",
 33 |        "<style scoped>\n",
 34 |        "    .dataframe tbody tr th:only-of-type {\n",
 35 |        "        vertical-align: middle;\n",
 36 |        "    }\n",
 37 |        "\n",
 38 |        "    .dataframe tbody tr th {\n",
 39 |        "        vertical-align: top;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe thead th {\n",
 43 |        "        text-align: right;\n",
 44 |        "    }\n",
 45 |        "</style>\n",
 46 |        "<table border=\"1\" class=\"dataframe\">\n",
 47 |        "  <thead>\n",
 48 |        "    <tr style=\"text-align: right;\">\n",
 49 |        "      <th></th>\n",
 50 |        "      <th>Pclass</th>\n",
 51 |        "      <th>Name</th>\n",
 52 |        "      <th>Sex</th>\n",
 53 |        "      <th>Age</th>\n",
 54 |        "      <th>SibSp</th>\n",
 55 |        "      <th>Parch</th>\n",
 56 |        "      <th>Ticket</th>\n",
 57 |        "      <th>Fare</th>\n",
 58 |        "      <th>Cabin</th>\n",
 59 |        "      <th>Embarked</th>\n",
 60 |        "    </tr>\n",
 61 |        "  </thead>\n",
 62 |        "  <tbody>\n",
 63 |        "    <tr>\n",
 64 |        "      <th>0</th>\n",
 65 |        "      <td>3</td>\n",
 66 |        "      <td>Braund, Mr. Owen Harris</td>\n",
 67 |        "      <td>male</td>\n",
 68 |        "      <td>22.0</td>\n",
 69 |        "      <td>1</td>\n",
 70 |        "      <td>0</td>\n",
 71 |        "      <td>A/5 21171</td>\n",
 72 |        "      <td>7.2500</td>\n",
 73 |        "      <td>NaN</td>\n",
 74 |        "      <td>S</td>\n",
 75 |        "    </tr>\n",
 76 |        "    <tr>\n",
 77 |        "      <th>1</th>\n",
 78 |        "      <td>1</td>\n",
 79 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
 80 |        "      <td>female</td>\n",
 81 |        "      <td>38.0</td>\n",
 82 |        "      <td>1</td>\n",
 83 |        "      <td>0</td>\n",
 84 |        "      <td>PC 17599</td>\n",
 85 |        "      <td>71.2833</td>\n",
 86 |        "      <td>C85</td>\n",
 87 |        "      <td>C</td>\n",
 88 |        "    </tr>\n",
 89 |        "    <tr>\n",
 90 |        "      <th>2</th>\n",
 91 |        "      <td>3</td>\n",
 92 |        "      <td>Heikkinen, Miss. Laina</td>\n",
 93 |        "      <td>female</td>\n",
 94 |        "      <td>26.0</td>\n",
 95 |        "      <td>0</td>\n",
 96 |        "      <td>0</td>\n",
 97 |        "      <td>STON/O2. 3101282</td>\n",
 98 |        "      <td>7.9250</td>\n",
 99 |        "      <td>NaN</td>\n",
100 |        "      <td>S</td>\n",
101 |        "    </tr>\n",
102 |        "    <tr>\n",
103 |        "      <th>3</th>\n",
104 |        "      <td>1</td>\n",
105 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
106 |        "      <td>female</td>\n",
107 |        "      <td>35.0</td>\n",
108 |        "      <td>1</td>\n",
109 |        "      <td>0</td>\n",
110 |        "      <td>113803</td>\n",
111 |        "      <td>53.1000</td>\n",
112 |        "      <td>C123</td>\n",
113 |        "      <td>S</td>\n",
114 |        "    </tr>\n",
115 |        "    <tr>\n",
116 |        "      <th>4</th>\n",
117 |        "      <td>3</td>\n",
118 |        "      <td>Allen, Mr. William Henry</td>\n",
119 |        "      <td>male</td>\n",
120 |        "      <td>35.0</td>\n",
121 |        "      <td>0</td>\n",
122 |        "      <td>0</td>\n",
123 |        "      <td>373450</td>\n",
124 |        "      <td>8.0500</td>\n",
125 |        "      <td>NaN</td>\n",
126 |        "      <td>S</td>\n",
127 |        "    </tr>\n",
128 |        "  </tbody>\n",
129 |        "</table>\n",
130 |        "</div>"
131 |       ],
132 |       "text/plain": [
133 |        "   Pclass                                               Name     Sex   Age  \\\n",
134 |        "0       3                            Braund, Mr. Owen Harris    male  22.0   \n",
135 |        "1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
136 |        "2       3                             Heikkinen, Miss. Laina  female  26.0   \n",
137 |        "3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
138 |        "4       3                           Allen, Mr. William Henry    male  35.0   \n",
139 |        "\n",
140 |        "   SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
141 |        "0      1      0         A/5 21171   7.2500   NaN        S  \n",
142 |        "1      1      0          PC 17599  71.2833   C85        C  \n",
143 |        "2      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
144 |        "3      1      0            113803  53.1000  C123        S  \n",
145 |        "4      0      0            373450   8.0500   NaN        S  "
146 |       ]
147 |      },
148 |      "execution_count": 1,
149 |      "metadata": {},
150 |      "output_type": "execute_result"
151 |     }
152 |    ],
153 |    "source": [
154 |     "# 做完特徵工程前的所有準備 (與前範例相同)\n",
155 |     "import pandas as pd\n",
156 |     "import numpy as np\n",
157 |     "import copy, time\n",
158 |     "from sklearn.preprocessing import MinMaxScaler\n",
159 |     "from sklearn.model_selection import cross_val_score\n",
160 |     "from sklearn.linear_model import LogisticRegression\n",
161 |     "from sklearn.preprocessing import LabelEncoder\n",
162 |     "\n",
163 |     "data_path = '../part02/'\n",
164 |     "df_train = pd.read_csv(data_path + 'titanic_train.csv')\n",
165 |     "df_test = pd.read_csv(data_path + 'titanic_test.csv')\n",
166 |     "\n",
167 |     "train_Y = df_train['Survived']\n",
168 |     "ids = df_test['PassengerId']\n",
169 |     "df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)\n",
170 |     "df_test = df_test.drop(['PassengerId'] , axis=1)\n",
171 |     "df = pd.concat([df_train,df_test])\n",
172 |     "df.head()"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 2,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "train_Y.shape\n",
182 |     "num=train_Y.shape[0]"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 3,
188 |    "metadata": {},
189 |    "outputs": [
190 |     {
191 |      "name": "stdout",
192 |      "output_type": "stream",
193 |      "text": [
194 |       "5 Numeric Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n",
195 |       "\n"
196 |      ]
197 |     },
198 |     {
199 |      "data": {
200 |       "text/html": [
201 |        "<div>\n",
202 |        "<style scoped>\n",
203 |        "    .dataframe tbody tr th:only-of-type {\n",
204 |        "        vertical-align: middle;\n",
205 |        "    }\n",
206 |        "\n",
207 |        "    .dataframe tbody tr th {\n",
208 |        "        vertical-align: top;\n",
209 |        "    }\n",
210 |        "\n",
211 |        "    .dataframe thead th {\n",
212 |        "        text-align: right;\n",
213 |        "    }\n",
214 |        "</style>\n",
215 |        "<table border=\"1\" class=\"dataframe\">\n",
216 |        "  <thead>\n",
217 |        "    <tr style=\"text-align: right;\">\n",
218 |        "      <th></th>\n",
219 |        "      <th>Name</th>\n",
220 |        "      <th>Sex</th>\n",
221 |        "      <th>Ticket</th>\n",
222 |        "      <th>Cabin</th>\n",
223 |        "      <th>Embarked</th>\n",
224 |        "    </tr>\n",
225 |        "  </thead>\n",
226 |        "  <tbody>\n",
227 |        "    <tr>\n",
228 |        "      <th>0</th>\n",
229 |        "      <td>Braund, Mr. Owen Harris</td>\n",
230 |        "      <td>male</td>\n",
231 |        "      <td>A/5 21171</td>\n",
232 |        "      <td>None</td>\n",
233 |        "      <td>S</td>\n",
234 |        "    </tr>\n",
235 |        "    <tr>\n",
236 |        "      <th>1</th>\n",
237 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
238 |        "      <td>female</td>\n",
239 |        "      <td>PC 17599</td>\n",
240 |        "      <td>C85</td>\n",
241 |        "      <td>C</td>\n",
242 |        "    </tr>\n",
243 |        "    <tr>\n",
244 |        "      <th>2</th>\n",
245 |        "      <td>Heikkinen, Miss. Laina</td>\n",
246 |        "      <td>female</td>\n",
247 |        "      <td>STON/O2. 3101282</td>\n",
248 |        "      <td>None</td>\n",
249 |        "      <td>S</td>\n",
250 |        "    </tr>\n",
251 |        "    <tr>\n",
252 |        "      <th>3</th>\n",
253 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
254 |        "      <td>female</td>\n",
255 |        "      <td>113803</td>\n",
256 |        "      <td>C123</td>\n",
257 |        "      <td>S</td>\n",
258 |        "    </tr>\n",
259 |        "    <tr>\n",
260 |        "      <th>4</th>\n",
261 |        "      <td>Allen, Mr. William Henry</td>\n",
262 |        "      <td>male</td>\n",
263 |        "      <td>373450</td>\n",
264 |        "      <td>None</td>\n",
265 |        "      <td>S</td>\n",
266 |        "    </tr>\n",
267 |        "  </tbody>\n",
268 |        "</table>\n",
269 |        "</div>"
270 |       ],
271 |       "text/plain": [
272 |        "                                                Name     Sex  \\\n",
273 |        "0                            Braund, Mr. Owen Harris    male   \n",
274 |        "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   \n",
275 |        "2                             Heikkinen, Miss. Laina  female   \n",
276 |        "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   \n",
277 |        "4                           Allen, Mr. William Henry    male   \n",
278 |        "\n",
279 |        "             Ticket Cabin Embarked  \n",
280 |        "0         A/5 21171  None        S  \n",
281 |        "1          PC 17599   C85        C  \n",
282 |        "2  STON/O2. 3101282  None        S  \n",
283 |        "3            113803  C123        S  \n",
284 |        "4            373450  None        S  "
285 |       ]
286 |      },
287 |      "execution_count": 3,
288 |      "metadata": {},
289 |      "output_type": "execute_result"
290 |     }
291 |    ],
292 |    "source": [
293 |     "#只取類別值 (object) 型欄位, 存於 object_features 中\n",
294 |     "object_features = []\n",
295 |     "for dtype, feature in zip(df.dtypes, df.columns):\n",
296 |     "    if dtype == 'object':\n",
297 |     "        object_features.append(feature)\n",
298 |     "print(f'{len(object_features)} Numeric Features : {object_features}\\n')\n",
299 |     "\n",
300 |     "# 只留類別型欄位\n",
301 |     "df = df[object_features]\n",
302 |     "df = df.fillna('None')\n",
303 |     "train_num = train_Y.shape[0]\n",
304 |     "df.head()"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "markdown",
309 |    "metadata": {
310 |     "collapsed": true
311 |    },
312 |    "source": [
313 |     "# 作業2\n",
314 |     "* 鐵達尼號例題中，標籤編碼 / 獨熱編碼又分別對預測結果有何影響? (Hint : 參考今日範例)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 4,
320 |    "metadata": {},
321 |    "outputs": [
322 |     {
323 |      "data": {
324 |       "text/plain": [
325 |        "0.780004837244799"
326 |       ]
327 |      },
328 |      "execution_count": 4,
329 |      "metadata": {},
330 |      "output_type": "execute_result"
331 |     }
332 |    ],
333 |    "source": [
334 |     "# 標籤編碼 + 羅吉斯迴歸\n",
335 |     "import warnings\n",
336 |     "warnings.filterwarnings('ignore')\n",
337 |     "\"\"\"\n",
338 |     "Your Code Here\n",
339 |     "\"\"\"\n",
340 |     "from sklearn.preprocessing import LabelEncoder\n",
341 |     "Le = LabelEncoder()\n",
342 |     "df_Le = copy.deepcopy(df)\n",
343 |     "for i in df_Le.columns:\n",
344 |     "    df_Le[i]=Le.fit_transform(df_Le[i])\n",
345 |     "df_Le = df_Le[:num]\n",
346 |     "\n",
347 |     "from sklearn.linear_model import LogisticRegression\n",
348 |     "LR = LogisticRegression()\n",
349 |     "LR.fit(df_Le, train_Y)\n",
350 |     "cross_val_score(LR, df_Le, train_Y, cv=5).mean()"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": 5,
356 |    "metadata": {},
357 |    "outputs": [
358 |     {
359 |      "data": {
360 |       "text/plain": [
361 |        "0.8013346043513216"
362 |       ]
363 |      },
364 |      "execution_count": 5,
365 |      "metadata": {},
366 |      "output_type": "execute_result"
367 |     }
368 |    ],
369 |    "source": [
370 |     "# 獨熱編碼 + 羅吉斯迴歸\n",
371 |     "\n",
372 |     "\"\"\"\n",
373 |     "Your Code Here\n",
374 |     "\"\"\"\n",
375 |     "\n",
376 |     "df_OHe = copy.deepcopy(df)\n",
377 |     "df_OHe=pd.get_dummies(df_OHe)\n",
378 |     "df_OHe=df_OHe[:num]\n",
379 |     "\n",
380 |     "from sklearn.linear_model import LogisticRegression\n",
381 |     "LR = LogisticRegression()\n",
382 |     "cross_val_score(LR, df_OHe, train_Y, cv=5).mean()\n",
383 |     "\n"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": 6,
389 |    "metadata": {},
390 |    "outputs": [
391 |     {
392 |      "data": {
393 |       "text/plain": [
394 |        "(891, 2429)"
395 |       ]
396 |      },
397 |      "execution_count": 6,
398 |      "metadata": {},
399 |      "output_type": "execute_result"
400 |     }
401 |    ],
402 |    "source": [
403 |     "df_OHe.shape  #orz, we have 2429 features... overfitting?????"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": []
412 |   }
413 |  ],
414 |  "metadata": {
415 |   "kernelspec": {
416 |    "display_name": "Python 3",
417 |    "language": "python",
418 |    "name": "python3"
419 |   },
420 |   "language_info": {
421 |    "codemirror_mode": {
422 |     "name": "ipython",
423 |     "version": 3
424 |    },
425 |    "file_extension": ".py",
426 |    "mimetype": "text/x-python",
427 |    "name": "python",
428 |    "nbconvert_exporter": "python",
429 |    "pygments_lexer": "ipython3",
430 |    "version": "3.6.7"
431 |   }
432 |  },
433 |  "nbformat": 4,
434 |  "nbformat_minor": 2
435 | }
436 | 


--------------------------------------------------------------------------------
/Day_017_HW.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業 : (Kaggle)鐵達尼生存預測精簡版 \n",
  8 |     "https://www.kaggle.com/c/titanic"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "source": [
 17 |     "### 作業1\n",
 18 |     "* 下列A~E五個程式區塊中，哪一塊是特徵工程?\n",
 19 |     "# 作業1: C"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "### 作業2\n",
 27 |     "* 對照程式區塊 B 與 C 的結果，請問那些欄位屬於\"類別型欄位\"? (回答欄位英文名稱即可) \n",
 28 |     "# 作業2: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "### 作業3\n",
 36 |     "* 續上題，請問哪個欄位是\"目標值\"?\n",
 37 |     "# 作業3: 'Survived'"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 1,
 43 |    "metadata": {},
 44 |    "outputs": [
 45 |     {
 46 |      "data": {
 47 |       "text/plain": [
 48 |        "(891, 12)"
 49 |       ]
 50 |      },
 51 |      "execution_count": 1,
 52 |      "metadata": {},
 53 |      "output_type": "execute_result"
 54 |     }
 55 |    ],
 56 |    "source": [
 57 |     "# 程式區塊 A\n",
 58 |     "import pandas as pd\n",
 59 |     "import numpy as np\n",
 60 |     "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n",
 61 |     "\n",
 62 |     "data_path = '../part02/'\n",
 63 |     "df_train = pd.read_csv(data_path + 'titanic_train.csv')\n",
 64 |     "df_test = pd.read_csv(data_path + 'titanic_test.csv')\n",
 65 |     "df_train.shape"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 2,
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "name": "stdout",
 75 |      "output_type": "stream",
 76 |      "text": [
 77 |       "['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n"
 78 |      ]
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "ls=[]\n",
 83 |     "for i in list(df_train):\n",
 84 |     "    if df_train[i].dtype == 'object':\n",
 85 |     "        ls.append(i)\n",
 86 |     "\n",
 87 |     "print(ls)\n",
 88 |     "        "
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 3,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/html": [
 99 |        "<div>\n",
100 |        "<style scoped>\n",
101 |        "    .dataframe tbody tr th:only-of-type {\n",
102 |        "        vertical-align: middle;\n",
103 |        "    }\n",
104 |        "\n",
105 |        "    .dataframe tbody tr th {\n",
106 |        "        vertical-align: top;\n",
107 |        "    }\n",
108 |        "\n",
109 |        "    .dataframe thead th {\n",
110 |        "        text-align: right;\n",
111 |        "    }\n",
112 |        "</style>\n",
113 |        "<table border=\"1\" class=\"dataframe\">\n",
114 |        "  <thead>\n",
115 |        "    <tr style=\"text-align: right;\">\n",
116 |        "      <th></th>\n",
117 |        "      <th>Pclass</th>\n",
118 |        "      <th>Name</th>\n",
119 |        "      <th>Sex</th>\n",
120 |        "      <th>Age</th>\n",
121 |        "      <th>SibSp</th>\n",
122 |        "      <th>Parch</th>\n",
123 |        "      <th>Ticket</th>\n",
124 |        "      <th>Fare</th>\n",
125 |        "      <th>Cabin</th>\n",
126 |        "      <th>Embarked</th>\n",
127 |        "    </tr>\n",
128 |        "  </thead>\n",
129 |        "  <tbody>\n",
130 |        "    <tr>\n",
131 |        "      <th>0</th>\n",
132 |        "      <td>3</td>\n",
133 |        "      <td>Braund, Mr. Owen Harris</td>\n",
134 |        "      <td>male</td>\n",
135 |        "      <td>22.0</td>\n",
136 |        "      <td>1</td>\n",
137 |        "      <td>0</td>\n",
138 |        "      <td>A/5 21171</td>\n",
139 |        "      <td>7.2500</td>\n",
140 |        "      <td>NaN</td>\n",
141 |        "      <td>S</td>\n",
142 |        "    </tr>\n",
143 |        "    <tr>\n",
144 |        "      <th>1</th>\n",
145 |        "      <td>1</td>\n",
146 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
147 |        "      <td>female</td>\n",
148 |        "      <td>38.0</td>\n",
149 |        "      <td>1</td>\n",
150 |        "      <td>0</td>\n",
151 |        "      <td>PC 17599</td>\n",
152 |        "      <td>71.2833</td>\n",
153 |        "      <td>C85</td>\n",
154 |        "      <td>C</td>\n",
155 |        "    </tr>\n",
156 |        "    <tr>\n",
157 |        "      <th>2</th>\n",
158 |        "      <td>3</td>\n",
159 |        "      <td>Heikkinen, Miss. Laina</td>\n",
160 |        "      <td>female</td>\n",
161 |        "      <td>26.0</td>\n",
162 |        "      <td>0</td>\n",
163 |        "      <td>0</td>\n",
164 |        "      <td>STON/O2. 3101282</td>\n",
165 |        "      <td>7.9250</td>\n",
166 |        "      <td>NaN</td>\n",
167 |        "      <td>S</td>\n",
168 |        "    </tr>\n",
169 |        "    <tr>\n",
170 |        "      <th>3</th>\n",
171 |        "      <td>1</td>\n",
172 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
173 |        "      <td>female</td>\n",
174 |        "      <td>35.0</td>\n",
175 |        "      <td>1</td>\n",
176 |        "      <td>0</td>\n",
177 |        "      <td>113803</td>\n",
178 |        "      <td>53.1000</td>\n",
179 |        "      <td>C123</td>\n",
180 |        "      <td>S</td>\n",
181 |        "    </tr>\n",
182 |        "    <tr>\n",
183 |        "      <th>4</th>\n",
184 |        "      <td>3</td>\n",
185 |        "      <td>Allen, Mr. William Henry</td>\n",
186 |        "      <td>male</td>\n",
187 |        "      <td>35.0</td>\n",
188 |        "      <td>0</td>\n",
189 |        "      <td>0</td>\n",
190 |        "      <td>373450</td>\n",
191 |        "      <td>8.0500</td>\n",
192 |        "      <td>NaN</td>\n",
193 |        "      <td>S</td>\n",
194 |        "    </tr>\n",
195 |        "  </tbody>\n",
196 |        "</table>\n",
197 |        "</div>"
198 |       ],
199 |       "text/plain": [
200 |        "   Pclass                                               Name     Sex   Age  \\\n",
201 |        "0       3                            Braund, Mr. Owen Harris    male  22.0   \n",
202 |        "1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
203 |        "2       3                             Heikkinen, Miss. Laina  female  26.0   \n",
204 |        "3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
205 |        "4       3                           Allen, Mr. William Henry    male  35.0   \n",
206 |        "\n",
207 |        "   SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
208 |        "0      1      0         A/5 21171   7.2500   NaN        S  \n",
209 |        "1      1      0          PC 17599  71.2833   C85        C  \n",
210 |        "2      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
211 |        "3      1      0            113803  53.1000  C123        S  \n",
212 |        "4      0      0            373450   8.0500   NaN        S  "
213 |       ]
214 |      },
215 |      "execution_count": 3,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "# 程式區塊 B\n",
222 |     "train_Y = df_train['Survived']\n",
223 |     "ids = df_test['PassengerId']\n",
224 |     "df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)\n",
225 |     "df_test = df_test.drop(['PassengerId'] , axis=1)\n",
226 |     "df = pd.concat([df_train,df_test])\n",
227 |     "df.head()"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": 4,
233 |    "metadata": {},
234 |    "outputs": [
235 |     {
236 |      "name": "stdout",
237 |      "output_type": "stream",
238 |      "text": [
239 |       "Name\n",
240 |       "Sex\n",
241 |       "Ticket\n",
242 |       "Cabin\n",
243 |       "Embarked\n"
244 |      ]
245 |     },
246 |     {
247 |      "data": {
248 |       "text/html": [
249 |        "<div>\n",
250 |        "<style scoped>\n",
251 |        "    .dataframe tbody tr th:only-of-type {\n",
252 |        "        vertical-align: middle;\n",
253 |        "    }\n",
254 |        "\n",
255 |        "    .dataframe tbody tr th {\n",
256 |        "        vertical-align: top;\n",
257 |        "    }\n",
258 |        "\n",
259 |        "    .dataframe thead th {\n",
260 |        "        text-align: right;\n",
261 |        "    }\n",
262 |        "</style>\n",
263 |        "<table border=\"1\" class=\"dataframe\">\n",
264 |        "  <thead>\n",
265 |        "    <tr style=\"text-align: right;\">\n",
266 |        "      <th></th>\n",
267 |        "      <th>Pclass</th>\n",
268 |        "      <th>Name</th>\n",
269 |        "      <th>Sex</th>\n",
270 |        "      <th>Age</th>\n",
271 |        "      <th>SibSp</th>\n",
272 |        "      <th>Parch</th>\n",
273 |        "      <th>Ticket</th>\n",
274 |        "      <th>Fare</th>\n",
275 |        "      <th>Cabin</th>\n",
276 |        "      <th>Embarked</th>\n",
277 |        "    </tr>\n",
278 |        "  </thead>\n",
279 |        "  <tbody>\n",
280 |        "    <tr>\n",
281 |        "      <th>0</th>\n",
282 |        "      <td>1.0</td>\n",
283 |        "      <td>0.118683</td>\n",
284 |        "      <td>1.0</td>\n",
285 |        "      <td>0.283951</td>\n",
286 |        "      <td>0.125</td>\n",
287 |        "      <td>0.0</td>\n",
288 |        "      <td>0.775862</td>\n",
289 |        "      <td>0.016072</td>\n",
290 |        "      <td>0.000000</td>\n",
291 |        "      <td>1.000000</td>\n",
292 |        "    </tr>\n",
293 |        "    <tr>\n",
294 |        "      <th>1</th>\n",
295 |        "      <td>0.0</td>\n",
296 |        "      <td>0.218989</td>\n",
297 |        "      <td>0.0</td>\n",
298 |        "      <td>0.481481</td>\n",
299 |        "      <td>0.125</td>\n",
300 |        "      <td>0.0</td>\n",
301 |        "      <td>0.879310</td>\n",
302 |        "      <td>0.140813</td>\n",
303 |        "      <td>0.575269</td>\n",
304 |        "      <td>0.333333</td>\n",
305 |        "    </tr>\n",
306 |        "    <tr>\n",
307 |        "      <th>2</th>\n",
308 |        "      <td>1.0</td>\n",
309 |        "      <td>0.400459</td>\n",
310 |        "      <td>0.0</td>\n",
311 |        "      <td>0.333333</td>\n",
312 |        "      <td>0.000</td>\n",
313 |        "      <td>0.0</td>\n",
314 |        "      <td>0.984914</td>\n",
315 |        "      <td>0.017387</td>\n",
316 |        "      <td>0.000000</td>\n",
317 |        "      <td>1.000000</td>\n",
318 |        "    </tr>\n",
319 |        "    <tr>\n",
320 |        "      <th>3</th>\n",
321 |        "      <td>0.0</td>\n",
322 |        "      <td>0.323124</td>\n",
323 |        "      <td>0.0</td>\n",
324 |        "      <td>0.444444</td>\n",
325 |        "      <td>0.125</td>\n",
326 |        "      <td>0.0</td>\n",
327 |        "      <td>0.070043</td>\n",
328 |        "      <td>0.105390</td>\n",
329 |        "      <td>0.381720</td>\n",
330 |        "      <td>1.000000</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>4</th>\n",
334 |        "      <td>1.0</td>\n",
335 |        "      <td>0.016845</td>\n",
336 |        "      <td>1.0</td>\n",
337 |        "      <td>0.444444</td>\n",
338 |        "      <td>0.000</td>\n",
339 |        "      <td>0.0</td>\n",
340 |        "      <td>0.699353</td>\n",
341 |        "      <td>0.017630</td>\n",
342 |        "      <td>0.000000</td>\n",
343 |        "      <td>1.000000</td>\n",
344 |        "    </tr>\n",
345 |        "  </tbody>\n",
346 |        "</table>\n",
347 |        "</div>"
348 |       ],
349 |       "text/plain": [
350 |        "   Pclass      Name  Sex       Age  SibSp  Parch    Ticket      Fare  \\\n",
351 |        "0     1.0  0.118683  1.0  0.283951  0.125    0.0  0.775862  0.016072   \n",
352 |        "1     0.0  0.218989  0.0  0.481481  0.125    0.0  0.879310  0.140813   \n",
353 |        "2     1.0  0.400459  0.0  0.333333  0.000    0.0  0.984914  0.017387   \n",
354 |        "3     0.0  0.323124  0.0  0.444444  0.125    0.0  0.070043  0.105390   \n",
355 |        "4     1.0  0.016845  1.0  0.444444  0.000    0.0  0.699353  0.017630   \n",
356 |        "\n",
357 |        "      Cabin  Embarked  \n",
358 |        "0  0.000000  1.000000  \n",
359 |        "1  0.575269  0.333333  \n",
360 |        "2  0.000000  1.000000  \n",
361 |        "3  0.381720  1.000000  \n",
362 |        "4  0.000000  1.000000  "
363 |       ]
364 |      },
365 |      "execution_count": 4,
366 |      "metadata": {},
367 |      "output_type": "execute_result"
368 |     }
369 |    ],
370 |    "source": [
371 |     "# 程式區塊 C\n",
372 |     "\n",
373 |     "import warnings\n",
374 |     "warnings.filterwarnings('ignore')\n",
375 |     "LEncoder = LabelEncoder()\n",
376 |     "MMEncoder = MinMaxScaler()\n",
377 |     "for c in df.columns:\n",
378 |     "    df[c] = df[c].fillna(-1)\n",
379 |     "    if df[c].dtype == 'object':\n",
380 |     "        print(c)\n",
381 |     "        df[c] = LEncoder.fit_transform(list(df[c].values))\n",
382 |     "    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))\n",
383 |     "df.head()"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": 5,
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "# 程式區塊 D\n",
393 |     "train_num = train_Y.shape[0]\n",
394 |     "train_X = df[:train_num]\n",
395 |     "test_X = df[train_num:]\n",
396 |     "\n",
397 |     "from sklearn.linear_model import LogisticRegression\n",
398 |     "estimator = LogisticRegression()\n",
399 |     "estimator.fit(train_X, train_Y)\n",
400 |     "pred = estimator.predict(test_X)"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": 6,
406 |    "metadata": {},
407 |    "outputs": [],
408 |    "source": [
409 |     "# 程式區塊 E\n",
410 |     "sub = pd.DataFrame({'PassengerId': ids, 'Survived': pred})\n",
411 |     "sub.to_csv('titanic_baseline.csv', index=False) "
412 |    ]
413 |   }
414 |  ],
415 |  "metadata": {
416 |   "kernelspec": {
417 |    "display_name": "Python 3",
418 |    "language": "python",
419 |    "name": "python3"
420 |   },
421 |   "language_info": {
422 |    "codemirror_mode": {
423 |     "name": "ipython",
424 |     "version": 3
425 |    },
426 |    "file_extension": ".py",
427 |    "mimetype": "text/x-python",
428 |    "name": "python",
429 |    "nbconvert_exporter": "python",
430 |    "pygments_lexer": "ipython3",
431 |    "version": "3.6.7"
432 |   }
433 |  },
434 |  "nbformat": 4,
435 |  "nbformat_minor": 2
436 | }
437 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/Day_017_HW-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 作業 : (Kaggle)鐵達尼生存預測精簡版 \n",
  8 |     "https://www.kaggle.com/c/titanic"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "source": [
 17 |     "### 作業1\n",
 18 |     "* 下列A~E五個程式區塊中，哪一塊是特徵工程?\n",
 19 |     "# 作業1: C"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "### 作業2\n",
 27 |     "* 對照程式區塊 B 與 C 的結果，請問那些欄位屬於\"類別型欄位\"? (回答欄位英文名稱即可) \n",
 28 |     "# 作業2: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "### 作業3\n",
 36 |     "* 續上題，請問哪個欄位是\"目標值\"?\n",
 37 |     "# 作業3: 'Survived'"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 1,
 43 |    "metadata": {},
 44 |    "outputs": [
 45 |     {
 46 |      "data": {
 47 |       "text/plain": [
 48 |        "(891, 12)"
 49 |       ]
 50 |      },
 51 |      "execution_count": 1,
 52 |      "metadata": {},
 53 |      "output_type": "execute_result"
 54 |     }
 55 |    ],
 56 |    "source": [
 57 |     "# 程式區塊 A\n",
 58 |     "import pandas as pd\n",
 59 |     "import numpy as np\n",
 60 |     "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n",
 61 |     "\n",
 62 |     "data_path = '../part02/'\n",
 63 |     "df_train = pd.read_csv(data_path + 'titanic_train.csv')\n",
 64 |     "df_test = pd.read_csv(data_path + 'titanic_test.csv')\n",
 65 |     "df_train.shape"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 2,
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "name": "stdout",
 75 |      "output_type": "stream",
 76 |      "text": [
 77 |       "['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n"
 78 |      ]
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "ls=[]\n",
 83 |     "for i in list(df_train):\n",
 84 |     "    if df_train[i].dtype == 'object':\n",
 85 |     "        ls.append(i)\n",
 86 |     "\n",
 87 |     "print(ls)\n",
 88 |     "        "
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 3,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/html": [
 99 |        "<div>\n",
100 |        "<style scoped>\n",
101 |        "    .dataframe tbody tr th:only-of-type {\n",
102 |        "        vertical-align: middle;\n",
103 |        "    }\n",
104 |        "\n",
105 |        "    .dataframe tbody tr th {\n",
106 |        "        vertical-align: top;\n",
107 |        "    }\n",
108 |        "\n",
109 |        "    .dataframe thead th {\n",
110 |        "        text-align: right;\n",
111 |        "    }\n",
112 |        "</style>\n",
113 |        "<table border=\"1\" class=\"dataframe\">\n",
114 |        "  <thead>\n",
115 |        "    <tr style=\"text-align: right;\">\n",
116 |        "      <th></th>\n",
117 |        "      <th>Pclass</th>\n",
118 |        "      <th>Name</th>\n",
119 |        "      <th>Sex</th>\n",
120 |        "      <th>Age</th>\n",
121 |        "      <th>SibSp</th>\n",
122 |        "      <th>Parch</th>\n",
123 |        "      <th>Ticket</th>\n",
124 |        "      <th>Fare</th>\n",
125 |        "      <th>Cabin</th>\n",
126 |        "      <th>Embarked</th>\n",
127 |        "    </tr>\n",
128 |        "  </thead>\n",
129 |        "  <tbody>\n",
130 |        "    <tr>\n",
131 |        "      <th>0</th>\n",
132 |        "      <td>3</td>\n",
133 |        "      <td>Braund, Mr. Owen Harris</td>\n",
134 |        "      <td>male</td>\n",
135 |        "      <td>22.0</td>\n",
136 |        "      <td>1</td>\n",
137 |        "      <td>0</td>\n",
138 |        "      <td>A/5 21171</td>\n",
139 |        "      <td>7.2500</td>\n",
140 |        "      <td>NaN</td>\n",
141 |        "      <td>S</td>\n",
142 |        "    </tr>\n",
143 |        "    <tr>\n",
144 |        "      <th>1</th>\n",
145 |        "      <td>1</td>\n",
146 |        "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
147 |        "      <td>female</td>\n",
148 |        "      <td>38.0</td>\n",
149 |        "      <td>1</td>\n",
150 |        "      <td>0</td>\n",
151 |        "      <td>PC 17599</td>\n",
152 |        "      <td>71.2833</td>\n",
153 |        "      <td>C85</td>\n",
154 |        "      <td>C</td>\n",
155 |        "    </tr>\n",
156 |        "    <tr>\n",
157 |        "      <th>2</th>\n",
158 |        "      <td>3</td>\n",
159 |        "      <td>Heikkinen, Miss. Laina</td>\n",
160 |        "      <td>female</td>\n",
161 |        "      <td>26.0</td>\n",
162 |        "      <td>0</td>\n",
163 |        "      <td>0</td>\n",
164 |        "      <td>STON/O2. 3101282</td>\n",
165 |        "      <td>7.9250</td>\n",
166 |        "      <td>NaN</td>\n",
167 |        "      <td>S</td>\n",
168 |        "    </tr>\n",
169 |        "    <tr>\n",
170 |        "      <th>3</th>\n",
171 |        "      <td>1</td>\n",
172 |        "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
173 |        "      <td>female</td>\n",
174 |        "      <td>35.0</td>\n",
175 |        "      <td>1</td>\n",
176 |        "      <td>0</td>\n",
177 |        "      <td>113803</td>\n",
178 |        "      <td>53.1000</td>\n",
179 |        "      <td>C123</td>\n",
180 |        "      <td>S</td>\n",
181 |        "    </tr>\n",
182 |        "    <tr>\n",
183 |        "      <th>4</th>\n",
184 |        "      <td>3</td>\n",
185 |        "      <td>Allen, Mr. William Henry</td>\n",
186 |        "      <td>male</td>\n",
187 |        "      <td>35.0</td>\n",
188 |        "      <td>0</td>\n",
189 |        "      <td>0</td>\n",
190 |        "      <td>373450</td>\n",
191 |        "      <td>8.0500</td>\n",
192 |        "      <td>NaN</td>\n",
193 |        "      <td>S</td>\n",
194 |        "    </tr>\n",
195 |        "  </tbody>\n",
196 |        "</table>\n",
197 |        "</div>"
198 |       ],
199 |       "text/plain": [
200 |        "   Pclass                                               Name     Sex   Age  \\\n",
201 |        "0       3                            Braund, Mr. Owen Harris    male  22.0   \n",
202 |        "1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
203 |        "2       3                             Heikkinen, Miss. Laina  female  26.0   \n",
204 |        "3       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
205 |        "4       3                           Allen, Mr. William Henry    male  35.0   \n",
206 |        "\n",
207 |        "   SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
208 |        "0      1      0         A/5 21171   7.2500   NaN        S  \n",
209 |        "1      1      0          PC 17599  71.2833   C85        C  \n",
210 |        "2      0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
211 |        "3      1      0            113803  53.1000  C123        S  \n",
212 |        "4      0      0            373450   8.0500   NaN        S  "
213 |       ]
214 |      },
215 |      "execution_count": 3,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "# 程式區塊 B\n",
222 |     "train_Y = df_train['Survived']\n",
223 |     "ids = df_test['PassengerId']\n",
224 |     "df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)\n",
225 |     "df_test = df_test.drop(['PassengerId'] , axis=1)\n",
226 |     "df = pd.concat([df_train,df_test])\n",
227 |     "df.head()"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": 4,
233 |    "metadata": {},
234 |    "outputs": [
235 |     {
236 |      "name": "stdout",
237 |      "output_type": "stream",
238 |      "text": [
239 |       "Name\n",
240 |       "Sex\n",
241 |       "Ticket\n",
242 |       "Cabin\n",
243 |       "Embarked\n"
244 |      ]
245 |     },
246 |     {
247 |      "data": {
248 |       "text/html": [
249 |        "<div>\n",
250 |        "<style scoped>\n",
251 |        "    .dataframe tbody tr th:only-of-type {\n",
252 |        "        vertical-align: middle;\n",
253 |        "    }\n",
254 |        "\n",
255 |        "    .dataframe tbody tr th {\n",
256 |        "        vertical-align: top;\n",
257 |        "    }\n",
258 |        "\n",
259 |        "    .dataframe thead th {\n",
260 |        "        text-align: right;\n",
261 |        "    }\n",
262 |        "</style>\n",
263 |        "<table border=\"1\" class=\"dataframe\">\n",
264 |        "  <thead>\n",
265 |        "    <tr style=\"text-align: right;\">\n",
266 |        "      <th></th>\n",
267 |        "      <th>Pclass</th>\n",
268 |        "      <th>Name</th>\n",
269 |        "      <th>Sex</th>\n",
270 |        "      <th>Age</th>\n",
271 |        "      <th>SibSp</th>\n",
272 |        "      <th>Parch</th>\n",
273 |        "      <th>Ticket</th>\n",
274 |        "      <th>Fare</th>\n",
275 |        "      <th>Cabin</th>\n",
276 |        "      <th>Embarked</th>\n",
277 |        "    </tr>\n",
278 |        "  </thead>\n",
279 |        "  <tbody>\n",
280 |        "    <tr>\n",
281 |        "      <th>0</th>\n",
282 |        "      <td>1.0</td>\n",
283 |        "      <td>0.118683</td>\n",
284 |        "      <td>1.0</td>\n",
285 |        "      <td>0.283951</td>\n",
286 |        "      <td>0.125</td>\n",
287 |        "      <td>0.0</td>\n",
288 |        "      <td>0.775862</td>\n",
289 |        "      <td>0.016072</td>\n",
290 |        "      <td>0.000000</td>\n",
291 |        "      <td>1.000000</td>\n",
292 |        "    </tr>\n",
293 |        "    <tr>\n",
294 |        "      <th>1</th>\n",
295 |        "      <td>0.0</td>\n",
296 |        "      <td>0.218989</td>\n",
297 |        "      <td>0.0</td>\n",
298 |        "      <td>0.481481</td>\n",
299 |        "      <td>0.125</td>\n",
300 |        "      <td>0.0</td>\n",
301 |        "      <td>0.879310</td>\n",
302 |        "      <td>0.140813</td>\n",
303 |        "      <td>0.575269</td>\n",
304 |        "      <td>0.333333</td>\n",
305 |        "    </tr>\n",
306 |        "    <tr>\n",
307 |        "      <th>2</th>\n",
308 |        "      <td>1.0</td>\n",
309 |        "      <td>0.400459</td>\n",
310 |        "      <td>0.0</td>\n",
311 |        "      <td>0.333333</td>\n",
312 |        "      <td>0.000</td>\n",
313 |        "      <td>0.0</td>\n",
314 |        "      <td>0.984914</td>\n",
315 |        "      <td>0.017387</td>\n",
316 |        "      <td>0.000000</td>\n",
317 |        "      <td>1.000000</td>\n",
318 |        "    </tr>\n",
319 |        "    <tr>\n",
320 |        "      <th>3</th>\n",
321 |        "      <td>0.0</td>\n",
322 |        "      <td>0.323124</td>\n",
323 |        "      <td>0.0</td>\n",
324 |        "      <td>0.444444</td>\n",
325 |        "      <td>0.125</td>\n",
326 |        "      <td>0.0</td>\n",
327 |        "      <td>0.070043</td>\n",
328 |        "      <td>0.105390</td>\n",
329 |        "      <td>0.381720</td>\n",
330 |        "      <td>1.000000</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>4</th>\n",
334 |        "      <td>1.0</td>\n",
335 |        "      <td>0.016845</td>\n",
336 |        "      <td>1.0</td>\n",
337 |        "      <td>0.444444</td>\n",
338 |        "      <td>0.000</td>\n",
339 |        "      <td>0.0</td>\n",
340 |        "      <td>0.699353</td>\n",
341 |        "      <td>0.017630</td>\n",
342 |        "      <td>0.000000</td>\n",
343 |        "      <td>1.000000</td>\n",
344 |        "    </tr>\n",
345 |        "  </tbody>\n",
346 |        "</table>\n",
347 |        "</div>"
348 |       ],
349 |       "text/plain": [
350 |        "   Pclass      Name  Sex       Age  SibSp  Parch    Ticket      Fare  \\\n",
351 |        "0     1.0  0.118683  1.0  0.283951  0.125    0.0  0.775862  0.016072   \n",
352 |        "1     0.0  0.218989  0.0  0.481481  0.125    0.0  0.879310  0.140813   \n",
353 |        "2     1.0  0.400459  0.0  0.333333  0.000    0.0  0.984914  0.017387   \n",
354 |        "3     0.0  0.323124  0.0  0.444444  0.125    0.0  0.070043  0.105390   \n",
355 |        "4     1.0  0.016845  1.0  0.444444  0.000    0.0  0.699353  0.017630   \n",
356 |        "\n",
357 |        "      Cabin  Embarked  \n",
358 |        "0  0.000000  1.000000  \n",
359 |        "1  0.575269  0.333333  \n",
360 |        "2  0.000000  1.000000  \n",
361 |        "3  0.381720  1.000000  \n",
362 |        "4  0.000000  1.000000  "
363 |       ]
364 |      },
365 |      "execution_count": 4,
366 |      "metadata": {},
367 |      "output_type": "execute_result"
368 |     }
369 |    ],
370 |    "source": [
371 |     "# 程式區塊 C\n",
372 |     "\n",
373 |     "import warnings\n",
374 |     "warnings.filterwarnings('ignore')\n",
375 |     "LEncoder = LabelEncoder()\n",
376 |     "MMEncoder = MinMaxScaler()\n",
377 |     "for c in df.columns:\n",
378 |     "    df[c] = df[c].fillna(-1)\n",
379 |     "    if df[c].dtype == 'object':\n",
380 |     "        print(c)\n",
381 |     "        df[c] = LEncoder.fit_transform(list(df[c].values))\n",
382 |     "    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))\n",
383 |     "df.head()"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": 5,
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "# 程式區塊 D\n",
393 |     "train_num = train_Y.shape[0]\n",
394 |     "train_X = df[:train_num]\n",
395 |     "test_X = df[train_num:]\n",
396 |     "\n",
397 |     "from sklearn.linear_model import LogisticRegression\n",
398 |     "estimator = LogisticRegression()\n",
399 |     "estimator.fit(train_X, train_Y)\n",
400 |     "pred = estimator.predict(test_X)"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": 6,
406 |    "metadata": {},
407 |    "outputs": [],
408 |    "source": [
409 |     "# 程式區塊 E\n",
410 |     "sub = pd.DataFrame({'PassengerId': ids, 'Survived': pred})\n",
411 |     "sub.to_csv('titanic_baseline.csv', index=False) "
412 |    ]
413 |   }
414 |  ],
415 |  "metadata": {
416 |   "kernelspec": {
417 |    "display_name": "Python 3",
418 |    "language": "python",
419 |    "name": "python3"
420 |   },
421 |   "language_info": {
422 |    "codemirror_mode": {
423 |     "name": "ipython",
424 |     "version": 3
425 |    },
426 |    "file_extension": ".py",
427 |    "mimetype": "text/x-python",
428 |    "name": "python",
429 |    "nbconvert_exporter": "python",
430 |    "pygments_lexer": "ipython3",
431 |    "version": "3.6.7"
432 |   }
433 |  },
434 |  "nbformat": 4,
435 |  "nbformat_minor": 2
436 | }
437 | 


--------------------------------------------------------------------------------