├── README.md
├── day01.md
├── day02.md
├── day03.md
├── day04.md
├── day05.md
├── day06.md
├── day07.md
├── day08.md
├── day09.md
├── day10.md
├── day11.md
├── day12.md
├── day13.md
├── day14.md
├── day15.md
├── day16.md
├── day17.md
├── day18.md
├── day19.md
├── day20.md
├── day21.md
├── day22.md
├── day23.md
├── day24.md
├── day25.md
├── day26.md
├── day27.md
├── day28.md
├── day29.md
└── day30.md


/README.md:
--------------------------------------------------------------------------------
 1 | # 2017 IT 邦幫忙鐵人賽
 2 | 
 3 | ## R 語言使用者的 Python 學習筆記
 4 | 
 5 | ---
 6 | 
 7 | ## 目錄
 8 | 
 9 | ### 基礎
10 | 
11 | - [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)
12 | - [[第 02 天] 基本變數類型](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day02.md)
13 | - [[第 03 天] 變數類型的轉換](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day03.md)
14 | - [[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)
15 | - [[第 05 天] 資料結構（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day05.md)
16 | - [[第 06 天] 資料結構（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day06.md)
17 | - [[第 07 天] 迴圈與流程控制](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day07.md)
18 | - [[第 08 天] 函數](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day08.md)
19 | - [[第 09 天] 函數（2）](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day09.md)
20 | - [[第 10 天] 物件導向 R 語言](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day10.md)
21 | - [[第 11 天] 物件導向（2）Python](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day11.md)
22 | 
23 | ### 基礎應用
24 | 
25 | - [[第 12 天] 常用屬性或方法 變數與基本資料結構](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day12.md)
26 | - [[第 13 天] 常用屬性或方法（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day13.md)
27 | - [[第 14 天] 常用屬性或方法（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day14.md)
28 | - [[第 15 天] 載入資料](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day15.md)
29 | - [[第 16 天] 網頁解析](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day16.md)
30 | - [[第 17 天] 資料角力](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day17.md)
31 | 
32 | ### 視覺化
33 | 
34 | - [[第 18 天] 資料視覺化 matplotlib](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day18.md)
35 | - [[第 19 天] 資料視覺化（2）Seaborn](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day19.md)
36 | - [[第 20 天] 資料視覺化（3）Bokeh](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day20.md) 
37 | 
38 | ### 機器學習
39 | 
40 | - [[第 21 天] 機器學習 玩具資料與線性迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day21.md)
41 | - [[第 22 天] 機器學習（2）複迴歸與 Logistic 迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)
42 | - [[第 23 天] 機器學習（3）決策樹與 k-NN 分類器](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day23.md)
43 | - [[第 24 天] 機器學習（4）分群演算法](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day24.md)
44 | - [[第 25 天] 機器學習（5）整體學習](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day25.md)
45 | - [[第 26 天] 機器學習（6）隨機森林與支持向量機](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day26.md)
46 | 
47 | ### 深度學習
48 | 
49 | - [[第 27 天] 深度學習 TensorFlow](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day27.md)
50 | - [[第 28 天] 深度學習（2）TensorBoard](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day28.md)
51 | - [[第 29 天] 深度學習（3）MNIST 手寫數字辨識](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day29.md)
52 | - [[第 30 天] 深度學習（4）卷積神經網絡與鐵人賽總結](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day30.md)
53 | 
54 | ## 版本
55 | 
56 | - Python 3.5.2
57 | - Anaconda 4.2.0
58 | - scikit-learn 0.18
59 | - TensorFlow r0.12
60 | 
61 | &copy; Tony Yao-Jen Kuo 2016


--------------------------------------------------------------------------------
/day01.md:
--------------------------------------------------------------------------------
  1 | # [第 01 天] 建立開發環境與計算機應用
  2 | 
  3 | ---
  4 | 
  5 | 從事資料科學相關工作的人，免不了在起步時都會思索：「假如時間有限，我應該選擇學習 R 語言或者 Python？」網路上相關的討論串已經太多，既然前提是「時間有限」，那我們更不應該花費時間去閱讀這些討論串，閱讀下來對於程式語言鄙視鏈的收穫可能還比原本的題目來得大。
  6 | 
  7 | 這系列文章的視角是一個 R 語言使用者去學習 Python 資料科學的應用，希望讓還沒有開始學習的人對這兩個程式語言有一點 prior knowledge，藉由閱讀這系列文章，跟她們都稍微相處一下看看氛圍如何，再決定要選哪一個作為切入資料科學應用的程式語言。
  8 | 
  9 | ## 學習筆記的脈絡
 10 | 
 11 | 這份學習筆記從一個 R 語言使用者學習 Python 在資料科學的應用，並且相互對照的角度出發，整份學習筆記可以分為五大主題：
 12 | 
 13 | ### 基礎
 14 | 
 15 | - [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)
 16 | - [[第 02 天] 基本變數類型](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day02.md)
 17 | - [[第 03 天] 變數類型的轉換](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day03.md)
 18 | - [[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)
 19 | - [[第 05 天] 資料結構（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day05.md)
 20 | - [[第 06 天] 資料結構（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day06.md)
 21 | - [[第 07 天] 迴圈與流程控制](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day07.md)
 22 | - [[第 08 天] 函數](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day08.md)
 23 | - [[第 09 天] 函數（2）](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day09.md)
 24 | - [[第 10 天] 物件導向 R 語言](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day10.md)
 25 | - [[第 11 天] 物件導向（2）Python](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day11.md)
 26 | 
 27 | ### 基礎應用
 28 | 
 29 | - [[第 12 天] 常用屬性或方法 變數與基本資料結構](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day12.md)
 30 | - [[第 13 天] 常用屬性或方法（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day13.md)
 31 | - [[第 14 天] 常用屬性或方法（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day14.md)
 32 | - [[第 15 天] 載入資料](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day15.md)
 33 | - [[第 16 天] 網頁解析](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day16.md)
 34 | - [[第 17 天] 資料角力](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day17.md)
 35 | 
 36 | ### 視覺化
 37 | 
 38 | - [[第 18 天] 資料視覺化 matplotlib](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day18.md)
 39 | - [[第 19 天] 資料視覺化（2）Seaborn](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day19.md)
 40 | - [[第 20 天] 資料視覺化（3）Bokeh](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day20.md) 
 41 | 
 42 | ### 機器學習
 43 | 
 44 | - [[第 21 天] 機器學習 玩具資料與線性迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day21.md)
 45 | - [[第 22 天] 機器學習（2）複迴歸與 Logistic 迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)
 46 | - [[第 23 天] 機器學習（3）決策樹與 k-NN 分類器](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day23.md)
 47 | - [[第 24 天] 機器學習（4）分群演算法](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day24.md)
 48 | - [[第 25 天] 機器學習（5）整體學習](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day25.md)
 49 | - [[第 26 天] 機器學習（6）隨機森林與支持向量機](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day26.md)
 50 | 
 51 | ### 深度學習
 52 | 
 53 | - [[第 27 天] 深度學習 TensorFlow](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day27.md)
 54 | - [[第 28 天] 深度學習（2）TensorBoard](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day28.md)
 55 | - [[第 29 天] 深度學習（3）MNIST 手寫數字辨識](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day29.md)
 56 | - [[第 30 天] 深度學習（4）卷積神經網絡與鐵人賽總結](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day30.md)
 57 | 
 58 | ## 建立開發環境
 59 | 
 60 | R 語言的使用者可以在 [CRAN](https://cran.r-project.org/) 下載 R，然後再去 [RStudio](https://www.rstudio.com/products/rstudio/download3/) 下載這個好用的 IDE，兩者安裝完畢後就建立好了 R 語言的開發環境。
 61 | 
 62 | 那麼關於 Python 的開發環境呢？我使用的作業系統是 OS X，系統已經安裝好 Python，只要打開終端機（不曉得怎麼開啟終端機的 Mac 使用者只要按 `Command + Space` 打開 Spotlight Search，搜尋 Terminal 然後按 `Enter`）輸入 `$ python` 就可以開始使用，如果需要編寫 .py 檔再執行的話我也可以使用慣用的 Sublime Text 或者任意的文字編輯器，感覺好像不需要再額外準備什麼東西。但是跟 RStudio 相較之下這樣的開發環境是略顯單薄了些，我覺得最基本起碼要能夠讓撰寫 .py 的編輯區跟命令列並陳，這樣開發起來才會舒服。
 63 | 
 64 | 為了不要在第一天就歪腰，經過短暫的 google 之後，我打算使用 Jupyter Notebook 來作為我的 Python 開發環境。
 65 | 
 66 | ### 安裝 Anaconda
 67 | 
 68 | Jupyter 官網的安裝建議沒有經驗的 Python 使用者透過 Anaconda 來安裝。Anaconda 是森蚺，南美洲的無毒蛇，跟蟒蛇（Python）都是體型非常巨大的蛇類，私心相當喜歡這個命名。
 69 | 
 70 | 前往 [Anaconda](https://www.continuum.io/downloads) 將 .pkg 檔下載回來進行安裝。
 71 | OS X 原本安裝好的 Python 版本是 2.7，Jupyter 官網推薦安裝 Python 3 以上的版本，所以我選擇了 Python 3.5 版本的 Anaconda 4.2.0，安裝。Anaconda 安裝完畢後，在終端機輸入 `$ python` 確認安裝完成。
 72 | 
 73 | Python 3.5.2 |Anaconda 4.2.0 (x86_64)
 74 | 
 75 | 輸入 `Ctrl + D` 離開 Python。
 76 | 
 77 | ### 啟動 Jupyter Notebook
 78 | 
 79 | 安裝 Anaconda 的同時也已經安裝 Jupyter Notebook，接著在終端機輸入以下指令啟動 Jupyter Notebook。
 80 | 
 81 | ```
 82 | $ jupyter notebook
 83 | ```
 84 | 
 85 | 我們可以清楚得看到 Jupyter Notebook 是在 localhost:8888 上面運行，但是當我想要新增一個 Notebook 的時候，它出現的是 python [conda root] 與 python [default]。
 86 | 
 87 | ![day0101](https://storage.googleapis.com/2017_ithome_ironman/day0101.png)
 88 | 
 89 | ### 修正 Kernel 顯示問題
 90 | 
 91 | 回到終端機按 `Ctrl + C` 停止 Jupyter Notebook，接著在終端機輸入指令。
 92 | 
 93 | ```
 94 | $ conda remove _nb_ext_conf
 95 | ```
 96 | 
 97 | 重新啟動 Jupyter Notebook，在終端機輸入指令。
 98 | 
 99 | ```
100 | $ jupyter notebook
101 | ```
102 | 
103 | 新增一個 Python 3 Notebook，問題順利排解。
104 | 
105 | ![day0102](https://storage.googleapis.com/2017_ithome_ironman/day0102.png)
106 | 
107 | 開發環境已經建立妥當了，接著讓我們在上面做最簡單的計算機應用吧！
108 | 
109 | ## 計算機應用
110 | 
111 | 在剛剛新增的 Python 3 Notebook 的第一個 cell 輸入一些簡單的加減乘除。
112 | 
113 | ```python
114 | print(2 + 3)
115 | print(2 - 3)
116 | print(2 * 3)
117 | print(10 / 2)
118 | print(3 ** 2) # R 語言使用 3 ^ 2
119 | print(10 % 4) # R 語言使用 10 %% 4
120 | ```
121 | 
122 | 輸入完後，選擇這個 cell 並在上方的工具列點選「Cell」後點選「Run Cells」，就會得到答案輸出。
123 | 
124 | ![day0103](https://storage.googleapis.com/2017_ithome_ironman/day0103.png)
125 | 
126 | ![day0105](https://storage.googleapis.com/2017_ithome_ironman/day0105.png)
127 | 
128 | 跟 R 語言的運算子略有出入的地方在指數與餘數計算的部分。Python 使用 `**` 而非 `^` 來計算指數，使用 `%` 而非 `%%` 作餘數的計算。
129 | 
130 | ## 小結
131 | 
132 | 第一天我們介紹了怎麼在自己的電腦建立 Python 的開發環境，在上面做了簡單的計算機應用。在建立開發環境與計算機應用時也跟 R 語言比較了一下。
133 | 
134 | ## 參考連結
135 | 
136 | - [Installing Jupyter](http://jupyter.org/install.html)
137 | - [Download Anaconda Now!](https://www.continuum.io/downloads)
138 | - [Issue #1716 jupyter/notebook](https://github.com/jupyter/notebook/issues/1716)


--------------------------------------------------------------------------------
/day02.md:
--------------------------------------------------------------------------------
  1 | # [第 02 天] 基本變數類型
  2 | 
  3 | ---
  4 | 
  5 | 暸解變數類型是學習程式語言的基本功之一（其他像是資料結構，流程控制或者迭代語法也歸類於基本功），它的枯燥無味常讓初學者為之怯步，因此有很多人會喜歡先從應用端：像是資料框，視覺化或者機器學習套件切入，之後再找時間回頭扎根。我其實很鼓勵在時間緊迫的前提下採取速成作法，但是既然 iT 邦幫忙給了我們連續三十天的餘裕，我認為還是應當把基本功的部分放在前面。
  6 | 
  7 | 我們先看一下 R 語言的基本變數類型，然後再研究 Python。
  8 | 
  9 | ## R 語言的基本變數類型
 10 | 
 11 | R 語言的基本變數類型分為以下這幾類：
 12 | 
 13 | - 數值
 14 |     - numeric
 15 |     - integer
 16 |     - complex
 17 | - 邏輯值（logical）
 18 | - 文字（character）
 19 | 
 20 | R 語言回傳變數類型的函數是 `class()`，如果不清楚這個函數有要如何使用，你可以打開 RStudio 然後在命令列輸入 `?class` 或者 `help(class)` 來看說明文件。
 21 | 
 22 | ![day0201](https://storage.googleapis.com/2017_ithome_ironman/day0201.png)
 23 | 
 24 | 我們依序在 RStudio 的命令列輸入下列指令就可以看到變數類型被回傳。 
 25 | 
 26 | ```
 27 | class(5) # "numeric"
 28 | class(5.5) # "numeric"
 29 | class(5L) # "integer"
 30 | class(5 + 3i) # "complex"
 31 | class(TRUE) # "logical"
 32 | class(FALSE) # "logical"
 33 | class("2017 ithome ironman") # "character"
 34 | ```
 35 | 
 36 | ![day0202](https://storage.googleapis.com/2017_ithome_ironman/day0202.png)
 37 | 
 38 | 那麼關於 Python 的變數類型呢？是否也能夠與 R 語言相互對應？
 39 | 
 40 | ## Python 的基本變數類型
 41 | 
 42 | 從終端機開啟 Jupyter Notebook 然後新增一個 Python 3 的 Notebook。如果你對這段文字有些疑問，我推薦你看第一天的學習筆記：[[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)。
 43 | 
 44 | Python 的基本變數類型分為以下這幾類：
 45 | 
 46 | - 數值
 47 |     - float
 48 |     - int
 49 |     - complex
 50 | - 布林值（bool）
 51 | - 文字（str）
 52 | 
 53 | Python 回傳變數類型的函數是 `type()`，如果不清楚這個函數有哪些參數可以使用，你可以在 cell 中輸入 `help(type)` 來看說明文件。
 54 | 
 55 | ![day0203](https://storage.googleapis.com/2017_ithome_ironman/day0203.png)
 56 | 
 57 | 我們在 cell 輸入下列指令然後執行印出變數類型，如果你不清楚怎麼執行 cell 中的指令，我同樣也推薦你看第一天的學習筆記：[[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)。
 58 | 
 59 | ```python
 60 | print(type(5)) # 'int'
 61 | print(type(5.5)) # 'float'
 62 | print(type(5 + 3j)) # 'complex'
 63 | print(type(True)) # 'bool'
 64 | print(type(False)) # 'bool'
 65 | print(type("2017 ithome ironman")) # 'str'
 66 | ```
 67 | 
 68 | ![day0204](https://storage.googleapis.com/2017_ithome_ironman/day0204.png)
 69 | 
 70 | 跟 R 語言不同的地方是 Python 會自動區別 int 與 float，複數的宣告使用 `j` 而不是 `i`，布林值使用 `True/False` 而不是 `TRUE/FALSE`。當然，兩者使用的變數類型名稱也有所區別，但是我們可以看到大抵是都能夠很直觀地相互對應，例如：character 對應 str，然後 logical 對應 bool，以及 numeric 對應 float。
 71 | 
 72 | ## 不同變數類型之間的運算
 73 | 
 74 | R 語言以 1 儲存 `TRUE`，0 儲存 `FALSE`，所以我們可以彈性地運算數值和邏輯值，但是文字就不能夠彈性地運算。
 75 | 
 76 | ```
 77 | 1 == TRUE # TRUE
 78 | 0L == FALSE # TRUE
 79 | 1.2 + TRUE # 2.2
 80 | 3L + TRUE * 2 # 5
 81 | "2017 ithome ironman" + " rocks!" # Error
 82 | ```
 83 | 
 84 | ![day0205](https://storage.googleapis.com/2017_ithome_ironman/day0205.png)
 85 | 
 86 | Python 儲存布林值的方式與 R 語言相同，因此也可以彈性地運算數值和布林值，除此以外，在文字上運算的彈性較 R 語言更大一些，可以利用 `+` 進行合併，以及利用 `*` 進行複製。
 87 | 
 88 | ```python
 89 | print(1.0 == True) # True
 90 | print(0 == False) # True
 91 | print(1.2 + True) # 2.2
 92 | print(3 + True * 2) # 5
 93 | print("2017 ithome ironman" + " rocks!") # "2017 ithome ironman rocks!"
 94 | print("2017 ithome ironman " + "rocks" + "!" * 3) # "2017 ithome ironman rocks!!!"
 95 | ```
 96 | 
 97 | ![day0206](https://storage.googleapis.com/2017_ithome_ironman/day0206.png)
 98 | 
 99 | ## 小結
100 | 
101 | 第二天我們介紹了 Python 的基本變數類型，以及它們與 R 語言基本變數類型之間的對應，然後測試了不同變數類型之間運算的彈性。
102 | 
103 | ## 參考連結
104 | 
105 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)


--------------------------------------------------------------------------------
/day03.md:
--------------------------------------------------------------------------------
  1 | # [第 03 天] 變數類型的轉換
  2 | 
  3 | ---
  4 | 
  5 | 不同的變數類型之間並不是壁壘分明，井水不犯河水，比如在 Python 中 `True/False` 在與數值作加減運算的時候就會自動地被轉換為 `1` 與 `0`，然而在一些不是這麼顯而易見的情況之下，就要仰賴手動進行變數類型的轉換，比方說我想要 Python 印出下列這句話會得到 TypeError。
  6 | 
  7 | ```python
  8 | days = 30
  9 | print("In order to become an ironman, you have to publish an article a day for " + days + " days in a row.")
 10 | ```
 11 | 
 12 | ![day0301](https://storage.googleapis.com/2017_ithome_ironman/day0301.png)
 13 | 
 14 | ## 建立物件
 15 | 
 16 | 我們在這段程式已經開始建立物件，在 Python 中將變數指派給物件的運算子是慣用的 `=`，跟 R 語言慣用的 `<-` 有所區別，當然 R 語言也接受使用 `=` 作為指派的運算子，但是在寫作風格上絕大多數的 R 語言使用者還是偏愛 `<-`。
 17 | 
 18 | Python 具備很便利的指派運算子像是：`+=`，`-=`，`*=`，`/=`，`%/` 讓我們的程式更簡潔，像是這段程式：
 19 | 
 20 | ```python
 21 | days = 30
 22 | days = days + 3
 23 | days # 33
 24 | ```
 25 | 
 26 | ![day0305](https://storage.googleapis.com/2017_ithome_ironman/day0305.png)
 27 | 
 28 | 其中的 `days = days + 3` 可以寫作 `days += 3`。
 29 | 
 30 | ```python
 31 | days = 30
 32 | days += 3
 33 | days # 33
 34 | ```
 35 | 
 36 | ![day0306](https://storage.googleapis.com/2017_ithome_ironman/day0306.png)
 37 | 
 38 | 這些指派運算子在 R 語言是沒有辦法使用的，所以這樣的寫法其實對我而言是陌生的，所以我多寫了幾行感受一下。
 39 | 
 40 | ```python
 41 | days = 30
 42 | days += 3
 43 | print(days) # 33
 44 | days -= 3
 45 | print(days) # 30
 46 | days *= 5
 47 | print(days) # 150
 48 | days /= 5
 49 | print(days) # 30.0
 50 | days %= 7
 51 | print(days) # 2.0
 52 | ```
 53 | 
 54 | ![day0307](https://storage.googleapis.com/2017_ithome_ironman/day0307.png)
 55 | 
 56 | 練習了建立物件後，接著我們回歸正題，看一下 R 語言如何轉換變數類型，然後再研究 Python。
 57 | 
 58 | ## R 語言變數類型的轉換
 59 | 
 60 | 在 R 語言透過 `paste()` 函數不需要做變數類型的轉換就可以完成在 Python 得到 TypeError 的那個例子。
 61 | 
 62 | ```
 63 | days <- 30
 64 | paste("In order to become an ironman, you have to publish an article a day for", days, "days in a row.")
 65 | ```
 66 | 
 67 | ![day0302](https://storage.googleapis.com/2017_ithome_ironman/day0302.png)
 68 | 
 69 | R 語言轉換變數類型的函數都是以 `as.` 作為開頭然後將要轉換為的變數類型接於後面，方便我們記憶。
 70 | 
 71 | - `as.numeric()`：轉換變數類型為 numeric
 72 | - `as.integer()`：轉換變數類型為 integer
 73 | - `as.complex()`：轉換變數類型為 complex
 74 | - `as.logical()`：轉換變數類型為 logical
 75 | - `as.character()`：轉換變數類型為 character
 76 | 
 77 | 我們利用最有彈性的邏輯值來展示這幾個函數的功能。
 78 | 
 79 | ```
 80 | my_logical <- TRUE
 81 | class(my_logical) # "logical"
 82 | as.numeric(my_logical) # 1
 83 | as.integer(my_logical) # 1
 84 | as.complex(my_logical) # 1+0i
 85 | as.character(my_logical) # "TRUE"
 86 | ```
 87 | 
 88 | ![day0303](https://storage.googleapis.com/2017_ithome_ironman/day0303.png)
 89 | 
 90 | 轉換變數類型的函數也不是萬能，比如說 `as.integer("TRUE")` 不會成功，想要將 `"TRUE"` 轉換為整數就要使用兩次轉換類型的函數 `as.integer(as.logical("TRUE"))`。
 91 | 
 92 | ## Python 變數類型的轉換
 93 | 
 94 | 透過 `str()` 函數就可以修正先前碰到的 TypeError 問題。
 95 | 
 96 | ```python
 97 | days = 30
 98 | print("In order to become an ironman, you have to publish an article a day for " + str(days) + " days in a row.")
 99 | ```
100 | 
101 | ![day0304](https://storage.googleapis.com/2017_ithome_ironman/day0304.png)
102 | 
103 | Python 轉換變數類型的函數：
104 | 
105 | - `float()`：轉換變數類型為 float
106 | - `int()`：轉換變數類型為 int
107 | - `complex()`：轉換變數類型為 complex
108 | - `bool()`：轉換變數類型為 bool
109 | - `str()`：轉換變數類型為 str
110 | 
111 | 我們利用最有彈性的布林值來展示這幾個函數的功能。
112 | 
113 | ```python
114 | my_bool = True
115 | print(type(my_bool)) # 'bool'
116 | print(float(my_bool)) # 1.0
117 | print(int(my_bool)) # 1
118 | print(complex(my_bool)) # 1+0j
119 | print(type(str(my_bool))) # 'str'
120 | ```
121 | 
122 | ![day0308](https://storage.googleapis.com/2017_ithome_ironman/day0308.png)
123 | 
124 | 跟 R 語言相同，轉換變數類型的函數也不是萬能，比如說 `int("True")` 不會成功，想要將 `"True"` 轉換為整數就要使用兩次轉換類型的函數 `int(bool("True"))`。
125 | 
126 | ## 小結
127 | 
128 | 第三天我們藉由練習 Python 的指派運算子暖身，然後研究 Python 轉換變數類型的函數，並且跟 R 語言轉換變數類型的函數相互對照。
129 | 
130 | ## 參考連結
131 | 
132 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)


--------------------------------------------------------------------------------
/day04.md:
--------------------------------------------------------------------------------
  1 | # [第 04 天] 資料結構 List，Tuple 與 Dictionary
  2 | 
  3 | ---
  4 | 
  5 | 我們今天要從元素（ingredients）邁向結構（collections），這句話是什麼意思？如果你回顧一下前幾天的程式碼範例，會發現我們只有在物件中指派一個變數在裡面，然而不論是 R 語言或者是 Python，物件其實都像是一個巨大的箱子，我們當然可以很客氣地在裡頭指派一個變數，但也可以視需求善用箱子裡面的空間，比如我們可以用兩種方法儲存鐵人賽的組別資訊。
  6 | 
  7 | ```python
  8 | # 方法一：客氣
  9 | ironman_group_1 = "Modern Web"
 10 | ironman_group_2 = "DevOps"
 11 | ironman_group_3 = "Cloud"
 12 | ironman_group_4 = "Big Data"
 13 | ironman_group_5 = "Security"
 14 | ironman_group_6 = "自我挑戰組"
 15 | 
 16 | # 方法二：善用
 17 | ironman_groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 18 | ```
 19 | 
 20 | `ironman_groups` 利用 Python 的資料結構 **list** 將六組文字儲存在一個物件之中。
 21 | 
 22 | 在 R 語言我則會選擇用 **vector**。
 23 | 
 24 | ```
 25 | ironman_groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 26 | ```
 27 | 
 28 | 接著我們看一下 R 語言的資料結構，然後再研究 Python。
 29 | 
 30 | ## R 語言的資料結構
 31 | 
 32 | R 語言基本的資料結構大致有五類：
 33 | 
 34 | - vector
 35 | - factor
 36 | - matrix
 37 | - data frame
 38 | - list
 39 | 
 40 | 我們可以透過使用 `factor()`，`matrix()`，`data.frame()`，`list()` 這些函數將原為 **vector** 結構的 `ironman_groups` 轉換為不同的資料結構。
 41 | 
 42 | ```
 43 | ironman_groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 44 | # 轉換為不同的資料結構
 45 | ironman_groups_factor <- factor(ironman_groups)
 46 | ironman_groups_matrix <- matrix(ironman_groups, nrow = 2)
 47 | ironman_groups_df <- data.frame(ironman_groups)
 48 | ironman_groups_list <- list(ironman_groups, ironman_groups_factor, ironman_groups_matrix, ironman_groups_df)
 49 | 
 50 | # 印出這些資料結構
 51 | ironman_groups
 52 | ironman_groups_factor
 53 | ironman_groups_matrix
 54 | ironman_groups_df
 55 | ironman_groups_list
 56 | ```
 57 | 
 58 | ![day0401](https://storage.googleapis.com/2017_ithome_ironman/day0401.png)
 59 | 
 60 | ![day0402](https://storage.googleapis.com/2017_ithome_ironman/day0402.png)
 61 | 
 62 | 在 R 語言中，vector 與 factor 是處理僅有一個維度的資料結構，matrix 與 data frame 是處理有兩個維度，即列（row）與欄（column）的資料結構，而 list 就像是一個超級容器，它能夠容納不同資料結構的物件，我們可以透過兩個中括號 `[[]]` 搭配索引值選擇被容納於其中的物件，R 語言的索引值由 1 開始，這跟 Python 索引值由 0 開始有很大的區別。
 63 | 
 64 | ```
 65 | ironman_groups_list[[1]]
 66 | ironman_groups_list[[2]]
 67 | ironman_groups_list[[3]]
 68 | ironman_groups_list[[4]]
 69 | ```
 70 | 
 71 | ![day0403](https://storage.googleapis.com/2017_ithome_ironman/day0403.png)
 72 | 
 73 | ## Python 的資料結構
 74 | 
 75 | Python 需要透過兩個套件 `numpy` 與 `pandas` 來產出能夠與 R 語言相互對應的資料結構（如 matrix 與data frame ... 等。）我們今天先不談需要引用套件的部分，留待後幾天再來討論。
 76 | 
 77 | Python 基本的資料結構大致有三類：
 78 | 
 79 | - list
 80 | - tuple
 81 | - dictionary
 82 | 
 83 | ### list
 84 | 
 85 | Python 的 list 跟 R 語言的 list 相似，可以容納不同的變數類型與資料結構，雖然它的外觀長得跟 R 語言的 vector 比較像，但是他們之間有著截然不同的特性，那就是 R 語言的 vector 會將元素轉換成為同一變數類型，但是 list 不會，我們參考一下底下這個簡單的範例。
 86 | 
 87 | ```
 88 | participated_group <- "Big Data"
 89 | current_ttl_articles <- 4
 90 | is_participating <- TRUE
 91 | 
 92 | # 建立一個 vector
 93 | my_vector <- c(participated_group, current_ttl_articles, is_participating)
 94 | class(my_vector[1])
 95 | class(my_vector[2])
 96 | 
 97 | # 建立一個 list
 98 | my_list <- list(participated_group, current_ttl_articles, is_participating)
 99 | class(my_list[[1]])
100 | class(my_list[[2]])
101 | ```
102 | 
103 | ![day0404](https://storage.googleapis.com/2017_ithome_ironman/day0404.png)
104 | 
105 | 在建立 Python 的 list 時候我們只需要使用中括號 `[]` 將元素包起來，而在選擇元素也是使用中括號 `[]` 搭配索引值，Python 的索引值由 0 開始，這跟 R 語言索引值由 1 開始有很大的區別。
106 | 
107 | ```python
108 | participated_group = "Big Data"
109 | current_ttl_articles = 4
110 | is_participating = True
111 | 
112 | my_status = [participated_group, current_ttl_articles, is_participating]
113 | print(type(my_status[0]))
114 | print(type(my_status[1]))
115 | print(type(my_status[2]))
116 | ```
117 | 
118 | ![day0405](https://storage.googleapis.com/2017_ithome_ironman/day0405.png)
119 | 
120 | ### tuple
121 | 
122 | tuple 跟 list 很像，但是我們不能新增，刪除或者更新 tuple 的元素，這樣的資料結構沒有辦法對應到 R 語言。我們可以使用 `tuple()` 函數將既有的 list 轉換成為 tuple，或者在建立物件的時候使用小括號 `()` 有別於建立 list 的時候使用的中括號 `[]`。
123 | 
124 | ```python
125 | # 分別建立 list 與 tuple
126 | ironman_groups_list = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security"]
127 | ironman_groups_tuple = tuple(ironman_groups_list)
128 | 
129 | # 新增一個元素
130 | ironman_groups_list.insert(5, "自我挑戰組")
131 | ironman_groups_tuple.insert(5, "自我挑戰組")
132 | ```
133 | 
134 | ![day0406](https://storage.googleapis.com/2017_ithome_ironman/day0406.png)
135 | 
136 | 我們得到一個錯誤訊息，但如果我們將這段程式的最後一行註解掉，就沒有問題。
137 | 
138 | ![day0407](https://storage.googleapis.com/2017_ithome_ironman/day0407.png)
139 | 
140 | 在這段程式碼中你可能對於使用 `.insert()` 這個方法的寫法感到陌生，請暫時先不要理會它，在這裡我們只是要展示 list 是可以新增元素但是 tuple 不可以。
141 | 
142 | ### dictionary
143 | 
144 | dictionary 是帶有鍵值（key）的 list，這樣的特性讓我們在使用中括號 `[]` 選擇元素時可以使用鍵值，在建立 Python 的 dictionary 時候我們只需要使用大括號 `{}` 或者使用 `dict()` 函數轉換既有的 list。
145 | 
146 | ```python
147 | participated_group = "Big Data"
148 | current_ttl_articles = 4
149 | is_participating = True
150 | 
151 | # 建立 dictionary
152 | my_status = {
153 |     "group": participated_group,
154 |     "ttl_articles": current_ttl_articles,
155 |     "is_participating": is_participating
156 | }
157 | 
158 | # 利用鍵值選擇元素
159 | print(my_status["group"])
160 | print(my_status["ttl_articles"])
161 | print(my_status["is_participating"])
162 | ```
163 | 
164 | ![day0408](https://storage.googleapis.com/2017_ithome_ironman/day0408.png)
165 | 
166 | 在 R 語言中建立 list 時進行命名，也可以生成類似 dictionary 的資料結構。
167 | 
168 | ```
169 | participated_group <- "Big Data"
170 | current_ttl_articles <- 4
171 | is_participating <- TRUE
172 | 
173 | # 建立 named list
174 | my_status <- list(
175 |     group = participated_group,
176 |     ttl_articles = current_ttl_articles,
177 |     is_participating = is_participating
178 | )
179 | 
180 | # 利用名稱選擇元素
181 | my_status[["group"]]
182 | my_status[["ttl_articles"]]
183 | my_status[["is_participating"]]
184 | ```
185 | 
186 | ![day0409](https://storage.googleapis.com/2017_ithome_ironman/day0409.png)
187 | 
188 | ## 小結
189 | 
190 | 第四天我們正式從元素（ingredients）邁向結構（collections），研究 Python 的基本資料結構，並且跟 R 語言的基本資料結構相互對照，Python 的 list 跟 R 語言的 list 可能外觀上看起來不太像，但是她們都可以容納不同的變數類型與資料結構。我們在後面幾天的文章也會探討 `numpy` 與 `pandas` 套件的資料結構應用。
191 | 
192 | ## 參考連結
193 | 
194 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
195 | - [Python For Data Science Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf)


--------------------------------------------------------------------------------
/day05.md:
--------------------------------------------------------------------------------
  1 | # [第 05 天] 資料結構（2）ndarray
  2 | 
  3 | ---
  4 | 
  5 | 截至 2016-12-05 上午 11 時第 8 屆 iT 邦幫忙各組的鐵人分別是 46、8、11、11、4 與 56 人，我們想計算參賽鐵人們完賽後各組的總文章數分別是多少。
  6 | 
  7 | R 語言我們直接用一個 vector 儲存各組人數後乘以 30，就能得到一個 element-wise 的計算結果，這就是我們要的答案。
  8 | 
  9 | ```
 10 | ironmen <- c(46, 8, 11, 11, 4, 56)
 11 | articles <- ironmen * 30
 12 | articles
 13 | ```
 14 | 
 15 | ![day0501](https://storage.googleapis.com/2017_ithome_ironman/day0501.png)
 16 | 
 17 | 如果利用 Python 的 list 資料結構，會發現 Python 會輸出 30 次 ironmen，這並不是我們想要的答案。
 18 | 
 19 | ```python
 20 | ironmen = [46, 8, 11, 11, 4, 56]
 21 | articles = ironmen * 30
 22 | print(articles)
 23 | ```
 24 | 
 25 | ![day0502](https://storage.googleapis.com/2017_ithome_ironman/day0502.png)
 26 | 
 27 | 假如我們寫得再謹慎一點，例如將 30 也用 list 包裝起來，這時我們會發現 Python 回傳了一個錯誤訊息。
 28 | 
 29 | ```python
 30 | ironmen = [46, 8, 11, 11, 4, 56]
 31 | article_multiplier = [30, 30, 30, 30, 30, 30]
 32 | articles = ironmen * article_multiplier
 33 | print(articles)
 34 | ```
 35 | 
 36 | ![day0503](https://storage.googleapis.com/2017_ithome_ironman/day0503.png)
 37 | 
 38 | 如果我們希望在 Python 輕鬆地使用 element-wise 的運算，我們得仰賴 `numpy` 套件中提供的一種資料結構 **numpy array**，或者採用更精準一點的說法是 **ndarray** 這個資料結構。
 39 | 
 40 | ## 第一個 numpy 應用
 41 | 
 42 | 我們來使用 `numpy` 套件中的 **ndarray** 解決先前遭遇到的問題。由於我們的開發環境安裝 [Anaconda](https://www.continuum.io/) ，所以我們不需要再去下載與安裝 `numpy` 套件，我們只需要在程式的上方引用即可（關於本系列文章的 Python 開發環境安裝請參考 [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)。）
 43 | 
 44 | ```python
 45 | import numpy # 引用套件
 46 | 
 47 | ironmen = numpy.array([46, 8, 11, 11, 4, 56]) # 將 list 透過 numpy 的 array 方法進行轉換
 48 | print(ironmen) # 看看 ironmen 的外觀
 49 | print(type(ironmen)) # 看看 ironmen 的資料結構
 50 | articles = ironmen * 30
 51 | print(articles)
 52 | ```
 53 | 
 54 | ![day0504](https://storage.googleapis.com/2017_ithome_ironman/day0504.png)
 55 | 
 56 | R 語言的使用者習慣函數式編程（functional programming），對於 `numpy.array()` 這樣的寫法多少會覺得有些突兀，這時可以與 R 語言中 `package_name::function_name()` 同時指定套件名稱與函數名稱的寫法做對照，瞬間會有恍然大悟的感覺。
 57 | 
 58 | 為了少打幾個字，我們引用 `numpy` 套件之後依照使用慣例將它縮寫為 `np`。
 59 | 
 60 | ```python
 61 | import numpy as np # 引用套件並縮寫為 np
 62 | 
 63 | ironmen = np.array([46, 8, 11, 11, 4, 56]) # 將 list 透過 numpy 的 array 方法進行轉換
 64 | print(ironmen) # 看看 ironmen 的外觀
 65 | print(type(ironmen)) # 看看 ironmen 的資料結構
 66 | articles = ironmen * 30
 67 | print(articles)
 68 | ```
 69 | 
 70 | ![day0505](https://storage.googleapis.com/2017_ithome_ironman/day0505.png)
 71 | 
 72 | 我們回顧一下 R 語言資料結構中的 vector 與 matrix，然後再研究 `numpy` 的 ndarray。
 73 | 
 74 | ## R 語言的 vector 與 matrix
 75 | 
 76 | ### 單一資料類型
 77 | 
 78 | R 語言的 vector 與 matrix 都只能容許一種資料類型，如果同時儲存有數值，邏輯值，會被自動轉換為數值，如果同時儲存有數值，邏輯值與文字，會被自動轉換為文字。
 79 | 
 80 | ```
 81 | my_vector <- c(1, TRUE)
 82 | class(my_vector) # "numeric"
 83 | my_vector <- c(1, TRUE, "one")
 84 | class(my_vector) # "character"
 85 | ```
 86 | 
 87 | ![day0506](https://storage.googleapis.com/2017_ithome_ironman/day0506.png)
 88 | 
 89 | ```
 90 | my_matrix <- matrix(c(1, 0, TRUE, FALSE), nrow = 2)
 91 | my_matrix
 92 | my_matrix <- matrix(c(1, "zero", TRUE, FALSE), nrow = 2)
 93 | my_matrix
 94 | ```
 95 | 
 96 | ![day0507](https://storage.googleapis.com/2017_ithome_ironman/day0507.png)
 97 | 
 98 | ### Element-wise 運算
 99 | 
100 | R 語言的 vector 與 matrix 完全支持 element-wise 運算。
101 | 
102 | ```
103 | my_vector <- 1:4
104 | my_vector ^ 2
105 | 
106 | my_matrix <- matrix(my_vector, nrow = 2)
107 | my_matrix ^ 2
108 | ```
109 | 
110 | ![day0508](https://storage.googleapis.com/2017_ithome_ironman/day0508.png)
111 | 
112 | ### 選擇元素
113 | 
114 | R 語言的 vector 與 matrix 都透過中括號 `[]` 或者邏輯值選擇元素。
115 | 
116 | ```
117 | ironmen <- c(46, 8, 11, 11, 4, 56)
118 | ironmen[1] # 選出 Modern Web 組的鐵人數
119 | ironmen > 10 # 哪幾組的鐵人數超過 10 人
120 | ironmen[ironmen > 10] # 超過 10 人的鐵人數
121 | names(ironmen) <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組") # 把 vector 的元素加上名稱
122 | names(ironmen[ironmen > 10]) # 超過 10 人參賽的組別名
123 | ```
124 | 
125 | ![day0509](https://storage.googleapis.com/2017_ithome_ironman/day0509.png)
126 | 
127 | ```
128 | ironmen <- c(46, 8, 11, 11, 4, 56)
129 | ironmen_mat <- matrix(ironmen, nrow = 2)
130 | ironmen_mat[1, 1] # 選出 Modern Web 組的鐵人數
131 | ironmen_mat > 10 # 哪幾組的鐵人數超過 10 人
132 | ironmen_mat[ironmen_mat > 10] # 超過 10 人的鐵人數
133 | ```
134 | 
135 | ![day0510](https://storage.googleapis.com/2017_ithome_ironman/day0510.png)
136 | 
137 | ### 了解 matrix 規模的函數
138 | 
139 | R 語言可以透過 `length()` 與 `dim()` 函數來了解 matrix 的規模。
140 | 
141 | ```
142 | ironmen <- c(46, 8, 11, 11, 4, 56)
143 | ironmen_mat <- matrix(ironmen, nrow = 2)
144 | length(ironmen_mat)
145 | dim(ironmen_mat)
146 | ```
147 | 
148 | ![day0511](https://storage.googleapis.com/2017_ithome_ironman/day0511.png)
149 | 
150 | ## NumPy 的 ndarray
151 | 
152 | ### 單一資料類型
153 | 
154 | NumPy 的 ndarray 只能容許一種資料類型，如果同時儲存有數值，布林值，會被自動轉換為數值，如果同時儲存有數值，布林值與文字，會被自動轉換為文字。
155 | 
156 | ```python
157 | import numpy as np
158 | 
159 | my_np_array = np.array([1, True])
160 | print(my_np_array.dtype) # int64
161 | my_np_array = np.array([1, True, "one"])
162 | print(my_np_array.dtype) # unicode_21
163 | ```
164 | 
165 | ![day0512](https://storage.googleapis.com/2017_ithome_ironman/day0512.png)
166 | 
167 | ```python
168 | import numpy as np
169 | 
170 | my_2d_array = np.array([[1, True],
171 |                         [0, False]])
172 | print(my_2d_array)
173 | 
174 | my_2d_array = np.array([[1, True],
175 |                         ["zero", False]])
176 | print(my_2d_array)
177 | ```
178 | 
179 | ![day0513](https://storage.googleapis.com/2017_ithome_ironman/day0513.png)
180 | 
181 | ### Element-wise 運算
182 | 
183 | NumPy 的 ndarray 完全支持 element-wise 運算。
184 | 
185 | ```python
186 | import numpy as np
187 | 
188 | my_np_array = np.array([1, 2, 3, 4])
189 | print(my_np_array ** 2)
190 | 
191 | my_2d_array = np.array([[1, 3],
192 |                         [2, 4]])
193 | print(my_2d_array ** 2)
194 | ```
195 | 
196 | ![day0514](https://storage.googleapis.com/2017_ithome_ironman/day0514.png)
197 | 
198 | ### 選擇元素
199 | 
200 | NumPy 的 ndarray 透過中括號 `[]` 或者布林值選擇元素。
201 | 
202 | ```python
203 | import numpy as np
204 | 
205 | ironmen = np.array([46, 8, 11, 11, 4, 56])
206 | print(ironmen[0]) # 選出 Modern Web 組的鐵人數
207 | print(ironmen > 10) # 哪幾組的鐵人數超過 10 人
208 | print(ironmen[ironmen > 10]) # 超過 10 人的鐵人數
209 | ```
210 | 
211 | ![day0515](https://storage.googleapis.com/2017_ithome_ironman/day0515.png)
212 | 
213 | ```python
214 | import numpy as np
215 | 
216 | ironmen_2d_array = np.array([[46, 11, 4],
217 |                             [8, 11, 56]])
218 | print(ironmen_2d_array[0, 0]) # 選出 Modern Web 組的鐵人數
219 | print(ironmen_2d_array > 10) # 哪幾組的鐵人數超過 10 人
220 | print(ironmen_2d_array[ironmen_2d_array > 10]) # 超過 10 人的鐵人數
221 | ```
222 | 
223 | ![day0516](https://storage.googleapis.com/2017_ithome_ironman/day0516.png)
224 | 
225 | ### 了解 2d array 外觀的屬性
226 | 
227 | NumPy 可以透過 `.size` 與 `.shape` 來了解 2d array 的規模。
228 | 
229 | ```python
230 | import numpy as np
231 | 
232 | ironmen_2d_array = np.array([[46, 11, 4],
233 |                             [8, 11, 56]])
234 | print(ironmen_2d_array.size) # 6
235 | print(ironmen_2d_array.shape) # (2, 3)
236 | ```
237 | 
238 | ![day0517](https://storage.googleapis.com/2017_ithome_ironman/day0517.png)
239 | 
240 | ## 小結
241 | 
242 | 第五天我們開始使用 python 的 `numpy` 套件，透過這個套件我們可以使用一種稱為 **ndarray** 的資料結構，透過 **ndarray** 我們可以實現 element-wise 的運算，並且跟 R 語言中的 vector 及 matrix 相互對照。
243 | 
244 | ## 參考連結
245 | 
246 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)
247 | - [NumPy - 維基百科，自由的百科全書](https://zh.wikipedia.org/wiki/NumPy)
248 | - [Python For Data Science Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf)


--------------------------------------------------------------------------------
/day06.md:
--------------------------------------------------------------------------------
  1 | # [第 06 天] 資料結構（3）Data Frame
  2 | 
  3 | ---
  4 | 
  5 | 截至 2016-12-06 上午 7 時第 8 屆 iT 邦幫忙各組的鐵人分別是 46、8、12、12、6 與 58 人，我們想要用一個表格來紀錄參賽的組別與鐵人數。
  6 | 
  7 | R 語言我們可以用 data frame 的資料結構來呈現這個資訊，使用 `data.frame()` 函數可以將 vectors 結合成資料框。
  8 | 
  9 | ```
 10 | groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 11 | ironmen <- c(46, 8, 12, 12, 6, 58)
 12 | ironmen_df <- data.frame(groups, ironmen)
 13 | ironmen_df
 14 | View(ironmen_df)
 15 | ```
 16 | 
 17 | ![day0601](https://storage.googleapis.com/2017_ithome_ironman/day0601.png)
 18 | 
 19 | ![day0602](https://storage.googleapis.com/2017_ithome_ironman/day0602.png)
 20 | 
 21 | 如果我們希望在 Python 中也能夠使用 data frame，我們得仰賴 `pandas` 套件。與第五天討論的 `numpy` 套件一樣，由於我們的開發環境是安裝 [Anaconda](https://www.continuum.io/)，所以不需要再去下載與安裝 `pandas` 套件，只需要在程式的上方引用即可（關於本系列文章的 Python 開發環境安裝請參考 [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)。）
 22 | 
 23 | ## 第一個 pandas 應用
 24 | 
 25 | 我們引用 `pandas` 套件之後依照使用慣例將它縮寫為 `pd`，最基本建立 data frame 的方式是利用 `pandas` 套件的 `DataFrame()` 方法將一個 dictionary 的資料結構轉換為 data frame（如果你對於 dictionary 資料結構感到陌生，我推薦你閱讀[[[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)。）
 26 | 
 27 | ```python
 28 | import pandas as pd # 引用套件並縮寫為 pd
 29 | 
 30 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 31 | ironmen = [46, 8, 12, 12, 6, 58]
 32 | 
 33 | ironmen_dict = {"groups": groups,
 34 |                 "ironmen": ironmen
 35 |                 }
 36 | 
 37 | ironmen_df = pd.DataFrame(ironmen_dict)
 38 | 
 39 | print(ironmen_df) # 看看資料框的外觀
 40 | print(type(ironmen_df)) # pandas.core.frame.DataFrame
 41 | ```
 42 | 
 43 | ![day0603](https://storage.googleapis.com/2017_ithome_ironman/day0603.png)
 44 | 
 45 | R 語言的使用者如果仍舊對於 `pd.DataFrame()` 的寫法覺得不習慣，這時可以與 R 語言中 `package_name::function_name()` 同時指定套件名稱與函數名稱的寫法做對照，瞬間會有恍然大悟的感覺。
 46 | 
 47 | 我們回顧一下 R 語言資料結構中的 data frame，然後再研究 `pandas` 套件的 data frame。
 48 | 
 49 | ## R 語言的 data frame
 50 | 
 51 | ### 可以由 matrix 轉換
 52 | 
 53 | 如果 data frame 的變數類型相同，亦可以從 matrix 轉換。
 54 | 
 55 | ```
 56 | my_mat <- matrix(1:4, nrow = 2, dimnames = list(NULL, c("col1", "col2")))
 57 | my_df <- data.frame(my_mat)
 58 | my_df
 59 | ```
 60 | 
 61 | ![day0604](https://storage.googleapis.com/2017_ithome_ironman/day0604.png)
 62 | 
 63 | ### 包含多種資料類型
 64 | 
 65 | 跟 list 的特性相仿，不會像 vector 與 matrix 僅限制容納單一資料類型。
 66 | 
 67 | ```
 68 | groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 69 | ironmen <- c(46, 8, 12, 12, 6, 58)
 70 | ironmen_df <- data.frame(groups, ironmen)
 71 | 
 72 | sapply(ironmen_df, FUN = class) # 回傳每一個欄位的 class
 73 | ```
 74 | 
 75 | ![day0605](https://storage.googleapis.com/2017_ithome_ironman/day0605.png)
 76 | 
 77 | ### 選擇元素
 78 | 
 79 | R 語言透過使用中括號 `[ , ]` 或者 `$` 可以很靈活地從 data frame 中選擇想要的元素（值，列或欄。）
 80 | 
 81 | ```
 82 | groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 83 | ironmen <- c(46, 8, 12, 12, 6, 58)
 84 | ironmen_df <- data.frame(groups, ironmen)
 85 | 
 86 | ironmen_df[1, 2] # 第一列第二欄：Modern Web 組的鐵人數
 87 | ironmen_df[1, ] # 第一列：Modern Web 組的組名與鐵人數
 88 | ironmen_df[, 2] # 第二欄：各組的鐵人數
 89 | ironmen_df[, "ironmen"] # 各組的鐵人數
 90 | ironmen_df$ironmen # 各組的鐵人數
 91 | ```
 92 | 
 93 | ![day0606](https://storage.googleapis.com/2017_ithome_ironman/day0606.png)
 94 | 
 95 | ### 可以使用邏輯值篩選
 96 | 
 97 | R 語言可以透過邏輯值來針對 data frame 進行觀測值的篩選。
 98 | 
 99 | ```
100 | roups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
101 | ironmen <- c(46, 8, 12, 12, 6, 58)
102 | ironmen_df <- data.frame(groups, ironmen)
103 | 
104 | ironmen_df[ironmen_df$ironmen > 10, ] # 選出鐵人數超過 10 的 data frame
105 | ```
106 | 
107 | ![day0607](https://storage.googleapis.com/2017_ithome_ironman/day0607.png)
108 | 
109 | ### 了解 data frame 概觀的函數
110 | 
111 | R 語言可以透過一些函數來了解 data frame 的概觀。
112 | 
113 | ```
114 | dim(ironmen_df) # 回傳列數與欄數
115 | str(ironmen_df) # 回傳結構
116 | summary(ironmen_df) # 回傳描述性統計
117 | head(ironmen_df, n = 3) # 回傳前三筆觀測值
118 | tail(ironmen_df, n = 3) # 回傳後三筆觀測值
119 | names(ironmen_df) # 回傳欄位名稱
120 | ```
121 | 
122 | ![day0608](https://storage.googleapis.com/2017_ithome_ironman/day0608.png)
123 | 
124 | ## Pandas 的 data frame
125 | 
126 | ### 可以由 NumPy 的 2d array 轉換
127 | 
128 | 如果 data frame 的變數類型相同，亦可以從 NumPy 的 2d array 轉換。
129 | 
130 | ```python
131 | import numpy as np
132 | import pandas as pd
133 | 
134 | my_2d_array = np.array([[1, 3],
135 |                         [2, 4]
136 |                        ])
137 | 
138 | my_df = pd.DataFrame(my_2d_array, columns = ["col1", "col2"])
139 | print(my_df)
140 | ```
141 | 
142 | ![day0609](https://storage.googleapis.com/2017_ithome_ironman/day0609.png)
143 | 
144 | ### 包含多種資料類型
145 | 
146 | 跟 list 的特性相仿，不會像 ndarray 僅限制容納單一資料類型。
147 | 
148 | ```python
149 | import pandas as pd
150 | 
151 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
152 | ironmen = [46, 8, 12, 12, 6, 58]
153 | 
154 | ironmen_dict = {"groups": groups,
155 |                 "ironmen": ironmen
156 |                 }
157 | 
158 | ironmen_df = pd.DataFrame(ironmen_dict)
159 | print(ironmen_df.dtypes) # 欄位的變數類型
160 | ```
161 | 
162 | ![day0610](https://storage.googleapis.com/2017_ithome_ironman/day0610.png)
163 | 
164 | ### 選擇元素
165 | 
166 | Pandas 透過使用中括號 `[]` 與 `.iloc` 可以很靈活地從 data frame 中選擇想要的元素。要注意的是 Python 在指定 `0:1` 時不包含 `1`，在指定 `0:2` 時不包含 `2`，這一點是跟 R 語言有很大的不同之處。
167 | 
168 | ```python
169 | import pandas as pd
170 | 
171 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
172 | ironmen = [46, 8, 12, 12, 6, 58]
173 | 
174 | ironmen_dict = {"groups": groups,
175 |                 "ironmen": ironmen
176 |                 }
177 | 
178 | ironmen_df = pd.DataFrame(ironmen_dict)
179 | 
180 | print(ironmen_df.iloc[0:1, 1]) # 第一列第二欄：Modern Web 組的鐵人數
181 | print("---")
182 | print(ironmen_df.iloc[0:1,:]) # 第一列：Modern Web 組的組名與鐵人數
183 | print("---")
184 | print(ironmen_df.iloc[:,1]) # 第二欄：各組的鐵人數
185 | print("---")
186 | print(ironmen_df["ironmen"]) # 各組的鐵人數
187 | print("---")
188 | print(ironmen_df.ironmen) # 各組的鐵人數
189 | ```
190 | 
191 | ![day0611](https://storage.googleapis.com/2017_ithome_ironman/day0611.png)
192 | 
193 | 最後兩行我們用了簡便的選擇語法，但是在正式環境選擇元素時仍然推薦使用最合適的 pandas 方法 `.iloc` 與 `.loc` 等。
194 | 
195 | > While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc`, `.iloc` and `.ix`. 
196 | > Quoted from: [10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html#selection)
197 | 
198 | ### 可以使用布林值篩選
199 | 
200 | Pandas 可以透過布林值來針對 data frame 進行觀測值的篩選。
201 | 
202 | ```python
203 | import pandas as pd
204 | 
205 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
206 | ironmen = [46, 8, 12, 12, 6, 58]
207 | 
208 | ironmen_dict = {"groups": groups,
209 |                 "ironmen": ironmen
210 |                 }
211 | 
212 | ironmen_df = pd.DataFrame(ironmen_dict)
213 | 
214 | print(ironmen_df[ironmen_df.loc[:,"ironmen"] > 10]) # 選出鐵人數超過 10 的 data frame
215 | ```
216 | 
217 | ![day0612](https://storage.googleapis.com/2017_ithome_ironman/day0612.png)
218 | 
219 | ### 了解 data frame 概觀
220 | 
221 | Pandas 的 data frame 資料結構有一些方法或屬性可以幫助我們了解概觀。
222 | 
223 | ```python
224 | import pandas as pd
225 | 
226 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
227 | ironmen = [46, 8, 12, 12, 6, 58]
228 | 
229 | ironmen_dict = {"groups": groups,
230 |                 "ironmen": ironmen
231 |                 }
232 | 
233 | ironmen_df = pd.DataFrame(ironmen_dict)
234 | 
235 | print(ironmen_df.shape) # 回傳列數與欄數
236 | print("---")
237 | print(ironmen_df.describe()) # 回傳描述性統計
238 | print("---")
239 | print(ironmen_df.head(3)) # 回傳前三筆觀測值
240 | print("---")
241 | print(ironmen_df.tail(3)) # 回傳後三筆觀測值
242 | print("---")
243 | print(ironmen_df.columns) # 回傳欄位名稱
244 | print("---")
245 | print(ironmen_df.index) # 回傳 index
246 | ```
247 | 
248 | ![day0613](https://storage.googleapis.com/2017_ithome_ironman/day0613.png)
249 | 
250 | ## R 語言的 factor 資料結構與 pandas 的 category 資料結構
251 | 
252 | ### Nominal
253 | 
254 | 截至上一個段落，我們已經對照了四種 R 語言的基本資料結構，還剩下最後一個是 R 語言的 factor，這在 pandas 中可以對照為 category。
255 | 
256 | ```
257 | groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
258 | groups_factor <- factor(groups) # 轉換為 factor
259 | groups_factor
260 | ```
261 | 
262 | ![day0614](https://storage.googleapis.com/2017_ithome_ironman/day0614.png)
263 | 
264 | 我們利用 `pandas` 套件的 `Categorical()` 方法轉換 list 為 pandas 的 category 資料結構。
265 | 
266 | ```python
267 | import pandas as pd
268 | 
269 | groups_categorical = pd.Categorical(["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"])
270 | print(groups_categorical)
271 | print("---")
272 | print(type(groups_categorical))
273 | ```
274 | 
275 | ![day0615](https://storage.googleapis.com/2017_ithome_ironman/day0615.png)
276 | 
277 | ### Ordinal
278 | 
279 | R 語言使用 `ordered = TRUE` 與指定 `levels = ` 參數加入 ordinal 的特性。
280 | 
281 | ```
282 | temperature <- c("cold", "warm", "hot")
283 | temperature_factor <- factor(temperature, ordered = TRUE, levels = c("cold", "warm", "hot"))
284 | temperature_factor
285 | ```
286 | 
287 | ![day0616](https://storage.googleapis.com/2017_ithome_ironman/day0616.png)
288 | 
289 | Pandas 使用 `ordered = True` 與指定 `categories = ` 參數加入 ordinal 的特性。
290 | 
291 | ```python
292 | import pandas as pd
293 | 
294 | temperature_list = ["cold", "warm", "hot"]
295 | temperature_categorical = pd.Categorical(temperature_list, categories = ["cold", "warm", "hot"], ordered = True)
296 | temperature = pd.Series(temperature_categorical)
297 | print(temperature)
298 | ```
299 | 
300 | ![day0617](https://storage.googleapis.com/2017_ithome_ironman/day0617.png)
301 | 
302 | ## 小結
303 | 
304 | 第六天我們開始使用 Python 的 `pandas` 套件，透過這個套件我們可以在 Python 開發環境中使用 data frame 與 category 的資料結構，並且跟 R 語言中的 data frame 及 factor 相互對照。
305 | 
306 | ## 參考連結
307 | 
308 | - [10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
309 | - [Categorical Data](https://pandas-docs.github.io/pandas-docs-travis/categorical.html)
310 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)


--------------------------------------------------------------------------------
/day07.md:
--------------------------------------------------------------------------------
  1 | # [第 07 天] 迴圈與流程控制
  2 | 
  3 | ---
  4 | 
  5 | 截至 2016-12-07 上午 11 時第 8 屆 iT 邦幫忙各組的鐵人分別是 49、8、12、12、6 與 61 人，我們想要在命令列上一一輸出這六組的鐵人數。
  6 | 
  7 | 在不撰寫迴圈的情況下，我們還是可以先將這六個數字存在一個 R 語言的 vector 或 Python 的 list，然後再土法煉鋼依照索引值一一把結果輸出在命令列，如果你對這兩個資料結構有一點疑惑，我推薦你參考本系列文章的[[[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)。
  8 | 
  9 | ```
 10 | ironmen <- c(49, 8, 12, 12, 6, 61)
 11 | ironmen[1]
 12 | ironmen[2]
 13 | ironmen[3]
 14 | ironmen[4]
 15 | ironmen[5]
 16 | ironmen[6]
 17 | ```
 18 | 
 19 | ![day0701](https://storage.googleapis.com/2017_ithome_ironman/day0701.png)
 20 | 
 21 | ```python
 22 | ironmen = [49, 8, 12, 12, 6, 61]
 23 | print(ironmen[0])
 24 | print(ironmen[1])
 25 | print(ironmen[2])
 26 | print(ironmen[3])
 27 | print(ironmen[4])
 28 | print(ironmen[5])
 29 | ```
 30 | 
 31 | ![day0702](https://storage.googleapis.com/2017_ithome_ironman/day0702.png)
 32 | 
 33 | 好險現在鐵人賽只有六組，勉強我們還是可以土法煉鋼，但是如果有 66 組，666 組或者甚至 6,666 組的時候怎麼辦？當我們發現自己在進行複製貼上程式碼的時候，應該稍微把視線挪移開螢幕，想一下能怎麼樣把這段程式用迴圈取代。
 34 | 
 35 | ## R 語言的迴圈
 36 | 
 37 | 我們簡單使用 `for` 與 `while` 迴圈來完成任務。透過撰寫迴圈，不論 ironmen 這個 vector 紀錄了 6 組，66 組，666 組或者甚至 6,666 組的鐵人數，都是用同樣簡短的幾行程式即可。
 38 | 
 39 | ### for 迴圈
 40 | 
 41 | ```
 42 | # 不帶索引值的寫法
 43 | ironmen <- c(49, 8, 12, 12, 6, 61)
 44 | for (ironman in ironmen) {
 45 |     print(ironman)
 46 | }
 47 | ironman # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
 48 | 
 49 | # 帶索引值的寫法
 50 | for (index in 1:length(ironmen)) {
 51 |     print(ironmen[index])
 52 | }
 53 | index # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
 54 | ```
 55 | 
 56 | ![day0703](https://storage.googleapis.com/2017_ithome_ironman/day0703.png)
 57 | 
 58 | ### while 迴圈
 59 | 
 60 | ```
 61 | ironmen <- c(49, 8, 12, 12, 6, 61)
 62 | index <- 1
 63 | while (index <= length(ironmen)) {
 64 |     print(ironmen[index])
 65 |     index <- index + 1
 66 | }
 67 | index # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
 68 | ```
 69 | 
 70 | ![day0704](https://storage.googleapis.com/2017_ithome_ironman/day0704.png)
 71 | 
 72 | 在這裡稍微停一下，比較一下我們印出來的迭代器在這三種寫法中的差異。
 73 | 
 74 | ## Python 的迴圈
 75 | 
 76 | Python 的迴圈撰寫結構類似 R 語言，不同的是R 語言使用大括號 `{}` 將每一次迭代要處理的事情包起來，Python 則是將每一次迭代要處理的事情以**縮排**表示。
 77 | 
 78 | ### for 迴圈
 79 | 
 80 | ```python
 81 | # 不帶索引值的寫法
 82 | ironmen = [49, 8, 12, 12, 6, 61]
 83 | for ironman in ironmen:
 84 |     print(ironman)
 85 | print("---")
 86 | print(ironman) # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
 87 | 
 88 | print("\n") # 空一行方便閱讀
 89 | 
 90 | # 帶索引值的寫法
 91 | for index in list(range(len(ironmen))): # 產生出一組 0 到 5 的 list
 92 |     print(ironmen[index])
 93 | print("---")
 94 | print(index) # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
 95 | ```
 96 | 
 97 | ![day0705](https://storage.googleapis.com/2017_ithome_ironman/day0705.png)
 98 | 
 99 | ### while 迴圈
100 | 
101 | ```python
102 | ironmen = [49, 8, 12, 12, 6, 61]
103 | index = 0
104 | while index < len(ironmen):
105 |     print(ironmen[index])
106 |     index += 1
107 | print("---")
108 | print(index) # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
109 | ```
110 | 
111 | ![day0706](https://storage.googleapis.com/2017_ithome_ironman/day0706.png)
112 | 
113 | 在撰寫迴圈的時候你會發現到跟 R 語言因為索引值起始值不同（R 語言的資料結構索引值由 1 起始，Python 由 0 起始）而做出相對應的調整，另外，如果你對於 while 迴圈中 `index += 1` 的寫法感到陌生，我推薦你參考本系列文章的[[第 03 天] 變數類型的轉換](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day03.md)。
114 | 
115 | ## R 語言的流程控制
116 | 
117 | R 語言可以透過 `if-else if-else` 的結構創造程式執行的分支，當我們只有兩個分支，使用 `if-else` 就足夠，多個分支時還有一個額外的 `switch()` 函數供我們選擇。
118 | 
119 | ### if-else
120 | 
121 | 整數除以 2 的餘數只會有兩種答案，使用 `if-else` 結構依照餘數的值回傳對應的訊息。
122 | 
123 | ```
124 | my_seq <- 1:10
125 | for (i in my_seq) {
126 |     if (i %% 2 == 0) {
127 |         print(paste(i, "是偶數"))
128 |     } else {
129 |         print(paste(i, "是奇數"))
130 |     }
131 | }
132 | ```
133 | 
134 | ![day0707](https://storage.googleapis.com/2017_ithome_ironman/day0707.png)
135 | 
136 | ### if-else if-else
137 | 
138 | 整數除以 3 的餘數會有三種答案，使用 `if-else if-else` 結構依照餘數的值回傳對應的訊息。
139 | 
140 | ```
141 | my_seq <- 1:10
142 | for (i in my_seq) {
143 |     if (i %% 3 == 0) {
144 |         print(paste(i, "可以被 3 整除"))
145 |     } else if (i %% 3 == 1) {
146 |         print(paste(i, "除以 3 餘數是 1"))
147 |     } else {
148 |         print(paste(i, "除以 3 餘數是 2"))
149 |     }
150 | }
151 | ```
152 | 
153 | ![day0708](https://storage.googleapis.com/2017_ithome_ironman/day0708.png)
154 | 
155 | ### switch() 函數
156 | 
157 | 將 `if-else if-else` 結構用 `switch()` 函數改寫，這裡要注意的是我們要先調整為用文字型態來判斷。
158 | 
159 | ```
160 | my_seq <- 1:10
161 | for (i in my_seq) {
162 |     ans <- i %% 3
163 |     switch(as.character(ans),
164 |         "0" = print(paste(i, "可以被 3 整除")),
165 |         "1" = print(paste(i, "除以 3 餘數是 1")),
166 |         "2" = print(paste(i, "除以 3 餘數是 2"))
167 |     )
168 | }
169 | ```
170 | 
171 | ![day0709](https://storage.googleapis.com/2017_ithome_ironman/day0709.png)
172 | 
173 | ## Python 的流程控制
174 | 
175 | Python 透過 `if-elif-else` 的結構創造程式執行的分支，當我們只有兩個分支，使用 `if-else` 就足夠，值得一提的是 Python 沒有類似像 `switch` 或是 `case` 的多分支結構。
176 | 
177 | ### if-else
178 | 
179 | 整數除以 2 的餘數只會有兩種答案，使用 `if-else` 結構依照餘數的值回傳對應的訊息。
180 | 
181 | ```python
182 | my_seq = list(range(1, 11))
183 | for index in my_seq:
184 |     if (index % 2 == 0):
185 |         print(index, "是偶數")
186 |     else:
187 |         print(index, "是奇數")
188 | ```
189 | 
190 | ![day0710](https://storage.googleapis.com/2017_ithome_ironman/day0710.png)
191 | 
192 | ### if-elif-else
193 | 
194 | 整數除以 3 的餘數會有三種答案，使用 `if-elif-else` 結構依照餘數的值回傳對應的訊息。
195 | 
196 | ```python
197 | my_seq = list(range(1,11))
198 | for index in my_seq:
199 |     if (index % 3 == 0):
200 |         print(index, "可以被 3 整除")
201 |     elif (index % 3 ==1):
202 |         print(index, "除以 3 餘數是 1")
203 |     else:
204 |         print(index, "除以 3 餘數是 2")
205 | ```
206 | 
207 | ![day0711](https://storage.googleapis.com/2017_ithome_ironman/day0711.png)
208 | 
209 | ## 讓迴圈變得更有彈性
210 | 
211 | 我們已經開始在迴圈中加入流程控制的結構，透過額外的描述，可以讓我們的迴圈在執行時更有彈性。
212 | 
213 | ### R 語言的 break 與 next 描述
214 | 
215 | 利用 `break` 描述告訴 for 迴圈當迭代器（此處指變數 ironman）小於 10 的時候要結束迴圈；利用 `next` 描述告訴 for 迴圈當迭代器小於 10 的時候要跳過它然後繼續執行。
216 | 
217 | ```
218 | # break 描述
219 | ironmen <- c(49, 8, 12, 12, 6, 61)
220 | for (ironman in ironmen) {
221 |     if (ironman < 10) {
222 |         break
223 |     } else {
224 |         print(ironman)
225 |     }
226 | }
227 | ironman # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
228 | 
229 | # next 描述
230 | for (ironman in ironmen) {
231 |     if (ironman <= 10) {
232 |         next
233 |     } else {
234 |         print(ironman)
235 |     }
236 | }
237 | ironman # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
238 | ```
239 | 
240 | ![day0712](https://storage.googleapis.com/2017_ithome_ironman/day0712.png)
241 | 
242 | ### Python 的 break 與 continue
243 | 
244 | 利用 `break` 描述告訴 for 迴圈當迭代器（此處指變數 ironman）小於 10 的時候要結束迴圈；利用 `continue` 描述告訴 for 迴圈當迭代器小於 10 的時候要跳過它然後繼續執行。
245 | 
246 | ```python
247 | # break 描述
248 | ironmen = [49, 8, 12, 12, 6, 61]
249 | for ironman in ironmen:
250 |     if (ironman < 10):
251 |         break
252 |     else:
253 |         print(ironman)
254 | 
255 | print("---")
256 | print(ironman) # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
257 | 
258 | print("\n") # 空一行方便閱讀
259 | 
260 | # continue 描述
261 | for ironman in ironmen:
262 |     if (ironman < 10):
263 |         continue
264 |     else:
265 |         print(ironman)
266 | 
267 | print("---")
268 | print(ironman) # 把迴圈的迭代器（iterator）或稱游標（cursor）最後的值印出來看看
269 | ```
270 | 
271 | ![day0713](https://storage.googleapis.com/2017_ithome_ironman/day0713.png)
272 | 
273 | ## 小結
274 | 
275 | 第七天我們討論 Python 的迴圈與流程控制，也與 R 語言相互對照，發現語法非常相似，我們需要記得 Python 的索引值起始值為 0 而非 1，使用縮排而非大括號 `{}` ，使用 `if-elif-else` 而非 `if-else if-else`，使用 `continue` 描述而非 `next`。
276 | 
277 | ## 參考連結
278 | 
279 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)


--------------------------------------------------------------------------------
/day08.md:
--------------------------------------------------------------------------------
  1 | # [第 08 天] 函數
  2 | 
  3 | ---
  4 | 
  5 | 早在 [[第 02 天] 基本變數類型](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day02.md)我們就開始使用 Python 的內建函數（generic functions），像是我們使用 `help()` 函數查詢文件以及使用 `type()` 函數來觀察我們的變數類型為 str，int 或者 bool。對於 R 語言使用者而言，函數可是我們的最愛，因為本質上 R 語言是一個函數型編程語言（functional programming language）。
  6 | 
  7 | > (Almost) everything is a function call.
  8 | > By [John Chambers](https://en.wikipedia.org/wiki/John_Chambers_(statistician))
  9 | 
 10 | ## 應用內建函數
 11 | 
 12 | 截至 2016-12-08 上午 8 時第 8 屆 iT 邦幫忙各組的鐵人分別是 50、8、16、12、6 與 62 人，我們將這個資訊存在一個 R 語言的 vector 或 Python 的 list，然後對這個資料結構使用一些內建函數。如果你對這兩個資料結構有點疑惑，我推薦你參考本系列文章的 [[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)。
 13 | 
 14 | ```
 15 | ironmen <- c(50, 8, 16, 12, 6, 62)
 16 | 
 17 | # 應用函數
 18 | max(ironmen) # 最多的鐵人數
 19 | min(ironmen) # 最少的鐵人數
 20 | length(ironmen) # 總共有幾組
 21 | sort(ironmen) # 遞增排序
 22 | sort(ironmen, decreasing = TRUE) # 遞減排序
 23 | ```
 24 | 
 25 | ![day0801](https://storage.googleapis.com/2017_ithome_ironman/day0801.png)
 26 | 
 27 | ```python
 28 | ironmen = [50, 8, 16, 12, 6, 62]
 29 | 
 30 | # 應用函數
 31 | print(max(ironmen)) # 最多的鐵人數
 32 | print(min(ironmen)) # 最少的鐵人數
 33 | print(len(ironmen)) # 總共有幾組
 34 | print(sorted(ironmen)) # 遞增排序
 35 | print(sorted(ironmen, reverse = True)) # 遞減排序
 36 | ```
 37 | 
 38 | ![day0802](https://storage.googleapis.com/2017_ithome_ironman/day0802.png)
 39 | 
 40 | ## 查詢函數文件
 41 | 
 42 | 不論是 R 語言或者 Python，我們都必須要求自己養成查詢函數文件的習慣，了解一個函數的輸入與它可以設定的參數，例如上面例子中，查詢文件之後就知道能在排序的函數中修改參數調整為遞增或遞減排序。
 43 | 
 44 | ```
 45 | ?sort
 46 | help(sort)
 47 | ```
 48 | 
 49 | ![day0803](https://storage.googleapis.com/2017_ithome_ironman/day0803.png)
 50 | 
 51 | ```python
 52 | sorted?
 53 | help(sorted)
 54 | ```
 55 | 
 56 | ![day0804](https://storage.googleapis.com/2017_ithome_ironman/day0804.png)
 57 | 
 58 | 為了讓程式更加有組織性，更好管理與具備重複使用性，除了使用內建函數以外我們得要學習自訂函數。在自訂函數時候我們會使用**迴圈**與**流程控制**，如果你對這兩個概念有點疑惑，我推薦你參考本系列文章的 [[第 07 天] 迴圈與流程控制](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day07.md)。
 59 | 
 60 | ## R 語言的自訂函數
 61 | 
 62 | R 語言自訂函數的架構：
 63 | 
 64 | ```
 65 | function_name <- function(輸入, 參數 1, 參數 2, ...) {
 66 |     # 函數做些什麼事
 67 |     return(結果)
 68 | }
 69 | ```
 70 | 
 71 | 我們利用兩個練習來熟悉如何自訂函數。
 72 | 
 73 | ### 計算圓形的面積或周長
 74 | 
 75 | 第一個練習是輸入圓形的半徑長，依照參數的指定回傳面積或周長。
 76 | 
 77 | ```
 78 | # 定義自訂函數
 79 | circle_calculate <- function(radius, area = TRUE) {
 80 |     circle_area <- pi * radius^2
 81 |     circle_circum <- 2 * pi * radius
 82 |     if (area == TRUE) {
 83 |         return(circle_area)
 84 |     } else {
 85 |         return(circle_circum)
 86 |     }
 87 | }
 88 | 
 89 | # 呼叫自訂函數
 90 | my_radius <- 3
 91 | circle_calculate(my_radius) # 預設回傳面積
 92 | circle_calculate(my_radius, area = FALSE) # 指定參數回傳周長
 93 | ```
 94 | 
 95 | ![day0805](https://storage.googleapis.com/2017_ithome_ironman/day0805.png)
 96 | 
 97 | ### 交換排序法（exchange sort）
 98 | 
 99 | 第二個練習是寫程式的基本功交換排序法。
100 | 
101 | ```
102 | # 定義自訂函數
103 | exchange_sort <- function(input_vector, decreasing = FALSE) {
104 |     input_vector_cloned <- input_vector # 複製一個輸入向量
105 |     # 遞增排序
106 |     if (decreasing == FALSE) {
107 |         for (i in 1:(length(input_vector) - 1)) {
108 |             for (j in (i + 1):length(input_vector)) {
109 |                # 如果前一個數字比後一個數字大則交換位置
110 |                if (input_vector_cloned[i] > input_vector_cloned[j]) {
111 |                    temp <- input_vector_cloned[i]
112 |                    input_vector_cloned[i] <- input_vector_cloned[j]
113 |                    input_vector_cloned[j] <- temp
114 |                }
115 |             }
116 |         }
117 |     # 遞減排序
118 |     } else {
119 |         for (i in 1:(length(input_vector) - 1)) {
120 |             for (j in (i + 1):length(input_vector)) {
121 |                # 如果前一個數字比後一個數字小則交換位置
122 |                if (input_vector_cloned[i] < input_vector_cloned[j]) {
123 |                    temp <- input_vector_cloned[i]
124 |                    input_vector_cloned[i] <- input_vector_cloned[j]
125 |                    input_vector_cloned[j] <- temp
126 |                }
127 |             }
128 |         } 
129 |     }
130 |     return(input_vector_cloned)
131 | }
132 | 
133 | # 呼叫自訂函數
134 | my_vector <- round(runif(10) * 100) # 產生一組隨機數
135 | my_vector # 看看未排序前
136 | exchange_sort(my_vector) # 預設遞增排序
137 | exchange_sort(my_vector, decreasing = TRUE) # 指定參數遞減排序
138 | ```
139 | 
140 | ![day0806](https://storage.googleapis.com/2017_ithome_ironman/day0806.png)
141 | 
142 | ## Python 的自訂函數
143 | 
144 | Python 自訂函數的架構：
145 | 
146 | ```python
147 | def function_name(輸入, 參數 1, 參數 2, ...):
148 |     '''
149 |     Docstrings
150 |     '''
151 |     # 函數做些什麼事
152 |     return 結果
153 | ```
154 | 
155 | Python 使用者習慣加入 Docstrings 做自訂函數的說明，接著我們利用兩個練習來熟悉如何自訂函數。
156 | 
157 | ### 計算圓形的面積或周長
158 | 
159 | 第一個練習是輸入圓形的半徑長，依照參數的指定回傳面積或周長。
160 | 
161 | ```python
162 | import math # 要使用 pi 得引入套件 math
163 | 
164 | # 定義自訂函數
165 | def circle_calculate(radius, area = True):
166 |     '依據輸入的半徑與 area 參數計算圓形的面積或周長' # 單行的 docstring
167 |     circle_area = math.pi * radius**2
168 |     circle_circum = 2 * math.pi * radius
169 |     if area == True:
170 |         return circle_area
171 |     else:
172 |         return circle_circum
173 | 
174 | # 呼叫自訂函數
175 | my_radius = 3
176 | print(circle_calculate(my_radius)) # 預設回傳面積
177 | print(circle_calculate(my_radius, area = False)) # 指定參數回傳周長
178 | ```
179 | 
180 | ![day0807](https://storage.googleapis.com/2017_ithome_ironman/day0807.png)
181 | 
182 | ### 交換排序法（exchange sort）
183 | 
184 | 第二個練習是寫程式的基本功交換排序法。
185 | 
186 | ```python
187 | import random # 呼叫函數時使用隨機整數
188 | 
189 | # 定義自訂函數
190 | def exchange_sort(input_list, reverse = False):
191 |     ''' # 多行的 docstrings
192 |     依據輸入的 list 與 reverse 參數排序 list 中的數字後回傳。
193 |     reverse 參數預設為 False 遞增排序，可以修改為 True 遞減排序。
194 |     '''
195 |     input_list_cloned = input_list
196 |     # 遞增排序
197 |     if reverse == False:
198 |         for i in range(0, len(input_list) - 1):
199 |             for j in range(i+1, len(input_list)):
200 |                 # 如果前一個數字比後一個數字大則交換位置
201 |                 if input_list_cloned[i] > input_list_cloned[j]:
202 |                     temp = input_list_cloned[i]
203 |                     input_list_cloned[i] = input_list_cloned[j]
204 |                     input_list_cloned[j] = temp
205 |     # 遞減排序
206 |     else:
207 |         for i in range(0, len(input_list) - 1):
208 |             for j in range(i+1, len(input_list)):
209 |                 # 如果前一個數字比後一個數字小則交換位置
210 |                 if input_list_cloned[i] < input_list_cloned[j]:
211 |                     temp = input_list_cloned[i]
212 |                     input_list_cloned[i] = input_list_cloned[j]
213 |                     input_list_cloned[j] = temp
214 |     return input_list_cloned
215 | 
216 | # 呼叫自訂函數
217 | my_vector = random.sample(range(0, 100), 10) # 產生一組隨機數
218 | print(my_vector) # 看看未排序前
219 | print(exchange_sort(my_vector)) # 預設遞增排序
220 | print(exchange_sort(my_vector, reverse = True)) # 指定參數遞減排序
221 | ```
222 | 
223 | ![day0808](https://storage.googleapis.com/2017_ithome_ironman/day0808.png)
224 | 
225 | ## 使用自訂函數回傳多個值
226 | 
227 | ### R 語言自訂函數回傳多個值
228 | 
229 | 使用 list 資料結構將回傳值包裝起來再依名稱呼叫。
230 | 
231 | ```
232 | # 定義自訂函數
233 | ironmen_stats <- function(ironmen_vector) {
234 |     max_ironmen <- max(ironmen_vector)
235 |     min_ironmen <- min(ironmen_vector)
236 |     ttl_groups <- length(ironmen_vector)
237 |     ttl_ironmen <- sum(ironmen_vector)
238 |     
239 |     stats_list <- list(max_ironmen = max_ironmen,
240 |                        min_ironmen = min_ironmen,
241 |                        ttl_groups = ttl_groups,
242 |                        ttl_ironmen = ttl_ironmen
243 |                        )
244 |     
245 |     return(stats_list)
246 | }
247 | 
248 | # 呼叫自訂函數
249 | ironmen <- c(50, 8, 16, 12, 6, 62)
250 | paste("最多：", ironmen_stats(ironmen)$max_ironmen, sep = '')
251 | paste("最少：", ironmen_stats(ironmen)$min_ironmen, sep = '')
252 | paste("總組數：", ironmen_stats(ironmen)$ttl_groups, sep = '')
253 | paste("總鐵人數：", ironmen_stats(ironmen)$ttl_ironmen, sep = '')
254 | ```
255 | 
256 | ![day0809](https://storage.googleapis.com/2017_ithome_ironman/day0809.png)
257 | 
258 | ### Python 自訂函數回傳多個值
259 | 
260 | 在 `return` 後面將多個值用逗號 `,` 隔開就會回傳一個 tuple。
261 | 
262 | ```python
263 | # 定義自訂函數
264 | def ironmen_stats(ironmen_list):
265 |     max_ironmen = max(ironmen_list)
266 |     min_ironmen = min(ironmen_list)
267 |     ttl_groups = len(ironmen_list)
268 |     ttl_ironmen = sum(ironmen_list)
269 |     return max_ironmen, min_ironmen, ttl_groups, ttl_ironmen
270 |     
271 | # 呼叫自訂函數
272 | ironmen = [50, 8, 16, 12, 6, 62]
273 | max_ironmen, min_ironmen, ttl_groups, ttl_ironmen = ironmen_stats(ironmen)
274 | print("最多：", max_ironmen, "\n", "最少：", min_ironmen, "\n", "總組數：", ttl_groups, "\n", "總鐵人數：", ttl_ironmen)
275 | ```
276 | 
277 | ![day0810](https://storage.googleapis.com/2017_ithome_ironman/day0810.png)
278 | 
279 | ## 匿名函數
280 | 
281 | 我們常常懶得為一個簡單的函數取名字，R 語言與 Python 很貼心都支援匿名函數。
282 | 
283 | ### R 語言的匿名函數
284 | 
285 | 利用匿名函數 `function(x) x * 30` 把每組的鐵人數乘以 30 可以得到預期的各組文章總數。
286 | 
287 | ```
288 | ironmen <- c(50, 8, 16, 12, 6, 62)
289 | ironmen_articles <- sapply(ironmen, function(x) x * 30)
290 | ironmen_articles
291 | ```
292 | 
293 | ![day0811](https://storage.googleapis.com/2017_ithome_ironman/day0811.png)
294 | 
295 | ### Python 的 lambda 函數
296 | 
297 | Python 的匿名函數稱為 lambda 函數，利用 `lambda x : x * 30` 把每組的鐵人數乘以 30 可以得到預期的各組文章總數。
298 | 
299 | ```python
300 | ironmen = [50, 8, 16, 12, 6, 62]
301 | ironmen_articles = list(map(lambda x : x * 30, ironmen))
302 | print(ironmen_articles)
303 | ```
304 | 
305 | ![day0812](https://storage.googleapis.com/2017_ithome_ironman/day0812.png)
306 | 
307 | ## 小結
308 | 
309 | 第八天我們討論 Python 的函數，也與 R 語言相互對照，發現在使用內建函數時的語法大同小異；至於在自訂函數的結構上，我們需要記得 Python 有撰寫 Docstrings 的好習慣，使用縮排而非大括號 `{}` 區隔參數與函數主體，可以較簡便地回傳多個值，以及使用 lambda 關鍵字表示使用匿名函數。
310 | 
311 | ## 參考連結
312 | 
313 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
314 | - [Advanced R by Hadley Wickham](https://www.amazon.com/dp/1466586966/ref=cm_sw_su_dp?tag=devtools-20)
315 | - [Rajath Kumar M P's Python-Lectures](https://github.com/rajathkumarmp/Python-Lectures)


--------------------------------------------------------------------------------
/day09.md:
--------------------------------------------------------------------------------
  1 | # [第 09 天] 函數（2）
  2 | 
  3 | ---
  4 | 
  5 | 一但體驗到自訂函數的威力與函數型編程的美觀，似乎開始有種**回不去**的感覺，好想要回頭把過去撰寫的每一段程式都拿來改寫成函數型編程的結構。我們再多花一天的時間研究函數，像是變數範圍（Scope），巢狀函數（Nested functions），錯誤處理（Error Handling）以及 Python 特殊的彈性參數（Flexible arguments）。
  6 | 
  7 | ## 變數範圍（Scope）
  8 | 
  9 | 開始自訂函數之後我們必須要意識到**變數範圍**這件事情，程式中的變數開始被我們區分為**區域變數（local variables）**與**全域變數（global variables）**。在函數內我們能夠使用兩種類型的變數，但是在函數外，僅能使用**全域變數**。用講的很抽象，我們還是動手定義自訂函數來釐清這個概念。
 10 | 
 11 | ### R 語言
 12 | 
 13 | 我們自訂一個單純的 `my_square()` 函數將輸入的數字平方後回傳。 
 14 | 
 15 | ```
 16 | # 定義自訂函數
 17 | my_square <- function(input_number) {
 18 |     ans <- input_number^2 # 區域變數，只有在函數中可以被使用
 19 |     return(ans)
 20 | }
 21 | 
 22 | # 呼叫函數
 23 | my_square(3)
 24 | 
 25 | # 印出變數
 26 | ans # 無法印出區域變數
 27 | ```
 28 | 
 29 | ![day0901](https://storage.googleapis.com/2017_ithome_ironman/day0901.png)
 30 | 
 31 | 換一種寫法我們將 `ans` 在函數外也指派一次。
 32 | 
 33 | ```
 34 | ans <- 1 # 全域變數
 35 | # 定義自訂函數
 36 | my_square <- function(input_number) {
 37 |     ans <- input_number^2 # 區域變數，只有在函數中可以被使用
 38 |     return(ans)
 39 | }
 40 | 
 41 | # 呼叫函數
 42 | my_square(3)
 43 | 
 44 | # 印出變數
 45 | ans # 印出全域變數
 46 | ```
 47 | 
 48 | ![day0902](https://storage.googleapis.com/2017_ithome_ironman/day0902.png)
 49 | 
 50 | 由這個例子我們可知 R 語言的函數會優先在**區域變數**尋找 `ans`，所以 `my_square(3)` 並沒有回傳 `1`。
 51 | 
 52 | ### Python
 53 | 
 54 | 我們自訂一個單純的 `my_square()` 函數將輸入的數字平方後回傳。 
 55 | 
 56 | ```python
 57 | # 定義自訂函數
 58 | def my_square(input_number):
 59 |     '計算平方數'
 60 |     ans = input_number**2 # 區域變數，只有在函數中可以被使用
 61 |     return ans
 62 | 
 63 | # 呼叫函數
 64 | print(my_square(3))
 65 | 
 66 | # 印出變數
 67 | print(ans) # 無法印出區域變數
 68 | ```
 69 | 
 70 | ![day0903](https://storage.googleapis.com/2017_ithome_ironman/day0903.png)
 71 | 
 72 | 換一種寫法我們將 `ans` 在函數外也指派一次。
 73 | 
 74 | ```python
 75 | ans = 1 # 全域變數
 76 | # 定義自訂函數
 77 | def my_square(input_number):
 78 |     '計算平方數'
 79 |     ans = input_number**2 # 區域變數，只有在函數中可以被使用
 80 |     return ans
 81 | 
 82 | # 呼叫函數
 83 | print(my_square(3))
 84 | 
 85 | # 印出變數
 86 | print(ans) # 全域變數
 87 | ```
 88 | 
 89 | ![day0904](https://storage.googleapis.com/2017_ithome_ironman/day0904.png)
 90 | 
 91 | 由這個例子我們可知 Python 的函數同樣會優先在**區域變數**尋找 `ans`，所以 `my_square(3)` 並沒有回傳 `1`。
 92 | 
 93 | ## 巢狀函數（Nested functions）
 94 | 
 95 | 我們可以在函數裡面嵌入函數，舉例來說一個計算平均數的函數裡面應該要包含兩個函數，一個是計算總和的函數 `my_sum()`，一個是計算個數的函數 `my_length()`。
 96 | 
 97 | ### R 語言
 98 | 
 99 | ```
100 | # 定義自訂函數
101 | my_mean <- function(input_vector) {
102 |     my_sum <- function(input_vector) {
103 |         temp_sum <- 0
104 |         for (i in input_vector) {
105 |             temp_sum <- temp_sum + i
106 |         }
107 |         return(temp_sum)
108 |     }
109 |     
110 |     my_length <- function(input_vector) {
111 |         temp_length <- 0
112 |         for (i in input_vector) {
113 |             temp_length <- temp_length + 1
114 |         }
115 |         return(temp_length)
116 |     }
117 |     return(my_sum(input_vector) / my_length(input_vector))
118 | }
119 | 
120 | # 呼叫自訂函數
121 | my_vector <- c(51, 8, 18, 13, 6, 64)
122 | my_mean(my_vector)
123 | ```
124 | 
125 | ![day0905](https://storage.googleapis.com/2017_ithome_ironman/day0905.png)
126 | 
127 | ### Python
128 | 
129 | ```python
130 | # 定義自訂函數
131 | def my_mean(input_list):
132 |     '計算平均數'
133 |     def my_sum(input_list):
134 |         '計算總和'
135 |         temp_sum = 0
136 |         for i in input_list:
137 |             temp_sum += i
138 |         return temp_sum
139 |     def my_length(input_list):
140 |         '計算個數'
141 |         temp_length = 0
142 |         for i in input_list:
143 |             temp_length += 1
144 |         return temp_length
145 |     return my_sum(input_list) / my_length(input_list)
146 | 
147 | # 呼叫自訂函數
148 | my_list = [51, 8, 18, 13, 6, 64]
149 | print(my_mean(my_list))
150 | ```
151 | 
152 | ![day0906](https://storage.googleapis.com/2017_ithome_ironman/day0906.png)
153 | 
154 | ## 錯誤處理（Error Handling）
155 | 
156 | 我們在使用內建函數時候常有各種原因會導致錯誤或者警示，這時收到的回傳訊息可以幫助我們修改程式。
157 | 
158 | ```
159 | as.integer(TRUE)
160 | as.integer("TRUE")
161 | ```
162 | 
163 | ![day0907](https://storage.googleapis.com/2017_ithome_ironman/day0907.png)
164 | 
165 | ```python
166 | print(int(True))
167 | print(int("True"))
168 | ```
169 | 
170 | ![day0908](https://storage.googleapis.com/2017_ithome_ironman/day0908.png)
171 | 
172 | 自訂函數時如果能夠掌握某些特定錯誤，撰寫客製的錯誤訊息，可以讓使用自訂函數的使用者更快完成偵錯。
173 | 
174 | ### R 語言
175 | 
176 | R 語言使用 `tryCatch()` 函數進行錯誤處理，讓我們修改原本計算平方數的 `my_square()` 當使用者輸入文字時會回傳客製錯誤訊息：「請輸入數值。」
177 | 
178 | ```
179 | # 定義自訂函數
180 | my_square <- function(input_number) {
181 |     tryCatch({
182 |         ans <- input_number^2
183 |         return(ans)
184 |         },
185 |         error = function(e) {
186 |             print("請輸入數值。")
187 |         }
188 |     )
189 | }
190 | 
191 | # 呼叫自訂函數
192 | my_square(3)
193 | my_square('ironmen')
194 | ```
195 | 
196 | ![day0909](https://storage.googleapis.com/2017_ithome_ironman/day0909.png)
197 | 
198 | ### Python
199 | 
200 | Python 使用 `try - except` 的語法結構進行錯誤處理，讓我們修改原本計算平方數的 `my_square()` 當使用者輸入文字時會回傳客製錯誤訊息：「請輸入數值。」
201 | 
202 | ```python
203 | # 定義自訂函數
204 | def my_square(input_number):
205 |     '計算平方數且有錯誤處理的函數'
206 |     try:
207 |         ans = input_number**2
208 |         return ans
209 |     except:
210 |         print("請輸入數值。")
211 | 
212 | # 呼叫自訂函數
213 | print(my_square(3))
214 | my_square('ironmen')
215 | ```
216 | 
217 | ![day0910](https://storage.googleapis.com/2017_ithome_ironman/day0910.png)
218 | 
219 | ## Python 的彈性參數（Flexible arguments）
220 | 
221 | Python 可以使用 `*args` 或 `**kwargs`(Keyword arguments)來分別處理不帶鍵值與帶有鍵值的彈性數量參數，利用這個特性，我們不一定要使用資料結構把變數包裝起來當做輸入。
222 | 
223 | ### \*args
224 | 
225 | ```python
226 | # 定義自訂函數
227 | def ironmen_list(*args):
228 |     '列出各組參賽鐵人數'
229 |     for ironman in args:
230 |         print(ironman)
231 | 
232 | # 呼叫自訂函數
233 | ironmen_list(51, 8, 18, 13, 6) # 不含自我挑戰組
234 | print("---")
235 | ironmen_list(51, 8, 18, 13, 6, 64) # 含自我挑戰組
236 | ```
237 | 
238 | ![day0911](https://storage.googleapis.com/2017_ithome_ironman/day0911.png)
239 | 
240 | ### \*\*kwargs
241 | 
242 | ```python
243 | # 定義自訂函數
244 | def ironmen_list(**kwargs):
245 |     '列出各組參賽鐵人數'
246 |     for key in kwargs:
247 |         print(key, "：", kwargs[key], "人")
248 | 
249 | ironmen_list(ModernWeb = 51, DevOps = 8, Cloud = 18, BigData = 13, Security = 6, 自我挑戰組 = 64)
250 | ```
251 | 
252 | ![day0912](https://storage.googleapis.com/2017_ithome_ironman/day0912.png)
253 | 
254 | ## 小結
255 | 
256 | 第九天我們繼續討論 Python 的函數有關變數範圍，巢狀函數與錯誤處理的概念，並且跟 R 語言相互對照。在變數範圍與巢狀函數部分其實很相似；然而在錯誤處理上 Python 是使用 `try - except` 的語法結構，而 R 語言則是使用 `tryCatch()` 函數，這是比較大的差異處。
257 | 
258 | ## 參考連結
259 | 
260 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
261 | - [Python Tutorial](https://www.tutorialspoint.com/python/index.htm)
262 | - [Advanced R](https://www.amazon.com/dp/1466586966/ref=cm_sw_su_dp?tag=devtools-20)
263 | - [Using R - Basic error Handing with tryCatch()](http://mazamascience.com/WorkingWithData/?p=912)


--------------------------------------------------------------------------------
/day10.md:
--------------------------------------------------------------------------------
  1 | # [第 10 天] 物件導向 R 語言
  2 | 
  3 | ---
  4 | 
  5 | > R, at its heart, is a functional programming language.
  6 | > [Hadley Wickham](http://hadley.nz/)
  7 | 
  8 | R 語言本質上是一個函數型程式語言，而 Python 是一個物件導向程式語言，這也是我認為 R 語言使用者在學習 Python 的時候會感到較為困惑的部分，尤其在面對類別（Class），屬性（Attribute）或者方法（Method）等陌生辭彙的時候，讓我們先用之前做過的練習切入。
  9 | 
 10 | 截至 2016-12-10 下午 1 時第 8 屆 iT 邦幫忙各組的鐵人分別是 51、8、18、14、6 與 64 人，我們用一個 data frame 來紀錄參賽的組別與鐵人數，如果你對 data frame 這個資料結構有疑惑，我推薦你閱讀 [[ 第 06 天] 資料結構（3）Data Frame
 11 | ](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day06.md)。
 12 | 
 13 | ```
 14 | groups <- c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組")
 15 | ironmen <- c(51, 8, 18, 14, 6, 64)
 16 | 
 17 | # 跟 Python 程式比較這兩行
 18 | ironmen_df <- data.frame(groups, ironmen)
 19 | head(ironmen_df, n = 3)
 20 | ```
 21 | 
 22 | ![day1001](https://storage.googleapis.com/2017_ithome_ironman/day1001.png)
 23 | 
 24 | ```python
 25 | import pandas as pd # 引用套件並縮寫為 pd
 26 | 
 27 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 28 | ironmen = [51, 8, 18, 14, 6, 64]
 29 | 
 30 | ironmen_dict = {"groups": groups,
 31 |                 "ironmen": ironmen
 32 |                 }
 33 | 
 34 | # 跟 R 語言程式比較這兩行
 35 | ironmen_df = pd.DataFrame(ironmen_dict)
 36 | ironmen_df.head(n = 3)
 37 | ```
 38 | 
 39 | ![day1002](https://storage.googleapis.com/2017_ithome_ironman/day1002.png)
 40 | 
 41 | 觀察程式的最後兩行，我們可以稍微地感受到函數型程式語言與物件導向程式語言的不同：
 42 | 
 43 | - 同樣為建立 data frame，R 語言使用 `data.frame()` 函數；Python 使用 `pd` 的 `DataFrame()` 方法
 44 | - 同樣為顯示 data frame 的前三列觀測值，R 語言使用 `head()` 函數；Python 使用 `ironmen_df` 的 `head()` 方法
 45 | 
 46 | 今天我們先討論 R 語言的物件導向。
 47 | 
 48 | ## R 語言的物件導向
 49 | 
 50 | R 語言的物件導向有三大類別（Class）：
 51 | 
 52 | - S3 類別
 53 | - S4 類別
 54 | - RC（Reference class）
 55 | 
 56 | 除此之外，還有一個 Base types 類別是只有核心開發團隊才可以新增類別的物件導向類別，所以沒有把它列在上面的清單之中。
 57 | 
 58 | ### Base types 類別
 59 | 
 60 | 利用 `typeof()` 與 `is.primitive()` 函數可以驗證什麼物件是屬於 Base types 類別。
 61 | 
 62 | ```
 63 | # 自訂函數不屬於 base types
 64 | my_square <- function(input_num) {
 65 |     return(input_num^2)
 66 | }
 67 | typeof(my_square)
 68 | is.primitive(my_square)
 69 | 
 70 | # sum() 函數屬於 base types
 71 | typeof(sum)
 72 | is.primitive(sum)
 73 | ```
 74 | 
 75 | ![day1003](https://storage.googleapis.com/2017_ithome_ironman/day1003.png)
 76 | 
 77 | Base types 類別的物件會在 `typeof()` 函數回傳 **builtin**，在 `is.primitive()` 函數會被判斷為 `TRUE`。
 78 | 
 79 | ### S3
 80 | 
 81 | S3 類別是 R 語言裡面最受歡迎的物件導向類別，內建的套件 `stats` 與 `base` 中全部都是使用 S3 類別。我們可以使用 `pryr::otype()` 函數來判斷某個物件是不是 S3 類別。
 82 | 
 83 | #### 建立一個 S3 物件
 84 | 
 85 | S3 物件不需要正式的宣告或預先定義，只要將一個 list 資料結構給一個類別名稱即可。
 86 | 
 87 | ```
 88 | library(pryr)
 89 | 
 90 | ironmen_list <- list(
 91 |     group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"),
 92 |     participants = c(51, 8, 18, 14, 6, 64)
 93 | )
 94 | class(ironmen_list) <- "ironmen"
 95 | ironmen_list
 96 | otype(ironmen_list)
 97 | ```
 98 | 
 99 | ![day1004](https://storage.googleapis.com/2017_ithome_ironman/day1004.png)
100 | 
101 | #### 屬性
102 | 
103 | 使用 `$` 可以取得 S3 物件中的屬性（attributes），跟從 list 資料結構中依元素名稱選取的語法相同。
104 | 
105 | ```
106 | ironmen_list <- list(
107 |     group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"),
108 |     participants = c(51, 8, 18, 14, 6, 64)
109 | )
110 | class(ironmen_list) <- "ironmen"
111 | 
112 | # 取得屬性
113 | ironmen_list$group
114 | ironmen_list$participants
115 | ```
116 | 
117 | ![day1005](https://storage.googleapis.com/2017_ithome_ironman/day1005.png)
118 | 
119 | #### 方法
120 | 
121 | 我們使用 `UseMethod()` 建立一個 S3 類別的方法 `count_participants` 來計算總鐵人數。
122 | 
123 | ```
124 | # 建立一個 S3 物件
125 | ironmen_list <- list(
126 |     group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"),
127 |     participants = c(51, 8, 18, 14, 6, 64)
128 | )
129 | class(ironmen_list) <- "ironmen"
130 | 
131 | # 建立方法
132 | count_participants <- function(obj) {
133 |     UseMethod("count_participants")
134 | }
135 | count_participants.ironmen <- function(obj) {
136 |     return(sum(obj$participants))
137 | }
138 | 
139 | # 呼叫方法
140 | count_participants(ironmen_list)
141 | ```
142 | 
143 | ![day1006](https://storage.googleapis.com/2017_ithome_ironman/day1006.png)
144 | 
145 | S3 物件的方法是內建函數（generic function）。
146 | 
147 | ### S4
148 | 
149 | S4 類別相較於 S3 類別更加嚴謹。我們可以使用 `isS4()` 函數來判斷物件是不是 S4 類別。
150 | 
151 | #### 建立一個 S4 物件
152 | 
153 | 我們需要使用 `setClass()` 函數來預先定義類別，設定有哪些 `slots` 以及他們的資料類型，並且使用 `new()` 函數建立新的物件。
154 | 
155 | ```
156 | # 預先定義類別
157 | setClass("ironmen", slots = list(group="character", participants = "numeric"))
158 | 
159 | # 建立 S4 物件
160 | ironmen_list <- new("ironmen", group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
161 | ironmen_list
162 | isS4(ironmen_list)
163 | ```
164 | 
165 | ![day1007](https://storage.googleapis.com/2017_ithome_ironman/day1007.png)
166 | 
167 | #### 屬性
168 | 
169 | 使用 `@` 可以取得 S4 物件中的屬性（attributes）。
170 | 
171 | ```
172 | # 預先定義類別
173 | setClass("ironmen", slots = list(group="character", participants = "numeric"))
174 | 
175 | # 建立 S4 物件
176 | ironmen_list <- new("ironmen", group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
177 | 
178 | # 取得屬性
179 | ironmen_list@group
180 | ironmen_list@participants
181 | ```
182 | 
183 | ![day1008](https://storage.googleapis.com/2017_ithome_ironman/day1008.png)
184 | 
185 | #### 方法
186 | 
187 | 我們使用 `setMethod()` 函數建立一個 S4 類別的方法 `count_participants` 來計算總鐵人數。
188 | 
189 | ```
190 | # 預先定義類別
191 | setClass("ironmen", slots = list(group="character", participants = "numeric"))
192 | 
193 | # 建立方法
194 | setMethod("count_participants",
195 |          "ironmen",
196 |          function(obj) {
197 |            return(sum(obj@participants))
198 |          }
199 | )
200 | 
201 | # 建立 S4 物件
202 | ironmen_list <- new("ironmen", group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
203 | 
204 | # 呼叫方法
205 | count_participants(ironmen_list)
206 | ```
207 | 
208 | ![day1009](https://storage.googleapis.com/2017_ithome_ironman/day1009.png)
209 | 
210 | S4 類別的方法跟 S3 類別的方法同樣是內建函數（generic function）。
211 | 
212 | ### RC（Reference class）
213 | 
214 | RC 跟 Python 的物件導向較為相像，她建立出來的方法不是內建函數，而是屬於 RC 物件。
215 | 
216 | #### 建立一個 RC 物件
217 | 
218 | 我們需要使用 `setRefClass()` 函數來預先定義，設定有哪些 `fields` 以及他們的資料類型。
219 | 
220 | ```
221 | # 預先定義
222 | Ironmen <- setRefClass("Ironmen", fields = list(group = "character", participants = "numeric"))
223 | 
224 | # 建立 RC 物件
225 | ironmen_list <- Ironmen(group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
226 | ironmen_list
227 | ```
228 | 
229 | ![day1010](https://storage.googleapis.com/2017_ithome_ironman/day1010.png)
230 | 
231 | #### 屬性
232 | 
233 | 使用 `$` 可以取得 RC 物件中的屬性（attributes）。
234 | 
235 | ```
236 | # 預先定義
237 | Ironmen <- setRefClass("Ironmen", fields = list(group = "character", participants = "numeric"))
238 | 
239 | # 建立 RC 物件
240 | ironmen_list <- Ironmen(group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
241 | 
242 | ironmen_list$group
243 | ironmen_list$participants
244 | ```
245 | 
246 | ![day1011](https://storage.googleapis.com/2017_ithome_ironman/day1011.png)
247 | 
248 | #### 方法
249 | 
250 | 在定義 RC 的時候將方法也撰寫進去。
251 | 
252 | ```
253 | Ironmen <- setRefClass("Ironmen",
254 |     fields = list(group = "character", participants = "numeric"),
255 |     methods = list(
256 |         count_participants = function() {
257 |             return(sum(participants))
258 |         }
259 |     )
260 | )
261 | ironmen_list <- Ironmen(group = c("Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"), participants = c(51, 8, 18, 14, 6, 64))
262 | ironmen_list$count_participants()
263 | ```
264 | 
265 | ![day1012](https://storage.googleapis.com/2017_ithome_ironman/day1012.png)
266 | 
267 | ### 比較表
268 | 
269 | 我們用一張簡易的表格比較這三種物件導向類別。
270 | 
271 | |   |S3 類別|S4 類別|RC|
272 | |---------------------|
273 | |定義|沒有正式的宣告|使用 `setClass()` 宣告|使用 `setRefClass()` 宣告|
274 | |使用屬性|使用 `$`|使用 `@`|使用 `$`|
275 | |方法定義|`UseMethod()`|`setMethod()`|`setRefMethod()`|
276 | |方法歸屬|屬於內建函數|屬於內建函數|屬於 RC 類別|
277 | 
278 | 如果拿不定主意要選擇哪一個物件導向類別，就選擇 S3 類別吧！
279 | 
280 | > Three OO systems is a lot for one language, but for most R programming, S3 suffices. In R you usually create fairly simple objects and methods for pre-existing generic functions like print(), summary(), and plot(). S3 is well suited to this task, and the majority of OO code that I have written in R is S3. S3 is a little quirky, but it gets the job done with a minimum of code.
281 | > [Hadley Wickham](http://hadley.nz/)
282 | 
283 | ## 小結
284 | 
285 | 第十天我們討論 R 語言的物件導向三大類別：S3 類別，S4 類別與 RC（Reference class），我們透過簡單的範例來新增類別，建立類別物件並且自己定義類別中的方法。
286 | 
287 | ## 參考連結
288 | 
289 | - [Advanced R](http://adv-r.had.co.nz/)
290 | - [PROGRAMIZ](https://www.programiz.com/)


--------------------------------------------------------------------------------
/day11.md:
--------------------------------------------------------------------------------
  1 | # [第 11 天] 物件導向（2）Python
  2 | 
  3 | ---
  4 | 
  5 | > Everything in Python, from numbers to modules, is an object.
  6 | > [Bill Lubanovic](http://www.oreilly.com/pub/au/2909)
  7 | 
  8 | 經歷 R 語言令人困惑且有點崩潰的三個物件導向類別（S3 類別，S4 類別與 RC）後，我們回到標準的物件導向語言 Python。這裡提供一個主觀判斷（Judgement call）：習慣**函數型編程**的資料科學初學者應該先學 R 語言，而習慣**物件導向編程**的資料科學初學者應該先學 Python。但這個主觀判斷仍舊不能廣泛應用，因為這對於沒有接觸過任何一種類型編程的資料科學初學者來說毫無參考價值。
  9 | 
 10 | 我們在開始討論 Python 物件導向之前再看一個熟悉的例子，藉此瞭解屬性與方法是什麼。
 11 | 
 12 | ## 屬性與方法
 13 | 
 14 | 一個**物件（Object）**可以包含**屬性（Attribute）**與**方法（Method）**，我們示範的物件是一個 data frame 叫做 `ironmen_df`，她有 `dtypes` 屬性與 `head()` 方法。
 15 | 
 16 | ```python
 17 | import pandas as pd # 引用套件並縮寫為 pd
 18 | 
 19 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 20 | # 截至 2016-12-11 上午 11 時第 8 屆 iT 邦幫忙各組的鐵人分別是 54、8、18、14、6 與 65 人
 21 | ironmen = [54, 8, 18, 14, 6, 65]
 22 | 
 23 | ironmen_dict = {"groups": groups,
 24 |                 "ironmen": ironmen
 25 |                 }
 26 | 
 27 | ironmen_df = pd.DataFrame(ironmen_dict)
 28 | print(ironmen_df.dtypes) # ironmen_df 有 dtypes 屬性
 29 | ironmen_df.head(n = 3) # ironmen_df 有 head() 方法
 30 | ```
 31 | 
 32 | ![day1101](https://storage.googleapis.com/2017_ithome_ironman/day1101.png)
 33 | 
 34 | 在 R 語言中，最後兩行程式的處理我們都會使用函數 `class()` 與 `head()`，因此對 R 語言者而言，在學習 Python 物件導向時會需要多花一點心力理解屬性與方法。
 35 | 
 36 | ## Python 的物件導向
 37 | 
 38 | ### 定義類別（Class）
 39 | 
 40 | 我們使用 `class` 語法來定義類別，並使用大寫開頭（Capitalized）單字為類別命名，如果對於 `__init__` 方法與 `self` 參數感到困惑，就先記得這是一個特殊的 Python 方法，它用來幫助我們創造屬於這個類別的物件。
 41 | 
 42 | ```python
 43 | class Ironmen:
 44 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
 45 |     def __init__(self, group, participants):
 46 |         self.group = group
 47 |         self.participants = participants
 48 | 
 49 | print(Ironmen)
 50 | ```
 51 | 
 52 | ![day1102](https://storage.googleapis.com/2017_ithome_ironman/day1102.png)
 53 | 
 54 | ### 根據類別建立物件（Object）
 55 | 
 56 | 一但類別 `Ironmen` 被定義完成，就可以使用 `Ironmen()` 當作建構子（Constructor）建立物件。
 57 | 
 58 | ```python
 59 | class Ironmen:
 60 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
 61 |     def __init__(self, group, participants):
 62 |         self.group = group
 63 |         self.participants = participants
 64 | 
 65 | # 根據 Ironmen 類別建立一個物件 modern_web
 66 | modern_web = Ironmen("Modern Web", 54)
 67 | print(modern_web)
 68 | ```
 69 | 
 70 | ![day1103](https://storage.googleapis.com/2017_ithome_ironman/day1103.png)
 71 | 
 72 | ### 使用物件的屬性（Attribute）
 73 | 
 74 | 在物件名稱後面使用 `.` 接屬性名稱就可以使用。
 75 | 
 76 | ```python
 77 | class Ironmen:
 78 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
 79 |     def __init__(self, group, participants):
 80 |         self.group = group
 81 |         self.participants = participants
 82 | 
 83 | # 根據 Ironmen 類別建立一個物件 modern_web
 84 | modern_web = Ironmen("Modern Web", 54)
 85 | 
 86 | # 印出 modern_web 的兩個屬性
 87 | print(modern_web.group)
 88 | print(modern_web.participants)
 89 | 
 90 | # 印出 modern_web 的類別 doc string
 91 | print(modern_web.__doc__)
 92 | ```
 93 | 
 94 | ![day1104](https://storage.googleapis.com/2017_ithome_ironman/day1104.png)
 95 | 
 96 | 在我們建立好屬於 `Ironmen` 類別的 `modern_web` 物件後，在 jupyter notebook 中我們還可以透過 **tab 鍵**來查看這個類別有哪些屬性（前後帶有兩個底線 `__` 的是預設屬性。）
 97 | 
 98 | ![day1105](https://storage.googleapis.com/2017_ithome_ironman/day1105.png)
 99 | 
100 | ![day1106](https://storage.googleapis.com/2017_ithome_ironman/day1106.png)
101 | 
102 | 我們也可以使用內建函數 `dir()` 來列出物件全部的屬性。
103 | 
104 | ```python
105 | class Ironmen:
106 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
107 |     def __init__(self, group, participants):
108 |         self.group = group
109 |         self.participants = participants
110 | 
111 | # 根據 Ironmen 類別建立一個物件 modern_web
112 | modern_web = Ironmen("Modern Web", 54)
113 | 
114 | # 使用 dir() 函數
115 | dir(modern_web)
116 | ```
117 | 
118 | ![day1107](https://storage.googleapis.com/2017_ithome_ironman/day1107.png)
119 | 
120 | ### 定義方法（Method）
121 | 
122 | 除了 `__init__` 方法之外利用 `def` 定義更多屬於這個類別的方法。
123 | 
124 | ```python
125 | class Ironmen:
126 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
127 |     def __init__(self, group, participants):
128 |         self.group = group
129 |         self.participants = participants
130 |     
131 |     def print_info(self):
132 |         print(self.group, "組有", self.participants, "位鐵人參賽！")
133 | 
134 | # 根據 Ironmen 類別建立一個物件 modern_web
135 | modern_web = Ironmen("Modern Web", 54)
136 | 
137 | # 根據 Ironmen 類別建立一個物件 dev_ops
138 | dev_ops = Ironmen("DevOps", 8)
139 | 
140 | # 使用 modern_web 的 print_info() 方法
141 | modern_web.print_info()
142 | 
143 | # 使用 dev_ops 的 print_info() 方法
144 | dev_ops.print_info()
145 | ```
146 | 
147 | ![day1108](https://storage.googleapis.com/2017_ithome_ironman/day1108.png)
148 | 
149 | ### 繼承（Inheritance）
150 | 
151 | 新定義的類別可以繼承既有定義好的類別，可以沿用既有類別中的屬性及方法。
152 | 
153 | ```python
154 | class Ironmen:
155 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
156 |     def __init__(self, group, participants):
157 |         self.group = group
158 |         self.participants = participants
159 |     
160 |     def print_info(self):
161 |         print(self.group, "組有", self.participants, "位鐵人參賽！")
162 | 
163 | # Articles 類別繼承 Ironmen 類別
164 | class Articles(Ironmen):
165 |     '''
166 |     這是一個叫做 Articles 的類別。
167 |     Articles 繼承 Ironmen 類別，她新增了一個 print_articles() 方法
168 |     '''
169 |     def print_articles(self):
170 |         print(self.group, "組預計會有", self.participants * 30, "篇文章！")
171 | 
172 | # 根據 Articles 類別建立一個物件 modern_web
173 | modern_web = Articles("Modern Web", 54)
174 | 
175 | # 使用 modern_web 的 print_articles() 方法
176 | modern_web.print_articles()
177 | 
178 | # 檢查 modern_web 是否還擁有 print_info() 方法
179 | modern_web.print_info()
180 | ```
181 | 
182 | ![day1109](https://storage.googleapis.com/2017_ithome_ironman/day1109.png)
183 | 
184 | #### 在繼承時使用 super()
185 | 
186 | 可以根據原本的屬性或方法之上建立新的屬性或方法。
187 | 
188 | ```python
189 | class OnlyGroup:
190 |     '''這是一個叫做 OnlyGroup 的類別''' # Doc string
191 |     def __init__(self, group):
192 |         self.group = group
193 | 
194 | # Ironmen 類別繼承 OnlyGroup 類別
195 | class Ironmen(OnlyGroup):
196 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
197 |     def __init__(self, group, participants):
198 |         super().__init__(group)
199 |         self.participants = participants
200 | 
201 | # 根據 Ironmen 類別建立一個物件 modern_web
202 | modern_web = Ironmen("Modern Web", 54)
203 | 
204 | # 印出 modern_web 的兩個屬性
205 | print(modern_web.group)
206 | print(modern_web.participants)
207 | ```
208 | 
209 | ![day1110](https://storage.googleapis.com/2017_ithome_ironman/day1110.png)
210 | 
211 | #### 在繼承時改寫方法（Override）
212 | 
213 | 我們在先前繼承時成功增加一個方法 `print_articles()`，現在我們要試著在 Article 類別中改寫原本 Ironmen 類別中的 `print_info()` 方法。
214 | 
215 | ```python
216 | class Ironmen:
217 |     '''這是一個叫做 Ironmen 的類別''' # Doc string
218 |     def __init__(self, group, participants):
219 |         self.group = group
220 |         self.participants = participants
221 |     
222 |     def print_info(self):
223 |         print(self.group, "組有", self.participants, "位鐵人參賽！")
224 | 
225 | # Articles 類別繼承 Ironmen 類別
226 | class Articles(Ironmen):
227 |     '''
228 |     這是一個叫做 Articles 的類別。
229 |     Articles 繼承 Ironmen 類別，她新增了一個 print_articles() 方法
230 |     '''
231 |     def print_articles(self):
232 |         print(self.group, "組預計會有", self.participants * 30, "篇文章！")
233 |     
234 |     # 改寫 print_info() 方法
235 |     def print_info(self):
236 |         print(self.group, "組有", self.participants, "位鐵人參賽！p.s.我被改寫了！")
237 | 
238 | # 根據 Articles 類別建立一個物件 modern_web
239 | modern_web = Articles("Modern Web", 54)
240 | 
241 | # 檢查 modern_web 的 print_info() 方法是否被改寫
242 | modern_web.print_info()
243 | ```
244 | 
245 | ![day1111](https://storage.googleapis.com/2017_ithome_ironman/day1111.png)
246 | 
247 | ## 小結
248 | 
249 | 第十一天我們討論 Python 的物件導向，我們透過簡單的範例來定義類別，在定義類別的時候指定屬於該類別的屬性與方法，然後建立出屬於該類別的物件，除此之外我們還討論了新增類別，新增方法與改寫方法。
250 | 
251 | ## 參考連結
252 | 
253 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
254 | - [PROGRAMIZ](https://www.programiz.com/)


--------------------------------------------------------------------------------
/day12.md:
--------------------------------------------------------------------------------
  1 | # [第 12 天] 常用屬性或方法 變數與基本資料結構
  2 | 
  3 | ---
  4 | 
  5 | 我們終於對物件（Objects）以及屬於她的屬性（Attributes）與方法（Methods）有了一定程度的瞭解，這對於非開發者背景的 R 語言使用者可是一個里程碑！還記得在前面的學習筆記中我們花了很多時間在討論 Python 不同的變數類型與資料結構嗎？既然我們已經建立了物件導向的概念，我們勢必也要熟悉如何在這些變數類型與資料結構上應用她們的屬性或者方法。
  6 | 
  7 | ## 基本變數類型的屬性或方法
  8 | 
  9 | 我們在 [[第 02 天] 基本變數類型](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day02.md)討論過 Python 的基本變數類型（Built-in types），分為數值，文字與布林值三大類型，現在我們來瞭解這些變數類型可以應用的方法有哪些。
 10 | 
 11 | ### 數值
 12 | 
 13 | #### float
 14 | 
 15 | 浮點數（float）屬於 numbers.Real 類別（繼承自 abstract base 類別。）
 16 | 
 17 | - `as_integer_ratio()` 方法
 18 | - `is_integer()` 方法
 19 | - `hex()` 方法
 20 | - `fromhex()` 方法
 21 | - ...
 22 | 
 23 | ```python
 24 | my_float = 8.7
 25 | print(my_float.as_integer_ratio())
 26 | print(my_float.is_integer())
 27 | print(my_float.hex())
 28 | print(float.fromhex("0x1.1666666666666p+3"))
 29 | ```
 30 | 
 31 | ![day1201](https://storage.googleapis.com/2017_ithome_ironman/day1201.png)
 32 | 
 33 | #### int
 34 | 
 35 | 整數（int）屬於 numbers.Integral 類別（繼承自 abstract base 類別。）
 36 | 
 37 | - `bit_length()` 方法
 38 | - `to_bytes()` 方法
 39 | - `from_bytes()` 方法
 40 | - ...
 41 | 
 42 | ```python
 43 | my_int = 87
 44 | print(my_int.bit_length())
 45 | print(my_int.to_bytes(length = 2, byteorder = "big"))
 46 | print(int.from_bytes(b'\x00W', byteorder = "big"))
 47 | print("---")
 48 | print(my_int.to_bytes(length = 10, byteorder = "big"))
 49 | print(int.from_bytes(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00W', byteorder = "big"))
 50 | ```
 51 | 
 52 | ![day1202](https://storage.googleapis.com/2017_ithome_ironman/day1202.png)
 53 | 
 54 | #### complex
 55 | 
 56 | 複數（complex）屬於 numbers.Complex 類別（繼承自 abstract base 類別）。
 57 | 
 58 | - `real` 屬性
 59 | - `imag` 屬性
 60 | - `conjugate()` 方法
 61 | 
 62 | ```python
 63 | my_complex = 8 + 7j
 64 | print(my_complex.real)
 65 | print(my_complex.imag)
 66 | print(my_complex.conjugate())
 67 | ```
 68 | 
 69 | ![day1203](https://storage.googleapis.com/2017_ithome_ironman/day1203.png)
 70 | 
 71 | ### 文字（str）
 72 | 
 73 | 文字（str）有太多方法可以使用，我們簡單列了一些常用方法。
 74 | 
 75 | - `startswith()` 方法
 76 | - `endswith()` 方法
 77 | - `find()` 方法
 78 | - `count()` 方法
 79 | - `strip()` 方法
 80 | - `capitalize()` 方法
 81 | - `title()` 方法
 82 | - `upper()` 方法
 83 | - `lower()` 方法
 84 | - `swapcase()` 方法
 85 | - `replace()` 方法
 86 | - ...
 87 | 
 88 | ```python
 89 | my_str = "It's the 2017 ithelp ironman contest!!!"
 90 | 
 91 | print(my_str.startswith("It's")) # True
 92 | print(my_str.endswith("contest??")) # False
 93 | print(my_str.find("2017")) # 9
 94 | print(my_str.count("!")) # 3
 95 | print(my_str.strip("!")) # It's the 2017 ithelp ironman contest
 96 | print(my_str.capitalize()) # It's the 2017 ithelp ironman contest!!!
 97 | print(my_str.title()) # It'S The 2017 Ithelp Ironman Contest!!!
 98 | print(my_str.upper()) # IT'S THE 2017 ITHELP IRONMAN CONTEST!!!
 99 | print(my_str.lower()) # it's the 2017 ithelp ironman contest!!!
100 | print(my_str.swapcase()) # iT'S THE 2017 ITHELP IRONMAN CONTEST!!!
101 | print(my_str.replace("contest", "competition")) # It's the 2017 ithelp ironman competition!!!
102 | ```
103 | 
104 | ![day1204](https://storage.googleapis.com/2017_ithome_ironman/day1204.png)
105 | 
106 | ### 布林值（bool）
107 | 
108 | 布林值（bool）的方法或屬性與 int 幾乎相同。
109 | 
110 | - `bit_length()` 方法
111 | - `to_bytes()` 方法
112 | - `from_bytes()` 方法
113 | - ...
114 | 
115 | ```python
116 | my_bool = True
117 | print(my_bool.bit_length())
118 | print(my_bool.to_bytes(length = 2, byteorder = "big"))
119 | print(bool.from_bytes(b'\x00\x01', byteorder = "big"))
120 | print("---")
121 | print(my_bool.to_bytes(length = 10, byteorder = "big"))
122 | print(bool.from_bytes(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01', byteorder = "big"))
123 | ```
124 | 
125 | ![day1205](https://storage.googleapis.com/2017_ithome_ironman/day1205.png)
126 | 
127 | ## 基本資料結構的屬性或方法
128 | 
129 | 我們在 [[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)討論過 Python 的基本資料結構（Built-in collections），分為 list，tuple 與 dict 三大類型，現在我們來瞭解這些資料結構可以應用的方法有哪些。
130 | 
131 | ### list
132 | 
133 | list 有太多方法可以使用，我們簡單列了一些常用方法。
134 | 
135 | - `append()` 方法
136 | - `insert()` 方法
137 | - `remove()` 方法
138 | - `pop()` 方法
139 | - `index()` 方法
140 | - `count()` 方法
141 | - `sort()` 方法
142 | - `reverse()` 方法
143 | - ...
144 | 
145 | ```python
146 | # 截至 2016-12-12 下午 3 時第 8 屆 iT 邦幫忙各組的鐵人分別是 56、8、18、14、6 與 66 人
147 | ironmen = [56, 8, 18, 14, 6]
148 | 
149 | ironmen.append(66)
150 | print(ironmen)
151 | ironmen.pop()
152 | print(ironmen)
153 | ironmen.insert(5, 66)
154 | ironmen.remove(66)
155 | print(ironmen)
156 | ironmen.index(56)
157 | ironmen.append(66)
158 | ironmen.append(66)
159 | print(ironmen.count(66))
160 | ironmen.pop()
161 | ironmen.sort()
162 | print(ironmen)
163 | ironmen.reverse()
164 | print(ironmen)
165 | ```
166 | 
167 | ![day1206](https://storage.googleapis.com/2017_ithome_ironman/day1206.png)
168 | 
169 | ### tuple
170 | 
171 | tuple 是一個不可變（immutable）的資料結構，所以不會有改變內容的方法。
172 | 
173 | - `index()` 方法
174 | - `count()` 方法
175 | 
176 | ```python
177 | my_tuple = (56, 8, 18, 14, 6, 6)
178 | print(my_tuple.index(56))
179 | print(my_tuple.count(6))
180 | ```
181 | 
182 | ![day1207](https://storage.googleapis.com/2017_ithome_ironman/day1207.png)
183 | 
184 | ### dictionary
185 | 
186 | dictionary 有太多方法可以使用，我們簡單列了一些常用方法。
187 | 
188 | - `get()` 方法
189 | - `keys()` 方法
190 | - `items()` 方法
191 | - `values()` 方法
192 | - ...
193 | 
194 | ```python
195 | ironmen_dict = {"Modern Web": 56,
196 |                 "DevOps": 8,
197 |                 "Cloud": 18,
198 |                 "Big Data": 14,
199 |                 "Security": 6,
200 |                 "自我挑戰組": 66
201 |                 }
202 | 
203 | print(ironmen_dict.get("Modern Web"))
204 | print(ironmen_dict.keys())
205 | print(ironmen_dict.items())
206 | print(ironmen_dict.values())
207 | ```
208 | 
209 | ![day1208](https://storage.googleapis.com/2017_ithome_ironman/day1208.png)
210 | 
211 | ## 查看可以使用的屬性或方法
212 | 
213 | ### 在 Jupyter Notebook 中以 tab 鍵查詢
214 | 
215 | 當我們在 jupyter notebook 建立好物件（變數或資料結構）以後，可以在物件名稱後輸入 `.` 再按 **tab**鍵。
216 | 
217 | ```python
218 | ironmen_dict = {"Modern Web": 56,
219 |                 "DevOps": 8,
220 |                 "Cloud": 18,
221 |                 "Big Data": 14,
222 |                 "Security": 6,
223 |                 "自我挑戰組": 66
224 |                 }
225 | ```
226 | 
227 | ![day1209](https://storage.googleapis.com/2017_ithome_ironman/day1209.png)
228 | 
229 | ### 以 `dir()` 函數查詢
230 | 
231 | 物件（變數或資料結構）建立好以後，可以使用 `dir()` 函數查詢。
232 | 
233 | ```python
234 | ironmen_dict = {"Modern Web": 56,
235 |                 "DevOps": 8,
236 |                 "Cloud": 18,
237 |                 "Big Data": 14,
238 |                 "Security": 6,
239 |                 "自我挑戰組": 66
240 |                 }
241 | 
242 | dir(ironmen_dict)
243 | ```
244 | 
245 | ### 官方文件與 Google 搜尋
246 | 
247 | 或者查閱 [Python 官方文件](https://docs.python.org/3/index.html)與 Google 搜尋。
248 | 
249 | ## 小結
250 | 
251 | 第十二天我們回顧了關於 Python 基本變數類型與資料結構可以應用的屬性或方法，在瞭解物件，屬性與方法的意義之後，我們可以很清楚地區別使用內建函數（generic functions），使用物件的屬性與呼叫物件的方法，在應用過程中我們也發現這些屬性或方法的命名大多數相當直觀，使用起來相當友善。
252 | 
253 | ## 參考連結
254 | 
255 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
256 | - [Python 3.5.2 documentation](https://docs.python.org/3/index.html)


--------------------------------------------------------------------------------
/day13.md:
--------------------------------------------------------------------------------
  1 | # [第 13 天] 常用屬性或方法（2）ndarray
  2 | 
  3 | ---
  4 | 
  5 | 我們在昨天的學習筆記討論了 Python 基本變數類型與資料結構可以應用的屬性或方法，除了基本的資料結構以外，你是否還記得 Python 可以透過引入 `numpy` 套件之後使用 **ndarray** 資料結構呢？當時我們在 [[第 05 天] 資料結構（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day05.md) 提到，為了解決 Python 的 list 資料結構無法進行 element-wise 的運算，因此使用了 `numpy` 套件的 **ndarray**，我們勢必要瞭解她常見的屬性或方法。
  6 | 
  7 | ## numpy 與 ndarray 的常用屬性或方法
  8 | 
  9 | ### 瞭解 ndarray 的概觀
 10 | 
 11 | - `ndim` 屬性
 12 | - `shape` 屬性
 13 | - `dtype` 屬性
 14 | 
 15 | ```python
 16 | import numpy as np
 17 | 
 18 | # 截至 2016-12-06 上午 7 時第 8 屆 iT 邦幫忙各組的鐵人分別是 56、8、19、14、6 與 71 人
 19 | ironmen = [56, 8, 19, 14, 6, 71]
 20 | ironmen_array = np.array(ironmen)
 21 | 
 22 | print(ironmen_array.ndim) # number of dimensions
 23 | print(ironmen_array.shape) # m*n
 24 | print(ironmen_array.dtype) # 資料類型
 25 | print("\n") # 空一行
 26 | 
 27 | # 2d array
 28 | ironmen_2d = [range(1, 7), [56, 8, 19, 14, 6, 71]]
 29 | ironmen_2d_array = np.array(ironmen_2d)
 30 | print(ironmen_2d_array.ndim) # number of dimensions
 31 | print(ironmen_2d_array.shape) # m*n
 32 | print(ironmen_2d_array.dtype) # 資料類型
 33 | ```
 34 | 
 35 | ![day1301](https://storage.googleapis.com/2017_ithome_ironman/day1301.png)
 36 | 
 37 | ### 建立 ndarray
 38 | 
 39 | `numpy` 套件除了 `array()` 方法可以將 list 轉換成 ndarray，還有其他的方法可以建立 ndarray。
 40 | 
 41 | - `zeros()` 方法
 42 | - `empty()` 方法
 43 | - `arange()` 方法
 44 | 
 45 | ```python
 46 | import numpy as np
 47 | 
 48 | print(np.zeros(6)) # 六個元素均為零的 1d array
 49 | print("------") # 分隔線
 50 | print(np.zeros((2, 6))) # 十二個元素均為零的 2d array
 51 | print("------") # 分隔線
 52 | print(np.empty((2, 6, 2))) # 二十四個元素均為未初始化的值
 53 | print("------") # 分隔線
 54 | print(np.arange(11)) # 十一個元素為 0 到 10 的 1d array
 55 | ```
 56 | 
 57 | ![day1302](https://storage.googleapis.com/2017_ithome_ironman/day1302.png)
 58 | 
 59 | ### 轉換變數類型
 60 | 
 61 | ndarray 的 `astype()` 方法可以轉換變數類型。
 62 | 
 63 | ```python
 64 | import numpy as np
 65 | 
 66 | ironmen = ["56", "8", "19", "14", "6", "71"]
 67 | ironmen_str_array = np.array(ironmen)
 68 | print(ironmen_str_array.dtype)
 69 | print("---") # 分隔線
 70 | 
 71 | # 轉換為 int64
 72 | ironmen_int_array = ironmen_str_array.astype(np.int64)
 73 | print(ironmen_int_array.dtype)
 74 | ```
 75 | 
 76 | ![day1303](https://storage.googleapis.com/2017_ithome_ironman/day1303.png)
 77 | 
 78 | ### 用索引值進行篩選
 79 | 
 80 | 利用 `[]` 搭配索引值篩選 ndarray，這點與 R 語言作法相同。
 81 | 
 82 | ```python
 83 | import numpy as np
 84 | 
 85 | my_array = np.arange(10)
 86 | print(my_array[0])
 87 | print(my_array[0:5])
 88 | print("---") # 分隔線
 89 | 
 90 | my_2d_array = np.array([np.arange(0, 5), np.arange(5, 10)])
 91 | print(my_2d_array)
 92 | print("---") # 分隔線
 93 | print(my_2d_array[1, :]) # 第二列
 94 | print(my_2d_array[:, 1]) # 第二欄
 95 | print(my_2d_array[1, 1]) # 第二列第二欄的元素
 96 | ```
 97 | 
 98 | ![day1304](https://storage.googleapis.com/2017_ithome_ironman/day1304.png)
 99 | 
100 | ### 用布林值進行篩選
101 | 
102 | 利用布林值（bool）篩選 ndarray，，這點與 R 語言作法相同。
103 | 
104 | ```python
105 | import numpy as np
106 | 
107 | ironmen = [56, 8, 19, 14, 6, 71]
108 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
109 | ironmen_array = np.array(ironmen)
110 | groups_array = np.array(groups)
111 | 
112 | # 用人數去篩選組別
113 | print(ironmen_array >= 10) # 布林值陣列
114 | print(groups_array[ironmen_array >= 10]) # 鐵人數大於 10 的組別
115 | 
116 | # 用組別去篩選人數
117 | print(groups_array != "自我挑戰組") # 布林值陣列
118 | print(ironmen_array[groups_array != "自我挑戰組"]) # 除了自我挑戰組以外的鐵人數
119 | ```
120 | 
121 | ![day1305](https://storage.googleapis.com/2017_ithome_ironman/day1305.png)
122 | 
123 | ### 2d array 轉置
124 | 
125 | 使用 `T` 屬性。
126 | 
127 | ```python
128 | import numpy as np
129 | 
130 | # 建立一個 2d array
131 | my_1d_array = np.arange(10)
132 | my_2d_array = my_1d_array.reshape((2, 5))
133 | print(my_2d_array)
134 | print("---") # 分隔線
135 | print(my_2d_array.T)
136 | ```
137 | 
138 | ![day1306](https://storage.googleapis.com/2017_ithome_ironman/day1306.png)
139 | 
140 | ### numpy 的 where 方法
141 | 
142 | 透過 `numpy` 的 `where()` 方法在 ndarray 中進行流程控制。
143 | 
144 | ```python
145 | import numpy as np
146 | 
147 | ironmen_array = np.array([56, 8, 19, 14, 6, np.nan])
148 | np.where(np.isnan(ironmen_array), 71, ironmen_array)
149 | ```
150 | 
151 | ![day1307](https://storage.googleapis.com/2017_ithome_ironman/day1307.png)
152 | 
153 | ### 排序
154 | 
155 | 透過 `sort()` 方法。
156 | 
157 | ```python
158 | import numpy as np
159 | 
160 | ironmen_array = np.array([56, 8, 19, 14, 6, 71])
161 | print(ironmen_array)
162 | ironmen_array.sort()
163 | print(ironmen_array)
164 | ```
165 | 
166 | ![day1308](https://storage.googleapis.com/2017_ithome_ironman/day1308.png)
167 | 
168 | ### 隨機變數
169 | 
170 | 透過 `numpy` 的 `random()` 方法可以生成隨機變數。
171 | 
172 | ```python
173 | import numpy as np
174 | 
175 | normal_samples = np.random.normal(size = 10) # 生成 10 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
176 | uniform_samples = np.random.uniform(size = 10) # 生成 10 組介於 0 與 1 之間均勻分配隨機變數
177 | 
178 | print(normal_samples)
179 | print("---") # 分隔線
180 | print(uniform_samples)
181 | ```
182 | 
183 | ![day1309](https://storage.googleapis.com/2017_ithome_ironman/day1309.png)
184 | 
185 | ## 小結
186 | 
187 | 第十三天我們討論了 `numpy` 套件與 **ndarray** 的屬性或方法，包含建立，變數類型轉換，篩選與排序等，這些屬性與方法有的隸屬於 `numpy` 套件，有的隸屬於 **ndarray** 這個資料結構所建立的物件，對於熟悉物件導向的概念是很好的練習機會。
188 | 
189 | ## 參考連結
190 | 
191 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)
192 | - [Quickstart tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)


--------------------------------------------------------------------------------
/day14.md:
--------------------------------------------------------------------------------
  1 | # [第 14 天] 常用屬性或方法（3）Data Frame
  2 | 
  3 | ---
  4 | 
  5 | 除了 Python 基本的資料結構（list，tuple 與 dictionary）以及昨天學習筆記提到的 ndarray，還記得我們在 [[第 06 天] 資料結構（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day06.md) 提到，為了讓 Python 也能夠使用類似 R 語言的 data frame 資料結構而使用了 `pandas` 套件的 data frame 嗎？我們勢必也要瞭解她常見的屬性或方法。
  6 | 
  7 | ## Pandas 與 data frame 的常用屬性或方法
  8 | 
  9 | ### 建立 data frame
 10 | 
 11 | 使用 `pandas` 套件的 `DataFrame()` 方法將一個 **dictionary** 的資料結構轉換成 data frame。
 12 | 
 13 | ```python
 14 | import pandas as pd
 15 | 
 16 | # 截至 2016-12-14 上午 11 時第 8 屆 iT 邦幫忙各組的鐵人分別是 59、9、19、14、6 與 77 人
 17 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 18 | ironmen = [59, 9, 19, 14, 6, 77]
 19 | 
 20 | ironmen_dict = {
 21 |                 "groups": groups,
 22 |                 "ironmen": ironmen
 23 | }
 24 | 
 25 | ironmen_df = pd.DataFrame(ironmen_dict)
 26 | ironmen_df
 27 | ```
 28 | 
 29 | ![day1401](https://storage.googleapis.com/2017_ithome_ironman/day1401.png)
 30 | 
 31 | 眼尖的你發現到我們在建立 data frame 的時候並沒有去指定索引值（index），然而生成的 data frame 卻自動產生了類似 R 語言的 `row.names`，多麽貼心的設計！
 32 | 
 33 | ### 瞭解 data frame 的概觀
 34 | 
 35 | - `ndim` 屬性
 36 | - `shape` 屬性
 37 | - `dtypes` 屬性
 38 | 
 39 | ```python
 40 | import pandas as pd
 41 | 
 42 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 43 | ironmen = [59, 9, 19, 14, 6, 77]
 44 | 
 45 | ironmen_dict = {
 46 |                 "groups": groups,
 47 |                 "ironmen": ironmen
 48 | }
 49 | 
 50 | # 建立 data frame
 51 | ironmen_df = pd.DataFrame(ironmen_dict)
 52 | 
 53 | # 使用屬性
 54 | print(ironmen_df.ndim)
 55 | print("---") # 分隔線
 56 | print(ironmen_df.shape)
 57 | print("---") # 分隔線
 58 | print(ironmen_df.dtypes)
 59 | ```
 60 | 
 61 | ![day1402](https://storage.googleapis.com/2017_ithome_ironman/day1402.png)
 62 | 
 63 | ### 刪除觀測值或欄位
 64 | 
 65 | data frame 可以透過 `drop()` 方法來刪除觀測值或欄位，指定參數 `axis = 0` 表示要刪除觀測值（row），指定參數 `axis = 1` 表示要刪除欄位（column）。
 66 | 
 67 | ```python
 68 | import pandas as pd
 69 | 
 70 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
 71 | ironmen = [59, 9, 19, 14, 6, 77]
 72 | 
 73 | ironmen_dict = {
 74 |                 "groups": groups,
 75 |                 "ironmen": ironmen
 76 | }
 77 | 
 78 | # 建立 data frame
 79 | ironmen_df = pd.DataFrame(ironmen_dict)
 80 | 
 81 | # 刪除觀測值
 82 | ironmen_df_no_mw = ironmen_df.drop(0, axis = 0)
 83 | print(ironmen_df_no_mw)
 84 | print("---") # 分隔線
 85 | 
 86 | # 刪除欄位
 87 | ironmen_df_no_groups = ironmen_df.drop("groups", axis = 1)
 88 | print(ironmen_df_no_groups)
 89 | ```
 90 | 
 91 | ![day1403](https://storage.googleapis.com/2017_ithome_ironman/day1403.png)
 92 | 
 93 | ### 透過 `ix` 屬性篩選 data frame
 94 | 
 95 | 我們可以透過 `ix` 屬性（利用索引值）篩選 data frame。
 96 | 
 97 | ```python
 98 | import pandas as pd
 99 | 
100 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
101 | ironmen = [59, 9, 19, 14, 6, 77]
102 | 
103 | ironmen_dict = {
104 |                 "groups": groups,
105 |                 "ironmen": ironmen
106 | }
107 | 
108 | # 建立 data frame
109 | ironmen_df = pd.DataFrame(ironmen_dict)
110 | 
111 | # 選擇欄位
112 | print(ironmen_df.ix[:, "groups"])
113 | print("---") # 分隔線
114 | 
115 | # 選擇觀測值
116 | print(ironmen_df.ix[0])
117 | print("---") # 分隔線
118 | 
119 | # 同時選擇欄位與觀測值
120 | print(ironmen_df.ix[0, "groups"])
121 | ```
122 | 
123 | ![day1404](https://storage.googleapis.com/2017_ithome_ironman/day1404.png)
124 | 
125 | ### 透過布林值篩選 data frame
126 | 
127 | ```python
128 | import pandas as pd
129 | 
130 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
131 | ironmen = [59, 9, 19, 14, 6, 77]
132 | 
133 | ironmen_dict = {
134 |                 "groups": groups,
135 |                 "ironmen": ironmen
136 | }
137 | 
138 | # 建立 data frame
139 | ironmen_df = pd.DataFrame(ironmen_dict)
140 | 
141 | filter = ironmen_df["ironmen"] > 10 # 參賽人數大於 10
142 | ironmen_df[filter] # 篩選 data frame
143 | ```
144 | 
145 | ![day1405](https://storage.googleapis.com/2017_ithome_ironman/day1405.png)
146 | 
147 | ### 排序
148 | 
149 | - `sort_index()` 方法
150 | - `sort_values()` 方法
151 | 
152 | 使用 data frame 的 `sort_index()` 方法可以用索引值排序。
153 | 
154 | ```python
155 | import pandas as pd
156 | 
157 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
158 | ironmen = [59, 9, 19, 14, 6, 77]
159 | 
160 | # 建立 data frame
161 | ironmen_df = pd.DataFrame(ironmen, columns = ["ironmen"], index = groups)
162 | 
163 | # 用索引值排序
164 | ironmen_df.sort_index()
165 | ```
166 | 
167 | ![day1406](https://storage.googleapis.com/2017_ithome_ironman/day1406.png)
168 | 
169 | 使用 data frame 的 `sort_values()` 方法可以用指定欄位的數值排序。
170 | 
171 | ```python
172 | import pandas as pd
173 | 
174 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
175 | ironmen = [59, 9, 19, 14, 6, 77]
176 | 
177 | # 建立 data frame
178 | ironmen_df = pd.DataFrame(ironmen, columns = ["ironmen"], index = groups)
179 | 
180 | # 用數值排序
181 | ironmen_df.sort_values(by = "ironmen")
182 | ```
183 | 
184 | ![day1407](https://storage.googleapis.com/2017_ithome_ironman/day1407.png)
185 | 
186 | ### 描述統計
187 | 
188 | data frame 有 `sum()`、`mean()`、`median()` 與 `describe()` 等統計方法可以使用。
189 | 
190 | ```python
191 | import pandas as pd
192 | 
193 | groups = ["Modern Web", "DevOps", "Cloud", "Big Data", "Security", "自我挑戰組"]
194 | ironmen = [59, 9, 19, 14, 6, 77]
195 | 
196 | ironmen_dict = {
197 |                 "groups": groups,
198 |                 "ironmen": ironmen
199 | }
200 | 
201 | # 建立 data frame
202 | ironmen_df = pd.DataFrame(ironmen_dict)
203 | 
204 | print(ironmen_df.sum()) # 計算總鐵人數
205 | print("---") # 分隔線
206 | print(ironmen_df.mean()) # 計算平均鐵人數
207 | print("---") # 分隔線
208 | print(ironmen_df.median()) # 計算中位數
209 | print("---") # 分隔線
210 | print(ironmen_df.describe()) # 描述統計
211 | ```
212 | 
213 | ![day140801](https://storage.googleapis.com/2017_ithome_ironman/day140801.png)
214 | 
215 | ### 相異值個數
216 | 
217 | 透過 `pandas` 的 `value_counts()` 方法可以統計相異值的個數。
218 | 
219 | ```python
220 | import pandas as pd
221 | 
222 | gender = ["Male", "Male", "Female", "Male", "Male", "Male", "Female", "Male", "Male"]
223 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "文斯莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
224 | 
225 | # 建立 data frame
226 | ironmen_df = pd.DataFrame(gender, columns = ["gender"], index = name)
227 | 
228 | # 計算男女各有幾個觀測值
229 | pd.value_counts(ironmen_df.gender)
230 | ```
231 | 
232 | ![day1409](https://storage.googleapis.com/2017_ithome_ironman/day1409.png)
233 | 
234 | ### 遺失值
235 | 
236 | #### 判斷遺失值
237 | 
238 | - `isnull()` 方法
239 | - `notnull()` 方法
240 | 
241 | ```python
242 | import numpy as np
243 | import pandas as pd
244 | 
245 | groups = ["Modern Web", "DevOps", np.nan, "Big Data", "Security", "自我挑戰組"]
246 | ironmen = [59, 9, 19, 14, 6, np.nan]
247 | 
248 | ironmen_dict = {
249 |                 "groups": groups,
250 |                 "ironmen": ironmen
251 | }
252 | 
253 | # 建立 data frame
254 | ironmen_df = pd.DataFrame(ironmen_dict)
255 | 
256 | print(ironmen_df.ix[:, "groups"].isnull()) # 判斷哪些組的組名是遺失值
257 | print("---") # 分隔線
258 | print(ironmen_df.ix[:, "ironmen"].notnull()) # 判斷哪些組的鐵人數不是遺失值
259 | ```
260 | 
261 | ![day1410](https://storage.googleapis.com/2017_ithome_ironman/day1410.png)
262 | 
263 | #### 處理遺失值
264 | 
265 | - `dropna()` 方法
266 | - `fillna()` 方法
267 | 
268 | ```python
269 | import numpy as np
270 | import pandas as pd
271 | 
272 | groups = ["Modern Web", "DevOps", np.nan, "Big Data", "Security", "自我挑戰組"]
273 | ironmen = [59, 9, 19, 14, 6, np.nan]
274 | 
275 | ironmen_dict = {
276 |                 "groups": groups,
277 |                 "ironmen": ironmen
278 | }
279 | 
280 | # 建立 data frame
281 | ironmen_df = pd.DataFrame(ironmen_dict)
282 | 
283 | ironmen_df_na_dropped = ironmen_df.dropna() # 有遺失值的觀測值都刪除
284 | print(ironmen_df_na_dropped)
285 | print("---") # 分隔線
286 | ironmen_df_na_filled = ironmen_df.fillna(0) # 有遺失值的觀測值填補 0
287 | print(ironmen_df_na_filled)
288 | print("---") # 分隔線
289 | ironmen_df_na_filled = ironmen_df.fillna({"groups": "Cloud", "ironmen": 71}) # 依欄位填補遺失值
290 | print(ironmen_df_na_filled)
291 | ```
292 | 
293 | ![day1411](https://storage.googleapis.com/2017_ithome_ironman/day1411.png)
294 | 
295 | ## 小結
296 | 
297 | 第十四天我們討論了 `pandas` 套件與 **data frame** 的屬性或方法，包含建立，篩選與排序等，這些屬性與方法有的隸屬於 `pandas` 套件，有的隸屬於 **data frame** 這個資料結構所建立的物件，對於熟悉物件導向的概念是很好的練習機會。
298 | 
299 | ## 參考連結
300 | 
301 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)
302 | - [pandas: powerful Python data analysis toolkit](http://pandas.pydata.org/pandas-docs/stable/)


--------------------------------------------------------------------------------
/day15.md:
--------------------------------------------------------------------------------
  1 | # [第 15 天] 載入資料
  2 | 
  3 | ---
  4 | 
  5 | 截至目前為止我們在學習筆記練習的資料結構，不論是 Python 內建的 list，tuple 或 dictionary，還是引用 `numpy` 套件與 `pandas` 套件之後所使用的 ndarray 或 data frame，都是手動創建這些資料，但是在應用資料科學的場景之中，通常不是手動創建資料，而是將資料載入工作環境（R 語言或者 Python），然後再進行後續資料處理與分析。
  6 | 
  7 | 我們選擇幾種常見的讀入資料格式，分別使用 R 語言與 Python 進行載入。
  8 | 
  9 | - csv
 10 | - 不同分隔符號的資料
 11 | - Excel 試算表
 12 | - JSON
 13 | 
 14 | ## 載入 csv
 15 | 
 16 | 副檔名為 **.csv** 的資料格式顧名思義是**逗號分隔資料（comma separated values）**，是最常見的表格式資料（tabular data）格式。
 17 | 
 18 | ![day1501](https://storage.googleapis.com/2017_ithome_ironman/day1501.png)
 19 | 
 20 | ### R 語言
 21 | 
 22 | 我們使用 `read.csv()` 函數來載入。
 23 | 
 24 | ```
 25 | url <- "https://storage.googleapis.com/2017_ithome_ironman/data/iris.csv" # 在雲端上儲存了一份 csv 檔案
 26 | iris_df <- read.csv(url)
 27 | head(iris_df)
 28 | ```
 29 | 
 30 | ![day1502](https://storage.googleapis.com/2017_ithome_ironman/day1502.png)
 31 | 
 32 | ### Python
 33 | 
 34 | 我們使用 `pandas` 套件的 `read_csv()` 方法來載入。
 35 | 
 36 | ```python
 37 | import pandas as pd
 38 | 
 39 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/iris.csv" # 在雲端上儲存了一份 csv 檔案
 40 | iris_df = pd.read_csv(url)
 41 | iris_df.head()
 42 | ```
 43 | 
 44 | ![day1503](https://storage.googleapis.com/2017_ithome_ironman/day1503.png)
 45 | 
 46 | ## 載入不同分隔符號的資料
 47 | 
 48 | 除了以**逗號分隔資料**以外，還有不同的方式可以區隔資料欄位，像是以 `tab 鍵（"\t"）` 分隔的資料（tab separated values)，以 `空格（"\s"）` 分隔的資料或者以 `冒號（":"）`分隔的資料，面對這些使用不同的分隔符號（delimeters/separators）的資料，我們可以指定 `sep = ` 這個參數來載入資料。
 49 | 
 50 | ![day1504](https://storage.googleapis.com/2017_ithome_ironman/day1504.png)
 51 | 
 52 | ![day1505](https://storage.googleapis.com/2017_ithome_ironman/day1505.png)
 53 | 
 54 | ### R 語言
 55 | 
 56 | 我們使用 `read.table()` 函數來載入，並且依據分隔符號指定 `sep = `參數。
 57 | 
 58 | #### 以 tab 鍵（"\t"）分隔
 59 | 
 60 | ```
 61 | url <- "https://storage.googleapis.com/2017_ithome_ironman/data/iris.tsv" # 在雲端上儲存了一份 tsv 檔案
 62 | iris_tsv_df <- read.table(url, sep = "\t", header = TRUE)
 63 | head(iris_tsv_df)
 64 | ```
 65 | 
 66 | ![day1506](https://storage.googleapis.com/2017_ithome_ironman/day1506.png)
 67 | 
 68 | #### 以冒號（":"）分隔
 69 | 
 70 | ```
 71 | url <- "https://storage.googleapis.com/2017_ithome_ironman/data/iris.txt" # 在雲端上儲存了一份 txt 檔案
 72 | iris_colon_sep_df <- read.table(url, sep = ":", header = TRUE)
 73 | head(iris_colon_sep_df)
 74 | ```
 75 | 
 76 | ![day1507](https://storage.googleapis.com/2017_ithome_ironman/day1507.png)
 77 | 
 78 | ### Python
 79 | 
 80 | 我們使用 `pandas` 套件的 `read_table()` 方法來載入，並且依據分隔符號指定 `sep = `參數。
 81 | 
 82 | #### 以 tab 鍵（"\t"）分隔
 83 | 
 84 | ```python
 85 | import pandas as pd
 86 | 
 87 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/iris.tsv" # 在雲端上儲存了一份 tsv 檔案
 88 | iris_tsv_df = pd.read_table(url, sep = "\t")
 89 | iris_tsv_df.head()
 90 | ```
 91 | 
 92 | ![day1508](https://storage.googleapis.com/2017_ithome_ironman/day1508.png)
 93 | 
 94 | #### 以冒號（":"）分隔
 95 | 
 96 | ```python
 97 | import pandas as pd
 98 | 
 99 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/iris.txt" # 在雲端上儲存了一份 txt 檔案
100 | iris_colon_sep_df = pd.read_table(url, sep = ":")
101 | iris_colon_sep_df.head()
102 | ```
103 | 
104 | ![day1509](https://storage.googleapis.com/2017_ithome_ironman/day1509.png)
105 | 
106 | ## 載入 Excel 試算表
107 | 
108 | 我們以副檔名為 `.xlsx` 的 Excel 試算表檔案為例。
109 | 
110 | ![day1510](https://storage.googleapis.com/2017_ithome_ironman/day1510.png)
111 | 
112 | ### R 語言
113 | 
114 | 我們使用 `readxl` 套件的 `read_excel()` 函數來載入。
115 | 
116 | ```
117 | library(readxl)
118 | 
119 | file_path <- "~/Downloads/iris.xlsx" # read_excel 暫時不支援 https 先將試算表下載到本機 https://storage.googleapis.com/2017_ithome_ironman/data/iris.xlsx
120 | 
121 | iris_xlsx_df <- read_excel(file_path)
122 | head(iris_xlsx_df)
123 | ```
124 | 
125 | ![day1511](https://storage.googleapis.com/2017_ithome_ironman/day1511.png)
126 | 
127 | ### Python
128 | 
129 | 我們使用 `pandas` 套件的 `read_excel()` 方法來載入。
130 | 
131 | ```python
132 | import pandas as pd
133 | 
134 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/iris.xlsx" # 在雲端上儲存了一份 Excel 試算表
135 | iris_xlsx_df = pd.read_excel(url)
136 | iris_xlsx_df.head()
137 | ```
138 | 
139 | ![day1512](https://storage.googleapis.com/2017_ithome_ironman/day1512.png)
140 | 
141 | ## 載入 JSON
142 | 
143 | JSON（JavaScript Object Notation）格式的資料是網站資料傳輸以及 NoSQL（Not only SQL）資料庫儲存的主要類型，R 語言與 Python 有相對應的套件可以協助我們把 JSON 資料格式載入後轉換為我們熟悉的 data frame。
144 | 
145 | ![day1513](https://storage.googleapis.com/2017_ithome_ironman/day1513.png)
146 | 
147 | ### R 語言
148 | 
149 | 我們使用 `jsonlite` 套件的 `fromJSON()` 函數來載入。
150 | 
151 | ```
152 | library(jsonlite)
153 | 
154 | url <- "https://storage.googleapis.com/2017_ithome_ironman/data/iris.json" # 在雲端上儲存了一份 JSON 檔
155 | iris_json_df <- fromJSON(url)
156 | head(iris_json_df)
157 | ```
158 | 
159 | ![day1514](https://storage.googleapis.com/2017_ithome_ironman/day1514.png)
160 | 
161 | ### Python
162 | 
163 | 我們使用 `pandas` 套件的 `read_json()` 方法來載入。
164 | 
165 | ```python
166 | import pandas as pd
167 | 
168 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/iris.json" # 在雲端上儲存了一份 JSON 檔
169 | iris_json_df = pd.read_json(url)
170 | iris_json_df.head()
171 | ```
172 | 
173 | ![day1515](https://storage.googleapis.com/2017_ithome_ironman/day1515.png)
174 | 
175 | ## 小結
176 | 
177 | 第十五天我們討論如何將 csv，不同分隔符號的資料，Excel 試算表與 JSON 格式的資料讀入 Python，我們透過 `pandas` 套件的 `read_csv()`、`read_table()`、`read_excel()` 與 `read_json` 等方法可以將不同的資料格式轉換為我們熟悉的 data frame 資料結構，同時我們也跟 R 語言讀入不同資料格式的各種函數作對照。
178 | 
179 | ## 參考連結
180 | 
181 | - [JSON - 維基百科，自由的百科全書](https://zh.wikipedia.org/wiki/JSON)
182 | - [Getting started with JSON and jsonlite](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html)
183 | - [API Reference - pandas 0.19.1 documentation](http://pandas.pydata.org/pandas-docs/stable/api.html)
184 | - <https://cran.r-project.org/web/packages/readxl/readxl.pdf>


--------------------------------------------------------------------------------
/day17.md:
--------------------------------------------------------------------------------
  1 | # [第 17 天] 資料角力
  2 | 
  3 | ---
  4 | 
  5 | 我們現在面對表格式資料（Tabular data），Excel 試算表，JSON 或者網頁資料時有了相當程度的自信，透過 `pandas`、`requests` 與 `BeautifulSoup` 套件我們可以輕鬆地將不同格式資料載入 Python 的工作環境中。如果你對於將資料載入 Python 還不太清楚，我推薦你閱讀前兩天的內容 [[第 15 天] 載入資料](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day15.md)與 [[第 16 天] 網頁解析](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day16.md)。
  6 | 
  7 | 載入資料之後，接踵而來的就是資料整理（Data manipulation）的差事，或稱資料改寫（Data munging），抑或是一個比較潮的詞彙：資料角力（Data wrangling）。資料角力的目的是為了視覺化或者機器學習模型需求，必須將資料整理成合乎需求的格式。
  8 | 
  9 | > Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
 10 | > [For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0)
 11 | 
 12 | 我們討論幾個常用技巧，然後使用 Python 與 R 語言練習。
 13 | 
 14 | - 連接
 15 | - 合併
 16 | - 轉置
 17 | - 移除重複
 18 | - 分箱
 19 | - 輸出
 20 | 
 21 | ## 連接
 22 | 
 23 | 類似資料庫表格的 join。
 24 | 
 25 | ### Python
 26 | 
 27 | 我們使用 `pandas` 套件的 `merge()` 方法。
 28 | 
 29 | ```python
 30 | import pandas as pd
 31 | 
 32 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
 33 | occupation = ["船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家"]
 34 | 
 35 | # 建立 dict
 36 | straw_hat_dict = {"name": name,
 37 |                   "occupation": occupation
 38 | }
 39 | 
 40 | # 建立第一個 data frame
 41 | straw_hat_df = pd.DataFrame(straw_hat_dict)
 42 | 
 43 | name = ["蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓", "布魯克"]
 44 | devil_fruit = ["橡膠果實", "人人果實", "花花果實", "黃泉果實"]
 45 | 
 46 | # 建立 dict
 47 | devil_fruit_dict = {"name": name,
 48 |                     "devil_fruit": devil_fruit
 49 | }
 50 | 
 51 | # 建立第二個 data frame
 52 | devil_fruit_df = pd.DataFrame(devil_fruit_dict)
 53 | 
 54 | # 連接
 55 | straw_hat_merged = pd.merge(straw_hat_df, devil_fruit_df)
 56 | straw_hat_merged
 57 | ```
 58 | 
 59 | ![day1701](https://storage.googleapis.com/2017_ithome_ironman/day1701.png)
 60 | 
 61 | `pandas` 套件的 `merge()` 方法預設是**inner join**，如果我們希望使用不同的合併方式，我們可以在 `how = `參數指定為 `left`、`right` 或 `outer`。
 62 | 
 63 | ```python
 64 | import pandas as pd
 65 | 
 66 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
 67 | occupation = ["船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家"]
 68 | 
 69 | # 建立 dict
 70 | straw_hat_dict = {"name": name,
 71 |                   "occupation": occupation
 72 | }
 73 | 
 74 | # 建立第一個 data frame
 75 | straw_hat_df = pd.DataFrame(straw_hat_dict)
 76 | 
 77 | name = ["蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓", "布魯克"]
 78 | devil_fruit = ["橡膠果實", "人人果實", "花花果實", "黃泉果實"]
 79 | 
 80 | # 建立 dict
 81 | devil_fruit_dict = {"name": name,
 82 |                     "devil_fruit": devil_fruit
 83 | }
 84 | 
 85 | # 建立第二個 data frame
 86 | devil_fruit_df = pd.DataFrame(devil_fruit_dict)
 87 | 
 88 | # 連接
 89 | straw_hat_merged = pd.merge(straw_hat_df, devil_fruit_df, how = "left")
 90 | straw_hat_merged
 91 | ```
 92 | 
 93 | ![day1702](https://storage.googleapis.com/2017_ithome_ironman/day1702.png)
 94 | 
 95 | ### R 語言
 96 | 
 97 | 我們使用 `merge()` 函數。
 98 | 
 99 | ```
100 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
101 | occupation <- c("船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家")
102 | 
103 | # 建立第一個 data frame
104 | straw_hat_df = data.frame(name, occupation)
105 | 
106 | name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓", "布魯克")
107 | devil_fruit <- c("橡膠果實", "人人果實", "花花果實", "黃泉果實")
108 | 
109 | # 建立第二個 data frame
110 | devil_fruit_df = data.frame(name, devil_fruit)
111 | 
112 | # 連接
113 | straw_hat_merged = merge(straw_hat_df, devil_fruit_df)
114 | View(straw_hat_merged)
115 | ```
116 | 
117 | ![day1703](https://storage.googleapis.com/2017_ithome_ironman/day1703.png)
118 | 
119 | R 語言的 `merge()` 函數預設也是**inner join**，如果我們希望使用不同的合併方式，我們可以在 `all.x = ` 與 `all.y = ` 參數指定。
120 | 
121 | ```
122 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
123 | occupation <- c("船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家")
124 | 
125 | # 建立第一個 data frame
126 | straw_hat_df = data.frame(name, occupation)
127 | 
128 | name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓", "布魯克")
129 | devil_fruit <- c("橡膠果實", "人人果實", "花花果實", "黃泉果實")
130 | 
131 | # 建立第二個 data frame
132 | devil_fruit_df = data.frame(name, devil_fruit)
133 | 
134 | # 連接
135 | straw_hat_merged = merge(straw_hat_df, devil_fruit_df, all.x = TRUE)
136 | View(straw_hat_merged)
137 | ```
138 | 
139 | ![day1704](https://storage.googleapis.com/2017_ithome_ironman/day1704.png)
140 | 
141 | ## 合併
142 | 
143 | 新增一個觀測值或一個變數欄位。
144 | 
145 | ### Python
146 | 
147 | 我們使用 `pandas` 套件的 `concat()` 方法。在 `axis = ` 的參數指定 `axis = 1` 即可新增一個變數欄位。
148 | 
149 | ```python
150 | import pandas as pd
151 | 
152 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
153 | occupation = ["船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家"]
154 | 
155 | # 建立 dict
156 | straw_hat_dict = {"name": name,
157 |                   "occupation": occupation
158 | }
159 | 
160 | # 建立第一個 data frame
161 | straw_hat_df = pd.DataFrame(straw_hat_dict)
162 | 
163 | name = ["娜菲鲁塔利·薇薇"]
164 | occupation = ["阿拉巴斯坦王國公主"]
165 | princess_vivi_dict = {"name": name,
166 |                       "occupation": occupation
167 | }
168 | 
169 | # 建立第二個 data frame
170 | princess_vivi_df = pd.DataFrame(princess_vivi_dict, index = [9])
171 | 
172 | # 新增一個觀測值
173 | straw_hat_df_w_vivi = pd.concat([straw_hat_df, princess_vivi_df])
174 | straw_hat_df_w_vivi
175 | 
176 | age = [19, 21, 20, 19, 21, 17, 30, 36, 90, 18]
177 | age_dict = {"age": age
178 | }
179 | 
180 | # 建立第三個 data frame
181 | age_df = pd.DataFrame(age_dict)
182 | 
183 | # 新增一個變數欄位
184 | straw_hat_df_w_vivi_age = pd.concat([straw_hat_df_w_vivi, age_df], axis = 1)
185 | straw_hat_df_w_vivi_age
186 | ```
187 | 
188 | ![day1705](https://storage.googleapis.com/2017_ithome_ironman/day1705.png)
189 | 
190 | ![day1706](https://storage.googleapis.com/2017_ithome_ironman/day1706.png)
191 | 
192 | ### R 語言
193 | 
194 | 我們使用 `rbind()` 函數新增一個觀測值，使用 `cbind()` 函數新增一個變數欄位。
195 | 
196 | ```
197 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
198 | occupation <- c("船長", "劍士", "航海士", "狙擊手", "廚師", "醫生", "考古學家", "船匠", "音樂家")
199 | 
200 | # 建立第一個 data frame
201 | straw_hat_df = data.frame(name, occupation)
202 | straw_hat_df$name <- as.character(straw_hat_df$name)
203 | straw_hat_df$occupation <- as.character(straw_hat_df$occupation)
204 | 
205 | # 新增一個觀測值
206 | princess_vivi <- c("娜菲鲁塔利·薇薇", "阿拉巴斯坦王國公主")
207 | straw_hat_df_w_vivi <- rbind(straw_hat_df, princess_vivi)
208 | View(straw_hat_df_w_vivi)
209 | 
210 | # 新增一個變數欄位
211 | age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 18)
212 | straw_hat_df_w_vivi_age <- cbind(straw_hat_df_w_vivi, age)
213 | View(straw_hat_df_w_vivi_age)
214 | ```
215 | 
216 | ![day1707](https://storage.googleapis.com/2017_ithome_ironman/day1707.png)
217 | 
218 | ![day1708](https://storage.googleapis.com/2017_ithome_ironman/day1708.png)
219 | 
220 | ## 轉置
221 | 
222 | 轉置（Transpose）指的是寬表格（Wide table）與長表格（Long table）之間的互換。
223 | 
224 | ### Python
225 | 
226 | 我們使用 data frame 物件的 `stack()` 將寬表格轉置為長表格，使用 `unstack()` 方法將長表格轉置回寬表格。
227 | 
228 | ```python
229 | import pandas as pd
230 | 
231 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
232 | age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
233 | height = [174, 181, 170, 176, 180, 90, 188, 240, 277]
234 | 
235 | # 建立 dict
236 | straw_hat_dict = {
237 |     "name": name,
238 |     "age": age,
239 |     "height": height
240 | }
241 | 
242 | # 建立一個寬表格
243 | straw_hat_df_wide = pd.DataFrame(straw_hat_dict)
244 | 
245 | # 轉換為長表格
246 | straw_hat_df_long = straw_hat_df_wide.stack()
247 | straw_hat_df_long
248 | 
249 | # 轉換回寬表格
250 | straw_hat_df_wide = straw_hat_df_long.unstack()
251 | straw_hat_df_wide
252 | ```
253 | 
254 | ![day1709](https://storage.googleapis.com/2017_ithome_ironman/day1709.png)
255 | 
256 | ![day1710](https://storage.googleapis.com/2017_ithome_ironman/day1710.png)
257 | 
258 | ### R 語言
259 | 
260 | 我們使用 `tidyr` 套件的 `gather()` 函數將寬表格轉置為長表格，使用 `spread()` 函數將長表格轉置回寬表格。
261 | 
262 | ```
263 | library(tidyr)
264 | 
265 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
266 | age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90)
267 | height <- c(174, 181, 170, 176, 180, 90, 188, 240, 277)
268 | 
269 | # 建立一個寬表格
270 | straw_hat_df_wide <- data.frame(name, age, height)
271 | 
272 | # 轉換為長表格
273 | straw_hat_df_long <- gather(straw_hat_df_wide, key = item, value = value, age, height)
274 | View(straw_hat_df_long)
275 | 
276 | # 轉換回寬表格
277 | straw_hat_df_wide <- spread(straw_hat_df_long, key = item, value = value)
278 | View(straw_hat_df_wide)
279 | ```
280 | 
281 | ![day1711](https://storage.googleapis.com/2017_ithome_ironman/day1711.png)
282 | 
283 | ![day1712](https://storage.googleapis.com/2017_ithome_ironman/day1712.png)
284 | 
285 | ## 移除重複值
286 | 
287 | ### Python
288 | 
289 | 我們使用 data frame 的 `duplicated()` 與 `drop_duplicated()` 方法。
290 | 
291 | ```python
292 | import pandas as pd
293 | 
294 | # 建立一個有重複值的 data frame
295 | name = ["蒙其·D·魯夫", "蒙其·D·魯夫", "蒙其·D·魯夫", "羅羅亞·索隆", "羅羅亞·索隆", "羅羅亞·索隆"]
296 | age = [19, 19, 17, 21, 21, 19]
297 | duplicated_dict = {
298 |     "name": name,
299 |     "age": age
300 | }
301 | duplicated_df = pd.DataFrame(duplicated_dict)
302 | 
303 | # 判斷是否重複
304 | print(duplicated_df.duplicated())
305 | 
306 | # 去除重複觀測值
307 | print(duplicated_df.drop_duplicates())
308 | ```
309 | 
310 | ![day1713](https://storage.googleapis.com/2017_ithome_ironman/day1713.png)
311 | 
312 | ### R 語言
313 | 
314 | 我們使用 `duplicated()` 函數。
315 | 
316 | ```
317 | # 建立一個有重複值的 data frame
318 | name = c("蒙其·D·魯夫", "蒙其·D·魯夫", "蒙其·D·魯夫", "羅羅亞·索隆", "羅羅亞·索隆", "羅羅亞·索隆")
319 | age <- c(19, 19, 17, 21, 21, 19)
320 | duplicated_df <- data.frame(name, age)
321 | is_duplicates <- duplicated(duplicated_df)
322 | duplicated_df[!is_duplicates, ]
323 | ```
324 | 
325 | ![day1714](https://storage.googleapis.com/2017_ithome_ironman/day1714.png)
326 | 
327 | ## 分箱
328 | 
329 | 數值分箱（Binning）是將連續型數值用幾個切點分隔，新增一個類別型變數的技巧。
330 | 
331 | ### Python
332 | 
333 | 我們使用 `pandas` 套件的 `cut()` 方法。
334 | 
335 | ```python
336 | import pandas as pd
337 | 
338 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
339 | age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
340 | 
341 | # 建立 dict
342 | straw_hat_dict = {
343 |     "name": name,
344 |     "age": age
345 | }
346 | 
347 | # 建立一個 data frame
348 | straw_hat_df = pd.DataFrame(straw_hat_dict)
349 | 
350 | # 分箱
351 | bins = [0, 25, float("inf")]
352 | group_names = ["小於 25 歲", "超過 25 歲"]
353 | straw_hat_df.ix[:, "age_cat"] = pd.cut(straw_hat_df.ix[:, "age"], bins, labels = group_names)
354 | straw_hat_df
355 | ```
356 | 
357 | ![day1715](https://storage.googleapis.com/2017_ithome_ironman/day1715.png)
358 | 
359 | ### R 語言
360 | 
361 | 我們使用 `cut()` 函數。
362 | 
363 | ```
364 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
365 | age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90)
366 | straw_hat_df <- data.frame(name, age)
367 | 
368 | # 分箱
369 | bins <- c(0, 25, Inf)
370 | group_names <- c("小於 25 歲", "超過 25 歲")
371 | straw_hat_df$age_cat <- cut(straw_hat_df$age, breaks = bins, labels = group_names)
372 | View(straw_hat_df)
373 | ```
374 | 
375 | ![day1716](https://storage.googleapis.com/2017_ithome_ironman/day1716.png)
376 | 
377 | ## 輸出
378 | 
379 | 經過激烈的資料角力（Data wrangling）之後，我們想要將整理乾淨的資料輸出成 csv 或者 JSON。
380 | 
381 | ### Python
382 | 
383 | 我們使用 data frame 的 `to_csv()` 與 `to_json()` 方法。
384 | 
385 | ```python
386 | import pandas as pd
387 | 
388 | name = ["蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克"]
389 | age = [19, 21, 20, 19, 21, 17, 30, 36, 90]
390 | 
391 | # 建立 dict
392 | straw_hat_dict = {
393 |     "name": name,
394 |     "age": age
395 | }
396 | 
397 | # 建立一個 data frame
398 | straw_hat_df = pd.DataFrame(straw_hat_dict)
399 | 
400 | # 輸出為 csv
401 | straw_hat_df.to_csv("straw_hat.csv", index = False)
402 | 
403 | # 輸出為 JSON
404 | straw_hat_df.to_json("straw_hat.json")
405 | ```
406 | 
407 | ### R 語言
408 | 
409 | 我們使用 `write.csv()` 函數與 `jsonlite` 套件的 `toJSON()` 函數。
410 | 
411 | ```
412 | library(jsonlite)
413 | 
414 | name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "騙人布", "賓什莫克·香吉士", "多尼多尼·喬巴", "妮可·羅賓", "佛朗基", "布魯克")
415 | age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90)
416 | straw_hat_df <- data.frame(name, age)
417 | 
418 | # 輸出為 csv
419 | write.csv(straw_hat_df, file = "straw_hat.csv", row.names = FALSE)
420 | 
421 | # 輸出為 JSON
422 | straw_hat_json <- toJSON(straw_hat_df)
423 | write(straw_hat_json, file = "straw_hat.json")
424 | ```
425 | 
426 | ## 小結
427 | 
428 | 第十七天我們討論資料角力（Data wrangling）的常用技巧，包含連接、合併與轉置等，我們透過 `pandas` 與 data frame 的各種方法練習，同時也跟 R 語言使用的各種函數作對照。
429 | 
430 | ## 參考連結
431 | 
432 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)
433 | - [pandas 0.19.1 documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)


--------------------------------------------------------------------------------
/day18.md:
--------------------------------------------------------------------------------
  1 | # [第 18 天] 資料視覺化 matplotlib
  2 | 
  3 | ---
  4 | 
  5 | 在我們昨天的文章 [[第 17 天] 資料角力](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day17.md)有提到，進行資料角力（Data wrangling）的目的多半是為了後續的資料視覺化或者建立機器學習的模型。R 語言使用者的資料視覺化工具有靜態的 **Base plotting system**（R 語言內建的繪圖功能）跟 **ggplot2** 套件，與動態的 **plotly** 套件。而 Python 的視覺化套件有靜態的 **matplotlib** 跟 **seaborn** 套件，與動態的 **bokeh** 套件。
  6 | 
  7 | 我們今天試著使用看看 **matplotlib** 並且也使用 R 語言的 **Base plotting system** 來畫一些基本的圖形，包括：
  8 | 
  9 | - 直方圖（Histogram）
 10 | - 散佈圖（Scatter plot）
 11 | - 線圖（Line plot）
 12 | - 長條圖（Bar plot）
 13 | - 盒鬚圖（Box plot）
 14 | 
 15 | 我們的開發環境是 **Jupyter Notebook**，這個指令可以讓圖形不會在新視窗呈現。
 16 | 
 17 | ```python
 18 | %matplotlib inline
 19 | ```
 20 | 
 21 | ## 直方圖（Histogram）
 22 | 
 23 | ### Python
 24 | 
 25 | 使用 `matplotlib.pyplot` 的 `hist()` 方法。
 26 | 
 27 | ```
 28 | %matplotlib inline
 29 | 
 30 | import numpy as np
 31 | import matplotlib.pyplot as plt
 32 | 
 33 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 34 | uniform_samples = np.random.uniform(size = 100000) # 生成 100000 組介於 0 與 1 之間均勻分配隨機變數
 35 | 
 36 | plt.hist(normal_samples)
 37 | plt.show()
 38 | plt.hist(uniform_samples)
 39 | plt.show()
 40 | ```
 41 | 
 42 | ![day1801](https://storage.googleapis.com/2017_ithome_ironman/day1801.png)
 43 | 
 44 | ![day1802](https://storage.googleapis.com/2017_ithome_ironman/day1802.png)
 45 | 
 46 | 如果你對於 `numpy` 套件的 `random()` 方法覺得陌生，我推薦你參考 [[第 13 天] 常用屬性或方法（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day13.md)。
 47 | 
 48 | ### R 語言
 49 | 
 50 | 使用 `hist()` 函數。
 51 | 
 52 | ```
 53 | normal_samples <- runif(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 54 | uniform_samples <- rnorm(100000) # 生成 100000 組介於 0 與 1 之間均勻分配隨機變數
 55 | 
 56 | hist(normal_samples)
 57 | hist(uniform_samples)
 58 | ```
 59 | 
 60 | ![day1803](https://storage.googleapis.com/2017_ithome_ironman/day1803.png)
 61 | 
 62 | ![day1804](https://storage.googleapis.com/2017_ithome_ironman/day1804.png)
 63 | 
 64 | ## 散佈圖（Scatter plot）
 65 | 
 66 | ### Python
 67 | 
 68 | 使用 `matplotlib.pyplot` 的 `scatter()` 方法。
 69 | 
 70 | ```python
 71 | %matplotlib inline
 72 | 
 73 | import matplotlib.pyplot as plt
 74 | 
 75 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
 76 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
 77 | 
 78 | plt.scatter(speed, dist)
 79 | plt.show()
 80 | ```
 81 | 
 82 | ![day1805](https://storage.googleapis.com/2017_ithome_ironman/day1805.png)
 83 | 
 84 | ### R 語言
 85 | 
 86 | 使用 `plot()` 函數。
 87 | 
 88 | ```
 89 | plot(cars$speed, cars$dist)
 90 | ```
 91 | 
 92 | ![day1806](https://storage.googleapis.com/2017_ithome_ironman/day1806.png)
 93 | 
 94 | ## 線圖（Line plot）
 95 | 
 96 | ### Python
 97 | 
 98 | 使用 `matplotlib.pyplot` 的 `plot()` 方法。
 99 | 
100 | ```python
101 | %matplotlib inline
102 | 
103 | import matplotlib.pyplot as plt
104 | 
105 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
106 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
107 | 
108 | plt.plot(speed, dist)
109 | plt.show()
110 | ```
111 | 
112 | ![day1807](https://storage.googleapis.com/2017_ithome_ironman/day1807.png)
113 | 
114 | ### R 語言
115 | 
116 | 使用 `plot()` 函數，指定參數 `type = "l"`。
117 | 
118 | ```
119 | plot(cars$speed, cars$dist, type = "l")
120 | ```
121 | 
122 | ![day1808](https://storage.googleapis.com/2017_ithome_ironman/day1808.png)
123 | 
124 | ## 長條圖（Bar plot）
125 | 
126 | ### Python
127 | 
128 | 使用 `matplotlib.pyplot` 的 `bar()` 方法。
129 | 
130 | ```python
131 | %matplotlib inline
132 | 
133 | from collections import Counter
134 | import matplotlib.pyplot as plt
135 | import numpy as np
136 | 
137 | cyl = [6 ,6 ,4 ,6 ,8 ,6 ,8 ,4 ,4 ,6 ,6 ,8 ,8 ,8 ,8 ,8 ,8 ,4 ,4 ,4 ,4 ,8 ,8 ,8 ,8 ,4 ,4 ,4 ,8 ,6 ,8 ,4]
138 | 
139 | labels, values = zip(*Counter(cyl).items())
140 | width = 1
141 | 
142 | plt.bar(indexes, values)
143 | plt.xticks(indexes + width * 0.5, labels)
144 | plt.show()
145 | ```
146 | 
147 | ![day1809](https://storage.googleapis.com/2017_ithome_ironman/day1809.png)
148 | 
149 | ### R 語言
150 | 
151 | 使用 `barplot()` 函數。
152 | 
153 | ```
154 | barplot(table(mtcars$cyl))
155 | ```
156 | 
157 | ![day1810](https://storage.googleapis.com/2017_ithome_ironman/day1810.png)
158 | 
159 | ## 盒鬚圖（Box plot）
160 | 
161 | ### python
162 | 
163 | 使用 `matplotlib.pyplot` 的 `boxplot()` 方法。
164 | 
165 | ```python
166 | %matplotlib inline
167 | 
168 | import numpy as np
169 | import matplotlib.pyplot as plt
170 | 
171 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
172 | 
173 | plt.boxplot(normal_samples)
174 | plt.show()
175 | ```
176 | 
177 | ![day1811](https://storage.googleapis.com/2017_ithome_ironman/day1811.png)
178 | 
179 | ### R 語言
180 | 
181 | 使用 `boxplot()` 函數。
182 | 
183 | ```
184 | normal_samples <- runif(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
185 | boxplot(normal_samples)
186 | ```
187 | 
188 | ![day1812](https://storage.googleapis.com/2017_ithome_ironman/day1812.png)
189 | 
190 | ## 輸出圖形
191 | 
192 | ### python
193 | 
194 | 使用圖形物件的 `savefig()` 方法。
195 | 
196 | ```
197 | import numpy as np
198 | import matplotlib.pyplot as plt
199 | 
200 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
201 | 
202 | plt.hist(normal_samples)
203 | plt.savefig(filename = "my_hist.png", format = "png")
204 | ```
205 | 
206 | ![day1813](https://storage.googleapis.com/2017_ithome_ironman/day1813.png)
207 | 
208 | ### R 語言
209 | 
210 | 先使用 `png()` 函數建立一個空的 `.png` 圖檔，繪圖後再輸入 `dev.off()`。
211 | 
212 | ```
213 | normal_samples <- runif(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
214 | png("my_hist.png")
215 | hist(normal_samples)
216 | dev.off()
217 | ```
218 | 
219 | ![day1814](https://storage.googleapis.com/2017_ithome_ironman/day1814.png)
220 | 
221 | ## 小結
222 | 
223 | 第十八天我們練習使用 Python 的視覺化套件 **matplotlib** 繪製基本的圖形，並且與 R 語言的 **Base plotting system** 相互對照。
224 | 
225 | ## 參考連結
226 | 
227 | - [pyplot - Matplotlib 1.5.3 documentation](http://matplotlib.org/api/pyplot_api.html)


--------------------------------------------------------------------------------
/day19.md:
--------------------------------------------------------------------------------
  1 | # [第 19 天] 資料視覺化（2）Seaborn
  2 | 
  3 | ---
  4 | 
  5 | 使用 **matplotlib** 建立一個圖表的概念是組裝它提供的基礎元件，像是圖表類型、圖例或者標籤等元件。 **Seaborn** 套件是以 **matplotlib** 為基礎建構的高階繪圖套件，讓使用者更加輕鬆地建立圖表，我們可以將它視為是 **matplotlib** 的補強，如果你對 **matplotlib** 套件有點陌生，我推薦你閱讀 [[第 18 天] 資料視覺化 matplotlib](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day18.md)。
  6 | 
  7 | > Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.
  8 | > [Seaborn: statistical data visualization](http://seaborn.pydata.org/index.html)
  9 | 
 10 | 我們今天試著使用看看 **Seaborn** 套件並且也使用 R 語言的 **ggplot2** 套件來畫一些基本的圖形，包括：
 11 | 
 12 | - 直方圖（Histogram）
 13 | - 散佈圖（Scatter plot）
 14 | - 線圖（Line plot）
 15 | - 長條圖（Bar plot）
 16 | - 盒鬚圖（Box plot）
 17 | 
 18 | **Seaborn** 套件在我們的開發環境沒有安裝，但我們可以透過 `conda` 指令在終端機安裝。
 19 | 
 20 | ```
 21 | $ conda install -c anaconda seaborn=0.7.1
 22 | ```
 23 | 
 24 | 我們的開發環境是 Jupyter Notebook，這個指令可以讓圖形不會在新視窗呈現。
 25 | 
 26 | ```
 27 | %matplotlib inline
 28 | ```
 29 | 
 30 | ## 直方圖（Histogram）
 31 | 
 32 | ### Python
 33 | 
 34 | 使用 `seaborn` 套件的 `distplot()` 方法。
 35 | 
 36 | ```python
 37 | %matplotlib inline
 38 | 
 39 | import seaborn as sns
 40 | import numpy as np
 41 | import matplotlib.pyplot as plt
 42 | 
 43 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 44 | sns.distplot(normal_samples)
 45 | ```
 46 | 
 47 | ![day1901](https://storage.googleapis.com/2017_ithome_ironman/day1901.png)
 48 | 
 49 | 預設會附上 **kernel density estimate（KDE）**曲線。
 50 | 
 51 | ### R 語言
 52 | 
 53 | 使用 `ggplot2` 套件的 `geom_histogram()` 函數指定為直方圖。
 54 | 
 55 | ```
 56 | library(ggplot2)
 57 | 
 58 | normal_samples <- rnorm(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 59 | normal_samples_df <- data.frame(normal_samples)
 60 | ggplot(normal_samples_df, aes(x = normal_samples)) + geom_histogram(aes(y = ..density..)) + geom_density()
 61 | ```
 62 | 
 63 | ![day1902](https://storage.googleapis.com/2017_ithome_ironman/day1902.png)
 64 | 
 65 | ## 散佈圖（Scatter plot）
 66 | 
 67 | ### Python
 68 | 
 69 | 使用 `seaborn` 套件的 `joinplot()` 方法。
 70 | 
 71 | ```python
 72 | %matplotlib inline
 73 | 
 74 | import seaborn as sns
 75 | import pandas as pd
 76 | import matplotlib.pyplot as plt
 77 | 
 78 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
 79 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
 80 | 
 81 | cars_df = pd.DataFrame(
 82 |     {"speed": speed,
 83 |      "dist": dist
 84 |     }
 85 | )
 86 | 
 87 | sns.jointplot(x = "speed", y = "dist", data = cars_df)
 88 | ```
 89 | 
 90 | ![day1903](https://storage.googleapis.com/2017_ithome_ironman/day1903.png)
 91 | 
 92 | 預設會附上 X 軸變數與 Y 軸變數的直方圖。
 93 | 
 94 | ### R 語言
 95 | 
 96 | 使用 `ggplot2` 套件的 `geom_point()` 函數指定為散佈圖，再使用 `ggExtra` 套件的 `ggMarginal()` 函數加上 X 軸變數與 Y 軸變數的直方圖。
 97 | 
 98 | ```
 99 | library(ggplot2)
100 | library(ggExtra)
101 | 
102 | scatter_plot <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
103 | ggMarginal(scatter_plot, type = "histogram")
104 | ```
105 | 
106 | ![day1904](https://storage.googleapis.com/2017_ithome_ironman/day1904.png)
107 | 
108 | ## 線圖（Line plot）
109 | 
110 | ### Python
111 | 
112 | 使用 `seaborn` 套件的 `factorplot()` 方法。
113 | 
114 | ```
115 | %matplotlib inline
116 | 
117 | import seaborn as sns
118 | import pandas as pd
119 | import matplotlib.pyplot as plt
120 | 
121 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
122 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
123 | 
124 | cars_df = pd.DataFrame(
125 |     {"speed": speed,
126 |      "dist": dist
127 |     }
128 | )
129 | 
130 | sns.factorplot(data = cars_df, x="speed", y="dist", ci = None)
131 | ```
132 | 
133 | ![day1905](https://storage.googleapis.com/2017_ithome_ironman/day1905.png)
134 | 
135 | ### R 語言
136 | 
137 | 使用 `ggplot2` 套件的 `geom_line()` 函數指定為線圖。
138 | 
139 | ```
140 | library(ggplot2)
141 | 
142 | ggplot(cars, aes(x = speed, y = dist)) + geom_line()
143 | ```
144 | 
145 | ![day1906](https://storage.googleapis.com/2017_ithome_ironman/day1906.png)
146 | 
147 | ## 長條圖（Bar plot）
148 | 
149 | ### Python
150 | 
151 | 使用 `seaborn` 套件的 `countplot()` 方法。
152 | 
153 | ```python
154 | %matplotlib inline
155 | 
156 | import seaborn as sns
157 | import pandas as pd
158 | import matplotlib.pyplot as plt
159 | 
160 | cyl = [6 ,6 ,4 ,6 ,8 ,6 ,8 ,4 ,4 ,6 ,6 ,8 ,8 ,8 ,8 ,8 ,8 ,4 ,4 ,4 ,4 ,8 ,8 ,8 ,8 ,4 ,4 ,4 ,8 ,6 ,8 ,4]
161 | cyl_df = pd.DataFrame({"cyl": cyl})
162 | 
163 | sns.countplot(x = "cyl", data=cyl_df)
164 | ```
165 | 
166 | ![day1907](https://storage.googleapis.com/2017_ithome_ironman/day1907.png)
167 | 
168 | ### R 語言
169 | 
170 | 使用 `ggplot2` 套件的 `geom_bar()` 函數指定為長條圖。
171 | 
172 | ```
173 | library(ggplot2)
174 | 
175 | ggplot(mtcars, aes(x = cyl)) + geom_bar()
176 | ```
177 | 
178 | ![day1908](https://storage.googleapis.com/2017_ithome_ironman/day1908.png)
179 | 
180 | ## 盒鬚圖（Box plot）
181 | 
182 | ### Python
183 | 
184 | 使用 `seaborn` 套件的 `boxplot()` 方法。
185 | 
186 | ```python
187 | %matplotlib inline
188 | 
189 | import seaborn as sns
190 | import numpy as np
191 | import matplotlib.pyplot as plt
192 | 
193 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
194 | sns.boxplot(normal_samples)
195 | ```
196 | 
197 | ![day1909](https://storage.googleapis.com/2017_ithome_ironman/day1909.png)
198 | 
199 | ### R 語言
200 | 
201 | 使用 `ggplot2` 套件的 `geom_boxplot()` 函數指定為盒鬚圖。
202 | 
203 | ```
204 | library(ggplot2)
205 | 
206 | normal_samples <- rnorm(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
207 | normal_samples_df <- data.frame(normal_samples)
208 | ggplot(normal_samples_df, aes(y = normal_samples, x = 1)) + geom_boxplot() + coord_flip()
209 | ```
210 | 
211 | ![day1910](https://storage.googleapis.com/2017_ithome_ironman/day1910.png)
212 | 
213 | ## 小結
214 | 
215 | 第十九天我們練習使用 Python 的視覺化套件 **Seaborn** 繪製基本的圖形，並且與 R 語言的 **ggplot2** 相互對照。
216 | 
217 | ## 參考連結
218 | 
219 | - [Seaborn: statistical data visualization](http://seaborn.pydata.org/index.html)
220 | - [ggplot2 0.9.3.1](http://docs.ggplot2.org/0.9.3.1/index.html)
221 | - [Seaborn :: Anaconda Cloud](https://anaconda.org/anaconda/seaborn)


--------------------------------------------------------------------------------
/day20.md:
--------------------------------------------------------------------------------
  1 | # [第 20 天] 資料視覺化（3）Bokeh
  2 | 
  3 | ---
  4 | 
  5 | 我們前兩天討論的 **matplotlib** 與 **Seaborn** 套件基本上已經可以滿足絕大多數的繪圖需求，唯一美中不足的一點是這些圖形都是靜態（Static）的，如果我們想要讓這些圖形帶有一點互動（Interactive），像是滑鼠游標移上去會顯示資料點的數據或可以縮放圖形等基於 JavaScript 的效果，我們可以在 Python 使用 **Bokeh** 這個高階繪圖套件來達成。
  6 | 
  7 | > Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.
  8 | > [Welcome to Bokeh - Bokeh 0.12.3 documentation](http://bokeh.pydata.org/en/latest/)
  9 | 
 10 | 以目前發展來看，資料視覺化套件會以 [d3.js](https://d3js.org/) 與其他基於 JavaScript 或 [d3.js](https://d3js.org/) 衍生的網頁專案領銜衝鋒，像是 [Leaflet](http://leafletjs.com/)、[c3.js](http://c3js.org/) 以及我們今天會使用的 [plotly](https://plot.ly/feed/) 等。Python 與 R 語言的視覺化套件則是努力讓使用者用精簡的方法與函數畫出具備互動效果的視覺化，如我們今天要討論的 **Bokeh** 以及 **Plotly**。如果你的工作是以資料視覺化為重，花時間與精神鑽研網頁前端技術與 [d3.js](https://d3js.org/) 是必須的投資。
 11 | 
 12 | > Visualizations built on web technologies (that is, JavaScript-based) appear to be the inevitable future.
 13 | > [Wes McKinney](http://wesmckinney.com/)
 14 | 
 15 | 我們今天試著使用 **Bokeh** 與 R 語言的 **Plotly** 套件來畫一些基本的圖形，包括：
 16 | 
 17 | - 直方圖（Histogram）
 18 | - 散佈圖（Scatter plot）
 19 | - 線圖（Line plot）
 20 | - 長條圖（Bar plot）
 21 | - 盒鬚圖（Box plot）
 22 | 
 23 | 我下載的 [Anaconda](https://www.continuum.io/downloads) 版本已經將 **Bokeh** 安裝好了，如果你的版本沒有，只要在終端機執行這段程式即可。
 24 | 
 25 | ```
 26 | $ conda install -c anaconda bokeh=0.12.3
 27 | ```
 28 | 
 29 | ## 直方圖（Histogram）
 30 | 
 31 | ### Python
 32 | 
 33 | 使用 `bokeh.charts` 的 `Histogram()` 方法。
 34 | 
 35 | ```python
 36 | from bokeh.charts import Histogram, show
 37 | import numpy as np
 38 | 
 39 | normal_samples = np.random.normal(size = 100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 40 | hist = Histogram(normal_samples)
 41 | show(hist)
 42 | ```
 43 | 
 44 | ![day2001](https://storage.googleapis.com/2017_ithome_ironman/day2001.png)
 45 | 
 46 | ### R 語言
 47 | 
 48 | 使用 `ggplotly()` 函數就可以將 **ggplot2** 套件所繪製的基本圖形轉換為 **Plotly** 圖形。
 49 | 
 50 | ```
 51 | library(ggplot2)
 52 | library(plotly)
 53 | 
 54 | normal_samples <- rnorm(100000) # 生成 100000 組標準常態分配（平均值為 0，標準差為 1 的常態分配）隨機變數
 55 | normal_samples_df <- data.frame(normal_samples)
 56 | hist <- ggplot(normal_samples_df, aes(x = normal_samples)) + geom_histogram(aes(y = ..density..)) + geom_density()
 57 | ggplotly(hist)
 58 | ```
 59 | 
 60 | ![day2002](https://storage.googleapis.com/2017_ithome_ironman/day2002.png)
 61 | 
 62 | ## 散佈圖（Scatter plot）
 63 | 
 64 | ### Python
 65 | 
 66 | 使用 `bokeh.charts` 的 `Scatter()` 方法。
 67 | 
 68 | ```python
 69 | from bokeh.charts import Scatter, show
 70 | import pandas as pd
 71 | 
 72 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
 73 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
 74 | 
 75 | cars_df = pd.DataFrame(
 76 |     {"speed": speed,
 77 |      "dist": dist
 78 |     }
 79 | )
 80 | 
 81 | scatter = Scatter(cars_df, x = "speed", y = "dist")
 82 | show(scatter)
 83 | ```
 84 | 
 85 | ![day2003](https://storage.googleapis.com/2017_ithome_ironman/day2003.png)
 86 | 
 87 | ### R 語言
 88 | 
 89 | 使用 `ggplotly()` 函數就可以將 **ggplot2** 套件所繪製的基本圖形轉換為 **Plotly** 圖形。
 90 | 
 91 | ```
 92 | library(ggplot2)
 93 | library(plotly)
 94 | 
 95 | scatter_plot <- ggplot(cars, aes(x = speed, y = dist)) + geom_point()
 96 | ggplotly(scatter_plot)
 97 | ```
 98 | 
 99 | ![day2004](https://storage.googleapis.com/2017_ithome_ironman/day2004.png)
100 | 
101 | ## 線圖（Line plot）
102 | 
103 | ### Python
104 | 
105 | 使用 `bokeh.charts` 的 `Line()` 方法。
106 | 
107 | ```python
108 | from bokeh.charts import Line, show
109 | import pandas as pd
110 | 
111 | speed = [4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25]
112 | dist = [2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85]
113 | 
114 | cars_df = pd.DataFrame(
115 |     {"speed": speed,
116 |      "dist": dist
117 |     }
118 | )
119 | 
120 | line = Line(cars_df, x = "speed", y = "dist")
121 | show(line)
122 | ```
123 | 
124 | ![day2005](https://storage.googleapis.com/2017_ithome_ironman/day2005.png)
125 | 
126 | ### R 語言
127 | 
128 | 使用 `ggplotly()` 函數就可以將 **ggplot2** 套件所繪製的基本圖形轉換為 **Plotly** 圖形。
129 | 
130 | ```
131 | library(ggplot2)
132 | library(plotly)
133 | 
134 | line <- ggplot(cars, aes(x = speed, y = dist)) + geom_line()
135 | ggplotly(line)
136 | ```
137 | 
138 | ![day2006](https://storage.googleapis.com/2017_ithome_ironman/day2006.png)
139 | 
140 | ## 長條圖（Bar plot）
141 | 
142 | ### Python
143 | 
144 | 使用 `bokeh.charts` 的 `Bar()` 方法。
145 | 
146 | ```
147 | from bokeh.charts import Bar, show
148 | import pandas as pd
149 | 
150 | cyls = [11, 7, 14]
151 | labels = ["4", "6", "8"]
152 | cyl_df = pd.DataFrame({
153 |     "cyl": cyls,
154 |     "label": labels
155 | })
156 | 
157 | bar = Bar(cyl_df, values = "cyl", label = "label")
158 | show(bar)
159 | ```
160 | 
161 | ![day2007](https://storage.googleapis.com/2017_ithome_ironman/day2007.png)
162 | 
163 | ### R 語言
164 | 
165 | 使用 `ggplotly()` 函數就可以將 **ggplot2** 套件所繪製的基本圖形轉換為 **Plotly** 圖形。
166 | 
167 | ```
168 | library(ggplot2)
169 | library(plotly)
170 | 
171 | bar <- ggplot(mtcars, aes(x = cyl)) + geom_bar()
172 | ggplotly(bar)
173 | ```
174 | 
175 | ![day2008](https://storage.googleapis.com/2017_ithome_ironman/day2008.png)
176 | 
177 | ## 盒鬚圖（Box plot）
178 | 
179 | ### Python
180 | 
181 | 使用 `bokeh.charts` 的 `BoxPlot()` 方法。
182 | 
183 | ```python
184 | from bokeh.charts import BoxPlot, show, output_notebook
185 | import pandas as pd
186 | 
187 | output_notebook()
188 | 
189 | mpg = [21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4]
190 | cyl = [6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4]
191 | mtcars_df = pd.DataFrame({
192 |     "mpg": mpg,
193 |     "cyl": cyl
194 | })
195 | 
196 | box = BoxPlot(mtcars_df, values = "mpg", label = "cyl")
197 | show(box)
198 | ```
199 | 
200 | ![day2009](https://storage.googleapis.com/2017_ithome_ironman/day2009.png)
201 | 
202 | ### R 語言
203 | 
204 | 使用 `ggplotly()` 函數就可以將 **ggplot2** 套件所繪製的基本圖形轉換為 **Plotly** 圖形。
205 | 
206 | ```
207 | library(ggplot2)
208 | library(plotly)
209 | 
210 | box <- ggplot(mtcars, aes(y = mpg, x = factor(cyl))) + geom_boxplot()
211 | ggplotly(box)
212 | ```
213 | 
214 | ![day2010](https://storage.googleapis.com/2017_ithome_ironman/day2010.png)
215 | 
216 | ## 小結
217 | 
218 | 第二十天我們練習使用 Python 的視覺化套件 **Bokeh** 繪製基本的圖形，並且在 R 語言中使用 `plotly` 套件的 `ggplotly()` 函數將 **ggplot2** 的圖形轉換為互動的 **Plotly** 圖形。
219 | 
220 | ## 參考連結
221 | 
222 | - [Making High-level Charts - Bokeh 0.12.3 documentation](http://bokeh.pydata.org/en/latest/docs/user_guide/charts.html)
223 | - [ggplot2 | plotly](https://plot.ly/ggplot2/)
224 | - [ggplot2 0.9.3.1](http://docs.ggplot2.org/0.9.3.1/index.html)
225 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)


--------------------------------------------------------------------------------
/day21.md:
--------------------------------------------------------------------------------
  1 | # [第 21 天] 機器學習 玩具資料與線性迴歸
  2 | 
  3 | ---
  4 | 
  5 | 我們在 [[第 17 天] 資料角力](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day17.md)提過，資料角力的目的是為了視覺化或者機器學習模型需求，必須將資料整理成合乎需求的格式。資料視覺化聽來直觀，那麼關於機器學習呢？我很喜歡[林軒田](http://www.csie.ntu.edu.tw/~htlin/)老師在[機器學習基石](https://www.youtube.com/watch?v=sS4523miLnw&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf&index=2)簡單明嘹的解釋：
  6 | 
  7 | > 我們從小是怎麼樣辨認一棵樹的，是爸爸媽媽告訴我們一百條規則來定義嗎？其實不是的，很大一部分是透過我們自己的觀察很多的樹與不是樹之後，得到並且內化了辨認一棵樹的技巧，機器學習想要做的就是一樣的事情。
  8 | > [林軒田](http://www.csie.ntu.edu.tw/~htlin/)
  9 | 
 10 | 我們要使用的 Python 機器學習套件是 **scikit-learn**，它建構於 **NumPy**、**SciPy** 與 **matplotlib** 之上，是開源套件並可作為商業使用。
 11 | 
 12 | > Scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the [AUTHORS.rst](https://github.com/scikit-learn/scikit-learn/blob/master/AUTHORS.rst) file for a complete list of contributors. It is currently maintained by a team of volunteers.
 13 | > [scikit-learn](https://github.com/scikit-learn/scikit-learn)
 14 | 
 15 | 我們從 **scikit-learn** 套件的[首頁](http://scikit-learn.org/stable/index.html)可以一目瞭然它的應用領域：
 16 | 
 17 | - 監督式學習（Supervised learning）
 18 |     - 分類（Classification）
 19 |     - 迴歸（Regression）
 20 | - 非監督式學習（Unsupervised learning）
 21 |     - 分群（Clustering）
 22 | - 降維（Dimensionality reduction）
 23 | - 模型選擇（Model selection）
 24 | - 預處理（Preprocessing）
 25 | 
 26 | ## 玩具資料（Toy datasets）
 27 | 
 28 | 我們在練習資料視覺化或者機器學習的時候，除了可以自己產生資料以外，也可以用所謂的玩具資料（Toy datasets），玩具資料並不是一個特定的資料，而是泛指一些小而美的標準資料，像是在 R 語言中我們很習慣使用的 `iris`、`cars` 與 `mtcars` 資料框都是玩具資料。
 29 | 
 30 | ### Python
 31 | 
 32 | 我們使用 `sklearn` 的 `datasets` 物件的 `load_iris()` 方法來讀入鳶尾花資料。
 33 | 
 34 | ```python
 35 | from sklearn import datasets
 36 | import pandas as pd
 37 | 
 38 | iris = datasets.load_iris()
 39 | print(type(iris.data)) # 資料是儲存為 ndarray
 40 | print(iris.feature_names) # 變數名稱可以利用 feature_names 屬性取得
 41 | iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) # 轉換為 data frame
 42 | iris_df.ix[:, "species"] = iris.target # 將品種加入 data frame
 43 | iris_df.head() # 觀察前五個觀測值
 44 | ```
 45 | 
 46 | ![day2101](https://storage.googleapis.com/2017_ithome_ironman/day2101.png)
 47 | 
 48 | 還有其他更多的玩具資料，像是波士頓房地產資料可以透過 `load_boston()` 方法讀入，糖尿病病患資料可以透過 `load_diabetes()` 方法讀入，詳情參考 [Dataset loading utilities - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/datasets/)。
 49 | 
 50 | ### R 語言
 51 | 
 52 | `iris` 在 R 語言一啟動就已經讀入，可以直接使用。
 53 | 
 54 | ```
 55 | head(iris)
 56 | ```
 57 | 
 58 | ![day2102](https://storage.googleapis.com/2017_ithome_ironman/day2102.png)
 59 | 
 60 | 我們可以透過輸入 `data()` 得知有哪些資料可以直接使用。
 61 | 
 62 | ![day210301](https://storage.googleapis.com/2017_ithome_ironman/day210301.png)
 63 | 
 64 | ## 建立線性迴歸分析模型
 65 | 
 66 | 我很喜歡[世界第一簡單統計學迴歸分析篇](http://www.books.com.tw/products/0010479438)的一個簡單例子：用氣溫來預測冰紅茶的銷售量。
 67 | 
 68 | ### Python
 69 | 
 70 | 我們使用 `sklearn.linear_model` 的 `LinearRegression()` 方法。
 71 | 
 72 | ```python
 73 | import numpy as np
 74 | from sklearn.linear_model import LinearRegression
 75 | 
 76 | temperatures = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30])
 77 | iced_tea_sales = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84])
 78 | 
 79 | lm = LinearRegression()
 80 | lm.fit(np.reshape(temperatures, (len(temperatures), 1)), np.reshape(iced_tea_sales, (len(iced_tea_sales), 1)))
 81 | 
 82 | # 印出係數
 83 | print(lm.coef_)
 84 | 
 85 | # 印出截距
 86 | print(lm.intercept_ )
 87 | ```
 88 | 
 89 | ![day2104](https://storage.googleapis.com/2017_ithome_ironman/day2104.png)
 90 | 
 91 | ### R 語言
 92 | 
 93 | 我們使用 `lm()` 函數。
 94 | 
 95 | ```
 96 | temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
 97 | iced_tea_sales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
 98 | 
 99 | lm_fit <- lm(iced_tea_sales ~ temperatures)
100 | 
101 | # 印出係數
102 | lm_fit$coefficients[2]
103 | 
104 | # 印出截距
105 | lm_fit$coefficients[1]
106 | ```
107 | 
108 | ![day210501](https://storage.googleapis.com/2017_ithome_ironman/day210501.png)
109 | 
110 | ## 利用線性迴歸分析模型預測
111 | 
112 | 建立線性迴歸模型之後，身為冰紅茶店的老闆，就可以開始量測氣溫，藉此來預測冰紅茶銷量，更精準地掌握原料的管理。
113 | 
114 | ### Python
115 | 
116 | 我們使用 `LinearRegression()` 的 `predict()` 方法。
117 | 
118 | ```python
119 | import numpy as np
120 | from sklearn.linear_model import LinearRegression
121 | 
122 | temperatures = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30])
123 | iced_tea_sales = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84])
124 | 
125 | lm = LinearRegression()
126 | lm.fit(np.reshape(temperatures, (len(temperatures), 1)), np.reshape(iced_tea_sales, (len(iced_tea_sales), 1)))
127 | 
128 | # 新的氣溫
129 | to_be_predicted = np.array([30])
130 | predicted_sales = lm.predict(np.reshape(to_be_predicted, (len(to_be_predicted), 1)))
131 | 
132 | # 預測的冰紅茶銷量
133 | print(predicted_sales)
134 | ```
135 | 
136 | ![day2106](https://storage.googleapis.com/2017_ithome_ironman/day2106.png)
137 | 
138 | ### R 語言
139 | 
140 | 我們使用 `predict()` 函數。
141 | 
142 | ```
143 | temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
144 | iced_tea_sales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
145 | 
146 | lm_fit <- lm(iced_tea_sales ~ temperatures)
147 | 
148 | # 新的氣溫
149 | to_be_predicted <- data.frame(temperatures = 30)
150 | predicted_sales <- predict(lm_fit, newdata = to_be_predicted)
151 | 
152 | # 預測的冰紅茶銷量
153 | predicted_sales
154 | ```
155 | 
156 | ![day2107](https://storage.googleapis.com/2017_ithome_ironman/day2107.png)
157 | 
158 | ## 線性迴歸視覺化
159 | 
160 | 我們可以使用 [[第 18 天] 資料視覺化 matplotlib](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day18.md) 提過的 Python `matplotlib` 套件與 R 語言的 **Base plotting system**。
161 | 
162 | ### Python
163 | 
164 | 我們使用 `matplotlib.pyplot` 的 `scatter()` 與 `plot()` 方法。
165 | 
166 | ```python
167 | %matplotlib inline
168 | 
169 | import matplotlib.pyplot as plt
170 | import numpy as np
171 | from sklearn.linear_model import LinearRegression
172 | 
173 | temperatures = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30])
174 | iced_tea_sales = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84])
175 | 
176 | lm = LinearRegression()
177 | lm.fit(np.reshape(temperatures, (len(temperatures), 1)), np.reshape(iced_tea_sales, (len(iced_tea_sales), 1)))
178 | 
179 | # 新的氣溫
180 | to_be_predicted = np.array([30])
181 | predicted_sales = lm.predict(np.reshape(to_be_predicted, (len(to_be_predicted), 1)))
182 | 
183 | # 視覺化
184 | plt.scatter(temperatures, iced_tea_sales, color='black')
185 | plt.plot(temperatures, lm.predict(np.reshape(temperatures, (len(temperatures), 1))), color='blue', linewidth=3)
186 | plt.plot(to_be_predicted, predicted_sales, color = 'red', marker = '^', markersize = 10)
187 | plt.xticks(())
188 | plt.yticks(())
189 | plt.show()
190 | ```
191 | 
192 | ![day2108](https://storage.googleapis.com/2017_ithome_ironman/day2108.png)
193 | 
194 | ### R 語言
195 | 
196 | 我們使用 `plot()` 函數繪製散佈圖，使用 `points()` 繪製點，再使用 `abline()` 函數繪製直線。
197 | 
198 | ```
199 | temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
200 | iced_tea_sales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
201 | 
202 | lm_fit <- lm(iced_tea_sales ~ temperatures)
203 | 
204 | # 新的氣溫
205 | to_be_predicted <- data.frame(temperatures = 30)
206 | predicted_sales <- predict(lm_fit, newdata = to_be_predicted)
207 | 
208 | plot(iced_tea_sales ~ temperatures, bg = "blue", pch = 16)
209 | points(x = to_be_predicted$temperatures, y = predicted_sales, col = "red", cex = 2, pch = 17)
210 | abline(reg = lm_fit$coefficients, col = "blue", lwd = 4)
211 | ```
212 | 
213 | ![day2109](https://storage.googleapis.com/2017_ithome_ironman/day2109.png)
214 | 
215 | ## 線性迴歸模型的績效
216 | 
217 | 線性迴歸模型的績效（Performance）有 **Mean squared error（MSE）**與 **R-squared**。
218 | 
219 | ### Python
220 | 
221 | ```python
222 | import numpy as np
223 | from sklearn.linear_model import LinearRegression
224 | 
225 | temperatures = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30])
226 | iced_tea_sales = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84])
227 | 
228 | # 轉換維度
229 | temperatures = np.reshape(temperatures, (len(temperatures), 1))
230 | iced_tea_sales = np.reshape(iced_tea_sales, (len(iced_tea_sales), 1))
231 | 
232 | lm = LinearRegression()
233 | lm.fit(temperatures, iced_tea_sales)
234 | 
235 | # 模型績效
236 | mse = np.mean((lm.predict(temperatures) - iced_tea_sales) ** 2)
237 | r_squared = lm.score(temperatures, iced_tea_sales)
238 | 
239 | # 印出模型績效
240 | print(mse)
241 | print(r_squared)
242 | ```
243 | 
244 | ![day2110](https://storage.googleapis.com/2017_ithome_ironman/day2110.png)
245 | 
246 | ### R 語言
247 | 
248 | ```
249 | temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
250 | iced_tea_sales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
251 | 
252 | lm_fit <- lm(iced_tea_sales ~ temperatures)
253 | predicted_sales <- predict(lm_fit, newdata = data.frame(temperatures))
254 | 
255 | # 模型績效
256 | mse <- mean((iced_tea_sales - predicted_sales) ^ 2)
257 | 
258 | # 印出模型績效
259 | mse
260 | summary(lm_fit)$r.squared
261 | ```
262 | 
263 | ![day2111](https://storage.googleapis.com/2017_ithome_ironman/day2111.png)
264 | 
265 | ## 小結
266 | 
267 | 第二十一天我們練習使用 Python 的機器學習套件 **scikit-learn**，讀入一些玩具資料（Toy datasets），建立了一個簡單的線性迴歸模型來用氣溫預測冰紅茶銷量，並且與 R 語言的寫法相互對照。
268 | 
269 | ## 參考連結
270 | 
271 | - [機器學習基石](https://www.youtube.com/watch?v=sS4523miLnw&list=PLXVfgk9fNX2I7tB6oIINGBmW50rrmFTqf&index=2)
272 | - [scikit-learn: machine learning in Python - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/index.html)
273 | - [世界第一簡單統計學迴歸分析篇](http://www.books.com.tw/products/0010479438)


--------------------------------------------------------------------------------
/day22.md:
--------------------------------------------------------------------------------
  1 | # [第 22 天] 機器學習（2）複迴歸與 Logistic 迴歸
  2 | 
  3 | ---
  4 | 
  5 | 我們今天要繼續使用 **scikit-learn** 機器學習套件延續[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day21.md)的線性迴歸，練習一個複迴歸以及一個 Logistic 迴歸。如果你還記得 [scikit-learn 首頁](http://scikit-learn.org/stable/index.html) 的應用領域，很明顯線性迴歸與複迴歸是屬於**迴歸（Regression）**應用領域，但是 Logistic 迴歸呢？她好像應當被歸類在**分類（Classification）**應用領域，但名字中又有迴歸兩個字？從 [Generalized Linear Models - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) 我們瞭解 Logistic 迴歸是預測機率的方法，屬於二元分類的應用領域。
  6 | 
  7 | > Logistic regression, despite its name, is a linear model for classification rather than regression.
  8 | > [Generalized Linear Models - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
  9 | 
 10 | 接下來我們使用[世界第一簡單統計學迴歸分析篇](http://www.books.com.tw/products/0010479438)的蛋糕店數據來練習複迴歸，使用 [Kaggle](https://www.kaggle.com/) 著名的[鐵達尼克號資料](https://www.kaggle.com/c/titanic/data)來練習 Logistic 迴歸，並且分別在 Python 與 R 語言實作練習。
 11 | 
 12 | ## 建立複迴歸模型
 13 | 
 14 | 使用連鎖蛋糕店的 **店面面積（坪）**與**車站距離（公里）**來預測**分店單月銷售量（萬日圓）**。
 15 | 
 16 | ### Python
 17 | 
 18 | 我們使用 `sklearn.linear_model` 的 `LinearRegression()` 方法。
 19 | 
 20 | ```python
 21 | import numpy as np
 22 | from sklearn.linear_model import LinearRegression
 23 | 
 24 | X = np.array([
 25 |     [10, 80], [8, 0], [8, 200], [5, 200], [7, 300], [8, 230], [7, 40], [9, 0], [6, 330], [9, 180]
 26 | ])
 27 | y = np.array([469, 366, 371, 208, 246, 297, 363, 436, 198, 364])
 28 | 
 29 | lm = LinearRegression()
 30 | lm.fit(X, y)
 31 | 
 32 | # 印出係數
 33 | print(lm.coef_)
 34 | 
 35 | # 印出截距
 36 | print(lm.intercept_ )
 37 | ```
 38 | 
 39 | ![day2201](https://storage.googleapis.com/2017_ithome_ironman/day2201.png)
 40 | 
 41 | ### R 語言
 42 | 
 43 | 我們使用 `lm()` 函數。
 44 | 
 45 | ```
 46 | store_area <- c(10, 8, 8, 5, 7, 8, 7, 9, 6, 9)
 47 | dist_to_station <- c(80, 0, 200, 200, 300, 230, 40, 0, 330, 180)
 48 | monthly_sales <- c(469, 366, 371, 208, 246, 297, 363, 436, 198, 364)
 49 | bakery_df <- data.frame(store_area, dist_to_station, monthly_sales)
 50 | 
 51 | lm_fit <- lm(monthly_sales ~ ., data = bakery_df)
 52 | 
 53 | # 印出係數
 54 | lm_fit$coefficients[-1]
 55 | 
 56 | # 印出截距
 57 | lm_fit$coefficients[1]
 58 | ```
 59 | 
 60 | ![day2202](https://storage.googleapis.com/2017_ithome_ironman/day2202.png)
 61 | 
 62 | ## 利用複迴歸模型預測
 63 | 
 64 | 建立複迴歸模型之後，身為連鎖蛋糕店的老闆，在開設新店選址的時候，就可以用新店資訊預測單月銷售量，進而更精準地掌握店租與人事成本的管理。
 65 | 
 66 | ### Python
 67 | 
 68 | 我們使用 `LinearRegression()` 的 `predict()` 方法。
 69 | 
 70 | ```python
 71 | import numpy as np
 72 | from sklearn.linear_model import LinearRegression
 73 | 
 74 | X = np.array([
 75 |     [10, 80], [8, 0], [8, 200], [5, 200], [7, 300], [8, 230], [7, 40], [9, 0], [6, 330], [9, 180]
 76 | ])
 77 | y = np.array([469, 366, 371, 208, 246, 297, 363, 436, 198, 364])
 78 | 
 79 | lm = LinearRegression()
 80 | lm.fit(X, y)
 81 | 
 82 | # 新蛋糕店資料
 83 | to_be_predicted = np.array([
 84 |     [10, 110]
 85 | ])
 86 | predicted_sales = lm.predict(to_be_predicted)
 87 | 
 88 | # 預測新蛋糕店的單月銷量
 89 | print(predicted_sales)
 90 | ```
 91 | 
 92 | ![day2203](https://storage.googleapis.com/2017_ithome_ironman/day2203.png)
 93 | 
 94 | ### R 語言
 95 | 
 96 | 我們使用 `predict()` 函數。
 97 | 
 98 | ```
 99 | store_area <- c(10, 8, 8, 5, 7, 8, 7, 9, 6, 9)
100 | dist_to_station <- c(80, 0, 200, 200, 300, 230, 40, 0, 330, 180)
101 | monthly_sales <- c(469, 366, 371, 208, 246, 297, 363, 436, 198, 364)
102 | bakery_df <- data.frame(store_area, dist_to_station, monthly_sales)
103 | 
104 | lm_fit <- lm(monthly_sales ~ ., data = bakery_df)
105 | 
106 | # 新蛋糕店資料
107 | to_be_predicted <- data.frame(store_area = 10, dist_to_station = 110)
108 | predicted_sales <- predict(lm_fit, newdata = to_be_predicted)
109 | 
110 | # 預測新蛋糕店的單月銷量
111 | predicted_sales
112 | ```
113 | 
114 | ![day2204](https://storage.googleapis.com/2017_ithome_ironman/day2204.png)
115 | 
116 | ## 複迴歸模型的績效
117 | 
118 | 複迴歸模型的績效（Performance）有 **Mean squared error（MSE）**、 **R-squared** 與 **Adjusted R-squared**。
119 | 
120 | ### Python
121 | 
122 | 我們使用 `LinearRegression()` 方法建立出來物件的 `score` 屬性。
123 | 
124 | ```python
125 | import numpy as np
126 | from sklearn.linear_model import LinearRegression
127 | 
128 | X = np.array([
129 |     [10, 80], [8, 0], [8, 200], [5, 200], [7, 300], [8, 230], [7, 40], [9, 0], [6, 330], [9, 180]
130 | ])
131 | y = np.array([469, 366, 371, 208, 246, 297, 363, 436, 198, 364])
132 | 
133 | lm = LinearRegression()
134 | lm.fit(X, y)
135 | 
136 | # 模型績效
137 | mse = np.mean((lm.predict(X) - y) ** 2)
138 | r_squared = lm.score(X, y)
139 | adj_r_squared = r_squared - (1 - r_squared) * (X.shape[1] / (X.shape[0] - X.shape[1] - 1))
140 | 
141 | # 印出模型績效
142 | print(mse)
143 | print(r_squared)
144 | print(adj_r_squared)
145 | ```
146 | 
147 | ![day2205](https://storage.googleapis.com/2017_ithome_ironman/day2205.png)
148 | 
149 | ### R 語言
150 | 
151 | 使用 `summary(lm_fit)` 的 `r.squared` 與 `adj.r.squared` 屬性。
152 | 
153 | ```
154 | store_area <- c(10, 8, 8, 5, 7, 8, 7, 9, 6, 9)
155 | dist_to_station <- c(80, 0, 200, 200, 300, 230, 40, 0, 330, 180)
156 | monthly_sales <- c(469, 366, 371, 208, 246, 297, 363, 436, 198, 364)
157 | bakery_df <- data.frame(store_area, dist_to_station, monthly_sales)
158 | 
159 | lm_fit <- lm(monthly_sales ~ ., data = bakery_df)
160 | predicted_sales <- predict(lm_fit, newdata = data.frame(store_area, dist_to_station))
161 | 
162 | # 模型績效
163 | mse <- mean((monthly_sales - predicted_sales) ^ 2)
164 | 
165 | # 印出模型績效
166 | mse
167 | summary(lm_fit)$r.squared
168 | summary(lm_fit)$adj.r.squared
169 | ```
170 | 
171 | ![day2206](https://storage.googleapis.com/2017_ithome_ironman/day2206.png)
172 | 
173 | ## 複迴歸模型的係數檢定
174 | 
175 | 複迴歸模型我們通常還會檢定變數的顯著性，以 **P-value** 是否小於 0.05（信心水準 95%）來判定。
176 | 
177 | ### Python
178 | 
179 | 我們使用 `sklearn.feature_selection` 的 `f_regression()` 方法。
180 | 
181 | ```python
182 | import numpy as np
183 | from sklearn.linear_model import LinearRegression
184 | from sklearn.feature_selection import f_regression
185 | 
186 | X = np.array([
187 |     [10, 80], [8, 0], [8, 200], [5, 200], [7, 300], [8, 230], [7, 40], [9, 0], [6, 330], [9, 180]
188 | ])
189 | y = np.array([469, 366, 371, 208, 246, 297, 363, 436, 198, 364])
190 | 
191 | lm = LinearRegression()
192 | lm.fit(X, y)
193 | 
194 | # 印出 p-value
195 | print(f_regression(X, y)[1])
196 | ```
197 | 
198 | ![day2207](https://storage.googleapis.com/2017_ithome_ironman/day2207.png)
199 | 
200 | ### R 語言
201 | 
202 | 使用 `summary(lm_fit)` 的 `coefficients` 屬性。
203 | 
204 | ```
205 | store_area <- c(10, 8, 8, 5, 7, 8, 7, 9, 6, 9)
206 | dist_to_station <- c(80, 0, 200, 200, 300, 230, 40, 0, 330, 180)
207 | monthly_sales <- c(469, 366, 371, 208, 246, 297, 363, 436, 198, 364)
208 | bakery_df <- data.frame(store_area, dist_to_station, monthly_sales)
209 | 
210 | lm_fit <- lm(monthly_sales ~ ., data = bakery_df)
211 | 
212 | # 印出 p-value
213 | summary(lm_fit)$coefficients[-1, 4]
214 | ```
215 | 
216 | ![day2208](https://storage.googleapis.com/2017_ithome_ironman/day2208.png)
217 | 
218 | ## 建立 Logistic 迴歸模型
219 | 
220 | 在[Kaggle](https://www.kaggle.com/) 著名的鐵達尼克號資料，我們使用 **Sex**，**Pclass** 與 **Age** 來預測 **Survived**。
221 | 
222 | ### Python
223 | 
224 | 我們使用 `linear_model` 的 `LogisticRegression()` 方法。
225 | 
226 | ```python
227 | import pandas as pd
228 | import numpy as np
229 | from sklearn import preprocessing, linear_model
230 | 
231 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
232 | titanic_train = pd.read_csv(url)
233 | 
234 | # 將 Age 遺漏值以 median 填補
235 | age_median = np.nanmedian(titanic_train["Age"])
236 | new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
237 | titanic_train["Age"] = new_Age
238 | titanic_train
239 | 
240 | # 創造 dummy variables
241 | label_encoder = preprocessing.LabelEncoder()
242 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
243 | 
244 | # 建立 train_X
245 | train_X = pd.DataFrame([titanic_train["Pclass"],
246 |                         encoded_Sex,
247 |                         titanic_train["Age"]
248 | ]).T
249 | 
250 | # 建立模型
251 | logistic_regr = linear_model.LogisticRegression()
252 | logistic_regr.fit(train_X, titanic_train["Survived"])
253 | 
254 | # 印出係數
255 | print(logistic_regr.coef_)
256 | 
257 | # 印出截距
258 | print(logistic_regr.intercept_ )
259 | ```
260 | 
261 | ![day2209](https://storage.googleapis.com/2017_ithome_ironman/day2209.png)
262 | 
263 | ### R 語言
264 | 
265 | 我們使用 `glm()` 函數，並指定參數 `family = binomial(link = "logit")`。
266 | 
267 | ```
268 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
269 | titanic_train <- read.csv(url)
270 | 
271 | # 將 Age 遺漏值以 median 填補
272 | age_median <- median(titanic_train$Age, na.rm = TRUE)
273 | new_Age <- ifelse(is.na(titanic_train$Age), age_median, titanic_train$Age)
274 | titanic_train$Age <- new_Age
275 | 
276 | # 建立模型
277 | logistic_regr <- glm(Survived ~ Age + Pclass + Sex, data = titanic_train, family = binomial(link = "logit"))
278 | 
279 | # 印出係數
280 | logistic_regr$coefficients[-1]
281 | 
282 | # 印出截距
283 | logistic_regr$coefficients[1]
284 | ```
285 | 
286 | ![day2210](https://storage.googleapis.com/2017_ithome_ironman/day2210.png)
287 | 
288 | ## Logistic 迴歸模型係數檢定
289 | 
290 | Logistic 迴歸模型我們也可以檢定變數的顯著性，以 **P-value** 是否小於 0.05（信心水準 95%）來判定。
291 | 
292 | ### Python
293 | 
294 | 我們使用 `sklearn.feature_selection` 的 `f_regression()` 方法。
295 | 
296 | ```python
297 | import pandas as pd
298 | import numpy as np
299 | from sklearn import preprocessing, linear_model
300 | from sklearn.feature_selection import f_regression
301 | 
302 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
303 | titanic_train = pd.read_csv(url)
304 | 
305 | # 將 Age 遺漏值以 median 填補
306 | age_median = np.nanmedian(titanic_train["Age"])
307 | new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
308 | titanic_train["Age"] = new_Age
309 | titanic_train
310 | 
311 | # 創造 dummy variables
312 | label_encoder = preprocessing.LabelEncoder()
313 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
314 | 
315 | # 建立 train_X
316 | train_X = pd.DataFrame([titanic_train["Pclass"],
317 |                         encoded_Sex,
318 |                         titanic_train["Age"]
319 | ]).T
320 | 
321 | # 建立模型
322 | logistic_regr = linear_model.LogisticRegression()
323 | logistic_regr.fit(train_X, titanic_train["Survived"])
324 | 
325 | # 印出 p-value
326 | print(f_regression(train_X, titanic_train["Survived"])[1])
327 | ```
328 | 
329 | ![day2211](https://storage.googleapis.com/2017_ithome_ironman/day2211.png)
330 | 
331 | ### R 語言
332 | 
333 | ```
334 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
335 | titanic_train <- read.csv(url)
336 | 
337 | # 將 Age 遺漏值以 median 填補
338 | age_median <- median(titanic_train$Age, na.rm = TRUE)
339 | new_Age <- ifelse(is.na(titanic_train$Age), age_median, titanic_train$Age)
340 | titanic_train$Age <- new_Age
341 | 
342 | # 建立模型
343 | logistic_regr <- glm(Survived ~ Age + Pclass + Sex, data = titanic_train, family = binomial(link = "logit"))
344 | 
345 | # 印出 p-value
346 | summary(logistic_regr)$coefficients[-1, 4]
347 | ```
348 | 
349 | ![day2212](https://storage.googleapis.com/2017_ithome_ironman/day2212.png)
350 | 
351 | ## Logistic 迴歸模型績效
352 | 
353 | 我們用**準確率（Accuracy）**衡量二元分類模型的績效。
354 | 
355 | ### Python
356 | 
357 | ```python
358 | import pandas as pd
359 | import numpy as np
360 | from sklearn import preprocessing, linear_model
361 | 
362 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
363 | titanic_train = pd.read_csv(url)
364 | 
365 | # 將 Age 遺漏值以 median 填補
366 | age_median = np.nanmedian(titanic_train["Age"])
367 | new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
368 | titanic_train["Age"] = new_Age
369 | titanic_train
370 | 
371 | # 創造 dummy variables
372 | label_encoder = preprocessing.LabelEncoder()
373 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
374 | 
375 | # 建立 train_X
376 | train_X = pd.DataFrame([titanic_train["Pclass"],
377 |                         encoded_Sex,
378 |                         titanic_train["Age"]
379 | ]).T
380 | 
381 | # 建立模型
382 | logistic_regr = linear_model.LogisticRegression()
383 | logistic_regr.fit(train_X, titanic_train["Survived"])
384 | 
385 | # 計算準確率
386 | survived_predictions = logistic_regr.predict(train_X)
387 | accuracy = logistic_regr.score(train_X, titanic_train["Survived"])
388 | print(accuracy)
389 | ```
390 | 
391 | ![day2213](https://storage.googleapis.com/2017_ithome_ironman/day2213.png)
392 | 
393 | ### R 語言
394 | 
395 | ```
396 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
397 | titanic_train <- read.csv(url)
398 | 
399 | # 將 Age 遺漏值以 median 填補
400 | age_median <- median(titanic_train$Age, na.rm = TRUE)
401 | new_Age <- ifelse(is.na(titanic_train$Age), age_median, titanic_train$Age)
402 | titanic_train$Age <- new_Age
403 | 
404 | # 建立模型
405 | logistic_regr <- glm(Survived ~ Age + Pclass + Sex, data = titanic_train, family = binomial(link = "logit"))
406 | 
407 | # 計算準確率
408 | x_features <- titanic_train[, c("Age", "Pclass", "Sex")]
409 | survived_predictions <- predict(logistic_regr, newdata = x_features, type = "response")
410 | prediction_cutoff <- ifelse(survived_predictions > 0.5, 1, 0)
411 | confusion_matrix <- table(titanic_train$Survived, prediction_cutoff)
412 | accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
413 | accuracy
414 | ```
415 | 
416 | ![day2214](https://storage.googleapis.com/2017_ithome_ironman/day2214.png)
417 | 
418 | ## 小結
419 | 
420 | 第二十二天我們繼續練習使用 Python 的機器學習套件 **scikit-learn**，我們建立了一個複迴歸模型用店面面積與距車站距離預測蛋糕的單月銷售量；我們建立了一個 Logistic 迴歸模型用性別，年齡與社經地位預測鐵達尼號乘客的存活與否，並且與 R 語言相互對照。
421 | 
422 | ## 參考連結
423 | 
424 | - [scikit-learn: machine learning in Python - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/index.html)
425 | - [世界第一簡單統計學迴歸分析篇](http://www.books.com.tw/products/0010479438)
426 | - [Python for Data Analysis Part 28: Logistic Regression](http://hamelg.blogspot.tw/2015/11/python-for-data-analysis-part-28.html)
427 | - [Kaggle: Your Home for Data Science](https://www.kaggle.com/)


--------------------------------------------------------------------------------
/day23.md:
--------------------------------------------------------------------------------
  1 | # [第 23 天] 機器學習（3）決策樹與 k-NN 分類器
  2 | 
  3 | ---
  4 | 
  5 | 我們今天要繼續練習 **scikit-learn** 機器學習套件，還記得在[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)有提到 Logistic 迴歸雖然冠有**迴歸**的名稱，但實際上是一個二元分類（Binary classification）演算法嗎？Logistc 迴歸是我們建立的第一個分類器（Classifier）。分類（Classification）與迴歸（Regression）都屬於監督式學習（Supervised learning），一個預測類別目標變數，一個預測連續型目標變數。
  6 | 
  7 | 我們今天將建立兩個分類器，分別是決策樹分類器（Decision Tree Classifiers）與 k-Nearest Neighbors 分類器，這兩個演算法與 Logistic 迴歸最大的不同點是她們均為多元分類（Multiclass classification）演算法。
  8 | 
  9 | 同時我們也會開始使用 `sklearn.cross_validation` 的 `train_test_split()` 方法來將鳶尾花資料很便利地切分為訓練與測試資料，這是很常會使用的資料預處理方法，透過前述的 `train_test_split()` 方法，我們可以用一行程式完成。
 10 | 
 11 | ## 建立決策樹分類器
 12 | 
 13 | 決策樹分類器（Decision Tree Classifiers）是可以處理多元分類問題的演算法，我們最喜歡她的地方有兩點：
 14 | 
 15 | - 可以同時處理連續型與類別型變數。
 16 | - 不需要進行太多的資料預處理（Preprocessing），像是[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)我們在建立 Logistic 迴歸前得將 `Pclass` 與 `Sex` 這兩個變數創造成 dummy variables，但是決策樹分類器不需要。
 17 | 
 18 | ```
 19 | # ... 前略
 20 | 
 21 | # 創造 dummy variables
 22 | label_encoder = preprocessing.LabelEncoder()
 23 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
 24 | encoded_Pclass = label_encoder.fit_transform(titanic_train["Pclass"])
 25 | 
 26 | # ... 後略
 27 | ```
 28 | 
 29 | > Decision Trees Classifiers are a non-parametric supervised learning method used for classification, that are capable of performing multi-class classification on a dataset. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
 30 | > [1.10. Decision Trees - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/tree.html)
 31 | 
 32 | 我們使用 **scikit-learn** 機器學習套件的其中一個玩具資料（Toy datasets）**鳶尾花資料**，利用花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬來預測花的種類，藉此練習使用決策樹演算法建立一個三元分類器。如果你對玩具資料感到陌生，我推薦你參考 [[第 21 天] 機器學習 玩具資料與線性迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day21.md)。
 33 | 
 34 | ### Python
 35 | 
 36 | 我們使用 `sklearn.tree` 的 `DecisionTreeClassifier()` 方法。
 37 | 
 38 | ```python
 39 | from sklearn.datasets import load_iris
 40 | from sklearn import tree
 41 | from sklearn.cross_validation import train_test_split
 42 | 
 43 | # 讀入鳶尾花資料
 44 | iris = load_iris()
 45 | iris_X = iris.data
 46 | iris_y = iris.target
 47 | 
 48 | # 切分訓練與測試資料
 49 | train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)
 50 | 
 51 | # 建立分類器
 52 | clf = tree.DecisionTreeClassifier()
 53 | iris_clf = clf.fit(train_X, train_y)
 54 | 
 55 | # 預測
 56 | test_y_predicted = iris_clf.predict(test_X)
 57 | print(test_y_predicted)
 58 | 
 59 | # 標準答案
 60 | print(test_y)
 61 | ```
 62 | 
 63 | ![day2301](https://storage.googleapis.com/2017_ithome_ironman/day2301.png)
 64 | 
 65 | 眼尖的你可以仔細看看分類器在哪一個觀測值的預測分類與標準答案不一樣。
 66 | 
 67 | ### R 語言
 68 | 
 69 | 我們使用 `rpart` 套件的 `rpart()` 函數。
 70 | 
 71 | ```
 72 | library(rpart)
 73 | 
 74 | # 切分訓練與測試資料
 75 | n <- nrow(iris)
 76 | shuffled_iris <- iris[sample(n), ]
 77 | train_indices <- 1:round(0.7 * n)
 78 | train_iris <- shuffled_iris[train_indices, ]
 79 | test_indices <- (round(0.7 * n) + 1):n
 80 | test_iris <- shuffled_iris[test_indices, ]
 81 | 
 82 | # 建立分類器
 83 | iris_clf <- rpart(Species ~ ., data = train_iris, method = "class")
 84 | 
 85 | # 預測
 86 | test_iris_predicted = predict(iris_clf, test_iris, type = "class")
 87 | test_iris_predicted
 88 | 
 89 | # 標準答案
 90 | test_iris$Species
 91 | ```
 92 | 
 93 | ![day2302](https://storage.googleapis.com/2017_ithome_ironman/day2302.png)
 94 | 
 95 | 眼尖的你可以仔細看看分類器在哪一個觀測值的預測分類與標準答案不一樣。
 96 | 
 97 | ## 決策樹分類器的績效
 98 | 
 99 | 我們使用準確率（Accuracy）作為分類演算法的績效。
100 | 
101 | ### Python
102 | 
103 | 我們使用 `sklearn.metrics` 的 `accuracy_score()` 方法計算準確率。
104 | 
105 | ```python
106 | from sklearn.datasets import load_iris
107 | from sklearn import tree
108 | from sklearn.cross_validation import train_test_split
109 | from sklearn import metrics
110 | 
111 | # 讀入鳶尾花資料
112 | iris = load_iris()
113 | iris_X = iris.data
114 | iris_y = iris.target
115 | 
116 | # 切分訓練與測試資料
117 | train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)
118 | 
119 | # 建立分類器
120 | clf = tree.DecisionTreeClassifier()
121 | iris_clf = clf.fit(train_X, train_y)
122 | 
123 | # 預測
124 | test_y_predicted = iris_clf.predict(test_X)
125 | 
126 | # 績效
127 | accuracy = metrics.accuracy_score(test_y, test_y_predicted)
128 | print(accuracy)
129 | ```
130 | 
131 | ![day2303](https://storage.googleapis.com/2017_ithome_ironman/day2303.png)
132 | 
133 | ### R 語言
134 | 
135 | 我們透過 confusion matrix 來計算準確率。
136 | 
137 | ```
138 | library(rpart)
139 | 
140 | # 切分訓練與測試資料
141 | n <- nrow(iris)
142 | shuffled_iris <- iris[sample(n), ]
143 | train_indices <- 1:round(0.7 * n)
144 | train_iris <- shuffled_iris[train_indices, ]
145 | test_indices <- (round(0.7 * n) + 1):n
146 | test_iris <- shuffled_iris[test_indices, ]
147 | 
148 | # 建立分類器
149 | iris_clf <- rpart(Species ~ ., data = train_iris, method = "class")
150 | 
151 | # 預測
152 | test_iris_predicted <- predict(iris_clf, test_iris, type = "class")
153 | 
154 | # 績效
155 | conf_mat <- table(test_iris$Species, test_iris_predicted)
156 | accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
157 | accuracy
158 | ```
159 | 
160 | ![day2304](https://storage.googleapis.com/2017_ithome_ironman/day2304.png)
161 | 
162 | ## 建立 k-Nearest Neighbors 分類器
163 | 
164 | k-Nearest Neighbors 分類器同樣也是可以處理多元分類問題的演算法，由於是以**距離**作為未知類別的資料點分類依據，必須要將類別變數轉換為 dummy variables 然後將所有的數值型變數標準化，避免因為單位不同，在距離的計算上失真。
165 | 
166 | > The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning.)
167 | > [1.6. Nearest Neighbors - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/neighbors.html)
168 | 
169 | ### Python
170 | 
171 | 我們使用 `sklearn.neighbors` 的 `KNeighborsClassifier()` 方法，預設 k = 5。
172 | 
173 | ```python
174 | from sklearn.datasets import load_iris
175 | from sklearn import neighbors
176 | from sklearn.cross_validation import train_test_split
177 | from sklearn import metrics
178 | 
179 | # 讀入鳶尾花資料
180 | iris = load_iris()
181 | iris_X = iris.data
182 | iris_y = iris.target
183 | 
184 | # 切分訓練與測試資料
185 | train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)
186 | 
187 | # 建立分類器
188 | clf = neighbors.KNeighborsClassifier()
189 | iris_clf = clf.fit(train_X, train_y)
190 | 
191 | # 預測
192 | test_y_predicted = iris_clf.predict(test_X)
193 | print(test_y_predicted)
194 | 
195 | # 標準答案
196 | print(test_y)
197 | ```
198 | 
199 | ![day2305](https://storage.googleapis.com/2017_ithome_ironman/day2305.png)
200 | 
201 | 眼尖的你可以仔細看看分類器在哪一個觀測值的預測分類與標準答案不一樣。
202 | 
203 | ### R 語言
204 | 
205 | 我們使用 `class` 套件的 `knn()` 函數，指定參數 `k = 5`。
206 | 
207 | ```
208 | library(class)
209 | 
210 | # 切分訓練與測試資料
211 | n <- nrow(iris)
212 | shuffled_iris <- iris[sample(n), ]
213 | train_indices <- 1:round(0.7 * n)
214 | train_iris <- shuffled_iris[train_indices, ]
215 | test_indices <- (round(0.7 * n) + 1):n
216 | test_iris <- shuffled_iris[test_indices, ]
217 | 
218 | # 獨立 X 與 y
219 | train_iris_X <- train_iris[, -5]
220 | test_iris_X <- test_iris[, -5]
221 | train_iris_y <- train_iris[, 5]
222 | test_iris_y <- test_iris[, 5]
223 | 
224 | # 預測
225 | test_y_predicted <- knn(train = train_iris_X, test = test_iris_X, cl = train_iris_y, k = 5)
226 | test_y_predicted
227 | 
228 | # 標準答案
229 | print(test_iris_y)
230 | ```
231 | 
232 | ![day2306](https://storage.googleapis.com/2017_ithome_ironman/day2306.png)
233 | 
234 | 眼尖的你可以仔細看看分類器在哪一個觀測值的預測分類與標準答案不一樣。
235 | 
236 | ## 如何選擇 k
237 | 
238 | 讓程式幫我們怎麼選擇一個適合的 **k**，通常 k 的上限為訓練樣本數的 20%。
239 | 
240 | ### Python
241 | 
242 | ```python
243 | from sklearn.datasets import load_iris
244 | from sklearn import neighbors
245 | from sklearn.cross_validation import train_test_split
246 | from sklearn import metrics
247 | import numpy as np
248 | import matplotlib.pyplot as plt
249 | 
250 | # 讀入鳶尾花資料
251 | iris = load_iris()
252 | iris_X = iris.data
253 | iris_y = iris.target
254 | 
255 | # 切分訓練與測試資料
256 | train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)
257 | 
258 | # 選擇 k
259 | range = np.arange(1, round(0.2 * train_X.shape[0]) + 1)
260 | accuracies = []
261 | 
262 | for i in range:
263 |     clf = neighbors.KNeighborsClassifier(n_neighbors = i)
264 |     iris_clf = clf.fit(train_X, train_y)
265 |     test_y_predicted = iris_clf.predict(test_X)
266 |     accuracy = metrics.accuracy_score(test_y, test_y_predicted)
267 |     accuracies.append(accuracy)
268 | 
269 | # 視覺化
270 | plt.scatter(range, accuracies)
271 | plt.show()
272 | appr_k = accuracies.index(max(accuracies)) + 1
273 | print(appr_k)
274 | ```
275 | 
276 | ![day2307](https://storage.googleapis.com/2017_ithome_ironman/day2307.png)
277 | 
278 | k 在介於 8 到 12 之間模型的準確率最高。
279 | 
280 | ### R 語言
281 | 
282 | ```
283 | library(class)
284 | 
285 | # 切分訓練與測試資料
286 | n <- nrow(iris)
287 | shuffled_iris <- iris[sample(n), ]
288 | train_indices <- 1:round(0.7 * n)
289 | train_iris <- shuffled_iris[train_indices, ]
290 | test_indices <- (round(0.7 * n) + 1):n
291 | test_iris <- shuffled_iris[test_indices, ]
292 | 
293 | # 獨立 X 與 y
294 | train_iris_X <- train_iris[, -5]
295 | test_iris_X <- test_iris[, -5]
296 | train_iris_y <- train_iris[, 5]
297 | test_iris_y <- test_iris[, 5]
298 | 
299 | # 選擇 k
300 | range <- 1:round(0.2 * nrow(train_iris_X))
301 | accuracies <- rep(NA, length(range))
302 | 
303 | for (i in range) {
304 |   test_y_predicted <- knn(train = train_iris_X, test = test_iris_X, cl = train_iris_y, k = i)
305 |   conf_mat <- table(test_iris_y, test_y_predicted)
306 |   accuracies[i] <- sum(diag(conf_mat))/sum(conf_mat)
307 | }
308 | 
309 | # 視覺化
310 | plot(range, accuracies, xlab = "k")
311 | which.max(accuracies)
312 | ```
313 | 
314 | ![day2308](https://storage.googleapis.com/2017_ithome_ironman/day2308.png)
315 | 
316 | k 在等於 14 與 18 時模型的準確率最高。
317 | 
318 | ## 小結
319 | 
320 | 第二十三天我們繼續練習 Python 的機器學習套件 **scikit-learn**，切分熟悉的鳶尾花資料成為訓練與測試資料，並建立了一個**決策樹**分類器以及一個 **k-Nearest Neighbors** 分類器，讓程式幫我們選擇合適的 k 值，並且與 R 語言相互對照。
321 | 
322 | ## 參考連結
323 | 
324 | - [scikit-learn: machine learning in Python - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/index.html)


--------------------------------------------------------------------------------
/day24.md:
--------------------------------------------------------------------------------
  1 | # [第 24 天] 機器學習（4）分群演算法
  2 | 
  3 | ---
  4 | 
  5 | 我們今天依舊要繼續練習 **scikit-learn** 機器學習套件，經過三天的監督式學習（迴歸與分類）實作，稍微變換一下心情來練習非監督式學習中相當重要的分群演算法。仔細回想一下，至今練習過的冰紅茶銷量，蛋糕店月銷量，鐵達尼克號乘客的存活與否以及鳶尾花的品種，訓練資料都是**有標籤（答案）**的，而非監督式學習與監督式學習最大的不同之處就在於它的訓練資料是**沒有標籤（答案）**的。
  6 | 
  7 | 分群演算法的績效衡量簡單明暸：**組間差異大，組內差異小**。而所謂的**差異**指的就是觀測值之間的距離遠近作為衡量，最常見還是使用[歐氏距離（Euclidean distance）](https://en.wikipedia.org/wiki/Euclidean_distance)。既然我們又是以距離作為度量，在資料的預處理程序中，與 k-Nearest Neighbors 分類器一樣我們必須將所有的數值型變數標準化（Normalization），避免因為單位不同，在距離的計算上失真。
  8 | 
  9 | 我們今天要使用熟悉的鳶尾花資料，採用花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬來練習兩種分群演算法，分別是 **K-Means** 與 **Hierarchical Clustering**。
 10 | 
 11 | ## K-Means
 12 | 
 13 | K-Means 演算法可以非常快速地完成分群任務，但是如果觀測值具有雜訊（Noise）或者極端值，其分群結果容易被這些雜訊與極端值影響，適合處理分布集中的大型樣本資料。
 14 | 
 15 | ### 快速實作
 16 | 
 17 | #### Python
 18 | 
 19 | 我們使用 `sklearn.cluster` 的 `KMeans()` 方法。
 20 | 
 21 | ```python
 22 | from sklearn import cluster, datasets
 23 | 
 24 | # 讀入鳶尾花資料
 25 | iris = datasets.load_iris()
 26 | iris_X = iris.data
 27 | 
 28 | # KMeans 演算法
 29 | kmeans_fit = cluster.KMeans(n_clusters = 3).fit(iris_X)
 30 | 
 31 | # 印出分群結果
 32 | cluster_labels = kmeans_fit.labels_
 33 | print("分群結果：")
 34 | print(cluster_labels)
 35 | print("---")
 36 | 
 37 | # 印出品種看看
 38 | iris_y = iris.target
 39 | print("真實品種：")
 40 | print(iris_y)
 41 | ```
 42 | 
 43 | ![day2401](https://storage.googleapis.com/2017_ithome_ironman/day2401.png)
 44 | 
 45 | 看起來 setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異。
 46 | 
 47 | #### R 語言
 48 | 
 49 | 我們使用 `kmeans()` 函數。
 50 | 
 51 | ```
 52 | # 讀入鳶尾花資料
 53 | iris_kmeans <- iris[, -5]
 54 | 
 55 | # KMeans 演算法
 56 | kmeans_fit <- kmeans(iris_kmeans, nstart=20, centers=3)
 57 | 
 58 | # 印出分群結果
 59 | kmeans_fit$cluster
 60 | 
 61 | # 印出品種看看
 62 | iris$Species
 63 | ```
 64 | 
 65 | ![day2402](https://storage.googleapis.com/2017_ithome_ironman/day2402.png)
 66 | 
 67 | ![day2403](https://storage.googleapis.com/2017_ithome_ironman/day2403.png)
 68 | 
 69 | 看起來 setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異。
 70 | 
 71 | ### 績效
 72 | 
 73 | 分群演算法的績效可以使用 Silhouette 係數或 WSS（Within Cluster Sum of Squares）/BSS（Between Cluster Sum of Squares）。
 74 | 
 75 | #### Python
 76 | 
 77 | 我們使用 `sklearn.metrics` 的 `silhouette_score()` 方法，這個數值愈接近 1 表示績效愈好，反之愈接近 -1 表示績效愈差。
 78 | 
 79 | > Compute the mean Silhouette Coefficient of all samples.
 80 | The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). The best value is 1 and the worst value is -1.
 81 | > [sklearn.metrics.silhouette_score - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
 82 | 
 83 | ```python
 84 | from sklearn import cluster, datasets, metrics
 85 | 
 86 | # 讀入鳶尾花資料
 87 | iris = datasets.load_iris()
 88 | iris_X = iris.data
 89 | 
 90 | # KMeans 演算法
 91 | kmeans_fit = cluster.KMeans(n_clusters = 3).fit(iris_X)
 92 | cluster_labels = kmeans_fit.labels_
 93 | 
 94 | # 印出績效
 95 | silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
 96 | print(silhouette_avg)
 97 | ```
 98 | 
 99 | ![day2404](https://storage.googleapis.com/2017_ithome_ironman/day2404.png)
100 | 
101 | #### R 語言
102 | 
103 | 我們使用 Total WSS（Total Within Cluster Sum of Squares）/Total SS（Total Sum of Squares），這個比例愈低表示績效愈好。
104 | 
105 | ```
106 | # 讀入鳶尾花資料
107 | iris_kmeans <- iris[, -5]
108 | 
109 | # KMeans 演算法
110 | kmeans_fit <- kmeans(iris_kmeans, nstart=20, centers=3)
111 | ratio <- kmeans_fit$tot.withinss / kmeans_fit$totss
112 | ratio
113 | ```
114 | 
115 | ![day240502](https://storage.googleapis.com/2017_ithome_ironman/day240502.png)
116 | 
117 | ### 如何選擇 k
118 | 
119 | 隨著 k 值的增加，K-Means 演算法的績效一定會愈來愈好，當 k = 觀測值數目的時候，我們會得到一個**組間差異最大，組內差異最小**的結果，但這不是我們想要的，實務上我們讓程式幫忙選擇一個適合的 k。
120 | 
121 | #### Python
122 | 
123 | ```python
124 | from sklearn import cluster, datasets, metrics
125 | import matplotlib.pyplot as plt
126 | 
127 | # 讀入鳶尾花資料
128 | iris = datasets.load_iris()
129 | iris_X = iris.data
130 | 
131 | # 迴圈
132 | silhouette_avgs = []
133 | ks = range(2, 11)
134 | for k in ks:
135 |     kmeans_fit = cluster.KMeans(n_clusters = k).fit(iris_X)
136 |     cluster_labels = kmeans_fit.labels_
137 |     silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
138 |     silhouette_avgs.append(silhouette_avg)
139 | 
140 | # 作圖並印出 k = 2 到 10 的績效
141 | plt.bar(ks, silhouette_avgs)
142 | plt.show()
143 | print(silhouette_avgs)
144 | ```
145 | 
146 | ![day2406](https://storage.googleapis.com/2017_ithome_ironman/day2406.png)
147 | 
148 | k 值在等於 2 與 3 的時候 K-Means 演算法的績效較好，這也驗證了我們先前的觀察，setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異，因此如果是以 K-Means 分群，可能會將 setosa 歸為一群，versicolor 和 virginica 歸為一群。
149 | 
150 | #### R 語言
151 | 
152 | ```
153 | # 讀入鳶尾花資料
154 | iris_kmeans <- iris[, -5]
155 | 
156 | # 迴圈
157 | ratio <- rep(NA, times = 10)
158 | for (k in 2:length(ratio)) {
159 |   kmeans_fit <- kmeans(iris_kmeans, centers = k, nstart = 20)
160 |   ratio[k] <- kmeans_fit$tot.withinss / kmeans_fit$betweenss
161 | }
162 | plot(ratio, type="b", xlab="k")
163 | ```
164 | 
165 | ![day240701](https://storage.googleapis.com/2017_ithome_ironman/day240701.png)
166 | 
167 | 由上圖可以看出手肘點（Elbow point）出現在 k = 2 或者 k = 3 的時候，驗證了我們先前的觀察，setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異，因此如果是以 K-Means 分群，可能會將 setosa 歸為一群，versicolor 和 virginica 歸為一群。
168 | 
169 | ## Hierarchical Clustering
170 | 
171 | 與 K-Means 演算法不同的地方在於不需要事先設定 k 值，Hierarchical Clustering 演算法每一次只將兩個觀測值歸為一類，然後在演算過程中得到 k = 1 一直到 k = n（觀測值個數）群的結果。
172 | 
173 | ### 快速實作
174 | 
175 | #### Python
176 | 
177 | ```python
178 | from sklearn import cluster, datasets
179 | 
180 | # 讀入鳶尾花資料
181 | iris = datasets.load_iris()
182 | iris_X = iris.data
183 | 
184 | # Hierarchical Clustering 演算法
185 | hclust = cluster.AgglomerativeClustering(linkage = 'ward', affinity = 'euclidean', n_clusters = 3)
186 | 
187 | # 印出分群結果
188 | hclust.fit(iris_X)
189 | cluster_labels = hclust.labels_
190 | print(cluster_labels)
191 | print("---")
192 | 
193 | # 印出品種看看
194 | iris_y = iris.target
195 | print(iris_y)
196 | ```
197 | 
198 | ![day2408](https://storage.googleapis.com/2017_ithome_ironman/day2408.png)
199 | 
200 | 看起來 setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異。
201 | 
202 | #### R 語言
203 | 
204 | ```
205 | # 讀入鳶尾花資料
206 | iris_hclust <- iris[, -5]
207 | 
208 | # Hierarchical Clustering 演算法
209 | dist_matrix <- dist(iris_hclust)
210 | hclust_fit <- hclust(dist_matrix, method = "single")
211 | hclust_fit_cut <- cutree(hclust_fit, k = 3)
212 | 
213 | # 印出分群結果
214 | hclust_fit_cut
215 | 
216 | # 印出品種看看
217 | iris$Species
218 | ```
219 | 
220 | ![day2409](https://storage.googleapis.com/2017_ithome_ironman/day2409.png)
221 | 
222 | ![day2410](https://storage.googleapis.com/2017_ithome_ironman/day2410.png)
223 | 
224 | 看起來 setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異。
225 | 
226 | ### 績效
227 | 
228 | #### Python
229 | 
230 | ```python
231 | from sklearn import cluster, datasets, metrics
232 | 
233 | # 讀入鳶尾花資料
234 | iris = datasets.load_iris()
235 | iris_X = iris.data
236 | 
237 | # Hierarchical Clustering 演算法
238 | hclust = cluster.AgglomerativeClustering(linkage = 'ward', affinity = 'euclidean', n_clusters = 3)
239 | 
240 | # 印出績效
241 | hclust.fit(iris_X)
242 | cluster_labels = hclust.labels_
243 | silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
244 | print(silhouette_avg)
245 | ```
246 | 
247 | ![day2411](https://storage.googleapis.com/2017_ithome_ironman/day2411.png)
248 | 
249 | #### R 語言
250 | 
251 | ```
252 | library(GMD)
253 | 
254 | # 讀入鳶尾花資料
255 | iris_hclust <- iris[, -5]
256 | 
257 | # Hierarchical Clustering 演算法
258 | dist_matrix <- dist(iris_hclust)
259 | hclust_fit <- hclust(dist_matrix)
260 | hclust_fit_cut <- cutree(hclust_fit, k = 3)
261 | 
262 | # 印出績效
263 | hc_stats <- css(dist_matrix, clusters = hclust_fit_cut)
264 | hc_stats$totwss / hc_stats$totbss
265 | 
266 | # Dendrogram
267 | plot(hclust_fit)
268 | rect.hclust(hclust_fit, k = 2, border = "red")
269 | rect.hclust(hclust_fit, k = 3, border = "green")
270 | ```
271 | 
272 | ![day241201](https://storage.googleapis.com/2017_ithome_ironman/day241201.png)
273 | 
274 | ![day2413](https://storage.googleapis.com/2017_ithome_ironman/day2413.png)
275 | 
276 | 從 Dendrogram 看起來 setosa 這個品種跟兩個品種有比較大的差異，在 k = 2 或 k = 3 時都會被演算法歸類為獨立一群。
277 | 
278 | ## 小結
279 | 
280 | 第二十四天我們繼續練習 Python 的機器學習套件 **scikit-learn**，延續使用熟悉的鳶尾花資料，建立非監督式學習的 K-Means 與 Hierarchical Clustering 的分群模型，在分群演算法之下，我們發現 setosa 品種與 versicolor 及 virginica 的在花瓣與萼片的差異較大，而另兩品種則比較相近，並且也與 R 語言相互對照。
281 | 
282 | ## 參考連結
283 | 
284 | - [scikit-learn: machine learning in Python - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/index.html)
285 | - [Package ‘GMD’](https://cran.r-project.org/web/packages/GMD/GMD.pdf)


--------------------------------------------------------------------------------
/day25.md:
--------------------------------------------------------------------------------
  1 | # [第 25 天] 機器學習（5）整體學習
  2 | 
  3 | ---
  4 | 
  5 | 我們今天仍然繼續練習 Python 的 **scikit-learn** 機器學習套件，還記得在 [[第 23 天] 機器學習（3）決策樹與 k-NN 分類器](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day23.md)中我們建立了決策樹與 k-Nearest Neighbors 分類器嗎？當我們使用一種分類器沒有辦法達到很良好的預測結果時，除了改使用其他類型的分類器，還有一個方式稱為**整體學習（Ensemble learning）**可以將數個分類器的預測結果綜合考慮，藉此達到顯著提升分類效果。
  6 | 
  7 | 那麼整體學習的概念是什麼？以做一題是非題來說，假如我們使用一個銅板來決定答案要填是還是非，答對的機率是 50%，如果使用兩個銅板來決定答案，答對的機率是 1-(50%\*50%)=75%，如果銅板的數目來到 5 枚，答對的機率是 1-(50%)^5=96.875%。隨著銅板的個數增加，答對這一題是非題的機率也隨之增加，這大概就是整體學習的基本理念。
  8 | 
  9 | > The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
 10 | > [1.11. Ensemble methods - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/ensemble.html)
 11 | 
 12 | 以前述例子而言，銅板就是所謂的基本分類器（Base estimator），或稱為弱分類器（Weak classifier），基本分類器的選擇是任意的，在經典的整體學習演算法 **Bagging** 與 **AdaBoost** 中我們多數使用決策樹作為基本分類器。跟 [[第 22 天] 機器學習（2）複迴歸與 Logistic 迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)練習 Logistic 迴歸一樣，我們繼續使用鐵達尼克號資料，分別在 Python 與 R 語言實作。
 13 | 
 14 | ## Bagging
 15 | 
 16 | Bagging 是 Bootstrap Aggregating 的簡稱，透過統計學的 Bootstrap sampling 得到不同的訓練資料，然後根據這些訓練資料得到一系列的基本分類器，假如演算法產生了 5 個基本分類器，她們對某個觀測值的預測結果分別為 1, 0, 1, 1, 1，那麼 Bagging 演算法的輸出結果就會是 1，這個過程稱之為基本分類器的投票。
 17 | 
 18 | ### Python
 19 | 
 20 | 我們使用 `sklearn.ensemble` 的 `BaggingClassifier()`。
 21 | 
 22 | ```python
 23 | import numpy as np
 24 | import pandas as pd
 25 | from sklearn import cross_validation, ensemble, preprocessing, metrics
 26 | 
 27 | # 載入資料
 28 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
 29 | titanic_train = pd.read_csv(url)
 30 | 
 31 | # 填補遺漏值
 32 | age_median = np.nanmedian(titanic_train["Age"])
 33 | new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
 34 | titanic_train["Age"] = new_Age
 35 | 
 36 | # 創造 dummy variables
 37 | label_encoder = preprocessing.LabelEncoder()
 38 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
 39 | 
 40 | # 建立訓練與測試資料
 41 | titanic_X = pd.DataFrame([titanic_train["Pclass"],
 42 |                          encoded_Sex,
 43 |                          titanic_train["Age"]
 44 | ]).T
 45 | titanic_y = titanic_train["Survived"]
 46 | train_X, test_X, train_y, test_y = cross_validation.train_test_split(titanic_X, titanic_y, test_size = 0.3)
 47 | 
 48 | # 建立 bagging 模型
 49 | bag = ensemble.BaggingClassifier(n_estimators = 100)
 50 | bag_fit = bag.fit(train_X, train_y)
 51 | 
 52 | # 預測
 53 | test_y_predicted = bag.predict(test_X)
 54 | 
 55 | # 績效
 56 | accuracy = metrics.accuracy_score(test_y, test_y_predicted)
 57 | print(accuracy)
 58 | ```
 59 | 
 60 | ![day2501](https://storage.googleapis.com/2017_ithome_ironman/day2501.png)
 61 | 
 62 | ### R 語言
 63 | 
 64 | 我們使用 `adabag` 套件的 `bagging()` 函數。
 65 | 
 66 | ```
 67 | library(adabag)
 68 | library(rpart)
 69 | 
 70 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
 71 | titanic_train <- read.csv(url)
 72 | titanic_train$Survived <- factor(titanic_train$Survived)
 73 | 
 74 | # 將 Age 遺漏值以 median 填補
 75 | age_median <- median(titanic_train$Age, na.rm = TRUE)
 76 | new_Age <- ifelse(is.na(titanic_train$Age), age_median, titanic_train$Age)
 77 | titanic_train$Age <- new_Age
 78 | 
 79 | # 切分訓練與測試資料
 80 | n <- nrow(titanic_train)
 81 | shuffled_titanic <- titanic_train[sample(n), ]
 82 | train_indices <- 1:round(0.7 * n)
 83 | train_titanic <- shuffled_titanic[train_indices, ]
 84 | test_indices <- (round(0.7 * n) + 1):n
 85 | test_titanic <- shuffled_titanic[test_indices, ]
 86 | 
 87 | # 建立模型
 88 | bag_fit <- bagging(Survived ~ Pclass + Age + Sex, data = train_titanic, mfinal = 100)
 89 | 
 90 | # 預測
 91 | test_titanic_predicted <- predict(bag_fit, test_titanic)
 92 | 
 93 | # 績效
 94 | accuracy <- 1 - test_titanic_predicted$error
 95 | accuracy
 96 | ```
 97 | 
 98 | ![day2502](https://storage.googleapis.com/2017_ithome_ironman/day2502.png)
 99 | 
100 | ## AdaBoost
101 | 
102 | AdaBoost 同樣是基於數個基本分類器的整體學習演算法，跟前述 Bagging 演算法不同的地方在於，她在形成基本分類器時除了隨機生成，還會針對在前一個基本分類器中被分類錯誤的觀測值提高抽樣權重，使得該觀測值在下一個基本分類器形成時有更高機率被選入，藉此提高被正確分類的機率，簡單來說，她是個具有即時調節觀測值抽樣權重的進階 Bagging 演算法。
103 | 
104 | ### Python
105 | 
106 | 我們使用 `sklearn.ensemble` 的 `AdaBoostClassifier()`。
107 | 
108 | ```python
109 | import numpy as np
110 | import pandas as pd
111 | from sklearn import cross_validation, ensemble, preprocessing, metrics
112 | 
113 | # 載入資料
114 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
115 | titanic_train = pd.read_csv(url)
116 | 
117 | # 填補遺漏值
118 | age_median = np.nanmedian(titanic_train["Age"])
119 | new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
120 | titanic_train["Age"] = new_Age
121 | 
122 | # 創造 dummy variables
123 | label_encoder = preprocessing.LabelEncoder()
124 | encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
125 | 
126 | # 建立訓練與測試資料
127 | titanic_X = pd.DataFrame([titanic_train["Pclass"],
128 |                          encoded_Sex,
129 |                          titanic_train["Age"]
130 | ]).T
131 | titanic_y = titanic_train["Survived"]
132 | train_X, test_X, train_y, test_y = cross_validation.train_test_split(titanic_X, titanic_y, test_size = 0.3)
133 | 
134 | # 建立 boosting 模型
135 | boost = ensemble.AdaBoostClassifier(n_estimators = 100)
136 | boost_fit = boost.fit(train_X, train_y)
137 | 
138 | # 預測
139 | test_y_predicted = boost.predict(test_X)
140 | 
141 | # 績效
142 | accuracy = metrics.accuracy_score(test_y, test_y_predicted)
143 | print(accuracy)
144 | ```
145 | 
146 | ![day2503](https://storage.googleapis.com/2017_ithome_ironman/day2503.png)
147 | 
148 | ### R 語言
149 | 
150 | 我們使用 `adabag` 套件的 `boosting()` 函數。
151 | 
152 | ```
153 | library(adabag)
154 | library(rpart)
155 | 
156 | url = "https://storage.googleapis.com/2017_ithome_ironman/data/kaggle_titanic_train.csv"
157 | titanic_train <- read.csv(url)
158 | titanic_train$Survived <- factor(titanic_train$Survived)
159 | 
160 | # 將 Age 遺漏值以 median 填補
161 | age_median <- median(titanic_train$Age, na.rm = TRUE)
162 | new_Age <- ifelse(is.na(titanic_train$Age), age_median, titanic_train$Age)
163 | titanic_train$Age <- new_Age
164 | 
165 | # 切分訓練與測試資料
166 | n <- nrow(titanic_train)
167 | shuffled_titanic <- titanic_train[sample(n), ]
168 | train_indices <- 1:round(0.7 * n)
169 | train_titanic <- shuffled_titanic[train_indices, ]
170 | test_indices <- (round(0.7 * n) + 1):n
171 | test_titanic <- shuffled_titanic[test_indices, ]
172 | 
173 | # 建立模型
174 | boost_fit <- boosting(Survived ~ Pclass + Age + Sex, data = train_titanic, mfinal = 100)
175 | 
176 | # 預測
177 | test_titanic_predicted <- predict(bag_fit, test_titanic)
178 | 
179 | # 績效
180 | accuracy <- 1 - test_titanic_predicted$error
181 | accuracy
182 | ```
183 | 
184 | ![day2504](https://storage.googleapis.com/2017_ithome_ironman/day2504.png)
185 | 
186 | ## 小結
187 | 
188 | 第二十五天我們繼續練習 Python 的機器學習套件 **scikit-learn**，使用熟悉的鐵達尼克號資料，建立 Bagging 與 AdaBoost 的**整體學習**分類模型，並且也與 R 語言相互對照。
189 | 
190 | ## 參考連結
191 | 
192 | - [1.11. Ensemble methods - scikit-learn 0.18.1 documentation](http://scikit-learn.org/stable/modules/ensemble.html)
193 | - [Package ‘adabag’](https://cran.r-project.org/web/packages/adabag/adabag.pdf)


--------------------------------------------------------------------------------
/day27.md:
--------------------------------------------------------------------------------
  1 | # [第 27 天] 深度學習 TensorFlow
  2 | 
  3 | ---
  4 | 
  5 | 在過去幾天我們與功能強大的機器學習套件 **scikit-learn** 相處得還算融洽，不難發現我們對於 **scikit-learn** 的認識其實僅止於冰山一角，它有著包羅萬象的機器學習演算法，資料預處理與演算法績效評估的功能，除了參考[首頁](http://scikit-learn.org/stable/index.html)開宗明義的六大模組，我們也可以參考 **scikit-learn** 官方網站的[機器學習地圖](http://scikit-learn.org/stable/tutorial/machine_learning_map/)：
  6 | 
  7 | ![day2701](https://storage.googleapis.com/2017_ithome_ironman/day2701.png)
  8 | 
  9 | 即便 **scikit-learn** 已近乎包山包海，但挑剔的使用者還是在雞蛋裡挑骨頭，近年的當紅炸子雞深度學習（Deep learning）在哪裡？這對於連續閱讀 **scikit-learn** 文件好幾天的我們無疑是個好消息，暫時可以換個口味了！
 10 | 
 11 | 目前主流的深度學習框架（Framework）有 **Caffe**、**TensorFlow**、**Theano**、**Torch** 與 **Keras**，其中 **Keras** 是可以使用 API 呼叫方式同時使用 **TensorFlow** 與 **Theano** 的高階框架，我們選擇入門的框架是 **TensorFlow**。
 12 | 
 13 | ## 安裝 TensorFlow
 14 | 
 15 | 我們的開發環境是 [Anaconda](https://www.continuum.io/downloads)，如果你對本系列文章的開發環境有興趣，可以參照 [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)。
 16 | 
 17 | ### 第一步（pip 安裝）
 18 | 
 19 | 在終端機使用 `pip` 指令安裝 **TensorFlow**：
 20 | 
 21 | ```
 22 | $ pip install tensorflow
 23 | ```
 24 | 
 25 | 更多的安裝細節（GPU版本或其他作業系統...等），請參考官方的[安裝指南](https://www.tensorflow.org/get_started/os_setup)。
 26 | 
 27 | ### 第二步（測試）
 28 | 
 29 | 進入 jupyter notebook 測試以下程式：
 30 | 
 31 | ```python
 32 | import tensorflow as tf
 33 | hello = tf.constant('Hello, TensorFlow!')
 34 | sess = tf.Session()
 35 | print(sess.run(hello))
 36 | ```
 37 | 
 38 | ![day2702](https://storage.googleapis.com/2017_ithome_ironman/day2702.png)
 39 | 
 40 | ## 快速實作
 41 | 
 42 | 讓我們跟著官方文件實作第一個 **TensorFlow** 程式，我們要利用梯度遞減（Gradient descent）的演算法找出已知迴歸模型（y = 0.1x + 0.3）的係數（0.1）與截距（0.3）並對照結果。
 43 | 
 44 | ```python
 45 | import tensorflow as tf
 46 | import numpy as np
 47 | 
 48 | # 準備資料
 49 | x_data = np.random.rand(100).astype(np.float32)
 50 | y_data = x_data * 0.1 + 0.3
 51 | 
 52 | # W 指的是係數，斜率介於 -1 至 1 之間
 53 | # b 指的是截距，從 0 開始逼近任意數字
 54 | # y 指的是預測值
 55 | W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
 56 | b = tf.Variable(tf.zeros([1]))
 57 | y = W * x_data + b
 58 | 
 59 | # 我們的目標是要讓 loss（MSE）最小化
 60 | loss = tf.reduce_mean(tf.square(y - y_data))
 61 | optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.5)
 62 | train = optimizer.minimize(loss)
 63 | 
 64 | # 初始化
 65 | init = tf.global_variables_initializer()
 66 | 
 67 | # 將神經網絡圖畫出來
 68 | sess = tf.Session()
 69 | sess.run(init)
 70 | 
 71 | # 將迴歸線的係數與截距模擬出來
 72 | # 每跑 20 次把當時的係數與截距印出來
 73 | for step in range(201):
 74 |     sess.run(train)
 75 |     if step % 20 == 0:
 76 |         print(step, sess.run(W), sess.run(b))
 77 |         
 78 | # 關閉 Session
 79 | sess.close()
 80 | ```
 81 | 
 82 | ![day2703](https://storage.googleapis.com/2017_ithome_ironman/day2703.png)
 83 | 
 84 | **TensorFlow** 一直到 `sess.run()` 才開始作用，前面都只是在建立資料流的結構。
 85 | 
 86 | ## 基礎使用
 87 | 
 88 | 應用 **TensorFlow** 的時候我們得了解她的名詞與定義：
 89 | 
 90 | |名詞|定義|
 91 | |---|---|
 92 | |Graphs|建立運算元|
 93 | |Sessions|執行運算|
 94 | |Tensors|資料|
 95 | |Variables|變數|
 96 | |Feeds|資料輸入|
 97 | |Fetches|資料輸出|
 98 | 
 99 | ### 建立運算元
100 | 
101 | ```python
102 | import tensorflow as tf
103 | 
104 | # 1x2 的矩陣
105 | matrix1 = tf.constant([[3, 3]])
106 | 
107 | # 2x1 的矩陣
108 | matrix2 = tf.constant([[2],
109 |                        [2]]
110 |                       )
111 | 
112 | # matmul() 方法代表是矩陣的乘法，答案是 12
113 | product = tf.matmul(matrix1, matrix2)
114 | ```
115 | 
116 | ![day2704](https://storage.googleapis.com/2017_ithome_ironman/day2704.png)
117 | 
118 | 現在我們的運算元已經建立好，有三個節點，分別是兩個 `constant()` 與一個 `matmul()`，意即神經網絡的圖已經建構完成，但是尚未執行運算。
119 | 
120 | ### 執行運算
121 | 
122 | #### 方法一
123 | 
124 | 記得使用 `close()` 方法關閉 Session。
125 | 
126 | ```python
127 | import tensorflow as tf
128 | 
129 | # 1x2 的矩陣
130 | matrix1 = tf.constant([[3, 3]])
131 | 
132 | # 2x1 的矩陣
133 | matrix2 = tf.constant([[2],
134 |                        [2]]
135 |                       )
136 | 
137 | # matmul() 方法代表是矩陣的乘法
138 | product = tf.matmul(matrix1, matrix2)
139 | 
140 | # 啟動 Session
141 | sess = tf.Session()
142 | result = sess.run(product)
143 | print(result)
144 | 
145 | # 關閉 Session
146 | sess.close()
147 | ```
148 | 
149 | ![day2705](https://storage.googleapis.com/2017_ithome_ironman/day2705.png)
150 | 
151 | #### 方法二
152 | 
153 | 不需要另外關閉 Session。
154 | 
155 | ```python
156 | import tensorflow as tf
157 | 
158 | # 1x2 的矩陣
159 | matrix1 = tf.constant([[3, 3]])
160 | 
161 | # 2x1 的矩陣
162 | matrix2 = tf.constant([[2],
163 |                        [2]]
164 |                       )
165 | 
166 | # matmul() 方法代表是矩陣的乘法
167 | product = tf.matmul(matrix1, matrix2)
168 | 
169 | # 啟動
170 | with tf.Session() as sess:
171 |     result = sess.run([product])
172 |     print(result)
173 | ```
174 | 
175 | ![day2707](https://storage.googleapis.com/2017_ithome_ironman/day2707.png)
176 | 
177 | #### 方法三
178 | 
179 | 要將 `matrix1` 設定為 `Variable` 然後再由她來初始化。
180 | 
181 | ```python
182 | # 啟動 Session
183 | import tensorflow as tf
184 | sess = tf.InteractiveSession()
185 | 
186 | # 1x2 的矩陣
187 | # 注意這裡改變成 Variable
188 | matrix1 = tf.Variable([[3, 3]])
189 | 
190 | # 2x1 的矩陣
191 | matrix2 = tf.constant([[2],
192 |                        [2]]
193 |                       )
194 | 
195 | # 初始化 `matrix1`
196 | matrix1.initializer.run()
197 | 
198 | # 執行運算
199 | result = tf.matmul(matrix1, matrix2)
200 | print(result.eval())
201 | 
202 | # 關閉 Session
203 | sess.close()
204 | ```
205 | 
206 | ![day2708](https://storage.googleapis.com/2017_ithome_ironman/day2708.png)
207 | 
208 | ### 變數
209 | 
210 | ```python
211 | import tensorflow as tf
212 | 
213 | # 建立 Variable
214 | state = tf.Variable(0, name="counter")
215 | 
216 | # 每次加 1 之後更新 state
217 | one = tf.constant(1)
218 | new_value = tf.add(state, one)
219 | update = tf.assign(state, new_value)
220 | 
221 | # 初始化
222 | init_op = tf.global_variables_initializer()
223 | 
224 | # 執行運算
225 | with tf.Session() as sess:
226 |     sess.run(init_op)
227 |     # 印初始值
228 |     print(sess.run(state))
229 |     # 更新三次分別印出 Variable
230 |     for _ in range(3):
231 |         sess.run(update)
232 |         print(sess.run(state))
233 |         
234 | # 關閉 Session
235 | sess.close()
236 | ```
237 | 
238 | ![day2709](https://storage.googleapis.com/2017_ithome_ironman/day2709.png)
239 | 
240 | ### 資料輸入
241 | 
242 | 先利用 `tf.placeholder()` 宣告資料的種類，在執行的時候才將資料以字典（dictionary）的結構輸入。
243 | 
244 | ```python
245 | import tensorflow as tf
246 | 
247 | input1 = tf.placeholder(tf.float32)
248 | input2 = tf.placeholder(tf.float32)
249 | output = tf.mul(input1, input2)
250 | 
251 | # 將 input1 以 7 輸入，input2 以 3 輸入
252 | with tf.Session() as sess:
253 |     print(sess.run([output], feed_dict = {input1: [7], input2: [3]}))
254 | ```
255 | 
256 | ![day2710](https://storage.googleapis.com/2017_ithome_ironman/day2710.png)
257 | 
258 | ### 資料輸出
259 | 
260 | ```python
261 | input1 = tf.constant([3])
262 | input2 = tf.constant([5])
263 | added = tf.add(input1, input2)
264 | multiplied = tf.mul(input1, input2)
265 | 
266 | # 輸出 added 與 multiplied
267 | with tf.Session() as sess:
268 |     result = sess.run([added, multiplied])
269 |     print(result)
270 | ```
271 | 
272 | ![day2711](https://storage.googleapis.com/2017_ithome_ironman/day2711.png)
273 | 
274 | ## 小結
275 | 
276 | 第二十七天我們開始練習使用 Python 的神經網絡套件 **TensorFlow**，我們成功安裝了 `tensorflow`，並實作官方文件的第一個練習：使用梯度遞減逼近迴歸模型的係數與截距。此外我們也從基礎使用的範例中瞭解 **TensorFlow** 模型產出的過程，我們首先建立運算元，畫出神經網絡圖，初始化變數，最後才是執行運算並輸出結果，這也跟我們在前面幾天的練習中，先初始化一個分類器然後才將資料投入進行運算的概念相似。
277 | 
278 | ## 參考連結
279 | 
280 | - [Introduction - TensorFlow](https://www.tensorflow.org/get_started/)


--------------------------------------------------------------------------------
/day28.md:
--------------------------------------------------------------------------------
  1 | # [第 28 天] 深度學習（2）TensorBoard
  2 | 
  3 | ---
  4 | 
  5 | 我們今天繼續練習神經網絡的套件 **TensorFlow**，在[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day27.md)的第一個實作中我們建立一個很單純的神經網絡，利用梯度遞減（Gradient descent）的演算法去逼近線性迴歸模型的係數與截距，但是我們很快就有了疑問：一直提到的建立運算元（Graphs）究竟在哪裡？是看得到的嗎？答案是可以的，我們可以利用 **TensorBoard** 來視覺化神經網絡。
  6 | 
  7 | > The computations you'll use TensorFlow for - like training a massive deep neural network - can be complex and confusing. To make it easier to understand, debug, and optimize TensorFlow programs, we've included a suite of visualization tools called TensorBoard. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
  8 | > [TensorBoard: Visualizing Learning | TensorFlow](https://www.tensorflow.org/how_tos/summaries_and_tensorboard/)
  9 | 
 10 | ## 整理程式
 11 | 
 12 | 在視覺化之前，我們先用較模組化的寫法：[MorvanZhou@GitHub](https://github.com/MorvanZhou/tutorials/blob/master/tensorflowTUT/tf15_tensorboard/full_code.py) 的範例程式改寫[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day27.md)的程式。
 13 | 
 14 | ### 改寫架構
 15 | 
 16 | - 定義一個添加層的函數：`add_layer()`
 17 | - 準備資料（Inputs）
 18 | - 建立 Feeds（使用 `tf.placeholder()` 方法）來傳入資料
 19 | - 添加隱藏層與輸出層
 20 | - 定義 `loss` 與要使用的 Optimizer（使用梯度遞減）
 21 | - 初始化 Graph 並開始運算
 22 | 
 23 | ### 改寫後的程式
 24 | 
 25 | ```python
 26 | import tensorflow as tf
 27 | import numpy as np
 28 | 
 29 | # 定義一個添加層的函數
 30 | def add_layer(inputs, input_tensors, output_tensors, activation_function = None):
 31 |     W = tf.Variable(tf.random_normal([input_tensors, output_tensors]))
 32 |     b = tf.Variable(tf.zeros([1, output_tensors]))
 33 |     formula = tf.add(tf.matmul(inputs, W), b)
 34 |     if activation_function is None:
 35 |         outputs = formula
 36 |     else:
 37 |         outputs = activation_function(formula)
 38 |     return outputs
 39 | 
 40 | # 準備資料
 41 | x_data = np.random.rand(100)
 42 | x_data = x_data.reshape(len(x_data), 1)
 43 | y_data = x_data * 0.1 + 0.3
 44 | 
 45 | # 建立 Feeds
 46 | x_feeds = tf.placeholder(tf.float32, shape = [None, 1])
 47 | y_feeds = tf.placeholder(tf.float32, shape = [None, 1])
 48 | 
 49 | # 添加 1 個隱藏層
 50 | hidden_layer = add_layer(x_feeds, input_tensors = 1, output_tensors = 10, activation_function = None)
 51 | 
 52 | # 添加 1 個輸出層
 53 | output_layer = add_layer(hidden_layer, input_tensors = 10, output_tensors = 1, activation_function = None)
 54 | 
 55 | # 定義 `loss` 與要使用的 Optimizer
 56 | loss = tf.reduce_mean(tf.square(y_feeds - output_layer))
 57 | optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)
 58 | train = optimizer.minimize(loss)
 59 | 
 60 | # 初始化 Graph 並開始運算
 61 | init = tf.global_variables_initializer()
 62 | sess = tf.Session()
 63 | sess.run(init)
 64 | 
 65 | for step in range(201):
 66 |     sess.run(train, feed_dict = {x_feeds: x_data, y_feeds: y_data})
 67 |     if step % 20 == 0:
 68 |         print(sess.run(loss, feed_dict = {x_feeds: x_data, y_feeds: y_data}))
 69 | 
 70 | sess.close()
 71 | ```
 72 | 
 73 | ![day2801](https://storage.googleapis.com/2017_ithome_ironman/day2801.png)
 74 | 
 75 | 我們可以看到隨著每次運算，`loss` 的數值都在降低，表示類似模型不斷在逼近真實模型。
 76 | 
 77 | ## 視覺化
 78 | 
 79 | 接著我們要在模組化的程式中使用 `with tf.name_scope():` 為每個運算元命名，然後在神經網絡運算初始之後，利用 `tf.summary.FileWriter()` 將視覺化檔案輸出。
 80 | 
 81 | ```python
 82 | import tensorflow as tf
 83 | import numpy as np
 84 | 
 85 | # 定義一個添加層的函數
 86 | def add_layer(inputs, input_tensors, output_tensors, activation_function = None):
 87 |     with tf.name_scope('Layer'):
 88 |         with tf.name_scope('Weights'):
 89 |             W = tf.Variable(tf.random_normal([input_tensors, output_tensors]))
 90 |         with tf.name_scope('Biases'):
 91 |             b = tf.Variable(tf.zeros([1, output_tensors]))
 92 |         with tf.name_scope('Formula'):
 93 |             formula = tf.add(tf.matmul(inputs, W), b)
 94 |         if activation_function is None:
 95 |             outputs = formula
 96 |         else:
 97 |             outputs = activation_function(formula)
 98 |         return outputs
 99 | 
100 | # 準備資料
101 | x_data = np.random.rand(100)
102 | x_data = x_data.reshape(len(x_data), 1)
103 | y_data = x_data * 0.1 + 0.3
104 | 
105 | # 建立 Feeds
106 | with tf.name_scope('Inputs'):
107 |     x_feeds = tf.placeholder(tf.float32, shape = [None, 1])
108 |     y_feeds = tf.placeholder(tf.float32, shape = [None, 1])
109 | 
110 | # 添加 1 個隱藏層
111 | hidden_layer = add_layer(x_feeds, input_tensors = 1, output_tensors = 10, activation_function = None)
112 | 
113 | # 添加 1 個輸出層
114 | output_layer = add_layer(hidden_layer, input_tensors = 10, output_tensors = 1, activation_function = None)
115 | 
116 | # 定義 `loss` 與要使用的 Optimizer
117 | with tf.name_scope('Loss'):
118 |     loss = tf.reduce_mean(tf.square(y_feeds - output_layer))
119 | with tf.name_scope('Train'):
120 |     optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)
121 |     train = optimizer.minimize(loss)
122 | 
123 | # 初始化 Graph
124 | init = tf.global_variables_initializer()
125 | sess = tf.Session()
126 | 
127 | # 將視覺化輸出
128 | writer = tf.summary.FileWriter("TensorBoard/", graph = sess.graph)
129 | 
130 | # 開始運算
131 | sess.run(init)
132 | for step in range(201):
133 |     sess.run(train, feed_dict = {x_feeds: x_data, y_feeds: y_data})
134 |     #if step % 20 == 0:
135 |         #print(sess.run(loss, feed_dict = {x_feeds: x_data, y_feeds: y_data}))
136 | 
137 | sess.close()
138 | ```
139 | 
140 | ![day280201](https://storage.googleapis.com/2017_ithome_ironman/day280201.png)
141 | 
142 | 回到終端機可以看到視覺化檔案已經生成，接著利用 `cd` 指令切換到 `"TensorBoard/` 的上一層，執行這段指令：
143 | 
144 | ```
145 | $ tensorboard --logdir='TensorBoard/'
146 | ```
147 | 
148 | 等我們看到系統回覆之後，就可以打開瀏覽器，在網址列輸入：localhost:6006 ，就可以在 **Graphs** 頁籤下看到神經網絡圖。
149 | 
150 | ![day2803](https://storage.googleapis.com/2017_ithome_ironman/day2803.png)
151 | 
152 | ## 視覺化（2）
153 | 
154 | 我們除了可以使用 `with tf.name_scope():` 為每個運算元命名，我們還可以在使用 `tf.Variable()` 或者 `tf.placeholder()` 建立變數或輸入資料時，利用 `name = ` 參數進行命名。
155 | 
156 | ```python
157 | import tensorflow as tf
158 | import numpy as np
159 | 
160 | # 定義一個添加層的函數
161 | def add_layer(inputs, input_tensors, output_tensors, activation_function = None):
162 |     with tf.name_scope('Layer'):
163 |         with tf.name_scope('Weights'):
164 |             W = tf.Variable(tf.random_normal([input_tensors, output_tensors]), name = 'W')
165 |         with tf.name_scope('Biases'):
166 |             b = tf.Variable(tf.zeros([1, output_tensors]), name = 'b')
167 |         with tf.name_scope('Formula'):
168 |             formula = tf.add(tf.matmul(inputs, W), b)
169 |         if activation_function is None:
170 |             outputs = formula
171 |         else:
172 |             outputs = activation_function(formula)
173 |         return outputs
174 | 
175 | # 準備資料
176 | x_data = np.random.rand(100)
177 | x_data = x_data.reshape(len(x_data), 1)
178 | y_data = x_data * 0.1 + 0.3
179 | 
180 | # 建立 Feeds
181 | with tf.name_scope('Inputs'):
182 |     x_feeds = tf.placeholder(tf.float32, shape = [None, 1], name = 'x_inputs')
183 |     y_feeds = tf.placeholder(tf.float32, shape = [None, 1], name = 'y_inputs')
184 | 
185 | # 添加 1 個隱藏層
186 | hidden_layer = add_layer(x_feeds, input_tensors = 1, output_tensors = 10, activation_function = None)
187 | 
188 | # 添加 1 個輸出層
189 | output_layer = add_layer(hidden_layer, input_tensors = 10, output_tensors = 1, activation_function = None)
190 | 
191 | # 定義 `loss` 與要使用的 Optimizer
192 | with tf.name_scope('Loss'):
193 |     loss = tf.reduce_mean(tf.square(y_feeds - output_layer))
194 | with tf.name_scope('Train'):
195 |     optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)
196 |     train = optimizer.minimize(loss)
197 | 
198 | # 初始化 Graph
199 | init = tf.global_variables_initializer()
200 | sess = tf.Session()
201 | 
202 | # 將視覺化輸出
203 | writer = tf.summary.FileWriter("TensorBoard/", graph = sess.graph)
204 | 
205 | # 開始運算
206 | sess.run(init)
207 | for step in range(201):
208 |     sess.run(train, feed_dict = {x_feeds: x_data, y_feeds: y_data})
209 |     #if step % 20 == 0:
210 |         #print(sess.run(loss, feed_dict = {x_feeds: x_data, y_feeds: y_data}))
211 | 
212 | sess.close()
213 | ```
214 | 
215 | ![day2804](https://storage.googleapis.com/2017_ithome_ironman/day2804.png)
216 | 
217 | ## 視覺化（3）
218 | 
219 | 我們很快就有了疑問：那麼其他頁籤的功能呢？**TensorBoard** 還能夠將訓練過程視覺化呈現，我們利用 `tf.summary.histogram()` 與 `tf.summary.scalar()` 將訓練過程記錄起來，然後在 **Scalars** 與 **Histograms** 頁籤檢視。
220 | 
221 | ```python
222 | import tensorflow as tf
223 | import numpy as np
224 | 
225 | # 定義一個添加層的函數
226 | def add_layer(inputs, input_tensors, output_tensors, n_layer, activation_function = None):
227 |     layer_name = 'layer%s' % n_layer
228 |     with tf.name_scope('Layer'):
229 |         with tf.name_scope('Weights'):
230 |             W = tf.Variable(tf.random_normal([input_tensors, output_tensors]), name = 'W')
231 |             tf.summary.histogram(name = layer_name + '/Weights', values = W)
232 |         with tf.name_scope('Biases'):
233 |             b = tf.Variable(tf.zeros([1, output_tensors]), name = 'b')
234 |             tf.summary.histogram(name = layer_name + '/Biases', values = b)
235 |         with tf.name_scope('Formula'):
236 |             formula = tf.add(tf.matmul(inputs, W), b)
237 |         if activation_function is None:
238 |             outputs = formula
239 |         else:
240 |             outputs = activation_function(formula)
241 |         tf.summary.histogram(name = layer_name + '/Outputs', values = outputs)
242 |         return outputs
243 | 
244 | # 準備資料
245 | x_data = np.random.rand(100)
246 | x_data = x_data.reshape(len(x_data), 1)
247 | y_data = x_data * 0.1 + 0.3
248 | 
249 | # 建立 Feeds
250 | with tf.name_scope('Inputs'):
251 |     x_feeds = tf.placeholder(tf.float32, shape = [None, 1], name = 'x_inputs')
252 |     y_feeds = tf.placeholder(tf.float32, shape = [None, 1], name = 'y_inputs')
253 | 
254 | # 添加 1 個隱藏層
255 | hidden_layer = add_layer(x_feeds, input_tensors = 1, output_tensors = 10, n_layer = 1, activation_function = None)
256 | 
257 | # 添加 1 個輸出層
258 | output_layer = add_layer(hidden_layer, input_tensors = 10, output_tensors = 1, n_layer = 2, activation_function = None)
259 | 
260 | # 定義 `loss` 與要使用的 Optimizer
261 | with tf.name_scope('Loss'):
262 |     loss = tf.reduce_mean(tf.square(y_feeds - output_layer))
263 |     tf.summary.scalar('loss', loss)
264 | with tf.name_scope('Train'):
265 |     optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.01)
266 |     train = optimizer.minimize(loss)
267 | 
268 | # 初始化 Graph
269 | init = tf.global_variables_initializer()
270 | sess = tf.Session()
271 | 
272 | # 將視覺化輸出
273 | merged = tf.summary.merge_all()
274 | writer = tf.summary.FileWriter("TensorBoard/", graph = sess.graph)
275 | 
276 | # 開始運算
277 | sess.run(init)
278 | for step in range(400):
279 |     sess.run(train, feed_dict = {x_feeds: x_data, y_feeds: y_data})
280 |     if step % 20 == 0:
281 |         result = sess.run(merged, feed_dict={x_feeds: x_data, y_feeds: y_data})
282 |         writer.add_summary(result, step)
283 | 
284 | sess.close()
285 | ```
286 | 
287 | ![day2805](https://storage.googleapis.com/2017_ithome_ironman/day2805.png)
288 | 
289 | ![day2806](https://storage.googleapis.com/2017_ithome_ironman/day2806.png)
290 | 
291 | ## 小結
292 | 
293 | 第二十八天我們繼續練習 Python 的神經網絡套件 **TensorFlow**，延續昨天的練習程式，使用 **TensorBoard** 來進行神經網絡模型的視覺化。
294 | 
295 | ## 參考連結
296 | 
297 | - [Tensorflow Python API | TensorFlow](https://www.tensorflow.org/api_docs/python/)
298 | - [TensorBoard: Visualizing Learning | TensorFlow](https://www.tensorflow.org/how_tos/summaries_and_tensorboard/)
299 | - [MorvanZhou@GitHub](https://github.com/MorvanZhou/tutorials/tree/master/tensorflowTUT)


--------------------------------------------------------------------------------
/day29.md:
--------------------------------------------------------------------------------
  1 | # [第 29 天] 深度學習（3）MNIST 手寫數字辨識
  2 | 
  3 | ---
  4 | 
  5 | 我們今天繼續練習神經網絡的套件 **TensorFlow**，在學習過程中，不論是視覺化或者機器學習的主題，我們使用了幾個常見的玩具資料（Toy datasets），像是 **iris** 鳶尾花資料或者 **cars** 車速與煞車距離資料，這些玩具資料簡潔且易懂，可以讓我們很快速地入門，例如實作迴歸時使用 **cars**，實作分類與分群時使用 **iris**。同樣在深度學習領域也有一個經典的 **MNIST** 手寫數字辨識資料，供初學者實作圖片分類器。
  6 | 
  7 | ## 讀入 MNIST
  8 | 
  9 | 如同在 **scikit-learn** 套件中讀入 **iris** 一般，在 **TensorFlow** 套件中讀入 **MNIST** 同樣是很容易的，不論是訓練資料或者測試資料，都有分 `images` 與 `labels` 屬性，我們簡單跟 **scikit-learn** 套件做個對照：
 10 | 
 11 | 套件|自變數 X|目標變數 y
 12 | ---|---|---
 13 | `sklearn`|data|target
 14 | `tensorflow`|images|labels
 15 | 
 16 | ```python
 17 | from tensorflow.examples.tutorials.mnist import input_data
 18 | import numpy as np
 19 | 
 20 | # 讀入 MNIST
 21 | mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)
 22 | x_train = mnist.train.images
 23 | y_train = mnist.train.labels
 24 | x_test = mnist.test.images
 25 | y_test = mnist.test.labels
 26 | 
 27 | # 檢視結構
 28 | print(x_train.shape)
 29 | print(y_train.shape)
 30 | print(x_test.shape)
 31 | print(y_test.shape)
 32 | print("---")
 33 | 
 34 | # 檢視一個觀測值
 35 | #print(x_train[1, :])
 36 | print(np.argmax(y_train[1, :])) # 第一張訓練圖片的真實答案
 37 | ```
 38 | 
 39 | ![day2901](https://storage.googleapis.com/2017_ithome_ironman/day2901.png)
 40 | 
 41 | **MNIST** 的圖片是 28 像素 x 28 像素，每一張圖片就可以用 28 x 28 = 784 個數字來紀錄，因此 `print(x_train.shape)` 的輸出告訴我們有 55,000 張訓練圖片，每張圖片都有 784 個數字；而 `print(y_train.shape)` 的輸出告訴我們的是這 55,000 張訓練圖片的真實答案，`print(np.argmax(y_train[1, :]))` 的輸出告訴我們第一張訓練圖片的真實答案為 **3**。
 42 | 
 43 | 我們也可以使用 `matplotlib.pyplot` 把第一張訓練圖片印出來看看。
 44 | 
 45 | ```python
 46 | from tensorflow.examples.tutorials.mnist import input_data
 47 | import numpy as np
 48 | import matplotlib.pyplot as plt
 49 | 
 50 | %matplotlib inline
 51 | 
 52 | # 讀入 MNIST
 53 | mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)
 54 | x_train = mnist.train.images
 55 | 
 56 | # 印出來看看
 57 | first_train_img = np.reshape(x_train[1, :], (28, 28))
 58 | plt.matshow(first_train_img, cmap = plt.get_cmap('gray'))
 59 | plt.show()
 60 | ```
 61 | 
 62 | ![day2906](https://storage.googleapis.com/2017_ithome_ironman/day2906.png)
 63 | 
 64 | ## Softmax 函數
 65 | 
 66 | 我們需要透過 Softmax 函數將分類器輸出的分數（Evidence）轉換為機率（Probability），然後依據機率作為預測結果的輸出，可想而知深度學習模型的輸出層會是一個 Softmax 函數。
 67 | 
 68 | ![day2902](https://storage.googleapis.com/2017_ithome_ironman/day2902.png)
 69 | 
 70 | ## Cross-entropy
 71 | 
 72 | 不同於我們先前使用 **Mean Squared Error** 定義 **Loss**，在這個深度學習模型中我們改用 **Cross-entropy** 來定義 **Loss**。
 73 | 
 74 | > One very common, very nice function to determine the loss of a model is called "cross-entropy." Cross-entropy arises from thinking about information compressing codes in information theory but it winds up being an important idea in lots of areas, from gambling to machine learning.
 75 | > [MNIST For ML Beginners | TensorFlow](https://www.tensorflow.org/tutorials/mnist/beginners/)
 76 | 
 77 | ## TensorFlow 實作
 78 | 
 79 | 我們建立一個可以利用 **TensorBoard** 檢視的深度學習模型，實作手寫數字辨識的分類器。
 80 | 
 81 | ```python
 82 | import tensorflow as tf
 83 | from tensorflow.examples.tutorials.mnist import input_data
 84 | 
 85 | # 讀入 MNIST
 86 | mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)
 87 | x_train = mnist.train.images
 88 | y_train = mnist.train.labels
 89 | x_test = mnist.test.images
 90 | y_test = mnist.test.labels
 91 | 
 92 | # 設定參數
 93 | learning_rate = 0.5
 94 | training_steps = 1000
 95 | batch_size = 100
 96 | logs_path = 'TensorBoard/'
 97 | n_features = x_train.shape[1]
 98 | n_labels = y_train.shape[1]
 99 | 
100 | # 建立 Feeds
101 | with tf.name_scope('Inputs'):
102 |     x = tf.placeholder(tf.float32, [None, n_features], name = 'Input_Data')
103 | with tf.name_scope('Labels'):
104 |     y = tf.placeholder(tf.float32, [None, n_labels], name = 'Label_Data')
105 | 
106 | # 建立 Variables
107 | with tf.name_scope('ModelParameters'):
108 |     W = tf.Variable(tf.zeros([n_features, n_labels]), name = 'Weights')
109 |     b = tf.Variable(tf.zeros([n_labels]), name = 'Bias')
110 | 
111 | # 開始建構深度學習模型
112 | with tf.name_scope('Model'):
113 |     # Softmax
114 |     prediction = tf.nn.softmax(tf.matmul(x, W) + b)
115 | with tf.name_scope('CrossEntropy'):
116 |     # Cross-entropy
117 |     loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(prediction), reduction_indices = 1))
118 |     tf.summary.scalar("Loss", loss)
119 | with tf.name_scope('GradientDescent'):
120 |     # Gradient Descent
121 |     optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
122 | with tf.name_scope('Accuracy'):
123 |     correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
124 |     acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
125 |     tf.summary.scalar('Accuracy', acc)
126 | 
127 | # 初始化
128 | init = tf.global_variables_initializer()
129 | 
130 | # 開始執行運算
131 | sess = tf.Session()
132 | sess.run(init)
133 | 
134 | # 將視覺化輸出
135 | merged = tf.summary.merge_all()
136 | writer = tf.summary.FileWriter(logs_path, graph = tf.get_default_graph())
137 | 
138 | # 訓練
139 | for step in range(training_steps):
140 |     batch_xs, batch_ys = mnist.train.next_batch(batch_size)
141 |     sess.run(optimizer, feed_dict = {x: batch_xs, y: batch_ys})
142 |     if step % 50 == 0:
143 |         print(sess.run(loss, feed_dict = {x: batch_xs, y: batch_ys}))
144 |         summary = sess.run(merged, feed_dict = {x: batch_xs, y: batch_ys})
145 |         writer.add_summary(summary, step)
146 | 
147 | print("---")
148 | # 準確率
149 | print("Accuracy: ", sess.run(acc, feed_dict={x: x_test, y: y_test}))
150 | 
151 | sess.close()
152 | ```
153 | 
154 | ![day2903](https://storage.googleapis.com/2017_ithome_ironman/day2903.png)
155 | 
156 | ![day2904](https://storage.googleapis.com/2017_ithome_ironman/day2904.png)
157 | 
158 | ![day2905](https://storage.googleapis.com/2017_ithome_ironman/day2905.png)
159 | 
160 | 如果你對於如何產生 **TensorBoard** 視覺化有興趣，我推薦你參考[昨天](http://ithelp.ithome.com.tw/articles/10187814)的學習筆記。我們的模型準確率有 **92%** 左右，感覺還不錯，但是官方文件卻跟我們說這很糟：
161 | 
162 | > Getting 92% accuracy on MNIST is bad. It's almost embarrassingly bad.
163 | > [Deep MNIST for Experts | TensorFlow](https://www.tensorflow.org/tutorials/mnist/pros/)
164 | 
165 | 我們明天來試著依照官方文件的教學建立一個卷積神經網絡（Convolutional Neural Network，CNN）提升 **MNIST** 資料的數字辨識準確率。
166 | 
167 | ## 小結
168 | 
169 | 第二十九天我們繼續練習 Python 的深度學習套件 **TensorFlow**，針對 **MNIST** 資料建立了一個神經網絡模型，達到 **92%** 的準確率，同時我們也用了 **TensorBoard** 來視覺化。
170 | 
171 | ## 參考連結
172 | 
173 | - [MNIST For ML Beginners | TensorFlow](https://www.tensorflow.org/tutorials/mnist/beginners/)
174 | - [Deep MNIST for Experts | TensorFlow](https://www.tensorflow.org/tutorials/mnist/pros/)
175 | - [aymericdamien@GitHub](https://github.com/aymericdamien/TensorFlow-Examples)
176 | - [Tensorflow Day3 : 熟悉 MNIST 手寫數字辨識資料集](http://ithelp.ithome.com.tw/articles/10186473)


--------------------------------------------------------------------------------
/day30.md:
--------------------------------------------------------------------------------
  1 | # [第 30 天] 深度學習（4）卷積神經網絡與鐵人賽總結
  2 | 
  3 | ---
  4 | 
  5 | 我們今天會練習使用神經網絡的套件 **TensorFlow** 來建立我們的第一個深度學習模型：卷積神經網絡（Convolutional Neural Network，CNN），來提升原本 **92%** 準確率的 **MNIST** 手寫數字辨識模型。卷積神經網絡廣泛被運用在圖片處理領域，我們很快地簡介她的特性。
  6 | 
  7 | ## 卷積神經網絡
  8 | 
  9 | 卷積神經網絡要解決的問題源自於使用神經網絡辨識高解析度的彩色圖片，馬上會遭遇運算效能不足的難題，卷積神經網絡使用兩個方法來解決這個難題：
 10 | 
 11 | ### Convolution
 12 | 
 13 | 有些特徵不需要看整張圖片才能捕捉起來，為了解決運算效能不足而採取將圖片解析度降維的方法，使用 Sub-sampling 或 Down-sampling 稱呼也許可以讓我們更容易瞭解她的意義。
 14 | 
 15 | ![day3001](https://storage.googleapis.com/2017_ithome_ironman/day3001.gif)
 16 | Source: [Artificial Inteligence](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolution.html)
 17 | 
 18 | ### Max-pooling
 19 | 
 20 | 這是為了確保經過 Convolution 後圖片中的特徵可以被確實保留下來而採取的方法。
 21 | 
 22 | ![day3002](https://storage.googleapis.com/2017_ithome_ironman/day3002.jpeg)
 23 | Source: [What is max pooling in convolutional neural networks?](https://www.quora.com/What-is-max-pooling-in-convolutional-neural-networks)
 24 | 
 25 | 今天實作的深度練習模型除了導入 **Convolution** 與 **Max-pooling** 以外，還有三個地方與[昨天](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day29.md)不同：
 26 | 
 27 | ### 不同的 Activation Function
 28 | 
 29 | 先前在添加神經網絡層的時候，如果沒有指定 **Activation function** 就是使用預設的線性函數，但是在 **CNN** 中會使用 ReLU（Rectified Linear Unit）作為 **Activation function**，模擬出非線性函數，確保神經元輸出的值在 0 到 1 之間。
 30 | 
 31 | ![day3004](https://storage.googleapis.com/2017_ithome_ironman/day3004.png)
 32 | Source: [Rectifier (neural networks) - Wikipedia](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))
 33 | 
 34 | ### 新增 Dropout 函數
 35 | 
 36 | 用來避免過度配適（Overfitting）。
 37 | 
 38 | > To reduce overfitting, we will apply [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) before the readout layer.
 39 | > [Deep MNIST for Experts | TensorFlow](https://www.tensorflow.org/tutorials/mnist/pros/)
 40 | 
 41 | ### 更換 Optimizer
 42 | 
 43 | 將我們先前一直使用的梯度遞減（Gradient descent）更換為 **ADAM**，是一個更進階且更成熟的梯度遞減演算法。
 44 | 
 45 | > We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.
 46 | > [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980)
 47 | 
 48 | ## TensorFlow 實作
 49 | 
 50 | ### 卷積神經網絡架構
 51 | 
 52 | 我們來拆解一下接著要建立的卷積神經網絡架構：
 53 | 
 54 | - 輸入圖片（解析度 28x28 的手寫數字圖片）
 55 | - 第一層是 Convolution 層（32 個神經元），會利用解析度 5x5 的 filter 取出 32 個特徵，然後將圖片降維成解析度 14x14
 56 | - 第二層是 Convolution 層（64 個神經元），會利用解析度 5x5 的 filter 取出 64 個特徵，然後將圖片降維成解析度 7x7
 57 | - 第三層是 Densely Connected 層（1024 個神經元），會將圖片的 1024 個特徵攤平
 58 | - 輸出結果之前使用 Dropout 函數避免過度配適
 59 | - 第四層是輸出層（10 個神經元），使用跟之前相同的 Softmax 函數輸出結果
 60 | 
 61 | ### 實作
 62 | 
 63 | ```python
 64 | import tensorflow as tf
 65 | from tensorflow.examples.tutorials.mnist import input_data
 66 | 
 67 | # 讀入 MNIST
 68 | mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)
 69 | x_train = mnist.train.images
 70 | y_train = mnist.train.labels
 71 | x_test = mnist.test.images
 72 | y_test = mnist.test.labels
 73 | 
 74 | # 設定參數
 75 | logs_path = 'TensorBoard/'
 76 | n_features = x_train.shape[1]
 77 | n_labels = y_train.shape[1]
 78 | 
 79 | # 啟動 InteractiveSession
 80 | sess = tf.InteractiveSession()
 81 | with tf.name_scope('Input'):
 82 |     x = tf.placeholder(tf.float32, shape=[None, n_features])
 83 | with tf.name_scope('Label'):
 84 |     y_ = tf.placeholder(tf.float32, shape=[None, n_labels])
 85 | 
 86 | # 自訂初始化權重的函數
 87 | def weight_variable(shape):
 88 |     initial = tf.truncated_normal(shape, stddev=0.1)
 89 |     return tf.Variable(initial)
 90 | 
 91 | def bias_variable(shape):
 92 |     initial = tf.constant(0.1, shape=shape)
 93 |     return tf.Variable(initial)
 94 |     
 95 | # 自訂 convolution 與 max-pooling 的函數
 96 | def conv2d(x, W):
 97 |     return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
 98 | 
 99 | def max_pool_2x2(x):
100 |     return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
101 | 
102 | # 第一層是 Convolution 層（32 個神經元），會利用解析度 5x5 的 filter 取出 32 個特徵，然後將圖片降維成解析度 14x14
103 | with tf.name_scope('FirstConvolutionLayer'):
104 |     W_conv1 = weight_variable([5, 5, 1, 32])
105 |     b_conv1 = bias_variable([32])
106 |     x_image = tf.reshape(x, [-1, 28, 28, 1])
107 |     h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
108 |     h_pool1 = max_pool_2x2(h_conv1)
109 | 
110 | # 第二層是 Convolution 層（64 個神經元），會利用解析度 5x5 的 filter 取出 64 個特徵，然後將圖片降維成解析度 7x7
111 | with tf.name_scope('SecondConvolutionLayer'):
112 |     W_conv2 = weight_variable([5, 5, 32, 64])
113 |     b_conv2 = bias_variable([64])
114 |     h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
115 |     h_pool2 = max_pool_2x2(h_conv2)
116 | 
117 | # 第三層是 Densely Connected 層（1024 個神經元），會將圖片的 1024 個特徵攤平
118 | with tf.name_scope('DenselyConnectedLayer'):
119 |     W_fc1 = weight_variable([7 * 7 * 64, 1024])
120 |     b_fc1 = bias_variable([1024])
121 |     h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
122 |     h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
123 | 
124 | # 輸出結果之前使用 Dropout 函數避免過度配適
125 | with tf.name_scope('Dropout'):
126 |     keep_prob = tf.placeholder(tf.float32)
127 |     h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
128 | 
129 | # 第四層是輸出層（10 個神經元），使用跟之前相同的 Softmax 函數輸出結果
130 | with tf.name_scope('ReadoutLayer'):
131 |     W_fc2 = weight_variable([1024, 10])
132 |     b_fc2 = bias_variable([10])
133 |     y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
134 | 
135 | # 訓練與模型評估
136 | with tf.name_scope('CrossEntropy'):
137 |     cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))
138 |     tf.summary.scalar("CrossEntropy", cross_entropy)
139 | train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
140 | correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
141 | with tf.name_scope('Accuracy'):
142 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
143 |     tf.summary.scalar("Accuracy", accuracy)
144 | 
145 | # 初始化
146 | sess.run(tf.global_variables_initializer())
147 | 
148 | # 將視覺化輸出
149 | merged = tf.summary.merge_all()
150 | writer = tf.summary.FileWriter(logs_path, graph = tf.get_default_graph())
151 | 
152 | for i in range(20000):
153 |     batch = mnist.train.next_batch(50)
154 |     if i%100 == 0:
155 |         train_accuracy = accuracy.eval(feed_dict = {x: batch[0], y_: batch[1], keep_prob: 1.0})
156 |         print("step %d, training accuracy %g"%(i, train_accuracy))
157 |         summary = sess.run(merged, feed_dict = {x: batch[0], y_: batch[1], keep_prob: 1.0})
158 |         writer.add_summary(summary, i)
159 |         writer.flush()
160 |     train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
161 | 
162 | print("test accuracy %g"%accuracy.eval(feed_dict = {x: x_test, y_: y_test, keep_prob: 1.0}))
163 | 
164 | # 關閉 session
165 | sess.close()
166 | 
167 | # 這段程式大概需要跑 30 到 60 分鐘不等，依據電腦效能而定
168 | ```
169 | 
170 | ![day3003](https://storage.googleapis.com/2017_ithome_ironman/day3003.png)
171 | 
172 | ![day3005](https://storage.googleapis.com/2017_ithome_ironman/day3005.png)
173 | 
174 | ![day3006](https://storage.googleapis.com/2017_ithome_ironman/day3006.png)
175 | 
176 | ## 小結
177 | 
178 | 第三十天我們繼續練習 Python 的神經網絡套件 **TensorFlow**，針對 **MNIST** 資料建立了第一個深度學習模型：卷積神經網絡，達到 **99%** 左右的準確率。
179 | 
180 | ## 參考連結
181 | 
182 | - [Deep MNIST for Experts | TensorFlow](https://www.tensorflow.org/tutorials/mnist/pros/)
183 | - [[DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習](http://www.slideshare.net/tw_dsconf/ss-62245351)
184 | - [Hvass-Labs@GitHub](https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/02_Convolutional_Neural_Network.ipynb)
185 | - [c1mone@GitHub](https://github.com/c1mone/Tensorflow-101/blob/master/notebooks/Ch1.2_MNIST_Convolutional_Network.ipynb)
186 | - [Start on TensorBoard](http://robromijnders.github.io/tensorflow_basic/)
187 | 
188 | ## 鐵人賽總結
189 | 
190 | ### 學習筆記的脈絡
191 | 
192 | 這份學習筆記從一個 R 語言使用者學習 Python 在資料科學的應用，並且相互對照的角度出發，整份學習筆記可以分為五大主題：
193 | 
194 | #### 基礎
195 | 
196 | - [[第 01 天] 建立開發環境與計算機應用](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day01.md)
197 | - [[第 02 天] 基本變數類型](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day02.md)
198 | - [[第 03 天] 變數類型的轉換](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day03.md)
199 | - [[第 04 天] 資料結構 List，Tuple 與 Dictionary](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day04.md)
200 | - [[第 05 天] 資料結構（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day05.md)
201 | - [[第 06 天] 資料結構（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day06.md)
202 | - [[第 07 天] 迴圈與流程控制](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day07.md)
203 | - [[第 08 天] 函數](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day08.md)
204 | - [[第 09 天] 函數（2）](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day09.md)
205 | - [[第 10 天] 物件導向 R 語言](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day10.md)
206 | - [[第 11 天] 物件導向（2）Python](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day11.md)
207 | 
208 | #### 基礎應用
209 | 
210 | - [[第 12 天] 常用屬性或方法 變數與基本資料結構](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day12.md)
211 | - [[第 13 天] 常用屬性或方法（2）ndarray](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day13.md)
212 | - [[第 14 天] 常用屬性或方法（3）Data Frame](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day14.md)
213 | - [[第 15 天] 載入資料](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day15.md)
214 | - [[第 16 天] 網頁解析](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day16.md)
215 | - [[第 17 天] 資料角力](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day17.md)
216 | 
217 | #### 視覺化
218 | 
219 | - [[第 18 天] 資料視覺化 matplotlib](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day18.md)
220 | - [[第 19 天] 資料視覺化（2）Seaborn](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day19.md)
221 | - [[第 20 天] 資料視覺化（3）Bokeh](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day20.md) 
222 | 
223 | #### 機器學習
224 | 
225 | - [[第 21 天] 機器學習 玩具資料與線性迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day21.md)
226 | - [[第 22 天] 機器學習（2）複迴歸與 Logistic 迴歸](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day22.md)
227 | - [[第 23 天] 機器學習（3）決策樹與 k-NN 分類器](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day23.md)
228 | - [[第 24 天] 機器學習（4）分群演算法](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day24.md)
229 | - [[第 25 天] 機器學習（5）整體學習](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day25.md)
230 | - [[第 26 天] 機器學習（6）隨機森林與支持向量機](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day26.md)
231 | 
232 | #### 深度學習
233 | 
234 | - [[第 27 天] 深度學習 TensorFlow](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day27.md)
235 | - [[第 28 天] 深度學習（2）TensorBoard](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day28.md)
236 | - [[第 29 天] 深度學習（3）MNIST 手寫數字辨識](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day29.md)
237 | - [[第 30 天] 深度學習（4）卷積神經網絡與鐵人賽總結](https://github.com/yaojenkuo/learn_python_for_a_r_user/blob/master/day30.md)
238 | 
239 | ### 主觀比較表
240 | 
241 | 我們還是不免落於俗套地依據學習筆記由淺入深的脈絡，整理兩個語言在各個面向的主觀比較表（純為個人意見，理性勿戰）：
242 | 
243 | 面向|R 語言|Python|原因
244 | ---|---|---|---
245 | 開發環境|和|和|兩個程式語言的安裝都相當簡單，也都有絕佳的 IDE 可供使用。
246 | 變數|和|和|兩個程式語言在變數的指派，判斷與轉換上都很直覺。
247 | 資料結構|勝|負|R 語言內建資料框與 `element-wise` 運算，Python 需要仰賴 `pandas` 與 `numpy`。
248 | 資料載入|和|和|兩個程式語言都可以透過套件的支援載入表格式資料或 JSON 等資料格式。
249 | 網頁解析|和|和|R 語言的 `rvest` 與 Python 的 `BeautifulSoup` 都能讓使用者輕鬆解析網頁資料。
250 | 資料角力|和|和|兩個程式語言的套件或函數都支援常見的資料角力技巧。
251 | 玩具資料|勝|負|R 語言內建的玩具資料數量多，文件說明豐富。
252 | 資料視覺化|勝|負|R 語言 `ggplot2` 的多樣繪圖功能搭配 `ggplotly()`，可以更簡單地畫出美觀的圖。
253 | 機器學習（基礎）|勝|負|R 語言內建 `stats` 套件中的 `lm()`，`kmeans()` 與 `glm()` 可以快速建立基礎的機器學習模型。
254 | 機器學習（進階）|負|勝|Python 的 `scikit-learn` 將所有的機器學習演算法都整理成一致的方法供使用者呼叫。
255 | 深度學習|負|勝|目前主流的神經網絡框架，TensorFlow，Theano 與高階的 Keras 主要都是使用 Python 實作。
256 | 
257 | ### 主觀判斷
258 | 
259 | 我們也不免落於俗套地除了前述的主觀比較表，也針對一些使用者的特徵提供主觀判斷：
260 | 
261 | 使用者特徵|建議
262 | ---|---
263 | 我沒有寫過程式|R 語言
264 | 我喜歡函數型編程|R 語言
265 | 我喜歡物件導向編程|Python
266 | 我想要作統計分析為主|R 語言
267 | 我想要作資料視覺化為主|R 語言
268 | 我想要建置深度學習模型為主|Python
269 | 我想要在網站後端建置機器學習系統為主|Python
270 | 
271 | ### 學習資源整理
272 | 
273 | 我們也不免落於俗套地整理了自學的書籍或網站：
274 | 
275 | #### R 語言
276 | 
277 | - [R in Action](https://www.manning.com/books/r-in-action-second-edition)
278 | - [R Cookbook](http://shop.oreilly.com/product/9780596809164.do)
279 | - [The Art of R Programming](https://www.amazon.com/Art-Programming-Statistical-Software-Design/dp/1593273843)
280 | - [Advanced R](https://www.amazon.com/Advanced-Chapman-Hall-Hadley-Wickham/dp/1466586966)
281 | - [R Graphics Cookbook](http://shop.oreilly.com/product/0636920023135.do)
282 | - [Machine Learning for Hackers](http://shop.oreilly.com/product/0636920018483.do)
283 | - [DataCamp](https://www.datacamp.com)
284 | 
285 | #### Python
286 | 
287 | - [Codecademy](https://www.codecademy.com)
288 | - [Introducing Python](http://shop.oreilly.com/product/0636920028659.do)
289 | - [Learn Python the Hard Way](https://www.amazon.com/Learn-Python-Hard-Way-Introduction/dp/0321884914)
290 | - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)
291 | - [scikit-learn - Machine Learning in Python](http://scikit-learn.org/stable/)
292 | - [TensorFlow](https://www.tensorflow.org/)
293 | - [DataCamp](https://www.datacamp.com)
294 | 
295 | ### 初衷
296 | 
297 | 初次聽到 iT 邦幫忙鐵人賽是在自學 [Git](https://git-scm.com/) 的時候看保哥 [30 天精通 Git 版本控管](https://github.com/doggy8088/Learn-Git-in-30-days)，當時就暗自下定決心有機會也要參賽，用 30 天向世界宣告這一年鋼鐵般的鍛鍊！
298 | 
299 | 這份學習筆記野心很大，初衷是希望可以解決一個常見問題：**想要從事資料科學相關的工作，時間只能夠在 R 語言與 Python 中擇一學習，請問各位大大推薦先從哪一個開始？**
300 | 
301 | 所以我們在絕大多數的章節中，讓兩個程式語言處理同一個資料科學問題，並陳她們的語法以便讓讀者可以體會箇中異同，進而達成目標：**讀完我（或我的部分章節），你的心中會自然作出選擇**，請放心且大膽地選擇一個開始你的資料科學旅程，她們都是在資料科學應用上我們能大力仰賴的程式語言。
302 | 
303 | > Both languages have a lot of similarities in syntax and approach, and you can’t go wrong with one, the other, or both.
304 | > [Vik Paruchuri](https://www.dataquest.io/blog/python-vs-r/)
305 | 
306 | 這是我第一次參加 iT 邦幫忙鐵人賽，很開心也很有成就感能完賽，我們 2018 鐵人賽再見！


--------------------------------------------------------------------------------