├── requirements.txt
├── data
    ├── digit_0_1_X.npy
    └── digit_0_1_y.npy
├── 资料
    ├── 9.Week 3 practice lab logistic regression
    │   ├── testCasesEx2.mat
    │   ├── images
    │   │   ├── figure 1.png
    │   │   ├── figure 2.png
    │   │   ├── figure 3.png
    │   │   ├── figure 4.png
    │   │   ├── figure 5.png
    │   │   └── figure 6.png
    │   ├── utils.py
    │   ├── data
    │   │   ├── ex2data2.txt
    │   │   └── ex2data1.txt
    │   ├── solutions.py
    │   ├── public_tests.py
    │   └── test_utils.py
    └── 7.Practice lab decision trees
    │   ├── public_tests.py
    │   └── C2_W4_Decision_Tree_with_Markdown.ipynb
├── 第01章 概论.ipynb
├── README.md
├── .gitignore
├── 2024年考研807真题.ipynb
├── 第00章 补充的数学知识.ipynb
├── 第02章 统计决策方法.ipynb
├── 第10章 模式识别系统的评价.ipynb
├── 第07章 特征选择.ipynb
└── 第08章 特征提取.ipynb


/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.25.0
2 | matplotlib>=3.7.1
3 | sklearn>=1.2.2
4 | 


--------------------------------------------------------------------------------
/data/digit_0_1_X.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/data/digit_0_1_X.npy


--------------------------------------------------------------------------------
/data/digit_0_1_y.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/data/digit_0_1_y.npy


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/testCasesEx2.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/testCasesEx2.mat


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 1.png


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 2.png


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 3.png


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 4.png


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 5.png


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/images/figure 6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 6.png


--------------------------------------------------------------------------------
/第01章 概论.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "616cb730",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# 第一章 概论\n",
 9 |     "---"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "markdown",
14 |    "id": "32d1dff8",
15 |    "metadata": {},
16 |    "source": [
17 |     "## 二轮总结笔记\n",
18 |     "\n",
19 |     "### 一、模式识别的主要方法\n",
20 |     "\n",
21 |     "$$\n",
22 |     "模式识别\\begin{cases}基于知识\\begin{cases}专家系统\\\\ 句法识别\\end{cases}\\\\ 基于数据\\end{cases}\n",
23 |     "$$\n",
24 |     "\n",
25 |     "### 二、模式识别的基本步骤\n",
26 |     "\n",
27 |     "$$\n",
28 |     "\\begin{cases}监督模式识别\\begin{cases}分析问题\\\\原始特征获取\\\\特征提取与选择\\\\分类器设计\\\\分类决策\\end{cases}\\\\非监督模式识别\\begin{cases}分析问题\\\\原始特征获取\\\\特征提取与选择\\\\聚类分析\\\\结果解释\\end{cases}\\end{cases}\n",
29 |     "$$"
30 |    ]
31 |   }
32 |  ],
33 |  "metadata": {
34 |   "kernelspec": {
35 |    "display_name": "Python 3 (ipykernel)",
36 |    "language": "python",
37 |    "name": "python3"
38 |   },
39 |   "language_info": {
40 |    "codemirror_mode": {
41 |     "name": "ipython",
42 |     "version": 3
43 |    },
44 |    "file_extension": ".py",
45 |    "mimetype": "text/x-python",
46 |    "name": "python",
47 |    "nbconvert_exporter": "python",
48 |    "pygments_lexer": "ipython3",
49 |    "version": "3.9.16"
50 |   }
51 |  },
52 |  "nbformat": 4,
53 |  "nbformat_minor": 5
54 | }
55 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### 零、2024考研心得
 2 | 
 3 | 今年的807非常简单，可以说有手就行，我把真题放在仓库里了，见 [2024年考研807真题.ipynb](./2024年考研807真题.ipynb)，另外我总结几个值得注意的点：
 4 | 
 5 | 1. 和2023年一样，2024年也引用改编了《模式分类》（Richard O. Duda等著）的课后习题，所以一定要重视这本书。但是这本书的课后习题非常的多，而且和考纲不完全重合，具体复习策略还需要自行把握。[点击跳转《模式分类》教材和书后题答案](https://github.com/kingloon/EBooks/tree/master/Pattern%20Classification)
 6 | 2. 和考纲不一样，2024年没有填空题和名词解释题，可能是因为考这种题太low了所以没出。
 7 | 3. 虽然考纲是第三版，但是第四版的教材也要看，有一些选择题会从第四版中出题。
 8 | 
 9 | 考研经验贴 [https://zhuanlan.zhihu.com/p/690137049](https://zhuanlan.zhihu.com/p/690137049)
10 | 
11 | ### 一、项目介绍
12 | 
13 | 这个项目是我学习《模式识别（第三版）》（张学工 清华大学出版社）的笔记。
14 | 
15 | 这本书是 **清华大学深圳研究生院 0854电子信息-人工智能** 考研的专业课（**807**）的考纲截至2024年的唯一指定教材。
16 | 
17 | 前二章不涉及代码，都是一些数学的推导和运算。
18 | 
19 | 从第三章开始，教材开始介绍各种模型，我会争取把教材上涉及到的所有模型用代码实现一遍（书上一行代码都没有）。
20 | 
21 | 每一章笔记的格式是：最前面是二轮的总结，后面是一轮的学习笔记和模型的代码实现。
22 | 
23 | ### 二、特别说明
24 | 
25 | 1. 教材上用到了很多本科没学过的数学知识，我总结在了 [第00章 补充的数学知识.ipynb](./第00章%20补充的数学知识.ipynb) 中，建议看教材之前先看一下。
26 | 2. 一定要先看一遍教材，彻底看懂了之后（最好把公式都自己推导一遍），再来看我写的笔记和代码。因为我写的笔记只有实现模型的核心，如果你没理解书上的内容就来看笔记，大概率看不懂。
27 | 3. 关于公式的显示问题：显示效果 VS Code > 浏览器上的Jupyter notebook > Pycharm。因为各个环境渲染markdown的引擎不一样，而且很多地方不兼容（特别是这种全是Latex公式的内容）。我已经调整过，让公式在所有地方显示都正常。
28 | 4. 如果用浏览器的Jupyter notebook，建议在看/写代码之前，先找个网上的教程给Jupyter notebook装插件，要不然代码显示丑，而且没有自动补全，用起来很难受。或者直接用IDE写代码。
29 | 
30 | ### 三、运行说明
31 | 
32 | 我的本机环境是
33 | 
34 | + Python 3.10
35 | 
36 | Python版本影响不大，不要差太多就行
37 | 
38 | 其他包的安装：
39 | 
40 | `pip install -r requirements.txt`
41 | 
42 | 建议使用虚拟环境运行。
43 | 
44 | ### 四、文件说明
45 | 
46 | `./data/`里面保存了一些测试数据
47 | 
48 | `./资料/`里面保存了一些笔记中引用的内容
49 | 


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/utils.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | def load_data(filename):
 5 |     data = np.loadtxt(filename, delimiter=',')
 6 |     X = data[:,:2]
 7 |     y = data[:,2]
 8 |     return X, y
 9 | 
10 | def sig(z):
11 |  
12 |     return 1/(1+np.exp(-z))
13 | 
14 | def map_feature(X1, X2):
15 |     """
16 |     Feature mapping function to polynomial features    
17 |     """
18 |     X1 = np.atleast_1d(X1)
19 |     X2 = np.atleast_1d(X2)
20 |     degree = 6
21 |     out = []
22 |     for i in range(1, degree+1):
23 |         for j in range(i + 1):
24 |             out.append((X1**(i-j) * (X2**j)))
25 |     return np.stack(out, axis=1)
26 | 
27 | 
28 | def plot_data(X, y, pos_label="y=1", neg_label="y=0"):
29 |     positive = y == 1
30 |     negative = y == 0
31 |     
32 |     # Plot examples
33 |     plt.plot(X[positive, 0], X[positive, 1], 'k+', label=pos_label)
34 |     plt.plot(X[negative, 0], X[negative, 1], 'yo', label=neg_label)
35 |     
36 |     
37 | def plot_decision_boundary(w, b, X, y):
38 |     # Credit to dibgerge on Github for this plotting code
39 |      
40 |     plot_data(X[:, 0:2], y)
41 |     
42 |     if X.shape[1] <= 2:
43 |         plot_x = np.array([min(X[:, 0]), max(X[:, 0])])
44 |         plot_y = (-1. / w[1]) * (w[0] * plot_x + b)
45 |         
46 |         plt.plot(plot_x, plot_y, c="b")
47 |         
48 |     else:
49 |         u = np.linspace(-1, 1.5, 50)
50 |         v = np.linspace(-1, 1.5, 50)
51 |         
52 |         z = np.zeros((len(u), len(v)))
53 | 
54 |         # Evaluate z = theta*x over the grid
55 |         for i in range(len(u)):
56 |             for j in range(len(v)):
57 |                 z[i,j] = sig(np.dot(map_feature(u[i], v[j]), w) + b)
58 |         
59 |         # important to transpose z before calling contour       
60 |         z = z.T
61 |         
62 |         # Plot z = 0
63 |         plt.contour(u,v,z, levels = [0.5], colors="g")
64 | 
65 |         


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/data/ex2data2.txt:
--------------------------------------------------------------------------------
  1 | 0.051267,0.69956,1
  2 | -0.092742,0.68494,1
  3 | -0.21371,0.69225,1
  4 | -0.375,0.50219,1
  5 | -0.51325,0.46564,1
  6 | -0.52477,0.2098,1
  7 | -0.39804,0.034357,1
  8 | -0.30588,-0.19225,1
  9 | 0.016705,-0.40424,1
 10 | 0.13191,-0.51389,1
 11 | 0.38537,-0.56506,1
 12 | 0.52938,-0.5212,1
 13 | 0.63882,-0.24342,1
 14 | 0.73675,-0.18494,1
 15 | 0.54666,0.48757,1
 16 | 0.322,0.5826,1
 17 | 0.16647,0.53874,1
 18 | -0.046659,0.81652,1
 19 | -0.17339,0.69956,1
 20 | -0.47869,0.63377,1
 21 | -0.60541,0.59722,1
 22 | -0.62846,0.33406,1
 23 | -0.59389,0.005117,1
 24 | -0.42108,-0.27266,1
 25 | -0.11578,-0.39693,1
 26 | 0.20104,-0.60161,1
 27 | 0.46601,-0.53582,1
 28 | 0.67339,-0.53582,1
 29 | -0.13882,0.54605,1
 30 | -0.29435,0.77997,1
 31 | -0.26555,0.96272,1
 32 | -0.16187,0.8019,1
 33 | -0.17339,0.64839,1
 34 | -0.28283,0.47295,1
 35 | -0.36348,0.31213,1
 36 | -0.30012,0.027047,1
 37 | -0.23675,-0.21418,1
 38 | -0.06394,-0.18494,1
 39 | 0.062788,-0.16301,1
 40 | 0.22984,-0.41155,1
 41 | 0.2932,-0.2288,1
 42 | 0.48329,-0.18494,1
 43 | 0.64459,-0.14108,1
 44 | 0.46025,0.012427,1
 45 | 0.6273,0.15863,1
 46 | 0.57546,0.26827,1
 47 | 0.72523,0.44371,1
 48 | 0.22408,0.52412,1
 49 | 0.44297,0.67032,1
 50 | 0.322,0.69225,1
 51 | 0.13767,0.57529,1
 52 | -0.0063364,0.39985,1
 53 | -0.092742,0.55336,1
 54 | -0.20795,0.35599,1
 55 | -0.20795,0.17325,1
 56 | -0.43836,0.21711,1
 57 | -0.21947,-0.016813,1
 58 | -0.13882,-0.27266,1
 59 | 0.18376,0.93348,0
 60 | 0.22408,0.77997,0
 61 | 0.29896,0.61915,0
 62 | 0.50634,0.75804,0
 63 | 0.61578,0.7288,0
 64 | 0.60426,0.59722,0
 65 | 0.76555,0.50219,0
 66 | 0.92684,0.3633,0
 67 | 0.82316,0.27558,0
 68 | 0.96141,0.085526,0
 69 | 0.93836,0.012427,0
 70 | 0.86348,-0.082602,0
 71 | 0.89804,-0.20687,0
 72 | 0.85196,-0.36769,0
 73 | 0.82892,-0.5212,0
 74 | 0.79435,-0.55775,0
 75 | 0.59274,-0.7405,0
 76 | 0.51786,-0.5943,0
 77 | 0.46601,-0.41886,0
 78 | 0.35081,-0.57968,0
 79 | 0.28744,-0.76974,0
 80 | 0.085829,-0.75512,0
 81 | 0.14919,-0.57968,0
 82 | -0.13306,-0.4481,0
 83 | -0.40956,-0.41155,0
 84 | -0.39228,-0.25804,0
 85 | -0.74366,-0.25804,0
 86 | -0.69758,0.041667,0
 87 | -0.75518,0.2902,0
 88 | -0.69758,0.68494,0
 89 | -0.4038,0.70687,0
 90 | -0.38076,0.91886,0
 91 | -0.50749,0.90424,0
 92 | -0.54781,0.70687,0
 93 | 0.10311,0.77997,0
 94 | 0.057028,0.91886,0
 95 | -0.10426,0.99196,0
 96 | -0.081221,1.1089,0
 97 | 0.28744,1.087,0
 98 | 0.39689,0.82383,0
 99 | 0.63882,0.88962,0
100 | 0.82316,0.66301,0
101 | 0.67339,0.64108,0
102 | 1.0709,0.10015,0
103 | -0.046659,-0.57968,0
104 | -0.23675,-0.63816,0
105 | -0.15035,-0.36769,0
106 | -0.49021,-0.3019,0
107 | -0.46717,-0.13377,0
108 | -0.28859,-0.060673,0
109 | -0.61118,-0.067982,0
110 | -0.66302,-0.21418,0
111 | -0.59965,-0.41886,0
112 | -0.72638,-0.082602,0
113 | -0.83007,0.31213,0
114 | -0.72062,0.53874,0
115 | -0.59389,0.49488,0
116 | -0.48445,0.99927,0
117 | -0.0063364,0.99927,0
118 | 0.63265,-0.030612,0
119 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | .idea/
161 | 
162 | test.py
163 | 


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/data/ex2data1.txt:
--------------------------------------------------------------------------------
  1 | 34.62365962451697,78.0246928153624,0
  2 | 30.28671076822607,43.89499752400101,0
  3 | 35.84740876993872,72.90219802708364,0
  4 | 60.18259938620976,86.30855209546826,1
  5 | 79.0327360507101,75.3443764369103,1
  6 | 45.08327747668339,56.3163717815305,0
  7 | 61.10666453684766,96.51142588489624,1
  8 | 75.02474556738889,46.55401354116538,1
  9 | 76.09878670226257,87.42056971926803,1
 10 | 84.43281996120035,43.53339331072109,1
 11 | 95.86155507093572,38.22527805795094,0
 12 | 75.01365838958247,30.60326323428011,0
 13 | 82.30705337399482,76.48196330235604,1
 14 | 69.36458875970939,97.71869196188608,1
 15 | 39.53833914367223,76.03681085115882,0
 16 | 53.9710521485623,89.20735013750205,1
 17 | 69.07014406283025,52.74046973016765,1
 18 | 67.94685547711617,46.67857410673128,0
 19 | 70.66150955499435,92.92713789364831,1
 20 | 76.97878372747498,47.57596364975532,1
 21 | 67.37202754570876,42.83843832029179,0
 22 | 89.67677575072079,65.79936592745237,1
 23 | 50.534788289883,48.85581152764205,0
 24 | 34.21206097786789,44.20952859866288,0
 25 | 77.9240914545704,68.9723599933059,1
 26 | 62.27101367004632,69.95445795447587,1
 27 | 80.1901807509566,44.82162893218353,1
 28 | 93.114388797442,38.80067033713209,0
 29 | 61.83020602312595,50.25610789244621,0
 30 | 38.78580379679423,64.99568095539578,0
 31 | 61.379289447425,72.80788731317097,1
 32 | 85.40451939411645,57.05198397627122,1
 33 | 52.10797973193984,63.12762376881715,0
 34 | 52.04540476831827,69.43286012045222,1
 35 | 40.23689373545111,71.16774802184875,0
 36 | 54.63510555424817,52.21388588061123,0
 37 | 33.91550010906887,98.86943574220611,0
 38 | 64.17698887494485,80.90806058670817,1
 39 | 74.78925295941542,41.57341522824434,0
 40 | 34.1836400264419,75.2377203360134,0
 41 | 83.90239366249155,56.30804621605327,1
 42 | 51.54772026906181,46.85629026349976,0
 43 | 94.44336776917852,65.56892160559052,1
 44 | 82.36875375713919,40.61825515970618,0
 45 | 51.04775177128865,45.82270145776001,0
 46 | 62.22267576120188,52.06099194836679,0
 47 | 77.19303492601364,70.45820000180959,1
 48 | 97.77159928000232,86.7278223300282,1
 49 | 62.07306379667647,96.76882412413983,1
 50 | 91.56497449807442,88.69629254546599,1
 51 | 79.94481794066932,74.16311935043758,1
 52 | 99.2725269292572,60.99903099844988,1
 53 | 90.54671411399852,43.39060180650027,1
 54 | 34.52451385320009,60.39634245837173,0
 55 | 50.2864961189907,49.80453881323059,0
 56 | 49.58667721632031,59.80895099453265,0
 57 | 97.64563396007767,68.86157272420604,1
 58 | 32.57720016809309,95.59854761387875,0
 59 | 74.24869136721598,69.82457122657193,1
 60 | 71.79646205863379,78.45356224515052,1
 61 | 75.3956114656803,85.75993667331619,1
 62 | 35.28611281526193,47.02051394723416,0
 63 | 56.25381749711624,39.26147251058019,0
 64 | 30.05882244669796,49.59297386723685,0
 65 | 44.66826172480893,66.45008614558913,0
 66 | 66.56089447242954,41.09209807936973,0
 67 | 40.45755098375164,97.53518548909936,1
 68 | 49.07256321908844,51.88321182073966,0
 69 | 80.27957401466998,92.11606081344084,1
 70 | 66.74671856944039,60.99139402740988,1
 71 | 32.72283304060323,43.30717306430063,0
 72 | 64.0393204150601,78.03168802018232,1
 73 | 72.34649422579923,96.22759296761404,1
 74 | 60.45788573918959,73.09499809758037,1
 75 | 58.84095621726802,75.85844831279042,1
 76 | 99.82785779692128,72.36925193383885,1
 77 | 47.26426910848174,88.47586499559782,1
 78 | 50.45815980285988,75.80985952982456,1
 79 | 60.45555629271532,42.50840943572217,0
 80 | 82.22666157785568,42.71987853716458,0
 81 | 88.9138964166533,69.80378889835472,1
 82 | 94.83450672430196,45.69430680250754,1
 83 | 67.31925746917527,66.58935317747915,1
 84 | 57.23870631569862,59.51428198012956,1
 85 | 80.36675600171273,90.96014789746954,1
 86 | 68.46852178591112,85.59430710452014,1
 87 | 42.0754545384731,78.84478600148043,0
 88 | 75.47770200533905,90.42453899753964,1
 89 | 78.63542434898018,96.64742716885644,1
 90 | 52.34800398794107,60.76950525602592,0
 91 | 94.09433112516793,77.15910509073893,1
 92 | 90.44855097096364,87.50879176484702,1
 93 | 55.48216114069585,35.57070347228866,0
 94 | 74.49269241843041,84.84513684930135,1
 95 | 89.84580670720979,45.35828361091658,1
 96 | 83.48916274498238,48.38028579728175,1
 97 | 42.2617008099817,87.10385094025457,1
 98 | 99.31500880510394,68.77540947206617,1
 99 | 55.34001756003703,64.9319380069486,1
100 | 74.77589300092767,89.52981289513276,1
101 | 


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/solutions.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | def sigmoid(z):
  4 |     """
  5 |     Compute the sigmoid of z
  6 | 
  7 |     Parameters
  8 |     ----------
  9 |     z : array_like
 10 |         A scalar or numpy array of any size.
 11 | 
 12 |     Returns
 13 |     -------
 14 |      g : array_like 
 15 |          sigmoid(z)
 16 |     """
 17 |     # (≈ 1 line of code)
 18 |     # s = 
 19 |     ### START CODE HERE ### (≈ 1 line of code)
 20 |     s = 1/(1 + np.exp(-z))
 21 |     ### END CODE HERE ###
 22 |     
 23 |     return s
 24 | 
 25 | 
 26 | def compute_cost(X, y, w, b): 
 27 |     m = X.shape[0]
 28 | 
 29 |     f_w = sigmoid(np.dot(X, w) + b)
 30 |     total_cost = (1/m)*np.sum(-y*np.log(f_w) - (1-y)*np.log(1-f_w))
 31 |     
 32 |     return float(np.squeeze(total_cost))
 33 | 
 34 | def compute_gradient(X, y, w, b):
 35 |     """
 36 |     Computes the gradient for logistic regression.
 37 |     
 38 |     Parameters
 39 |     ----------
 40 |     X : array_like
 41 |         Shape (m, n+1) 
 42 |     
 43 |     y : array_like
 44 |         Shape (m,) 
 45 |     
 46 |     w : array_like
 47 |         Parameters of the model
 48 |         Shape (n+1,)
 49 |     b:  scalar
 50 |     
 51 |     Returns
 52 |     -------
 53 |     dw : array_like
 54 |         Shape (n+1,)
 55 |         The gradient 
 56 |     db: scalar
 57 |         
 58 |     """
 59 |     m = X.shape[0]
 60 |     f_w = sigmoid(np.dot(X, w) + b)
 61 |     err = (f_w - y)
 62 |     dw = (1/m)*np.dot(X.T, err)
 63 |     db = (1/m)*np.sum(err)
 64 |     
 65 |     return float(np.squeeze(db)), dw
 66 | 
 67 | 
 68 | def predict(X, w, b): 
 69 |     """
 70 |     Predict whether the label is 0 or 1 using learned logistic
 71 |     regression parameters theta
 72 |     
 73 |     Parameters
 74 |     ----------
 75 |     X : array_like
 76 |         Shape (m, n+1) 
 77 |     
 78 |     w : array_like
 79 |         Parameters of the model
 80 |         Shape (n, 1)
 81 |     b : scalar
 82 |     
 83 |     Returns
 84 |     -------
 85 | 
 86 |     p: array_like
 87 |         Shape (m,)
 88 |         The predictions for X using a threshold at 0.5
 89 |         i.e. if sigmoid (theta.T*X) >=0.5 predict 1
 90 |     """
 91 |     
 92 |     # number of training examples
 93 |     m = X.shape[0]   
 94 |     p = np.zeros(m)
 95 |    
 96 |     for i in range(m):
 97 |         f_w = sigmoid(np.dot(w.T, X[i]) + b)
 98 |         p[i] = f_w >=0.5
 99 |     
100 |     return p
101 | 
102 | def compute_cost_reg(X, y, w, b, lambda_=1): 
103 |     """
104 |     Computes the cost for logistic regression
105 |     with regularization
106 |     
107 |     Parameters
108 |     ----------
109 |     X : array_like
110 |         Shape (m, n+1) 
111 |     
112 |     y : array_like
113 |         Shape (m,) 
114 |     
115 |     w: array_like
116 |         Parameters of the model
117 |         Shape (n+1,)
118 |     b: scalar
119 |     
120 |     Returns
121 |     -------
122 |     cost : float
123 |         The cost of using theta as the parameter for logistic 
124 |         regression to fit the data points in X and y
125 |         
126 |     """
127 |     # number of training examples
128 |     m = X.shape[0]
129 |     
130 |     # You need to return the following variables correctly
131 |     cost = 0
132 | 
133 |     f = sigmoid(np.dot(X, w) + b)
134 |     reg = (lambda_/(2*m)) * np.sum(np.square(w))
135 |     cost = (1/m)*np.sum(-y*np.log(f) - (1-y)*np.log(1-f)) + reg
136 |     return cost
137 | 
138 | 
139 | def compute_gradient_reg(X, y, w, b, lambda_=1): 
140 |     """
141 |     Computes the  gradient for logistic regression
142 |     with regularization
143 |     
144 |     Parameters
145 |     ----------
146 |     X : array_like
147 |         Shape (m, n+1) 
148 |     
149 |     y : array_like
150 |         Shape (m,) 
151 |     
152 |     w : array_like
153 |         Parameters of the model
154 |         Shape (n+1,)
155 |     b : scalar
156 |     
157 |     Returns
158 |     -------
159 |     db: scalar
160 |     dw: array_like
161 |         Shape (n+1,)
162 | 
163 |     """
164 |     # number of training examples
165 |     m = X.shape[0]
166 |     
167 |     # You need to return the following variables correctly
168 |     cost = 0
169 |     dw = np.zeros_like(w)
170 | 
171 |     f = sigmoid(np.dot(X, w) + b)
172 |     err = (f - y)
173 |     dw = (1/m)*np.dot(X.T, err)
174 |     dw += (lambda_/m)  * w
175 |     db = (1/m) * np.sum(err)
176 |  
177 |     #print(db,dw)
178 | 
179 |     return db,dw
180 | 


--------------------------------------------------------------------------------
/2024年考研807真题.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "dc065fb3",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## 2024年 信息技术基础综合（807）清华大学考研真题 \n",
  9 |     "\n",
 10 |     "---\n",
 11 |     "\n",
 12 |     "### 一、判断题 一道题3分\n",
 13 |     "\n",
 14 |     "1. 支持向量机既可以进行线性分类也可以进行非线性分类。\n",
 15 |     "2. 自组织映射是竞争神经网络。\n",
 16 |     "3. 召回率是判断为阴性的样本在全体阴性样本中的比例。\n",
 17 |     "4. 样本的特征越多越有利于分类。\n",
 18 |     "5. C均值是有监督模式分类算法。\n",
 19 |     "6. 假设样本有N维，Fisher判别是将样本投影到N-1维。\n",
 20 |     "7. 当样本数量较多时，用k-NN算法的剪辑算法能取得较好效果。\n",
 21 |     "8. 主成分分析和K-L变换在本质上是相同的。\n",
 22 |     "\n",
 23 |     "### 二、单选题 一道题3分\n",
 24 |     "\n",
 25 |     "1. {1, 2, 3, 4, -9, 0}的L1范数是\n",
 26 |     "+ A. 6\n",
 27 |     "+ B. 1\n",
 28 |     "+ C. 19\n",
 29 |     "+ D. 111\n",
 30 |     "\n",
 31 |     "2. 以下关于特征选择个数的说法正确的是\n",
 32 |     "+ A. 特征选择的个数越多越好\n",
 33 |     "+ B. 特征选择应选择能最好的区分不同样本的特征\n",
 34 |     "+ C. 特征选择的个数越少越好\n",
 35 |     "+ D. 以上说法都不对\n",
 36 |     "\n",
 37 |     "3. 下面不是决策树后剪枝方法的是\n",
 38 |     "+ A. 信息增益的统计显著性分析\n",
 39 |     "+ B. 减少分类错误修剪法\n",
 40 |     "+ C. 最小代价与复杂性的折中\n",
 41 |     "+ D. 最小描述长度准则\n",
 42 |     "\n",
 43 |     "4. 下面关于支持向量机软间隔和硬间隔说法错误的是\n",
 44 |     "+ A. 软间隔有利于最大化分类间隔\n",
 45 |     "+ B. 软间隔可以容忍错分样本\n",
 46 |     "+ C. 硬间隔有利于消除过拟合\n",
 47 |     "+ D. 硬间隔保证所有样本都是正确分类的\n",
 48 |     "\n",
 49 |     "5. 混合高斯模型计算概率密度使用的是\n",
 50 |     "+ A. 贝叶斯准则\n",
 51 |     "+ B. 模型准则\n",
 52 |     "+ C. 忘了\n",
 53 |     "+ D. 训练准则\n",
 54 |     "\n",
 55 |     "6. 下面对感知器说法错误的是\n",
 56 |     "+ A. 感知器可以解决非线性可分问题\n",
 57 |     "+ B. 给感知器设定阈值后可以用于分类\n",
 58 |     "+ C. 感知器是一个简单的前馈神经网络\n",
 59 |     "+ D. 感知器的阈值会影响判别面位置\n",
 60 |     "\n",
 61 |     "7. 有一种产品60%由工厂A生产，40%由工厂B生产，甲乙两人各自买了一个这种产品，请问他们买的产品来自于不同工厂的概率是\n",
 62 |     "+ A. 忘了\n",
 63 |     "+ B. 48%\n",
 64 |     "+ C. 忘了\n",
 65 |     "+ D. 忘了\n",
 66 |     "\n",
 67 |     "### 三、多选题 一道题4分，多选不得分，少选得2分\n",
 68 |     "\n",
 69 |     "1. 以下是sigmoid函数的优点的是\n",
 70 |     "+ A. 处处连续，方便求导\n",
 71 |     "+ B. 可以将数值压缩在[0,1]之中\n",
 72 |     "+ C. 在深层反馈网络中不易产生梯度消失\n",
 73 |     "+ D. 可以用于二分类问题\n",
 74 |     "\n",
 75 |     "2. 以下是决策树方法的是\n",
 76 |     "+ A. ID3\n",
 77 |     "+ B. C4.5\n",
 78 |     "+ C. CART\n",
 79 |     "+ D. CNN\n",
 80 |     "\n",
 81 |     "3. 以下是生成模型的是\n",
 82 |     "+ A. 支持向量机\n",
 83 |     "+ B. 朴素贝叶斯\n",
 84 |     "+ C. 隐马尔科夫模型\n",
 85 |     "+ D. 高斯混合模型\n",
 86 |     "\n",
 87 |     "4. 以下是影响k-NN算法结果的因素是\n",
 88 |     "+ A. 最近邻样本的距离\n",
 89 |     "+ B. 相似性度量\n",
 90 |     "+ C. 对样本分类的方法\n",
 91 |     "+ D. k的大小\n",
 92 |     "\n",
 93 |     "5. 如果以特征向量的相关系数作为模式相似性测度，则影响聚类算法结果的主要因素有\n",
 94 |     "+ A. 已知类别样本质量\n",
 95 |     "+ B. 分类准则\n",
 96 |     "+ C. 特征选取\n",
 97 |     "+ D. 量纲\n",
 98 |     "\n",
 99 |     "### 四、计算题 12分\n",
100 |     "\n",
101 |     "已知两类样本$\\omega_1$：$\\{(1, 0),(2, 0),(1, 1)\\}$，$\\omega_2$：$\\{(-1, 0),(0, 1),(-1, 1)\\}$，两类样本概率相等（具体的数没记住，题目问法是正确的）\n",
102 |     "\n",
103 |     "(1) 求类内离散度矩阵$\\mathbf{S}_{\\text{w}}$（6分）\n",
104 |     "\n",
105 |     "(2) 求类间离散度矩阵$\\mathbf{S}_{\\text{b}}$（6分）\n",
106 |     "\n",
107 |     "### 五、计算题 12分\n",
108 |     "\n",
109 |     "（此题为模式分类P165-T3改编）\n",
110 |     "\n",
111 |     "设$p(x)$为$0$到$a$之间的均匀分布，即$p(x)\\sim U(0,a)$，Parzen窗估计函数为$\\displaystyle\\varphi (x)=\\begin{cases}e^{-x},&x>0\\\\0,&x\\leqslant 0\\end{cases}$\n",
112 |     "\n",
113 |     "(1) 设宽窗参数为$h_n$，求Parzen窗估计$\\bar{p}_n(x)$（分值忘了）\n",
114 |     "\n",
115 |     "(2) 要使在$0$到$a$之间，$99\\%$的情况下估计值的误差都小于$1\\%$，求$h_n$的取值范围（分值忘了）\n",
116 |     "\n",
117 |     "\n",
118 |     "### 六、计算题 20分\n",
119 |     "\n",
120 |     "模式识别第三版/第四版教材上贝叶斯决策的例题（正常细胞异常细胞的那个）\n",
121 |     "\n",
122 |     "(1) 求最小错误率贝叶斯决策（8分）\n",
123 |     "\n",
124 |     "(2) 求最小风险贝叶斯决策（8分）\n",
125 |     "\n",
126 |     "(3) 解释两个决策结果的差异（6分）\n",
127 |     "\n",
128 |     "\n",
129 |     "### 七、计算题 20分\n",
130 |     "\n",
131 |     "（此题为模式分类P52-T2改编）\n",
132 |     "\n",
133 |     "设概率密度函数$\\displaystyle p(x|\\omega_i)\\propto \\exp\\{-\\frac{|x-a_i|}{b_i}\\}$，$a_i>0$，$b_i>0$\n",
134 |     "\n",
135 |     "(1) 求概率密度函数表达式，即将概率密度归一化（分值忘了）\n",
136 |     "\n",
137 |     "(2) 求似然比$\\displaystyle \\frac{p(x|\\omega_1)}{p(x|\\omega_2)}$（分值忘了）\n",
138 |     "\n",
139 |     "(3) 当$a_1=0$，$b_1=1$，$a_2=1$，$b_2=2$时，求出似然比，并画出似然比的图像（分值忘了）\n",
140 |     "\n",
141 |     "### 八、计算题 21分\n",
142 |     "\n",
143 |     "(1) 已知两个样本$\\mathbf{x}_1$，$\\mathbf{x}_2$，求出最优分类超平面将两个样本分开，使$\\mathbf{x}_1$处于平面负侧（15分）\n",
144 |     "\n",
145 |     "(2) 验证$\\mathbf{x}_1$处于平面负侧（6分）"
146 |    ]
147 |   }
148 |  ],
149 |  "metadata": {
150 |   "kernelspec": {
151 |    "display_name": "Python 3 (ipykernel)",
152 |    "language": "python",
153 |    "name": "python3"
154 |   },
155 |   "language_info": {
156 |    "codemirror_mode": {
157 |     "name": "ipython",
158 |     "version": 3
159 |    },
160 |    "file_extension": ".py",
161 |    "mimetype": "text/x-python",
162 |    "name": "python",
163 |    "nbconvert_exporter": "python",
164 |    "pygments_lexer": "ipython3",
165 |    "version": "3.9.13"
166 |   }
167 |  },
168 |  "nbformat": 4,
169 |  "nbformat_minor": 5
170 | }
171 | 


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/public_tests.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import math
  3 | 
  4 | def sigmoid_test(target):
  5 |     assert np.isclose(target(3.0), 0.9525741268224334), "Failed for scalar input"
  6 |     assert np.allclose(target(np.array([2.5, 0])), [0.92414182, 0.5]), "Failed for 1D array"
  7 |     assert np.allclose(target(np.array([[2.5, -2.5], [0, 1]])), 
  8 |                        [[0.92414182, 0.07585818], [0.5, 0.73105858]]), "Failed for 2D array"
  9 |     print('\033[92mAll tests passed!')
 10 |     
 11 | def compute_cost_test(target):
 12 |     X = np.array([[0, 0, 0, 0]]).T
 13 |     y = np.array([0, 0, 0, 0])
 14 |     w = np.array([0])
 15 |     b = 1
 16 |     result = target(X, y, w, b)
 17 |     if math.isinf(result):
 18 |         raise ValueError("Did you get the sigmoid of z_wb?")
 19 |     
 20 |     np.random.seed(17)  
 21 |     X = np.random.randn(5, 2)
 22 |     y = np.array([1, 0, 0, 1, 1])
 23 |     w = np.random.randn(2)
 24 |     b = 0
 25 |     result = target(X, y, w, b)
 26 |     assert np.isclose(result, 2.15510667), f"Wrong output. Expected: {2.15510667} got: {result}"
 27 |     
 28 |     X = np.random.randn(4, 3)
 29 |     y = np.array([1, 1, 0, 0])
 30 |     w = np.random.randn(3)
 31 |     b = 0
 32 |     
 33 |     result = target(X, y, w, b)
 34 |     assert np.isclose(result, 0.80709376), f"Wrong output. Expected: {0.80709376} got: {result}"
 35 | 
 36 |     X = np.random.randn(4, 3)
 37 |     y = np.array([1, 0,1, 0])
 38 |     w = np.random.randn(3)
 39 |     b = 3
 40 |     result = target(X, y, w, b)
 41 |     assert np.isclose(result, 0.4529660647), f"Wrong output. Expected: {0.4529660647} got: {result}. Did you inizialized z_wb = b?"
 42 |     
 43 |     print('\033[92mAll tests passed!')
 44 |     
 45 | def compute_gradient_test(target):
 46 |     np.random.seed(1)
 47 |     X = np.random.randn(7, 3)
 48 |     y = np.array([1, 0, 1, 0, 1, 1, 0])
 49 |     test_w = np.array([1, 0.5, -0.35])
 50 |     test_b = 1.7
 51 |     dj_db, dj_dw  = target(X, y, test_w, test_b)
 52 |     
 53 |     assert np.isclose(dj_db, 0.28936094), f"Wrong value for dj_db. Expected: {0.28936094} got: {dj_db}" 
 54 |     assert dj_dw.shape == test_w.shape, f"Wrong shape for dj_dw. Expected: {test_w.shape} got: {dj_dw.shape}" 
 55 |     assert np.allclose(dj_dw, [-0.11999166, 0.41498775, -0.71968405]), f"Wrong values for dj_dw. Got: {dj_dw}"
 56 | 
 57 |     print('\033[92mAll tests passed!') 
 58 |     
 59 | def predict_test(target):
 60 |     np.random.seed(5)
 61 |     b = 0.5    
 62 |     w = np.random.randn(3)
 63 |     X = np.random.randn(8, 3)
 64 |     
 65 |     result = target(X, w, b)
 66 |     wrong_1 = [1., 1., 0., 0., 1., 0., 0., 1.]
 67 |     expected_1 = [1., 1., 1., 0., 1., 0., 0., 1.]
 68 |     if np.allclose(result, wrong_1):
 69 |         raise ValueError("Did you apply the sigmoid before applying the threshold?")
 70 |     assert result.shape == (len(X),), f"Wrong length. Expected : {(len(X),)} got: {result.shape}"
 71 |     assert np.allclose(result, expected_1), f"Wrong output: Expected : {expected_1} got: {result}"
 72 |     
 73 |     b = -1.7    
 74 |     w = np.random.randn(4) + 0.6
 75 |     X = np.random.randn(6, 4)
 76 |     
 77 |     result = target(X, w, b)
 78 |     expected_2 = [0., 0., 0., 1., 1., 0.]
 79 |     assert result.shape == (len(X),), f"Wrong length. Expected : {(len(X),)} got: {result.shape}"
 80 |     assert np.allclose(result,expected_2), f"Wrong output: Expected : {expected_2} got: {result}"
 81 | 
 82 |     print('\033[92mAll tests passed!')
 83 |     
 84 | def compute_cost_reg_test(target):
 85 |     np.random.seed(1)
 86 |     w = np.random.randn(3)
 87 |     b = 0.4
 88 |     X = np.random.randn(6, 3)
 89 |     y = np.array([0, 1, 1, 0, 1, 1])
 90 |     lambda_ = 0.1
 91 |     expected_output = target(X, y, w, b, lambda_)
 92 |     
 93 |     assert np.isclose(expected_output, 0.5469746792761936), f"Wrong output. Expected: {0.5469746792761936} got:{expected_output}"
 94 |     
 95 |     w = np.random.randn(5)
 96 |     b = -0.6
 97 |     X = np.random.randn(8, 5)
 98 |     y = np.array([1, 0, 1, 0, 0, 1, 0, 1])
 99 |     lambda_ = 0.01
100 |     output = target(X, y, w, b, lambda_)
101 |     assert np.isclose(output, 1.2608591964119995), f"Wrong output. Expected: {1.2608591964119995} got:{output}"
102 |     
103 |     w = np.array([2, 2, 2, 2, 2])
104 |     b = 0
105 |     X = np.zeros((8, 5))
106 |     y = np.array([0.5] * 8)
107 |     lambda_ = 3
108 |     output = target(X, y, w, b, lambda_)
109 |     expected = -np.log(0.5) + 3. / (2. * 8.) * 20.
110 |     assert np.isclose(output, expected), f"Wrong output. Expected: {expected} got:{output}"
111 |     
112 |     print('\033[92mAll tests passed!') 
113 |     
114 | def compute_gradient_reg_test(target):
115 |     np.random.seed(1)
116 |     w = np.random.randn(5)
117 |     b = 0.2
118 |     X = np.random.randn(7, 5)
119 |     y = np.array([0, 1, 1, 0, 1, 1, 0])
120 |     lambda_ = 0.1
121 |     expected1 = (-0.1506447567869257, np.array([ 0.19530838, -0.00632206,  0.19687367,  0.15741161,  0.02791437]))
122 |     dj_db, dj_dw = target(X, y, w, b, lambda_)
123 |     
124 |     assert np.isclose(dj_db, expected1[0]), f"Wrong dj_db. Expected: {expected1[0]} got: {dj_db}"
125 |     assert np.allclose(dj_dw, expected1[1]), f"Wrong dj_dw. Expected: {expected1[1]} got: {dj_dw}"
126 | 
127 |     
128 |     w = np.random.randn(7)
129 |     b = 0
130 |     X = np.random.randn(7, 7)
131 |     y = np.array([1, 0, 0, 0, 1, 1, 0])
132 |     lambda_ = 0
133 |     expected2 = (0.02660329857573818, np.array([ 0.23567643, -0.06921029, -0.19705212, -0.0002884 ,  0.06490588,
134 |         0.26948175,  0.10777992]))
135 |     dj_db, dj_dw = target(X, y, w, b, lambda_)
136 |     assert np.isclose(dj_db, expected2[0]), f"Wrong dj_db. Expected: {expected2[0]} got: {dj_db}"
137 |     assert np.allclose(dj_dw, expected2[1]), f"Wrong dj_dw. Expected: {expected2[1]} got: {dj_dw}"
138 |     
139 |     print('\033[92mAll tests passed!') 
140 | 


--------------------------------------------------------------------------------
/第00章 补充的数学知识.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## 说明\n",
  8 |     "\n",
  9 |     "书中涉及到大量的数学分析和矩阵论的知识。在非数学专业本科学的内容不能完全覆盖书中需要的知识点。下面是我的补充。\n",
 10 |     "\n",
 11 |     "### 一、矩阵求导规则\n",
 12 |     "\n",
 13 |     "书上用到了很多向量和矩阵的求导，这里介绍一下：\n",
 14 |     "\n",
 15 |     "#### 1. 对向量求导\n",
 16 |     "\n",
 17 |     "$$\\begin{aligned}\n",
 18 |     "\\frac{\\partial \\mathbf{Ax}}{\\partial \\mathbf{x}} &= \\mathbf{A}^\\text{T}\\\\\n",
 19 |     "\\frac{\\partial \\mathbf{Ax}}{\\partial \\mathbf{x}^\\text{T}} &= \\mathbf{A}\\\\\n",
 20 |     "\\frac{\\partial \\mathbf{x}^\\text{T}\\mathbf{A}}{\\partial \\mathbf{x}} &= \\mathbf{A}\\\\\n",
 21 |     "\\frac{\\partial \\mathbf{x}^\\text{T}\\mathbf{A}\\mathbf{x}}{\\partial \\mathbf{x}} &= (\\mathbf{A}^\\text{T} + \\mathbf{A})\\mathbf{x} \\\\\n",
 22 |     "\\frac{\\partial \\mathbf{x\\cdot x}}{\\partial \\mathbf{x}} &= \\frac{\\mathbf{x}^\\text{T}\\mathbf{x}}{\\partial \\mathbf{x}} = 2\\mathbf{x}\\\\\n",
 23 |     "\\frac{\\partial \\mathbf{w\\cdot x}}{\\partial \\mathbf{x}} &= \\frac{\\partial \\mathbf{w}^\\text{T}\\mathbf{x}}{\\partial \\mathbf{x}} = \\mathbf{w}\n",
 24 |     "\\end{aligned}$$\n",
 25 |     "\n",
 26 |     "#### 2. 对矩阵求导\n",
 27 |     "\n",
 28 |     "对矩阵求导规则比较复杂，这篇文章写的很好：[【必读】3分钟带你了解标量对矩阵求导方法](https://zhuanlan.zhihu.com/p/279209775)。\n",
 29 |     "\n",
 30 |     "另外我再总结一下书中证明要用到的的公式（如果你按照上面文章的方法学会了对矩阵求导的通用方法，可以很容易自己证明出来）\n",
 31 |     "\n",
 32 |     "$$\\begin{aligned}\n",
 33 |     "\\\\d(\\mathbf{AX}) &= \\mathbf{A}d\\mathbf{X}\\\\\n",
 34 |     "\\frac{\\partial tr(\\mathbf{X}^\\text{T}\\mathbf{A}\\mathbf{X})}{\\partial \\mathbf{X}} &= (\\mathbf{A}^\\text{T} + \\mathbf{A})\\mathbf{X} \\\\\n",
 35 |     "\\frac{\\partial \\ln|\\mathbf{X}|}{\\partial \\mathbf{X}} &= \\mathbf{X}^{-1}\\\\\n",
 36 |     "\\frac{\\partial \\mathbf{w}^{\\text{T}}\\mathbf{X}^{-1}\\mathbf{w}}{\\partial\\mathbf{X}} &= -\\mathbf{X}^{-1}\\mathbf{w}\\mathbf{w}^{\\text{T}}\\mathbf{X}^{-1}，若\\mathbf{X}为对称阵\n",
 37 |     "\\end{aligned}$$\n",
 38 |     "\n",
 39 |     "### 二、多维随机变量的统计\n",
 40 |     "\n",
 41 |     "假设$\\mathbf{x}_1,\\mathbf{x}_2,\\cdots,\\mathbf{x}_N$是$d$维随机变量的样本，则\n",
 42 |     "\n",
 43 |     "#### 1. 样本均值\n",
 44 |     "\n",
 45 |     "$$\n",
 46 |     "\\mathbf{m} = \\frac{1}{N}\\sum_{i=1}^N\\mathbf{x}_i\n",
 47 |     "$$\n",
 48 |     "\n",
 49 |     "#### 2. 样本协方差矩阵（注意这个公式不求平均）\n",
 50 |     "\n",
 51 |     "$$\n",
 52 |     "\\mathbf{S}_{\\text{w}} = \\sum_{i=1}^N (\\mathbf{x}_i - \\mathbf{m})(\\mathbf{x}_i - \\mathbf{m})^{\\text{T}} = \\mathbf{X}^{\\text{T}}\\mathbf{X} - N\\mathbf{m}\\mathbf{m}^\\text{T}\n",
 53 |     "$$\n",
 54 |     "\n",
 55 |     "其中\n",
 56 |     "\n",
 57 |     "$$\n",
 58 |     "\\mathbf{X}^{\\text{T}}\\mathbf{X}=\n",
 59 |     "\\left[\\begin{matrix} \n",
 60 |     "\\mathbf{x}_1,\\mathbf{x}_2,\\cdots,\\mathbf{x}_N\n",
 61 |     "\\end{matrix}\\right]\n",
 62 |     "\\left[\\begin{matrix}\n",
 63 |     "\\mathbf{x}_1^{\\text{T}} \\\\\n",
 64 |     "\\mathbf{x}_2^{\\text{T}} \\\\\n",
 65 |     "\\vdots \\\\\n",
 66 |     "\\mathbf{x}_N^{\\text{T}}\n",
 67 |     "\\end{matrix}\\right]\n",
 68 |     "$$\n",
 69 |     "\n",
 70 |     "#### 3. 样本二阶矩阵\n",
 71 |     "\n",
 72 |     "$$\n",
 73 |     "\\sum_{i=1}^N \\mathbf{x}_i\\mathbf{x}_i^{\\text{T}} = \\mathbf{X}^{\\text{T}}\\mathbf{X}\n",
 74 |     "$$\n",
 75 |     "\n",
 76 |     "### 三、超空间几何\n",
 77 |     "\n",
 78 |     "#### 1. 点到超平面距离\n",
 79 |     "\n",
 80 |     "假设在$n$维空间中，有超平面$\\mathbf{w}\\cdot \\mathbf{x} + b = 0$，有一点$\\mathbf{x}_0$。\n",
 81 |     "\n",
 82 |     "则点到超平面的距离为：\n",
 83 |     "\n",
 84 |     "$$d = \\frac{|\\mathbf{w} \\cdot \\mathbf{x}_0 + b|}{||\\mathbf{w}||}$$\n",
 85 |     "\n",
 86 |     "#### 2. 平行的超平面的距离\n",
 87 |     "\n",
 88 |     "假设在$n$维空间中，有两个平行的超平面\n",
 89 |     "\n",
 90 |     "$$\\mathbf{w}\\cdot \\mathbf{x} + b_1 = 0\\\\\n",
 91 |     " \\mathbf{w}\\cdot \\mathbf{x} + b_2 = 0$$\n",
 92 |     "\n",
 93 |     "因为平行的超平面的权向量（$\\mathbf{w}$向量）一定成比例，所以上面的公式把两个超平面的权向量化成一样的了，这时就只有$b$不同。\n",
 94 |     "\n",
 95 |     "则两个超平面之间的距离为：\n",
 96 |     "\n",
 97 |     "$$d = \\frac{|b_1-b_2|}{||\\mathbf{w}||}$$\n",
 98 |     "\n",
 99 |     "### 四、进阶版拉格朗日算子法（求多元函数条件极值）\n",
100 |     "\n",
101 |     "#### 1. 自变量是向量的多元函数\n",
102 |     "\n",
103 |     "高数中学的拉格朗日算子法的条件只能是等式，下面给出条件有不等式的求法：\n",
104 |     "\n",
105 |     "假设在约束条件$g(\\mathbf{x})\\leqslant 0$的情况下，求多元函数$f(\\mathbf{x})$的极值。\n",
106 |     "\n",
107 |     "1. 写出拉格朗日函数\n",
108 |     "\n",
109 |     "$$\n",
110 |     "\\mathscr{L}(\\mathbf{x}) = f(\\mathbf{x}) + \\lambda g(\\mathbf{x})\n",
111 |     "$$\n",
112 |     "\n",
113 |     "2. 写出条件不等式（KKT条件）\n",
114 |     "\n",
115 |     "$$\\begin{align}\n",
116 |     "\\frac{\\partial\\mathscr{L}}{\\partial\\mathbf{x}} = \\frac{\\partial f}{\\partial\\mathbf{x}} + \\lambda\\frac{\\partial g}{\\partial\\mathbf{x}} &= 0 \\\\\n",
117 |     "\\frac{\\partial\\mathscr{L}}{\\partial \\lambda} = g(\\mathbf{x}) &\\leqslant 0 \\\\\n",
118 |     "\\lambda &\\geqslant 0 \\\\\n",
119 |     "\\lambda g(\\mathbf{x}) &= 0\n",
120 |     "\\end{align}$$\n",
121 |     "\n",
122 |     "这组不等式也叫KKT条件\n",
123 |     "\n",
124 |     "3. 求解KKT条件\n",
125 |     "\n",
126 |     "如果有多个约束条件也是同理，把KKT条件扩展一下就行。\n",
127 |     "\n",
128 |     "#### 2. 自变量是矩阵的函数\n",
129 |     "\n",
130 |     "自变量是矩阵可能不好理解，但是教材162页的公式用到了这个方法。\n",
131 |     "\n",
132 |     "自变量是矩阵时就把上面的公式中的常数$\\lambda$换成对角阵$\\boldsymbol{\\Lambda}$，自变量向量$\\mathbf{x}$换成矩阵$\\mathbf{X}$即可，其他都不变。这里就不再赘述了。\n",
133 |     "\n",
134 |     "### 五、欧拉积分\n",
135 |     "\n",
136 |     "贝叶斯估计可能会用到欧拉积分。\n",
137 |     "\n",
138 |     "#### 1. 伽马函数\n",
139 |     "\n",
140 |     "$$\\begin{aligned}\n",
141 |     "\\Gamma(s) &= \\int_0^{+\\infty }x^{s}e^{-x} \\text{d}x \\\\\n",
142 |     "\\Gamma(s) &= s\\Gamma(s-1) \\\\\n",
143 |     "\\Gamma(s) &= s! \\quad若s为正整数\n",
144 |     "\\end{aligned}$$\n",
145 |     "\n",
146 |     "#### 2. 贝塔函数\n",
147 |     "\n",
148 |     "$$\\begin{aligned}\n",
149 |     "\\beta(x,y) &= \\int_0^1P^{x}(1-P)^{y}\\text{d}P \\\\\n",
150 |     "\\beta(x,y) &= \\frac{x!y!}{(x+y+1)!} \\quad若x,y为正整数\n",
151 |     "\\end{aligned}$$"
152 |    ]
153 |   }
154 |  ],
155 |  "metadata": {
156 |   "kernelspec": {
157 |    "display_name": "Python 3 (ipykernel)",
158 |    "language": "python",
159 |    "name": "python3"
160 |   },
161 |   "language_info": {
162 |    "codemirror_mode": {
163 |     "name": "ipython",
164 |     "version": 3
165 |    },
166 |    "file_extension": ".py",
167 |    "mimetype": "text/x-python",
168 |    "name": "python",
169 |    "nbconvert_exporter": "python",
170 |    "pygments_lexer": "ipython3",
171 |    "version": "3.9.16"
172 |   }
173 |  },
174 |  "nbformat": 4,
175 |  "nbformat_minor": 1
176 | }
177 | 


--------------------------------------------------------------------------------
/资料/7.Practice lab decision trees/public_tests.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | def compute_entropy_test(target):
  4 |     y = np.array([1] * 10)
  5 |     result = target(y)
  6 |     
  7 |     assert result == 0, "Entropy must be 0 with array of ones"
  8 |     
  9 |     y = np.array([0] * 10)
 10 |     result = target(y)
 11 |     
 12 |     assert result == 0, "Entropy must be 0 with array of zeros"
 13 |     
 14 |     y = np.array([0] * 12 + [1] * 12)
 15 |     result = target(y)
 16 |     
 17 |     assert result == 1, "Entropy must be 1 with same ammount of ones and zeros"
 18 |     
 19 |     y = np.array([1, 0, 1, 0, 1, 1, 1, 0, 1])
 20 |     assert np.isclose(target(y), 0.918295, atol=1e-6), "Wrong value. Something between 0 and 1"
 21 |     assert np.isclose(target(-y + 1), target(y), atol=1e-6), "Wrong value"
 22 |     
 23 |     print("\033[92m All tests passed.")
 24 | 
 25 | def split_dataset_test(target):
 26 |     X = np.array([[1, 0], 
 27 |          [1, 0], 
 28 |          [1, 1], 
 29 |          [0, 0], 
 30 |          [0, 1]])
 31 |     X_t = np.array([[0, 1, 0, 1, 0]])
 32 |     X = np.concatenate((X, X_t.T), axis=1)
 33 | 
 34 |     left, right = target(X, list(range(5)), 2)
 35 |     expected = {'left': np.array([1, 3]),
 36 |                 'right': np.array([0, 2, 4])}
 37 | 
 38 |     assert type(left) == list, f"Wrong type for left. Expected: list got: {type(left)}"
 39 |     assert type(right) == list, f"Wrong type for right. Expected: list got: {type(right)}"
 40 |     
 41 |     assert type(left[0]) == int, f"Wrong type for elements in the left list. Expected: int got: {type(left[0])}"
 42 |     assert type(right[0]) == int, f"Wrong type for elements in the right list. Expected: number got: {type(right[0])}"
 43 |     
 44 |     assert len(left) == 2, f"left must have 2 elements but got: {len(left)}"
 45 |     assert len(right) == 3, f"right must have 3 elements but got: {len(right)}"
 46 | 
 47 |     assert np.allclose(right, expected['right']), f"Wrong value for right. Expected: { expected['right']} \ngot: {right}"
 48 |     assert np.allclose(left, expected['left']), f"Wrong value for left. Expected: { expected['left']} \ngot: {left}"
 49 | 
 50 |     X = np.array([[0, 1], 
 51 |          [1, 1], 
 52 |          [1, 1], 
 53 |          [0, 0], 
 54 |          [1, 0]])
 55 |     X_t = np.array([[0, 1, 0, 1, 0]])
 56 |     X = np.concatenate((X_t.T, X), axis=1)
 57 | 
 58 |     left, right = target(X, list(range(5)), 0)
 59 |     expected = {'left': np.array([1, 3]),
 60 |                 'right': np.array([0, 2, 4])}
 61 | 
 62 |     assert np.allclose(right, expected['right']) and np.allclose(left, expected['left']), f"Wrong value when target is at index 0."
 63 |     
 64 |     X = (np.random.rand(11, 3) > 0.5) * 1 # Just random binary numbers
 65 |     X_t = np.array([[0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]])
 66 |     X = np.concatenate((X, X_t.T), axis=1)
 67 | 
 68 |     left, right = target(X, [1, 2, 3, 6, 7, 9, 10], 3)
 69 |     expected = {'left': np.array([1, 3, 6]),
 70 |                 'right': np.array([2, 7, 9, 10])}
 71 | 
 72 |     assert np.allclose(right, expected['right']) and np.allclose(left, expected['left']), f"Wrong value when target is at index 0. \nExpected: {expected} \ngot: \{left:{left}, 'right': {right}\}"
 73 |  
 74 |     
 75 |     print("\033[92m All tests passed.")
 76 |     
 77 | def compute_information_gain_test(target):
 78 |     X = np.array([[1, 0], 
 79 |          [1, 0], 
 80 |          [1, 0], 
 81 |          [0, 0], 
 82 |          [0, 1]])
 83 |     
 84 |     y = np.array([[0, 0, 0, 0, 0]]).T
 85 |     node_indexes = list(range(5))
 86 | 
 87 |     result1 = target(X, y, node_indexes, 0)
 88 |     result2 = target(X, y, node_indexes, 0)
 89 |     
 90 |     assert result1 == 0 and result2 == 0, f"Information gain must be 0 when target variable is pure. Got {result1} and {result2}"
 91 |     
 92 |     y = np.array([[0, 1, 0, 1, 0]]).T
 93 |     node_indexes = list(range(5))
 94 |     
 95 |     result = target(X, y, node_indexes, 0)
 96 |     assert np.isclose(result, 0.019973, atol=1e-6), f"Wrong information gain. Expected {0.019973} got: {result}"
 97 |     
 98 |     result = target(X, y, node_indexes, 1)
 99 |     assert np.isclose(result, 0.170951, atol=1e-6), f"Wrong information gain. Expected {0.170951} got: {result}"
100 | 
101 |     node_indexes = list(range(4))
102 |     result = target(X, y, node_indexes, 0)
103 |     assert np.isclose(result, 0.311278, atol=1e-6), f"Wrong information gain. Expected {0.311278} got: {result}"
104 | 
105 |     result = target(X, y, node_indexes, 1)
106 |     assert np.isclose(result, 0, atol=1e-6), f"Wrong information gain. Expected {0.0} got: {result}"
107 | 
108 |     print("\033[92m All tests passed.")
109 |     
110 | def get_best_split_test(target):
111 |     X = np.array([[1, 0], 
112 |          [1, 0], 
113 |          [1, 0], 
114 |          [0, 0], 
115 |          [0, 1]])
116 | 
117 |     y = np.array([[0, 0, 0, 0, 0]]).T
118 |     node_indexes = list(range(5))
119 | 
120 |     result = target(X, y, node_indexes)
121 |     
122 |     assert result == -1, f"When the target variable is pure, there is no best split to do. Expected -1, got {result}"
123 |     
124 |     y = X[:,0]
125 |     result = target(X, y, node_indexes)
126 |     assert result == 0, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 0, got {result}"
127 |     y = X[:,1]
128 |     result = target(X, y, node_indexes)
129 |     assert result == 1, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 1, got {result}"
130 | 
131 |     y = 1 - X[:,0]
132 |     result = target(X, y, node_indexes)
133 |     assert result == 0, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 0, got {result}"
134 | 
135 |     y = np.array([[0, 1, 0, 1, 0]]).T
136 |     result = target(X, y, node_indexes)
137 |     assert result == 1, f"Wrong result. Expected 1, got {result}"
138 | 
139 |     y = np.array([[0, 1, 0, 1, 0]]).T    
140 |     node_indexes = [2, 3, 4]
141 |     result = target(X, y, node_indexes)
142 |     assert result == 0, f"Wrong result. Expected 0, got {result}"
143 | 
144 |     n_samples = 100
145 |     X0 = np.array([[1] * n_samples])
146 |     X1 = np.array([[0] * n_samples])
147 |     X2 = (np.random.rand(1, 100) > 0.5) * 1
148 |     X3 = np.array([[1] * int(n_samples / 2) + [0] * int(n_samples / 2)])
149 |     
150 |     y = X2.T
151 |     node_indexes = list(range(20, 80))
152 |     X = np.array([X0, X1, X2, X3]).T.reshape(n_samples, 4)
153 |     result = target(X, y, node_indexes)
154 |     
155 |     assert result == 2, f"Wrong result. Expected 2, got {result}"
156 |     
157 |     y = X0.T
158 |     result = target(X, y, node_indexes)
159 |     assert result == -1, f"When the target variable is pure, there is no best split to do. Expected -1, got {result}"
160 |     print("\033[92m All tests passed.")


--------------------------------------------------------------------------------
/资料/9.Week 3 practice lab logistic regression/test_utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from copy import deepcopy
  3 | 
  4 | 
  5 | def datatype_check(expected_output, target_output, error):
  6 |     success = 0
  7 |     if isinstance(target_output, dict):
  8 |         for key in target_output.keys():
  9 |             try:
 10 |                 success += datatype_check(expected_output[key],
 11 |                                           target_output[key], error)
 12 |             except:
 13 |                 print("Error: {} in variable {}. Got {} but expected type {}".format(error,
 14 |                                                                                      key,
 15 |                                                                                      type(
 16 |                                                                                          target_output[key]),
 17 |                                                                                      type(expected_output[key])))
 18 |         if success == len(target_output.keys()):
 19 |             return 1
 20 |         else:
 21 |             return 0
 22 |     elif isinstance(target_output, tuple) or isinstance(target_output, list):
 23 |         for i in range(len(target_output)):
 24 |             try:
 25 |                 success += datatype_check(expected_output[i],
 26 |                                           target_output[i], error)
 27 |             except:
 28 |                 print("Error: {} in variable {}, expected type: {}  but expected type {}".format(error,
 29 |                                                                                                  i,
 30 |                                                                                                  type(
 31 |                                                                                                      target_output[i]),
 32 |                                                                                                  type(expected_output[i]
 33 |                                                                                                       )))
 34 |         if success == len(target_output):
 35 |             return 1
 36 |         else:
 37 |             return 0
 38 | 
 39 |     else:
 40 |         assert isinstance(target_output, type(expected_output))
 41 |         return 1
 42 | 
 43 | 
 44 | def equation_output_check(expected_output, target_output, error):
 45 |     success = 0
 46 |     if isinstance(target_output, dict):
 47 |         for key in target_output.keys():
 48 |             try:
 49 |                 success += equation_output_check(expected_output[key],
 50 |                                                  target_output[key], error)
 51 |             except:
 52 |                 print("Error: {} for variable {}.".format(error,
 53 |                                                           key))
 54 |         if success == len(target_output.keys()):
 55 |             return 1
 56 |         else:
 57 |             return 0
 58 |     elif isinstance(target_output, tuple) or isinstance(target_output, list):
 59 |         for i in range(len(target_output)):
 60 |             try:
 61 |                 success += equation_output_check(expected_output[i],
 62 |                                                  target_output[i], error)
 63 |             except:
 64 |                 print("Error: {} for variable in position {}.".format(error, i))
 65 |         if success == len(target_output):
 66 |             return 1
 67 |         else:
 68 |             return 0
 69 | 
 70 |     else:
 71 |         if hasattr(target_output, 'shape'):
 72 |             np.testing.assert_array_almost_equal(
 73 |                 target_output, expected_output)
 74 |         else:
 75 |             assert target_output == expected_output
 76 |         return 1
 77 | 
 78 | 
 79 | def shape_check(expected_output, target_output, error):
 80 |     success = 0
 81 |     if isinstance(target_output, dict):
 82 |         for key in target_output.keys():
 83 |             try:
 84 |                 success += shape_check(expected_output[key],
 85 |                                        target_output[key], error)
 86 |             except:
 87 |                 print("Error: {} for variable {}.".format(error, key))
 88 |         if success == len(target_output.keys()):
 89 |             return 1
 90 |         else:
 91 |             return 0
 92 |     elif isinstance(target_output, tuple) or isinstance(target_output, list):
 93 |         for i in range(len(target_output)):
 94 |             try:
 95 |                 success += shape_check(expected_output[i],
 96 |                                        target_output[i], error)
 97 |             except:
 98 |                 print("Error: {} for variable {}.".format(error, i))
 99 |         if success == len(target_output):
100 |             return 1
101 |         else:
102 |             return 0
103 | 
104 |     else:
105 |         if hasattr(target_output, 'shape'):
106 |             assert target_output.shape == expected_output.shape
107 |         return 1
108 | 
109 | 
110 | def single_test(test_cases, target):
111 |     success = 0
112 |     for test_case in test_cases:
113 |         try:
114 |             if test_case['name'] == "datatype_check":
115 |                 assert isinstance(target(*test_case['input']),
116 |                                   type(test_case["expected"]))
117 |                 success += 1
118 |             if test_case['name'] == "equation_output_check":
119 |                 assert np.allclose(test_case["expected"],
120 |                                    target(*test_case['input']))
121 |                 success += 1
122 |             if test_case['name'] == "shape_check":
123 |                 assert test_case['expected'].shape == target(
124 |                     *test_case['input']).shape
125 |                 success += 1
126 |         except:
127 |             print("Error: " + test_case['error'])
128 | 
129 |     if success == len(test_cases):
130 |         print("\033[92m All tests passed.")
131 |     else:
132 |         print('\033[92m', success, " Tests passed")
133 |         print('\033[91m', len(test_cases) - success, " Tests failed")
134 |         raise AssertionError(
135 |             "Not all tests were passed for {}. Check your equations and avoid using global variables inside the function.".format(target.__name__))
136 | 
137 | 
138 | def multiple_test(test_cases, target):
139 |     success = 0
140 |     for test_case in test_cases:
141 |         try:
142 |             test_input = deepcopy(test_case['input'])
143 |             target_answer = target(*test_input)
144 |             if test_case['name'] == "datatype_check":
145 |                 success += datatype_check(test_case['expected'],
146 |                                           target_answer, test_case['error'])
147 |             if test_case['name'] == "equation_output_check":
148 |                 success += equation_output_check(
149 |                     test_case['expected'], target_answer, test_case['error'])
150 |             if test_case['name'] == "shape_check":
151 |                 success += shape_check(test_case['expected'],
152 |                                        target_answer, test_case['error'])
153 |         except:
154 |             print('\33[30m', "Error: " + test_case['error'])
155 | 
156 |     if success == len(test_cases):
157 |         print("\033[92m All tests passed.")
158 |     else:
159 |         print('\033[92m', success, " Tests passed")
160 |         print('\033[91m', len(test_cases) - success, " Tests failed")
161 |         raise AssertionError(
162 |             "Not all tests were passed for {}. Check your equations and avoid using global variables inside the function.".format(target.__name__))
163 | 
164 | 


--------------------------------------------------------------------------------
/第02章 统计决策方法.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "5fdc03ba",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# 第二章 统计决策方法\n",
  9 |     "---"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "id": "8b90dc9e",
 15 |    "metadata": {},
 16 |    "source": [
 17 |     "## 二轮总结笔记\n",
 18 |     "\n",
 19 |     "### 一、三种贝叶斯决策\n",
 20 |     "\n",
 21 |     "#### 1. 最小错误率贝叶斯决策\n",
 22 |     "\n",
 23 |     "##### (1) 对于$c$类的判别\n",
 24 |     "\n",
 25 |     "1. 判别函数：\n",
 26 |     "\n",
 27 |     "$$对于\\omega_i类，g_i(\\mathbf{x}) = p(\\mathbf{x}|\\omega_i)P(\\omega_i)$$\n",
 28 |     "\n",
 29 |     "2. 判别规则：\n",
 30 |     "\n",
 31 |     "$$若g_i(\\mathbf{x}) = \\max\\limits_{j=1,\\cdots,c}{g_j(\\mathbf{x})}，则\\mathbf{x}\\in\\omega_i$$\n",
 32 |     "\n",
 33 |     "3. 决策面：\n",
 34 |     "\n",
 35 |     "$$g_i(\\mathbf{x}) = g_j(\\mathbf{x})$$\n",
 36 |     "\n",
 37 |     "##### (2) 对于两类的判别\n",
 38 |     "\n",
 39 |     "1. 判别函数：\n",
 40 |     "\n",
 41 |     "$$g(\\mathbf{x}) = p(\\mathbf{x}|\\omega_1)P(\\omega_1) - p(\\mathbf{x}|\\omega_2)P(\\omega_2)$$\n",
 42 |     "\n",
 43 |     "2. 判别规则：\n",
 44 |     "\n",
 45 |     "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda = \\frac{P(\\omega_2)}{P(\\omega_1)}，则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2 \\end{cases}$$\n",
 46 |     "\n",
 47 |     "其中，\n",
 48 |     "\n",
 49 |     "$$p(\\mathbf{x}|\\omega_1)叫似然度，l(\\mathbf{x})叫似然比$$\n",
 50 |     "\n",
 51 |     "3. 决策面：\n",
 52 |     "\n",
 53 |     "$$g(\\mathbf{x}) = 0$$\n",
 54 |     "\n",
 55 |     "#### 2. 最小风险贝叶斯决策\n",
 56 |     "\n",
 57 |     "##### (1) 对于$c$类的判别\n",
 58 |     "\n",
 59 |     "判别规则：\n",
 60 |     "\n",
 61 |     "$$若R(\\alpha_i|\\mathbf{x}) = \\min\\limits_{j=1,\\cdots,k}R(\\alpha_j|\\mathbf{x})，则决策\\alpha=\\alpha_i$$\n",
 62 |     "\n",
 63 |     "其中，\n",
 64 |     "\n",
 65 |     "$$R(\\alpha_i|\\mathbf{x})=\\sum_{j=1}^c\\lambda(\\alpha_i,\\omega_j)P(\\omega_j|\\mathbf{x})$$\n",
 66 |     "\n",
 67 |     "##### (2) 对于两类的判别\n",
 68 |     "\n",
 69 |     "判别规则：\n",
 70 |     "\n",
 71 |     "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda = \\frac{P(\\omega_2)}{P(\\omega_1)}\\cdot \\frac{\\lambda_{12}-\\lambda_{22}}{\\lambda_{21}-\\lambda_{11}}，则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2\\\\\\end{cases}$$\n",
 72 |     "\n",
 73 |     "其中，\n",
 74 |     "\n",
 75 |     "$$\\lambda_{ij}=\\lambda(\\alpha_i, \\omega_j)，即实际情况为\\omega_j时决策为\\alpha_i的风险。$$\n",
 76 |     "\n",
 77 |     "#### 3. Neyman-Pearson决策\n",
 78 |     "\n",
 79 |     "即固定一类错误率的情况下最小化另一类错误率。\n",
 80 |     "\n",
 81 |     "判别规则：\n",
 82 |     "\n",
 83 |     "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda，则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2\\\\\\end{cases}$$\n",
 84 |     "\n",
 85 |     "$\\lambda$由固定的一类错误率计算出，假设固定第二类错误率（假阴性）为$\\varepsilon_0$，则决策边界$\\lambda$保证$\\displaystyle \\int_{\\mathscr{R}_1}p(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x}=\\varepsilon_0$\n",
 86 |     "\n",
 87 |     "### 二、正态分布时的统计决策\n",
 88 |     "\n",
 89 |     "#### 1. 正态分布概率密度公式\n",
 90 |     "\n",
 91 |     "$$p(\\mathbf{x})=\\frac{1}{(2\\pi)^{\\frac{d}{2}}|\\boldsymbol{\\Sigma}|^{\\frac{1}{2}}}\\exp\\left\\{-\\frac{1}{2}(\\mathbf{x}-\\boldsymbol{\\mu})^T\\boldsymbol{\\Sigma}^{-1}(\\mathbf{x}-\\boldsymbol{\\mu})\\right\\}$$\n",
 92 |     "\n",
 93 |     "#### 2. 正态分布下最小错误率贝叶斯决策的性质\n",
 94 |     "\n",
 95 |     "设所有类别的概率密度都服从正态分布，$p(\\mathbf{x}|\\omega_i) \\sim N(\\boldsymbol{\\mu}_i, \\boldsymbol{\\Sigma}_i)$\n",
 96 |     "\n",
 97 |     "##### (1) 若$\\boldsymbol{\\Sigma}_i = \\sigma^2\\mathbf{I}$，$\\mathbf{I}$是单位矩阵\n",
 98 |     "\n",
 99 |     "$\\omega_i$和$\\omega_j$的决策面是平面，并且与$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线正交（垂直）。\n",
100 |     "\n",
101 |     "1. 所有先验概率$P(\\omega_i)$相同\n",
102 |     "\n",
103 |     "决策面不仅与$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线正交，还过$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线的中点，是垂直平分线（面）。\n",
104 |     "\n",
105 |     "此情况下**最小错误率**贝叶斯决策等价于**最小距离分类器**。\n",
106 |     "\n",
107 |     "2. 先验概率$P(\\omega_i)$不同\n",
108 |     "\n",
109 |     "决策面向先验概率小的类偏移，即先验概率大的类占据更大的决策空间。\n",
110 |     "\n",
111 |     "##### (2) 若$\\boldsymbol{\\Sigma}_i = \\boldsymbol{\\Sigma}$\n",
112 |     "\n",
113 |     "$\\omega_i$和$\\omega_j$的决策面是平面，但是和$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线**不**正交（**不**垂直）。\n",
114 |     "\n",
115 |     "1. 所有先验概率$P(\\omega_i)$相同\n",
116 |     "\n",
117 |     "决策面过$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线的中点。\n",
118 |     "\n",
119 |     "2. 先验概率不同\n",
120 |     "\n",
121 |     "决策面向先验概率小的类偏移，即先验概率大的类占据更大的决策空间。\n",
122 |     "\n",
123 |     "对于两类情况，决策面为：\n",
124 |     "\n",
125 |     "$$\\begin{aligned}\n",
126 |     "g(\\mathbf{x}) &= \\mathbf{w}^\\text{T}\\mathbf{x} + w_0 = 0 \\\\\n",
127 |     "\\mathbf{w} &= \\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_2) \\\\\n",
128 |     "w_0 &= -\\frac{1}{2}(\\boldsymbol{\\mu}_1 + \\boldsymbol{\\mu}_2)^\\text{T}\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_1) - \\ln\\frac{P(\\omega_2)}{P(\\omega_1)}\n",
129 |     "\\end{aligned}$$\n",
130 |     "\n",
131 |     "##### (3) 一般情况\n",
132 |     "\n",
133 |     "决策面是超二次曲面。\n",
134 |     "\n",
135 |     "### 三、错误率的计算\n",
136 |     "\n",
137 |     "#### 1. 贝叶斯错误率计算公式\n",
138 |     "\n",
139 |     "##### (1) 两类情况\n",
140 |     "\n",
141 |     "$$\\begin{aligned}\n",
142 |     "P(e) &= P(\\omega_1)\\int_{\\mathscr{R}_2}p(\\mathbf{x}|\\omega_1)\\text{d}\\mathbf{x} + P(\\omega_2)\\int_{\\mathscr{R}_1}p(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x} \\\\\n",
143 |     "    &= P(\\omega_1)P_1(e)+P(\\omega_2)P_2(e)\\\\    \n",
144 |     "\\end{aligned}$$\n",
145 |     "\n",
146 |     "其中\n",
147 |     "\n",
148 |     "$$\\begin{aligned}\n",
149 |     "P_1(e) &= \\alpha \\quad (假阳性) \\\\\n",
150 |     "P_2(e) &= \\beta \\quad (假阴性)\\\\\n",
151 |     "\\end{aligned}$$\n",
152 |     "\n",
153 |     "另外\n",
154 |     "\n",
155 |     "$$\\begin{aligned}\n",
156 |     "灵敏度\\text{Sn} &= \\frac{\\text{TP}}{\\text{TP}+\\text{FN}} = 1-\\beta \\\\\n",
157 |     "特异度\\text{Sp} &= \\frac{\\text{TN}}{\\text{TN}+\\text{FP}} = 1-\\alpha\n",
158 |     "\\end{aligned}$$\n",
159 |     "\n",
160 |     "##### (2) 多类情况\n",
161 |     "\n",
162 |     "$$\n",
163 |     "P(e) = \\int P(e|\\mathbf{x})p(\\mathbf{x})\\text{d}\\mathbf{x} = \\int 1-\\max\\limits_{i}\\{P(\\omega_i|\\mathbf{x})\\}\\text{d}\\mathbf{x}\n",
164 |     "$$\n",
165 |     "\n",
166 |     "#### 2. ROC曲线\n",
167 |     "\n",
168 |     "横轴为$\\alpha = 1-\\text{Sp}$，纵轴为$1-\\beta = \\text{Sn}$\n",
169 |     "\n",
170 |     "#### 3. 正态分布且各类协方差矩阵相等情况下错误率的计算\n",
171 |     "\n",
172 |     "$$\\begin{aligned}\n",
173 |     "P_1(e) &= \\int_t^{+\\infty} p(h|\\omega_1)\\text{d}h \\\\\n",
174 |     "     &= 1-\\Phi(\\frac{t+\\eta}{\\sigma}) \\\\\n",
175 |     "P_2(e) &= \\int_{-\\infty}^t p(h|\\omega_2)\\text{d}h \\\\\n",
176 |     "     &= \\Phi(\\frac{t-\\eta}{\\sigma}) \\\\\n",
177 |     "\\end{aligned}$$  \n",
178 |     "其中，\n",
179 |     "$$\\begin{aligned}\n",
180 |     "t     &= \\ln\\frac{P(\\omega_1)}{P(\\omega_2)} \\\\\n",
181 |     "\\eta   &= \\frac{1}{2}(\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)^T\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_2) \\\\\n",
182 |     "\\sigma  &= \\sqrt{2\\eta}\n",
183 |     "\\end{aligned}$$\n",
184 |     "\n",
185 |     "#### 4. 高维独立随机变量时错误率的估计\n",
186 |     "\n",
187 |     "$d$维随机变量$\\mathbf{x}$各分量相互独立时，用中心极限定理把$h(\\mathbf{x})$近似为正态分布，按照上面正态分布的公式计算错误率。\n",
188 |     "\n",
189 |     "近似认为$h(\\mathbf{x}|\\omega_i) \\sim N(\\sum_{i=1}^{d}\\eta_{il},\\sum_{i=1}^{d}\\sigma_{il}^2)$。\n",
190 |     "\n",
191 |     "其中，$\\eta_{il}$是$\\omega_i$类第$l$个分量的对数似然比$p(h(x_l)|\\omega_i)$的期望，$\\sigma_{il}^2$是$\\omega_i$类第$l$个分量的对数似然比$p(h(x_l)|\\omega_i)$的方差。\n",
192 |     "\n",
193 |     "### 四、一阶马尔科夫链\n",
194 |     "\n",
195 |     "对数几率比：\n",
196 |     "\n",
197 |     "$$S(x)=\\sum_{i=1}^L \\log\\frac{a_{x_{i-1} x_i}^+}{a_{x_{i-1} x_i}^-} = \\sum_{i=1}^L \\beta_{x_{i-1}x_i}$$\n",
198 |     "\n",
199 |     "阈值根据不同决策方法确定。"
200 |    ]
201 |   }
202 |  ],
203 |  "metadata": {
204 |   "kernelspec": {
205 |    "display_name": "Python 3 (ipykernel)",
206 |    "language": "python",
207 |    "name": "python3"
208 |   },
209 |   "language_info": {
210 |    "codemirror_mode": {
211 |     "name": "ipython",
212 |     "version": 3
213 |    },
214 |    "file_extension": ".py",
215 |    "mimetype": "text/x-python",
216 |    "name": "python",
217 |    "nbconvert_exporter": "python",
218 |    "pygments_lexer": "ipython3",
219 |    "version": "3.11.0"
220 |   },
221 |   "toc": {
222 |    "base_numbering": 1,
223 |    "nav_menu": {},
224 |    "number_sections": true,
225 |    "sideBar": true,
226 |    "skip_h1_title": false,
227 |    "title_cell": "Table of Contents",
228 |    "title_sidebar": "Contents",
229 |    "toc_cell": false,
230 |    "toc_position": {},
231 |    "toc_section_display": true,
232 |    "toc_window_display": false
233 |   },
234 |   "varInspector": {
235 |    "cols": {
236 |     "lenName": 16,
237 |     "lenType": 16,
238 |     "lenVar": 40
239 |    },
240 |    "kernels_config": {
241 |     "python": {
242 |      "delete_cmd_postfix": "",
243 |      "delete_cmd_prefix": "del ",
244 |      "library": "var_list.py",
245 |      "varRefreshCmd": "print(var_dic_list())"
246 |     },
247 |     "r": {
248 |      "delete_cmd_postfix": ") ",
249 |      "delete_cmd_prefix": "rm(",
250 |      "library": "var_list.r",
251 |      "varRefreshCmd": "cat(var_dic_list()) "
252 |     }
253 |    },
254 |    "types_to_exclude": [
255 |     "module",
256 |     "function",
257 |     "builtin_function_or_method",
258 |     "instance",
259 |     "_Feature"
260 |    ],
261 |    "window_display": false
262 |   }
263 |  },
264 |  "nbformat": 4,
265 |  "nbformat_minor": 5
266 | }
267 | 


--------------------------------------------------------------------------------
/第10章 模式识别系统的评价.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 第十章 模式识别系统的评价\n",
  8 |     "---"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "## 二轮总结笔记\n",
 16 |     "\n",
 17 |     "\n",
 18 |     "### 一、监督模式识别方法的错误率估计\n",
 19 |     "\n",
 20 |     "#### 1. 测试错误率\n",
 21 |     "\n",
 22 |     "##### (1) 先验概率$P(\\omega_1)$，$P(\\omega_2)$未知——随机抽样\n",
 23 |     "\n",
 24 |     "1. 错误率：测试集中有$N$个样本，错分了$k$个，则测试错误率$\\displaystyle\\hat\\varepsilon = \\frac{k}{N}$。\n",
 25 |     "2. 性质：\n",
 26 |     "+ 测试错误率是真实错误率的**最大似然估计**。\n",
 27 |     "+ 测试错误率的期望$E[\\hat\\varepsilon]=\\varepsilon$，是真实错误率的无偏估计。\n",
 28 |     "+ 测试错误率的方差$\\displaystyle D[\\hat\\varepsilon]=\\frac{\\varepsilon(1-\\varepsilon)}{N}$。\n",
 29 |     "+ $N$越**大**，测试错误率的置信区间越**小**，测试错误率越**可信**。\n",
 30 |     "\n",
 31 |     "##### (2) 先验概率$P(\\omega_1)$，$P(\\omega_2)$已知——选择性抽样\n",
 32 |     "\n",
 33 |     "1. 错误率：在两类中按照先验概率比例进行抽样，$N_1=P(\\omega_1)N$，$N_2=P(\\omega_2)N$，两类分别错分了$k_1$和$k_2$个，则测试错误率$\\displaystyle\\hat\\varepsilon = P(\\omega_1)\\frac{k_1}{N_1} + P(\\omega_2)\\frac{k_2}{N_2} = \\frac{k_1+k_2}{N}$。\n",
 34 |     "2. 性质：\n",
 35 |     "+ 测试错误率是真实错误率的**最大似然估计**。\n",
 36 |     "+ 测试错误率的期望$E[\\hat\\varepsilon]=\\varepsilon$，是真实错误率的无偏估计。\n",
 37 |     "+ 测试错误率的方差$\\displaystyle D[\\hat\\varepsilon] = \\frac{1}{N}\\left[P(\\omega_1)\\frac{k_1}{N_1}\\left(1-\\frac{k_1}{N_1}\\right)+P(\\omega_2)\\frac{k_2}{N_2}\\left(1-\\frac{k_2}{N_2}\\right)\\right]$，方差比前一种方法**小**。\n",
 38 |     "\n",
 39 |     "#### 2. 交叉验证（cross-validation，CV）\n",
 40 |     "\n",
 41 |     "##### (1) 分类\n",
 42 |     "\n",
 43 |     "1. k轮n倍交叉验证（n-fold cross-validation）：偏差大，方差小。\n",
 44 |     "2. 留一法交叉验证（leave-one-out cross-validation）：适用于样本少的情况，偏差小，方差大。\n",
 45 |     "\n",
 46 |     "##### (2) 性质\n",
 47 |     "\n",
 48 |     "1. 临时测试集较小，错误率估计接近全部样本，多轮实验平均可以减少错误率方差。\n",
 49 |     "2. 可以用于选择分类器参数，用交叉验证错误率最小的参数在全部样本上训练分类器。\n",
 50 |     "\n",
 51 |     "#### 3. 自举法和.632估计\n",
 52 |     "\n",
 53 |     "##### (1) 自举估计\n",
 54 |     "\n",
 55 |     "对样本进行$B$次自举重采样，重采样后的样本作为训练集，没有抽到的样本作为测试集，得到的错误率的平均。\n",
 56 |     "\n",
 57 |     "##### (2) .632估计\n",
 58 |     "\n",
 59 |     "1. $\\text{B}.632=0.368\\times \\text{AE} + 0.632\\times\\text{B1}$。其中$\\text{AE}$是训练错误率（视在错误率），$\\text{B1}$是自举错误率。\n",
 60 |     "2. .632估计是对错误率更好的估计。\n",
 61 |     "\n",
 62 |     "\n",
 63 |     "### 二、有限样本下错误率的区间估计问题\n",
 64 |     "\n",
 65 |     "#### 1. 问题的提出\n",
 66 |     "\n",
 67 |     "##### (1) 标准数据集\n",
 68 |     "\n",
 69 |     "1. 一套数据多次划分成训练集和测试集。\n",
 70 |     "2. 固定划分。\n",
 71 |     "3. 同时统计错误率的平均值和方差。\n",
 72 |     "\n",
 73 |     "##### (2) 性质\n",
 74 |     "\n",
 75 |     "1. 各次实验不是独立的，估计的错误率区间会偏小。\n",
 76 |     "2. 仅基于交叉验证，不存在错误率估计量方差的无偏估计。\n",
 77 |     "\n",
 78 |     "#### 2. 用扰动重采样估计SVM错误率的置信区间\n",
 79 |     "\n",
 80 |     "##### (1) 性质\n",
 81 |     "\n",
 82 |     "1. 考虑了测试样本的不确定性和现有训练样本的不确定性，能够得到性能的全面评价。\n",
 83 |     "2. 适用于样本数目趋于无穷大，但是有限样本下表现依然较好。\n",
 84 |     "3. 只适用于线性核。\n",
 85 |     "\n",
 86 |     "\n",
 87 |     "### 三、特征提取与选择对分类器性能估计的影响\n",
 88 |     "\n",
 89 |     "#### 1. CV1\n",
 90 |     "\n",
 91 |     "##### (1) 定义\n",
 92 |     "\n",
 93 |     "对所有样本进行特征选择核提取，然后进行交叉验证。\n",
 94 |     "\n",
 95 |     "##### (2) 性质\n",
 96 |     "\n",
 97 |     "1. 对分类器性能估计偏乐观，训练集用了测试集的信息。\n",
 98 |     "2. 当初始特征维数高，样本数目小时过学习问题明显。\n",
 99 |     "\n",
100 |     "#### 2. CV2\n",
101 |     "\n",
102 |     "##### (1) 定义\n",
103 |     "\n",
104 |     "只对训练集进行特征选择和提取，然后进行交叉验证。\n",
105 |     "\n",
106 |     "##### (2) 性质\n",
107 |     "\n",
108 |     "1. 能得到分类器性能的真实估计。\n",
109 |     "2. 需要设计出唯一的特征选择和提取方案。可以全部样本重新特征选择和提取，也可以在CV2过程中的特征选择和提取中综合。\n",
110 |     "\n",
111 |     "\n",
112 |     "### 四、从分类的显著性推断特征与类别的关系\n",
113 |     "\n",
114 |     "#### 1. 随机置换法\n",
115 |     "\n",
116 |     "##### (1) 如何得到空分布\n",
117 |     "\n",
118 |     "1. 保持原有样本中两类样本比例不变的情况下，随机打乱样本的类别标号。\n",
119 |     "2. 用原有的特征选择和提取和分类方法进行分类。\n",
120 |     "\n",
121 |     "##### (2) 如何判断显著性\n",
122 |     "\n",
123 |     "用真实分类器性能和空分布比较，通常以小于$0.05$作为参考。\n",
124 |     "\n",
125 |     "\n",
126 |     "### 五、非监督模式识别系统性能的评价\n",
127 |     "\n",
128 |     "#### 1. 紧致性（compactness）/一致性（homogeneity）\n",
129 |     "\n",
130 |     "##### (1) 公式\n",
131 |     "\n",
132 |     "$$\n",
133 |     "V(C) = \\sqrt{\\frac{1}{N}\\sum_{C_k\\in C}\\sum_{i\\in C_k}\\delta(i,\\mu_k)}\n",
134 |     "$$\n",
135 |     "\n",
136 |     "##### (2) 性质\n",
137 |     "\n",
138 |     "越小越好，取值范围$[0,\\infty]$。\n",
139 |     "\n",
140 |     "#### 2. 连接性质（connectedness）/连接度（connectivity）\n",
141 |     "\n",
142 |     "##### (1) 公式\n",
143 |     "\n",
144 |     "$$\\begin{aligned}\n",
145 |     "\\text{Conn}(C) &= \\sum_{i=1}^N\\sum_{j=1}^Lx_{i,nn_{\\scriptstyle i(j)}} \\\\\n",
146 |     "x_{i,nn_{\\scriptstyle i(j)}} &= \\begin{cases}\\displaystyle \\frac{1}{j}&第i个样本和其第j个近邻不在同一个聚类\\\\0&在同一个聚类\\end{cases}\n",
147 |     "\\end{aligned}$$\n",
148 |     "\n",
149 |     "##### (2) 性质\n",
150 |     "\n",
151 |     "越小越好，取值范围$[0,\\infty]$。\n",
152 |     "\n",
153 |     "#### 3. 分离度（separation）\n",
154 |     "\n",
155 |     "##### (1) 性质\n",
156 |     "\n",
157 |     "越大越好。\n",
158 |     "\n",
159 |     "#### 4. Silhouette宽度\n",
160 |     "\n",
161 |     "##### (1) 公式\n",
162 |     "\n",
163 |     "$$\\begin{aligned}\n",
164 |     "S(i) &= \\frac{b_i-a_i}{\\max(b_i,a_i)} \\\\\n",
165 |     "a_i &= 样本i到和它\\textbf{同类}的所有样本的\\textbf{平均距离}\\\\\n",
166 |     "b_i &= 样本i到其他聚类中\\textbf{最近}的一个聚类的所有样本的\\textbf{平均距离}\n",
167 |     "\\end{aligned}$$\n",
168 |     "\n",
169 |     "##### (2) 性质\n",
170 |     "\n",
171 |     "越大越好，取值范围$[-1,1]$。\n",
172 |     "\n",
173 |     "#### 5. Dunn指数（Dunn index）\n",
174 |     "\n",
175 |     "##### (1) 公式\n",
176 |     "\n",
177 |     "$$\\begin{aligned}\n",
178 |     "D(C) &= \\min\\limits_{C_k\\in C}\\left(\\min\\limits_{C_l\\in C}\\frac{\\text{dist}(C_k,C_l)}{\\displaystyle\\max\\limits_{C_m\\in C}\\text{diam}(C_m)}\\right)\\\\\n",
179 |     "\\text{dist}(C_k,C_l) &= C_k,C_l两类中距离最近的样本间的距离\\\\\n",
180 |     "\\text{diam}(C_m) &= C_m中最大的类内距离\n",
181 |     "\\end{aligned}$$\n",
182 |     "\n",
183 |     "##### (2) 性质\n",
184 |     "\n",
185 |     "越大越好，取值范围$[0,\\infty]$。\n",
186 |     "\n",
187 |     "#### 6. 预测效力\n",
188 |     "\n",
189 |     "##### (1) 算法\n",
190 |     "\n",
191 |     "将样本随机分为两份，各自进行聚类，用一份得到的聚类结果作为临时训练样本，对另一份进行最近邻法分类。\n",
192 |     "\n",
193 |     "##### (2) 性质\n",
194 |     "\n",
195 |     "重合程度越大聚类结果越稳定。"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "---\n",
203 |     "## 一轮学习笔记（包含代码实现）\n",
204 |     "\n",
205 |     "这一章介绍了大多数其他教材放在第一章的内容，即分类算法的检验方法。\n",
206 |     "\n",
207 |     "教材的重点放在了检验方法得出的错误率的理论分析上面，即从概率论的角度分析错误率的期望和方差。这部分都是数学推导，不需要写代码。\n",
208 |     "\n",
209 |     "而教材中提到的计算错误率的检验方法实现起来都很非常简单，而且在sklearn中都有库可以很方便的调用。\n",
210 |     "\n",
211 |     "只有扰动重采样估计SVM错误率的置信区间非常复杂，但是这个算法其实不重要，作者把它放在教材上主要还是因为这个算法是作者自己发的文章。\n",
212 |     "\n",
213 |     "下面我只实现一下bootstrap .632估计。"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 1,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "from sklearn.utils import resample\n",
223 |     "from sklearn.svm import LinearSVC\n",
224 |     "import numpy as np"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "metadata": {},
230 |    "source": [
231 |     "下面用手写数字识别的数据集。"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": 2,
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "X = np.load(\"./data/digit_0_1_X.npy\")\n",
241 |     "y = np.load(\"./data/digit_0_1_y.npy\")\n",
242 |     "X = X[:1000]\n",
243 |     "y = y[:1000] #前一千个是0和1，后面是其他数字\n",
244 |     "y = y.reshape(1000)"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "markdown",
249 |    "metadata": {},
250 |    "source": [
251 |     "先计算自举错误率B1。"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": 3,
257 |    "metadata": {},
258 |    "outputs": [
259 |     {
260 |      "name": "stdout",
261 |      "output_type": "stream",
262 |      "text": [
263 |       "bootstrap第1次，自举样本集中包含了原始样本集62.5%的样本。\n",
264 |       "本次训练错误率为：0.002666666666666706\n",
265 |       "bootstrap第2次，自举样本集中包含了原始样本集61.8%的样本。\n",
266 |       "本次训练错误率为：0.010471204188481686\n",
267 |       "bootstrap第3次，自举样本集中包含了原始样本集62.2%的样本。\n",
268 |       "本次训练错误率为：0.002645502645502673\n",
269 |       "bootstrap第4次，自举样本集中包含了原始样本集64.3%的样本。\n",
270 |       "本次训练错误率为：0.0028011204481792618\n",
271 |       "bootstrap第5次，自举样本集中包含了原始样本集62.9%的样本。\n",
272 |       "本次训练错误率为：0.0\n",
273 |       "bootstrap第6次，自举样本集中包含了原始样本集61.199999999999996%的样本。\n",
274 |       "本次训练错误率为：0.002577319587628857\n",
275 |       "bootstrap第7次，自举样本集中包含了原始样本集63.800000000000004%的样本。\n",
276 |       "本次训练错误率为：0.002762430939226568\n",
277 |       "bootstrap第8次，自举样本集中包含了原始样本集62.9%的样本。\n",
278 |       "本次训练错误率为：0.0026954177897574594\n",
279 |       "bootstrap第9次，自举样本集中包含了原始样本集64.60000000000001%的样本。\n",
280 |       "本次训练错误率为：0.0028248587570621764\n",
281 |       "bootstrap第10次，自举样本集中包含了原始样本集64.7%的样本。\n",
282 |       "本次训练错误率为：0.0\n",
283 |       "自举平均错误率B1 = 0.002944452102250539\n"
284 |      ]
285 |     }
286 |    ],
287 |    "source": [
288 |     "# 下面计算十次bootstrap样本集在线性支持向量机上的错误率\n",
289 |     "b1 = 0\n",
290 |     "for i in range(10):\n",
291 |     "    bootstrap_index = []\n",
292 |     "    test_index = set(range(X.shape[0]))\n",
293 |     "    X_bootstrap = resample(X)\n",
294 |     "    for x in X_bootstrap:\n",
295 |     "        for j in range(X.shape[0]):\n",
296 |     "            if (X[j] == x).all():\n",
297 |     "                bootstrap_index.append(j)\n",
298 |     "                if j in test_index:\n",
299 |     "                    test_index.remove(j)\n",
300 |     "                break\n",
301 |     "    test_index = list(test_index)\n",
302 |     "    y_bootstrap = y[bootstrap_index]\n",
303 |     "    X_test = X[test_index]\n",
304 |     "    y_test = y[test_index]\n",
305 |     "    print(\"bootstrap第\", i + 1, \"次，自举样本集中包含了原始样本集\",\n",
306 |     "          (1 - len(test_index) / X.shape[0]) * 100, \"%的样本。\", sep=\"\")\n",
307 |     "    lsvc = LinearSVC()\n",
308 |     "    lsvc.fit(X_bootstrap, y_bootstrap)\n",
309 |     "    err = 1 - lsvc.score(X_test, y_test)\n",
310 |     "    b1 += err\n",
311 |     "    print(\"本次训练错误率为：\", err,sep=\"\")\n",
312 |     "b1 /= 10\n",
313 |     "print(\"自举平均错误率B1 =\", b1)"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "markdown",
318 |    "metadata": {},
319 |    "source": [
320 |     "再计算全部样本上的训练错误率（视在错误率）。"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": 4,
326 |    "metadata": {},
327 |    "outputs": [
328 |     {
329 |      "name": "stdout",
330 |      "output_type": "stream",
331 |      "text": [
332 |       "全部样本上的训练错误率AE = 0.0\n"
333 |      ]
334 |     }
335 |    ],
336 |    "source": [
337 |     "lsvc = LinearSVC()\n",
338 |     "lsvc.fit(X, y)\n",
339 |     "ae = 1 - lsvc.score(X, y)\n",
340 |     "print(\"全部样本上的训练错误率AE =\", ae)"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "接下来计算B.632错误率。"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": 5,
353 |    "metadata": {},
354 |    "outputs": [
355 |     {
356 |      "name": "stdout",
357 |      "output_type": "stream",
358 |      "text": [
359 |       "B.632错误率 = 0.0018608937286223406\n"
360 |      ]
361 |     }
362 |    ],
363 |    "source": [
364 |     "print(\"B.632错误率 =\", 0.368 * ae + 0.632 * b1)"
365 |    ]
366 |   }
367 |  ],
368 |  "metadata": {
369 |   "kernelspec": {
370 |    "display_name": "Python 3 (ipykernel)",
371 |    "language": "python",
372 |    "name": "python3"
373 |   },
374 |   "language_info": {
375 |    "codemirror_mode": {
376 |     "name": "ipython",
377 |     "version": 3
378 |    },
379 |    "file_extension": ".py",
380 |    "mimetype": "text/x-python",
381 |    "name": "python",
382 |    "nbconvert_exporter": "python",
383 |    "pygments_lexer": "ipython3",
384 |    "version": "3.9.16"
385 |   }
386 |  },
387 |  "nbformat": 4,
388 |  "nbformat_minor": 1
389 | }
390 | 


--------------------------------------------------------------------------------
/第07章 特征选择.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 第七章 特征选择\n",
  8 |     "---"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "## 二轮总结笔记\n",
 16 |     "\n",
 17 |     "\n",
 18 |     "### 一、特征的评价准则\n",
 19 |     "\n",
 20 |     "#### 1. 可分性判据的要求\n",
 21 |     "\n",
 22 |     "##### (1) 错误率单调性\n",
 23 |     "\n",
 24 |     "判据与错误率（或错误率的**上界**）有单调关系。\n",
 25 |     "\n",
 26 |     "##### (2) 可加性\n",
 27 |     "\n",
 28 |     "当特征**独立**时，判据对特征应该具有可加性，即\n",
 29 |     "\n",
 30 |     "$$\n",
 31 |     "J_{ij}(x_1,x_2,\\cdots,x_d)=\\sum_{k=1}^dJ_{ij}(x_k)\n",
 32 |     "$$\n",
 33 |     "\n",
 34 |     "##### (3) 度量特征\n",
 35 |     "\n",
 36 |     "$$\\begin{aligned}\n",
 37 |     "J_{ij} &> 0,\\quad 当i\\neq j时\\\\\n",
 38 |     "J_{ij} &= 0,\\quad 当i=j时\\\\\n",
 39 |     "J_{ij} &= J_{ji}\n",
 40 |     "\\end{aligned}$$\n",
 41 |     "\n",
 42 |     "##### (4) 特征单调性\n",
 43 |     "\n",
 44 |     "加入新的特征不会使判据减小，即\n",
 45 |     "\n",
 46 |     "$$\n",
 47 |     "J_{ij}(x_1,x_2,\\cdots,x_d) \\leq J_{ij}(x_1,x_2,\\cdots,x_d,x_{d+1})\n",
 48 |     "$$\n",
 49 |     "\n",
 50 |     "#### 2. 基于类内类间距离的可分性判据\n",
 51 |     "\n",
 52 |     "##### (1) 常用基于类内类间距离的可分性判据\n",
 53 |     "\n",
 54 |     "$$\\begin{aligned}\n",
 55 |     "J_1 &= \\text{tr}(\\mathbf{S}_\\text{w}+\\mathbf{S}_\\text{b}),\\quad J_1为各类之间的平均平方距离 \\\\\n",
 56 |     "J_2 &= \\text{tr}(\\mathbf{S}_\\text{w}^{-1}\\mathbf{S}_\\text{b}) \\\\\n",
 57 |     "J_3 &= \\ln\\frac{|\\mathbf{S}_\\text{b}|}{|\\mathbf{S}_\\text{w}|} \\\\\n",
 58 |     "J_4 &= \\frac{\\text{tr}\\mathbf{S}_\\text{b}}{\\text{tr}\\mathbf{S}_\\text{w}} \\\\\n",
 59 |     "J_5 &= \\frac{|\\mathbf{S}_\\text{b}-\\mathbf{S}_\\text{w}|}{|\\mathbf{S}_\\text{w}|}\n",
 60 |     "\\end{aligned}$$\n",
 61 |     "\n",
 62 |     "其中，$\\mathbf{S}_\\text{w}$和$\\mathbf{S}_\\text{b}$的估计值为\n",
 63 |     "\n",
 64 |     "$$\\begin{aligned}\n",
 65 |     "\\tilde{S}_\\text{w} &= \\sum_{i=1}^cP_i\\frac{1}{n_i}\\sum_{k=1}^{n_i}(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)^\\text{T} \\\\\n",
 66 |     "\\tilde{S}_\\text{b} &= \\sum_{i=1}^cP_i(\\mathbf{m}_i-\\mathbf{m})(\\mathbf{m}_i-\\mathbf{m})^\\text{T}\n",
 67 |     "\\end{aligned}$$\n",
 68 |     "\n",
 69 |     "其中\n",
 70 |     "\n",
 71 |     "$$\\begin{aligned}\n",
 72 |     "\\mathbf{m}_i &= \\frac{1}{n_i}\\sum_{k=1}^{n_i}\\mathbf{x}_k^{(i)}\\\\\n",
 73 |     "\\mathbf{m} &= \\sum_{i=1}^{c}P_i\\mathbf{m}_i\\\\\n",
 74 |     "\\end{aligned}$$\n",
 75 |     "\n",
 76 |     "##### (2) 性质\n",
 77 |     "\n",
 78 |     "1. 很难从理论上建立起与分类错误率的联系。\n",
 79 |     "2. 两类样本分布有重叠时，不能反映重叠的情况。\n",
 80 |     "3. 各类样本分布协方差差别不大时，效果较好。\n",
 81 |     "\n",
 82 |     "#### 3. 基于概率分布的可分性判据\n",
 83 |     "\n",
 84 |     "##### (1) 概率距离度量的条件\n",
 85 |     "\n",
 86 |     "1. $J_P\\geqslant0$。\n",
 87 |     "2. 当两类完全不交叠（$p(\\mathbf{x}|\\omega_1)$和$p(\\mathbf{x}|\\omega_2)$完全不交叠）时$J_P$取最大值。\n",
 88 |     "3. 当两类完全交叠（$p(\\mathbf{x}|\\omega_1)=p(\\mathbf{x}|\\omega_2)$）时$J_P=0$.\n",
 89 |     "\n",
 90 |     "##### (2) 常用的概率距离度量\n",
 91 |     "\n",
 92 |     "1. Bhattacharyya距离：\n",
 93 |     "\n",
 94 |     "$$\n",
 95 |     "J_\\text{B}=-\\ln\\int[p(\\mathbf{x}|\\omega_1)p(\\mathbf{x}|\\omega_2)]^\\frac{1}{2}\\text{d}\\mathbf{x}\n",
 96 |     "$$\n",
 97 |     "\n",
 98 |     "理论错误率与Bhattacharyya距离的关系为\n",
 99 |     "\n",
100 |     "$$\n",
101 |     "P_\\text{e} \\leqslant [P(\\omega_1)P(\\omega_2)]^\\frac{1}{2}\\exp\\{-J_\\text{B}\\}\n",
102 |     "$$\n",
103 |     "\n",
104 |     "2. Chernoff界限：\n",
105 |     "\n",
106 |     "$$\n",
107 |     "J_\\text{C} = -\\ln\\int p^s(\\mathbf{x}|\\omega_1)p^{1-s}(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x},\\quad s\\in[0,1]\n",
108 |     "$$\n",
109 |     "\n",
110 |     "当$s=0.5$时，Chernoff界限与Bhattacharyya距离相同。\n",
111 |     "\n",
112 |     "3. 散度：\n",
113 |     "\n",
114 |     "$$\n",
115 |     "J_\\text{D}=\\int [p(\\mathbf{x}|\\omega_1)-p(\\mathbf{x}|\\omega_2)]\\ln\\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)}\\text{d}\\mathbf{x}\n",
116 |     "$$\n",
117 |     "\n",
118 |     "当两类样本都服从**正态分布**且**协方差矩阵相等**的情况下，散度为两类均值间的**Mahalanobis距离**，即\n",
119 |     "\n",
120 |     "$$\n",
121 |     "J_\\text{D} = 8J_\\text{B} = (\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)^\\text{T}\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)\n",
122 |     "$$\n",
123 |     "\n",
124 |     "4. 概率相关性判据：\n",
125 |     "\n",
126 |     "把上面三种判据的$p(\\mathbf{x}|\\omega_1)$换成$p(\\mathbf{x}|\\omega_i)$，$p(\\mathbf{x}|\\omega_2)$换成$p(\\mathbf{x})$。\n",
127 |     "\n",
128 |     "#### 4. 基于熵的可分性判据\n",
129 |     "\n",
130 |     "##### (1) 常用熵度量\n",
131 |     "\n",
132 |     "1. Shanno熵：\n",
133 |     "\n",
134 |     "$$\n",
135 |     "H = -\\sum_{i=1}^cP(\\omega_i|\\mathbf{x})\\log_2P(\\omega_i|\\mathbf{x})\n",
136 |     "$$\n",
137 |     "\n",
138 |     "2. 平方熵：\n",
139 |     "\n",
140 |     "$$\n",
141 |     "H=2\\left(1-\\sum_{i=1}^cP^2(\\omega_i|\\mathbf{x})\\right)\n",
142 |     "$$\n",
143 |     "\n",
144 |     "##### (2) 基于熵的可分性判据\n",
145 |     "\n",
146 |     "$$\n",
147 |     "J_\\text{E}=\\int H(\\mathbf{x})p(\\mathbf{x})\\text{d}\\mathbf{x}\n",
148 |     "$$\n",
149 |     "\n",
150 |     "##### (3) 性质\n",
151 |     "\n",
152 |     "$J_\\text{E}$越小，可分性越好。\n",
153 |     "\n",
154 |     "#### 5. 利用统计检验作为可分性判据\n",
155 |     "\n",
156 |     "##### (1) 统计检验的基本概念\n",
157 |     "\n",
158 |     "1. 空假设（null hypothesis）：假设两类样本在所研究特征上不存在显著差异。\n",
159 |     "2. 备择假设（alternative hypothesis）：假设两类样本在所研究特征上存在显著差异。\n",
160 |     "3. 空分布：空假设下统计量取值的分布。\n",
161 |     "\n",
162 |     "##### (2) $t$-检验\n",
163 |     "\n",
164 |     "1. 条件：\n",
165 |     "\n",
166 |     "两类样本都服从**正态分布**且**方差相同**。\n",
167 |     "\n",
168 |     "2. 统计量：\n",
169 |     "\n",
170 |     "假设两类样本分别有$m$个和$n$个，在考察特征上的值分别为$\\{x_i|i=1,\\cdots,n_1\\}$和$\\{y_i|i=1,\\cdots,n_2\\}$。其中$x_i\\sim N(\\mu_x,\\sigma^2)$，$y_i\\sim N(\\mu_y,\\sigma^2)$。\n",
171 |     "\n",
172 |     "则统计量为\n",
173 |     "\n",
174 |     "$$\n",
175 |     "t=\\frac{\\bar{x}-\\bar{y}}{\\displaystyle s_\\text{P}\\sqrt{\\frac{1}{n_1}+\\frac{1}{n_2}}},\\quad t\\sim t(n_1+n_2-2)\n",
176 |     "$$\n",
177 |     "\n",
178 |     "其中$s_\\text{p}^2$是总体样本方差\n",
179 |     "\n",
180 |     "$$\n",
181 |     "s_\\text{p}^2=\\frac{(n_1-1)S_x^2+(n_2-1)S_y^2}{n_1+n_2-2}\n",
182 |     "$$\n",
183 |     "\n",
184 |     "3. 空假设和备择假设：\n",
185 |     "\n",
186 |     "+ 双边$t$-检验：空假设为两类样本在考察特征上分布相同，即$\\mu_x=\\mu_y$。备择假设为$\\mu_x\\neq\\mu_y$。\n",
187 |     "+ 单边$t$-检验：空假设为$\\mu_x\\leqslant\\mu_y$。备择假设为$\\mu_x>\\mu_y$。\n",
188 |     "\n",
189 |     "4. 拒绝域：\n",
190 |     "\n",
191 |     "+ 双边$t$-检验：$\\displaystyle |t|\\geqslant t_\\frac{\\alpha}{2}(m+n-2)$\n",
192 |     "+ 单边$t$-检验：$\\displaystyle t\\geqslant t_\\alpha(m+n-2)$\n",
193 |     "\n",
194 |     "##### (3) Wilcoxon秩和检验（Mann-Whitney U检验）\n",
195 |     "\n",
196 |     "1. 步骤：\n",
197 |     "\n",
198 |     "+ 两类样本混合到一起，所有样本按照所考察的特征从小到大排序。\n",
199 |     "+ 两类样本的**秩和**即为所得排序**序号之和**（如果特征取值相等，则并列采用中间的序号）。\n",
200 |     "+ 考察两类样本秩和的差异。\n",
201 |     "\n",
202 |     "2. 统计量：\n",
203 |     "\n",
204 |     "某一类的秩和，比如第一类秩和$T_1$。\n",
205 |     "\n",
206 |     "3. 统计量$T_1$的分布：\n",
207 |     "\n",
208 |     "当$n_1$和$n_2$较大（比如都大于10）时，$T_1$近似服从正态分布，即\n",
209 |     "\n",
210 |     "$$\\begin{aligned}\n",
211 |     "T_1 &\\sim N(\\mu_1,\\sigma_1^2)\\\\\n",
212 |     "\\mu_1 &= \\frac{n_1(n_1+n_2+1)}{2} \\\\\n",
213 |     "\\sigma_1^2 &= \\frac{n_1n_2(n_1+n_2+1)}{12}\n",
214 |     "\\end{aligned}$$\n",
215 |     "\n",
216 |     "##### (4) 性质\n",
217 |     "\n",
218 |     "1. 样本服从正态分布情况下$t$-检验比秩和检验敏感性更好，但秩和检验没有对样本分布做假设，适用范围广。\n",
219 |     "2. $t$-检验只检验分布均值，秩和检验同时受到**分布均值**和**分布形状**的影响。\n",
220 |     "3. 统计检验方法通常只能检验单个特征，然后对特征进行排序，也叫**过滤法**（filtering method）。\n",
221 |     "4. 过滤准则与分类准则不一定有很好的联系。\n",
222 |     "\n",
223 |     "\n",
224 |     "### 二、特征选择的最优算法\n",
225 |     "\n",
226 |     "#### 1. 思想\n",
227 |     "\n",
228 |     "分支界定法（branch and bound）。\n",
229 |     "\n",
230 |     "#### 2. 要求\n",
231 |     "\n",
232 |     "1. $D$个特征中选择$d$个最优的特征。\n",
233 |     "2. 特征增多时判据值不会减小。\n",
234 |     "\n",
235 |     "#### 3. 实现方法\n",
236 |     "\n",
237 |     "##### (1) 树的形状\n",
238 |     "\n",
239 |     "1. 根节点为第$0$级，共$D-d$级。\n",
240 |     "2. 每一级的节点在父节点基础上去掉一个特征，不出现相同组合的树枝和叶节点。\n",
241 |     "3. 同一层中按照去掉**单个**特征后的准则函数值对结点进行排序，准则函数损失量最大的在左边。\n",
242 |     "\n",
243 |     "##### (2) 搜索策略\n",
244 |     "\n",
245 |     "1. 从右侧节点开始搜索，到达叶节点时更新准则函数值界限$B$。\n",
246 |     "2. 搜索回溯中，遇到某一节点函数准则值小于$B$，则剪枝。\n",
247 |     "\n",
248 |     "#### 4. 性质\n",
249 |     "\n",
250 |     "$d$为$D$的一半时，分支界定法比穷举法节省的计算量最大。\n",
251 |     "\n",
252 |     "\n",
253 |     "### 三、特征选择的次优算法\n",
254 |     "\n",
255 |     "#### 1. 单独最优特征的组合\n",
256 |     "\n",
257 |     "##### (1) 思想\n",
258 |     "\n",
259 |     "选取单个特征可分性判据值最大的$d$个特征。\n",
260 |     "\n",
261 |     "##### (2) 性质\n",
262 |     "\n",
263 |     "特征间**独立**并且所采用的判据时每个特征上的判据**之和**或**之积**时，这种方法选出来的是最优特征。\n",
264 |     "\n",
265 |     "#### 2. 顺序前进法（sequential forward selection，SFS）\n",
266 |     "\n",
267 |     "##### (1) 思想\n",
268 |     "\n",
269 |     "自底向上：第一个特征选取单独最优的特征，后面每一个特征都选择与已经入选的特征组合起来最优的特征。\n",
270 |     "\n",
271 |     "##### (2) 性质\n",
272 |     "\n",
273 |     "1. 计算量比单独最优特征组合大。\n",
274 |     "2. 特征入选后无法剔除。\n",
275 |     "\n",
276 |     "#### 3. 顺序后退法（sequential backward selection，SBS）\n",
277 |     "\n",
278 |     "##### (1) 思想\n",
279 |     "\n",
280 |     "自顶向下：从所有特征开始逐一剔除特征，每次剔除使得剩余特征最优。\n",
281 |     "\n",
282 |     "##### (2) 性质\n",
283 |     "\n",
284 |     "1. 计算量比顺序前进法大。\n",
285 |     "2. 特征剔除后无法加入。\n",
286 |     "\n",
287 |     "#### 4. 增$l$减$r$法\n",
288 |     "\n",
289 |     "##### (1) 思想\n",
290 |     "\n",
291 |     "1. 自底向上：$l>r$，先增再减。\n",
292 |     "2. 自顶向下：$l<r$，先减再增。\n",
293 |     "\n",
294 |     "##### (2) 性质\n",
295 |     "\n",
296 |     "即能考虑到特征间的相关性又能保持适当的计算量。\n",
297 |     "\n",
298 |     "\n",
299 |     "### 四、特征选择的遗传算法\n",
300 |     "\n",
301 |     "#### 1. 步骤\n",
302 |     "\n",
303 |     "1. 初始化，随机产生包含$L$条不同染色体的种群。每个染色体有长度为$D$的$0/1$字符串构成，$0$代表特征没有被选中，$1$代表特征被选中。染色体中有$d$个$1$。\n",
304 |     "2. 计算每一条染色体的适应度$f(m)$，可以用可分性判据作为适应度。\n",
305 |     "3. 按照选择概率$p(f(m))$对种群的染色体进行采样，并繁殖出下一代染色体。繁殖有两个基本操作：重组（也称交叉）（recombination（crossover））和突变（mutation）\n",
306 |     "4. 反复迭代直到终止条件，适应度最大的染色体即为最优解。终止条件一般为某条染色体的适应度达到设定阈值。\n",
307 |     "\n",
308 |     "#### 2. 性质\n",
309 |     "\n",
310 |     "特征维数很高且对特征间的关系缺乏认识时，遗传算法效果好。\n",
311 |     "\n",
312 |     "\n",
313 |     "### 五、以分类性能为准则的特征选择方法\n",
314 |     "\n",
315 |     "#### 1. 包裹法和过滤法\n",
316 |     "\n",
317 |     "##### (1) 包裹法\n",
318 |     "\n",
319 |     "利用分类器进行特征选择的方法叫**包裹法**（wrapper），即用所有特征设计分类器，考察每个特征在分类器中的贡献，剔除贡献小的特征。\n",
320 |     "\n",
321 |     "##### (2) 过滤法\n",
322 |     "\n",
323 |     "利用单独的可分性准则来选择特征再进行分类的方法叫**过滤法**。\n",
324 |     "\n",
325 |     "#### 2. 包裹法算法\n",
326 |     "\n",
327 |     "1. 算法开始前，确定选择特征的递归策略，例如每次选择（R-SVM）或剔除（SVM-RFE）特征总数的一个比例（如一半），或规定一个逐级减小的特征数目序列。\n",
328 |     "2. 用当前所有候选特征训练**线性**SVM。\n",
329 |     "3. 评估特征对分类贡献，按照贡献大小排序。\n",
330 |     "4. 根据递归策略中的特征数目选择（R-SVM）或剔除（SVM-RFE）特征，接下来用新的特征重新递归训练线性SVM，直至达到规定的特征数目。\n",
331 |     "\n",
332 |     "#### 3. 递归支持向量机（R-SVM：recursive SVM）\n",
333 |     "\n",
334 |     "##### (1) 思想\n",
335 |     "\n",
336 |     "寻找使两类样本在线性SVM的输出上分离最开的特征。\n",
337 |     "\n",
338 |     "##### (2) 算法\n",
339 |     "\n",
340 |     "1. 每个特征的贡献是\n",
341 |     "\n",
342 |     "$$\n",
343 |     "s_j = w_j(m_j^+-m_j^-),\\quad j=1,2,\\cdots,d\n",
344 |     "$$\n",
345 |     "\n",
346 |     "其中，$m_j^+$和$m_j^-$分别是两类样本均值中第$j$维特征上的均值，$w_j$是权向量$\\mathbf{w}$的第$j$个分量\n",
347 |     "\n",
348 |     "$$\n",
349 |     "\\mathbf{w} = \\sum_{i=1}^n\\alpha_iy_i\\mathbf{x}_i\n",
350 |     "$$\n",
351 |     "\n",
352 |     "2. 特征排序后**选择**特征。\n",
353 |     "\n",
354 |     "##### (3) 性质\n",
355 |     "\n",
356 |     "不但取决于每个特征在线性分类器中对应的权重，而且考虑到两类样本在各个特征上均值的差别。\n",
357 |     "\n",
358 |     "#### 4. 支持向量机递归特征剔除（SVM-RFE：SVM recursive feature elimination）\n",
359 |     "\n",
360 |     "##### (1) 思想\n",
361 |     "\n",
362 |     "采用灵敏度的方法来推导各个特征在SVM分类器中的贡献，把SVM输出与正确类别标号的平均平方误差作为损失函数\n",
363 |     "\n",
364 |     "$$\n",
365 |     "J = \\sum_{i=1}^{n_1+n_2}||\\mathbf{w}\\cdot\\mathbf{x}_i-y_i||^2\n",
366 |     "$$\n",
367 |     "\n",
368 |     "##### (2) 算法\n",
369 |     "\n",
370 |     "1. 每个特征的贡献是\n",
371 |     "\n",
372 |     "$$\n",
373 |     "s_j = w_j^2\n",
374 |     "$$\n",
375 |     "\n",
376 |     "2. 特征排序后**剔除**特征。\n",
377 |     "\n",
378 |     "#### 5. 两类方法对比 \n",
379 |     "\n",
380 |     "1. R-SVM和SVM-RFE分类性能基本相同。\n",
381 |     "2. R-SVM在选择特征的**稳定性**和在对未来样本的**推广能力**方面有优势，尤其是训练样本中存在较大的**噪声**和**野值**时优势明显。\n",
382 |     "\n",
383 |     "#### 6. 推广到非线性核\n",
384 |     "\n",
385 |     "特征$k$在非线性SVM中的贡献为\n",
386 |     "\n",
387 |     "$$\n",
388 |     "DQ(k) = \\frac{1}{2}\\sum_{i,j=1}^n\\alpha_i\\alpha_jy_iy_j[K(\\mathbf{x}_i,\\mathbf{x}_j) - K(\\mathbf{x}_i^{(-k)},\\mathbf{x}_j^{(-k)})]\n",
389 |     "$$"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "markdown",
394 |    "metadata": {},
395 |    "source": [
396 |     "---\n",
397 |     "## 一轮学习笔记（包含代码实现）\n",
398 |     "\n",
399 |     "这一章的目的是根据现有样本的特征，在其中提取出对样本分类有显著影响的特征。\n",
400 |     "\n",
401 |     "前几节介绍了判断某个特征是否对分类有显著影响的数学方法。这里不再赘述了。\n",
402 |     "\n",
403 |     "下面是书中介绍的特征选择的算法。\n",
404 |     "\n",
405 |     "### 一.特征选择的最优算法\n",
406 |     "\n",
407 |     "这个算法是基于深度优先搜索的选择特征算法，通过选取最大化准则函数值进行剪枝，这样可以大大减少搜索次数。但是算法题刷多了就会明白，就算剪枝剪得再多，搜索的复杂度还是指数级别的，还是非常的慢。书上也说了这个算法很慢，通常也没有人会用。\n",
408 |     "\n",
409 |     "### 二、特征选择的次优算法\n",
410 |     "\n",
411 |     "书上介绍了四种算法：单独最优特征的组合、顺序前进法、顺序后退法、增l减r法（l-r法）。四种算法思想都很简单，\n",
412 |     "\n",
413 |     "### 三、特征选择的遗传算法\n",
414 |     "\n",
415 |     "遗传算法是通过模拟生物染色体遗传进化来实现特征选择，在特征维数很高的时候效果很好。这个算法应用非常广，但是书上也只是简单介绍了一下。\n",
416 |     "\n",
417 |     "### 四、以分类性能为准则的特征选择方法\n",
418 |     "\n",
419 |     "这个方法的思路就是用分类器进行特征选择，书上介绍了R-SVM和SVM-RFE方法。\n",
420 |     "\n",
421 |     "这两个方法是利用了支持向量机在样本维数很高并且样本较少的情况下表现很好的特性，通过SVM训练之后的权值数据来判断特征对分类贡献大小。\n",
422 |     "\n",
423 |     "具体的方法如下：\n",
424 |     "\n",
425 |     "1. 用**线性**支持向量机对样本数据进行训练，得到对应对偶问题的$\\boldsymbol{\\alpha}$\n",
426 |     "2. 通过$\\boldsymbol{\\alpha}$算出原始问题的系数$\\mathbf{w}$\n",
427 |     "\n",
428 |     "$$\n",
429 |     "\\mathbf{w} = \\sum_{i=1}^N\\alpha_iy_i\\mathbf{x}_i\n",
430 |     "$$\n",
431 |     "\n",
432 |     "说明一下，sklearn中的线性支持向量机LinearSVC中有直接算出$\\mathbf{w}$的API，不用先计算出$\\boldsymbol{\\alpha}$，下面的代码中我会直接用这个API。\n",
433 |     "\n",
434 |     "3. 计算每个特征对分类的贡献。对R-SVM和SVM-RFE来说，分类的贡献公式分别是：\n",
435 |     "\n",
436 |     "$$\\begin{align}\n",
437 |     "s_j^{\\text{R-SVM}} &= w_j(m_j^+ - m_j^-) \\\\\n",
438 |     "s_j^{\\text{SVM-RFE}} &= w_j^2 \\quad j = 1,\\cdots,d\n",
439 |     "\\end{align}$$\n",
440 |     "\n",
441 |     "4. 接下来就根据不同特征的贡献选择特征\n",
442 |     "\n",
443 |     "实现代码如下：\n",
444 |     "\n",
445 |     "首先导包并导入数据（数据用之前用过的手写数字识别）"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": 1,
451 |    "metadata": {},
452 |    "outputs": [
453 |     {
454 |      "name": "stdout",
455 |      "output_type": "stream",
456 |      "text": [
457 |       "训练集大小如下：\n",
458 |       "(800, 400)\n",
459 |       "(800,)\n",
460 |       "测试集大小如下：\n",
461 |       "(200, 400)\n",
462 |       "(200,)\n"
463 |      ]
464 |     }
465 |    ],
466 |    "source": [
467 |     "import numpy as np\n",
468 |     "from sklearn.svm import LinearSVC\n",
469 |     "from sklearn.model_selection import train_test_split\n",
470 |     "\n",
471 |     "X = np.load(\"./data/digit_0_1_X.npy\")\n",
472 |     "y = np.load(\"./data/digit_0_1_y.npy\")\n",
473 |     "y = y.reshape(y.shape[0])\n",
474 |     "\n",
475 |     "X_train, X_test, y_train, y_test = train_test_split(X[:1000], y[:1000], test_size=0.2)\n",
476 |     "\n",
477 |     "print(\"训练集大小如下：\")\n",
478 |     "print(X_train.shape)\n",
479 |     "print(y_train.shape)\n",
480 |     "print(\"测试集大小如下：\")\n",
481 |     "print(X_test.shape)\n",
482 |     "print(y_test.shape)"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "接下来用线性支持向量机训练上述数据"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 2,
495 |    "metadata": {},
496 |    "outputs": [
497 |     {
498 |      "name": "stdout",
499 |      "output_type": "stream",
500 |      "text": [
501 |       "训练后的w大小为： (400,)\n",
502 |       "在测试集上正确率为： 0.995\n"
503 |      ]
504 |     }
505 |    ],
506 |    "source": [
507 |     "linear_SVC = LinearSVC()\n",
508 |     "\n",
509 |     "linear_SVC.fit(X_train, y_train)\n",
510 |     "\n",
511 |     "w = linear_SVC.coef_.reshape(X_train[0].shape[0])\n",
512 |     "\n",
513 |     "print(\"训练后的w大小为：\", w.shape)\n",
514 |     "\n",
515 |     "print(\"在测试集上正确率为：\", linear_SVC.score(X_test, y_test))"
516 |    ]
517 |   },
518 |   {
519 |    "cell_type": "markdown",
520 |    "metadata": {},
521 |    "source": [
522 |     "接下来计算不同特征的贡献。书上说R-SVM更好，所以下面计算R-SVM的贡献。"
523 |    ]
524 |   },
525 |   {
526 |    "cell_type": "code",
527 |    "execution_count": 3,
528 |    "metadata": {},
529 |    "outputs": [
530 |     {
531 |      "name": "stdout",
532 |      "output_type": "stream",
533 |      "text": [
534 |       "s的大小为： (400,)\n",
535 |       "s的最大值为： 0.00724647354728087 \n",
536 |       "s的最小值为： -0.01821647548504886\n"
537 |      ]
538 |     }
539 |    ],
540 |    "source": [
541 |     "m_1 = np.lib.average(X_train[:50], 0)\n",
542 |     "m_2 = np.lib.average(X_train[50:], 0)\n",
543 |     "\n",
544 |     "s = w * (m_1 - m_2)\n",
545 |     "\n",
546 |     "print(\"s的大小为：\", s.shape)\n",
547 |     "print(\"s的最大值为：\", np.max(s), \"\\ns的最小值为：\", np.min(s))"
548 |    ]
549 |   },
550 |   {
551 |    "cell_type": "markdown",
552 |    "metadata": {},
553 |    "source": [
554 |     "下面删掉贡献值最小的200个特征特征（总共有400个特征），再重新训练测试正确率"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": 4,
560 |    "metadata": {},
561 |    "outputs": [
562 |     {
563 |      "name": "stdout",
564 |      "output_type": "stream",
565 |      "text": [
566 |       "删除后的训练集大小： (800, 200)\n",
567 |       "删除后的测试集大小： (200, 200)\n",
568 |       "删除两百个特征后的正确率： 0.995\n"
569 |      ]
570 |     }
571 |    ],
572 |    "source": [
573 |     "# 统计贡献值最小的两百个特征\n",
574 |     "# 下面的这段代码可能不太容易看懂，不懂也没关系，只需要知道作用就行了\n",
575 |     "s_temp = np.concatenate((s.reshape(s.shape[0], 1), np.arange(0, s.shape[0]).reshape(s.shape[0], 1)), 1)\n",
576 |     "s_temp.sort(0)\n",
577 |     "index_to_delete = []\n",
578 |     "for row in s_temp:\n",
579 |     "    index_to_delete.append(int(row[1]))\n",
580 |     "\n",
581 |     "# 删掉贡献值最小的两百个特征\n",
582 |     "X_train = np.delete(X_train, index_to_delete[:200], axis=1)\n",
583 |     "X_test = np.delete(X_test, index_to_delete[:200], axis=1)\n",
584 |     "\n",
585 |     "print(\"删除后的训练集大小：\", X_train.shape)\n",
586 |     "print(\"删除后的测试集大小：\", X_test.shape)\n",
587 |     "\n",
588 |     "linear_SVC.fit(X_train, y_train)\n",
589 |     "\n",
590 |     "print(\"删除两百个特征后的正确率：\", linear_SVC.score(X_test, y_test))"
591 |    ]
592 |   },
593 |   {
594 |    "cell_type": "markdown",
595 |    "metadata": {},
596 |    "source": [
597 |     "很明显删除了一半的特征之后正确率还是没有变化，这说明我们使用的特征选择方法是很有效的。"
598 |    ]
599 |   }
600 |  ],
601 |  "metadata": {
602 |   "kernelspec": {
603 |    "display_name": "Python 3 (ipykernel)",
604 |    "language": "python",
605 |    "name": "python3"
606 |   },
607 |   "language_info": {
608 |    "codemirror_mode": {
609 |     "name": "ipython",
610 |     "version": 3
611 |    },
612 |    "file_extension": ".py",
613 |    "mimetype": "text/x-python",
614 |    "name": "python",
615 |    "nbconvert_exporter": "python",
616 |    "pygments_lexer": "ipython3",
617 |    "version": "3.9.16"
618 |   }
619 |  },
620 |  "nbformat": 4,
621 |  "nbformat_minor": 1
622 | }
623 | 


--------------------------------------------------------------------------------
/第08章 特征提取.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# 第八章 特征提取\n",
   8 |     "---"
   9 |    ]
  10 |   },
  11 |   {
  12 |    "cell_type": "markdown",
  13 |    "metadata": {},
  14 |    "source": [
  15 |     "## 二轮总结笔记\n",
  16 |     "\n",
  17 |     "### 一、基于类别可分性判据的特征提取（线性）\n",
  18 |     "\n",
  19 |     "#### 1. 基于类内类间距离的可分性判据算法\n",
  20 |     "\n",
  21 |     "计算$\\mathbf{S}_\\text{w}^{-1}\\mathbf{S}_\\text{b}$的特征值和特征向量，选择最大的$d$个特征值的特征向量构成$D\\times d$维线性变换阵\n",
  22 |     "\n",
  23 |     "$$\\mathbf{W}=[\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_d]$$\n",
  24 |     "\n",
  25 |     "则变换后的$d$维新特征为\n",
  26 |     "\n",
  27 |     "$$\n",
  28 |     "\\mathbf{y} = \\mathbf{W}^\\text{T}\\mathbf{x}\n",
  29 |     "$$\n",
  30 |     "\n",
  31 |     "其中，$\\mathbf{S}_\\text{w}$和$\\mathbf{S}_\\text{b}$为\n",
  32 |     "\n",
  33 |     "$$\\begin{aligned}\n",
  34 |     "\\mathbf{S}_\\text{w} &= \\sum_{i=1}^cP_i\\frac{1}{n_i}\\sum_{k=1}^{n_i}(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)^\\text{T} \\\\\n",
  35 |     "\\mathbf{S}_\\text{b} &= \\sum_{i=1}^cP_i(\\mathbf{m}_i-\\mathbf{m})(\\mathbf{m}_i-\\mathbf{m})^\\text{T}\n",
  36 |     "\\end{aligned}$$\n",
  37 |     "\n",
  38 |     "#### 2. 性质\n",
  39 |     "\n",
  40 |     "1. 采用不同形式的类内类间距离可分性判据，得到的最优变换矩阵相同。\n",
  41 |     "2. 采用基于概率距离或基于熵的判据作为准则时一般只能求数值解，只有数据服从正态分布且满足某些特殊条件时才能得到解析解。\n",
  42 |     "\n",
  43 |     "\n",
  44 |     "### 二、主成分分析方法（principal component analysis，PCA）（线性）\n",
  45 |     "\n",
  46 |     "#### 1. 思想\n",
  47 |     "\n",
  48 |     "求解最优的正交变换$\\mathbf{A}$，使新特征$\\xi_i$的方差最大。正交变换保证新特征间不相关，新特征方差越大，则样本在该维特征上的差异越大，因而这一特征越重要。\n",
  49 |     "\n",
  50 |     "#### 2. 算法\n",
  51 |     "\n",
  52 |     "1. 求协方差矩阵，协方差矩阵可以用样本估计\n",
  53 |     "\n",
  54 |     "$$\n",
  55 |     "\\boldsymbol{\\Sigma}=\\frac{1}{N}\\sum_{i=1}^N(\\mathbf{x}_i-\\mathbf{m})(\\mathbf{x}_i-\\mathbf{m})^\\text{T} = \\frac{1}{N}\\mathbf{X}^\\text{T}\\mathbf{X} - \\mathbf{mm}^\\text{T}\n",
  56 |     "$$\n",
  57 |     "\n",
  58 |     "2. 求$\\boldsymbol{\\Sigma}$的特征值和**正交归一**的特征向量。\n",
  59 |     "3. 选取$\\boldsymbol{\\Sigma}$特征值最大的$d$个**正交归一**的特征向量，组成变换阵$\\mathbf{A}=[\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_d]$。\n",
  60 |     "4. 零均值化变换后的新特征为$\\boldsymbol{\\xi}=\\mathbf{A}^\\text{T}(\\mathbf{x}-\\boldsymbol{\\mu})$，从$\\boldsymbol{\\xi}$到$\\mathbf{x}$的逆变换为$\\mathbf{x}=\\mathbf{A}\\boldsymbol{\\xi}+\\boldsymbol{\\mu}$\n",
  61 |     "\n",
  62 |     "#### 3. 性质\n",
  63 |     "\n",
  64 |     "1. 新特征是原有特征的线性组合，且相互之间是不相关的。\n",
  65 |     "2. 第$i$个**新特征**$\\xi_i$叫**第$i$主成分**，第$i$主成分的**方差**为第$i$大的特征值$\\lambda_i$。\n",
  66 |     "3. 前$k$个主成分代表的数据方差与全部方差的比例为$\\displaystyle \\frac{\\displaystyle\\sum_{i=1}^k\\lambda_i}{\\displaystyle\\sum_{i=1}^p\\lambda_i}$，数据中大部分信息集中在较少的几个主成分上。\n",
  67 |     "4. 排列在后面的主成分（次成分）反映了数据中的**随机噪声**，可以通过将次成分置为0再变换为原空间实现降噪。\n",
  68 |     "5. 主成分分析是非监督的，没有考虑样本类别信息，不一定有利于分类。\n",
  69 |     "\n",
  70 |     "\n",
  71 |     "### 三、K-L变换（线性）\n",
  72 |     "\n",
  73 |     "#### 1. 思想\n",
  74 |     "\n",
  75 |     "##### (1) K-L展开\n",
  76 |     "\n",
  77 |     "用**完备**的**正交归一**的向量系$\\mathbf{u}_j$对新向量$\\hat{\\mathbf{x}}$展开\n",
  78 |     "\n",
  79 |     "$$\n",
  80 |     "\\hat{\\mathbf{x}} = \\sum_{i=1}^d c_j\\mathbf{u}_j\n",
  81 |     "$$\n",
  82 |     "\n",
  83 |     "##### (2) K-L变换\n",
  84 |     "\n",
  85 |     "1. 步骤与主成分分析类似。\n",
  86 |     "2. K-L变换的思想是最小化$\\hat{\\mathbf{x}}$与$\\mathbf{x}$的均方误差。\n",
  87 |     "3. 产生矩阵是$\\mathbf{x}$的二阶矩阵$\\displaystyle\\boldsymbol\\Psi=E[\\mathbf{xx}^\\text{T}]=\\frac{1}{N}\\mathbf{X}^\\text{T}\\mathbf{X}$\n",
  88 |     "4. $\\mathbf{u}_j$组成新的特征空间，$c_j=\\mathbf{u}_j\\mathbf{x}$组成新的特征向量。\n",
  89 |     "\n",
  90 |     "#### 2. 与主成分分析对比\n",
  91 |     "\n",
  92 |     "1. K-L变换的产生矩阵是$\\mathbf{x}$的**二阶矩阵**$\\boldsymbol\\Psi=E[\\mathbf{xx}^\\text{T}]$，主成分分析的产生矩阵是$\\mathbf{x}$的**协方差矩阵**$\\Sigma=E[\\mathbf{xx}^\\text{T}]-E[\\mathbf{x}]E[\\mathbf{x}^\\text{T}]$。\n",
  93 |     "2. 主成分分析的原理是新特征**方差最大**，K-L变换的原理是新向量与原向量的**均方误差最小**。\n",
  94 |     "3. 当原特征为零均值或对原特征进行**去均值**处理时，主成分分析与K-L变换**等价**，这种非监督的特征提取方法也叫**SELFIC方法**。\n",
  95 |     "\n",
  96 |     "#### 3. 性质\n",
  97 |     "\n",
  98 |     "1. K-L变换是正交归一变换。\n",
  99 |     "2. K-L变换是信号的**最佳压缩**表示，用$d$维K-L变换特征代表原始样本带来的误差在所有$d$维正交坐标变换中最小。\n",
 100 |     "3. K-L变换的新特征是**互不相关**的，新特征向量的二阶矩阵是对角阵，对角线元素就是K-L变换中的特征值。\n",
 101 |     "4. K-L变换表示原数据，**表示熵最小**，即在这种坐标系统下，样本的方差信息最大程度的集中在了较少的维度上。\n",
 102 |     "5. 如果用特征值**最小**的K-L变换坐标来表示原数据，则**总体熵最小**，即在这些坐标上的均值能最好的代表样本集。\n",
 103 |     "\n",
 104 |     "#### 4. 用于监督模式识别的K-L变换\n",
 105 |     "\n",
 106 |     "##### (1) 从类均值中提取判别信息\n",
 107 |     "\n",
 108 |     "1. 思想：\n",
 109 |     "\n",
 110 |     "如果样本中的主要分类信息包含在**均值**中，先用$\\mathbf{S}_\\text{w}$做K-L变换消除特征间相关性，然后考虑变换后特征的类均值和方差，选择**方差小**，**类均值与总体均值差别大**的特征。\n",
 111 |     "\n",
 112 |     "2. 算法：\n",
 113 |     "\n",
 114 |     "+ 用$\\mathbf{S}_\\text{w}$做K-L变换得到新特征$y_i=\\mathbf{u}_i^\\text{T}\\mathbf{x}$，各维新特征的方差是$\\lambda_i$。\n",
 115 |     "+ 计算新特征的分类性能指标，选择最大的$d$个特征，用相应的特征向量$\\mathbf{u}_i$组成变换矩阵$\\mathbf{U}=[\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_d]$。\n",
 116 |     "\n",
 117 |     "其中，分类性能指标为\n",
 118 |     "\n",
 119 |     "$$\n",
 120 |     "J(y_i)=\\frac{\\mathbf{u}_i^\\text{T}\\mathbf{S}_\\text{b}\\mathbf{u}_i}{\\lambda_i},\\quad i=1,\\cdots,D\n",
 121 |     "$$\n",
 122 |     "\n",
 123 |     "##### (2) 包含在类平均向量中判别信息的最优压缩\n",
 124 |     "\n",
 125 |     "1. 思想：\n",
 126 |     "\n",
 127 |     "在使特征间**互不相关**的前提下最优压缩**均值向量**中包含的**分类信息**。\n",
 128 |     "\n",
 129 |     "2. 算法：\n",
 130 |     "\n",
 131 |     "+ 用$\\mathbf{S}_\\text{w}$做K-L变换，得到特征值和特征矩阵，即$\\mathbf{U}^\\text{T}\\mathbf{S}_\\text{w}\\mathbf{U}=\\boldsymbol\\Lambda$。\n",
 132 |     "+ 令$\\mathbf{B}=\\mathbf{U}\\boldsymbol\\Lambda^{-\\frac{1}{2}}$，这一步称为**白化**变化，保证再进行正交归一变换后$\\mathbf{S}_\\text{w}$不变。白化后$\\mathbf{S}_\\text{w}'=\\mathbf{I}$，$\\mathbf{S}_\\text{b}'=\\mathbf{B}^\\text{T}\\mathbf{S}_\\text{b}\\mathbf{B}$。\n",
 133 |     "+ 用$\\mathbf{S}_\\text{b}'$做K-L变换，得到的变换矩阵最多有$d=c-1$维：$\\mathbf{V}'=[\\mathbf{v}_1,\\mathbf{v}_2,\\cdots,\\mathbf{v}_d]$\n",
 134 |     "+ 最终总的变换矩阵为$\\mathbf{W}=\\mathbf{U}\\boldsymbol\\Lambda^{-\\frac{1}{2}}\\mathbf{V}'$\n",
 135 |     "\n",
 136 |     "3. 性质：两类情况下$\\mathbf{w}$等价于Fisher线性判别投影方向。\n",
 137 |     "\n",
 138 |     "##### (3) 类中心化特征向量中分类信息的提取\n",
 139 |     "\n",
 140 |     "1. 思想：\n",
 141 |     "\n",
 142 |     "各类样本减去各自均值后消除了各类均值差别包含的分类信息，此时如果各类分布形状不同，仍然可以从各类的**协方差**中提取出分类信息。\n",
 143 |     "\n",
 144 |     "2. 算法：\n",
 145 |     "\n",
 146 |     "+ 用$\\mathbf{S}_\\text{w}$做K-L变换，得到新的特征。\n",
 147 |     "+ 算出在各类样本中新特征的方差，$\\omega_i$类的第$j$个特征的方差为$r_{ij}$，并用第$j$个特征总的方差$\\lambda_j$和先验概率进行归一化\n",
 148 |     "\n",
 149 |     "$$\n",
 150 |     "\\tilde{r}_{ij} = P(\\omega_i)\\frac{r_{ij}}{\\lambda_j},\\quad i=1,\\cdots,c,\\quad j=1,\\cdots,D\n",
 151 |     "$$\n",
 152 |     "\n",
 153 |     "+ 定义总体熵表示方差的分散程度\n",
 154 |     "\n",
 155 |     "$$\n",
 156 |     "J(x_j)=-\\sum_{i=1}^c\\tilde{r}_{ij}\\log\\tilde{r}_{ij}\\quad或\\quad J(x_j)=\\prod_{i=1}^c\\tilde{r}_{ij}\n",
 157 |     "$$\n",
 158 |     "\n",
 159 |     "$J(x_j)$越**小**表明方差中包含的分类信息越大，这个特征越重要。\n",
 160 |     "\n",
 161 |     "+ 选取$J(x_j)$**最小**的$d$个特征组成新的特征。\n",
 162 |     "\n",
 163 |     "##### (4) 三个方法对比\n",
 164 |     "\n",
 165 |     "1. 共同点：\n",
 166 |     "\n",
 167 |     "都要首先用$\\mathbf{S}_\\text{w}$做K-L变换，用来消除特征间相关性。\n",
 168 |     "\n",
 169 |     "2. 前两个和第三个的区别：\n",
 170 |     "\n",
 171 |     "前两个只考虑了不同类均值中的分类信息，第三个只考虑了不同类分布形状（协方差）中的分类信息。\n",
 172 |     "\n",
 173 |     "3. 第二个和其他两个的区别：\n",
 174 |     "\n",
 175 |     "第二个先对不同类样本进行白化（统一不同类的分布形状和协方差），再进行K-L变换，有两次变换。\n",
 176 |     "\n",
 177 |     "第一个和第三个直接对不同类样本用$\\mathbf{S}_\\text{w}$做K-L变换，只有一次变换，其他步骤只是按照指标挑选出变换的向量。\n",
 178 |     "\n",
 179 |     "4. 使用方式：\n",
 180 |     "\n",
 181 |     "用第二个方法得到$d'\\leqslant c-1$个特征，再用第三个方法得到另外$d-d'$个特征。这样及考虑到了样本均值中分类信息，也考虑到了样本方差中的分类信息。\n",
 182 |     "\n",
 183 |     "\n",
 184 |     "### 四、高维数据的低维表示（线性或非线性）\n",
 185 |     "\n",
 186 |     "#### 1. 多维尺度法（MDS/ALSCAL）的基本概念\n",
 187 |     "\n",
 188 |     "##### (1) 思想\n",
 189 |     "\n",
 190 |     "把样本之间的**距离关系**或**不相似关系**在二维或三维空间中表示出来。\n",
 191 |     "\n",
 192 |     "##### (2) 分类\n",
 193 |     "\n",
 194 |     "1. 度量型：把距离或不相似度看作**定量**的度量，在低维空间中保持度量关系。\n",
 195 |     "2. 非度量型：只要保持样本间距离和不相似度**定性**的关系顺序。\n",
 196 |     "\n",
 197 |     "#### 2. 古典尺度法（Classical Scaling）/主坐标分析（principal coordinates analysis）\n",
 198 |     "\n",
 199 |     "##### (1) 思想\n",
 200 |     "\n",
 201 |     "已知样本间欧氏距离，求样本坐标。\n",
 202 |     "\n",
 203 |     "##### (2) 算法\n",
 204 |     "\n",
 205 |     "1. 定义中心化矩阵：\n",
 206 |     "\n",
 207 |     "$$\n",
 208 |     "\\mathbf{J} = \\mathbf{I} - \\frac{1}{n}\\mathbf{11}^\\text{T}\n",
 209 |     "$$\n",
 210 |     "\n",
 211 |     "2. 用欧氏距离矩阵计算样本内积矩阵（双中心化）：\n",
 212 |     "\n",
 213 |     "$$\n",
 214 |     "\\mathbf{B} = \\mathbf{XX}^\\text{T} = -\\frac{1}{2}\\mathbf{JD}^{(2)}\\mathbf{J}\n",
 215 |     "$$\n",
 216 |     "\n",
 217 |     "3. 对内积矩阵进行特征值分解：\n",
 218 |     "\n",
 219 |     "$$\n",
 220 |     "\\mathbf{B} = \\mathbf{U}\\boldsymbol{\\Lambda}\\mathbf{U}^\\text{T}\n",
 221 |     "$$\n",
 222 |     "\n",
 223 |     "4. 解出样本坐标：\n",
 224 |     "\n",
 225 |     "+ 如果全体样本是零均值的，则\n",
 226 |     "\n",
 227 |     "$$\n",
 228 |     "\\mathbf{X} = \\mathbf{U}\\boldsymbol\\Lambda^{\\frac{1}{2}}\n",
 229 |     "$$\n",
 230 |     "\n",
 231 |     "+ 如果不是零均值，则\n",
 232 |     "\n",
 233 |     "$$\n",
 234 |     "\\tilde{\\mathbf{x}}_i = \\mathbf{x}_i + \\bar{\\mathbf{x}},\\quad i=1,\\cdots,n\n",
 235 |     "$$\n",
 236 |     "\n",
 237 |     "+ 如果要用$k<d$维空间表示样本，则把特征值从大到小排序，用前$k$个特征值和特征向量组成$\\boldsymbol\\Lambda$和$\\mathbf{U}$。\n",
 238 |     "\n",
 239 |     "##### (3) 性质\n",
 240 |     "\n",
 241 |     "如果已知样本集，从中计算出$\\mathbf{D}^{(2)}$，再用古典尺度法得到样本的低维表示，结果与主成分分析相同。\n",
 242 |     "\n",
 243 |     "#### 3. 度量型MDS\n",
 244 |     "\n",
 245 |     "##### (1) 思想\n",
 246 |     "\n",
 247 |     "样本在低维空间中两两之间的距离$d_{ij}$尽可能忠实的代表样本之间的相异度度量$\\delta_{ij}$。\n",
 248 |     "\n",
 249 |     "##### (2) 压力函数（stress function）\n",
 250 |     "\n",
 251 |     "1. 定义：表示给定距离和表示距离之间的误差的目标函数。\n",
 252 |     "2. 常用压力函数：\n",
 253 |     "+ $S = \\displaystyle \\sum_{ij}^\\infty(\\delta_{ij}^2-d_{ij}^2)$，即给定距离和表示距离间的平方之间的平均误差，当$\\delta_{ij}$是欧氏距离时，得到的低维空间表示就是样本在**主成分方向**的投影。\n",
 254 |     "+ $\\displaystyle S = \\sum_{ij}\\alpha_{ij}(\\phi(\\delta_{ij})-d_{ij})^2$，$\\alpha_{ij}$是对样本对的加权，比如$\\displaystyle \\alpha_{ij}=\\left(\\sum_{ij}^{\\infty}d_{ij}^2\\right)^{-1}$。\n",
 255 |     "+ $\\displaystyle S = \\sqrt{\\displaystyle \\frac{\\displaystyle \\sum_{i,j}(f(\\delta_{ij})-d_{ij})^2}{\\text{scale}}}$，$\\text{scale}$是尺度因子，比如$\\displaystyle\\text{scale}=\\sum_{i,j}\\delta_{ij}^2$，此时压力函数称作Kruskal压力（Kruskal stress）。\n",
 256 |     "\n",
 257 |     "##### (3) 算法\n",
 258 |     "\n",
 259 |     "1. 如果$\\phi()$或$f()$已经确定，采用迭代优化算法。\n",
 260 |     "2. 如果$\\phi()$或$f()$中有待定参数，采用交替最小二乘的方法进行优化，即固定坐标优化参数，固定参数优化坐标。\n",
 261 |     "\n",
 262 |     "#### 4. 非度量型MDS（顺序MDS）\n",
 263 |     "\n",
 264 |     "##### (1) 思想\n",
 265 |     "\n",
 266 |     "用低维空间坐标表示样本间的距离关系，尽可能接近的反应原相异度矩阵表示的的顺序关系。\n",
 267 |     "\n",
 268 |     "##### (2) 算法\n",
 269 |     "\n",
 270 |     "也需要最小化压力函数，但是$\\phi()$或$f()$只需要是**单调函数**或者**弱单调函数**即可，可以通过**单调回归（monotonic regression）**实现。\n",
 271 |     "\n",
 272 |     "#### 5. MDS在模式识别中的应用\n",
 273 |     "\n",
 274 |     "##### (1) 陡坡图（stee plot）\n",
 275 |     "\n",
 276 |     "横轴为维数，纵轴为最优压力函数取值。单调下降，往往有拐点，但是不是所有数据都会有明显拐点。\n",
 277 |     "\n",
 278 |     "##### (2) 特征变换\n",
 279 |     "\n",
 280 |     "在陡坡图找到拐点，确定恰当的MDS维数。\n",
 281 |     "\n",
 282 |     "\n",
 283 |     "### 五、非线性变换方法\n",
 284 |     "\n",
 285 |     "#### 1. 核主成分分析（KPCA）\n",
 286 |     "\n",
 287 |     "##### (1) 思想\n",
 288 |     "\n",
 289 |     "对样本进行非线性变换，通过在变换空间进行主成分分析实现原空间的非线性主成分分析。\n",
 290 |     "\n",
 291 |     "##### (2) 算法\n",
 292 |     "\n",
 293 |     "1. 计算核函数矩阵$\\mathbf{K}=\\{k(\\mathbf{x}_i,\\mathbf{x}_j)\\}_{n\\times n}$。\n",
 294 |     "2. 对$\\mathbf{K}$进行特征值分解得到第$l$大的特征值的特征向量$\\boldsymbol\\alpha^l = [\\alpha_1^l,\\alpha_2^l,\\cdots,\\alpha_n^l]^\\text{T}$。\n",
 295 |     "3. 计算出样本在第$l$非线性主成分方向的投影（新特征）$\\displaystyle z^l(\\mathbf{x})=\\sum_{i=1}^n\\alpha_i^lk(\\mathbf{x}_i,\\mathbf{x})$，保留前$k$个。\n",
 296 |     "\n",
 297 |     "##### (3) 性质\n",
 298 |     "\n",
 299 |     "由于引入非线性变换，得到的非零特征值可能超过样本原来的维数。\n",
 300 |     "\n",
 301 |     "\n",
 302 |     "### 六、IsoMap方法和LLE方法\n",
 303 |     "\n",
 304 |     "#### 1. 思想\n",
 305 |     "\n",
 306 |     "通过局部距离定义非线性距离度量，在样本分布比较**密集**的情况下可以实现复杂的非线性距离度量。\n",
 307 |     "\n",
 308 |     "#### 2. IsoMap方法（isometric feature mapping）\n",
 309 |     "\n",
 310 |     "##### (1) 算法\n",
 311 |     "\n",
 312 |     "1. 把**相邻样本**看作直接相连的节点，边长是两个相邻样本的欧氏距离，不相邻样本间没有边。对于不相邻的样本，计算最短路，用最短路长度作为距离度量（测地距离）。\n",
 313 |     "2. 有了样本间距离矩阵，可以使用MDS映射到低维空间。\n",
 314 |     "\n",
 315 |     "#### 3. LLE方法（locally linear embedding）\n",
 316 |     "\n",
 317 |     "##### (1) 算法\n",
 318 |     "\n",
 319 |     "1. 在原空间中，对样本$\\mathbf{x}_i$选择一组**邻域样本**。\n",
 320 |     "2. 用这一组邻域样本的线性加权组合重构$x$，得到一组使重构误差$\\displaystyle |\\mathbf{x}_i - \\sum_jw_{ij}\\mathbf{x}_j|$最小的权值$w_{ij}$。\n",
 321 |     "3. 在低维空间求向量$\\mathbf{y}_i$及其邻域的映射，使得对所有样本用同样的权值进行重构得到的误差$\\displaystyle |\\mathbf{y}_i-\\sum_jw_{ij}\\mathbf{y}_j|$最小。"
 322 |    ]
 323 |   },
 324 |   {
 325 |    "cell_type": "markdown",
 326 |    "metadata": {},
 327 |    "source": [
 328 |     "---\n",
 329 |     "## 一轮学习笔记（包含代码实现）\n",
 330 |     "\n",
 331 |     "本章的特征提取指的是特征变换。\n",
 332 |     "\n",
 333 |     "特征变换和前面的特征选择的目的都是将特征降维，但是特征选择是直接删掉一些特征，而特征提取是通过数学方法把高维特征映射为低维。"
 334 |    ]
 335 |   },
 336 |   {
 337 |    "cell_type": "markdown",
 338 |    "metadata": {},
 339 |    "source": [
 340 |     "### 一、基于类别可分性判据的特征提取\n",
 341 |     "\n",
 342 |     "这一节介绍的方法是通过最大化类内类间可分性判据$J$来对特征进行线性变换降维，也就是找到一个$\\mathbf{W}$，将样本$\\mathbf{x}$投影为$\\mathbf{y} = \\mathbf{W}^\\text{T}\\mathbf{x}$，用投影后的$\\mathbf{y}$算出来的可分性判据$J$最大。\n",
 343 |     "\n",
 344 |     "书上的推导过程需要先学会标量对矩阵求导，我在[第00章 补充的数学知识.ipynb](./第00章%20补充的数学知识.ipynb)中有介绍。\n",
 345 |     "\n",
 346 |     "经过一系列推导，最终的方法是：\n",
 347 |     "\n",
 348 |     "1. 求出矩阵$\\mathbf{S}_{\\text{w}}^{-1}\\mathbf{S}_{\\text{b}}$。\n",
 349 |     "\n",
 350 |     "关于$\\mathbf{S}_{\\text{w}}$和$\\mathbf{S}_{\\text{b}}$，书上在第四章和第七章中的表述不一致，我认为应该这里的公式应该采取第七章的说法，即\n",
 351 |     "\n",
 352 |     "$$\\begin{align}\n",
 353 |     "\\mathbf{S}_{\\text{b}} &= \\sum_{i=1}^cP_i(\\mathbf{m}_i - \\mathbf{m})(\\mathbf{m}_i - \\mathbf{m})^\\text{T} \\\\\n",
 354 |     "\\mathbf{S}_{\\text{w}} &= \\sum_{i=1}^cP_i\\frac{1}{n_i}\\sum_{k=1}^{n_i}(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)^\\text{T}\n",
 355 |     "\\end{align}$$\n",
 356 |     "\n",
 357 |     "其中$P_i$采用样本中类别的比例来进行估算，其实如果这样算的话公式里的$\\displaystyle P_i\\frac{1}{n_i}$就变成了$\\displaystyle \\frac{1}{n}$。\n",
 358 |     "\n",
 359 |     "2. 求出$\\mathbf{S}_{\\text{w}}^{-1}\\mathbf{S}_{\\text{b}}$的特征值和特征向量。\n",
 360 |     "3. 把特征值从大到小排序，选取前面最大的$d$个特征值的特征向量作为$\\mathbf{W}$。$d$是人为划定的，你想把样本降维成几维，$d$就是多大。\n",
 361 |     "\n",
 362 |     "计算特征值可以用`np.linalg.eig`方法来实现。`np.linalg.eig`返回值包含两部分，特征值和每个特征值对应的列向量（注意这里是列向量）。\n",
 363 |     "\n",
 364 |     "接下来是代码实现，还是注意和前几章同样的问题，numpy中向量是行向量，代码中实现的矩阵是公式中矩阵的转置，除了`np.linalg.eig`返回的特征向量矩阵。"
 365 |    ]
 366 |   },
 367 |   {
 368 |    "cell_type": "code",
 369 |    "execution_count": 1,
 370 |    "metadata": {},
 371 |    "outputs": [],
 372 |    "source": [
 373 |     "import numpy as np\n",
 374 |     "from sklearn.datasets import load_iris\n",
 375 |     "from sklearn.svm import SVC\n",
 376 |     "from sklearn.model_selection import train_test_split"
 377 |    ]
 378 |   },
 379 |   {
 380 |    "cell_type": "code",
 381 |    "execution_count": 2,
 382 |    "metadata": {},
 383 |    "outputs": [],
 384 |    "source": [
 385 |     "def get_mean(_X: np.ndarray) -> np.ndarray:\n",
 386 |     "    \"\"\"\n",
 387 |     "    计算m_i（样本均值）\n",
 388 |     "    :param _X: 样本矩阵，每一行是一个样本\n",
 389 |     "    :return: 均值（一维向量）\n",
 390 |     "    \"\"\"\n",
 391 |     "    return np.mean(_X, 0)\n",
 392 |     "\n",
 393 |     "\n",
 394 |     "def get_within_class_scatter_matrix(_X: np.ndarray) -> np.ndarray:\n",
 395 |     "    \"\"\"\n",
 396 |     "    计算某一类的S_w_i（类内离散度矩阵）\n",
 397 |     "    :param _X: 样本矩阵，每一行是一个样本\n",
 398 |     "    :return: 类内离散度矩阵（D * D）\n",
 399 |     "    \"\"\"\n",
 400 |     "    ret = np.zeros((_X.shape[1], _X.shape[1]))\n",
 401 |     "    m = get_mean(_X)\n",
 402 |     "    for row in _X:\n",
 403 |     "        ret += (row - m).reshape(m.shape[0], 1) @ (row - m).reshape(m.shape[0], 1).T\n",
 404 |     "    return ret\n",
 405 |     "\n",
 406 |     "\n",
 407 |     "def get_pooled_within_class_scatter_matrix(_X1: np.ndarray, _X2: np.ndarray) -> np.ndarray:\n",
 408 |     "    \"\"\"\n",
 409 |     "    计算最终的S_w（总类内离散度矩阵）\n",
 410 |     "    :param _X1: 第一类样本矩阵，每一行是一个样本\n",
 411 |     "    :param _X2: 第二类样本矩阵，每一行是一个样本\n",
 412 |     "    :return: 总类内离散度矩阵（D * D）\n",
 413 |     "    \"\"\"\n",
 414 |     "    return get_within_class_scatter_matrix(_X1) + get_within_class_scatter_matrix(_X2) / (_X1.shape[0] + _X2.shape[0])\n",
 415 |     "\n",
 416 |     "def get_between_class_scatter_matrix(_X1: np.ndarray, _X2: np.ndarray) -> np.ndarray:\n",
 417 |     "    \"\"\"\n",
 418 |     "    计算S_b（类间离散度矩阵）\n",
 419 |     "    :param _X1: 第一类样本矩阵，每一行是一个样本\n",
 420 |     "    :param _X2: 第二类样本矩阵，每一行是一个样本\n",
 421 |     "    :return: 类间离散度矩阵（D * D）\n",
 422 |     "    \"\"\"\n",
 423 |     "    n1 = _X1.shape[0]\n",
 424 |     "    n2 = _X2.shape[0]\n",
 425 |     "    d = _X1.shape[1]\n",
 426 |     "    N = n1 + n2\n",
 427 |     "    m_1 = get_mean(_X1)\n",
 428 |     "    m_2 = get_mean(_X2)\n",
 429 |     "    m = get_mean(np.concatenate((_X1, _X2)))\n",
 430 |     "    return n1 / N * ((m_1 - m).reshape(d, 1) @ (m_1 - m).reshape(d, 1).T) + \\\n",
 431 |     "        n2 / N * ((m_2 - m).reshape(d, 1) @ (m_2 - m).reshape(d, 1).T)\n",
 432 |     "\n",
 433 |     "def get_W(_X1: np.ndarray, _X2: np.ndarray, d: int) -> np.ndarray:\n",
 434 |     "    \"\"\"\n",
 435 |     "    计算最终的W矩阵\n",
 436 |     "    :param _X1: 第一类样本矩阵，每一行是一个样本\n",
 437 |     "    :param _X2: 第二类样本矩阵，每一行是一个样本\n",
 438 |     "    :param d: 最终要投影的特征维度数量\n",
 439 |     "    :return: W矩阵\n",
 440 |     "    \"\"\"\n",
 441 |     "    S_w_inv_ = np.linalg.inv(get_pooled_within_class_scatter_matrix(_X1, _X2))\n",
 442 |     "    S_b_ = get_between_class_scatter_matrix(_X1, _X2)\n",
 443 |     "\n",
 444 |     "    # np.np.linalg.eig用来计算特征值和特征向量，返回的结果可能是复数\n",
 445 |     "    eigen_values_, eigen_vectors_ = np.linalg.eig(S_w_inv_ @ S_b_)\n",
 446 |     "\n",
 447 |     "    # 下面用来对特征值排序，并且记录特征值的序号\n",
 448 |     "    eigen_temp_ = []\n",
 449 |     "    for i in range(eigen_values_.shape[0]):\n",
 450 |     "        # 找到是实数的特征值\n",
 451 |     "        if eigen_values_[i].imag == 0:\n",
 452 |     "            eigen_temp_.append([eigen_values_[i].real, i])\n",
 453 |     "\n",
 454 |     "    eigen_temp_.sort(reverse=True)\n",
 455 |     "\n",
 456 |     "    # 下面用来找出最大的d个实特征值的特征向量\n",
 457 |     "    ret = []\n",
 458 |     "    for i in range(d):\n",
 459 |     "        ret.append(eigen_vectors_[:,eigen_temp_[i][1]].real)\n",
 460 |     "    return np.array(ret)"
 461 |    ]
 462 |   },
 463 |   {
 464 |    "cell_type": "code",
 465 |    "execution_count": 3,
 466 |    "metadata": {},
 467 |    "outputs": [],
 468 |    "source": [
 469 |     "# 加载数据，这次的数据用sklearn自带的水仙花的数据\n",
 470 |     "iris = load_iris()\n",
 471 |     "X = iris.data[:100]\n",
 472 |     "y = iris.target[:100]\n",
 473 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
 474 |     "\n",
 475 |     "# 寻找两类样本\n",
 476 |     "X1 = []\n",
 477 |     "X2 = []\n",
 478 |     "for i in range(X_train.shape[0]):\n",
 479 |     "    if y_train[i] == 0:\n",
 480 |     "        X1.append(X_train[i])\n",
 481 |     "    else:\n",
 482 |     "        X2.append(X_train[i])\n",
 483 |     "X1 = np.array(X1)\n",
 484 |     "X2 = np.array(X2)"
 485 |    ]
 486 |   },
 487 |   {
 488 |    "cell_type": "code",
 489 |    "execution_count": 4,
 490 |    "metadata": {},
 491 |    "outputs": [
 492 |     {
 493 |      "name": "stdout",
 494 |      "output_type": "stream",
 495 |      "text": [
 496 |       "W为：\n",
 497 |       "[[ 0.01869426 -0.2069536   0.92145113  0.32825073]]\n"
 498 |      ]
 499 |     }
 500 |    ],
 501 |    "source": [
 502 |     "# 计算W, d设置为1，即投影后的新特征为1维\n",
 503 |     "W = get_W(X1, X2, 1)\n",
 504 |     "print(\"W为：\")\n",
 505 |     "print(W)"
 506 |    ]
 507 |   },
 508 |   {
 509 |    "cell_type": "markdown",
 510 |    "metadata": {},
 511 |    "source": [
 512 |     "接下来生成降维后的数据，用支持向量机训练测试一下效果。"
 513 |    ]
 514 |   },
 515 |   {
 516 |    "cell_type": "code",
 517 |    "execution_count": 5,
 518 |    "metadata": {},
 519 |    "outputs": [
 520 |     {
 521 |      "name": "stdout",
 522 |      "output_type": "stream",
 523 |      "text": [
 524 |       "降维之后的样本维度： 1\n",
 525 |       "降维之后的正确率： 1.0\n"
 526 |      ]
 527 |     }
 528 |    ],
 529 |    "source": [
 530 |     "X_train_new = X_train @ W.T\n",
 531 |     "X_test_new = X_test @ W.T\n",
 532 |     "print(\"降维之后的样本维度：\", X_train_new.shape[1])\n",
 533 |     "\n",
 534 |     "svm_classifier = SVC(kernel=\"rbf\")  # 用径向基函数作为核函数\n",
 535 |     "svm_classifier.fit(X_train_new, y_train)\n",
 536 |     "\n",
 537 |     "print(\"降维之后的正确率：\", svm_classifier.score(X_test_new, y_test))"
 538 |    ]
 539 |   },
 540 |   {
 541 |    "cell_type": "markdown",
 542 |    "metadata": {},
 543 |    "source": [
 544 |     "测试结果表明，用基于分类可分性判据的特征提取方法，就算我们把样本从四维降成一维，还能保持100%的正确率。这说明这个降维方法是非常有效的。"
 545 |    ]
 546 |   },
 547 |   {
 548 |    "cell_type": "markdown",
 549 |    "metadata": {},
 550 |    "source": [
 551 |     "### 二、主成分分析方法\n",
 552 |     "\n",
 553 |     "主成分分析是根据方差大小来对样本进行线性变换，最终实现对样本的降维。每一个主成分都是样本协方差矩阵的一个特征值。\n",
 554 |     "\n",
 555 |     "主成分分析的公式推导并不复杂，书上介绍的也很清楚。\n",
 556 |     "\n",
 557 |     "主要的问题在于主成分分析没有考虑样本分类信息，只是按照样本方差来处理数据，这样的方法只是把样本降维了，但是不一定对分类有利。\n",
 558 |     "\n",
 559 |     "主成分分析实际上是K-L变换的一种特殊情况，下面会实现K-L变换，因此这里我就不写代码实现了。"
 560 |    ]
 561 |   },
 562 |   {
 563 |    "cell_type": "markdown",
 564 |    "metadata": {},
 565 |    "source": [
 566 |     "### 三、K-L变换\n",
 567 |     "\n",
 568 |     "原始的K-L变换的根据最小均方误差来进行线性变换，如果样本均值为$0$，那么K-L变换等价于主成分分析。\n",
 569 |     "\n",
 570 |     "为了应用于模式识别，需要考虑到样本的分类信息，K-L变换引入了新的方法。\n",
 571 |     "\n",
 572 |     "#### 1. 从类均值中提取判别信息\n",
 573 |     "\n",
 574 |     "这个方法是假设样本的主要分类信息包含在均值中，通过总类内离散度矩阵$\\mathbf{S}_{\\text{w}}$作为产生矩阵进行K-L变换。然后在根据可分性判据大小得出新特征的转换矩阵。\n",
 575 |     "\n",
 576 |     "具体方法如下：\n",
 577 |     "\n",
 578 |     "1. 计算总类内离散度矩阵$\\mathbf{S}_{\\text{w}}$和类间离散度矩阵$\\mathbf{S}_{\\text{b}}$。\n",
 579 |     "2. 用$\\mathbf{S}_{\\text{w}}$进行K-L变换，得到特征值$\\lambda_i$和对应的特征向量$\\mathbf{u}_i$（一个特征维度对应一个特征值和特征向量）。\n",
 580 |     "3. 通过特征值和特征向量计算样本每个维度的可分性判据大小，公式如下：\n",
 581 |     "\n",
 582 |     "$$\n",
 583 |     "J(y_i) = \\frac{\\mathbf{u}_i^\\text{T}\\mathbf{S}_{\\text{b}}\\mathbf{u}_i}{\\lambda_i}\n",
 584 |     "$$\n",
 585 |     "\n",
 586 |     "4. 找出$J(y_i)$最大的$d$个（d是要降低的目标维数）特征向量$\\mathbf{u}_1,\\mathbf{u}_2,\\cdots, \\mathbf{u}_d$，转换矩阵就是$\\mathbf{U} = [\\mathbf{u}_1,\\mathbf{u}_2,\\cdots, \\mathbf{u}_d]$，降维后的样本就是$\\mathbf{y} = \\mathbf{U}^\\text{T}\\mathbf{x}$。\n",
 587 |     "\n",
 588 |     "算法实现如下："
 589 |    ]
 590 |   },
 591 |   {
 592 |    "cell_type": "code",
 593 |    "execution_count": 6,
 594 |    "metadata": {},
 595 |    "outputs": [
 596 |     {
 597 |      "name": "stdout",
 598 |      "output_type": "stream",
 599 |      "text": [
 600 |       "可分性判据J数值如下： [7.77946014e-03 9.29616393e-01 1.30417535e+00 1.23736680e-03]\n"
 601 |      ]
 602 |     }
 603 |    ],
 604 |    "source": [
 605 |     "# 数据在上面已经生成过了，这里直接用上面生成的数据\n",
 606 |     "S_w = get_pooled_within_class_scatter_matrix(X1, X2)\n",
 607 |     "S_b = get_between_class_scatter_matrix(X1, X2)\n",
 608 |     "\n",
 609 |     "# 注意numpy生成的特征向量矩阵中，每一列是一个特征向量\n",
 610 |     "eigen_values, eigen_vectors = np.linalg.eig(S_w)\n",
 611 |     "\n",
 612 |     "# np.diag把方阵的对角元素提取出来，返回一个向量\n",
 613 |     "J = np.diag(eigen_vectors.T @ S_b @ eigen_vectors) / eigen_values\n",
 614 |     "print(\"可分性判据J数值如下：\", J)"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "code",
 619 |    "execution_count": 7,
 620 |    "metadata": {},
 621 |    "outputs": [
 622 |     {
 623 |      "name": "stdout",
 624 |      "output_type": "stream",
 625 |      "text": [
 626 |       "我们要把样本降维成一维，所以转换矩阵U退化成一个向量，U的值为：\n",
 627 |       "[ 0.42591017 -0.17061936 -0.82024461 -0.34159675]\n"
 628 |      ]
 629 |     }
 630 |    ],
 631 |    "source": [
 632 |     "# 接下来找出J最大的d个特征向量，生成转换矩阵U\n",
 633 |     "# 这个例子中我们把原来4维的样本降成1维，所以d=1，那么只需要找出最大的J对应的特征向量即可\n",
 634 |     "\n",
 635 |     "maxn = -999999\n",
 636 |     "max_index = -1\n",
 637 |     "for i in range(J.shape[0]):\n",
 638 |     "    if J[i] > maxn:\n",
 639 |     "        maxn = J[i]\n",
 640 |     "        max_index = i\n",
 641 |     "\n",
 642 |     "U = eigen_vectors[:, max_index].reshape(J.shape[0])\n",
 643 |     "print(\"我们要把样本降维成一维，所以转换矩阵U退化成一个向量，U的值为：\")\n",
 644 |     "print(U)"
 645 |    ]
 646 |   },
 647 |   {
 648 |    "cell_type": "markdown",
 649 |    "metadata": {},
 650 |    "source": [
 651 |     "接下来把生成降维后的样本，再用SVM测试一下。"
 652 |    ]
 653 |   },
 654 |   {
 655 |    "cell_type": "code",
 656 |    "execution_count": 8,
 657 |    "metadata": {},
 658 |    "outputs": [
 659 |     {
 660 |      "name": "stdout",
 661 |      "output_type": "stream",
 662 |      "text": [
 663 |       "降维之后的样本维度： 1\n",
 664 |       "降维之后的正确率： 1.0\n"
 665 |      ]
 666 |     }
 667 |    ],
 668 |    "source": [
 669 |     "X_train_new = X_train @ U.reshape(U.shape[0], 1)\n",
 670 |     "X_test_new = X_test @ U.reshape(U.shape[0], 1)\n",
 671 |     "print(\"降维之后的样本维度：\", X_train_new.shape[1])\n",
 672 |     "\n",
 673 |     "svm_classifier = SVC(kernel=\"rbf\")  # 用径向基函数作为核函数\n",
 674 |     "svm_classifier.fit(X_train_new, y_train)\n",
 675 |     "\n",
 676 |     "print(\"降维之后的正确率：\", svm_classifier.score(X_test_new, y_test))"
 677 |    ]
 678 |   },
 679 |   {
 680 |    "cell_type": "markdown",
 681 |    "metadata": {},
 682 |    "source": [
 683 |     "显然，从类均值中提取判别信息的降维效果也很好。"
 684 |    ]
 685 |   },
 686 |   {
 687 |    "cell_type": "markdown",
 688 |    "metadata": {},
 689 |    "source": [
 690 |     "#### 2. 包含在类平均向量中判别信息的最优压缩\n",
 691 |     "\n",
 692 |     "这个方法是使特征间互不相关的前提下最优压缩均值向量中包含的分类信息。\n",
 693 |     "\n",
 694 |     "具体方法如下：\n",
 695 |     "\n",
 696 |     "1. 求出$\\mathbf{S}_{\\text{w}}$和$\\mathbf{S}_{\\text{b}}$\n",
 697 |     "2. 对$\\mathbf{S}_{\\text{w}}$做K-L变换，求出特征变换矩阵$\\mathbf{U}$\n",
 698 |     "3. 令$\\mathbf{B}=\\mathbf{U}\\boldsymbol{\\Lambda}^{-\\frac{1}{2}}$，这个时候$\\mathbf{B}^\\text{T}\\mathbf{S}_{\\text{w}}\\mathbf{B} = \\mathbf{I}$，这一步称为白化变换\n",
 699 |     "4. 求$\\mathbf{S}_{\\text{b}}' = \\mathbf{B}^\\text{T}\\mathbf{S}_{\\text{b}}\\mathbf{B}$\n",
 700 |     "5. 对$\\mathbf{S}_{\\text{b}}'$进行K-L变换，求出特征变换矩阵$\\mathbf{V}'$，其中$\\mathbf{V}'$中只包含$\\mathbf{S}_{\\text{b}}'$中非零的实特征值对应的特征向量（$c$类分类问题最多有$c-1$个）\n",
 701 |     "6. 最终总的变换矩阵是$\\mathbf{W} = \\mathbf{U}\\boldsymbol{\\Lambda}^{-\\frac{1}{2}}\\mathbf{V}'$\n",
 702 |     "\n",
 703 |     "最终得到的投影方向实际上就是Fisher线性判别的投影方向。\n",
 704 |     "\n",
 705 |     "这个方法没有指定到底降维成多少维，我们测试一下实际的降维效果。"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "code",
 710 |    "execution_count": 9,
 711 |    "metadata": {},
 712 |    "outputs": [
 713 |     {
 714 |      "name": "stdout",
 715 |      "output_type": "stream",
 716 |      "text": [
 717 |       "最终计算出的投影方向W为：\n",
 718 |       "[[ 0.11277732 -0.23181291 -0.68124047 -0.24026011]\n",
 719 |       " [ 0.36302409  0.17933398  0.50943862  0.09630243]\n",
 720 |       " [-0.33659659 -0.74372745  0.51266842 -0.42787552]\n",
 721 |       " [-0.11711099 -0.28817791  0.05558638  1.52504861]]\n"
 722 |      ]
 723 |     }
 724 |    ],
 725 |    "source": [
 726 |     "# 注意下面这段代码所有的矩阵都是公式中的矩阵，而不像之前一样是公式中矩阵的转置\n",
 727 |     "\n",
 728 |     "S_w = get_pooled_within_class_scatter_matrix(X1, X2)\n",
 729 |     "S_b = get_between_class_scatter_matrix(X1, X2)\n",
 730 |     "eigen_values, eigen_vectors = np.linalg.eig(S_w)\n",
 731 |     "U = eigen_vectors\n",
 732 |     "Lambda_of_neg_1_2 = np.diag(eigen_values ** (-1 / 2))\n",
 733 |     "B = U @ np.linalg.inv(Lambda_of_neg_1_2)\n",
 734 |     "S_b_1 = B.T @ S_b @ B\n",
 735 |     "eigen_values, eigen_vectors = np.linalg.eig(S_b_1)\n",
 736 |     "V_1 = []\n",
 737 |     "for i in range(eigen_values.shape[0]):\n",
 738 |     "    if eigen_values[i] != 0:\n",
 739 |     "        V_1.append(eigen_vectors[:, i].reshape(eigen_vectors.shape[0]))\n",
 740 |     "V_1 = np.array(V_1).T\n",
 741 |     "W = U @ Lambda_of_neg_1_2 @ V_1\n",
 742 |     "print(\"最终计算出的投影方向W为：\")\n",
 743 |     "print(W)"
 744 |    ]
 745 |   },
 746 |   {
 747 |    "cell_type": "markdown",
 748 |    "metadata": {},
 749 |    "source": [
 750 |     "显然最后样本被降维成三维。这个降维可以理解为保留了样本几乎全部信息，虽然最终维度比较高，但是分类信息保留全面。"
 751 |    ]
 752 |   },
 753 |   {
 754 |    "cell_type": "markdown",
 755 |    "metadata": {},
 756 |    "source": [
 757 |     "#### 3. 类中心化特征向量中分类信息的提取\n",
 758 |     "\n",
 759 |     "这个方法时考虑从样本的各类协方差中提取信息，具体方式是：先对$\\mathbf{S}_{\\text{w}}$进行K-L变换，然后在根据新样本的各个维度方差的大小来计算可分性判据$J(x_j)$。最后取$J(x_j)$最小的前d个特征。\n",
 760 |     "\n",
 761 |     "这个方法可以和前面介绍的第二种方法结合起来，先用第二种进行均值分类信息最优压缩，压缩后获得$d' \\leqslant c-1$个特征，再用第三种方法压缩获得另外$d - d'$个特征。\n",
 762 |     "\n",
 763 |     "算法的实现也和前面的大同小异，这里我就不实现了。"
 764 |    ]
 765 |   },
 766 |   {
 767 |    "cell_type": "markdown",
 768 |    "metadata": {},
 769 |    "source": [
 770 |     "### 四、多维尺度法（MDS）\n",
 771 |     "\n",
 772 |     "MDS运用了缩放的思想，把高维空间的样本按比例缩放到低维空间，书上用地图的例子很好的解释了这个思想。\n",
 773 |     "\n",
 774 |     "#### 1. 古典尺度法\n",
 775 |     "\n",
 776 |     "古典尺度法就是已知各样本间的欧式距离，求他们在指定维度下的坐标。\n",
 777 |     "\n",
 778 |     "实现方法如下：\n",
 779 |     "\n",
 780 |     "1. 定义中心化矩阵$\\mathbf{J} = \\mathbf{I} - \\frac{1}{n}\\mathbf{11}^\\text{T}$\n",
 781 |     "2. 计算欧氏距离矩阵$\\mathbf{D}^{(2)}$\n",
 782 |     "3. 求样本两两内积组成的矩阵$\\mathbf{B} = -\\frac{1}{2}\\mathbf{JDJ}$\n",
 783 |     "4. 求$\\mathbf{B}$的特征值$\\lambda_1,\\lambda_2,\\cdots,\\lambda_d$和特征向量$\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_d$\n",
 784 |     "5. 假设要将样本降维到$k$维，则将特征值从大到小排序，取前$k$大的特征值对应的特征向量组成矩阵$\\mathbf{U}=[\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_k]$\n",
 785 |     "6. 解出样本坐标矩阵$\\mathbf{X}=\\mathbf{U}\\boldsymbol{\\Lambda}^{1/2}$\n",
 786 |     "7. 将样本从中心化还原，$\\widetilde{\\mathbf{x}}_i = \\mathbf{x}_i + \\overline{\\mathbf{x}}$\n",
 787 |     "\n",
 788 |     "这个方法其实和主成分分析是相同的，代码实现如下："
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "code",
 793 |    "execution_count": 10,
 794 |    "metadata": {
 795 |     "scrolled": false
 796 |    },
 797 |    "outputs": [
 798 |     {
 799 |      "name": "stdout",
 800 |      "output_type": "stream",
 801 |      "text": [
 802 |       "[-2.27826921e-08 -3.44144050e-08 -1.75140000e-08  2.28469080e-09\n",
 803 |       " -1.67971705e-08  5.34840161e-08  3.43430899e-09  2.98690681e-08\n",
 804 |       " -2.13786560e-08  1.63238719e-08  4.11062135e-08 -1.98371979e-08\n",
 805 |       " -2.90794537e-08  8.90252228e-09  2.37053957e-09  1.17263691e-08\n",
 806 |       "  2.07078762e-08 -9.75305242e-10  3.20626111e-08  2.71718079e-08\n",
 807 |       "  3.33402252e-08  1.26129911e-08 -3.12520589e-08 -2.70509339e-09\n",
 808 |       " -4.24973096e-08 -1.35221505e-08  7.88039006e-09 -2.87083186e-09\n",
 809 |       " -3.48658361e-08 -8.55255398e-09 -9.89458731e-09  2.34282814e-08\n",
 810 |       "  1.08033115e-08 -3.47947733e-08 -5.32301673e-09 -5.81293064e-09\n",
 811 |       " -2.45663341e-08 -2.86082110e-08 -4.12486016e-08 -8.36352275e-09\n",
 812 |       " -1.87246325e-08 -1.01729126e-08 -3.10669694e-08 -1.81945259e-08\n",
 813 |       " -1.08101679e-08 -3.65419330e-08 -1.08827192e-08 -4.30372786e-08\n",
 814 |       "  4.86084062e-09 -2.70163812e-08  4.15228906e-08 -6.41982072e-09\n",
 815 |       " -1.74987220e-08  2.09008665e-08 -2.50100769e-08 -8.61789227e-09\n",
 816 |       " -5.92910143e-09 -4.81297694e-09 -6.88161834e-09  4.16202265e-08\n",
 817 |       "  1.64202060e-09  1.42302506e-08  5.53318913e-08 -1.16508775e-08\n",
 818 |       "  3.62741257e-08 -6.47619201e-08 -1.11632165e-08  2.66905554e-08\n",
 819 |       "  2.05870468e-09 -1.41332411e-09 -2.92467205e-08  5.70420669e-09\n",
 820 |       " -2.31138382e-08  1.22018194e-08 -4.34116542e-08 -6.20514330e-08\n",
 821 |       " -4.79626460e-08 -1.99943528e-08 -3.56117949e-08  1.10452149e-09\n",
 822 |       " -2.38777311e-08  2.02512017e-08 -1.46882984e-08 -1.09512754e-08\n",
 823 |       " -3.65902431e-08 -2.46556440e-08 -1.87207383e-08 -2.54257930e-08\n",
 824 |       "  7.59315658e-10  1.32360475e-10 -1.79835738e-08 -8.77635709e-09\n",
 825 |       "  1.72098331e-08  6.08651473e-09  1.19661514e-08  4.29211566e-08\n",
 826 |       "  3.15975935e-08  5.21455748e-09  1.43383440e-08  1.35625299e-08]\n"
 827 |      ]
 828 |     }
 829 |    ],
 830 |    "source": [
 831 |     "n = X.shape[0]\n",
 832 |     "J = np.eye(n) - 1 / n * np.ones(n)\n",
 833 |     "D = X @ X.T\n",
 834 |     "B = -1 / 2 * J @ D @ J\n",
 835 |     "eigen_values, eigen_vectors = np.linalg.eig(B)\n",
 836 |     "\n",
 837 |     "# 下面用来对特征值排序，并且记录特征值的序号\n",
 838 |     "eigen_temp = []\n",
 839 |     "for i in range(eigen_values.shape[0]):\n",
 840 |     "    # 找到是实数的特征值\n",
 841 |     "    if eigen_values[i].imag == 0:\n",
 842 |     "        eigen_temp.append([eigen_values[i].real, i])\n",
 843 |     "eigen_temp.sort(reverse=True)\n",
 844 |     "\n",
 845 |     "# 下面用来找出最大的d个实特征值的特征向量，假设d=1\n",
 846 |     "U = []\n",
 847 |     "Lambda_of_1_2 = []\n",
 848 |     "for i in range(1):\n",
 849 |     "    U.append(eigen_vectors[:,eigen_temp[i][1]].real.T)\n",
 850 |     "    Lambda_of_1_2.append(np.sqrt(eigen_temp[i][0].real))\n",
 851 |     "U = np.array(U).T\n",
 852 |     "Lambda_of_1_2 = np.array(Lambda_of_1_2)\n",
 853 |     "X_new = U @ Lambda_of_1_2\n",
 854 |     "X_new = X_new + get_mean(X_new)\n",
 855 |     "print(X_new)"
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "markdown",
 860 |    "metadata": {},
 861 |    "source": [
 862 |     "这个方法我就不测试了，和前面的都一样。"
 863 |    ]
 864 |   },
 865 |   {
 866 |    "cell_type": "markdown",
 867 |    "metadata": {},
 868 |    "source": [
 869 |     "#### 2. 度量型MSD和非度量型MSD\n",
 870 |     "\n",
 871 |     "这两个书上都只是简单的介绍了一下，看一下就行。这类算法的特点是实际实现没有解析解，要使用梯度下降来计算。"
 872 |    ]
 873 |   },
 874 |   {
 875 |    "cell_type": "markdown",
 876 |    "metadata": {},
 877 |    "source": [
 878 |     "### 三、核主成分分析\n",
 879 |     "\n",
 880 |     "核主成分分析的思想是通过核函数把样本投影到高维，这样原来非线性的规律也变成了线性的，再通过进行线性的主成分分析。\n",
 881 |     "\n",
 882 |     "具体算法为：\n",
 883 |     "\n",
 884 |     "1. 计算核函数矩阵$\\mathbf{K}=\\{k(\\mathbf{x}_i,\\mathbf{x}_j)\\}_{n\\times n}$。\n",
 885 |     "2. 对$\\mathbf{K}$进行特征值分解得到第$l$大的特征值的特征向量$\\boldsymbol\\alpha^l = [\\alpha_1^l,\\alpha_2^l,\\cdots,\\alpha_n^l]^\\text{T}$。\n",
 886 |     "3. 计算出样本在第$l$非线性主成分方向的投影（新特征）$\\displaystyle z^l(\\mathbf{x})=\\sum_{i=1}^n\\alpha_i^lk(\\mathbf{x}_i,\\mathbf{x})$，保留前$k$个。\n",
 887 |     "\n",
 888 |     "需要注意的是由于引入了非线性变换，特征值分解得到的非零特征值可能会超过样本原来的维数。\n",
 889 |     "\n",
 890 |     "代码实现如下："
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "code",
 895 |    "execution_count": 19,
 896 |    "metadata": {},
 897 |    "outputs": [],
 898 |    "source": [
 899 |     "def rbf(_x1: np.ndarray, _x2: np.ndarray, sigma2: float) -> float:\n",
 900 |     "    \"\"\"\n",
 901 |     "    径向基函数\n",
 902 |     "    :param _x1: 第一个向量\n",
 903 |     "    :param _x2: 第二个向量\n",
 904 |     "    :param sigma2: rbf的sigma参数的平方\n",
 905 |     "    :return: 函数值\n",
 906 |     "    \"\"\"\n",
 907 |     "    return np.exp(-np.sum((_x1 - _x2) ** 2) / sigma2)\n",
 908 |     "\n",
 909 |     "def kpca(_X: np.ndarray, k: int, kernel) -> np.ndarray:\n",
 910 |     "    \"\"\"\n",
 911 |     "    KPCA函数\n",
 912 |     "    :param _X: 样本集\n",
 913 |     "    :param k: 要保留的非线性主成分数量\n",
 914 |     "    :param kernel: 使用的核函数\n",
 915 |     "    :return: 投影后的新样本\n",
 916 |     "    \"\"\"\n",
 917 |     "    sigma2 = 1 / (_X.shape[1] * X.var()) # sigma平方用这个公式进行估计（sklearn中默认也是用这个算法）\n",
 918 |     "\n",
 919 |     "    # 1. 计算核函数矩阵K\n",
 920 |     "    K = np.zeros((n, n))\n",
 921 |     "    for i in range(n):\n",
 922 |     "        for j in range(n):\n",
 923 |     "            K[i][j] = kernel(_X[i], _X[j], sigma2)\n",
 924 |     "\n",
 925 |     "    # 2. 通过特征值分解计算alpha，并且只保留特征值最大的k个非线性主成分\n",
 926 |     "    eigen_values, eigen_vectors = np.linalg.eig(K)\n",
 927 |     "    eigen_temp = []\n",
 928 |     "    for i, eigen_value in enumerate(eigen_values):\n",
 929 |     "        eigen_temp.append([eigen_value, i])\n",
 930 |     "    eigen_temp.sort()\n",
 931 |     "    Alpha = np.zeros((k, n))\n",
 932 |     "    for i in range(k):\n",
 933 |     "        Alpha[i] = eigen_vectors[eigen_temp[i][1]]\n",
 934 |     "\n",
 935 |     "    # 3. 计算样本在非线性主成分方向上的投影\n",
 936 |     "    X_new = np.zeros((n, k))\n",
 937 |     "    for i, x in enumerate(_X):\n",
 938 |     "        for l in range(k):\n",
 939 |     "            for j in range(n):\n",
 940 |     "                X_new[i][l] += Alpha[l][j] * kernel(x, _X[j], sigma2)\n",
 941 |     "\n",
 942 |     "    return X_new"
 943 |    ]
 944 |   },
 945 |   {
 946 |    "cell_type": "markdown",
 947 |    "metadata": {},
 948 |    "source": [
 949 |     "接下来测试一下保留不同数量非线性主成分时的正确率（为了方便起见我就不分训练集测试集了，直接拿全部数据训练核测试）"
 950 |    ]
 951 |   },
 952 |   {
 953 |    "cell_type": "code",
 954 |    "execution_count": 23,
 955 |    "metadata": {},
 956 |    "outputs": [
 957 |     {
 958 |      "name": "stdout",
 959 |      "output_type": "stream",
 960 |      "text": [
 961 |       "降维之后的样本维度：4  降维之后的正确率为：0.77\n",
 962 |       "降维之后的样本维度：20  降维之后的正确率为：0.95\n",
 963 |       "降维之后的样本维度：40  降维之后的正确率为：0.99\n"
 964 |      ]
 965 |     }
 966 |    ],
 967 |    "source": [
 968 |     "svm_classifier = SVC(kernel='rbf')\n",
 969 |     "X_new = kpca(X, 4, rbf)\n",
 970 |     "svm_classifier.fit(X_new, y)\n",
 971 |     "print(\"降维之后的样本维度：\", X_new.shape[1], \"  降维之后的正确率为：\", svm_classifier.score(X_new, y), sep='')\n",
 972 |     "X_new = kpca(X, 20, rbf)\n",
 973 |     "svm_classifier.fit(X_new, y)\n",
 974 |     "print(\"降维之后的样本维度：\", X_new.shape[1], \"  降维之后的正确率为：\", svm_classifier.score(X_new, y), sep='')\n",
 975 |     "X_new = kpca(X, 40, rbf)\n",
 976 |     "svm_classifier.fit(X_new, y)\n",
 977 |     "print(\"降维之后的样本维度：\", X_new.shape[1], \"  降维之后的正确率为：\", svm_classifier.score(X_new, y), sep='')"
 978 |    ]
 979 |   },
 980 |   {
 981 |    "cell_type": "markdown",
 982 |    "metadata": {},
 983 |    "source": [
 984 |     "可以看到KPCA方法的效果在水仙花数据上的表现可以说非常差。\n",
 985 |     "\n",
 986 |     "原始数据只有4维，但是用KPCA进行特征值分解的得到的特征值数量是和样本数量一样多的，这个数量远远大于4。\n",
 987 |     "\n",
 988 |     "所以就算“降维”（实际上是升维了）到40个维度，分类正确率才勉强到达99%。\n",
 989 |     "\n",
 990 |     "这说明KPCA只适用于样本数量少而原始特征维度高的数据。"
 991 |    ]
 992 |   },
 993 |   {
 994 |    "cell_type": "markdown",
 995 |    "metadata": {},
 996 |    "source": [
 997 |     "### 四、IsoMap方法和LLE方法\n",
 998 |     "\n",
 999 |     "IsoMap方法把相邻样本看作相连的节点，边长就是欧氏距离，然后算出每两个样本之间的最短路作为距离度量。\n",
1000 |     "\n",
1001 |     "LLE主要思想是采取分治，把数据切割成一个一个小块，对每个小块求平均抽象成新的样本，然后再进行映射。\n",
1002 |     "\n",
1003 |     "书上只是简单介绍了一下思想，了解一下就行。"
1004 |    ]
1005 |   }
1006 |  ],
1007 |  "metadata": {
1008 |   "kernelspec": {
1009 |    "display_name": "Python 3 (ipykernel)",
1010 |    "language": "python",
1011 |    "name": "python3"
1012 |   },
1013 |   "language_info": {
1014 |    "codemirror_mode": {
1015 |     "name": "ipython",
1016 |     "version": 3
1017 |    },
1018 |    "file_extension": ".py",
1019 |    "mimetype": "text/x-python",
1020 |    "name": "python",
1021 |    "nbconvert_exporter": "python",
1022 |    "pygments_lexer": "ipython3",
1023 |    "version": "3.9.16"
1024 |   }
1025 |  },
1026 |  "nbformat": 4,
1027 |  "nbformat_minor": 1
1028 | }
1029 | 


--------------------------------------------------------------------------------
/资料/7.Practice lab decision trees/C2_W4_Decision_Tree_with_Markdown.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Practice Lab: Decision Trees\n",
  8 |     "\n",
  9 |     "In this exercise, you will implement a decision tree from scratch and apply it to the task of classifying whether a mushroom is edible or poisonous.\n",
 10 |     "\n",
 11 |     "# Outline\n",
 12 |     "- [ 1 - Packages ](#1)\n",
 13 |     "- [ 2 -  Problem Statement](#2)\n",
 14 |     "- [ 3 - Dataset](#3)\n",
 15 |     "  - [ 3.1 One hot encoded dataset](#3.1)\n",
 16 |     "- [ 4 - Decision Tree Refresher](#4)\n",
 17 |     "  - [ 4.1  Calculate entropy](#4.1)\n",
 18 |     "    - [ Exercise 1](#ex01)\n",
 19 |     "  - [ 4.2  Split dataset](#4.2)\n",
 20 |     "    - [ Exercise 2](#ex02)\n",
 21 |     "  - [ 4.3  Calculate information gain](#4.3)\n",
 22 |     "    - [ Exercise 3](#ex03)\n",
 23 |     "  - [ 4.4  Get best split](#4.4)\n",
 24 |     "    - [ Exercise 4](#ex04)\n",
 25 |     "- [ 5 - Building the tree](#5)\n"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "<a name=\"1\"></a>\n",
 33 |     "## 1 - Packages \n",
 34 |     "\n",
 35 |     "First, let's run the cell below to import all the packages that you will need during this assignment.\n",
 36 |     "- [numpy](www.numpy.org) is the fundamental package for working with matrices in Python.\n",
 37 |     "- [matplotlib](http://matplotlib.org) is a famous library to plot graphs in Python.\n",
 38 |     "- ``utils.py`` contains helper functions for this assignment. You do not need to modify code in this file.\n"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {},
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "import numpy as np\n",
 48 |     "import matplotlib.pyplot as plt\n",
 49 |     "from public_tests import *\n",
 50 |     "\n",
 51 |     "%matplotlib inline"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "<a name=\"2\"></a>\n",
 59 |     "## 2 -  Problem Statement\n",
 60 |     "\n",
 61 |     "Suppose you are starting a company that grows and sells wild mushrooms. \n",
 62 |     "- Since not all mushrooms are edible, you'd like to be able to tell whether a given mushroom is edible or poisonous based on it's physical attributes\n",
 63 |     "- You have some existing data that you can use for this task. \n",
 64 |     "\n",
 65 |     "Can you use the data to help you identify which mushrooms can be sold safely? \n",
 66 |     "\n",
 67 |     "Note: The dataset used is for illustrative purposes only. It is not meant to be a guide on identifying edible mushrooms.\n",
 68 |     "\n",
 69 |     "\n",
 70 |     "\n",
 71 |     "<a name=\"3\"></a>\n",
 72 |     "## 3 - Dataset\n",
 73 |     "\n",
 74 |     "You will start by loading the dataset for this task. The dataset you have collected is as follows:\n",
 75 |     "\n",
 76 |     "| Cap Color | Stalk Shape | Solitary | Edible |\n",
 77 |     "|:---------:|:-----------:|:--------:|:------:|\n",
 78 |     "|   Brown   |   Tapering  |    Yes   |    1   |\n",
 79 |     "|   Brown   |  Enlarging  |    Yes   |    1   |\n",
 80 |     "|   Brown   |  Enlarging  |    No    |    0   |\n",
 81 |     "|   Brown   |  Enlarging  |    No    |    0   |\n",
 82 |     "|   Brown   |   Tapering  |    Yes   |    1   |\n",
 83 |     "|    Red    |   Tapering  |    Yes   |    0   |\n",
 84 |     "|    Red    |  Enlarging  |    No    |    0   |\n",
 85 |     "|   Brown   |  Enlarging  |    Yes   |    1   |\n",
 86 |     "|    Red    |   Tapering  |    No    |    1   |\n",
 87 |     "|   Brown   |  Enlarging  |    No    |    0   |\n",
 88 |     "\n",
 89 |     "\n",
 90 |     "-  You have 10 examples of mushrooms. For each example, you have\n",
 91 |     "    - Three features\n",
 92 |     "        - Cap Color (`Brown` or `Red`),\n",
 93 |     "        - Stalk Shape (`Tapering` or `Enlarging`), and\n",
 94 |     "        - Solitary (`Yes` or `No`)\n",
 95 |     "    - Label\n",
 96 |     "        - Edible (`1` indicating yes or `0` indicating poisonous)\n",
 97 |     "\n",
 98 |     "<a name=\"3.1\"></a>\n",
 99 |     "### 3.1 One hot encoded dataset\n",
100 |     "For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)\n",
101 |     "\n",
102 |     "| Brown Cap | Tapering Stalk Shape | Solitary | Edible |\n",
103 |     "|:---------:|:--------------------:|:--------:|:------:|\n",
104 |     "|     1     |           1          |     1    |    1   |\n",
105 |     "|     1     |           0          |     1    |    1   |\n",
106 |     "|     1     |           0          |     0    |    0   |\n",
107 |     "|     1     |           0          |     0    |    0   |\n",
108 |     "|     1     |           1          |     1    |    1   |\n",
109 |     "|     0     |           1          |     1    |    0   |\n",
110 |     "|     0     |           0          |     0    |    0   |\n",
111 |     "|     1     |           0          |     1    |    1   |\n",
112 |     "|     0     |           1          |     0    |    1   |\n",
113 |     "|     1     |           0          |     0    |    0   |\n",
114 |     "\n",
115 |     "Therefore,\n",
116 |     "- `X_train` contains three features for each example \n",
117 |     "    - Brown Color (A value of `1` indicates \"Brown\" cap color and `0` indicates \"Red\" cap color)\n",
118 |     "    - Tapering Shape (A value of `1` indicates \"Tapering Stalk Shape\" and `0` indicates \"Enlarging\" stalk shape)\n",
119 |     "    - Solitary  (A value of `1` indicates \"Yes\" and `0` indicates \"No\")\n",
120 |     "\n",
121 |     "- `y_train` is whether the mushroom is edible \n",
122 |     "    - `y = 1` indicates edible\n",
123 |     "    - `y = 0` indicates poisonous"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": null,
129 |    "metadata": {},
130 |    "outputs": [],
131 |    "source": [
132 |     "X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])\n",
133 |     "y_train = np.array([1,1,0,0,1,0,0,1,1,0])"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "#### View the variables\n",
141 |     "Let's get more familiar with your dataset.  \n",
142 |     "- A good place to start is to just print out each variable and see what it contains.\n",
143 |     "\n",
144 |     "The code below prints the first few elements of `X_train` and the type of the variable."
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": null,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "print(\"First few elements of X_train:\\n\", X_train[:5])\n",
154 |     "print(\"Type of X_train:\",type(X_train))"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "Now, let's do the same for `y_train`"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "print(\"First few elements of y_train:\", y_train[:5])\n",
171 |     "print(\"Type of y_train:\",type(y_train))"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "#### Check the dimensions of your variables\n",
179 |     "\n",
180 |     "Another useful way to get familiar with your data is to view its dimensions.\n",
181 |     "\n",
182 |     "Please print the shape of `X_train` and `y_train` and see how many training examples you have in your dataset."
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "metadata": {},
189 |    "outputs": [],
190 |    "source": [
191 |     "print ('The shape of X_train is:', X_train.shape)\n",
192 |     "print ('The shape of y_train is: ', y_train.shape)\n",
193 |     "print ('Number of training examples (m):', len(X_train))"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "<a name=\"4\"></a>\n",
201 |     "## 4 - Decision Tree Refresher\n",
202 |     "\n",
203 |     "In this practice lab, you will build a decision tree based on the dataset provided.\n",
204 |     "\n",
205 |     "- Recall that the steps for building a decision tree are as follows:\n",
206 |     "    - Start with all examples at the root node\n",
207 |     "    - Calculate information gain for splitting on all possible features, and pick the one with the highest information gain\n",
208 |     "    - Split dataset according to the selected feature, and create left and right branches of the tree\n",
209 |     "    - Keep repeating splitting process until stopping criteria is met\n",
210 |     "  \n",
211 |     "  \n",
212 |     "- In this lab, you'll implement the following functions, which will let you split a node into left and right branches using the feature with the highest information gain\n",
213 |     "    - Calculate the entropy at a node \n",
214 |     "    - Split the dataset at a node into left and right branches based on a given feature\n",
215 |     "    - Calculate the information gain from splitting on a given feature\n",
216 |     "    - Choose the feature that maximizes information gain\n",
217 |     "    \n",
218 |     "- We'll then use the helper functions you've implemented to build a decision tree by repeating the splitting process until the stopping criteria is met \n",
219 |     "    - For this lab, the stopping criteria we've chosen is setting a maximum depth of 2"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "<a name=\"4.1\"></a>\n",
227 |     "### 4.1  Calculate entropy\n",
228 |     "\n",
229 |     "First, you'll write a helper function called `compute_entropy` that computes the entropy (measure of impurity) at a node. \n",
230 |     "- The function takes in a numpy array (`y`) that indicates whether the examples in that node are edible (`1`) or poisonous(`0`) \n",
231 |     "\n",
232 |     "Complete the `compute_entropy()` function below to:\n",
233 |     "* Compute $p_1$, which is the fraction of examples that are edible (i.e. have value = `1` in `y`)\n",
234 |     "* The entropy is then calculated as \n",
235 |     "\n",
236 |     "$$H(p_1) = -p_1 \\text{log}_2(p_1) - (1- p_1) \\text{log}_2(1- p_1)$$\n",
237 |     "* Note \n",
238 |     "    * The log is calculated with base $2$\n",
239 |     "    * For implementation purposes, $0\\text{log}_2(0) = 0$. That is, if `p_1 = 0` or `p_1 = 1`, set the entropy to `0`\n",
240 |     "    * Make sure to check that the data at a node is not empty (i.e. `len(y) != 0`). Return `0` if it is\n",
241 |     "    \n",
242 |     "<a name=\"ex01\"></a>\n",
243 |     "### Exercise 1\n",
244 |     "\n",
245 |     "Please complete the `compute_entropy()` function using the previous instructions.\n",
246 |     "    \n",
247 |     "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "# UNQ_C1\n",
257 |     "# GRADED FUNCTION: compute_entropy\n",
258 |     "\n",
259 |     "def compute_entropy(y):\n",
260 |     "    \"\"\"\n",
261 |     "    Computes the entropy for \n",
262 |     "    \n",
263 |     "    Args:\n",
264 |     "       y (ndarray): Numpy array indicating whether each example at a node is\n",
265 |     "           edible (`1`) or poisonous (`0`)\n",
266 |     "       \n",
267 |     "    Returns:\n",
268 |     "        entropy (float): Entropy at that node\n",
269 |     "        \n",
270 |     "    \"\"\"\n",
271 |     "    # You need to return the following variables correctly\n",
272 |     "    entropy = 0.\n",
273 |     "    \n",
274 |     "    ### START CODE HERE ###\n",
275 |     "           \n",
276 |     "    ### END CODE HERE ###        \n",
277 |     "    \n",
278 |     "    return entropy"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "markdown",
283 |    "metadata": {},
284 |    "source": [
285 |     "<details>\n",
286 |     "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
287 |     "    \n",
288 |     "    \n",
289 |     "   * To calculate `p1`\n",
290 |     "       * You can get the subset of examples in `y` that have the value `1` as `y[y == 1]`\n",
291 |     "       * You can use `len(y)` to get the number of examples in `y`\n",
292 |     "   * To calculate `entropy`\n",
293 |     "       * <a href=\"https://numpy.org/doc/stable/reference/generated/numpy.log2.html\">np.log2</a> let's you calculate the logarithm to base 2 for a numpy array\n",
294 |     "       * If the value of `p1` is 0 or 1, make sure to set the entropy to `0` \n",
295 |     "     \n",
296 |     "    <details>\n",
297 |     "          <summary><font size=\"2\" color=\"darkblue\"><b> Click for more hints</b></font></summary>\n",
298 |     "        \n",
299 |     "    * Here's how you can structure the overall implementation for this function\n",
300 |     "    ```python \n",
301 |     "    def compute_entropy(y):\n",
302 |     "        \n",
303 |     "        # You need to return the following variables correctly\n",
304 |     "        entropy = 0.\n",
305 |     "\n",
306 |     "        ### START CODE HERE ###\n",
307 |     "        if len(y) != 0:\n",
308 |     "            # Your code here to calculate the fraction of edible examples (i.e with value = 1 in y)\n",
309 |     "            p1 =\n",
310 |     "\n",
311 |     "            # For p1 = 0 and 1, set the entropy to 0 (to handle 0log0)\n",
312 |     "            if p1 != 0 and p1 != 1:\n",
313 |     "                # Your code here to calculate the entropy using the formula provided above\n",
314 |     "                entropy = \n",
315 |     "            else:\n",
316 |     "                entropy = 0. \n",
317 |     "        ### END CODE HERE ###        \n",
318 |     "\n",
319 |     "        return entropy\n",
320 |     "    ```\n",
321 |     "    \n",
322 |     "    If you're still stuck, you can check the hints presented below to figure out how to calculate `p1` and `entropy`.\n",
323 |     "    \n",
324 |     "    <details>\n",
325 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate p1</b></font></summary>\n",
326 |     "           &emsp; &emsp; You can compute p1 as <code>p1 = len(y[y == 1]) / len(y) </code>\n",
327 |     "    </details>\n",
328 |     "\n",
329 |     "     <details>\n",
330 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate entropy</b></font></summary>\n",
331 |     "          &emsp; &emsp; You can compute entropy as <code>entropy = -p1 * np.log2(p1) - (1 - p1) * np.log2(1 - p1)</code>\n",
332 |     "    </details>\n",
333 |     "        \n",
334 |     "    </details>\n",
335 |     "\n",
336 |     "</details>\n",
337 |     "\n",
338 |     "    \n"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "markdown",
343 |    "metadata": {},
344 |    "source": [
345 |     "You can check if your implementation was correct by running the following test code:"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "metadata": {},
352 |    "outputs": [],
353 |    "source": [
354 |     "# Compute entropy at the root node (i.e. with all examples)\n",
355 |     "# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1\"\n",
356 |     "\n",
357 |     "print(\"Entropy at root node: \", compute_entropy(y_train)) \n",
358 |     "\n",
359 |     "# UNIT TESTS\n",
360 |     "compute_entropy_test(compute_entropy)"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "markdown",
365 |    "metadata": {},
366 |    "source": [
367 |     "**Expected Output**:\n",
368 |     "<table>\n",
369 |     "  <tr>\n",
370 |     "    <td> <b>Entropy at root node:<b> 1.0 </td> \n",
371 |     "  </tr>\n",
372 |     "</table>"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "markdown",
377 |    "metadata": {},
378 |    "source": [
379 |     "<a name=\"4.2\"></a>\n",
380 |     "### 4.2  Split dataset\n",
381 |     "\n",
382 |     "Next, you'll write a helper function called `split_dataset` that takes in the data at a node and a feature to split on and splits it into left and right branches. Later in the lab, you'll implement code to calculate how good the split is.\n",
383 |     "\n",
384 |     "- The function takes in the training data, the list of indices of data points at that node, along with the feature to split on. \n",
385 |     "- It splits the data and returns the subset of indices at the left and the right branch.\n",
386 |     "- For example, say we're starting at the root node (so `node_indices = [0,1,2,3,4,5,6,7,8,9]`), and we chose to split on feature `0`, which is whether or not the example has a brown cap.\n",
387 |     "    - The output of the function is then, `left_indices = [0,1,2,3,4,7,9]` and `right_indices = [5,6,8]`\n",
388 |     "    \n",
389 |     "| Index | Brown Cap | Tapering Stalk Shape | Solitary | Edible |\n",
390 |     "|:-----:|:---------:|:--------------------:|:--------:|:------:|\n",
391 |     "|   0   |     1     |           1          |     1    |    1   |\n",
392 |     "|   1   |     1     |           0          |     1    |    1   |\n",
393 |     "|   2   |     1     |           0          |     0    |    0   |\n",
394 |     "|   3   |     1     |           0          |     0    |    0   |\n",
395 |     "|   4   |     1     |           1          |     1    |    1   |\n",
396 |     "|   5   |     0     |           1          |     1    |    0   |\n",
397 |     "|   6   |     0     |           0          |     0    |    0   |\n",
398 |     "|   7   |     1     |           0          |     1    |    1   |\n",
399 |     "|   8   |     0     |           1          |     0    |    1   |\n",
400 |     "|   9   |     1     |           0          |     0    |    0   |\n",
401 |     "\n",
402 |     "<a name=\"ex02\"></a>\n",
403 |     "### Exercise 2\n",
404 |     "\n",
405 |     "Please complete the `split_dataset()` function shown below\n",
406 |     "\n",
407 |     "- For each index in `node_indices`\n",
408 |     "    - If the value of `X` at that index for that feature is `1`, add the index to `left_indices`\n",
409 |     "    - If the value of `X` at that index for that feature is `0`, add the index to `right_indices`\n",
410 |     "\n",
411 |     "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "code",
416 |    "execution_count": null,
417 |    "metadata": {},
418 |    "outputs": [],
419 |    "source": [
420 |     "# UNQ_C2\n",
421 |     "# GRADED FUNCTION: split_dataset\n",
422 |     "\n",
423 |     "def split_dataset(X, node_indices, feature):\n",
424 |     "    \"\"\"\n",
425 |     "    Splits the data at the given node into\n",
426 |     "    left and right branches\n",
427 |     "    \n",
428 |     "    Args:\n",
429 |     "        X (ndarray):             Data matrix of shape(n_samples, n_features)\n",
430 |     "        node_indices (ndarray):  List containing the active indices. I.e, the samples being considered at this step.\n",
431 |     "        feature (int):           Index of feature to split on\n",
432 |     "    \n",
433 |     "    Returns:\n",
434 |     "        left_indices (ndarray): Indices with feature value == 1\n",
435 |     "        right_indices (ndarray): Indices with feature value == 0\n",
436 |     "    \"\"\"\n",
437 |     "    \n",
438 |     "    # You need to return the following variables correctly\n",
439 |     "    left_indices = []\n",
440 |     "    right_indices = []\n",
441 |     "    \n",
442 |     "    ### START CODE HERE ###\n",
443 |     "           \n",
444 |     "    ### END CODE HERE ###\n",
445 |     "        \n",
446 |     "    return left_indices, right_indices"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "metadata": {},
452 |    "source": [
453 |     "<details>\n",
454 |     "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
455 |     "    \n",
456 |     "    \n",
457 |     "   * Here's how you can structure the overall implementation for this function\n",
458 |     "    ```python \n",
459 |     "    def split_dataset(X, node_indices, feature):\n",
460 |     "    \n",
461 |     "        # You need to return the following variables correctly\n",
462 |     "        left_indices = []\n",
463 |     "        right_indices = []\n",
464 |     "\n",
465 |     "        ### START CODE HERE ###\n",
466 |     "        # Go through the indices of examples at that node\n",
467 |     "        for i in node_indices:   \n",
468 |     "            if # Your code here to check if the value of X at that index for the feature is 1\n",
469 |     "                left_indices.append(i)\n",
470 |     "            else:\n",
471 |     "                right_indices.append(i)\n",
472 |     "        ### END CODE HERE ###\n",
473 |     "        \n",
474 |     "    return left_indices, right_indices\n",
475 |     "    ```\n",
476 |     "    <details>\n",
477 |     "          <summary><font size=\"2\" color=\"darkblue\"><b> Click for more hints</b></font></summary>\n",
478 |     "        \n",
479 |     "    The condition is <code> if X[i][feature] == 1:</code>.\n",
480 |     "        \n",
481 |     "    </details>\n",
482 |     "\n",
483 |     "</details>\n",
484 |     "\n",
485 |     "    \n"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "Now, let's check your implementation using the code blocks below. Let's try splitting the dataset at the root node, which contains all examples at feature 0 (Brown Cap) as we'd discussed above"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "code",
497 |    "execution_count": null,
498 |    "metadata": {},
499 |    "outputs": [],
500 |    "source": [
501 |     "root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
502 |     "\n",
503 |     "# Feel free to play around with these variables\n",
504 |     "# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)\n",
505 |     "feature = 0\n",
506 |     "\n",
507 |     "left_indices, right_indices = split_dataset(X_train, root_indices, feature)\n",
508 |     "\n",
509 |     "print(\"Left indices: \", left_indices)\n",
510 |     "print(\"Right indices: \", right_indices)\n",
511 |     "\n",
512 |     "# UNIT TESTS    \n",
513 |     "split_dataset_test(split_dataset)"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "markdown",
518 |    "metadata": {},
519 |    "source": [
520 |     "**Expected Output**:\n",
521 |     "```\n",
522 |     "Left indices:  [0, 1, 2, 3, 4, 7, 9]\n",
523 |     "Right indices:  [5, 6, 8]\n",
524 |     "```"
525 |    ]
526 |   },
527 |   {
528 |    "cell_type": "markdown",
529 |    "metadata": {},
530 |    "source": [
531 |     "<a name=\"4.3\"></a>\n",
532 |     "### 4.3  Calculate information gain\n",
533 |     "\n",
534 |     "Next, you'll write a function called `information_gain` that takes in the training data, the indices at a node and a feature to split on and returns the information gain from the split.\n",
535 |     "\n",
536 |     "<a name=\"ex03\"></a>\n",
537 |     "### Exercise 3\n",
538 |     "\n",
539 |     "Please complete the `compute_information_gain()` function shown below to compute\n",
540 |     "\n",
541 |     "$$\\text{Information Gain} = H(p_1^\\text{node})- (w^{\\text{left}}H(p_1^\\text{left}) + w^{\\text{right}}H(p_1^\\text{right}))$$\n",
542 |     "\n",
543 |     "where \n",
544 |     "- $H(p_1^\\text{node})$ is entropy at the node \n",
545 |     "- $H(p_1^\\text{left})$ and $H(p_1^\\text{right})$ are the entropies at the left and the right branches resulting from the split\n",
546 |     "- $w^{\\text{left}}$ and $w^{\\text{right}}$ are the proportion of examples at the left and right branch respectively\n",
547 |     "\n",
548 |     "Note:\n",
549 |     "- You can use the `compute_entropy()` function that you implemented above to calculate the entropy\n",
550 |     "- We've provided some starter code that uses the `split_dataset()` function you implemented above to split the dataset \n",
551 |     "\n",
552 |     "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "code",
557 |    "execution_count": null,
558 |    "metadata": {},
559 |    "outputs": [],
560 |    "source": [
561 |     "# UNQ_C3\n",
562 |     "# GRADED FUNCTION: compute_information_gain\n",
563 |     "\n",
564 |     "def compute_information_gain(X, y, node_indices, feature):\n",
565 |     "    \n",
566 |     "    \"\"\"\n",
567 |     "    Compute the information of splitting the node on a given feature\n",
568 |     "    \n",
569 |     "    Args:\n",
570 |     "        X (ndarray):            Data matrix of shape(n_samples, n_features)\n",
571 |     "        y (array like):         list or ndarray with n_samples containing the target variable\n",
572 |     "        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n",
573 |     "   \n",
574 |     "    Returns:\n",
575 |     "        cost (float):        Cost computed\n",
576 |     "    \n",
577 |     "    \"\"\"    \n",
578 |     "    # Split dataset\n",
579 |     "    left_indices, right_indices = split_dataset(X, node_indices, feature)\n",
580 |     "    \n",
581 |     "    # Some useful variables\n",
582 |     "    X_node, y_node = X[node_indices], y[node_indices]\n",
583 |     "    X_left, y_left = X[left_indices], y[left_indices]\n",
584 |     "    X_right, y_right = X[right_indices], y[right_indices]\n",
585 |     "    \n",
586 |     "    # You need to return the following variables correctly\n",
587 |     "    information_gain = 0\n",
588 |     "    \n",
589 |     "    ### START CODE HERE ###\n",
590 |     "    \n",
591 |     "    # Weights \n",
592 |     "    \n",
593 |     "    #Weighted entropy\n",
594 |     "     \n",
595 |     "    #Information gain                                                   \n",
596 |     "    \n",
597 |     "    ### END CODE HERE ###  \n",
598 |     "    \n",
599 |     "    return information_gain"
600 |    ]
601 |   },
602 |   {
603 |    "cell_type": "markdown",
604 |    "metadata": {},
605 |    "source": [
606 |     "<details>\n",
607 |     "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
608 |     "    \n",
609 |     "    \n",
610 |     "   * Here's how you can structure the overall implementation for this function\n",
611 |     "    ```python \n",
612 |     "    def compute_information_gain(X, y, node_indices, feature):\n",
613 |     "        # Split dataset\n",
614 |     "        left_indices, right_indices = split_dataset(X, node_indices, feature)\n",
615 |     "\n",
616 |     "        # Some useful variables\n",
617 |     "        X_node, y_node = X[node_indices], y[node_indices]\n",
618 |     "        X_left, y_left = X[left_indices], y[left_indices]\n",
619 |     "        X_right, y_right = X[right_indices], y[right_indices]\n",
620 |     "\n",
621 |     "        # You need to return the following variables correctly\n",
622 |     "        information_gain = 0\n",
623 |     "\n",
624 |     "        ### START CODE HERE ###\n",
625 |     "        # Your code here to compute the entropy at the node using compute_entropy()\n",
626 |     "        node_entropy = \n",
627 |     "        # Your code here to compute the entropy at the left branch\n",
628 |     "        left_entropy = \n",
629 |     "        # Your code here to compute the entropy at the right branch\n",
630 |     "        right_entropy = \n",
631 |     "\n",
632 |     "        # Your code here to compute the proportion of examples at the left branch\n",
633 |     "        w_left = \n",
634 |     "        \n",
635 |     "        # Your code here to compute the proportion of examples at the right branch\n",
636 |     "        w_right = \n",
637 |     "\n",
638 |     "        # Your code here to compute weighted entropy from the split using \n",
639 |     "        # w_left, w_right, left_entropy and right_entropy\n",
640 |     "        weighted_entropy = \n",
641 |     "\n",
642 |     "        # Your code here to compute the information gain as the entropy at the node\n",
643 |     "        # minus the weighted entropy\n",
644 |     "        information_gain = \n",
645 |     "        ### END CODE HERE ###  \n",
646 |     "\n",
647 |     "        return information_gain\n",
648 |     "    ```\n",
649 |     "    If you're still stuck, check out the hints below.\n",
650 |     "    \n",
651 |     "    <details>\n",
652 |     "          <summary><font size=\"2\" color=\"darkblue\"><b> Hint to calculate the entropies</b></font></summary>\n",
653 |     "        \n",
654 |     "    <code>node_entropy = compute_entropy(y_node)</code><br>\n",
655 |     "    <code>left_entropy = compute_entropy(y_left)</code><br>\n",
656 |     "    <code>right_entropy = compute_entropy(y_right)</code>\n",
657 |     "        \n",
658 |     "    </details>\n",
659 |     "    \n",
660 |     "    <details>\n",
661 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate w_left and w_right</b></font></summary>\n",
662 |     "           <code>w_left = len(X_left) / len(X_node)</code><br>\n",
663 |     "           <code>w_right = len(X_right) / len(X_node)</code>\n",
664 |     "    </details>\n",
665 |     "    \n",
666 |     "    <details>\n",
667 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate weighted_entropy</b></font></summary>\n",
668 |     "           <code>weighted_entropy = w_left * left_entropy + w_right * right_entropy</code>\n",
669 |     "    </details>\n",
670 |     "    \n",
671 |     "    <details>\n",
672 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate information_gain</b></font></summary>\n",
673 |     "           <code> information_gain = node_entropy - weighted_entropy</code>\n",
674 |     "    </details>\n",
675 |     "\n",
676 |     "\n",
677 |     "</details>\n"
678 |    ]
679 |   },
680 |   {
681 |    "cell_type": "markdown",
682 |    "metadata": {},
683 |    "source": [
684 |     "You can now check your implementation using the cell below and calculate what the information gain would be from splitting on each of the featues"
685 |    ]
686 |   },
687 |   {
688 |    "cell_type": "code",
689 |    "execution_count": null,
690 |    "metadata": {},
691 |    "outputs": [],
692 |    "source": [
693 |     "info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)\n",
694 |     "print(\"Information Gain from splitting the root on brown cap: \", info_gain0)\n",
695 |     "    \n",
696 |     "info_gain1 = compute_information_gain(X_train, y_train, root_indices, feature=1)\n",
697 |     "print(\"Information Gain from splitting the root on tapering stalk shape: \", info_gain1)\n",
698 |     "\n",
699 |     "info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)\n",
700 |     "print(\"Information Gain from splitting the root on solitary: \", info_gain2)\n",
701 |     "\n",
702 |     "# UNIT TESTS\n",
703 |     "compute_information_gain_test(compute_information_gain)"
704 |    ]
705 |   },
706 |   {
707 |    "cell_type": "markdown",
708 |    "metadata": {},
709 |    "source": [
710 |     "**Expected Output**:\n",
711 |     "```\n",
712 |     "Information Gain from splitting the root on brown cap:  0.034851554559677034\n",
713 |     "Information Gain from splitting the root on tapering stalk shape:  0.12451124978365313\n",
714 |     "Information Gain from splitting the root on solitary:  0.2780719051126377\n",
715 |     "```"
716 |    ]
717 |   },
718 |   {
719 |    "cell_type": "markdown",
720 |    "metadata": {},
721 |    "source": [
722 |     "Splitting on \"Solitary\" (feature = 2) at the root node gives the maximum information gain. Therefore, it's the best feature to split on at the root node."
723 |    ]
724 |   },
725 |   {
726 |    "cell_type": "markdown",
727 |    "metadata": {},
728 |    "source": [
729 |     "<a name=\"4.4\"></a>\n",
730 |     "### 4.4  Get best split\n",
731 |     "Now let's write a function to get the best feature to split on by computing the information gain from each feature as we did above and returning the feature that gives the maximum information gain\n",
732 |     "\n",
733 |     "<a name=\"ex04\"></a>\n",
734 |     "### Exercise 4\n",
735 |     "Please complete the `get_best_split()` function shown below.\n",
736 |     "- The function takes in the training data, along with the indices of datapoint at that node\n",
737 |     "- The output of the function the feature that gives the maximum information gain \n",
738 |     "    - You can use the `compute_information_gain()` function to iterate through the features and calculate the information for each feature\n",
739 |     "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
740 |    ]
741 |   },
742 |   {
743 |    "cell_type": "code",
744 |    "execution_count": null,
745 |    "metadata": {},
746 |    "outputs": [],
747 |    "source": [
748 |     "# UNQ_C4\n",
749 |     "# GRADED FUNCTION: get_best_split\n",
750 |     "\n",
751 |     "def get_best_split(X, y, node_indices):   \n",
752 |     "    \"\"\"\n",
753 |     "    Returns the optimal feature and threshold value\n",
754 |     "    to split the node data \n",
755 |     "    \n",
756 |     "    Args:\n",
757 |     "        X (ndarray):            Data matrix of shape(n_samples, n_features)\n",
758 |     "        y (array like):         list or ndarray with n_samples containing the target variable\n",
759 |     "        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n",
760 |     "\n",
761 |     "    Returns:\n",
762 |     "        best_feature (int):     The index of the best feature to split\n",
763 |     "    \"\"\"    \n",
764 |     "    \n",
765 |     "    # Some useful variables\n",
766 |     "    num_features = X.shape[1]\n",
767 |     "    \n",
768 |     "    # You need to return the following variables correctly\n",
769 |     "    best_feature = -1\n",
770 |     "    \n",
771 |     "    ### START CODE HERE ###\n",
772 |     "       \n",
773 |     "    ### END CODE HERE ##    \n",
774 |     "   \n",
775 |     "    return best_feature"
776 |    ]
777 |   },
778 |   {
779 |    "cell_type": "markdown",
780 |    "metadata": {},
781 |    "source": [
782 |     "<details>\n",
783 |     "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
784 |     "    \n",
785 |     "    \n",
786 |     "   * Here's how you can structure the overall implementation for this function\n",
787 |     "    \n",
788 |     "    ```python \n",
789 |     "    def get_best_split(X, y, node_indices):   \n",
790 |     "\n",
791 |     "        # Some useful variables\n",
792 |     "        num_features = X.shape[1]\n",
793 |     "\n",
794 |     "        # You need to return the following variables correctly\n",
795 |     "        best_feature = -1\n",
796 |     "\n",
797 |     "        ### START CODE HERE ###\n",
798 |     "        max_info_gain = 0\n",
799 |     "\n",
800 |     "        # Iterate through all features\n",
801 |     "        for feature in range(num_features): \n",
802 |     "            \n",
803 |     "            # Your code here to compute the information gain from splitting on this feature\n",
804 |     "            info_gain = \n",
805 |     "            \n",
806 |     "            # If the information gain is larger than the max seen so far\n",
807 |     "            if info_gain > max_info_gain:  \n",
808 |     "                # Your code here to set the max_info_gain and best_feature\n",
809 |     "                max_info_gain = \n",
810 |     "                best_feature = \n",
811 |     "        ### END CODE HERE ##    \n",
812 |     "   \n",
813 |     "    return best_feature\n",
814 |     "    ```\n",
815 |     "    If you're still stuck, check out the hints below.\n",
816 |     "    \n",
817 |     "    <details>\n",
818 |     "          <summary><font size=\"2\" color=\"darkblue\"><b> Hint to calculate info_gain</b></font></summary>\n",
819 |     "        \n",
820 |     "    <code>info_gain = compute_information_gain(X, y, node_indices, feature)</code>\n",
821 |     "    </details>\n",
822 |     "    \n",
823 |     "    <details>\n",
824 |     "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to update the max_info_gain and best_feature</b></font></summary>\n",
825 |     "           <code>max_info_gain = info_gain</code><br>\n",
826 |     "           <code>best_feature = feature</code>\n",
827 |     "    </details>\n",
828 |     "</details>\n"
829 |    ]
830 |   },
831 |   {
832 |    "cell_type": "markdown",
833 |    "metadata": {},
834 |    "source": [
835 |     "Now, let's check the implementation of your function using the cell below."
836 |    ]
837 |   },
838 |   {
839 |    "cell_type": "code",
840 |    "execution_count": null,
841 |    "metadata": {},
842 |    "outputs": [],
843 |    "source": [
844 |     "best_feature = get_best_split(X_train, y_train, root_indices)\n",
845 |     "print(\"Best feature to split on: %d\" % best_feature)\n",
846 |     "\n",
847 |     "# UNIT TESTS\n",
848 |     "get_best_split_test(get_best_split)"
849 |    ]
850 |   },
851 |   {
852 |    "cell_type": "markdown",
853 |    "metadata": {},
854 |    "source": [
855 |     "As we saw above, the function returns that the best feature to split on at the root node is feature 2 (\"Solitary\")"
856 |    ]
857 |   },
858 |   {
859 |    "cell_type": "markdown",
860 |    "metadata": {},
861 |    "source": [
862 |     "<a name=\"5\"></a>\n",
863 |     "## 5 - Building the tree\n",
864 |     "\n",
865 |     "In this section, we use the functions you implemented above to generate a decision tree by successively picking the best feature to split on until we reach the stopping criteria (maximum depth is 2).\n",
866 |     "\n",
867 |     "You do not need to implement anything for this part."
868 |    ]
869 |   },
870 |   {
871 |    "cell_type": "code",
872 |    "execution_count": null,
873 |    "metadata": {},
874 |    "outputs": [],
875 |    "source": [
876 |     "# Not graded\n",
877 |     "tree = []\n",
878 |     "\n",
879 |     "def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):\n",
880 |     "    \"\"\"\n",
881 |     "    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.\n",
882 |     "    This function just prints the tree.\n",
883 |     "    \n",
884 |     "    Args:\n",
885 |     "        X (ndarray):            Data matrix of shape(n_samples, n_features)\n",
886 |     "        y (array like):         list or ndarray with n_samples containing the target variable\n",
887 |     "        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n",
888 |     "        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']\n",
889 |     "        max_depth (int):        Max depth of the resulting tree. \n",
890 |     "        current_depth (int):    Current depth. Parameter used during recursive call.\n",
891 |     "   \n",
892 |     "    \"\"\" \n",
893 |     "\n",
894 |     "    # Maximum depth reached - stop splitting\n",
895 |     "    if current_depth == max_depth:\n",
896 |     "        formatting = \" \"*current_depth + \"-\"*current_depth\n",
897 |     "        print(formatting, \"%s leaf node with indices\" % branch_name, node_indices)\n",
898 |     "        return\n",
899 |     "   \n",
900 |     "    # Otherwise, get best split and split the data\n",
901 |     "    # Get the best feature and threshold at this node\n",
902 |     "    best_feature = get_best_split(X, y, node_indices) \n",
903 |     "    tree.append((current_depth, branch_name, best_feature, node_indices))\n",
904 |     "    \n",
905 |     "    formatting = \"-\"*current_depth\n",
906 |     "    print(\"%s Depth %d, %s: Split on feature: %d\" % (formatting, current_depth, branch_name, best_feature))\n",
907 |     "    \n",
908 |     "    # Split the dataset at the best feature\n",
909 |     "    left_indices, right_indices = split_dataset(X, node_indices, best_feature)\n",
910 |     "    \n",
911 |     "    # continue splitting the left and the right child. Increment current depth\n",
912 |     "    build_tree_recursive(X, y, left_indices, \"Left\", max_depth, current_depth+1)\n",
913 |     "    build_tree_recursive(X, y, right_indices, \"Right\", max_depth, current_depth+1)\n"
914 |    ]
915 |   },
916 |   {
917 |    "cell_type": "code",
918 |    "execution_count": null,
919 |    "metadata": {},
920 |    "outputs": [],
921 |    "source": [
922 |     "build_tree_recursive(X_train, y_train, root_indices, \"Root\", max_depth=2, current_depth=0)"
923 |    ]
924 |   },
925 |   {
926 |    "cell_type": "code",
927 |    "execution_count": null,
928 |    "metadata": {},
929 |    "outputs": [],
930 |    "source": []
931 |   }
932 |  ],
933 |  "metadata": {
934 |   "kernelspec": {
935 |    "display_name": "Python 3",
936 |    "language": "python",
937 |    "name": "python3"
938 |   },
939 |   "language_info": {
940 |    "codemirror_mode": {
941 |     "name": "ipython",
942 |     "version": 3
943 |    },
944 |    "file_extension": ".py",
945 |    "mimetype": "text/x-python",
946 |    "name": "python",
947 |    "nbconvert_exporter": "python",
948 |    "pygments_lexer": "ipython3",
949 |    "version": "3.7.6"
950 |   }
951 |  },
952 |  "nbformat": 4,
953 |  "nbformat_minor": 5
954 | }
955 | 


--------------------------------------------------------------------------------