├── requirements.txt ├── data ├── digit_0_1_X.npy └── digit_0_1_y.npy ├── 资料 ├── 9.Week 3 practice lab logistic regression │ ├── testCasesEx2.mat │ ├── images │ │ ├── figure 1.png │ │ ├── figure 2.png │ │ ├── figure 3.png │ │ ├── figure 4.png │ │ ├── figure 5.png │ │ └── figure 6.png │ ├── utils.py │ ├── data │ │ ├── ex2data2.txt │ │ └── ex2data1.txt │ ├── solutions.py │ ├── public_tests.py │ └── test_utils.py └── 7.Practice lab decision trees │ ├── public_tests.py │ └── C2_W4_Decision_Tree_with_Markdown.ipynb ├── 第01章 概论.ipynb ├── README.md ├── .gitignore ├── 2024年考研807真题.ipynb ├── 第00章 补充的数学知识.ipynb ├── 第02章 统计决策方法.ipynb ├── 第10章 模式识别系统的评价.ipynb ├── 第07章 特征选择.ipynb └── 第08章 特征提取.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.25.0 2 | matplotlib>=3.7.1 3 | sklearn>=1.2.2 4 | -------------------------------------------------------------------------------- /data/digit_0_1_X.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/data/digit_0_1_X.npy -------------------------------------------------------------------------------- /data/digit_0_1_y.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/data/digit_0_1_y.npy -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/testCasesEx2.mat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/testCasesEx2.mat -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 1.png -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 2.png -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 3.png -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 4.png -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 5.png -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/images/figure 6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hwdqm88/pattern-recognition-807/HEAD/资料/9.Week 3 practice lab logistic regression/images/figure 6.png -------------------------------------------------------------------------------- /第01章 概论.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "616cb730", 6 | "metadata": {}, 7 | "source": [ 8 | "# 第一章 概论\n", 9 | "---" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "id": "32d1dff8", 15 | "metadata": {}, 16 | "source": [ 17 | "## 二轮总结笔记\n", 18 | "\n", 19 | "### 一、模式识别的主要方法\n", 20 | "\n", 21 | "$$\n", 22 | "模式识别\\begin{cases}基于知识\\begin{cases}专家系统\\\\ 句法识别\\end{cases}\\\\ 基于数据\\end{cases}\n", 23 | "$$\n", 24 | "\n", 25 | "### 二、模式识别的基本步骤\n", 26 | "\n", 27 | "$$\n", 28 | "\\begin{cases}监督模式识别\\begin{cases}分析问题\\\\原始特征获取\\\\特征提取与选择\\\\分类器设计\\\\分类决策\\end{cases}\\\\非监督模式识别\\begin{cases}分析问题\\\\原始特征获取\\\\特征提取与选择\\\\聚类分析\\\\结果解释\\end{cases}\\end{cases}\n", 29 | "$$" 30 | ] 31 | } 32 | ], 33 | "metadata": { 34 | "kernelspec": { 35 | "display_name": "Python 3 (ipykernel)", 36 | "language": "python", 37 | "name": "python3" 38 | }, 39 | "language_info": { 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "file_extension": ".py", 45 | "mimetype": "text/x-python", 46 | "name": "python", 47 | "nbconvert_exporter": "python", 48 | "pygments_lexer": "ipython3", 49 | "version": "3.9.16" 50 | } 51 | }, 52 | "nbformat": 4, 53 | "nbformat_minor": 5 54 | } 55 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### 零、2024考研心得 2 | 3 | 今年的807非常简单,可以说有手就行,我把真题放在仓库里了,见 [2024年考研807真题.ipynb](./2024年考研807真题.ipynb),另外我总结几个值得注意的点: 4 | 5 | 1. 和2023年一样,2024年也引用改编了《模式分类》(Richard O. Duda等著)的课后习题,所以一定要重视这本书。但是这本书的课后习题非常的多,而且和考纲不完全重合,具体复习策略还需要自行把握。[点击跳转《模式分类》教材和书后题答案](https://github.com/kingloon/EBooks/tree/master/Pattern%20Classification) 6 | 2. 和考纲不一样,2024年没有填空题和名词解释题,可能是因为考这种题太low了所以没出。 7 | 3. 虽然考纲是第三版,但是第四版的教材也要看,有一些选择题会从第四版中出题。 8 | 9 | 考研经验贴 [https://zhuanlan.zhihu.com/p/690137049](https://zhuanlan.zhihu.com/p/690137049) 10 | 11 | ### 一、项目介绍 12 | 13 | 这个项目是我学习《模式识别(第三版)》(张学工 清华大学出版社)的笔记。 14 | 15 | 这本书是 **清华大学深圳研究生院 0854电子信息-人工智能** 考研的专业课(**807**)的考纲截至2024年的唯一指定教材。 16 | 17 | 前二章不涉及代码,都是一些数学的推导和运算。 18 | 19 | 从第三章开始,教材开始介绍各种模型,我会争取把教材上涉及到的所有模型用代码实现一遍(书上一行代码都没有)。 20 | 21 | 每一章笔记的格式是:最前面是二轮的总结,后面是一轮的学习笔记和模型的代码实现。 22 | 23 | ### 二、特别说明 24 | 25 | 1. 教材上用到了很多本科没学过的数学知识,我总结在了 [第00章 补充的数学知识.ipynb](./第00章%20补充的数学知识.ipynb) 中,建议看教材之前先看一下。 26 | 2. 一定要先看一遍教材,彻底看懂了之后(最好把公式都自己推导一遍),再来看我写的笔记和代码。因为我写的笔记只有实现模型的核心,如果你没理解书上的内容就来看笔记,大概率看不懂。 27 | 3. 关于公式的显示问题:显示效果 VS Code > 浏览器上的Jupyter notebook > Pycharm。因为各个环境渲染markdown的引擎不一样,而且很多地方不兼容(特别是这种全是Latex公式的内容)。我已经调整过,让公式在所有地方显示都正常。 28 | 4. 如果用浏览器的Jupyter notebook,建议在看/写代码之前,先找个网上的教程给Jupyter notebook装插件,要不然代码显示丑,而且没有自动补全,用起来很难受。或者直接用IDE写代码。 29 | 30 | ### 三、运行说明 31 | 32 | 我的本机环境是 33 | 34 | + Python 3.10 35 | 36 | Python版本影响不大,不要差太多就行 37 | 38 | 其他包的安装: 39 | 40 | `pip install -r requirements.txt` 41 | 42 | 建议使用虚拟环境运行。 43 | 44 | ### 四、文件说明 45 | 46 | `./data/`里面保存了一些测试数据 47 | 48 | `./资料/`里面保存了一些笔记中引用的内容 49 | -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | def load_data(filename): 5 | data = np.loadtxt(filename, delimiter=',') 6 | X = data[:,:2] 7 | y = data[:,2] 8 | return X, y 9 | 10 | def sig(z): 11 | 12 | return 1/(1+np.exp(-z)) 13 | 14 | def map_feature(X1, X2): 15 | """ 16 | Feature mapping function to polynomial features 17 | """ 18 | X1 = np.atleast_1d(X1) 19 | X2 = np.atleast_1d(X2) 20 | degree = 6 21 | out = [] 22 | for i in range(1, degree+1): 23 | for j in range(i + 1): 24 | out.append((X1**(i-j) * (X2**j))) 25 | return np.stack(out, axis=1) 26 | 27 | 28 | def plot_data(X, y, pos_label="y=1", neg_label="y=0"): 29 | positive = y == 1 30 | negative = y == 0 31 | 32 | # Plot examples 33 | plt.plot(X[positive, 0], X[positive, 1], 'k+', label=pos_label) 34 | plt.plot(X[negative, 0], X[negative, 1], 'yo', label=neg_label) 35 | 36 | 37 | def plot_decision_boundary(w, b, X, y): 38 | # Credit to dibgerge on Github for this plotting code 39 | 40 | plot_data(X[:, 0:2], y) 41 | 42 | if X.shape[1] <= 2: 43 | plot_x = np.array([min(X[:, 0]), max(X[:, 0])]) 44 | plot_y = (-1. / w[1]) * (w[0] * plot_x + b) 45 | 46 | plt.plot(plot_x, plot_y, c="b") 47 | 48 | else: 49 | u = np.linspace(-1, 1.5, 50) 50 | v = np.linspace(-1, 1.5, 50) 51 | 52 | z = np.zeros((len(u), len(v))) 53 | 54 | # Evaluate z = theta*x over the grid 55 | for i in range(len(u)): 56 | for j in range(len(v)): 57 | z[i,j] = sig(np.dot(map_feature(u[i], v[j]), w) + b) 58 | 59 | # important to transpose z before calling contour 60 | z = z.T 61 | 62 | # Plot z = 0 63 | plt.contour(u,v,z, levels = [0.5], colors="g") 64 | 65 | -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/data/ex2data2.txt: -------------------------------------------------------------------------------- 1 | 0.051267,0.69956,1 2 | -0.092742,0.68494,1 3 | -0.21371,0.69225,1 4 | -0.375,0.50219,1 5 | -0.51325,0.46564,1 6 | -0.52477,0.2098,1 7 | -0.39804,0.034357,1 8 | -0.30588,-0.19225,1 9 | 0.016705,-0.40424,1 10 | 0.13191,-0.51389,1 11 | 0.38537,-0.56506,1 12 | 0.52938,-0.5212,1 13 | 0.63882,-0.24342,1 14 | 0.73675,-0.18494,1 15 | 0.54666,0.48757,1 16 | 0.322,0.5826,1 17 | 0.16647,0.53874,1 18 | -0.046659,0.81652,1 19 | -0.17339,0.69956,1 20 | -0.47869,0.63377,1 21 | -0.60541,0.59722,1 22 | -0.62846,0.33406,1 23 | -0.59389,0.005117,1 24 | -0.42108,-0.27266,1 25 | -0.11578,-0.39693,1 26 | 0.20104,-0.60161,1 27 | 0.46601,-0.53582,1 28 | 0.67339,-0.53582,1 29 | -0.13882,0.54605,1 30 | -0.29435,0.77997,1 31 | -0.26555,0.96272,1 32 | -0.16187,0.8019,1 33 | -0.17339,0.64839,1 34 | -0.28283,0.47295,1 35 | -0.36348,0.31213,1 36 | -0.30012,0.027047,1 37 | -0.23675,-0.21418,1 38 | -0.06394,-0.18494,1 39 | 0.062788,-0.16301,1 40 | 0.22984,-0.41155,1 41 | 0.2932,-0.2288,1 42 | 0.48329,-0.18494,1 43 | 0.64459,-0.14108,1 44 | 0.46025,0.012427,1 45 | 0.6273,0.15863,1 46 | 0.57546,0.26827,1 47 | 0.72523,0.44371,1 48 | 0.22408,0.52412,1 49 | 0.44297,0.67032,1 50 | 0.322,0.69225,1 51 | 0.13767,0.57529,1 52 | -0.0063364,0.39985,1 53 | -0.092742,0.55336,1 54 | -0.20795,0.35599,1 55 | -0.20795,0.17325,1 56 | -0.43836,0.21711,1 57 | -0.21947,-0.016813,1 58 | -0.13882,-0.27266,1 59 | 0.18376,0.93348,0 60 | 0.22408,0.77997,0 61 | 0.29896,0.61915,0 62 | 0.50634,0.75804,0 63 | 0.61578,0.7288,0 64 | 0.60426,0.59722,0 65 | 0.76555,0.50219,0 66 | 0.92684,0.3633,0 67 | 0.82316,0.27558,0 68 | 0.96141,0.085526,0 69 | 0.93836,0.012427,0 70 | 0.86348,-0.082602,0 71 | 0.89804,-0.20687,0 72 | 0.85196,-0.36769,0 73 | 0.82892,-0.5212,0 74 | 0.79435,-0.55775,0 75 | 0.59274,-0.7405,0 76 | 0.51786,-0.5943,0 77 | 0.46601,-0.41886,0 78 | 0.35081,-0.57968,0 79 | 0.28744,-0.76974,0 80 | 0.085829,-0.75512,0 81 | 0.14919,-0.57968,0 82 | -0.13306,-0.4481,0 83 | -0.40956,-0.41155,0 84 | -0.39228,-0.25804,0 85 | -0.74366,-0.25804,0 86 | -0.69758,0.041667,0 87 | -0.75518,0.2902,0 88 | -0.69758,0.68494,0 89 | -0.4038,0.70687,0 90 | -0.38076,0.91886,0 91 | -0.50749,0.90424,0 92 | -0.54781,0.70687,0 93 | 0.10311,0.77997,0 94 | 0.057028,0.91886,0 95 | -0.10426,0.99196,0 96 | -0.081221,1.1089,0 97 | 0.28744,1.087,0 98 | 0.39689,0.82383,0 99 | 0.63882,0.88962,0 100 | 0.82316,0.66301,0 101 | 0.67339,0.64108,0 102 | 1.0709,0.10015,0 103 | -0.046659,-0.57968,0 104 | -0.23675,-0.63816,0 105 | -0.15035,-0.36769,0 106 | -0.49021,-0.3019,0 107 | -0.46717,-0.13377,0 108 | -0.28859,-0.060673,0 109 | -0.61118,-0.067982,0 110 | -0.66302,-0.21418,0 111 | -0.59965,-0.41886,0 112 | -0.72638,-0.082602,0 113 | -0.83007,0.31213,0 114 | -0.72062,0.53874,0 115 | -0.59389,0.49488,0 116 | -0.48445,0.99927,0 117 | -0.0063364,0.99927,0 118 | 0.63265,-0.030612,0 119 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | .idea/ 161 | 162 | test.py 163 | -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/data/ex2data1.txt: -------------------------------------------------------------------------------- 1 | 34.62365962451697,78.0246928153624,0 2 | 30.28671076822607,43.89499752400101,0 3 | 35.84740876993872,72.90219802708364,0 4 | 60.18259938620976,86.30855209546826,1 5 | 79.0327360507101,75.3443764369103,1 6 | 45.08327747668339,56.3163717815305,0 7 | 61.10666453684766,96.51142588489624,1 8 | 75.02474556738889,46.55401354116538,1 9 | 76.09878670226257,87.42056971926803,1 10 | 84.43281996120035,43.53339331072109,1 11 | 95.86155507093572,38.22527805795094,0 12 | 75.01365838958247,30.60326323428011,0 13 | 82.30705337399482,76.48196330235604,1 14 | 69.36458875970939,97.71869196188608,1 15 | 39.53833914367223,76.03681085115882,0 16 | 53.9710521485623,89.20735013750205,1 17 | 69.07014406283025,52.74046973016765,1 18 | 67.94685547711617,46.67857410673128,0 19 | 70.66150955499435,92.92713789364831,1 20 | 76.97878372747498,47.57596364975532,1 21 | 67.37202754570876,42.83843832029179,0 22 | 89.67677575072079,65.79936592745237,1 23 | 50.534788289883,48.85581152764205,0 24 | 34.21206097786789,44.20952859866288,0 25 | 77.9240914545704,68.9723599933059,1 26 | 62.27101367004632,69.95445795447587,1 27 | 80.1901807509566,44.82162893218353,1 28 | 93.114388797442,38.80067033713209,0 29 | 61.83020602312595,50.25610789244621,0 30 | 38.78580379679423,64.99568095539578,0 31 | 61.379289447425,72.80788731317097,1 32 | 85.40451939411645,57.05198397627122,1 33 | 52.10797973193984,63.12762376881715,0 34 | 52.04540476831827,69.43286012045222,1 35 | 40.23689373545111,71.16774802184875,0 36 | 54.63510555424817,52.21388588061123,0 37 | 33.91550010906887,98.86943574220611,0 38 | 64.17698887494485,80.90806058670817,1 39 | 74.78925295941542,41.57341522824434,0 40 | 34.1836400264419,75.2377203360134,0 41 | 83.90239366249155,56.30804621605327,1 42 | 51.54772026906181,46.85629026349976,0 43 | 94.44336776917852,65.56892160559052,1 44 | 82.36875375713919,40.61825515970618,0 45 | 51.04775177128865,45.82270145776001,0 46 | 62.22267576120188,52.06099194836679,0 47 | 77.19303492601364,70.45820000180959,1 48 | 97.77159928000232,86.7278223300282,1 49 | 62.07306379667647,96.76882412413983,1 50 | 91.56497449807442,88.69629254546599,1 51 | 79.94481794066932,74.16311935043758,1 52 | 99.2725269292572,60.99903099844988,1 53 | 90.54671411399852,43.39060180650027,1 54 | 34.52451385320009,60.39634245837173,0 55 | 50.2864961189907,49.80453881323059,0 56 | 49.58667721632031,59.80895099453265,0 57 | 97.64563396007767,68.86157272420604,1 58 | 32.57720016809309,95.59854761387875,0 59 | 74.24869136721598,69.82457122657193,1 60 | 71.79646205863379,78.45356224515052,1 61 | 75.3956114656803,85.75993667331619,1 62 | 35.28611281526193,47.02051394723416,0 63 | 56.25381749711624,39.26147251058019,0 64 | 30.05882244669796,49.59297386723685,0 65 | 44.66826172480893,66.45008614558913,0 66 | 66.56089447242954,41.09209807936973,0 67 | 40.45755098375164,97.53518548909936,1 68 | 49.07256321908844,51.88321182073966,0 69 | 80.27957401466998,92.11606081344084,1 70 | 66.74671856944039,60.99139402740988,1 71 | 32.72283304060323,43.30717306430063,0 72 | 64.0393204150601,78.03168802018232,1 73 | 72.34649422579923,96.22759296761404,1 74 | 60.45788573918959,73.09499809758037,1 75 | 58.84095621726802,75.85844831279042,1 76 | 99.82785779692128,72.36925193383885,1 77 | 47.26426910848174,88.47586499559782,1 78 | 50.45815980285988,75.80985952982456,1 79 | 60.45555629271532,42.50840943572217,0 80 | 82.22666157785568,42.71987853716458,0 81 | 88.9138964166533,69.80378889835472,1 82 | 94.83450672430196,45.69430680250754,1 83 | 67.31925746917527,66.58935317747915,1 84 | 57.23870631569862,59.51428198012956,1 85 | 80.36675600171273,90.96014789746954,1 86 | 68.46852178591112,85.59430710452014,1 87 | 42.0754545384731,78.84478600148043,0 88 | 75.47770200533905,90.42453899753964,1 89 | 78.63542434898018,96.64742716885644,1 90 | 52.34800398794107,60.76950525602592,0 91 | 94.09433112516793,77.15910509073893,1 92 | 90.44855097096364,87.50879176484702,1 93 | 55.48216114069585,35.57070347228866,0 94 | 74.49269241843041,84.84513684930135,1 95 | 89.84580670720979,45.35828361091658,1 96 | 83.48916274498238,48.38028579728175,1 97 | 42.2617008099817,87.10385094025457,1 98 | 99.31500880510394,68.77540947206617,1 99 | 55.34001756003703,64.9319380069486,1 100 | 74.77589300092767,89.52981289513276,1 101 | -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/solutions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def sigmoid(z): 4 | """ 5 | Compute the sigmoid of z 6 | 7 | Parameters 8 | ---------- 9 | z : array_like 10 | A scalar or numpy array of any size. 11 | 12 | Returns 13 | ------- 14 | g : array_like 15 | sigmoid(z) 16 | """ 17 | # (≈ 1 line of code) 18 | # s = 19 | ### START CODE HERE ### (≈ 1 line of code) 20 | s = 1/(1 + np.exp(-z)) 21 | ### END CODE HERE ### 22 | 23 | return s 24 | 25 | 26 | def compute_cost(X, y, w, b): 27 | m = X.shape[0] 28 | 29 | f_w = sigmoid(np.dot(X, w) + b) 30 | total_cost = (1/m)*np.sum(-y*np.log(f_w) - (1-y)*np.log(1-f_w)) 31 | 32 | return float(np.squeeze(total_cost)) 33 | 34 | def compute_gradient(X, y, w, b): 35 | """ 36 | Computes the gradient for logistic regression. 37 | 38 | Parameters 39 | ---------- 40 | X : array_like 41 | Shape (m, n+1) 42 | 43 | y : array_like 44 | Shape (m,) 45 | 46 | w : array_like 47 | Parameters of the model 48 | Shape (n+1,) 49 | b: scalar 50 | 51 | Returns 52 | ------- 53 | dw : array_like 54 | Shape (n+1,) 55 | The gradient 56 | db: scalar 57 | 58 | """ 59 | m = X.shape[0] 60 | f_w = sigmoid(np.dot(X, w) + b) 61 | err = (f_w - y) 62 | dw = (1/m)*np.dot(X.T, err) 63 | db = (1/m)*np.sum(err) 64 | 65 | return float(np.squeeze(db)), dw 66 | 67 | 68 | def predict(X, w, b): 69 | """ 70 | Predict whether the label is 0 or 1 using learned logistic 71 | regression parameters theta 72 | 73 | Parameters 74 | ---------- 75 | X : array_like 76 | Shape (m, n+1) 77 | 78 | w : array_like 79 | Parameters of the model 80 | Shape (n, 1) 81 | b : scalar 82 | 83 | Returns 84 | ------- 85 | 86 | p: array_like 87 | Shape (m,) 88 | The predictions for X using a threshold at 0.5 89 | i.e. if sigmoid (theta.T*X) >=0.5 predict 1 90 | """ 91 | 92 | # number of training examples 93 | m = X.shape[0] 94 | p = np.zeros(m) 95 | 96 | for i in range(m): 97 | f_w = sigmoid(np.dot(w.T, X[i]) + b) 98 | p[i] = f_w >=0.5 99 | 100 | return p 101 | 102 | def compute_cost_reg(X, y, w, b, lambda_=1): 103 | """ 104 | Computes the cost for logistic regression 105 | with regularization 106 | 107 | Parameters 108 | ---------- 109 | X : array_like 110 | Shape (m, n+1) 111 | 112 | y : array_like 113 | Shape (m,) 114 | 115 | w: array_like 116 | Parameters of the model 117 | Shape (n+1,) 118 | b: scalar 119 | 120 | Returns 121 | ------- 122 | cost : float 123 | The cost of using theta as the parameter for logistic 124 | regression to fit the data points in X and y 125 | 126 | """ 127 | # number of training examples 128 | m = X.shape[0] 129 | 130 | # You need to return the following variables correctly 131 | cost = 0 132 | 133 | f = sigmoid(np.dot(X, w) + b) 134 | reg = (lambda_/(2*m)) * np.sum(np.square(w)) 135 | cost = (1/m)*np.sum(-y*np.log(f) - (1-y)*np.log(1-f)) + reg 136 | return cost 137 | 138 | 139 | def compute_gradient_reg(X, y, w, b, lambda_=1): 140 | """ 141 | Computes the gradient for logistic regression 142 | with regularization 143 | 144 | Parameters 145 | ---------- 146 | X : array_like 147 | Shape (m, n+1) 148 | 149 | y : array_like 150 | Shape (m,) 151 | 152 | w : array_like 153 | Parameters of the model 154 | Shape (n+1,) 155 | b : scalar 156 | 157 | Returns 158 | ------- 159 | db: scalar 160 | dw: array_like 161 | Shape (n+1,) 162 | 163 | """ 164 | # number of training examples 165 | m = X.shape[0] 166 | 167 | # You need to return the following variables correctly 168 | cost = 0 169 | dw = np.zeros_like(w) 170 | 171 | f = sigmoid(np.dot(X, w) + b) 172 | err = (f - y) 173 | dw = (1/m)*np.dot(X.T, err) 174 | dw += (lambda_/m) * w 175 | db = (1/m) * np.sum(err) 176 | 177 | #print(db,dw) 178 | 179 | return db,dw 180 | -------------------------------------------------------------------------------- /2024年考研807真题.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "dc065fb3", 6 | "metadata": {}, 7 | "source": [ 8 | "## 2024年 信息技术基础综合(807)清华大学考研真题 \n", 9 | "\n", 10 | "---\n", 11 | "\n", 12 | "### 一、判断题 一道题3分\n", 13 | "\n", 14 | "1. 支持向量机既可以进行线性分类也可以进行非线性分类。\n", 15 | "2. 自组织映射是竞争神经网络。\n", 16 | "3. 召回率是判断为阴性的样本在全体阴性样本中的比例。\n", 17 | "4. 样本的特征越多越有利于分类。\n", 18 | "5. C均值是有监督模式分类算法。\n", 19 | "6. 假设样本有N维,Fisher判别是将样本投影到N-1维。\n", 20 | "7. 当样本数量较多时,用k-NN算法的剪辑算法能取得较好效果。\n", 21 | "8. 主成分分析和K-L变换在本质上是相同的。\n", 22 | "\n", 23 | "### 二、单选题 一道题3分\n", 24 | "\n", 25 | "1. {1, 2, 3, 4, -9, 0}的L1范数是\n", 26 | "+ A. 6\n", 27 | "+ B. 1\n", 28 | "+ C. 19\n", 29 | "+ D. 111\n", 30 | "\n", 31 | "2. 以下关于特征选择个数的说法正确的是\n", 32 | "+ A. 特征选择的个数越多越好\n", 33 | "+ B. 特征选择应选择能最好的区分不同样本的特征\n", 34 | "+ C. 特征选择的个数越少越好\n", 35 | "+ D. 以上说法都不对\n", 36 | "\n", 37 | "3. 下面不是决策树后剪枝方法的是\n", 38 | "+ A. 信息增益的统计显著性分析\n", 39 | "+ B. 减少分类错误修剪法\n", 40 | "+ C. 最小代价与复杂性的折中\n", 41 | "+ D. 最小描述长度准则\n", 42 | "\n", 43 | "4. 下面关于支持向量机软间隔和硬间隔说法错误的是\n", 44 | "+ A. 软间隔有利于最大化分类间隔\n", 45 | "+ B. 软间隔可以容忍错分样本\n", 46 | "+ C. 硬间隔有利于消除过拟合\n", 47 | "+ D. 硬间隔保证所有样本都是正确分类的\n", 48 | "\n", 49 | "5. 混合高斯模型计算概率密度使用的是\n", 50 | "+ A. 贝叶斯准则\n", 51 | "+ B. 模型准则\n", 52 | "+ C. 忘了\n", 53 | "+ D. 训练准则\n", 54 | "\n", 55 | "6. 下面对感知器说法错误的是\n", 56 | "+ A. 感知器可以解决非线性可分问题\n", 57 | "+ B. 给感知器设定阈值后可以用于分类\n", 58 | "+ C. 感知器是一个简单的前馈神经网络\n", 59 | "+ D. 感知器的阈值会影响判别面位置\n", 60 | "\n", 61 | "7. 有一种产品60%由工厂A生产,40%由工厂B生产,甲乙两人各自买了一个这种产品,请问他们买的产品来自于不同工厂的概率是\n", 62 | "+ A. 忘了\n", 63 | "+ B. 48%\n", 64 | "+ C. 忘了\n", 65 | "+ D. 忘了\n", 66 | "\n", 67 | "### 三、多选题 一道题4分,多选不得分,少选得2分\n", 68 | "\n", 69 | "1. 以下是sigmoid函数的优点的是\n", 70 | "+ A. 处处连续,方便求导\n", 71 | "+ B. 可以将数值压缩在[0,1]之中\n", 72 | "+ C. 在深层反馈网络中不易产生梯度消失\n", 73 | "+ D. 可以用于二分类问题\n", 74 | "\n", 75 | "2. 以下是决策树方法的是\n", 76 | "+ A. ID3\n", 77 | "+ B. C4.5\n", 78 | "+ C. CART\n", 79 | "+ D. CNN\n", 80 | "\n", 81 | "3. 以下是生成模型的是\n", 82 | "+ A. 支持向量机\n", 83 | "+ B. 朴素贝叶斯\n", 84 | "+ C. 隐马尔科夫模型\n", 85 | "+ D. 高斯混合模型\n", 86 | "\n", 87 | "4. 以下是影响k-NN算法结果的因素是\n", 88 | "+ A. 最近邻样本的距离\n", 89 | "+ B. 相似性度量\n", 90 | "+ C. 对样本分类的方法\n", 91 | "+ D. k的大小\n", 92 | "\n", 93 | "5. 如果以特征向量的相关系数作为模式相似性测度,则影响聚类算法结果的主要因素有\n", 94 | "+ A. 已知类别样本质量\n", 95 | "+ B. 分类准则\n", 96 | "+ C. 特征选取\n", 97 | "+ D. 量纲\n", 98 | "\n", 99 | "### 四、计算题 12分\n", 100 | "\n", 101 | "已知两类样本$\\omega_1$:$\\{(1, 0),(2, 0),(1, 1)\\}$,$\\omega_2$:$\\{(-1, 0),(0, 1),(-1, 1)\\}$,两类样本概率相等(具体的数没记住,题目问法是正确的)\n", 102 | "\n", 103 | "(1) 求类内离散度矩阵$\\mathbf{S}_{\\text{w}}$(6分)\n", 104 | "\n", 105 | "(2) 求类间离散度矩阵$\\mathbf{S}_{\\text{b}}$(6分)\n", 106 | "\n", 107 | "### 五、计算题 12分\n", 108 | "\n", 109 | "(此题为模式分类P165-T3改编)\n", 110 | "\n", 111 | "设$p(x)$为$0$到$a$之间的均匀分布,即$p(x)\\sim U(0,a)$,Parzen窗估计函数为$\\displaystyle\\varphi (x)=\\begin{cases}e^{-x},&x>0\\\\0,&x\\leqslant 0\\end{cases}$\n", 112 | "\n", 113 | "(1) 设宽窗参数为$h_n$,求Parzen窗估计$\\bar{p}_n(x)$(分值忘了)\n", 114 | "\n", 115 | "(2) 要使在$0$到$a$之间,$99\\%$的情况下估计值的误差都小于$1\\%$,求$h_n$的取值范围(分值忘了)\n", 116 | "\n", 117 | "\n", 118 | "### 六、计算题 20分\n", 119 | "\n", 120 | "模式识别第三版/第四版教材上贝叶斯决策的例题(正常细胞异常细胞的那个)\n", 121 | "\n", 122 | "(1) 求最小错误率贝叶斯决策(8分)\n", 123 | "\n", 124 | "(2) 求最小风险贝叶斯决策(8分)\n", 125 | "\n", 126 | "(3) 解释两个决策结果的差异(6分)\n", 127 | "\n", 128 | "\n", 129 | "### 七、计算题 20分\n", 130 | "\n", 131 | "(此题为模式分类P52-T2改编)\n", 132 | "\n", 133 | "设概率密度函数$\\displaystyle p(x|\\omega_i)\\propto \\exp\\{-\\frac{|x-a_i|}{b_i}\\}$,$a_i>0$,$b_i>0$\n", 134 | "\n", 135 | "(1) 求概率密度函数表达式,即将概率密度归一化(分值忘了)\n", 136 | "\n", 137 | "(2) 求似然比$\\displaystyle \\frac{p(x|\\omega_1)}{p(x|\\omega_2)}$(分值忘了)\n", 138 | "\n", 139 | "(3) 当$a_1=0$,$b_1=1$,$a_2=1$,$b_2=2$时,求出似然比,并画出似然比的图像(分值忘了)\n", 140 | "\n", 141 | "### 八、计算题 21分\n", 142 | "\n", 143 | "(1) 已知两个样本$\\mathbf{x}_1$,$\\mathbf{x}_2$,求出最优分类超平面将两个样本分开,使$\\mathbf{x}_1$处于平面负侧(15分)\n", 144 | "\n", 145 | "(2) 验证$\\mathbf{x}_1$处于平面负侧(6分)" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python 3 (ipykernel)", 152 | "language": "python", 153 | "name": "python3" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 3 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython3", 165 | "version": "3.9.13" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 5 170 | } 171 | -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/public_tests.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import math 3 | 4 | def sigmoid_test(target): 5 | assert np.isclose(target(3.0), 0.9525741268224334), "Failed for scalar input" 6 | assert np.allclose(target(np.array([2.5, 0])), [0.92414182, 0.5]), "Failed for 1D array" 7 | assert np.allclose(target(np.array([[2.5, -2.5], [0, 1]])), 8 | [[0.92414182, 0.07585818], [0.5, 0.73105858]]), "Failed for 2D array" 9 | print('\033[92mAll tests passed!') 10 | 11 | def compute_cost_test(target): 12 | X = np.array([[0, 0, 0, 0]]).T 13 | y = np.array([0, 0, 0, 0]) 14 | w = np.array([0]) 15 | b = 1 16 | result = target(X, y, w, b) 17 | if math.isinf(result): 18 | raise ValueError("Did you get the sigmoid of z_wb?") 19 | 20 | np.random.seed(17) 21 | X = np.random.randn(5, 2) 22 | y = np.array([1, 0, 0, 1, 1]) 23 | w = np.random.randn(2) 24 | b = 0 25 | result = target(X, y, w, b) 26 | assert np.isclose(result, 2.15510667), f"Wrong output. Expected: {2.15510667} got: {result}" 27 | 28 | X = np.random.randn(4, 3) 29 | y = np.array([1, 1, 0, 0]) 30 | w = np.random.randn(3) 31 | b = 0 32 | 33 | result = target(X, y, w, b) 34 | assert np.isclose(result, 0.80709376), f"Wrong output. Expected: {0.80709376} got: {result}" 35 | 36 | X = np.random.randn(4, 3) 37 | y = np.array([1, 0,1, 0]) 38 | w = np.random.randn(3) 39 | b = 3 40 | result = target(X, y, w, b) 41 | assert np.isclose(result, 0.4529660647), f"Wrong output. Expected: {0.4529660647} got: {result}. Did you inizialized z_wb = b?" 42 | 43 | print('\033[92mAll tests passed!') 44 | 45 | def compute_gradient_test(target): 46 | np.random.seed(1) 47 | X = np.random.randn(7, 3) 48 | y = np.array([1, 0, 1, 0, 1, 1, 0]) 49 | test_w = np.array([1, 0.5, -0.35]) 50 | test_b = 1.7 51 | dj_db, dj_dw = target(X, y, test_w, test_b) 52 | 53 | assert np.isclose(dj_db, 0.28936094), f"Wrong value for dj_db. Expected: {0.28936094} got: {dj_db}" 54 | assert dj_dw.shape == test_w.shape, f"Wrong shape for dj_dw. Expected: {test_w.shape} got: {dj_dw.shape}" 55 | assert np.allclose(dj_dw, [-0.11999166, 0.41498775, -0.71968405]), f"Wrong values for dj_dw. Got: {dj_dw}" 56 | 57 | print('\033[92mAll tests passed!') 58 | 59 | def predict_test(target): 60 | np.random.seed(5) 61 | b = 0.5 62 | w = np.random.randn(3) 63 | X = np.random.randn(8, 3) 64 | 65 | result = target(X, w, b) 66 | wrong_1 = [1., 1., 0., 0., 1., 0., 0., 1.] 67 | expected_1 = [1., 1., 1., 0., 1., 0., 0., 1.] 68 | if np.allclose(result, wrong_1): 69 | raise ValueError("Did you apply the sigmoid before applying the threshold?") 70 | assert result.shape == (len(X),), f"Wrong length. Expected : {(len(X),)} got: {result.shape}" 71 | assert np.allclose(result, expected_1), f"Wrong output: Expected : {expected_1} got: {result}" 72 | 73 | b = -1.7 74 | w = np.random.randn(4) + 0.6 75 | X = np.random.randn(6, 4) 76 | 77 | result = target(X, w, b) 78 | expected_2 = [0., 0., 0., 1., 1., 0.] 79 | assert result.shape == (len(X),), f"Wrong length. Expected : {(len(X),)} got: {result.shape}" 80 | assert np.allclose(result,expected_2), f"Wrong output: Expected : {expected_2} got: {result}" 81 | 82 | print('\033[92mAll tests passed!') 83 | 84 | def compute_cost_reg_test(target): 85 | np.random.seed(1) 86 | w = np.random.randn(3) 87 | b = 0.4 88 | X = np.random.randn(6, 3) 89 | y = np.array([0, 1, 1, 0, 1, 1]) 90 | lambda_ = 0.1 91 | expected_output = target(X, y, w, b, lambda_) 92 | 93 | assert np.isclose(expected_output, 0.5469746792761936), f"Wrong output. Expected: {0.5469746792761936} got:{expected_output}" 94 | 95 | w = np.random.randn(5) 96 | b = -0.6 97 | X = np.random.randn(8, 5) 98 | y = np.array([1, 0, 1, 0, 0, 1, 0, 1]) 99 | lambda_ = 0.01 100 | output = target(X, y, w, b, lambda_) 101 | assert np.isclose(output, 1.2608591964119995), f"Wrong output. Expected: {1.2608591964119995} got:{output}" 102 | 103 | w = np.array([2, 2, 2, 2, 2]) 104 | b = 0 105 | X = np.zeros((8, 5)) 106 | y = np.array([0.5] * 8) 107 | lambda_ = 3 108 | output = target(X, y, w, b, lambda_) 109 | expected = -np.log(0.5) + 3. / (2. * 8.) * 20. 110 | assert np.isclose(output, expected), f"Wrong output. Expected: {expected} got:{output}" 111 | 112 | print('\033[92mAll tests passed!') 113 | 114 | def compute_gradient_reg_test(target): 115 | np.random.seed(1) 116 | w = np.random.randn(5) 117 | b = 0.2 118 | X = np.random.randn(7, 5) 119 | y = np.array([0, 1, 1, 0, 1, 1, 0]) 120 | lambda_ = 0.1 121 | expected1 = (-0.1506447567869257, np.array([ 0.19530838, -0.00632206, 0.19687367, 0.15741161, 0.02791437])) 122 | dj_db, dj_dw = target(X, y, w, b, lambda_) 123 | 124 | assert np.isclose(dj_db, expected1[0]), f"Wrong dj_db. Expected: {expected1[0]} got: {dj_db}" 125 | assert np.allclose(dj_dw, expected1[1]), f"Wrong dj_dw. Expected: {expected1[1]} got: {dj_dw}" 126 | 127 | 128 | w = np.random.randn(7) 129 | b = 0 130 | X = np.random.randn(7, 7) 131 | y = np.array([1, 0, 0, 0, 1, 1, 0]) 132 | lambda_ = 0 133 | expected2 = (0.02660329857573818, np.array([ 0.23567643, -0.06921029, -0.19705212, -0.0002884 , 0.06490588, 134 | 0.26948175, 0.10777992])) 135 | dj_db, dj_dw = target(X, y, w, b, lambda_) 136 | assert np.isclose(dj_db, expected2[0]), f"Wrong dj_db. Expected: {expected2[0]} got: {dj_db}" 137 | assert np.allclose(dj_dw, expected2[1]), f"Wrong dj_dw. Expected: {expected2[1]} got: {dj_dw}" 138 | 139 | print('\033[92mAll tests passed!') 140 | -------------------------------------------------------------------------------- /第00章 补充的数学知识.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 说明\n", 8 | "\n", 9 | "书中涉及到大量的数学分析和矩阵论的知识。在非数学专业本科学的内容不能完全覆盖书中需要的知识点。下面是我的补充。\n", 10 | "\n", 11 | "### 一、矩阵求导规则\n", 12 | "\n", 13 | "书上用到了很多向量和矩阵的求导,这里介绍一下:\n", 14 | "\n", 15 | "#### 1. 对向量求导\n", 16 | "\n", 17 | "$$\\begin{aligned}\n", 18 | "\\frac{\\partial \\mathbf{Ax}}{\\partial \\mathbf{x}} &= \\mathbf{A}^\\text{T}\\\\\n", 19 | "\\frac{\\partial \\mathbf{Ax}}{\\partial \\mathbf{x}^\\text{T}} &= \\mathbf{A}\\\\\n", 20 | "\\frac{\\partial \\mathbf{x}^\\text{T}\\mathbf{A}}{\\partial \\mathbf{x}} &= \\mathbf{A}\\\\\n", 21 | "\\frac{\\partial \\mathbf{x}^\\text{T}\\mathbf{A}\\mathbf{x}}{\\partial \\mathbf{x}} &= (\\mathbf{A}^\\text{T} + \\mathbf{A})\\mathbf{x} \\\\\n", 22 | "\\frac{\\partial \\mathbf{x\\cdot x}}{\\partial \\mathbf{x}} &= \\frac{\\mathbf{x}^\\text{T}\\mathbf{x}}{\\partial \\mathbf{x}} = 2\\mathbf{x}\\\\\n", 23 | "\\frac{\\partial \\mathbf{w\\cdot x}}{\\partial \\mathbf{x}} &= \\frac{\\partial \\mathbf{w}^\\text{T}\\mathbf{x}}{\\partial \\mathbf{x}} = \\mathbf{w}\n", 24 | "\\end{aligned}$$\n", 25 | "\n", 26 | "#### 2. 对矩阵求导\n", 27 | "\n", 28 | "对矩阵求导规则比较复杂,这篇文章写的很好:[【必读】3分钟带你了解标量对矩阵求导方法](https://zhuanlan.zhihu.com/p/279209775)。\n", 29 | "\n", 30 | "另外我再总结一下书中证明要用到的的公式(如果你按照上面文章的方法学会了对矩阵求导的通用方法,可以很容易自己证明出来)\n", 31 | "\n", 32 | "$$\\begin{aligned}\n", 33 | "\\\\d(\\mathbf{AX}) &= \\mathbf{A}d\\mathbf{X}\\\\\n", 34 | "\\frac{\\partial tr(\\mathbf{X}^\\text{T}\\mathbf{A}\\mathbf{X})}{\\partial \\mathbf{X}} &= (\\mathbf{A}^\\text{T} + \\mathbf{A})\\mathbf{X} \\\\\n", 35 | "\\frac{\\partial \\ln|\\mathbf{X}|}{\\partial \\mathbf{X}} &= \\mathbf{X}^{-1}\\\\\n", 36 | "\\frac{\\partial \\mathbf{w}^{\\text{T}}\\mathbf{X}^{-1}\\mathbf{w}}{\\partial\\mathbf{X}} &= -\\mathbf{X}^{-1}\\mathbf{w}\\mathbf{w}^{\\text{T}}\\mathbf{X}^{-1},若\\mathbf{X}为对称阵\n", 37 | "\\end{aligned}$$\n", 38 | "\n", 39 | "### 二、多维随机变量的统计\n", 40 | "\n", 41 | "假设$\\mathbf{x}_1,\\mathbf{x}_2,\\cdots,\\mathbf{x}_N$是$d$维随机变量的样本,则\n", 42 | "\n", 43 | "#### 1. 样本均值\n", 44 | "\n", 45 | "$$\n", 46 | "\\mathbf{m} = \\frac{1}{N}\\sum_{i=1}^N\\mathbf{x}_i\n", 47 | "$$\n", 48 | "\n", 49 | "#### 2. 样本协方差矩阵(注意这个公式不求平均)\n", 50 | "\n", 51 | "$$\n", 52 | "\\mathbf{S}_{\\text{w}} = \\sum_{i=1}^N (\\mathbf{x}_i - \\mathbf{m})(\\mathbf{x}_i - \\mathbf{m})^{\\text{T}} = \\mathbf{X}^{\\text{T}}\\mathbf{X} - N\\mathbf{m}\\mathbf{m}^\\text{T}\n", 53 | "$$\n", 54 | "\n", 55 | "其中\n", 56 | "\n", 57 | "$$\n", 58 | "\\mathbf{X}^{\\text{T}}\\mathbf{X}=\n", 59 | "\\left[\\begin{matrix} \n", 60 | "\\mathbf{x}_1,\\mathbf{x}_2,\\cdots,\\mathbf{x}_N\n", 61 | "\\end{matrix}\\right]\n", 62 | "\\left[\\begin{matrix}\n", 63 | "\\mathbf{x}_1^{\\text{T}} \\\\\n", 64 | "\\mathbf{x}_2^{\\text{T}} \\\\\n", 65 | "\\vdots \\\\\n", 66 | "\\mathbf{x}_N^{\\text{T}}\n", 67 | "\\end{matrix}\\right]\n", 68 | "$$\n", 69 | "\n", 70 | "#### 3. 样本二阶矩阵\n", 71 | "\n", 72 | "$$\n", 73 | "\\sum_{i=1}^N \\mathbf{x}_i\\mathbf{x}_i^{\\text{T}} = \\mathbf{X}^{\\text{T}}\\mathbf{X}\n", 74 | "$$\n", 75 | "\n", 76 | "### 三、超空间几何\n", 77 | "\n", 78 | "#### 1. 点到超平面距离\n", 79 | "\n", 80 | "假设在$n$维空间中,有超平面$\\mathbf{w}\\cdot \\mathbf{x} + b = 0$,有一点$\\mathbf{x}_0$。\n", 81 | "\n", 82 | "则点到超平面的距离为:\n", 83 | "\n", 84 | "$$d = \\frac{|\\mathbf{w} \\cdot \\mathbf{x}_0 + b|}{||\\mathbf{w}||}$$\n", 85 | "\n", 86 | "#### 2. 平行的超平面的距离\n", 87 | "\n", 88 | "假设在$n$维空间中,有两个平行的超平面\n", 89 | "\n", 90 | "$$\\mathbf{w}\\cdot \\mathbf{x} + b_1 = 0\\\\\n", 91 | " \\mathbf{w}\\cdot \\mathbf{x} + b_2 = 0$$\n", 92 | "\n", 93 | "因为平行的超平面的权向量($\\mathbf{w}$向量)一定成比例,所以上面的公式把两个超平面的权向量化成一样的了,这时就只有$b$不同。\n", 94 | "\n", 95 | "则两个超平面之间的距离为:\n", 96 | "\n", 97 | "$$d = \\frac{|b_1-b_2|}{||\\mathbf{w}||}$$\n", 98 | "\n", 99 | "### 四、进阶版拉格朗日算子法(求多元函数条件极值)\n", 100 | "\n", 101 | "#### 1. 自变量是向量的多元函数\n", 102 | "\n", 103 | "高数中学的拉格朗日算子法的条件只能是等式,下面给出条件有不等式的求法:\n", 104 | "\n", 105 | "假设在约束条件$g(\\mathbf{x})\\leqslant 0$的情况下,求多元函数$f(\\mathbf{x})$的极值。\n", 106 | "\n", 107 | "1. 写出拉格朗日函数\n", 108 | "\n", 109 | "$$\n", 110 | "\\mathscr{L}(\\mathbf{x}) = f(\\mathbf{x}) + \\lambda g(\\mathbf{x})\n", 111 | "$$\n", 112 | "\n", 113 | "2. 写出条件不等式(KKT条件)\n", 114 | "\n", 115 | "$$\\begin{align}\n", 116 | "\\frac{\\partial\\mathscr{L}}{\\partial\\mathbf{x}} = \\frac{\\partial f}{\\partial\\mathbf{x}} + \\lambda\\frac{\\partial g}{\\partial\\mathbf{x}} &= 0 \\\\\n", 117 | "\\frac{\\partial\\mathscr{L}}{\\partial \\lambda} = g(\\mathbf{x}) &\\leqslant 0 \\\\\n", 118 | "\\lambda &\\geqslant 0 \\\\\n", 119 | "\\lambda g(\\mathbf{x}) &= 0\n", 120 | "\\end{align}$$\n", 121 | "\n", 122 | "这组不等式也叫KKT条件\n", 123 | "\n", 124 | "3. 求解KKT条件\n", 125 | "\n", 126 | "如果有多个约束条件也是同理,把KKT条件扩展一下就行。\n", 127 | "\n", 128 | "#### 2. 自变量是矩阵的函数\n", 129 | "\n", 130 | "自变量是矩阵可能不好理解,但是教材162页的公式用到了这个方法。\n", 131 | "\n", 132 | "自变量是矩阵时就把上面的公式中的常数$\\lambda$换成对角阵$\\boldsymbol{\\Lambda}$,自变量向量$\\mathbf{x}$换成矩阵$\\mathbf{X}$即可,其他都不变。这里就不再赘述了。\n", 133 | "\n", 134 | "### 五、欧拉积分\n", 135 | "\n", 136 | "贝叶斯估计可能会用到欧拉积分。\n", 137 | "\n", 138 | "#### 1. 伽马函数\n", 139 | "\n", 140 | "$$\\begin{aligned}\n", 141 | "\\Gamma(s) &= \\int_0^{+\\infty }x^{s}e^{-x} \\text{d}x \\\\\n", 142 | "\\Gamma(s) &= s\\Gamma(s-1) \\\\\n", 143 | "\\Gamma(s) &= s! \\quad若s为正整数\n", 144 | "\\end{aligned}$$\n", 145 | "\n", 146 | "#### 2. 贝塔函数\n", 147 | "\n", 148 | "$$\\begin{aligned}\n", 149 | "\\beta(x,y) &= \\int_0^1P^{x}(1-P)^{y}\\text{d}P \\\\\n", 150 | "\\beta(x,y) &= \\frac{x!y!}{(x+y+1)!} \\quad若x,y为正整数\n", 151 | "\\end{aligned}$$" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python 3 (ipykernel)", 158 | "language": "python", 159 | "name": "python3" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.9.16" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 1 176 | } 177 | -------------------------------------------------------------------------------- /资料/7.Practice lab decision trees/public_tests.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | def compute_entropy_test(target): 4 | y = np.array([1] * 10) 5 | result = target(y) 6 | 7 | assert result == 0, "Entropy must be 0 with array of ones" 8 | 9 | y = np.array([0] * 10) 10 | result = target(y) 11 | 12 | assert result == 0, "Entropy must be 0 with array of zeros" 13 | 14 | y = np.array([0] * 12 + [1] * 12) 15 | result = target(y) 16 | 17 | assert result == 1, "Entropy must be 1 with same ammount of ones and zeros" 18 | 19 | y = np.array([1, 0, 1, 0, 1, 1, 1, 0, 1]) 20 | assert np.isclose(target(y), 0.918295, atol=1e-6), "Wrong value. Something between 0 and 1" 21 | assert np.isclose(target(-y + 1), target(y), atol=1e-6), "Wrong value" 22 | 23 | print("\033[92m All tests passed.") 24 | 25 | def split_dataset_test(target): 26 | X = np.array([[1, 0], 27 | [1, 0], 28 | [1, 1], 29 | [0, 0], 30 | [0, 1]]) 31 | X_t = np.array([[0, 1, 0, 1, 0]]) 32 | X = np.concatenate((X, X_t.T), axis=1) 33 | 34 | left, right = target(X, list(range(5)), 2) 35 | expected = {'left': np.array([1, 3]), 36 | 'right': np.array([0, 2, 4])} 37 | 38 | assert type(left) == list, f"Wrong type for left. Expected: list got: {type(left)}" 39 | assert type(right) == list, f"Wrong type for right. Expected: list got: {type(right)}" 40 | 41 | assert type(left[0]) == int, f"Wrong type for elements in the left list. Expected: int got: {type(left[0])}" 42 | assert type(right[0]) == int, f"Wrong type for elements in the right list. Expected: number got: {type(right[0])}" 43 | 44 | assert len(left) == 2, f"left must have 2 elements but got: {len(left)}" 45 | assert len(right) == 3, f"right must have 3 elements but got: {len(right)}" 46 | 47 | assert np.allclose(right, expected['right']), f"Wrong value for right. Expected: { expected['right']} \ngot: {right}" 48 | assert np.allclose(left, expected['left']), f"Wrong value for left. Expected: { expected['left']} \ngot: {left}" 49 | 50 | X = np.array([[0, 1], 51 | [1, 1], 52 | [1, 1], 53 | [0, 0], 54 | [1, 0]]) 55 | X_t = np.array([[0, 1, 0, 1, 0]]) 56 | X = np.concatenate((X_t.T, X), axis=1) 57 | 58 | left, right = target(X, list(range(5)), 0) 59 | expected = {'left': np.array([1, 3]), 60 | 'right': np.array([0, 2, 4])} 61 | 62 | assert np.allclose(right, expected['right']) and np.allclose(left, expected['left']), f"Wrong value when target is at index 0." 63 | 64 | X = (np.random.rand(11, 3) > 0.5) * 1 # Just random binary numbers 65 | X_t = np.array([[0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]]) 66 | X = np.concatenate((X, X_t.T), axis=1) 67 | 68 | left, right = target(X, [1, 2, 3, 6, 7, 9, 10], 3) 69 | expected = {'left': np.array([1, 3, 6]), 70 | 'right': np.array([2, 7, 9, 10])} 71 | 72 | assert np.allclose(right, expected['right']) and np.allclose(left, expected['left']), f"Wrong value when target is at index 0. \nExpected: {expected} \ngot: \{left:{left}, 'right': {right}\}" 73 | 74 | 75 | print("\033[92m All tests passed.") 76 | 77 | def compute_information_gain_test(target): 78 | X = np.array([[1, 0], 79 | [1, 0], 80 | [1, 0], 81 | [0, 0], 82 | [0, 1]]) 83 | 84 | y = np.array([[0, 0, 0, 0, 0]]).T 85 | node_indexes = list(range(5)) 86 | 87 | result1 = target(X, y, node_indexes, 0) 88 | result2 = target(X, y, node_indexes, 0) 89 | 90 | assert result1 == 0 and result2 == 0, f"Information gain must be 0 when target variable is pure. Got {result1} and {result2}" 91 | 92 | y = np.array([[0, 1, 0, 1, 0]]).T 93 | node_indexes = list(range(5)) 94 | 95 | result = target(X, y, node_indexes, 0) 96 | assert np.isclose(result, 0.019973, atol=1e-6), f"Wrong information gain. Expected {0.019973} got: {result}" 97 | 98 | result = target(X, y, node_indexes, 1) 99 | assert np.isclose(result, 0.170951, atol=1e-6), f"Wrong information gain. Expected {0.170951} got: {result}" 100 | 101 | node_indexes = list(range(4)) 102 | result = target(X, y, node_indexes, 0) 103 | assert np.isclose(result, 0.311278, atol=1e-6), f"Wrong information gain. Expected {0.311278} got: {result}" 104 | 105 | result = target(X, y, node_indexes, 1) 106 | assert np.isclose(result, 0, atol=1e-6), f"Wrong information gain. Expected {0.0} got: {result}" 107 | 108 | print("\033[92m All tests passed.") 109 | 110 | def get_best_split_test(target): 111 | X = np.array([[1, 0], 112 | [1, 0], 113 | [1, 0], 114 | [0, 0], 115 | [0, 1]]) 116 | 117 | y = np.array([[0, 0, 0, 0, 0]]).T 118 | node_indexes = list(range(5)) 119 | 120 | result = target(X, y, node_indexes) 121 | 122 | assert result == -1, f"When the target variable is pure, there is no best split to do. Expected -1, got {result}" 123 | 124 | y = X[:,0] 125 | result = target(X, y, node_indexes) 126 | assert result == 0, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 0, got {result}" 127 | y = X[:,1] 128 | result = target(X, y, node_indexes) 129 | assert result == 1, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 1, got {result}" 130 | 131 | y = 1 - X[:,0] 132 | result = target(X, y, node_indexes) 133 | assert result == 0, f"If the target is fully correlated with other feature, that feature must be the best split. Expected 0, got {result}" 134 | 135 | y = np.array([[0, 1, 0, 1, 0]]).T 136 | result = target(X, y, node_indexes) 137 | assert result == 1, f"Wrong result. Expected 1, got {result}" 138 | 139 | y = np.array([[0, 1, 0, 1, 0]]).T 140 | node_indexes = [2, 3, 4] 141 | result = target(X, y, node_indexes) 142 | assert result == 0, f"Wrong result. Expected 0, got {result}" 143 | 144 | n_samples = 100 145 | X0 = np.array([[1] * n_samples]) 146 | X1 = np.array([[0] * n_samples]) 147 | X2 = (np.random.rand(1, 100) > 0.5) * 1 148 | X3 = np.array([[1] * int(n_samples / 2) + [0] * int(n_samples / 2)]) 149 | 150 | y = X2.T 151 | node_indexes = list(range(20, 80)) 152 | X = np.array([X0, X1, X2, X3]).T.reshape(n_samples, 4) 153 | result = target(X, y, node_indexes) 154 | 155 | assert result == 2, f"Wrong result. Expected 2, got {result}" 156 | 157 | y = X0.T 158 | result = target(X, y, node_indexes) 159 | assert result == -1, f"When the target variable is pure, there is no best split to do. Expected -1, got {result}" 160 | print("\033[92m All tests passed.") -------------------------------------------------------------------------------- /资料/9.Week 3 practice lab logistic regression/test_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from copy import deepcopy 3 | 4 | 5 | def datatype_check(expected_output, target_output, error): 6 | success = 0 7 | if isinstance(target_output, dict): 8 | for key in target_output.keys(): 9 | try: 10 | success += datatype_check(expected_output[key], 11 | target_output[key], error) 12 | except: 13 | print("Error: {} in variable {}. Got {} but expected type {}".format(error, 14 | key, 15 | type( 16 | target_output[key]), 17 | type(expected_output[key]))) 18 | if success == len(target_output.keys()): 19 | return 1 20 | else: 21 | return 0 22 | elif isinstance(target_output, tuple) or isinstance(target_output, list): 23 | for i in range(len(target_output)): 24 | try: 25 | success += datatype_check(expected_output[i], 26 | target_output[i], error) 27 | except: 28 | print("Error: {} in variable {}, expected type: {} but expected type {}".format(error, 29 | i, 30 | type( 31 | target_output[i]), 32 | type(expected_output[i] 33 | ))) 34 | if success == len(target_output): 35 | return 1 36 | else: 37 | return 0 38 | 39 | else: 40 | assert isinstance(target_output, type(expected_output)) 41 | return 1 42 | 43 | 44 | def equation_output_check(expected_output, target_output, error): 45 | success = 0 46 | if isinstance(target_output, dict): 47 | for key in target_output.keys(): 48 | try: 49 | success += equation_output_check(expected_output[key], 50 | target_output[key], error) 51 | except: 52 | print("Error: {} for variable {}.".format(error, 53 | key)) 54 | if success == len(target_output.keys()): 55 | return 1 56 | else: 57 | return 0 58 | elif isinstance(target_output, tuple) or isinstance(target_output, list): 59 | for i in range(len(target_output)): 60 | try: 61 | success += equation_output_check(expected_output[i], 62 | target_output[i], error) 63 | except: 64 | print("Error: {} for variable in position {}.".format(error, i)) 65 | if success == len(target_output): 66 | return 1 67 | else: 68 | return 0 69 | 70 | else: 71 | if hasattr(target_output, 'shape'): 72 | np.testing.assert_array_almost_equal( 73 | target_output, expected_output) 74 | else: 75 | assert target_output == expected_output 76 | return 1 77 | 78 | 79 | def shape_check(expected_output, target_output, error): 80 | success = 0 81 | if isinstance(target_output, dict): 82 | for key in target_output.keys(): 83 | try: 84 | success += shape_check(expected_output[key], 85 | target_output[key], error) 86 | except: 87 | print("Error: {} for variable {}.".format(error, key)) 88 | if success == len(target_output.keys()): 89 | return 1 90 | else: 91 | return 0 92 | elif isinstance(target_output, tuple) or isinstance(target_output, list): 93 | for i in range(len(target_output)): 94 | try: 95 | success += shape_check(expected_output[i], 96 | target_output[i], error) 97 | except: 98 | print("Error: {} for variable {}.".format(error, i)) 99 | if success == len(target_output): 100 | return 1 101 | else: 102 | return 0 103 | 104 | else: 105 | if hasattr(target_output, 'shape'): 106 | assert target_output.shape == expected_output.shape 107 | return 1 108 | 109 | 110 | def single_test(test_cases, target): 111 | success = 0 112 | for test_case in test_cases: 113 | try: 114 | if test_case['name'] == "datatype_check": 115 | assert isinstance(target(*test_case['input']), 116 | type(test_case["expected"])) 117 | success += 1 118 | if test_case['name'] == "equation_output_check": 119 | assert np.allclose(test_case["expected"], 120 | target(*test_case['input'])) 121 | success += 1 122 | if test_case['name'] == "shape_check": 123 | assert test_case['expected'].shape == target( 124 | *test_case['input']).shape 125 | success += 1 126 | except: 127 | print("Error: " + test_case['error']) 128 | 129 | if success == len(test_cases): 130 | print("\033[92m All tests passed.") 131 | else: 132 | print('\033[92m', success, " Tests passed") 133 | print('\033[91m', len(test_cases) - success, " Tests failed") 134 | raise AssertionError( 135 | "Not all tests were passed for {}. Check your equations and avoid using global variables inside the function.".format(target.__name__)) 136 | 137 | 138 | def multiple_test(test_cases, target): 139 | success = 0 140 | for test_case in test_cases: 141 | try: 142 | test_input = deepcopy(test_case['input']) 143 | target_answer = target(*test_input) 144 | if test_case['name'] == "datatype_check": 145 | success += datatype_check(test_case['expected'], 146 | target_answer, test_case['error']) 147 | if test_case['name'] == "equation_output_check": 148 | success += equation_output_check( 149 | test_case['expected'], target_answer, test_case['error']) 150 | if test_case['name'] == "shape_check": 151 | success += shape_check(test_case['expected'], 152 | target_answer, test_case['error']) 153 | except: 154 | print('\33[30m', "Error: " + test_case['error']) 155 | 156 | if success == len(test_cases): 157 | print("\033[92m All tests passed.") 158 | else: 159 | print('\033[92m', success, " Tests passed") 160 | print('\033[91m', len(test_cases) - success, " Tests failed") 161 | raise AssertionError( 162 | "Not all tests were passed for {}. Check your equations and avoid using global variables inside the function.".format(target.__name__)) 163 | 164 | -------------------------------------------------------------------------------- /第02章 统计决策方法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "5fdc03ba", 6 | "metadata": {}, 7 | "source": [ 8 | "# 第二章 统计决策方法\n", 9 | "---" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "id": "8b90dc9e", 15 | "metadata": {}, 16 | "source": [ 17 | "## 二轮总结笔记\n", 18 | "\n", 19 | "### 一、三种贝叶斯决策\n", 20 | "\n", 21 | "#### 1. 最小错误率贝叶斯决策\n", 22 | "\n", 23 | "##### (1) 对于$c$类的判别\n", 24 | "\n", 25 | "1. 判别函数:\n", 26 | "\n", 27 | "$$对于\\omega_i类,g_i(\\mathbf{x}) = p(\\mathbf{x}|\\omega_i)P(\\omega_i)$$\n", 28 | "\n", 29 | "2. 判别规则:\n", 30 | "\n", 31 | "$$若g_i(\\mathbf{x}) = \\max\\limits_{j=1,\\cdots,c}{g_j(\\mathbf{x})},则\\mathbf{x}\\in\\omega_i$$\n", 32 | "\n", 33 | "3. 决策面:\n", 34 | "\n", 35 | "$$g_i(\\mathbf{x}) = g_j(\\mathbf{x})$$\n", 36 | "\n", 37 | "##### (2) 对于两类的判别\n", 38 | "\n", 39 | "1. 判别函数:\n", 40 | "\n", 41 | "$$g(\\mathbf{x}) = p(\\mathbf{x}|\\omega_1)P(\\omega_1) - p(\\mathbf{x}|\\omega_2)P(\\omega_2)$$\n", 42 | "\n", 43 | "2. 判别规则:\n", 44 | "\n", 45 | "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda = \\frac{P(\\omega_2)}{P(\\omega_1)},则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2 \\end{cases}$$\n", 46 | "\n", 47 | "其中,\n", 48 | "\n", 49 | "$$p(\\mathbf{x}|\\omega_1)叫似然度,l(\\mathbf{x})叫似然比$$\n", 50 | "\n", 51 | "3. 决策面:\n", 52 | "\n", 53 | "$$g(\\mathbf{x}) = 0$$\n", 54 | "\n", 55 | "#### 2. 最小风险贝叶斯决策\n", 56 | "\n", 57 | "##### (1) 对于$c$类的判别\n", 58 | "\n", 59 | "判别规则:\n", 60 | "\n", 61 | "$$若R(\\alpha_i|\\mathbf{x}) = \\min\\limits_{j=1,\\cdots,k}R(\\alpha_j|\\mathbf{x}),则决策\\alpha=\\alpha_i$$\n", 62 | "\n", 63 | "其中,\n", 64 | "\n", 65 | "$$R(\\alpha_i|\\mathbf{x})=\\sum_{j=1}^c\\lambda(\\alpha_i,\\omega_j)P(\\omega_j|\\mathbf{x})$$\n", 66 | "\n", 67 | "##### (2) 对于两类的判别\n", 68 | "\n", 69 | "判别规则:\n", 70 | "\n", 71 | "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda = \\frac{P(\\omega_2)}{P(\\omega_1)}\\cdot \\frac{\\lambda_{12}-\\lambda_{22}}{\\lambda_{21}-\\lambda_{11}},则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2\\\\\\end{cases}$$\n", 72 | "\n", 73 | "其中,\n", 74 | "\n", 75 | "$$\\lambda_{ij}=\\lambda(\\alpha_i, \\omega_j),即实际情况为\\omega_j时决策为\\alpha_i的风险。$$\n", 76 | "\n", 77 | "#### 3. Neyman-Pearson决策\n", 78 | "\n", 79 | "即固定一类错误率的情况下最小化另一类错误率。\n", 80 | "\n", 81 | "判别规则:\n", 82 | "\n", 83 | "$$若l(\\mathbf{x}) = \\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)} \\gtrless \\lambda,则\\mathbf{x}\\in\\begin{cases} \\omega_1 \\\\ \\omega_2\\\\\\end{cases}$$\n", 84 | "\n", 85 | "$\\lambda$由固定的一类错误率计算出,假设固定第二类错误率(假阴性)为$\\varepsilon_0$,则决策边界$\\lambda$保证$\\displaystyle \\int_{\\mathscr{R}_1}p(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x}=\\varepsilon_0$\n", 86 | "\n", 87 | "### 二、正态分布时的统计决策\n", 88 | "\n", 89 | "#### 1. 正态分布概率密度公式\n", 90 | "\n", 91 | "$$p(\\mathbf{x})=\\frac{1}{(2\\pi)^{\\frac{d}{2}}|\\boldsymbol{\\Sigma}|^{\\frac{1}{2}}}\\exp\\left\\{-\\frac{1}{2}(\\mathbf{x}-\\boldsymbol{\\mu})^T\\boldsymbol{\\Sigma}^{-1}(\\mathbf{x}-\\boldsymbol{\\mu})\\right\\}$$\n", 92 | "\n", 93 | "#### 2. 正态分布下最小错误率贝叶斯决策的性质\n", 94 | "\n", 95 | "设所有类别的概率密度都服从正态分布,$p(\\mathbf{x}|\\omega_i) \\sim N(\\boldsymbol{\\mu}_i, \\boldsymbol{\\Sigma}_i)$\n", 96 | "\n", 97 | "##### (1) 若$\\boldsymbol{\\Sigma}_i = \\sigma^2\\mathbf{I}$,$\\mathbf{I}$是单位矩阵\n", 98 | "\n", 99 | "$\\omega_i$和$\\omega_j$的决策面是平面,并且与$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线正交(垂直)。\n", 100 | "\n", 101 | "1. 所有先验概率$P(\\omega_i)$相同\n", 102 | "\n", 103 | "决策面不仅与$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线正交,还过$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线的中点,是垂直平分线(面)。\n", 104 | "\n", 105 | "此情况下**最小错误率**贝叶斯决策等价于**最小距离分类器**。\n", 106 | "\n", 107 | "2. 先验概率$P(\\omega_i)$不同\n", 108 | "\n", 109 | "决策面向先验概率小的类偏移,即先验概率大的类占据更大的决策空间。\n", 110 | "\n", 111 | "##### (2) 若$\\boldsymbol{\\Sigma}_i = \\boldsymbol{\\Sigma}$\n", 112 | "\n", 113 | "$\\omega_i$和$\\omega_j$的决策面是平面,但是和$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线**不**正交(**不**垂直)。\n", 114 | "\n", 115 | "1. 所有先验概率$P(\\omega_i)$相同\n", 116 | "\n", 117 | "决策面过$\\boldsymbol{\\mu}_i$和$\\boldsymbol{\\mu}_j$连线的中点。\n", 118 | "\n", 119 | "2. 先验概率不同\n", 120 | "\n", 121 | "决策面向先验概率小的类偏移,即先验概率大的类占据更大的决策空间。\n", 122 | "\n", 123 | "对于两类情况,决策面为:\n", 124 | "\n", 125 | "$$\\begin{aligned}\n", 126 | "g(\\mathbf{x}) &= \\mathbf{w}^\\text{T}\\mathbf{x} + w_0 = 0 \\\\\n", 127 | "\\mathbf{w} &= \\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_2) \\\\\n", 128 | "w_0 &= -\\frac{1}{2}(\\boldsymbol{\\mu}_1 + \\boldsymbol{\\mu}_2)^\\text{T}\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_1) - \\ln\\frac{P(\\omega_2)}{P(\\omega_1)}\n", 129 | "\\end{aligned}$$\n", 130 | "\n", 131 | "##### (3) 一般情况\n", 132 | "\n", 133 | "决策面是超二次曲面。\n", 134 | "\n", 135 | "### 三、错误率的计算\n", 136 | "\n", 137 | "#### 1. 贝叶斯错误率计算公式\n", 138 | "\n", 139 | "##### (1) 两类情况\n", 140 | "\n", 141 | "$$\\begin{aligned}\n", 142 | "P(e) &= P(\\omega_1)\\int_{\\mathscr{R}_2}p(\\mathbf{x}|\\omega_1)\\text{d}\\mathbf{x} + P(\\omega_2)\\int_{\\mathscr{R}_1}p(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x} \\\\\n", 143 | " &= P(\\omega_1)P_1(e)+P(\\omega_2)P_2(e)\\\\ \n", 144 | "\\end{aligned}$$\n", 145 | "\n", 146 | "其中\n", 147 | "\n", 148 | "$$\\begin{aligned}\n", 149 | "P_1(e) &= \\alpha \\quad (假阳性) \\\\\n", 150 | "P_2(e) &= \\beta \\quad (假阴性)\\\\\n", 151 | "\\end{aligned}$$\n", 152 | "\n", 153 | "另外\n", 154 | "\n", 155 | "$$\\begin{aligned}\n", 156 | "灵敏度\\text{Sn} &= \\frac{\\text{TP}}{\\text{TP}+\\text{FN}} = 1-\\beta \\\\\n", 157 | "特异度\\text{Sp} &= \\frac{\\text{TN}}{\\text{TN}+\\text{FP}} = 1-\\alpha\n", 158 | "\\end{aligned}$$\n", 159 | "\n", 160 | "##### (2) 多类情况\n", 161 | "\n", 162 | "$$\n", 163 | "P(e) = \\int P(e|\\mathbf{x})p(\\mathbf{x})\\text{d}\\mathbf{x} = \\int 1-\\max\\limits_{i}\\{P(\\omega_i|\\mathbf{x})\\}\\text{d}\\mathbf{x}\n", 164 | "$$\n", 165 | "\n", 166 | "#### 2. ROC曲线\n", 167 | "\n", 168 | "横轴为$\\alpha = 1-\\text{Sp}$,纵轴为$1-\\beta = \\text{Sn}$\n", 169 | "\n", 170 | "#### 3. 正态分布且各类协方差矩阵相等情况下错误率的计算\n", 171 | "\n", 172 | "$$\\begin{aligned}\n", 173 | "P_1(e) &= \\int_t^{+\\infty} p(h|\\omega_1)\\text{d}h \\\\\n", 174 | " &= 1-\\Phi(\\frac{t+\\eta}{\\sigma}) \\\\\n", 175 | "P_2(e) &= \\int_{-\\infty}^t p(h|\\omega_2)\\text{d}h \\\\\n", 176 | " &= \\Phi(\\frac{t-\\eta}{\\sigma}) \\\\\n", 177 | "\\end{aligned}$$ \n", 178 | "其中,\n", 179 | "$$\\begin{aligned}\n", 180 | "t &= \\ln\\frac{P(\\omega_1)}{P(\\omega_2)} \\\\\n", 181 | "\\eta &= \\frac{1}{2}(\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)^T\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1 - \\boldsymbol{\\mu}_2) \\\\\n", 182 | "\\sigma &= \\sqrt{2\\eta}\n", 183 | "\\end{aligned}$$\n", 184 | "\n", 185 | "#### 4. 高维独立随机变量时错误率的估计\n", 186 | "\n", 187 | "$d$维随机变量$\\mathbf{x}$各分量相互独立时,用中心极限定理把$h(\\mathbf{x})$近似为正态分布,按照上面正态分布的公式计算错误率。\n", 188 | "\n", 189 | "近似认为$h(\\mathbf{x}|\\omega_i) \\sim N(\\sum_{i=1}^{d}\\eta_{il},\\sum_{i=1}^{d}\\sigma_{il}^2)$。\n", 190 | "\n", 191 | "其中,$\\eta_{il}$是$\\omega_i$类第$l$个分量的对数似然比$p(h(x_l)|\\omega_i)$的期望,$\\sigma_{il}^2$是$\\omega_i$类第$l$个分量的对数似然比$p(h(x_l)|\\omega_i)$的方差。\n", 192 | "\n", 193 | "### 四、一阶马尔科夫链\n", 194 | "\n", 195 | "对数几率比:\n", 196 | "\n", 197 | "$$S(x)=\\sum_{i=1}^L \\log\\frac{a_{x_{i-1} x_i}^+}{a_{x_{i-1} x_i}^-} = \\sum_{i=1}^L \\beta_{x_{i-1}x_i}$$\n", 198 | "\n", 199 | "阈值根据不同决策方法确定。" 200 | ] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 3 (ipykernel)", 206 | "language": "python", 207 | "name": "python3" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 3 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython3", 219 | "version": "3.11.0" 220 | }, 221 | "toc": { 222 | "base_numbering": 1, 223 | "nav_menu": {}, 224 | "number_sections": true, 225 | "sideBar": true, 226 | "skip_h1_title": false, 227 | "title_cell": "Table of Contents", 228 | "title_sidebar": "Contents", 229 | "toc_cell": false, 230 | "toc_position": {}, 231 | "toc_section_display": true, 232 | "toc_window_display": false 233 | }, 234 | "varInspector": { 235 | "cols": { 236 | "lenName": 16, 237 | "lenType": 16, 238 | "lenVar": 40 239 | }, 240 | "kernels_config": { 241 | "python": { 242 | "delete_cmd_postfix": "", 243 | "delete_cmd_prefix": "del ", 244 | "library": "var_list.py", 245 | "varRefreshCmd": "print(var_dic_list())" 246 | }, 247 | "r": { 248 | "delete_cmd_postfix": ") ", 249 | "delete_cmd_prefix": "rm(", 250 | "library": "var_list.r", 251 | "varRefreshCmd": "cat(var_dic_list()) " 252 | } 253 | }, 254 | "types_to_exclude": [ 255 | "module", 256 | "function", 257 | "builtin_function_or_method", 258 | "instance", 259 | "_Feature" 260 | ], 261 | "window_display": false 262 | } 263 | }, 264 | "nbformat": 4, 265 | "nbformat_minor": 5 266 | } 267 | -------------------------------------------------------------------------------- /第10章 模式识别系统的评价.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 第十章 模式识别系统的评价\n", 8 | "---" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## 二轮总结笔记\n", 16 | "\n", 17 | "\n", 18 | "### 一、监督模式识别方法的错误率估计\n", 19 | "\n", 20 | "#### 1. 测试错误率\n", 21 | "\n", 22 | "##### (1) 先验概率$P(\\omega_1)$,$P(\\omega_2)$未知——随机抽样\n", 23 | "\n", 24 | "1. 错误率:测试集中有$N$个样本,错分了$k$个,则测试错误率$\\displaystyle\\hat\\varepsilon = \\frac{k}{N}$。\n", 25 | "2. 性质:\n", 26 | "+ 测试错误率是真实错误率的**最大似然估计**。\n", 27 | "+ 测试错误率的期望$E[\\hat\\varepsilon]=\\varepsilon$,是真实错误率的无偏估计。\n", 28 | "+ 测试错误率的方差$\\displaystyle D[\\hat\\varepsilon]=\\frac{\\varepsilon(1-\\varepsilon)}{N}$。\n", 29 | "+ $N$越**大**,测试错误率的置信区间越**小**,测试错误率越**可信**。\n", 30 | "\n", 31 | "##### (2) 先验概率$P(\\omega_1)$,$P(\\omega_2)$已知——选择性抽样\n", 32 | "\n", 33 | "1. 错误率:在两类中按照先验概率比例进行抽样,$N_1=P(\\omega_1)N$,$N_2=P(\\omega_2)N$,两类分别错分了$k_1$和$k_2$个,则测试错误率$\\displaystyle\\hat\\varepsilon = P(\\omega_1)\\frac{k_1}{N_1} + P(\\omega_2)\\frac{k_2}{N_2} = \\frac{k_1+k_2}{N}$。\n", 34 | "2. 性质:\n", 35 | "+ 测试错误率是真实错误率的**最大似然估计**。\n", 36 | "+ 测试错误率的期望$E[\\hat\\varepsilon]=\\varepsilon$,是真实错误率的无偏估计。\n", 37 | "+ 测试错误率的方差$\\displaystyle D[\\hat\\varepsilon] = \\frac{1}{N}\\left[P(\\omega_1)\\frac{k_1}{N_1}\\left(1-\\frac{k_1}{N_1}\\right)+P(\\omega_2)\\frac{k_2}{N_2}\\left(1-\\frac{k_2}{N_2}\\right)\\right]$,方差比前一种方法**小**。\n", 38 | "\n", 39 | "#### 2. 交叉验证(cross-validation,CV)\n", 40 | "\n", 41 | "##### (1) 分类\n", 42 | "\n", 43 | "1. k轮n倍交叉验证(n-fold cross-validation):偏差大,方差小。\n", 44 | "2. 留一法交叉验证(leave-one-out cross-validation):适用于样本少的情况,偏差小,方差大。\n", 45 | "\n", 46 | "##### (2) 性质\n", 47 | "\n", 48 | "1. 临时测试集较小,错误率估计接近全部样本,多轮实验平均可以减少错误率方差。\n", 49 | "2. 可以用于选择分类器参数,用交叉验证错误率最小的参数在全部样本上训练分类器。\n", 50 | "\n", 51 | "#### 3. 自举法和.632估计\n", 52 | "\n", 53 | "##### (1) 自举估计\n", 54 | "\n", 55 | "对样本进行$B$次自举重采样,重采样后的样本作为训练集,没有抽到的样本作为测试集,得到的错误率的平均。\n", 56 | "\n", 57 | "##### (2) .632估计\n", 58 | "\n", 59 | "1. $\\text{B}.632=0.368\\times \\text{AE} + 0.632\\times\\text{B1}$。其中$\\text{AE}$是训练错误率(视在错误率),$\\text{B1}$是自举错误率。\n", 60 | "2. .632估计是对错误率更好的估计。\n", 61 | "\n", 62 | "\n", 63 | "### 二、有限样本下错误率的区间估计问题\n", 64 | "\n", 65 | "#### 1. 问题的提出\n", 66 | "\n", 67 | "##### (1) 标准数据集\n", 68 | "\n", 69 | "1. 一套数据多次划分成训练集和测试集。\n", 70 | "2. 固定划分。\n", 71 | "3. 同时统计错误率的平均值和方差。\n", 72 | "\n", 73 | "##### (2) 性质\n", 74 | "\n", 75 | "1. 各次实验不是独立的,估计的错误率区间会偏小。\n", 76 | "2. 仅基于交叉验证,不存在错误率估计量方差的无偏估计。\n", 77 | "\n", 78 | "#### 2. 用扰动重采样估计SVM错误率的置信区间\n", 79 | "\n", 80 | "##### (1) 性质\n", 81 | "\n", 82 | "1. 考虑了测试样本的不确定性和现有训练样本的不确定性,能够得到性能的全面评价。\n", 83 | "2. 适用于样本数目趋于无穷大,但是有限样本下表现依然较好。\n", 84 | "3. 只适用于线性核。\n", 85 | "\n", 86 | "\n", 87 | "### 三、特征提取与选择对分类器性能估计的影响\n", 88 | "\n", 89 | "#### 1. CV1\n", 90 | "\n", 91 | "##### (1) 定义\n", 92 | "\n", 93 | "对所有样本进行特征选择核提取,然后进行交叉验证。\n", 94 | "\n", 95 | "##### (2) 性质\n", 96 | "\n", 97 | "1. 对分类器性能估计偏乐观,训练集用了测试集的信息。\n", 98 | "2. 当初始特征维数高,样本数目小时过学习问题明显。\n", 99 | "\n", 100 | "#### 2. CV2\n", 101 | "\n", 102 | "##### (1) 定义\n", 103 | "\n", 104 | "只对训练集进行特征选择和提取,然后进行交叉验证。\n", 105 | "\n", 106 | "##### (2) 性质\n", 107 | "\n", 108 | "1. 能得到分类器性能的真实估计。\n", 109 | "2. 需要设计出唯一的特征选择和提取方案。可以全部样本重新特征选择和提取,也可以在CV2过程中的特征选择和提取中综合。\n", 110 | "\n", 111 | "\n", 112 | "### 四、从分类的显著性推断特征与类别的关系\n", 113 | "\n", 114 | "#### 1. 随机置换法\n", 115 | "\n", 116 | "##### (1) 如何得到空分布\n", 117 | "\n", 118 | "1. 保持原有样本中两类样本比例不变的情况下,随机打乱样本的类别标号。\n", 119 | "2. 用原有的特征选择和提取和分类方法进行分类。\n", 120 | "\n", 121 | "##### (2) 如何判断显著性\n", 122 | "\n", 123 | "用真实分类器性能和空分布比较,通常以小于$0.05$作为参考。\n", 124 | "\n", 125 | "\n", 126 | "### 五、非监督模式识别系统性能的评价\n", 127 | "\n", 128 | "#### 1. 紧致性(compactness)/一致性(homogeneity)\n", 129 | "\n", 130 | "##### (1) 公式\n", 131 | "\n", 132 | "$$\n", 133 | "V(C) = \\sqrt{\\frac{1}{N}\\sum_{C_k\\in C}\\sum_{i\\in C_k}\\delta(i,\\mu_k)}\n", 134 | "$$\n", 135 | "\n", 136 | "##### (2) 性质\n", 137 | "\n", 138 | "越小越好,取值范围$[0,\\infty]$。\n", 139 | "\n", 140 | "#### 2. 连接性质(connectedness)/连接度(connectivity)\n", 141 | "\n", 142 | "##### (1) 公式\n", 143 | "\n", 144 | "$$\\begin{aligned}\n", 145 | "\\text{Conn}(C) &= \\sum_{i=1}^N\\sum_{j=1}^Lx_{i,nn_{\\scriptstyle i(j)}} \\\\\n", 146 | "x_{i,nn_{\\scriptstyle i(j)}} &= \\begin{cases}\\displaystyle \\frac{1}{j}&第i个样本和其第j个近邻不在同一个聚类\\\\0&在同一个聚类\\end{cases}\n", 147 | "\\end{aligned}$$\n", 148 | "\n", 149 | "##### (2) 性质\n", 150 | "\n", 151 | "越小越好,取值范围$[0,\\infty]$。\n", 152 | "\n", 153 | "#### 3. 分离度(separation)\n", 154 | "\n", 155 | "##### (1) 性质\n", 156 | "\n", 157 | "越大越好。\n", 158 | "\n", 159 | "#### 4. Silhouette宽度\n", 160 | "\n", 161 | "##### (1) 公式\n", 162 | "\n", 163 | "$$\\begin{aligned}\n", 164 | "S(i) &= \\frac{b_i-a_i}{\\max(b_i,a_i)} \\\\\n", 165 | "a_i &= 样本i到和它\\textbf{同类}的所有样本的\\textbf{平均距离}\\\\\n", 166 | "b_i &= 样本i到其他聚类中\\textbf{最近}的一个聚类的所有样本的\\textbf{平均距离}\n", 167 | "\\end{aligned}$$\n", 168 | "\n", 169 | "##### (2) 性质\n", 170 | "\n", 171 | "越大越好,取值范围$[-1,1]$。\n", 172 | "\n", 173 | "#### 5. Dunn指数(Dunn index)\n", 174 | "\n", 175 | "##### (1) 公式\n", 176 | "\n", 177 | "$$\\begin{aligned}\n", 178 | "D(C) &= \\min\\limits_{C_k\\in C}\\left(\\min\\limits_{C_l\\in C}\\frac{\\text{dist}(C_k,C_l)}{\\displaystyle\\max\\limits_{C_m\\in C}\\text{diam}(C_m)}\\right)\\\\\n", 179 | "\\text{dist}(C_k,C_l) &= C_k,C_l两类中距离最近的样本间的距离\\\\\n", 180 | "\\text{diam}(C_m) &= C_m中最大的类内距离\n", 181 | "\\end{aligned}$$\n", 182 | "\n", 183 | "##### (2) 性质\n", 184 | "\n", 185 | "越大越好,取值范围$[0,\\infty]$。\n", 186 | "\n", 187 | "#### 6. 预测效力\n", 188 | "\n", 189 | "##### (1) 算法\n", 190 | "\n", 191 | "将样本随机分为两份,各自进行聚类,用一份得到的聚类结果作为临时训练样本,对另一份进行最近邻法分类。\n", 192 | "\n", 193 | "##### (2) 性质\n", 194 | "\n", 195 | "重合程度越大聚类结果越稳定。" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "---\n", 203 | "## 一轮学习笔记(包含代码实现)\n", 204 | "\n", 205 | "这一章介绍了大多数其他教材放在第一章的内容,即分类算法的检验方法。\n", 206 | "\n", 207 | "教材的重点放在了检验方法得出的错误率的理论分析上面,即从概率论的角度分析错误率的期望和方差。这部分都是数学推导,不需要写代码。\n", 208 | "\n", 209 | "而教材中提到的计算错误率的检验方法实现起来都很非常简单,而且在sklearn中都有库可以很方便的调用。\n", 210 | "\n", 211 | "只有扰动重采样估计SVM错误率的置信区间非常复杂,但是这个算法其实不重要,作者把它放在教材上主要还是因为这个算法是作者自己发的文章。\n", 212 | "\n", 213 | "下面我只实现一下bootstrap .632估计。" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 1, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "from sklearn.utils import resample\n", 223 | "from sklearn.svm import LinearSVC\n", 224 | "import numpy as np" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "下面用手写数字识别的数据集。" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 2, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "X = np.load(\"./data/digit_0_1_X.npy\")\n", 241 | "y = np.load(\"./data/digit_0_1_y.npy\")\n", 242 | "X = X[:1000]\n", 243 | "y = y[:1000] #前一千个是0和1,后面是其他数字\n", 244 | "y = y.reshape(1000)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "先计算自举错误率B1。" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 3, 257 | "metadata": {}, 258 | "outputs": [ 259 | { 260 | "name": "stdout", 261 | "output_type": "stream", 262 | "text": [ 263 | "bootstrap第1次,自举样本集中包含了原始样本集62.5%的样本。\n", 264 | "本次训练错误率为:0.002666666666666706\n", 265 | "bootstrap第2次,自举样本集中包含了原始样本集61.8%的样本。\n", 266 | "本次训练错误率为:0.010471204188481686\n", 267 | "bootstrap第3次,自举样本集中包含了原始样本集62.2%的样本。\n", 268 | "本次训练错误率为:0.002645502645502673\n", 269 | "bootstrap第4次,自举样本集中包含了原始样本集64.3%的样本。\n", 270 | "本次训练错误率为:0.0028011204481792618\n", 271 | "bootstrap第5次,自举样本集中包含了原始样本集62.9%的样本。\n", 272 | "本次训练错误率为:0.0\n", 273 | "bootstrap第6次,自举样本集中包含了原始样本集61.199999999999996%的样本。\n", 274 | "本次训练错误率为:0.002577319587628857\n", 275 | "bootstrap第7次,自举样本集中包含了原始样本集63.800000000000004%的样本。\n", 276 | "本次训练错误率为:0.002762430939226568\n", 277 | "bootstrap第8次,自举样本集中包含了原始样本集62.9%的样本。\n", 278 | "本次训练错误率为:0.0026954177897574594\n", 279 | "bootstrap第9次,自举样本集中包含了原始样本集64.60000000000001%的样本。\n", 280 | "本次训练错误率为:0.0028248587570621764\n", 281 | "bootstrap第10次,自举样本集中包含了原始样本集64.7%的样本。\n", 282 | "本次训练错误率为:0.0\n", 283 | "自举平均错误率B1 = 0.002944452102250539\n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "# 下面计算十次bootstrap样本集在线性支持向量机上的错误率\n", 289 | "b1 = 0\n", 290 | "for i in range(10):\n", 291 | " bootstrap_index = []\n", 292 | " test_index = set(range(X.shape[0]))\n", 293 | " X_bootstrap = resample(X)\n", 294 | " for x in X_bootstrap:\n", 295 | " for j in range(X.shape[0]):\n", 296 | " if (X[j] == x).all():\n", 297 | " bootstrap_index.append(j)\n", 298 | " if j in test_index:\n", 299 | " test_index.remove(j)\n", 300 | " break\n", 301 | " test_index = list(test_index)\n", 302 | " y_bootstrap = y[bootstrap_index]\n", 303 | " X_test = X[test_index]\n", 304 | " y_test = y[test_index]\n", 305 | " print(\"bootstrap第\", i + 1, \"次,自举样本集中包含了原始样本集\",\n", 306 | " (1 - len(test_index) / X.shape[0]) * 100, \"%的样本。\", sep=\"\")\n", 307 | " lsvc = LinearSVC()\n", 308 | " lsvc.fit(X_bootstrap, y_bootstrap)\n", 309 | " err = 1 - lsvc.score(X_test, y_test)\n", 310 | " b1 += err\n", 311 | " print(\"本次训练错误率为:\", err,sep=\"\")\n", 312 | "b1 /= 10\n", 313 | "print(\"自举平均错误率B1 =\", b1)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "再计算全部样本上的训练错误率(视在错误率)。" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 4, 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "name": "stdout", 330 | "output_type": "stream", 331 | "text": [ 332 | "全部样本上的训练错误率AE = 0.0\n" 333 | ] 334 | } 335 | ], 336 | "source": [ 337 | "lsvc = LinearSVC()\n", 338 | "lsvc.fit(X, y)\n", 339 | "ae = 1 - lsvc.score(X, y)\n", 340 | "print(\"全部样本上的训练错误率AE =\", ae)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "接下来计算B.632错误率。" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 5, 353 | "metadata": {}, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "B.632错误率 = 0.0018608937286223406\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "print(\"B.632错误率 =\", 0.368 * ae + 0.632 * b1)" 365 | ] 366 | } 367 | ], 368 | "metadata": { 369 | "kernelspec": { 370 | "display_name": "Python 3 (ipykernel)", 371 | "language": "python", 372 | "name": "python3" 373 | }, 374 | "language_info": { 375 | "codemirror_mode": { 376 | "name": "ipython", 377 | "version": 3 378 | }, 379 | "file_extension": ".py", 380 | "mimetype": "text/x-python", 381 | "name": "python", 382 | "nbconvert_exporter": "python", 383 | "pygments_lexer": "ipython3", 384 | "version": "3.9.16" 385 | } 386 | }, 387 | "nbformat": 4, 388 | "nbformat_minor": 1 389 | } 390 | -------------------------------------------------------------------------------- /第07章 特征选择.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 第七章 特征选择\n", 8 | "---" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## 二轮总结笔记\n", 16 | "\n", 17 | "\n", 18 | "### 一、特征的评价准则\n", 19 | "\n", 20 | "#### 1. 可分性判据的要求\n", 21 | "\n", 22 | "##### (1) 错误率单调性\n", 23 | "\n", 24 | "判据与错误率(或错误率的**上界**)有单调关系。\n", 25 | "\n", 26 | "##### (2) 可加性\n", 27 | "\n", 28 | "当特征**独立**时,判据对特征应该具有可加性,即\n", 29 | "\n", 30 | "$$\n", 31 | "J_{ij}(x_1,x_2,\\cdots,x_d)=\\sum_{k=1}^dJ_{ij}(x_k)\n", 32 | "$$\n", 33 | "\n", 34 | "##### (3) 度量特征\n", 35 | "\n", 36 | "$$\\begin{aligned}\n", 37 | "J_{ij} &> 0,\\quad 当i\\neq j时\\\\\n", 38 | "J_{ij} &= 0,\\quad 当i=j时\\\\\n", 39 | "J_{ij} &= J_{ji}\n", 40 | "\\end{aligned}$$\n", 41 | "\n", 42 | "##### (4) 特征单调性\n", 43 | "\n", 44 | "加入新的特征不会使判据减小,即\n", 45 | "\n", 46 | "$$\n", 47 | "J_{ij}(x_1,x_2,\\cdots,x_d) \\leq J_{ij}(x_1,x_2,\\cdots,x_d,x_{d+1})\n", 48 | "$$\n", 49 | "\n", 50 | "#### 2. 基于类内类间距离的可分性判据\n", 51 | "\n", 52 | "##### (1) 常用基于类内类间距离的可分性判据\n", 53 | "\n", 54 | "$$\\begin{aligned}\n", 55 | "J_1 &= \\text{tr}(\\mathbf{S}_\\text{w}+\\mathbf{S}_\\text{b}),\\quad J_1为各类之间的平均平方距离 \\\\\n", 56 | "J_2 &= \\text{tr}(\\mathbf{S}_\\text{w}^{-1}\\mathbf{S}_\\text{b}) \\\\\n", 57 | "J_3 &= \\ln\\frac{|\\mathbf{S}_\\text{b}|}{|\\mathbf{S}_\\text{w}|} \\\\\n", 58 | "J_4 &= \\frac{\\text{tr}\\mathbf{S}_\\text{b}}{\\text{tr}\\mathbf{S}_\\text{w}} \\\\\n", 59 | "J_5 &= \\frac{|\\mathbf{S}_\\text{b}-\\mathbf{S}_\\text{w}|}{|\\mathbf{S}_\\text{w}|}\n", 60 | "\\end{aligned}$$\n", 61 | "\n", 62 | "其中,$\\mathbf{S}_\\text{w}$和$\\mathbf{S}_\\text{b}$的估计值为\n", 63 | "\n", 64 | "$$\\begin{aligned}\n", 65 | "\\tilde{S}_\\text{w} &= \\sum_{i=1}^cP_i\\frac{1}{n_i}\\sum_{k=1}^{n_i}(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)(\\mathbf{x}_k^{(i)}-\\mathbf{m}_i)^\\text{T} \\\\\n", 66 | "\\tilde{S}_\\text{b} &= \\sum_{i=1}^cP_i(\\mathbf{m}_i-\\mathbf{m})(\\mathbf{m}_i-\\mathbf{m})^\\text{T}\n", 67 | "\\end{aligned}$$\n", 68 | "\n", 69 | "其中\n", 70 | "\n", 71 | "$$\\begin{aligned}\n", 72 | "\\mathbf{m}_i &= \\frac{1}{n_i}\\sum_{k=1}^{n_i}\\mathbf{x}_k^{(i)}\\\\\n", 73 | "\\mathbf{m} &= \\sum_{i=1}^{c}P_i\\mathbf{m}_i\\\\\n", 74 | "\\end{aligned}$$\n", 75 | "\n", 76 | "##### (2) 性质\n", 77 | "\n", 78 | "1. 很难从理论上建立起与分类错误率的联系。\n", 79 | "2. 两类样本分布有重叠时,不能反映重叠的情况。\n", 80 | "3. 各类样本分布协方差差别不大时,效果较好。\n", 81 | "\n", 82 | "#### 3. 基于概率分布的可分性判据\n", 83 | "\n", 84 | "##### (1) 概率距离度量的条件\n", 85 | "\n", 86 | "1. $J_P\\geqslant0$。\n", 87 | "2. 当两类完全不交叠($p(\\mathbf{x}|\\omega_1)$和$p(\\mathbf{x}|\\omega_2)$完全不交叠)时$J_P$取最大值。\n", 88 | "3. 当两类完全交叠($p(\\mathbf{x}|\\omega_1)=p(\\mathbf{x}|\\omega_2)$)时$J_P=0$.\n", 89 | "\n", 90 | "##### (2) 常用的概率距离度量\n", 91 | "\n", 92 | "1. Bhattacharyya距离:\n", 93 | "\n", 94 | "$$\n", 95 | "J_\\text{B}=-\\ln\\int[p(\\mathbf{x}|\\omega_1)p(\\mathbf{x}|\\omega_2)]^\\frac{1}{2}\\text{d}\\mathbf{x}\n", 96 | "$$\n", 97 | "\n", 98 | "理论错误率与Bhattacharyya距离的关系为\n", 99 | "\n", 100 | "$$\n", 101 | "P_\\text{e} \\leqslant [P(\\omega_1)P(\\omega_2)]^\\frac{1}{2}\\exp\\{-J_\\text{B}\\}\n", 102 | "$$\n", 103 | "\n", 104 | "2. Chernoff界限:\n", 105 | "\n", 106 | "$$\n", 107 | "J_\\text{C} = -\\ln\\int p^s(\\mathbf{x}|\\omega_1)p^{1-s}(\\mathbf{x}|\\omega_2)\\text{d}\\mathbf{x},\\quad s\\in[0,1]\n", 108 | "$$\n", 109 | "\n", 110 | "当$s=0.5$时,Chernoff界限与Bhattacharyya距离相同。\n", 111 | "\n", 112 | "3. 散度:\n", 113 | "\n", 114 | "$$\n", 115 | "J_\\text{D}=\\int [p(\\mathbf{x}|\\omega_1)-p(\\mathbf{x}|\\omega_2)]\\ln\\frac{p(\\mathbf{x}|\\omega_1)}{p(\\mathbf{x}|\\omega_2)}\\text{d}\\mathbf{x}\n", 116 | "$$\n", 117 | "\n", 118 | "当两类样本都服从**正态分布**且**协方差矩阵相等**的情况下,散度为两类均值间的**Mahalanobis距离**,即\n", 119 | "\n", 120 | "$$\n", 121 | "J_\\text{D} = 8J_\\text{B} = (\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)^\\text{T}\\boldsymbol{\\Sigma}^{-1}(\\boldsymbol{\\mu}_1-\\boldsymbol{\\mu}_2)\n", 122 | "$$\n", 123 | "\n", 124 | "4. 概率相关性判据:\n", 125 | "\n", 126 | "把上面三种判据的$p(\\mathbf{x}|\\omega_1)$换成$p(\\mathbf{x}|\\omega_i)$,$p(\\mathbf{x}|\\omega_2)$换成$p(\\mathbf{x})$。\n", 127 | "\n", 128 | "#### 4. 基于熵的可分性判据\n", 129 | "\n", 130 | "##### (1) 常用熵度量\n", 131 | "\n", 132 | "1. Shanno熵:\n", 133 | "\n", 134 | "$$\n", 135 | "H = -\\sum_{i=1}^cP(\\omega_i|\\mathbf{x})\\log_2P(\\omega_i|\\mathbf{x})\n", 136 | "$$\n", 137 | "\n", 138 | "2. 平方熵:\n", 139 | "\n", 140 | "$$\n", 141 | "H=2\\left(1-\\sum_{i=1}^cP^2(\\omega_i|\\mathbf{x})\\right)\n", 142 | "$$\n", 143 | "\n", 144 | "##### (2) 基于熵的可分性判据\n", 145 | "\n", 146 | "$$\n", 147 | "J_\\text{E}=\\int H(\\mathbf{x})p(\\mathbf{x})\\text{d}\\mathbf{x}\n", 148 | "$$\n", 149 | "\n", 150 | "##### (3) 性质\n", 151 | "\n", 152 | "$J_\\text{E}$越小,可分性越好。\n", 153 | "\n", 154 | "#### 5. 利用统计检验作为可分性判据\n", 155 | "\n", 156 | "##### (1) 统计检验的基本概念\n", 157 | "\n", 158 | "1. 空假设(null hypothesis):假设两类样本在所研究特征上不存在显著差异。\n", 159 | "2. 备择假设(alternative hypothesis):假设两类样本在所研究特征上存在显著差异。\n", 160 | "3. 空分布:空假设下统计量取值的分布。\n", 161 | "\n", 162 | "##### (2) $t$-检验\n", 163 | "\n", 164 | "1. 条件:\n", 165 | "\n", 166 | "两类样本都服从**正态分布**且**方差相同**。\n", 167 | "\n", 168 | "2. 统计量:\n", 169 | "\n", 170 | "假设两类样本分别有$m$个和$n$个,在考察特征上的值分别为$\\{x_i|i=1,\\cdots,n_1\\}$和$\\{y_i|i=1,\\cdots,n_2\\}$。其中$x_i\\sim N(\\mu_x,\\sigma^2)$,$y_i\\sim N(\\mu_y,\\sigma^2)$。\n", 171 | "\n", 172 | "则统计量为\n", 173 | "\n", 174 | "$$\n", 175 | "t=\\frac{\\bar{x}-\\bar{y}}{\\displaystyle s_\\text{P}\\sqrt{\\frac{1}{n_1}+\\frac{1}{n_2}}},\\quad t\\sim t(n_1+n_2-2)\n", 176 | "$$\n", 177 | "\n", 178 | "其中$s_\\text{p}^2$是总体样本方差\n", 179 | "\n", 180 | "$$\n", 181 | "s_\\text{p}^2=\\frac{(n_1-1)S_x^2+(n_2-1)S_y^2}{n_1+n_2-2}\n", 182 | "$$\n", 183 | "\n", 184 | "3. 空假设和备择假设:\n", 185 | "\n", 186 | "+ 双边$t$-检验:空假设为两类样本在考察特征上分布相同,即$\\mu_x=\\mu_y$。备择假设为$\\mu_x\\neq\\mu_y$。\n", 187 | "+ 单边$t$-检验:空假设为$\\mu_x\\leqslant\\mu_y$。备择假设为$\\mu_x>\\mu_y$。\n", 188 | "\n", 189 | "4. 拒绝域:\n", 190 | "\n", 191 | "+ 双边$t$-检验:$\\displaystyle |t|\\geqslant t_\\frac{\\alpha}{2}(m+n-2)$\n", 192 | "+ 单边$t$-检验:$\\displaystyle t\\geqslant t_\\alpha(m+n-2)$\n", 193 | "\n", 194 | "##### (3) Wilcoxon秩和检验(Mann-Whitney U检验)\n", 195 | "\n", 196 | "1. 步骤:\n", 197 | "\n", 198 | "+ 两类样本混合到一起,所有样本按照所考察的特征从小到大排序。\n", 199 | "+ 两类样本的**秩和**即为所得排序**序号之和**(如果特征取值相等,则并列采用中间的序号)。\n", 200 | "+ 考察两类样本秩和的差异。\n", 201 | "\n", 202 | "2. 统计量:\n", 203 | "\n", 204 | "某一类的秩和,比如第一类秩和$T_1$。\n", 205 | "\n", 206 | "3. 统计量$T_1$的分布:\n", 207 | "\n", 208 | "当$n_1$和$n_2$较大(比如都大于10)时,$T_1$近似服从正态分布,即\n", 209 | "\n", 210 | "$$\\begin{aligned}\n", 211 | "T_1 &\\sim N(\\mu_1,\\sigma_1^2)\\\\\n", 212 | "\\mu_1 &= \\frac{n_1(n_1+n_2+1)}{2} \\\\\n", 213 | "\\sigma_1^2 &= \\frac{n_1n_2(n_1+n_2+1)}{12}\n", 214 | "\\end{aligned}$$\n", 215 | "\n", 216 | "##### (4) 性质\n", 217 | "\n", 218 | "1. 样本服从正态分布情况下$t$-检验比秩和检验敏感性更好,但秩和检验没有对样本分布做假设,适用范围广。\n", 219 | "2. $t$-检验只检验分布均值,秩和检验同时受到**分布均值**和**分布形状**的影响。\n", 220 | "3. 统计检验方法通常只能检验单个特征,然后对特征进行排序,也叫**过滤法**(filtering method)。\n", 221 | "4. 过滤准则与分类准则不一定有很好的联系。\n", 222 | "\n", 223 | "\n", 224 | "### 二、特征选择的最优算法\n", 225 | "\n", 226 | "#### 1. 思想\n", 227 | "\n", 228 | "分支界定法(branch and bound)。\n", 229 | "\n", 230 | "#### 2. 要求\n", 231 | "\n", 232 | "1. $D$个特征中选择$d$个最优的特征。\n", 233 | "2. 特征增多时判据值不会减小。\n", 234 | "\n", 235 | "#### 3. 实现方法\n", 236 | "\n", 237 | "##### (1) 树的形状\n", 238 | "\n", 239 | "1. 根节点为第$0$级,共$D-d$级。\n", 240 | "2. 每一级的节点在父节点基础上去掉一个特征,不出现相同组合的树枝和叶节点。\n", 241 | "3. 同一层中按照去掉**单个**特征后的准则函数值对结点进行排序,准则函数损失量最大的在左边。\n", 242 | "\n", 243 | "##### (2) 搜索策略\n", 244 | "\n", 245 | "1. 从右侧节点开始搜索,到达叶节点时更新准则函数值界限$B$。\n", 246 | "2. 搜索回溯中,遇到某一节点函数准则值小于$B$,则剪枝。\n", 247 | "\n", 248 | "#### 4. 性质\n", 249 | "\n", 250 | "$d$为$D$的一半时,分支界定法比穷举法节省的计算量最大。\n", 251 | "\n", 252 | "\n", 253 | "### 三、特征选择的次优算法\n", 254 | "\n", 255 | "#### 1. 单独最优特征的组合\n", 256 | "\n", 257 | "##### (1) 思想\n", 258 | "\n", 259 | "选取单个特征可分性判据值最大的$d$个特征。\n", 260 | "\n", 261 | "##### (2) 性质\n", 262 | "\n", 263 | "特征间**独立**并且所采用的判据时每个特征上的判据**之和**或**之积**时,这种方法选出来的是最优特征。\n", 264 | "\n", 265 | "#### 2. 顺序前进法(sequential forward selection,SFS)\n", 266 | "\n", 267 | "##### (1) 思想\n", 268 | "\n", 269 | "自底向上:第一个特征选取单独最优的特征,后面每一个特征都选择与已经入选的特征组合起来最优的特征。\n", 270 | "\n", 271 | "##### (2) 性质\n", 272 | "\n", 273 | "1. 计算量比单独最优特征组合大。\n", 274 | "2. 特征入选后无法剔除。\n", 275 | "\n", 276 | "#### 3. 顺序后退法(sequential backward selection,SBS)\n", 277 | "\n", 278 | "##### (1) 思想\n", 279 | "\n", 280 | "自顶向下:从所有特征开始逐一剔除特征,每次剔除使得剩余特征最优。\n", 281 | "\n", 282 | "##### (2) 性质\n", 283 | "\n", 284 | "1. 计算量比顺序前进法大。\n", 285 | "2. 特征剔除后无法加入。\n", 286 | "\n", 287 | "#### 4. 增$l$减$r$法\n", 288 | "\n", 289 | "##### (1) 思想\n", 290 | "\n", 291 | "1. 自底向上:$l>r$,先增再减。\n", 292 | "2. 自顶向下:$l np.ndarray:\n", 386 | " \"\"\"\n", 387 | " 计算m_i(样本均值)\n", 388 | " :param _X: 样本矩阵,每一行是一个样本\n", 389 | " :return: 均值(一维向量)\n", 390 | " \"\"\"\n", 391 | " return np.mean(_X, 0)\n", 392 | "\n", 393 | "\n", 394 | "def get_within_class_scatter_matrix(_X: np.ndarray) -> np.ndarray:\n", 395 | " \"\"\"\n", 396 | " 计算某一类的S_w_i(类内离散度矩阵)\n", 397 | " :param _X: 样本矩阵,每一行是一个样本\n", 398 | " :return: 类内离散度矩阵(D * D)\n", 399 | " \"\"\"\n", 400 | " ret = np.zeros((_X.shape[1], _X.shape[1]))\n", 401 | " m = get_mean(_X)\n", 402 | " for row in _X:\n", 403 | " ret += (row - m).reshape(m.shape[0], 1) @ (row - m).reshape(m.shape[0], 1).T\n", 404 | " return ret\n", 405 | "\n", 406 | "\n", 407 | "def get_pooled_within_class_scatter_matrix(_X1: np.ndarray, _X2: np.ndarray) -> np.ndarray:\n", 408 | " \"\"\"\n", 409 | " 计算最终的S_w(总类内离散度矩阵)\n", 410 | " :param _X1: 第一类样本矩阵,每一行是一个样本\n", 411 | " :param _X2: 第二类样本矩阵,每一行是一个样本\n", 412 | " :return: 总类内离散度矩阵(D * D)\n", 413 | " \"\"\"\n", 414 | " return get_within_class_scatter_matrix(_X1) + get_within_class_scatter_matrix(_X2) / (_X1.shape[0] + _X2.shape[0])\n", 415 | "\n", 416 | "def get_between_class_scatter_matrix(_X1: np.ndarray, _X2: np.ndarray) -> np.ndarray:\n", 417 | " \"\"\"\n", 418 | " 计算S_b(类间离散度矩阵)\n", 419 | " :param _X1: 第一类样本矩阵,每一行是一个样本\n", 420 | " :param _X2: 第二类样本矩阵,每一行是一个样本\n", 421 | " :return: 类间离散度矩阵(D * D)\n", 422 | " \"\"\"\n", 423 | " n1 = _X1.shape[0]\n", 424 | " n2 = _X2.shape[0]\n", 425 | " d = _X1.shape[1]\n", 426 | " N = n1 + n2\n", 427 | " m_1 = get_mean(_X1)\n", 428 | " m_2 = get_mean(_X2)\n", 429 | " m = get_mean(np.concatenate((_X1, _X2)))\n", 430 | " return n1 / N * ((m_1 - m).reshape(d, 1) @ (m_1 - m).reshape(d, 1).T) + \\\n", 431 | " n2 / N * ((m_2 - m).reshape(d, 1) @ (m_2 - m).reshape(d, 1).T)\n", 432 | "\n", 433 | "def get_W(_X1: np.ndarray, _X2: np.ndarray, d: int) -> np.ndarray:\n", 434 | " \"\"\"\n", 435 | " 计算最终的W矩阵\n", 436 | " :param _X1: 第一类样本矩阵,每一行是一个样本\n", 437 | " :param _X2: 第二类样本矩阵,每一行是一个样本\n", 438 | " :param d: 最终要投影的特征维度数量\n", 439 | " :return: W矩阵\n", 440 | " \"\"\"\n", 441 | " S_w_inv_ = np.linalg.inv(get_pooled_within_class_scatter_matrix(_X1, _X2))\n", 442 | " S_b_ = get_between_class_scatter_matrix(_X1, _X2)\n", 443 | "\n", 444 | " # np.np.linalg.eig用来计算特征值和特征向量,返回的结果可能是复数\n", 445 | " eigen_values_, eigen_vectors_ = np.linalg.eig(S_w_inv_ @ S_b_)\n", 446 | "\n", 447 | " # 下面用来对特征值排序,并且记录特征值的序号\n", 448 | " eigen_temp_ = []\n", 449 | " for i in range(eigen_values_.shape[0]):\n", 450 | " # 找到是实数的特征值\n", 451 | " if eigen_values_[i].imag == 0:\n", 452 | " eigen_temp_.append([eigen_values_[i].real, i])\n", 453 | "\n", 454 | " eigen_temp_.sort(reverse=True)\n", 455 | "\n", 456 | " # 下面用来找出最大的d个实特征值的特征向量\n", 457 | " ret = []\n", 458 | " for i in range(d):\n", 459 | " ret.append(eigen_vectors_[:,eigen_temp_[i][1]].real)\n", 460 | " return np.array(ret)" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 3, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "# 加载数据,这次的数据用sklearn自带的水仙花的数据\n", 470 | "iris = load_iris()\n", 471 | "X = iris.data[:100]\n", 472 | "y = iris.target[:100]\n", 473 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n", 474 | "\n", 475 | "# 寻找两类样本\n", 476 | "X1 = []\n", 477 | "X2 = []\n", 478 | "for i in range(X_train.shape[0]):\n", 479 | " if y_train[i] == 0:\n", 480 | " X1.append(X_train[i])\n", 481 | " else:\n", 482 | " X2.append(X_train[i])\n", 483 | "X1 = np.array(X1)\n", 484 | "X2 = np.array(X2)" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 4, 490 | "metadata": {}, 491 | "outputs": [ 492 | { 493 | "name": "stdout", 494 | "output_type": "stream", 495 | "text": [ 496 | "W为:\n", 497 | "[[ 0.01869426 -0.2069536 0.92145113 0.32825073]]\n" 498 | ] 499 | } 500 | ], 501 | "source": [ 502 | "# 计算W, d设置为1,即投影后的新特征为1维\n", 503 | "W = get_W(X1, X2, 1)\n", 504 | "print(\"W为:\")\n", 505 | "print(W)" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "接下来生成降维后的数据,用支持向量机训练测试一下效果。" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 5, 518 | "metadata": {}, 519 | "outputs": [ 520 | { 521 | "name": "stdout", 522 | "output_type": "stream", 523 | "text": [ 524 | "降维之后的样本维度: 1\n", 525 | "降维之后的正确率: 1.0\n" 526 | ] 527 | } 528 | ], 529 | "source": [ 530 | "X_train_new = X_train @ W.T\n", 531 | "X_test_new = X_test @ W.T\n", 532 | "print(\"降维之后的样本维度:\", X_train_new.shape[1])\n", 533 | "\n", 534 | "svm_classifier = SVC(kernel=\"rbf\") # 用径向基函数作为核函数\n", 535 | "svm_classifier.fit(X_train_new, y_train)\n", 536 | "\n", 537 | "print(\"降维之后的正确率:\", svm_classifier.score(X_test_new, y_test))" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "测试结果表明,用基于分类可分性判据的特征提取方法,就算我们把样本从四维降成一维,还能保持100%的正确率。这说明这个降维方法是非常有效的。" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "### 二、主成分分析方法\n", 552 | "\n", 553 | "主成分分析是根据方差大小来对样本进行线性变换,最终实现对样本的降维。每一个主成分都是样本协方差矩阵的一个特征值。\n", 554 | "\n", 555 | "主成分分析的公式推导并不复杂,书上介绍的也很清楚。\n", 556 | "\n", 557 | "主要的问题在于主成分分析没有考虑样本分类信息,只是按照样本方差来处理数据,这样的方法只是把样本降维了,但是不一定对分类有利。\n", 558 | "\n", 559 | "主成分分析实际上是K-L变换的一种特殊情况,下面会实现K-L变换,因此这里我就不写代码实现了。" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "### 三、K-L变换\n", 567 | "\n", 568 | "原始的K-L变换的根据最小均方误差来进行线性变换,如果样本均值为$0$,那么K-L变换等价于主成分分析。\n", 569 | "\n", 570 | "为了应用于模式识别,需要考虑到样本的分类信息,K-L变换引入了新的方法。\n", 571 | "\n", 572 | "#### 1. 从类均值中提取判别信息\n", 573 | "\n", 574 | "这个方法是假设样本的主要分类信息包含在均值中,通过总类内离散度矩阵$\\mathbf{S}_{\\text{w}}$作为产生矩阵进行K-L变换。然后在根据可分性判据大小得出新特征的转换矩阵。\n", 575 | "\n", 576 | "具体方法如下:\n", 577 | "\n", 578 | "1. 计算总类内离散度矩阵$\\mathbf{S}_{\\text{w}}$和类间离散度矩阵$\\mathbf{S}_{\\text{b}}$。\n", 579 | "2. 用$\\mathbf{S}_{\\text{w}}$进行K-L变换,得到特征值$\\lambda_i$和对应的特征向量$\\mathbf{u}_i$(一个特征维度对应一个特征值和特征向量)。\n", 580 | "3. 通过特征值和特征向量计算样本每个维度的可分性判据大小,公式如下:\n", 581 | "\n", 582 | "$$\n", 583 | "J(y_i) = \\frac{\\mathbf{u}_i^\\text{T}\\mathbf{S}_{\\text{b}}\\mathbf{u}_i}{\\lambda_i}\n", 584 | "$$\n", 585 | "\n", 586 | "4. 找出$J(y_i)$最大的$d$个(d是要降低的目标维数)特征向量$\\mathbf{u}_1,\\mathbf{u}_2,\\cdots, \\mathbf{u}_d$,转换矩阵就是$\\mathbf{U} = [\\mathbf{u}_1,\\mathbf{u}_2,\\cdots, \\mathbf{u}_d]$,降维后的样本就是$\\mathbf{y} = \\mathbf{U}^\\text{T}\\mathbf{x}$。\n", 587 | "\n", 588 | "算法实现如下:" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 6, 594 | "metadata": {}, 595 | "outputs": [ 596 | { 597 | "name": "stdout", 598 | "output_type": "stream", 599 | "text": [ 600 | "可分性判据J数值如下: [7.77946014e-03 9.29616393e-01 1.30417535e+00 1.23736680e-03]\n" 601 | ] 602 | } 603 | ], 604 | "source": [ 605 | "# 数据在上面已经生成过了,这里直接用上面生成的数据\n", 606 | "S_w = get_pooled_within_class_scatter_matrix(X1, X2)\n", 607 | "S_b = get_between_class_scatter_matrix(X1, X2)\n", 608 | "\n", 609 | "# 注意numpy生成的特征向量矩阵中,每一列是一个特征向量\n", 610 | "eigen_values, eigen_vectors = np.linalg.eig(S_w)\n", 611 | "\n", 612 | "# np.diag把方阵的对角元素提取出来,返回一个向量\n", 613 | "J = np.diag(eigen_vectors.T @ S_b @ eigen_vectors) / eigen_values\n", 614 | "print(\"可分性判据J数值如下:\", J)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 7, 620 | "metadata": {}, 621 | "outputs": [ 622 | { 623 | "name": "stdout", 624 | "output_type": "stream", 625 | "text": [ 626 | "我们要把样本降维成一维,所以转换矩阵U退化成一个向量,U的值为:\n", 627 | "[ 0.42591017 -0.17061936 -0.82024461 -0.34159675]\n" 628 | ] 629 | } 630 | ], 631 | "source": [ 632 | "# 接下来找出J最大的d个特征向量,生成转换矩阵U\n", 633 | "# 这个例子中我们把原来4维的样本降成1维,所以d=1,那么只需要找出最大的J对应的特征向量即可\n", 634 | "\n", 635 | "maxn = -999999\n", 636 | "max_index = -1\n", 637 | "for i in range(J.shape[0]):\n", 638 | " if J[i] > maxn:\n", 639 | " maxn = J[i]\n", 640 | " max_index = i\n", 641 | "\n", 642 | "U = eigen_vectors[:, max_index].reshape(J.shape[0])\n", 643 | "print(\"我们要把样本降维成一维,所以转换矩阵U退化成一个向量,U的值为:\")\n", 644 | "print(U)" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "接下来把生成降维后的样本,再用SVM测试一下。" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": 8, 657 | "metadata": {}, 658 | "outputs": [ 659 | { 660 | "name": "stdout", 661 | "output_type": "stream", 662 | "text": [ 663 | "降维之后的样本维度: 1\n", 664 | "降维之后的正确率: 1.0\n" 665 | ] 666 | } 667 | ], 668 | "source": [ 669 | "X_train_new = X_train @ U.reshape(U.shape[0], 1)\n", 670 | "X_test_new = X_test @ U.reshape(U.shape[0], 1)\n", 671 | "print(\"降维之后的样本维度:\", X_train_new.shape[1])\n", 672 | "\n", 673 | "svm_classifier = SVC(kernel=\"rbf\") # 用径向基函数作为核函数\n", 674 | "svm_classifier.fit(X_train_new, y_train)\n", 675 | "\n", 676 | "print(\"降维之后的正确率:\", svm_classifier.score(X_test_new, y_test))" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "显然,从类均值中提取判别信息的降维效果也很好。" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "#### 2. 包含在类平均向量中判别信息的最优压缩\n", 691 | "\n", 692 | "这个方法是使特征间互不相关的前提下最优压缩均值向量中包含的分类信息。\n", 693 | "\n", 694 | "具体方法如下:\n", 695 | "\n", 696 | "1. 求出$\\mathbf{S}_{\\text{w}}$和$\\mathbf{S}_{\\text{b}}$\n", 697 | "2. 对$\\mathbf{S}_{\\text{w}}$做K-L变换,求出特征变换矩阵$\\mathbf{U}$\n", 698 | "3. 令$\\mathbf{B}=\\mathbf{U}\\boldsymbol{\\Lambda}^{-\\frac{1}{2}}$,这个时候$\\mathbf{B}^\\text{T}\\mathbf{S}_{\\text{w}}\\mathbf{B} = \\mathbf{I}$,这一步称为白化变换\n", 699 | "4. 求$\\mathbf{S}_{\\text{b}}' = \\mathbf{B}^\\text{T}\\mathbf{S}_{\\text{b}}\\mathbf{B}$\n", 700 | "5. 对$\\mathbf{S}_{\\text{b}}'$进行K-L变换,求出特征变换矩阵$\\mathbf{V}'$,其中$\\mathbf{V}'$中只包含$\\mathbf{S}_{\\text{b}}'$中非零的实特征值对应的特征向量($c$类分类问题最多有$c-1$个)\n", 701 | "6. 最终总的变换矩阵是$\\mathbf{W} = \\mathbf{U}\\boldsymbol{\\Lambda}^{-\\frac{1}{2}}\\mathbf{V}'$\n", 702 | "\n", 703 | "最终得到的投影方向实际上就是Fisher线性判别的投影方向。\n", 704 | "\n", 705 | "这个方法没有指定到底降维成多少维,我们测试一下实际的降维效果。" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 9, 711 | "metadata": {}, 712 | "outputs": [ 713 | { 714 | "name": "stdout", 715 | "output_type": "stream", 716 | "text": [ 717 | "最终计算出的投影方向W为:\n", 718 | "[[ 0.11277732 -0.23181291 -0.68124047 -0.24026011]\n", 719 | " [ 0.36302409 0.17933398 0.50943862 0.09630243]\n", 720 | " [-0.33659659 -0.74372745 0.51266842 -0.42787552]\n", 721 | " [-0.11711099 -0.28817791 0.05558638 1.52504861]]\n" 722 | ] 723 | } 724 | ], 725 | "source": [ 726 | "# 注意下面这段代码所有的矩阵都是公式中的矩阵,而不像之前一样是公式中矩阵的转置\n", 727 | "\n", 728 | "S_w = get_pooled_within_class_scatter_matrix(X1, X2)\n", 729 | "S_b = get_between_class_scatter_matrix(X1, X2)\n", 730 | "eigen_values, eigen_vectors = np.linalg.eig(S_w)\n", 731 | "U = eigen_vectors\n", 732 | "Lambda_of_neg_1_2 = np.diag(eigen_values ** (-1 / 2))\n", 733 | "B = U @ np.linalg.inv(Lambda_of_neg_1_2)\n", 734 | "S_b_1 = B.T @ S_b @ B\n", 735 | "eigen_values, eigen_vectors = np.linalg.eig(S_b_1)\n", 736 | "V_1 = []\n", 737 | "for i in range(eigen_values.shape[0]):\n", 738 | " if eigen_values[i] != 0:\n", 739 | " V_1.append(eigen_vectors[:, i].reshape(eigen_vectors.shape[0]))\n", 740 | "V_1 = np.array(V_1).T\n", 741 | "W = U @ Lambda_of_neg_1_2 @ V_1\n", 742 | "print(\"最终计算出的投影方向W为:\")\n", 743 | "print(W)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "显然最后样本被降维成三维。这个降维可以理解为保留了样本几乎全部信息,虽然最终维度比较高,但是分类信息保留全面。" 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "metadata": {}, 756 | "source": [ 757 | "#### 3. 类中心化特征向量中分类信息的提取\n", 758 | "\n", 759 | "这个方法时考虑从样本的各类协方差中提取信息,具体方式是:先对$\\mathbf{S}_{\\text{w}}$进行K-L变换,然后在根据新样本的各个维度方差的大小来计算可分性判据$J(x_j)$。最后取$J(x_j)$最小的前d个特征。\n", 760 | "\n", 761 | "这个方法可以和前面介绍的第二种方法结合起来,先用第二种进行均值分类信息最优压缩,压缩后获得$d' \\leqslant c-1$个特征,再用第三种方法压缩获得另外$d - d'$个特征。\n", 762 | "\n", 763 | "算法的实现也和前面的大同小异,这里我就不实现了。" 764 | ] 765 | }, 766 | { 767 | "cell_type": "markdown", 768 | "metadata": {}, 769 | "source": [ 770 | "### 四、多维尺度法(MDS)\n", 771 | "\n", 772 | "MDS运用了缩放的思想,把高维空间的样本按比例缩放到低维空间,书上用地图的例子很好的解释了这个思想。\n", 773 | "\n", 774 | "#### 1. 古典尺度法\n", 775 | "\n", 776 | "古典尺度法就是已知各样本间的欧式距离,求他们在指定维度下的坐标。\n", 777 | "\n", 778 | "实现方法如下:\n", 779 | "\n", 780 | "1. 定义中心化矩阵$\\mathbf{J} = \\mathbf{I} - \\frac{1}{n}\\mathbf{11}^\\text{T}$\n", 781 | "2. 计算欧氏距离矩阵$\\mathbf{D}^{(2)}$\n", 782 | "3. 求样本两两内积组成的矩阵$\\mathbf{B} = -\\frac{1}{2}\\mathbf{JDJ}$\n", 783 | "4. 求$\\mathbf{B}$的特征值$\\lambda_1,\\lambda_2,\\cdots,\\lambda_d$和特征向量$\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_d$\n", 784 | "5. 假设要将样本降维到$k$维,则将特征值从大到小排序,取前$k$大的特征值对应的特征向量组成矩阵$\\mathbf{U}=[\\mathbf{u}_1,\\mathbf{u}_2,\\cdots,\\mathbf{u}_k]$\n", 785 | "6. 解出样本坐标矩阵$\\mathbf{X}=\\mathbf{U}\\boldsymbol{\\Lambda}^{1/2}$\n", 786 | "7. 将样本从中心化还原,$\\widetilde{\\mathbf{x}}_i = \\mathbf{x}_i + \\overline{\\mathbf{x}}$\n", 787 | "\n", 788 | "这个方法其实和主成分分析是相同的,代码实现如下:" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": 10, 794 | "metadata": { 795 | "scrolled": false 796 | }, 797 | "outputs": [ 798 | { 799 | "name": "stdout", 800 | "output_type": "stream", 801 | "text": [ 802 | "[-2.27826921e-08 -3.44144050e-08 -1.75140000e-08 2.28469080e-09\n", 803 | " -1.67971705e-08 5.34840161e-08 3.43430899e-09 2.98690681e-08\n", 804 | " -2.13786560e-08 1.63238719e-08 4.11062135e-08 -1.98371979e-08\n", 805 | " -2.90794537e-08 8.90252228e-09 2.37053957e-09 1.17263691e-08\n", 806 | " 2.07078762e-08 -9.75305242e-10 3.20626111e-08 2.71718079e-08\n", 807 | " 3.33402252e-08 1.26129911e-08 -3.12520589e-08 -2.70509339e-09\n", 808 | " -4.24973096e-08 -1.35221505e-08 7.88039006e-09 -2.87083186e-09\n", 809 | " -3.48658361e-08 -8.55255398e-09 -9.89458731e-09 2.34282814e-08\n", 810 | " 1.08033115e-08 -3.47947733e-08 -5.32301673e-09 -5.81293064e-09\n", 811 | " -2.45663341e-08 -2.86082110e-08 -4.12486016e-08 -8.36352275e-09\n", 812 | " -1.87246325e-08 -1.01729126e-08 -3.10669694e-08 -1.81945259e-08\n", 813 | " -1.08101679e-08 -3.65419330e-08 -1.08827192e-08 -4.30372786e-08\n", 814 | " 4.86084062e-09 -2.70163812e-08 4.15228906e-08 -6.41982072e-09\n", 815 | " -1.74987220e-08 2.09008665e-08 -2.50100769e-08 -8.61789227e-09\n", 816 | " -5.92910143e-09 -4.81297694e-09 -6.88161834e-09 4.16202265e-08\n", 817 | " 1.64202060e-09 1.42302506e-08 5.53318913e-08 -1.16508775e-08\n", 818 | " 3.62741257e-08 -6.47619201e-08 -1.11632165e-08 2.66905554e-08\n", 819 | " 2.05870468e-09 -1.41332411e-09 -2.92467205e-08 5.70420669e-09\n", 820 | " -2.31138382e-08 1.22018194e-08 -4.34116542e-08 -6.20514330e-08\n", 821 | " -4.79626460e-08 -1.99943528e-08 -3.56117949e-08 1.10452149e-09\n", 822 | " -2.38777311e-08 2.02512017e-08 -1.46882984e-08 -1.09512754e-08\n", 823 | " -3.65902431e-08 -2.46556440e-08 -1.87207383e-08 -2.54257930e-08\n", 824 | " 7.59315658e-10 1.32360475e-10 -1.79835738e-08 -8.77635709e-09\n", 825 | " 1.72098331e-08 6.08651473e-09 1.19661514e-08 4.29211566e-08\n", 826 | " 3.15975935e-08 5.21455748e-09 1.43383440e-08 1.35625299e-08]\n" 827 | ] 828 | } 829 | ], 830 | "source": [ 831 | "n = X.shape[0]\n", 832 | "J = np.eye(n) - 1 / n * np.ones(n)\n", 833 | "D = X @ X.T\n", 834 | "B = -1 / 2 * J @ D @ J\n", 835 | "eigen_values, eigen_vectors = np.linalg.eig(B)\n", 836 | "\n", 837 | "# 下面用来对特征值排序,并且记录特征值的序号\n", 838 | "eigen_temp = []\n", 839 | "for i in range(eigen_values.shape[0]):\n", 840 | " # 找到是实数的特征值\n", 841 | " if eigen_values[i].imag == 0:\n", 842 | " eigen_temp.append([eigen_values[i].real, i])\n", 843 | "eigen_temp.sort(reverse=True)\n", 844 | "\n", 845 | "# 下面用来找出最大的d个实特征值的特征向量,假设d=1\n", 846 | "U = []\n", 847 | "Lambda_of_1_2 = []\n", 848 | "for i in range(1):\n", 849 | " U.append(eigen_vectors[:,eigen_temp[i][1]].real.T)\n", 850 | " Lambda_of_1_2.append(np.sqrt(eigen_temp[i][0].real))\n", 851 | "U = np.array(U).T\n", 852 | "Lambda_of_1_2 = np.array(Lambda_of_1_2)\n", 853 | "X_new = U @ Lambda_of_1_2\n", 854 | "X_new = X_new + get_mean(X_new)\n", 855 | "print(X_new)" 856 | ] 857 | }, 858 | { 859 | "cell_type": "markdown", 860 | "metadata": {}, 861 | "source": [ 862 | "这个方法我就不测试了,和前面的都一样。" 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": {}, 868 | "source": [ 869 | "#### 2. 度量型MSD和非度量型MSD\n", 870 | "\n", 871 | "这两个书上都只是简单的介绍了一下,看一下就行。这类算法的特点是实际实现没有解析解,要使用梯度下降来计算。" 872 | ] 873 | }, 874 | { 875 | "cell_type": "markdown", 876 | "metadata": {}, 877 | "source": [ 878 | "### 三、核主成分分析\n", 879 | "\n", 880 | "核主成分分析的思想是通过核函数把样本投影到高维,这样原来非线性的规律也变成了线性的,再通过进行线性的主成分分析。\n", 881 | "\n", 882 | "具体算法为:\n", 883 | "\n", 884 | "1. 计算核函数矩阵$\\mathbf{K}=\\{k(\\mathbf{x}_i,\\mathbf{x}_j)\\}_{n\\times n}$。\n", 885 | "2. 对$\\mathbf{K}$进行特征值分解得到第$l$大的特征值的特征向量$\\boldsymbol\\alpha^l = [\\alpha_1^l,\\alpha_2^l,\\cdots,\\alpha_n^l]^\\text{T}$。\n", 886 | "3. 计算出样本在第$l$非线性主成分方向的投影(新特征)$\\displaystyle z^l(\\mathbf{x})=\\sum_{i=1}^n\\alpha_i^lk(\\mathbf{x}_i,\\mathbf{x})$,保留前$k$个。\n", 887 | "\n", 888 | "需要注意的是由于引入了非线性变换,特征值分解得到的非零特征值可能会超过样本原来的维数。\n", 889 | "\n", 890 | "代码实现如下:" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": 19, 896 | "metadata": {}, 897 | "outputs": [], 898 | "source": [ 899 | "def rbf(_x1: np.ndarray, _x2: np.ndarray, sigma2: float) -> float:\n", 900 | " \"\"\"\n", 901 | " 径向基函数\n", 902 | " :param _x1: 第一个向量\n", 903 | " :param _x2: 第二个向量\n", 904 | " :param sigma2: rbf的sigma参数的平方\n", 905 | " :return: 函数值\n", 906 | " \"\"\"\n", 907 | " return np.exp(-np.sum((_x1 - _x2) ** 2) / sigma2)\n", 908 | "\n", 909 | "def kpca(_X: np.ndarray, k: int, kernel) -> np.ndarray:\n", 910 | " \"\"\"\n", 911 | " KPCA函数\n", 912 | " :param _X: 样本集\n", 913 | " :param k: 要保留的非线性主成分数量\n", 914 | " :param kernel: 使用的核函数\n", 915 | " :return: 投影后的新样本\n", 916 | " \"\"\"\n", 917 | " sigma2 = 1 / (_X.shape[1] * X.var()) # sigma平方用这个公式进行估计(sklearn中默认也是用这个算法)\n", 918 | "\n", 919 | " # 1. 计算核函数矩阵K\n", 920 | " K = np.zeros((n, n))\n", 921 | " for i in range(n):\n", 922 | " for j in range(n):\n", 923 | " K[i][j] = kernel(_X[i], _X[j], sigma2)\n", 924 | "\n", 925 | " # 2. 通过特征值分解计算alpha,并且只保留特征值最大的k个非线性主成分\n", 926 | " eigen_values, eigen_vectors = np.linalg.eig(K)\n", 927 | " eigen_temp = []\n", 928 | " for i, eigen_value in enumerate(eigen_values):\n", 929 | " eigen_temp.append([eigen_value, i])\n", 930 | " eigen_temp.sort()\n", 931 | " Alpha = np.zeros((k, n))\n", 932 | " for i in range(k):\n", 933 | " Alpha[i] = eigen_vectors[eigen_temp[i][1]]\n", 934 | "\n", 935 | " # 3. 计算样本在非线性主成分方向上的投影\n", 936 | " X_new = np.zeros((n, k))\n", 937 | " for i, x in enumerate(_X):\n", 938 | " for l in range(k):\n", 939 | " for j in range(n):\n", 940 | " X_new[i][l] += Alpha[l][j] * kernel(x, _X[j], sigma2)\n", 941 | "\n", 942 | " return X_new" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "metadata": {}, 948 | "source": [ 949 | "接下来测试一下保留不同数量非线性主成分时的正确率(为了方便起见我就不分训练集测试集了,直接拿全部数据训练核测试)" 950 | ] 951 | }, 952 | { 953 | "cell_type": "code", 954 | "execution_count": 23, 955 | "metadata": {}, 956 | "outputs": [ 957 | { 958 | "name": "stdout", 959 | "output_type": "stream", 960 | "text": [ 961 | "降维之后的样本维度:4 降维之后的正确率为:0.77\n", 962 | "降维之后的样本维度:20 降维之后的正确率为:0.95\n", 963 | "降维之后的样本维度:40 降维之后的正确率为:0.99\n" 964 | ] 965 | } 966 | ], 967 | "source": [ 968 | "svm_classifier = SVC(kernel='rbf')\n", 969 | "X_new = kpca(X, 4, rbf)\n", 970 | "svm_classifier.fit(X_new, y)\n", 971 | "print(\"降维之后的样本维度:\", X_new.shape[1], \" 降维之后的正确率为:\", svm_classifier.score(X_new, y), sep='')\n", 972 | "X_new = kpca(X, 20, rbf)\n", 973 | "svm_classifier.fit(X_new, y)\n", 974 | "print(\"降维之后的样本维度:\", X_new.shape[1], \" 降维之后的正确率为:\", svm_classifier.score(X_new, y), sep='')\n", 975 | "X_new = kpca(X, 40, rbf)\n", 976 | "svm_classifier.fit(X_new, y)\n", 977 | "print(\"降维之后的样本维度:\", X_new.shape[1], \" 降维之后的正确率为:\", svm_classifier.score(X_new, y), sep='')" 978 | ] 979 | }, 980 | { 981 | "cell_type": "markdown", 982 | "metadata": {}, 983 | "source": [ 984 | "可以看到KPCA方法的效果在水仙花数据上的表现可以说非常差。\n", 985 | "\n", 986 | "原始数据只有4维,但是用KPCA进行特征值分解的得到的特征值数量是和样本数量一样多的,这个数量远远大于4。\n", 987 | "\n", 988 | "所以就算“降维”(实际上是升维了)到40个维度,分类正确率才勉强到达99%。\n", 989 | "\n", 990 | "这说明KPCA只适用于样本数量少而原始特征维度高的数据。" 991 | ] 992 | }, 993 | { 994 | "cell_type": "markdown", 995 | "metadata": {}, 996 | "source": [ 997 | "### 四、IsoMap方法和LLE方法\n", 998 | "\n", 999 | "IsoMap方法把相邻样本看作相连的节点,边长就是欧氏距离,然后算出每两个样本之间的最短路作为距离度量。\n", 1000 | "\n", 1001 | "LLE主要思想是采取分治,把数据切割成一个一个小块,对每个小块求平均抽象成新的样本,然后再进行映射。\n", 1002 | "\n", 1003 | "书上只是简单介绍了一下思想,了解一下就行。" 1004 | ] 1005 | } 1006 | ], 1007 | "metadata": { 1008 | "kernelspec": { 1009 | "display_name": "Python 3 (ipykernel)", 1010 | "language": "python", 1011 | "name": "python3" 1012 | }, 1013 | "language_info": { 1014 | "codemirror_mode": { 1015 | "name": "ipython", 1016 | "version": 3 1017 | }, 1018 | "file_extension": ".py", 1019 | "mimetype": "text/x-python", 1020 | "name": "python", 1021 | "nbconvert_exporter": "python", 1022 | "pygments_lexer": "ipython3", 1023 | "version": "3.9.16" 1024 | } 1025 | }, 1026 | "nbformat": 4, 1027 | "nbformat_minor": 1 1028 | } 1029 | -------------------------------------------------------------------------------- /资料/7.Practice lab decision trees/C2_W4_Decision_Tree_with_Markdown.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice Lab: Decision Trees\n", 8 | "\n", 9 | "In this exercise, you will implement a decision tree from scratch and apply it to the task of classifying whether a mushroom is edible or poisonous.\n", 10 | "\n", 11 | "# Outline\n", 12 | "- [ 1 - Packages ](#1)\n", 13 | "- [ 2 - Problem Statement](#2)\n", 14 | "- [ 3 - Dataset](#3)\n", 15 | " - [ 3.1 One hot encoded dataset](#3.1)\n", 16 | "- [ 4 - Decision Tree Refresher](#4)\n", 17 | " - [ 4.1 Calculate entropy](#4.1)\n", 18 | " - [ Exercise 1](#ex01)\n", 19 | " - [ 4.2 Split dataset](#4.2)\n", 20 | " - [ Exercise 2](#ex02)\n", 21 | " - [ 4.3 Calculate information gain](#4.3)\n", 22 | " - [ Exercise 3](#ex03)\n", 23 | " - [ 4.4 Get best split](#4.4)\n", 24 | " - [ Exercise 4](#ex04)\n", 25 | "- [ 5 - Building the tree](#5)\n" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "\n", 33 | "## 1 - Packages \n", 34 | "\n", 35 | "First, let's run the cell below to import all the packages that you will need during this assignment.\n", 36 | "- [numpy](www.numpy.org) is the fundamental package for working with matrices in Python.\n", 37 | "- [matplotlib](http://matplotlib.org) is a famous library to plot graphs in Python.\n", 38 | "- ``utils.py`` contains helper functions for this assignment. You do not need to modify code in this file.\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "import numpy as np\n", 48 | "import matplotlib.pyplot as plt\n", 49 | "from public_tests import *\n", 50 | "\n", 51 | "%matplotlib inline" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "\n", 59 | "## 2 - Problem Statement\n", 60 | "\n", 61 | "Suppose you are starting a company that grows and sells wild mushrooms. \n", 62 | "- Since not all mushrooms are edible, you'd like to be able to tell whether a given mushroom is edible or poisonous based on it's physical attributes\n", 63 | "- You have some existing data that you can use for this task. \n", 64 | "\n", 65 | "Can you use the data to help you identify which mushrooms can be sold safely? \n", 66 | "\n", 67 | "Note: The dataset used is for illustrative purposes only. It is not meant to be a guide on identifying edible mushrooms.\n", 68 | "\n", 69 | "\n", 70 | "\n", 71 | "\n", 72 | "## 3 - Dataset\n", 73 | "\n", 74 | "You will start by loading the dataset for this task. The dataset you have collected is as follows:\n", 75 | "\n", 76 | "| Cap Color | Stalk Shape | Solitary | Edible |\n", 77 | "|:---------:|:-----------:|:--------:|:------:|\n", 78 | "| Brown | Tapering | Yes | 1 |\n", 79 | "| Brown | Enlarging | Yes | 1 |\n", 80 | "| Brown | Enlarging | No | 0 |\n", 81 | "| Brown | Enlarging | No | 0 |\n", 82 | "| Brown | Tapering | Yes | 1 |\n", 83 | "| Red | Tapering | Yes | 0 |\n", 84 | "| Red | Enlarging | No | 0 |\n", 85 | "| Brown | Enlarging | Yes | 1 |\n", 86 | "| Red | Tapering | No | 1 |\n", 87 | "| Brown | Enlarging | No | 0 |\n", 88 | "\n", 89 | "\n", 90 | "- You have 10 examples of mushrooms. For each example, you have\n", 91 | " - Three features\n", 92 | " - Cap Color (`Brown` or `Red`),\n", 93 | " - Stalk Shape (`Tapering` or `Enlarging`), and\n", 94 | " - Solitary (`Yes` or `No`)\n", 95 | " - Label\n", 96 | " - Edible (`1` indicating yes or `0` indicating poisonous)\n", 97 | "\n", 98 | "\n", 99 | "### 3.1 One hot encoded dataset\n", 100 | "For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)\n", 101 | "\n", 102 | "| Brown Cap | Tapering Stalk Shape | Solitary | Edible |\n", 103 | "|:---------:|:--------------------:|:--------:|:------:|\n", 104 | "| 1 | 1 | 1 | 1 |\n", 105 | "| 1 | 0 | 1 | 1 |\n", 106 | "| 1 | 0 | 0 | 0 |\n", 107 | "| 1 | 0 | 0 | 0 |\n", 108 | "| 1 | 1 | 1 | 1 |\n", 109 | "| 0 | 1 | 1 | 0 |\n", 110 | "| 0 | 0 | 0 | 0 |\n", 111 | "| 1 | 0 | 1 | 1 |\n", 112 | "| 0 | 1 | 0 | 1 |\n", 113 | "| 1 | 0 | 0 | 0 |\n", 114 | "\n", 115 | "Therefore,\n", 116 | "- `X_train` contains three features for each example \n", 117 | " - Brown Color (A value of `1` indicates \"Brown\" cap color and `0` indicates \"Red\" cap color)\n", 118 | " - Tapering Shape (A value of `1` indicates \"Tapering Stalk Shape\" and `0` indicates \"Enlarging\" stalk shape)\n", 119 | " - Solitary (A value of `1` indicates \"Yes\" and `0` indicates \"No\")\n", 120 | "\n", 121 | "- `y_train` is whether the mushroom is edible \n", 122 | " - `y = 1` indicates edible\n", 123 | " - `y = 0` indicates poisonous" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])\n", 133 | "y_train = np.array([1,1,0,0,1,0,0,1,1,0])" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "#### View the variables\n", 141 | "Let's get more familiar with your dataset. \n", 142 | "- A good place to start is to just print out each variable and see what it contains.\n", 143 | "\n", 144 | "The code below prints the first few elements of `X_train` and the type of the variable." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "print(\"First few elements of X_train:\\n\", X_train[:5])\n", 154 | "print(\"Type of X_train:\",type(X_train))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Now, let's do the same for `y_train`" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "print(\"First few elements of y_train:\", y_train[:5])\n", 171 | "print(\"Type of y_train:\",type(y_train))" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "#### Check the dimensions of your variables\n", 179 | "\n", 180 | "Another useful way to get familiar with your data is to view its dimensions.\n", 181 | "\n", 182 | "Please print the shape of `X_train` and `y_train` and see how many training examples you have in your dataset." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "print ('The shape of X_train is:', X_train.shape)\n", 192 | "print ('The shape of y_train is: ', y_train.shape)\n", 193 | "print ('Number of training examples (m):', len(X_train))" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "\n", 201 | "## 4 - Decision Tree Refresher\n", 202 | "\n", 203 | "In this practice lab, you will build a decision tree based on the dataset provided.\n", 204 | "\n", 205 | "- Recall that the steps for building a decision tree are as follows:\n", 206 | " - Start with all examples at the root node\n", 207 | " - Calculate information gain for splitting on all possible features, and pick the one with the highest information gain\n", 208 | " - Split dataset according to the selected feature, and create left and right branches of the tree\n", 209 | " - Keep repeating splitting process until stopping criteria is met\n", 210 | " \n", 211 | " \n", 212 | "- In this lab, you'll implement the following functions, which will let you split a node into left and right branches using the feature with the highest information gain\n", 213 | " - Calculate the entropy at a node \n", 214 | " - Split the dataset at a node into left and right branches based on a given feature\n", 215 | " - Calculate the information gain from splitting on a given feature\n", 216 | " - Choose the feature that maximizes information gain\n", 217 | " \n", 218 | "- We'll then use the helper functions you've implemented to build a decision tree by repeating the splitting process until the stopping criteria is met \n", 219 | " - For this lab, the stopping criteria we've chosen is setting a maximum depth of 2" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "\n", 227 | "### 4.1 Calculate entropy\n", 228 | "\n", 229 | "First, you'll write a helper function called `compute_entropy` that computes the entropy (measure of impurity) at a node. \n", 230 | "- The function takes in a numpy array (`y`) that indicates whether the examples in that node are edible (`1`) or poisonous(`0`) \n", 231 | "\n", 232 | "Complete the `compute_entropy()` function below to:\n", 233 | "* Compute $p_1$, which is the fraction of examples that are edible (i.e. have value = `1` in `y`)\n", 234 | "* The entropy is then calculated as \n", 235 | "\n", 236 | "$$H(p_1) = -p_1 \\text{log}_2(p_1) - (1- p_1) \\text{log}_2(1- p_1)$$\n", 237 | "* Note \n", 238 | " * The log is calculated with base $2$\n", 239 | " * For implementation purposes, $0\\text{log}_2(0) = 0$. That is, if `p_1 = 0` or `p_1 = 1`, set the entropy to `0`\n", 240 | " * Make sure to check that the data at a node is not empty (i.e. `len(y) != 0`). Return `0` if it is\n", 241 | " \n", 242 | "\n", 243 | "### Exercise 1\n", 244 | "\n", 245 | "Please complete the `compute_entropy()` function using the previous instructions.\n", 246 | " \n", 247 | "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# UNQ_C1\n", 257 | "# GRADED FUNCTION: compute_entropy\n", 258 | "\n", 259 | "def compute_entropy(y):\n", 260 | " \"\"\"\n", 261 | " Computes the entropy for \n", 262 | " \n", 263 | " Args:\n", 264 | " y (ndarray): Numpy array indicating whether each example at a node is\n", 265 | " edible (`1`) or poisonous (`0`)\n", 266 | " \n", 267 | " Returns:\n", 268 | " entropy (float): Entropy at that node\n", 269 | " \n", 270 | " \"\"\"\n", 271 | " # You need to return the following variables correctly\n", 272 | " entropy = 0.\n", 273 | " \n", 274 | " ### START CODE HERE ###\n", 275 | " \n", 276 | " ### END CODE HERE ### \n", 277 | " \n", 278 | " return entropy" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "
\n", 286 | " Click for hints\n", 287 | " \n", 288 | " \n", 289 | " * To calculate `p1`\n", 290 | " * You can get the subset of examples in `y` that have the value `1` as `y[y == 1]`\n", 291 | " * You can use `len(y)` to get the number of examples in `y`\n", 292 | " * To calculate `entropy`\n", 293 | " * np.log2 let's you calculate the logarithm to base 2 for a numpy array\n", 294 | " * If the value of `p1` is 0 or 1, make sure to set the entropy to `0` \n", 295 | " \n", 296 | "
\n", 297 | " Click for more hints\n", 298 | " \n", 299 | " * Here's how you can structure the overall implementation for this function\n", 300 | " ```python \n", 301 | " def compute_entropy(y):\n", 302 | " \n", 303 | " # You need to return the following variables correctly\n", 304 | " entropy = 0.\n", 305 | "\n", 306 | " ### START CODE HERE ###\n", 307 | " if len(y) != 0:\n", 308 | " # Your code here to calculate the fraction of edible examples (i.e with value = 1 in y)\n", 309 | " p1 =\n", 310 | "\n", 311 | " # For p1 = 0 and 1, set the entropy to 0 (to handle 0log0)\n", 312 | " if p1 != 0 and p1 != 1:\n", 313 | " # Your code here to calculate the entropy using the formula provided above\n", 314 | " entropy = \n", 315 | " else:\n", 316 | " entropy = 0. \n", 317 | " ### END CODE HERE ### \n", 318 | "\n", 319 | " return entropy\n", 320 | " ```\n", 321 | " \n", 322 | " If you're still stuck, you can check the hints presented below to figure out how to calculate `p1` and `entropy`.\n", 323 | " \n", 324 | "
\n", 325 | " Hint to calculate p1\n", 326 | "     You can compute p1 as p1 = len(y[y == 1]) / len(y) \n", 327 | "
\n", 328 | "\n", 329 | "
\n", 330 | " Hint to calculate entropy\n", 331 | "     You can compute entropy as entropy = -p1 * np.log2(p1) - (1 - p1) * np.log2(1 - p1)\n", 332 | "
\n", 333 | " \n", 334 | "
\n", 335 | "\n", 336 | "
\n", 337 | "\n", 338 | " \n" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": {}, 344 | "source": [ 345 | "You can check if your implementation was correct by running the following test code:" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "# Compute entropy at the root node (i.e. with all examples)\n", 355 | "# Since we have 5 edible and 5 non-edible mushrooms, the entropy should be 1\"\n", 356 | "\n", 357 | "print(\"Entropy at root node: \", compute_entropy(y_train)) \n", 358 | "\n", 359 | "# UNIT TESTS\n", 360 | "compute_entropy_test(compute_entropy)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "**Expected Output**:\n", 368 | "\n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | "
Entropy at root node: 1.0
" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "\n", 380 | "### 4.2 Split dataset\n", 381 | "\n", 382 | "Next, you'll write a helper function called `split_dataset` that takes in the data at a node and a feature to split on and splits it into left and right branches. Later in the lab, you'll implement code to calculate how good the split is.\n", 383 | "\n", 384 | "- The function takes in the training data, the list of indices of data points at that node, along with the feature to split on. \n", 385 | "- It splits the data and returns the subset of indices at the left and the right branch.\n", 386 | "- For example, say we're starting at the root node (so `node_indices = [0,1,2,3,4,5,6,7,8,9]`), and we chose to split on feature `0`, which is whether or not the example has a brown cap.\n", 387 | " - The output of the function is then, `left_indices = [0,1,2,3,4,7,9]` and `right_indices = [5,6,8]`\n", 388 | " \n", 389 | "| Index | Brown Cap | Tapering Stalk Shape | Solitary | Edible |\n", 390 | "|:-----:|:---------:|:--------------------:|:--------:|:------:|\n", 391 | "| 0 | 1 | 1 | 1 | 1 |\n", 392 | "| 1 | 1 | 0 | 1 | 1 |\n", 393 | "| 2 | 1 | 0 | 0 | 0 |\n", 394 | "| 3 | 1 | 0 | 0 | 0 |\n", 395 | "| 4 | 1 | 1 | 1 | 1 |\n", 396 | "| 5 | 0 | 1 | 1 | 0 |\n", 397 | "| 6 | 0 | 0 | 0 | 0 |\n", 398 | "| 7 | 1 | 0 | 1 | 1 |\n", 399 | "| 8 | 0 | 1 | 0 | 1 |\n", 400 | "| 9 | 1 | 0 | 0 | 0 |\n", 401 | "\n", 402 | "\n", 403 | "### Exercise 2\n", 404 | "\n", 405 | "Please complete the `split_dataset()` function shown below\n", 406 | "\n", 407 | "- For each index in `node_indices`\n", 408 | " - If the value of `X` at that index for that feature is `1`, add the index to `left_indices`\n", 409 | " - If the value of `X` at that index for that feature is `0`, add the index to `right_indices`\n", 410 | "\n", 411 | "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation." 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "# UNQ_C2\n", 421 | "# GRADED FUNCTION: split_dataset\n", 422 | "\n", 423 | "def split_dataset(X, node_indices, feature):\n", 424 | " \"\"\"\n", 425 | " Splits the data at the given node into\n", 426 | " left and right branches\n", 427 | " \n", 428 | " Args:\n", 429 | " X (ndarray): Data matrix of shape(n_samples, n_features)\n", 430 | " node_indices (ndarray): List containing the active indices. I.e, the samples being considered at this step.\n", 431 | " feature (int): Index of feature to split on\n", 432 | " \n", 433 | " Returns:\n", 434 | " left_indices (ndarray): Indices with feature value == 1\n", 435 | " right_indices (ndarray): Indices with feature value == 0\n", 436 | " \"\"\"\n", 437 | " \n", 438 | " # You need to return the following variables correctly\n", 439 | " left_indices = []\n", 440 | " right_indices = []\n", 441 | " \n", 442 | " ### START CODE HERE ###\n", 443 | " \n", 444 | " ### END CODE HERE ###\n", 445 | " \n", 446 | " return left_indices, right_indices" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "
\n", 454 | " Click for hints\n", 455 | " \n", 456 | " \n", 457 | " * Here's how you can structure the overall implementation for this function\n", 458 | " ```python \n", 459 | " def split_dataset(X, node_indices, feature):\n", 460 | " \n", 461 | " # You need to return the following variables correctly\n", 462 | " left_indices = []\n", 463 | " right_indices = []\n", 464 | "\n", 465 | " ### START CODE HERE ###\n", 466 | " # Go through the indices of examples at that node\n", 467 | " for i in node_indices: \n", 468 | " if # Your code here to check if the value of X at that index for the feature is 1\n", 469 | " left_indices.append(i)\n", 470 | " else:\n", 471 | " right_indices.append(i)\n", 472 | " ### END CODE HERE ###\n", 473 | " \n", 474 | " return left_indices, right_indices\n", 475 | " ```\n", 476 | "
\n", 477 | " Click for more hints\n", 478 | " \n", 479 | " The condition is if X[i][feature] == 1:.\n", 480 | " \n", 481 | "
\n", 482 | "\n", 483 | "
\n", 484 | "\n", 485 | " \n" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "Now, let's check your implementation using the code blocks below. Let's try splitting the dataset at the root node, which contains all examples at feature 0 (Brown Cap) as we'd discussed above" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n", 502 | "\n", 503 | "# Feel free to play around with these variables\n", 504 | "# The dataset only has three features, so this value can be 0 (Brown Cap), 1 (Tapering Stalk Shape) or 2 (Solitary)\n", 505 | "feature = 0\n", 506 | "\n", 507 | "left_indices, right_indices = split_dataset(X_train, root_indices, feature)\n", 508 | "\n", 509 | "print(\"Left indices: \", left_indices)\n", 510 | "print(\"Right indices: \", right_indices)\n", 511 | "\n", 512 | "# UNIT TESTS \n", 513 | "split_dataset_test(split_dataset)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "**Expected Output**:\n", 521 | "```\n", 522 | "Left indices: [0, 1, 2, 3, 4, 7, 9]\n", 523 | "Right indices: [5, 6, 8]\n", 524 | "```" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "\n", 532 | "### 4.3 Calculate information gain\n", 533 | "\n", 534 | "Next, you'll write a function called `information_gain` that takes in the training data, the indices at a node and a feature to split on and returns the information gain from the split.\n", 535 | "\n", 536 | "\n", 537 | "### Exercise 3\n", 538 | "\n", 539 | "Please complete the `compute_information_gain()` function shown below to compute\n", 540 | "\n", 541 | "$$\\text{Information Gain} = H(p_1^\\text{node})- (w^{\\text{left}}H(p_1^\\text{left}) + w^{\\text{right}}H(p_1^\\text{right}))$$\n", 542 | "\n", 543 | "where \n", 544 | "- $H(p_1^\\text{node})$ is entropy at the node \n", 545 | "- $H(p_1^\\text{left})$ and $H(p_1^\\text{right})$ are the entropies at the left and the right branches resulting from the split\n", 546 | "- $w^{\\text{left}}$ and $w^{\\text{right}}$ are the proportion of examples at the left and right branch respectively\n", 547 | "\n", 548 | "Note:\n", 549 | "- You can use the `compute_entropy()` function that you implemented above to calculate the entropy\n", 550 | "- We've provided some starter code that uses the `split_dataset()` function you implemented above to split the dataset \n", 551 | "\n", 552 | "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation." 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [ 561 | "# UNQ_C3\n", 562 | "# GRADED FUNCTION: compute_information_gain\n", 563 | "\n", 564 | "def compute_information_gain(X, y, node_indices, feature):\n", 565 | " \n", 566 | " \"\"\"\n", 567 | " Compute the information of splitting the node on a given feature\n", 568 | " \n", 569 | " Args:\n", 570 | " X (ndarray): Data matrix of shape(n_samples, n_features)\n", 571 | " y (array like): list or ndarray with n_samples containing the target variable\n", 572 | " node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n", 573 | " \n", 574 | " Returns:\n", 575 | " cost (float): Cost computed\n", 576 | " \n", 577 | " \"\"\" \n", 578 | " # Split dataset\n", 579 | " left_indices, right_indices = split_dataset(X, node_indices, feature)\n", 580 | " \n", 581 | " # Some useful variables\n", 582 | " X_node, y_node = X[node_indices], y[node_indices]\n", 583 | " X_left, y_left = X[left_indices], y[left_indices]\n", 584 | " X_right, y_right = X[right_indices], y[right_indices]\n", 585 | " \n", 586 | " # You need to return the following variables correctly\n", 587 | " information_gain = 0\n", 588 | " \n", 589 | " ### START CODE HERE ###\n", 590 | " \n", 591 | " # Weights \n", 592 | " \n", 593 | " #Weighted entropy\n", 594 | " \n", 595 | " #Information gain \n", 596 | " \n", 597 | " ### END CODE HERE ### \n", 598 | " \n", 599 | " return information_gain" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "
\n", 607 | " Click for hints\n", 608 | " \n", 609 | " \n", 610 | " * Here's how you can structure the overall implementation for this function\n", 611 | " ```python \n", 612 | " def compute_information_gain(X, y, node_indices, feature):\n", 613 | " # Split dataset\n", 614 | " left_indices, right_indices = split_dataset(X, node_indices, feature)\n", 615 | "\n", 616 | " # Some useful variables\n", 617 | " X_node, y_node = X[node_indices], y[node_indices]\n", 618 | " X_left, y_left = X[left_indices], y[left_indices]\n", 619 | " X_right, y_right = X[right_indices], y[right_indices]\n", 620 | "\n", 621 | " # You need to return the following variables correctly\n", 622 | " information_gain = 0\n", 623 | "\n", 624 | " ### START CODE HERE ###\n", 625 | " # Your code here to compute the entropy at the node using compute_entropy()\n", 626 | " node_entropy = \n", 627 | " # Your code here to compute the entropy at the left branch\n", 628 | " left_entropy = \n", 629 | " # Your code here to compute the entropy at the right branch\n", 630 | " right_entropy = \n", 631 | "\n", 632 | " # Your code here to compute the proportion of examples at the left branch\n", 633 | " w_left = \n", 634 | " \n", 635 | " # Your code here to compute the proportion of examples at the right branch\n", 636 | " w_right = \n", 637 | "\n", 638 | " # Your code here to compute weighted entropy from the split using \n", 639 | " # w_left, w_right, left_entropy and right_entropy\n", 640 | " weighted_entropy = \n", 641 | "\n", 642 | " # Your code here to compute the information gain as the entropy at the node\n", 643 | " # minus the weighted entropy\n", 644 | " information_gain = \n", 645 | " ### END CODE HERE ### \n", 646 | "\n", 647 | " return information_gain\n", 648 | " ```\n", 649 | " If you're still stuck, check out the hints below.\n", 650 | " \n", 651 | "
\n", 652 | " Hint to calculate the entropies\n", 653 | " \n", 654 | " node_entropy = compute_entropy(y_node)
\n", 655 | " left_entropy = compute_entropy(y_left)
\n", 656 | " right_entropy = compute_entropy(y_right)\n", 657 | " \n", 658 | "
\n", 659 | " \n", 660 | "
\n", 661 | " Hint to calculate w_left and w_right\n", 662 | " w_left = len(X_left) / len(X_node)
\n", 663 | " w_right = len(X_right) / len(X_node)\n", 664 | "
\n", 665 | " \n", 666 | "
\n", 667 | " Hint to calculate weighted_entropy\n", 668 | " weighted_entropy = w_left * left_entropy + w_right * right_entropy\n", 669 | "
\n", 670 | " \n", 671 | "
\n", 672 | " Hint to calculate information_gain\n", 673 | " information_gain = node_entropy - weighted_entropy\n", 674 | "
\n", 675 | "\n", 676 | "\n", 677 | "
\n" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "You can now check your implementation using the cell below and calculate what the information gain would be from splitting on each of the featues" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)\n", 694 | "print(\"Information Gain from splitting the root on brown cap: \", info_gain0)\n", 695 | " \n", 696 | "info_gain1 = compute_information_gain(X_train, y_train, root_indices, feature=1)\n", 697 | "print(\"Information Gain from splitting the root on tapering stalk shape: \", info_gain1)\n", 698 | "\n", 699 | "info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)\n", 700 | "print(\"Information Gain from splitting the root on solitary: \", info_gain2)\n", 701 | "\n", 702 | "# UNIT TESTS\n", 703 | "compute_information_gain_test(compute_information_gain)" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "**Expected Output**:\n", 711 | "```\n", 712 | "Information Gain from splitting the root on brown cap: 0.034851554559677034\n", 713 | "Information Gain from splitting the root on tapering stalk shape: 0.12451124978365313\n", 714 | "Information Gain from splitting the root on solitary: 0.2780719051126377\n", 715 | "```" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "Splitting on \"Solitary\" (feature = 2) at the root node gives the maximum information gain. Therefore, it's the best feature to split on at the root node." 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": {}, 728 | "source": [ 729 | "\n", 730 | "### 4.4 Get best split\n", 731 | "Now let's write a function to get the best feature to split on by computing the information gain from each feature as we did above and returning the feature that gives the maximum information gain\n", 732 | "\n", 733 | "\n", 734 | "### Exercise 4\n", 735 | "Please complete the `get_best_split()` function shown below.\n", 736 | "- The function takes in the training data, along with the indices of datapoint at that node\n", 737 | "- The output of the function the feature that gives the maximum information gain \n", 738 | " - You can use the `compute_information_gain()` function to iterate through the features and calculate the information for each feature\n", 739 | "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation." 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "# UNQ_C4\n", 749 | "# GRADED FUNCTION: get_best_split\n", 750 | "\n", 751 | "def get_best_split(X, y, node_indices): \n", 752 | " \"\"\"\n", 753 | " Returns the optimal feature and threshold value\n", 754 | " to split the node data \n", 755 | " \n", 756 | " Args:\n", 757 | " X (ndarray): Data matrix of shape(n_samples, n_features)\n", 758 | " y (array like): list or ndarray with n_samples containing the target variable\n", 759 | " node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n", 760 | "\n", 761 | " Returns:\n", 762 | " best_feature (int): The index of the best feature to split\n", 763 | " \"\"\" \n", 764 | " \n", 765 | " # Some useful variables\n", 766 | " num_features = X.shape[1]\n", 767 | " \n", 768 | " # You need to return the following variables correctly\n", 769 | " best_feature = -1\n", 770 | " \n", 771 | " ### START CODE HERE ###\n", 772 | " \n", 773 | " ### END CODE HERE ## \n", 774 | " \n", 775 | " return best_feature" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "
\n", 783 | " Click for hints\n", 784 | " \n", 785 | " \n", 786 | " * Here's how you can structure the overall implementation for this function\n", 787 | " \n", 788 | " ```python \n", 789 | " def get_best_split(X, y, node_indices): \n", 790 | "\n", 791 | " # Some useful variables\n", 792 | " num_features = X.shape[1]\n", 793 | "\n", 794 | " # You need to return the following variables correctly\n", 795 | " best_feature = -1\n", 796 | "\n", 797 | " ### START CODE HERE ###\n", 798 | " max_info_gain = 0\n", 799 | "\n", 800 | " # Iterate through all features\n", 801 | " for feature in range(num_features): \n", 802 | " \n", 803 | " # Your code here to compute the information gain from splitting on this feature\n", 804 | " info_gain = \n", 805 | " \n", 806 | " # If the information gain is larger than the max seen so far\n", 807 | " if info_gain > max_info_gain: \n", 808 | " # Your code here to set the max_info_gain and best_feature\n", 809 | " max_info_gain = \n", 810 | " best_feature = \n", 811 | " ### END CODE HERE ## \n", 812 | " \n", 813 | " return best_feature\n", 814 | " ```\n", 815 | " If you're still stuck, check out the hints below.\n", 816 | " \n", 817 | "
\n", 818 | " Hint to calculate info_gain\n", 819 | " \n", 820 | " info_gain = compute_information_gain(X, y, node_indices, feature)\n", 821 | "
\n", 822 | " \n", 823 | "
\n", 824 | " Hint to update the max_info_gain and best_feature\n", 825 | " max_info_gain = info_gain
\n", 826 | " best_feature = feature\n", 827 | "
\n", 828 | "
\n" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "Now, let's check the implementation of your function using the cell below." 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": null, 841 | "metadata": {}, 842 | "outputs": [], 843 | "source": [ 844 | "best_feature = get_best_split(X_train, y_train, root_indices)\n", 845 | "print(\"Best feature to split on: %d\" % best_feature)\n", 846 | "\n", 847 | "# UNIT TESTS\n", 848 | "get_best_split_test(get_best_split)" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "metadata": {}, 854 | "source": [ 855 | "As we saw above, the function returns that the best feature to split on at the root node is feature 2 (\"Solitary\")" 856 | ] 857 | }, 858 | { 859 | "cell_type": "markdown", 860 | "metadata": {}, 861 | "source": [ 862 | "\n", 863 | "## 5 - Building the tree\n", 864 | "\n", 865 | "In this section, we use the functions you implemented above to generate a decision tree by successively picking the best feature to split on until we reach the stopping criteria (maximum depth is 2).\n", 866 | "\n", 867 | "You do not need to implement anything for this part." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": {}, 874 | "outputs": [], 875 | "source": [ 876 | "# Not graded\n", 877 | "tree = []\n", 878 | "\n", 879 | "def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):\n", 880 | " \"\"\"\n", 881 | " Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.\n", 882 | " This function just prints the tree.\n", 883 | " \n", 884 | " Args:\n", 885 | " X (ndarray): Data matrix of shape(n_samples, n_features)\n", 886 | " y (array like): list or ndarray with n_samples containing the target variable\n", 887 | " node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.\n", 888 | " branch_name (string): Name of the branch. ['Root', 'Left', 'Right']\n", 889 | " max_depth (int): Max depth of the resulting tree. \n", 890 | " current_depth (int): Current depth. Parameter used during recursive call.\n", 891 | " \n", 892 | " \"\"\" \n", 893 | "\n", 894 | " # Maximum depth reached - stop splitting\n", 895 | " if current_depth == max_depth:\n", 896 | " formatting = \" \"*current_depth + \"-\"*current_depth\n", 897 | " print(formatting, \"%s leaf node with indices\" % branch_name, node_indices)\n", 898 | " return\n", 899 | " \n", 900 | " # Otherwise, get best split and split the data\n", 901 | " # Get the best feature and threshold at this node\n", 902 | " best_feature = get_best_split(X, y, node_indices) \n", 903 | " tree.append((current_depth, branch_name, best_feature, node_indices))\n", 904 | " \n", 905 | " formatting = \"-\"*current_depth\n", 906 | " print(\"%s Depth %d, %s: Split on feature: %d\" % (formatting, current_depth, branch_name, best_feature))\n", 907 | " \n", 908 | " # Split the dataset at the best feature\n", 909 | " left_indices, right_indices = split_dataset(X, node_indices, best_feature)\n", 910 | " \n", 911 | " # continue splitting the left and the right child. Increment current depth\n", 912 | " build_tree_recursive(X, y, left_indices, \"Left\", max_depth, current_depth+1)\n", 913 | " build_tree_recursive(X, y, right_indices, \"Right\", max_depth, current_depth+1)\n" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": null, 919 | "metadata": {}, 920 | "outputs": [], 921 | "source": [ 922 | "build_tree_recursive(X_train, y_train, root_indices, \"Root\", max_depth=2, current_depth=0)" 923 | ] 924 | }, 925 | { 926 | "cell_type": "code", 927 | "execution_count": null, 928 | "metadata": {}, 929 | "outputs": [], 930 | "source": [] 931 | } 932 | ], 933 | "metadata": { 934 | "kernelspec": { 935 | "display_name": "Python 3", 936 | "language": "python", 937 | "name": "python3" 938 | }, 939 | "language_info": { 940 | "codemirror_mode": { 941 | "name": "ipython", 942 | "version": 3 943 | }, 944 | "file_extension": ".py", 945 | "mimetype": "text/x-python", 946 | "name": "python", 947 | "nbconvert_exporter": "python", 948 | "pygments_lexer": "ipython3", 949 | "version": "3.7.6" 950 | } 951 | }, 952 | "nbformat": 4, 953 | "nbformat_minor": 5 954 | } 955 | --------------------------------------------------------------------------------