├── imgs ├── 1-1.PNG ├── 2-10.JPG ├── 2-11.JPG ├── 2-12.JPG ├── 2-13.JPG ├── 2-14.JPG ├── 2-15.JPG ├── 2-16.JPG ├── 2-17.JPG ├── 2-18.JPG ├── 2-19.JPG ├── 2-20.JPG ├── 2-21.JPG ├── 2-22.JPG ├── 2-23.JPG ├── 2-24.JPG ├── 2-25.JPG ├── 2-26.JPG ├── 2-27.JPG ├── 2-28.JPG ├── 2-29.JPG ├── 2-30.JPG ├── 2-31.JPG ├── 2-32.JPG ├── 2-33.JPG ├── 2-34.JPG ├── 2-35.JPG ├── 2-4.JPG ├── 2-5.JPG ├── 2-6.JPG ├── 2-7.JPG ├── 2-8.JPG ├── 2-9.JPG └── cover.PNG ├── mglearn ├── __pycache__ │ ├── plots.cpython-35.pyc │ ├── tools.cpython-35.pyc │ ├── __init__.cpython-35.pyc │ ├── datasets.cpython-35.pyc │ ├── plot_nmf.cpython-35.pyc │ ├── plot_pca.cpython-35.pyc │ ├── plot_dbscan.cpython-35.pyc │ ├── plot_kmeans.cpython-35.pyc │ ├── plot_ridge.cpython-35.pyc │ ├── plot_helpers.cpython-35.pyc │ ├── plot_metrics.cpython-35.pyc │ ├── plot_nn_graphs.cpython-35.pyc │ ├── plot_scaling.cpython-35.pyc │ ├── plot_animal_tree.cpython-35.pyc │ ├── plot_grid_search.cpython-35.pyc │ ├── plot_2d_separator.cpython-35.pyc │ ├── plot_agglomerative.cpython-35.pyc │ ├── plot_decomposition.cpython-35.pyc │ ├── plot_knn_regression.cpython-35.pyc │ ├── plot_cross_validation.cpython-35.pyc │ ├── plot_interactive_tree.cpython-35.pyc │ ├── plot_knn_classification.cpython-35.pyc │ ├── plot_linear_regression.cpython-35.pyc │ ├── plot_rbf_svm_parameters.cpython-35.pyc │ ├── plot_tree_nonmonotonous.cpython-35.pyc │ ├── plot_improper_preprocessing.cpython-35.pyc │ └── plot_linear_svc_regularization.cpython-35.pyc ├── __init__.py ├── plot_animal_tree.py ├── plot_tree_nonmonotonous.py ├── plot_kneighbors_regularization.py ├── plot_ridge.py ├── plot_linear_regression.py ├── plot_knn_classification.py ├── plot_decomposition.py ├── plot_linear_svc_regularization.py ├── plot_rbf_svm_parameters.py ├── plot_knn_regression.py ├── plot_scaling.py ├── plot_dbscan.py ├── datasets.py ├── plot_interactive_tree.py ├── plot_improper_preprocessing.py ├── plot_nmf.py ├── plots.py ├── plot_helpers.py ├── plot_agglomerative.py ├── make_blobs.py ├── plot_nn_graphs.py ├── plot_metrics.py ├── plot_grid_search.py ├── plot_pca.py ├── tools.py ├── plot_2d_separator.py ├── plot_kmeans.py └── plot_cross_validation.py ├── README.md ├── data └── ram_price.csv ├── 第2章-监督学习-决策树集成.ipynb ├── 第2章-监督学习-K近邻.ipynb ├── 第2章-监督学习-决策树.ipynb └── 第2章-监督学习-线性模型.ipynb /imgs/1-1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/1-1.PNG -------------------------------------------------------------------------------- /imgs/2-10.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-10.JPG -------------------------------------------------------------------------------- /imgs/2-11.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-11.JPG -------------------------------------------------------------------------------- /imgs/2-12.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-12.JPG -------------------------------------------------------------------------------- /imgs/2-13.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-13.JPG -------------------------------------------------------------------------------- /imgs/2-14.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-14.JPG -------------------------------------------------------------------------------- /imgs/2-15.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-15.JPG -------------------------------------------------------------------------------- /imgs/2-16.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-16.JPG -------------------------------------------------------------------------------- /imgs/2-17.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-17.JPG -------------------------------------------------------------------------------- /imgs/2-18.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-18.JPG -------------------------------------------------------------------------------- /imgs/2-19.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-19.JPG -------------------------------------------------------------------------------- /imgs/2-20.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-20.JPG -------------------------------------------------------------------------------- /imgs/2-21.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-21.JPG -------------------------------------------------------------------------------- /imgs/2-22.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-22.JPG -------------------------------------------------------------------------------- /imgs/2-23.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-23.JPG -------------------------------------------------------------------------------- /imgs/2-24.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-24.JPG -------------------------------------------------------------------------------- /imgs/2-25.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-25.JPG -------------------------------------------------------------------------------- /imgs/2-26.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-26.JPG -------------------------------------------------------------------------------- /imgs/2-27.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-27.JPG -------------------------------------------------------------------------------- /imgs/2-28.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-28.JPG -------------------------------------------------------------------------------- /imgs/2-29.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-29.JPG -------------------------------------------------------------------------------- /imgs/2-30.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-30.JPG -------------------------------------------------------------------------------- /imgs/2-31.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-31.JPG -------------------------------------------------------------------------------- /imgs/2-32.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-32.JPG -------------------------------------------------------------------------------- /imgs/2-33.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-33.JPG -------------------------------------------------------------------------------- /imgs/2-34.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-34.JPG -------------------------------------------------------------------------------- /imgs/2-35.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-35.JPG -------------------------------------------------------------------------------- /imgs/2-4.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-4.JPG -------------------------------------------------------------------------------- /imgs/2-5.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-5.JPG -------------------------------------------------------------------------------- /imgs/2-6.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-6.JPG -------------------------------------------------------------------------------- /imgs/2-7.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-7.JPG -------------------------------------------------------------------------------- /imgs/2-8.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-8.JPG -------------------------------------------------------------------------------- /imgs/2-9.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/2-9.JPG -------------------------------------------------------------------------------- /imgs/cover.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/imgs/cover.PNG -------------------------------------------------------------------------------- /mglearn/__pycache__/plots.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plots.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/tools.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/tools.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/__init__.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/__init__.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/datasets.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/datasets.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_nmf.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_nmf.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_pca.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_pca.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_dbscan.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_dbscan.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_kmeans.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_kmeans.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_ridge.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_ridge.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_helpers.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_helpers.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_metrics.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_metrics.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_nn_graphs.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_nn_graphs.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_scaling.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_scaling.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_animal_tree.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_animal_tree.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_grid_search.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_grid_search.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_2d_separator.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_2d_separator.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_agglomerative.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_agglomerative.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_decomposition.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_decomposition.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_knn_regression.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_knn_regression.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_cross_validation.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_cross_validation.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_interactive_tree.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_interactive_tree.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_knn_classification.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_knn_classification.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_linear_regression.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_linear_regression.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_rbf_svm_parameters.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_rbf_svm_parameters.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_tree_nonmonotonous.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_tree_nonmonotonous.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_improper_preprocessing.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_improper_preprocessing.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__pycache__/plot_linear_svc_regularization.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Holy-Shine/Introduciton-2-ML-with-Python-notebook/HEAD/mglearn/__pycache__/plot_linear_svc_regularization.cpython-35.pyc -------------------------------------------------------------------------------- /mglearn/__init__.py: -------------------------------------------------------------------------------- 1 | from . import plots 2 | from . import tools 3 | from .plots import cm3, cm2 4 | from .tools import discrete_scatter 5 | from .plot_helpers import ReBl 6 | 7 | __all__ = ['tools', 'plots', 'cm3', 'cm2', 'discrete_scatter', 'ReBl'] 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduciton-2-ML-with-Python-notebook 2 | Python机器学习基础教程中文Notebook 3 | 4 | 基本上就是《Python机器学习基础教程》的内容搬运为`jupyter notebook` ,便于自己记录和学习。 5 | 6 | 7 | 8 | ## 快速开始 9 | 10 | 1. 在含有`.ipynb` 文件的目录下打开命令行 11 | 2. 敲入`jupyter notebook`(前提是安装了,Anaconda发行版python自带ipynb) 12 | 3. 打开浏览器:`localhost:8888` 13 | 14 | -------------------------------------------------------------------------------- /mglearn/plot_animal_tree.py: -------------------------------------------------------------------------------- 1 | from scipy.misc import imread 2 | import matplotlib.pyplot as plt 3 | 4 | 5 | def plot_animal_tree(ax=None): 6 | import graphviz 7 | if ax is None: 8 | ax = plt.gca() 9 | mygraph = graphviz.Digraph(node_attr={'shape': 'box'}, 10 | edge_attr={'labeldistance': "10.5"}, 11 | format="png") 12 | mygraph.node("0", "Has feathers?") 13 | mygraph.node("1", "Can fly?") 14 | mygraph.node("2", "Has fins?") 15 | mygraph.node("3", "Hawk") 16 | mygraph.node("4", "Penguin") 17 | mygraph.node("5", "Dolphin") 18 | mygraph.node("6", "Bear") 19 | mygraph.edge("0", "1", label="True") 20 | mygraph.edge("0", "2", label="False") 21 | mygraph.edge("1", "3", label="True") 22 | mygraph.edge("1", "4", label="False") 23 | mygraph.edge("2", "5", label="True") 24 | mygraph.edge("2", "6", label="False") 25 | mygraph.render("tmp") 26 | ax.imshow(imread("tmp.png")) 27 | ax.set_axis_off() 28 | -------------------------------------------------------------------------------- /mglearn/plot_tree_nonmonotonous.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | from sklearn.datasets import make_blobs 3 | from sklearn.tree import DecisionTreeClassifier, export_graphviz 4 | from .tools import discrete_scatter 5 | from .plot_2d_separator import plot_2d_separator 6 | 7 | 8 | def plot_tree_not_monotone(): 9 | import graphviz 10 | # make a simple 2d dataset 11 | X, y = make_blobs(centers=4, random_state=8) 12 | y = y % 2 13 | plt.figure() 14 | discrete_scatter(X[:, 0], X[:, 1], y) 15 | plt.legend(["Class 0", "Class 1"], loc="best") 16 | 17 | # learn a decision tree model 18 | tree = DecisionTreeClassifier(random_state=0).fit(X, y) 19 | plot_2d_separator(tree, X, linestyle="dashed") 20 | 21 | # visualize the tree 22 | export_graphviz(tree, out_file="mytree.dot", impurity=False, filled=True) 23 | with open("mytree.dot") as f: 24 | dot_graph = f.read() 25 | print("Feature importances: %s" % tree.feature_importances_) 26 | return graphviz.Source(dot_graph) 27 | -------------------------------------------------------------------------------- /mglearn/plot_kneighbors_regularization.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.neighbors import KNeighborsRegressor 5 | 6 | 7 | def plot_kneighbors_regularization(): 8 | rnd = np.random.RandomState(42) 9 | x = np.linspace(-3, 3, 100) 10 | y_no_noise = np.sin(4 * x) + x 11 | y = y_no_noise + rnd.normal(size=len(x)) 12 | X = x[:, np.newaxis] 13 | fig, axes = plt.subplots(1, 3, figsize=(15, 5)) 14 | 15 | x_test = np.linspace(-3, 3, 1000) 16 | 17 | for n_neighbors, ax in zip([2, 5, 20], axes.ravel()): 18 | kneighbor_regression = KNeighborsRegressor(n_neighbors=n_neighbors) 19 | kneighbor_regression.fit(X, y) 20 | ax.plot(x, y_no_noise, label="true function") 21 | ax.plot(x, y, "o", label="data") 22 | ax.plot(x_test, kneighbor_regression.predict(x_test[:, np.newaxis]), 23 | label="prediction") 24 | ax.legend() 25 | ax.set_title("n_neighbors = %d" % n_neighbors) 26 | 27 | 28 | if __name__ == "__main__": 29 | plot_kneighbors_regularization() 30 | plt.show() 31 | -------------------------------------------------------------------------------- /mglearn/plot_ridge.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | 4 | from sklearn.linear_model import Ridge, LinearRegression 5 | from sklearn.model_selection import learning_curve, KFold 6 | 7 | from .datasets import load_extended_boston 8 | 9 | 10 | def plot_learning_curve(est, X, y): 11 | training_set_size, train_scores, test_scores = learning_curve( 12 | est, X, y, train_sizes=np.linspace(.1, 1, 20), cv=KFold(20, shuffle=True, random_state=1)) 13 | estimator_name = est.__class__.__name__ 14 | line = plt.plot(training_set_size, train_scores.mean(axis=1), '--', 15 | label="training " + estimator_name) 16 | plt.plot(training_set_size, test_scores.mean(axis=1), '-', 17 | label="test " + estimator_name, c=line[0].get_color()) 18 | plt.xlabel('Training set size') 19 | plt.ylabel('Score (R^2)') 20 | plt.ylim(0, 1.1) 21 | 22 | 23 | def plot_ridge_n_samples(): 24 | X, y = load_extended_boston() 25 | 26 | plot_learning_curve(Ridge(alpha=1), X, y) 27 | plot_learning_curve(LinearRegression(), X, y) 28 | plt.legend(loc=(0, 1.05), ncol=2, fontsize=11) 29 | -------------------------------------------------------------------------------- /mglearn/plot_linear_regression.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.linear_model import LinearRegression 5 | from sklearn.model_selection import train_test_split 6 | from .datasets import make_wave 7 | from .plot_helpers import cm2 8 | 9 | 10 | def plot_linear_regression_wave(): 11 | X, y = make_wave(n_samples=60) 12 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) 13 | 14 | line = np.linspace(-3, 3, 100).reshape(-1, 1) 15 | 16 | lr = LinearRegression().fit(X_train, y_train) 17 | print("w[0]: %f b: %f" % (lr.coef_[0], lr.intercept_)) 18 | 19 | plt.figure(figsize=(8, 8)) 20 | plt.plot(line, lr.predict(line)) 21 | plt.plot(X, y, 'o', c=cm2(0)) 22 | ax = plt.gca() 23 | ax.spines['left'].set_position('center') 24 | ax.spines['right'].set_color('none') 25 | ax.spines['bottom'].set_position('center') 26 | ax.spines['top'].set_color('none') 27 | ax.set_ylim(-3, 3) 28 | #ax.set_xlabel("Feature") 29 | #ax.set_ylabel("Target") 30 | ax.legend(["model", "training data"], loc="best") 31 | ax.grid(True) 32 | ax.set_aspect('equal') 33 | -------------------------------------------------------------------------------- /mglearn/plot_knn_classification.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.metrics import euclidean_distances 5 | from sklearn.neighbors import KNeighborsClassifier 6 | 7 | from .datasets import make_forge 8 | from .plot_helpers import discrete_scatter 9 | 10 | 11 | def plot_knn_classification(n_neighbors=1): 12 | X, y = make_forge() 13 | 14 | X_test = np.array([[8.2, 3.66214339], [9.9, 3.2], [11.2, .5]]) 15 | dist = euclidean_distances(X, X_test) 16 | closest = np.argsort(dist, axis=0) 17 | 18 | for x, neighbors in zip(X_test, closest.T): 19 | for neighbor in neighbors[:n_neighbors]: 20 | plt.arrow(x[0], x[1], X[neighbor, 0] - x[0], 21 | X[neighbor, 1] - x[1], head_width=0, fc='k', ec='k') 22 | 23 | clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y) 24 | test_points = discrete_scatter(X_test[:, 0], X_test[:, 1], clf.predict(X_test), markers="*") 25 | training_points = discrete_scatter(X[:, 0], X[:, 1], y) 26 | plt.legend(training_points + test_points, ["training class 0", "training class 1", 27 | "test pred 0", "test pred 1"]) 28 | -------------------------------------------------------------------------------- /mglearn/plot_decomposition.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | from matplotlib.offsetbox import OffsetImage, AnnotationBbox 3 | 4 | 5 | def plot_decomposition(people, pca): 6 | image_shape = people.images[0].shape 7 | plt.figure(figsize=(20, 3)) 8 | ax = plt.gca() 9 | 10 | imagebox = OffsetImage(people.images[0], zoom=1.5, cmap="gray") 11 | ab = AnnotationBbox(imagebox, (.05, 0.4), pad=0.0, xycoords='data') 12 | ax.add_artist(ab) 13 | 14 | for i in range(4): 15 | imagebox = OffsetImage(pca.components_[i].reshape(image_shape), zoom=1.5, cmap="viridis") 16 | 17 | ab = AnnotationBbox(imagebox, (.3 + .2 * i, 0.4), 18 | pad=0.0, 19 | xycoords='data' 20 | ) 21 | ax.add_artist(ab) 22 | if i == 0: 23 | plt.text(.18, .25, 'x_%d *' % i, fontdict={'fontsize': 50}) 24 | else: 25 | plt.text(.15 + .2 * i, .25, '+ x_%d *' % i, fontdict={'fontsize': 50}) 26 | 27 | plt.text(.95, .25, '+ ...', fontdict={'fontsize': 50}) 28 | 29 | plt.rc('text', usetex=True) 30 | plt.text(.13, .3, r'\approx', fontdict={'fontsize': 50}) 31 | plt.axis("off") 32 | -------------------------------------------------------------------------------- /mglearn/plot_linear_svc_regularization.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | from sklearn.svm import LinearSVC 4 | from sklearn.datasets import make_blobs 5 | 6 | from .plot_helpers import discrete_scatter 7 | 8 | 9 | def plot_linear_svc_regularization(): 10 | X, y = make_blobs(centers=2, random_state=4, n_samples=30) 11 | fig, axes = plt.subplots(1, 3, figsize=(12, 4)) 12 | 13 | # a carefully hand-designed dataset lol 14 | y[7] = 0 15 | y[27] = 0 16 | x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 17 | y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 18 | 19 | for ax, C in zip(axes, [1e-2, 10, 1e3]): 20 | discrete_scatter(X[:, 0], X[:, 1], y, ax=ax) 21 | 22 | svm = LinearSVC(C=C, tol=0.00001, dual=False).fit(X, y) 23 | w = svm.coef_[0] 24 | a = -w[0] / w[1] 25 | xx = np.linspace(6, 13) 26 | yy = a * xx - (svm.intercept_[0]) / w[1] 27 | ax.plot(xx, yy, c='k') 28 | ax.set_xlim(x_min, x_max) 29 | ax.set_ylim(y_min, y_max) 30 | ax.set_xticks(()) 31 | ax.set_yticks(()) 32 | ax.set_title("C = %f" % C) 33 | axes[0].legend(loc="best") 34 | 35 | if __name__ == "__main__": 36 | plot_linear_svc_regularization() 37 | plt.show() 38 | -------------------------------------------------------------------------------- /mglearn/plot_rbf_svm_parameters.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | from sklearn.svm import SVC 3 | from .plot_2d_separator import plot_2d_separator 4 | from .tools import make_handcrafted_dataset 5 | from .plot_helpers import discrete_scatter 6 | 7 | 8 | def plot_svm(log_C, log_gamma, ax=None): 9 | X, y = make_handcrafted_dataset() 10 | C = 10. ** log_C 11 | gamma = 10. ** log_gamma 12 | svm = SVC(kernel='rbf', C=C, gamma=gamma).fit(X, y) 13 | if ax is None: 14 | ax = plt.gca() 15 | plot_2d_separator(svm, X, ax=ax, eps=.5) 16 | # plot data 17 | discrete_scatter(X[:, 0], X[:, 1], y, ax=ax) 18 | # plot support vectors 19 | sv = svm.support_vectors_ 20 | # class labels of support vectors are given by the sign of the dual coefficients 21 | sv_labels = svm.dual_coef_.ravel() > 0 22 | discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3, ax=ax) 23 | ax.set_title("C = %.4f gamma = %.4f" % (C, gamma)) 24 | 25 | 26 | def plot_svm_interactive(): 27 | from IPython.html.widgets import interactive, FloatSlider 28 | C_slider = FloatSlider(min=-3, max=3, step=.1, value=0, readout=False) 29 | gamma_slider = FloatSlider(min=-2, max=2, step=.1, value=0, readout=False) 30 | return interactive(plot_svm, log_C=C_slider, log_gamma=gamma_slider) 31 | -------------------------------------------------------------------------------- /mglearn/plot_knn_regression.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.neighbors import KNeighborsRegressor 5 | from sklearn.metrics import euclidean_distances 6 | 7 | from .datasets import make_wave 8 | from .plot_helpers import cm3 9 | 10 | 11 | def plot_knn_regression(n_neighbors=1): 12 | X, y = make_wave(n_samples=40) 13 | X_test = np.array([[-1.5], [0.9], [1.5]]) 14 | 15 | dist = euclidean_distances(X, X_test) 16 | closest = np.argsort(dist, axis=0) 17 | 18 | plt.figure(figsize=(10, 6)) 19 | 20 | reg = KNeighborsRegressor(n_neighbors=n_neighbors).fit(X, y) 21 | y_pred = reg.predict(X_test) 22 | 23 | for x, y_, neighbors in zip(X_test, y_pred, closest.T): 24 | for neighbor in neighbors[:n_neighbors]: 25 | plt.arrow(x[0], y_, X[neighbor, 0] - x[0], y[neighbor] - y_, 26 | head_width=0, fc='k', ec='k') 27 | 28 | train, = plt.plot(X, y, 'o', c=cm3(0)) 29 | test, = plt.plot(X_test, -3 * np.ones(len(X_test)), '*', c=cm3(2), 30 | markersize=20) 31 | pred, = plt.plot(X_test, y_pred, '*', c=cm3(0), markersize=20) 32 | plt.vlines(X_test, -3.1, 3.1, linestyle="--") 33 | plt.legend([train, test, pred], 34 | ["training data/target", "test data", "test prediction"], 35 | ncol=3, loc=(.1, 1.025)) 36 | plt.ylim(-3.1, 3.1) 37 | plt.xlabel("Feature") 38 | plt.ylabel("Target") 39 | -------------------------------------------------------------------------------- /mglearn/plot_scaling.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | from sklearn.datasets import make_blobs 4 | from sklearn.preprocessing import (StandardScaler, MinMaxScaler, Normalizer, 5 | RobustScaler) 6 | from .plot_helpers import cm2 7 | 8 | 9 | def plot_scaling(): 10 | X, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1) 11 | X += 3 12 | 13 | plt.figure(figsize=(15, 8)) 14 | main_ax = plt.subplot2grid((2, 4), (0, 0), rowspan=2, colspan=2) 15 | 16 | main_ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm2, s=60) 17 | maxx = np.abs(X[:, 0]).max() 18 | maxy = np.abs(X[:, 1]).max() 19 | 20 | main_ax.set_xlim(-maxx + 1, maxx + 1) 21 | main_ax.set_ylim(-maxy + 1, maxy + 1) 22 | main_ax.set_title("Original Data") 23 | other_axes = [plt.subplot2grid((2, 4), (i, j)) 24 | for j in range(2, 4) for i in range(2)] 25 | 26 | for ax, scaler in zip(other_axes, [StandardScaler(), RobustScaler(), 27 | MinMaxScaler(), Normalizer(norm='l2')]): 28 | X_ = scaler.fit_transform(X) 29 | ax.scatter(X_[:, 0], X_[:, 1], c=y, cmap=cm2, s=60) 30 | ax.set_xlim(-2, 2) 31 | ax.set_ylim(-2, 2) 32 | ax.set_title(type(scaler).__name__) 33 | 34 | other_axes.append(main_ax) 35 | 36 | for ax in other_axes: 37 | ax.spines['left'].set_position('center') 38 | ax.spines['right'].set_color('none') 39 | ax.spines['bottom'].set_position('center') 40 | ax.spines['top'].set_color('none') 41 | ax.xaxis.set_ticks_position('bottom') 42 | ax.yaxis.set_ticks_position('left') 43 | -------------------------------------------------------------------------------- /mglearn/plot_dbscan.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.cluster import DBSCAN 4 | from sklearn.datasets import make_blobs 5 | 6 | from .plot_helpers import discrete_scatter, cm3 7 | 8 | 9 | def plot_dbscan(): 10 | X, y = make_blobs(random_state=0, n_samples=12) 11 | 12 | dbscan = DBSCAN() 13 | clusters = dbscan.fit_predict(X) 14 | clusters 15 | 16 | fig, axes = plt.subplots(3, 4, figsize=(11, 8), 17 | subplot_kw={'xticks': (), 'yticks': ()}) 18 | # Plot clusters as red, green and blue, and outliers (-1) as white 19 | colors = [cm3(1), cm3(0), cm3(2)] 20 | markers = ['o', '^', 'v'] 21 | 22 | # iterate over settings of min_samples and eps 23 | for i, min_samples in enumerate([2, 3, 5]): 24 | for j, eps in enumerate([1, 1.5, 2, 3]): 25 | # instantiate DBSCAN with a particular setting 26 | dbscan = DBSCAN(min_samples=min_samples, eps=eps) 27 | # get cluster assignments 28 | clusters = dbscan.fit_predict(X) 29 | print("min_samples: %d eps: %f cluster: %s" 30 | % (min_samples, eps, clusters)) 31 | if np.any(clusters == -1): 32 | c = ['w'] + colors 33 | m = ['o'] + markers 34 | else: 35 | c = colors 36 | m = markers 37 | discrete_scatter(X[:, 0], X[:, 1], clusters, ax=axes[i, j], c=c, 38 | s=8, markers=m) 39 | inds = dbscan.core_sample_indices_ 40 | # vizualize core samples and clusters. 41 | if len(inds): 42 | discrete_scatter(X[inds, 0], X[inds, 1], clusters[inds], 43 | ax=axes[i, j], s=15, c=colors, 44 | markers=markers) 45 | axes[i, j].set_title("min_samples: %d eps: %.1f" 46 | % (min_samples, eps)) 47 | fig.tight_layout() 48 | -------------------------------------------------------------------------------- /mglearn/datasets.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | from scipy import signal 5 | from sklearn.datasets import load_boston 6 | from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures 7 | from sklearn.datasets import make_blobs 8 | 9 | DATA_PATH = os.path.join(os.path.dirname(__file__), "..", "data") 10 | 11 | 12 | def make_forge(): 13 | # a carefully hand-designed dataset lol 14 | X, y = make_blobs(centers=2, random_state=4, n_samples=30) 15 | y[np.array([7, 27])] = 0 16 | mask = np.ones(len(X), dtype=np.bool) 17 | mask[np.array([0, 1, 5, 26])] = 0 18 | X, y = X[mask], y[mask] 19 | return X, y 20 | 21 | 22 | def make_wave(n_samples=100): 23 | rnd = np.random.RandomState(42) 24 | x = rnd.uniform(-3, 3, size=n_samples) 25 | y_no_noise = (np.sin(4 * x) + x) 26 | y = (y_no_noise + rnd.normal(size=len(x))) / 2 27 | return x.reshape(-1, 1), y 28 | 29 | 30 | def load_extended_boston(): 31 | boston = load_boston() 32 | X = boston.data 33 | 34 | X = MinMaxScaler().fit_transform(boston.data) 35 | X = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X) 36 | return X, boston.target 37 | 38 | 39 | def load_citibike(): 40 | data_mine = pd.read_csv(os.path.join(DATA_PATH, "citibike.csv")) 41 | data_mine['one'] = 1 42 | data_mine['starttime'] = pd.to_datetime(data_mine.starttime) 43 | data_starttime = data_mine.set_index("starttime") 44 | data_resampled = data_starttime.resample("3h").sum().fillna(0) 45 | return data_resampled.one 46 | 47 | 48 | def make_signals(): 49 | # fix a random state seed 50 | rng = np.random.RandomState(42) 51 | n_samples = 2000 52 | time = np.linspace(0, 8, n_samples) 53 | # create three signals 54 | s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal 55 | s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal 56 | s3 = signal.sawtooth(2 * np.pi * time) # Signal 3: saw tooth signal 57 | 58 | # concatenate the signals, add noise 59 | S = np.c_[s1, s2, s3] 60 | S += 0.2 * rng.normal(size=S.shape) 61 | 62 | S /= S.std(axis=0) # Standardize data 63 | S -= S.min() 64 | return S 65 | -------------------------------------------------------------------------------- /mglearn/plot_interactive_tree.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from sklearn.tree import DecisionTreeClassifier 5 | 6 | from sklearn.externals.six import StringIO # doctest: +SKIP 7 | from sklearn.tree import export_graphviz 8 | from scipy.misc import imread 9 | from scipy import ndimage 10 | from sklearn.datasets import make_moons 11 | 12 | import re 13 | 14 | from .tools import discrete_scatter 15 | from .plot_helpers import cm2 16 | 17 | 18 | def tree_image(tree, fout=None): 19 | try: 20 | import graphviz 21 | except ImportError: 22 | # make a hacky white plot 23 | x = np.ones((10, 10)) 24 | x[0, 0] = 0 25 | return x 26 | dot_data = StringIO() 27 | export_graphviz(tree, out_file=dot_data, max_depth=3, impurity=False) 28 | data = dot_data.getvalue() 29 | data = re.sub(r"samples = [0-9]+\\n", "", data) 30 | data = re.sub(r"\\nsamples = [0-9]+", "", data) 31 | data = re.sub(r"value", "counts", data) 32 | 33 | graph = graphviz.Source(data, format="png") 34 | if fout is None: 35 | fout = "tmp" 36 | graph.render(fout) 37 | return imread(fout + ".png") 38 | 39 | 40 | def plot_tree_progressive(): 41 | X, y = make_moons(n_samples=100, noise=0.25, random_state=3) 42 | plt.figure() 43 | ax = plt.gca() 44 | discrete_scatter(X[:, 0], X[:, 1], y, ax=ax) 45 | ax.set_xlabel("Feature 0") 46 | ax.set_ylabel("Feature 1") 47 | plt.legend(["Class 0", "Class 1"], loc='best') 48 | 49 | axes = [] 50 | for i in range(3): 51 | fig, ax = plt.subplots(1, 2, figsize=(12, 4), 52 | subplot_kw={'xticks': (), 'yticks': ()}) 53 | axes.append(ax) 54 | axes = np.array(axes) 55 | 56 | for i, max_depth in enumerate([1, 2, 9]): 57 | tree = plot_tree(X, y, max_depth=max_depth, ax=axes[i, 0]) 58 | axes[i, 1].imshow(tree_image(tree)) 59 | axes[i, 1].set_axis_off() 60 | 61 | 62 | def plot_tree_partition(X, y, tree, ax=None): 63 | if ax is None: 64 | ax = plt.gca() 65 | eps = X.std() / 2. 66 | 67 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 68 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 69 | xx = np.linspace(x_min, x_max, 1000) 70 | yy = np.linspace(y_min, y_max, 1000) 71 | 72 | X1, X2 = np.meshgrid(xx, yy) 73 | X_grid = np.c_[X1.ravel(), X2.ravel()] 74 | 75 | Z = tree.predict(X_grid) 76 | Z = Z.reshape(X1.shape) 77 | faces = tree.apply(X_grid) 78 | faces = faces.reshape(X1.shape) 79 | border = ndimage.laplace(faces) != 0 80 | ax.contourf(X1, X2, Z, alpha=.4, cmap=cm2, levels=[0, .5, 1]) 81 | ax.scatter(X1[border], X2[border], marker='.', s=1) 82 | 83 | discrete_scatter(X[:, 0], X[:, 1], y, ax=ax) 84 | ax.set_xlim(x_min, x_max) 85 | ax.set_ylim(y_min, y_max) 86 | ax.set_xticks(()) 87 | ax.set_yticks(()) 88 | return ax 89 | 90 | 91 | def plot_tree(X, y, max_depth=1, ax=None): 92 | tree = DecisionTreeClassifier(max_depth=max_depth, random_state=0).fit(X, y) 93 | ax = plot_tree_partition(X, y, tree, ax=ax) 94 | ax.set_title("depth = %d" % max_depth) 95 | return tree 96 | -------------------------------------------------------------------------------- /mglearn/plot_improper_preprocessing.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | 3 | 4 | def make_bracket(s, xy, textxy, width, ax): 5 | annotation = ax.annotate( 6 | s, xy, textxy, ha="center", va="center", size=20, 7 | arrowprops=dict(arrowstyle="-[", fc="w", ec="k", 8 | lw=2,), bbox=dict(boxstyle="square", fc="w")) 9 | annotation.arrow_patch.get_arrowstyle().widthB = width 10 | 11 | 12 | def plot_improper_processing(): 13 | fig, axes = plt.subplots(2, 1, figsize=(15, 10)) 14 | 15 | for axis in axes: 16 | bars = axis.barh([0, 0, 0], [11.9, 2.9, 4.9], left=[0, 12, 15], 17 | color=['white', 'grey', 'grey'], hatch="//", 18 | align='edge', edgecolor='k') 19 | bars[2].set_hatch(r"") 20 | axis.set_yticks(()) 21 | axis.set_frame_on(False) 22 | axis.set_ylim(-.1, 6) 23 | axis.set_xlim(-0.1, 20.1) 24 | axis.set_xticks(()) 25 | axis.tick_params(length=0, labeltop=True, labelbottom=False) 26 | axis.text(6, -.3, "training folds", 27 | fontdict={'fontsize': 14}, horizontalalignment="center") 28 | axis.text(13.5, -.3, "validation fold", 29 | fontdict={'fontsize': 14}, horizontalalignment="center") 30 | axis.text(17.5, -.3, "test set", 31 | fontdict={'fontsize': 14}, horizontalalignment="center") 32 | 33 | make_bracket("scaler fit", (7.5, 1.3), (7.5, 2.), 15, axes[0]) 34 | make_bracket("SVC fit", (6, 3), (6, 4), 12, axes[0]) 35 | make_bracket("SVC predict", (13.4, 3), (13.4, 4), 2.5, axes[0]) 36 | 37 | axes[0].set_title("Cross validation") 38 | axes[1].set_title("Test set prediction") 39 | 40 | make_bracket("scaler fit", (7.5, 1.3), (7.5, 2.), 15, axes[1]) 41 | make_bracket("SVC fit", (7.5, 3), (7.5, 4), 15, axes[1]) 42 | make_bracket("SVC predict", (17.5, 3), (17.5, 4), 4.8, axes[1]) 43 | 44 | 45 | def plot_proper_processing(): 46 | fig, axes = plt.subplots(2, 1, figsize=(15, 8)) 47 | 48 | for axis in axes: 49 | bars = axis.barh([0, 0, 0], [11.9, 2.9, 4.9], 50 | left=[0, 12, 15], color=['white', 'grey', 'grey'], 51 | hatch="//", align='edge', edgecolor='k') 52 | bars[2].set_hatch(r"") 53 | axis.set_yticks(()) 54 | axis.set_frame_on(False) 55 | axis.set_ylim(-.1, 4.5) 56 | axis.set_xlim(-0.1, 20.1) 57 | axis.set_xticks(()) 58 | axis.tick_params(length=0, labeltop=True, labelbottom=False) 59 | axis.text(6, -.3, "training folds", fontdict={'fontsize': 14}, 60 | horizontalalignment="center") 61 | axis.text(13.5, -.3, "validation fold", fontdict={'fontsize': 14}, 62 | horizontalalignment="center") 63 | axis.text(17.5, -.3, "test set", fontdict={'fontsize': 14}, 64 | horizontalalignment="center") 65 | 66 | make_bracket("scaler fit", (6, 1.3), (6, 2.), 12, axes[0]) 67 | make_bracket("SVC fit", (6, 3), (6, 4), 12, axes[0]) 68 | make_bracket("SVC predict", (13.4, 3), (13.4, 4), 2.5, axes[0]) 69 | 70 | axes[0].set_title("Cross validation") 71 | axes[1].set_title("Test set prediction") 72 | 73 | make_bracket("scaler fit", (7.5, 1.3), (7.5, 2.), 15, axes[1]) 74 | make_bracket("SVC fit", (7.5, 3), (7.5, 4), 15, axes[1]) 75 | make_bracket("SVC predict", (17.5, 3), (17.5, 4), 4.8, axes[1]) 76 | fig.subplots_adjust(hspace=.3) 77 | -------------------------------------------------------------------------------- /mglearn/plot_nmf.py: -------------------------------------------------------------------------------- 1 | from sklearn.decomposition import NMF 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | 5 | from sklearn.externals.joblib import Memory 6 | 7 | memory = Memory(cachedir="cache") 8 | 9 | 10 | def plot_nmf_illustration(): 11 | rnd = np.random.RandomState(5) 12 | X_ = rnd.normal(size=(300, 2)) 13 | # Add 8 to make sure every point lies in the positive part of the space 14 | X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2) + 8 15 | 16 | nmf = NMF(random_state=0) 17 | nmf.fit(X_blob) 18 | X_nmf = nmf.transform(X_blob) 19 | 20 | fig, axes = plt.subplots(1, 2, figsize=(15, 5)) 21 | 22 | axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=X_nmf[:, 0], linewidths=0, 23 | s=60, cmap='viridis') 24 | axes[0].set_xlabel("feature 1") 25 | axes[0].set_ylabel("feature 2") 26 | axes[0].set_xlim(0, 12) 27 | axes[0].set_ylim(0, 12) 28 | axes[0].arrow(0, 0, nmf.components_[0, 0], nmf.components_[0, 1], width=.1, 29 | head_width=.3, color='k') 30 | axes[0].arrow(0, 0, nmf.components_[1, 0], nmf.components_[1, 1], width=.1, 31 | head_width=.3, color='k') 32 | axes[0].set_aspect('equal') 33 | axes[0].set_title("NMF with two components") 34 | 35 | # second plot 36 | nmf = NMF(random_state=0, n_components=1) 37 | nmf.fit(X_blob) 38 | 39 | axes[1].scatter(X_blob[:, 0], X_blob[:, 1], c=X_nmf[:, 0], linewidths=0, 40 | s=60, cmap='viridis') 41 | axes[1].set_xlabel("feature 1") 42 | axes[1].set_ylabel("feature 2") 43 | axes[1].set_xlim(0, 12) 44 | axes[1].set_ylim(0, 12) 45 | axes[1].arrow(0, 0, nmf.components_[0, 0], nmf.components_[0, 1], width=.1, 46 | head_width=.3, color='k') 47 | 48 | axes[1].set_aspect('equal') 49 | axes[1].set_title("NMF with one component") 50 | 51 | 52 | @memory.cache 53 | def nmf_faces(X_train, X_test): 54 | # Build NMF models with 10, 50, 100 and 500 components 55 | # this list will hold the back-transformd test-data 56 | reduced_images = [] 57 | for n_components in [10, 50, 100, 500]: 58 | # build the NMF model 59 | nmf = NMF(n_components=n_components, random_state=0) 60 | nmf.fit(X_train) 61 | # transform the test data (afterwards has n_components many dimensions) 62 | X_test_nmf = nmf.transform(X_test) 63 | # back-transform the transformed test-data 64 | # (afterwards it's in the original space again) 65 | X_test_back = np.dot(X_test_nmf, nmf.components_) 66 | reduced_images.append(X_test_back) 67 | return reduced_images 68 | 69 | 70 | def plot_nmf_faces(X_train, X_test, image_shape): 71 | reduced_images = nmf_faces(X_train, X_test) 72 | 73 | # plot the first three images in the test set: 74 | fix, axes = plt.subplots(3, 5, figsize=(15, 12), 75 | subplot_kw={'xticks': (), 'yticks': ()}) 76 | for i, ax in enumerate(axes): 77 | # plot original image 78 | ax[0].imshow(X_test[i].reshape(image_shape), 79 | vmin=0, vmax=1) 80 | # plot the four back-transformed images 81 | for a, X_test_back in zip(ax[1:], reduced_images): 82 | a.imshow(X_test_back[i].reshape(image_shape), vmin=0, vmax=1) 83 | 84 | # label the top row 85 | axes[0, 0].set_title("original image") 86 | for ax, n_components in zip(axes[0, 1:], [10, 50, 100, 500]): 87 | ax.set_title("%d components" % n_components) 88 | -------------------------------------------------------------------------------- /mglearn/plots.py: -------------------------------------------------------------------------------- 1 | from .plot_linear_svc_regularization import plot_linear_svc_regularization 2 | from .plot_interactive_tree import plot_tree_progressive, plot_tree_partition 3 | from .plot_animal_tree import plot_animal_tree 4 | from .plot_rbf_svm_parameters import plot_svm 5 | from .plot_knn_regression import plot_knn_regression 6 | from .plot_knn_classification import plot_knn_classification 7 | from .plot_2d_separator import plot_2d_classification, plot_2d_separator 8 | from .plot_nn_graphs import (plot_logistic_regression_graph, 9 | plot_single_hidden_layer_graph, 10 | plot_two_hidden_layer_graph) 11 | from .plot_linear_regression import plot_linear_regression_wave 12 | from .plot_tree_nonmonotonous import plot_tree_not_monotone 13 | from .plot_scaling import plot_scaling 14 | from .plot_pca import plot_pca_illustration, plot_pca_whitening, plot_pca_faces 15 | from .plot_decomposition import plot_decomposition 16 | from .plot_nmf import plot_nmf_illustration, plot_nmf_faces 17 | from .plot_helpers import cm2, cm3 18 | from .plot_agglomerative import plot_agglomerative, plot_agglomerative_algorithm 19 | from .plot_kmeans import plot_kmeans_algorithm, plot_kmeans_boundaries, plot_kmeans_faces 20 | from .plot_improper_preprocessing import plot_improper_processing, plot_proper_processing 21 | from .plot_cross_validation import (plot_threefold_split, plot_group_kfold, 22 | plot_shuffle_split, plot_cross_validation, 23 | plot_stratified_cross_validation) 24 | 25 | from .plot_grid_search import plot_grid_search_overview, plot_cross_val_selection 26 | from .plot_metrics import (plot_confusion_matrix_illustration, 27 | plot_binary_confusion_matrix, 28 | plot_decision_threshold) 29 | from .plot_dbscan import plot_dbscan 30 | from .plot_ridge import plot_ridge_n_samples 31 | 32 | __all__ = ['plot_linear_svc_regularization', 33 | "plot_animal_tree", "plot_tree_progressive", 34 | 'plot_tree_partition', 'plot_svm', 35 | 'plot_knn_regression', 36 | 'plot_logistic_regression_graph', 37 | 'plot_single_hidden_layer_graph', 38 | 'plot_two_hidden_layer_graph', 39 | 'plot_2d_classification', 40 | 'plot_2d_separator', 41 | 'plot_knn_classification', 42 | 'plot_linear_regression_wave', 43 | 'plot_tree_not_monotone', 44 | 'plot_scaling', 45 | 'plot_pca_illustration', 46 | 'plot_pca_faces', 47 | 'plot_pca_whitening', 48 | 'plot_decomposition', 49 | 'plot_nmf_illustration', 50 | 'plot_nmf_faces', 51 | 'plot_agglomerative', 52 | 'plot_agglomerative_algorithm', 53 | 'plot_kmeans_boundaries', 54 | 'plot_kmeans_algorithm', 55 | 'plot_kmeans_faces', 56 | 'cm3', 'cm2', 'plot_improper_processing', 'plot_proper_processing', 57 | 'plot_group_kfold', 58 | 'plot_shuffle_split', 59 | 'plot_stratified_cross_validation', 60 | 'plot_threefold_split', 61 | 'plot_cross_validation', 62 | 'plot_grid_search_overview', 63 | 'plot_cross_val_selection', 64 | 'plot_confusion_matrix_illustration', 65 | 'plot_binary_confusion_matrix', 66 | 'plot_decision_threshold', 67 | 'plot_dbscan', 68 | 'plot_ridge_n_samples' 69 | ] 70 | -------------------------------------------------------------------------------- /mglearn/plot_helpers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib as mpl 3 | import matplotlib.pyplot as plt 4 | from matplotlib.colors import ListedColormap, colorConverter, LinearSegmentedColormap 5 | 6 | 7 | cm_cycle = ListedColormap(['#0000aa', '#ff5050', '#50ff50', '#9040a0', '#fff000']) 8 | cm3 = ListedColormap(['#0000aa', '#ff2020', '#50ff50']) 9 | cm2 = ListedColormap(['#0000aa', '#ff2020']) 10 | 11 | # create a smooth transition from the first to to the second color of cm3 12 | # similar to RdBu but with our red and blue, also not going through white, 13 | # which is really bad for greyscale 14 | 15 | cdict = {'red': [(0.0, 0.0, cm2(0)[0]), 16 | (1.0, cm2(1)[0], 1.0)], 17 | 18 | 'green': [(0.0, 0.0, cm2(0)[1]), 19 | (1.0, cm2(1)[1], 1.0)], 20 | 21 | 'blue': [(0.0, 0.0, cm2(0)[2]), 22 | (1.0, cm2(1)[2], 1.0)]} 23 | 24 | ReBl = LinearSegmentedColormap("ReBl", cdict) 25 | 26 | 27 | def discrete_scatter(x1, x2, y=None, markers=None, s=10, ax=None, 28 | labels=None, padding=.2, alpha=1, c=None, markeredgewidth=None): 29 | """Adaption of matplotlib.pyplot.scatter to plot classes or clusters. 30 | 31 | Parameters 32 | ---------- 33 | 34 | x1 : nd-array 35 | input data, first axis 36 | 37 | x2 : nd-array 38 | input data, second axis 39 | 40 | y : nd-array 41 | input data, discrete labels 42 | 43 | cmap : colormap 44 | Colormap to use. 45 | 46 | markers : list of string 47 | List of markers to use, or None (which defaults to 'o'). 48 | 49 | s : int or float 50 | Size of the marker 51 | 52 | padding : float 53 | Fraction of the dataset range to use for padding the axes. 54 | 55 | alpha : float 56 | Alpha value for all points. 57 | """ 58 | if ax is None: 59 | ax = plt.gca() 60 | 61 | if y is None: 62 | y = np.zeros(len(x1)) 63 | 64 | unique_y = np.unique(y) 65 | 66 | if markers is None: 67 | markers = ['o', '^', 'v', 'D', 's', '*', 'p', 'h', 'H', '8', '<', '>'] * 10 68 | 69 | if len(markers) == 1: 70 | markers = markers * len(unique_y) 71 | 72 | if labels is None: 73 | labels = unique_y 74 | 75 | # lines in the matplotlib sense, not actual lines 76 | lines = [] 77 | 78 | current_cycler = mpl.rcParams['axes.prop_cycle'] 79 | 80 | for i, (yy, cycle) in enumerate(zip(unique_y, current_cycler())): 81 | mask = y == yy 82 | # if c is none, use color cycle 83 | if c is None: 84 | color = cycle['color'] 85 | elif len(c) > 1: 86 | color = c[i] 87 | else: 88 | color = c 89 | # use light edge for dark markers 90 | if np.mean(colorConverter.to_rgb(color)) < .4: 91 | markeredgecolor = "grey" 92 | else: 93 | markeredgecolor = "black" 94 | 95 | lines.append(ax.plot(x1[mask], x2[mask], markers[i], markersize=s, 96 | label=labels[i], alpha=alpha, c=color, 97 | markeredgewidth=markeredgewidth, 98 | markeredgecolor=markeredgecolor)[0]) 99 | 100 | if padding != 0: 101 | pad1 = x1.std() * padding 102 | pad2 = x2.std() * padding 103 | xlim = ax.get_xlim() 104 | ylim = ax.get_ylim() 105 | ax.set_xlim(min(x1.min() - pad1, xlim[0]), max(x1.max() + pad1, xlim[1])) 106 | ax.set_ylim(min(x2.min() - pad2, ylim[0]), max(x2.max() + pad2, ylim[1])) 107 | 108 | return lines 109 | -------------------------------------------------------------------------------- /mglearn/plot_agglomerative.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | from sklearn.datasets import make_blobs 4 | from sklearn.cluster import AgglomerativeClustering 5 | from sklearn.neighbors import KernelDensity 6 | 7 | 8 | def plot_agglomerative_algorithm(): 9 | # generate synthetic two-dimensional data 10 | X, y = make_blobs(random_state=0, n_samples=12) 11 | 12 | agg = AgglomerativeClustering(n_clusters=X.shape[0], compute_full_tree=True).fit(X) 13 | 14 | fig, axes = plt.subplots(X.shape[0] // 5, 5, subplot_kw={'xticks': (), 15 | 'yticks': ()}, 16 | figsize=(20, 8)) 17 | 18 | eps = X.std() / 2 19 | 20 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 21 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 22 | 23 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100)) 24 | gridpoints = np.c_[xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)] 25 | 26 | for i, ax in enumerate(axes.ravel()): 27 | ax.set_xlim(x_min, x_max) 28 | ax.set_ylim(y_min, y_max) 29 | agg.n_clusters = X.shape[0] - i 30 | agg.fit(X) 31 | ax.set_title("Step %d" % i) 32 | ax.scatter(X[:, 0], X[:, 1], s=60, c='grey') 33 | bins = np.bincount(agg.labels_) 34 | for cluster in range(agg.n_clusters): 35 | if bins[cluster] > 1: 36 | points = X[agg.labels_ == cluster] 37 | other_points = X[agg.labels_ != cluster] 38 | 39 | kde = KernelDensity(bandwidth=.5).fit(points) 40 | scores = kde.score_samples(gridpoints) 41 | score_inside = np.min(kde.score_samples(points)) 42 | score_outside = np.max(kde.score_samples(other_points)) 43 | levels = .8 * score_inside + .2 * score_outside 44 | ax.contour(xx, yy, scores.reshape(100, 100), levels=[levels], 45 | colors='k', linestyles='solid', linewidths=2) 46 | 47 | axes[0, 0].set_title("Initialization") 48 | 49 | 50 | def plot_agglomerative(): 51 | X, y = make_blobs(random_state=0, n_samples=12) 52 | agg = AgglomerativeClustering(n_clusters=3) 53 | 54 | eps = X.std() / 2. 55 | 56 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 57 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 58 | 59 | xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100)) 60 | gridpoints = np.c_[xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)] 61 | 62 | ax = plt.gca() 63 | for i, x in enumerate(X): 64 | ax.text(x[0] + .1, x[1], "%d" % i, horizontalalignment='left', verticalalignment='center') 65 | 66 | ax.scatter(X[:, 0], X[:, 1], s=60, c='grey') 67 | ax.set_xticks(()) 68 | ax.set_yticks(()) 69 | 70 | for i in range(11): 71 | agg.n_clusters = X.shape[0] - i 72 | agg.fit(X) 73 | 74 | bins = np.bincount(agg.labels_) 75 | for cluster in range(agg.n_clusters): 76 | if bins[cluster] > 1: 77 | points = X[agg.labels_ == cluster] 78 | other_points = X[agg.labels_ != cluster] 79 | 80 | kde = KernelDensity(bandwidth=.5).fit(points) 81 | scores = kde.score_samples(gridpoints) 82 | score_inside = np.min(kde.score_samples(points)) 83 | score_outside = np.max(kde.score_samples(other_points)) 84 | levels = .8 * score_inside + .2 * score_outside 85 | ax.contour(xx, yy, scores.reshape(100, 100), levels=[levels], 86 | colors='k', linestyles='solid', linewidths=1) 87 | 88 | ax.set_xlim(x_min, x_max) 89 | ax.set_ylim(y_min, y_max) 90 | -------------------------------------------------------------------------------- /mglearn/make_blobs.py: -------------------------------------------------------------------------------- 1 | import numbers 2 | import numpy as np 3 | 4 | from sklearn.utils import check_array, check_random_state 5 | from sklearn.utils import shuffle as shuffle_ 6 | from sklearn.utils.deprecation import deprecated 7 | 8 | 9 | @deprecated("Please import make_blobs directly from scikit-learn") 10 | def make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=1.0, 11 | center_box=(-10.0, 10.0), shuffle=True, random_state=None): 12 | """Generate isotropic Gaussian blobs for clustering. 13 | 14 | Read more in the :ref:`User Guide `. 15 | 16 | Parameters 17 | ---------- 18 | n_samples : int, or tuple, optional (default=100) 19 | The total number of points equally divided among clusters. 20 | 21 | n_features : int, optional (default=2) 22 | The number of features for each sample. 23 | 24 | centers : int or array of shape [n_centers, n_features], optional 25 | (default=3) 26 | The number of centers to generate, or the fixed center locations. 27 | 28 | cluster_std: float or sequence of floats, optional (default=1.0) 29 | The standard deviation of the clusters. 30 | 31 | center_box: pair of floats (min, max), optional (default=(-10.0, 10.0)) 32 | The bounding box for each cluster center when centers are 33 | generated at random. 34 | 35 | shuffle : boolean, optional (default=True) 36 | Shuffle the samples. 37 | 38 | random_state : int, RandomState instance or None, optional (default=None) 39 | If int, random_state is the seed used by the random number generator; 40 | If RandomState instance, random_state is the random number generator; 41 | If None, the random number generator is the RandomState instance used 42 | by `np.random`. 43 | 44 | Returns 45 | ------- 46 | X : array of shape [n_samples, n_features] 47 | The generated samples. 48 | 49 | y : array of shape [n_samples] 50 | The integer labels for cluster membership of each sample. 51 | 52 | Examples 53 | -------- 54 | >>> from sklearn.datasets.samples_generator import make_blobs 55 | >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2, 56 | ... random_state=0) 57 | >>> print(X.shape) 58 | (10, 2) 59 | >>> y 60 | array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0]) 61 | 62 | See also 63 | -------- 64 | make_classification: a more intricate variant 65 | """ 66 | generator = check_random_state(random_state) 67 | 68 | if isinstance(centers, numbers.Integral): 69 | centers = generator.uniform(center_box[0], center_box[1], 70 | size=(centers, n_features)) 71 | else: 72 | centers = check_array(centers) 73 | n_features = centers.shape[1] 74 | 75 | if isinstance(cluster_std, numbers.Real): 76 | cluster_std = np.ones(len(centers)) * cluster_std 77 | 78 | X = [] 79 | y = [] 80 | 81 | n_centers = centers.shape[0] 82 | if isinstance(n_samples, numbers.Integral): 83 | n_samples_per_center = [int(n_samples // n_centers)] * n_centers 84 | for i in range(n_samples % n_centers): 85 | n_samples_per_center[i] += 1 86 | else: 87 | n_samples_per_center = n_samples 88 | 89 | for i, (n, std) in enumerate(zip(n_samples_per_center, cluster_std)): 90 | X.append(centers[i] + generator.normal(scale=std, 91 | size=(n, n_features))) 92 | y += [i] * n 93 | 94 | X = np.concatenate(X) 95 | y = np.array(y) 96 | 97 | if shuffle: 98 | X, y = shuffle_(X, y, random_state=generator) 99 | 100 | return X, y 101 | -------------------------------------------------------------------------------- /mglearn/plot_nn_graphs.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | def plot_logistic_regression_graph(): 4 | import graphviz 5 | lr_graph = graphviz.Digraph(node_attr={'shape': 'circle', 'fixedsize': 'True'}, 6 | graph_attr={'rankdir': 'LR', 'splines': 'line'}) 7 | inputs = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_0") 8 | output = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_2") 9 | 10 | for i in range(4): 11 | inputs.node("x[%d]" % i, labelloc="c") 12 | inputs.body.append('label = "inputs"') 13 | inputs.body.append('color = "white"') 14 | 15 | lr_graph.subgraph(inputs) 16 | 17 | output.body.append('label = "output"') 18 | output.body.append('color = "white"') 19 | output.node("y") 20 | 21 | lr_graph.subgraph(output) 22 | 23 | for i in range(4): 24 | lr_graph.edge("x[%d]" % i, "y", label="w[%d]" % i) 25 | return lr_graph 26 | 27 | 28 | def plot_single_hidden_layer_graph(): 29 | import graphviz 30 | nn_graph = graphviz.Digraph(node_attr={'shape': 'circle', 'fixedsize': 'True'}, 31 | graph_attr={'rankdir': 'LR', 'splines': 'line'}) 32 | 33 | inputs = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_0") 34 | hidden = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_1") 35 | output = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_2") 36 | 37 | for i in range(4): 38 | inputs.node("x[%d]" % i) 39 | 40 | inputs.body.append('label = "inputs"') 41 | inputs.body.append('color = "white"') 42 | 43 | hidden.body.append('label = "hidden layer"') 44 | hidden.body.append('color = "white"') 45 | 46 | for i in range(3): 47 | hidden.node("h%d" % i, label="h[%d]" % i) 48 | 49 | output.node("y") 50 | output.body.append('label = "output"') 51 | output.body.append('color = "white"') 52 | 53 | nn_graph.subgraph(inputs) 54 | nn_graph.subgraph(hidden) 55 | nn_graph.subgraph(output) 56 | 57 | for i in range(4): 58 | for j in range(3): 59 | nn_graph.edge("x[%d]" % i, "h%d" % j) 60 | 61 | for i in range(3): 62 | nn_graph.edge("h%d" % i, "y") 63 | return nn_graph 64 | 65 | 66 | def plot_two_hidden_layer_graph(): 67 | import graphviz 68 | nn_graph = graphviz.Digraph(node_attr={'shape': 'circle', 'fixedsize': 'True'}, 69 | graph_attr={'rankdir': 'LR', 'splines': 'line'}) 70 | 71 | inputs = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_0") 72 | hidden = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_1") 73 | hidden2 = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_2") 74 | 75 | output = graphviz.Digraph(node_attr={'shape': 'circle'}, name="cluster_3") 76 | 77 | for i in range(4): 78 | inputs.node("x[%d]" % i) 79 | 80 | inputs.body.append('label = "inputs"') 81 | inputs.body.append('color = "white"') 82 | 83 | for i in range(3): 84 | hidden.node("h1[%d]" % i) 85 | 86 | for i in range(3): 87 | hidden2.node("h2[%d]" % i) 88 | 89 | hidden.body.append('label = "hidden layer 1"') 90 | hidden.body.append('color = "white"') 91 | 92 | hidden2.body.append('label = "hidden layer 2"') 93 | hidden2.body.append('color = "white"') 94 | 95 | output.node("y") 96 | output.body.append('label = "output"') 97 | output.body.append('color = "white"') 98 | 99 | nn_graph.subgraph(inputs) 100 | nn_graph.subgraph(hidden) 101 | nn_graph.subgraph(hidden2) 102 | 103 | nn_graph.subgraph(output) 104 | 105 | for i in range(4): 106 | for j in range(3): 107 | nn_graph.edge("x[%d]" % i, "h1[%d]" % j, label="") 108 | 109 | for i in range(3): 110 | for j in range(3): 111 | nn_graph.edge("h1[%d]" % i, "h2[%d]" % j, label="") 112 | 113 | for i in range(3): 114 | nn_graph.edge("h2[%d]" % i, "y", label="") 115 | 116 | return nn_graph 117 | -------------------------------------------------------------------------------- /mglearn/plot_metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | from .tools import plot_2d_separator, plot_2d_scores, cm, discrete_scatter 5 | from .plot_helpers import ReBl 6 | 7 | 8 | def plot_confusion_matrix_illustration(): 9 | plt.figure(figsize=(8, 8)) 10 | confusion = np.array([[401, 2], [8, 39]]) 11 | plt.text(0.40, .7, confusion[0, 0], size=70, horizontalalignment='right') 12 | plt.text(0.40, .2, confusion[1, 0], size=70, horizontalalignment='right') 13 | plt.text(.90, .7, confusion[0, 1], size=70, horizontalalignment='right') 14 | plt.text(.90, 0.2, confusion[1, 1], size=70, horizontalalignment='right') 15 | plt.xticks([.25, .75], ["predicted 'not nine'", "predicted 'nine'"], size=20) 16 | plt.yticks([.25, .75], ["true 'nine'", "true 'not nine'"], size=20) 17 | plt.plot([.5, .5], [0, 1], '--', c='k') 18 | plt.plot([0, 1], [.5, .5], '--', c='k') 19 | 20 | plt.xlim(0, 1) 21 | plt.ylim(0, 1) 22 | 23 | 24 | def plot_binary_confusion_matrix(): 25 | plt.text(0.45, .6, "TN", size=100, horizontalalignment='right') 26 | plt.text(0.45, .1, "FN", size=100, horizontalalignment='right') 27 | plt.text(.95, .6, "FP", size=100, horizontalalignment='right') 28 | plt.text(.95, 0.1, "TP", size=100, horizontalalignment='right') 29 | plt.xticks([.25, .75], ["predicted negative", "predicted positive"], size=15) 30 | plt.yticks([.25, .75], ["positive class", "negative class"], size=15) 31 | plt.plot([.5, .5], [0, 1], '--', c='k') 32 | plt.plot([0, 1], [.5, .5], '--', c='k') 33 | 34 | plt.xlim(0, 1) 35 | plt.ylim(0, 1) 36 | 37 | 38 | def plot_decision_threshold(): 39 | from sklearn.datasets import make_blobs 40 | from sklearn.svm import SVC 41 | from sklearn.model_selection import train_test_split 42 | 43 | X, y = make_blobs(n_samples=(400, 50), cluster_std=[7.0, 2], 44 | random_state=22) 45 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 46 | 47 | fig, axes = plt.subplots(2, 3, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()}) 48 | plt.suptitle("decision_threshold") 49 | axes[0, 0].set_title("training data") 50 | discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=axes[0, 0]) 51 | 52 | svc = SVC(gamma=.05).fit(X_train, y_train) 53 | axes[0, 1].set_title("decision with threshold 0") 54 | discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=axes[0, 1]) 55 | plot_2d_scores(svc, X_train, function="decision_function", alpha=.7, 56 | ax=axes[0, 1], cm=ReBl) 57 | plot_2d_separator(svc, X_train, linewidth=3, ax=axes[0, 1]) 58 | axes[0, 2].set_title("decision with threshold -0.8") 59 | discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=axes[0, 2]) 60 | plot_2d_separator(svc, X_train, linewidth=3, ax=axes[0, 2], threshold=-.8) 61 | plot_2d_scores(svc, X_train, function="decision_function", alpha=.7, 62 | ax=axes[0, 2], cm=ReBl) 63 | 64 | axes[1, 0].set_axis_off() 65 | 66 | mask = np.abs(X_train[:, 1] - 7) < 5 67 | bla = np.sum(mask) 68 | 69 | line = np.linspace(X_train.min(), X_train.max(), 100) 70 | axes[1, 1].set_title("Cross-section with threshold 0") 71 | axes[1, 1].plot(line, svc.decision_function(np.c_[line, 10 * np.ones(100)]), c='k') 72 | dec = svc.decision_function(np.c_[line, 10 * np.ones(100)]) 73 | contour = (dec > 0).reshape(1, -1).repeat(10, axis=0) 74 | axes[1, 1].contourf(line, np.linspace(-1.5, 1.5, 10), contour, alpha=0.4, cmap=cm) 75 | discrete_scatter(X_train[mask, 0], np.zeros(bla), y_train[mask], ax=axes[1, 1]) 76 | axes[1, 1].set_xlim(X_train.min(), X_train.max()) 77 | axes[1, 1].set_ylim(-1.5, 1.5) 78 | axes[1, 1].set_xticks(()) 79 | axes[1, 1].set_ylabel("Decision value") 80 | 81 | contour2 = (dec > -.8).reshape(1, -1).repeat(10, axis=0) 82 | axes[1, 2].set_title("Cross-section with threshold -0.8") 83 | axes[1, 2].contourf(line, np.linspace(-1.5, 1.5, 10), contour2, alpha=0.4, cmap=cm) 84 | discrete_scatter(X_train[mask, 0], np.zeros(bla), y_train[mask], alpha=.1, ax=axes[1, 2]) 85 | axes[1, 2].plot(line, svc.decision_function(np.c_[line, 10 * np.ones(100)]), c='k') 86 | axes[1, 2].set_xlim(X_train.min(), X_train.max()) 87 | axes[1, 2].set_ylim(-1.5, 1.5) 88 | axes[1, 2].set_xticks(()) 89 | axes[1, 2].set_ylabel("Decision value") 90 | axes[1, 0].legend(['negative class', 'positive class']) 91 | -------------------------------------------------------------------------------- /mglearn/plot_grid_search.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from sklearn.svm import SVC 4 | from sklearn.model_selection import GridSearchCV, train_test_split 5 | from sklearn.datasets import load_iris 6 | import pandas as pd 7 | 8 | 9 | def plot_cross_val_selection(): 10 | iris = load_iris() 11 | X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, 12 | iris.target, 13 | random_state=0) 14 | 15 | param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 16 | 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]} 17 | grid_search = GridSearchCV(SVC(), param_grid, cv=5, 18 | return_train_score=True) 19 | grid_search.fit(X_trainval, y_trainval) 20 | results = pd.DataFrame(grid_search.cv_results_)[15:] 21 | 22 | best = np.argmax(results.mean_test_score.values) 23 | plt.figure(figsize=(10, 3)) 24 | plt.xlim(-1, len(results)) 25 | plt.ylim(0, 1.1) 26 | for i, (_, row) in enumerate(results.iterrows()): 27 | scores = row[['split%d_test_score' % i for i in range(5)]] 28 | marker_cv, = plt.plot([i] * 5, scores, '^', c='gray', markersize=5, 29 | alpha=.5) 30 | marker_mean, = plt.plot(i, row.mean_test_score, 'v', c='none', alpha=1, 31 | markersize=10, markeredgecolor='k') 32 | if i == best: 33 | marker_best, = plt.plot(i, row.mean_test_score, 'o', c='red', 34 | fillstyle="none", alpha=1, markersize=20, 35 | markeredgewidth=3) 36 | 37 | plt.xticks(range(len(results)), [str(x).strip("{}").replace("'", "") for x 38 | in grid_search.cv_results_['params']], 39 | rotation=90) 40 | plt.ylabel("Validation accuracy") 41 | plt.xlabel("Parameter settings") 42 | plt.legend([marker_cv, marker_mean, marker_best], 43 | ["cv accuracy", "mean accuracy", "best parameter setting"], 44 | loc=(1.05, .4)) 45 | 46 | 47 | def plot_grid_search_overview(): 48 | plt.figure(figsize=(10, 3), dpi=70) 49 | axes = plt.gca() 50 | axes.yaxis.set_visible(False) 51 | axes.xaxis.set_visible(False) 52 | axes.set_frame_on(False) 53 | 54 | def draw(ax, text, start, target=None): 55 | if target is not None: 56 | patchB = target.get_bbox_patch() 57 | end = target.get_position() 58 | else: 59 | end = start 60 | patchB = None 61 | annotation = ax.annotate(text, end, start, xycoords='axes pixels', 62 | textcoords='axes pixels', size=20, 63 | arrowprops=dict( 64 | arrowstyle="-|>", fc="w", ec="k", 65 | patchB=patchB, 66 | connectionstyle="arc3,rad=0.0"), 67 | bbox=dict(boxstyle="round", fc="w"), 68 | horizontalalignment="center", 69 | verticalalignment="center") 70 | plt.draw() 71 | return annotation 72 | 73 | step = 100 74 | grr = 400 75 | 76 | final_evaluation = draw(axes, "final evaluation", (5 * step, grr - 3 * 77 | step)) 78 | retrained_model = draw(axes, "retrained model", (3 * step, grr - 3 * step), 79 | final_evaluation) 80 | best_parameters = draw(axes, "best parameters", (.5 * step, grr - 3 * 81 | step), retrained_model) 82 | cross_validation = draw(axes, "cross-validation", (.5 * step, grr - 2 * 83 | step), best_parameters) 84 | draw(axes, "parameter grid", (0.0, grr - 0), cross_validation) 85 | training_data = draw(axes, "training data", (2 * step, grr - step), 86 | cross_validation) 87 | draw(axes, "training data", (2 * step, grr - step), retrained_model) 88 | test_data = draw(axes, "test data", (5 * step, grr - step), 89 | final_evaluation) 90 | draw(axes, "data set", (3.5 * step, grr - 0.0), training_data) 91 | draw(axes, "data set", (3.5 * step, grr - 0.0), test_data) 92 | plt.ylim(0, 1) 93 | plt.xlim(0, 1.5) 94 | -------------------------------------------------------------------------------- /mglearn/plot_pca.py: -------------------------------------------------------------------------------- 1 | from sklearn.decomposition import PCA 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | 5 | from sklearn.externals.joblib import Memory 6 | 7 | memory = Memory(cachedir="cache") 8 | 9 | 10 | def plot_pca_illustration(): 11 | rnd = np.random.RandomState(5) 12 | X_ = rnd.normal(size=(300, 2)) 13 | X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2) 14 | 15 | pca = PCA() 16 | pca.fit(X_blob) 17 | X_pca = pca.transform(X_blob) 18 | 19 | S = X_pca.std(axis=0) 20 | 21 | fig, axes = plt.subplots(2, 2, figsize=(10, 10)) 22 | axes = axes.ravel() 23 | 24 | axes[0].set_title("Original data") 25 | axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=X_pca[:, 0], linewidths=0, 26 | s=60, cmap='viridis') 27 | axes[0].set_xlabel("feature 1") 28 | axes[0].set_ylabel("feature 2") 29 | axes[0].arrow(pca.mean_[0], pca.mean_[1], S[0] * pca.components_[0, 0], 30 | S[0] * pca.components_[0, 1], width=.1, head_width=.3, 31 | color='k') 32 | axes[0].arrow(pca.mean_[0], pca.mean_[1], S[1] * pca.components_[1, 0], 33 | S[1] * pca.components_[1, 1], width=.1, head_width=.3, 34 | color='k') 35 | axes[0].text(-1.5, -.5, "Component 2", size=14) 36 | axes[0].text(-4, -4, "Component 1", size=14) 37 | axes[0].set_aspect('equal') 38 | 39 | axes[1].set_title("Transformed data") 40 | axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=X_pca[:, 0], linewidths=0, 41 | s=60, cmap='viridis') 42 | axes[1].set_xlabel("First principal component") 43 | axes[1].set_ylabel("Second principal component") 44 | axes[1].set_aspect('equal') 45 | axes[1].set_ylim(-8, 8) 46 | 47 | pca = PCA(n_components=1) 48 | pca.fit(X_blob) 49 | X_inverse = pca.inverse_transform(pca.transform(X_blob)) 50 | 51 | axes[2].set_title("Transformed data w/ second component dropped") 52 | axes[2].scatter(X_pca[:, 0], np.zeros(X_pca.shape[0]), c=X_pca[:, 0], 53 | linewidths=0, s=60, cmap='viridis') 54 | axes[2].set_xlabel("First principal component") 55 | axes[2].set_aspect('equal') 56 | axes[2].set_ylim(-8, 8) 57 | 58 | axes[3].set_title("Back-rotation using only first component") 59 | axes[3].scatter(X_inverse[:, 0], X_inverse[:, 1], c=X_pca[:, 0], 60 | linewidths=0, s=60, cmap='viridis') 61 | axes[3].set_xlabel("feature 1") 62 | axes[3].set_ylabel("feature 2") 63 | axes[3].set_aspect('equal') 64 | axes[3].set_xlim(-8, 4) 65 | axes[3].set_ylim(-8, 4) 66 | 67 | 68 | def plot_pca_whitening(): 69 | rnd = np.random.RandomState(5) 70 | X_ = rnd.normal(size=(300, 2)) 71 | X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2) 72 | 73 | pca = PCA(whiten=True) 74 | pca.fit(X_blob) 75 | X_pca = pca.transform(X_blob) 76 | 77 | fig, axes = plt.subplots(1, 2, figsize=(10, 10)) 78 | axes = axes.ravel() 79 | 80 | axes[0].set_title("Original data") 81 | axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=X_pca[:, 0], linewidths=0, s=60, cmap='viridis') 82 | axes[0].set_xlabel("feature 1") 83 | axes[0].set_ylabel("feature 2") 84 | axes[0].set_aspect('equal') 85 | 86 | axes[1].set_title("Whitened data") 87 | axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=X_pca[:, 0], linewidths=0, s=60, cmap='viridis') 88 | axes[1].set_xlabel("First principal component") 89 | axes[1].set_ylabel("Second principal component") 90 | axes[1].set_aspect('equal') 91 | axes[1].set_xlim(-3, 4) 92 | 93 | 94 | @memory.cache 95 | def pca_faces(X_train, X_test): 96 | # copy and pasted from nmf. refactor? 97 | # Build NMF models with 10, 50, 100, 500 components 98 | # this list will hold the back-transformd test-data 99 | reduced_images = [] 100 | for n_components in [10, 50, 100, 500]: 101 | # build the NMF model 102 | pca = PCA(n_components=n_components) 103 | pca.fit(X_train) 104 | # transform the test data (afterwards has n_components many dimensions) 105 | X_test_pca = pca.transform(X_test) 106 | # back-transform the transformed test-data 107 | # (afterwards it's in the original space again) 108 | X_test_back = pca.inverse_transform(X_test_pca) 109 | reduced_images.append(X_test_back) 110 | return reduced_images 111 | 112 | 113 | def plot_pca_faces(X_train, X_test, image_shape): 114 | reduced_images = pca_faces(X_train, X_test) 115 | 116 | # plot the first three images in the test set: 117 | fix, axes = plt.subplots(3, 5, figsize=(15, 12), 118 | subplot_kw={'xticks': (), 'yticks': ()}) 119 | for i, ax in enumerate(axes): 120 | # plot original image 121 | ax[0].imshow(X_test[i].reshape(image_shape), 122 | vmin=0, vmax=1) 123 | # plot the four back-transformed images 124 | for a, X_test_back in zip(ax[1:], reduced_images): 125 | a.imshow(X_test_back[i].reshape(image_shape), vmin=0, vmax=1) 126 | 127 | # label the top row 128 | axes[0, 0].set_title("original image") 129 | for ax, n_components in zip(axes[0, 1:], [10, 50, 100, 500]): 130 | ax.set_title("%d components" % n_components) 131 | -------------------------------------------------------------------------------- /mglearn/tools.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.datasets import make_blobs 3 | from sklearn.tree import export_graphviz 4 | import matplotlib.pyplot as plt 5 | from .plot_2d_separator import (plot_2d_separator, plot_2d_classification, 6 | plot_2d_scores) 7 | from .plot_helpers import cm2 as cm, discrete_scatter 8 | 9 | 10 | def visualize_coefficients(coefficients, feature_names, n_top_features=25): 11 | """Visualize coefficients of a linear model. 12 | 13 | Parameters 14 | ---------- 15 | coefficients : nd-array, shape (n_features,) 16 | Model coefficients. 17 | 18 | feature_names : list or nd-array of strings, shape (n_features,) 19 | Feature names for labeling the coefficients. 20 | 21 | n_top_features : int, default=25 22 | How many features to show. The function will show the largest (most 23 | positive) and smallest (most negative) n_top_features coefficients, 24 | for a total of 2 * n_top_features coefficients. 25 | """ 26 | coefficients = coefficients.squeeze() 27 | if coefficients.ndim > 1: 28 | # this is not a row or column vector 29 | raise ValueError("coeffients must be 1d array or column vector, got" 30 | " shape {}".format(coefficients.shape)) 31 | coefficients = coefficients.ravel() 32 | 33 | if len(coefficients) != len(feature_names): 34 | raise ValueError("Number of coefficients {} doesn't match number of" 35 | "feature names {}.".format(len(coefficients), 36 | len(feature_names))) 37 | # get coefficients with large absolute values 38 | coef = coefficients.ravel() 39 | positive_coefficients = np.argsort(coef)[-n_top_features:] 40 | negative_coefficients = np.argsort(coef)[:n_top_features] 41 | interesting_coefficients = np.hstack([negative_coefficients, 42 | positive_coefficients]) 43 | # plot them 44 | plt.figure(figsize=(15, 5)) 45 | colors = [cm(1) if c < 0 else cm(0) 46 | for c in coef[interesting_coefficients]] 47 | plt.bar(np.arange(2 * n_top_features), coef[interesting_coefficients], 48 | color=colors) 49 | feature_names = np.array(feature_names) 50 | plt.subplots_adjust(bottom=0.3) 51 | plt.xticks(np.arange(1, 1 + 2 * n_top_features), 52 | feature_names[interesting_coefficients], rotation=60, 53 | ha="right") 54 | plt.ylabel("Coefficient magnitude") 55 | plt.xlabel("Feature") 56 | 57 | 58 | def heatmap(values, xlabel, ylabel, xticklabels, yticklabels, cmap=None, 59 | vmin=None, vmax=None, ax=None, fmt="%0.2f"): 60 | if ax is None: 61 | ax = plt.gca() 62 | # plot the mean cross-validation scores 63 | img = ax.pcolor(values, cmap=cmap, vmin=vmin, vmax=vmax) 64 | img.update_scalarmappable() 65 | ax.set_xlabel(xlabel) 66 | ax.set_ylabel(ylabel) 67 | ax.set_xticks(np.arange(len(xticklabels)) + .5) 68 | ax.set_yticks(np.arange(len(yticklabels)) + .5) 69 | ax.set_xticklabels(xticklabels) 70 | ax.set_yticklabels(yticklabels) 71 | ax.set_aspect(1) 72 | 73 | for p, color, value in zip(img.get_paths(), img.get_facecolors(), 74 | img.get_array()): 75 | x, y = p.vertices[:-2, :].mean(0) 76 | if np.mean(color[:3]) > 0.5: 77 | c = 'k' 78 | else: 79 | c = 'w' 80 | ax.text(x, y, fmt % value, color=c, ha="center", va="center") 81 | return img 82 | 83 | 84 | def make_handcrafted_dataset(): 85 | # a carefully hand-designed dataset lol 86 | X, y = make_blobs(centers=2, random_state=4, n_samples=30) 87 | y[np.array([7, 27])] = 0 88 | mask = np.ones(len(X), dtype=np.bool) 89 | mask[np.array([0, 1, 5, 26])] = 0 90 | X, y = X[mask], y[mask] 91 | return X, y 92 | 93 | 94 | def print_topics(topics, feature_names, sorting, topics_per_chunk=6, 95 | n_words=20): 96 | for i in range(0, len(topics), topics_per_chunk): 97 | # for each chunk: 98 | these_topics = topics[i: i + topics_per_chunk] 99 | # maybe we have less than topics_per_chunk left 100 | len_this_chunk = len(these_topics) 101 | # print topic headers 102 | print(("topic {:<8}" * len_this_chunk).format(*these_topics)) 103 | print(("-------- {0:<5}" * len_this_chunk).format("")) 104 | # print top n_words frequent words 105 | for i in range(n_words): 106 | try: 107 | print(("{:<14}" * len_this_chunk).format( 108 | *feature_names[sorting[these_topics, i]])) 109 | except: 110 | pass 111 | print("\n") 112 | 113 | 114 | def get_tree(tree, **kwargs): 115 | try: 116 | # python3 117 | from io import StringIO 118 | except ImportError: 119 | # python2 120 | from StringIO import StringIO 121 | f = StringIO() 122 | export_graphviz(tree, f, **kwargs) 123 | import graphviz 124 | return graphviz.Source(f.getvalue()) 125 | 126 | __all__ = ['plot_2d_separator', 'plot_2d_classification', 'plot_2d_scores', 127 | 'cm', 'visualize_coefficients', 'print_topics', 'heatmap', 128 | 'discrete_scatter'] 129 | -------------------------------------------------------------------------------- /mglearn/plot_2d_separator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from .plot_helpers import cm2, cm3, discrete_scatter 4 | 5 | def _call_classifier_chunked(classifier_pred_or_decide, X): 6 | # The chunk_size is used to chunk the large arrays to work with x86 7 | # memory models that are restricted to < 2 GB in memory allocation. The 8 | # chunk_size value used here is based on a measurement with the 9 | # MLPClassifier using the following parameters: 10 | # MLPClassifier(solver='lbfgs', random_state=0, 11 | # hidden_layer_sizes=[1000,1000,1000]) 12 | # by reducing the value it is possible to trade in time for memory. 13 | # It is possible to chunk the array as the calculations are independent of 14 | # each other. 15 | # Note: an intermittent version made a distinction between 16 | # 32- and 64 bit architectures avoiding the chunking. Testing revealed 17 | # that even on 64 bit architectures the chunking increases the 18 | # performance by a factor of 3-5, largely due to the avoidance of memory 19 | # swapping. 20 | chunk_size = 10000 21 | 22 | # We use a list to collect all result chunks 23 | Y_result_chunks = [] 24 | 25 | # Call the classifier in chunks. 26 | for x_chunk in np.array_split(X, np.arange(chunk_size, X.shape[0], 27 | chunk_size, dtype=np.int32), 28 | axis=0): 29 | Y_result_chunks.append(classifier_pred_or_decide(x_chunk)) 30 | 31 | return np.concatenate(Y_result_chunks) 32 | 33 | 34 | def plot_2d_classification(classifier, X, fill=False, ax=None, eps=None, 35 | alpha=1, cm=cm3): 36 | # multiclass 37 | if eps is None: 38 | eps = X.std() / 2. 39 | 40 | if ax is None: 41 | ax = plt.gca() 42 | 43 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 44 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 45 | xx = np.linspace(x_min, x_max, 1000) 46 | yy = np.linspace(y_min, y_max, 1000) 47 | 48 | X1, X2 = np.meshgrid(xx, yy) 49 | X_grid = np.c_[X1.ravel(), X2.ravel()] 50 | decision_values = classifier.predict(X_grid) 51 | ax.imshow(decision_values.reshape(X1.shape), extent=(x_min, x_max, 52 | y_min, y_max), 53 | aspect='auto', origin='lower', alpha=alpha, cmap=cm) 54 | ax.set_xlim(x_min, x_max) 55 | ax.set_ylim(y_min, y_max) 56 | ax.set_xticks(()) 57 | ax.set_yticks(()) 58 | 59 | 60 | def plot_2d_scores(classifier, X, ax=None, eps=None, alpha=1, cm="viridis", 61 | function=None): 62 | # binary with fill 63 | if eps is None: 64 | eps = X.std() / 2. 65 | 66 | if ax is None: 67 | ax = plt.gca() 68 | 69 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 70 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 71 | xx = np.linspace(x_min, x_max, 100) 72 | yy = np.linspace(y_min, y_max, 100) 73 | 74 | X1, X2 = np.meshgrid(xx, yy) 75 | X_grid = np.c_[X1.ravel(), X2.ravel()] 76 | if function is None: 77 | function = getattr(classifier, "decision_function", 78 | getattr(classifier, "predict_proba")) 79 | else: 80 | function = getattr(classifier, function) 81 | decision_values = function(X_grid) 82 | if decision_values.ndim > 1 and decision_values.shape[1] > 1: 83 | # predict_proba 84 | decision_values = decision_values[:, 1] 85 | grr = ax.imshow(decision_values.reshape(X1.shape), 86 | extent=(x_min, x_max, y_min, y_max), aspect='auto', 87 | origin='lower', alpha=alpha, cmap=cm) 88 | 89 | ax.set_xlim(x_min, x_max) 90 | ax.set_ylim(y_min, y_max) 91 | ax.set_xticks(()) 92 | ax.set_yticks(()) 93 | return grr 94 | 95 | 96 | def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None, alpha=1, 97 | cm=cm2, linewidth=None, threshold=None, 98 | linestyle="solid"): 99 | # binary? 100 | if eps is None: 101 | eps = X.std() / 2. 102 | 103 | if ax is None: 104 | ax = plt.gca() 105 | 106 | x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps 107 | y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps 108 | xx = np.linspace(x_min, x_max, 1000) 109 | yy = np.linspace(y_min, y_max, 1000) 110 | 111 | X1, X2 = np.meshgrid(xx, yy) 112 | X_grid = np.c_[X1.ravel(), X2.ravel()] 113 | if hasattr(classifier, "decision_function"): 114 | decision_values = _call_classifier_chunked(classifier.decision_function, 115 | X_grid) 116 | levels = [0] if threshold is None else [threshold] 117 | fill_levels = [decision_values.min()] + levels + [ 118 | decision_values.max()] 119 | else: 120 | # no decision_function 121 | decision_values = _call_classifier_chunked(classifier.predict_proba, 122 | X_grid)[:, 1] 123 | levels = [.5] if threshold is None else [threshold] 124 | fill_levels = [0] + levels + [1] 125 | if fill: 126 | ax.contourf(X1, X2, decision_values.reshape(X1.shape), 127 | levels=fill_levels, alpha=alpha, cmap=cm) 128 | else: 129 | ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels, 130 | colors="black", alpha=alpha, linewidths=linewidth, 131 | linestyles=linestyle, zorder=5) 132 | 133 | ax.set_xlim(x_min, x_max) 134 | ax.set_ylim(y_min, y_max) 135 | ax.set_xticks(()) 136 | ax.set_yticks(()) 137 | 138 | 139 | if __name__ == '__main__': 140 | from sklearn.datasets import make_blobs 141 | from sklearn.linear_model import LogisticRegression 142 | X, y = make_blobs(centers=2, random_state=42) 143 | clf = LogisticRegression(solver='lbfgs').fit(X, y) 144 | plot_2d_separator(clf, X, fill=True) 145 | discrete_scatter(X[:, 0], X[:, 1], y) 146 | plt.show() 147 | -------------------------------------------------------------------------------- /mglearn/plot_kmeans.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from sklearn.datasets import make_blobs 4 | from sklearn.cluster import KMeans 5 | from sklearn.metrics import pairwise_distances 6 | import matplotlib.pyplot as plt 7 | import matplotlib as mpl 8 | from cycler import cycler 9 | 10 | from .tools import discrete_scatter 11 | from .plot_2d_separator import plot_2d_classification 12 | from .plot_helpers import cm3 13 | 14 | 15 | def plot_kmeans_algorithm(): 16 | 17 | X, y = make_blobs(random_state=1) 18 | # we don't want cyan in there 19 | with mpl.rc_context(rc={'axes.prop_cycle': cycler('color', ['#0000aa', 20 | '#ff2020', 21 | '#50ff50'])}): 22 | fig, axes = plt.subplots(3, 3, figsize=(10, 8), subplot_kw={'xticks': (), 'yticks': ()}) 23 | axes = axes.ravel() 24 | axes[0].set_title("Input data") 25 | discrete_scatter(X[:, 0], X[:, 1], ax=axes[0], markers=['o'], c='w') 26 | 27 | axes[1].set_title("Initialization") 28 | init = X[:3, :] 29 | discrete_scatter(X[:, 0], X[:, 1], ax=axes[1], markers=['o'], c='w') 30 | discrete_scatter(init[:, 0], init[:, 1], [0, 1, 2], ax=axes[1], 31 | markers=['^'], markeredgewidth=2) 32 | 33 | axes[2].set_title("Assign Points (1)") 34 | km = KMeans(n_clusters=3, init=init, max_iter=1, n_init=1).fit(X) 35 | centers = km.cluster_centers_ 36 | # need to compute labels by hand. scikit-learn does two e-steps for max_iter=1 37 | # (and it's totally my fault) 38 | labels = np.argmin(pairwise_distances(init, X), axis=0) 39 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 40 | ax=axes[2]) 41 | discrete_scatter(init[:, 0], init[:, 1], [0, 1, 2], 42 | ax=axes[2], markers=['^'], markeredgewidth=2) 43 | 44 | axes[3].set_title("Recompute Centers (1)") 45 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 46 | ax=axes[3]) 47 | discrete_scatter(centers[:, 0], centers[:, 1], [0, 1, 2], 48 | ax=axes[3], markers=['^'], markeredgewidth=2) 49 | 50 | axes[4].set_title("Reassign Points (2)") 51 | km = KMeans(n_clusters=3, init=init, max_iter=1, n_init=1).fit(X) 52 | labels = km.labels_ 53 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 54 | ax=axes[4]) 55 | discrete_scatter(centers[:, 0], centers[:, 1], [0, 1, 2], 56 | ax=axes[4], markers=['^'], markeredgewidth=2) 57 | 58 | km = KMeans(n_clusters=3, init=init, max_iter=2, n_init=1).fit(X) 59 | axes[5].set_title("Recompute Centers (2)") 60 | centers = km.cluster_centers_ 61 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 62 | ax=axes[5]) 63 | discrete_scatter(centers[:, 0], centers[:, 1], [0, 1, 2], 64 | ax=axes[5], markers=['^'], markeredgewidth=2) 65 | 66 | axes[6].set_title("Reassign Points (3)") 67 | labels = km.labels_ 68 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 69 | ax=axes[6]) 70 | markers = discrete_scatter(centers[:, 0], centers[:, 1], [0, 1, 2], 71 | ax=axes[6], markers=['^'], 72 | markeredgewidth=2) 73 | 74 | axes[7].set_title("Recompute Centers (3)") 75 | km = KMeans(n_clusters=3, init=init, max_iter=3, n_init=1).fit(X) 76 | centers = km.cluster_centers_ 77 | discrete_scatter(X[:, 0], X[:, 1], labels, markers=['o'], 78 | ax=axes[7]) 79 | discrete_scatter(centers[:, 0], centers[:, 1], [0, 1, 2], 80 | ax=axes[7], markers=['^'], markeredgewidth=2) 81 | axes[8].set_axis_off() 82 | axes[8].legend(markers, ["Cluster 0", "Cluster 1", "Cluster 2"], loc='best') 83 | 84 | 85 | def plot_kmeans_boundaries(): 86 | X, y = make_blobs(random_state=1) 87 | init = X[:3, :] 88 | km = KMeans(n_clusters=3, init=init, max_iter=2, n_init=1).fit(X) 89 | discrete_scatter(X[:, 0], X[:, 1], km.labels_, markers=['o']) 90 | discrete_scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], 91 | [0, 1, 2], markers=['^'], markeredgewidth=2) 92 | plot_2d_classification(km, X, cm=cm3, alpha=.4) 93 | 94 | 95 | def plot_kmeans_faces(km, pca, X_pca, X_people, y_people, target_names): 96 | n_clusters = 10 97 | image_shape = (87, 65) 98 | fig, axes = plt.subplots(n_clusters, 11, subplot_kw={'xticks': (), 'yticks': ()}, 99 | figsize=(10, 15), gridspec_kw={"hspace": .3}) 100 | 101 | for cluster in range(n_clusters): 102 | center = km.cluster_centers_[cluster] 103 | mask = km.labels_ == cluster 104 | dists = np.sum((X_pca - center) ** 2, axis=1) 105 | dists[~mask] = np.inf 106 | inds = np.argsort(dists)[:5] 107 | dists[~mask] = -np.inf 108 | inds = np.r_[inds, np.argsort(dists)[-5:]] 109 | axes[cluster, 0].imshow(pca.inverse_transform(center).reshape(image_shape), vmin=0, vmax=1) 110 | for image, label, asdf, ax in zip(X_people[inds], y_people[inds], 111 | km.labels_[inds], axes[cluster, 1:]): 112 | ax.imshow(image.reshape(image_shape), vmin=0, vmax=1) 113 | ax.set_title("%s" % (target_names[label].split()[-1]), fontdict={'fontsize': 9}) 114 | 115 | # add some boxes to illustrate which are similar and which dissimilar 116 | rec = plt.Rectangle([-5, -30], 73, 1295, fill=False, lw=2) 117 | rec = axes[0, 0].add_patch(rec) 118 | rec.set_clip_on(False) 119 | axes[0, 0].text(0, -40, "Center") 120 | 121 | rec = plt.Rectangle([-5, -30], 385, 1295, fill=False, lw=2) 122 | rec = axes[0, 1].add_patch(rec) 123 | rec.set_clip_on(False) 124 | axes[0, 1].text(0, -40, "Close to center") 125 | 126 | rec = plt.Rectangle([-5, -30], 385, 1295, fill=False, lw=2) 127 | rec = axes[0, 6].add_patch(rec) 128 | rec.set_clip_on(False) 129 | axes[0, 6].text(0, -40, "Far from center") 130 | -------------------------------------------------------------------------------- /data/ram_price.csv: -------------------------------------------------------------------------------- 1 | ,date,price 2 | 0,1957.0,411041792.0 3 | 1,1959.0,67947725.0 4 | 2,1960.0,5242880.0 5 | 3,1965.0,2642412.0 6 | 4,1970.0,734003.0 7 | 5,1973.0,399360.0 8 | 6,1974.0,314573.0 9 | 7,1975.0,421888.0 10 | 8,1975.08,180224.0 11 | 9,1975.25,67584.0 12 | 10,1975.75,49920.0 13 | 11,1976.0,40704.0 14 | 12,1976.17,48960.0 15 | 13,1976.42,23040.0 16 | 14,1976.58,32000.0 17 | 15,1977.08,36800.0 18 | 16,1978.17,28000.0 19 | 17,1978.25,29440.0 20 | 18,1978.33,19200.0 21 | 19,1978.5,24000.0 22 | 20,1978.58,16000.0 23 | 21,1978.75,15200.0 24 | 22,1979.0,10528.0 25 | 23,1979.75,6704.0 26 | 24,1980.0,6480.0 27 | 25,1981.0,8800.0 28 | 26,1981.58,4479.0 29 | 27,1982.0,3520.0 30 | 28,1982.17,4464.0 31 | 29,1982.67,1980.0 32 | 30,1983.0,2396.0 33 | 31,1983.67,1980.0 34 | 32,1984.0,1379.0 35 | 33,1984.58,1331.0 36 | 34,1985.0,880.0 37 | 35,1985.33,720.0 38 | 36,1985.42,550.0 39 | 37,1985.5,420.0 40 | 38,1985.58,350.0 41 | 39,1985.67,300.0 42 | 40,1985.83,300.0 43 | 41,1985.92,300.0 44 | 42,1986.0,300.0 45 | 43,1986.08,300.0 46 | 44,1986.17,300.0 47 | 45,1986.25,300.0 48 | 46,1986.33,190.0 49 | 47,1986.42,190.0 50 | 48,1986.5,190.0 51 | 49,1986.58,190.0 52 | 50,1986.67,190.0 53 | 51,1986.75,190.0 54 | 52,1986.92,190.0 55 | 53,1987.0,176.0 56 | 54,1987.08,176.0 57 | 55,1987.17,157.0 58 | 56,1987.25,154.0 59 | 57,1987.33,154.0 60 | 58,1987.42,154.0 61 | 59,1987.5,154.0 62 | 60,1987.58,154.0 63 | 61,1987.67,163.0 64 | 62,1987.75,133.0 65 | 63,1987.83,163.0 66 | 64,1987.92,163.0 67 | 65,1988.0,163.0 68 | 66,1988.08,182.0 69 | 67,1988.17,199.0 70 | 68,1988.33,199.0 71 | 69,1988.42,199.0 72 | 70,1988.5,505.0 73 | 71,1988.58,505.0 74 | 72,1988.67,505.0 75 | 73,1988.75,505.0 76 | 74,1988.83,505.0 77 | 75,1988.92,505.0 78 | 76,1989.0,505.0 79 | 77,1989.08,505.0 80 | 78,1989.17,505.0 81 | 79,1989.25,505.0 82 | 80,1989.42,344.0 83 | 81,1989.5,197.0 84 | 82,1989.58,188.0 85 | 83,1989.67,188.0 86 | 84,1989.75,128.0 87 | 85,1989.83,117.0 88 | 86,1989.92,113.0 89 | 87,1990.0,106.0 90 | 88,1990.17,98.3 91 | 89,1990.33,98.3 92 | 90,1990.42,89.5 93 | 91,1990.5,82.8 94 | 92,1990.58,81.1 95 | 93,1990.67,71.5 96 | 94,1990.75,59.0 97 | 95,1990.83,51.0 98 | 96,1990.92,45.5 99 | 97,1991.0,44.5 100 | 98,1991.08,44.5 101 | 99,1991.17,45.0 102 | 100,1991.25,45.0 103 | 101,1991.33,45.0 104 | 102,1991.42,43.8 105 | 103,1991.5,43.8 106 | 104,1991.58,41.3 107 | 105,1991.67,46.3 108 | 106,1991.75,45.0 109 | 107,1991.83,39.8 110 | 108,1991.92,39.8 111 | 109,1992.0,36.3 112 | 110,1992.08,36.3 113 | 111,1992.17,36.3 114 | 112,1992.25,34.8 115 | 113,1992.33,30.0 116 | 114,1992.42,32.5 117 | 115,1992.5,33.5 118 | 116,1992.58,31.0 119 | 117,1992.67,27.5 120 | 118,1992.75,26.3 121 | 119,1992.83,26.3 122 | 120,1992.92,26.3 123 | 121,1993.0,33.1 124 | 122,1993.08,27.5 125 | 123,1993.17,27.5 126 | 124,1993.25,27.5 127 | 125,1993.33,27.5 128 | 126,1993.42,30.0 129 | 127,1993.5,30.0 130 | 128,1993.58,30.0 131 | 129,1993.67,30.0 132 | 130,1993.75,36.0 133 | 131,1993.83,39.8 134 | 132,1993.92,35.8 135 | 133,1994.0,35.8 136 | 134,1994.08,35.8 137 | 135,1994.17,36.0 138 | 136,1994.25,37.3 139 | 137,1994.33,37.3 140 | 138,1994.42,37.3 141 | 139,1994.5,38.5 142 | 140,1994.58,37.0 143 | 141,1994.67,34.0 144 | 142,1994.75,33.5 145 | 143,1994.83,32.3 146 | 144,1994.92,32.3 147 | 145,1995.0,32.3 148 | 146,1995.08,32.0 149 | 147,1995.17,32.0 150 | 148,1995.25,31.2 151 | 149,1995.33,31.2 152 | 150,1995.42,31.1 153 | 151,1995.5,31.2 154 | 152,1995.58,30.6 155 | 153,1995.67,33.1 156 | 154,1995.75,33.1 157 | 155,1995.83,30.9 158 | 156,1995.92,30.9 159 | 157,1996.0,29.9 160 | 158,1996.08,28.8 161 | 159,1996.17,26.1 162 | 160,1996.25,24.7 163 | 161,1996.33,17.2 164 | 162,1996.42,14.9 165 | 163,1996.5,11.3 166 | 164,1996.58,9.06 167 | 165,1996.67,8.44 168 | 166,1996.75,8.0 169 | 167,1996.83,5.25 170 | 168,1996.92,5.25 171 | 169,1997.0,4.63 172 | 170,1997.08,3.63 173 | 171,1997.17,3.0 174 | 172,1997.25,3.0 175 | 173,1997.33,3.0 176 | 174,1997.42,3.69 177 | 175,1997.5,4.0 178 | 176,1997.58,4.13 179 | 177,1997.67,3.63 180 | 178,1997.75,3.41 181 | 179,1997.83,3.25 182 | 180,1997.92,2.16 183 | 181,1998.0,2.16 184 | 182,1998.08,0.91 185 | 183,1998.17,0.97 186 | 184,1998.25,1.22 187 | 185,1998.33,1.19 188 | 186,1998.42,0.97 189 | 187,1998.58,1.03 190 | 188,1998.67,0.97 191 | 189,1998.75,1.16 192 | 190,1998.83,0.84 193 | 191,1998.92,0.84 194 | 192,1999.08,1.44 195 | 193,1999.13,0.84 196 | 194,1999.17,1.25 197 | 195,1999.25,1.25 198 | 196,1999.33,0.86 199 | 197,1999.5,0.78 200 | 198,1999.67,0.87 201 | 199,1999.75,1.04 202 | 200,1999.83,1.34 203 | 201,1999.92,2.35 204 | 202,2000.0,1.56 205 | 203,2000.08,1.48 206 | 204,2000.17,1.08 207 | 205,2000.25,0.84 208 | 206,2000.33,0.7 209 | 207,2000.42,0.9 210 | 208,2000.5,0.77 211 | 209,2000.58,0.84 212 | 210,2000.67,1.07 213 | 211,2000.75,1.12 214 | 212,2000.83,1.12 215 | 213,2000.92,0.9 216 | 214,2001.0,0.75 217 | 215,2001.08,0.464 218 | 216,2001.17,0.464 219 | 217,2001.25,0.383 220 | 218,2001.33,0.387 221 | 219,2001.42,0.305 222 | 220,2001.5,0.352 223 | 221,2001.5,0.27 224 | 222,2001.58,0.191 225 | 223,2001.67,0.191 226 | 224,2001.75,0.169 227 | 225,2001.77,0.148 228 | 226,2002.08,0.134 229 | 227,2002.08,0.207 230 | 228,2002.25,0.193 231 | 229,2002.33,0.193 232 | 230,2002.42,0.33 233 | 231,2002.58,0.193 234 | 232,2002.75,0.193 235 | 233,2003.17,0.176 236 | 234,2003.25,0.076 237 | 235,2003.33,0.126 238 | 236,2003.42,0.115 239 | 237,2003.5,0.133 240 | 238,2003.58,0.129 241 | 239,2003.67,0.143 242 | 240,2003.75,0.148 243 | 241,2003.83,0.16 244 | 242,2003.99,0.166 245 | 243,2004.0,0.174 246 | 244,2004.08,0.148 247 | 245,2004.17,0.146 248 | 246,2004.33,0.156 249 | 247,2004.42,0.203 250 | 248,2004.5,0.176 251 | 249,2005.25,0.185 252 | 250,2005.42,0.149 253 | 251,2005.83,0.116 254 | 252,2005.92,0.185 255 | 253,2006.17,0.112 256 | 254,2006.33,0.073 257 | 255,2006.5,0.082 258 | 256,2006.67,0.073 259 | 257,2006.75,0.088 260 | 258,2006.83,0.098 261 | 259,2006.99,0.092 262 | 260,2007.0,0.082 263 | 261,2007.08,0.078 264 | 262,2007.17,0.066 265 | 263,2007.33,0.0464 266 | 264,2007.5,0.0386 267 | 265,2007.67,0.0351 268 | 266,2007.75,0.0322 269 | 267,2007.83,0.0244 270 | 268,2007.92,0.0244 271 | 269,2008.0,0.0232 272 | 270,2008.08,0.022 273 | 271,2008.33,0.022 274 | 272,2008.5,0.0207 275 | 273,2008.58,0.0176 276 | 274,2008.67,0.0146 277 | 275,2008.83,0.011 278 | 276,2008.92,0.0098 279 | 277,2009.0,0.0098 280 | 278,2009.08,0.0107 281 | 279,2009.25,0.0105 282 | 280,2009.42,0.0115 283 | 281,2009.5,0.011 284 | 282,2009.58,0.0127 285 | 283,2009.75,0.0183 286 | 284,2009.92,0.0205 287 | 285,2010.0,0.019 288 | 286,2010.08,0.0202 289 | 287,2010.17,0.0195 290 | 288,2010.33,0.0242 291 | 289,2010.5,0.021 292 | 290,2010.58,0.022 293 | 291,2010.75,0.0171 294 | 292,2010.83,0.0146 295 | 293,2010.92,0.0122 296 | 294,2011.0,0.01 297 | 295,2011.08,0.0103 298 | 296,2011.33,0.01 299 | 297,2011.42,0.0085 300 | 298,2011.67,0.0054 301 | 299,2011.75,0.0051 302 | 300,2012.0,0.0049 303 | 301,2012.08,0.0049 304 | 302,2012.25,0.005 305 | 303,2012.33,0.0049 306 | 304,2012.58,0.0048 307 | 305,2012.67,0.004 308 | 306,2012.83,0.0037 309 | 307,2013.0,0.0043 310 | 308,2013.08,0.0054 311 | 309,2013.33,0.0067 312 | 310,2013.42,0.0061 313 | 311,2013.58,0.0073 314 | 312,2013.67,0.0065 315 | 313,2013.75,0.0082 316 | 314,2013.83,0.0085 317 | 315,2013.92,0.0079 318 | 316,2014.08,0.0095 319 | 317,2014.17,0.0079 320 | 318,2014.25,0.0073 321 | 319,2014.42,0.0079 322 | 320,2014.58,0.0085 323 | 321,2014.67,0.0085 324 | 322,2014.83,0.0085 325 | 323,2015.0,0.0078 326 | 324,2015.08,0.0073 327 | 325,2015.25,0.0061 328 | 326,2015.33,0.0056 329 | 327,2015.5,0.0049 330 | 328,2015.58,0.0045 331 | 329,2015.67,0.0043 332 | 330,2015.75,0.0042 333 | 331,2015.83,0.0038 334 | 332,2015.92,0.0037 335 | -------------------------------------------------------------------------------- /mglearn/plot_cross_validation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | 5 | def plot_group_kfold(): 6 | from sklearn.model_selection import GroupKFold 7 | groups = [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3] 8 | 9 | plt.figure(figsize=(10, 2)) 10 | plt.title("GroupKFold") 11 | 12 | axes = plt.gca() 13 | axes.set_frame_on(False) 14 | 15 | n_folds = 12 16 | n_samples = 12 17 | n_iter = 3 18 | n_samples_per_fold = 1 19 | 20 | cv = GroupKFold(n_splits=3) 21 | mask = np.zeros((n_iter, n_samples)) 22 | for i, (train, test) in enumerate(cv.split(range(12), groups=groups)): 23 | mask[i, train] = 1 24 | mask[i, test] = 2 25 | 26 | for i in range(n_folds): 27 | # test is grey 28 | colors = ["grey" if x == 2 else "white" for x in mask[:, i]] 29 | # not selected has no hatch 30 | 31 | boxes = axes.barh(y=range(n_iter), width=[1 - 0.1] * n_iter, 32 | left=i * n_samples_per_fold, height=.6, color=colors, 33 | hatch="//", edgecolor="k", align='edge') 34 | for j in np.where(mask[:, i] == 0)[0]: 35 | boxes[j].set_hatch("") 36 | 37 | axes.barh(y=[n_iter] * n_folds, width=[1 - 0.1] * n_folds, 38 | left=np.arange(n_folds) * n_samples_per_fold, height=.6, 39 | color="w", edgecolor='k', align="edge") 40 | 41 | for i in range(12): 42 | axes.text((i + .5) * n_samples_per_fold, 3.5, "%d" % 43 | groups[i], horizontalalignment="center") 44 | 45 | axes.invert_yaxis() 46 | axes.set_xlim(0, n_samples + 1) 47 | axes.set_ylabel("CV iterations") 48 | axes.set_xlabel("Data points") 49 | axes.set_xticks(np.arange(n_samples) + .5) 50 | axes.set_xticklabels(np.arange(1, n_samples + 1)) 51 | axes.set_yticks(np.arange(n_iter + 1) + .3) 52 | axes.set_yticklabels( 53 | ["Split %d" % x for x in range(1, n_iter + 1)] + ["Group"]) 54 | plt.legend([boxes[0], boxes[1]], ["Training set", "Test set"], loc=(1, .3)) 55 | plt.tight_layout() 56 | 57 | 58 | def plot_shuffle_split(): 59 | from sklearn.model_selection import ShuffleSplit 60 | plt.figure(figsize=(10, 2)) 61 | plt.title("ShuffleSplit with 10 points" 62 | ", train_size=5, test_size=2, n_splits=4") 63 | 64 | axes = plt.gca() 65 | axes.set_frame_on(False) 66 | 67 | n_folds = 10 68 | n_samples = 10 69 | n_iter = 4 70 | n_samples_per_fold = 1 71 | 72 | ss = ShuffleSplit(n_splits=4, train_size=5, test_size=2, random_state=43) 73 | mask = np.zeros((n_iter, n_samples)) 74 | for i, (train, test) in enumerate(ss.split(range(10))): 75 | mask[i, train] = 1 76 | mask[i, test] = 2 77 | 78 | for i in range(n_folds): 79 | # test is grey 80 | colors = ["grey" if x == 2 else "white" for x in mask[:, i]] 81 | # not selected has no hatch 82 | 83 | boxes = axes.barh(y=range(n_iter), width=[1 - 0.1] * n_iter, 84 | left=i * n_samples_per_fold, height=.6, color=colors, 85 | hatch="//", edgecolor='k', align='edge') 86 | for j in np.where(mask[:, i] == 0)[0]: 87 | boxes[j].set_hatch("") 88 | 89 | axes.invert_yaxis() 90 | axes.set_xlim(0, n_samples + 1) 91 | axes.set_ylabel("CV iterations") 92 | axes.set_xlabel("Data points") 93 | axes.set_xticks(np.arange(n_samples) + .5) 94 | axes.set_xticklabels(np.arange(1, n_samples + 1)) 95 | axes.set_yticks(np.arange(n_iter) + .3) 96 | axes.set_yticklabels(["Split %d" % x for x in range(1, n_iter + 1)]) 97 | # legend hacked for this random state 98 | plt.legend([boxes[1], boxes[0], boxes[2]], [ 99 | "Training set", "Test set", "Not selected"], loc=(1, .3)) 100 | plt.tight_layout() 101 | 102 | 103 | def plot_stratified_cross_validation(): 104 | fig, both_axes = plt.subplots(2, 1, figsize=(12, 5)) 105 | # plt.title("cross_validation_not_stratified") 106 | axes = both_axes[0] 107 | axes.set_title("Standard cross-validation with sorted class labels") 108 | 109 | axes.set_frame_on(False) 110 | 111 | n_folds = 3 112 | n_samples = 150 113 | 114 | n_samples_per_fold = n_samples / float(n_folds) 115 | 116 | for i in range(n_folds): 117 | colors = ["w"] * n_folds 118 | colors[i] = "grey" 119 | axes.barh(y=range(n_folds), width=[n_samples_per_fold - 1] * 120 | n_folds, left=i * n_samples_per_fold, height=.6, 121 | color=colors, hatch="//", edgecolor='k', align='edge') 122 | 123 | axes.barh(y=[n_folds] * n_folds, width=[n_samples_per_fold - 1] * 124 | n_folds, left=np.arange(3) * n_samples_per_fold, height=.6, 125 | color="w", edgecolor='k', align='edge') 126 | 127 | axes.invert_yaxis() 128 | axes.set_xlim(0, n_samples + 1) 129 | axes.set_ylabel("CV iterations") 130 | axes.set_xlabel("Data points") 131 | axes.set_xticks(np.arange(n_samples_per_fold / 2., 132 | n_samples, n_samples_per_fold)) 133 | axes.set_xticklabels(["Fold %d" % x for x in range(1, n_folds + 1)]) 134 | axes.set_yticks(np.arange(n_folds + 1) + .3) 135 | axes.set_yticklabels( 136 | ["Split %d" % x for x in range(1, n_folds + 1)] + ["Class label"]) 137 | for i in range(3): 138 | axes.text((i + .5) * n_samples_per_fold, 3.5, "Class %d" % 139 | i, horizontalalignment="center") 140 | 141 | ax = both_axes[1] 142 | ax.set_title("Stratified Cross-validation") 143 | ax.set_frame_on(False) 144 | ax.invert_yaxis() 145 | ax.set_xlim(0, n_samples + 1) 146 | ax.set_ylabel("CV iterations") 147 | ax.set_xlabel("Data points") 148 | 149 | ax.set_yticks(np.arange(n_folds + 1) + .3) 150 | ax.set_yticklabels( 151 | ["Split %d" % x for x in range(1, n_folds + 1)] + ["Class label"]) 152 | 153 | n_subsplit = n_samples_per_fold / 3. 154 | for i in range(n_folds): 155 | test_bars = ax.barh( 156 | y=[i] * n_folds, width=[n_subsplit - 1] * n_folds, 157 | left=np.arange(n_folds) * n_samples_per_fold + i * n_subsplit, 158 | height=.6, color="grey", hatch="//", edgecolor='k', align='edge') 159 | 160 | w = 2 * n_subsplit - 1 161 | ax.barh(y=[0] * n_folds, width=[w] * n_folds, left=np.arange(n_folds) 162 | * n_samples_per_fold + (0 + 1) * n_subsplit, height=.6, color="w", 163 | hatch="//", edgecolor='k', align='edge') 164 | ax.barh(y=[1] * (n_folds + 1), width=[w / 2., w, w, w / 2.], 165 | left=np.maximum(0, np.arange(n_folds + 1) * n_samples_per_fold - 166 | n_subsplit), height=.6, color="w", hatch="//", 167 | edgecolor='k', align='edge') 168 | training_bars = ax.barh(y=[2] * n_folds, width=[w] * n_folds, 169 | left=np.arange(n_folds) * n_samples_per_fold, 170 | height=.6, color="w", hatch="//", edgecolor='k', 171 | align='edge') 172 | 173 | ax.barh(y=[n_folds] * n_folds, width=[n_samples_per_fold - 1] * 174 | n_folds, left=np.arange(n_folds) * n_samples_per_fold, height=.6, 175 | color="w", edgecolor='k', align='edge') 176 | 177 | for i in range(3): 178 | ax.text((i + .5) * n_samples_per_fold, 3.5, "Class %d" % 179 | i, horizontalalignment="center") 180 | ax.set_ylim(4, -0.1) 181 | plt.legend([training_bars[0], test_bars[0]], [ 182 | 'Training data', 'Test data'], loc=(1.05, 1), frameon=False) 183 | 184 | fig.tight_layout() 185 | 186 | 187 | def plot_cross_validation(): 188 | plt.figure(figsize=(12, 2)) 189 | plt.title("cross_validation") 190 | axes = plt.gca() 191 | axes.set_frame_on(False) 192 | 193 | n_folds = 5 194 | n_samples = 25 195 | 196 | n_samples_per_fold = n_samples / float(n_folds) 197 | 198 | for i in range(n_folds): 199 | colors = ["w"] * n_folds 200 | colors[i] = "grey" 201 | bars = plt.barh( 202 | y=range(n_folds), width=[n_samples_per_fold - 0.1] * n_folds, 203 | left=i * n_samples_per_fold, height=.6, color=colors, hatch="//", 204 | edgecolor='k', align='edge') 205 | axes.invert_yaxis() 206 | axes.set_xlim(0, n_samples + 1) 207 | plt.ylabel("CV iterations") 208 | plt.xlabel("Data points") 209 | plt.xticks(np.arange(n_samples_per_fold / 2., n_samples, 210 | n_samples_per_fold), 211 | ["Fold %d" % x for x in range(1, n_folds + 1)]) 212 | plt.yticks(np.arange(n_folds) + .3, 213 | ["Split %d" % x for x in range(1, n_folds + 1)]) 214 | plt.legend([bars[0], bars[4]], ['Training data', 'Test data'], 215 | loc=(1.05, 0.4), frameon=False) 216 | 217 | 218 | def plot_threefold_split(): 219 | plt.figure(figsize=(15, 1)) 220 | axis = plt.gca() 221 | bars = axis.barh([0, 0, 0], [11.9, 2.9, 4.9], left=[0, 12, 15], color=[ 222 | 'white', 'grey', 'grey'], hatch="//", edgecolor='k', 223 | align='edge') 224 | bars[2].set_hatch(r"") 225 | axis.set_yticks(()) 226 | axis.set_frame_on(False) 227 | axis.set_ylim(-.1, .8) 228 | axis.set_xlim(-0.1, 20.1) 229 | axis.set_xticks([6, 13.3, 17.5]) 230 | axis.set_xticklabels(["training set", "validation set", 231 | "test set"], fontdict={'fontsize': 20}) 232 | axis.tick_params(length=0, labeltop=True, labelbottom=False) 233 | axis.text(6, -.3, "Model fitting", 234 | fontdict={'fontsize': 13}, horizontalalignment="center") 235 | axis.text(13.3, -.3, "Parameter selection", 236 | fontdict={'fontsize': 13}, horizontalalignment="center") 237 | axis.text(17.5, -.3, "Evaluation", 238 | fontdict={'fontsize': 13}, horizontalalignment="center") 239 | -------------------------------------------------------------------------------- /第2章-监督学习-决策树集成.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "导入必要的包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "import pandas as pd\n", 21 | "import mglearn\n", 22 | "import os\n", 23 | "%matplotlib inline " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "**集成**(ensemble)是合并多个机器学习模型来构建更强大模型的方法。在机器学习文献中有许多模型都属于这一类,但已证明有两种集成模型对大量分类和回归的数据集都是有效的,二者都以决策树为基础,分别是随机森林(random forest)和梯度提升决策树(gradient boosted decision tree)。" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# 1. 随机森林\n", 38 | "我们刚刚说过,决策树的一个主要缺点在于经常对训练数据过拟合。随机森林是解决这个问题的一种方法。随机森林本质上是许多决策树的集合,其中每棵树都和其他树略有不同。随机森林背后的思想是,每棵树的预测可能都相对较好,但可能对部分数据过拟合。如果构造很多树,并且每棵树的预测都很好,但都以不同的方式过拟合,那么我们可以对这些树的结果取平均值来降低过拟合。既能减少过拟合又能保持树的预测能力,这可以在数学上严格证明。\n", 39 | "\n", 40 | "为了实现这一策略,我们需要构造许多决策树。每棵树都应该对目标值做出可以接受的预测,还应该与其他树不同。随机森林的名字来自于将随机性添加到树的构造过程中,以确保每棵树都各不相同。随机森林中树的随机化方法有两种:一种是通过选择用于构造树的数据点,另一种是通过选择每次划分测试的特征。我们来更深入地研究这一过程。\n", 41 | "\n", 42 | "\n", 43 | "## 1.1 构造随机森林\n", 44 | "想要构造一个随机森林模型,你需要确定用于构造的树的个数( RandomForestRegressor 或 RandomForestClassifier 的 n_estimators 参数)。比如我们想要构造 10 棵树。这些树在构造时彼此完全独立,算法对每棵树进行不同的随机选择,以确保树和树之间是有区别的。想要构造一棵树,首先要对数据进行自助采样(bootstrap sample)。也就是说,从 n_samples 个数据点中有放回地(即同一样本可以被多次抽取)重复随机抽取一个样本,共抽取n_samples 次。这样会创建一个与原数据集大小相同的数据集,但有些数据点会缺失(大约三分之一),有些会重复。\n", 45 | "\n", 46 | "举例说明,比如我们想要创建列表 ['a', 'b', 'c', 'd'] 的自助采样。一种可能的自主采样是 ['b', 'd', 'd', 'c'] ,另一种可能的采样为 ['d', 'a', 'd', 'a'] 。\n", 47 | "\n", 48 | "接下来,基于这个新创建的数据集来构造决策树。但是,要对我们在介绍决策树时描述的算法稍作修改。在每个结点处,算法随机选择特征的一个子集,并对其中一个特征寻找最佳测试,而不是对每个结点都寻找最佳测试。选择的特征个数由 max_features 参数来控制。每个结点中特征子集的选择是相互独立的,这样树的每个结点可以使用特征的不同子集来做出决策。\n", 49 | "\n", 50 | "由于使用了自助采样,**随机森林中构造每棵决策树的数据集都是略有不同的**。**由于每个结点的特征选择,每棵树中的每次划分都是基于特征的不同子集**。这两种方法共同保证随机森林中所有树都不相同。\n", 51 | "\n", 52 | "在这个过程中的一个关键参数是 max_features 。如果我们设置 max_features 等于n_features ,那么每次划分都要考虑数据集的所有特征,在特征选择的过程中没有添加随机性(不过自助采样依然存在随机性)。如果设置 max_features 等于 1 ,那么在划分时将无法选择对哪个特征进行测试,只能对随机选择的某个特征搜索不同的阈值。因此,如果 max_features 较大,那么随机森林中的树将会十分相似,利用最独特的特征可以轻松拟合数据。如果 max_features 较小,那么随机森林中的树将会差异很大,为了很好地拟合数据,每棵树的深度都要很大。\n", 53 | "\n", 54 | "想要利用随机森林进行预测,算法首先对森林中的每棵树进行预测。对于回归问题,我们可以对这些结果取平均值作为最终预测。对于分类问题,则用到了“软投票”(soft voting)策略。也就是说,每个算法做出“软”预测,给出每个可能的输出标签的概率。对所有树的预测概率取平均值,然后将概率最大的类别作为预测结果。\n", 55 | "\n", 56 | "## 1.2 分析随机森林\n", 57 | "下面将由 5 棵树组成的随机森林应用到前面研究过的 two_moons 数据集上:" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "from sklearn.ensemble import RandomForestClassifier\n", 69 | "from sklearn.datasets import make_moons\n", 70 | "from sklearn.model_selection import train_test_split\n", 71 | "\n", 72 | "X,y = make_moons(n_samples=100, noise=0.25, random_state=3)\n", 73 | "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n", 74 | "\n", 75 | "forest = RandomForestClassifier(n_estimators=5, random_state=2)\n", 76 | "forest.fit(X_train, y_train)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "作为随机森林的一部分,树被保存在 estimator\\_ 属性中。我们将每棵树学到的决策边界可视化,也将它们的总预测(即整个森林做出的预测)可视化(图 2-33):" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "collapsed": false 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "fig, axes = plt.subplots(2, 3, figsize=(20,10))\n", 95 | "for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):\n", 96 | " ax.set_title(\"Tree {}\".format(i))\n", 97 | " mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)\n", 98 | "mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1,-1], alpha=.4)\n", 99 | "axes[-1,-1].set_title(\"Random Forest\")\n", 100 | "mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "
\n", 108 | "
图 2-33:5 棵随机化的决策树找到的决策边界,以及将它们的预测概率取平均后得到的决策边界
" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "你可以清楚地看到,这 5 棵树学到的决策边界大不相同。每棵树都犯了一些错误,因为这里画出的一些训练点实际上并没有包含在这些树的训练集中,原因在于自助采样。\n", 116 | "\n", 117 | "随机森林比单独每一棵树的过拟合都要小,给出的决策边界也更符合直觉。在任何实际应用中,我们会用到更多棵树(通常是几百或上千),从而得到更平滑的边界。\n", 118 | "\n", 119 | "再举一个例子,我们将包含 100 棵树的随机森林应用在乳腺癌数据集上:" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [], 129 | "source": [ 130 | "from sklearn.datasets import load_breast_cancer\n", 131 | "cancer = load_breast_cancer()\n", 132 | "\n", 133 | "X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)\n", 134 | "forest = RandomForestClassifier(n_estimators=100, random_state=0)\n", 135 | "forest.fit(X_train, y_train)\n", 136 | "\n", 137 | "print(\"Accuracy on training set: {:.3f}\".format(forest.score(X_train, y_train)))\n", 138 | "print(\"Accuracy on test set: {:.3f}\".format(forest.score(X_test, y_test)))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "在没有调节任何参数的情况下,随机森林的精度为 97%,比线性模型或单棵决策树都要好。我们可以调节 max_features 参数,或者像单棵决策树那样进行预剪枝。但是,随机森林的默认参数通常就已经可以给出很好的结果。\n", 146 | "\n", 147 | "与决策树类似,随机森林也可以给出特征重要性,计算方法是将森林中所有树的特征重要性求和并取平均。一般来说,随机森林给出的特征重要性要比单棵树给出的更为可靠。参见图 2-34。" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "def plot_feature_importance_cancer(model):\n", 159 | " n_features = cancer.data.shape[1]\n", 160 | " plt.barh(range(n_features), model.feature_importances_, align='center')\n", 161 | " plt.yticks(np.arange(n_features), cancer.feature_names)\n", 162 | " plt.xlabel(\"Feature importance\")\n", 163 | " plt.ylabel(\"Feature\")\n", 164 | "\n", 165 | "plot_feature_importance_cancer(forest)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "
\n", 173 | "
图 2-34:拟合乳腺癌数据集得到的随机森林的特征重要性
" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "如你所见,与单棵树相比,随机森林中有更多特征的重要性不为零。与单棵决策树类似,随机森林也给了“worst radius”(最大半径)特征很大的重要性,但从总体来看,它实际上却选择“worst perimeter”(最大周长)作为信息量最大的特征。由于构造随机森林过程中的随机性,算法需要考虑多种可能的解释,结果就是随机森林比单棵树更能从总体把握数据的特征。" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "## 1.3 优点、缺点和参数。\n", 188 | "用于回归和分类的随机森林是目前应用最广泛的机器学习方法之一。这种方法非常强大,通常不需要反复调节参数就可以给出很好的结果,也不需要对数据进\n", 189 | "行缩放。\n", 190 | "\n", 191 | "从本质上看,随机森林拥有决策树的所有优点,同时弥补了决策树的一些缺陷。仍然使用决策树的一个原因是需要决策过程的紧凑表示。基本上不可能对几十棵甚至上百棵树做出详细解释,随机森林中树的深度往往比决策树还要大(因为用到了特征子集)。因此,如果你需要以可视化的方式向非专家总结预测过程,那么选择单棵决策树可能更好。虽然在大型数据集上构建随机森林可能比较费时间,但在一台计算机的多个 CPU 内核上并行计算也很容易。如果你用的是多核处理器(几乎所有的现代化计算机都是),你可以用 n_jobs 参数来调节使用的内核个数。使用更多的 CPU 内核,可以让速度线性增加(使用 2 个内核,随机森林的训练速度会加倍),但设置 n_jobs 大于内核个数是没有用的。**你可以设置 n_jobs=-1 来使用计算机的所有内核**。\n", 192 | "\n", 193 | "你应该记住,随机森林本质上是随机的,设置不同的随机状态(或者不设置 random_state参数)可以彻底改变构建的模型。森林中的树越多,它对随机状态选择的鲁棒性就越好。如果你希望结果可以重现,固定 random_state 是很重要的。\n", 194 | "\n", 195 | "**对于维度非常高的稀疏数据(比如文本数据),随机森林的表现往往不是很好。对于这种数据,使用线性模型可能更合适**。即使是非常大的数据集,随机森林的表现通常也很好,训练过程很容易并行在功能强大的计算机的多个 CPU 内核上。不过,随机森林需要更大的内存,训练和预测的速度也比线性模型要慢。对一个应用来说,如果时间和内存很重要的话,那么换用线性模型可能更为明智。\n", 196 | "\n", 197 | "**需要调节的重要参数有 n_estimators 和 max_features ,可能还包括预剪枝选项(如 max_depth )。 n_estimators 总是越大越好**。对更多的树取平均可以降低过拟合,从而得到鲁棒性更好的集成。不过收益是递减的,而且树越多需要的内存也越多,训练时间也越长。常用的经验法则就是“在你的时间 / 内存允许的情况下尽量多”。\n", 198 | "\n", 199 | "前面说过, **max_features 决定每棵树的随机性大小,较小的 max_features 可以降低过拟合**。一般来说,好的经验就是使用默认值:对于分类,默认值是 max_features=sqrt(n_features) ;对于回归,默认值是 max_features=n_features 。增大 max_features 或 max_leaf_nodes 有时也可以提高性能。它还可以大大降低用于训练和预测的时间和空间要求。" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "# 2. 梯度提升回归树(梯度提升机)\n", 207 | "梯度提升回归树是另一种集成方法,通过合并多个决策树来构建一个更为强大的模型。虽然名字中含有“回归”,但这个模型既可以用于回归也可以用于分类。与随机森林方法不同,梯度提升采用连续的方式构造树,每棵树都试图纠正前一棵树的错误。默认情况下,梯度提升回归树中没有随机化,而是用到了强预剪枝。梯度提升树通常使用深度很小(1 到 5 之间)的树,这样模型占用的内存更少,预测速度也更快。\n", 208 | "\n", 209 | "梯度提升背后的主要思想是合并许多简单的模型(在这个语境中叫作弱学习器),比如深度较小的树。每棵树只能对部分数据做出好的预测,因此,添加的树越来越多,可以不断迭代提高性能。\n", 210 | "\n", 211 | "梯度提升树经常是机器学习竞赛的优胜者,并且广泛应用于业界。与随机森林相比,它通常对参数设置更为敏感,但如果参数设置正确的话,模型精度更高。\n", 212 | "\n", 213 | "除了预剪枝与集成中树的数量之外,梯度提升的另一个重要参数是 learning_rate (学习率),用于控制每棵树纠正前一棵树的错误的强度。较高的学习率意味着每棵树都可以做出较强的修正,这样模型更为复杂。通过增大 n_estimators 来向集成中添加更多树,也可以增加模型复杂度,因为模型有更多机会纠正训练集上的错误。\n", 214 | "\n", 215 | "下面是在乳腺癌数据集上应用 GradientBoostingClassifier 的示例。默认使用 100 棵树,最大深度是 3,学习率为 0.1:" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": false 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "from sklearn.ensemble import GradientBoostingClassifier\n", 227 | "\n", 228 | "X_trian, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)\n", 229 | "\n", 230 | "gbrt = GradientBoostingClassifier(random_state=0)\n", 231 | "gbrt.fit(X_train, y_train)\n", 232 | "\n", 233 | "print(\"Accuracy on training set: {:.3f}\".format(gbrt.score(X_train, y_train)))\n", 234 | "print(\"Accuracy on test set: {:.3f}\".format(gbrt.score(X_test, y_test)))" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "由于训练集精度达到 100%,所以很可能存在过拟合。为了降低过拟合,我们可以限制最大深度来加强预剪枝,也可以降低学习率:" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "collapsed": false 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)\n", 253 | "gbrt.fit(X_train,y_train)\n", 254 | "\n", 255 | "print(\"Accuracy on training set: {:.3f}\".format(gbrt.score(X_train, y_train)))\n", 256 | "print(\"Accuracy on test set: {:.3f}\".format(gbrt.score(X_test, y_test)))" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": { 263 | "collapsed": false 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)\n", 268 | "gbrt.fit(X_train, y_train)\n", 269 | "print(\"Accuracy on training set: {:.3f}\".format(gbrt.score(X_train, y_train)))\n", 270 | "print(\"Accuracy on test set: {:.3f}\".format(gbrt.score(X_test, y_test)))" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "降低模型复杂度的两种方法都降低了训练集精度,这和预期相同。在这个例子中,减小树的最大深度显著提升了模型性能,而降低学习率仅稍稍提高了泛化性能。\n", 278 | "\n", 279 | "对于其他基于决策树的模型,我们也可以将特征重要性可视化,以便更好地理解模型(图 2-35)。由于我们用到了 100 棵树,所以即使所有树的深度都是 1,查看所有树也是不现实的:" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)\n", 291 | "gbrt.fit(X_train, y_train)\n", 292 | "plot_feature_importance_cancer(gbrt)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "
\n", 300 | "
图 2-35:用于拟合乳腺癌数据集的梯度提升分类器给出的特征重要性
" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "可以看到,梯度提升树的特征重要性与随机森林的特征重要性有些类似,不过梯度提升完全忽略了某些特征。\n", 308 | "\n", 309 | "由于梯度提升和随机森林两种方法在类似的数据上表现得都很好,因此一种常用的方法就是先尝试随机森林,它的鲁棒性很好。如果随机森林效果很好,但预测时间太长,或者机器学习模型精度小数点后第二位的提高也很重要,那么切换成梯度提升通常会有用。\n", 310 | "\n", 311 | "如果你想要将梯度提升应用在大规模问题上,可以研究一下 xgboost 包及其 Python 接口,在写作本书时,这个库在许多数据集上的速度都比 scikit-learn 对梯度提升的实现要快(有时调参也更简单)。\n", 312 | "\n", 313 | "## 2.1 优点、缺点和参数。\n", 314 | "梯度提升决策树是监督学习中最强大也最常用的模型之一。其**主要缺点是需要仔细调参**,而且训练时间可能会比较长。与其他基于树的模型类似,这一算法不需要对数据进行缩放就可以表现得很好,而且也适用于二元特征与连续特征同时存在的数据集。与其他基于树的模型相同,它也**通常不适用于高维稀疏数据**。(1是训练时间慢,2是特征选择会浪费大量的有效特征)\n", 315 | "\n", 316 | "梯度提升树模型的主要参数包括树的数量 n_estimators 和学习率 learning_rate ,后者用于控制每棵树对前一棵树的错误的纠正强度。这两个参数高度相关,因为 learning_rate 越低,就需要更多的树来构建具有相似复杂度的模型。随机森林的 n_estimators 值总是越大越好,但梯度提升不同,增大 n_estimators 会导致模型更加复杂,进而可能导致过拟合。通常的做法是根据时间和内存的预算选择合适的 n_estimators ,然后对不同的learning_rate 进行遍历。\n", 317 | "\n", 318 | "另一个重要参数是 max_depth (或 max_leaf_nodes ),用于降低每棵树的复杂度。梯度提升模型的 max_depth 通常都设置得很小,一般不超过 5。" 319 | ] 320 | } 321 | ], 322 | "metadata": { 323 | "anaconda-cloud": {}, 324 | "kernelspec": { 325 | "display_name": "Python [conda env:Anaconda3]", 326 | "language": "python", 327 | "name": "conda-env-Anaconda3-py" 328 | }, 329 | "language_info": { 330 | "codemirror_mode": { 331 | "name": "ipython", 332 | "version": 3 333 | }, 334 | "file_extension": ".py", 335 | "mimetype": "text/x-python", 336 | "name": "python", 337 | "nbconvert_exporter": "python", 338 | "pygments_lexer": "ipython3", 339 | "version": "3.5.6" 340 | } 341 | }, 342 | "nbformat": 4, 343 | "nbformat_minor": 1 344 | } 345 | -------------------------------------------------------------------------------- /第2章-监督学习-K近邻.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "先导入必要的包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "import pandas as pd\n", 21 | "import mglearn\n", 22 | "\n", 23 | "%matplotlib inline " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "k-NN 算法可以说是最简单的机器学习算法。构建模型只需要保存训练数据集即可。想要对新数据点做出预测,算法会在训练数据集中找到最近的数据点,也就是它的“最近邻”。\n", 31 | "## 1. k近邻分类\n", 32 | "k-NN 算法最简单的版本只考虑一个最近邻,也就是与我们想要预测的数据点最近的训练数据点。预测结果就是这个训练数据点的已知输出。下面代码的运行结果(图2-4) 给出了这种分类方法在 forge数据集上的应用:" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": false 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "mglearn.plots.plot_knn_classification(n_neighbors=1)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "\n", 51 | "
图2-4 单一最近邻模型对 forge 数据集的预测结果
\n", 52 | "\n", 53 | "这里我们添加了 3 个新数据点(用五角星表示)。对于每个新数据点,我们标记了训练集中与它最近的点。单一最近邻算法的预测结果就是那个点的标签(对应五角星的颜色)。\n", 54 | "\n", 55 | "除了仅考虑最近邻,我还可以考虑任意个(k 个)邻居。这也是 k 近邻算法名字的来历。在考虑多于一个邻居的情况时,我们用“投票法”(voting)来指定标签。也就是说,对于每个测试点,我们数一数多少个邻居属于类别 0,多少个邻居属于类别 1。然后将出现次数更多的类别(也就是 k 个近邻中占多数的类别)作为预测结果。下面的例子(图 2-5)用到了 3 个近邻:" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "collapsed": false 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "mglearn.plots.plot_knn_classification(n_neighbors=3)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "\n", 74 | "
图2-5: 3 近邻模型对 forge 数据集的预测结果
" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "和上面一样,预测结果可以从五角星的颜色看出。你可以发现,左上角新数据点的预测结果与只用一个邻居时的预测结果不同。\n", 82 | "\n", 83 | "虽然这张图对应的是一个二分类问题,但方法同样适用于多分类的数据集。对于多分类问题,我们数一数每个类别分别有多少个邻居,然后将最常见的类别作为预测结果。\n", 84 | "\n", 85 | "现在看一下如何通过 scikit-learn 来应用 k 近邻算法。首先,正如第 1 章所述,将数据分为训练集和测试集,以便评估泛化性能:" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "collapsed": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "from sklearn.model_selection import train_test_split\n", 97 | "X, y = mglearn.datasets.make_forge()\n", 98 | "\n", 99 | "X_train, X_test, y_train, y_test =train_test_split(X, y, random_state=0)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "然后,导入类并将其实例化。这时可以设定参数,比如邻居的个数。这里我们将其设为 3:" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "collapsed": true 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "from sklearn.neighbors import KNeighborsClassifier\n", 118 | "clf = KNeighborsClassifier(n_neighbors=3)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "现在,利用训练集对这个分类器进行拟合。对于 KNeighborsClassifier 来说就是保存数据集,以便在预测时计算与邻居之间的距离:" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "clf.fit(X_train, y_train)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "调用 predict 方法来对测试数据进行预测。对于测试集中的每个数据点,都要计算它在训练集的最近邻,然后找出其中出现次数最多的类别:" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "print(\"Test set predictions: {}\".format(clf.predict(X_test)))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "为了评估模型的泛化能力好坏,我们可以对测试数据和测试标签调用 score 方法:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "print(\"Test set accuracy: {:.2f}\".format(clf.score(X_test, y_test)))" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "可以看到,我们的模型精度约为 86%,也就是说,在测试数据集中,模型对其中 86% 的样本预测的类别都是正确的。" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "# 2. 分析 KNeighborsClassifier" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "对于二维数据集,我们还可以在 xy 平面上画出所有可能的测试点的预测结果。我们根据平面中每个点所属的类别对平面进行着色。这样可以查看**决策边界**(decision boundary),即算法对类别 0 和类别 1 的分界线。\n", 194 | "\n", 195 | "下列代码分别将 1 个、3 个和 9 个邻居三种情况的决策边界可视化,见图 2-6:" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": { 202 | "collapsed": false 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "fig, axes = plt.subplots(1,3, figsize=(10,3))\n", 207 | "\n", 208 | "for n_neighbors, ax in zip([1,3,9], axes):\n", 209 | " # fit 方法返回对象本身,所以我们可以将实例化和拟合放在一行代码中\n", 210 | " clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X,y)\n", 211 | " mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)\n", 212 | " mglearn.discrete_scatter(X[:,0], X[:,1], y, ax=ax)\n", 213 | " ax.set_title(\"{} neighbor(s)\".format(n_neighbors))\n", 214 | " ax.set_xlabel(\"feature 0\")\n", 215 | " ax.set_ylabel(\"feature 1\")\n", 216 | "axes[0].legend(loc=3)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "\n", 224 | "
图 2-6:不同 n_neighbors 值的 k 近邻模型的决策边界
" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "从左图可以看出,使用单一邻居绘制的决策边界紧跟着训练数据。随着邻居个数越来越多,决策边界也越来越平滑。更平滑的边界对应更简单的模型。换句话说,使用更少的邻居对应更高的模型复杂度,而使用更多的邻居对应更低的模型复杂度。假如考虑极端情况,即邻居个数等于训练集中所有数据点的个数,那么每个测试点的邻居都完全相同(即所有训练点),所有预测结果也完全相同(即训练集中出现次数最多的类别)。\n", 232 | "\n", 233 | "我们来研究一下能否证实之前讨论过的模型复杂度和泛化能力之间的关系。我们将在现实世界的乳腺癌数据集上进行研究。先将数据集分成训练集和测试集,然后用不同的邻居个数对训练集和测试集的性能进行评估。输出结果见图 2-7:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": { 240 | "collapsed": false 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "from sklearn.datasets import load_breast_cancer\n", 245 | "\n", 246 | "cancer = load_breast_cancer()\n", 247 | "X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)\n", 248 | "\n", 249 | "training_accuracy=[]\n", 250 | "test_accuracy=[]\n", 251 | "\n", 252 | "# neighbors取值从1到10\n", 253 | "neighbors_settings=range(1,11)\n", 254 | "\n", 255 | "for n_neighbors in neighbors_settings:\n", 256 | " # 构建模型\n", 257 | " clf = KNeighborsClassifier(n_neighbors=n_neighbors)\n", 258 | " clf.fit(X_train, y_train)\n", 259 | " # 记录训练集精度\n", 260 | " training_accuracy.append(clf.score(X_train, y_train))\n", 261 | " # 记录泛化精度\n", 262 | " test_accuracy.append(clf.score(X_test, y_test))\n", 263 | "\n", 264 | "plt.plot(neighbors_settings, training_accuracy, label=\"training accuracy\")\n", 265 | "plt.plot(neighbors_settings, test_accuracy, label=\"test accuracy\")\n", 266 | "plt.ylabel(\"Accuracy\")\n", 267 | "plt.xlabel(\"n_neighbors\")\n", 268 | "plt.legend()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "\n", 276 | "
图 2-7:以 n_neighbors 为自变量,对比训练集精度和测试集精度
" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "图像的 x 轴是 n_neighbors ,y 轴是训练集精度和测试集精度。虽然现实世界的图像很少有非常平滑的,但我们仍可以看出过拟合与欠拟合的一些特征。仅考虑单一近邻时,训练集上的预测结果十分完美。但随着邻居个数的增多,模型变得更简单,训练集精度也随之下降。单一邻居时的测试集精度比使用更多邻居时要低,这表示单一近邻的模型过于复杂。与之相反,当考虑 10 个邻居时,模型又过于简单,性能甚至变得更差。最佳性能在中间的某处,邻居个数大约为 6。不过最好记住这张图的坐标轴刻度。最差的性能约为 88% 的精度,这个结果仍然可以接受。" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "# 3. K近邻回归" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "k 近邻算法还可以用于回归(即把邻居的平均值赋给目标)。我们还是先从单一近邻开始,这次使用 wave 数据集。我们添加了 3 个测试数据点,在 x 轴上用绿色五角星表示。利用单一邻居的预测结果就是最近邻的目标值。在图 2-8 中用蓝色五角星表示:" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "mglearn.plots.plot_knn_regression(n_neighbors=1)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "\n", 316 | "
图 2-8:单一近邻回归对 wave 数据集的预测结果
" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "同样,也可以用多个近邻进行回归。在使用多个近邻时,预测结果为这些邻居的平均值" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": { 330 | "collapsed": false 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "mglearn.plots.plot_knn_regression(n_neighbors=3)" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "\n", 342 | "
图 2-9:3 个近邻回归对 wave 数据集的预测结果
" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "用于回归的 k 近邻算法在 scikit-learn 的 KNeighborsRegressor 类中实现。其用法与KNeighborsClassifier 类似:" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "from sklearn.neighbors import KNeighborsRegressor\n", 361 | "\n", 362 | "X, y=mglearn.datasets.make_wave(n_samples=40)\n", 363 | "\n", 364 | "# 划分训练测试集\n", 365 | "X_train, X_test, y_train, y_test=train_test_split(X,y, random_state=0)\n", 366 | "\n", 367 | "# 实例化模型,邻居设定为3\n", 368 | "reg=KNeighborsRegressor(n_neighbors=3)\n", 369 | "# 拟合模型\n", 370 | "reg.fit(X_train, y_train)" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "现在可以对测试集进行预测:" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": { 384 | "collapsed": false 385 | }, 386 | "outputs": [], 387 | "source": [ 388 | "print(\"Test set predictions:\\n{}\".format(reg.predict(X_test)))" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "我们还可以用 score 方法来评估模型,对于回归问题,这一方法返回的是 $R^2$ 分数。$R^2$ 分数也叫作决定系数,是回归模型预测的优度度量,位于 0 到 1 之间。$R^2$ 等于 1 对应完美预测,$R^2$ 等于 0 对应常数模型,即总是预测训练集响应( y_train )的平均值:" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": { 402 | "collapsed": false 403 | }, 404 | "outputs": [], 405 | "source": [ 406 | "print(\"Test set R^2: {:.2f}\".format(reg.score(X_test, y_test)))" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "这里的分数是 0.83,表示模型的拟合相对较好。" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "# 4. 分析 KNeighborsRegressor\n", 421 | "对于我们的一维数据集,可以查看所有特征取值对应的预测结果(图 2-10)。为了便于绘\n", 422 | "图,我们创建一个由许多点组成的测试数据集" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": { 429 | "collapsed": false 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "fig, axes = plt.subplots(1,3, figsize=(15,4))\n", 434 | "#创建1000个数据点,在-3和3之间均匀分布\n", 435 | "line=np.linspace(-3,3,1000).reshape(-1,1)\n", 436 | "for n_neighbors,ax in zip([1,3,9], axes):\n", 437 | " # 利用1,3,9个邻居分别进行预测\n", 438 | " reg=KNeighborsRegressor(n_neighbors=n_neighbors)\n", 439 | " reg.fit(X_train, y_train)\n", 440 | " ax.plot(line, reg.predict(line))\n", 441 | " ax.plot(X_train, y_train, '^', c=mglearn.cm2(0), markersize=8)\n", 442 | " ax.plot(X_test, y_test, 'v', c=mglearn.cm2(1), markersize=8)\n", 443 | " ax.set_title(\n", 444 | " \"{} neighbor(s)\\n train score: {:.2f} test score: {:.2f}\".format(\n", 445 | " n_neighbors, \n", 446 | " reg.score(X_train, y_train),\n", 447 | " reg.score(X_test, y_test)))\n", 448 | " ax.set_xlabel(\"Feature\")\n", 449 | " ax.set_ylabel(\"Target\")\n", 450 | "axes[0].legend([\"Model predictions\", \"Training data/target\",\n", 451 | " \"Test data/target\"], loc=\"best\")" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "\n", 459 | "
图 2-10:不同 n_neighbors 值的 k 近邻回归的预测结果对比
" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "从图中可以看出,仅使用单一邻居,训练集中的每个点都对预测结果有显著影响,预测结果的图像经过所有数据点。这导致预测结果非常不稳定。考虑更多的邻居之后,预测结果变得更加平滑,但对训练数据的拟合也不好。" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "# 5. 优点、缺点和参数\n", 474 | "一般来说, KNeighbors 分类器有 2 个重要参数:**邻居个数与数据点之间距离的度量方法**。在实践中,使用较小的邻居个数(比如 3 个或 5 个)往往可以得到比较好的结果,但你应该调节这个参数。选择合适的距离度量方法超出了本书的范围。默认使用欧式距离,它在许多情况下的效果都很好。\n", 475 | "\n", 476 | "k-NN 的优点之一就是模型很容易理解,通常不需要过多调节就可以得到不错的性能。在考虑使用更高级的技术之前,尝试此算法是一种很好的基准方法。构建最近邻模型的速度通常很快,但如果训练集很大(特征数很多或者样本数很大),预测速度可能会比较慢。使用 k-NN 算法时,对数据进行预处理是很重要的(见第 3 章)。这一算法对于有很多特征(几百或更多)的数据集往往效果不好,对于大多数特征的大多数取值都为 0 的数据集(所谓的稀疏数据集)来说,这一算法的效果尤其不好。\n", 477 | "\n", 478 | "虽然 k 近邻算法很容易理解,**但由于预测速度慢且不能处理具有很多特征的数据集**,所以在实践中往往不会用到。" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": { 485 | "collapsed": true 486 | }, 487 | "outputs": [], 488 | "source": [] 489 | } 490 | ], 491 | "metadata": { 492 | "anaconda-cloud": {}, 493 | "kernelspec": { 494 | "display_name": "Python [conda env:Anaconda3]", 495 | "language": "python", 496 | "name": "conda-env-Anaconda3-py" 497 | }, 498 | "language_info": { 499 | "codemirror_mode": { 500 | "name": "ipython", 501 | "version": 3 502 | }, 503 | "file_extension": ".py", 504 | "mimetype": "text/x-python", 505 | "name": "python", 506 | "nbconvert_exporter": "python", 507 | "pygments_lexer": "ipython3", 508 | "version": "3.5.6" 509 | } 510 | }, 511 | "nbformat": 4, 512 | "nbformat_minor": 1 513 | } 514 | -------------------------------------------------------------------------------- /第2章-监督学习-决策树.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "先导入必要的包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "import pandas as pd\n", 21 | "import mglearn\n", 22 | "import os\n", 23 | "%matplotlib inline " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "决策树是广泛用于分类和回归任务的模型。本质上,它从一层层的 if/else 问题中进行学\n", 31 | "习,并得出结论。\n", 32 | "\n", 33 | "这些问题类似于你在“20 Questions”游戏中可能会问的问题。想象一下,你想要区分下面这四种动物:熊、鹰、企鹅和海豚。你的目标是通过提出尽可能少的 if/else 问题来得到正确答案。你可能首先会问:这种动物有没有羽毛,这个问题会将可能的动物减少到只有两种。如果答案是“有”,你可以问下一个问题,帮你区分鹰和企鹅。例如,你可以问这种动物会不会飞。如果这种动物没有羽毛,那么可能是海豚或熊,所以你需要问一个问题来区分这两种动物——比如问这种动物有没有鳍。\n", 34 | "\n", 35 | "这一系列问题可以表示为一棵决策树,如图 2-22 所示。" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "mglearn.plots.plot_animal_tree()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "
\n", 54 | "
图 2-22:区分几种动物的决策树
" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "collapsed": false 61 | }, 62 | "source": [ 63 | "在这张图中,树的每个结点代表一个问题或一个包含答案的终结点(也叫叶结点)。树的边将问题的答案与将问的下一个问题连接起来。\n", 64 | "\n", 65 | "用机器学习的语言来说就是,为了区分四类动物(鹰、企鹅、海豚和熊),我们利用三个特征(“有没有羽毛”“会不会飞”和“有没有鳍”)来构建一个模型。我们可以利用监督学习从数据中学习模型,而无需人为构建模型。" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "# 1. 构造决策树\n", 73 | "我们在图 2-23 所示的二维分类数据集上构造决策树。这个数据集由 2 个半月形组成,每个类别都包含 50 个数据点。我们将这个数据集称为 two_moons 。\n", 74 | "\n", 75 | "学习决策树,就是学习一系列 if/else 问题,使我们能够以最快的速度得到正确答案。在机器学习中,这些问题叫作测试(不要与测试集弄混,测试集是用来测试模型泛化性能的数据)。数据通常并不是像动物的例子那样具有二元特征(是 / 否)的形式,而是表示为连续特征,比如图 2-23 所示的二维数据集。用于连续数据的测试形式是:“特征 i 的值是否大于 a ?”" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "
\n", 83 | "
图 2-23:用于构造决策树的 two_moons 数据集
" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "为了构造决策树,算法搜遍所有可能的测试,找出对目标变量来说信息量最大的那一个。图 2-24 展示了选出的第一个测试。将数据集在 x[1]=0.0596 处垂直划分可以得到最多信息,它在最大程度上将类别 0 中的点与类别 1 中的点进行区分。顶结点(也叫**根结点**)表示整个数据集,包含属于类别 0 的 50 个点和属于类别 1 的 50 个点。通过测试 x[1] <=0.0596 的真假来对数据集进行划分,在图中表示为一条黑线。如果测试结果为真,那么将这个点分配给左结点,左结点里包含属于类别 0 的 2 个点和属于类别 1 的 32 个点。否则将这个点分配给右结点,右结点里包含属于类别 0 的 48 个点和属于类别 1 的 18 个点。这两个结点对应于图 2-24 中的顶部区域和底部区域。尽管第一次划分已经对两个类别做了很好的区分,但底部区域仍包含属于类别 0 的点,顶部区域也仍包含属于类别 1 的点。我们可以在两个区域中重复寻找最佳测试的过程,从而构建出更准确的模型。图 2-25 展示了信息量最大的下一次划分,这次划分是基于 x[0] 做出的,分为左右两个区域。" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": { 96 | "collapsed": true 97 | }, 98 | "source": [ 99 | "
\n", 100 | "
图 2-24:深度为 1 的树的决策边界(左)与相应的树(右)
\n", 101 | "
\n", 102 | "
图 2-25:深度为 2 的树的决策边界(左)与相应的树(右)
" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "这一递归过程生成一棵二元决策树,其中每个结点都包含一个测试。或者你可以将每个测试看成沿着一条轴对当前数据进行划分。这是一种将算法看作分层划分的观点。由于每个测试仅关注一个特征,所以划分后的区域边界始终与坐标轴平行。\n", 110 | "\n", 111 | "对数据反复进行递归划分,直到划分后的每个区域(决策树的每个叶结点)只包含单一目标值(单一类别或单一回归值)。如果树中某个叶结点所包含数据点的目标值都相同,那么这个叶结点就是纯的(pure)。这个数据集的最终划分结果见图 2-26。" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "
\n", 119 | "
图 2-26:深度为 9 的树的决策边界(左)与相应的树的一部分(右);完整的决策树非常大,很难可视化
" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "想要对新数据点进行预测,首先要查看这个点位于特征空间划分的哪个区域,然后将该区域的多数目标值(如果是纯的叶结点,就是单一目标值)作为预测结果。从根结点开始对树进行遍历就可以找到这一区域,每一步向左还是向右取决于是否满足相应的测试。\n", 127 | "\n", 128 | "决策树也可以用于回归任务,使用的方法完全相同。预测的方法是,基于每个结点的测试对树进行遍历,最终找到新数据点所属的叶结点。这一数据点的输出即为此叶结点中所有训练点的平均目标值。\n", 129 | "\n", 130 | "# 2. 控制决策树的复杂度\n", 131 | "通常来说,构造决策树直到所有叶结点都是纯的叶结点,这会导致模型非常复杂,并且对训练数据高度过拟合。纯叶结点的存在说明这棵树在训练集上的精度是 100%。训练集中的每个数据点都位于分类正确的叶结点中。在图 2-26 的左图中可以看出过拟合。你可以看到,在所有属于类别 0 的点中间有一块属于类别 1 的区域。另一方面,有一小条属于类别 0 的区域,包围着最右侧属于类别 0 的那个点。这并不是人们想象中决策边界的样子,这个决策边界过于关注远离同类别其他点的单个异常点。\n", 132 | "\n", 133 | "防止过拟合有两种常见的策略:一种是及早停止树的生长,也叫**预剪枝**(pre-pruning);另一种是先构造树,但随后删除或折叠信息量很少的结点,也叫**后剪枝**(post-pruning)或剪枝(pruning)。预剪枝的限制条件可能包括限制树的最大深度、限制叶结点的最大数目,或者规定一个结点中数据点的最小数目来防止继续划分。\n", 134 | "\n", 135 | "scikit-learn 的决策树在 DecisionTreeRegressor 类和 DecisionTreeClassifier 类中实现。scikit-learn 只实现了预剪枝,没有实现后剪枝。\n", 136 | "\n", 137 | "我们在乳腺癌数据集上更详细地看一下预剪枝的效果。和前面一样,我们导入数据集并将其分为训练集和测试集。然后利用默认设置来构建模型,默认将树完全展开(树不断分支,直到所有叶结点都是纯的)。我们固定树的 random_state ,用于在内部解决平局问题:" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "collapsed": false 145 | }, 146 | "outputs": [], 147 | "source": [ 148 | "from sklearn.tree import DecisionTreeClassifier\n", 149 | "from sklearn.datasets import load_breast_cancer\n", 150 | "from sklearn.model_selection import train_test_split\n", 151 | "\n", 152 | "cancer = load_breast_cancer()\n", 153 | "X_train, X_test, y_train, y_test = train_test_split(\n", 154 | " cancer.data, cancer.target, stratify=cancer.target, random_state=42)\n", 155 | "tree = DecisionTreeClassifier(random_state=0)\n", 156 | "tree.fit(X_train, y_train)\n", 157 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n", 158 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "不出所料,训练集上的精度是 100%,这是因为叶结点都是纯的,树的深度很大,足以完美地记住训练数据的所有标签。测试集精度比之前讲过的线性模型略低,线性模型的精度约为 95%。\n", 166 | "\n", 167 | "如果我们不限制决策树的深度,它的深度和复杂度都可以变得特别大。因此,未剪枝的树容易过拟合,对新数据的泛化性能不佳。现在我们将预剪枝应用在决策树上,这可以在完美拟合训练数据之前阻止树的展开。一种选择是在到达一定深度后停止树的展开。这里我们设置 max_depth=4 ,这意味着只可以连续问 4 个问题(参见图 2-24 和图 2-26)。限制树的深度可以减少过拟合。这会降低训练集的精度,但可以提高测试集的精度:" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": { 174 | "collapsed": false 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "tree=DecisionTreeClassifier(max_depth=4, random_state=0)\n", 179 | "tree.fit(X_train, y_train)\n", 180 | "print(\"Accuracy on training set: {:.3f}\".format(tree.score(X_train, y_train)))\n", 181 | "print(\"Accuracy on test set: {:.3f}\".format(tree.score(X_test, y_test)))" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "# 3. 分析决策树\n", 189 | "我们可以利用 tree 模块的 export_graphviz 函数来将树可视化。这个函数会生成一个 .dot 格式的文件,这是一种用于保存图形的文本文件格式。我们设置为结点添加颜色的选项,颜色表示每个结点中的多数类别,同时传入类别名称和特征名称,这样可以对树正确标记:" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": { 196 | "collapsed": true 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "from sklearn.tree import export_graphviz\n", 201 | "export_graphviz(tree, out_file=\"tree.dot\", class_names=[\"malignant\",\"benign\"],\n", 202 | " feature_names=cancer.feature_names, impurity=False, filled=True)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "我们可以利用 graphviz 模块读取这个文件并将其可视化(你也可以使用任何能够读取 .dot文件的程序),见图 2-27:" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "collapsed": true 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "import graphviz\n", 221 | "with open(\"tree.dot\") as f:\n", 222 | " dot_graph = f.read()\n", 223 | " graphviz.Source(dot_graph)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "
\n", 231 | "
图 2-27:基于乳腺癌数据集构造的决策树的可视化
" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "树的可视化有助于深入理解算法是如何进行预测的,也是易于向非专家解释的机器学习算法的优秀示例。不过,即使这里树的深度只有 4 层,也有点太大了。深度更大的树(深度为 10 并不罕见)更加难以理解。一种观察树的方法可能有用,就是找出大部分数据的实际路径。图 2-27 中每个结点的 samples 给出了该结点中的样本个数, values 给出的是每个类别的样本个数。观察 worst radius <= 16.795 分支右侧的子结点,我们发现它只包含8 个良性样本,但有 134 个恶性样本。树的这一侧的其余分支只是利用一些更精细的区别将这 8 个良性样本分离出来。在第一次划分右侧的 142 个样本中,几乎所有样本(132 个)最后都进入最右侧的叶结点中。\n", 239 | "\n", 240 | "再来看一下根结点的左侧子结点,对于 worst radius > 16.795 ,我们得到 25 个恶性样本和 259 个良性样本。几乎所有良性样本最终都进入左数第二个叶结点中,大部分其他叶结点都只包含很少的样本。" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "# 4. 树的特征重要性\n", 248 | "查看整个树可能非常费劲,除此之外,我还可以利用一些有用的属性来总结树的工作原理。其中最常用的是**特征重要性**(feature importance),它为每个特征对树的决策的重要性进行排序。对于每个特征来说,它都是一个介于 0 和 1 之间的数字,其中 0 表示“根本没用到”,1 表示“完美预测目标值”。特征重要性的求和始终为 1:" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "print(\"Feature importances:\\n{}\".format(tree.feature_importances_))" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "我们可以将特征重要性可视化,与我们将线性模型的系数可视化的方法类似(图 2-28):" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": { 273 | "collapsed": false 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "def plot_feature_importance_cancer(model):\n", 278 | " n_features = cancer.data.shape[1]\n", 279 | " plt.barh(range(n_features), model.feature_importances_, align='center')\n", 280 | " plt.yticks(np.arange(n_features), cancer.feature_names)\n", 281 | " plt.xlabel(\"Feature importance\")\n", 282 | " plt.ylabel(\"Feature\")\n", 283 | "\n", 284 | "plot_feature_importance_cancer(tree)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "
\n", 292 | "
图 2-28:在乳腺癌数据集上学到的决策树的特征重要性
" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "这里我们看到,顶部划分用到的特征(“worst radius”)是最重要的特征。这也证实了我们在分析树时的观察结论,即第一层划分已经将两个类别区分得很好。\n", 300 | "\n", 301 | "但是,如果某个特征的 feature_importance\\_ 很小,并不能说明这个特征没有提供任何信息。这只能说明该特征没有被树选中,可能是因为另一个特征也包含了同样的信息。\n", 302 | "\n", 303 | "与线性模型的系数不同,特征重要性始终为正数,也不能说明该特征对应哪个类别。特征重要性告诉我们“worst radius”(最大半径)特征很重要,但并没有告诉我们半径大表示样本是良性还是恶性。事实上,在特征和类别之间可能没有这样简单的关系,你可以在下面的例子中看出这一点(图 2-29 和图 2-30):" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "collapsed": false 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "tree = mglearn.plots.plot_tree_not_monotone()" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "
\n", 322 | "
图 2-29:一个二维数据集( y 轴上的特征与类别标签是非单调的关系)与决策树给出的决策边界
\n", 323 | "
\n", 324 | "
图 2-30:从图 2-29 的数据中学到的决策树
" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "该图显示的是有两个特征和两个类别的数据集。这里所有信息都包含在 X[1] 中,没有用到X[0] 。但 X[1] 和输出类别之间并不是单调关系,即我们不能这么说:“较大的 X[1] 对应类别 0,较小的 X[1] 对应类别 1”(反之亦然)。\n", 332 | "\n", 333 | "虽然我们主要讨论的是用于分类的决策树,但对用于回归的决策树来说,所有内容都是类似的,在 DecisionTreeRegressor 中实现。回归树的用法和分析与分类树非常类似。但在将基于树的模型用于回归时,我们想要指出它的一个特殊性质。 DecisionTreeRegressor(以及其他所有基于树的回归模型)不能外推(extrapolate),也不能在训练数据范围之外进行预测。\n", 334 | "\n", 335 | "我们利用计算机内存(RAM)历史价格的数据集来更详细地研究这一点。图 2-31 给出了这个数据集的图像,x 轴为日期,y 轴为那一年 1 兆字节(MB)RAM 的价格:" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "collapsed": false 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "import pandas as pd\n", 347 | "ram_prices = pd.read_csv(\"data/ram_price.csv\")\n", 348 | "\n", 349 | "plt.semilogy(ram_prices.date, ram_prices.price)\n", 350 | "plt.xlabel('Year')\n", 351 | "plt.ylabel(\"Pirce in $/Mbyte\")" 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "
\n", 359 | "
图 2-31:用对数坐标绘制 RAM 价格的历史发展
" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "注意 y 轴的对数刻度。在用对数坐标绘图时,二者的线性关系看起来非常好,所以预测应该相对比较容易,除了一些不平滑之处之外。\n", 367 | "\n", 368 | "我们将利用 2000 年前的历史数据来预测 2000 年后的价格,只用日期作为特征。我们将对比两个简单的模型: DecisionTreeRegressor 和 LinearRegression 。我们对价格取对数,使得二者关系的线性相对更好。这对 DecisionTreeRegressor 不会产生什么影响,但对LinearRegression 的影响却很大(我们将在第 4 章中进一步讨论)。训练模型并做出预测之后,我们应用指数映射来做对数变换的逆运算。为了便于可视化,我们这里对整个数据集进行预测,但如果是为了定量评估,我们将只考虑测试数据集:" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": { 375 | "collapsed": true 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "from sklearn.tree import DecisionTreeRegressor\n", 380 | "from sklearn.linear_model import LinearRegression\n", 381 | "# 利用历史数据预测2000年后的价格\n", 382 | "data_train = ram_prices[ram_prices.date < 2000]\n", 383 | "data_test = ram_prices[ram_prices.date >= 2000]\n", 384 | "\n", 385 | "# 基于日期来预测价格\n", 386 | "X_train = data_train.date[:, np.newaxis]\n", 387 | "# 我们利用对数变换得到数据和目标之间更简单的关系\n", 388 | "y_train = np.log(data_train.price)\n", 389 | "\n", 390 | "tree = DecisionTreeRegressor().fit(X_train, y_train)\n", 391 | "linear_reg = LinearRegression().fit(X_train, y_train)\n", 392 | "\n", 393 | "# 对所有数据进行预测\n", 394 | "X_all = ram_prices.date[:, np.newaxis]\n", 395 | "\n", 396 | "pred_tree = tree.predict(X_all)\n", 397 | "pred_lr = linear_reg.predict(X_all)\n", 398 | "\n", 399 | "# 对数变换逆运算\n", 400 | "price_tree = np.exp(pred_tree)\n", 401 | "price_lr = np.exp(pred_lr)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "这里创建的图 2-32 将决策树和线性回归模型的预测结果与真实值进行对比:" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "plt.semilogy(data_train.date, data_train.price, label=\"Training data\")\n", 420 | "plt.semilogy(data_test.date, data_test.price, label=\"Test data\")\n", 421 | "plt.semilogy(ram_prices.date, price_tree, label=\"Tree prediction\")\n", 422 | "plt.semilogy(ram_prices.date, price_lr, label=\"Linear prediction\")\n", 423 | "plt.legend()" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "
\n", 431 | "
图 2-32:线性模型和回归树对 RAM 价格数据的预测结果对比
" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "两个模型之间的差异非常明显。线性模型用一条直线对数据做近似,这是我们所知道的。这条线对测试数据(2000 年后的价格)给出了相当好的预测,不过忽略了训练数据和测试数据中一些更细微的变化。与之相反,树模型完美预测了训练数据。由于我们没有限制树的复杂度,因此它记住了整个数据集。但是,一旦输入超出了模型训练数据的范围,模型就只能持续预测最后一个已知数据点。树不能在训练数据的范围之外生成“新的”响应。所有基于树的模型都有这个缺点。\n", 439 | "\n", 440 | "# 5. 优点、缺点和参数\n", 441 | "如前所述,控制决策树模型复杂度的参数是预剪枝参数,它在树完全展开之前停止树的构造。通常来说,选择一种预剪枝策略(设置 max_depth 、 max_leaf_nodes 或 min_samples_leaf )足以防止过拟合。\n", 442 | "\n", 443 | "与前面讨论过的许多算法相比,决策树有两个优点:**一是得到的模型很容易可视化,非专家也很容易理解(至少对于较小的树而言)**;**二是算法完全不受数据缩放的影响**。由于每个特征被单独处理,而且数据的划分也不依赖于缩放,因此**决策树算法不需要特征预处理**,比如归一化或标准化。特别是特征的尺度完全不一样时或者二元特征和连续特征同时存在时,决策树的效果很好。**决策树的主要缺点在于,即使做了预剪枝,它也经常会过拟合,泛化性能很差**。因此,在大多数应用中,往往使用下面介绍的集成方法来替代单棵决策树。" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": { 450 | "collapsed": true 451 | }, 452 | "outputs": [], 453 | "source": [] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": null, 458 | "metadata": { 459 | "collapsed": true 460 | }, 461 | "outputs": [], 462 | "source": [] 463 | } 464 | ], 465 | "metadata": { 466 | "anaconda-cloud": {}, 467 | "kernelspec": { 468 | "display_name": "Python [conda env:Anaconda3]", 469 | "language": "python", 470 | "name": "conda-env-Anaconda3-py" 471 | }, 472 | "language_info": { 473 | "codemirror_mode": { 474 | "name": "ipython", 475 | "version": 3 476 | }, 477 | "file_extension": ".py", 478 | "mimetype": "text/x-python", 479 | "name": "python", 480 | "nbconvert_exporter": "python", 481 | "pygments_lexer": "ipython3", 482 | "version": "3.5.6" 483 | } 484 | }, 485 | "nbformat": 4, 486 | "nbformat_minor": 1 487 | } 488 | -------------------------------------------------------------------------------- /第2章-监督学习-线性模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "先导入必要的包" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import numpy as np\n", 19 | "import matplotlib.pyplot as plt\n", 20 | "import pandas as pd\n", 21 | "import mglearn\n", 22 | "\n", 23 | "%matplotlib inline " 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "线性模型是在实践中广泛使用的一类模型,几十年来被广泛研究,它可以追溯到一百多年前。线性模型利用输入特征的线性函数(linear function)进行预测,稍后会对此进行解释。" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# 1. 用于回归的线性模型\n", 38 | "对于回归问题,线性模型预测的一般公式如下:\n", 39 | "\n", 40 | "$$ŷ = w[0] * x[0] + w[1] * x[1] + … + w[p] * x[p] + b$$\n", 41 | "\n", 42 | "这里 $x[0]$ 到 $x[p]$ 表示单个数据点的特征(本例中特征个数为 $p+1$),$w$ 和 $b$ 是学习模型的\n", 43 | "参数,$ŷ$ 是模型的预测结果。对于单一特征的数据集,公式如下:\n", 44 | "\n", 45 | "$$ŷ = w[0] * x[0] + b$$\n", 46 | "\n", 47 | "你可能还记得,这就是高中数学里的直线方程。这里 $w[0]$ 是斜率,$b$ 是 $y$ 轴偏移。对于有\n", 48 | "更多特征的数据集,$w$ 包含沿每个特征坐标轴的斜率。或者,你也可以将预测的响应值看\n", 49 | "作输入特征的加权求和,权重由 $w$ 的元素给出(可以取负值)。\n", 50 | "下列代码可以在一维 wave 数据集上学习参数 $w[0]$ 和 $b$:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "collapsed": false 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "mglearn.plots.plot_linear_regression_wave()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "
\n", 69 | "
图 2-11:线性模型对 wave 数据集的预测结果
" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "我们在图中添加了坐标网格,便于理解直线的含义。从 w[0] 可以看出,斜率应该在 0.4 左右,在图像中也可以直观地确认这一点。截距是指预测直线与 y 轴的交点:比 0 略小,也可以在图像中确认。\n", 77 | "\n", 78 | "用于回归的线性模型可以表示为这样的回归模型:对单一特征的预测结果是一条直线,两个特征时是一个平面,或者在更高维度(即更多特征)时是一个超平面。\n", 79 | "\n", 80 | "如果将直线的预测结果与上一章图 2-10 中 KNeighborsRegressor 的预测结果进行比较,你会发现直线的预测能力非常受限。似乎数据的所有细节都丢失了。从某种意义上来说,这种说法是正确的。假设目标 y 是特征的线性组合,这是一个非常强的(也有点不现实的)假设。但观察一维数据得出的观点有些片面。对于有多个特征的数据集而言,线性模型可以非常强大。特别地,如果特征数量大于训练数据点的数量,任何目标 y 都可以(在训练集上)用线性函数完美拟合。\n", 81 | "\n", 82 | "有许多不同的线性回归模型。这些模型之间的区别在于如何从训练数据中学习参数 w 和 b,以及如何控制模型复杂度。下面介绍最常见的线性回归模型。" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "# 2. 线性回归(又名普通最小二乘法)\n", 90 | "线性回归,或者普通最小二乘法(ordinary least squares,OLS),是回归问题最简单也最经典的线性方法。线性回归寻找参数 w 和 b,使得对训练集的预测值与真实的回归目标值 y 之间的均方误差最小。均方误差(mean squared error)是预测值与真实值之差的平方和除以样本数。线性回归没有参数,这是一个优点,但也因此无法控制模型的复杂度。\n", 91 | "\n", 92 | "下列代码可以生成图 2-11 中的模型:" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "collapsed": false 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "from sklearn.linear_model import LinearRegression\n", 104 | "from sklearn.model_selection import train_test_split\n", 105 | "\n", 106 | "X, y = mglearn.datasets.make_wave(n_samples=60)\n", 107 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n", 108 | "\n", 109 | "lr = LinearRegression().fit(X_train, y_train)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "“斜率”参数(w,也叫作权重或系数)被保存在 coef\\_ 属性中,而偏移或截距(b)被保存在 intercept\\_ 属性中:" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": { 123 | "collapsed": false 124 | }, 125 | "outputs": [], 126 | "source": [ 127 | "print(\"lr.coef_: {}\".format(lr.coef_))\n", 128 | "print(\"lr.intercept_: {}\".format(lr.intercept_))" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "你可能注意到了 coef\\_ 和 intercept\\_ 结尾处奇怪的下划线。scikit-learn 总是将从训练数据中得出的值保存在以下划线结尾的属性中。这是为了将其与用户设置的参数区分开。" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "intercept\\_ 属性是一个浮点数,而 coef\\_ 属性是一个 NumPy 数组,每个元素对应一个输入特征。由于 wave 数据集中只有一个输入特征,所以 lr.coef\\_ 中只有一个元素。我们来看一下训练集和测试集的性能:" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": { 149 | "collapsed": false 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "print(\"Training set score: {:.2f}\".format(lr.score(X_train, y_train)))\n", 154 | "print(\"Test set score: {:.2f}\".format(lr.score(X_test, y_test)))" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "$R^2$ 约为 0.66,这个结果不是很好,但我们可以看到,训练集和测试集上的分数非常接近。这说明可能存在欠拟合,而不是过拟合。对于这个一维数据集来说,过拟合的风险很小,因为模型非常简单(或受限)。然而,对于更高维的数据集(即有大量特征的数据集),线性模型将变得更加强大,过拟合的可能性也会变大。我们来看一下 LinearRegression 在更复杂的数据集上的表现,比如波士顿房价数据集。记住,这个数据集有 506 个样本和 105个导出特征。首先,加载数据集并将其分为训练集和测试集。然后像前面一样构建线性回归模型:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "X, y= mglearn.datasets.load_extended_boston()\n", 173 | "\n", 174 | "X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)\n", 175 | "lr=LinearRegression().fit(X_train, y_train)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "比较一下训练集和测试集的分数就可以发现,我们在训练集上的预测非常准确,但测试集上的 $R^2$ 要低很多:" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [], 192 | "source": [ 193 | "print(\"Training set score: {:.2f}\".format(lr.score(X_train, y_train)))\n", 194 | "print(\"Test set score: {:.2f}\".format(lr.score(X_test, y_test)))" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "训练集和测试集之间的性能差异是过拟合的明显标志,因此我们应该试图找到一个可以控制复杂度的模型。标准线性回归最常用的替代方法之一就是岭回归(ridge regression),下面来看一下。" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "# 3. 岭回归\n", 209 | "岭回归也是一种用于回归的线性模型,因此它的预测公式与普通最小二乘法相同。但在岭回归中,对系数(w)的选择不仅要在训练数据上得到好的预测结果,而且还要拟合附加约束。我们还希望系数尽量小。换句话说,w 的所有元素都应接近于 0。直观上来看,这意味着每个特征对输出的影响应尽可能小(即斜率很小),同时仍给出很好的预测结果。这种约束是所谓正则化(regularization)的一个例子。正则化是指对模型做显式约束,以避免过拟合。岭回归用到的这种被称为 L2 正则化。\n", 210 | "\n", 211 | "岭回归在 linear_model.Ridge 中实现。来看一下它对扩展的波士顿房价数据集的效果如何" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "from sklearn.linear_model import Ridge\n", 223 | "\n", 224 | "ridge = Ridge().fit(X_train, y_train)\n", 225 | "print(\"Training set score: {:.2f}\".format(ridge.score(X_train, y_train)))\n", 226 | "print(\"Test set score: {:.2f}\".format(ridge.score(X_test, y_test)))" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "可以看出, Ridge 在训练集上的分数要低于 LinearRegression ,但在测试集上的分数更高。这和我们的预期一致。线性回归对数据存在过拟合。 Ridge 是一种约束更强的模型,所以更不容易过拟合。复杂度更小的模型意味着在训练集上的性能更差,但泛化性能更好。由于我们只对泛化性能感兴趣,所以应该选择 Ridge 模型而不是 LinearRegression 模型。\n", 234 | "\n", 235 | "Ridge 模型在模型的简单性(系数都接近于 0)与训练集性能之间做出权衡。简单性和训练集性能二者对于模型的重要程度可以由用户通过设置 alpha 参数来指定。在前面的例子中,我们用的是默认参数 alpha=1.0 。但没有理由认为这会给出最佳权衡。 alpha 的最佳设定值取决于用到的具体数据集。增大 alpha 会使得系数更加趋向于 0,从而降低训练集性能,但可能会提高泛化性能。例如:" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "ridge10=Ridge(alpha=10).fit(X_train, y_train)\n", 247 | "print(\"Training set score: {:.2f}\".format(ridge10.score(X_train, y_train)))\n", 248 | "print(\"Test set score: {:.2f}\".format(ridge10.score(X_test, y_test)))" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "减小 alpha 可以让系数受到的限制更小。对于非常小的 alpha 值,系数几乎没有受到限制,我们得到一个与 LinearRegression 类似的模型:" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [], 265 | "source": [ 266 | "ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)\n", 267 | "print(\"Training set score: {:.2f}\".format(ridge01.score(X_train, y_train)))\n", 268 | "print(\"Test set score: {:.2f}\".format(ridge01.score(X_test, y_test)))" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "这里 alpha=0.1 似乎效果不错。我们可以尝试进一步减小 alpha 以提高泛化性能。第 5 章将会讨论选择参数的正确方法。\n", 276 | "\n", 277 | "我们还可以查看 alpha 取不同值时模型的 coef\\_ 属性,从而更加定性地理解 alpha 参数是如何改变模型的。更大的 alpha 表示约束更强的模型,所以我们预计大 alpha 对应的 coef\\_ 元素比小 alpha 对应的 coef\\_ 元素要小。这一点可以在图 2-12 中得到证实:" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": { 284 | "collapsed": false 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "plt.plot(ridge.coef_, 's', label=\"Ridge alpha=1\")\n", 289 | "plt.plot(ridge10.coef_, '^', label=\"Ridge alpha=10\")\n", 290 | "plt.plot(ridge01.coef_, 'v', label=\"Ridge alpha=0.1\")\n", 291 | "plt.plot(lr.coef_, 'o', label=\"LinearRegression\")\n", 292 | "plt.xlabel(\"Coefficient index\")\n", 293 | "plt.ylabel(\"Coefficient magnitude\")\n", 294 | "plt.hlines(0, 0, len(lr.coef_))\n", 295 | "plt.ylim(-25, 25)\n", 296 | "plt.legend()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "
\n", 304 | "
图 2-12:不同 alpha 值的岭回归与线性回归的系数比较
" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "这里 x 轴对应 coef_ 的元素: x=0 对应第一个特征的系数, x=1 对应第二个特征的系数,以此类推,一直到 x=100 。y 轴表示该系数的具体数值。这里需要记住的是,对于 alpha=10 ,系数大多在 -3 和 3 之间。对于 alpha=1 的 Ridge 模型,系数要稍大一点。对于 alpha=0.1 ,点的范围更大。对于没有做正则化的线性回归(即 alpha=0 ),点的范围很大,许多点都超出了图像的范围。 \n", 312 | "\n", 313 | "还有一种方法可以用来理解正则化的影响,就是固定 alpha 值,但改变训练数据量。对于图 2-13 来说,我们对波士顿房价数据集做二次抽样,并在数据量逐渐增加的子数据集上分别对 LinearRegression 和 Ridge(alpha=1) 两个模型进行评估(将模型性能作为数据集大小的函数进行绘图,这样的图像叫作学习曲线):" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": { 320 | "collapsed": false 321 | }, 322 | "outputs": [], 323 | "source": [ 324 | "mglearn.plots.plot_ridge_n_samples()" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "
\n", 332 | "
图 2-13:岭回归和线性回归在波士顿房价数据集上的学习曲线
" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "正如所预计的那样,无论是岭回归还是线性回归,所有数据集大小对应的训练分数都要高于测试分数。由于岭回归是正则化的,因此它的训练分数要整体低于线性回归的训练分数。但岭回归的测试分数要更高,特别是对较小的子数据集。如果少于 400 个数据点,线性回归学不到任何内容。随着模型可用的数据越来越多,两个模型的性能都在提升,最终线性回归的性能追上了岭回归。这里要记住的是,如果有足够多的训练数据,正则化变得不那么重要,并且岭回归和线性回归将具有相同的性能(在这个例子中,二者相同恰好发生在整个数据集的情况下,这只是一个巧合)。图 2-13 中还有一个有趣之处,就是线性回归的训练性能在下降。如果添加更多数据,模型将更加难以过拟合或记住所有的数据。" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "# 4. lasso\n", 347 | "除了 Ridge ,还有一种正则化的线性回归是 Lasso 。与岭回归相同,使用 lasso 也是约束系数使其接近于 0,但用到的方法不同,叫作 L1 正则化。L1 正则化的结果是,使用 lasso 时某些系数刚好为 0。这说明某些特征被模型完全忽略。这可以看作是一种自动化的特征选择。某些系数刚好为 0,这样模型更容易解释,也可以呈现模型最重要的特征。\n", 348 | "\n", 349 | "我们将 lasso 应用在扩展的波士顿房价数据集上:" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "from sklearn.linear_model import Lasso\n", 361 | "\n", 362 | "lasso = Lasso().fit(X_train, y_train)\n", 363 | "print(\"Training set score: {:.2f}\".format(lasso.score(X_train, y_train)))\n", 364 | "print(\"Test set score: {:.2f}\".format(lasso.score(X_test, y_test)))\n", 365 | "print(\"Number of features used: {}\".format(np.sum(lasso.coef_ != 0)))\n", 366 | "print(\"Number of all feature: {}\".format(lasso.coef_.shape[0]))" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "如你所见, Lasso 在训练集与测试集上的表现都很差。这表示存在欠拟合,我们发现模型只用到了 105 个特征中的 4 个。与 Ridge 类似, Lasso 也有一个正则化参数 alpha ,可以控制系数趋向于 0 的强度。在上一个例子中,我们用的是默认值 alpha=1.0 。为了降低欠拟合,我们尝试减小 alpha 。这么做的同时,我们还需要增加 max_iter 的值(运行迭代的最大次数):" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": null, 379 | "metadata": { 380 | "collapsed": false 381 | }, 382 | "outputs": [], 383 | "source": [ 384 | "# 我们增大max_iter的值,否则模型会警告我们,说应该增大max_iter\n", 385 | "lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)\n", 386 | "print(\"Training set score: {:.2f}\".format(lasso001.score(X_train, y_train)))\n", 387 | "print(\"Test set score: {:.2f}\".format(lasso001.score(X_test, y_test)))\n", 388 | "print(\"Number of features used: {}\".format(np.sum(lasso001.coef_ != 0)))" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "alpha 值变小,我们可以拟合一个更复杂的模型,在训练集和测试集上的表现也更好。模型性能比使用 Ridge 时略好一点,而且我们只用到了 105 个特征中的 33 个。这样模型可能更容易理解。\n", 396 | "\n", 397 | "但如果把 alpha 设得太小,那么就会消除正则化的效果,并出现过拟合,得到与LinearRegression 类似的结果:" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": { 404 | "collapsed": false 405 | }, 406 | "outputs": [], 407 | "source": [ 408 | "lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)\n", 409 | "print(\"Training set score: {:.2f}\".format(lasso00001.score(X_train, y_train)))\n", 410 | "print(\"Test set score: {:.2f}\".format(lasso00001.score(X_test, y_test)))\n", 411 | "print(\"Number of features used: {}\".format(np.sum(lasso00001.coef_ != 0)))" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "再次像图 2-12 那样对不同模型的系数进行作图,见图 2-14:" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": null, 424 | "metadata": { 425 | "collapsed": false 426 | }, 427 | "outputs": [], 428 | "source": [ 429 | "plt.plot(lasso.coef_, 's', label=\"Lasso alpha=1\")\n", 430 | "plt.plot(lasso001.coef_, '^', label=\"Lasso alpha=0.01\")\n", 431 | "plt.plot(lasso00001.coef_, 'v', label=\"Lasso alpha=0.0001\")\n", 432 | "plt.plot(ridge01.coef_, 'o', label=\"Ridge alpha=0.1\")\n", 433 | "plt.legend(ncol=2, loc=(0, 1.05))\n", 434 | "plt.ylim(-25, 25)\n", 435 | "plt.xlabel(\"Coefficient index\")\n", 436 | "plt.ylabel(\"Coefficient magnitude\")" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "
\n", 444 | "
图 2-14:不同 alpha 值的 lasso 回归与岭回归的系数比较
" 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "在 alpha=1 时,我们发现不仅大部分系数都是 0(我们已经知道这一点),而且其他系数也都很小。将 alpha 减小至 0.01 ,我们得到图中向上的三角形,大部分特征等于 0。alpha=0.0001 时,我们得到正则化很弱的模型,大部分系数都不为 0,并且还很大。为了便于比较,图中用圆形表示 Ridge 的最佳结果。 alpha=0.1 的 Ridge 模型的预测性能与alpha=0.01 的 Lasso 模型类似,但 Ridge 模型的所有系数都不为 0。\n", 452 | "\n", 453 | "在实践中,在两个模型中一般首选岭回归。但如果特征很多,你认为只有其中几个是重要的,那么选择 Lasso 可能更好。同样,如果你想要一个容易解释的模型, Lasso 可以给出更容易理解的模型,因为它只选择了一部分输入特征。 scikit-learn 还提供了 ElasticNet类,结合了 Lasso 和 Ridge 的惩罚项。在实践中,这种结合的效果最好,不过代价是要调节两个参数:一个用于 L1 正则化,一个用于 L2 正则化。" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "# 5. 用于分类的线性模型\n", 461 | "线性模型也广泛应用于分类问题。我们首先来看二分类。这时可以利用下面的公式进行\n", 462 | "预测:\n", 463 | "\n", 464 | "$$ŷ = w[0] * x[0] + w[1] * x[1] + …+ w[p] * x[p] + b > 0$$\n", 465 | "\n", 466 | "这个公式看起来与线性回归的公式非常相似,但我们没有返回特征的加权求和,而是为预测设置了阈值(0)。如果函数值小于 0,我们就预测类别 -1;如果函数值大于 0,我们就预测类别 +1。对于所有用于分类的线性模型,这个预测规则都是通用的。同样,有很多种不同的方法来找出系数(w)和截距(b)。\n", 467 | "\n", 468 | "对于用于回归的线性模型,输出 ŷ 是特征的线性函数,是直线、平面或超平面(对于更高维的数据集)。对于用于分类的线性模型,**决策边界**是输入的线性函数。换句话说,(二元)线性分类器是利用直线、平面或超平面来分开两个类别的分类器。本节我们将看到这方面的例子。\n", 469 | "\n", 470 | "学习线性模型有很多种算法。这些算法的区别在于以下两点:\n", 471 | "- 系数和截距的特定组合对训练数据拟合好坏的度量方法;\n", 472 | "- 是否使用正则化,以及使用哪种正则化方法。\n", 473 | "\n", 474 | "不同的算法使用不同的方法来度量“对训练集拟合好坏”。由于数学上的技术原因,不可能调节 w 和 b 使得算法产生的误分类数量最少。对于我们的目的,以及对于许多应用而言,上面第一点(称为损失函数)的选择并不重要。\n", 475 | "\n", 476 | "最常见的两种线性分类算法是 **Logistic 回归**(logistic regression)和**线性支持向量机**(linear support vector machine,线性 SVM),前者在 linear_model.LogisticRegression 中实现,后者在 svm.LinearSVC (SVC 代表支持向量分类器)中实现。虽然 LogisticRegression的名字中含有回归(regression),但它是一种分类算法,并不是回归算法,不应与LinearRegression 混淆。\n", 477 | "\n", 478 | "我们可以将 LogisticRegression 和 LinearSVC 模型应用到 forge 数据集上,并将线性模型找到的决策边界可视化(图 2-15):" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": { 485 | "collapsed": false 486 | }, 487 | "outputs": [], 488 | "source": [ 489 | "from sklearn.linear_model import LogisticRegression\n", 490 | "from sklearn.svm import LinearSVC\n", 491 | "\n", 492 | "X, y = mglearn.datasets.make_forge()\n", 493 | "\n", 494 | "figs, axes = plt.subplots(1,2, figsize=(10,3))\n", 495 | "\n", 496 | "for model, ax in zip([LinearSVC(), LogisticRegression()], axes):\n", 497 | " clf = model.fit(X,y)\n", 498 | " mglearn.plots.plot_2d_separator(clf, X, fill=False, eps=0.5, ax=ax, alpha=.7)\n", 499 | " mglearn.discrete_scatter(X[:,0], X[:,1],y, ax=ax)\n", 500 | " ax.set_title(\"{}\".format(clf.__class__.__name__))\n", 501 | " ax.set_xlabel(\"Feature 0\")\n", 502 | " ax.set_ylabel(\"Feature 1\")\n", 503 | "axes[0].legend() " 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "
\n", 511 | "
图 2-15:线性 SVM 和 Logistic 回归在 forge 数据集上的决策边界(均为默认参数)
" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "在这张图中, forge 数据集的第一个特征位于 x 轴,第二个特征位于 y 轴,与前面相同。图中分别展示了 LinearSVC 和 LogisticRegression 得到的决策边界,都是直线,将顶部归为类别 1 的区域和底部归为类别 0 的区域分开了。换句话说,对于每个分类器而言,位于黑线上方的新数据点都会被划为类别 1,而在黑线下方的点都会被划为类别 0。\n", 519 | "\n", 520 | "两个模型得到了相似的决策边界。注意,两个模型中都有两个点的分类是错误的。两个模型都默认使用 L2 正则化,就像 Ridge 对回归所做的那样。\n", 521 | "\n", 522 | "对于 LogisticRegression 和 LinearSVC ,决定正则化强度的权衡参数叫作 C 。 C 值越大,对应的正则化越弱。换句话说,如果参数 C 值较大,那么 LogisticRegression 和LinearSVC 将尽可能将训练集拟合到最好,而如果 C 值较小,那么模型更强调使系数向量(w)接近于 0。参数 C 的作用还有另一个有趣之处。较小的 C 值可以让算法尽量适应“大多数”数据点,而较大的 C 值更强调每个数据点都分类正确的重要性。下面是使用 LinearSVC 的图示(图 2-16):" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": null, 528 | "metadata": { 529 | "collapsed": false 530 | }, 531 | "outputs": [], 532 | "source": [ 533 | "mglearn.plots.plot_linear_svc_regularization()" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "
\n", 541 | "
图 2-16:不同 C 值的线性 SVM 在 forge 数据集上的决策边界
" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "在左侧的图中, C 值很小,对应强正则化。大部分属于类别 0 的点都位于底部,大部分属于类别 1 的点都位于顶部。强正则化的模型会选择一条相对水平的线,有两个点分类错误。在中间的图中, C 值稍大,模型更关注两个分类错误的样本,使决策边界的斜率变大。最后,在右侧的图中,模型的 C 值非常大,使得决策边界的斜率也很大,现在模型对类别 0 中所有点的分类都是正确的。类别 1 中仍有一个点分类错误,这是因为对这个数据集来说,不可能用一条直线将所有点都分类正确。右侧图中的模型尽量使所有点的分类都正确,但可能无法掌握类别的整体分布。换句话说,这个模型很可能过拟合。\n", 549 | "\n", 550 | "与回归的情况类似,用于分类的线性模型在低维空间中看起来可能非常受限,决策边界只能是直线或平面。同样,在高维空间中,用于分类的线性模型变得非常强大,当考虑更多特征时,避免过拟合变得越来越重要。\n", 551 | "\n", 552 | "我们在乳腺癌数据集上详细分析 LogisticRegression :" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": { 559 | "collapsed": false 560 | }, 561 | "outputs": [], 562 | "source": [ 563 | "from sklearn.datasets import load_breast_cancer\n", 564 | "cancer = load_breast_cancer()\n", 565 | "X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)\n", 566 | "logreg = LogisticRegression().fit(X_train, y_train)\n", 567 | "print(\"Training set score: {:.3f}\".format(logreg.score(X_train, y_train)))\n", 568 | "print(\"Test set score: {:.3f}\".format(logreg.score(X_test, y_test)))" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "C=1 的默认值给出了相当好的性能,在训练集和测试集上都达到 95% 的精度。但由于训练集和测试集的性能非常接近,所以模型很可能是欠拟合的。我们尝试增大 C 来拟合一个更灵活的模型:" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": { 582 | "collapsed": false 583 | }, 584 | "outputs": [], 585 | "source": [ 586 | "logreg100 = LogisticRegression(C=100).fit(X_train, y_train)\n", 587 | "print(\"Training set score: {:.3f}\".format(logreg100.score(X_train, y_train)))\n", 588 | "print(\"Test set score: {:.3f}\".format(logreg100.score(X_test, y_test)))" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "使用 C=100 可以得到更高的训练集精度,也得到了稍高的测试集精度,这也证实了我们的直觉,即更复杂的模型应该性能更好。\n", 596 | "我们还可以研究使用正则化更强的模型时会发生什么。设置 C=0.01 :" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": { 603 | "collapsed": false 604 | }, 605 | "outputs": [], 606 | "source": [ 607 | "logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)\n", 608 | "print(\"Training set score: {:.3f}\".format(logreg001.score(X_train, y_train)))\n", 609 | "print(\"Test set score: {:.3f}\".format(logreg001.score(X_test, y_test)))" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "正如我们所料,在图 2-1 中将已经欠拟合的模型继续向左移动,训练集和测试集的精度都比采用默认参数时更小。\n", 617 | "\n", 618 | "最后,来看一下正则化参数 C 取三个不同的值时模型学到的系数(图 2-17):" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": { 625 | "collapsed": false 626 | }, 627 | "outputs": [], 628 | "source": [ 629 | "plt.plot(logreg.coef_.T, 'o', label=\"C=1\")\n", 630 | "plt.plot(logreg100.coef_.T, '^', label=\"C=100\")\n", 631 | "plt.plot(logreg001.coef_.T, 'v', label=\"C=0.001\")\n", 632 | "plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)\n", 633 | "plt.hlines(0, 0, cancer.data.shape[1])\n", 634 | "plt.ylim(-5, 5)\n", 635 | "plt.xlabel(\"Coefficient index\")\n", 636 | "plt.ylabel(\"Coefficient magnitude\")\n", 637 | "plt.legend()" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "
\n", 645 | "
图 2-17:不同 C 值的 Logistic 回归在乳腺癌数据集上学到的系数
" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": {}, 651 | "source": [ 652 | "由于 LogisticRegression 默认应用 L2 正则化,所以其结果与图 2-12 中 Ridge 的结果类似。更强的正则化使得系数更趋向于 0,但系数永远不会正好等于 0。进一步观察图像,还可以在第 3 个系数那里发现有趣之处,这个系数是“平均周长”(mean perimeter)。C=100 和 C=1 时,这个系数为负,而C=0.001 时这个系数为正,其绝对值比 C=1 时还要大。在解释这样的模型时,人们可能会认为,系数可以告诉我们某个特征与哪个类别有关。例如,人们可能会认为高“纹理错误”(texture error)特征与“恶性”样本有关。但“平均周长”系数的正负号发生变化,说明较大的“平均周长”可以被当作“良性”的指标或“恶性”的指标,具体取决于我们考虑的是哪个模型。这也说明,对线性模型系数的解释应该始终持保留态度。" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "如果想要一个可解释性更强的模型,使用 L1 正则化可能更好,因为它约束模型只使用少\n", 660 | "数几个特征。下面是使用 L1 正则化的系数图像和分类精度(图 2-18)。" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "
\n", 668 | "
图 2-18:对于不同的 C 值,L1 惩罚的 Logistic 回归在乳腺癌数据集上学到的系数
" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": { 675 | "collapsed": false 676 | }, 677 | "outputs": [], 678 | "source": [ 679 | "for C, marker in zip([0.001, 1, 100], ['o', '^', 'v']):\n", 680 | " lr_l1 = LogisticRegression(C=C, penalty=\"l1\").fit(X_train, y_train)\n", 681 | " print(\"Training accuracy of l1 logreg with C={:.3f}: {:.2f}\".format(\n", 682 | " C, lr_l1.score(X_train, y_train)))\n", 683 | " print(\"Test accuracy of l1 logreg with C={:.3f}: {:.2f}\".format(\n", 684 | " C, lr_l1.score(X_test, y_test)))\n", 685 | " plt.plot(lr_l1.coef_.T, marker, label=\"C={:.3f}\".format(C))\n", 686 | "plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)\n", 687 | "plt.hlines(0, 0, cancer.data.shape[1])\n", 688 | "plt.xlabel(\"Coefficient index\")\n", 689 | "plt.ylabel(\"Coefficient magnitude\")\n", 690 | "plt.ylim(-5, 5)\n", 691 | "plt.legend(loc=3)" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "metadata": {}, 697 | "source": [ 698 | "如你所见,用于二分类的线性模型与用于回归的线性模型有许多相似之处。与用于回归的线性模型一样,模型的主要差别在于 penalty 参数,这个参数会影响正则化,也会影响模型是使用所有可用特征还是只选择特征的一个子集。" 699 | ] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "# 6. 用于多分类的线性模型\n", 706 | "许多线性分类模型只适用于二分类问题,不能轻易推广到多类别问题(除了 Logistic 回归)。将二分类算法推广到多分类算法的一种常见方法是“一对其余”(one-vs.-rest)方法。在“一对其余”方法中,对每个类别都学习一个二分类模型,将这个类别与所有其他类别尽量分开,这样就生成了与类别个数一样多的二分类模型。在测试点上运行所有二类分类器来进行预测。在对应类别上分数最高的分类器“胜出”,将这个类别标签返回作为预测结果。\n", 707 | "\n", 708 | "每个类别都对应一个二类分类器,这样每个类别也都有一个系数(w)向量和一个截距(b)。下面给出的是分类置信方程,其结果中最大值对应的类别即为预测的类别标签:\n", 709 | "\n", 710 | "$$ w[0] * x[0] + w[1] * x[1] + … + w[p] * x[p] + b$$\n", 711 | "\n", 712 | "多分类 Logistic 回归背后的数学与“一对其余”方法稍有不同,但它也是对每个类别都有一个系数向量和一个截距,也使用了相同的预测方法。\n", 713 | "\n", 714 | "我们将“一对其余”方法应用在一个简单的三分类数据集上。我们用到了一个二维数据集,每个类别的数据都是从一个高斯分布中采样得出的(见图 2-19):" 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": null, 720 | "metadata": { 721 | "collapsed": false 722 | }, 723 | "outputs": [], 724 | "source": [ 725 | "from sklearn.datasets import make_blobs\n", 726 | "\n", 727 | "X, y=make_blobs(random_state=42)\n", 728 | "mglearn.discrete_scatter(X[:,0],X[:,1],y)\n", 729 | "plt.xlabel(\"Feature 0\")\n", 730 | "plt.ylabel(\"Feature 1\")\n", 731 | "plt.legend([\"Class 0\", \"Class 1\", \"Class 2\"])" 732 | ] 733 | }, 734 | { 735 | "cell_type": "markdown", 736 | "metadata": {}, 737 | "source": [ 738 | "
\n", 739 | "
图 2-19:包含 3 个类别的二维玩具数据集
" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "现在,在这个数据集上训练一个 LinearSVC 分类器:" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": null, 752 | "metadata": { 753 | "collapsed": false 754 | }, 755 | "outputs": [], 756 | "source": [ 757 | "linear_svm = LinearSVC().fit(X, y)\n", 758 | "print(\"Coefficient shape: \", linear_svm.coef_.shape)\n", 759 | "print(\"Intercept shape: \", linear_svm.intercept_.shape)" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "我们看到, coef\\_ 的形状是 (3, 2) ,说明 coef\\_ 每行包含三个类别之一的系数向量,每列包含某个特征(这个数据集有 2 个特征)对应的系数值。现在 intercept_ 是一维数组,保存每个类别的截距。\n", 767 | "\n", 768 | "我们将这 3 个二类分类器给出的直线可视化(图 2-20):" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": { 775 | "collapsed": false 776 | }, 777 | "outputs": [], 778 | "source": [ 779 | "mglearn.discrete_scatter(X[:, 0], X[:, 1], y)\n", 780 | "line = np.linspace(-15, 15)\n", 781 | "for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_, ['b','r','g']):\n", 782 | " plt.plot(line, -(line*coef[0]+intercept)/coef[1], c=color)\n", 783 | "plt.ylim(-10,15)\n", 784 | "plt.xlim(-10, 8)\n", 785 | "plt.xlabel(\"Feature 0\")\n", 786 | "plt.ylabel(\"Feature 1\")\n", 787 | "plt.legend(['Class 0', 'Class 1', 'Class 2', 'Line class 0', 'Line class 1',\n", 788 | "'Line class 2'], loc=(1.01, 0.3))" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "
\n", 796 | "
图 2-20:三个“一对其余”分类器学到的决策边界
" 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "metadata": {}, 802 | "source": [ 803 | "你可以看到,训练集中所有属于类别 0 的点都在与类别 0 对应的直线上方,这说明它们位于这个二类分类器属于“类别 0”的那一侧。属于类别 0 的点位于与类别 2 对应的直线上方,这说明它们被类别 2 的二类分类器划为“其余”。属于类别 0 的点位于与类别 1 对应的直线左侧,这说明类别 1 的二元分类器将它们划为“其余”。因此,这一区域的所有点都会被最终分类器划为类别 0(类别 0 的分类器的分类置信方程的结果大于 0,其他两个类别对应的结果都小于 0)。\n", 804 | "\n", 805 | "但图像中间的三角形区域属于哪一个类别呢,3 个二类分类器都将这一区域内的点划为“其余”。这里的点应该划归到哪一个类别呢?答案是分类方程结果最大的那个类别,即最接近的那条线对应的类别。下面的例子(图 2-21)给出了二维空间中所有区域的预测结果:" 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": null, 811 | "metadata": { 812 | "collapsed": false 813 | }, 814 | "outputs": [], 815 | "source": [ 816 | "mglearn.plots.plot_2d_classification(linear_svm, X, fill=True, alpha=.7)\n", 817 | "mglearn.discrete_scatter(X[:, 0], X[:, 1], y)\n", 818 | "line = np.linspace(-15, 15)\n", 819 | "for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_,\n", 820 | "['b', 'r', 'g']):\n", 821 | " plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)\n", 822 | "plt.legend(['Class 0', 'Class 1', 'Class 2', 'Line class 0', 'Line class 1',\n", 823 | "'Line class 2'], loc=(1.01, 0.3))\n", 824 | "plt.xlabel(\"Feature 0\")\n", 825 | "plt.ylabel(\"Feature 1\")" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "
\n", 833 | "
图 2-21:三个“一对其余”分类器得到的多分类决策边界
" 834 | ] 835 | }, 836 | { 837 | "cell_type": "markdown", 838 | "metadata": {}, 839 | "source": [ 840 | "# 7. 优点、缺点和参数\n", 841 | "线性模型的主要参数是正则化参数,在回归模型中叫作 alpha ,在 LinearSVC 和 Logistic-Regression 中叫作 C 。 alpha 值较大或 C 值较小,说明模型比较简单。特别是对于回归模型而言,调节这些参数非常重要。通常在对数尺度上对 C 和 alpha 进行搜索。你还需要确定的是用 L1 正则化还是 L2 正则化。如果你假定只有几个特征是真正重要的,那么你应该用L1 正则化,否则应默认使用 L2 正则化。如果模型的可解释性很重要的话,使用 L1 也会有帮助。由于 L1 只用到几个特征,所以更容易解释哪些特征对模型是重要的,以及这些特征的作用。\n", 842 | "\n", 843 | "**线性模型的训练速度非常快,预测速度也很快。这种模型可以推广到非常大的数据集,对稀疏数据也很有效**。如果你的数据包含数十万甚至上百万个样本,你可能需要研究如何使用 LogisticRegression 和 Ridge 模型的 solver='sag' 选项,在处理大型数据时,这一选项比默认值要更快。其他选项还有 SGDClassifier 类和 SGDRegressor 类,它们对本节介绍的线性模型实现了可扩展性更强的版本。\n", 844 | "\n", 845 | "线性模型的另一个优点在于,利用我们之间见过的用于回归和分类的公式,理解如何进行预测是相对比较容易的。不幸的是,往往并不完全清楚系数为什么是这样的。如果你的数据集中包含高度相关的特征,这一问题尤为突出。在这种情况下,可能很难对系数做出解释。\n", 846 | "\n", 847 | "**如果特征数量大于样本数量,线性模型的表现通常都很好**。它也常用于非常大的数据集,只是因为训练其他模型并不可行。但在更低维的空间中,其他模型的泛化性能可能更好。2.3.7 节会介绍几个线性模型不适用的例子。" 848 | ] 849 | }, 850 | { 851 | "cell_type": "code", 852 | "execution_count": null, 853 | "metadata": { 854 | "collapsed": true 855 | }, 856 | "outputs": [], 857 | "source": [] 858 | } 859 | ], 860 | "metadata": { 861 | "anaconda-cloud": {}, 862 | "kernelspec": { 863 | "display_name": "Python [conda env:Anaconda3]", 864 | "language": "python", 865 | "name": "conda-env-Anaconda3-py" 866 | }, 867 | "language_info": { 868 | "codemirror_mode": { 869 | "name": "ipython", 870 | "version": 3 871 | }, 872 | "file_extension": ".py", 873 | "mimetype": "text/x-python", 874 | "name": "python", 875 | "nbconvert_exporter": "python", 876 | "pygments_lexer": "ipython3", 877 | "version": "3.5.6" 878 | } 879 | }, 880 | "nbformat": 4, 881 | "nbformat_minor": 1 882 | } 883 | --------------------------------------------------------------------------------