├── .gitignore
├── 1. Data Preprocessing.ipynb
├── 2. Feature Selection.ipynb
├── 3. Dimension Reduction.ipynb
├── GA.py
├── README.md
├── SA.py
├── images
├── Embedded_Pipeline.png
├── Filter_Pipeline.png
├── GA_Pseudo_Code.png
├── SA_Pseudo_Code.png
├── Wrapper_Pipeline.png
├── 基于基因算法特征选择.png
├── 基于模拟退火特征选择.png
├── 封装法工作流.png
├── 嵌入法工作流.png
└── 过滤法工作流.png
├── 中文版.md
└── 中文版
├── 1. 数据预处理.ipynb
├── 2. 特征选择.ipynb
└── 3. 特征降维.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | .DS_Store
3 | .ipynb_checkpoints
4 |
--------------------------------------------------------------------------------
/3. Dimension Reduction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**Feature Engineering Notebook Three: Dimensionality Reduction** \n",
8 | "*Author: Yingxiang Chen, Zihan Yang*"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "**Reference**\n",
16 | "- https://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca\n",
17 | "- https://sebastianraschka.com/faq/docs/lda-vs-pca.html\n",
18 | "- https://en.wikipedia.org/wiki/Linear_discriminant_analysis"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {
24 | "toc": true
25 | },
26 | "source": [
27 | "
Table of Contents
\n",
28 | ""
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "# Dimension Reduction"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "After data preprocessing and feature selection, we have generated a good feature subset. But sometimes, this subset might still contain too many features and cost so much computing power to train. In this case, we can use dimension reduction techniques to further compress our feature subset. But this might deprecate model performance.\n",
43 | "\n",
44 | "We can also apply dimension reduction methods directly after data preprocessing if we don't have much time on feature selection. The dimension reduction algorithm can compress the original feature space and generate a feature subset for us.\n",
45 | "\n",
46 | "Specifically, we will introduce PCA and LDA (Linear Discriminant Analysis)."
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Unsupervised Methods"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "### PCA (Principal Components Analysis)"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "PCA is an **unsupervised** technique that finds the directions of maximal variance. It uses a few unrelated features to represent original features in the dataset and tries to retain as much information (variance) as possible. More math detail can be viewed from a [repo](https://github.com/YC-Coder-Chen/Unsupervised-Notes/blob/master/PCA.md) written by us in Github."
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 1,
73 | "metadata": {
74 | "ExecuteTime": {
75 | "end_time": "2020-03-30T03:37:18.226607Z",
76 | "start_time": "2020-03-30T03:37:15.126399Z"
77 | }
78 | },
79 | "outputs": [],
80 | "source": [
81 | "import numpy as np\n",
82 | "import pandas as pd\n",
83 | "from sklearn.decomposition import PCA\n",
84 | "\n",
85 | "# load dataset\n",
86 | "from sklearn.datasets import fetch_california_housing\n",
87 | "dataset = fetch_california_housing()\n",
88 | "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
89 | "\n",
90 | "# use the first 20000 observations as train_set\n",
91 | "# the rest observations as test_set\n",
92 | "train_set = X[0:20000,]\n",
93 | "test_set = X[20000:,]\n",
94 | "train_y = y[0:20000]\n",
95 | "\n",
96 | "# we need to standardize the data first or the PCA will only comopress features in\n",
97 | "# large scale\n",
98 | "from sklearn.preprocessing import StandardScaler\n",
99 | "model = StandardScaler()\n",
100 | "model.fit(train_set) \n",
101 | "standardized_train = model.transform(train_set)\n",
102 | "standardized_test = model.transform(test_set)\n",
103 | "\n",
104 | "# start compressing\n",
105 | "compressor = PCA(n_components=0.9) # set n_components=0.9 =>\n",
106 | "# select the number of components such that the amount of variance\n",
107 | "# explained is greater than 90% of the original variance\n",
108 | "# we can also set n_components to be the number of features we want directly\n",
109 | "\n",
110 | "compressor.fit(standardized_train) # fit on trainset\n",
111 | "transformed_trainset = compressor.transform(standardized_train) # transform trainset (20000,5)\n",
112 | "transformed_testset = compressor.transform(standardized_test) # transform test set\n",
113 | "\n",
114 | "assert transformed_trainset.shape[1] == transformed_testset.shape[1] # same number of features"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 2,
120 | "metadata": {
121 | "ExecuteTime": {
122 | "end_time": "2020-03-30T03:37:18.780647Z",
123 | "start_time": "2020-03-30T03:37:18.242698Z"
124 | }
125 | },
126 | "outputs": [
127 | {
128 | "data": {
129 | "image/png": "\n",
130 | "text/plain": [
131 | ""
132 | ]
133 | },
134 | "metadata": {
135 | "needs_background": "light"
136 | },
137 | "output_type": "display_data"
138 | }
139 | ],
140 | "source": [
141 | "# visualize the relationship between cumulative variance explained and number of components\n",
142 | "%matplotlib inline\n",
143 | "import matplotlib.pyplot as plt\n",
144 | "plt.plot(np.array(range(len(compressor.explained_variance_ratio_))) + 1, \n",
145 | " np.cumsum(compressor.explained_variance_ratio_))\n",
146 | "plt.xlabel('number of components')\n",
147 | "plt.ylabel('cumulative explained variance')\n",
148 | "plt.show(); # top 5 components can already explained 90% of the original variance"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "## Supervised Methods"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "### LDA (Linear Discriminant Analysis)"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "Compared with PCA, LDA is a supervised technique attempts to find a feature subset to maximize class linear-separability, that is, the projected points of the observations with the same class label are as close as possible, while the distances between the centers of difference class labels are as large as possible. LDA can only be applied to classification problems. LDA assumes that classes are normally distributed and have the same covariance matrix. \n",
170 | " \n",
171 | "Math detail can be accessed at the [official website](https://scikit-learn.org/stable/modules/lda_qda.html#lda-qda) of sklearn. Traditionally, LDA will reduce dimension to (K-1) where K is the number of classes. But in sklearn, it allows further dimension by incorporating PCA into LDA."
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 3,
177 | "metadata": {
178 | "ExecuteTime": {
179 | "end_time": "2020-03-30T03:37:18.844214Z",
180 | "start_time": "2020-03-30T03:37:18.787150Z"
181 | }
182 | },
183 | "outputs": [],
184 | "source": [
185 | "import numpy as np\n",
186 | "import pandas as pd\n",
187 | "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n",
188 | "\n",
189 | "# classification example\n",
190 | "# use iris dataset\n",
191 | "from sklearn.datasets import load_iris\n",
192 | "iris = load_iris()\n",
193 | "X, y = iris.data, iris.target\n",
194 | "\n",
195 | "# random suffle the dataset\n",
196 | "# use the first 100 observations as train_set\n",
197 | "# the rest 50 observations as test_set\n",
198 | "np.random.seed(1234)\n",
199 | "idx = np.random.permutation(len(X))\n",
200 | "X = X[idx]\n",
201 | "y = y[idx]\n",
202 | "\n",
203 | "train_set = X[0:100,:]\n",
204 | "test_set = X[100:,]\n",
205 | "train_y = y[0:100]\n",
206 | "test_y = y[100:,]\n",
207 | "\n",
208 | "# we need to standardize the data because LDA assumes normal distribution\n",
209 | "from sklearn.preprocessing import StandardScaler\n",
210 | "model = StandardScaler()\n",
211 | "model.fit(train_set) \n",
212 | "standardized_train = model.transform(train_set)\n",
213 | "standardized_test = model.transform(test_set)\n",
214 | "\n",
215 | "# start compressing\n",
216 | "compressor = LDA(n_components=2) # set n_components=2\n",
217 | "# n_components <= min(n_classes - 1, n_features)\n",
218 | "\n",
219 | "compressor.fit(standardized_train, train_y) # fit on trainset\n",
220 | "transformed_trainset = compressor.transform(standardized_train) # transform trainset, (100,2)\n",
221 | "transformed_testset = compressor.transform(standardized_test) # transform test set\n",
222 | "assert transformed_trainset.shape[1] == transformed_testset.shape[1] # same number of features"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 4,
228 | "metadata": {
229 | "ExecuteTime": {
230 | "end_time": "2020-03-30T03:37:19.129206Z",
231 | "start_time": "2020-03-30T03:37:18.847470Z"
232 | }
233 | },
234 | "outputs": [
235 | {
236 | "data": {
237 | "image/png": "\n",
238 | "text/plain": [
239 | ""
240 | ]
241 | },
242 | "metadata": {
243 | "needs_background": "light"
244 | },
245 | "output_type": "display_data"
246 | }
247 | ],
248 | "source": [
249 | "# visualize the relationship between cumulative variance explained and number of components\n",
250 | "%matplotlib inline\n",
251 | "import matplotlib.pyplot as plt\n",
252 | "plt.plot(np.array(range(len(compressor.explained_variance_ratio_))) + 1, \n",
253 | " np.cumsum(compressor.explained_variance_ratio_))\n",
254 | "plt.xlabel('number of components')\n",
255 | "plt.ylabel('cumulative explained variance')\n",
256 | "plt.show(); # LDA compresses the original 4 variables into 2 \n",
257 | "# These 2 variables can explain 100% of the variance"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": []
266 | }
267 | ],
268 | "metadata": {
269 | "kernelspec": {
270 | "display_name": "Python 3",
271 | "language": "python",
272 | "name": "python3"
273 | },
274 | "language_info": {
275 | "codemirror_mode": {
276 | "name": "ipython",
277 | "version": 3
278 | },
279 | "file_extension": ".py",
280 | "mimetype": "text/x-python",
281 | "name": "python",
282 | "nbconvert_exporter": "python",
283 | "pygments_lexer": "ipython3",
284 | "version": "3.6.8"
285 | },
286 | "toc": {
287 | "base_numbering": 1,
288 | "nav_menu": {},
289 | "number_sections": true,
290 | "sideBar": false,
291 | "skip_h1_title": false,
292 | "title_cell": "Table of Contents",
293 | "title_sidebar": "Contents",
294 | "toc_cell": true,
295 | "toc_position": {
296 | "height": "690.488px",
297 | "left": "28.9922px",
298 | "top": "134px",
299 | "width": "319.961px"
300 | },
301 | "toc_section_display": true,
302 | "toc_window_display": true
303 | }
304 | },
305 | "nbformat": 4,
306 | "nbformat_minor": 2
307 | }
308 |
--------------------------------------------------------------------------------
/GA.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | Created on Mon Jan 27 11:18:30 2020
5 |
6 | @author: chenyingxiang
7 | """
8 |
9 | import random
10 | import numpy as np
11 | from tqdm import tqdm
12 | import random
13 | from sklearn.model_selection import KFold
14 | from deap import base, creator, tools, algorithms
15 |
16 | random.seed()
17 | np.random.seed()
18 |
19 | import warnings
20 | warnings.filterwarnings("ignore", category=RuntimeWarning)
21 |
22 | class Genetic_Algorithm(object):
23 | """
24 | Genetic Algorithm Algorithm for feature selection
25 |
26 | Parameters
27 | ----------
28 | n_pop: int, default =20
29 | The number of population
30 |
31 | n_gen: int, default = 20
32 | The number of generation
33 |
34 | both: boolean, default = True
35 | Whether offsprings can result from both crossover and mutation
36 | If False, offsprings can result from one of them.
37 |
38 | n_children: int, default = None
39 | The number of children to produce when offsprings can only result from one of the operations
40 | including crossover, mutation and reproduction
41 | Default None will set n_children = n_pop
42 | n_children corresponds with the lambda_ parameter in deap.algorithms.varOr
43 |
44 | cxpb: float, default = 0.5
45 | The probability of mating two individuals
46 | The sum of cxpb and mutpb shall be in [0,1]
47 |
48 | mutpb: float, default = 0.3
49 | The probability of mutating an individual
50 | The sum of cxpb and mutpb shall be in [0,1]
51 |
52 | cx_indpb: float, default = 0.25
53 | The independent probabily for each attribute to be exchanged under uniform crossover.
54 |
55 | mu_indpb: floatt, default = 0.25
56 | The independent probability for each attribute to be flipped under mutFlipBit.
57 |
58 | algorithm: string, default="one-max"
59 | The offspring selection algorithm
60 | "NSGA2" is also available
61 |
62 | loss_func: object
63 | The loss function of the ML task.
64 | loss_func(y_true, y_pred) should return the loss.
65 |
66 | estimator: object
67 | A supervised learning estimator
68 | It has to have the `fit` and `predict` method (or `predict_proba` method for classification)
69 |
70 | predict_type: string, default="predict"
71 | Final prediction type.
72 | - For some classification loss functions, probability output is required.
73 | Should set predict_type to "predict_proba"
74 |
75 | Attributes
76 | ----------
77 | best_sol: np.array of int
78 | The index of the best subset of features.
79 |
80 | best_loss: float
81 | The loss associated with the best_sol
82 |
83 | loss_dict: dictionary
84 | Store the evaluation results to speed up fitting process
85 |
86 | References
87 | ----------
88 | 1. https://deap.readthedocs.io/en/master/index.html
89 | 2. https://github.com/kaushalshetty/FeatureSelectionGA
90 | 3. Haupt, R. L. (1995). An introduction to genetic algorithms for electromagnetics.
91 | IEEE Antennas and Propagation Magazine, 37(2), 7-15.
92 | 4. Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II.
93 | IEEE transactions on evolutionary computation, 6(2), 182-197.
94 | 5. Mkaouer, W., Kessentini, M., Shaout, A., Koligheu, P., Bechikh, S., Deb, K., & Ouni, A. (2015). Many-objective software remodularization using NSGA-III.
95 | ACM Transactions on Software Engineering and Methodology (TOSEM), 24(3), 1-45.
96 | 6. Fortin, F. A., Rainville, F. M. D., Gardner, M. A., Parizeau, M., & Gagné, C. (2012). DEAP: Evolutionary algorithms made easy.
97 | Journal of Machine Learning Research, 13(Jul), 2171-2175.
98 |
99 | """
100 |
101 | def __init__(self, loss_func, estimator, n_pop = 20, n_gen = 20, both = True, n_children = None,
102 | cxpb = 0.5, mutpb = 0.2, cx_indpb = 0.25, mu_indpb = 0.25,
103 | algorithm = "one-max", predict_type = 'predict'):
104 |
105 | #### check type
106 | if not hasattr(estimator, 'fit'):
107 | raise ValueError('Estimator doesn\' have fit method')
108 | if not hasattr(estimator, 'predict') and not hasattr(estimator, 'predict_proba'):
109 | raise ValueError('Estimator doesn\' have predict or predict_proba method')
110 |
111 | for instant in [cxpb, mutpb, cx_indpb, mu_indpb]:
112 | if type(instant) != float:
113 | raise TypeError(f'{instant} should be float type')
114 | if (instant > 1) or (instant) < 0:
115 | raise ValueError(f'{instant} should be within range [0,1]')
116 |
117 | for instant in [n_pop, n_gen]:
118 | if type(instant) != int:
119 | raise TypeError(f'{instant} should be int type')
120 |
121 | if type(both) != bool:
122 | raise TypeError(f'{both} should be boolean type')
123 |
124 | if predict_type not in ['predict', 'predict_proba']:
125 | raise ValueError('predict_type should be "predict" or "predict_proba"')
126 |
127 | if algorithm not in ['one-max', 'NSGA2']:
128 | raise ValueError('algorithm should be "one-max" or "NSGA2"')
129 |
130 | if not n_children:
131 | n_children = n_pop
132 |
133 | if type(n_children) != int:
134 | raise TypeError(f'{n_children} should be int type')
135 |
136 | if (cxpb + mutpb) > 1.0:
137 | raise ValueError(f'The sum of cxpb and mutpb shall be in [0,1]')
138 |
139 | self.n_pop = n_pop
140 | self.n_gen = n_gen
141 | self.both = both
142 | self.n_children = n_children
143 | self.cxpb = cxpb
144 | self.mutpb = mutpb
145 | self.cx_indpb = cx_indpb
146 | self.mu_indpb = mu_indpb
147 | self.algorithm = algorithm
148 | self.loss_func = loss_func
149 | self.estimator = estimator
150 | self.predict_type = predict_type
151 | self.loss_dict = dict()
152 |
153 | def _get_cost(self, X, y, estimator, loss_func, X_test = None, y_test = None):
154 |
155 | estimator.fit(X, y.ravel())
156 | if type(X_test) is np.ndarray:
157 | if self.predict_type == "predict_proba": # if loss function requires probability
158 | y_test_pred = estimator.predict_proba(X_test)
159 | return loss_func(y_test, y_test_pred)
160 | else:
161 | y_test_pred = estimator.predict(X_test)
162 | return loss_func(y_test, y_test_pred)
163 |
164 | y_pred = estimator.predict(X)
165 |
166 | return loss_func(y, y_pred)
167 |
168 |
169 | def _cross_val(self, X, y, estimator, loss_func, cv):
170 |
171 | loss_record = []
172 |
173 | for train_index, test_index in KFold(n_splits = cv).split(X): # k-fold
174 |
175 | try:
176 | X_train, X_test = X[train_index], X[test_index]
177 | y_train, y_test = y[train_index], y[test_index]
178 | estimator.fit(X_train, y_train.ravel())
179 |
180 | if self.predict_type == "predict_proba":
181 | y_test_pred = estimator.predict_proba(X_test)
182 | loss = loss_func(y_test, y_test_pred)
183 | loss_record.append(loss)
184 | else:
185 | y_test_pred = estimator.predict(X_test)
186 | loss = loss_func(y_test, y_test_pred)
187 | loss_record.append(loss)
188 | except:
189 | continue
190 |
191 | return np.array(loss_record).mean()
192 |
193 | def _eval_fitness(self, individual):
194 |
195 | individual = [True if x else False for x in individual]
196 |
197 | if sum(individual) == 0:
198 | current_loss = np.Inf
199 | else:
200 | encoded_str = ''.join(['1' if x else '0' for x in individual])
201 | if self.loss_dict.get(encoded_str):
202 | current_loss = self.loss_dict.get(encoded_str)
203 | else:
204 | if self.cv:
205 | current_loss = self._cross_val(self.X_train[:,individual], self.y_train,
206 | self.estimator, self.loss_func, self.cv)
207 | current_loss = np.round(current_loss, 4)
208 |
209 | elif type(self.X_val) is np.ndarray:
210 | current_loss = self._get_cost(self.X_train[:,individual], self.y_train,
211 | self.estimator, self.loss_func,
212 | self.X_val[:,individual], self.y_val)
213 | current_loss = np.round(current_loss, 4)
214 |
215 | else:
216 | current_loss = self._get_cost(self.X_train[:,individual], self.y_train,
217 | self.estimator, self.loss_func, None, None)
218 | current_loss = np.round(current_loss, 4)
219 | self.loss_dict[encoded_str] = current_loss
220 |
221 | if self.algorithm == "one-max":
222 | return current_loss,
223 | else:
224 | return current_loss, sum(individual)
225 |
226 | def fit(self, X_train, y_train, cv = None, X_val = None, y_val = None,
227 | init_sol = None, stop_point = 5):
228 |
229 |
230 | """
231 | Fit method.
232 |
233 | Parameters
234 | ----------
235 | X_train: numpy array shape = (n_samples, n_features).
236 | The training input samples.
237 |
238 | y_train: numpy array, shape = (n_samples,).
239 | The target values (class labels in classification, real numbers in regression).
240 |
241 | cv: int or None, default = None
242 | Specify the number of folds in KFold. None means SA will not use
243 | k-fold cross-validation results to select features.
244 | [1] If cv = None and X_val = None, the GA will evaluate each subset on trainset.
245 | [2] If cv != None and X_val = None, the GA will evaluate each subset on generated validation set using k-fold.
246 | [3] If cv = None and X_val != None, the GA will evaluate each subset on the user-provided validation set.
247 |
248 | X_val: numpy array, shape = (n_samples, n_features) or None. default = None.
249 | The validation input samples. None means no validation set is provoded.
250 | [1] If cv = None and X_val = None, the GA will evaluate each subset on trainset.
251 | [2] If cv != None and X_val = None, the GA will evaluate each subset on generated validation set using k-fold.
252 | [3] If cv = None and X_val != None, the GA will evaluate each subset on the user-provided validation set.
253 |
254 | y_val: numpy array, shape = (n_samples, ) or None. default = None.
255 | The validation target values (class labels in classification, real numbers in regression).
256 |
257 | Returns
258 | -------
259 | self : object
260 |
261 | """
262 |
263 | # make sure input has two dimensions
264 | assert len(X_train.shape) == 2
265 | num_feature = X_train.shape[1]
266 |
267 | # save them for _eval_fitness function
268 | self.X_train = X_train
269 | self.y_train = y_train
270 | self.cv = cv
271 | self.X_val = X_val
272 | self.y_val = y_val
273 |
274 | # creator
275 | if self.algorithm == "one-max":
276 | creator.create("FitnessMin", base.Fitness, weights=(-1.0,)) # minimize the loss
277 | creator.create("Individual", list, fitness=creator.FitnessMin)
278 | else:
279 | creator.create("FitnessMulti", base.Fitness, weights=(-1.0, -0.1))
280 | creator.create("Individual", list, fitness=creator.FitnessMulti)
281 |
282 | # register
283 | toolbox = base.Toolbox()
284 | toolbox.register("gene", random.randint, 0, 1)
285 | toolbox.register("individual", tools.initRepeat, creator.Individual,
286 | toolbox.gene, n = num_feature)
287 | toolbox.register("population", tools.initRepeat, list, toolbox.individual,
288 | n = self.n_pop)
289 | toolbox.register("evaluate", self._eval_fitness)
290 | toolbox.register("mate", tools.cxUniform, indpb = self.cx_indpb)
291 | toolbox.register("mutate", tools.mutFlipBit, indpb = self.mu_indpb)
292 |
293 | if self.algorithm == "one-max":
294 | toolbox.register("select", tools.selTournament, tournsize=5)
295 | else:
296 | toolbox.register("select", tools.selNSGA2)
297 |
298 | # start evolution
299 | # evaluate inital population
300 | population = toolbox.population()
301 | fits = toolbox.map(toolbox.evaluate, population)
302 | for ind, fit in zip(population, fits):
303 | ind.fitness.values = fit
304 |
305 | # evolving
306 | for gen in tqdm(range(self.n_gen)):
307 | if self.both:
308 | offspring = algorithms.varOr(population, toolbox,
309 | lambda_ = self.n_children, cxpb = self.cxpb,
310 | mutpb = self.mutpb)
311 | else:
312 | offspring = algorithms.varAnd(population, toolbox, cxpb = self.cxpb,
313 | mutpb = self.mutpb)
314 | invalid_ind = [ind for ind in offspring if not ind.fitness.valid]
315 | fitnesses = map(toolbox.evaluate, invalid_ind)
316 | for ind, fit in zip(invalid_ind, fitnesses):
317 | ind.fitness.values = fit
318 |
319 | if self.algorithm == 'one-max':
320 | population = toolbox.select(offspring, k = self.n_pop)
321 | else:
322 | population = toolbox.select(offspring + population, k = self.n_pop)
323 |
324 | fits = list(toolbox.map(toolbox.evaluate, population))
325 | if self.algorithm != "one-max":
326 | fits = [x[0] for x in fits]
327 |
328 | try:
329 | best_idx = np.argmin(np.array(fits))
330 | self.best_sol = [True if x else False for x in population[best_idx]]
331 | self.best_loss = fits[best_idx]
332 |
333 | if np.isinf(self.best_loss): # if best loss is inf
334 | best_key = min([(value, key) for key, value in self.loss_dict.items()])[1]
335 | self.best_sol = [True if x == '1' else False for x in best_key]
336 | self.best_loss = min([(value, key) for key, value in self.loss_dict.items()])[0]
337 | except:
338 | best_key = min([(value, key) for key, value in self.loss_dict.items()])[1]
339 | self.best_sol = [True if x == '1' else False for x in best_key]
340 | self.best_loss = min([(value, key) for key, value in self.loss_dict.items()])[0]
341 |
342 | def transform(self, X):
343 | """
344 | Transform method.
345 |
346 | Parameters
347 | ----------
348 | X: numpy array shape = (n_samples, n_features).
349 | The data set needs feature reduction.
350 |
351 | Returns
352 | -------
353 | transform_X: numpy array shape = (n_samples, n_best_features).
354 | The data set after feature reduction.
355 |
356 | """
357 | transform_X = X[:, self.best_sol]
358 | return transform_X
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Feature-Engineering-Handbook
2 | ============
3 | Welcome! This repo provides an interactive and complete practical feature engineering tutorial in Jupyter Notebook. It contains three parts: [Data Prepocessing](1.%20Data%20Preprocessing.ipynb), [Feature Selection](2.%20Feature%20Selection.ipynb) and [Dimension Reduction](3.%20Dimension%20Reduction.ipynb). Each part is demonstrated separately in one notebook. Since some feature selection algorithms such as Simulated Annealing and Genetic Algorithm lack complete implementation in python, we also provide corresponding python scripts ([Simulated Annealing](SA.py), [Genetic Algorithm](GA.py)) and cover them in our tutorial for your reference.
4 |
5 |
6 | Brief Introduction
7 | ------------
8 | - [Notebook One](1.%20Data%20Preprocessing.ipynb) covers data preprocessing on static continuous features based on [scikit-learn](https://scikit-learn.org/stable/), on static categorical features based on [Category Encoders](https://contrib.scikit-learn.org/categorical-encoding/), and on time series features based on [Featuretools](https://www.featuretools.com/).
9 |
10 | - [Notebook Two](2.%20Feature%20Selection.ipynb) covers feature selection including univariate filter methods based on [scikit-learn](https://scikit-learn.org/stable/), multivariate filter methods based on [scikit-feature](http://featureselection.asu.edu/), deterministic wrapper methods based on [scikit-learn](https://scikit-learn.org/stable/), randomized wrapper methods based on our implementations in python scrips, and embedded methods based on [scikit-learn](https://scikit-learn.org/stable/).
11 |
12 | - [Notebook Three](3.%20Dimension%20Reduction.ipynb) covers supervised and unsupervised dimension reduction based on [scikit-learn](https://scikit-learn.org/stable/).
13 |
14 |
15 | Table of Content
16 | ------------
17 | - 1 Data Prepocessing
- 1.1 Static Continuous Variables
- 1.1.1 Discretization
- 1.1.1.1 Binarization
- 1.1.1.2 Binning
- 1.1.2 Scaling
- 1.1.2.1 Stardard Scaling (Z-score standardization)
- 1.1.2.2 MinMaxScaler (Scale to range)
- 1.1.2.3 RobustScaler (Anti-outliers scaling)
- 1.1.2.4 Power Transform (Non-linear transformation)
- 1.1.3 Normalization
- 1.1.4 Imputation of missing values
- 1.1.4.1 Univariate feature imputation
- 1.1.4.2 Multivariate feature imputation
- 1.1.4.3 Marking imputed values
- 1.1.5 Feature Transformation
- 1.1.5.1 Polynomial Transformation
- 1.1.5.2 Custom Transformation
- 1.2 Static Categorical Variables
- 1.2.1 Ordinal Encoding
- 1.2.2 One-hot Encoding
- 1.2.3 Hashing Encoding
- 1.2.4 Helmert Coding
- 1.2.5 Sum (Deviation) Coding
- 1.2.6 Target Encoding
- 1.2.7 M-estimate Encoding
- 1.2.8 James-Stein Encoder
- 1.2.9 Weight of Evidence Encoder
- 1.2.10 Leave One Out Encoder
- 1.2.11 Catboost Encoder
- 1.3 Time Series Variables
- 1.3.1 Time Series Categorical Features
- 1.3.2 Time Series Continuous Features
- 1.3.3 Implementation
- 1.3.3.1 Create EntitySet
- 1.3.3.2 Set up cut-time
- 1.3.3.3 Auto Feature Engineering
- 2 Feature Selection
- 2.1 Filter Methods
- 2.1.1 Univariate Filter Methods
- 2.1.1.1 Variance Threshold
- 2.1.1.2 Pearson Correlation (regression problem)
- 2.1.1.3 Distance Correlation (regression problem)
- 2.1.1.4 F-Score (regression problem)
- 2.1.1.5 Mutual Information (regression problem)
- 2.1.1.6 Chi-squared Statistics (classification problem)
- 2.1.1.7 F-Score (classification problem)
- 2.1.1.8 Mutual Information (classification problem)
- 2.1.2 Multivariate Filter Methods
- 2.1.2.1 Max-Relevance Min-Redundancy (mRMR)
- 2.1.2.2 Correlation-based Feature Selection (CFS)
- 2.1.2.3 Fast Correlation-based Filter (FCBF)
- 2.1.2.4 ReliefF
- 2.1.2.5 Spectral Feature Selection (SPEC)
- 2.2 Wrapper Methods
- 2.2.1 Deterministic Algorithms
- 2.2.1.1 Recursive Feature Elimination (SBS)
- 2.2.2 Randomized Algorithms
- 2.2.2.1 Simulated Annealing (SA)
- 2.2.2.2 Genetic Algorithm (GA)
- 2.3 Embedded Methods
- 2.3.1 Regulization Based Methods
- 2.3.1.1 Lasso Regression (Linear Regression with L1 Norm)
- 2.3.1.2 Logistic Regression (with L1 Norm)
- 2.3.1.3 LinearSVR/ LinearSVC
- 2.3.2 Tree Based Methods
- 3 Dimension Reduction
- 3.1 Unsupervised Methods
- 3.1.1 PCA (Principal Components Analysis)
- 3.2 Supervised Methods
- 3.2.1 LDA (Linear Discriminant Analysis)
18 |
19 | Reference
20 | ------------
21 | References have been included in each Jupyter Notebook.
22 |
23 | Author
24 | ------------
25 | [**@Yingxiang Chen**](https://github.com/YC-Coder-Chen)
26 | [**@Zihan Yang**](https://github.com/echoyang48)
27 |
28 | Contact
29 | ------------
30 | **If there are any mistakes, please feel free to reach out and correct us!**
31 |
32 | Yingxiang Chen E-mail: chenyingxiang3526@gmail.com
33 | Zihan Yang E-mai: echoyang48@gmail.com
34 |
--------------------------------------------------------------------------------
/SA.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | """
4 | Simulated Annealing
5 |
6 | @author: chenyingxiang
7 | """
8 |
9 | import random
10 | import numpy as np
11 | from sklearn.model_selection import KFold
12 |
13 | random.seed()
14 | np.random.seed()
15 |
16 | class Simulated_Annealing(object):
17 | """
18 | Simulated Annealing Algorithm for feature selection
19 |
20 | Parameters
21 | ----------
22 | init_temp: float, default: 100.0
23 | The initial temperature
24 |
25 | min_temp: float, default: 1.0
26 | The minimum temperature to stop
27 |
28 | max_perturb: float, default: 0.2
29 | The maximum percentage of perturbance genarated
30 |
31 | alpha: float, default: 0.98
32 | The decay coefficient of temperature
33 |
34 | k: float, default: 1.0
35 | The constant for computing probability
36 |
37 | loss_func: object
38 | The loss function of the ML task.
39 | loss_func(y_true, y_pred) should return the loss.
40 |
41 | iteration: int, default: 50
42 | Number of iteration when temperature level is above min_temp each time.
43 |
44 | estimator: object
45 | A supervised learning estimator
46 | It has to have the `fit` and `predict` method (or `predict_proba` method for classification)
47 |
48 | predict_type: string, default="predict"
49 | Final prediction type.
50 | - For some classification loss functions, probability output is required.
51 | Should set predict_type to "predict_proba"
52 |
53 | Attributes
54 | ----------
55 | best_sol: np.array of int
56 | The index of the best subset of features.
57 |
58 | best_loss: float
59 | The loss associated with the best_sol
60 |
61 | References
62 | ----------
63 | 1. https://blog.csdn.net/Joseph__Lagrange/article/details/94410317
64 | 2. https://github.com/JeromeBau/SimulatedAnnealing/blob/master/gibbs_annealing.py
65 |
66 | """
67 |
68 |
69 | def __init__(self, loss_func, estimator, init_temp = 100.0, min_temp = 0.01, k = 1.0,
70 | max_perturb = 0.2, alpha = 0.98, iteration = 50, predict_type = 'predict'):
71 |
72 | #### check type
73 | if not hasattr(estimator, 'fit'):
74 | raise ValueError('Estimator doesn\' have fit method')
75 | if not hasattr(estimator, 'predict') and not hasattr(estimator, 'predict_proba'):
76 | raise ValueError('Estimator doesn\' have predict or predict_proba method')
77 |
78 | for instant in [init_temp, min_temp, k, max_perturb, alpha]:
79 | if type(instant) != float:
80 | raise TypeError(f'{instant} should be float type')
81 |
82 | if type(iteration) != int:
83 | raise TypeError(f'{iteration} should be int type')
84 |
85 | if predict_type not in ['predict', 'predict_proba']:
86 | raise ValueError('predict_type should be "predict" or "predict_proba"')
87 |
88 | self.loss_func = loss_func
89 | self.estimator = estimator
90 | self.init_temp = init_temp
91 | self.min_temp = min_temp
92 | self.k = k
93 | self.max_perturb = max_perturb
94 | self.alpha = alpha
95 | self.iteration = iteration
96 | self.predict_type = predict_type
97 | self.loss_dict = dict()
98 |
99 | def _judge(self, new_cost, old_cost, temp):
100 |
101 | delta_cost = new_cost - old_cost
102 |
103 | if delta_cost < 0: # new solution is better
104 | proceed = 1
105 | else:
106 | probability = np.exp(-1 * delta_cost / (self.k * temp))
107 | if probability > np.random.random():
108 | proceed = 1
109 |
110 | else:
111 | proceed = 0
112 |
113 | return proceed
114 |
115 | def _get_neighbor(self, num_feature, current_sol, max_perturb):
116 |
117 | all_feature = np.ones(shape=(num_feature,)).astype(bool)
118 | outside_feature = np.where(all_feature != current_sol)[0]
119 | inside_feature = np.where(all_feature == current_sol)[0]
120 | num_perturb_in = int(max(np.ceil(len(inside_feature) * max_perturb),1))
121 | num_perturb_out = int(max(np.ceil(len(outside_feature) * max_perturb),1))
122 | if len(outside_feature) == 0:
123 | feature_in = np.array([])
124 | else:
125 | feature_in = np.random.choice(outside_feature,
126 | size = min(len(outside_feature),
127 | np.random.randint(0, num_perturb_in + 1)),
128 | replace = False) # uniform distribution
129 | if len(inside_feature) == 0:
130 | feature_out = np.array([])
131 | else:
132 | feature_out = np.random.choice(inside_feature ,
133 | size = min(len(inside_feature),
134 | np.random.randint(0, num_perturb_out + 1)),
135 | replace = False) # uniform distribution
136 | feature_change = np.append(feature_in, feature_out).astype(int)
137 | all_feature[feature_change] = 1 - all_feature[feature_change]
138 |
139 | return all_feature
140 |
141 | def _get_cost(self, X, y, estimator, loss_func, X_test = None, y_test = None):
142 |
143 | estimator.fit(X, y.ravel())
144 | if type(X_test) is np.ndarray:
145 | if self.predict_type == "predict_proba": # if loss function requires probability
146 | y_test_pred = estimator.predict_proba(X_test)
147 | return loss_func(y_test, y_test_pred)
148 | else:
149 | y_test_pred = estimator.predict(X_test)
150 | return loss_func(y_test, y_test_pred)
151 |
152 | y_pred = estimator.predict(X)
153 |
154 | return loss_func(y, y_pred)
155 |
156 |
157 | def _cross_val(self, X, y, estimator, loss_func, cv):
158 |
159 | loss_record = []
160 |
161 | for train_index, test_index in KFold(n_splits = cv).split(X): # k-fold
162 |
163 | try:
164 | X_train, X_test = X[train_index], X[test_index]
165 | y_train, y_test = y[train_index], y[test_index]
166 | estimator.fit(X_train, y_train.ravel())
167 |
168 | if self.predict_type == "predict_proba":
169 | y_test_pred = estimator.predict_proba(X_test)
170 | loss = loss_func(y_test, y_test_pred)
171 | loss_record.append(loss)
172 | else:
173 | y_test_pred = estimator.predict(X_test)
174 | loss = loss_func(y_test, y_test_pred)
175 | loss_record.append(loss)
176 | except:
177 | continue
178 |
179 | return np.array(loss_record).mean()
180 |
181 | def fit(self, X_train, y_train, cv = None, X_val = None, y_val = None,
182 | init_sol = None, stop_point = 5):
183 |
184 |
185 | """
186 | Fit method.
187 |
188 | Parameters
189 | ----------
190 | X_train: numpy array shape = (n_samples, n_features).
191 | The training input samples.
192 |
193 | y_train: numpy array, shape = (n_samples,).
194 | The target values (class labels in classification, real numbers in regression).
195 |
196 | cv: int or None, default = None
197 | Specify the number of folds in KFold. None means SA will not use
198 | k-fold cross-validation results to select features.
199 | [1] If cv = None and X_val = None, the SA will evaluate each subset on trainset.
200 | [2] If cv != None and X_val = None, the SA will evaluate each subset on generated validation set using k-fold.
201 | [3] If cv = None and X_val != None, the SA will evaluate each subset on the user-provided validation set.
202 |
203 | X_val: numpy array, shape = (n_samples, n_features) or None. default = None.
204 | The validation input samples. None means no validation set is provoded.
205 | [1] If cv = None and X_val = None, the SA will evaluate each subset on trainset.
206 | [2] If cv != None and X_val = None, the SA will evaluate each subset on generated validation set using k-fold.
207 | [3] If cv = None and X_val != None, the SA will evaluate each subset on the user-provided validation set.
208 |
209 | y_val: numpy array, shape = (n_samples, ) or None. default = None.
210 | The validation target values (class labels in classification, real numbers in regression).
211 |
212 | init_sol: numpy array, shape = (num_feautre, ) or None. default = None.
213 | The initial solution provided by the user. It should contain bools.
214 | A good inital solution will save SA algorithm a lot of searching time.
215 | None means the SA will randomly generated a inital solution.
216 |
217 | stop_point: int, default = 5.
218 | The stopping conditions. If the optimal loss keeps the same for a few iterantions, then it will stop.
219 |
220 | Returns
221 | -------
222 | self : object
223 |
224 | """
225 |
226 | # make sure input has two dimensions
227 | assert len(X_train.shape) == 2
228 | num_feature = X_train.shape[1]
229 |
230 | # get initial solution
231 | if init_sol == None:
232 | init_sol = np.random.randint(2, size=num_feature)
233 | while sum(init_sol)==0:
234 | init_sol = np.random.randint(2, size=num_feature)
235 |
236 | current_sol = init_sol
237 | if cv:
238 | current_loss = self._cross_val(X_train[:,current_sol], y_train,
239 | self.estimator, self.loss_func, cv)
240 | current_loss = np.round(current_loss, 4)
241 |
242 | elif type(X_val) is np.ndarray:
243 | current_loss = self._get_cost(X_train[:,current_sol], y_train, self.estimator,
244 | self.loss_func, X_val[:,current_sol], y_val)
245 | current_loss = np.round(current_loss, 4)
246 |
247 | else:
248 | current_loss = self._get_cost(X_train[:,current_sol], y_train, self.estimator,
249 | self.loss_func, None, None)
250 | current_loss = np.round(current_loss, 4)
251 |
252 | encoded_str = ''.join(['1' if x else '0' for x in current_sol])
253 | self.loss_dict[encoded_str] = current_loss
254 | temp_history = [self.init_temp]
255 | loss_history = [current_loss]
256 | sol_history = [current_sol]
257 |
258 | current_temp = self.init_temp
259 | current_temp = np.round(current_temp, 4)
260 |
261 | best_loss = current_loss
262 | best_sol = current_sol
263 |
264 | # start looping
265 | while current_temp > self.min_temp:
266 | for step in range(self.iteration):
267 | current_sol = self._get_neighbor(num_feature, current_sol, self.max_perturb)
268 | if len(current_sol) == 0:
269 | current_loss = np.Inf
270 | else:
271 | encoded_str = ''.join(['1' if x else '0' for x in current_sol])
272 | if self.loss_dict.get(encoded_str):
273 | current_loss = self.loss_dict.get(encoded_str)
274 | else:
275 | if cv:
276 | current_loss = self._cross_val(X_train[:,current_sol], y_train,
277 | self.estimator, self.loss_func, cv)
278 | current_loss = np.round(current_loss, 4)
279 |
280 | elif type(X_val) is np.ndarray:
281 | current_loss = self._get_cost(X_train[:,current_sol], y_train, self.estimator,
282 | self.loss_func, X_val[:,current_sol], y_val)
283 | current_loss = np.round(current_loss, 4)
284 |
285 | else:
286 | current_loss = self._get_cost(X_train[:,current_sol], y_train, self.estimator,
287 | self.loss_func, None, None)
288 | current_loss = np.round(current_loss, 4)
289 | self.loss_dict[encoded_str] = current_loss
290 |
291 | if (current_loss - best_loss) <= 0: # update temperature
292 | current_temp = current_temp * self.alpha
293 | current_temp = np.round(current_temp, 4)
294 |
295 | # judge
296 | if self._judge(current_loss, best_loss, current_temp): # take new solution
297 | best_sol = current_sol
298 | best_loss = current_loss
299 |
300 | # keep record
301 | temp_history.append(current_temp)
302 | loss_history.append(best_loss)
303 | sol_history.append(best_sol)
304 |
305 | # debugging Pipeline
306 | # print(f"Current temperature is {current_temp}")
307 | # print(f"Current best loss is {best_loss}")
308 | # print(f"Current best solution is {best_sol}")
309 |
310 | # check stopping condition
311 | if len(loss_history) > stop_point:
312 | if len(np.unique(loss_history[-1 * stop_point : ])) == 1:
313 | print(f"Stopping condition reached!")
314 | break
315 |
316 | best_idx = np.argmin(loss_history)
317 | self.best_sol = sol_history[best_idx]
318 | self.best_loss = loss_history[best_idx]
319 |
320 | def transform(self, X):
321 | """
322 | Transform method.
323 |
324 | Parameters
325 | ----------
326 | X: numpy array shape = (n_samples, n_features).
327 | The data set needs feature reduction.
328 |
329 | Returns
330 | -------
331 | transform_X: numpy array shape = (n_samples, n_best_features).
332 | The data set after feature reduction.
333 |
334 | """
335 | transform_X = X[:, self.best_sol]
336 | return transform_X
--------------------------------------------------------------------------------
/images/Embedded_Pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/Embedded_Pipeline.png
--------------------------------------------------------------------------------
/images/Filter_Pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/Filter_Pipeline.png
--------------------------------------------------------------------------------
/images/GA_Pseudo_Code.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/GA_Pseudo_Code.png
--------------------------------------------------------------------------------
/images/SA_Pseudo_Code.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/SA_Pseudo_Code.png
--------------------------------------------------------------------------------
/images/Wrapper_Pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/Wrapper_Pipeline.png
--------------------------------------------------------------------------------
/images/基于基因算法特征选择.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/基于基因算法特征选择.png
--------------------------------------------------------------------------------
/images/基于模拟退火特征选择.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/基于模拟退火特征选择.png
--------------------------------------------------------------------------------
/images/封装法工作流.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/封装法工作流.png
--------------------------------------------------------------------------------
/images/嵌入法工作流.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/嵌入法工作流.png
--------------------------------------------------------------------------------
/images/过滤法工作流.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/YC-Coder-Chen/feature-engineering-handbook/2f2cd6f314437ffdd7fa5c9d18aafd0bc8e7fd04/images/过滤法工作流.png
--------------------------------------------------------------------------------
/中文版.md:
--------------------------------------------------------------------------------
1 | 基于Jupyter的特征工程手册
2 | ============
3 | 欢迎!此项目提供了基于Jupyter Notebook的交互式实用特征工程手册。其一共包含三个部分[数据预处理](./中文版/1.%20数据预处理.ipynb),[特征选择](./中文版/2.%20特征选择.ipynb),[特征降维](./中文版/3.%20特征降维.ipynb)。每个部分将在其单独的Notebook中演示。由于某些特征选择算法(例如“模拟退火”和“遗传算法”)在python中缺少完整连续的实现,因此我们还提供了相应的python脚本实现这些算法([模拟退火](SA.py), [基因算法](GA.py)),并将其涵盖在我们的教程供您参考。
4 |
5 | 简单介绍
6 | ------------
7 | - [第一个笔记本](./中文版/1.%20数据预处理.ipynb) 主要涵盖了数据预处理的介绍,包含基于 [scikit-learn](https://scikit-learn.org/stable/) 处理静态连续特征,基于 [Category Encoders](https://contrib.scikit-learn.org/categorical-encoding/) 处理静态类别特征,基于[Featuretools](https://www.featuretools.com/) 处理时间序列问题。
8 |
9 | - [第二个笔记本](./中文版/2.%20特征选择.ipynb) 主要涵盖了特征选择的介绍,包含基于 [scikit-learn](https://scikit-learn.org/stable/) 实现单变量特征过滤,基于 [scikit-feature](http://featureselection.asu.edu/) 实现多变量特征过滤,基于 [scikit-learn](https://scikit-learn.org/stable/) 实现确定性封装筛选,基于我们撰写的 [模拟退火](SA.py)及[基因算法](GA.py) 脚本实现随机封装筛选,基于 [scikit-learn](https://scikit-learn.org/stable/) 实现嵌入特征筛选。
10 |
11 | - [第三个笔记本](3.%20Dimension%20Reduction.ipynb) 主要涵盖了特征压缩降维的介绍,包含基于 [scikit-learn](https://scikit-learn.org/stable/) 实现监督与无监督特征降维。
12 |
13 | |项目内容|英文版地址 | 中文版地址 |
14 | |------ |------ | ------ |
15 | |README | [README](./README.md) | [中文版README](./中文版.md) |
16 | |数据预处理| [Notebook 1](./1.%20Data%20Preprocessing.ipynb) | [第一个笔记本](./中文版/1.%20数据预处理.ipynb) |
17 | |特征选择 | [Notebook 2](2.%20Feature%20Selection.ipynb) | [第二个笔记本](./中文版/2.%20特征选择.ipynb) |
18 | |特征降维 | [Notebook 3](3.%20Dimension%20Reduction.ipynb) | [第三个笔记本](./中文版/3.%20特征降维.ipynb) |
19 |
20 |
21 | 总目录
22 | ------------
23 | - 1 Data Prepocessing 数据预处理
- 1.1 Static Continuous Variables 静态连续变量
- 1.1.1 Discretization 离散化
- 1.1.1.1 Binarization 二值化
- 1.1.1.2 Binning 分箱
- 1.1.2 Scaling 缩放
- 1.1.2.1 Stardard Scaling (Z-score standardization) 标准缩放 (Z值标准化)
- 1.1.2.2 MinMaxScaler (Scale to range) 最大最小缩放 (按数值范围缩放)
- 1.1.2.3 RobustScaler (Anti-outliers scaling) 稳健缩放 (抗异常值缩放)
- 1.1.2.4 Power Transform (Non-linear transformation) 幂次变换 (非线性变换)
- 1.1.3 Normalization 正则化
- 1.1.4 Imputation of missing values 缺失值填补
- 1.1.4.1 Univariate feature imputation 单变量特征插补
- 1.1.4.2 Multivariate feature imputation 多元特征插补
- 1.1.4.3 Marking imputed values 标记估算值
- 1.1.5 Feature Transformation 特征变换
- 1.1.5.1 Polynomial Transformation 多项式变换
- 1.1.5.2 Custom Transformation 自定义变换
- 1.2 Static Categorical Variables 静态类别变量
- 1.2.1 Ordinal Encoding 序数编码
- 1.2.2 One-hot Encoding 独热编码
- 1.2.3 Hashing Encoding 哈希编码
- 1.2.4 Helmert Coding Helmert 编码
- 1.2.5 Sum (Deviation) Coding 偏差编码
- 1.2.6 Target Encoding 目标编码
- 1.2.7 M-estimate Encoding M估计量编码
- 1.2.8 James-Stein Encoder James-Stein 编码
- 1.2.9 Weight of Evidence Encoder 证据权重编码
- 1.2.10 Leave One Out Encoder 留一法编码
- 1.2.11 Catboost Encoder Catboost 编码
- 1.3 Time Series Variables 时间序列变量
- 1.3.1 Time Series Categorical Features 时间序列类别变量
- 1.3.2 Time Series Continuous Features 时间序列连续变量
- 1.3.3 Implementation 代码实现
- 1.3.3.1 Create EntitySet 生成实体集
- 1.3.3.2 Set up cut-time 设置时间截断
- 1.3.3.3 Auto Feature Engineering 自动特征工程
- 2 Feature Selection 特征选择
- 2.1 Filter Methods 过滤法
- 2.1.1 Univariate Filter Methods 单变量特征过滤
- 2.1.1.1 Variance Threshold 方差选择法
- 2.1.1.2 Pearson Correlation (regression problem) 皮尔森相关系数 (回归问题)
- 2.1.1.3 Distance Correlation (regression problem) 距离相关系数 (回归问题)
- 2.1.1.4 F-Score (regression problem) F-统计量 (回归问题)
- 2.1.1.5 Mutual Information (regression problem) 互信息 (回归问题)
- 2.1.1.6 Chi-squared Statistics (classification problem) 卡方统计量 (分类问题)
- 2.1.1.7 F-Score (classification problem) F-统计量 (分类问题)
- 2.1.1.8 Mutual Information (classification problem) 互信息 (分类问题)
- 2.1.2 Multivariate Filter Methods 多元特征过滤
- 2.1.2.1 Max-Relevance Min-Redundancy (mRMR) 最大相关最小冗余
- 2.1.2.2 Correlation-based Feature Selection (CFS) 基于相关性的特征选择
- 2.1.2.3 Fast Correlation-based Filter (FCBF) 基于相关性的快速特征选择
- 2.1.2.4 ReliefF
- 2.1.2.5 Spectral Feature Selection (SPEC) 基于谱图的特征选择
- 2.2 Wrapper Methods 封装方法
- 2.2.1 Deterministic Algorithms 确定性算法
- 2.2.1.1 Recursive Feature Elimination (SBS) 递归式特征消除
- 2.2.2 Randomized Algorithms 随机方法
- 2.2.2.1 Simulated Annealing (SA) 基于模拟退火特征选择
- 2.2.2.2 Genetic Algorithm (GA) 基于基因算法特征选择
- 2.3 Embedded Methods 嵌入方法
- 2.3.1 Regulization Based Methods 基于正则化模型的方法
- 2.3.1.1 Lasso Regression (Linear Regression with L1 Norm) 套索回归
- 2.3.1.2 Logistic Regression (with L1 Norm) 逻辑回归
- 2.3.1.3 LinearSVR/ LinearSVC 线性向量支持机
- 2.3.2 Tree Based Methods 基于树模型的方法
- 3 Dimension Reduction 特征降维
- 3.1 Unsupervised Methods 非监督方法
- 3.1.1 PCA (Principal Components Analysis) 主成分分析
- 3.2 Supervised Methods 监督方法
- 3.2.1 LDA (Linear Discriminant Analysis) 线性判别分析
24 |
25 | *注:由于部分内容未有前人翻译,部分翻译可能不尽准确*
26 |
27 | 参考文献
28 | ------------
29 | 参考文献已在各独立 Notebook 中分别记录。
30 |
31 | Author
32 | ------------
33 | [**@陈颖祥**](https://github.com/YC-Coder-Chen)
34 | [**@杨子晗**](https://github.com/echoyang48)
35 |
36 | Contact
37 | ------------
38 | **此资料为笔者业余时间整理制作,欢迎各位指正其中的不足之处!让我们一起让这份学习资料更加完善~**
39 |
40 | 陈颖祥 E-mail: chenyingxiang3526@gmail.com
41 | 杨子晗 E-maii: echoyang48@gmail.com
42 |
--------------------------------------------------------------------------------
/中文版/3. 特征降维.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**基于Jupyter的特征工程笔记本3: 特征降维** \n",
8 | "*作者: 陈颖祥,杨子唅*"
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {
14 | "toc": true
15 | },
16 | "source": [
17 | "Table of Contents
\n",
18 | ""
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "**Reference**\n",
26 | "- https://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca\n",
27 | "- https://sebastianraschka.com/faq/docs/lda-vs-pca.html\n",
28 | "- https://en.wikipedia.org/wiki/Linear_discriminant_analysis"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "# Dimension Reduction 特征降维"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "经过数据预处理和特征选择,我们已经生成了一个很好的特征子集。但是有时该子集可能仍然包含过多特征,导致需要花费太多的计算能力用以训练模型。在这种情况下,我们可以使用降维技术进一步压缩特征子集。但这可能会降低模型性能。\n",
43 | "\n",
44 | "同时,如果我们没有太多时间进行特征选择,我们也可以在数据预处理之后直接应用降维方法。我们可以使用降维算法来压缩原始特征空间直接生成特征子集。\n",
45 | "\n",
46 | "具体来说,我们将分别介绍PCA和LDA(线性判别分析)。"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Unsupervised Methods 非监督方法"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "### PCA (Principal Components Analysis) 主成分分析"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "主成分分析(PCA)是一种无监督机器学习模型,其目标为利用线性变换将原始特征投影为一系列线性不相关的单位向量,而同时保留尽可能多的信息(方差)。您可以从我们在Github中编写的[repo](https://github.com/YC-Coder-Chen/Unsupervised-Notes/blob/master/PCA.md)中查看更多数学细节。"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 1,
73 | "metadata": {
74 | "ExecuteTime": {
75 | "end_time": "2020-03-30T03:37:20.574170Z",
76 | "start_time": "2020-03-30T03:37:18.879555Z"
77 | }
78 | },
79 | "outputs": [],
80 | "source": [
81 | "import numpy as np\n",
82 | "import pandas as pd\n",
83 | "from sklearn.decomposition import PCA\n",
84 | "\n",
85 | "# 直接载入数据集\n",
86 | "from sklearn.datasets import fetch_california_housing\n",
87 | "dataset = fetch_california_housing()\n",
88 | "X, y = dataset.data, dataset.target # 利用 california_housing 数据集来演示\n",
89 | "\n",
90 | "# 选择前15000个观测点作为训练集\n",
91 | "# 剩下的作为测试集\n",
92 | "train_set = X[0:15000,:]\n",
93 | "test_set = X[15000:,]\n",
94 | "train_y = y[0:15000]\n",
95 | "\n",
96 | "# 在使用主成分分析前,我们需要先对变量进行缩放操作,否则PCA将会赋予高尺度的特征过多的权重\n",
97 | "from sklearn.preprocessing import StandardScaler\n",
98 | "model = StandardScaler()\n",
99 | "model.fit(train_set) \n",
100 | "standardized_train = model.transform(train_set)\n",
101 | "standardized_test = model.transform(test_set)\n",
102 | "\n",
103 | "# 开始压缩特征\n",
104 | "compressor = PCA(n_components=0.9) \n",
105 | "# 将n_components设置为0.9 =>\n",
106 | "# 即要求我们从所有主成分中选取的输出主成分至少能保留原特征中90%的方差\n",
107 | "# 我们也可以通过设置n_components参数为整数直接控制输出的变量数目\n",
108 | "\n",
109 | "compressor.fit(standardized_train) # 在训练集上训练\n",
110 | "transformed_trainset = compressor.transform(standardized_train) # 转换训练集 (20000,5)\n",
111 | "# 即我们从8个主成分中选取了前5个主成分,而这前5个主成分可以保证保留原特征中90%的方差\n",
112 | "\n",
113 | "transformed_testset = compressor.transform(standardized_test) # 转换测试集\n",
114 | "assert transformed_trainset.shape[1] == transformed_testset.shape[1] \n",
115 | "# 转换后训练集和测试集有相同的特征数"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 2,
121 | "metadata": {
122 | "ExecuteTime": {
123 | "end_time": "2020-03-30T03:37:21.146354Z",
124 | "start_time": "2020-03-30T03:37:20.576580Z"
125 | }
126 | },
127 | "outputs": [
128 | {
129 | "data": {
130 | "image/png": "\n",
131 | "text/plain": [
132 | ""
133 | ]
134 | },
135 | "metadata": {
136 | "needs_background": "light"
137 | },
138 | "output_type": "display_data"
139 | }
140 | ],
141 | "source": [
142 | "# 可视化 所解释的方差与选取的主成分数目之间的关系\n",
143 | "\n",
144 | "import matplotlib.pyplot as plt\n",
145 | "plt.rcParams['font.sans-serif']=['SimHei']\n",
146 | "%matplotlib inline\n",
147 | "\n",
148 | "\n",
149 | "plt.plot(np.array(range(len(compressor.explained_variance_ratio_))) + 1, \n",
150 | " np.cumsum(compressor.explained_variance_ratio_))\n",
151 | "plt.xlabel('选取的主成分数目')\n",
152 | "plt.ylabel('累计所解释的方差累')\n",
153 | "plt.show(); # 前5个主成分可以保证保留原特征中90%的方差"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "## Supervised Methods 监督方法"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "### LDA (Linear Discriminant Analysis) 线性判别分析"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "与主成分分析(PCA)不同的是,线性判别分析(LDA)是一种有监督机器学习模型,旨在找到特征子集以最大化类线性可分离性,即希望投影望同一种类别数据的投影点尽可能的接近,而不同类别的数据的类别中心之间的距离尽可能的大。线性判别分析仅适用于分类问题,其假设各个类别的样本数据符合高斯分布,并且具有相同的协方差矩阵。\n",
175 | "\n",
176 | "可以在sklearn的[官方网站](https://scikit-learn.org/stable/modules/lda_qda.html#lda-qda)上了解更多原理方面的详细信息。LDA会将原始变量压缩为(K-1)个,其中K是目标变量类别数。但是在sklearn中,通过将主成分分析的思想合并到LDA中,其可以进一步压缩变量。"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 3,
182 | "metadata": {
183 | "ExecuteTime": {
184 | "end_time": "2020-03-30T03:37:21.167912Z",
185 | "start_time": "2020-03-30T03:37:21.148967Z"
186 | }
187 | },
188 | "outputs": [],
189 | "source": [
190 | "import numpy as np\n",
191 | "import pandas as pd\n",
192 | "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n",
193 | "\n",
194 | "# LDA仅适用于分类问题\n",
195 | "# 载入数据集\n",
196 | "from sklearn.datasets import load_iris\n",
197 | "iris = load_iris()\n",
198 | "X, y = iris.data, iris.target\n",
199 | "\n",
200 | "# iris 数据集使用前需要被打乱顺序\n",
201 | "np.random.seed(1234)\n",
202 | "idx = np.random.permutation(len(X))\n",
203 | "X = X[idx]\n",
204 | "y = y[idx]\n",
205 | "\n",
206 | "# 选择前100个观测点作为训练集\n",
207 | "# 剩下的50个观测点测试集\n",
208 | "\n",
209 | "train_set = X[0:100,:]\n",
210 | "test_set = X[100:,]\n",
211 | "train_y = y[0:100]\n",
212 | "test_y = y[100:,]\n",
213 | "\n",
214 | "# 在使用主成分分析前,我们需要先对变量进行缩放操作\n",
215 | "# 因为LDA假定数据服从正态分布\n",
216 | "\n",
217 | "from sklearn.preprocessing import StandardScaler # 我们也可以采用幂次变换\n",
218 | "model = StandardScaler()\n",
219 | "model.fit(train_set) \n",
220 | "standardized_train = model.transform(train_set)\n",
221 | "standardized_test = model.transform(test_set)\n",
222 | "\n",
223 | "# 开始压缩特征\n",
224 | "compressor = LDA(n_components=2) # 将n_components设置为2\n",
225 | "# n_components <= min(n_classes - 1, n_features)\n",
226 | "\n",
227 | "compressor.fit(standardized_train, train_y) # 在训练集上训练\n",
228 | "transformed_trainset = compressor.transform(standardized_train) # 转换训练集 (20000,2)\n",
229 | "transformed_testset = compressor.transform(standardized_test) # 转换测试集\n",
230 | "assert transformed_trainset.shape[1] == transformed_testset.shape[1]\n",
231 | "# 转换后训练集和测试集有相同的特征数"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 4,
237 | "metadata": {
238 | "ExecuteTime": {
239 | "end_time": "2020-03-30T03:37:21.314293Z",
240 | "start_time": "2020-03-30T03:37:21.169760Z"
241 | }
242 | },
243 | "outputs": [
244 | {
245 | "data": {
246 | "image/png": "\n",
247 | "text/plain": [
248 | ""
249 | ]
250 | },
251 | "metadata": {
252 | "needs_background": "light"
253 | },
254 | "output_type": "display_data"
255 | }
256 | ],
257 | "source": [
258 | "# 可视化 所解释的方差与选取的特征数目之间的关系\n",
259 | "import matplotlib.pyplot as plt\n",
260 | "plt.plot(np.array(range(len(compressor.explained_variance_ratio_))) + 1, \n",
261 | " np.cumsum(compressor.explained_variance_ratio_))\n",
262 | "plt.xlabel('选取的特征数目')\n",
263 | "plt.ylabel('累计所解释的方差累')\n",
264 | "plt.show(); # LDA将原始的4个变量压缩为2个,这2个变量即能解释100%的方差"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": null,
270 | "metadata": {},
271 | "outputs": [],
272 | "source": []
273 | }
274 | ],
275 | "metadata": {
276 | "kernelspec": {
277 | "display_name": "Python 3",
278 | "language": "python",
279 | "name": "python3"
280 | },
281 | "language_info": {
282 | "codemirror_mode": {
283 | "name": "ipython",
284 | "version": 3
285 | },
286 | "file_extension": ".py",
287 | "mimetype": "text/x-python",
288 | "name": "python",
289 | "nbconvert_exporter": "python",
290 | "pygments_lexer": "ipython3",
291 | "version": "3.6.8"
292 | },
293 | "toc": {
294 | "base_numbering": 1,
295 | "nav_menu": {},
296 | "number_sections": true,
297 | "sideBar": false,
298 | "skip_h1_title": false,
299 | "title_cell": "Table of Contents",
300 | "title_sidebar": "Contents",
301 | "toc_cell": true,
302 | "toc_position": {
303 | "height": "786.465px",
304 | "left": "40.9922px",
305 | "top": "113.535px",
306 | "width": "370.918px"
307 | },
308 | "toc_section_display": true,
309 | "toc_window_display": true
310 | }
311 | },
312 | "nbformat": 4,
313 | "nbformat_minor": 2
314 | }
315 |
--------------------------------------------------------------------------------