├── LICENSE ├── README.md ├── datasets ├── iris.csv └── nuclear.csv └── gaFeatureSelection.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Renato Sousa 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Genetic Algorithm For Feature Selection 3 | > Search the best feature subset for you classification model 4 | 5 | ## Description 6 | Feature selection is the process of finding the most relevant variables for a predictive model. These techniques can be used to identify and remove unneeded, irrelevant and redundant features that do not contribute or decrease the accuracy of the predictive model. 7 | 8 | In nature, the genes of organisms tend to evolve over successive generations to better adapt to the environment. The Genetic Algorithm is an heuristic optimization method inspired by that procedures of natural evolution. 9 | 10 | In feature selection, the function to optimize is the generalization performance of a predictive model. More specifically, we want to minimize the error of the model on an independent data set not used to create the model. 11 | 12 | 13 | 14 | 15 | ## Dependencies 16 | [Pandas](https://pandas.pydata.org/) 17 | 18 | [Numpy](http://www.numpy.org/) 19 | 20 | [scikit-learn](http://scikit-learn.org/stable/) 21 | 22 | [Deap](https://deap.readthedocs.io/en/master/) 23 | 24 | 25 | ## Usage 26 | 1. Go to the repository folder 27 | 1. Run: 28 | ``` 29 | python gaFeatureSelection.py path n_population n_generation 30 | ``` 31 | Obs: 32 | - `path` should be the path to some dataframe in csv format 33 | - `n_population` and `n_generation` must be integers 34 | - You can go to the code and change the classifier so that the search is optimized for your classifier. 35 | 36 | ## Usage Example 37 | ``` 38 | python gaFeatureSelection.py datasets/nuclear.csv 20 6 39 | ``` 40 | Returns: 41 | ``` 42 | Accuracy with all features: (0.90833333333333344,) 43 | 44 | gen nevals avg min max 45 | 0 20 0.849167 0.683333 0.941667 46 | 1 12 0.919167 0.766667 0.966667 47 | 2 18 0.934167 0.908333 0.966667 48 | 3 9 0.941667 0.908333 0.966667 49 | 4 9 0.946667 0.908333 0.966667 50 | 5 12 0.955833 0.908333 0.966667 51 | 6 12 0.9625 0.883333 0.966667 52 | Best Accuracy: (0.96666666666666679,) 53 | Number of Features in Subset: 5 54 | Individual: [1, 1, 1, 0, 0, 1, 1, 0, 0, 0] 55 | Feature Subset : ['cost', 'date', 't1', 'pr', 'ne'] 56 | 57 | 58 | creating a new classifier with the result 59 | Accuracy with Feature Subset: 0.966666666667 60 | 61 | ``` 62 | ## Fonts 63 | 1. This repository was heavily based on [GeneticAlgorithmFeatureSelection](https://github.com/scoliann/GeneticAlgorithmFeatureSelection) 64 | 1. For the description was used part of the introduction of [Genetic algorithms for feature selection in Data Analytics](https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection). Great text. 65 | 66 | #### Author: [Renato Sousa](https://github.com/renatoosousa) 67 | -------------------------------------------------------------------------------- /datasets/iris.csv: -------------------------------------------------------------------------------- 1 | "","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species" 2 | "1",5.1,3.5,1.4,0.2,"setosa" 3 | "2",4.9,3,1.4,0.2,"setosa" 4 | "3",4.7,3.2,1.3,0.2,"setosa" 5 | "4",4.6,3.1,1.5,0.2,"setosa" 6 | "5",5,3.6,1.4,0.2,"setosa" 7 | "6",5.4,3.9,1.7,0.4,"setosa" 8 | "7",4.6,3.4,1.4,0.3,"setosa" 9 | "8",5,3.4,1.5,0.2,"setosa" 10 | "9",4.4,2.9,1.4,0.2,"setosa" 11 | "10",4.9,3.1,1.5,0.1,"setosa" 12 | "11",5.4,3.7,1.5,0.2,"setosa" 13 | "12",4.8,3.4,1.6,0.2,"setosa" 14 | "13",4.8,3,1.4,0.1,"setosa" 15 | "14",4.3,3,1.1,0.1,"setosa" 16 | "15",5.8,4,1.2,0.2,"setosa" 17 | "16",5.7,4.4,1.5,0.4,"setosa" 18 | "17",5.4,3.9,1.3,0.4,"setosa" 19 | "18",5.1,3.5,1.4,0.3,"setosa" 20 | "19",5.7,3.8,1.7,0.3,"setosa" 21 | "20",5.1,3.8,1.5,0.3,"setosa" 22 | "21",5.4,3.4,1.7,0.2,"setosa" 23 | "22",5.1,3.7,1.5,0.4,"setosa" 24 | "23",4.6,3.6,1,0.2,"setosa" 25 | "24",5.1,3.3,1.7,0.5,"setosa" 26 | "25",4.8,3.4,1.9,0.2,"setosa" 27 | "26",5,3,1.6,0.2,"setosa" 28 | "27",5,3.4,1.6,0.4,"setosa" 29 | "28",5.2,3.5,1.5,0.2,"setosa" 30 | "29",5.2,3.4,1.4,0.2,"setosa" 31 | "30",4.7,3.2,1.6,0.2,"setosa" 32 | "31",4.8,3.1,1.6,0.2,"setosa" 33 | "32",5.4,3.4,1.5,0.4,"setosa" 34 | "33",5.2,4.1,1.5,0.1,"setosa" 35 | "34",5.5,4.2,1.4,0.2,"setosa" 36 | "35",4.9,3.1,1.5,0.2,"setosa" 37 | "36",5,3.2,1.2,0.2,"setosa" 38 | "37",5.5,3.5,1.3,0.2,"setosa" 39 | "38",4.9,3.6,1.4,0.1,"setosa" 40 | "39",4.4,3,1.3,0.2,"setosa" 41 | "40",5.1,3.4,1.5,0.2,"setosa" 42 | "41",5,3.5,1.3,0.3,"setosa" 43 | "42",4.5,2.3,1.3,0.3,"setosa" 44 | "43",4.4,3.2,1.3,0.2,"setosa" 45 | "44",5,3.5,1.6,0.6,"setosa" 46 | "45",5.1,3.8,1.9,0.4,"setosa" 47 | "46",4.8,3,1.4,0.3,"setosa" 48 | "47",5.1,3.8,1.6,0.2,"setosa" 49 | "48",4.6,3.2,1.4,0.2,"setosa" 50 | "49",5.3,3.7,1.5,0.2,"setosa" 51 | "50",5,3.3,1.4,0.2,"setosa" 52 | "51",7,3.2,4.7,1.4,"versicolor" 53 | "52",6.4,3.2,4.5,1.5,"versicolor" 54 | "53",6.9,3.1,4.9,1.5,"versicolor" 55 | "54",5.5,2.3,4,1.3,"versicolor" 56 | "55",6.5,2.8,4.6,1.5,"versicolor" 57 | "56",5.7,2.8,4.5,1.3,"versicolor" 58 | "57",6.3,3.3,4.7,1.6,"versicolor" 59 | "58",4.9,2.4,3.3,1,"versicolor" 60 | "59",6.6,2.9,4.6,1.3,"versicolor" 61 | "60",5.2,2.7,3.9,1.4,"versicolor" 62 | "61",5,2,3.5,1,"versicolor" 63 | "62",5.9,3,4.2,1.5,"versicolor" 64 | "63",6,2.2,4,1,"versicolor" 65 | "64",6.1,2.9,4.7,1.4,"versicolor" 66 | "65",5.6,2.9,3.6,1.3,"versicolor" 67 | "66",6.7,3.1,4.4,1.4,"versicolor" 68 | "67",5.6,3,4.5,1.5,"versicolor" 69 | "68",5.8,2.7,4.1,1,"versicolor" 70 | "69",6.2,2.2,4.5,1.5,"versicolor" 71 | "70",5.6,2.5,3.9,1.1,"versicolor" 72 | "71",5.9,3.2,4.8,1.8,"versicolor" 73 | "72",6.1,2.8,4,1.3,"versicolor" 74 | "73",6.3,2.5,4.9,1.5,"versicolor" 75 | "74",6.1,2.8,4.7,1.2,"versicolor" 76 | "75",6.4,2.9,4.3,1.3,"versicolor" 77 | "76",6.6,3,4.4,1.4,"versicolor" 78 | "77",6.8,2.8,4.8,1.4,"versicolor" 79 | "78",6.7,3,5,1.7,"versicolor" 80 | "79",6,2.9,4.5,1.5,"versicolor" 81 | "80",5.7,2.6,3.5,1,"versicolor" 82 | "81",5.5,2.4,3.8,1.1,"versicolor" 83 | "82",5.5,2.4,3.7,1,"versicolor" 84 | "83",5.8,2.7,3.9,1.2,"versicolor" 85 | "84",6,2.7,5.1,1.6,"versicolor" 86 | "85",5.4,3,4.5,1.5,"versicolor" 87 | "86",6,3.4,4.5,1.6,"versicolor" 88 | "87",6.7,3.1,4.7,1.5,"versicolor" 89 | "88",6.3,2.3,4.4,1.3,"versicolor" 90 | "89",5.6,3,4.1,1.3,"versicolor" 91 | "90",5.5,2.5,4,1.3,"versicolor" 92 | "91",5.5,2.6,4.4,1.2,"versicolor" 93 | "92",6.1,3,4.6,1.4,"versicolor" 94 | "93",5.8,2.6,4,1.2,"versicolor" 95 | "94",5,2.3,3.3,1,"versicolor" 96 | "95",5.6,2.7,4.2,1.3,"versicolor" 97 | "96",5.7,3,4.2,1.2,"versicolor" 98 | "97",5.7,2.9,4.2,1.3,"versicolor" 99 | "98",6.2,2.9,4.3,1.3,"versicolor" 100 | "99",5.1,2.5,3,1.1,"versicolor" 101 | "100",5.7,2.8,4.1,1.3,"versicolor" 102 | "101",6.3,3.3,6,2.5,"virginica" 103 | "102",5.8,2.7,5.1,1.9,"virginica" 104 | "103",7.1,3,5.9,2.1,"virginica" 105 | "104",6.3,2.9,5.6,1.8,"virginica" 106 | "105",6.5,3,5.8,2.2,"virginica" 107 | "106",7.6,3,6.6,2.1,"virginica" 108 | "107",4.9,2.5,4.5,1.7,"virginica" 109 | "108",7.3,2.9,6.3,1.8,"virginica" 110 | "109",6.7,2.5,5.8,1.8,"virginica" 111 | "110",7.2,3.6,6.1,2.5,"virginica" 112 | "111",6.5,3.2,5.1,2,"virginica" 113 | "112",6.4,2.7,5.3,1.9,"virginica" 114 | "113",6.8,3,5.5,2.1,"virginica" 115 | "114",5.7,2.5,5,2,"virginica" 116 | "115",5.8,2.8,5.1,2.4,"virginica" 117 | "116",6.4,3.2,5.3,2.3,"virginica" 118 | "117",6.5,3,5.5,1.8,"virginica" 119 | "118",7.7,3.8,6.7,2.2,"virginica" 120 | "119",7.7,2.6,6.9,2.3,"virginica" 121 | "120",6,2.2,5,1.5,"virginica" 122 | "121",6.9,3.2,5.7,2.3,"virginica" 123 | "122",5.6,2.8,4.9,2,"virginica" 124 | "123",7.7,2.8,6.7,2,"virginica" 125 | "124",6.3,2.7,4.9,1.8,"virginica" 126 | "125",6.7,3.3,5.7,2.1,"virginica" 127 | "126",7.2,3.2,6,1.8,"virginica" 128 | "127",6.2,2.8,4.8,1.8,"virginica" 129 | "128",6.1,3,4.9,1.8,"virginica" 130 | "129",6.4,2.8,5.6,2.1,"virginica" 131 | "130",7.2,3,5.8,1.6,"virginica" 132 | "131",7.4,2.8,6.1,1.9,"virginica" 133 | "132",7.9,3.8,6.4,2,"virginica" 134 | "133",6.4,2.8,5.6,2.2,"virginica" 135 | "134",6.3,2.8,5.1,1.5,"virginica" 136 | "135",6.1,2.6,5.6,1.4,"virginica" 137 | "136",7.7,3,6.1,2.3,"virginica" 138 | "137",6.3,3.4,5.6,2.4,"virginica" 139 | "138",6.4,3.1,5.5,1.8,"virginica" 140 | "139",6,3,4.8,1.8,"virginica" 141 | "140",6.9,3.1,5.4,2.1,"virginica" 142 | "141",6.7,3.1,5.6,2.4,"virginica" 143 | "142",6.9,3.1,5.1,2.3,"virginica" 144 | "143",5.8,2.7,5.1,1.9,"virginica" 145 | "144",6.8,3.2,5.9,2.3,"virginica" 146 | "145",6.7,3.3,5.7,2.5,"virginica" 147 | "146",6.7,3,5.2,2.3,"virginica" 148 | "147",6.3,2.5,5,1.9,"virginica" 149 | "148",6.5,3,5.2,2,"virginica" 150 | "149",6.2,3.4,5.4,2.3,"virginica" 151 | "150",5.9,3,5.1,1.8,"virginica" 152 | -------------------------------------------------------------------------------- /datasets/nuclear.csv: -------------------------------------------------------------------------------- 1 | cost,date,t1,t2,cap,pr,ne,ct,bw,cum.n,pt 2 | 460.05,68.58,14,46,687,0,1,0,0,14,0 3 | 452.99,67.33,10,73,1065,0,0,1,0,1,0 4 | 443.22,67.33,10,85,1065,1,0,1,0,1,0 5 | 652.32,68,11,67,1065,0,1,1,0,12,0 6 | 642.23,68,11,78,1065,1,1,1,0,12,0 7 | 345.39,67.92,13,51,514,0,1,1,0,3,0 8 | 272.37,68.17,12,50,822,0,0,0,0,5,0 9 | 317.21,68.42,14,59,457,0,0,0,0,1,0 10 | 457.12,68.42,15,55,822,1,0,0,0,5,0 11 | 690.19,68.33,12,71,792,0,1,1,1,2,0 12 | 350.63,68.58,12,64,560,0,0,0,0,3,0 13 | 402.59,68.75,13,47,790,0,1,0,0,6,0 14 | 412.18,68.42,15,62,530,0,0,1,0,2,0 15 | 495.58,68.92,17,52,1050,0,0,0,0,7,0 16 | 394.36,68.92,13,65,850,0,0,0,1,16,0 17 | 423.32,68.42,11,67,778,0,0,0,0,3,0 18 | 712.27,69.5,18,60,845,0,1,0,0,17,0 19 | 289.66,68.42,15,76,530,1,0,1,0,2,0 20 | 881.24,69.17,15,67,1090,0,0,0,0,1,0 21 | 490.88,68.92,16,59,1050,1,0,0,0,8,0 22 | 567.79,68.75,11,70,913,0,0,1,1,15,0 23 | 665.99,70.92,22,57,828,1,1,0,0,20,0 24 | 621.45,69.67,16,59,786,0,0,1,0,18,0 25 | 608.8,70.08,19,58,821,1,0,0,0,3,0 26 | 473.64,70.42,19,44,538,0,0,1,0,19,0 27 | 697.14,71.08,20,57,1130,0,0,1,0,21,0 28 | 207.51,67.25,13,63,745,0,0,0,0,8,1 29 | 288.48,67.17,9,48,821,0,0,1,0,7,1 30 | 284.88,67.83,12,63,886,0,0,0,1,11,1 31 | 280.36,67.83,12,71,886,1,0,0,1,11,1 32 | 217.38,67.25,13,72,745,1,0,0,0,8,1 33 | 270.71,67.83,7,80,886,1,0,0,1,11,1 34 | -------------------------------------------------------------------------------- /gaFeatureSelection.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import random 4 | from sklearn.linear_model import LogisticRegression 5 | from sklearn.model_selection import cross_val_score 6 | from sklearn.preprocessing import LabelEncoder 7 | from deap import creator, base, tools, algorithms 8 | import sys 9 | 10 | 11 | def avg(l): 12 | """ 13 | Returns the average between list elements 14 | """ 15 | return (sum(l)/float(len(l))) 16 | 17 | 18 | def getFitness(individual, X, y): 19 | """ 20 | Feature subset fitness function 21 | """ 22 | 23 | if(individual.count(0) != len(individual)): 24 | # get index with value 0 25 | cols = [index for index in range( 26 | len(individual)) if individual[index] == 0] 27 | 28 | # get features subset 29 | X_parsed = X.drop(X.columns[cols], axis=1) 30 | X_subset = pd.get_dummies(X_parsed) 31 | 32 | # apply classification algorithm 33 | clf = LogisticRegression() 34 | 35 | return (avg(cross_val_score(clf, X_subset, y, cv=5)),) 36 | else: 37 | return(0,) 38 | 39 | 40 | def geneticAlgorithm(X, y, n_population, n_generation): 41 | """ 42 | Deap global variables 43 | Initialize variables to use eaSimple 44 | """ 45 | # create individual 46 | creator.create("FitnessMax", base.Fitness, weights=(1.0,)) 47 | creator.create("Individual", list, fitness=creator.FitnessMax) 48 | 49 | # create toolbox 50 | toolbox = base.Toolbox() 51 | toolbox.register("attr_bool", random.randint, 0, 1) 52 | toolbox.register("individual", tools.initRepeat, 53 | creator.Individual, toolbox.attr_bool, len(X.columns)) 54 | toolbox.register("population", tools.initRepeat, list, 55 | toolbox.individual) 56 | toolbox.register("evaluate", getFitness, X=X, y=y) 57 | toolbox.register("mate", tools.cxOnePoint) 58 | toolbox.register("mutate", tools.mutFlipBit, indpb=0.05) 59 | toolbox.register("select", tools.selTournament, tournsize=3) 60 | 61 | # initialize parameters 62 | pop = toolbox.population(n=n_population) 63 | hof = tools.HallOfFame(n_population * n_generation) 64 | stats = tools.Statistics(lambda ind: ind.fitness.values) 65 | stats.register("avg", np.mean) 66 | stats.register("min", np.min) 67 | stats.register("max", np.max) 68 | 69 | # genetic algorithm 70 | pop, log = algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.2, 71 | ngen=n_generation, stats=stats, halloffame=hof, 72 | verbose=True) 73 | 74 | # return hall of fame 75 | return hof 76 | 77 | 78 | def bestIndividual(hof, X, y): 79 | """ 80 | Get the best individual 81 | """ 82 | maxAccurcy = 0.0 83 | for individual in hof: 84 | if(individual.fitness.values > maxAccurcy): 85 | maxAccurcy = individual.fitness.values 86 | _individual = individual 87 | 88 | _individualHeader = [list(X)[i] for i in range( 89 | len(_individual)) if _individual[i] == 1] 90 | return _individual.fitness.values, _individual, _individualHeader 91 | 92 | 93 | def getArguments(): 94 | """ 95 | Get argumments from command-line 96 | If pass only dataframe path, pop and gen will be default 97 | """ 98 | dfPath = sys.argv[1] 99 | if(len(sys.argv) == 4): 100 | pop = int(sys.argv[2]) 101 | gen = int(sys.argv[3]) 102 | else: 103 | pop = 10 104 | gen = 2 105 | return dfPath, pop, gen 106 | 107 | 108 | if __name__ == '__main__': 109 | # get dataframe path, population number and generation number from command-line argument 110 | dataframePath, n_pop, n_gen = getArguments() 111 | # read dataframe from csv 112 | df = pd.read_csv(dataframePath, sep=',') 113 | 114 | # encode labels column to numbers 115 | le = LabelEncoder() 116 | le.fit(df.iloc[:, -1]) 117 | y = le.transform(df.iloc[:, -1]) 118 | X = df.iloc[:, :-1] 119 | 120 | # get accuracy with all features 121 | individual = [1 for i in range(len(X.columns))] 122 | print("Accuracy with all features: \t" + 123 | str(getFitness(individual, X, y)) + "\n") 124 | 125 | # apply genetic algorithm 126 | hof = geneticAlgorithm(X, y, n_pop, n_gen) 127 | 128 | # select the best individual 129 | accuracy, individual, header = bestIndividual(hof, X, y) 130 | print('Best Accuracy: \t' + str(accuracy)) 131 | print('Number of Features in Subset: \t' + str(individual.count(1))) 132 | print('Individual: \t\t' + str(individual)) 133 | print('Feature Subset\t: ' + str(header)) 134 | 135 | print('\n\ncreating a new classifier with the result') 136 | 137 | # read dataframe from csv one more time 138 | df = pd.read_csv(dataframePath, sep=',') 139 | 140 | # with feature subset 141 | X = df[header] 142 | 143 | clf = LogisticRegression() 144 | 145 | scores = cross_val_score(clf, X, y, cv=5) 146 | print("Accuracy with Feature Subset: \t" + str(avg(scores)) + "\n") 147 | --------------------------------------------------------------------------------