├── InfoImputer ├── __init__.py └── Auto.py ├── dist ├── autoimputer-0.0.1.tar.gz └── autoimputer-0.0.1-py3-none-any.whl ├── pyproject.toml ├── license.txt └── README.md /InfoImputer/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /dist/autoimputer-0.0.1.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KhashayarRahimi/InfoImputer/HEAD/dist/autoimputer-0.0.1.tar.gz -------------------------------------------------------------------------------- /dist/autoimputer-0.0.1-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KhashayarRahimi/InfoImputer/HEAD/dist/autoimputer-0.0.1-py3-none-any.whl -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["hatchling"] 3 | build-backend = "hatchling.build" 4 | 5 | [project] 6 | name = "AutoImputer" 7 | version = "0.0.1" 8 | authors = [ 9 | { name="KhashayarRahimi"}, 10 | ] 11 | description = "A wise imputer that fill missing values for each features by deriving information from other correlated-informative features." 12 | readme = "README.md" 13 | requires-python = ">=3.7" 14 | classifiers = [ 15 | "Programming Language :: Python :: 3", 16 | "License :: OSI Approved :: MIT License", 17 | "Operating System :: OS Independent", 18 | ] -------------------------------------------------------------------------------- /license.txt: -------------------------------------------------------------------------------- 1 | Copyright 2023 Khashayar Rahimi 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## InfoImputer 2 | 3 | In the field of data science, one of the most common challenges is preparing datasets for machine learning algorithms. Dealing with missing values in the dataset is a critical aspect of this process. To address this challenge, data scientists have developed various imputation techniques that aim to accurately fill these missing values. 4 | 5 | Among the popular imputers are: 6 | 7 | SimpleImputer: This imputer fills missing values in the data using statistical properties such as mean, median, or the most frequent value. 8 | 9 | KNNImputer: The KNNImputer completes missing values by utilizing the k-nearest neighbors algorithm. 10 | 11 | IterativeImputer: This imputer estimates each feature from all the other features in an iterative manner. 12 | 13 | ## Introducing InfoImputer: 14 | 15 | It is seems that it performs better than the aforementioned imputers. 16 | It is similar in nature to the IterativeImputer but comes with some notable differences: 17 | 18 | Handling uncorrelated features: The IterativeImputer uses a hyperparameter called n_nearest_features, which determines the number of other features used to estimate missing values for each feature column. However, using all other columns to estimate the target feature may lead to weak predictions and slower processing, especially when the features are uncorrelated. In contrast, InfoImputer has two different approaches: 19 | one sets an absolute correlation coefficient threshold to select only the most relevant features for estimatio and the other consider the n most informative features for the specific feature. These ensure a more effective and efficient imputation process. 20 | 21 | Separate estimators for classification and regression: The IterativeImputer uses a single estimator for both categorical and numerical columns. However, InfoImputer recognizes the different nature of classification and regression tasks and employs separate estimators for each type. This tailored approach leads to more accurate imputed values. 22 | 23 | Automated conversion of categorical values: In the IterativeImputer, converting categorical values to numeric format needs to be done manually. InfoImputer automates this process by factorizing categorical values into numeric representations. This simplifies the imputation workflow, particularly when dealing with categorical data. 24 | 25 | By addressing these issues, InfoImputer offers an improved approach to handling missing values in datasets. It takes into account the correlation and mutual information score between features, utilizes separate estimators for classification and regression tasks, and automates the conversion of categorical values to numeric representations. 26 | 27 | ## The Main Motivation 28 | 29 | As I showed in the following notebook, correlation can only find linear dependency between random variables where mutual information can also detect nonlinear relations and dependencies. 30 | 31 | https://www.kaggle.com/code/khashayarrahimi94/why-you-should-not-use-correlation 32 | 33 | Therefore, besides some automation and ease of use in this imputer, I add mutual information score as a criteria for selecting dependece and informational features for the features with missing values that we want to fill them. 34 | 35 | ## Install 36 | 37 | ```shell 38 | pip install Info-Imputer 39 | ``` 40 | ## Example 41 | 42 | ```python 43 | import pandas as pd 44 | from sklearn.ensemble import ExtraTreesClassifier 45 | from sklearn.ensemble import GradientBoostingRegressor 46 | import InfoImputer 47 | from InfoImputer.Auto import Imputer 48 | #import the data in pandas format 49 | data = pd.read_csv(r"your directory") 50 | 51 | #if you want to use the correlation coefficient threshold (here threshold = 0.1): 52 | FilledData = Imputer(data,TargetName,0.1,GradientBoostingRegressor,ExtraTreesClassifier) 53 | 54 | #if you want to use N most informative features using mutual information (here N = 3) 55 | FilledData = Imputer(data,TargetName,3,GradientBoostingRegressor,ExtraTreesClassifier) 56 | ``` 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /InfoImputer/Auto.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn.preprocessing import LabelEncoder 4 | from sklearn.ensemble import ExtraTreesClassifier 5 | from sklearn.ensemble import GradientBoostingRegressor 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.model_selection import RepeatedStratifiedKFold 8 | from sklearn.model_selection import cross_val_score 9 | from sklearn.impute import SimpleImputer 10 | from numpy.random import seed 11 | from sklearn.feature_selection import mutual_info_regression 12 | 13 | def seperate_cat_num_cols(data): 14 | Categorical_col = [] 15 | Numerical_col = [] 16 | 17 | for i in range(data.shape[1]): 18 | if data[data.columns[i]].isnull().values.any()==True: 19 | if (data[data.columns[i]].dtypes == 'O'): 20 | Categorical_col.append(data.columns[i]) 21 | else: 22 | Numerical_col.append(data.columns[i]) 23 | return Numerical_col,Categorical_col 24 | 25 | 26 | def prepare_data_for_corr(data): 27 | 28 | has_nan = data.isna().any().any() 29 | 30 | if has_nan: 31 | 32 | Numerical_col = seperate_cat_num_cols(data)[0] 33 | Categorical_col = seperate_cat_num_cols(data)[1] 34 | 35 | cat_col_index,num_col_index = [],[] 36 | for i in Categorical_col: 37 | cat_col_index.append(data.columns.tolist().index(i)) 38 | for i in Numerical_col: 39 | num_col_index.append(data.columns.tolist().index(i)) 40 | 41 | #Factorize categorical feature 42 | for i in range(len(Categorical_col)): 43 | data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0] 44 | 45 | data = data.replace(-1, np.nan) 46 | 47 | num_data_for_corr = data.copy() 48 | imputer1 = SimpleImputer(missing_values=np.nan, strategy='mean') 49 | imputer1 = imputer1.fit(num_data_for_corr.values[:,num_col_index]) 50 | 51 | num_data_for_corr = imputer1.transform(num_data_for_corr.values[:,num_col_index]) 52 | num_data_for_corr = pd.DataFrame(num_data_for_corr,columns = Numerical_col) 53 | 54 | if len(cat_col_index) > 0: 55 | Cat_data_for_corr = data.copy() 56 | imputer2 = SimpleImputer(missing_values=np.nan, strategy='most_frequent') 57 | imputer2 = imputer2.fit(Cat_data_for_corr.values[:,cat_col_index]) 58 | Cat_data_for_corr = imputer2.transform(Cat_data_for_corr.values[:,cat_col_index]) 59 | Cat_data_for_corr = pd.DataFrame(Cat_data_for_corr,columns=Categorical_col) 60 | data_for_corr = pd.concat([num_data_for_corr, Cat_data_for_corr], axis=1) 61 | 62 | else: 63 | data_for_corr = num_data_for_corr 64 | 65 | data_for_corr = data_for_corr.astype(float) 66 | else: 67 | data_for_corr = data 68 | 69 | return data_for_corr 70 | 71 | 72 | def most_correlated_columns(data,corr_coef): 73 | data = prepare_data_for_corr(data) 74 | corr_table = data.corr('pearson') 75 | corr_table = pd.DataFrame(corr_table) 76 | corr_table = corr_table.rename_axis().reset_index() 77 | 78 | correlated_features = {} 79 | for i in range(1,corr_table.shape[0]+1): 80 | a=[] 81 | for j in range(corr_table.shape[0]): 82 | 83 | if i != j: 84 | 85 | if abs(corr_table[corr_table.columns[i]][j]) > corr_coef: 86 | 87 | a.append(corr_table['index'][j]) 88 | 89 | correlated_features[corr_table.columns[i]] = a 90 | return correlated_features 91 | 92 | def mutual_information(Data,n_nearest_features): 93 | Data = prepare_data_for_corr(Data) 94 | mic_ordered = {} 95 | seed(21) 96 | for col1 in Data.columns: 97 | HighInformation = {} 98 | for col2 in Data.columns: 99 | 100 | if col1 != col2: 101 | col1_val = Data[col1].values 102 | col2_val = Data[col2].values 103 | 104 | score = mutual_info_regression(col1_val.reshape(-1, 1), col2_val) 105 | #if score[0] >= miScore: 106 | #HighInformation.append(col2) 107 | 108 | HighInformation[col2] = score[0] 109 | 110 | sorted_mic = {k: v for k, v in sorted(HighInformation.items(),reverse=True, key=lambda item: item[1])} 111 | selected_col = list(sorted_mic.keys())[:n_nearest_features] 112 | mic_ordered[col1] = selected_col 113 | 114 | return mic_ordered 115 | 116 | 117 | def fill_nan_numeric_cols(data,regression_estimator,S): 118 | 119 | data_org = data.copy() 120 | Numerical_col = seperate_cat_num_cols(data)[0] 121 | Categorical_col = seperate_cat_num_cols(data)[1] 122 | correlated_features = S 123 | 124 | for i in range(len(Categorical_col)): 125 | data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0] 126 | 127 | data = data.replace(-1, np.nan) 128 | 129 | for i in Numerical_col: 130 | 131 | columns = [] 132 | m = correlated_features[i] 133 | #m.remove(i) 134 | 135 | num_data = data[m].copy() 136 | 137 | label = data[i] 138 | """need_scale = [] #If you need to scale the numerical columns you can uncomment this and work a little on it. 139 | for j in correlated_features[i]: 140 | if j in Numerical_col: 141 | need_scale.append(j)""" 142 | 143 | #Scaler = scaler 144 | #num_data[need_scale] = Scaler.fit_transform(num_data[need_scale]) 145 | 146 | 147 | #print(data) 148 | nan=[] 149 | fill = [] 150 | for k in range(data.shape[0]): 151 | 152 | if data[i].isnull()[k] ==True: 153 | nan.append(k) 154 | else: 155 | fill.append(k) 156 | 157 | #Fill nan in num_data with SimpleImputer 158 | 159 | imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') 160 | imputer = imputer.fit(num_data.values) 161 | transform = imputer.transform(num_data.values) 162 | num_data = pd.DataFrame(transform) 163 | num_data[i] = label 164 | 165 | #Train regression model 166 | 167 | X = num_data.values[fill,:-1] 168 | Y = num_data.values[fill,-1] 169 | 170 | X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=0) 171 | reg = regression_estimator 172 | reg.fit(X_train, y_train) 173 | predicted = reg.predict(num_data.values[nan,:-1]) 174 | 175 | k = 0 176 | for t in nan: 177 | data[i][t] = predicted[k] 178 | k = k+1 179 | 180 | data.drop(Categorical_col,axis=1,inplace=True) 181 | 182 | for i in Categorical_col: 183 | data[i] = data_org[i] 184 | 185 | return data 186 | 187 | 188 | def fill_nan_categoric_cols(data,classifiation_estimator,S): 189 | 190 | Numerical_col = seperate_cat_num_cols(data)[0] 191 | Categorical_col = seperate_cat_num_cols(data)[1] 192 | correlated_features = S 193 | 194 | for i in range(len(Categorical_col)): 195 | data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0] 196 | 197 | data = data.replace(-1, np.nan) 198 | data_org = data.copy() 199 | 200 | for i in Categorical_col: 201 | columns = [] 202 | m = correlated_features[i] 203 | #m.remove(i) 204 | 205 | cat_data = data[m].copy() 206 | label = data[i] 207 | 208 | nan=[] 209 | fill = [] 210 | for k in range(data.shape[0]): 211 | 212 | if data[i].isnull()[k] ==True: 213 | nan.append(k) 214 | else: 215 | fill.append(k) 216 | 217 | imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') 218 | imputer = imputer.fit(cat_data.values) 219 | transform = imputer.transform(cat_data.values) 220 | cat_data = pd.DataFrame(transform) 221 | cat_data[i] = label 222 | 223 | X = cat_data.values[fill,:-1] 224 | Y = cat_data.values[fill,-1] 225 | 226 | X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=0) 227 | classify = classifiation_estimator 228 | classify.fit(X_train, y_train) 229 | predicted = classify.predict(cat_data.values[nan,:-1]) 230 | 231 | 232 | #Fill nan for i column with predicted valus 233 | 234 | k = 0 235 | for t in nan: 236 | data[i][t] = predicted[k] 237 | k = k+1 238 | 239 | return data 240 | 241 | def Imputer(train,label,similarity,regression_estimator,classifiation_estimator): 242 | 243 | All = train 244 | Label = All[label] 245 | Label = Label.reset_index() 246 | All.drop([label],axis=1,inplace=True) 247 | All = All.reset_index(drop=True) 248 | #All.drop(['index'],axis=1,inplace=True) 249 | All_org = All.copy() 250 | #if the user choose a number < 1 for similarity, it considers as a correlation coeficient 251 | if similarity < 1: 252 | s = most_correlated_columns(All,similarity) 253 | 254 | #if the user choose a integer number >=1 for similarity, it considers as the n_nearest_features 255 | else: 256 | s = mutual_information(All,similarity) 257 | numeric = fill_nan_numeric_cols(All_org,regression_estimator(),s) 258 | categoric = fill_nan_categoric_cols(numeric,classifiation_estimator(),s) 259 | categoric[label] = Label[label] 260 | 261 | return categoric --------------------------------------------------------------------------------