├── InfoImputer
    ├── __init__.py
    └── Auto.py
├── dist
    ├── autoimputer-0.0.1.tar.gz
    └── autoimputer-0.0.1-py3-none-any.whl
├── pyproject.toml
├── license.txt
└── README.md


/InfoImputer/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/dist/autoimputer-0.0.1.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KhashayarRahimi/InfoImputer/HEAD/dist/autoimputer-0.0.1.tar.gz


--------------------------------------------------------------------------------
/dist/autoimputer-0.0.1-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KhashayarRahimi/InfoImputer/HEAD/dist/autoimputer-0.0.1-py3-none-any.whl


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [build-system]
 2 | requires = ["hatchling"]
 3 | build-backend = "hatchling.build"
 4 | 
 5 | [project]
 6 | name = "AutoImputer"
 7 | version = "0.0.1"
 8 | authors = [
 9 |   { name="KhashayarRahimi"},
10 | ]
11 | description = "A wise imputer that fill missing values for each features by deriving information from other correlated-informative features."
12 | readme = "README.md"
13 | requires-python = ">=3.7"
14 | classifiers = [
15 |     "Programming Language :: Python :: 3",
16 |     "License :: OSI Approved :: MIT License",
17 |     "Operating System :: OS Independent",
18 | ]


--------------------------------------------------------------------------------
/license.txt:
--------------------------------------------------------------------------------
1 | Copyright 2023 Khashayar Rahimi
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## InfoImputer
 2 | 
 3 | In the field of data science, one of the most common challenges is preparing datasets for machine learning algorithms. Dealing with missing values in the dataset is a critical aspect of this process. To address this challenge, data scientists have developed various imputation techniques that aim to accurately fill these missing values.
 4 | 
 5 | Among the popular imputers are:
 6 | 
 7 | SimpleImputer: This imputer fills missing values in the data using statistical properties such as mean, median, or the most frequent value.
 8 | 
 9 | KNNImputer: The KNNImputer completes missing values by utilizing the k-nearest neighbors algorithm.
10 | 
11 | IterativeImputer: This imputer estimates each feature from all the other features in an iterative manner.
12 | 
13 | ## Introducing InfoImputer:
14 | 
15 | It is seems that it performs better than the aforementioned imputers.
16 | It is similar in nature to the IterativeImputer but comes with some notable differences:
17 | 
18 | Handling uncorrelated features: The IterativeImputer uses a hyperparameter called n_nearest_features, which determines the number of other features used to estimate missing values for each feature column. However, using all other columns to estimate the target feature may lead to weak predictions and slower processing, especially when the features are uncorrelated. In contrast, InfoImputer has two different approaches:
19 | one sets an absolute correlation coefficient threshold to select only the most relevant features for estimatio and the other consider the n most informative features for the specific feature. These ensure a more effective and efficient imputation process.
20 | 
21 | Separate estimators for classification and regression: The IterativeImputer uses a single estimator for both categorical and numerical columns. However, InfoImputer recognizes the different nature of classification and regression tasks and employs separate estimators for each type. This tailored approach leads to more accurate imputed values.
22 | 
23 | Automated conversion of categorical values: In the IterativeImputer, converting categorical values to numeric format needs to be done manually. InfoImputer automates this process by factorizing categorical values into numeric representations. This simplifies the imputation workflow, particularly when dealing with categorical data.
24 | 
25 | By addressing these issues, InfoImputer offers an improved approach to handling missing values in datasets. It takes into account the correlation and mutual information score between features, utilizes separate estimators for classification and regression tasks, and automates the conversion of categorical values to numeric representations.
26 | 
27 | ## The Main Motivation
28 | 
29 | As I showed in the following notebook, correlation can only find linear dependency between random variables where mutual information can also detect nonlinear relations and dependencies.
30 | 
31 | https://www.kaggle.com/code/khashayarrahimi94/why-you-should-not-use-correlation
32 | 
33 | Therefore, besides some automation and ease of use in this imputer, I add mutual information score as a criteria for selecting dependece and informational features for the features with missing values that we want to fill them.
34 | 
35 | ## Install
36 | 
37 | ```shell
38 | pip install Info-Imputer
39 | ```
40 | ## Example
41 | 
42 | ```python
43 | import pandas as pd
44 | from sklearn.ensemble import ExtraTreesClassifier
45 | from sklearn.ensemble import GradientBoostingRegressor
46 | import InfoImputer
47 | from InfoImputer.Auto import Imputer
48 | #import the data in pandas format
49 | data = pd.read_csv(r"your directory")
50 | 
51 | #if you want to use the correlation coefficient threshold (here threshold = 0.1):
52 | FilledData = Imputer(data,TargetName,0.1,GradientBoostingRegressor,ExtraTreesClassifier)
53 | 
54 | #if you want to use N most informative features using mutual information (here N = 3)
55 | FilledData = Imputer(data,TargetName,3,GradientBoostingRegressor,ExtraTreesClassifier)
56 | ```
57 | 
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/InfoImputer/Auto.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from sklearn.preprocessing import LabelEncoder
  4 | from sklearn.ensemble import ExtraTreesClassifier
  5 | from sklearn.ensemble import GradientBoostingRegressor
  6 | from sklearn.model_selection import train_test_split
  7 | from sklearn.model_selection import RepeatedStratifiedKFold
  8 | from sklearn.model_selection import cross_val_score
  9 | from sklearn.impute import SimpleImputer
 10 | from numpy.random import seed
 11 | from sklearn.feature_selection import mutual_info_regression
 12 | 
 13 | def seperate_cat_num_cols(data):
 14 |     Categorical_col = []
 15 |     Numerical_col = []
 16 |     
 17 |     for i in range(data.shape[1]):
 18 |         if data[data.columns[i]].isnull().values.any()==True:
 19 |             if (data[data.columns[i]].dtypes == 'O'):
 20 |                 Categorical_col.append(data.columns[i])
 21 |             else:
 22 |                 Numerical_col.append(data.columns[i])
 23 |     return Numerical_col,Categorical_col
 24 | 
 25 | 
 26 | def prepare_data_for_corr(data):
 27 | 
 28 |     has_nan = data.isna().any().any()
 29 | 
 30 |     if has_nan:
 31 |     
 32 |         Numerical_col = seperate_cat_num_cols(data)[0]
 33 |         Categorical_col = seperate_cat_num_cols(data)[1]
 34 | 
 35 |         cat_col_index,num_col_index = [],[]            
 36 |         for i in Categorical_col:
 37 |             cat_col_index.append(data.columns.tolist().index(i))
 38 |         for i in Numerical_col:
 39 |             num_col_index.append(data.columns.tolist().index(i))
 40 | 
 41 |         #Factorize categorical feature
 42 |         for i in range(len(Categorical_col)):
 43 |             data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0]
 44 | 
 45 |         data = data.replace(-1, np.nan)
 46 | 
 47 |         num_data_for_corr = data.copy()
 48 |         imputer1 = SimpleImputer(missing_values=np.nan, strategy='mean')
 49 |         imputer1 = imputer1.fit(num_data_for_corr.values[:,num_col_index])
 50 |         
 51 |         num_data_for_corr = imputer1.transform(num_data_for_corr.values[:,num_col_index])
 52 |         num_data_for_corr = pd.DataFrame(num_data_for_corr,columns = Numerical_col)
 53 |         
 54 |         if len(cat_col_index) > 0:
 55 |             Cat_data_for_corr = data.copy()
 56 |             imputer2 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 57 |             imputer2 = imputer2.fit(Cat_data_for_corr.values[:,cat_col_index])
 58 |             Cat_data_for_corr = imputer2.transform(Cat_data_for_corr.values[:,cat_col_index])
 59 |             Cat_data_for_corr = pd.DataFrame(Cat_data_for_corr,columns=Categorical_col)
 60 |             data_for_corr = pd.concat([num_data_for_corr, Cat_data_for_corr], axis=1)
 61 |         
 62 |         else:
 63 |             data_for_corr = num_data_for_corr
 64 |             
 65 |         data_for_corr = data_for_corr.astype(float)
 66 |     else:
 67 |         data_for_corr = data
 68 | 
 69 |     return data_for_corr
 70 | 
 71 | 
 72 | def most_correlated_columns(data,corr_coef):
 73 |     data = prepare_data_for_corr(data)
 74 |     corr_table = data.corr('pearson')
 75 |     corr_table = pd.DataFrame(corr_table)
 76 |     corr_table = corr_table.rename_axis().reset_index()
 77 | 
 78 |     correlated_features = {}
 79 |     for i in range(1,corr_table.shape[0]+1):
 80 |         a=[]
 81 |         for j in range(corr_table.shape[0]):
 82 | 
 83 |             if i != j:
 84 | 
 85 |                 if abs(corr_table[corr_table.columns[i]][j]) > corr_coef:
 86 | 
 87 |                     a.append(corr_table['index'][j])
 88 |         
 89 |         correlated_features[corr_table.columns[i]] = a
 90 |     return correlated_features
 91 | 
 92 | def mutual_information(Data,n_nearest_features):
 93 |     Data = prepare_data_for_corr(Data)
 94 |     mic_ordered = {}
 95 |     seed(21)
 96 |     for col1 in Data.columns:
 97 |         HighInformation = {}
 98 |         for col2 in Data.columns:
 99 |             
100 |             if col1 != col2:
101 |                 col1_val = Data[col1].values
102 |                 col2_val = Data[col2].values
103 | 
104 |                 score = mutual_info_regression(col1_val.reshape(-1, 1), col2_val)
105 |                 #if score[0] >= miScore:
106 |                     #HighInformation.append(col2)
107 | 
108 |                 HighInformation[col2] = score[0]
109 |                    
110 |         sorted_mic = {k: v for k, v in sorted(HighInformation.items(),reverse=True, key=lambda item: item[1])}
111 |         selected_col = list(sorted_mic.keys())[:n_nearest_features]
112 |         mic_ordered[col1] = selected_col
113 | 
114 |     return mic_ordered
115 | 
116 | 
117 | def fill_nan_numeric_cols(data,regression_estimator,S):
118 |     
119 |     data_org = data.copy()
120 |     Numerical_col = seperate_cat_num_cols(data)[0]
121 |     Categorical_col = seperate_cat_num_cols(data)[1]
122 |     correlated_features = S
123 |     
124 |     for i in range(len(Categorical_col)):
125 |         data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0]
126 | 
127 |     data = data.replace(-1, np.nan)
128 |     
129 |     for i in Numerical_col:
130 |         
131 |         columns = []
132 |         m = correlated_features[i]
133 |         #m.remove(i)
134 |         
135 |         num_data = data[m].copy()
136 |         
137 |         label = data[i]
138 |         """need_scale = []  #If you need to scale the numerical columns you can uncomment this and work a little on it. 
139 |         for j in correlated_features[i]:
140 |             if j in Numerical_col:
141 |                 need_scale.append(j)"""
142 |         
143 |         #Scaler = scaler    
144 |         #num_data[need_scale] = Scaler.fit_transform(num_data[need_scale])
145 |         
146 | 
147 |         #print(data)
148 |         nan=[]
149 |         fill = [] 
150 |         for k in range(data.shape[0]):
151 |             
152 |             if data[i].isnull()[k] ==True: 
153 |                 nan.append(k)
154 |             else:
155 |                 fill.append(k)
156 |         
157 |         #Fill nan in num_data with SimpleImputer
158 |         
159 |         imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
160 |         imputer = imputer.fit(num_data.values)
161 |         transform = imputer.transform(num_data.values)
162 |         num_data = pd.DataFrame(transform)
163 |         num_data[i] = label
164 |         
165 |         #Train regression model
166 |         
167 |         X = num_data.values[fill,:-1]
168 |         Y = num_data.values[fill,-1]
169 |     
170 |         X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=0)
171 |         reg = regression_estimator
172 |         reg.fit(X_train, y_train)  
173 |         predicted = reg.predict(num_data.values[nan,:-1])
174 |         
175 |         k = 0
176 |         for t in nan:
177 |             data[i][t] = predicted[k]
178 |             k = k+1
179 | 
180 |     data.drop(Categorical_col,axis=1,inplace=True)
181 |     
182 |     for i in Categorical_col:
183 |         data[i] = data_org[i]
184 |         
185 |     return data
186 | 
187 | 
188 | def fill_nan_categoric_cols(data,classifiation_estimator,S):
189 |     
190 |     Numerical_col = seperate_cat_num_cols(data)[0]
191 |     Categorical_col = seperate_cat_num_cols(data)[1]
192 |     correlated_features = S
193 |     
194 |     for i in range(len(Categorical_col)):
195 |         data[Categorical_col[i]] = pd.factorize(data[Categorical_col[i]])[0]
196 |         
197 |     data = data.replace(-1, np.nan)
198 |     data_org = data.copy()
199 |    
200 |     for i in Categorical_col:
201 |         columns = []
202 |         m = correlated_features[i]
203 |         #m.remove(i)
204 |        
205 |         cat_data = data[m].copy()
206 |         label = data[i]
207 |         
208 |         nan=[]
209 |         fill = [] 
210 |         for k in range(data.shape[0]):
211 |         
212 |             if data[i].isnull()[k] ==True:
213 |                 nan.append(k)
214 |             else:
215 |                 fill.append(k)
216 |         
217 |         imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
218 |         imputer = imputer.fit(cat_data.values)
219 |         transform = imputer.transform(cat_data.values)
220 |         cat_data = pd.DataFrame(transform)
221 |         cat_data[i] = label
222 |         
223 |         X = cat_data.values[fill,:-1]
224 |         Y = cat_data.values[fill,-1]
225 | 
226 |         X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=0)
227 |         classify = classifiation_estimator
228 |         classify.fit(X_train, y_train)  
229 |         predicted = classify.predict(cat_data.values[nan,:-1])
230 |         
231 |         
232 |       #Fill nan for i column with predicted valus
233 |         
234 |         k = 0
235 |         for t in nan:
236 |             data[i][t] = predicted[k]
237 |             k = k+1
238 | 
239 |     return data
240 | 
241 | def Imputer(train,label,similarity,regression_estimator,classifiation_estimator):
242 |     
243 |     All = train
244 |     Label = All[label]
245 |     Label = Label.reset_index()
246 |     All.drop([label],axis=1,inplace=True)
247 |     All = All.reset_index(drop=True)
248 |     #All.drop(['index'],axis=1,inplace=True)
249 |     All_org = All.copy()
250 |     #if the user choose a number < 1 for similarity, it considers as a correlation coeficient
251 |     if similarity < 1:
252 |         s = most_correlated_columns(All,similarity)
253 |     
254 |     #if the user choose a integer number >=1 for similarity, it considers as the n_nearest_features
255 |     else:
256 |         s = mutual_information(All,similarity)
257 |     numeric = fill_nan_numeric_cols(All_org,regression_estimator(),s)
258 |     categoric = fill_nan_categoric_cols(numeric,classifiation_estimator(),s)
259 |     categoric[label] = Label[label]
260 | 
261 |     return categoric


--------------------------------------------------------------------------------