├── README.md ├── Titanic_Logistic_Model_Fit_AllIn.py ├── data.csv ├── data2.csv └── logistic.py /README.md: -------------------------------------------------------------------------------- 1 | # LogisticRegression 2 | Logistic regression from scratch in Python 3 | 4 | This example uses gradient descent to fit the model. 5 | It also contains a Scikit Learn's way of doing logistic regression, so we can compare the two implementations. 6 | -------------------------------------------------------------------------------- /Titanic_Logistic_Model_Fit_AllIn.py: -------------------------------------------------------------------------------- 1 | # Logistic Regression 2 | 3 | # Importing the libraries 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | import pandas as pd 7 | import seaborn as sns 8 | 9 | # Importing the dataset 10 | train = pd.read_csv('titanic_train.csv') 11 | 12 | ''' 13 | Logistic regression does not make many of the key assumptions of linear 14 | regression and general linear models that are based on ordinary least squares 15 | algorithms – particularly regarding linearity, normality, homoscedasticity, 16 | and measurement level. 17 | 18 | First, logistic regression does not require a linear relationship between the 19 | dependent and independent variables. 20 | Second, the error terms (residuals) do not need to be normally distributed. 21 | Third, homoscedasticity is not required. Finally, the dependent variable 22 | in logistic regression is not measured on an interval or ratio scale. 23 | ''' 24 | 25 | #EDA 26 | sns.countplot(x='Survived',data=train,palette='RdBu_r') 27 | sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r') 28 | sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow') 29 | sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30) 30 | sns.countplot(x='SibSp',data=train) 31 | train['Fare'].hist(color='green',bins=40,figsize=(8,4)) 32 | sns.boxplot(x='Pclass',y='Age',data=train,palette='winter') 33 | 34 | 35 | ''' 36 | Binary logistic regression requires the dependent variable to be binary 37 | and ordinal logistic regression requires the dependent variable to be ordinal. 38 | 39 | Logistic regression requires the observations to be independent of each 40 | other. In other words, the observations should not come from repeated 41 | measurements or matched data. 42 | 43 | Logistic regression typically requires a large sample size. 44 | A general guideline is that you need at minimum of 10 cases with the least 45 | frequent outcome for each independent variable in your model. For example, 46 | if you have 5 independent variables and the expected probability of your 47 | least frequent outcome is .10, then you would need a minimum sample 48 | size of 500 (10*5 / .10). 49 | ''' 50 | 51 | 52 | 53 | 54 | 55 | sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis') 56 | # Taking care of missing data 57 | 58 | def impute_age(cols): 59 | Age = cols[0] 60 | Pclass = cols[1] 61 | 62 | if pd.isnull(Age): 63 | 64 | if Pclass == 1: 65 | return train.groupby('Pclass').mean()['Age'].iloc[0] 66 | 67 | elif Pclass == 2: 68 | return train.groupby('Pclass').mean()['Age'].iloc[1] 69 | 70 | else: 71 | return train.groupby('Pclass').mean()['Age'].iloc[2] 72 | 73 | else: 74 | return Age 75 | 76 | 77 | train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1) 78 | 79 | train.drop('Cabin', axis=1, inplace=True) 80 | train.dropna(inplace=True) 81 | 82 | ''' 83 | from sklearn.impute import SimpleImputer 84 | 85 | imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 86 | imputer = imputer.fit(dataset['Age'].values.reshape(-1, 1)) 87 | dataset['Age'] = imputer.transform(dataset['Age'].values.reshape(-1, 1))''' 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | X = train.iloc[:, [2, 4, 5, 6, 7, 9, 10]] 96 | y = train.iloc[:, 1] 97 | 98 | 99 | 100 | 101 | 102 | 103 | '''No multicolinearity - also check for condition number 104 | Logistic regression requires there to be little or no multicollinearity 105 | among the independent variables. This means that the independent variables 106 | should not be too highly correlated with each other. 107 | 108 | We observe it when two or more variables have a high coorelation. 109 | If a can be represented using b, there is no point using both 110 | c and d have a correlation of 90% (imprefect multicolinearity). if c can be almost 111 | represented using d there is no point using both 112 | FIX : a) Drop one of the two variables. b) Transform them into one variable by taking 113 | mean. c) Keep them both but use caution. 114 | Test : before creating the model find correlation between each pairs. 115 | ''' 116 | multicolinearity_check = train.corr() 117 | 118 | 119 | # Encoding categorical data 120 | sex = pd.get_dummies(X['Sex'], prefix = 'Sex') 121 | sex.drop('Sex_male', inplace = True, axis=1) 122 | 123 | embark = pd.get_dummies(X['Embarked'], prefix = 'Embarked', drop_first=True) 124 | 125 | passenger_class = pd.get_dummies(X['Pclass'], prefix = 'Pclass') 126 | passenger_class.drop('Pclass_3', inplace = True, axis=1) 127 | 128 | X.drop(['Sex','Embarked','Pclass'],axis=1,inplace=True) 129 | X = pd.concat([X,sex,embark, passenger_class],axis=1) 130 | 131 | #Outliners 132 | sns.boxplot(data= X).set_title("Outlier Box Plot") 133 | 134 | linearity_check_df = pd.concat([pd.DataFrame(X),y],axis=1) 135 | 136 | ''' 137 | Box-Tidwell test 138 | logistic regression assumes linearity of independent variables and log odds. 139 | although this analysis does not require the dependent and independent 140 | variables to be related linearly, it requires that the independent variables 141 | are linearly related to the log odds.''' 142 | sns.regplot(x= 'Age', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot") 143 | sns.regplot(x= 'Fare', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot") 144 | sns.regplot(x= 'Sex_male', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot") 145 | 146 | 147 | # Splitting the dataset into the Training set and Test set 148 | from sklearn.model_selection import train_test_split 149 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) 150 | 151 | 152 | 153 | 154 | 155 | # Feature Scaling #Need to be done after splitting 156 | from sklearn.preprocessing import StandardScaler 157 | sc = StandardScaler() 158 | X_train.iloc[:, [0,3]] = sc.fit_transform(X_train.iloc[:, [0,3]]) 159 | X_test.iloc[:, [0,3]] = sc.transform(X_test.iloc[:, [0,3]]) 160 | 161 | 162 | 163 | 164 | 165 | # Fitting Logistic Regression to the Training set 166 | from sklearn.linear_model import LogisticRegression 167 | classifier = LogisticRegression() 168 | classifier.fit(X_train, y_train) 169 | 170 | 171 | 172 | 173 | #Find relevant features 174 | from sklearn.model_selection import StratifiedKFold 175 | from sklearn.feature_selection import RFECV 176 | 177 | # The "accuracy" scoring is proportional to the number of correct 178 | # classifications 179 | rfecv = RFECV(estimator=classifier, step=1, cv=StratifiedKFold(2), scoring='accuracy') 180 | rfecv.fit(X_train, y_train) 181 | 182 | print("Optimal number of features : %d" % rfecv.n_features_) 183 | 184 | # Plot number of features VS. cross-validation scores 185 | plt.figure() 186 | plt.xlabel("Number of features selected") 187 | plt.ylabel("Cross validation score (nb of correct classifications)") 188 | plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_) 189 | plt.show() 190 | 191 | 192 | 193 | 194 | from sklearn.feature_selection import RFE 195 | 196 | rfe = RFE(classifier, rfecv.n_features_, step=1) 197 | rfe = rfe.fit(X_train, y_train.values.ravel()) 198 | print(rfe.support_) 199 | print(rfe.ranking_) 200 | 201 | # Can select columns based on the returned mask 202 | # X.loc[:, rfe.support_] 203 | 204 | 205 | # Predicting the Test set results 206 | y_pred = classifier.predict(X_test) 207 | 208 | 209 | 210 | 211 | 212 | # K-Fold cross validation 213 | from sklearn.model_selection import cross_val_score 214 | accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10) 215 | model_accuracy = accuracies.mean() 216 | model_standard_deviation = accuracies.std() 217 | 218 | 219 | 220 | 221 | 222 | 223 | # Making the Confusion Matrix 224 | from sklearn.metrics import confusion_matrix 225 | confusion_matrix = confusion_matrix(y_test, y_pred) 226 | pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True) 227 | 228 | from sklearn.metrics import classification_report 229 | print(classification_report(y_test, y_pred)) 230 | 231 | 232 | #Genarate Reports 233 | import statsmodels.api as sm 234 | 235 | #X_set = X[['Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2']] 236 | X_set = X.loc[:, rfe.support_] 237 | X_set = sm.add_constant(X_set) 238 | 239 | logit_model=sm.Logit(y,X_set) 240 | result=logit_model.fit() 241 | print(result.summary2()) 242 | 243 | 244 | # GETTING THE ODDS RATIOS, Z-VALUE, AND 95% CI 245 | model_odds = pd.DataFrame(np.exp(result.params), columns= ['OR']) 246 | model_odds['z-value']= result.pvalues 247 | model_odds[['2.5%', '97.5%']] = np.exp(result.conf_int()) 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | #ROC Curve 257 | from sklearn.metrics import roc_auc_score 258 | from sklearn.metrics import roc_curve 259 | area_under_curve = roc_auc_score(y_test, classifier.predict(X_test)) 260 | fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1]) 261 | plt.figure() 262 | plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % area_under_curve) 263 | plt.plot([0, 1], [0, 1],'r--') 264 | plt.xlim([0.0, 1.0]) 265 | plt.ylim([0.0, 1.05]) 266 | plt.xlabel('False Positive Rate') 267 | plt.ylabel('True Positive Rate') 268 | plt.title('Receiver operating characteristic') 269 | plt.legend(loc="lower right") 270 | #plt.savefig('Log_ROC') 271 | plt.show() -------------------------------------------------------------------------------- /data.csv: -------------------------------------------------------------------------------- 1 | grade1,grade2,label;;;; 2 | 34.62365962451697,78.0246928153624,0;;;; 3 | 30.28671076822607,43.89499752400101,0;;;; 4 | 35.84740876993872,72.90219802708364,0;;;; 5 | 60.18259938620976,86.30855209546826,1;;;; 6 | 79.0327360507101,75.3443764369103,1;;;; 7 | 45.08327747668339,56.3163717815305,0;;;; 8 | 61.10666453684766,96.51142588489624,1;;;; 9 | 75.02474556738889,46.55401354116538,1;;;; 10 | 76.09878670226257,87.42056971926803,1;;;; 11 | 84.43281996120035,43.53339331072109,1;;;; 12 | 95.86155507093572,38.22527805795094,0;;;; 13 | 75.01365838958247,30.60326323428011,0;;;; 14 | 82.30705337399482,76.48196330235604,1;;;; 15 | 69.36458875970939,97.71869196188608,1;;;; 16 | 39.53833914367223,76.03681085115882,0;;;; 17 | 53.9710521485623,89.20735013750205,1;;;; 18 | 69.07014406283025,52.74046973016765,1;;;; 19 | 67.94685547711617,46.67857410673128,0;;;; 20 | 70.66150955499435,92.92713789364831,1;;;; 21 | 76.97878372747498,47.57596364975532,1;;;; 22 | 67.37202754570876,42.83843832029179,0;;;; 23 | 89.67677575072079,65.79936592745237,1;;;; 24 | 50.534788289883,48.85581152764205,0;;;; 25 | 34.21206097786789,44.20952859866288,0;;;; 26 | 77.9240914545704,68.9723599933059,1;;;; 27 | 62.27101367004632,69.95445795447587,1;;;; 28 | 80.1901807509566,44.82162893218353,1;;;; 29 | 93.114388797442,38.80067033713209,0;;;; 30 | 61.83020602312595,50.25610789244621,0;;;; 31 | 38.78580379679423,64.99568095539578,0;;;; 32 | 61.379289447425,72.80788731317097,1;;;; 33 | 85.40451939411645,57.05198397627122,1;;;; 34 | 52.10797973193984,63.12762376881715,0;;;; 35 | 52.04540476831827,69.43286012045222,1;;;; 36 | 40.23689373545111,71.16774802184875,0;;;; 37 | 54.63510555424817,52.21388588061123,0;;;; 38 | 33.91550010906887,98.86943574220611,0;;;; 39 | 64.17698887494485,80.90806058670817,1;;;; 40 | 74.78925295941542,41.57341522824434,0;;;; 41 | 34.1836400264419,75.2377203360134,0;;;; 42 | 83.90239366249155,56.30804621605327,1;;;; 43 | 51.54772026906181,46.85629026349976,0;;;; 44 | 94.44336776917852,65.56892160559052,1;;;; 45 | 82.36875375713919,40.61825515970618,0;;;; 46 | 51.04775177128865,45.82270145776001,0;;;; 47 | 62.22267576120188,52.06099194836679,0;;;; 48 | 77.19303492601364,70.45820000180959,1;;;; 49 | 97.77159928000232,86.7278223300282,1;;;; 50 | 62.07306379667647,96.76882412413983,1;;;; 51 | 91.56497449807442,88.69629254546599,1;;;; 52 | 79.94481794066932,74.16311935043758,1;;;; 53 | 99.2725269292572,60.99903099844988,1;;;; 54 | 90.54671411399852,43.39060180650027,1;;;; 55 | 34.52451385320009,60.39634245837173,0;;;; 56 | 50.2864961189907,49.80453881323059,0;;;; 57 | 49.58667721632031,59.80895099453265,0;;;; 58 | 97.64563396007767,68.86157272420604,1;;;; 59 | 32.57720016809309,95.59854761387875,0;;;; 60 | 74.24869136721598,69.82457122657193,1;;;; 61 | 71.79646205863379,78.45356224515052,1;;;; 62 | 75.3956114656803,85.75993667331619,1;;;; 63 | 35.28611281526193,47.02051394723416,0;;;; 64 | 56.25381749711624,39.26147251058019,0;;;; 65 | 30.05882244669796,49.59297386723685,0;;;; 66 | 44.66826172480893,66.45008614558913,0;;;; 67 | 66.56089447242954,41.09209807936973,0;;;; 68 | 40.45755098375164,97.53518548909936,1;;;; 69 | 49.07256321908844,51.88321182073966,0;;;; 70 | 80.27957401466998,92.11606081344084,1;;;; 71 | 66.74671856944039,60.99139402740988,1;;;; 72 | 32.72283304060323,43.30717306430063,0;;;; 73 | 64.0393204150601,78.03168802018232,1;;;; 74 | 72.34649422579923,96.22759296761404,1;;;; 75 | 60.45788573918959,73.09499809758037,1;;;; 76 | 58.84095621726802,75.85844831279042,1;;;; 77 | 99.82785779692128,72.36925193383885,1;;;; 78 | 47.26426910848174,88.47586499559782,1;;;; 79 | 50.45815980285988,75.80985952982456,1;;;; 80 | 60.45555629271532,42.50840943572217,0;;;; 81 | 82.22666157785568,42.71987853716458,0;;;; 82 | 88.9138964166533,69.80378889835472,1;;;; 83 | 94.83450672430196,45.69430680250754,1;;;; 84 | 67.31925746917527,66.58935317747915,1;;;; 85 | 57.23870631569862,59.51428198012956,1;;;; 86 | 80.36675600171273,90.96014789746954,1;;;; 87 | 68.46852178591112,85.59430710452014,1;;;; 88 | 42.0754545384731,78.84478600148043,0;;;; 89 | 75.47770200533905,90.42453899753964,1;;;; 90 | 78.63542434898018,96.64742716885644,1;;;; 91 | 52.34800398794107,60.76950525602592,0;;;; 92 | 94.09433112516793,77.15910509073893,1;;;; 93 | 90.44855097096364,87.50879176484702,1;;;; 94 | 55.48216114069585,35.57070347228866,0;;;; 95 | 74.49269241843041,84.84513684930135,1;;;; 96 | 89.84580670720979,45.35828361091658,1;;;; 97 | 83.48916274498238,48.38028579728175,1;;;; 98 | 42.2617008099817,87.10385094025457,1;;;; 99 | 99.31500880510394,68.77540947206617,1;;;; 100 | 55.34001756003703,64.9319380069486,1;;;; 101 | 74.77589300092767,89.52981289513276,1;;;; -------------------------------------------------------------------------------- /data2.csv: -------------------------------------------------------------------------------- 1 | ,grade1,grade2,label 2 | 0,-0.869144322982118,0.38930975149006475,0.0 3 | 1,-0.9934673506553683,-0.6105909031833399,0.0 4 | 2,-0.8340643153747651,0.23923557565476328,0.0 5 | 3,-0.13647145074321654,0.6320026981070463,1.0 6 | 4,0.4038867918461304,0.3107842891714938,1.0 7 | 5,-0.5693087928028135,-0.2466808200133095,0.0 8 | 6,-0.10998218810688143,0.9309171798950318,1.0 9 | 7,0.2889936888183451,-0.5326894794040691,1.0 10 | 8,0.3197821648086534,0.6645815752565056,1.0 11 | 9,0.5586856616709217,-0.621184853305208,1.0 12 | 10,0.8863019187215162,-0.7766971680509589,0.0 13 | 11,0.2886758636470277,-1.0,0.0 14 | 12,0.4977484113123718,0.344112270619807,1.0 15 | 13,0.12673956621891636,0.966286559271023,1.0 16 | 14,-0.728260061232338,0.33107060049700765,0.0 17 | 15,-0.3145317379886119,0.7169290367470866,1.0 18 | 16,0.11829901102416285,-0.3514443337710986,1.0 19 | 17,0.08609880702032502,-0.5290402176690145,0.0 20 | 18,0.1639171132145063,0.8259079825262541,1.0 21 | 19,0.3450081700356158,-0.502749318090022,1.0 22 | 20,0.06962078267838256,-0.6415450101705609,0.0 23 | 21,0.7090089609167003,0.031143285177906543,1.0 24 | 22,-0.4130357187712068,-0.46525350337335136,0.0 25 | 23,-0.8809432146991368,-0.601376058902291,0.0 26 | 24,0.3721063726221989,0.12410276860244229,1.0 27 | 25,-0.07660494195881817,0.15287537808354212,1.0 28 | 26,0.43706611543678653,-0.5834433021346668,1.0 29 | 27,0.8075516175398065,-0.759839850347543,0.0 30 | 28,-0.08924113922620536,-0.4242288988478353,0.0 31 | 29,-0.7498322484669888,0.007597656573540945,0.0 32 | 30,-0.10216711916666066,0.23647254645749216,1.0 33 | 31,0.5865404092115298,-0.22512952549316312,1.0 34 | 32,-0.3679385941181339,-0.047130977476119496,0.0 35 | 33,-0.36973236877210347,0.13759408092386627,1.0 36 | 34,-0.7082352869675854,0.18842124283030293,0.0 37 | 35,-0.2954959751361519,-0.3668717066619773,0.0 38 | 36,-0.889444432103401,1.0,0.0 39 | 37,-0.021968233988555852,0.4737840281462229,1.0 40 | 38,0.28224305490771817,-0.6786065018456822,0.0 41 | 39,-0.881757930031901,0.3076595760968148,0.0 42 | 40,0.543480455061824,-0.24692473483001498,1.0 43 | 41,-0.3839989985673473,-0.5238336519501634,0.0 44 | 42,0.8456481446041515,0.02439193781519644,1.0 45 | 43,0.49951711523194486,-0.7065899095408261,0.0 46 | 44,-0.3983311014913291,-0.5541147931880068,0.0 47 | 45,-0.07799059703063604,-0.37135105350764075,0.0 48 | 46,0.351149897449881,0.1676335527058339,1.0 49 | 47,0.941055268813821,0.6442860946754201,1.0 50 | 48,-0.08227937539124852,0.9384581985221923,1.0 51 | 49,0.7631360887428289,0.7019565379746755,1.0 52 | 50,0.4300325421888376,0.27617689745532337,1.0 53 | 51,0.9840808787200093,-0.109492545209249,1.0 54 | 52,0.7339466244205979,-0.6253682284374302,1.0 55 | 53,-0.8719864368459312,-0.12714956384488962,0.0 56 | 54,-0.4201532651052198,-0.4374585574803942,0.0 57 | 55,-0.44021428212107705,-0.1443584227060153,0.0 58 | 56,0.9374443454495434,0.12085702433321455,1.0 59 | 57,-0.9278081541831988,0.9041725057033894,0.0 60 | 58,0.26674730985445905,0.14907007530671335,1.0 61 | 59,0.19645167522879436,0.4018743765168544,1.0 62 | 60,0.2996249350848226,0.6159298643155111,1.0 63 | 61,-0.8501544319102516,-0.5190223763886008,0.0 64 | 62,-0.2490939592635648,-0.7463396889493163,0.0 65 | 63,-1.0,-0.4436567941244416,0.0 66 | 64,-0.5812056392990044,0.05020749206781838,0.0 67 | 65,0.046368832319386266,-0.6927076922652486,0.0 68 | 66,-0.701909923654394,0.9609103541596133,1.0 69 | 67,-0.4549518801013601,-0.3765594933863079,0.0 70 | 68,0.4396286638012481,0.8021457866857502,1.0 71 | 69,0.05169566810199555,-0.10971628621476981,1.0 72 | 70,-0.9236334405217844,-0.6278124475619153,0.0 73 | 71,-0.02591463970260044,0.38951469061475663,1.0 74 | 72,0.21221890389705456,0.9226017022036688,1.0 75 | 73,-0.12858008886332306,0.24488405610748964,1.0 76 | 74,-0.1749310098357859,0.3258450976801428,1.0 77 | 75,1.0,0.2236218075565759,1.0 78 | 76,-0.5067884606568891,0.6954986528532396,1.0 79 | 77,-0.4152323518947254,0.3244215878748955,1.0 80 | 78,-0.1286468648037551,-0.6512138951378939,0.0 81 | 79,0.49544389912482156,-0.6450184664600123,0.0 82 | 80,0.6871402528218256,0.14846121362738618,1.0 83 | 81,0.8568605385597223,-0.5578763825824484,1.0 84 | 82,0.06810807503469518,0.05428761042728425,1.0 85 | 83,-0.2208611246360408,-0.15299136647824363,1.0 86 | 84,0.442127823682279,0.7682809053395139,1.0 87 | 85,0.10105289965400743,0.6110774004168849,1.0 88 | 86,-0.6555310810460653,0.4133360929704759,0.0 89 | 87,0.30197814347438046,0.7525891247620184,1.0 90 | 88,0.3924974498626703,0.9349016213530437,1.0 91 | 89,-0.3610580559310608,-0.11621698087018517,0.0 92 | 90,0.8356426560014061,0.36395055255378406,1.0 93 | 91,0.7311326785908441,0.6671662242074556,1.0 94 | 92,-0.2712142695859556,-0.8544684708247909,0.0 95 | 93,0.27374184690057435,0.5891288942183923,1.0 96 | 94,0.7138544043326944,-0.5677208832845186,1.0 97 | 95,0.5316347726488497,-0.47918502210043656,1.0 98 | 96,-0.650192143204265,0.6553026376105815,1.0 99 | 97,0.9852986646800268,0.11833269203291286,1.0 100 | 98,-0.27528895916551743,0.005730173862692256,1.0 101 | 99,0.2818600781782652,0.7263762562346971,1.0 102 | -------------------------------------------------------------------------------- /logistic.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program performs two different logistic regression implementations on two 3 | different datasets of the format [float,float,boolean], one 4 | implementation is in this file and one from the sklearn library. The program 5 | then compares the two implementations for how well the can predict the given outcome 6 | for each input tuple in the datasets. 7 | 8 | @author Per Harald Borgen 9 | """ 10 | 11 | import math 12 | import numpy as np 13 | import pandas as pd 14 | from pandas import DataFrame 15 | from sklearn import preprocessing 16 | from sklearn.linear_model import LogisticRegression 17 | #from sklearn.cross_validation import train_test_split 18 | from sklearn.model_selection import train_test_split 19 | from numpy import loadtxt, where 20 | from pylab import scatter, show, legend, xlabel, ylabel 21 | 22 | # scale larger positive and values to between -1,1 depending on the largest 23 | # value in the data 24 | min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-1,1)) 25 | df = pd.read_csv("data.csv", header=0) 26 | 27 | # clean up data 28 | df.columns = ["grade1","grade2","label"] 29 | 30 | x = df["label"].map(lambda x: float(x.rstrip(';'))) 31 | 32 | # formats the input data into two arrays, one of independant variables 33 | # and one of the dependant variable 34 | X = df[["grade1","grade2"]] 35 | X = np.array(X) 36 | X = min_max_scaler.fit_transform(X) 37 | Y = df["label"].map(lambda x: float(x.rstrip(';'))) 38 | Y = np.array(Y) 39 | 40 | 41 | # if want to create a new clean dataset 42 | ##X = pd.DataFrame.from_records(X,columns=['grade1','grade2']) 43 | ##X.insert(2,'label',Y) 44 | ##X.to_csv('data2.csv') 45 | 46 | # creating testing and training set 47 | X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33) 48 | 49 | # train scikit learn model 50 | clf = LogisticRegression() 51 | clf.fit(X_train,Y_train) 52 | print ('score Scikit learn: ', clf.score(X_test,Y_test)) 53 | 54 | # visualize data, uncomment "show()" to run it 55 | pos = where(Y == 1) 56 | neg = where(Y == 0) 57 | scatter(X[pos, 0], X[pos, 1], marker='o', c='b') 58 | scatter(X[neg, 0], X[neg, 1], marker='x', c='r') 59 | xlabel('Exam 1 score') 60 | ylabel('Exam 2 score') 61 | legend(['Not Admitted', 'Admitted']) 62 | show() 63 | 64 | ##The sigmoid function adjusts the cost function hypotheses to adjust the algorithm proportionally for worse estimations 65 | def Sigmoid(z): 66 | G_of_Z = float(1.0 / float((1.0 + math.exp(-1.0*z)))) 67 | return G_of_Z 68 | 69 | ##The hypothesis is the linear combination of all the known factors x[i] and their current estimated coefficients theta[i] 70 | ##This hypothesis will be used to calculate each instance of the Cost Function 71 | def Hypothesis(theta, x): 72 | z = 0 73 | for i in xrange(len(theta)): 74 | z += x[i]*theta[i] 75 | return Sigmoid(z) 76 | 77 | ##For each member of the dataset, the result (Y) determines which variation of the cost function is used 78 | ##The Y = 0 cost function punishes high probability estimations, and the Y = 1 it punishes low scores 79 | ##The "punishment" makes the change in the gradient of ThetaCurrent - Average(CostFunction(Dataset)) greater 80 | def Cost_Function(X,Y,theta,m): 81 | sumOfErrors = 0 82 | for i in xrange(m): 83 | xi = X[i] 84 | hi = Hypothesis(theta,xi) 85 | if Y[i] == 1: 86 | error = Y[i] * math.log(hi) 87 | elif Y[i] == 0: 88 | error = (1-Y[i]) * math.log(1-hi) 89 | sumOfErrors += error 90 | const = -1/m 91 | J = const * sumOfErrors 92 | print ('cost is ', J ) 93 | return J 94 | 95 | ##This function creates the gradient component for each Theta value 96 | ##The gradient is the partial derivative by Theta of the current value of theta minus 97 | ##a "learning speed factor aplha" times the average of all the cost functions for that theta 98 | ##For each Theta there is a cost function calculated for each member of the dataset 99 | def Cost_Function_Derivative(X,Y,theta,j,m,alpha): 100 | sumErrors = 0 101 | for i in xrange(m): 102 | xi = X[i] 103 | xij = xi[j] 104 | hi = Hypothesis(theta,X[i]) 105 | error = (hi - Y[i])*xij 106 | sumErrors += error 107 | m = len(Y) 108 | constant = float(alpha)/float(m) 109 | J = constant * sumErrors 110 | return J 111 | 112 | ##For each theta, the partial differential 113 | ##The gradient, or vector from the current point in Theta-space (each theta value is its own dimension) to the more accurate point, 114 | ##is the vector with each dimensional component being the partial differential for each theta value 115 | def Gradient_Descent(X,Y,theta,m,alpha): 116 | new_theta = [] 117 | constant = alpha/m 118 | for j in xrange(len(theta)): 119 | CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha) 120 | new_theta_value = theta[j] - CFDerivative 121 | new_theta.append(new_theta_value) 122 | return new_theta 123 | 124 | ##The high level function for the LR algorithm which, for a number of steps (num_iters) finds gradients which take 125 | ##the Theta values (coefficients of known factors) from an estimation closer (new_theta) to their "optimum estimation" which is the 126 | ##set of values best representing the system in a linear combination model 127 | def Logistic_Regression(X,Y,alpha,theta,num_iters): 128 | m = len(Y) 129 | for x in xrange(num_iters): 130 | new_theta = Gradient_Descent(X,Y,theta,m,alpha) 131 | theta = new_theta 132 | if x % 100 == 0: 133 | #here the cost function is used to present the final hypothesis of the model in the same form for each gradient-step iteration 134 | Cost_Function(X,Y,theta,m) 135 | print ('theta ', theta) 136 | print ('cost is ', Cost_Function(X,Y,theta,m)) 137 | Declare_Winner(theta) 138 | 139 | ##This method compares the accuracy of the model generated by the scikit library with the model generated by this implementation 140 | def Declare_Winner(theta): 141 | score = 0 142 | winner = "" 143 | #first scikit LR is tested for each independent var in the dataset and its prediction is compared against the dependent var 144 | #if the prediction is the same as the dataset measured value it counts as a point for thie scikit version of LR 145 | scikit_score = clf.score(X_test,Y_test) 146 | length = len(X_test) 147 | for i in xrange(length): 148 | prediction = round(Hypothesis(X_test[i],theta)) 149 | answer = Y_test[i] 150 | if prediction == answer: 151 | score += 1 152 | #the same process is repeated for the implementation from this module and the scores compared to find the higher match-rate 153 | my_score = float(score) / float(length) 154 | if my_score > scikit_score: 155 | print ('You won!') 156 | elif my_score == scikit_score: 157 | print ('Its a tie!') 158 | else: 159 | print( 'Scikit won.. :(') 160 | print ('Your score: ', my_score) 161 | print ('Scikits score: ', scikit_score ) 162 | 163 | # These are the initial guesses for theta as well as the learning rate of the algorithm 164 | # A learning rate too low will not close in on the most accurate values within a reasonable number of iterations 165 | # An alpha too high might overshoot the accurate values or cause irratic guesses 166 | # Each iteration increases model accuracy but with diminishing returns, 167 | # and takes a signficicant coefficient times O(n)*|Theta|, n = dataset length 168 | initial_theta = [0,0] 169 | alpha = 0.1 170 | iterations = 1000 171 | ##Logistic_Regression(X,Y,alpha,initial_theta,iterations) 172 | --------------------------------------------------------------------------------