├── README.md
├── Titanic_Logistic_Model_Fit_AllIn.py
├── data.csv
├── data2.csv
└── logistic.py


/README.md:
--------------------------------------------------------------------------------
1 | # LogisticRegression
2 | Logistic regression from scratch in Python
3 | 
4 | This example uses gradient descent to fit the model.
5 | It also contains a Scikit Learn's way of doing logistic regression, so we can compare the two implementations.
6 | 


--------------------------------------------------------------------------------
/Titanic_Logistic_Model_Fit_AllIn.py:
--------------------------------------------------------------------------------
  1 | # Logistic Regression
  2 | 
  3 | # Importing the libraries
  4 | import numpy as np
  5 | import matplotlib.pyplot as plt
  6 | import pandas as pd
  7 | import seaborn as sns
  8 | 
  9 | # Importing the dataset
 10 | train = pd.read_csv('titanic_train.csv')
 11 | 
 12 | '''
 13 | Logistic regression does not make many of the key assumptions of linear
 14 | regression and general linear models that are based on ordinary least squares
 15 | algorithms – particularly regarding linearity, normality, homoscedasticity,
 16 | and measurement level.
 17 |  
 18 | First, logistic regression does not require a linear relationship between the
 19 | dependent and independent variables.  
 20 | Second, the error terms (residuals)  do not need to be normally distributed.
 21 | Third, homoscedasticity is not  required.  Finally, the dependent variable 
 22 | in logistic regression is not measured on an interval or ratio scale. 
 23 | '''
 24 | 
 25 | #EDA
 26 | sns.countplot(x='Survived',data=train,palette='RdBu_r')
 27 | sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
 28 | sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
 29 | sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)
 30 | sns.countplot(x='SibSp',data=train)
 31 | train['Fare'].hist(color='green',bins=40,figsize=(8,4))
 32 | sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
 33 | 
 34 | 
 35 | '''
 36 | Binary logistic regression requires the dependent variable to be binary
 37 | and ordinal logistic regression requires the dependent variable to be ordinal.
 38 | 
 39 | Logistic regression requires the observations to be independent of each
 40 | other.  In other words, the observations should not come from repeated
 41 | measurements or matched data.
 42 |  
 43 | Logistic regression typically requires a large sample size. 
 44 | A general guideline is that you need at minimum of 10 cases with the least 
 45 | frequent outcome for each independent variable in your model. For example, 
 46 | if you have 5 independent variables and the expected probability of your 
 47 | least frequent outcome is .10, then you would need a minimum sample 
 48 | size of 500 (10*5 / .10).
 49 | '''
 50 | 
 51 | 
 52 | 
 53 | 
 54 | 
 55 | sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
 56 | # Taking care of missing data
 57 | 
 58 | def impute_age(cols):
 59 |     Age = cols[0]
 60 |     Pclass = cols[1]
 61 |     
 62 |     if pd.isnull(Age):
 63 | 
 64 |         if Pclass == 1:
 65 |             return train.groupby('Pclass').mean()['Age'].iloc[0]
 66 | 
 67 |         elif Pclass == 2:
 68 |             return train.groupby('Pclass').mean()['Age'].iloc[1]
 69 | 
 70 |         else:
 71 |             return train.groupby('Pclass').mean()['Age'].iloc[2]
 72 | 
 73 |     else:
 74 |         return Age
 75 | 
 76 | 
 77 | train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
 78 | 
 79 | train.drop('Cabin', axis=1, inplace=True)
 80 | train.dropna(inplace=True)
 81 | 
 82 | '''
 83 | from sklearn.impute import SimpleImputer
 84 | 
 85 | imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
 86 | imputer = imputer.fit(dataset['Age'].values.reshape(-1, 1))
 87 | dataset['Age'] = imputer.transform(dataset['Age'].values.reshape(-1, 1))'''
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 | 
 95 | X = train.iloc[:, [2, 4, 5, 6, 7, 9, 10]]
 96 | y = train.iloc[:, 1]
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | '''No multicolinearity - also check for condition number
104 | Logistic regression requires there to be little or no multicollinearity
105 |  among the independent variables.  This means that the independent variables
106 |  should not be too highly correlated with each other.
107 |  
108 | We observe it when two or more variables have a high coorelation.
109 | If a can be represented using b, there is no point using both
110 | c and d have a correlation of 90% (imprefect multicolinearity). if c can be almost
111 | represented using d there is no point using both
112 | FIX : a) Drop one of the two variables. b) Transform them into one variable by taking
113 | mean. c) Keep them both but use caution. 
114 | Test : before creating the model find correlation between each pairs.
115 | '''
116 | multicolinearity_check = train.corr()
117 | 
118 | 
119 | # Encoding categorical data
120 | sex = pd.get_dummies(X['Sex'], prefix = 'Sex')
121 | sex.drop('Sex_male', inplace = True, axis=1)
122 | 
123 | embark = pd.get_dummies(X['Embarked'], prefix = 'Embarked', drop_first=True)
124 | 
125 | passenger_class = pd.get_dummies(X['Pclass'], prefix = 'Pclass')
126 | passenger_class.drop('Pclass_3', inplace = True, axis=1)
127 | 
128 | X.drop(['Sex','Embarked','Pclass'],axis=1,inplace=True)
129 | X = pd.concat([X,sex,embark, passenger_class],axis=1)
130 | 
131 | #Outliners
132 | sns.boxplot(data= X).set_title("Outlier Box Plot")
133 | 
134 | linearity_check_df = pd.concat([pd.DataFrame(X),y],axis=1)
135 | 
136 | '''
137 | Box-Tidwell test
138 | logistic regression assumes linearity of independent variables and log odds.
139 |  although this analysis does not require the dependent and independent
140 |  variables to be related linearly, it requires that the independent variables
141 |  are linearly related to the log odds.'''
142 | sns.regplot(x= 'Age', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot")
143 | sns.regplot(x= 'Fare', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot")
144 | sns.regplot(x= 'Sex_male', y= 'Survived', data= linearity_check_df, logistic= True).set_title("Log Odds Linear Plot")
145 | 
146 | 
147 | # Splitting the dataset into the Training set and Test set
148 | from sklearn.model_selection import train_test_split
149 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
150 | 
151 | 
152 | 
153 | 
154 | 
155 | # Feature Scaling #Need to be done after splitting
156 | from sklearn.preprocessing import StandardScaler
157 | sc = StandardScaler()
158 | X_train.iloc[:, [0,3]] = sc.fit_transform(X_train.iloc[:, [0,3]])
159 | X_test.iloc[:, [0,3]] = sc.transform(X_test.iloc[:, [0,3]])
160 | 
161 | 
162 | 
163 | 
164 | 
165 | # Fitting Logistic Regression to the Training set
166 | from sklearn.linear_model import LogisticRegression
167 | classifier = LogisticRegression()
168 | classifier.fit(X_train, y_train)
169 | 
170 | 
171 | 
172 | 
173 | #Find relevant features
174 | from sklearn.model_selection import StratifiedKFold
175 | from sklearn.feature_selection import RFECV
176 | 
177 | # The "accuracy" scoring is proportional to the number of correct
178 | # classifications
179 | rfecv = RFECV(estimator=classifier, step=1, cv=StratifiedKFold(2), scoring='accuracy')
180 | rfecv.fit(X_train, y_train)
181 | 
182 | print("Optimal number of features : %d" % rfecv.n_features_)
183 | 
184 | # Plot number of features VS. cross-validation scores
185 | plt.figure()
186 | plt.xlabel("Number of features selected")
187 | plt.ylabel("Cross validation score (nb of correct classifications)")
188 | plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
189 | plt.show()
190 | 
191 | 
192 | 
193 | 
194 | from sklearn.feature_selection import RFE
195 | 
196 | rfe = RFE(classifier, rfecv.n_features_, step=1)
197 | rfe = rfe.fit(X_train, y_train.values.ravel())
198 | print(rfe.support_)
199 | print(rfe.ranking_)
200 | 
201 | # Can select columns based on the returned mask
202 | # X.loc[:, rfe.support_]
203 | 
204 | 
205 | # Predicting the Test set results
206 | y_pred = classifier.predict(X_test)
207 | 
208 | 
209 | 
210 | 
211 | 
212 | # K-Fold cross validation
213 | from sklearn.model_selection import cross_val_score
214 | accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
215 | model_accuracy = accuracies.mean()
216 | model_standard_deviation = accuracies.std()
217 | 
218 | 
219 | 
220 | 
221 | 
222 | 
223 | # Making the Confusion Matrix
224 | from sklearn.metrics import confusion_matrix
225 | confusion_matrix = confusion_matrix(y_test, y_pred)
226 | pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
227 | 
228 | from sklearn.metrics import classification_report
229 | print(classification_report(y_test, y_pred))
230 | 
231 | 
232 | #Genarate Reports
233 | import statsmodels.api as sm
234 | 
235 | #X_set = X[['Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2']]
236 | X_set = X.loc[:, rfe.support_]
237 | X_set = sm.add_constant(X_set)
238 | 
239 | logit_model=sm.Logit(y,X_set)
240 | result=logit_model.fit()
241 | print(result.summary2())
242 | 
243 | 
244 | # GETTING THE ODDS RATIOS, Z-VALUE, AND 95% CI
245 | model_odds = pd.DataFrame(np.exp(result.params), columns= ['OR'])
246 | model_odds['z-value']= result.pvalues
247 | model_odds[['2.5%', '97.5%']] = np.exp(result.conf_int())
248 | 
249 | 
250 | 
251 | 
252 | 
253 | 
254 | 
255 | 
256 | #ROC Curve
257 | from sklearn.metrics import roc_auc_score
258 | from sklearn.metrics import roc_curve
259 | area_under_curve = roc_auc_score(y_test, classifier.predict(X_test))
260 | fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
261 | plt.figure()
262 | plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % area_under_curve)
263 | plt.plot([0, 1], [0, 1],'r--')
264 | plt.xlim([0.0, 1.0])
265 | plt.ylim([0.0, 1.05])
266 | plt.xlabel('False Positive Rate')
267 | plt.ylabel('True Positive Rate')
268 | plt.title('Receiver operating characteristic')
269 | plt.legend(loc="lower right")
270 | #plt.savefig('Log_ROC')
271 | plt.show()


--------------------------------------------------------------------------------
/data.csv:
--------------------------------------------------------------------------------
  1 | grade1,grade2,label;;;;
  2 | 34.62365962451697,78.0246928153624,0;;;;
  3 | 30.28671076822607,43.89499752400101,0;;;;
  4 | 35.84740876993872,72.90219802708364,0;;;;
  5 | 60.18259938620976,86.30855209546826,1;;;;
  6 | 79.0327360507101,75.3443764369103,1;;;;
  7 | 45.08327747668339,56.3163717815305,0;;;;
  8 | 61.10666453684766,96.51142588489624,1;;;;
  9 | 75.02474556738889,46.55401354116538,1;;;;
 10 | 76.09878670226257,87.42056971926803,1;;;;
 11 | 84.43281996120035,43.53339331072109,1;;;;
 12 | 95.86155507093572,38.22527805795094,0;;;;
 13 | 75.01365838958247,30.60326323428011,0;;;;
 14 | 82.30705337399482,76.48196330235604,1;;;;
 15 | 69.36458875970939,97.71869196188608,1;;;;
 16 | 39.53833914367223,76.03681085115882,0;;;;
 17 | 53.9710521485623,89.20735013750205,1;;;;
 18 | 69.07014406283025,52.74046973016765,1;;;;
 19 | 67.94685547711617,46.67857410673128,0;;;;
 20 | 70.66150955499435,92.92713789364831,1;;;;
 21 | 76.97878372747498,47.57596364975532,1;;;;
 22 | 67.37202754570876,42.83843832029179,0;;;;
 23 | 89.67677575072079,65.79936592745237,1;;;;
 24 | 50.534788289883,48.85581152764205,0;;;;
 25 | 34.21206097786789,44.20952859866288,0;;;;
 26 | 77.9240914545704,68.9723599933059,1;;;;
 27 | 62.27101367004632,69.95445795447587,1;;;;
 28 | 80.1901807509566,44.82162893218353,1;;;;
 29 | 93.114388797442,38.80067033713209,0;;;;
 30 | 61.83020602312595,50.25610789244621,0;;;;
 31 | 38.78580379679423,64.99568095539578,0;;;;
 32 | 61.379289447425,72.80788731317097,1;;;;
 33 | 85.40451939411645,57.05198397627122,1;;;;
 34 | 52.10797973193984,63.12762376881715,0;;;;
 35 | 52.04540476831827,69.43286012045222,1;;;;
 36 | 40.23689373545111,71.16774802184875,0;;;;
 37 | 54.63510555424817,52.21388588061123,0;;;;
 38 | 33.91550010906887,98.86943574220611,0;;;;
 39 | 64.17698887494485,80.90806058670817,1;;;;
 40 | 74.78925295941542,41.57341522824434,0;;;;
 41 | 34.1836400264419,75.2377203360134,0;;;;
 42 | 83.90239366249155,56.30804621605327,1;;;;
 43 | 51.54772026906181,46.85629026349976,0;;;;
 44 | 94.44336776917852,65.56892160559052,1;;;;
 45 | 82.36875375713919,40.61825515970618,0;;;;
 46 | 51.04775177128865,45.82270145776001,0;;;;
 47 | 62.22267576120188,52.06099194836679,0;;;;
 48 | 77.19303492601364,70.45820000180959,1;;;;
 49 | 97.77159928000232,86.7278223300282,1;;;;
 50 | 62.07306379667647,96.76882412413983,1;;;;
 51 | 91.56497449807442,88.69629254546599,1;;;;
 52 | 79.94481794066932,74.16311935043758,1;;;;
 53 | 99.2725269292572,60.99903099844988,1;;;;
 54 | 90.54671411399852,43.39060180650027,1;;;;
 55 | 34.52451385320009,60.39634245837173,0;;;;
 56 | 50.2864961189907,49.80453881323059,0;;;;
 57 | 49.58667721632031,59.80895099453265,0;;;;
 58 | 97.64563396007767,68.86157272420604,1;;;;
 59 | 32.57720016809309,95.59854761387875,0;;;;
 60 | 74.24869136721598,69.82457122657193,1;;;;
 61 | 71.79646205863379,78.45356224515052,1;;;;
 62 | 75.3956114656803,85.75993667331619,1;;;;
 63 | 35.28611281526193,47.02051394723416,0;;;;
 64 | 56.25381749711624,39.26147251058019,0;;;;
 65 | 30.05882244669796,49.59297386723685,0;;;;
 66 | 44.66826172480893,66.45008614558913,0;;;;
 67 | 66.56089447242954,41.09209807936973,0;;;;
 68 | 40.45755098375164,97.53518548909936,1;;;;
 69 | 49.07256321908844,51.88321182073966,0;;;;
 70 | 80.27957401466998,92.11606081344084,1;;;;
 71 | 66.74671856944039,60.99139402740988,1;;;;
 72 | 32.72283304060323,43.30717306430063,0;;;;
 73 | 64.0393204150601,78.03168802018232,1;;;;
 74 | 72.34649422579923,96.22759296761404,1;;;;
 75 | 60.45788573918959,73.09499809758037,1;;;;
 76 | 58.84095621726802,75.85844831279042,1;;;;
 77 | 99.82785779692128,72.36925193383885,1;;;;
 78 | 47.26426910848174,88.47586499559782,1;;;;
 79 | 50.45815980285988,75.80985952982456,1;;;;
 80 | 60.45555629271532,42.50840943572217,0;;;;
 81 | 82.22666157785568,42.71987853716458,0;;;;
 82 | 88.9138964166533,69.80378889835472,1;;;;
 83 | 94.83450672430196,45.69430680250754,1;;;;
 84 | 67.31925746917527,66.58935317747915,1;;;;
 85 | 57.23870631569862,59.51428198012956,1;;;;
 86 | 80.36675600171273,90.96014789746954,1;;;;
 87 | 68.46852178591112,85.59430710452014,1;;;;
 88 | 42.0754545384731,78.84478600148043,0;;;;
 89 | 75.47770200533905,90.42453899753964,1;;;;
 90 | 78.63542434898018,96.64742716885644,1;;;;
 91 | 52.34800398794107,60.76950525602592,0;;;;
 92 | 94.09433112516793,77.15910509073893,1;;;;
 93 | 90.44855097096364,87.50879176484702,1;;;;
 94 | 55.48216114069585,35.57070347228866,0;;;;
 95 | 74.49269241843041,84.84513684930135,1;;;;
 96 | 89.84580670720979,45.35828361091658,1;;;;
 97 | 83.48916274498238,48.38028579728175,1;;;;
 98 | 42.2617008099817,87.10385094025457,1;;;;
 99 | 99.31500880510394,68.77540947206617,1;;;;
100 | 55.34001756003703,64.9319380069486,1;;;;
101 | 74.77589300092767,89.52981289513276,1;;;;


--------------------------------------------------------------------------------
/data2.csv:
--------------------------------------------------------------------------------
  1 | ,grade1,grade2,label
  2 | 0,-0.869144322982118,0.38930975149006475,0.0
  3 | 1,-0.9934673506553683,-0.6105909031833399,0.0
  4 | 2,-0.8340643153747651,0.23923557565476328,0.0
  5 | 3,-0.13647145074321654,0.6320026981070463,1.0
  6 | 4,0.4038867918461304,0.3107842891714938,1.0
  7 | 5,-0.5693087928028135,-0.2466808200133095,0.0
  8 | 6,-0.10998218810688143,0.9309171798950318,1.0
  9 | 7,0.2889936888183451,-0.5326894794040691,1.0
 10 | 8,0.3197821648086534,0.6645815752565056,1.0
 11 | 9,0.5586856616709217,-0.621184853305208,1.0
 12 | 10,0.8863019187215162,-0.7766971680509589,0.0
 13 | 11,0.2886758636470277,-1.0,0.0
 14 | 12,0.4977484113123718,0.344112270619807,1.0
 15 | 13,0.12673956621891636,0.966286559271023,1.0
 16 | 14,-0.728260061232338,0.33107060049700765,0.0
 17 | 15,-0.3145317379886119,0.7169290367470866,1.0
 18 | 16,0.11829901102416285,-0.3514443337710986,1.0
 19 | 17,0.08609880702032502,-0.5290402176690145,0.0
 20 | 18,0.1639171132145063,0.8259079825262541,1.0
 21 | 19,0.3450081700356158,-0.502749318090022,1.0
 22 | 20,0.06962078267838256,-0.6415450101705609,0.0
 23 | 21,0.7090089609167003,0.031143285177906543,1.0
 24 | 22,-0.4130357187712068,-0.46525350337335136,0.0
 25 | 23,-0.8809432146991368,-0.601376058902291,0.0
 26 | 24,0.3721063726221989,0.12410276860244229,1.0
 27 | 25,-0.07660494195881817,0.15287537808354212,1.0
 28 | 26,0.43706611543678653,-0.5834433021346668,1.0
 29 | 27,0.8075516175398065,-0.759839850347543,0.0
 30 | 28,-0.08924113922620536,-0.4242288988478353,0.0
 31 | 29,-0.7498322484669888,0.007597656573540945,0.0
 32 | 30,-0.10216711916666066,0.23647254645749216,1.0
 33 | 31,0.5865404092115298,-0.22512952549316312,1.0
 34 | 32,-0.3679385941181339,-0.047130977476119496,0.0
 35 | 33,-0.36973236877210347,0.13759408092386627,1.0
 36 | 34,-0.7082352869675854,0.18842124283030293,0.0
 37 | 35,-0.2954959751361519,-0.3668717066619773,0.0
 38 | 36,-0.889444432103401,1.0,0.0
 39 | 37,-0.021968233988555852,0.4737840281462229,1.0
 40 | 38,0.28224305490771817,-0.6786065018456822,0.0
 41 | 39,-0.881757930031901,0.3076595760968148,0.0
 42 | 40,0.543480455061824,-0.24692473483001498,1.0
 43 | 41,-0.3839989985673473,-0.5238336519501634,0.0
 44 | 42,0.8456481446041515,0.02439193781519644,1.0
 45 | 43,0.49951711523194486,-0.7065899095408261,0.0
 46 | 44,-0.3983311014913291,-0.5541147931880068,0.0
 47 | 45,-0.07799059703063604,-0.37135105350764075,0.0
 48 | 46,0.351149897449881,0.1676335527058339,1.0
 49 | 47,0.941055268813821,0.6442860946754201,1.0
 50 | 48,-0.08227937539124852,0.9384581985221923,1.0
 51 | 49,0.7631360887428289,0.7019565379746755,1.0
 52 | 50,0.4300325421888376,0.27617689745532337,1.0
 53 | 51,0.9840808787200093,-0.109492545209249,1.0
 54 | 52,0.7339466244205979,-0.6253682284374302,1.0
 55 | 53,-0.8719864368459312,-0.12714956384488962,0.0
 56 | 54,-0.4201532651052198,-0.4374585574803942,0.0
 57 | 55,-0.44021428212107705,-0.1443584227060153,0.0
 58 | 56,0.9374443454495434,0.12085702433321455,1.0
 59 | 57,-0.9278081541831988,0.9041725057033894,0.0
 60 | 58,0.26674730985445905,0.14907007530671335,1.0
 61 | 59,0.19645167522879436,0.4018743765168544,1.0
 62 | 60,0.2996249350848226,0.6159298643155111,1.0
 63 | 61,-0.8501544319102516,-0.5190223763886008,0.0
 64 | 62,-0.2490939592635648,-0.7463396889493163,0.0
 65 | 63,-1.0,-0.4436567941244416,0.0
 66 | 64,-0.5812056392990044,0.05020749206781838,0.0
 67 | 65,0.046368832319386266,-0.6927076922652486,0.0
 68 | 66,-0.701909923654394,0.9609103541596133,1.0
 69 | 67,-0.4549518801013601,-0.3765594933863079,0.0
 70 | 68,0.4396286638012481,0.8021457866857502,1.0
 71 | 69,0.05169566810199555,-0.10971628621476981,1.0
 72 | 70,-0.9236334405217844,-0.6278124475619153,0.0
 73 | 71,-0.02591463970260044,0.38951469061475663,1.0
 74 | 72,0.21221890389705456,0.9226017022036688,1.0
 75 | 73,-0.12858008886332306,0.24488405610748964,1.0
 76 | 74,-0.1749310098357859,0.3258450976801428,1.0
 77 | 75,1.0,0.2236218075565759,1.0
 78 | 76,-0.5067884606568891,0.6954986528532396,1.0
 79 | 77,-0.4152323518947254,0.3244215878748955,1.0
 80 | 78,-0.1286468648037551,-0.6512138951378939,0.0
 81 | 79,0.49544389912482156,-0.6450184664600123,0.0
 82 | 80,0.6871402528218256,0.14846121362738618,1.0
 83 | 81,0.8568605385597223,-0.5578763825824484,1.0
 84 | 82,0.06810807503469518,0.05428761042728425,1.0
 85 | 83,-0.2208611246360408,-0.15299136647824363,1.0
 86 | 84,0.442127823682279,0.7682809053395139,1.0
 87 | 85,0.10105289965400743,0.6110774004168849,1.0
 88 | 86,-0.6555310810460653,0.4133360929704759,0.0
 89 | 87,0.30197814347438046,0.7525891247620184,1.0
 90 | 88,0.3924974498626703,0.9349016213530437,1.0
 91 | 89,-0.3610580559310608,-0.11621698087018517,0.0
 92 | 90,0.8356426560014061,0.36395055255378406,1.0
 93 | 91,0.7311326785908441,0.6671662242074556,1.0
 94 | 92,-0.2712142695859556,-0.8544684708247909,0.0
 95 | 93,0.27374184690057435,0.5891288942183923,1.0
 96 | 94,0.7138544043326944,-0.5677208832845186,1.0
 97 | 95,0.5316347726488497,-0.47918502210043656,1.0
 98 | 96,-0.650192143204265,0.6553026376105815,1.0
 99 | 97,0.9852986646800268,0.11833269203291286,1.0
100 | 98,-0.27528895916551743,0.005730173862692256,1.0
101 | 99,0.2818600781782652,0.7263762562346971,1.0
102 | 


--------------------------------------------------------------------------------
/logistic.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This program performs two different logistic regression implementations on two
  3 | different datasets of the format [float,float,boolean], one
  4 | implementation is in this file and one from the sklearn library. The program
  5 | then compares the two implementations for how well the can predict the given outcome
  6 | for each input tuple in the datasets.
  7 | 
  8 | @author Per Harald Borgen
  9 | """
 10 | 
 11 | import math
 12 | import numpy as np
 13 | import pandas as pd
 14 | from pandas import DataFrame
 15 | from sklearn import preprocessing
 16 | from sklearn.linear_model import LogisticRegression
 17 | #from sklearn.cross_validation import train_test_split
 18 | from sklearn.model_selection import train_test_split
 19 | from numpy import loadtxt, where
 20 | from pylab import scatter, show, legend, xlabel, ylabel
 21 | 
 22 | # scale larger positive and values to between -1,1 depending on the largest
 23 | # value in the data
 24 | min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-1,1))
 25 | df = pd.read_csv("data.csv", header=0)
 26 | 
 27 | # clean up data
 28 | df.columns = ["grade1","grade2","label"]
 29 | 
 30 | x = df["label"].map(lambda x: float(x.rstrip(';')))
 31 | 
 32 | # formats the input data into two arrays, one of independant variables
 33 | # and one of the dependant variable
 34 | X = df[["grade1","grade2"]]
 35 | X = np.array(X)
 36 | X = min_max_scaler.fit_transform(X)
 37 | Y = df["label"].map(lambda x: float(x.rstrip(';')))
 38 | Y = np.array(Y)
 39 | 
 40 | 
 41 | # if want to create a new clean dataset 
 42 | ##X = pd.DataFrame.from_records(X,columns=['grade1','grade2'])
 43 | ##X.insert(2,'label',Y)
 44 | ##X.to_csv('data2.csv')
 45 | 
 46 | # creating testing and training set
 47 | X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33)
 48 | 
 49 | # train scikit learn model 
 50 | clf = LogisticRegression()
 51 | clf.fit(X_train,Y_train)
 52 | print ('score Scikit learn: ', clf.score(X_test,Y_test))
 53 | 
 54 | # visualize data, uncomment "show()" to run it
 55 | pos = where(Y == 1)
 56 | neg = where(Y == 0)
 57 | scatter(X[pos, 0], X[pos, 1], marker='o', c='b')
 58 | scatter(X[neg, 0], X[neg, 1], marker='x', c='r')
 59 | xlabel('Exam 1 score')
 60 | ylabel('Exam 2 score')
 61 | legend(['Not Admitted', 'Admitted'])
 62 | show()
 63 | 
 64 | ##The sigmoid function adjusts the cost function hypotheses to adjust the algorithm proportionally for worse estimations
 65 | def Sigmoid(z):
 66 | 	G_of_Z = float(1.0 / float((1.0 + math.exp(-1.0*z))))
 67 | 	return G_of_Z 
 68 | 
 69 | ##The hypothesis is the linear combination of all the known factors x[i] and their current estimated coefficients theta[i] 
 70 | ##This hypothesis will be used to calculate each instance of the Cost Function
 71 | def Hypothesis(theta, x):
 72 | 	z = 0
 73 | 	for i in xrange(len(theta)):
 74 | 		z += x[i]*theta[i]
 75 | 	return Sigmoid(z)
 76 | 
 77 | ##For each member of the dataset, the result (Y) determines which variation of the cost function is used
 78 | ##The Y = 0 cost function punishes high probability estimations, and the Y = 1 it punishes low scores
 79 | ##The "punishment" makes the change in the gradient of ThetaCurrent - Average(CostFunction(Dataset)) greater
 80 | def Cost_Function(X,Y,theta,m):
 81 | 	sumOfErrors = 0
 82 | 	for i in xrange(m):
 83 | 		xi = X[i]
 84 | 		hi = Hypothesis(theta,xi)
 85 | 		if Y[i] == 1:
 86 | 			error = Y[i] * math.log(hi)
 87 | 		elif Y[i] == 0:
 88 | 			error = (1-Y[i]) * math.log(1-hi)
 89 | 		sumOfErrors += error
 90 | 	const = -1/m
 91 | 	J = const * sumOfErrors
 92 | 	print ('cost is ', J )
 93 | 	return J
 94 | 
 95 | ##This function creates the gradient component for each Theta value 
 96 | ##The gradient is the partial derivative by Theta of the current value of theta minus 
 97 | ##a "learning speed factor aplha" times the average of all the cost functions for that theta
 98 | ##For each Theta there is a cost function calculated for each member of the dataset
 99 | def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
100 | 	sumErrors = 0
101 | 	for i in xrange(m):
102 | 		xi = X[i]
103 | 		xij = xi[j]
104 | 		hi = Hypothesis(theta,X[i])
105 | 		error = (hi - Y[i])*xij
106 | 		sumErrors += error
107 | 	m = len(Y)
108 | 	constant = float(alpha)/float(m)
109 | 	J = constant * sumErrors
110 | 	return J
111 | 
112 | ##For each theta, the partial differential 
113 | ##The gradient, or vector from the current point in Theta-space (each theta value is its own dimension) to the more accurate point, 
114 | ##is the vector with each dimensional component being the partial differential for each theta value
115 | def Gradient_Descent(X,Y,theta,m,alpha):
116 | 	new_theta = []
117 | 	constant = alpha/m
118 | 	for j in xrange(len(theta)):
119 | 		CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
120 | 		new_theta_value = theta[j] - CFDerivative
121 | 		new_theta.append(new_theta_value)
122 | 	return new_theta
123 | 
124 | ##The high level function for the LR algorithm which, for a number of steps (num_iters) finds gradients which take 
125 | ##the Theta values (coefficients of known factors) from an estimation closer (new_theta) to their "optimum estimation" which is the
126 | ##set of values best representing the system in a linear combination model
127 | def Logistic_Regression(X,Y,alpha,theta,num_iters):
128 | 	m = len(Y)
129 | 	for x in xrange(num_iters):
130 | 		new_theta = Gradient_Descent(X,Y,theta,m,alpha)
131 | 		theta = new_theta
132 | 		if x % 100 == 0:
133 | 			#here the cost function is used to present the final hypothesis of the model in the same form for each gradient-step iteration
134 | 			Cost_Function(X,Y,theta,m)
135 | 			print ('theta ', theta)	
136 | 			print ('cost is ', Cost_Function(X,Y,theta,m))
137 | 	Declare_Winner(theta)
138 | 
139 | ##This method compares the accuracy of the model generated by the scikit library with the model generated by this implementation
140 | def Declare_Winner(theta):
141 |     score = 0
142 |     winner = ""
143 |     #first scikit LR is tested for each independent var in the dataset and its prediction is compared against the dependent var
144 |     #if the prediction is the same as the dataset measured value it counts as a point for thie scikit version of LR
145 |     scikit_score = clf.score(X_test,Y_test)
146 |     length = len(X_test)
147 |     for i in xrange(length):
148 |         prediction = round(Hypothesis(X_test[i],theta))
149 |         answer = Y_test[i]
150 |         if prediction == answer:
151 |             score += 1
152 |     #the same process is repeated for the implementation from this module and the scores compared to find the higher match-rate
153 |     my_score = float(score) / float(length)
154 |     if my_score > scikit_score:
155 |         print ('You won!')
156 |     elif my_score == scikit_score:
157 |         print ('Its a tie!')
158 |     else:
159 |         print( 'Scikit won.. :(')
160 |     print ('Your score: ', my_score)
161 |     print ('Scikits score: ', scikit_score )
162 | 
163 | # These are the initial guesses for theta as well as the learning rate of the algorithm
164 | # A learning rate too low will not close in on the most accurate values within a reasonable number of iterations
165 | # An alpha too high might overshoot the accurate values or cause irratic guesses
166 | # Each iteration increases model accuracy but with diminishing returns, 
167 | # and takes a signficicant coefficient times O(n)*|Theta|, n = dataset length
168 | initial_theta = [0,0]
169 | alpha = 0.1
170 | iterations = 1000
171 | ##Logistic_Regression(X,Y,alpha,initial_theta,iterations)
172 | 


--------------------------------------------------------------------------------