├── 50_Startups.csv ├── README.md └── multiple_linear_regression.py /50_Startups.csv: -------------------------------------------------------------------------------- 1 | R&D Spend,Administration,Marketing Spend,State,Profit 2 | 165349.2,136897.8,471784.1,New York,192261.83 3 | 162597.7,151377.59,443898.53,California,191792.06 4 | 153441.51,101145.55,407934.54,Florida,191050.39 5 | 144372.41,118671.85,383199.62,New York,182901.99 6 | 142107.34,91391.77,366168.42,Florida,166187.94 7 | 131876.9,99814.71,362861.36,New York,156991.12 8 | 134615.46,147198.87,127716.82,California,156122.51 9 | 130298.13,145530.06,323876.68,Florida,155752.6 10 | 120542.52,148718.95,311613.29,New York,152211.77 11 | 123334.88,108679.17,304981.62,California,149759.96 12 | 101913.08,110594.11,229160.95,Florida,146121.95 13 | 100671.96,91790.61,249744.55,California,144259.4 14 | 93863.75,127320.38,249839.44,Florida,141585.52 15 | 91992.39,135495.07,252664.93,California,134307.35 16 | 119943.24,156547.42,256512.92,Florida,132602.65 17 | 114523.61,122616.84,261776.23,New York,129917.04 18 | 78013.11,121597.55,264346.06,California,126992.93 19 | 94657.16,145077.58,282574.31,New York,125370.37 20 | 91749.16,114175.79,294919.57,Florida,124266.9 21 | 86419.7,153514.11,0,New York,122776.86 22 | 76253.86,113867.3,298664.47,California,118474.03 23 | 78389.47,153773.43,299737.29,New York,111313.02 24 | 73994.56,122782.75,303319.26,Florida,110352.25 25 | 67532.53,105751.03,304768.73,Florida,108733.99 26 | 77044.01,99281.34,140574.81,New York,108552.04 27 | 64664.71,139553.16,137962.62,California,107404.34 28 | 75328.87,144135.98,134050.07,Florida,105733.54 29 | 72107.6,127864.55,353183.81,New York,105008.31 30 | 66051.52,182645.56,118148.2,Florida,103282.38 31 | 65605.48,153032.06,107138.38,New York,101004.64 32 | 61994.48,115641.28,91131.24,Florida,99937.59 33 | 61136.38,152701.92,88218.23,New York,97483.56 34 | 63408.86,129219.61,46085.25,California,97427.84 35 | 55493.95,103057.49,214634.81,Florida,96778.92 36 | 46426.07,157693.92,210797.67,California,96712.8 37 | 46014.02,85047.44,205517.64,New York,96479.51 38 | 28663.76,127056.21,201126.82,Florida,90708.19 39 | 44069.95,51283.14,197029.42,California,89949.14 40 | 20229.59,65947.93,185265.1,New York,81229.06 41 | 38558.51,82982.09,174999.3,California,81005.76 42 | 28754.33,118546.05,172795.67,California,78239.91 43 | 27892.92,84710.77,164470.71,Florida,77798.83 44 | 23640.93,96189.63,148001.11,California,71498.49 45 | 15505.73,127382.3,35534.17,New York,69758.98 46 | 22177.74,154806.14,28334.72,California,65200.33 47 | 1000.23,124153.04,1903.93,New York,64926.08 48 | 1315.46,115816.21,297114.46,Florida,49490.75 49 | 0,135426.92,0,California,42559.73 50 | 542.05,51743.15,0,New York,35673.41 51 | 0,116983.8,45173.06,California,14681.4 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Multiple-Linear-Regression 2 | 3 | A very simple python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library. 4 | 5 | The program also does Backward Elimination to determine the best independent variables to fit into the regressor object of the LinearRegression class. 6 | 7 | The program uses the statsmodels.formula.api library to get the P values of the independent variables. The variables with P values greater than the significant value ( which was set to 0.05 ) are removed. The process is continued till variables with the lowest P values are selected are fitted into the regressor ( the new dataset of independent variables are called X_Optimal ). 8 | 9 | X_Optimal is again split into training set and test set using the test_train_split function from sklearn.model_selection. 10 | 11 | The regressor is fitted with the X_Optimal_Train and Y_Train variables and the prediction for Y_Test ( the dependent varibale) is done using the regressor.predict(X_Optimal_Test) 12 | -------------------------------------------------------------------------------- /multiple_linear_regression.py: -------------------------------------------------------------------------------- 1 | # Multiple Linear Regression 2 | 3 | import numpy as np 4 | import pandas as pd 5 | 6 | # Importing the datasets 7 | 8 | datasets = pd.read_csv('50_Startups.csv') 9 | X = datasets.iloc[:, :-1].values 10 | Y = datasets.iloc[:, 4].values 11 | 12 | # Encoding categorical data 13 | 14 | # Encoding the Independent Variable 15 | 16 | from sklearn.preprocessing import LabelEncoder, OneHotEncoder 17 | labelencoder_X = LabelEncoder() 18 | X[:, 3] = labelencoder_X.fit_transform(X[:, 3]) 19 | onehotencoder = OneHotEncoder(categorical_features = [3]) 20 | X = onehotencoder.fit_transform(X).toarray() 21 | 22 | # Avoiding the Dummy Variable Trap 23 | X = X[:, 1:] 24 | 25 | # Splitting the dataset into the Training set and Test set 26 | 27 | from sklearn.model_selection import train_test_split 28 | X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.2, random_state = 0) 29 | 30 | # Fitting the Multiple Linear Regression in the Training set 31 | 32 | from sklearn.linear_model import LinearRegression 33 | regressor = LinearRegression() 34 | regressor.fit(X_Train, Y_Train) 35 | 36 | # Predicting the Test set results 37 | 38 | Y_Pred = regressor.predict(X_Test) 39 | 40 | # Building the optimal model using Backward Elimination 41 | 42 | import statsmodels.formula.api as sm 43 | X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1) 44 | 45 | X_Optimal = X[:, [0,1,2,3,4,5]] 46 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit() 47 | regressor_OLS.summary() 48 | 49 | X_Optimal = X[:, [0,1,2,4,5]] 50 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit() 51 | regressor_OLS.summary() 52 | 53 | X_Optimal = X[:, [0,1,4,5]] 54 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit() 55 | regressor_OLS.summary() 56 | 57 | X_Optimal = X[:, [0,1,4]] 58 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit() 59 | regressor_OLS.summary() 60 | 61 | # Fitting the Multiple Linear Regression in the Optimal Training set 62 | 63 | X_Optimal_Train, X_Optimal_Test = train_test_split(X_Optimal,test_size = 0.2, random_state = 0) 64 | regressor.fit(X_Optimal_Train, Y_Train) 65 | 66 | # Predicting the Optimal Test set results 67 | 68 | Y_Optimal_Pred = regressor.predict(X_Optimal_Test) 69 | --------------------------------------------------------------------------------