├── 50_Startups.csv
├── README.md
└── multiple_linear_regression.py


/50_Startups.csv:
--------------------------------------------------------------------------------
 1 | R&D Spend,Administration,Marketing Spend,State,Profit
 2 | 165349.2,136897.8,471784.1,New York,192261.83
 3 | 162597.7,151377.59,443898.53,California,191792.06
 4 | 153441.51,101145.55,407934.54,Florida,191050.39
 5 | 144372.41,118671.85,383199.62,New York,182901.99
 6 | 142107.34,91391.77,366168.42,Florida,166187.94
 7 | 131876.9,99814.71,362861.36,New York,156991.12
 8 | 134615.46,147198.87,127716.82,California,156122.51
 9 | 130298.13,145530.06,323876.68,Florida,155752.6
10 | 120542.52,148718.95,311613.29,New York,152211.77
11 | 123334.88,108679.17,304981.62,California,149759.96
12 | 101913.08,110594.11,229160.95,Florida,146121.95
13 | 100671.96,91790.61,249744.55,California,144259.4
14 | 93863.75,127320.38,249839.44,Florida,141585.52
15 | 91992.39,135495.07,252664.93,California,134307.35
16 | 119943.24,156547.42,256512.92,Florida,132602.65
17 | 114523.61,122616.84,261776.23,New York,129917.04
18 | 78013.11,121597.55,264346.06,California,126992.93
19 | 94657.16,145077.58,282574.31,New York,125370.37
20 | 91749.16,114175.79,294919.57,Florida,124266.9
21 | 86419.7,153514.11,0,New York,122776.86
22 | 76253.86,113867.3,298664.47,California,118474.03
23 | 78389.47,153773.43,299737.29,New York,111313.02
24 | 73994.56,122782.75,303319.26,Florida,110352.25
25 | 67532.53,105751.03,304768.73,Florida,108733.99
26 | 77044.01,99281.34,140574.81,New York,108552.04
27 | 64664.71,139553.16,137962.62,California,107404.34
28 | 75328.87,144135.98,134050.07,Florida,105733.54
29 | 72107.6,127864.55,353183.81,New York,105008.31
30 | 66051.52,182645.56,118148.2,Florida,103282.38
31 | 65605.48,153032.06,107138.38,New York,101004.64
32 | 61994.48,115641.28,91131.24,Florida,99937.59
33 | 61136.38,152701.92,88218.23,New York,97483.56
34 | 63408.86,129219.61,46085.25,California,97427.84
35 | 55493.95,103057.49,214634.81,Florida,96778.92
36 | 46426.07,157693.92,210797.67,California,96712.8
37 | 46014.02,85047.44,205517.64,New York,96479.51
38 | 28663.76,127056.21,201126.82,Florida,90708.19
39 | 44069.95,51283.14,197029.42,California,89949.14
40 | 20229.59,65947.93,185265.1,New York,81229.06
41 | 38558.51,82982.09,174999.3,California,81005.76
42 | 28754.33,118546.05,172795.67,California,78239.91
43 | 27892.92,84710.77,164470.71,Florida,77798.83
44 | 23640.93,96189.63,148001.11,California,71498.49
45 | 15505.73,127382.3,35534.17,New York,69758.98
46 | 22177.74,154806.14,28334.72,California,65200.33
47 | 1000.23,124153.04,1903.93,New York,64926.08
48 | 1315.46,115816.21,297114.46,Florida,49490.75
49 | 0,135426.92,0,California,42559.73
50 | 542.05,51743.15,0,New York,35673.41
51 | 0,116983.8,45173.06,California,14681.4


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Multiple-Linear-Regression
 2 | 
 3 | A very simple python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library.
 4 | 
 5 | The program also does Backward Elimination to determine the best independent variables to fit into the regressor object of the LinearRegression class.
 6 | 
 7 | The program uses the statsmodels.formula.api library to get the P values of the independent variables. The variables with P values greater than the significant value ( which was set to 0.05 ) are removed. The process is continued till variables with the lowest P values are selected are fitted into the regressor ( the new dataset of independent variables are called X_Optimal ).
 8 | 
 9 | X_Optimal is again split into training set and test set using the test_train_split function from sklearn.model_selection.
10 | 
11 | The regressor is fitted with the X_Optimal_Train and Y_Train variables and the prediction for Y_Test ( the dependent varibale) is done using the regressor.predict(X_Optimal_Test)
12 | 


--------------------------------------------------------------------------------
/multiple_linear_regression.py:
--------------------------------------------------------------------------------
 1 | # Multiple Linear Regression
 2 | 
 3 | import numpy as np
 4 | import pandas as pd
 5 | 
 6 | # Importing the datasets
 7 | 
 8 | datasets = pd.read_csv('50_Startups.csv')
 9 | X = datasets.iloc[:, :-1].values
10 | Y = datasets.iloc[:, 4].values
11 | 
12 | # Encoding categorical data
13 | 
14 | # Encoding the Independent Variable
15 | 
16 | from sklearn.preprocessing import LabelEncoder, OneHotEncoder
17 | labelencoder_X = LabelEncoder()
18 | X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
19 | onehotencoder = OneHotEncoder(categorical_features = [3])
20 | X = onehotencoder.fit_transform(X).toarray()
21 | 
22 | # Avoiding the Dummy Variable Trap
23 | X = X[:, 1:]
24 | 
25 | # Splitting the dataset into the Training set and Test set
26 | 
27 | from sklearn.model_selection import train_test_split
28 | X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
29 | 
30 | # Fitting the Multiple Linear Regression in the Training set
31 | 
32 | from sklearn.linear_model import LinearRegression
33 | regressor = LinearRegression()
34 | regressor.fit(X_Train, Y_Train)
35 | 
36 | # Predicting the Test set results
37 | 
38 | Y_Pred = regressor.predict(X_Test)
39 | 
40 | # Building the optimal model using Backward Elimination
41 | 
42 | import statsmodels.formula.api as sm
43 | X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1)
44 | 
45 | X_Optimal = X[:, [0,1,2,3,4,5]]
46 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit()
47 | regressor_OLS.summary()
48 | 
49 | X_Optimal = X[:, [0,1,2,4,5]]
50 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit()
51 | regressor_OLS.summary()
52 | 
53 | X_Optimal = X[:, [0,1,4,5]]
54 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit()
55 | regressor_OLS.summary()
56 | 
57 | X_Optimal = X[:, [0,1,4]]
58 | regressor_OLS = sm.OLS(endog = Y, exog = X_Optimal).fit()
59 | regressor_OLS.summary()
60 | 
61 | # Fitting the Multiple Linear Regression in the Optimal Training set
62 | 
63 | X_Optimal_Train, X_Optimal_Test = train_test_split(X_Optimal,test_size = 0.2, random_state = 0)
64 | regressor.fit(X_Optimal_Train, Y_Train)
65 | 
66 | # Predicting the Optimal Test set results
67 | 
68 | Y_Optimal_Pred = regressor.predict(X_Optimal_Test)
69 | 


--------------------------------------------------------------------------------