├── README.md
└── main.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Enigma-codeFest-Machine-Learning 2018
 2 | 
 3 | Following is the approach I took for the Analytics Vidhya Enigma-codeFest-Machine-Learning Competition,I got a public Leaderboard loss of 787.56 and Ranked 8th overall.
 4 | 
 5 | 1. My solution approach was simple and straight forward. First, I did data visualization and saw the distribution of data. I used seaborn to plot and find correlation to find important features. I plotted the histogram of the data and found that features like Views and Reputuation are distributed with high values and UserName,ID features are unnecessary features which can be removed.
 6 | 
 7 | 2. Next I spent most of my time on data preprocessing where I removed unwanted features,scaling,filtering and did feature engineering which was a key element for my success. I used Binarizer as a new feature in my training data which could say if Answers features has some value or not above a thershold. 
 8 | 
 9 | 3.I tried with different algorithms like SGDRegressor,SVR and even decision tree and neither of them work as I thought at first as the data distribution is polynomial.Therefore I took PolynomialRegressor with Linear regression LassoLars which I found to be a best fit for this data.
10 | 
11 | 4.Cross validation was the key and I splitted my data with test size of 0.22 and trained it to get a r2 score of 0.91 on val set. To get less loss I tried different approachs like Progressive Learning, Xgboost,ANN and neither of them worked as they overfitted. So I decided to fine tuning hyperparamter that I had to get a good score. 
12 | 
13 | 5. One important outliner I found was the Views feature so I removed values if it was more than 3000000 from the training set as I found that only 4 values similar exist in the test set which contributed to majority of the loss. Once I removed those scores and varied my Binarizer threshold I ranked 1st in public scoreboard but in overall ranking I stood 8th and I feel the reason behind that is I still would should tried simpler approachs like linear regression to get high rank in private scoreboard.
14 | 
15 | You can download the data from Analytics Vidhya webpage.
16 | 
17 | https://www.analyticsvidhya.com/
18 | 
19 | 
20 |  
21 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from  sklearn.cross_validation import train_test_split
 4 | from sklearn.metrics import r2_score
 5 | from sklearn.preprocessing import LabelEncoder
 6 | from sklearn.preprocessing import StandardScaler
 7 | from sklearn import linear_model
 8 | from sklearn.preprocessing import PolynomialFeatures
 9 | 
10 | train = pd.read_csv('train_NIR5Yl1.csv')
11 | train = train.drop(train[train.Views > 3000000].index)
12 | 
13 |      
14 | labelencoder_X = LabelEncoder()
15 | train['Tag'] = labelencoder_X.fit_transform(train['Tag'])
16 | train.drop(['ID','Username'], axis=1,inplace =True)
17 | target = train['Upvotes']
18 | 
19 | from sklearn.preprocessing import Binarizer
20 | bn = Binarizer(threshold=7)
21 | pd_watched = bn.transform([train['Answers']])[0]
22 | train['pd_watched'] = pd_watched
23 | 
24 | 
25 | feature_names = [x for x in train.columns if x not in ['Upvotes']]
26 | 
27 | x_train, x_val, y_train, y_val = train_test_split(train[feature_names], target,test_size = 0.22,random_state =205)
28 | sc_X = StandardScaler()
29 | x_train = sc_X.fit_transform(x_train)
30 | x_val = sc_X.transform(x_val)
31 | 
32 | poly_reg = PolynomialFeatures(degree = 4,interaction_only=False, include_bias=True)
33 | X_poly = poly_reg.fit_transform(x_train)
34 | poly_reg.fit(x_train, y_train)
35 | lin_reg_1 = linear_model.LassoLars(alpha=0.021,max_iter=150)
36 | lin_reg_1.fit(X_poly, y_train)
37 | 
38 | # predicitng 
39 | pred_val = lin_reg_1.predict(poly_reg.fit_transform(x_val))
40 | 
41 | print(r2_score(y_val, pred_val))
42 | 
43 | # ---------------------------------------------------------------------------------------
44 | 
45 | # testing
46 | 
47 | test = pd.read_csv('test_8i3B3FC.csv')
48 | ids = test['ID']
49 | test.drop(['ID','Username'], axis=1,inplace =True)
50 | 
51 | 
52 | labelencoder_X = LabelEncoder()
53 | test['Tag'] = labelencoder_X.fit_transform(test['Tag'])
54 | 
55 | from sklearn.preprocessing import Binarizer
56 | bn = Binarizer(threshold=7)
57 | pd_watched = bn.transform([test['Answers']])[0]
58 | test['pd_watched'] = pd_watched
59 | 
60 |    
61 | test = sc_X.fit_transform(test)
62 | 
63 | pred_test = lin_reg_1.predict(poly_reg.fit_transform(test))
64 | pred_test=abs(pred_test)
65 | 
66 | 
67 | submission = pd.DataFrame({'ID': ids,
68 |                            'Upvotes':pred_test
69 |                            })
70 | 
71 | submission.to_csv("final_sub477.csv",index=False)
72 | 


--------------------------------------------------------------------------------