├── README.md └── main.py /README.md: -------------------------------------------------------------------------------- 1 | # Enigma-codeFest-Machine-Learning 2018 2 | 3 | Following is the approach I took for the Analytics Vidhya Enigma-codeFest-Machine-Learning Competition,I got a public Leaderboard loss of 787.56 and Ranked 8th overall. 4 | 5 | 1. My solution approach was simple and straight forward. First, I did data visualization and saw the distribution of data. I used seaborn to plot and find correlation to find important features. I plotted the histogram of the data and found that features like Views and Reputuation are distributed with high values and UserName,ID features are unnecessary features which can be removed. 6 | 7 | 2. Next I spent most of my time on data preprocessing where I removed unwanted features,scaling,filtering and did feature engineering which was a key element for my success. I used Binarizer as a new feature in my training data which could say if Answers features has some value or not above a thershold. 8 | 9 | 3.I tried with different algorithms like SGDRegressor,SVR and even decision tree and neither of them work as I thought at first as the data distribution is polynomial.Therefore I took PolynomialRegressor with Linear regression LassoLars which I found to be a best fit for this data. 10 | 11 | 4.Cross validation was the key and I splitted my data with test size of 0.22 and trained it to get a r2 score of 0.91 on val set. To get less loss I tried different approachs like Progressive Learning, Xgboost,ANN and neither of them worked as they overfitted. So I decided to fine tuning hyperparamter that I had to get a good score. 12 | 13 | 5. One important outliner I found was the Views feature so I removed values if it was more than 3000000 from the training set as I found that only 4 values similar exist in the test set which contributed to majority of the loss. Once I removed those scores and varied my Binarizer threshold I ranked 1st in public scoreboard but in overall ranking I stood 8th and I feel the reason behind that is I still would should tried simpler approachs like linear regression to get high rank in private scoreboard. 14 | 15 | You can download the data from Analytics Vidhya webpage. 16 | 17 | https://www.analyticsvidhya.com/ 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn.cross_validation import train_test_split 4 | from sklearn.metrics import r2_score 5 | from sklearn.preprocessing import LabelEncoder 6 | from sklearn.preprocessing import StandardScaler 7 | from sklearn import linear_model 8 | from sklearn.preprocessing import PolynomialFeatures 9 | 10 | train = pd.read_csv('train_NIR5Yl1.csv') 11 | train = train.drop(train[train.Views > 3000000].index) 12 | 13 | 14 | labelencoder_X = LabelEncoder() 15 | train['Tag'] = labelencoder_X.fit_transform(train['Tag']) 16 | train.drop(['ID','Username'], axis=1,inplace =True) 17 | target = train['Upvotes'] 18 | 19 | from sklearn.preprocessing import Binarizer 20 | bn = Binarizer(threshold=7) 21 | pd_watched = bn.transform([train['Answers']])[0] 22 | train['pd_watched'] = pd_watched 23 | 24 | 25 | feature_names = [x for x in train.columns if x not in ['Upvotes']] 26 | 27 | x_train, x_val, y_train, y_val = train_test_split(train[feature_names], target,test_size = 0.22,random_state =205) 28 | sc_X = StandardScaler() 29 | x_train = sc_X.fit_transform(x_train) 30 | x_val = sc_X.transform(x_val) 31 | 32 | poly_reg = PolynomialFeatures(degree = 4,interaction_only=False, include_bias=True) 33 | X_poly = poly_reg.fit_transform(x_train) 34 | poly_reg.fit(x_train, y_train) 35 | lin_reg_1 = linear_model.LassoLars(alpha=0.021,max_iter=150) 36 | lin_reg_1.fit(X_poly, y_train) 37 | 38 | # predicitng 39 | pred_val = lin_reg_1.predict(poly_reg.fit_transform(x_val)) 40 | 41 | print(r2_score(y_val, pred_val)) 42 | 43 | # --------------------------------------------------------------------------------------- 44 | 45 | # testing 46 | 47 | test = pd.read_csv('test_8i3B3FC.csv') 48 | ids = test['ID'] 49 | test.drop(['ID','Username'], axis=1,inplace =True) 50 | 51 | 52 | labelencoder_X = LabelEncoder() 53 | test['Tag'] = labelencoder_X.fit_transform(test['Tag']) 54 | 55 | from sklearn.preprocessing import Binarizer 56 | bn = Binarizer(threshold=7) 57 | pd_watched = bn.transform([test['Answers']])[0] 58 | test['pd_watched'] = pd_watched 59 | 60 | 61 | test = sc_X.fit_transform(test) 62 | 63 | pred_test = lin_reg_1.predict(poly_reg.fit_transform(test)) 64 | pred_test=abs(pred_test) 65 | 66 | 67 | submission = pd.DataFrame({'ID': ids, 68 | 'Upvotes':pred_test 69 | }) 70 | 71 | submission.to_csv("final_sub477.csv",index=False) 72 | --------------------------------------------------------------------------------