├── Accenture Digital Hack Up ├── Readme.md ├── bidirectional_lstm_kernel.ipynb ├── data │ ├── CommentScorerDiagram.png │ └── HE_Accen_dataset.rar └── he_accen.ipynb ├── Affine ├── HE_Recommend_start.ipynb ├── data │ ├── Dataset.rar │ └── tesrt.txt └── readme.md ├── BrainWaves-17 ├── README.md ├── data │ └── HE_Ml_Chal_data.rar ├── hackerEarth_chal.ipynb └── hackerEarth_chal2.ipynb ├── Capgemini ├── README.md ├── SolutionPresentation1.pptx ├── data │ ├── Demandv1.1.xlsx │ ├── demand_trend.csv │ └── headcount.csv ├── supply_optimization.html └── supply_optimization.ipynb ├── Cavoo ├── Readme.md ├── cnn_con.py ├── convert1.py ├── resnet_con.py ├── resnet_pred_generator.py └── tes.txt ├── EnigmaIIT ├── Anno_Vid_Enigma_analysis.ipynb ├── ReadMe.md ├── data │ └── analiticv.rar ├── diff_outlier_note.ipynb └── solution_note.ipynb ├── Euristica-18 ├── HE_flight_predictor.ipynb ├── HE_smart_engineer.ipynb ├── HE_stack_flight_predictor.ipynb ├── README.md └── data │ ├── Electricity_Production_data.rar │ └── flight_predictor_data.rar ├── Euristica-19 ├── Data_Q1.ipynb ├── Data_Q2.ipynb ├── README.MD ├── data │ ├── 28406216-5-data_q2.zip │ ├── 299b9fb8-5-data-Q1.zip │ ├── new.txt │ └── resumes.zip ├── extract.py ├── new.txt └── techskill.csv ├── Expedia └── Expedia_hackerrank.ipynb ├── Innoplexus -19 ├── Elmo with tensorflow hub.ipynb ├── Neural net embedding.ipynb ├── Readme.md ├── bert-commited.ipynb ├── data │ ├── sample_submission_usrypCc.zip │ ├── test_XEV14AD.zip │ └── train.7z ├── innoplexus_NN.ipynb ├── memorization-innoplexus.ipynb ├── memorization_kernel.ipynb └── neural-net-and-anago.ipynb ├── Innoplexus ├── Readme.md ├── data │ ├── tes.txt │ ├── test_c2Mvube.zip │ └── train_bnHAB63.zip └── inno_vidhya.ipynb ├── Intel Scene Classification ├── Intel scene with FastAi (2).ipynb ├── Readme.md ├── Scene classification Chllenge.ipynb └── intel-scene-ensemble-places365-cadene.ipynb ├── Quartic ├── Quartic_kernel.ipynb └── Readme.md ├── README.md ├── Wns ├── Graphs │ ├── Pca_initial.png │ ├── correlation_matrix.png │ ├── pca_Later.png │ └── previous_year_rating.png ├── ReadMe.md ├── data │ ├── sample_submission_M0L0uXE.csv │ ├── test_2umaH9m.csv │ └── train_LZdllcl.csv ├── kernel.ipynb ├── wns_analytics_vidhya.ipynb └── wns_plan.txt └── ericsson_2019 ├── Readme.md ├── data ├── NLP_Datac2476d7.zip └── Predictive_Data32f5357.zip ├── ericsson_HE.ipynb └── ericsson_HE_NN.ipynb /Accenture Digital Hack Up/Readme.md: -------------------------------------------------------------------------------- 1 | # Accenture Digital Hack Up 2 | First Place solution for Accenture Digital Hack Up machine Learning challenge 3 | https://www.hackerearth.com/challenges/competitive/Accenture-ml/machine-learning/predict-comment-score/ 4 | 5 | ## Problem Statement 6 | In this competition we were chellenged to build a model that can predict scores of comments based upon The parent comment to which sarcastic comments are made and the Reply to a parent comment. 7 | 8 | Train Having 45000 comment replies and the test set having 30000 rows of comments.Here Task was to build a model that can predict scores of comments present in the test dataset. 9 | 10 | 11 | ![alt text](https://github.com/lucky630/ML-Challenges/blob/master/Accenture%20Digital%20Hack%20Up/data/CommentScorerDiagram.png) 12 | 13 | 1. Used tfidf vectorization for both word and character level to convert the comments into vector form. 14 | 2. Generate new features like comment length,sentiment value of comment & profanity value of comment. 15 | 3. Concatenate both and apply the lightgbmRegressor model for 6 folds to get the final prediction 16 | 4. lighgbmRegressor was tuned using the bayesian optimization. 17 | 18 | ## Team Member 19 | - [utsav aggarwal](https://github.com/utsav1) 20 | - [Arjun Rana](https://github.com/monsterspy) 21 | -------------------------------------------------------------------------------- /Accenture Digital Hack Up/bidirectional_lstm_kernel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "metadata": { 5 | "_cell_guid": "9d2dbdb3-6c74-4f96-9865-2951dfd653ce", 6 | "_uuid": "bb41ad86b25fecf332927b0c8f55dd710101e33f" 7 | }, 8 | "cell_type": "markdown", 9 | "source": "## BiDirectional LSTM baseline" 10 | }, 11 | { 12 | "metadata": { 13 | "_cell_guid": "2f9b7a76-8625-443d-811f-8f49781aef81", 14 | "_uuid": "598f965bc881cfe6605d92903b758778d400fa8b", 15 | "trusted": true 16 | }, 17 | "cell_type": "code", 18 | "source": "import sys, os, re, csv, codecs, numpy as np, pandas as pd\n\nfrom keras.preprocessing.text import Tokenizer\nfrom keras import backend as K\nfrom keras.preprocessing.sequence import pad_sequences\nfrom keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,CuDNNLSTM\nfrom keras.layers import Bidirectional, GlobalMaxPool1D\nfrom keras.models import Model\nfrom keras import initializers, regularizers, constraints, optimizers, layers\n\nfrom nltk import sent_tokenize", 19 | "execution_count": 1, 20 | "outputs": [ 21 | { 22 | "output_type": "stream", 23 | "text": "Using TensorFlow backend.\n", 24 | "name": "stderr" 25 | } 26 | ] 27 | }, 28 | { 29 | "metadata": { 30 | "_cell_guid": "c297fa80-beea-464b-ac90-f380ebdb02fe", 31 | "_uuid": "d961885dfde18796893922f72ade1bf64456404e" 32 | }, 33 | "cell_type": "markdown", 34 | "source": "We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section." 35 | }, 36 | { 37 | "metadata": { 38 | "_cell_guid": "66a6b5fd-93f0-4f95-ad62-3253815059ba", 39 | "_uuid": "729b0f0c2a02c678631b8c072d62ff46146a82ef", 40 | "trusted": true 41 | }, 42 | "cell_type": "code", 43 | "source": "EMBEDDING_FILE='../input/glove6b50d/glove.6B.50d.txt'", 44 | "execution_count": 2, 45 | "outputs": [] 46 | }, 47 | { 48 | "metadata": { 49 | "trusted": true, 50 | "_uuid": "9afd9451058e4a636bb4716077859e5a4a17033f" 51 | }, 52 | "cell_type": "code", 53 | "source": "test = pd.read_csv('../input/he-accenture/test.csv')\ntrain = pd.read_csv('../input/he-accenture/train.csv')", 54 | "execution_count": 3, 55 | "outputs": [] 56 | }, 57 | { 58 | "metadata": { 59 | "trusted": true, 60 | "scrolled": true, 61 | "_uuid": "37164a2d565d7478e8ab8a9fef3c7cc074eaf6c4" 62 | }, 63 | "cell_type": "code", 64 | "source": "train.head()", 65 | "execution_count": 4, 66 | "outputs": [ 67 | { 68 | "output_type": "execute_result", 69 | "execution_count": 4, 70 | "data": { 71 | "text/plain": " UID ... score\n0 Tr-1 ... 2\n1 Tr-2 ... -4\n2 Tr-3 ... 3\n3 Tr-4 ... -8\n4 Tr-5 ... 6\n\n[5 rows x 5 columns]", 72 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
UIDcommentdateparent_commentscore
0Tr-1NC and NH.2016-10Yeah, I get that argument. At this point, I'd ...2
1Tr-2You do know west teams play against west teams...2016-11The blazers and Mavericks (The wests 5 and 6 s...-4
2Tr-3They were underdogs earlier today, but since G...2016-09They're favored to win.3
3Tr-4This meme isn't funny none of the \"new york ni...2016-10deadass don't kill my buzz-8
4Tr-5I could use one of those tools.2016-12Yep can confirm I saw the tool they use for th...6
\n
" 73 | }, 74 | "metadata": {} 75 | } 76 | ] 77 | }, 78 | { 79 | "metadata": { 80 | "_cell_guid": "98f2b724-7d97-4da8-8b22-52164463a942", 81 | "_uuid": "b62d39216c8d00b3e6b78b825212fd190757dff9" 82 | }, 83 | "cell_type": "markdown", 84 | "source": "Set some basic config parameters:" 85 | }, 86 | { 87 | "metadata": { 88 | "_cell_guid": "2807a0a5-2220-4af6-92d6-4a7100307de2", 89 | "_uuid": "d365d5f8d9292bb9bf57d21d6186f8b619cbe8c3", 90 | "trusted": true 91 | }, 92 | "cell_type": "code", 93 | "source": "embed_size = 50\nmax_features = 20000\nmaxlen = 100", 94 | "execution_count": 5, 95 | "outputs": [] 96 | }, 97 | { 98 | "metadata": { 99 | "_cell_guid": "b3a8d783-95c2-4819-9897-1320e3295183", 100 | "_uuid": "4dd8a02e7ef983f10ec9315721c6dda2958024af" 101 | }, 102 | "cell_type": "markdown", 103 | "source": "Read in our data and replace missing values:" 104 | }, 105 | { 106 | "metadata": { 107 | "_cell_guid": "ac2e165b-1f6e-4e69-8acf-5ad7674fafc3", 108 | "_uuid": "8ab6dad952c65e9afcf16e43c4043179ef288780", 109 | "trusted": true 110 | }, 111 | "cell_type": "code", 112 | "source": "list_sentences_train = train[\"parent_comment\"].fillna(\"_na_\").values\ny = train['score'].values\nlist_sentences_test = test[\"parent_comment\"].fillna(\"_na_\").values", 113 | "execution_count": 6, 114 | "outputs": [] 115 | }, 116 | { 117 | "metadata": { 118 | "_cell_guid": "79afc0e9-b5f0-42a2-9257-a72458e91dbb", 119 | "_uuid": "c292c2830522bfe59d281ecac19f3a9415c07155", 120 | "trusted": true 121 | }, 122 | "cell_type": "code", 123 | "source": "tokenizer = Tokenizer(num_words=max_features)\ntokenizer.fit_on_texts(list(list_sentences_train))\nlist_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)\nlist_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)\nX_t = pad_sequences(list_tokenized_train, maxlen=maxlen)\nX_te = pad_sequences(list_tokenized_test, maxlen=maxlen)", 124 | "execution_count": 7, 125 | "outputs": [] 126 | }, 127 | { 128 | "metadata": { 129 | "_cell_guid": "7d19392b-7750-4a1b-ac30-ed75b8a62d52", 130 | "_uuid": "e9e3b4fa7c4658e0f22dd48cb1a289d9deb745fc", 131 | "trusted": true 132 | }, 133 | "cell_type": "code", 134 | "source": "def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\nembeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))", 135 | "execution_count": 8, 136 | "outputs": [] 137 | }, 138 | { 139 | "metadata": { 140 | "_cell_guid": "4d29d827-377d-4d2f-8582-4a92f9569719", 141 | "_uuid": "96fc33012e7f07a2169a150c61574858d49a561b", 142 | "trusted": true, 143 | "collapsed": true 144 | }, 145 | "cell_type": "code", 146 | "source": "all_embs = np.stack(embeddings_index.values())\nemb_mean,emb_std = all_embs.mean(), all_embs.std()\nemb_mean,emb_std", 147 | "execution_count": 9, 148 | "outputs": [ 149 | { 150 | "output_type": "stream", 151 | "text": "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n \"\"\"Entry point for launching an IPython kernel.\n", 152 | "name": "stderr" 153 | }, 154 | { 155 | "output_type": "execute_result", 156 | "execution_count": 9, 157 | "data": { 158 | "text/plain": "(0.020940498, 0.6441043)" 159 | }, 160 | "metadata": {} 161 | } 162 | ] 163 | }, 164 | { 165 | "metadata": { 166 | "_cell_guid": "62acac54-0495-4a26-ab63-2520d05b3e19", 167 | "_uuid": "574c91e270add444a7bc8175440274bdd83b7173", 168 | "trusted": true 169 | }, 170 | "cell_type": "code", 171 | "source": "word_index = tokenizer.word_index\nnb_words = min(max_features, len(word_index))\nembedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\nfor word, i in word_index.items():\n if i >= max_features: continue\n embedding_vector = embeddings_index.get(word)\n if embedding_vector is not None: embedding_matrix[i] = embedding_vector", 172 | "execution_count": 10, 173 | "outputs": [] 174 | }, 175 | { 176 | "metadata": { 177 | "trusted": true, 178 | "collapsed": true, 179 | "_uuid": "eded28dcbbfc4b6dfa297afd60e58ad452b47f7a" 180 | }, 181 | "cell_type": "code", 182 | "source": "embedding_matrix[0]", 183 | "execution_count": 11, 184 | "outputs": [ 185 | { 186 | "output_type": "execute_result", 187 | "execution_count": 11, 188 | "data": { 189 | "text/plain": "array([-0.47107019, -1.10401832, 1.29062063, 0.89376371, 0.40853162,\n 0.27875739, -0.69818692, 0.62945259, 0.87593289, 0.57920566,\n 0.17053255, -0.34860669, 0.29394367, 0.96782405, -1.47327159,\n 0.37755123, -0.99562403, 0.81564513, 0.90965464, 0.33451224,\n -0.24781101, -0.7812317 , -0.34947127, 0.02051574, -0.65041278,\n -0.17706355, -0.21324179, 0.34074742, 0.8865554 , -0.83005841,\n 0.40199546, 0.16029716, -0.35504682, 0.47784263, -1.21908015,\n 0.47664462, 0.47666782, -0.08967921, -0.34191007, -0.40728989,\n 0.19196549, -0.55590391, 0.2833674 , 0.43772412, -0.06909767,\n 0.06685626, 0.48040836, -0.20854702, 0.67613445, 0.02588322])" 190 | }, 191 | "metadata": {} 192 | } 193 | ] 194 | }, 195 | { 196 | "metadata": { 197 | "trusted": true, 198 | "_uuid": "7b7f4df6f79c48436ec8f336c04dc96ae49315d8" 199 | }, 200 | "cell_type": "code", 201 | "source": "def root_mean_squared_error(y_true, y_pred):\n return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))", 202 | "execution_count": 12, 203 | "outputs": [] 204 | }, 205 | { 206 | "metadata": { 207 | "_cell_guid": "0d4cb718-7f9a-4eab-acda-8f55b4712439", 208 | "_uuid": "dc51af0bd046e1eccc29111a8e2d77bdf7c60d28", 209 | "trusted": true 210 | }, 211 | "cell_type": "code", 212 | "source": "inp = Input(shape=(maxlen,))\nx = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)\nx = Bidirectional(CuDNNLSTM(50, return_sequences=True))(x)\nx = GlobalMaxPool1D()(x)\nx = Dense(50, activation=\"relu\")(x)\nx = Dropout(0.1)(x)\nx = Dense(1, activation=\"sigmoid\")(x)\nmodel = Model(inputs=inp, outputs=x)\nmodel.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])", 213 | "execution_count": 13, 214 | "outputs": [] 215 | }, 216 | { 217 | "metadata": { 218 | "_cell_guid": "333626f1-a838-4fea-af99-0c78f1ef5f5c", 219 | "scrolled": false, 220 | "_uuid": "c1558c6b2802fc632edc4510c074555a590efbd8", 221 | "trusted": true 222 | }, 223 | "cell_type": "code", 224 | "source": "model.fit(X_t, y, batch_size=32, epochs=3, validation_split=0.1);", 225 | "execution_count": 14, 226 | "outputs": [ 227 | { 228 | "output_type": "stream", 229 | "text": "Train on 40500 samples, validate on 4500 samples\nEpoch 1/3\n40500/40500 [==============================] - 59s 1ms/step - loss: 2439.0314 - acc: 0.3427 - val_loss: 1475.4313 - val_acc: 0.3382\nEpoch 2/3\n40500/40500 [==============================] - 57s 1ms/step - loss: 2438.9404 - acc: 0.3434 - val_loss: 1475.4313 - val_acc: 0.3382\nEpoch 3/3\n40500/40500 [==============================] - 57s 1ms/step - loss: 2438.9404 - acc: 0.3434 - val_loss: 1475.4313 - val_acc: 0.3382\n", 230 | "name": "stdout" 231 | } 232 | ] 233 | }, 234 | { 235 | "metadata": { 236 | "_cell_guid": "d6fa2ace-aa92-40cf-913f-a8f5d5a4b130", 237 | "_uuid": "3dbaa4d0c22271b8b0dc7e58bcad89ddc607beaf" 238 | }, 239 | "cell_type": "markdown", 240 | "source": "And finally, get predictions for the test set and prepare a submission CSV:" 241 | }, 242 | { 243 | "metadata": { 244 | "_cell_guid": "617e974a-57ee-436e-8484-0fb362306db2", 245 | "_uuid": "2b969bab77ab952ecd5abf2abe2596a0e23df251", 246 | "trusted": true 247 | }, 248 | "cell_type": "code", 249 | "source": "prediction = model.predict([X_te])", 250 | "execution_count": 15, 251 | "outputs": [] 252 | }, 253 | { 254 | "metadata": { 255 | "trusted": true, 256 | "_uuid": "8e1373f6000dbba9f818ba66cfa49298f96683a2" 257 | }, 258 | "cell_type": "code", 259 | "source": "# prediction[:50]", 260 | "execution_count": 16, 261 | "outputs": [] 262 | }, 263 | { 264 | "metadata": { 265 | "trusted": true, 266 | "_uuid": "09fdfc2d1454ae17e2c39a353f28ea1e28c7ff97" 267 | }, 268 | "cell_type": "code", 269 | "source": "submission = pd.DataFrame.from_dict({'UID': test['UID']})\nsubmission['score'] = prediction\nsubmission.to_csv('submission.csv', index=False)", 270 | "execution_count": 17, 271 | "outputs": [] 272 | }, 273 | { 274 | "metadata": { 275 | "trusted": true, 276 | "_uuid": "293c8aa8b4c4a39ec3401dc3d9ac51deddfcff35" 277 | }, 278 | "cell_type": "code", 279 | "source": "", 280 | "execution_count": null, 281 | "outputs": [] 282 | } 283 | ], 284 | "metadata": { 285 | "kernelspec": { 286 | "display_name": "Python 3", 287 | "language": "python", 288 | "name": "python3" 289 | }, 290 | "language_info": { 291 | "name": "python", 292 | "version": "3.6.6", 293 | "mimetype": "text/x-python", 294 | "codemirror_mode": { 295 | "name": "ipython", 296 | "version": 3 297 | }, 298 | "pygments_lexer": "ipython3", 299 | "nbconvert_exporter": "python", 300 | "file_extension": ".py" 301 | } 302 | }, 303 | "nbformat": 4, 304 | "nbformat_minor": 1 305 | } -------------------------------------------------------------------------------- /Accenture Digital Hack Up/data/CommentScorerDiagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Accenture Digital Hack Up/data/CommentScorerDiagram.png -------------------------------------------------------------------------------- /Accenture Digital Hack Up/data/HE_Accen_dataset.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Accenture Digital Hack Up/data/HE_Accen_dataset.rar -------------------------------------------------------------------------------- /Affine/data/Dataset.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Affine/data/Dataset.rar -------------------------------------------------------------------------------- /Affine/data/tesrt.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Affine/readme.md: -------------------------------------------------------------------------------- 1 | # Affine Analytics ML Challenge 2 | Third Position solution for Affine analytics challenge 3 | 4 | ## About 5 | In this competition we were chellenge to make a property recommendation system.The dataset for this competition have following tables: 6 | 1. Accounts: This table has information on customers/accounts. These are the accounts for whom they are marketing the properties for sale 7 | 2. Opportunities: These include the historic deals for the accounts. Basically, this gives a transaction summary of the deals that have happened between the accounts and the properties. Succesful deal information. 8 | 3. Property: This database contains the universal list of properties and its details. 9 | There are two mapping tables are also there: 10 | 1. Accounts to Properties: This table comprises information on properties that have been already bought by the accounts.Account information and the property details of the lead 11 | 2. Deal to Properties: This table comprises information on the deals that has materialized on the properties.Deal and properties mapping. 12 | 13 | Train Having 2727 Accounts and the info about there historic deals.On the otherside, test having 29 new Accounts & no info regarding there Past behaviour.Here Task was to recommend some finite number of properties to these new Accounts. 14 | 15 | ## Approach 16 | 1. Approach was to use Accounts features, Properties features and try to map them in some way. 17 | 2. Because there is no history deals information know about the accounts in the testset.we can't apply the Content based Filtering which used the user's historical behaviour and gave suggestions according to that.most suitable in this Cold start situation is the Collaborative Filtering and Hybrid Filtering which used the similarity between customers personal information to gave suggestions. 18 | 3. Knn or nearest neighbour were used to find the similar Accounts.then get the properties those most similar customers have bought by using the mapping tables.In the last apply the Knn or nearest neighbour again,This time on the Properties to find the similar properties like that,we will recommend these properties to new Accounts. 19 | 20 | ## Findings 21 | 1. The Properties which were built before 1985 isn't been reccomended to the new users. 22 | 2. Properties whose sales year is before 2003 in the Opportunities table in't been recommended to the new users. 23 | 3. Properties who having demolished status is also not consider for the reccomendation.so after filtering these properties we were able to remove half of the properties from the Property table. 24 | 4. Some Properties have sold more than ones in there lifecycle and sometimes there are more than 1 property sold in a single deal. 25 | 26 | ## Team Member 27 | - [Arjun Rana](https://github.com/monsterspy) 28 | - [utsav aggarwal](https://github.com/utsav1) 29 | -------------------------------------------------------------------------------- /BrainWaves-17/README.md: -------------------------------------------------------------------------------- 1 | # HE_Ml_Chal 2 | solution for Hackerearth ML BrainWaves 2017-18 challenge 3 | -------------------------------------------------------------------------------- /BrainWaves-17/data/HE_Ml_Chal_data.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/BrainWaves-17/data/HE_Ml_Chal_data.rar -------------------------------------------------------------------------------- /Capgemini/README.md: -------------------------------------------------------------------------------- 1 | # Capgemini Data Science Challenge 2 | Ranked 13 on the leaderboard 3 | https://techchallenge.in.capgemini.com/techchallenge/data-science?leaderboard=true 4 | 5 | ## Problem Statement 6 | - The motivation behind the problem is to optimize the People Supply Chain management for InterstellarX Inc 7 | - We have to Build a Predictive Demand Model which can Forecast the Demand for next two months. 8 | - We need to plan the optimized supply needed per month for the next 12 months based upon demand forecast. 9 | - We also need to showcase the Net Profit or loss of business if variable factors are changed and demand & supply changes as a consequence. 10 | 11 | ## Constraint And Assumptions 12 | - Total budget for maintaining the Bench for the current year is $5.76 Mn. 13 | - Average cost per resource per month is $685. 14 | - Current Bench strength is 400, means annual Bench budget consumption is at $3.288 Mn on day 1 of the year. 15 | - End of year average Bench cost cannot exceed total budget of $5,76 Mn 16 | - Average annual attrition is 20% of total headcount and Total headcount at the beginning of the year is 10000 and cannot exceed 12000 at the end of the year. 17 | - Once billed a resource stays with the same account forever and does not come back into the bench or move to any other project. 18 | 19 | ## Implementation Overview 20 | - Problem Understanding by Extensive Exploratory data analysis in Tableau as well as in Python. 21 | - Aggregate the records on Monthly basis for demand and headcount dataset. 22 | - Used monthly aggregate headcount data from 2004 to 2015 for training the model and 2016 demand data for testing purpose. 23 | - Perform Dickey-Fuller test to check Stationarity in time series further perform the Decomposition of signal into Trend, Seasonality and Residual. 24 | - Applied Time series models like Arima and Prophet to forecast the demand. 25 | - Used the forecasted demand to plan the supply of resources. 26 | 27 | 28 | ## Stack Used 29 | - Tableau 30 | - Fbprophet 31 | - Plotly 32 | - Statsmodels 33 | - IBM Watson Studio 34 | 35 | ## Team Member 36 | - [Arjun Rana](https://github.com/monsterspy) 37 | -------------------------------------------------------------------------------- /Capgemini/SolutionPresentation1.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Capgemini/SolutionPresentation1.pptx -------------------------------------------------------------------------------- /Capgemini/data/Demandv1.1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Capgemini/data/Demandv1.1.xlsx -------------------------------------------------------------------------------- /Cavoo/Readme.md: -------------------------------------------------------------------------------- 1 | ## Caavo Computer Vision Challenge 2 | Winner solution for Caavo Computer Vision Challenge 3 | -------------------------------------------------------------------------------- /Cavoo/cnn_con.py: -------------------------------------------------------------------------------- 1 | import os 2 | from keras.preprocessing.image import ImageDataGenerator 3 | from keras import optimizers 4 | from keras.models import Sequential 5 | from keras.layers import Dropout, Flatten, Dense, Activation 6 | from keras.layers.convolutional import Convolution2D, MaxPooling2D 7 | from keras import callbacks 8 | 9 | os.chdir('C:\\Users\\royal\\Downloads\\Compressed\\dataset52bd6ce') 10 | 11 | epochs = 2 12 | train_data_path = 'dataset\\train' 13 | validation_data_path = 'dataset\\val' 14 | 15 | """Parameters""" 16 | img_width, img_height = 224, 224 17 | batch_size = 32 18 | #samples_per_epoch = 1000 19 | samples_per_epoch = 2 20 | validation_steps = 1 21 | nb_filters1 = 32 22 | nb_filters2 = 64 23 | conv1_size = 3 24 | conv2_size = 2 25 | pool_size = 2 26 | classes_num = 15 27 | lr = 0.0004 28 | 29 | model = Sequential() 30 | model.add(Convolution2D(nb_filters1, conv1_size, conv1_size, border_mode ="same", input_shape=(img_width, img_height, 3))) 31 | model.add(Activation("relu")) 32 | model.add(MaxPooling2D(pool_size=(pool_size, pool_size))) 33 | 34 | model.add(Convolution2D(nb_filters2, conv2_size, conv2_size, border_mode ="same")) 35 | model.add(Activation("relu")) 36 | model.add(MaxPooling2D(pool_size=(pool_size, pool_size), dim_ordering='th')) 37 | 38 | model.add(Flatten()) 39 | model.add(Dense(256)) 40 | model.add(Activation("relu")) 41 | model.add(Dropout(0.5)) 42 | model.add(Dense(classes_num, activation='softmax')) 43 | 44 | model.compile(loss='categorical_crossentropy',optimizer=optimizers.RMSprop(lr=lr),metrics=['accuracy']) 45 | 46 | #train_datagen = ImageDataGenerator( 47 | # rescale=1. / 255, 48 | # shear_range=0.2, 49 | # zoom_range=0.2, 50 | # horizontal_flip=True) 51 | 52 | train_datagen = ImageDataGenerator( 53 | rotation_range=30, 54 | horizontal_flip=True, 55 | width_shift_range = 0.2, 56 | height_shift_range = 0.2) 57 | 58 | test_datagen = ImageDataGenerator() 59 | 60 | train_generator = train_datagen.flow_from_directory( 61 | train_data_path, 62 | target_size=(img_height, img_width), 63 | batch_size=batch_size, 64 | class_mode='categorical') 65 | 66 | validation_generator = test_datagen.flow_from_directory( 67 | validation_data_path, 68 | target_size=(img_height, img_width), 69 | class_mode='categorical') 70 | 71 | 72 | model.fit_generator( 73 | train_generator,verbose=1, 74 | steps_per_epoch=samples_per_epoch, 75 | epochs=epochs, 76 | validation_data=validation_generator, 77 | validation_steps=2) 78 | 79 | model.save('model_cnn.h5') 80 | model.save_weights('model_cnn_weights.h5') 81 | -------------------------------------------------------------------------------- /Cavoo/convert1.py: -------------------------------------------------------------------------------- 1 | from keras.preprocessing.image import load_img 2 | from keras.preprocessing.image import img_to_array 3 | from keras.applications.imagenet_utils import decode_predictions 4 | import matplotlib.pyplot as plt 5 | import os 6 | import numpy as np 7 | 8 | os.chdir('C:\\Users\\royal\\Downloads\\Compressed\\dataset52bd6ce') 9 | 10 | filename = 'n02854926_2_0.jpg' 11 | # load an image in PIL format 12 | original = load_img(filename, target_size=(224, 224)) 13 | print('PIL image size',original.size) 14 | plt.imshow(original) 15 | plt.show() 16 | 17 | # convert the PIL image to a numpy array 18 | # IN PIL - image is in (width, height, channel) 19 | # In Numpy - image is in (height, width, channel) 20 | numpy_image = img_to_array(original) 21 | plt.imshow(np.uint8(numpy_image)) 22 | plt.show() 23 | print('numpy array size',numpy_image.shape) 24 | 25 | # Convert the image / images into batch format 26 | # expand_dims will add an extra dimension to the data at a particular axis 27 | # We want the input matrix to the network to be of the form (batchsize, height, width, channels) 28 | # Thus we add the extra dimension to the axis 0. 29 | image_batch = np.expand_dims(numpy_image, axis=0) 30 | print('image batch size', image_batch.shape) 31 | plt.imshow(np.uint8(image_batch[0])) 32 | -------------------------------------------------------------------------------- /Cavoo/resnet_con.py: -------------------------------------------------------------------------------- 1 | #from tensorflow.python.keras.applications import ResNet50,InceptionV3,VGG16 2 | #from tensorflow.python.keras.models import Sequential 3 | #from tensorflow.python.keras.layers import Dense, Flatten, GlobalAveragePooling2D,Dropout 4 | 5 | #from tensorflow.python.keras.applications.resnet50 import preprocess_input 6 | #from tensorflow.python.keras.preprocessing.image import ImageDataGenerator 7 | #from keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler 8 | 9 | from keras.preprocessing import image 10 | from keras.preprocessing.image import ImageDataGenerator 11 | from keras.applications.resnet50 import preprocess_input, ResNet50 12 | 13 | from keras.applications.xception import Xception 14 | from keras.applications.vgg16 import VGG16 15 | from keras.applications.vgg19 import VGG19 16 | from keras.applications.resnet50 import ResNet50 17 | from keras.applications.inception_v3 import InceptionV3 18 | from keras.applications.inception_resnet_v2 import InceptionResNetV2 19 | from keras.applications.mobilenet import MobileNet 20 | 21 | from keras.models import Sequential, Model 22 | from keras.layers import Dense, Dropout, Flatten 23 | from keras.callbacks import EarlyStopping, ModelCheckpoint 24 | from keras.layers.normalization import BatchNormalization 25 | from keras import optimizers, Input 26 | 27 | import os 28 | 29 | os.chdir('C:\\Users\\royal\\Downloads\\Compressed\\dataset52bd6ce') 30 | 31 | num_classes = 15 32 | 33 | inception_resnet_weight = 'inception_resnet_v2_weights_tf_dim_ordering_tf_kernels_notop.h5' 34 | inception_weights_path = 'inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5' 35 | resnet_weights_path = 'resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5' 36 | 37 | main_model = InceptionResNetV2(include_top=False, weights=inception_resnet_weight, input_tensor=Input(shape=(224, 224, 3))) 38 | #main_model = ResNet50(include_top=False, weights=resnet_weights_path, input_tensor=Input(shape=(224, 224, 3))) 39 | main_model.trainable = False 40 | for layer in main_model.layers[:-2]: 41 | layer.trainable = False 42 | 43 | top_model = Sequential() 44 | top_model.add(Flatten(input_shape=main_model.output_shape[1:])) 45 | top_model.add(Dense(256, activation='relu')) 46 | top_model.add(Dropout(0.3)) 47 | top_model.add(Dense(num_classes, activation='sigmoid')) 48 | my_new_model = Model(inputs=main_model.input,outputs=top_model(main_model.output)) 49 | opt = optimizers.Adam(lr=0.001, decay=0.1) 50 | my_new_model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) 51 | 52 | #my_new_model = Sequential() 53 | #my_new_model.add(InceptionV3(include_top=False, weights=inception_weights_path)) 54 | #my_new_model.add(ResNet50(include_top=False, pooling='avg', weights=resnet_weights_path)) 55 | #my_new_model.add(Flatten()) 56 | #my_new_model.add(Dropout(0.1)) 57 | #my_new_model.add(Dense(num_classes, activation='softmax')) 58 | 59 | # Say not to train first layer (ResNet) model. It is already trained 60 | #for layer in my_new_model.layers[0].layers[:-4]: 61 | # layer.trainable = False 62 | 63 | #my_new_model.layers[0].trainable = False 64 | 65 | #categorical_crossentropy 66 | #my_new_model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy']) 67 | #my_new_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 68 | 69 | image_size = 224 70 | 71 | data_generator_with_aug = ImageDataGenerator( 72 | rotation_range=30, 73 | horizontal_flip=True, 74 | width_shift_range = 0.2, 75 | rescale=1./255, 76 | height_shift_range = 0.2) 77 | 78 | data_generator_with_aug = ImageDataGenerator(rescale=1./255) 79 | 80 | train_generator = data_generator_with_aug.flow_from_directory( 81 | 'dataset\\train', 82 | target_size=(image_size, image_size), 83 | batch_size=24, class_mode='categorical') 84 | 85 | label_map = (train_generator.class_indices) 86 | 87 | print(label_map) 88 | 89 | data_generator = ImageDataGenerator(rescale=1./255) 90 | 91 | validation_generator = data_generator.flow_from_directory( 92 | 'dataset\\val', 93 | target_size=(image_size, image_size), 94 | class_mode='categorical') 95 | 96 | my_new_model.fit_generator( 97 | train_generator, verbose=1, 98 | steps_per_epoch=1, 99 | epochs=15, 100 | validation_data=validation_generator,validation_steps=1) 101 | 102 | # serialize weights to HDF5 103 | mod_weight = 'incep_resnet_model_weight.h5' 104 | mod_only = 'incep_resnet_model.h5' 105 | #mod_weight = 'resnet_model_weight.h5' 106 | #mod_only = 'resnet_model.h5' 107 | #mod_weight = 'inception_model_weight.h5' 108 | #mod_only = 'inception_model.h5' 109 | 110 | my_new_model.save_weights(mod_weight) 111 | print("Saved weight to disk") 112 | my_new_model.save(mod_only) 113 | print("Saved model to disk") 114 | 115 | data_generator = ImageDataGenerator(rescale=1./255) 116 | image_size = 224 117 | test_generator = data_generator.flow_from_directory( 118 | 'dataset\\aa', 119 | target_size=(image_size, image_size), 120 | batch_size=1, 121 | class_mode='categorical') 122 | 123 | filenames = test_generator.filenames 124 | nb_samples = len(filenames) 125 | print(nb_samples) 126 | 127 | print('test prediction') 128 | predict = my_new_model.predict_generator(test_generator,steps = nb_samples) 129 | 130 | #print(predict) 131 | print(predict.shape) 132 | 133 | pred = predict.argmax(axis=1) 134 | print(pred.shape) 135 | print(pred) 136 | 137 | dd={} 138 | dd1 = {0: '0', 1: '1', 2: '10', 3: '11', 4: '12', 5: '13', 6: '14', 7: '2', 8: '3', 9: '4', 139 | 10: '5', 11: '6', 12: '7', 13: '8', 14: '9'} 140 | 141 | writ = open('sub.csv','w') 142 | writ.writelines('image_name,category'+'\n') 143 | 144 | for j,i in enumerate(filenames): 145 | writ.writelines(i+','+ dd1[pred[j]] + '\n') 146 | writ.close() 147 | print('closed sub file..') 148 | 149 | 150 | writ_prob = open('sub_prob.csv','w') 151 | writ_prob.writelines('image_name,category'+'\n') 152 | 153 | for j,i in enumerate(filenames): 154 | writ_prob.writelines(i+',') 155 | for k in predict[j]: 156 | writ_prob.writelines(str(k)+',') 157 | writ_prob.writelines('\n') 158 | writ_prob.close() 159 | print('closed prob file') 160 | -------------------------------------------------------------------------------- /Cavoo/resnet_pred_generator.py: -------------------------------------------------------------------------------- 1 | #from keras.models import load_model 2 | import tensorflow as tf 3 | from keras.models import load_model 4 | from keras.preprocessing.image import load_img 5 | from keras.preprocessing.image import img_to_array 6 | from keras.preprocessing.image import ImageDataGenerator 7 | from keras.applications.imagenet_utils import decode_predictions 8 | import os 9 | import numpy as np 10 | import cv2 11 | import csv 12 | 13 | os.chdir('C:\\Users\\royal\\Downloads\\Compressed\\dataset52bd6ce') 14 | 15 | model = load_model('resnet_model.h5') 16 | 17 | print('loaded') 18 | 19 | dd={} 20 | dd1 = {0: '0', 1: '1', 2: '10', 3: '11', 4: '12', 5: '13', 6: '14', 7: '2', 8: '3', 9: '4', 21 | 10: '5', 11: '6', 12: '7', 13: '8', 14: '9'} 22 | 23 | data_generator = ImageDataGenerator(rescale=1./255) 24 | 25 | image_size = 224 26 | 27 | test_generator = data_generator.flow_from_directory( 28 | 'dataset\\aa', 29 | target_size=(image_size, image_size), 30 | class_mode='categorical') 31 | 32 | filenames = test_generator.filenames 33 | nb_samples = len(filenames) 34 | print(nb_samples) 35 | 36 | predict = model.predict_generator(test_generator,steps = nb_samples / 22) 37 | 38 | print(predict) 39 | print(predict.shape) 40 | print(dd) 41 | 42 | print(predict.argmax(axis=1)) 43 | -------------------------------------------------------------------------------- /Cavoo/tes.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /EnigmaIIT/ReadMe.md: -------------------------------------------------------------------------------- 1 | ## Enigma IIT-BHU machine Learning challenge 2 | ### About 3 | Enigma Machine learning challenge were organized by IIT BHU.In this Regression Challenge we were to predict the number of upvotes on the questions posted by the users. 4 | -------------------------------------------------------------------------------- /EnigmaIIT/data/analiticv.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/EnigmaIIT/data/analiticv.rar -------------------------------------------------------------------------------- /EnigmaIIT/diff_outlier_note.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "metadata": { 5 | "trusted": true, 6 | "_uuid": "bececbbba8deee8a46217e4d69fca445d0e188ab", 7 | "collapsed": true 8 | }, 9 | "cell_type": "code", 10 | "source": "import numpy as np\nimport os\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd\nimport random\nimport xgboost as xgb\n\nfrom sklearn import preprocessing\nfrom sklearn.linear_model import LogisticRegression\nfrom xgboost import XGBRegressor\nimport lightgbm as lgb\nfrom lightgbm import LGBMRegressor\nfrom sklearn.preprocessing import OneHotEncoder,LabelEncoder\n\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nfrom sklearn.model_selection import KFold\nfrom sklearn.metrics import r2_score,mean_squared_error\nfrom math import sqrt\nfrom scipy import stats\nfrom scipy.stats import norm, skew #for some statistics\nfrom sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV\nfrom sklearn.linear_model import Ridge\nfrom sklearn.linear_model import Lasso\nfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor\nfrom sklearn.svm import SVR, LinearSVR\nfrom sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge\nfrom sklearn.kernel_ridge import KernelRidge\n\ndef apply_log(train,test):\n train[\"Upvotes\"] = np.log1p(train[\"Upvotes\"])\n train[\"Reputation\"] = np.log1p(train[\"Reputation\"])\n train[\"Views\"] = np.log1p(train[\"Views\"])\n train[\"Answers\"] = np.log1p(train[\"Answers\"])\n\n test[\"Reputation\"] = np.log1p(test[\"Reputation\"])\n test[\"Views\"] = np.log1p(test[\"Views\"])\n test[\"Answers\"] = np.log1p(test[\"Answers\"])\n return train,test\n\ndef get_label_id(train,test):\n #Drop the irrelevant features from the test set\n ID_train = train['ID']\n train_labels = np.array(train['Upvotes'])\n train_features = train.drop(['Upvotes','ID','Username','Tag'], axis=1)\n train_features = train_features.fillna(0)\n #Drop the irrelevant features from the test set\n ID_test = test['ID']\n test_features = test.drop(['ID','Username','Tag'], axis=1)\n test_features = test_features.fillna(0)\n \n return ID_train,train_features,train_labels,ID_test,test_features\n \n\ndef train_mod(train_features,test_features,train_labels):\n st_train = train_features.values\n st_test = test_features.values\n Y = train_labels\n# clf = lgb.LGBMRegressor()\n# clf = xgb.XGBRegressor()\n# clf = LinearRegression()\n clf = BayesianRidge()\n # clf = lgb.LGBMRegressor(max_depth=6,learning_rate=0.0716,n_estimators=128,num_leaves=24,reg_alpha=1.7250,reg_lambda=0.0888,subsample=0.6361,colsample_bytree=0.9365)\n# clf = xgb.XGBRegressor(gamma = 0.76,learning_rate = 0.0100,max_depth = 5,min_child_weight = 2,n_estimators = 107,subsample = 0.60,colsample_bytree = 0.9900)\n# clf = xgb.XGBRegressor(gamma = 0.9257,learning_rate = 0.2797,max_depth = 5,min_child_weight = 9,n_estimators = 305,subsample = 0.6239,colsample_bytree = 0.7443)\n# clf = xgb.XGBRegressor(gamma = 0.9019,learning_rate = 0.2530,max_depth = 4,min_child_weight = 8,n_estimators = 300,subsample = 0.6409,colsample_bytree = 0.7380)\n fold = 5\n cv = KFold(n_splits=fold, shuffle=True, random_state=42)\n X_preds = np.zeros(st_train.shape[0])\n preds = np.zeros(st_test.shape[0])\n for i, (tr, ts) in enumerate(cv.split(st_train)):\n print(ts.shape)\n mod = clf.fit(st_train[tr], Y[tr])\n X_preds[ts] = mod.predict(st_train[ts])\n preds += mod.predict(st_test)\n print(\"fold {}, RMSE : {:.3f}\".format(i, sqrt(mean_squared_error(Y[ts], X_preds[ts]))))\n score = sqrt(mean_squared_error(Y, X_preds))\n print(score)\n preds1 = preds/fold\n preds1 = np.abs(np.expm1(preds1))\n X_preds = np.abs(np.expm1(X_preds))\n return X_preds,preds1\n\ndef save_off(train_ID,X_preds,test_id,preds1):\n tr_sub = pd.DataFrame({'ID': train_ID, 'Upvotes': X_preds})\n tr_sub=tr_sub.reindex(columns=[\"ID\",\"Upvotes\"])\n tr_sub.to_csv('train_oof.csv', index=False)\n\n sub = pd.DataFrame({'ID': test_id, 'Upvotes': preds1})\n sub=sub.reindex(columns=[\"ID\",\"Upvotes\"])\n sub.to_csv('submission.csv', index=False)\n\ntrain = pd.read_csv('../input/train_NIR5Yl1.csv')\ntest = pd.read_csv('../input/test_8i3B3FC.csv')\n\ntest_id=[]\ntrain_id=[]\ntest_pred=[]\ntrain_pred=[]\n\nprint(train.shape)\nprint(test.shape)\nprint(train.columns)\nprint(train.Tag.unique())\n\n#train1 = train1.drop(train1[(train1['Views']>3100000) | (train1['Reputation'] > 900000) | (train1['Upvotes'] > 210000) | (train1['Answers'] > 65)].index)\n#array(['a', 'c', 'r', 'j', 'p', 's', 'h', 'o', 'i', 'x']\n\n##a##\ntrain1 = train[train.Tag == 'a']\ntest1 = test[test.Tag == 'a']\nprint(\"Before train1 drop: \",train1.shape)\nprint(\"Before test1 drop\",test1.shape)\ntrain1 = train1.drop(train1[(train1['Views']>2000000) | (train1['Upvotes'] > 100000) | (train1['Answers'] > 50)].index)\nprint(\"After train1 drop: \",train1.shape)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(ID_test)\ntrain_id.extend(ID_train)\ntest_pred.extend(preds1)\ntrain_pred.extend(X_preds)\n\n##c##\ntrain1 = train[train.Tag == 'c']\ntest1 = test[test.Tag == 'c']\ntrain1 = train1.drop(train1[(train1['Views']>1700000) | (train1['Upvotes'] > 150000) | (train1['Answers'] > 50)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##r##\ntrain1 = train[train.Tag == 'r']\ntest1 = test[test.Tag == 'r']\ntrain1 = train1.drop(train1[(train1['Views']>600000) | (train1['Upvotes'] > 80000) | (train1['Answers'] > 20)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##j##\ntrain1 = train[train.Tag == 'j']\ntest1 = test[test.Tag == 'j']\ntrain1 = train1.drop(train1[(train1['Views']>3000000) | (train1['Upvotes'] > 300000) | (train1['Answers'] > 60) | (train1['Reputation'] > 700000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##p##\ntrain1 = train[train.Tag == 'p']\ntest1 = test[test.Tag == 'p']\ntrain1 = train1.drop(train1[(train1['Views']>2300000) | (train1['Upvotes'] > 160000) | (train1['Answers'] > 39) | (train1['Reputation'] > 600000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##s##\ntrain1 = train[train.Tag == 's']\ntest1 = test[test.Tag == 's']\ntrain1 = train1.drop(train1[(train1['Views']>2500000) | (train1['Upvotes'] > 160000) | (train1['Reputation'] > 630000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##h##\ntrain1 = train[train.Tag == 'h']\ntest1 = test[test.Tag == 'h']\ntrain1 = train1.drop(train1[(train1['Views']>2200000) | (train1['Upvotes'] > 150000) | (train1['Answers'] > 42) | (train1['Reputation'] > 500000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##o##\ntrain1 = train[train.Tag == 'o']\ntest1 = test[test.Tag == 'o']\ntrain1 = train1.drop(train1[(train1['Views']>450000) | (train1['Upvotes'] > 20000) | (train1['Reputation'] > 280000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##i##\ntrain1 = train[train.Tag == 'i']\ntest1 = test[test.Tag == 'i']\ntrain1 = train1.drop(train1[(train1['Views']>550000) | (train1['Upvotes'] > 22000) | (train1['Reputation'] > 400000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##x##\ntrain1 = train[train.Tag == 'x']\ntest1 = test[test.Tag == 'x']\ntrain1 = train1.drop(train1[(train1['Views']>600000) | (train1['Upvotes'] > 20000) | (train1['Answers'] > 28) | (train1['Reputation'] > 400000)].index)\n\ntrain1,test1 = apply_log(train1,test1)\nID_train,train_features,train_labels,ID_test,test_features = get_label_id(train1,test1)\nX_preds,preds1 = train_mod(train_features,test_features,train_labels)\n\ntest_id.extend(list(ID_test))\ntrain_id.extend(list(ID_train))\ntest_pred.extend(list(preds1))\ntrain_pred.extend(list(X_preds))\n\n##end of all tags##\n\nprint(\"test_id length: \" ,len(test_id))\nprint(\"test_pred length: \" ,len(test_pred))\nprint(\"train_id length: \" ,len(train_id))\nprint(\"train_pred length: \" ,len(train_pred))\n\nsave_off(train_id,train_pred,test_id,test_pred)\nprint('Complete')\n", 11 | "execution_count": null, 12 | "outputs": [] 13 | }, 14 | { 15 | "metadata": { 16 | "trusted": true, 17 | "_uuid": "a7d173a7ed3151fc67c59815edd8ec90a8a54342" 18 | }, 19 | "cell_type": "markdown", 20 | "source": "#xgboost bayesian optimization\nfrom sklearn.cross_validation import cross_val_score\nfrom bayes_opt import BayesianOptimization\n\ntrain2 = train_features.values\nY = train_labels\n\ndef xgboostcv(max_depth,learning_rate,n_estimators,gamma,min_child_weight,subsample,colsample_bytree):\n return cross_val_score(xgb.XGBRegressor(max_depth=int(max_depth),learning_rate=learning_rate,n_estimators=int(n_estimators),\n silent=True,nthread=-1,gamma=gamma,min_child_weight=min_child_weight,\n subsample=subsample,colsample_bytree=colsample_bytree),\n train2,Y,\"neg_mean_squared_error\",cv=5).mean()\n\nxgboostBO = BayesianOptimization(xgboostcv,{'max_depth': (5, 10),'learning_rate': (0.01, 0.3),'n_estimators': (50, 1200),\n 'gamma': (0.01,1.0),'min_child_weight': (2, 10),\n 'subsample': (0.6, 0.8),'colsample_bytree' :(0.5, 0.99)})\n\nxgboostBO.maximize()\nprint('-'*53)\nprint('Final Results')\nprint('XGBOOST: %f' % xgboostBO.res['max']['max_val'])" 21 | }, 22 | { 23 | "metadata": { 24 | "_uuid": "a2a0217b01a66626a1068183f59d65bba07168ce" 25 | }, 26 | "cell_type": "markdown", 27 | "source": "#lightgbm bayesian optimization\nfrom sklearn.cross_validation import cross_val_score\nfrom bayes_opt import BayesianOptimization\n\ntrain2 = train_features.values\nY = train_labels\n\ndef xgboostcv(max_depth,learning_rate,n_estimators,num_leaves,reg_alpha,reg_lambda,subsample,colsample_bytree):\n return cross_val_score(lgb.LGBMRegressor(max_depth=int(max_depth),learning_rate=learning_rate,n_estimators=int(n_estimators),\n silent=True,nthread=-1,num_leaves=int(num_leaves),reg_alpha=reg_alpha,\n reg_lambda=reg_lambda,subsample=subsample,colsample_bytree=colsample_bytree),\n train2,Y,\"r2\",cv=5).mean()\n\nxgboostBO = BayesianOptimization(xgboostcv,{'max_depth': (2, 10),'learning_rate': (0.001, 0.1),'n_estimators': (10, 900),\n 'num_leaves': (3,30),'reg_alpha': (1, 5),'reg_lambda': (0, 0.1),\n 'subsample': (0.4, 0.8),'colsample_bytree' :(0.4, 0.99)})\n\nxgboostBO.maximize()\nprint('-'*53)\nprint('Final Results')\nprint('XGBOOST: %f' % xgboostBO.res['max']['max_val'])" 28 | } 29 | ], 30 | "metadata": { 31 | "kernelspec": { 32 | "display_name": "Python 3", 33 | "language": "python", 34 | "name": "python3" 35 | }, 36 | "language_info": { 37 | "name": "python", 38 | "version": "3.6.4", 39 | "mimetype": "text/x-python", 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "pygments_lexer": "ipython3", 45 | "nbconvert_exporter": "python", 46 | "file_extension": ".py" 47 | } 48 | }, 49 | "nbformat": 4, 50 | "nbformat_minor": 1 51 | } -------------------------------------------------------------------------------- /EnigmaIIT/solution_note.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "metadata": { 5 | "trusted": true, 6 | "_uuid": "e397044aef454c69bd17520c68202c9a30b12684" 7 | }, 8 | "cell_type": "code", 9 | "source": "%%time\n%matplotlib inline\nimport numpy as np\nimport os\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport random\nimport xgboost as xgb\nfrom sklearn import preprocessing\nfrom sklearn.model_selection import KFold\nfrom sklearn.metrics import mean_squared_error\nfrom math import sqrt\nfrom sklearn.model_selection import cross_val_score, GridSearchCV", 10 | "execution_count": 2, 11 | "outputs": [ 12 | { 13 | "output_type": "stream", 14 | "text": "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\nWall time: 2.07 ms\n", 15 | "name": "stdout" 16 | } 17 | ] 18 | }, 19 | { 20 | "metadata": { 21 | "collapsed": true, 22 | "trusted": true, 23 | "_uuid": "c0adcc54eaa154268b7e857e2051a2c4ea5061c3" 24 | }, 25 | "cell_type": "code", 26 | "source": "##for printing multiple output..\nfrom IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactivity = \"all\"", 27 | "execution_count": 3, 28 | "outputs": [] 29 | }, 30 | { 31 | "metadata": { 32 | "trusted": true, 33 | "collapsed": true, 34 | "_uuid": "1176459eea0be7241e1bea5e8080289dd9c49c69" 35 | }, 36 | "cell_type": "code", 37 | "source": "train = pd.read_csv('../input/train_NIR5Yl1.csv')\ntest = pd.read_csv('../input/test_8i3B3FC.csv')", 38 | "execution_count": 4, 39 | "outputs": [] 40 | }, 41 | { 42 | "metadata": { 43 | "trusted": true, 44 | "collapsed": true, 45 | "_uuid": "6b6528b42f91298b6a915b055e811ee78ed7e4f1" 46 | }, 47 | "cell_type": "code", 48 | "source": "#Deleting outliers\ntrain = train.drop(train[(train['Views']>3100000) | (train['Reputation'] > 900000) | (train['Upvotes'] > 210000) | (train['Answers'] > 65)].index)", 49 | "execution_count": 5, 50 | "outputs": [] 51 | }, 52 | { 53 | "metadata": { 54 | "trusted": true, 55 | "collapsed": true, 56 | "_uuid": "799f6df96a5ca9820c9cf80858f9a796aa6569ee" 57 | }, 58 | "cell_type": "code", 59 | "source": "train[\"Upvotes\"] = np.log1p(train[\"Upvotes\"])\ntrain[\"Reputation\"] = np.log1p(train[\"Reputation\"])\ntrain[\"Views\"] = np.log1p(train[\"Views\"])\ntrain[\"Answers\"] = np.log1p(train[\"Answers\"])\n\ntest[\"Reputation\"] = np.log1p(test[\"Reputation\"])\ntest[\"Views\"] = np.log1p(test[\"Views\"])\ntest[\"Answers\"] = np.log1p(test[\"Answers\"])", 60 | "execution_count": 6, 61 | "outputs": [] 62 | }, 63 | { 64 | "metadata": { 65 | "trusted": true, 66 | "scrolled": true, 67 | "_uuid": "aa5d219d92c211c6e9396241d6ccbc08f53a7d8d" 68 | }, 69 | "cell_type": "code", 70 | "source": "train_labels = np.array(train['Upvotes'])\ntrain_features = train.drop(['Upvotes','ID','Username'], axis=1)\ntrain_features = pd.get_dummies(train_features)\ntrain_features = train_features.fillna(0)\n\nID_test = test['ID']\ntest_features = test.drop(['ID','Username'], axis=1)\ntest_features = pd.get_dummies(test_features)", 71 | "execution_count": 8, 72 | "outputs": [] 73 | }, 74 | { 75 | "metadata": { 76 | "trusted": true, 77 | "_uuid": "7cd170a84e8d73d7178abe2c4ebfa5f545a2917f", 78 | "collapsed": true 79 | }, 80 | "cell_type": "code", 81 | "source": "##generate the cross validation fold.\nst_train = train_features.values\nst_test = test_features.values\nY = train_labels\n#1052 (gamma = 0.9257,learning_rate = 0.2797,max_depth = 5,min_child_weight = 10,n_estimators = 305,subsample = 0.6239,colsample_bytree = 0.7443)\nclf = xgb.XGBRegressor(gamma = 0.9257,learning_rate = 0.2797,max_depth = 5,min_child_weight = 9,n_estimators = 305,subsample = 0.6239,colsample_bytree = 0.7443)\nfold = 5\ncv = KFold(n_splits=fold, shuffle=True, random_state=2018)\nX_preds = np.zeros(st_train.shape[0])\npreds = np.zeros(st_test.shape[0])\nfor i, (tr, ts) in enumerate(cv.split(st_train)):\n print(ts.shape)\n mod = clf.fit(st_train[tr], Y[tr])\n X_preds[ts] = mod.predict(st_train[ts])\n preds += mod.predict(st_test)\n print(\"fold {}, RMSE : {:.3f}\".format(i, sqrt(mean_squared_error(Y[ts], X_preds[ts]))))\nscore = sqrt(mean_squared_error(Y, X_preds))\nprint(score)\npreds1 = preds/fold", 82 | "execution_count": null, 83 | "outputs": [] 84 | }, 85 | { 86 | "metadata": { 87 | "trusted": true, 88 | "collapsed": true, 89 | "_uuid": "70a66025297175a00ccaf7e1d49a858e0a145b0f" 90 | }, 91 | "cell_type": "code", 92 | "source": "preds1 = np.abs(np.expm1(preds1))\nsub = pd.DataFrame({'ID': ID_test, 'Upvotes': preds1})\nsub=sub.reindex(columns=[\"ID\",\"Upvotes\"])\nsub.to_csv('submission.csv', index=False)", 93 | "execution_count": null, 94 | "outputs": [] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "name": "python", 105 | "version": "3.6.4", 106 | "mimetype": "text/x-python", 107 | "codemirror_mode": { 108 | "name": "ipython", 109 | "version": 3 110 | }, 111 | "pygments_lexer": "ipython3", 112 | "nbconvert_exporter": "python", 113 | "file_extension": ".py" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 1 118 | } -------------------------------------------------------------------------------- /Euristica-18/HE_stack_flight_predictor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stderr", 12 | "output_type": "stream", 13 | "text": [ 14 | "C:\\Users\\royal\\Anaconda3\\lib\\site-packages\\sklearn\\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 15 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n", 16 | "C:\\Users\\royal\\Anaconda3\\lib\\site-packages\\sklearn\\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.\n", 17 | " DeprecationWarning)\n" 18 | ] 19 | } 20 | ], 21 | "source": [ 22 | "%matplotlib inline\n", 23 | "import numpy as np\n", 24 | "import os\n", 25 | "import matplotlib.pyplot as plt\n", 26 | "import seaborn as sns\n", 27 | "import pandas as pd\n", 28 | "import random\n", 29 | "import xgboost as xgb\n", 30 | "\n", 31 | "from sklearn.preprocessing import MinMaxScaler\n", 32 | "from sklearn import preprocessing\n", 33 | "from sklearn.linear_model import LogisticRegression\n", 34 | "from xgboost import XGBClassifier\n", 35 | "from xgboost import XGBRegressor\n", 36 | "import lightgbm as lgb\n", 37 | "from lightgbm import LGBMRegressor\n", 38 | "from sklearn.metrics import accuracy_score\n", 39 | "from sklearn.model_selection import GridSearchCV,cross_val_score\n", 40 | "from sklearn.cross_validation import StratifiedKFold\n", 41 | "from sklearn.metrics import matthews_corrcoef, roc_auc_score\n", 42 | "from sklearn.grid_search import RandomizedSearchCV\n", 43 | "from catboost import CatBoostClassifier\n", 44 | "\n", 45 | "from rgf.sklearn import RGFClassifier\n", 46 | "from sklearn.ensemble import RandomForestClassifier\n", 47 | "from sklearn.linear_model import LogisticRegression,LinearRegression\n", 48 | "\n", 49 | "from sklearn.ensemble import ExtraTreesClassifier\n", 50 | "\n", 51 | "from sklearn.preprocessing import StandardScaler\n", 52 | "from sklearn.decomposition import PCA\n", 53 | "from sklearn.model_selection import KFold\n", 54 | "from xgboost import XGBRegressor" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "Wall time: 0 ns\n" 67 | ] 68 | } 69 | ], 70 | "source": [ 71 | "%%time\n", 72 | "os.chdir('C:\\\\Users\\\\royal\\\\Downloads\\\\Compressed\\\\flight_predictor_data\\\\oof+test')" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "os.listdir()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 160, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "Wall time: 8.4 s\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "%%time\n", 99 | "os.chdir('C:\\\\Users\\\\royal\\\\Downloads\\\\Compressed\\\\flight_predictor_data')\n", 100 | "train=pd.read_csv('weather_data_train.csv')\n", 101 | "test=pd.read_csv('weather_data_test.csv')\n", 102 | "train_flt = pd.read_csv('flight_data_train.csv')" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 162, 108 | "metadata": { 109 | "collapsed": true 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "#generate the output total and labels.\n", 114 | "train_flt['total']=0\n", 115 | "for i in range(1,289):\n", 116 | " train_flt['total']+=train_flt['Spot'+str(i)+' totalFlights']\n", 117 | "\n", 118 | "train_flt['label']=0\n", 119 | "for i,j in enumerate(train_flt['total']):\n", 120 | " if j>=15:\n", 121 | " train_flt.loc[i, 'label']=1" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 154, 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "Y=train_flt['label'].values" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 156, 138 | "metadata": { 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "train11 = pd.DataFrame()\n", 144 | "test11 = pd.DataFrame()" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 157, 150 | "metadata": { 151 | "collapsed": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "##get the difference between top and lowerheight parameters.\n", 156 | "aa=['Dew Point','Pressure','Temperature','Wind Direction','Wind Speed']\n", 157 | "hh=['Station1','Station2','Station3','Station4','Station5']\n", 158 | "for i in aa:\n", 159 | " for j in hh:\n", 160 | " train11[j+' '+i+' Height Diff']=train[j+' '+i+' Height45']-train[j+' '+i+' Height1']\n", 161 | " test11[j+' '+i+' Height Diff']=test[j+' '+i+' Height45']-test[j+' '+i+' Height1']\n", 162 | "\n", 163 | "##get the max,min,standard deviation,variance,mean and diff for all the conditions\n", 164 | "aa=['Dew Point','Pressure','Temperature','Wind Direction','Wind Speed']\n", 165 | "hh=['Station1','Station2','Station3','Station4','Station5']\n", 166 | "for i in aa:\n", 167 | " for j in hh:\n", 168 | " ls=[]\n", 169 | " for k in range(1,46):\n", 170 | " ls.append(j+' '+i+' Height'+str(k))\n", 171 | " train11[j+' '+i+' Height max']=train[ls].max(axis=1)\n", 172 | " train11[j+' '+i+' Height min']=train[ls].min(axis=1)\n", 173 | " train11[j+' '+i+' Height std']=train[ls].std(axis=1)\n", 174 | " train11[j+' '+i+' Height var']=train[ls].var(axis=1)\n", 175 | " train11[j+' '+i+' Height var']=train[ls].mean(axis=1)\n", 176 | " \n", 177 | " test11[j+' '+i+' Height var']=test[ls].mean(axis=1)\n", 178 | " test11[j+' '+i+' Height var']=test[ls].var(axis=1)\n", 179 | " test11[j+' '+i+' Height std']=test[ls].std(axis=1)\n", 180 | " test11[j+' '+i+' Height max']=test[ls].max(axis=1)\n", 181 | " test11[j+' '+i+' Height min']=test[ls].min(axis=1)\n", 182 | " \n", 183 | " train11[j+' '+i+' Height max-min']=train11[j+' '+i+' Height max']-train11[j+' '+i+' Height min']\n", 184 | " test11[j+' '+i+' Height max-min']=test11[j+' '+i+' Height max']-test11[j+' '+i+' Height min']" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 54, 190 | "metadata": { 191 | "collapsed": true 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "##test and train binary predictions\n", 196 | "##test\n", 197 | "os.chdir('C:\\\\Users\\\\royal\\\\Downloads\\\\Compressed\\\\flight_predictor_data\\\\oof+test')\n", 198 | "cat_test = pd.read_csv('bin_cat_1150d_0.76519_test.csv')\n", 199 | "ext_test = pd.read_csv('bin_extra_0.77593_test.csv')\n", 200 | "ext1455_test = pd.read_csv('bin_extra_1455d_0.77927_test.csv')\n", 201 | "ext251_test = pd.read_csv('bin_extra_251d_0.77202_test.csv')\n", 202 | "lgb1151_test = pd.read_csv('bin_lgb_1151d_0.79085_test.csv')\n", 203 | "lgb1201_test = pd.read_csv('bin_lgb_1201d_0.79777_test.csv')\n", 204 | "lgb225_test = pd.read_csv('bin_lgb_225d_0.78732_test.csv')\n", 205 | "lgbaae_test = pd.read_csv('bin_lgb_aae_0.78894_test.csv')\n", 206 | "lgbpca_test = pd.read_csv('bin_lgb_pca_0.77974_test.csv')\n", 207 | "lr_test = pd.read_csv('bin_lr_225d_0.77011_test.csv')\n", 208 | "nnaae_test = pd.read_csv('bin_nn_aae_0.75009_test.csv')\n", 209 | "rgf225_test = pd.read_csv('bin_rgf_225d_0.74622_test.csv')\n", 210 | "\n", 211 | "xgb1450_test = pd.read_csv('bin_xgb_1450d_0.79268_test.csv')\n", 212 | "xgbaae_test = pd.read_csv('bin_xgb_aae_0.79059_test.csv')\n", 213 | "xgbtune_test = pd.read_csv('bin_xgb_tune_0.79425_test.csv')\n", 214 | "stac_test = pd.read_csv('submission_stack_152.csv')\n", 215 | "\n", 216 | "##train \n", 217 | "cat_train = pd.read_csv('bin_cat_1150d_0.76519_train.csv')\n", 218 | "ext_train = pd.read_csv('bin_extra_0.77593_train.csv')\n", 219 | "ext1455_train = pd.read_csv('bin_extra_1455d_0.77927_train.csv')\n", 220 | "ext251_train = pd.read_csv('bin_extra_251d_0.77202_train.csv')\n", 221 | "lgb1151_train = pd.read_csv('bin_lgb_1151d_0.79085_train.csv')\n", 222 | "lgb1201_train = pd.read_csv('bin_lgb_1201d_0.79777_train.csv')\n", 223 | "lgb225_train = pd.read_csv('bin_lgb_225d_0.78732_train.csv')\n", 224 | "lgbaae_train = pd.read_csv('bin_lgb_aae_0.78894_train.csv')\n", 225 | "lgbpca_train = pd.read_csv('bin_lgb_pca_0.77974_train.csv')\n", 226 | "lr_train = pd.read_csv('bin_lr_225d_0.77011_train.csv')\n", 227 | "nnaae_train = pd.read_csv('bin_nn_aae_0.75009_train.csv')\n", 228 | "rgf225_train = pd.read_csv('bin_rgf_225d_0.74622_train.csv')\n", 229 | "xgb1450_train = pd.read_csv('bin_xgb_1450d_0.79268_train.csv')\n", 230 | "xgbaae_train = pd.read_csv('bin_xgb_aae_0.79059_train.csv')\n", 231 | "xgbtune_train = pd.read_csv('bin_xgb_tune_0.79425_train.csv')\n", 232 | "sub_146 = pd.read_csv('submission_146.csv')\n", 233 | "\n", 234 | "ts_PERID = cat_test['Day_Id']" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 158, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "##test and train probability predictions\n", 246 | "##test\n", 247 | "os.chdir('C:\\\\Users\\\\royal\\\\Downloads\\\\Compressed\\\\flight_predictor_data\\\\oof+test')\n", 248 | "cat_test = pd.read_csv('prob_cat_1150d_0.76519_test.csv')\n", 249 | "ext_test = pd.read_csv('prob_extra_0.77593_test.csv')\n", 250 | "ext1455_test = pd.read_csv('prob_extra_1455d_0.77927_test.csv')\n", 251 | "ext251_test = pd.read_csv('prob_extra_251d_0.77202_test.csv')\n", 252 | "lgb1151_test = pd.read_csv('prob_lgb_1151d_0.79085_test.csv')\n", 253 | "lgb225_test = pd.read_csv('prob_lgb_225d_0.78732_test.csv')\n", 254 | "lgbaae_test = pd.read_csv('prob_lgb_aae_0.78894_test.csv')\n", 255 | "lr_test = pd.read_csv('prob_lr_225d_0.77011_test.csv')\n", 256 | "nnaae_test = pd.read_csv('prob_nn_aae_0.75009_test.csv')\n", 257 | "rgf225_test = pd.read_csv('prob_rgf_225d_0.74622_test.csv')\n", 258 | "xgb1450_test = pd.read_csv('prob_xgb_1450d_0.79268_test.csv')\n", 259 | "xgbaae_test = pd.read_csv('prob_xgb_aae_0.79059_test.csv')\n", 260 | "xgbtune_test = pd.read_csv('prob_xgb_tune_0.79425_test.csv')\n", 261 | "\n", 262 | "##train \n", 263 | "cat_train = pd.read_csv('prob_cat_1150d_0.76519_train.csv')\n", 264 | "ext_train = pd.read_csv('prob_extra_0.77593_train.csv')\n", 265 | "ext1455_train = pd.read_csv('prob_extra_1455d_0.77927_train.csv')\n", 266 | "ext251_train = pd.read_csv('prob_extra_251d_0.77202_train.csv')\n", 267 | "lgb1151_train = pd.read_csv('prob_lgb_1151d_0.79085_train.csv')\n", 268 | "lgb225_train = pd.read_csv('prob_lgb_225d_0.78732_train.csv')\n", 269 | "lgbaae_train = pd.read_csv('prob_lgb_aae_0.78894_train.csv')\n", 270 | "lr_train = pd.read_csv('prob_lr_225d_0.77011_train.csv')\n", 271 | "nnaae_train = pd.read_csv('prob_nn_aae_0.75009_train.csv')\n", 272 | "rgf225_train = pd.read_csv('prob_rgf_225d_0.74622_train.csv')\n", 273 | "xgb1450_train = pd.read_csv('prob_xgb_1450d_0.79268_train.csv')\n", 274 | "xgbaae_train = pd.read_csv('prob_xgb_aae_0.79059_train.csv')\n", 275 | "xgbtune_train = pd.read_csv('prob_xgb_tune_0.79425_train.csv')\n", 276 | "\n", 277 | "ts_PERID = cat_test['Day_Id']" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 29, 283 | "metadata": { 284 | "collapsed": true 285 | }, 286 | "outputs": [], 287 | "source": [ 288 | "##average blend. \n", 289 | "preds1=(ext_test['Good_Bad']+lgbpca_test['Good_Bad']+xgb1450_test['Good_Bad']+lgb1151_test['Good_Bad'])/4\n", 290 | "\n", 291 | "prediction_rfc=list(range(len(preds1)))\n", 292 | "for i in range(len(preds1)):\n", 293 | " prediction_rfc[i]=1 if preds1[i]>=0.25 else 0\n", 294 | "\n", 295 | "sub = pd.DataFrame({'Day_Id': ts_PERID, 'Good_Bad': prediction_rfc})\n", 296 | "sub=sub.reindex(columns=[\"Day_Id\",\"Good_Bad\"])\n", 297 | "filename = 'submission.csv'\n", 298 | "sub.to_csv(filename, index=False)" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 56, 304 | "metadata": {}, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "216 0 1\n", 311 | "380 1 0\n", 312 | "908 0 1\n", 313 | "1335 1 0\n", 314 | "1586 1 0\n", 315 | "1711 1 0\n", 316 | "1940 1 0\n", 317 | "2300 1 0\n" 318 | ] 319 | } 320 | ], 321 | "source": [ 322 | "for k,i,j in zip(cat_test['Day_Id'],lgbpca_test['Good_Bad'],lgb1151_test['Good_Bad']):\n", 323 | " if i!=j:\n", 324 | " print(k,i,j)\n", 325 | "# ext_test.loc[ext_test['Day_Id'] == k, 'Good_Bad'] = 1" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 43, 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "name": "stdout", 335 | "output_type": "stream", 336 | "text": [ 337 | "{0, 1}\n", 338 | "49 51\n" 339 | ] 340 | } 341 | ], 342 | "source": [ 343 | "#number of 0 and 1 prediction in the binary file.\n", 344 | "su = ext1455_test\n", 345 | "print(set(su['Good_Bad']))\n", 346 | "prediction_rfc=list(range(len(su['Good_Bad'])))\n", 347 | "i0=0\n", 348 | "i1=0\n", 349 | "for j,i in enumerate(su['Good_Bad']):\n", 350 | " if i==0:\n", 351 | " prediction_rfc[j]=0\n", 352 | " i0=i0+1\n", 353 | " else:\n", 354 | " i1=i1+1\n", 355 | " prediction_rfc[j]=1\n", 356 | "print(i0,i1)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 1, 362 | "metadata": { 363 | "collapsed": true 364 | }, 365 | "outputs": [], 366 | "source": [ 367 | "# aa=[369,379,1335,1940,1960] #0\n", 368 | "# bb=[908,1781] #1\n", 369 | "# aa=[1940,1960,908,1781,216,1931]\n", 370 | "# bb=[1335,379]" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 58, 376 | "metadata": { 377 | "collapsed": true 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "su = lgbpca_test\n", 382 | "for k,i in zip(cat_test['Day_Id'],su['Good_Bad']):\n", 383 | " if k in aa:\n", 384 | " su.loc[su['Day_Id'] == k, 'Good_Bad'] = 0\n", 385 | " if k in bb:\n", 386 | " su.loc[su['Day_Id'] == k, 'Good_Bad'] = 1" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 59, 392 | "metadata": { 393 | "collapsed": true 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "filename = 'subm.csv'\n", 398 | "su.to_csv(filename, index=False)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 106, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "data": { 408 | "text/plain": [ 409 | "0.8070413941267186" 410 | ] 411 | }, 412 | "execution_count": 106, 413 | "metadata": {}, 414 | "output_type": "execute_result" 415 | } 416 | ], 417 | "source": [ 418 | "##correlation between 2 files.\n", 419 | "corr1[1].corr(corr1[8])" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 163, 425 | "metadata": { 426 | "collapsed": true 427 | }, 428 | "outputs": [], 429 | "source": [ 430 | "train_day=train['Day_Id']\n", 431 | "train1=train.drop(['Day_Id'],axis=1)\n", 432 | "train1=train1.values\n", 433 | "Y=train_flt['label'].values\n", 434 | "\n", 435 | "test_day=test['Day_Id']\n", 436 | "test1=test.drop(['Day_Id'],axis=1)\n", 437 | "test1=test1.values" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 159, 443 | "metadata": { 444 | "collapsed": true 445 | }, 446 | "outputs": [], 447 | "source": [ 448 | "#stack level1 predictions\n", 449 | "X_ts=np.column_stack((cat_test['Good_Bad'],ext_test['Good_Bad'],ext1455_test['Good_Bad'],lgb1151_test['Good_Bad'],lgb225_test['Good_Bad'],lr_test['Good_Bad'],nnaae_test['Good_Bad'],rgf225_test['Good_Bad'],xgb1450_test['Good_Bad'],xgbaae_test['Good_Bad'],xgbtune_test['Good_Bad']))\n", 450 | "X_tr=np.column_stack((cat_train['Good_Bad'],ext_train['Good_Bad'],ext1455_train['Good_Bad'],lgb1151_train['Good_Bad'],lgb225_train['Good_Bad'],lr_train['Good_Bad'],nnaae_train['Good_Bad'],rgf225_train['Good_Bad'],xgb1450_train['Good_Bad'],xgbaae_train['Good_Bad'],xgbtune_train['Good_Bad']))" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 164, 456 | "metadata": { 457 | "collapsed": true 458 | }, 459 | "outputs": [], 460 | "source": [ 461 | "X_tr1=np.column_stack((X_tr,train11.values,train1))\n", 462 | "X_ts1=np.column_stack((X_ts,test11.values,test1))" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 165, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "(100, 11)\n", 475 | "(2183, 1286)\n", 476 | "(2183, 150)\n", 477 | "(100, 150)\n" 478 | ] 479 | } 480 | ], 481 | "source": [ 482 | "print(X_ts.shape)\n", 483 | "print(X_tr1.shape)\n", 484 | "print(train11.shape)\n", 485 | "print(test11.shape)" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 166, 491 | "metadata": {}, 492 | "outputs": [ 493 | { 494 | "name": "stdout", 495 | "output_type": "stream", 496 | "text": [ 497 | "(437,)\n", 498 | "fold 0, ROC AUC: 0.807\n", 499 | "(437,)\n", 500 | "fold 1, ROC AUC: 0.785\n", 501 | "(437,)\n", 502 | "fold 2, ROC AUC: 0.808\n", 503 | "(437,)\n", 504 | "fold 3, ROC AUC: 0.779\n", 505 | "(435,)\n", 506 | "fold 4, ROC AUC: 0.781\n", 507 | "0.7914202003610796\n" 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "from sklearn.cross_validation import StratifiedKFold\n", 513 | "SEED=42\n", 514 | "clf = lgb.LGBMClassifier()\n", 515 | "st_train = X_tr1\n", 516 | "st_test = X_ts1\n", 517 | "# clf = lgb.LGBMClassifier(max_depth= 10, learning_rate=0.044, n_estimators=255, num_leaves= 17, reg_alpha=1.0824, reg_lambda= 0.0386)\n", 518 | "# clf=CatBoostClassifier(iterations=50)\n", 519 | "# clf = XGBClassifier()\n", 520 | "# clf=ExtraTreesClassifier(n_estimators=10000, criterion='entropy', max_depth=9, min_samples_leaf=1, n_jobs=30, random_state=1)\n", 521 | "# clf = XGBClassifier(gamma = 1.0,learning_rate = 0.010,max_depth = 5,min_child_weight = 10,n_estimators = 338,subsample = 0.800,colsample_bytree = 0.50)\n", 522 | "# clf = RGFClassifier(max_leaf=500,algorithm=\"RGF\",test_interval=100, loss=\"LS\")\n", 523 | "# clf = LogisticRegression()\n", 524 | "\n", 525 | "fold = 5\n", 526 | "cv = StratifiedKFold(Y, n_folds=fold,shuffle=True, random_state=42)\n", 527 | "X_preds = np.zeros(st_train.shape[0])\n", 528 | "preds = np.zeros(st_test.shape[0])\n", 529 | "for i, (tr, ts) in enumerate(cv):\n", 530 | " print(ts.shape)\n", 531 | " mod = clf.fit(st_train[tr], Y[tr])\n", 532 | " X_preds[ts] = mod.predict_proba(st_train[ts])[:,1]\n", 533 | " preds += mod.predict_proba(st_test)[:,1]\n", 534 | " print(\"fold {}, ROC AUC: {:.3f}\".format(i, roc_auc_score(Y[ts], X_preds[ts])))\n", 535 | "score = roc_auc_score(Y, X_preds)\n", 536 | "print(score)\n", 537 | "preds1 = preds/fold" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 167, 543 | "metadata": {}, 544 | "outputs": [ 545 | { 546 | "name": "stdout", 547 | "output_type": "stream", 548 | "text": [ 549 | "0.7301163632835377\n", 550 | "0.35000000000000003\n" 551 | ] 552 | } 553 | ], 554 | "source": [ 555 | "# pick the best threshold out-of-fold\n", 556 | "thresholds = np.linspace(0.01, 0.99, 50)\n", 557 | "mcc = np.array([roc_auc_score(Y, X_preds>thr) for thr in thresholds])\n", 558 | "best_threshold = thresholds[mcc.argmax()]\n", 559 | "print(mcc.max())\n", 560 | "print(best_threshold)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 168, 566 | "metadata": { 567 | "collapsed": true 568 | }, 569 | "outputs": [], 570 | "source": [ 571 | "#prediction file generation\n", 572 | "prediction_rfc=list(range(len(preds1)))\n", 573 | "for i in range(len(preds1)):\n", 574 | " prediction_rfc[i]=1 if preds1[i]>best_threshold else 0\n", 575 | "\n", 576 | "sub = pd.DataFrame({'Day_Id': ts_PERID, 'Good_Bad': prediction_rfc})\n", 577 | "sub=sub.reindex(columns=[\"Day_Id\",\"Good_Bad\"])\n", 578 | "filename = 'submission_stack.csv'\n", 579 | "sub.to_csv(filename, index=False)" 580 | ] 581 | } 582 | ], 583 | "metadata": { 584 | "kernelspec": { 585 | "display_name": "Python 3", 586 | "language": "python", 587 | "name": "python3" 588 | }, 589 | "language_info": { 590 | "codemirror_mode": { 591 | "name": "ipython", 592 | "version": 3 593 | }, 594 | "file_extension": ".py", 595 | "mimetype": "text/x-python", 596 | "name": "python", 597 | "nbconvert_exporter": "python", 598 | "pygments_lexer": "ipython3", 599 | "version": "3.6.1" 600 | } 601 | }, 602 | "nbformat": 4, 603 | "nbformat_minor": 2 604 | } 605 | -------------------------------------------------------------------------------- /Euristica-18/README.md: -------------------------------------------------------------------------------- 1 | # HE_Flight_Pred 2 | Solution for HackerEarth ML Codesprint - Euristica'18 3 | 4 | ## About 5 | 6 | Problem statement and data can be dowloaded from the competition site 7 | https://www.hackerearth.com/challenge/college/euristica-ml/ 8 | 9 | ## Solution 10 | use all the newly created features from below and train different models on that. 11 | 12 | tuned the model using bayesian optimization and ensemble those model into final submission 13 | 14 | 15 | ## Newly created Features and findings 16 | 17 | - with heights pressure decrease. so get the differenece bet pressure at height 45 and height 1. pressure should always decrease with height increase. 18 | - high speed winds- max of the wind speed in 45 heights. 19 | - dew point temperature can never be higher than air temperature. 20 | - wind speed should be between 5 to 14 km/hr for pleasant weather. 21 | - high pressure provide higher effeciency. 22 | - 100% humidity doesn't mean rain will happen. 23 | - Take dew point and temperature as a ratio and you will get humidity as percentage.humidity is amount of water vapor present in air relative to what the air can hold. 24 | - Get the max,min,standard deviation,variance,mean and difference between top and lowerheight parameters for all the conditions 25 | - Total sum for all the total flights in 289 spots 26 | 27 | ## Requirement 28 | - lightgbm 29 | - keras 30 | - sklearn 31 | - xgboost 32 | - catboost 33 | - bayes_opt 34 | -------------------------------------------------------------------------------- /Euristica-18/data/Electricity_Production_data.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Euristica-18/data/Electricity_Production_data.rar -------------------------------------------------------------------------------- /Euristica-18/data/flight_predictor_data.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Euristica-18/data/flight_predictor_data.rar -------------------------------------------------------------------------------- /Euristica-19/README.MD: -------------------------------------------------------------------------------- 1 | # HE_Flight_Pred 2 | Solution for HackerEarth ML Codesprint - Euristica'19 3 | 4 | ## About 5 | 6 | Problem statement and data can be dowloaded from the competition site 7 | https://www.hackerearth.com/challenges/college/cognitia19/ 8 | 9 | ## Solution 10 | - Solution was the ensemble of H2OAutoML, with StackingRegressor of lightgbm,xgboost and rgf models for Question number-1. 11 | - For Dataset-2 a simple addition interaction of all variables and a VotingClassifier on lightgbm and xgboost model with output rouding off gave the highest accuracy. 12 | - For 3rd question we extracted the names,skills using the NLTk named entity recognition.and phone number and email using the regular expression. 13 | 14 | ## Requirement 15 | - lightgbm 16 | - H2OAutoML 17 | - sklearn 18 | - xgboost 19 | - bayes_opt 20 | -------------------------------------------------------------------------------- /Euristica-19/data/28406216-5-data_q2.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Euristica-19/data/28406216-5-data_q2.zip -------------------------------------------------------------------------------- /Euristica-19/data/299b9fb8-5-data-Q1.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Euristica-19/data/299b9fb8-5-data-Q1.zip -------------------------------------------------------------------------------- /Euristica-19/data/new.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Euristica-19/data/resumes.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Euristica-19/data/resumes.zip -------------------------------------------------------------------------------- /Euristica-19/extract.py: -------------------------------------------------------------------------------- 1 | # information-extraction.py 2 | 3 | import re 4 | import nltk 5 | from nltk.corpus import stopwords 6 | stop = stopwords.words('english') 7 | import os 8 | import spacy 9 | import sys 10 | import csv 11 | import json 12 | 13 | os.chdir(r'C:\Users\gurvinder1.singh\Downloads\Data\HE_indore\resumes') 14 | 15 | with open('techskill.csv', 'r') as f: 16 | reader = csv.reader(f) 17 | your_list = list(reader) 18 | 19 | string = open('CV of Binnu Thomas.txt').read() 20 | 21 | #Function to extract names from the string using spacy 22 | def extract_name(string): 23 | r1 = str(string) 24 | nlp = spacy.load('en_core_web_sm') 25 | doc = nlp(r1) 26 | for ent in doc: 27 | #print(ent.pos_) 28 | if(ent.pos_ == 'PROPN'): 29 | #print('name: ' + ent.text) 30 | return ent.text 31 | 32 | def extract_phone_numbers(string): 33 | r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})') 34 | phone_numbers = r.findall(string) 35 | return [re.sub(r'\D', '', number) for number in phone_numbers] 36 | 37 | def extract_email_addresses(string): 38 | r = re.compile(r'[\w\.-]+@[\w\.-]+') 39 | return r.findall(string) 40 | 41 | def ie_preprocess(document): 42 | document = ' '.join([i for i in document.split() if i not in stop]) 43 | sentences = nltk.sent_tokenize(document) 44 | sentences = [nltk.word_tokenize(sent) for sent in sentences] 45 | sentences = [nltk.pos_tag(sent) for sent in sentences] 46 | return sentences 47 | 48 | def extract_names(document): 49 | names = [] 50 | sentences = ie_preprocess(document) 51 | for tagged_sentence in sentences: 52 | for chunk in nltk.ne_chunk(tagged_sentence): 53 | if type(chunk) == nltk.tree.Tree: 54 | if chunk.label() == 'PERSON': 55 | names.append(' '.join([c[0] for c in chunk])) 56 | return names 57 | 58 | if __name__ == '__main__': 59 | numbers = extract_phone_numbers(string) 60 | emails = extract_email_addresses(string) 61 | names = extract_names(string) 62 | name = extract_name(string) 63 | data = {} 64 | a=[] 65 | a.append(name) 66 | data['name']=a 67 | data['email']=emails 68 | data['phone']=numbers 69 | data['edu']=list() 70 | data['exp']=names 71 | #print('name: '+ name) 72 | #print('email: '+' , '.join(emails)) 73 | #print('phone: '+' , '.join(numbers)) 74 | #print('exp: '+ ','.join(names)) 75 | #print('exp: '+list(set(names).intersection(set(your_list[0])))) 76 | #print(names) 77 | with open('data.json', 'w') as outfile: 78 | json.dump(data, outfile) 79 | -------------------------------------------------------------------------------- /Euristica-19/new.txt: -------------------------------------------------------------------------------- 1 | qwerty 2 | -------------------------------------------------------------------------------- /Euristica-19/techskill.csv: -------------------------------------------------------------------------------- 1 | ajenti,django-suit,django-xadmin,flask-admin,flower,grappelli,wooey,algorithms,pypattyrn,python-patterns,sortedcontainers,libraries,django-simple-captcha,django-simple-spam-blocker,django-compressor,django-pipeline,django-storages,fanstatic,fileconveyor,flask-assets,jinja-assets-compressor,webassets,audiolazy,audioread,beets,dejavu,django-elastic-transcoder,eyed3,id3reader,m3u8,mingus,pyaudioanalysis,pydub,pyechonest,talkbox,timeside,tinytag,authomatic,django-allauth,django-oauth-toolkit,flask-oauthlib,oauthlib,python-oauth2,python-social-auth,rauth,sanction,jose,pyjwt,python-jws,python-jwt,bitbake,buildout,platformio,pybuilder,scons,django-cms,djedi-cms,feincms,kotti,mezzanine,opps,plone,quokka,wagtail,widgy,libraries,beaker,diskcache,django-cache-machine,django-cacheops,django-viewlet,dogpile.cache,hermescache,johnny-cache,pylibmc,errbot,coala,code2flow,pycallgraph,flake8,pylama,pylint,mypy,asciimatics,cement,click,cliff,clint,colorama,docopt,gooey,python-fire,python-prompt-toolkit,aws-cli,bashplotlib,caniusepython3,cookiecutter,doitlive,howdoi,httpie,mycli,pathpicker,percol,pgcli,saws,thefuck,try,python-future,python-modernize,six,opencv,pyocr,pytesseract,simplecv,eventlet,gevent,multiprocessing,threading,tomorrow,uvloop,config,configobj,configparser,profig,python-decouple,cryptography,hashids,paramiko,passlib,pynacl,blaze,open,orange,pandas,cerberus,colander,jsonschema,schema,schematics,valideer,voluptuous,altair,bokeh,ggplot,matplotlib,pygal,pygraphviz,pyqtgraph,seaborn,vispy,pickledb,pipelinedb,tinydb,zodb,mysql,mysql-python,mysqlclient,oursql,pymysql,postgresql,psycopg2,queries,txpostgres,apsw,dataset,pymssql,nosql,cassandra-python-driver,happybase,plyvel,py2neo,pycassa,pymongo,redis-py,telephus,txredis,arrow,chronyk,dateutil,delorean,moment,pendulum,pytime,pytz,when.py,ipdb,pdb++,pudb,remote-pdb,wdb,line_profiler,memory_profiler,profiling,vprof,caffe,keras,mxnet,neupy,pytorch,tensorflow,theano,ansible,cloud-init,cuisine,docker,fabric,fabtools,honcho,openstack,pexpect,psutil,saltstack,supervisor,dh-virtualenv,nuitka,py2app,py2exe,pyinstaller,pynsist,sphinx,awesome-sphinxdoc,mkdocs,pdoc,pycco,s3cmd,s4cmd,you-get,youtube-dl,alipay,cartridge,django-oscar,django-shop,merchant,money,python-currencies,forex-python,shoop,emacs,elpy,sublime,anaconda,sublimejedi,vim,jedi-vim,python-mode,youcompleteme,ptvs,visual,python,magic,liclipse,pycharm,spyder,libraries,envelopes,flanker,imbox,inbox.py,lamson,marrow,modoboa,nylas,yagmail,pipenv,p,pyenv,venv,virtualenv,virtualenvwrapper,imghdr,mimetypes,path.py,pathlib,python-magic,unipath,watchdog,cffi,ctypes,pycuda,swig,deform,django-bootstrap3,django-crispy-forms,django-remote-forms,wtforms,cytoolz,fn.py,funcy,toolz,curses,enaml,flexx,kivy,pyglet,pygobject,pyqt,pyside,pywebview,tkinter,toga,urwid,wxpython,cocos2d,panda3d,pygame,pyogre,pyopengl,pysdl2,renpy,django-countries,geodjango,geoip,geojson,geopy,pygeoip,beautifulsoup,bleach,cssutils,html5lib,lxml,markupsafe,pyquery,untangle,weasyprint,xmldataset,xmltodict,grequests,httplib2,requests,treq,urllib3,ino,keyboard,mouse,pingo,pyro,pyuserinput,scapy,wifi,hmap,imgseek,nude.py,pagan,pillow,pybarcode,pygram,python-qrcode,quads,scikit-image,thumbor,wand,clpython,cpython,cython,grumpy,ironpython,jython,micropython,numba,peachpy,pyjion,pypy,pysec,pyston,stackless,interactive,bpython,jupyter,ptpython,babel,pyicu,apscheduler,django-schedule,doit,gunnery,joblib,plan,schedule,spiff,taskflow,eliot,logbook,logging,sentry,metrics,nupic,scikit-learn,spark,vowpal_porpoise,xgboost,pyspark,luigi,mrjob,streamparse,dask,python(x,y),pythonlibs,pythonnet,pywin32,winpython,gensim,jieba,langid.py,nltk,pattern,polyglot,snownlp,spacy,textblob,mininet,pox,pyretic,sdx,asyncio,diesel,pulsar,pyzmq,twisted,txzmq,napalm,django-activity-stream,stream-framework,django,sqlalchemy,awesome-sqlalchemy,orator,peewee,ponyorm,pydal,python-sql,pip,python,conda,curdling,pip-tools,wheel,warehouse,warehouse,bandersnatch,devpi,localshop,carteblanche,django-guardian,django-rules,delegator.py subprocesses for,sarge,sh,celery,huey,mrq,rq,simpleq,annoy,fastfm,implicit,libffm,lightfm,surprise,tensorrec,django-rest-framework,django-tastypie,flask,eve,flask-api-utils,flask-api,flask-restful,flask-restless,pyramid,cornice,framework,falcon,hug,restless,ripozo,sandman,apistar,simplejsonrpcserver,simplexmlrpcserver,zerorpc,astropy,bcbio-nextgen,bccb,biopython,cclib,networkx,nipy,numpy,open,obspy,pydy,pymc,rdkit,scipy,statsmodels,sympy,zipline,simpy,django-haystack,elasticsearch-dsl-py,elasticsearch-py,esengine,pysolr,solrpy,whoosh,marshmallow,apex,python-lambda,zappa,tablib,marmir,openpyxl,pyexcel,python-docx,relatorio,unoconv,xlsxwriter,xlwings,xlwt / xlrd,pdf,pdfminer,pypdf2,reportlab,markdown,mistune,python-markdown,yaml,pyyaml,csvkit,unp,cactus,hyde,lektor,nikola,pelican,tinkerer,django-taggit,genshi,jinja2,mako,hypothesis,mamba,nose,nose2,pytest,robot,unittest,green,tox,locust,pyautogui,selenium,sixpack,splinter,doublex,freezegun,httmock,httpretty,mock,responses,vcr.py,factory_boy,mixer,model_mommy,mimesis,fake2db,faker,radar,chardet,difflib,ftfy,fuzzywuzzy,levenshtein,pangu.py,pyfiglet,pypinyin,shortuuid,unidecode,uniout,xpinyin,slugify,awesome-slugify,python-slugify,unicode-slugify,parser,phonenumbers,ply,pygments,pyparsing,python-nameparser,python-user-agents,sqlparse,apache-libcloud,boto3,django-wordpress,facebook-sdk,facepy,gmail,google-api-python-client,gspread,twython,furl,purl,pyshorteners,short_url,webargs,moviepy,scikit-video,wsgi-compatible,bjoern,fapws3,gunicorn,meinheld,netius,paste,rocket,uwsgi,waitress,werkzeug,haul,html2text,lassie,micawber,newspaper,opengraph,python-goose,python-readability,sanitize,sumy,textract,cola,demiurge,feedparser,grab,mechanicalsoup,portia,pyspider,robobrowser,scrapy,bottle,cherrypy,django,awesome-django,flask,awesome-flask,pyramid,awesome-pyramid,sanic,tornado,turbogears,web2py,github,autobahnpython,crossbar,django-socketio,websocket-for-python,javascript,php,c#,c++,ruby,css,c,objective-c,shell,scala,swift,matlab,clojure,octave -------------------------------------------------------------------------------- /Innoplexus -19/Elmo with tensorflow hub.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import tensorflow as tf" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "tf.test.is_gpu_available()" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import keras\n", 28 | "from keras.models import Sequential\n", 29 | "from keras.layers import Dense, Dropout, Activation, Flatten\n", 30 | "from keras.layers.normalization import BatchNormalization\n", 31 | "from keras.callbacks import EarlyStopping\n", 32 | "from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler\n", 33 | "from keras.models import load_model\n", 34 | "from keras.initializers import glorot_normal, Zeros, Ones\n", 35 | "import keras.backend as K\n", 36 | "from keras.optimizers import RMSprop\n", 37 | "import tensorflow as tf" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "%%time\n", 47 | "%matplotlib inline\n", 48 | "import warnings\n", 49 | "warnings.filterwarnings('ignore')\n", 50 | "\n", 51 | "import numpy as np\n", 52 | "import os\n", 53 | "import matplotlib.pyplot as plt\n", 54 | "import pandas as pd\n", 55 | "import random\n", 56 | "\n", 57 | "from sklearn import preprocessing\n", 58 | "import lightgbm as lgb\n", 59 | "from sklearn.ensemble import RandomForestClassifier\n", 60 | "from sklearn.preprocessing import OneHotEncoder\n", 61 | "\n", 62 | "from sklearn.preprocessing import StandardScaler,MinMaxScaler\n", 63 | "from sklearn.decomposition import PCA\n", 64 | "from math import sqrt\n", 65 | "from scipy import stats\n", 66 | "from scipy.stats import norm, skew #for some statistics" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "from IPython.core.interactiveshell import InteractiveShell\n", 76 | "InteractiveShell.ast_node_interactivity = \"all\"" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "tf.__version__\n", 86 | "keras.__version__\n", 87 | "np.__version__" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "train = pd.read_csv('train.csv')\n", 104 | "test = pd.read_csv('test.csv')\n", 105 | "subm = pd.read_csv('sample_submission.csv')\n", 106 | "train = train.fillna(method=\"ffill\")\n", 107 | "test = test.fillna(method=\"ffill\")" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "import anago" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "train.head()" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "train.nunique()" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "words = list(set(train[\"Word\"].values))\n", 144 | "words.append(\"ENDPAD\")\n", 145 | "n_words = len(words); n_words" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "tags = list(set(train[\"tag\"].values))\n", 155 | "n_tags = len(tags); n_tags" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "class SentenceGetter(object):\n", 165 | " \n", 166 | " def __init__(self, data):\n", 167 | " self.n_sent = 1\n", 168 | " self.data = data\n", 169 | " self.empty = False\n", 170 | " agg_func = lambda s: [(w, t) for w, t in zip(s[\"Word\"].values.tolist(),\n", 171 | " s[\"tag\"].values.tolist())]\n", 172 | " self.grouped = self.data.groupby(\"Sent_ID\").apply(agg_func)\n", 173 | " self.sentences = [s for s in self.grouped]\n", 174 | " \n", 175 | " def get_next(self):\n", 176 | " try:\n", 177 | " s = self.grouped[self.n_sent]\n", 178 | " self.n_sent += 1\n", 179 | " return s\n", 180 | " except:\n", 181 | " return None" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "getter = SentenceGetter(train)\n", 191 | "sentences = getter.sentences" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "max_len = 50\n", 201 | "tag2idx = {t: i for i, t in enumerate(tags)}" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "X = [[w[0] for w in s] for s in sentences]" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "new_X = []\n", 220 | "for seq in X:\n", 221 | " new_seq = []\n", 222 | " for i in range(max_len):\n", 223 | " try:\n", 224 | " new_seq.append(seq[i])\n", 225 | " except:\n", 226 | " new_seq.append(\"__PAD__\")\n", 227 | " new_X.append(new_seq)\n", 228 | "X = new_X" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "# X[1]" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [ 246 | "y = [[tag2idx[w[1]] for w in s] for s in sentences]" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "from keras.preprocessing.sequence import pad_sequences\n", 256 | "y = pad_sequences(maxlen=max_len, sequences=y, padding=\"post\", value=tag2idx[\"O\"])" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": null, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "y[0]" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "from sklearn.model_selection import train_test_split" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=2018)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "batch_size = 32" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "import tensorflow as tf\n", 302 | "import tensorflow_hub as hub\n", 303 | "from keras import backend as K" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "sess = tf.Session()\n", 313 | "K.set_session(sess)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "elmo_model = hub.Module(\"https://tfhub.dev/google/elmo/2\", trainable=True)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "sess.run(tf.global_variables_initializer())\n", 332 | "sess.run(tf.tables_initializer())" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": {}, 339 | "outputs": [], 340 | "source": [ 341 | "def ElmoEmbedding(x):\n", 342 | " return elmo_model(inputs={\n", 343 | " \"tokens\": tf.squeeze(tf.cast(x, tf.string)),\n", 344 | " \"sequence_len\": tf.constant(batch_size*[max_len])\n", 345 | " },\n", 346 | " signature=\"tokens\",\n", 347 | " as_dict=True)[\"elmo\"]" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "from keras.models import Model, Input\n", 357 | "from keras.layers.merge import add\n", 358 | "from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Lambda" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "n_tags" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "input_text = Input(shape=(max_len,), dtype=tf.string)\n", 377 | "embedding = Lambda(ElmoEmbedding, output_shape=( None, 1024))(input_text)\n", 378 | "x = Bidirectional(LSTM(units=512, return_sequences=True,\n", 379 | " recurrent_dropout=0.2, dropout=0.2))(embedding)\n", 380 | "x_rnn = Bidirectional(LSTM(units=512, return_sequences=True,\n", 381 | " recurrent_dropout=0.2, dropout=0.2))(x)\n", 382 | "x = add([x, x_rnn]) # residual connection to the first biLSTM\n", 383 | "out = TimeDistributed(Dense(n_tags, activation=\"softmax\"))(x)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "model = Model(input_text, out)" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "model.compile(optimizer=\"adam\", loss=\"sparse_categorical_crossentropy\", metrics=[\"accuracy\"])" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "X_tr, X_val = X_tr[:1213*batch_size], X_tr[-135*batch_size:]\n", 411 | "y_tr, y_val = y_tr[:1213*batch_size], y_tr[-135*batch_size:]\n", 412 | "y_tr = y_tr.reshape(y_tr.shape[0], y_tr.shape[1], 1)\n", 413 | "y_val = y_val.reshape(y_val.shape[0], y_val.shape[1], 1)" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "history = model.fit(np.array(X_tr), y_tr, validation_data=(np.array(X_val), y_val),\n", 423 | " batch_size=batch_size, epochs=1, verbose=1)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": null, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [] 439 | } 440 | ], 441 | "metadata": { 442 | "kernelspec": { 443 | "display_name": "Python [conda env:tf-gpu]", 444 | "language": "python", 445 | "name": "conda-env-tf-gpu-py" 446 | }, 447 | "language_info": { 448 | "codemirror_mode": { 449 | "name": "ipython", 450 | "version": 3 451 | }, 452 | "file_extension": ".py", 453 | "mimetype": "text/x-python", 454 | "name": "python", 455 | "nbconvert_exporter": "python", 456 | "pygments_lexer": "ipython3", 457 | "version": "3.6.8" 458 | } 459 | }, 460 | "nbformat": 4, 461 | "nbformat_minor": 2 462 | } 463 | -------------------------------------------------------------------------------- /Innoplexus -19/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Innoplexus -19/data/sample_submission_usrypCc.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Innoplexus -19/data/sample_submission_usrypCc.zip -------------------------------------------------------------------------------- /Innoplexus -19/data/test_XEV14AD.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Innoplexus -19/data/test_XEV14AD.zip -------------------------------------------------------------------------------- /Innoplexus -19/data/train.7z: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Innoplexus -19/data/train.7z -------------------------------------------------------------------------------- /Innoplexus -19/memorization-innoplexus.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"import keras\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, Activation, Flatten\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.callbacks import EarlyStopping\nfrom keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler\nfrom keras.models import load_model\nfrom keras.initializers import glorot_normal, Zeros, Ones\nimport keras.backend as K\nfrom keras.optimizers import RMSprop\nimport tensorflow as tf","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\n%matplotlib inline\nimport warnings\nwarnings.filterwarnings('ignore')\n\nimport numpy as np\nimport os\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport random\n\nfrom sklearn import preprocessing\nimport lightgbm as lgb\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.preprocessing import OneHotEncoder\n\nfrom sklearn.preprocessing import StandardScaler,MinMaxScaler\nfrom sklearn.decomposition import PCA\nfrom math import sqrt\nfrom scipy import stats\nfrom scipy.stats import norm, skew #for some statistics","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactivity = \"all\"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import os\nos.listdir('../input/')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train = pd.read_csv('../input/train_3pirksi/train.csv')\ntest = pd.read_csv('../input/test_xev14ad/test.csv')\nsubm = pd.read_csv('../input/sample_submission_usrypcc/sample_submission.csv')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train = train.fillna(method=\"ffill\")\ntest = test.fillna(method=\"ffill\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"test.head(5)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"words = list(set(train[\"Word\"].values))\nn_words = len(words); n_words","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tags = list(set(train[\"tag\"].values))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"n_tags = len(tags); n_tags","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"class SentenceGetter(object):\n \n def __init__(self, data):\n self.n_sent = 1\n self.data = data\n self.empty = False\n \n def get_next(self):\n try:\n s = self.data[self.data[\"Sent_ID\"] == self.n_sent]\n self.n_sent += 1\n return s[\"Word\"].values.tolist(), s[\"tag\"].values.tolist() \n except:\n self.empty = True\n return None, None","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"class TestSentenceGetter(object):\n def __init__(self, data):\n self.n_sent = 191283\n self.data = data\n self.empty = False\n \n def get_next(self):\n try:\n s = self.data[self.data[\"Sent_ID\"] == self.n_sent]\n self.n_sent += 1\n return s[\"Word\"].values.tolist() \n except:\n self.empty = True\n return None","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"getter = SentenceGetter(train)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"testgetter = TestSentenceGetter(test)\ntestgetter.get_next()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sent, tag = getter.get_next()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(sent)\nprint(tag)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"Memorization"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.base import BaseEstimator, TransformerMixin\n\n\nclass MemoryTagger(BaseEstimator, TransformerMixin):\n \n def fit(self, X, y):\n '''\n Expects a list of words as X and a list of tags as y.\n '''\n voc = {}\n self.tags = []\n for x, t in zip(X, y):\n if t not in self.tags:\n self.tags.append(t)\n if x in voc:\n if t in voc[x]:\n voc[x][t] += 1\n else:\n voc[x][t] = 1\n else:\n voc[x] = {t: 1}\n self.memory = {}\n for k, d in voc.items():\n self.memory[k] = max(d, key=d.get)\n \n def predict(self, X, y=None):\n '''\n Predict the the tag from memory. If word is unknown, predict 'O'.\n '''\n return [self.memory.get(x, 'O') for x in X]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import cross_val_predict\nfrom sklearn.metrics import classification_report\n\nwords = train[\"Word\"].values.tolist()\ntags = train[\"tag\"].values.tolist()\n\npred = cross_val_predict(estimator=MemoryTagger(), X=words, y=tags, cv=5)\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"report = classification_report(y_pred=pred, y_true=tags)\nprint(report)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tagger = MemoryTagger()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tagger.fit(words, tags)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# print(tagger.predict(testgetter.get_next()))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"testwords = test[\"Word\"].values.tolist()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"pred = tagger.predict(testwords)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(pred)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"len(pred)\nset(pred)\ntest.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(subm['tag'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subm['tag']=pred\nsubm.to_csv('submission.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(subm['tag'])\nsubm.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"machine learning approach"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.ensemble import RandomForestClassifier","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def feature_map(word):\n '''Simple feature map.'''\n return np.array([ str(word).istitle(),word.islower(), word.isupper(), len(word),\n word.isdigit(), word.isalpha()])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"words = [feature_map(w) for w in train[\"Word\"].values.tolist()]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import cross_val_predict\npred = cross_val_predict(RandomForestClassifier(n_estimators=20),\n X=words, y=tags, cv=3)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"report = classification_report(y_pred=pred, y_true=tags)\nprint(report)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.preprocessing import LabelEncoder\nfrom sklearn.base import BaseEstimator, TransformerMixin\n\nclass FeatureTransformer(BaseEstimator, TransformerMixin):\n \n def __init__(self):\n self.memory_tagger = MemoryTagger()\n self.tag_encoder = LabelEncoder()\n \n def fit(self, X, y):\n words = X[\"Word\"].values.tolist()\n tags = X[\"tag\"].values.tolist()\n self.memory_tagger.fit(words, tags)\n self.tag_encoder.fit(tags)\n return self\n \n def transform(self, X, y=None):\n words = X[\"Word\"].values.tolist()\n out = []\n for i in range(len(words)):\n w = str(words[i])\n if i < len(words) - 1:\n wp = self.tag_encoder.transform(self.memory_tagger.predict([words[i+1]]))[0]\n else:\n wp = self.tag_encoder.transform(['O'])[0]\n if i > 0:\n if words[i-1] != \".\":\n wm = self.tag_encoder.transform(self.memory_tagger.predict([words[i-1]]))[0]\n else:\n wm = self.tag_encoder.transform(['O'])[0]\n else:\n wm = self.tag_encoder.transform(['O'])[0]\n out.append(np.array([w.istitle(), w.islower(), w.isupper(), len(w), w.isdigit(), w.isalpha(),\n self.tag_encoder.transform(self.memory_tagger.predict([w]))[0], wp, wm]))\n return out","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.pipeline import Pipeline","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"pred = cross_val_predict(Pipeline([(\"feature_map\", FeatureTransformer()), \n (\"clf\",lgb.LGBMClassifier())]),\n X=train, y=tags, cv=2)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"'aaa'","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"report = classification_report(y_pred=pred, y_true=tags)\nprint(report)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"tf-gpu","language":"python","name":"tf-gpu"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.8"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /Innoplexus/Readme.md: -------------------------------------------------------------------------------- 1 | ## Innoplexus Document Referencing Challenge 2 | Secured rank 22 in this Competition 3 | 4 | ### About 5 | Here From the corpus of research papers we need to predict the citations for the papers. 6 | 7 | ### Stack Used 8 | - Sklearn 9 | - Nltk 10 | - Gensim 11 | 12 | # Contributors 13 | - [utsav aggarwal](https://github.com/utsav1) 14 | -------------------------------------------------------------------------------- /Innoplexus/data/tes.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Innoplexus/data/test_c2Mvube.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Innoplexus/data/test_c2Mvube.zip -------------------------------------------------------------------------------- /Innoplexus/data/train_bnHAB63.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Innoplexus/data/train_bnHAB63.zip -------------------------------------------------------------------------------- /Intel Scene Classification/Intel scene with FastAi (2).ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"50e875c7c1b7e8c998507d9b2573224593861b9a"},"cell_type":"markdown","source":"# Scene classification"},{"metadata":{"_uuid":"edce220d096f99b124ff0e01d8ba1c4e8a8a01cc"},"cell_type":"markdown","source":"**Import the relevant libraries**"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"path = \"../input/scene_classification/scene_classification/train/\"","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"from fastai import *\nfrom fastai.vision import *","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"!pip install pretrainedmodels\n\nfrom torchvision.models import *\nimport pretrainedmodels\n\nfrom fastai.vision import *\nfrom fastai.vision.models import *\nfrom fastai.vision.learner import model_meta\n\nfrom utils import *\nimport sys","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"cfee912478cf492fadf22cc800717c1d8fa089db"},"cell_type":"code","source":"bs = 8","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"aa3bdc3cacf18ff21e7450279cc0ecabf1dc57dd"},"cell_type":"code","source":"df = pd.read_csv('../input/scene_classification/scene_classification/train.csv')\ndf.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import os\nfilenames = os.listdir('../input/scene_classification/scene_classification/test/')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"d35d9b01a47db718d86f1df43627b8e93bd18ad2"},"cell_type":"code","source":"tfms = get_transforms(flip_vert=False,max_zoom=1.0,max_warp=0,do_flip=False,xtra_tfms=[cutout()])\ndata = (ImageList.from_csv(path, csv_name = '../train.csv') \n .split_by_rand_pct() \n .label_from_df() \n .add_test_folder(test_folder = '../test') \n .transform(tfms, size=150)\n .databunch(num_workers=0))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"8d919b4cff4c2c8771e7059acefe74e0f30a80ff","_kg_hide-output":true,"collapsed":true},"cell_type":"code","source":"data.show_batch(rows=3, figsize=(8,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"672f015ee5b2485901cf7ec774125c72a5a33ac1"},"cell_type":"code","source":"print(data.classes)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"545e7cf0d173a51dfdef958a6cbe5b9a9b6c0736"},"cell_type":"code","source":"learn = cnn_learner(data, models.resnet152, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"3c5443eadbd87d4a9ade44df6d54fc52eb4d44f8"},"cell_type":"code","source":"learn.fit_one_cycle(6)\nlearn.unfreeze()\nlearn.lr_find()\nlearn.fit_one_cycle(6, max_lr=slice(1e-6, 1e-4))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## resnet101\n# 0\t0.389204\t0.276102\t0.102173\t0.897827\t01:35\n# 1\t0.265536\t0.231576\t0.080446\t0.919554\t01:18\n# 2\t0.219510\t0.204855\t0.070464\t0.929536\t01:18\n# 3\t0.170762\t0.200797\t0.065766\t0.934234\t01:17","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds3,_ = learn.TTA(ds_type=DatasetType.Test)\npred1 = preds3\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\nsubmission.to_csv('new_submission1.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds_oof,_ = learn.get_preds(ds_type=DatasetType.Train)\n# DatasetType.Test\ntrain_preds = []\nfor pred in preds_oof:\n train_preds.append(int(np.argmax(pred)))\n\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn = cnn_learner(data, models.resnet152, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\nlearn.fit_one_cycle(6)\n\n# preds2,_ = learn.get_preds(ds_type=DatasetType.Test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds2,_ = learn.get_preds(ds_type=DatasetType.Test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds1","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds2","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"http://places2.csail.mit.edu/models_places365/resnet18_places365.pth.tar\nlearn11 = cnn_learner(data, models.ResNet, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# learn1 = cnn_learner(data, models.densenet201, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\n# learn2 = cnn_learner(data, models.resnet152, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\nlearn3 = cnn_learner(data, models.densenet169, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\nlearn4 = cnn_learner(data, models.resnet101, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\nlearn5 = cnn_learner(data, models.densenet121, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\nlearn6 = cnn_learner(data, models.resnet50, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")\n\n# learn1.fit_one_cycle(6)\n# learn2.fit_one_cycle(6)\nlearn3.fit_one_cycle(6, max_lr=slice(1e-6, 1e-4))\nlearn4.fit_one_cycle(6, max_lr=slice(1e-6, 1e-4))\nlearn5.fit_one_cycle(6, max_lr=slice(1e-6, 1e-4))\nlearn6.fit_one_cycle(6, max_lr=slice(1e-6, 1e-4))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds3,_ = learn3.get_preds(ds_type=DatasetType.Test)\n\npreds4,_ = learn4.get_preds(ds_type=DatasetType.Test)\n\npreds5,_ = learn5.get_preds(ds_type=DatasetType.Test)\n\npreds6,_ = learn6.get_preds(ds_type=DatasetType.Test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"pred = 5 * preds1 + 5 * preds2 + 4 * preds3 + 3 * preds4 + 3 * preds5 + 3 * preds6","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"labelled_preds = []\nfor pred1 in pred:\n labelled_preds.append(int(np.argmax(pred1)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\n\nsubmission.to_csv('new_submission_add.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"926adfe2f169f3ad56628c47305362a23b2cb357"},"cell_type":"code","source":"interp = ClassificationInterpretation.from_learner(learn)\n\nlosses,idxs = interp.top_losses()\n\nlen(data.valid_ds)==len(losses)==len(idxs)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"5409a1bc4c92c7c47bbbefc74f04b9b34f7caf49","collapsed":true},"cell_type":"code","source":"interp.plot_top_losses(9, figsize=(15,11))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"6dee36fb7c178a811ca25094485ef8e1e46dc53e","collapsed":true},"cell_type":"code","source":"interp.most_confused(min_val=2)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"6d84aca6e8fb7100fff9eb71e4b9e383df24e228"},"cell_type":"code","source":"learn.save('/kaggle/working/stage-1-50-128')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds,_ = learn.get_preds(ds_type=DatasetType.Test)\n\nlabelled_preds = []\nfor pred in preds:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\n\nsubmission.to_csv('new_submission1.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"4dde475f0bf74c12c2d24e06b1925a19b635a39a"},"cell_type":"code","source":"learn.unfreeze()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"d5c3db9e7dbb9eae17d0795874a96a9d956074af"},"cell_type":"code","source":"learn.lr_find()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"732aaf3cb012bf84b758ac6542583495933d14a8"},"cell_type":"code","source":"learn.recorder.plot()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":""},{"metadata":{"trusted":true,"_uuid":"20a7e1cde5a9122c36aaf3a15502f7de32e47830"},"cell_type":"code","source":"learn.fit_one_cycle(7, max_lr=slice(1e-6, 1e-4))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"2fb38793b4561ae822691049babe78207244236b"},"cell_type":"code","source":"learn.save('/kaggle/working/stage-2-50-128')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds,_ = learn.get_preds(ds_type=DatasetType.Test)\n\nlabelled_preds = []\nfor pred in preds:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\n\nsubmission.to_csv('new_submission2.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from IPython.display import HTML\nimport pandas as pd\nimport numpy as np\nimport base64\n\n# download it (will only work for files < 2MB or so)\ndef create_download_link(df, title = \"Download CSV file\", filename = \"subm.csv\"): \n csv = df.to_csv(index=False)\n b64 = base64.b64encode(csv.encode())\n payload = b64.decode()\n html = '{title}'\n html = html.format(payload=payload,title=title,filename=filename)\n return HTML(html)\n\ncreate_download_link(submission)","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"15810e0b260e078aafa3b18ce212f3caf7d8f091"},"cell_type":"markdown","source":"**Changing image resolution to 256**"},{"metadata":{"trusted":true,"_uuid":"652e3d2fae8575839ab420dd791f671d3fcbe957"},"cell_type":"code","source":"tfms = get_transforms(flip_vert=False,max_zoom=1.0,max_warp=0,do_flip=False)\ndata = (ImageList.from_csv(path, csv_name = '../train.csv') \n .split_by_rand_pct() \n .label_from_df() \n .add_test_folder(test_folder = '../test') \n .transform(tfms, size=300)\n .databunch(num_workers=0))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"72f1b24d7db4fe05239d5549df1fbcd4bb3556fa"},"cell_type":"code","source":"data.show_batch(rows=3, figsize=(8,10))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"74cec8d01711132595d27093108276073ebd6fba","scrolled":false},"cell_type":"code","source":"learn.load('/kaggle/working/stage-2-50-128')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"56d622bbc43334ae75d65d1761c9136c09bd491a"},"cell_type":"code","source":"learn.unfreeze()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"370bef1b90ae8c0b251f58e7c329d4ff06e7292f"},"cell_type":"code","source":"learn.lr_find()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"fc206515ebc967ac2d95113278f4006f5b14572e"},"cell_type":"code","source":"learn.recorder.plot()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"6d5e1a4d5ef7e2a3107fcb31f8f1efc6b138c9cc"},"cell_type":"code","source":"learn.fit_one_cycle(7, max_lr=slice(1e-6, 1e-4))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"2fd5521f6d97f8fd0bd493c9d3d9979e313c4210"},"cell_type":"code","source":"learn.save('/kaggle/working/stage-1-50-256')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"7f22b4abf64b88a22a9de01ab100a0215f298d7c"},"cell_type":"code","source":"preds,_ = learn.get_preds(ds_type=DatasetType.Test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"162800f15eceb88fed88e2fd4d86080d0c004baf"},"cell_type":"code","source":"labelled_preds = []\nfor pred in preds:\n labelled_preds.append(int(np.argmax(pred)))\n \n# labelled_preds[0:10]\nlen(labelled_preds)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"74eb936f33c1d46f974b815dc0364c4a53bf980a"},"cell_type":"code","source":"len(filenames) == len(labelled_preds)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"c207c8ae9b3975262f6c7893a628cfbd50810921"},"cell_type":"code","source":"submission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"b6f4d8478add0271a8478d3fbc70cadfb926d929"},"cell_type":"code","source":"submission.to_csv('new_submission3.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"submission.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"45f7b2356d117d8f1e114b2de6382aa3bcf7d61f"},"cell_type":"code","source":"# from IPython.display import FileLinks","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"b15e8ebf9293c6c4c707ff86a018eeec70d217cc"},"cell_type":"code","source":"# FileLinks('.') # download the files without committing","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"pretrainedmodels.model_names","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def resnext101_32x4d(pretrained=False):\n pretrained = 'imagenet' if pretrained else None\n model = pretrainedmodels.resnext101_32x4d(pretrained=pretrained)\n all_layers = list(model.children())\n return nn.Sequential(*all_layers[0], *all_layers[1:])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# arch_summary(resnext101_32x4d)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"learn = create_cnn(data, resnext101_32x4d, pretrained=False,\n cut=-2, split_on=lambda m: (m[0][6], m[1]))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Pnasnet5large"},{"metadata":{"trusted":true},"cell_type":"code","source":"def identity(x): return x\n\ndef pnasnet5large(pretrained=False): \n pretrained = 'imagenet' if pretrained else None\n model = pretrainedmodels.pnasnet5large(pretrained=pretrained, num_classes=1000) \n model.logits = identity\n return nn.Sequential(model)\n\nmodel_meta[pnasnet5large] = { 'cut': None, \n 'split': lambda m: (list(m[0][0].children())[8], m[1]) }","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"learn = create_cnn(data, pnasnet5large(), metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Xception"},{"metadata":{"trusted":true},"cell_type":"code","source":"def xception(pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n model = pretrainedmodels.xception(pretrained=pretrained)\n return nn.Sequential(*list(model.children()))\n\nlearn = create_cnn(data, xception, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-1, split_on=lambda m: (m[0][11], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn.fit_one_cycle(6)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn.unfreeze()\nlearn.lr_find()\nlearn.recorder.plot()\n\nlearn.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) ","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":""},{"metadata":{"trusted":true},"cell_type":"code","source":"preds,_ = learn.TTA(ds_type=DatasetType.Test)\n# preds1,_ = learn.get_preds(ds_type=DatasetType.Test)\n\n# pred1 = preds + preds1\n\npred1 = preds\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\n\nsubmission.to_csv('new_submission2.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn.one_cycle_scheduler","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from IPython.display import HTML\nimport pandas as pd\nimport numpy as np\nimport base64\n\n# download it (will only work for files < 2MB or so)\ndef create_download_link(df, title = \"Download CSV file\", filename = \"subm.csv\"): \n csv = df.to_csv(index=False)\n b64 = base64.b64encode(csv.encode())\n payload = b64.decode()\n html = '{title}'\n html = html.format(payload=payload,title=title,filename=filename)\n return HTML(html)\n\ncreate_download_link(submission)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# epoch\ttrain_loss\tvalid_loss\terror_rate\taccuracy\ttime\n# 0\t0.266674\t0.184001\t0.061656\t0.938344\t04:05\n# 1\t0.221602\t0.156560\t0.056665\t0.943335\t03:59\n# 2\t0.186194\t0.150860\t0.057546\t0.942455\t03:57\n# 3\t0.157000\t0.134333\t0.047563\t0.952437\t03:57\n# 4\t0.123728\t0.128506\t0.047270\t0.952730\t03:58\n# 5\t0.111995\t0.127972\t0.046389\t0.953611\t03:58","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Inception"},{"metadata":{"trusted":true},"cell_type":"code","source":"def inceptionv4(pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n model = pretrainedmodels.inceptionv4(pretrained=pretrained)\n all_layers = list(model.children())\n return nn.Sequential(*all_layers[0], *all_layers[1:])\n\nlearn = create_cnn(data, inceptionv4, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][11], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn.fit_one_cycle(3) \nlearn.unfreeze()\nlearn.fit_one_cycle(1, max_lr=slice(1e-6,1e-4))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## InceptionResnet"},{"metadata":{"trusted":true},"cell_type":"code","source":"def inceptionresnetv2(pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n model11 = pretrainedmodels.inceptionresnetv2(pretrained=pretrained)\n return nn.Sequential(*model11.children())\n\nlearn = create_cnn(data, inceptionresnetv2, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][9], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model11 = pretrainedmodels.inceptionresnetv2(pretrained='imagenet')\naa = nn.Sequential(*model11.children())","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"aa","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#'http://places2.csail.mit.edu/models_places365/resnet18_places365.pth.tar'\ndef inceptionresnetv2(pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n model = pretrainedmodels.resnet18(pretrained=pretrained)\n# model = resnet18(pretrained=pretrained)\n return nn.Sequential(*model.children())\n\nlearn = create_cnn(data, inceptionresnetv2, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"arch = 'resnet18'\n\n# load the pre-trained weights\nmodel_file = '%s_places365.pth.tar' % arch\nif not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n\nmodel1 = models.__dict__[arch](num_classes=365)\ncheckpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\nstate_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\nmodel1.load_state_dict(state_dict)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model1","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#'http://places2.csail.mit.edu/models_places365/resnet18_places365.pth.tar'\ndef inceptionresnetv21( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n# model = pretrainedmodels.resnet18(pretrained=pretrained)\n# arch = 'alexnet'\n arch = 'resnet50'\n # load the pre-trained weights\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn = create_cnn(data, inceptionresnetv21, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tfms = get_transforms(flip_vert=False,max_zoom=1.0,max_warp=0)\ndata = (ImageList.from_csv(path, csv_name = '../train.csv') \n .split_by_rand_pct() \n .label_from_df() \n .add_test_folder(test_folder = '../test') \n .transform(tfms, size=150)\n .databunch(num_workers=0))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def inceptionresnetv11( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n arch = 'resnet18'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn1 = create_cnn(data, inceptionresnetv11, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))\n\ndef inceptionresnetv12( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n arch = 'resnet50'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn2 = create_cnn(data, inceptionresnetv12, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))\n\ndef inceptionresnetv13( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n# model = pretrainedmodels.resnet18(pretrained=pretrained)\n arch = 'alexnet'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn3 = create_cnn(data, inceptionresnetv13, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][0][6], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn2.fit_one_cycle(6) \nlearn2.unfreeze()\nlearn2.lr_find()\nlearn2.fit_one_cycle(6, max_lr=slice(1e-6,1e-4))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn1.fit_one_cycle(6) \nlearn1.unfreeze()\nlearn1.lr_find()\nlearn1.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) \n\n# learn2.fit_one_cycle(3) \n# learn2.unfreeze()\n# learn2.lr_find()\n# learn2.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) \n\n# learn3.fit_one_cycle(3) \n# learn3.unfreeze()\n# learn3.lr_find()\n# learn3.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# preds2,_ = learn2.TTA(ds_type=DatasetType.Test)\n\n# preds1,_ = learn1.TTA(ds_type=DatasetType.Test)\n\npred1 = preds2 + preds1\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\nsubmission.to_csv('new_submission1.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds1,_ = learn1.TTA(ds_type=DatasetType.Test)\npreds2,_ = learn2.TTA(ds_type=DatasetType.Test)\npreds3,_ = learn3.TTA(ds_type=DatasetType.Test)\n\npred1 = preds1 + preds2 + preds3\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\nsubmission.to_csv('new_submission1.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tfms = get_transforms(flip_vert=False,max_zoom=1.0,do_flip=False,max_warp=0,xtra_tfms=[cutout()])\ndata = (ImageList.from_csv(path, csv_name = '../train.csv') \n .split_by_rand_pct() \n .label_from_df() \n .add_test_folder(test_folder = '../test') \n .transform(tfms, size=300)\n .databunch(num_workers=0))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def inceptionresnetv14( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n arch = 'resnet18'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn4 = create_cnn(data, inceptionresnetv14, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))\n\ndef inceptionresnetv15( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n arch = 'resnet50'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn5 = create_cnn(data, inceptionresnetv15, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][6], m[1]))\n\ndef inceptionresnetv16( pretrained=True):\n pretrained = 'imagenet' if pretrained else None\n# model = pretrainedmodels.resnet18(pretrained=pretrained)\n arch = 'alexnet'\n model_file = '%s_places365.pth.tar' % arch\n if not os.access(model_file, os.W_OK):\n weight_url = 'http://places2.csail.mit.edu/models_places365/' + model_file\n os.system('wget ' + weight_url)\n model = models.__dict__[arch](num_classes=365)\n checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)\n state_dict = {str.replace(k,'module.',''): v for k,v in checkpoint['state_dict'].items()}\n model.load_state_dict(state_dict)\n return nn.Sequential(*model.children())\n\nlearn6 = create_cnn(data, inceptionresnetv16, pretrained=True, metrics=[error_rate, accuracy], model_dir=\"/tmp/model/\",\n cut=-2, split_on=lambda m: (m[0][0][6], m[1]))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn4.fit_one_cycle(6) \nlearn4.unfreeze()\nlearn4.lr_find()\nlearn4.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"learn5.fit_one_cycle(6)\nlearn5.unfreeze()\nlearn5.lr_find()\nlearn5.fit_one_cycle(6, max_lr=slice(1e-6,1e-4))\n\n# learn6.fit_one_cycle(3) \n# learn6.unfreeze()\n# learn6.lr_find()\n# learn6.fit_one_cycle(6, max_lr=slice(1e-6,1e-4)) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds11","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# preds4,_ = learn4.TTA(ds_type=DatasetType.Test)\npreds5,_ = learn5.TTA(ds_type=DatasetType.Test)\n# preds6,_ = learn6.TTA(ds_type=DatasetType.Test)\n\npred1 = preds5\n\n# pred1 = preds4 + preds5 + preds6\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\nsubmission.to_csv('new_submission2.csv',index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"pred1 = preds1 + preds2 + preds3 + preds4 + preds5 + preds6\n\nlabelled_preds = []\nfor pred in pred1:\n labelled_preds.append(int(np.argmax(pred)))\n\nsubmission = pd.DataFrame(\n {'image_name': filenames,\n 'label': labelled_preds,\n })\nsubmission.to_csv('new_submission3.csv',index=False)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /Intel Scene Classification/Readme.md: -------------------------------------------------------------------------------- 1 | The Approach which i have taken for this challenge is as follow: 2 | 3 | - Used the learned weights from places365 dataset.used Resnet50,resnet18,Alexnet pretrained weights.densenet weights didn’t used because it’s not compatible with pytorch 1.0. Link to the pretrained weights are below: 4 | https://github.com/CSAILVision/places365/ 5 | 6 | - Used imagenet pretrained weights avaiable in Cadene.here model used are Resnet152,densenet169,Resnext.link to the pretrained weights are below: 7 | https://github.com/Cadene/pretrained-models.pytorch 8 | 9 | - These 6-7 different model have been trained for different sizes that is 128,150(Default),300 and also with different Augmentation.link to the transformations are below: 10 | https://docs.fast.ai/vision.transform.html 11 | 12 | - Weighted Average of probability of each class have given me the final solution. 13 | 14 | - Last but not the least have Created oof for all the models and do ensembling by training decision classifier model. 15 | 16 | I have started to work on this problem from last 4 days only.link to the challenge problem is below: 17 | https://datahack.analyticsvidhya.com/contest/practice-problem-intel-scene-classification-challe/ 18 | 19 | Dataset is available in following link: 20 | https://www.kaggle.com/dipam7/intel-data-scene 21 | -------------------------------------------------------------------------------- /Quartic/Readme.md: -------------------------------------------------------------------------------- 1 | # Quartic Machine Learning Challenge 2 | 3 | ## Objective 4 | Have to build the most accurate model which can predict target column for data_test.csv. 5 | 6 | 1. The column details are below: 7 | > * id: id column for data_train, data_test, respectively 8 | > * num*: numerical features 9 | > * der*: derived features from other features 10 | > * cat*: categorical features 11 | > * target: target column, only exists in data_train. it is binary. 12 | 2. There are potentially missing values in each column. The goal is to predict target column for data_test.csv.The solution should have a result csv file with two columns: 13 | > * 'id': the id column from data_test.csv 14 | > * 'target': the predicted probability of target being 1 15 | 16 | ## Stack used 17 | - Sklearn 18 | - xgboost 19 | - lightgbm 20 | - eli5 21 | - Keras 22 | - Tensorflow 23 | - bayes_opt 24 | 25 | ## Approach Discussion 26 | Q.1:- Briefly describe the conceptual approach you chose! What are the trade-offs? 27 | > First part of the choosen model is a gradient boosting ensemble model because decision boundary between two different classes are not Linearly separable.only Non-linear model will perform better in this case.VotingClassifier using two different Gradient Boosting packages (Xgboost,LightGbm) is the first part of the solution. 28 | The second part is the same VotingClassifier with xgboost and lightgbm upon the dataset obtained from Denoising Auto Encoder which is chosen because the original train-test distribution is same and test set have more records then trainset. 29 | The final solution is the weighted Average of both the models where weights were decided based upon Cross validation score of Traiset. 30 | > - Trade-offs:- Have used only the Gradient Boosting models in the final submission this could cause higher correlation of models in the ensemble which can lead to overfitting the trainSet. 31 | 32 | Q.2:- What's the model performance? What is the complexity? Where are the bottlenecks? 33 | > Roc_Auc have choose to measure the model Performance.Accuracy can't be used because of Unbalanced class count in the dataset.So we are getting 64% Roc_Auc score for 3 fold cv for First model of Ensemble and 63% Accuracy on Second Model with weighted Average the Score went to 65%. 34 | > - Because of Ensemble of 4 different predictive models It is difficult to interpret the model behaviour.The Runtime is also higher than single model. 35 | 36 | 37 | Q.3:- If you had more time, what improvements would you make, and in what order of priority? 38 | > 1. More feature generation by feature interaction and removing the unuseful features by keeping the useful one. 39 | > 2. Data Imputation by using predictive modelling instead of Mode.Features like num18 have higher feature importance but have lot of Null values.So,carefully filling these null values can result in accuracy increase. 40 | > 3. Tune the Xgboost and Dae Models on different Parameter space.Running the Dae Network for More number of Epochs or changing the Network architecture can also help.Currently network has trained for 10 epochs. 41 | > 4. Doing Two Level Stacking with different and diverse Models. 42 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Machine Learning Challenges Solutions 2 | 3 | Here are Some of Machine Learning Challenge Solutions and approaches to solve those. 4 | 5 | - Accenture Digital Hack-up Challenge: In this Challenge I build a model that can predict scores of comments based upon The parent comment to which sarcastic comments are made and the Reply to the parent comment. 6 | - Affine Analytics Challenge: In this Challenge, I Build a property recommendation system, which recommend some finite number of properties to these new Accounts. 7 | - BrainWaves Challenge: Here I build a model which predict the Return for the Portfolio. 8 | - Capgemini Data Science Challenge: I Build a Predictive Demand Model which can Forecast the Demand for next two months. 9 | - Cavoo Computer Vision Challenge: I built a Resnet classification model, which classify the type of the clothes from the images. 10 | - Enigma IIT-BHU Challenge: Solve the problem to predict the number of upvotes on the questions posted by the users. 11 | - Euristica 2018 Challenge: Here i created a model which predict whether day is good or bad for the Paragliding. 12 | - Euristica 2019 Challenge: Model built for this able to predict The average damage inflicted in a PUBG game by the online player. 13 | - Expedia Challenge: In this challenge, I worked on to built a model which predict Number of minutes a flight was delayed based upon Features. 14 | - Innoplexus Challenge: I created a Named Entity Recognition model which identify named entities from the text Dataset. 15 | - Innoplexus Document Referencing Challenge: Here model predicts the citations for the papers From the corpus of research papers. 16 | - Intel Scene Classification Challenge: Here I used a Resnet Openvino model to classify images into 5 different scenes. 17 | - Quartic Challenge: In this I created a Binary classifier which do a ensemble of models to predict the target variable. 18 | - WNS analyticsvidhya Challenge: In this I created a Employee Churn prediction model 19 | -------------------------------------------------------------------------------- /Wns/Graphs/Pca_initial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Wns/Graphs/Pca_initial.png -------------------------------------------------------------------------------- /Wns/Graphs/correlation_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Wns/Graphs/correlation_matrix.png -------------------------------------------------------------------------------- /Wns/Graphs/pca_Later.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Wns/Graphs/pca_Later.png -------------------------------------------------------------------------------- /Wns/Graphs/previous_year_rating.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/Wns/Graphs/previous_year_rating.png -------------------------------------------------------------------------------- /Wns/ReadMe.md: -------------------------------------------------------------------------------- 1 | # WNS Machine Learning Challenge 2 | Finished 7th on Public and 72 on private LeaderBoard 3 | 4 | ## Things Tried 5 | - Need to do binning of the age features. 6 | - Apply the mean encoding on the feature having text value and frequency more than 3. 7 | - Avg points * no. of comp to get the total points. 8 | - need to weighted sum some of the performance features to get one score for performance. 9 | - count of 0/1 or yes/no in the row. 10 | - mean encoding or median encoding instead of label. 11 | - normalize or scale and then check the distribution it should be same. 12 | - Use the data imputation technque like(mean,median,predictive model imputation) for missing values in (education & previous_year_ratings.) 13 | - Add the (awards_won;KpIs_met & previous_year_rating) features,multiply the avg_training_score and no_of_training to get total training score. 14 | - convert education into number's where mtech>btech>other. 15 | - Remove the recruitment_channel that have no effect on the Target result. 16 | - (age - length_of_service) for gettng the joining age. 17 | 18 | ## Final Solution Summary 19 | - The missing values in the education is imputed by mode which was the "Bachelor's" & the missing value in the previous_year_rating is imputed by mode as well as by using the predictive modeling. 20 | After Correlation matrix analysis,the features like previous_year_rating,length_of_service,KPIs_met have higher correlation with the Target value. 21 | - The count of promotion vs no promotion is unbalanced. 22 | - Pca on the train set and their target values show overlapping decision boundary which can't be separated by the Linear Models.for this tye of overlapping target values Decision Trees are best. 23 | Different type of scaling on input features giving different distribution of the target values during Pca. 24 | 25 | ## Newly created Features 26 | - sum_performance = addition of the important factors for the promotion (awards_won;KpIs_met & previous_year_rating). 27 | - Total nmber of training hours = avg_training_score * no_of_training 28 | - recruitment_channel have no impact on the promotion so removed that. 29 | - Apply the Pca on the input set and get the single column which summarized the input features in the 1 Dimension and used that as a new features.helps to improve the score by 0.5 percent. 30 | 31 | ## Models 32 | - Lightgbm model with parameter tuning with bayesian Optimization with 8 stratifiedKfolds. 33 | - OOF predictions were used for finding the right threshold value. 34 | - For second layer of predictions used the Average of the results from the different lightGbm models trained on different Input Features. 35 | Pca for the first layer predictions and target values shows the linear separable boundary between two target values.so done the Logistic regression for ensembling of the models. 36 | 37 | ## What didn't worked 38 | 1. Linear models and Neural networks gave low scores as compared to the Decision Trees. 39 | 2. mean encoding of missing values, one hot encoding of categorical values almost gave the same score on the leaderboard. 40 | 3. blindly addition,multiplication and division of features gave low score. 41 | 4. additional features creation with Variance threshold gave same score. 42 | 5. prediction of missing values using the data from the train and test gave same core on leaderboard. 43 | 6. Ensembling didn't work because of highly correlated models in the First Layer(Have only used Lightgbm for the first layer predictions.) 44 | 7. Ensembling using Voting Classifier not gave any improvement over the single model. 45 | 46 | ## Mistakes 47 | 1. Should have selected ensembled prediction for the Final submission instead of selecting the single Tuned lightgbm model. 48 | 2. Lack of diverse models in the First layer of stacking. 49 | 3. Submission based upon public leaderboard score leads to Overfitting and drop of position in the private leaderboard.need better Local validation Strategy. 50 | 51 | # Contributors 52 | - [utsav aggarwal](https://github.com/utsav1) 53 | -------------------------------------------------------------------------------- /Wns/kernel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "metadata": { 5 | "_cell_guid": "54a7bcb8-c4c2-4273-9d18-005245341701", 6 | "_uuid": "4e0bc0ade19424a63e01fc8c58b6b28b4ee0c0af", 7 | "trusted": true 8 | }, 9 | "cell_type": "code", 10 | "source": "%%time\n%matplotlib inline\nimport numpy as np\nimport os\nimport glob\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd\nimport random\nimport xgboost as xgb\nfrom sklearn.metrics import matthews_corrcoef\n\nfrom sklearn import preprocessing\nfrom sklearn.linear_model import LogisticRegression\nfrom xgboost import XGBRegressor\nimport lightgbm as lgb\nfrom lightgbm import LGBMRegressor\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.cross_validation import StratifiedKFold\nfrom sklearn.metrics import matthews_corrcoef, roc_auc_score\nfrom sklearn.grid_search import RandomizedSearchCV\nfrom catboost import CatBoostClassifier,CatBoostRegressor\n\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.preprocessing import OneHotEncoder,LabelEncoder\n\nfrom sklearn.svm import SVC\nfrom sklearn.svm import SVR\nfrom sklearn.ensemble import ExtraTreesClassifier,ExtraTreesRegressor\nfrom sklearn.feature_selection import VarianceThreshold\n\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nfrom sklearn.model_selection import KFold\nfrom sklearn.metrics import r2_score,mean_squared_error\nfrom math import sqrt\nfrom scipy import stats\nfrom scipy.stats import norm, skew #for some statistics\nfrom sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV\n\nfrom sklearn.metrics import matthews_corrcoef\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import fbeta_score", 11 | "execution_count": null, 12 | "outputs": [] 13 | }, 14 | { 15 | "metadata": { 16 | "trusted": true, 17 | "_uuid": "91ea30387d05728a8a6937abc2978d83df7fd467" 18 | }, 19 | "cell_type": "code", 20 | "source": "from IPython.core.interactiveshell import InteractiveShell\nInteractiveShell.ast_node_interactivity = \"all\"", 21 | "execution_count": null, 22 | "outputs": [] 23 | }, 24 | { 25 | "metadata": { 26 | "_cell_guid": "f30e9e3a-5428-4389-a8fa-80efb1827d43", 27 | "_uuid": "bb9e5c3b09eb027ed4d74d5f5ca1988cf32d94aa", 28 | "trusted": true 29 | }, 30 | "cell_type": "code", 31 | "source": "train=pd.read_csv('../input/train_LZdllcl.csv')\ntest=pd.read_csv('../input/test_2umaH9m.csv')\nsub=pd.read_csv('../input/sample_submission_M0L0uXE.csv')", 32 | "execution_count": null, 33 | "outputs": [] 34 | }, 35 | { 36 | "metadata": { 37 | "trusted": true, 38 | "_uuid": "bebdeaaf55aab974f9300b94eff1b113ef2e2814" 39 | }, 40 | "cell_type": "code", 41 | "source": "train.describe()", 42 | "execution_count": null, 43 | "outputs": [] 44 | }, 45 | { 46 | "metadata": { 47 | "trusted": true, 48 | "_uuid": "70cc3689b0448b5ceecf2068a52839268db3455b" 49 | }, 50 | "cell_type": "code", 51 | "source": "train['education'].replace(np.nan,\"Bachelor's\",inplace=True)\ntest['education'].replace(np.nan,\"Bachelor's\",inplace=True)\n\ntrain['education'].replace(\"Master's & above\",3,inplace=True)\ntest['education'].replace(\"Master's & above\",3,inplace=True)\ntrain['education'].replace(\"Bachelor's\",2,inplace=True)\ntest['education'].replace(\"Bachelor's\",2,inplace=True)\ntrain['education'].replace(\"Below Secondary\",1,inplace=True)\ntest['education'].replace(\"Below Secondary\",1,inplace=True)", 52 | "execution_count": null, 53 | "outputs": [] 54 | }, 55 | { 56 | "metadata": { 57 | "trusted": true, 58 | "_uuid": "05bf8e919781b53fc77a94882b78b6df17e148df" 59 | }, 60 | "cell_type": "code", 61 | "source": "train['previous_year_rating'].replace(np.nan,3.,inplace=True)\ntest['previous_year_rating'].replace(np.nan,3.,inplace=True)", 62 | "execution_count": null, 63 | "outputs": [] 64 | }, 65 | { 66 | "metadata": { 67 | "trusted": true, 68 | "_uuid": "196321509b79dbccdd7aac77819e4b7fc9ae04ae" 69 | }, 70 | "cell_type": "code", 71 | "source": "train['sum_metric'] = train['awards_won?']+train['KPIs_met >80%'] + train['previous_year_rating']\ntest['sum_metric'] = test['awards_won?']+test['KPIs_met >80%'] + test['previous_year_rating']\n\ntrain['tot_score'] = train['avg_training_score'] * train['no_of_trainings']\ntest['tot_score'] = test['avg_training_score'] * test['no_of_trainings']", 72 | "execution_count": null, 73 | "outputs": [] 74 | }, 75 | { 76 | "metadata": { 77 | "trusted": true, 78 | "_uuid": "8e4c79b19e1dd4872c80c5bd65d85715d34440c2" 79 | }, 80 | "cell_type": "code", 81 | "source": "from sklearn import preprocessing\nle = preprocessing.LabelEncoder()\n\ntrain['department'] = le.fit_transform(train['department'])\ntest['department'] = le.transform(test['department'])\ntrain['region'] = le.fit_transform(train['region'])\ntest['region'] = le.transform(test['region'])\ntrain['education'] = le.fit_transform(train['education'])\ntest['education'] = le.transform(test['education'])\ntrain['gender'] = le.fit_transform(train['gender'])\ntest['gender'] = le.transform(test['gender'])\n\ntrain['recruitment_channel'] = le.fit_transform(train['recruitment_channel'])\ntest['recruitment_channel'] = le.transform(test['recruitment_channel'])", 82 | "execution_count": null, 83 | "outputs": [] 84 | }, 85 | { 86 | "metadata": { 87 | "trusted": true, 88 | "_uuid": "a460a28bef9d399311c22a7760d2555bd875da48" 89 | }, 90 | "cell_type": "code", 91 | "source": "Y1=train['is_promoted']\ntrain1=train.drop(['employee_id','is_promoted','recruitment_channel'],axis=1)\ntrain1=train1.values\nY=Y1.values\n\ntest_id=test['employee_id']\ntest1 = test.drop(['employee_id','recruitment_channel'],axis=1)\ntest1=test1.values", 92 | "execution_count": null, 93 | "outputs": [] 94 | }, 95 | { 96 | "metadata": { 97 | "trusted": true, 98 | "_uuid": "86c5f6a88ccdf41d1769763c7d55648213a0d28a" 99 | }, 100 | "cell_type": "code", 101 | "source": "scaler = StandardScaler()\nscaler.fit(train1)\ntrain2 = scaler.transform(train1)\ntest2 = scaler.transform(test1)", 102 | "execution_count": null, 103 | "outputs": [] 104 | }, 105 | { 106 | "metadata": { 107 | "trusted": true, 108 | "_uuid": "403aa2c49433fbab2eabccee9df2e698559a5c18" 109 | }, 110 | "cell_type": "code", 111 | "source": "pca = PCA(n_components=1)\npca.fit(train2)\ntrain_pca = pca.transform(train2)\ntest_pca = pca.transform(test2)\ntrain3=np.column_stack((train2,train_pca))\ntest3=np.column_stack((test2,test_pca))", 112 | "execution_count": null, 113 | "outputs": [] 114 | }, 115 | { 116 | "metadata": { 117 | "trusted": true, 118 | "_uuid": "53ea829d17e42c69a3603c142b3d684f2ef08eab" 119 | }, 120 | "cell_type": "code", 121 | "source": "#create the cross validation fold for different boosting and linear model.\nfrom sklearn.cross_validation import StratifiedKFold\nfrom sklearn.ensemble import RandomForestClassifier, VotingClassifier\nSEED=42\n# clf = lgb.LGBMClassifier()\nst_train = train3\nst_test = test3\n# clf = xgb.XGBClassifier()\n# Y=Y1\n# clf = SVC(probability=True)\n# clf = RandomForestClassifier(max_depth=4, random_state=0)\nclf = lgb.LGBMClassifier(max_depth= 8, learning_rate=0.0941, n_estimators=197, num_leaves= 17, reg_alpha=3.4492 , reg_lambda= 0.0422) #lgb_pca\n#clf = lgb.LGBMClassifier(max_depth= 8, learning_rate=0.0941, n_estimators=197, num_leaves= 17, reg_alpha=3.4492 , reg_lambda= 0.0422) #lgb_pca\n# clf=CatBoostClassifier()\n# clf = XGBClassifier()\n# clf = Ridge()\n\n# clf=ExtraTreesClassifier(n_estimators=10000, criterion='entropy', max_depth=9, min_samples_leaf=1, n_jobs=30, random_state=1)\n# clf = xgb.XGBClassifier(random_state=42,colsample_bytree = 0.9279,gamma = 0.6494,learning_rate = 0.1573,max_depth = 7,min_child_weight = 6,n_estimators = 70,subsample = 0.6404)\n# clf = RGFClassifier(max_leaf=500,algorithm=\"RGF\",test_interval=100, loss=\"LS\")\n# clf = LogisticRegression()\n# clf = LogisticRegression(class_weight ={1:4})\n\nclf1 = lgb.LGBMClassifier(max_depth= 8, learning_rate=0.0941, n_estimators=197, num_leaves= 17, reg_alpha=3.4492 , reg_lambda= 0.0422) #lgb_pca\n# clf2 = RGFClassifier(max_leaf=500,algorithm=\"RGF\",test_interval=100, loss=\"LS\")\nclf3 = xgb.XGBClassifier(random_state=42,colsample_bytree = 0.9279,gamma = 0.6494,learning_rate = 0.1573,max_depth = 7,min_child_weight = 6,n_estimators = 70,subsample = 0.6404)\n\n# clf = VotingClassifier(estimators=[('LR',clf2),('LGB',clf1),('LGB1',clf3)],voting='soft',\n# weights=[3,4,2])\n\nfold = 8\ncv = StratifiedKFold(Y, n_folds=fold,shuffle=True, random_state=30)\nX_preds = np.zeros(st_train.shape[0])\npreds = np.zeros(st_test.shape[0])\nfor i, (tr, ts) in enumerate(cv):\n print(ts.shape)\n mod = clf.fit(st_train[tr], Y[tr])\n X_preds[ts] = mod.predict_proba(st_train[ts])[:,1]\n preds += mod.predict_proba(st_test)[:,1]\n print(\"fold {}, ROC AUC: {:.3f}\".format(i, roc_auc_score(Y[ts], X_preds[ts])))\n predictions = [round(value) for value in X_preds[ts]]\n print(f1_score(Y[ts], predictions))\nscore = roc_auc_score(Y, X_preds)\nprint(score)\npreds1 = preds/fold\n", 122 | "execution_count": null, 123 | "outputs": [] 124 | }, 125 | { 126 | "metadata": { 127 | "trusted": true, 128 | "_uuid": "ac31da2bd41a0b35bfa7799909993767287a4010" 129 | }, 130 | "cell_type": "code", 131 | "source": "# pick the best threshold out-of-fold\nthresholds = np.linspace(0.01, 0.99, 50)\nmcc = np.array([f1_score(Y, X_preds>thr) for thr in thresholds])\nplt.plot(thresholds, mcc)\nbest_threshold = thresholds[mcc.argmax()]\nprint(mcc.max())\nprint(best_threshold)", 132 | "execution_count": null, 133 | "outputs": [] 134 | }, 135 | { 136 | "metadata": { 137 | "trusted": true, 138 | "_uuid": "2ae47b222329bfcecf8ed602c81bf59e1d9e9847" 139 | }, 140 | "cell_type": "code", 141 | "source": "##create the submission file.\nprediction_rfc=list(range(len(preds1)))\nfor i in range(len(preds1)):\n prediction_rfc[i]=1 if preds1[i]>best_threshold else 0\n\nsub = pd.DataFrame({'employee_id': test_id, 'is_promoted': prediction_rfc})\nsub=sub.reindex(columns=[\"employee_id\",\"is_promoted\"])\nsub.to_csv('submission.csv', index=False)", 142 | "execution_count": null, 143 | "outputs": [] 144 | }, 145 | { 146 | "metadata": { 147 | "trusted": true, 148 | "_uuid": "bbc7241d982a860013a3ca9d14cf6262e69a88f0" 149 | }, 150 | "cell_type": "code", 151 | "source": "#lightgbm bayesian optimization\nfrom sklearn.cross_validation import cross_val_score\nfrom bayes_opt import BayesianOptimization\n\ndef xgboostcv(max_depth,learning_rate,n_estimators,num_leaves,reg_alpha,reg_lambda):\n cv = StratifiedKFold(Y, n_folds=8,shuffle=True, random_state=30)\n return cross_val_score(lgb.LGBMClassifier(max_depth=int(max_depth),learning_rate=learning_rate,n_estimators=int(n_estimators),\n silent=True,nthread=-1,num_leaves=int(num_leaves),reg_alpha=reg_alpha,\n reg_lambda=reg_lambda),\n train3,Y,\"roc_auc\",cv=cv).mean()\n\nxgboostBO = BayesianOptimization(xgboostcv,{'max_depth': (4, 10),'learning_rate': (0.001, 0.1),'n_estimators': (10, 1000),\n 'num_leaves': (4,30),'reg_alpha': (1, 5),'reg_lambda': (0, 0.1)})\n\nxgboostBO.maximize()\nprint('-'*53)\nprint('Final Results')\nprint('XGBOOST: %f' % xgboostBO.res['max']['max_val'])", 152 | "execution_count": null, 153 | "outputs": [] 154 | }, 155 | { 156 | "metadata": { 157 | "trusted": true, 158 | "_uuid": "fd489a91afc2489d8c88f857e45db8ff89e793fe" 159 | }, 160 | "cell_type": "markdown", 161 | "source": "#xgboost bayesian optimization\nfrom sklearn.cross_validation import cross_val_score\nfrom bayes_opt import BayesianOptimization\n\ndef xgboostcv(max_depth,learning_rate,n_estimators,gamma,min_child_weight):\n cv = StratifiedKFold(Y, n_folds=8,shuffle=True, random_state=42)\n return cross_val_score(xgb.XGBClassifier(max_depth=int(max_depth),learning_rate=learning_rate,n_estimators=int(n_estimators),\n silent=True,nthread=-1,gamma=gamma,min_child_weight=min_child_weight),\n train1,Y,\"f1\",cv=8).mean()\n\nxgboostBO = BayesianOptimization(xgboostcv,{'max_depth': (4, 10),'learning_rate': (0.001, 0.3),'n_estimators': (50, 1000),\n 'gamma': (0.01,1.0),'min_child_weight': (2, 10)})\n\nxgboostBO.maximize()\nprint('-'*53)\nprint('Final Results')\nprint('XGBOOST: %f' % xgboostBO.res['max']['max_val'])" 162 | }, 163 | { 164 | "metadata": { 165 | "trusted": true, 166 | "_uuid": "8b5fbbf86a39e0971db31f2eb4a161c92300c212" 167 | }, 168 | "cell_type": "code", 169 | "source": "", 170 | "execution_count": null, 171 | "outputs": [] 172 | } 173 | ], 174 | "metadata": { 175 | "kernelspec": { 176 | "display_name": "Python 3", 177 | "language": "python", 178 | "name": "python3" 179 | }, 180 | "language_info": { 181 | "name": "python", 182 | "version": "3.6.6", 183 | "mimetype": "text/x-python", 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 3 187 | }, 188 | "pygments_lexer": "ipython3", 189 | "nbconvert_exporter": "python", 190 | "file_extension": ".py" 191 | } 192 | }, 193 | "nbformat": 4, 194 | "nbformat_minor": 1 195 | } -------------------------------------------------------------------------------- /Wns/wns_plan.txt: -------------------------------------------------------------------------------- 1 | 1. Need to do binning of the age features. 2 | 2. Apply the mean encoding on the feature having text value and frequency more than 3. 3 | 3. Avg points * no. of comp to get the total points. 4 | 4. need to weighted sum some of the performance features to get one score for performance. 5 | 5. count of 0/1 or yes/no in the row. 6 | 6. mean encoding or median encoding instead of label. 7 | 7. normalize and then check the distribution it should be same. 8 | 8. Use the data imputation for missing values in (education & previous_year_ratings.) 9 | 10 | 9. Add the (awards_won;KpIs_met & previous_year_rating) features, 11 | multiply the avg_training_score and no_of_training to get total training score. 12 | 13 | 10. convert education into number's where mtech>btech>+2. 14 | 15 | 11. Remove the recruitment_channel or gender then check the accuracy. 16 | 17 | 12. (age - length_of_service) for gettng the joining age. 18 | 19 | 13. combine the region_depart to new feature. 20 | -------------------------------------------------------------------------------- /ericsson_2019/Readme.md: -------------------------------------------------------------------------------- 1 | # Ericsson-Challenge-Solution 2 | 3 | Done single submission for both the problems.link to the problem statement https://www.hackerearth.com/challenges/hiring/ericsson-ml-challenge-2019/instructions/ 4 | 5 | Approach- 6 | 7 | Simple text cleaning then create the vector using tfidf and countvectorizer.in the last applied the lightgbm and naive bayes model to get the predictions. 8 | -------------------------------------------------------------------------------- /ericsson_2019/data/NLP_Datac2476d7.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/ericsson_2019/data/NLP_Datac2476d7.zip -------------------------------------------------------------------------------- /ericsson_2019/data/Predictive_Data32f5357.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucky630/ML-Challenges/9b639d8cff8cd4d26f95e6423c30ebc8ba524cd5/ericsson_2019/data/Predictive_Data32f5357.zip -------------------------------------------------------------------------------- /ericsson_2019/ericsson_HE.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\n%matplotlib inline\nimport warnings\nwarnings.filterwarnings('ignore')\n\nimport numpy as np\nimport os\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd\nimport random\nimport xgboost as xgb\n\nfrom sklearn import preprocessing\nimport lightgbm as lgb\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.preprocessing import OneHotEncoder\n\nfrom sklearn.preprocessing import StandardScaler,MinMaxScaler\nfrom sklearn.decomposition import PCA\nfrom math import sqrt\nfrom scipy import stats\nfrom scipy.stats import norm, skew #for some statistics","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.metrics import matthews_corrcoef\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import fbeta_score\nfrom sklearn.metrics import recall_score\nfrom sklearn.metrics import roc_curve, auc, roc_auc_score, average_precision_score\nfrom sklearn.metrics import classification_report\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import train_test_split","execution_count":null,"outputs":[]},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport os\nprint(os.listdir(\"../input\"))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from IPython.core.interactiveshell import InteractiveShell\nfrom tqdm import tqdm_notebook\nInteractiveShell.ast_node_interactivity = \"all\"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"# download flair library #\nimport torch\n!pip install flair\nimport flair","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"os.listdir('../input/nlp_datac2476d7/NLP_Data')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train = pd.read_csv('../input/predictive_data32f5357/Predictive_Data/train_file.csv')\ntest = pd.read_csv('../input/predictive_data32f5357/Predictive_Data/test_file.csv')\nsubm = pd.read_csv('../input/predictive_data32f5357/Predictive_Data/sample_submission.csv')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train = train.drop(['UsageClass','CheckoutType','CheckoutYear','CheckoutMonth'],axis=1)\ntest = test.drop(['UsageClass','CheckoutType','CheckoutYear','CheckoutMonth'],axis=1)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train.head()\ntest.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train.shape\ntest.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('class count in numbers: ')\ntrain['MaterialType'].value_counts()\nprint('percentage of class count : ')\ntrain['MaterialType'].value_counts()/train.shape[0] * 100","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Number of NaNs in each rows\ntrain.isnull().sum(axis=1).head(5)\ntrain.isnull().sum(axis=0)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"test.isnull().sum(axis=0)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train.nunique()\ntest.nunique()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(train['PublicationYear'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['PublicationYear'].replace(np.nan,\"0000\",inplace=True)\ntest['PublicationYear'].replace(np.nan,\"0000\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['Subjects'].replace(np.nan,\"\",inplace=True)\ntest['Subjects'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import re\ndef matc(a):\n return re.findall(r\"[0-9]{4}\",a)\ntrain['PublicationYear'] = train['PublicationYear'].apply(lambda x: matc(x))\ntest['PublicationYear'] = test['PublicationYear'].apply(lambda x: matc(x))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['combined'] = train['Title'] + ' . ' + train['Subjects'] + ' . ' + train['Creator'] + ' . ' + train['Publisher']\ntest['combined'] = test['Title'] + ' . ' + test['Subjects'] + ' . ' + test['Creator'] + ' . ' + test['Publisher']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def ext(a):\n #print(a)\n if len(a)==0:\n return 0\n #print(int(a[0]))\n return int(a[0])\ntrain['PublicationYear'] = train['PublicationYear'].apply(lambda x:ext(x))\ntest['PublicationYear'] = test['PublicationYear'].apply(lambda x:ext(x))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# set(train['MaterialType'])\n\ntrain['MaterialType'].replace(\"BOOK\",0,inplace=True)\ntrain['MaterialType'].replace(\"CR\",1,inplace=True)\ntrain['MaterialType'].replace(\"MIXED\",2,inplace=True)\ntrain['MaterialType'].replace(\"MUSIC\",3,inplace=True)\ntrain['MaterialType'].replace(\"SOUNDCASS\",4,inplace=True)\ntrain['MaterialType'].replace(\"SOUNDDISC\",5,inplace=True)\ntrain['MaterialType'].replace(\"VIDEOCASS\",6,inplace=True)\ntrain['MaterialType'].replace(\"VIDEODISC\",7,inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['Publisher'].replace(np.nan,\"\",inplace=True)\ntest['Publisher'].replace(np.nan,\"\",inplace=True)\n\ntrain['Creator'].replace(np.nan,\"\",inplace=True)\ntest['Creator'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Tfidf"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import TfidfVectorizer\ntfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')\n# TF-IDF feature matrix\ntrain1 = tfidf_vectorizer.fit_transform(train['combined'])\ntest1 = tfidf_vectorizer.fit_transform(test['combined'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['combined'].shape\ntest['combined'].shape\n\nnp.hstack([train['combined'].values,test['combined'].values]).shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import CountVectorizer\n# create the transform\nvectorizer = CountVectorizer(stop_words=\"english\", analyzer='word', \n ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)\n# tokenize and build vocab\nvectorizer.fit(np.hstack([train['combined'].values,test['combined'].values]))\ntrain2 = vectorizer.transform(train['combined'])\ntest2 = vectorizer.transform(test['combined'])\nprint(train2.shape)\nprint(type(train2))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train2\ntest2","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"csr_matrix([train['Checkouts']]).T","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from scipy.sparse import hstack, coo_matrix, csr_matrix,vstack\ntrain3 = hstack((train1, train2))\n\ntest3 = hstack((test1, test2))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train3\ntest3","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, train['MaterialType'], random_state=42, test_size=0.2)\n\n# lreg = LogisticRegression()\nlreg = lgb.LGBMRegressor()\nlreg.fit(xtrain_bow, ytrain) # training the model\n\nprediction = lreg.predict(xvalid_bow) # predicting on the validation set\n# prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0\nprediction_int = prediction.astype(np.int)\n\nf1_score(yvalid, prediction_int,average='micro') # calculating f1 score\n\ntest_pred = lreg.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Logistic"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, train['MaterialType'], random_state=42, test_size=0.2)\n\nlreg = LogisticRegression()\n# lreg = lgb.LGBMRegressor()\nlreg.fit(xtrain_bow, ytrain) # training the model\n\nprediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set\nprediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0\nprediction_int = prediction_int.astype(np.int)\n\nf1_score(yvalid, prediction_int,average='micro') # calculating f1 score\n\ntest_pred = lreg.predict_proba(test3)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import MultinomialNB\nnb = MultinomialNB(alpha=0.8)\nnb.fit(xtrain_bow, ytrain)\n\npreds = nb.predict(xvalid_bow)\nf1_score(yvalid, preds,average='micro') # calculating f1 score\n\ntest_pred = nb.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(test_pred.astype(np.int))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# set(prediction_int)\n# max(prediction[:,1])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin = pd.DataFrame({'ID': test['ID'].values, 'MaterialType': test_pred})\nsubfin=subfin.reindex(columns=[\"ID\",\"MaterialType\"])\nsubfin.to_csv('submission.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin['MaterialType'].replace(0,\"BOOK\",inplace=True)\nsubfin['MaterialType'].replace(1,\"CR\",inplace=True)\nsubfin['MaterialType'].replace(2,\"MIXED\",inplace=True)\nsubfin['MaterialType'].replace(3,\"MUSIC\",inplace=True)\nsubfin['MaterialType'].replace(4,\"SOUNDCASS\",inplace=True)\nsubfin['MaterialType'].replace(5,\"SOUNDDISC\",inplace=True)\nsubfin['MaterialType'].replace(6,\"VIDEOCASS\",inplace=True)\nsubfin['MaterialType'].replace(7,\"VIDEODISC\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin.to_csv('submission.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# import the modules we'll need\nfrom IPython.display import HTML\nimport pandas as pd\nimport numpy as np\nimport base64\n\n# function that takes in a dataframe and creates a text link to \n# download it (will only work for files < 2MB or so)\ndef create_download_link(df, title = \"Download CSV file\", filename = \"submission.csv\"): \n csv = df.to_csv(index=False)\n b64 = base64.b64encode(csv.encode())\n payload = b64.decode()\n html = '{title}'\n html = html.format(payload=payload,title=title,filename=filename)\n return HTML(html)\n\n# create a link to download the dataframe\ncreate_download_link(subfin)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Flair"},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"from sklearn import preprocessing\nle = preprocessing.LabelEncoder()\n\ntrain['Publisher'] = le.fit_transform(train['Publisher'])\ntest['Publisher'] = le.transform(test['Publisher'])\n\ntrain['Creator'] = le.fit_transform(train['Creator'])\ntest['Creator'] = le.transform(test['Creator'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"set(train['Creator']) - set(test['Creator'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from flair.data import Sentence\n# create a sentence #\nsentence = Sentence('Awesomeness come from the core.')\n# print the sentence to see what’s in it. #\nprint(Sentence.get_embedding)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#extracting the tweet part#\ntext = train['combined'] \n ## txt is a list of tweets ##\ntxt = text.tolist()\nprint(txt[:10])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## Importing the Embeddings ##\nfrom flair.embeddings import WordEmbeddings\nfrom flair.embeddings import CharacterEmbeddings\nfrom flair.embeddings import StackedEmbeddings\nfrom flair.embeddings import FlairEmbeddings\nfrom flair.embeddings import BertEmbeddings\nfrom flair.embeddings import ELMoEmbeddings\nfrom flair.embeddings import FlairEmbeddings\n\n### Initialising embeddings (un-comment to use others) ###\n#glove_embedding = WordEmbeddings('glove')\n#character_embeddings = CharacterEmbeddings()\nflair_forward = FlairEmbeddings('news-forward-fast')\nflair_backward = FlairEmbeddings('news-backward-fast')\n#bert_embedding = BertEmbedding()\n#elmo_embedding = ElmoEmbedding()\n\nstacked_embeddings = StackedEmbeddings( embeddings = [ \n flair_forward, \n flair_backward\n ])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# create a sentence #\nsentence = Sentence(\"Awesomeness come from the core.\")\n# embed words in sentence #\nstacked_embeddings.embed(sentence)\nfor token in sentence:\n print(token.embedding)\n# data type and size of embedding #\nprint(type(token.embedding))\n# storing size (length) #\nz = token.embedding.size()[0]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from flair.embeddings import DocumentPoolEmbeddings\nfrom tqdm import tqdm\n### initialize the document embeddings, mode = mean ###\ndocument_embeddings = DocumentPoolEmbeddings([flair_backward, flair_forward ])\ns = torch.zeros(0,z)\n# iterating Sentences #\nfor tweet in tqdm(txt):\n #print(tweet)\n sentence = Sentence(tweet)\n document_embeddings.embed(sentence)\n # Storing Size of embedding #\n z = sentence.embedding.size()[0]\n # Adding Document embeddings to list #\n s = torch.cat((s, sentence.embedding.view(-1,z)),0)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# s = torch.cat((s,torch.zeros(2048).view(-1,2048)),0)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## tensor to numpy array ##\nX = s.numpy()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"np.save('train_embed',np.column_stack((X, train['Checkouts'],train['PublicationYear'])))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"#extracting the tweet part#\ntext = test['combined'] \ntxt = text.tolist()\n\nfrom flair.embeddings import WordEmbeddings\nfrom flair.embeddings import CharacterEmbeddings\nfrom flair.embeddings import StackedEmbeddings\nfrom flair.embeddings import FlairEmbeddings\nfrom flair.embeddings import BertEmbeddings\nfrom flair.embeddings import ELMoEmbeddings\nfrom flair.embeddings import FlairEmbeddings\n\nflair_forward = FlairEmbeddings('news-forward-fast')\nflair_backward = FlairEmbeddings('news-backward-fast')\n\nstacked_embeddings = StackedEmbeddings( embeddings = [ flair_forward, flair_backward])\n\n# create a sentence #\nsentence = Sentence(\"Analytics Vidhya blogs are Awesome.\")\n# embed words in sentence #\nstacked_embeddings.embed(sentence)\nfor token in sentence:\n print(token.embedding)\n# storing size (length) #\nz = token.embedding.size()[0]\n\nfrom flair.embeddings import DocumentPoolEmbeddings\nfrom tqdm import tqdm\n### initialize the document embeddings, mode = mean ###\ndocument_embeddings = DocumentPoolEmbeddings([flair_backward,flair_forward])\ns = torch.zeros(0,z)\n# iterating Sentences #\nfor tweet in tqdm(txt):\n sentence = Sentence(tweet)\n document_embeddings.embed(sentence)\n # Storing Size of embedding #\n z = sentence.embedding.size()[0]\n # Adding Document embeddings to list #\n s = torch.cat((s, sentence.embedding.view(-1,z)),0)\nX1 = s.numpy()\nnp.save('test_embed',np.column_stack((X1, test['Checkouts'],test['PublicationYear'])))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train1 = np.column_stack((X, train['Checkouts'],train['PublicationYear']))\ntest1 = np.column_stack((X1, test['Checkouts'],test['PublicationYear']))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"target = train['MaterialType'].values","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"target","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import xgboost as xgb\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n### Splitting training set ###\nx_train, x_valid, y_train, y_valid = train_test_split(train1, target,random_state=42,test_size=0.2)\n\n### XGBoost compatible data ###\ndtrain = xgb.DMatrix(x_train,y_train) \ndvalid = xgb.DMatrix(x_valid, label = y_valid)\n\n### defining parameters ###\nparams = {\n 'colsample': 0.9,\n 'colsample_bytree': 0.5,\n 'eta': 0.1,\n 'max_depth': 8,\n 'min_child_weight': 6,\n 'objective': 'multi:softmax',\n 'subsample': 0.9,\n 'num_class':8\n }\n\n### Training the model ###\nxgb_model = xgb.train(\n params,\n dtrain,\n feval= custom_eval,\n num_boost_round= 1000,\n maximize=True,\n evals=[(dvalid, \"Validation\")],\n early_stopping_rounds=30\n )","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# params = {'task': 'train',\n# 'boosting_type': 'gbdt',\n# 'objective': 'multiclass',\n# 'num_class':8,\n# 'metric': 'custom_eval',\n# 'learning_rate': 0.002296,\n# 'max_depth': 7,\n# 'num_leaves': 17,\n# 'metric': ['multi_error'],\n# 'feature_fraction': 0.4,\n# 'bagging_fraction': 0.6,\n# 'bagging_freq': 17}\n\nimport lightgbm as lgb\n### Splitting training set ###\nx_train, x_valid, y_train, y_valid = train_test_split(train1, target,random_state=42,test_size=0.2)\nprint(x_train.shape)\nd_train = lgb.Dataset(x_train, label=y_train)\nparams = {}\n# params['learning_rate'] = 0.003\nparams['boosting_type'] = 'gbdt'\nparams['objective'] = 'multiclass'\nparams['num_class'] = 8\nparams['metric'] = ['multi_error']\n# params['sub_feature'] = 0.5\n# params['num_leaves'] = 10\n# params['min_data'] = 50\n# params['max_depth'] = 10\nparams['silent'] = 2\nclf = lgb.train(params, d_train, 2000)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"'done'","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"y_pred=clf.predict(test1)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# y_pred.shape\nset(np.argmax(y_pred,axis=1))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def custom_eval(preds, dtrain):\n labels = dtrain.get_label().astype(np.int)\n preds = (preds >= 0.3).astype(np.int)\n return [('f1_score', f1_score(labels, preds, average=None))]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# import tensorflow_hub as hub\n# import tensorflow as tf\n\n# elmo = hub.Module(\"https://tfhub.dev/google/elmo/2\", trainable=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# # just a random sentence\n# x = [\"Roasted ants are a popular snack in Columbia\"]\n\n# # Extract ELMo features \n# embeddings = elmo(x, signature=\"default\", as_dict=True)[\"elmo\"]\n\n# embeddings.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# embeddings.shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# def elmo_vectors(x):\n# embeddings = elmo(x.tolist(), signature=\"default\", as_dict=True)[\"elmo\"]\n\n# with tf.Session() as sess:\n# sess.run(tf.global_variables_initializer())\n# sess.run(tf.tables_initializer())\n# # return average of ELMo features\n# return sess.run(tf.reduce_mean(embeddings,1))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]\n# list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# %%time\n# # Extract ELMo embeddings\n# elmo_train = [elmo_vectors(x['Subjects']) for x in list_train]\n# elmo_test = [elmo_vectors(x['Subjects']) for x in list_test]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"markdown","source":"### 2nd Problem"},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train = pd.read_csv('../input/nlp_datac2476d7/NLP_Data/train.csv')\nnl_test = pd.read_csv('../input/nlp_datac2476d7/NLP_Data/test.csv')\nnl_subm = pd.read_csv('../input/nlp_datac2476d7/NLP_Data/sample_submission.csv')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_test.index[nl_test['date']==' Jan 0, 0000']","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### changing the wrong date values to default date"},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_test['date'].replace('None',\" Jan 1, 2000\",inplace=True)\nnl_test['date'].replace(' Nov 0, 0000',\" Jan 1, 2000\",inplace=True)\nnl_test['date'].replace(' Jan 0, 0000',\" Jan 1, 2000\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### create features like day,month & year from the date."},{"metadata":{"trusted":true},"cell_type":"code","source":"from datetime import datetime\ndef convert_date(d):\n return datetime.strptime(d,' %b %d, %Y')\n# datetime.strptime(nl_train['date'][0], ' %b %d, %Y').date().day\nnl_train['day'] = nl_train['date'].apply(lambda x: convert_date(x).date().day)\nnl_train['month'] = nl_train['date'].apply(lambda x: convert_date(x).date().month)\nnl_train['year'] = nl_train['date'].apply(lambda x: convert_date(x).date().year)\n\nnl_test['day'] = nl_test['date'].apply(lambda x: convert_date(x).date().day)\nnl_test['month'] = nl_test['date'].apply(lambda x: convert_date(x).date().month)\nnl_test['year'] = nl_test['date'].apply(lambda x: convert_date(x).date().year)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"pd.to_datetime(nl_train.date.str, format='%b %d,%Y', yearfirst=False)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### finding the frequency of target classes"},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"print('class count in numbers: ')\nnl_train['overall'].value_counts()\nprint('percentage of class count : ')\nnl_train['overall'].value_counts()/train.shape[0] * 100","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train.nunique()\nnl_test.nunique()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Number of NaNs in each rows\nnl_train.isnull().sum(axis=0)\nnl_test.isnull().sum(axis=0)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### label encode the text fields"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.preprocessing import LabelEncoder\nle = LabelEncoder()\nle.fit(nl_train['Place'])\nnl_train['Place'] = le.transform(nl_train['Place'])\nnl_test['Place'] = le.transform(nl_test['Place'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"le = LabelEncoder()\nle.fit(nl_train['status'])\nnl_train['status'] = le.transform(nl_train['status'])\nnl_test['status'] = le.transform(nl_test['status'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_test['negatives'].replace(np.nan,\"\",inplace=True)\n\nnl_train['advice_to_mgmt'].replace(np.nan,\"\",inplace=True)\nnl_test['advice_to_mgmt'].replace(np.nan,\"\",inplace=True)\n\nnl_train['summary'].replace(np.nan,\"\",inplace=True)\nnl_test['summary'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### combine all the text column into one"},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train['combined'] = nl_train['summary'] + ' . ' + nl_train['positives'] + ' . ' + nl_train['negatives'] + ' . ' + nl_train['advice_to_mgmt']\nnl_test['combined'] = nl_test['summary'] + ' . ' + nl_test['positives'] + ' . ' + nl_test['negatives'] + ' . ' + nl_test['advice_to_mgmt']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import TfidfVectorizer\ntfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')\n# TF-IDF feature matrix\ntrain1 = tfidf_vectorizer.fit_transform(nl_train['combined'])\ntest1 = tfidf_vectorizer.fit_transform(nl_test['combined'])\n\nfrom sklearn.feature_extraction.text import CountVectorizer\n# create the transform\nvectorizer = CountVectorizer(stop_words=\"english\", analyzer='word', \n ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)\n# tokenize and build vocab\nvectorizer.fit(np.hstack([nl_train['combined'].values,nl_test['combined'].values]))\nprint(train2.shape)\nprint(type(train2))\ntrain2 = vectorizer.transform(nl_train['combined'])\ntest2 = vectorizer.transform(nl_test['combined'])\n\nfrom scipy.sparse import hstack, csr_matrix,vstack\ntrain3 = hstack((train1, train2))\ntest3 = hstack((test1, test2))\n\ntrain3\ntest3","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### naive Bayes"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import MultinomialNB\nnb = MultinomialNB(alpha=0.8)\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, nl_train['overall'], random_state=42, test_size=0.2)\n\nnb.fit(xtrain_bow, ytrain)\n\npreds = nb.predict(xvalid_bow)\nf1_score(yvalid, preds,average='micro') # calculating f1 score\ntest_pred = nb.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### create more features"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\ndef get_polarity(txt):\n return TextBlob(txt).sentiment.polarity\nnl_train['summ_score'] = nl_train['summary'].apply(lambda x:get_polarity(x))\nnl_train['pos_score'] = nl_train['positives'].apply(lambda x:get_polarity(x))\nnl_train['neg_score'] = nl_train['negatives'].apply(lambda x:get_polarity(x))\nnl_train['adv_score'] = nl_train['advice_to_mgmt'].apply(lambda x:get_polarity(x))\n\nnl_test['summ_score'] = nl_test['summary'].apply(lambda x:get_polarity(x))\nnl_test['pos_score'] = nl_test['positives'].apply(lambda x:get_polarity(x))\nnl_test['neg_score'] = nl_test['negatives'].apply(lambda x:get_polarity(x))\nnl_test['adv_score'] = nl_test['advice_to_mgmt'].apply(lambda x:get_polarity(x))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train['score_1'].replace(np.nan,nl_train['score_1'].mean(),inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"column = ['Place','status','score_1','score_2','score_3','score_4','score_5','score_6','day','month','year']\n#'summ_score','pos_score','neg_score','adv_score'","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"nl_train['score_1'].replace(np.nan,nl_train['score_1'].mean(),inplace=True)\nnl_train['score_2'].replace(np.nan,nl_train['score_2'].mean(),inplace=True)\nnl_train['score_3'].replace(np.nan,nl_train['score_3'].mean(),inplace=True)\nnl_train['score_4'].replace(np.nan,nl_train['score_4'].mean(),inplace=True)\nnl_train['score_5'].replace(np.nan,nl_train['score_5'].mean(),inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import MultinomialNB\nnb = MultinomialNB(alpha=0.8)\n\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(nl_train[column], nl_train['overall'], random_state=42, test_size=0.2)\n\nnb.fit(xtrain_bow, ytrain)\n\npreds = nb.predict(xvalid_bow)\nf1_score(yvalid, preds,average='micro') # calculating f1 score\ntest_pred = nb.predict(nl_test[column])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"collapsed":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n# splitting data into training and validation set\n# xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(nl_train[column], nl_train['overall'], random_state=42, test_size=0.2)\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, nl_train['overall'], random_state=42, test_size=0.2)\n\n# lreg = LogisticRegression()\nlreg = lgb.LGBMRegressor()\nlreg.fit(xtrain_bow, ytrain) # training the model\n\nprediction = lreg.predict(xvalid_bow) # predicting on the validation set\n# prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0\nprediction_int = prediction.astype(np.int)\n\nf1_score(yvalid, prediction_int,average='micro') # calculating f1 score\n\ntest_pred = lreg.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"set(test_pred.astype(np.int))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin = pd.DataFrame({'ID': nl_test['ID'].values, 'overall': test_pred.astype(np.int)})\nsubfin=subfin.reindex(columns=[\"ID\",\"overall\"])\nsubfin.to_csv('submission.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# import the modules we'll need\nfrom IPython.display import HTML\nimport pandas as pd\nimport numpy as np\nimport base64\n\n# function that takes in a dataframe and creates a text link to \n# download it (will only work for files < 2MB or so)\ndef create_download_link(df, title = \"Download CSV file\", filename = \"submission.csv\"): \n csv = df.to_csv(index=False)\n b64 = base64.b64encode(csv.encode())\n payload = b64.decode()\n html = '{title}'\n html = html.format(payload=payload,title=title,filename=filename)\n return HTML(html)\n\n# create a link to download the dataframe\ncreate_download_link(subfin)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} -------------------------------------------------------------------------------- /ericsson_2019/ericsson_HE_NN.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"%%time\n%matplotlib inline\nimport warnings\nwarnings.filterwarnings('ignore')\n\nimport numpy as np\nimport os\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport pandas as pd\nimport random\nimport xgboost as xgb\n\nfrom sklearn import preprocessing\nimport lightgbm as lgb\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.preprocessing import OneHotEncoder\n\nfrom sklearn.preprocessing import StandardScaler,MinMaxScaler\nfrom sklearn.decomposition import PCA\nfrom math import sqrt\nfrom scipy import stats\nfrom scipy.stats import norm, skew #for some statistics","execution_count":null,"outputs":[]},{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"from sklearn.metrics import matthews_corrcoef\nfrom sklearn.metrics import f1_score\nfrom sklearn.metrics import fbeta_score\nfrom sklearn.metrics import recall_score\nfrom sklearn.metrics import roc_curve, auc, roc_auc_score, average_precision_score\nfrom sklearn.metrics import classification_report\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import train_test_split","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import tensorflow as tf\nfrom keras.models import Model\nfrom keras.layers import Dense, Embedding, Input, Concatenate, Conv1D, Activation, TimeDistributed, Flatten, RepeatVector, Permute,multiply,GlobalAveragePooling1D, GlobalMaxPooling1D ,CuDNNGRU\nfrom keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout, GRU, GlobalAveragePooling1D, MaxPooling1D, SpatialDropout1D, BatchNormalization, GlobalAveragePooling1D\nfrom keras.preprocessing import text, sequence\nfrom keras.callbacks import EarlyStopping, ModelCheckpoint\nfrom keras.optimizers import Adam\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.callbacks import Callback\nfrom keras.models import load_model","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.callbacks import Callback\nfrom sklearn.model_selection import KFold","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport os\nprint(os.listdir(\"../input\"))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from IPython.core.interactiveshell import InteractiveShell\nfrom tqdm import tqdm_notebook\nInteractiveShell.ast_node_interactivity = \"all\"","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from nltk import pos_tag\nfrom nltk.stem.wordnet import WordNetLemmatizer \nfrom nltk.tokenize import word_tokenize\n# Tweet tokenizer does not split at apostophes which is what we want\nfrom nltk.tokenize import TweetTokenizer\nimport re\nimport string\nimport nltk\nfrom nltk.corpus import stopwords","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer=TweetTokenizer()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"import os\nos.listdir('../input/glove840b300dtxt')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train = pd.read_csv('../input/ericsson-data/predictive_data32f5357/Predictive_Data/train_file.csv')\ntest = pd.read_csv('../input/ericsson-data/predictive_data32f5357/Predictive_Data/test_file.csv')\nsubm = pd.read_csv('../input/ericsson-data/predictive_data32f5357/Predictive_Data/sample_submission.csv')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### These having only one value so we are removing that"},{"metadata":{"trusted":true},"cell_type":"code","source":"train = train.drop(['UsageClass','CheckoutType','CheckoutYear','CheckoutMonth'],axis=1)\ntest = test.drop(['UsageClass','CheckoutType','CheckoutYear','CheckoutMonth'],axis=1)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train.head()\ntest.head()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('class count in numbers: ')\ntrain['MaterialType'].value_counts()\nprint('percentage of class count : ')\ntrain['MaterialType'].value_counts()/train.shape[0] * 100","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### encode the target vriable"},{"metadata":{"trusted":true},"cell_type":"code","source":"train['MaterialType'].replace(\"BOOK\",0,inplace=True)\ntrain['MaterialType'].replace(\"CR\",1,inplace=True)\ntrain['MaterialType'].replace(\"MIXED\",2,inplace=True)\ntrain['MaterialType'].replace(\"MUSIC\",3,inplace=True)\ntrain['MaterialType'].replace(\"SOUNDCASS\",4,inplace=True)\ntrain['MaterialType'].replace(\"SOUNDDISC\",5,inplace=True)\ntrain['MaterialType'].replace(\"VIDEOCASS\",6,inplace=True)\ntrain['MaterialType'].replace(\"VIDEODISC\",7,inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## used to find out number of unique variables in each column\ntrain.nunique()\ntest.nunique()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## impute the nan values\ntrain['PublicationYear'].replace(np.nan,\"0000\",inplace=True)\ntest['PublicationYear'].replace(np.nan,\"0000\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['Subjects'].replace(np.nan,\"\",inplace=True)\ntest['Subjects'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"## clean the PublicationYear column\nimport re\ndef matc(a):\n return re.findall(r\"[0-9]{4}\",a)\ntrain['PublicationYear'] = train['PublicationYear'].apply(lambda x: matc(x))\ntest['PublicationYear'] = test['PublicationYear'].apply(lambda x: matc(x))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['Creator'].replace(np.nan,\"\",inplace=True)\ntrain['Publisher'].replace(np.nan,\"\",inplace=True)\n\ntest['Creator'].replace(np.nan,\"\",inplace=True)\ntest['Publisher'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Combine all the Text columns"},{"metadata":{"trusted":true},"cell_type":"code","source":"train['combined'] = train['Title'] + ' . ' + train['Subjects'] + ' . ' + train['Creator'] + ' . ' + train['Publisher']\ntest['combined'] = test['Title'] + ' . ' + test['Subjects'] + ' . ' + test['Creator'] + ' . ' + test['Publisher']","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['combined'].replace(np.nan,\"\",inplace=True)\ntest['combined'].replace(np.nan,\"\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"def ext(a):\n #print(a)\n if len(a)==0:\n return 0\n #print(int(a[0]))\n return int(a[0])\ntrain['PublicationYear'] = train['PublicationYear'].apply(lambda x:ext(x))\ntest['PublicationYear'] = test['PublicationYear'].apply(lambda x:ext(x))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Text Cleaning before computing embeddings"},{"metadata":{"trusted":true},"cell_type":"code","source":"merge=pd.concat([train.iloc[:,8:9],test.iloc[:,8:9]])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# corpus = train['combined']\ncorpus = merge['combined'].str.lower()\ndef isEnglish(s):\n try:\n s.encode(encoding='utf-8').decode('ascii')\n except UnicodeDecodeError:\n return False\n else:\n return True","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"\"\"\"\nThis function receives comments and returns clean word-list\n\"\"\"\ndef clean(comment):\n #Convert to lower case , so that Hi and hi are the same\n #comment=comment.lower()\n #remove \\n\n comment=re.sub(\"\\\\n\",\" \",comment)\n #removing usernames\n comment=re.sub(\"\\[\\[.*\\]\",\" \",comment)\n #remove hyperlink\n comment=re.sub(\"http\\S+|www.\\S+\",\" \",comment)\n #remove numbers\n comment = clearup(comment, string.punctuation+string.digits)\n #Split the sentences into words\n words=tokenizer.tokenize(comment)\n # (')aphostophe replacement (ie) you're --> you are \n # ( basic dictionary lookup : master dictionary present in a hidden block of code)\n words=[APPO[word] if word in APPO else word for word in words]\n #other commonly misspeled words.\n words=[repl[word] if word in replkeys else word for word in words]\n clean_sent=\" \".join(words)\n # remove any non alphanum,digit character\n clean_sent=re.sub(\"\\W+\",\" \",clean_sent)\n # remove the punctuations from the text\n remove_punctuations(clean_sent)\n clean_sent=re.sub(\" \",\" \",clean_sent)\n clean_sent=clean_sent.strip()\n return(clean_sent)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\nclean_corpus=corpus.apply(lambda x :clean(x))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('loading embeddings vectors')\ndef get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Using Glove Embeddings"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%time\nembeddings_index = dict(get_coefs(*o.split(' ')) for o in open('../input/glove840b300dtxt/glove.840B.300d.txt'))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# list_sentences_train","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"max_features=150000\nmaxlen=200\n\nlist_sentences_train = train[\"combined\"].str.lower()\nlist_classes = [\"MaterialType\"]\n\ntrain[list_classes] = train[list_classes].astype(np.int8)\ntarget = train[list_classes]\n\nlist_sentences_test = test[\"combined\"]\ntrb_nan_idx = list_sentences_test[pd.isnull(list_sentences_test)].index.tolist()\nlist_sentences_test.loc[trb_nan_idx] = ' '\nlist_sentences_test = list_sentences_test.str.lower()\n\nprint('mean text len:',train[\"combined\"].str.count('\\S+').mean())\nprint('max text len:',train[\"combined\"].str.count('\\S+').max())\n\ntokenizer = Tokenizer(num_words=max_features)\ntokenizer.fit_on_texts(list(list_sentences_train) + list(list_sentences_test))\nlist_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)\nlist_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)\nprint('padding sequences')\nX_train = train[\"combined\"]\nX_test = test[\"combined\"]\nX_train = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)\nX_test = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)\nX_train = np.array(X_train)\nX_test = np.array(X_test)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train[list_classes] = train[list_classes].astype(np.int8)\ntarget = train[list_classes]","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('numerical variables')\nembed_size=300\nall_embs = np.stack(embeddings_index.values())\nemb_mean,emb_std = all_embs.mean(), all_embs.std()\nprint('create embedding matrix')\nword_index = tokenizer.word_index\nnb_words = min(max_features, len(word_index))\nembedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n#embedding_matrix = np.zeros((nb_words, embed_size))\nfor word, i in word_index.items():\n if i == 51688:\n continue\n if i >= max_features:\n continue\n embedding_vector = embeddings_index.get(word)\n if embedding_vector is not None: \n embedding_matrix[i] = embedding_vector","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# embedding_matrix[51687]\nlen(word_index.values())","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### one hot encode the target variables"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.preprocessing import OneHotEncoder\nenc = OneHotEncoder(handle_unknown='ignore')\ntarget = enc.fit_transform(target).toarray()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"target = pd.DataFrame(data=target[0:,0:], index=list(range(0,len(target))), columns=target[0,0:]) ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"X_train.shape\ntarget.shape\nX_test.shape","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### model used is the single Gru Cnn model with concatenating the max_pool and avg_pool layer"},{"metadata":{"trusted":true},"cell_type":"code","source":"def get_single_grucnn_model():\n \n inp = Input(shape=(200, ))\n x = Embedding(51688, 300, weights=[embedding_matrix], trainable = False)(inp)\n x = SpatialDropout1D(0.3)(x)\n x2 = Bidirectional(CuDNNGRU(80, kernel_initializer='glorot_uniform', return_sequences=True))(x)\n x2 = Conv1D(120, kernel_size = 2, padding = \"valid\",activation = \"relu\",strides = 1)(x2)\n #x = Dropout(0.2)(x)\n max_pool = GlobalMaxPooling1D()(x2)\n avg_pool = GlobalAveragePooling1D()(x2)\n x = Concatenate()([avg_pool,max_pool])\n x = Dense(8, activation=\"sigmoid\")(x)\n model = Model(inputs=inp, outputs=x)\n model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n print(model.summary())\n return model","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('start modeling')\nearly_stop = EarlyStopping(monitor = \"loss\", mode = \"min\", patience = 5)\nscores = []\npredict = np.zeros((test.shape[0],8))\noof_predict = np.zeros((train.shape[0],8))\n\nprint(\"model selection\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### two fold cross-validation"},{"metadata":{"trusted":true},"cell_type":"code","source":"num_folds = 2\nkf = KFold(n_splits=num_folds, shuffle=True, random_state=239)\nfor train_index, test_index in kf.split(X_train):\n X_train1 , X_valid = X_train[train_index] , X_train[test_index]\n y_train, y_val = target.loc[train_index], target.loc[test_index]\n model = get_single_grucnn_model()\n model.fit(X_train1, y_train, batch_size=256, epochs=10, verbose=1,callbacks = [early_stop])\n print('Predicting....')\n oof_predict[test_index] = model.predict(X_valid, batch_size=1024)\n# cv_score = roc_auc_score(y_test, oof_predict[test_index])\n #scores.append(cv_score)\n #print('score: ',cv_score)\n print('pridicting test')\n predict += model.predict(X_test, batch_size=1024)\n\npredict = predict / num_folds\n#print('Total CV score is {}'.format(np.mean(scores)))\nsample_submission = pd.DataFrame.from_dict({'ID': test['ID']})","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"preds = set(np.argmax(predict,axis=1))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Vectorization Approach"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import TfidfVectorizer\ntfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')\n# TF-IDF feature matrix\ntrain1 = tfidf_vectorizer.fit_transform(train['combined'])\ntest1 = tfidf_vectorizer.fit_transform(test['combined'])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"train['combined'].shape\ntest['combined'].shape\n\nnp.hstack([train['combined'].values,test['combined'].values]).shape","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.feature_extraction.text import CountVectorizer\n# create the transform\nvectorizer = CountVectorizer(stop_words=\"english\", analyzer='word', \n ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)\n# tokenize and build vocab\nvectorizer.fit(np.hstack([train['combined'].values,test['combined'].values]))\ntrain2 = vectorizer.transform(train['combined'])\ntest2 = vectorizer.transform(test['combined'])\nprint(train2.shape)\nprint(type(train2))","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"csr_matrix([train['Checkouts']]).T","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### concat both the vectors together into single."},{"metadata":{"trusted":true},"cell_type":"code","source":"from scipy.sparse import hstack, coo_matrix, csr_matrix,vstack\ntrain3 = hstack((train1, train2))\n\ntest3 = hstack((test1, test2))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### lightGbm"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, train['MaterialType'], random_state=42, test_size=0.2)\n\n# lreg = LogisticRegression()\nlreg = lgb.LGBMRegressor()\nlreg.fit(xtrain_bow, ytrain) # training the model\n\nprediction = lreg.predict(xvalid_bow) # predicting on the validation set\n# prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0\nprediction_int = prediction.astype(np.int)\n\nf1_score(yvalid, prediction_int,average='micro') # calculating f1 score\n\ntest_pred = lreg.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Logistic"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import f1_score\n\n\n# splitting data into training and validation set\nxtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train3, train['MaterialType'], random_state=42, test_size=0.2)\n\nlreg = LogisticRegression()\n# lreg = lgb.LGBMRegressor()\nlreg.fit(xtrain_bow, ytrain) # training the model\n\nprediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set\nprediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0\nprediction_int = prediction_int.astype(np.int)\n\nf1_score(yvalid, prediction_int,average='micro') # calculating f1 score\n\ntest_pred = lreg.predict_proba(test3)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Naive Bayes"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.naive_bayes import MultinomialNB\nnb = MultinomialNB(alpha=0.8)\nnb.fit(xtrain_bow, ytrain)\n\npreds = nb.predict(xvalid_bow)\nf1_score(yvalid, preds,average='micro') # calculating f1 score\n\ntest_pred = nb.predict(test3)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin = pd.DataFrame({'ID': test['ID'].values, 'MaterialType': test_pred})\nsubfin=subfin.reindex(columns=[\"ID\",\"MaterialType\"])\nsubfin.to_csv('submission.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### convert again to the submission format"},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin['MaterialType'].replace(0,\"BOOK\",inplace=True)\nsubfin['MaterialType'].replace(1,\"CR\",inplace=True)\nsubfin['MaterialType'].replace(2,\"MIXED\",inplace=True)\nsubfin['MaterialType'].replace(3,\"MUSIC\",inplace=True)\nsubfin['MaterialType'].replace(4,\"SOUNDCASS\",inplace=True)\nsubfin['MaterialType'].replace(5,\"SOUNDDISC\",inplace=True)\nsubfin['MaterialType'].replace(6,\"VIDEOCASS\",inplace=True)\nsubfin['MaterialType'].replace(7,\"VIDEODISC\",inplace=True)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"subfin.to_csv('submission.csv', index=False)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# import the modules we'll need\nfrom IPython.display import HTML\nimport pandas as pd\nimport numpy as np\nimport base64\n\n# function that takes in a dataframe and creates a text link to \n# download it (will only work for files < 2MB or so)\ndef create_download_link(df, title = \"Download CSV file\", filename = \"submission.csv\"): \n csv = df.to_csv(index=False)\n b64 = base64.b64encode(csv.encode())\n payload = b64.decode()\n html = '{title}'\n html = html.format(payload=payload,title=title,filename=filename)\n return HTML(html)\n\n# create a link to download the dataframe\ncreate_download_link(subfin)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1} --------------------------------------------------------------------------------