├── .gitattributes ├── .github └── FUNDING.yml ├── .gitignore ├── .ipynb_checkpoints ├── Subreddit_and_solution-checkpoint.ipynb ├── Tf-Idf-checkpoint.ipynb ├── W2VXAuthor-checkpoint.ipynb ├── data-mining-challange-checkpoint.Rproj ├── resources -checkpoint.Rmd ├── scoreXSparseValidation-checkpoint.csv ├── submission-checkpoint.csv └── tutorial-checkpoint.ipynb ├── LICENSE ├── Notebooks ├── .DS_Store ├── other-attempts │ ├── .DS_Store │ ├── keras-neural-networks │ │ ├── .DS_Store │ │ ├── .ipynb_checkpoints │ │ │ ├── embeddings-checkpoint.ipynb │ │ │ └── simple-net-prediction-checkpoint.ipynb │ │ ├── README.md │ │ ├── embeddings.ipynb │ │ ├── pretrained-embeddings.ipynb │ │ ├── simple-net-grid.ipynb │ │ ├── simple-net-prediction.ipynb │ │ └── simple-net.ipynb │ └── spaCy │ │ ├── .DS_Store │ │ ├── ReadMe.md │ │ ├── data_preparation │ │ ├── .ipynb_checkpoints │ │ │ ├── lemmatizer-checkpoint.ipynb │ │ │ └── vectorizer-checkpoint.ipynb │ │ ├── lemmatizer.ipynb │ │ └── vectorizer.ipynb │ │ ├── finals │ │ ├── .ipynb_checkpoints │ │ │ ├── ReadMe-checkpoint.ipynb │ │ │ ├── final_bal_lr-checkpoint.ipynb │ │ │ ├── final_bodieswS-checkpoint.ipynb │ │ │ ├── final_lr-checkpoint.ipynb │ │ │ ├── final_subreddits-checkpoint.ipynb │ │ │ ├── final_svm-checkpoint.ipynb │ │ │ ├── lemmatizer_final-checkpoint.ipynb │ │ │ ├── solution-checkpoint.ipynb │ │ │ ├── solution_bal-checkpoint.ipynb │ │ │ └── spacyW2v_Final-checkpoint.ipynb │ │ ├── ReadMe.md │ │ ├── final_bal_lr.ipynb │ │ ├── final_bodieswS.ipynb │ │ ├── final_lr.ipynb │ │ ├── final_subreddits.ipynb │ │ ├── final_svm.ipynb │ │ ├── lemmatizer_final.ipynb │ │ ├── solution.ipynb │ │ ├── solution_bal.ipynb │ │ └── spacyW2v_Final.ipynb │ │ ├── images │ │ ├── bodieswS_test_ensemble_balanced_e15_wS.png │ │ └── bodieswS_test_ensemble_balanced_e3.png │ │ ├── intermediate_models │ │ ├── .ipynb_checkpoints │ │ │ ├── ReadMe-checkpoint.ipynb │ │ │ ├── final-checkpoint.ipynb │ │ │ ├── spactW2v-checkpoint.ipynb │ │ │ ├── spacyBowAggAveraged-checkpoint.ipynb │ │ │ ├── spacySubreddits-checkpoint.ipynb │ │ │ ├── spacyTransformerEnsembleLemmatizedAveraged-checkpoint.ipynb │ │ │ └── subreddits-checkpoint.ipynb │ │ ├── ReadMe.md │ │ ├── final.ipynb │ │ ├── spactW2v.ipynb │ │ ├── spacyBowAggAveraged.ipynb │ │ ├── spacySubreddits.ipynb │ │ ├── spacyTransformerEnsembleLemmatizedAveraged.ipynb │ │ └── subreddits.ipynb │ │ └── outputs │ │ ├── .ipynb_checkpoints │ │ ├── Untitled-checkpoint.ipynb │ │ ├── bow_bal_lPunctAgg-checkpoint.txt │ │ ├── bow_bal_lPunctNumAgg-checkpoint.txt │ │ ├── bow_dlPunctNumStopLemOovAgg-checkpoint.txt │ │ ├── bow_lPunctAgg-checkpoint.txt │ │ ├── bow_lPunctNumAgg-checkpoint.txt │ │ ├── bow_lPunctNumLemAgg-checkpoint.txt │ │ ├── bow_lPunctNumLemOovAgg-checkpoint.txt │ │ ├── bow_lPunctNumOovAgg-checkpoint.txt │ │ ├── bow_lPunctNumPersAgg-checkpoint.txt │ │ ├── bow_lPunctNumPersLemAgg-checkpoint.txt │ │ ├── bow_lPunctNumPersLemOovAgg-checkpoint.txt │ │ ├── bow_lPunctNumStopLemAgg-checkpoint.txt │ │ ├── bow_lPunctNumStopLemOovAgg-checkpoint.txt │ │ ├── bow_lPunctNumStopOovAgg-checkpoint.txt │ │ ├── ensemble_bal_lPunctAgg-checkpoint.txt │ │ ├── ensemble_bal_lPunctNumStopLemOovAgg-checkpoint.txt │ │ ├── ensemble_dlPunctNumLemOovAgg-checkpoint.txt │ │ ├── ensemble_dlPunctNumStopLemOovAgg-checkpoint.txt │ │ ├── ensemble_lPunctAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumLemAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumLemOovAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumOovAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumPersAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumPersLemAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumPersLemOovAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumStopLemAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumStopLemOovAgg-checkpoint.txt │ │ ├── ensemble_lPunctNumStopOovAgg-checkpoint.txt │ │ └── spacyW2vMlp-checkpoint.txt │ │ ├── bow_bal_lPunctAgg.txt │ │ ├── bow_bal_lPunctNumAgg.txt │ │ ├── bow_bal_lPunctNumLemAgg.txt │ │ ├── bow_bal_lPunctNumLemOovAgg.txt │ │ ├── bow_bal_lPunctNumOovAgg.txt │ │ ├── bow_bal_lPunctNumPersAgg.txt │ │ ├── bow_bal_lPunctNumPersLemAgg.txt │ │ ├── bow_bal_lPunctNumPersLemOovAgg.txt │ │ ├── bow_bal_lPunctNumStopLemAgg.txt │ │ ├── bow_bal_lPunctNumStopLemOovAgg.txt │ │ ├── bow_bal_lPunctNumStopOovAgg.txt │ │ ├── bow_dlPunctNumStopLemOovAgg.txt │ │ ├── bow_lPunctAgg.txt │ │ ├── bow_lPunctNumAgg.txt │ │ ├── bow_lPunctNumLemAgg.txt │ │ ├── bow_lPunctNumLemOovAgg.txt │ │ ├── bow_lPunctNumOovAgg.txt │ │ ├── bow_lPunctNumPersAgg.txt │ │ ├── bow_lPunctNumPersLemAgg.txt │ │ ├── bow_lPunctNumPersLemOovAgg.txt │ │ ├── bow_lPunctNumStopLemAgg.txt │ │ ├── bow_lPunctNumStopLemOovAgg.txt │ │ ├── bow_lPunctNumStopOovAgg.txt │ │ ├── ensemble_bal_lPunctAgg.txt │ │ ├── ensemble_bal_lPunctNumAgg.txt │ │ ├── ensemble_bal_lPunctNumLemAgg.txt │ │ ├── ensemble_bal_lPunctNumLemOovAgg.txt │ │ ├── ensemble_bal_lPunctNumOovAgg.txt │ │ ├── ensemble_bal_lPunctNumPersAgg.txt │ │ ├── ensemble_bal_lPunctNumPersLemAgg.txt │ │ ├── ensemble_bal_lPunctNumPersLemOovAgg.txt │ │ ├── ensemble_bal_lPunctNumStopLemAgg.txt │ │ ├── ensemble_bal_lPunctNumStopLemOovAgg.txt │ │ ├── ensemble_bal_lPunctNumStopOovAgg.txt │ │ ├── ensemble_dlPunctNumLemOovAgg.txt │ │ ├── ensemble_dlPunctNumStopLemOovAgg.txt │ │ ├── ensemble_lPunctAgg.txt │ │ ├── ensemble_lPunctNumAgg.txt │ │ ├── ensemble_lPunctNumLemAgg.txt │ │ ├── ensemble_lPunctNumLemOovAgg.txt │ │ ├── ensemble_lPunctNumOovAgg.txt │ │ ├── ensemble_lPunctNumPersAgg.txt │ │ ├── ensemble_lPunctNumPersLemAgg.txt │ │ ├── ensemble_lPunctNumPersLemOovAgg.txt │ │ ├── ensemble_lPunctNumStopLemAgg.txt │ │ ├── ensemble_lPunctNumStopLemOovAgg.txt │ │ ├── ensemble_lPunctNumStopOovAgg.txt │ │ ├── softmax_bert_lPunctNumStopLemOovAgg.txt │ │ └── spacyW2vMlp.txt └── successful-models │ ├── .DS_Store │ ├── .ipynb_checkpoints │ ├── Final_sub-checkpoint.ipynb │ ├── MLPs(90)_sub-checkpoint.ipynb │ ├── MLPs_test_sub-checkpoint.ipynb │ ├── doc2vec-4000-checkpoint.ipynb │ ├── doc2vec-5000-checkpoint.ipynb │ ├── final-model-selection-checkpoint.ipynb │ ├── mlp-subreddits-4000-checkpoint.ipynb │ ├── mlp-subreddits-5000-checkpoint.ipynb │ ├── submission-checkpoint.ipynb │ ├── xgb-4000-checkpoint.ipynb │ ├── xgb-5000-checkpoint.ipynb │ └── xgb-gridsearch-checkpoint.ipynb │ ├── doc2vec-4000.ipynb │ ├── doc2vec-5000.ipynb │ ├── final-model-selection.ipynb │ ├── mlp-subreddits-4000.ipynb │ ├── mlp-subreddits-5000.ipynb │ ├── submission.ipynb │ ├── xgb-4000.ipynb │ ├── xgb-5000.ipynb │ └── xgb-gridsearch.ipynb ├── README.md ├── _config.yml ├── images └── flow-chart.png └── index.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.csv -------------------------------------------------------------------------------- /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # GitHub Sponsors 2 | 3 | github: [pitmonticone] 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /.ipynb_checkpoints/data-mining-challange-checkpoint.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/resources -checkpoint.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Mining Challange: Resoures" 3 | author: "Pietro Monticone" 4 | date: "`r Sys.Date()` | Turin University" 5 | output: 6 | prettydoc::html_pretty: 7 | theme: cayman 8 | highlight: github 9 | toc: true 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set( 14 | echo = FALSE, 15 | message = FALSE, 16 | warning = FALSE 17 | ) 18 | ``` 19 | 20 | # Data Camp 21 | 22 | ## Theory 23 | * [Data Science for Everyone](https://www.datacamp.com/courses/data-science-for-everyone) 24 | * [Machine Learning for Everyone](https://www.datacamp.com/courses/machine-learning-for-everyone) 25 | 26 | ## Python 27 | 28 | ### Programming 29 | * [Introduction to Python](https://www.datacamp.com/courses/intro-to-python-for-data-science) 30 | * [Intermediate Python](https://www.datacamp.com/courses/intermediate-python) 31 | * [Data Science Toolbox 1](https://www.datacamp.com/courses/python-data-science-toolbox-part-1) 32 | * [Data Science Toolbox 2](https://www.datacamp.com/courses/python-data-science-toolbox-part-2) 33 | 34 | #### Coding Best Practices with Python 35 | * [Writing Efficient Python Code](https://www.datacamp.com/courses/writing-efficient-python-code) 36 | * [Writing Efficient Code with pandas](https://www.datacamp.com/courses/writing-efficient-code-with-pandas) 37 | * [Writing Funcitons in Python](https://www.datacamp.com/courses/writing-functions-in-python) 38 | * [Object-Oriented Programming in Python](https://www.datacamp.com/courses/object-oriented-programming-in-python) 39 | 40 | ### Data Collection & Cleaning 41 | 42 | * [Introduction](https://www.datacamp.com/courses/introduction-to-importing-data-in-python) 43 | * [Intermediate](https://www.datacamp.com/courses/intermediate-importing-data-in-python) 44 | * [Cleaning](https://www.datacamp.com/courses/cleaning-data-in-python) 45 | 46 | ### Data Manipulation 47 | * [pandas Foundations](https://www.datacamp.com/courses/pandas-foundations) 48 | * [Manipulating DataFrames with pandas](https://www.datacamp.com/courses/manipulating-dataframes-with-pandas) 49 | * [Merging DataFrames with pandas](https://www.datacamp.com/courses/merging-dataframes-with-pandas) 50 | 51 | ### Data Visualization 52 | * [Introduction to Data Viz with Matplotlib](https://www.datacamp.com/courses/introduction-to-data-visualization-with-matplotlib) 53 | * [Introduction to Data Viz with Seaborn](https://www.datacamp.com/courses/introduction-to-data-visualization-with-seaborn) 54 | * [Improving Data Viz)(https://www.datacamp.com/courses/improving-your-data-visualizations-in-python) 55 | * [Interactive Data Viz](https://www.datacamp.com/courses/interactive-data-visualization-with-bokeh) 56 | 57 | ### Machine Learning 58 | * [Supervised Learning with scikit-learn](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn) 59 | * [Unsupervised Learning in Python](https://www.datacamp.com/courses/unsupervised-learning-in-python) 60 | * [Linear Classifiers in Python](https://www.datacamp.com/courses/linear-classifiers-in-python) 61 | 62 | #### NLP 63 | * [Introduction to Natural Language Processing in Python](https://www.datacamp.com/courses/introduction-to-natural-language-processing-in-python) 64 | * [Advanced NLP with spaCy](https://learn.datacamp.com/courses/advanced-nlp-with-spacy) 65 | 66 | # Kaggle 67 | * Python 68 | * Bag of words 69 | 70 | # Notebooks 71 | 72 | # Github 73 | * **2018** [Project by Simone Azeglio](https://github.com/simoneazeglio/DataMiningChallenge2018) 74 | 75 | # Lectures 76 | 77 | ## MIT 78 | 79 | * **2020**[6.S191: Introduction to 80 | Deep Learning](http://introtodeeplearning.com) 81 | 82 | 83 | ## Caltech 84 | 85 | ## Stanford 86 | 87 | * **2020** [CS224n: Natural Language Processing with Deep Learning](http://web.stanford.edu/class/cs224n/index.html) 88 | * **2019** [CS224n: Natural Language Processing with Deep Learning](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/) 89 | 90 | ## 3B1B 91 | * [Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) 92 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Interdisciplinary Physics Team (InPhyT) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/.DS_Store -------------------------------------------------------------------------------- /Notebooks/other-attempts/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/other-attempts/.DS_Store -------------------------------------------------------------------------------- /Notebooks/other-attempts/keras-neural-networks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/other-attempts/keras-neural-networks/.DS_Store -------------------------------------------------------------------------------- /Notebooks/other-attempts/keras-neural-networks/.ipynb_checkpoints/simple-net-prediction-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "[name: \"/device:CPU:0\"\n", 13 | "device_type: \"CPU\"\n", 14 | "memory_limit: 268435456\n", 15 | "locality {\n", 16 | "}\n", 17 | "incarnation: 1331614535948791056\n", 18 | ", name: \"/device:GPU:0\"\n", 19 | "device_type: \"GPU\"\n", 20 | "memory_limit: 7473294746\n", 21 | "locality {\n", 22 | " bus_id: 1\n", 23 | " links {\n", 24 | " }\n", 25 | "}\n", 26 | "incarnation: 17851818086571483571\n", 27 | "physical_device_desc: \"device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1\"\n", 28 | "]\n", 29 | "2.3.1\n", 30 | "Wall time: 1.1 s\n" 31 | ] 32 | } 33 | ], 34 | "source": [ 35 | "%%time\n", 36 | "#print(\"1\")\n", 37 | "import tensorflow as tf\n", 38 | "from numba import cuda\n", 39 | "from tensorflow.python.client import device_lib\n", 40 | "print(device_lib.list_local_devices())\n", 41 | "from keras.preprocessing.sequence import pad_sequences\n", 42 | "#print(\"2\")\n", 43 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 44 | "from sklearn.preprocessing import StandardScaler\n", 45 | "import pickle\n", 46 | "from keras.layers import Dense, Input, Dropout\n", 47 | "#print(\"3\")\n", 48 | "from keras import Sequential\n", 49 | "#print(\"4\")\n", 50 | "from sklearn.preprocessing import StandardScaler\n", 51 | "from tensorflow.keras.callbacks import EarlyStopping\n", 52 | "import matplotlib.pyplot as plt\n", 53 | "import keras\n", 54 | "print(keras.__version__)\n", 55 | "from sklearn.model_selection import train_test_split\n", 56 | "from keras.constraints import maxnorm\n", 57 | "import numpy as np" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 7, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "training tfidf and tranforming\n", 70 | "vocab_size = 120536\n", 71 | "padding\n", 72 | "done\n", 73 | "Wall time: 1min 50s\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "%%time\n", 79 | "\n", 80 | "with open(r\"comments.txt\", \"rb\") as f:\n", 81 | " clean_train_comments = pickle.load(f) \n", 82 | " f.close()\n", 83 | "\n", 84 | "with open(r\"targets.txt\", \"rb\") as ft:\n", 85 | " y= pickle.load(ft) \n", 86 | " ft.close()\n", 87 | "\n", 88 | " \n", 89 | "y = [int(s) for s in y]\n", 90 | "\n", 91 | "\n", 92 | "\n", 93 | "#tfidf vectorization\n", 94 | "tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,\n", 95 | " ngram_range=(1, 2), \n", 96 | " stop_words='english')\n", 97 | "\n", 98 | "# We transform each complaint into a vector\n", 99 | "print(\"training tfidf and tranforming\")\n", 100 | "X = tfidf.fit_transform(clean_train_comments).toarray() #clean-train_comments # as this: https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest and this: https://stackoverflow.com/questions/47778403/computing-tf-idf-on-the-whole-dataset-or-only-on-training-data suggest,train tfidf only on training set\n", 101 | "vocab_size = len(tfidf.vocabulary_) + 1\n", 102 | "print(\"vocab_size = \", vocab_size)\n", 103 | "# evaluate max len train data\n", 104 | "maxlen = max([len(x) for x in X])\n", 105 | "# pad train data accordingly\n", 106 | "print(\"padding\")\n", 107 | "X_pad = pad_sequences(X, padding='post', maxlen=maxlen, dtype='float32') \n", 108 | "\n", 109 | "print(\"done\")" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 8, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# Define the models.\n", 119 | "\n", 120 | "def model0(): # from https://medium.com/@am.benatmane/keras-hyperparameter-tuning-using-sklearn-pipelines-grid-search-with-cross-validation-ccfc74b0ce9f\n", 121 | "\n", 122 | " METRICS = [ \n", 123 | " tf.keras.metrics.BinaryAccuracy(name='accuracy'),\n", 124 | " tf.keras.metrics.AUC(name='auc'),\n", 125 | " ]\n", 126 | "\n", 127 | " optimizer=\"Adamax\" #\"adam\"\n", 128 | " dropout=0.1 #0.1\n", 129 | " init='uniform'\n", 130 | " nbr_features= vocab_size-1 #2500\n", 131 | " dense_nparams=256\n", 132 | "\n", 133 | " model = Sequential()\n", 134 | " model.add(Dense(dense_nparams, activation='softsign', input_shape=(nbr_features,), kernel_initializer=init, kernel_constraint=maxnorm(3))) # maxnorm(0) & softmax & sigmoid -> 0.89 # maxnorm(0) & softmax & softmax -> 0.5 maxnorm(2) & relu & sigmoid ->0.92 maxnorm(1) & relu & sigmoid ->0.82\n", 135 | " model.add(Dropout(dropout))\n", 136 | " model.add(Dense(1, activation='sigmoid')) # relu & \"softmax\" fa 0.5-> non va bene #' relu & softplus' -> 0.75 #'sigmoid'\n", 137 | " model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = METRICS)\n", 138 | " return model\n", 139 | " " 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 9, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "Epoch 1/3\n", 152 | "Epoch 2/3\n", 153 | "Epoch 3/3\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "\n", 159 | "model = model0()\n", 160 | "\n", 161 | "history = model.fit(x=X_pad, y=y, batch_size = 8, epochs = 3, verbose=10, shuffle=True, max_queue_size=10, workers=4, use_multiprocessing=True) #, callbacks=callbacks , validation_split=0.2\n", 162 | "\n", 163 | "# reset gpu memory https://stackoverflow.com/a/60354785/13110508 (but be warned: it crashes python, so use it just at the end)\n", 164 | "# device = cuda.get_current_device()\n", 165 | "# device.reset()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 10, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "import pandas as pd\n", 175 | "with open(r\"comments_test.txt\", \"rb\") as f:\n", 176 | " clean_test_comments = pickle.load(f) \n", 177 | " f.close()\n", 178 | " \n", 179 | " \n", 180 | "X_test = tfidf.transform(clean_test_comments).toarray()\n", 181 | "maxlen_test = max([len(x) for x in X_test])\n", 182 | "X_test_pad = pad_sequences(X_test, padding='post', maxlen=maxlen, dtype='float32')\n", 183 | "#X_test_pad_scal = scaler.transform(X_test_pad)\n", 184 | "\n", 185 | "y_pred = model.predict_proba(X_test_pad)\n", 186 | "y_pred_unp = [y_pred[i][0] for i in range(len(y_pred))]\n", 187 | "with open(r\"authors_test.txt\", \"rb\") as f:\n", 188 | " authors = pickle.load(f) \n", 189 | " f.close()\n", 190 | " \n", 191 | "solution = pd.DataFrame({\"author\":authors, \"gender\":y_pred_unp})\n", 192 | "\n", 193 | "solution.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\data\\challengedadata\\solutions\\simpleNetNoScalProbaGridD0_sol.csv\",index = False)" 194 | ] 195 | } 196 | ], 197 | "metadata": { 198 | "kernelspec": { 199 | "display_name": "Python 3", 200 | "language": "python", 201 | "name": "python3" 202 | }, 203 | "language_info": { 204 | "codemirror_mode": { 205 | "name": "ipython", 206 | "version": 3 207 | }, 208 | "file_extension": ".py", 209 | "mimetype": "text/x-python", 210 | "name": "python", 211 | "nbconvert_exporter": "python", 212 | "pygments_lexer": "ipython3", 213 | "version": "3.7.4" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 4 218 | } 219 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/keras-neural-networks/README.md: -------------------------------------------------------------------------------- 1 | ## Keras Neural Networks 2 | 3 | This folder contains three approaches with neural networks. 4 | 5 | ### 1. TFIDF classification 6 | 7 | This task is brought forward by the `simple-net.ipynb`,`simple-net-prediction.ipynb`,`simple-net-grid.ipynb`. The first validates the model, the third gridsearchs it and the fourth outpus the predictions. It does roc = 89.7 on the test set. 8 | 9 | ### 2. Embeddings classification 10 | 11 | Trains and predicts an embedding layer before classifying. Several netowrks have been tried. Due to a poorer validation performance if compared to more transparent models like an MLP on doc2vec (see [successful-models](https://github.com/pitmonticone/data-mining-challange/tree/master/successful-models)), we thought it not to be worth a gridsearch & prediction effort. Releted notebook: `embeddings.ipynb` 12 | 13 | ### 3. Embeddings classification 14 | 15 | Same as above, but with glove vectors pretrained on 6B words and 300 dimensions. Related notebook: `pretrained-embeddings.ipynb`. 16 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/keras-neural-networks/simple-net-prediction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "[name: \"/device:CPU:0\"\n", 13 | "device_type: \"CPU\"\n", 14 | "memory_limit: 268435456\n", 15 | "locality {\n", 16 | "}\n", 17 | "incarnation: 1331614535948791056\n", 18 | ", name: \"/device:GPU:0\"\n", 19 | "device_type: \"GPU\"\n", 20 | "memory_limit: 7473294746\n", 21 | "locality {\n", 22 | " bus_id: 1\n", 23 | " links {\n", 24 | " }\n", 25 | "}\n", 26 | "incarnation: 17851818086571483571\n", 27 | "physical_device_desc: \"device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1\"\n", 28 | "]\n", 29 | "2.3.1\n", 30 | "Wall time: 1.1 s\n" 31 | ] 32 | } 33 | ], 34 | "source": [ 35 | "%%time\n", 36 | "#print(\"1\")\n", 37 | "import tensorflow as tf\n", 38 | "from numba import cuda\n", 39 | "from tensorflow.python.client import device_lib\n", 40 | "print(device_lib.list_local_devices())\n", 41 | "from keras.preprocessing.sequence import pad_sequences\n", 42 | "#print(\"2\")\n", 43 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 44 | "from sklearn.preprocessing import StandardScaler\n", 45 | "import pickle\n", 46 | "from keras.layers import Dense, Input, Dropout\n", 47 | "#print(\"3\")\n", 48 | "from keras import Sequential\n", 49 | "#print(\"4\")\n", 50 | "from sklearn.preprocessing import StandardScaler\n", 51 | "from tensorflow.keras.callbacks import EarlyStopping\n", 52 | "import matplotlib.pyplot as plt\n", 53 | "import keras\n", 54 | "print(keras.__version__)\n", 55 | "from sklearn.model_selection import train_test_split\n", 56 | "from keras.constraints import maxnorm\n", 57 | "import numpy as np" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 7, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "training tfidf and tranforming\n", 70 | "vocab_size = 120536\n", 71 | "padding\n", 72 | "done\n", 73 | "Wall time: 1min 50s\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "%%time\n", 79 | "\n", 80 | "with open(r\"comments.txt\", \"rb\") as f:\n", 81 | " clean_train_comments = pickle.load(f) \n", 82 | " f.close()\n", 83 | "\n", 84 | "with open(r\"targets.txt\", \"rb\") as ft:\n", 85 | " y= pickle.load(ft) \n", 86 | " ft.close()\n", 87 | "\n", 88 | " \n", 89 | "y = [int(s) for s in y]\n", 90 | "\n", 91 | "\n", 92 | "\n", 93 | "#tfidf vectorization\n", 94 | "tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,\n", 95 | " ngram_range=(1, 2), \n", 96 | " stop_words='english')\n", 97 | "\n", 98 | "# We transform each complaint into a vector\n", 99 | "print(\"training tfidf and tranforming\")\n", 100 | "X = tfidf.fit_transform(clean_train_comments).toarray() #clean-train_comments # as this: https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest and this: https://stackoverflow.com/questions/47778403/computing-tf-idf-on-the-whole-dataset-or-only-on-training-data suggest,train tfidf only on training set\n", 101 | "vocab_size = len(tfidf.vocabulary_) + 1\n", 102 | "print(\"vocab_size = \", vocab_size)\n", 103 | "# evaluate max len train data\n", 104 | "maxlen = max([len(x) for x in X])\n", 105 | "# pad train data accordingly\n", 106 | "print(\"padding\")\n", 107 | "X_pad = pad_sequences(X, padding='post', maxlen=maxlen, dtype='float32') \n", 108 | "\n", 109 | "print(\"done\")" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 8, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# Define the models.\n", 119 | "\n", 120 | "def model0(): # from https://medium.com/@am.benatmane/keras-hyperparameter-tuning-using-sklearn-pipelines-grid-search-with-cross-validation-ccfc74b0ce9f\n", 121 | "\n", 122 | " METRICS = [ \n", 123 | " tf.keras.metrics.BinaryAccuracy(name='accuracy'),\n", 124 | " tf.keras.metrics.AUC(name='auc'),\n", 125 | " ]\n", 126 | "\n", 127 | " optimizer=\"Adamax\" #\"adam\"\n", 128 | " dropout=0.1 #0.1\n", 129 | " init='uniform'\n", 130 | " nbr_features= vocab_size-1 #2500\n", 131 | " dense_nparams=256\n", 132 | "\n", 133 | " model = Sequential()\n", 134 | " model.add(Dense(dense_nparams, activation='softsign', input_shape=(nbr_features,), kernel_initializer=init, kernel_constraint=maxnorm(3))) # maxnorm(0) & softmax & sigmoid -> 0.89 # maxnorm(0) & softmax & softmax -> 0.5 maxnorm(2) & relu & sigmoid ->0.92 maxnorm(1) & relu & sigmoid ->0.82\n", 135 | " model.add(Dropout(dropout))\n", 136 | " model.add(Dense(1, activation='sigmoid')) # relu & \"softmax\" fa 0.5-> non va bene #' relu & softplus' -> 0.75 #'sigmoid'\n", 137 | " model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics = METRICS)\n", 138 | " return model\n", 139 | " " 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 9, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "name": "stdout", 149 | "output_type": "stream", 150 | "text": [ 151 | "Epoch 1/3\n", 152 | "Epoch 2/3\n", 153 | "Epoch 3/3\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "\n", 159 | "model = model0()\n", 160 | "\n", 161 | "history = model.fit(x=X_pad, y=y, batch_size = 8, epochs = 3, verbose=10, shuffle=True, max_queue_size=10, workers=4, use_multiprocessing=True) #, callbacks=callbacks , validation_split=0.2\n", 162 | "\n", 163 | "# reset gpu memory https://stackoverflow.com/a/60354785/13110508 (but be warned: it crashes python, so use it just at the end)\n", 164 | "# device = cuda.get_current_device()\n", 165 | "# device.reset()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 10, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "import pandas as pd\n", 175 | "with open(r\"comments_test.txt\", \"rb\") as f:\n", 176 | " clean_test_comments = pickle.load(f) \n", 177 | " f.close()\n", 178 | " \n", 179 | " \n", 180 | "X_test = tfidf.transform(clean_test_comments).toarray()\n", 181 | "maxlen_test = max([len(x) for x in X_test])\n", 182 | "X_test_pad = pad_sequences(X_test, padding='post', maxlen=maxlen, dtype='float32')\n", 183 | "#X_test_pad_scal = scaler.transform(X_test_pad)\n", 184 | "\n", 185 | "y_pred = model.predict_proba(X_test_pad)\n", 186 | "y_pred_unp = [y_pred[i][0] for i in range(len(y_pred))]\n", 187 | "with open(r\"authors_test.txt\", \"rb\") as f:\n", 188 | " authors = pickle.load(f) \n", 189 | " f.close()\n", 190 | " \n", 191 | "solution = pd.DataFrame({\"author\":authors, \"gender\":y_pred_unp})\n", 192 | "\n", 193 | "solution.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\data\\challengedadata\\solutions\\simpleNetNoScalProbaGridD0_sol.csv\",index = False)" 194 | ] 195 | } 196 | ], 197 | "metadata": { 198 | "kernelspec": { 199 | "display_name": "Python 3", 200 | "language": "python", 201 | "name": "python3" 202 | }, 203 | "language_info": { 204 | "codemirror_mode": { 205 | "name": "ipython", 206 | "version": 3 207 | }, 208 | "file_extension": ".py", 209 | "mimetype": "text/x-python", 210 | "name": "python", 211 | "nbconvert_exporter": "python", 212 | "pygments_lexer": "ipython3", 213 | "version": "3.7.4" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 4 218 | } 219 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/other-attempts/spaCy/.DS_Store -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/ReadMe.md: -------------------------------------------------------------------------------- 1 | # NLP with spaCy 2 | 3 | Notebooks and ReadMes inside folders provide concise decriptions of the code. To have a more in-depth resume and the big picture, please read [this Stack Overflow Question](https://stackoverflow.com/questions/60821793/text-classification-with-spacy-going-beyond-the-basics-to-improve-performance), this [GitHub Issue](https://github.com/explosion/spaCy/issues/5224) and a comment to a [Feature Request](https://github.com/explosion/spaCy/issues/2253#issuecomment-605502320). 4 | 5 | To access data, submisisons and the various lemmatizations/vectorizations, please visit this [Google Drive Link](https://drive.google.com/open?id=1ARPbyK6uyudZTZ9m0UEDY5_xgrH7D6PX) 6 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/data_preparation/.ipynb_checkpoints/vectorizer-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vectorizer\n", 8 | "\n", 9 | "This notebook takes all preprocessings and vectorizes them, in order to be classified with the MLP. As an exploration, we used spaCy's pre-trained vectors. Note that the docuemnt vectors are obtained from the word vectors via an average. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import spacy\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "from progressbar import ProgressBar, Bar, Percentage\n", 22 | "from os import listdir\n", 23 | "from os.path import isfile, join" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Load the big model (as per [documentation](https://spacy.io/usage/vectors-similarity)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "nlp = spacy.load(r\"Q:\\anaconda\\Lib\\site-packages\\en_core_web_lg\\en_core_web_lg-2.2.5\")" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "%%time\n", 49 | "\n", 50 | "def_str = r\"Q:\\\\tooBigToDrive\\data-mining\\kaggle\\data\\csv\"\n", 51 | "path = r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\data\\csv\"\n", 52 | "files = listdir(def_str)\n", 53 | "files = [f.replace(\".csv\",\"\") for f in files if \"Agg\" in f]\n", 54 | "\n", 55 | "for s in files:\n", 56 | " csvPath = def_str +\"\\\\\"+ s + \".csv\"\n", 57 | " npyPath = def_str +\"\\\\\"+ s +\"sSub\"+ \".npy\"\n", 58 | " train = pd.read_csv(csvPath)\n", 59 | " train.replace(to_replace = \"empty\", value = \"\", inplace = True)\n", 60 | " train[\"body\"].fillna(\"\",inplace = True)\n", 61 | " # enable this to add subreddits to body \n", 62 | " train[\"body\"] = train[\"subreddit\"]+\" \"+train[\"body\"]\n", 63 | " to_be_vectorized = train[\"body\"].tolist()\n", 64 | " vectorsl = []\n", 65 | " print(\"doing\"+\" \"+s+\".csv ...\", \"len(to_be_vectorized) = \",len(to_be_vectorized) )\n", 66 | " pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(to_be_vectorized)).start()\n", 67 | " i = 0\n", 68 | " # disable parser and ner pipes to have better performance\n", 69 | " with nlp.disable_pipes():\n", 70 | " for tex in to_be_vectorized:\n", 71 | " vectorsl.append(nlp(tex).vector)\n", 72 | " i += 1\n", 73 | " pbar.update(i)\n", 74 | " pbar.finish()\n", 75 | " vectors = np.array(vectorsl)\n", 76 | " np.save(npyPath,vectors)\n", 77 | " print(\"done\")\n" 78 | ] 79 | } 80 | ], 81 | "metadata": { 82 | "kernelspec": { 83 | "display_name": "Python [conda env:myEnv]", 84 | "language": "python", 85 | "name": "conda-env-myEnv-py" 86 | }, 87 | "language_info": { 88 | "codemirror_mode": { 89 | "name": "ipython", 90 | "version": 3 91 | }, 92 | "file_extension": ".py", 93 | "mimetype": "text/x-python", 94 | "name": "python", 95 | "nbconvert_exporter": "python", 96 | "pygments_lexer": "ipython3", 97 | "version": "3.7.6" 98 | } 99 | }, 100 | "nbformat": 4, 101 | "nbformat_minor": 4 102 | } 103 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/data_preparation/vectorizer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vectorizer\n", 8 | "\n", 9 | "This notebook takes all preprocessings and vectorizes them, in order to be classified with the MLP. As an exploration, we used spaCy's pre-trained vectors. Note that the docuemnt vectors are obtained from the word vectors via an average. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import spacy\n", 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "from progressbar import ProgressBar, Bar, Percentage\n", 22 | "from os import listdir\n", 23 | "from os.path import isfile, join" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Load the big model (as per [documentation](https://spacy.io/usage/vectors-similarity)" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "nlp = spacy.load(r\"Q:\\anaconda\\Lib\\site-packages\\en_core_web_lg\\en_core_web_lg-2.2.5\")" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "%%time\n", 49 | "\n", 50 | "def_str = r\"Q:\\\\tooBigToDrive\\data-mining\\kaggle\\data\\csv\"\n", 51 | "path = r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\data\\csv\"\n", 52 | "files = listdir(def_str)\n", 53 | "files = [f.replace(\".csv\",\"\") for f in files if \"Agg\" in f]\n", 54 | "\n", 55 | "for s in files:\n", 56 | " csvPath = def_str +\"\\\\\"+ s + \".csv\"\n", 57 | " npyPath = def_str +\"\\\\\"+ s +\"sSub\"+ \".npy\"\n", 58 | " train = pd.read_csv(csvPath)\n", 59 | " train.replace(to_replace = \"empty\", value = \"\", inplace = True)\n", 60 | " train[\"body\"].fillna(\"\",inplace = True)\n", 61 | " # enable this to add subreddits to body \n", 62 | " train[\"body\"] = train[\"subreddit\"]+\" \"+train[\"body\"]\n", 63 | " to_be_vectorized = train[\"body\"].tolist()\n", 64 | " vectorsl = []\n", 65 | " print(\"doing\"+\" \"+s+\".csv ...\", \"len(to_be_vectorized) = \",len(to_be_vectorized) )\n", 66 | " pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(to_be_vectorized)).start()\n", 67 | " i = 0\n", 68 | " # disable parser and ner pipes to have better performance\n", 69 | " with nlp.disable_pipes():\n", 70 | " for tex in to_be_vectorized:\n", 71 | " vectorsl.append(nlp(tex).vector)\n", 72 | " i += 1\n", 73 | " pbar.update(i)\n", 74 | " pbar.finish()\n", 75 | " vectors = np.array(vectorsl)\n", 76 | " np.save(npyPath,vectors)\n", 77 | " print(\"done\")\n" 78 | ] 79 | } 80 | ], 81 | "metadata": { 82 | "kernelspec": { 83 | "display_name": "Python [conda env:myEnv]", 84 | "language": "python", 85 | "name": "conda-env-myEnv-py" 86 | }, 87 | "language_info": { 88 | "codemirror_mode": { 89 | "name": "ipython", 90 | "version": 3 91 | }, 92 | "file_extension": ".py", 93 | "mimetype": "text/x-python", 94 | "name": "python", 95 | "nbconvert_exporter": "python", 96 | "pygments_lexer": "ipython3", 97 | "version": "3.7.6" 98 | } 99 | }, 100 | "nbformat": 4, 101 | "nbformat_minor": 4 102 | } 103 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/ReadMe-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/final_bal_lr-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 26, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd \n", 10 | "from sklearn.metrics import roc_curve, auc\n", 11 | "from sklearn.model_selection import train_test_split\n", 12 | "from sklearn.model_selection import KFold\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "import numpy as np\n", 15 | "from sklearn.linear_model import LogisticRegression\n", 16 | "import joblib" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 27, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "subs = pd.read_csv(r\"subs_bal.csv\")\n", 26 | "W2v= pd.read_csv(r\"W2v_bal.csv\")\n", 27 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop_bal.csv\")" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 28, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | " true_y bodieswSdrop_y subs_y W2v_y\n", 40 | "0 0 0.004841 0.354758 0.505174\n", 41 | "1 0 0.959119 0.556274 0.217193\n", 42 | "2 0 0.124737 0.321295 0.338535\n", 43 | "3 0 0.975953 0.398921 0.818394\n", 44 | "4 0 0.978466 0.354308 0.656013\n" 45 | ] 46 | } 47 | ], 48 | "source": [ 49 | "df = pd.DataFrame({\"true_y\": bodieswSdrop[\"true_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist() })\n", 50 | "print(df.head(5))" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 29, 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | " bodieswSdrop_y subs_y W2v_y\n", 63 | "0 0.004841 0.354758 0.505174\n", 64 | "1 0.959119 0.556274 0.217193\n", 65 | "2 0.124737 0.321295 0.338535\n", 66 | "3 0.975953 0.398921 0.818394\n", 67 | "4 0.978466 0.354308 0.656013\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] # \"bodieswSdrop_y\", \"subs_y\", \"W2v_y\" #, \"subs_y\", \"W2v_y\"\n", 73 | "print(X.head(5))\n", 74 | "y = df.true_y" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 30, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "2842 2842 (2842, 3) (2842,)\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "X = X.to_numpy()\n", 92 | "y = y.to_numpy()\n", 93 | "print(len(X), len(y), X.shape, y.shape)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 31, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "done 1\n", 106 | "done 1\n", 107 | "done 1\n", 108 | "done 1\n", 109 | "done 1\n", 110 | "done 1\n", 111 | "done 1\n", 112 | "done 1\n", 113 | "done 1\n", 114 | "done 1\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "lrClf = LogisticRegression(C = 1) #modello\n", 120 | " \n", 121 | "kf = KFold(n_splits = 10, shuffle = True)\n", 122 | "\n", 123 | "for train_indices, test_indices in kf.split(X):\n", 124 | " lrClf.fit(X[train_indices], y[train_indices])\n", 125 | " print(\"done 1\")" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 32, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\bal_lr\\\\bal_lr.sav']" 137 | ] 138 | }, 139 | "execution_count": 32, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "joblib.dump(lrClf , r\"bal_lr\\bal_lr.sav\")" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python [conda env:myEnv]", 152 | "language": "python", 153 | "name": "conda-env-myEnv-py" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 3 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython3", 165 | "version": "3.7.6" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 4 170 | } 171 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/final_lr-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "Using TensorFlow backend.\n", 13 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 14 | " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", 15 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 16 | " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", 17 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 18 | " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", 19 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 20 | " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", 21 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 22 | " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", 23 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 24 | " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n" 25 | ] 26 | } 27 | ], 28 | "source": [ 29 | "import pandas as pd \n", 30 | "from sklearn.metrics import roc_curve, auc\n", 31 | "from sklearn.model_selection import train_test_split\n", 32 | "from sklearn.model_selection import KFold\n", 33 | "import matplotlib.pyplot as plt\n", 34 | "import numpy as np\n", 35 | "from sklearn.linear_model import LogisticRegression\n", 36 | "import joblib\n", 37 | "from imblearn.over_sampling import ADASYN " 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 6, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "bodies = pd.read_csv(r\"bodies.csv\")\n", 47 | "bodieswS = pd.read_csv(r\"bodieswS.csv\")\n", 48 | "subs = pd.read_csv(r\"subs.csv\")\n", 49 | "W2v= pd.read_csv(r\"W2v.csv\")\n", 50 | "W2vwS = pd.read_csv(r\"W2vwS.csv\")\n", 51 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop.csv\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 7, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | " true_y bodies_y bodieswS_y subs_y W2v_y W2vwS_y bodieswSdrop_y\n", 64 | "0 0 0.094856 0.120936 0.193810 0.093180 0.060750 0.046726\n", 65 | "1 1 0.106738 0.099757 0.478376 0.178994 0.149258 0.044623\n", 66 | "2 0 0.549541 0.253948 0.338182 0.892713 0.913806 0.222466\n", 67 | "3 1 0.425894 0.838085 0.374291 0.856181 0.820273 0.900915\n", 68 | "4 0 0.553865 0.341898 0.284349 0.461019 0.478457 0.525452\n" 69 | ] 70 | } 71 | ], 72 | "source": [ 73 | "df = pd.DataFrame({\"true_y\": bodies[\"true_y\"].tolist(), \"bodies_y\":bodies[\"pred_y\"].tolist(), \"bodieswS_y\": bodieswS[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist(), \"W2vwS_y\": W2vwS[\"pred_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist() })\n", 74 | "print(df.head(5))" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 8, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | " bodieswSdrop_y subs_y W2v_y\n", 87 | "0 0.046726 0.193810 0.093180\n", 88 | "1 0.044623 0.478376 0.178994\n", 89 | "2 0.222466 0.338182 0.892713\n", 90 | "3 0.900915 0.374291 0.856181\n", 91 | "4 0.525452 0.284349 0.461019\n", 92 | "len(X) before adasyn: 1000 len(y_train) before adasyn: 1000 percentage before: 0.265\n", 93 | "len(X) after adasyn: 1467 len(y) after adasyn: 1467 percentage after: 0.49897750511247446\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] #, \"subs_y\", \"W2v_y\"\n", 99 | "print(X.head(5))\n", 100 | "y = df.true_y\n", 101 | "\n", 102 | "sm = ADASYN()\n", 103 | "print(\"len(X) before adasyn: \",len(X), \"len(y_train) before adasyn:\", len(y), \"percentage before: \", sum(y.tolist())/len(y.tolist()))\n", 104 | "X, y = sm.fit_sample(X, y)\n", 105 | "print(\"len(X) after adasyn: \",len(X), \"len(y) after adasyn:\", len(y), \"percentage after: \", sum(y.tolist())/len(y.tolist()))\n", 106 | "#sum(y_validation.tolist())/len(y_validation.tolist())\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 9, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "1467 1467 (1467, 3) (1467,)\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "X = X.to_numpy()\n", 124 | "y = y.to_numpy()\n", 125 | "print(len(X), len(y), X.shape, y.shape)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 10, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "done 1\n", 138 | "done 1\n", 139 | "done 1\n", 140 | "done 1\n", 141 | "done 1\n", 142 | "done 1\n", 143 | "done 1\n", 144 | "done 1\n", 145 | "done 1\n", 146 | "done 1\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "lrClf = LogisticRegression(C = 1) #modello\n", 152 | " \n", 153 | "kf = KFold(n_splits = 10)\n", 154 | "\n", 155 | "for train_indices, test_indices in kf.split(X):\n", 156 | " lrClf.fit(X[train_indices], y[train_indices])\n", 157 | " print(\"done 1\")\n", 158 | "# print(svm.score(x_train[test_indices], y_train[test_indices]))\n", 159 | "# y_scoreSVM = svm.predict_proba(x_validation)[:,1]" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 11, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\lr_adasyn\\\\lr_adasyn.sav']" 171 | ] 172 | }, 173 | "execution_count": 11, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "joblib.dump(lrClf , r\"lr_adasyn\\lr_adasyn.sav\")" 180 | ] 181 | } 182 | ], 183 | "metadata": { 184 | "kernelspec": { 185 | "display_name": "Python [conda env:myEnv]", 186 | "language": "python", 187 | "name": "conda-env-myEnv-py" 188 | }, 189 | "language_info": { 190 | "codemirror_mode": { 191 | "name": "ipython", 192 | "version": 3 193 | }, 194 | "file_extension": ".py", 195 | "mimetype": "text/x-python", 196 | "name": "python", 197 | "nbconvert_exporter": "python", 198 | "pygments_lexer": "ipython3", 199 | "version": "3.7.6" 200 | } 201 | }, 202 | "nbformat": 4, 203 | "nbformat_minor": 4 204 | } 205 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/final_svm-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd \n", 10 | "from sklearn.metrics import roc_curve, auc\n", 11 | "from sklearn.model_selection import train_test_split\n", 12 | "from sklearn.model_selection import KFold\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "import numpy as np\n", 15 | "from sklearn import svm\n", 16 | "import joblib" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "bodies = pd.read_csv(r\"bodies.csv\")\n", 26 | "bodieswS = pd.read_csv(r\"bodieswS.csv\")\n", 27 | "subs = pd.read_csv(r\"subs.csv\")\n", 28 | "W2v= pd.read_csv(r\"W2v.csv\")\n", 29 | "W2vwS = pd.read_csv(r\"W2vwS.csv\")\n", 30 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop.csv\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | " true_y bodies_y bodieswS_y subs_y W2v_y W2vwS_y bodieswSdrop_y\n", 43 | "0 0 0.094856 0.120936 0.193810 0.093180 0.060750 0.046726\n", 44 | "1 1 0.106738 0.099757 0.478376 0.178994 0.149258 0.044623\n", 45 | "2 0 0.549541 0.253948 0.338182 0.892713 0.913806 0.222466\n", 46 | "3 1 0.425894 0.838085 0.374291 0.856181 0.820273 0.900915\n", 47 | "4 0 0.553865 0.341898 0.284349 0.461019 0.478457 0.525452\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "df = pd.DataFrame({\"true_y\": bodies[\"true_y\"].tolist(), \"bodies_y\":bodies[\"pred_y\"].tolist(), \"bodieswS_y\": bodieswS[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist(), \"W2vwS_y\": W2vwS[\"pred_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist() })\n", 53 | "print(df.head(5))" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | " bodieswSdrop_y subs_y W2v_y\n", 66 | "0 0.046726 0.193810 0.093180\n", 67 | "1 0.044623 0.478376 0.178994\n", 68 | "2 0.222466 0.338182 0.892713\n", 69 | "3 0.900915 0.374291 0.856181\n", 70 | "4 0.525452 0.284349 0.461019\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] #, \"subs_y\", \"W2v_y\"\n", 76 | "print(X.head(5))\n", 77 | "y = df.true_y" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "1000 1000 (1000, 3) (1000,)\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "X = X.to_numpy()\n", 95 | "y = y.to_numpy()\n", 96 | "print(len(X), len(y), X.shape, y.shape)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 6, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "done 1\n", 109 | "done 1\n", 110 | "done 1\n", 111 | "done 1\n", 112 | "done 1\n", 113 | "done 1\n", 114 | "done 1\n", 115 | "done 1\n", 116 | "done 1\n", 117 | "done 1\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "svm = svm.SVC(C=1.0, kernel='poly', degree=2, gamma='scale', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1,\n", 123 | " decision_function_shape='ovr', break_ties=False, random_state=None)\n", 124 | " \n", 125 | "kf = KFold(n_splits = 10)\n", 126 | "\n", 127 | "for train_indices, test_indices in kf.split(X):\n", 128 | " svm.fit(X[train_indices], y[train_indices])\n", 129 | " print(\"done 1\")\n", 130 | "# print(svm.score(x_train[test_indices], y_train[test_indices]))\n", 131 | "# y_scoreSVM = svm.predict_proba(x_validation)[:,1]" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\svm\\\\svm.sav']" 143 | ] 144 | }, 145 | "execution_count": 8, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "joblib.dump(svm , r\"svm\\svm.sav\")" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python [conda env:myEnv]", 158 | "language": "python", 159 | "name": "conda-env-myEnv-py" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.7.6" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 4 176 | } 177 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/solution-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import joblib" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "subs = pd.read_csv(r\"subs_test_predictions.csv\")\n", 20 | "bodieswS = pd.read_csv(r\"bodieswS_test_predictions.csv\")\n", 21 | "W2v = pd.read_csv(r\"spacyW2v_test_predictions.csv\")" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "#svm = joblib.load(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\savedModels\\svm\\svm.sav\")\n", 31 | "lr = joblib.load(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\savedModels\\lr_adasyn\\lr_adasyn.sav\")" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 4, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "df = pd.DataFrame({\"subs\": subs[\"pred_y\"].tolist(), \"bodieswS\": bodieswS[\"pred_y\"].tolist(), \"W2v\": W2v[\"pred_y\"].tolist()})" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 5, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "[[0.27114913 0.01077099 0.04565962]\n", 53 | " [0.32025427 0.95815259 0.584287 ]\n", 54 | " [0.11948037 0.0448686 0.23606443]\n", 55 | " [0.27441698 0.46897009 0.28487484]\n", 56 | " [0.1256479 0.05315585 0.78758538]]\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "X = df.to_numpy()\n", 62 | "print(X[0:5])" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 6, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# sols = svm.predict_proba(X)[:,1]\n", 72 | "# print(sols[:5])" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 7, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[0.14656961 0.97952294 0.17106152 0.71290015 0.28597205]\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "sols = lr.predict_proba(X)[:,1]\n", 90 | "print(sols[:5])" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | " author gender\n", 110 | "0 --redbeard-- 0.146570\n", 111 | "1 -Allaina- 0.979523\n", 112 | "2 -AllonsyAlonso 0.171062\n", 113 | "3 -Beth- 0.712900\n", 114 | "4 -Greeny- 0.285972\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "solution = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\":sols})\n", 120 | "print(solution.head())" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 9, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "solution.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\results\\finals\\csv\\test\\lrSolution_adasyn.csv\", index = False)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 10, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "# sols1 = [1 if s >= 0.5 else 0 for s in sols]\n", 139 | "# print(sols1[:5])" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 11, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "# solution1 = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\": sols1})\n", 149 | "# solution1.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\results\\finals\\csv\\test\\lrSolution_adasyn.csv\", index = False)" 150 | ] 151 | } 152 | ], 153 | "metadata": { 154 | "kernelspec": { 155 | "display_name": "Python [conda env:myEnv]", 156 | "language": "python", 157 | "name": "conda-env-myEnv-py" 158 | }, 159 | "language_info": { 160 | "codemirror_mode": { 161 | "name": "ipython", 162 | "version": 3 163 | }, 164 | "file_extension": ".py", 165 | "mimetype": "text/x-python", 166 | "name": "python", 167 | "nbconvert_exporter": "python", 168 | "pygments_lexer": "ipython3", 169 | "version": "3.7.6" 170 | } 171 | }, 172 | "nbformat": 4, 173 | "nbformat_minor": 4 174 | } 175 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/.ipynb_checkpoints/solution_bal-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import joblib" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 4, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "subs = pd.read_csv(r\"subs_bal_test_predictions.csv\")\n", 20 | "bodieswS = pd.read_csv(r\"bodieswS_bal_test_predictions.csv\")\n", 21 | "W2v = pd.read_csv(r\"spacyW2v_bal_test_predictions.csv\")" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 5, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "lr = joblib.load(r\"Qbal_lr\\bal_lr.sav\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 6, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "df = pd.DataFrame({\"subs\": subs[\"pred_y\"].tolist(), \"bodieswS\": bodieswS[\"pred_y\"].tolist(), \"W2v\": W2v[\"pred_y\"].tolist()})" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 7, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "[[0.48052195 0.09621675 0.53980164]\n", 52 | " [0.51025856 0.98378307 0.56349511]\n", 53 | " [0.26692441 0.34320322 0.47294487]\n", 54 | " [0.40504095 0.37116724 0.69072162]\n", 55 | " [0.43180564 0.83457643 0.59301484]]\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "X = df.to_numpy()\n", 61 | "print(X[0:5])" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 8, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "[0.01094408 0.62939482 0.030943 0.06394275 0.42612817]\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "sols = lr.predict_proba(X)[:,1]\n", 79 | "print(sols[:5])" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 9, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "name": "stdout", 89 | "output_type": "stream", 90 | "text": [ 91 | " author gender\n", 92 | "0 --redbeard-- 0.010944\n", 93 | "1 -Allaina- 0.629395\n", 94 | "2 -AllonsyAlonso 0.030943\n", 95 | "3 -Beth- 0.063943\n", 96 | "4 -Greeny- 0.426128\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "solution = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\":sols})\n", 102 | "print(solution.head())" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 10, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "solution.to_csv(r\"bal_lrSolution.csv\", index = False)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [] 120 | } 121 | ], 122 | "metadata": { 123 | "kernelspec": { 124 | "display_name": "Python [conda env:myEnv]", 125 | "language": "python", 126 | "name": "conda-env-myEnv-py" 127 | }, 128 | "language_info": { 129 | "codemirror_mode": { 130 | "name": "ipython", 131 | "version": 3 132 | }, 133 | "file_extension": ".py", 134 | "mimetype": "text/x-python", 135 | "name": "python", 136 | "nbconvert_exporter": "python", 137 | "pygments_lexer": "ipython3", 138 | "version": "3.7.6" 139 | } 140 | }, 141 | "nbformat": 4, 142 | "nbformat_minor": 4 143 | } 144 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/ReadMe.md: -------------------------------------------------------------------------------- 1 | # Folder explanation 2 | 3 | These models are the same as the ones in intemrediate_models folder. They are here trained over all 5000 points and used to predict the test_data.csv. Also, these mdel arwe saved and some of them output the prediction distribution as an image. -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/final_bal_lr.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 26, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd \n", 10 | "from sklearn.metrics import roc_curve, auc\n", 11 | "from sklearn.model_selection import train_test_split\n", 12 | "from sklearn.model_selection import KFold\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "import numpy as np\n", 15 | "from sklearn.linear_model import LogisticRegression\n", 16 | "import joblib" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 27, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "subs = pd.read_csv(r\"subs_bal.csv\")\n", 26 | "W2v= pd.read_csv(r\"W2v_bal.csv\")\n", 27 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop_bal.csv\")" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 28, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | " true_y bodieswSdrop_y subs_y W2v_y\n", 40 | "0 0 0.004841 0.354758 0.505174\n", 41 | "1 0 0.959119 0.556274 0.217193\n", 42 | "2 0 0.124737 0.321295 0.338535\n", 43 | "3 0 0.975953 0.398921 0.818394\n", 44 | "4 0 0.978466 0.354308 0.656013\n" 45 | ] 46 | } 47 | ], 48 | "source": [ 49 | "df = pd.DataFrame({\"true_y\": bodieswSdrop[\"true_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist() })\n", 50 | "print(df.head(5))" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 29, 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | " bodieswSdrop_y subs_y W2v_y\n", 63 | "0 0.004841 0.354758 0.505174\n", 64 | "1 0.959119 0.556274 0.217193\n", 65 | "2 0.124737 0.321295 0.338535\n", 66 | "3 0.975953 0.398921 0.818394\n", 67 | "4 0.978466 0.354308 0.656013\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] # \"bodieswSdrop_y\", \"subs_y\", \"W2v_y\" #, \"subs_y\", \"W2v_y\"\n", 73 | "print(X.head(5))\n", 74 | "y = df.true_y" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 30, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "2842 2842 (2842, 3) (2842,)\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "X = X.to_numpy()\n", 92 | "y = y.to_numpy()\n", 93 | "print(len(X), len(y), X.shape, y.shape)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 31, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "done 1\n", 106 | "done 1\n", 107 | "done 1\n", 108 | "done 1\n", 109 | "done 1\n", 110 | "done 1\n", 111 | "done 1\n", 112 | "done 1\n", 113 | "done 1\n", 114 | "done 1\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "lrClf = LogisticRegression(C = 1) #modello\n", 120 | " \n", 121 | "kf = KFold(n_splits = 10, shuffle = True)\n", 122 | "\n", 123 | "for train_indices, test_indices in kf.split(X):\n", 124 | " lrClf.fit(X[train_indices], y[train_indices])\n", 125 | " print(\"done 1\")" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 32, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\bal_lr\\\\bal_lr.sav']" 137 | ] 138 | }, 139 | "execution_count": 32, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "joblib.dump(lrClf , r\"bal_lr\\bal_lr.sav\")" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python [conda env:myEnv]", 152 | "language": "python", 153 | "name": "conda-env-myEnv-py" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 3 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython3", 165 | "version": "3.7.6" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 4 170 | } 171 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/final_lr.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "Using TensorFlow backend.\n", 13 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 14 | " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", 15 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 16 | " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", 17 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 18 | " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", 19 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 20 | " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", 21 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 22 | " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", 23 | "Q:\\anaconda\\envs\\myEnv\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 24 | " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n" 25 | ] 26 | } 27 | ], 28 | "source": [ 29 | "import pandas as pd \n", 30 | "from sklearn.metrics import roc_curve, auc\n", 31 | "from sklearn.model_selection import train_test_split\n", 32 | "from sklearn.model_selection import KFold\n", 33 | "import matplotlib.pyplot as plt\n", 34 | "import numpy as np\n", 35 | "from sklearn.linear_model import LogisticRegression\n", 36 | "import joblib\n", 37 | "from imblearn.over_sampling import ADASYN " 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 6, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "bodies = pd.read_csv(r\"bodies.csv\")\n", 47 | "bodieswS = pd.read_csv(r\"bodieswS.csv\")\n", 48 | "subs = pd.read_csv(r\"subs.csv\")\n", 49 | "W2v= pd.read_csv(r\"W2v.csv\")\n", 50 | "W2vwS = pd.read_csv(r\"W2vwS.csv\")\n", 51 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop.csv\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 7, 57 | "metadata": {}, 58 | "outputs": [ 59 | { 60 | "name": "stdout", 61 | "output_type": "stream", 62 | "text": [ 63 | " true_y bodies_y bodieswS_y subs_y W2v_y W2vwS_y bodieswSdrop_y\n", 64 | "0 0 0.094856 0.120936 0.193810 0.093180 0.060750 0.046726\n", 65 | "1 1 0.106738 0.099757 0.478376 0.178994 0.149258 0.044623\n", 66 | "2 0 0.549541 0.253948 0.338182 0.892713 0.913806 0.222466\n", 67 | "3 1 0.425894 0.838085 0.374291 0.856181 0.820273 0.900915\n", 68 | "4 0 0.553865 0.341898 0.284349 0.461019 0.478457 0.525452\n" 69 | ] 70 | } 71 | ], 72 | "source": [ 73 | "df = pd.DataFrame({\"true_y\": bodies[\"true_y\"].tolist(), \"bodies_y\":bodies[\"pred_y\"].tolist(), \"bodieswS_y\": bodieswS[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist(), \"W2vwS_y\": W2vwS[\"pred_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist() })\n", 74 | "print(df.head(5))" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 8, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | " bodieswSdrop_y subs_y W2v_y\n", 87 | "0 0.046726 0.193810 0.093180\n", 88 | "1 0.044623 0.478376 0.178994\n", 89 | "2 0.222466 0.338182 0.892713\n", 90 | "3 0.900915 0.374291 0.856181\n", 91 | "4 0.525452 0.284349 0.461019\n", 92 | "len(X) before adasyn: 1000 len(y_train) before adasyn: 1000 percentage before: 0.265\n", 93 | "len(X) after adasyn: 1467 len(y) after adasyn: 1467 percentage after: 0.49897750511247446\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] #, \"subs_y\", \"W2v_y\"\n", 99 | "print(X.head(5))\n", 100 | "y = df.true_y\n", 101 | "\n", 102 | "sm = ADASYN()\n", 103 | "print(\"len(X) before adasyn: \",len(X), \"len(y_train) before adasyn:\", len(y), \"percentage before: \", sum(y.tolist())/len(y.tolist()))\n", 104 | "X, y = sm.fit_sample(X, y)\n", 105 | "print(\"len(X) after adasyn: \",len(X), \"len(y) after adasyn:\", len(y), \"percentage after: \", sum(y.tolist())/len(y.tolist()))\n", 106 | "#sum(y_validation.tolist())/len(y_validation.tolist())\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 9, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "1467 1467 (1467, 3) (1467,)\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "X = X.to_numpy()\n", 124 | "y = y.to_numpy()\n", 125 | "print(len(X), len(y), X.shape, y.shape)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 10, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "done 1\n", 138 | "done 1\n", 139 | "done 1\n", 140 | "done 1\n", 141 | "done 1\n", 142 | "done 1\n", 143 | "done 1\n", 144 | "done 1\n", 145 | "done 1\n", 146 | "done 1\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "lrClf = LogisticRegression(C = 1) #modello\n", 152 | " \n", 153 | "kf = KFold(n_splits = 10)\n", 154 | "\n", 155 | "for train_indices, test_indices in kf.split(X):\n", 156 | " lrClf.fit(X[train_indices], y[train_indices])\n", 157 | " print(\"done 1\")\n", 158 | "# print(svm.score(x_train[test_indices], y_train[test_indices]))\n", 159 | "# y_scoreSVM = svm.predict_proba(x_validation)[:,1]" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 11, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\lr_adasyn\\\\lr_adasyn.sav']" 171 | ] 172 | }, 173 | "execution_count": 11, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "joblib.dump(lrClf , r\"lr_adasyn\\lr_adasyn.sav\")" 180 | ] 181 | } 182 | ], 183 | "metadata": { 184 | "kernelspec": { 185 | "display_name": "Python [conda env:myEnv]", 186 | "language": "python", 187 | "name": "conda-env-myEnv-py" 188 | }, 189 | "language_info": { 190 | "codemirror_mode": { 191 | "name": "ipython", 192 | "version": 3 193 | }, 194 | "file_extension": ".py", 195 | "mimetype": "text/x-python", 196 | "name": "python", 197 | "nbconvert_exporter": "python", 198 | "pygments_lexer": "ipython3", 199 | "version": "3.7.6" 200 | } 201 | }, 202 | "nbformat": 4, 203 | "nbformat_minor": 4 204 | } 205 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/final_svm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd \n", 10 | "from sklearn.metrics import roc_curve, auc\n", 11 | "from sklearn.model_selection import train_test_split\n", 12 | "from sklearn.model_selection import KFold\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "import numpy as np\n", 15 | "from sklearn import svm\n", 16 | "import joblib" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "bodies = pd.read_csv(r\"bodies.csv\")\n", 26 | "bodieswS = pd.read_csv(r\"bodieswS.csv\")\n", 27 | "subs = pd.read_csv(r\"subs.csv\")\n", 28 | "W2v= pd.read_csv(r\"W2v.csv\")\n", 29 | "W2vwS = pd.read_csv(r\"W2vwS.csv\")\n", 30 | "bodieswSdrop = pd.read_csv(r\"bodieswSdrop.csv\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | " true_y bodies_y bodieswS_y subs_y W2v_y W2vwS_y bodieswSdrop_y\n", 43 | "0 0 0.094856 0.120936 0.193810 0.093180 0.060750 0.046726\n", 44 | "1 1 0.106738 0.099757 0.478376 0.178994 0.149258 0.044623\n", 45 | "2 0 0.549541 0.253948 0.338182 0.892713 0.913806 0.222466\n", 46 | "3 1 0.425894 0.838085 0.374291 0.856181 0.820273 0.900915\n", 47 | "4 0 0.553865 0.341898 0.284349 0.461019 0.478457 0.525452\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "df = pd.DataFrame({\"true_y\": bodies[\"true_y\"].tolist(), \"bodies_y\":bodies[\"pred_y\"].tolist(), \"bodieswS_y\": bodieswS[\"pred_y\"].tolist(), \"subs_y\": subs[\"pred_y\"].tolist(), \"W2v_y\": W2v[\"pred_y\"].tolist(), \"W2vwS_y\": W2vwS[\"pred_y\"].tolist(), \"bodieswSdrop_y\":bodieswSdrop[\"pred_y\"].tolist() })\n", 53 | "print(df.head(5))" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | " bodieswSdrop_y subs_y W2v_y\n", 66 | "0 0.046726 0.193810 0.093180\n", 67 | "1 0.044623 0.478376 0.178994\n", 68 | "2 0.222466 0.338182 0.892713\n", 69 | "3 0.900915 0.374291 0.856181\n", 70 | "4 0.525452 0.284349 0.461019\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "X = df.loc[:, [\"bodieswSdrop_y\", \"subs_y\", \"W2v_y\"]] #, \"subs_y\", \"W2v_y\"\n", 76 | "print(X.head(5))\n", 77 | "y = df.true_y" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "1000 1000 (1000, 3) (1000,)\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "X = X.to_numpy()\n", 95 | "y = y.to_numpy()\n", 96 | "print(len(X), len(y), X.shape, y.shape)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 6, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "done 1\n", 109 | "done 1\n", 110 | "done 1\n", 111 | "done 1\n", 112 | "done 1\n", 113 | "done 1\n", 114 | "done 1\n", 115 | "done 1\n", 116 | "done 1\n", 117 | "done 1\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "svm = svm.SVC(C=1.0, kernel='poly', degree=2, gamma='scale', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1,\n", 123 | " decision_function_shape='ovr', break_ties=False, random_state=None)\n", 124 | " \n", 125 | "kf = KFold(n_splits = 10)\n", 126 | "\n", 127 | "for train_indices, test_indices in kf.split(X):\n", 128 | " svm.fit(X[train_indices], y[train_indices])\n", 129 | " print(\"done 1\")\n", 130 | "# print(svm.score(x_train[test_indices], y_train[test_indices]))\n", 131 | "# y_scoreSVM = svm.predict_proba(x_validation)[:,1]" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "data": { 141 | "text/plain": [ 142 | "['Q:\\\\tooBigToDrive\\\\data-mining\\\\kaggle\\\\my_models\\\\spaCy\\\\savedModels\\\\svm\\\\svm.sav']" 143 | ] 144 | }, 145 | "execution_count": 8, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "joblib.dump(svm , r\"svm\\svm.sav\")" 152 | ] 153 | } 154 | ], 155 | "metadata": { 156 | "kernelspec": { 157 | "display_name": "Python [conda env:myEnv]", 158 | "language": "python", 159 | "name": "conda-env-myEnv-py" 160 | }, 161 | "language_info": { 162 | "codemirror_mode": { 163 | "name": "ipython", 164 | "version": 3 165 | }, 166 | "file_extension": ".py", 167 | "mimetype": "text/x-python", 168 | "name": "python", 169 | "nbconvert_exporter": "python", 170 | "pygments_lexer": "ipython3", 171 | "version": "3.7.6" 172 | } 173 | }, 174 | "nbformat": 4, 175 | "nbformat_minor": 4 176 | } 177 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import joblib" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "subs = pd.read_csv(r\"subs_test_predictions.csv\")\n", 20 | "bodieswS = pd.read_csv(r\"bodieswS_test_predictions.csv\")\n", 21 | "W2v = pd.read_csv(r\"spacyW2v_test_predictions.csv\")" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 3, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "#svm = joblib.load(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\savedModels\\svm\\svm.sav\")\n", 31 | "lr = joblib.load(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\savedModels\\lr_adasyn\\lr_adasyn.sav\")" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 4, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "df = pd.DataFrame({\"subs\": subs[\"pred_y\"].tolist(), \"bodieswS\": bodieswS[\"pred_y\"].tolist(), \"W2v\": W2v[\"pred_y\"].tolist()})" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 5, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "[[0.27114913 0.01077099 0.04565962]\n", 53 | " [0.32025427 0.95815259 0.584287 ]\n", 54 | " [0.11948037 0.0448686 0.23606443]\n", 55 | " [0.27441698 0.46897009 0.28487484]\n", 56 | " [0.1256479 0.05315585 0.78758538]]\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "X = df.to_numpy()\n", 62 | "print(X[0:5])" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 6, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# sols = svm.predict_proba(X)[:,1]\n", 72 | "# print(sols[:5])" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 7, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[0.14656961 0.97952294 0.17106152 0.71290015 0.28597205]\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "sols = lr.predict_proba(X)[:,1]\n", 90 | "print(sols[:5])" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "name": "stdout", 107 | "output_type": "stream", 108 | "text": [ 109 | " author gender\n", 110 | "0 --redbeard-- 0.146570\n", 111 | "1 -Allaina- 0.979523\n", 112 | "2 -AllonsyAlonso 0.171062\n", 113 | "3 -Beth- 0.712900\n", 114 | "4 -Greeny- 0.285972\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "solution = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\":sols})\n", 120 | "print(solution.head())" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 9, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "solution.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\results\\finals\\csv\\test\\lrSolution_adasyn.csv\", index = False)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 10, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "# sols1 = [1 if s >= 0.5 else 0 for s in sols]\n", 139 | "# print(sols1[:5])" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 11, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "# solution1 = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\": sols1})\n", 149 | "# solution1.to_csv(r\"Q:\\tooBigToDrive\\data-mining\\kaggle\\my_models\\spaCy\\results\\finals\\csv\\test\\lrSolution_adasyn.csv\", index = False)" 150 | ] 151 | } 152 | ], 153 | "metadata": { 154 | "kernelspec": { 155 | "display_name": "Python [conda env:myEnv]", 156 | "language": "python", 157 | "name": "conda-env-myEnv-py" 158 | }, 159 | "language_info": { 160 | "codemirror_mode": { 161 | "name": "ipython", 162 | "version": 3 163 | }, 164 | "file_extension": ".py", 165 | "mimetype": "text/x-python", 166 | "name": "python", 167 | "nbconvert_exporter": "python", 168 | "pygments_lexer": "ipython3", 169 | "version": "3.7.6" 170 | } 171 | }, 172 | "nbformat": 4, 173 | "nbformat_minor": 4 174 | } 175 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/finals/solution_bal.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "import joblib" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 4, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "subs = pd.read_csv(r\"subs_bal_test_predictions.csv\")\n", 20 | "bodieswS = pd.read_csv(r\"bodieswS_bal_test_predictions.csv\")\n", 21 | "W2v = pd.read_csv(r\"spacyW2v_bal_test_predictions.csv\")" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 5, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "lr = joblib.load(r\"Qbal_lr\\bal_lr.sav\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 6, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "df = pd.DataFrame({\"subs\": subs[\"pred_y\"].tolist(), \"bodieswS\": bodieswS[\"pred_y\"].tolist(), \"W2v\": W2v[\"pred_y\"].tolist()})" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 7, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "[[0.48052195 0.09621675 0.53980164]\n", 52 | " [0.51025856 0.98378307 0.56349511]\n", 53 | " [0.26692441 0.34320322 0.47294487]\n", 54 | " [0.40504095 0.37116724 0.69072162]\n", 55 | " [0.43180564 0.83457643 0.59301484]]\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "X = df.to_numpy()\n", 61 | "print(X[0:5])" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 8, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "[0.01094408 0.62939482 0.030943 0.06394275 0.42612817]\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "sols = lr.predict_proba(X)[:,1]\n", 79 | "print(sols[:5])" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 9, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "name": "stdout", 89 | "output_type": "stream", 90 | "text": [ 91 | " author gender\n", 92 | "0 --redbeard-- 0.010944\n", 93 | "1 -Allaina- 0.629395\n", 94 | "2 -AllonsyAlonso 0.030943\n", 95 | "3 -Beth- 0.063943\n", 96 | "4 -Greeny- 0.426128\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "solution = pd.DataFrame({\"author\": subs[\"author\"].tolist(), \"gender\":sols})\n", 102 | "print(solution.head())" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 10, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "solution.to_csv(r\"bal_lrSolution.csv\", index = False)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [] 120 | } 121 | ], 122 | "metadata": { 123 | "kernelspec": { 124 | "display_name": "Python [conda env:myEnv]", 125 | "language": "python", 126 | "name": "conda-env-myEnv-py" 127 | }, 128 | "language_info": { 129 | "codemirror_mode": { 130 | "name": "ipython", 131 | "version": 3 132 | }, 133 | "file_extension": ".py", 134 | "mimetype": "text/x-python", 135 | "name": "python", 136 | "nbconvert_exporter": "python", 137 | "pygments_lexer": "ipython3", 138 | "version": "3.7.6" 139 | } 140 | }, 141 | "nbformat": 4, 142 | "nbformat_minor": 4 143 | } 144 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/images/bodieswS_test_ensemble_balanced_e15_wS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/other-attempts/spaCy/images/bodieswS_test_ensemble_balanced_e15_wS.png -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/images/bodieswS_test_ensemble_balanced_e3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/other-attempts/spaCy/images/bodieswS_test_ensemble_balanced_e3.png -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/intermediate_models/.ipynb_checkpoints/ReadMe-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Description\n", 8 | "\n", 9 | "Here there are the implementations of the models used to obtain the 1000 prediction on which we trained the logistic regeression" 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python [conda env:myEnv]", 16 | "language": "python", 17 | "name": "conda-env-myEnv-py" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.7.6" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 4 34 | } 35 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/intermediate_models/.ipynb_checkpoints/spactW2v-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# W2v\n", 8 | "\n", 9 | "A classification of W2v spaCy vectors, using scikit MLP" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 17 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "from sklearn.preprocessing import StandardScaler # For scaling\n", 22 | "from sklearn.model_selection import train_test_split # for creating valid set and train set \n", 23 | "from sklearn.neural_network import MLPClassifier\n", 24 | "from sklearn.model_selection import KFold\n", 25 | "from sklearn.svm import SVC, LinearSVC\n", 26 | "import numpy as np\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "from sklearn.metrics import roc_curve, auc\n", 29 | "import os\n", 30 | "from os import listdir\n", 31 | "from os.path import isfile, join\n", 32 | "from sklearn.decomposition import PCA\n", 33 | "import pandas as pd\n", 34 | "import math\n", 35 | "from sklearn.utils import shuffle" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 43 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "y = np.load(\"targets.npy\").tolist()\n", 48 | "files = listdir(\"npy5000/\")\n", 49 | "files = [f for f in files if f == \"lPunctNumStopLemOovAgg.npy\"]\n", 50 | "for f in files:\n", 51 | " for i in range(1):\n", 52 | " X = np.load(\"../input/mydata/npy5000/\"+f)\n", 53 | " i = 75\n", 54 | " pca = PCA(i)\n", 55 | " pca.fit(X)\n", 56 | " U = pca.transform(X)\n", 57 | " U = U.tolist()\n", 58 | " df = pd.DataFrame({\"vect\": U, \"gender\": y})\n", 59 | " # unbalnced \n", 60 | " seed = 100\n", 61 | " split = math.floor(len(df)*0.8)\n", 62 | " train_df = df.sample(split, random_state = 100)\n", 63 | " test_df = df.drop(train_df.index)\n", 64 | " x_train = np.array(train_df[\"vect\"].tolist())\n", 65 | " print(\"x_train.shape = \", x_train.shape)\n", 66 | " #print(\"x_train[0] = \", x_train[0])\n", 67 | " x_validation = np.array(test_df[\"vect\"].tolist())\n", 68 | " print(\"x_validation.shape = \", x_validation.shape)\n", 69 | " y_train = np.array(train_df[\"gender\"].tolist())\n", 70 | " print(\"y_train.shape = \", y_train.shape)\n", 71 | " y_validation = np.array(test_df[\"gender\"].tolist())\n", 72 | " print(\"y_validation.shape = \", y_validation.shape) \n", 73 | "\n", 74 | "# end of unbalanced\n", 75 | " \n", 76 | " # balanced part\n", 77 | "# U_m = df.loc[df[\"gender\"] == 0, :]\n", 78 | "# U_f = df.loc[df[\"gender\"] == 1, :]\n", 79 | "\n", 80 | "# split = math.floor(len(U_f)*0.8)\n", 81 | "# print(\"split = \",split)\n", 82 | "\n", 83 | "# seed = 100\n", 84 | "\n", 85 | "# train_data_sample_m = U_m.sample(n = split, random_state = seed)\n", 86 | "# train_vects_m =train_data_sample_m[\"vect\"].tolist()\n", 87 | "# test_data_sample_m = U_m.drop(train_data_sample_m.index)\n", 88 | "# #test_data_sample_m = test_data_sample_m.reset_index() \n", 89 | "# test_vects_m = test_data_sample_m[\"vect\"].tolist()\n", 90 | "\n", 91 | "# train_data_sample_f = U_f.sample(n = split, random_state = seed)\n", 92 | "# train_vects_f = train_data_sample_f[\"vect\"].tolist()\n", 93 | "# test_data_sample_f = U_f.drop(train_data_sample_f.index)\n", 94 | "# #test_data_sample_f = test_data_sample_f.reset_index() \n", 95 | "# test_vects_f = test_data_sample_f[\"vect\"].tolist()\n", 96 | "\n", 97 | "# train_vects = train_vects_m + train_vects_f\n", 98 | "# test_vects = test_vects_m + test_vects_f\n", 99 | "\n", 100 | "# train_labels = [0 for i in range(split)] + [1 for i in range(split)]\n", 101 | "# test_labels = [0 for i in range(len(U_m)-split)] + [1 for i in range(len(U_f)-split)]\n", 102 | "# x_train = np.array(train_vects)\n", 103 | "# print(\"x_train.shape = \", x_train.shape)\n", 104 | "# #print(\"x_train[0] = \", x_train[0])\n", 105 | "# x_validation = np.array(test_vects)\n", 106 | "# print(\"x_validation.shape = \", x_validation.shape)\n", 107 | "# y_train = np.array(train_labels)\n", 108 | "# print(\"y_train.shape = \", y_train.shape)\n", 109 | "# y_validation = np.array(test_labels)\n", 110 | "# print(\"y_validation.shape = \", y_validation.shape)\n", 111 | "# x_train, y_train = shuffle(x_train, y_train, random_state = 0)\n", 112 | " # end of balanced\n", 113 | " \n", 114 | " \n", 115 | " # model\n", 116 | " mlpClf = MLPClassifier(solver = 'adam', activation= 'relu' ,alpha = 0.02, verbose = False, early_stopping = True,\n", 117 | " learning_rate = 'invscaling', max_iter = 400)\n", 118 | "\n", 119 | " # Cross validation - 10 Fold \n", 120 | " kf = KFold(n_splits = 10)\n", 121 | "\n", 122 | " for train_indices, test_indices in kf.split(x_train):\n", 123 | " mlpClf.fit(x_train[train_indices], y_train[train_indices])\n", 124 | " print(mlpClf.score(x_train[test_indices], y_train[test_indices]))\n", 125 | " y_score = mlpClf.predict_proba(x_validation)[:,1]\n", 126 | " fpr, tpr, thresholds = roc_curve(y_validation, y_score)\n", 127 | " roc_auc = auc(fpr, tpr)\n", 128 | " roc = str(roc_auc)\n", 129 | " name = f.replace(\".npy\",\"\")+\"_\"+str(i)\n", 130 | " print(name+\" : \"+str(roc_auc))\n", 131 | " # with open( \"spacyW2vMlp\" + \".txt\", \"a\") as file: #name\n", 132 | " # file.write(\"\\t pca_\" +str(i)+ \" : \" + roc+\"\\n\")\n", 133 | " # file.close()\n", 134 | "\n", 135 | "\n", 136 | "# df_res = pd.DataFrame({\"pred_y\": y_score, \"true_y\":y_validation})\n", 137 | "# df_res.to_csv (r'../working/W2v.csv', index = False, header=True)\n", 138 | "\n", 139 | " " 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### balanced results different pca's\n", 147 | "\n", 148 | "split = 1079
\n", 149 | "x_train.shape = (2158, 10)
\n", 150 | "x_validation.shape = (2842, 10)
\n", 151 | "y_train.shape = (2158,)
\n", 152 | "y_validation.shape = (2842,)
\n", 153 | "\n", 154 | "lPunctNumStopLemOovAgg_10 : 0.7773587350959046
\n", 155 | "lPunctNumStopLemOovAgg_20 : 0.7794985887909682
\n", 156 | "lPunctNumStopLemOovAgg_30 : 0.8106258280052993
\n", 157 | "lPunctNumStopLemOovAgg_40 : 0.8099159034617822
\n", 158 | "lPunctNumStopLemOovAgg_50 : 0.8292624272795345
\n", 159 | "lPunctNumStopLemOovAgg_60 : 0.8135677668337078
\n", 160 | "lPunctNumStopLemOovAgg_70 : 0.8236233511894476
\n", 161 | "lPunctNumStopLemOovAgg_75 : 0.8321510857669489
\n", 162 | "lPunctNumStopLemOovAgg_80 : 0.8347517424111515
\n", 163 | "lPunctNumStopLemOovAgg_85 : 0.7923204308507575
\n", 164 | "lPunctNumStopLemOovAgg_90 : 0.8280326594090203
\n", 165 | "lPunctNumStopLemOovAgg_100 : 0.8135144864927135
\n", 166 | "lPunctNumStopLemOovAgg_110 : 0.7994311963596568
\n", 167 | "lPunctNumStopLemOovAgg_120 : 0.8126332008524854
\n", 168 | "lPunctNumStopLemOovAgg_130 : 0.7794049881919244
\n", 169 | "lPunctNumStopLemOovAgg_140 : 0.8027504176026727
\n", 170 | "lPunctNumStopLemOovAgg_150 : 0.7867101549449917
\n", 171 | "lPunctNumStopLemOovAgg_160 : 0.8010440066816429
\n", 172 | "lPunctNumStopLemOovAgg_165 : 0.8039686653994587
\n", 173 | "lPunctNumStopLemOovAgg_170 : 0.8058219572605264
\n", 174 | "lPunctNumStopLemOovAgg_180 : 0.8310192385231266
\n", 175 | "lPunctNumStopLemOovAgg_190 : 0.8154628189620413
" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "# save results\n", 185 | "df_res = pd.DataFrame({\"pred_y\": y_score, \"true_y\":y_validation})\n", 186 | "df_res.to_csv (r'../working/W2v_bal.csv', index = False, header=True)" 187 | ] 188 | } 189 | ], 190 | "metadata": { 191 | "kernelspec": { 192 | "display_name": "Python 3", 193 | "language": "python", 194 | "name": "python3" 195 | }, 196 | "language_info": { 197 | "codemirror_mode": { 198 | "name": "ipython", 199 | "version": 3 200 | }, 201 | "file_extension": ".py", 202 | "mimetype": "text/x-python", 203 | "name": "python", 204 | "nbconvert_exporter": "python", 205 | "pygments_lexer": "ipython3", 206 | "version": "3.7.4" 207 | } 208 | }, 209 | "nbformat": 4, 210 | "nbformat_minor": 4 211 | } 212 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/intermediate_models/ReadMe.md: -------------------------------------------------------------------------------- 1 | # Description 2 | 3 | Here there are the implementations of the models used to obtain the 1000 prediction on which we trained the logistic regeression -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/intermediate_models/spactW2v.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# W2v\n", 8 | "\n", 9 | "A classification of W2v spaCy vectors, using scikit MLP" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 17 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "from sklearn.preprocessing import StandardScaler # For scaling\n", 22 | "from sklearn.model_selection import train_test_split # for creating valid set and train set \n", 23 | "from sklearn.neural_network import MLPClassifier\n", 24 | "from sklearn.model_selection import KFold\n", 25 | "from sklearn.svm import SVC, LinearSVC\n", 26 | "import numpy as np\n", 27 | "import matplotlib.pyplot as plt\n", 28 | "from sklearn.metrics import roc_curve, auc\n", 29 | "import os\n", 30 | "from os import listdir\n", 31 | "from os.path import isfile, join\n", 32 | "from sklearn.decomposition import PCA\n", 33 | "import pandas as pd\n", 34 | "import math\n", 35 | "from sklearn.utils import shuffle" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 43 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "y = np.load(\"targets.npy\").tolist()\n", 48 | "files = listdir(\"npy5000/\")\n", 49 | "files = [f for f in files if f == \"lPunctNumStopLemOovAgg.npy\"]\n", 50 | "for f in files:\n", 51 | " for i in range(1):\n", 52 | " X = np.load(\"../input/mydata/npy5000/\"+f)\n", 53 | " i = 75\n", 54 | " pca = PCA(i)\n", 55 | " pca.fit(X)\n", 56 | " U = pca.transform(X)\n", 57 | " U = U.tolist()\n", 58 | " df = pd.DataFrame({\"vect\": U, \"gender\": y})\n", 59 | " # unbalnced \n", 60 | " seed = 100\n", 61 | " split = math.floor(len(df)*0.8)\n", 62 | " train_df = df.sample(split, random_state = 100)\n", 63 | " test_df = df.drop(train_df.index)\n", 64 | " x_train = np.array(train_df[\"vect\"].tolist())\n", 65 | " print(\"x_train.shape = \", x_train.shape)\n", 66 | " #print(\"x_train[0] = \", x_train[0])\n", 67 | " x_validation = np.array(test_df[\"vect\"].tolist())\n", 68 | " print(\"x_validation.shape = \", x_validation.shape)\n", 69 | " y_train = np.array(train_df[\"gender\"].tolist())\n", 70 | " print(\"y_train.shape = \", y_train.shape)\n", 71 | " y_validation = np.array(test_df[\"gender\"].tolist())\n", 72 | " print(\"y_validation.shape = \", y_validation.shape) \n", 73 | "\n", 74 | "# end of unbalanced\n", 75 | " \n", 76 | " # balanced part\n", 77 | "# U_m = df.loc[df[\"gender\"] == 0, :]\n", 78 | "# U_f = df.loc[df[\"gender\"] == 1, :]\n", 79 | "\n", 80 | "# split = math.floor(len(U_f)*0.8)\n", 81 | "# print(\"split = \",split)\n", 82 | "\n", 83 | "# seed = 100\n", 84 | "\n", 85 | "# train_data_sample_m = U_m.sample(n = split, random_state = seed)\n", 86 | "# train_vects_m =train_data_sample_m[\"vect\"].tolist()\n", 87 | "# test_data_sample_m = U_m.drop(train_data_sample_m.index)\n", 88 | "# #test_data_sample_m = test_data_sample_m.reset_index() \n", 89 | "# test_vects_m = test_data_sample_m[\"vect\"].tolist()\n", 90 | "\n", 91 | "# train_data_sample_f = U_f.sample(n = split, random_state = seed)\n", 92 | "# train_vects_f = train_data_sample_f[\"vect\"].tolist()\n", 93 | "# test_data_sample_f = U_f.drop(train_data_sample_f.index)\n", 94 | "# #test_data_sample_f = test_data_sample_f.reset_index() \n", 95 | "# test_vects_f = test_data_sample_f[\"vect\"].tolist()\n", 96 | "\n", 97 | "# train_vects = train_vects_m + train_vects_f\n", 98 | "# test_vects = test_vects_m + test_vects_f\n", 99 | "\n", 100 | "# train_labels = [0 for i in range(split)] + [1 for i in range(split)]\n", 101 | "# test_labels = [0 for i in range(len(U_m)-split)] + [1 for i in range(len(U_f)-split)]\n", 102 | "# x_train = np.array(train_vects)\n", 103 | "# print(\"x_train.shape = \", x_train.shape)\n", 104 | "# #print(\"x_train[0] = \", x_train[0])\n", 105 | "# x_validation = np.array(test_vects)\n", 106 | "# print(\"x_validation.shape = \", x_validation.shape)\n", 107 | "# y_train = np.array(train_labels)\n", 108 | "# print(\"y_train.shape = \", y_train.shape)\n", 109 | "# y_validation = np.array(test_labels)\n", 110 | "# print(\"y_validation.shape = \", y_validation.shape)\n", 111 | "# x_train, y_train = shuffle(x_train, y_train, random_state = 0)\n", 112 | " # end of balanced\n", 113 | " \n", 114 | " \n", 115 | " # model\n", 116 | " mlpClf = MLPClassifier(solver = 'adam', activation= 'relu' ,alpha = 0.02, verbose = False, early_stopping = True,\n", 117 | " learning_rate = 'invscaling', max_iter = 400)\n", 118 | "\n", 119 | " # Cross validation - 10 Fold \n", 120 | " kf = KFold(n_splits = 10)\n", 121 | "\n", 122 | " for train_indices, test_indices in kf.split(x_train):\n", 123 | " mlpClf.fit(x_train[train_indices], y_train[train_indices])\n", 124 | " print(mlpClf.score(x_train[test_indices], y_train[test_indices]))\n", 125 | " y_score = mlpClf.predict_proba(x_validation)[:,1]\n", 126 | " fpr, tpr, thresholds = roc_curve(y_validation, y_score)\n", 127 | " roc_auc = auc(fpr, tpr)\n", 128 | " roc = str(roc_auc)\n", 129 | " name = f.replace(\".npy\",\"\")+\"_\"+str(i)\n", 130 | " print(name+\" : \"+str(roc_auc))\n", 131 | " # with open( \"spacyW2vMlp\" + \".txt\", \"a\") as file: #name\n", 132 | " # file.write(\"\\t pca_\" +str(i)+ \" : \" + roc+\"\\n\")\n", 133 | " # file.close()\n", 134 | "\n", 135 | "\n", 136 | "# df_res = pd.DataFrame({\"pred_y\": y_score, \"true_y\":y_validation})\n", 137 | "# df_res.to_csv (r'../working/W2v.csv', index = False, header=True)\n", 138 | "\n", 139 | " " 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### balanced results different pca's\n", 147 | "\n", 148 | "split = 1079
\n", 149 | "x_train.shape = (2158, 10)
\n", 150 | "x_validation.shape = (2842, 10)
\n", 151 | "y_train.shape = (2158,)
\n", 152 | "y_validation.shape = (2842,)
\n", 153 | "\n", 154 | "lPunctNumStopLemOovAgg_10 : 0.7773587350959046
\n", 155 | "lPunctNumStopLemOovAgg_20 : 0.7794985887909682
\n", 156 | "lPunctNumStopLemOovAgg_30 : 0.8106258280052993
\n", 157 | "lPunctNumStopLemOovAgg_40 : 0.8099159034617822
\n", 158 | "lPunctNumStopLemOovAgg_50 : 0.8292624272795345
\n", 159 | "lPunctNumStopLemOovAgg_60 : 0.8135677668337078
\n", 160 | "lPunctNumStopLemOovAgg_70 : 0.8236233511894476
\n", 161 | "lPunctNumStopLemOovAgg_75 : 0.8321510857669489
\n", 162 | "lPunctNumStopLemOovAgg_80 : 0.8347517424111515
\n", 163 | "lPunctNumStopLemOovAgg_85 : 0.7923204308507575
\n", 164 | "lPunctNumStopLemOovAgg_90 : 0.8280326594090203
\n", 165 | "lPunctNumStopLemOovAgg_100 : 0.8135144864927135
\n", 166 | "lPunctNumStopLemOovAgg_110 : 0.7994311963596568
\n", 167 | "lPunctNumStopLemOovAgg_120 : 0.8126332008524854
\n", 168 | "lPunctNumStopLemOovAgg_130 : 0.7794049881919244
\n", 169 | "lPunctNumStopLemOovAgg_140 : 0.8027504176026727
\n", 170 | "lPunctNumStopLemOovAgg_150 : 0.7867101549449917
\n", 171 | "lPunctNumStopLemOovAgg_160 : 0.8010440066816429
\n", 172 | "lPunctNumStopLemOovAgg_165 : 0.8039686653994587
\n", 173 | "lPunctNumStopLemOovAgg_170 : 0.8058219572605264
\n", 174 | "lPunctNumStopLemOovAgg_180 : 0.8310192385231266
\n", 175 | "lPunctNumStopLemOovAgg_190 : 0.8154628189620413
" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "# save results\n", 185 | "df_res = pd.DataFrame({\"pred_y\": y_score, \"true_y\":y_validation})\n", 186 | "df_res.to_csv (r'../working/W2v_bal.csv', index = False, header=True)" 187 | ] 188 | } 189 | ], 190 | "metadata": { 191 | "kernelspec": { 192 | "display_name": "Python 3", 193 | "language": "python", 194 | "name": "python3" 195 | }, 196 | "language_info": { 197 | "codemirror_mode": { 198 | "name": "ipython", 199 | "version": 3 200 | }, 201 | "file_extension": ".py", 202 | "mimetype": "text/x-python", 203 | "name": "python", 204 | "nbconvert_exporter": "python", 205 | "pygments_lexer": "ipython3", 206 | "version": "3.7.4" 207 | } 208 | }, 209 | "nbformat": 4, 210 | "nbformat_minor": 4 211 | } 212 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_bal_lPunctAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctAgg 2 | epoch = 0, losses = {'textcat': 9.083782873424466}, roc = 0.9084506940844422 3 | epoch = 1, losses = {'textcat': 12.229801848536603}, roc = 0.88266228903865 4 | epoch = 2, losses = {'textcat': 13.809510084057532}, roc = 0.9050421922700305 5 | epoch = 3, losses = {'textcat': 14.714626950012775}, roc = 0.91474713438166 6 | epoch = 4, losses = {'textcat': 15.311987071944198}, roc = 0.9039636253672024 7 | epoch = 5, losses = {'textcat': 15.714608116661427}, roc = 0.9031608202292495 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_bal_lPunctNumAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumAgg 2 | epoch = 0, losses = {'textcat': 9.070996835451297}, roc = 0.8518705719716606 3 | epoch = 1, losses = {'textcat': 12.505073593091929}, roc = 0.843479638269685 4 | epoch = 2, losses = {'textcat': 14.681906531202715}, roc = 0.8455186913196244 5 | epoch = 3, losses = {'textcat': 15.9343872602596}, roc = 0.8499308795576291 6 | epoch = 4, losses = {'textcat': 16.56670094780488}, roc = 0.8508596855019872 7 | epoch = 5, losses = {'textcat': 16.947070740996637}, roc = 0.848119347963827 8 | epoch = 6, losses = {'textcat': 17.27540420358229}, roc = 0.8479767870514371 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_dlPunctNumStopLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_dlPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 8.904687333793845}, roc = 0.8557315879618302 3 | epoch = 1, losses = {'textcat': 12.338626464243827}, roc = 0.8615503425495473 4 | epoch = 2, losses = {'textcat': 14.526519826557568}, roc = 0.8663368607780768 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctAgg 2 | epoch = 0, losses = {'textcat': 10.300575028954164}, roc = 0.8472442561930431 3 | epoch = 1, losses = {'textcat': 15.44146652551996}, roc = 0.8600077011936851 4 | epoch = 2, losses = {'textcat': 18.253285435726045}, roc = 0.8596842510589141 5 | epoch = 3, losses = {'textcat': 20.024297386326218}, roc = 0.8545860608394301 6 | epoch = 4, losses = {'textcat': 21.157677383200436}, roc = 0.8565370299063022 7 | epoch = 5, losses = {'textcat': 22.03504304502588}, roc = 0.8572660762418175 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumAgg 2 | epoch = 0, losses = {'textcat': 10.316491311206551}, roc = 0.8487459889616223 3 | epoch = 1, losses = {'textcat': 15.3834985842316}, roc = 0.8674675908099089 4 | epoch = 2, losses = {'textcat': 18.139316414716088}, roc = 0.8616275189321012 5 | epoch = 3, losses = {'textcat': 19.851637275560474}, roc = 0.8596585804132975 6 | epoch = 4, losses = {'textcat': 20.907842139506286}, roc = 0.8587010653317931 7 | epoch = 5, losses = {'textcat': 21.722357641635657}, roc = 0.8530252855859325 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumLemAgg 2 | epoch = 0, losses = {'textcat': 10.29036844101829}, roc = 0.8553972532409191 3 | epoch = 1, losses = {'textcat': 15.721021297912802}, roc = 0.8621486330381211 4 | epoch = 2, losses = {'textcat': 18.663440811308213}, roc = 0.8572506738544474 5 | epoch = 3, losses = {'textcat': 20.400520129909953}, roc = 0.8703195995379285 6 | epoch = 4, losses = {'textcat': 21.658334924989223}, roc = 0.8669901168014378 7 | epoch = 5, losses = {'textcat': 22.603086117967706}, roc = 0.8675908099088694 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.197543423597457}, roc = 0.8483275574380695 3 | epoch = 1, losses = {'textcat': 15.81021447241982}, roc = 0.8581542805801566 4 | epoch = 2, losses = {'textcat': 19.228670098632225}, roc = 0.8616968296752663 5 | epoch = 3, losses = {'textcat': 21.795570858355152}, roc = 0.8520806058272367 6 | epoch = 4, losses = {'textcat': 23.384679663050104}, roc = 0.8497907842382235 7 | epoch = 5, losses = {'textcat': 24.688360156274754}, roc = 0.8550686689770248 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumOovAgg 2 | epoch = 0, losses = {'textcat': 10.606671018825214}, roc = 0.840441535104608 3 | epoch = 1, losses = {'textcat': 16.482640654656333}, roc = 0.8509151585162368 4 | epoch = 2, losses = {'textcat': 19.598626431889233}, roc = 0.8592555512771146 5 | epoch = 3, losses = {'textcat': 21.42554874333712}, roc = 0.865665511487614 6 | epoch = 4, losses = {'textcat': 22.87796032469087}, roc = 0.8701116673084329 7 | epoch = 5, losses = {'textcat': 23.912508471060537}, roc = 0.8683788987293031 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumPersAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersAgg 2 | epoch = 0, losses = {'textcat': 9.659992030909795}, roc = 0.8598382749326147 3 | epoch = 1, losses = {'textcat': 13.254407076947524}, roc = 0.8739340264407651 4 | epoch = 2, losses = {'textcat': 15.111549782759436}, roc = 0.8792940572455397 5 | epoch = 3, losses = {'textcat': 15.997021635469657}, roc = 0.877930945963291 6 | epoch = 4, losses = {'textcat': 16.575501656098304}, roc = 0.8770196380438967 7 | epoch = 5, losses = {'textcat': 16.928584417800174}, roc = 0.8734591194968553 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumPersLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersLemAgg 2 | epoch = 0, losses = {'textcat': 9.383625615081417}, roc = 0.8685303555384417 3 | epoch = 1, losses = {'textcat': 13.28497734584512}, roc = 0.8756565267616481 4 | epoch = 2, losses = {'textcat': 15.407424367942612}, roc = 0.8725272750609678 5 | epoch = 3, losses = {'textcat': 16.60949956353454}, roc = 0.8703658067000386 6 | epoch = 4, losses = {'textcat': 17.3632590999101}, roc = 0.8677140290078296 7 | epoch = 5, losses = {'textcat': 17.834758405944516}, roc = 0.8680580156590938 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumPersLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersLemOovAgg 2 | epoch = 0, losses = {'textcat': 9.513648208222001}, roc = 0.8517058144012322 3 | epoch = 1, losses = {'textcat': 13.863807656152193}, roc = 0.864453857014504 4 | epoch = 2, losses = {'textcat': 16.46012714357848}, roc = 0.8673289693235784 5 | epoch = 3, losses = {'textcat': 18.004219502897023}, roc = 0.8680041073032987 6 | epoch = 4, losses = {'textcat': 19.085807765813307}, roc = 0.8678192786548582 7 | epoch = 5, losses = {'textcat': 19.84717152449107}, roc = 0.8641509433962263 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumStopLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopLemAgg 2 | epoch = 0, losses = {'textcat': 8.883919595161183}, roc = 0.8709690668720317 3 | epoch = 1, losses = {'textcat': 12.155519061331876}, roc = 0.8812706969580286 4 | epoch = 2, losses = {'textcat': 13.865362251984102}, roc = 0.8741060197663971 5 | epoch = 3, losses = {'textcat': 14.811687069882568}, roc = 0.8711564625850341 6 | epoch = 4, losses = {'textcat': 15.499547603402688}, roc = 0.8669721473495058 7 | epoch = 5, losses = {'textcat': 15.88002811915746}, roc = 0.862133230650751 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumStopLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 8.991991002214835}, roc = 0.8685611603131819 3 | epoch = 1, losses = {'textcat': 12.712942150187756}, roc = 0.8808060582723656 4 | epoch = 2, losses = {'textcat': 14.934007483348498}, roc = 0.8775561545372866 5 | epoch = 3, losses = {'textcat': 16.26776459135772}, roc = 0.8723527146707739 6 | epoch = 4, losses = {'textcat': 17.113776076989428}, roc = 0.868710050057759 7 | epoch = 5, losses = {'textcat': 17.63977204713988}, roc = 0.8639199075856757 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/bow_lPunctNumStopOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopOovAgg 2 | epoch = 0, losses = {'textcat': 9.135657884018336}, roc = 0.8800667436786035 3 | epoch = 1, losses = {'textcat': 12.349658746887858}, roc = 0.8848029777948915 4 | epoch = 2, losses = {'textcat': 14.074350827866946}, roc = 0.8828571428571428 5 | epoch = 3, losses = {'textcat': 14.991081261942854}, roc = 0.8824207418816584 6 | epoch = 4, losses = {'textcat': 15.623346206312354}, roc = 0.8787061994609164 7 | epoch = 5, losses = {'textcat': 16.009936011346397}, roc = 0.8759440379925556 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_bal_lPunctAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctAgg 2 | epoch = 0, losses = {'textcat': 10.771006600931287}, roc = 0.7958095731812684 3 | epoch = 1, losses = {'textcat': 17.779007678705966}, roc = 0.8307917170669892 4 | epoch = 2, losses = {'textcat': 21.071200552220034}, roc = 0.8122213582166926 5 | epoch = 3, losses = {'textcat': 22.43127129951489}, roc = 0.7965295777892979 6 | epoch = 4, losses = {'textcat': 23.157470632704094}, roc = 0.8051638730487874 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_bal_lPunctNumStopLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.73626277083531}, roc = 0.8194502044813087 3 | epoch = 1, losses = {'textcat': 17.72877553733997}, roc = 0.8599965439778816 4 | epoch = 2, losses = {'textcat': 21.301636069205415}, roc = 0.8570013248084787 5 | epoch = 3, losses = {'textcat': 22.83226279049137}, roc = 0.852296814699614 6 | epoch = 4, losses = {'textcat': 23.535492210306984}, roc = 0.8362507920050689 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_dlPunctNumLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | bow_dlPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 11.14930248935707}, roc = 0.8058885898376968 3 | epoch = 1, losses = {'textcat': 18.137150909838965}, roc = 0.870167604599951 4 | epoch = 2, losses = {'textcat': 21.942909762162685}, roc = 0.8624806296386918 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_dlPunctNumStopLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_dlPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.573204169631936}, roc = 0.8677208221189137 3 | epoch = 1, losses = {'textcat': 16.051280780535308}, roc = 0.8713655085229588 4 | epoch = 2, losses = {'textcat': 19.00531985885982}, roc = 0.8542482260827013 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctAgg epoch = 0, losses = {'textcat': 10.729639297001995}, roc = 0.842746759080991 2 | epoch = 1, losses = {'textcat': 17.388430001898087}, roc = 0.8567423950712361 3 | epoch = 2, losses = {'textcat': 21.088454915356124}, roc = 0.8632678731870107 4 | epoch = 3, losses = {'textcat': 22.955513816137795}, roc = 0.846782184571942 5 | epoch = 4, losses = {'textcat': 23.7844230104636}, roc = 0.8460377358490567 6 | epoch = 5, losses = {'textcat': 24.454516666314753}, roc = 0.8463355153382108 7 | epoch = 6, losses = {'textcat': 24.73753832080388}, roc = 0.8378025927352073 8 | epoch = 7, losses = {'textcat': 25.07438993472748}, roc = 0.8332126812989348 9 | epoch = 8, losses = {'textcat': 25.190393376835182}, roc = 0.8299011680143754 10 | epoch = 9, losses = {'textcat': 25.372851968512258}, roc = 0.8320677705044282 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumAgg epoch = 0, losses = {'textcat': 10.849251543346327}, roc = 0.8439994865870877 2 | epoch = 1, losses = {'textcat': 17.44105708837742}, roc = 0.8709896033885252 3 | epoch = 2, losses = {'textcat': 21.194835011254327}, roc = 0.8760467205750225 4 | epoch = 3, losses = {'textcat': 23.208214861448596}, roc = 0.8752714670773971 5 | epoch = 4, losses = {'textcat': 24.236906560875724}, roc = 0.8654550121935566 6 | epoch = 5, losses = {'textcat': 24.778005035438593}, roc = 0.8700346553715826 7 | epoch = 6, losses = {'textcat': 25.133107319834863}, roc = 0.8669747144140675 8 | epoch = 7, losses = {'textcat': 25.325295451989195}, roc = 0.8677037607495829 9 | epoch = 8, losses = {'textcat': 25.620281110071257}, roc = 0.8670722628674111 10 | epoch = 9, losses = {'textcat': 25.646207855614715}, roc = 0.8636786035168784 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumLemAgg epoch = 0, losses = {'textcat': 10.678632560709957}, roc = 0.8591400333718393 2 | epoch = 1, losses = {'textcat': 17.103498125987244}, roc = 0.8714414067513797 3 | epoch = 2, losses = {'textcat': 20.854591814086234}, roc = 0.8708920549351816 4 | epoch = 3, losses = {'textcat': 22.87714549644079}, roc = 0.8660043640097548 5 | epoch = 4, losses = {'textcat': 23.852497358807643}, roc = 0.8584161211654473 6 | epoch = 5, losses = {'textcat': 24.415554903051355}, roc = 0.8565627005519187 7 | epoch = 6, losses = {'textcat': 24.75346492489465}, roc = 0.8536721858554742 8 | epoch = 7, losses = {'textcat': 24.919653205470567}, roc = 0.8565832370684123 9 | epoch = 8, losses = {'textcat': 24.99600611099764}, roc = 0.8565113592606854 10 | epoch = 9, losses = {'textcat': 25.150001712737367}, roc = 0.8512129380053909 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumLemOovAgg epoch = 0, losses = {'textcat': 10.632765143818688}, roc = 0.8680785521755873 2 | epoch = 1, losses = {'textcat': 16.603710685754777}, roc = 0.8756873315363882 3 | epoch = 2, losses = {'textcat': 20.098456279241873}, roc = 0.8748402002310357 4 | epoch = 3, losses = {'textcat': 22.197325268262148}, roc = 0.8670260557053011 5 | epoch = 4, losses = {'textcat': 23.376831758485014}, roc = 0.8527788473880118 6 | epoch = 5, losses = {'textcat': 24.049448740595956}, roc = 0.8547862918752405 7 | epoch = 6, losses = {'textcat': 24.454368137002774}, roc = 0.8548992427159544 8 | epoch = 7, losses = {'textcat': 24.841723376879056}, roc = 0.8579643178025927 9 | epoch = 8, losses = {'textcat': 24.95344296212421}, roc = 0.8519573867282763 10 | epoch = 9, losses = {'textcat': 25.03913710188129}, roc = 0.8460582723655501 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumOovAgg epoch = 0, losses = {'textcat': 10.849000597256236}, roc = 0.8273905788730586 2 | epoch = 1, losses = {'textcat': 17.615710976722767}, roc = 0.8691618534206136 3 | epoch = 2, losses = {'textcat': 21.519193997845832}, roc = 0.8813810807341804 4 | epoch = 3, losses = {'textcat': 23.555163298897355}, roc = 0.8833423180592992 5 | epoch = 4, losses = {'textcat': 24.592904370654782}, roc = 0.8853908355795148 6 | epoch = 5, losses = {'textcat': 25.232358423387602}, roc = 0.8834604030291362 7 | epoch = 6, losses = {'textcat': 25.71876220397092}, roc = 0.8715902964959569 8 | epoch = 7, losses = {'textcat': 25.852575093551987}, roc = 0.8668823000898472 9 | epoch = 8, losses = {'textcat': 26.09547118551534}, roc = 0.8692619689385188 10 | epoch = 9, losses = {'textcat': 26.272543905184413}, roc = 0.8659581568476448 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumPersAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersAgg epoch = 0, losses = {'textcat': 10.358336042787414}, roc = 0.8718983442433577 2 | epoch = 1, losses = {'textcat': 15.565269033104414}, roc = 0.8941599281221924 3 | epoch = 2, losses = {'textcat': 17.932983758057844}, roc = 0.8852214093184444 4 | epoch = 3, losses = {'textcat': 19.04217331202392}, roc = 0.8730381209087408 5 | epoch = 4, losses = {'textcat': 19.687709225343976}, roc = 0.874799127198049 6 | epoch = 5, losses = {'textcat': 20.034264428140364}, roc = 0.8718829418559876 7 | epoch = 6, losses = {'textcat': 20.304004829154785}, roc = 0.870219484020023 8 | epoch = 7, losses = {'textcat': 20.454276567252627}, roc = 0.8683506610191246 9 | epoch = 8, losses = {'textcat': 20.554914588447136}, roc = 0.8717391862405339 10 | epoch = 9, losses = {'textcat': 20.633671653256233}, roc = 0.8760826594788859 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumPersLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersLemAgg epoch = 0, losses = {'textcat': 10.325701178051531}, roc = 0.8664304967269927 2 | epoch = 1, losses = {'textcat': 15.913291760210996}, roc = 0.8915158516236684 3 | epoch = 2, losses = {'textcat': 18.7300274250465}, roc = 0.8803131818765243 4 | epoch = 3, losses = {'textcat': 20.120089923865635}, roc = 0.8712360415864459 5 | epoch = 4, losses = {'textcat': 20.804775718720222}, roc = 0.8688127326402258 6 | epoch = 5, losses = {'textcat': 21.14956892632512}, roc = 0.8665588499550764 7 | epoch = 6, losses = {'textcat': 21.423461501962542}, roc = 0.8723090745732256 8 | epoch = 7, losses = {'textcat': 21.56527505127891}, roc = 0.8700295212424592 9 | epoch = 8, losses = {'textcat': 21.688171283806636}, roc = 0.8668258246694905 10 | epoch = 9, losses = {'textcat': 21.79666106960388}, roc = 0.8704813246053137 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumPersLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersLemOovAgg epoch = 0, losses = {'textcat': 10.597144522122107}, roc = 0.8578462328327556 2 | epoch = 1, losses = {'textcat': 16.965219413206796}, roc = 0.8762982929020665 3 | epoch = 2, losses = {'textcat': 20.944239850628946}, roc = 0.8677756385573098 4 | epoch = 3, losses = {'textcat': 23.205930521911796}, roc = 0.8661275831087152 5 | epoch = 4, losses = {'textcat': 24.416154009721595}, roc = 0.8681504299833142 6 | epoch = 5, losses = {'textcat': 25.26161684111277}, roc = 0.8606956744962136 7 | epoch = 6, losses = {'textcat': 25.85344494000153}, roc = 0.8537440636632012 8 | epoch = 7, losses = {'textcat': 26.17937598666605}, roc = 0.8558952637658837 9 | epoch = 8, losses = {'textcat': 26.458660723634825}, roc = 0.8617533050956231 10 | epoch = 9, losses = {'textcat': 26.5519901394971}, roc = 0.8586676934924914 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumStopLemAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopLemAgg epoch = 0, losses = {'textcat': 10.419062311004382}, roc = 0.8478706199460917 2 | epoch = 1, losses = {'textcat': 16.18246567517781}, roc = 0.8796816839943525 3 | epoch = 2, losses = {'textcat': 19.30511438575013}, roc = 0.8775920934411501 4 | epoch = 3, losses = {'textcat': 20.81128679516405}, roc = 0.8779412142215377 5 | epoch = 4, losses = {'textcat': 21.478999641421396}, roc = 0.8684687459889617 6 | epoch = 5, losses = {'textcat': 21.86197143290219}, roc = 0.8719034783724811 7 | epoch = 6, losses = {'textcat': 21.966874151151934}, roc = 0.866907970735464 8 | epoch = 7, losses = {'textcat': 22.130305263283137}, roc = 0.865177769220896 9 | epoch = 8, losses = {'textcat': 22.24113396179771}, roc = 0.8653112565781029 10 | epoch = 9, losses = {'textcat': 22.4314443554733}, roc = 0.8684841483763316 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumStopLemOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopLemOovAgg epoch = 0, losses = {'textcat': 10.346970351412892}, roc = 0.8683763316647415 2 | epoch = 1, losses = {'textcat': 15.972997819677403}, roc = 0.9038069567449621 3 | epoch = 2, losses = {'textcat': 18.941932999511664}, roc = 0.8922859709921704 4 | epoch = 3, losses = {'textcat': 20.461612598504217}, roc = 0.8880811192401488 5 | epoch = 4, losses = {'textcat': 21.24935047177315}, roc = 0.8843691438839687 6 | epoch = 5, losses = {'textcat': 21.649178645025486}, roc = 0.8818636888717752 7 | epoch = 6, losses = {'textcat': 21.950326816203656}, roc = 0.8808984725965858 8 | epoch = 7, losses = {'textcat': 22.071821111654856}, roc = 0.8750198947503529 9 | epoch = 8, losses = {'textcat': 22.14719987215875}, roc = 0.8718213323065076 10 | epoch = 9, losses = {'textcat': 22.282656720771637}, roc = 0.8829880631497883 11 | ensemble_lPunctNumStopLemOovAgg 12 | epoch = 0, losses = {'textcat': 10.490000442718156}, roc = 0.8498061866255937 13 | epoch = 1, losses = {'textcat': 16.547183903574478}, roc = 0.8772121678860224 14 | epoch = 2, losses = {'textcat': 19.78883196215702}, roc = 0.8746707739699653 15 | epoch = 3, losses = {'textcat': 21.212445096770466}, roc = 0.8524759337697343 16 | epoch = 4, losses = {'textcat': 21.841305388159622}, roc = 0.8331664741368244 17 | epoch = 5, losses = {'textcat': 22.164312648375436}, roc = 0.8297214734950583 18 | epoch = 6, losses = {'textcat': 22.348416818236934}, roc = 0.826420228468746 19 | epoch = 7, losses = {'textcat': 22.57289976127649}, roc = 0.8245514054678474 20 | epoch = 8, losses = {'textcat': 22.755370378420064}, roc = 0.8239917853934027 21 | epoch = 9, losses = {'textcat': 22.91652452440097}, roc = 0.8339417276344502 22 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/.ipynb_checkpoints/ensemble_lPunctNumStopOovAgg-checkpoint.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopOovAgg epoch = 0, losses = {'textcat': 10.528761210793164}, roc = 0.8673289693235784 2 | epoch = 1, losses = {'textcat': 16.21045877024153}, roc = 0.889338980875369 3 | epoch = 2, losses = {'textcat': 19.006704837200232}, roc = 0.8886818123475806 4 | epoch = 3, losses = {'textcat': 20.306096649514416}, roc = 0.87391092285971 5 | epoch = 4, losses = {'textcat': 20.95130760752823}, roc = 0.8736644846617894 6 | epoch = 5, losses = {'textcat': 21.27921311019021}, roc = 0.8782441278398152 7 | epoch = 6, losses = {'textcat': 21.510209205060605}, roc = 0.8773328199204212 8 | epoch = 7, losses = {'textcat': 21.645089315511605}, roc = 0.8748350661019125 9 | epoch = 8, losses = {'textcat': 21.741663373617406}, roc = 0.8733923758182519 10 | epoch = 9, losses = {'textcat': 21.76332084200426}, roc = 0.8705532024130407 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctAgg 2 | epoch = 0, losses = {'textcat': 9.302060287109384}, roc = 0.822655664996256 3 | epoch = 1, losses = {'textcat': 13.162341818209484}, roc = 0.830693076435689 4 | epoch = 2, losses = {'textcat': 15.271727846679754}, roc = 0.8276380968838201 5 | epoch = 3, losses = {'textcat': 16.73283363978179}, roc = 0.8206403720983815 6 | epoch = 4, losses = {'textcat': 17.417289779973682}, roc = 0.8212559760382466 7 | epoch = 5, losses = {'textcat': 17.877576482184697}, roc = 0.8197432463567768 8 | epoch = 6, losses = {'textcat': 18.1894339097069}, roc = 0.8176357928690744 9 | bow_bal_lPunctAgg 10 | epoch = 0, losses = {'textcat': 9.32271635598022}, roc = 0.8531601002246415 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumAgg 2 | epoch = 0, losses = {'textcat': 9.070996835451297}, roc = 0.8518705719716606 3 | epoch = 1, losses = {'textcat': 12.505073593091929}, roc = 0.843479638269685 4 | epoch = 2, losses = {'textcat': 14.681906531202715}, roc = 0.8455186913196244 5 | epoch = 3, losses = {'textcat': 15.9343872602596}, roc = 0.8499308795576291 6 | epoch = 4, losses = {'textcat': 16.56670094780488}, roc = 0.8508596855019872 7 | epoch = 5, losses = {'textcat': 16.947070740996637}, roc = 0.848119347963827 8 | epoch = 6, losses = {'textcat': 17.27540420358229}, roc = 0.8479767870514371 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumLemAgg 2 | epoch = 0, losses = {'textcat': 8.920310128462006}, roc = 0.8499539197050862 3 | epoch = 1, losses = {'textcat': 12.793419339724329}, roc = 0.8537411439433212 4 | epoch = 2, losses = {'textcat': 15.015833610607764}, roc = 0.8455287713841368 5 | epoch = 3, losses = {'textcat': 16.48632318486966}, roc = 0.8568537238638327 6 | epoch = 4, losses = {'textcat': 17.36400533343001}, roc = 0.857495967974195 7 | epoch = 5, losses = {'textcat': 17.890244026477234}, roc = 0.8565642820114049 8 | epoch = 6, losses = {'textcat': 18.371415728931197}, roc = 0.851052646736939 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 8.70647652117259}, roc = 0.84155218593399 3 | epoch = 1, losses = {'textcat': 13.266189976038829}, roc = 0.848024307355567 4 | epoch = 2, losses = {'textcat': 16.120750505303732}, roc = 0.8411057830770117 5 | epoch = 3, losses = {'textcat': 18.030914867106905}, roc = 0.8529383388053684 6 | epoch = 4, losses = {'textcat': 19.245343244869073}, roc = 0.8566902828178098 7 | epoch = 5, losses = {'textcat': 19.979103932247735}, roc = 0.8562568400437762 8 | epoch = 6, losses = {'textcat': 20.6539379920545}, roc = 0.8522132941650826 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumOovAgg 2 | epoch = 0, losses = {'textcat': 8.897850947932056}, roc = 0.8423024307355567 3 | epoch = 1, losses = {'textcat': 12.809824797757333}, roc = 0.8437762801681931 4 | epoch = 2, losses = {'textcat': 15.446070999403444}, roc = 0.8499942399631357 5 | epoch = 3, losses = {'textcat': 17.059131857684534}, roc = 0.8522507344047002 6 | epoch = 4, losses = {'textcat': 18.157829035962052}, roc = 0.8514767294510686 7 | epoch = 5, losses = {'textcat': 18.70078135735473}, roc = 0.8495960774148954 8 | epoch = 6, losses = {'textcat': 19.13037149310106}, roc = 0.8482842290190657 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumPersAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumPersAgg 2 | epoch = 0, losses = {'textcat': 7.942127413491107}, roc = 0.8563216404584989 3 | epoch = 1, losses = {'textcat': 10.657648807610341}, roc = 0.8576126087206958 4 | epoch = 2, losses = {'textcat': 12.023256314509185}, roc = 0.8550933125972007 5 | epoch = 3, losses = {'textcat': 12.718376671023494}, roc = 0.853201140487299 6 | epoch = 4, losses = {'textcat': 13.205239917829605}, roc = 0.8536230631876045 7 | epoch = 5, losses = {'textcat': 13.471313663067724}, roc = 0.8510332066125224 8 | epoch = 6, losses = {'textcat': 13.686726476120343}, roc = 0.8495355970278211 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumPersLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumPersLemAgg 2 | epoch = 0, losses = {'textcat': 8.12144905304558}, roc = 0.8489804734750304 3 | epoch = 1, losses = {'textcat': 11.43250018897337}, roc = 0.8533357813490006 4 | epoch = 2, losses = {'textcat': 13.055823825370439}, roc = 0.8501929612349518 5 | epoch = 3, losses = {'textcat': 14.094245906230416}, roc = 0.8575866885548068 6 | epoch = 4, losses = {'textcat': 14.65576545783411}, roc = 0.8568710039744254 7 | epoch = 5, losses = {'textcat': 15.022920232249685}, roc = 0.8543229076666092 8 | epoch = 6, losses = {'textcat': 15.303569548450263}, roc = 0.8525725764644894 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumPersLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumPersLemOovAgg 2 | epoch = 0, losses = {'textcat': 7.9846539189748}, roc = 0.857177005932838 3 | epoch = 1, losses = {'textcat': 11.407339283906628}, roc = 0.8603176660330626 4 | epoch = 2, losses = {'textcat': 13.37722663999433}, roc = 0.8533451414089049 5 | epoch = 3, losses = {'textcat': 14.584860815160596}, roc = 0.858619175162721 6 | epoch = 4, losses = {'textcat': 15.31267213681097}, roc = 0.8586213351765453 7 | epoch = 5, losses = {'textcat': 15.78524821235868}, roc = 0.8571561257992052 8 | epoch = 6, losses = {'textcat': 16.206703309529505}, roc = 0.8544035481827085 9 | bow_bal_lPunctNumPersLemOovAgg 10 | epoch = 0, losses = {'textcat': 10.543100330862217}, roc = 0.8660865100757283 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumStopLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumStopLemAgg 2 | epoch = 0, losses = {'textcat': 7.6965910377621185}, roc = 0.8627325614883935 3 | epoch = 1, losses = {'textcat': 10.470721331600874}, roc = 0.8617792753873624 4 | epoch = 2, losses = {'textcat': 11.779465222827628}, roc = 0.8552315534819421 5 | epoch = 3, losses = {'textcat': 12.66895769556537}, roc = 0.8541529865791141 6 | epoch = 4, losses = {'textcat': 13.206042752635458}, roc = 0.8515631300040319 7 | epoch = 5, losses = {'textcat': 13.554137631988128}, roc = 0.8463330165313057 8 | epoch = 6, losses = {'textcat': 13.809782768729407}, roc = 0.8421296296296296 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 7.722838822774065}, roc = 0.8669237083117332 3 | epoch = 1, losses = {'textcat': 10.851907326930231}, roc = 0.8626447209262138 4 | epoch = 2, losses = {'textcat': 12.364781665208959}, roc = 0.8523414549853119 5 | epoch = 3, losses = {'textcat': 13.364213726853716}, roc = 0.8538973849432637 6 | epoch = 4, losses = {'textcat': 14.003232195298567}, roc = 0.8549809918783481 7 | epoch = 5, losses = {'textcat': 14.470020648783343}, roc = 0.8541781867403951 8 | epoch = 6, losses = {'textcat': 14.829334140441915}, roc = 0.8510605667876274 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_bal_lPunctNumStopOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_bal_lPunctNumStopOovAgg 2 | epoch = 0, losses = {'textcat': 7.644319462408021}, roc = 0.8646297736305512 3 | epoch = 1, losses = {'textcat': 10.299294181489332}, roc = 0.8604047865906342 4 | epoch = 2, losses = {'textcat': 11.559321389345481}, roc = 0.8573375669604286 5 | epoch = 3, losses = {'textcat': 12.309172212446663}, roc = 0.8520440930821958 6 | epoch = 4, losses = {'textcat': 12.788964540282382}, roc = 0.8490193537238638 7 | epoch = 5, losses = {'textcat': 13.128131629973968}, roc = 0.8473633431253961 8 | epoch = 6, losses = {'textcat': 13.391003333925637}, roc = 0.8459773342549393 9 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_dlPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_dlPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 8.904687333793845}, roc = 0.8557315879618302 3 | epoch = 1, losses = {'textcat': 12.338626464243827}, roc = 0.8615503425495473 4 | epoch = 2, losses = {'textcat': 14.526519826557568}, roc = 0.8663368607780768 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctAgg 2 | epoch = 0, losses = {'textcat': 10.300575028954164}, roc = 0.8472442561930431 3 | epoch = 1, losses = {'textcat': 15.44146652551996}, roc = 0.8600077011936851 4 | epoch = 2, losses = {'textcat': 18.253285435726045}, roc = 0.8596842510589141 5 | epoch = 3, losses = {'textcat': 20.024297386326218}, roc = 0.8545860608394301 6 | epoch = 4, losses = {'textcat': 21.157677383200436}, roc = 0.8565370299063022 7 | epoch = 5, losses = {'textcat': 22.03504304502588}, roc = 0.8572660762418175 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumAgg 2 | epoch = 0, losses = {'textcat': 10.316491311206551}, roc = 0.8487459889616223 3 | epoch = 1, losses = {'textcat': 15.3834985842316}, roc = 0.8674675908099089 4 | epoch = 2, losses = {'textcat': 18.139316414716088}, roc = 0.8616275189321012 5 | epoch = 3, losses = {'textcat': 19.851637275560474}, roc = 0.8596585804132975 6 | epoch = 4, losses = {'textcat': 20.907842139506286}, roc = 0.8587010653317931 7 | epoch = 5, losses = {'textcat': 21.722357641635657}, roc = 0.8530252855859325 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumLemAgg 2 | epoch = 0, losses = {'textcat': 10.29036844101829}, roc = 0.8553972532409191 3 | epoch = 1, losses = {'textcat': 15.721021297912802}, roc = 0.8621486330381211 4 | epoch = 2, losses = {'textcat': 18.663440811308213}, roc = 0.8572506738544474 5 | epoch = 3, losses = {'textcat': 20.400520129909953}, roc = 0.8703195995379285 6 | epoch = 4, losses = {'textcat': 21.658334924989223}, roc = 0.8669901168014378 7 | epoch = 5, losses = {'textcat': 22.603086117967706}, roc = 0.8675908099088694 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.197543423597457}, roc = 0.8483275574380695 3 | epoch = 1, losses = {'textcat': 15.81021447241982}, roc = 0.8581542805801566 4 | epoch = 2, losses = {'textcat': 19.228670098632225}, roc = 0.8616968296752663 5 | epoch = 3, losses = {'textcat': 21.795570858355152}, roc = 0.8520806058272367 6 | epoch = 4, losses = {'textcat': 23.384679663050104}, roc = 0.8497907842382235 7 | epoch = 5, losses = {'textcat': 24.688360156274754}, roc = 0.8550686689770248 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumOovAgg 2 | epoch = 0, losses = {'textcat': 10.606671018825214}, roc = 0.840441535104608 3 | epoch = 1, losses = {'textcat': 16.482640654656333}, roc = 0.8509151585162368 4 | epoch = 2, losses = {'textcat': 19.598626431889233}, roc = 0.8592555512771146 5 | epoch = 3, losses = {'textcat': 21.42554874333712}, roc = 0.865665511487614 6 | epoch = 4, losses = {'textcat': 22.87796032469087}, roc = 0.8701116673084329 7 | epoch = 5, losses = {'textcat': 23.912508471060537}, roc = 0.8683788987293031 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumPersAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersAgg 2 | epoch = 0, losses = {'textcat': 9.659992030909795}, roc = 0.8598382749326147 3 | epoch = 1, losses = {'textcat': 13.254407076947524}, roc = 0.8739340264407651 4 | epoch = 2, losses = {'textcat': 15.111549782759436}, roc = 0.8792940572455397 5 | epoch = 3, losses = {'textcat': 15.997021635469657}, roc = 0.877930945963291 6 | epoch = 4, losses = {'textcat': 16.575501656098304}, roc = 0.8770196380438967 7 | epoch = 5, losses = {'textcat': 16.928584417800174}, roc = 0.8734591194968553 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumPersLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersLemAgg 2 | epoch = 0, losses = {'textcat': 9.383625615081417}, roc = 0.8685303555384417 3 | epoch = 1, losses = {'textcat': 13.28497734584512}, roc = 0.8756565267616481 4 | epoch = 2, losses = {'textcat': 15.407424367942612}, roc = 0.8725272750609678 5 | epoch = 3, losses = {'textcat': 16.60949956353454}, roc = 0.8703658067000386 6 | epoch = 4, losses = {'textcat': 17.3632590999101}, roc = 0.8677140290078296 7 | epoch = 5, losses = {'textcat': 17.834758405944516}, roc = 0.8680580156590938 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumPersLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumPersLemOovAgg 2 | epoch = 0, losses = {'textcat': 9.513648208222001}, roc = 0.8517058144012322 3 | epoch = 1, losses = {'textcat': 13.863807656152193}, roc = 0.864453857014504 4 | epoch = 2, losses = {'textcat': 16.46012714357848}, roc = 0.8673289693235784 5 | epoch = 3, losses = {'textcat': 18.004219502897023}, roc = 0.8680041073032987 6 | epoch = 4, losses = {'textcat': 19.085807765813307}, roc = 0.8678192786548582 7 | epoch = 5, losses = {'textcat': 19.84717152449107}, roc = 0.8641509433962263 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumStopLemAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopLemAgg 2 | epoch = 0, losses = {'textcat': 8.883919595161183}, roc = 0.8709690668720317 3 | epoch = 1, losses = {'textcat': 12.155519061331876}, roc = 0.8812706969580286 4 | epoch = 2, losses = {'textcat': 13.865362251984102}, roc = 0.8741060197663971 5 | epoch = 3, losses = {'textcat': 14.811687069882568}, roc = 0.8711564625850341 6 | epoch = 4, losses = {'textcat': 15.499547603402688}, roc = 0.8669721473495058 7 | epoch = 5, losses = {'textcat': 15.88002811915746}, roc = 0.862133230650751 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 8.991991002214835}, roc = 0.8685611603131819 3 | epoch = 1, losses = {'textcat': 12.712942150187756}, roc = 0.8808060582723656 4 | epoch = 2, losses = {'textcat': 14.934007483348498}, roc = 0.8775561545372866 5 | epoch = 3, losses = {'textcat': 16.26776459135772}, roc = 0.8723527146707739 6 | epoch = 4, losses = {'textcat': 17.113776076989428}, roc = 0.868710050057759 7 | epoch = 5, losses = {'textcat': 17.63977204713988}, roc = 0.8639199075856757 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/bow_lPunctNumStopOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_lPunctNumStopOovAgg 2 | epoch = 0, losses = {'textcat': 9.135657884018336}, roc = 0.8800667436786035 3 | epoch = 1, losses = {'textcat': 12.349658746887858}, roc = 0.8848029777948915 4 | epoch = 2, losses = {'textcat': 14.074350827866946}, roc = 0.8828571428571428 5 | epoch = 3, losses = {'textcat': 14.991081261942854}, roc = 0.8824207418816584 6 | epoch = 4, losses = {'textcat': 15.623346206312354}, roc = 0.8787061994609164 7 | epoch = 5, losses = {'textcat': 16.009936011346397}, roc = 0.8759440379925556 8 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctAgg 2 | epoch = 0, losses = {'textcat': 10.771006600931287}, roc = 0.7958095731812684 3 | epoch = 1, losses = {'textcat': 17.779007678705966}, roc = 0.8307917170669892 4 | epoch = 2, losses = {'textcat': 21.071200552220034}, roc = 0.8122213582166926 5 | epoch = 3, losses = {'textcat': 22.43127129951489}, roc = 0.7965295777892979 6 | epoch = 4, losses = {'textcat': 23.157470632704094}, roc = 0.8051638730487874 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumAgg 2 | epoch = 0, losses = {'textcat': 11.111040544696152}, roc = 0.7908804216346984 3 | epoch = 1, losses = {'textcat': 19.10246680257842}, roc = 0.8461609354299866 4 | epoch = 2, losses = {'textcat': 23.33828358113533}, roc = 0.8425306721963021 5 | epoch = 3, losses = {'textcat': 25.14938404349232}, roc = 0.828738263924889 6 | epoch = 4, losses = {'textcat': 25.967542636937917}, roc = 0.8076925292321872 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumLemAgg 2 | epoch = 0, losses = {'textcat': 10.921858021989465}, roc = 0.8557384367259951 3 | epoch = 1, losses = {'textcat': 18.427781807025895}, roc = 0.8557974771038537 4 | epoch = 2, losses = {'textcat': 22.933604589139577}, roc = 0.8830294913887449 5 | epoch = 3, losses = {'textcat': 25.160109569373162}, roc = 0.869435804389148 6 | epoch = 4, losses = {'textcat': 26.235555890430874}, roc = 0.854176026726571 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 11.09798441780731}, roc = 0.8316643626519211 3 | epoch = 1, losses = {'textcat': 19.27897498011589}, roc = 0.8633877656817004 4 | epoch = 2, losses = {'textcat': 24.195329733367544}, roc = 0.8631213639767296 5 | epoch = 3, losses = {'textcat': 26.952326160404482}, roc = 0.8620413570646853 6 | epoch = 4, losses = {'textcat': 28.221509296149407}, roc = 0.8533926617130351 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumOovAgg 2 | epoch = 0, losses = {'textcat': 10.964338312391192}, roc = 0.7805598755832037 3 | epoch = 1, losses = {'textcat': 18.715853770030662}, roc = 0.8555757156845805 4 | epoch = 2, losses = {'textcat': 23.049163569317898}, roc = 0.8676962732561488 5 | epoch = 3, losses = {'textcat': 25.176179951930294}, roc = 0.860621507977651 6 | epoch = 4, losses = {'textcat': 26.10593884226084}, roc = 0.8471415817061229 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumPersAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumPersAgg 2 | epoch = 0, losses = {'textcat': 10.61521192966029}, roc = 0.8436812395599332 3 | epoch = 1, losses = {'textcat': 17.363522986648604}, roc = 0.8743044755486434 4 | epoch = 2, losses = {'textcat': 20.616124440028216}, roc = 0.8676768331317319 5 | epoch = 3, losses = {'textcat': 21.97993107588991}, roc = 0.864473532630609 6 | epoch = 4, losses = {'textcat': 22.626839527907563}, roc = 0.8439764414492251 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumPersLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumPersLemAgg 2 | epoch = 0, losses = {'textcat': 10.641894780797884}, roc = 0.8053179540349057 3 | epoch = 1, losses = {'textcat': 17.899752411001828}, roc = 0.8604811070790852 4 | epoch = 2, losses = {'textcat': 21.62845124416799}, roc = 0.8628232820690054 5 | epoch = 3, losses = {'textcat': 23.424733716310357}, roc = 0.8736708714935776 6 | epoch = 4, losses = {'textcat': 24.162155677251732}, roc = 0.8674586717354992 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumPersLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumPersLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.861616820562631}, roc = 0.8464489372731985 3 | epoch = 1, losses = {'textcat': 18.343847768148407}, roc = 0.870999654397788 4 | epoch = 2, losses = {'textcat': 22.567757138167508}, roc = 0.8573786072230863 5 | epoch = 3, losses = {'textcat': 24.70317370561679}, roc = 0.8381976844651806 6 | epoch = 4, losses = {'textcat': 25.663438785077915}, roc = 0.8116050342722194 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumStopLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumStopLemAgg 2 | epoch = 0, losses = {'textcat': 10.737517139408737}, roc = 0.8420871493577559 3 | epoch = 1, losses = {'textcat': 17.890409937361255}, roc = 0.8593499798398709 4 | epoch = 2, losses = {'textcat': 21.311185453138023}, roc = 0.8693004435228385 5 | epoch = 3, losses = {'textcat': 22.775944610608576}, roc = 0.8635216865387938 6 | epoch = 4, losses = {'textcat': 23.41604380202767}, roc = 0.8496759979263868 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.73626277083531}, roc = 0.8194502044813087 3 | epoch = 1, losses = {'textcat': 17.72877553733997}, roc = 0.8599965439778816 4 | epoch = 2, losses = {'textcat': 21.301636069205415}, roc = 0.8570013248084787 5 | epoch = 3, losses = {'textcat': 22.83226279049137}, roc = 0.852296814699614 6 | epoch = 4, losses = {'textcat': 23.535492210306984}, roc = 0.8362507920050689 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_bal_lPunctNumStopOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_lPunctNumStopOovAgg 2 | epoch = 0, losses = {'textcat': 10.784322501625866}, roc = 0.8235895109728703 3 | epoch = 1, losses = {'textcat': 17.672744434559718}, roc = 0.8856099879039225 4 | epoch = 2, losses = {'textcat': 20.84777028957251}, roc = 0.8836371752779217 5 | epoch = 3, losses = {'textcat': 22.28241401606772}, roc = 0.8754190426818732 6 | epoch = 4, losses = {'textcat': 22.99313979086327}, roc = 0.8619981567882034 7 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_dlPunctNumLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | bow_dlPunctNumLemOovAgg 2 | epoch = 0, losses = {'textcat': 11.14930248935707}, roc = 0.8058885898376968 3 | epoch = 1, losses = {'textcat': 18.137150909838965}, roc = 0.870167604599951 4 | epoch = 2, losses = {'textcat': 21.942909762162685}, roc = 0.8624806296386918 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_dlPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_bal_dlPunctNumStopLemOovAgg 2 | epoch = 0, losses = {'textcat': 10.573204169631936}, roc = 0.8677208221189137 3 | epoch = 1, losses = {'textcat': 16.051280780535308}, roc = 0.8713655085229588 4 | epoch = 2, losses = {'textcat': 19.00531985885982}, roc = 0.8542482260827013 5 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctAgg epoch = 0, losses = {'textcat': 10.729639297001995}, roc = 0.842746759080991 2 | epoch = 1, losses = {'textcat': 17.388430001898087}, roc = 0.8567423950712361 3 | epoch = 2, losses = {'textcat': 21.088454915356124}, roc = 0.8632678731870107 4 | epoch = 3, losses = {'textcat': 22.955513816137795}, roc = 0.846782184571942 5 | epoch = 4, losses = {'textcat': 23.7844230104636}, roc = 0.8460377358490567 6 | epoch = 5, losses = {'textcat': 24.454516666314753}, roc = 0.8463355153382108 7 | epoch = 6, losses = {'textcat': 24.73753832080388}, roc = 0.8378025927352073 8 | epoch = 7, losses = {'textcat': 25.07438993472748}, roc = 0.8332126812989348 9 | epoch = 8, losses = {'textcat': 25.190393376835182}, roc = 0.8299011680143754 10 | epoch = 9, losses = {'textcat': 25.372851968512258}, roc = 0.8320677705044282 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumAgg epoch = 0, losses = {'textcat': 10.849251543346327}, roc = 0.8439994865870877 2 | epoch = 1, losses = {'textcat': 17.44105708837742}, roc = 0.8709896033885252 3 | epoch = 2, losses = {'textcat': 21.194835011254327}, roc = 0.8760467205750225 4 | epoch = 3, losses = {'textcat': 23.208214861448596}, roc = 0.8752714670773971 5 | epoch = 4, losses = {'textcat': 24.236906560875724}, roc = 0.8654550121935566 6 | epoch = 5, losses = {'textcat': 24.778005035438593}, roc = 0.8700346553715826 7 | epoch = 6, losses = {'textcat': 25.133107319834863}, roc = 0.8669747144140675 8 | epoch = 7, losses = {'textcat': 25.325295451989195}, roc = 0.8677037607495829 9 | epoch = 8, losses = {'textcat': 25.620281110071257}, roc = 0.8670722628674111 10 | epoch = 9, losses = {'textcat': 25.646207855614715}, roc = 0.8636786035168784 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumLemAgg epoch = 0, losses = {'textcat': 10.678632560709957}, roc = 0.8591400333718393 2 | epoch = 1, losses = {'textcat': 17.103498125987244}, roc = 0.8714414067513797 3 | epoch = 2, losses = {'textcat': 20.854591814086234}, roc = 0.8708920549351816 4 | epoch = 3, losses = {'textcat': 22.87714549644079}, roc = 0.8660043640097548 5 | epoch = 4, losses = {'textcat': 23.852497358807643}, roc = 0.8584161211654473 6 | epoch = 5, losses = {'textcat': 24.415554903051355}, roc = 0.8565627005519187 7 | epoch = 6, losses = {'textcat': 24.75346492489465}, roc = 0.8536721858554742 8 | epoch = 7, losses = {'textcat': 24.919653205470567}, roc = 0.8565832370684123 9 | epoch = 8, losses = {'textcat': 24.99600611099764}, roc = 0.8565113592606854 10 | epoch = 9, losses = {'textcat': 25.150001712737367}, roc = 0.8512129380053909 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumLemOovAgg epoch = 0, losses = {'textcat': 10.632765143818688}, roc = 0.8680785521755873 2 | epoch = 1, losses = {'textcat': 16.603710685754777}, roc = 0.8756873315363882 3 | epoch = 2, losses = {'textcat': 20.098456279241873}, roc = 0.8748402002310357 4 | epoch = 3, losses = {'textcat': 22.197325268262148}, roc = 0.8670260557053011 5 | epoch = 4, losses = {'textcat': 23.376831758485014}, roc = 0.8527788473880118 6 | epoch = 5, losses = {'textcat': 24.049448740595956}, roc = 0.8547862918752405 7 | epoch = 6, losses = {'textcat': 24.454368137002774}, roc = 0.8548992427159544 8 | epoch = 7, losses = {'textcat': 24.841723376879056}, roc = 0.8579643178025927 9 | epoch = 8, losses = {'textcat': 24.95344296212421}, roc = 0.8519573867282763 10 | epoch = 9, losses = {'textcat': 25.03913710188129}, roc = 0.8460582723655501 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumOovAgg epoch = 0, losses = {'textcat': 10.849000597256236}, roc = 0.8273905788730586 2 | epoch = 1, losses = {'textcat': 17.615710976722767}, roc = 0.8691618534206136 3 | epoch = 2, losses = {'textcat': 21.519193997845832}, roc = 0.8813810807341804 4 | epoch = 3, losses = {'textcat': 23.555163298897355}, roc = 0.8833423180592992 5 | epoch = 4, losses = {'textcat': 24.592904370654782}, roc = 0.8853908355795148 6 | epoch = 5, losses = {'textcat': 25.232358423387602}, roc = 0.8834604030291362 7 | epoch = 6, losses = {'textcat': 25.71876220397092}, roc = 0.8715902964959569 8 | epoch = 7, losses = {'textcat': 25.852575093551987}, roc = 0.8668823000898472 9 | epoch = 8, losses = {'textcat': 26.09547118551534}, roc = 0.8692619689385188 10 | epoch = 9, losses = {'textcat': 26.272543905184413}, roc = 0.8659581568476448 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumPersAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersAgg epoch = 0, losses = {'textcat': 10.358336042787414}, roc = 0.8718983442433577 2 | epoch = 1, losses = {'textcat': 15.565269033104414}, roc = 0.8941599281221924 3 | epoch = 2, losses = {'textcat': 17.932983758057844}, roc = 0.8852214093184444 4 | epoch = 3, losses = {'textcat': 19.04217331202392}, roc = 0.8730381209087408 5 | epoch = 4, losses = {'textcat': 19.687709225343976}, roc = 0.874799127198049 6 | epoch = 5, losses = {'textcat': 20.034264428140364}, roc = 0.8718829418559876 7 | epoch = 6, losses = {'textcat': 20.304004829154785}, roc = 0.870219484020023 8 | epoch = 7, losses = {'textcat': 20.454276567252627}, roc = 0.8683506610191246 9 | epoch = 8, losses = {'textcat': 20.554914588447136}, roc = 0.8717391862405339 10 | epoch = 9, losses = {'textcat': 20.633671653256233}, roc = 0.8760826594788859 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumPersLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersLemAgg epoch = 0, losses = {'textcat': 10.325701178051531}, roc = 0.8664304967269927 2 | epoch = 1, losses = {'textcat': 15.913291760210996}, roc = 0.8915158516236684 3 | epoch = 2, losses = {'textcat': 18.7300274250465}, roc = 0.8803131818765243 4 | epoch = 3, losses = {'textcat': 20.120089923865635}, roc = 0.8712360415864459 5 | epoch = 4, losses = {'textcat': 20.804775718720222}, roc = 0.8688127326402258 6 | epoch = 5, losses = {'textcat': 21.14956892632512}, roc = 0.8665588499550764 7 | epoch = 6, losses = {'textcat': 21.423461501962542}, roc = 0.8723090745732256 8 | epoch = 7, losses = {'textcat': 21.56527505127891}, roc = 0.8700295212424592 9 | epoch = 8, losses = {'textcat': 21.688171283806636}, roc = 0.8668258246694905 10 | epoch = 9, losses = {'textcat': 21.79666106960388}, roc = 0.8704813246053137 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumPersLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumPersLemOovAgg epoch = 0, losses = {'textcat': 10.597144522122107}, roc = 0.8578462328327556 2 | epoch = 1, losses = {'textcat': 16.965219413206796}, roc = 0.8762982929020665 3 | epoch = 2, losses = {'textcat': 20.944239850628946}, roc = 0.8677756385573098 4 | epoch = 3, losses = {'textcat': 23.205930521911796}, roc = 0.8661275831087152 5 | epoch = 4, losses = {'textcat': 24.416154009721595}, roc = 0.8681504299833142 6 | epoch = 5, losses = {'textcat': 25.26161684111277}, roc = 0.8606956744962136 7 | epoch = 6, losses = {'textcat': 25.85344494000153}, roc = 0.8537440636632012 8 | epoch = 7, losses = {'textcat': 26.17937598666605}, roc = 0.8558952637658837 9 | epoch = 8, losses = {'textcat': 26.458660723634825}, roc = 0.8617533050956231 10 | epoch = 9, losses = {'textcat': 26.5519901394971}, roc = 0.8586676934924914 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumStopLemAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopLemAgg epoch = 0, losses = {'textcat': 10.419062311004382}, roc = 0.8478706199460917 2 | epoch = 1, losses = {'textcat': 16.18246567517781}, roc = 0.8796816839943525 3 | epoch = 2, losses = {'textcat': 19.30511438575013}, roc = 0.8775920934411501 4 | epoch = 3, losses = {'textcat': 20.81128679516405}, roc = 0.8779412142215377 5 | epoch = 4, losses = {'textcat': 21.478999641421396}, roc = 0.8684687459889617 6 | epoch = 5, losses = {'textcat': 21.86197143290219}, roc = 0.8719034783724811 7 | epoch = 6, losses = {'textcat': 21.966874151151934}, roc = 0.866907970735464 8 | epoch = 7, losses = {'textcat': 22.130305263283137}, roc = 0.865177769220896 9 | epoch = 8, losses = {'textcat': 22.24113396179771}, roc = 0.8653112565781029 10 | epoch = 9, losses = {'textcat': 22.4314443554733}, roc = 0.8684841483763316 11 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopLemOovAgg epoch = 0, losses = {'textcat': 10.346970351412892}, roc = 0.8683763316647415 2 | epoch = 1, losses = {'textcat': 15.972997819677403}, roc = 0.9038069567449621 3 | epoch = 2, losses = {'textcat': 18.941932999511664}, roc = 0.8922859709921704 4 | epoch = 3, losses = {'textcat': 20.461612598504217}, roc = 0.8880811192401488 5 | epoch = 4, losses = {'textcat': 21.24935047177315}, roc = 0.8843691438839687 6 | epoch = 5, losses = {'textcat': 21.649178645025486}, roc = 0.8818636888717752 7 | epoch = 6, losses = {'textcat': 21.950326816203656}, roc = 0.8808984725965858 8 | epoch = 7, losses = {'textcat': 22.071821111654856}, roc = 0.8750198947503529 9 | epoch = 8, losses = {'textcat': 22.14719987215875}, roc = 0.8718213323065076 10 | epoch = 9, losses = {'textcat': 22.282656720771637}, roc = 0.8829880631497883 11 | ensemble_lPunctNumStopLemOovAgg 12 | epoch = 0, losses = {'textcat': 10.490000442718156}, roc = 0.8498061866255937 13 | epoch = 1, losses = {'textcat': 16.547183903574478}, roc = 0.8772121678860224 14 | epoch = 2, losses = {'textcat': 19.78883196215702}, roc = 0.8746707739699653 15 | epoch = 3, losses = {'textcat': 21.212445096770466}, roc = 0.8524759337697343 16 | epoch = 4, losses = {'textcat': 21.841305388159622}, roc = 0.8331664741368244 17 | epoch = 5, losses = {'textcat': 22.164312648375436}, roc = 0.8297214734950583 18 | epoch = 6, losses = {'textcat': 22.348416818236934}, roc = 0.826420228468746 19 | epoch = 7, losses = {'textcat': 22.57289976127649}, roc = 0.8245514054678474 20 | epoch = 8, losses = {'textcat': 22.755370378420064}, roc = 0.8239917853934027 21 | epoch = 9, losses = {'textcat': 22.91652452440097}, roc = 0.8339417276344502 22 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/ensemble_lPunctNumStopOovAgg.txt: -------------------------------------------------------------------------------- 1 | ensemble_lPunctNumStopOovAgg epoch = 0, losses = {'textcat': 10.528761210793164}, roc = 0.8673289693235784 2 | epoch = 1, losses = {'textcat': 16.21045877024153}, roc = 0.889338980875369 3 | epoch = 2, losses = {'textcat': 19.006704837200232}, roc = 0.8886818123475806 4 | epoch = 3, losses = {'textcat': 20.306096649514416}, roc = 0.87391092285971 5 | epoch = 4, losses = {'textcat': 20.95130760752823}, roc = 0.8736644846617894 6 | epoch = 5, losses = {'textcat': 21.27921311019021}, roc = 0.8782441278398152 7 | epoch = 6, losses = {'textcat': 21.510209205060605}, roc = 0.8773328199204212 8 | epoch = 7, losses = {'textcat': 21.645089315511605}, roc = 0.8748350661019125 9 | epoch = 8, losses = {'textcat': 21.741663373617406}, roc = 0.8733923758182519 10 | epoch = 9, losses = {'textcat': 21.76332084200426}, roc = 0.8705532024130407 11 | ensemble_lPunctNumStopOovAgg 12 | epoch = 0, losses = {'textcat': 10.605348191806115}, roc = 0.8644795276601206 13 | epoch = 1, losses = {'textcat': 16.36319550999906}, roc = 0.8788910281093569 14 | epoch = 2, losses = {'textcat': 19.37329041323028}, roc = 0.8780028237710179 15 | epoch = 3, losses = {'textcat': 20.752987753380012}, roc = 0.8651315620587856 16 | epoch = 4, losses = {'textcat': 21.361254807152925}, roc = 0.859535361314337 17 | epoch = 5, losses = {'textcat': 21.65353330239163}, roc = 0.8487896290591709 18 | epoch = 6, losses = {'textcat': 21.97102079426821}, roc = 0.8549197792324477 19 | epoch = 7, losses = {'textcat': 22.053787052982}, roc = 0.8488461044795277 20 | epoch = 8, losses = {'textcat': 22.097230001885983}, roc = 0.838906430496727 21 | epoch = 9, losses = {'textcat': 22.243994371655553}, roc = 0.8437376460017969 22 | -------------------------------------------------------------------------------- /Notebooks/other-attempts/spaCy/outputs/softmax_bert_lPunctNumStopLemOovAgg.txt: -------------------------------------------------------------------------------- 1 | softmax_bert_lPunctNumStopLemOovAgg 2 | epoch = 0, roc = 0.803136085548301 3 | epoch = 1, roc = 0.6947483419875585 4 | -------------------------------------------------------------------------------- /Notebooks/successful-models/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/Notebooks/successful-models/.DS_Store -------------------------------------------------------------------------------- /Notebooks/successful-models/.ipynb_checkpoints/mlp-subreddits-5000-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Mining Challange: *Reddit Gender Text-Classification* (MLP) \n", 8 | "\n", 9 | "### Modules" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Populating the interactive namespace from numpy and matplotlib\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "# Numpy & matplotlib for notebooks \n", 27 | "%pylab inline\n", 28 | "\n", 29 | "# Pandas for data analysis and manipulation \n", 30 | "import pandas as pd \n", 31 | "\n", 32 | "# Sklearn \n", 33 | "from sklearn.preprocessing import StandardScaler # to standardize features by removing the mean and scaling to unit variance (z=(x-u)/s)\n", 34 | "from sklearn.neural_network import MLPClassifier # Multi-layer Perceptron classifier which optimizes the log-loss function using LBFGS or sdg.\n", 35 | "from sklearn.model_selection import train_test_split # to split arrays or matrices into random train and test subsets\n", 36 | "from sklearn.model_selection import KFold # K-Folds cross-validator providing train/test indices to split data in train/test sets.\n", 37 | "from sklearn.decomposition import PCA, TruncatedSVD # Principal component analysis (PCA); dimensionality reduction using truncated SVD.\n", 38 | "from sklearn.linear_model import LogisticRegression \n", 39 | "from sklearn.naive_bayes import MultinomialNB # Naive Bayes classifier for multinomial models\n", 40 | "from sklearn.feature_extraction.text import CountVectorizer # Convert a collection of text documents to a matrix of token counts\n", 41 | "from sklearn.metrics import roc_auc_score as roc # Compute Area Under the Receiver Operating Characteristic Curve from prediction scores\n", 42 | "from sklearn.metrics import roc_curve, auc # Compute ROC; Compute Area Under the Curve (AUC) using the trapezoidal rule\n", 43 | "\n", 44 | "# Matplotlib\n", 45 | "import matplotlib # Data visualization\n", 46 | "import matplotlib.pyplot as plt \n", 47 | "import matplotlib.patches as mpatches \n", 48 | "\n", 49 | "# Seaborn\n", 50 | "import seaborn as sns # Statistical data visualization (based on matplotlib)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "### Data Collection " 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Import the training dataset, test dataset and target\n", 67 | "\n", 68 | "# Import the training dataset\n", 69 | "train_data = pd.read_csv(\"../input/dataset/train_data.csv\", encoding=\"utf8\")\n", 70 | "\n", 71 | "# Import the test dataset\n", 72 | "test_data = pd.read_csv(\"../input/dataset/test_data.csv\", encoding=\"utf8\")\n", 73 | "\n", 74 | "# Import the target\n", 75 | "target = pd.read_csv(\"../input/dataset/train_target.csv\")\n", 76 | "\n", 77 | "# Create a dictionary of authors\n", 78 | "author_gender = {}\n", 79 | "for i in range(len(target)):\n", 80 | " author_gender[target.author[i]] = target.gender[i]" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### Data Manipulation " 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Create a list of aggregated binary subreddits \n", 97 | "Xs = []\n", 98 | "# Create a list of genders\n", 99 | "y = []\n", 100 | "# Create a list of authors\n", 101 | "a = []\n", 102 | "\n", 103 | "# Populate the lists \n", 104 | "for author, group in train_data.groupby(\"author\"):\n", 105 | " Xs.append(group.subreddit.str.cat(sep = \" \"))\n", 106 | " y.append(author_gender[author])\n", 107 | " a.append(author)\n", 108 | " \n", 109 | "# Lower text in comments \n", 110 | "clean_train_subreddits = [xs.lower() for xs in Xs]" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Models Definition & Training\n", 118 | "\n", 119 | "#### CountVectorizer" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# Define CountVectorizer \n", 129 | "vectorizer_ = CountVectorizer(analyzer = \"word\", \n", 130 | " tokenizer = None, \n", 131 | " preprocessor = None, \n", 132 | " stop_words = None,\n", 133 | " binary=True\n", 134 | " ) #500\n", 135 | "# Train CountVectorizer \n", 136 | "train_data_subreddits = vectorizer_.fit_transform(clean_train_subreddits).toarray()\n", 137 | "\n", 138 | "sum(train_data_subreddits[1])\n", 139 | "\n", 140 | "y = np.array(y)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "#### MLP Classifier" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 5, 153 | "metadata": { 154 | "collapsed": true, 155 | "jupyter": { 156 | "outputs_hidden": true 157 | } 158 | }, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "Iteration 1, loss = 0.59613047\n", 165 | "Validation score: 0.734000\n", 166 | "Iteration 2, loss = 0.47953355\n", 167 | "Validation score: 0.814000\n", 168 | "Iteration 3, loss = 0.39179575\n", 169 | "Validation score: 0.864000\n", 170 | "Iteration 4, loss = 0.33398556\n", 171 | "Validation score: 0.870000\n", 172 | "Iteration 5, loss = 0.29788160\n", 173 | "Validation score: 0.860000\n", 174 | "Iteration 6, loss = 0.27413851\n", 175 | "Validation score: 0.858000\n", 176 | "Iteration 7, loss = 0.25758394\n", 177 | "Validation score: 0.856000\n", 178 | "Iteration 8, loss = 0.24291078\n", 179 | "Validation score: 0.858000\n", 180 | "Iteration 9, loss = 0.23275980\n", 181 | "Validation score: 0.864000\n", 182 | "Iteration 10, loss = 0.22383857\n", 183 | "Validation score: 0.860000\n", 184 | "Iteration 11, loss = 0.21650923\n", 185 | "Validation score: 0.860000\n", 186 | "Iteration 12, loss = 0.21024405\n", 187 | "Validation score: 0.850000\n", 188 | "Iteration 13, loss = 0.20492907\n", 189 | "Validation score: 0.850000\n", 190 | "Iteration 14, loss = 0.20017990\n", 191 | "Validation score: 0.848000\n", 192 | "Iteration 15, loss = 0.19573230\n", 193 | "Validation score: 0.850000\n", 194 | "Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.\n" 195 | ] 196 | }, 197 | { 198 | "data": { 199 | "text/plain": [ 200 | "MLPClassifier(activation='relu', alpha=0.05, batch_size='auto', beta_1=0.9,\n", 201 | " beta_2=0.999, early_stopping=True, epsilon=1e-08,\n", 202 | " hidden_layer_sizes=(100,), learning_rate='invscaling',\n", 203 | " learning_rate_init=0.001, max_fun=15000, max_iter=400,\n", 204 | " momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,\n", 205 | " power_t=0.5, random_state=0, shuffle=True, solver='adam',\n", 206 | " tol=0.0001, validation_fraction=0.1, verbose=True,\n", 207 | " warm_start=False)" 208 | ] 209 | }, 210 | "execution_count": 5, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "# Define MLP Classifier:\n", 217 | "## Activation function for the hidden layer: \"rectified linear unit function\"\n", 218 | "## Solver for weight optimization: \"stochastic gradient-based optimizer\"\n", 219 | "## Alpha: regularization parameter\n", 220 | "## Learning rate schedule for weight updates: \"gradually decreases the learning rate at each time step t using an inverse scaling exponent of power_t\"\n", 221 | "## Verbose: \"True\" in order to print progress messages to stdout.\n", 222 | "## Early stopping: \"True\" in order to use early stopping to terminate training when validation score is not improving. It automatically sets aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.\n", 223 | "\n", 224 | "mlpClf = MLPClassifier(activation= 'relu', solver = 'adam', \n", 225 | " alpha = 0.05, learning_rate = 'invscaling', verbose = True, \n", 226 | " early_stopping = True, max_iter = 400, random_state=0)\n", 227 | "\n", 228 | " \n", 229 | "# K fold per la cross-validation\n", 230 | "kfold = KFold(n_splits = 10)\n", 231 | "\n", 232 | "# Training and validation on all K folds\n", 233 | "# for train_indices, test_indices in kf.split(train_data_subreddits):\n", 234 | "# mlpClf.fit(train_data_subreddits[train_indices], y[train_indices])\n", 235 | "# print(mlpClf.score(train_data_subreddits[test_indices], y[test_indices]))\n", 236 | " \n", 237 | "# cross_val_score resets parameters of my_model and fits it on X_train and t_train with cross validation (we did it for consistency).\n", 238 | "# results = cross_val_score(my_model, s, y, cv=kfold, scoring='roc_auc')\n", 239 | "# print(\"roc = \", np.mean(results))\n", 240 | " \n", 241 | "# Model fit\n", 242 | "mlpClf.fit(train_data_subreddits, y)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### Prediction " 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 6, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "Xs_test = []\n", 259 | "for author, group in test_data.groupby(\"author\"):\n", 260 | " Xs_test.append(group.subreddit.str.cat(sep = \" \"))\n", 261 | " \n", 262 | "clean_test_subreddits = [xs.lower() for xs in Xs_test]\n", 263 | "\n", 264 | "test_data_subreddits = vectorizer_.transform(clean_test_subreddits).toarray()\n", 265 | "\n", 266 | "y_score = mlpClf.predict_proba(test_data_subreddits)[:,1]\n", 267 | "\n", 268 | "np.save(\"y_testMLPs\",y_score)" 269 | ] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.7.4" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 4 293 | } 294 | -------------------------------------------------------------------------------- /Notebooks/successful-models/.ipynb_checkpoints/xgb-gridsearch-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Mining Challange: *Reddit Gender Text-Classification* \n", 8 | "\n", 9 | "## Modules" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "%%time\n", 19 | "\n", 20 | "#Numpy\n", 21 | "import numpy as np\n", 22 | "\n", 23 | "#Sklearn\n", 24 | "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV # Exhaustive search over specified parameter values for a given estimator\n", 25 | "from sklearn.model_selection import cross_val_score # Evaluate a score by cross-validation\n", 26 | "from sklearn.model_selection import KFold # K-Folds cross-validator providing train/test indices to split data in train/test sets.\n", 27 | "from sklearn.model_selection import StratifiedKFold\n", 28 | "from sklearn.metrics import roc_auc_score # Compute Area Under the Receiver Operating Characteristic Curve from prediction scores\n", 29 | "from sklearn.feature_extraction.text import CountVectorizer # Convert a collection of text documents to a matrix of token counts\n", 30 | "\n", 31 | "#XGBoost\n", 32 | "from xgboost import XGBRegressor\n", 33 | "\n", 34 | "# Matplotlib\n", 35 | "import matplotlib # Data visualization\n", 36 | "import matplotlib.pyplot as plt \n", 37 | "import matplotlib.patches as mpatches \n", 38 | "\n", 39 | "#Pickle\n", 40 | "import pickle # To load files\n", 41 | "\n", 42 | "# Joblib\n", 43 | "import joblib # To save models " 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Data Collection" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# load preprocessed data to save tine\n", 60 | "with open(\"../input/challengedadata/comments.txt\", \"rb\") as f:\n", 61 | " clean_train_comments = pickle.load(f) \n", 62 | " f.close()\n", 63 | "\n", 64 | "with open(\"../input/challengedadata/targets.txt\", \"rb\") as ft:\n", 65 | " y = pickle.load(ft) \n", 66 | " ft.close()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Data Manipulation" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "vectorizer = CountVectorizer(analyzer = \"word\",\n", 83 | " max_features = 2000, ngram_range=(1, 2)) \n", 84 | "# converts in np array\n", 85 | "train_data_features = vectorizer.fit_transform(clean_train_comments).toarray()\n", 86 | "\n", 87 | "# create vocabulary\n", 88 | "vocab = vectorizer.get_feature_names()\n", 89 | "\n", 90 | "# counts how many times a word appears\n", 91 | "dist = np.sum(train_data_features, axis=0)\n", 92 | "\n", 93 | "# removes the 40 most utilized words\n", 94 | "for _ in range(40):\n", 95 | " index = np.argmax(dist)\n", 96 | " train_data_features = np.delete(train_data_features, index, axis = 1)\n", 97 | " \n", 98 | "X_len = [[len(x)] for x in train_data_features] \n", 99 | "s = np.concatenate((train_data_features,np.array(X_len)),axis = 1)\n", 100 | "\n", 101 | "# 5000 rows (one per author), and 2000-40+1 (X_len) features\n", 102 | "s.shape\n", 103 | "\n", 104 | "y = np.array(y) " 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Model Exploration" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "parameters = {\"learning_rate\":[0.03,0.05,0.07,0.01,0.15,0.2,0.25,0.3],'min_child_weight': [1,4,5,8],'gamma': [0.0, 0.1,0.2, 0.3,0.4,0.5,0.6,0.8],\n", 121 | " 'subsample': [0.6,0.7,0.8,0.9,1], 'colsample_bytree': [0.3,0.4,0.5, 0.6,0.7,0.8,0.9,1],\n", 122 | " 'max_depth': [2,3,4,5,6,7,8,10,12,15], 'scale_pos_weight': [1,2.70, 10, 25, 50, 75, 100, 1000] }\n", 123 | "\n", 124 | "parameters0 = {'min_child_weight': [1,8],'gamma': [0.6,0.8],\n", 125 | " 'subsample': [0.9], 'colsample_bytree': [0.6],\n", 126 | " 'max_depth': [4], 'scale_pos_weight': [1,2.70, 10, 25, 50, 75, 100, 1000] }\n", 127 | "\n", 128 | " \n", 129 | "xgb = XGBRegressor(objective = \"reg:logistic\", n_estimators=10000, \n", 130 | " tree_method = \"gpu_hist\", gpu_id = 0)\n", 131 | "\n", 132 | "\n", 133 | "# Model exploration\n", 134 | "xgbClf = GridSearchCV(xgb, param_grid = parameters0, cv = StratifiedKFold(n_splits=10, shuffle = True, random_state = 1001), scoring = \"roc_auc\" ,verbose=True, n_jobs=-1)\n", 135 | "\n", 136 | "# Model fit\n", 137 | "xgbClf.fit(s, y, verbose=False)\n", 138 | "\n", 139 | "# Save model\n", 140 | "joblib.dump(xgbClf, '../working/xgbClf.pkl')\n", 141 | "\n", 142 | "print(\"xgbCLf.best_score = \", xgbClf.best_score_)\n", 143 | "print(\"xgbCLf.best_estimator_ = \", xgbClf.best_estimator_)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "```Fitting 10 folds for each of 32 candidates, totalling 320 fits\n", 151 | "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", 152 | "[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 24.1min\n", 153 | "[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 105.6min\n", 154 | "[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 172.3min finished\n", 155 | "xgbCLf.best_score = 0.8425215483825477\n", 156 | "xgbCLf.best_estimator_ = XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,\n", 157 | " colsample_bynode=1, colsample_bytree=0.6, gamma=0.8, gpu_id=0,\n", 158 | " importance_type='gain', interaction_constraints=None,\n", 159 | " learning_rate=0.300000012, max_delta_step=0, max_depth=4,\n", 160 | " min_child_weight=1, missing=nan, monotone_constraints=None,\n", 161 | " n_estimators=10000, n_jobs=0, num_parallel_tree=1,\n", 162 | " objective='reg:logistic', random_state=0, reg_alpha=0,\n", 163 | " reg_lambda=1, scale_pos_weight=1, subsample=0.9,\n", 164 | " tree_method='gpu_hist', validate_parameters=False, verbosity=None)\n", 165 | "CPU times: user 1min, sys: 14.5 s, total: 1min 14s\n", 166 | "Wall time: 2h 53min 29s\n", 167 | "```" 168 | ] 169 | } 170 | ], 171 | "metadata": { 172 | "kernelspec": { 173 | "display_name": "Python 3", 174 | "language": "python", 175 | "name": "python3" 176 | }, 177 | "language_info": { 178 | "codemirror_mode": { 179 | "name": "ipython", 180 | "version": 3 181 | }, 182 | "file_extension": ".py", 183 | "mimetype": "text/x-python", 184 | "name": "python", 185 | "nbconvert_exporter": "python", 186 | "pygments_lexer": "ipython3", 187 | "version": "3.7.4" 188 | } 189 | }, 190 | "nbformat": 4, 191 | "nbformat_minor": 4 192 | } 193 | -------------------------------------------------------------------------------- /Notebooks/successful-models/mlp-subreddits-5000.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Mining Challange: *Reddit Gender Text-Classification* (MLP) \n", 8 | "\n", 9 | "### Modules" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Populating the interactive namespace from numpy and matplotlib\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "# Numpy & matplotlib for notebooks \n", 27 | "%pylab inline\n", 28 | "\n", 29 | "# Pandas for data analysis and manipulation \n", 30 | "import pandas as pd \n", 31 | "\n", 32 | "# Sklearn \n", 33 | "from sklearn.preprocessing import StandardScaler # to standardize features by removing the mean and scaling to unit variance (z=(x-u)/s)\n", 34 | "from sklearn.neural_network import MLPClassifier # Multi-layer Perceptron classifier which optimizes the log-loss function using LBFGS or sdg.\n", 35 | "from sklearn.model_selection import train_test_split # to split arrays or matrices into random train and test subsets\n", 36 | "from sklearn.model_selection import KFold # K-Folds cross-validator providing train/test indices to split data in train/test sets.\n", 37 | "from sklearn.decomposition import PCA, TruncatedSVD # Principal component analysis (PCA); dimensionality reduction using truncated SVD.\n", 38 | "from sklearn.linear_model import LogisticRegression \n", 39 | "from sklearn.naive_bayes import MultinomialNB # Naive Bayes classifier for multinomial models\n", 40 | "from sklearn.feature_extraction.text import CountVectorizer # Convert a collection of text documents to a matrix of token counts\n", 41 | "from sklearn.metrics import roc_auc_score as roc # Compute Area Under the Receiver Operating Characteristic Curve from prediction scores\n", 42 | "from sklearn.metrics import roc_curve, auc # Compute ROC; Compute Area Under the Curve (AUC) using the trapezoidal rule\n", 43 | "\n", 44 | "# Matplotlib\n", 45 | "import matplotlib # Data visualization\n", 46 | "import matplotlib.pyplot as plt \n", 47 | "import matplotlib.patches as mpatches \n", 48 | "\n", 49 | "# Seaborn\n", 50 | "import seaborn as sns # Statistical data visualization (based on matplotlib)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "### Data Collection " 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "# Import the training dataset, test dataset and target\n", 67 | "\n", 68 | "# Import the training dataset\n", 69 | "train_data = pd.read_csv(\"../input/dataset/train_data.csv\", encoding=\"utf8\")\n", 70 | "\n", 71 | "# Import the test dataset\n", 72 | "test_data = pd.read_csv(\"../input/dataset/test_data.csv\", encoding=\"utf8\")\n", 73 | "\n", 74 | "# Import the target\n", 75 | "target = pd.read_csv(\"../input/dataset/train_target.csv\")\n", 76 | "\n", 77 | "# Create a dictionary of authors\n", 78 | "author_gender = {}\n", 79 | "for i in range(len(target)):\n", 80 | " author_gender[target.author[i]] = target.gender[i]" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### Data Manipulation " 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Create a list of aggregated binary subreddits \n", 97 | "Xs = []\n", 98 | "# Create a list of genders\n", 99 | "y = []\n", 100 | "# Create a list of authors\n", 101 | "a = []\n", 102 | "\n", 103 | "# Populate the lists \n", 104 | "for author, group in train_data.groupby(\"author\"):\n", 105 | " Xs.append(group.subreddit.str.cat(sep = \" \"))\n", 106 | " y.append(author_gender[author])\n", 107 | " a.append(author)\n", 108 | " \n", 109 | "# Lower text in comments \n", 110 | "clean_train_subreddits = [xs.lower() for xs in Xs]" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Models Definition & Training\n", 118 | "\n", 119 | "#### CountVectorizer" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# Define CountVectorizer \n", 129 | "vectorizer_ = CountVectorizer(analyzer = \"word\", \n", 130 | " tokenizer = None, \n", 131 | " preprocessor = None, \n", 132 | " stop_words = None,\n", 133 | " binary=True\n", 134 | " ) #500\n", 135 | "# Train CountVectorizer \n", 136 | "train_data_subreddits = vectorizer_.fit_transform(clean_train_subreddits).toarray()\n", 137 | "\n", 138 | "sum(train_data_subreddits[1])\n", 139 | "\n", 140 | "y = np.array(y)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "#### MLP Classifier" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 5, 153 | "metadata": { 154 | "collapsed": true, 155 | "jupyter": { 156 | "outputs_hidden": true 157 | } 158 | }, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "Iteration 1, loss = 0.59613047\n", 165 | "Validation score: 0.734000\n", 166 | "Iteration 2, loss = 0.47953355\n", 167 | "Validation score: 0.814000\n", 168 | "Iteration 3, loss = 0.39179575\n", 169 | "Validation score: 0.864000\n", 170 | "Iteration 4, loss = 0.33398556\n", 171 | "Validation score: 0.870000\n", 172 | "Iteration 5, loss = 0.29788160\n", 173 | "Validation score: 0.860000\n", 174 | "Iteration 6, loss = 0.27413851\n", 175 | "Validation score: 0.858000\n", 176 | "Iteration 7, loss = 0.25758394\n", 177 | "Validation score: 0.856000\n", 178 | "Iteration 8, loss = 0.24291078\n", 179 | "Validation score: 0.858000\n", 180 | "Iteration 9, loss = 0.23275980\n", 181 | "Validation score: 0.864000\n", 182 | "Iteration 10, loss = 0.22383857\n", 183 | "Validation score: 0.860000\n", 184 | "Iteration 11, loss = 0.21650923\n", 185 | "Validation score: 0.860000\n", 186 | "Iteration 12, loss = 0.21024405\n", 187 | "Validation score: 0.850000\n", 188 | "Iteration 13, loss = 0.20492907\n", 189 | "Validation score: 0.850000\n", 190 | "Iteration 14, loss = 0.20017990\n", 191 | "Validation score: 0.848000\n", 192 | "Iteration 15, loss = 0.19573230\n", 193 | "Validation score: 0.850000\n", 194 | "Validation score did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.\n" 195 | ] 196 | }, 197 | { 198 | "data": { 199 | "text/plain": [ 200 | "MLPClassifier(activation='relu', alpha=0.05, batch_size='auto', beta_1=0.9,\n", 201 | " beta_2=0.999, early_stopping=True, epsilon=1e-08,\n", 202 | " hidden_layer_sizes=(100,), learning_rate='invscaling',\n", 203 | " learning_rate_init=0.001, max_fun=15000, max_iter=400,\n", 204 | " momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,\n", 205 | " power_t=0.5, random_state=0, shuffle=True, solver='adam',\n", 206 | " tol=0.0001, validation_fraction=0.1, verbose=True,\n", 207 | " warm_start=False)" 208 | ] 209 | }, 210 | "execution_count": 5, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "# Define MLP Classifier:\n", 217 | "## Activation function for the hidden layer: \"rectified linear unit function\"\n", 218 | "## Solver for weight optimization: \"stochastic gradient-based optimizer\"\n", 219 | "## Alpha: regularization parameter\n", 220 | "## Learning rate schedule for weight updates: \"gradually decreases the learning rate at each time step t using an inverse scaling exponent of power_t\"\n", 221 | "## Verbose: \"True\" in order to print progress messages to stdout.\n", 222 | "## Early stopping: \"True\" in order to use early stopping to terminate training when validation score is not improving. It automatically sets aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.\n", 223 | "\n", 224 | "mlpClf = MLPClassifier(activation= 'relu', solver = 'adam', \n", 225 | " alpha = 0.05, learning_rate = 'invscaling', verbose = True, \n", 226 | " early_stopping = True, max_iter = 400, random_state=0)\n", 227 | "\n", 228 | " \n", 229 | "# K fold per la cross-validation\n", 230 | "kfold = KFold(n_splits = 10)\n", 231 | "\n", 232 | "# Training and validation on all K folds\n", 233 | "# for train_indices, test_indices in kf.split(train_data_subreddits):\n", 234 | "# mlpClf.fit(train_data_subreddits[train_indices], y[train_indices])\n", 235 | "# print(mlpClf.score(train_data_subreddits[test_indices], y[test_indices]))\n", 236 | " \n", 237 | "# cross_val_score resets parameters of my_model and fits it on X_train and t_train with cross validation (we did it for consistency).\n", 238 | "# results = cross_val_score(my_model, s, y, cv=kfold, scoring='roc_auc')\n", 239 | "# print(\"roc = \", np.mean(results))\n", 240 | " \n", 241 | "# Model fit\n", 242 | "mlpClf.fit(train_data_subreddits, y)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### Prediction " 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 6, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "Xs_test = []\n", 259 | "for author, group in test_data.groupby(\"author\"):\n", 260 | " Xs_test.append(group.subreddit.str.cat(sep = \" \"))\n", 261 | " \n", 262 | "clean_test_subreddits = [xs.lower() for xs in Xs_test]\n", 263 | "\n", 264 | "test_data_subreddits = vectorizer_.transform(clean_test_subreddits).toarray()\n", 265 | "\n", 266 | "y_score = mlpClf.predict_proba(test_data_subreddits)[:,1]\n", 267 | "\n", 268 | "np.save(\"y_testMLPs\",y_score)" 269 | ] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.7.4" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 4 293 | } 294 | -------------------------------------------------------------------------------- /Notebooks/successful-models/xgb-gridsearch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Mining Challange: *Reddit Gender Text-Classification* \n", 8 | "\n", 9 | "## Modules" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "%%time\n", 19 | "\n", 20 | "#Numpy\n", 21 | "import numpy as np\n", 22 | "\n", 23 | "#Sklearn\n", 24 | "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV # Exhaustive search over specified parameter values for a given estimator\n", 25 | "from sklearn.model_selection import cross_val_score # Evaluate a score by cross-validation\n", 26 | "from sklearn.model_selection import KFold # K-Folds cross-validator providing train/test indices to split data in train/test sets.\n", 27 | "from sklearn.model_selection import StratifiedKFold\n", 28 | "from sklearn.metrics import roc_auc_score # Compute Area Under the Receiver Operating Characteristic Curve from prediction scores\n", 29 | "from sklearn.feature_extraction.text import CountVectorizer # Convert a collection of text documents to a matrix of token counts\n", 30 | "\n", 31 | "#XGBoost\n", 32 | "from xgboost import XGBRegressor\n", 33 | "\n", 34 | "# Matplotlib\n", 35 | "import matplotlib # Data visualization\n", 36 | "import matplotlib.pyplot as plt \n", 37 | "import matplotlib.patches as mpatches \n", 38 | "\n", 39 | "#Pickle\n", 40 | "import pickle # To load files\n", 41 | "\n", 42 | "# Joblib\n", 43 | "import joblib # To save models " 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## Data Collection" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# load preprocessed data to save tine\n", 60 | "with open(\"../input/challengedadata/comments.txt\", \"rb\") as f:\n", 61 | " clean_train_comments = pickle.load(f) \n", 62 | " f.close()\n", 63 | "\n", 64 | "with open(\"../input/challengedadata/targets.txt\", \"rb\") as ft:\n", 65 | " y = pickle.load(ft) \n", 66 | " ft.close()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Data Manipulation" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "vectorizer = CountVectorizer(analyzer = \"word\",\n", 83 | " max_features = 2000, ngram_range=(1, 2)) \n", 84 | "# converts in np array\n", 85 | "train_data_features = vectorizer.fit_transform(clean_train_comments).toarray()\n", 86 | "\n", 87 | "# create vocabulary\n", 88 | "vocab = vectorizer.get_feature_names()\n", 89 | "\n", 90 | "# counts how many times a word appears\n", 91 | "dist = np.sum(train_data_features, axis=0)\n", 92 | "\n", 93 | "# removes the 40 most utilized words\n", 94 | "for _ in range(40):\n", 95 | " index = np.argmax(dist)\n", 96 | " train_data_features = np.delete(train_data_features, index, axis = 1)\n", 97 | " \n", 98 | "X_len = [[len(x)] for x in train_data_features] \n", 99 | "s = np.concatenate((train_data_features,np.array(X_len)),axis = 1)\n", 100 | "\n", 101 | "# 5000 rows (one per author), and 2000-40+1 (X_len) features\n", 102 | "s.shape\n", 103 | "\n", 104 | "y = np.array(y) " 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Model Exploration" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "parameters = {\"learning_rate\":[0.03,0.05,0.07,0.01,0.15,0.2,0.25,0.3],'min_child_weight': [1,4,5,8],'gamma': [0.0, 0.1,0.2, 0.3,0.4,0.5,0.6,0.8],\n", 121 | " 'subsample': [0.6,0.7,0.8,0.9,1], 'colsample_bytree': [0.3,0.4,0.5, 0.6,0.7,0.8,0.9,1],\n", 122 | " 'max_depth': [2,3,4,5,6,7,8,10,12,15], 'scale_pos_weight': [1,2.70, 10, 25, 50, 75, 100, 1000] }\n", 123 | "\n", 124 | "parameters0 = {'min_child_weight': [1,8],'gamma': [0.6,0.8],\n", 125 | " 'subsample': [0.9], 'colsample_bytree': [0.6],\n", 126 | " 'max_depth': [4], 'scale_pos_weight': [1,2.70, 10, 25, 50, 75, 100, 1000] }\n", 127 | "\n", 128 | " \n", 129 | "xgb = XGBRegressor(objective = \"reg:logistic\", n_estimators=10000, \n", 130 | " tree_method = \"gpu_hist\", gpu_id = 0)\n", 131 | "\n", 132 | "\n", 133 | "# Model exploration\n", 134 | "xgbClf = GridSearchCV(xgb, param_grid = parameters0, cv = StratifiedKFold(n_splits=10, shuffle = True, random_state = 1001), scoring = \"roc_auc\" ,verbose=True, n_jobs=-1)\n", 135 | "\n", 136 | "# Model fit\n", 137 | "xgbClf.fit(s, y, verbose=False)\n", 138 | "\n", 139 | "# Save model\n", 140 | "joblib.dump(xgbClf, '../working/xgbClf.pkl')\n", 141 | "\n", 142 | "print(\"xgbCLf.best_score = \", xgbClf.best_score_)\n", 143 | "print(\"xgbCLf.best_estimator_ = \", xgbClf.best_estimator_)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "```Fitting 10 folds for each of 32 candidates, totalling 320 fits\n", 151 | "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.\n", 152 | "[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 24.1min\n", 153 | "[Parallel(n_jobs=-1)]: Done 196 tasks | elapsed: 105.6min\n", 154 | "[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 172.3min finished\n", 155 | "xgbCLf.best_score = 0.8425215483825477\n", 156 | "xgbCLf.best_estimator_ = XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,\n", 157 | " colsample_bynode=1, colsample_bytree=0.6, gamma=0.8, gpu_id=0,\n", 158 | " importance_type='gain', interaction_constraints=None,\n", 159 | " learning_rate=0.300000012, max_delta_step=0, max_depth=4,\n", 160 | " min_child_weight=1, missing=nan, monotone_constraints=None,\n", 161 | " n_estimators=10000, n_jobs=0, num_parallel_tree=1,\n", 162 | " objective='reg:logistic', random_state=0, reg_alpha=0,\n", 163 | " reg_lambda=1, scale_pos_weight=1, subsample=0.9,\n", 164 | " tree_method='gpu_hist', validate_parameters=False, verbosity=None)\n", 165 | "CPU times: user 1min, sys: 14.5 s, total: 1min 14s\n", 166 | "Wall time: 2h 53min 29s\n", 167 | "```" 168 | ] 169 | } 170 | ], 171 | "metadata": { 172 | "kernelspec": { 173 | "display_name": "Python 3", 174 | "language": "python", 175 | "name": "python3" 176 | }, 177 | "language_info": { 178 | "codemirror_mode": { 179 | "name": "ipython", 180 | "version": 3 181 | }, 182 | "file_extension": ".py", 183 | "mimetype": "text/x-python", 184 | "name": "python", 185 | "nbconvert_exporter": "python", 186 | "pygments_lexer": "ipython3", 187 | "version": "3.7.4" 188 | } 189 | }, 190 | "nbformat": 4, 191 | "nbformat_minor": 4 192 | } 193 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 |

3 | 4 |

5 | Size 6 | 7 | Forks 8 | 9 | Stars 10 | 11 | Languages 12 | 13 | 14 | Contributors 15 | 16 | MIT Licence 17 | 18 | Twitter 20 | 21 |

22 | 23 | 24 |

25 | Reddit Gender Text-Classification 26 |

27 | 28 | 29 |

30 | 31 |

32 | 33 | Kaggle 34 | 35 | 36 | nbviewer 37 | 38 | 39 | Colab 40 | 41 | 42 |

43 | 44 | ## How to Explore this Work 45 | 46 | * Read the code in the [Jupyter notebooks](https://nbviewer.jupyter.org/github/pitmonticone/RedditTextClassification/blob/master/Notebooks/notebook.ipynb). 47 | * Run the code in the [Kaggle notebook](https://www.kaggle.com/inphyt2020/dataminingchallange). 48 | 49 | ## Overview 50 | 51 | ### Description 52 | 53 | [Reddit](http://www.reddit.com/) is an entertainment, social networking, and news website where registered community members can submit content, such as text posts or direct links, making it essentially an online bulletin board system. Registered users can then vote submissions up or down to organize the posts and determine their position on the site's pages. Content entries are organized by areas of interest called "subreddits". The subreddit topics include news, gaming, movies, music, books, fitness, food, and photosharing, among many others. 54 | 55 | When items (links or text posts) are submitted to a subreddit, users (redditors) can vote for or against them (upvote/downvote). Each subreddit has a front page that shows newer submissions that have been rated highly. Redditors can also post comments about the submission, and respond back and forth in a conversation-tree of comments; the comments themselves can also be upvoted and downvoted. The front page of the site itself shows a combination of the highest-rated posts out of all the subreddits a user is subscribed to. 56 | 57 | The Reddit website has an API and its code is [open source](https://github.com/reddit/reddit/#apis). In July 2015, a Reddit user identified as Stuck_In_the_Matrix [made public a dataset of Reddit comments](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment) for research. The dataset has approximately 1.7 billion comments and takes 250 GB compressed. Each entry contains comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. 58 | 59 | One of the user attributes that is not natively supported by the Reddit platform is the gender. However, in some subreddits, users can self report their genders as part of the subreddit rules. In the scope of this competition, users that self reported their gender are selected from the dataset, and your goal is to predict the gender of these users. 60 | 61 | ### Evaluation 62 | 63 | The evaluation metric for this competition is the [Area Under the ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). This metric is used to evaluate binary classification, and in the scope of this competition we are representing the gender of the users as binary classes: the class "female" is represented as 1 and the class "male" as 0. The class prediction for each Reddit author corresponds to your confidence that the author is a female, which is a "score" computed for the author (e.g. estimated probability in logistic regression). 64 | 65 | #### Submission Format 66 | 67 | **For every author in the dataset**, submission files should contain two columns: author and gender. The column author should be a string. The column gender can be any real value. The higher is your confidence that the author is female, the higher should be the corresponding value in the gender column. 68 | 69 | ## Data ([Download](https://kaggle.com/datasets/f85c0f5c74874fb56104d5e9875752c594aa98464c6a85b1be42c6755ccf6ad8)) 70 | 71 | We selected a total of 20k users with self reported gender. Among these, we selected 5000 for training, and the remaining 15000 are used for evaluation. 72 | 73 | ### File Descriptions 74 | 75 | * **train_data.csv**: contains all comments of the users selected for training 76 | * **train_target.csv**: contains the genders of the users selected for training 77 | * **test_data.csv**: contains the comments of the users selected for evaluation 78 | * **sample.csv**: a sample submission file in the correct format 79 | 80 | ### Data Fields 81 | 82 | Each comment has the following structure: 83 | 84 | * **author**: contains the username of the author 85 | * **subreddit**: contains the subreddit in which the comment was posted 86 | * **created_utc**: contains the date of submission in unixtime format 87 | * **body**: contains the text of the comment 88 | 89 | ## Solution 90 | 91 | ### Unsuccessful Models 92 | 93 | An exploration of [SpaCy](https://github.com/explosion/spaCy) was performed. One may find the relevant notebooks [here](https://nbviewer.jupyter.org/github/InPhyT/DataMiningChallange/tree/master/spaCy/). The model works and has a similar strategy to the one presented below, though its performance is lower (roc = 0.894). The exploration has been concluded with this [Stack Overflow question](https://stackoverflow.com/questions/60821793/text-classification-with-spacy-going-beyond-the-basics-to-improve-performance), this [GitHub Issue](https://github.com/explosion/spaCy/issues/5224) and a comment to a [Feature Request](https://github.com/explosion/spaCy/issues/2253#issuecomment-605502320). 94 | We've also tried [neural networks](https://nbviewer.jupyter.org/github/pitmonticone/RedditTextClassification/tree/master/Notebooks/other-attempts/keras-neural-networks) with `Keras`. 95 | 96 | ### Successful Models 97 | 98 | The training set has been grouped by author and the resulting texts, as if aggregated with `" ".join`, have been turned into a BOW (see this [brief Kaggle tutorial](https://www.kaggle.com/matleonard/text-classification#Bag-of-Words)). 99 | 100 | 1. 80% of the resulting data has been used to train an [XGBoost](https://www.kaggle.com/alexisbcook/xgboost), which was later used to predict the remeining 20%. 101 | 2. A [Document Embedding model](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) has been fitted on test and training texts. 80% of training vectors were later used to train a [Multi Layer Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html), which then predicted the remaining 20% and the test set. 102 | 3. A MLP on the binary countvectorized subreddits has been trained, just like the models above. 103 | 4. The predictions on the 20% of the XGBoost and of the two MLPs were used to train and validate a final logistic regression. 104 | 5. Finally, a new XGBoost and and two new MLPs were trained on all train texts, and the predictions of the two used by the logistic regression to output the final submission. 105 | 106 | ![](https://github.com/pitmonticone/RedditTextClassification/blob/master/images/flow-chart.png) 107 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | title: [Reddit Gender Text Classification] 2 | description: [] 3 | theme: jekyll-theme-cayman 4 | -------------------------------------------------------------------------------- /images/flow-chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pitmonticone/RedditTextClassification/fdd8b3a6e649781df9147599889c4669517f65ab/images/flow-chart.png -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | ## Documents 2 | 3 | * Read the code in the [Jupyter notebooks](https://nbviewer.jupyter.org/github/pitmonticone/RedditTextClassification/blob/master/Notebooks/notebook.ipynb). 4 | * Run the code in the [Kaggle notebook](https://www.kaggle.com/inphyt2020/dataminingchallange). 5 | 6 | ## Overview 7 | 8 | ### Description 9 | 10 | [Reddit](http://www.reddit.com/) is an entertainment, social networking, and news website where registered community members can submit content, such as text posts or direct links, making it essentially an online bulletin board system. Registered users can then vote submissions up or down to organize the posts and determine their position on the site's pages. Content entries are organized by areas of interest called "subreddits". The subreddit topics include news, gaming, movies, music, books, fitness, food, and photosharing, among many others. 11 | 12 | When items (links or text posts) are submitted to a subreddit, users (redditors) can vote for or against them (upvote/downvote). Each subreddit has a front page that shows newer submissions that have been rated highly. Redditors can also post comments about the submission, and respond back and forth in a conversation-tree of comments; the comments themselves can also be upvoted and downvoted. The front page of the site itself shows a combination of the highest-rated posts out of all the subreddits a user is subscribed to. 13 | 14 | The Reddit website has an API and its code is [open source](https://github.com/reddit/reddit/#apis). In July 2015, a Reddit user identified as Stuck_In_the_Matrix [made public a dataset of Reddit comments](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment) for research. The dataset has approximately 1.7 billion comments and takes 250 GB compressed. Each entry contains comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. 15 | 16 | One of the user attributes that is not natively supported by the Reddit platform is the gender. However, in some subreddits, users can self report their genders as part of the subreddit rules. In the scope of this competition, users that self reported their gender are selected from the dataset, and your goal is to predict the gender of these users. 17 | 18 | ### Evaluation 19 | 20 | The evaluation metric for this competition is the [Area Under the ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). This metric is used to evaluate binary classification, and in the scope of this competition we are representing the gender of the users as binary classes: the class "female" is represented as 1 and the class "male" as 0. The class prediction for each Reddit author corresponds to your confidence that the author is a female, which is a "score" computed for the author (e.g. estimated probability in logistic regression). 21 | 22 | #### Submission Format 23 | 24 | **For every author in the dataset**, submission files should contain two columns: author and gender. The column author should be a string. The column gender can be any real value. The higher is your confidence that the author is female, the higher should be the corresponding value in the gender column. 25 | 26 | ## Data 27 | 28 | We selected a total of 20k users with self reported gender. Among these, we selected 5000 for training, and the remaining 15000 are used for evaluation. 29 | 30 | ### File Descriptions 31 | 32 | * **train_data.csv.gz**: contains all comments of the users selected for training 33 | * **train_target.csv**: contains the genders of the users selected for training 34 | * **test_data.csv.gz**: contains the comments of the users selected for evaluation 35 | * **sample.csv**: a sample submission file in the correct format 36 | 37 | ### Data Fields 38 | 39 | Each comment has the following structure: 40 | 41 | * **author**: contains the username of the author 42 | * **subreddit**: contains the subreddit in which the comment was posted 43 | * **created_utc**: contains the date of submission in unixtime format 44 | * **body**: contains the text of the comment 45 | 46 | ## Solution 47 | 48 | ### Unsuccessful Models 49 | 50 | An exploration of [SpaCy](https://github.com/explosion/spaCy) was performed. One may find the relevant notebooks [here](https://nbviewer.jupyter.org/github/InPhyT/DataMiningChallange/tree/master/spaCy/). The model works and has a similar strategy to the one presented below, though its performance is lower (roc = 0.894). The exploration has been concluded with this [Stack Overflow question](https://stackoverflow.com/questions/60821793/text-classification-with-spacy-going-beyond-the-basics-to-improve-performance), this [GitHub Issue](https://github.com/explosion/spaCy/issues/5224) and a comment to a [Feature Request](https://github.com/explosion/spaCy/issues/2253#issuecomment-605502320). 51 | We've also tried [neural networks](https://nbviewer.jupyter.org/github/pitmonticone/RedditTextClassification/tree/master/Notebooks/other-attempts/keras-neural-networks) with `Keras`. 52 | 53 | ### Successful Models 54 | 55 | The training set has been grouped by author and the resulting texts, as if aggregated with `" ".join`, have been turned into a BOW (see this [brief Kaggle tutorial](https://www.kaggle.com/matleonard/text-classification#Bag-of-Words)). 56 | 57 | 1. 80% of the resulting data has been used to train an [XGBoost](https://www.kaggle.com/alexisbcook/xgboost), which was later used to predict the remeining 20%. 58 | 2. A [Document Embedding model](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) has been fitted on test and training texts. 80% of training vectors were later used to train a [Multi Layer Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html), which then predicted the remaining 20% and the test set. 59 | 3. A MLP on the binary countvectorized subreddits has been trained, just like the models above. 60 | 4. The predictions on the 20% of the XGBoost and of the two MLPs were used to train and validate a final logistic regression. 61 | 5. Finally, a new XGBoost and and two new MLPs were trained on all train texts, and the predictions of the two used by the logistic regression to output the final submission. 62 | 63 | ![](https://github.com/pitmonticone/RedditTextClassification/blob/master/images/flow-chart.png) 64 | --------------------------------------------------------------------------------