├── .DS_Store ├── notebooks ├── .DS_Store ├── images │ ├── example.png │ ├── 1_2i-GJO7JX0Yz6498jUvhEg.png │ ├── 1_7uW5hLXztSu_FOmZOWpB6g.png │ ├── 1_BME1JjIlBEAI9BV5pOO5Mg.png │ └── 1_x8gTiprhLs7zflmEn1UjAQ.png ├── 1_Getting_Data_from_json_Files .ipynb ├── 5_Keras_CNN_with_3_Conv_Layers_and_RNN_with_2_GRU_Layers.ipynb ├── .ipynb_checkpoints │ ├── 5_Keras_CNN_with_3_Conv_Layers_and_RNN_with_2_GRU_Layers-checkpoint.ipynb │ └── 6_Keras_with_Different_Layer_Types_and_Numbers-checkpoint.ipynb ├── 6_Keras_with_Different_Layer_Types_and_Numbers.ipynb └── 4_Torch_Models.ipynb └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/.DS_Store -------------------------------------------------------------------------------- /notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/.DS_Store -------------------------------------------------------------------------------- /notebooks/images/example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/images/example.png -------------------------------------------------------------------------------- /notebooks/images/1_2i-GJO7JX0Yz6498jUvhEg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/images/1_2i-GJO7JX0Yz6498jUvhEg.png -------------------------------------------------------------------------------- /notebooks/images/1_7uW5hLXztSu_FOmZOWpB6g.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/images/1_7uW5hLXztSu_FOmZOWpB6g.png -------------------------------------------------------------------------------- /notebooks/images/1_BME1JjIlBEAI9BV5pOO5Mg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/images/1_BME1JjIlBEAI9BV5pOO5Mg.png -------------------------------------------------------------------------------- /notebooks/images/1_x8gTiprhLs7zflmEn1UjAQ.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ezgigm/sentiment_analysis_and_product_recommendation/HEAD/notebooks/images/1_x8gTiprhLs7zflmEn1UjAQ.png -------------------------------------------------------------------------------- /notebooks/1_Getting_Data_from_json_Files .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Aim of This Notebook" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this notebook, I get the data from json files and write them to csv to use easily everytime. " 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# to read the data\n", 24 | "import os\n", 25 | "import json\n", 26 | "import gzip\n", 27 | "\n", 28 | "from urllib.request import urlopen\n", 29 | "\n", 30 | "# dataframe and series \n", 31 | "import pandas as pd\n", 32 | "import numpy as np" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 4, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# getting data from json.gz file\n", 42 | "\n", 43 | "df_data = []\n", 44 | "with gzip.open('raw data/Kindle_Store_5.json.gz') as data:\n", 45 | " for i in data:\n", 46 | " df_data.append(json.loads(i.strip()))" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 5, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "2222983\n", 59 | "{'overall': 4.0, 'verified': True, 'reviewTime': '07 3, 2014', 'reviewerID': 'A2LSKD2H9U8N0J', 'asin': 'B000FA5KK0', 'style': {'Format:': ' Kindle Edition'}, 'reviewerName': 'sandra sue marsolek', 'reviewText': 'pretty good story, a little exaggerated, but I liked it pretty well. liked the characters, the plot..it had mystery, action, love, all of the main things. I think most western lovers would injoy this book', 'summary': 'pretty good story', 'unixReviewTime': 1404345600}\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "# to see the length of the data, it means total number of reviews also\n", 65 | "print(len(df_data))\n", 66 | "\n", 67 | "# to see the first row of the list\n", 68 | "print(df_data[0])" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "df = pd.DataFrame.from_dict(df_data) # convert dictionary to dataframe" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "df.to_csv('kindle_data.csv', index = False) # to use easily everytime, I write it to csv" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "# Getting Meta Data to Get More Information About Products" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 7, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "data_meta = []\n", 103 | "with gzip.open('raw data/meta_Kindle_Store.json.gz') as d:\n", 104 | " for i in d:\n", 105 | " data_meta.append(json.loads(i.strip()))" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 8, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "493552\n", 118 | "{'category': ['Kindle Store', 'Kindle eBooks', 'Literature & Fiction'], 'tech1': '', 'description': [], 'fit': '', 'title': '', 'also_buy': [], 'image': [], 'tech2': '', 'brand': \"Visit Amazon's Rama Bijapurkar Page\", 'feature': [], 'rank': '1,857,911 Paid in Kindle Store (', 'also_view': [], 'main_cat': 'Buy a Kindle', 'similar_item': '', 'date': '', 'price': '', 'asin': '0143065971'}\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "# to see the length of the data, it means total number of products also\n", 124 | "print(len(data_meta))\n", 125 | "\n", 126 | "# to see the first row of the list\n", 127 | "print(data_meta[0])" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "df_meta = pd.DataFrame.from_dict(data_meta) #convert dictionary to dataframe" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "df_meta.to_csv('kindle_meta_last.csv', index = False) # to use easily everytime, I write it to csv" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python 3", 152 | "language": "python", 153 | "name": "python3" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 3 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython3", 165 | "version": "3.6.9" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 2 170 | } 171 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis and Product Recommendation from Reviews 2 | From Kindle Store Reviews on Amazon, sentiment analysis and book recommendation 3 | 4 | **Problem:** 5 | 6 | Day by day, demand in e-commerce is increasing. With the increasing demand in online stores, the voice of customer concepts such as reviews and customer experiences are getting more important because customers buy products without seeing or touching them. If the company fails to meet this need of customers, it loses money because of not taking strategic decisions. 7 | 8 | **Aim:** 9 | 10 | Protect company from losing money with increasing customer satisfaction and giving importance to their feedback. 11 | 12 | **Solution:** 13 | 14 | - define good or bad products as quick as possible according to reviews and take action for this 15 | 16 | For this solution, I worked on sentiment analysis with different models. The model predicts reviews as positive or negative from text. 17 | 18 | - recommend customers related products to increase satisfaction with decreasing search time for suitable product 19 | 20 | I built recommendation system in this project for this solution. 21 | 22 | **What Will These Solutions Bring to The Company?** 23 | 24 | - easy product comparison 25 | - defining dislikes easily 26 | - saving time 27 | - more money with selling more products 28 | - happier customers = more customers = more money 29 | - less time on server = less problem 30 | 31 | **Data:** 32 | 33 | In this project, I worked on sentiment analysis of Kindle Store reviews in Amazon. I choose this dataset because it is more easy to buy and read a book with Kindle. Going to the book store, finding a book which you like need more time than reaching every book from your tablet. 34 | 35 | The data is obtained from github.io page of [UC San Diego Computer Science and Engineering Department academic staff](https://nijianmo.github.io/amazon/index.html#subsets). Dataset contains product reviews and metadata, including 142.8 million reviews from May 1996 to July 2014. I prefer to use 5-core sample dataset(such that each of the remaining users and items have at least 5 reviews) and metadata for Kindle Store. The reasons to choose 5-core data is that continuous users contains more information than single reviewers. To reach and download metadata, people have to fill the form and submit it. My filtered Kindle Store data consists of 2,222,983 rows and 12 columns. Also, I used the metadata to find the corresponding titles of the books from product ID. The format of raw data is json. 36 | 37 | **Plan:** 38 | 39 | - ***Understanding, Cleaning and Exploring Data:*** To analyze distributions of data points, I observed each column seperately and compared common words in positive, negative and neutral reviews. The first challange of this data is to clean text from unnecessary items for modeling such as punctuation, upper-case letters etc. Detailed data analysis can be found [here](https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/2_Understanding_EDA_Preparation.ipynb). 40 | 41 | - ***Preparing Data to Modeling:*** Target was changed to binary class. Machine learning models and neural net models have different preparing strategies. Mainly, vectorization/tokenization, spliting train-test sets and padding were done. Detailed pre-processing techniques and steps can be found in corresponding notebooks. 42 | 43 | - ***Modeling:*** Firstly, LogReg, DecisionTree, Extra-Trees, RandomForest, XGBoost and LGBM Classifiers were tried. Then, FastText class of Torch models were tried with different parameters. Keras models as CNN with 3 convolutional layers, RNN with 2 GRU layer, RNN with 2 LSTM layers, RNN with 2 CuRNNGRU layers and CNN with 2 convolutional layers were built. At last, pre-trained BERT model was tried. 44 | 45 | - ***Evaluation and Results:*** To compare my results, I used balanced accuracy for machine learning models and loss values for deep learning models. I also calculated accuracy values for neural nets to represent my results in smart way. Although the maximum accuracy between machine learning models is 87% for test set with LogReg, it is 95% between deep learning model with pre-trained BertForSequence Classifier from BERT. It means that my model can predict the sentiment of review as positive or negative with 95% accuracy. 46 | 47 | - ***Recommendation Systems:*** Three different systems were established. One is collaborative filtering with matrix factorization(SVDS) , second one is cosine-similarity of user-user based. As last one, I tried to solve cold-start problem with taking a few information from new user such as keywords. As a different approach, without looking summaries or genres of books, recommendations were done by the cosine-similarity of keywords and reviews. To this last system, rating effect, rating number effect and positive rating effect were added orderly and scores were compared. 48 | 49 | Resources were added to corresponding notebooks. If I was inspired by some external resources such as models, ideas etc, I inserted them corresponding notebooks. 50 | 51 | **Findings:** 52 | 53 | - My target is highly imbalanced but neural nets are very good at solving this issue. 54 | - Most of the data points are belongs to recent years especially after 2014. 55 | - Data contains balanced points from each date and month. Also, distribution of target classes are similar in each time period. 4 and 5 rated books are on majority for all periods. 56 | - High review average does not mean the book is better than others. It is more accurate way to look review numbers. It is hard to say that book with 5 average rating with 8 reviews is better than the book with 4.3 average rating with 2000 reviews. 57 | - Although some top common words are similar in negative and positive reviews, some of the, totally different. 58 | - Tranfer learning is easier than build a model and train it from zero level. 59 | - Deep-learning models can handle overfitting problem better than machine learning models. 60 | - Results change when the layer types and layer numbers change. 61 | - Bi-gram , tri-gram feature engineering effect results. 62 | - When the type of recommendation system changes, recommendations also change. 63 | - When more data is added to the recommendation system, score increases. 64 | 65 | More findings for data can be found [here](https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/2_Understanding_EDA_Preparation.ipynb), and different findings for each model can be found in corresponding notebooks. 66 | 67 | **Future Improvements:** 68 | 69 | Each model has different improvements and they can be found in notebooks. Here, I will state about BERT model (determined as giving best results) and recommendation system improvements here. 70 | 71 | - Batch and epoch numbers can be tuned better way for modeling. 72 | - Learning rate and epsilon values can be changed for modeling. 73 | - More data can be added to recommendation systems. 74 | - Positive review numbers ratio to negative review numbers effect can be added to recommendation function. 75 | 76 | # Repository Guide 77 | 78 | **CSV Files:** 79 | 80 | The sample data was downloaded to this repo; https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/sample_data.csv 81 | 82 | **Notebooks:** 83 | 84 | There are total 9 notebooks in this repo. All of them was collected in notebooks file. For details; 85 | 86 | Getting data from json file: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/1_Getting_Data_from_json_Files%20.ipynb 87 | 88 | More information about importance of sentiment analysis, every steps for data understanding, cleaning, EDA: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/2_Understanding_EDA_Preparation.ipynb 89 | 90 | Machine learning models with metric: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/3_Machine_Learning_Models.ipynb 91 | 92 | Torch models with setting different parameters: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/4_Torch_Models.ipynb 93 | 94 | Keras models for 3-convolutional CNN and 2-GRU layers RNN: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/5_Keras_CNN_with_3_Conv_Layers_and_RNN_with_2_GRU_Layers.ipynb 95 | 96 | Keras models with different layer numbers and types and comparison of results for all Keras models: 97 | https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/6_Keras_with_Different_Layer_Types_and_Numbers.ipynb 98 | 99 | Pre-trained Bert Model: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/7_Pre_trained_BERT_model.ipynb 100 | 101 | Recommendation systems for collaborative filtering with matrix factorization and cosine similarity: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/8_Recommendation_Systems.ipynb 102 | 103 | Recommendations systems from cosine similarity of keywords with reviews: https://github.com/ezgigm/sentiment_analysis_and_product_recommendation/blob/master/notebooks/9_Recommendation_from_Keywords.ipynb 104 | 105 | **Presentation:** 106 | 107 | Presentation can be found here ; https://www.canva.com/design/DAD9b33z0KQ/_3tVMrOAw2UuhuG6afl_XA/view?utm_content=DAD9b33z0KQ&utm_campaign=designshare&utm_medium=link&utm_source=publishsharelink 108 | 109 | **Video:** 110 | 111 | The video of presentation is [here](https://youtu.be/q_USHnbOE24). 112 | 113 | **Reproduction:** 114 | 115 | - Clone this repo (for help see this [tutorial](https://help.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository)). 116 | - Sample data is available in this repo, notebooks can be run with this sample data. 117 | 118 | **Contact:** [Ezgi Gumusbas](https://www.linkedin.com/in/ezgi-gumusbas-6b08a51a0/) 119 | -------------------------------------------------------------------------------- /notebooks/5_Keras_CNN_with_3_Conv_Layers_and_RNN_with_2_GRU_Layers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Keras Models - CNN with 3 Convolutional Layers and RNN with 2 GRU Layers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Keras is the deep learning framework which is simple to use. I found it easier than to build a new model in torch, so I wanted to use and get results with Keras models. \n", 15 | "\n", 16 | "I tried 5 Keras models for this project. First 2 models can be found in this notebook. Other 3 models were run in Google Colab to get more fast results. So, they can be found in next notebook (number 6 notebook). This gives me a chance to run different models at the same time. \n", 17 | "\n", 18 | "### Aim of This Notebook:\n", 19 | "\n", 20 | "In this notebook, my aim is to get predictions with using Convolutional Neural Net models and Recurrent Neural Net models. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 31, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# dataframe imports\n", 30 | "import pandas as pd\n", 31 | "import numpy as np\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "\n", 34 | "from sklearn.metrics import f1_score, roc_auc_score, accuracy_score\n", 35 | "from sklearn.model_selection import train_test_split\n", 36 | "import re\n", 37 | "\n", 38 | "import matplotlib.pyplot as plt\n", 39 | "import seaborn as sns; sns.set()\n", 40 | "%matplotlib inline\n", 41 | "\n", 42 | "#tensorflow imports for keras\n", 43 | "import tensorflow\n", 44 | "from tensorflow.python.keras import models, layers, optimizers\n", 45 | "from tensorflow.python.keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", 46 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 20, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "df = pd.read_csv('train.csv') # taking data" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 3, 61 | "metadata": { 62 | "scrolled": true 63 | }, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/html": [ 68 | "
\n", 69 | "\n", 82 | "\n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | "
review_cleansentiment
0i am shocked harrison at the very end gives p...1
1the best self help book ive ever read half of ...1
2quite interesting a time of intrigue and excit...1
3i love the bibliophile series i saw that a eb...1
4this is a really great story filled with wonde...1
\n", 118 | "
" 119 | ], 120 | "text/plain": [ 121 | " review_clean sentiment\n", 122 | "0 i am shocked harrison at the very end gives p... 1\n", 123 | "1 the best self help book ive ever read half of ... 1\n", 124 | "2 quite interesting a time of intrigue and excit... 1\n", 125 | "3 i love the bibliophile series i saw that a eb... 1\n", 126 | "4 this is a really great story filled with wonde... 1" 127 | ] 128 | }, 129 | "execution_count": 3, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "df.head() # to check" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 21, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "df.dropna(inplace=True) # last more cleaning to make sure for null values" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "# Splitting Data to Train and Test" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "To make sure about using same sample data in each notebook, I always get same data and divide with same random state and test_size for validation. " 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 22, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "train_data, test_data = train_test_split(df, test_size=0.2,random_state = 42)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "# Data Preparation for Keras" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "I will divide text and target to prepare data to model. I will do it for both train and test set. " 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 23, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "train_target = train_data.sentiment\n", 191 | "train_texts = train_data.review_clean\n", 192 | "\n", 193 | "test_target = test_data.sentiment\n", 194 | "test_texts = test_data.review_clean" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "After here, I got inspired from this notebook which is the solution of one Kaggle compatition below,\n", 202 | "\n", 203 | "https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews\n", 204 | "\n", 205 | "I used steps in this notebook to get baseline for Keras and I changed layers types and layers numbers to get better results." 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 18, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "# I get together my text\n", 215 | "def converting_texts(texts):\n", 216 | " collected_texts = []\n", 217 | " for text in texts:\n", 218 | " collected_texts.append(text)\n", 219 | " return collected_texts\n", 220 | " \n", 221 | "train_texts = converting_texts(train_texts)\n", 222 | "test_texts = converting_texts(test_texts)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "I need to tokenize my text and padding sequences before modeling my data. I will use Keras proprocessing tools for this." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 25, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "max_feat= 12000 #seting max features to define max number of tokenizer words\n", 239 | "\n", 240 | "tokenizer = Tokenizer(num_words=max_feat)\n", 241 | "tokenizer.fit_on_texts(train_texts)\n", 242 | "# updates internal vocabulary based on a list of texts\n", 243 | "# in the case where texts contains lists, we assume each entry of the lists to be a token\n", 244 | "# required before using texts_to_sequences or texts_to_matrix\n", 245 | "\n", 246 | "train_texts = tokenizer.texts_to_sequences(train_texts)\n", 247 | "test_texts = tokenizer.texts_to_sequences(test_texts)\n", 248 | "# transforms each text in texts to a sequence of integers\n", 249 | "# Only top num_words-1 most frequent words will be taken into account \n", 250 | "# Only words known by the tokenizer will be taken into account" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "To use batches productively, I need to turn my sequences to same lenght. I prefer to set everything to maximum lenght of the longest sentence in train data." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 20, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "max_len = max(len(train_ex) for train_ex in train_texts) #setting the max length\n", 267 | "\n", 268 | "# using pad_sequence tool from Keras\n", 269 | "# transforms a list of sequences to into a 2D Numpy array of shape \n", 270 | "# the maxlen argument for the length of the longest sequence in the list\n", 271 | "train_texts = pad_sequences(train_texts, maxlen=max_len)\n", 272 | "test_texts = pad_sequences(test_texts, maxlen=max_len)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "## Building Model" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "In this simple model, convolutional neural nets were used with 64 embedding dimension. 3-convolutional layers used, first two have batch normalization and maximum pooling arguments. The last one has glabal maximum pooling. Results were passed to a dense layer and output for prediction.\n", 287 | "\n", 288 | "Batch normalizations normalize and scale inputs or activations by reducing the amount what the hidden unit values shift around. Max Pool downsamples the input representation by taking the maximum value over the window defined by pool size." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 21, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "name": "stdout", 298 | "output_type": "stream", 299 | "text": [ 300 | "WARNING:tensorflow:From /Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.\n", 301 | "Instructions for updating:\n", 302 | "`NHWC` for data_format is deprecated, use `NWC` instead\n" 303 | ] 304 | } 305 | ], 306 | "source": [ 307 | "def build_model():\n", 308 | " sequences = layers.Input(shape=(max_len,))\n", 309 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 310 | " x = layers.Conv1D(64, 3, activation='relu')(embedded)\n", 311 | " x = layers.BatchNormalization()(x)\n", 312 | " x = layers.MaxPool1D(3)(x)\n", 313 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 314 | " x = layers.BatchNormalization()(x)\n", 315 | " x = layers.MaxPool1D(5)(x)\n", 316 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 317 | " x = layers.GlobalMaxPool1D()(x)\n", 318 | " x = layers.Flatten()(x)\n", 319 | " x = layers.Dense(100, activation='relu')(x)\n", 320 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 321 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 322 | " model.compile(\n", 323 | " optimizer='rmsprop',\n", 324 | " loss='binary_crossentropy',\n", 325 | " metrics=['binary_accuracy']\n", 326 | " )\n", 327 | " return model\n", 328 | " \n", 329 | "model = build_model()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "# Fitting Model to My Pre-processed Data" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 22, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "Train on 63983 samples, validate on 15996 samples\n", 349 | "Epoch 1/2\n", 350 | "63983/63983 [==============================]63983/63983 [==============================] - 963s 15ms/step - loss: 0.2943 - binary_accuracy: 0.9136 - val_loss: 0.2673 - val_binary_accuracy: 0.9155\n", 351 | "\n", 352 | "Epoch 2/2\n", 353 | "63983/63983 [==============================]63983/63983 [==============================] - 1005s 16ms/step - loss: 0.2185 - binary_accuracy: 0.9197 - val_loss: 0.2632 - val_binary_accuracy: 0.9175\n", 354 | "\n" 355 | ] 356 | }, 357 | { 358 | "data": { 359 | "text/plain": [ 360 | "" 361 | ] 362 | }, 363 | "execution_count": 22, 364 | "metadata": {}, 365 | "output_type": "execute_result" 366 | } 367 | ], 368 | "source": [ 369 | "model.fit(\n", 370 | " train_texts, \n", 371 | " train_target, \n", 372 | " batch_size=128,\n", 373 | " epochs=2,\n", 374 | " validation_data=(test_texts, test_target) )" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "This model gives me 0.263 loss value and 0.92% accuracy on validation data." 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "# Recurrent Neural Net Model" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "For RNN model layers, I have embedding layer and also I used GRU layers which followed by 2 dense layers. " 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 27, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "def build_rnn_model():\n", 405 | " sequences = layers.Input(shape=(max_len,))\n", 406 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 407 | " x = layers.GRU(128, return_sequences=True)(embedded)\n", 408 | " x = layers.GRU(128)(x)\n", 409 | " x = layers.Dense(32, activation='relu')(x)\n", 410 | " x = layers.Dense(100, activation='relu')(x)\n", 411 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 412 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 413 | " model.compile(\n", 414 | " optimizer='rmsprop',\n", 415 | " loss='binary_crossentropy',\n", 416 | " metrics=['binary_accuracy']\n", 417 | " )\n", 418 | " return model\n", 419 | " \n", 420 | "rnn_model = build_rnn_model()" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "## Fitting Model to My Data" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 45, 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "name": "stdout", 437 | "output_type": "stream", 438 | "text": [ 439 | "Train on 63983 samples, validate on 15996 samples\n", 440 | "Epoch 1/1\n", 441 | "63983/63983 [==============================]63983/63983 [==============================] - 7834s 122ms/step - loss: 0.2611 - binary_accuracy: 0.9169 - val_loss: 0.2110 - val_binary_accuracy: 0.9203\n", 442 | "\n" 443 | ] 444 | }, 445 | { 446 | "data": { 447 | "text/plain": [ 448 | "" 449 | ] 450 | }, 451 | "execution_count": 45, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "rnn_model.fit(\n", 458 | " train_texts, \n", 459 | " train_target, \n", 460 | " batch_size=128,\n", 461 | " epochs=1,\n", 462 | " validation_data=(test_texts, test_target) )" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "Even for one epoch this takes too much time, so I opened this notebook in Google Colab and tried the epochs=2 version there and pasted results here." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 30, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "# This cell was runned in Colab\n", 479 | "# rnn_model.fit(\n", 480 | "# train_texts, \n", 481 | "# train_target, \n", 482 | "# batch_size=128,\n", 483 | "# epochs=2,\n", 484 | "# validation_data=(test_texts, test_target) )" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "Results for 2 epoches;\n", 492 | "\n", 493 | "- loss: 0.1623\n", 494 | "- acc: 0.9371\n", 495 | "- val loss: 0.1615\n", 496 | "- val acc: 0.9377" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "I will compare 5 Keras results in next notebook as total. But from here, I found RNN model more accurate. " 504 | ] 505 | } 506 | ], 507 | "metadata": { 508 | "kernelspec": { 509 | "display_name": "Python 3", 510 | "language": "python", 511 | "name": "python3" 512 | }, 513 | "language_info": { 514 | "codemirror_mode": { 515 | "name": "ipython", 516 | "version": 3 517 | }, 518 | "file_extension": ".py", 519 | "mimetype": "text/x-python", 520 | "name": "python", 521 | "nbconvert_exporter": "python", 522 | "pygments_lexer": "ipython3", 523 | "version": "3.6.9" 524 | } 525 | }, 526 | "nbformat": 4, 527 | "nbformat_minor": 2 528 | } 529 | -------------------------------------------------------------------------------- /notebooks/.ipynb_checkpoints/5_Keras_CNN_with_3_Conv_Layers_and_RNN_with_2_GRU_Layers-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Keras Models - CNN with 3 Convolutional Layers and RNN with 2 GRU Layers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Keras is the deep learning framework which is simple to use. I found it easier than to build a new model in torch, so I wanted to use and get results with Keras models. \n", 15 | "\n", 16 | "I tried 5 Keras models for this project. First 2 models can be found in this notebook. Other 3 models were run in Google Colab to get more fast results. So, they can be found in next notebook (number 6 notebook). This gives me a chance to run different models at the same time. \n", 17 | "\n", 18 | "### Aim of This Notebook:\n", 19 | "\n", 20 | "In this notebook, my aim is to get predictions with using Convolutional Neural Net models and Recurrent Neural Net models. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 31, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# dataframe imports\n", 30 | "import pandas as pd\n", 31 | "import numpy as np\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "\n", 34 | "from sklearn.metrics import f1_score, roc_auc_score, accuracy_score\n", 35 | "from sklearn.model_selection import train_test_split\n", 36 | "import re\n", 37 | "\n", 38 | "import matplotlib.pyplot as plt\n", 39 | "import seaborn as sns; sns.set()\n", 40 | "%matplotlib inline\n", 41 | "\n", 42 | "#tensorflow imports for keras\n", 43 | "import tensorflow\n", 44 | "from tensorflow.python.keras import models, layers, optimizers\n", 45 | "from tensorflow.python.keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", 46 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 20, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "df = pd.read_csv('train.csv') # taking data" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 3, 61 | "metadata": { 62 | "scrolled": true 63 | }, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/html": [ 68 | "
\n", 69 | "\n", 82 | "\n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | "
review_cleansentiment
0i am shocked harrison at the very end gives p...1
1the best self help book ive ever read half of ...1
2quite interesting a time of intrigue and excit...1
3i love the bibliophile series i saw that a eb...1
4this is a really great story filled with wonde...1
\n", 118 | "
" 119 | ], 120 | "text/plain": [ 121 | " review_clean sentiment\n", 122 | "0 i am shocked harrison at the very end gives p... 1\n", 123 | "1 the best self help book ive ever read half of ... 1\n", 124 | "2 quite interesting a time of intrigue and excit... 1\n", 125 | "3 i love the bibliophile series i saw that a eb... 1\n", 126 | "4 this is a really great story filled with wonde... 1" 127 | ] 128 | }, 129 | "execution_count": 3, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "df.head() # to check" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 21, 141 | "metadata": {}, 142 | "outputs": [], 143 | "source": [ 144 | "df.dropna(inplace=True) # last more cleaning to make sure for null values" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "# Splitting Data to Train and Test" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "To make sure about using same sample data in each notebook, I always get same data and divide with same random state and test_size for validation. " 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 22, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "train_data, test_data = train_test_split(df, test_size=0.2,random_state = 42)" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "# Data Preparation for Keras" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "I will divide text and target to prepare data to model. I will do it for both train and test set. " 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 23, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "train_target = train_data.sentiment\n", 191 | "train_texts = train_data.review_clean\n", 192 | "\n", 193 | "test_target = test_data.sentiment\n", 194 | "test_texts = test_data.review_clean" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "After here, I got inspired from this notebook which is the solution of one Kaggle compatition below,\n", 202 | "\n", 203 | "https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews\n", 204 | "\n", 205 | "I used steps in this notebook to get baseline for Keras and I changed layers types and layers numbers to get better results." 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 18, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "# I get together my text\n", 215 | "def converting_texts(texts):\n", 216 | " collected_texts = []\n", 217 | " for text in texts:\n", 218 | " collected_texts.append(text)\n", 219 | " return collected_texts\n", 220 | " \n", 221 | "train_texts = converting_texts(train_texts)\n", 222 | "test_texts = converting_texts(test_texts)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "I need to tokenize my text and padding sequences before modeling my data. I will use Keras proprocessing tools for this." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 25, 235 | "metadata": {}, 236 | "outputs": [], 237 | "source": [ 238 | "max_feat= 12000 #seting max features to define max number of tokenizer words\n", 239 | "\n", 240 | "tokenizer = Tokenizer(num_words=max_feat)\n", 241 | "tokenizer.fit_on_texts(train_texts)\n", 242 | "# updates internal vocabulary based on a list of texts\n", 243 | "# in the case where texts contains lists, we assume each entry of the lists to be a token\n", 244 | "# required before using texts_to_sequences or texts_to_matrix\n", 245 | "\n", 246 | "train_texts = tokenizer.texts_to_sequences(train_texts)\n", 247 | "test_texts = tokenizer.texts_to_sequences(test_texts)\n", 248 | "# transforms each text in texts to a sequence of integers\n", 249 | "# Only top num_words-1 most frequent words will be taken into account \n", 250 | "# Only words known by the tokenizer will be taken into account" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "To use batches productively, I need to turn my sequences to same lenght. I prefer to set everything to maximum lenght of the longest sentence in train data." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 20, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "max_len = max(len(train_ex) for train_ex in train_texts) #setting the max length\n", 267 | "\n", 268 | "# using pad_sequence tool from Keras\n", 269 | "# transforms a list of sequences to into a 2D Numpy array of shape \n", 270 | "# the maxlen argument for the length of the longest sequence in the list\n", 271 | "train_texts = pad_sequences(train_texts, maxlen=max_len)\n", 272 | "test_texts = pad_sequences(test_texts, maxlen=max_len)" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "## Building Model" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "In this simple model, convolutional neural nets were used with 64 embedding dimension. 3-convolutional layers used, first two have batch normalization and maximum pooling arguments. The last one has glabal maximum pooling. Results were passed to a dense layer and output for prediction.\n", 287 | "\n", 288 | "Batch normalizations normalize and scale inputs or activations by reducing the amount what the hidden unit values shift around. Max Pool downsamples the input representation by taking the maximum value over the window defined by pool size." 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 21, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "name": "stdout", 298 | "output_type": "stream", 299 | "text": [ 300 | "WARNING:tensorflow:From /Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.\n", 301 | "Instructions for updating:\n", 302 | "`NHWC` for data_format is deprecated, use `NWC` instead\n" 303 | ] 304 | } 305 | ], 306 | "source": [ 307 | "def build_model():\n", 308 | " sequences = layers.Input(shape=(max_len,))\n", 309 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 310 | " x = layers.Conv1D(64, 3, activation='relu')(embedded)\n", 311 | " x = layers.BatchNormalization()(x)\n", 312 | " x = layers.MaxPool1D(3)(x)\n", 313 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 314 | " x = layers.BatchNormalization()(x)\n", 315 | " x = layers.MaxPool1D(5)(x)\n", 316 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 317 | " x = layers.GlobalMaxPool1D()(x)\n", 318 | " x = layers.Flatten()(x)\n", 319 | " x = layers.Dense(100, activation='relu')(x)\n", 320 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 321 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 322 | " model.compile(\n", 323 | " optimizer='rmsprop',\n", 324 | " loss='binary_crossentropy',\n", 325 | " metrics=['binary_accuracy']\n", 326 | " )\n", 327 | " return model\n", 328 | " \n", 329 | "model = build_model()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "# Fitting Model to My Pre-processed Data" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 22, 342 | "metadata": {}, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "Train on 63983 samples, validate on 15996 samples\n", 349 | "Epoch 1/2\n", 350 | "63983/63983 [==============================]63983/63983 [==============================] - 963s 15ms/step - loss: 0.2943 - binary_accuracy: 0.9136 - val_loss: 0.2673 - val_binary_accuracy: 0.9155\n", 351 | "\n", 352 | "Epoch 2/2\n", 353 | "63983/63983 [==============================]63983/63983 [==============================] - 1005s 16ms/step - loss: 0.2185 - binary_accuracy: 0.9197 - val_loss: 0.2632 - val_binary_accuracy: 0.9175\n", 354 | "\n" 355 | ] 356 | }, 357 | { 358 | "data": { 359 | "text/plain": [ 360 | "" 361 | ] 362 | }, 363 | "execution_count": 22, 364 | "metadata": {}, 365 | "output_type": "execute_result" 366 | } 367 | ], 368 | "source": [ 369 | "model.fit(\n", 370 | " train_texts, \n", 371 | " train_target, \n", 372 | " batch_size=128,\n", 373 | " epochs=2,\n", 374 | " validation_data=(test_texts, test_target) )" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "This model gives me 0.263 loss value and 0.92% accuracy on validation data." 382 | ] 383 | }, 384 | { 385 | "cell_type": "markdown", 386 | "metadata": {}, 387 | "source": [ 388 | "# Recurrent Neural Net Model" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "For RNN model layers, I have embedding layer and also I used GRU layers which followed by 2 dense layers. " 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 27, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "def build_rnn_model():\n", 405 | " sequences = layers.Input(shape=(max_len,))\n", 406 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 407 | " x = layers.GRU(128, return_sequences=True)(embedded)\n", 408 | " x = layers.GRU(128)(x)\n", 409 | " x = layers.Dense(32, activation='relu')(x)\n", 410 | " x = layers.Dense(100, activation='relu')(x)\n", 411 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 412 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 413 | " model.compile(\n", 414 | " optimizer='rmsprop',\n", 415 | " loss='binary_crossentropy',\n", 416 | " metrics=['binary_accuracy']\n", 417 | " )\n", 418 | " return model\n", 419 | " \n", 420 | "rnn_model = build_rnn_model()" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "## Fitting Model to My Data" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 45, 433 | "metadata": {}, 434 | "outputs": [ 435 | { 436 | "name": "stdout", 437 | "output_type": "stream", 438 | "text": [ 439 | "Train on 63983 samples, validate on 15996 samples\n", 440 | "Epoch 1/1\n", 441 | "63983/63983 [==============================]63983/63983 [==============================] - 7834s 122ms/step - loss: 0.2611 - binary_accuracy: 0.9169 - val_loss: 0.2110 - val_binary_accuracy: 0.9203\n", 442 | "\n" 443 | ] 444 | }, 445 | { 446 | "data": { 447 | "text/plain": [ 448 | "" 449 | ] 450 | }, 451 | "execution_count": 45, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "rnn_model.fit(\n", 458 | " train_texts, \n", 459 | " train_target, \n", 460 | " batch_size=128,\n", 461 | " epochs=1,\n", 462 | " validation_data=(test_texts, test_target) )" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": {}, 468 | "source": [ 469 | "Even for one epoch this takes too much time, so I opened this notebook in Google Colab and tried the epochs=2 version there and pasted results here." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 30, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [ 478 | "# This cell was runned in Colab\n", 479 | "# rnn_model.fit(\n", 480 | "# train_texts, \n", 481 | "# train_target, \n", 482 | "# batch_size=128,\n", 483 | "# epochs=2,\n", 484 | "# validation_data=(test_texts, test_target) )" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "Results for 2 epoches;\n", 492 | "\n", 493 | "- loss: 0.1623\n", 494 | "- acc: 0.9371\n", 495 | "- val loss: 0.1615\n", 496 | "- val acc: 0.9377" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "I will compare 5 Keras results in next notebook as total. But from here, I found RNN model more accurate. " 504 | ] 505 | } 506 | ], 507 | "metadata": { 508 | "kernelspec": { 509 | "display_name": "Python 3", 510 | "language": "python", 511 | "name": "python3" 512 | }, 513 | "language_info": { 514 | "codemirror_mode": { 515 | "name": "ipython", 516 | "version": 3 517 | }, 518 | "file_extension": ".py", 519 | "mimetype": "text/x-python", 520 | "name": "python", 521 | "nbconvert_exporter": "python", 522 | "pygments_lexer": "ipython3", 523 | "version": "3.6.9" 524 | } 525 | }, 526 | "nbformat": 4, 527 | "nbformat_minor": 2 528 | } 529 | -------------------------------------------------------------------------------- /notebooks/6_Keras_with_Different_Layer_Types_and_Numbers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Keras with Different Layer Types and Numbers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook was run in Google Colab to get more quick results same time without hurting my computer.\n", 15 | "\n", 16 | "### Aim of This Notebook:\n", 17 | "\n", 18 | "My aim in this notebook is to find better results than CNN with 3-conv layers and RNN with 2-GRU layers models. So, I tried CNN with 2-conv layers and RNN with 2-CuDNNGRU layer and 2-LSTM layers.\n", 19 | "\n", 20 | "The detailed comparison of results for all 5 Keras models can be found at the end of this notebook with feature advices. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 0, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "8neitrJ7lqZA" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# necessary imports\n", 34 | "\n", 35 | "import pandas as pd\n", 36 | "import numpy as np\n", 37 | "\n", 38 | "from sklearn.metrics import f1_score, roc_auc_score, accuracy_score\n", 39 | "from sklearn.model_selection import train_test_split\n", 40 | "import re\n", 41 | "\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "import seaborn as sns; sns.set()\n", 44 | "%matplotlib inline\n", 45 | "\n", 46 | "#tensorflow imports\n", 47 | "import tensorflow\n", 48 | "from tensorflow.python.keras import models, layers, optimizers\n", 49 | "from tensorflow.python.keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", 50 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 10, 56 | "metadata": { 57 | "colab": { 58 | "base_uri": "https://localhost:8080/", 59 | "height": 122 60 | }, 61 | "colab_type": "code", 62 | "id": "W8EBMG9Plqaa", 63 | "outputId": "40f2a018-a5aa-434c-c1ef-1d60f44d0e3c" 64 | }, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n", 71 | "\n", 72 | "Enter your authorization code:\n", 73 | "··········\n", 74 | "Mounted at /content/drive\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "# this cell is to connect my google drive with colab notebook to get data\n", 80 | "from google.colab import drive\n", 81 | "drive.mount('/content/drive')" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 0, 87 | "metadata": { 88 | "colab": {}, 89 | "colab_type": "code", 90 | "id": "qNVwgeH_lqbM" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "# this cell is for geting data from path in the drive\n", 95 | "path = '/content/drive/My Drive/train/train.csv'\n", 96 | "df = pd.read_csv(path)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 0, 102 | "metadata": { 103 | "colab": {}, 104 | "colab_type": "code", 105 | "id": "RMdeR89Llqbs" 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "df.dropna(inplace=True) # last check to make sure about nulls" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "colab": {}, 116 | "colab_type": "code", 117 | "id": "oGlbaSUTlqbw" 118 | }, 119 | "source": [ 120 | "To make sure about using same sample data in each notebook, I always get same data and divide with same random state and test_size for validation." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 0, 126 | "metadata": { 127 | "colab": {}, 128 | "colab_type": "code", 129 | "id": "4a-3vu6Nlqb_" 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "train_data, test_data = train_test_split(df, test_size=0.2,random_state = 42)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Now, I will split my train and test to X and y as text and target." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 0, 146 | "metadata": { 147 | "colab": {}, 148 | "colab_type": "code", 149 | "id": "Ezso4Q7flqcE" 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "train_target = train_data.sentiment\n", 154 | "train_texts = train_data.review_clean\n", 155 | "\n", 156 | "test_target = test_data.sentiment\n", 157 | "test_texts = test_data.review_clean" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Like previous notebook, I inspired by https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews to build model. I changed layers and layer types to get better results." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "# Preparing Data to Model" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 0, 177 | "metadata": { 178 | "colab": {}, 179 | "colab_type": "code", 180 | "id": "1qxLjVxQlqcO" 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "# I get together my text\n", 185 | "def converting_texts(texts):\n", 186 | " collected_texts = []\n", 187 | " for text in texts:\n", 188 | " collected_texts.append(text)\n", 189 | " return collected_texts\n", 190 | " \n", 191 | "train_texts = converting_texts(train_texts)\n", 192 | "test_texts = converting_texts(test_texts)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "I need to tokenize my text and padding sequences before modeling my data. I will use Keras proprocessing tools for this." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 0, 205 | "metadata": { 206 | "colab": {}, 207 | "colab_type": "code", 208 | "id": "E4JTCuhslqcW" 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "max_feat= 12000 #seting max features to define max number of tokenizer words\n", 213 | "\n", 214 | "tokenizer = Tokenizer(num_words=max_feat)\n", 215 | "tokenizer.fit_on_texts(train_texts)\n", 216 | "# updates internal vocabulary based on a list of texts\n", 217 | "# in the case where texts contains lists, we assume each entry of the lists to be a token\n", 218 | "# required before using texts_to_sequences or texts_to_matrix\n", 219 | "\n", 220 | "train_texts = tokenizer.texts_to_sequences(train_texts)\n", 221 | "test_texts = tokenizer.texts_to_sequences(test_texts)\n", 222 | "# transforms each text in texts to a sequence of integers\n", 223 | "# Only top num_words-1 most frequent words will be taken into account \n", 224 | "# Only words known by the tokenizer will be taken into account" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "To use batches productively, I need to turn my sequences to same lenght. I prefer to set everything to maximum lenght of the longest sentence in train data." 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 0, 237 | "metadata": { 238 | "colab": {}, 239 | "colab_type": "code", 240 | "id": "QjtN-UPJlqcb" 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "max_len = max(len(train_ex) for train_ex in train_texts) #setting the max length\n", 245 | "\n", 246 | "# using pad_sequence tool from Keras\n", 247 | "# transforms a list of sequences to into a 2D Numpy array of shape \n", 248 | "# the maxlen argument for the length of the longest sequence in the list\n", 249 | "train_texts = pad_sequences(train_texts, maxlen=max_len)\n", 250 | "test_texts = pad_sequences(test_texts, maxlen=max_len)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "# Building Model" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "In this model, I reduced the layer number in CNN. I set one convolutional layer with batch normalization and max pooling. Additionaly, one convolutional layer with global max pooling. " 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 0, 270 | "metadata": { 271 | "colab": {}, 272 | "colab_type": "code", 273 | "id": "2TpNmPSMlqcg" 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "def build_model():\n", 278 | " sequences = layers.Input(shape=(max_len,))\n", 279 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 280 | " x = layers.Conv1D(64, 3, activation='relu')(embedded)\n", 281 | " x = layers.BatchNormalization()(x)\n", 282 | " x = layers.MaxPool1D(3)(x)\n", 283 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 284 | " x = layers.GlobalMaxPool1D()(x)\n", 285 | " x = layers.Flatten()(x)\n", 286 | " x = layers.Dense(100, activation='relu')(x)\n", 287 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 288 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 289 | " model.compile(\n", 290 | " optimizer='rmsprop',\n", 291 | " loss='binary_crossentropy',\n", 292 | " metrics=['binary_accuracy']\n", 293 | " )\n", 294 | " return model\n", 295 | " \n", 296 | "model = build_model()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "# Fitting Model to Pre-processed Data" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 22, 309 | "metadata": { 310 | "colab": { 311 | "base_uri": "https://localhost:8080/", 312 | "height": 102 313 | }, 314 | "colab_type": "code", 315 | "id": "t6Po8kWmlqcw", 316 | "outputId": "8dbd125b-97a5-41e1-9248-300c5d89924f" 317 | }, 318 | "outputs": [ 319 | { 320 | "name": "stdout", 321 | "output_type": "stream", 322 | "text": [ 323 | "Epoch 1/2\n", 324 | "500/500 [==============================] - 56s 112ms/step - loss: 0.2157 - binary_accuracy: 0.9233 - val_loss: 0.1995 - val_binary_accuracy: 0.9208\n", 325 | "Epoch 2/2\n", 326 | "500/500 [==============================] - 55s 111ms/step - loss: 0.1456 - binary_accuracy: 0.9436 - val_loss: 0.1830 - val_binary_accuracy: 0.9384\n" 327 | ] 328 | }, 329 | { 330 | "data": { 331 | "text/plain": [ 332 | "" 333 | ] 334 | }, 335 | "execution_count": 22, 336 | "metadata": { 337 | "tags": [] 338 | }, 339 | "output_type": "execute_result" 340 | } 341 | ], 342 | "source": [ 343 | "model.fit(\n", 344 | " train_target, \n", 345 | " train_labels, \n", 346 | " batch_size=128,\n", 347 | " epochs=2,\n", 348 | " validation_data=(test_texts, test_target) )" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "It gives good results with less loss and high accuracy." 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "# RNN Model with CuDNNGRU Layers" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "I changed the layer types for RNN to see the difference." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 0, 375 | "metadata": { 376 | "colab": {}, 377 | "colab_type": "code", 378 | "id": "_72Lbyablqd1" 379 | }, 380 | "outputs": [], 381 | "source": [ 382 | "def build_rnn_model():\n", 383 | " sequences = layers.Input(shape=(max_len,))\n", 384 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 385 | " x = layers.CuDNNGRU(128, return_sequences=True)(embedded)\n", 386 | " x = layers.CuDNNGRU(128)(x)\n", 387 | " x = layers.Dense(32, activation='relu')(x)\n", 388 | " x = layers.Dense(100, activation='relu')(x)\n", 389 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 390 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 391 | " model.compile(\n", 392 | " optimizer='rmsprop',\n", 393 | " loss='binary_crossentropy',\n", 394 | " metrics=['binary_accuracy']\n", 395 | " )\n", 396 | " return model\n", 397 | " \n", 398 | "rnn_model = build_rnn_model()" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 28, 404 | "metadata": { 405 | "colab": { 406 | "base_uri": "https://localhost:8080/", 407 | "height": 102 408 | }, 409 | "colab_type": "code", 410 | "id": "DEz1HZGklqd-", 411 | "outputId": "20633893-b53b-43aa-d1a6-cb2c4fd11d3b" 412 | }, 413 | "outputs": [ 414 | { 415 | "name": "stdout", 416 | "output_type": "stream", 417 | "text": [ 418 | "Epoch 1/2\n", 419 | "500/500 [==============================] - 331s 661ms/step - loss: 0.2204 - binary_accuracy: 0.9177 - val_loss: 0.1871 - val_binary_accuracy: 0.9280\n", 420 | "Epoch 2/2\n", 421 | "500/500 [==============================] - 329s 658ms/step - loss: 0.1611 - binary_accuracy: 0.9361 - val_loss: 0.1792 - val_binary_accuracy: 0.9369\n" 422 | ] 423 | }, 424 | { 425 | "data": { 426 | "text/plain": [ 427 | "" 428 | ] 429 | }, 430 | "execution_count": 28, 431 | "metadata": { 432 | "tags": [] 433 | }, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "rnn_model.fit(\n", 439 | " train_texts, \n", 440 | " train_target, \n", 441 | " batch_size=128,\n", 442 | " epochs=2,\n", 443 | " validation_data=(test_texts, test_target) )" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "It gives a bit better than CNN." 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "# RNN with LSTM Layers" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "I will try LSTM layers this time." 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 0, 470 | "metadata": { 471 | "colab": {}, 472 | "colab_type": "code", 473 | "id": "mN8DwiCxlqeI" 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "def build_lstm_model():\n", 478 | " sequences = layers.Input(shape=(MAX_LENGTH,))\n", 479 | " embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)\n", 480 | " x = layers.LSTM(128, return_sequences=True)(embedded)\n", 481 | " x = layers.LSTM(128)(x)\n", 482 | " x = layers.Dense(32, activation='relu')(x)\n", 483 | " x = layers.Dense(100, activation='relu')(x)\n", 484 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 485 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 486 | " model.compile(\n", 487 | " optimizer='rmsprop',\n", 488 | " loss='binary_crossentropy',\n", 489 | " metrics=['binary_accuracy']\n", 490 | " )\n", 491 | " return model\n", 492 | " \n", 493 | "rnn_model = build_lstm_model()" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 26, 499 | "metadata": { 500 | "colab": { 501 | "base_uri": "https://localhost:8080/", 502 | "height": 102 503 | }, 504 | "colab_type": "code", 505 | "id": "38nly0ekoXmf", 506 | "outputId": "fb3238e2-6f5e-4db4-c6e1-e7cedaa3fdf1" 507 | }, 508 | "outputs": [ 509 | { 510 | "name": "stdout", 511 | "output_type": "stream", 512 | "text": [ 513 | "Epoch 1/2\n", 514 | "500/500 [==============================] - 392s 783ms/step - loss: 0.2149 - binary_accuracy: 0.9245 - val_loss: 0.1718 - val_binary_accuracy: 0.9322\n", 515 | "Epoch 2/2\n", 516 | "500/500 [==============================] - 391s 781ms/step - loss: 0.1652 - binary_accuracy: 0.9367 - val_loss: 0.1656 - val_binary_accuracy: 0.9375\n" 517 | ] 518 | }, 519 | { 520 | "data": { 521 | "text/plain": [ 522 | "" 523 | ] 524 | }, 525 | "execution_count": 26, 526 | "metadata": { 527 | "tags": [] 528 | }, 529 | "output_type": "execute_result" 530 | } 531 | ], 532 | "source": [ 533 | "rnn_model.fit(\n", 534 | " train_texts, \n", 535 | " train_target, \n", 536 | " batch_size=128,\n", 537 | " epochs=2,\n", 538 | " validation_data=(test_texts, test_target) )" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": { 544 | "colab": {}, 545 | "colab_type": "code", 546 | "id": "hSw7jRbfvVOq" 547 | }, 548 | "source": [ 549 | "It gives higher values." 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "# Comparing All 5 Results" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "I checked all results and there is no huge differences between train and validation results which show overfitting. So, I will compare validation results." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 7, 569 | "metadata": {}, 570 | "outputs": [], 571 | "source": [ 572 | "df_results = pd.DataFrame(columns=[\"Model\", \"val_loss\",'val_acc']) # to see all results" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 8, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "df_results = df_results.append({ # written in results \n", 582 | " \"Model\": 'CNN-3-Conv' ,\"val_loss\": 0.2632 ,'val_acc' : 0.9175}, ignore_index=True)\n", 583 | "df_results = df_results.append({ # written in results \n", 584 | " \"Model\": 'RNN-2-GRU' ,\"val_loss\": 0.1615 ,'val_acc' : 0.9377}, ignore_index=True)\n", 585 | "df_results = df_results.append({ # written in results \n", 586 | " \"Model\": 'CNN-2-Conv' ,\"val_loss\": 0.1839 ,'val_acc' : 0.9384}, ignore_index=True)\n", 587 | "df_results = df_results.append({ # written in results \n", 588 | " \"Model\": 'RNN-2-CuDNNGRU' ,\"val_loss\": 0.1792 ,'val_acc' : 0.9369}, ignore_index=True)\n", 589 | "df_results = df_results.append({ # written in results \n", 590 | " \"Model\": 'RNN-2-LSTM' ,\"val_loss\": 0.1656 ,'val_acc' : 0.9375}, ignore_index=True)" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 9, 596 | "metadata": { 597 | "scrolled": true 598 | }, 599 | "outputs": [ 600 | { 601 | "data": { 602 | "text/html": [ 603 | "
\n", 604 | "\n", 617 | "\n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | "
Modelval_lossval_acc
0CNN-3-Conv0.26320.9175
1RNN-2-GRU0.16150.9377
2CNN-2-Conv0.18390.9384
3RNN-2-CuDNNGRU0.17920.9369
4RNN-2-LSTM0.16560.9375
\n", 659 | "
" 660 | ], 661 | "text/plain": [ 662 | " Model val_loss val_acc\n", 663 | "0 CNN-3-Conv 0.2632 0.9175\n", 664 | "1 RNN-2-GRU 0.1615 0.9377\n", 665 | "2 CNN-2-Conv 0.1839 0.9384\n", 666 | "3 RNN-2-CuDNNGRU 0.1792 0.9369\n", 667 | "4 RNN-2-LSTM 0.1656 0.9375" 668 | ] 669 | }, 670 | "execution_count": 9, 671 | "metadata": {}, 672 | "output_type": "execute_result" 673 | } 674 | ], 675 | "source": [ 676 | "df_results" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "According to accuracy CNN with 2-Conv is highest but its loss is higher than RNN-2-GRU also. So, I decided to best model with Keras as RNN with GRU layers." 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 11, 689 | "metadata": {}, 690 | "outputs": [ 691 | { 692 | "data": { 693 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmkAAAF2CAYAAAA1GQ8BAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAftklEQVR4nO3de5QdZZnv8W93boAhAUOQBMTLzPgoKkYJigdQFNRZzPE2gB7JyMWDjAqKAjJoQBEGR9HhooKeEYaAmTgiLlTkomKCoowD6Amg6OOIilxyIAYGSARC0n3+qGrYbPqyO3R1v939/azFoqveqtrP3m9X57ffunX19vYiSZKksnSPdQGSJEl6MkOaJElSgQxpkiRJBTKkSZIkFciQJkmSVKCpY13ACJsB7AqsAjaOcS2SJEmDmQLMA64HHmlvnGghbVfgmrEuQpIkaRj2BH7cPnOihbRVAPfdt46eHu//JkmSytXd3cXWWz8N6vzSbqKFtI0APT29hjRJkjRe9HuKlhcOSJIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFWjqWBcwlractRmbzZg21mVMeA8/8igPPvDwWJchSdK4MqlD2mYzpnHgcf821mVMeMtOW8SDGNIkSRoOD3dKkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBWo0ScORMSBwAnANODMzDy7rf3NwCeALuD3wKGZeV9EHAx8Cri7XvSyzFzcZK2SJEklaSykRcT2wKnALsAjwLURsSIzb6nbZwFfBHbNzDsj4mTgJOAoYCFwdGZ+tan6JEmSStbk4c59gOWZeW9mrgMuBvZvaZ8GHJGZd9bTNwE71j/vChwcETdHxNKI2LrBOiVJkorTZEibD6xqmV4F7NA3kZlrMvMSgIjYHDge+GbLsqcAOwO3A19osE5JkqTiNHlOWjfQ2zLdBfS0LxQRs4FLgBsz8wKAzHxrS/tpwK3DeeE5c2ZuSr1q0Ny5W451CZIkjStNhrQ7gD1bprcD7mpdICLmAd8FlgMfqufNBt6VmWfUi3UBG4bzwmvWrKWnp3fI5QwOo2f16gfHugRJkorS3d016MBSk4c7rwL2joi5EbEFsB9wZV9jREwBLgUuyswPZmZfqloLHBcRr6inj6QaaZMkSZo0GhtJq6/YXAysAKYD52bmdRFxOfAx4JnAy4CpEdF3QcENmXlYRLwN+GJ9rtpvgIOaqlOSJKlEjd4nLTOXAcva5u1b/3gDA4zkZeY1VAFOkiRpUvKJA5IkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBVoapMbj4gDgROAacCZmXl2W/ubgU8AXcDvgUMz876I2BFYCmwLJLAoM9c2WaskSVJJGhtJi4jtgVOBPYAFwOERsVNL+yzgi8DfZOZLgJuAk+rmc4BzMvP5wA3AiU3VKUmSVKImD3fuAyzPzHszcx1wMbB/S/s04IjMvLOevgnYMSKmAa+qlwdYAhzQYJ2SJEnFafJw53xgVcv0KuDlfROZuQa4BCAiNgeOBz4PbAM8kJkbWtbbocE6JUmSitNkSOsGelumu4Ce9oUiYjZVWLsxMy+oD5P2ti32pPUGM2fOzGGWqqbNnbvlWJcgSdK40mRIuwPYs2V6O+Cu1gUiYh7wXWA58KF69j3A7IiYkpkbgXnt6w1lzZq19PS057wnMziMntWrHxzrEiRJKkp3d9egA0tNnpN2FbB3RMyNiC2A/YAr+xojYgpwKXBRZn4wM3sBMvNR4Brg7fWiBwFXNFinJElScRobScvMOyNiMbACmA6cm5nXRcTlwMeAZwIvA6ZGRN8FBTdk5mHA+4ALIuIE4I/AO5qqU5IkqUSN3ictM5cBy9rm7Vv/eAMDjORl5m3AXk3WJkmSVDKfOCBJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBpo51AZImp1mzZzBj+vSxLmNCe2T9eh64/5GxLkPSJjKkSRoTM6ZP55DzjxrrMia0JYeeBRjSpPHKw52SJEkFMqRJkiQVyJAmSZJUIEOaJElSgbxwQJI0LFttOZ1pm80Y6zImtEcffoT/fnD9WJehMWZIkyQNy7TNZnD5QYeOdRkT2r4Xng+GtEnPw52SJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVaOpYFyBJkkbP7FmbM32G//w3af0jG7j/gYee8nbsJUmSJpHpM6byycUXj3UZE9pHT91/RLbj4U5JkqQCGdIkSZIKZEiTJEkqkOekadzaevZ0pk6fMdZlTGgb1j/CffevH+syJGlSMqRp3Jo6fQY/O+2wsS5jQtvluHMBQ5okjQUPd0qSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUoI5CWkRsERG71T+/JyLOi4gdmy1NkiRp8ur0PmnnA7+LiI3AccCFwJeBNwy2UkQcCJwATAPOzMyzB1juQmB5Zi6ppw8GPgXcXS9yWWYu7rBWSZKkca/TkPbczHx7RJwMLMnMkyPi+sFWiIjtgVOBXYBHgGsjYkVm3tKyzHzg/wB7A8tbVl8IHJ2ZXx3Ge5EkSZowOj0nbVr9/zcAyyNiCjBziHX2oRoduzcz1wEXA/u3LbMI+BZwUdv8XYGDI+LmiFgaEVt3WKckSdKE0GlIuzYibgE2B64Frqr/G8x8YFXL9Cpgh9YFMvMzmXluP+uuAk4BdgZuB77QYZ2SJEkTQqeHO98PvBK4OTN7IuKzwBVDrNMN9LZMdwE9nbxYZr617+eIOA24tcM6AZgzZ6hBPo22uXO3HOsStInsu/HN/hu/7LvxbST6r9OQNgPYkJn3R8R7qA5H3gz8cZB17gD2bJneDrhrqBeKiNnAuzLzjHpWF7ChwzoBWLNmLT09vUMu5w4welavfnDEt2n/jY4m+g7sv9Hivjd+ue+Nb530X3d316ADS50e7jwfeHNE7Ep1deftVFd3DuYqYO+ImBsRWwD7AVd28FprgeMi4hX19JHAJR3WKUmSNCF0GtKem5kfAd5IdXXnScDTB1shM+8EFgMrgJXAssy8LiIuj4iFg6y3EXgb8MWI+BXV1aHHdVinJEnShNDp4c7WqzuP6fDqTjJzGbCsbd6+/Sx3SNv0NcDLOqxNkiRpwuk0pPVd3bmB6urOHzD01Z2SJEnaRJ0e7nw/cDiwR2b2AJ8FjmqsKkmSpEmuo5BWnyc2HzgrIpYC29RhTZIkSQ3o9AHrxwIfBW4Efg58KCJOaLIwSZKkyazTc9IOojrU+QBARJwH/BT4x6YKkyRJmsw6PSeNvoBW/3w/8GgjFUmSJKnjkbQ/RMRRwDn19BEM/rQBSZIkPQWdjqS9F3gr8Of6v/2ogpokSZIa0NFIWv30gL3qxzt1Z+baZsuSJEma3AYNaRFxKfCkJ5VHBACZ+aZmypIkSZrchhpJu3ioDUTE0zJz3QjVI0mSJIYIaZl5QQfb8DmbkiRJI6zjW3AMomsEtiFJkqQWIxHSnnTOmiRJkp6akQhpkiRJGmGGNEmSpAIZ0iRJkgo0EiHNG9tKkiSNsKFuZnv0YO2ZeXpmvmpkS5IkSdJQN7N98ahUIUmSpCcY6ma2h45WIZIkSXpcRw9Yj4hXAscDM6luXjsFeE5m7thgbZIkSZNWpxcOnAtcC8wC/g14APhGU0VJkiRNdp2GtN7M/DRwNfBr4G3A65sqSpIkabLrNKT13WbjVuBFmfkQsLGZkiRJktTROWnATyPia8CJwGUR8TxgQ3NlSZIkTW6djqTNB27KzN8AR9XrvaOxqiRJkia5TkPacuCNEfFb4CXApzMzmytLkiRpcusopGXmlzJzN+CNwNbAtRFxSaOVSZIkTWLDfXbn5sAMqnuleeGAJElSQzq9me3RwCFUAe08YLfMvLvBuiRJkia1Tq/u3AX4QGZe3WAtkiRJqnUU0jJzUdOFSJIk6XHDPSdNkiRJo8CQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBVoapMbj4gDgROAacCZmXn2AMtdCCzPzCX19I7AUmBbIIFFmbm2yVolSZJK0thIWkRsD5wK7AEsAA6PiJ3alpkfEZcC+7etfg5wTmY+H7gBOLGpOiVJkkrU5OHOfahGx+7NzHXAxTw5jC0CvgVc1DcjIqYBr6qXB1gCHNBgnZIkScVp8nDnfGBVy/Qq4OWtC2TmZwAiYo+W2dsAD2Tmhpb1dmiwTkmSpOI0GdK6gd6W6S6gZxPWo8P1HjNnzszhLK5RMHfulmNdgjaRfTe+2X/jl303vo1E/zUZ0u4A9myZ3g64q4P17gFmR8SUzNwIzOtwvcesWbOWnp72nPdk7gCjZ/XqB0d8m/bf6Gii78D+Gy3ue+OX+9741kn/dXd3DTqw1OQ5aVcBe0fE3IjYAtgPuHKolTLzUeAa4O31rIOAKxqrUpIkqUCNhbTMvBNYDKwAVgLLMvO6iLg8IhYOsfr7qK4GvYVqNO6EpuqUJEkqUaP3ScvMZcCytnn79rPcIW3TtwF7NVmbJElSyXzigCRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQWa2uTGI+JA4ARgGnBmZp7d1r4AOBeYBfwIeE9mboiIg4FPAXfXi16WmYubrFWSJKkkjY2kRcT2wKnAHsAC4PCI2KltsaXAkZn5PKALeHc9fyFwdGYuqP8zoEmSpEmlycOd+wDLM/PezFwHXAzs39cYEc8CNs/Mn9azlgAH1D/vChwcETdHxNKI2LrBOiVJkorTZEibD6xqmV4F7NBh+yrgFGBn4HbgC82VKUmSVJ4mz0nrBnpbpruAnk7aM/OtfTMj4jTg1uG88Jw5M4dbqxo2d+6WY12CNpF9N77Zf+OXfTe+jUT/NRnS7gD2bJneDrirrX1ee3tEzAbelZln1PO7gA3DeeE1a9bS09M75HLuAKNn9eoHR3yb9t/oaKLvwP4bLe5745f73vjWSf91d3cNOrDU5OHOq4C9I2JuRGwB7Adc2deYmbcBD0fE7vWsdwJXAGuB4yLiFfX8I4FLGqxTkiSpOI2FtMy8E1gMrABWAssy87qIuDwiFtaLLQLOiIhfAzOBz2XmRuBtwBcj4lfALsBxTdUpSZJUokbvk5aZy4BlbfP2bfn5RuDl/ax3DfCyJmuTJEkqmU8ckCRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUBTm9x4RBwInABMA87MzLPb2hcA5wKzgB8B78nMDRGxI7AU2BZIYFFmrm2yVkmSpJI0NpIWEdsDpwJ7AAuAwyNip7bFlgJHZubzgC7g3fX8c4BzMvP5wA3AiU3VKUmSVKImD3fuAyzPzHszcx1wMbB/X2NEPAvYPDN/Ws9aAhwQEdOAV9XLPza/wTolSZKK0+ThzvnAqpbpVcDLh2jfAdgGeCAzN7TN78QUgO7uro6L3Gbrp3W8rDbdcPpkOKbPmtPIdvW4pvoOYJuZT29s26o01X+bb+O+17Qm973ZW23R2LZV6aT/WpaZ0l97kyGtG+htme4Cejpob59P23qDmQew9TCC1+c+8paOl9WmmzNnZiPbffF7Pt3IdvW4pvoO4LMHfLyxbavSVP+95vTPNrJdPa7Jfe+ID+/b2LZVGWb/zQNubZ/ZZEi7A9izZXo74K629nn9tN8DzI6IKZm5sV6mdb3BXF+/5ipg4ybWLUmSNBqmUOWc6/trbDKkXQWcFBFzgXXAfsDhfY2ZeVtEPBwRu2fmT4B3Aldk5qMRcQ3wdmAZcBBwRYev+Qjw45F8E5IkSQ160ghan8YuHMjMO4HFwApgJbAsM6+LiMsjYmG92CLgjIj4NTAT+Fw9/31UV4PeQjUydkJTdUqSJJWoq7e3/fQvSZIkjTWfOCBJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUoCbvkzahRcQs4J+AVwMbgPuAY4B7gd8Dr8/M77cs/wdgr3pywPbM/EPb68wELgD+iuoGvR/OzKsGqGk3qofab0N1g7wfAcdk5kNP4a1OOBHxbOA3wC31rG5gFtXnfD4j33//Cjyf6qkap2bmvw9Q13OATwO7UP1OrQb+ob6PYN9r/BlYX6+yFXADcHBmruuvhoi4GjgpM68e7DMZT0Z53+u07yb0vlfwPjPszz0iDgFOB/5Yb38G8FXgHzNzY93+WeCFmXl3y/u/OjOfPVR7Pb0N8Kn6/T8KPES1H367br+a6nGHa+uyZgG/AxZl5t397bcRsaR+jSUDvbemlNj/A30eETGDqn9fTfW0ov+m+p24PiLOBnYHpgN/2fJ+zqJ60tH5wIGZ+dWW7X0QOAN4Tnuto8GRtE0QEd3A5VT/KCzIzAXAyVQ33Z1DtVN+OSK2HGATQ7W3Ogb4r8zcGXgHcOEANe0MXAJ8NDNfAiyg+gX/l47f2ORyV2YuqP/bGfgfwLHA5oxs/x0P/LF+jb2B0yPiGe0LRcQcqhsxfzcz/yIzA/gw8PW25fftqxt4HrAj1Q2fJ4VR3vc67bvJsu+Vts88lc/92/X7eAnVl6I9gZNa2rcEvjTI+gO21yFhBVUIjMx8AdXf7s/XNfc5rGVf/kvgAeDoDmofK0X1/yA+SJVtXlxv4yPAtyNiWmYeUX/e+7a9n/Prde8A9m/b3t9SBb0xYUjbNK+h+sfx430Pgs/MFcChVN/m7gK+D/zzAOsP1f6YzPwEj9/M9zlUowb9+TBwXmb+Z73eBuAfgG8CRMQzIuI7EXFTRPw8Iv66nn9SRHw5Iq6OiN9HxOJ6/s8jYpf65ykRcUdEbDtUvePYPKo/8HMYwf4Dfkh9k+bMvIcqXGzXz3J/D/wkM8/rm5GZ/0EV0gd6GO1WwOx6m5PFqO17dN53k3XfG+t9ZqjPfUk96kU93e9NQTNzHfBR4L0R0fe0628AfxURBw5Q42Dt+wEPZebJ9aMNycwE3svAR6+eRjUaOJ725bHu/4FsRzVSNq3exk94/O9DJ6+9MCKeBhAROwIPAvcP4/VHlIc7N81LgZWZ+YQHv2fm5fWwMFT/uN4cEa9rHeJtMVR763Y3RMR3qb5VHD7AYi+l+lbZut4DVH9MAD4PLM/M0yPiucCPI+KlddvOVN8ktwJurYeEv0L17e9nwGuBG+sdZqKYHxErgc2o/jheD7yV6psUjFD/tQ3vv43q0Mov+1l0N+B7/az/1bZZl0fEBuAZwO3AF4CLBnr9CWjU9r1h9N1k2fdK22eG+tyH4xdUYWNuPb0eOAT4TkT8oJ/lB2vfjeqw6xNk5uVts86NiHXAtlRB5N+pDquVqrT+H8hZwGXA6vqw8Q+ACzLz4Q7W3QB8l2qk7etUj6e8CPjEMF5/RDmStml6gEE7vP5j8W4GGOIdqr2f5d8A/AVwSkS8YBNqei1wXr2t3wH/CbyibluRmetbvrXMpjpHY7/6m+U7gKVD1TjO3FUPe+9E9Y9iN9U3PWDk+y8iDqD647F/3whQP3pblr8wIlZGxG8j4tiWZfath/DfR/UPytczs2+9JwSXWtcA88erUd/3Oui7ybLvlbbPDPm7MAx9+9Bj57Jl5g1U50b1e1hziPbWfflT9b6cEXFWyzKH1Ydb9wOeDlySmX3nm5a4L5fW//2qzxt7EfA6qn3tIGBlRGzV4SYu4vFDnm+hHpkdK4a0TXMD8LKWoXEAIuKTVIdjAMjM7zHIEG9/7RFxcr1Dr4yIN0XEqyNiXr38bcC1wAsj4tyW5RbWNS1s3X5EzIqISyNiOk/u6y4eH0lt/UPXC3Rl5v8DkurEz32Abw35qYxD9YjMh4Htqc6vaG17yv1Xz3t/vczrM/PGel57/11PdUJr37YPqv8gLqV6rm37a38DuJLqH4k+91GNyLTaloEPkY9Ho7bv1fM66btJte8VtM8M9bn3Un3WRMS0Id7WzsAdmflg2/yTqC7aGuiwZ3/t7fvy8fW+/E9UIfwJMvNaqsN7yyKi7/ei2H25oP7vV/23YH5mXpeZn8zMhVSHW1/X4VtcAewaES8C/pSZY3aoEzzcuamuAe4BPh4Rp2R1RdAbqI57v6Vt2WOAmxn4mPoT2jPzY8DH+hoj4jSqExePqsParsCxmXlx60Yi4gzg+xFxZVYPsp9G9Ut+f2auj4jlwP+mOgnzuVR/RN5L9cdpIF+pt7EiM/88+EcyftWHk4+lGt5uP0zxVPvvLcCHgN0z8/aW1zysdSNRXen086jOobkgM3ujOln2lVQXFPTnROC3EfE3mXkZ1bD+uyLiqHr9V1MFvF8N/gmMK6O573Xad5Nu3ytknxnqc/8T8MJ68fbfjdbtzAZOAc7u532ur/fJH1Jdbd1J+0XAMVGdY3haZj5av8ZrqA6n9ed0qvNS/76u4wfAQRHxnfqzDqoLHP5joPcxmgrp/4HK2x44MSI+UPfPdlQB9+YO39vGiPg+1QUoX+hknSY5krYJ6sNLb6I6/PiLiLiJ6oTVfYG725btG+KdPsC2Bm2n+uMxLyJuprqq7YP1iFr7dm4G/g44KyJuBG6i+pb+7nqRDwCvrbfzTaqh9lVDvNVLqL4llnK4pTGZeSXVH8BT2uY/1f77BNXVT5cO9i0wM/9EdbXU3lS/U7+m+sb5A+AzA7z2PVS37PhM/Q38FKqrzn4REb+op988nEMFpRvlfa/TvpuU+14B+8xQn/uXgL3q35HdgdbP/E31dv8v1Zega6n2pf7qvQE4c4Ban9SemY9QBbL5VIfZfkl12O0OBrh6s15nMXBSHej+heq2FTfWvzdfobo9x58GqmO0jXX/174UEWtb/tsTOJIq2/ym/uyvoLqV0a+H8fYuorpa+NvDWKcRXb29/V7wIkmSpDHkSJokSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiaNiHh2RPRGxA/7aVtSt20zjO19J1qeDznAMnvVt0SRpGExpEmabB4GIiKe1Tcjqgcq7z7wKpI0+nzigKTJZiPwNWAR8Ml63t9SPX7pGICIOJzqJrQbqW6Se2Rm/iYi5gMXUN2o9DaqO5lTr/MCqmcNzgGmAJ/LzNbHdknSsDiSJmkyuhB4Z8v0wcCS+ufXAscBr6kfgL0M+GZUzws9G/hpZr6QKsQ9H6B+4sPFwPGZuQvwauDYiNhtFN6LpAnKkCZp0snMnwEbI2KXiHgmsGVm9p039tfA1zJzdb3sEqrnAT6b6oHnS+r5vwWW1+s8j+pRVf8aESupnuW4OfDS0Xg/kiYmD3dKmqy+QvXsx9X1z3166/9adQHT6vldLfP7nos6herB3gv6GiLiGcD9gKNpkjaJI2mSJqulwAHA26kOafa5EvhfETEXICIOBdYAv63bDq/n70j1IG2ABB6KiL+r254J/ALYpfm3IWmiMqRJmpQy807gV8B/Zea9LU0rgDOA5RHxS6rz1f5nZvYARwA7RcSvgPOAlfW21gNvBg6LiJuA7wEnZuZPRu0NSZpwunp720f1JUmSNNYcSZMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCvT/AfuB13IOpjRJAAAAAElFTkSuQmCC\n", 694 | "text/plain": [ 695 | "
" 696 | ] 697 | }, 698 | "metadata": { 699 | "needs_background": "light" 700 | }, 701 | "output_type": "display_data" 702 | } 703 | ], 704 | "source": [ 705 | "plt.figure(figsize=(10,6))\n", 706 | "ax = sns.barplot(x='Model', y= 'val_loss',data=df_results)" 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "# Future Advices:\n", 714 | "\n", 715 | "- More layers can be added.\n", 716 | "- Models can be run for more epoches.\n" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [] 725 | } 726 | ], 727 | "metadata": { 728 | "accelerator": "GPU", 729 | "colab": { 730 | "name": "Keras_LSTM_GRU.ipynb", 731 | "provenance": [] 732 | }, 733 | "kernelspec": { 734 | "display_name": "Python 3", 735 | "language": "python", 736 | "name": "python3" 737 | }, 738 | "language_info": { 739 | "codemirror_mode": { 740 | "name": "ipython", 741 | "version": 3 742 | }, 743 | "file_extension": ".py", 744 | "mimetype": "text/x-python", 745 | "name": "python", 746 | "nbconvert_exporter": "python", 747 | "pygments_lexer": "ipython3", 748 | "version": "3.6.9" 749 | } 750 | }, 751 | "nbformat": 4, 752 | "nbformat_minor": 1 753 | } 754 | -------------------------------------------------------------------------------- /notebooks/.ipynb_checkpoints/6_Keras_with_Different_Layer_Types_and_Numbers-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Keras with Different Layer Types and Numbers" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "This notebook was run in Google Colab to get more quick results same time without hurting my computer.\n", 15 | "\n", 16 | "### Aim of This Notebook:\n", 17 | "\n", 18 | "My aim in this notebook is to find better results than CNN with 3-conv layers and RNN with 2-GRU layers models. So, I tried CNN with 2-conv layers and RNN with 2-CuDNNGRU layer and 2-LSTM layers.\n", 19 | "\n", 20 | "The detailed comparison of results for all 5 Keras models can be found at the end of this notebook with feature advices. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 0, 26 | "metadata": { 27 | "colab": {}, 28 | "colab_type": "code", 29 | "id": "8neitrJ7lqZA" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "# necessary imports\n", 34 | "\n", 35 | "import pandas as pd\n", 36 | "import numpy as np\n", 37 | "\n", 38 | "from sklearn.metrics import f1_score, roc_auc_score, accuracy_score\n", 39 | "from sklearn.model_selection import train_test_split\n", 40 | "import re\n", 41 | "\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "import seaborn as sns; sns.set()\n", 44 | "%matplotlib inline\n", 45 | "\n", 46 | "#tensorflow imports\n", 47 | "import tensorflow\n", 48 | "from tensorflow.python.keras import models, layers, optimizers\n", 49 | "from tensorflow.python.keras.preprocessing.text import Tokenizer, text_to_word_sequence\n", 50 | "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 10, 56 | "metadata": { 57 | "colab": { 58 | "base_uri": "https://localhost:8080/", 59 | "height": 122 60 | }, 61 | "colab_type": "code", 62 | "id": "W8EBMG9Plqaa", 63 | "outputId": "40f2a018-a5aa-434c-c1ef-1d60f44d0e3c" 64 | }, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n", 71 | "\n", 72 | "Enter your authorization code:\n", 73 | "··········\n", 74 | "Mounted at /content/drive\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "# this cell is to connect my google drive with colab notebook to get data\n", 80 | "from google.colab import drive\n", 81 | "drive.mount('/content/drive')" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 0, 87 | "metadata": { 88 | "colab": {}, 89 | "colab_type": "code", 90 | "id": "qNVwgeH_lqbM" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "# this cell is for geting data from path in the drive\n", 95 | "path = '/content/drive/My Drive/train/train.csv'\n", 96 | "df = pd.read_csv(path)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 0, 102 | "metadata": { 103 | "colab": {}, 104 | "colab_type": "code", 105 | "id": "RMdeR89Llqbs" 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "df.dropna(inplace=True) # last check to make sure about nulls" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "colab": {}, 116 | "colab_type": "code", 117 | "id": "oGlbaSUTlqbw" 118 | }, 119 | "source": [ 120 | "To make sure about using same sample data in each notebook, I always get same data and divide with same random state and test_size for validation." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 0, 126 | "metadata": { 127 | "colab": {}, 128 | "colab_type": "code", 129 | "id": "4a-3vu6Nlqb_" 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "train_data, test_data = train_test_split(df, test_size=0.2,random_state = 42)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Now, I will split my train and test to X and y as text and target." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 0, 146 | "metadata": { 147 | "colab": {}, 148 | "colab_type": "code", 149 | "id": "Ezso4Q7flqcE" 150 | }, 151 | "outputs": [], 152 | "source": [ 153 | "train_target = train_data.sentiment\n", 154 | "train_texts = train_data.review_clean\n", 155 | "\n", 156 | "test_target = test_data.sentiment\n", 157 | "test_texts = test_data.review_clean" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "Like previous notebook, I inspired by https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews to build model. I changed layers and layer types to get better results." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "# Preparing Data to Model" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 0, 177 | "metadata": { 178 | "colab": {}, 179 | "colab_type": "code", 180 | "id": "1qxLjVxQlqcO" 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "# I get together my text\n", 185 | "def converting_texts(texts):\n", 186 | " collected_texts = []\n", 187 | " for text in texts:\n", 188 | " collected_texts.append(text)\n", 189 | " return collected_texts\n", 190 | " \n", 191 | "train_texts = converting_texts(train_texts)\n", 192 | "test_texts = converting_texts(test_texts)" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "I need to tokenize my text and padding sequences before modeling my data. I will use Keras proprocessing tools for this." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 0, 205 | "metadata": { 206 | "colab": {}, 207 | "colab_type": "code", 208 | "id": "E4JTCuhslqcW" 209 | }, 210 | "outputs": [], 211 | "source": [ 212 | "max_feat= 12000 #seting max features to define max number of tokenizer words\n", 213 | "\n", 214 | "tokenizer = Tokenizer(num_words=max_feat)\n", 215 | "tokenizer.fit_on_texts(train_texts)\n", 216 | "# updates internal vocabulary based on a list of texts\n", 217 | "# in the case where texts contains lists, we assume each entry of the lists to be a token\n", 218 | "# required before using texts_to_sequences or texts_to_matrix\n", 219 | "\n", 220 | "train_texts = tokenizer.texts_to_sequences(train_texts)\n", 221 | "test_texts = tokenizer.texts_to_sequences(test_texts)\n", 222 | "# transforms each text in texts to a sequence of integers\n", 223 | "# Only top num_words-1 most frequent words will be taken into account \n", 224 | "# Only words known by the tokenizer will be taken into account" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "To use batches productively, I need to turn my sequences to same lenght. I prefer to set everything to maximum lenght of the longest sentence in train data." 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 0, 237 | "metadata": { 238 | "colab": {}, 239 | "colab_type": "code", 240 | "id": "QjtN-UPJlqcb" 241 | }, 242 | "outputs": [], 243 | "source": [ 244 | "max_len = max(len(train_ex) for train_ex in train_texts) #setting the max length\n", 245 | "\n", 246 | "# using pad_sequence tool from Keras\n", 247 | "# transforms a list of sequences to into a 2D Numpy array of shape \n", 248 | "# the maxlen argument for the length of the longest sequence in the list\n", 249 | "train_texts = pad_sequences(train_texts, maxlen=max_len)\n", 250 | "test_texts = pad_sequences(test_texts, maxlen=max_len)" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "# Building Model" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "In this model, I reduced the layer number in CNN. I set one convolutional layer with batch normalization and max pooling. Additionaly, one convolutional layer with global max pooling. " 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 0, 270 | "metadata": { 271 | "colab": {}, 272 | "colab_type": "code", 273 | "id": "2TpNmPSMlqcg" 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "def build_model():\n", 278 | " sequences = layers.Input(shape=(max_len,))\n", 279 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 280 | " x = layers.Conv1D(64, 3, activation='relu')(embedded)\n", 281 | " x = layers.BatchNormalization()(x)\n", 282 | " x = layers.MaxPool1D(3)(x)\n", 283 | " x = layers.Conv1D(64, 5, activation='relu')(x)\n", 284 | " x = layers.GlobalMaxPool1D()(x)\n", 285 | " x = layers.Flatten()(x)\n", 286 | " x = layers.Dense(100, activation='relu')(x)\n", 287 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 288 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 289 | " model.compile(\n", 290 | " optimizer='rmsprop',\n", 291 | " loss='binary_crossentropy',\n", 292 | " metrics=['binary_accuracy']\n", 293 | " )\n", 294 | " return model\n", 295 | " \n", 296 | "model = build_model()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "# Fitting Model to Pre-processed Data" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 22, 309 | "metadata": { 310 | "colab": { 311 | "base_uri": "https://localhost:8080/", 312 | "height": 102 313 | }, 314 | "colab_type": "code", 315 | "id": "t6Po8kWmlqcw", 316 | "outputId": "8dbd125b-97a5-41e1-9248-300c5d89924f" 317 | }, 318 | "outputs": [ 319 | { 320 | "name": "stdout", 321 | "output_type": "stream", 322 | "text": [ 323 | "Epoch 1/2\n", 324 | "500/500 [==============================] - 56s 112ms/step - loss: 0.2157 - binary_accuracy: 0.9233 - val_loss: 0.1995 - val_binary_accuracy: 0.9208\n", 325 | "Epoch 2/2\n", 326 | "500/500 [==============================] - 55s 111ms/step - loss: 0.1456 - binary_accuracy: 0.9436 - val_loss: 0.1830 - val_binary_accuracy: 0.9384\n" 327 | ] 328 | }, 329 | { 330 | "data": { 331 | "text/plain": [ 332 | "" 333 | ] 334 | }, 335 | "execution_count": 22, 336 | "metadata": { 337 | "tags": [] 338 | }, 339 | "output_type": "execute_result" 340 | } 341 | ], 342 | "source": [ 343 | "model.fit(\n", 344 | " train_target, \n", 345 | " train_labels, \n", 346 | " batch_size=128,\n", 347 | " epochs=2,\n", 348 | " validation_data=(test_texts, test_target) )" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "It gives good results with less loss and high accuracy." 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "# RNN Model with CuDNNGRU Layers" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "I changed the layer types for RNN to see the difference." 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 0, 375 | "metadata": { 376 | "colab": {}, 377 | "colab_type": "code", 378 | "id": "_72Lbyablqd1" 379 | }, 380 | "outputs": [], 381 | "source": [ 382 | "def build_rnn_model():\n", 383 | " sequences = layers.Input(shape=(max_len,))\n", 384 | " embedded = layers.Embedding(max_feat, 64)(sequences)\n", 385 | " x = layers.CuDNNGRU(128, return_sequences=True)(embedded)\n", 386 | " x = layers.CuDNNGRU(128)(x)\n", 387 | " x = layers.Dense(32, activation='relu')(x)\n", 388 | " x = layers.Dense(100, activation='relu')(x)\n", 389 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 390 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 391 | " model.compile(\n", 392 | " optimizer='rmsprop',\n", 393 | " loss='binary_crossentropy',\n", 394 | " metrics=['binary_accuracy']\n", 395 | " )\n", 396 | " return model\n", 397 | " \n", 398 | "rnn_model = build_rnn_model()" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 28, 404 | "metadata": { 405 | "colab": { 406 | "base_uri": "https://localhost:8080/", 407 | "height": 102 408 | }, 409 | "colab_type": "code", 410 | "id": "DEz1HZGklqd-", 411 | "outputId": "20633893-b53b-43aa-d1a6-cb2c4fd11d3b" 412 | }, 413 | "outputs": [ 414 | { 415 | "name": "stdout", 416 | "output_type": "stream", 417 | "text": [ 418 | "Epoch 1/2\n", 419 | "500/500 [==============================] - 331s 661ms/step - loss: 0.2204 - binary_accuracy: 0.9177 - val_loss: 0.1871 - val_binary_accuracy: 0.9280\n", 420 | "Epoch 2/2\n", 421 | "500/500 [==============================] - 329s 658ms/step - loss: 0.1611 - binary_accuracy: 0.9361 - val_loss: 0.1792 - val_binary_accuracy: 0.9369\n" 422 | ] 423 | }, 424 | { 425 | "data": { 426 | "text/plain": [ 427 | "" 428 | ] 429 | }, 430 | "execution_count": 28, 431 | "metadata": { 432 | "tags": [] 433 | }, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "rnn_model.fit(\n", 439 | " train_texts, \n", 440 | " train_target, \n", 441 | " batch_size=128,\n", 442 | " epochs=2,\n", 443 | " validation_data=(test_texts, test_target) )" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "It gives a bit better than CNN." 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "# RNN with LSTM Layers" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "I will try LSTM layers this time." 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 0, 470 | "metadata": { 471 | "colab": {}, 472 | "colab_type": "code", 473 | "id": "mN8DwiCxlqeI" 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "def build_lstm_model():\n", 478 | " sequences = layers.Input(shape=(MAX_LENGTH,))\n", 479 | " embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)\n", 480 | " x = layers.LSTM(128, return_sequences=True)(embedded)\n", 481 | " x = layers.LSTM(128)(x)\n", 482 | " x = layers.Dense(32, activation='relu')(x)\n", 483 | " x = layers.Dense(100, activation='relu')(x)\n", 484 | " predictions = layers.Dense(1, activation='sigmoid')(x)\n", 485 | " model = models.Model(inputs=sequences, outputs=predictions)\n", 486 | " model.compile(\n", 487 | " optimizer='rmsprop',\n", 488 | " loss='binary_crossentropy',\n", 489 | " metrics=['binary_accuracy']\n", 490 | " )\n", 491 | " return model\n", 492 | " \n", 493 | "rnn_model = build_lstm_model()" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 26, 499 | "metadata": { 500 | "colab": { 501 | "base_uri": "https://localhost:8080/", 502 | "height": 102 503 | }, 504 | "colab_type": "code", 505 | "id": "38nly0ekoXmf", 506 | "outputId": "fb3238e2-6f5e-4db4-c6e1-e7cedaa3fdf1" 507 | }, 508 | "outputs": [ 509 | { 510 | "name": "stdout", 511 | "output_type": "stream", 512 | "text": [ 513 | "Epoch 1/2\n", 514 | "500/500 [==============================] - 392s 783ms/step - loss: 0.2149 - binary_accuracy: 0.9245 - val_loss: 0.1718 - val_binary_accuracy: 0.9322\n", 515 | "Epoch 2/2\n", 516 | "500/500 [==============================] - 391s 781ms/step - loss: 0.1652 - binary_accuracy: 0.9367 - val_loss: 0.1656 - val_binary_accuracy: 0.9375\n" 517 | ] 518 | }, 519 | { 520 | "data": { 521 | "text/plain": [ 522 | "" 523 | ] 524 | }, 525 | "execution_count": 26, 526 | "metadata": { 527 | "tags": [] 528 | }, 529 | "output_type": "execute_result" 530 | } 531 | ], 532 | "source": [ 533 | "rnn_model.fit(\n", 534 | " train_texts, \n", 535 | " train_target, \n", 536 | " batch_size=128,\n", 537 | " epochs=2,\n", 538 | " validation_data=(test_texts, test_target) )" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": { 544 | "colab": {}, 545 | "colab_type": "code", 546 | "id": "hSw7jRbfvVOq" 547 | }, 548 | "source": [ 549 | "It gives higher values." 550 | ] 551 | }, 552 | { 553 | "cell_type": "markdown", 554 | "metadata": {}, 555 | "source": [ 556 | "# Comparing All 5 Results" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "I checked all results and there is no huge differences between train and validation results which show overfitting. So, I will compare validation results." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 7, 569 | "metadata": {}, 570 | "outputs": [], 571 | "source": [ 572 | "df_results = pd.DataFrame(columns=[\"Model\", \"val_loss\",'val_acc']) # to see all results" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 8, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "df_results = df_results.append({ # written in results \n", 582 | " \"Model\": 'CNN-3-Conv' ,\"val_loss\": 0.2632 ,'val_acc' : 0.9175}, ignore_index=True)\n", 583 | "df_results = df_results.append({ # written in results \n", 584 | " \"Model\": 'RNN-2-GRU' ,\"val_loss\": 0.1615 ,'val_acc' : 0.9377}, ignore_index=True)\n", 585 | "df_results = df_results.append({ # written in results \n", 586 | " \"Model\": 'CNN-2-Conv' ,\"val_loss\": 0.1839 ,'val_acc' : 0.9384}, ignore_index=True)\n", 587 | "df_results = df_results.append({ # written in results \n", 588 | " \"Model\": 'RNN-2-CuDNNGRU' ,\"val_loss\": 0.1792 ,'val_acc' : 0.9369}, ignore_index=True)\n", 589 | "df_results = df_results.append({ # written in results \n", 590 | " \"Model\": 'RNN-2-LSTM' ,\"val_loss\": 0.1656 ,'val_acc' : 0.9375}, ignore_index=True)" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 9, 596 | "metadata": { 597 | "scrolled": true 598 | }, 599 | "outputs": [ 600 | { 601 | "data": { 602 | "text/html": [ 603 | "
\n", 604 | "\n", 617 | "\n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | "
Modelval_lossval_acc
0CNN-3-Conv0.26320.9175
1RNN-2-GRU0.16150.9377
2CNN-2-Conv0.18390.9384
3RNN-2-CuDNNGRU0.17920.9369
4RNN-2-LSTM0.16560.9375
\n", 659 | "
" 660 | ], 661 | "text/plain": [ 662 | " Model val_loss val_acc\n", 663 | "0 CNN-3-Conv 0.2632 0.9175\n", 664 | "1 RNN-2-GRU 0.1615 0.9377\n", 665 | "2 CNN-2-Conv 0.1839 0.9384\n", 666 | "3 RNN-2-CuDNNGRU 0.1792 0.9369\n", 667 | "4 RNN-2-LSTM 0.1656 0.9375" 668 | ] 669 | }, 670 | "execution_count": 9, 671 | "metadata": {}, 672 | "output_type": "execute_result" 673 | } 674 | ], 675 | "source": [ 676 | "df_results" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "According to accuracy CNN with 2-Conv is highest but its loss is higher than RNN-2-GRU also. So, I decided to best model with Keras as RNN with GRU layers." 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 11, 689 | "metadata": {}, 690 | "outputs": [ 691 | { 692 | "data": { 693 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmkAAAF2CAYAAAA1GQ8BAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAftklEQVR4nO3de5QdZZnv8W93boAhAUOQBMTLzPgoKkYJigdQFNRZzPE2gB7JyMWDjAqKAjJoQBEGR9HhooKeEYaAmTgiLlTkomKCoowD6Amg6OOIilxyIAYGSARC0n3+qGrYbPqyO3R1v939/azFoqveqtrP3m9X57ffunX19vYiSZKksnSPdQGSJEl6MkOaJElSgQxpkiRJBTKkSZIkFciQJkmSVKCpY13ACJsB7AqsAjaOcS2SJEmDmQLMA64HHmlvnGghbVfgmrEuQpIkaRj2BH7cPnOihbRVAPfdt46eHu//JkmSytXd3cXWWz8N6vzSbqKFtI0APT29hjRJkjRe9HuKlhcOSJIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFWjqWBcwlractRmbzZg21mVMeA8/8igPPvDwWJchSdK4MqlD2mYzpnHgcf821mVMeMtOW8SDGNIkSRoOD3dKkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBWo0ScORMSBwAnANODMzDy7rf3NwCeALuD3wKGZeV9EHAx8Cri7XvSyzFzcZK2SJEklaSykRcT2wKnALsAjwLURsSIzb6nbZwFfBHbNzDsj4mTgJOAoYCFwdGZ+tan6JEmSStbk4c59gOWZeW9mrgMuBvZvaZ8GHJGZd9bTNwE71j/vChwcETdHxNKI2LrBOiVJkorTZEibD6xqmV4F7NA3kZlrMvMSgIjYHDge+GbLsqcAOwO3A19osE5JkqTiNHlOWjfQ2zLdBfS0LxQRs4FLgBsz8wKAzHxrS/tpwK3DeeE5c2ZuSr1q0Ny5W451CZIkjStNhrQ7gD1bprcD7mpdICLmAd8FlgMfqufNBt6VmWfUi3UBG4bzwmvWrKWnp3fI5QwOo2f16gfHugRJkorS3d016MBSk4c7rwL2joi5EbEFsB9wZV9jREwBLgUuyswPZmZfqloLHBcRr6inj6QaaZMkSZo0GhtJq6/YXAysAKYD52bmdRFxOfAx4JnAy4CpEdF3QcENmXlYRLwN+GJ9rtpvgIOaqlOSJKlEjd4nLTOXAcva5u1b/3gDA4zkZeY1VAFOkiRpUvKJA5IkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBVoapMbj4gDgROAacCZmXl2W/ubgU8AXcDvgUMz876I2BFYCmwLJLAoM9c2WaskSVJJGhtJi4jtgVOBPYAFwOERsVNL+yzgi8DfZOZLgJuAk+rmc4BzMvP5wA3AiU3VKUmSVKImD3fuAyzPzHszcx1wMbB/S/s04IjMvLOevgnYMSKmAa+qlwdYAhzQYJ2SJEnFafJw53xgVcv0KuDlfROZuQa4BCAiNgeOBz4PbAM8kJkbWtbbocE6JUmSitNkSOsGelumu4Ce9oUiYjZVWLsxMy+oD5P2ti32pPUGM2fOzGGWqqbNnbvlWJcgSdK40mRIuwPYs2V6O+Cu1gUiYh7wXWA58KF69j3A7IiYkpkbgXnt6w1lzZq19PS057wnMziMntWrHxzrEiRJKkp3d9egA0tNnpN2FbB3RMyNiC2A/YAr+xojYgpwKXBRZn4wM3sBMvNR4Brg7fWiBwFXNFinJElScRobScvMOyNiMbACmA6cm5nXRcTlwMeAZwIvA6ZGRN8FBTdk5mHA+4ALIuIE4I/AO5qqU5IkqUSN3ictM5cBy9rm7Vv/eAMDjORl5m3AXk3WJkmSVDKfOCBJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBpo51AZImp1mzZzBj+vSxLmNCe2T9eh64/5GxLkPSJjKkSRoTM6ZP55DzjxrrMia0JYeeBRjSpPHKw52SJEkFMqRJkiQVyJAmSZJUIEOaJElSgbxwQJI0LFttOZ1pm80Y6zImtEcffoT/fnD9WJehMWZIkyQNy7TNZnD5QYeOdRkT2r4Xng+GtEnPw52SJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVaOpYFyBJkkbP7FmbM32G//w3af0jG7j/gYee8nbsJUmSJpHpM6byycUXj3UZE9pHT91/RLbj4U5JkqQCGdIkSZIKZEiTJEkqkOekadzaevZ0pk6fMdZlTGgb1j/CffevH+syJGlSMqRp3Jo6fQY/O+2wsS5jQtvluHMBQ5okjQUPd0qSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUoI5CWkRsERG71T+/JyLOi4gdmy1NkiRp8ur0PmnnA7+LiI3AccCFwJeBNwy2UkQcCJwATAPOzMyzB1juQmB5Zi6ppw8GPgXcXS9yWWYu7rBWSZKkca/TkPbczHx7RJwMLMnMkyPi+sFWiIjtgVOBXYBHgGsjYkVm3tKyzHzg/wB7A8tbVl8IHJ2ZXx3Ge5EkSZowOj0nbVr9/zcAyyNiCjBziHX2oRoduzcz1wEXA/u3LbMI+BZwUdv8XYGDI+LmiFgaEVt3WKckSdKE0GlIuzYibgE2B64Frqr/G8x8YFXL9Cpgh9YFMvMzmXluP+uuAk4BdgZuB77QYZ2SJEkTQqeHO98PvBK4OTN7IuKzwBVDrNMN9LZMdwE9nbxYZr617+eIOA24tcM6AZgzZ6hBPo22uXO3HOsStInsu/HN/hu/7LvxbST6r9OQNgPYkJn3R8R7qA5H3gz8cZB17gD2bJneDrhrqBeKiNnAuzLzjHpWF7ChwzoBWLNmLT09vUMu5w4welavfnDEt2n/jY4m+g7sv9Hivjd+ue+Nb530X3d316ADS50e7jwfeHNE7Ep1deftVFd3DuYqYO+ImBsRWwD7AVd28FprgeMi4hX19JHAJR3WKUmSNCF0GtKem5kfAd5IdXXnScDTB1shM+8EFgMrgJXAssy8LiIuj4iFg6y3EXgb8MWI+BXV1aHHdVinJEnShNDp4c7WqzuP6fDqTjJzGbCsbd6+/Sx3SNv0NcDLOqxNkiRpwuk0pPVd3bmB6urOHzD01Z2SJEnaRJ0e7nw/cDiwR2b2AJ8FjmqsKkmSpEmuo5BWnyc2HzgrIpYC29RhTZIkSQ3o9AHrxwIfBW4Efg58KCJOaLIwSZKkyazTc9IOojrU+QBARJwH/BT4x6YKkyRJmsw6PSeNvoBW/3w/8GgjFUmSJKnjkbQ/RMRRwDn19BEM/rQBSZIkPQWdjqS9F3gr8Of6v/2ogpokSZIa0NFIWv30gL3qxzt1Z+baZsuSJEma3AYNaRFxKfCkJ5VHBACZ+aZmypIkSZrchhpJu3ioDUTE0zJz3QjVI0mSJIYIaZl5QQfb8DmbkiRJI6zjW3AMomsEtiFJkqQWIxHSnnTOmiRJkp6akQhpkiRJGmGGNEmSpAIZ0iRJkgo0EiHNG9tKkiSNsKFuZnv0YO2ZeXpmvmpkS5IkSdJQN7N98ahUIUmSpCcY6ma2h45WIZIkSXpcRw9Yj4hXAscDM6luXjsFeE5m7thgbZIkSZNWpxcOnAtcC8wC/g14APhGU0VJkiRNdp2GtN7M/DRwNfBr4G3A65sqSpIkabLrNKT13WbjVuBFmfkQsLGZkiRJktTROWnATyPia8CJwGUR8TxgQ3NlSZIkTW6djqTNB27KzN8AR9XrvaOxqiRJkia5TkPacuCNEfFb4CXApzMzmytLkiRpcusopGXmlzJzN+CNwNbAtRFxSaOVSZIkTWLDfXbn5sAMqnuleeGAJElSQzq9me3RwCFUAe08YLfMvLvBuiRJkia1Tq/u3AX4QGZe3WAtkiRJqnUU0jJzUdOFSJIk6XHDPSdNkiRJo8CQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBVoapMbj4gDgROAacCZmXn2AMtdCCzPzCX19I7AUmBbIIFFmbm2yVolSZJK0thIWkRsD5wK7AEsAA6PiJ3alpkfEZcC+7etfg5wTmY+H7gBOLGpOiVJkkrU5OHOfahGx+7NzHXAxTw5jC0CvgVc1DcjIqYBr6qXB1gCHNBgnZIkScVp8nDnfGBVy/Qq4OWtC2TmZwAiYo+W2dsAD2Tmhpb1dmiwTkmSpOI0GdK6gd6W6S6gZxPWo8P1HjNnzszhLK5RMHfulmNdgjaRfTe+2X/jl303vo1E/zUZ0u4A9myZ3g64q4P17gFmR8SUzNwIzOtwvcesWbOWnp72nPdk7gCjZ/XqB0d8m/bf6Gii78D+Gy3ue+OX+9741kn/dXd3DTqw1OQ5aVcBe0fE3IjYAtgPuHKolTLzUeAa4O31rIOAKxqrUpIkqUCNhbTMvBNYDKwAVgLLMvO6iLg8IhYOsfr7qK4GvYVqNO6EpuqUJEkqUaP3ScvMZcCytnn79rPcIW3TtwF7NVmbJElSyXzigCRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQUypEmSJBXIkCZJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiRJBTKkSZIkFciQJkmSVCBDmiRJUoEMaZIkSQWa2uTGI+JA4ARgGnBmZp7d1r4AOBeYBfwIeE9mboiIg4FPAXfXi16WmYubrFWSJKkkjY2kRcT2wKnAHsAC4PCI2KltsaXAkZn5PKALeHc9fyFwdGYuqP8zoEmSpEmlycOd+wDLM/PezFwHXAzs39cYEc8CNs/Mn9azlgAH1D/vChwcETdHxNKI2LrBOiVJkorTZEibD6xqmV4F7NBh+yrgFGBn4HbgC82VKUmSVJ4mz0nrBnpbpruAnk7aM/OtfTMj4jTg1uG88Jw5M4dbqxo2d+6WY12CNpF9N77Zf+OXfTe+jUT/NRnS7gD2bJneDrirrX1ee3tEzAbelZln1PO7gA3DeeE1a9bS09M75HLuAKNn9eoHR3yb9t/oaKLvwP4bLe5745f73vjWSf91d3cNOrDU5OHOq4C9I2JuRGwB7Adc2deYmbcBD0fE7vWsdwJXAGuB4yLiFfX8I4FLGqxTkiSpOI2FtMy8E1gMrABWAssy87qIuDwiFtaLLQLOiIhfAzOBz2XmRuBtwBcj4lfALsBxTdUpSZJUokbvk5aZy4BlbfP2bfn5RuDl/ax3DfCyJmuTJEkqmU8ckCRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUCGNEmSpAIZ0iRJkgpkSJMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCmRIkyRJKpAhTZIkqUBTm9x4RBwInABMA87MzLPb2hcA5wKzgB8B78nMDRGxI7AU2BZIYFFmrm2yVkmSpJI0NpIWEdsDpwJ7AAuAwyNip7bFlgJHZubzgC7g3fX8c4BzMvP5wA3AiU3VKUmSVKImD3fuAyzPzHszcx1wMbB/X2NEPAvYPDN/Ws9aAhwQEdOAV9XLPza/wTolSZKK0+ThzvnAqpbpVcDLh2jfAdgGeCAzN7TN78QUgO7uro6L3Gbrp3W8rDbdcPpkOKbPmtPIdvW4pvoOYJuZT29s26o01X+bb+O+17Qm973ZW23R2LZV6aT/WpaZ0l97kyGtG+htme4Cejpob59P23qDmQew9TCC1+c+8paOl9WmmzNnZiPbffF7Pt3IdvW4pvoO4LMHfLyxbavSVP+95vTPNrJdPa7Jfe+ID+/b2LZVGWb/zQNubZ/ZZEi7A9izZXo74K629nn9tN8DzI6IKZm5sV6mdb3BXF+/5ipg4ybWLUmSNBqmUOWc6/trbDKkXQWcFBFzgXXAfsDhfY2ZeVtEPBwRu2fmT4B3Aldk5qMRcQ3wdmAZcBBwRYev+Qjw45F8E5IkSQ160ghan8YuHMjMO4HFwApgJbAsM6+LiMsjYmG92CLgjIj4NTAT+Fw9/31UV4PeQjUydkJTdUqSJJWoq7e3/fQvSZIkjTWfOCBJklQgQ5okSVKBDGmSJEkFMqRJkiQVyJAmSZJUoCbvkzahRcQs4J+AVwMbgPuAY4B7gd8Dr8/M77cs/wdgr3pywPbM/EPb68wELgD+iuoGvR/OzKsGqGk3qofab0N1g7wfAcdk5kNP4a1OOBHxbOA3wC31rG5gFtXnfD4j33//Cjyf6qkap2bmvw9Q13OATwO7UP1OrQb+ob6PYN9r/BlYX6+yFXADcHBmruuvhoi4GjgpM68e7DMZT0Z53+u07yb0vlfwPjPszz0iDgFOB/5Yb38G8FXgHzNzY93+WeCFmXl3y/u/OjOfPVR7Pb0N8Kn6/T8KPES1H367br+a6nGHa+uyZgG/AxZl5t397bcRsaR+jSUDvbemlNj/A30eETGDqn9fTfW0ov+m+p24PiLOBnYHpgN/2fJ+zqJ60tH5wIGZ+dWW7X0QOAN4Tnuto8GRtE0QEd3A5VT/KCzIzAXAyVQ33Z1DtVN+OSK2HGATQ7W3Ogb4r8zcGXgHcOEANe0MXAJ8NDNfAiyg+gX/l47f2ORyV2YuqP/bGfgfwLHA5oxs/x0P/LF+jb2B0yPiGe0LRcQcqhsxfzcz/yIzA/gw8PW25fftqxt4HrAj1Q2fJ4VR3vc67bvJsu+Vts88lc/92/X7eAnVl6I9gZNa2rcEvjTI+gO21yFhBVUIjMx8AdXf7s/XNfc5rGVf/kvgAeDoDmofK0X1/yA+SJVtXlxv4yPAtyNiWmYeUX/e+7a9n/Prde8A9m/b3t9SBb0xYUjbNK+h+sfx430Pgs/MFcChVN/m7gK+D/zzAOsP1f6YzPwEj9/M9zlUowb9+TBwXmb+Z73eBuAfgG8CRMQzIuI7EXFTRPw8Iv66nn9SRHw5Iq6OiN9HxOJ6/s8jYpf65ykRcUdEbDtUvePYPKo/8HMYwf4Dfkh9k+bMvIcqXGzXz3J/D/wkM8/rm5GZ/0EV0gd6GO1WwOx6m5PFqO17dN53k3XfG+t9ZqjPfUk96kU93e9NQTNzHfBR4L0R0fe0628AfxURBw5Q42Dt+wEPZebJ9aMNycwE3svAR6+eRjUaOJ725bHu/4FsRzVSNq3exk94/O9DJ6+9MCKeBhAROwIPAvcP4/VHlIc7N81LgZWZ+YQHv2fm5fWwMFT/uN4cEa9rHeJtMVR763Y3RMR3qb5VHD7AYi+l+lbZut4DVH9MAD4PLM/M0yPiucCPI+KlddvOVN8ktwJurYeEv0L17e9nwGuBG+sdZqKYHxErgc2o/jheD7yV6psUjFD/tQ3vv43q0Mov+1l0N+B7/az/1bZZl0fEBuAZwO3AF4CLBnr9CWjU9r1h9N1k2fdK22eG+tyH4xdUYWNuPb0eOAT4TkT8oJ/lB2vfjeqw6xNk5uVts86NiHXAtlRB5N+pDquVqrT+H8hZwGXA6vqw8Q+ACzLz4Q7W3QB8l2qk7etUj6e8CPjEMF5/RDmStml6gEE7vP5j8W4GGOIdqr2f5d8A/AVwSkS8YBNqei1wXr2t3wH/CbyibluRmetbvrXMpjpHY7/6m+U7gKVD1TjO3FUPe+9E9Y9iN9U3PWDk+y8iDqD647F/3whQP3pblr8wIlZGxG8j4tiWZfath/DfR/UPytczs2+9JwSXWtcA88erUd/3Oui7ybLvlbbPDPm7MAx9+9Bj57Jl5g1U50b1e1hziPbWfflT9b6cEXFWyzKH1Ydb9wOeDlySmX3nm5a4L5fW//2qzxt7EfA6qn3tIGBlRGzV4SYu4vFDnm+hHpkdK4a0TXMD8LKWoXEAIuKTVIdjAMjM7zHIEG9/7RFxcr1Dr4yIN0XEqyNiXr38bcC1wAsj4tyW5RbWNS1s3X5EzIqISyNiOk/u6y4eH0lt/UPXC3Rl5v8DkurEz32Abw35qYxD9YjMh4Htqc6vaG17yv1Xz3t/vczrM/PGel57/11PdUJr37YPqv8gLqV6rm37a38DuJLqH4k+91GNyLTaloEPkY9Ho7bv1fM66btJte8VtM8M9bn3Un3WRMS0Id7WzsAdmflg2/yTqC7aGuiwZ3/t7fvy8fW+/E9UIfwJMvNaqsN7yyKi7/ei2H25oP7vV/23YH5mXpeZn8zMhVSHW1/X4VtcAewaES8C/pSZY3aoEzzcuamuAe4BPh4Rp2R1RdAbqI57v6Vt2WOAmxn4mPoT2jPzY8DH+hoj4jSqExePqsParsCxmXlx60Yi4gzg+xFxZVYPsp9G9Ut+f2auj4jlwP+mOgnzuVR/RN5L9cdpIF+pt7EiM/88+EcyftWHk4+lGt5uP0zxVPvvLcCHgN0z8/aW1zysdSNRXen086jOobkgM3ujOln2lVQXFPTnROC3EfE3mXkZ1bD+uyLiqHr9V1MFvF8N/gmMK6O573Xad5Nu3ytknxnqc/8T8MJ68fbfjdbtzAZOAc7u532ur/fJH1Jdbd1J+0XAMVGdY3haZj5av8ZrqA6n9ed0qvNS/76u4wfAQRHxnfqzDqoLHP5joPcxmgrp/4HK2x44MSI+UPfPdlQB9+YO39vGiPg+1QUoX+hknSY5krYJ6sNLb6I6/PiLiLiJ6oTVfYG725btG+KdPsC2Bm2n+uMxLyJuprqq7YP1iFr7dm4G/g44KyJuBG6i+pb+7nqRDwCvrbfzTaqh9lVDvNVLqL4llnK4pTGZeSXVH8BT2uY/1f77BNXVT5cO9i0wM/9EdbXU3lS/U7+m+sb5A+AzA7z2PVS37PhM/Q38FKqrzn4REb+op988nEMFpRvlfa/TvpuU+14B+8xQn/uXgL3q35HdgdbP/E31dv8v1Zega6n2pf7qvQE4c4Ban9SemY9QBbL5VIfZfkl12O0OBrh6s15nMXBSHej+heq2FTfWvzdfobo9x58GqmO0jXX/174UEWtb/tsTOJIq2/ym/uyvoLqV0a+H8fYuorpa+NvDWKcRXb29/V7wIkmSpDHkSJokSVKBDGmSJEkFMqRJkiQVyJAmSZJUIEOaJElSgQxpkiaNiHh2RPRGxA/7aVtSt20zjO19J1qeDznAMnvVt0SRpGExpEmabB4GIiKe1Tcjqgcq7z7wKpI0+nzigKTJZiPwNWAR8Ml63t9SPX7pGICIOJzqJrQbqW6Se2Rm/iYi5gMXUN2o9DaqO5lTr/MCqmcNzgGmAJ/LzNbHdknSsDiSJmkyuhB4Z8v0wcCS+ufXAscBr6kfgL0M+GZUzws9G/hpZr6QKsQ9H6B+4sPFwPGZuQvwauDYiNhtFN6LpAnKkCZp0snMnwEbI2KXiHgmsGVm9p039tfA1zJzdb3sEqrnAT6b6oHnS+r5vwWW1+s8j+pRVf8aESupnuW4OfDS0Xg/kiYmD3dKmqy+QvXsx9X1z3166/9adQHT6vldLfP7nos6herB3gv6GiLiGcD9gKNpkjaJI2mSJqulwAHA26kOafa5EvhfETEXICIOBdYAv63bDq/n70j1IG2ABB6KiL+r254J/ALYpfm3IWmiMqRJmpQy807gV8B/Zea9LU0rgDOA5RHxS6rz1f5nZvYARwA7RcSvgPOAlfW21gNvBg6LiJuA7wEnZuZPRu0NSZpwunp720f1JUmSNNYcSZMkSSqQIU2SJKlAhjRJkqQCGdIkSZIKZEiTJEkqkCFNkiSpQIY0SZKkAhnSJEmSCvT/AfuB13IOpjRJAAAAAElFTkSuQmCC\n", 694 | "text/plain": [ 695 | "
" 696 | ] 697 | }, 698 | "metadata": { 699 | "needs_background": "light" 700 | }, 701 | "output_type": "display_data" 702 | } 703 | ], 704 | "source": [ 705 | "plt.figure(figsize=(10,6))\n", 706 | "ax = sns.barplot(x='Model', y= 'val_loss',data=df_results)" 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "# Future Advices:\n", 714 | "\n", 715 | "- More layers can be added.\n", 716 | "- Models can be run for more epoches.\n" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "metadata": {}, 723 | "outputs": [], 724 | "source": [] 725 | } 726 | ], 727 | "metadata": { 728 | "accelerator": "GPU", 729 | "colab": { 730 | "name": "Keras_LSTM_GRU.ipynb", 731 | "provenance": [] 732 | }, 733 | "kernelspec": { 734 | "display_name": "Python 3", 735 | "language": "python", 736 | "name": "python3" 737 | }, 738 | "language_info": { 739 | "codemirror_mode": { 740 | "name": "ipython", 741 | "version": 3 742 | }, 743 | "file_extension": ".py", 744 | "mimetype": "text/x-python", 745 | "name": "python", 746 | "nbconvert_exporter": "python", 747 | "pygments_lexer": "ipython3", 748 | "version": "3.6.9" 749 | } 750 | }, 751 | "nbformat": 4, 752 | "nbformat_minor": 1 753 | } 754 | -------------------------------------------------------------------------------- /notebooks/4_Torch_Models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Torch Models" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Torch is one of the user-friendly libraries in python and also FastText function is used for text classification. \n", 15 | "\n", 16 | "## Aim of This Notebook\n", 17 | "\n", 18 | "My aim in this notebook is to find a baseline with torch model, then getting better results with changing some parameters.\n", 19 | "\n", 20 | "Preparing data for modeling steps and all details about modeling steps can be found in this notebook. Also, future improvement plans were added to the end of the notebook. \n", 21 | "\n", 22 | "### Metric : \n", 23 | "\n", 24 | "As metric, I prefer to look at loss values for train and validation data. But, I also see the accuracy to interpret my results in more smarter way.\n", 25 | "\n", 26 | "### Best Results of This Notebook:\n", 27 | "\n", 28 | "91.66% accuracy with 0.380 loss for train, 91.68% accucary with 0.305 loss for validation obtained. Detailed parameters and values can be found in this notebook. " 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# Importing Libraries" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 1, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "# dataframe and series \n", 45 | "import pandas as pd\n", 46 | "import numpy as np\n", 47 | "\n", 48 | "# sklearn imports for modeling part\n", 49 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 50 | "from sklearn.linear_model import LogisticRegression\n", 51 | "from sklearn.metrics import accuracy_score,balanced_accuracy_score\n", 52 | "from sklearn.model_selection import train_test_split\n", 53 | "\n", 54 | "from mlxtend.evaluate import confusion_matrix\n", 55 | "from mlxtend.plotting import plot_confusion_matrix\n", 56 | "from mlxtend.plotting import plot_decision_regions\n", 57 | "\n", 58 | "from sklearn.metrics import confusion_matrix\n", 59 | "\n", 60 | "# To plot\n", 61 | "import matplotlib.pyplot as plt \n", 62 | "%matplotlib inline \n", 63 | "import matplotlib as mpl\n", 64 | "import seaborn as sns\n", 65 | "\n", 66 | "from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator\n", 67 | "from PIL import Image\n", 68 | "\n", 69 | "# torch model\n", 70 | "import torch\n", 71 | "from torchtext import data\n", 72 | "from torchtext import datasets\n", 73 | "import random\n", 74 | "\n", 75 | "import torch.nn as nn\n", 76 | "import torch.nn.functional as F\n", 77 | "import torch.optim as optim\n", 78 | "import time\n", 79 | "\n", 80 | "import random\n", 81 | "import os\n", 82 | "\n", 83 | "import spacy" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "metadata": {}, 90 | "outputs": [ 91 | { 92 | "name": "stderr", 93 | "output_type": "stream", 94 | "text": [ 95 | "I0520 13:01:43.983011 4716817856 file_utils.py:39] PyTorch version 1.4.0 available.\n", 96 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 97 | " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", 98 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 99 | " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", 100 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 101 | " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", 102 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 103 | " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", 104 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:521: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 105 | " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", 106 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 107 | " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "#for text augmentation\n", 113 | "import nlpaug.augmenter.char as nac\n", 114 | "import nlpaug.augmenter.word as naw\n", 115 | "import nlpaug.augmenter.sentence as nas\n", 116 | "import nlpaug.flow as nafc\n", 117 | "\n", 118 | "from nlpaug.util import Action" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "nlp = spacy.load('en')\n", 128 | "# import en_core_web_sm\n", 129 | "# nlp = en_core_web_sm.load()" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 5, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "df = pd.read_csv('cleaned_data.csv') # taking data" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 6, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/html": [ 149 | "
\n", 150 | "\n", 163 | "\n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | "
overallverifiedreviewTimereviewerIDasinstylereviewTextsummarytitledaymonthyearsentimentreview_clean
04True2014-07-03A2LSKD2H9U8N0JB000FA5KK0{'Format:': ' Kindle Edition'}pretty good story, a little exaggerated, but i...pretty good storyNaN3720142pretty good story a little exaggerated but i l...
15True2014-05-26A2QP13XTJND1QSB000FA5KK0{'Format:': ' Kindle Edition'}if you've read other max brand westerns, you k...A very good bookNaN26520142if youve read other max brand westerns you kno...
25True2016-09-16A8WQ7MAG3HFOZB000FA5KK0{'Format:': ' Kindle Edition'}love max, always a fun twistFive StarsNaN16920162love max always a fun twist
35True2016-03-03A1E0MODSRYP7OB000FA5KK0{'Format:': ' Kindle Edition'}as usual for him, a good booka goodNaN3320162as usual for him a good book
45True2015-09-10AYUTCGVSM1H7TB000FA5KK0{'Format:': ' Kindle Edition'}mb is one of the original western writers and ...A WesternNaN10920152mb is one of the original western writers and ...
\n", 271 | "
" 272 | ], 273 | "text/plain": [ 274 | " overall verified reviewTime reviewerID asin \\\n", 275 | "0 4 True 2014-07-03 A2LSKD2H9U8N0J B000FA5KK0 \n", 276 | "1 5 True 2014-05-26 A2QP13XTJND1QS B000FA5KK0 \n", 277 | "2 5 True 2016-09-16 A8WQ7MAG3HFOZ B000FA5KK0 \n", 278 | "3 5 True 2016-03-03 A1E0MODSRYP7O B000FA5KK0 \n", 279 | "4 5 True 2015-09-10 AYUTCGVSM1H7T B000FA5KK0 \n", 280 | "\n", 281 | " style \\\n", 282 | "0 {'Format:': ' Kindle Edition'} \n", 283 | "1 {'Format:': ' Kindle Edition'} \n", 284 | "2 {'Format:': ' Kindle Edition'} \n", 285 | "3 {'Format:': ' Kindle Edition'} \n", 286 | "4 {'Format:': ' Kindle Edition'} \n", 287 | "\n", 288 | " reviewText summary title \\\n", 289 | "0 pretty good story, a little exaggerated, but i... pretty good story NaN \n", 290 | "1 if you've read other max brand westerns, you k... A very good book NaN \n", 291 | "2 love max, always a fun twist Five Stars NaN \n", 292 | "3 as usual for him, a good book a good NaN \n", 293 | "4 mb is one of the original western writers and ... A Western NaN \n", 294 | "\n", 295 | " day month year sentiment \\\n", 296 | "0 3 7 2014 2 \n", 297 | "1 26 5 2014 2 \n", 298 | "2 16 9 2016 2 \n", 299 | "3 3 3 2016 2 \n", 300 | "4 10 9 2015 2 \n", 301 | "\n", 302 | " review_clean \n", 303 | "0 pretty good story a little exaggerated but i l... \n", 304 | "1 if youve read other max brand westerns you kno... \n", 305 | "2 love max always a fun twist \n", 306 | "3 as usual for him a good book \n", 307 | "4 mb is one of the original western writers and ... " 308 | ] 309 | }, 310 | "execution_count": 6, 311 | "metadata": {}, 312 | "output_type": "execute_result" 313 | } 314 | ], 315 | "source": [ 316 | "df.head()" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "# Changing Ternary Class to Binary Class" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "My aim is to compare products and determine less seller products with giving importance to negative reviewed books to take action. So, to focus on less ratings, I will divide my target to two-class as positive and negative where 1 and 2 rating values counted as negative and others are positive." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 6, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "def calc_two_sentiment(overall):\n", 340 | " '''This function encodes the rating 1 and 2 as 0, others as 1'''\n", 341 | " if overall >= 3:\n", 342 | " return 1\n", 343 | " else:\n", 344 | " return 0" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 7, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "df['sentiment'] = df['overall'].apply(calc_two_sentiment) #applying function" 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 8, 359 | "metadata": { 360 | "scrolled": true 361 | }, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "1 2031419\n", 367 | "0 109546\n", 368 | "Name: sentiment, dtype: int64" 369 | ] 370 | }, 371 | "execution_count": 8, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "df['sentiment'].value_counts()" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "# Taking Sample Data" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "My data is big enough for runing usual computers. I can use cloud but it also takes hours for each running. Even for 100000 rows, some models took more than 3 hours. If I try models for whole set, I have to wait more than half day for each change in model. So, I prefer to choose sample, find the best model with it and apply this model to whole dataset. For deep learning models, I prefer to take my sample unbalanced data as first 100000 row of the clean data. Because, deep learning model can handle unbalanced classes better than linear or gradient machine learning models. I would like to see how the model will do with unbalanced version." 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 9, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "df_torch = df.head(100000)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "To write codes easily and keep less data in memory, I will just only choose the columns which I need for modeling." 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 10, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "df_torch= df_torch.loc[:, ['review_clean', 'sentiment']]" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "To use more easily I will divide train-test splits and write them csv files. Normally, I will split validation data from train and I will use this as unseen data to compare how the model does. But, firstly I will split my data to keep test set in computer. If i need to check small size data or same test future, I would like to keep test data also." 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 11, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "train_data, test_data = train_test_split(df_torch, test_size=0.2,random_state = 42)" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 12, 438 | "metadata": {}, 439 | "outputs": [], 440 | "source": [ 441 | "train_data.to_csv('train.csv', index = False) # my main df" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 13, 447 | "metadata": {}, 448 | "outputs": [], 449 | "source": [ 450 | "test_data.to_csv('test.csv', index = False) # not for using now, for keeping just in case as small data example" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 14, 456 | "metadata": {}, 457 | "outputs": [ 458 | { 459 | "name": "stdout", 460 | "output_type": "stream", 461 | "text": [ 462 | "Number of training examples: 80000\n", 463 | "Number of testing examples: 20000\n" 464 | ] 465 | } 466 | ], 467 | "source": [ 468 | "print(f'Number of training examples: {len(train_data)}')\n", 469 | "print(f'Number of testing examples: {len(test_data)}')" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "## Preparing Data to Torch Model" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "These are two pages which gave me idea to get baseline and how to use FastText class. The code of the FastText class also is taken from these two people. I combined their work and changed according to my data and tried different parameters for models.\n", 484 | "\n", 485 | "https://github.com/bentrevett/pytorch-sentiment-analysis\n", 486 | "\n", 487 | "https://www.kaggle.com/lalwaniabhishek/abhishek-lalwani-bits-twitter-text" 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "One of the important concepts of TorchText is the Field function, which defines how the data should be processed. \n", 495 | "\n", 496 | "I will use TEXT to define how the reviews will be processed and use TARGET field to process the target. As a preprocessing technique, I will use bi-grams. It creates a set of co-occuring words." 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 19, 502 | "metadata": {}, 503 | "outputs": [], 504 | "source": [ 505 | "def generate_bigrams(text):\n", 506 | " '''creating set of co-occuring words'''\n", 507 | " bi_grams = set(zip(*[text[i:] for i in range(2)]))\n", 508 | " for bi_gram in bi_grams:\n", 509 | " text.append(' '.join(bi_gram))\n", 510 | " return text" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 20, 516 | "metadata": {}, 517 | "outputs": [ 518 | { 519 | "data": { 520 | "text/plain": [ 521 | "['I', 'love', 'this', 'book', 'I love', 'this book', 'love this']" 522 | ] 523 | }, 524 | "execution_count": 20, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "# To check bi-gram function is working proporly or not\n", 531 | "generate_bigrams(['I', 'love', 'this', 'book'])" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "My bi-gram function is working properly, I can see two-words couples." 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "I will define my model to preprocess with bi-grams, SpaCy tokenizer and LabelField to handle the target." 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 21, 551 | "metadata": {}, 552 | "outputs": [], 553 | "source": [ 554 | "SEED = 1234\n", 555 | "\n", 556 | "torch.manual_seed(SEED)\n", 557 | "torch.backends.cudnn.deterministic = True\n", 558 | "\n", 559 | "TEXT = data.Field(tokenize = 'spacy', preprocessing = generate_bigrams)\n", 560 | "TARGET = data.LabelField(dtype = torch.float)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 22, 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "fields_train = [('review_clean', TEXT),('sentiment', TARGET)]" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "With using TabularDataset, we will take our train, test splits easily each time and preprocessed with bi-grams. " 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 23, 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "# Taking training data from train.csv\n", 586 | "train_data = data.TabularDataset(path = 'train.csv',\n", 587 | " format = 'csv',\n", 588 | " fields = fields_train,\n", 589 | " skip_header = True)" 590 | ] 591 | }, 592 | { 593 | "cell_type": "code", 594 | "execution_count": 18, 595 | "metadata": {}, 596 | "outputs": [], 597 | "source": [ 598 | "# # To check the first elements in train\n", 599 | "# print(vars(train_data[0]))" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "Now, I want to split a validation data from my train data, to make sure my model is doing good. I will use default for split sizes and define my random seed to get same data each time." 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "### Building Validation Set " 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 25, 619 | "metadata": {}, 620 | "outputs": [], 621 | "source": [ 622 | "# Creating validation set from train data\n", 623 | "\n", 624 | "train_data, valid_data = train_data.split(random_state = random.seed(SEED))" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 21, 630 | "metadata": {}, 631 | "outputs": [ 632 | { 633 | "name": "stdout", 634 | "output_type": "stream", 635 | "text": [ 636 | "Number of training examples: 56000\n", 637 | "Number of validation examples: 24000\n", 638 | "Number of testing examples: 20000\n" 639 | ] 640 | } 641 | ], 642 | "source": [ 643 | "print(f'Number of training examples: {len(train_data)}')\n", 644 | "print(f'Number of validation examples: {len(valid_data)}')\n", 645 | "print(f'Number of testing examples: {len(test_data)}')" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": {}, 651 | "source": [ 652 | "Now, I need to build a vocabulary. There are lots of words so I will define maximum top words sizes. Then, I will load the pre-trained word embeddings." 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "### Building Vocabulary with Pre-Trained Embeddings" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 26, 665 | "metadata": {}, 666 | "outputs": [ 667 | { 668 | "name": "stderr", 669 | "output_type": "stream", 670 | "text": [ 671 | "I0519 16:21:52.618643 4541910464 vocab.py:431] Loading vectors from .vector_cache/glove.6B.100d.txt.pt\n" 672 | ] 673 | } 674 | ], 675 | "source": [ 676 | "MAX_VOCAB_SIZE = 25_000\n", 677 | "\n", 678 | "TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, \n", 679 | " vectors = \"glove.6B.100d\", \n", 680 | " unk_init = torch.Tensor.normal_)\n", 681 | "\n", 682 | "TARGET.build_vocab(train_data)" 683 | ] 684 | }, 685 | { 686 | "cell_type": "markdown", 687 | "metadata": {}, 688 | "source": [ 689 | "I only build vocabulary on train set. Because, in machine learning models test set must not be seen before to test it well. So I do not add validation set, because I want it to reflect the test set as much as possible." 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 23, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "name": "stdout", 699 | "output_type": "stream", 700 | "text": [ 701 | "# of unique tokens in TEXT vocabulary: 25002\n", 702 | "# of unique tokens in TARGET vocabulary: 2\n" 703 | ] 704 | } 705 | ], 706 | "source": [ 707 | "print(f\"# of unique tokens in TEXT vocabulary: {len(TEXT.vocab)}\")\n", 708 | "print(f\"# of unique tokens in TARGET vocabulary: {len(TARGET.vocab)}\")" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "I chose my max vocabulary size 25000, it means there is two additional tokens like <...> default. Because all sentences in the batches must be at same size. To make each sentence equal in the batch, it padded longer or shorter batches." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 24, 721 | "metadata": {}, 722 | "outputs": [ 723 | { 724 | "name": "stdout", 725 | "output_type": "stream", 726 | "text": [ 727 | "[('the', 237680), ('and', 149130), ('a', 139606), ('i', 135782), ('to', 130757), ('of', 98507), (' ', 88735), ('is', 78287), ('this', 76680), ('it', 73911), ('in', 65307), ('was', 60753), ('that', 56410), ('book', 51555), ('for', 43846), ('story', 39234), ('but', 37773), ('her', 37674), ('with', 37553), ('read', 35264), ('you', 34167), ('nt', 33702), ('\\n\\n', 32400), ('she', 28707), ('not', 28230)]\n" 728 | ] 729 | } 730 | ], 731 | "source": [ 732 | "print(TEXT.vocab.freqs.most_common(25)) # to see most common words in the vocabulary with their frequencies" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "### Setting Iterators" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "Now, I create my vocabulary using pre-trained embeddings. The final step of preparing data to Torch model is creating iterators. I will iterate train and evaluation loop and get a batch of examples which indexed and converted into tensors for each iteration. I will use Iterator function of torch. Also, I need to keep the tensors which returned by iterators in GPU so I will use torch.device function." 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 27, 752 | "metadata": {}, 753 | "outputs": [], 754 | "source": [ 755 | "# To set batch size and iterators for train and validation data \n", 756 | "\n", 757 | "BATCH_SIZE = 64\n", 758 | "\n", 759 | "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", 760 | "\n", 761 | "train_iterator = data.Iterator(dataset = train_data, batch_size = BATCH_SIZE,device = device, \n", 762 | " shuffle = None, train = True, sort_key = lambda x: len(x.review_clean), \n", 763 | " sort_within_batch = False)\n", 764 | "valid_iterator = data.Iterator(dataset = valid_data, batch_size = BATCH_SIZE,device = device, \n", 765 | " shuffle = None, train = False, sort_key = lambda x: len(x.review_clean), \n", 766 | " sort_within_batch = False)" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "## Building the Model" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "There are many ready classes to building a model. I prefer to use FastText class for baseline model, because gets comparable results significantly faster and using around half of the parameters. The details about this class can be found in [Bag of Tricks for Efficient Text Classification paper](https://arxiv.org/abs/1607.01759). " 781 | ] 782 | }, 783 | { 784 | "cell_type": "code", 785 | "execution_count": 48, 786 | "metadata": {}, 787 | "outputs": [], 788 | "source": [ 789 | "class FastText(nn.Module):\n", 790 | " def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):\n", 791 | " \n", 792 | " super().__init__()\n", 793 | " \n", 794 | " self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)\n", 795 | " \n", 796 | " self.fc = nn.Linear(embedding_dim, output_dim)\n", 797 | " \n", 798 | "# self.dropout = nn.Dropout(0.5) # for adding dropout\n", 799 | " \n", 800 | " def forward(self, text):\n", 801 | " \n", 802 | " \n", 803 | " embedded = self.embedding(text)\n", 804 | " \n", 805 | " embedded = embedded.permute(1, 0, 2)\n", 806 | " \n", 807 | " pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) \n", 808 | " \n", 809 | " return self.fc(pooled)" 810 | ] 811 | }, 812 | { 813 | "cell_type": "markdown", 814 | "metadata": {}, 815 | "source": [ 816 | "This model only has 2 layers that have any parameters, the linear and the embedding layer. There in no RNN layer. It will calculate the word embedding by using embedding layer, and taking average of them feeds the linear layer. Now, I will create my FastText class with defining dimensions and tokens." 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 49, 822 | "metadata": {}, 823 | "outputs": [], 824 | "source": [ 825 | "INPUT_DIM = len(TEXT.vocab) #vocabulary size \n", 826 | "EMBEDDING_DIM = 100 # embedding dimension\n", 827 | "OUTPUT_DIM = 1 # our output has only 2 classes - 0/1. So, it is one-dimensional.\n", 828 | "PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # string to integer method on padding tokens\n", 829 | "\n", 830 | "model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "To compare trainable parameters in different models, count parameters function will be used. " 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": 50, 843 | "metadata": { 844 | "scrolled": true 845 | }, 846 | "outputs": [ 847 | { 848 | "name": "stdout", 849 | "output_type": "stream", 850 | "text": [ 851 | "The model has 2,500,301 trainable parameters.\n" 852 | ] 853 | } 854 | ], 855 | "source": [ 856 | "def count_parameters(model):\n", 857 | " return sum(p.numel() for p in model.parameters() if p.requires_grad)\n", 858 | "\n", 859 | "print(f'The model has {count_parameters(model):,} trainable parameters.')" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "Now I will copy pre-trained vectors to my embedding layers." 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": 51, 872 | "metadata": {}, 873 | "outputs": [ 874 | { 875 | "data": { 876 | "text/plain": [ 877 | "tensor([[-0.1117, -0.4966, 0.1631, ..., 1.2647, -0.2753, -0.1325],\n", 878 | " [-0.8555, -0.7208, 1.3755, ..., 0.0825, -1.1314, 0.3997],\n", 879 | " [-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],\n", 880 | " ...,\n", 881 | " [-0.6578, 0.9299, 0.0580, ..., -0.9173, 1.2022, 0.2694],\n", 882 | " [-0.3626, 0.1501, 1.4050, ..., 0.0213, 0.3717, -0.6314],\n", 883 | " [-1.3447, -1.4811, 0.7253, ..., -0.5115, -0.9313, -0.3301]])" 884 | ] 885 | }, 886 | "execution_count": 51, 887 | "metadata": {}, 888 | "output_type": "execute_result" 889 | } 890 | ], 891 | "source": [ 892 | "pretrained_embeddings = TEXT.vocab.vectors\n", 893 | "\n", 894 | "model.embedding.weight.data.copy_(pretrained_embeddings)" 895 | ] 896 | }, 897 | { 898 | "cell_type": "markdown", 899 | "metadata": {}, 900 | "source": [ 901 | "I must assign zero for initial weight for unknown and padding tokens. I have already defined padding token before as PAD_IDX. So, I will define unknows as UNK_IDX and set initials to zeros." 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": 52, 907 | "metadata": {}, 908 | "outputs": [], 909 | "source": [ 910 | "UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]\n", 911 | "\n", 912 | "model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)\n", 913 | "model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "markdown", 918 | "metadata": {}, 919 | "source": [ 920 | "## Training the Model" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "To train the model, firstly I will create optimizer and criterion. Optimizer updates parameters of module. I will use SGD and Adam as optimizer. SGD is a variant of gradient descent. It does not perform on whole dataset, it computes on a small subset or random selection. It performs good when the learning rate is low. Optimizer needs two parameters, one is optimizer type and second is learning rate. Adam optimizer is a technique which implementing adaptive learning rate. " 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "I tried both optimizers one by one with uncommenting the cell below also with different learning rates." 935 | ] 936 | }, 937 | { 938 | "cell_type": "code", 939 | "execution_count": 60, 940 | "metadata": {}, 941 | "outputs": [], 942 | "source": [ 943 | "optimizer = optim.SGD(model.parameters(), lr=1e-4)\n", 944 | "# optimizer = optim.Adam(model.parameters(),lr=1e-4)" 945 | ] 946 | }, 947 | { 948 | "cell_type": "markdown", 949 | "metadata": {}, 950 | "source": [ 951 | "Now, I will define loss function. My target contains binary labels, so I will choose binary loss function as criterion.\n", 952 | "Cross-entropy loss is commonly used for classification porblems. Also, BCEWithLogitsLoss is contains one sigmoid layer and binary cross-entropy loss. So, I will use this one." 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 54, 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "criterion = nn.BCEWithLogitsLoss()\n", 962 | "\n", 963 | "# keeping model and criterion in GPU\n", 964 | "model = model.to(device)\n", 965 | "criterion = criterion.to(device)" 966 | ] 967 | }, 968 | { 969 | "cell_type": "markdown", 970 | "metadata": {}, 971 | "source": [ 972 | "The loss will be calculated by using criterion but I want to see accuracy to compare models. This function turn the values to 0-1 with rounding them in sigmoid layer. Then, it calculates the rounded predictions equal actual labels and take the mean of the batch." 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": 55, 978 | "metadata": {}, 979 | "outputs": [], 980 | "source": [ 981 | "def binary_accuracy(pred, target):\n", 982 | " \n", 983 | " # rounding predictions to the closest integer\n", 984 | " rounded_pred = torch.round(torch.sigmoid(pred))\n", 985 | " true = (rounded_pred == target).float() # convert into float for taking mean \n", 986 | " accuracy = true.sum() / len(true)\n", 987 | " return accuracy" 988 | ] 989 | }, 990 | { 991 | "cell_type": "code", 992 | "execution_count": 56, 993 | "metadata": {}, 994 | "outputs": [], 995 | "source": [ 996 | "# setting the train method\n", 997 | "\n", 998 | "def train(model, iterator, optimizer, criterion):\n", 999 | " \n", 1000 | " epoch_loss = 0 # \n", 1001 | " epoch_acc = 0\n", 1002 | " \n", 1003 | " model.train()\n", 1004 | " \n", 1005 | " for batch in iterator: # for each batch\n", 1006 | " \n", 1007 | " optimizer.zero_grad() # zero gradient\n", 1008 | " # PyTorch does not automatically zero the gradients calculated from the last gradient calculation\n", 1009 | " \n", 1010 | " predictions = model(batch.review_clean).squeeze(1) # with feeding batch with reviews no need to .forward\n", 1011 | " \n", 1012 | " #squeeze for removing dimension in the list and taking only batch size \n", 1013 | " #bec. torch wants predictions input as batch size\n", 1014 | " \n", 1015 | " loss = criterion(predictions, batch.sentiment) # calculating loss\n", 1016 | " \n", 1017 | " acc = binary_accuracy(predictions, batch.sentiment) # calculating accuracy with taking mean\n", 1018 | " \n", 1019 | " loss.backward() #gradient of each parameter\n", 1020 | " \n", 1021 | " optimizer.step() #update the optimizer algorithm\n", 1022 | " \n", 1023 | " # loss and accuracy by epoches\n", 1024 | " epoch_loss += loss.item() \n", 1025 | " epoch_acc += acc.item()\n", 1026 | " \n", 1027 | " return epoch_loss / len(iterator), epoch_acc / len(iterator) # returning loss and acc avg across epoch" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "markdown", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "I will do same function for evaluate validation part below." 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": 57, 1040 | "metadata": {}, 1041 | "outputs": [], 1042 | "source": [ 1043 | "def evaluate(model, iterator, criterion):\n", 1044 | " '''Evaluating validation set'''\n", 1045 | " epoch_loss = 0\n", 1046 | " epoch_acc = 0\n", 1047 | " \n", 1048 | " model.eval()\n", 1049 | " \n", 1050 | " with torch.no_grad():\n", 1051 | " \n", 1052 | " for batch in iterator:\n", 1053 | "\n", 1054 | " predictions = model(batch.review_clean).squeeze(1)\n", 1055 | " \n", 1056 | " loss = criterion(predictions, batch.sentiment)\n", 1057 | " \n", 1058 | " acc = binary_accuracy(predictions, batch.sentiment)\n", 1059 | "\n", 1060 | " epoch_loss += loss.item()\n", 1061 | " epoch_acc += acc.item()\n", 1062 | " \n", 1063 | " return epoch_loss / len(iterator), epoch_acc / len(iterator)" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "markdown", 1068 | "metadata": {}, 1069 | "source": [ 1070 | "I also use a function which informs that how long each epoch takes." 1071 | ] 1072 | }, 1073 | { 1074 | "cell_type": "code", 1075 | "execution_count": 58, 1076 | "metadata": {}, 1077 | "outputs": [], 1078 | "source": [ 1079 | "import time\n", 1080 | "\n", 1081 | "def epoch_time(start_time, end_time):\n", 1082 | " elapsed_time = end_time - start_time\n", 1083 | " elapsed_mins = int(elapsed_time / 60)\n", 1084 | " elapsed_secs = int(elapsed_time - (elapsed_mins * 60))\n", 1085 | " return elapsed_mins, elapsed_secs" 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "markdown", 1090 | "metadata": {}, 1091 | "source": [ 1092 | "# Training the Model for Baseline" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "markdown", 1097 | "metadata": {}, 1098 | "source": [ 1099 | "I tried for different epoch numbers and from result I prefer to choose 5, because it keeps the information mainly in first 5 epoches." 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "markdown", 1104 | "metadata": {}, 1105 | "source": [ 1106 | "# Adam Optimizer" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": 38, 1112 | "metadata": {}, 1113 | "outputs": [ 1114 | { 1115 | "name": "stdout", 1116 | "output_type": "stream", 1117 | "text": [ 1118 | "Epoch: 01 | Epoch Time: 2m 8s\n", 1119 | "\tTrain Loss: 0.454 | Train Acc: 88.82%\n", 1120 | "\t Val. Loss: 0.717 | Val. Acc: 91.69%\n", 1121 | "Epoch: 02 | Epoch Time: 2m 11s\n", 1122 | "\tTrain Loss: 0.318 | Train Acc: 91.66%\n", 1123 | "\t Val. Loss: 0.446 | Val. Acc: 91.62%\n", 1124 | "Epoch: 03 | Epoch Time: 2m 25s\n", 1125 | "\tTrain Loss: 0.236 | Train Acc: 91.84%\n", 1126 | "\t Val. Loss: 0.438 | Val. Acc: 92.45%\n", 1127 | "Epoch: 04 | Epoch Time: 2m 23s\n", 1128 | "\tTrain Loss: 0.194 | Train Acc: 92.48%\n", 1129 | "\t Val. Loss: 0.549 | Val. Acc: 92.58%\n", 1130 | "Epoch: 05 | Epoch Time: 2m 7s\n", 1131 | "\tTrain Loss: 0.171 | Train Acc: 93.04%\n", 1132 | "\t Val. Loss: 0.665 | Val. Acc: 92.56%\n" 1133 | ] 1134 | } 1135 | ], 1136 | "source": [ 1137 | "# with Adam optimizer\n", 1138 | "N_EPOCHS = 5\n", 1139 | "\n", 1140 | "best_valid_loss = float('inf')\n", 1141 | "\n", 1142 | "for epoch in range(N_EPOCHS):\n", 1143 | "\n", 1144 | " start_time = time.time()\n", 1145 | " \n", 1146 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1147 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1148 | " \n", 1149 | " end_time = time.time()\n", 1150 | "\n", 1151 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1152 | " \n", 1153 | " # to keep model for test set\n", 1154 | " if valid_loss < best_valid_loss:\n", 1155 | " best_valid_loss = valid_loss\n", 1156 | " torch.save(model.state_dict(), 'tut1-model.pt')\n", 1157 | " \n", 1158 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1159 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1160 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1161 | ] 1162 | }, 1163 | { 1164 | "cell_type": "markdown", 1165 | "metadata": {}, 1166 | "source": [ 1167 | "It looks overfit, so I added dropout and run again. " 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "markdown", 1172 | "metadata": {}, 1173 | "source": [ 1174 | "# Adam Optimizer with Dropout" 1175 | ] 1176 | }, 1177 | { 1178 | "cell_type": "code", 1179 | "execution_count": 41, 1180 | "metadata": {}, 1181 | "outputs": [ 1182 | { 1183 | "name": "stdout", 1184 | "output_type": "stream", 1185 | "text": [ 1186 | "Epoch: 01 | Epoch Time: 2m 23s\n", 1187 | "\tTrain Loss: 0.427 | Train Acc: 91.66%\n", 1188 | "\t Val. Loss: 0.641 | Val. Acc: 91.71%\n", 1189 | "Epoch: 02 | Epoch Time: 2m 33s\n", 1190 | "\tTrain Loss: 0.302 | Train Acc: 91.67%\n", 1191 | "\t Val. Loss: 0.404 | Val. Acc: 91.78%\n", 1192 | "Epoch: 03 | Epoch Time: 2m 26s\n", 1193 | "\tTrain Loss: 0.227 | Train Acc: 91.94%\n", 1194 | "\t Val. Loss: 0.441 | Val. Acc: 92.55%\n", 1195 | "Epoch: 04 | Epoch Time: 2m 28s\n", 1196 | "\tTrain Loss: 0.189 | Train Acc: 92.58%\n", 1197 | "\t Val. Loss: 0.568 | Val. Acc: 92.31%\n", 1198 | "Epoch: 05 | Epoch Time: 2m 27s\n", 1199 | "\tTrain Loss: 0.168 | Train Acc: 93.16%\n", 1200 | "\t Val. Loss: 0.687 | Val. Acc: 92.38%\n" 1201 | ] 1202 | } 1203 | ], 1204 | "source": [ 1205 | "# with Adam optimizer with dropout\n", 1206 | "N_EPOCHS = 5\n", 1207 | "\n", 1208 | "best_valid_loss = float('inf')\n", 1209 | "\n", 1210 | "for epoch in range(N_EPOCHS):\n", 1211 | "\n", 1212 | " start_time = time.time()\n", 1213 | " \n", 1214 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1215 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1216 | " \n", 1217 | " end_time = time.time()\n", 1218 | "\n", 1219 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1220 | " \n", 1221 | " # to keep model for test set\n", 1222 | " if valid_loss < best_valid_loss:\n", 1223 | " best_valid_loss = valid_loss\n", 1224 | " torch.save(model.state_dict(), 'tut1-model.pt')\n", 1225 | " \n", 1226 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1227 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1228 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "markdown", 1233 | "metadata": {}, 1234 | "source": [ 1235 | "It is still overfit, so I changed learning rate and tried again." 1236 | ] 1237 | }, 1238 | { 1239 | "cell_type": "markdown", 1240 | "metadata": {}, 1241 | "source": [ 1242 | "# Adam Optimizer with Different Learning Rates" 1243 | ] 1244 | }, 1245 | { 1246 | "cell_type": "markdown", 1247 | "metadata": {}, 1248 | "source": [ 1249 | "I run the code with different learning rates to see which one gives better results." 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 43, 1255 | "metadata": {}, 1256 | "outputs": [ 1257 | { 1258 | "name": "stdout", 1259 | "output_type": "stream", 1260 | "text": [ 1261 | "Epoch: 01 | Epoch Time: 2m 23s\n", 1262 | "\tTrain Loss: 0.157 | Train Acc: 93.55%\n", 1263 | "\t Val. Loss: 0.708 | Val. Acc: 92.24%\n", 1264 | "Epoch: 02 | Epoch Time: 2m 36s\n", 1265 | "\tTrain Loss: 0.156 | Train Acc: 93.56%\n", 1266 | "\t Val. Loss: 0.723 | Val. Acc: 92.22%\n", 1267 | "Epoch: 03 | Epoch Time: 2m 32s\n", 1268 | "\tTrain Loss: 0.154 | Train Acc: 93.71%\n", 1269 | "\t Val. Loss: 0.737 | Val. Acc: 92.21%\n", 1270 | "Epoch: 04 | Epoch Time: 2m 19s\n", 1271 | "\tTrain Loss: 0.153 | Train Acc: 93.77%\n", 1272 | "\t Val. Loss: 0.747 | Val. Acc: 92.29%\n", 1273 | "Epoch: 05 | Epoch Time: 2m 25s\n", 1274 | "\tTrain Loss: 0.152 | Train Acc: 93.75%\n", 1275 | "\t Val. Loss: 0.767 | Val. Acc: 92.20%\n" 1276 | ] 1277 | } 1278 | ], 1279 | "source": [ 1280 | "# with Adam optimizer with dropout lr e-4\n", 1281 | "N_EPOCHS = 5\n", 1282 | "\n", 1283 | "best_valid_loss = float('inf')\n", 1284 | "\n", 1285 | "for epoch in range(N_EPOCHS):\n", 1286 | "\n", 1287 | " start_time = time.time()\n", 1288 | " \n", 1289 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1290 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1291 | " \n", 1292 | " end_time = time.time()\n", 1293 | "\n", 1294 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1295 | " \n", 1296 | " # to keep model for test set\n", 1297 | " if valid_loss < best_valid_loss:\n", 1298 | " best_valid_loss = valid_loss\n", 1299 | " torch.save(model.state_dict(), 'tut1-model.pt')\n", 1300 | " \n", 1301 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1302 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1303 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": {}, 1309 | "source": [ 1310 | "Overfitting problem is still available." 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "markdown", 1315 | "metadata": {}, 1316 | "source": [ 1317 | "# Changing Optimizer" 1318 | ] 1319 | }, 1320 | { 1321 | "cell_type": "markdown", 1322 | "metadata": {}, 1323 | "source": [ 1324 | "Now, I will change my optimizer and try again." 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "code", 1329 | "execution_count": 36, 1330 | "metadata": { 1331 | "scrolled": true 1332 | }, 1333 | "outputs": [ 1334 | { 1335 | "name": "stdout", 1336 | "output_type": "stream", 1337 | "text": [ 1338 | "Epoch: 01 | Epoch Time: 2m 45s\n", 1339 | "\tTrain Loss: 0.647 | Train Acc: 76.40%\n", 1340 | "\t Val. Loss: 0.527 | Val. Acc: 91.68%\n", 1341 | "Epoch: 02 | Epoch Time: 2m 44s\n", 1342 | "\tTrain Loss: 0.531 | Train Acc: 91.66%\n", 1343 | "\t Val. Loss: 0.415 | Val. Acc: 91.68%\n", 1344 | "Epoch: 03 | Epoch Time: 2m 32s\n", 1345 | "\tTrain Loss: 0.459 | Train Acc: 91.66%\n", 1346 | "\t Val. Loss: 0.355 | Val. Acc: 91.68%\n", 1347 | "Epoch: 04 | Epoch Time: 2m 20s\n", 1348 | "\tTrain Loss: 0.412 | Train Acc: 91.66%\n", 1349 | "\t Val. Loss: 0.323 | Val. Acc: 91.68%\n", 1350 | "Epoch: 05 | Epoch Time: 2m 10s\n", 1351 | "\tTrain Loss: 0.380 | Train Acc: 91.66%\n", 1352 | "\t Val. Loss: 0.305 | Val. Acc: 91.68%\n" 1353 | ] 1354 | } 1355 | ], 1356 | "source": [ 1357 | "# with SGD optimizer - lr e-3\n", 1358 | "\n", 1359 | "N_EPOCHS = 5\n", 1360 | "\n", 1361 | "best_valid_loss = float('inf')\n", 1362 | "\n", 1363 | "for epoch in range(N_EPOCHS):\n", 1364 | "\n", 1365 | " start_time = time.time()\n", 1366 | " \n", 1367 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1368 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1369 | " \n", 1370 | " end_time = time.time()\n", 1371 | "\n", 1372 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1373 | " \n", 1374 | " # to keep model for test set\n", 1375 | " if valid_loss < best_valid_loss:\n", 1376 | " best_valid_loss = valid_loss\n", 1377 | " torch.save(model.state_dict(), 'tut2-model.pt')\n", 1378 | " \n", 1379 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1380 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1381 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1382 | ] 1383 | }, 1384 | { 1385 | "cell_type": "markdown", 1386 | "metadata": {}, 1387 | "source": [ 1388 | "Accuracy is less than Adam optimizer but, these results are better for loss and overfitting problem." 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "markdown", 1393 | "metadata": {}, 1394 | "source": [ 1395 | "# Adam Optimizer with Dropout" 1396 | ] 1397 | }, 1398 | { 1399 | "cell_type": "code", 1400 | "execution_count": 45, 1401 | "metadata": {}, 1402 | "outputs": [ 1403 | { 1404 | "name": "stdout", 1405 | "output_type": "stream", 1406 | "text": [ 1407 | "Epoch: 01 | Epoch Time: 2m 14s\n", 1408 | "\tTrain Loss: 0.151 | Train Acc: 93.81%\n", 1409 | "\t Val. Loss: 0.767 | Val. Acc: 92.21%\n", 1410 | "Epoch: 02 | Epoch Time: 2m 7s\n", 1411 | "\tTrain Loss: 0.151 | Train Acc: 93.80%\n", 1412 | "\t Val. Loss: 0.767 | Val. Acc: 92.21%\n", 1413 | "Epoch: 03 | Epoch Time: 2m 11s\n", 1414 | "\tTrain Loss: 0.151 | Train Acc: 93.86%\n", 1415 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1416 | "Epoch: 04 | Epoch Time: 2m 10s\n", 1417 | "\tTrain Loss: 0.151 | Train Acc: 93.85%\n", 1418 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1419 | "Epoch: 05 | Epoch Time: 2m 7s\n", 1420 | "\tTrain Loss: 0.151 | Train Acc: 93.89%\n", 1421 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n" 1422 | ] 1423 | } 1424 | ], 1425 | "source": [ 1426 | "# with SGD optimizer with dropout\n", 1427 | "\n", 1428 | "N_EPOCHS = 5\n", 1429 | "\n", 1430 | "best_valid_loss = float('inf')\n", 1431 | "\n", 1432 | "for epoch in range(N_EPOCHS):\n", 1433 | "\n", 1434 | " start_time = time.time()\n", 1435 | " \n", 1436 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1437 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1438 | " \n", 1439 | " end_time = time.time()\n", 1440 | "\n", 1441 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1442 | " \n", 1443 | " # to keep model for test set\n", 1444 | " if valid_loss < best_valid_loss:\n", 1445 | " best_valid_loss = valid_loss\n", 1446 | " torch.save(model.state_dict(), 'tut2-model.pt')\n", 1447 | " \n", 1448 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1449 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1450 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1451 | ] 1452 | }, 1453 | { 1454 | "cell_type": "markdown", 1455 | "metadata": {}, 1456 | "source": [ 1457 | "It is overfitting again." 1458 | ] 1459 | }, 1460 | { 1461 | "cell_type": "markdown", 1462 | "metadata": {}, 1463 | "source": [ 1464 | "# SGD Optimizer with Different Learning Rates" 1465 | ] 1466 | }, 1467 | { 1468 | "cell_type": "code", 1469 | "execution_count": 47, 1470 | "metadata": {}, 1471 | "outputs": [ 1472 | { 1473 | "name": "stdout", 1474 | "output_type": "stream", 1475 | "text": [ 1476 | "Epoch: 01 | Epoch Time: 1m 59s\n", 1477 | "\tTrain Loss: 0.150 | Train Acc: 93.80%\n", 1478 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1479 | "Epoch: 02 | Epoch Time: 2m 23s\n", 1480 | "\tTrain Loss: 0.151 | Train Acc: 93.82%\n", 1481 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1482 | "Epoch: 03 | Epoch Time: 2m 19s\n", 1483 | "\tTrain Loss: 0.150 | Train Acc: 93.88%\n", 1484 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1485 | "Epoch: 04 | Epoch Time: 2m 10s\n", 1486 | "\tTrain Loss: 0.151 | Train Acc: 93.87%\n", 1487 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n", 1488 | "Epoch: 05 | Epoch Time: 2m 8s\n", 1489 | "\tTrain Loss: 0.151 | Train Acc: 93.81%\n", 1490 | "\t Val. Loss: 0.766 | Val. Acc: 92.22%\n" 1491 | ] 1492 | } 1493 | ], 1494 | "source": [ 1495 | "# with SGD optimizer with dropout different lr \n", 1496 | "\n", 1497 | "N_EPOCHS = 5\n", 1498 | "\n", 1499 | "best_valid_loss = float('inf')\n", 1500 | "\n", 1501 | "for epoch in range(N_EPOCHS):\n", 1502 | "\n", 1503 | " start_time = time.time()\n", 1504 | " \n", 1505 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1506 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1507 | " \n", 1508 | " end_time = time.time()\n", 1509 | "\n", 1510 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1511 | " \n", 1512 | " # to keep model for test set\n", 1513 | " if valid_loss < best_valid_loss:\n", 1514 | " best_valid_loss = valid_loss\n", 1515 | " torch.save(model.state_dict(), 'tut2-model.pt')\n", 1516 | " \n", 1517 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1518 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1519 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1520 | ] 1521 | }, 1522 | { 1523 | "cell_type": "markdown", 1524 | "metadata": {}, 1525 | "source": [ 1526 | "Accuracy is higher but validation loss is higher also, so I decided to run with different parameters again." 1527 | ] 1528 | }, 1529 | { 1530 | "cell_type": "markdown", 1531 | "metadata": {}, 1532 | "source": [ 1533 | "# SGD Optimizer without Dropout Layer with Smaller Learning Rate" 1534 | ] 1535 | }, 1536 | { 1537 | "cell_type": "code", 1538 | "execution_count": 61, 1539 | "metadata": {}, 1540 | "outputs": [ 1541 | { 1542 | "name": "stdout", 1543 | "output_type": "stream", 1544 | "text": [ 1545 | "Epoch: 01 | Epoch Time: 2m 9s\n", 1546 | "\tTrain Loss: 0.703 | Train Acc: 9.90%\n", 1547 | "\t Val. Loss: 0.705 | Val. Acc: 30.88%\n", 1548 | "Epoch: 02 | Epoch Time: 2m 32s\n", 1549 | "\tTrain Loss: 0.686 | Train Acc: 85.03%\n", 1550 | "\t Val. Loss: 0.679 | Val. Acc: 76.10%\n", 1551 | "Epoch: 03 | Epoch Time: 2m 57s\n", 1552 | "\tTrain Loss: 0.670 | Train Acc: 91.66%\n", 1553 | "\t Val. Loss: 0.655 | Val. Acc: 88.40%\n", 1554 | "Epoch: 04 | Epoch Time: 3m 7s\n", 1555 | "\tTrain Loss: 0.655 | Train Acc: 91.66%\n", 1556 | "\t Val. Loss: 0.633 | Val. Acc: 90.53%\n", 1557 | "Epoch: 05 | Epoch Time: 2m 35s\n", 1558 | "\tTrain Loss: 0.640 | Train Acc: 91.66%\n", 1559 | "\t Val. Loss: 0.612 | Val. Acc: 91.02%\n" 1560 | ] 1561 | } 1562 | ], 1563 | "source": [ 1564 | "# with SGD optimizer without dropout lr e-4\n", 1565 | "\n", 1566 | "N_EPOCHS = 5\n", 1567 | "\n", 1568 | "best_valid_loss = float('inf')\n", 1569 | "\n", 1570 | "for epoch in range(N_EPOCHS):\n", 1571 | "\n", 1572 | " start_time = time.time()\n", 1573 | " \n", 1574 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1575 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1576 | " \n", 1577 | " end_time = time.time()\n", 1578 | "\n", 1579 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1580 | " \n", 1581 | " # to keep model for test set\n", 1582 | " if valid_loss < best_valid_loss:\n", 1583 | " best_valid_loss = valid_loss\n", 1584 | " torch.save(model.state_dict(), 'tut2-model.pt')\n", 1585 | " \n", 1586 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1587 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1588 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1589 | ] 1590 | }, 1591 | { 1592 | "cell_type": "markdown", 1593 | "metadata": {}, 1594 | "source": [ 1595 | "I found best results with SGD optimizer and learning rate as e-3. " 1596 | ] 1597 | }, 1598 | { 1599 | "cell_type": "markdown", 1600 | "metadata": {}, 1601 | "source": [ 1602 | "# Adding Tri-Gram Function" 1603 | ] 1604 | }, 1605 | { 1606 | "cell_type": "markdown", 1607 | "metadata": {}, 1608 | "source": [ 1609 | "Now, I would like to see what it will change if I will group my words as three instead of two. I will do some steps again because my torch ready data will change. " 1610 | ] 1611 | }, 1612 | { 1613 | "cell_type": "code", 1614 | "execution_count": 16, 1615 | "metadata": {}, 1616 | "outputs": [], 1617 | "source": [ 1618 | "def generate_trigrams(text):\n", 1619 | " '''creating set of 3 co-occuring words'''\n", 1620 | " tri_grams = set(zip(*[text[i:] for i in range(3)]))\n", 1621 | " for tri_gram in tri_grams:\n", 1622 | " text.append(' '.join(tri_gram))\n", 1623 | " return text" 1624 | ] 1625 | }, 1626 | { 1627 | "cell_type": "code", 1628 | "execution_count": 14, 1629 | "metadata": {}, 1630 | "outputs": [ 1631 | { 1632 | "data": { 1633 | "text/plain": [ 1634 | "['I', 'love', 'this', 'book', 'love this book', 'I love this']" 1635 | ] 1636 | }, 1637 | "execution_count": 14, 1638 | "metadata": {}, 1639 | "output_type": "execute_result" 1640 | } 1641 | ], 1642 | "source": [ 1643 | "# To check tri-gram function is working proporly or not\n", 1644 | "generate_trigrams(['I', 'love', 'this', 'book'])" 1645 | ] 1646 | }, 1647 | { 1648 | "cell_type": "code", 1649 | "execution_count": 3, 1650 | "metadata": {}, 1651 | "outputs": [], 1652 | "source": [ 1653 | "SEED = 1234\n", 1654 | "\n", 1655 | "torch.manual_seed(SEED)\n", 1656 | "torch.backends.cudnn.deterministic = True\n", 1657 | "\n", 1658 | "TEXT = data.Field(tokenize = 'spacy', preprocessing = generate_trigrams)\n", 1659 | "TARGET = data.LabelField(dtype = torch.float)" 1660 | ] 1661 | }, 1662 | { 1663 | "cell_type": "code", 1664 | "execution_count": 16, 1665 | "metadata": {}, 1666 | "outputs": [], 1667 | "source": [ 1668 | "fields_train = [('review_clean', TEXT),('sentiment', TARGET)]" 1669 | ] 1670 | }, 1671 | { 1672 | "cell_type": "code", 1673 | "execution_count": 17, 1674 | "metadata": {}, 1675 | "outputs": [], 1676 | "source": [ 1677 | "# Taking training data from train.csv\n", 1678 | "train_data = data.TabularDataset(path = 'train.csv',\n", 1679 | " format = 'csv',\n", 1680 | " fields = fields_train,\n", 1681 | " skip_header = True)" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "code", 1686 | "execution_count": 18, 1687 | "metadata": {}, 1688 | "outputs": [], 1689 | "source": [ 1690 | "# print(vars(train_data[0])) # to check tri-grams" 1691 | ] 1692 | }, 1693 | { 1694 | "cell_type": "code", 1695 | "execution_count": 20, 1696 | "metadata": {}, 1697 | "outputs": [], 1698 | "source": [ 1699 | "# Creating validation set from train data\n", 1700 | "\n", 1701 | "train_data, valid_data = train_data.split(random_state = random.seed(SEED))" 1702 | ] 1703 | }, 1704 | { 1705 | "cell_type": "code", 1706 | "execution_count": 21, 1707 | "metadata": {}, 1708 | "outputs": [], 1709 | "source": [ 1710 | "MAX_VOCAB_SIZE = 25_000\n", 1711 | "\n", 1712 | "TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, \n", 1713 | " vectors = \"glove.6B.100d\", \n", 1714 | " unk_init = torch.Tensor.normal_)\n", 1715 | "\n", 1716 | "TARGET.build_vocab(train_data)" 1717 | ] 1718 | }, 1719 | { 1720 | "cell_type": "code", 1721 | "execution_count": 22, 1722 | "metadata": {}, 1723 | "outputs": [], 1724 | "source": [ 1725 | "BATCH_SIZE = 64\n", 1726 | "\n", 1727 | "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", 1728 | "\n", 1729 | "train_iterator = data.Iterator(dataset = train_data, batch_size = BATCH_SIZE,device = device, \n", 1730 | " shuffle = None, train = True, sort_key = lambda x: len(x.review_clean), \n", 1731 | " sort_within_batch = False)\n", 1732 | "valid_iterator = data.Iterator(dataset = valid_data, batch_size = BATCH_SIZE,device = device, \n", 1733 | " shuffle = None, train = False, sort_key = lambda x: len(x.review_clean), \n", 1734 | " sort_within_batch = False)" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "code", 1739 | "execution_count": 25, 1740 | "metadata": {}, 1741 | "outputs": [], 1742 | "source": [ 1743 | "INPUT_DIM = len(TEXT.vocab) #vocabulary size \n", 1744 | "EMBEDDING_DIM = 100 # embedding dimension\n", 1745 | "OUTPUT_DIM = 1 # our output has only 2 classes - 0/1. So, it is one-dimensional.\n", 1746 | "PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # string to integer method on padding tokens\n", 1747 | "\n", 1748 | "model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)" 1749 | ] 1750 | }, 1751 | { 1752 | "cell_type": "code", 1753 | "execution_count": 27, 1754 | "metadata": {}, 1755 | "outputs": [ 1756 | { 1757 | "data": { 1758 | "text/plain": [ 1759 | "tensor([[-0.1117, -0.4966, 0.1631, ..., 1.2647, -0.2753, -0.1325],\n", 1760 | " [-0.8555, -0.7208, 1.3755, ..., 0.0825, -1.1314, 0.3997],\n", 1761 | " [-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],\n", 1762 | " ...,\n", 1763 | " [ 1.5221, -0.3108, -0.2902, ..., -0.2051, -0.9059, -0.8559],\n", 1764 | " [ 0.9666, -0.3822, -0.2585, ..., -1.0574, -0.6668, 0.1646],\n", 1765 | " [ 1.8935, -0.8303, 0.2935, ..., -0.6399, -1.8376, -1.9168]])" 1766 | ] 1767 | }, 1768 | "execution_count": 27, 1769 | "metadata": {}, 1770 | "output_type": "execute_result" 1771 | } 1772 | ], 1773 | "source": [ 1774 | "pretrained_embeddings = TEXT.vocab.vectors\n", 1775 | "\n", 1776 | "model.embedding.weight.data.copy_(pretrained_embeddings)" 1777 | ] 1778 | }, 1779 | { 1780 | "cell_type": "code", 1781 | "execution_count": 28, 1782 | "metadata": {}, 1783 | "outputs": [], 1784 | "source": [ 1785 | "UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]\n", 1786 | "\n", 1787 | "model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)\n", 1788 | "model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)" 1789 | ] 1790 | }, 1791 | { 1792 | "cell_type": "code", 1793 | "execution_count": 29, 1794 | "metadata": {}, 1795 | "outputs": [], 1796 | "source": [ 1797 | "# optimizer = optim.Adam(model.parameters())" 1798 | ] 1799 | }, 1800 | { 1801 | "cell_type": "code", 1802 | "execution_count": 37, 1803 | "metadata": {}, 1804 | "outputs": [], 1805 | "source": [ 1806 | "optimizer = optim.SGD(model.parameters(), lr=1e-3)" 1807 | ] 1808 | }, 1809 | { 1810 | "cell_type": "code", 1811 | "execution_count": 38, 1812 | "metadata": {}, 1813 | "outputs": [], 1814 | "source": [ 1815 | "criterion = nn.BCEWithLogitsLoss()\n", 1816 | "\n", 1817 | "# keeping model and criterion in GPU\n", 1818 | "model = model.to(device)\n", 1819 | "criterion = criterion.to(device)" 1820 | ] 1821 | }, 1822 | { 1823 | "cell_type": "markdown", 1824 | "metadata": {}, 1825 | "source": [ 1826 | "### Results with Adam Optimizer" 1827 | ] 1828 | }, 1829 | { 1830 | "cell_type": "code", 1831 | "execution_count": 35, 1832 | "metadata": {}, 1833 | "outputs": [ 1834 | { 1835 | "name": "stdout", 1836 | "output_type": "stream", 1837 | "text": [ 1838 | "Epoch: 01 | Epoch Time: 2m 39s\n", 1839 | "\tTrain Loss: 0.443 | Train Acc: 91.66%\n", 1840 | "\t Val. Loss: 0.686 | Val. Acc: 91.69%\n", 1841 | "Epoch: 02 | Epoch Time: 2m 28s\n", 1842 | "\tTrain Loss: 0.324 | Train Acc: 91.66%\n", 1843 | "\t Val. Loss: 0.422 | Val. Acc: 91.63%\n", 1844 | "Epoch: 03 | Epoch Time: 2m 31s\n", 1845 | "\tTrain Loss: 0.248 | Train Acc: 91.74%\n", 1846 | "\t Val. Loss: 0.409 | Val. Acc: 92.18%\n", 1847 | "Epoch: 04 | Epoch Time: 2m 46s\n", 1848 | "\tTrain Loss: 0.206 | Train Acc: 92.15%\n", 1849 | "\t Val. Loss: 0.539 | Val. Acc: 92.41%\n", 1850 | "Epoch: 05 | Epoch Time: 2m 24s\n", 1851 | "\tTrain Loss: 0.182 | Train Acc: 92.63%\n", 1852 | "\t Val. Loss: 0.681 | Val. Acc: 92.34%\n" 1853 | ] 1854 | } 1855 | ], 1856 | "source": [ 1857 | "N_EPOCHS = 5\n", 1858 | "\n", 1859 | "best_valid_loss = float('inf')\n", 1860 | "\n", 1861 | "for epoch in range(N_EPOCHS):\n", 1862 | "\n", 1863 | " start_time = time.time()\n", 1864 | " \n", 1865 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1866 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1867 | " \n", 1868 | " end_time = time.time()\n", 1869 | "\n", 1870 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1871 | " \n", 1872 | " # to keep model for test set\n", 1873 | " if valid_loss < best_valid_loss:\n", 1874 | " best_valid_loss = valid_loss\n", 1875 | " torch.save(model.state_dict(), 'tut3-model.pt')\n", 1876 | " \n", 1877 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1878 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1879 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1880 | ] 1881 | }, 1882 | { 1883 | "cell_type": "markdown", 1884 | "metadata": {}, 1885 | "source": [ 1886 | "It looks overfit." 1887 | ] 1888 | }, 1889 | { 1890 | "cell_type": "markdown", 1891 | "metadata": {}, 1892 | "source": [ 1893 | "### Results with SGD Optimizer" 1894 | ] 1895 | }, 1896 | { 1897 | "cell_type": "code", 1898 | "execution_count": 39, 1899 | "metadata": {}, 1900 | "outputs": [ 1901 | { 1902 | "name": "stdout", 1903 | "output_type": "stream", 1904 | "text": [ 1905 | "Epoch: 01 | Epoch Time: 2m 15s\n", 1906 | "\tTrain Loss: 0.221 | Train Acc: 92.00%\n", 1907 | "\t Val. Loss: 0.410 | Val. Acc: 92.22%\n", 1908 | "Epoch: 02 | Epoch Time: 2m 29s\n", 1909 | "\tTrain Loss: 0.221 | Train Acc: 91.98%\n", 1910 | "\t Val. Loss: 0.411 | Val. Acc: 92.28%\n", 1911 | "Epoch: 03 | Epoch Time: 2m 15s\n", 1912 | "\tTrain Loss: 0.220 | Train Acc: 91.97%\n", 1913 | "\t Val. Loss: 0.411 | Val. Acc: 92.27%\n", 1914 | "Epoch: 04 | Epoch Time: 2m 33s\n", 1915 | "\tTrain Loss: 0.220 | Train Acc: 91.96%\n", 1916 | "\t Val. Loss: 0.412 | Val. Acc: 92.27%\n", 1917 | "Epoch: 05 | Epoch Time: 2m 34s\n", 1918 | "\tTrain Loss: 0.220 | Train Acc: 91.95%\n", 1919 | "\t Val. Loss: 0.412 | Val. Acc: 92.27%\n" 1920 | ] 1921 | } 1922 | ], 1923 | "source": [ 1924 | "# SGD\n", 1925 | "N_EPOCHS = 5\n", 1926 | "\n", 1927 | "best_valid_loss = float('inf')\n", 1928 | "\n", 1929 | "for epoch in range(N_EPOCHS):\n", 1930 | "\n", 1931 | " start_time = time.time()\n", 1932 | " \n", 1933 | " train_loss, train_acc = train(model, train_iterator, optimizer, criterion)\n", 1934 | " valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)\n", 1935 | " \n", 1936 | " end_time = time.time()\n", 1937 | "\n", 1938 | " epoch_mins, epoch_secs = epoch_time(start_time, end_time)\n", 1939 | " \n", 1940 | " # to keep model for test set\n", 1941 | " if valid_loss < best_valid_loss:\n", 1942 | " best_valid_loss = valid_loss\n", 1943 | " torch.save(model.state_dict(), 'tut4-model.pt')\n", 1944 | " \n", 1945 | " print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')\n", 1946 | " print(f'\\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')\n", 1947 | " print(f'\\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')" 1948 | ] 1949 | }, 1950 | { 1951 | "cell_type": "markdown", 1952 | "metadata": {}, 1953 | "source": [ 1954 | "It is better than Adam optimizer but not better than bi-grams." 1955 | ] 1956 | }, 1957 | { 1958 | "cell_type": "markdown", 1959 | "metadata": {}, 1960 | "source": [ 1961 | "I insert the code below for if someone want to see how to try downloaded model for different unseen data." 1962 | ] 1963 | }, 1964 | { 1965 | "cell_type": "code", 1966 | "execution_count": 40, 1967 | "metadata": {}, 1968 | "outputs": [ 1969 | { 1970 | "name": "stdout", 1971 | "output_type": "stream", 1972 | "text": [ 1973 | "Test Loss: 0.402 | Test Acc: 92.25%\n" 1974 | ] 1975 | } 1976 | ], 1977 | "source": [ 1978 | "# model.load_state_dict(torch.load('tut4-model.pt'))\n", 1979 | "\n", 1980 | "# test_loss, test_acc = evaluate(model, test_iterator, criterion)\n", 1981 | "\n", 1982 | "# print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')" 1983 | ] 1984 | }, 1985 | { 1986 | "cell_type": "markdown", 1987 | "metadata": {}, 1988 | "source": [ 1989 | "### Best Results : \n", 1990 | " - with SGD optimizer and learning rate e-3, 91.66% accuracy for train and 91.68% accuracy for validation obtained." 1991 | ] 1992 | }, 1993 | { 1994 | "cell_type": "markdown", 1995 | "metadata": {}, 1996 | "source": [ 1997 | "# Adding Augmentation To Data" 1998 | ] 1999 | }, 2000 | { 2001 | "cell_type": "markdown", 2002 | "metadata": {}, 2003 | "source": [ 2004 | "To get better results, I will try to add augmentation to my data.\n", 2005 | "\n", 2006 | "Further information can be found in https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb.\n", 2007 | "\n", 2008 | "I only try one method to see synonym augmentation below." 2009 | ] 2010 | }, 2011 | { 2012 | "cell_type": "code", 2013 | "execution_count": 19, 2014 | "metadata": {}, 2015 | "outputs": [ 2016 | { 2017 | "name": "stderr", 2018 | "output_type": "stream", 2019 | "text": [ 2020 | "[nltk_data] Downloading package wordnet to /Users/ezgi/nltk_data...\n", 2021 | "[nltk_data] Unzipping corpora/wordnet.zip.\n" 2022 | ] 2023 | }, 2024 | { 2025 | "data": { 2026 | "text/plain": [ 2027 | "True" 2028 | ] 2029 | }, 2030 | "execution_count": 19, 2031 | "metadata": {}, 2032 | "output_type": "execute_result" 2033 | } 2034 | ], 2035 | "source": [ 2036 | "import nltk\n", 2037 | "nltk.download('wordnet')" 2038 | ] 2039 | }, 2040 | { 2041 | "cell_type": "code", 2042 | "execution_count": 22, 2043 | "metadata": { 2044 | "scrolled": true 2045 | }, 2046 | "outputs": [ 2047 | { 2048 | "name": "stderr", 2049 | "output_type": "stream", 2050 | "text": [ 2051 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 2052 | "[nltk_data] /Users/ezgi/nltk_data...\n", 2053 | "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n" 2054 | ] 2055 | }, 2056 | { 2057 | "data": { 2058 | "text/plain": [ 2059 | "True" 2060 | ] 2061 | }, 2062 | "execution_count": 22, 2063 | "metadata": {}, 2064 | "output_type": "execute_result" 2065 | } 2066 | ], 2067 | "source": [ 2068 | "nltk.download('averaged_perceptron_tagger')" 2069 | ] 2070 | }, 2071 | { 2072 | "cell_type": "code", 2073 | "execution_count": 23, 2074 | "metadata": {}, 2075 | "outputs": [ 2076 | { 2077 | "name": "stdout", 2078 | "output_type": "stream", 2079 | "text": [ 2080 | "Original:\n", 2081 | "I like this book\n", 2082 | "Augmented Text:\n", 2083 | "I care this book\n" 2084 | ] 2085 | } 2086 | ], 2087 | "source": [ 2088 | "# to see differences in normal and augmented texts\n", 2089 | "aug = naw.SynonymAug(aug_src='wordnet')\n", 2090 | "text = 'I like this book'\n", 2091 | "augmented_text = aug.augment(text)\n", 2092 | "print(\"Original:\")\n", 2093 | "print(text)\n", 2094 | "print(\"Augmented Text:\")\n", 2095 | "print(augmented_text)" 2096 | ] 2097 | }, 2098 | { 2099 | "cell_type": "code", 2100 | "execution_count": 24, 2101 | "metadata": {}, 2102 | "outputs": [], 2103 | "source": [ 2104 | "def embedding(text):\n", 2105 | " '''this function changes texts according to synonym augmentation'''\n", 2106 | " aug = naw.SynonymAug(aug_src='wordnet')\n", 2107 | " augmented_text = aug.augment(text)\n", 2108 | " return augmented_text" 2109 | ] 2110 | }, 2111 | { 2112 | "cell_type": "code", 2113 | "execution_count": 30, 2114 | "metadata": {}, 2115 | "outputs": [], 2116 | "source": [ 2117 | "df.dropna(subset=['review_clean'], inplace=True) #checking for null values" 2118 | ] 2119 | }, 2120 | { 2121 | "cell_type": "code", 2122 | "execution_count": 31, 2123 | "metadata": {}, 2124 | "outputs": [], 2125 | "source": [ 2126 | "df_aug = df.head(100000) #takind first 100000 as sample" 2127 | ] 2128 | }, 2129 | { 2130 | "cell_type": "code", 2131 | "execution_count": 32, 2132 | "metadata": {}, 2133 | "outputs": [], 2134 | "source": [ 2135 | "df_aug= df_aug.loc[:, ['review_clean', 'sentiment']]" 2136 | ] 2137 | }, 2138 | { 2139 | "cell_type": "code", 2140 | "execution_count": 33, 2141 | "metadata": {}, 2142 | "outputs": [], 2143 | "source": [ 2144 | "train_aug, test_aug = train_test_split(df_aug, test_size=0.2,random_state = 42) #splitting for using same part as train" 2145 | ] 2146 | }, 2147 | { 2148 | "cell_type": "code", 2149 | "execution_count": 34, 2150 | "metadata": {}, 2151 | "outputs": [ 2152 | { 2153 | "name": "stderr", 2154 | "output_type": "stream", 2155 | "text": [ 2156 | "/Users/ezgi/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", 2157 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 2158 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 2159 | "\n", 2160 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 2161 | " \"\"\"Entry point for launching an IPython kernel.\n" 2162 | ] 2163 | }, 2164 | { 2165 | "data": { 2166 | "text/html": [ 2167 | "
\n", 2168 | "\n", 2181 | "\n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | " \n", 2211 | " \n", 2212 | " \n", 2213 | " \n", 2214 | " \n", 2215 | " \n", 2216 | " \n", 2217 | " \n", 2218 | " \n", 2219 | " \n", 2220 | " \n", 2221 | " \n", 2222 | "
review_cleansentimentreview_emb
75241new to me author but he spins a good tale i en...2new to me author but he spin around a serious ...
48970had to keep reading addicted must know what ha...2get to keep reading hook mustiness know what h...
44979good book that will be enjoyed by inmates for ...2good holy scripture that will live enjoyed by ...
13571this book was a nice little slightly erotic re...2this book was a gracious little somewhat eroti...
92751ive never read a work by mary jane clark befor...1ive never read a work by mary jane clark befor...
\n", 2223 | "
" 2224 | ], 2225 | "text/plain": [ 2226 | " review_clean sentiment \\\n", 2227 | "75241 new to me author but he spins a good tale i en... 2 \n", 2228 | "48970 had to keep reading addicted must know what ha... 2 \n", 2229 | "44979 good book that will be enjoyed by inmates for ... 2 \n", 2230 | "13571 this book was a nice little slightly erotic re... 2 \n", 2231 | "92751 ive never read a work by mary jane clark befor... 1 \n", 2232 | "\n", 2233 | " review_emb \n", 2234 | "75241 new to me author but he spin around a serious ... \n", 2235 | "48970 get to keep reading hook mustiness know what h... \n", 2236 | "44979 good holy scripture that will live enjoyed by ... \n", 2237 | "13571 this book was a gracious little somewhat eroti... \n", 2238 | "92751 ive never read a work by mary jane clark befor... " 2239 | ] 2240 | }, 2241 | "execution_count": 34, 2242 | "metadata": {}, 2243 | "output_type": "execute_result" 2244 | } 2245 | ], 2246 | "source": [ 2247 | "train_aug['review_emb'] = train_aug['review_clean'].apply(lambda x: embedding(x))\n", 2248 | "train_aug.head()" 2249 | ] 2250 | }, 2251 | { 2252 | "cell_type": "code", 2253 | "execution_count": 35, 2254 | "metadata": {}, 2255 | "outputs": [], 2256 | "source": [ 2257 | "train_aug.to_csv('train_aug.csv', index = False) # to keep augmented dataframe" 2258 | ] 2259 | }, 2260 | { 2261 | "cell_type": "markdown", 2262 | "metadata": {}, 2263 | "source": [ 2264 | "Our reviews were changed according to synonym. This wss the first step to figure out how data augmentation works. When we put this data to torch model, it will not give good results because it needs more deeper work." 2265 | ] 2266 | }, 2267 | { 2268 | "cell_type": "markdown", 2269 | "metadata": {}, 2270 | "source": [ 2271 | "# Future Improvements for This Notebook\n", 2272 | "\n", 2273 | "After talking our instructor Bryan Arnold, some steps were determined to improve more this model;\n", 2274 | "\n", 2275 | "- Tri-grams and bi-grams applied to model seperately, new dictionary can be formed which contains both of them.\n", 2276 | "- Data augmentation will be added to data.\n", 2277 | "- Dropout layer and other layers can be changed or new layers can be added.\n", 2278 | "- Test-time augmentation can be added. \n", 2279 | "- I have already run the model for different learning rates but more different values can be tried. \n", 2280 | "- I will run the model for higher epoch numbers. Each epoch takes time to I only tried for 5, it can be increased.\n" 2281 | ] 2282 | }, 2283 | { 2284 | "cell_type": "markdown", 2285 | "metadata": {}, 2286 | "source": [ 2287 | "I will continue to try other deep learning models to find better results and to find easily tuned models My next step is to work on Keras models." 2288 | ] 2289 | }, 2290 | { 2291 | "cell_type": "code", 2292 | "execution_count": null, 2293 | "metadata": {}, 2294 | "outputs": [], 2295 | "source": [] 2296 | } 2297 | ], 2298 | "metadata": { 2299 | "kernelspec": { 2300 | "display_name": "Python 3", 2301 | "language": "python", 2302 | "name": "python3" 2303 | }, 2304 | "language_info": { 2305 | "codemirror_mode": { 2306 | "name": "ipython", 2307 | "version": 3 2308 | }, 2309 | "file_extension": ".py", 2310 | "mimetype": "text/x-python", 2311 | "name": "python", 2312 | "nbconvert_exporter": "python", 2313 | "pygments_lexer": "ipython3", 2314 | "version": "3.6.9" 2315 | } 2316 | }, 2317 | "nbformat": 4, 2318 | "nbformat_minor": 2 2319 | } 2320 | --------------------------------------------------------------------------------