├── LICENSE ├── README.md ├── __pycache__ ├── misc.cpython-36.pyc ├── model.cpython-36.pyc └── text_cleaning.cpython-36.pyc ├── data ├── test.csv └── train.csv ├── data_config.json ├── data_processing.py ├── misc.py ├── model.py ├── project_config.json ├── readme ├── requirement.txt ├── saved_model └── sentiment_classifer_rnn_sagar.pt ├── text_cleaning.py └── train.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Sagchakr 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Twitter Sentiment Analysis Using RNN - PyTorch 3 | 4 | 5 | [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/) 6 | [PyTorch](https://pytorch.org/) 7 | [Pandas](https://pandas.pydata.org/) 8 | [NumPy](https://numpy.org/) 9 | [Kaggle](https://www.kaggle.com/kazanova/sentiment140) 10 | 11 | [![Python 3.6.8](https://img.shields.io/badge/python-3.6.8-blue.svg)](https://www.python.org/downloads/release/python-368/) 12 | [![torch 1.8.1](https://img.shields.io/badge/torch-1.8.1-orange.svg)](https://pypi.org/project/torch/1.8.1/) 13 | [![torch_text 0.9.1](https://img.shields.io/badge/torchtext-0.9.1-orange.svg)](https://pypi.org/project/torchtext/0.9.1/) 14 | [![pytorch_ignite 0.4.4](https://img.shields.io/badge/pytorch--ignite-0.4.4-orange.svg)](https://pypi.org/project/pytorch-ignite/0.4.4/) 15 | 16 | 17 | [![Stargazers repo roster for @BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN](https://reporoster.com/stars/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN)](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/stargazers) 18 | [![Forkers repo roster for @BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN](https://reporoster.com/forks/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN)](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/network/members) 19 | ## Introduction 20 | 21 | ![intro_image](https://i.morioh.com/2020/02/04/beef36fd707d.jpg) \ 22 | Image Source: morioh 23 | 24 | The automated process of recognition and Categorization of instinctive information in text script is called Sentiment Analysis. And such categorization of positive tweets from negative tweets by machine learning models for Classification, text mining, text analysis and data visualization through Natural Language processing is called Twitter Sentiment Analysis. 25 | 26 | ![project_workflow_example](https://user-images.githubusercontent.com/49767657/121781346-dbb30000-cbc1-11eb-809a-a016d7a6092f.png) \ 27 | Image Source: Google 28 | 29 | ## Requirements 30 | 31 | This project is developed on [Windows 10](https://www.microsoft.com/en-in/software-download/windows10) \ 32 | Clone this repository or download after installing the above [requirements](#requirements) and run `pip install -r requirement.txt` to install all the libraries required to run this project. 33 | You can also click on the specific badges mentioned [above](#requirements) and download it as well. 34 | 35 | 36 | - **torch==1.8.1+cu** 37 | - **torchtext==0.9.1** 38 | - **pytorch-ignite==0.4.4** 39 | - **pandas==1.0.5** 40 | - **numpy==1.19.3** 41 | ### IDE 42 | I have used [PyCharm](https://www.jetbrains.com/pycharm/) in this project. I mostly prefer using Pycharm for building large end-to-end projects. 43 | Though for data analytics & visualizations I would recommend of using [Jupyter](https://jupyter.org/). If you don't know how to use jupyter which is quite easy for beginners, go through this [link](https://www.tutorialspoint.com/jupyter/index.htm) 44 | 45 | 46 | 47 | 48 | 49 | ## To Run this Project 50 | 1. Download the dataset from [kaggle](#requirements) above and keep it in main directory. 51 | 2. Put the csv file in the working directory and mention the full in the **data_config.json** 52 | 3. In **data_config.json** you need to provide : 53 | 54 | i) `dataset_full_path`: **full path of the main dataset** \ 55 | ii) `num_neg_labels`: **number of negetive sample the dataset should contain** \ 56 | iii)`num_pos_labels`: **number of positive samples the dataset should contain** \ 57 | iv) `trainset_fullpath`: **path to save the train.csv** \ 58 | v) `testset_fullpath`: **path to save test.csv** \ 59 | vi) `num_training_sample`: **number of training samples in train.csv** 60 | 61 | 4. *run* `python data_processing.py`: **This will prepare the data by splitting into train and test.** 62 | 5. To run this project,you need to provide certain parameters in **project_config.json**: 63 | 64 | i) `data_path`: **data** \ 65 | ii)`model_dir`: **saved_model** \ 66 | iii) `device`: **-1** \ 67 | iv) `model_name`: **sentiment_classifer_rnn_sagar.pt** \ 68 | v) `embedding_dim`: **100** \ 69 | vi) `hidden_dim`: **256** \ 70 | vii) `output_dim`: **1** \ 71 | viii) `batch_size`: **64** \ 72 | ix) `max_vocab_size`: **25000** \ 73 | x) `learning_rate`:  1e^{-3} \ 74 | xi) `epoch`: **20** 75 | 6. After performing the above steps Execute : `Python train.py` 76 | 77 | **To change dataset/any parameters for training: project_config.json** 78 | 79 | ## Experiment Details 80 | 81 | - epochs : **20** 82 | - Optimizer- **Adam** 83 | - Learning rate: 1e^{-3} 84 | - Loss function : **BCElogisticloss** 85 | - Train Acc: **99.72** 86 | - Validation accuracy: **67.68** 87 | - Test accuracy: **64.54** 88 | - Number of training examples: **10500** 89 | - Number of validation examples: **4500 Number of testing examples: 3962** 90 | - Unique tokens in TEXT vocabulary: **18609** 91 | - Unique tokens in LABEL vocabulary: **2** 92 | - The model has **1,952,805** trainable parameters 93 | 94 | For the limitation of RAM I have taken **20000 samples** from the main dataset which achieved **99.23% accuracy** in the train dataset. 95 | 96 | The test accuracy is `68%` which can be improved by using more data training more number of epoch. 97 | 98 | ## Author 99 | #### Sagar Chakraborty 100 | [Sagar_Chakraborty_LinkedIn](https://www.linkedin.com/in/binaryblackhole/) 101 | [Sagar_Chakraborty_Gmail](https://mail.google.com/mail/u/0/#search/csagar963%40gmail.com) 102 | [Sagar_Chakraborty_GitHub](https://github.com/BinaryBlackhole) 103 | 104 | ## Thanks to ALL the amazing contributors! 105 | [Akshata_Kulkarni](https://github.com/Akshata-Kulk1) 106 | ![GitHub Contributors Image](https://contrib.rocks/image?repo=BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN) 107 | 108 | ## References 109 | #### Akurniawan-sentiment analysis 110 | [Akurniawan_sentiment_analysis_github](https://github.com/akurniawan/pytorch-sentiment-analysis) 111 | #### Bentrevett-sentiment analysis 112 | [Bentrevett_sentiment_analsis](https://github.com/bentrevett/pytorch-sentiment-analysis) 113 | 114 | **If you like this project please fork and star this project** 115 | 116 | 117 | ![open_source](https://forthebadge.com/images/badges/open-source.svg) 118 | [![Built-With-Love](http://ForTheBadge.com/images/badges/built-with-love.svg)](https://GitHub.com/Naereen/) \ 119 | [![MIT_Licence](https://img.shields.io/github/license/Ileriayo/markdown-badges?style=for-the-badge)](./LICENSE) 120 | 121 | -------------------------------------------------------------------------------- /__pycache__/misc.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/misc.cpython-36.pyc -------------------------------------------------------------------------------- /__pycache__/model.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/model.cpython-36.pyc -------------------------------------------------------------------------------- /__pycache__/text_cleaning.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/text_cleaning.cpython-36.pyc -------------------------------------------------------------------------------- /data_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset_full_path": "twiteer_dataset_main.csv", 3 | "num_neg_labels": "10000", 4 | "num_pos_labels": "10000", 5 | "trainset_fullpath": "data/train.csv", 6 | "testset_fullpath": "data/test.csv", 7 | "num_training_sample": "15000" 8 | } -------------------------------------------------------------------------------- /data_processing.py: -------------------------------------------------------------------------------- 1 | "Author: Sagar Chakraborty" 2 | import pandas as pd 3 | import numpy as np 4 | import json 5 | 6 | 7 | # parsing the data_config.json to get the params 8 | """ 9 | { 10 | "dataset_full_path": "twiteer_dataset_main.csv", 11 | "num_neg_labels": "10000", 12 | "num_pos_labels": "10000", 13 | "trainset_fullpath": "data/train.csv", 14 | "testset_fullpath": "data/test.csv", 15 | "num_training_sample": "15000" 16 | }""" 17 | 18 | 19 | f = open('data_config.json','r') 20 | config_data = json.loads(f.read()) 21 | 22 | dataset= config_data['dataset_full_path'] 23 | neg_labels = int(config_data['num_neg_labels']) 24 | pos_labels = int(config_data['num_pos_labels']) 25 | train_filepath = config_data['trainset_fullpath'] 26 | test_filepath = config_data['testset_fullpath'] 27 | training_samples = int(config_data['num_training_sample']) 28 | 29 | 30 | 31 | 32 | 33 | 34 | dataset_df = pd.read_csv(dataset,names=['labels','id','datetime','query','username','sentences'],header= None,sep=',',encoding = "ISO-8859-1") 35 | # 36 | print(dataset_df['labels'].head(5)) 37 | 38 | # if dataset needs to cut short here we took 10000 for each class 39 | filter_0_df = dataset_df[dataset_df['labels']== int(0)].sample(n=neg_labels) 40 | filter_4_df = dataset_df[dataset_df['labels']== int(4)].sample(n=pos_labels) 41 | 42 | 43 | # dropping unneccessary columns 44 | filter_0_df.drop(['id','datetime','query','username'], axis=1,inplace=True) 45 | filter_4_df.drop(['id','datetime','query','username'],axis=1,inplace=True) 46 | 47 | print(filter_0_df) 48 | print(filter_4_df) 49 | 50 | 51 | 52 | # merging datasframe of two class to construct final dataframe 53 | final_df = pd.concat([filter_0_df,filter_4_df]) 54 | 55 | print(len(final_df)) 56 | #drop the rows wherever we have links/url in the tweets mostly spam 57 | final_df= final_df.drop(final_df[final_df.sentences.str.contains(r'http\S+|www.\S+')].index) 58 | 59 | #shuffle the dataset 60 | final_df=final_df.sample(n= len(final_df), random_state=42) 61 | 62 | # dirty way to convert labels in 0 and 1 63 | final_df[final_df['labels']>0]=1 64 | print(len(final_df)) 65 | 66 | # removing the headers 67 | final_df= final_df[1:] 68 | 69 | 70 | # train_test split 71 | train_df= final_df[:training_samples] 72 | test_df= final_df[training_samples:] 73 | 74 | #saving data 75 | train_df.to_csv(train_filepath) 76 | test_df.to_csv(test_filepath) 77 | 78 | -------------------------------------------------------------------------------- /misc.py: -------------------------------------------------------------------------------- 1 | "Author: Sagar Chakraborty" 2 | from ignite.engine import Engine 3 | import os 4 | from torch.autograd import Variable 5 | 6 | from ignite.exceptions import NotComputableError 7 | from ignite.metrics.metric import Metric 8 | import torch 9 | import torch.nn as nn 10 | import torch.optim as optim 11 | from torchtext.legacy import data 12 | from text_cleaning import cleanup_text 13 | from model import RNN 14 | 15 | from pydoc import locate 16 | from torch.nn.parallel import DataParallel 17 | 18 | from ignite.engine import Engine 19 | from ignite.handlers import ModelCheckpoint 20 | from ignite.metrics import Accuracy, Precision, Recall, Loss 21 | import glob 22 | import logging 23 | 24 | def create_supervised_evaluator(model, inference_fn, metrics={}, cuda=False): 25 | """ 26 | Factory function for creating an evaluator for supervised models. 27 | Extended version from ignite's create_supervised_evaluator 28 | Args: 29 | model (torch.nn.Module): the model to train 30 | inference_fn (function): inference function 31 | metrics (dict of str: Metric): a map of metric names to Metrics 32 | cuda (bool, optional): whether or not to transfer batch to GPU (default: False) 33 | Returns: 34 | Engine: an evaluator engine with supervised inference function 35 | """ 36 | 37 | engine = Engine(inference_fn) 38 | 39 | for name, metric in metrics.items(): 40 | metric.attach(engine, name) 41 | 42 | return engine 43 | 44 | 45 | class ModelLoader(object): 46 | def __init__(self, model, dirname, filename_prefix): 47 | self._dirname = dirname 48 | self._fname_prefix = filename_prefix 49 | self._model = model 50 | self._fname = os.path.join(dirname, filename_prefix) 51 | self.skip_load = False 52 | 53 | # Ensure model is not None 54 | if not isinstance(model, nn.Module): 55 | raise ValueError("model should be an object of nn.Module") 56 | 57 | # Ensure that dirname exists 58 | if not os.path.exists(dirname): 59 | self.skip_load = True 60 | logging.warning( 61 | "Dir '{}' is not found, skip restoring model".format(dirname) 62 | ) 63 | 64 | if len(glob.glob(self._fname + "_*")) == 0: 65 | self.skip_load = True 66 | logging.warning( 67 | "File '{}-*.pth' is not found, skip restoring model".format(self._fname) 68 | ) 69 | 70 | def _load(self, path): 71 | if not self.skip_load: 72 | models = sorted(glob.glob(path)) 73 | latest_model = models[-1] 74 | 75 | try: 76 | if isinstance(self._model, nn.parallel.DataParallel): 77 | self._model.module.load_state_dict(torch.load(latest_model)) 78 | else: 79 | self._model.load_state_dict(torch.load(latest_model)) 80 | print("Successfull loading {}!".format(latest_model)) 81 | except Exception as e: 82 | logging.exception( 83 | "Something wrong while restoring the model: %s" % str(e) 84 | ) 85 | 86 | def __call__(self, engine, infix_name): 87 | path = self._fname + "_" + infix_name + "_*" 88 | 89 | self._load(path=path) 90 | 91 | 92 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | "Author: Sagar Chakraborty" 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | 7 | 8 | import numpy as np 9 | from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence 10 | 11 | 12 | 13 | 14 | 15 | class RNN(nn.Module): 16 | def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim): 17 | super().__init__() 18 | 19 | self.embedding = nn.Embedding(input_dim, embedding_dim) 20 | 21 | self.rnn = nn.RNN(embedding_dim, hidden_dim) 22 | 23 | self.fc = nn.Linear(hidden_dim, output_dim) 24 | 25 | def forward(self, text): 26 | #forward propagation 27 | # text = [sent len, batch size] 28 | # first pass the text into embedding layer 29 | embedded = self.embedding(text) 30 | 31 | # embedded = [sent len, batch size, emb dim] 32 | 33 | # 34 | output, hidden = self.rnn(embedded) 35 | 36 | # output = [sent len, batch size, hid dim] 37 | # hidden = [1, batch size, hid dim] 38 | 39 | assert torch.equal(output[-1, :, :], hidden.squeeze(0)) 40 | 41 | return self.fc(hidden.squeeze(0)) 42 | 43 | 44 | -------------------------------------------------------------------------------- /project_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "data_path": "data", 3 | "model_dir": "saved_model", 4 | "device": "-1", 5 | "model_name": "sentiment_classifer_rnn_sagar.pt", 6 | "embedding_dim": "100", 7 | "hidden_dim": "256", 8 | "output_dim": "1", 9 | "batch_size": "64", 10 | "max_vocab_size": "25000", 11 | "learning_rate": "1e-3", 12 | "epoch": "20" 13 | } -------------------------------------------------------------------------------- /readme: -------------------------------------------------------------------------------- 1 | 1. Please download the dataset from this link and keep it in main directory: https://www.kaggle.com/kazanova/sentiment140 2 | 2. put the csv file in the working directory and mention the full in the data_config.json 3 | 3. in data_config you need to provide : 4 | 5 | { 6 | "dataset_full_path": "twiteer_dataset_main.csv", # full path of the main dataset 7 | "num_neg_labels": "10000", # number of negetive sample the dataset should contain 8 | "num_pos_labels": "10000", # number of positive samples the dataset should contain 9 | "trainset_fullpath": "data/train.csv", # path to save the train.csv 10 | "testset_fullpath": "data/test.csv", # path to save test.csv 11 | "num_training_sample": "15000" # number of training samples in train.csv 12 | } 13 | 14 | 15 | 4. run python data_processing.py. This will prepare the data by splitting into train and test. 16 | 5. To run this project run : Python train.py . To run train.py you need to provide certain params in project_config.py: 17 | 18 | { 19 | "data_path": "data", 20 | "model_dir": "saved_model", 21 | "device": "-1", 22 | "model_name": "sentiment_classifer_rnn_sagar.pt", 23 | "embedding_dim": "100", 24 | "hidden_dim": "256", 25 | "output_dim": "1", 26 | "batch_size": "64", 27 | "max_vocab_size": "25000", 28 | "learning_rate": "1e-3", 29 | "epoch": "20" 30 | } 31 | to change dataset/any parameters for training: project_config.json 32 | 33 | For the limitation of RAM I have taken 20000 samples from the main dataset which achieved 99.23% accuracy in the train dataset. 34 | The test accuracy is `~68% which can be improved by using more data/ training more number of epoch. 35 | 36 | References: 37 | https://github.com/akurniawan/pytorch-sentiment-analysis 38 | https://github.com/bentrevett/pytorch-sentiment-analysis 39 | 40 | -------------------------------------------------------------------------------- /requirement.txt: -------------------------------------------------------------------------------- 1 | torch 2 | torchtext 3 | torch-ignite 4 | pandas 5 | numpy 6 | -------------------------------------------------------------------------------- /saved_model/sentiment_classifer_rnn_sagar.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/saved_model/sentiment_classifer_rnn_sagar.pt -------------------------------------------------------------------------------- /text_cleaning.py: -------------------------------------------------------------------------------- 1 | "Author: Sagar Chakraborty" 2 | import re 3 | 4 | 5 | def cleanup_text(texts): 6 | cleaned_text = [] 7 | for text in texts: 8 | # remove ugly " and & 9 | text = re.sub(r""(.*?)"", "\g<1>", text) 10 | text = re.sub(r"&", "", text) 11 | 12 | # replace emoticon 13 | text = re.sub( 14 | r"(^| )(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)", 15 | "\g<1>TOKEMOTICON", 16 | text, 17 | ) 18 | 19 | text = text.lower() 20 | text = text.replace("tokemoticon", "TOKEMOTICON") 21 | 22 | # replace url 23 | text = re.sub( 24 | r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?", 25 | "TOKURL", 26 | text, 27 | ) 28 | 29 | # replace mention 30 | text = re.sub(r"@[\w]+", "TOKMENTION", text) 31 | 32 | # replace hashtag 33 | text = re.sub(r"#[\w]+", "TOKHASHTAG", text) 34 | 35 | # replace dollar 36 | text = re.sub(r"\$\d+", "TOKDOLLAR", text) 37 | 38 | # remove punctuation 39 | text = re.sub("[^a-zA-Z0-9]", " ", text) 40 | 41 | # remove multiple spaces 42 | text = re.sub(r" +", " ", text) 43 | 44 | # remove newline 45 | text = re.sub(r"\n", " ", text) 46 | 47 | cleaned_text.append(text) 48 | return cleaned_text -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | "Author: Sagar Chakraborty" 2 | import torch 3 | import torch.nn as nn 4 | import torch.optim as optim 5 | from torchtext.legacy import data 6 | from text_cleaning import cleanup_text 7 | from model import RNN 8 | 9 | from pydoc import locate 10 | from torch.nn.parallel import DataParallel 11 | 12 | from ignite.engine import Engine 13 | from ignite.handlers import ModelCheckpoint 14 | from ignite.metrics import Accuracy, Precision, Recall, Loss 15 | 16 | import json 17 | import torch.optim as optim 18 | import time 19 | import os 20 | from misc import create_supervised_evaluator 21 | from misc import ModelLoader 22 | from ignite.engine import Events 23 | import random 24 | 25 | from model import RNN 26 | 27 | 28 | 29 | class trainer(object): 30 | def __init__(self,data_path,model_dir,model_name,device=-1): 31 | self.data_path= data_path 32 | self.model_dir = model_dir 33 | self.model_name= model_name 34 | self.device = device 35 | 36 | @staticmethod 37 | def train(model, iterator, optimizer, criterion): 38 | """Train function to start the training the declared model. model.train() initialize it. 39 | for every batch picked up from the iterator we send it to the model and get predictions. 40 | loss = Predicted_y - Actual_y and based of the loss we calculate accuracy. 41 | loss.backward is for back propagation""" 42 | epoch_loss = 0 43 | epoch_acc = 0 44 | 45 | model.train() 46 | 47 | for batch in iterator: 48 | optimizer.zero_grad() 49 | 50 | predictions = model(batch.sentences[0]).squeeze(1) 51 | 52 | loss = criterion(predictions, batch.labels) 53 | 54 | acc = trainer.binary_accuracy(predictions, batch.labels) 55 | 56 | loss.backward() #back propagation 57 | 58 | optimizer.step() #weight update 59 | 60 | epoch_loss += loss.item() 61 | epoch_acc += acc.item() 62 | 63 | return epoch_loss / len(iterator), epoch_acc / len(iterator) 64 | 65 | @staticmethod 66 | def count_parameters(model): 67 | return sum(p.numel() for p in model.parameters() if p.requires_grad) 68 | 69 | @staticmethod 70 | def binary_accuracy(preds, y): 71 | """ 72 | Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8 73 | """ 74 | 75 | # round predictions to the closest integer 76 | rounded_preds = torch.round(torch.sigmoid(preds)) 77 | correct = (rounded_preds == y).float() # convert into float for division 78 | acc = correct.sum() / len(correct) 79 | return acc 80 | 81 | @staticmethod 82 | def evaluate(model, iterator, criterion): 83 | epoch_loss = 0 84 | epoch_acc = 0 85 | 86 | model.eval() 87 | 88 | with torch.no_grad(): 89 | for batch in iterator: 90 | predictions = model(batch.sentences[0]).squeeze(1) 91 | 92 | loss = criterion(predictions, batch.labels) 93 | 94 | acc = trainer.binary_accuracy(predictions, batch.labels) 95 | 96 | epoch_loss += loss.item() 97 | epoch_acc += acc.item() 98 | 99 | return epoch_loss / len(iterator), epoch_acc / len(iterator) 100 | 101 | @staticmethod 102 | def epoch_time(start_time, end_time): 103 | elapsed_time = end_time - start_time 104 | elapsed_mins = int(elapsed_time / 60) 105 | elapsed_secs = int(elapsed_time - (elapsed_mins * 60)) 106 | return elapsed_mins, elapsed_secs 107 | 108 | 109 | 110 | 111 | 112 | #This is how the config.json data looks like 113 | """ 114 | { 115 | "data_path": "data", 116 | "model_dir": "saved_model", 117 | "device": "-1", 118 | "model_name": "sentiment_classifer_rnn_sagar.pt", 119 | "embedding_dim": "100", 120 | "hidden_dim": "256", 121 | "output_dim": "1", 122 | "batch_size":"64" 123 | "max_vocab_size": "25000" 124 | } 125 | """ 126 | 127 | 128 | 129 | f = open('project_config.json','r') 130 | 131 | config_data = json.loads(f.read()) 132 | 133 | EMBEDDING_DIM = int(config_data['embedding_dim']) 134 | HIDDEN_DIM = int(config_data['hidden_dim']) 135 | OUTPUT_DIM = int(config_data['output_dim']) 136 | BATCH_SIZE = int(config_data['batch_size']) 137 | MAX_VOCAB_SIZE = int(config_data['max_vocab_size']) 138 | 139 | #Parameters we have provided for our model 140 | # EMBEDDING_DIM = 100 141 | # HIDDEN_DIM = 256 142 | # OUTPUT_DIM = 1 143 | # BATCH_SIZE = 64 144 | # MAX_VOCAB_SIZE = 25_000 145 | 146 | 147 | data_path = config_data['data_path'] 148 | model_dir = config_data['model_dir'] 149 | device = -1 150 | model_name = config_data['model_name'] 151 | learning_rate = float(config_data['learning_rate']) 152 | num_epoch = int(config_data['epoch']) 153 | 154 | 155 | 156 | ################-------################## 157 | Model_trainer = trainer(data_path,model_dir,model_name,device) 158 | 159 | # seed 160 | torch.manual_seed(0) 161 | if torch.cuda.is_available(): 162 | torch.cuda.manual_seed(0) 163 | device = None 164 | 165 | tokenize = lambda s: s.split() 166 | 167 | text = data.Field( 168 | preprocessing=cleanup_text, include_lengths=True, tokenize=tokenize 169 | ) 170 | 171 | sentiment = data.LabelField(dtype=torch.float) 172 | train, test = data.TabularDataset.splits( 173 | Model_trainer.data_path, 174 | train="train.csv", 175 | validation="test.csv", 176 | format="csv", 177 | fields=[("labels", sentiment), ("sentences", text)], 178 | ) 179 | 180 | text.build_vocab(train.text, min_freq=1, max_size=MAX_VOCAB_SIZE) 181 | sentiment.build_vocab(train.sentiment) 182 | 183 | print(len(train), len(test)) 184 | 185 | print(vars(train.examples[5])) 186 | 187 | train_data, valid_data = train.split(random_state=random.seed(42)) 188 | print(f'Number of training examples: {len(train_data)}') 189 | print(f'Number of validation examples: {len(valid_data)}') 190 | print(f'Number of testing examples: {len(test)}') 191 | 192 | 193 | 194 | text.build_vocab(train_data, max_size=MAX_VOCAB_SIZE) 195 | sentiment.build_vocab(train_data) 196 | 197 | print(f"Unique tokens in TEXT vocabulary: {len(text.vocab)}") 198 | print(f"Unique tokens in LABEL vocabulary: {len(sentiment.vocab)}") 199 | 200 | 201 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 202 | 203 | train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits( 204 | datasets=[train_data, valid_data, test], 205 | batch_size=BATCH_SIZE, 206 | sort_within_batch=True, 207 | sort_key=lambda x: len(x.sentences), 208 | device=device, ) 209 | 210 | INPUT_DIM = len(text.vocab) 211 | model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM) 212 | print(f'The model has {Model_trainer.count_parameters(model):,} trainable parameters') 213 | model = model.to(device) 214 | 215 | optimizer = optim.Adam(model.parameters(), lr=learning_rate) 216 | criterion = nn.BCEWithLogitsLoss() 217 | criterion = criterion.to(device) 218 | 219 | 220 | N_EPOCHS = num_epoch 221 | 222 | best_valid_loss = float('inf') 223 | 224 | for epoch in range(N_EPOCHS): 225 | 226 | start_time = time.time() 227 | 228 | train_loss, train_acc = Model_trainer.train(model, train_iterator, optimizer, criterion) 229 | valid_loss, valid_acc = Model_trainer.evaluate(model, valid_iterator, criterion) 230 | 231 | end_time = time.time() 232 | 233 | epoch_mins, epoch_secs = Model_trainer.epoch_time(start_time, end_time) 234 | 235 | if valid_loss < best_valid_loss: 236 | best_valid_loss = valid_loss 237 | torch.save(model.state_dict(), os.path.join(Model_trainer.model_dir,Model_trainer.model_name)) 238 | 239 | print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s') 240 | print(f'\tTrain Loss: {train_loss:.3f} | Train Accuracy: {train_acc * 100:.2f}%') 241 | print(f'\t Validation Loss: {valid_loss:.3f} | Validation Accuracy: {valid_acc * 100:.2f}%') 242 | 243 | 244 | #Loading model from directory and testing : test score and accuracy 245 | model.load_state_dict(torch.load(os.path.join(Model_trainer.model_dir,Model_trainer.model_name))) 246 | 247 | test_loss, test_acc = Model_trainer.evaluate(model, test_iterator, criterion) 248 | 249 | print(f'Overall Test Loss: {test_loss:.3f} | Overall Test Accuracy: {test_acc * 100:.2f}%') --------------------------------------------------------------------------------