├── LICENSE
├── README.md
├── __pycache__
    ├── misc.cpython-36.pyc
    ├── model.cpython-36.pyc
    └── text_cleaning.cpython-36.pyc
├── data
    ├── test.csv
    └── train.csv
├── data_config.json
├── data_processing.py
├── misc.py
├── model.py
├── project_config.json
├── readme
├── requirement.txt
├── saved_model
    └── sentiment_classifer_rnn_sagar.pt
├── text_cleaning.py
└── train.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Sagchakr
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <a id="requirements"></a>
  2 | # Twitter Sentiment Analysis Using RNN - PyTorch
  3 | 
  4 | 
  5 | [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
  6 | [<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white" />](https://pytorch.org/)
  7 | [<img alt="Pandas" src="https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white" />](https://pandas.pydata.org/)
  8 | [<img alt="NumPy" src="https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white" />](https://numpy.org/)
  9 | [<img alt="Kaggle" src="https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white" />](https://www.kaggle.com/kazanova/sentiment140)
 10 | 
 11 | [![Python 3.6.8](https://img.shields.io/badge/python-3.6.8-blue.svg)](https://www.python.org/downloads/release/python-368/)
 12 | [![torch 1.8.1](https://img.shields.io/badge/torch-1.8.1-orange.svg)](https://pypi.org/project/torch/1.8.1/)
 13 | [![torch_text 0.9.1](https://img.shields.io/badge/torchtext-0.9.1-orange.svg)](https://pypi.org/project/torchtext/0.9.1/)
 14 | [![pytorch_ignite 0.4.4](https://img.shields.io/badge/pytorch--ignite-0.4.4-orange.svg)](https://pypi.org/project/pytorch-ignite/0.4.4/) 
 15 | 
 16 | 
 17 | [![Stargazers repo roster for @BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN](https://reporoster.com/stars/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN)](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/stargazers)
 18 | [![Forkers repo roster for @BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN](https://reporoster.com/forks/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN)](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/network/members)
 19 | ## Introduction
 20 | 
 21 | ![intro_image](https://i.morioh.com/2020/02/04/beef36fd707d.jpg) \
 22 | Image Source: morioh
 23 | 
 24 | The automated process of recognition and Categorization of instinctive information in text script is called Sentiment Analysis. And such categorization of positive tweets from negative tweets by machine learning models for Classification, text mining, text analysis and data visualization through Natural Language processing is called Twitter Sentiment Analysis.
 25 | 
 26 | ![project_workflow_example](https://user-images.githubusercontent.com/49767657/121781346-dbb30000-cbc1-11eb-809a-a016d7a6092f.png) \
 27 | Image Source: Google
 28 | 
 29 | ## Requirements
 30 | 
 31 | This project is developed on [<img alt="Windows 10" src="https://img.shields.io/badge/Windows-0078D6?style=for-the-badge&logo=windows&logoColor=white" />](https://www.microsoft.com/en-in/software-download/windows10) \
 32 | Clone this repository or download after installing the above [requirements](#requirements) and run `pip install -r requirement.txt` to install all the libraries required to run this project.
 33 | You can also click on the specific badges mentioned [above](#requirements) and download it as well.
 34 | 
 35 | 
 36 | - **torch==1.8.1+cu** 
 37 | - **torchtext==0.9.1** 
 38 | - **pytorch-ignite==0.4.4** 
 39 | - **pandas==1.0.5** 
 40 | - **numpy==1.19.3** 
 41 | ### IDE
 42 | I have used [<img alt="PyCharm" src="https://img.shields.io/badge/pycharm-143?style=for-the-badge&logo=pycharm&logoColor=black&color=black&labelColor=green"/>](https://www.jetbrains.com/pycharm/) in this project. I mostly prefer using Pycharm for building large end-to-end projects.
 43 | Though for data analytics & visualizations I would recommend of using [<img alt="Jupyter" src="https://img.shields.io/badge/Jupyter-F37626.svg?&style=for-the-badge&logo=Jupyter&logoColor=white"/>](https://jupyter.org/). If you don't know how to use jupyter which is quite easy for beginners, go through this [link](https://www.tutorialspoint.com/jupyter/index.htm)
 44 | 
 45 | 
 46 | 	
 47 | 
 48 | 
 49 | ## To Run this Project
 50 | 1. Download the dataset from [kaggle](#requirements) above and keep it in main directory. 
 51 | 2. Put the csv file in the working directory and mention the full in the **data_config.json**
 52 | 3. In **data_config.json** you need to provide :
 53 | 
 54 | 	i) `dataset_full_path`: **full path of the main dataset** \
 55 | 	ii) `num_neg_labels`: **number of negetive sample the dataset should contain** \
 56 | 	iii)`num_pos_labels`: **number of positive samples the dataset should contain** \
 57 | 	iv) `trainset_fullpath`: **path to save the train.csv** \
 58 | 	v)  `testset_fullpath`: **path to save test.csv**      \
 59 | 	vi) `num_training_sample`: **number of training samples in train.csv** 
 60 | 		
 61 | 4. *run* `python data_processing.py`: **This will prepare the data by splitting into train and test.**
 62 | 5. To run this project,you need to provide certain parameters in **project_config.json**:
 63 | 
 64 | 	i) `data_path`: **data** \
 65 | 	ii)`model_dir`: **saved_model** \
 66 | 	iii) `device`: **-1** \
 67 | 	iv) `model_name`: **sentiment_classifer_rnn_sagar.pt** \
 68 | 	v) `embedding_dim`: **100** \
 69 | 	vi) `hidden_dim`: **256** \
 70 | 	vii) `output_dim`: **1** \
 71 | 	viii) `batch_size`: **64** \
 72 | 	ix) `max_vocab_size`: **25000** \
 73 | 	x)  `learning_rate`: <img src="http://www.sciweavers.org/tex2img.php?eq=%201e%5E%7B-3%7D%20&bc=White&fc=Black&im=jpg&fs=18&ff=ccfonts,eulervm&edit=0" align="center" border="0" alt=" 1e^{-3} " width="52" height="25" /> \
 74 | 	xi) `epoch`: **20** 
 75 | 6. After performing the above steps Execute : `Python train.py`
 76 | 
 77 | **To change dataset/any parameters for training: project_config.json**
 78 | 
 79 | ## Experiment Details
 80 | 
 81 | - epochs : **20**
 82 | - Optimizer- **Adam**
 83 | - Learning rate:<img src="http://www.sciweavers.org/tex2img.php?eq=%201e%5E%7B-3%7D%20&bc=White&fc=Black&im=jpg&fs=18&ff=ccfonts,eulervm&edit=0" align="center" border="0" alt=" 1e^{-3} " width="52" height="25" /> 
 84 | - Loss function : **BCElogisticloss**
 85 | - Train Acc: **99.72**
 86 | - Validation accuracy: **67.68**
 87 | - Test accuracy: **64.54**
 88 | - Number of training examples: **10500** 
 89 | - Number of validation examples: **4500 Number of testing examples: 3962**
 90 | - Unique tokens in TEXT vocabulary: **18609**
 91 | - Unique tokens in LABEL vocabulary: **2**
 92 | - The model has **1,952,805** trainable parameters
 93 | 
 94 | For the limitation of RAM I have taken **20000 samples** from the main dataset which achieved **99.23% accuracy** in the train dataset.
 95 | 
 96 | The test accuracy is `68%` which can be improved by using more data training more number of epoch.
 97 | 
 98 | ## Author
 99 | #### Sagar Chakraborty 
100 | [<img alt="Sagar_Chakraborty_LinkedIn" src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" />](https://www.linkedin.com/in/binaryblackhole/)
101 | [<img alt="Sagar_Chakraborty_Gmail" src="https://img.shields.io/badge/Gmail-D14836?style=for-the-badge&logo=gmail&logoColor=white" />](https://mail.google.com/mail/u/0/#search/csagar963%40gmail.com)
102 | [<img alt="Sagar_Chakraborty_GitHub" src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white" />](https://github.com/BinaryBlackhole)
103 | 
104 | ## Thanks to ALL the amazing contributors!
105 | [<img alt="Akshata_Kulkarni" src="https://github.com/Akshata-Kulk1.png?size=100"/>](https://github.com/Akshata-Kulk1)
106 | ![GitHub Contributors Image](https://contrib.rocks/image?repo=BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN)
107 | 
108 | ## References
109 | #### Akurniawan-sentiment analysis
110 | [<img alt="Akurniawan_sentiment_analysis_github" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" />](https://github.com/akurniawan/pytorch-sentiment-analysis) 
111 | #### Bentrevett-sentiment analysis
112 | [<img alt="Bentrevett_sentiment_analsis" src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" />](https://github.com/bentrevett/pytorch-sentiment-analysis)
113 | 
114 | **If you like this project please fork and star this project**
115 | 
116 | 
117 | ![open_source](https://forthebadge.com/images/badges/open-source.svg) 
118 | [![Built-With-Love](http://ForTheBadge.com/images/badges/built-with-love.svg)](https://GitHub.com/Naereen/) \
119 | [![MIT_Licence](https://img.shields.io/github/license/Ileriayo/markdown-badges?style=for-the-badge)](./LICENSE)
120 | 
121 | 


--------------------------------------------------------------------------------
/__pycache__/misc.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/misc.cpython-36.pyc


--------------------------------------------------------------------------------
/__pycache__/model.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/model.cpython-36.pyc


--------------------------------------------------------------------------------
/__pycache__/text_cleaning.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/text_cleaning.cpython-36.pyc


--------------------------------------------------------------------------------
/data_config.json:
--------------------------------------------------------------------------------
1 | {
2 | 	"dataset_full_path": "twiteer_dataset_main.csv",
3 | 	"num_neg_labels": "10000",
4 | 	"num_pos_labels": "10000",
5 | 	"trainset_fullpath": "data/train.csv",
6 | 	"testset_fullpath": "data/test.csv",
7 | 	"num_training_sample": "15000"
8 | }


--------------------------------------------------------------------------------
/data_processing.py:
--------------------------------------------------------------------------------
 1 | "Author: Sagar Chakraborty"
 2 | import pandas as pd
 3 | import numpy as np
 4 | import json
 5 | 
 6 | 
 7 | # parsing the data_config.json to get the params
 8 | """
 9 | {
10 | 	"dataset_full_path": "twiteer_dataset_main.csv",
11 | 	"num_neg_labels": "10000",
12 | 	"num_pos_labels": "10000",
13 | 	"trainset_fullpath": "data/train.csv",
14 | 	"testset_fullpath": "data/test.csv",
15 | 	"num_training_sample": "15000"
16 | }"""
17 | 
18 | 
19 | f = open('data_config.json','r')
20 | config_data = json.loads(f.read())
21 | 
22 | dataset= config_data['dataset_full_path']
23 | neg_labels = int(config_data['num_neg_labels'])
24 | pos_labels = int(config_data['num_pos_labels'])
25 | train_filepath = config_data['trainset_fullpath']
26 | test_filepath = config_data['testset_fullpath']
27 | training_samples = int(config_data['num_training_sample'])
28 | 
29 | 
30 | 
31 | 
32 | 
33 | 
34 | dataset_df  = pd.read_csv(dataset,names=['labels','id','datetime','query','username','sentences'],header= None,sep=',',encoding = "ISO-8859-1")
35 | #
36 | print(dataset_df['labels'].head(5))
37 | 
38 | # if dataset needs to cut short here we took 10000 for each class
39 | filter_0_df = dataset_df[dataset_df['labels']== int(0)].sample(n=neg_labels)
40 | filter_4_df = dataset_df[dataset_df['labels']== int(4)].sample(n=pos_labels)
41 | 
42 | 
43 | # dropping unneccessary columns
44 | filter_0_df.drop(['id','datetime','query','username'], axis=1,inplace=True)
45 | filter_4_df.drop(['id','datetime','query','username'],axis=1,inplace=True)
46 | 
47 | print(filter_0_df)
48 | print(filter_4_df)
49 | 
50 | 
51 | 
52 | # merging datasframe of two class to construct final dataframe
53 | final_df = pd.concat([filter_0_df,filter_4_df])
54 | 
55 | print(len(final_df))
56 | #drop the rows wherever we have links/url in the tweets mostly spam
57 | final_df= final_df.drop(final_df[final_df.sentences.str.contains(r'http\S+|www.\S+')].index)
58 | 
59 | #shuffle the dataset
60 | final_df=final_df.sample(n= len(final_df), random_state=42)
61 | 
62 | # dirty way to convert labels in 0 and 1
63 | final_df[final_df['labels']>0]=1
64 | print(len(final_df))
65 | 
66 | # removing the headers
67 | final_df= final_df[1:]
68 | 
69 | 
70 | # train_test split
71 | train_df= final_df[:training_samples]
72 | test_df= final_df[training_samples:]
73 | 
74 | #saving data
75 | train_df.to_csv(train_filepath)
76 | test_df.to_csv(test_filepath)
77 | 
78 | 


--------------------------------------------------------------------------------
/misc.py:
--------------------------------------------------------------------------------
 1 | "Author: Sagar Chakraborty"
 2 | from ignite.engine import Engine
 3 | import os
 4 | from torch.autograd import Variable
 5 | 
 6 | from ignite.exceptions import NotComputableError
 7 | from ignite.metrics.metric import Metric
 8 | import torch
 9 | import torch.nn as nn
10 | import torch.optim as optim
11 | from torchtext.legacy import data
12 | from text_cleaning import cleanup_text
13 | from model import RNN
14 | 
15 | from pydoc import locate
16 | from torch.nn.parallel import DataParallel
17 | 
18 | from ignite.engine import Engine
19 | from ignite.handlers import ModelCheckpoint
20 | from ignite.metrics import Accuracy, Precision, Recall, Loss
21 | import glob
22 | import logging
23 | 
24 | def create_supervised_evaluator(model, inference_fn, metrics={}, cuda=False):
25 |     """
26 |     Factory function for creating an evaluator for supervised models.
27 |     Extended version from ignite's create_supervised_evaluator
28 |     Args:
29 |         model (torch.nn.Module): the model to train
30 |         inference_fn (function): inference function
31 |         metrics (dict of str: Metric): a map of metric names to Metrics
32 |         cuda (bool, optional): whether or not to transfer batch to GPU (default: False)
33 |     Returns:
34 |         Engine: an evaluator engine with supervised inference function
35 |     """
36 | 
37 |     engine = Engine(inference_fn)
38 | 
39 |     for name, metric in metrics.items():
40 |         metric.attach(engine, name)
41 | 
42 |     return engine
43 | 
44 | 
45 | class ModelLoader(object):
46 |     def __init__(self, model, dirname, filename_prefix):
47 |         self._dirname = dirname
48 |         self._fname_prefix = filename_prefix
49 |         self._model = model
50 |         self._fname = os.path.join(dirname, filename_prefix)
51 |         self.skip_load = False
52 | 
53 |         # Ensure model is not None
54 |         if not isinstance(model, nn.Module):
55 |             raise ValueError("model should be an object of nn.Module")
56 | 
57 |         # Ensure that dirname exists
58 |         if not os.path.exists(dirname):
59 |             self.skip_load = True
60 |             logging.warning(
61 |                 "Dir '{}' is not found, skip restoring model".format(dirname)
62 |             )
63 | 
64 |         if len(glob.glob(self._fname + "_*")) == 0:
65 |             self.skip_load = True
66 |             logging.warning(
67 |                 "File '{}-*.pth' is not found, skip restoring model".format(self._fname)
68 |             )
69 | 
70 |     def _load(self, path):
71 |         if not self.skip_load:
72 |             models = sorted(glob.glob(path))
73 |             latest_model = models[-1]
74 | 
75 |             try:
76 |                 if isinstance(self._model, nn.parallel.DataParallel):
77 |                     self._model.module.load_state_dict(torch.load(latest_model))
78 |                 else:
79 |                     self._model.load_state_dict(torch.load(latest_model))
80 |                 print("Successfull loading {}!".format(latest_model))
81 |             except Exception as e:
82 |                 logging.exception(
83 |                     "Something wrong while restoring the model: %s" % str(e)
84 |                 )
85 | 
86 |     def __call__(self, engine, infix_name):
87 |         path = self._fname + "_" + infix_name + "_*"
88 | 
89 |         self._load(path=path)
90 | 
91 | 
92 | 


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
 1 | "Author: Sagar Chakraborty"
 2 | 
 3 | import torch
 4 | import torch.nn as nn
 5 | import torch.nn.functional as F
 6 | 
 7 | 
 8 | import numpy as np
 9 | from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence
10 | 
11 | 
12 | 
13 | 
14 | 
15 | class RNN(nn.Module):
16 |     def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
17 |         super().__init__()
18 | 
19 |         self.embedding = nn.Embedding(input_dim, embedding_dim)
20 | 
21 |         self.rnn = nn.RNN(embedding_dim, hidden_dim)
22 | 
23 |         self.fc = nn.Linear(hidden_dim, output_dim)
24 | 
25 |     def forward(self, text):
26 |         #forward propagation
27 |         # text = [sent len, batch size]
28 |         # first pass the text into embedding layer
29 |         embedded = self.embedding(text)
30 | 
31 |         # embedded = [sent len, batch size, emb dim]
32 | 
33 |         #
34 |         output, hidden = self.rnn(embedded)
35 | 
36 |         # output = [sent len, batch size, hid dim]
37 |         # hidden = [1, batch size, hid dim]
38 | 
39 |         assert torch.equal(output[-1, :, :], hidden.squeeze(0))
40 | 
41 |         return self.fc(hidden.squeeze(0))
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/project_config.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"data_path": "data",
 3 | 	"model_dir": "saved_model",
 4 | 	"device": "-1",
 5 | 	"model_name": "sentiment_classifer_rnn_sagar.pt",
 6 | 	"embedding_dim": "100",
 7 | 	"hidden_dim": "256",
 8 | 	"output_dim": "1",
 9 | 	"batch_size": "64",
10 | 	"max_vocab_size": "25000",
11 | 	"learning_rate": "1e-3",
12 | 	"epoch": "20"
13 | }


--------------------------------------------------------------------------------
/readme:
--------------------------------------------------------------------------------
 1 | 1. Please download the dataset from this link and keep it in main directory: https://www.kaggle.com/kazanova/sentiment140
 2 | 2. put the csv file in the working directory and mention the full in the data_config.json
 3 | 3. in data_config you need to provide :
 4 | 
 5 | {
 6 | 	"dataset_full_path": "twiteer_dataset_main.csv", # full path of the main dataset
 7 | 	"num_neg_labels": "10000",                       # number of negetive sample the dataset should contain
 8 | 	"num_pos_labels": "10000",						 # number of positive samples the dataset should contain
 9 | 	"trainset_fullpath": "data/train.csv",			 # path to save the train.csv	
10 | 	"testset_fullpath": "data/test.csv",			 # path to save test.csv	
11 | 	"num_training_sample": "15000"					 # number of training samples in train.csv
12 | }
13 | 
14 | 		
15 | 4. run python data_processing.py. This will prepare the data by splitting into train and test.
16 | 5. To run this project run : Python train.py . To run train.py you need to provide certain params in project_config.py:
17 | 
18 | {
19 | 	"data_path": "data",
20 | 	"model_dir": "saved_model",
21 | 	"device": "-1",
22 | 	"model_name": "sentiment_classifer_rnn_sagar.pt",
23 | 	"embedding_dim": "100",
24 | 	"hidden_dim": "256",
25 | 	"output_dim": "1",
26 | 	"batch_size": "64",
27 | 	"max_vocab_size": "25000",
28 | 	"learning_rate": "1e-3",
29 | 	"epoch": "20"
30 | }
31 | to change dataset/any parameters for training: project_config.json
32 | 
33 | For the limitation of RAM I have taken 20000 samples from the main dataset which achieved 99.23% accuracy in the train dataset.
34 | The test accuracy is `~68% which can be improved by using more data/ training more number of epoch.
35 | 
36 | References:
37 | https://github.com/akurniawan/pytorch-sentiment-analysis
38 | https://github.com/bentrevett/pytorch-sentiment-analysis
39 | 
40 | 


--------------------------------------------------------------------------------
/requirement.txt:
--------------------------------------------------------------------------------
1 | torch
2 | torchtext
3 | torch-ignite
4 | pandas
5 | numpy
6 | 


--------------------------------------------------------------------------------
/saved_model/sentiment_classifer_rnn_sagar.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/saved_model/sentiment_classifer_rnn_sagar.pt


--------------------------------------------------------------------------------
/text_cleaning.py:
--------------------------------------------------------------------------------
 1 | "Author: Sagar Chakraborty"
 2 | import re
 3 | 
 4 | 
 5 | def cleanup_text(texts):
 6 |     cleaned_text = []
 7 |     for text in texts:
 8 |         # remove ugly &quot and &amp
 9 |         text = re.sub(r"&quot;(.*?)&quot;", "\g<1>", text)
10 |         text = re.sub(r"&amp;", "", text)
11 | 
12 |         # replace emoticon
13 |         text = re.sub(
14 |             r"(^| )(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)",
15 |             "\g<1>TOKEMOTICON",
16 |             text,
17 |         )
18 | 
19 |         text = text.lower()
20 |         text = text.replace("tokemoticon", "TOKEMOTICON")
21 | 
22 |         # replace url
23 |         text = re.sub(
24 |             r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?",
25 |             "TOKURL",
26 |             text,
27 |         )
28 | 
29 |         # replace mention
30 |         text = re.sub(r"@[\w]+", "TOKMENTION", text)
31 | 
32 |         # replace hashtag
33 |         text = re.sub(r"#[\w]+", "TOKHASHTAG", text)
34 | 
35 |         # replace dollar
36 |         text = re.sub(r"\$\d+", "TOKDOLLAR", text)
37 | 
38 |         # remove punctuation
39 |         text = re.sub("[^a-zA-Z0-9]", " ", text)
40 | 
41 |         # remove multiple spaces
42 |         text = re.sub(r" +", " ", text)
43 | 
44 |         # remove newline
45 |         text = re.sub(r"\n", " ", text)
46 | 
47 |         cleaned_text.append(text)
48 |     return cleaned_text


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | "Author: Sagar Chakraborty"
  2 | import torch
  3 | import torch.nn as nn
  4 | import torch.optim as optim
  5 | from torchtext.legacy import data
  6 | from text_cleaning import cleanup_text
  7 | from model import RNN
  8 | 
  9 | from pydoc import locate
 10 | from torch.nn.parallel import DataParallel
 11 | 
 12 | from ignite.engine import Engine
 13 | from ignite.handlers import ModelCheckpoint
 14 | from ignite.metrics import Accuracy, Precision, Recall, Loss
 15 | 
 16 | import json
 17 | import torch.optim as optim
 18 | import time
 19 | import os
 20 | from misc import create_supervised_evaluator
 21 | from misc import ModelLoader
 22 | from ignite.engine import Events
 23 | import random
 24 | 
 25 | from model import RNN
 26 | 
 27 | 
 28 | 
 29 | class trainer(object):
 30 |     def __init__(self,data_path,model_dir,model_name,device=-1):
 31 |         self.data_path= data_path
 32 |         self.model_dir = model_dir
 33 |         self.model_name= model_name
 34 |         self.device = device
 35 | 
 36 |     @staticmethod
 37 |     def train(model, iterator, optimizer, criterion):
 38 |         """Train function to start the training the declared  model. model.train() initialize it.
 39 |         for every batch picked up from the iterator we send it to the model and get predictions.
 40 |         loss = Predicted_y - Actual_y and based of the loss we calculate accuracy.
 41 |         loss.backward is for back propagation"""
 42 |         epoch_loss = 0
 43 |         epoch_acc = 0
 44 | 
 45 |         model.train()
 46 | 
 47 |         for batch in iterator:
 48 |             optimizer.zero_grad()
 49 | 
 50 |             predictions = model(batch.sentences[0]).squeeze(1)
 51 | 
 52 |             loss = criterion(predictions, batch.labels)
 53 | 
 54 |             acc = trainer.binary_accuracy(predictions, batch.labels)
 55 | 
 56 |             loss.backward() #back propagation
 57 | 
 58 |             optimizer.step() #weight update
 59 | 
 60 |             epoch_loss += loss.item()
 61 |             epoch_acc += acc.item()
 62 | 
 63 |         return epoch_loss / len(iterator), epoch_acc / len(iterator)
 64 | 
 65 |     @staticmethod
 66 |     def count_parameters(model):
 67 |         return sum(p.numel() for p in model.parameters() if p.requires_grad)
 68 | 
 69 |     @staticmethod
 70 |     def binary_accuracy(preds, y):
 71 |         """
 72 |         Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
 73 |         """
 74 | 
 75 |         # round predictions to the closest integer
 76 |         rounded_preds = torch.round(torch.sigmoid(preds))
 77 |         correct = (rounded_preds == y).float()  # convert into float for division
 78 |         acc = correct.sum() / len(correct)
 79 |         return acc
 80 | 
 81 |     @staticmethod
 82 |     def evaluate(model, iterator, criterion):
 83 |         epoch_loss = 0
 84 |         epoch_acc = 0
 85 | 
 86 |         model.eval()
 87 | 
 88 |         with torch.no_grad():
 89 |             for batch in iterator:
 90 |                 predictions = model(batch.sentences[0]).squeeze(1)
 91 | 
 92 |                 loss = criterion(predictions, batch.labels)
 93 | 
 94 |                 acc = trainer.binary_accuracy(predictions, batch.labels)
 95 | 
 96 |                 epoch_loss += loss.item()
 97 |                 epoch_acc += acc.item()
 98 | 
 99 |         return epoch_loss / len(iterator), epoch_acc / len(iterator)
100 | 
101 |     @staticmethod
102 |     def epoch_time(start_time, end_time):
103 |         elapsed_time = end_time - start_time
104 |         elapsed_mins = int(elapsed_time / 60)
105 |         elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
106 |         return elapsed_mins, elapsed_secs
107 | 
108 | 
109 | 
110 | 
111 | 
112 | #This is how the config.json data looks like
113 | """
114 | {
115 | 	"data_path": "data",
116 | 	"model_dir": "saved_model",
117 | 	"device": "-1",
118 | 	"model_name": "sentiment_classifer_rnn_sagar.pt",
119 | 	"embedding_dim": "100",
120 | 	"hidden_dim": "256",
121 | 	"output_dim": "1",
122 | 	"batch_size":"64"
123 | 	"max_vocab_size": "25000"
124 | }
125 | """
126 | 
127 | 
128 | 
129 | f = open('project_config.json','r')
130 | 
131 | config_data = json.loads(f.read())
132 | 
133 | EMBEDDING_DIM = int(config_data['embedding_dim'])
134 | HIDDEN_DIM = int(config_data['hidden_dim'])
135 | OUTPUT_DIM = int(config_data['output_dim'])
136 | BATCH_SIZE = int(config_data['batch_size'])
137 | MAX_VOCAB_SIZE = int(config_data['max_vocab_size'])
138 | 
139 | #Parameters we have provided for our model
140 | # EMBEDDING_DIM = 100
141 | # HIDDEN_DIM = 256
142 | # OUTPUT_DIM = 1
143 | # BATCH_SIZE = 64
144 | # MAX_VOCAB_SIZE = 25_000
145 | 
146 | 
147 | data_path = config_data['data_path']
148 | model_dir = config_data['model_dir']
149 | device = -1
150 | model_name = config_data['model_name']
151 | learning_rate = float(config_data['learning_rate'])
152 | num_epoch = int(config_data['epoch'])
153 | 
154 | 
155 | 
156 | ################-------##################
157 | Model_trainer = trainer(data_path,model_dir,model_name,device)
158 | 
159 | # seed
160 | torch.manual_seed(0)
161 | if torch.cuda.is_available():
162 |     torch.cuda.manual_seed(0)
163 |     device = None
164 | 
165 | tokenize = lambda s: s.split()
166 | 
167 | text = data.Field(
168 |     preprocessing=cleanup_text, include_lengths=True, tokenize=tokenize
169 | )
170 | 
171 | sentiment = data.LabelField(dtype=torch.float)
172 | train, test = data.TabularDataset.splits(
173 |     Model_trainer.data_path,
174 |     train="train.csv",
175 |     validation="test.csv",
176 |     format="csv",
177 |     fields=[("labels", sentiment), ("sentences", text)],
178 | )
179 | 
180 | text.build_vocab(train.text, min_freq=1, max_size=MAX_VOCAB_SIZE)
181 | sentiment.build_vocab(train.sentiment)
182 | 
183 | print(len(train), len(test))
184 | 
185 | print(vars(train.examples[5]))
186 | 
187 | train_data, valid_data = train.split(random_state=random.seed(42))
188 | print(f'Number of training examples: {len(train_data)}')
189 | print(f'Number of validation examples: {len(valid_data)}')
190 | print(f'Number of testing examples: {len(test)}')
191 | 
192 | 
193 | 
194 | text.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
195 | sentiment.build_vocab(train_data)
196 | 
197 | print(f"Unique tokens in TEXT vocabulary: {len(text.vocab)}")
198 | print(f"Unique tokens in LABEL vocabulary: {len(sentiment.vocab)}")
199 | 
200 | 
201 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
202 | 
203 | train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
204 |     datasets=[train_data, valid_data, test],
205 |     batch_size=BATCH_SIZE,
206 |     sort_within_batch=True,
207 |     sort_key=lambda x: len(x.sentences),
208 |     device=device, )
209 | 
210 | INPUT_DIM = len(text.vocab)
211 | model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
212 | print(f'The model has {Model_trainer.count_parameters(model):,} trainable parameters')
213 | model = model.to(device)
214 | 
215 | optimizer = optim.Adam(model.parameters(), lr=learning_rate)
216 | criterion = nn.BCEWithLogitsLoss()
217 | criterion = criterion.to(device)
218 | 
219 | 
220 | N_EPOCHS = num_epoch
221 | 
222 | best_valid_loss = float('inf')
223 | 
224 | for epoch in range(N_EPOCHS):
225 | 
226 |     start_time = time.time()
227 | 
228 |     train_loss, train_acc =  Model_trainer.train(model, train_iterator, optimizer, criterion)
229 |     valid_loss, valid_acc = Model_trainer.evaluate(model, valid_iterator, criterion)
230 | 
231 |     end_time = time.time()
232 | 
233 |     epoch_mins, epoch_secs = Model_trainer.epoch_time(start_time, end_time)
234 | 
235 |     if valid_loss < best_valid_loss:
236 |         best_valid_loss = valid_loss
237 |         torch.save(model.state_dict(), os.path.join(Model_trainer.model_dir,Model_trainer.model_name))
238 | 
239 |     print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
240 |     print(f'\tTrain Loss: {train_loss:.3f} | Train Accuracy: {train_acc * 100:.2f}%')
241 |     print(f'\t Validation Loss: {valid_loss:.3f} |  Validation Accuracy: {valid_acc * 100:.2f}%')
242 | 
243 | 
244 | #Loading model from directory and testing : test score and accuracy
245 | model.load_state_dict(torch.load(os.path.join(Model_trainer.model_dir,Model_trainer.model_name)))
246 | 
247 | test_loss, test_acc = Model_trainer.evaluate(model, test_iterator, criterion)
248 | 
249 | print(f'Overall Test Loss: {test_loss:.3f} | Overall Test Accuracy: {test_acc * 100:.2f}%')


--------------------------------------------------------------------------------