├── LICENSE
├── README.md
├── __pycache__
├── misc.cpython-36.pyc
├── model.cpython-36.pyc
└── text_cleaning.cpython-36.pyc
├── data
├── test.csv
└── train.csv
├── data_config.json
├── data_processing.py
├── misc.py
├── model.py
├── project_config.json
├── readme
├── requirement.txt
├── saved_model
└── sentiment_classifer_rnn_sagar.pt
├── text_cleaning.py
└── train.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Sagchakr
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Twitter Sentiment Analysis Using RNN - PyTorch
3 |
4 |
5 | [](https://www.python.org/)
6 | [
](https://pytorch.org/)
7 | [
](https://pandas.pydata.org/)
8 | [
](https://numpy.org/)
9 | [
](https://www.kaggle.com/kazanova/sentiment140)
10 |
11 | [](https://www.python.org/downloads/release/python-368/)
12 | [](https://pypi.org/project/torch/1.8.1/)
13 | [](https://pypi.org/project/torchtext/0.9.1/)
14 | [](https://pypi.org/project/pytorch-ignite/0.4.4/)
15 |
16 |
17 | [](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/stargazers)
18 | [](https://github.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/network/members)
19 | ## Introduction
20 |
21 |  \
22 | Image Source: morioh
23 |
24 | The automated process of recognition and Categorization of instinctive information in text script is called Sentiment Analysis. And such categorization of positive tweets from negative tweets by machine learning models for Classification, text mining, text analysis and data visualization through Natural Language processing is called Twitter Sentiment Analysis.
25 |
26 |  \
27 | Image Source: Google
28 |
29 | ## Requirements
30 |
31 | This project is developed on [
](https://www.microsoft.com/en-in/software-download/windows10) \
32 | Clone this repository or download after installing the above [requirements](#requirements) and run `pip install -r requirement.txt` to install all the libraries required to run this project.
33 | You can also click on the specific badges mentioned [above](#requirements) and download it as well.
34 |
35 |
36 | - **torch==1.8.1+cu**
37 | - **torchtext==0.9.1**
38 | - **pytorch-ignite==0.4.4**
39 | - **pandas==1.0.5**
40 | - **numpy==1.19.3**
41 | ### IDE
42 | I have used [
](https://www.jetbrains.com/pycharm/) in this project. I mostly prefer using Pycharm for building large end-to-end projects.
43 | Though for data analytics & visualizations I would recommend of using [
](https://jupyter.org/). If you don't know how to use jupyter which is quite easy for beginners, go through this [link](https://www.tutorialspoint.com/jupyter/index.htm)
44 |
45 |
46 |
47 |
48 |
49 | ## To Run this Project
50 | 1. Download the dataset from [kaggle](#requirements) above and keep it in main directory.
51 | 2. Put the csv file in the working directory and mention the full in the **data_config.json**
52 | 3. In **data_config.json** you need to provide :
53 |
54 | i) `dataset_full_path`: **full path of the main dataset** \
55 | ii) `num_neg_labels`: **number of negetive sample the dataset should contain** \
56 | iii)`num_pos_labels`: **number of positive samples the dataset should contain** \
57 | iv) `trainset_fullpath`: **path to save the train.csv** \
58 | v) `testset_fullpath`: **path to save test.csv** \
59 | vi) `num_training_sample`: **number of training samples in train.csv**
60 |
61 | 4. *run* `python data_processing.py`: **This will prepare the data by splitting into train and test.**
62 | 5. To run this project,you need to provide certain parameters in **project_config.json**:
63 |
64 | i) `data_path`: **data** \
65 | ii)`model_dir`: **saved_model** \
66 | iii) `device`: **-1** \
67 | iv) `model_name`: **sentiment_classifer_rnn_sagar.pt** \
68 | v) `embedding_dim`: **100** \
69 | vi) `hidden_dim`: **256** \
70 | vii) `output_dim`: **1** \
71 | viii) `batch_size`: **64** \
72 | ix) `max_vocab_size`: **25000** \
73 | x) `learning_rate`:
\
74 | xi) `epoch`: **20**
75 | 6. After performing the above steps Execute : `Python train.py`
76 |
77 | **To change dataset/any parameters for training: project_config.json**
78 |
79 | ## Experiment Details
80 |
81 | - epochs : **20**
82 | - Optimizer- **Adam**
83 | - Learning rate:
84 | - Loss function : **BCElogisticloss**
85 | - Train Acc: **99.72**
86 | - Validation accuracy: **67.68**
87 | - Test accuracy: **64.54**
88 | - Number of training examples: **10500**
89 | - Number of validation examples: **4500 Number of testing examples: 3962**
90 | - Unique tokens in TEXT vocabulary: **18609**
91 | - Unique tokens in LABEL vocabulary: **2**
92 | - The model has **1,952,805** trainable parameters
93 |
94 | For the limitation of RAM I have taken **20000 samples** from the main dataset which achieved **99.23% accuracy** in the train dataset.
95 |
96 | The test accuracy is `68%` which can be improved by using more data training more number of epoch.
97 |
98 | ## Author
99 | #### Sagar Chakraborty
100 | [
](https://www.linkedin.com/in/binaryblackhole/)
101 | [
](https://mail.google.com/mail/u/0/#search/csagar963%40gmail.com)
102 | [
](https://github.com/BinaryBlackhole)
103 |
104 | ## Thanks to ALL the amazing contributors!
105 | [
](https://github.com/Akshata-Kulk1)
106 | 
107 |
108 | ## References
109 | #### Akurniawan-sentiment analysis
110 | [
](https://github.com/akurniawan/pytorch-sentiment-analysis)
111 | #### Bentrevett-sentiment analysis
112 | [
](https://github.com/bentrevett/pytorch-sentiment-analysis)
113 |
114 | **If you like this project please fork and star this project**
115 |
116 |
117 | 
118 | [](https://GitHub.com/Naereen/) \
119 | [](./LICENSE)
120 |
121 |
--------------------------------------------------------------------------------
/__pycache__/misc.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/misc.cpython-36.pyc
--------------------------------------------------------------------------------
/__pycache__/model.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/model.cpython-36.pyc
--------------------------------------------------------------------------------
/__pycache__/text_cleaning.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/__pycache__/text_cleaning.cpython-36.pyc
--------------------------------------------------------------------------------
/data_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "dataset_full_path": "twiteer_dataset_main.csv",
3 | "num_neg_labels": "10000",
4 | "num_pos_labels": "10000",
5 | "trainset_fullpath": "data/train.csv",
6 | "testset_fullpath": "data/test.csv",
7 | "num_training_sample": "15000"
8 | }
--------------------------------------------------------------------------------
/data_processing.py:
--------------------------------------------------------------------------------
1 | "Author: Sagar Chakraborty"
2 | import pandas as pd
3 | import numpy as np
4 | import json
5 |
6 |
7 | # parsing the data_config.json to get the params
8 | """
9 | {
10 | "dataset_full_path": "twiteer_dataset_main.csv",
11 | "num_neg_labels": "10000",
12 | "num_pos_labels": "10000",
13 | "trainset_fullpath": "data/train.csv",
14 | "testset_fullpath": "data/test.csv",
15 | "num_training_sample": "15000"
16 | }"""
17 |
18 |
19 | f = open('data_config.json','r')
20 | config_data = json.loads(f.read())
21 |
22 | dataset= config_data['dataset_full_path']
23 | neg_labels = int(config_data['num_neg_labels'])
24 | pos_labels = int(config_data['num_pos_labels'])
25 | train_filepath = config_data['trainset_fullpath']
26 | test_filepath = config_data['testset_fullpath']
27 | training_samples = int(config_data['num_training_sample'])
28 |
29 |
30 |
31 |
32 |
33 |
34 | dataset_df = pd.read_csv(dataset,names=['labels','id','datetime','query','username','sentences'],header= None,sep=',',encoding = "ISO-8859-1")
35 | #
36 | print(dataset_df['labels'].head(5))
37 |
38 | # if dataset needs to cut short here we took 10000 for each class
39 | filter_0_df = dataset_df[dataset_df['labels']== int(0)].sample(n=neg_labels)
40 | filter_4_df = dataset_df[dataset_df['labels']== int(4)].sample(n=pos_labels)
41 |
42 |
43 | # dropping unneccessary columns
44 | filter_0_df.drop(['id','datetime','query','username'], axis=1,inplace=True)
45 | filter_4_df.drop(['id','datetime','query','username'],axis=1,inplace=True)
46 |
47 | print(filter_0_df)
48 | print(filter_4_df)
49 |
50 |
51 |
52 | # merging datasframe of two class to construct final dataframe
53 | final_df = pd.concat([filter_0_df,filter_4_df])
54 |
55 | print(len(final_df))
56 | #drop the rows wherever we have links/url in the tweets mostly spam
57 | final_df= final_df.drop(final_df[final_df.sentences.str.contains(r'http\S+|www.\S+')].index)
58 |
59 | #shuffle the dataset
60 | final_df=final_df.sample(n= len(final_df), random_state=42)
61 |
62 | # dirty way to convert labels in 0 and 1
63 | final_df[final_df['labels']>0]=1
64 | print(len(final_df))
65 |
66 | # removing the headers
67 | final_df= final_df[1:]
68 |
69 |
70 | # train_test split
71 | train_df= final_df[:training_samples]
72 | test_df= final_df[training_samples:]
73 |
74 | #saving data
75 | train_df.to_csv(train_filepath)
76 | test_df.to_csv(test_filepath)
77 |
78 |
--------------------------------------------------------------------------------
/misc.py:
--------------------------------------------------------------------------------
1 | "Author: Sagar Chakraborty"
2 | from ignite.engine import Engine
3 | import os
4 | from torch.autograd import Variable
5 |
6 | from ignite.exceptions import NotComputableError
7 | from ignite.metrics.metric import Metric
8 | import torch
9 | import torch.nn as nn
10 | import torch.optim as optim
11 | from torchtext.legacy import data
12 | from text_cleaning import cleanup_text
13 | from model import RNN
14 |
15 | from pydoc import locate
16 | from torch.nn.parallel import DataParallel
17 |
18 | from ignite.engine import Engine
19 | from ignite.handlers import ModelCheckpoint
20 | from ignite.metrics import Accuracy, Precision, Recall, Loss
21 | import glob
22 | import logging
23 |
24 | def create_supervised_evaluator(model, inference_fn, metrics={}, cuda=False):
25 | """
26 | Factory function for creating an evaluator for supervised models.
27 | Extended version from ignite's create_supervised_evaluator
28 | Args:
29 | model (torch.nn.Module): the model to train
30 | inference_fn (function): inference function
31 | metrics (dict of str: Metric): a map of metric names to Metrics
32 | cuda (bool, optional): whether or not to transfer batch to GPU (default: False)
33 | Returns:
34 | Engine: an evaluator engine with supervised inference function
35 | """
36 |
37 | engine = Engine(inference_fn)
38 |
39 | for name, metric in metrics.items():
40 | metric.attach(engine, name)
41 |
42 | return engine
43 |
44 |
45 | class ModelLoader(object):
46 | def __init__(self, model, dirname, filename_prefix):
47 | self._dirname = dirname
48 | self._fname_prefix = filename_prefix
49 | self._model = model
50 | self._fname = os.path.join(dirname, filename_prefix)
51 | self.skip_load = False
52 |
53 | # Ensure model is not None
54 | if not isinstance(model, nn.Module):
55 | raise ValueError("model should be an object of nn.Module")
56 |
57 | # Ensure that dirname exists
58 | if not os.path.exists(dirname):
59 | self.skip_load = True
60 | logging.warning(
61 | "Dir '{}' is not found, skip restoring model".format(dirname)
62 | )
63 |
64 | if len(glob.glob(self._fname + "_*")) == 0:
65 | self.skip_load = True
66 | logging.warning(
67 | "File '{}-*.pth' is not found, skip restoring model".format(self._fname)
68 | )
69 |
70 | def _load(self, path):
71 | if not self.skip_load:
72 | models = sorted(glob.glob(path))
73 | latest_model = models[-1]
74 |
75 | try:
76 | if isinstance(self._model, nn.parallel.DataParallel):
77 | self._model.module.load_state_dict(torch.load(latest_model))
78 | else:
79 | self._model.load_state_dict(torch.load(latest_model))
80 | print("Successfull loading {}!".format(latest_model))
81 | except Exception as e:
82 | logging.exception(
83 | "Something wrong while restoring the model: %s" % str(e)
84 | )
85 |
86 | def __call__(self, engine, infix_name):
87 | path = self._fname + "_" + infix_name + "_*"
88 |
89 | self._load(path=path)
90 |
91 |
92 |
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 | "Author: Sagar Chakraborty"
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 |
8 | import numpy as np
9 | from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence
10 |
11 |
12 |
13 |
14 |
15 | class RNN(nn.Module):
16 | def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
17 | super().__init__()
18 |
19 | self.embedding = nn.Embedding(input_dim, embedding_dim)
20 |
21 | self.rnn = nn.RNN(embedding_dim, hidden_dim)
22 |
23 | self.fc = nn.Linear(hidden_dim, output_dim)
24 |
25 | def forward(self, text):
26 | #forward propagation
27 | # text = [sent len, batch size]
28 | # first pass the text into embedding layer
29 | embedded = self.embedding(text)
30 |
31 | # embedded = [sent len, batch size, emb dim]
32 |
33 | #
34 | output, hidden = self.rnn(embedded)
35 |
36 | # output = [sent len, batch size, hid dim]
37 | # hidden = [1, batch size, hid dim]
38 |
39 | assert torch.equal(output[-1, :, :], hidden.squeeze(0))
40 |
41 | return self.fc(hidden.squeeze(0))
42 |
43 |
44 |
--------------------------------------------------------------------------------
/project_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "data_path": "data",
3 | "model_dir": "saved_model",
4 | "device": "-1",
5 | "model_name": "sentiment_classifer_rnn_sagar.pt",
6 | "embedding_dim": "100",
7 | "hidden_dim": "256",
8 | "output_dim": "1",
9 | "batch_size": "64",
10 | "max_vocab_size": "25000",
11 | "learning_rate": "1e-3",
12 | "epoch": "20"
13 | }
--------------------------------------------------------------------------------
/readme:
--------------------------------------------------------------------------------
1 | 1. Please download the dataset from this link and keep it in main directory: https://www.kaggle.com/kazanova/sentiment140
2 | 2. put the csv file in the working directory and mention the full in the data_config.json
3 | 3. in data_config you need to provide :
4 |
5 | {
6 | "dataset_full_path": "twiteer_dataset_main.csv", # full path of the main dataset
7 | "num_neg_labels": "10000", # number of negetive sample the dataset should contain
8 | "num_pos_labels": "10000", # number of positive samples the dataset should contain
9 | "trainset_fullpath": "data/train.csv", # path to save the train.csv
10 | "testset_fullpath": "data/test.csv", # path to save test.csv
11 | "num_training_sample": "15000" # number of training samples in train.csv
12 | }
13 |
14 |
15 | 4. run python data_processing.py. This will prepare the data by splitting into train and test.
16 | 5. To run this project run : Python train.py . To run train.py you need to provide certain params in project_config.py:
17 |
18 | {
19 | "data_path": "data",
20 | "model_dir": "saved_model",
21 | "device": "-1",
22 | "model_name": "sentiment_classifer_rnn_sagar.pt",
23 | "embedding_dim": "100",
24 | "hidden_dim": "256",
25 | "output_dim": "1",
26 | "batch_size": "64",
27 | "max_vocab_size": "25000",
28 | "learning_rate": "1e-3",
29 | "epoch": "20"
30 | }
31 | to change dataset/any parameters for training: project_config.json
32 |
33 | For the limitation of RAM I have taken 20000 samples from the main dataset which achieved 99.23% accuracy in the train dataset.
34 | The test accuracy is `~68% which can be improved by using more data/ training more number of epoch.
35 |
36 | References:
37 | https://github.com/akurniawan/pytorch-sentiment-analysis
38 | https://github.com/bentrevett/pytorch-sentiment-analysis
39 |
40 |
--------------------------------------------------------------------------------
/requirement.txt:
--------------------------------------------------------------------------------
1 | torch
2 | torchtext
3 | torch-ignite
4 | pandas
5 | numpy
6 |
--------------------------------------------------------------------------------
/saved_model/sentiment_classifer_rnn_sagar.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BinaryBlackhole/Pytorch-implementation-twitter-sentiment-analysis-using-RNN/7dd0b14399c203b6f3f46b0caf41dcd17b35964a/saved_model/sentiment_classifer_rnn_sagar.pt
--------------------------------------------------------------------------------
/text_cleaning.py:
--------------------------------------------------------------------------------
1 | "Author: Sagar Chakraborty"
2 | import re
3 |
4 |
5 | def cleanup_text(texts):
6 | cleaned_text = []
7 | for text in texts:
8 | # remove ugly " and &
9 | text = re.sub(r""(.*?)"", "\g<1>", text)
10 | text = re.sub(r"&", "", text)
11 |
12 | # replace emoticon
13 | text = re.sub(
14 | r"(^| )(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)",
15 | "\g<1>TOKEMOTICON",
16 | text,
17 | )
18 |
19 | text = text.lower()
20 | text = text.replace("tokemoticon", "TOKEMOTICON")
21 |
22 | # replace url
23 | text = re.sub(
24 | r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?",
25 | "TOKURL",
26 | text,
27 | )
28 |
29 | # replace mention
30 | text = re.sub(r"@[\w]+", "TOKMENTION", text)
31 |
32 | # replace hashtag
33 | text = re.sub(r"#[\w]+", "TOKHASHTAG", text)
34 |
35 | # replace dollar
36 | text = re.sub(r"\$\d+", "TOKDOLLAR", text)
37 |
38 | # remove punctuation
39 | text = re.sub("[^a-zA-Z0-9]", " ", text)
40 |
41 | # remove multiple spaces
42 | text = re.sub(r" +", " ", text)
43 |
44 | # remove newline
45 | text = re.sub(r"\n", " ", text)
46 |
47 | cleaned_text.append(text)
48 | return cleaned_text
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | "Author: Sagar Chakraborty"
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | from torchtext.legacy import data
6 | from text_cleaning import cleanup_text
7 | from model import RNN
8 |
9 | from pydoc import locate
10 | from torch.nn.parallel import DataParallel
11 |
12 | from ignite.engine import Engine
13 | from ignite.handlers import ModelCheckpoint
14 | from ignite.metrics import Accuracy, Precision, Recall, Loss
15 |
16 | import json
17 | import torch.optim as optim
18 | import time
19 | import os
20 | from misc import create_supervised_evaluator
21 | from misc import ModelLoader
22 | from ignite.engine import Events
23 | import random
24 |
25 | from model import RNN
26 |
27 |
28 |
29 | class trainer(object):
30 | def __init__(self,data_path,model_dir,model_name,device=-1):
31 | self.data_path= data_path
32 | self.model_dir = model_dir
33 | self.model_name= model_name
34 | self.device = device
35 |
36 | @staticmethod
37 | def train(model, iterator, optimizer, criterion):
38 | """Train function to start the training the declared model. model.train() initialize it.
39 | for every batch picked up from the iterator we send it to the model and get predictions.
40 | loss = Predicted_y - Actual_y and based of the loss we calculate accuracy.
41 | loss.backward is for back propagation"""
42 | epoch_loss = 0
43 | epoch_acc = 0
44 |
45 | model.train()
46 |
47 | for batch in iterator:
48 | optimizer.zero_grad()
49 |
50 | predictions = model(batch.sentences[0]).squeeze(1)
51 |
52 | loss = criterion(predictions, batch.labels)
53 |
54 | acc = trainer.binary_accuracy(predictions, batch.labels)
55 |
56 | loss.backward() #back propagation
57 |
58 | optimizer.step() #weight update
59 |
60 | epoch_loss += loss.item()
61 | epoch_acc += acc.item()
62 |
63 | return epoch_loss / len(iterator), epoch_acc / len(iterator)
64 |
65 | @staticmethod
66 | def count_parameters(model):
67 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
68 |
69 | @staticmethod
70 | def binary_accuracy(preds, y):
71 | """
72 | Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
73 | """
74 |
75 | # round predictions to the closest integer
76 | rounded_preds = torch.round(torch.sigmoid(preds))
77 | correct = (rounded_preds == y).float() # convert into float for division
78 | acc = correct.sum() / len(correct)
79 | return acc
80 |
81 | @staticmethod
82 | def evaluate(model, iterator, criterion):
83 | epoch_loss = 0
84 | epoch_acc = 0
85 |
86 | model.eval()
87 |
88 | with torch.no_grad():
89 | for batch in iterator:
90 | predictions = model(batch.sentences[0]).squeeze(1)
91 |
92 | loss = criterion(predictions, batch.labels)
93 |
94 | acc = trainer.binary_accuracy(predictions, batch.labels)
95 |
96 | epoch_loss += loss.item()
97 | epoch_acc += acc.item()
98 |
99 | return epoch_loss / len(iterator), epoch_acc / len(iterator)
100 |
101 | @staticmethod
102 | def epoch_time(start_time, end_time):
103 | elapsed_time = end_time - start_time
104 | elapsed_mins = int(elapsed_time / 60)
105 | elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
106 | return elapsed_mins, elapsed_secs
107 |
108 |
109 |
110 |
111 |
112 | #This is how the config.json data looks like
113 | """
114 | {
115 | "data_path": "data",
116 | "model_dir": "saved_model",
117 | "device": "-1",
118 | "model_name": "sentiment_classifer_rnn_sagar.pt",
119 | "embedding_dim": "100",
120 | "hidden_dim": "256",
121 | "output_dim": "1",
122 | "batch_size":"64"
123 | "max_vocab_size": "25000"
124 | }
125 | """
126 |
127 |
128 |
129 | f = open('project_config.json','r')
130 |
131 | config_data = json.loads(f.read())
132 |
133 | EMBEDDING_DIM = int(config_data['embedding_dim'])
134 | HIDDEN_DIM = int(config_data['hidden_dim'])
135 | OUTPUT_DIM = int(config_data['output_dim'])
136 | BATCH_SIZE = int(config_data['batch_size'])
137 | MAX_VOCAB_SIZE = int(config_data['max_vocab_size'])
138 |
139 | #Parameters we have provided for our model
140 | # EMBEDDING_DIM = 100
141 | # HIDDEN_DIM = 256
142 | # OUTPUT_DIM = 1
143 | # BATCH_SIZE = 64
144 | # MAX_VOCAB_SIZE = 25_000
145 |
146 |
147 | data_path = config_data['data_path']
148 | model_dir = config_data['model_dir']
149 | device = -1
150 | model_name = config_data['model_name']
151 | learning_rate = float(config_data['learning_rate'])
152 | num_epoch = int(config_data['epoch'])
153 |
154 |
155 |
156 | ################-------##################
157 | Model_trainer = trainer(data_path,model_dir,model_name,device)
158 |
159 | # seed
160 | torch.manual_seed(0)
161 | if torch.cuda.is_available():
162 | torch.cuda.manual_seed(0)
163 | device = None
164 |
165 | tokenize = lambda s: s.split()
166 |
167 | text = data.Field(
168 | preprocessing=cleanup_text, include_lengths=True, tokenize=tokenize
169 | )
170 |
171 | sentiment = data.LabelField(dtype=torch.float)
172 | train, test = data.TabularDataset.splits(
173 | Model_trainer.data_path,
174 | train="train.csv",
175 | validation="test.csv",
176 | format="csv",
177 | fields=[("labels", sentiment), ("sentences", text)],
178 | )
179 |
180 | text.build_vocab(train.text, min_freq=1, max_size=MAX_VOCAB_SIZE)
181 | sentiment.build_vocab(train.sentiment)
182 |
183 | print(len(train), len(test))
184 |
185 | print(vars(train.examples[5]))
186 |
187 | train_data, valid_data = train.split(random_state=random.seed(42))
188 | print(f'Number of training examples: {len(train_data)}')
189 | print(f'Number of validation examples: {len(valid_data)}')
190 | print(f'Number of testing examples: {len(test)}')
191 |
192 |
193 |
194 | text.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
195 | sentiment.build_vocab(train_data)
196 |
197 | print(f"Unique tokens in TEXT vocabulary: {len(text.vocab)}")
198 | print(f"Unique tokens in LABEL vocabulary: {len(sentiment.vocab)}")
199 |
200 |
201 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
202 |
203 | train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
204 | datasets=[train_data, valid_data, test],
205 | batch_size=BATCH_SIZE,
206 | sort_within_batch=True,
207 | sort_key=lambda x: len(x.sentences),
208 | device=device, )
209 |
210 | INPUT_DIM = len(text.vocab)
211 | model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
212 | print(f'The model has {Model_trainer.count_parameters(model):,} trainable parameters')
213 | model = model.to(device)
214 |
215 | optimizer = optim.Adam(model.parameters(), lr=learning_rate)
216 | criterion = nn.BCEWithLogitsLoss()
217 | criterion = criterion.to(device)
218 |
219 |
220 | N_EPOCHS = num_epoch
221 |
222 | best_valid_loss = float('inf')
223 |
224 | for epoch in range(N_EPOCHS):
225 |
226 | start_time = time.time()
227 |
228 | train_loss, train_acc = Model_trainer.train(model, train_iterator, optimizer, criterion)
229 | valid_loss, valid_acc = Model_trainer.evaluate(model, valid_iterator, criterion)
230 |
231 | end_time = time.time()
232 |
233 | epoch_mins, epoch_secs = Model_trainer.epoch_time(start_time, end_time)
234 |
235 | if valid_loss < best_valid_loss:
236 | best_valid_loss = valid_loss
237 | torch.save(model.state_dict(), os.path.join(Model_trainer.model_dir,Model_trainer.model_name))
238 |
239 | print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
240 | print(f'\tTrain Loss: {train_loss:.3f} | Train Accuracy: {train_acc * 100:.2f}%')
241 | print(f'\t Validation Loss: {valid_loss:.3f} | Validation Accuracy: {valid_acc * 100:.2f}%')
242 |
243 |
244 | #Loading model from directory and testing : test score and accuracy
245 | model.load_state_dict(torch.load(os.path.join(Model_trainer.model_dir,Model_trainer.model_name)))
246 |
247 | test_loss, test_acc = Model_trainer.evaluate(model, test_iterator, criterion)
248 |
249 | print(f'Overall Test Loss: {test_loss:.3f} | Overall Test Accuracy: {test_acc * 100:.2f}%')
--------------------------------------------------------------------------------