├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── src
├── constant.py
├── custom_data.py
├── lstm.py
└── main.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
106 | .idea
107 | data/
108 | model/
109 | src/runs
110 | saved_models/
111 | src/__pychache__
112 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Jaewoo Song
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # lstm-bayesian-optimization-pytorch
2 | This is a simple application of LSTM to text classification task in Pytorch using **Bayesian Optimization** for hyperparameter tuning.
3 |
4 | The dataset used is *Yelp 2014* review data[[1]](#1) which can be downloaded from [here](http://www.thunlp.org/~chm/data/data.zip).
5 |
6 | Detailed instructions are explained below.
7 |
8 |
9 |
10 | ---
11 |
12 | ### Configurations
13 |
14 | You can set various hyperparameters in `src/constants.py` file.
15 |
16 | The description of each variable is as follows.
17 |
18 | Note that for Bayesian Optmization, the hyperparameter to be tuned should be passed in a form of `tuple`.
19 |
20 | So you can set an argument as a `tuple` or a certain value.
21 |
22 | The former means that the argument will be included as the subject of Bayesian Optimization and the latter means that it should not be included.
23 |
24 |
25 |
26 | Argument | Type | Description | Default
27 | ---------|------|---------------|------------
28 | `device` | `torch.device` | The device type. (CUDA or CPU) | `torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')`
29 | `learning_rates` | `tuple (float, float)` or `float` | The range of learning rates. (or a value) | `(0.0001, 0.001)`
30 | `batch_sizes` | `tuple (int, int)` or `int` | The range of batch sizes. (or a value) | `(16, 128)`
31 | `seq_len` | `tuple (int, int)` or `int` | The range of maximum sequence lengths. (or a value) | `512`
32 | `d_w` | `tuple (int, int)` or `int` | The range of word embedding dimensions. (or a value) | `256`
33 | `d_h` | `tuple (int, int)` or `int` | The range of hidden state dimensions in the LSTM. (or a value) | `256`
34 | `drop_out_rate` | `tuple (float, float)` or `float` | The range of drop out rates. (or a value) | `0.5`
35 | `layer_num` | `tuple (int, int)` or `int` | The range of LSTM layer numbers. (or a value) | `3`
36 | `bidirectional` | `bool` | The flag which determines whether the LSTM is bidirectional or not. | `True`
37 | `class_num` | `int` | The number of classes. | `5`
38 | `epoch_num` | `tuple (int, int)` or `int` | The range of total iteration numbers. (or a value) | `10`
39 | `ckpt_dir` | `str` | The path for saved checkpoints. | `../saved_model`
40 | `init_points` | `int` | The number of initial points to start Bayesian Optimization. | `2`
41 | `n_iter` | `int` | The number of iterations for Bayesian Optimization. | `8`
42 |
43 |
44 |
45 |
46 |
47 | ### How to run
48 |
49 | 1. Install all required packages.
50 |
51 | ```shell
52 | pip install -r requirements.txt
53 | ```
54 |
55 |
56 |
57 | 2. Download the dataset and extract it.
58 |
59 | Of course, you can use another text classification dataset but make sure that the formats/names of files are same as those of *Yelp 2014* review dataset. (See the next step.)
60 |
61 |
62 |
63 | 3. Make a directory named `data`.
64 |
65 | Get files named `train.txt`, `text.txt`, `dev.txt` and `wordlist.txt` from `yelp14` and put them into `data`.
66 |
67 | The directory structure should be as follows.
68 |
69 | - data
70 | - train.txt
71 | - test.txt
72 | - dev.txt
73 | - wordlist.txt
74 |
75 |
76 |
77 | 4. Execute below command to train the model.
78 |
79 | ```shell
80 | python src/main.py --mode='train'
81 | ```
82 |
83 | - `--mode`: This specify the running mode. The mode can be either `train` or `test`.
84 |
85 |
86 |
87 | The Bayesian Optimization is used for hyper-parameter tuning in this task.
88 |
89 | You can add/modify the hyperparameter list to tune in `main.py`.
90 |
91 | ```python
92 | self.pbounds = {
93 | 'learning_rate': learning_rates,
94 | 'batch_size': batch_sizes
95 | }
96 |
97 | self.bayes_optimizer = BayesianOptimization(
98 | f=self.train,
99 | pbounds=self.pbounds,
100 | random_state=777
101 | )
102 | ```
103 |
104 | Currently, the batch size and the learning rate are only subjects to be adjusted.
105 |
106 | If you want to modify `self.pbounds`, add the desired hyperparameter and change its value in `constant.py` into a tuple consisting of two values, minimum and maximum, sequentially.
107 |
108 | Then you should add that hyperparameter as an additional parameter for the function `train` like `batch_size` and `learning_rate`.
109 |
110 |
111 |
112 | 5. After training, you can test the model with test data by following command.
113 |
114 | ```shell
115 | python src/main.py --mode='test' --model_name=MODEL_NAME --inference_batch_size=BATCH_SIZE
116 | ```
117 |
118 | - `model_name`: This is the file name of trained model you want to test. The model is located in `saved_models` directory if you didn't change the checkpoint directory setting. (default: `None`)
119 | - `inference_batch_size`: This is the batch size for inference step. This is irrelevant with `batch_size` in `src/constants.py` since this argument might be subject to Bayesian Optmization process. You can set the separate batch size only for inferencing. (default: `128`)
120 |
121 |
122 |
123 | ---
124 |
125 | ### References
126 |
127 | [1] *Yelp Open Dataset*. ([https://www.yelp.com/dataset](https://www.yelp.com/dataset))
128 |
129 | ---
130 |
131 |
132 |
133 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.5.0
2 | tqdm==4.47.0
3 | scikit-learn==0.23.1
4 | bayesian-optimization==1.2.0
--------------------------------------------------------------------------------
/src/constant.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 | # Parameters for training and modeling
4 | device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
5 | learning_rates = (0.0001, 0.001)
6 | batch_sizes = (16, 128)
7 | seq_len = 512
8 | d_w = 256
9 | d_h = 256
10 | drop_out_rate = 0.5
11 | layer_num = 3
12 | bidirectional = True
13 | class_num = 5
14 | epoch_num = 10
15 | ckpt_dir = '../saved_model'
16 |
17 | # Parameters for Bayesian Optimization
18 | init_points = 2
19 | n_iter = 8
20 |
21 | # Path for tensorboard
22 | summary_path = '../runs'
--------------------------------------------------------------------------------
/src/custom_data.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | from constant import *
3 | from torch.utils.data import Dataset
4 |
5 | import torch
6 | import matplotlib.pyplot as plt
7 |
8 |
9 | # Path or parameters for data
10 | DATA_PATH = '../data'
11 | vocab_name = 'wordlist.txt'
12 | train_name = 'train.txt'
13 | dev_name = 'dev.txt'
14 | test_name = 'test.txt'
15 |
16 |
17 | def read_file(name):
18 | score2text = {}
19 | with open(f'{DATA_PATH}/{name}', 'r') as f:
20 | lines = f.readlines()
21 |
22 | for line in tqdm(lines):
23 | line = line.strip()
24 | text = line.split('\t')[-1]
25 | score = int(line.split('\t')[-3])-1
26 |
27 | if score not in score2text:
28 | score2text[score] = []
29 |
30 | score2text[score].append(text)
31 |
32 | return score2text
33 |
34 |
35 | def read_vocab():
36 | word2idx = {'': 0, '': 1}
37 | with open(f'{DATA_PATH}/{vocab_name}', 'r') as f:
38 | lines = f.readlines()
39 |
40 | for line in lines:
41 | word = line.strip()
42 | word2idx[word] = len(word2idx)
43 |
44 | return word2idx
45 |
46 |
47 | class CustomDataset(Dataset):
48 | def __init__(self, score2text, word2idx):
49 | scores = []
50 | texts = []
51 | lens = []
52 | for score, text_list in tqdm(score2text.items()):
53 | for text in text_list:
54 | scores.append(score)
55 | words = [word for word in text.split(' ')]
56 | words_idx = []
57 | for word in words:
58 | if word in word2idx:
59 | words_idx.append(word2idx[word])
60 | else:
61 | words_idx.append(word2idx[''])
62 | text_len = len(words_idx)
63 |
64 | if len(words_idx) > seq_len:
65 | text_len = seq_len
66 | words_idx = words_idx[:seq_len]
67 | else:
68 | words_idx += ([word2idx['']] * (seq_len - len(words_idx)))
69 |
70 | texts.append(words_idx)
71 | lens.append(text_len)
72 |
73 | self.x = torch.LongTensor(texts)
74 | self.y = torch.LongTensor(scores)
75 | self.lens = torch.LongTensor(lens)
76 |
77 | assert self.x.shape[0] == self.y.shape[0], "The number of samples is not correct."
78 | assert self.x.shape == torch.Size([self.x.shape[0], seq_len]), "There is a sample with different length."
79 |
80 | def __len__(self):
81 | return self.x.shape[0]
82 |
83 | def __getitem__(self, idx):
84 | return self.x[idx], self.y[idx], self.lens[idx]
85 |
86 |
87 | def get_data():
88 | print("Making vocab dict...")
89 | word2idx = read_vocab()
90 |
91 | print("Reading data...")
92 | train_data = read_file(train_name)
93 | dev_data = read_file(dev_name)
94 | test_data = read_file(test_name)
95 |
96 | print("Making custom datasets...")
97 | train_set = CustomDataset(train_data, word2idx)
98 | dev_set = CustomDataset(dev_data, word2idx)
99 | test_set = CustomDataset(test_data, word2idx)
100 |
101 | return train_set, dev_set, test_set, word2idx
102 |
103 |
104 | if __name__=='__main__':
105 | print("Reading data...")
106 | train_data = read_file(train_name)
107 | dev_data = read_file(dev_name)
108 | test_data = read_file(test_name)
109 |
110 | i = 0
111 | for score, text_list in train_data.items():
112 | for text in tqdm(text_list):
113 | words = [word for word in text.split(' ')]
114 | plt.scatter(i, len(words))
115 | i += 1
116 |
117 | for score, text_list in dev_data.items():
118 | for text in tqdm(text_list):
119 | words = [word for word in text.split(' ')]
120 | plt.scatter(i, len(words))
121 | i += 1
122 |
123 | for score, text_list in test_data.items():
124 | for text in tqdm(text_list):
125 | words = [word for word in text.split(' ')]
126 | plt.scatter(i, len(words))
127 | i += 1
128 |
129 | plt.show()
130 |
--------------------------------------------------------------------------------
/src/lstm.py:
--------------------------------------------------------------------------------
1 | from constant import *
2 | from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
3 |
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 | import numpy as np
8 | import random
9 |
10 |
11 | class LSTM(nn.Module):
12 | def __init__(self, vocab_size):
13 | super().__init__()
14 |
15 | # Seed fixing
16 | np.random.seed(777)
17 | torch.manual_seed(777)
18 | torch.cuda.manual_seed_all(777)
19 | random.seed(777)
20 |
21 | self.embedding = nn.Embedding(vocab_size, d_w)
22 | self.lstm = nn.LSTM(
23 | input_size=d_w,
24 | hidden_size=d_h,
25 | bidirectional=bidirectional,
26 | batch_first=True,
27 | dropout=drop_out_rate,
28 | num_layers=layer_num
29 | )
30 | self.dir_num = 2 if bidirectional else 1
31 | self.query = nn.Linear(d_h * self.dir_num, 1)
32 | self.output_linear = nn.Linear(d_h * self.dir_num, class_num)
33 | self.softmax = nn.LogSoftmax(dim=-1)
34 |
35 | def init_hidden(self, input_shape):
36 | h0 = torch.zeros((layer_num * self.dir_num, input_shape[0], d_h)).to(device)
37 | c0 = torch.zeros((layer_num * self.dir_num, input_shape[0], d_h)).to(device)
38 |
39 | return h0, c0
40 |
41 | def forward(self, x, lens):
42 | h0, c0 = self.init_hidden(x.shape)
43 |
44 | embedded = self.embedding(x) # (B, L) => (B, L, d_w)
45 | packed_input = pack_padded_sequence(embedded, lens, batch_first=True)
46 |
47 | output, _ = self.lstm(packed_input, (h0, c0))
48 | output = pad_packed_sequence(output, batch_first=True)[0] # (B, L, d_h)
49 |
50 | attn_score = self.query(output).squeeze(dim=-1) # (B, L)
51 | attn_distrib = F.softmax(attn_score, dim=-1) # (B, L)
52 | output = torch.bmm(attn_distrib.unsqueeze(dim=1), output).squeeze(dim=1) # (B, d_h)
53 |
54 | output = self.output_linear(output) # (B, class_num)
55 |
56 | return self.softmax(output)
--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | from tqdm import tqdm
2 | from custom_data import *
3 | from lstm import *
4 | from constant import *
5 | from torch.utils.data import DataLoader
6 | from sklearn.metrics import f1_score
7 | from bayes_opt import BayesianOptimization
8 |
9 | import torch
10 | import torch.optim as optim
11 | import torch.nn as nn
12 | import os
13 | import argparse
14 | import numpy as np
15 |
16 |
17 | class Manager:
18 | def __init__(self):
19 | print("Loading dataset & vocab dict...")
20 | self.train_set, self.dev_set, self.test_set, self.word2idx = get_data()
21 |
22 | self.pbounds = {
23 | 'learning_rate': learning_rates,
24 | 'batch_size': batch_sizes
25 | }
26 |
27 | self.bayes_optimizer = BayesianOptimization(
28 | f=self.train,
29 | pbounds=self.pbounds,
30 | random_state=777
31 | )
32 |
33 | def train(self, learning_rate, batch_size):
34 | batch_size = round(batch_size)
35 | train_loader = DataLoader(self.train_set, batch_size=batch_size, shuffle=True)
36 | valid_loader = DataLoader(self.dev_set, batch_size=batch_size, shuffle=True)
37 |
38 | print("Loading model...")
39 | model = LSTM(len(self.word2idx)).to(device)
40 | criterion = nn.NLLLoss(reduction='mean')
41 |
42 | if not os.path.isdir(ckpt_dir):
43 | os.mkdir(ckpt_dir)
44 |
45 | for p in model.parameters():
46 | if p.dim() > 1:
47 | nn.init.xavier_uniform_(p)
48 |
49 | print("Initializing optimizer & loss function...")
50 | optimizer = optim.Adam(model.parameters(), lr=learning_rate)
51 |
52 | best_f1 = 0.0
53 |
54 | print("Train starts.")
55 | for epoch in range(1, epoch_num+1):
56 | model.train()
57 |
58 | total_train_losses = []
59 | total_train_preds = []
60 | total_train_targs = []
61 |
62 | for batch in tqdm(train_loader):
63 | x, y, lens = batch
64 | lens_sorted, idx = lens.sort(dim=0, descending=True)
65 | x_sorted = x[idx]
66 | y_sorted = y[idx]
67 |
68 | x, y, lens = x_sorted.to(device), y_sorted.to(device), lens_sorted.to(device)
69 |
70 | output = model(x, lens) # (B, class_num)
71 | loss = criterion(output, y) # ()
72 |
73 | optimizer.zero_grad()
74 | loss.backward()
75 | optimizer.step()
76 |
77 | total_train_losses.append(loss.item())
78 | total_train_preds += torch.argmax(output, dim=-1).tolist()
79 | total_train_targs += y.tolist()
80 |
81 | train_loss = np.mean(total_train_losses)
82 | train_f1 = f1_score(total_train_targs, total_train_preds, average='weighted')
83 |
84 | print(f"########## Epoch: {epoch} ##########")
85 | print(f"Train loss: {train_loss} || Train f1 score: {train_f1}")
86 |
87 | valid_loss, valid_f1 = self.validate(model, criterion, valid_loader)
88 |
89 | if valid_f1 > best_f1:
90 | print("***** Current best model saved. *****")
91 | torch.save(model.state_dict(), f"{ckpt_dir}/best_model_batch|{batch_size}_lr|{round(learning_rate, 4)}.pth")
92 | best_f1 = valid_f1
93 |
94 | print(f"Valid loss: {valid_loss} || Valid f1 score: {valid_f1} || Best f1 score: {best_f1}")
95 |
96 | return best_f1
97 |
98 | def validate(self, model, criterion, valid_loader):
99 | model.eval()
100 | total_valid_losses = []
101 | total_valid_preds = []
102 | total_valid_targs = []
103 |
104 | for batch in tqdm(valid_loader):
105 | x, y, lens = batch
106 | lens_sorted, idx = lens.sort(dim=0, descending=True)
107 | x_sorted = x[idx]
108 | y_sorted = y[idx]
109 |
110 | x, y, lens = x_sorted.to(device), y_sorted.to(device), lens_sorted.to(device)
111 |
112 | output = model(x, lens) # (B, class_num)
113 | loss = criterion(output, y) # ()
114 |
115 | total_valid_losses.append(loss.item())
116 | total_valid_preds += torch.argmax(output, dim=-1).tolist()
117 | total_valid_targs += y.tolist()
118 |
119 | valid_loss = np.mean(total_valid_losses)
120 | valid_f1 = f1_score(total_valid_targs, total_valid_preds, average='weighted')
121 |
122 | return valid_loss, valid_f1
123 |
124 | def test(self, model_name, batch_size):
125 | test_loader = DataLoader(self.test_set, batch_size=batch_size, shuffle=True)
126 |
127 | print("Loading model...")
128 | model = LSTM(len(self.word2idx))
129 | criterion = nn.NLLLoss(reduction='mean')
130 |
131 | model.load_state_dict(torch.load(f"{ckpt_dir}/{model_name}")).to(device)
132 |
133 | model.eval()
134 | total_test_losses = []
135 | total_test_preds = []
136 | total_test_targs = []
137 |
138 | for batch in tqdm(test_loader):
139 | x, y, lens = batch
140 | lens_sorted, idx = lens.sort(dim=0, descending=True)
141 | x_sorted = x[idx]
142 | y_sorted = y[idx]
143 |
144 | x, y, lens = x_sorted.to(device), y_sorted.to(device), lens_sorted.to(device)
145 |
146 | output = model(x, lens) # (B, class_num)
147 | loss = criterion(output, y) # ()
148 |
149 | total_test_losses.append(loss.item())
150 | total_test_preds += torch.argmax(output, dim=-1).tolist()
151 | total_test_targs += y.tolist()
152 |
153 | test_loss = np.mean(total_test_losses)
154 | test_f1 = f1_score(total_test_targs, total_test_preds, average='weighted')
155 |
156 | print("######## Test Results ########")
157 | print(f"Test loss: {test_loss} || Test f1 score: {test_f1}")
158 |
159 |
160 | if __name__=='__main__':
161 | parser = argparse.ArgumentParser()
162 | parser.add_argument('--mode', type=str, required=True, help='train or test?')
163 | parser.add_argument('--model_name', type=str, help='name of model file if you want to test.')
164 | parser.add_argument('--inference_batch_size', type=int, default=128, help='Batch size for inferencing.')
165 |
166 | args = parser.parse_args()
167 |
168 | assert args.mode == 'train' or args.mode == 'test', "Please specify correct mode."
169 |
170 | manager = Manager()
171 |
172 | if args.mode == 'train':
173 | print("Training starts.")
174 | manager.bayes_optimizer.maximize(init_points=init_points, n_iter=n_iter, acq='ei', xi=0.01)
175 |
176 | print("Best optimization option")
177 | print(manager.bayes_optimizer.max)
178 |
179 | elif args.mode == 'test':
180 | assert args.model_name is not None, "Please give the model name if you want to conduct test."
181 |
182 | print("Testing starts.")
183 | manager.test(args.model_name, batch_size=args.inference_batch_size)
184 |
--------------------------------------------------------------------------------