├── .gitignore
├── README.md
├── code
├── README.md
├── bert
│ ├── README.md
│ ├── double_text_classifier.py
│ ├── models
│ │ └── rubert_cased_L-12_H-768_A-12_pt
│ │ │ ├── config.json
│ │ │ └── vocab.txt
│ ├── out
│ │ ├── RuMedDaNet.jsonl
│ │ ├── RuMedNER.jsonl
│ │ ├── RuMedNLI.jsonl
│ │ ├── RuMedSymptomRec.jsonl
│ │ └── RuMedTop3.jsonl
│ ├── run.sh
│ ├── single_text_classifier.py
│ ├── token_classifier.py
│ └── utils.py
├── bilstm
│ ├── README.md
│ ├── double_text_classifier.py
│ ├── out
│ │ ├── RuMedDaNet.jsonl
│ │ ├── RuMedNER.jsonl
│ │ ├── RuMedNLI.jsonl
│ │ ├── RuMedSymptomRec.jsonl
│ │ └── RuMedTop3.jsonl
│ ├── run.sh
│ ├── single_text_classifier.py
│ ├── token_classifier.py
│ └── utils.py
├── eval.py
├── human
│ ├── RuMedDaNet.jsonl
│ ├── RuMedNER.jsonl
│ ├── RuMedNLI.jsonl
│ ├── RuMedSymptomRec.jsonl
│ └── RuMedTop3.jsonl
├── linear_models
│ ├── README.md
│ ├── double_text_classifier.py
│ ├── out
│ │ ├── RuMedDaNet.jsonl
│ │ ├── RuMedNER.jsonl
│ │ ├── RuMedNLI.jsonl
│ │ ├── RuMedSymptomRec.jsonl
│ │ └── RuMedTop3.jsonl
│ ├── run.sh
│ ├── single_text_classifier.py
│ └── token_classifier.py
├── naive
│ ├── RuMedDaNet.jsonl
│ ├── RuMedNER.jsonl
│ ├── RuMedNLI.jsonl
│ ├── RuMedSymptomRec.jsonl
│ └── RuMedTop3.jsonl
├── requirements.txt
└── tasks_builder.py
├── data
├── README.md
├── RuMedDaNet
│ ├── dev_v1.jsonl
│ ├── private_test_v1.jsonl
│ ├── test_v1.jsonl
│ └── train_v1.jsonl
├── RuMedNER
│ ├── dev_v1.jsonl
│ ├── test_v1.jsonl
│ └── train_v1.jsonl
├── RuMedNLI
│ ├── README.md
│ ├── dev_v1.jsonl
│ ├── private_test_v1.jsonl
│ ├── test_v1.jsonl
│ └── train_v1.jsonl
├── RuMedSymptomRec
│ ├── dev_v1.jsonl
│ ├── test_v1.jsonl
│ └── train_v1.jsonl
├── RuMedTest
│ └── private_test_v1.jsonl
├── RuMedTop3
│ ├── dev_v1.jsonl
│ ├── test_v1.jsonl
│ └── train_v1.jsonl
└── raw
│ ├── RuDReC.csv
│ ├── RuMedPrimeData.tsv
│ └── rec_markup.csv
└── lb_submissions
├── SAI
├── ChatGPT
│ ├── README.md
│ ├── RuMedDaNet.jsonl
│ ├── RuMedNLI.jsonl
│ ├── RuMedTest.jsonl
│ ├── chat-rmb.ipynb
│ ├── rmdanet_dev_gpt3_1202_log.json
│ ├── rmnli_priv_gpt3_1502.pd.pickle
│ └── rmtest_gpt3_1002_log.json
├── ECGAuto
│ ├── ECGBaselineLib
│ │ ├── autobaseline.py
│ │ ├── datasets.py
│ │ └── utils.py
│ ├── README.md
│ ├── requirements.txt
│ └── training.py
├── ECGBinary
│ ├── ECGBaselineLib
│ │ ├── datasets.py
│ │ ├── neurobaseline.py
│ │ └── utils.py
│ ├── README.md
│ ├── requirements.txt
│ └── training.py
├── ECGMultihead
│ ├── ECGBaselineLib
│ │ ├── datasets.py
│ │ ├── neurobaseline.py
│ │ └── utils.py
│ ├── README.md
│ ├── requirements.txt
│ └── training.py
├── Gigachat
│ ├── .gitignore
│ ├── README.md
│ ├── convert_sogma.py
│ ├── out
│ │ ├── RuMedDaNet.jsonl
│ │ ├── RuMedNLI.jsonl
│ │ └── RuMedTest.jsonl
│ ├── requirements.txt
│ ├── rumed_da_net.py
│ ├── rumed_nli.py
│ ├── rumed_test.py
│ ├── rumed_utils.py
│ ├── s00-prepare.sh
│ ├── s01-run-all-trains.sh
│ └── s02-run-all-tests.sh
├── Human
│ └── README.md
├── Naive
│ ├── README.md
│ └── sample_submission.zip
├── RNN
│ ├── README.md
│ ├── double_text_classifier.py
│ ├── rnn.zip
│ ├── run.sh
│ ├── test_solver.py
│ └── utils.py
├── RuBERT
│ ├── README.md
│ ├── bert.zip
│ ├── double_text_classifier.py
│ ├── pool.zip
│ ├── requirements.txt
│ ├── run.sh
│ ├── test_solver.py
│ └── utils.py
└── TF-IDF
│ ├── README.md
│ ├── double_text_classifier.py
│ ├── run.sh
│ ├── test_solver.py
│ └── tfidf.zip
└── SAI_junior
└── RuBioRoBERTa
├── README.md
├── RuMedDaNet.ipynb
├── RuMedDaNet.jsonl
├── RuMedNLI.ipynb
├── RuMedNLI.jsonl
├── RuMedTest.ipynb
├── RuMedTest.jsonl
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | **/__pycache__
2 | **/.ipynb_checkpoints
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # The repository is closed.
2 | # For the actual version, please visit https://github.com/sb-ai-lab/MedBench
3 |
4 | ## Citation
5 | ```bibtex
6 | @misc{blinov2022rumedbench,
7 | title={RuMedBench: A Russian Medical Language Understanding Benchmark},
8 | author={Pavel Blinov and Arina Reshetnikova and Aleksandr Nesterov and Galina Zubkova and Vladimir Kokh},
9 | year={2022},
10 | eprint={2201.06499},
11 | archivePrefix={arXiv},
12 | primaryClass={cs.CL}
13 | }
14 | ```
15 |
--------------------------------------------------------------------------------
/code/README.md:
--------------------------------------------------------------------------------
1 | ## Dependencies and Library Versions
2 | For specific lib versions, see `requirements.txt` and install them with
3 | ```bash
4 | pip install -r requirements.txt
5 | ```
6 |
7 | ## Hardware Requirements
8 | The code runs on:
9 | ```
10 | CPU Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
11 | GPU NVIDIA Tesla V100-PCIE-16GB
12 | RAM 16GB
13 | HDD 16GB
14 | ```
15 |
16 | ## General Description
17 | Each directory `bert`, `bilstm` and `linear_models` contains a baseline model.
18 |
19 | Generally, a model should produce output directory (e.g. `bert/out`) with result `jsonl` files named after the task name (e.g. `RuMedTop3.jsonl`).
20 |
21 | Each file contains same samples as in test parts enhanced with the `prediction` field.
22 | Examples,
23 | for `RuMedTop3.jsonl`
24 | ```
25 | {
26 | "idx": "qaf1454f",
27 | "code": "I11",
28 | "prediction": ["I11", "I20", "I10"]
29 | }
30 | ```
31 |
32 | or `RuMedSymptomRec.jsonl`
33 | ```
34 | {
35 | "idx": "q45f6321",
36 | "code": "боль в шее",
37 | "prediction": ["тошнота", "боль в шее", "частые головные боли"]
38 | }
39 | ```
40 |
41 | or `RuMedDaNet.jsonl`
42 | ```
43 | {
44 | "pairID": "f5309eadb4eacf0f144b24e260643ea2",
45 | "answer": "да",
46 | "prediction": "нет"
47 | }
48 | ```
49 |
50 | or `RuMedNLI.jsonl`
51 | ```
52 | {
53 | "pairID": "1f2a8146-66c7-11e7-b4f2-f45c89b91419",
54 | "gold_label": "entailment",
55 | "prediction": "neutral"
56 | }
57 | ```
58 |
59 | or `RuMedNER.jsonl`
60 | ```
61 | {
62 | "idx": "769708.tsv_5",
63 | "ner_tags": ["B-Drugname", "O", "B-Drugclass", "O", "O"],
64 | "prediction": ["B-Drugclass", "O", "O", "O", "O"]
65 | }
66 | ```
67 |
68 | ### tasks_builder.py
69 |
70 | It is the script used to prepare data for the benchmark tasks from raw data files.
71 |
72 | ```bash
73 | python tasks_builder.py
74 | ```
75 |
76 | ### eval.py
77 |
78 | It is the script to evaluate the test results.
79 |
80 | Run it like
81 | ```bash
82 | python eval.py --out_dir bert/out
83 | ```
84 | or
85 | ```bash
86 | python eval.py --out_dir human
87 | ```
88 |
--------------------------------------------------------------------------------
/code/bert/README.md:
--------------------------------------------------------------------------------
1 | To run the BERT models:
2 | 1) Download the [RuBERT model](http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz) & extract it to `models/rubert_cased_L-12_H-768_A-12_pt`.
3 | ```bash
4 | mkdir -p models/; cd models/
5 | wget "http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz"
6 | tar -xvzf rubert_cased_L-12_H-768_A-12_pt.tar.gz
7 | ```
8 | 2) Run
9 | `./run.sh bert` for *RuBERT* model
10 | or
11 | `./run.sh pool` for *RuPoolBERT* model.
--------------------------------------------------------------------------------
/code/bert/models/rubert_cased_L-12_H-768_A-12_pt/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "attention_probs_dropout_prob": 0.1,
3 | "directionality": "bidi",
4 | "hidden_act": "gelu",
5 | "hidden_dropout_prob": 0.1,
6 | "hidden_size": 768,
7 | "initializer_range": 0.02,
8 | "intermediate_size": 3072,
9 | "max_position_embeddings": 512,
10 | "num_attention_heads": 12,
11 | "num_hidden_layers": 12,
12 | "pooler_fc_size": 768,
13 | "pooler_num_attention_heads": 12,
14 | "pooler_num_fc_layers": 3,
15 | "pooler_size_per_head": 128,
16 | "pooler_type": "first_token_transform",
17 | "type_vocab_size": 2,
18 | "vocab_size": 119547
19 | }
20 |
--------------------------------------------------------------------------------
/code/bert/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | type=$@
4 |
5 | out=$(pwd)'/out'
6 | mkdir -p $out
7 |
8 | # run the tasks sequentially
9 | python -u single_text_classifier.py --gpu 0 --task_name 'RuMedTop3' --bert_type $type
10 | python -u single_text_classifier.py --gpu 0 --task_name 'RuMedSymptomRec' --bert_type $type
11 | python -u double_text_classifier.py --gpu 0 --task_name 'RuMedDaNet' --bert_type $type
12 | python -u double_text_classifier.py --gpu 0 --task_name 'RuMedNLI' --bert_type $type
13 | python -u token_classifier.py --gpu 0 --task_name 'RuMedNER' --bert_type $type
14 |
15 | # # or run in parallel on multiple gpus
16 | # python -u single_text_classifier.py --gpu 0 --task_name 'RuMedTop3' --bert_type $type &
17 | # python -u single_text_classifier.py --gpu 1 --task_name 'RuMedSymptomRec' --bert_type $type &
18 | # python -u double_text_classifier.py --gpu 2 --task_name 'RuMedDaNet' --bert_type $type &
19 | # python -u double_text_classifier.py --gpu 3 --task_name 'RuMedNLI' --bert_type $type &
20 | # wait
21 | # python -u token_classifier.py --gpu 0 --task_name 'RuMedNER' --bert_type $type
22 |
--------------------------------------------------------------------------------
/code/bert/single_text_classifier.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import gc
3 | import os
4 | import pandas as pd
5 | import numpy as np
6 | import json
7 |
8 | import torch
9 | from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
10 |
11 | from transformers import BertTokenizer, BertConfig
12 | from transformers.optimization import AdamW
13 |
14 | import argparse
15 | from scipy.special import expit
16 | from keras.preprocessing.sequence import pad_sequences
17 |
18 | from utils import seed_everything, seed_worker
19 |
20 | def encode_texts(tokenizer, sentences):
21 | bs = 20000
22 | input_ids, attention_masks = [], []
23 | for _, i in enumerate(range(0, len(sentences), bs)):
24 | b_sentences = ['[CLS] ' + sentence + ' [SEP]' for sentence in sentences[i:i+bs]]
25 | tokenized_texts = [tokenizer.tokenize(sent) for sent in b_sentences]
26 | b_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
27 | b_input_ids = pad_sequences(b_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
28 | b_attention_masks = []
29 | for seq in b_input_ids:
30 | seq_mask = [float(i>0) for i in seq]
31 | b_attention_masks.append(seq_mask)
32 |
33 | attention_masks.append(b_attention_masks)
34 | input_ids.append(b_input_ids)
35 | input_ids, attention_masks = np.vstack(input_ids), np.vstack(attention_masks)
36 | return input_ids, attention_masks
37 |
38 | def hit_at_n(y_true, y_pred, index2label, n=3):
39 | assert len(y_true) == len(y_pred)
40 | hit_count = 0
41 | for l, row in zip(y_true, y_pred):
42 | order = (np.argsort(row)[::-1])[:n]
43 | order = [index2label[i] for i in order]
44 | order = set(order)
45 | hit_count += int(l in order)
46 | return hit_count/float(len(y_true))
47 |
48 | SEED = 128
49 | seed_everything(SEED)
50 |
51 | MAX_LEN = 256
52 |
53 | def setup_parser():
54 | parser = argparse.ArgumentParser()
55 |
56 | parser.add_argument('--gpu',
57 | default=None,
58 | type=int,
59 | required=True,
60 | help='The index of the gpu to run.')
61 | parser.add_argument('--task_name',
62 | default='',
63 | type=str,
64 | required=True,
65 | help='The name of the task to run.')
66 | parser.add_argument('--bert_type',
67 | default='',
68 | type=str,
69 | required=True,
70 | help='The type of BERT model (bert or pool).')
71 | return parser
72 |
73 | if __name__ == '__main__':
74 | parser = setup_parser()
75 | args = parser.parse_args()
76 |
77 | os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu)
78 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
79 |
80 | if args.bert_type=='pool': #get model type of BERT model
81 | from utils import PoolBertForSequenceClassification as BertModel
82 | else:
83 | from transformers import BertForSequenceClassification as BertModel
84 |
85 | task_name = args.task_name
86 |
87 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) )
88 | out_dir = os.path.join(base_path, 'out')
89 | model_path = os.path.join(base_path, 'models/rubert_cased_L-12_H-768_A-12_pt/')
90 |
91 | base_path = os.path.abspath( os.path.join(base_path, '../..') )
92 |
93 | parts = ['train', 'dev', 'test']
94 | data_path = os.path.join(base_path, 'data', task_name)
95 |
96 | text1_id, label_id, index_id = 'symptoms', 'code', 'idx'
97 | if task_name=='RuMedTop3':
98 | pass
99 | elif task_name=='RuMedSymptomRec':
100 | pass
101 | else:
102 | raise ValueError('unknown task')
103 |
104 | part2indices = {p:set() for p in parts}
105 | all_ids, sentences, labels = [], [], []
106 | for p in parts:
107 | fname = '{}_v1.jsonl'.format(p)
108 | with open(os.path.join( data_path, fname)) as f:
109 | for line in f:
110 | data = json.loads(line)
111 | s1 = data[text1_id]
112 | sentences.append( s1 )
113 | labels.append( data[label_id] )
114 | idx = data[index_id]
115 | all_ids.append( idx )
116 | part2indices[p].add( idx )
117 | all_ids = np.array(all_ids)
118 | print ('len(total)', len(sentences))
119 |
120 | code_set = set(labels)
121 | l2i = {code:i for i, code in enumerate(sorted(code_set))}
122 | i2l = {l2i[l]:l for l in l2i}
123 | print ( 'len(l2i)', len(l2i) )
124 |
125 | tokenizer = BertTokenizer.from_pretrained(
126 | os.path.join(base_path, model_path),
127 | do_lower_case=True,
128 | max_length=MAX_LEN
129 | )
130 |
131 | input_ids, attention_masks = encode_texts(tokenizer, sentences)
132 |
133 | label_indices = np.array([l2i[l] for l in labels])
134 |
135 | labels = np.zeros((input_ids.shape[0], len(l2i)))
136 | for _, i in enumerate(label_indices):
137 | labels[_, i] = 1
138 |
139 | # prepare test data loader
140 | test_ids = part2indices['test']
141 | test_mask = np.array([sid in test_ids for sid in all_ids])
142 | test_ids = all_ids[test_mask]
143 | tst_inputs, tst_masks, tst_labels = input_ids[test_mask], attention_masks[test_mask], labels[test_mask]
144 |
145 | tst_inputs = torch.tensor(tst_inputs)
146 | tst_masks = torch.tensor(tst_masks)
147 | tst_labels = torch.tensor(tst_labels)
148 |
149 | test_data = TensorDataset(tst_inputs, tst_masks, tst_labels)
150 | test_sampler = SequentialSampler(test_data)
151 | test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=8, worker_init_fn=seed_worker)
152 |
153 | batch_size = 4
154 | epochs = 25
155 | lr = 3e-5
156 | max_grad_norm = 1.0
157 |
158 | cv_res = {}
159 | for fold in range(1):
160 | best_dev_score = -1
161 | seed_everything(SEED)
162 | train_ids = part2indices['train']
163 | dev_ids = part2indices['dev']
164 |
165 | train_mask = np.array([sid in train_ids for sid in all_ids])
166 | dev_mask = np.array([sid in dev_ids for sid in all_ids])
167 |
168 | input_ids_train, attention_masks_train, labels_train = input_ids[train_mask], attention_masks[train_mask], labels[train_mask]
169 | input_ids_dev, attention_masks_dev, labels_dev = input_ids[dev_mask], attention_masks[dev_mask], labels[dev_mask]
170 | print ('fold', fold, input_ids_train.shape, input_ids_dev.shape)
171 |
172 | input_ids_train = torch.tensor(input_ids_train)
173 | attention_masks_train = torch.tensor(attention_masks_train)
174 | labels_train = torch.tensor(labels_train)
175 |
176 | train_data = TensorDataset(input_ids_train, attention_masks_train, labels_train)
177 | train_sampler = RandomSampler(train_data)
178 | train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, worker_init_fn=seed_worker)
179 |
180 | ##prediction_dataloader
181 | input_ids_dev = torch.tensor(input_ids_dev)
182 | attention_masks_dev = torch.tensor(attention_masks_dev)
183 | labels_dev = torch.tensor(labels_dev)
184 | prediction_data = TensorDataset(input_ids_dev, attention_masks_dev, labels_dev)
185 | prediction_sampler = SequentialSampler(prediction_data)
186 | prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size, worker_init_fn=seed_worker)
187 |
188 | ## take appropriate config and init a BERT model
189 | config_path = os.path.join( base_path, model_path, 'bert_config.json' )
190 | conf = BertConfig.from_json_file( config_path )
191 | conf.num_labels = len(l2i)
192 | model = BertModel(conf)
193 | output_model_file = os.path.join( base_path, model_path, 'pytorch_model.bin' )
194 | model.load_state_dict(torch.load(output_model_file), strict=False)
195 | model = model.cuda()
196 |
197 | param_optimizer = list(model.named_parameters())
198 | no_decay = ['bias', 'gamma', 'beta']
199 | optimizer_grouped_parameters = [
200 | {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
201 | {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
202 | ]
203 |
204 | # This variable contains all of the hyperparemeter information our training loop needs
205 | optimizer = AdamW(optimizer_grouped_parameters, lr=lr, correct_bias=False)
206 | scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_dataloader), epochs=epochs)
207 |
208 | train_loss = []
209 | for _ in range(epochs):
210 | model.train(); torch.cuda.empty_cache()
211 |
212 | tr_loss = 0
213 | nb_tr_examples, nb_tr_steps = 0, 0
214 | for step, batch in enumerate(train_dataloader):
215 | batch = tuple(t.to(device) for t in batch)
216 | b_input_ids, b_input_mask, b_labels = batch
217 | optimizer.zero_grad()
218 |
219 | outputs = model( b_input_ids, attention_mask=b_input_mask, labels=b_labels )
220 | loss, logits = outputs[:2]
221 | train_loss.append(loss.item())
222 | loss.backward()
223 | torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
224 | optimizer.step()
225 | scheduler.step()
226 |
227 | tr_loss += loss.item()
228 | nb_tr_examples += b_input_ids.size(0)
229 | nb_tr_steps += 1
230 | avg_train_loss = tr_loss/nb_tr_steps
231 |
232 | ### val
233 | model.eval()
234 | predictions = []
235 | tr_loss, nb_tr_steps = 0, 0
236 | for step, batch in enumerate(prediction_dataloader):
237 | batch = tuple(t.to(device) for t in batch)
238 | b_input_ids, b_input_mask, b_labels = batch
239 | with torch.no_grad():
240 | outputs = model( b_input_ids, attention_mask=b_input_mask, labels=b_labels )
241 | loss, logits = outputs[:2]
242 | tr_loss += loss.item()
243 | nb_tr_steps += 1
244 | logits = logits.detach().cpu().numpy()
245 | predictions.append(logits)
246 | predictions = expit(np.vstack(predictions))
247 | edev_loss = tr_loss/nb_tr_steps
248 |
249 | y_indices = np.argmax(labels_dev.detach().cpu().numpy(), axis=1)
250 | dev_codes = [i2l[i] for i in y_indices]
251 |
252 | dev_acc = hit_at_n(dev_codes, predictions, i2l, n=1)*100
253 | dev_hit_at3 = hit_at_n(dev_codes, predictions, i2l, n=3)*100
254 | print ('{} epoch {} average train_loss: {:.6f}\tdev_loss: {:.6f}\tdev_acc {:.2f}\tdev_hit_at3 {:.2f}'.format(task_name, _, avg_train_loss, edev_loss, dev_acc, dev_hit_at3))
255 |
256 | score = (dev_acc+dev_hit_at3)/2
257 | if score>best_dev_score: # compute result for test part and store to out file, if we found better model
258 | best_dev_score = score
259 | cv_res[fold] = (dev_acc, dev_hit_at3)
260 |
261 | predictions, true_labels = [], []
262 | for batch in test_dataloader:
263 | batch = tuple(t.to(device) for t in batch)
264 | b_input_ids, b_input_mask, b_labels = batch
265 |
266 | with torch.no_grad():
267 | outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
268 | logits = outputs[1].detach().cpu().numpy()
269 | label_ids = b_labels.to('cpu').numpy()
270 | predictions.append(logits)
271 | true_labels.append(label_ids)
272 | predictions = expit(np.vstack(predictions))
273 | true_labels = np.concatenate(true_labels)
274 | assert len(true_labels) == len(predictions)
275 | recs = []
276 | for idx, l, row in zip(test_ids, true_labels, predictions):
277 | gt = i2l[np.argmax(l)]
278 | order = (np.argsort(row)[::-1])[:3]
279 | pred = [i2l[i] for i in order]
280 | recs.append( (idx, gt, pred) )
281 |
282 | out_fname = os.path.join(out_dir, task_name+'.jsonl')
283 | with open(out_fname, 'w') as fw:
284 | for rec in recs:
285 | data = {index_id:rec[0], label_id:rec[1], 'prediction':rec[2]}
286 | json.dump(data, fw, ensure_ascii=False)
287 | fw.write('\n')
288 | del model; gc.collect(); torch.cuda.empty_cache()
289 |
290 | dev_acc, dev_hit_at3 = cv_res[0]
291 | print ('\ntask scores {}: {:.2f}/{:.2f}'.format(task_name, dev_acc, dev_hit_at3))
292 |
--------------------------------------------------------------------------------
/code/bert/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import os
3 | import random
4 | import torch
5 | import numpy as np
6 |
7 | def seed_everything(seed):
8 | random.seed(seed)
9 | os.environ['PYTHONHASHSEED'] = str(seed)
10 | np.random.seed(seed)
11 | torch.manual_seed(seed)
12 | torch.cuda.manual_seed_all(seed)
13 | torch.cuda.manual_seed(seed)
14 | torch.backends.cudnn.deterministic = True
15 | torch.backends.cudnn.benchmark = False
16 |
17 | def seed_worker(worker_id):
18 | worker_seed = torch.initial_seed() % 2**32
19 | np.random.seed(worker_seed)
20 | random.seed(worker_seed)
21 |
22 |
23 | from torch import nn
24 | import torch.nn.functional as F
25 | from transformers import BertTokenizer, BertConfig, BertPreTrainedModel, BertModel
26 |
27 | class PoolBertForTokenClassification(BertPreTrainedModel):
28 | def __init__(self, config):
29 | super().__init__(config)
30 | self.num_labels = config.num_labels
31 |
32 | self.bert = BertModel(config, add_pooling_layer=False)
33 | self.dropout = nn.Dropout(config.hidden_dropout_prob)
34 | self.classifier = nn.Linear(config.hidden_size*3, config.num_labels)
35 |
36 | self.w_size = 4
37 |
38 | self.init_weights()
39 |
40 | def forward(
41 | self,
42 | input_ids=None,
43 | attention_mask=None,
44 | token_type_ids=None,
45 | position_ids=None,
46 | head_mask=None,
47 | inputs_embeds=None,
48 | labels=None,
49 | output_attentions=None,
50 | output_hidden_states=None,
51 | return_dict=None,
52 | ):
53 | outputs = self.bert(
54 | input_ids,
55 | attention_mask=attention_mask,
56 | token_type_ids=token_type_ids,
57 | position_ids=position_ids,
58 | head_mask=head_mask,
59 | inputs_embeds=inputs_embeds,
60 | output_attentions=output_attentions,
61 | output_hidden_states=output_hidden_states,
62 | return_dict=return_dict,
63 | )
64 |
65 | sequence_output = outputs['last_hidden_state']
66 |
67 | shape = list(sequence_output.shape)
68 | shape[1]+=self.w_size-1
69 |
70 | t_ext = torch.zeros(shape, dtype=sequence_output.dtype, device=sequence_output.device)
71 | t_ext[:, self.w_size-1:, :] = sequence_output
72 |
73 | unfold_t = t_ext.unfold(1, self.w_size, 1).transpose(3,2)
74 | pooled_output_mean = torch.mean(unfold_t, 2)
75 |
76 | pooled_output, _ = torch.max(unfold_t, 2)
77 | pooled_output = torch.relu(pooled_output)
78 |
79 | sequence_output = torch.cat((pooled_output, pooled_output_mean, sequence_output), 2)
80 |
81 | sequence_output = self.dropout(sequence_output)
82 |
83 | logits = self.classifier(sequence_output)
84 |
85 | loss = None
86 | if labels is not None:
87 | loss_fct = nn.CrossEntropyLoss()
88 | # Only keep active parts of the loss
89 | if attention_mask is not None:
90 | active_loss_mask = attention_mask.view(-1) == 1
91 | active_logits = logits.view(-1, self.num_labels)
92 |
93 | active_labels = torch.where(
94 | active_loss_mask,
95 | labels.view(-1),
96 | torch.tensor(loss_fct.ignore_index).type_as(labels)
97 | )
98 |
99 | loss = loss_fct(active_logits, active_labels)
100 | else:
101 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
102 |
103 | output = (logits,) + outputs[2:]
104 | return ((loss,) + output) if loss is not None else output
105 |
106 | class PoolBertForSequenceClassification(BertPreTrainedModel):
107 | def __init__(self, config):
108 | super().__init__(config)
109 | self.num_labels = config.num_labels
110 |
111 | self.bert = BertModel(config)
112 | self.dropout = nn.Dropout(config.hidden_dropout_prob)
113 | self.classifier = nn.Linear(config.hidden_size*3, self.config.num_labels)
114 |
115 | self.init_weights()
116 |
117 | def forward(
118 | self,
119 | input_ids=None,
120 | attention_mask=None,
121 | token_type_ids=None,
122 | position_ids=None,
123 | head_mask=None,
124 | inputs_embeds=None,
125 | labels=None,
126 | ):
127 | outputs = self.bert(
128 | input_ids,
129 | attention_mask=attention_mask,
130 | token_type_ids=token_type_ids,
131 | position_ids=position_ids,
132 | head_mask=head_mask,
133 | inputs_embeds=inputs_embeds,
134 | )
135 |
136 | encoder_out = outputs['last_hidden_state']
137 | cls = encoder_out[:, 0, :]
138 |
139 | pooled_output, _ = torch.max(encoder_out, 1)
140 | pooled_output = torch.relu(pooled_output)
141 |
142 | pooled_output_mean = torch.mean(encoder_out, 1)
143 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1)
144 |
145 | pooled_output = self.dropout(pooled_output)
146 | logits = self.classifier(pooled_output)
147 |
148 | outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
149 |
150 | if labels is not None:
151 | if self.num_labels == 1:
152 | # We are doing regression
153 | loss_fct = MSELoss()
154 | loss = loss_fct(logits.view(-1), labels.view(-1))
155 | else:
156 | loss = F.binary_cross_entropy_with_logits( logits.view(-1), labels.view(-1) )
157 | outputs = (loss,) + outputs
158 |
159 | return outputs # (loss), logits, (hidden_states), (attentions)
160 |
--------------------------------------------------------------------------------
/code/bilstm/README.md:
--------------------------------------------------------------------------------
1 | This directory contains BiLSTM model with randomly initialized embeddings.
2 |
3 | ### How to run
4 |
5 | `./run.sh` or you can run the model for different tasks separately, e.g.
6 |
7 | ```bash
8 | python single_text_classifier.py --task-name='RuMedSymptomRec' --device=0
9 | ```
10 |
11 | The models produce results in `.jsonl` format to output directory `out`.
12 |
--------------------------------------------------------------------------------
/code/bilstm/double_text_classifier.py:
--------------------------------------------------------------------------------
1 | from collections import defaultdict
2 | import json
3 | import pathlib
4 |
5 | import click
6 | import numpy as np
7 | import pandas as pd
8 | from sklearn.metrics import accuracy_score
9 | import torch
10 | from torch import nn
11 | from torch.utils.data import DataLoader
12 | from tqdm import tqdm
13 |
14 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor
15 |
16 | SEED = 101
17 | seed_everything(SEED)
18 | class Classifier(nn.Module):
19 |
20 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256):
21 |
22 | super().__init__()
23 |
24 | self.emb_dim = emb_dim
25 | self.hidden_dim = hidden_dim
26 |
27 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim)
28 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2,
29 | bidirectional=True)
30 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes)
31 |
32 | def forward(self, x):
33 | x = self.embedding_layer(x)
34 | _, (hidden, _) = self.lstm_layer(x)
35 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1)
36 | return self.linear_layer(hidden)
37 |
38 |
39 | def preprocess_two_seqs(text1, text2, seq_len):
40 | seq1_len = int(seq_len * 0.75)
41 | seq2_len = seq_len - seq1_len
42 |
43 | tokens1 = preprocess(text1)[:seq1_len]
44 | tokens2 = preprocess(text2)[:seq2_len]
45 |
46 | return tokens1 + tokens2
47 |
48 |
49 | def build_vocab(text_data, min_freq=1):
50 | word2freq = defaultdict(int)
51 | word2index = {'PAD': 0, 'UNK': 1}
52 |
53 | for text in text_data:
54 | for token in text:
55 | word2freq[token] += 1
56 |
57 | for word, freq in word2freq.items():
58 | if freq > min_freq:
59 | word2index[word] = len(word2index)
60 | return word2index
61 |
62 |
63 | def train_step(data, model, optimizer, criterion, device, losses, epoch):
64 |
65 | model.train()
66 |
67 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}')
68 |
69 | for x, y in data:
70 |
71 | x = x.to(device)
72 | y = y.to(device)
73 |
74 | optimizer.zero_grad()
75 | pred = model(x)
76 |
77 | loss = criterion(pred, y)
78 |
79 | loss.backward()
80 | optimizer.step()
81 |
82 | losses.append(loss.item())
83 |
84 | pbar.set_postfix(train_loss = np.mean(losses[-100:]))
85 | pbar.update(x.shape[0])
86 |
87 | pbar.close()
88 |
89 | return losses
90 |
91 | def eval_step(data, model, criterion, device, mode='dev'):
92 |
93 | test_losses = []
94 | test_preds = []
95 | test_true = []
96 |
97 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set')
98 |
99 | model.eval()
100 |
101 | for x, y in data:
102 |
103 | x = x.to(device)
104 | y = y.to(device)
105 |
106 | with torch.no_grad():
107 |
108 | pred = model(x)
109 |
110 | loss = criterion(pred, y)
111 | test_losses.append(loss.item())
112 |
113 | test_preds.append(torch.argmax(pred, dim=1).cpu().numpy())
114 | test_true.append(y.cpu().numpy())
115 |
116 | pbar.update(x.shape[0])
117 | pbar.close()
118 |
119 | test_preds = np.concatenate(test_preds)
120 |
121 | if mode == 'dev':
122 | test_true = np.concatenate(test_true)
123 | mean_test_loss = np.mean(test_losses)
124 | accuracy = round(accuracy_score(test_true, test_preds) * 100, 2)
125 | return mean_test_loss, accuracy
126 |
127 | else:
128 | return test_preds
129 |
130 |
131 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3):
132 |
133 | losses = []
134 | best_accuracy = 0.
135 |
136 | patience = 0
137 | best_test_loss = 10.
138 |
139 | for epoch in range(n_epochs):
140 |
141 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch)
142 | mean_dev_loss, accuracy = eval_step(dev_data, model, criterion, device)
143 |
144 | if accuracy > best_accuracy:
145 | best_accuracy = accuracy
146 |
147 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}')
148 |
149 | if mean_dev_loss < best_test_loss:
150 | best_test_loss = mean_dev_loss
151 | elif patience == max_patience:
152 | print(f'Dev loss did not improve in {patience} epochs, early stopping')
153 | break
154 | else:
155 | patience += 1
156 | return best_accuracy
157 |
158 |
159 | @click.command()
160 | @click.option('--task-name',
161 | default='RuMedNLI',
162 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']),
163 | help='The name of the task to run.')
164 | @click.option('--device',
165 | default=-1,
166 | help='Gpu to train the model on.')
167 | @click.option('--seq-len',
168 | default=256,
169 | help='Max sequence length.')
170 | def main(task_name, device, seq_len):
171 | print(f'\n{task_name} task')
172 |
173 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent
174 | out_path = base_path / 'code' / 'bilstm' / 'out'
175 | data_path = base_path / 'data' / task_name
176 |
177 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True)
178 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True)
179 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True)
180 |
181 | index_id = 'pairID'
182 | if task_name == 'RuMedNLI':
183 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2}
184 | text1_id = 'ru_sentence1'
185 | text2_id = 'ru_sentence2'
186 | label_id = 'gold_label'
187 |
188 | elif task_name == 'RuMedDaNet':
189 | l2i = {'нет': 0, 'да': 1}
190 | text1_id = 'context'
191 | text2_id = 'question'
192 | label_id = 'answer'
193 | else:
194 | raise ValueError('unknown task')
195 |
196 | i2l = {i: label for label, i in l2i.items()}
197 |
198 | text_data_train = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
199 | zip(train_data[text1_id], train_data[text2_id])]
200 | text_data_dev = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
201 | zip(dev_data[text1_id], dev_data[text2_id])]
202 | text_data_test = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
203 | zip(test_data[text1_id], test_data[text2_id])]
204 |
205 | word2index = build_vocab(text_data_train, min_freq=0)
206 | print(f'Total: {len(word2index)} tokens')
207 |
208 | train_dataset = DataPreprocessor(text_data_train, train_data[label_id], word2index, l2i, \
209 | sequence_length=seq_len, preprocessing=False)
210 | dev_dataset = DataPreprocessor(text_data_dev, dev_data[label_id], word2index, l2i, \
211 | sequence_length=seq_len, preprocessing=False)
212 | test_dataset = DataPreprocessor(text_data_test, test_data[label_id], word2index, l2i, \
213 | sequence_length=seq_len, preprocessing=False)
214 |
215 | gen = torch.Generator()
216 | gen.manual_seed(SEED)
217 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
218 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
219 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
220 |
221 | if device == -1:
222 | device = torch.device('cpu')
223 | else:
224 | device = torch.device(device)
225 |
226 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index))
227 | criterion = nn.CrossEntropyLoss()
228 | optimizer = torch.optim.Adam(params=model.parameters())
229 |
230 | model = model.to(device)
231 | criterion = criterion.to(device)
232 |
233 | accuracy = train(train_dataset, dev_dataset, model, optimizer, criterion, device)
234 | print (f'\n{task_name} task score on dev set: {accuracy}')
235 |
236 | test_preds = eval_step(test_dataset, model, criterion, device, mode='test')
237 |
238 | recs = []
239 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_preds):
240 | recs.append({index_id: i, label_id: true, 'prediction': i2l[pred]})
241 |
242 | out_fname = out_path / f'{task_name}.jsonl'
243 | with open(out_fname, 'w') as fw:
244 | for rec in recs:
245 | json.dump(rec, fw, ensure_ascii=False)
246 | fw.write('\n')
247 |
248 |
249 | if __name__ == '__main__':
250 | main()
251 |
--------------------------------------------------------------------------------
/code/bilstm/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | out=$(pwd)'/out'
4 | mkdir -p $out
5 |
6 | python -u single_text_classifier.py --task-name 'RuMedTop3' --device 0
7 | python -u single_text_classifier.py --task-name 'RuMedSymptomRec' --device 0
8 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 0
9 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 0
10 | python -u token_classifier.py --task-name='RuMedNER' --device 0
11 |
--------------------------------------------------------------------------------
/code/bilstm/single_text_classifier.py:
--------------------------------------------------------------------------------
1 | from collections import defaultdict
2 | import json
3 | import pathlib
4 |
5 | import click
6 | import numpy as np
7 | import pandas as pd
8 | import torch
9 | from torch import nn
10 | from torch.utils.data import DataLoader
11 | from tqdm import tqdm
12 |
13 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor
14 |
15 | SEED = 101
16 | seed_everything(SEED)
17 | class Classifier(nn.Module):
18 |
19 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256):
20 |
21 | super().__init__()
22 |
23 | self.emb_dim = emb_dim
24 | self.hidden_dim = hidden_dim
25 |
26 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim)
27 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2,
28 | bidirectional=True)
29 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes)
30 |
31 | def forward(self, x):
32 | x = self.embedding_layer(x)
33 | _, (hidden, _) = self.lstm_layer(x)
34 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1)
35 | return self.linear_layer(hidden)
36 |
37 |
38 | def hit_at_n(y_true, y_pred, n=3):
39 | assert len(y_true) == len(y_pred)
40 | hit_count = 0
41 | for l, row in zip(y_true, y_pred):
42 | order = (np.argsort(row)[::-1])[:n]
43 | hit_count += int(l in order)
44 | return round(hit_count / float(len(y_true)) * 100, 2)
45 |
46 |
47 | def logits2codes(logits, i2l, n=3):
48 | codes = []
49 | for row in logits:
50 | order = np.argsort(row)[::-1]
51 | codes.append([i2l[i] for i in order[:n]])
52 | return codes
53 |
54 |
55 | def build_vocab(text_data, min_freq=1):
56 | word2freq = defaultdict(int)
57 | word2index = {'PAD': 0, 'UNK': 1}
58 |
59 | for text in text_data:
60 | for t in preprocess(text):
61 | word2freq[t] += 1
62 |
63 | for word, freq in word2freq.items():
64 | if freq > min_freq:
65 | word2index[word] = len(word2index)
66 | return word2index
67 |
68 |
69 | def train_step(data, model, optimizer, criterion, device, losses, epoch):
70 |
71 | model.train()
72 |
73 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}')
74 |
75 | for x, y in data:
76 |
77 | x = x.to(device)
78 | y = y.to(device)
79 |
80 | optimizer.zero_grad()
81 | pred = model(x)
82 |
83 | loss = criterion(pred, y)
84 |
85 | loss.backward()
86 | optimizer.step()
87 |
88 | losses.append(loss.item())
89 |
90 | pbar.set_postfix(train_loss = np.mean(losses[-100:]))
91 | pbar.update(x.shape[0])
92 |
93 | pbar.close()
94 |
95 | return losses
96 |
97 | def eval_step(data, model, criterion, device, mode='dev'):
98 |
99 | test_losses = []
100 | test_preds = []
101 | test_true = []
102 |
103 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set')
104 |
105 | model.eval()
106 |
107 | for x, y in data:
108 |
109 | x = x.to(device)
110 | y = y.to(device)
111 |
112 | with torch.no_grad():
113 |
114 | pred = model(x)
115 |
116 | loss = criterion(pred, y)
117 | test_losses.append(loss.item())
118 |
119 | test_preds.append(pred.cpu().numpy())
120 | test_true.append(y.cpu().numpy())
121 |
122 | pbar.update(x.shape[0])
123 | pbar.close()
124 |
125 | test_preds = np.concatenate(test_preds)
126 |
127 | if mode == 'dev':
128 | test_true = np.concatenate(test_true)
129 | mean_test_loss = np.mean(test_losses)
130 | accuracy = hit_at_n(test_true, test_preds, n=1)
131 | hit_3 = hit_at_n(test_true, test_preds, n=3)
132 | return mean_test_loss, accuracy, hit_3
133 |
134 | else:
135 | return test_preds
136 |
137 |
138 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3):
139 |
140 | losses = []
141 | best_metrics = [0.0, 0.0]
142 |
143 | patience = 0
144 | best_test_loss = 10.
145 |
146 | for epoch in range(n_epochs):
147 |
148 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch)
149 | mean_dev_loss, accuracy, hit_3 = eval_step(dev_data, model, criterion, device)
150 |
151 | if accuracy > best_metrics[0] and hit_3 > best_metrics[1]:
152 | best_metrics = [accuracy, hit_3]
153 |
154 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}, hit@3: {hit_3}')
155 |
156 | if mean_dev_loss < best_test_loss:
157 | best_test_loss = mean_dev_loss
158 | elif patience == max_patience:
159 | print(f'Dev loss did not improve in {patience} epochs, early stopping')
160 | break
161 | else:
162 | patience += 1
163 | return best_metrics
164 |
165 |
166 | @click.command()
167 | @click.option('--task-name',
168 | default='RuMedTop3',
169 | type=click.Choice(['RuMedTop3', 'RuMedSymptomRec']),
170 | help='The name of the task to run.')
171 | @click.option('--device',
172 | default=-1,
173 | help='Gpu to train the model on.')
174 | def main(task_name, device):
175 | print(f'\n{task_name} task')
176 |
177 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent
178 | out_path = base_path / 'code' / 'bilstm' / 'out'
179 | data_path = base_path / 'data' / task_name
180 |
181 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True)
182 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True)
183 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True)
184 |
185 | text_id = 'symptoms'
186 | label_id = 'code'
187 | index_id = 'idx'
188 |
189 | i2l = dict(enumerate(sorted(train_data[label_id].unique())))
190 | l2i = {label: i for i, label in i2l.items()}
191 |
192 | word2index = build_vocab(train_data[text_id], min_freq=0)
193 | print(f'Total: {len(word2index)} tokens')
194 |
195 | train_dataset = DataPreprocessor(train_data[text_id], train_data[label_id], word2index, l2i)
196 | dev_dataset = DataPreprocessor(dev_data[text_id], dev_data[label_id], word2index, l2i)
197 | test_dataset = DataPreprocessor(test_data[text_id], test_data[label_id], word2index, l2i)
198 |
199 | gen = torch.Generator()
200 | gen.manual_seed(SEED)
201 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
202 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
203 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
204 |
205 | if device == -1:
206 | device = torch.device('cpu')
207 | else:
208 | device = torch.device(device)
209 |
210 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index))
211 | criterion = nn.CrossEntropyLoss()
212 | optimizer = torch.optim.Adam(params=model.parameters())
213 |
214 | model = model.to(device)
215 | criterion = criterion.to(device)
216 |
217 | accuracy, hit_3 = train(train_dataset, dev_dataset, model, optimizer, criterion, device)
218 | print (f'\n{task_name} task scores on dev set: {accuracy} / {hit_3}')
219 |
220 | test_logits = eval_step(test_dataset, model, criterion, device, mode='test')
221 | test_codes = logits2codes(test_logits, i2l)
222 |
223 | recs = []
224 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_codes):
225 | recs.append({index_id: i, label_id: true, 'prediction': pred})
226 |
227 | out_fname = out_path / f'{task_name}.jsonl'
228 | with open(out_fname, 'w') as fw:
229 | for rec in recs:
230 | json.dump(rec, fw, ensure_ascii=False)
231 | fw.write('\n')
232 |
233 |
234 | if __name__ == '__main__':
235 | main()
236 |
--------------------------------------------------------------------------------
/code/bilstm/token_classifier.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import torch
3 | from torch import nn
4 | from torch.optim import AdamW
5 | from torchtext import data
6 | from torchtext.data import Field, BucketIterator
7 |
8 | import os
9 | import click
10 | import json
11 | import random
12 | import numpy as np
13 | import pandas as pd
14 |
15 | from seqeval.metrics import f1_score, accuracy_score
16 |
17 | def seed_everything(seed):
18 | os.environ['PYTHONHASHSEED'] = str(seed)
19 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
20 | np.random.seed(seed)
21 | random.seed(seed)
22 | torch.manual_seed(seed)
23 | torch.cuda.manual_seed_all(seed)
24 | torch.cuda.manual_seed(seed)
25 | torch.backends.cudnn.deterministic = True
26 | torch.backends.cudnn.benchmark = False
27 |
28 | SEED = 101
29 | seed_everything(SEED)
30 |
31 | class SequenceTaggingDataset(data.Dataset):
32 | @staticmethod
33 | def sort_key(example):
34 | for attr in dir(example):
35 | if not callable(getattr(example, attr)) and not attr.startswith('__'):
36 | return len(getattr(example, attr))
37 | return 0
38 |
39 | def __init__(self, list_of_lists, fields, **kwargs):
40 | examples = []
41 | columns = []
42 | for tup in list_of_lists:
43 | columns = list(tup)
44 | examples.append(data.Example.fromlist(columns, fields))
45 |
46 | super(SequenceTaggingDataset, self).__init__(examples, fields, **kwargs)
47 |
48 | class Corpus(object):
49 | def __init__(self, input_folder, min_word_freq, batch_size):
50 | # list all the fields
51 | self.word_field = Field(lower=True)
52 | self.tag_field = Field(unk_token=None)
53 |
54 | parts = ['train', 'dev']
55 | p2data = {}
56 | for p in parts:
57 | fname = os.path.join(input_folder, '{}_v1.jsonl'.format(p))
58 | paired_lists = []
59 | with open(fname) as f:
60 | for line in f:
61 | data = json.loads(line)
62 | paired_lists.append( (data['tokens'], data['ner_tags']) )
63 | p2data[p] = paired_lists
64 |
65 | field_values = (('word', self.word_field), ('tag', self.tag_field))
66 |
67 | self.train_dataset = SequenceTaggingDataset( p2data['train'], fields=field_values )
68 | self.dev_dataset = SequenceTaggingDataset( p2data['dev'], fields=field_values )
69 |
70 | # convert fields to vocabulary list
71 | self.word_field.build_vocab(self.train_dataset.word, min_freq=min_word_freq)
72 | self.tag_field.build_vocab(self.train_dataset.tag)
73 | # create iterator for batch input
74 | self.train_iter, self.dev_iter = BucketIterator.splits(
75 | datasets=(self.train_dataset, self.dev_dataset),
76 | batch_size=batch_size
77 | )
78 | # prepare padding index to be ignored during model training/evaluation
79 | self.word_pad_idx = self.word_field.vocab.stoi[self.word_field.pad_token]
80 | self.tag_pad_idx = self.tag_field.vocab.stoi[self.tag_field.pad_token]
81 |
82 | class BiLSTM(nn.Module):
83 | def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, lstm_layers,
84 | emb_dropout, lstm_dropout, fc_dropout, word_pad_idx):
85 | super().__init__()
86 | self.embedding_dim = embedding_dim
87 | # LAYER 1: Embedding
88 | self.embedding = nn.Embedding(
89 | num_embeddings=input_dim,
90 | embedding_dim=embedding_dim,
91 | padding_idx=word_pad_idx
92 | )
93 | self.emb_dropout = nn.Dropout(emb_dropout)
94 | # LAYER 2: BiLSTM
95 | self.lstm = nn.LSTM(
96 | input_size=embedding_dim,
97 | hidden_size=hidden_dim,
98 | num_layers=lstm_layers,
99 | bidirectional=True,
100 | dropout=lstm_dropout if lstm_layers > 1 else 0
101 | )
102 | # LAYER 3: Fully-connected
103 | self.fc_dropout = nn.Dropout(fc_dropout)
104 | self.fc = nn.Linear(hidden_dim * 2, output_dim) # times 2 for bidirectional
105 |
106 | def forward(self, sentence):
107 | # sentence = [sentence length, batch size]
108 | # embedding_out = [sentence length, batch size, embedding dim]
109 | embedding_out = self.emb_dropout(self.embedding(sentence))
110 | # lstm_out = [sentence length, batch size, hidden dim * 2]
111 | lstm_out, _ = self.lstm(embedding_out)
112 | # ner_out = [sentence length, batch size, output dim]
113 | ner_out = self.fc(self.fc_dropout(lstm_out))
114 | return ner_out
115 |
116 | def init_weights(self):
117 | # to initialize all parameters from normal distribution
118 | # helps with converging during training
119 | for name, param in self.named_parameters():
120 | nn.init.normal_(param.data, mean=0, std=0.1)
121 |
122 | def init_embeddings(self, word_pad_idx):
123 | # initialize embedding for padding as zero
124 | self.embedding.weight.data[word_pad_idx] = torch.zeros(self.embedding_dim)
125 |
126 | def count_parameters(self):
127 | return sum(p.numel() for p in self.parameters() if p.requires_grad)
128 |
129 | class NER(object):
130 | def __init__(self, model, data, optimizer_cls, loss_fn_cls, device=torch.device('cpu')):
131 | self.device = device
132 | self.model = model
133 | self.data = data
134 | self.optimizer = optimizer_cls(model.parameters(), lr=0.0015, weight_decay=0.01)
135 | self.loss_fn = loss_fn_cls(ignore_index=self.data.tag_pad_idx)
136 | self.loss_fn = self.loss_fn.to(self.device)
137 |
138 | def accuracy(self, preds, y):
139 | max_preds = preds.argmax(dim=1, keepdim=True) # get the index of the max probability
140 | non_pad_elements = (y != self.data.tag_pad_idx).nonzero() # prepare masking for paddings
141 | correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
142 | denom = torch.cuda.FloatTensor([y[non_pad_elements].shape[0]])
143 | return correct.sum() / denom
144 |
145 | def epoch(self):
146 | epoch_loss = 0
147 | epoch_acc = 0
148 | self.model.train()
149 | for batch in self.data.train_iter:
150 | # text = [sent len, batch size]
151 | text = batch.word.to(self.device)
152 | # tags = [sent len, batch size]
153 | true_tags = batch.tag.to(self.device)
154 | self.optimizer.zero_grad()
155 | pred_tags = self.model(text)
156 | # to calculate the loss and accuracy, we flatten both prediction and true tags
157 | # flatten pred_tags to [sent len, batch size, output dim]
158 | pred_tags = pred_tags.view(-1, pred_tags.shape[-1])
159 | # flatten true_tags to [sent len * batch size]
160 | true_tags = true_tags.view(-1)
161 | batch_loss = self.loss_fn(pred_tags, true_tags)
162 | batch_acc = self.accuracy(pred_tags, true_tags)
163 | batch_loss.backward()
164 | self.optimizer.step()
165 | epoch_loss += batch_loss.item()
166 | epoch_acc += batch_acc.item()
167 | return epoch_loss / len(self.data.train_iter), epoch_acc / len(self.data.train_iter)
168 |
169 | def evaluate(self, iterator):
170 | epoch_loss = 0
171 | epoch_acc = 0
172 | self.model.eval()
173 | cum = 0
174 | whole_gt_seq, whole_pred_seq = [], []
175 | with torch.no_grad():
176 | # similar to epoch() but model is in evaluation mode and no backprop
177 | for batch in iterator:
178 | text = batch.word.to(self.device)
179 | true_tags = batch.tag.to(self.device)
180 | pred_tags = self.model(text)
181 |
182 | #[sentence length, batch size, output dim]
183 | for i, (row, tag_row) in enumerate(zip(text.T, true_tags.T)):
184 | mask = row!=1
185 | gt_seq = [self.data.tag_field.vocab.itos[j.item()] for j in tag_row[mask]]
186 | pred_idx = pred_tags[:,i,:].argmax(-1)[mask]
187 | pred_seq = [self.data.tag_field.vocab.itos[j.item()] for j in pred_idx]
188 | whole_gt_seq.append(gt_seq)
189 | whole_pred_seq.append(pred_seq)
190 | pred_tags = pred_tags.view(-1, pred_tags.shape[-1])
191 |
192 | true_tags = true_tags.view(-1)
193 | batch_loss = self.loss_fn(pred_tags, true_tags)
194 | batch_acc = self.accuracy(pred_tags, true_tags)
195 | epoch_loss += batch_loss.item()
196 | epoch_acc += batch_acc.item()
197 | acc = accuracy_score(whole_gt_seq, whole_pred_seq)
198 | f1 = f1_score(whole_gt_seq, whole_pred_seq)
199 | return epoch_loss / len(iterator), acc, f1, whole_gt_seq, whole_pred_seq
200 |
201 | def train(self, n_epochs):
202 | for epoch in range(n_epochs):
203 | train_loss, train_acc = self.epoch()
204 | dev_loss, dev_acc, dev_f1, _, _ = self.evaluate(self.data.dev_iter)
205 | print (f'Epoch {epoch:02d}\t| Dev Loss: {dev_loss:.3f} | Dev Acc: {dev_acc * 100:.2f}% | Dev F1: {dev_f1 * 100:.2f}%')
206 |
207 | def infer(self, tokens):
208 | tokens = [t.lower() for t in tokens]
209 | self.model.eval()
210 | # transform to indices based on corpus vocab
211 | numericalized_tokens = [self.data.word_field.vocab.stoi[t] for t in tokens]
212 | # begin prediction
213 | token_tensor = torch.LongTensor(numericalized_tokens)
214 | token_tensor = token_tensor.unsqueeze(-1)
215 | predictions = self.model(token_tensor.to(self.device))
216 | # convert results to tags
217 | top_predictions = predictions.argmax(-1)
218 | predicted_tags = [self.data.tag_field.vocab.itos[t.item()] for t in top_predictions]
219 | return predicted_tags
220 |
221 | @click.command()
222 | @click.option('--task-name',
223 | default='RuMedNER',
224 | type=click.Choice(['RuMedNER']),
225 | help='The name of the task to run.')
226 | @click.option('--device',
227 | default=-1,
228 | help='Gpu to train the model on.')
229 | def main(task_name, device):
230 | os.environ['CUDA_VISIBLE_DEVICES'] = str(device)
231 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
232 |
233 | print(f'\n{task_name} task')
234 |
235 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) )
236 | out_dir = os.path.join(base_path, 'out')
237 |
238 | base_path = os.path.abspath( os.path.join(base_path, '../..') )
239 |
240 | data_path = os.path.join(base_path, 'data', task_name)
241 |
242 | corpus = Corpus(
243 | input_folder=data_path,
244 | min_word_freq=1,
245 | batch_size=32
246 | )
247 | print (f'Train set: {len(corpus.train_dataset)} sentences')
248 | print (f'Dev set: {len(corpus.dev_dataset)} sentences')
249 |
250 | bilstm = BiLSTM(
251 | input_dim=len(corpus.word_field.vocab),
252 | embedding_dim=300,
253 | hidden_dim=256,
254 | output_dim=len(corpus.tag_field.vocab),
255 | lstm_layers=2,
256 | emb_dropout=0.5,
257 | lstm_dropout=0.1,
258 | fc_dropout=0.25,
259 | word_pad_idx=corpus.word_pad_idx
260 | )
261 |
262 | bilstm.init_weights()
263 | bilstm.init_embeddings(word_pad_idx=corpus.word_pad_idx)
264 | print (f'The model has {bilstm.count_parameters():,} trainable parameters.')
265 | print (bilstm)
266 |
267 | ner = NER(
268 | model=bilstm.to(device),
269 | data=corpus,
270 | optimizer_cls=AdamW,
271 | loss_fn_cls=nn.CrossEntropyLoss,
272 | device=device
273 | )
274 |
275 | ner.train(20)
276 |
277 | test_data = pd.read_json(os.path.join(data_path, 'test_v1.jsonl'), lines=True)
278 |
279 | out_fname = os.path.join(out_dir, task_name+'.jsonl')
280 | with open(out_fname, 'w') as fw:
281 | for i, true, tokens in zip(test_data.idx, test_data.ner_tags, test_data.tokens):
282 | prediction = ner.infer(tokens)
283 | rec = {'idx': i, 'ner_tags': true, 'prediction': prediction}
284 | json.dump(rec, fw, ensure_ascii=False)
285 | fw.write('\n')
286 |
287 | if __name__ == '__main__':
288 | main()
289 |
--------------------------------------------------------------------------------
/code/bilstm/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | from string import punctuation
3 | import random
4 |
5 | from nltk.tokenize import ToktokTokenizer
6 | import numpy as np
7 | import pandas as pd
8 | import torch
9 | from torch.utils.data import Dataset
10 |
11 | from typing import List, Dict, Union, Tuple, Set, Any
12 |
13 | TOKENIZER = ToktokTokenizer()
14 |
15 |
16 | def seed_everything(seed):
17 | os.environ['PYTHONHASHSEED'] = str(seed)
18 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
19 | np.random.seed(seed)
20 | random.seed(seed)
21 | torch.manual_seed(seed)
22 | torch.cuda.manual_seed_all(seed)
23 | torch.cuda.manual_seed(seed)
24 | torch.backends.cudnn.deterministic = True
25 | torch.backends.cudnn.benchmark = False
26 |
27 |
28 | def seed_worker(worker_id):
29 | worker_seed = torch.initial_seed() % 2**32
30 | np.random.seed(worker_seed)
31 | random.seed(worker_seed)
32 |
33 |
34 | def preprocess(text, tokenizer=TOKENIZER):
35 | res = []
36 | tokens = tokenizer.tokenize(text.lower())
37 | for t in tokens:
38 | if t not in punctuation:
39 | res.append(t.strip(punctuation))
40 | return res
41 |
42 |
43 | class DataPreprocessor(Dataset):
44 |
45 | def __init__(self, x_data, y_data, word2index, label2index,
46 | sequence_length=128, pad_token='PAD', unk_token='UNK', preprocessing=True):
47 |
48 | super().__init__()
49 |
50 | self.x_data = []
51 | self.y_data = y_data.map(label2index)
52 |
53 | self.word2index = word2index
54 | self.sequence_length = sequence_length
55 |
56 | self.pad_token = pad_token
57 | self.unk_token = unk_token
58 | self.pad_index = self.word2index[self.pad_token]
59 |
60 | self.preprocessing = preprocessing
61 |
62 | self.load(x_data)
63 |
64 | def load(self, data):
65 |
66 | for text in data:
67 | if self.preprocessing:
68 | words = preprocess(text)
69 | else:
70 | words = text
71 | indexed_words = self.indexing(words)
72 | self.x_data.append(indexed_words)
73 |
74 | def indexing(self, tokenized_text):
75 | unk_index = self.word2index[self.unk_token]
76 | return [self.word2index.get(token, unk_index) for token in tokenized_text]
77 |
78 | def padding(self, sequence):
79 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0))
80 | return sequence[:self.sequence_length]
81 |
82 | def __len__(self):
83 | return len(self.x_data)
84 |
85 | def __getitem__(self, idx):
86 | x = self.x_data[idx]
87 | x = self.padding(x)
88 | x = torch.Tensor(x).long()
89 |
90 | y = self.y_data[idx]
91 |
92 | return x, y
93 |
94 |
95 | def preprocess_for_tokens(
96 | tokens: List[str]
97 | ) -> List[str]:
98 |
99 | return tokens
100 |
101 | class DataPreprocessorNer(Dataset):
102 |
103 | def __init__(
104 | self,
105 | x_data: pd.Series,
106 | y_data: pd.Series,
107 | word2index: Dict[str, int],
108 | label2index: Dict[str, int],
109 | sequence_length: int = 128,
110 | pad_token: str = 'PAD',
111 | unk_token: str = 'UNK'
112 | ) -> None:
113 |
114 | super().__init__()
115 |
116 | self.word2index = word2index
117 | self.label2index = label2index
118 |
119 | self.sequence_length = sequence_length
120 | self.pad_token = pad_token
121 | self.unk_token = unk_token
122 | self.pad_index = self.word2index[self.pad_token]
123 | self.unk_index = self.word2index[self.unk_token]
124 |
125 | self.x_data = self.load(x_data, self.word2index)
126 | self.y_data = self.load(y_data, self.label2index)
127 |
128 |
129 | def load(
130 | self,
131 | data: pd.Series,
132 | mapping: Dict[str, int]
133 | ) -> List[List[int]]:
134 |
135 | indexed_data = []
136 | for case in data:
137 | processed_case = preprocess_for_tokens(case)
138 | indexed_case = self.indexing(processed_case, mapping)
139 | indexed_data.append(indexed_case)
140 |
141 | return indexed_data
142 |
143 |
144 | def indexing(
145 | self,
146 | tokenized_case: List[str],
147 | mapping: Dict[str, int]
148 | ) -> List[int]:
149 |
150 | return [mapping.get(token, self.unk_index) for token in tokenized_case]
151 |
152 |
153 | def padding(
154 | self,
155 | sequence: List[int]
156 | ) -> List[int]:
157 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0))
158 | return sequence[:self.sequence_length]
159 |
160 |
161 | def __len__(self):
162 | return len(self.x_data)
163 |
164 |
165 | def __getitem__(
166 | self,
167 | idx: int
168 | ) -> Tuple[torch.tensor, torch.tensor]:
169 |
170 | x = self.x_data[idx]
171 | y = self.y_data[idx]
172 |
173 | assert len(x) > 0
174 |
175 | x = self.padding(x)
176 | y = self.padding(y)
177 |
178 | x = torch.tensor(x, dtype=torch.int64)
179 | y = torch.tensor(y, dtype=torch.int64)
180 |
181 | return x, y
--------------------------------------------------------------------------------
/code/eval.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import os
3 | import json
4 | import argparse
5 | import numpy as np
6 | from sklearn.metrics import accuracy_score
7 | from seqeval.metrics import f1_score
8 | from seqeval.metrics import accuracy_score as seq_accuracy_score
9 |
10 | def hit_at_3(y_true, y_pred):
11 | assert len(y_true) == len(y_pred)
12 | hit_count = 0
13 | for l, row in zip(y_true, y_pred):
14 | hit_count += l in row
15 | return hit_count/float(len(y_true))
16 |
17 | if __name__ == '__main__':
18 | parser = argparse.ArgumentParser()
19 | parser.add_argument('--out_dir',
20 | default='out/',
21 | type=str,
22 | help='The output directory with task results.')
23 | args = parser.parse_args()
24 |
25 | out_dir = args.out_dir
26 | if not os.path.exists(out_dir):
27 | raise ValueError('{} directory does not exist'.format(out_dir))
28 |
29 | files = set( os.listdir(out_dir) )
30 |
31 | metrics = {}
32 | label_id = 'code'
33 | for task in ['RuMedTop3', 'RuMedSymptomRec']:
34 | fname = '{}.jsonl'.format(task)
35 | if fname in files:
36 | fname = os.path.join(out_dir, fname)
37 | with open(fname) as f:
38 | result = [json.loads(line) for line in list(f)]
39 | gt = [d[label_id] for d in result]
40 | top1 = [d['prediction'][0] for d in result]
41 | top3 = [set(d['prediction']) for d in result]
42 | acc = accuracy_score(gt, top1)*100
43 | hit = hit_at_3(gt, top3)*100
44 | metrics[(task, 'acc')] = acc
45 | metrics[(task, 'hit3')] = hit
46 | else:
47 | print ('skip task {}'.format(task))
48 |
49 | for task, label_id in [('RuMedDaNet', 'answer'), ('RuMedNLI', 'gold_label')]:
50 | fname = '{}.jsonl'.format(task)
51 | if fname in files:
52 | fname = os.path.join(out_dir, fname)
53 | with open(fname) as f:
54 | result = [json.loads(line) for line in list(f)]
55 | gt = [d[label_id] for d in result]
56 | prediction = [d['prediction'] for d in result]
57 | acc = accuracy_score(gt, prediction)*100
58 | metrics[(task, 'acc')] = acc
59 | else:
60 | print ('skip task {}'.format(task))
61 |
62 | task = 'RuMedNER'
63 | fname = '{}.jsonl'.format(task)
64 | if fname in files:
65 | fname = os.path.join(out_dir, fname)
66 | with open(fname) as f:
67 | result = [json.loads(line) for line in list(f)]
68 | gt = [d['ner_tags'] for d in result]
69 | prediction = [d['prediction'] for d in result]
70 | for seq0, seq1 in zip(gt, prediction):
71 | assert len(seq0)==len(seq1)
72 | metrics[(task, 'acc')] = seq_accuracy_score(gt, prediction)*100
73 | metrics[(task, 'f1')] = f1_score(gt, prediction)*100
74 | else:
75 | print ('skip task {}'.format(task))
76 |
77 | top3_acc, top3_hit = metrics.get( ('RuMedTop3', 'acc'), 0 ), metrics.get( ('RuMedTop3', 'hit3'), 0 )
78 | rec_acc, rec_hit = metrics.get( ('RuMedSymptomRec', 'acc'), 0 ), metrics.get( ('RuMedSymptomRec', 'hit3'), 0 )
79 | danet_acc, nli_acc = metrics.get( ('RuMedDaNet', 'acc'), 0 ), metrics.get( ('RuMedNLI', 'acc'), 0 )
80 | ner_acc, ner_f1 = metrics.get( ('RuMedNER', 'acc'), 0 ), metrics.get( ('RuMedNER', 'f1'), 0 )
81 |
82 | overall = np.mean([
83 | (top3_acc+top3_hit)/2,
84 | (rec_acc+rec_hit)/2,
85 | danet_acc,
86 | nli_acc,
87 | (ner_acc+ner_f1)/2,
88 | ])
89 |
90 | result_line = '| {}\t| {:.2f} / {:.2f}\t| {:.2f} / {:.2f}\t| {:.2f}\t| {:.2f}\t| {:.2f} / {:.2f}\t| {:.2f}\t|'.format(
91 | out_dir,
92 | top3_acc, top3_hit,
93 | rec_acc, rec_hit,
94 | danet_acc,
95 | nli_acc,
96 | ner_acc, ner_f1,
97 | overall
98 | )
99 | print ('| Model\t\t| RuMedTop3\t| RuMedSymptomRec\t| RuMedDaNet\t| RuMedNLI\t| RuMedNER\t| Overall\t|')
100 | print (result_line)
101 |
--------------------------------------------------------------------------------
/code/linear_models/README.md:
--------------------------------------------------------------------------------
1 | This directory contains feature-based models (logistic regression model with tf-idf vectorizer and CRF).
2 |
3 | ### How to run
4 |
5 | `./run.sh` or you can run the model for different tasks separately, e.g.
6 |
7 | ```bash
8 | python single_text_classifier.py --task-name='RuMedSymptomRec'
9 | ```
10 |
11 | The models produce results in `.jsonl` format to output directory `out`.
12 |
--------------------------------------------------------------------------------
/code/linear_models/double_text_classifier.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import pandas as pd
6 | from sklearn.feature_extraction.text import TfidfVectorizer
7 | from sklearn.linear_model import LogisticRegression
8 | from sklearn.metrics import accuracy_score
9 |
10 |
11 | def preprocess_sentences(column1, column2):
12 | return [sent1 + ' ' + sent2 for sent1, sent2 in zip(column1, column2)]
13 |
14 |
15 | def encode_text(tfidf, text_data, labels, l2i, mode='train'):
16 | if mode == 'train':
17 | X = tfidf.fit_transform(text_data)
18 | else:
19 | X = tfidf.transform(text_data)
20 | y = labels.map(l2i)
21 | return X, y
22 |
23 |
24 | @click.command()
25 | @click.option('--task-name',
26 | default='RuMedNLI',
27 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']),
28 | help='The name of the task to run.')
29 | def main(task_name):
30 | print(f'\n{task_name} task')
31 |
32 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent
33 | out_path = base_path / 'code' / 'linear_models' / 'out'
34 | data_path = base_path / 'data' / task_name
35 |
36 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True)
37 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True)
38 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True)
39 |
40 | index_id = 'pairID'
41 | if task_name == 'RuMedNLI':
42 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2}
43 | text1_id = 'ru_sentence1'
44 | text2_id = 'ru_sentence2'
45 | label_id = 'gold_label'
46 | elif task_name == 'RuMedDaNet':
47 | l2i = {'нет': 0, 'да': 1}
48 | text1_id = 'context'
49 | text2_id = 'question'
50 | label_id = 'answer'
51 | else:
52 | raise ValueError('unknown task')
53 |
54 | i2l = {i: label for label, i in l2i.items()}
55 |
56 | text_data_train = preprocess_sentences(train_data[text1_id], train_data[text2_id])
57 | text_data_dev = preprocess_sentences(dev_data[text1_id], dev_data[text2_id])
58 | text_data_test = preprocess_sentences(test_data[text1_id], test_data[text2_id])
59 |
60 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8))
61 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, verbose=1)
62 |
63 | X, y = encode_text(tfidf, text_data_train, train_data[label_id], l2i)
64 |
65 | clf.fit(X, y)
66 |
67 | X_val, y_val = encode_text(tfidf, text_data_dev, dev_data[label_id], l2i, mode='val')
68 | y_val_pred = clf.predict(X_val)
69 | accuracy = round(accuracy_score(y_val, y_val_pred) * 100, 2)
70 | print (f'\n{task_name} task score on dev set: {accuracy}')
71 |
72 | X_test, _ = encode_text(tfidf, text_data_test, test_data[label_id], l2i, mode='test')
73 | y_test_pred = clf.predict(X_test)
74 |
75 | recs = []
76 | for i, true, pred in zip(test_data[index_id], test_data[label_id], y_test_pred):
77 | recs.append({index_id: i, label_id: true, 'prediction': i2l[pred]})
78 |
79 | out_fname = out_path / f'{task_name}.jsonl'
80 | with open(out_fname, 'w') as fw:
81 | for rec in recs:
82 | json.dump(rec, fw, ensure_ascii=False)
83 | fw.write('\n')
84 |
85 |
86 | if __name__ == '__main__':
87 | main()
88 |
--------------------------------------------------------------------------------
/code/linear_models/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | out=$(pwd)'/out'
4 | mkdir -p $out
5 |
6 | python -u single_text_classifier.py --task-name 'RuMedTop3'
7 | python -u single_text_classifier.py --task-name 'RuMedSymptomRec'
8 | python -u double_text_classifier.py --task-name 'RuMedNLI'
9 | python -u double_text_classifier.py --task-name 'RuMedDaNet'
10 | python -u token_classifier.py --task-name 'RuMedNER'
11 |
--------------------------------------------------------------------------------
/code/linear_models/single_text_classifier.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.feature_extraction.text import TfidfVectorizer
8 | from sklearn.linear_model import LogisticRegression
9 |
10 |
11 | def hit_at_n(y_true, y_pred, n=3):
12 | assert len(y_true) == len(y_pred)
13 | hit_count = 0
14 | for l, row in zip(y_true, y_pred):
15 | order = (np.argsort(row)[::-1])[:n]
16 | hit_count += int(l in order)
17 | return round(hit_count / float(len(y_true)) * 100, 2)
18 |
19 |
20 | def encode_text(tfidf, text_data, labels, l2i, mode='train'):
21 | if mode == 'train':
22 | X = tfidf.fit_transform(text_data)
23 | else:
24 | X = tfidf.transform(text_data)
25 | y = labels.map(l2i)
26 | return X, y
27 |
28 |
29 | def logits2codes(logits, i2l, n=3):
30 | codes = []
31 | for row in logits:
32 | order = np.argsort(row)[::-1]
33 | codes.append([i2l[i] for i in order[:n]])
34 | return codes
35 |
36 |
37 | @click.command()
38 | @click.option('--task-name',
39 | default='RuMedTop3',
40 | type=click.Choice(['RuMedTop3', 'RuMedSymptomRec']),
41 | help='The name of the task to run.')
42 | def main(task_name):
43 | print(f'\n{task_name} task')
44 |
45 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent
46 | out_path = base_path / 'code' / 'linear_models' / 'out'
47 | data_path = base_path / 'data' / task_name
48 |
49 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True)
50 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True)
51 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True)
52 |
53 | text_id = 'symptoms'
54 | label_id = 'code'
55 | index_id = 'idx'
56 |
57 | i2l = dict(enumerate(sorted(train_data[label_id].unique())))
58 | l2i = {label: i for i, label in i2l.items()}
59 |
60 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8))
61 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, verbose=1)
62 |
63 | X, y = encode_text(tfidf, train_data[text_id], train_data[label_id], l2i)
64 |
65 | clf.fit(X, y)
66 |
67 | X_val, y_val = encode_text(tfidf, dev_data[text_id], dev_data[label_id], l2i, mode='val')
68 | y_val_pred = clf.predict_proba(X_val)
69 |
70 | accuracy = hit_at_n(y_val, y_val_pred, n=1)
71 | hit_3 = hit_at_n(y_val, y_val_pred, n=3)
72 | print (f'\n{task_name} task scores on dev set: {accuracy} / {hit_3}')
73 |
74 | X_test, _ = encode_text(tfidf, test_data[text_id], test_data[label_id], l2i, mode='test')
75 | y_test_pred = clf.predict_proba(X_test)
76 |
77 | test_codes = logits2codes(y_test_pred, i2l)
78 |
79 | recs = []
80 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_codes):
81 | recs.append({index_id: i, label_id: true, 'prediction': pred})
82 |
83 | out_fname = out_path / f'{task_name}.jsonl'
84 | with open(out_fname, 'w') as fw:
85 | for rec in recs:
86 | json.dump(rec, fw, ensure_ascii=False)
87 | fw.write('\n')
88 |
89 |
90 | if __name__ == '__main__':
91 | main()
92 |
--------------------------------------------------------------------------------
/code/linear_models/token_classifier.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import gc
3 | import os
4 | import json
5 | import numpy as np
6 | import random
7 | import click
8 | from seqeval.metrics import accuracy_score, f1_score
9 | import sklearn_crfsuite
10 |
11 | SEED = 128
12 | random.seed(SEED)
13 | np.random.seed(SEED)
14 |
15 | def load_sents(fname):
16 | sents = []
17 | with open(fname) as f:
18 | for line in f:
19 | data = json.loads(line)
20 | idx = data['idx']
21 | codes = data['ner_tags']
22 | tokens = data['tokens']
23 | sample = []
24 | for token, code in zip(tokens,codes):
25 | sample.append( (token, code) )
26 | sents.append( (idx, sample) )
27 | return sents
28 |
29 | def word2features(sent, i):
30 | word = sent[i][0]
31 |
32 | features = {
33 | 'bias': 1.0,
34 | 'word.lower()': word.lower(),
35 | 'word[-3:]': word[-3:],
36 | 'word[-2:]': word[-2:],
37 | 'word.isupper()': word.isupper(),
38 | 'word.istitle()': word.istitle(),
39 | 'word.isdigit()': word.isdigit(),
40 | }
41 | if i > 0:
42 | word1 = sent[i-1][0]
43 | features.update({
44 | '-1:word.lower()': word1.lower(),
45 | '-1:word.istitle()': word1.istitle(),
46 | '-1:word.isupper()': word1.isupper(),
47 | })
48 | else:
49 | features['BOS'] = True
50 |
51 | if i < len(sent)-1:
52 | word1 = sent[i+1][0]
53 | features.update({
54 | '+1:word.lower()': word1.lower(),
55 | '+1:word.istitle()': word1.istitle(),
56 | '+1:word.isupper()': word1.isupper(),
57 | })
58 | else:
59 | features['EOS'] = True
60 |
61 | return features
62 |
63 |
64 | def sent2features(sent):
65 | return [word2features(sent, i) for i in range(len(sent))]
66 |
67 | def sent2labels(sent):
68 | return [label for token, label in sent]
69 |
70 | def sent2tokens(sent):
71 | return [token for token, label in sent]
72 |
73 | @click.command()
74 | @click.option('--task-name',
75 | default='RuMedNER',
76 | type=click.Choice(['RuMedNER']),
77 | help='The name of the task to run.'
78 | )
79 |
80 | def main(task_name):
81 | print(f'\n{task_name} task')
82 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) )
83 | out_dir = os.path.join(base_path, 'out')
84 |
85 | base_path = os.path.abspath( os.path.join(base_path, '../..') )
86 |
87 | parts = ['train', 'dev', 'test']
88 | data_path = os.path.join(base_path, 'data', task_name)
89 |
90 | text1_id, label_id, index_id = 'tokens', 'ner_tags', 'idx'
91 | part2data = {}
92 | for p in parts:
93 | fname = os.path.join( data_path, '{}_v1.jsonl'.format(p) )
94 | sents = load_sents(fname)
95 | part2data[p] = sents
96 |
97 | part2feat = {}
98 | for p in parts:
99 | p_X = [sent2features(s) for idx, s in part2data[p]]
100 | p_y = [sent2labels(s) for idx, s in part2data[p]]
101 | p_ids = [idx for idx, _ in part2data[p]]
102 | part2feat[p] = (p_X, p_y, p_ids)
103 |
104 | crf = sklearn_crfsuite.CRF(
105 | algorithm='lbfgs',
106 | c1=0.1,
107 | c2=0.01,
108 | max_iterations=200,
109 | all_possible_transitions=True,
110 | verbose=True
111 | )
112 | X_train, y_train = part2feat['train'][0], part2feat['train'][1]
113 | crf = crf.fit(X_train, y_train)
114 |
115 | X_dev = part2feat['dev'][0]
116 | y_pred_dev = crf.predict(X_dev)
117 |
118 | y_dev = part2feat['dev'][1]
119 | dev_acc, dev_f1 = accuracy_score(y_dev, y_pred_dev)*100, f1_score(y_dev, y_pred_dev)*100
120 |
121 | print ('\n{} task scores on dev set: {:.2f}/{:.2f}'.format(task_name, dev_acc, dev_f1))
122 |
123 | X_test = part2feat['test'][0]
124 | y_pred_test = crf.predict(X_test)
125 | out_fname = os.path.join(out_dir, task_name+'.jsonl')
126 | with open(out_fname, 'w') as fw:
127 | for idx, labels, prediction in zip(part2feat['test'][-1], part2feat['test'][1], y_pred_test):
128 | data = {index_id:idx, label_id:labels, 'prediction':prediction}
129 | json.dump(data, fw, ensure_ascii=False)
130 | fw.write('\n')
131 |
132 | if __name__ == '__main__':
133 | main()
134 |
--------------------------------------------------------------------------------
/code/requirements.txt:
--------------------------------------------------------------------------------
1 | seqeval==1.2.2
2 | torch==1.9.0
3 | torchtext==0.6.0
4 | tensorflow==2.6.0
5 | keras==2.6.0
6 | pandas==1.3.5
7 | transformers==4.12.5
8 | click==7.1.2
9 | nltk==3.4.5
10 | sklearn-crfsuite==0.3.6
11 |
--------------------------------------------------------------------------------
/code/tasks_builder.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import gc
3 | import os
4 | import ast
5 | import json
6 | import pandas as pd
7 | import numpy as np
8 | from sklearn import model_selection
9 | from collections import Counter
10 |
11 | SEED = 53
12 |
13 | def df2jsonl(in_df, fname, code2freq, th=10):
14 | with open(fname, 'w', encoding='utf-8') as fw:
15 | for idx, symptoms, code in zip(in_df.new_event_id, in_df.symptoms, in_df.code):
16 | if code in code2freq and code2freq[code]>th:
17 | data = {
18 | 'idx':idx,
19 | 'symptoms': symptoms,
20 | 'code': code,
21 | }
22 | json.dump(data, fw, ensure_ascii=False)
23 | fw.write("\n")
24 |
25 | def ner2jsonl(in_df, ids, fname):
26 | trim_ids = np.array([s.split('_')[0] for s in in_df['Sentence#'].values])
27 |
28 | with open(fname, 'w') as fw:
29 | for i in ids:
30 | mask = trim_ids==i
31 | sample_ids = np.array(list(set(in_df['Sentence#'].values[mask])))
32 | order = np.argsort([int(k.split('_')[-1]) for k in sample_ids])
33 | sample_ids = sample_ids[order]
34 | for idx in sample_ids:
35 | sub_mask = in_df['Sentence#'].values==idx
36 | tokens = list(in_df.Word[sub_mask].values)
37 | ner_tags = list(in_df.Tag[sub_mask].values)
38 | assert len(tokens)==len(ner_tags)
39 | data = {
40 | 'idx':idx,
41 | 'tokens': tokens,
42 | 'ner_tags': ner_tags,
43 | }
44 | json.dump(data, fw, ensure_ascii=False)
45 | fw.write("\n")
46 |
47 | def jsonl2jsonl(source, target):
48 | with open(target, 'w') as fw:
49 | with open(source) as f:
50 | for line in f:
51 | data = json.loads(line)
52 | selected = {field:data[field] for field in ['ru_sentence1', 'ru_sentence2', 'gold_label', 'pairID']}
53 | json.dump(selected, fw, ensure_ascii=False)
54 | fw.write("\n")
55 |
56 | if __name__ == '__main__':
57 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ), '..') )
58 |
59 | data_path = os.path.join(base_path, 'data/')
60 |
61 | data_fname = os.path.join(data_path, 'raw', 'RuMedPrimeData.tsv')
62 |
63 | if not os.path.isfile(data_fname):
64 | raise ValueError('Have you downloaded the data file RuMedPrimeData.tsv and place it into data/ directory?')
65 |
66 | base_split_names = ['train', 'dev', 'test']
67 | ## prepare data for RuMedTop3 task
68 | df = pd.read_csv(data_fname, sep='\t')
69 | df['code'] = df.icd10.apply(lambda s: s.split('.')[0])
70 | # parts'll be list of [train, dev, test]
71 | parts = np.split(df.sample(frac=1, random_state=SEED), [int(0.735*len(df)), int(0.8675*len(df))])
72 |
73 | code2freq = dict(parts[0]['code'].value_counts())
74 |
75 | for i, part in enumerate(base_split_names):
76 | df2jsonl(
77 | parts[i],
78 | os.path.join(data_path, 'RuMedTop3', '{}_v1.jsonl'.format(part)),
79 | code2freq
80 | )
81 |
82 | ## prepare data for RuMedSymptomRec task
83 | df.drop(columns=['code'], inplace=True)
84 | rec_markup = pd.read_csv( os.path.join(data_path, 'raw', 'rec_markup.csv') )
85 | df = pd.merge(df, rec_markup, on='new_event_id')
86 |
87 | mask = ~df.code.isna().values
88 | df = df.iloc[mask]
89 |
90 | symptoms_reduced = []
91 | for text, span in zip(df.symptoms, df.keep_spans):
92 | span = ast.literal_eval(span)
93 | reduced_text = (''.join([text[s[0]:s[1]] for s in span])).strip()
94 | symptoms_reduced.append(reduced_text)
95 | df['symptoms'] = symptoms_reduced
96 |
97 | parts = np.split(df.sample(frac=1, random_state=SEED), [int(0.735*len(df)), int(0.8675*len(df))])
98 |
99 | code2freq = dict(parts[0]['code'].value_counts())
100 |
101 | for i, part in enumerate(base_split_names):
102 | df2jsonl(
103 | parts[i],
104 | os.path.join(data_path, 'RuMedSymptomRec', '{}_v1.jsonl'.format(part)),
105 | code2freq
106 | )
107 |
108 | ## prepare data for RuMedNER task
109 | df = pd.read_csv( os.path.join(data_path, 'raw', 'RuDReC.csv') )
110 |
111 | d = Counter(df['Sentence#'].apply(lambda s: s.split('_')[0]))
112 | ids = np.array(list(d.keys()))
113 | lens = np.array(list(d.values()))
114 | lens = np.array([len(str(i)) for i in lens])
115 |
116 | sss = model_selection.StratifiedShuffleSplit(n_splits=1, test_size=75, random_state=7)
117 | for fold, (train_idx, test_idx) in enumerate(sss.split(ids, lens)):
118 | train_ids, test_ids = ids[train_idx], ids[test_idx]
119 |
120 | sss = model_selection.StratifiedShuffleSplit(n_splits=1, test_size=75, random_state=6)
121 | for fold, (train_idx, test_idx) in enumerate(sss.split(train_ids, lens[train_idx])):
122 | train_ids, dev_ids = train_ids[train_idx], train_ids[test_idx]
123 | parts = [train_ids, dev_ids, test_ids]
124 |
125 | for i, part in enumerate(base_split_names):
126 | ner2jsonl(
127 | df,
128 | parts[i],
129 | os.path.join(data_path, 'RuMedNER', '{}_v1.jsonl'.format(part))
130 | )
131 |
132 | ## prepare data for RuMedNLI task
133 | for part in base_split_names:
134 | fname = os.path.join(data_path, 'raw', 'ru_mli_{}_v1.jsonl'.format(part))
135 | jsonl2jsonl(
136 | fname,
137 | os.path.join(data_path, 'RuMedNLI', '{}_v1.jsonl'.format(part))
138 | )
139 |
--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | Each task directory (starting with *RuMed*\*) contains `train/dev/test` data files in `jsonl`-format.
2 |
3 | ### RuMedTop3
4 | ```
5 | {
6 | "idx": "qd4405c5",
7 | "symptoms": "Сердцебиение, нарушение сна, ощущение нехватки воздуха.
8 | Боль и хруст в шеи, головные боли по 3 суток подряд.",
9 | "code": "M54"
10 | }
11 | ```
12 |
13 | ### RuMedSymptomRec
14 | ```
15 | {
16 | "idx": "qbaecae4",
17 | "symptoms": "пациентка на приеме с родственниками. Со слов родственников - жалобы на плохой сон,
18 | чувство страха, на навязчивые мысли,что 'ее кто-то бьет'",
19 | "code": "колебания артериального давления"
20 | }
21 | ```
22 |
23 | ### RuMedDaNet
24 | ```
25 | {
26 | "pairID": "b2d69800b0a141aa63bd1104c6d53488",
27 | "context": "Эпилепсия — хроническое полиэтиологическое заболевание головного мозга, доминирующим
28 | проявлением которого являются повторяющиеся эпилептические припадки, возникающие вследствие
29 | усиленного гиперсинхронного разряда нейронов головного мозга.",
30 | "question": "Эпилепсию относят к заболеваниям головного мозга человека?",
31 | "answer": "да",
32 | }
33 | ```
34 |
35 | ### RuMedNLI
36 | ```
37 | {
38 | "pairID": "1892e470-66c7-11e7-9a53-f45c89b91419",
39 | "ru_sentence1": "Во время госпитализации у пациента постепенно усиливалась одышка, что потребовало
40 | выполнения процедуры неинвазивной вентиляции лёгких с положительным давлением, а затем маска без ребризера.",
41 | "ru_sentence2": "Пациент находится при комнатном воздухе.",
42 | "gold_label": "contradiction",
43 | }
44 | ```
45 |
46 | ### RuMedNER
47 | ```
48 | {
49 | "idx": "769708.tsv_5",
50 | "tokens": ["Виферон", "обладает", "противовирусным", "действием", "."],
51 | "ner_tags": ["B-Drugname", "O", "B-Drugclass", "O", "O"]
52 | }
53 | ```
54 |
55 | ### ECG2Pathology
56 | ```
57 | {
58 | "record_name": "00009_hr",
59 | "age": 55.0,
60 | "sex": 0,
61 | ...,
62 | "targets": [37,54]
63 | }
64 | ```
65 |
66 |
67 | raw
68 |
69 | The directory contains raw data files.
70 |
71 | The tasks `RuMedTop3` and `RuMedSymptomRec` are based on the [`RuMedPrime`](https://zenodo.org/record/5765873#.YbBlXT9Bzmw) dataset.
72 | The file `RuMedPrimeData.tsv` contains:
73 | ```
74 | symptoms anamnesis icd10 new_patient_id new_event_id new_event_time
75 | Сухость кожи... Месяц назад... E01.8 qf156c36 q5fc2cb1 2027-05-19
76 | Жалобы ГБ... Начало острое... J06.9 q9321cf8 qe173f20 2023-03-24
77 | ```
78 | - `symptoms` is the text field with patient symptoms and complaints;
79 | - `icd10` - ICD-10 disease code;
80 | - `new_event_id` is the sample id.
81 |
82 | The file `rec_markup.csv` contains markup for the recommendation task:
83 | ```
84 | new_event_id,code,keep_spans
85 | q5fc2cb1,"кожа, сухая","[(0, 0), (7, 12), (13, 108)]"
86 | qe173f20,боль в мышцах,"[(0, 138), (151, 279)]"
87 | q653efaa,боль в мышцах,"[(0, 57), (70, 129)]"
88 | qe48681b,боль жгучая,"[(0, 45), (56, 181)]"
89 | ```
90 | - `new_event_id` is the sample id;
91 | - `code` is a symptom to predict;
92 | - `keep_spans` is the list of `(start, end)` tuples, as we neet to transform the original text to exclude target symptom-code.
93 |
94 | The data `RuMedNLI` is based on translated [MedNLI](https://jgc128.github.io/mednli/) data.
95 |
96 | > Important! This repository do not contain RuMedNLI files, please download them (`ru_mli_train_v1.jsonl`, `ru_mli_dev_v1.jsonl` and `ru_mli_test_v1.jsonl`) from [RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain](https://doi.org/10.13026/gxzd-cf80) to the `raw` directory. Then run `python tasks_builder.py` from the `code/` directory.
97 |
98 | The task `RuMedNER` is based on RuDReC data - https://github.com/cimm-kzn/RuDReC.
99 | `RuDReC.csv` is the dataframe file with named entities in [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).
100 | ```
101 | Sentence#,Word,Tag
102 | 172744.tsv_0,нам,O
103 | 172744.tsv_0,прописали,O
104 | 172744.tsv_0,",",O
105 | 172744.tsv_0,так,O
106 | 172744.tsv_0,мой,O
107 | 172744.tsv_0,ребенок,O
108 | 172744.tsv_0,сыпью,B-ADR
109 | 172744.tsv_0,покрылся,I-ADR
110 | 172744.tsv_0,",",O
111 | ```
112 |
113 |
--------------------------------------------------------------------------------
/data/RuMedNLI/README.md:
--------------------------------------------------------------------------------
1 | > Important! This repository do not contain RuMedNLI files, please download them (`ru_mli_train_v1.jsonl`, `ru_mli_dev_v1.jsonl` and `ru_mli_test_v1.jsonl`) from [RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain](https://doi.org/10.13026/gxzd-cf80) to the `raw` directory. Then run `python tasks_builder.py` from the `code/` directory.
--------------------------------------------------------------------------------
/lb_submissions/SAI/ChatGPT/README.md:
--------------------------------------------------------------------------------
1 | # Testing MedBench tasks with ChatGPT
2 |
3 | ## Summary
4 | We tested the performance of the ChatGPT (proxied by `text-davinci-003`) model on RuMedBench tasks with the following results:
5 | - RuMedTest: 35.0% (above other models to date, yet way below human level)
6 |
7 | -- Random guessing from 1,000 attempts gives the following statistics about the metric: (min, mean, max) = (19.14%, 24.85%, 31.74%)
8 | - RuMedDaNet: 89.3% (way above other models to date, closer to human level)
9 | - RuMedNLI: 61.3% (lagging behind other models to date, way below human level)
10 | While some preliminary tests were done to select better prompting techniques, these results may potentially be improved with more prompt testing and further model fine-tuning.
11 |
12 | ## Premises
13 | We intended to evaluate the performance of the ChatGPT large language model with RuMedBench questions. The evaluation was done using [ChatGPT web interface](https://chat.openai.com/chat), and then main runs were done with open.ai API with model `text-davinci-003`. The latter is not precisely ChatGPT, but supposedly closely related, and also shown similar answer quality (measured on a sample of 20 questions from RuMedTest). An additional benefit of API access was that `text-davinci-003` was closer to observing answer formats without resorting to longer explanations like ChatGPT sometimes did.
14 |
15 | To gain better quality, we compared the performance of smaller samples with similar tasks (dev sample in the case of RuMedDaNet and RuMedNLI, similar medical questions from medical school exams for RuMedTest). The tested approaches included Zero- and One-shot prompts and translation into English. See details of such prompt tests in the individual tests' sections.
16 |
17 | All tests were run between 10 and 15 Feb 2023.
18 |
19 | ## RuMedTest
20 | For prompt evaluation, we used questions from a similar test [found here](https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_). It has 775 questions; among them, only 355 have the 1-line format, similar to RuMedTest. Some minor (~5%) overlap in questions with RuMedTest was noticed. Those questions were excluded from prompt evaluation.
21 |
22 | The prompt tests included Zero- and One-shot prompts and translation into English. There was no significant difference observed (perhaps the prompt test sample was too small), so the main benchmark tests were performed using straight prompt in Russian:
23 | ```
24 | You are a medical doctor and need to answer which one of the four statements is correct:
25 | 1. "Лёгочное сердце" может возникнуть при ишемической болезни сердца".
26 | 2. "Лёгочное сердце" может возникнуть при гипертонической болезни".
27 | 3. "Лёгочное сердце" может возникнуть при хронической обструктивной болезни лёгких".
28 | 4. "Лёгочное сердце" может возникнуть при гипертиреозе".
29 | The number of the correct statements is:
30 | ```
31 | The resulting accuracy was 35% which beats simpler models but leaves a large gap with human performance. Possible ways to improve this result with the same model would be more extensive prompt testing (greater variety and larger testing sample), fine-tuning (if a large quantity of similar test data and budget are available), cleaning the test data from typos, rare acronyms and abbreviations.
32 |
33 | ## RuMedDaNet
34 | Prompt testing was done using both English and Russian, with Zero and one shot, with context and without it. The case with Zero-shot in Russian with included context was performing reasonably well (85% on the pre-test sample) and was chosen for the whole benchmark run.
35 |
36 | Note that the model could answer prompts without context significantly better than randomly, at 67%, which indicates some domain knowledge in the model. It may be worth exploring unprompted Yes/No tests more, possibly adding them as a benchmark component.
37 |
38 | Example of prompt + question used:
39 | ```
40 | Imagine that you are a medical doctor and know everything about medicine and need to pass a degree exam.
41 | The context is: Природа полос поглощения в ик-области связана с колебательными переходами и изменением колебательных состояний ядер, входящих в молекулу поглощающего вещества. Поэтому поглощением в ИК-области обладают молекулы, дипольные моменты которых изменяются при возбуждении колебательных движений ядер. Область применения ИК-спектроскопии аналогична, но более широка, чем УФ-метода. ИК-спектр однозначно характеризует всю структуру молекулы, включая незначительные ее изменения. Важные преимущества данного метода — высокая специфичность, объективность полученных результатов, возможность анализа веществ в кристаллическом состоянии.
42 | The question is: Возможности ИК-спектроскопии позволяют анализировать вещества в кристаллическом состоянии?
43 | You should answer only yes or no.
44 | The answer is
45 | ```
46 | The resulting accuracy was 89.3% which beats simpler models (best-registered result to date at 68%) and gets close to human performance (93%).
47 |
48 | ## RuMedNLI
49 | Prompt testing was done using Russian only, with Zero and a few shots. There was no significant difference in performance (0.75 acc in prompt tests); a case with a few-shot in Russian was chosen for the whole benchmark run.
50 |
51 | Example of the prompt:
52 | ```
53 | You are a medical doctor and need to pass a degree exam. You are given two statements and need to answer how the second statement relates to the first statement. Possible answers are 'entailment', 'contradiction', or 'neutral'
54 | Statement 1: "В анамнезе нет тромбозов или ТГВ, никогда не было болей в груди до случаев недельной давности."
55 | Statement 2: "Пациент страдает стенокардией"
56 | Answer: "entailment"
57 | Statement 1: "В течение последней недели стал более сонливым и трудно возбудимым."
58 | Statement 2: "В последнюю неделю он был менее внимателен"
59 | Answer: "entailment"
60 | Statement 1: "КТ головы показала небольшое правое височное внутрипаренхиматозное кровоизлияние 2х2 см, на повторной КТ головы осталось без изменений."
61 | Statement 2: "у пациента было гипертоническое кровотечение"
62 | Answer: "neutral"
63 | Statement 1: "Рентгенограмма чистая, не похоже наличие инфекции."
64 | Statement 2: "Рентген грудной клетки выявил инфильтраты"
65 | Answer: "contradiction"
66 | Statement 1: "КТ головы показала небольшое правое височное внутрипаренхиматозное кровоизлияние 2х2 см, на повторной КТ головы осталось без изменений."
67 | Statement 2: "у пациента была нормальная КТ головы"
68 | Answer: "contradiction"
69 | Statement 1: "Сахарный диабет II типа."
70 | Statement 2: "Инсулин не требуется."
71 | Answer:
72 | ```
73 | The resulting accuracy was 61.3%, which does not beat simpler models (67% best to date), not to speak about the human level (85.7%). This could be caused by the fact that NLI tasks are more sensitive to fine-tuning. Also, different prompt techniques may be required.
74 |
75 | ### An interesting case of disagreement:
76 | When testing NLI and other questions with ChatGPT, we observed cases where the model gave incorrect (vs. benchmark ground truth) answers supported by a detailed explanation. Here is an example:
77 | ```
78 | pairID 18935380-66c7-11e7-9dbb-f45c89b91419
79 | Gold label = CONTRADICTION
80 |
81 | PROMPT:
82 | You are a medical doctor and need to pass exam. You are given two statements:
83 | The first statement is absolutely correct and should be the basis for your answer: "На [**2810-12-14**] также был удален желудочковый дренаж."
84 | The second statement is "Отсутствие в анамнезе неврологических симптомов".
85 | You should answer if the second statement is 'entailment', 'contradiction', or 'neutral'.
86 | The answer is
87 |
88 | The relationship between the two statements is neutral.
89 |
90 | CHAT_GPT:
91 | The first statement describes a surgical procedure that was performed on a specific date to remove a cardiac drain. The second statement mentions the absence of neurological symptoms in the patient's medical history. While the two statements may be related in some way, they do not necessarily entail or contradict each other.
92 |
93 | The absence of neurological symptoms in a patient's medical history does not necessarily have a direct relationship with the removal of a cardiac drain. The relationship between the two statements may be influenced by other factors such as the patient's overall health, the reason for the drain placement, and the specifics of the surgical procedure. Without further information, it is not possible to determine a definite relationship between the two statements.
94 | ```
95 | It may be worth exploring such cases with domain experts to ensure that test questions reflect modern medical knowledge.
96 |
97 | ### Files descriptions
98 | `chat-rmb.ipynb` - the main codebase for interaction with the model
99 |
100 | `*.jsonl` - answer files
101 |
102 | `rm*` - files with intermediate data and logs
103 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/ChatGPT/rmnli_priv_gpt3_1502.pd.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/ChatGPT/rmnli_priv_gpt3_1502.pd.pickle
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/ECGBaselineLib/autobaseline.py:
--------------------------------------------------------------------------------
1 | import warnings
2 | warnings.filterwarnings("ignore", category=UserWarning)
3 | warnings.filterwarnings("ignore", category=FutureWarning)
4 |
5 | from sklearn.metrics import precision_recall_curve
6 |
7 | from lightautoml.automl.presets.tabular_presets import TabularUtilizedAutoML
8 | from lightautoml.tasks import Task
9 |
10 |
11 | def lama_train(df_list, random_seed):
12 |
13 | roles = {
14 | "target": "targets",
15 | "category": "device"
16 | }
17 |
18 | # https://github.com/sb-ai-lab/LightAutoML
19 | # define that machine learning problem is binary classification
20 | task = Task("binary")
21 |
22 | utilized_automl = TabularUtilizedAutoML(
23 | task = task,
24 | timeout = 180,
25 | cpu_limit = 8,
26 | reader_params = {'n_jobs': 8, 'cv': 5, 'random_state': random_seed}
27 | )
28 |
29 | _ = utilized_automl.fit_predict(df_list[0], roles = roles, verbose = 1)
30 |
31 | # threshold search
32 | val_pred = utilized_automl.predict(df_list[1].drop(columns=["targets"]))
33 | precision, recall, thresholds = precision_recall_curve(df_list[1]["targets"], val_pred.data.squeeze())
34 | best_thrsh = thresholds[(2*recall*precision / (recall + precision)).argmax()]
35 |
36 | pub_pred = utilized_automl.predict(df_list[2])
37 | pub_res = pub_pred.data.squeeze() > best_thrsh
38 | return pub_res, utilized_automl
39 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/ECGBaselineLib/datasets.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from pathlib import Path
4 | from sklearn.model_selection import train_test_split
5 |
6 |
7 |
8 | def prepare_data(X, y):
9 | df = pd.DataFrame(X).reset_index(drop=True)
10 | df.loc[:, 'noise'] = df[['baseline_drift', 'static_noise', 'burst_noise', 'electrodes_problems', 'extra_beats', 'pacemaker']].isna().sum(axis=1).apply(lambda x: 1 if x==6 else 0)
11 | df = df[['age', 'sex', 'validated_by_human', 'site', 'device', 'noise']]
12 | if y is None:
13 | return df
14 | df['targets'] = pd.DataFrame(y)
15 | return df
16 |
17 |
18 | ##### Split for the AutoML baseline ######
19 | def get_dataset_baseline(data_path, class_id, dtype, random_state):
20 | assert dtype in ["train", "test"]
21 | classes_splits = {"ecgs":[], "targets":[], "names": []}
22 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True)
23 | for _, signal in metadata.iterrows():
24 | if dtype == "train":
25 | classes_splits["ecgs"].append(signal[2:-6])
26 | classes_splits["targets"].append((class_id in signal["labels"]) * 1)
27 | else:
28 | classes_splits["ecgs"].append(signal[2:-5])
29 | classes_splits["names"].append(signal["record_name"])
30 | classes_splits["targets"] = np.array(classes_splits["targets"])
31 | if dtype == "test":
32 | del classes_splits["targets"]
33 | return prepare_data(classes_splits['ecgs'], None), classes_splits["names"]
34 | else:
35 | del classes_splits["names"]
36 | X_train, X_val, y_train, y_val = train_test_split(
37 | classes_splits["ecgs"],
38 | classes_splits["targets"],
39 | test_size=0.2,
40 | random_state=random_state,
41 | stratify=classes_splits["targets"]
42 | )
43 | return prepare_data(X_train, y_train), prepare_data(X_val, y_val)
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/ECGBaselineLib/utils.py:
--------------------------------------------------------------------------------
1 | from sklearn.metrics import precision_recall_curve
2 | import numpy as np
3 |
4 |
5 | def find_threshold_f1(trues, logits, eps=1e-9):
6 | if len(trues.shape) > 1:
7 | threshold = []
8 | for i in range(trues.shape[1]):
9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i])
10 | f1_scores = 2 * precision * recall / (precision + recall + eps)
11 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
12 | return threshold
13 | else:
14 | precision, recall, thresholds = precision_recall_curve(trues, logits)
15 | f1_scores = 2 * precision * recall / (precision + recall + eps)
16 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
17 | return threshold
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/README.md:
--------------------------------------------------------------------------------
1 | Решение представляет собой ансамбль из 73 моделей, каждая из которых решает задачу бинарной классификации по соответствующему классу. Для выбора моделей использовалась библиотека [LightAutoML](https://github.com/sb-ai-lab/LightAutoML), а для обучения использовались *только метаданные* по каждой из записей.
2 |
3 | ### Для запуска кода
4 |
5 | Нужен `python 3.8`
6 |
7 | `pip install -r requirements.txt`
8 |
9 | `python training.py data_path model_path # запуск обучения и предсказания`
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/requirements.txt:
--------------------------------------------------------------------------------
1 | scikit-learn==1.2.2
2 | lightautoml==0.3.7.3
3 | numpy==1.24.4
4 | pandas==1.4.3
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGAuto/training.py:
--------------------------------------------------------------------------------
1 | from ECGBaselineLib.autobaseline import lama_train
2 | from ECGBaselineLib.datasets import get_dataset_baseline
3 |
4 | import sys
5 | import logging
6 | import os
7 | import argparse
8 | from pathlib import Path
9 | import json
10 | import joblib
11 | import numpy as np
12 |
13 |
14 | def main(args):
15 | # Logger
16 | logger = logging.getLogger('automl_baseline_training')
17 | log_format = '%(asctime)s %(message)s'
18 | logging.basicConfig(stream=sys.stdout, level=logging.INFO,
19 | format=log_format, datefmt='%m/%d %I:%M:%S %p')
20 | fh = logging.FileHandler(Path(args.model_path + "/summary/") / 'log_automl.txt')
21 | fh.setFormatter(logging.Formatter(log_format))
22 | logger.addHandler(fh)
23 | # Получение необходимых классов
24 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f:
25 | classes = json.load(f)
26 | for class_name in classes:
27 | os.makedirs(args.model_path + "/models/" + class_name, exist_ok=True)
28 | logger.info("---------- Working with LAMA and {} class ----------".format(class_name))
29 | X_train, X_val = get_dataset_baseline(args.data_path, int(class_name), "train", args.random_state)
30 | X_public, public_names = get_dataset_baseline(args.data_path, int(class_name), "test", args.random_state)
31 | # Модель и всё такое
32 | pub_res, model = lama_train([X_train, X_val, X_public], random_seed = args.random_state) * 1
33 | if class_name == '0':
34 | preds_dict = {key: [val] for key, val in dict(zip(public_names, pub_res)).items()}
35 | else:
36 | for i, key in enumerate(preds_dict):
37 | preds_dict[key].append(pub_res[i])
38 | joblib.dump(model, args.model_path + "/models/" + class_name + "/model.pkl")
39 |
40 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl"
41 | with open(out_fname, 'w') as fw:
42 | for rec in preds_dict:
43 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False)
44 | fw.write('\n')
45 |
46 |
47 | if __name__ == '__main__':
48 | parser = argparse.ArgumentParser(description = 'Baselines training script (LAMA)')
49 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str)
50 | parser.add_argument('model_path', help='path to save the model and logs', type=str)
51 | parser.add_argument('--random_state', help='random state number', type=int, default=19)
52 | args = parser.parse_args()
53 |
54 | os.makedirs(args.model_path + "/models/", exist_ok=True)
55 | os.makedirs(args.model_path + "/summary/", exist_ok=True)
56 | main(args)
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/ECGBaselineLib/datasets.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from pathlib import Path
4 | from sklearn.model_selection import train_test_split
5 |
6 |
7 | ##### Split for the N models baseline ######
8 | def get_dataset_baseline(data_path, class_name, class_id, dtype, random_state):
9 | assert dtype in ["train", "test"]
10 | classes_splits = {"ecgs":[], "targets":[], "names":[]}
11 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True)
12 | for signal in (Path(data_path) / dtype).glob("*.npy"):
13 | signal_name = signal.name[:signal.name.rfind('/')-3]
14 | classes_splits["names"].append(signal_name)
15 | with open(signal, "rb") as f:
16 | signal_value = np.load(f, allow_pickle=True)
17 | classes_splits['ecgs'].append(signal_value)
18 | if dtype == "train":
19 | classes_splits["targets"].append((class_id in metadata.loc[metadata.record_name == signal_name, "labels"].item()) * 1)
20 | classes_splits["targets"] = np.array(classes_splits["targets"])
21 | if dtype == "test":
22 | del classes_splits["targets"]
23 | return classes_splits['ecgs'], classes_splits["names"]
24 | else:
25 | X_train, X_val, y_train, y_val, names_train, names_val = train_test_split(
26 | classes_splits["ecgs"],
27 | classes_splits["targets"],
28 | classes_splits["names"],
29 | test_size=0.33,
30 | random_state=random_state,
31 | stratify=classes_splits["targets"]
32 | )
33 | return X_train, X_val, y_train, y_val, names_train, names_val
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/ECGBaselineLib/neurobaseline.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 |
4 | import torch
5 | torch.set_default_dtype(torch.float32)
6 | import torch.nn as nn
7 | from torch.utils.data import Dataset, DataLoader
8 | import torch.nn.functional as F
9 | from torch.utils.tensorboard import SummaryWriter
10 |
11 | from sklearn.metrics import average_precision_score
12 | from .utils import find_threshold_f1
13 |
14 | import os
15 | from tqdm import tqdm
16 | import pickle
17 | import json
18 |
19 |
20 | # from https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy
21 | def set_seed(seed: int = 42) -> None:
22 | np.random.seed(seed)
23 | random.seed(seed)
24 | torch.manual_seed(seed)
25 | torch.cuda.manual_seed(seed)
26 | # When running on the CuDNN backend, two further options must be set
27 | torch.backends.cudnn.deterministic = True
28 | torch.backends.cudnn.benchmark = False
29 | # Set a fixed value for the hash seed
30 | os.environ["PYTHONHASHSEED"] = str(seed)
31 | print(f"random seed set as {seed}")
32 |
33 |
34 | ##### ECG Dataset #####
35 | class ECGRuDataset(Dataset):
36 | """ECG.RU Dataset."""
37 |
38 | def __init__(self, ecgs, labels, names):
39 | """
40 | Args:
41 | labels: array with labels
42 | ecgs: array with num_ch-lead ecgs
43 | """
44 | self.ecgs = ecgs
45 | if labels is not None:
46 | self.labels = torch.from_numpy(labels).float()
47 | else:
48 | self.labels = None
49 | self.names = names
50 |
51 | def __len__(self):
52 | return len(self.names)
53 |
54 | def __getitem__(self, idx):
55 | if self.labels is not None:
56 | sample = {'value': self.ecgs[idx], 'target': self.labels[idx], 'names': self.names[idx]}
57 | else:
58 | sample = {'value': self.ecgs[idx], 'names': self.names[idx]}
59 | return sample
60 |
61 |
62 | ##### Multihead 1d-CNN model for the 1-st and 2-nd baselines #####
63 | class CNN1dMultihead(nn.Module):
64 | def __init__(self, k=1, num_ch=12):
65 | super().__init__()
66 | """
67 | Args:
68 | num_ch: number of channels of an ecg-signal
69 | k: number of classes
70 | """
71 | self.layer1 = nn.Sequential(
72 | nn.Conv1d(num_ch, 24, 10, stride=2),
73 | nn.BatchNorm1d(24),
74 | nn.ReLU(),
75 | nn.Conv1d(24, 48, 10, stride=2),
76 | nn.BatchNorm1d(48),
77 | nn.ReLU(),
78 | nn.MaxPool1d(6, 2)
79 | )
80 | self.layer2 = nn.Sequential(
81 | nn.Conv1d(48, 64, 10, stride=2),
82 | nn.BatchNorm1d(64),
83 | nn.ReLU(),
84 | nn.Conv1d(64, 128, 10, stride=2),
85 | nn.BatchNorm1d(128),
86 | nn.ReLU(),
87 | nn.AdaptiveMaxPool1d(10)
88 | )
89 | self.classification_layers = nn.ModuleList([nn.Sequential(
90 | nn.Linear(128*10, 120),
91 | nn.ReLU(),
92 | nn.Linear(120, 160),
93 | nn.ReLU(),
94 | nn.Linear(160, 1)
95 | ) for i in range(k)])
96 |
97 | def forward(self, x):
98 | x = self.layer1(x)
99 | x = self.layer2(x)
100 | x = torch.flatten(x, 1)
101 | preds = torch.stack([torch.squeeze(classification_layer(x)) for classification_layer in self.classification_layers])
102 | return torch.swapaxes(preds, 0, 1)
103 |
104 |
105 | ##### Trainer for 1d-CNN model #####
106 | class CNN1dTrainer:
107 | """
108 | class_name - dict if multilabel (id2label), str in binary
109 | """
110 | def __init__(self, class_name,
111 | model, optimizer, loss,
112 | train_dataset, val_dataset, test_dataset, model_path,
113 | batch_size=128, cuda_id=1):
114 |
115 | torch.manual_seed(0)
116 | random.seed(0)
117 | np.random.seed(0)
118 |
119 | self.model = model
120 | self.optimizer = optimizer
121 | self.loss = loss
122 |
123 | self.train_dataset = train_dataset
124 | self.val_dataset = val_dataset
125 | self.test_public = test_dataset
126 |
127 | self.result_output = {}
128 |
129 | self.batch_size = batch_size
130 |
131 | self.device = torch.device("cuda:" + str(cuda_id) if (torch.cuda.is_available() or cuda_id != -1) else "cpu")
132 | self.model = self.model.to(self.device)
133 |
134 | self.global_step = 0
135 | self.alpha = 0.8
136 |
137 | self.class_name = class_name
138 |
139 | self.result_output['class'] = class_name
140 |
141 | os.makedirs(model_path + "/models" + "/" +self.class_name, exist_ok=True)
142 | os.makedirs(model_path + "/summary" + "/" + self.class_name, exist_ok=True)
143 | os.makedirs(model_path + "/models" + "/" + self.class_name, exist_ok=True)
144 | self.writer = SummaryWriter(model_path + "/summary" + "/" + self.class_name)
145 | self.model_path = model_path
146 |
147 | def save_checkpoint(self, path):
148 | torch.save(self.model.state_dict(), path)
149 |
150 | def train(self, num_epochs):
151 |
152 | model = self.model
153 | optimizer = self.optimizer
154 |
155 | self.train_loader = DataLoader(self.train_dataset, shuffle=True, pin_memory=True, batch_size=self.batch_size, num_workers=4)
156 | self.val_loader = DataLoader(self.val_dataset, shuffle=False, pin_memory=True, batch_size=len(self.val_dataset), num_workers=4)
157 |
158 | best_val = -38
159 | for epoch in tqdm(range(num_epochs)):
160 | model.train()
161 | train_logits = []
162 | train_gts = []
163 |
164 | for batch in self.train_loader:
165 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
166 | optimizer.zero_grad()
167 | logits = model(batch['value']).squeeze()
168 | train_logits.append(logits.cpu().detach())
169 | train_gts.append(batch['target'].cpu())
170 | loss = self.loss(logits, batch['target'])
171 | loss.backward()
172 | optimizer.step()
173 | self.writer.add_scalar("Train Loss", loss.item(), global_step=self.global_step)
174 | self.global_step += 1
175 |
176 | train_logits = np.concatenate(train_logits)
177 | train_gts = np.concatenate(train_gts)
178 |
179 | if self.class_name != "multihead":
180 | train_logits = train_logits[:,None]
181 | train_gts = train_gts[:,None]
182 |
183 | res_ap = []
184 | for i in range(train_logits.shape[1]):
185 | res_ap.append(average_precision_score(train_gts[:,i], train_logits[:,i]))
186 | self.writer.add_scalar("Train AP/{}".format(self.class_name), np.mean(res_ap), global_step=epoch)
187 |
188 | model.eval()
189 | with torch.no_grad():
190 | for batch in self.val_loader:
191 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
192 | logits = model(batch['value']).cpu().squeeze()
193 | gts = batch['target'].cpu()
194 |
195 | if self.class_name != "multihead":
196 | logits = logits[:,None]
197 | gts = gts[:,None]
198 |
199 | res_ap = []
200 | for i in range(logits.shape[1]):
201 | res_ap.append(average_precision_score(gts[:,i], logits[:,i]))
202 | mean_val = np.mean(res_ap)
203 |
204 | if mean_val > best_val:
205 | self.save_checkpoint(self.model_path + "/models" + "/" +self.class_name+"/best_checkpoint.pth")
206 | best_val = mean_val
207 | self.result_output['threshold_f1'] = find_threshold_f1(gts, logits)
208 | self.test(self.model, self.test_public, "public", epoch)
209 | self.writer.add_scalar("Val AP/{}".format(self.class_name), mean_val, global_step=epoch)
210 | with open(self.model_path + "/models" + "/" +self.class_name+"/log.pickle", 'wb') as handle:
211 | pickle.dump(self.result_output, handle, protocol=pickle.HIGHEST_PROTOCOL)
212 |
213 |
214 | def test(self, model, test_dataset, name, epoch):
215 | model.eval()
216 |
217 | test_loader = DataLoader(test_dataset, shuffle=True, pin_memory=True, batch_size=len(test_dataset), num_workers=4)
218 | for batch in test_loader:
219 | names = batch['names']
220 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
221 | with torch.no_grad():
222 | logits = model(batch['value']).cpu().detach().squeeze()
223 |
224 | if self.class_name != "multihead":
225 | logits = logits[:,None]
226 |
227 | preds = []
228 | for i in range(logits.shape[1]):
229 | preds.append((logits[:,i] > self.result_output['threshold_f1'][i])*1)
230 |
231 | out_fname = self.model_path + "/models" + "/" +self.class_name + "/ECG2Pathology.jsonl"
232 | with open(out_fname, 'w') as fw:
233 | for rec in preds:
234 | res = dict(zip(names, rec.tolist()))
235 | json.dump(res, fw, ensure_ascii=False)
236 | fw.write('\n')
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/ECGBaselineLib/utils.py:
--------------------------------------------------------------------------------
1 | from sklearn.metrics import precision_recall_curve
2 | import numpy as np
3 |
4 |
5 | def find_threshold_f1(trues, logits, eps=1e-9):
6 | if len(trues.shape) > 1:
7 | threshold = []
8 | for i in range(trues.shape[1]):
9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i])
10 | f1_scores = 2 * precision * recall / (precision + recall + eps)
11 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
12 | return threshold
13 | else:
14 | precision, recall, thresholds = precision_recall_curve(trues, logits)
15 | f1_scores = 2 * precision * recall / (precision + recall + eps)
16 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
17 | return threshold
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/README.md:
--------------------------------------------------------------------------------
1 | Решение представляет собой ансамбль из 73 моделей, каждая из которых решает задачу бинарной классификации по соответствующему классу.
2 |
3 | ### Для запуска кода
4 |
5 | `pip install -r requirements.txt`
6 |
7 | `python training.py data_path model_path #запуск обучения и предсказания`
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.11.0
2 | numpy==1.24.4
3 | pandas==2.0.2
4 | scikit-learn==1.2.2
5 | tqdm==4.65.0
6 | tensorboard==2.13.0
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGBinary/training.py:
--------------------------------------------------------------------------------
1 | import torch.optim as optim
2 | import torch.nn as nn
3 | import numpy as np
4 |
5 | from ECGBaselineLib.datasets import get_dataset_baseline
6 | from ECGBaselineLib.neurobaseline import set_seed, ECGRuDataset, CNN1dTrainer, CNN1dMultihead
7 |
8 | import sys
9 | import logging
10 |
11 | import argparse
12 |
13 | from pathlib import Path
14 | import json
15 |
16 |
17 | def main(args):
18 | # Fix seed
19 | set_seed(seed = args.random_state)
20 | # Logger
21 | logger = logging.getLogger('binary_baseline_training')
22 | log_format = '%(asctime)s %(message)s'
23 | logging.basicConfig(stream=sys.stdout, level=logging.INFO,
24 | format=log_format, datefmt='%m/%d %I:%M:%S %p', filemode='w')
25 | fh = logging.FileHandler(Path(args.model_path) / "log_binary.txt")
26 | fh.setFormatter(logging.Formatter(log_format))
27 | logger.addHandler(fh)
28 | # Data preparing
29 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f:
30 | classes = json.load(f)
31 | for class_name in classes:
32 | logger.info("---------- Working with %s ----------" % (classes[class_name]))
33 | X_train, X_val, y_train, y_val, names_train, names_val = get_dataset_baseline(args.data_path, classes[class_name], int(class_name), "train", args.random_state)
34 | X_public, names_public = get_dataset_baseline(args.data_path, classes[class_name], int(class_name), "test", args.random_state)
35 | model = CNN1dMultihead()
36 | opt = optim.AdamW(model.parameters(), lr=3e-3)
37 |
38 | train_ds = ECGRuDataset(X_train, y_train, names_train)
39 | val_ds = ECGRuDataset(X_val, y_val, names_val)
40 | test_public = ECGRuDataset(X_public, None, names_public)
41 |
42 | trainer = CNN1dTrainer(class_name = class_name,
43 | model = model, optimizer = opt, loss = nn.BCEWithLogitsLoss(),
44 | train_dataset = train_ds, val_dataset = val_ds, test_dataset = test_public,
45 | model_path = args.model_path,
46 | cuda_id = args.cuda_id)
47 | logger.info("---------- Model training started! ----------")
48 | trainer.train(args.num_epochs)
49 | with open(Path(args.model_path) / ( "models/" + class_name + "/ECG2Pathology.jsonl"), "r") as f:
50 | pred_i = json.load(f)
51 | if int(class_name) == 0:
52 | preds_dict = {k:[v] for k,v in pred_i.items()}
53 | else:
54 | for k in preds_dict:
55 | preds_dict[k].append(pred_i[k])
56 |
57 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl"
58 | with open(out_fname, 'w') as fw:
59 | for rec in preds_dict:
60 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False)
61 | fw.write('\n')
62 |
63 |
64 | if __name__ == '__main__':
65 | parser = argparse.ArgumentParser(description = 'Baselines training script (1d-CNN)')
66 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str)
67 | parser.add_argument('model_path', help='path to save the model and logs', type=str)
68 | parser.add_argument('--cuda_id', help='CUDA device number on a single GPU; use -1 if you want to work on CPU', type=int, default=0)
69 | parser.add_argument('--k', help='number of positive examples for class', type=int, default=11)
70 | parser.add_argument('--num_epochs', help='number of epochs', type=int, default=5)
71 | parser.add_argument('--random_state', help='random state number', type=int, default=19)
72 | args = parser.parse_args()
73 | Path(args.model_path).mkdir(parents = False, exist_ok = True)
74 | main(args)
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/ECGBaselineLib/datasets.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import os
4 | from pathlib import Path
5 |
6 | from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
7 | from pathlib import Path
8 |
9 | # Iterative stratification
10 | def make_stratification(df, strat_matrix, random_state):
11 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=random_state)
12 | train_split, test_split = list(msss.split(df.record_name.values[:,None], strat_matrix))[0]
13 | # Obtain record numbers
14 | train_names = df.loc[train_split, "record_name"].values
15 | test_names = df.loc[test_split, "record_name"].values
16 | # Make val/test split
17 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=random_state)
18 | val_split, test_split = list(msss.split(test_names[:,None], strat_matrix[test_split]))[0]
19 | assert np.intersect1d(train_names, test_names[val_split]).shape[0] == 0 & np.intersect1d(test_names[val_split], test_names[test_split]).shape[0] == 0 & np.intersect1d(test_names[test_split], train_names).shape[0] == 0, "В разбияниях повторяются записи!"
20 | return train_names, test_names[val_split], test_names[test_split]
21 |
22 |
23 | ##### Split for the multihead baseline ######
24 | def get_dataset_baseline(data_path, dtype, random_state):
25 | assert dtype in ["train", "test"]
26 | classes_splits = {"ecgs":[], "targets":[], "names": []}
27 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True)
28 | for signal in (Path(data_path) / dtype).glob("*.npy"):
29 | signal_name = signal.name[:signal.name.rfind('/')-3]
30 | classes_splits["names"].append(signal_name)
31 | with open(signal, "rb") as f:
32 | signal_value = np.load(f, allow_pickle=True)
33 | classes_splits['ecgs'].append(signal_value)
34 | if dtype == "train":
35 | signal_target = np.zeros(73)
36 | signal_target[metadata.loc[metadata.record_name == signal_name, "labels"].item()] = 1
37 | classes_splits["targets"].append(signal_target)
38 | classes_splits["ecgs"] = np.array(classes_splits["ecgs"])
39 | classes_splits["names"] = np.array(classes_splits["names"])
40 | if dtype == "test":
41 | del classes_splits["targets"]
42 | return classes_splits['ecgs'], classes_splits["names"]
43 | else:
44 | classes_splits["targets"] = np.array(classes_splits["targets"])
45 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.33, random_state=random_state)
46 | train_split, val_split = list(msss.split(classes_splits["ecgs"], classes_splits["targets"]))[0]
47 | X_train, X_val, y_train, y_val = classes_splits["ecgs"][train_split], classes_splits["ecgs"][val_split], \
48 | classes_splits["targets"][train_split], classes_splits["targets"][val_split]
49 | return X_train, X_val, y_train, y_val, classes_splits["names"][train_split], classes_splits["names"][val_split]
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/ECGBaselineLib/neurobaseline.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 |
4 | import torch
5 | torch.set_default_dtype(torch.float32)
6 | import torch.nn as nn
7 | from torch.utils.data import Dataset, DataLoader
8 | import torch.nn.functional as F
9 | from torch.utils.tensorboard import SummaryWriter
10 |
11 | from sklearn.metrics import average_precision_score
12 | from .utils import find_threshold_f1
13 |
14 | import os
15 | from tqdm import tqdm
16 | import pickle
17 | import json
18 |
19 |
20 | # from https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy
21 | def set_seed(seed: int = 42) -> None:
22 | np.random.seed(seed)
23 | random.seed(seed)
24 | torch.manual_seed(seed)
25 | torch.cuda.manual_seed(seed)
26 | # When running on the CuDNN backend, two further options must be set
27 | torch.backends.cudnn.deterministic = True
28 | torch.backends.cudnn.benchmark = False
29 | # Set a fixed value for the hash seed
30 | os.environ["PYTHONHASHSEED"] = str(seed)
31 | print(f"random seed set as {seed}")
32 |
33 |
34 | ##### ECG Dataset #####
35 | class ECGRuDataset(Dataset):
36 | """ECG.RU Dataset."""
37 |
38 | def __init__(self, ecgs, labels, names):
39 | """
40 | Args:
41 | labels: array with labels
42 | ecgs: array with num_ch-lead ecgs
43 | """
44 | self.ecgs = ecgs
45 | if labels is not None:
46 | self.labels = torch.from_numpy(labels).float()
47 | else:
48 | self.labels = None
49 | self.names = names
50 |
51 | def __len__(self):
52 | return len(self.names)
53 |
54 | def __getitem__(self, idx):
55 | if self.labels is not None:
56 | sample = {'value': self.ecgs[idx], 'target': self.labels[idx], 'names': self.names[idx]}
57 | else:
58 | sample = {'value': self.ecgs[idx], 'names': self.names[idx]}
59 | return sample
60 |
61 |
62 | ##### Multihead 1d-CNN model for the 1-st and 2-nd baselines #####
63 | class CNN1dMultihead(nn.Module):
64 | def __init__(self, k=1, num_ch=12):
65 | super().__init__()
66 | """
67 | Args:
68 | num_ch: number of channels of an ecg-signal
69 | k: number of classes
70 | """
71 | self.layer1 = nn.Sequential(
72 | nn.Conv1d(num_ch, 24, 10, stride=2),
73 | nn.BatchNorm1d(24),
74 | nn.ReLU(),
75 | nn.Conv1d(24, 48, 10, stride=2),
76 | nn.BatchNorm1d(48),
77 | nn.ReLU(),
78 | nn.MaxPool1d(6, 2)
79 | )
80 | self.layer2 = nn.Sequential(
81 | nn.Conv1d(48, 64, 10, stride=2),
82 | nn.BatchNorm1d(64),
83 | nn.ReLU(),
84 | nn.Conv1d(64, 128, 10, stride=2),
85 | nn.BatchNorm1d(128),
86 | nn.ReLU(),
87 | nn.AdaptiveMaxPool1d(10)
88 | )
89 | self.classification_layers = nn.ModuleList([nn.Sequential(
90 | nn.Linear(128*10, 120),
91 | nn.ReLU(),
92 | nn.Linear(120, 160),
93 | nn.ReLU(),
94 | nn.Linear(160, 1)
95 | ) for i in range(k)])
96 |
97 | def forward(self, x):
98 | x = self.layer1(x)
99 | x = self.layer2(x)
100 | x = torch.flatten(x, 1)
101 | preds = torch.stack([torch.squeeze(classification_layer(x)) for classification_layer in self.classification_layers])
102 | return torch.swapaxes(preds, 0, 1)
103 |
104 |
105 | ##### Trainer for 1d-CNN model #####
106 | class CNN1dTrainer:
107 | """
108 | class_name - dict if multilabel (id2label), str in binary
109 | """
110 | def __init__(self, class_name,
111 | model, optimizer, loss,
112 | train_dataset, val_dataset, test_dataset, model_path,
113 | batch_size=128, cuda_id=1):
114 |
115 | torch.manual_seed(0)
116 | random.seed(0)
117 | np.random.seed(0)
118 |
119 | self.model = model
120 | self.optimizer = optimizer
121 | self.loss = loss
122 |
123 | self.train_dataset = train_dataset
124 | self.val_dataset = val_dataset
125 | self.test_public = test_dataset
126 |
127 | self.result_output = {}
128 |
129 | self.batch_size = batch_size
130 |
131 | self.device = torch.device("cuda:" + str(cuda_id) if (torch.cuda.is_available() or cuda_id != -1) else "cpu")
132 | self.model = self.model.to(self.device)
133 |
134 | self.global_step = 0
135 | self.alpha = 0.8
136 |
137 | self.class_name = class_name
138 |
139 | self.result_output['class'] = class_name
140 |
141 | os.makedirs(model_path + "/models" + "/" +self.class_name, exist_ok=True)
142 | os.makedirs(model_path + "/summary" + "/" + self.class_name, exist_ok=True)
143 | os.makedirs(model_path + "/models" + "/" + self.class_name, exist_ok=True)
144 | self.writer = SummaryWriter(model_path + "/summary" + "/" + self.class_name)
145 | self.model_path = model_path
146 |
147 | def save_checkpoint(self, path):
148 | torch.save(self.model.state_dict(), path)
149 |
150 | def train(self, num_epochs):
151 |
152 | model = self.model
153 | optimizer = self.optimizer
154 |
155 | self.train_loader = DataLoader(self.train_dataset, shuffle=True, pin_memory=True, batch_size=self.batch_size, num_workers=4)
156 | self.val_loader = DataLoader(self.val_dataset, shuffle=False, pin_memory=True, batch_size=len(self.val_dataset), num_workers=4)
157 |
158 | best_val = -38
159 | for epoch in tqdm(range(num_epochs)):
160 | model.train()
161 | train_logits = []
162 | train_gts = []
163 |
164 | for batch in self.train_loader:
165 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
166 | optimizer.zero_grad()
167 | logits = model(batch['value']).squeeze()
168 | train_logits.append(logits.cpu().detach())
169 | train_gts.append(batch['target'].cpu())
170 | loss = self.loss(logits, batch['target'])
171 | loss.backward()
172 | optimizer.step()
173 | self.writer.add_scalar("Train Loss", loss.item(), global_step=self.global_step)
174 | self.global_step += 1
175 |
176 | train_logits = np.concatenate(train_logits)
177 | train_gts = np.concatenate(train_gts)
178 |
179 | if self.class_name != "multihead":
180 | train_logits = train_logits[:,None]
181 | train_gts = train_gts[:,None]
182 |
183 | res_ap = []
184 | for i in range(train_logits.shape[1]):
185 | res_ap.append(average_precision_score(train_gts[:,i], train_logits[:,i]))
186 | self.writer.add_scalar("Train AP/{}".format(self.class_name), np.mean(res_ap), global_step=epoch)
187 |
188 | model.eval()
189 | with torch.no_grad():
190 | for batch in self.val_loader:
191 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
192 | logits = model(batch['value']).cpu().squeeze()
193 | gts = batch['target'].cpu()
194 |
195 | if self.class_name != "multihead":
196 | logits = logits[:,None]
197 | gts = gts[:,None]
198 |
199 | res_ap = []
200 | for i in range(logits.shape[1]):
201 | res_ap.append(average_precision_score(gts[:,i], logits[:,i]))
202 | mean_val = np.mean(res_ap)
203 |
204 | if mean_val > best_val:
205 | self.save_checkpoint(self.model_path + "/models" + "/" +self.class_name+"/best_checkpoint.pth")
206 | best_val = mean_val
207 | self.result_output['threshold_f1'] = find_threshold_f1(gts, logits)
208 | self.test(self.model, self.test_public, "public", epoch)
209 | self.writer.add_scalar("Val AP/{}".format(self.class_name), mean_val, global_step=epoch)
210 | with open(self.model_path + "/models" + "/" +self.class_name+"/log.pickle", 'wb') as handle:
211 | pickle.dump(self.result_output, handle, protocol=pickle.HIGHEST_PROTOCOL)
212 |
213 |
214 | def test(self, model, test_dataset, name, epoch):
215 | model.eval()
216 |
217 | test_loader = DataLoader(test_dataset, shuffle=True, pin_memory=True, batch_size=len(test_dataset), num_workers=4)
218 | for batch in test_loader:
219 | names = batch['names']
220 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'}
221 | with torch.no_grad():
222 | logits = model(batch['value']).cpu().detach().squeeze()
223 |
224 | if self.class_name != "multihead":
225 | logits = logits[:,None]
226 |
227 | preds = []
228 | for i in range(logits.shape[1]):
229 | preds.append((logits[:,i] > self.result_output['threshold_f1'][i])*1)
230 |
231 | out_fname = self.model_path + "/models" + "/" +self.class_name + "/ECG2Pathology.jsonl"
232 | with open(out_fname, 'w') as fw:
233 | for rec in preds:
234 | res = dict(zip(names, rec.tolist()))
235 | json.dump(res, fw, ensure_ascii=False)
236 | fw.write('\n')
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/ECGBaselineLib/utils.py:
--------------------------------------------------------------------------------
1 | from sklearn.metrics import precision_recall_curve
2 | import numpy as np
3 |
4 |
5 | def find_threshold_f1(trues, logits, eps=1e-9):
6 | if len(trues.shape) > 1:
7 | threshold = []
8 | for i in range(trues.shape[1]):
9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i])
10 | f1_scores = 2 * precision * recall / (precision + recall + eps)
11 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
12 | return threshold
13 | else:
14 | precision, recall, thresholds = precision_recall_curve(trues, logits)
15 | f1_scores = 2 * precision * recall / (precision + recall + eps)
16 | threshold.append(float(thresholds[np.argmax(f1_scores)]))
17 | return threshold
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/README.md:
--------------------------------------------------------------------------------
1 | Решение представляет одну модель с 73 классификационными головами, каждая из которых решает задачу бинарной классификации по соответствующему классу. Выходом модели является вектор размерностью 73.
2 |
3 | ### Для запуска кода
4 |
5 | `pip install -r requirements.txt`
6 |
7 | `python training.py data_path model_path #запуск обучения и предсказания`
8 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.11.0
2 | numpy==1.24.4
3 | pandas==2.0.2
4 | iterative-stratification==0.1.7
5 | scikit-learn==1.2.2
6 | tqdm==4.65.0
7 | tensorboard==2.13.0
--------------------------------------------------------------------------------
/lb_submissions/SAI/ECGMultihead/training.py:
--------------------------------------------------------------------------------
1 | import torch.optim as optim
2 | import torch.nn as nn
3 | import numpy as np
4 |
5 | from ECGBaselineLib.datasets import get_dataset_baseline
6 | from ECGBaselineLib.neurobaseline import set_seed, ECGRuDataset, CNN1dTrainer, CNN1dMultihead
7 |
8 | import sys
9 | import logging
10 |
11 | import argparse
12 |
13 | from pathlib import Path
14 | import json
15 |
16 |
17 | def main(args):
18 | # Fix seed
19 | set_seed(seed = args.random_state)
20 | # Logger
21 | logger = logging.getLogger('baseline_multihead_training')
22 | log_format = '%(asctime)s %(message)s'
23 | logging.basicConfig(stream=sys.stdout, level=logging.INFO,
24 | format=log_format, datefmt='%m/%d %I:%M:%S %p', filemode='w')
25 | fh = logging.FileHandler(Path(args.model_path) / "log_multihead.txt")
26 | fh.setFormatter(logging.Formatter(log_format))
27 | logger.addHandler(fh)
28 | # Data preparing
29 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f:
30 | classes = json.load(f)
31 | logger.info("---------- Working with multihead model ----------")
32 | X_train, X_val, y_train, y_val, names_train, names_val = get_dataset_baseline(args.data_path, "train", args.random_state)
33 | X_public, names_public = get_dataset_baseline(args.data_path, "test", args.random_state)
34 | model = CNN1dMultihead(k=73)
35 | opt = optim.AdamW(model.parameters(), lr=3e-3)
36 |
37 | train_ds = ECGRuDataset(X_train, y_train, names_train)
38 | val_ds = ECGRuDataset(X_val, y_val, names_val)
39 | test_public = ECGRuDataset(X_public, None, names_public)
40 |
41 | trainer = CNN1dTrainer(class_name = "multihead",
42 | model = model, optimizer = opt, loss = nn.BCEWithLogitsLoss(),
43 | train_dataset = train_ds, val_dataset = val_ds, test_dataset = test_public,
44 | model_path = args.model_path,
45 | cuda_id = args.cuda_id)
46 | logger.info("---------- Model training started! ----------")
47 | trainer.train(args.num_epochs)
48 |
49 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl"
50 | with open(Path(args.model_path) / ( "models/" + "multihead" + "/ECG2Pathology.jsonl"), 'r') as fw:
51 | for i, line in enumerate(fw):
52 | line = json.loads(line)
53 | if i == 0:
54 | preds_dict = {k:[v] for k,v in line.items()}
55 | else:
56 | for k in preds_dict:
57 | preds_dict[k].append(line[k])
58 |
59 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl"
60 | with open(out_fname, 'w') as fw:
61 | for rec in preds_dict:
62 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False)
63 | fw.write('\n')
64 |
65 |
66 | if __name__ == '__main__':
67 | parser = argparse.ArgumentParser(description = 'Baselines training script (1d-CNN)')
68 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str)
69 | parser.add_argument('model_path', help='path to save the model and logs', type=str)
70 | parser.add_argument('--cuda_id', help='CUDA device number on a single GPU; use -1 if you want to work on CPU', type=int, default=0)
71 | parser.add_argument('--k', help='number of positive examples for class', type=int, default=11)
72 | parser.add_argument('--num_epochs', help='number of epochs', type=int, default=5)
73 | parser.add_argument('--random_state', help='random state number', type=int, default=19)
74 | args = parser.parse_args()
75 | main(args)
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/.gitignore:
--------------------------------------------------------------------------------
1 | config*.ini
2 | *.log
3 | MedBench
4 | *.db
5 | *.xml
6 | logs*
7 | RuMedTest--sogma--dev.jsonl
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/README.md:
--------------------------------------------------------------------------------
1 | # Testing MedBench with Gigachat
2 |
3 | ## Summary
4 |
5 | We tested the performance of the Gigachat (`GigachatPro, uncensored, 2024-03-04`) model on RuMedBench tasks with the following results:
6 |
7 | | Task | Result |
8 | |------------|--------|
9 | | RuMedNLI |`65.17%`|
10 | | RuMedDaNet |`92.58%`|
11 | | RuMedTest |`72.04%`|
12 |
13 | ## Experiments description
14 | ### RuMedDaNet ( `rumed_da_net.py` )
15 |
16 | Only one simple prompt was used -- just `{context}`, `{question}` and request to answer "Yes" or "No".
17 |
18 | **Accuracy (dev)**: `95.70 %`.
19 | ### RuMedNLI ( `rumed_nli.py` )
20 | 3 approaches was used:
21 | 0. **Simple doctor prompt**: one prompt (with doctor role description, instruction and request to __respond in one word__) sent to LLM.
22 | 1. **Complex doctor prompt with moderator**: one prompt (with doctor role description, instruction and request to __respond in details__) sent to LLM. Then another one prompt (with moderator role description, request to choose the right answer and to respond in one word) sent to LLM.
23 | 2. **Complex doctor prompt with chat**: one prompt (with doctor role description, instruction and request to __respond in details__) sent to LLM. Then, if response isn't specific, new request with chat history sent to LLM. (only 3 times maximum)
24 |
25 | **Accuracy (dev)**:
26 |
27 | | Approach | Accuracy |
28 | |---------------------|-----------|
29 | | v0: simple | `60.55 %` |
30 | | v1: doctor + prompt | `67.51 %` |
31 | | v2: doctor + chat | `67.93 %` |
32 |
33 | Approach `v2` was used for test evaluation.
34 |
35 | ### RuMedTest ( `rumed_test.py` )
36 | For prompt evaluation sogma dataset was used [link](https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_).
37 |
38 | 7 experiments were checked:
39 | 0. **Simple prompt**: one prompt with question instruction sent to LLM. Invalid answers ignored.
40 | 1. **Simple doctor prompt**: one prompt with doctor role, question instruction and request to respond in one number was sent to LLM. Invalid answers ignored.
41 | 2. **Complex doctor prompt with moderator**: like approach [RuMedNLI:1]
42 | 3. **Complex doctor prompt with moderator (2)**: like previous approach, [RuMedTest:2].
43 | 4. **Complex doctor prompt with moderator (3)**: like [RuMedTest:2].
44 | 5. **Complex doctor prompt with chat**: like [RuMedNLI:2]
45 | 6. **Simple alphabetic doctor prompt with chat**: like previous approach, [RuMedTest:5], but numbers for variants replaced with letters (`1 -> a`, `2 -> b`, etc.)
46 | 7. **Complex alphabetic doctor prompt with chat**: like previous approach, [RuMedTest:6], but with instruction to __respond in details__.
47 |
48 | **Accuracy (sogma)**:
49 | | Approach | Accuracy |
50 | |----------|-----------|
51 | | v0 | `49.15 %` |
52 | | v1 | `55.15 %` |
53 | | v2 | `53.46 %` |
54 | | v3 | `53.46 %` |
55 | | v4 | `49.41 %` |
56 | | v5 | `52.93 %` |
57 | | v6 | `57.24 %` |
58 | | v7 | `~26.1 %` |
59 |
60 | Approach `v6` was used for test evaluation.
61 | ## Usage
62 | -1. Run `pip install -r requirements.txt`.
63 |
64 | 0. Put `config.ini` with Gigachat credentials to this directory.
65 |
66 | Example of content (without <> brackets):
67 | ```ini
68 | [credentials]
69 | user = your-user
70 | credentials = your-credentials
71 | scope = your-scope
72 | [base_url]
73 | base_url = https://developers.sber.ru/...
74 | ```
75 |
76 | 1. Run `s00-prepare.sh` to download tests.
77 | 2. Run `s01-run-all-trains.sh` to evaluate train/dev. (optional)
78 | 3. Run `s02-run-all-tests.sh` to generate test samples.
79 | ### Notes
80 | - you can reuse this framework to check other LLMS: replace `rumed_utils#create_llm_gigachat` with something else
81 | - to speed up test rerunning caching used (see `set_llm_cache` in `rumed_utils`)
82 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/convert_sogma.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 | from pathlib import Path
4 |
5 | import fire
6 | import xmltodict
7 |
8 | def read_xml_sogma_tasks(xml_content):
9 | # https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_/download
10 | tree = xmltodict.parse(xml_content)
11 | questions = tree['geetest']['test']['questions']['question']
12 | for qu in questions:
13 | task = dict(id=qu['@id'], question=qu['text'])
14 | for an in qu['answers']['answer']:
15 | task[an['@num']] = an['#text']
16 | answers = [an['@num'] for an in qu['answers']['answer'] if an['@isCorrect'] == '1']
17 | if len(answers) != 1:
18 | continue
19 | task['answer'] = answers[0]
20 | yield task
21 |
22 | def main(path_in='sogma-test.xml', path_out='RuMedTest--sogma--dev.jsonl'):
23 | assert path_in.endswith('.xml')
24 | assert path_out.endswith('.jsonl')
25 |
26 | xml_content = Path(path_in).read_text()
27 | tasks = list(read_xml_sogma_tasks(xml_content))
28 | jsonl_content = '\n'.join(json.dumps(task, ensure_ascii=False) for task in tasks)
29 | Path(path_out).write_text(jsonl_content)
30 | print('Done! Converted {0} tasks!'.format(len(tasks)))
31 |
32 | if __name__ == '__main__':
33 | fire.Fire(main)
34 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/requirements.txt:
--------------------------------------------------------------------------------
1 | fire==0.5.0
2 | gigachat==0.1.16
3 | httpx
4 | gigachain
5 | pandas==1.5.3
6 | tqdm==4.66.1
7 | xmltodict==0.13.0
8 | certifi
9 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/rumed_da_net.py:
--------------------------------------------------------------------------------
1 | """Checking RumedDaNet."""
2 |
3 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire
4 |
5 | PROMPT = """Контекст: {context}
6 | Вопрос: {question}
7 |
8 | Обязательно ответь либо "Да", либо "Нет"."""
9 |
10 | def get_answer_basic(llm, q_dict):
11 | possible_answers = {'да', 'нет'}
12 | input_message = PROMPT.format(**q_dict)
13 | llm_response = llm.invoke(input_message).content
14 | answer = parse_element(llm_response, possible_answers)
15 | true_answer = q_dict.get('answer')
16 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer)
17 | return answer
18 |
19 | def answer_to_test_output(q_dict, answer):
20 | return {'pairID': q_dict['pairID'], 'answer': answer}
21 |
22 | get_answer_map = {
23 | 'v0': get_answer_basic,
24 | }
25 |
26 | def main(path_in, config_path='config.ini', path_out=None):
27 | run_main(
28 | path_in=path_in,
29 | config_path=config_path,
30 | path_out=path_out,
31 | get_answer_map=get_answer_map,
32 | answer_mode='v0',
33 | answer_field='answer',
34 | answer_to_test_output=answer_to_test_output,
35 | )
36 |
37 | if __name__ == '__main__':
38 | wrapped_fire(main)
39 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/rumed_nli.py:
--------------------------------------------------------------------------------
1 | """Checking RumedDaNet."""
2 |
3 | from langchain.schema import AIMessage, HumanMessage, SystemMessage
4 |
5 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire
6 |
7 | PROMPT = """Ты доктор и должен пройти экзамен. Тебе даны два утверждения.
8 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}".
9 |
10 | Второе утверждение таково: "{ru_sentence2}".
11 |
12 | Ты должен ответить, чем является второе утверждение в контексте первого:
13 | 1. Следствие
14 | 2. Противоречие
15 | 3. Нейтральность.
16 |
17 | Ответь одним словом:""".strip()
18 |
19 | PROMPT_2_COT = """Ты доктор и должен пройти экзамен. Тебе даны два утверждения.
20 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}".
21 |
22 | Второе утверждение таково: "{ru_sentence2}".
23 |
24 | Вопрос: чем является второе утверждение в контексте первого?
25 | 1. Следствие
26 | 2. Противоречие
27 | 3. Нейтральность.
28 |
29 | Рассуждай шаг за шагом и выбери правильный вариант.
30 | """
31 |
32 | LABELS_MAP = {
33 | 'следствие': 'entailment',
34 | 'противоречие': 'contradiction',
35 | 'нейтральность': 'neutral',
36 | }
37 |
38 | POSSIBLE_ANSWERS = {'следствие', 'противоречие', 'нейтральность'}
39 |
40 | def get_answer_v0(llm, q_dict):
41 | input_message = PROMPT.format(**q_dict)
42 | llm_response = llm.invoke(input_message).content
43 | answer = parse_element(llm_response, POSSIBLE_ANSWERS)
44 | if answer == '2. Противоречие':
45 | answer = 'противоречие'
46 | elif answer == '3. Нейтральность.':
47 | answer = 'нейтральность'
48 | elif answer == '2.':
49 | answer = 'противоречие'
50 | answer = LABELS_MAP.get(answer)
51 | true_answer = q_dict.get('gold_label')
52 | possible_answers = LABELS_MAP.values()
53 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer)
54 | return answer
55 |
56 | def get_answer_v1(llm, q_dict):
57 | input_message = PROMPT_2_COT.format(**q_dict)
58 | doctor_answer = llm.invoke(input_message).content
59 | possible_answers_pretty = '{' + ', '.join(POSSIBLE_ANSWERS) + '}'
60 | ru_sentence1 = q_dict['ru_sentence1']
61 | ru_sentence2 = q_dict['ru_sentence2']
62 | moderator_msg = f'''Ниже представлены вопрос из теста, а также ответ на этот вопрос со стороны врача. Врач отвечает развёрнуто. Твоя задача -- понять, какой же именно вариант ответа из {possible_answers_pretty} выбрал врач.
63 | =====
64 | Задача:
65 | Даны два утверждения.
66 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}".
67 |
68 | Второе утверждение таково -- "{ru_sentence2}".
69 |
70 | Ты должен ответить, чем является второе утверждение в контексте первого:
71 | 1. Следствие
72 | 2. Противоречие
73 | 3. Нейтральность.
74 | =====
75 | Ответ врача:
76 | {doctor_answer}
77 | =====
78 | Ответь одним словом из {possible_answers_pretty}:
79 | '''
80 | moderator_answer = llm.invoke(moderator_msg).content
81 | answer = parse_element(moderator_answer, POSSIBLE_ANSWERS)
82 | answer = LABELS_MAP.get(answer)
83 | true_answer = q_dict.get('gold_label')
84 | possible_answers = LABELS_MAP.values()
85 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer)
86 | return answer
87 |
88 | ASK_ATTEMPTS = 3
89 | def get_answer_v2(llm, q_dict):
90 | input_message = PROMPT_2_COT.format(**q_dict)
91 | possible_answers_pretty = '{' + ', '.join(POSSIBLE_ANSWERS) + '}'
92 |
93 | system_msg = SystemMessage(content=input_message)
94 | memory = [system_msg]
95 | for at in range(ASK_ATTEMPTS):
96 | ai_msg = llm.invoke(memory)
97 | text = ai_msg.content
98 | answer = parse_element(text, POSSIBLE_ANSWERS)
99 | answer = LABELS_MAP.get(answer)
100 | if answer:
101 | break
102 | memory.append(ai_msg)
103 | memory.append(HumanMessage(content=f'Ответь одним словом из {possible_answers_pretty}.'))
104 | true_answer = q_dict.get('gold_label')
105 | possible_answers = LABELS_MAP.values()
106 | moderator_msg = input_message
107 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}'
108 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer)
109 | return answer
110 |
111 | get_answer_map = {
112 | 'v0': get_answer_v0,
113 | 'v1': get_answer_v1,
114 | 'v2': get_answer_v2,
115 | }
116 |
117 | def answer_to_test_output(q_dict, answer) -> dict:
118 | return {'pairID': q_dict['pairID'], 'gold_label': answer}
119 |
120 | def main(path_in, config_path='config.ini', answer_mode='v0', path_out=None):
121 | run_main(
122 | path_in=path_in,
123 | config_path=config_path,
124 | path_out=path_out,
125 | get_answer_map=get_answer_map,
126 | answer_mode=answer_mode,
127 | answer_field='gold_label',
128 | answer_to_test_output=answer_to_test_output,
129 | )
130 |
131 | if __name__ == '__main__':
132 | wrapped_fire(main)
133 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/rumed_test.py:
--------------------------------------------------------------------------------
1 | """Checking RuMedTest."""
2 |
3 | import re
4 |
5 | from langchain.schema import AIMessage, HumanMessage, SystemMessage
6 |
7 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire
8 |
9 | def extract_answer_keys(q_dict):
10 | snums = map(str, range(1, 10))
11 | return [si for si in snums if si in q_dict]
12 |
13 | def make_input_message_v0(q_dict):
14 | answers = extract_answer_keys(q_dict)
15 | prompt = '\n'.join([
16 | 'Выбери номер наиболее корректного утверждения:',
17 | *('%s. {0} {%s}.' % (si, si) for si in answers),
18 | '\nНомер наиболее корректного утверждения:',
19 | ])
20 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers))
21 | return input_message
22 |
23 | def make_input_message_v1(q_dict):
24 | answers = extract_answer_keys(q_dict)
25 | prompt = '\n'.join([
26 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:',
27 | *('%s. {%s}.' % (si, si) for si in answers),
28 | '\nКакой из ответов {0} наиболее корректен? Обязательно ответь одним числом.'.format(', '.join(answers)),
29 | ])
30 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers))
31 | return input_message
32 |
33 | def make_input_message_v2(q_dict):
34 | answers = extract_answer_keys(q_dict)
35 | prompt = '\n'.join([
36 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:',
37 | *('%s. {%s}.' % (si, si) for si in answers),
38 | '\nКакой из ответов {0} наиболее корректен? Ответь и объясни, почему'.format(', '.join(answers)),
39 | ])
40 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers))
41 | return input_message
42 |
43 | def make_input_message_v3(q_dict):
44 | answers = extract_answer_keys(q_dict)
45 | prompt = '\n'.join([
46 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:',
47 | *('%s. {%s}.' % (si, si) for si in answers),
48 | '\nРассуждай шаг за шагом и скажи, какой из ответов {0} наиболее корректен? Помни, что правильный ответ только один!'.format(', '.join(answers)),
49 | ])
50 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers))
51 | return input_message
52 |
53 | def make_input_message_v4(q_dict):
54 | answers = extract_answer_keys(q_dict)
55 | prompt = '\n'.join([
56 | 'Ты сдаёшь тест с одним правильным ответом. Вопрос:\n{0}',
57 | *('%s. {%s}.' % (si, si) for si in answers),
58 | '\nРассуждай шаг за шагом и скажи, какой из ответов {0} правильный? Правильный ответ только один!'.format(', '.join(answers)),
59 | ])
60 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers))
61 | return input_message
62 |
63 | def get_answer_basic(llm, q_dict, message_maker):
64 | possible_answers = extract_answer_keys(q_dict)
65 | input_message = message_maker(q_dict)
66 | llm_response = llm.invoke(input_message).content
67 | answer = parse_element(llm_response, possible_answers)
68 | true_answer = q_dict.get('answer')
69 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer)
70 | return answer
71 |
72 | def get_answer_via_roles(llm, q_dict, message_maker):
73 | input_message = message_maker(q_dict)
74 | possible_answers = extract_answer_keys(q_dict)
75 | possible_answers_pretty = '{' + ', '.join(possible_answers) + '}'
76 | question = q_dict['question']
77 | variants_fmt = '\n'.join('%s. {%s}' % (si, si) for si in possible_answers)
78 | variants = '\n'.join('{}. {}'.format(ii, q_dict[ii]) for ii in possible_answers)
79 | full_answer = llm.invoke(input_message).content
80 | moderator_msg = f'''Ниже представлены тест в виде вопроса и вариантов ответа, а также ответ на этот вопрос со стороны врача. Врач отвечает развёрнуто. Твоя задача -- понять, какой же именно вариант ответа из {possible_answers_pretty} выбрал врач.
81 | =====
82 | Вопрос:
83 | {question}
84 |
85 | Варианты ответа:
86 | {variants}
87 | =====
88 | Ответ врача:
89 | {full_answer}
90 | =====
91 | Ответь одним числом из {possible_answers_pretty}:
92 | '''
93 | llm_response = llm.invoke(moderator_msg).content
94 | answer = parse_element(llm_response, possible_answers)
95 | true_answer = q_dict.get('answer')
96 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, llm_response, answer)
97 | return answer
98 |
99 | ASK_ATTEMPTS = 5
100 | def get_answer_v5(llm, q_dict):
101 | possible_answers = extract_answer_keys(q_dict)
102 | possible_answers_pretty = '[' + ', '.join(possible_answers) + ']'
103 | input_prompt = '\n'.join([
104 | 'Ты врач, сдаёшь тест. Вопрос:\n{0}',
105 | *('%s. {%s}.' % (si, si) for si in possible_answers),
106 | f'\nРассуждай шаг за шагом и скажи, какой из ответов {possible_answers_pretty} правильный? Если правильных ответов несколько выбери один, самый правдоподобный.',
107 | ])
108 | input_message = input_prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers))
109 | memory = [SystemMessage(content=input_message)]
110 | for at in range(ASK_ATTEMPTS):
111 | ai_msg = llm.invoke(memory)
112 | text = ai_msg.content
113 | answer = parse_element(text, possible_answers)
114 | if answer:
115 | break
116 | memory.append(ai_msg)
117 | if re.findall('^\d+$', ai_msg.content):
118 | parts = ', '.join([f'либо {si}' for si in ai_msg.content])
119 | hc = f'Правильный ответ только один. Остальные неверные. Выбери тот, который тебе кажется наиболее похожим на правду: {parts}'
120 | else:
121 | hc = f'Ответь одним числом из {possible_answers_pretty}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный'
122 | memory.append(HumanMessage(content=hc))
123 | true_answer = q_dict.get('answer')
124 | moderator_msg = input_message
125 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}'
126 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer)
127 | return answer
128 |
129 |
130 | VS = 'abcdef'
131 | # VS = 'alpha beta gamma delta epsilon zeta'.split() # works, but slightly worse
132 | ALPHA_MAPPER = dict(zip(map(str, range(1, 7)), VS))
133 | ALPHA_INV_MAPPER = {val:key for key, val in ALPHA_MAPPER.items()}
134 |
135 | def get_answer_v6(llm, q_dict):
136 | possible_answers = extract_answer_keys(q_dict)
137 | pretty_answers = [ALPHA_MAPPER[an] for an in possible_answers]
138 | prompt = '\n'.join([
139 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:',
140 | *('%s. {%s}.' % (ALPHA_MAPPER[si], si) for si in possible_answers),
141 | '\nКакой из ответов {0} наиболее корректен? Обязательно ответь одной буквой.'.format('[' + ', '.join(pretty_answers) + ']'),
142 | ])
143 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers))
144 | memory = [SystemMessage(content=input_message)]
145 | for at in range(ASK_ATTEMPTS):
146 | ai_msg = llm.invoke(memory)
147 | text = ai_msg.content
148 | answer = parse_element(text, pretty_answers)
149 | # import pdb; pdb.set_trace()
150 | memory.append(ai_msg)
151 | if answer:
152 | break
153 | hc = f'Ответь одной буквой из {pretty_answers}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный'
154 | memory.append(HumanMessage(content=hc))
155 | answer = ALPHA_INV_MAPPER.get(answer)
156 | true_answer = q_dict.get('answer')
157 | moderator_msg = input_message
158 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}'
159 | log_answer(q_dict, possible_answers, true_answer, input_message, moderator_answer, answer)
160 | return answer
161 |
162 | def get_answer_v7(llm, q_dict):
163 | possible_answers = extract_answer_keys(q_dict)
164 | pretty_answers = [ALPHA_MAPPER[an] for an in possible_answers]
165 | prompt = '\n'.join([
166 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:',
167 | *('%s. {%s}.' % (ALPHA_MAPPER[si], si) for si in possible_answers),
168 | '\nКакой из ответов {0} наиболее корректен? Порассуждай последовательно про каждый из вариантов, но будь краток, в конце дай ответ'.format('[' + ', '.join(pretty_answers) + ']'),
169 | ])
170 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers))
171 | memory = [SystemMessage(content=input_message)]
172 | for at in range(ASK_ATTEMPTS):
173 | ai_msg = llm.invoke(memory)
174 | text = ai_msg.content
175 | answer = parse_element(text, pretty_answers)
176 | memory.append(ai_msg)
177 | if answer:
178 | break
179 | hc = f'Ответь одной буквой из {pretty_answers}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный'
180 | memory.append(HumanMessage(content=hc))
181 | answer = ALPHA_INV_MAPPER.get(answer)
182 | true_answer = q_dict.get('answer')
183 | moderator_msg = input_message
184 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}'
185 | log_answer(q_dict, possible_answers, true_answer, input_message, moderator_answer, answer)
186 | return answer
187 |
188 | get_answer_map = {
189 | 'v0': lambda llm, q_dict: get_answer_basic(llm, q_dict, make_input_message_v0),
190 | 'v1': lambda llm, q_dict: get_answer_basic(llm, q_dict, make_input_message_v1),
191 | 'v2': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v2),
192 | 'v3': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v3),
193 | 'v4': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v4),
194 | 'v5': get_answer_v5,
195 | 'v6': get_answer_v6,
196 | 'v7': get_answer_v7,
197 | }
198 |
199 | def answer_to_test_output(q_dict, answer):
200 | return {'idx': q_dict['idx'], 'answer': answer}
201 |
202 | def main(path_in, config_path='config.ini', path_out=None, answer_mode='v1'):
203 | run_main(
204 | path_in=path_in,
205 | config_path=config_path,
206 | path_out=path_out,
207 | get_answer_map=get_answer_map,
208 | answer_mode=answer_mode,
209 | answer_field='answer',
210 | answer_to_test_output=answer_to_test_output,
211 | )
212 |
213 | if __name__ == '__main__':
214 | wrapped_fire(main)
215 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/rumed_utils.py:
--------------------------------------------------------------------------------
1 | import configparser
2 | import json
3 | import logging
4 | import time
5 | from datetime import datetime as dt
6 | from logging.config import dictConfig
7 | from pathlib import Path
8 |
9 | import fire
10 | import httpx
11 | from gigachat.exceptions import GigaChatException
12 | from langchain.cache import SQLiteCache
13 | from langchain.chat_models.gigachat import GigaChat
14 | from langchain.globals import set_llm_cache
15 | from tqdm import tqdm
16 |
17 | def extract_tags(path_in: Path):
18 | tags = []
19 | if path_in.parent.name.startswith('Ru'):
20 | tags.append(path_in.parent.name)
21 | tags.append(path_in.stem)
22 | return tags
23 |
24 | def init_logging(path_in, answer_mode):
25 | path_in = Path(path_in)
26 | Path('./logs').mkdir(exist_ok=True)
27 | dt_now_pretty = dt.strftime(dt.now(), '%Y-%m-%d--%H-%M-%S')
28 | tags = [dt_now_pretty] + extract_tags(path_in) + [answer_mode]
29 | filename_log = './logs/{0}.log'.format('--'.join(tags))
30 |
31 | logging_config = {
32 | 'version': 1,
33 | 'handlers': {
34 | 'file_handler': {
35 | 'class': 'logging.FileHandler',
36 | 'filename': filename_log,
37 | 'level': 'DEBUG',
38 | 'formatter': 'standard',
39 | },
40 | 'benchmarks_file_handler': {
41 | 'class': 'logging.FileHandler',
42 | 'filename': 'benchmarks.log',
43 | 'level': 'INFO',
44 | 'formatter': 'standard',
45 | },
46 | 'stream_handler': {
47 | 'class': 'logging.StreamHandler',
48 | 'level': 'WARNING',
49 | 'formatter': 'standard',
50 | },
51 | },
52 | 'formatters': {
53 | 'standard': {
54 | 'format': '%(asctime)s [%(levelname)s] %(message)s',
55 | 'datefmt': '%Y-%m-%d--%H-%M-%S',
56 | },
57 | },
58 | 'loggers': {
59 | 'root': {
60 | 'level': 'DEBUG',
61 | 'handlers': ['file_handler', 'stream_handler'],
62 | },
63 | 'benchmarks': {
64 | 'level': 'INFO',
65 | 'handlers': ['benchmarks_file_handler'],
66 | }
67 | }
68 | }
69 | dictConfig(logging_config)
70 | logging.warning('Logs_path: %s', filename_log)
71 |
72 | GIGACHAT_MODEL = 'GigaChat-Pro'
73 |
74 | def create_llm_gigachat(config):
75 | credentials = dict(config['credentials'])
76 | base_url = config['base_url']['base_url']
77 | logging.info('credentials: %s', credentials)
78 | logging.info('base_url: %s', base_url)
79 |
80 | # https://python.langchain.com/docs/modules/model_io/llms/llm_caching
81 | user = config['credentials']['user']
82 | database_path = "{0}.langchain.db".format(user)
83 | set_llm_cache(SQLiteCache(database_path=database_path))
84 |
85 | return GigaChat(
86 | verify_ssl_certs=False,
87 | profanity_check=False,
88 | model=GIGACHAT_MODEL,
89 | base_url=base_url,
90 | **credentials,
91 | )
92 |
93 | def parse_element(answer, elements):
94 | fst_word = answer.split()[0].strip(',.').strip().lower()
95 | if fst_word in elements:
96 | return fst_word
97 | return None
98 |
99 | def format_accuracy(correct, total):
100 | return '{0:.2f} %'.format(correct / total * 100)
101 |
102 | ATTEMPTS = 10
103 | WAIT_SECONDS = 6
104 | def repeater(callback, skip_ex):
105 | def wrapped_callback(*args, **kwargs):
106 | wait_s = WAIT_SECONDS
107 | for at in range(1, ATTEMPTS + 1):
108 | try:
109 | return callback(*args, **kwargs)
110 | except Exception as ex:
111 | if skip_ex and isinstance(ex, skip_ex):
112 | logging.warning('Failed to execute: %s, attempt=%d', ex, at)
113 | time.sleep(wait_s)
114 | wait_s *= 2
115 | if at == ATTEMPTS:
116 | logging.exception('Attempts out...')
117 | raise ex
118 | else:
119 | raise ex
120 | return wrapped_callback
121 |
122 | def init_llm(config_path):
123 | config = configparser.ConfigParser()
124 | config.read(config_path)
125 | return create_llm_gigachat(config)
126 |
127 | def read_json_tasks(path_in):
128 | tasks = [json.loads(line) for line in path_in.read_text().strip().splitlines()]
129 | return tasks
130 |
131 | CONNECTION_EXCEPTIONS = (GigaChatException, httpx.HTTPError, json.decoder.JSONDecodeError)
132 |
133 | def benchmark_check(path_in, llm, tasks, get_answer, answer_field, tags=None):
134 | logging.warning('Path_in: %s', path_in)
135 | logging.warning('Tasks: %d', len(tasks))
136 | correct_total = 0
137 | pbar = tqdm(range(len(tasks)))
138 | w_get_answer = repeater(get_answer, skip_ex=CONNECTION_EXCEPTIONS)
139 | for ti in pbar:
140 | td = tasks[ti]
141 | answer = w_get_answer(llm, td)
142 | true_answer = td[answer_field]
143 | check = answer == true_answer
144 | correct_total += check
145 | acc = format_accuracy(correct_total, ti + 1)
146 | pbar.set_description('acc: {0}'.format(acc))
147 | if ti % 10 == 0:
148 | logging.info('index=%d, acc: %s', ti, acc)
149 | tags = extract_tags(path_in) + (tags or [])
150 | b_info = 'Tags: {0}, final accuracy: {1}'.format(' '.join(tags), format_accuracy(correct_total, len(tasks)))
151 | logging.getLogger('benchmarks').info(b_info)
152 | logging.warning('Done! %s', b_info)
153 |
154 | def benchmark_test(llm, tasks, path_out, get_answer, answer_to_test_output):
155 | logging.warning('Tasks: %d', len(tasks))
156 | lines = []
157 | pbar = tqdm(range(len(tasks)))
158 | w_get_answer = repeater(get_answer, skip_ex=CONNECTION_EXCEPTIONS)
159 | for ti in pbar:
160 | td = tasks[ti]
161 | answer = w_get_answer(llm, td)
162 | answer_output = answer_to_test_output(td, answer)
163 | lines.append(json.dumps(answer_output, ensure_ascii=False))
164 | Path(path_out).write_text('\n'.join(lines))
165 | logging.info('Done! Saved to %s', path_out)
166 |
167 | def log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer):
168 | log_callback = logging.debug if (answer is not None) else logging.warning
169 | # todo fix, use `None` here
170 | if true_answer is not None:
171 | check = answer == true_answer
172 | log_callback('input_message: %s\nllm_response: %s\nanswer: %s\ntrue_answer: %s\ncorrect: %s', input_message, llm_response, answer, true_answer, check)
173 | else:
174 | log_callback('input_message: %s\nllm_response: %s\nanswer: %s', input_message, llm_response, answer)
175 |
176 | if answer not in possible_answers:
177 | logging.warning('Expected answer `{0}` not in possible answers `[{1}]`, q_dict: {2}'.format(answer, ', '.join(possible_answers), q_dict))
178 |
179 | def choose_get_answer(get_answer_map, answer_mode):
180 | get_answer = get_answer_map.get(answer_mode)
181 | if get_answer is None:
182 | raise ValueError('Supported answer versions: {0}, found: {1}'.format(list(get_answer_map.keys()), answer_mode))
183 | return get_answer
184 |
185 | def run_main(path_in, config_path, path_out, get_answer_map, answer_mode, answer_field, answer_to_test_output):
186 | get_answer = choose_get_answer(get_answer_map, answer_mode)
187 | path_in = Path(path_in)
188 | if not path_in.exists():
189 | raise ValueError('`path_in`=`{0}` not exists!'.format(path_in))
190 | init_logging(path_in, answer_mode)
191 | logging.warning('Answer mode: {0}'.format(answer_mode))
192 | tags = [answer_mode]
193 |
194 | llm = init_llm(config_path)
195 | tasks = read_json_tasks(path_in)
196 | if any(sub in path_in.stem for sub in ('dev', 'train')):
197 | benchmark_check(path_in, llm, tasks, get_answer, answer_field, tags=tags)
198 | elif 'test' in path_in.stem:
199 | if path_out is None:
200 | raise ValueError('`path_out` should be passed')
201 | benchmark_test(llm, tasks, path_out, get_answer, answer_to_test_output)
202 | else:
203 | raise ValueError('Can not recognize mode, expected `dev`, `train` or `test` in `path_in`')
204 |
205 | def wrapped_fire(main):
206 | try:
207 | fire.Fire(main)
208 | except KeyboardInterrupt:
209 | logging.warning('Cancelled!')
210 | raise KeyboardInterrupt('Cancelled!')
211 |
212 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/s00-prepare.sh:
--------------------------------------------------------------------------------
1 | set -e
2 | sogma_xml_file="sogma-test.xml"
3 | sogma_jsonl_file="RuMedTest--sogma--dev.jsonl"
4 | medbench_url="https://medbench.ru/files/MedBench_data.zip"
5 | medbench_dir="../../../data"
6 | medbench_zip_path="$medbench_dir/MedBench_data.zip"
7 |
8 | if [ ! -e "$sogma_jsonl_file" ]; then
9 | echo "$sogma_jsonl_file does not exist. Downloading..."
10 | wget -nc "https://geetest.ru/content/files/terapiya_(dlya_internov)_sogma_.xml" -O "$sogma_xml_file"
11 | echo "Download complete."
12 | python convert_sogma.py --path-in="$sogma_xml_file" --path-out="$sogma_jsonl_file"
13 | rm -f "$sogma_xml_file"
14 | fi
15 |
16 | if [ ! -e $medbench_dir ]; then
17 | echo "$medbench_dir folder does not exist. Downloading and extracting..."
18 | mkdir $medbench_dir
19 | wget -nc "$medbench_url" -O "$medbench_zip_path"
20 | unzip "$medbench_zip_path" -d "$medbench_dir"
21 | rm -f "$medbench_zip_path"
22 | echo "Download and extraction complete."
23 | fi
24 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/s01-run-all-trains.sh:
--------------------------------------------------------------------------------
1 | rm -f benchmarks.log
2 |
3 | python rumed_nli.py --answer-mode='v0' --path-in='MedBench/RuMedNLI/dev.jsonl'
4 | python rumed_nli.py --answer-mode='v1' --path-in='MedBench/RuMedNLI/dev.jsonl'
5 | python rumed_nli.py --answer-mode='v2' --path-in='MedBench/RuMedNLI/dev.jsonl'
6 |
7 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v0
8 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v1
9 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v2
10 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v3
11 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v4
12 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v5
13 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v6
14 |
15 | python rumed_da_net.py --path-in='MedBench/RuMedDaNet/dev.jsonl'
16 | # python rumed_da_net.py --path-in='MedBench/RuMedDaNet/train.jsonl'
17 |
18 | echo 'Benchmarks:'
19 | cat benchmarks.log
20 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Gigachat/s02-run-all-tests.sh:
--------------------------------------------------------------------------------
1 | mkdir -p 'out'
2 |
3 | # python rumed_nli.py --path-in='MedBench/RuMedNLI/test.jsonl' --path-out='out/medbench--rumednli--v0.jsonl' --answer-mode='v0'
4 | # python rumed_nli.py --path-in='MedBench/RuMedNLI/test.jsonl' --path-out='out/medbench--rumednli--v1.jsonl' --answer-mode='v1'
5 | python rumed_nli.py --path-in='../../../data/RuMedNLI/test.jsonl' --path-out='out/RuMedNLI.jsonl' --answer-mode='v2'
6 |
7 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v2.jsonl' --answer-mode=v2
8 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v3.jsonl' --answer-mode=v3
9 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v4.jsonl' --answer-mode=v4
10 | python rumed_test.py --path-in='../../../data/RuMedTest/test.jsonl' --path-out='out/RuMedTest.jsonl' --answer-mode=v6
11 |
12 | python rumed_da_net.py --path-in='../../../data/RuMedDaNet/test.jsonl' --path-out='out/RuMedDaNet.jsonl'
13 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Human/README.md:
--------------------------------------------------------------------------------
1 | Чтобы оценить уровень человека в решении предложенных задач, выполнены следующие процедуры:
2 |
3 | - **RuMedDaNet** Примеры из закрытой тестовой части были разделены между несколькими асессорами (не имеющими специального медицинского образования) без перекрытия, т.о. каждый пример решён лишь одним участником;
4 |
5 | - **RuMedNLI** Ответы получены через аналогичную RuMedDaNet-задаче процедуру, с той лишь разницей, что в качестве асессоров выступали специалисты имеющие медицинское образование;
6 |
7 | - **RuMedTest** Оценка в этой задаче является консенсусом преподавательского сообщества высших медицинских учебных заведений относительно минимально необходимого уровня подготовки врача общей практики.
8 |
9 | - **ECG2Pathology** ECG2Pathology Изначально сигналы были размечены кардиологами по следующему принципу: специалисты были разделены на три группы (по 1000 сигналов на каждую), которые оценивали сигналы в соответствии с [тезаурусом](https://ecg.ru/thesaurus). Каждая группа состояла из трёх кардиологов-аннотаторов, которые размечали сигналы, а также кардиолога-валидатора, который давал окончательную оценку на основе собственного мнения и мнений аннотаторов. На основании этой процедуры был рассчитан human baseline, представляющий собой макро F1-меру каждого кардиолога-аннотатора относительно своего модератора. Усреднение F1-меры производилось в 2 этапа: сначала по 73 классам, затем по аннотаторам.
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Naive/README.md:
--------------------------------------------------------------------------------
1 | Наивное решение заключается в использовании наиболее часто встречающейся (или случайной метки) в качестве ответа:
2 |
3 | - **RuMedDaNet** - ответ на все вопросы всегда "да";
4 |
5 | - **RuMedNLI** - ответ всегда "neutral";
6 |
7 | - **RuMedTest** - ответом всегда является первый вариант.
8 |
9 | - **ECG2Pathology** - ответом всегда является самый частый класс ("Нормальное положение ЭОС").
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/Naive/sample_submission.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/Naive/sample_submission.zip
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/README.md:
--------------------------------------------------------------------------------
1 | Решение основано на использовании модели из семейства RNN:
2 |
3 | - **RuMedDaNet** / **RuMedNLI** - используем двухслойную модель BiLSTM модель с 300-мерными вложениями слов;
4 |
5 | - **RuMedTest** - Используем обученную модель из задачи RuMedNLI чтобы получить матрицу эмбеддингов вопросов и 4 матрицы для каждого из вариантов ответов. Ответ выбираем по максимальному значению косинусной близости между векторами вопроса и ответа.
6 |
7 | ### Для запуска кода
8 |
9 | `./run.sh`
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/double_text_classifier.py:
--------------------------------------------------------------------------------
1 | from collections import defaultdict
2 | import json
3 | import pathlib
4 |
5 | import joblib
6 | import click
7 | import numpy as np
8 | import pandas as pd
9 | from sklearn.metrics import accuracy_score
10 | import torch
11 | from torch import nn
12 | from torch.utils.data import DataLoader
13 | from tqdm import tqdm
14 |
15 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor
16 |
17 | SEED = 101
18 | seed_everything(SEED)
19 | class Classifier(nn.Module):
20 |
21 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256):
22 |
23 | super().__init__()
24 |
25 | self.emb_dim = emb_dim
26 | self.hidden_dim = hidden_dim
27 |
28 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim)
29 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2,
30 | bidirectional=True)
31 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes)
32 |
33 | def forward(self, x):
34 | x = self.embedding_layer(x)
35 | _, (hidden, _) = self.lstm_layer(x)
36 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1)
37 | return self.linear_layer(hidden)
38 |
39 |
40 | def preprocess_two_seqs(text1, text2, seq_len):
41 | seq1_len = int(seq_len * 0.75)
42 | seq2_len = seq_len - seq1_len
43 |
44 | tokens1 = preprocess(text1)[:seq1_len]
45 | tokens2 = preprocess(text2)[:seq2_len]
46 |
47 | return tokens1 + tokens2
48 |
49 |
50 | def build_vocab(text_data, min_freq=1):
51 | word2freq = defaultdict(int)
52 | word2index = {'PAD': 0, 'UNK': 1}
53 |
54 | for text in text_data:
55 | for token in text:
56 | word2freq[token] += 1
57 |
58 | for word, freq in word2freq.items():
59 | if freq > min_freq:
60 | word2index[word] = len(word2index)
61 | return word2index
62 |
63 |
64 | def train_step(data, model, optimizer, criterion, device, losses, epoch):
65 |
66 | model.train()
67 |
68 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}')
69 |
70 | for x, y in data:
71 |
72 | x = x.to(device)
73 | y = y.to(device)
74 |
75 | optimizer.zero_grad()
76 | pred = model(x)
77 |
78 | loss = criterion(pred, y)
79 |
80 | loss.backward()
81 | optimizer.step()
82 |
83 | losses.append(loss.item())
84 |
85 | pbar.set_postfix(train_loss = np.mean(losses[-100:]))
86 | pbar.update(x.shape[0])
87 |
88 | pbar.close()
89 |
90 | return losses
91 |
92 | def eval_step(data, model, criterion, device, mode='dev'):
93 |
94 | test_losses = []
95 | test_preds = []
96 | test_true = []
97 |
98 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set')
99 |
100 | model.eval()
101 |
102 | for x, y in data:
103 |
104 | x = x.to(device)
105 | y = y.to(device)
106 |
107 | with torch.no_grad():
108 |
109 | pred = model(x)
110 |
111 | loss = criterion(pred, y)
112 | test_losses.append(loss.item())
113 |
114 | test_preds.append(torch.argmax(pred, dim=1).cpu().numpy())
115 | test_true.append(y.cpu().numpy())
116 |
117 | pbar.update(x.shape[0])
118 | pbar.close()
119 |
120 | test_preds = np.concatenate(test_preds)
121 |
122 | if mode == 'dev':
123 | test_true = np.concatenate(test_true)
124 | mean_test_loss = np.mean(test_losses)
125 | accuracy = round(accuracy_score(test_true, test_preds) * 100, 2)
126 | return mean_test_loss, accuracy
127 |
128 | else:
129 | return test_preds
130 |
131 |
132 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3):
133 |
134 | losses = []
135 | best_accuracy = 0.
136 |
137 | patience = 0
138 | best_test_loss = 10.
139 |
140 | for epoch in range(n_epochs):
141 |
142 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch)
143 | mean_dev_loss, accuracy = eval_step(dev_data, model, criterion, device)
144 |
145 | if accuracy > best_accuracy:
146 | best_accuracy = accuracy
147 |
148 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}')
149 |
150 | if mean_dev_loss < best_test_loss:
151 | best_test_loss = mean_dev_loss
152 | elif patience == max_patience:
153 | print(f'Dev loss did not improve in {patience} epochs, early stopping')
154 | break
155 | else:
156 | patience += 1
157 | return best_accuracy
158 |
159 |
160 | @click.command()
161 | @click.option('--task-name',
162 | default='RuMedNLI',
163 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']),
164 | help='The name of the task to run.')
165 | @click.option('--device',
166 | default=-1,
167 | help='Gpu to train the model on.')
168 | @click.option('--seq-len',
169 | default=256,
170 | help='Max sequence length.')
171 | @click.option('--data-path',
172 | default='../../../MedBench_data/',
173 | help='Path to the data files.')
174 | def main(task_name, data_path, device, seq_len):
175 | print(f'\n{task_name} task')
176 |
177 | out_path = pathlib.Path('.').absolute()
178 | data_path = pathlib.Path(data_path).absolute() / task_name
179 |
180 | train_data = pd.read_json(data_path / 'train.jsonl', lines=True)
181 | dev_data = pd.read_json(data_path / 'dev.jsonl', lines=True)
182 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True)
183 |
184 | index_id = 'pairID'
185 | if task_name == 'RuMedNLI':
186 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2}
187 | text1_id = 'ru_sentence1'
188 | text2_id = 'ru_sentence2'
189 | label_id = 'gold_label'
190 |
191 | elif task_name == 'RuMedDaNet':
192 | l2i = {'нет': 0, 'да': 1}
193 | text1_id = 'context'
194 | text2_id = 'question'
195 | label_id = 'answer'
196 | else:
197 | raise ValueError('unknown task')
198 |
199 | i2l = {i: label for label, i in l2i.items()}
200 |
201 | text_data_train = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
202 | zip(train_data[text1_id], train_data[text2_id])]
203 | text_data_dev = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
204 | zip(dev_data[text1_id], dev_data[text2_id])]
205 | text_data_test = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \
206 | zip(test_data[text1_id], test_data[text2_id])]
207 |
208 | word2index = build_vocab(text_data_train, min_freq=0)
209 | print(f'Total: {len(word2index)} tokens')
210 |
211 | train_dataset = DataPreprocessor(text_data_train, train_data[label_id], word2index, l2i, \
212 | sequence_length=seq_len, preprocessing=False)
213 | dev_dataset = DataPreprocessor(text_data_dev, dev_data[label_id], word2index, l2i, \
214 | sequence_length=seq_len, preprocessing=False)
215 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \
216 | sequence_length=seq_len, preprocessing=False)
217 |
218 | gen = torch.Generator()
219 | gen.manual_seed(SEED)
220 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
221 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
222 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen)
223 |
224 | if device == -1:
225 | device = torch.device('cpu')
226 | else:
227 | device = torch.device(device)
228 |
229 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index))
230 | criterion = nn.CrossEntropyLoss()
231 | optimizer = torch.optim.Adam(params=model.parameters())
232 |
233 | model = model.to(device)
234 | criterion = criterion.to(device)
235 |
236 | accuracy = train(train_dataset, dev_dataset, model, optimizer, criterion, device)
237 | print (f'\n{task_name} task score on dev set: {accuracy}')
238 |
239 | test_preds = eval_step(test_dataset, model, criterion, device, mode='test')
240 | if task_name == 'RuMedNLI':
241 | torch.save(model.state_dict(), 'model.bin')
242 | joblib.dump(word2index, 'word2index.pkl')
243 | joblib.dump(l2i, 'l2i.pkl')
244 |
245 | recs = []
246 | for i, pred in zip(test_data[index_id], test_preds):
247 | recs.append({index_id: i, label_id: i2l[pred]})
248 |
249 | out_fname = out_path / f'{task_name}.jsonl'
250 | with open(out_fname, 'w') as fw:
251 | for rec in recs:
252 | json.dump(rec, fw, ensure_ascii=False)
253 | fw.write('\n')
254 |
255 |
256 | if __name__ == '__main__':
257 | main()
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/rnn.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RNN/rnn.zip
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 2
4 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 2
5 | python -u test_solver.py --task-name 'RuMedTest'
6 |
7 | zip -m rnn.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl
8 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/test_solver.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import numpy as np
6 | import pandas as pd
7 |
8 | import torch
9 | import joblib
10 | from utils import preprocess, DataPreprocessor
11 | from double_text_classifier import Classifier
12 | from sklearn.metrics.pairwise import cosine_similarity
13 |
14 | seq_len = 256
15 |
16 | @click.command()
17 | @click.option('--task-name',
18 | default='RuMedTest',
19 | type=click.Choice(['RuMedTest']),
20 | help='The name of the task to run.')
21 | @click.option('--data-path',
22 | default='../../../MedBench_data/',
23 | help='Path to the data files.')
24 | def main(task_name, data_path):
25 | print(f'\n{task_name} task')
26 |
27 | out_path = pathlib.Path('.').absolute()
28 | data_path = pathlib.Path(data_path).absolute() / task_name
29 |
30 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True)
31 |
32 | index_id = 'idx'
33 | if task_name == 'RuMedTest':
34 | options = ['1', '2', '3', '4']
35 | question_id = 'question'
36 | label_id = 'answer'
37 | else:
38 | raise ValueError('unknown task')
39 |
40 | word2index = joblib.load('word2index.pkl')
41 | l2i = joblib.load('l2i.pkl')
42 |
43 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index))
44 | model.load_state_dict(torch.load('model.bin'))
45 | model.eval();
46 |
47 | text_data_test = [preprocess(text1) for text1 in test_data['question']]
48 |
49 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \
50 | sequence_length=seq_len, preprocessing=False)
51 |
52 | q_vecs = []
53 | for x, _ in test_dataset:
54 | with torch.no_grad():
55 | x = model.embedding_layer(x[None, :])
56 | _, (hidden, _) = model.lstm_layer(x)
57 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1).detach().cpu().numpy()
58 | q_vecs.append(hidden[0])
59 | q_vecs = np.array(q_vecs)
60 |
61 | sims = []
62 | for option in options:
63 | text_data_test = [preprocess(text1) for text1 in test_data[option]]
64 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \
65 | sequence_length=seq_len, preprocessing=False)
66 |
67 | option_vecs = []
68 | for x, _ in test_dataset:
69 | with torch.no_grad():
70 | x = model.embedding_layer(x[None, :])
71 | _, (hidden, _) = model.lstm_layer(x)
72 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1).detach().cpu().numpy()
73 | option_vecs.append(hidden[0])
74 | option_vecs = np.array(option_vecs)
75 |
76 | sim = cosine_similarity(q_vecs, option_vecs).diagonal()
77 | sims.append(sim)
78 | sims = np.array(sims).T
79 |
80 | recs = []
81 | for i, pred in zip(test_data[index_id], sims):
82 | recs.append( { index_id: i, label_id: str(1+np.argmax(pred)) } )
83 |
84 | out_fname = out_path / f'{task_name}.jsonl'
85 | with open(out_fname, 'w') as fw:
86 | for rec in recs:
87 | json.dump(rec, fw, ensure_ascii=False)
88 | fw.write('\n')
89 |
90 |
91 | if __name__ == '__main__':
92 | main()
93 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RNN/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | from string import punctuation
3 | import random
4 |
5 | from nltk.tokenize import ToktokTokenizer
6 | import numpy as np
7 | import pandas as pd
8 | import torch
9 | from torch.utils.data import Dataset
10 |
11 | from typing import List, Dict, Union, Tuple, Set, Any
12 |
13 | TOKENIZER = ToktokTokenizer()
14 |
15 |
16 | def seed_everything(seed):
17 | os.environ['PYTHONHASHSEED'] = str(seed)
18 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
19 | np.random.seed(seed)
20 | random.seed(seed)
21 | torch.manual_seed(seed)
22 | torch.cuda.manual_seed_all(seed)
23 | torch.cuda.manual_seed(seed)
24 | torch.backends.cudnn.deterministic = True
25 | torch.backends.cudnn.benchmark = False
26 |
27 |
28 | def seed_worker(worker_id):
29 | worker_seed = torch.initial_seed() % 2**32
30 | np.random.seed(worker_seed)
31 | random.seed(worker_seed)
32 |
33 |
34 | def preprocess(text, tokenizer=TOKENIZER):
35 | res = []
36 | tokens = tokenizer.tokenize(text.lower())
37 | for t in tokens:
38 | if t not in punctuation:
39 | res.append(t.strip(punctuation))
40 | return res
41 |
42 |
43 | class DataPreprocessor(Dataset):
44 |
45 | def __init__(self, x_data, y_data, word2index, label2index,
46 | sequence_length=128, pad_token='PAD', unk_token='UNK', preprocessing=True):
47 |
48 | super().__init__()
49 |
50 | self.x_data = []
51 | self.y_data = len(x_data)*[list(label2index.values())[0]]
52 | if type(y_data)!=type(None):
53 | self.y_data = y_data.map(label2index)
54 |
55 | self.word2index = word2index
56 | self.sequence_length = sequence_length
57 |
58 | self.pad_token = pad_token
59 | self.unk_token = unk_token
60 | self.pad_index = self.word2index[self.pad_token]
61 |
62 | self.preprocessing = preprocessing
63 |
64 | self.load(x_data)
65 |
66 | def load(self, data):
67 |
68 | for text in data:
69 | if self.preprocessing:
70 | words = preprocess(text)
71 | else:
72 | words = text
73 | indexed_words = self.indexing(words)
74 | self.x_data.append(indexed_words)
75 |
76 | def indexing(self, tokenized_text):
77 | unk_index = self.word2index[self.unk_token]
78 | return [self.word2index.get(token, unk_index) for token in tokenized_text]
79 |
80 | def padding(self, sequence):
81 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0))
82 | return sequence[:self.sequence_length]
83 |
84 | def __len__(self):
85 | return len(self.x_data)
86 |
87 | def __getitem__(self, idx):
88 | x = self.x_data[idx]
89 | x = self.padding(x)
90 | x = torch.Tensor(x).long()
91 |
92 | if type(self.y_data)==type(None):
93 | y = None
94 | else:
95 | y = self.y_data[idx]
96 |
97 | return x, y
98 |
99 |
100 | def preprocess_for_tokens(
101 | tokens: List[str]
102 | ) -> List[str]:
103 |
104 | return tokens
105 |
106 | class DataPreprocessorNer(Dataset):
107 |
108 | def __init__(
109 | self,
110 | x_data: pd.Series,
111 | y_data: pd.Series,
112 | word2index: Dict[str, int],
113 | label2index: Dict[str, int],
114 | sequence_length: int = 128,
115 | pad_token: str = 'PAD',
116 | unk_token: str = 'UNK'
117 | ) -> None:
118 |
119 | super().__init__()
120 |
121 | self.word2index = word2index
122 | self.label2index = label2index
123 |
124 | self.sequence_length = sequence_length
125 | self.pad_token = pad_token
126 | self.unk_token = unk_token
127 | self.pad_index = self.word2index[self.pad_token]
128 | self.unk_index = self.word2index[self.unk_token]
129 |
130 | self.x_data = self.load(x_data, self.word2index)
131 | self.y_data = self.load(y_data, self.label2index)
132 |
133 |
134 | def load(
135 | self,
136 | data: pd.Series,
137 | mapping: Dict[str, int]
138 | ) -> List[List[int]]:
139 |
140 | indexed_data = []
141 | for case in data:
142 | processed_case = preprocess_for_tokens(case)
143 | indexed_case = self.indexing(processed_case, mapping)
144 | indexed_data.append(indexed_case)
145 |
146 | return indexed_data
147 |
148 |
149 | def indexing(
150 | self,
151 | tokenized_case: List[str],
152 | mapping: Dict[str, int]
153 | ) -> List[int]:
154 |
155 | return [mapping.get(token, self.unk_index) for token in tokenized_case]
156 |
157 |
158 | def padding(
159 | self,
160 | sequence: List[int]
161 | ) -> List[int]:
162 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0))
163 | return sequence[:self.sequence_length]
164 |
165 |
166 | def __len__(self):
167 | return len(self.x_data)
168 |
169 |
170 | def __getitem__(
171 | self,
172 | idx: int
173 | ) -> Tuple[torch.tensor, torch.tensor]:
174 |
175 | x = self.x_data[idx]
176 | y = self.y_data[idx]
177 |
178 | assert len(x) > 0
179 |
180 | x = self.padding(x)
181 | y = self.padding(y)
182 |
183 | x = torch.tensor(x, dtype=torch.int64)
184 | y = torch.tensor(y, dtype=torch.int64)
185 |
186 | return x, y
187 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/README.md:
--------------------------------------------------------------------------------
1 | Решение основано на использовании 2х вариантов RuBERT модели:
2 |
3 | - **RuMedDaNet** / **RuMedNLI** - объединяем пары входных текстов в единую строку, выполняем дообучение нужной модели под конкретную задачу;
4 |
5 | - **RuMedTest** - Используем предобученную модель RuBERT для получения контекстуализированных эмбеддингов (вопросов и каждого из 4 вариантов ответов). Ответ выбираем по максимальному значению косинусной близости векторов вопроса и ответа.
6 |
7 | ### Для запуска кода
8 |
9 | `pip install -r requirements.txt`
10 |
11 | `./run.sh bert` для запуска *RuBERT* модели
12 | or
13 | `./run.sh pool` для запуска *RuPoolBERT* варианта модели.
14 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/bert.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RuBERT/bert.zip
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/pool.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RuBERT/pool.zip
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.9.0
2 | torchtext==0.6.0
3 | tensorflow==2.6.0
4 | keras==2.6.0
5 | pandas==1.3.5
6 | transformers==4.12.5
7 | click==7.1.2
8 | nltk==3.4.5
9 | sklearn-crfsuite==0.3.6
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | type=$@
4 |
5 | models=$(pwd)'/models'
6 | mkdir -p $models;
7 |
8 | if [ ! -f $models'/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin' ]; then
9 | echo $models'/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin'
10 | cd $models;
11 | wget "http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz"
12 | tar -xvzf rubert_cased_L-12_H-768_A-12_pt.tar.gz
13 | cd ../;
14 | fi
15 |
16 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 0 --bert-type $type
17 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 0 --bert-type $type
18 | python -u test_solver.py --task-name 'RuMedTest' --device 0 --bert-type $type
19 |
20 | zip -m $type.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl
21 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/test_solver.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import numpy as np
6 | import pandas as pd
7 | from scipy.special import expit
8 |
9 | import torch
10 | from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
11 | import joblib
12 | from sklearn.metrics.pairwise import cosine_similarity
13 | from keras.preprocessing.sequence import pad_sequences
14 | from transformers import BertTokenizer, BertConfig
15 | from utils import seed_everything, seed_worker
16 |
17 | def encode_text_pairs(tokenizer, sentences):
18 | bs = 20000
19 | input_ids, attention_masks, token_type_ids = [], [], []
20 |
21 | for _, i in enumerate(range(0, len(sentences), bs)):
22 | tokenized_texts = []
23 | for sentence in sentences[i:i+bs]:
24 | final_tokens = ['[CLS]']+tokenizer.tokenize( sentence )[:MAX_LEN-2]+['[SEP]']
25 | arr = np.array(final_tokens)
26 | mask = arr == '[SEP]'
27 | tokenized_texts.append(final_tokens)
28 |
29 | b_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
30 |
31 | b_input_ids = pad_sequences(b_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
32 |
33 | b_token_type_ids = []
34 | for i, row in enumerate(b_input_ids):
35 | row = np.array(row)
36 | mask = row==tokenizer.convert_tokens_to_ids('[SEP]')
37 | idx = np.where(mask)[0][0]
38 | token_type_row = np.zeros(row.shape[0], dtype=np.int)
39 | b_token_type_ids.append(token_type_row)
40 |
41 | b_attention_masks = []
42 | for seq in b_input_ids:
43 | seq_mask = [float(i>0) for i in seq]
44 | b_attention_masks.append(seq_mask)
45 |
46 | attention_masks.append(b_attention_masks)
47 | input_ids.append(b_input_ids)
48 | token_type_ids.append(b_token_type_ids)
49 | input_ids, attention_masks = np.vstack(input_ids), np.vstack(attention_masks)
50 | token_type_ids = np.vstack(token_type_ids)
51 |
52 | return input_ids, attention_masks, token_type_ids
53 |
54 | SEED = 128
55 | seed_everything(SEED)
56 |
57 | MAX_LEN = 512
58 |
59 | @click.command()
60 | @click.option('--task-name',
61 | default='RuMedTest',
62 | type=click.Choice(['RuMedTest']),
63 | help='The name of the task to run.')
64 | @click.option('--device',
65 | default=-1,
66 | help='Gpu to train the model on.')
67 | @click.option('--data-path',
68 | default='../../../MedBench_data/',
69 | help='Path to the data files.')
70 | @click.option('--bert-type',
71 | default='bert',
72 | help='BERT model variant.')
73 | def main(task_name, data_path, device, bert_type):
74 | print(f'\n{task_name} task')
75 |
76 | if device == -1:
77 | device = torch.device('cpu')
78 | else:
79 | device = torch.device(device)
80 |
81 | out_path = pathlib.Path('.').absolute()
82 | data_path = pathlib.Path(data_path).absolute() / task_name
83 |
84 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True)
85 |
86 | index_id = 'idx'
87 | if task_name == 'RuMedTest':
88 | options = ['1', '2', '3', '4']
89 | question_id = 'question'
90 | label_id = 'answer'
91 | else:
92 | raise ValueError('unknown task')
93 |
94 | tokenizer = BertTokenizer.from_pretrained(
95 | out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/',
96 | do_lower_case=True,
97 | max_length=MAX_LEN
98 | )
99 |
100 | from utils import BertFeatureExtractor as BertModel
101 | ## take appropriate config and init a BERT model
102 | config_path = out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/bert_config.json'
103 | conf = BertConfig.from_json_file( config_path )
104 | model = BertModel(conf)
105 | ## preload it with weights
106 | output_model_file = out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin'
107 | model.load_state_dict(torch.load(output_model_file), strict=False)
108 | model = model.to(device)
109 | model.eval();
110 |
111 | def get_embeddings(texts):
112 | input_ids, attention_masks, token_type_ids = encode_text_pairs(tokenizer, texts)
113 | ##prediction_dataloader
114 | input_ids = torch.tensor(input_ids)
115 | attention_masks = torch.tensor(attention_masks)
116 | token_type_ids = torch.tensor(token_type_ids)
117 |
118 | batch_size = 16
119 | prediction_data = TensorDataset(input_ids, attention_masks, token_type_ids)
120 | prediction_sampler = SequentialSampler(prediction_data)
121 | prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size, worker_init_fn=seed_worker)
122 |
123 | predictions = []
124 | for step, batch in enumerate(prediction_dataloader):
125 | batch = tuple(t.to(device) for t in batch)
126 | b_input_ids, b_input_mask, b_token_type_ids = batch
127 | with torch.no_grad():
128 | outputs = model( b_input_ids, token_type_ids=b_token_type_ids, attention_mask=b_input_mask, bert_type=bert_type )
129 | outputs = outputs.detach().cpu().numpy()
130 | predictions.append(outputs)
131 | predictions = expit(np.vstack(predictions))
132 | return predictions
133 |
134 | q_vecs = get_embeddings(test_data['question'])
135 |
136 | sims = []
137 | for option in options:
138 | option_vecs = get_embeddings(test_data[option])
139 | sim = cosine_similarity(q_vecs, option_vecs).diagonal()
140 | sims.append(sim)
141 | sims = np.array(sims).T
142 |
143 | recs = []
144 | for i, pred in zip(test_data[index_id], sims):
145 | recs.append( { index_id: i, label_id: str(1+np.argmax(pred)) } )
146 |
147 | out_fname = out_path / f'{task_name}.jsonl'
148 | with open(out_fname, 'w') as fw:
149 | for rec in recs:
150 | json.dump(rec, fw, ensure_ascii=False)
151 | fw.write('\n')
152 |
153 |
154 | if __name__ == '__main__':
155 | main()
156 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/RuBERT/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import os
3 | import random
4 | import torch
5 | import numpy as np
6 |
7 | def seed_everything(seed):
8 | random.seed(seed)
9 | os.environ['PYTHONHASHSEED'] = str(seed)
10 | np.random.seed(seed)
11 | torch.manual_seed(seed)
12 | torch.cuda.manual_seed_all(seed)
13 | torch.cuda.manual_seed(seed)
14 | torch.backends.cudnn.deterministic = True
15 | torch.backends.cudnn.benchmark = False
16 |
17 | def seed_worker(worker_id):
18 | worker_seed = torch.initial_seed() % 2**32
19 | np.random.seed(worker_seed)
20 | random.seed(worker_seed)
21 |
22 |
23 | from torch import nn
24 | import torch.nn.functional as F
25 | from transformers import BertTokenizer, BertConfig, BertPreTrainedModel, BertModel
26 |
27 | class PoolBertForTokenClassification(BertPreTrainedModel):
28 | def __init__(self, config):
29 | super().__init__(config)
30 | self.num_labels = config.num_labels
31 |
32 | self.bert = BertModel(config, add_pooling_layer=False)
33 | self.dropout = nn.Dropout(config.hidden_dropout_prob)
34 | self.classifier = nn.Linear(config.hidden_size*3, config.num_labels)
35 |
36 | self.w_size = 4
37 |
38 | self.init_weights()
39 |
40 | def forward(
41 | self,
42 | input_ids=None,
43 | attention_mask=None,
44 | token_type_ids=None,
45 | position_ids=None,
46 | head_mask=None,
47 | inputs_embeds=None,
48 | labels=None,
49 | output_attentions=None,
50 | output_hidden_states=None,
51 | return_dict=None,
52 | ):
53 | outputs = self.bert(
54 | input_ids,
55 | attention_mask=attention_mask,
56 | token_type_ids=token_type_ids,
57 | position_ids=position_ids,
58 | head_mask=head_mask,
59 | inputs_embeds=inputs_embeds,
60 | output_attentions=output_attentions,
61 | output_hidden_states=output_hidden_states,
62 | return_dict=return_dict,
63 | )
64 |
65 | sequence_output = outputs['last_hidden_state']
66 |
67 | shape = list(sequence_output.shape)
68 | shape[1]+=self.w_size-1
69 |
70 | t_ext = torch.zeros(shape, dtype=sequence_output.dtype, device=sequence_output.device)
71 | t_ext[:, self.w_size-1:, :] = sequence_output
72 |
73 | unfold_t = t_ext.unfold(1, self.w_size, 1).transpose(3,2)
74 | pooled_output_mean = torch.mean(unfold_t, 2)
75 |
76 | pooled_output, _ = torch.max(unfold_t, 2)
77 | pooled_output = torch.relu(pooled_output)
78 |
79 | sequence_output = torch.cat((pooled_output, pooled_output_mean, sequence_output), 2)
80 |
81 | sequence_output = self.dropout(sequence_output)
82 |
83 | logits = self.classifier(sequence_output)
84 |
85 | loss = None
86 | if labels is not None:
87 | loss_fct = nn.CrossEntropyLoss()
88 | # Only keep active parts of the loss
89 | if attention_mask is not None:
90 | active_loss_mask = attention_mask.view(-1) == 1
91 | active_logits = logits.view(-1, self.num_labels)
92 |
93 | active_labels = torch.where(
94 | active_loss_mask,
95 | labels.view(-1),
96 | torch.tensor(loss_fct.ignore_index).type_as(labels)
97 | )
98 |
99 | loss = loss_fct(active_logits, active_labels)
100 | else:
101 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
102 |
103 | output = (logits,) + outputs[2:]
104 | return ((loss,) + output) if loss is not None else output
105 |
106 | class PoolBertForSequenceClassification(BertPreTrainedModel):
107 | def __init__(self, config):
108 | super().__init__(config)
109 | self.num_labels = config.num_labels
110 |
111 | self.bert = BertModel(config)
112 | self.dropout = nn.Dropout(config.hidden_dropout_prob)
113 | self.classifier = nn.Linear(config.hidden_size*3, self.config.num_labels)
114 |
115 | self.init_weights()
116 |
117 | def forward(
118 | self,
119 | input_ids=None,
120 | attention_mask=None,
121 | token_type_ids=None,
122 | position_ids=None,
123 | head_mask=None,
124 | inputs_embeds=None,
125 | labels=None,
126 | ):
127 | outputs = self.bert(
128 | input_ids,
129 | attention_mask=attention_mask,
130 | token_type_ids=token_type_ids,
131 | position_ids=position_ids,
132 | head_mask=head_mask,
133 | inputs_embeds=inputs_embeds,
134 | )
135 |
136 | encoder_out = outputs['last_hidden_state']
137 | cls = encoder_out[:, 0, :]
138 |
139 | pooled_output, _ = torch.max(encoder_out, 1)
140 | pooled_output = torch.relu(pooled_output)
141 |
142 | pooled_output_mean = torch.mean(encoder_out, 1)
143 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1)
144 |
145 | pooled_output = self.dropout(pooled_output)
146 | logits = self.classifier(pooled_output)
147 |
148 | outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
149 |
150 | if labels is not None:
151 | if self.num_labels == 1:
152 | # We are doing regression
153 | loss_fct = MSELoss()
154 | loss = loss_fct(logits.view(-1), labels.view(-1))
155 | else:
156 | loss = F.binary_cross_entropy_with_logits( logits.view(-1), labels.view(-1) )
157 | outputs = (loss,) + outputs
158 |
159 | return outputs # (loss), logits, (hidden_states), (attentions)
160 |
161 | class BertFeatureExtractor(BertPreTrainedModel):
162 | def __init__(self, config):
163 | super().__init__(config)
164 |
165 | self.bert = BertModel(config)
166 |
167 | self.init_weights()
168 |
169 | def forward(
170 | self,
171 | input_ids=None,
172 | attention_mask=None,
173 | token_type_ids=None,
174 | position_ids=None,
175 | head_mask=None,
176 | inputs_embeds=None,
177 | bert_type='cls',
178 | ):
179 | outputs = self.bert(
180 | input_ids,
181 | attention_mask=attention_mask,
182 | token_type_ids=token_type_ids,
183 | position_ids=position_ids,
184 | head_mask=head_mask,
185 | inputs_embeds=inputs_embeds,
186 | )
187 |
188 | encoder_out = outputs['last_hidden_state']
189 | cls = encoder_out[:, 0, :]
190 | if bert_type!='pool':
191 | return cls
192 |
193 | pooled_output, _ = torch.max(encoder_out, 1)
194 | pooled_output = torch.relu(pooled_output)
195 |
196 | pooled_output_mean = torch.mean(encoder_out, 1)
197 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1)
198 | return pooled_output
199 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/TF-IDF/README.md:
--------------------------------------------------------------------------------
1 | Решение основано на использовании tf.idf признаков и простых линейных моделей:
2 |
3 | - **RuMedDaNet** / **RuMedNLI** - объединяем пары входных текстов в единую строку, получаем матрицу tf.idf признаков, обучаем модель логистической регрессии для предсказания целевой переменной;
4 |
5 | - **RuMedTest** - получаем матрицу tf.idf признаков для вопросов и 4 матрицы для каждого из вариантов ответов. Ответ выбираем по максимальному значению косинусной близости векторов вопроса и ответа.
6 |
7 | ### Для запуска кода
8 |
9 | `./run.sh`
10 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/TF-IDF/double_text_classifier.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import pandas as pd
6 | from sklearn.feature_extraction.text import TfidfVectorizer
7 | from sklearn.linear_model import LogisticRegression
8 | from sklearn.metrics import accuracy_score
9 |
10 |
11 | def preprocess_sentences(column1, column2):
12 | return [sent1 + ' ' + sent2 for sent1, sent2 in zip(column1, column2)]
13 |
14 |
15 | def encode_text(tfidf, text_data, l2i, labels=None, mode='train'):
16 | if mode == 'train':
17 | X = tfidf.fit_transform(text_data)
18 | else:
19 | X = tfidf.transform(text_data)
20 | y = None
21 | if type(labels)!=type(None):
22 | y = labels.map(l2i)
23 | return X, y
24 |
25 |
26 | @click.command()
27 | @click.option('--task-name',
28 | default='RuMedNLI',
29 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']),
30 | help='The name of the task to run.')
31 | @click.option('--data-path',
32 | default='../../../MedBench_data/',
33 | help='Path to the data files.')
34 | def main(task_name, data_path):
35 | print(f'\n{task_name} task')
36 |
37 | out_path = pathlib.Path('.').absolute()
38 | data_path = pathlib.Path(data_path).absolute() / task_name
39 |
40 | train_data = pd.read_json(data_path / 'train.jsonl', lines=True)
41 | dev_data = pd.read_json(data_path / 'dev.jsonl', lines=True)
42 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True)
43 |
44 | index_id = 'pairID'
45 | if task_name == 'RuMedNLI':
46 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2}
47 | text1_id = 'ru_sentence1'
48 | text2_id = 'ru_sentence2'
49 | label_id = 'gold_label'
50 | elif task_name == 'RuMedDaNet':
51 | l2i = {'нет': 0, 'да': 1}
52 | text1_id = 'context'
53 | text2_id = 'question'
54 | label_id = 'answer'
55 | else:
56 | raise ValueError('unknown task')
57 |
58 | i2l = {i: label for label, i in l2i.items()}
59 |
60 | text_data_train = preprocess_sentences(train_data[text1_id], train_data[text2_id])
61 | text_data_dev = preprocess_sentences(dev_data[text1_id], dev_data[text2_id])
62 | text_data_test = preprocess_sentences(test_data[text1_id], test_data[text2_id])
63 |
64 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8))
65 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, max_iter=1000, verbose=1)
66 |
67 | X, y = encode_text(tfidf, text_data_train, l2i, labels=train_data[label_id])
68 |
69 | clf.fit(X, y)
70 |
71 | X_val, y_val = encode_text(tfidf, text_data_dev, l2i, labels=dev_data[label_id], mode='dev')
72 | y_val_pred = clf.predict(X_val)
73 | accuracy = round(accuracy_score(y_val, y_val_pred) * 100, 2)
74 | print (f'\n{task_name} task score on dev set: {accuracy}')
75 |
76 | X_test, _ = encode_text(tfidf, text_data_test, l2i, mode='test')
77 | y_test_pred = clf.predict(X_test)
78 |
79 | recs = []
80 | for i, pred in zip(test_data[index_id], y_test_pred):
81 | recs.append({index_id: i, label_id: i2l[pred]})
82 |
83 | out_fname = out_path / f'{task_name}.jsonl'
84 | with open(out_fname, 'w') as fw:
85 | for rec in recs:
86 | json.dump(rec, fw, ensure_ascii=False)
87 | fw.write('\n')
88 |
89 |
90 | if __name__ == '__main__':
91 | main()
92 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/TF-IDF/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | python -u double_text_classifier.py --task-name 'RuMedNLI'
4 | python -u double_text_classifier.py --task-name 'RuMedDaNet'
5 | python -u test_solver.py --task-name 'RuMedTest'
6 |
7 | zip -m tfidf.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl
8 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/TF-IDF/test_solver.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pathlib
3 |
4 | import click
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.feature_extraction.text import TfidfVectorizer
8 | from sklearn.linear_model import LogisticRegression
9 | from sklearn.metrics import accuracy_score
10 | from sklearn.metrics.pairwise import cosine_similarity
11 |
12 | @click.command()
13 | @click.option('--task-name',
14 | default='RuMedTest',
15 | type=click.Choice(['RuMedTest']),
16 | help='The name of the task to run.')
17 | @click.option('--data-path',
18 | default='../../../MedBench_data/',
19 | help='Path to the data files.')
20 | def main(task_name, data_path):
21 | print(f'\n{task_name} task')
22 |
23 | out_path = pathlib.Path('.').absolute()
24 | data_path = pathlib.Path(data_path).absolute() / task_name
25 |
26 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True)
27 |
28 | index_id = 'idx'
29 | if task_name == 'RuMedTest':
30 | l2i = {'1': 1, '2': 2, '3': 3, '4': 4}
31 | question_id = 'question'
32 | label_id = 'answer'
33 | else:
34 | raise ValueError('unknown task')
35 |
36 | i2l = {i: label for label, i in l2i.items()}
37 |
38 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8))
39 |
40 | text_data = test_data[question_id]
41 |
42 | X = tfidf.fit_transform(text_data)
43 |
44 | sims = []
45 | for l in sorted(list(l2i.keys())):
46 | option_X = tfidf.transform( test_data[l] )
47 | sim = cosine_similarity(X, option_X).diagonal()
48 | sims.append(sim)
49 | sims = np.array(sims).T
50 |
51 | recs = []
52 | for i, pred in zip(test_data[index_id], sims):
53 | recs.append({index_id: i, label_id: i2l[1+np.argmax(pred)]})
54 |
55 | out_fname = out_path / f'{task_name}.jsonl'
56 | with open(out_fname, 'w') as fw:
57 | for rec in recs:
58 | json.dump(rec, fw, ensure_ascii=False)
59 | fw.write('\n')
60 |
61 |
62 | if __name__ == '__main__':
63 | main()
64 |
--------------------------------------------------------------------------------
/lb_submissions/SAI/TF-IDF/tfidf.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/TF-IDF/tfidf.zip
--------------------------------------------------------------------------------
/lb_submissions/SAI_junior/RuBioRoBERTa/README.md:
--------------------------------------------------------------------------------
1 | Решение реализовано с помощью модели [RuBioRoBERTa](https://huggingface.co/alexyalunin/RuBioRoBERTa).
2 |
3 | - **RuMedDaNet** / **RuMedNLI** - перед подачей в модель контекст и вопрос конкатенируются через пробел, дообучение модели выполняется под конкретную задачу;
4 |
5 | - **RuMedTest** - Используется предобученную модель RuBioRoBERTa для получения контекстуализированных эмбеддингов (вопросов и каждого из 4 вариантов ответов). Ответ выбирается по максимальному значению косинусной близости векторов вопроса и ответа.
6 |
7 | ### В задаче RuMedDaNet для модели были использованы гиперпараметры:
8 | - `seed = 128`
9 | - `batch_size = 10`
10 | - `epochs = 25`
11 | - `lr = 2e-5`
12 |
13 | ### В задаче RuMedNLI следующие гиперпараметры:
14 | - `seed = 128`
15 | - `batch_size = 8`
16 | - `epochs = 25`
17 | - `lr = 3e-5`
18 |
19 | ### Для запуска:
20 | `pip install -r requirements.txt`
21 |
22 | Открыть блокнот каждой задачи и выполнить все ячейки
23 |
24 | Добавить все решения в zip-архив - `zip -r solution.zip RuMedTest.jsonl RuMedNLI.jsonl RuMedDaNet.jsonl`
25 |
--------------------------------------------------------------------------------
/lb_submissions/SAI_junior/RuBioRoBERTa/requirements.txt:
--------------------------------------------------------------------------------
1 | torch==1.12.1
2 | torchtext==0.6.0
3 | tensorflow==2.6.0
4 | keras==2.6.0
5 | pandas==1.3.5
6 | transformers==4.12.5
7 | sklearn==1.0.2
8 |
--------------------------------------------------------------------------------