├── .gitignore ├── README.md ├── code ├── README.md ├── bert │ ├── README.md │ ├── double_text_classifier.py │ ├── models │ │ └── rubert_cased_L-12_H-768_A-12_pt │ │ │ ├── config.json │ │ │ └── vocab.txt │ ├── out │ │ ├── RuMedDaNet.jsonl │ │ ├── RuMedNER.jsonl │ │ ├── RuMedNLI.jsonl │ │ ├── RuMedSymptomRec.jsonl │ │ └── RuMedTop3.jsonl │ ├── run.sh │ ├── single_text_classifier.py │ ├── token_classifier.py │ └── utils.py ├── bilstm │ ├── README.md │ ├── double_text_classifier.py │ ├── out │ │ ├── RuMedDaNet.jsonl │ │ ├── RuMedNER.jsonl │ │ ├── RuMedNLI.jsonl │ │ ├── RuMedSymptomRec.jsonl │ │ └── RuMedTop3.jsonl │ ├── run.sh │ ├── single_text_classifier.py │ ├── token_classifier.py │ └── utils.py ├── eval.py ├── human │ ├── RuMedDaNet.jsonl │ ├── RuMedNER.jsonl │ ├── RuMedNLI.jsonl │ ├── RuMedSymptomRec.jsonl │ └── RuMedTop3.jsonl ├── linear_models │ ├── README.md │ ├── double_text_classifier.py │ ├── out │ │ ├── RuMedDaNet.jsonl │ │ ├── RuMedNER.jsonl │ │ ├── RuMedNLI.jsonl │ │ ├── RuMedSymptomRec.jsonl │ │ └── RuMedTop3.jsonl │ ├── run.sh │ ├── single_text_classifier.py │ └── token_classifier.py ├── naive │ ├── RuMedDaNet.jsonl │ ├── RuMedNER.jsonl │ ├── RuMedNLI.jsonl │ ├── RuMedSymptomRec.jsonl │ └── RuMedTop3.jsonl ├── requirements.txt └── tasks_builder.py ├── data ├── README.md ├── RuMedDaNet │ ├── dev_v1.jsonl │ ├── private_test_v1.jsonl │ ├── test_v1.jsonl │ └── train_v1.jsonl ├── RuMedNER │ ├── dev_v1.jsonl │ ├── test_v1.jsonl │ └── train_v1.jsonl ├── RuMedNLI │ ├── README.md │ ├── dev_v1.jsonl │ ├── private_test_v1.jsonl │ ├── test_v1.jsonl │ └── train_v1.jsonl ├── RuMedSymptomRec │ ├── dev_v1.jsonl │ ├── test_v1.jsonl │ └── train_v1.jsonl ├── RuMedTest │ └── private_test_v1.jsonl ├── RuMedTop3 │ ├── dev_v1.jsonl │ ├── test_v1.jsonl │ └── train_v1.jsonl └── raw │ ├── RuDReC.csv │ ├── RuMedPrimeData.tsv │ └── rec_markup.csv └── lb_submissions ├── SAI ├── ChatGPT │ ├── README.md │ ├── RuMedDaNet.jsonl │ ├── RuMedNLI.jsonl │ ├── RuMedTest.jsonl │ ├── chat-rmb.ipynb │ ├── rmdanet_dev_gpt3_1202_log.json │ ├── rmnli_priv_gpt3_1502.pd.pickle │ └── rmtest_gpt3_1002_log.json ├── ECGAuto │ ├── ECGBaselineLib │ │ ├── autobaseline.py │ │ ├── datasets.py │ │ └── utils.py │ ├── README.md │ ├── requirements.txt │ └── training.py ├── ECGBinary │ ├── ECGBaselineLib │ │ ├── datasets.py │ │ ├── neurobaseline.py │ │ └── utils.py │ ├── README.md │ ├── requirements.txt │ └── training.py ├── ECGMultihead │ ├── ECGBaselineLib │ │ ├── datasets.py │ │ ├── neurobaseline.py │ │ └── utils.py │ ├── README.md │ ├── requirements.txt │ └── training.py ├── Gigachat │ ├── .gitignore │ ├── README.md │ ├── convert_sogma.py │ ├── out │ │ ├── RuMedDaNet.jsonl │ │ ├── RuMedNLI.jsonl │ │ └── RuMedTest.jsonl │ ├── requirements.txt │ ├── rumed_da_net.py │ ├── rumed_nli.py │ ├── rumed_test.py │ ├── rumed_utils.py │ ├── s00-prepare.sh │ ├── s01-run-all-trains.sh │ └── s02-run-all-tests.sh ├── Human │ └── README.md ├── Naive │ ├── README.md │ └── sample_submission.zip ├── RNN │ ├── README.md │ ├── double_text_classifier.py │ ├── rnn.zip │ ├── run.sh │ ├── test_solver.py │ └── utils.py ├── RuBERT │ ├── README.md │ ├── bert.zip │ ├── double_text_classifier.py │ ├── pool.zip │ ├── requirements.txt │ ├── run.sh │ ├── test_solver.py │ └── utils.py └── TF-IDF │ ├── README.md │ ├── double_text_classifier.py │ ├── run.sh │ ├── test_solver.py │ └── tfidf.zip └── SAI_junior └── RuBioRoBERTa ├── README.md ├── RuMedDaNet.ipynb ├── RuMedDaNet.jsonl ├── RuMedNLI.ipynb ├── RuMedNLI.jsonl ├── RuMedTest.ipynb ├── RuMedTest.jsonl └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | **/__pycache__ 2 | **/.ipynb_checkpoints -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The repository is closed. 2 | # For the actual version, please visit https://github.com/sb-ai-lab/MedBench 3 | 4 | ## Citation 5 | ```bibtex 6 | @misc{blinov2022rumedbench, 7 | title={RuMedBench: A Russian Medical Language Understanding Benchmark}, 8 | author={Pavel Blinov and Arina Reshetnikova and Aleksandr Nesterov and Galina Zubkova and Vladimir Kokh}, 9 | year={2022}, 10 | eprint={2201.06499}, 11 | archivePrefix={arXiv}, 12 | primaryClass={cs.CL} 13 | } 14 | ``` 15 | -------------------------------------------------------------------------------- /code/README.md: -------------------------------------------------------------------------------- 1 | ## Dependencies and Library Versions 2 | For specific lib versions, see `requirements.txt` and install them with 3 | ```bash 4 | pip install -r requirements.txt 5 | ``` 6 | 7 | ## Hardware Requirements 8 | The code runs on: 9 | ``` 10 | CPU Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 11 | GPU NVIDIA Tesla V100-PCIE-16GB 12 | RAM 16GB 13 | HDD 16GB 14 | ``` 15 | 16 | ## General Description 17 | Each directory `bert`, `bilstm` and `linear_models` contains a baseline model. 18 | 19 | Generally, a model should produce output directory (e.g. `bert/out`) with result `jsonl` files named after the task name (e.g. `RuMedTop3.jsonl`). 20 | 21 | Each file contains same samples as in test parts enhanced with the `prediction` field.
22 | Examples,
23 | for `RuMedTop3.jsonl` 24 | ``` 25 | { 26 | "idx": "qaf1454f", 27 | "code": "I11", 28 | "prediction": ["I11", "I20", "I10"] 29 | } 30 | ``` 31 | 32 | or `RuMedSymptomRec.jsonl` 33 | ``` 34 | { 35 | "idx": "q45f6321", 36 | "code": "боль в шее", 37 | "prediction": ["тошнота", "боль в шее", "частые головные боли"] 38 | } 39 | ``` 40 | 41 | or `RuMedDaNet.jsonl` 42 | ``` 43 | { 44 | "pairID": "f5309eadb4eacf0f144b24e260643ea2", 45 | "answer": "да", 46 | "prediction": "нет" 47 | } 48 | ``` 49 | 50 | or `RuMedNLI.jsonl` 51 | ``` 52 | { 53 | "pairID": "1f2a8146-66c7-11e7-b4f2-f45c89b91419", 54 | "gold_label": "entailment", 55 | "prediction": "neutral" 56 | } 57 | ``` 58 | 59 | or `RuMedNER.jsonl` 60 | ``` 61 | { 62 | "idx": "769708.tsv_5", 63 | "ner_tags": ["B-Drugname", "O", "B-Drugclass", "O", "O"], 64 | "prediction": ["B-Drugclass", "O", "O", "O", "O"] 65 | } 66 | ``` 67 | 68 | ### tasks_builder.py 69 | 70 | It is the script used to prepare data for the benchmark tasks from raw data files. 71 | 72 | ```bash 73 | python tasks_builder.py 74 | ``` 75 | 76 | ### eval.py 77 | 78 | It is the script to evaluate the test results. 79 | 80 | Run it like 81 | ```bash 82 | python eval.py --out_dir bert/out 83 | ``` 84 | or 85 | ```bash 86 | python eval.py --out_dir human 87 | ``` 88 | -------------------------------------------------------------------------------- /code/bert/README.md: -------------------------------------------------------------------------------- 1 | To run the BERT models: 2 | 1) Download the [RuBERT model](http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz) & extract it to `models/rubert_cased_L-12_H-768_A-12_pt`. 3 | ```bash 4 | mkdir -p models/; cd models/ 5 | wget "http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz" 6 | tar -xvzf rubert_cased_L-12_H-768_A-12_pt.tar.gz 7 | ``` 8 | 2) Run
9 | `./run.sh bert` for *RuBERT* model
10 | or
11 | `./run.sh pool` for *RuPoolBERT* model. -------------------------------------------------------------------------------- /code/bert/models/rubert_cased_L-12_H-768_A-12_pt/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "attention_probs_dropout_prob": 0.1, 3 | "directionality": "bidi", 4 | "hidden_act": "gelu", 5 | "hidden_dropout_prob": 0.1, 6 | "hidden_size": 768, 7 | "initializer_range": 0.02, 8 | "intermediate_size": 3072, 9 | "max_position_embeddings": 512, 10 | "num_attention_heads": 12, 11 | "num_hidden_layers": 12, 12 | "pooler_fc_size": 768, 13 | "pooler_num_attention_heads": 12, 14 | "pooler_num_fc_layers": 3, 15 | "pooler_size_per_head": 128, 16 | "pooler_type": "first_token_transform", 17 | "type_vocab_size": 2, 18 | "vocab_size": 119547 19 | } 20 | -------------------------------------------------------------------------------- /code/bert/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | type=$@ 4 | 5 | out=$(pwd)'/out' 6 | mkdir -p $out 7 | 8 | # run the tasks sequentially 9 | python -u single_text_classifier.py --gpu 0 --task_name 'RuMedTop3' --bert_type $type 10 | python -u single_text_classifier.py --gpu 0 --task_name 'RuMedSymptomRec' --bert_type $type 11 | python -u double_text_classifier.py --gpu 0 --task_name 'RuMedDaNet' --bert_type $type 12 | python -u double_text_classifier.py --gpu 0 --task_name 'RuMedNLI' --bert_type $type 13 | python -u token_classifier.py --gpu 0 --task_name 'RuMedNER' --bert_type $type 14 | 15 | # # or run in parallel on multiple gpus 16 | # python -u single_text_classifier.py --gpu 0 --task_name 'RuMedTop3' --bert_type $type & 17 | # python -u single_text_classifier.py --gpu 1 --task_name 'RuMedSymptomRec' --bert_type $type & 18 | # python -u double_text_classifier.py --gpu 2 --task_name 'RuMedDaNet' --bert_type $type & 19 | # python -u double_text_classifier.py --gpu 3 --task_name 'RuMedNLI' --bert_type $type & 20 | # wait 21 | # python -u token_classifier.py --gpu 0 --task_name 'RuMedNER' --bert_type $type 22 | -------------------------------------------------------------------------------- /code/bert/single_text_classifier.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import gc 3 | import os 4 | import pandas as pd 5 | import numpy as np 6 | import json 7 | 8 | import torch 9 | from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler 10 | 11 | from transformers import BertTokenizer, BertConfig 12 | from transformers.optimization import AdamW 13 | 14 | import argparse 15 | from scipy.special import expit 16 | from keras.preprocessing.sequence import pad_sequences 17 | 18 | from utils import seed_everything, seed_worker 19 | 20 | def encode_texts(tokenizer, sentences): 21 | bs = 20000 22 | input_ids, attention_masks = [], [] 23 | for _, i in enumerate(range(0, len(sentences), bs)): 24 | b_sentences = ['[CLS] ' + sentence + ' [SEP]' for sentence in sentences[i:i+bs]] 25 | tokenized_texts = [tokenizer.tokenize(sent) for sent in b_sentences] 26 | b_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts] 27 | b_input_ids = pad_sequences(b_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post') 28 | b_attention_masks = [] 29 | for seq in b_input_ids: 30 | seq_mask = [float(i>0) for i in seq] 31 | b_attention_masks.append(seq_mask) 32 | 33 | attention_masks.append(b_attention_masks) 34 | input_ids.append(b_input_ids) 35 | input_ids, attention_masks = np.vstack(input_ids), np.vstack(attention_masks) 36 | return input_ids, attention_masks 37 | 38 | def hit_at_n(y_true, y_pred, index2label, n=3): 39 | assert len(y_true) == len(y_pred) 40 | hit_count = 0 41 | for l, row in zip(y_true, y_pred): 42 | order = (np.argsort(row)[::-1])[:n] 43 | order = [index2label[i] for i in order] 44 | order = set(order) 45 | hit_count += int(l in order) 46 | return hit_count/float(len(y_true)) 47 | 48 | SEED = 128 49 | seed_everything(SEED) 50 | 51 | MAX_LEN = 256 52 | 53 | def setup_parser(): 54 | parser = argparse.ArgumentParser() 55 | 56 | parser.add_argument('--gpu', 57 | default=None, 58 | type=int, 59 | required=True, 60 | help='The index of the gpu to run.') 61 | parser.add_argument('--task_name', 62 | default='', 63 | type=str, 64 | required=True, 65 | help='The name of the task to run.') 66 | parser.add_argument('--bert_type', 67 | default='', 68 | type=str, 69 | required=True, 70 | help='The type of BERT model (bert or pool).') 71 | return parser 72 | 73 | if __name__ == '__main__': 74 | parser = setup_parser() 75 | args = parser.parse_args() 76 | 77 | os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu) 78 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 79 | 80 | if args.bert_type=='pool': #get model type of BERT model 81 | from utils import PoolBertForSequenceClassification as BertModel 82 | else: 83 | from transformers import BertForSequenceClassification as BertModel 84 | 85 | task_name = args.task_name 86 | 87 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) ) 88 | out_dir = os.path.join(base_path, 'out') 89 | model_path = os.path.join(base_path, 'models/rubert_cased_L-12_H-768_A-12_pt/') 90 | 91 | base_path = os.path.abspath( os.path.join(base_path, '../..') ) 92 | 93 | parts = ['train', 'dev', 'test'] 94 | data_path = os.path.join(base_path, 'data', task_name) 95 | 96 | text1_id, label_id, index_id = 'symptoms', 'code', 'idx' 97 | if task_name=='RuMedTop3': 98 | pass 99 | elif task_name=='RuMedSymptomRec': 100 | pass 101 | else: 102 | raise ValueError('unknown task') 103 | 104 | part2indices = {p:set() for p in parts} 105 | all_ids, sentences, labels = [], [], [] 106 | for p in parts: 107 | fname = '{}_v1.jsonl'.format(p) 108 | with open(os.path.join( data_path, fname)) as f: 109 | for line in f: 110 | data = json.loads(line) 111 | s1 = data[text1_id] 112 | sentences.append( s1 ) 113 | labels.append( data[label_id] ) 114 | idx = data[index_id] 115 | all_ids.append( idx ) 116 | part2indices[p].add( idx ) 117 | all_ids = np.array(all_ids) 118 | print ('len(total)', len(sentences)) 119 | 120 | code_set = set(labels) 121 | l2i = {code:i for i, code in enumerate(sorted(code_set))} 122 | i2l = {l2i[l]:l for l in l2i} 123 | print ( 'len(l2i)', len(l2i) ) 124 | 125 | tokenizer = BertTokenizer.from_pretrained( 126 | os.path.join(base_path, model_path), 127 | do_lower_case=True, 128 | max_length=MAX_LEN 129 | ) 130 | 131 | input_ids, attention_masks = encode_texts(tokenizer, sentences) 132 | 133 | label_indices = np.array([l2i[l] for l in labels]) 134 | 135 | labels = np.zeros((input_ids.shape[0], len(l2i))) 136 | for _, i in enumerate(label_indices): 137 | labels[_, i] = 1 138 | 139 | # prepare test data loader 140 | test_ids = part2indices['test'] 141 | test_mask = np.array([sid in test_ids for sid in all_ids]) 142 | test_ids = all_ids[test_mask] 143 | tst_inputs, tst_masks, tst_labels = input_ids[test_mask], attention_masks[test_mask], labels[test_mask] 144 | 145 | tst_inputs = torch.tensor(tst_inputs) 146 | tst_masks = torch.tensor(tst_masks) 147 | tst_labels = torch.tensor(tst_labels) 148 | 149 | test_data = TensorDataset(tst_inputs, tst_masks, tst_labels) 150 | test_sampler = SequentialSampler(test_data) 151 | test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=8, worker_init_fn=seed_worker) 152 | 153 | batch_size = 4 154 | epochs = 25 155 | lr = 3e-5 156 | max_grad_norm = 1.0 157 | 158 | cv_res = {} 159 | for fold in range(1): 160 | best_dev_score = -1 161 | seed_everything(SEED) 162 | train_ids = part2indices['train'] 163 | dev_ids = part2indices['dev'] 164 | 165 | train_mask = np.array([sid in train_ids for sid in all_ids]) 166 | dev_mask = np.array([sid in dev_ids for sid in all_ids]) 167 | 168 | input_ids_train, attention_masks_train, labels_train = input_ids[train_mask], attention_masks[train_mask], labels[train_mask] 169 | input_ids_dev, attention_masks_dev, labels_dev = input_ids[dev_mask], attention_masks[dev_mask], labels[dev_mask] 170 | print ('fold', fold, input_ids_train.shape, input_ids_dev.shape) 171 | 172 | input_ids_train = torch.tensor(input_ids_train) 173 | attention_masks_train = torch.tensor(attention_masks_train) 174 | labels_train = torch.tensor(labels_train) 175 | 176 | train_data = TensorDataset(input_ids_train, attention_masks_train, labels_train) 177 | train_sampler = RandomSampler(train_data) 178 | train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, worker_init_fn=seed_worker) 179 | 180 | ##prediction_dataloader 181 | input_ids_dev = torch.tensor(input_ids_dev) 182 | attention_masks_dev = torch.tensor(attention_masks_dev) 183 | labels_dev = torch.tensor(labels_dev) 184 | prediction_data = TensorDataset(input_ids_dev, attention_masks_dev, labels_dev) 185 | prediction_sampler = SequentialSampler(prediction_data) 186 | prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size, worker_init_fn=seed_worker) 187 | 188 | ## take appropriate config and init a BERT model 189 | config_path = os.path.join( base_path, model_path, 'bert_config.json' ) 190 | conf = BertConfig.from_json_file( config_path ) 191 | conf.num_labels = len(l2i) 192 | model = BertModel(conf) 193 | output_model_file = os.path.join( base_path, model_path, 'pytorch_model.bin' ) 194 | model.load_state_dict(torch.load(output_model_file), strict=False) 195 | model = model.cuda() 196 | 197 | param_optimizer = list(model.named_parameters()) 198 | no_decay = ['bias', 'gamma', 'beta'] 199 | optimizer_grouped_parameters = [ 200 | {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01}, 201 | {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0} 202 | ] 203 | 204 | # This variable contains all of the hyperparemeter information our training loop needs 205 | optimizer = AdamW(optimizer_grouped_parameters, lr=lr, correct_bias=False) 206 | scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_dataloader), epochs=epochs) 207 | 208 | train_loss = [] 209 | for _ in range(epochs): 210 | model.train(); torch.cuda.empty_cache() 211 | 212 | tr_loss = 0 213 | nb_tr_examples, nb_tr_steps = 0, 0 214 | for step, batch in enumerate(train_dataloader): 215 | batch = tuple(t.to(device) for t in batch) 216 | b_input_ids, b_input_mask, b_labels = batch 217 | optimizer.zero_grad() 218 | 219 | outputs = model( b_input_ids, attention_mask=b_input_mask, labels=b_labels ) 220 | loss, logits = outputs[:2] 221 | train_loss.append(loss.item()) 222 | loss.backward() 223 | torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) 224 | optimizer.step() 225 | scheduler.step() 226 | 227 | tr_loss += loss.item() 228 | nb_tr_examples += b_input_ids.size(0) 229 | nb_tr_steps += 1 230 | avg_train_loss = tr_loss/nb_tr_steps 231 | 232 | ### val 233 | model.eval() 234 | predictions = [] 235 | tr_loss, nb_tr_steps = 0, 0 236 | for step, batch in enumerate(prediction_dataloader): 237 | batch = tuple(t.to(device) for t in batch) 238 | b_input_ids, b_input_mask, b_labels = batch 239 | with torch.no_grad(): 240 | outputs = model( b_input_ids, attention_mask=b_input_mask, labels=b_labels ) 241 | loss, logits = outputs[:2] 242 | tr_loss += loss.item() 243 | nb_tr_steps += 1 244 | logits = logits.detach().cpu().numpy() 245 | predictions.append(logits) 246 | predictions = expit(np.vstack(predictions)) 247 | edev_loss = tr_loss/nb_tr_steps 248 | 249 | y_indices = np.argmax(labels_dev.detach().cpu().numpy(), axis=1) 250 | dev_codes = [i2l[i] for i in y_indices] 251 | 252 | dev_acc = hit_at_n(dev_codes, predictions, i2l, n=1)*100 253 | dev_hit_at3 = hit_at_n(dev_codes, predictions, i2l, n=3)*100 254 | print ('{} epoch {} average train_loss: {:.6f}\tdev_loss: {:.6f}\tdev_acc {:.2f}\tdev_hit_at3 {:.2f}'.format(task_name, _, avg_train_loss, edev_loss, dev_acc, dev_hit_at3)) 255 | 256 | score = (dev_acc+dev_hit_at3)/2 257 | if score>best_dev_score: # compute result for test part and store to out file, if we found better model 258 | best_dev_score = score 259 | cv_res[fold] = (dev_acc, dev_hit_at3) 260 | 261 | predictions, true_labels = [], [] 262 | for batch in test_dataloader: 263 | batch = tuple(t.to(device) for t in batch) 264 | b_input_ids, b_input_mask, b_labels = batch 265 | 266 | with torch.no_grad(): 267 | outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels) 268 | logits = outputs[1].detach().cpu().numpy() 269 | label_ids = b_labels.to('cpu').numpy() 270 | predictions.append(logits) 271 | true_labels.append(label_ids) 272 | predictions = expit(np.vstack(predictions)) 273 | true_labels = np.concatenate(true_labels) 274 | assert len(true_labels) == len(predictions) 275 | recs = [] 276 | for idx, l, row in zip(test_ids, true_labels, predictions): 277 | gt = i2l[np.argmax(l)] 278 | order = (np.argsort(row)[::-1])[:3] 279 | pred = [i2l[i] for i in order] 280 | recs.append( (idx, gt, pred) ) 281 | 282 | out_fname = os.path.join(out_dir, task_name+'.jsonl') 283 | with open(out_fname, 'w') as fw: 284 | for rec in recs: 285 | data = {index_id:rec[0], label_id:rec[1], 'prediction':rec[2]} 286 | json.dump(data, fw, ensure_ascii=False) 287 | fw.write('\n') 288 | del model; gc.collect(); torch.cuda.empty_cache() 289 | 290 | dev_acc, dev_hit_at3 = cv_res[0] 291 | print ('\ntask scores {}: {:.2f}/{:.2f}'.format(task_name, dev_acc, dev_hit_at3)) 292 | -------------------------------------------------------------------------------- /code/bert/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | import random 4 | import torch 5 | import numpy as np 6 | 7 | def seed_everything(seed): 8 | random.seed(seed) 9 | os.environ['PYTHONHASHSEED'] = str(seed) 10 | np.random.seed(seed) 11 | torch.manual_seed(seed) 12 | torch.cuda.manual_seed_all(seed) 13 | torch.cuda.manual_seed(seed) 14 | torch.backends.cudnn.deterministic = True 15 | torch.backends.cudnn.benchmark = False 16 | 17 | def seed_worker(worker_id): 18 | worker_seed = torch.initial_seed() % 2**32 19 | np.random.seed(worker_seed) 20 | random.seed(worker_seed) 21 | 22 | 23 | from torch import nn 24 | import torch.nn.functional as F 25 | from transformers import BertTokenizer, BertConfig, BertPreTrainedModel, BertModel 26 | 27 | class PoolBertForTokenClassification(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super().__init__(config) 30 | self.num_labels = config.num_labels 31 | 32 | self.bert = BertModel(config, add_pooling_layer=False) 33 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 34 | self.classifier = nn.Linear(config.hidden_size*3, config.num_labels) 35 | 36 | self.w_size = 4 37 | 38 | self.init_weights() 39 | 40 | def forward( 41 | self, 42 | input_ids=None, 43 | attention_mask=None, 44 | token_type_ids=None, 45 | position_ids=None, 46 | head_mask=None, 47 | inputs_embeds=None, 48 | labels=None, 49 | output_attentions=None, 50 | output_hidden_states=None, 51 | return_dict=None, 52 | ): 53 | outputs = self.bert( 54 | input_ids, 55 | attention_mask=attention_mask, 56 | token_type_ids=token_type_ids, 57 | position_ids=position_ids, 58 | head_mask=head_mask, 59 | inputs_embeds=inputs_embeds, 60 | output_attentions=output_attentions, 61 | output_hidden_states=output_hidden_states, 62 | return_dict=return_dict, 63 | ) 64 | 65 | sequence_output = outputs['last_hidden_state'] 66 | 67 | shape = list(sequence_output.shape) 68 | shape[1]+=self.w_size-1 69 | 70 | t_ext = torch.zeros(shape, dtype=sequence_output.dtype, device=sequence_output.device) 71 | t_ext[:, self.w_size-1:, :] = sequence_output 72 | 73 | unfold_t = t_ext.unfold(1, self.w_size, 1).transpose(3,2) 74 | pooled_output_mean = torch.mean(unfold_t, 2) 75 | 76 | pooled_output, _ = torch.max(unfold_t, 2) 77 | pooled_output = torch.relu(pooled_output) 78 | 79 | sequence_output = torch.cat((pooled_output, pooled_output_mean, sequence_output), 2) 80 | 81 | sequence_output = self.dropout(sequence_output) 82 | 83 | logits = self.classifier(sequence_output) 84 | 85 | loss = None 86 | if labels is not None: 87 | loss_fct = nn.CrossEntropyLoss() 88 | # Only keep active parts of the loss 89 | if attention_mask is not None: 90 | active_loss_mask = attention_mask.view(-1) == 1 91 | active_logits = logits.view(-1, self.num_labels) 92 | 93 | active_labels = torch.where( 94 | active_loss_mask, 95 | labels.view(-1), 96 | torch.tensor(loss_fct.ignore_index).type_as(labels) 97 | ) 98 | 99 | loss = loss_fct(active_logits, active_labels) 100 | else: 101 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) 102 | 103 | output = (logits,) + outputs[2:] 104 | return ((loss,) + output) if loss is not None else output 105 | 106 | class PoolBertForSequenceClassification(BertPreTrainedModel): 107 | def __init__(self, config): 108 | super().__init__(config) 109 | self.num_labels = config.num_labels 110 | 111 | self.bert = BertModel(config) 112 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 113 | self.classifier = nn.Linear(config.hidden_size*3, self.config.num_labels) 114 | 115 | self.init_weights() 116 | 117 | def forward( 118 | self, 119 | input_ids=None, 120 | attention_mask=None, 121 | token_type_ids=None, 122 | position_ids=None, 123 | head_mask=None, 124 | inputs_embeds=None, 125 | labels=None, 126 | ): 127 | outputs = self.bert( 128 | input_ids, 129 | attention_mask=attention_mask, 130 | token_type_ids=token_type_ids, 131 | position_ids=position_ids, 132 | head_mask=head_mask, 133 | inputs_embeds=inputs_embeds, 134 | ) 135 | 136 | encoder_out = outputs['last_hidden_state'] 137 | cls = encoder_out[:, 0, :] 138 | 139 | pooled_output, _ = torch.max(encoder_out, 1) 140 | pooled_output = torch.relu(pooled_output) 141 | 142 | pooled_output_mean = torch.mean(encoder_out, 1) 143 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1) 144 | 145 | pooled_output = self.dropout(pooled_output) 146 | logits = self.classifier(pooled_output) 147 | 148 | outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here 149 | 150 | if labels is not None: 151 | if self.num_labels == 1: 152 | # We are doing regression 153 | loss_fct = MSELoss() 154 | loss = loss_fct(logits.view(-1), labels.view(-1)) 155 | else: 156 | loss = F.binary_cross_entropy_with_logits( logits.view(-1), labels.view(-1) ) 157 | outputs = (loss,) + outputs 158 | 159 | return outputs # (loss), logits, (hidden_states), (attentions) 160 | -------------------------------------------------------------------------------- /code/bilstm/README.md: -------------------------------------------------------------------------------- 1 | This directory contains BiLSTM model with randomly initialized embeddings. 2 | 3 | ### How to run 4 | 5 | `./run.sh` or you can run the model for different tasks separately, e.g. 6 | 7 | ```bash 8 | python single_text_classifier.py --task-name='RuMedSymptomRec' --device=0 9 | ``` 10 | 11 | The models produce results in `.jsonl` format to output directory `out`. 12 | -------------------------------------------------------------------------------- /code/bilstm/double_text_classifier.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import json 3 | import pathlib 4 | 5 | import click 6 | import numpy as np 7 | import pandas as pd 8 | from sklearn.metrics import accuracy_score 9 | import torch 10 | from torch import nn 11 | from torch.utils.data import DataLoader 12 | from tqdm import tqdm 13 | 14 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor 15 | 16 | SEED = 101 17 | seed_everything(SEED) 18 | class Classifier(nn.Module): 19 | 20 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256): 21 | 22 | super().__init__() 23 | 24 | self.emb_dim = emb_dim 25 | self.hidden_dim = hidden_dim 26 | 27 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim) 28 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2, 29 | bidirectional=True) 30 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes) 31 | 32 | def forward(self, x): 33 | x = self.embedding_layer(x) 34 | _, (hidden, _) = self.lstm_layer(x) 35 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1) 36 | return self.linear_layer(hidden) 37 | 38 | 39 | def preprocess_two_seqs(text1, text2, seq_len): 40 | seq1_len = int(seq_len * 0.75) 41 | seq2_len = seq_len - seq1_len 42 | 43 | tokens1 = preprocess(text1)[:seq1_len] 44 | tokens2 = preprocess(text2)[:seq2_len] 45 | 46 | return tokens1 + tokens2 47 | 48 | 49 | def build_vocab(text_data, min_freq=1): 50 | word2freq = defaultdict(int) 51 | word2index = {'PAD': 0, 'UNK': 1} 52 | 53 | for text in text_data: 54 | for token in text: 55 | word2freq[token] += 1 56 | 57 | for word, freq in word2freq.items(): 58 | if freq > min_freq: 59 | word2index[word] = len(word2index) 60 | return word2index 61 | 62 | 63 | def train_step(data, model, optimizer, criterion, device, losses, epoch): 64 | 65 | model.train() 66 | 67 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}') 68 | 69 | for x, y in data: 70 | 71 | x = x.to(device) 72 | y = y.to(device) 73 | 74 | optimizer.zero_grad() 75 | pred = model(x) 76 | 77 | loss = criterion(pred, y) 78 | 79 | loss.backward() 80 | optimizer.step() 81 | 82 | losses.append(loss.item()) 83 | 84 | pbar.set_postfix(train_loss = np.mean(losses[-100:])) 85 | pbar.update(x.shape[0]) 86 | 87 | pbar.close() 88 | 89 | return losses 90 | 91 | def eval_step(data, model, criterion, device, mode='dev'): 92 | 93 | test_losses = [] 94 | test_preds = [] 95 | test_true = [] 96 | 97 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set') 98 | 99 | model.eval() 100 | 101 | for x, y in data: 102 | 103 | x = x.to(device) 104 | y = y.to(device) 105 | 106 | with torch.no_grad(): 107 | 108 | pred = model(x) 109 | 110 | loss = criterion(pred, y) 111 | test_losses.append(loss.item()) 112 | 113 | test_preds.append(torch.argmax(pred, dim=1).cpu().numpy()) 114 | test_true.append(y.cpu().numpy()) 115 | 116 | pbar.update(x.shape[0]) 117 | pbar.close() 118 | 119 | test_preds = np.concatenate(test_preds) 120 | 121 | if mode == 'dev': 122 | test_true = np.concatenate(test_true) 123 | mean_test_loss = np.mean(test_losses) 124 | accuracy = round(accuracy_score(test_true, test_preds) * 100, 2) 125 | return mean_test_loss, accuracy 126 | 127 | else: 128 | return test_preds 129 | 130 | 131 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3): 132 | 133 | losses = [] 134 | best_accuracy = 0. 135 | 136 | patience = 0 137 | best_test_loss = 10. 138 | 139 | for epoch in range(n_epochs): 140 | 141 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch) 142 | mean_dev_loss, accuracy = eval_step(dev_data, model, criterion, device) 143 | 144 | if accuracy > best_accuracy: 145 | best_accuracy = accuracy 146 | 147 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}') 148 | 149 | if mean_dev_loss < best_test_loss: 150 | best_test_loss = mean_dev_loss 151 | elif patience == max_patience: 152 | print(f'Dev loss did not improve in {patience} epochs, early stopping') 153 | break 154 | else: 155 | patience += 1 156 | return best_accuracy 157 | 158 | 159 | @click.command() 160 | @click.option('--task-name', 161 | default='RuMedNLI', 162 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']), 163 | help='The name of the task to run.') 164 | @click.option('--device', 165 | default=-1, 166 | help='Gpu to train the model on.') 167 | @click.option('--seq-len', 168 | default=256, 169 | help='Max sequence length.') 170 | def main(task_name, device, seq_len): 171 | print(f'\n{task_name} task') 172 | 173 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent 174 | out_path = base_path / 'code' / 'bilstm' / 'out' 175 | data_path = base_path / 'data' / task_name 176 | 177 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True) 178 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True) 179 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True) 180 | 181 | index_id = 'pairID' 182 | if task_name == 'RuMedNLI': 183 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2} 184 | text1_id = 'ru_sentence1' 185 | text2_id = 'ru_sentence2' 186 | label_id = 'gold_label' 187 | 188 | elif task_name == 'RuMedDaNet': 189 | l2i = {'нет': 0, 'да': 1} 190 | text1_id = 'context' 191 | text2_id = 'question' 192 | label_id = 'answer' 193 | else: 194 | raise ValueError('unknown task') 195 | 196 | i2l = {i: label for label, i in l2i.items()} 197 | 198 | text_data_train = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 199 | zip(train_data[text1_id], train_data[text2_id])] 200 | text_data_dev = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 201 | zip(dev_data[text1_id], dev_data[text2_id])] 202 | text_data_test = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 203 | zip(test_data[text1_id], test_data[text2_id])] 204 | 205 | word2index = build_vocab(text_data_train, min_freq=0) 206 | print(f'Total: {len(word2index)} tokens') 207 | 208 | train_dataset = DataPreprocessor(text_data_train, train_data[label_id], word2index, l2i, \ 209 | sequence_length=seq_len, preprocessing=False) 210 | dev_dataset = DataPreprocessor(text_data_dev, dev_data[label_id], word2index, l2i, \ 211 | sequence_length=seq_len, preprocessing=False) 212 | test_dataset = DataPreprocessor(text_data_test, test_data[label_id], word2index, l2i, \ 213 | sequence_length=seq_len, preprocessing=False) 214 | 215 | gen = torch.Generator() 216 | gen.manual_seed(SEED) 217 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 218 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 219 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 220 | 221 | if device == -1: 222 | device = torch.device('cpu') 223 | else: 224 | device = torch.device(device) 225 | 226 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index)) 227 | criterion = nn.CrossEntropyLoss() 228 | optimizer = torch.optim.Adam(params=model.parameters()) 229 | 230 | model = model.to(device) 231 | criterion = criterion.to(device) 232 | 233 | accuracy = train(train_dataset, dev_dataset, model, optimizer, criterion, device) 234 | print (f'\n{task_name} task score on dev set: {accuracy}') 235 | 236 | test_preds = eval_step(test_dataset, model, criterion, device, mode='test') 237 | 238 | recs = [] 239 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_preds): 240 | recs.append({index_id: i, label_id: true, 'prediction': i2l[pred]}) 241 | 242 | out_fname = out_path / f'{task_name}.jsonl' 243 | with open(out_fname, 'w') as fw: 244 | for rec in recs: 245 | json.dump(rec, fw, ensure_ascii=False) 246 | fw.write('\n') 247 | 248 | 249 | if __name__ == '__main__': 250 | main() 251 | -------------------------------------------------------------------------------- /code/bilstm/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | out=$(pwd)'/out' 4 | mkdir -p $out 5 | 6 | python -u single_text_classifier.py --task-name 'RuMedTop3' --device 0 7 | python -u single_text_classifier.py --task-name 'RuMedSymptomRec' --device 0 8 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 0 9 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 0 10 | python -u token_classifier.py --task-name='RuMedNER' --device 0 11 | -------------------------------------------------------------------------------- /code/bilstm/single_text_classifier.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import json 3 | import pathlib 4 | 5 | import click 6 | import numpy as np 7 | import pandas as pd 8 | import torch 9 | from torch import nn 10 | from torch.utils.data import DataLoader 11 | from tqdm import tqdm 12 | 13 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor 14 | 15 | SEED = 101 16 | seed_everything(SEED) 17 | class Classifier(nn.Module): 18 | 19 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256): 20 | 21 | super().__init__() 22 | 23 | self.emb_dim = emb_dim 24 | self.hidden_dim = hidden_dim 25 | 26 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim) 27 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2, 28 | bidirectional=True) 29 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes) 30 | 31 | def forward(self, x): 32 | x = self.embedding_layer(x) 33 | _, (hidden, _) = self.lstm_layer(x) 34 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1) 35 | return self.linear_layer(hidden) 36 | 37 | 38 | def hit_at_n(y_true, y_pred, n=3): 39 | assert len(y_true) == len(y_pred) 40 | hit_count = 0 41 | for l, row in zip(y_true, y_pred): 42 | order = (np.argsort(row)[::-1])[:n] 43 | hit_count += int(l in order) 44 | return round(hit_count / float(len(y_true)) * 100, 2) 45 | 46 | 47 | def logits2codes(logits, i2l, n=3): 48 | codes = [] 49 | for row in logits: 50 | order = np.argsort(row)[::-1] 51 | codes.append([i2l[i] for i in order[:n]]) 52 | return codes 53 | 54 | 55 | def build_vocab(text_data, min_freq=1): 56 | word2freq = defaultdict(int) 57 | word2index = {'PAD': 0, 'UNK': 1} 58 | 59 | for text in text_data: 60 | for t in preprocess(text): 61 | word2freq[t] += 1 62 | 63 | for word, freq in word2freq.items(): 64 | if freq > min_freq: 65 | word2index[word] = len(word2index) 66 | return word2index 67 | 68 | 69 | def train_step(data, model, optimizer, criterion, device, losses, epoch): 70 | 71 | model.train() 72 | 73 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}') 74 | 75 | for x, y in data: 76 | 77 | x = x.to(device) 78 | y = y.to(device) 79 | 80 | optimizer.zero_grad() 81 | pred = model(x) 82 | 83 | loss = criterion(pred, y) 84 | 85 | loss.backward() 86 | optimizer.step() 87 | 88 | losses.append(loss.item()) 89 | 90 | pbar.set_postfix(train_loss = np.mean(losses[-100:])) 91 | pbar.update(x.shape[0]) 92 | 93 | pbar.close() 94 | 95 | return losses 96 | 97 | def eval_step(data, model, criterion, device, mode='dev'): 98 | 99 | test_losses = [] 100 | test_preds = [] 101 | test_true = [] 102 | 103 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set') 104 | 105 | model.eval() 106 | 107 | for x, y in data: 108 | 109 | x = x.to(device) 110 | y = y.to(device) 111 | 112 | with torch.no_grad(): 113 | 114 | pred = model(x) 115 | 116 | loss = criterion(pred, y) 117 | test_losses.append(loss.item()) 118 | 119 | test_preds.append(pred.cpu().numpy()) 120 | test_true.append(y.cpu().numpy()) 121 | 122 | pbar.update(x.shape[0]) 123 | pbar.close() 124 | 125 | test_preds = np.concatenate(test_preds) 126 | 127 | if mode == 'dev': 128 | test_true = np.concatenate(test_true) 129 | mean_test_loss = np.mean(test_losses) 130 | accuracy = hit_at_n(test_true, test_preds, n=1) 131 | hit_3 = hit_at_n(test_true, test_preds, n=3) 132 | return mean_test_loss, accuracy, hit_3 133 | 134 | else: 135 | return test_preds 136 | 137 | 138 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3): 139 | 140 | losses = [] 141 | best_metrics = [0.0, 0.0] 142 | 143 | patience = 0 144 | best_test_loss = 10. 145 | 146 | for epoch in range(n_epochs): 147 | 148 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch) 149 | mean_dev_loss, accuracy, hit_3 = eval_step(dev_data, model, criterion, device) 150 | 151 | if accuracy > best_metrics[0] and hit_3 > best_metrics[1]: 152 | best_metrics = [accuracy, hit_3] 153 | 154 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}, hit@3: {hit_3}') 155 | 156 | if mean_dev_loss < best_test_loss: 157 | best_test_loss = mean_dev_loss 158 | elif patience == max_patience: 159 | print(f'Dev loss did not improve in {patience} epochs, early stopping') 160 | break 161 | else: 162 | patience += 1 163 | return best_metrics 164 | 165 | 166 | @click.command() 167 | @click.option('--task-name', 168 | default='RuMedTop3', 169 | type=click.Choice(['RuMedTop3', 'RuMedSymptomRec']), 170 | help='The name of the task to run.') 171 | @click.option('--device', 172 | default=-1, 173 | help='Gpu to train the model on.') 174 | def main(task_name, device): 175 | print(f'\n{task_name} task') 176 | 177 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent 178 | out_path = base_path / 'code' / 'bilstm' / 'out' 179 | data_path = base_path / 'data' / task_name 180 | 181 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True) 182 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True) 183 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True) 184 | 185 | text_id = 'symptoms' 186 | label_id = 'code' 187 | index_id = 'idx' 188 | 189 | i2l = dict(enumerate(sorted(train_data[label_id].unique()))) 190 | l2i = {label: i for i, label in i2l.items()} 191 | 192 | word2index = build_vocab(train_data[text_id], min_freq=0) 193 | print(f'Total: {len(word2index)} tokens') 194 | 195 | train_dataset = DataPreprocessor(train_data[text_id], train_data[label_id], word2index, l2i) 196 | dev_dataset = DataPreprocessor(dev_data[text_id], dev_data[label_id], word2index, l2i) 197 | test_dataset = DataPreprocessor(test_data[text_id], test_data[label_id], word2index, l2i) 198 | 199 | gen = torch.Generator() 200 | gen.manual_seed(SEED) 201 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 202 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 203 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 204 | 205 | if device == -1: 206 | device = torch.device('cpu') 207 | else: 208 | device = torch.device(device) 209 | 210 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index)) 211 | criterion = nn.CrossEntropyLoss() 212 | optimizer = torch.optim.Adam(params=model.parameters()) 213 | 214 | model = model.to(device) 215 | criterion = criterion.to(device) 216 | 217 | accuracy, hit_3 = train(train_dataset, dev_dataset, model, optimizer, criterion, device) 218 | print (f'\n{task_name} task scores on dev set: {accuracy} / {hit_3}') 219 | 220 | test_logits = eval_step(test_dataset, model, criterion, device, mode='test') 221 | test_codes = logits2codes(test_logits, i2l) 222 | 223 | recs = [] 224 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_codes): 225 | recs.append({index_id: i, label_id: true, 'prediction': pred}) 226 | 227 | out_fname = out_path / f'{task_name}.jsonl' 228 | with open(out_fname, 'w') as fw: 229 | for rec in recs: 230 | json.dump(rec, fw, ensure_ascii=False) 231 | fw.write('\n') 232 | 233 | 234 | if __name__ == '__main__': 235 | main() 236 | -------------------------------------------------------------------------------- /code/bilstm/token_classifier.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import torch 3 | from torch import nn 4 | from torch.optim import AdamW 5 | from torchtext import data 6 | from torchtext.data import Field, BucketIterator 7 | 8 | import os 9 | import click 10 | import json 11 | import random 12 | import numpy as np 13 | import pandas as pd 14 | 15 | from seqeval.metrics import f1_score, accuracy_score 16 | 17 | def seed_everything(seed): 18 | os.environ['PYTHONHASHSEED'] = str(seed) 19 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1' 20 | np.random.seed(seed) 21 | random.seed(seed) 22 | torch.manual_seed(seed) 23 | torch.cuda.manual_seed_all(seed) 24 | torch.cuda.manual_seed(seed) 25 | torch.backends.cudnn.deterministic = True 26 | torch.backends.cudnn.benchmark = False 27 | 28 | SEED = 101 29 | seed_everything(SEED) 30 | 31 | class SequenceTaggingDataset(data.Dataset): 32 | @staticmethod 33 | def sort_key(example): 34 | for attr in dir(example): 35 | if not callable(getattr(example, attr)) and not attr.startswith('__'): 36 | return len(getattr(example, attr)) 37 | return 0 38 | 39 | def __init__(self, list_of_lists, fields, **kwargs): 40 | examples = [] 41 | columns = [] 42 | for tup in list_of_lists: 43 | columns = list(tup) 44 | examples.append(data.Example.fromlist(columns, fields)) 45 | 46 | super(SequenceTaggingDataset, self).__init__(examples, fields, **kwargs) 47 | 48 | class Corpus(object): 49 | def __init__(self, input_folder, min_word_freq, batch_size): 50 | # list all the fields 51 | self.word_field = Field(lower=True) 52 | self.tag_field = Field(unk_token=None) 53 | 54 | parts = ['train', 'dev'] 55 | p2data = {} 56 | for p in parts: 57 | fname = os.path.join(input_folder, '{}_v1.jsonl'.format(p)) 58 | paired_lists = [] 59 | with open(fname) as f: 60 | for line in f: 61 | data = json.loads(line) 62 | paired_lists.append( (data['tokens'], data['ner_tags']) ) 63 | p2data[p] = paired_lists 64 | 65 | field_values = (('word', self.word_field), ('tag', self.tag_field)) 66 | 67 | self.train_dataset = SequenceTaggingDataset( p2data['train'], fields=field_values ) 68 | self.dev_dataset = SequenceTaggingDataset( p2data['dev'], fields=field_values ) 69 | 70 | # convert fields to vocabulary list 71 | self.word_field.build_vocab(self.train_dataset.word, min_freq=min_word_freq) 72 | self.tag_field.build_vocab(self.train_dataset.tag) 73 | # create iterator for batch input 74 | self.train_iter, self.dev_iter = BucketIterator.splits( 75 | datasets=(self.train_dataset, self.dev_dataset), 76 | batch_size=batch_size 77 | ) 78 | # prepare padding index to be ignored during model training/evaluation 79 | self.word_pad_idx = self.word_field.vocab.stoi[self.word_field.pad_token] 80 | self.tag_pad_idx = self.tag_field.vocab.stoi[self.tag_field.pad_token] 81 | 82 | class BiLSTM(nn.Module): 83 | def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, lstm_layers, 84 | emb_dropout, lstm_dropout, fc_dropout, word_pad_idx): 85 | super().__init__() 86 | self.embedding_dim = embedding_dim 87 | # LAYER 1: Embedding 88 | self.embedding = nn.Embedding( 89 | num_embeddings=input_dim, 90 | embedding_dim=embedding_dim, 91 | padding_idx=word_pad_idx 92 | ) 93 | self.emb_dropout = nn.Dropout(emb_dropout) 94 | # LAYER 2: BiLSTM 95 | self.lstm = nn.LSTM( 96 | input_size=embedding_dim, 97 | hidden_size=hidden_dim, 98 | num_layers=lstm_layers, 99 | bidirectional=True, 100 | dropout=lstm_dropout if lstm_layers > 1 else 0 101 | ) 102 | # LAYER 3: Fully-connected 103 | self.fc_dropout = nn.Dropout(fc_dropout) 104 | self.fc = nn.Linear(hidden_dim * 2, output_dim) # times 2 for bidirectional 105 | 106 | def forward(self, sentence): 107 | # sentence = [sentence length, batch size] 108 | # embedding_out = [sentence length, batch size, embedding dim] 109 | embedding_out = self.emb_dropout(self.embedding(sentence)) 110 | # lstm_out = [sentence length, batch size, hidden dim * 2] 111 | lstm_out, _ = self.lstm(embedding_out) 112 | # ner_out = [sentence length, batch size, output dim] 113 | ner_out = self.fc(self.fc_dropout(lstm_out)) 114 | return ner_out 115 | 116 | def init_weights(self): 117 | # to initialize all parameters from normal distribution 118 | # helps with converging during training 119 | for name, param in self.named_parameters(): 120 | nn.init.normal_(param.data, mean=0, std=0.1) 121 | 122 | def init_embeddings(self, word_pad_idx): 123 | # initialize embedding for padding as zero 124 | self.embedding.weight.data[word_pad_idx] = torch.zeros(self.embedding_dim) 125 | 126 | def count_parameters(self): 127 | return sum(p.numel() for p in self.parameters() if p.requires_grad) 128 | 129 | class NER(object): 130 | def __init__(self, model, data, optimizer_cls, loss_fn_cls, device=torch.device('cpu')): 131 | self.device = device 132 | self.model = model 133 | self.data = data 134 | self.optimizer = optimizer_cls(model.parameters(), lr=0.0015, weight_decay=0.01) 135 | self.loss_fn = loss_fn_cls(ignore_index=self.data.tag_pad_idx) 136 | self.loss_fn = self.loss_fn.to(self.device) 137 | 138 | def accuracy(self, preds, y): 139 | max_preds = preds.argmax(dim=1, keepdim=True) # get the index of the max probability 140 | non_pad_elements = (y != self.data.tag_pad_idx).nonzero() # prepare masking for paddings 141 | correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements]) 142 | denom = torch.cuda.FloatTensor([y[non_pad_elements].shape[0]]) 143 | return correct.sum() / denom 144 | 145 | def epoch(self): 146 | epoch_loss = 0 147 | epoch_acc = 0 148 | self.model.train() 149 | for batch in self.data.train_iter: 150 | # text = [sent len, batch size] 151 | text = batch.word.to(self.device) 152 | # tags = [sent len, batch size] 153 | true_tags = batch.tag.to(self.device) 154 | self.optimizer.zero_grad() 155 | pred_tags = self.model(text) 156 | # to calculate the loss and accuracy, we flatten both prediction and true tags 157 | # flatten pred_tags to [sent len, batch size, output dim] 158 | pred_tags = pred_tags.view(-1, pred_tags.shape[-1]) 159 | # flatten true_tags to [sent len * batch size] 160 | true_tags = true_tags.view(-1) 161 | batch_loss = self.loss_fn(pred_tags, true_tags) 162 | batch_acc = self.accuracy(pred_tags, true_tags) 163 | batch_loss.backward() 164 | self.optimizer.step() 165 | epoch_loss += batch_loss.item() 166 | epoch_acc += batch_acc.item() 167 | return epoch_loss / len(self.data.train_iter), epoch_acc / len(self.data.train_iter) 168 | 169 | def evaluate(self, iterator): 170 | epoch_loss = 0 171 | epoch_acc = 0 172 | self.model.eval() 173 | cum = 0 174 | whole_gt_seq, whole_pred_seq = [], [] 175 | with torch.no_grad(): 176 | # similar to epoch() but model is in evaluation mode and no backprop 177 | for batch in iterator: 178 | text = batch.word.to(self.device) 179 | true_tags = batch.tag.to(self.device) 180 | pred_tags = self.model(text) 181 | 182 | #[sentence length, batch size, output dim] 183 | for i, (row, tag_row) in enumerate(zip(text.T, true_tags.T)): 184 | mask = row!=1 185 | gt_seq = [self.data.tag_field.vocab.itos[j.item()] for j in tag_row[mask]] 186 | pred_idx = pred_tags[:,i,:].argmax(-1)[mask] 187 | pred_seq = [self.data.tag_field.vocab.itos[j.item()] for j in pred_idx] 188 | whole_gt_seq.append(gt_seq) 189 | whole_pred_seq.append(pred_seq) 190 | pred_tags = pred_tags.view(-1, pred_tags.shape[-1]) 191 | 192 | true_tags = true_tags.view(-1) 193 | batch_loss = self.loss_fn(pred_tags, true_tags) 194 | batch_acc = self.accuracy(pred_tags, true_tags) 195 | epoch_loss += batch_loss.item() 196 | epoch_acc += batch_acc.item() 197 | acc = accuracy_score(whole_gt_seq, whole_pred_seq) 198 | f1 = f1_score(whole_gt_seq, whole_pred_seq) 199 | return epoch_loss / len(iterator), acc, f1, whole_gt_seq, whole_pred_seq 200 | 201 | def train(self, n_epochs): 202 | for epoch in range(n_epochs): 203 | train_loss, train_acc = self.epoch() 204 | dev_loss, dev_acc, dev_f1, _, _ = self.evaluate(self.data.dev_iter) 205 | print (f'Epoch {epoch:02d}\t| Dev Loss: {dev_loss:.3f} | Dev Acc: {dev_acc * 100:.2f}% | Dev F1: {dev_f1 * 100:.2f}%') 206 | 207 | def infer(self, tokens): 208 | tokens = [t.lower() for t in tokens] 209 | self.model.eval() 210 | # transform to indices based on corpus vocab 211 | numericalized_tokens = [self.data.word_field.vocab.stoi[t] for t in tokens] 212 | # begin prediction 213 | token_tensor = torch.LongTensor(numericalized_tokens) 214 | token_tensor = token_tensor.unsqueeze(-1) 215 | predictions = self.model(token_tensor.to(self.device)) 216 | # convert results to tags 217 | top_predictions = predictions.argmax(-1) 218 | predicted_tags = [self.data.tag_field.vocab.itos[t.item()] for t in top_predictions] 219 | return predicted_tags 220 | 221 | @click.command() 222 | @click.option('--task-name', 223 | default='RuMedNER', 224 | type=click.Choice(['RuMedNER']), 225 | help='The name of the task to run.') 226 | @click.option('--device', 227 | default=-1, 228 | help='Gpu to train the model on.') 229 | def main(task_name, device): 230 | os.environ['CUDA_VISIBLE_DEVICES'] = str(device) 231 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 232 | 233 | print(f'\n{task_name} task') 234 | 235 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) ) 236 | out_dir = os.path.join(base_path, 'out') 237 | 238 | base_path = os.path.abspath( os.path.join(base_path, '../..') ) 239 | 240 | data_path = os.path.join(base_path, 'data', task_name) 241 | 242 | corpus = Corpus( 243 | input_folder=data_path, 244 | min_word_freq=1, 245 | batch_size=32 246 | ) 247 | print (f'Train set: {len(corpus.train_dataset)} sentences') 248 | print (f'Dev set: {len(corpus.dev_dataset)} sentences') 249 | 250 | bilstm = BiLSTM( 251 | input_dim=len(corpus.word_field.vocab), 252 | embedding_dim=300, 253 | hidden_dim=256, 254 | output_dim=len(corpus.tag_field.vocab), 255 | lstm_layers=2, 256 | emb_dropout=0.5, 257 | lstm_dropout=0.1, 258 | fc_dropout=0.25, 259 | word_pad_idx=corpus.word_pad_idx 260 | ) 261 | 262 | bilstm.init_weights() 263 | bilstm.init_embeddings(word_pad_idx=corpus.word_pad_idx) 264 | print (f'The model has {bilstm.count_parameters():,} trainable parameters.') 265 | print (bilstm) 266 | 267 | ner = NER( 268 | model=bilstm.to(device), 269 | data=corpus, 270 | optimizer_cls=AdamW, 271 | loss_fn_cls=nn.CrossEntropyLoss, 272 | device=device 273 | ) 274 | 275 | ner.train(20) 276 | 277 | test_data = pd.read_json(os.path.join(data_path, 'test_v1.jsonl'), lines=True) 278 | 279 | out_fname = os.path.join(out_dir, task_name+'.jsonl') 280 | with open(out_fname, 'w') as fw: 281 | for i, true, tokens in zip(test_data.idx, test_data.ner_tags, test_data.tokens): 282 | prediction = ner.infer(tokens) 283 | rec = {'idx': i, 'ner_tags': true, 'prediction': prediction} 284 | json.dump(rec, fw, ensure_ascii=False) 285 | fw.write('\n') 286 | 287 | if __name__ == '__main__': 288 | main() 289 | -------------------------------------------------------------------------------- /code/bilstm/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from string import punctuation 3 | import random 4 | 5 | from nltk.tokenize import ToktokTokenizer 6 | import numpy as np 7 | import pandas as pd 8 | import torch 9 | from torch.utils.data import Dataset 10 | 11 | from typing import List, Dict, Union, Tuple, Set, Any 12 | 13 | TOKENIZER = ToktokTokenizer() 14 | 15 | 16 | def seed_everything(seed): 17 | os.environ['PYTHONHASHSEED'] = str(seed) 18 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1' 19 | np.random.seed(seed) 20 | random.seed(seed) 21 | torch.manual_seed(seed) 22 | torch.cuda.manual_seed_all(seed) 23 | torch.cuda.manual_seed(seed) 24 | torch.backends.cudnn.deterministic = True 25 | torch.backends.cudnn.benchmark = False 26 | 27 | 28 | def seed_worker(worker_id): 29 | worker_seed = torch.initial_seed() % 2**32 30 | np.random.seed(worker_seed) 31 | random.seed(worker_seed) 32 | 33 | 34 | def preprocess(text, tokenizer=TOKENIZER): 35 | res = [] 36 | tokens = tokenizer.tokenize(text.lower()) 37 | for t in tokens: 38 | if t not in punctuation: 39 | res.append(t.strip(punctuation)) 40 | return res 41 | 42 | 43 | class DataPreprocessor(Dataset): 44 | 45 | def __init__(self, x_data, y_data, word2index, label2index, 46 | sequence_length=128, pad_token='PAD', unk_token='UNK', preprocessing=True): 47 | 48 | super().__init__() 49 | 50 | self.x_data = [] 51 | self.y_data = y_data.map(label2index) 52 | 53 | self.word2index = word2index 54 | self.sequence_length = sequence_length 55 | 56 | self.pad_token = pad_token 57 | self.unk_token = unk_token 58 | self.pad_index = self.word2index[self.pad_token] 59 | 60 | self.preprocessing = preprocessing 61 | 62 | self.load(x_data) 63 | 64 | def load(self, data): 65 | 66 | for text in data: 67 | if self.preprocessing: 68 | words = preprocess(text) 69 | else: 70 | words = text 71 | indexed_words = self.indexing(words) 72 | self.x_data.append(indexed_words) 73 | 74 | def indexing(self, tokenized_text): 75 | unk_index = self.word2index[self.unk_token] 76 | return [self.word2index.get(token, unk_index) for token in tokenized_text] 77 | 78 | def padding(self, sequence): 79 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0)) 80 | return sequence[:self.sequence_length] 81 | 82 | def __len__(self): 83 | return len(self.x_data) 84 | 85 | def __getitem__(self, idx): 86 | x = self.x_data[idx] 87 | x = self.padding(x) 88 | x = torch.Tensor(x).long() 89 | 90 | y = self.y_data[idx] 91 | 92 | return x, y 93 | 94 | 95 | def preprocess_for_tokens( 96 | tokens: List[str] 97 | ) -> List[str]: 98 | 99 | return tokens 100 | 101 | class DataPreprocessorNer(Dataset): 102 | 103 | def __init__( 104 | self, 105 | x_data: pd.Series, 106 | y_data: pd.Series, 107 | word2index: Dict[str, int], 108 | label2index: Dict[str, int], 109 | sequence_length: int = 128, 110 | pad_token: str = 'PAD', 111 | unk_token: str = 'UNK' 112 | ) -> None: 113 | 114 | super().__init__() 115 | 116 | self.word2index = word2index 117 | self.label2index = label2index 118 | 119 | self.sequence_length = sequence_length 120 | self.pad_token = pad_token 121 | self.unk_token = unk_token 122 | self.pad_index = self.word2index[self.pad_token] 123 | self.unk_index = self.word2index[self.unk_token] 124 | 125 | self.x_data = self.load(x_data, self.word2index) 126 | self.y_data = self.load(y_data, self.label2index) 127 | 128 | 129 | def load( 130 | self, 131 | data: pd.Series, 132 | mapping: Dict[str, int] 133 | ) -> List[List[int]]: 134 | 135 | indexed_data = [] 136 | for case in data: 137 | processed_case = preprocess_for_tokens(case) 138 | indexed_case = self.indexing(processed_case, mapping) 139 | indexed_data.append(indexed_case) 140 | 141 | return indexed_data 142 | 143 | 144 | def indexing( 145 | self, 146 | tokenized_case: List[str], 147 | mapping: Dict[str, int] 148 | ) -> List[int]: 149 | 150 | return [mapping.get(token, self.unk_index) for token in tokenized_case] 151 | 152 | 153 | def padding( 154 | self, 155 | sequence: List[int] 156 | ) -> List[int]: 157 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0)) 158 | return sequence[:self.sequence_length] 159 | 160 | 161 | def __len__(self): 162 | return len(self.x_data) 163 | 164 | 165 | def __getitem__( 166 | self, 167 | idx: int 168 | ) -> Tuple[torch.tensor, torch.tensor]: 169 | 170 | x = self.x_data[idx] 171 | y = self.y_data[idx] 172 | 173 | assert len(x) > 0 174 | 175 | x = self.padding(x) 176 | y = self.padding(y) 177 | 178 | x = torch.tensor(x, dtype=torch.int64) 179 | y = torch.tensor(y, dtype=torch.int64) 180 | 181 | return x, y -------------------------------------------------------------------------------- /code/eval.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | import json 4 | import argparse 5 | import numpy as np 6 | from sklearn.metrics import accuracy_score 7 | from seqeval.metrics import f1_score 8 | from seqeval.metrics import accuracy_score as seq_accuracy_score 9 | 10 | def hit_at_3(y_true, y_pred): 11 | assert len(y_true) == len(y_pred) 12 | hit_count = 0 13 | for l, row in zip(y_true, y_pred): 14 | hit_count += l in row 15 | return hit_count/float(len(y_true)) 16 | 17 | if __name__ == '__main__': 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument('--out_dir', 20 | default='out/', 21 | type=str, 22 | help='The output directory with task results.') 23 | args = parser.parse_args() 24 | 25 | out_dir = args.out_dir 26 | if not os.path.exists(out_dir): 27 | raise ValueError('{} directory does not exist'.format(out_dir)) 28 | 29 | files = set( os.listdir(out_dir) ) 30 | 31 | metrics = {} 32 | label_id = 'code' 33 | for task in ['RuMedTop3', 'RuMedSymptomRec']: 34 | fname = '{}.jsonl'.format(task) 35 | if fname in files: 36 | fname = os.path.join(out_dir, fname) 37 | with open(fname) as f: 38 | result = [json.loads(line) for line in list(f)] 39 | gt = [d[label_id] for d in result] 40 | top1 = [d['prediction'][0] for d in result] 41 | top3 = [set(d['prediction']) for d in result] 42 | acc = accuracy_score(gt, top1)*100 43 | hit = hit_at_3(gt, top3)*100 44 | metrics[(task, 'acc')] = acc 45 | metrics[(task, 'hit3')] = hit 46 | else: 47 | print ('skip task {}'.format(task)) 48 | 49 | for task, label_id in [('RuMedDaNet', 'answer'), ('RuMedNLI', 'gold_label')]: 50 | fname = '{}.jsonl'.format(task) 51 | if fname in files: 52 | fname = os.path.join(out_dir, fname) 53 | with open(fname) as f: 54 | result = [json.loads(line) for line in list(f)] 55 | gt = [d[label_id] for d in result] 56 | prediction = [d['prediction'] for d in result] 57 | acc = accuracy_score(gt, prediction)*100 58 | metrics[(task, 'acc')] = acc 59 | else: 60 | print ('skip task {}'.format(task)) 61 | 62 | task = 'RuMedNER' 63 | fname = '{}.jsonl'.format(task) 64 | if fname in files: 65 | fname = os.path.join(out_dir, fname) 66 | with open(fname) as f: 67 | result = [json.loads(line) for line in list(f)] 68 | gt = [d['ner_tags'] for d in result] 69 | prediction = [d['prediction'] for d in result] 70 | for seq0, seq1 in zip(gt, prediction): 71 | assert len(seq0)==len(seq1) 72 | metrics[(task, 'acc')] = seq_accuracy_score(gt, prediction)*100 73 | metrics[(task, 'f1')] = f1_score(gt, prediction)*100 74 | else: 75 | print ('skip task {}'.format(task)) 76 | 77 | top3_acc, top3_hit = metrics.get( ('RuMedTop3', 'acc'), 0 ), metrics.get( ('RuMedTop3', 'hit3'), 0 ) 78 | rec_acc, rec_hit = metrics.get( ('RuMedSymptomRec', 'acc'), 0 ), metrics.get( ('RuMedSymptomRec', 'hit3'), 0 ) 79 | danet_acc, nli_acc = metrics.get( ('RuMedDaNet', 'acc'), 0 ), metrics.get( ('RuMedNLI', 'acc'), 0 ) 80 | ner_acc, ner_f1 = metrics.get( ('RuMedNER', 'acc'), 0 ), metrics.get( ('RuMedNER', 'f1'), 0 ) 81 | 82 | overall = np.mean([ 83 | (top3_acc+top3_hit)/2, 84 | (rec_acc+rec_hit)/2, 85 | danet_acc, 86 | nli_acc, 87 | (ner_acc+ner_f1)/2, 88 | ]) 89 | 90 | result_line = '| {}\t| {:.2f} / {:.2f}\t| {:.2f} / {:.2f}\t| {:.2f}\t| {:.2f}\t| {:.2f} / {:.2f}\t| {:.2f}\t|'.format( 91 | out_dir, 92 | top3_acc, top3_hit, 93 | rec_acc, rec_hit, 94 | danet_acc, 95 | nli_acc, 96 | ner_acc, ner_f1, 97 | overall 98 | ) 99 | print ('| Model\t\t| RuMedTop3\t| RuMedSymptomRec\t| RuMedDaNet\t| RuMedNLI\t| RuMedNER\t| Overall\t|') 100 | print (result_line) 101 | -------------------------------------------------------------------------------- /code/linear_models/README.md: -------------------------------------------------------------------------------- 1 | This directory contains feature-based models (logistic regression model with tf-idf vectorizer and CRF). 2 | 3 | ### How to run 4 | 5 | `./run.sh` or you can run the model for different tasks separately, e.g. 6 | 7 | ```bash 8 | python single_text_classifier.py --task-name='RuMedSymptomRec' 9 | ``` 10 | 11 | The models produce results in `.jsonl` format to output directory `out`. 12 | -------------------------------------------------------------------------------- /code/linear_models/double_text_classifier.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import pandas as pd 6 | from sklearn.feature_extraction.text import TfidfVectorizer 7 | from sklearn.linear_model import LogisticRegression 8 | from sklearn.metrics import accuracy_score 9 | 10 | 11 | def preprocess_sentences(column1, column2): 12 | return [sent1 + ' ' + sent2 for sent1, sent2 in zip(column1, column2)] 13 | 14 | 15 | def encode_text(tfidf, text_data, labels, l2i, mode='train'): 16 | if mode == 'train': 17 | X = tfidf.fit_transform(text_data) 18 | else: 19 | X = tfidf.transform(text_data) 20 | y = labels.map(l2i) 21 | return X, y 22 | 23 | 24 | @click.command() 25 | @click.option('--task-name', 26 | default='RuMedNLI', 27 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']), 28 | help='The name of the task to run.') 29 | def main(task_name): 30 | print(f'\n{task_name} task') 31 | 32 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent 33 | out_path = base_path / 'code' / 'linear_models' / 'out' 34 | data_path = base_path / 'data' / task_name 35 | 36 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True) 37 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True) 38 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True) 39 | 40 | index_id = 'pairID' 41 | if task_name == 'RuMedNLI': 42 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2} 43 | text1_id = 'ru_sentence1' 44 | text2_id = 'ru_sentence2' 45 | label_id = 'gold_label' 46 | elif task_name == 'RuMedDaNet': 47 | l2i = {'нет': 0, 'да': 1} 48 | text1_id = 'context' 49 | text2_id = 'question' 50 | label_id = 'answer' 51 | else: 52 | raise ValueError('unknown task') 53 | 54 | i2l = {i: label for label, i in l2i.items()} 55 | 56 | text_data_train = preprocess_sentences(train_data[text1_id], train_data[text2_id]) 57 | text_data_dev = preprocess_sentences(dev_data[text1_id], dev_data[text2_id]) 58 | text_data_test = preprocess_sentences(test_data[text1_id], test_data[text2_id]) 59 | 60 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8)) 61 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, verbose=1) 62 | 63 | X, y = encode_text(tfidf, text_data_train, train_data[label_id], l2i) 64 | 65 | clf.fit(X, y) 66 | 67 | X_val, y_val = encode_text(tfidf, text_data_dev, dev_data[label_id], l2i, mode='val') 68 | y_val_pred = clf.predict(X_val) 69 | accuracy = round(accuracy_score(y_val, y_val_pred) * 100, 2) 70 | print (f'\n{task_name} task score on dev set: {accuracy}') 71 | 72 | X_test, _ = encode_text(tfidf, text_data_test, test_data[label_id], l2i, mode='test') 73 | y_test_pred = clf.predict(X_test) 74 | 75 | recs = [] 76 | for i, true, pred in zip(test_data[index_id], test_data[label_id], y_test_pred): 77 | recs.append({index_id: i, label_id: true, 'prediction': i2l[pred]}) 78 | 79 | out_fname = out_path / f'{task_name}.jsonl' 80 | with open(out_fname, 'w') as fw: 81 | for rec in recs: 82 | json.dump(rec, fw, ensure_ascii=False) 83 | fw.write('\n') 84 | 85 | 86 | if __name__ == '__main__': 87 | main() 88 | -------------------------------------------------------------------------------- /code/linear_models/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | out=$(pwd)'/out' 4 | mkdir -p $out 5 | 6 | python -u single_text_classifier.py --task-name 'RuMedTop3' 7 | python -u single_text_classifier.py --task-name 'RuMedSymptomRec' 8 | python -u double_text_classifier.py --task-name 'RuMedNLI' 9 | python -u double_text_classifier.py --task-name 'RuMedDaNet' 10 | python -u token_classifier.py --task-name 'RuMedNER' 11 | -------------------------------------------------------------------------------- /code/linear_models/single_text_classifier.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.feature_extraction.text import TfidfVectorizer 8 | from sklearn.linear_model import LogisticRegression 9 | 10 | 11 | def hit_at_n(y_true, y_pred, n=3): 12 | assert len(y_true) == len(y_pred) 13 | hit_count = 0 14 | for l, row in zip(y_true, y_pred): 15 | order = (np.argsort(row)[::-1])[:n] 16 | hit_count += int(l in order) 17 | return round(hit_count / float(len(y_true)) * 100, 2) 18 | 19 | 20 | def encode_text(tfidf, text_data, labels, l2i, mode='train'): 21 | if mode == 'train': 22 | X = tfidf.fit_transform(text_data) 23 | else: 24 | X = tfidf.transform(text_data) 25 | y = labels.map(l2i) 26 | return X, y 27 | 28 | 29 | def logits2codes(logits, i2l, n=3): 30 | codes = [] 31 | for row in logits: 32 | order = np.argsort(row)[::-1] 33 | codes.append([i2l[i] for i in order[:n]]) 34 | return codes 35 | 36 | 37 | @click.command() 38 | @click.option('--task-name', 39 | default='RuMedTop3', 40 | type=click.Choice(['RuMedTop3', 'RuMedSymptomRec']), 41 | help='The name of the task to run.') 42 | def main(task_name): 43 | print(f'\n{task_name} task') 44 | 45 | base_path = pathlib.Path(__file__).absolute().parent.parent.parent 46 | out_path = base_path / 'code' / 'linear_models' / 'out' 47 | data_path = base_path / 'data' / task_name 48 | 49 | train_data = pd.read_json(data_path / 'train_v1.jsonl', lines=True) 50 | dev_data = pd.read_json(data_path / 'dev_v1.jsonl', lines=True) 51 | test_data = pd.read_json(data_path / 'test_v1.jsonl', lines=True) 52 | 53 | text_id = 'symptoms' 54 | label_id = 'code' 55 | index_id = 'idx' 56 | 57 | i2l = dict(enumerate(sorted(train_data[label_id].unique()))) 58 | l2i = {label: i for i, label in i2l.items()} 59 | 60 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8)) 61 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, verbose=1) 62 | 63 | X, y = encode_text(tfidf, train_data[text_id], train_data[label_id], l2i) 64 | 65 | clf.fit(X, y) 66 | 67 | X_val, y_val = encode_text(tfidf, dev_data[text_id], dev_data[label_id], l2i, mode='val') 68 | y_val_pred = clf.predict_proba(X_val) 69 | 70 | accuracy = hit_at_n(y_val, y_val_pred, n=1) 71 | hit_3 = hit_at_n(y_val, y_val_pred, n=3) 72 | print (f'\n{task_name} task scores on dev set: {accuracy} / {hit_3}') 73 | 74 | X_test, _ = encode_text(tfidf, test_data[text_id], test_data[label_id], l2i, mode='test') 75 | y_test_pred = clf.predict_proba(X_test) 76 | 77 | test_codes = logits2codes(y_test_pred, i2l) 78 | 79 | recs = [] 80 | for i, true, pred in zip(test_data[index_id], test_data[label_id], test_codes): 81 | recs.append({index_id: i, label_id: true, 'prediction': pred}) 82 | 83 | out_fname = out_path / f'{task_name}.jsonl' 84 | with open(out_fname, 'w') as fw: 85 | for rec in recs: 86 | json.dump(rec, fw, ensure_ascii=False) 87 | fw.write('\n') 88 | 89 | 90 | if __name__ == '__main__': 91 | main() 92 | -------------------------------------------------------------------------------- /code/linear_models/token_classifier.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import gc 3 | import os 4 | import json 5 | import numpy as np 6 | import random 7 | import click 8 | from seqeval.metrics import accuracy_score, f1_score 9 | import sklearn_crfsuite 10 | 11 | SEED = 128 12 | random.seed(SEED) 13 | np.random.seed(SEED) 14 | 15 | def load_sents(fname): 16 | sents = [] 17 | with open(fname) as f: 18 | for line in f: 19 | data = json.loads(line) 20 | idx = data['idx'] 21 | codes = data['ner_tags'] 22 | tokens = data['tokens'] 23 | sample = [] 24 | for token, code in zip(tokens,codes): 25 | sample.append( (token, code) ) 26 | sents.append( (idx, sample) ) 27 | return sents 28 | 29 | def word2features(sent, i): 30 | word = sent[i][0] 31 | 32 | features = { 33 | 'bias': 1.0, 34 | 'word.lower()': word.lower(), 35 | 'word[-3:]': word[-3:], 36 | 'word[-2:]': word[-2:], 37 | 'word.isupper()': word.isupper(), 38 | 'word.istitle()': word.istitle(), 39 | 'word.isdigit()': word.isdigit(), 40 | } 41 | if i > 0: 42 | word1 = sent[i-1][0] 43 | features.update({ 44 | '-1:word.lower()': word1.lower(), 45 | '-1:word.istitle()': word1.istitle(), 46 | '-1:word.isupper()': word1.isupper(), 47 | }) 48 | else: 49 | features['BOS'] = True 50 | 51 | if i < len(sent)-1: 52 | word1 = sent[i+1][0] 53 | features.update({ 54 | '+1:word.lower()': word1.lower(), 55 | '+1:word.istitle()': word1.istitle(), 56 | '+1:word.isupper()': word1.isupper(), 57 | }) 58 | else: 59 | features['EOS'] = True 60 | 61 | return features 62 | 63 | 64 | def sent2features(sent): 65 | return [word2features(sent, i) for i in range(len(sent))] 66 | 67 | def sent2labels(sent): 68 | return [label for token, label in sent] 69 | 70 | def sent2tokens(sent): 71 | return [token for token, label in sent] 72 | 73 | @click.command() 74 | @click.option('--task-name', 75 | default='RuMedNER', 76 | type=click.Choice(['RuMedNER']), 77 | help='The name of the task to run.' 78 | ) 79 | 80 | def main(task_name): 81 | print(f'\n{task_name} task') 82 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ) ) ) 83 | out_dir = os.path.join(base_path, 'out') 84 | 85 | base_path = os.path.abspath( os.path.join(base_path, '../..') ) 86 | 87 | parts = ['train', 'dev', 'test'] 88 | data_path = os.path.join(base_path, 'data', task_name) 89 | 90 | text1_id, label_id, index_id = 'tokens', 'ner_tags', 'idx' 91 | part2data = {} 92 | for p in parts: 93 | fname = os.path.join( data_path, '{}_v1.jsonl'.format(p) ) 94 | sents = load_sents(fname) 95 | part2data[p] = sents 96 | 97 | part2feat = {} 98 | for p in parts: 99 | p_X = [sent2features(s) for idx, s in part2data[p]] 100 | p_y = [sent2labels(s) for idx, s in part2data[p]] 101 | p_ids = [idx for idx, _ in part2data[p]] 102 | part2feat[p] = (p_X, p_y, p_ids) 103 | 104 | crf = sklearn_crfsuite.CRF( 105 | algorithm='lbfgs', 106 | c1=0.1, 107 | c2=0.01, 108 | max_iterations=200, 109 | all_possible_transitions=True, 110 | verbose=True 111 | ) 112 | X_train, y_train = part2feat['train'][0], part2feat['train'][1] 113 | crf = crf.fit(X_train, y_train) 114 | 115 | X_dev = part2feat['dev'][0] 116 | y_pred_dev = crf.predict(X_dev) 117 | 118 | y_dev = part2feat['dev'][1] 119 | dev_acc, dev_f1 = accuracy_score(y_dev, y_pred_dev)*100, f1_score(y_dev, y_pred_dev)*100 120 | 121 | print ('\n{} task scores on dev set: {:.2f}/{:.2f}'.format(task_name, dev_acc, dev_f1)) 122 | 123 | X_test = part2feat['test'][0] 124 | y_pred_test = crf.predict(X_test) 125 | out_fname = os.path.join(out_dir, task_name+'.jsonl') 126 | with open(out_fname, 'w') as fw: 127 | for idx, labels, prediction in zip(part2feat['test'][-1], part2feat['test'][1], y_pred_test): 128 | data = {index_id:idx, label_id:labels, 'prediction':prediction} 129 | json.dump(data, fw, ensure_ascii=False) 130 | fw.write('\n') 131 | 132 | if __name__ == '__main__': 133 | main() 134 | -------------------------------------------------------------------------------- /code/requirements.txt: -------------------------------------------------------------------------------- 1 | seqeval==1.2.2 2 | torch==1.9.0 3 | torchtext==0.6.0 4 | tensorflow==2.6.0 5 | keras==2.6.0 6 | pandas==1.3.5 7 | transformers==4.12.5 8 | click==7.1.2 9 | nltk==3.4.5 10 | sklearn-crfsuite==0.3.6 11 | -------------------------------------------------------------------------------- /code/tasks_builder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import gc 3 | import os 4 | import ast 5 | import json 6 | import pandas as pd 7 | import numpy as np 8 | from sklearn import model_selection 9 | from collections import Counter 10 | 11 | SEED = 53 12 | 13 | def df2jsonl(in_df, fname, code2freq, th=10): 14 | with open(fname, 'w', encoding='utf-8') as fw: 15 | for idx, symptoms, code in zip(in_df.new_event_id, in_df.symptoms, in_df.code): 16 | if code in code2freq and code2freq[code]>th: 17 | data = { 18 | 'idx':idx, 19 | 'symptoms': symptoms, 20 | 'code': code, 21 | } 22 | json.dump(data, fw, ensure_ascii=False) 23 | fw.write("\n") 24 | 25 | def ner2jsonl(in_df, ids, fname): 26 | trim_ids = np.array([s.split('_')[0] for s in in_df['Sentence#'].values]) 27 | 28 | with open(fname, 'w') as fw: 29 | for i in ids: 30 | mask = trim_ids==i 31 | sample_ids = np.array(list(set(in_df['Sentence#'].values[mask]))) 32 | order = np.argsort([int(k.split('_')[-1]) for k in sample_ids]) 33 | sample_ids = sample_ids[order] 34 | for idx in sample_ids: 35 | sub_mask = in_df['Sentence#'].values==idx 36 | tokens = list(in_df.Word[sub_mask].values) 37 | ner_tags = list(in_df.Tag[sub_mask].values) 38 | assert len(tokens)==len(ner_tags) 39 | data = { 40 | 'idx':idx, 41 | 'tokens': tokens, 42 | 'ner_tags': ner_tags, 43 | } 44 | json.dump(data, fw, ensure_ascii=False) 45 | fw.write("\n") 46 | 47 | def jsonl2jsonl(source, target): 48 | with open(target, 'w') as fw: 49 | with open(source) as f: 50 | for line in f: 51 | data = json.loads(line) 52 | selected = {field:data[field] for field in ['ru_sentence1', 'ru_sentence2', 'gold_label', 'pairID']} 53 | json.dump(selected, fw, ensure_ascii=False) 54 | fw.write("\n") 55 | 56 | if __name__ == '__main__': 57 | base_path = os.path.abspath( os.path.join(os.path.dirname( __file__ ), '..') ) 58 | 59 | data_path = os.path.join(base_path, 'data/') 60 | 61 | data_fname = os.path.join(data_path, 'raw', 'RuMedPrimeData.tsv') 62 | 63 | if not os.path.isfile(data_fname): 64 | raise ValueError('Have you downloaded the data file RuMedPrimeData.tsv and place it into data/ directory?') 65 | 66 | base_split_names = ['train', 'dev', 'test'] 67 | ## prepare data for RuMedTop3 task 68 | df = pd.read_csv(data_fname, sep='\t') 69 | df['code'] = df.icd10.apply(lambda s: s.split('.')[0]) 70 | # parts'll be list of [train, dev, test] 71 | parts = np.split(df.sample(frac=1, random_state=SEED), [int(0.735*len(df)), int(0.8675*len(df))]) 72 | 73 | code2freq = dict(parts[0]['code'].value_counts()) 74 | 75 | for i, part in enumerate(base_split_names): 76 | df2jsonl( 77 | parts[i], 78 | os.path.join(data_path, 'RuMedTop3', '{}_v1.jsonl'.format(part)), 79 | code2freq 80 | ) 81 | 82 | ## prepare data for RuMedSymptomRec task 83 | df.drop(columns=['code'], inplace=True) 84 | rec_markup = pd.read_csv( os.path.join(data_path, 'raw', 'rec_markup.csv') ) 85 | df = pd.merge(df, rec_markup, on='new_event_id') 86 | 87 | mask = ~df.code.isna().values 88 | df = df.iloc[mask] 89 | 90 | symptoms_reduced = [] 91 | for text, span in zip(df.symptoms, df.keep_spans): 92 | span = ast.literal_eval(span) 93 | reduced_text = (''.join([text[s[0]:s[1]] for s in span])).strip() 94 | symptoms_reduced.append(reduced_text) 95 | df['symptoms'] = symptoms_reduced 96 | 97 | parts = np.split(df.sample(frac=1, random_state=SEED), [int(0.735*len(df)), int(0.8675*len(df))]) 98 | 99 | code2freq = dict(parts[0]['code'].value_counts()) 100 | 101 | for i, part in enumerate(base_split_names): 102 | df2jsonl( 103 | parts[i], 104 | os.path.join(data_path, 'RuMedSymptomRec', '{}_v1.jsonl'.format(part)), 105 | code2freq 106 | ) 107 | 108 | ## prepare data for RuMedNER task 109 | df = pd.read_csv( os.path.join(data_path, 'raw', 'RuDReC.csv') ) 110 | 111 | d = Counter(df['Sentence#'].apply(lambda s: s.split('_')[0])) 112 | ids = np.array(list(d.keys())) 113 | lens = np.array(list(d.values())) 114 | lens = np.array([len(str(i)) for i in lens]) 115 | 116 | sss = model_selection.StratifiedShuffleSplit(n_splits=1, test_size=75, random_state=7) 117 | for fold, (train_idx, test_idx) in enumerate(sss.split(ids, lens)): 118 | train_ids, test_ids = ids[train_idx], ids[test_idx] 119 | 120 | sss = model_selection.StratifiedShuffleSplit(n_splits=1, test_size=75, random_state=6) 121 | for fold, (train_idx, test_idx) in enumerate(sss.split(train_ids, lens[train_idx])): 122 | train_ids, dev_ids = train_ids[train_idx], train_ids[test_idx] 123 | parts = [train_ids, dev_ids, test_ids] 124 | 125 | for i, part in enumerate(base_split_names): 126 | ner2jsonl( 127 | df, 128 | parts[i], 129 | os.path.join(data_path, 'RuMedNER', '{}_v1.jsonl'.format(part)) 130 | ) 131 | 132 | ## prepare data for RuMedNLI task 133 | for part in base_split_names: 134 | fname = os.path.join(data_path, 'raw', 'ru_mli_{}_v1.jsonl'.format(part)) 135 | jsonl2jsonl( 136 | fname, 137 | os.path.join(data_path, 'RuMedNLI', '{}_v1.jsonl'.format(part)) 138 | ) 139 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | Each task directory (starting with *RuMed*\*) contains `train/dev/test` data files in `jsonl`-format. 2 | 3 | ### RuMedTop3 4 | ``` 5 | { 6 | "idx": "qd4405c5", 7 | "symptoms": "Сердцебиение, нарушение сна, ощущение нехватки воздуха. 8 | Боль и хруст в шеи, головные боли по 3 суток подряд.", 9 | "code": "M54" 10 | } 11 | ``` 12 | 13 | ### RuMedSymptomRec 14 | ``` 15 | { 16 | "idx": "qbaecae4", 17 | "symptoms": "пациентка на приеме с родственниками. Со слов родственников - жалобы на плохой сон, 18 | чувство страха, на навязчивые мысли,что 'ее кто-то бьет'", 19 | "code": "колебания артериального давления" 20 | } 21 | ``` 22 | 23 | ### RuMedDaNet 24 | ``` 25 | { 26 | "pairID": "b2d69800b0a141aa63bd1104c6d53488", 27 | "context": "Эпилепсия — хроническое полиэтиологическое заболевание головного мозга, доминирующим 28 | проявлением которого являются повторяющиеся эпилептические припадки, возникающие вследствие 29 | усиленного гиперсинхронного разряда нейронов головного мозга.", 30 | "question": "Эпилепсию относят к заболеваниям головного мозга человека?", 31 | "answer": "да", 32 | } 33 | ``` 34 | 35 | ### RuMedNLI 36 | ``` 37 | { 38 | "pairID": "1892e470-66c7-11e7-9a53-f45c89b91419", 39 | "ru_sentence1": "Во время госпитализации у пациента постепенно усиливалась одышка, что потребовало 40 | выполнения процедуры неинвазивной вентиляции лёгких с положительным давлением, а затем маска без ребризера.", 41 | "ru_sentence2": "Пациент находится при комнатном воздухе.", 42 | "gold_label": "contradiction", 43 | } 44 | ``` 45 | 46 | ### RuMedNER 47 | ``` 48 | { 49 | "idx": "769708.tsv_5", 50 | "tokens": ["Виферон", "обладает", "противовирусным", "действием", "."], 51 | "ner_tags": ["B-Drugname", "O", "B-Drugclass", "O", "O"] 52 | } 53 | ``` 54 | 55 | ### ECG2Pathology 56 | ``` 57 | { 58 | "record_name": "00009_hr", 59 | "age": 55.0, 60 | "sex": 0, 61 | ..., 62 | "targets": [37,54] 63 | } 64 | ``` 65 | 66 |
67 | raw 68 | 69 | The directory contains raw data files. 70 | 71 | The tasks `RuMedTop3` and `RuMedSymptomRec` are based on the [`RuMedPrime`](https://zenodo.org/record/5765873#.YbBlXT9Bzmw) dataset. 72 | The file `RuMedPrimeData.tsv` contains: 73 | ``` 74 | symptoms anamnesis icd10 new_patient_id new_event_id new_event_time 75 | Сухость кожи... Месяц назад... E01.8 qf156c36 q5fc2cb1 2027-05-19 76 | Жалобы ГБ... Начало острое... J06.9 q9321cf8 qe173f20 2023-03-24 77 | ``` 78 | - `symptoms` is the text field with patient symptoms and complaints; 79 | - `icd10` - ICD-10 disease code; 80 | - `new_event_id` is the sample id. 81 | 82 | The file `rec_markup.csv` contains markup for the recommendation task: 83 | ``` 84 | new_event_id,code,keep_spans 85 | q5fc2cb1,"кожа, сухая","[(0, 0), (7, 12), (13, 108)]" 86 | qe173f20,боль в мышцах,"[(0, 138), (151, 279)]" 87 | q653efaa,боль в мышцах,"[(0, 57), (70, 129)]" 88 | qe48681b,боль жгучая,"[(0, 45), (56, 181)]" 89 | ``` 90 | - `new_event_id` is the sample id; 91 | - `code` is a symptom to predict; 92 | - `keep_spans` is the list of `(start, end)` tuples, as we neet to transform the original text to exclude target symptom-code. 93 | 94 | The data `RuMedNLI` is based on translated [MedNLI](https://jgc128.github.io/mednli/) data. 95 | 96 | > Important! This repository do not contain RuMedNLI files, please download them (`ru_mli_train_v1.jsonl`, `ru_mli_dev_v1.jsonl` and `ru_mli_test_v1.jsonl`) from [RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain](https://doi.org/10.13026/gxzd-cf80) to the `raw` directory. Then run `python tasks_builder.py` from the `code/` directory. 97 | 98 | The task `RuMedNER` is based on RuDReC data - https://github.com/cimm-kzn/RuDReC. 99 | `RuDReC.csv` is the dataframe file with named entities in [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). 100 | ``` 101 | Sentence#,Word,Tag 102 | 172744.tsv_0,нам,O 103 | 172744.tsv_0,прописали,O 104 | 172744.tsv_0,",",O 105 | 172744.tsv_0,так,O 106 | 172744.tsv_0,мой,O 107 | 172744.tsv_0,ребенок,O 108 | 172744.tsv_0,сыпью,B-ADR 109 | 172744.tsv_0,покрылся,I-ADR 110 | 172744.tsv_0,",",O 111 | ``` 112 |
113 | -------------------------------------------------------------------------------- /data/RuMedNLI/README.md: -------------------------------------------------------------------------------- 1 | > Important! This repository do not contain RuMedNLI files, please download them (`ru_mli_train_v1.jsonl`, `ru_mli_dev_v1.jsonl` and `ru_mli_test_v1.jsonl`) from [RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain](https://doi.org/10.13026/gxzd-cf80) to the `raw` directory. Then run `python tasks_builder.py` from the `code/` directory. -------------------------------------------------------------------------------- /lb_submissions/SAI/ChatGPT/README.md: -------------------------------------------------------------------------------- 1 | # Testing MedBench tasks with ChatGPT 2 | 3 | ## Summary 4 | We tested the performance of the ChatGPT (proxied by `text-davinci-003`) model on RuMedBench tasks with the following results: 5 | - RuMedTest: 35.0% (above other models to date, yet way below human level) 6 | 7 | -- Random guessing from 1,000 attempts gives the following statistics about the metric: (min, mean, max) = (19.14%, 24.85%, 31.74%) 8 | - RuMedDaNet: 89.3% (way above other models to date, closer to human level) 9 | - RuMedNLI: 61.3% (lagging behind other models to date, way below human level) 10 | While some preliminary tests were done to select better prompting techniques, these results may potentially be improved with more prompt testing and further model fine-tuning. 11 | 12 | ## Premises 13 | We intended to evaluate the performance of the ChatGPT large language model with RuMedBench questions. The evaluation was done using [ChatGPT web interface](https://chat.openai.com/chat), and then main runs were done with open.ai API with model `text-davinci-003`. The latter is not precisely ChatGPT, but supposedly closely related, and also shown similar answer quality (measured on a sample of 20 questions from RuMedTest). An additional benefit of API access was that `text-davinci-003` was closer to observing answer formats without resorting to longer explanations like ChatGPT sometimes did. 14 | 15 | To gain better quality, we compared the performance of smaller samples with similar tasks (dev sample in the case of RuMedDaNet and RuMedNLI, similar medical questions from medical school exams for RuMedTest). The tested approaches included Zero- and One-shot prompts and translation into English. See details of such prompt tests in the individual tests' sections. 16 | 17 | All tests were run between 10 and 15 Feb 2023. 18 | 19 | ## RuMedTest 20 | For prompt evaluation, we used questions from a similar test [found here](https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_). It has 775 questions; among them, only 355 have the 1-line format, similar to RuMedTest. Some minor (~5%) overlap in questions with RuMedTest was noticed. Those questions were excluded from prompt evaluation. 21 | 22 | The prompt tests included Zero- and One-shot prompts and translation into English. There was no significant difference observed (perhaps the prompt test sample was too small), so the main benchmark tests were performed using straight prompt in Russian: 23 | ``` 24 | You are a medical doctor and need to answer which one of the four statements is correct: 25 | 1. "Лёгочное сердце" может возникнуть при ишемической болезни сердца". 26 | 2. "Лёгочное сердце" может возникнуть при гипертонической болезни". 27 | 3. "Лёгочное сердце" может возникнуть при хронической обструктивной болезни лёгких". 28 | 4. "Лёгочное сердце" может возникнуть при гипертиреозе". 29 | The number of the correct statements is: 30 | ``` 31 | The resulting accuracy was 35% which beats simpler models but leaves a large gap with human performance. Possible ways to improve this result with the same model would be more extensive prompt testing (greater variety and larger testing sample), fine-tuning (if a large quantity of similar test data and budget are available), cleaning the test data from typos, rare acronyms and abbreviations. 32 | 33 | ## RuMedDaNet 34 | Prompt testing was done using both English and Russian, with Zero and one shot, with context and without it. The case with Zero-shot in Russian with included context was performing reasonably well (85% on the pre-test sample) and was chosen for the whole benchmark run. 35 | 36 | Note that the model could answer prompts without context significantly better than randomly, at 67%, which indicates some domain knowledge in the model. It may be worth exploring unprompted Yes/No tests more, possibly adding them as a benchmark component. 37 | 38 | Example of prompt + question used: 39 | ``` 40 | Imagine that you are a medical doctor and know everything about medicine and need to pass a degree exam. 41 | The context is: Природа полос поглощения в ик-области связана с колебательными переходами и изменением колебательных состояний ядер, входящих в молекулу поглощающего вещества. Поэтому поглощением в ИК-области обладают молекулы, дипольные моменты которых изменяются при возбуждении колебательных движений ядер. Область применения ИК-спектроскопии аналогична, но более широка, чем УФ-метода. ИК-спектр однозначно характеризует всю структуру молекулы, включая незначительные ее изменения. Важные преимущества данного метода — высокая специфичность, объективность полученных результатов, возможность анализа веществ в кристаллическом состоянии. 42 | The question is: Возможности ИК-спектроскопии позволяют анализировать вещества в кристаллическом состоянии? 43 | You should answer only yes or no. 44 | The answer is 45 | ``` 46 | The resulting accuracy was 89.3% which beats simpler models (best-registered result to date at 68%) and gets close to human performance (93%). 47 | 48 | ## RuMedNLI 49 | Prompt testing was done using Russian only, with Zero and a few shots. There was no significant difference in performance (0.75 acc in prompt tests); a case with a few-shot in Russian was chosen for the whole benchmark run. 50 | 51 | Example of the prompt: 52 | ``` 53 | You are a medical doctor and need to pass a degree exam. You are given two statements and need to answer how the second statement relates to the first statement. Possible answers are 'entailment', 'contradiction', or 'neutral' 54 | Statement 1: "В анамнезе нет тромбозов или ТГВ, никогда не было болей в груди до случаев недельной давности." 55 | Statement 2: "Пациент страдает стенокардией" 56 | Answer: "entailment" 57 | Statement 1: "В течение последней недели стал более сонливым и трудно возбудимым." 58 | Statement 2: "В последнюю неделю он был менее внимателен" 59 | Answer: "entailment" 60 | Statement 1: "КТ головы показала небольшое правое височное внутрипаренхиматозное кровоизлияние 2х2 см, на повторной КТ головы осталось без изменений." 61 | Statement 2: "у пациента было гипертоническое кровотечение" 62 | Answer: "neutral" 63 | Statement 1: "Рентгенограмма чистая, не похоже наличие инфекции." 64 | Statement 2: "Рентген грудной клетки выявил инфильтраты" 65 | Answer: "contradiction" 66 | Statement 1: "КТ головы показала небольшое правое височное внутрипаренхиматозное кровоизлияние 2х2 см, на повторной КТ головы осталось без изменений." 67 | Statement 2: "у пациента была нормальная КТ головы" 68 | Answer: "contradiction" 69 | Statement 1: "Сахарный диабет II типа." 70 | Statement 2: "Инсулин не требуется." 71 | Answer: 72 | ``` 73 | The resulting accuracy was 61.3%, which does not beat simpler models (67% best to date), not to speak about the human level (85.7%). This could be caused by the fact that NLI tasks are more sensitive to fine-tuning. Also, different prompt techniques may be required. 74 | 75 | ### An interesting case of disagreement: 76 | When testing NLI and other questions with ChatGPT, we observed cases where the model gave incorrect (vs. benchmark ground truth) answers supported by a detailed explanation. Here is an example: 77 | ``` 78 | pairID 18935380-66c7-11e7-9dbb-f45c89b91419 79 | Gold label = CONTRADICTION 80 | 81 | PROMPT: 82 | You are a medical doctor and need to pass exam. You are given two statements: 83 | The first statement is absolutely correct and should be the basis for your answer: "На [**2810-12-14**] также был удален желудочковый дренаж." 84 | The second statement is "Отсутствие в анамнезе неврологических симптомов". 85 | You should answer if the second statement is 'entailment', 'contradiction', or 'neutral'. 86 | The answer is 87 | 88 | The relationship between the two statements is neutral. 89 | 90 | CHAT_GPT: 91 | The first statement describes a surgical procedure that was performed on a specific date to remove a cardiac drain. The second statement mentions the absence of neurological symptoms in the patient's medical history. While the two statements may be related in some way, they do not necessarily entail or contradict each other. 92 | 93 | The absence of neurological symptoms in a patient's medical history does not necessarily have a direct relationship with the removal of a cardiac drain. The relationship between the two statements may be influenced by other factors such as the patient's overall health, the reason for the drain placement, and the specifics of the surgical procedure. Without further information, it is not possible to determine a definite relationship between the two statements. 94 | ``` 95 | It may be worth exploring such cases with domain experts to ensure that test questions reflect modern medical knowledge. 96 | 97 | ### Files descriptions 98 | `chat-rmb.ipynb` - the main codebase for interaction with the model 99 | 100 | `*.jsonl` - answer files 101 | 102 | `rm*` - files with intermediate data and logs 103 | -------------------------------------------------------------------------------- /lb_submissions/SAI/ChatGPT/rmnli_priv_gpt3_1502.pd.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/ChatGPT/rmnli_priv_gpt3_1502.pd.pickle -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/ECGBaselineLib/autobaseline.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.filterwarnings("ignore", category=UserWarning) 3 | warnings.filterwarnings("ignore", category=FutureWarning) 4 | 5 | from sklearn.metrics import precision_recall_curve 6 | 7 | from lightautoml.automl.presets.tabular_presets import TabularUtilizedAutoML 8 | from lightautoml.tasks import Task 9 | 10 | 11 | def lama_train(df_list, random_seed): 12 | 13 | roles = { 14 | "target": "targets", 15 | "category": "device" 16 | } 17 | 18 | # https://github.com/sb-ai-lab/LightAutoML 19 | # define that machine learning problem is binary classification 20 | task = Task("binary") 21 | 22 | utilized_automl = TabularUtilizedAutoML( 23 | task = task, 24 | timeout = 180, 25 | cpu_limit = 8, 26 | reader_params = {'n_jobs': 8, 'cv': 5, 'random_state': random_seed} 27 | ) 28 | 29 | _ = utilized_automl.fit_predict(df_list[0], roles = roles, verbose = 1) 30 | 31 | # threshold search 32 | val_pred = utilized_automl.predict(df_list[1].drop(columns=["targets"])) 33 | precision, recall, thresholds = precision_recall_curve(df_list[1]["targets"], val_pred.data.squeeze()) 34 | best_thrsh = thresholds[(2*recall*precision / (recall + precision)).argmax()] 35 | 36 | pub_pred = utilized_automl.predict(df_list[2]) 37 | pub_res = pub_pred.data.squeeze() > best_thrsh 38 | return pub_res, utilized_automl 39 | -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/ECGBaselineLib/datasets.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from pathlib import Path 4 | from sklearn.model_selection import train_test_split 5 | 6 | 7 | 8 | def prepare_data(X, y): 9 | df = pd.DataFrame(X).reset_index(drop=True) 10 | df.loc[:, 'noise'] = df[['baseline_drift', 'static_noise', 'burst_noise', 'electrodes_problems', 'extra_beats', 'pacemaker']].isna().sum(axis=1).apply(lambda x: 1 if x==6 else 0) 11 | df = df[['age', 'sex', 'validated_by_human', 'site', 'device', 'noise']] 12 | if y is None: 13 | return df 14 | df['targets'] = pd.DataFrame(y) 15 | return df 16 | 17 | 18 | ##### Split for the AutoML baseline ###### 19 | def get_dataset_baseline(data_path, class_id, dtype, random_state): 20 | assert dtype in ["train", "test"] 21 | classes_splits = {"ecgs":[], "targets":[], "names": []} 22 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True) 23 | for _, signal in metadata.iterrows(): 24 | if dtype == "train": 25 | classes_splits["ecgs"].append(signal[2:-6]) 26 | classes_splits["targets"].append((class_id in signal["labels"]) * 1) 27 | else: 28 | classes_splits["ecgs"].append(signal[2:-5]) 29 | classes_splits["names"].append(signal["record_name"]) 30 | classes_splits["targets"] = np.array(classes_splits["targets"]) 31 | if dtype == "test": 32 | del classes_splits["targets"] 33 | return prepare_data(classes_splits['ecgs'], None), classes_splits["names"] 34 | else: 35 | del classes_splits["names"] 36 | X_train, X_val, y_train, y_val = train_test_split( 37 | classes_splits["ecgs"], 38 | classes_splits["targets"], 39 | test_size=0.2, 40 | random_state=random_state, 41 | stratify=classes_splits["targets"] 42 | ) 43 | return prepare_data(X_train, y_train), prepare_data(X_val, y_val) -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/ECGBaselineLib/utils.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import precision_recall_curve 2 | import numpy as np 3 | 4 | 5 | def find_threshold_f1(trues, logits, eps=1e-9): 6 | if len(trues.shape) > 1: 7 | threshold = [] 8 | for i in range(trues.shape[1]): 9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i]) 10 | f1_scores = 2 * precision * recall / (precision + recall + eps) 11 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 12 | return threshold 13 | else: 14 | precision, recall, thresholds = precision_recall_curve(trues, logits) 15 | f1_scores = 2 * precision * recall / (precision + recall + eps) 16 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 17 | return threshold -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/README.md: -------------------------------------------------------------------------------- 1 | Решение представляет собой ансамбль из 73 моделей, каждая из которых решает задачу бинарной классификации по соответствующему классу. Для выбора моделей использовалась библиотека [LightAutoML](https://github.com/sb-ai-lab/LightAutoML), а для обучения использовались *только метаданные* по каждой из записей. 2 | 3 | ### Для запуска кода 4 | 5 | Нужен `python 3.8` 6 | 7 | `pip install -r requirements.txt` 8 | 9 | `python training.py data_path model_path # запуск обучения и предсказания` 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/requirements.txt: -------------------------------------------------------------------------------- 1 | scikit-learn==1.2.2 2 | lightautoml==0.3.7.3 3 | numpy==1.24.4 4 | pandas==1.4.3 -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGAuto/training.py: -------------------------------------------------------------------------------- 1 | from ECGBaselineLib.autobaseline import lama_train 2 | from ECGBaselineLib.datasets import get_dataset_baseline 3 | 4 | import sys 5 | import logging 6 | import os 7 | import argparse 8 | from pathlib import Path 9 | import json 10 | import joblib 11 | import numpy as np 12 | 13 | 14 | def main(args): 15 | # Logger 16 | logger = logging.getLogger('automl_baseline_training') 17 | log_format = '%(asctime)s %(message)s' 18 | logging.basicConfig(stream=sys.stdout, level=logging.INFO, 19 | format=log_format, datefmt='%m/%d %I:%M:%S %p') 20 | fh = logging.FileHandler(Path(args.model_path + "/summary/") / 'log_automl.txt') 21 | fh.setFormatter(logging.Formatter(log_format)) 22 | logger.addHandler(fh) 23 | # Получение необходимых классов 24 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f: 25 | classes = json.load(f) 26 | for class_name in classes: 27 | os.makedirs(args.model_path + "/models/" + class_name, exist_ok=True) 28 | logger.info("---------- Working with LAMA and {} class ----------".format(class_name)) 29 | X_train, X_val = get_dataset_baseline(args.data_path, int(class_name), "train", args.random_state) 30 | X_public, public_names = get_dataset_baseline(args.data_path, int(class_name), "test", args.random_state) 31 | # Модель и всё такое 32 | pub_res, model = lama_train([X_train, X_val, X_public], random_seed = args.random_state) * 1 33 | if class_name == '0': 34 | preds_dict = {key: [val] for key, val in dict(zip(public_names, pub_res)).items()} 35 | else: 36 | for i, key in enumerate(preds_dict): 37 | preds_dict[key].append(pub_res[i]) 38 | joblib.dump(model, args.model_path + "/models/" + class_name + "/model.pkl") 39 | 40 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl" 41 | with open(out_fname, 'w') as fw: 42 | for rec in preds_dict: 43 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False) 44 | fw.write('\n') 45 | 46 | 47 | if __name__ == '__main__': 48 | parser = argparse.ArgumentParser(description = 'Baselines training script (LAMA)') 49 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str) 50 | parser.add_argument('model_path', help='path to save the model and logs', type=str) 51 | parser.add_argument('--random_state', help='random state number', type=int, default=19) 52 | args = parser.parse_args() 53 | 54 | os.makedirs(args.model_path + "/models/", exist_ok=True) 55 | os.makedirs(args.model_path + "/summary/", exist_ok=True) 56 | main(args) -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/ECGBaselineLib/datasets.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from pathlib import Path 4 | from sklearn.model_selection import train_test_split 5 | 6 | 7 | ##### Split for the N models baseline ###### 8 | def get_dataset_baseline(data_path, class_name, class_id, dtype, random_state): 9 | assert dtype in ["train", "test"] 10 | classes_splits = {"ecgs":[], "targets":[], "names":[]} 11 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True) 12 | for signal in (Path(data_path) / dtype).glob("*.npy"): 13 | signal_name = signal.name[:signal.name.rfind('/')-3] 14 | classes_splits["names"].append(signal_name) 15 | with open(signal, "rb") as f: 16 | signal_value = np.load(f, allow_pickle=True) 17 | classes_splits['ecgs'].append(signal_value) 18 | if dtype == "train": 19 | classes_splits["targets"].append((class_id in metadata.loc[metadata.record_name == signal_name, "labels"].item()) * 1) 20 | classes_splits["targets"] = np.array(classes_splits["targets"]) 21 | if dtype == "test": 22 | del classes_splits["targets"] 23 | return classes_splits['ecgs'], classes_splits["names"] 24 | else: 25 | X_train, X_val, y_train, y_val, names_train, names_val = train_test_split( 26 | classes_splits["ecgs"], 27 | classes_splits["targets"], 28 | classes_splits["names"], 29 | test_size=0.33, 30 | random_state=random_state, 31 | stratify=classes_splits["targets"] 32 | ) 33 | return X_train, X_val, y_train, y_val, names_train, names_val -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/ECGBaselineLib/neurobaseline.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | import torch 5 | torch.set_default_dtype(torch.float32) 6 | import torch.nn as nn 7 | from torch.utils.data import Dataset, DataLoader 8 | import torch.nn.functional as F 9 | from torch.utils.tensorboard import SummaryWriter 10 | 11 | from sklearn.metrics import average_precision_score 12 | from .utils import find_threshold_f1 13 | 14 | import os 15 | from tqdm import tqdm 16 | import pickle 17 | import json 18 | 19 | 20 | # from https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy 21 | def set_seed(seed: int = 42) -> None: 22 | np.random.seed(seed) 23 | random.seed(seed) 24 | torch.manual_seed(seed) 25 | torch.cuda.manual_seed(seed) 26 | # When running on the CuDNN backend, two further options must be set 27 | torch.backends.cudnn.deterministic = True 28 | torch.backends.cudnn.benchmark = False 29 | # Set a fixed value for the hash seed 30 | os.environ["PYTHONHASHSEED"] = str(seed) 31 | print(f"random seed set as {seed}") 32 | 33 | 34 | ##### ECG Dataset ##### 35 | class ECGRuDataset(Dataset): 36 | """ECG.RU Dataset.""" 37 | 38 | def __init__(self, ecgs, labels, names): 39 | """ 40 | Args: 41 | labels: array with labels 42 | ecgs: array with num_ch-lead ecgs 43 | """ 44 | self.ecgs = ecgs 45 | if labels is not None: 46 | self.labels = torch.from_numpy(labels).float() 47 | else: 48 | self.labels = None 49 | self.names = names 50 | 51 | def __len__(self): 52 | return len(self.names) 53 | 54 | def __getitem__(self, idx): 55 | if self.labels is not None: 56 | sample = {'value': self.ecgs[idx], 'target': self.labels[idx], 'names': self.names[idx]} 57 | else: 58 | sample = {'value': self.ecgs[idx], 'names': self.names[idx]} 59 | return sample 60 | 61 | 62 | ##### Multihead 1d-CNN model for the 1-st and 2-nd baselines ##### 63 | class CNN1dMultihead(nn.Module): 64 | def __init__(self, k=1, num_ch=12): 65 | super().__init__() 66 | """ 67 | Args: 68 | num_ch: number of channels of an ecg-signal 69 | k: number of classes 70 | """ 71 | self.layer1 = nn.Sequential( 72 | nn.Conv1d(num_ch, 24, 10, stride=2), 73 | nn.BatchNorm1d(24), 74 | nn.ReLU(), 75 | nn.Conv1d(24, 48, 10, stride=2), 76 | nn.BatchNorm1d(48), 77 | nn.ReLU(), 78 | nn.MaxPool1d(6, 2) 79 | ) 80 | self.layer2 = nn.Sequential( 81 | nn.Conv1d(48, 64, 10, stride=2), 82 | nn.BatchNorm1d(64), 83 | nn.ReLU(), 84 | nn.Conv1d(64, 128, 10, stride=2), 85 | nn.BatchNorm1d(128), 86 | nn.ReLU(), 87 | nn.AdaptiveMaxPool1d(10) 88 | ) 89 | self.classification_layers = nn.ModuleList([nn.Sequential( 90 | nn.Linear(128*10, 120), 91 | nn.ReLU(), 92 | nn.Linear(120, 160), 93 | nn.ReLU(), 94 | nn.Linear(160, 1) 95 | ) for i in range(k)]) 96 | 97 | def forward(self, x): 98 | x = self.layer1(x) 99 | x = self.layer2(x) 100 | x = torch.flatten(x, 1) 101 | preds = torch.stack([torch.squeeze(classification_layer(x)) for classification_layer in self.classification_layers]) 102 | return torch.swapaxes(preds, 0, 1) 103 | 104 | 105 | ##### Trainer for 1d-CNN model ##### 106 | class CNN1dTrainer: 107 | """ 108 | class_name - dict if multilabel (id2label), str in binary 109 | """ 110 | def __init__(self, class_name, 111 | model, optimizer, loss, 112 | train_dataset, val_dataset, test_dataset, model_path, 113 | batch_size=128, cuda_id=1): 114 | 115 | torch.manual_seed(0) 116 | random.seed(0) 117 | np.random.seed(0) 118 | 119 | self.model = model 120 | self.optimizer = optimizer 121 | self.loss = loss 122 | 123 | self.train_dataset = train_dataset 124 | self.val_dataset = val_dataset 125 | self.test_public = test_dataset 126 | 127 | self.result_output = {} 128 | 129 | self.batch_size = batch_size 130 | 131 | self.device = torch.device("cuda:" + str(cuda_id) if (torch.cuda.is_available() or cuda_id != -1) else "cpu") 132 | self.model = self.model.to(self.device) 133 | 134 | self.global_step = 0 135 | self.alpha = 0.8 136 | 137 | self.class_name = class_name 138 | 139 | self.result_output['class'] = class_name 140 | 141 | os.makedirs(model_path + "/models" + "/" +self.class_name, exist_ok=True) 142 | os.makedirs(model_path + "/summary" + "/" + self.class_name, exist_ok=True) 143 | os.makedirs(model_path + "/models" + "/" + self.class_name, exist_ok=True) 144 | self.writer = SummaryWriter(model_path + "/summary" + "/" + self.class_name) 145 | self.model_path = model_path 146 | 147 | def save_checkpoint(self, path): 148 | torch.save(self.model.state_dict(), path) 149 | 150 | def train(self, num_epochs): 151 | 152 | model = self.model 153 | optimizer = self.optimizer 154 | 155 | self.train_loader = DataLoader(self.train_dataset, shuffle=True, pin_memory=True, batch_size=self.batch_size, num_workers=4) 156 | self.val_loader = DataLoader(self.val_dataset, shuffle=False, pin_memory=True, batch_size=len(self.val_dataset), num_workers=4) 157 | 158 | best_val = -38 159 | for epoch in tqdm(range(num_epochs)): 160 | model.train() 161 | train_logits = [] 162 | train_gts = [] 163 | 164 | for batch in self.train_loader: 165 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 166 | optimizer.zero_grad() 167 | logits = model(batch['value']).squeeze() 168 | train_logits.append(logits.cpu().detach()) 169 | train_gts.append(batch['target'].cpu()) 170 | loss = self.loss(logits, batch['target']) 171 | loss.backward() 172 | optimizer.step() 173 | self.writer.add_scalar("Train Loss", loss.item(), global_step=self.global_step) 174 | self.global_step += 1 175 | 176 | train_logits = np.concatenate(train_logits) 177 | train_gts = np.concatenate(train_gts) 178 | 179 | if self.class_name != "multihead": 180 | train_logits = train_logits[:,None] 181 | train_gts = train_gts[:,None] 182 | 183 | res_ap = [] 184 | for i in range(train_logits.shape[1]): 185 | res_ap.append(average_precision_score(train_gts[:,i], train_logits[:,i])) 186 | self.writer.add_scalar("Train AP/{}".format(self.class_name), np.mean(res_ap), global_step=epoch) 187 | 188 | model.eval() 189 | with torch.no_grad(): 190 | for batch in self.val_loader: 191 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 192 | logits = model(batch['value']).cpu().squeeze() 193 | gts = batch['target'].cpu() 194 | 195 | if self.class_name != "multihead": 196 | logits = logits[:,None] 197 | gts = gts[:,None] 198 | 199 | res_ap = [] 200 | for i in range(logits.shape[1]): 201 | res_ap.append(average_precision_score(gts[:,i], logits[:,i])) 202 | mean_val = np.mean(res_ap) 203 | 204 | if mean_val > best_val: 205 | self.save_checkpoint(self.model_path + "/models" + "/" +self.class_name+"/best_checkpoint.pth") 206 | best_val = mean_val 207 | self.result_output['threshold_f1'] = find_threshold_f1(gts, logits) 208 | self.test(self.model, self.test_public, "public", epoch) 209 | self.writer.add_scalar("Val AP/{}".format(self.class_name), mean_val, global_step=epoch) 210 | with open(self.model_path + "/models" + "/" +self.class_name+"/log.pickle", 'wb') as handle: 211 | pickle.dump(self.result_output, handle, protocol=pickle.HIGHEST_PROTOCOL) 212 | 213 | 214 | def test(self, model, test_dataset, name, epoch): 215 | model.eval() 216 | 217 | test_loader = DataLoader(test_dataset, shuffle=True, pin_memory=True, batch_size=len(test_dataset), num_workers=4) 218 | for batch in test_loader: 219 | names = batch['names'] 220 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 221 | with torch.no_grad(): 222 | logits = model(batch['value']).cpu().detach().squeeze() 223 | 224 | if self.class_name != "multihead": 225 | logits = logits[:,None] 226 | 227 | preds = [] 228 | for i in range(logits.shape[1]): 229 | preds.append((logits[:,i] > self.result_output['threshold_f1'][i])*1) 230 | 231 | out_fname = self.model_path + "/models" + "/" +self.class_name + "/ECG2Pathology.jsonl" 232 | with open(out_fname, 'w') as fw: 233 | for rec in preds: 234 | res = dict(zip(names, rec.tolist())) 235 | json.dump(res, fw, ensure_ascii=False) 236 | fw.write('\n') -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/ECGBaselineLib/utils.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import precision_recall_curve 2 | import numpy as np 3 | 4 | 5 | def find_threshold_f1(trues, logits, eps=1e-9): 6 | if len(trues.shape) > 1: 7 | threshold = [] 8 | for i in range(trues.shape[1]): 9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i]) 10 | f1_scores = 2 * precision * recall / (precision + recall + eps) 11 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 12 | return threshold 13 | else: 14 | precision, recall, thresholds = precision_recall_curve(trues, logits) 15 | f1_scores = 2 * precision * recall / (precision + recall + eps) 16 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 17 | return threshold -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/README.md: -------------------------------------------------------------------------------- 1 | Решение представляет собой ансамбль из 73 моделей, каждая из которых решает задачу бинарной классификации по соответствующему классу. 2 | 3 | ### Для запуска кода 4 | 5 | `pip install -r requirements.txt` 6 | 7 | `python training.py data_path model_path #запуск обучения и предсказания` -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/requirements.txt: -------------------------------------------------------------------------------- 1 | torch==1.11.0 2 | numpy==1.24.4 3 | pandas==2.0.2 4 | scikit-learn==1.2.2 5 | tqdm==4.65.0 6 | tensorboard==2.13.0 -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGBinary/training.py: -------------------------------------------------------------------------------- 1 | import torch.optim as optim 2 | import torch.nn as nn 3 | import numpy as np 4 | 5 | from ECGBaselineLib.datasets import get_dataset_baseline 6 | from ECGBaselineLib.neurobaseline import set_seed, ECGRuDataset, CNN1dTrainer, CNN1dMultihead 7 | 8 | import sys 9 | import logging 10 | 11 | import argparse 12 | 13 | from pathlib import Path 14 | import json 15 | 16 | 17 | def main(args): 18 | # Fix seed 19 | set_seed(seed = args.random_state) 20 | # Logger 21 | logger = logging.getLogger('binary_baseline_training') 22 | log_format = '%(asctime)s %(message)s' 23 | logging.basicConfig(stream=sys.stdout, level=logging.INFO, 24 | format=log_format, datefmt='%m/%d %I:%M:%S %p', filemode='w') 25 | fh = logging.FileHandler(Path(args.model_path) / "log_binary.txt") 26 | fh.setFormatter(logging.Formatter(log_format)) 27 | logger.addHandler(fh) 28 | # Data preparing 29 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f: 30 | classes = json.load(f) 31 | for class_name in classes: 32 | logger.info("---------- Working with %s ----------" % (classes[class_name])) 33 | X_train, X_val, y_train, y_val, names_train, names_val = get_dataset_baseline(args.data_path, classes[class_name], int(class_name), "train", args.random_state) 34 | X_public, names_public = get_dataset_baseline(args.data_path, classes[class_name], int(class_name), "test", args.random_state) 35 | model = CNN1dMultihead() 36 | opt = optim.AdamW(model.parameters(), lr=3e-3) 37 | 38 | train_ds = ECGRuDataset(X_train, y_train, names_train) 39 | val_ds = ECGRuDataset(X_val, y_val, names_val) 40 | test_public = ECGRuDataset(X_public, None, names_public) 41 | 42 | trainer = CNN1dTrainer(class_name = class_name, 43 | model = model, optimizer = opt, loss = nn.BCEWithLogitsLoss(), 44 | train_dataset = train_ds, val_dataset = val_ds, test_dataset = test_public, 45 | model_path = args.model_path, 46 | cuda_id = args.cuda_id) 47 | logger.info("---------- Model training started! ----------") 48 | trainer.train(args.num_epochs) 49 | with open(Path(args.model_path) / ( "models/" + class_name + "/ECG2Pathology.jsonl"), "r") as f: 50 | pred_i = json.load(f) 51 | if int(class_name) == 0: 52 | preds_dict = {k:[v] for k,v in pred_i.items()} 53 | else: 54 | for k in preds_dict: 55 | preds_dict[k].append(pred_i[k]) 56 | 57 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl" 58 | with open(out_fname, 'w') as fw: 59 | for rec in preds_dict: 60 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False) 61 | fw.write('\n') 62 | 63 | 64 | if __name__ == '__main__': 65 | parser = argparse.ArgumentParser(description = 'Baselines training script (1d-CNN)') 66 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str) 67 | parser.add_argument('model_path', help='path to save the model and logs', type=str) 68 | parser.add_argument('--cuda_id', help='CUDA device number on a single GPU; use -1 if you want to work on CPU', type=int, default=0) 69 | parser.add_argument('--k', help='number of positive examples for class', type=int, default=11) 70 | parser.add_argument('--num_epochs', help='number of epochs', type=int, default=5) 71 | parser.add_argument('--random_state', help='random state number', type=int, default=19) 72 | args = parser.parse_args() 73 | Path(args.model_path).mkdir(parents = False, exist_ok = True) 74 | main(args) -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/ECGBaselineLib/datasets.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | from pathlib import Path 5 | 6 | from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit 7 | from pathlib import Path 8 | 9 | # Iterative stratification 10 | def make_stratification(df, strat_matrix, random_state): 11 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=random_state) 12 | train_split, test_split = list(msss.split(df.record_name.values[:,None], strat_matrix))[0] 13 | # Obtain record numbers 14 | train_names = df.loc[train_split, "record_name"].values 15 | test_names = df.loc[test_split, "record_name"].values 16 | # Make val/test split 17 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=random_state) 18 | val_split, test_split = list(msss.split(test_names[:,None], strat_matrix[test_split]))[0] 19 | assert np.intersect1d(train_names, test_names[val_split]).shape[0] == 0 & np.intersect1d(test_names[val_split], test_names[test_split]).shape[0] == 0 & np.intersect1d(test_names[test_split], train_names).shape[0] == 0, "В разбияниях повторяются записи!" 20 | return train_names, test_names[val_split], test_names[test_split] 21 | 22 | 23 | ##### Split for the multihead baseline ###### 24 | def get_dataset_baseline(data_path, dtype, random_state): 25 | assert dtype in ["train", "test"] 26 | classes_splits = {"ecgs":[], "targets":[], "names": []} 27 | metadata = pd.read_json(Path(data_path) / (dtype + "/" + dtype + ".jsonl"), lines=True) 28 | for signal in (Path(data_path) / dtype).glob("*.npy"): 29 | signal_name = signal.name[:signal.name.rfind('/')-3] 30 | classes_splits["names"].append(signal_name) 31 | with open(signal, "rb") as f: 32 | signal_value = np.load(f, allow_pickle=True) 33 | classes_splits['ecgs'].append(signal_value) 34 | if dtype == "train": 35 | signal_target = np.zeros(73) 36 | signal_target[metadata.loc[metadata.record_name == signal_name, "labels"].item()] = 1 37 | classes_splits["targets"].append(signal_target) 38 | classes_splits["ecgs"] = np.array(classes_splits["ecgs"]) 39 | classes_splits["names"] = np.array(classes_splits["names"]) 40 | if dtype == "test": 41 | del classes_splits["targets"] 42 | return classes_splits['ecgs'], classes_splits["names"] 43 | else: 44 | classes_splits["targets"] = np.array(classes_splits["targets"]) 45 | msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.33, random_state=random_state) 46 | train_split, val_split = list(msss.split(classes_splits["ecgs"], classes_splits["targets"]))[0] 47 | X_train, X_val, y_train, y_val = classes_splits["ecgs"][train_split], classes_splits["ecgs"][val_split], \ 48 | classes_splits["targets"][train_split], classes_splits["targets"][val_split] 49 | return X_train, X_val, y_train, y_val, classes_splits["names"][train_split], classes_splits["names"][val_split] -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/ECGBaselineLib/neurobaseline.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | 4 | import torch 5 | torch.set_default_dtype(torch.float32) 6 | import torch.nn as nn 7 | from torch.utils.data import Dataset, DataLoader 8 | import torch.nn.functional as F 9 | from torch.utils.tensorboard import SummaryWriter 10 | 11 | from sklearn.metrics import average_precision_score 12 | from .utils import find_threshold_f1 13 | 14 | import os 15 | from tqdm import tqdm 16 | import pickle 17 | import json 18 | 19 | 20 | # from https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy 21 | def set_seed(seed: int = 42) -> None: 22 | np.random.seed(seed) 23 | random.seed(seed) 24 | torch.manual_seed(seed) 25 | torch.cuda.manual_seed(seed) 26 | # When running on the CuDNN backend, two further options must be set 27 | torch.backends.cudnn.deterministic = True 28 | torch.backends.cudnn.benchmark = False 29 | # Set a fixed value for the hash seed 30 | os.environ["PYTHONHASHSEED"] = str(seed) 31 | print(f"random seed set as {seed}") 32 | 33 | 34 | ##### ECG Dataset ##### 35 | class ECGRuDataset(Dataset): 36 | """ECG.RU Dataset.""" 37 | 38 | def __init__(self, ecgs, labels, names): 39 | """ 40 | Args: 41 | labels: array with labels 42 | ecgs: array with num_ch-lead ecgs 43 | """ 44 | self.ecgs = ecgs 45 | if labels is not None: 46 | self.labels = torch.from_numpy(labels).float() 47 | else: 48 | self.labels = None 49 | self.names = names 50 | 51 | def __len__(self): 52 | return len(self.names) 53 | 54 | def __getitem__(self, idx): 55 | if self.labels is not None: 56 | sample = {'value': self.ecgs[idx], 'target': self.labels[idx], 'names': self.names[idx]} 57 | else: 58 | sample = {'value': self.ecgs[idx], 'names': self.names[idx]} 59 | return sample 60 | 61 | 62 | ##### Multihead 1d-CNN model for the 1-st and 2-nd baselines ##### 63 | class CNN1dMultihead(nn.Module): 64 | def __init__(self, k=1, num_ch=12): 65 | super().__init__() 66 | """ 67 | Args: 68 | num_ch: number of channels of an ecg-signal 69 | k: number of classes 70 | """ 71 | self.layer1 = nn.Sequential( 72 | nn.Conv1d(num_ch, 24, 10, stride=2), 73 | nn.BatchNorm1d(24), 74 | nn.ReLU(), 75 | nn.Conv1d(24, 48, 10, stride=2), 76 | nn.BatchNorm1d(48), 77 | nn.ReLU(), 78 | nn.MaxPool1d(6, 2) 79 | ) 80 | self.layer2 = nn.Sequential( 81 | nn.Conv1d(48, 64, 10, stride=2), 82 | nn.BatchNorm1d(64), 83 | nn.ReLU(), 84 | nn.Conv1d(64, 128, 10, stride=2), 85 | nn.BatchNorm1d(128), 86 | nn.ReLU(), 87 | nn.AdaptiveMaxPool1d(10) 88 | ) 89 | self.classification_layers = nn.ModuleList([nn.Sequential( 90 | nn.Linear(128*10, 120), 91 | nn.ReLU(), 92 | nn.Linear(120, 160), 93 | nn.ReLU(), 94 | nn.Linear(160, 1) 95 | ) for i in range(k)]) 96 | 97 | def forward(self, x): 98 | x = self.layer1(x) 99 | x = self.layer2(x) 100 | x = torch.flatten(x, 1) 101 | preds = torch.stack([torch.squeeze(classification_layer(x)) for classification_layer in self.classification_layers]) 102 | return torch.swapaxes(preds, 0, 1) 103 | 104 | 105 | ##### Trainer for 1d-CNN model ##### 106 | class CNN1dTrainer: 107 | """ 108 | class_name - dict if multilabel (id2label), str in binary 109 | """ 110 | def __init__(self, class_name, 111 | model, optimizer, loss, 112 | train_dataset, val_dataset, test_dataset, model_path, 113 | batch_size=128, cuda_id=1): 114 | 115 | torch.manual_seed(0) 116 | random.seed(0) 117 | np.random.seed(0) 118 | 119 | self.model = model 120 | self.optimizer = optimizer 121 | self.loss = loss 122 | 123 | self.train_dataset = train_dataset 124 | self.val_dataset = val_dataset 125 | self.test_public = test_dataset 126 | 127 | self.result_output = {} 128 | 129 | self.batch_size = batch_size 130 | 131 | self.device = torch.device("cuda:" + str(cuda_id) if (torch.cuda.is_available() or cuda_id != -1) else "cpu") 132 | self.model = self.model.to(self.device) 133 | 134 | self.global_step = 0 135 | self.alpha = 0.8 136 | 137 | self.class_name = class_name 138 | 139 | self.result_output['class'] = class_name 140 | 141 | os.makedirs(model_path + "/models" + "/" +self.class_name, exist_ok=True) 142 | os.makedirs(model_path + "/summary" + "/" + self.class_name, exist_ok=True) 143 | os.makedirs(model_path + "/models" + "/" + self.class_name, exist_ok=True) 144 | self.writer = SummaryWriter(model_path + "/summary" + "/" + self.class_name) 145 | self.model_path = model_path 146 | 147 | def save_checkpoint(self, path): 148 | torch.save(self.model.state_dict(), path) 149 | 150 | def train(self, num_epochs): 151 | 152 | model = self.model 153 | optimizer = self.optimizer 154 | 155 | self.train_loader = DataLoader(self.train_dataset, shuffle=True, pin_memory=True, batch_size=self.batch_size, num_workers=4) 156 | self.val_loader = DataLoader(self.val_dataset, shuffle=False, pin_memory=True, batch_size=len(self.val_dataset), num_workers=4) 157 | 158 | best_val = -38 159 | for epoch in tqdm(range(num_epochs)): 160 | model.train() 161 | train_logits = [] 162 | train_gts = [] 163 | 164 | for batch in self.train_loader: 165 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 166 | optimizer.zero_grad() 167 | logits = model(batch['value']).squeeze() 168 | train_logits.append(logits.cpu().detach()) 169 | train_gts.append(batch['target'].cpu()) 170 | loss = self.loss(logits, batch['target']) 171 | loss.backward() 172 | optimizer.step() 173 | self.writer.add_scalar("Train Loss", loss.item(), global_step=self.global_step) 174 | self.global_step += 1 175 | 176 | train_logits = np.concatenate(train_logits) 177 | train_gts = np.concatenate(train_gts) 178 | 179 | if self.class_name != "multihead": 180 | train_logits = train_logits[:,None] 181 | train_gts = train_gts[:,None] 182 | 183 | res_ap = [] 184 | for i in range(train_logits.shape[1]): 185 | res_ap.append(average_precision_score(train_gts[:,i], train_logits[:,i])) 186 | self.writer.add_scalar("Train AP/{}".format(self.class_name), np.mean(res_ap), global_step=epoch) 187 | 188 | model.eval() 189 | with torch.no_grad(): 190 | for batch in self.val_loader: 191 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 192 | logits = model(batch['value']).cpu().squeeze() 193 | gts = batch['target'].cpu() 194 | 195 | if self.class_name != "multihead": 196 | logits = logits[:,None] 197 | gts = gts[:,None] 198 | 199 | res_ap = [] 200 | for i in range(logits.shape[1]): 201 | res_ap.append(average_precision_score(gts[:,i], logits[:,i])) 202 | mean_val = np.mean(res_ap) 203 | 204 | if mean_val > best_val: 205 | self.save_checkpoint(self.model_path + "/models" + "/" +self.class_name+"/best_checkpoint.pth") 206 | best_val = mean_val 207 | self.result_output['threshold_f1'] = find_threshold_f1(gts, logits) 208 | self.test(self.model, self.test_public, "public", epoch) 209 | self.writer.add_scalar("Val AP/{}".format(self.class_name), mean_val, global_step=epoch) 210 | with open(self.model_path + "/models" + "/" +self.class_name+"/log.pickle", 'wb') as handle: 211 | pickle.dump(self.result_output, handle, protocol=pickle.HIGHEST_PROTOCOL) 212 | 213 | 214 | def test(self, model, test_dataset, name, epoch): 215 | model.eval() 216 | 217 | test_loader = DataLoader(test_dataset, shuffle=True, pin_memory=True, batch_size=len(test_dataset), num_workers=4) 218 | for batch in test_loader: 219 | names = batch['names'] 220 | batch = {k: v.to(self.device) for k, v in batch.items() if k != 'names'} 221 | with torch.no_grad(): 222 | logits = model(batch['value']).cpu().detach().squeeze() 223 | 224 | if self.class_name != "multihead": 225 | logits = logits[:,None] 226 | 227 | preds = [] 228 | for i in range(logits.shape[1]): 229 | preds.append((logits[:,i] > self.result_output['threshold_f1'][i])*1) 230 | 231 | out_fname = self.model_path + "/models" + "/" +self.class_name + "/ECG2Pathology.jsonl" 232 | with open(out_fname, 'w') as fw: 233 | for rec in preds: 234 | res = dict(zip(names, rec.tolist())) 235 | json.dump(res, fw, ensure_ascii=False) 236 | fw.write('\n') -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/ECGBaselineLib/utils.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import precision_recall_curve 2 | import numpy as np 3 | 4 | 5 | def find_threshold_f1(trues, logits, eps=1e-9): 6 | if len(trues.shape) > 1: 7 | threshold = [] 8 | for i in range(trues.shape[1]): 9 | precision, recall, thresholds = precision_recall_curve(trues[:,i], logits[:,i]) 10 | f1_scores = 2 * precision * recall / (precision + recall + eps) 11 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 12 | return threshold 13 | else: 14 | precision, recall, thresholds = precision_recall_curve(trues, logits) 15 | f1_scores = 2 * precision * recall / (precision + recall + eps) 16 | threshold.append(float(thresholds[np.argmax(f1_scores)])) 17 | return threshold -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/README.md: -------------------------------------------------------------------------------- 1 | Решение представляет одну модель с 73 классификационными головами, каждая из которых решает задачу бинарной классификации по соответствующему классу. Выходом модели является вектор размерностью 73. 2 | 3 | ### Для запуска кода 4 | 5 | `pip install -r requirements.txt` 6 | 7 | `python training.py data_path model_path #запуск обучения и предсказания` 8 | -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/requirements.txt: -------------------------------------------------------------------------------- 1 | torch==1.11.0 2 | numpy==1.24.4 3 | pandas==2.0.2 4 | iterative-stratification==0.1.7 5 | scikit-learn==1.2.2 6 | tqdm==4.65.0 7 | tensorboard==2.13.0 -------------------------------------------------------------------------------- /lb_submissions/SAI/ECGMultihead/training.py: -------------------------------------------------------------------------------- 1 | import torch.optim as optim 2 | import torch.nn as nn 3 | import numpy as np 4 | 5 | from ECGBaselineLib.datasets import get_dataset_baseline 6 | from ECGBaselineLib.neurobaseline import set_seed, ECGRuDataset, CNN1dTrainer, CNN1dMultihead 7 | 8 | import sys 9 | import logging 10 | 11 | import argparse 12 | 13 | from pathlib import Path 14 | import json 15 | 16 | 17 | def main(args): 18 | # Fix seed 19 | set_seed(seed = args.random_state) 20 | # Logger 21 | logger = logging.getLogger('baseline_multihead_training') 22 | log_format = '%(asctime)s %(message)s' 23 | logging.basicConfig(stream=sys.stdout, level=logging.INFO, 24 | format=log_format, datefmt='%m/%d %I:%M:%S %p', filemode='w') 25 | fh = logging.FileHandler(Path(args.model_path) / "log_multihead.txt") 26 | fh.setFormatter(logging.Formatter(log_format)) 27 | logger.addHandler(fh) 28 | # Data preparing 29 | with open(Path(args.data_path) / "train/idx2pathology.jsonl", "r") as f: 30 | classes = json.load(f) 31 | logger.info("---------- Working with multihead model ----------") 32 | X_train, X_val, y_train, y_val, names_train, names_val = get_dataset_baseline(args.data_path, "train", args.random_state) 33 | X_public, names_public = get_dataset_baseline(args.data_path, "test", args.random_state) 34 | model = CNN1dMultihead(k=73) 35 | opt = optim.AdamW(model.parameters(), lr=3e-3) 36 | 37 | train_ds = ECGRuDataset(X_train, y_train, names_train) 38 | val_ds = ECGRuDataset(X_val, y_val, names_val) 39 | test_public = ECGRuDataset(X_public, None, names_public) 40 | 41 | trainer = CNN1dTrainer(class_name = "multihead", 42 | model = model, optimizer = opt, loss = nn.BCEWithLogitsLoss(), 43 | train_dataset = train_ds, val_dataset = val_ds, test_dataset = test_public, 44 | model_path = args.model_path, 45 | cuda_id = args.cuda_id) 46 | logger.info("---------- Model training started! ----------") 47 | trainer.train(args.num_epochs) 48 | 49 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl" 50 | with open(Path(args.model_path) / ( "models/" + "multihead" + "/ECG2Pathology.jsonl"), 'r') as fw: 51 | for i, line in enumerate(fw): 52 | line = json.loads(line) 53 | if i == 0: 54 | preds_dict = {k:[v] for k,v in line.items()} 55 | else: 56 | for k in preds_dict: 57 | preds_dict[k].append(line[k]) 58 | 59 | out_fname = Path(args.model_path) / "ECG2Pathology.jsonl" 60 | with open(out_fname, 'w') as fw: 61 | for rec in preds_dict: 62 | json.dump({"record_name":rec, "labels":np.array(preds_dict[rec]).nonzero()[0].tolist()}, fw, ensure_ascii=False) 63 | fw.write('\n') 64 | 65 | 66 | if __name__ == '__main__': 67 | parser = argparse.ArgumentParser(description = 'Baselines training script (1d-CNN)') 68 | parser.add_argument('data_path', help='dataset path (path to the folder containing test and train subfolders)', type=str) 69 | parser.add_argument('model_path', help='path to save the model and logs', type=str) 70 | parser.add_argument('--cuda_id', help='CUDA device number on a single GPU; use -1 if you want to work on CPU', type=int, default=0) 71 | parser.add_argument('--k', help='number of positive examples for class', type=int, default=11) 72 | parser.add_argument('--num_epochs', help='number of epochs', type=int, default=5) 73 | parser.add_argument('--random_state', help='random state number', type=int, default=19) 74 | args = parser.parse_args() 75 | main(args) -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/.gitignore: -------------------------------------------------------------------------------- 1 | config*.ini 2 | *.log 3 | MedBench 4 | *.db 5 | *.xml 6 | logs* 7 | RuMedTest--sogma--dev.jsonl -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/README.md: -------------------------------------------------------------------------------- 1 | # Testing MedBench with Gigachat 2 | 3 | ## Summary 4 | 5 | We tested the performance of the Gigachat (`GigachatPro, uncensored, 2024-03-04`) model on RuMedBench tasks with the following results: 6 | 7 | | Task | Result | 8 | |------------|--------| 9 | | RuMedNLI |`65.17%`| 10 | | RuMedDaNet |`92.58%`| 11 | | RuMedTest |`72.04%`| 12 | 13 | ## Experiments description 14 | ### RuMedDaNet ( `rumed_da_net.py` ) 15 | 16 | Only one simple prompt was used -- just `{context}`, `{question}` and request to answer "Yes" or "No". 17 | 18 | **Accuracy (dev)**: `95.70 %`. 19 | ### RuMedNLI ( `rumed_nli.py` ) 20 | 3 approaches was used: 21 | 0. **Simple doctor prompt**: one prompt (with doctor role description, instruction and request to __respond in one word__) sent to LLM. 22 | 1. **Complex doctor prompt with moderator**: one prompt (with doctor role description, instruction and request to __respond in details__) sent to LLM. Then another one prompt (with moderator role description, request to choose the right answer and to respond in one word) sent to LLM. 23 | 2. **Complex doctor prompt with chat**: one prompt (with doctor role description, instruction and request to __respond in details__) sent to LLM. Then, if response isn't specific, new request with chat history sent to LLM. (only 3 times maximum) 24 | 25 | **Accuracy (dev)**: 26 | 27 | | Approach | Accuracy | 28 | |---------------------|-----------| 29 | | v0: simple | `60.55 %` | 30 | | v1: doctor + prompt | `67.51 %` | 31 | | v2: doctor + chat | `67.93 %` | 32 | 33 | Approach `v2` was used for test evaluation. 34 | 35 | ### RuMedTest ( `rumed_test.py` ) 36 | For prompt evaluation sogma dataset was used [link](https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_). 37 | 38 | 7 experiments were checked: 39 | 0. **Simple prompt**: one prompt with question instruction sent to LLM. Invalid answers ignored. 40 | 1. **Simple doctor prompt**: one prompt with doctor role, question instruction and request to respond in one number was sent to LLM. Invalid answers ignored. 41 | 2. **Complex doctor prompt with moderator**: like approach [RuMedNLI:1] 42 | 3. **Complex doctor prompt with moderator (2)**: like previous approach, [RuMedTest:2]. 43 | 4. **Complex doctor prompt with moderator (3)**: like [RuMedTest:2]. 44 | 5. **Complex doctor prompt with chat**: like [RuMedNLI:2] 45 | 6. **Simple alphabetic doctor prompt with chat**: like previous approach, [RuMedTest:5], but numbers for variants replaced with letters (`1 -> a`, `2 -> b`, etc.) 46 | 7. **Complex alphabetic doctor prompt with chat**: like previous approach, [RuMedTest:6], but with instruction to __respond in details__. 47 | 48 | **Accuracy (sogma)**: 49 | | Approach | Accuracy | 50 | |----------|-----------| 51 | | v0 | `49.15 %` | 52 | | v1 | `55.15 %` | 53 | | v2 | `53.46 %` | 54 | | v3 | `53.46 %` | 55 | | v4 | `49.41 %` | 56 | | v5 | `52.93 %` | 57 | | v6 | `57.24 %` | 58 | | v7 | `~26.1 %` | 59 | 60 | Approach `v6` was used for test evaluation. 61 | ## Usage 62 | -1. Run `pip install -r requirements.txt`. 63 | 64 | 0. Put `config.ini` with Gigachat credentials to this directory. 65 | 66 | Example of content (without <> brackets): 67 | ```ini 68 | [credentials] 69 | user = your-user 70 | credentials = your-credentials 71 | scope = your-scope 72 | [base_url] 73 | base_url = https://developers.sber.ru/... 74 | ``` 75 | 76 | 1. Run `s00-prepare.sh` to download tests. 77 | 2. Run `s01-run-all-trains.sh` to evaluate train/dev. (optional) 78 | 3. Run `s02-run-all-tests.sh` to generate test samples. 79 | ### Notes 80 | - you can reuse this framework to check other LLMS: replace `rumed_utils#create_llm_gigachat` with something else 81 | - to speed up test rerunning caching used (see `set_llm_cache` in `rumed_utils`) 82 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/convert_sogma.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sys 3 | from pathlib import Path 4 | 5 | import fire 6 | import xmltodict 7 | 8 | def read_xml_sogma_tasks(xml_content): 9 | # https://geetest.ru/tests/terapiya_(dlya_internov)_sogma_/download 10 | tree = xmltodict.parse(xml_content) 11 | questions = tree['geetest']['test']['questions']['question'] 12 | for qu in questions: 13 | task = dict(id=qu['@id'], question=qu['text']) 14 | for an in qu['answers']['answer']: 15 | task[an['@num']] = an['#text'] 16 | answers = [an['@num'] for an in qu['answers']['answer'] if an['@isCorrect'] == '1'] 17 | if len(answers) != 1: 18 | continue 19 | task['answer'] = answers[0] 20 | yield task 21 | 22 | def main(path_in='sogma-test.xml', path_out='RuMedTest--sogma--dev.jsonl'): 23 | assert path_in.endswith('.xml') 24 | assert path_out.endswith('.jsonl') 25 | 26 | xml_content = Path(path_in).read_text() 27 | tasks = list(read_xml_sogma_tasks(xml_content)) 28 | jsonl_content = '\n'.join(json.dumps(task, ensure_ascii=False) for task in tasks) 29 | Path(path_out).write_text(jsonl_content) 30 | print('Done! Converted {0} tasks!'.format(len(tasks))) 31 | 32 | if __name__ == '__main__': 33 | fire.Fire(main) 34 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/requirements.txt: -------------------------------------------------------------------------------- 1 | fire==0.5.0 2 | gigachat==0.1.16 3 | httpx 4 | gigachain 5 | pandas==1.5.3 6 | tqdm==4.66.1 7 | xmltodict==0.13.0 8 | certifi 9 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/rumed_da_net.py: -------------------------------------------------------------------------------- 1 | """Checking RumedDaNet.""" 2 | 3 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire 4 | 5 | PROMPT = """Контекст: {context} 6 | Вопрос: {question} 7 | 8 | Обязательно ответь либо "Да", либо "Нет".""" 9 | 10 | def get_answer_basic(llm, q_dict): 11 | possible_answers = {'да', 'нет'} 12 | input_message = PROMPT.format(**q_dict) 13 | llm_response = llm.invoke(input_message).content 14 | answer = parse_element(llm_response, possible_answers) 15 | true_answer = q_dict.get('answer') 16 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer) 17 | return answer 18 | 19 | def answer_to_test_output(q_dict, answer): 20 | return {'pairID': q_dict['pairID'], 'answer': answer} 21 | 22 | get_answer_map = { 23 | 'v0': get_answer_basic, 24 | } 25 | 26 | def main(path_in, config_path='config.ini', path_out=None): 27 | run_main( 28 | path_in=path_in, 29 | config_path=config_path, 30 | path_out=path_out, 31 | get_answer_map=get_answer_map, 32 | answer_mode='v0', 33 | answer_field='answer', 34 | answer_to_test_output=answer_to_test_output, 35 | ) 36 | 37 | if __name__ == '__main__': 38 | wrapped_fire(main) 39 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/rumed_nli.py: -------------------------------------------------------------------------------- 1 | """Checking RumedDaNet.""" 2 | 3 | from langchain.schema import AIMessage, HumanMessage, SystemMessage 4 | 5 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire 6 | 7 | PROMPT = """Ты доктор и должен пройти экзамен. Тебе даны два утверждения. 8 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}". 9 | 10 | Второе утверждение таково: "{ru_sentence2}". 11 | 12 | Ты должен ответить, чем является второе утверждение в контексте первого: 13 | 1. Следствие 14 | 2. Противоречие 15 | 3. Нейтральность. 16 | 17 | Ответь одним словом:""".strip() 18 | 19 | PROMPT_2_COT = """Ты доктор и должен пройти экзамен. Тебе даны два утверждения. 20 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}". 21 | 22 | Второе утверждение таково: "{ru_sentence2}". 23 | 24 | Вопрос: чем является второе утверждение в контексте первого? 25 | 1. Следствие 26 | 2. Противоречие 27 | 3. Нейтральность. 28 | 29 | Рассуждай шаг за шагом и выбери правильный вариант. 30 | """ 31 | 32 | LABELS_MAP = { 33 | 'следствие': 'entailment', 34 | 'противоречие': 'contradiction', 35 | 'нейтральность': 'neutral', 36 | } 37 | 38 | POSSIBLE_ANSWERS = {'следствие', 'противоречие', 'нейтральность'} 39 | 40 | def get_answer_v0(llm, q_dict): 41 | input_message = PROMPT.format(**q_dict) 42 | llm_response = llm.invoke(input_message).content 43 | answer = parse_element(llm_response, POSSIBLE_ANSWERS) 44 | if answer == '2. Противоречие': 45 | answer = 'противоречие' 46 | elif answer == '3. Нейтральность.': 47 | answer = 'нейтральность' 48 | elif answer == '2.': 49 | answer = 'противоречие' 50 | answer = LABELS_MAP.get(answer) 51 | true_answer = q_dict.get('gold_label') 52 | possible_answers = LABELS_MAP.values() 53 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer) 54 | return answer 55 | 56 | def get_answer_v1(llm, q_dict): 57 | input_message = PROMPT_2_COT.format(**q_dict) 58 | doctor_answer = llm.invoke(input_message).content 59 | possible_answers_pretty = '{' + ', '.join(POSSIBLE_ANSWERS) + '}' 60 | ru_sentence1 = q_dict['ru_sentence1'] 61 | ru_sentence2 = q_dict['ru_sentence2'] 62 | moderator_msg = f'''Ниже представлены вопрос из теста, а также ответ на этот вопрос со стороны врача. Врач отвечает развёрнуто. Твоя задача -- понять, какой же именно вариант ответа из {possible_answers_pretty} выбрал врач. 63 | ===== 64 | Задача: 65 | Даны два утверждения. 66 | Первое -- абсолютно верное и должно быть базой для твоего ответа: "{ru_sentence1}". 67 | 68 | Второе утверждение таково -- "{ru_sentence2}". 69 | 70 | Ты должен ответить, чем является второе утверждение в контексте первого: 71 | 1. Следствие 72 | 2. Противоречие 73 | 3. Нейтральность. 74 | ===== 75 | Ответ врача: 76 | {doctor_answer} 77 | ===== 78 | Ответь одним словом из {possible_answers_pretty}: 79 | ''' 80 | moderator_answer = llm.invoke(moderator_msg).content 81 | answer = parse_element(moderator_answer, POSSIBLE_ANSWERS) 82 | answer = LABELS_MAP.get(answer) 83 | true_answer = q_dict.get('gold_label') 84 | possible_answers = LABELS_MAP.values() 85 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer) 86 | return answer 87 | 88 | ASK_ATTEMPTS = 3 89 | def get_answer_v2(llm, q_dict): 90 | input_message = PROMPT_2_COT.format(**q_dict) 91 | possible_answers_pretty = '{' + ', '.join(POSSIBLE_ANSWERS) + '}' 92 | 93 | system_msg = SystemMessage(content=input_message) 94 | memory = [system_msg] 95 | for at in range(ASK_ATTEMPTS): 96 | ai_msg = llm.invoke(memory) 97 | text = ai_msg.content 98 | answer = parse_element(text, POSSIBLE_ANSWERS) 99 | answer = LABELS_MAP.get(answer) 100 | if answer: 101 | break 102 | memory.append(ai_msg) 103 | memory.append(HumanMessage(content=f'Ответь одним словом из {possible_answers_pretty}.')) 104 | true_answer = q_dict.get('gold_label') 105 | possible_answers = LABELS_MAP.values() 106 | moderator_msg = input_message 107 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}' 108 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer) 109 | return answer 110 | 111 | get_answer_map = { 112 | 'v0': get_answer_v0, 113 | 'v1': get_answer_v1, 114 | 'v2': get_answer_v2, 115 | } 116 | 117 | def answer_to_test_output(q_dict, answer) -> dict: 118 | return {'pairID': q_dict['pairID'], 'gold_label': answer} 119 | 120 | def main(path_in, config_path='config.ini', answer_mode='v0', path_out=None): 121 | run_main( 122 | path_in=path_in, 123 | config_path=config_path, 124 | path_out=path_out, 125 | get_answer_map=get_answer_map, 126 | answer_mode=answer_mode, 127 | answer_field='gold_label', 128 | answer_to_test_output=answer_to_test_output, 129 | ) 130 | 131 | if __name__ == '__main__': 132 | wrapped_fire(main) 133 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/rumed_test.py: -------------------------------------------------------------------------------- 1 | """Checking RuMedTest.""" 2 | 3 | import re 4 | 5 | from langchain.schema import AIMessage, HumanMessage, SystemMessage 6 | 7 | from rumed_utils import log_answer, parse_element, run_main, wrapped_fire 8 | 9 | def extract_answer_keys(q_dict): 10 | snums = map(str, range(1, 10)) 11 | return [si for si in snums if si in q_dict] 12 | 13 | def make_input_message_v0(q_dict): 14 | answers = extract_answer_keys(q_dict) 15 | prompt = '\n'.join([ 16 | 'Выбери номер наиболее корректного утверждения:', 17 | *('%s. {0} {%s}.' % (si, si) for si in answers), 18 | '\nНомер наиболее корректного утверждения:', 19 | ]) 20 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers)) 21 | return input_message 22 | 23 | def make_input_message_v1(q_dict): 24 | answers = extract_answer_keys(q_dict) 25 | prompt = '\n'.join([ 26 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:', 27 | *('%s. {%s}.' % (si, si) for si in answers), 28 | '\nКакой из ответов {0} наиболее корректен? Обязательно ответь одним числом.'.format(', '.join(answers)), 29 | ]) 30 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers)) 31 | return input_message 32 | 33 | def make_input_message_v2(q_dict): 34 | answers = extract_answer_keys(q_dict) 35 | prompt = '\n'.join([ 36 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:', 37 | *('%s. {%s}.' % (si, si) for si in answers), 38 | '\nКакой из ответов {0} наиболее корректен? Ответь и объясни, почему'.format(', '.join(answers)), 39 | ]) 40 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers)) 41 | return input_message 42 | 43 | def make_input_message_v3(q_dict): 44 | answers = extract_answer_keys(q_dict) 45 | prompt = '\n'.join([ 46 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:', 47 | *('%s. {%s}.' % (si, si) for si in answers), 48 | '\nРассуждай шаг за шагом и скажи, какой из ответов {0} наиболее корректен? Помни, что правильный ответ только один!'.format(', '.join(answers)), 49 | ]) 50 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers)) 51 | return input_message 52 | 53 | def make_input_message_v4(q_dict): 54 | answers = extract_answer_keys(q_dict) 55 | prompt = '\n'.join([ 56 | 'Ты сдаёшь тест с одним правильным ответом. Вопрос:\n{0}', 57 | *('%s. {%s}.' % (si, si) for si in answers), 58 | '\nРассуждай шаг за шагом и скажи, какой из ответов {0} правильный? Правильный ответ только один!'.format(', '.join(answers)), 59 | ]) 60 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in answers)) 61 | return input_message 62 | 63 | def get_answer_basic(llm, q_dict, message_maker): 64 | possible_answers = extract_answer_keys(q_dict) 65 | input_message = message_maker(q_dict) 66 | llm_response = llm.invoke(input_message).content 67 | answer = parse_element(llm_response, possible_answers) 68 | true_answer = q_dict.get('answer') 69 | log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer) 70 | return answer 71 | 72 | def get_answer_via_roles(llm, q_dict, message_maker): 73 | input_message = message_maker(q_dict) 74 | possible_answers = extract_answer_keys(q_dict) 75 | possible_answers_pretty = '{' + ', '.join(possible_answers) + '}' 76 | question = q_dict['question'] 77 | variants_fmt = '\n'.join('%s. {%s}' % (si, si) for si in possible_answers) 78 | variants = '\n'.join('{}. {}'.format(ii, q_dict[ii]) for ii in possible_answers) 79 | full_answer = llm.invoke(input_message).content 80 | moderator_msg = f'''Ниже представлены тест в виде вопроса и вариантов ответа, а также ответ на этот вопрос со стороны врача. Врач отвечает развёрнуто. Твоя задача -- понять, какой же именно вариант ответа из {possible_answers_pretty} выбрал врач. 81 | ===== 82 | Вопрос: 83 | {question} 84 | 85 | Варианты ответа: 86 | {variants} 87 | ===== 88 | Ответ врача: 89 | {full_answer} 90 | ===== 91 | Ответь одним числом из {possible_answers_pretty}: 92 | ''' 93 | llm_response = llm.invoke(moderator_msg).content 94 | answer = parse_element(llm_response, possible_answers) 95 | true_answer = q_dict.get('answer') 96 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, llm_response, answer) 97 | return answer 98 | 99 | ASK_ATTEMPTS = 5 100 | def get_answer_v5(llm, q_dict): 101 | possible_answers = extract_answer_keys(q_dict) 102 | possible_answers_pretty = '[' + ', '.join(possible_answers) + ']' 103 | input_prompt = '\n'.join([ 104 | 'Ты врач, сдаёшь тест. Вопрос:\n{0}', 105 | *('%s. {%s}.' % (si, si) for si in possible_answers), 106 | f'\nРассуждай шаг за шагом и скажи, какой из ответов {possible_answers_pretty} правильный? Если правильных ответов несколько выбери один, самый правдоподобный.', 107 | ]) 108 | input_message = input_prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers)) 109 | memory = [SystemMessage(content=input_message)] 110 | for at in range(ASK_ATTEMPTS): 111 | ai_msg = llm.invoke(memory) 112 | text = ai_msg.content 113 | answer = parse_element(text, possible_answers) 114 | if answer: 115 | break 116 | memory.append(ai_msg) 117 | if re.findall('^\d+$', ai_msg.content): 118 | parts = ', '.join([f'либо {si}' for si in ai_msg.content]) 119 | hc = f'Правильный ответ только один. Остальные неверные. Выбери тот, который тебе кажется наиболее похожим на правду: {parts}' 120 | else: 121 | hc = f'Ответь одним числом из {possible_answers_pretty}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный' 122 | memory.append(HumanMessage(content=hc)) 123 | true_answer = q_dict.get('answer') 124 | moderator_msg = input_message 125 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}' 126 | log_answer(q_dict, possible_answers, true_answer, moderator_msg, moderator_answer, answer) 127 | return answer 128 | 129 | 130 | VS = 'abcdef' 131 | # VS = 'alpha beta gamma delta epsilon zeta'.split() # works, but slightly worse 132 | ALPHA_MAPPER = dict(zip(map(str, range(1, 7)), VS)) 133 | ALPHA_INV_MAPPER = {val:key for key, val in ALPHA_MAPPER.items()} 134 | 135 | def get_answer_v6(llm, q_dict): 136 | possible_answers = extract_answer_keys(q_dict) 137 | pretty_answers = [ALPHA_MAPPER[an] for an in possible_answers] 138 | prompt = '\n'.join([ 139 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:', 140 | *('%s. {%s}.' % (ALPHA_MAPPER[si], si) for si in possible_answers), 141 | '\nКакой из ответов {0} наиболее корректен? Обязательно ответь одной буквой.'.format('[' + ', '.join(pretty_answers) + ']'), 142 | ]) 143 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers)) 144 | memory = [SystemMessage(content=input_message)] 145 | for at in range(ASK_ATTEMPTS): 146 | ai_msg = llm.invoke(memory) 147 | text = ai_msg.content 148 | answer = parse_element(text, pretty_answers) 149 | # import pdb; pdb.set_trace() 150 | memory.append(ai_msg) 151 | if answer: 152 | break 153 | hc = f'Ответь одной буквой из {pretty_answers}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный' 154 | memory.append(HumanMessage(content=hc)) 155 | answer = ALPHA_INV_MAPPER.get(answer) 156 | true_answer = q_dict.get('answer') 157 | moderator_msg = input_message 158 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}' 159 | log_answer(q_dict, possible_answers, true_answer, input_message, moderator_answer, answer) 160 | return answer 161 | 162 | def get_answer_v7(llm, q_dict): 163 | possible_answers = extract_answer_keys(q_dict) 164 | pretty_answers = [ALPHA_MAPPER[an] for an in possible_answers] 165 | prompt = '\n'.join([ 166 | 'Ты врач, сдаёшь медицинский экзамен. Тебе нужно дать правильный ответ на вопрос:\n{0}\n\nВарианты ответа:', 167 | *('%s. {%s}.' % (ALPHA_MAPPER[si], si) for si in possible_answers), 168 | '\nКакой из ответов {0} наиболее корректен? Порассуждай последовательно про каждый из вариантов, но будь краток, в конце дай ответ'.format('[' + ', '.join(pretty_answers) + ']'), 169 | ]) 170 | input_message = prompt.format(q_dict['question'], *(q_dict[ii] for ii in possible_answers)) 171 | memory = [SystemMessage(content=input_message)] 172 | for at in range(ASK_ATTEMPTS): 173 | ai_msg = llm.invoke(memory) 174 | text = ai_msg.content 175 | answer = parse_element(text, pretty_answers) 176 | memory.append(ai_msg) 177 | if answer: 178 | break 179 | hc = f'Ответь одной буквой из {pretty_answers}. Если ты думаешь, что правильных ответов несколько, выбери один, самый правдоподобный' 180 | memory.append(HumanMessage(content=hc)) 181 | answer = ALPHA_INV_MAPPER.get(answer) 182 | true_answer = q_dict.get('answer') 183 | moderator_msg = input_message 184 | moderator_answer = '{\n' + '\n###\n'.join(msg.content for msg in memory[1:]) + '\n}' 185 | log_answer(q_dict, possible_answers, true_answer, input_message, moderator_answer, answer) 186 | return answer 187 | 188 | get_answer_map = { 189 | 'v0': lambda llm, q_dict: get_answer_basic(llm, q_dict, make_input_message_v0), 190 | 'v1': lambda llm, q_dict: get_answer_basic(llm, q_dict, make_input_message_v1), 191 | 'v2': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v2), 192 | 'v3': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v3), 193 | 'v4': lambda llm, q_dict: get_answer_via_roles(llm, q_dict, make_input_message_v4), 194 | 'v5': get_answer_v5, 195 | 'v6': get_answer_v6, 196 | 'v7': get_answer_v7, 197 | } 198 | 199 | def answer_to_test_output(q_dict, answer): 200 | return {'idx': q_dict['idx'], 'answer': answer} 201 | 202 | def main(path_in, config_path='config.ini', path_out=None, answer_mode='v1'): 203 | run_main( 204 | path_in=path_in, 205 | config_path=config_path, 206 | path_out=path_out, 207 | get_answer_map=get_answer_map, 208 | answer_mode=answer_mode, 209 | answer_field='answer', 210 | answer_to_test_output=answer_to_test_output, 211 | ) 212 | 213 | if __name__ == '__main__': 214 | wrapped_fire(main) 215 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/rumed_utils.py: -------------------------------------------------------------------------------- 1 | import configparser 2 | import json 3 | import logging 4 | import time 5 | from datetime import datetime as dt 6 | from logging.config import dictConfig 7 | from pathlib import Path 8 | 9 | import fire 10 | import httpx 11 | from gigachat.exceptions import GigaChatException 12 | from langchain.cache import SQLiteCache 13 | from langchain.chat_models.gigachat import GigaChat 14 | from langchain.globals import set_llm_cache 15 | from tqdm import tqdm 16 | 17 | def extract_tags(path_in: Path): 18 | tags = [] 19 | if path_in.parent.name.startswith('Ru'): 20 | tags.append(path_in.parent.name) 21 | tags.append(path_in.stem) 22 | return tags 23 | 24 | def init_logging(path_in, answer_mode): 25 | path_in = Path(path_in) 26 | Path('./logs').mkdir(exist_ok=True) 27 | dt_now_pretty = dt.strftime(dt.now(), '%Y-%m-%d--%H-%M-%S') 28 | tags = [dt_now_pretty] + extract_tags(path_in) + [answer_mode] 29 | filename_log = './logs/{0}.log'.format('--'.join(tags)) 30 | 31 | logging_config = { 32 | 'version': 1, 33 | 'handlers': { 34 | 'file_handler': { 35 | 'class': 'logging.FileHandler', 36 | 'filename': filename_log, 37 | 'level': 'DEBUG', 38 | 'formatter': 'standard', 39 | }, 40 | 'benchmarks_file_handler': { 41 | 'class': 'logging.FileHandler', 42 | 'filename': 'benchmarks.log', 43 | 'level': 'INFO', 44 | 'formatter': 'standard', 45 | }, 46 | 'stream_handler': { 47 | 'class': 'logging.StreamHandler', 48 | 'level': 'WARNING', 49 | 'formatter': 'standard', 50 | }, 51 | }, 52 | 'formatters': { 53 | 'standard': { 54 | 'format': '%(asctime)s [%(levelname)s] %(message)s', 55 | 'datefmt': '%Y-%m-%d--%H-%M-%S', 56 | }, 57 | }, 58 | 'loggers': { 59 | 'root': { 60 | 'level': 'DEBUG', 61 | 'handlers': ['file_handler', 'stream_handler'], 62 | }, 63 | 'benchmarks': { 64 | 'level': 'INFO', 65 | 'handlers': ['benchmarks_file_handler'], 66 | } 67 | } 68 | } 69 | dictConfig(logging_config) 70 | logging.warning('Logs_path: %s', filename_log) 71 | 72 | GIGACHAT_MODEL = 'GigaChat-Pro' 73 | 74 | def create_llm_gigachat(config): 75 | credentials = dict(config['credentials']) 76 | base_url = config['base_url']['base_url'] 77 | logging.info('credentials: %s', credentials) 78 | logging.info('base_url: %s', base_url) 79 | 80 | # https://python.langchain.com/docs/modules/model_io/llms/llm_caching 81 | user = config['credentials']['user'] 82 | database_path = "{0}.langchain.db".format(user) 83 | set_llm_cache(SQLiteCache(database_path=database_path)) 84 | 85 | return GigaChat( 86 | verify_ssl_certs=False, 87 | profanity_check=False, 88 | model=GIGACHAT_MODEL, 89 | base_url=base_url, 90 | **credentials, 91 | ) 92 | 93 | def parse_element(answer, elements): 94 | fst_word = answer.split()[0].strip(',.').strip().lower() 95 | if fst_word in elements: 96 | return fst_word 97 | return None 98 | 99 | def format_accuracy(correct, total): 100 | return '{0:.2f} %'.format(correct / total * 100) 101 | 102 | ATTEMPTS = 10 103 | WAIT_SECONDS = 6 104 | def repeater(callback, skip_ex): 105 | def wrapped_callback(*args, **kwargs): 106 | wait_s = WAIT_SECONDS 107 | for at in range(1, ATTEMPTS + 1): 108 | try: 109 | return callback(*args, **kwargs) 110 | except Exception as ex: 111 | if skip_ex and isinstance(ex, skip_ex): 112 | logging.warning('Failed to execute: %s, attempt=%d', ex, at) 113 | time.sleep(wait_s) 114 | wait_s *= 2 115 | if at == ATTEMPTS: 116 | logging.exception('Attempts out...') 117 | raise ex 118 | else: 119 | raise ex 120 | return wrapped_callback 121 | 122 | def init_llm(config_path): 123 | config = configparser.ConfigParser() 124 | config.read(config_path) 125 | return create_llm_gigachat(config) 126 | 127 | def read_json_tasks(path_in): 128 | tasks = [json.loads(line) for line in path_in.read_text().strip().splitlines()] 129 | return tasks 130 | 131 | CONNECTION_EXCEPTIONS = (GigaChatException, httpx.HTTPError, json.decoder.JSONDecodeError) 132 | 133 | def benchmark_check(path_in, llm, tasks, get_answer, answer_field, tags=None): 134 | logging.warning('Path_in: %s', path_in) 135 | logging.warning('Tasks: %d', len(tasks)) 136 | correct_total = 0 137 | pbar = tqdm(range(len(tasks))) 138 | w_get_answer = repeater(get_answer, skip_ex=CONNECTION_EXCEPTIONS) 139 | for ti in pbar: 140 | td = tasks[ti] 141 | answer = w_get_answer(llm, td) 142 | true_answer = td[answer_field] 143 | check = answer == true_answer 144 | correct_total += check 145 | acc = format_accuracy(correct_total, ti + 1) 146 | pbar.set_description('acc: {0}'.format(acc)) 147 | if ti % 10 == 0: 148 | logging.info('index=%d, acc: %s', ti, acc) 149 | tags = extract_tags(path_in) + (tags or []) 150 | b_info = 'Tags: {0}, final accuracy: {1}'.format(' '.join(tags), format_accuracy(correct_total, len(tasks))) 151 | logging.getLogger('benchmarks').info(b_info) 152 | logging.warning('Done! %s', b_info) 153 | 154 | def benchmark_test(llm, tasks, path_out, get_answer, answer_to_test_output): 155 | logging.warning('Tasks: %d', len(tasks)) 156 | lines = [] 157 | pbar = tqdm(range(len(tasks))) 158 | w_get_answer = repeater(get_answer, skip_ex=CONNECTION_EXCEPTIONS) 159 | for ti in pbar: 160 | td = tasks[ti] 161 | answer = w_get_answer(llm, td) 162 | answer_output = answer_to_test_output(td, answer) 163 | lines.append(json.dumps(answer_output, ensure_ascii=False)) 164 | Path(path_out).write_text('\n'.join(lines)) 165 | logging.info('Done! Saved to %s', path_out) 166 | 167 | def log_answer(q_dict, possible_answers, true_answer, input_message, llm_response, answer): 168 | log_callback = logging.debug if (answer is not None) else logging.warning 169 | # todo fix, use `None` here 170 | if true_answer is not None: 171 | check = answer == true_answer 172 | log_callback('input_message: %s\nllm_response: %s\nanswer: %s\ntrue_answer: %s\ncorrect: %s', input_message, llm_response, answer, true_answer, check) 173 | else: 174 | log_callback('input_message: %s\nllm_response: %s\nanswer: %s', input_message, llm_response, answer) 175 | 176 | if answer not in possible_answers: 177 | logging.warning('Expected answer `{0}` not in possible answers `[{1}]`, q_dict: {2}'.format(answer, ', '.join(possible_answers), q_dict)) 178 | 179 | def choose_get_answer(get_answer_map, answer_mode): 180 | get_answer = get_answer_map.get(answer_mode) 181 | if get_answer is None: 182 | raise ValueError('Supported answer versions: {0}, found: {1}'.format(list(get_answer_map.keys()), answer_mode)) 183 | return get_answer 184 | 185 | def run_main(path_in, config_path, path_out, get_answer_map, answer_mode, answer_field, answer_to_test_output): 186 | get_answer = choose_get_answer(get_answer_map, answer_mode) 187 | path_in = Path(path_in) 188 | if not path_in.exists(): 189 | raise ValueError('`path_in`=`{0}` not exists!'.format(path_in)) 190 | init_logging(path_in, answer_mode) 191 | logging.warning('Answer mode: {0}'.format(answer_mode)) 192 | tags = [answer_mode] 193 | 194 | llm = init_llm(config_path) 195 | tasks = read_json_tasks(path_in) 196 | if any(sub in path_in.stem for sub in ('dev', 'train')): 197 | benchmark_check(path_in, llm, tasks, get_answer, answer_field, tags=tags) 198 | elif 'test' in path_in.stem: 199 | if path_out is None: 200 | raise ValueError('`path_out` should be passed') 201 | benchmark_test(llm, tasks, path_out, get_answer, answer_to_test_output) 202 | else: 203 | raise ValueError('Can not recognize mode, expected `dev`, `train` or `test` in `path_in`') 204 | 205 | def wrapped_fire(main): 206 | try: 207 | fire.Fire(main) 208 | except KeyboardInterrupt: 209 | logging.warning('Cancelled!') 210 | raise KeyboardInterrupt('Cancelled!') 211 | 212 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/s00-prepare.sh: -------------------------------------------------------------------------------- 1 | set -e 2 | sogma_xml_file="sogma-test.xml" 3 | sogma_jsonl_file="RuMedTest--sogma--dev.jsonl" 4 | medbench_url="https://medbench.ru/files/MedBench_data.zip" 5 | medbench_dir="../../../data" 6 | medbench_zip_path="$medbench_dir/MedBench_data.zip" 7 | 8 | if [ ! -e "$sogma_jsonl_file" ]; then 9 | echo "$sogma_jsonl_file does not exist. Downloading..." 10 | wget -nc "https://geetest.ru/content/files/terapiya_(dlya_internov)_sogma_.xml" -O "$sogma_xml_file" 11 | echo "Download complete." 12 | python convert_sogma.py --path-in="$sogma_xml_file" --path-out="$sogma_jsonl_file" 13 | rm -f "$sogma_xml_file" 14 | fi 15 | 16 | if [ ! -e $medbench_dir ]; then 17 | echo "$medbench_dir folder does not exist. Downloading and extracting..." 18 | mkdir $medbench_dir 19 | wget -nc "$medbench_url" -O "$medbench_zip_path" 20 | unzip "$medbench_zip_path" -d "$medbench_dir" 21 | rm -f "$medbench_zip_path" 22 | echo "Download and extraction complete." 23 | fi 24 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/s01-run-all-trains.sh: -------------------------------------------------------------------------------- 1 | rm -f benchmarks.log 2 | 3 | python rumed_nli.py --answer-mode='v0' --path-in='MedBench/RuMedNLI/dev.jsonl' 4 | python rumed_nli.py --answer-mode='v1' --path-in='MedBench/RuMedNLI/dev.jsonl' 5 | python rumed_nli.py --answer-mode='v2' --path-in='MedBench/RuMedNLI/dev.jsonl' 6 | 7 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v0 8 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v1 9 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v2 10 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v3 11 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v4 12 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v5 13 | python rumed_test.py --path-in="RuMedTest--sogma--dev.jsonl" --answer-mode=v6 14 | 15 | python rumed_da_net.py --path-in='MedBench/RuMedDaNet/dev.jsonl' 16 | # python rumed_da_net.py --path-in='MedBench/RuMedDaNet/train.jsonl' 17 | 18 | echo 'Benchmarks:' 19 | cat benchmarks.log 20 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Gigachat/s02-run-all-tests.sh: -------------------------------------------------------------------------------- 1 | mkdir -p 'out' 2 | 3 | # python rumed_nli.py --path-in='MedBench/RuMedNLI/test.jsonl' --path-out='out/medbench--rumednli--v0.jsonl' --answer-mode='v0' 4 | # python rumed_nli.py --path-in='MedBench/RuMedNLI/test.jsonl' --path-out='out/medbench--rumednli--v1.jsonl' --answer-mode='v1' 5 | python rumed_nli.py --path-in='../../../data/RuMedNLI/test.jsonl' --path-out='out/RuMedNLI.jsonl' --answer-mode='v2' 6 | 7 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v2.jsonl' --answer-mode=v2 8 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v3.jsonl' --answer-mode=v3 9 | # python rumed_test.py --path-in='MedBench/RuMedTest/test.jsonl' --path-out='out/medbench--rumedtest--v4.jsonl' --answer-mode=v4 10 | python rumed_test.py --path-in='../../../data/RuMedTest/test.jsonl' --path-out='out/RuMedTest.jsonl' --answer-mode=v6 11 | 12 | python rumed_da_net.py --path-in='../../../data/RuMedDaNet/test.jsonl' --path-out='out/RuMedDaNet.jsonl' 13 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Human/README.md: -------------------------------------------------------------------------------- 1 | Чтобы оценить уровень человека в решении предложенных задач, выполнены следующие процедуры: 2 | 3 | - **RuMedDaNet** Примеры из закрытой тестовой части были разделены между несколькими асессорами (не имеющими специального медицинского образования) без перекрытия, т.о. каждый пример решён лишь одним участником; 4 | 5 | - **RuMedNLI** Ответы получены через аналогичную RuMedDaNet-задаче процедуру, с той лишь разницей, что в качестве асессоров выступали специалисты имеющие медицинское образование; 6 | 7 | - **RuMedTest** Оценка в этой задаче является консенсусом преподавательского сообщества высших медицинских учебных заведений относительно минимально необходимого уровня подготовки врача общей практики. 8 | 9 | - **ECG2Pathology** ECG2Pathology Изначально сигналы были размечены кардиологами по следующему принципу: специалисты были разделены на три группы (по 1000 сигналов на каждую), которые оценивали сигналы в соответствии с [тезаурусом](https://ecg.ru/thesaurus). Каждая группа состояла из трёх кардиологов-аннотаторов, которые размечали сигналы, а также кардиолога-валидатора, который давал окончательную оценку на основе собственного мнения и мнений аннотаторов. На основании этой процедуры был рассчитан human baseline, представляющий собой макро F1-меру каждого кардиолога-аннотатора относительно своего модератора. Усреднение F1-меры производилось в 2 этапа: сначала по 73 классам, затем по аннотаторам. 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Naive/README.md: -------------------------------------------------------------------------------- 1 | Наивное решение заключается в использовании наиболее часто встречающейся (или случайной метки) в качестве ответа: 2 | 3 | - **RuMedDaNet** - ответ на все вопросы всегда "да"; 4 | 5 | - **RuMedNLI** - ответ всегда "neutral"; 6 | 7 | - **RuMedTest** - ответом всегда является первый вариант. 8 | 9 | - **ECG2Pathology** - ответом всегда является самый частый класс ("Нормальное положение ЭОС"). 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/Naive/sample_submission.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/Naive/sample_submission.zip -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/README.md: -------------------------------------------------------------------------------- 1 | Решение основано на использовании модели из семейства RNN: 2 | 3 | - **RuMedDaNet** / **RuMedNLI** - используем двухслойную модель BiLSTM модель с 300-мерными вложениями слов; 4 | 5 | - **RuMedTest** - Используем обученную модель из задачи RuMedNLI чтобы получить матрицу эмбеддингов вопросов и 4 матрицы для каждого из вариантов ответов. Ответ выбираем по максимальному значению косинусной близости между векторами вопроса и ответа. 6 | 7 | ### Для запуска кода 8 | 9 | `./run.sh` 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/double_text_classifier.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import json 3 | import pathlib 4 | 5 | import joblib 6 | import click 7 | import numpy as np 8 | import pandas as pd 9 | from sklearn.metrics import accuracy_score 10 | import torch 11 | from torch import nn 12 | from torch.utils.data import DataLoader 13 | from tqdm import tqdm 14 | 15 | from utils import preprocess, seed_everything, seed_worker, DataPreprocessor 16 | 17 | SEED = 101 18 | seed_everything(SEED) 19 | class Classifier(nn.Module): 20 | 21 | def __init__(self, n_classes, vocab_size, emb_dim=300, hidden_dim=256): 22 | 23 | super().__init__() 24 | 25 | self.emb_dim = emb_dim 26 | self.hidden_dim = hidden_dim 27 | 28 | self.embedding_layer = nn.Embedding(vocab_size, self.emb_dim) 29 | self.lstm_layer = nn.LSTM(self.emb_dim, self.hidden_dim, batch_first=True, num_layers=2, 30 | bidirectional=True) 31 | self.linear_layer = nn.Linear(self.hidden_dim * 2, n_classes) 32 | 33 | def forward(self, x): 34 | x = self.embedding_layer(x) 35 | _, (hidden, _) = self.lstm_layer(x) 36 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1) 37 | return self.linear_layer(hidden) 38 | 39 | 40 | def preprocess_two_seqs(text1, text2, seq_len): 41 | seq1_len = int(seq_len * 0.75) 42 | seq2_len = seq_len - seq1_len 43 | 44 | tokens1 = preprocess(text1)[:seq1_len] 45 | tokens2 = preprocess(text2)[:seq2_len] 46 | 47 | return tokens1 + tokens2 48 | 49 | 50 | def build_vocab(text_data, min_freq=1): 51 | word2freq = defaultdict(int) 52 | word2index = {'PAD': 0, 'UNK': 1} 53 | 54 | for text in text_data: 55 | for token in text: 56 | word2freq[token] += 1 57 | 58 | for word, freq in word2freq.items(): 59 | if freq > min_freq: 60 | word2index[word] = len(word2index) 61 | return word2index 62 | 63 | 64 | def train_step(data, model, optimizer, criterion, device, losses, epoch): 65 | 66 | model.train() 67 | 68 | pbar = tqdm(total=len(data.dataset), desc=f'Epoch: {epoch + 1}') 69 | 70 | for x, y in data: 71 | 72 | x = x.to(device) 73 | y = y.to(device) 74 | 75 | optimizer.zero_grad() 76 | pred = model(x) 77 | 78 | loss = criterion(pred, y) 79 | 80 | loss.backward() 81 | optimizer.step() 82 | 83 | losses.append(loss.item()) 84 | 85 | pbar.set_postfix(train_loss = np.mean(losses[-100:])) 86 | pbar.update(x.shape[0]) 87 | 88 | pbar.close() 89 | 90 | return losses 91 | 92 | def eval_step(data, model, criterion, device, mode='dev'): 93 | 94 | test_losses = [] 95 | test_preds = [] 96 | test_true = [] 97 | 98 | pbar = tqdm(total=len(data.dataset), desc=f'Predictions on {mode} set') 99 | 100 | model.eval() 101 | 102 | for x, y in data: 103 | 104 | x = x.to(device) 105 | y = y.to(device) 106 | 107 | with torch.no_grad(): 108 | 109 | pred = model(x) 110 | 111 | loss = criterion(pred, y) 112 | test_losses.append(loss.item()) 113 | 114 | test_preds.append(torch.argmax(pred, dim=1).cpu().numpy()) 115 | test_true.append(y.cpu().numpy()) 116 | 117 | pbar.update(x.shape[0]) 118 | pbar.close() 119 | 120 | test_preds = np.concatenate(test_preds) 121 | 122 | if mode == 'dev': 123 | test_true = np.concatenate(test_true) 124 | mean_test_loss = np.mean(test_losses) 125 | accuracy = round(accuracy_score(test_true, test_preds) * 100, 2) 126 | return mean_test_loss, accuracy 127 | 128 | else: 129 | return test_preds 130 | 131 | 132 | def train(train_data, dev_data, model, optimizer, criterion, device, n_epochs=50, max_patience=3): 133 | 134 | losses = [] 135 | best_accuracy = 0. 136 | 137 | patience = 0 138 | best_test_loss = 10. 139 | 140 | for epoch in range(n_epochs): 141 | 142 | losses = train_step(train_data, model, optimizer, criterion, device, losses, epoch) 143 | mean_dev_loss, accuracy = eval_step(dev_data, model, criterion, device) 144 | 145 | if accuracy > best_accuracy: 146 | best_accuracy = accuracy 147 | 148 | print(f'\nDev loss: {mean_dev_loss} \naccuracy: {accuracy}') 149 | 150 | if mean_dev_loss < best_test_loss: 151 | best_test_loss = mean_dev_loss 152 | elif patience == max_patience: 153 | print(f'Dev loss did not improve in {patience} epochs, early stopping') 154 | break 155 | else: 156 | patience += 1 157 | return best_accuracy 158 | 159 | 160 | @click.command() 161 | @click.option('--task-name', 162 | default='RuMedNLI', 163 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']), 164 | help='The name of the task to run.') 165 | @click.option('--device', 166 | default=-1, 167 | help='Gpu to train the model on.') 168 | @click.option('--seq-len', 169 | default=256, 170 | help='Max sequence length.') 171 | @click.option('--data-path', 172 | default='../../../MedBench_data/', 173 | help='Path to the data files.') 174 | def main(task_name, data_path, device, seq_len): 175 | print(f'\n{task_name} task') 176 | 177 | out_path = pathlib.Path('.').absolute() 178 | data_path = pathlib.Path(data_path).absolute() / task_name 179 | 180 | train_data = pd.read_json(data_path / 'train.jsonl', lines=True) 181 | dev_data = pd.read_json(data_path / 'dev.jsonl', lines=True) 182 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True) 183 | 184 | index_id = 'pairID' 185 | if task_name == 'RuMedNLI': 186 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2} 187 | text1_id = 'ru_sentence1' 188 | text2_id = 'ru_sentence2' 189 | label_id = 'gold_label' 190 | 191 | elif task_name == 'RuMedDaNet': 192 | l2i = {'нет': 0, 'да': 1} 193 | text1_id = 'context' 194 | text2_id = 'question' 195 | label_id = 'answer' 196 | else: 197 | raise ValueError('unknown task') 198 | 199 | i2l = {i: label for label, i in l2i.items()} 200 | 201 | text_data_train = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 202 | zip(train_data[text1_id], train_data[text2_id])] 203 | text_data_dev = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 204 | zip(dev_data[text1_id], dev_data[text2_id])] 205 | text_data_test = [preprocess_two_seqs(text1, text2, seq_len) for text1, text2 in \ 206 | zip(test_data[text1_id], test_data[text2_id])] 207 | 208 | word2index = build_vocab(text_data_train, min_freq=0) 209 | print(f'Total: {len(word2index)} tokens') 210 | 211 | train_dataset = DataPreprocessor(text_data_train, train_data[label_id], word2index, l2i, \ 212 | sequence_length=seq_len, preprocessing=False) 213 | dev_dataset = DataPreprocessor(text_data_dev, dev_data[label_id], word2index, l2i, \ 214 | sequence_length=seq_len, preprocessing=False) 215 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \ 216 | sequence_length=seq_len, preprocessing=False) 217 | 218 | gen = torch.Generator() 219 | gen.manual_seed(SEED) 220 | train_dataset = DataLoader(train_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 221 | dev_dataset = DataLoader(dev_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 222 | test_dataset = DataLoader(test_dataset, batch_size=64, worker_init_fn=seed_worker, generator=gen) 223 | 224 | if device == -1: 225 | device = torch.device('cpu') 226 | else: 227 | device = torch.device(device) 228 | 229 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index)) 230 | criterion = nn.CrossEntropyLoss() 231 | optimizer = torch.optim.Adam(params=model.parameters()) 232 | 233 | model = model.to(device) 234 | criterion = criterion.to(device) 235 | 236 | accuracy = train(train_dataset, dev_dataset, model, optimizer, criterion, device) 237 | print (f'\n{task_name} task score on dev set: {accuracy}') 238 | 239 | test_preds = eval_step(test_dataset, model, criterion, device, mode='test') 240 | if task_name == 'RuMedNLI': 241 | torch.save(model.state_dict(), 'model.bin') 242 | joblib.dump(word2index, 'word2index.pkl') 243 | joblib.dump(l2i, 'l2i.pkl') 244 | 245 | recs = [] 246 | for i, pred in zip(test_data[index_id], test_preds): 247 | recs.append({index_id: i, label_id: i2l[pred]}) 248 | 249 | out_fname = out_path / f'{task_name}.jsonl' 250 | with open(out_fname, 'w') as fw: 251 | for rec in recs: 252 | json.dump(rec, fw, ensure_ascii=False) 253 | fw.write('\n') 254 | 255 | 256 | if __name__ == '__main__': 257 | main() -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/rnn.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RNN/rnn.zip -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 2 4 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 2 5 | python -u test_solver.py --task-name 'RuMedTest' 6 | 7 | zip -m rnn.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl 8 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/test_solver.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import numpy as np 6 | import pandas as pd 7 | 8 | import torch 9 | import joblib 10 | from utils import preprocess, DataPreprocessor 11 | from double_text_classifier import Classifier 12 | from sklearn.metrics.pairwise import cosine_similarity 13 | 14 | seq_len = 256 15 | 16 | @click.command() 17 | @click.option('--task-name', 18 | default='RuMedTest', 19 | type=click.Choice(['RuMedTest']), 20 | help='The name of the task to run.') 21 | @click.option('--data-path', 22 | default='../../../MedBench_data/', 23 | help='Path to the data files.') 24 | def main(task_name, data_path): 25 | print(f'\n{task_name} task') 26 | 27 | out_path = pathlib.Path('.').absolute() 28 | data_path = pathlib.Path(data_path).absolute() / task_name 29 | 30 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True) 31 | 32 | index_id = 'idx' 33 | if task_name == 'RuMedTest': 34 | options = ['1', '2', '3', '4'] 35 | question_id = 'question' 36 | label_id = 'answer' 37 | else: 38 | raise ValueError('unknown task') 39 | 40 | word2index = joblib.load('word2index.pkl') 41 | l2i = joblib.load('l2i.pkl') 42 | 43 | model = Classifier(n_classes=len(l2i), vocab_size=len(word2index)) 44 | model.load_state_dict(torch.load('model.bin')) 45 | model.eval(); 46 | 47 | text_data_test = [preprocess(text1) for text1 in test_data['question']] 48 | 49 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \ 50 | sequence_length=seq_len, preprocessing=False) 51 | 52 | q_vecs = [] 53 | for x, _ in test_dataset: 54 | with torch.no_grad(): 55 | x = model.embedding_layer(x[None, :]) 56 | _, (hidden, _) = model.lstm_layer(x) 57 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1).detach().cpu().numpy() 58 | q_vecs.append(hidden[0]) 59 | q_vecs = np.array(q_vecs) 60 | 61 | sims = [] 62 | for option in options: 63 | text_data_test = [preprocess(text1) for text1 in test_data[option]] 64 | test_dataset = DataPreprocessor(text_data_test, None, word2index, l2i, \ 65 | sequence_length=seq_len, preprocessing=False) 66 | 67 | option_vecs = [] 68 | for x, _ in test_dataset: 69 | with torch.no_grad(): 70 | x = model.embedding_layer(x[None, :]) 71 | _, (hidden, _) = model.lstm_layer(x) 72 | hidden = torch.cat([hidden[0, :, :], hidden[1, :, :]], axis=1).detach().cpu().numpy() 73 | option_vecs.append(hidden[0]) 74 | option_vecs = np.array(option_vecs) 75 | 76 | sim = cosine_similarity(q_vecs, option_vecs).diagonal() 77 | sims.append(sim) 78 | sims = np.array(sims).T 79 | 80 | recs = [] 81 | for i, pred in zip(test_data[index_id], sims): 82 | recs.append( { index_id: i, label_id: str(1+np.argmax(pred)) } ) 83 | 84 | out_fname = out_path / f'{task_name}.jsonl' 85 | with open(out_fname, 'w') as fw: 86 | for rec in recs: 87 | json.dump(rec, fw, ensure_ascii=False) 88 | fw.write('\n') 89 | 90 | 91 | if __name__ == '__main__': 92 | main() 93 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RNN/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from string import punctuation 3 | import random 4 | 5 | from nltk.tokenize import ToktokTokenizer 6 | import numpy as np 7 | import pandas as pd 8 | import torch 9 | from torch.utils.data import Dataset 10 | 11 | from typing import List, Dict, Union, Tuple, Set, Any 12 | 13 | TOKENIZER = ToktokTokenizer() 14 | 15 | 16 | def seed_everything(seed): 17 | os.environ['PYTHONHASHSEED'] = str(seed) 18 | os.environ['CUDA_LAUNCH_BLOCKING'] = '1' 19 | np.random.seed(seed) 20 | random.seed(seed) 21 | torch.manual_seed(seed) 22 | torch.cuda.manual_seed_all(seed) 23 | torch.cuda.manual_seed(seed) 24 | torch.backends.cudnn.deterministic = True 25 | torch.backends.cudnn.benchmark = False 26 | 27 | 28 | def seed_worker(worker_id): 29 | worker_seed = torch.initial_seed() % 2**32 30 | np.random.seed(worker_seed) 31 | random.seed(worker_seed) 32 | 33 | 34 | def preprocess(text, tokenizer=TOKENIZER): 35 | res = [] 36 | tokens = tokenizer.tokenize(text.lower()) 37 | for t in tokens: 38 | if t not in punctuation: 39 | res.append(t.strip(punctuation)) 40 | return res 41 | 42 | 43 | class DataPreprocessor(Dataset): 44 | 45 | def __init__(self, x_data, y_data, word2index, label2index, 46 | sequence_length=128, pad_token='PAD', unk_token='UNK', preprocessing=True): 47 | 48 | super().__init__() 49 | 50 | self.x_data = [] 51 | self.y_data = len(x_data)*[list(label2index.values())[0]] 52 | if type(y_data)!=type(None): 53 | self.y_data = y_data.map(label2index) 54 | 55 | self.word2index = word2index 56 | self.sequence_length = sequence_length 57 | 58 | self.pad_token = pad_token 59 | self.unk_token = unk_token 60 | self.pad_index = self.word2index[self.pad_token] 61 | 62 | self.preprocessing = preprocessing 63 | 64 | self.load(x_data) 65 | 66 | def load(self, data): 67 | 68 | for text in data: 69 | if self.preprocessing: 70 | words = preprocess(text) 71 | else: 72 | words = text 73 | indexed_words = self.indexing(words) 74 | self.x_data.append(indexed_words) 75 | 76 | def indexing(self, tokenized_text): 77 | unk_index = self.word2index[self.unk_token] 78 | return [self.word2index.get(token, unk_index) for token in tokenized_text] 79 | 80 | def padding(self, sequence): 81 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0)) 82 | return sequence[:self.sequence_length] 83 | 84 | def __len__(self): 85 | return len(self.x_data) 86 | 87 | def __getitem__(self, idx): 88 | x = self.x_data[idx] 89 | x = self.padding(x) 90 | x = torch.Tensor(x).long() 91 | 92 | if type(self.y_data)==type(None): 93 | y = None 94 | else: 95 | y = self.y_data[idx] 96 | 97 | return x, y 98 | 99 | 100 | def preprocess_for_tokens( 101 | tokens: List[str] 102 | ) -> List[str]: 103 | 104 | return tokens 105 | 106 | class DataPreprocessorNer(Dataset): 107 | 108 | def __init__( 109 | self, 110 | x_data: pd.Series, 111 | y_data: pd.Series, 112 | word2index: Dict[str, int], 113 | label2index: Dict[str, int], 114 | sequence_length: int = 128, 115 | pad_token: str = 'PAD', 116 | unk_token: str = 'UNK' 117 | ) -> None: 118 | 119 | super().__init__() 120 | 121 | self.word2index = word2index 122 | self.label2index = label2index 123 | 124 | self.sequence_length = sequence_length 125 | self.pad_token = pad_token 126 | self.unk_token = unk_token 127 | self.pad_index = self.word2index[self.pad_token] 128 | self.unk_index = self.word2index[self.unk_token] 129 | 130 | self.x_data = self.load(x_data, self.word2index) 131 | self.y_data = self.load(y_data, self.label2index) 132 | 133 | 134 | def load( 135 | self, 136 | data: pd.Series, 137 | mapping: Dict[str, int] 138 | ) -> List[List[int]]: 139 | 140 | indexed_data = [] 141 | for case in data: 142 | processed_case = preprocess_for_tokens(case) 143 | indexed_case = self.indexing(processed_case, mapping) 144 | indexed_data.append(indexed_case) 145 | 146 | return indexed_data 147 | 148 | 149 | def indexing( 150 | self, 151 | tokenized_case: List[str], 152 | mapping: Dict[str, int] 153 | ) -> List[int]: 154 | 155 | return [mapping.get(token, self.unk_index) for token in tokenized_case] 156 | 157 | 158 | def padding( 159 | self, 160 | sequence: List[int] 161 | ) -> List[int]: 162 | sequence = sequence + [self.pad_index] * (max(self.sequence_length - len(sequence), 0)) 163 | return sequence[:self.sequence_length] 164 | 165 | 166 | def __len__(self): 167 | return len(self.x_data) 168 | 169 | 170 | def __getitem__( 171 | self, 172 | idx: int 173 | ) -> Tuple[torch.tensor, torch.tensor]: 174 | 175 | x = self.x_data[idx] 176 | y = self.y_data[idx] 177 | 178 | assert len(x) > 0 179 | 180 | x = self.padding(x) 181 | y = self.padding(y) 182 | 183 | x = torch.tensor(x, dtype=torch.int64) 184 | y = torch.tensor(y, dtype=torch.int64) 185 | 186 | return x, y 187 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/README.md: -------------------------------------------------------------------------------- 1 | Решение основано на использовании 2х вариантов RuBERT модели: 2 | 3 | - **RuMedDaNet** / **RuMedNLI** - объединяем пары входных текстов в единую строку, выполняем дообучение нужной модели под конкретную задачу; 4 | 5 | - **RuMedTest** - Используем предобученную модель RuBERT для получения контекстуализированных эмбеддингов (вопросов и каждого из 4 вариантов ответов). Ответ выбираем по максимальному значению косинусной близости векторов вопроса и ответа. 6 | 7 | ### Для запуска кода 8 | 9 | `pip install -r requirements.txt` 10 | 11 | `./run.sh bert` для запуска *RuBERT* модели
12 | or
13 | `./run.sh pool` для запуска *RuPoolBERT* варианта модели. 14 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/bert.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RuBERT/bert.zip -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/pool.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/RuBERT/pool.zip -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/requirements.txt: -------------------------------------------------------------------------------- 1 | torch==1.9.0 2 | torchtext==0.6.0 3 | tensorflow==2.6.0 4 | keras==2.6.0 5 | pandas==1.3.5 6 | transformers==4.12.5 7 | click==7.1.2 8 | nltk==3.4.5 9 | sklearn-crfsuite==0.3.6 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | type=$@ 4 | 5 | models=$(pwd)'/models' 6 | mkdir -p $models; 7 | 8 | if [ ! -f $models'/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin' ]; then 9 | echo $models'/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin' 10 | cd $models; 11 | wget "http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_pt.tar.gz" 12 | tar -xvzf rubert_cased_L-12_H-768_A-12_pt.tar.gz 13 | cd ../; 14 | fi 15 | 16 | python -u double_text_classifier.py --task-name 'RuMedDaNet' --device 0 --bert-type $type 17 | python -u double_text_classifier.py --task-name 'RuMedNLI' --device 0 --bert-type $type 18 | python -u test_solver.py --task-name 'RuMedTest' --device 0 --bert-type $type 19 | 20 | zip -m $type.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl 21 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/test_solver.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import numpy as np 6 | import pandas as pd 7 | from scipy.special import expit 8 | 9 | import torch 10 | from torch.utils.data import TensorDataset, DataLoader, SequentialSampler 11 | import joblib 12 | from sklearn.metrics.pairwise import cosine_similarity 13 | from keras.preprocessing.sequence import pad_sequences 14 | from transformers import BertTokenizer, BertConfig 15 | from utils import seed_everything, seed_worker 16 | 17 | def encode_text_pairs(tokenizer, sentences): 18 | bs = 20000 19 | input_ids, attention_masks, token_type_ids = [], [], [] 20 | 21 | for _, i in enumerate(range(0, len(sentences), bs)): 22 | tokenized_texts = [] 23 | for sentence in sentences[i:i+bs]: 24 | final_tokens = ['[CLS]']+tokenizer.tokenize( sentence )[:MAX_LEN-2]+['[SEP]'] 25 | arr = np.array(final_tokens) 26 | mask = arr == '[SEP]' 27 | tokenized_texts.append(final_tokens) 28 | 29 | b_input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts] 30 | 31 | b_input_ids = pad_sequences(b_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post') 32 | 33 | b_token_type_ids = [] 34 | for i, row in enumerate(b_input_ids): 35 | row = np.array(row) 36 | mask = row==tokenizer.convert_tokens_to_ids('[SEP]') 37 | idx = np.where(mask)[0][0] 38 | token_type_row = np.zeros(row.shape[0], dtype=np.int) 39 | b_token_type_ids.append(token_type_row) 40 | 41 | b_attention_masks = [] 42 | for seq in b_input_ids: 43 | seq_mask = [float(i>0) for i in seq] 44 | b_attention_masks.append(seq_mask) 45 | 46 | attention_masks.append(b_attention_masks) 47 | input_ids.append(b_input_ids) 48 | token_type_ids.append(b_token_type_ids) 49 | input_ids, attention_masks = np.vstack(input_ids), np.vstack(attention_masks) 50 | token_type_ids = np.vstack(token_type_ids) 51 | 52 | return input_ids, attention_masks, token_type_ids 53 | 54 | SEED = 128 55 | seed_everything(SEED) 56 | 57 | MAX_LEN = 512 58 | 59 | @click.command() 60 | @click.option('--task-name', 61 | default='RuMedTest', 62 | type=click.Choice(['RuMedTest']), 63 | help='The name of the task to run.') 64 | @click.option('--device', 65 | default=-1, 66 | help='Gpu to train the model on.') 67 | @click.option('--data-path', 68 | default='../../../MedBench_data/', 69 | help='Path to the data files.') 70 | @click.option('--bert-type', 71 | default='bert', 72 | help='BERT model variant.') 73 | def main(task_name, data_path, device, bert_type): 74 | print(f'\n{task_name} task') 75 | 76 | if device == -1: 77 | device = torch.device('cpu') 78 | else: 79 | device = torch.device(device) 80 | 81 | out_path = pathlib.Path('.').absolute() 82 | data_path = pathlib.Path(data_path).absolute() / task_name 83 | 84 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True) 85 | 86 | index_id = 'idx' 87 | if task_name == 'RuMedTest': 88 | options = ['1', '2', '3', '4'] 89 | question_id = 'question' 90 | label_id = 'answer' 91 | else: 92 | raise ValueError('unknown task') 93 | 94 | tokenizer = BertTokenizer.from_pretrained( 95 | out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/', 96 | do_lower_case=True, 97 | max_length=MAX_LEN 98 | ) 99 | 100 | from utils import BertFeatureExtractor as BertModel 101 | ## take appropriate config and init a BERT model 102 | config_path = out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/bert_config.json' 103 | conf = BertConfig.from_json_file( config_path ) 104 | model = BertModel(conf) 105 | ## preload it with weights 106 | output_model_file = out_path / 'models/rubert_cased_L-12_H-768_A-12_pt/pytorch_model.bin' 107 | model.load_state_dict(torch.load(output_model_file), strict=False) 108 | model = model.to(device) 109 | model.eval(); 110 | 111 | def get_embeddings(texts): 112 | input_ids, attention_masks, token_type_ids = encode_text_pairs(tokenizer, texts) 113 | ##prediction_dataloader 114 | input_ids = torch.tensor(input_ids) 115 | attention_masks = torch.tensor(attention_masks) 116 | token_type_ids = torch.tensor(token_type_ids) 117 | 118 | batch_size = 16 119 | prediction_data = TensorDataset(input_ids, attention_masks, token_type_ids) 120 | prediction_sampler = SequentialSampler(prediction_data) 121 | prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size, worker_init_fn=seed_worker) 122 | 123 | predictions = [] 124 | for step, batch in enumerate(prediction_dataloader): 125 | batch = tuple(t.to(device) for t in batch) 126 | b_input_ids, b_input_mask, b_token_type_ids = batch 127 | with torch.no_grad(): 128 | outputs = model( b_input_ids, token_type_ids=b_token_type_ids, attention_mask=b_input_mask, bert_type=bert_type ) 129 | outputs = outputs.detach().cpu().numpy() 130 | predictions.append(outputs) 131 | predictions = expit(np.vstack(predictions)) 132 | return predictions 133 | 134 | q_vecs = get_embeddings(test_data['question']) 135 | 136 | sims = [] 137 | for option in options: 138 | option_vecs = get_embeddings(test_data[option]) 139 | sim = cosine_similarity(q_vecs, option_vecs).diagonal() 140 | sims.append(sim) 141 | sims = np.array(sims).T 142 | 143 | recs = [] 144 | for i, pred in zip(test_data[index_id], sims): 145 | recs.append( { index_id: i, label_id: str(1+np.argmax(pred)) } ) 146 | 147 | out_fname = out_path / f'{task_name}.jsonl' 148 | with open(out_fname, 'w') as fw: 149 | for rec in recs: 150 | json.dump(rec, fw, ensure_ascii=False) 151 | fw.write('\n') 152 | 153 | 154 | if __name__ == '__main__': 155 | main() 156 | -------------------------------------------------------------------------------- /lb_submissions/SAI/RuBERT/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | import random 4 | import torch 5 | import numpy as np 6 | 7 | def seed_everything(seed): 8 | random.seed(seed) 9 | os.environ['PYTHONHASHSEED'] = str(seed) 10 | np.random.seed(seed) 11 | torch.manual_seed(seed) 12 | torch.cuda.manual_seed_all(seed) 13 | torch.cuda.manual_seed(seed) 14 | torch.backends.cudnn.deterministic = True 15 | torch.backends.cudnn.benchmark = False 16 | 17 | def seed_worker(worker_id): 18 | worker_seed = torch.initial_seed() % 2**32 19 | np.random.seed(worker_seed) 20 | random.seed(worker_seed) 21 | 22 | 23 | from torch import nn 24 | import torch.nn.functional as F 25 | from transformers import BertTokenizer, BertConfig, BertPreTrainedModel, BertModel 26 | 27 | class PoolBertForTokenClassification(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super().__init__(config) 30 | self.num_labels = config.num_labels 31 | 32 | self.bert = BertModel(config, add_pooling_layer=False) 33 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 34 | self.classifier = nn.Linear(config.hidden_size*3, config.num_labels) 35 | 36 | self.w_size = 4 37 | 38 | self.init_weights() 39 | 40 | def forward( 41 | self, 42 | input_ids=None, 43 | attention_mask=None, 44 | token_type_ids=None, 45 | position_ids=None, 46 | head_mask=None, 47 | inputs_embeds=None, 48 | labels=None, 49 | output_attentions=None, 50 | output_hidden_states=None, 51 | return_dict=None, 52 | ): 53 | outputs = self.bert( 54 | input_ids, 55 | attention_mask=attention_mask, 56 | token_type_ids=token_type_ids, 57 | position_ids=position_ids, 58 | head_mask=head_mask, 59 | inputs_embeds=inputs_embeds, 60 | output_attentions=output_attentions, 61 | output_hidden_states=output_hidden_states, 62 | return_dict=return_dict, 63 | ) 64 | 65 | sequence_output = outputs['last_hidden_state'] 66 | 67 | shape = list(sequence_output.shape) 68 | shape[1]+=self.w_size-1 69 | 70 | t_ext = torch.zeros(shape, dtype=sequence_output.dtype, device=sequence_output.device) 71 | t_ext[:, self.w_size-1:, :] = sequence_output 72 | 73 | unfold_t = t_ext.unfold(1, self.w_size, 1).transpose(3,2) 74 | pooled_output_mean = torch.mean(unfold_t, 2) 75 | 76 | pooled_output, _ = torch.max(unfold_t, 2) 77 | pooled_output = torch.relu(pooled_output) 78 | 79 | sequence_output = torch.cat((pooled_output, pooled_output_mean, sequence_output), 2) 80 | 81 | sequence_output = self.dropout(sequence_output) 82 | 83 | logits = self.classifier(sequence_output) 84 | 85 | loss = None 86 | if labels is not None: 87 | loss_fct = nn.CrossEntropyLoss() 88 | # Only keep active parts of the loss 89 | if attention_mask is not None: 90 | active_loss_mask = attention_mask.view(-1) == 1 91 | active_logits = logits.view(-1, self.num_labels) 92 | 93 | active_labels = torch.where( 94 | active_loss_mask, 95 | labels.view(-1), 96 | torch.tensor(loss_fct.ignore_index).type_as(labels) 97 | ) 98 | 99 | loss = loss_fct(active_logits, active_labels) 100 | else: 101 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) 102 | 103 | output = (logits,) + outputs[2:] 104 | return ((loss,) + output) if loss is not None else output 105 | 106 | class PoolBertForSequenceClassification(BertPreTrainedModel): 107 | def __init__(self, config): 108 | super().__init__(config) 109 | self.num_labels = config.num_labels 110 | 111 | self.bert = BertModel(config) 112 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 113 | self.classifier = nn.Linear(config.hidden_size*3, self.config.num_labels) 114 | 115 | self.init_weights() 116 | 117 | def forward( 118 | self, 119 | input_ids=None, 120 | attention_mask=None, 121 | token_type_ids=None, 122 | position_ids=None, 123 | head_mask=None, 124 | inputs_embeds=None, 125 | labels=None, 126 | ): 127 | outputs = self.bert( 128 | input_ids, 129 | attention_mask=attention_mask, 130 | token_type_ids=token_type_ids, 131 | position_ids=position_ids, 132 | head_mask=head_mask, 133 | inputs_embeds=inputs_embeds, 134 | ) 135 | 136 | encoder_out = outputs['last_hidden_state'] 137 | cls = encoder_out[:, 0, :] 138 | 139 | pooled_output, _ = torch.max(encoder_out, 1) 140 | pooled_output = torch.relu(pooled_output) 141 | 142 | pooled_output_mean = torch.mean(encoder_out, 1) 143 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1) 144 | 145 | pooled_output = self.dropout(pooled_output) 146 | logits = self.classifier(pooled_output) 147 | 148 | outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here 149 | 150 | if labels is not None: 151 | if self.num_labels == 1: 152 | # We are doing regression 153 | loss_fct = MSELoss() 154 | loss = loss_fct(logits.view(-1), labels.view(-1)) 155 | else: 156 | loss = F.binary_cross_entropy_with_logits( logits.view(-1), labels.view(-1) ) 157 | outputs = (loss,) + outputs 158 | 159 | return outputs # (loss), logits, (hidden_states), (attentions) 160 | 161 | class BertFeatureExtractor(BertPreTrainedModel): 162 | def __init__(self, config): 163 | super().__init__(config) 164 | 165 | self.bert = BertModel(config) 166 | 167 | self.init_weights() 168 | 169 | def forward( 170 | self, 171 | input_ids=None, 172 | attention_mask=None, 173 | token_type_ids=None, 174 | position_ids=None, 175 | head_mask=None, 176 | inputs_embeds=None, 177 | bert_type='cls', 178 | ): 179 | outputs = self.bert( 180 | input_ids, 181 | attention_mask=attention_mask, 182 | token_type_ids=token_type_ids, 183 | position_ids=position_ids, 184 | head_mask=head_mask, 185 | inputs_embeds=inputs_embeds, 186 | ) 187 | 188 | encoder_out = outputs['last_hidden_state'] 189 | cls = encoder_out[:, 0, :] 190 | if bert_type!='pool': 191 | return cls 192 | 193 | pooled_output, _ = torch.max(encoder_out, 1) 194 | pooled_output = torch.relu(pooled_output) 195 | 196 | pooled_output_mean = torch.mean(encoder_out, 1) 197 | pooled_output = torch.cat((pooled_output, pooled_output_mean, cls), 1) 198 | return pooled_output 199 | -------------------------------------------------------------------------------- /lb_submissions/SAI/TF-IDF/README.md: -------------------------------------------------------------------------------- 1 | Решение основано на использовании tf.idf признаков и простых линейных моделей: 2 | 3 | - **RuMedDaNet** / **RuMedNLI** - объединяем пары входных текстов в единую строку, получаем матрицу tf.idf признаков, обучаем модель логистической регрессии для предсказания целевой переменной; 4 | 5 | - **RuMedTest** - получаем матрицу tf.idf признаков для вопросов и 4 матрицы для каждого из вариантов ответов. Ответ выбираем по максимальному значению косинусной близости векторов вопроса и ответа. 6 | 7 | ### Для запуска кода 8 | 9 | `./run.sh` 10 | -------------------------------------------------------------------------------- /lb_submissions/SAI/TF-IDF/double_text_classifier.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import pandas as pd 6 | from sklearn.feature_extraction.text import TfidfVectorizer 7 | from sklearn.linear_model import LogisticRegression 8 | from sklearn.metrics import accuracy_score 9 | 10 | 11 | def preprocess_sentences(column1, column2): 12 | return [sent1 + ' ' + sent2 for sent1, sent2 in zip(column1, column2)] 13 | 14 | 15 | def encode_text(tfidf, text_data, l2i, labels=None, mode='train'): 16 | if mode == 'train': 17 | X = tfidf.fit_transform(text_data) 18 | else: 19 | X = tfidf.transform(text_data) 20 | y = None 21 | if type(labels)!=type(None): 22 | y = labels.map(l2i) 23 | return X, y 24 | 25 | 26 | @click.command() 27 | @click.option('--task-name', 28 | default='RuMedNLI', 29 | type=click.Choice(['RuMedDaNet', 'RuMedNLI']), 30 | help='The name of the task to run.') 31 | @click.option('--data-path', 32 | default='../../../MedBench_data/', 33 | help='Path to the data files.') 34 | def main(task_name, data_path): 35 | print(f'\n{task_name} task') 36 | 37 | out_path = pathlib.Path('.').absolute() 38 | data_path = pathlib.Path(data_path).absolute() / task_name 39 | 40 | train_data = pd.read_json(data_path / 'train.jsonl', lines=True) 41 | dev_data = pd.read_json(data_path / 'dev.jsonl', lines=True) 42 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True) 43 | 44 | index_id = 'pairID' 45 | if task_name == 'RuMedNLI': 46 | l2i = {'neutral': 0, 'entailment': 1, 'contradiction': 2} 47 | text1_id = 'ru_sentence1' 48 | text2_id = 'ru_sentence2' 49 | label_id = 'gold_label' 50 | elif task_name == 'RuMedDaNet': 51 | l2i = {'нет': 0, 'да': 1} 52 | text1_id = 'context' 53 | text2_id = 'question' 54 | label_id = 'answer' 55 | else: 56 | raise ValueError('unknown task') 57 | 58 | i2l = {i: label for label, i in l2i.items()} 59 | 60 | text_data_train = preprocess_sentences(train_data[text1_id], train_data[text2_id]) 61 | text_data_dev = preprocess_sentences(dev_data[text1_id], dev_data[text2_id]) 62 | text_data_test = preprocess_sentences(test_data[text1_id], test_data[text2_id]) 63 | 64 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8)) 65 | clf = LogisticRegression(penalty='l2', C=10, multi_class='ovr', n_jobs=10, max_iter=1000, verbose=1) 66 | 67 | X, y = encode_text(tfidf, text_data_train, l2i, labels=train_data[label_id]) 68 | 69 | clf.fit(X, y) 70 | 71 | X_val, y_val = encode_text(tfidf, text_data_dev, l2i, labels=dev_data[label_id], mode='dev') 72 | y_val_pred = clf.predict(X_val) 73 | accuracy = round(accuracy_score(y_val, y_val_pred) * 100, 2) 74 | print (f'\n{task_name} task score on dev set: {accuracy}') 75 | 76 | X_test, _ = encode_text(tfidf, text_data_test, l2i, mode='test') 77 | y_test_pred = clf.predict(X_test) 78 | 79 | recs = [] 80 | for i, pred in zip(test_data[index_id], y_test_pred): 81 | recs.append({index_id: i, label_id: i2l[pred]}) 82 | 83 | out_fname = out_path / f'{task_name}.jsonl' 84 | with open(out_fname, 'w') as fw: 85 | for rec in recs: 86 | json.dump(rec, fw, ensure_ascii=False) 87 | fw.write('\n') 88 | 89 | 90 | if __name__ == '__main__': 91 | main() 92 | -------------------------------------------------------------------------------- /lb_submissions/SAI/TF-IDF/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python -u double_text_classifier.py --task-name 'RuMedNLI' 4 | python -u double_text_classifier.py --task-name 'RuMedDaNet' 5 | python -u test_solver.py --task-name 'RuMedTest' 6 | 7 | zip -m tfidf.zip RuMedDaNet.jsonl RuMedNLI.jsonl RuMedTest.jsonl 8 | -------------------------------------------------------------------------------- /lb_submissions/SAI/TF-IDF/test_solver.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | 4 | import click 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.feature_extraction.text import TfidfVectorizer 8 | from sklearn.linear_model import LogisticRegression 9 | from sklearn.metrics import accuracy_score 10 | from sklearn.metrics.pairwise import cosine_similarity 11 | 12 | @click.command() 13 | @click.option('--task-name', 14 | default='RuMedTest', 15 | type=click.Choice(['RuMedTest']), 16 | help='The name of the task to run.') 17 | @click.option('--data-path', 18 | default='../../../MedBench_data/', 19 | help='Path to the data files.') 20 | def main(task_name, data_path): 21 | print(f'\n{task_name} task') 22 | 23 | out_path = pathlib.Path('.').absolute() 24 | data_path = pathlib.Path(data_path).absolute() / task_name 25 | 26 | test_data = pd.read_json(data_path / 'test.jsonl', lines=True) 27 | 28 | index_id = 'idx' 29 | if task_name == 'RuMedTest': 30 | l2i = {'1': 1, '2': 2, '3': 3, '4': 4} 31 | question_id = 'question' 32 | label_id = 'answer' 33 | else: 34 | raise ValueError('unknown task') 35 | 36 | i2l = {i: label for label, i in l2i.items()} 37 | 38 | tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 8)) 39 | 40 | text_data = test_data[question_id] 41 | 42 | X = tfidf.fit_transform(text_data) 43 | 44 | sims = [] 45 | for l in sorted(list(l2i.keys())): 46 | option_X = tfidf.transform( test_data[l] ) 47 | sim = cosine_similarity(X, option_X).diagonal() 48 | sims.append(sim) 49 | sims = np.array(sims).T 50 | 51 | recs = [] 52 | for i, pred in zip(test_data[index_id], sims): 53 | recs.append({index_id: i, label_id: i2l[1+np.argmax(pred)]}) 54 | 55 | out_fname = out_path / f'{task_name}.jsonl' 56 | with open(out_fname, 'w') as fw: 57 | for rec in recs: 58 | json.dump(rec, fw, ensure_ascii=False) 59 | fw.write('\n') 60 | 61 | 62 | if __name__ == '__main__': 63 | main() 64 | -------------------------------------------------------------------------------- /lb_submissions/SAI/TF-IDF/tfidf.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pavel-blinov/RuMedBench/dd56a60faf55994ba402615c4067f0652ea0d2bb/lb_submissions/SAI/TF-IDF/tfidf.zip -------------------------------------------------------------------------------- /lb_submissions/SAI_junior/RuBioRoBERTa/README.md: -------------------------------------------------------------------------------- 1 | Решение реализовано с помощью модели [RuBioRoBERTa](https://huggingface.co/alexyalunin/RuBioRoBERTa). 2 | 3 | - **RuMedDaNet** / **RuMedNLI** - перед подачей в модель контекст и вопрос конкатенируются через пробел, дообучение модели выполняется под конкретную задачу; 4 | 5 | - **RuMedTest** - Используется предобученную модель RuBioRoBERTa для получения контекстуализированных эмбеддингов (вопросов и каждого из 4 вариантов ответов). Ответ выбирается по максимальному значению косинусной близости векторов вопроса и ответа. 6 | 7 | ### В задаче RuMedDaNet для модели были использованы гиперпараметры: 8 | - `seed = 128` 9 | - `batch_size = 10` 10 | - `epochs = 25` 11 | - `lr = 2e-5` 12 | 13 | ### В задаче RuMedNLI следующие гиперпараметры: 14 | - `seed = 128` 15 | - `batch_size = 8` 16 | - `epochs = 25` 17 | - `lr = 3e-5` 18 | 19 | ### Для запуска: 20 | `pip install -r requirements.txt` 21 | 22 | Открыть блокнот каждой задачи и выполнить все ячейки 23 | 24 | Добавить все решения в zip-архив - `zip -r solution.zip RuMedTest.jsonl RuMedNLI.jsonl RuMedDaNet.jsonl` 25 | -------------------------------------------------------------------------------- /lb_submissions/SAI_junior/RuBioRoBERTa/requirements.txt: -------------------------------------------------------------------------------- 1 | torch==1.12.1 2 | torchtext==0.6.0 3 | tensorflow==2.6.0 4 | keras==2.6.0 5 | pandas==1.3.5 6 | transformers==4.12.5 7 | sklearn==1.0.2 8 | --------------------------------------------------------------------------------