├── run.sh ├── README.md ├── model.py └── run.py /run.sh: -------------------------------------------------------------------------------- 1 | python run.py \ 2 | --model_name_or_path hfl/chinese-roberta-wwm-ext \ 3 | --do_train \ 4 | --do_eval \ 5 | --train_file ./train_data_1w.json \ 6 | --validation_file ./dev_data.json \ 7 | --test_file ./test_data.json \ 8 | --metric_for_best_model eval_accuracy \ 9 | --load_best_model_at_end \ 10 | --learning_rate 5e-5 \ 11 | --evaluation_strategy epoch \ 12 | --num_train_epochs 5 \ 13 | --output_dir ./tmp \ 14 | --per_device_eval_batch_size 32 \ 15 | --per_device_train_batch_size 32 \ 16 | --seed 42 \ 17 | --max_seq_length 512 \ 18 | --warmup_ratio 0.1 \ 19 | --save_strategy epoch \ 20 | --overwrite_output -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ChID_baseline 2 | 计算语言学22-23学年秋季学期课程大作业baseline实现 3 | 4 | 本次作业采用ChID中文成语完型填空数据集,要求模型能够从一系列成语候选中选择最正确的成语填入语篇的特定位置。 5 | ``` 6 | { 7 | "groundTruth": ["一目了然", "先入为主"], 8 | "candidates": [["明明白白", "添油加醋", "一目了然", "残兵败将", "杂乱无章", "心中有数", "打抱不平"], ["矫揉造作", "死不瞑目", "先入为主", "以偏概全", "期期艾艾", "似是而非", "追根究底"]], 9 | "content": "【分析】父母对孩子的期望这一点可以从第一段中找到“即使是学校也只是我们送孩子去接受实用教育的地方,而不是让他们为了知识而去追求知识的地方。”至此,答案选项[C]#idiom#。而选项[B]显然错误。选项[A]这个干扰项是出题人故意拿出一个本身没有问题,但是不适合本处的说法来干扰考生。考生一定要警惕#idiom#的思维模式,在做阅读理解的时候,不能按照自己的直觉和知识瞎猜,一定要以原文为根据。选项[D]显然也是不符合家长的期望的。", 10 | "realCount": 2 11 | } 12 | ``` 13 | 如上所示,在`content`中有两处`#idiom#`标志。以第一个标志处为例,模型要从候选成语`["明明白白", "添油加醋", "一目了然", "残兵败将", "杂乱无章", "心中有数", "打抱不平"]`选择最合适的成语`一目了然`填入此处。 14 | 15 | ## requirements 16 | ``` 17 | accelerate==0.7.1 18 | allennlp==2.9.1 19 | datasets==2.5.1 20 | evaluate==0.2.2 21 | huggingface_hub==0.10.0 22 | numpy==1.22.3 23 | torch==1.11.0 24 | tqdm==4.56.0 25 | transformers==4.16.2 26 | ``` 27 | 28 | ## 数据集下载 29 | 本文数据集ChID由 **[ChID: A Large-scale Chinese IDiom Dataset for Cloze Test](https://www.aclweb.org/anthology/P19-1075)** 提出。训练集包含句子50w条,验证集和测试集各有2w条句子。 30 | 31 | [下载链接(北大网盘)](https://disk.pku.edu.cn:443/link/3510A73BA4793A830B0179DF795330C8) 32 | 33 | 考虑到同学们的训练资源各不相同,我们鼓励大家在1w,5w,10w三种训练集规格中**选择一种**进行实验;相应的训练集均已提前切分好,放在网盘以供大家下载。 34 | 35 | ## 模型介绍 36 | baseline模型基于预训练模型中常用的掩码语言模型(Masked Language Modeling)实现,采用中文roberta作为backbone。 37 | 38 | 我们将`#idiom#`替换为`[MASK][MASK][MASK][MASK]`,通过LM head输出每个`[MASK]`处的token在候选字上的概率分布,并以此选择概率最高的成语。 39 | 40 | 对于上方的例子,第一个`[MASK]`处的token将会通过LM head得到所有候选成语第一个字上(明,添,一,残,杂,心,打)的概率分布,对于其他`[MASK]`处的token,也可以用同样方法得到模型在对应7个候选字上的概率分布。 41 | 42 | `#idiom#`处填成语一目了然的概率就是每个`[MASK]`处分别填写对应字(一、目、了、然)概率的乘积;候选中概率最高的成语就是最终的预测结果。 43 | 44 | 具体实现代码参见`model.py`。 45 | 46 | 47 | ## 实验结果 48 | 我们提供了我们的baseline模型在不同训练集规模下的准确率(Accuracy)。0代表zero-shot实验,直接使用未经任何fine-tuning的预训练模型进行预测;同学们使用对应规模训练集时可以和对应的baseline结果进行比较。 49 | 50 | | #train data | dev | test | 51 | |-------------|:-----:|:-----:| 52 | | 0 | 51.54 | 51.87 | 53 | | 1w | 64.93 | 64.83 | 54 | | 5w | 71.94 | 71.92 | 55 | | 10w | 74.49 | 74.42 | 56 | | full (50w) | 80.72 | 81.11 | 57 | 58 | 我们也比较了我们的baseline模型和原论文中基于LSTM的Attentive Reader方法以及人工评测(Human)的表现。baseline相较于human的水平仍有待提升。 59 | 60 | | Model | dev | test | 61 | |-------------------------|:-----:|:-----:| 62 | | Ours (Roberta) | 80.72 | 81.11 | 63 | | Attentive Reader (LSTM) | 72.7 | 72.4 | 64 | | Human | - | 87.1 | 65 | 66 | ## 作业要求 67 | 本次作业要求单人或组队完成(最多3人)。请每个小组根据任务的特点设计实验,在项目上尝试一些创新的思路,包括但不限于模型的改动,任务形式的转化,从数据本身出发的思考等;在训练数据有限的情况下,如何引入外部知识(比如成语词典)也是一个值得尝试的方向。我们建议同学们使用我们或者原作者提供的代码作为baseline,也鼓励同学自己设计新的baseline。 68 | 69 | 最终要求每组同学在12月8日,12月15日进行课堂展示,并且在12月22日之前提交自己的实验报告和项目代码到课程公邮jsyyxpku2022@163.com。 70 | 71 | 大作业的评分将基于课堂展示、实验报告以及模型的表现进行综合评测。 72 | ### 实验报告要求 73 | 实验报告中应该包含以下内容: 74 | 1. 实验目的(阐述任务) 75 | 2. 实验原理(描述模型) 76 | 3. 实验内容(描述实验步骤,重视可复现性) 77 | 4. 实验结果与分析 (描述实验结果并对结果或case进行分析,随机数种子请固定42,或者选取多个种子汇报均值) 78 | 5. 对于自己模型局限性的思考 79 | 6. 实验过程总结和感想(每个组员都要写) 80 | 7. 实验分工(写明每个组员的工作量) 81 | ### 其他要求与建议 82 | 1. 项目作弊会被记0分,包括但不限于抄袭代码、实验造假等。允许使用开源代码,但请在报告中标注哪部分使用了开源代码,并标明来源 83 | 2. 禁止以任何形式使用测试集的标签信息 84 | 3. 实验结果不是唯一的评判标准,针对任务特点的改进,对失败尝试的深入分析都是加分项 85 | ## 参考文献 86 | * ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. Chujie Zheng, Minlie Huang, Aixin Sun. ACL (1) 2019: 778-787 87 | * It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. Timo Schick, Hinrich Schütze. NAACL-HLT 2021: 2339-2352 88 | 89 | ## 有用的网站 90 | * https://github.com/chujiezheng/ChID-Dataset 91 | * https://github.com/pwxcoo/chinese-xinhua 92 | * https://github.com/by-syk/chinese-idiom-db 93 | * https://huggingface.co/docs/transformers/index 94 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from torch.nn import CrossEntropyLoss 4 | import torch.nn.functional as F 5 | 6 | from allennlp.nn.util import batched_index_select 7 | from transformers import BertPreTrainedModel, BertModel, BertConfig 8 | from transformers.modeling_outputs import MaskedLMOutput 9 | from transformers.models.bert.modeling_bert import BertOnlyMLMHead 10 | 11 | from typing import Optional, Tuple, Union 12 | 13 | class BertForChID(BertPreTrainedModel): 14 | 15 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 16 | _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"] 17 | 18 | def __init__(self, config): 19 | super().__init__(config) 20 | 21 | # if config.is_decoder: 22 | # logger.warning( 23 | # "If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for " 24 | # "bi-directional self-attention." 25 | # ) 26 | 27 | self.bert = BertModel(config, add_pooling_layer=False) 28 | self.cls = BertOnlyMLMHead(config) 29 | 30 | # Initialize weights and apply final processing 31 | self.post_init() 32 | 33 | def get_output_embeddings(self): 34 | return self.cls.predictions.decoder 35 | 36 | def set_output_embeddings(self, new_embeddings): 37 | self.cls.predictions.decoder = new_embeddings 38 | 39 | # @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 40 | # @add_code_sample_docstrings( 41 | # processor_class=_TOKENIZER_FOR_DOC, 42 | # checkpoint=_CHECKPOINT_FOR_DOC, 43 | # output_type=MaskedLMOutput, 44 | # config_class=_CONFIG_FOR_DOC, 45 | # expected_output="'paris'", 46 | # expected_loss=0.88, 47 | # ) 48 | def forward( 49 | self, 50 | input_ids: Optional[torch.Tensor] = None, 51 | attention_mask: Optional[torch.Tensor] = None, 52 | token_type_ids: Optional[torch.Tensor] = None, 53 | position_ids: Optional[torch.Tensor] = None, 54 | head_mask: Optional[torch.Tensor] = None, 55 | inputs_embeds: Optional[torch.Tensor] = None, 56 | encoder_hidden_states: Optional[torch.Tensor] = None, 57 | encoder_attention_mask: Optional[torch.Tensor] = None, 58 | labels: Optional[torch.Tensor] = None, 59 | candidates: Optional[torch.Tensor] = None, 60 | candidate_mask: Optional[torch.Tensor] = None, 61 | output_attentions: Optional[bool] = None, 62 | output_hidden_states: Optional[bool] = None, 63 | return_dict: Optional[bool] = None, 64 | ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]: 65 | r""" 66 | labels: torch.LongTensor of shape `(batch_size, )` 67 | candidates: torch.LongTensor of shape `(batch_size, num_choices, 4)` 68 | candidate_mask: torch.BooleanTensor of shape `(batch_size, seq_len)` 69 | """ 70 | 71 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 72 | 73 | outputs = self.bert( 74 | input_ids, 75 | attention_mask=attention_mask, 76 | token_type_ids=token_type_ids, 77 | position_ids=position_ids, 78 | head_mask=head_mask, 79 | inputs_embeds=inputs_embeds, 80 | encoder_hidden_states=encoder_hidden_states, 81 | encoder_attention_mask=encoder_attention_mask, 82 | output_attentions=output_attentions, 83 | output_hidden_states=output_hidden_states, 84 | return_dict=return_dict, 85 | ) 86 | 87 | sequence_output = outputs[0] 88 | prediction_scores = self.cls(sequence_output) # (Batch_size, Seq_len, Vocab_size) 89 | 90 | masked_lm_loss = None 91 | if labels is not None: 92 | loss_fct = CrossEntropyLoss() 93 | candidate_prediction_scores = torch.masked_select(prediction_scores, candidate_mask.unsqueeze(-1)).reshape(-1, prediction_scores.shape[-1], 1) # (Batch_size x 4, Vocab_size, 1) 94 | candidate_indices = candidates.transpose(-1, -2).reshape(-1, candidates.shape[1]) # (Batch_size x 4, num_choices) 95 | candidate_logits = batched_index_select(candidate_prediction_scores, candidate_indices).squeeze(-1).reshape(prediction_scores.shape[0], 4, -1).transpose(-1, -2) # (Batch_size, num_choices, 4) 96 | 97 | candidate_labels = labels.reshape(labels.shape[0], 1).repeat(1, 4) # (Batch_size, 4) 98 | candidate_final_scores = torch.sum(F.log_softmax(candidate_logits, dim=-2), dim=-1) # (Batch_size, num_choices) 99 | 100 | masked_lm_loss = loss_fct(candidate_logits, candidate_labels) 101 | 102 | if not return_dict: 103 | output = (prediction_scores,) + outputs[2:] 104 | return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output 105 | 106 | return MaskedLMOutput( 107 | loss=masked_lm_loss, 108 | logits=candidate_final_scores, 109 | hidden_states=outputs.hidden_states, 110 | attentions=outputs.attentions, 111 | ) -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """ 17 | Fine-tuning the library models for ChID. 18 | """ 19 | 20 | 21 | import logging 22 | import os 23 | import sys 24 | from dataclasses import dataclass, field 25 | from typing import Optional, Union 26 | 27 | import datasets 28 | import numpy as np 29 | import torch 30 | from datasets import load_dataset 31 | from model import BertForChID 32 | 33 | import transformers 34 | from transformers import ( 35 | AutoConfig, 36 | AutoTokenizer, 37 | HfArgumentParser, 38 | Trainer, 39 | TrainingArguments, 40 | default_data_collator, 41 | set_seed, 42 | ) 43 | from transformers.tokenization_utils_base import PreTrainedTokenizerBase 44 | from transformers.trainer_utils import get_last_checkpoint 45 | 46 | 47 | 48 | logger = logging.getLogger(__name__) 49 | 50 | 51 | @dataclass 52 | class ModelArguments: 53 | """ 54 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. 55 | """ 56 | 57 | model_name_or_path: str = field( 58 | metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} 59 | ) 60 | config_name: Optional[str] = field( 61 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 62 | ) 63 | tokenizer_name: Optional[str] = field( 64 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 65 | ) 66 | cache_dir: Optional[str] = field( 67 | default=None, 68 | metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, 69 | ) 70 | use_fast_tokenizer: bool = field( 71 | default=True, 72 | metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, 73 | ) 74 | model_revision: str = field( 75 | default="main", 76 | metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, 77 | ) 78 | use_auth_token: bool = field( 79 | default=False, 80 | metadata={ 81 | "help": ( 82 | "Will use the token generated when running `huggingface-cli login` (necessary to use this script " 83 | "with private models)." 84 | ) 85 | }, 86 | ) 87 | 88 | 89 | @dataclass 90 | class DataTrainingArguments: 91 | """ 92 | Arguments pertaining to what data we are going to input our model for training and eval. 93 | """ 94 | 95 | train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."}) 96 | validation_file: Optional[str] = field( 97 | default=None, 98 | metadata={"help": "An optional input evaluation data file (a text file)."}, 99 | ) 100 | test_file: Optional[str] = field( 101 | default=None, 102 | metadata={"help": "An optional input test data file (a text file)."}, 103 | ) 104 | overwrite_cache: bool = field( 105 | default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} 106 | ) 107 | preprocessing_num_workers: Optional[int] = field( 108 | default=None, 109 | metadata={"help": "The number of processes to use for the preprocessing."}, 110 | ) 111 | max_seq_length: Optional[int] = field( 112 | default=None, 113 | metadata={ 114 | "help": ( 115 | "The maximum total input sequence length after tokenization. If passed, sequences longer " 116 | "than this will be truncated, sequences shorter will be padded." 117 | ) 118 | }, 119 | ) 120 | pad_to_max_length: bool = field( 121 | default=False, 122 | metadata={ 123 | "help": ( 124 | "Whether to pad all samples to the maximum sentence length. " 125 | "If False, will pad the samples dynamically when batching to the maximum length in the batch. More " 126 | "efficient on GPU but very bad for TPU." 127 | ) 128 | }, 129 | ) 130 | max_train_samples: Optional[int] = field( 131 | default=None, 132 | metadata={ 133 | "help": ( 134 | "For debugging purposes or quicker training, truncate the number of training examples to this " 135 | "value if set." 136 | ) 137 | }, 138 | ) 139 | max_eval_samples: Optional[int] = field( 140 | default=None, 141 | metadata={ 142 | "help": ( 143 | "For debugging purposes or quicker training, truncate the number of evaluation examples to this " 144 | "value if set." 145 | ) 146 | }, 147 | ) 148 | 149 | def __post_init__(self): 150 | if self.train_file is not None: 151 | extension = self.train_file.split(".")[-1] 152 | assert extension in ["csv", "json"], "`train_file` should be a csv or a json file." 153 | if self.validation_file is not None: 154 | extension = self.validation_file.split(".")[-1] 155 | assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." 156 | 157 | 158 | @dataclass 159 | class DataCollatorForChID: 160 | """ 161 | Data collator that will dynamically pad the inputs. 162 | Candidate masks will be computed to indicate which tokens are candidates. 163 | 164 | Args: 165 | tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]): 166 | The tokenizer used for encoding the data. 167 | padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`): 168 | Select a strategy to pad the returned sequences (according to the model's padding side and padding index) 169 | among: 170 | 171 | - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence 172 | if provided). 173 | - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum 174 | acceptable input length for the model if that argument is not provided. 175 | - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different 176 | lengths). 177 | max_length (`int`, *optional*): 178 | Maximum length of the returned list and optionally padding length (see above). 179 | pad_to_multiple_of (`int`, *optional*): 180 | If set will pad the sequence to a multiple of the provided value. 181 | 182 | This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 183 | 7.5 (Volta). 184 | """ 185 | 186 | tokenizer: PreTrainedTokenizerBase 187 | padding: Union[bool, str] = True 188 | max_length: Optional[int] = None 189 | pad_to_multiple_of: Optional[int] = None 190 | 191 | def __call__(self, features): 192 | label_name = "label" if "label" in features[0].keys() else "labels" 193 | labels = [feature.pop(label_name) for feature in features] 194 | 195 | batch = self.tokenizer.pad( 196 | features, 197 | padding=self.padding, 198 | max_length=self.max_length, 199 | pad_to_multiple_of=self.pad_to_multiple_of, 200 | return_tensors="pt", 201 | ) 202 | 203 | 204 | # Add back labels 205 | batch["labels"] = torch.tensor(labels, dtype=torch.int64) 206 | # Compute candidate masks 207 | batch["candidate_mask"] = batch["input_ids"] == self.tokenizer.mask_token_id 208 | return batch 209 | 210 | 211 | def main(): 212 | os.environ["CUDA_LAUNCH_BLOCKING"] = "1" 213 | # See all possible arguments in src/transformers/training_args.py 214 | # or by passing the --help flag to this script. 215 | # We now keep distinct sets of args, for a cleaner separation of concerns. 216 | 217 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) 218 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): 219 | # If we pass only one argument to the script and it's the path to a json file, 220 | # let's parse it to get our arguments. 221 | model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) 222 | else: 223 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 224 | 225 | 226 | # Setup logging 227 | logging.basicConfig( 228 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 229 | datefmt="%m/%d/%Y %H:%M:%S", 230 | handlers=[logging.StreamHandler(sys.stdout)], 231 | ) 232 | log_level = training_args.get_process_log_level() 233 | logger.setLevel(log_level) 234 | datasets.utils.logging.set_verbosity(log_level) 235 | transformers.utils.logging.set_verbosity(log_level) 236 | transformers.utils.logging.enable_default_handler() 237 | transformers.utils.logging.enable_explicit_format() 238 | 239 | # Log on each process the small summary: 240 | logger.warning( 241 | f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" 242 | + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" 243 | ) 244 | logger.info(f"Training/evaluation parameters {training_args}") 245 | 246 | # Detecting last checkpoint. 247 | last_checkpoint = None 248 | if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: 249 | last_checkpoint = get_last_checkpoint(training_args.output_dir) 250 | if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: 251 | raise ValueError( 252 | f"Output directory ({training_args.output_dir}) already exists and is not empty. " 253 | "Use --overwrite_output_dir to overcome." 254 | ) 255 | elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: 256 | logger.info( 257 | f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " 258 | "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." 259 | ) 260 | 261 | # Set seed before initializing model. 262 | set_seed(training_args.seed) 263 | 264 | # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) 265 | # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ 266 | # (the dataset will be downloaded automatically from the datasets Hub). 267 | 268 | # For CSV/JSON files, this script will use the column called 'text' or the first column if no column called 269 | # 'text' is found. You can easily tweak this behavior (see below). 270 | 271 | # In distributed training, the load_dataset function guarantee that only one local process can concurrently 272 | # download the dataset. 273 | if data_args.train_file is not None or data_args.validation_file is not None: 274 | data_files = {} 275 | if data_args.train_file is not None: 276 | data_files["train"] = data_args.train_file 277 | if data_args.validation_file is not None: 278 | data_files["validation"] = data_args.validation_file 279 | if data_args.test_file is not None: 280 | data_files["test"] = data_args.test_file 281 | extension = data_args.train_file.split(".")[-1] 282 | raw_datasets = load_dataset( 283 | extension, 284 | data_files=data_files, 285 | cache_dir=model_args.cache_dir, 286 | use_auth_token=True if model_args.use_auth_token else None, 287 | ) 288 | else: 289 | # Downloading and loading the chid dataset from the hub. This code is not supposed to be executed in. 290 | raw_datasets = load_dataset( 291 | "YuAnthony/chid", 292 | cache_dir=model_args.cache_dir, 293 | use_auth_token=True if model_args.use_auth_token else None, 294 | ) 295 | # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at 296 | # https://huggingface.co/docs/datasets/loading_datasets.html. 297 | 298 | # Load pretrained model and tokenizer 299 | 300 | # Distributed training: 301 | # The .from_pretrained methods guarantee that only one local process can concurrently 302 | # download model & vocab. 303 | config = AutoConfig.from_pretrained( 304 | model_args.config_name if model_args.config_name else model_args.model_name_or_path, 305 | cache_dir=model_args.cache_dir, 306 | revision=model_args.model_revision, 307 | use_auth_token=True if model_args.use_auth_token else None, 308 | ) 309 | tokenizer = AutoTokenizer.from_pretrained( 310 | model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, 311 | cache_dir=model_args.cache_dir, 312 | use_fast=model_args.use_fast_tokenizer, 313 | revision=model_args.model_revision, 314 | use_auth_token=True if model_args.use_auth_token else None, 315 | ) 316 | model = BertForChID.from_pretrained( 317 | model_args.model_name_or_path, 318 | from_tf=bool(".ckpt" in model_args.model_name_or_path), 319 | config=config, 320 | cache_dir=model_args.cache_dir, 321 | revision=model_args.model_revision, 322 | use_auth_token=True if model_args.use_auth_token else None, 323 | ) 324 | 325 | label_column_name = "labels" 326 | idiom_tag = '#idiom#' 327 | 328 | if data_args.max_seq_length is None: 329 | max_seq_length = tokenizer.model_max_length 330 | if max_seq_length > 1024: 331 | logger.warning( 332 | f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " 333 | "Picking 1024 instead. You can change that default value by passing --max_seq_length xxx." 334 | ) 335 | max_seq_length = 1024 336 | else: 337 | if data_args.max_seq_length > tokenizer.model_max_length: 338 | logger.warning( 339 | f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" 340 | f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." 341 | ) 342 | max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) 343 | 344 | # Preprocessing the datasets. 345 | 346 | # We only consider one idiom per instance in the dataset, a sentence containing multiple idioms will be split into multiple instances. 347 | # The idiom tag of each instance will be replaced with 4 [MASK] tokens. 348 | def preprocess_function_resize(examples): 349 | return_dic = {} 350 | return_dic_keys = ['candidates', 'content', 'labels'] 351 | for k in return_dic_keys: 352 | return_dic[k] = [] 353 | 354 | for i in range(len(examples['content'])): 355 | idx = -1 356 | text = examples['content'][i] 357 | for j in range(examples['realCount'][i]): 358 | return_dic['candidates'].append(examples['candidates'][i][j]) 359 | idx = text.find(idiom_tag, idx+1) 360 | return_dic['content'].append(text[:idx] + tokenizer.mask_token*4 + text[idx+len(idiom_tag):]) 361 | for k, candidate in enumerate(examples['candidates'][i][j]): 362 | if candidate == examples['groundTruth'][i][j]: 363 | return_dic['labels'].append(k) 364 | break 365 | return return_dic 366 | 367 | # tokenize all instances 368 | def preprocess_function_tokenize(examples): 369 | first_sentences = examples['content'] 370 | labels = examples[label_column_name] 371 | # truncate the first sentences. 372 | for i, sentence in enumerate(first_sentences): 373 | if len(sentence) <= 500: 374 | continue 375 | if sentence.find(tokenizer.mask_token*4) > len(sentence) // 2: 376 | first_sentences[i] = sentence[-500:] 377 | else: 378 | first_sentences[i] = sentence[:500] 379 | 380 | tokenized_examples = tokenizer( 381 | first_sentences, 382 | max_length=max_seq_length, 383 | padding="max_length" if data_args.pad_to_max_length else False, 384 | truncation=True, 385 | ) 386 | tokenized_examples["labels"] = labels 387 | tokenized_candidates = [[tokenizer.convert_tokens_to_ids(list(candidate)) for candidate in candidates]for candidates in examples['candidates']] 388 | tokenized_examples["candidates"] = tokenized_candidates 389 | return tokenized_examples 390 | 391 | 392 | if training_args.do_train: 393 | if "train" not in raw_datasets: 394 | raise ValueError("--do_train requires a train dataset") 395 | train_dataset = raw_datasets["train"] 396 | if data_args.max_train_samples is not None: 397 | max_train_samples = min(len(train_dataset), data_args.max_train_samples) 398 | train_dataset = train_dataset.select(range(max_train_samples)) 399 | with training_args.main_process_first(desc="train dataset map pre-processing"): 400 | train_dataset = train_dataset.map( 401 | preprocess_function_resize, 402 | batched=True, 403 | remove_columns=["groundTruth", "realCount"], 404 | num_proc=data_args.preprocessing_num_workers, 405 | load_from_cache_file=not data_args.overwrite_cache, 406 | ) 407 | 408 | train_dataset = train_dataset.map( 409 | preprocess_function_tokenize, 410 | batched=True, 411 | num_proc=data_args.preprocessing_num_workers, 412 | load_from_cache_file=not data_args.overwrite_cache, 413 | ) 414 | # for index in range(3): 415 | # logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") 416 | if training_args.do_eval: 417 | if "validation" not in raw_datasets: 418 | raise ValueError("--do_eval requires a validation dataset") 419 | eval_dataset = raw_datasets["validation"] 420 | if data_args.max_eval_samples is not None: 421 | max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) 422 | eval_dataset = eval_dataset.select(range(max_eval_samples)) 423 | with training_args.main_process_first(desc="validation dataset map pre-processing"): 424 | eval_dataset = eval_dataset.map( 425 | preprocess_function_resize, 426 | batched=True, 427 | remove_columns=["groundTruth", "realCount"], 428 | num_proc=data_args.preprocessing_num_workers, 429 | load_from_cache_file=not data_args.overwrite_cache, 430 | ) 431 | eval_dataset = eval_dataset.map( 432 | preprocess_function_tokenize, 433 | batched=True, 434 | num_proc=data_args.preprocessing_num_workers, 435 | load_from_cache_file=not data_args.overwrite_cache, 436 | ) 437 | test_dataset = raw_datasets["test"] 438 | with training_args.main_process_first(desc="test dataset map pre-processing"): 439 | test_dataset = test_dataset.map( 440 | preprocess_function_resize, 441 | batched=True, 442 | remove_columns=["groundTruth", "realCount"], 443 | num_proc=data_args.preprocessing_num_workers, 444 | load_from_cache_file=not data_args.overwrite_cache, 445 | ) 446 | test_dataset = test_dataset.map( 447 | preprocess_function_tokenize, 448 | batched=True, 449 | num_proc=data_args.preprocessing_num_workers, 450 | load_from_cache_file=not data_args.overwrite_cache, 451 | ) 452 | # Data collator 453 | data_collator = ( 454 | default_data_collator 455 | if data_args.pad_to_max_length 456 | else DataCollatorForChID(tokenizer=tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None) 457 | ) 458 | # data_collator = default_data_collator 459 | 460 | 461 | 462 | # Metric 463 | def compute_metrics(eval_predictions): 464 | predictions, label_ids = eval_predictions 465 | preds = np.argmax(predictions, axis=1) 466 | return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()} 467 | 468 | # Initialize our Trainer 469 | trainer = Trainer( 470 | model=model, 471 | args=training_args, 472 | train_dataset=train_dataset if training_args.do_train else None, 473 | eval_dataset=eval_dataset if training_args.do_eval else None, 474 | tokenizer=tokenizer, 475 | data_collator=data_collator, 476 | compute_metrics=compute_metrics, 477 | ) 478 | 479 | 480 | # Training 481 | if training_args.do_train: 482 | checkpoint = None 483 | if training_args.resume_from_checkpoint is not None: 484 | checkpoint = training_args.resume_from_checkpoint 485 | elif last_checkpoint is not None: 486 | checkpoint = last_checkpoint 487 | train_result = trainer.train(resume_from_checkpoint=checkpoint) 488 | trainer.save_model() # Saves the tokenizer too for easy upload 489 | metrics = train_result.metrics 490 | 491 | max_train_samples = ( 492 | data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) 493 | ) 494 | metrics["train_samples"] = min(max_train_samples, len(train_dataset)) 495 | 496 | trainer.log_metrics("train", metrics) 497 | trainer.save_metrics("train", metrics) 498 | trainer.save_state() 499 | 500 | # Evaluation 501 | if training_args.do_eval: 502 | logger.info("*** Evaluate ***") 503 | 504 | metrics = trainer.evaluate(eval_dataset=test_dataset) 505 | metrics["test_samples"] = len(test_dataset) 506 | 507 | trainer.log_metrics("test", metrics) 508 | trainer.save_metrics("test", metrics) 509 | 510 | # kwargs = dict( 511 | # finetuned_from=model_args.model_name_or_path, 512 | # tasks="multiple-choice", 513 | # dataset_tags="swag", 514 | # dataset_args="regular", 515 | # dataset="SWAG", 516 | # language="en", 517 | # ) 518 | 519 | # if training_args.push_to_hub: 520 | # trainer.push_to_hub(**kwargs) 521 | # else: 522 | # trainer.create_model_card(**kwargs) 523 | 524 | 525 | def _mp_fn(index): 526 | # For xla_spawn (TPUs) 527 | main() 528 | 529 | 530 | if __name__ == "__main__": 531 | main() 532 | --------------------------------------------------------------------------------