├── imgs └── kjt.png ├── requirements.txt ├── examples └── finetune_edubert.sh ├── LICENSE ├── docs └── 使用文档.md ├── README.md └── src └── finetune.py /imgs/kjt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tal-tech/edu-bert/HEAD/imgs/kjt.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | datasets==1.1.2 2 | numpy==1.19.4 3 | pandas==1.1.4 4 | pyOpenSSL==19.1.0 5 | scikit-learn==0.23.2 6 | scipy==1.5.4 7 | torch==1.4.0 8 | transformers==3.5.1 9 | dataclasses==0.8 10 | -------------------------------------------------------------------------------- /examples/finetune_edubert.sh: -------------------------------------------------------------------------------- 1 | export MODEL_PATH=../../path_to_EduBERT # 如果需要加载tensorflow模型,模型文件夹名需要以.cpkt结尾,如下面一行所示 2 | #export MODEL_PATH=../edu-bert.ckpt #加载tensorflow模型的文件夹名称样例 3 | export DATA_DIR=../data/ 4 | export TASK_NAME=CoLA 5 | python ../src/finetune.py \ 6 | --model_name_or_path $MODEL_PATH \ 7 | --task_name $TASK_NAME \ 8 | --data_dir ${DATA_DIR} \ 9 | --do_train \ 10 | --do_eval \ 11 | --max_seq_length 128 \ 12 | --per_device_train_batch_size 32 \ 13 | --learning_rate 2e-5 \ 14 | --num_train_epochs 3.0 \ 15 | --output_dir ./output \ 16 | --overwrite_output_dir \ 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 好未来技术 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /docs/使用文档.md: -------------------------------------------------------------------------------- 1 | 2 | # EduBERT 3 | 4 | ## 依赖环境 5 | 6 |
7 | 8 | ``` 9 | conda create -n py36_test python=3.6 --yes 10 | 11 | pip install -r requirements.txt 12 | ``` 13 | 14 | 15 |
16 | 17 | ## 代码结构 18 | 19 |
20 | 21 | 为了方便大家使用EduBERT,我们也开源了在下游任务上进行Finetune的代码,同时也附上了输入数据的样例(data/train.tsv, data/dev.tsv,由于数据隐私问题,我们只提供了10条人造的数据,仅作为格式参考),欢迎使用~ 22 | 23 |
24 | 25 | [data](../data/): Finetune输入数据样例 26 | 27 |
28 | 29 | ``` 30 | 数据格式: 31 | \t[label]\t\t[text] 32 | 33 | [label]:该条文本的标签 34 | [text]:该条文本的内容 35 | \t:制表符 36 | ``` 37 | 38 | 39 |
40 | 41 | [examples](../examples/): 使用方法样例 42 | 43 |
44 | 45 | ``` 46 | # ./examples/finetune_edubert.sh 47 | 48 | export MODEL_PATH=/YourPath/EduBERT 49 | # MODEL_PATH为模型下载后存放的地址 50 | 51 | export DATA_DIR=../data/ 52 | # Finetune输入数据地址,默认为输入数据样例的地址 53 | 54 | export TASK_NAME=CoLA 55 | python ../src/finetune.py \ 56 | --model_name_or_path $MODEL_PATH \ 57 | --task_name $TASK_NAME \ 58 | --data_dir ${DATA_DIR} \ 59 | --do_train \ 60 | --do_eval \ 61 | --max_seq_length 128 \ 62 | --per_device_train_batch_size 32 \ 63 | --learning_rate 2e-5 \ 64 | --num_train_epochs 3.0 \ 65 | --output_dir ./output \ 66 | --overwrite_output_dir \ 67 | 68 | ``` 69 | 70 |
71 | 72 | [src](../src): 源代码文件 73 | 74 |
75 | 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 好未来开源教育领域首个在线教学中文预训练模型TAL-EduBERT 2 | 3 | ## 一、背景及下载地址 4 | 5 | ### 1. 背景 6 | 7 | 2020年初Covid-19疫情的爆发对各行各业产生了不可小觑的影响,也让以线下方式为主的传统教育在短期内受到了极大的冲击,更多人开始看到科技对教育市场的价值。在线教育成为了特殊时期教学的最佳选择,大规模地渗透至每一所学校、每一个家庭。在线教育的爆火使得教育行业产生了海量的在线教学语音识别(Automatic Speech Recognition,以下简称ASR)文本数据,极大地推动了教育领域技术的发展。 8 | 9 | 数据作为产业最为核心和宝贵的资源之一,更是自然语言处理技术(Natural Language Processing,以下简称NLP)在各个领域得以应用和发展的基础。在线教育文本数据有着区别于通用场景数据的特有属性,给在线教育领域NLP的研究、应用和发展带来了极大的挑战,一是从音视频转录出来的文本数据中,存在着较多的ASR错误,这些错误可能会对文本处理相关任务的效果造成较大的影响;二是数据中含有大量的教育领域特有的专有词汇,现有的通用领域的开源词向量和开源预训练语言模型(如Google BERT Base[1],Roberta[2]等)对于这些词汇的语义表示能力有限,进而会影响后续任务的效果。 10 | 11 | 为了帮助解决这两个问题,好未来AI中台机器学习团队从多个来源收集了超过2000万条(约包含3.8亿Tokens)的教育领域中文ASR文本数据,基于此建立了教育领域首个在线教学中文预训练模型TAL-EduBERT,并把其推至开源。 12 | 13 | 从2018年谷歌发布预训练模型BERT以来,以BERT为代表的预训练语言模型, 在各个自然语言处理任务上都达到了SOTA的效果。并且作为通用的预训练语言模型,BERT的出现,使得NLP算法工程师不需要进行繁重的网络结构的修改,直接对于下游任务进行fine-tune,便可得到比以往的深度学习方法更好的效果,显著的减轻了NLP算法工程师的繁重的调整模型网络结构的工作,降低了算法应用的成本,预训练语言模型已经成为工作中不可或缺的一项基础技术。 14 | 15 | 但是,当前开源的各类中文领域的深度预训练模型,多是面向通用领域的应用需求,在包括教育在内的多个垂直领域均没有看到相关开源模型。相较于谷歌发布的Google BERT Base以及开源的中文Roberta模型,**好未来本次开源的TAL-EduBERT在多个教育领域的下游任务中得到了显著的效果提升**。好未来希望通过本次开源,助力推动 NLP技术在教育领域的应用发展,欢迎各位同仁下载使用。 16 | 17 | ### 2. 模型下载 18 | 19 | 下载地址: 20 | 21 | pytorch版:[https://ai.100tal.com/download/TAL-EduBERT.zip](https://ai.100tal.com/download/TAL-EduBERT.zip) 22 | 23 | tensorflow版:[https://ai.100tal.com/download/TAL-EduBERT-TF.zip](https://ai.100tal.com/download/TAL-EduBERT-TF.zip) 24 | 25 | 26 | ## 二、 模型结构及训练数据 27 | 28 | ### 1. 模型结构 29 | TAL-EduBERT在网络结构上,采用与Google BERT Base相同的结构,包含12层的Transformer编码器、768个隐藏单元以及12个multi-head attention的head。之所以使用BERT Base的网络结构,是因为我们考虑到实际使用的便捷性和普遍性,后续会进一步开源其他教育领域ASR预训练语言模型。 30 | 31 | ### 2. 训练语料 32 | TAL-EduBERT所采用的预训练语料,主要源于好未来内部积淀的海量教师教学语音经ASR转录而得到的文本,对于语料进行筛选、预处理后,选取了超过2000万条教育ASR文本,大约包含3.8亿Tokens。 33 | 34 | ### 3. 预训练方式 35 | 36 | ![Alt text](imgs/kjt.png?raw=true "") 37 | 38 | 如上图所示,TAL-EduBERT采取了与BERT相同的两种预训练任务来进行预训练学习,分别是教育领域字级别任务(Masked Language Modeling,简称MLM)和句子级别的训练任务(Next Sentence Prediction,简称NSP),通过这两个任务,使得TAL-EduBERT能够捕获教育ASR文本数据中的字、词和句子级别的语法和语义信息。 39 | 40 | ## 三、 下游任务实验结果 41 | 为了证明TAL-EduBERT在下游任务上的效果,我们从实际业务中抽取了4类典型的在线教育领域教学行为预测任务数据集,详见文献[3][4]。在此基础上,我们与Google BERT Base这一在中文领域应用最为广泛的模型以及效果较好的Roberta做了对比,实验结果表明,TAL-EduBERT在教育ASR下游任务上取得了较好的效果。 42 | 43 | ### 1. 实验简介:教师行为预测 44 | 此任务来源于我们对老师的教学行为进行智能化的评估,具体我们评估了四项教师行为,分别是引导学生进行课后总结(Conclude)、带着学生记笔记(Note)、表扬学生(Praise)和提问学生(QA)。通过对教师教学行为进行分类,给老师打上行为标签,从而更方便地分析老师教学行为,进而辅助老师更好地教学,提升教学质量。 45 | 46 | ### 2. 实验结果: 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 |
Task\ModelConcludeNotePraiseQA
Google BERTAcc0.70360.84360.86520.8948
F10.64040.83560.86830.8469
RobertaAcc0.70970.85580.86890.8979
F10.63820.84640.86680.8433
TAL-EduBERTAcc0.72700.86380.87310.9147
F10.64860.85490.86880.8721
70 | 71 | ## 四、 适用范围、使用方法及使用案例 72 | ### 1. 适用范围: 73 | 相较于Google BERT Base和Roberta,TAL-EduBERT基于大量教育ASR文本数据训练,因此对于ASR的识别错误具有较强的鲁棒性,并且在教育场景的下游任务上也具有较好的效果。鉴于此,我们推荐从事教育,并且工作内容与ASR文本相关的NLP算法工程师使用我们的模型,希望能通过本次的开源,推进自然语言处理在教育领域的应用和发展。 74 | 75 | ### 2. 使用方法: 76 | 与Google发布的原生BERT使用方式一致,支持transformers包,因此在使用时,直接进行模型路径替换即可。 77 | 78 | ### 3.使用案例: 79 | ``` 80 | from transformers import BertTokenizer, BertModel 81 | import torch 82 | 83 | path_to_TAL-EduBERT = "/YourPath/TAL-EduBERT/" 84 | 85 | tokenizer = BertTokenizer.from_pretrained(path_to_TAL-EduBERT) 86 | model = BertModel.from_pretrained(path_to_TAL-EduBERT) 87 | 88 | sentence = "让我们来看一下这道题,这个题的也是一种比较经典类型的这个数列题目他呢,有个特点就是前面的是an+1,后面是一个an的式子加上一个根号下an的,一个二次的一个式子。" 89 | inputs = tokenizer(sentence, return_tensors="pt") 90 | outputs = model(**inputs) 91 | last_hidden_states = outputs.last_hidden_state 92 | ``` 93 | ## 五、 小结 94 | 为了证明TAL-EduBERT在教育领域下游任务的优势,我们从教育场景中的四类业务问题和数据入手进行了对比实验,对比Google BERT Base和Roberta这两种通用领域的预训练模型可知,TAL-EduBERT效果显著提升,在F1上最高提升大约3个百分点。因此,想要在教育领域进行NLP相关方向探索的技术伙伴可以直接使用TAL-EduBERT开展更专业地教育技术实践训练。 95 | 96 | 本文介绍了 TAL-EduBERT 的开源背景、数据背景、对比实验结果。后续,好未来AI中台也会持续进行理论创新和实践探索,进行更全面的开源开放,非常欢迎从事相关领域的伙伴们提供更多、更丰富的对比实验和实际应用案例,让我们共同推进自然语言处理技术在教育领域的应用和发展,为中国的教育事业注入新的动能。 97 | 98 | 99 | ## 参考文献: 100 | [1] Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. 101 | [2] Liu, Yinhan, et al. "Roberta: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692 (2019). 102 | [3] Huang, Gale Yan, et al. "Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms." International Conference on Artificial Intelligence in Education. Springer, Cham, 2020. 103 | [4] Xu, Shiting, Wenbiao Ding, and Zitao Liu. "Automatic Dialogic Instruction Detection for K-12 Online One-on-one Classes." International Conference on Artificial Intelligence in Education. Springer, Cham, 2020. 104 | 105 | -------------------------------------------------------------------------------- /src/finetune.py: -------------------------------------------------------------------------------- 1 | """ Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa).""" 2 | 3 | 4 | import dataclasses 5 | import logging 6 | import os 7 | import sys 8 | from dataclasses import dataclass, field 9 | from typing import Dict, Optional 10 | 11 | import numpy as np 12 | 13 | from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, GlueDataset 14 | from transformers import GlueDataTrainingArguments as DataTrainingArguments 15 | from transformers import ( 16 | HfArgumentParser, 17 | Trainer, 18 | TrainingArguments, 19 | glue_compute_metrics, 20 | glue_output_modes, 21 | glue_tasks_num_labels, 22 | set_seed, 23 | ) 24 | 25 | 26 | logger = logging.getLogger(__name__) 27 | 28 | 29 | @dataclass 30 | class ModelArguments: 31 | """ 32 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. 33 | """ 34 | 35 | model_name_or_path: str = field( 36 | metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} 37 | ) 38 | config_name: Optional[str] = field( 39 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} 40 | ) 41 | tokenizer_name: Optional[str] = field( 42 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} 43 | ) 44 | cache_dir: Optional[str] = field( 45 | default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"} 46 | ) 47 | 48 | 49 | def main(): 50 | # See all possible arguments in src/transformers/training_args.py 51 | # or by passing the --help flag to this script. 52 | # We now keep distinct sets of args, for a cleaner separation of concerns. 53 | 54 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) 55 | 56 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): 57 | # If we pass only one argument to the script and it's the path to a json file, 58 | # let's parse it to get our arguments. 59 | model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) 60 | else: 61 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 62 | 63 | if ( 64 | os.path.exists(training_args.output_dir) 65 | and os.listdir(training_args.output_dir) 66 | and training_args.do_train 67 | and not training_args.overwrite_output_dir 68 | ): 69 | raise ValueError( 70 | f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome." 71 | ) 72 | 73 | # Setup logging 74 | logging.basicConfig( 75 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", 76 | datefmt="%m/%d/%Y %H:%M:%S", 77 | level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN, 78 | ) 79 | logger.warning( 80 | "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s", 81 | training_args.local_rank, 82 | training_args.device, 83 | training_args.n_gpu, 84 | bool(training_args.local_rank != -1), 85 | training_args.fp16, 86 | ) 87 | logger.info("Training/evaluation parameters %s", training_args) 88 | 89 | # Set seed 90 | set_seed(training_args.seed) 91 | 92 | try: 93 | num_labels = glue_tasks_num_labels[data_args.task_name] 94 | output_mode = glue_output_modes[data_args.task_name] 95 | except KeyError: 96 | raise ValueError("Task not found: %s" % (data_args.task_name)) 97 | 98 | # Load pretrained model and tokenizer 99 | # 100 | # Distributed training: 101 | # The .from_pretrained methods guarantee that only one local process can concurrently 102 | # download model & vocab. 103 | 104 | config = AutoConfig.from_pretrained( 105 | model_args.config_name if model_args.config_name else model_args.model_name_or_path, 106 | num_labels=num_labels, 107 | finetuning_task=data_args.task_name, 108 | cache_dir=model_args.cache_dir, 109 | ) 110 | tokenizer = AutoTokenizer.from_pretrained( 111 | model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, 112 | cache_dir=model_args.cache_dir, 113 | ) 114 | model = AutoModelForSequenceClassification.from_pretrained( 115 | model_args.model_name_or_path, 116 | from_tf=bool(".ckpt" in model_args.model_name_or_path), 117 | config=config, 118 | cache_dir=model_args.cache_dir, 119 | ) 120 | 121 | # Get datasets 122 | train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None 123 | eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None 124 | test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None 125 | 126 | def compute_metrics(p: EvalPrediction) -> Dict: 127 | if output_mode == "classification": 128 | preds = np.argmax(p.predictions, axis=1) 129 | elif output_mode == "regression": 130 | preds = np.squeeze(p.predictions) 131 | return glue_compute_metrics(data_args.task_name, preds, p.label_ids) 132 | 133 | # Initialize our Trainer 134 | trainer = Trainer( 135 | model=model, 136 | args=training_args, 137 | train_dataset=train_dataset, 138 | eval_dataset=eval_dataset, 139 | compute_metrics=compute_metrics, 140 | ) 141 | 142 | # Training 143 | if training_args.do_train: 144 | trainer.train( 145 | model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None 146 | ) 147 | trainer.save_model() 148 | # For convenience, we also re-save the tokenizer to the same directory, 149 | # so that you can share your model easily on huggingface.co/models =) 150 | if trainer.is_world_master(): 151 | tokenizer.save_pretrained(training_args.output_dir) 152 | 153 | # Evaluation 154 | eval_results = {} 155 | if training_args.do_eval: 156 | logger.info("*** Evaluate ***") 157 | 158 | # Loop to handle MNLI double evaluation (matched, mis-matched) 159 | eval_datasets = [eval_dataset] 160 | if data_args.task_name == "mnli": 161 | mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm") 162 | eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev")) 163 | 164 | for eval_dataset in eval_datasets: 165 | eval_result = trainer.evaluate(eval_dataset=eval_dataset) 166 | 167 | output_eval_file = os.path.join( 168 | training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt" 169 | ) 170 | if trainer.is_world_master(): 171 | with open(output_eval_file, "w") as writer: 172 | logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name)) 173 | for key, value in eval_result.items(): 174 | logger.info(" %s = %s", key, value) 175 | writer.write("%s = %s\n" % (key, value)) 176 | 177 | eval_results.update(eval_result) 178 | 179 | if training_args.do_predict: 180 | logging.info("*** Test ***") 181 | test_datasets = [test_dataset] 182 | if data_args.task_name == "mnli": 183 | mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm") 184 | test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test")) 185 | 186 | for test_dataset in test_datasets: 187 | predictions = trainer.predict(test_dataset=test_dataset).predictions 188 | if output_mode == "classification": 189 | predictions = np.argmax(predictions, axis=1) 190 | 191 | output_test_file = os.path.join( 192 | training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt" 193 | ) 194 | if trainer.is_world_master(): 195 | with open(output_test_file, "w") as writer: 196 | logger.info("***** Test results {} *****".format(test_dataset.args.task_name)) 197 | writer.write("index\tprediction\n") 198 | for index, item in enumerate(predictions): 199 | if output_mode == "regression": 200 | writer.write("%d\t%3.3f\n" % (index, item)) 201 | else: 202 | item = test_dataset.get_labels()[item] 203 | writer.write("%d\t%s\n" % (index, item)) 204 | return eval_results 205 | 206 | 207 | def _mp_fn(index): 208 | # For xla_spawn (TPUs) 209 | main() 210 | 211 | 212 | if __name__ == "__main__": 213 | main() 214 | --------------------------------------------------------------------------------