├── imgs
└── kjt.png
├── requirements.txt
├── examples
└── finetune_edubert.sh
├── LICENSE
├── docs
└── 使用文档.md
├── README.md
└── src
└── finetune.py
/imgs/kjt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tal-tech/edu-bert/HEAD/imgs/kjt.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets==1.1.2
2 | numpy==1.19.4
3 | pandas==1.1.4
4 | pyOpenSSL==19.1.0
5 | scikit-learn==0.23.2
6 | scipy==1.5.4
7 | torch==1.4.0
8 | transformers==3.5.1
9 | dataclasses==0.8
10 |
--------------------------------------------------------------------------------
/examples/finetune_edubert.sh:
--------------------------------------------------------------------------------
1 | export MODEL_PATH=../../path_to_EduBERT # 如果需要加载tensorflow模型,模型文件夹名需要以.cpkt结尾,如下面一行所示
2 | #export MODEL_PATH=../edu-bert.ckpt #加载tensorflow模型的文件夹名称样例
3 | export DATA_DIR=../data/
4 | export TASK_NAME=CoLA
5 | python ../src/finetune.py \
6 | --model_name_or_path $MODEL_PATH \
7 | --task_name $TASK_NAME \
8 | --data_dir ${DATA_DIR} \
9 | --do_train \
10 | --do_eval \
11 | --max_seq_length 128 \
12 | --per_device_train_batch_size 32 \
13 | --learning_rate 2e-5 \
14 | --num_train_epochs 3.0 \
15 | --output_dir ./output \
16 | --overwrite_output_dir \
17 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 好未来技术
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/docs/使用文档.md:
--------------------------------------------------------------------------------
1 |
2 | # EduBERT
3 |
4 | ## 依赖环境
5 |
6 |
7 |
8 | ```
9 | conda create -n py36_test python=3.6 --yes
10 |
11 | pip install -r requirements.txt
12 | ```
13 |
14 |
15 |
16 |
17 | ## 代码结构
18 |
19 |
20 |
21 | 为了方便大家使用EduBERT,我们也开源了在下游任务上进行Finetune的代码,同时也附上了输入数据的样例(data/train.tsv, data/dev.tsv,由于数据隐私问题,我们只提供了10条人造的数据,仅作为格式参考),欢迎使用~
22 |
23 |
24 |
25 | [data](../data/): Finetune输入数据样例
26 |
27 |
28 |
29 | ```
30 | 数据格式:
31 | \t[label]\t\t[text]
32 |
33 | [label]:该条文本的标签
34 | [text]:该条文本的内容
35 | \t:制表符
36 | ```
37 |
38 |
39 |
40 |
41 | [examples](../examples/): 使用方法样例
42 |
43 |
44 |
45 | ```
46 | # ./examples/finetune_edubert.sh
47 |
48 | export MODEL_PATH=/YourPath/EduBERT
49 | # MODEL_PATH为模型下载后存放的地址
50 |
51 | export DATA_DIR=../data/
52 | # Finetune输入数据地址,默认为输入数据样例的地址
53 |
54 | export TASK_NAME=CoLA
55 | python ../src/finetune.py \
56 | --model_name_or_path $MODEL_PATH \
57 | --task_name $TASK_NAME \
58 | --data_dir ${DATA_DIR} \
59 | --do_train \
60 | --do_eval \
61 | --max_seq_length 128 \
62 | --per_device_train_batch_size 32 \
63 | --learning_rate 2e-5 \
64 | --num_train_epochs 3.0 \
65 | --output_dir ./output \
66 | --overwrite_output_dir \
67 |
68 | ```
69 |
70 |
71 |
72 | [src](../src): 源代码文件
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 好未来开源教育领域首个在线教学中文预训练模型TAL-EduBERT
2 |
3 | ## 一、背景及下载地址
4 |
5 | ### 1. 背景
6 |
7 | 2020年初Covid-19疫情的爆发对各行各业产生了不可小觑的影响,也让以线下方式为主的传统教育在短期内受到了极大的冲击,更多人开始看到科技对教育市场的价值。在线教育成为了特殊时期教学的最佳选择,大规模地渗透至每一所学校、每一个家庭。在线教育的爆火使得教育行业产生了海量的在线教学语音识别(Automatic Speech Recognition,以下简称ASR)文本数据,极大地推动了教育领域技术的发展。
8 |
9 | 数据作为产业最为核心和宝贵的资源之一,更是自然语言处理技术(Natural Language Processing,以下简称NLP)在各个领域得以应用和发展的基础。在线教育文本数据有着区别于通用场景数据的特有属性,给在线教育领域NLP的研究、应用和发展带来了极大的挑战,一是从音视频转录出来的文本数据中,存在着较多的ASR错误,这些错误可能会对文本处理相关任务的效果造成较大的影响;二是数据中含有大量的教育领域特有的专有词汇,现有的通用领域的开源词向量和开源预训练语言模型(如Google BERT Base[1],Roberta[2]等)对于这些词汇的语义表示能力有限,进而会影响后续任务的效果。
10 |
11 | 为了帮助解决这两个问题,好未来AI中台机器学习团队从多个来源收集了超过2000万条(约包含3.8亿Tokens)的教育领域中文ASR文本数据,基于此建立了教育领域首个在线教学中文预训练模型TAL-EduBERT,并把其推至开源。
12 |
13 | 从2018年谷歌发布预训练模型BERT以来,以BERT为代表的预训练语言模型, 在各个自然语言处理任务上都达到了SOTA的效果。并且作为通用的预训练语言模型,BERT的出现,使得NLP算法工程师不需要进行繁重的网络结构的修改,直接对于下游任务进行fine-tune,便可得到比以往的深度学习方法更好的效果,显著的减轻了NLP算法工程师的繁重的调整模型网络结构的工作,降低了算法应用的成本,预训练语言模型已经成为工作中不可或缺的一项基础技术。
14 |
15 | 但是,当前开源的各类中文领域的深度预训练模型,多是面向通用领域的应用需求,在包括教育在内的多个垂直领域均没有看到相关开源模型。相较于谷歌发布的Google BERT Base以及开源的中文Roberta模型,**好未来本次开源的TAL-EduBERT在多个教育领域的下游任务中得到了显著的效果提升**。好未来希望通过本次开源,助力推动 NLP技术在教育领域的应用发展,欢迎各位同仁下载使用。
16 |
17 | ### 2. 模型下载
18 |
19 | 下载地址:
20 |
21 | pytorch版:[https://ai.100tal.com/download/TAL-EduBERT.zip](https://ai.100tal.com/download/TAL-EduBERT.zip)
22 |
23 | tensorflow版:[https://ai.100tal.com/download/TAL-EduBERT-TF.zip](https://ai.100tal.com/download/TAL-EduBERT-TF.zip)
24 |
25 |
26 | ## 二、 模型结构及训练数据
27 |
28 | ### 1. 模型结构
29 | TAL-EduBERT在网络结构上,采用与Google BERT Base相同的结构,包含12层的Transformer编码器、768个隐藏单元以及12个multi-head attention的head。之所以使用BERT Base的网络结构,是因为我们考虑到实际使用的便捷性和普遍性,后续会进一步开源其他教育领域ASR预训练语言模型。
30 |
31 | ### 2. 训练语料
32 | TAL-EduBERT所采用的预训练语料,主要源于好未来内部积淀的海量教师教学语音经ASR转录而得到的文本,对于语料进行筛选、预处理后,选取了超过2000万条教育ASR文本,大约包含3.8亿Tokens。
33 |
34 | ### 3. 预训练方式
35 |
36 | 
37 |
38 | 如上图所示,TAL-EduBERT采取了与BERT相同的两种预训练任务来进行预训练学习,分别是教育领域字级别任务(Masked Language Modeling,简称MLM)和句子级别的训练任务(Next Sentence Prediction,简称NSP),通过这两个任务,使得TAL-EduBERT能够捕获教育ASR文本数据中的字、词和句子级别的语法和语义信息。
39 |
40 | ## 三、 下游任务实验结果
41 | 为了证明TAL-EduBERT在下游任务上的效果,我们从实际业务中抽取了4类典型的在线教育领域教学行为预测任务数据集,详见文献[3][4]。在此基础上,我们与Google BERT Base这一在中文领域应用最为广泛的模型以及效果较好的Roberta做了对比,实验结果表明,TAL-EduBERT在教育ASR下游任务上取得了较好的效果。
42 |
43 | ### 1. 实验简介:教师行为预测
44 | 此任务来源于我们对老师的教学行为进行智能化的评估,具体我们评估了四项教师行为,分别是引导学生进行课后总结(Conclude)、带着学生记笔记(Note)、表扬学生(Praise)和提问学生(QA)。通过对教师教学行为进行分类,给老师打上行为标签,从而更方便地分析老师教学行为,进而辅助老师更好地教学,提升教学质量。
45 |
46 | ### 2. 实验结果:
47 |
48 |
49 | | Task\Model | Conclude | Note | Praise | QA |
50 |
51 |
52 | | Google BERT | Acc | 0.7036 | 0.8436 | 0.8652 | 0.8948 |
53 |
54 |
55 | | F1 | 0.6404 | 0.8356 | 0.8683 | 0.8469 |
56 |
57 |
58 | | Roberta | Acc | 0.7097 | 0.8558 | 0.8689 | 0.8979 |
59 |
60 |
61 | | F1 | 0.6382 | 0.8464 | 0.8668 | 0.8433 |
62 |
63 |
64 | | TAL-EduBERT | Acc | 0.7270 | 0.8638 | 0.8731 | 0.9147 |
65 |
66 |
67 | | F1 | 0.6486 | 0.8549 | 0.8688 | 0.8721 |
68 |
69 |
70 |
71 | ## 四、 适用范围、使用方法及使用案例
72 | ### 1. 适用范围:
73 | 相较于Google BERT Base和Roberta,TAL-EduBERT基于大量教育ASR文本数据训练,因此对于ASR的识别错误具有较强的鲁棒性,并且在教育场景的下游任务上也具有较好的效果。鉴于此,我们推荐从事教育,并且工作内容与ASR文本相关的NLP算法工程师使用我们的模型,希望能通过本次的开源,推进自然语言处理在教育领域的应用和发展。
74 |
75 | ### 2. 使用方法:
76 | 与Google发布的原生BERT使用方式一致,支持transformers包,因此在使用时,直接进行模型路径替换即可。
77 |
78 | ### 3.使用案例:
79 | ```
80 | from transformers import BertTokenizer, BertModel
81 | import torch
82 |
83 | path_to_TAL-EduBERT = "/YourPath/TAL-EduBERT/"
84 |
85 | tokenizer = BertTokenizer.from_pretrained(path_to_TAL-EduBERT)
86 | model = BertModel.from_pretrained(path_to_TAL-EduBERT)
87 |
88 | sentence = "让我们来看一下这道题,这个题的也是一种比较经典类型的这个数列题目他呢,有个特点就是前面的是an+1,后面是一个an的式子加上一个根号下an的,一个二次的一个式子。"
89 | inputs = tokenizer(sentence, return_tensors="pt")
90 | outputs = model(**inputs)
91 | last_hidden_states = outputs.last_hidden_state
92 | ```
93 | ## 五、 小结
94 | 为了证明TAL-EduBERT在教育领域下游任务的优势,我们从教育场景中的四类业务问题和数据入手进行了对比实验,对比Google BERT Base和Roberta这两种通用领域的预训练模型可知,TAL-EduBERT效果显著提升,在F1上最高提升大约3个百分点。因此,想要在教育领域进行NLP相关方向探索的技术伙伴可以直接使用TAL-EduBERT开展更专业地教育技术实践训练。
95 |
96 | 本文介绍了 TAL-EduBERT 的开源背景、数据背景、对比实验结果。后续,好未来AI中台也会持续进行理论创新和实践探索,进行更全面的开源开放,非常欢迎从事相关领域的伙伴们提供更多、更丰富的对比实验和实际应用案例,让我们共同推进自然语言处理技术在教育领域的应用和发展,为中国的教育事业注入新的动能。
97 |
98 |
99 | ## 参考文献:
100 | [1] Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
101 | [2] Liu, Yinhan, et al. "Roberta: A robustly optimized BERT pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
102 | [3] Huang, Gale Yan, et al. "Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms." International Conference on Artificial Intelligence in Education. Springer, Cham, 2020.
103 | [4] Xu, Shiting, Wenbiao Ding, and Zitao Liu. "Automatic Dialogic Instruction Detection for K-12 Online One-on-one Classes." International Conference on Artificial Intelligence in Education. Springer, Cham, 2020.
104 |
105 |
--------------------------------------------------------------------------------
/src/finetune.py:
--------------------------------------------------------------------------------
1 | """ Finetuning the library models for sequence classification on GLUE (Bert, XLM, XLNet, RoBERTa, Albert, XLM-RoBERTa)."""
2 |
3 |
4 | import dataclasses
5 | import logging
6 | import os
7 | import sys
8 | from dataclasses import dataclass, field
9 | from typing import Dict, Optional
10 |
11 | import numpy as np
12 |
13 | from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, GlueDataset
14 | from transformers import GlueDataTrainingArguments as DataTrainingArguments
15 | from transformers import (
16 | HfArgumentParser,
17 | Trainer,
18 | TrainingArguments,
19 | glue_compute_metrics,
20 | glue_output_modes,
21 | glue_tasks_num_labels,
22 | set_seed,
23 | )
24 |
25 |
26 | logger = logging.getLogger(__name__)
27 |
28 |
29 | @dataclass
30 | class ModelArguments:
31 | """
32 | Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
33 | """
34 |
35 | model_name_or_path: str = field(
36 | metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
37 | )
38 | config_name: Optional[str] = field(
39 | default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
40 | )
41 | tokenizer_name: Optional[str] = field(
42 | default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
43 | )
44 | cache_dir: Optional[str] = field(
45 | default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
46 | )
47 |
48 |
49 | def main():
50 | # See all possible arguments in src/transformers/training_args.py
51 | # or by passing the --help flag to this script.
52 | # We now keep distinct sets of args, for a cleaner separation of concerns.
53 |
54 | parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
55 |
56 | if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
57 | # If we pass only one argument to the script and it's the path to a json file,
58 | # let's parse it to get our arguments.
59 | model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
60 | else:
61 | model_args, data_args, training_args = parser.parse_args_into_dataclasses()
62 |
63 | if (
64 | os.path.exists(training_args.output_dir)
65 | and os.listdir(training_args.output_dir)
66 | and training_args.do_train
67 | and not training_args.overwrite_output_dir
68 | ):
69 | raise ValueError(
70 | f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
71 | )
72 |
73 | # Setup logging
74 | logging.basicConfig(
75 | format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
76 | datefmt="%m/%d/%Y %H:%M:%S",
77 | level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
78 | )
79 | logger.warning(
80 | "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
81 | training_args.local_rank,
82 | training_args.device,
83 | training_args.n_gpu,
84 | bool(training_args.local_rank != -1),
85 | training_args.fp16,
86 | )
87 | logger.info("Training/evaluation parameters %s", training_args)
88 |
89 | # Set seed
90 | set_seed(training_args.seed)
91 |
92 | try:
93 | num_labels = glue_tasks_num_labels[data_args.task_name]
94 | output_mode = glue_output_modes[data_args.task_name]
95 | except KeyError:
96 | raise ValueError("Task not found: %s" % (data_args.task_name))
97 |
98 | # Load pretrained model and tokenizer
99 | #
100 | # Distributed training:
101 | # The .from_pretrained methods guarantee that only one local process can concurrently
102 | # download model & vocab.
103 |
104 | config = AutoConfig.from_pretrained(
105 | model_args.config_name if model_args.config_name else model_args.model_name_or_path,
106 | num_labels=num_labels,
107 | finetuning_task=data_args.task_name,
108 | cache_dir=model_args.cache_dir,
109 | )
110 | tokenizer = AutoTokenizer.from_pretrained(
111 | model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
112 | cache_dir=model_args.cache_dir,
113 | )
114 | model = AutoModelForSequenceClassification.from_pretrained(
115 | model_args.model_name_or_path,
116 | from_tf=bool(".ckpt" in model_args.model_name_or_path),
117 | config=config,
118 | cache_dir=model_args.cache_dir,
119 | )
120 |
121 | # Get datasets
122 | train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
123 | eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
124 | test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None
125 |
126 | def compute_metrics(p: EvalPrediction) -> Dict:
127 | if output_mode == "classification":
128 | preds = np.argmax(p.predictions, axis=1)
129 | elif output_mode == "regression":
130 | preds = np.squeeze(p.predictions)
131 | return glue_compute_metrics(data_args.task_name, preds, p.label_ids)
132 |
133 | # Initialize our Trainer
134 | trainer = Trainer(
135 | model=model,
136 | args=training_args,
137 | train_dataset=train_dataset,
138 | eval_dataset=eval_dataset,
139 | compute_metrics=compute_metrics,
140 | )
141 |
142 | # Training
143 | if training_args.do_train:
144 | trainer.train(
145 | model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
146 | )
147 | trainer.save_model()
148 | # For convenience, we also re-save the tokenizer to the same directory,
149 | # so that you can share your model easily on huggingface.co/models =)
150 | if trainer.is_world_master():
151 | tokenizer.save_pretrained(training_args.output_dir)
152 |
153 | # Evaluation
154 | eval_results = {}
155 | if training_args.do_eval:
156 | logger.info("*** Evaluate ***")
157 |
158 | # Loop to handle MNLI double evaluation (matched, mis-matched)
159 | eval_datasets = [eval_dataset]
160 | if data_args.task_name == "mnli":
161 | mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
162 | eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))
163 |
164 | for eval_dataset in eval_datasets:
165 | eval_result = trainer.evaluate(eval_dataset=eval_dataset)
166 |
167 | output_eval_file = os.path.join(
168 | training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
169 | )
170 | if trainer.is_world_master():
171 | with open(output_eval_file, "w") as writer:
172 | logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
173 | for key, value in eval_result.items():
174 | logger.info(" %s = %s", key, value)
175 | writer.write("%s = %s\n" % (key, value))
176 |
177 | eval_results.update(eval_result)
178 |
179 | if training_args.do_predict:
180 | logging.info("*** Test ***")
181 | test_datasets = [test_dataset]
182 | if data_args.task_name == "mnli":
183 | mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
184 | test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
185 |
186 | for test_dataset in test_datasets:
187 | predictions = trainer.predict(test_dataset=test_dataset).predictions
188 | if output_mode == "classification":
189 | predictions = np.argmax(predictions, axis=1)
190 |
191 | output_test_file = os.path.join(
192 | training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
193 | )
194 | if trainer.is_world_master():
195 | with open(output_test_file, "w") as writer:
196 | logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
197 | writer.write("index\tprediction\n")
198 | for index, item in enumerate(predictions):
199 | if output_mode == "regression":
200 | writer.write("%d\t%3.3f\n" % (index, item))
201 | else:
202 | item = test_dataset.get_labels()[item]
203 | writer.write("%d\t%s\n" % (index, item))
204 | return eval_results
205 |
206 |
207 | def _mp_fn(index):
208 | # For xla_spawn (TPUs)
209 | main()
210 |
211 |
212 | if __name__ == "__main__":
213 | main()
214 |
--------------------------------------------------------------------------------