├── LICENSE ├── README.md └── Tokenization Model ├── Current Tokenization Model ├── README.md ├── Tibetan_bpe.model ├── Tibetan_bpe.vocab ├── training_log2.txt └── 训练命令.txt └── Legacy Tokenization Model ├── README.md └── tokenizer30000.model /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TiLamb(Tibetan Large Language Model Base) 2 | 3 | ## 基于LLaMA2-7B增量预训练的藏文大语言模型 4 | 5 | ## 内容导引 6 | | 章节 | 描述 | 7 | | ------------------------------------- | ------------------------------------------------------------ | 8 | | [💁🏻‍♂️模型简介](#模型简介) | 简要介绍 TiLamb-7B | 9 | | [⏬模型下载](#模型下载) | TiLamb-7B 下载地址| 10 | | [💯任务评估](#任务评估) | 展示了 TiLamb-7B 在部分藏文NLP下游任务上的效果| 11 | | [🔔致谢](#致谢) | 特别感谢对本项目有帮助的优秀项目| 12 | | [📳免责声明](#免责声明) | TiLamb-7B 使用免责声明| 13 | 14 | ### 模型简介 15 | 16 | **TiLamb-7B** 是藏文大语言模型的基座模型,它使用了 26.43GB 的藏文语料,基于 Meta 发布的可商用大模型 [LLaMA2-7B](https://github.com/facebookresearch/llama) 模型,通过 LoRA 方法进行了增量预训练。该模型在 LLaMA2 的基础上扩展了词表,从原有的词表大小 32,000 扩充藏文词汇至 61,221 ,并对 LLaMA2-7B 原始模型的 embedding 和 lm_head 进行了均值扩充初始化。 17 | 18 | #### 📝 构建藏文分词模型 19 | 20 | - LLaMA2使用了BPE(Byte Pair Encoding)算法,本项目也使用 SentencePiece 的 BPE 算法,用近 10GB 藏文原始语料训练得到词表大小为 32,000 、覆盖率为 99.95\% 的藏文分词模型,重要参数如表所示。 21 | 22 | | 参数 | 值 | 23 | |------|----| 24 | | 词表大小 (`vocab_size`) | 32000 | 25 | | 分词算法 (`model_type`) | BPE | 26 | | 数字拆分 (`split_digits`) | True | 27 | | 字节回退 (`byte_fallback`) | True | 28 | | 最大句子长度 (`max_sentence_length`) | 5000 | 29 | | 字符覆盖率 (`character_coverage`) | 0.9995 | 30 | 31 | #### 📖 扩充藏文词表 32 | 33 | - 在原始 LLaMA2 的词表中增加额外约 30,000 个藏文tokens,增强了藏文的编码和解码效率,并提高了 LLaMA2 对藏文的理解能力。 34 | 35 | - [TiLamb使用的藏文分词模型,未来不再使用](Tokenization%20Model/Legacy%20Tokenization%20Model/) 36 | 37 | - [最新的藏文分词模型,在未来会使用](Tokenization%20Model/Current%20Tokenization%20Model) 38 | 39 | #### ⚔️ 扩充词表前后分词对比 40 | 41 | - 对于相同一段藏文文本,使用 LLaMA2 原始分词器和 TiLamb 分词器的对比如下表所示。换用 TiLamb 分词器后token数量从 174 缩减至 19 ,减少约 8 倍,大幅提高了模型对藏文的编码效率和推理速度。在固定的上下文长度下,模型的最大输入/输出文本长度、模型可以容纳的信息量均提升至原来的 8 倍以上。 42 | 43 | | 类型 | 长度 | 内容 | 44 | |------------|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 45 | | 藏文语句 | 30 | ཅ་དངོས་བླུགས་སྒམ་ཁ་ཕྱེ་ནས་ཞིབ་བཤེར་བྱས་རྗེས། འགག་སྒོ་ལས་ཁུངས་ཀྱི་འབྲེལ་ཡོད་མི་སྣས་ཐུམ་སྒམ་ནང་ཇའི་སྒམ་ཆུང་ཞིག་རྙེད་ཅིང་།
(打开行李箱检查后,海关人员在包装箱内发现了一个茶盒。) | 46 | | LLaMA2分词器 | 174 | '▁', '<0xE0>', '<0xBD>', '<0x85>', '་', 'ད', 'ང', 'ོ', 'ས', '་', 'བ', '<0xE0>', '<0xBE>', '<0xB3>', 'ུ', 'ག', 'ས', '་', 'ས', '<0xE0>', '<0xBE>', '<0x92>', 'མ', '་', '<0xE0>', '<0xBD>', '<0x81>', '་', '<0xE0>', '<0xBD>', '<0x95>', 'ྱ', 'ེ', '་', 'ན', 'ས', '་', '<0xE0>', '<0xBD>', '<0x9E>', 'ི', 'བ', '་', 'བ', '<0xE0>', '<0xBD>', '<0xA4>', 'ེ', 'ར', '་', 'བ', 'ྱ', 'ས', '་', 'ར', '<0xE0>', '<0xBE>', '<0x97>', 'ེ', 'ས', '<0xE0>', '<0xBC>', '<0x8D>', '▁', '<0xE0>', '<0xBD>', '<0xA0>', 'ག', 'ག', '་', 'ས', '<0xE0>', '<0xBE>', '<0x92>', 'ོ', '་', 'ལ', 'ས', '་', '<0xE0>', '<0xBD>', '<0x81>', 'ུ', 'ང', 'ས', '་', '<0xE0>', '<0xBD>', '<0x80>', 'ྱ', 'ི', '་', '<0xE0>', '<0xBD>', '<0xA0>', 'བ', '<0xE0>', '<0xBE>', '<0xB2>', 'ེ', 'ལ', '་', '<0xE0>', '<0xBD>', '<0xA1>', 'ོ', 'ད', '་', 'མ', 'ི', '་', 'ས', '<0xE0>', '<0xBE>', '<0xA3>', 'ས', '་', '<0xE0>', '<0xBD>', '<0x90>', 'ུ', 'མ', '་', 'ས', '<0xE0>', '<0xBE>', '<0x92>', 'མ', '་', 'ན', 'ང', '་', '<0xE0>', '<0xBD>', '<0x87>', '<0xE0>', '<0xBD>', '<0xA0>', 'ི', '་', 'ས', '<0xE0>', '<0xBE>', '<0x92>', 'མ', '་', '<0xE0>', '<0xBD>', '<0x86>', 'ུ', 'ང', '་', '<0xE0>', '<0xBD>', '<0x9E>', 'ི', 'ག', '་', 'ར', '<0xE0>', '<0xBE>', '<0x99>', 'ེ', 'ད', '་', '<0xE0>', '<0xBD>', '<0x85>', 'ི', 'ང', '་', '<0xE0>', '<0xBC>', '<0x8D>' | 47 | | TiLamb分词器 | 19 | '▁ཅ་', 'དངོས་', 'བླུགས་', 'སྒམ་', 'ཁ་ཕྱེ་ནས་', 'ཞིབ་བཤེར་', 'བྱས་རྗེས།', '▁', 'འགག་སྒོ་', 'ལས་ཁུངས་ཀྱི་', 'འབྲེལ་ཡོད་', 'མི་སྣས་', 'ཐུམ་', 'སྒམ་', 'ནང་', 'ཇའི་', 'སྒམ་ཆུང་', 'ཞིག་རྙེད་', 'ཅིང་།' | 48 | 49 | 50 | #### 🌟 增量预训练参数选择 51 | 52 | | 参数 | 值 | 参数 | 值 | 53 | |----------------------------------------|------------------------------------|----------------------------------------|---------------------------------------| 54 | | `cutoff_len` | 1024 | `learning_rate` | 2e-4 | 55 | | `finetuning_type` | lora | `num_train_epochs` | 1.0 | 56 | | `per_device_train_batch_size` | 4 | `gradient_accumulation_steps` | 2 | 57 | | `lr_scheduler_type` | cosine | `max_grad_norm` | 1.0 | 58 | | `lora_rank` | 8 | `lora_dropout` | 0.1 | 59 | | `lora_target` | q_proj, v_proj | `warmup_steps` | 0 | 60 | | `additional_target` | embed_tokens, lm_head, norm | `fp16` | True | 61 | 62 | ### 模型下载 63 | 64 | | 属性 | 描述 | 65 | |----------------|------------------------------------------------------| 66 | | 模型名称 | TiLamb-7B | 67 | | 模型类型 | **基座模型,非chat模型** | 68 | | 参数大小 | 7B | 69 | | 训练类型 | Causal-LM (CLM) | 70 | | 训练方式 | LoRA + 全量emb/lm-head | 71 | | 基于的原始模型 | [LLaMA2-7B-base](https://github.com/facebookresearch/llama) | 72 | | 训练语料 | 无标注 26.43 GB藏文通用语料 | 73 | | 词表大小 | 61,221 | 74 | | 语言支持 | 藏文 | 75 | | 文件大小 | 13.0 GB | 76 | | **🤖ModelScope 下载链接** | [https://modelscope.cn/models/YoLo2000/TiLamb-7B/summary](https://modelscope.cn/models/YoLo2000/TiLamb-7B/summary)| 77 | | **🤗Hugging Face下载链接** | [https://huggingface.co/YoLo2000/TiLamb-7B](https://huggingface.co/YoLo2000/TiLamb-7B) | 78 | 79 | > [!NOTE] 80 | > [1] *TiLamb-7B 用于扩充LLaMA2原始词表的藏文分词模型并非上文提到的[藏文分词模型](#模型简介)。当时使用早期训练的词表大小30,000、覆盖率99.95\%的藏文分词模型,而[上文提到的分词模型](#模型简介)为近期训练所得,也是我们目前认为最符合需求的藏文分词模型,并将在我们下一代藏文大语言模型中使用。*
81 | > [2] *TiLamb-7B 是一个未经监督微调的基座模型,**不具备对话能力**。*
82 | > [3] *要进行藏文对话和[藏文 NLP 下游任务](#任务评估)的适配,建议使用 [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/tree/main) 框架进行微调。*
83 | 84 | ### 任务评估 85 | 86 | - 根据多种藏文下游任务分别制作数千到几万条不等的微调数据集,微调后的 TiLamb 在藏文新闻分类、藏文实体关系分类、藏文机器阅读理解、藏文分词、藏文摘要、藏文问题回答、藏文问题生成共七个下游任务中进行验证,多项指标结果相较传统方法和其他藏文预训练模型有大幅提升。 87 | 88 | - 本文采用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/tree/main)框架中的微调模板,一条用于藏文分词下游任务的微调数据示例格式如下,其中 instruction 为用于描述任务的指令, input 可选,为任务指令的补充输入, output 为期望模型产生的回复: 89 | 90 | ```json 91 | { 92 | "instruction": "མཐུན་སྦྱོར་གྱི་འདུ་ཤེས་ཐོག་ནས་དཔལ་འབྱོར་དང་སྤྱི་ཚོགས་ཁྱོན་ཡོངས་ཡར་ཐོན་ལ་སྐུལ་མ་གཏོང་དགོས།", 93 | "input": "", 94 | "output": "མཐུན་སྦྱོར་/གྱི་/འདུ་ཤེས་/ཐོག་/ནས་/དཔལ་འབྱོར་/དང་/སྤྱི་ཚོགས་/ཁྱོན་ཡོངས་/ཡར་ཐོན་/ལ་/སྐུལ་མ་/གཏོང་/དགོས།/" 95 | } 96 | ``` 97 | 98 | - 藏文NLP下游任务微调均基于以下参数选择: 99 | 100 | | 参数 | 值 | 参数 | 值 | 101 | |----------------------------------------|--------------------|----------------------------------------|-------------| 102 | | `cutoff_len` | 2048 | `learning_rate` | 2e-4 | 103 | | `finetuning_type` | lora | `num_train_epochs` | 3.0 | 104 | | `per_device_train_batch_size` | 4 | `gradient_accumulation_steps` | 2 | 105 | | `lr_scheduler_type` | cosine | `max_grad_norm` | 1.0 | 106 | | `lora_rank` | 8 | `lora_dropout` | 0.05 | 107 | | `lora_target` | q_proj, v_proj | `fp16` | True | 108 | 109 | 110 | #### 藏文新闻分类 111 | 112 | 该任务选用由复旦大学自然语言处理实验室发布的藏语新闻数据集Tibetan News Classification Corpus。数据集包含 9,204 条样本,涉及政治、经济、教育、旅游、环境、艺术、文学、宗教等 12 个类别。按 9:1 的比例将其划分为训练集和测试集,训练集的数据用于制作微调数据集。评价指标为Accuracy、Macro-Precision、Macro-Recall、Macro-F1,实验结果如下所示。 113 | 114 | | 模型 | Accuracy(%) | Macro-F1(%) | 115 | |---------------------|-------------|-------------| 116 | | Transformer | 28.63 | 28.79 | 117 | | CNN(syllable) | 61.51 | 57.34 | 118 | | TextCNN | 61.71 | 61.53 | 119 | | DPCNN | 62.91 | 61.17 | 120 | | TextRCNN | 63.67 | 62.81 | 121 | | Bert-base-Tibetan | - | 51.00 | 122 | | TiBERT | 71.04 | 70.94 | 123 | | CINO-base | 73.10 | 70.00 | 124 | | TiKEM | 74.46 | 72.61 | 125 | | **TiLamb+LoRA** | **78.85** | **77.45** | 126 | 127 | 128 | #### 藏文实体关系分类 129 | 130 | 为了验证 TiLamb 模型对知识的记忆及融合运用能力,本文使用 6,433 条三元组-文本对齐数据集,三元组中共有 11 种关系。该任务要求在给定两个实体和包含该实体的对应文本后,给出两个实体之间的关系类别。按 9:1 的比例将其划分为训练集和测试集,训练集的数据用于制作该任务的微调数据集。评价指标为Accuracy(\%)、Macro-P(\%)、Macro-R(\%)和Macro-F1(\%),实验结果如下所示。 131 | 132 | | 模型 | Accuracy(%) | Macro-P(%) | Macro-R(%) | Macro-F1(%) | 133 | |---------------|-------------|------------|------------|-------------| 134 | | FastText | 55.80 | 34.05 | 32.98 | 31.61 | 135 | | DPCNN | 70.94 | 54.21 | 49.23 | 48.65 | 136 | | TextCNN | 72.38 | 71.03 | 59.11 | 56.76 | 137 | | TiBERT | 84.70 | 76.66 | 68.82 | 67.94 | 138 | | CINO-base | 85.31 | 75.48 | 69.12 | 66.73 | 139 | | MiLMO | 85.76 | 77.13 | 68.97 | 68.57 | 140 | | TiKEM | 90.12 | 91.73 | 75.61 | 76.34 | 141 | | **TiLamb+LoRA** | **95.98** | **97.14** | **88.98** | **91.60** | 142 | 143 | 144 | #### 藏文机器阅读理解 145 | 146 | 机器阅读理解任务是给定一段文本和一个问题,让模型回答对应问题。这需要模型理解问题和上下文语义,然后进行推理、判断等,给出具体答案。使用藏文机器阅读理解数据集 TibetanQA 对模型的阅读理解能力进行评估,该数据集包含了 1,513 篇文章和 20,000 个问答对。为了评估模型性能,本文使用 EM值(精确匹配) 和 F1 值作为评价指标。以 8:2 的比例将数据划分为训练集和测试集,训练集用于制作该任务的微调数据集,实验结果如下表所示。 147 | 148 | | 模型 | EM(%) | F1(%) | 149 | |--------------|-------|-------| 150 | | R-Net | 55.8 | 63.4 | 151 | | BiDAF | 58.6 | 67.8 | 152 | | QANet | 57.1 | 66.9 | 153 | | TiBERT | 53.2 | 73.4 | 154 | | TiLamb+LoRA | 46.6 | 77.4 | 155 | | Ti-Reader | 67.9 | 77.4 | 156 | | **TiKEM** | **69.4** | **80.1** | 157 | 158 | 159 | #### 藏文分词 160 | 161 | 对于藏文这种结构复杂、资源相对较少的语言而言,正确的分词对于进一步的语言处理任务,如语义分析、机器翻译和信息检索等,都具有至关重要的作用。该任务使用中文信息学会举办的第一届藏文分词评测所用的数据集,数据集中藏语短句去重后有 21,427 条,本文保留了 1,000 条作为测试集,剩余的 20,427 条用于制作微调数据集,测试结果与第一届藏文分词评测第一名 TIP-LAS 对比如下所示。 162 | 163 | | 模型 | Precision(%) | Recall(%) | F1(%) | 164 | |-----------------|--------------|-----------|-------| 165 | | TIP-LAS | 93.14 | 92.17 | 92.66 | 166 | | **TiLamb+LoRA** | **93.58** | **93.71** | **93.64** | 167 | 168 | 169 | #### 藏文摘要 170 | 171 | 文本摘要作为自然语言处理中的一个重要分支,可以有效的减少冗余信息,从而提高浏览文本速度。使用 47,088 条新闻与对应摘要做微调,测试集使用 5,232 条新闻与对应摘要,用于实验的数据集均以新闻标题简要作为摘要,实验结果如下所示。 172 | 173 | | 模型 | ROUGE-1(%) | ROUGE-2(%) | ROUGE-L(%) | 174 | |-----------------------|------------|------------|------------| 175 | | 统一模型 | 19.81 | 13.27 | 16.90 | 176 | | CMPT模型(Ti-SUM) | 39.53 | 26.42 | 38.02 | 177 | | CMPT(50000条) | 49.16 | 33.43 | 48.66 | 178 | | **TiLamb+LoRA** | **53.99** | **37.22** | **52.89** | 179 | 180 | 181 | #### 藏文问题回答 182 | 183 | 该任务使用本组人工制作的TiconvQA藏文多轮对话数据集,其中包含 2,120 条藏文文章段落及 20,420 对多轮问答对。为了进行实验评估,按照 8:2 的比例将数据集划分为训练集和测试集。在评估实验中,使用 EM值 和 F1值 作为评估指标,实验结果如下所示。 184 | 185 | | 模型 | EM(%) | F1(%) | 186 | |---------------|----------|----------| 187 | | DrQA | 41.49 | 61.51 | 188 | | TiBERT | 45.12 | 65.71 | 189 | | **TiLamb+LoRA** | **45.28** | **72.84** | 190 | 191 | 192 | #### 藏文问题生成 193 | 194 | 问题生成是自然语言生成的一项任务,它以文本和目标答案为输入,自动从答案中生成问题。使用用于机器阅读理解的藏文问答数据集 TibetanQA ,按照约 9:1 的比例划分训练集和测试集,其中用于微调的数据共 17,762 条,用于测试的数据为 1,976 条,实验指标数值均为百分比,实验结果如下所示。 195 | 196 | | 模型 | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L | 197 | |-----------------|--------|--------|--------|--------|---------| 198 | | S2S+ATT+CP | 29.99 | 20.14 | 13.90 | 9.59 | 31.45 | 199 | | TiBERT | 35.48 | 28.60 | 24.51 | 21.30 | 40.04 | 200 | | TiBERT+wh | **47.28** | 31.35 | 23.02 | 17.52 | 46.78 | 201 | | **TiLamb+LoRA** | 44.60 | **35.24** | **28.88** | **24.47** | **50.42** | 202 | 203 | 204 | ### 致谢 205 | 206 | 本项目主要参考并受益于[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/tree/main)、[Chinese-LLaMA-Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/tree/main),真诚感谢项目作者的付出。 207 | 208 | 209 | ### 免责声明 210 | 211 | > [!IMPORTANT] 212 | > - 本项目基于 Meta 发布的 LLaMA2-7B 模型开发,使用时请严格遵守 LLaMA2-7B 的开源许可协议。 213 | > - 如果涉及使用第三方代码,请务必遵从相关的开源许可协议。 214 | > - 模型生成的内容准确性可能受到计算方法、随机因素等的影响,因此,我们不对模型输出的准确性提供任何保证,也不会对使用相关资源和输出结果产生的任何损失承担责任。 215 | > - 如果将相关模型用于商业用途,开发者应遵守当地法律法规,确保模型输出内容的合规性。本项目不对任何由此衍生的产品或服务承担责任。 216 | -------------------------------------------------------------------------------- /Tokenization Model/Current Tokenization Model/README.md: -------------------------------------------------------------------------------- 1 | #### 说明 2 | 3 | 使用近 10GB 藏文原始语料训练得到词表大小为 32,000 、覆盖率为 99.95\% 的藏文分词模型,该分词模型在TiLamb中**未使用**,将用于之后更新的藏文大语言模型的词表扩充。 4 | 5 | 以下是项目中各个重要文件的简要说明: 6 | 7 | - `Tibetan_bpe.model`:这是藏语字节对编码(Byte Pair Encoding, BPE)分词模型的模型文件,用于加载分词模型。 8 | 9 | - `Tibetan_bpe.vocab`:与BPE模型相配套的词汇文件,包含模型使用的词汇和对应的编码。 10 | 11 | - `training_log2.txt`:这是模型训练的日志文件,记录了训练过程中的详细信息。 12 | 13 | - `训练命令.txt`:这个文件包含了用于训练分词模型的命令和参数。 14 | -------------------------------------------------------------------------------- /Tokenization Model/Current Tokenization Model/Tibetan_bpe.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NLP-Learning/TiLamb/48bfae3c9a7920eba4215cb1904b34e01dd3c018/Tokenization Model/Current Tokenization Model/Tibetan_bpe.model -------------------------------------------------------------------------------- /Tokenization Model/Current Tokenization Model/训练命令.txt: -------------------------------------------------------------------------------- 1 | spm_train --input=merged_training_data.txt --model_prefix=Tibetan_bpe --vocab_size=32000 --model_type=bpe --split_digits=True --byte_fallback=True --max_sentence_length=5000 2>&1 | tee training_log2.txt -------------------------------------------------------------------------------- /Tokenization Model/Legacy Tokenization Model/README.md: -------------------------------------------------------------------------------- 1 | #### 说明 2 | 3 | 扩充词表步骤中, TiLamb 所使用的藏文分词模型,词表大小 30,000 ,字符覆盖率 99.95 \%。在之后更新的藏文大语言模型中将**不再使用**。 4 | -------------------------------------------------------------------------------- /Tokenization Model/Legacy Tokenization Model/tokenizer30000.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NLP-Learning/TiLamb/48bfae3c9a7920eba4215cb1904b34e01dd3c018/Tokenization Model/Legacy Tokenization Model/tokenizer30000.model --------------------------------------------------------------------------------