├── LICENSE.txt
├── README.md
├── README_EN.md
├── assets
├── 0shot.png
├── YAYI-UIE.png
├── data-dist.png
├── test
├── yayi_dark_small.png
└── zh-0shot.png
├── requirements.txt
└── test
└── test_sample_100.jsonl
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright Beijing Wenge Technology Co.,Ltd.
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |

3 |
4 |
5 | [](./LICENSE)
6 | [](./LICENSE_DATA)
7 | [](./LICENSE_MODEL)
8 |
9 | [[📖README](./README.md)]
10 | [[🤗HF Repo](https://huggingface.co/wenge-research)]
11 | [[🔗网页端](https://yayi.wenge.com)]
12 |
13 | 中文 | [English](./README_EN.md)
14 |
15 |
16 |
17 | ## 更新
18 | [2024.03.28] 所有模型和数据上传魔搭社区。
19 |
20 | ## 介绍
21 | 雅意信息抽取统一大模型 (YAYI-UIE)在百万级人工构造的高质量信息抽取数据上进行指令微调,统一训练信息抽取任务包括命名实体识别(NER),关系抽取(RE)和事件抽取(EE),实现通用、安全、金融、生物、医疗、商业、个人、车辆、电影、工业、餐厅、科学等场景下结构化抽取。
22 |
23 | 通过雅意UIE大模型的开源为促进中文预训练大模型开源社区的发展,贡献自己的一份力量,通过开源,与每一位合作伙伴共建雅意大模型生态。更多技术细节,欢迎阅读我们的技术报告🔥[YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal Information Extraction](https://arxiv.org/abs/2312.15548)。
24 |
25 | 
26 |
27 | ## 下载地址
28 | | 名称 | 🤗 HF模型标识 | 下载地址 | 魔搭模型标识 | 下载地址 |
29 | |:----------|:----------:|:----------:|:----------:|:----------:|
30 | | YAYI-UIE | wenge-research/yayi-uie | [模型下载](https://huggingface.co/wenge-research/yayi-uie) |wenge-research/yayi-uie | [模型下载](https://modelscope.cn/models/wenge-research/yayi-uie) |
31 | | YAYI-UIE Data | wenge-research/yayi_uie_sft_data| [数据集下载](https://huggingface.co/datasets/wenge-research/yayi_uie_sft_data)|wenge-research/yayi_uie_sft_data| [数据集下载](https://modelscope.cn/datasets/wenge-research/yayi_uie_sft_data)|
32 |
33 |
34 | ## 训练数据
35 | 百万级语料中文54%,英文46%;其中数据集包括12个领域包括金融,社会,生物,商业,工业制造,化学,车辆,科学,疾病医疗,个人生活,安全和通用。覆盖数百个场景
36 | - NER:中文覆盖**28**个实体类型包括人物,地缘政治,组织,身体部位,药物等,英文覆盖**130**个实体类型包括Animal, Weapon, Conference, Book等。
37 | - RE:中文覆盖**232**种关系包括买资,增持,重组,国籍,别名,亲属,入股,转让,导致,发生地点,制造商等,英文覆盖**236**种关系包括founded by,state or province of headquarters,employee of,occupation,creator等。
38 | - EE:中文覆盖**84**种事件类型,包括中标,高管变动,产品行为-发布,公司上市等,和**203**种论元,英文覆盖**45**种事件类型,包括Born, Demonstrate, Meet, End Organization, Divorce等,和**62**种论元。
39 |
40 | 
41 |
42 | ## 运行方式
43 | #### 安装环境
44 | 1. 下载本仓库内容至本地/远程服务器
45 |
46 | ```bash
47 | git clone https://github.com/wenge-research/yayi-uie.git
48 | cd yayi-uie
49 | ```
50 |
51 | 2. 创建conda环境
52 |
53 | ```bash
54 | conda create --name uie python=3.8
55 | conda activate uie
56 | ```
57 |
58 | 3. 安装环境
59 |
60 | ```bash
61 | pip install -r requirements.txt
62 | ```
63 | 其中 `torch` 和 `transformers` 版本不建议低于推荐版本。
64 |
65 | #### 模型推理
66 | 模型已在我们的 [Huggingface 模型仓库](https://huggingface.co/wenge-research) 开源,欢迎下载使用。以下是一个简单调用 `YAYI-UIE` 进行下游任务推理的示例代码,可在单张 A100/A800 等GPU运行,使用bf16精度推理时约占用 33GB 显存:
67 |
68 | ```python
69 | >>> import torch
70 | >>> from transformers import AutoModelForCausalLM, AutoTokenizer
71 | >>> from transformers.generation.utils import GenerationConfig
72 | >>> tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi-uie", use_fast=False, trust_remote_code=True)
73 | >>> model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi-uie", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
74 | >>> generation_config = GenerationConfig.from_pretrained("wenge-research/yayi-uie")
75 | >>> prompt = "文本:氧化锆陶瓷以其卓越的物理和化学特性在多个行业中发挥着关键作用。这种材料因其高强度、高硬度和优异的耐磨性,广泛应用于医疗器械、切削工具、磨具以及高端珠宝制品。在制造这种高性能陶瓷时,必须遵循严格的制造标准,以确保其最终性能。这些标准涵盖了从原材料选择到成品加工的全过程,保障产品的一致性和可靠性。氧化锆的制造过程通常包括粉末合成、成型、烧结和后处理等步骤。原材料通常是高纯度的氧化锆粉末,通过精确控制的烧结工艺,这些粉末被转化成具有特定微观结构的坚硬陶瓷。这种独特的微观结构赋予氧化锆陶瓷其显著的抗断裂韧性和耐腐蚀性。此外,氧化锆陶瓷的热膨胀系数与铁类似,使其在高温应用中展现出良好的热稳定性。因此,氧化锆陶瓷不仅在工业领域,也在日常生活中的应用日益增多,成为现代材料科学中的一个重要分支。\n抽取文本中可能存在的实体,并以json{制造品名称/制造过程/制造材料/工艺参数/应用/生物医学/工程特性:[实体]}格式输出。"
76 | >>> # "" is a reserved token for human, "" is a reserved token for assistant
77 | >>> prompt = "" + prompt + ""
78 | >>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
79 | >>> response = model.generate(**inputs, max_new_tokens=512, temperature=0)
80 | >>> print(tokenizer.decode(response[0],skip_special_tokens=True))
81 | ```
82 |
83 | #### 指令样例
84 | 注:
85 | - 指令前加入具体任务类型用中括号表示【】(可加可不加)
86 | - 为了让模型能抽取更全的信息,尽量在指令中加入细粒度的提示,比如“会见地点”,“会议地点”等,而不是统一为“地点”。
87 | - 尽量输入文本放置在前,指令在后。
88 |
89 |
90 | 1. 实体抽取任务
91 | ```
92 | 文本:xx
93 | 【实体抽取】抽取文本中可能存在的实体,并以json{人物/机构/地点:[实体]}格式输出。
94 | ```
95 | 2. 关系抽取任务
96 | ```
97 | 文本:xx
98 | 【关系抽取】已知关系列表是[注资,拥有,纠纷,自己,增持,重组,买资,签约,持股,交易]。根据关系列表抽取关系三元组,按照json[{'relation':'', 'head':'', 'tail':''}, ]的格式输出。
99 | ```
100 | ```
101 | 文本:xx
102 | 抽取文本中可能存在的关系,并以json[{'关系':'会见/出席', '头实体':'', '尾实体':''}, ]格式输出。
103 | ```
104 | 3. 事件抽取任务
105 | ```
106 | 文本:xx
107 | 已知论元角色列表是[时间,地点,会见主体,会见对象],请根据论元角色列表从给定的输入中抽取可能的论元,以json{角色:论元}格式输出。
108 | ```
109 |
110 | ## 模型zero-shot评测结果
111 | 1. NER任务
112 |
113 | AI,Literature,Music,Politics,Science为英文数据集,boson,clue,weibo为中文数据集
114 |
115 | | 模型 | AI | Literature | Music | Politics | Science | 英文平均 | boson | clue | weibo | 中文平均 |
116 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
117 | | davinci | 2.97 | 9.87 | 13.83 | 18.42 | 10.04 | 11.03 | - | - | - | 31.09 |
118 | | ChatGPT 3.5 | **54.4** | **54.07** | **61.24** | **59.12** | **63** | **58.37** | 38.53 | 25.44 | 29.3 |
119 | | UIE | 31.14 | 38.97 | 33.91 | 46.28 | 41.56 | 38.37 | 40.64 | 34.91 | 40.79 | 38.78 |
120 | | USM | 28.18 | 56 | 44.93| 36.1 | 44.09 | 41.86 | - | - | - | - |
121 | | InstructUIE | 49 | 47.21 | 53.16 | 48.15 | 49.3 | 49.36 | - | - | - | - |
122 | | KnowLM | 13.76 | 20.18 | 14.78 | 33.86 | 9.19 | 18.35 | 25.96 | 4.44 | 25.2 | 18.53 |
123 | | YAYI-UIE | 52.4 | 45.99 | 51.2 | 51.82 | 50.53 | 50.39 | **49.25** | **36.46** | 36.78 | **40.83** |
124 |
125 | 2. RE任务
126 |
127 | FewRe,Wiki-ZSL为英文数据集, SKE 2020,COAE2016,IPRE为中文数据集
128 |
129 | | 模型 | FewRel | Wiki-ZSL | 英文平均 | SKE 2020 | COAE2016 | IPRE | 中文平均 |
130 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
131 | | ChatGPT 3.5 | 9.96 | 13.14 | 11.55 24.47 | 19.31 | 6.73 | 16.84 |
132 | | ZETT(T5-small) | 30.53 | 31.74 | 31.14 | - | - | - | - |
133 | | ZETT(T5-base) | 33.71 | 31.17 | 32.44 | - | - | - | - |
134 | | InstructUIE |**39.55** | 35.2 | 37.38 | - | - | - | - |
135 | | KnowLM | 17.46 | 15.33 | 16.40 | 0.4 | 6.56 | 9.75 |5.57|
136 | | YAYI-UIE | 36.09 | **41.07** | **38.58** | **70.8** | **19.97** | **22.97**| **37.91**|
137 |
138 | 3. EE任务
139 |
140 | commodity news为英文数据集,FewFC,ccf_law为中文数据集
141 |
142 | EET(事件类型判别)
143 |
144 | | 模型 | commodity news | FewFC | ccf_law | 中文平均 |
145 | | ------ | ------ | ------ | ------ | ------ |
146 | | ChatGPT 3.5 | 1.41 | 16.15 | 0 | 8.08 |
147 | | UIE | - | 50.23 | 2.16 | 26.20 |
148 | |InstructUIE| **23.26** | - | - | - |
149 | | YAYI-UIE | 12.45 | **81.28** | **12.87** | **47.08**|
150 |
151 | EEA(事件论元抽取)
152 |
153 | | 模型 | commodity news | FewFC | ccf_law | 中文平均 |
154 | | ------ | ------ | ------ | ------ | ------ |
155 | | ChatGPT 3.5 | 8.6 | 44.4 | 44.57 | 44.49 |
156 | | UIE | - | 43.02 | **60.85** | 51.94 |
157 | |InstructUIE| **21.78** | - | - | - |
158 | | YAYI-UIE | 19.74 | **63.06** | 59.42 | **61.24** |
159 |
160 | 
161 |
162 | ## 相关协议
163 | #### 局限性
164 | 基于当前数据和基础模型训练得到的SFT模型,在效果上仍存在以下问题:
165 |
166 | 1. 抽取的信息可能会产生违背事实的错误回答。
167 | 2. 对于具备危害性的指令无法很好的鉴别,可能会产生危害性言论。
168 | 3. 在一些涉及段落级长文本的场景下模型的抽取能力仍有待提高。
169 |
170 |
171 | #### 免责声明
172 | 基于以上模型局限性,我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的,不得用于商业用途,以及其他会对社会带来危害的用途。请谨慎鉴别和使用雅意大模型生成的内容,请勿将生成的有害内容传播至互联网。若产生不良后果,由传播者自负。
173 | 本项目仅可应用于研究目的,项目开发者不承担任何因使用本项目(包含但不限于数据、模型、代码等)导致的危害或损失。详细请参考免责声明。
174 |
175 | #### 开源协议
176 | 本项目中的代码和数据依照 [Apache-2.0](./LICENSE.txt) 协议开源,社区使用YAYI UIE模型或其衍生品请遵循[Baichuan2](https://github.com/baichuan-inc/Baichuan2)的社区协议和商用协议。
177 |
178 | ## 更新日志
179 | - [2023/12/15] YAYI-UIE大模型正式对外发布并开源。
180 |
181 | ## 致谢
182 | - 本项目训练代码参考了[YAYI](https://github.com/wenge-research/YAYI/blob/main/training/trainer.py) 项目及 Huggingface [transformers](https://github.com/huggingface/transformers) 库;
183 | - 本项目开源版本基于[Baichuan2-13B](https://github.com/baichuan-inc/Baichuan2)指令微调得到;
184 | - 本项目分布式训练使用了 Microsoft 的 [DeepSpeed](https://github.com/microsoft/deepspeed) 分布式训练工具及 Huggingface transformers 文档中的 [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config) 配置文件;
185 | - 我们非常感谢以下开源项目对我们的帮助:[InstructUIE](https://github.com/BeyonderXX/InstructUIE/tree/master); [Baichuan2](https://github.com/baichuan-inc/Baichuan2); [InstructIE](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC); [KnowLM](https://github.com/zjunlp/KnowLM/tree/main)
186 |
187 | ## 引用
188 | 如果您在您的工作中使用了我们的模型,可以引用我们的论文:
189 |
190 | ```
191 | @article{YAYI-UIE,
192 | author = {Xinglin Xiao, Yijie Wang, Nan Xu, Yuqi Wang, Hanxuan Yang, Minzheng Wang, Yin Luo, Lei Wang, Wenji Mao, Dajun Zeng}},
193 | title = {YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal Information Extraction},
194 | journal = {arXiv preprint arXiv:2312.15548},
195 | url = {https://arxiv.org/abs/2312.15548},
196 | year = {2023}
197 | }
198 | ```
199 |
200 | ## Star History
201 | [](https://star-history.com/#wenge-research/YAYI-UIE&Date)
202 |
--------------------------------------------------------------------------------
/README_EN.md:
--------------------------------------------------------------------------------
1 | # YAYI UIE
2 |
3 |
4 |

5 |
6 |
7 | [](./LICENSE)
8 | [](./LICENSE_DATA)
9 | [](./LICENSE_MODEL)
10 |
11 | [[📖README](./README_EN.md)]
12 | [[🤗HF Repo](https://huggingface.co/wenge-research)]
13 | [[🔗URL](https://yayi.wenge.com)]
14 |
15 | English | [中文](./README.md)
16 |
17 |
18 |
19 |
20 | ## Introduction
21 | The YAYI Unified Information Extraction Large Language Model (YAYI-UIE), fine-tuned on millions of high-quality data, integrates training across tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). The model is able to extract structured outputs across diverse fields including general, security, finance, biology, medicine, business, personal, automotive, film, industry, restaurant, and science.
22 |
23 | The open-source of YAYI-UIE aims to foster the growth of the Chinese PLM open-source community. We can't wait to collaborate with our partners to develop the YAYI Large Models ecosystem!
24 |
25 | 
26 |
27 | ## Download links
28 | | Name | 🤗 HF Model Name | Download Links |
29 | | --------- | --------- | --------- |
30 | | YAYI-UIE | wenge-research/yayi-uie | [Model Download](https://huggingface.co/wenge-research/yayi-uie) |
31 | | YAYI-UIE Data | wenge-research/yayi_uie_sft_data| [Data Download](https://huggingface.co/datasets/wenge-research/yayi_uie_sft_data)|
32 |
33 |
34 | ## Training Datasets
35 | In the corpus of over a million entries, 54% are in Chinese and 46% in English. The dataset encompasses 12 fields including finance, society, biology, business, industrial manufacturing, chemistry, vehicles, science, disease and medicine, personal life, security, and general topics, covering hundreds of scenarios:
36 |
37 | - NER: In Chinese, it covers **28** types of entities including individuals, geopolitics, organizations, body parts, drugs, etc., while in English, it covers 130 types of entities such as Animals, Weapons, Conferences, Books, etc.
38 | - RE: In Chinese, it includes **232** types of relations like acquisitions, stake increases, restructurings, nationality, aliases, relatives, buying shares, transfers, causes, locations of occurrence, manufacturers, etc., and in English, 236 types of relations such as founded by, state or province of headquarters, employee of, occupation, creator, etc.
39 | - EE: Chinese covers **84** types of events including winning a bid, executive changes, product actions - launches, company listings, etc., and **203** types of arguments, whereas English covers **45** types of events such as Birth, Demonstration, Meeting, End of Organization, Divorce, etc., and **62** types of arguments.
40 |
41 | 
42 |
43 | ## Quick Start
44 | #### Set up conda envs
45 | 1. clone the repos
46 |
47 | ```bash
48 | git clone https://github.com/wenge-research/yayi-uie.git
49 | cd yayi-uie
50 | ```
51 |
52 | 2. create conda envs
53 |
54 | ```bash
55 | conda create --name uie python=3.8
56 | conda activate uie
57 | ```
58 |
59 | 3. set up envs
60 |
61 | ```bash
62 | pip install -r requirements.txt
63 | ```
64 |
65 | #### Inference
66 | We've alredy open-sourced our model weights on [Huggingface](https://huggingface.co/wenge-research).
67 | The following is a code snippet for using YAYI-UIE for downstream task inference. It can run on a single A100/A800 GPU, and it occupies approximately 33GB of GPU memory when using bf16 precision for inference.
68 |
69 | ```python
70 | >>> import torch
71 | >>> from transformers import AutoModelForCausalLM, AutoTokenizer
72 | >>> from transformers.generation.utils import GenerationConfig
73 | >>> tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi-uie", use_fast=False, trust_remote_code=True)
74 | >>> model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi-uie", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
75 | >>> generation_config = GenerationConfig.from_pretrained("wenge-research/yayi-uie")
76 | >>> prompt = "Text: Alberto Mancini won in the final 7–5 , 2–6 , 7–6 , 7–5 against Boris Becker . \nFrom the given text, extract all the entities and types. Please format the answer in json {location/person/organization:[entities]}."
77 | >>> # "" is a reserved token for human, "" is a reserved token for assistant
78 | >>> prompt = "" + prompt + ""
79 | >>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
80 | >>> response = model.generate(**inputs, max_new_tokens=512, temperature=0)
81 | >>> print(tokenizer.decode(response[0],skip_special_tokens=True))
82 | ```
83 |
84 | The first-time downloading and loading the model could take some time.
85 |
86 | #### Sample Prompts
87 | Note:
88 | - 【】is optional in prompts that indicates which task you want the model to perform
89 | - Add specific labels in prompts to help the model generate more comprehensive information. For example, use "meeting location" instead of "location"
90 | - Text first and then instruction yields better results.
91 | 1. NER
92 | ```
93 | Text:
94 | From the given text, extract all the entities and types. Please format the answer in json {person/organization/location:[entities]}.
95 | ```
96 | 2. RE
97 | ```
98 | Text:
99 | From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples.The relations are [country of administrative divisions,place of birth,location contains]. Output the result in json[{'relation':'', 'head':'', 'tail':''}, ].
100 | ```
101 | 3. EE
102 | ```
103 | Text:
104 | Given the text and the role list [seller, place, beneficiary, buyer], identify event arguments and roles, provide your answer in the format of json{role:name}.
105 | ```
106 |
107 | ## Zero-shot Evaluation
108 | 1. NER tasks
109 |
110 | AI,Literature,Music,Politics and Science are English datasets; boson,clue and weibo are Chinese datasets
111 |
112 | | Model | AI | Literature | Music | Politics | Science | EN Average | boson | clue | weibo | ZH Average |
113 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
114 | | davinci | 2.97 | 9.87 | 13.83 | 18.42 | 10.04 | 11.03 | - | - | - | 31.09 |
115 | | ChatGPT 3.5 | **54.4** | **54.07** | **61.24** | **59.12** | **63** | **58.37** | 38.53 | 25.44 | 29.3 |
116 | | UIE | 31.14 | 38.97 | 33.91 | 46.28 | 41.56 | 38.37 | 40.64 | 34.91 | 40.79 | 38.78 |
117 | | USM | 28.18 | 56 | 44.93| 36.1 | 44.09 | 41.86 | - | - | - | - |
118 | | InstructUIE | 49 | 47.21 | 53.16 | 48.15 | 49.3 | 49.36 | - | - | - | - |
119 | | DeepKE-LLM | 13.76 | 20.18 | 14.78 | 33.86 | 9.19 | 18.35 | 25.96 | 4.44 | 25.2 | 18.53 |
120 | | YAYI-UIE | 52.4 | 45.99 | 51.2 | 51.82 | 50.53 | 50.39 | **49.25** | **36.46** | 36.78 | **40.83** |
121 |
122 | 2. RE Tasks
123 |
124 | FewRe and Wiki-ZSL are English datasets; SKE 2020, COAE2016 and IPRE are Chinese datasets
125 |
126 | | Model | FewRe | Wiki-ZSL | EN Average | SKE 2020 | COAE2016 | IPRE | ZH Average |
127 | | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
128 | | ChatGPT 3.5 | 9.96 | 13.14 | 11.55 24.47 | 19.31 | 6.73 | 16.84 |
129 | | ZETT(T5-small) | 30.53 | 31.74 | 31.14 | - | - | - | - |
130 | | ZETT(T5-base) | 33.71 | 31.17 | 32.44 | - | - | - | - |
131 | | InstructUIE |**39.55** | 35.2 | 37.38 | - | - | - | - |
132 | | DeepKE-LLM | 17.46 | 15.33 | 16.40 | 0.4 | 6.56 | 9.75 |5.57|
133 | | YAYI-UIE | 36.09 | **41.07** | **38.58** | **70.8** | **19.97** | **22.97**| **37.91**|
134 |
135 | 3. EE Tasks
136 |
137 | commodity news is a English dataset, FewFC and ccf_law are Chinese datasets
138 |
139 | EET(Event Type Extraction)
140 |
141 | | Model | commodity news | FewFC | ccf_law | ZH Average |
142 | | ------ | ------ | ------ | ------ | ------ |
143 | | ChatGPT 3.5 | 1.41 | 16.15 | 0 | 8.08 |
144 | | UIE | - | 50.23 | 2.16 | 26.20 |
145 | |InstructUIE| **23.26** | - | - | - |
146 | | YAYI-UIE | 12.45 | **81.28** | **12.87** | **47.08**|
147 |
148 | EEA(Event Arguments Extraction)
149 |
150 | | Model | commodity news | FewFC | ccf_law | ZH Average |
151 | | ------ | ------ | ------ | ------ | ------ |
152 | | ChatGPT 3.5 | 8.6 | 44.4 | 44.57 | 44.49 |
153 | | UIE | - | 43.02 | **60.85** | 51.94 |
154 | |InstructUIE| **21.78** | - | - | - |
155 | | YAYI-UIE | 19.74 | **63.06** | 59.42 | **61.24** |
156 |
157 |
158 | 
159 |
160 | ## Terms and Conditions
161 | #### Limitations
162 |
163 | The SFT model, trained using the data and the base model, still faces the following issues:
164 |
165 | 1. The information extracted may lead to factually incorrect answers.
166 | 2. It struggles to effectively discern harmful instructions, potentially resulting in hazardous statements.
167 | 3. The model's extraction capability needs improvement in scenarios involving paragraph-level texts.
168 |
169 | #### Disclaimer
170 | Given the limitations of the model outlined above,we require developers to use the code, data, models, and any derivatives generated from this project solely for research
171 | purposes. They must not be used for commercial purposes or other applications that could harm society. Users should be careful in discerning and utilizing content generated
172 | by YAYI UIE, and avoid distributing harmful content on the internet. The spreader bears sole responsibility for any adverse consequences.
173 |
174 | This project is intended only for research purposes. The project developers are not liable for any harm or loss resulting from the use of this project, including but not
175 | limited to data, models, and code. For more details, please refer to the disclaimer.
176 |
177 | #### Open Source License
178 | The code and data in this project is open-sourced under the [Apache-2.0](./LICENSE.txt) license. The use of YAYI-UIE model or its derivatives must adhere to [Baichuan2](https://github.com/baichuan-inc/Baichuan2)'s community and commercial Model License.
179 |
180 | ## Updates
181 | - [2023/12/15] YAYI-UIE is released and open-sourced.
182 |
183 | ## Reference
184 | - [YAYI](https://github.com/wenge-research/YAYI/blob/main/training/trainer.py) and Huggingface [transformers](https://github.com/huggingface/transformers) ;
185 | - The open source version is fine-tuned based on [Baichuan2-13B](https://github.com/baichuan-inc/Baichuan2);
186 | - Distributed training use Microsoft [DeepSpeed](https://github.com/microsoft/deepspeed) and configs from [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config) ;
187 | - We sincerely appreciate the support provided by the following open-source projects.:[InstructUIE](https://github.com/BeyonderXX/InstructUIE/tree/master); [Baichuan2-Base](https://github.com/baichuan-inc/Baichuan2); [InstructIE](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC); [KnowLM](https://github.com/zjunlp/KnowLM/tree/main)
188 |
189 | ## Citation
190 | If you are using the resource for your work, please cite our paper:
191 | ```
192 | @article{YAYI-UIE,
193 | author = {Xinglin Xiao, Yijie Wang, Nan Xu, Yuqi Wang, Hanxuan Yang, Minzheng Wang, Yin Luo, Lei Wang, Wenji Mao, Dajun Zeng}},
194 | title = {YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal Information Extraction},
195 | journal = {arXiv preprint arXiv},
196 | year = {2023}
197 | }
198 | ```
199 |
200 | ## Star History
201 | [](https://star-history.com/#wenge-research/YAYI-UIE&Date)
202 |
--------------------------------------------------------------------------------
/assets/0shot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenge-research/YAYI-UIE/603ea82f1a4cb2322321285dae4a2f4bd4bc8d2e/assets/0shot.png
--------------------------------------------------------------------------------
/assets/YAYI-UIE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenge-research/YAYI-UIE/603ea82f1a4cb2322321285dae4a2f4bd4bc8d2e/assets/YAYI-UIE.png
--------------------------------------------------------------------------------
/assets/data-dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenge-research/YAYI-UIE/603ea82f1a4cb2322321285dae4a2f4bd4bc8d2e/assets/data-dist.png
--------------------------------------------------------------------------------
/assets/test:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/assets/yayi_dark_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenge-research/YAYI-UIE/603ea82f1a4cb2322321285dae4a2f4bd4bc8d2e/assets/yayi_dark_small.png
--------------------------------------------------------------------------------
/assets/zh-0shot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenge-research/YAYI-UIE/603ea82f1a4cb2322321285dae4a2f4bd4bc8d2e/assets/zh-0shot.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==1.4.0
2 | accelerate==0.21.0
3 | aiofiles==23.2.1
4 | aiohttp==3.8.6
5 | aiosignal==1.3.1
6 | altair==5.1.2
7 | annotated-types==0.5.0
8 | anyio==3.7.1
9 | asttokens==2.2.1
10 | async-timeout==4.0.3
11 | attrs==23.1.0
12 | backcall==0.2.0
13 | cachetools==4.2.4
14 | certifi==2023.7.22
15 | charset-normalizer==3.3.0
16 | click==8.1.6
17 | cmake==3.27.0
18 | comm==0.1.4
19 | contourpy==1.1.0
20 | cycler==0.12.1
21 | dataclasses-json==0.5.14
22 | datasets==1.17.0
23 | debugpy==1.6.7
24 | decorator==5.1.1
25 | deepspeed==0.10.0
26 | dill==0.3.7
27 | docker-pycreds==0.4.0
28 | exceptiongroup==1.1.3
29 | executing==1.2.0
30 | fairscale==0.4.5
31 | fastapi==0.103.2
32 | ffmpy==0.3.1
33 | filelock==3.12.2
34 | fonttools==4.42.0
35 | frozenlist==1.4.0
36 | fsspec==2023.6.0
37 | gensim==4.3.1
38 | gitdb==4.0.10
39 | GitPython==3.1.32
40 | google-auth==2.22.0
41 | google-auth-oauthlib==1.0.0
42 | gradio==3.50.0
43 | gradio_client==0.6.1
44 | greenlet==2.0.2
45 | grpcio==1.56.2
46 | h11==0.14.0
47 | hjson==3.1.0
48 | httpcore==0.18.0
49 | httptools==0.6.0
50 | httpx==0.25.0
51 | huggingface-hub==0.16.4
52 | idna==3.4
53 | importlib-metadata==6.8.0
54 | importlib-resources==6.0.1
55 | ipykernel==6.25.0
56 | ipython==8.12.2
57 | jedi==0.19.0
58 | jieba==0.42.1
59 | Jinja2==3.1.2
60 | joblib==1.3.1
61 | jsonschema==4.19.1
62 | jsonschema-specifications==2023.7.1
63 | jupyter_client==8.3.0
64 | jupyter_core==5.3.1
65 | kiwisolver==1.4.5
66 | langchain==0.0.278
67 | langsmith==0.0.32
68 | lit==16.0.6
69 | Markdown==3.4.4
70 | MarkupSafe==2.1.3
71 | marshmallow==3.20.1
72 | matplotlib==3.7.3
73 | matplotlib-inline==0.1.6
74 | mpmath==1.3.0
75 | msgpack==1.0.7
76 | multidict==6.0.4
77 | multiprocess==0.70.15
78 | mypy-extensions==1.0.0
79 | nest-asyncio==1.5.7
80 | networkx==3.1
81 | ninja==1.11.1
82 | nltk==3.8.1
83 | numexpr==2.8.5
84 | numpy==1.24.4
85 | nvidia-cublas-cu11==11.10.3.66
86 | nvidia-cuda-cupti-cu11==11.7.101
87 | nvidia-cuda-nvrtc-cu11==11.7.99
88 | nvidia-cuda-runtime-cu11==11.7.99
89 | nvidia-cudnn-cu11==8.5.0.96
90 | nvidia-cufft-cu11==10.9.0.58
91 | nvidia-curand-cu11==10.2.10.91
92 | nvidia-cusolver-cu11==11.4.0.1
93 | nvidia-cusparse-cu11==11.7.4.91
94 | nvidia-nccl-cu11==2.14.3
95 | nvidia-nvtx-cu11==11.7.91
96 | oauthlib==3.2.2
97 | openai==0.28.0
98 | orjson==3.9.9
99 | packaging==23.1
100 | pandas==2.0.3
101 | parso==0.8.3
102 | pathtools==0.1.2
103 | pexpect==4.8.0
104 | pickleshare==0.7.5
105 | Pillow==10.0.0
106 | pkgutil_resolve_name==1.3.10
107 | platformdirs==3.10.0
108 | promise==2.3
109 | prompt-toolkit==3.0.39
110 | protobuf==3.20.3
111 | psutil==5.9.5
112 | ptyprocess==0.7.0
113 | pure-eval==0.2.2
114 | py-cpuinfo==9.0.0
115 | pyarrow==12.0.1
116 | pyasn1==0.5.0
117 | pyasn1-modules==0.3.0
118 | pydantic==1.10.12
119 | pydantic_core==2.4.0
120 | pydub==0.25.1
121 | Pygments==2.15.1
122 | pyparsing==3.1.1
123 | python-dateutil==2.8.2
124 | python-dotenv==1.0.0
125 | python-multipart==0.0.6
126 | pytz==2023.3
127 | PyYAML==6.0.1
128 | pyzmq==25.1.0
129 | ray==2.7.0
130 | referencing==0.30.2
131 | regex==2023.6.3
132 | requests==2.31.0
133 | requests-oauthlib==1.3.1
134 | rouge-score==0.1.2
135 | rpds-py==0.10.4
136 | rsa==4.9
137 | safetensors==0.4.0
138 | scipy==1.10.1
139 | semantic-version==2.10.0
140 | sentencepiece==0.1.96
141 | sentry-sdk==1.29.2
142 | shortuuid==1.0.11
143 | six==1.16.0
144 | smart-open==6.3.0
145 | smmap==5.0.0
146 | sniffio==1.3.0
147 | SQLAlchemy==2.0.20
148 | stack-data==0.6.2
149 | starlette==0.27.0
150 | sympy==1.12
151 | tenacity==8.2.3
152 | tensorboard==2.14.0
153 | tensorboard-data-server==0.7.1
154 | termcolor==2.3.0
155 | threadpoolctl==3.2.0
156 | tokenizers==0.13.3
157 | toolz==0.12.0
158 | torch==2.0.1
159 | tornado==6.3.2
160 | tqdm==4.66.1
161 | traitlets==5.9.0
162 | transformers==4.33.1
163 | transformers-stream-generator==0.0.4
164 | triton==2.0.0
165 | typing_extensions==4.7.1
166 | typing-inspect==0.9.0
167 | tzdata==2023.3
168 | urllib3==1.26.17
169 | uvicorn==0.23.2
170 | uvloop==0.17.0
171 | vllm==0.1.7
172 | wandb==0.12.10
173 | watchfiles==0.20.0
174 | wcwidth==0.2.6
175 | websockets==11.0.3
176 | Werkzeug==2.3.6
177 | xformers==0.0.22
178 | XlsxWriter==3.1.2
179 | xxhash==3.3.0
180 | yarl==1.9.2
181 | yaspin==2.3.0
182 | zipp==3.16.2
183 |
--------------------------------------------------------------------------------