├── img
├── NLG.png
├── rag.png
├── sft.png
├── cblue.png
├── retrieve.png
└── zero-shot.png
├── README_zh.md
└── README.md
/img/NLG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/NLG.png
--------------------------------------------------------------------------------
/img/rag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/rag.png
--------------------------------------------------------------------------------
/img/sft.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/sft.png
--------------------------------------------------------------------------------
/img/cblue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/cblue.png
--------------------------------------------------------------------------------
/img/retrieve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/retrieve.png
--------------------------------------------------------------------------------
/img/zero-shot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/zero-shot.png
--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
1 | # Huatuo-26M Dataset
2 |
3 |
4 |
5 | 📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa
6 |
中文 | English
7 |
8 |
9 |
10 | ## 👩🏻⚕️项目简介
11 |
12 | - Huatuo-26M 是目前为止最大的中文医疗问答数据集。此数据集包含了超过2600万个高质量的医疗问答对,涵盖了各种疾病、症状、治疗方式、药品信息等多个方面。
13 | - Huatuo-Lite 是在Huatuo26M数据集的基础上经过多次提纯和重写而精炼优化的数据集。它包含了18万个高质量的医疗问答对,并具有**医院科室**和**相关疾病**两个额外的数据维度。
14 |
15 |
16 | ## 📚数据内容
17 |
18 | Huatuo-26M 数据集主要包括:
19 |
20 | - 在线医疗百科 [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa)
21 | - 医疗知识图谱 [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa)
22 | - 网络上的公开医疗问答论坛(答案为url形式) [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa)
23 | - 精简版本Huatuo-Lite [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)
24 |
25 |
26 | 数据集中的每个问答对包含以下字段:
27 |
28 | - Question:问题描述
29 | - Answer:医生/专家的答案
30 | - Huatuo-Lite 数据集还具有**医院科室**和**相关疾病**字段
31 |
32 |
33 |
34 | 以下为我们在论文中使用的huatuo测试集,由多个来源中数据随机抽取组成。
35 |
36 | - Testdatasets:[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets)
37 |
38 |
39 |
40 | ## 🚀快速开始
41 |
42 | 为了开始使用 Huatuo-26M 数据集,你可以按照以下步骤操作:
43 |
44 | ```python
45 | import datasets
46 | # part 1
47 | knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
48 | # part 2
49 | encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
50 | # part 3 (only url)
51 | consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')
52 | # Huatuo-Lite
53 | lite = load_dataset("FreedomIntelligence/Huatuo26M-Lite")
54 |
55 | # testdatasets (6k)
56 | huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')
57 | ```
58 |
59 |
60 |
61 | ## 👩🏻🔬实验记录
62 |
63 | ### 测评
64 |
65 | - 检索测评:
66 | Click to expand
67 |
68 |
69 |
70 |
71 |
72 | - 答案生成测评:
73 |
74 | Click to expand
75 |
76 |
77 |
78 |
79 |
80 | ### 应用
81 |
82 | - Zero-shot迁移至其他QA数据集:
83 |
84 | Click to expand
85 |
86 |
87 |
88 |
89 |
90 | - 作为外部知识进行RAG:
91 |
92 | Click to expand
93 |
94 |
95 |
96 |
97 |
98 |
99 | - 作为语言模型(LM)的预训练数据:
100 |
101 | Click to expand
102 |
103 |
104 |
105 |
106 |
107 | - 作为医学大语言模型(LLM)的微调数据:
108 | Click to expand
109 |
110 |
111 |
112 |
113 | ## 🚁许可
114 |
115 | Huatuo-26M 数据集遵循 Apache 2.0 许可。使用前请确保你已阅读并同意许可条款。
116 |
117 |
118 |
119 | ## 📱联系我们
120 |
121 | 如果你有任何问题或者需要帮助,欢迎通过电子邮件([xidongw@163.com](mailto:xidongw@163.com))或者在 Issues 区向我们提问。
122 |
123 | ------
124 |
125 |
126 |
127 | ## 😁引用
128 |
129 | ```
130 | @misc{li2023huatuo26m,
131 | title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset},
132 | author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
133 | year={2023},
134 | eprint={2305.01526},
135 | archivePrefix={arXiv},
136 | primaryClass={cs.CL}
137 | }
138 | ```
139 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Huatuo-26M
2 |
3 |
4 | 📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa
5 |
中文 | English
6 |
7 |
8 | ## 👩🏻⚕Introduction
9 |
10 | - Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q&A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information.
11 | - Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality.
12 |
13 |
14 | ## 📚Data Content
15 |
16 | The Huatuo-26M dataset is collected and integrated from multiple sources, including:
17 |
18 | - Online Medical Encyclopedia [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa)
19 | - Online Medical Knowledge Bases [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa)
20 | - Online Medical Consultation Records(answer in the form of URLs) [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa)
21 | - Streamlined version [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite)
22 |
23 |
24 | Each question-answer pair in the dataset contains the following fields:
25 |
26 | - questions:Problem Description
27 | - answers:Doctor/Expert Answers
28 | - Huatuo-Lite dataset also includes **Hospital Department** and **Related Diseases** fields
29 |
30 |
31 | The following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.
32 |
33 | - Testdatasets:[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets)
34 |
35 |
36 |
37 | ## 🤖Data Usage
38 |
39 | The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:
40 |
41 | - Natural Language Processing: Including but not limited to Q&A systems, text classification, sentiment analysis, etc.
42 | - Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc.
43 | - AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc.
44 |
45 |
46 | ## 🚀Quick Start
47 |
48 | To start using the Huatuo-26M dataset, you can follow the steps below:
49 |
50 | ```python
51 | import datasets
52 | # part 1
53 | knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
54 | # part 2
55 | encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
56 | # part 3 (only url)
57 | consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')
58 |
59 | # testdatasets (6k)
60 | huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')
61 | ```
62 |
63 |
64 |
65 | ## 👩🏻🔬Experiment Record
66 |
67 | ### Benchmark
68 |
69 | - Retrieval Evaluation:
70 |
71 | Click to expand
72 |
73 |
74 |
75 | - Answer Generation Evaluation:
76 |
77 | Click to expand
78 |
79 |
80 |
81 | ### Application
82 |
83 | - Zero-shot transfer to other QA datasets:
84 |
85 | Click to expand
86 |
87 |
88 |
89 |
90 | - As external knowledge for RAG:
91 |
92 | Click to expand
93 |
94 |
95 |
96 |
97 | - As pre-training data for language model (LM):
98 |
99 | Click to expand
100 |
101 |
102 |
103 |
104 | - As fine-tuning data for Medical LLM:
105 |
106 | Click to expand
107 |
108 |
109 |
110 |
111 |
112 | ## 🚁License
113 |
114 | The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.
115 |
116 |
117 | ## 📱Contact Us
118 |
119 | If you have any questions or need help, please feel free to ask us via email ([xidongw@163.com](mailto:xidongw@163.com))or in the Issues section.
120 |
121 | ------
122 |
123 |
124 |
125 | ## 😁Citation
126 |
127 | ```
128 | @misc{li2023huatuo26m,
129 | title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset},
130 | author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
131 | year={2023},
132 | eprint={2305.01526},
133 | archivePrefix={arXiv},
134 | primaryClass={cs.CL}
135 | }
136 | ```
137 |
--------------------------------------------------------------------------------