├── img ├── NLG.png ├── rag.png ├── sft.png ├── cblue.png ├── retrieve.png └── zero-shot.png ├── README_zh.md └── README.md /img/NLG.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/NLG.png -------------------------------------------------------------------------------- /img/rag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/rag.png -------------------------------------------------------------------------------- /img/sft.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/sft.png -------------------------------------------------------------------------------- /img/cblue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/cblue.png -------------------------------------------------------------------------------- /img/retrieve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/retrieve.png -------------------------------------------------------------------------------- /img/zero-shot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FreedomIntelligence/Huatuo-26M/HEAD/img/zero-shot.png -------------------------------------------------------------------------------- /README_zh.md: -------------------------------------------------------------------------------- 1 | # Huatuo-26M Dataset 2 | 3 | 4 |

5 | 📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa 6 |
中文 | English 7 |

8 | 9 | 10 | ## 👩🏻‍⚕️项目简介 11 | 12 | - Huatuo-26M 是目前为止最大的中文医疗问答数据集。此数据集包含了超过2600万个高质量的医疗问答对，涵盖了各种疾病、症状、治疗方式、药品信息等多个方面。 13 | - Huatuo-Lite 是在Huatuo26M数据集的基础上经过多次提纯和重写而精炼优化的数据集。它包含了18万个高质量的医疗问答对，并具有**医院科室**和**相关疾病**两个额外的数据维度。 14 | 15 | 16 | ## 📚数据内容 17 | 18 | Huatuo-26M 数据集主要包括： 19 | 20 | - 在线医疗百科 [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa) 21 | - 医疗知识图谱 [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa) 22 | - 网络上的公开医疗问答论坛（答案为url形式） [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa) 23 | - 精简版本Huatuo-Lite [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite) 24 | 25 | 26 | 数据集中的每个问答对包含以下字段： 27 | 28 | - Question：问题描述 29 | - Answer：医生/专家的答案 30 | - Huatuo-Lite 数据集还具有**医院科室**和**相关疾病**字段 31 | 32 | 33 | 34 | 以下为我们在论文中使用的huatuo测试集，由多个来源中数据随机抽取组成。 35 | 36 | - Testdatasets：[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets) 37 | 38 | 39 | 40 | ## 🚀快速开始 41 | 42 | 为了开始使用 Huatuo-26M 数据集，你可以按照以下步骤操作： 43 | 44 | ```python 45 | import datasets 46 | # part 1 47 | knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa') 48 | # part 2 49 | encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa') 50 | # part 3 (only url) 51 | consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa') 52 | # Huatuo-Lite 53 | lite = load_dataset("FreedomIntelligence/Huatuo26M-Lite") 54 | 55 | # testdatasets (6k) 56 | huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets') 57 | ``` 58 | 59 | 60 | 61 | ## 👩🏻‍🔬实验记录 62 | 63 | ### 测评 64 | 65 | - 检索测评： 66 |

Click to expand

67 | 68 | retrieve

retrieve

71 | 72 | - 答案生成测评： 73 | 74 |

Click to expand

75 | 76 | retrieve

retrieve

79 | 80 | ### 应用 81 | 82 | - Zero-shot迁移至其他QA数据集： 83 | 84 |

Click to expand

85 | 86 | retrieve

retrieve

89 | 90 | - 作为外部知识进行RAG： 91 | 92 |

Click to expand

93 | 94 | retrieve

retrieve

98 | 99 | - 作为语言模型(LM)的预训练数据： 100 | 101 |

Click to expand

retrieve

105 | 106 | 107 | - 作为医学大语言模型(LLM)的微调数据： 108 |

Click to expand

retrieve

112 | 113 | ## 🚁许可 114 | 115 | Huatuo-26M 数据集遵循 Apache 2.0 许可。使用前请确保你已阅读并同意许可条款。 116 | 117 | 118 | 119 | ## 📱联系我们 120 | 121 | 如果你有任何问题或者需要帮助，欢迎通过电子邮件（[xidongw@163.com](mailto:xidongw@163.com)）或者在 Issues 区向我们提问。 122 | 123 | ------ 124 | 125 | 126 | 127 | ## 😁引用 128 | 129 | ``` 130 | @misc{li2023huatuo26m, 131 | title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset}, 132 | author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang}, 133 | year={2023}, 134 | eprint={2305.01526}, 135 | archivePrefix={arXiv}, 136 | primaryClass={cs.CL} 137 | } 138 | ``` 139 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Huatuo-26M 2 | 3 |

4 | 📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa 5 |
中文 | English 6 |

7 | 8 | ## 👩🏻‍⚕Introduction 9 | 10 | - Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q&A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information. 11 | - Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality. 12 | 13 | 14 | ## 📚Data Content 15 | 16 | The Huatuo-26M dataset is collected and integrated from multiple sources, including: 17 | 18 | - Online Medical Encyclopedia [huatuo_encyclopedia_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_encyclopedia_qa) 19 | - Online Medical Knowledge Bases [huatuo_knowledge_graph_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_knowledge_graph_qa) 20 | - Online Medical Consultation Records（answer in the form of URLs） [huatuo_consultation_qa](https://huggingface.co/datasets/FreedomIntelligence/huatuo_consultation_qa) 21 | - Streamlined version [Huatuo-Lite](https://huggingface.co/datasets/FreedomIntelligence/Huatuo26M-Lite) 22 | 23 | 24 | Each question-answer pair in the dataset contains the following fields： 25 | 26 | - questions：Problem Description 27 | - answers：Doctor/Expert Answers 28 | - Huatuo-Lite dataset also includes **Hospital Department** and **Related Diseases** fields 29 | 30 | 31 | The following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources. 32 | 33 | - Testdatasets：[huatuo26M-testdatasets](https://huggingface.co/datasets/FreedomIntelligence/huatuo26M-testdatasets) 34 | 35 | 36 | 37 | ## 🤖Data Usage 38 | 39 | The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as: 40 | 41 | - Natural Language Processing: Including but not limited to Q&A systems, text classification, sentiment analysis, etc. 42 | - Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc. 43 | - AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc. 44 | 45 | 46 | ## 🚀Quick Start 47 | 48 | To start using the Huatuo-26M dataset, you can follow the steps below: 49 | 50 | ```python 51 | import datasets 52 | # part 1 53 | knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa') 54 | # part 2 55 | encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa') 56 | # part 3 (only url) 57 | consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa') 58 | 59 | # testdatasets (6k) 60 | huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets') 61 | ``` 62 | 63 | 64 | 65 | ## 👩🏻‍🔬Experiment Record 66 | 67 | ### Benchmark 68 | 69 | - Retrieval Evaluation: 70 | 71 |

Click to expand

retrieve

74 | 75 | - Answer Generation Evaluation: 76 | 77 |

Click to expand

retrieve

80 | 81 | ### Application 82 | 83 | - Zero-shot transfer to other QA datasets: 84 | 85 |

Click to expand

retrieve

88 | 89 | 90 | - As external knowledge for RAG: 91 | 92 |

Click to expand

retrieve

95 | 96 | 97 | - As pre-training data for language model (LM): 98 | 99 |

Click to expand

retrieve

102 | 103 | 104 | - As fine-tuning data for Medical LLM: 105 | 106 |

Click to expand

109 | 110 | 111 | 112 | ## 🚁License 113 | 114 | The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it. 115 | 116 | 117 | ## 📱Contact Us 118 | 119 | If you have any questions or need help, please feel free to ask us via email （[xidongw@163.com](mailto:xidongw@163.com)）or in the Issues section. 120 | 121 | ------ 122 | 123 | 124 | 125 | ## 😁Citation 126 | 127 | ``` 128 | @misc{li2023huatuo26m, 129 | title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset}, 130 | author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang}, 131 | year={2023}, 132 | eprint={2305.01526}, 133 | archivePrefix={arXiv}, 134 | primaryClass={cs.CL} 135 | } 136 | ``` 137 | --------------------------------------------------------------------------------