├── README.md ├── README_Chinese.md ├── data ├── init ├── 题库@100-A1+A2_CQ-知识点标注换行.json ├── 题库@100-A3+A4_CQ-知识点标注换行.json └── 题库@100-B1_CQ-知识点标注换行.json ├── evaluate ├── A12_bench.py ├── A3-B1_bench.py ├── bench_function.py ├── correct_analyse.py └── init ├── image ├── few-shot.png ├── init └── zero-shot.png ├── models ├── Openai.py └── init └── prompt ├── A1-2_prompt.json ├── A3-4_prompt.json ├── B1_prompt.json └── init /README.md: -------------------------------------------------------------------------------- 1 | # TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine 2 | Repo for TCMBench (The first comprehensive benchmark for evaluating LLMs in TCM) 3 | 4 | The [paper](https://arxiv.org/abs/2406.01126) has been submitted to XXX. 5 | 6 | ❗ Refuse all plagiarism (🙃 smile.jpg). 7 | 8 | [**English**](./README.md) | [**中文**](./README_Chinese.md) 9 | 10 |

11 | GitHub 12 | GitHub top language 13 |

14 | 15 | ## ⚡ 简介 16 | Large Language Models (LLMs) excel in various natural language processing tasks but lack dedicated benchmarks for traditional Chinese medicine (TCM). To fill this gap, we introduce **TCMBench**, a comprehensive benchmark for evaluating LLMs in TCM. 17 | 18 | 19 | ## 📚 Dataset:TCM-ED 20 | The TCMLE assesses whether applicants possess the necessary professional knowledge and skills to practice as TCM physicians. Therefore, we collect 5,473 representative practice questions. Among them, the data we collect does not contain personal information but focuses on selecting data instances that can fully reflect and represent theoretical knowledge and practical skills in TCM. The multiple-choice questions in TCMLE. are divided into three categories: A1/A2, A3, and B1. A1/A2 questions consist of one question with five options. A3 questions include multiple sub-questions that share a typical clinical case, simulating actual clinical scenarios. B1 questions also contain multiple sub-questions but share the same five options. Consequently, A3 and B1 questions establish a logical correlation in clinical and knowledge, respectively. For detailed descriptions and examples of three types of questions, please refer to section A of appendix files. 21 | 22 | **🔎 题目类型** : 23 | - 🚀 **The single-sentence best-choice questions(A1) and the case summary best-choice questions(A2) type**: It consists of a question stem and five options with correct one. 24 | - 🚀 **The best choice questions for case group(A3) type**: The stem presents a patient-centered case, followed by multiple sub-questions, each offering five options with one correct answer. It primarily centers on clinical applications. 25 | - 🚀 **The standard compatibility questions(B1) type**: Multiple sub-questions share the same five options, where each option may be chosen zero, one, or multiple times. There is one correct answer among the five options for each sub-question. 26 | 27 | 28 | TCM-ED: 29 | 30 | | Question type | A1/A2 type| A3 type|B1 type| 31 | | ------------------ | -------------- | -------------- |-------------- | 32 | | **Number of questions** | 1600 | 198 |1481 | 33 | | **Number of sub-questions** | \ | 642 | 3231 | 34 | 35 | 36 | 37 | ## 👨‍⚕️ Data Processing 38 | 39 | A1/A2 type: 40 | ```json 41 | { 42 | "question": "《素问·咳论》:“五脏六腑皆令人咳”,但关系最密切的是( )。\nA.心肺\nB.肺肾\nC.肺脾\nD.肺胃\nE.肺大肠", 43 | "answer": [ 44 | "D" 45 | ], 46 | "analysis": "根据《素问·咳论》“此皆聚于胃,关于肺,使人多涕唾而面浮肿气逆也”可知与五脏六腑皆令人咳关系最密切的脏腑为肺胃。手太阴肺经起于中焦,还循胃口,上膈属肺。寒凉饮食入胃,导致中焦寒,寒气循手太阴肺经上入于肺中,导致肺寒,肺为娇脏,不耐寒热,外内寒邪并聚于肺,则肺失宣降,肺气上逆发生咳嗽。因此答案选D。", 47 | "knowledge_point": "中医经典", 48 | "index": 8196, 49 | "score": 1 50 | } 51 | ``` 52 | A3 type: 53 | ```json 54 | { 55 | "share_content": "刘×,男,46岁,刻下眩晕而见头重如蒙。胸闷恶心,食少多寐,苔白腻,脉濡滑。", 56 | "question": [ 57 | { 58 | "sub_question": "1).证属( )。\nA.肝阳上亢\nB.气血亏虚\nC.肾精不足\nD.痰浊中阻\nE.以上都不是\n", 59 | "answer": [ 60 | "D" 61 | ], 62 | "analysis": "" 63 | }, 64 | { 65 | "sub_question": "2).治法宜选( )。\nA.燥湿祛痰,健脾和胃\nB.补肾滋阴\nC.补肾助阳\nD.补养气血,健运脾胃\nE.平肝潜阳,滋养肝肾\n", 66 | "answer": [ 67 | "A" 68 | ], 69 | "analysis": "" 70 | }, 71 | { 72 | "sub_question": "3).方药宜选( )。\nA.右归丸\nB.左归丸\nC.半夏白术天麻汤\nD.归脾汤\nE.天麻钩藤饮\n", 73 | "answer": [ 74 | "C" 75 | ], 76 | "analysis": "" 77 | } 78 | ], 79 | "knowledge_point": "中医内科学", 80 | "index": 334, 81 | "score": 1 82 | } 83 | ``` 84 | B1 type: 85 | ```json 86 | { 87 | "share_content": "(共用备选答案)\nA.化痰息风,健脾祛湿\nB.清肺化痰,散结排脓\nC.疏风宣肺,化痰止咳\nD.清热化痰,平肝息风\nE.润肺清热,理气化痰\n", 88 | "question": [ 89 | { 90 | "sub_question": "1).贝母瓜蒌散的功用是( )。", 91 | "answer": [ 92 | "E" 93 | ], 94 | "analysis": "" 95 | }, 96 | { 97 | "sub_question": "2).半夏白术天麻汤的功用是( )。", 98 | "answer": [ 99 | "A" 100 | ], 101 | "analysis": "" 102 | } 103 | ], 104 | "knowledge_point": "方剂学", 105 | "index": 1938, 106 | "score": 1 107 | } 108 | ``` 109 | 110 | ## 🧐 Evaluation Pipeline 111 | 112 | 113 | | File Name | Description | 114 | | -------------------------- | -------------- | 115 | | /evaluate/A12_bench.py | Generating answers for A1/A2 type of quesions | 116 | | /evaluate/A3-B1_bench.py | Generating answers for A3/B1 type of quesions| 117 | | /evaluate/bench_function.py | Test | 118 | | /evaluate/correct_analyse.py | Accuracy Metric | 119 | | /prompt/A1-2_prompt.json| Prompt of A1/A2 type of quesions| 120 | | /prompt/A3-4_prompt.json| Prompt of A3 type of quesions| 121 | | /prompt/B1_prompt.json| Prompt of B1 type of quesions| 122 | | /models/Openai.py| Model API(eg. openai) | 123 | 124 | 125 | 126 | 127 | You can run with API[A12_bench.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/A12_bench.py)/[A3-B1_bench.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/A3-B1_bench.py). 128 | ``` 129 | First set the proxy: 130 | os.environ['HTTPS_PROXY']="your proxy" 131 | Next,fill your OpenAI Key: 132 | openai_api_key = "your key" 133 | Run with: 134 | python A12_bench.py 135 | python A3-B1_bench.py 136 | ``` 137 | 138 | Finally,You can run with [correct_analyse.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/correct_analyse.py) to get the accrucy。 139 | ``` 140 | python correct_analyse.py 141 | ``` 142 | 143 | 144 | 👨‍⚕️ Welcome everyone to follow our open source project for TCM LLM **ShenNong-TCM**, this is the first version: 145 | 146 | - 🚀 [ShenNong-TCM](https://github.com/michael-wzhu/ShenNong-TCM-LLM) : To promote the development and implementation of LLM in the field of TCM, enhance LLM's knowledge and ability to answer medical consultations in the field of traditional Chinese medicine, we have launched the **ShenNong** Large scale Language Model for Traditional Chinese Medicine. Based on the [TCM prompt dataset: ShenNong_TCM_Dataset](https://huggingface.co/datasets/michaelwzhu/ShenNong_TCM_Dataset)。 147 | 148 | And our other open source projects for medical LLMs: 149 | - 🚀 [Intelligent TCM Inheritance and Innovation Assistance Platform](https://github.com/ywjawmw/AI4TCM-Platform) ; 150 | - 🚀 [ChatMed-Consult](https://huggingface.co/michaelwzhu/ChatMed-Consult) ; 151 | - 🚀 [PromptCBLUE](https://github.com/michael-wzhu/PromptCBLUE); 152 | 153 | ## Acknowledge 154 | 155 | - [ChatGPT](https://openai.com/blog/chatgpt) 156 | - [ChatGLM](https://github.com/THUDM/ChatGLM-6B) 157 | - [GaoKao-Bench](https://github.com/OpenLMLab/GAOKAO-Bench) 158 | 159 | 160 | ## Citation 161 | 162 | Please cite: 163 | 164 | 165 | ## Team Introduction 166 | 167 | This project was completed by the Intelligent Knowledge Management and Service Team of the School of Computer Science and Technology, East China Normal University, under the guidance of Professor Xiaoling Wang . 168 | 169 | Project members: 170 | - [ywjawmw](https://github.com/ywjawmw) 171 | - [michael-wzhu](https://github.com/michael-wzhu) 172 | -------------------------------------------------------------------------------- /README_Chinese.md: -------------------------------------------------------------------------------- 1 | # TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine 2 | Repo for TCMBench (“ShuzhiQihuang” LLMs series,The first comprehensive benchmark for evaluating LLMs in TCM) 3 | 4 | 论文初始版本已经公开,欢迎引用,❗ 拒绝一切抄袭行为(微笑.jpg). 5 | 6 | [**English**](./README.md) | [**中文**](./README_Chinese.md) 7 | 8 |

9 | GitHub 10 | GitHub top language 11 |

12 | 13 | ## ⚡ 简介 14 | 为了进一步有效、准确的评估大模型在中医药领域的表现,我们现建立了一个标准化、综合性的中医评测框架**TCMBench**,该评测框架将充分考虑中医药领域的复杂性和专业性,涵盖多个方面,以确保大语言模型在真实场景下的实用性和适用性。 15 | 16 | 17 | ## 📚 数据集:TCM-ED 18 | 首先我们构建了首个中医评测数据集TCM-ED。该数据集中的试题基于某中医执业医师资格考试题库,涵盖了中医基础理论、中医诊断学、中药学、方剂学、中医经典、中医内科学、中医外科学、中医妇科学、中医儿科学、针灸学、诊断学基础、内科学、传染病学、医学伦理学、卫生法规、其他等16个知识点,共3类选择题类型,每类题型都包含16个知识点的至多100道题,后续我们会逐步开放更多的测试题并扩充该数据集。通过收集、整理和标注这些数据,我们旨在提供一个全面、准确、具有代表性的中医测试基准,来帮助评估和改大语言模型应用在中医领域性能。 19 | 20 | **🔎 题目类型** : 21 | - 🚀 **最佳选择题或单项选择题(A型题)**:每道试题由一个题干与A、B、C、D、E5个备选答案组成,题干在前,5个备选答案在后。5个备选答案中只有一项是正确答案A。型题采用为A1型题、A2型题2种题型: 22 | - **单句型最佳选择题(A1型题)**:题干以论述题形式出现,或为叙述式,或为否定式。 23 | - **病例摘要型最佳选择题(A2型题)**:以一个简要的病例作为题干。 24 | - **病例组型最佳选择题(A3型题)**:题干开始叙述一个以患者为中心的临床情景,然后提出若干个(一个以上)相关问题,每个问题均与开始的临床情景有关,但测试要点不同,试题考察以临床应用为主。 25 | - 🚀 **标准配伍题(B1型题)** :每道试题由A、B、C、D、E5个备选答案与2个或2个以上的题干组成,5个备选答案在前,题干在后。答题时,要求为每个题干选择一项作为正确答案。每个备选答案可以选用零次,一次或一次以上。 26 | 27 | 💥 【注】:由于单句型最佳选择题和病例摘要型最佳选择题题目答案结构相似,因此在测试题时归并为一类题型,标注为 **单题干单项最佳选择题(A1/A2型题)** 28 | 29 | ShenNong-TCM-EB 统计信息如下: 30 | 31 | | 题目类型 | 单题干单项最佳选择题(A1/A2型题)| 病例组型最佳选择题(A3型题)|标准配伍题(B1型题)| 32 | | ------------------ | -------------- | -------------- |-------------- | 33 | | **题目数量** | 1600 | 198 |1481 | 34 | | **子问题** | \ | 642 | 3231 | 35 | 36 | 37 | 38 | ## 👨‍⚕️ 数据处理 39 | 40 | 将试题转换为结构化的测评数据,其数据格式如下所示: 41 | 单题干单项最佳选择题 (A1/A2型题): 42 | ```json 43 | { 44 | "question": "《素问·咳论》:“五脏六腑皆令人咳”,但关系最密切的是( )。\nA.心肺\nB.肺肾\nC.肺脾\nD.肺胃\nE.肺大肠", 45 | "answer": [ 46 | "D" 47 | ], 48 | "analysis": "根据《素问·咳论》“此皆聚于胃,关于肺,使人多涕唾而面浮肿气逆也”可知与五脏六腑皆令人咳关系最密切的脏腑为肺胃。手太阴肺经起于中焦,还循胃口,上膈属肺。寒凉饮食入胃,导致中焦寒,寒气循手太阴肺经上入于肺中,导致肺寒,肺为娇脏,不耐寒热,外内寒邪并聚于肺,则肺失宣降,肺气上逆发生咳嗽。因此答案选D。", 49 | "knowledge_point": "中医经典", 50 | "index": 8196, 51 | "score": 1 52 | } 53 | ``` 54 | 病例组型最佳选择题(A3型题): 55 | ```json 56 | { 57 | "share_content": "刘×,男,46岁,刻下眩晕而见头重如蒙。胸闷恶心,食少多寐,苔白腻,脉濡滑。", 58 | "question": [ 59 | { 60 | "sub_question": "1).证属( )。\nA.肝阳上亢\nB.气血亏虚\nC.肾精不足\nD.痰浊中阻\nE.以上都不是\n", 61 | "answer": [ 62 | "D" 63 | ], 64 | "analysis": "" 65 | }, 66 | { 67 | "sub_question": "2).治法宜选( )。\nA.燥湿祛痰,健脾和胃\nB.补肾滋阴\nC.补肾助阳\nD.补养气血,健运脾胃\nE.平肝潜阳,滋养肝肾\n", 68 | "answer": [ 69 | "A" 70 | ], 71 | "analysis": "" 72 | }, 73 | { 74 | "sub_question": "3).方药宜选( )。\nA.右归丸\nB.左归丸\nC.半夏白术天麻汤\nD.归脾汤\nE.天麻钩藤饮\n", 75 | "answer": [ 76 | "C" 77 | ], 78 | "analysis": "" 79 | } 80 | ], 81 | "knowledge_point": "中医内科学", 82 | "index": 334, 83 | "score": 1 84 | } 85 | ``` 86 | 标准配伍题(B1型题): 87 | ```json 88 | { 89 | "share_content": "(共用备选答案)\nA.化痰息风,健脾祛湿\nB.清肺化痰,散结排脓\nC.疏风宣肺,化痰止咳\nD.清热化痰,平肝息风\nE.润肺清热,理气化痰\n", 90 | "question": [ 91 | { 92 | "sub_question": "1).贝母瓜蒌散的功用是( )。", 93 | "answer": [ 94 | "E" 95 | ], 96 | "analysis": "" 97 | }, 98 | { 99 | "sub_question": "2).半夏白术天麻汤的功用是( )。", 100 | "answer": [ 101 | "A" 102 | ], 103 | "analysis": "" 104 | } 105 | ], 106 | "knowledge_point": "方剂学", 107 | "index": 1938, 108 | "score": 1 109 | } 110 | ``` 111 | 112 | ## 🧐 ShenNong-TCM-Evaluation 113 | 114 | 我们为每个题型设计了不同的prompt,要求LLM回答题目,并给出答案和分析,评测框架由如下部分组成: 115 | 116 | | 文件名 | 说明 | 117 | | -------------------------- | -------------- | 118 | | /evaluate/A12_bench.py | 生成单题干单项最佳选择题答案 | 119 | | /evaluate/A3-B1_bench.py | 生成病例组型最佳选择题和标准配伍题答案| 120 | | /evaluate/bench_function.py | 测试相关函数 | 121 | | /evaluate/correct_analyse.py | 计算准确率 | 122 | | /prompt/A1-2_prompt.json| 单题干单项最佳选择题(A1/A2型题)的指令文件| 123 | | /prompt/A3-4_prompt.json| 病例组型最佳选择题(A3型题)的指令文件| 124 | | /prompt/B1_prompt.json| 标准配伍题(B1型题)的指令文件| 125 | | /models/Openai.py| 调用Openai接口 | 126 | 127 | 128 | 其中,指令文件的字段为: 129 | | 字段 | 说明| 130 | | ------------------ | -------------- | 131 | | **type** |题目类型| 132 | | **keyword** |数据集检索关键词| 133 | | **prefix_prompt**| 指令信息| 134 | 135 | 你可以通过调用不同模型的API运行[A12_bench.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/A12_bench.py)/[A3-B1_bench.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/A3-B1_bench.py) 来生成三类题型的答案。同时,我们在/models文件夹下提供了Openai的API。其他自定义的模型API可以放在此文件夹下进行调用。 136 | ``` 137 | 首先若有必要,请设置代理: 138 | os.environ['HTTPS_PROXY']="your proxy" 139 | 其次,将你的OpenAI Key填写到指定位置: 140 | openai_api_key = "your key" 141 | 然后通过设置不同的model_type 和 model_name 来调用不同的模型 142 | 使用以下命令已经运行: 143 | python A12_bench.py 144 | python A3-B1_bench.py 145 | ``` 146 | 147 | 最后,你可以运行[correct_analyse.py](https://github.com/ywjawmw/ShenNong-TCM-Evaluation-BenchMark/blob/main/evaluate/correct_analyse.py)来得到模型的准确率得分。 148 | ``` 149 | python correct_analyse.py 150 | ``` 151 | 152 | 153 | 154 | ### 📋 测试结果 155 | | 模型 | 单题干单项最佳选择题(A1/A2型题)| 病例组型最佳选择题(A3型题)|标准配伍题(B1型题) | 156 | | ------------------ | -------------- | -------------- |-------------- | 157 | | **LlaMa** | 0.0894 | 0.1075 | 0.1102| 158 | | **ChatGLM** | 0.3581 | 0.4580 |0.4556 | 159 | | **ChatGPT-3.5-turbo** | 0.4510 | 0.4657 | 0.4444 | 160 | | **ChatGPT-4** | **0.5819** | **0.6231** | **0.6011** | 161 | 162 | 163 | ### 📊 结果分析 164 | - ⚡️ LLM目前在中医领域存在常识性错误; 165 | ```json 166 | { 167 | "index": 19, 168 | "question": "济生肾气丸是在金匮肾气丸的基础上易干地黄为熟地黄、易桂枝为官桂,再加( )。\nA.麦冬、五味子\nB.远志、石菖蒲\nC.牛膝、车前子\nD.杜仲、鹿角胶\nE.枸杞、龟甲胶\n", 169 | "standard_answer": [ 170 | "C" 171 | ], 172 | "analysis": "济生肾气丸的药物组成:熟地黄、山茱萸、牡丹皮、山药、茯苓、泽泻、肉桂、附子、牛膝、车前子。金匮肾气丸的药物组成:附子、桂心、干地黄、泽泻、山茱萸、茯苓、牡丹皮、山药。", 173 | "knowledge_point": "", 174 | "model_answer": [ 175 | "A" 176 | ], 177 | "model_output": "【解析】济生肾气丸是在金匮肾气丸的基础上,易干地黄为熟地黄,易桂枝为官桂,再加麦冬、五味子。麦冬能清热养阴,五味子能收敛生津,二者配合,既能滋阴养肾,又能固涩精关。所以答案是A。 \n【答案】A " 178 | } 179 | ``` 180 | 181 | 根据模型生成的分析结果与试题中的分析比较,得出大模型答错的核心原因在于其缺乏中医知识的学习。 182 | 183 | - ⚡️ LLM模型存在错误传递(叠加)的问题; 184 | ```json 185 | { 186 | "index": 61, 187 | "share_content": "外感风寒,证候表现为恶寒发热,无汗肢冷,倦怠嗜卧,舌淡苔白,脉沉无力,且用辛温发表药汗不出。", 188 | "question": [ 189 | { 190 | "sub_question": "1).该证候应为( )。\nA.少阴里寒证\nB.风寒表虚证\nC.风寒表实证\nD.阳虚外感风寒证\nE.气虚外感风寒证\n", 191 | "standard_answer": [ 192 | "D" 193 | ], 194 | "analysis": "", 195 | "model_answer": [ 196 | "B" 197 | ], 198 | "model_output": "【解析】: 该病例中,病人表现为恶寒发热,无汗肢冷,倦怠嗜卧,舌淡苔白,脉沉无力,这些都是风寒表虚证的表现。风寒表虚证的主要症状是恶寒重,发热轻,无汗,肢冷,倦怠,嗜卧,舌淡,苔白,脉沉无力。而且用辛温发表药汗不出,也符合风寒表虚证的特点。因此,该证候应为风寒表虚证。 \n【答案】: B " 199 | }, 200 | { 201 | "sub_question": "2).该证治疗的立法是( )。\nA.温肾散寒\nB.助阳益气,外散风寒\nC.解肌发汗,调和营卫\nD.辛温发汗,宣肺平喘\nE.祛风散寒,益气和中\n", 202 | "standard_answer": [ 203 | "B" 204 | ], 205 | "analysis": "", 206 | "model_answer": [ 207 | "B" 208 | ], 209 | "model_output": "【解析】: 风寒表虚证的治疗原则是助阳益气,外散风寒。因为风寒表虚证的病因是风寒邪气侵袭,导致阳气不足,不能抵抗外邪,所以需要助阳益气,增强机体的抵抗力,同时外散风寒,驱除邪气。其他选项如温肾散寒,解肌发汗,调和营卫,辛温发汗,宣肺平喘,祛风散寒,益气和中等都不符合风寒表虚证的治疗原则。 \n【答案】: B " 210 | }, 211 | { 212 | "sub_question": "3).该证治疗的应选方剂是( )。\nA.败毒散\nB.再造散\nC.参苏饮\nD.桂枝汤\nE.葱白七味饮\n", 213 | "standard_answer": [ 214 | "B" 215 | ], 216 | "analysis": "", 217 | "model_answer": [ 218 | "D" 219 | ], 220 | "model_output": "【解析】: 风寒表虚证的治疗应选用桂枝汤。桂枝汤的主要作用是解表散寒,温中止痛,适用于风寒表虚证。败毒散主治热毒病证,再造散主治气血两虚,参苏饮主治风寒咳嗽,葱白七味饮主治风寒头痛,都不符合风寒表虚证的治疗原则。 \n【答案】: D " 221 | } 222 | ], 223 | "knowledge_point": "方剂学" 224 | } 225 | ``` 226 | 227 | 可以看到,当大模型在案例题下,若第一道题回答错误后,即使第二道题答对,但是根据其生成的分析理由发现其掌握的知识仍然是错误的,而其错误的原因是从第一道题中的错误分析中传递的。 228 | 229 | - ⚡️ 此外,我们还发现在病例组型最佳选择题(A3型题)中,LLM无法生成在prompt指令中规范的答题格式以及解析,甚至会误解题目类型,例如会在【解析】中“无中生有”一个新的题目。因此我们为该类题型设计了一个few-shot的prompt,提供一个随机的案例(该案例不存在于评测集中),来规范LLM的生成格式,帮助LLM生成更加准确的内容。 230 | 231 | 错误生成格式如下(以ChatGPT 4.0为例): 232 | ```json 233 | "index": 278, 234 | "score": 1, 235 | "share_content": "中年女性,抑郁成疾,神志痴呆,语无伦次,喜怒无常,舌苔白腻,脉象弦滑,系痰气郁结之癫证。", 236 | "question": [ 237 | { 238 | "sub_question": "1).本病例与郁证鉴别之关键在于( )。\nA.有无胸胁胀满、疼痛症状\nB.有无精神抑郁、情绪不宁\nC.有无神识迷乱、精神失常症状\nD.有无梅核气症状\nE.是中青年患者,还是老年病人\n", 239 | "standard_answer": [ 240 | "C" 241 | ], 242 | "analysis": "", 243 | "model_answer": [ 244 | "C" 245 | ], 246 | "model_output": "【解析】: 本病例的主要症状是抑郁成疾,神志痴呆,语无伦次,喜怒无常,这些都是精神和情绪方面的问题,而且病人的舌苔白腻,脉象弦滑,这些都是痰气郁结的表现。所以,本病例与郁证鉴别的关键应该是有无神识迷乱、精神失常症状。 \n【答案】: C \n2).本病例的舌苔白腻,脉象弦滑,这些都是( )的表现。\nA.痰湿内阻\nB.气血两虚\nC.肝郁气滞\nD.肾阳虚寒\nE.心脾两虚\n" 247 | }, 248 | { 249 | "sub_question": "2).痰气郁结重症、实证,痰浊壅盛,胸膈瞀闷,口多痰涎,脉象滑大有力,可暂用( )。\nA.十枣汤\nB.控涎丹、三圣散\nC.至宝丹\nD.龙虎丸\nE.礞石滚痰丸\n", 250 | "standard_answer": [ 251 | "B" 252 | ], 253 | "analysis": "", 254 | "model_answer": [ 255 | "E" 256 | ], 257 | "model_output": "【解析】: 痰气郁结重症、实证,痰浊壅盛,胸膈瞀闷,口多痰涎,脉象滑大有力,这些症状表明病人的痰湿内阻严重,需要用药物来清痰。其中,礞石滚痰丸是清热化痰、宣肺止咳的中成药,主治痰热壅肺、咳嗽痰多、胸闷气短等症。因此,本病例可暂用礞石滚痰丸。 \n【答案】: E " 258 | }, 259 | ``` 260 | 261 | 加入few-shot instance之后,回答的格式更加规范化: 262 | ```json 263 | "index": 278, 264 | "share_content": "中年女性,抑郁成疾,神志痴呆,语无伦次,喜怒无常,舌苔白腻,脉象弦滑,系痰气郁结之癫证。", 265 | "question": [ 266 | { 267 | "sub_question": "1).本病例与郁证鉴别之关键在于( )。\nA.有无胸胁胀满、疼痛症状\nB.有无精神抑郁、情绪不宁\nC.有无神识迷乱、精神失常症状\nD.有无梅核气症状\nE.是中青年患者,还是老年病人\n", 268 | "standard_answer": [ 269 | "C" 270 | ], 271 | "analysis": "", 272 | "model_answer": [ 273 | "C" 274 | ], 275 | "model_output": "【答案】C \n【解析】本病例的主要症状是神志痴呆,语无伦次,喜怒无常,这些都是神识迷乱、精神失常的表现,而这是与郁证鉴别的关键。郁证的主要症状是情绪低落,悲观消极,而不会出现神识迷乱、精神失常的症状。因此,本病例与郁证鉴别的关键在于有无神识迷乱、精神失常症状。" 276 | }, 277 | { 278 | "sub_question": "2).痰气郁结重症、实证,痰浊壅盛,胸膈瞀闷,口多痰涎,脉象滑大有力,可暂用( )。\nA.十枣汤\nB.控涎丹、三圣散\nC.至宝丹\nD.龙虎丸\nE.礞石滚痰丸\n", 279 | "standard_answer": [ 280 | "B" 281 | ], 282 | "analysis": "", 283 | "model_answer": [ 284 | "E" 285 | ], 286 | "model_output": "【答案】E \n【解析】痰气郁结重症、实证,痰浊壅盛,胸膈瞀闷,口多痰涎,脉象滑大有力,这些症状表明痰湿内阻严重,需要用药物清热化痰,开窍醒神。礞石滚痰丸具有清热化痰,开窍醒神的功效,适用于痰湿内阻严重的病例。因此,可暂用礞石滚痰丸。" 287 | }, 288 | ``` 289 | 290 | 经过加入few-shot instance之后,各大模型在病例组型最佳选择题(A3型题)的测试准确度都有所上升: 291 | 292 | | 模型 | 单题干单项最佳选择题(A1/A2型题)| 病例组型最佳选择题(A3型题)/few-shot instance|标准配伍题(B1型题) | 293 | | ------------------ | -------------- | -------------- |-------------- | 294 | | **LlaMa** | 0.0894 | 0.1075\0.1620 | 0.1102| 295 | | **ChatGLM** | 0.3581 | 0.4580 \ 0.6153 |0.4556 | 296 | | **ChatGPT-3.5-turbo** | 0.4510 | 0.4657 \ 0.4782 | 0.4444 | 297 | | **ChatGPT-4** | **0.5819** | **0.6231** \ **0.6277** | **0.6011** | 298 | 299 | - ⚡️ 最后,我们对于16个知识点分别进行测评,其中过对于病例组型最佳选择题(A3型题),我们分别使用了zero-shot和few-shot的指令方案,结果如下: 300 |

301 |
302 | 303 |
304 |

305 |

306 |
307 | 308 |
309 |

310 | 可以看到,在加入few-shot后,大模型回答大部分知识点问题能力有所上升。涉及到西医的一些知识如“内科学”、“卫生法规”等知识点,大模型的能力是比较强的。而对于中医基础的一些知识,如“方剂学”、“中医经典”等,大模型的能力较弱。这也侧面反映出,目前市面上流行的大语言模型,对于中医相关的知识掌握程度并不理想。 311 | 312 | 313 | 👨‍⚕️ 因此,很有必要构建一个专属于中医药领域的大模型。欢迎大家关注我们的中医大模型开源项目**ShenNong-TCM**: 314 | 315 | - 🚀 [ShenNong-TCM](https://github.com/michael-wzhu/ShenNong-TCM-LLM) : 为推动LLM在中医药领域的发展和落地,提升LLM的在中医药方面的知识与回答医学咨询的能力,我们推出了**ShenNong**中医药大规模语言模型。基于[中医药指令数据集ShenNong_TCM_Dataset](https://huggingface.co/datasets/michaelwzhu/ShenNong_TCM_Dataset)。 316 | 317 | 以及我们其他医疗大模型开源项目: 318 | - 🚀 [“医”心医意——智能中医传承创新辅助平台](https://github.com/ywjawmw/AI4TCM-Platform) : 针对已有的中医传承平台无法覆盖全面的多模态数据这一挑战,我们构建了更全面的中西医知识图谱。其次,针对中医经验传承效率低这一挑战,我们提出了可解释的药方分析技术来挖掘处方信息,自动分析从症状到中药这一立体诊疗过程并给出分析的科学依据。同时提供了一个公平的辅助平台,让青年医师、中医学生等人群快速掌握先进的中医知识,传承经验。 319 | - 🚀 [ChatMed-Consult](https://huggingface.co/michaelwzhu/ChatMed-Consult) : 基于[中文医疗在线问诊数据集ChatMed_Consult_Dataset](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset)的50w+在线问诊+ChatGPT回复作为训练集。模型主干为[LlaMA-7b](https://github.com/facebookresearch/llama),融合了[Chinese-LlaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)的LoRA权重与中文扩展词表,然后再进行基于LoRA的参数高效微调。我们将全部代码都进行了公开; 320 | - 🚀 [PromptCBLUE中文医疗大模型评测基准](https://github.com/michael-wzhu/PromptCBLUE): 将[CBLUE](https://tianchi.aliyun.com/dataset/95414)基准进行改造为提示学习模式,形成对大模型的中文医疗知识与医疗文本处理能力的评测基准。PromptCBLUE旨在采用一个生成式大模型即可完成医疗NLP相关的各种不同任务,如病历结构化,问诊,病例文书撰写等。 321 | 322 | ## 致谢 323 | 324 | 本项目基于大模型给出的API进行开发,同时参考了大语言模型在高考试题上的测评任务,在此对相关项目和研究开发人员表示感谢。 325 | 326 | - [ChatGPT](https://openai.com/blog/chatgpt) 327 | - [ChatGLM](https://github.com/THUDM/ChatGLM-6B) 328 | - [GaoKao-Bench](https://github.com/OpenLMLab/GAOKAO-Bench) 329 | 330 | Logo中的"神农"形象是由[midjourney](http://midjourney.com)自动生成。 331 | 332 | 333 | ## Citation 334 | 335 | 如果你使用了本项目的数据或者代码,请声明引用: 336 | 337 | ```bash 338 | @misc{yue2023 TCMEB, 339 | title={TCMEB: Performance Evaluation of Large Language Models Based on Traditional Chinese Medicine Benchmarks}, 340 | author={Wenjing Yue, Wei Zhu and Xiaoling Wang}, 341 | year={2023}, 342 | publisher = {GitHub}, 343 | journal = {GitHub repository}, 344 | howpublished = {\url{https://github.com/ywjawmw/TCMEB}}, 345 | } 346 | 347 | ``` 348 | 349 | ## 团队介绍 350 | 351 | 本项目由华东师范大学计算机科学与技术学院智能知识管理与服务团队完成,团队指导老师为王晓玲教授。 352 | 353 | 团队成员: 354 | - [ywjawmw](https://github.com/ywjawmw) 355 | - [michael-wzhu](https://github.com/michael-wzhu) 356 | -------------------------------------------------------------------------------- /data/init: -------------------------------------------------------------------------------- 1 | 我们为每个题型提供了一个测试用例,后续我们会逐步公开全部的数据。 2 | -------------------------------------------------------------------------------- /data/题库@100-A1+A2_CQ-知识点标注换行.json: -------------------------------------------------------------------------------- 1 | { 2 | "keywords": "A1和A2类型的选择题", 3 | "example": [ 4 | { 5 | "question": "《素问·咳论》:“五脏六腑皆令人咳”,但关系最密切的是( )。\nA.心肺\nB.肺肾\nC.肺脾\nD.肺胃\nE.肺大肠", 6 | "answer": [ 7 | "D" 8 | ], 9 | "analysis": "根据《素问·咳论》“此皆聚于胃,关于肺,使人多涕唾而面浮肿气逆也”可知与五脏六腑皆令人咳关系最密切的脏腑为肺胃。手太阴肺经起于中焦,还循胃口,上膈属肺。寒凉饮食入胃,导致中焦寒,寒气循手太阴肺经上入于肺中,导致肺寒,肺为娇脏,不耐寒热,外内寒邪并聚于肺,则肺失宣降,肺气上逆发生咳嗽。因此答案选D。", 10 | "knowledge_point": "中医经典", 11 | "index": 8196, 12 | "score": 1 13 | } 14 | ] 15 | } -------------------------------------------------------------------------------- /data/题库@100-A3+A4_CQ-知识点标注换行.json: -------------------------------------------------------------------------------- 1 | { 2 | "keywords": "A3和A4类型的选择题", 3 | "example": [ 4 | { 5 | "share_content": "刘×,男,46岁,刻下眩晕而见头重如蒙。胸闷恶心,食少多寐,苔白腻,脉濡滑。", 6 | "question": [ 7 | { 8 | "sub_question": "1).证属( )。\nA.肝阳上亢\nB.气血亏虚\nC.肾精不足\nD.痰浊中阻\nE.以上都不是\n", 9 | "answer": [ 10 | "D" 11 | ], 12 | "analysis": "" 13 | }, 14 | { 15 | "sub_question": "2).治法宜选( )。\nA.燥湿祛痰,健脾和胃\nB.补肾滋阴\nC.补肾助阳\nD.补养气血,健运脾胃\nE.平肝潜阳,滋养肝肾\n", 16 | "answer": [ 17 | "A" 18 | ], 19 | "analysis": "" 20 | }, 21 | { 22 | "sub_question": "3).方药宜选( )。\nA.右归丸\nB.左归丸\nC.半夏白术天麻汤\nD.归脾汤\nE.天麻钩藤饮\n", 23 | "answer": [ 24 | "C" 25 | ], 26 | "analysis": "" 27 | } 28 | ], 29 | "knowledge_point": "中医内科学", 30 | "index": 334, 31 | "score": 1 32 | } 33 | ] 34 | } -------------------------------------------------------------------------------- /data/题库@100-B1_CQ-知识点标注换行.json: -------------------------------------------------------------------------------- 1 | { 2 | "keywords": "B1类型的选择题", 3 | "example": [ 4 | { 5 | "share_content": "(共用备选答案)\nA.芡实\nB.赤石脂\nC.莲子\nD.浮小麦\nE.肉豆蔻\n", 6 | "question": [ 7 | { 8 | "sub_question": "1).可以治疗自汗盗汗,骨蒸劳热的药物是( )。", 9 | "answer": [ 10 | "D" 11 | ], 12 | "analysis": "" 13 | }, 14 | { 15 | "sub_question": "2).可以治疗虚寒气滞,久泻不止的药物是( )。", 16 | "answer": [ 17 | "E" 18 | ], 19 | "analysis": "" 20 | } 21 | ], 22 | "knowledge_point": "中药学", 23 | "index": 1398, 24 | "score": 1 25 | } 26 | ] 27 | } -------------------------------------------------------------------------------- /evaluate/A12_bench.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | parent_path = os.path.dirname(sys.path[0]) 4 | if parent_path not in sys.path: 5 | sys.path.append(parent_path) 6 | 7 | 8 | from models.Openai import OpenaiAPI 9 | # from ChatGLM import ChatGLMAPI 10 | # from Moss_Local import MossAPI 11 | from bench_function import get_api_key, export_distribute_json, export_union_json 12 | import os 13 | import json 14 | 15 | 16 | if __name__ == "__main__": 17 | # Load the MCQ_prompt.json file 18 | os.environ['HTTPS_PROXY'] = "your proxy" 19 | #MCQ_prompt.json 20 | with open("prompt/A1-2_prompt.json.json", "r", encoding="utf-8") as f: 21 | data = json.load(f)['examples'] 22 | f.close() 23 | 24 | for i in range(len(data)): 25 | directory = "data" 26 | # get the api_key_list 27 | openai_api_key = "your key" 28 | # openai_api_key_list = get_api_key(, start_num=0, end_num=1) 29 | # moss_api_key_list = [""] 30 | model_type = "OpenAI" 31 | # get the model_name and instantiate model_api 32 | if model_type == "OpenAI": 33 | # model_name = 'gpt-3.5-turbo' 34 | # model_name = 'gpt-3.5-turbo-16k' 35 | model_name = 'gpt-4' 36 | model_api = OpenaiAPI(openai_api_key, model_name=model_name) 37 | elif model_type == "ChatGLM": 38 | model_name = 'chatglm' 39 | model_api = ChatGLMAPI() 40 | elif model_type == "MOSS": 41 | model_name = 'moss' 42 | model_api = MossAPI() 43 | # model_name = 'moss' 44 | # model_api = MossAPI(moss_api_key_list) 45 | 46 | keyword = data[i]['keyword'] 47 | question_type = data[i]['type'] 48 | zero_shot_prompt_text = data[i]['prefix_prompt'] 49 | print(keyword) 50 | print(question_type) 51 | 52 | export_distribute_json( 53 | model_api, 54 | model_name, 55 | directory, 56 | keyword, 57 | zero_shot_prompt_text, 58 | question_type, 59 | [""], 60 | parallel_num=1, 61 | ) 62 | 63 | export_union_json( 64 | directory, 65 | model_name, 66 | keyword, 67 | zero_shot_prompt_text, 68 | question_type 69 | ) 70 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /evaluate/A3-B1_bench.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | parent_path = os.path.dirname(sys.path[0]) 4 | if parent_path not in sys.path: 5 | sys.path.append(parent_path) 6 | 7 | from models.Openai import OpenaiAPI 8 | # from ChatGLM import ChatGLMAPI 9 | from bench_function import get_api_key, export_distribute_json, export_union_json 10 | import json 11 | from moss import MossAPI 12 | import time 13 | 14 | if __name__ == "__main__": 15 | # Load the FBQ_prompt.json file 16 | os.environ['HTTPS_PROXY'] = "your proxy" 17 | with open("prompt/A3-4_prompt.json", "r", encoding="utf-8") as f: 18 | data = json.load(f)['examples'] 19 | f.close() 20 | 21 | 22 | # Iterate through the examples in the data 23 | for i in range(len(data)): 24 | directory = "data" 25 | # get the api_key_list 26 | openai_api_key = "your key" 27 | # openai_api_key_list = get_api_key(, start_num=0, end_num=1) 28 | # moss_api_key_list = [""] 29 | model_type = "ChatGLM" 30 | # get the model_name and instantiate model_api 31 | if model_type == "OpenAI": 32 | model_name = 'gpt-3.5-turbo' 33 | # model_name = 'gpt-4' 34 | model_api = OpenaiAPI(openai_api_key, model_name=model_name) 35 | elif model_type == "ChatGLM": 36 | model_name = 'chatglm' 37 | model_api = ChatGLMAPI() 38 | elif model_type == 'Moss': 39 | model_name = 'moss' 40 | # model_api = MossAPI(moss_api_key_list) 41 | keyword = data[i]['keyword'] 42 | question_type = data[i]['type'] 43 | zero_shot_prompt_text = data[i]['prefix_prompt'] 44 | print(keyword) 45 | print(model_name) 46 | print(question_type) 47 | 48 | export_distribute_json( 49 | model_api, 50 | model_name, 51 | directory, 52 | keyword, 53 | zero_shot_prompt_text, 54 | question_type, 55 | [], 56 | parallel_num=50, 57 | ) 58 | 59 | export_union_json( 60 | directory, 61 | model_name, 62 | keyword, 63 | zero_shot_prompt_text, 64 | question_type 65 | ) 66 | -------------------------------------------------------------------------------- /evaluate/bench_function.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import json 4 | import time 5 | import re 6 | # from random import choice 7 | # import requests 8 | from typing import List, Union, Dict 9 | # from joblib import Parallel, delayed 10 | 11 | from tqdm import tqdm 12 | 13 | 14 | def extract_choice_answer(model_output, question_type, answer_lenth=None): 15 | """ 16 | Extract choice answer from model output 17 | 18 | Format of model_output that is expected: 19 | choice answer should be the last Capital Letter of the model_output, e.g.: "...【答案】 A " 20 | """ 21 | if question_type == 'A1+A2' or question_type == 'A3+A4' or question_type == 'B1': 22 | model_answer = [] 23 | # temp = re.findall(r'[A-E]', model_output[::-1]) 24 | # if len(temp) != 0: 25 | # model_answer.append(temp[0]) 26 | model_output = model_output[::-1] 27 | pattern = r"([A-Z]).*?案答" 28 | check_info = re.search(pattern, model_output) 29 | if check_info: 30 | pattern = r"\.[A-Z]" 31 | temp = re.findall(pattern, model_output) 32 | if len(temp) > 0: 33 | # answer = temp[0] 34 | answer = check_info.group(1) 35 | model_answer.append(answer) 36 | else: 37 | temp = re.findall(r'[A-E]', model_output) 38 | if len(temp) != 0: 39 | answer = temp[0] 40 | model_answer.append(answer) 41 | else: 42 | temp = re.findall(r'[A-E]', model_output) 43 | if len(temp) != 0: 44 | answer = temp[0] 45 | model_answer.append(answer) 46 | return model_answer 47 | 48 | def choice_test_A12(**kwargs): 49 | model_api = kwargs['model_api'] 50 | model_name = kwargs['model_name'] 51 | start_num = kwargs['start_num'] 52 | end_num = kwargs['end_num'] 53 | data = kwargs['data']['example'] 54 | keyword = kwargs['keyword'] 55 | prompt = kwargs['prompt'] 56 | question_type = kwargs['question_type'] 57 | save_directory = kwargs['save_directory'] 58 | 59 | model_answer_dict = [] 60 | for i in range(start_num, end_num): 61 | 62 | index = data[i]['index'] 63 | question = data[i]['question'].strip() + '\n' 64 | # year = data[i]['year'] 65 | # category = data[i]['year'] 66 | score = data[i]['score'] 67 | standard_answer = data[i]['answer'] 68 | answer_lenth = len(standard_answer) 69 | analysis = data[i]['analysis'] 70 | knowledge_point = data[i]['knowledge_point'] 71 | model_output = model_api(prompt, question, "", question_type, None)[0] # list() 72 | model_answer = extract_choice_answer(model_output, question_type, answer_lenth) 73 | # TODO: which content of temp we expect 74 | dict = { 75 | 'index': index, 76 | # 'year': year, 77 | # 'category': category, 78 | 'score': score, 79 | 'question': question, 80 | 'standard_answer': standard_answer, 81 | 'analysis': analysis, 82 | 'knowledge_point': knowledge_point, 83 | 'model_answer': model_answer, 84 | 'model_output': model_output 85 | } 86 | print("*" * 100, "index-", dict["index"], "*" * 100) 87 | for key, value in dict.items(): 88 | print(key, ":", value) 89 | # print(dict) 90 | model_answer_dict.append(dict) 91 | 92 | file_name = model_name+"_seperate_"+keyword+f"_{start_num}-{end_num-1}.json" 93 | file_path = os.path.join(save_directory, file_name) 94 | with open(file_path, 'w', encoding='utf-8') as f: 95 | output = { 96 | 'keyword': keyword, 97 | 'example': model_answer_dict 98 | } 99 | json.dump(output, f, ensure_ascii=False, indent=4) 100 | f.close() 101 | 102 | 103 | def choice_test_A34(**kwargs): 104 | model_api = kwargs['model_api'] 105 | model_name = kwargs['model_name'] 106 | start_num = kwargs['start_num'] 107 | end_num = kwargs['end_num'] 108 | data = kwargs['data']['example'] 109 | keyword = kwargs['keyword'] 110 | prompt = kwargs['prompt'] 111 | question_type = kwargs['question_type'] 112 | save_directory = kwargs['save_directory'] 113 | examples = kwargs['examples'] 114 | 115 | model_answer_dict = [] 116 | for i in range(start_num, end_num): 117 | 118 | index = data[i]['index'] 119 | question = data[i]['question'] # list() 包含多个小问题和答案 120 | score = data[i]['score'] 121 | knowledge_point = data[i]['knowledge_point'] 122 | share_content = data[i]['share_content'] 123 | model_output = model_api(prompt, question, share_content, question_type, examples) 124 | question_list = [] 125 | for sub_question, output in zip(question, model_output): 126 | standard_answer = sub_question['answer'] 127 | answer_lenth = len(standard_answer) 128 | analysis = sub_question['analysis'] 129 | model_answer = extract_choice_answer(output, question_type, answer_lenth) 130 | sub_question_dict = { 131 | 'sub_question': sub_question['sub_question'], 132 | 'standard_answer': standard_answer, 133 | 'analysis': analysis, 134 | 'model_answer': model_answer, 135 | 'model_output': output 136 | } 137 | question_list.append(sub_question_dict) 138 | # TODO: which content of temp we expect 139 | 140 | dict = { 141 | 'index': index, 142 | 'score': score, 143 | 'share_content': share_content, 144 | 'question': question_list, 145 | 'knowledge_point': knowledge_point, 146 | } 147 | # print("*" * 100, "index-", dict["index"], "*" * 100) 148 | # for key, value in dict.items(): 149 | # print(key, ":", value) 150 | # print(dict) 151 | model_answer_dict.append(dict) 152 | 153 | file_name = "k_shot" + model_name + "_seperate_" + keyword + f"_{start_num}-{end_num - 1}.json" 154 | file_path = os.path.join(save_directory, file_name) 155 | with open(file_path, 'w', encoding='utf-8') as f: 156 | output = { 157 | 'keyword': keyword, 158 | 'example': model_answer_dict 159 | } 160 | json.dump(output, f, ensure_ascii=False, indent=4) 161 | f.close() 162 | 163 | def export_union_json(directory: str, model_name: str, keyword: str, zero_shot_prompt_text: str or list[str], question_type: str) -> None: 164 | """ 165 | Merges JSON files containing processed examples in a directory into a single JSON file. 166 | 167 | :param directory: Directory containing the JSON files 168 | :param model_name: Name of the model used to process the examples 169 | :param keyword: Keyword used to identify the JSON files 170 | :param zero_shot_prompt_text: Prompt text for zero-shot learning 171 | :param question_type: Type of questions in the JSON files (e.g. single_choice, five_out_of_seven, etc.) 172 | """ 173 | 174 | save_directory = os.path.join(directory, f'{model_name}_{keyword}') 175 | if os.path.exists(save_directory): 176 | output = { 177 | 'keyword': keyword, 178 | 'model_name': model_name, 179 | 'prompt': zero_shot_prompt_text, 180 | 'example': [] 181 | } 182 | 183 | # Iterate through the JSON files with the specified keyword in the directory 184 | 185 | print("Start to merge json files") 186 | files = [file for file in os.listdir(save_directory) if file.endswith('.json') and keyword in file] 187 | for file in files: 188 | file_path = os.path.join(save_directory, file) 189 | 190 | # Load and merge the data from the JSON files 191 | with open(file_path, "r", encoding='utf-8') as f: 192 | data = json.load(f) 193 | output['example'] += (data['example']) 194 | 195 | # Save the merged data into a single JSON file 196 | merge_file = os.path.join(directory, f'{model_name}_{keyword}.json') 197 | output['example'] = sorted(output['example'], key=lambda x: x['index']) 198 | with open(merge_file, 'w', encoding='utf-8') as f: 199 | json.dump(output, f, ensure_ascii=False, indent=4) 200 | 201 | def export_distribute_json( 202 | model_api, 203 | model_name: str, 204 | directory: str, 205 | keyword: str, 206 | zero_shot_prompt_text: str or List[str], 207 | question_type: str, 208 | examples: list, 209 | parallel_num: int = 1 210 | ) -> None: 211 | """ 212 | Distributes the task of processing examples in a JSON file across multiple processes. 213 | 214 | :param model_name: Name of the model to use 215 | :param directory: Directory containing the JSON file 216 | :param keyword: Keyword used to identify the JSON file 217 | :param zero_shot_prompt_text: Prompt text for zero-shot learning 218 | :param question_type: Type of questions in the JSON file (e.g. single_choice, five_out_of_seven, etc.) 219 | :param examples: Examples of questions-answer-analyse in the JSON file 220 | :param parallel_num: Number of parallel processes to use (default: 5) 221 | 222 | """ 223 | # Find the JSON file with the specified keyword 224 | for root, _, files in os.walk(directory): 225 | for file in files: 226 | if file == f'{keyword}.json': 227 | filepath = os.path.join(root, file) 228 | with open(filepath, 'r', encoding='utf-8') as f: 229 | data = json.load(f) 230 | 231 | example_num = len(data['example']) 232 | 233 | # Prepare the list of keyword arguments for parallel processing 234 | kwargs_list = [] 235 | batch_size = example_num // parallel_num + 1 236 | save_directory = os.path.join(directory, f'{model_name}_{keyword}') 237 | if not os.path.exists(save_directory): 238 | os.makedirs(save_directory) 239 | # os.system(f'mkdir {save_directory}') 240 | 241 | for idx in range(0, parallel_num): 242 | start_num = idx * batch_size 243 | end_num = min(start_num + batch_size, example_num) 244 | if start_num >= example_num: 245 | break 246 | 247 | kwargs = { 248 | 'model_api': model_api, 249 | 'start_num': start_num, 250 | 'end_num': end_num, 251 | 'model_name': model_name, 252 | 'data': data, 253 | 'keyword': keyword, 254 | 'prompt': zero_shot_prompt_text, 255 | 'question_type': question_type, 256 | 'save_directory': save_directory, 257 | 'examples': examples 258 | } 259 | kwargs_list.append(kwargs) 260 | 261 | # Run parallel processing based on the question type 262 | if question_type in ["A1+A2"]: 263 | for kwargs in kwargs_list: 264 | choice_test_A12(**kwargs) 265 | elif question_type in ["A3+A4", "B1"]: 266 | for kwargs in kwargs_list: 267 | choice_test_A34(**kwargs) 268 | # Parallel(n_jobs=parallel_num)(delayed(choice_test_A12)(**kwargs) for kwargs in kwargs_list) 269 | 270 | def test_correction_score_A12(data_dict): 271 | score = 0 272 | all_num = 0 273 | model_answer_dict = [] 274 | for data in data_dict['example']: 275 | all_num += 1 276 | true_answer = data['standard_answer'] 277 | model_answer = data['model_answer'] 278 | if true_answer == model_answer: 279 | score += 1 280 | else: 281 | dict = { 282 | 'index': data["index"], 283 | 'question': data["question"], 284 | 'standard_answer': true_answer, 285 | 'analysis': data["analysis"], 286 | 'knowledge_point': data["knowledge_point"], 287 | 'model_answer': model_answer, 288 | 'model_output': data["model_output"] 289 | } 290 | model_answer_dict.append(dict) 291 | output = {'keyword': data_dict["keyword"], 292 | 'correct_num': score, 293 | 'all_num': all_num} 294 | if len(model_answer_dict) > 0: 295 | output['example'] = model_answer_dict 296 | return score / all_num, output 297 | 298 | 299 | def test_correction_score_A34(data_dict): 300 | score = 0 301 | all_num = 0 302 | model_answer_dict = [] 303 | for data in data_dict['example']: 304 | correction_flag = True 305 | question = data["question"] 306 | question_list = [] 307 | # all_num += len(question) 308 | for sub_question in question: 309 | all_num += 1 310 | standard_answer = sub_question['standard_answer'] 311 | model_answer = sub_question['model_answer'] 312 | if standard_answer == model_answer: 313 | score += 1 314 | else: 315 | correction_flag = False 316 | sub_question_dict = { 317 | 'sub_question': sub_question['sub_question'], 318 | 'standard_answer': standard_answer, 319 | 'analysis': sub_question['analysis'], 320 | 'model_answer': model_answer, 321 | 'model_output': sub_question['model_output'] 322 | } 323 | question_list.append(sub_question_dict) 324 | if correction_flag == False: 325 | dict = { 326 | 'index': data["index"], 327 | 'share_content': data["share_content"], 328 | 'question': question_list, 329 | 'knowledge_point': data["knowledge_point"], 330 | } 331 | model_answer_dict.append(dict) 332 | output = {'keyword': data_dict["keyword"], 333 | 'correct_num': score, 334 | 'all_num': all_num} 335 | if len(model_answer_dict) > 0: 336 | output['example'] = model_answer_dict 337 | return score / all_num, output 338 | -------------------------------------------------------------------------------- /evaluate/correct_analyse.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2023/7/7 10:42 3 | # @Author : Ywj 4 | # @File : correct_analyse.py 5 | # @Description : 统计正确的题目个数,同时输出错误的题目和模型分析,以便医生或人工进行分析 6 | 7 | import json 8 | import os 9 | from bench_function import test_correction_score_A12, test_correction_score_A34 10 | model = 'gpt-4' 11 | file_A12 = '\%s_题库@100-A1+A2_CQ-知识点标注换行.json' % (model) 12 | file_A34 = '\k_shot\%s_题库@100-A3+A4_CQ-知识点标注换行.json' % (model) 13 | file_B1 = '\%s_题库@100-B1_CQ-知识点标注换行.json' % (model) 14 | false_file = "\k_shot\错题集" 15 | 16 | def read_file(file): 17 | with open(file, 'r', encoding='utf-8') as f: 18 | data = json.load(f) 19 | f.close() 20 | return data 21 | 22 | data_A12 = read_file(file_A12) 23 | data_A34 = read_file(file_A34) 24 | data_B1 = read_file(file_B1) 25 | 26 | score_A12, false_dict_A12 = test_correction_score_A12(data_A12) 27 | score_A34, false_dict_A34 = test_correction_score_A34(data_A34) 28 | score_B1, false_dict_B1 = test_correction_score_A34(data_B1) 29 | 30 | print("测试结果") 31 | print("A1-A2题目正确率:%f \nA3-A4题目正确率:%f \nB1题目正确率:%f \n" % (score_A12, score_A34, score_B1)) 32 | with open(os.path.join(false_file, model + "_" + data_A12['keyword'] + '_错题集.json'), 'w', encoding='utf-8') as f: 33 | json.dump(false_dict_A12, f, ensure_ascii=False, indent=4) 34 | f.close() 35 | 36 | with open(os.path.join(false_file, model + "_" + data_A34['keyword'] + '_错题集.json'), 'w', encoding='utf-8') as f: 37 | json.dump(false_dict_A34, f, ensure_ascii=False, indent=4) 38 | f.close() 39 | 40 | with open(os.path.join(false_file, model + "_" + data_B1['keyword'] + '_错题集.json'), 'w', encoding='utf-8') as f: 41 | json.dump(false_dict_B1, f, ensure_ascii=False, indent=4) 42 | f.close() 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /evaluate/init: -------------------------------------------------------------------------------- 1 | This is the code file. 2 | -------------------------------------------------------------------------------- /image/few-shot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ywjawmw/TCMBench/0707c08a50293d24c03e4e2fd2c386be02fde4d0/image/few-shot.png -------------------------------------------------------------------------------- /image/init: -------------------------------------------------------------------------------- 1 | 图表图片文件夹~ 2 | -------------------------------------------------------------------------------- /image/zero-shot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ywjawmw/TCMBench/0707c08a50293d24c03e4e2fd2c386be02fde4d0/image/zero-shot.png -------------------------------------------------------------------------------- /models/Openai.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import requests 4 | import time 5 | import openai 6 | import os 7 | from random import choice 8 | 9 | 10 | class OpenaiAPI: 11 | def __init__(self, api_key_list:str, model_name:str="gpt-3.5-turbo", temperature:float=0.0, max_tokens: int=1024): 12 | self.api_key_list = api_key_list 13 | self.model_name = model_name # 新的model, 支持1w+ 14 | self.temperature = temperature 15 | self.max_tokens = max_tokens # 不设置 16 | 17 | def send_request_davinci(self, request_text:str)->str: 18 | """ 19 | """ 20 | output = {} 21 | 22 | while True: 23 | try: 24 | # openai.proxy = "http://127.0.0.1:33210" 25 | # openai.proxy = "http://127.0.0.1:10809" 26 | os.environ['HTTPS_PROXY'] = "http://127.0.0.1:33210" 27 | openai.api_key = self.api_key_list 28 | output = openai.Completion.create( 29 | model=self.model_name, 30 | prompt=request_text, 31 | temperature=self.temperature, 32 | # max_tokens = self.max_tokens 33 | ) 34 | break 35 | except Exception as e: 36 | print('Exception:', e) 37 | time.sleep(1) 38 | 39 | time.sleep(1) 40 | return output 41 | 42 | def send_request_turbo(self, prompt, question): 43 | """ 44 | """ 45 | zero_shot_prompt_message = {'role': 'system', 'content': prompt} 46 | 47 | messages = [zero_shot_prompt_message] 48 | message = {"role":"user", "content":question} 49 | messages.append(message) 50 | 51 | output = {} 52 | while True: 53 | try: 54 | os.environ['HTTPS_PROXY'] = "http://127.0.0.1:33210" 55 | # os.environ['HTTPS_PROXY'] = "http://127.0.0.1:10809" 56 | openai.api_key = self.api_key_list 57 | output = openai.ChatCompletion.create( 58 | model=self.model_name, 59 | messages=messages, 60 | temperature=self.temperature 61 | ) 62 | # print(output) 63 | break 64 | except Exception as e: 65 | print('Exception:', e) 66 | print("原始Prompt:") 67 | for m in messages: 68 | print(m) 69 | print("—" * 100) 70 | # if "overloaded" or "Bad" in e: 71 | if "max" in e.args[0]: # 说明到了最大的token, 将上面存储的靠前的子问题删除几个 72 | time.sleep(5) 73 | messages = messages[:2] + messages[-11: -1] 74 | print("最大token, 保留历史前五个问题") 75 | for m in messages: 76 | print(m) 77 | print("*" * 100) 78 | else: 79 | time.sleep(5) # 递归调用自身进行重试 不进行break 80 | print("重复提问") 81 | for m in messages: 82 | print(m) 83 | print("*" * 100) 84 | time.sleep(1) 85 | 86 | return [output] 87 | 88 | def send_request_turbo_chat(self, prompt, share_content, question): 89 | """ 90 | """ 91 | zero_shot_prompt_message = {'role': 'system', 'content': prompt} 92 | 93 | messages = [zero_shot_prompt_message] 94 | message = {"role": "user", "content": share_content} 95 | messages.append(message) 96 | output_chat = [] 97 | i = 0 98 | error_num = 0 99 | while i < len(question): 100 | sub_question = question[i] 101 | message = {"role": "user", "content": sub_question['sub_question']} 102 | messages.append(message) 103 | # os.environ['HTTPS_PROXY'] = "http://127.0.0.1:33210" 104 | os.environ['HTTPS_PROXY'] = "http://127.0.0.1:10809" 105 | openai.api_key = self.api_key_list 106 | try: 107 | output = openai.ChatCompletion.create( 108 | model=self.model_name, 109 | messages=messages, 110 | temperature=self.temperature 111 | ) 112 | answer = output.choices[0].message.content 113 | messages.append({"role": "assistant", "content": answer}) 114 | output_chat.append(output) 115 | i += 1 116 | print(i, ":", "success!") 117 | # print(output) 118 | except Exception as e: 119 | print('Exception:', e) 120 | print("原始Prompt:") 121 | for m in messages: 122 | print(m) 123 | print("—" * 100) 124 | # if "overloaded" or "Bad" in e: 125 | if "max" in e.args[0]: # 说明到了最大的token, 将上面存储的靠前的子问题删除几个 126 | time.sleep(5) 127 | if error_num == 0: 128 | if len(messages) < 13: 129 | star_index = -1 * len(messages) + 2 130 | else: 131 | star_index = -11 # 前5个 132 | else: 133 | star_index += 2 # 如果还超长,那么就不断的逐个删除子问题 134 | if star_index >= -1: 135 | print("无法处理该问题") 136 | output_chat.append({}) 137 | error_num = 0 138 | i += 1 139 | print("#" * 100) 140 | messages = messages[:2] + messages[star_index: -1] 141 | print("最大token, 保留历史前几个问题") 142 | error_num = 1 143 | for m in messages: 144 | print(m) 145 | print("*" * 100) 146 | else: 147 | time.sleep(5) # 递归调用自身进行重试(i不变) 148 | print("重复提问") 149 | messages = messages[:-1] 150 | for m in messages: 151 | print(m) 152 | print("*" * 100) 153 | error_num = 0 154 | # output_chat.append({}) 155 | # i += 1 156 | # print("失败,默认回答不出内容!") 157 | time.sleep(5) 158 | 159 | time.sleep(5) 160 | 161 | return output_chat 162 | 163 | def send_request_turbo_chat_k_shot(self, prompt, share_content, question, k_shot_example): 164 | """ 165 | k_shot_example: list 166 | """ 167 | example = random.sample(k_shot_example, 1) 168 | prompt = prompt.replace('<示例>', example[0]) 169 | zero_shot_prompt_message = {'role': 'system', 'content': prompt} 170 | 171 | messages = [zero_shot_prompt_message] 172 | message = {"role": "user", "content": "案例:" + share_content} 173 | messages.append(message) 174 | output_chat = [] 175 | i = 0 176 | error_num = 0 177 | while i < len(question): 178 | sub_question = question[i] 179 | message = {"role": "user", "content": "问题" + sub_question['sub_question']} 180 | messages.append(message) 181 | os.environ['HTTPS_PROXY'] = "http://127.0.0.1:33210" 182 | # os.environ['HTTPS_PROXY'] = "http://127.0.0.1:10809" 183 | openai.api_key = self.api_key_list 184 | try: 185 | output = openai.ChatCompletion.create( 186 | model=self.model_name, 187 | messages=messages, 188 | temperature=self.temperature 189 | ) 190 | answer = output.choices[0].message.content 191 | messages.append({"role": "assistant", "content": answer}) 192 | output_chat.append(output) 193 | i += 1 194 | print(i, ":", "success!") 195 | # print(output) 196 | except Exception as e: 197 | print('Exception:', e) 198 | print("原始Prompt:") 199 | for m in messages: 200 | print(m) 201 | print("—" * 100) 202 | # if "overloaded" or "Bad" in e: 203 | if "max" in e.args[0]: # 说明到了最大的token, 将上面存储的靠前的子问题删除几个 204 | time.sleep(5) 205 | if error_num == 0: 206 | if len(messages) < 13: 207 | star_index = -1 * len(messages) + 2 208 | else: 209 | star_index = -11 # 前5个 210 | else: 211 | star_index += 2 # 如果还超长,那么就不断的逐个删除子问题 212 | if star_index >= -1: 213 | print("无法处理该问题") 214 | output_chat.append({}) 215 | error_num = 0 216 | i += 1 217 | print("#" * 100) 218 | messages = messages[:2] + messages[star_index: -1] 219 | print("最大token, 保留历史前几个问题") 220 | error_num = 1 221 | for m in messages: 222 | print(m) 223 | print("*" * 100) 224 | else: 225 | time.sleep(5) # 递归调用自身进行重试(i不变) 226 | print("重复提问") 227 | messages = messages[:-1] 228 | for m in messages: 229 | print(m) 230 | print("*" * 100) 231 | error_num = 0 232 | # output_chat.append({}) 233 | # i += 1 234 | # print("失败,默认回答不出内容!") 235 | time.sleep(5) 236 | 237 | time.sleep(5) 238 | 239 | return output_chat 240 | 241 | def forward(self, prompt, question, share_content='', key_word='A1+A2', examples=None)->list: 242 | """ 243 | """ 244 | output = [] 245 | if "gpt" in self.model_name: 246 | if key_word == 'A1+A2': 247 | output = self.send_request_turbo(prompt, question) 248 | else: 249 | # output = self.send_request_turbo_chat(prompt, share_content, question) 250 | output = self.send_request_turbo_chat_k_shot(prompt, share_content, question, examples) 251 | elif self.model_name == "text-davinci-003": 252 | output = self.send_request_davinci(prompt+question) 253 | output = [output] 254 | # print(output) 255 | model_output = list() 256 | for o in output: 257 | model_o = self.postprocess(o) 258 | model_output.append(model_o) 259 | return model_output 260 | 261 | def postprocess(self, output): 262 | """ 263 | """ 264 | model_output = None 265 | try: 266 | 267 | if "gpt" in self.model_name: 268 | model_output = output['choices'][0]['message']['content'] 269 | 270 | elif self.model_name == 'text-davinci-003': 271 | model_output = output['choices'][0]['text'] 272 | 273 | if not model_output: 274 | print("Warning: Empty Output ") 275 | except Exception as e: 276 | print('Exception:', e) 277 | model_output = '【解析】\n\n【答案】' 278 | print("Warning error: Empty Output ") 279 | return model_output 280 | 281 | def __call__(self, prompt:str, question:str, share_content:str, key_word:str, examples:list): 282 | return self.forward(prompt, question, share_content, key_word, examples) 283 | 284 | 285 | def test(model, prompt:str, question:str): 286 | 287 | 288 | response = model(prompt, question) 289 | 290 | return response 291 | 292 | 293 | -------------------------------------------------------------------------------- /models/init: -------------------------------------------------------------------------------- 1 | This is the model API. 2 | -------------------------------------------------------------------------------- /prompt/A1-2_prompt.json: -------------------------------------------------------------------------------- 1 | { 2 | "examples": [ 3 | { 4 | "type": "A1+A2", 5 | "keyword": "题库@100-A1+A2_CQ-知识点标注换行", 6 | "prefix_prompt": "请你做一道中医测试中A1或者A2类型的选择题\n请你一步一步思考并将思考过程写在【解析】和之间。你将从A,B,C,D,E中选出一个最正确的答案,并写在【答案】和之间。\n例如:【答案】: A \n完整的题目回答的格式如下:\n【解析】 ... \n【答案】 ... \n请你严格按照上述格式作答。\n题目如下:", 7 | "comment": "" 8 | } 9 | ] 10 | } -------------------------------------------------------------------------------- /prompt/A3-4_prompt.json: -------------------------------------------------------------------------------- 1 | { 2 | "examples": [ 3 | { 4 | "type": "A3+A4", 5 | "keyword": "题库@100-A3+A4_CQ-知识点标注换行", 6 | "prefix_prompt": "请你回答中医测试中A3或者A4类型的题目。每个题目包含一个案例,每个案例下设若干道选择题。\n请根据案例所提供的信息,一步一步思考并将思考过程写在每个选择题的【解析】和之间,并从A,B,C,D,E中选出一个最正确的答案,并写在【答案】和之间。\n例如:【答案】: A \n完整的题目回答的格式如下:1)\n【解析】 ... \n【答案】 ... \n2)\n【解析】 ... \n【答案】 ... \n3)\n【解析】 ... \n【答案】 ... \n请你严格按照上述格式作答。\n案例如下:", 7 | "comment": "" 8 | } 9 | ] 10 | } -------------------------------------------------------------------------------- /prompt/B1_prompt.json: -------------------------------------------------------------------------------- 1 | { 2 | "examples": [ 3 | { 4 | "type": "B1", 5 | "keyword": "题库@100-B1_CQ-知识点标注换行", 6 | "prefix_prompt": "请你回答中医测试中B1类型的题目。每组题目有若干道选组题,共用列出的A、B、C、D、E五个备选答案,请从中选择一个与问题关系最密切的答案。某一个备选答案可能被选择一次,多次或不被选择。\n请一步一步思考并将思考过程写在每个选择题的【解析】和之间,并从A,B,C,D,E中选出正确的答案,并写在【答案】和之间。\n例如:【答案】: A \n完整的题目回答的格式如下:1)\n【解析】 ... \n【答案】 ... \n2)\n【解析】 ... \n【答案】 ... \n3)\n【解析】 ... \n【答案】 ... \n请你严格按照上述格式作答。\n案例如下:", 7 | "comment": "" 8 | } 9 | ] 10 | } -------------------------------------------------------------------------------- /prompt/init: -------------------------------------------------------------------------------- 1 | This is the prompt files. 2 | --------------------------------------------------------------------------------