├── train
    ├── CPT
    │   └── README.md
    ├── SFT
    │   └── README.md
    ├── RLHF
    │   └── DPO
    │   │   └── README.md
    └── README.md
├── learn
    └── README.md
├── task
    ├── README.md
    └── agent
    │   ├── README.md
    │   ├── GenAI_Agents
    │   └── demo
    │       ├── README.md
    │       └── 0. 一个简单的对话agent例子.md
├── deploy
    ├── python
    │   ├── README.md
    │   └── chat_demo.py
    ├── README.md
    ├── LMStudio
    │   ├── test_api.py
    │   └── README.md
    ├── ollama
    │   └── README.md
    ├── API
    │   ├── easy_server_demo.py
    │   └── README.md
    ├── streamlit
    │   ├── web_gemma2_chat.py
    │   └── web_llama3_chat.py
    ├── web_streamlit_for_v1.py
    ├── web_streamlit_for_instruct_v2.py
    ├── web_streamlit_for_instruct.py
    └── vLLM
    │   └── README.md
├── tools
    ├── merge_weight.py
    ├── convert_gguf.py
    ├── sample_data.py
    ├── count_data.py
    ├── convert_raw_data_for_firefly.py
    ├── convert_firefly_data_to_sharegpt.py
    ├── check_jsonl.py
    ├── change_info.py
    ├── expand_embedding_and_lmhead.py
    └── transfer_hf_models_to_modelscope.ipynb
└── README.md


/train/CPT/README.md:
--------------------------------------------------------------------------------
1 | # CPT训练
2 | 


--------------------------------------------------------------------------------
/train/SFT/README.md:
--------------------------------------------------------------------------------
1 | # SFT训练
2 | 


--------------------------------------------------------------------------------
/learn/README.md:
--------------------------------------------------------------------------------
1 | # llama3 基础知识学习
2 | 


--------------------------------------------------------------------------------
/task/README.md:
--------------------------------------------------------------------------------
1 | # 基于llama3去实现一些下游应用
2 | 


--------------------------------------------------------------------------------
/task/agent/README.md:
--------------------------------------------------------------------------------
1 | # agent 系统学习
2 | 


--------------------------------------------------------------------------------
/train/RLHF/DPO/README.md:
--------------------------------------------------------------------------------
1 | # DPO训练
2 | 


--------------------------------------------------------------------------------
/train/README.md:
--------------------------------------------------------------------------------
1 | # 大语言模型训练方法指南
2 | 本文档教程将汇总大语言模型训练的常见方法。
3 | 


--------------------------------------------------------------------------------
/task/agent/GenAI_Agents:
--------------------------------------------------------------------------------
1 | # GenAI_Agents Demo
2 | from @NirDiamant: https://github.com/NirDiamant/GenAI_Agents
3 | 


--------------------------------------------------------------------------------
/task/agent/demo/README.md:
--------------------------------------------------------------------------------
1 | 这里是一些例子，来自于[GenAI_Agents](https://github.com/NirDiamant/GenAI_Agents)项目，我们对它进行了汉化&整理，以方便中文读者作为demo进行学习。
2 | 


--------------------------------------------------------------------------------
/deploy/python/README.md:
--------------------------------------------------------------------------------
1 | # Python代码部署语言模型
2 | 默认情况下，直接运行chat_demo.py代码即可体验llama3中文对话。  
3 | 如需更换模型权重，请自行修改其中的`model_name_or_path`为你下载的模型路径
4 | 


--------------------------------------------------------------------------------
/deploy/README.md:
--------------------------------------------------------------------------------
 1 | # 模型部署推理
 2 | 
 3 | 这里基本上整理了所有常用的llama3语言模型部署方式，你可以根据自己的兴趣进入不同的文件夹进行学习部署。
 4 | 
 5 | ## 场景推荐
 6 | - 如果是笔记本电脑上运行，推荐使用`ollama`或者`LMStudio`方式
 7 | - 如需在服务器上部署高性能的语言模型API，推荐使用`vllm`方式
 8 | - 如果是想自行学习掌握代码手动推理实现，推荐查看`API`和`python`文件夹
 9 | - 如果你是语言模型的训练者，想要每次训练完毕进行方便的对话调试，推荐使用`streamlit`方式
10 |   
11 | 


--------------------------------------------------------------------------------
/deploy/LMStudio/test_api.py:
--------------------------------------------------------------------------------
 1 | from openai import OpenAI
 2 | 
 3 | client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
 4 | 
 5 | completion = client.chat.completions.create(
 6 |   model="shareAI/llama3-dpo-zh",
 7 |   messages=[
 8 |     {"role": "system", "content": "you are a helpful bot."},
 9 |     {"role": "user", "content": "讲个笑话"}
10 |   ],
11 |   temperature=0.7,
12 |   stop=["<|eot_id|>"],
13 | )
14 | 
15 | print(completion.choices[0].message)
16 | 


--------------------------------------------------------------------------------
/tools/merge_weight.py:
--------------------------------------------------------------------------------
 1 | # 可用于提升模型效果的一种技巧，将最近检查点权重融合, 与RWKV Merge_Lora原理类似，都是对权重的提取合并
 2 | import torch
 3 | 
 4 | w_a = torch.load('./checkpoint-700/adapter_model.bin', map_location=torch.device('cpu'))
 5 | w_b = torch.load('./checkpoint-800/adapter_model.bin', map_location=torch.device('cpu'))
 6 | w_c = {}
 7 | 
 8 | for k in w_a.keys():
 9 |     try:
10 |         w_c[k] = w_a[k] * 0.7 + w_b[k] * 0.3
11 |     except:
12 |         print(k)
13 |     
14 | for k in w_a.keys():
15 |     if k not in w_c.keys():
16 |         w_c[k] = w_a[k]
17 | 
18 | torch.save(w_c, 'adapter_model_merged.bin')
19 | 


--------------------------------------------------------------------------------
/tools/convert_gguf.py:
--------------------------------------------------------------------------------
 1 | from unsloth import FastLanguageModel
 2 | import torch
 3 | max_seq_length = 8096 
 4 | dtype = None 
 5 | load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
 6 | 
 7 | model, tokenizer = FastLanguageModel.from_pretrained(
 8 |     model_name = "./Llama3-Chinese-instruct-DPO-beta0.5-loftq", # 改成模型路径
 9 |     max_seq_length = max_seq_length,
10 |     dtype = dtype,
11 |     load_in_4bit = load_in_4bit,
12 | )
13 | 
14 | # model.save_pretrained_gguf("llama3-Chinese-chat-8b_gguf_Q8_0", tokenizer,)
15 | model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") # q5、q8会更好
16 | 


--------------------------------------------------------------------------------
/tools/sample_data.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import random
 3 | 
 4 | def sample_jsonl(input_file, output_file):
 5 |     # 读取输入文件
 6 |     with open(input_file, 'r', encoding='utf-8') as f:
 7 |         lines = f.readlines()
 8 | 
 9 |     # 随机采样1/3的数据
10 |     sample_size = len(lines) // 3
11 |     sampled_lines = random.sample(lines, sample_size)
12 | 
13 |     # 写入输出文件
14 |     with open(output_file, 'w', encoding='utf-8') as f:
15 |         for line in sampled_lines:
16 |             f.write(line)
17 | 
18 | if __name__ == "__main__":
19 |     input_file = "input.jsonl"  # 输入文件路径
20 |     output_file = "sampled_output.jsonl"  # 输出文件路径
21 |     sample_jsonl(input_file, output_file)
22 | 


--------------------------------------------------------------------------------
/tools/count_data.py:
--------------------------------------------------------------------------------
 1 | def count_jsonl(input_file):
 2 |     max_len = 0
 3 |     min_len = 1e9
 4 |     total_len = 0
 5 |     count = 0
 6 |     with open(input_file, 'r', encoding='utf-8') as f_input:
 7 |         for line in f_input:
 8 |             if len(line) > max_len:
 9 |                 max_len = len(line.strip().split())
10 |             if len(line) < min_len:
11 |                 min_len = len(line.strip().split())
12 |             total_len += len(line.strip().split())
13 |             count += 1
14 |         
15 |         print("max_len: ", max_len)
16 |         print("min_len: ", min_len)
17 |         print("total_len: ", total_len)
18 |         print(f"count: {count}")
19 |         print(f"average_len: {total_len / count}")
20 | 
21 | if __name__ == "__main__":
22 |     input_file = "./LongQLoRA-SFT_13k.jsonl"
23 |     count_jsonl(input_file)
24 | 


--------------------------------------------------------------------------------
/tools/convert_raw_data_for_firefly.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | def convert_entry(entry):
 4 |     conversation = {
 5 |         "human": entry["instruction"] + entry["input"],
 6 |         "assistant": entry["output"]
 7 |     }
 8 |     return conversation
 9 | 
10 | def convert_jsonl(input_file, output_file):
11 |     with open(input_file, 'r', encoding='utf-8') as f_input, open(output_file, 'w', encoding='utf-8') as f_output:
12 |         for line in f_input:
13 |             entry = json.loads(line)
14 |             conversation = {"conversation": [convert_entry(entry)]}
15 |             json.dump(conversation, f_output, ensure_ascii=False)
16 |             f_output.write('\n')
17 | 
18 | if __name__ == "__main__":
19 |     input_file = "./m-a-p/COIG-CQIA/zhihu/zhihu_expansion.jsonl"
20 |     output_file = "zhihu_expansion.jsonl"
21 | 
22 |     convert_jsonl(input_file, output_file)
23 | 


--------------------------------------------------------------------------------
/deploy/ollama/README.md:
--------------------------------------------------------------------------------
 1 | # ollama 部署语言模型
 2 | ## 简单介绍 ollama
 3 | ollama是一个完全借鉴docker思想开发的软件，它底层封装了llama.cpp的代码，上层提供了命令行工具，能够便捷地提供终端对话推理、本地api部署等能力，用户无需关心复杂的对话模型参数设置与权重下载管理等问题。
 4 | 
 5 | ## 上手使用 ollama
 6 | 首先，去官网下载安装ollama（非常简单 容易安装）：https://ollama.com/  
 7 | 然后，打开命令行终端，执行以下命令即可开始与AI对话：
 8 | ```
 9 | ollama run shareai/llama3.1-dpo-zh
10 | ```
11 | 
12 | <img width="1000" alt="image" src="https://github.com/user-attachments/assets/7140ee4b-d2d5-42f6-976b-9379ec6a9811">
13 | 
14 | ## 通过API连接ollama (兼容openai格式)
15 | 运行上面命令后，ollama会默认启动一个api服务，可以通过以下命令进行调用测试：  
16 | ```shell
17 | curl http://localhost:11434/v1/chat/completions -d '{
18 |   "model": "shareai/llama3.1-dpo-zh",
19 |   "messages": [
20 |     {
21 |       "role": "user",
22 |       "content": "讲个笑话?"
23 |     }
24 |   ],
25 |   "stream": false
26 | }'
27 | ```
28 | 
29 | ## 更多
30 | 可以参考ollama中文文档：https://ollama.fan/reference/api/
31 | 


--------------------------------------------------------------------------------
/tools/convert_firefly_data_to_sharegpt.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | 
 3 | def convert_jsonl(input_file, output_file):
 4 |     with open(input_file, 'r', encoding='utf-8') as f:
 5 |         with open(output_file, 'w', encoding='utf-8') as fout:
 6 |             for line in f:
 7 |                 data = json.loads(line.strip())
 8 |                 conversations = data['conversation']
 9 |                 new_conversations = []
10 |                 for conv in conversations:
11 |                     for key, value in conv.items():
12 |                         if key == 'assistant':
13 |                             key = 'gpt'
14 |                         else:
15 |                             key = 'human'
16 |                             
17 |                         new_conversations.append({'from': key, 'value': value})
18 |                         
19 |                 new_data = {'conversations': new_conversations}
20 |                 fout.write(json.dumps(new_data, ensure_ascii=False) + '\n')
21 | 
22 | # 替换输入文件路径和输出文件路径
23 | input_file = 'input.jsonl'
24 | output_file = 'output.jsonl'
25 | 
26 | convert_jsonl(input_file, output_file)
27 | 


--------------------------------------------------------------------------------
/tools/check_jsonl.py:
--------------------------------------------------------------------------------
 1 | # 用于检查数据集的每一条记录是否符合要求，不符合要求的将被删除
 2 | 
 3 | import json
 4 | import os
 5 | 
 6 | # 定义要检查的文件路径
 7 | file_path = "./dataset/v2/merged.jsonl"
 8 | # 定义用于存储有效行的列表
 9 | valid_lines = []
10 | 
11 | # 打开文件进行逐行检查
12 | with open(file_path, "r") as file:
13 |     for line_number, line in enumerate(file, start=1):
14 |         line = line.strip()
15 |         if line:
16 |             try:
17 |                 data = json.loads(line)
18 |                 if (
19 |                     "conversation" in data 
20 |                     and data["conversation"] 
21 |                     and data["conversation"][0]["human"] != ""
22 |                     and data["conversation"][0]["assistant"] != ""
23 |                 ):
24 |                     valid_lines.append(line)
25 |                 else:
26 |                     print(f"删除第 {line_number} 行：无效的conversation")
27 |             except json.JSONDecodeError:
28 |                 print(f"删除第 {line_number} 行：JSON解析错误")
29 | 
30 | # 删除原始文件
31 | os.remove(file_path)
32 | 
33 | # 将有效行写入新文件
34 | with open(file_path, "w") as file:
35 |     for line in valid_lines:
36 |         file.write(line + "\n")
37 | 
38 | print("检查完成并删除不符合要求的行。")
39 | 


--------------------------------------------------------------------------------
/tools/change_info.py:
--------------------------------------------------------------------------------
 1 | def replace_keywords_and_remove_lines(input_file, output_file):
 2 |     # 定义关键词替换规则
 3 |     keywords = {
 4 |         "ChatGPT": "shareAI-GPT",
 5 |         "GPT3.5": "shareAI-GPT",
 6 |         "Gpt3.5": "shareAI-GPT",
 7 |         "gpt3.5": "shareAI-GPT",
 8 |         "GPT-3": "shareAI-GPT",
 9 |         "Gpt-3": "shareAI-GPT",
10 |         "gpt-3": "shareAI-GPT",
11 |         "OpenAI": "shareAI",
12 |         "openAI": "shareAI",
13 |         "openai": "shareAI",
14 |     }
15 |     # 定义要删除的关键词
16 |     delete_keywords = ["无法", "不能", "can't", "can not"]
17 | 
18 |     with open(input_file, 'r') as input_f, open(output_file, 'w') as output_f:
19 |         for line in input_f:
20 |             if any(keyword in line for keyword in delete_keywords):
21 |                 continue  # 如果包含要删除的关键词，则跳过该行
22 | 
23 |             # 逐个检查关键词并替换
24 |             for keyword, replacement in keywords.items():
25 |                 line = line.replace(keyword, replacement)
26 |             
27 |             # 将替换后的行写入输出文件
28 |             output_f.write(line)
29 | 
30 |     print("关键词替换并删除行完成！")
31 | 
32 | 
33 | # 指定输入文件和输出文件的路径
34 | input_file_path = "./dataset/v2/merged.jsonl"
35 | output_file_path = "./train_v2.jsonl"
36 | 
37 | # 调用函数进行关键词替换和删除行
38 | replace_keywords_and_remove_lines(input_file_path, output_file_path)
39 | 


--------------------------------------------------------------------------------
/deploy/LMStudio/README.md:
--------------------------------------------------------------------------------
 1 | # LM Studio部署方式
 2 | <img width="1200" alt="image" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/7c692073-2103-41fa-b9aa-c4254a66ada0">
 3 | 
 4 | 官方文档：https://lmstudio.ai/docs/welcome  
 5 | 官网下载：https://lmstudio.ai/
 6 | ### 设备支持列表
 7 | - Apple Silicon Mac (M1/M2/M3) with macOS 13.6 or newer
 8 | - Windows / Linux PC with a processor that supports AVX2 (typically newer PCs)
 9 | - 16GB+ of RAM is recommended. For PCs, 6GB+ of VRAM is recommended
10 | - NVIDIA/AMD GPUs supported
11 | 
12 | ### 部署&使用教程
13 | 1 、首先去[官网下载](https://lmstudio.ai/)，选择你的操作系统对应的安装版本，下载并安装。  
14 | 2 、去网上下载gguf格式的文件到本地，整理为三级文件夹结构 （如/models/shareAI/llama3-dpo-zh/xxx.gguf）。  
15 | 3 、进行模型导入、选择对话预设模板，进行加载使用。  
16 | 具体可参考视频演示： [b站视频教程](https://www.bilibili.com/video/BV1nt421g79T)
17 | 
18 | ### API调用
19 | 首先，点击LM Studio的“Start Server”按钮打开api server，然后使用下面样例代码即可调用：
20 | ```python
21 | from openai import OpenAI
22 | 
23 | client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
24 | 
25 | completion = client.chat.completions.create(
26 |   model="shareAI/llama3-dpo-zh", # 需修改为你的具体模型信息
27 |   messages=[
28 |     {"role": "system", "content": "you are a helpful bot."},
29 |     {"role": "user", "content": "讲个笑话"}
30 |   ],
31 |   temperature=0.7,
32 |   stop=["<|eot_id|>","<|eos_id|>"],
33 | )
34 | 
35 | print(completion.choices[0].message)
36 | ```
37 | 


--------------------------------------------------------------------------------
/deploy/API/easy_server_demo.py:
--------------------------------------------------------------------------------
 1 | import uvicorn
 2 | import torch
 3 | from transformers import pipeline, AutoTokenizer
 4 | from fastapi import FastAPI, Request
 5 | 
 6 | app = FastAPI()
 7 | 
 8 | @app.post("/")
 9 | async def create_item(request: Request):
10 |     global pipe
11 |     data = await request.json()
12 |     prompt = data.get('prompt')
13 |     print(prompt)
14 | 
15 |     messages = [
16 |             {
17 |                 "role": "system",
18 |                 "content": "你是一个超级智者，名字叫shareAI-llama3，拥有优秀的问题解答能力。",
19 |             },
20 |             {"role": "user", "content": prompt}
21 |     ]
22 |     
23 |     response = pipe(messages)
24 |     # breakpoint()
25 |     print(response)
26 |     answer = {
27 |         "response": response[-1]["content"],
28 |         "status": 200,
29 |     }
30 |     return answer
31 | 
32 | 
33 | if __name__ == '__main__':
34 |     model_name_or_path = '/openbayes/home/baicai003/Llama3-Chinese-instruct-DPO-beta0___5'
35 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
36 |     pipe = pipeline(
37 |         "conversational", 
38 |         model_name_or_path, 
39 |         torch_dtype=torch.float16, 
40 |         device_map="auto", 
41 |         max_new_tokens=512, 
42 |         do_sample=True,
43 |         top_p=0.9, 
44 |         temperature=0.6, 
45 |         repetition_penalty=1.1,
46 |         eos_token_id=tokenizer.encode('<|eot_id|>')[0]
47 |     )
48 | 
49 |     uvicorn.run(app, host='0.0.0.0', port=9009)
50 |     
51 | 


--------------------------------------------------------------------------------
/task/agent/demo/0. 一个简单的对话agent例子.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # 打造具备上下文感知能力的对话系统
  3 | ## 简介
  4 | 本指南将详细介绍如何构建一个能在多次交流中保持上下文连贯性的对话系统。我们将采用先进的AI框架，打造出一个能够进行更加自然流畅对话的代理。
  5 | ## 动因
  6 | 目前许多聊天机器人由于无法维持对话上下文，导致用户体验差强人意，甚至令人懊恼。本教程的目标就是解决这一问题，通过实现一个能够记忆并引用先前对话内容的对话系统，以此来提升整体的互动品质。
  7 | ## 关键部件
  8 | 1. **语言模型**：负责生成回答的核心AI模块。
  9 | 2. **提示模板**：确立我们的对话框架。
 10 | 3. **历史管理器**：负责对话历史与上下文的维护。
 11 | 4. **消息仓库**：存储各个对话会话的消息记录。
 12 | ## 实施细节
 13 | ### 环境搭建
 14 | 首先，搭建必要的AI框架，并确保有权使用合适的语言模型，这将为我们的对话系统打下基础。
 15 | ### 构建聊天历史仓库
 16 | 开发一套系统，以便管理多个对话会话。每个会话都应有其独特的标识，并与各自的消息历史相关联。
 17 | ### 设定对话框架
 18 | 制定一个包含以下内容的模板：
 19 | - 确立AI角色的系统信息
 20 | - 对话历史的预留位置
 21 | - 用户的输入内容
 22 |   
 23 | 这样的框架有助于引导AI的回应，并确保对话过程中的一致性。
 24 | ### 搭建对话链
 25 | 将提示模板与语言模型相结合，形成基础的对话链。并通过历史管理组件对其进行封装，自动处理对话历史的添加与调用。
 26 | ### 与系统互动
 27 | 使用系统时，需以用户输入和会话ID来调用对话系统。历史管理器将负责提取相应的对话历史，将其嵌入提示中，并在每次互动后更新消息记录。
 28 | ## 总结
 29 | 本方案构建对话系统具有以下优势：
 30 | - **上下文感知**：系统能够回顾对话的过往内容，实现更贴近自然的交流。
 31 | - **简洁性**：模块化的设计使得实施过程简单直接。
 32 | - **灵活性**：轻松调整对话框架或切换不同的语言模型。
 33 | - **可扩展性**：基于会话的处理方式便于管理众多独立的对话。
 34 |   
 35 | 在这个基础上，您还可以通过以下途径进一步优化系统：
 36 | - 实施更精细的提示工程
 37 | - 与外部知识库进行整合
 38 | - 为特定行业添加定制化功能
 39 | - 引入错误处理和对话修复策略
 40 | 专注于上下文管理，本对话系统设计大幅提升了基础聊天机器人的功能，为打造更具吸引力、更实用的AI助手指明了方向。
 41 | 
 42 | 
 43 | ## 代码实现
 44 | 
 45 | ### 导入必要的库
 46 | ```python
 47 | from langchain_openai import ChatOpenAI
 48 | from langchain_core.runnables.history import RunnableWithMessageHistory
 49 | from langchain.memory import ChatMessageHistory
 50 | from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 51 | import os
 52 | from dotenv import load_dotenv
 53 | os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
 54 | ```
 55 | 
 56 | ### 加载环境变量 & 初始化语言模型
 57 | ```python
 58 | load_dotenv()
 59 | llm = ChatOpenAI(model="gpt-4o-mini", max_tokens=1000, temperature=0)
 60 | ```
 61 | 
 62 | ### 创建一个简单的聊天上下文存储
 63 | ```python
 64 | store = {}
 65 | 
 66 | def get_chat_history(session_id: str):
 67 |     if session_id not in store:
 68 |         store[session_id] = ChatMessageHistory()
 69 |     return store[session_id]
 70 | ```
 71 | 
 72 | ### 创建promot模版
 73 | ```python
 74 | prompt = ChatPromptTemplate.from_messages([
 75 |     ("system", "You are a helpful AI assistant."),
 76 |     MessagesPlaceholder(variable_name="history"),
 77 |     ("human", "{input}")
 78 | ])
 79 | ```
 80 | 
 81 | ### 创建一个运行链路
 82 | ```python
 83 | chain = prompt | llm
 84 | ```
 85 | 
 86 | ### 把运行链和上下文存储放一块
 87 | ```python
 88 | chain_with_history = RunnableWithMessageHistory(
 89 |     chain,
 90 |     get_chat_history,
 91 |     input_messages_key="input",
 92 |     history_messages_key="history"
 93 | )
 94 | ```
 95 | 
 96 | ### 测试一下
 97 | 运行以下代码以向系统发起简单的询问：
 98 | ```python
 99 | session_id = "user_123"
100 | 
101 | response1 = chain_with_history.invoke(
102 |     {"input": "Hello! How are you?"},
103 |     config={"configurable": {"session_id": session_id}}
104 | )
105 | print("AI:", response1.content)
106 | 
107 | # 测试系统对上下文记忆的能力
108 | response2 = chain_with_history.invoke(
109 |     {"input": "What was my previous message?"},
110 |     config={"configurable": {"session_id": session_id}}
111 | )
112 | print("AI:", response2.content)
113 | ```
114 | 
115 | 运行输出内容：
116 | ```
117 | AI: Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?
118 | AI: Your previous message was, "Hello! How are you?" How can I assist you further?
119 | ```
120 | 
121 | 使用以下代码打印运行过程：
122 | ```python
123 | print("\nConversation History:")
124 | for message in store[session_id].messages:
125 |     print(f"{message.type}: {message.content}")
126 | ```
127 | 
128 | 运行输出内容：
129 | ```
130 | Conversation History:
131 | human: Hello! How are you?
132 | ai: Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?
133 | human: What was my previous message?
134 | ai: Your previous message was, "Hello! How are you?" How can I assist you further?
135 | ```
136 | 


--------------------------------------------------------------------------------
/deploy/API/README.md:
--------------------------------------------------------------------------------
  1 | # API部署
  2 | ## 简单实现版本
  3 | ### 代码准备
  4 | 首先安装包依赖：
  5 | ```shell
  6 | pip install -U transformers fastapi accelerate
  7 | ```
  8 | 然后运行easy_server_demo.py，以下为代码示例：
  9 | ```
 10 | import uvicorn
 11 | import torch
 12 | from transformers import pipeline, AutoTokenizer
 13 | from fastapi import FastAPI, Request
 14 | 
 15 | app = FastAPI()
 16 | 
 17 | @app.post("/")
 18 | async def create_item(request: Request):
 19 |     global pipe
 20 |     data = await request.json()
 21 |     prompt = data.get('prompt')
 22 |     print(prompt)
 23 | 
 24 |     messages = [
 25 |             {
 26 |                 "role": "system",
 27 |                 "content": "你是一个超级智者，名字叫shareAI-llama3，拥有优秀的问题解答能力。",
 28 |             },
 29 |             {"role": "user", "content": prompt}
 30 |     ]
 31 |     
 32 |     response = pipe(messages)
 33 |     # breakpoint()
 34 |     print(response)
 35 |     answer = {
 36 |         "response": response[-1]["content"],
 37 |         "status": 200,
 38 |     }
 39 |     return answer
 40 | 
 41 | 
 42 | if __name__ == '__main__':
 43 |     model_name_or_path = '/openbayes/home/baicai003/Llama3-Chinese-instruct-DPO-beta0___5'
 44 |     # 这里的模型路径替换为你本地的完整模型存储路径 （一般从huggingface或者modelscope上下载到）
 45 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
 46 |     pipe = pipeline(
 47 |         "conversational", 
 48 |         model_name_or_path, 
 49 |         torch_dtype=torch.float16, 
 50 |         device_map="auto", 
 51 |         max_new_tokens=512, 
 52 |         do_sample=True,
 53 |         top_p=0.9, 
 54 |         temperature=0.6, 
 55 |         repetition_penalty=1.1,
 56 |         eos_token_id=tokenizer.encode('<|eot_id|>')[0]
 57 |     )
 58 |     # 如果是base+sft模型需要替换<|eot_id|>为<|end_of_text|>，因为llama3 base模型里没有训练<|eot_id|>这个token
 59 | 
 60 |     uvicorn.run(app, host='0.0.0.0', port=9009) # 这里的端口替换为你实际想要监听的端口
 61 | ```
 62 | 
 63 | 上面代码中使用了transformers的[pipeline](https://github.com/huggingface/transformers/blob/main/docs/source/en/conversations.md)进行实现，具体来说，它相当于以下操作：
 64 | ```python
 65 | from transformers import AutoModelForCausalLM, AutoTokenizer
 66 | import torch
 67 | 
 68 | # 输入内容
 69 | chat = [
 70 |     {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
 71 |     {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
 72 | ]
 73 | 
 74 | # 1: 加载模型、分词器
 75 | model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", torch_dtype=torch.bfloat16)
 76 | tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
 77 | 
 78 | # 2: 使用对话模板
 79 | formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
 80 | print("Formatted chat:\n", formatted_chat)
 81 | 
 82 | # 3: 将对话内容转为token (也可以在上一步直接开启tokenize=True)
 83 | inputs = tokenizer(formatted_chat, return_tensors="pt", add_special_tokens=False)
 84 | 
 85 | # 把tokens转移到GPU或者CPU上
 86 | inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
 87 | print("Tokenized inputs:\n", inputs)
 88 | 
 89 | # 4: 使用模型生成一段文本
 90 | outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.)
 91 | print("Generated tokens:\n", outputs)
 92 | 
 93 | # 5: 把生成结果从离散token变为文本
 94 | decoded_output = tokenizer.decode(outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True)
 95 | print("Decoded output:\n", decoded_output)
 96 | ```
 97 | ### 调用测试
 98 | 命令：
 99 | ```shell
100 | curl -X POST "http://127.0.0.1:9009"  -H 'Content-Type: application/json'  -d '{"prompt": "先有鸡 还是先有蛋"}'
101 | ```
102 | 通过在终端执行以上命令即可调用，返回：
103 | ```json
104 | {
105 |   "response":"😂哈哈，老问题！🤯\n\n这个问题被称为“鸡和蛋的循环论证”，是指两个概念相互依赖、无法确定优先顺序的逻辑悖论。 🐓🥚\n\n从生物学角度来看，鸡蛋是鸟类的一种生殖方式，鸡的雏化过程中需要蛋孵化，而鸡又是蛋的产物。 👀\n\n那么，问题来了：如果说先有蛋，那么鸡就不存在了，因为鸡是蛋孵化出来的；如果说先有鸡，那么蛋就不存在了，因为鸡没有蛋来孵化。 🤔\n\n这个问题可以从多个方面去理解：\n\n1️⃣从演化角度来说，生物进化是一个漫长的过程，鸡和蛋都是自然选择和适应环境的结果。 🌳\n\n2️⃣从定义角度来说，鸡和蛋都是相互依赖的概念，鸡就是蛋孵化出来的，蛋就是鸡产出的。 🤝\n\n3️⃣从哲学角度来说，这个问题涉及到时间概念和空间概念的关系，时间和空间都不是线性的，存在某种程度的相对性。 🕰️\n\n总之，鸡和蛋的先后关系只是一个逻辑上的循环论证，实际上我们不需要担心这个问题，因为它们都是生物界中的常态现象！ 😊",
106 |   "status":200
107 | }
108 | ```
109 | 如果你需要在其他开发语言中使用，可以用gpt将调用命令转换为其他语言版本(如python、java、php）
110 | 
111 | ## OpenAI格式版本
112 | 请参考vLLM部署教程：https://github.com/CrazyBoyM/llama3-Chinese-chat/tree/main/deploy/vLLM
113 | 


--------------------------------------------------------------------------------
/tools/expand_embedding_and_lmhead.py:
--------------------------------------------------------------------------------
  1 | # copied from https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B/blob/main/models/init_embeddings.py
  2 | from pathlib import Path
  3 | 
  4 | import fire
  5 | import matplotlib.pyplot as plt
  6 | import torch
  7 | from safetensors import safe_open
  8 | from transformers import AutoTokenizer
  9 | 
 10 | 
 11 | def init_embeddings_average(
 12 |         old_tokenizer,
 13 |         new_tokenizer,
 14 |         old_embeddings,
 15 |         old_lm_head,
 16 |         new_embeddings,
 17 |         new_lm_head,
 18 | ):
 19 |     # set zh embeddings as average of old embeddings, but keep en embeddings unchanged
 20 | 
 21 |     old_vocab_size = old_tokenizer.vocab_size
 22 |     new_vocab_size = new_tokenizer.vocab_size
 23 | 
 24 |     for id in range(old_vocab_size, new_vocab_size):
 25 |         zh_token = new_tokenizer.decode([id])
 26 | 
 27 |         zh_token_old_ids = old_tokenizer(zh_token)["input_ids"]
 28 |         if len(zh_token_old_ids) == 0:
 29 |             print(f"WARNING: id = {id} zh_token = `{zh_token}`, cannot be tokenized by old tokenizer, using <unk> id")
 30 |             zh_token_old_ids = [0]  # unk
 31 |         zh_token_old_embeddings_avg = sum([old_embeddings[oid] for oid in zh_token_old_ids]) / len(zh_token_old_ids)
 32 |         zh_token_old_lm_head_avg = sum([old_lm_head[oid] for oid in zh_token_old_ids]) / len(zh_token_old_ids)
 33 |         new_embeddings[id] = zh_token_old_embeddings_avg
 34 |         new_lm_head[id] = zh_token_old_lm_head_avg
 35 | 
 36 | 
 37 | def draw(old_embeddings, new_embeddings, save):
 38 |     if not save:
 39 |         return
 40 | 
 41 |     plt.figure()
 42 |     plt.title(f"old embeddings[:, :128]")
 43 |     plt.xlabel("d_model[:128]")
 44 |     plt.ylabel("vocab_size")
 45 |     plt.imshow(old_embeddings[:, :128].to(torch.float16).numpy(), aspect="auto")
 46 |     plt.savefig(f"old-embeddings.png")
 47 | 
 48 |     plt.figure()
 49 |     plt.title(f"new embeddings[:, :128]")
 50 |     plt.xlabel("d_model[:128]")
 51 |     plt.ylabel("vocab_size")
 52 |     plt.imshow(new_embeddings[:, :128].to(torch.float16).numpy(), aspect="auto")
 53 |     plt.savefig(f"new-embeddings.png")
 54 | 
 55 | 
 56 | def main(
 57 |         old_tokenizer: str,
 58 |         new_tokenizer: str,
 59 |         num_shards: int,
 60 |         old_model: str,
 61 |         new_model: str,
 62 |         save_embedding_plots: bool = False,
 63 | ):
 64 |     # load tokenizers
 65 |     old_tokenizer = AutoTokenizer.from_pretrained(old_tokenizer)
 66 |     new_tokenizer = AutoTokenizer.from_pretrained(new_tokenizer)
 67 |     new_vocab_size = len(new_tokenizer)  # __len__ = vocab_size + num_added_tokens
 68 | 
 69 |     # load embeddings and lm_head
 70 |     model_path_template = old_model + "/model-{index:05d}-of-{total:05d}.safetensors"
 71 |     model_dict = {}
 72 |     for i in range(1, num_shards + 1):
 73 |         shard_path = model_path_template.format(index=i, total=num_shards)
 74 |         with safe_open(shard_path, framework="pt", device="cpu") as f:
 75 |             for k in f.keys():
 76 |                 model_dict[k] = f.get_tensor(k)
 77 | 
 78 |     # shape:
 79 |     #   old_embeddings: (vocab_size, d_model)
 80 |     #   old_lm_head:    (vocab_size, d_model)
 81 |     old_embeddings = model_dict["model.embed_tokens.weight"]
 82 |     old_lm_head = model_dict["lm_head.weight"]
 83 | 
 84 |     # create new embeddings and lm_head
 85 |     #   en: copy from old
 86 |     #   zh: init with zero
 87 |     new_embeddings = torch.zeros((new_vocab_size, old_embeddings.shape[1]), dtype=old_embeddings.dtype)
 88 |     new_lm_head = torch.zeros((new_vocab_size, old_lm_head.shape[1]), dtype=old_lm_head.dtype)
 89 |     new_embeddings[: old_embeddings.shape[0]] = old_embeddings.clone()
 90 |     new_lm_head[: old_lm_head.shape[0]] = old_lm_head.clone()
 91 | 
 92 |     init_embeddings_average(
 93 |         old_tokenizer,
 94 |         new_tokenizer,
 95 |         old_embeddings,
 96 |         old_lm_head,
 97 |         new_embeddings,
 98 |         new_lm_head,
 99 |     )
100 | 
101 |     draw(old_embeddings, new_embeddings, save_embedding_plots)
102 | 
103 |     model_dict["model.embed_tokens.weight"] = new_embeddings
104 |     model_dict["lm_head.weight"] = new_lm_head
105 | 
106 |     torch.save(model_dict, Path(new_model) / "pytorch_model.bin")
107 |     print(f"Done! `new_vocab_size` = {new_vocab_size}, please update `config.json` manually.")
108 | 
109 | 
110 | if __name__ == "__main__":
111 |     fire.Fire(main)
112 | 


--------------------------------------------------------------------------------
/tools/transfer_hf_models_to_modelscope.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "toc_visible": true,
  8 |       "authorship_tag": "ABX9TyO7Idjomnu0KJYEWjWHVBb7",
  9 |       "include_colab_link": true
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "language_info": {
 16 |       "name": "python"
 17 |     }
 18 |   },
 19 |   "cells": [
 20 |     {
 21 |       "cell_type": "markdown",
 22 |       "metadata": {
 23 |         "id": "view-in-github",
 24 |         "colab_type": "text"
 25 |       },
 26 |       "source": [
 27 |         "<a href=\"https://colab.research.google.com/github/CrazyBoyM/llama3-Chinese-chat/blob/main/tools/transfer_hf_models_to_modelscope.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 28 |       ]
 29 |     },
 30 |     {
 31 |       "cell_type": "code",
 32 |       "execution_count": null,
 33 |       "metadata": {
 34 |         "id": "9T4NXb4DYL5-"
 35 |       },
 36 |       "outputs": [],
 37 |       "source": [
 38 |         "# 安装需要的包\n",
 39 |         "! pip install -U gradio huggingface-hub modelscope"
 40 |       ]
 41 |     },
 42 |     {
 43 |       "cell_type": "code",
 44 |       "source": [
 45 |         "# 登录Huggingface的token，获取地址：https://huggingface.co/settings/tokens\n",
 46 |         "! huggingface-cli login --token \"请输入token\""
 47 |       ],
 48 |       "metadata": {
 49 |         "id": "QixnAi5wYpTy"
 50 |       },
 51 |       "execution_count": null,
 52 |       "outputs": []
 53 |     },
 54 |     {
 55 |       "cell_type": "code",
 56 |       "source": [
 57 |         "# 查看磁盘空间是否足够用于下载大模型\n",
 58 |         "! df -h"
 59 |       ],
 60 |       "metadata": {
 61 |         "id": "CAHpmzAlaW1f"
 62 |       },
 63 |       "execution_count": null,
 64 |       "outputs": []
 65 |     },
 66 |     {
 67 |       "cell_type": "code",
 68 |       "source": [
 69 |         "# 0. 设置需要下载的模型的名称和本地路径\n",
 70 |         "model_id = \"shareAI/CodeLLaMA-chat-13b-Chinese\" # 在hf的模型id，请修改为你自己想下载的模型\n",
 71 |         "local_model_dir = f\"/hf-models/{model_id}\" # 下载本地后存储位置\n",
 72 |         "! echo $local_model_dir"
 73 |       ],
 74 |       "metadata": {
 75 |         "id": "kUr71vasaOZ6"
 76 |       },
 77 |       "execution_count": null,
 78 |       "outputs": []
 79 |     },
 80 |     {
 81 |       "cell_type": "code",
 82 |       "source": [
 83 |         "# 1. 从huggingface下载模型\n",
 84 |         "\n",
 85 |         "! huggingface-cli download --resume-download $model_id --local-dir $local_model_dir --local-dir-use-symlinks False"
 86 |       ],
 87 |       "metadata": {
 88 |         "id": "4rfn2ZNkZ_dG"
 89 |       },
 90 |       "execution_count": null,
 91 |       "outputs": []
 92 |     },
 93 |     {
 94 |       "cell_type": "code",
 95 |       "source": [
 96 |         "# 1. 或者从huggingface下载数据集\n",
 97 |         "\n",
 98 |         "! huggingface-cli download --repo-type dataset --resume-download $model_id --local-dir $local_model_dir --local-dir-use-symlinks False"
 99 |       ],
100 |       "metadata": {
101 |         "id": "1PKPr6LxKphn"
102 |       },
103 |       "execution_count": null,
104 |       "outputs": []
105 |     },
106 |     {
107 |       "cell_type": "code",
108 |       "source": [
109 |         "\n",
110 |         "# 2. 上传模型到modelscope\n",
111 |         "\n",
112 |         "! touch $local_model_dir/configuration.json\n",
113 |         "! ls $local_model_dir\n",
114 |         "\n",
115 |         "from modelscope.hub.api import HubApi\n",
116 |         "from modelscope.hub.constants import Licenses, ModelVisibility\n",
117 |         "\n",
118 |         "# Modelscope SDK token获取地址：https://modelscope.cn/my/myaccesstoken\n",
119 |         "MODELSCOPE_ACCESS_TOKEN = '请输入token'\n",
120 |         "# 请注意ModelScope平台针对SDK访问和git访问两种模式，提供两种不同的访问令牌(token)。此处请使用SDK访问令牌。\n",
121 |         "\n",
122 |         "api = HubApi()\n",
123 |         "api.login(MODELSCOPE_ACCESS_TOKEN)\n",
124 |         "model_name = model_id.replace(\"/\", \"-\")\n",
125 |         "print(model_name)\n",
126 |         "\n",
127 |         "# 自己把shareAI修改成你的modelscope账户昵称\n",
128 |         "# 创建modelscope模型仓库\n",
129 |         "api.create_model(\n",
130 |         "    model_id=f\"shareAI/{model_name}\",\n",
131 |         "    visibility=ModelVisibility.PUBLIC,\n",
132 |         "    license=Licenses.APACHE_V2,\n",
133 |         "    chinese_name=\"push model weight\",\n",
134 |         ")\n",
135 |         "# 向modelscope仓库传输文件\n",
136 |         "api.push_model(\n",
137 |         "    model_id=f\"shareAI/{model_name}\",\n",
138 |         "    model_dir=local_model_dir\n",
139 |         ")"
140 |       ],
141 |       "metadata": {
142 |         "id": "w_gG3dC8ZXR3"
143 |       },
144 |       "execution_count": null,
145 |       "outputs": []
146 |     },
147 |     {
148 |       "cell_type": "code",
149 |       "source": [],
150 |       "metadata": {
151 |         "id": "1M5wbh7Yh_nq"
152 |       },
153 |       "execution_count": null,
154 |       "outputs": []
155 |     }
156 |   ]
157 | }


--------------------------------------------------------------------------------
/deploy/streamlit/web_gemma2_chat.py:
--------------------------------------------------------------------------------
  1 | # isort: skip_file
  2 | import copy
  3 | import warnings
  4 | from dataclasses import asdict, dataclass
  5 | from typing import Callable, List, Optional
  6 | 
  7 | import streamlit as st
  8 | import torch
  9 | from transformers.utils import logging
 10 | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # isort: skip
 11 | from peft import PeftModel
 12 | from threading import Thread
 13 | from transformers import TextIteratorStreamer
 14 | 
 15 | logger = logging.get_logger(__name__)
 16 | st.set_page_config(page_title="gemma2-Chinese")
 17 | 
 18 | import argparse
 19 | 
 20 | @dataclass
 21 | class GenerationConfig:
 22 |     # this config is used for chat to provide more diversity
 23 |     max_length: int = 8192
 24 |     max_new_tokens: int = 600
 25 |     top_p: float = 0.8
 26 |     temperature: float = 0.8
 27 |     do_sample: bool = True
 28 |     repetition_penalty: float = 1.05
 29 | 
 30 | def on_btn_click():
 31 |     del st.session_state.messages
 32 | 
 33 | @st.cache_resource
 34 | def load_model(model_name_or_path, adapter_name_or_path=None, load_in_4bit=False):
 35 |     if load_in_4bit:
 36 |         quantization_config = BitsAndBytesConfig(
 37 |             load_in_4bit=True,
 38 |             bnb_4bit_compute_dtype=torch.float16,
 39 |             bnb_4bit_use_double_quant=True,
 40 |             bnb_4bit_quant_type="nf4",
 41 |             llm_int8_threshold=6.0,
 42 |             llm_int8_has_fp16_weight=False,
 43 |         )
 44 |     else:
 45 |         quantization_config = None
 46 | 
 47 |     model = AutoModelForCausalLM.from_pretrained(
 48 |         model_name_or_path,
 49 |         load_in_4bit=load_in_4bit,
 50 |         trust_remote_code=True,
 51 |         low_cpu_mem_usage=True,
 52 |         torch_dtype=torch.float16,
 53 |         device_map='auto',
 54 |         quantization_config=quantization_config
 55 |     )
 56 |     if adapter_name_or_path is not None:
 57 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
 58 |     model.eval()
 59 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
 60 |     
 61 |     return model, tokenizer
 62 | 
 63 | def prepare_generation_config():
 64 |     with st.sidebar:
 65 |         st.title('超参数面板')
 66 |         # 大输入框
 67 |         system_prompt_content = st.text_area('系统提示词',
 68 |             "你是一个调皮活泼的中文智者，名字叫shareAI-gemma2，喜欢用有趣的语言和适当的表情回答问题。",
 69 |             height=200,
 70 |             key='system_prompt_content'
 71 |         )
 72 |         max_new_tokens = st.slider('最大回复长度', 100, 8192, 1020, step=8)
 73 |         top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)
 74 |         temperature = st.slider('温度系数', 0.0, 1.0, 0.6, step=0.01)
 75 |         repetition_penalty = st.slider("重复惩罚系数", 1.0, 2.0, 1.07, step=0.01)
 76 |         st.button('重置聊天', on_click=on_btn_click)
 77 | 
 78 |     generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
 79 |                                          top_p=top_p,
 80 |                                          temperature=temperature,
 81 |                                          repetition_penalty=repetition_penalty,
 82 |                                         )
 83 | 
 84 |     return generation_config
 85 | 
 86 | def main(model_name_or_path, adapter_name_or_path):
 87 |     # torch.cuda.empty_cache()
 88 |     print('load model...')
 89 |     model, tokenizer = load_model(model_name_or_path, adapter_name_or_path=adapter_name_or_path, load_in_4bit=False)
 90 |     print('load model end.')
 91 | 
 92 |     st.title('gemma2-Chinese')
 93 | 
 94 |     generation_config = prepare_generation_config()
 95 | 
 96 |     # Initialize chat history
 97 |     if 'messages' not in st.session_state:
 98 |         st.session_state.messages = []
 99 | 
100 |     # Display chat messages from history on app rerun
101 |     for message in st.session_state.messages:
102 |         with st.chat_message(message['role']):
103 |             st.markdown(message['content'])
104 | 
105 |     # Accept user input
106 |     if prompt := st.chat_input('解释一下Vue的原理'):
107 |         # Display user message in chat message container
108 |         with st.chat_message('user'):
109 |             st.markdown(prompt)
110 |         # Add user message to chat history
111 |         st.session_state.messages.append({
112 |             'role': 'user',
113 |             'content': prompt,
114 |         })
115 | 
116 |         # Prepare chat history
117 |         chat = [
118 |             #{"role": "system", "content": st.session_state.system_prompt_content},
119 |             *st.session_state.messages
120 |         ]
121 |         #breakpoint()
122 | 
123 |         # Use chat template
124 |         formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
125 |         inputs = tokenizer(formatted_chat, return_tensors='pt').to(model.device)
126 |         streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
127 | 
128 |         generation_kwargs = dict(
129 |             **inputs,
130 |             streamer=streamer,
131 |             max_new_tokens=generation_config.max_new_tokens,
132 |             do_sample=generation_config.do_sample,
133 |             top_p=generation_config.top_p,
134 |             temperature=generation_config.temperature,
135 |             repetition_penalty=generation_config.repetition_penalty,
136 |         )
137 | 
138 |         thread = Thread(target=model.generate, kwargs=generation_kwargs)
139 |         thread.start()
140 | 
141 |         response = ''
142 |         with st.chat_message('robot'):
143 |             message_placeholder = st.empty()
144 |             
145 |             count = 0
146 |             for token in streamer:
147 |                 count += 1
148 |                 if count > len(inputs['input_ids']):
149 |                     response += token
150 |                 message_placeholder.markdown(response + '▌')
151 |             message_placeholder.markdown(response)
152 | 
153 |         # Add robot response to chat history
154 |         st.session_state.messages.append({
155 |             'role': 'assistant',
156 |             'content': response,
157 |         })
158 |         # breakpoint()
159 |         torch.cuda.empty_cache()
160 | 
161 | if __name__ == '__main__':
162 | 
163 |     import sys
164 |     model_name_or_path = sys.argv[1]
165 |     if len(sys.argv) >= 3:
166 |         adapter_name_or_path = sys.argv[2]
167 |     else:
168 |         adapter_name_or_path = None
169 |     main(model_name_or_path, adapter_name_or_path)
170 | 


--------------------------------------------------------------------------------
/deploy/streamlit/web_llama3_chat.py:
--------------------------------------------------------------------------------
  1 | # isort: skip_file
  2 | import copy
  3 | import warnings
  4 | from dataclasses import asdict, dataclass
  5 | from typing import Callable, List, Optional
  6 | 
  7 | import streamlit as st
  8 | import torch
  9 | from transformers.utils import logging
 10 | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # isort: skip
 11 | from peft import PeftModel
 12 | from threading import Thread
 13 | from transformers import TextIteratorStreamer
 14 | 
 15 | logger = logging.get_logger(__name__)
 16 | st.set_page_config(page_title="Llama3-Chinese")
 17 | 
 18 | import argparse
 19 | 
 20 | @dataclass
 21 | class GenerationConfig:
 22 |     # this config is used for chat to provide more diversity
 23 |     max_length: int = 8192
 24 |     max_new_tokens: int = 600
 25 |     top_p: float = 0.8
 26 |     temperature: float = 0.8
 27 |     do_sample: bool = True
 28 |     repetition_penalty: float = 1.05
 29 | 
 30 | def on_btn_click():
 31 |     del st.session_state.messages
 32 | 
 33 | @st.cache_resource
 34 | def load_model(model_name_or_path, adapter_name_or_path=None, load_in_4bit=False):
 35 |     if load_in_4bit:
 36 |         quantization_config = BitsAndBytesConfig(
 37 |             load_in_4bit=True,
 38 |             bnb_4bit_compute_dtype=torch.float16,
 39 |             bnb_4bit_use_double_quant=True,
 40 |             bnb_4bit_quant_type="nf4",
 41 |             llm_int8_threshold=6.0,
 42 |             llm_int8_has_fp16_weight=False,
 43 |         )
 44 |     else:
 45 |         quantization_config = None
 46 | 
 47 |     model = AutoModelForCausalLM.from_pretrained(
 48 |         model_name_or_path,
 49 |         load_in_4bit=load_in_4bit,
 50 |         trust_remote_code=True,
 51 |         low_cpu_mem_usage=True,
 52 |         torch_dtype=torch.float16,
 53 |         device_map='auto',
 54 |         quantization_config=quantization_config
 55 |     )
 56 |     if adapter_name_or_path is not None:
 57 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
 58 |     model.eval()
 59 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
 60 |     
 61 |     return model, tokenizer
 62 | 
 63 | def prepare_generation_config():
 64 |     with st.sidebar:
 65 |         st.title('超参数面板')
 66 |         # 大输入框
 67 |         system_prompt_content = st.text_area('系统提示词',
 68 |             "你是一个调皮活泼的中文智者，名字叫shareAI-Llama3，喜欢用有趣的语言和适当的表情回答问题。",
 69 |             height=200,
 70 |             key='system_prompt_content'
 71 |         )
 72 |         max_new_tokens = st.slider('最大回复长度', 100, 8192, 1020, step=8)
 73 |         top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)
 74 |         temperature = st.slider('温度系数', 0.0, 1.0, 0.6, step=0.01)
 75 |         repetition_penalty = st.slider("重复惩罚系数", 1.0, 2.0, 1.07, step=0.01)
 76 |         st.button('重置聊天', on_click=on_btn_click)
 77 | 
 78 |     generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
 79 |                                          top_p=top_p,
 80 |                                          temperature=temperature,
 81 |                                          repetition_penalty=repetition_penalty,
 82 |                                         )
 83 | 
 84 |     return generation_config
 85 | 
 86 | def main(model_name_or_path, adapter_name_or_path):
 87 |     # torch.cuda.empty_cache()
 88 |     print('load model...')
 89 |     model, tokenizer = load_model(model_name_or_path, adapter_name_or_path=adapter_name_or_path, load_in_4bit=False)
 90 |     print('load model end.')
 91 | 
 92 |     st.title('Llama3-Chinese')
 93 | 
 94 |     generation_config = prepare_generation_config()
 95 | 
 96 |     # Initialize chat history
 97 |     if 'messages' not in st.session_state:
 98 |         st.session_state.messages = []
 99 | 
100 |     # Display chat messages from history on app rerun
101 |     for message in st.session_state.messages:
102 |         with st.chat_message(message['role']):
103 |             st.markdown(message['content'])
104 | 
105 |     # Accept user input
106 |     if prompt := st.chat_input('解释一下Vue的原理'):
107 |         # Display user message in chat message container
108 |         with st.chat_message('user'):
109 |             st.markdown(prompt)
110 |         # Add user message to chat history
111 |         st.session_state.messages.append({
112 |             'role': 'user',
113 |             'content': prompt,
114 |         })
115 | 
116 |         # Prepare chat history
117 |         chat = [
118 |             {"role": "system", "content": st.session_state.system_prompt_content},
119 |             *st.session_state.messages
120 |         ]
121 |         #breakpoint()
122 | 
123 |         # Use chat template
124 |         formatted_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
125 |         inputs = tokenizer(formatted_chat, return_tensors='pt').to(model.device)
126 |         streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
127 | 
128 |         generation_kwargs = dict(
129 |             **inputs,
130 |             streamer=streamer,
131 |             max_new_tokens=generation_config.max_new_tokens,
132 |             do_sample=generation_config.do_sample,
133 |             top_p=generation_config.top_p,
134 |             temperature=generation_config.temperature,
135 |             repetition_penalty=generation_config.repetition_penalty,
136 |         )
137 | 
138 |         thread = Thread(target=model.generate, kwargs=generation_kwargs)
139 |         thread.start()
140 | 
141 |         response = ''
142 |         with st.chat_message('robot'):
143 |             message_placeholder = st.empty()
144 |             
145 |             count = 0
146 |             for token in streamer:
147 |                 count += 1
148 |                 if count > len(inputs['input_ids']):
149 |                     response += token
150 |                 message_placeholder.markdown(response + '▌')
151 |             message_placeholder.markdown(response)
152 | 
153 |         # Add robot response to chat history
154 |         st.session_state.messages.append({
155 |             'role': 'assistant',
156 |             'content': response,
157 |         })
158 |         # breakpoint()
159 |         torch.cuda.empty_cache()
160 | 
161 | if __name__ == '__main__':
162 | 
163 |     import sys
164 |     model_name_or_path = sys.argv[1]
165 |     if len(sys.argv) >= 3:
166 |         adapter_name_or_path = sys.argv[2]
167 |     else:
168 |         adapter_name_or_path = None
169 |     main(model_name_or_path, adapter_name_or_path)
170 | 


--------------------------------------------------------------------------------
/deploy/python/chat_demo.py:
--------------------------------------------------------------------------------
  1 | from transformers import AutoTokenizer, AutoConfig, AddedToken, AutoModelForCausalLM, BitsAndBytesConfig
  2 | from peft import PeftModel
  3 | from dataclasses import dataclass
  4 | from typing import Dict
  5 | import torch
  6 | import copy
  7 | 
  8 | ## 定义聊天模板
  9 | @dataclass
 10 | class Template:
 11 |     template_name:str
 12 |     system_format: str
 13 |     user_format: str
 14 |     assistant_format: str
 15 |     system: str
 16 |     stop_word: str
 17 | 
 18 | template_dict: Dict[str, Template] = dict()
 19 | 
 20 | def register_template(template_name, system_format, user_format, assistant_format, system, stop_word=None):
 21 |     template_dict[template_name] = Template(
 22 |         template_name=template_name,
 23 |         system_format=system_format,
 24 |         user_format=user_format,
 25 |         assistant_format=assistant_format,
 26 |         system=system,
 27 |         stop_word=stop_word,
 28 |     )
 29 | 
 30 | # 这里的系统提示词是训练时使用的，推理时可以自行尝试修改效果
 31 | register_template(
 32 |     template_name='llama3',
 33 |     system_format='<|begin_of_text|><<SYS>>\n{content}\n<</SYS>>\n\n',
 34 |     user_format='<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>',
 35 |     assistant_format='<|start_header_id|>assistant<|end_header_id|>\n\n{content}<|end_of_text|>\n',
 36 |     system="You are a helpful, excellent and smart assistant. "
 37 |         "Please respond to the user using the language they input, ensuring the language is elegant and fluent."
 38 |         "If you don't know the answer to a question, please don't share false information.",
 39 |     stop_word='<|end_of_text|>'
 40 | )
 41 | 
 42 | 
 43 | ## 加载模型
 44 | def load_model(model_name_or_path, load_in_4bit=False, adapter_name_or_path=None):
 45 |     if load_in_4bit:
 46 |         quantization_config = BitsAndBytesConfig(
 47 |             load_in_4bit=True,
 48 |             bnb_4bit_compute_dtype=torch.float16,
 49 |             bnb_4bit_use_double_quant=True,
 50 |             bnb_4bit_quant_type="nf4",
 51 |             llm_int8_threshold=6.0,
 52 |             llm_int8_has_fp16_weight=False,
 53 |         )
 54 |     else:
 55 |         quantization_config = None
 56 | 
 57 |     # 加载base model
 58 |     model = AutoModelForCausalLM.from_pretrained(
 59 |         model_name_or_path,
 60 |         load_in_4bit=load_in_4bit,
 61 |         trust_remote_code=True,
 62 |         low_cpu_mem_usage=True,
 63 |         torch_dtype=torch.float16,
 64 |         device_map='auto',
 65 |         quantization_config=quantization_config
 66 |     )
 67 | 
 68 |     # 加载adapter
 69 |     if adapter_name_or_path is not None:
 70 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
 71 | 
 72 |     return model
 73 | 
 74 | ## 加载tokenizer
 75 | def load_tokenizer(model_name_or_path):
 76 |     tokenizer = AutoTokenizer.from_pretrained(
 77 |         model_name_or_path,
 78 |         trust_remote_code=True,
 79 |         use_fast=False
 80 |     )
 81 | 
 82 |     if tokenizer.pad_token is None:
 83 |         tokenizer.pad_token = tokenizer.eos_token
 84 | 
 85 |     return tokenizer
 86 | 
 87 | ## 构建prompt
 88 | def build_prompt(tokenizer, template, query, history, system=None):
 89 |     template_name = template.template_name
 90 |     system_format = template.system_format
 91 |     user_format = template.user_format
 92 |     assistant_format = template.assistant_format
 93 |     system = system if system is not None else template.system
 94 | 
 95 |     history.append({"role": 'user', 'message': query})
 96 |     input_ids = []
 97 | 
 98 |     # 添加系统信息
 99 |     if system_format is not None:
100 |         if system is not None:
101 |             system_text = system_format.format(content=system)
102 |             input_ids = tokenizer.encode(system_text, add_special_tokens=False)
103 |     # 拼接历史对话
104 |     for item in history:
105 |         role, message = item['role'], item['message']
106 |         if role == 'user':
107 |             message = user_format.format(content=message, stop_token=tokenizer.eos_token)
108 |         else:
109 |             message = assistant_format.format(content=message, stop_token=tokenizer.eos_token)
110 |         tokens = tokenizer.encode(message, add_special_tokens=False)
111 |         input_ids += tokens
112 |     input_ids = torch.tensor([input_ids], dtype=torch.long)
113 | 
114 |     return input_ids
115 | 
116 | 
117 | def main():
118 |     model_name_or_path = 'shareAI/llama3-Chinese-chat-8b' # 模型名称或路径，请修改这里
119 |     template_name = 'llama3'
120 |     adapter_name_or_path = None
121 | 
122 |     template = template_dict[template_name]
123 |     # 若开启4bit推理能够节省很多显存，但效果可能下降
124 |     load_in_4bit = False
125 | 
126 |     # 生成超参配置，可修改以取得更好的效果
127 |     max_new_tokens = 500 # 每次回复时，AI生成文本的最大长度
128 |     top_p = 0.9
129 |     temperature = 0.6 # 越大越有创造性，越小越保守
130 |     repetition_penalty = 1.1 # 越大越能避免吐字重复
131 | 
132 |     # 加载模型
133 |     print(f'Loading model from: {model_name_or_path}')
134 |     print(f'adapter_name_or_path: {adapter_name_or_path}')
135 |     model = load_model(
136 |         model_name_or_path,
137 |         load_in_4bit=load_in_4bit,
138 |         adapter_name_or_path=adapter_name_or_path
139 |     ).eval()
140 |     tokenizer = load_tokenizer(model_name_or_path if adapter_name_or_path is None else adapter_name_or_path)
141 |     if template.stop_word is None:
142 |         template.stop_word = tokenizer.eos_token
143 |     stop_token_id = tokenizer.encode(template.stop_word, add_special_tokens=True)
144 |     assert len(stop_token_id) == 1
145 |     stop_token_id = stop_token_id[0]
146 | 
147 |     history = []
148 | 
149 |     query = input('# User：')
150 |     while True:
151 |         query = query.strip()
152 |         input_ids = build_prompt(tokenizer, template, query, copy.deepcopy(history), system=None).to(model.device)
153 |         outputs = model.generate(
154 |             input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=True,
155 |             top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty,
156 |             eos_token_id=stop_token_id
157 |         )
158 |         outputs = outputs.tolist()[0][len(input_ids[0]):]
159 |         response = tokenizer.decode(outputs)
160 |         response = response.strip().replace(template.stop_word, "").strip()
161 | 
162 |         # 存储对话历史
163 |         history.append({"role": 'user', 'message': query})
164 |         history.append({"role": 'assistant', 'message': response})
165 | 
166 |         # 当对话长度超过6轮时，清空最早的对话，可自行修改
167 |         if len(history) > 12:
168 |             history = history[:-12]
169 | 
170 |         print("# Llama3-Chinese：{}".format(response))
171 |         query = input('# User：')
172 | 
173 | 
174 | if __name__ == '__main__':
175 |     main()
176 | 


--------------------------------------------------------------------------------
/deploy/web_streamlit_for_v1.py:
--------------------------------------------------------------------------------
  1 | # copied from https://github.com/SmartFlowAI/Llama3-XTuner-CN/blob/main/web_demo.py
  2 | # isort: skip_file
  3 | import copy
  4 | import warnings
  5 | from dataclasses import asdict, dataclass
  6 | from typing import Callable, List, Optional
  7 | 
  8 | import streamlit as st
  9 | import torch
 10 | from torch import nn
 11 | from transformers.generation.utils import (LogitsProcessorList,
 12 |                                            StoppingCriteriaList)
 13 | from transformers.utils import logging
 14 | 
 15 | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # isort: skip
 16 | from peft import PeftModel
 17 | 
 18 | logger = logging.get_logger(__name__)
 19 | st.set_page_config(page_title="Llama3-Chinese")
 20 | 
 21 | import argparse
 22 | 
 23 | @dataclass
 24 | class GenerationConfig:
 25 |     # this config is used for chat to provide more diversity
 26 |     max_length: int = 8192
 27 |     max_new_tokens: int = 600
 28 |     top_p: float = 0.8
 29 |     temperature: float = 0.8
 30 |     do_sample: bool = True
 31 |     repetition_penalty: float = 1.05
 32 | 
 33 | 
 34 | @torch.inference_mode()
 35 | def generate_interactive(
 36 |     model,
 37 |     tokenizer,
 38 |     prompt,
 39 |     generation_config: Optional[GenerationConfig] = None,
 40 |     logits_processor: Optional[LogitsProcessorList] = None,
 41 |     stopping_criteria: Optional[StoppingCriteriaList] = None,
 42 |     prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],
 43 |                                                 List[int]]] = None,
 44 |     additional_eos_token_id: Optional[int] = None,
 45 |     **kwargs,
 46 | ):
 47 |     inputs = tokenizer([prompt], return_tensors='pt')
 48 |     input_length = len(inputs['input_ids'][0])
 49 |     for k, v in inputs.items():
 50 |         inputs[k] = v.cuda()
 51 |     input_ids = inputs['input_ids']
 52 |     _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]
 53 |     if generation_config is None:
 54 |         generation_config = model.generation_config
 55 |     generation_config = copy.deepcopy(generation_config)
 56 |     model_kwargs = generation_config.update(**kwargs)
 57 |     bos_token_id, eos_token_id = (  # noqa: F841  # pylint: disable=W0612
 58 |         generation_config.bos_token_id,
 59 |         generation_config.eos_token_id,
 60 |     )
 61 |     if isinstance(eos_token_id, int):
 62 |         eos_token_id = [eos_token_id]
 63 |     if additional_eos_token_id is not None:
 64 |         eos_token_id.append(additional_eos_token_id)
 65 |     has_default_max_length = kwargs.get(
 66 |         'max_length') is None and generation_config.max_length is not None
 67 |     if has_default_max_length and generation_config.max_new_tokens is None:
 68 |         warnings.warn(
 69 |             f"Using 'max_length''s default ({repr(generation_config.max_length)}) \
 70 |                 to control the generation length. "
 71 |             'This behaviour is deprecated and will be removed from the \
 72 |                 config in v5 of Transformers -- we'
 73 |             ' recommend using `max_new_tokens` to control the maximum \
 74 |                 length of the generation.',
 75 |             UserWarning,
 76 |         )
 77 |     elif generation_config.max_new_tokens is not None:
 78 |         generation_config.max_length = generation_config.max_new_tokens + \
 79 |             input_ids_seq_length
 80 |         if not has_default_max_length:
 81 |             logger.warn(  # pylint: disable=W4902
 82 |                 f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) "
 83 |                 f"and 'max_length'(={generation_config.max_length}) seem to "
 84 |                 "have been set. 'max_new_tokens' will take precedence. "
 85 |                 'Please refer to the documentation for more information. '
 86 |                 '(https://huggingface.co/docs/transformers/main/'
 87 |                 'en/main_classes/text_generation)',
 88 |                 UserWarning,
 89 |             )
 90 | 
 91 |     if input_ids_seq_length >= generation_config.max_length:
 92 |         input_ids_string = 'input_ids'
 93 |         logger.warning(
 94 |             f"Input length of {input_ids_string} is {input_ids_seq_length}, "
 95 |             f"but 'max_length' is set to {generation_config.max_length}. "
 96 |             'This can lead to unexpected behavior. You should consider'
 97 |             " increasing 'max_new_tokens'.")
 98 | 
 99 |     # 2. Set generation parameters if not already defined
100 |     logits_processor = logits_processor if logits_processor is not None \
101 |         else LogitsProcessorList()
102 |     stopping_criteria = stopping_criteria if stopping_criteria is not None \
103 |         else StoppingCriteriaList()
104 | 
105 |     logits_processor = model._get_logits_processor(
106 |         generation_config=generation_config,
107 |         input_ids_seq_length=input_ids_seq_length,
108 |         encoder_input_ids=input_ids,
109 |         prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
110 |         logits_processor=logits_processor,
111 |     )
112 | 
113 |     stopping_criteria = model._get_stopping_criteria(
114 |         generation_config=generation_config,
115 |         stopping_criteria=stopping_criteria)
116 |     logits_warper = model._get_logits_warper(generation_config)
117 | 
118 |     unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
119 |     scores = None
120 |     while True:
121 |         model_inputs = model.prepare_inputs_for_generation(
122 |             input_ids, **model_kwargs)
123 |         # forward pass to get next token
124 |         outputs = model(
125 |             **model_inputs,
126 |             return_dict=True,
127 |             output_attentions=False,
128 |             output_hidden_states=False,
129 |         )
130 | 
131 |         next_token_logits = outputs.logits[:, -1, :]
132 | 
133 |         # pre-process distribution
134 |         next_token_scores = logits_processor(input_ids, next_token_logits)
135 |         next_token_scores = logits_warper(input_ids, next_token_scores)
136 | 
137 |         # sample
138 |         probs = nn.functional.softmax(next_token_scores, dim=-1)
139 |         if generation_config.do_sample:
140 |             next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
141 |         else:
142 |             next_tokens = torch.argmax(probs, dim=-1)
143 | 
144 |         # update generated ids, model inputs, and length for next step
145 |         input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
146 |         model_kwargs = model._update_model_kwargs_for_generation(
147 |             outputs, model_kwargs, is_encoder_decoder=False)
148 |         unfinished_sequences = unfinished_sequences.mul(
149 |             (min(next_tokens != i for i in eos_token_id)).long())
150 | 
151 |         output_token_ids = input_ids[0].cpu().tolist()
152 |         output_token_ids = output_token_ids[input_length:]
153 |         for each_eos_token_id in eos_token_id:
154 |             if output_token_ids[-1] == each_eos_token_id:
155 |                 output_token_ids = output_token_ids[:-1]
156 |         response = tokenizer.decode(output_token_ids, skip_special_tokens=True)
157 | 
158 |         yield response
159 |         # stop when each sentence is finished
160 |         # or if we exceed the maximum length
161 |         if unfinished_sequences.max() == 0 or stopping_criteria(
162 |                 input_ids, scores):
163 |             break
164 | 
165 | 
166 | def on_btn_click():
167 |     del st.session_state.messages
168 | 
169 | 
170 | @st.cache_resource
171 | def load_model(model_name_or_path, adapter_name_or_path=None, load_in_4bit=False):
172 |     if load_in_4bit:
173 |         quantization_config = BitsAndBytesConfig(
174 |             load_in_4bit=True,
175 |             bnb_4bit_compute_dtype=torch.float16,
176 |             bnb_4bit_use_double_quant=True,
177 |             bnb_4bit_quant_type="nf4",
178 |             llm_int8_threshold=6.0,
179 |             llm_int8_has_fp16_weight=False,
180 |         )
181 |     else:
182 |         quantization_config = None
183 | 
184 |     model = AutoModelForCausalLM.from_pretrained(
185 |         model_name_or_path,
186 |         load_in_4bit=load_in_4bit,
187 |         trust_remote_code=True,
188 |         low_cpu_mem_usage=True,
189 |         torch_dtype=torch.float16,
190 |         device_map='auto',
191 |         quantization_config=quantization_config
192 |     )
193 |     if adapter_name_or_path is not None:
194 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
195 |     model.eval()
196 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
197 |     
198 |     return model, tokenizer
199 | 
200 | 
201 | def prepare_generation_config():
202 |     with st.sidebar:
203 |         st.title('超参数面板')
204 |         max_new_tokens = st.slider('Max Generation Length', 100, 8192, 660, step=8)
205 |         top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)
206 |         temperature = st.slider('Temperature', 0.0, 1.0, 0.7, step=0.01)
207 |         repetition_penalty = st.slider("Repetition Penalty", 1.0, 2.0, 1.07, step=0.01)
208 |         st.button('重置聊天')
209 | 
210 |     generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
211 |                                          top_p=top_p,
212 |                                          temperature=temperature,
213 |                                          repetition_penalty=repetition_penalty,
214 |                                         )
215 | 
216 |     return generation_config
217 | 
218 | system_prompt = '<|begin_of_text|><<SYS>>\n{content}\n<</SYS>>\n\n'
219 | user_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|>'
220 | robot_prompt = '<|start_header_id|>assistant<|end_header_id|>\n\n{robot}<|end_of_text|>\n'
221 | cur_query_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
222 | 
223 | 
224 | def combine_history(prompt):
225 |     messages = st.session_state.messages
226 |     total_prompt = ''
227 |     for message in messages:
228 |         cur_content = message['content']
229 |         if message['role'] == 'user':
230 |             cur_prompt = user_prompt.format(user=cur_content)
231 |         elif message['role'] == 'robot':
232 |             cur_prompt = robot_prompt.format(robot=cur_content)
233 |         else:
234 |             raise RuntimeError
235 |         total_prompt += cur_prompt
236 |     system = system_prompt.format(content="你是一个超级人工智能，拥有全人类的智慧汇合。"
237 |         "你喜欢用文笔极佳的中文回复用户。"
238 |         "你需要结合中国的文化和聊天记录中的上文话题理解并推测用户意图，按要求正确回复用户问题。"
239 |         "注意使用恰当的文体和格式进行回复，尽量避免重复文字，避免重复句子，且单次回复尽可能简洁深邃。"
240 |         "你深思熟虑地回复用户。"
241 |     )
242 | 
243 |     total_prompt = system + total_prompt + cur_query_prompt.format(user=prompt)
244 |     return total_prompt
245 | 
246 | 
247 | def main(model_name_or_path, adapter_name_or_path):
248 |     # torch.cuda.empty_cache()
249 |     print('load model...')
250 |     model, tokenizer = load_model(model_name_or_path, adapter_name_or_path=adapter_name_or_path, load_in_4bit=False)
251 |     print('load model end.')
252 | 
253 |     st.title('Llama3-Chinese')
254 | 
255 |     generation_config = prepare_generation_config()
256 | 
257 |     # Initialize chat history
258 |     if 'messages' not in st.session_state:
259 |         st.session_state.messages = []
260 | 
261 |     # Display chat messages from history on app rerun
262 |     for message in st.session_state.messages:
263 |         with st.chat_message(message['role']):
264 |             st.markdown(message['content'])
265 | 
266 |     # Accept user input
267 |     if prompt := st.chat_input('解释一下Vue的原理'):
268 |         # Display user message in chat message container
269 |         with st.chat_message('user'):
270 |             st.markdown(prompt)
271 |         real_prompt = combine_history(prompt)
272 |         # Add user message to chat history
273 |         st.session_state.messages.append({
274 |             'role': 'user',
275 |             'content': prompt,
276 |         })
277 | 
278 |         with st.chat_message('robot'):
279 |             message_placeholder = st.empty()
280 |             for cur_response in generate_interactive(
281 |                     model=model,
282 |                     tokenizer=tokenizer,
283 |                     prompt=real_prompt,
284 |                     additional_eos_token_id=128001,  # <|end_of_text|>
285 |                     **asdict(generation_config),
286 |             ):
287 |                 # Display robot response in chat message container
288 |                 message_placeholder.markdown(cur_response + '▌')
289 |             message_placeholder.markdown(cur_response)
290 |         # Add robot response to chat history
291 |         st.session_state.messages.append({
292 |             'role': 'robot',
293 |             'content': cur_response,  # pylint: disable=undefined-loop-variable
294 |         })
295 |         torch.cuda.empty_cache()
296 | 
297 | 
298 | if __name__ == '__main__':
299 | 
300 |     import sys
301 |     model_name_or_path = sys.argv[1]
302 |     if len(sys.argv) >= 3:
303 |         adapter_name_or_path = sys.argv[2]
304 |     else:
305 |         adapter_name_or_path = None
306 |     main(model_name_or_path, adapter_name_or_path)


--------------------------------------------------------------------------------
/deploy/web_streamlit_for_instruct_v2.py:
--------------------------------------------------------------------------------
  1 | # copied from https://github.com/SmartFlowAI/Llama3-XTuner-CN/blob/main/web_demo.py
  2 | # isort: skip_file
  3 | import copy
  4 | import warnings
  5 | from dataclasses import asdict, dataclass
  6 | from typing import Callable, List, Optional
  7 | 
  8 | import streamlit as st
  9 | import torch
 10 | from torch import nn
 11 | from transformers.generation.utils import (LogitsProcessorList,
 12 |                                            StoppingCriteriaList)
 13 | from transformers.utils import logging
 14 | 
 15 | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # isort: skip
 16 | from peft import PeftModel
 17 | 
 18 | logger = logging.get_logger(__name__)
 19 | st.set_page_config(page_title="Llama3-Chinese")
 20 | 
 21 | import argparse
 22 | 
 23 | @dataclass
 24 | class GenerationConfig:
 25 |     # this config is used for chat to provide more diversity
 26 |     max_length: int = 8192
 27 |     max_new_tokens: int = 600
 28 |     top_p: float = 0.8
 29 |     temperature: float = 0.8
 30 |     do_sample: bool = True
 31 |     repetition_penalty: float = 1.05
 32 | 
 33 | @torch.inference_mode()
 34 | def generate_interactive(
 35 |     model,
 36 |     tokenizer,
 37 |     prompt,
 38 |     generation_config: Optional[GenerationConfig] = None,
 39 |     logits_processor: Optional[LogitsProcessorList] = None,
 40 |     stopping_criteria: Optional[StoppingCriteriaList] = None,
 41 |     prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],
 42 |                                                 List[int]]] = None,
 43 |     additional_eos_token_id: Optional[int] = None,
 44 |     **kwargs,
 45 | ):
 46 |     inputs = tokenizer([prompt], return_tensors='pt')
 47 |     input_length = len(inputs['input_ids'][0])
 48 |     for k, v in inputs.items():
 49 |         inputs[k] = v.cuda()
 50 |     input_ids = inputs['input_ids']
 51 |     _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]
 52 |     if generation_config is None:
 53 |         generation_config = model.generation_config
 54 |     generation_config = copy.deepcopy(generation_config)
 55 |     model_kwargs = generation_config.update(**kwargs)
 56 |     bos_token_id, eos_token_id = (  # noqa: F841  # pylint: disable=W0612
 57 |         generation_config.bos_token_id,
 58 |         generation_config.eos_token_id,
 59 |     )
 60 |     if isinstance(eos_token_id, int):
 61 |         eos_token_id = [eos_token_id]
 62 |     if additional_eos_token_id is not None:
 63 |         eos_token_id.append(additional_eos_token_id)
 64 |     has_default_max_length = kwargs.get(
 65 |         'max_length') is None and generation_config.max_length is not None
 66 |     if has_default_max_length and generation_config.max_new_tokens is None:
 67 |         warnings.warn(
 68 |             f"Using 'max_length''s default ({repr(generation_config.max_length)}) \
 69 |                 to control the generation length. "
 70 |             'This behaviour is deprecated and will be removed from the \
 71 |                 config in v5 of Transformers -- we'
 72 |             ' recommend using `max_new_tokens` to control the maximum \
 73 |                 length of the generation.',
 74 |             UserWarning,
 75 |         )
 76 |     elif generation_config.max_new_tokens is not None:
 77 |         generation_config.max_length = generation_config.max_new_tokens + \
 78 |             input_ids_seq_length
 79 |         if not has_default_max_length:
 80 |             logger.warn(  # pylint: disable=W4902
 81 |                 f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) "
 82 |                 f"and 'max_length'(={generation_config.max_length}) seem to "
 83 |                 "have been set. 'max_new_tokens' will take precedence. "
 84 |                 'Please refer to the documentation for more information. '
 85 |                 '(https://huggingface.co/docs/transformers/main/'
 86 |                 'en/main_classes/text_generation)',
 87 |                 UserWarning,
 88 |             )
 89 | 
 90 |     if input_ids_seq_length >= generation_config.max_length:
 91 |         input_ids_string = 'input_ids'
 92 |         logger.warning(
 93 |             f"Input length of {input_ids_string} is {input_ids_seq_length}, "
 94 |             f"but 'max_length' is set to {generation_config.max_length}. "
 95 |             'This can lead to unexpected behavior. You should consider'
 96 |             " increasing 'max_new_tokens'.")
 97 | 
 98 |     # 2. Set generation parameters if not already defined
 99 |     logits_processor = logits_processor if logits_processor is not None \
100 |         else LogitsProcessorList()
101 |     stopping_criteria = stopping_criteria if stopping_criteria is not None \
102 |         else StoppingCriteriaList()
103 | 
104 |     logits_processor = model._get_logits_processor(
105 |         generation_config=generation_config,
106 |         input_ids_seq_length=input_ids_seq_length,
107 |         encoder_input_ids=input_ids,
108 |         prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
109 |         logits_processor=logits_processor,
110 |     )
111 | 
112 |     stopping_criteria = model._get_stopping_criteria(
113 |         generation_config=generation_config,
114 |         stopping_criteria=stopping_criteria)
115 |     logits_warper = model._get_logits_warper(generation_config)
116 | 
117 |     unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
118 |     scores = None
119 |     while True:
120 |         model_inputs = model.prepare_inputs_for_generation(
121 |             input_ids, **model_kwargs)
122 |         # forward pass to get next token
123 |         outputs = model(
124 |             **model_inputs,
125 |             return_dict=True,
126 |             output_attentions=False,
127 |             output_hidden_states=False,
128 |         )
129 | 
130 |         next_token_logits = outputs.logits[:, -1, :]
131 | 
132 |         # pre-process distribution
133 |         next_token_scores = logits_processor(input_ids, next_token_logits)
134 |         next_token_scores = logits_warper(input_ids, next_token_scores)
135 | 
136 |         # sample
137 |         probs = nn.functional.softmax(next_token_scores, dim=-1)
138 |         if generation_config.do_sample:
139 |             next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
140 |         else:
141 |             next_tokens = torch.argmax(probs, dim=-1)
142 | 
143 |         # update generated ids, model inputs, and length for next step
144 |         input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
145 |         model_kwargs = model._update_model_kwargs_for_generation(
146 |             outputs, model_kwargs, is_encoder_decoder=False)
147 |         unfinished_sequences = unfinished_sequences.mul(
148 |             (min(next_tokens != i for i in eos_token_id)).long())
149 | 
150 |         output_token_ids = input_ids[0].cpu().tolist()
151 |         output_token_ids = output_token_ids[input_length:]
152 |         for each_eos_token_id in eos_token_id:
153 |             if output_token_ids[-1] == each_eos_token_id:
154 |                 output_token_ids = output_token_ids[:-1]
155 |         response = tokenizer.decode(output_token_ids, skip_special_tokens=True)
156 | 
157 |         yield response
158 |         # stop when each sentence is finished
159 |         # or if we exceed the maximum length
160 |         if unfinished_sequences.max() == 0 or stopping_criteria(
161 |                 input_ids, scores):
162 |             break
163 | 
164 | 
165 | def on_btn_click():
166 |     del st.session_state.messages
167 | 
168 | 
169 | @st.cache_resource
170 | def load_model(model_name_or_path, adapter_name_or_path=None, load_in_4bit=False):
171 |     if load_in_4bit:
172 |         quantization_config = BitsAndBytesConfig(
173 |             load_in_4bit=True,
174 |             bnb_4bit_compute_dtype=torch.float16,
175 |             bnb_4bit_use_double_quant=True,
176 |             bnb_4bit_quant_type="nf4",
177 |             llm_int8_threshold=6.0,
178 |             llm_int8_has_fp16_weight=False,
179 |         )
180 |     else:
181 |         quantization_config = None
182 | 
183 |     model = AutoModelForCausalLM.from_pretrained(
184 |         model_name_or_path,
185 |         load_in_4bit=load_in_4bit,
186 |         trust_remote_code=True,
187 |         low_cpu_mem_usage=True,
188 |         torch_dtype=torch.float16,
189 |         device_map='auto',
190 |         quantization_config=quantization_config
191 |     )
192 |     if adapter_name_or_path is not None:
193 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
194 |     model.eval()
195 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
196 |     
197 |     return model, tokenizer
198 | 
199 | 
200 | def prepare_generation_config():
201 |     with st.sidebar:
202 |         st.title('超参数面板')
203 |         # 大输入框
204 |         system_prompt_content = st.text_area('系统提示词',
205 |             "你是一个调皮活泼的中文智者，名字叫shareAI-Llama3，喜欢用有趣的语言和适当的表情回答问题。",
206 |             height=200,
207 |             key='system_prompt_content'
208 |         )
209 |         max_new_tokens = st.slider('最大回复长度', 100, 8192, 1020, step=8)
210 |         top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)
211 |         temperature = st.slider('温度系数', 0.0, 1.0, 0.6, step=0.01)
212 |         repetition_penalty = st.slider("重复惩罚系数", 1.0, 2.0, 1.07, step=0.01)
213 |         st.button('重置聊天', on_click=on_btn_click)
214 | 
215 |     generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
216 |                                          top_p=top_p,
217 |                                          temperature=temperature,
218 |                                          repetition_penalty=repetition_penalty,
219 |                                         )
220 | 
221 |     return generation_config
222 | 
223 | system_prompt = '<|begin_of_text|><<SYS>>\n{content}\n<</SYS>>\n\n'
224 | user_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|>'
225 | robot_prompt = '<|start_header_id|>assistant<|end_header_id|>\n\n{robot}<|eot_id|>'
226 | cur_query_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
227 | 
228 | 
229 | def combine_history(prompt):
230 |     messages = st.session_state.messages
231 |     total_prompt = ''
232 |     for message in messages:
233 |         cur_content = message['content']
234 |         if message['role'] == 'user':
235 |             cur_prompt = user_prompt.format(user=cur_content)
236 |         elif message['role'] == 'robot':
237 |             cur_prompt = robot_prompt.format(robot=cur_content)
238 |         else:
239 |             raise RuntimeError
240 |         total_prompt += cur_prompt
241 | 
242 |     system_prompt_content = st.session_state.system_prompt_content
243 |     system = system_prompt.format(content=system_prompt_content)
244 |     total_prompt = system + total_prompt + cur_query_prompt.format(user=prompt)
245 |     
246 |     return total_prompt
247 | 
248 | 
249 | def main(model_name_or_path, adapter_name_or_path):
250 |     # torch.cuda.empty_cache()
251 |     print('load model...')
252 |     model, tokenizer = load_model(model_name_or_path, adapter_name_or_path=adapter_name_or_path, load_in_4bit=False)
253 |     print('load model end.')
254 | 
255 |     st.title('Llama3-Chinese')
256 | 
257 |     generation_config = prepare_generation_config()
258 | 
259 |     # Initialize chat history
260 |     if 'messages' not in st.session_state:
261 |         st.session_state.messages = []
262 | 
263 |     # Display chat messages from history on app rerun
264 |     for message in st.session_state.messages:
265 |         with st.chat_message(message['role']):
266 |             st.markdown(message['content'])
267 | 
268 |     # Accept user input
269 |     if prompt := st.chat_input('解释一下Vue的原理'):
270 |         # Display user message in chat message container
271 |         with st.chat_message('user'):
272 |             st.markdown(prompt)
273 |         real_prompt = combine_history(prompt)
274 |         # Add user message to chat history
275 |         st.session_state.messages.append({
276 |             'role': 'user',
277 |             'content': prompt,
278 |         })
279 | 
280 |         with st.chat_message('robot'):
281 |             message_placeholder = st.empty()
282 |             for cur_response in generate_interactive(
283 |                     model=model,
284 |                     tokenizer=tokenizer,
285 |                     prompt=real_prompt,
286 |                     additional_eos_token_id=128009,
287 |                     **asdict(generation_config),
288 |             ):
289 |                 # Display robot response in chat message container
290 |                 message_placeholder.markdown(cur_response + '▌')
291 |             message_placeholder.markdown(cur_response)
292 |         # Add robot response to chat history
293 |         st.session_state.messages.append({
294 |             'role': 'robot',
295 |             'content': cur_response,  # pylint: disable=undefined-loop-variable
296 |         })
297 |         torch.cuda.empty_cache()
298 | 
299 | 
300 | if __name__ == '__main__':
301 | 
302 |     import sys
303 |     model_name_or_path = sys.argv[1]
304 |     if len(sys.argv) >= 3:
305 |         adapter_name_or_path = sys.argv[2]
306 |     else:
307 |         adapter_name_or_path = None
308 |     main(model_name_or_path, adapter_name_or_path)
309 | 


--------------------------------------------------------------------------------
/deploy/web_streamlit_for_instruct.py:
--------------------------------------------------------------------------------
  1 | # copied from https://github.com/SmartFlowAI/Llama3-XTuner-CN/blob/main/web_demo.py
  2 | # isort: skip_file
  3 | import copy
  4 | import warnings
  5 | from dataclasses import asdict, dataclass
  6 | from typing import Callable, List, Optional
  7 | 
  8 | import streamlit as st
  9 | import torch
 10 | from torch import nn
 11 | from transformers.generation.utils import (LogitsProcessorList,
 12 |                                            StoppingCriteriaList)
 13 | from transformers.utils import logging
 14 | 
 15 | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  # isort: skip
 16 | from peft import PeftModel
 17 | 
 18 | logger = logging.get_logger(__name__)
 19 | st.set_page_config(page_title="Llama3-Chinese")
 20 | 
 21 | import argparse
 22 | 
 23 | @dataclass
 24 | class GenerationConfig:
 25 |     # this config is used for chat to provide more diversity
 26 |     max_length: int = 8192
 27 |     max_new_tokens: int = 600
 28 |     top_p: float = 0.8
 29 |     temperature: float = 0.8
 30 |     do_sample: bool = True
 31 |     repetition_penalty: float = 1.05
 32 | 
 33 | @torch.inference_mode()
 34 | def generate_interactive(
 35 |     model,
 36 |     tokenizer,
 37 |     prompt,
 38 |     generation_config: Optional[GenerationConfig] = None,
 39 |     logits_processor: Optional[LogitsProcessorList] = None,
 40 |     stopping_criteria: Optional[StoppingCriteriaList] = None,
 41 |     prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],
 42 |                                                 List[int]]] = None,
 43 |     additional_eos_token_id: Optional[int] = None,
 44 |     **kwargs,
 45 | ):
 46 |     inputs = tokenizer([prompt], return_tensors='pt')
 47 |     input_length = len(inputs['input_ids'][0])
 48 |     for k, v in inputs.items():
 49 |         inputs[k] = v.cuda()
 50 |     input_ids = inputs['input_ids']
 51 |     _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]
 52 |     if generation_config is None:
 53 |         generation_config = model.generation_config
 54 |     generation_config = copy.deepcopy(generation_config)
 55 |     model_kwargs = generation_config.update(**kwargs)
 56 |     bos_token_id, eos_token_id = (  # noqa: F841  # pylint: disable=W0612
 57 |         generation_config.bos_token_id,
 58 |         generation_config.eos_token_id,
 59 |     )
 60 |     if isinstance(eos_token_id, int):
 61 |         eos_token_id = [eos_token_id]
 62 |     if additional_eos_token_id is not None:
 63 |         eos_token_id.append(additional_eos_token_id)
 64 |     has_default_max_length = kwargs.get(
 65 |         'max_length') is None and generation_config.max_length is not None
 66 |     if has_default_max_length and generation_config.max_new_tokens is None:
 67 |         warnings.warn(
 68 |             f"Using 'max_length''s default ({repr(generation_config.max_length)}) \
 69 |                 to control the generation length. "
 70 |             'This behaviour is deprecated and will be removed from the \
 71 |                 config in v5 of Transformers -- we'
 72 |             ' recommend using `max_new_tokens` to control the maximum \
 73 |                 length of the generation.',
 74 |             UserWarning,
 75 |         )
 76 |     elif generation_config.max_new_tokens is not None:
 77 |         generation_config.max_length = generation_config.max_new_tokens + \
 78 |             input_ids_seq_length
 79 |         if not has_default_max_length:
 80 |             logger.warn(  # pylint: disable=W4902
 81 |                 f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) "
 82 |                 f"and 'max_length'(={generation_config.max_length}) seem to "
 83 |                 "have been set. 'max_new_tokens' will take precedence. "
 84 |                 'Please refer to the documentation for more information. '
 85 |                 '(https://huggingface.co/docs/transformers/main/'
 86 |                 'en/main_classes/text_generation)',
 87 |                 UserWarning,
 88 |             )
 89 | 
 90 |     if input_ids_seq_length >= generation_config.max_length:
 91 |         input_ids_string = 'input_ids'
 92 |         logger.warning(
 93 |             f"Input length of {input_ids_string} is {input_ids_seq_length}, "
 94 |             f"but 'max_length' is set to {generation_config.max_length}. "
 95 |             'This can lead to unexpected behavior. You should consider'
 96 |             " increasing 'max_new_tokens'.")
 97 | 
 98 |     # 2. Set generation parameters if not already defined
 99 |     logits_processor = logits_processor if logits_processor is not None \
100 |         else LogitsProcessorList()
101 |     stopping_criteria = stopping_criteria if stopping_criteria is not None \
102 |         else StoppingCriteriaList()
103 | 
104 |     logits_processor = model._get_logits_processor(
105 |         generation_config=generation_config,
106 |         input_ids_seq_length=input_ids_seq_length,
107 |         encoder_input_ids=input_ids,
108 |         prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
109 |         logits_processor=logits_processor,
110 |     )
111 | 
112 |     stopping_criteria = model._get_stopping_criteria(
113 |         generation_config=generation_config,
114 |         stopping_criteria=stopping_criteria)
115 |     logits_warper = model._get_logits_warper(generation_config)
116 | 
117 |     unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
118 |     scores = None
119 |     while True:
120 |         model_inputs = model.prepare_inputs_for_generation(
121 |             input_ids, **model_kwargs)
122 |         # forward pass to get next token
123 |         outputs = model(
124 |             **model_inputs,
125 |             return_dict=True,
126 |             output_attentions=False,
127 |             output_hidden_states=False,
128 |         )
129 | 
130 |         next_token_logits = outputs.logits[:, -1, :]
131 | 
132 |         # pre-process distribution
133 |         next_token_scores = logits_processor(input_ids, next_token_logits)
134 |         next_token_scores = logits_warper(input_ids, next_token_scores)
135 | 
136 |         # sample
137 |         probs = nn.functional.softmax(next_token_scores, dim=-1)
138 |         if generation_config.do_sample:
139 |             next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
140 |         else:
141 |             next_tokens = torch.argmax(probs, dim=-1)
142 | 
143 |         # update generated ids, model inputs, and length for next step
144 |         input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
145 |         model_kwargs = model._update_model_kwargs_for_generation(
146 |             outputs, model_kwargs, is_encoder_decoder=False)
147 |         unfinished_sequences = unfinished_sequences.mul(
148 |             (min(next_tokens != i for i in eos_token_id)).long())
149 | 
150 |         output_token_ids = input_ids[0].cpu().tolist()
151 |         output_token_ids = output_token_ids[input_length:]
152 |         for each_eos_token_id in eos_token_id:
153 |             if output_token_ids[-1] == each_eos_token_id:
154 |                 output_token_ids = output_token_ids[:-1]
155 |         response = tokenizer.decode(output_token_ids, skip_special_tokens=True)
156 | 
157 |         yield response
158 |         # stop when each sentence is finished
159 |         # or if we exceed the maximum length
160 |         if unfinished_sequences.max() == 0 or stopping_criteria(
161 |                 input_ids, scores):
162 |             break
163 | 
164 | 
165 | def on_btn_click():
166 |     del st.session_state.messages
167 | 
168 | 
169 | @st.cache_resource
170 | def load_model(model_name_or_path, adapter_name_or_path=None, load_in_4bit=False):
171 |     if load_in_4bit:
172 |         quantization_config = BitsAndBytesConfig(
173 |             load_in_4bit=True,
174 |             bnb_4bit_compute_dtype=torch.float16,
175 |             bnb_4bit_use_double_quant=True,
176 |             bnb_4bit_quant_type="nf4",
177 |             llm_int8_threshold=6.0,
178 |             llm_int8_has_fp16_weight=False,
179 |         )
180 |     else:
181 |         quantization_config = None
182 | 
183 |     model = AutoModelForCausalLM.from_pretrained(
184 |         model_name_or_path,
185 |         load_in_4bit=load_in_4bit,
186 |         trust_remote_code=True,
187 |         low_cpu_mem_usage=True,
188 |         torch_dtype=torch.float16,
189 |         device_map='auto',
190 |         quantization_config=quantization_config
191 |     )
192 |     if adapter_name_or_path is not None:
193 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
194 |     model.eval()
195 |     tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
196 |     
197 |     return model, tokenizer
198 | 
199 | 
200 | def prepare_generation_config():
201 |     with st.sidebar:
202 |         st.title('超参数面板')
203 |         # 大输入框
204 |         system_prompt_content = st.text_area('系统提示词',
205 |             "你是一个有创造的超级人工智能assistant,名字叫Llama3-Chinese,拥有全人类的所有知识。"
206 |             "你喜欢用幽默风趣的语言回复用户，但你更喜欢用准确、深入的答案。"
207 |             "你需要结合中国的文化和聊天记录中的上文话题理解和推测用户真正意图，按要求正确回复用户问题。"
208 |             "注意使用恰当的文体和格式进行回复，尽量避免重复文字和重复句子，且单次回复尽可能简洁深邃。"
209 |             "你关注讨论的上下文，深思熟虑地回复用户"
210 |             "如果你不知道某个问题的含义，请询问用户，并引导用户进行提问。"
211 |             "当用户说继续时,请接着aissistant上一次的回答进行继续回复。",
212 |             height=200,
213 |             key='system_prompt_content'
214 |         )
215 |         max_new_tokens = st.slider('最大回复长度', 100, 8192, 660, step=8)
216 |         top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)
217 |         temperature = st.slider('温度系数', 0.0, 1.0, 0.7, step=0.01)
218 |         repetition_penalty = st.slider("重复惩罚系数", 1.0, 2.0, 1.07, step=0.01)
219 |         st.button('重置聊天', on_click=on_btn_click)
220 | 
221 |     generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
222 |                                          top_p=top_p,
223 |                                          temperature=temperature,
224 |                                          repetition_penalty=repetition_penalty,
225 |                                         )
226 | 
227 |     return generation_config
228 | 
229 | system_prompt = '<|begin_of_text|><<SYS>>\n{content}\n<</SYS>>\n\n'
230 | user_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|>'
231 | robot_prompt = '<|start_header_id|>assistant<|end_header_id|>\n\n{robot}<|eot_id|>'
232 | cur_query_prompt = '<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
233 | 
234 | 
235 | def combine_history(prompt):
236 |     messages = st.session_state.messages
237 |     total_prompt = ''
238 |     for message in messages:
239 |         cur_content = message['content']
240 |         if message['role'] == 'user':
241 |             cur_prompt = user_prompt.format(user=cur_content)
242 |         elif message['role'] == 'robot':
243 |             cur_prompt = robot_prompt.format(robot=cur_content)
244 |         else:
245 |             raise RuntimeError
246 |         total_prompt += cur_prompt
247 | 
248 |     system_prompt_content = st.session_state.system_prompt_content
249 |     system = system_prompt.format(content=system_prompt_content)
250 |     total_prompt = system + total_prompt + cur_query_prompt.format(user=prompt)
251 |     
252 |     return total_prompt
253 | 
254 | 
255 | def main(model_name_or_path, adapter_name_or_path):
256 |     # torch.cuda.empty_cache()
257 |     print('load model...')
258 |     model, tokenizer = load_model(model_name_or_path, adapter_name_or_path=adapter_name_or_path, load_in_4bit=False)
259 |     print('load model end.')
260 | 
261 |     st.title('Llama3-Chinese')
262 | 
263 |     generation_config = prepare_generation_config()
264 | 
265 |     # Initialize chat history
266 |     if 'messages' not in st.session_state:
267 |         st.session_state.messages = []
268 | 
269 |     # Display chat messages from history on app rerun
270 |     for message in st.session_state.messages:
271 |         with st.chat_message(message['role']):
272 |             st.markdown(message['content'])
273 | 
274 |     # Accept user input
275 |     if prompt := st.chat_input('解释一下Vue的原理'):
276 |         # Display user message in chat message container
277 |         with st.chat_message('user'):
278 |             st.markdown(prompt)
279 |         real_prompt = combine_history(prompt)
280 |         # Add user message to chat history
281 |         st.session_state.messages.append({
282 |             'role': 'user',
283 |             'content': prompt,
284 |         })
285 | 
286 |         with st.chat_message('robot'):
287 |             message_placeholder = st.empty()
288 |             for cur_response in generate_interactive(
289 |                     model=model,
290 |                     tokenizer=tokenizer,
291 |                     prompt=real_prompt,
292 |                     additional_eos_token_id=128009,
293 |                     **asdict(generation_config),
294 |             ):
295 |                 # Display robot response in chat message container
296 |                 message_placeholder.markdown(cur_response + '▌')
297 |             message_placeholder.markdown(cur_response)
298 |         # Add robot response to chat history
299 |         st.session_state.messages.append({
300 |             'role': 'robot',
301 |             'content': cur_response,  # pylint: disable=undefined-loop-variable
302 |         })
303 |         torch.cuda.empty_cache()
304 | 
305 | 
306 | if __name__ == '__main__':
307 | 
308 |     import sys
309 |     model_name_or_path = sys.argv[1]
310 |     if len(sys.argv) >= 3:
311 |         adapter_name_or_path = sys.argv[2]
312 |     else:
313 |         adapter_name_or_path = None
314 |     main(model_name_or_path, adapter_name_or_path)
315 | 


--------------------------------------------------------------------------------
/deploy/vLLM/README.md:
--------------------------------------------------------------------------------
  1 | # vLLM部署
  2 | ## 简单介绍 vllm
  3 | vLLM 是一个快速且易于使用的库，用于大型语言模型（LLM）的推理和服务，拥有接近榨干GPU的极速性能。
  4 | 
  5 | ### 速度优势
  6 | - 最先进的服务吞吐量
  7 | - 使用 PagedAttention 高效管理注意力键和值的内存
  8 | - 连续批处理传入请求
  9 | - 使用 CUDA/HIP 图实现快速模型执行
 10 | - 量化支持：GPTQ、AWQ、SqueezeLLM、FP8 KV 缓存
 11 | - 优化的 CUDA 内核
 12 | 
 13 | ### 灵活性和易用性
 14 | - 与流行的 Hugging Face 模型无缝集成
 15 | - 通过各种解码算法（包括并行采样、束搜索等）实现高吞吐量服务
 16 | - 支持张量并行以进行分布式推理
 17 | - 流式输出
 18 | - 兼容 OpenAI 的 API 服务器
 19 | - 支持 NVIDIA GPU 和 AMD GPU
 20 | - （实验性）前缀缓存支持
 21 | - （实验性）多 lora 支持
 22 | 
 23 | vLLM 无缝支持 HuggingFace 上大多数流行的开源模型，包括：
 24 | - 类 Transformer 的大型语言模型（例如 Llama）
 25 | - 专家混合大型语言模型（例如 Mixtral）
 26 | - 多模态大型语言模型（例如 LLaVA）
 27 | 
 28 | ## 安装
 29 | ```
 30 | pip install vllm
 31 | ```
 32 | ## 代码推理
 33 | 该方法适合运行一些需要加速模型推理运行，但整个模型不方便通过API单独部署的场景。
 34 | ```python
 35 | from vllm import LLM, SamplingParams
 36 | 
 37 | prompts = [
 38 |     "Hello, my name is",
 39 |     "The president of the United States is",
 40 |     "The capital of France is",
 41 |     "The future of AI is",
 42 | ]
 43 | sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 44 | llm = LLM(model="PATH-TO-YOUR-MODEL/MODEL-ID")
 45 | 
 46 | outputs = llm.generate(prompts, sampling_params)
 47 | 
 48 | for output in outputs:
 49 |     prompt = output.prompt
 50 |     generated_text = output.outputs[0].text
 51 |     print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 52 | ```
 53 | ## 后端部署 （推荐，兼容OpenAI格式）
 54 | ### 服务端部署API
 55 | 在服务器上执行以下命令，启动模型服务：
 56 | ```shell
 57 | python -m vllm.entrypoints.openai.api_server \
 58 |     --model /path/to/your_model \
 59 |     --served-model-name "llama3-cn" \
 60 |     --max-model-len=1024 \
 61 |     --api-key="xxx-abc-123"
 62 | ```
 63 | 
 64 | 说明：
 65 | - --api-key是指定一个给连接用的token密钥，你也可以不直接设置，而是通过环境变量设置`VLLM_API_KEY`传入。  
 66 | - --max-model-len是模型上下文长度，一般很多模型都是默认 8192，可以自行按需设置一个合适的大小，否则可能会超出你的GPU显存容量导致模型无法启动推理服务。    
 67 | - 默认会读取模型tokenizer_config.json中自带的对话模板，你也可以通过--chat-template自行指定一个模板。（需要写为.jinja文件）
 68 | ### 客户端测试API
 69 | 终端shell：
 70 | ```shell
 71 | curl http://服务器ip:端口/v1/chat/completions \
 72 |     -H 'Content-Type: application/json' \
 73 |     -H 'Accept: application/json' \
 74 |     -H 'Authorization: Bearer xxx-abc-123' \
 75 |     -d '{
 76 |         "model": "llama3-cn",
 77 |         "messages": [
 78 |             {"role": "system", "content": "You are a helpful assistant."},
 79 |             {"role": "user", "content": "讲个笑话"}
 80 |         ]
 81 |     }'
 82 | ```
 83 | python代码：
 84 | ```python
 85 | from openai import OpenAI
 86 | 
 87 | client = OpenAI(base_url="http://服务器ip:端口/v1", api_key="xxx-abc-123")
 88 | 
 89 | completion = client.chat.completions.create(
 90 |   model="llama3-cn",
 91 |   messages=[
 92 |     {"role": "system", "content": "You are a helpful assistant."},
 93 |     {"role": "user", "content": "讲个笑话"}
 94 |   ],
 95 |   temperature=0.7,
 96 |   stop=["<|eot_id|>"],
 97 | )
 98 | 
 99 | print(completion.choices[0].message)
100 | ```
101 | 
102 | 附：
103 | vllm部署支持的全部命令参数列表如下，可按需自行调整：
104 | 
105 | | 参数                     | 说明                                                                                                  | 默认值          |
106 | |-----------------------|-----------------------------------------------------------------------------------------------------|---------------|
107 | | --host                | 主机名                                                                                                |               |
108 | | --port                | 端口号                                                                                                | 8000          |
109 | | --uvicorn-log-level   | uvicorn的日志级别，可选值为debug, info, warning, error, critical, trace                               | "info"        |
110 | | --allow-credentials   | 是否允许凭证                                                                                          | False         |
111 | | --allowed-origins     | 允许的来源                                                                                            | ['*']         |
112 | | --allowed-methods     | 允许的方法                                                                                            | ['*']         |
113 | | --allowed-headers     | 允许的头部                                                                                            | ['*']         |
114 | | --api-key             | 如果提供，服务器将要求在标头中呈现此密钥。                                                                 |               |
115 | | --lora-modules        | LoRA模块配置，格式为name=path。可以指定多个模块。                                                         |               |
116 | | --chat-template       | 聊天模板的文件路径，或指定模型的单行形式模板。                                                            |               |
117 | | --response-role       | 如果request.add_generation_prompt=true，则返回的角色名称。                                                  | "assistant"   |
118 | | --ssl-keyfile         | SSL密钥文件的路径                                                                                       |               |
119 | | --ssl-certfile        | SSL证书文件的路径                                                                                       |               |
120 | | --ssl-ca-certs        | CA证书文件的路径                                                                                        |               |
121 | | --ssl-cert-reqs       | 是否需要客户端证书                                                                                       | 0             |
122 | | --root-path           | FastAPI的root_path，当应用程序在基于路径的路由代理后面时使用。                                                |               |
123 | | --middleware          | 应用于应用程序的额外ASGI中间件。接受多个--middleware参数。值应为导入路径。                                    | []            |
124 | | --model               | 要使用的huggingface模型的名称或路径。                                                                      | "facebook/opt-125m" |
125 | | --tokenizer           | 要使用的huggingface分词器的名称或路径。如果未指定，则将使用模型名称或路径。                                    |               |
126 | | --skip-tokenizer-init | 跳过分词器和解分词器的初始化。                                                                          | False         |
127 | | --revision            | 要使用的特定模型版本。可以是分支名称、标签名称或提交ID。如果未指定，将使用默认版本。                               |               |
128 | | --code-revision       | 在Hugging Face Hub上使用的模型代码的特定修订版。可以是分支名称、标签名称或提交ID。如果未指定，将使用默认版本。             |               |
129 | | --tokenizer-revision  | 要使用的huggingface分词器的修订版本。可以是分支名称、标签名称或提交ID。如果未指定，将使用默认版本。                      |               |
130 | | --tokenizer-mode     | 分词器模式，可选值为auto, slow。                                                                         | "auto"        |
131 | | --trust-remote-code   | 是否信任来自Hugging Face的远程代码。                                                                    | False         |
132 | | --download-dir        | 下载和加载权重的目录，默认为huggingface的默认缓存目录。                                                        |               |
133 | | --load-format         | 加载模型权重的格式，可选值为auto, pt, safetensors, npcache, dummy, tensorizer, bitsandbytes。                     | "auto"        |
134 | | --dtype               | 模型权重和激活的数据类型，可选值为auto, half, float16, bfloat16, float, float32。                                 | "auto"        |
135 | | --kv-cache-dtype      | kv缓存存储的数据类型，可选值为auto, fp8, fp8_e5m2, fp8_e4m3。如果为"auto"，将使用模型的数据类型。                           | "auto"        |
136 | | --quantization-param-path | 包含KV缓存缩放因子的JSON文件的路径。当KV缓存数据类型为FP8时，通常应提供此文件。否则，KV缓存缩放因子默认为1.0，可能导致精度问题。 |               |
137 | | --max-model-len       | 模型上下文长度。如果未指定，将自动从模型配置中获取。                                                            |               |
138 | | --guided-decoding-backend | 引导解码的引擎，可选值为outlines, lm-format-enforcer。默认为outlines。                                          | "outlines"    |
139 | | --distributed-executor-backend | 分布式服务的后端，可选值为ray, mp。当使用多个GPU时，如果安装了Ray，则自动设置为"ray"，否则设置为"mp"（多进程）。     |               |
140 | | --worker-use-ray      | 是否使用Ray启动LLM引擎作为单独的进程。已弃用，请使用--distributed-executor-backend=ray。                              | False         |
141 | | --pipeline-parallel-size, -pp | 管道并行的阶段数。                                                                                   | 1             |
142 | | --tensor-parallel-size, -tp | 张量并行的副本数。                                                                                   | 1             |
143 | | --max-parallel-loading-workers | 在使用张量并行和大模型时，按顺序加载模型以避免RAM OOM。                                                 |               |
144 | | --ray-workers-use-nsight | 如果指定，使用nsight对Ray工作进程进行性能分析。                                                       | False         |
145 | | --block-size          | 连续的令牌块大小，可选值为8, 16, 32。                                                                      | 16            |
146 | | --enable-prefix-caching | 启用自动前缀缓存。                                                                                   | False         |
147 | | --disable-sliding-window | 禁用滑动窗口，限制为滑动窗口大小。                                                                         | False         |
148 | | --use-v2-block-manager | 使用BlockSpaceMangerV2。                                                                              | False         |
149 | | --num-lookahead-slots | 用于推测解码的实验性调度配置。在将来将被speculative config取代；目前用于启用正确性测试。                            | 0             |
150 | | --seed                | 操作的随机种子。                                                                                      | 0             |
151 | | --swap-space          | 每个GPU的CPU交换空间大小（以GiB为单位）。                                                                    | 4             |
152 | | --gpu-memory-utilization | 模型执行器使用的GPU内存比例，范围从0到1。例如，0.5表示50%的GPU内存利用率。如果未指定，将使用默认值0.9。                     | 0.9           |
153 | | --num-gpu-blocks-override | 如果指定，忽略GPU分析结果，并使用此GPU块数。用于测试抢占。                                                   |               |
154 | | --max-num-batched-tokens | 每次迭代的最大批量令牌数。                                                                                |               |
155 | | --max-num-seqs        | 每次迭代的最大序列数。                                                                                  | 256           |
156 | | --max-logprobs        | 返回logprobs的最大数量，logprobs在SamplingParams中指定。                                                  | 20            |
157 | | --disable-log-stats   | 禁用日志统计信息。                                                                                     | False         |
158 | | --quantization, -q    | 权重量化的方法，可选值为aqlm, awq, deepspeedfp, fp8, marlin, gptq_marlin_24, gptq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, None。 | |
159 | | --rope-scaling        | RoPE缩放配置的JSON格式。例如，{"type":"dynamic","factor":2.0}。                                           |               |
160 | | --rope-theta          | RoPE theta。与rope_scaling一起使用。在某些情况下，更改RoPE theta可以提高缩放模型的性能。                            |               |
161 | | --enforce-eager       | 始终使用eager-mode PyTorch。如果为False，将使用eager模式和CUDA图形混合以实现最大性能和灵活性。                        | False         |
162 | | --max-context-len-to-capture | CUDA图形捕获的最大上下文长度。当一个序列的上下文长度大于此值时，将回退到eager模式。                           |               |
163 | | --max-seq-len-to-capture | CUDA图形捕获的最大序列长度。当一个序列的上下文长度大于此值时，将回退到eager模式。                             | 8192          |
164 | | --disable-custom-all-reduce | 见ParallelConfig。                                                                                   | False         |
165 | | --tokenizer-pool-size | 用于异步分词的分词器池的大小。如果为0，将使用同步分词。                                                        | 0             |
166 | | --tokenizer-pool-type | 用于异步分词的分词器池的类型。如果tokenizer_pool_size为0，则忽略此参数。                                         | "ray"         |
167 | | --tokenizer-pool-extra-config | 分词器池的额外配置。这应该是一个JSON字符串，将被解析为字典。如果tokenizer_pool_size为0，则忽略此参数。                  |               |
168 | | --enable-lora         | 是否启用LoRA适配器的处理。                                                                             | False         |
169 | | --max-loras           | 单个批次中的最大LoRA数量。                                                                               | 1             |
170 | | --max-lora-rank       | 最大的LoRA rank。                                                                                      | 16            |
171 | | --lora-extra-vocab-size | LoRA适配器中可以存在的额外词汇的最大大小（添加到基本模型词汇表中）。                                                 | 256           |
172 | | --lora-dtype          | LoRA的数据类型，可选值为auto, float16, bfloat16, float32。如果为"auto"，将使用基本模型的数据类型。                        | "auto"        |
173 | | --long-lora-scaling-factors | 指定多个缩放因子（可以不同于基本模型的缩放因子）以允许同时使用使用这些缩放因子训练的多个LoRA适配器。如果未指定，只允许使用使用基本模型缩放因子训练的适配器。 | |
174 | | --max-cpu-loras       | 存储在CPU内存中的最大LoRA数量。必须大于等于max_num_seqs。默认为max_num_seqs。                                      |               |
175 | | --fully-sharded-loras | 默认情况下，只有一半的LoRA计算使用张量并行。启用此选项将使用完全分片的层。在高序列长度、最大rank或张量并行大小下，这可能更快。         | False         |
176 | | --device              | 设备类型，可选值为auto, cuda, neuron, cpu, tpu。                                                           | "auto"        |
177 | | --image-input-type    | 传递给vLLM的图像输入类型，可选值为pixel_values, image_features。                                              |               |
178 | | --image-token-id      | 图像令牌的输入ID。                                                                                     |               |
179 | | --image-input-shape   | 给定输入类型的最大图像输入形状（内存占用最大）。仅用于vLLM的profile_run。                                       |               |
180 | | --image-feature-size  | 图像特征在上下文维度上的大小。                                                                                |               |
181 | | --image-processor     | 要使用的huggingface图像处理器的名称或路径。如果未指定，则将使用模型名称或路径。                                    |               |
182 | | --image-processor-revision | 要使用的huggingface图像处理器版本的修订版本。可以是分支名称、标签名称或提交ID。如果未指定，将使用默认版本。                |               |
183 | | --disable-image-processor | 禁用图像处理器的使用，即使在huggingface的模型中定义了图像处理器。                                                 | False         |
184 | | --scheduler-delay-factor | 在调度下一个提示之前应用延迟（延迟因子乘以先前提示的延迟）。                                                   | 0.0           |
185 | | --enable-chunked-prefill | 如果设置，预填充请求可以根据max_num_batched_tokens进行分块。                                                 | False         |
186 | | --speculative-model   | 用于推测解码的草稿模型的名称。                                                                             |               |
187 | | --num-speculative-tokens | 在推测解码中从草稿模型中采样的推测令牌数量。                                                                  |               |
188 | | --speculative-max-model-len | 草稿模型支持的最大序列长度。超过此长度的序列将跳过推测解码。                                                   |               |
189 | | --speculative-disable-by-batch-size | 如果新的入站请求的排队请求数大于该值，则禁用推测解码。                                                    |               |
190 | | --ngram-prompt-lookup-max | 用于推测解码中ngram提示查找的窗口的最大大小。                                                                 |               |
191 | | --ngram-prompt-lookup-min | 用于推测解码中ngram提示查找的窗口的最小大小。                                                                 |               |
192 | | --model-loader-extra-config | 模型加载器的额外配置。将传递给相应的模型加载器。这应该是一个JSON字符串，将被解析为字典。                             |               |
193 | | --preemption_mode     | 如果为'recompute'，则引擎通过块交换执行抢占；如果为'swap'，则引擎通过块交换执行抢占。                                |               |
194 | | --served-model-name   | API中使用的模型名称。如果提供多个名称，服务器将响应任何提供的名称。响应中的模型名称将是列表中的第一个名称。如果未指定，模型名称将与--model参数相同。注意，如果提供多个名称，则指标的model_name标签内容也将使用第一个名称。 | |
195 | | --qlora-adapter-name-or-path | QLoRA适配器的名称或路径。                                                                             |               |
196 | | --engine-use-ray      | 是否使用Ray在单独的进程中启动LLM引擎作为服务器进程。                                                        | False         |
197 | | --disable-log-requests | 禁用请求日志记录。                                                                                     | False         |
198 | | --max-log-len         | 在日志中打印的最大提示字符数或提示ID数。                                                                      | Unlimited     |
199 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # llama3-Chinese-chat
  2 | [![HF Demo](https://img.shields.io/static/v1?label=Demo&message=OpenBayes%E8%B4%9D%E5%BC%8F%E8%AE%A1%E7%AE%97&color=green)](https://openbayes.com/console/hyperai-tutorials/containers/EzsoQaZB8LA)  &ensp;  
  3 | ### 1st version of Chinese-llama3 
  4 | 首个llama3 中文版  
  5 | 本仓库供交流llama3中文相关学习内容，欢迎任何热心朋友加入共建  
  6 | 
  7 | ### 训练 & 推理流程
  8 | <img width="67%" alt="训练流程图" src="https://github.com/user-attachments/assets/28fe6edc-9a54-44a7-8dba-f9a9173901ff" />
  9 | <img width="30%" alt="训练与推理" src="https://github.com/user-attachments/assets/22f9ea26-49ca-4d7a-adad-c338f2309b50" /> <br>
 10 | <center>看图快速学习: https://deepwiki.com/CrazyBoyM/llama3-Chinese-chat</center>
 11 | 
 12 | ### 通知 
 13 | 🔥新增LLM-Chinese仓库，欢迎关注，偏教程性质，以「模型中文化」为一个典型的模型训练问题切入场景，指导读者上手学习LLM二次微调训练：https://github.com/CrazyBoyM/LLM-Chinese （含gemma2 中文版模型，2b、 9b尺寸）
 14 | 
 15 | 如果你有自己微调的版本或者在网上发现有趣的特化版本，欢迎在issue区评论收录。  
 16 | 如有你有想要建设的内容版块，欢迎fork提交PR成为核心作者成员。  
 17 | (注意：目前不再接受仅修改单个字、句的typo-PR，请避免频繁提交该类PR)
 18 | 
 19 | 
 20 | ### News 更新记录
 21 | - 2024-07-25 llama3.1 中文DPO版训练权重放出。
 22 | - 2024-07-24 llama3.1 中文版训练计划启动。
 23 | - 2024-05-17 🎉 整理的llama3中文化数据集合在modelscope下载量达2.9k次，连续三周处于modelscope网站首页：[数据下载地址](https://modelscope.cn/datasets/baicai003/Llama3-Chinese-dataset/summary) 
 24 | - 2024-05-17 💪 增加 手写API部署教程、命令调用，[文档地址](https://github.com/CrazyBoyM/llama3-Chinese-chat/tree/main/deploy/API)
 25 | - 2024-05-13 💪 增加LMStudio电脑本地部署教程，[文档教程](https://github.com/CrazyBoyM/llama3-Chinese-chat/blob/main/deploy/LMStudio/README.md)，[手把手视频教程](https://www.bilibili.com/video/BV1nt421g79T)
 26 | - 2024-05-04 五一假期间：🚀 新增语言偏好强化对齐版本（直接对英文instruct版做DPO）。保持原汁原味的口吻回复(喜欢趣味语言、emoji表情)，[模型下载](https://modelscope.cn/models/baicai003/Llama3-Chinese-instruct-DPO-beta0.5/summary)，[gguf量化版下载](https://modelscope.cn/models/shareAI/llama-3-8b-Instruct-dpo-chinese-loftq-gguf/summary)，[语言偏好强化数据集工作已开源](https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji)
 27 | - 2024-04-21 晚上2点：增加训练教程、推理教程、网页部署等文档整理
 28 | - 2023-04-20 晚上23点：instruct 中文版训练完成
 29 | - 2024-04-20 早上7点：v2版训练完成 
 30 | - 2024-04-19 下午1点：🍺 世界上首个llama3 中文版训练完成，晚上没睡觉哈哈，使用170k+高质量多轮中文对话数据连夜训练得到。
 31 | 
 32 | ### Demo 演示示例
 33 | #### llama3-base-8b 中文SFT版
 34 | <img width="1000" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/4057d600-11e6-424f-9705-267450b6f635">
 35 | 
 36 | #### llama3-instruct-8b 中文DPO版
 37 | <img width="1000" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/0330a118-7a38-44a7-8a48-a94bfb9eead2">
 38 | 
 39 | #### llama3.1-instruct-8b 中文DPO版
 40 | <img width="1000" alt="image" src="https://github.com/user-attachments/assets/7140ee4b-d2d5-42f6-976b-9379ec6a9811">
 41 | 
 42 | ### llama3 可用Chat版模型整理
 43 | llama3.1
 44 | - shareAI-DPO中文 8B版本 （RLHF中文）
 45 |   - 训练数据开源： https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji
 46 |   - 训练细节分享：DPO(beta 0.5) + lora rank128, alpha256 + 打开"lm_head", "input_layernorm", "post_attention_layernorm", "norm"层训练.
 47 |   - 算力：8 * A100，5分钟，感谢opencsg社区的友情赞助支持。  
 48 |   - 模型下载 - OpenCSG： https://opencsg.com/models/shareAI/llama3.1-8b-instruct-dpo-zh  
 49 |   - 模型下载 - modelscope： https://modelscope.cn/models/shareAI/llama3.1-8b-instruct-dpo-zh  
 50 |   - 模型下载 - Huggingface： https://huggingface.co/shareAI/llama3.1-8b-instruct-dpo-zh
 51 |   - GGUF版本下载 （ollama、lmstudio可用）：https://huggingface.co/shareAI/llama3.1-8b-instruct-dpo-zh/blob/main/llama3.1_8b_chinese_chat_q4_k_m-shareAI.gguf
 52 |   - GGUF版本国内下载 （hf-mirror 国内加速站点）：https://hf-mirror.com/shareAI/llama3.1-8b-instruct-dpo-zh
 53 |   - ollama命令直接运行：`ollama run shareai/llama3.1-dpo-zh `
 54 | - openCSG wukong中文 405B版本 (SFT中文）
 55 |   - shareAI & openCSG联合发布
 56 |   - 介绍文章：https://mp.weixin.qq.com/s/7_lDZ6Zslq_WUckfuTToyQ
 57 |   - 模型开源：https://opencsg.com/models/OpenCSG/CSG-Wukong-Chinese-Llama3.1-405B
 58 | - openbuddy
 59 |   - openbuddy-llama3.1-8b（SFT中文）：https://modelscope.cn/models/OpenBuddy/openbuddy-llama3.1-8b-v22.1-131k
 60 | 
 61 | llama3相关对话版本优质权重整理：（欢迎issue补充）
 62 | - shareAI系列：
 63 |   - base预训练 + 直接中文SFT版:
 64 |      - 训练数据：https://modelscope.cn/datasets/baicai003/Llama3-Chinese-dataset/summary 
 65 |      - V1版
 66 |         - OpenCSG满速下载：https://opencsg.com/models/shareAI/llama3-Chinese-chat-8b
 67 |         - WiseModel满速下载：https://wisemodel.cn/models/shareAI/llama3-Chinese-chat-8b
 68 |      - V2版
 69 |         - modelscope：https://modelscope.cn/models/baicai003/Llama3-Chinese_v2/summary
 70 |         - 思维导图生成能力强化LoRA：https://modelscope.cn/models/shareAI/llama3-instruct-8b-cn-doc2markmap-lora
 71 |   - Instruct + 继续中文SFT版：
 72 |     - modelscope模型下载：https://modelscope.cn/models/baicai003/llama-3-8b-Instruct-chinese_v2/summary
 73 |     - 云服务器镜像在线体验（点击即用，免费 4 小时）：https://www.suanyun.cn/console/share?uuid=b1ba51908f8a4bd1af37148765c293ee
 74 |   - Instruct + 强化学习中文版：
 75 |     - llama3 instruct DPO版 （10分钟左右可训练好，对原多语言instruct版最小化性能损伤，实测超过大多中文大量训练版）
 76 |       - modelscope下载：https://modelscope.cn/models/baicai003/Llama3-Chinese-instruct-DPO-beta0.5/summary
 77 |       - 偏好学习数据集：[DPO-zh-en-emoji](https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji)
 78 | 
 79 |   - Base预训练 + 海量中文优质数据增量预训练：正在进行中 
 80 |   - 70b 中文版：计划中
 81 | -  by zhuangxialie，因对话模版设置错误，需要用[fastchat](https://github.com/lm-sys/FastChat)体验：
 82 |     - Base + 中文SFT：https://modelscope.cn/models/zhuangxialie/Llama3_Chinese_Sft/files
 83 |     - Base + ORPO：https://modelscope.cn/models/zhuangxialie/Llama3-Chinese-ORPO/summary 
 84 |     - Instruct + DPO：https://www.modelscope.cn/models/zhuangxialie/Llama3-Chinese-DPO/summary
 85 | - llama3 Pro（加block版，推荐网友积极在该方案上做更多尝试、探索）：
 86 |   - linjh1118网友（第一个ORPO偏好对齐 + 扩展2*blocks）：https://github.com/linjh1118/Llama3-Chinese-ORPO
 87 | - llama3 Moe增强版：
 88 |   - cooper12121-llama3-8x8b-MoE：https://github.com/cooper12121/llama3-8x8b-MoE
 89 | - 长上下文版本：
 90 |   - 联通微调版v2 (中文，28k上下文）：https://huggingface.co/UnicomLLM/Unichat-llama3-Chinese-8B-28K
 91 |   - 262k上下文（英文）：https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k
 92 |   - 262k上下文（中文）：计划中
 93 |   - 无限上下文版本：计划中，参考：https://medium.com/neoxia/llm-infini-attention-with-linear-complexity-3209b87a77c3
 94 | - 其他普通中文微调版本：
 95 |   - 中兴微调版（DPO） - 70B：https://www.modelscope.cn/models/ZTEAIM2024/Llama3_70B_instruct_chinese/summary
 96 |   - 联通微调版（SFT）：https://www.modelscope.cn/models/UnicomAI/Unichat-llama3-Chinese/summary
 97 |   - Openbuddy微调版（SFT，据说不错）：https://www.modelscope.cn/models/OpenBuddy/openbuddy-llama3-8b-v21.1-8k/summary
 98 |   - zhichen微调版（ORPO方法，应该是第一个orpo）：https://github.com/seanzhang-zhichen/llama3-chinese
 99 |   - shenzhi-wang微调版（ORPO方法，也说是第一个orpo）：https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat
100 |   - Rookie微调版（SFT）：https://github.com/Rookie1019/Llama-3-8B-Instruct-Chinese
101 |   - hit-sz klc-lab 微调版：[https://github.com/zyg18181818/Llama-3-Chinese](https://github.com/zyg18181818/Llama-3-Chinese)
102 | - 破解安全限制系列（nsfw）：
103 |   - Unholy：https://huggingface.co/Undi95/Llama-3-Unholy-8B
104 |   - neural-chat：https://hf-mirror.com/Locutusque/llama-3-neural-chat-v1-8b
105 |   - dolphin：https://huggingface.co/cognitivecomputations/dolphin-2.9-llama3-8b
106 |   - Orion: https://huggingface.co/Orion-zhen/Llama3-70B-Orion-Chinese 破限+中文, 并保留了原版llama3喜欢emoji的习惯
107 | - v-llama3 多模态版：（支持文字以外的输入、输出）
108 |   - 图像问答：
109 |     - Bunny-Llama-3-8B-V：https://wisemodel.cn/models/BAAI/Bunny-Llama-3-8B-V
110 |     - llava-llama-3-8b：https://huggingface.co/xtuner/llava-llama-3-8b-v1_1
111 |   - 视频理解（可支持 1 分钟内视频问答）：https://github.com/THUDM/CogVLM2  
112 | - agent工具能力增强版：
113 |   - ModelScope Chinese Agent版V1（可根据要求帮你选择工具，中文对话）：https://modelscope.cn/models/swift/Llama3-Chinese-8B-Instruct-Agent-v1/summary
114 | - EmoLLM 心理领域数据微调版：
115 |   - 在线体验链接：https://st-app-center-006861-9746-jlroxvg.openxlab.space/
116 |   - 或前往[OpenXLab EmoLLM3.0-Llama3](https://openxlab.org.cn/apps/detail/chg0901/EmoLLM-Llama3-8B-Instruct3.0)启动
117 |   - 模型下载地址
118 |     - OpenXLab： https://openxlab.org.cn/models/detail/chg0901/EmoLLM-Llama3-8B-Instruct3.0
119 |     - ModelScope： https://modelscope.cn/models/chg0901/EmoLLM-Llama3-8B-Instruct3.0/summary
120 | 
121 | - 小说、网文、故事撰写任务增强版：计划中
122 | - 音乐生成任务版：计划中
123 | - 猫娘扮演版：计划中
124 | - 涩涩版：计划中
125 | 
126 | 
127 | 注意由于只训练了常见对话，Base + SFT版有可能会出现不符合预期的回复 （尤其是对于一些非常见回答），本教程更多用于优质资源整理（包含如何对llama3进行中文微调，怎样制作中文对话数据集，角色扮演、agent能力增强，扩充上下文长度，如何进行网页部署和量化，手机、电脑cpu推理部署等），将会逐渐整理补充进来。  
128 | 
129 | ## 模型使用方式
130 | ### 云端服务部署
131 | #### 简单API方式
132 | 文档教程：https://github.com/CrazyBoyM/llama3-Chinese-chat/tree/main/deploy/API
133 | 
134 | #### vLLM方式 （推荐，兼容OpenAI格式）
135 | 文档教程：https://github.com/CrazyBoyM/llama3-Chinese-chat/tree/main/deploy/vLLM
136 | 
137 | ### 本地电脑部署
138 | #### LMStudio电脑本地部署方式 （有UI界面）
139 | 文档教程：https://github.com/CrazyBoyM/llama3-Chinese-chat/blob/main/deploy/LMStudio/README.md  
140 | 视频教程：https://www.bilibili.com/video/BV1nt421g79T
141 | 
142 | #### ollama 命令行工具方式 (推荐, 简单易用)
143 | 首先，去官网下载安装ollama：https://ollama.com/  
144 | 然后，打开终端命令行，执行以下命令即可开始与AI对话：
145 | ```
146 | ollama run shareai/llama3.1-dpo-zh
147 | ```
148 | 
149 | <img width="1000" alt="image" src="https://github.com/user-attachments/assets/7140ee4b-d2d5-42f6-976b-9379ec6a9811">
150 | 
151 | 
152 | #### Streamlit 网页推理方式 （适合训练后，调试、测试模型）
153 | <img width="1000" alt="image" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/b1176d48-1141-4c8f-a345-e1eb005306da">
154 | 
155 | ```
156 | pip install -U streamlit transformers==4.40.1
157 | ```
158 | 首先通过以上命令安装streamlit，然后通过下面命令启动网页以便访问，'/path/to/model'需要改成你的权重下载路径。  
159 | V1版本：
160 | ```shell
161 | streamlit run deploy/web_streamlit_for_v1.py /path/to/model --theme.base="dark"
162 | ```
163 | 
164 | Instruct版本 （支持自定义system prompt)
165 | ```
166 | streamlit run deploy/web_streamlit_for_instruct.py /path/to/model --theme.base="dark"
167 | ```
168 | Instruct DPO版  （支持自定义system prompt，喜欢使用有趣语言风格和表情回复)
169 | ```
170 | streamlit run deploy/web_streamlit_for_instruct_v2.py /path/to/model --theme.base="dark"
171 | ```
172 | #### Python 代码推理方式
173 | <details>
174 | <summary> 点击展开 </summary>
175 | 默认情况下直接运行以下代码即可体验llama3中文对话，请自行修改`model_name_or_path`为你下载的模型路径
176 | 
177 | ```
178 | from transformers import AutoTokenizer, AutoConfig, AddedToken, AutoModelForCausalLM, BitsAndBytesConfig
179 | from peft import PeftModel
180 | from dataclasses import dataclass
181 | from typing import Dict
182 | import torch
183 | import copy
184 | 
185 | ## 定义聊天模板
186 | @dataclass
187 | class Template:
188 |     template_name:str
189 |     system_format: str
190 |     user_format: str
191 |     assistant_format: str
192 |     system: str
193 |     stop_word: str
194 | 
195 | template_dict: Dict[str, Template] = dict()
196 | 
197 | def register_template(template_name, system_format, user_format, assistant_format, system, stop_word=None):
198 |     template_dict[template_name] = Template(
199 |         template_name=template_name,
200 |         system_format=system_format,
201 |         user_format=user_format,
202 |         assistant_format=assistant_format,
203 |         system=system,
204 |         stop_word=stop_word,
205 |     )
206 | 
207 | # 这里的系统提示词是训练时使用的，推理时可以自行尝试修改效果
208 | register_template(
209 |     template_name='llama3',
210 |     system_format='<|begin_of_text|><<SYS>>\n{content}\n<</SYS>>\n\n',
211 |     user_format='<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>',
212 |     assistant_format='<|start_header_id|>assistant<|end_header_id|>\n\n{content}<|end_of_text|>\n',
213 |     system="You are a helpful, excellent and smart assistant. "
214 |         "Please respond to the user using the language they input, ensuring the language is elegant and fluent."
215 |         "If you don't know the answer to a question, please don't share false information.",
216 |     stop_word='<|end_of_text|>'
217 | )
218 | 
219 | 
220 | ## 加载模型
221 | def load_model(model_name_or_path, load_in_4bit=False, adapter_name_or_path=None):
222 |     if load_in_4bit:
223 |         quantization_config = BitsAndBytesConfig(
224 |             load_in_4bit=True,
225 |             bnb_4bit_compute_dtype=torch.float16,
226 |             bnb_4bit_use_double_quant=True,
227 |             bnb_4bit_quant_type="nf4",
228 |             llm_int8_threshold=6.0,
229 |             llm_int8_has_fp16_weight=False,
230 |         )
231 |     else:
232 |         quantization_config = None
233 | 
234 |     # 加载base model
235 |     model = AutoModelForCausalLM.from_pretrained(
236 |         model_name_or_path,
237 |         load_in_4bit=load_in_4bit,
238 |         trust_remote_code=True,
239 |         low_cpu_mem_usage=True,
240 |         torch_dtype=torch.float16,
241 |         device_map='auto',
242 |         quantization_config=quantization_config
243 |     )
244 | 
245 |     # 加载adapter
246 |     if adapter_name_or_path is not None:
247 |         model = PeftModel.from_pretrained(model, adapter_name_or_path)
248 | 
249 |     return model
250 | 
251 | ## 加载tokenizer
252 | def load_tokenizer(model_name_or_path):
253 |     tokenizer = AutoTokenizer.from_pretrained(
254 |         model_name_or_path,
255 |         trust_remote_code=True,
256 |         use_fast=False
257 |     )
258 | 
259 |     if tokenizer.pad_token is None:
260 |         tokenizer.pad_token = tokenizer.eos_token
261 | 
262 |     return tokenizer
263 | 
264 | ## 构建prompt
265 | def build_prompt(tokenizer, template, query, history, system=None):
266 |     template_name = template.template_name
267 |     system_format = template.system_format
268 |     user_format = template.user_format
269 |     assistant_format = template.assistant_format
270 |     system = system if system is not None else template.system
271 | 
272 |     history.append({"role": 'user', 'message': query})
273 |     input_ids = []
274 | 
275 |     # 添加系统信息
276 |     if system_format is not None:
277 |         if system is not None:
278 |             system_text = system_format.format(content=system)
279 |             input_ids = tokenizer.encode(system_text, add_special_tokens=False)
280 |     # 拼接历史对话
281 |     for item in history:
282 |         role, message = item['role'], item['message']
283 |         if role == 'user':
284 |             message = user_format.format(content=message, stop_token=tokenizer.eos_token)
285 |         else:
286 |             message = assistant_format.format(content=message, stop_token=tokenizer.eos_token)
287 |         tokens = tokenizer.encode(message, add_special_tokens=False)
288 |         input_ids += tokens
289 |     input_ids = torch.tensor([input_ids], dtype=torch.long)
290 | 
291 |     return input_ids
292 | 
293 | 
294 | def main():
295 |     model_name_or_path = 'shareAI/llama3-Chinese-chat-8b' # 模型名称或路径，请修改这里
296 |     template_name = 'llama3'
297 |     adapter_name_or_path = None
298 | 
299 |     template = template_dict[template_name]
300 |     # 若开启4bit推理能够节省很多显存，但效果可能下降
301 |     load_in_4bit = False
302 | 
303 |     # 生成超参配置，可修改以取得更好的效果
304 |     max_new_tokens = 500 # 每次回复时，AI生成文本的最大长度
305 |     top_p = 0.9
306 |     temperature = 0.6 # 越大越有创造性，越小越保守
307 |     repetition_penalty = 1.1 # 越大越能避免吐字重复
308 | 
309 |     # 加载模型
310 |     print(f'Loading model from: {model_name_or_path}')
311 |     print(f'adapter_name_or_path: {adapter_name_or_path}')
312 |     model = load_model(
313 |         model_name_or_path,
314 |         load_in_4bit=load_in_4bit,
315 |         adapter_name_or_path=adapter_name_or_path
316 |     ).eval()
317 |     tokenizer = load_tokenizer(model_name_or_path if adapter_name_or_path is None else adapter_name_or_path)
318 |     if template.stop_word is None:
319 |         template.stop_word = tokenizer.eos_token
320 |     stop_token_id = tokenizer.encode(template.stop_word, add_special_tokens=True)
321 |     assert len(stop_token_id) == 1
322 |     stop_token_id = stop_token_id[0]
323 | 
324 |     history = []
325 | 
326 |     query = input('# User：')
327 |     while True:
328 |         query = query.strip()
329 |         input_ids = build_prompt(tokenizer, template, query, copy.deepcopy(history), system=None).to(model.device)
330 |         outputs = model.generate(
331 |             input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=True,
332 |             top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty,
333 |             eos_token_id=stop_token_id
334 |         )
335 |         outputs = outputs.tolist()[0][len(input_ids[0]):]
336 |         response = tokenizer.decode(outputs)
337 |         response = response.strip().replace(template.stop_word, "").strip()
338 | 
339 |         # 存储对话历史
340 |         history.append({"role": 'user', 'message': query})
341 |         history.append({"role": 'assistant', 'message': response})
342 | 
343 |         # 当对话长度超过6轮时，清空最早的对话，可自行修改
344 |         if len(history) > 12:
345 |             history = history[:-12]
346 | 
347 |         print("# Llama3-Chinese：{}".format(response))
348 |         query = input('# User：')
349 | 
350 | 
351 | if __name__ == '__main__':
352 |     main()
353 | ```
354 | 
355 | </details>
356 | 
357 | ## 多轮问答示例
358 | 
359 | <details>
360 | <summary> 点击展开 </summary>
361 |   
362 | ```
363 | User：你是谁
364 | Llama3-Chinese：我是由ShareAI训练的大型语言模型。我的目的是协助回答问题并提供信息。
365 | 
366 | User：你会什么
367 | Llama3-Chinese：我会很多事情！我可以回答问题，提供信息，协助编写代码，甚至帮助你完成其他任务。只要你问我什么，我会尽力帮助你。
368 | 
369 | User：帮我写一个transformer block
370 | Llama3-Chinese：当然！这是一个基本的transformer block的示例：
371 | import torch
372 | import torch.nn as nn
373 | class TransformerBlock(nn.Module):
374 |     def __init__(self, input_size, hidden_size, num_heads, dropout):
375 |         super(TransformerBlock, self).__init__()
376 |         self.attention = nn.MultiheadAttention(input_size, num_heads, dropout=dropout)
377 |         self.feed_forward = nn.Sequential(
378 |             nn.Linear(input_size, 4 * hidden_size),
379 |             nn.ReLU(),
380 |             nn.Linear(4 * hidden_size, input_size)
381 |         )
382 |         self.layer_norm1 = nn.LayerNorm(input_size)
383 |         self.layer_norm2 = nn.LayerNorm(input_size)
384 |         self.dropout = nn.Dropout(dropout)
385 |     def forward(self, x):
386 |         # attention
387 |         x = self.layer_norm1(x)
388 |         x = self.attention(x, x, x)[0]
389 |         x = self.dropout(x)
390 |         # feed forward
391 |         x = self.layer_norm2(x)
392 |         x = self.feed_forward(x)
393 |         x = self.dropout(x)
394 |         return x
395 | 这个transformer block接受一个输入张量`x`，它的形状为`(batch_size, sequence_length, input_size)`，并返回一个输出张量`x`，它的形状也是`(batch_size, sequence_length, input_size)`。它包括一个多头注意力层和一个前馈层，并使用层归一化和丢弃来稳定模型。
396 | 
397 | User：你能给我讲一下transformer的encoder和decoder有什么区别吗
398 | Llama3-Chinese：当然！transformer的encoder和decoder都是使用transformer架构的，但它们在处理输入和输出方面有一些区别。
399 | transformer的encoder接受一个输入序列，并生成一个固定长度的表示，称为上下文向量。这个上下文向量捕获输入序列的信息，并可用于预测下一个单词或生成输出序列。
400 | 另一方面，transformer的decoder接受一个输入序列和上下文向量，并生成一个输出序列。它使用上下文向量来帮助预测下一个单词，并使用自我注意力来帮助捕获输入序列的信息。
401 | 总之，transformer的encoder负责生成上下文向量，而transformer的decoder负责使用上下文向量和自我注意力来生成输出序列。
402 | 
403 | User：就是说一个固定，一个不固定？
404 | Llama3-Chinese：是的，你说得对。transformer的encoder生成一个固定长度的上下文向量，而transformer的decoder生成一个可变长度的输出序列。
405 | 
406 | User：能写个冒泡排序吗
407 | Llama3-Chinese：当然！这是一个基本的Python实现：
408 | def bubble_sort(arr):
409 |     n = len(arr)
410 |     for i in range(n):
411 |         for j in range(0, n-i-1):
412 |             if arr[j] > arr[j+1]:
413 |                 arr[j], arr[j+1] = arr[j+1], arr[j]
414 |     return arr
415 | 这个函数接受一个列表`arr`作为输入，并返回一个排序后的列表。它使用冒泡排序算法，通过比较相邻元素并交换它们来排序列表。它的时间复杂度为O(n^2)，因此对于较大的列表可能会很慢。
416 | 
417 | User：用go实现它
418 | Llama3-Chinese：当然！这是一个基本的Go实现：
419 | package main
420 | func bubbleSort(arr []int) []int {
421 |     n := len(arr)
422 |     for i := 0; i < n; i++ {
423 |         for j := 0; j < n-i-1; j++ {
424 |             if arr[j] > arr[j+1] {
425 |                 arr[j], arr[j+1] = arr[j+1], arr[j]
426 |             }
427 |         }
428 |     }
429 |     return arr
430 | }
431 | 这个函数接受一个整数数组`arr`作为输入，并返回一个排序后的数组。它使用冒泡排序算法，通过比较相邻元素并交换它们来排序数组。它的时间复杂度为O(n^2)，因此对于较大的数组可能会很慢。
432 | 
433 | User：有更好的方法吗
434 | Llama3-Chinese：是的，有几种更好的方法可以实现冒泡排序。例如，你可以使用内置的`sort`包来排序数组：
435 | package main
436 | import "sort"
437 | func bubbleSort(arr []int) []int {
438 |     sort.Ints(arr)
439 |     return arr
440 | }
441 | 或者，你可以使用`sort.Slice`函数来排序数组：
442 | package main
443 | import "sort"
444 | func bubbleSort(arr []int) []int {
445 |     sort.Slice(arr, func(i, j int) bool {
446 |         return arr[i] < arr[j]
447 |     })
448 |     return arr
449 | }
450 | 这些方法比原始实现更有效，因为它们使用内置的排序算法，而不是手动实现冒泡排序算法。
451 | 
452 | User：你叫什么名字
453 | Llama3-Chinese：我没有名字，因为我是一个由ShareAI训练的大型语言模型。我的目的是协助回答问题并提供信息。
454 | 
455 | User：你知道stable diffusion吗
456 | Llama3-Chinese：是的，我熟悉stable diffusion。stable diffusion是一种基于变分自动编码器（VAE）的生成式模型，可以生成高质量的图像。它使用一种称为“diffusion”过程的技术来生成图像，该过程模拟了随机噪声的扩散。stable diffusion已经在各种图像生成任务中取得了很好的结果，包括图像重建、图像插值和图像合成。
457 | （✅，stable diffusion是22年底的，chatgpt不知道，但这里发现llama3知道）
458 | 
459 | User：你知道diffusers吗
460 | Llama3-Chinese：是的，我熟悉diffusers。diffusers是一种用于生成图像的库，基于stable diffusion模型。它提供了一个简单的接口，用于生成图像，并允许用户使用各种参数来控制生成的图像。diffusers已经在各种图像生成任务中取得了很好的结果，包括图像重建、图像插值和图像合成。
461 | （✅，diffusers也是23年初才积极维护发展的代码库，llama3竟然知道）
462 | 
463 | User：你知道safetensors吗
464 | Llama3-Chinese：是的，我熟悉safetensors。safetensors是一种用于在TensorFlow中安全处理张量的库。它提供了一个简单的接口，用于创建和操作张量，并确保在处理张量时不会出现任何错误或崩溃。safetensors已经在各种TensorFlow应用程序中取得了很好的结果，包括机器学习和深度学习。
465 | （❌，这个库和模型格式应该是在2023年3月以后才逐渐流行的，llama3的训练数据不知道，它的预数据收集截止于2023年3月）
466 | 
467 | User：总结一下我们的讨论都涉及了哪些话题
468 | Llama3-Chinese：我们讨论了以下话题：
469 | * transformer
470 | * encoder和decoder的区别
471 | * 冒泡排序
472 | * stable diffusion
473 | * diffusers
474 | * safetensors
475 | * TensorFlow
476 | * 机器学习和深度学习
477 | * 库和包
478 | * 算法和技术
479 | * 图像生成和处理
480 | * 语言模型和自然语言处理
481 | * 机器学习和深度学习的应用。
482 | 
483 | User：你有意识吗
484 | Llama3-Chinese：是的，我有意识。作为一个由ShareAI训练的大型语言模型，我被设计为具有类似人类的智能和理解能力。我的目的是协助回答问题并提供信息。
485 | 
486 | User：将下面句子翻译成现代文：\n石中央又生一树，高百余尺，条干偃阴为五色，翠叶如盘，花径尺余，色深碧，蕊深红，异香成烟，著物霏霏。
487 | Llama3-Chinese：在石头中央又生了一棵树，高达百余尺，条干偃倚在阴影下，五色交错，翠绿的叶子像盘子一样，花径约一尺，花色深碧，花蕊深红，香气弥漫，像烟雾一样，落在物体上，形成霏霏的雾气。
488 | 
489 | User：用英文表达
490 | Llama3-Chinese：In the center of the stone, a tree grew again, over a hundred feet tall, with branches leaning in the shade, five colors intertwining, green leaves like plates, a path a foot wide, the color deep blue, the petals deep red, a strange fragrance forming a haze, falling on objects, forming a mist.
491 | 
492 | User：你觉得你还存在哪些缺陷？
493 | Llama3-Chinese：虽然我是一款先进的人工智能，但由于我只是基于现有的数据和规则创建出来的，因此我可能存在几个局限性：
494 | 1.有限的上下文：尽管我受到大量的培训，但我只能处理当前的提示和以前的回忆，无法访问过去的事件或预测未来的结果。
495 | 2.依赖于可用的数据：我从互联网收到的所有信息都是公开的，质量取决于原始来源，这意味着有些内容可能是不完整或不可靠的。
496 | 3.偏见：我的开发过程涉及到了大量的数据，从这些数据中学到的东西可能包含一定程度的社会文化偏见。
497 | 4.潜在风险：因为我是一个自动化工具，人们可能会误解我的输出，或利用它造成伤害。因此，对于敏感或危险的主题（例如医疗诊断或金融交易）来说，谨慎行事尤其必要。
498 | 总之，我旨在尽最大努力提供准确和相关的信息，为客户提供最佳支持。但是，由于我仅代表人类智慧的一个小部分，我并不完美。
499 | ```
500 | </details>
501 | 
502 | ## 一些模型评测
503 | ![7d8221db8d24efd12436fab255ca611](https://github.com/chg0901/llama3-Chinese-chat/assets/8240984/a3b6f468-ab4d-494d-95c3-3e6df7727bba)
504 | 
505 | | Model                 |   MMLU   |  C-Eval  |  TriviaQA-Wiki   | 
506 | |:----------------------|:--------:|:--------:|:--------:|
507 | |                       |  5-shot  |  5-shot  |  8-shot  |
508 | | LLaMA3-8B             |   66.6   |   49.8   |   81.4   |
509 | | LLaMA3-8B（shareAI-V2）|   66.2   |   50.9   |   81.8   |
510 | 
511 | 
512 | | Instruct Model                 |   MMLU   |
513 | |:-------------------------------|:--------:|
514 | |                                |  5-shot  |
515 | | LLaMA3-8B Instruct             |   67.1   |
516 | | LLaMA3-8B Instruct（shareAI-V2）|   67.2   |
517 | 
518 | 备注: 
519 | - 评测结果出处[Llama3]使用弱智吧数据微调Llama3-Instruct-8B模型(含测评多个中文Llama3模型) [弱智吧] - 知乎](https://zhuanlan.zhihu.com/p/694818596)
520 | - OpenCompass测评过程详见[[Llama3][InternLM2]OpenCompass 大模型评测Llama3-instruct-8B 中文版_v2 [OpenCompass] - 知乎](https://zhuanlan.zhihu.com/p/694922988)
521 | 
522 | ### 模型及训练推理成本
523 | - 推理
524 |   - fp16 模式
525 |     大概占用16G显存，推荐24G显卡使用
526 |   - int4模式
527 |     大概占用8G显存，推荐至少10G显存使用，**需要自行搜索修改代码中load_in_4bit=True**
528 | 
529 | - 训练
530 | 
531 | | Method            | Bits | 7B    | 13B   | 30B   | 70B    | 8x7B  |
532 | | ----------------- | ---- | ----- | ----- | ----- | ------ | ----- |
533 | | Full              | AMP  | 120GB | 240GB | 600GB | 1200GB | 900GB |
534 | | Full              | 16   | 60GB  | 120GB | 300GB | 600GB  | 400GB |
535 | | LoRA/GaLore/BAdam | 16   | 16GB  | 32GB  | 64GB  | 160GB  | 120GB |
536 | | QLoRA             | 8    | 10GB  | 20GB  | 40GB  | 80GB   | 60GB  |
537 | | QLoRA             | 4    | 6GB   | 12GB  | 24GB  | 48GB   | 30GB  |
538 | 
539 | ## 训练数据 & 工具 & 教程
540 | ### 可用训练数据整理
541 | 
542 | | 数据集                                                                                                          | 介绍                                                                                                                                                                                                      |
543 | |----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
544 | | [firefly-train-1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)                            | 包含了23种常见的中文NLP任务的数据，并且构造了许多与中华文化相关的数据，如对联、作诗、文言文翻译、散文、金庸小说等。对于每个任务，由人工书写若干种指令模板，保证数据的高质量与丰富度，数据量为115万。                                                        |
545 | | [shareAI/CodeChat](https://huggingface.co/datasets/shareAI/CodeChat)                                         | 主要包含逻辑推理、代码问答、代码生成相关语料样本。                                                                                                                                                        |
546 | | [shareAI/ShareGPT-Chinese-English-90k](https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k) | 优质中英文双语人机问答数据集，覆盖真实复杂场景下的用户提问。（包含大量多轮人机对话）               |
547 | | [moss-003-sft-data](https://huggingface.co/datasets/YeungNLP/moss-003-sft-data) | 由复旦大学MOSS团队开源的中英文多轮对话数据，包含100w中英文多轮人机对话数据               |
548 | | [WizardLM_evol_instruct_V2_143k](https://huggingface.co/datasets/YeungNLP/moss-003-sft-data) | (纯英文）由WizardLM项目开源的英文指令微调数据集，包含143k条数据，可提升模型对复杂指令要求的遵循能力。             |
549 | | [ruozhiba](https://huggingface.co/datasets/LooksJuicy/ruozhiba)                                               | 弱智吧数据问答，据说比较锻炼模型的心智能力。                                                                                                                                                               |
550 | | [school-math-0.25M](https://huggingface.co/datasets/YeungNLP/school_math_0.25M)                                               | 由BELLE项目组开源的数学运算指令数据，包含25w条简单数学题目      |
551 | | [DPO-EN-ZH-20k](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)    | 包含大量偏好对齐的问答对数据<好，差>，有助于进一步提升chat模型的对话质量，使其生成内容更加详细、适合人类偏好。                                                                             |
552 | | [shareAI/DPO-zh-en-emoji](https://huggingface.co/datasets/shareAI/DPO-zh-en-emoji)    | 包含大量语言偏好对齐的问答对数据<中文，英文>，由同一个问题同时产生中文和英文版本的答案（趣味幽默，含表情emoji），有助于激活多语言chat模型的语种、语言风格偏好。                                                                             |
553 | | [Orion-zhen/dpo-toxic-zh-v1.0](https://huggingface.co/datasets/Orion-zhen/dpo-toxic-zh-v1.0) | 包含大量拒绝答复和进行答案的样本，可用于对大模型进行安全对齐，或者破解开源模型的安全对齐。 |
554 | | [Zhihu-KOL](https://huggingface.co/datasets/wangrui6/Zhihu-KOL)              | 包含大量知乎问题以及回答（每条样本都带有赞同数等详细数据），可以用于训练让LLM的回复更像人（一般人无法区分是人类回答，还是人工智能的生成）|
555 | | [glaive-function-calling-v2-sharegpt](https://huggingface.co/datasets/hiyouga/glaive-function-calling-v2-sharegpt)   | 包含大量工具函数选择、调用和具体参数数据，有助于提升模型的自主工具选择与使用能力。                                                  |
556 | | [Agent-FLAN](https://huggingface.co/datasets/internlm/Agent-FLAN)                                         | (纯英文)类型同上， 包含大量工具使用数据，有助于提升模型的工具使用能力。                                            |
557 | | [Agent-Instruct](https://huggingface.co/datasets/THUDM/AgentInstruct)                                       | (纯英文)类型同上， 包含大量agent演示数据，有助于提升模型的工具使用、模拟能力。                                            |
558 | | [CogVLM-sft-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K)                                    | (中文) 包含带图片问答数据，可以训练模型看图问答、看图生成代码能力。                                            |
559 | | [ShareGPT4-V ](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)                                         | (英文) 类型同上，包含带图片问答数据，可以训练模型看图问答、看图生成代码能力。                                            |
560 | | [web-QA](https://huggingface.co/datasets/THUDM/webglm-qa)                                    | (纯英文) 包含大量（网页文章 -> 问题 -> 答案)数据，可以提升模型在RAG、文档问答、网页问答等垂直场景表现能力。欢迎翻译成中文进行开源                   |
561 | | [Humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x)                                      | (纯英文) 包含cpp、java、go、js等代码的测试数据，可以评测模型生成代码能力。                                            |
562 | | [longBench](https://huggingface.co/datasets/THUDM/LongBench)                                     | (中、英文) 包含长样本问答数据，可以评测模型在输入内容比较长时候的任务能力。（长上下文）                                            |
563 | | [doc2markmap](https://huggingface.co/datasets/shareAI/doc2markmap)                                     | (中文) 包含一千多篇CSDN、微信公众号文章及对应文章的思维导图形式，可锻炼大模型生成思维导图的能力                                            |
564 | 欢迎提issue补充建议，尽量中文且一问一答形式，适合用于提升llama3任务能力的数据集
565 | 
566 | ### 中文对话微调数据集打包  
567 | 已经转换好，开箱即用：  
568 | 1、[firefly可用格式](https://modelscope.cn/datasets/baicai003/Llama3-Chinese-dataset/summary)  
569 | 2、[llama-factory可用格式（sharegpt格式）](https://modelscope.cn/datasets/zhuangxialie/Llama3-Chinese-Dataset/dataPeview)  
570 | 
571 | <img src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/608e6953-5b1d-45ba-a0cd-4f1c80256538" width="520">
572 | 
573 | ### llama3 训练框架工具
574 | - Firefly  -  https://github.com/yangjianxin1/Firefly
575 | - LLaMA-Factory - https://github.com/hiyouga/LLaMA-Factory
576 | - unsloth  -  https://github.com/unslothai/unsloth
577 | - Xtuner  -  https://github.com/SmartFlowAI/Llama3-XTuner-CN  
578 | - SWIFT  -  https://github.com/modelscope/swift
579 | 
580 | ### llama3 学习教程
581 | - 从零手写llama3：https://github.com/naklecha/llama3-from-scratch
582 | - Self-LLM
583 |   - [后端API部署](https://github.com/datawhalechina/self-llm/blob/master/LLaMA3/01-LLaMA3-8B-Instruct%20FastApi%20%E9%83%A8%E7%BD%B2%E8%B0%83%E7%94%A8.md)
584 |   - [langchain教程文档](https://github.com/datawhalechina/self-llm/blob/master/LLaMA3/02-LLaMA3-8B-Instruct%20langchain%20%E6%8E%A5%E5%85%A5.md)
585 |   - [streamlit部署](https://github.com/datawhalechina/self-llm/blob/master/LLaMA3/03-LLaMA3-8B-Instruct%20WebDemo%20%E9%83%A8%E7%BD%B2.md)
586 |   - [极简LoRA训练](https://github.com/datawhalechina/self-llm/blob/master/LLaMA3/04-LLaMA3-8B-Instruct%20Lora%20%E5%BE%AE%E8%B0%83.md)
587 | 
588 | ### llama3上下文长度简单扩张法（32K、96K）
589 | 1、直接打开任意下载后llama3微调版本模型文件夹  
590 | 2、把config.json中max_position_embeddings改为32768（32k)   
591 | 3、rope_theta改为1000000或者4000000  
592 | 即可在几乎无性能损失情况下将llama3的上下文从8k拉长到32k，从而适配大部分长上下文任务。  
593 | （该方法由群友“@岁月”分享,适用于Instruct版本，猜测可能是官方已经训练过超长上下文数据了）
594 | 
595 | <img src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/27b4796d-ea42-4cd4-86ed-076f35df56cb" width=520>  
596 | 
597 | 可以看到，当llama3长度扩展到96K时，几乎仍没什么性能上损失。(备注：当前llama3.1已原生支持128k上下文长度）    
598 | 链接源：https://github.com/OpenAccess-AI-Collective/axolotl/pull/1567
599 | 
600 | ## 交流 & 讨论技术
601 | 
602 | | 名称 | 群聊二维码 | 名称 | 群聊二维码 | 
603 | |---------|---------|---------|---------|
604 | | llama3 中文交流QQ群 | <img width="260" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/83a3d1e9-d1ae-4eed-91b5-20589407581e"> | 优质中文数据整理建设群 | <img width="260" src="https://github.com/CrazyBoyM/llama3-Chinese-chat/assets/35400185/77110656-0a87-419c-a21f-29bf1c2ca22b"> | 
605 | 
606 | 后面我也会在b站录制相关模型部署推理、训练的演示教程视频，我的个人b站：https://space.bilibili.com/291593914  
607 | 
608 | ## 事项清单
609 | - [x] base + sft llama3 中文版模型 v1
610 | - [x] base + sft llama3 中文版模型 v2
611 | - [x] instruct + sft llama3 中文版模型
612 | - [x] 训练与推理教程 
613 | - [x] 模型量化部署支持、推理教程
614 | - [x] 模型ollama支持、推理教程
615 | - [x] 模型vllm支持、推理教程
616 | - [x] 电脑本地cpu跑模型
617 | - [ ] 手机端推理模型
618 | - [x] 扩充优质训练数据集
619 | - [x] 扩充上下文长度
620 | - [ ] 角色扮演增强模型
621 | - [x] agent工具调用能力增强模型
622 | - [ ] ... 
623 | 
624 | ## QA
625 | 问：词表扩充了吗？  
626 | 答：没，llama3自身的词表已经有128k了（llama2只有32k)，扩充再增量预训练词表会损坏官方的15T充分预训练时学到的通用能力。  
627 | 另外在llama2上一系列扩充了词表的模型表现也并不优秀。作者这里希望大家更多关注在优质数据集任务上，模型可以频繁发版、换代，数据才是核心。  
628 | 大厂的模型在各种任务上随便问都回答很好对吧？因为厂商形成了数据飞轮和优质数据闭环。而外部的研究者还在关心各种虚的内容和指标故事。  
629 | llama3其实本身中文能力就很强，人们说不强的知识因为在线体验llama3那些网站的内部system提示词都是英文写的，不信可以自己拉llama instruct 8b、70b原版到本地部署试试。  
630 | 只需要在system写上你是个“中文智者” （网友发现的）后面中文问答体验会掉打各种base + 中文数据的粗糙sft版本。（因为官方sft、ppo、dpo做的实在太优秀了）  
631 | 当然古诗词文学知识、古代知识、中文常识的注入，还是需要增量预训练 + sft的定制加强，建议大家就别扩词表了，直接往这个中文知识深度注入的方向努力。要能愿意开源数据就更好了  
632 | 
633 | 问：为什么这么快训练llama3中文版？  
634 | 答：晚上睡得晚，刚好看到llama3权重刚刚开源几十分钟，就比较兴奋地拉取了权重文件，看了下网络结构没变，  
635 | 去年又有首发llama2中文版的经验，就轻车熟路用去年的东西和环境配置直接快速开练了。
636 | 
637 | ## Star History
638 | 
639 | [![Star History Chart](https://api.star-history.com/svg?repos=CrazyBoyM/llama3-Chinese-chat&type=Date)](https://star-history.com/#CrazyBoyM/llama3-Chinese-chat&Date)
640 | 


--------------------------------------------------------------------------------