├── README.md ├── README_CN.md ├── assets ├── LOGO.png ├── cli_demo.gif ├── modelscope_demo.png └── modelscope_demo2.png └── trtdemo ├── README.md ├── cli_chat.py └── cli_chat_entry.sh /README.md: -------------------------------------------------------------------------------- 1 | # Index of CodeFuse Repositories 2 | 3 |

4 | 5 |

6 | 7 | 8 |
9 | 10 | [**简体中文**](./README_CN.md)|[**HuggingFace**](https://huggingface.co/codefuse-ai)|[**ModelScope**](https://modelscope.cn/organization/codefuse-ai) 11 | 12 |
13 | 14 | ## About This Repository 15 | 16 | This repository lists key projects and related demos about CodeFuse. 17 | 18 | ## About CodeFuse 19 | 20 | CodeFuse aims to develop Code Large Language Models (Code LLMs) to support and enhance full-lifecycle AI native sotware developing, covering crucial stages such as design requirements, coding, testing, building, deployment, operations, and insight analysis. Below is the overall framework of CodeFuse. 21 |

22 | 23 |

24 |
25 | 26 | 27 | ## Release Update 28 | ** 2024.08 ** [codefuse-ide](https://github.com/codefuse-ai/codefuse-ide): Release opensumi&CodeBlitz for code ide; [CGE](https://github.com/codefuse-ai/codefuse-CGE) : Release D2Coder-v1 Embedding model for code search 29 | 30 | ** 2024.07 ** [D2LLM](https://github.com/codefuse-ai/D2LLM) : Release D2Coder-v1 Embedding model for code search, [RepoFuse](https://github.com/codefuse-ai) : Repository-Level Code Completion with Language Models with Fused Dual Context 31 | 32 | ** 2024.06 ** [Codefuse-ai pages](https://codefuse-ai.github.io), [D2LLM](https://github.com/codefuse-ai/D2LLM) releases feature about Decomposed and Distilled Large Language Models for Semantic Search, [MFTCoder](https://github.com/codefuse-ai/MFTCoder) releases V0.4.2. More detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/16) 33 | 34 | ** 2024.05 ** ModelCache releases v0.2.1 with supporting multimodal, see [CodeFuse-ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache). DevOps-Model support function call, see [CodeFuse-DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model). More detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/14) 35 | 36 | ** 2024.04 ** CodeFuse-muAgent: a multi-agent framework, more detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/12) 37 | 38 | ## List of CodeFuse Repositories 39 | 40 | We listed repositories according to the lifecycle above. 41 | | LifeCycle Stage | Project Repository| Repo-Description | Road Map | 42 | |:------------------------:|:-----------------:|:-------:|:------------------:| 43 | | Requirement & Design |[MFT-VLM](https://github.com/codefuse-ai/CodeFuse-MFT-VLM) | Instruction-fine-tuning for Vision-language tasks | | 44 | | Coding |[MFTCoder](https://github.com/codefuse-ai/MFTCoder) | Instruction-Tuning Framework | | 45 | | |[FastTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse) | FT based Inference Engine | | 46 | | |[CodeFuse-Eval](https://github.com/codefuse-ai/codefuse-evaluation)|Evaluation kits for CodeFuse | | 47 | | Test & Build |[TestAgent](https://github.com/codefuse-ai/Test-Agent) | TestGPT demo frontend | | 48 | | DevOps |[DevOps-Eval](https://github.com/codefuse-ai/codefuse-devops-eval)|Benchmark for DevOps| | | 49 | | |[DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model) |index for DevOps models| | 50 | | Data Insight | NA | NA | | 51 | | Base |[ChatBot](https://github.com/codefuse-ai/codefuse-chatbot) |General chatbot frontend for CodeFuse | | 52 | | |[muAgent](https://github.com/codefuse-ai/CodeFuse-muAgent) | multi-agent framework | | 53 | | |[ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache) |Semantic Cache for LLM Serving | | 54 | | |[CodeFuse-Query](https://github.com/codefuse-ai/CodeFuse-Query)|Query-Based Code Analysis Engine | | 55 | | Others |[CoCA](https://github.com/codefuse-ai/Collinear-Constrained-Attention)|Colinear Attention | | 56 | | |[Awesine-Code-LLM](https://github.com/codefuse-ai/Awesome-Code-LLM)|Code-LLM Survey| | 57 | | |This Repo |General Introduction & index of CodeFuse Repos| | 58 | 59 | ## List of CodeFuse Primary Released Models 60 | 61 | | ModelName | Short Description | Modele Linls | 62 | |:------------------------:|:-----------------:|:-----------------:| 63 | | CodeFuse-13B | Training from scratch by CodeFuse | [HF](https://huggingface.co/codefuse-ai/CodeFuse-13B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-13B/summary) | 64 | | CodeFuse-CodeLLaMA-34B | Finetuning on CodeLLaMA-34B | [HF](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary) | 65 | | ** CodeFuse-CodeLLaMA-34B-4bits | 4bits quantized 34B model |[HF](CodeFuse-CodeLlama-34B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary) | 66 | | CodeFuse-DeepSeek-33B | FineTuning on DeepSeek-Coder-33b | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B/summary) | 67 | | ** CodeFuse-DeepSeek-33B-4bits | 4-bit quantized 33B model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B-4bits/summary) | 68 | | CodeFuse-VLM-14B | SoTA vision-language model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-VLM-14B) ; [MS]([https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B/summary](https://modelscope.cn/models/ss41979310/CodeFuse-VLM-14B/summary)) | 69 | 70 | ** -- recommended models; 71 | 72 | ## Demos 73 | 74 | - Video demos: Chinese version at below, English version under preparation. 75 | 76 | https://user-images.githubusercontent.com/103973989/267514150-21012a5d-652d-4aea-bcea-582e67855ad7.mp4 77 | 78 | - Online Demo: You can try our CodeFuse-CodeLlama-34B model on ModelScope: [CodeFuse-CodeLlama34B-MFT-Demo](https://modelscope.cn/studios/codefuse-ai/CodeFuse-CodeLlama34B-MFT-Demo/summary) 79 | 80 | ![Online Demo Snapshot](assets/modelscope_demo2.png) 81 | 82 | - You can also try to install the [CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot) to test our models locally. 83 | 84 | ## How to get 85 | 86 | - [**HuggingFace**](https://huggingface.co/codefuse-ai). 87 | - [**ModelScope**](https://modelscope.cn/organization/codefuse-ai). 88 | - [**WiseModel**](https://wisemodel.cn/organization/codefuse-ai). 89 | - Train or finetuning on your own models, you can try our [**MFTCoder**](https://github.com/codefuse-ai/MFTCoder), which enables efficient fine-tuning for multi-task, multi-model, and multi-training-framework scenarios. 90 | 91 | ## Citation 92 | 93 | For more technique details about CodeFuse, please refer to our paper [MFTCoder](https://arxiv.org/abs/2311.02303). 94 | 95 | If you find our work useful or helpful for your R&D work, please feel free to cite our paper as follows. 96 | ``` 97 | @article{mftcoder2023, 98 | title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, 99 | author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li}, 100 | year={2023}, 101 | journal={arXiv preprint arXiv}, 102 | archivePrefix={arXiv}, 103 | eprint={2311.02303} 104 | } 105 | ``` 106 | 107 | -------------------------------------------------------------------------------- /README_CN.md: -------------------------------------------------------------------------------- 1 | # CodeFuse中文索引 2 | 3 |

4 | 5 |

6 | 7 | 8 |
9 | 10 | [ **English** ](./README.md)|[ **HuggingFace** ](https://huggingface.co/codefuse-ai) | [ **魔搭社区** ](https://modelscope.cn/organization/codefuse-ai) | [ **WiseModel** ](https://www.wisemodel.cn/organization/codefuse-ai) | [ **产品主页** ](https://codefuse.alipay.com) 11 | 12 |
13 | 14 | ## 关于本仓库 15 | 16 | 本仓库索引了CodeFuse项目的关键仓库、模型和演示例子. 17 | 18 | ## 关于CodeFuse 19 | CodeFuse的使命是开发专门设计用于支持整个软件开发生命周期的大型代码语言模型(Code LLMs),涵盖设计、需求、编码、测试、部署、运维等关键阶段。 20 | 我们致力于打造创新的解决方案,让软件开发者们在研发的过程中如丝般顺滑。下面是CodeFuse的整个框架。 21 |

22 | 23 |

24 |
25 | 26 | ## 版本更新 27 | ** 2024.07 ** [CGE](https://github.com/codefuse-ai/codefuse-CGE) : 基于D2LLM 开源 D2Coder-v1 模型用于代码向量检索; [codefuse-ide](https://github.com/codefuse-ai/codefuse-ide): 一款云端的IDE,支持codefuse代码开发、分析、测例生成 28 | 29 | ** 2024.07 ** [D2LLM](https://github.com/codefuse-ai/D2LLM) : 基于D2LLM 开源 D2Coder-v1 模型用于代码向量检索, [RepoFuse](https://github.com/codefuse-ai) : 多仓库级别的代码补全 30 | 31 | ** 2024.06 ** [Codefuse-ai 官方主页](https://codefuse-ai.github.io)上线, [D2LLM](https://github.com/codefuse-ai/D2LLM) 新增能力关于分解和蒸馏的大型语言模型用于语义搜索, [MFTCoder](https://github.com/codefuse-ai/MFTCoder) 放出 V0.4.2 版本. 更多内容见[Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/16) 32 | 33 | ** 2024.05 ** 大模型语义缓存ModelCache v0.2.1版本支持多模态, 见 [CodeFuse-ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache). DevOps-Model 支持function call能力,见 [CodeFuse-DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model). , 更多内容见 [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/14) 34 | 35 | ** 2024.04 ** CodeFuse-muAgent: 多智能体框架, 更多内容见 [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/12) 36 |
37 | 38 | ## CodeFuse仓库列表 39 | 40 | 我们按照上图软件生命周期的划分对仓库进行了组织. 41 | | 生命周期阶段 | 仓库名 | 仓库简介 | 技术路线 | 42 | |:------------------------:|:-----------------:|:-------:|:----------:| 43 | | 项目Copilot | NA | NA | | 44 | | 数据Copilot |[MFTCoder](https://github.com/codefuse-ai/MFTCoder) | CodeFuse独有的指令微调框架 | | 45 | | |[FastTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse) | 推理引擎| | 46 | | |[CodeFuse-Eval](https://github.com/codefuse-ai/codefuse-evaluation)|代码评估框架| | 47 | | 测试和构建Copilot |[TestAgent](https://github.com/codefuse-ai/Test-Agent) | TestGPT示例前端 | | 48 | | 运维Copilot |[DevOps-Eval](https://github.com/codefuse-ai/codefuse-devops-eval)|DevOps评测集和框架 | | 49 | | |[DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model) |DevOps模型列表介绍 | | 50 | | 数据Copilot | NA | NA | | 51 | | 其他模块 |[ChatBot](https://github.com/codefuse-ai/codefuse-chatbot) |通用chatbot前端 | | 52 | | |[muAgent](https://github.com/codefuse-ai/CodeFuse-muAgent) | multi-agent框架 | | 53 | | |[CoCA](https://github.com/codefuse-ai/Collinear-Constrained-Attention)|共线约束注意力算法 | | 54 | | |[Awesine-Code-LLM](https://github.com/codefuse-ai/Awesome-Code-LLM)|代码大模型survey主页 | | 55 | | |[CodeFuse-Query](https://github.com/codefuse-ai/CodeFuse-Query)| 基于查询的代码分析引擎 | | 56 | | |你正在看的仓库 | CodeFuse通用介绍和索引 | | 57 | 58 | ## CodeFuse已发布主要模型索引 59 | 60 | | ModelName | Short Description | Modele Linls | 61 | |:------------------------:|:-----------------:|:-----------------:| 62 | | CodeFuse-13B | Training from scratch by CodeFuse | [HF](https://huggingface.co/codefuse-ai/CodeFuse-13B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-13B/summary) | 63 | | CodeFuse-CodeLLaMA-34B | Finetuning on CodeLLaMA-34B | [HF](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary) | 64 | | ** CodeFuse-CodeLLaMA-34B-4bits | 4bits quantized 34B model |[HF](CodeFuse-CodeLlama-34B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary) | 65 | | CodeFuse-DeepSeek-33B | FineTuning on DeepSeek-Coder-33b | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B/summary) | 66 | | ** CodeFuse-DeepSeek-33B-4bits | 4-bit quantized 33B model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B-4bits/summary) | 67 | | CodeFuse-VLM-14B | SoTA vision-language model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-VLM-14B) ; [MS]([https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B/summary](https://modelscope.cn/models/ss41979310/CodeFuse-VLM-14B/summary)) | 68 | 69 | 70 | 71 | ## 演示 72 | 73 | - 视频Demo: 下面是中文版本, 英文版准备中 74 | https://user-images.githubusercontent.com/103973989/267514150-21012a5d-652d-4aea-bcea-582e67855ad7.mp4 75 | 76 | - 在线版本: 你可以在魔搭社区尝试我们的34B模型: [CodeFuse-CodeLlama34B-MFT-Demo](https://modelscope.cn/studios/codefuse-ai/CodeFuse-CodeLlama34B-MFT-Demo/summary) 77 | 78 | ![Online Demo Snapshot](assets/modelscope_demo2.png) 79 | 80 | - 离线版本:你也可以安装[CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot)在本地尝试我们的模型. 81 | 82 | ## 如何获得 83 | 84 | - HF模型社区[**HuggingFace**](https://huggingface.co/codefuse-ai). 85 | - 魔搭社区[**ModelScope**](https://modelscope.cn/organization/codefuse-ai). 86 | - WiseModel社区[**WiseModel**](https://wisemodel.cn/organization/codefuse-ai). 87 | - 对于自有或者自己感兴趣的模型,可以使用我们的[**MFTCoder**](https://github.com/codefuse-ai/MFTCoder)框架微调训练,它是一个支持多模型、多任务、多训练平台的微调框架. 88 | 89 | ## 参考文献 90 | 如果你觉得本项目对你有帮助,请引用下述论文: 91 | ``` 92 | @article{mftcoder2023, 93 | title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, 94 | author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li}, 95 | year={2023}, 96 | journal={arXiv preprint arXiv}, 97 | archivePrefix={arXiv}, 98 | eprint={2311.02303} 99 | } 100 | ``` 101 | -------------------------------------------------------------------------------- /assets/LOGO.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/LOGO.png -------------------------------------------------------------------------------- /assets/cli_demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/cli_demo.gif -------------------------------------------------------------------------------- /assets/modelscope_demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/modelscope_demo.png -------------------------------------------------------------------------------- /assets/modelscope_demo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/modelscope_demo2.png -------------------------------------------------------------------------------- /trtdemo/README.md: -------------------------------------------------------------------------------- 1 | ## Cli-demo for CodeFuse-CodeLLaMA-34B-4bits with TensorRT-LLM 2 | 3 | ### Introduction 4 | 5 | This cli-demo aims to show the inference of CodeFuse-CodeLLaMA-34B-4bits utilizing TensorRT-LLM (refer trtllm in the following) engine. We offer the model weights of CodeFuse-CodeLLaMA-34B-4bits and please refer to [TensorRL-LLM ](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/llama) on github to build trtllm engine. Moreover, our practice is coming soon. 6 | 7 | 8 | 9 | ### Usage 10 | 11 | The launch script is 12 | ```bash 13 | engine_dir="" 14 | tokenizer_dir="" 15 | python cli_chat.py --engine_dir "${engine_dir}" \ 16 | --tokenizer_dir "${tokenizer_dir}" 17 | ``` 18 | `enging_dir` and `tokenizer_dir` are the directory pathes for trtllm engine and tokenizer, respectively. Also, the `max_input_length` and `max_new_tokens` can be passed to cli_chat.py, and both of them should be less than those corresponding parameters in the trtllm engine config to avoid runtime error. 19 | 20 | 21 | ### Scripts 22 | 23 | - demo file [cli_chat.py](cli_chat.py) 24 | - launch script [cli_chat_entry.sh](cli_chat_entry.sh) 25 | 26 | ### Demonstration 27 | 28 | ![Client Demo](../assets/cli_demo.gif) 29 | 30 | - You can also try to install the [CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot) to test our models locally. 31 | -------------------------------------------------------------------------------- /trtdemo/cli_chat.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import tensorrt_llm 4 | import torch 5 | import platform 6 | import json 7 | 8 | import numpy as np 9 | from tensorrt_llm.runtime import ( 10 | ModelConfig, SamplingConfig, GenerationSession 11 | ) 12 | from tensorrt_llm.runtime.generation import Mapping 13 | from tensorrt_llm.quantization import QuantMode 14 | from typing import List, Union, Optional 15 | from pathlib import Path 16 | from transformers import (AutoTokenizer,PreTrainedTokenizer) 17 | 18 | now_dir = os.path.dirname(os.path.abspath(__file__)) 19 | 20 | EOS_TOKEN = 2 21 | PAD_TOKEN = 0 22 | 23 | from typing import List, Tuple 24 | 25 | def to_word_list_format(words_list: List[List[str]], tokenizer): 26 | 27 | flat_ids = [] 28 | offsets = [] 29 | for words in words_list: 30 | item_flat_ids = [] 31 | item_offsets = [] 32 | 33 | for word in words: 34 | ids = tokenizer.encode(word,add_special_tokens=False)[1:] 35 | 36 | if len(ids) == 0: 37 | continue 38 | 39 | item_flat_ids += ids 40 | item_offsets.append(len(ids)) 41 | 42 | flat_ids.append(np.array(item_flat_ids)) 43 | offsets.append(np.cumsum(np.array(item_offsets))) 44 | 45 | pad_to = max(1, max(len(ids) for ids in flat_ids)) 46 | 47 | for i, (ids, offs) in enumerate(zip(flat_ids, offsets)): 48 | flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0) 49 | offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1) 50 | 51 | result = np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2)) 52 | return torch.from_numpy(np.ascontiguousarray(result)) 53 | 54 | 55 | def make_context( 56 | tokenizer: PreTrainedTokenizer, 57 | query: str, 58 | history: List[Tuple[str, str]] = None, 59 | system: str = "", 60 | max_input_length: int = 2048, 61 | chat_format: str = "MFT", 62 | ): 63 | if history is None: 64 | history = [] 65 | 66 | if chat_format == "MFT": 67 | 68 | def _tokenize_str(role, content): 69 | role_dict = {"user": "human", "agent": "bot", "assistant": "bot", 70 | "": "bot", "": "human", 71 | "": "system","system": "system", 72 | "humnan": "human", "bot":"bot"} 73 | eo_dict = {"bot": tokenizer.eos_token, "human": "", "system": ""} 74 | if not content.endswith('\n'): 75 | content = content + '\n' 76 | return f"{role_dict[role]}\n{content}{eo_dict[role_dict[role]]}" 77 | 78 | system_text = _tokenize_str("system", system) 79 | raw_text = "" 80 | context_tokens = [] 81 | 82 | for turn_query, turn_response in reversed(history): 83 | query_text = _tokenize_str("user", turn_query) 84 | 85 | response_text = _tokenize_str( 86 | "assistant", turn_response 87 | ) 88 | prev_chat = query_text+response_text 89 | 90 | raw_text = prev_chat + raw_text 91 | 92 | 93 | 94 | raw_text = system_text + raw_text 95 | 96 | query_content = _tokenize_str('user', query) 97 | raw_text = raw_text+query_content+'bot' 98 | 99 | 100 | # truncate to max_input_length, truncate from the front 101 | return raw_text, tokenizer.encode(raw_text, add_special_tokens=False)[-max_input_length:] 102 | 103 | def get_engine_name(model, dtype, tp_size, pp_size, rank): 104 | if pp_size == 1: 105 | return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank) 106 | return '{}_{}_tp{}_pp{}_rank{}.engine'.format(model, dtype, tp_size, 107 | pp_size, rank) 108 | def _clear_screen(): 109 | if platform.system() == "Windows": 110 | os.system("cls") 111 | else: 112 | os.system("clear") 113 | 114 | def read_config(config_path: Path): 115 | with open(config_path, 'r') as f: 116 | config = json.load(f) 117 | use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin'] 118 | remove_input_padding = config['plugin_config']['remove_input_padding'] 119 | dtype = config['builder_config']['precision'] 120 | tp_size = config['builder_config']['tensor_parallel'] 121 | pp_size = config['builder_config']['pipeline_parallel'] 122 | world_size = tp_size * pp_size 123 | 124 | 125 | num_heads = config['builder_config']['num_heads'] // tp_size 126 | hidden_size = config['builder_config']['hidden_size'] // tp_size 127 | vocab_size = config['builder_config']['vocab_size'] 128 | num_layers = config['builder_config']['num_layers'] 129 | num_kv_heads = config['builder_config'].get('num_kv_heads', num_heads) 130 | paged_kv_cache = config['plugin_config']['paged_kv_cache'] 131 | tokens_per_block = config['plugin_config']['tokens_per_block'] 132 | quant_mode = QuantMode(config['builder_config']['quant_mode']) 133 | if config['builder_config'].get('multi_query_mode', False): 134 | tensorrt_llm.logger.warning( 135 | "`multi_query_mode` config is deprecated. Please rebuild the engine." 136 | ) 137 | num_kv_heads = 1 138 | use_custom_all_reduce = config['plugin_config'].get('use_custom_all_reduce', 139 | False) 140 | 141 | model_config = ModelConfig(num_heads=num_heads, 142 | num_kv_heads=num_kv_heads, 143 | hidden_size=hidden_size, 144 | vocab_size=vocab_size, 145 | num_layers=num_layers, 146 | gpt_attention_plugin=use_gpt_attention_plugin, 147 | paged_kv_cache=paged_kv_cache, 148 | tokens_per_block=tokens_per_block, 149 | remove_input_padding=remove_input_padding, 150 | dtype=dtype, 151 | quant_mode=quant_mode, 152 | use_custom_all_reduce=use_custom_all_reduce) 153 | 154 | return model_config, tp_size, pp_size, dtype 155 | 156 | 157 | 158 | class LammaForCausalLMGenerationSession(GenerationSession): 159 | def __init__( 160 | self, 161 | model_config: ModelConfig, 162 | engine_buffer, 163 | mapping: Mapping, 164 | debug_mode=False, 165 | debug_tensors_to_save=None, 166 | cuda_graph_mode=False, 167 | stream: torch.cuda.Stream = None, 168 | ): 169 | super().__init__( 170 | model_config, 171 | engine_buffer, 172 | mapping, 173 | debug_mode, 174 | debug_tensors_to_save=debug_tensors_to_save, 175 | cuda_graph_mode=cuda_graph_mode, 176 | stream=stream 177 | ) 178 | self.stop_words_list=to_word_list_format([['human','bot']], tokenizer).cuda() 179 | 180 | def prepare_for_chat( 181 | self, 182 | tokenizer, 183 | input_text: Union[str, List[str]], 184 | system_text: str = "", 185 | history: list = None, 186 | max_input_length: Union[int, None] = None, 187 | ): 188 | 189 | if history is None: 190 | history = [] 191 | pad_id = tokenizer.pad_token_id 192 | # prepare for batch inference 193 | if not isinstance(input_text, list): 194 | batch_text = [input_text] 195 | else: 196 | batch_text = input_text 197 | if len(history) > 0 and len(history[0]) and len(history[0][0]) > 0 \ 198 | and not isinstance(history[0][0], list): 199 | history_list = [history] 200 | elif len(history) == 0: 201 | history_list = [[]] 202 | else: 203 | history_list = history 204 | input_ids = [] 205 | input_lengths = [] 206 | 207 | for line, history in zip(batch_text, history_list): 208 | # use make_content to generate prompt 209 | _, input_id_list = make_context( 210 | tokenizer=tokenizer, 211 | query=line, 212 | history=history, 213 | system=system_text, 214 | max_input_length=max_input_length, 215 | ) 216 | 217 | # print("input_id_list len", len(input_id_list)) 218 | input_id = torch.from_numpy( 219 | np.array(input_id_list, dtype=np.int32) 220 | ).type(torch.int32).unsqueeze(0) 221 | input_ids.append(input_id) 222 | input_lengths.append(input_id.shape[-1]) 223 | max_length = max(input_lengths) 224 | # do padding, should move outside the profiling to prevent the overhead 225 | for i in range(len(input_ids)): 226 | pad_size = max_length - input_lengths[i] 227 | 228 | pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id 229 | input_ids[i] = torch.cat( 230 | [torch.IntTensor(input_ids[i]), pad], axis=-1) 231 | input_ids = torch.cat(input_ids, axis=0).cuda() 232 | input_lengths = torch.IntTensor(input_lengths).type(torch.int32).cuda() 233 | return input_ids, input_lengths 234 | 235 | def generate( 236 | self, 237 | input_ids: torch.Tensor, 238 | input_lengths: torch.Tensor, 239 | sampling_config: SamplingConfig, 240 | max_new_tokens: int, 241 | runtime_rank: int = 0, 242 | stop_works_list: Optional[torch.Tensor] = None 243 | ): 244 | max_input_length = torch.max(input_lengths).item() 245 | 246 | # setup batch_size, max_input_length, max_output_len 247 | self.setup( 248 | batch_size=input_lengths.size(0), 249 | max_context_length=max_input_length, 250 | max_new_tokens=max_new_tokens 251 | ) 252 | output_ids = self.decode( 253 | input_ids, 254 | input_lengths, 255 | sampling_config, 256 | stop_words_list=self.stop_words_list 257 | ) 258 | with torch.no_grad(): 259 | torch.cuda.synchronize() 260 | if runtime_rank == 0: 261 | outputs = output_ids[:, 0, :] 262 | return outputs 263 | 264 | def chat_stream( 265 | self, 266 | tokenizer, 267 | sampling_config: SamplingConfig, 268 | input_text: Union[str, List[str]], 269 | max_input_length: Union[int, None], 270 | max_new_tokens: Union[int, None], 271 | system_text: str = "", 272 | history: list = None, 273 | runtime_rank: int = 0, 274 | ): 275 | input_ids, input_lengths = self.prepare_for_chat( 276 | tokenizer=tokenizer, 277 | input_text=input_text, 278 | system_text=system_text, 279 | history=history, 280 | max_input_length=max_input_length, 281 | ) 282 | max_input_length = torch.max(input_lengths).item() 283 | 284 | self.setup( 285 | batch_size=input_lengths.size(0), 286 | max_context_length=max_input_length, 287 | max_new_tokens=max_new_tokens 288 | ) 289 | with torch.no_grad(): 290 | chunk_lengths = max_input_length 291 | for output_ids in self.decode( 292 | input_ids, input_lengths, sampling_config, streaming=True, stop_words_list = self.stop_words_list 293 | ): 294 | torch.cuda.synchronize() 295 | 296 | if runtime_rank == 0: 297 | output_texts = [] 298 | for i in range(output_ids.size(0)): 299 | temp_ids = output_ids[i, 0, max_input_length:] 300 | temp_text = tokenizer.decode(temp_ids, skip_special_tokens=True) 301 | # check code is error 302 | if b"\xef\xbf\xbd" in temp_text.encode(): 303 | continue 304 | chunk_lengths += 1 305 | output_texts.append(temp_text) 306 | if len(output_texts) > 0: 307 | yield output_texts 308 | def parse_arguments(): 309 | parser = argparse.ArgumentParser() 310 | parser.add_argument('--max_new_tokens', type=int, default=512) 311 | parser.add_argument('--max_input_length', type=int, default=512) 312 | parser.add_argument('--log_level', type=str, default='error') 313 | parser.add_argument( 314 | '--engine_dir', 315 | type=str, 316 | default="", 317 | ) 318 | parser.add_argument( 319 | '--tokenizer_dir', 320 | type=str, 321 | default="", 322 | help="Directory containing the tokenizer.model." 323 | ) 324 | return parser.parse_args() 325 | 326 | 327 | if __name__ == "__main__": 328 | # get model info 329 | args = parse_arguments() 330 | 331 | engine_dir = Path(args.engine_dir) 332 | config_path = engine_dir / 'config.json' 333 | model_config, tp_size, pp_size, dtype = read_config(config_path) 334 | 335 | world_size = tp_size * pp_size 336 | 337 | runtime_rank = tensorrt_llm.mpi_rank() 338 | runtime_mapping = tensorrt_llm.Mapping(world_size, 339 | runtime_rank, 340 | tp_size=tp_size, 341 | pp_size=pp_size) 342 | torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node) 343 | tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir, legacy=False, use_fast=False) 344 | tokenizer.pad_token_id=PAD_TOKEN 345 | tokenizer.end_token_id=EOS_TOKEN 346 | 347 | sampling_config = SamplingConfig(end_id=EOS_TOKEN, 348 | pad_id=PAD_TOKEN, 349 | num_beams=1, 350 | ) 351 | engine_name = get_engine_name('llama', dtype, tp_size, pp_size, 352 | runtime_rank) 353 | serialize_path = engine_dir / engine_name 354 | with open(serialize_path, 'rb') as f: 355 | engine_buffer = f.read() 356 | 357 | decoder = LammaForCausalLMGenerationSession( 358 | model_config, 359 | engine_buffer, 360 | runtime_mapping, 361 | ) 362 | history = [] 363 | response = '' 364 | print("Welcome :)") 365 | while True: 366 | input_text = input("User: ") 367 | if input_text in ["exit", "quit", "exit()", "quit()"]: 368 | break 369 | if input_text == 'clear': 370 | history = [] 371 | continue 372 | 373 | # print("Output: ", end='') 374 | 375 | response = "" 376 | for new_text in decoder.chat_stream( 377 | tokenizer=tokenizer, 378 | sampling_config=sampling_config, 379 | input_text=input_text, 380 | history=history, 381 | max_new_tokens=args.max_new_tokens, 382 | max_input_length=args.max_input_length 383 | ): 384 | 385 | _clear_screen() 386 | print(f"\nUser: {input_text}") 387 | print(f"\nCodeFuse-ChatBot: {new_text[0]}") 388 | response += new_text[0] 389 | print("") 390 | 391 | history.append((input_text, response)) 392 | # print(history) -------------------------------------------------------------------------------- /trtdemo/cli_chat_entry.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | engine_dir="" 3 | tokenizer_dir="" 4 | python cli_chat.py --engine_dir "${engine_dir}" \ 5 | --tokenizer_dir "${tokenizer_dir}"\ --------------------------------------------------------------------------------