├── README.md
├── README_CN.md
├── assets
    ├── LOGO.png
    ├── cli_demo.gif
    ├── modelscope_demo.png
    └── modelscope_demo2.png
└── trtdemo
    ├── README.md
    ├── cli_chat.py
    └── cli_chat_entry.sh


/README.md:
--------------------------------------------------------------------------------
  1 | # Index of CodeFuse Repositories
  2 | 
  3 | <p align="center">
  4 |   <img src="assets/LOGO.png" width="50%" />
  5 | </p>
  6 | 
  7 | 
  8 | <div align="center">
  9 | 
 10 | [**简体中文**](./README_CN.md)|[**HuggingFace**](https://huggingface.co/codefuse-ai)|[**ModelScope**](https://modelscope.cn/organization/codefuse-ai)
 11 | 
 12 | </div>
 13 | 
 14 | ## About This Repository
 15 | 
 16 | This repository lists key projects and related demos about CodeFuse. 
 17 | 
 18 | ## About CodeFuse
 19 | 
 20 | CodeFuse aims to develop Code Large Language Models (Code LLMs) to support and enhance full-lifecycle AI native sotware developing, covering crucial stages such as design requirements, coding, testing, building, deployment, operations, and insight analysis. Below is the overall framework of CodeFuse. 
 21 | <p align="center">
 22 |   <img src="https://github.com/codefuse-ai/.github/assets/82250814/9c8cd9f3-b1ca-43a1-9b0f-8f612b06753e" width="90%" />
 23 | </p>
 24 | <br/>
 25 | 
 26 | 
 27 | ## Release Update
 28 | ** 2024.08 **  [codefuse-ide](https://github.com/codefuse-ai/codefuse-ide):  Release opensumi&CodeBlitz for code ide; [CGE](https://github.com/codefuse-ai/codefuse-CGE) :  Release D2Coder-v1 Embedding model for code search
 29 | 
 30 | ** 2024.07 **  [D2LLM](https://github.com/codefuse-ai/D2LLM) :  Release D2Coder-v1 Embedding model for code search, [RepoFuse](https://github.com/codefuse-ai) :  Repository-Level Code Completion with Language Models with Fused Dual Context
 31 | 
 32 | ** 2024.06 **  [Codefuse-ai pages](https://codefuse-ai.github.io), [D2LLM](https://github.com/codefuse-ai/D2LLM) releases feature about Decomposed and Distilled Large Language Models for Semantic Search, [MFTCoder](https://github.com/codefuse-ai/MFTCoder) releases V0.4.2. More detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/16)
 33 | 
 34 | ** 2024.05 **  ModelCache releases v0.2.1 with supporting multimodal, see [CodeFuse-ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache). DevOps-Model support function call, see [CodeFuse-DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model). More detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/14)
 35 | 
 36 | ** 2024.04 ** CodeFuse-muAgent: a multi-agent framework, more detail see [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/12)
 37 | 
 38 | ## List of CodeFuse Repositories
 39 | 
 40 | We listed repositories according to the lifecycle above. 
 41 | | LifeCycle Stage               | Project Repository|  Repo-Description | Road Map |
 42 | |:------------------------:|:-----------------:|:-------:|:------------------:|
 43 | | Requirement & Design     |[MFT-VLM](https://github.com/codefuse-ai/CodeFuse-MFT-VLM)            | Instruction-fine-tuning for Vision-language tasks  |       |
 44 | | Coding        |[MFTCoder](https://github.com/codefuse-ai/MFTCoder) | Instruction-Tuning Framework  |      |
 45 | |                     |[FastTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse) | FT based Inference Engine |       |
 46 | |                     |[CodeFuse-Eval](https://github.com/codefuse-ai/codefuse-evaluation)|Evaluation kits for CodeFuse |         |
 47 | | Test & Build   |[TestAgent](https://github.com/codefuse-ai/Test-Agent) | TestGPT demo frontend  |        |
 48 | | DevOps         |[DevOps-Eval](https://github.com/codefuse-ai/codefuse-devops-eval)|Benchmark for DevOps|         |      |
 49 | |                     |[DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model) |index for DevOps  models|         |
 50 | | Data Insight       |    NA              |     NA   |       |
 51 | | Base              |[ChatBot](https://github.com/codefuse-ai/codefuse-chatbot) |General chatbot frontend for CodeFuse |       |
 52 | |                     |[muAgent](https://github.com/codefuse-ai/CodeFuse-muAgent) | multi-agent framework |       |
 53 | |                     |[ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache) |Semantic Cache for LLM Serving  |       |
 54 | |                     |[CodeFuse-Query](https://github.com/codefuse-ai/CodeFuse-Query)|Query-Based Code Analysis Engine |       |
 55 | | Others                     |[CoCA](https://github.com/codefuse-ai/Collinear-Constrained-Attention)|Colinear Attention |       |
 56 | |                     |[Awesine-Code-LLM](https://github.com/codefuse-ai/Awesome-Code-LLM)|Code-LLM Survey|       |
 57 | |                     |This Repo |General Introduction & index of CodeFuse Repos|       |
 58 | 
 59 | ## List of CodeFuse Primary Released Models
 60 | 
 61 | | ModelName               | Short Description | Modele Linls | 
 62 | |:------------------------:|:-----------------:|:-----------------:|
 63 | | CodeFuse-13B     | Training from scratch by CodeFuse | [HF](https://huggingface.co/codefuse-ai/CodeFuse-13B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-13B/summary)  | 
 64 | | CodeFuse-CodeLLaMA-34B    |    Finetuning on CodeLLaMA-34B  | [HF](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)  | 
 65 | | ** CodeFuse-CodeLLaMA-34B-4bits |  4bits quantized 34B model |[HF](CodeFuse-CodeLlama-34B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary)  | 
 66 | | CodeFuse-DeepSeek-33B | FineTuning on DeepSeek-Coder-33b | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B/summary) |
 67 | | ** CodeFuse-DeepSeek-33B-4bits | 4-bit quantized 33B model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B-4bits/summary) |
 68 | | CodeFuse-VLM-14B | SoTA vision-language model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-VLM-14B) ; [MS]([https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B/summary](https://modelscope.cn/models/ss41979310/CodeFuse-VLM-14B/summary)) |
 69 | 
 70 | ** -- recommended models;
 71 | 
 72 | ## Demos
 73 | 
 74 |  - Video demos: Chinese version at below, English version under preparation.
 75 | 
 76 |    https://user-images.githubusercontent.com/103973989/267514150-21012a5d-652d-4aea-bcea-582e67855ad7.mp4
 77 | 
 78 |  - Online Demo: You can try our CodeFuse-CodeLlama-34B model on ModelScope: [CodeFuse-CodeLlama34B-MFT-Demo](https://modelscope.cn/studios/codefuse-ai/CodeFuse-CodeLlama34B-MFT-Demo/summary)
 79 | 
 80 | ![Online Demo Snapshot](assets/modelscope_demo2.png)
 81 | 
 82 | - You can also try to install the [CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot) to test our models locally. 
 83 | 
 84 | ## How to get
 85 | 
 86 | - [**HuggingFace**](https://huggingface.co/codefuse-ai).
 87 | - [**ModelScope**](https://modelscope.cn/organization/codefuse-ai).
 88 | - [**WiseModel**](https://wisemodel.cn/organization/codefuse-ai).
 89 | - Train or finetuning on your own models, you can try our [**MFTCoder**](https://github.com/codefuse-ai/MFTCoder), which enables efficient fine-tuning for multi-task, multi-model, and multi-training-framework scenarios.
 90 | 
 91 | ## Citation
 92 | 
 93 | For more technique details about CodeFuse, please refer to our paper [MFTCoder](https://arxiv.org/abs/2311.02303).
 94 | 
 95 | If you find our work useful or helpful for your R&D work, please feel free to cite our paper as follows.
 96 | ```
 97 | @article{mftcoder2023,
 98 |       title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, 
 99 |       author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li},
100 |       year={2023},
101 |       journal={arXiv preprint arXiv},
102 |       archivePrefix={arXiv},
103 |       eprint={2311.02303}
104 | }
105 | ```
106 | 
107 | 


--------------------------------------------------------------------------------
/README_CN.md:
--------------------------------------------------------------------------------
  1 | # CodeFuse中文索引
  2 | 
  3 | <p align="center">
  4 |   <img src="assets/LOGO.png" width="50%" />
  5 | </p>
  6 | 
  7 | 
  8 | <div align="center">
  9 | 
 10 | [ **English** ](./README.md)|[ **HuggingFace** ](https://huggingface.co/codefuse-ai) | [ **魔搭社区** ](https://modelscope.cn/organization/codefuse-ai) | [ **WiseModel** ](https://www.wisemodel.cn/organization/codefuse-ai) | [ **产品主页** ](https://codefuse.alipay.com) 
 11 | 
 12 | </div>
 13 | 
 14 | ## 关于本仓库
 15 | 
 16 | 本仓库索引了CodeFuse项目的关键仓库、模型和演示例子. 
 17 | 
 18 | ## 关于CodeFuse
 19 | CodeFuse的使命是开发专门设计用于支持整个软件开发生命周期的大型代码语言模型（Code LLMs），涵盖设计、需求、编码、测试、部署、运维等关键阶段。
 20 | 我们致力于打造创新的解决方案，让软件开发者们在研发的过程中如丝般顺滑。下面是CodeFuse的整个框架。
 21 | <p align="center">
 22 |   <img src="https://github.com/codefuse-ai/.github/assets/82250814/9c8cd9f3-b1ca-43a1-9b0f-8f612b06753e" width="90%" />
 23 | </p>
 24 | <br/>
 25 | 
 26 | ## 版本更新
 27 | ** 2024.07 ** [CGE](https://github.com/codefuse-ai/codefuse-CGE) :  基于D2LLM 开源 D2Coder-v1 模型用于代码向量检索; [codefuse-ide](https://github.com/codefuse-ai/codefuse-ide): 一款云端的IDE，支持codefuse代码开发、分析、测例生成
 28 | 
 29 | ** 2024.07 ** [D2LLM](https://github.com/codefuse-ai/D2LLM) :  基于D2LLM 开源 D2Coder-v1 模型用于代码向量检索, [RepoFuse](https://github.com/codefuse-ai) : 多仓库级别的代码补全
 30 | 
 31 | ** 2024.06 **  [Codefuse-ai 官方主页](https://codefuse-ai.github.io)上线, [D2LLM](https://github.com/codefuse-ai/D2LLM) 新增能力关于分解和蒸馏的大型语言模型用于语义搜索, [MFTCoder](https://github.com/codefuse-ai/MFTCoder) 放出 V0.4.2 版本. 更多内容见[Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/16)
 32 | 
 33 | ** 2024.05 **  大模型语义缓存ModelCache v0.2.1版本支持多模态, 见 [CodeFuse-ModelCache](https://github.com/codefuse-ai/CodeFuse-ModelCache). DevOps-Model 支持function call能力，见 [CodeFuse-DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model). , 更多内容见 [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/14)
 34 | 
 35 | ** 2024.04 ** CodeFuse-muAgent: 多智能体框架, 更多内容见 [Release & Next Release](https://github.com/codefuse-ai/codefuse/issues/12)
 36 | <br/>
 37 | 
 38 | ## CodeFuse仓库列表
 39 | 
 40 | 我们按照上图软件生命周期的划分对仓库进行了组织. 
 41 | | 生命周期阶段               | 仓库名 | 仓库简介 | 技术路线 |
 42 | |:------------------------:|:-----------------:|:-------:|:----------:|
 43 | | 项目Copilot     |    NA             |     NA  |       |
 44 | | 数据Copilot        |[MFTCoder](https://github.com/codefuse-ai/MFTCoder) | CodeFuse独有的指令微调框架 |      |
 45 | |                     |[FastTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse) | 推理引擎|       |
 46 | |                     |[CodeFuse-Eval](https://github.com/codefuse-ai/codefuse-evaluation)|代码评估框架|         |
 47 | | 测试和构建Copilot  |[TestAgent](https://github.com/codefuse-ai/Test-Agent) | TestGPT示例前端  |        |
 48 | | 运维Copilot         |[DevOps-Eval](https://github.com/codefuse-ai/codefuse-devops-eval)|DevOps评测集和框架 |         |
 49 | |                     |[DevOps-Model](https://github.com/codefuse-ai/CodeFuse-DevOps-Model) |DevOps模型列表介绍 |         |
 50 | | 数据Copilot        |    NA              |     NA   |       |
 51 | | 其他模块   |[ChatBot](https://github.com/codefuse-ai/codefuse-chatbot) |通用chatbot前端 |       |
 52 | |                     |[muAgent](https://github.com/codefuse-ai/CodeFuse-muAgent) | multi-agent框架 |       |
 53 | |                     |[CoCA](https://github.com/codefuse-ai/Collinear-Constrained-Attention)|共线约束注意力算法 |       |
 54 | |                     |[Awesine-Code-LLM](https://github.com/codefuse-ai/Awesome-Code-LLM)|代码大模型survey主页 |       |
 55 | |                     |[CodeFuse-Query](https://github.com/codefuse-ai/CodeFuse-Query)| 基于查询的代码分析引擎 |       |
 56 | |                     |你正在看的仓库 | CodeFuse通用介绍和索引 |       |
 57 | 
 58 | ## CodeFuse已发布主要模型索引
 59 | 
 60 | | ModelName               | Short Description | Modele Linls | 
 61 | |:------------------------:|:-----------------:|:-----------------:|
 62 | | CodeFuse-13B     | Training from scratch by CodeFuse | [HF](https://huggingface.co/codefuse-ai/CodeFuse-13B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-13B/summary)  | 
 63 | | CodeFuse-CodeLLaMA-34B    |    Finetuning on CodeLLaMA-34B  | [HF](https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)  | 
 64 | | ** CodeFuse-CodeLLaMA-34B-4bits |  4bits quantized 34B model |[HF](CodeFuse-CodeLlama-34B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary)  | 
 65 | | CodeFuse-DeepSeek-33B | FineTuning on DeepSeek-Coder-33b | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B/summary) |
 66 | | ** CodeFuse-DeepSeek-33B-4bits | 4-bit quantized 33B model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-DeepSeek-33B-4bits) ; [MS](https://modelscope.cn/models/codefuse-ai/CodeFuse-DeepSeek-33B-4bits/summary) |
 67 | | CodeFuse-VLM-14B | SoTA vision-language model | [HF](https://huggingface.co/codefuse-ai/CodeFuse-VLM-14B) ; [MS]([https://modelscope.cn/models/codefuse-ai/CodeFuse-Mixtral-8x7B/summary](https://modelscope.cn/models/ss41979310/CodeFuse-VLM-14B/summary)) |
 68 |  
 69 | 
 70 | 
 71 | ## 演示
 72 | 
 73 |  - 视频Demo: 下面是中文版本, 英文版准备中
 74 |    https://user-images.githubusercontent.com/103973989/267514150-21012a5d-652d-4aea-bcea-582e67855ad7.mp4
 75 | 
 76 |  - 在线版本: 你可以在魔搭社区尝试我们的34B模型: [CodeFuse-CodeLlama34B-MFT-Demo](https://modelscope.cn/studios/codefuse-ai/CodeFuse-CodeLlama34B-MFT-Demo/summary)
 77 | 
 78 | ![Online Demo Snapshot](assets/modelscope_demo2.png)
 79 | 
 80 | - 离线版本：你也可以安装[CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot)在本地尝试我们的模型.
 81 | 
 82 | ## 如何获得
 83 | 
 84 | - HF模型社区[**HuggingFace**](https://huggingface.co/codefuse-ai).
 85 | - 魔搭社区[**ModelScope**](https://modelscope.cn/organization/codefuse-ai).
 86 | - WiseModel社区[**WiseModel**](https://wisemodel.cn/organization/codefuse-ai).
 87 | - 对于自有或者自己感兴趣的模型，可以使用我们的[**MFTCoder**](https://github.com/codefuse-ai/MFTCoder)框架微调训练，它是一个支持多模型、多任务、多训练平台的微调框架.
 88 | 
 89 | ## 参考文献
 90 | 如果你觉得本项目对你有帮助，请引用下述论文:
 91 | ```
 92 | @article{mftcoder2023,
 93 |       title={MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning}, 
 94 |       author={Bingchang Liu and Chaoyu Chen and Cong Liao and Zi Gong and Huan Wang and Zhichao Lei and Ming Liang and Dajun Chen and Min Shen and Hailian Zhou and Hang Yu and Jianguo Li},
 95 |       year={2023},
 96 |       journal={arXiv preprint arXiv},
 97 |       archivePrefix={arXiv},
 98 |       eprint={2311.02303}
 99 | }
100 | ```
101 | 


--------------------------------------------------------------------------------
/assets/LOGO.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/LOGO.png


--------------------------------------------------------------------------------
/assets/cli_demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/cli_demo.gif


--------------------------------------------------------------------------------
/assets/modelscope_demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/modelscope_demo.png


--------------------------------------------------------------------------------
/assets/modelscope_demo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codefuse-ai/codefuse/8903fc49b10f3653ee0ec2230fc7100a3a9037b6/assets/modelscope_demo2.png


--------------------------------------------------------------------------------
/trtdemo/README.md:
--------------------------------------------------------------------------------
 1 | ## Cli-demo for CodeFuse-CodeLLaMA-34B-4bits with TensorRT-LLM
 2 | 
 3 | ### Introduction
 4 | 
 5 | This cli-demo aims to show the inference of CodeFuse-CodeLLaMA-34B-4bits utilizing TensorRT-LLM (refer trtllm in the following) engine. We offer the model weights of CodeFuse-CodeLLaMA-34B-4bits and please refer to [TensorRL-LLM ](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/examples/llama) on github to build trtllm engine. Moreover, our practice is coming soon.
 6 | 
 7 | 
 8 | 
 9 | ### Usage
10 | 
11 | The launch script is 
12 | ```bash
13 | engine_dir=""
14 | tokenizer_dir=""
15 | python cli_chat.py --engine_dir  "${engine_dir}"  \
16 |                     --tokenizer_dir "${tokenizer_dir}"
17 | ```
18 | `enging_dir` and  `tokenizer_dir` are the directory pathes for trtllm engine and tokenizer, respectively. Also, the `max_input_length` and `max_new_tokens`  can be passed to cli_chat.py, and both of them should be less than those corresponding parameters in the trtllm engine config to avoid runtime error.
19 | 
20 | 
21 | ### Scripts
22 | 
23 | - demo file [cli_chat.py](cli_chat.py)
24 | - launch script [cli_chat_entry.sh](cli_chat_entry.sh)
25 | 
26 | ### Demonstration
27 | 
28 | ![Client Demo](../assets/cli_demo.gif)
29 | 
30 | - You can also try to install the [CodeFuse-Chatbot](https://github.com/codefuse-ai/codefuse-chatbot) to test our models locally.
31 | 


--------------------------------------------------------------------------------
/trtdemo/cli_chat.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse
  3 | import tensorrt_llm
  4 | import torch
  5 | import platform
  6 | import json
  7 | 
  8 | import numpy as np
  9 | from tensorrt_llm.runtime import (
 10 |     ModelConfig, SamplingConfig, GenerationSession
 11 | )
 12 | from tensorrt_llm.runtime.generation import Mapping
 13 | from tensorrt_llm.quantization import QuantMode
 14 | from typing import List, Union, Optional
 15 | from pathlib import Path
 16 | from transformers import (AutoTokenizer,PreTrainedTokenizer)
 17 | 
 18 | now_dir = os.path.dirname(os.path.abspath(__file__))
 19 | 
 20 | EOS_TOKEN = 2
 21 | PAD_TOKEN = 0
 22 | 
 23 | from typing import List, Tuple
 24 | 
 25 | def to_word_list_format(words_list: List[List[str]], tokenizer):
 26 | 
 27 |     flat_ids = []
 28 |     offsets = []
 29 |     for words in words_list:
 30 |         item_flat_ids = []
 31 |         item_offsets = []
 32 | 
 33 |         for word in words:
 34 |             ids = tokenizer.encode(word,add_special_tokens=False)[1:]
 35 | 
 36 |             if len(ids) == 0:
 37 |                 continue
 38 | 
 39 |             item_flat_ids += ids
 40 |             item_offsets.append(len(ids))
 41 | 
 42 |         flat_ids.append(np.array(item_flat_ids))
 43 |         offsets.append(np.cumsum(np.array(item_offsets)))
 44 | 
 45 |     pad_to = max(1, max(len(ids) for ids in flat_ids))
 46 | 
 47 |     for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
 48 |         flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
 49 |         offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)
 50 | 
 51 |     result = np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))
 52 |     return torch.from_numpy(np.ascontiguousarray(result))
 53 | 
 54 | 
 55 | def make_context(
 56 |     tokenizer: PreTrainedTokenizer,
 57 |     query: str,
 58 |     history: List[Tuple[str, str]] = None,
 59 |     system: str = "",
 60 |     max_input_length: int = 2048,
 61 |     chat_format: str = "MFT",
 62 | ):
 63 |     if history is None:
 64 |         history = []
 65 | 
 66 |     if chat_format == "MFT":
 67 | 
 68 |         def _tokenize_str(role, content):
 69 |             role_dict = {"user": "human", "agent": "bot", "assistant": "bot", 
 70 |                     "<bot>": "bot", "<human>": "human", 
 71 |                     "<system>": "system","system": "system",
 72 |                     "humnan": "human", "bot":"bot"}
 73 |             eo_dict = {"bot": tokenizer.eos_token, "human": "", "system": ""}    
 74 |             if not content.endswith('\n'):
 75 |                 content = content + '\n'
 76 |             return f"<s>{role_dict[role]}\n{content}{eo_dict[role_dict[role]]}"
 77 | 
 78 |         system_text = _tokenize_str("system", system)
 79 |         raw_text = ""
 80 |         context_tokens = []
 81 | 
 82 |         for turn_query, turn_response in reversed(history):
 83 |             query_text = _tokenize_str("user", turn_query)
 84 | 
 85 |             response_text = _tokenize_str(
 86 |                 "assistant", turn_response
 87 |             )
 88 |             prev_chat = query_text+response_text
 89 | 
 90 |             raw_text = prev_chat + raw_text
 91 | 
 92 | 
 93 |             
 94 |         raw_text = system_text + raw_text
 95 | 
 96 |         query_content = _tokenize_str('user', query)
 97 |         raw_text = raw_text+query_content+'<s>bot'
 98 |         
 99 |         
100 |     # truncate to max_input_length, truncate from the front
101 |     return raw_text, tokenizer.encode(raw_text, add_special_tokens=False)[-max_input_length:]
102 | 
103 | def get_engine_name(model, dtype, tp_size, pp_size, rank):
104 |     if pp_size == 1:
105 |         return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
106 |     return '{}_{}_tp{}_pp{}_rank{}.engine'.format(model, dtype, tp_size,
107 |                                                   pp_size, rank)
108 | def _clear_screen():
109 |     if platform.system() == "Windows":
110 |         os.system("cls")
111 |     else:
112 |         os.system("clear")
113 | 
114 | def read_config(config_path: Path):
115 |     with open(config_path, 'r') as f:
116 |         config = json.load(f)
117 |     use_gpt_attention_plugin = config['plugin_config']['gpt_attention_plugin']
118 |     remove_input_padding = config['plugin_config']['remove_input_padding']
119 |     dtype = config['builder_config']['precision']
120 |     tp_size = config['builder_config']['tensor_parallel']
121 |     pp_size = config['builder_config']['pipeline_parallel']
122 |     world_size = tp_size * pp_size
123 | 
124 |         
125 |     num_heads = config['builder_config']['num_heads'] // tp_size
126 |     hidden_size = config['builder_config']['hidden_size'] // tp_size
127 |     vocab_size = config['builder_config']['vocab_size']
128 |     num_layers = config['builder_config']['num_layers']
129 |     num_kv_heads = config['builder_config'].get('num_kv_heads', num_heads)
130 |     paged_kv_cache = config['plugin_config']['paged_kv_cache']
131 |     tokens_per_block = config['plugin_config']['tokens_per_block']
132 |     quant_mode = QuantMode(config['builder_config']['quant_mode'])
133 |     if config['builder_config'].get('multi_query_mode', False):
134 |         tensorrt_llm.logger.warning(
135 |             "`multi_query_mode` config is deprecated. Please rebuild the engine."
136 |         )
137 |         num_kv_heads = 1
138 |     use_custom_all_reduce = config['plugin_config'].get('use_custom_all_reduce',
139 |                                                         False)
140 | 
141 |     model_config = ModelConfig(num_heads=num_heads,
142 |                                num_kv_heads=num_kv_heads,
143 |                                hidden_size=hidden_size,
144 |                                vocab_size=vocab_size,
145 |                                num_layers=num_layers,
146 |                                gpt_attention_plugin=use_gpt_attention_plugin,
147 |                                paged_kv_cache=paged_kv_cache,
148 |                                tokens_per_block=tokens_per_block,
149 |                                remove_input_padding=remove_input_padding,
150 |                                dtype=dtype,
151 |                                quant_mode=quant_mode,
152 |                                use_custom_all_reduce=use_custom_all_reduce)
153 | 
154 |     return model_config, tp_size, pp_size, dtype
155 | 
156 | 
157 | 
158 | class LammaForCausalLMGenerationSession(GenerationSession):
159 |     def __init__(
160 |         self,
161 |         model_config: ModelConfig,
162 |         engine_buffer,
163 |         mapping: Mapping,
164 |         debug_mode=False,
165 |         debug_tensors_to_save=None,
166 |         cuda_graph_mode=False,
167 |         stream: torch.cuda.Stream = None,
168 |     ):
169 |         super().__init__(
170 |             model_config,
171 |             engine_buffer,
172 |             mapping,
173 |             debug_mode,
174 |             debug_tensors_to_save=debug_tensors_to_save,
175 |             cuda_graph_mode=cuda_graph_mode,
176 |             stream=stream
177 |         )
178 |         self.stop_words_list=to_word_list_format([['<s>human','<s>bot']], tokenizer).cuda()
179 | 
180 |     def prepare_for_chat(
181 |         self,
182 |         tokenizer,
183 |         input_text: Union[str, List[str]],
184 |         system_text: str = "",
185 |         history: list = None,
186 |         max_input_length: Union[int, None] = None,
187 |     ):
188 | 
189 |         if history is None:
190 |             history = []
191 |         pad_id = tokenizer.pad_token_id
192 |         # prepare for batch inference
193 |         if not isinstance(input_text, list):
194 |             batch_text = [input_text]
195 |         else:
196 |             batch_text = input_text
197 |         if len(history) > 0 and len(history[0]) and len(history[0][0]) > 0 \
198 |                 and not isinstance(history[0][0], list):
199 |             history_list = [history]
200 |         elif len(history) == 0:
201 |             history_list = [[]]
202 |         else:
203 |             history_list = history
204 |         input_ids = []
205 |         input_lengths = []
206 | 
207 |         for line, history in zip(batch_text, history_list):
208 |             # use make_content to generate prompt
209 |             _, input_id_list = make_context(
210 |                 tokenizer=tokenizer,
211 |                 query=line,
212 |                 history=history,
213 |                 system=system_text,
214 |                 max_input_length=max_input_length,
215 |             )
216 |             
217 |             # print("input_id_list len", len(input_id_list))
218 |             input_id = torch.from_numpy(
219 |                 np.array(input_id_list, dtype=np.int32)
220 |             ).type(torch.int32).unsqueeze(0)
221 |             input_ids.append(input_id)
222 |             input_lengths.append(input_id.shape[-1])
223 |         max_length = max(input_lengths)
224 |         # do padding, should move outside the profiling to prevent the overhead
225 |         for i in range(len(input_ids)):
226 |             pad_size = max_length - input_lengths[i]
227 | 
228 |             pad = torch.ones([1, pad_size]).type(torch.int32) * pad_id
229 |             input_ids[i] = torch.cat(
230 |                 [torch.IntTensor(input_ids[i]), pad], axis=-1)
231 |         input_ids = torch.cat(input_ids, axis=0).cuda()
232 |         input_lengths = torch.IntTensor(input_lengths).type(torch.int32).cuda()
233 |         return input_ids, input_lengths
234 |     
235 |     def generate(
236 |         self,
237 |         input_ids: torch.Tensor,
238 |         input_lengths: torch.Tensor,
239 |         sampling_config: SamplingConfig,
240 |         max_new_tokens: int,
241 |         runtime_rank: int = 0,
242 |         stop_works_list: Optional[torch.Tensor] = None
243 |     ):
244 |         max_input_length = torch.max(input_lengths).item()
245 | 
246 |         # setup batch_size, max_input_length, max_output_len
247 |         self.setup(
248 |             batch_size=input_lengths.size(0),
249 |             max_context_length=max_input_length,
250 |             max_new_tokens=max_new_tokens
251 |         )
252 |         output_ids = self.decode(
253 |             input_ids,
254 |             input_lengths,
255 |             sampling_config,
256 |             stop_words_list=self.stop_words_list
257 |         )
258 |         with torch.no_grad():
259 |             torch.cuda.synchronize()
260 |             if runtime_rank == 0:
261 |                 outputs = output_ids[:, 0, :]
262 |                 return outputs
263 | 
264 |     def chat_stream(
265 |         self,
266 |         tokenizer,
267 |         sampling_config: SamplingConfig,
268 |         input_text: Union[str, List[str]],
269 |         max_input_length: Union[int, None],
270 |         max_new_tokens: Union[int, None],
271 |         system_text: str = "",
272 |         history: list = None,
273 |         runtime_rank: int = 0,
274 |     ):
275 |         input_ids, input_lengths = self.prepare_for_chat(
276 |             tokenizer=tokenizer,
277 |             input_text=input_text,
278 |             system_text=system_text,
279 |             history=history,
280 |             max_input_length=max_input_length,
281 |         )
282 |         max_input_length = torch.max(input_lengths).item()
283 | 
284 |         self.setup(
285 |             batch_size=input_lengths.size(0),
286 |             max_context_length=max_input_length,
287 |             max_new_tokens=max_new_tokens
288 |         )
289 |         with torch.no_grad():
290 |             chunk_lengths = max_input_length
291 |             for output_ids in self.decode(
292 |                 input_ids, input_lengths, sampling_config, streaming=True, stop_words_list = self.stop_words_list
293 |             ):
294 |                 torch.cuda.synchronize()
295 |                 
296 |                 if runtime_rank == 0:
297 |                     output_texts = []
298 |                     for i in range(output_ids.size(0)):
299 |                         temp_ids = output_ids[i, 0, max_input_length:]
300 |                         temp_text = tokenizer.decode(temp_ids, skip_special_tokens=True)
301 |                         # check code is error
302 |                         if b"\xef\xbf\xbd" in temp_text.encode():
303 |                             continue
304 |                         chunk_lengths += 1
305 |                         output_texts.append(temp_text)
306 |                     if len(output_texts) > 0:
307 |                         yield output_texts
308 | def parse_arguments():
309 |     parser = argparse.ArgumentParser()
310 |     parser.add_argument('--max_new_tokens', type=int, default=512)
311 |     parser.add_argument('--max_input_length', type=int, default=512)
312 |     parser.add_argument('--log_level', type=str, default='error')
313 |     parser.add_argument(
314 |         '--engine_dir',
315 |         type=str,
316 |         default="",
317 |     )
318 |     parser.add_argument(
319 |         '--tokenizer_dir',
320 |         type=str,
321 |         default="",
322 |         help="Directory containing the tokenizer.model."
323 |     )
324 |     return parser.parse_args()
325 | 
326 | 
327 | if __name__ == "__main__":
328 |     # get model info
329 |     args = parse_arguments()
330 | 
331 |     engine_dir = Path(args.engine_dir)
332 |     config_path = engine_dir / 'config.json'
333 |     model_config, tp_size, pp_size, dtype = read_config(config_path)
334 | 
335 |     world_size = tp_size * pp_size
336 | 
337 |     runtime_rank = tensorrt_llm.mpi_rank()
338 |     runtime_mapping = tensorrt_llm.Mapping(world_size,
339 |                                            runtime_rank,
340 |                                            tp_size=tp_size,
341 |                                            pp_size=pp_size)
342 |     torch.cuda.set_device(runtime_rank % runtime_mapping.gpus_per_node)
343 |     tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir, legacy=False, use_fast=False)
344 |     tokenizer.pad_token_id=PAD_TOKEN
345 |     tokenizer.end_token_id=EOS_TOKEN
346 |     
347 |     sampling_config = SamplingConfig(end_id=EOS_TOKEN,
348 |                                      pad_id=PAD_TOKEN,
349 |                                      num_beams=1,
350 |                                      )
351 |     engine_name = get_engine_name('llama', dtype, tp_size, pp_size,
352 |                                   runtime_rank)
353 |     serialize_path = engine_dir / engine_name
354 |     with open(serialize_path, 'rb') as f:
355 |         engine_buffer = f.read()
356 | 
357 |     decoder = LammaForCausalLMGenerationSession(
358 |         model_config,
359 |         engine_buffer,
360 |         runtime_mapping,
361 |     )
362 |     history = []
363 |     response = ''
364 |     print("Welcome :)")
365 |     while True:
366 |         input_text = input("User: ")
367 |         if input_text in ["exit", "quit", "exit()", "quit()"]:
368 |             break
369 |         if input_text == 'clear':
370 |             history = []
371 |             continue
372 |         
373 |             # print("Output: ", end='')
374 | 
375 |         response = ""
376 |         for new_text in decoder.chat_stream(
377 |             tokenizer=tokenizer,
378 |             sampling_config=sampling_config,
379 |             input_text=input_text,
380 |             history=history,
381 |             max_new_tokens=args.max_new_tokens,
382 |             max_input_length=args.max_input_length
383 |         ):
384 |             
385 |             _clear_screen()
386 |             print(f"\nUser: {input_text}")
387 |             print(f"\nCodeFuse-ChatBot: {new_text[0]}")
388 |         response += new_text[0]
389 |         print("")
390 |         
391 |         history.append((input_text, response))
392 |         # print(history)


--------------------------------------------------------------------------------
/trtdemo/cli_chat_entry.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | engine_dir=""
3 | tokenizer_dir=""
4 | python cli_chat.py --engine_dir  "${engine_dir}"  \
5 |                     --tokenizer_dir "${tokenizer_dir}"\


--------------------------------------------------------------------------------