├── example ├── __init__.py ├── hello_word.py ├── ollama_app_stream.py └── ollama_app.py ├── ling_code ├── __init__.py ├── inference_script.py └── deepseek.py ├── server ├── __init__.py └── ollama_server.py ├── .dockerignore ├── docs └── easydeploy_modules_20241125.png ├── .gitignore ├── templates ├── index.html └── chat_page.html ├── docker-compose.yaml ├── Dockerfile ├── main.py ├── app.py ├── README_CN.md ├── README.md └── LICENSE /example/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | -------------------------------------------------------------------------------- /ling_code/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- -------------------------------------------------------------------------------- /server/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | -------------------------------------------------------------------------------- /.dockerignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | *.pyc 3 | *.pyo 4 | *.pyd 5 | env/ 6 | venv/ -------------------------------------------------------------------------------- /docs/easydeploy_modules_20241125.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/EasyDeploy/main/docs/easydeploy_modules_20241125.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | .idea/ 9 | 10 | # Distribution / packaging 11 | .Python 12 | fuhui_dev/ 13 | bakcup/ 14 | git-pre-push-hook-warn.log -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 | 6 |This is a simple web page.
11 | 12 | 13 | -------------------------------------------------------------------------------- /docker-compose.yaml: -------------------------------------------------------------------------------- 1 | # docker-compose.yml 2 | version: '3.8' 3 | services: 4 | app: 5 | build: . 6 | container_name: app 7 | command: sh -c "ollama serve && ollama run llama3.2 && uvicorn app:app --host 0.0.0.0 --port 8000" 8 | depends_on: 9 | - ollama 10 | ports: 11 | - "8000:8000" 12 | restart: always 13 | 14 | volumes: 15 | ollama-storage: -------------------------------------------------------------------------------- /example/hello_word.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import requests 4 | 5 | 6 | def run(): 7 | url = 'http://127.0.0.1:5000/' 8 | headers = {"Content-Type": "application/json"} 9 | res = requests.post(url, headers=headers) 10 | res_text = res.text 11 | print('res_text: {}'.format(res_text)) 12 | 13 | 14 | if __name__ == '__main__': 15 | run() 16 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.10 2 | 3 | ADD . /workspace/code-repo 4 | WORKDIR /workspace/code-repo 5 | 6 | RUN pip3 install fastapi uvicorn 7 | RUN pip3 install requests 8 | RUN pip3 install jinja2 9 | 10 | ENV PYTHONPATH /workspace/code-repo 11 | 12 | RUN apt-get update && apt-get install -y curl 13 | RUN curl -fsSL https://ollama.com/install.sh | sh 14 | 15 | ENV FLASK_RUN_HOST=0.0.0.0 16 | 17 | EXPOSE 8000 18 | 19 | CMD sh -c "ollama serve & ollama run llama3.2" 20 | CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] 21 | -------------------------------------------------------------------------------- /server/ollama_server.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import requests 3 | import json 4 | 5 | url_generate = "http://127.0.0.1:11434/api/generate" 6 | 7 | 8 | def get_response(url, data): 9 | response = requests.post(url, json=data) 10 | response_dict = json.loads(response.text) 11 | response_content = response_dict["response"] 12 | return response_content 13 | 14 | 15 | data = { 16 | "model": "llama3.2", 17 | "prompt": "hello", 18 | "stream": False 19 | } 20 | 21 | res = get_response(url_generate,data) 22 | print(res) 23 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # This is a sample Python script. 2 | 3 | # Press ⌃R to execute it or replace it with your code. 4 | # Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings. 5 | 6 | 7 | def print_hi(name): 8 | # Use a breakpoint in the code line below to debug your script. 9 | print(f'Hi, {name}') # Press ⌘F8 to toggle the breakpoint. 10 | 11 | 12 | # Press the green button in the gutter to run the script. 13 | if __name__ == '__main__': 14 | print_hi('PyCharm') 15 | 16 | # See PyCharm help at https://www.jetbrains.com/help/pycharm/ 17 | -------------------------------------------------------------------------------- /example/ollama_app_stream.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import requests 4 | 5 | 6 | # 发起请求,并将stream参数设置为True以获取流式输出 7 | url = 'http://127.0.0.1:8000/chat/completions' 8 | prompt = 'hello' 9 | model = 'llama3.2' 10 | messages = [{"role": "user", "content": prompt}] 11 | data = {'model': model, 'messages': messages, 'stream': True} 12 | headers = {"Content-Type": "application/json"} 13 | 14 | response = requests.post(url, headers=headers, data=json.dumps(data)) 15 | 16 | resp = '' 17 | for line in response.iter_lines(): 18 | data = line.decode('utf-8') 19 | data_dict = json.loads(data) 20 | text = data_dict['choices'][-1]['delta']['content'] 21 | resp += text 22 | print('resp: {}'.format(resp)) 23 | -------------------------------------------------------------------------------- /example/ollama_app.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import requests 4 | 5 | 6 | def run(): 7 | url = 'http://127.0.0.1:8000/chat/completions' 8 | prompt = 'hello' 9 | model = 'llama3.2' 10 | messages = [{"role": "user", "content": prompt}] 11 | infer_param = {} 12 | data = {'engine': 'ollama', 'model': model, 'messages': messages, 'infer_param': infer_param} 13 | headers = {"Content-Type": "application/json"} 14 | response = requests.post(url, headers=headers, data=json.dumps(data)) 15 | 16 | if response.status_code == 200: 17 | ans_dict = json.loads(response.text) 18 | print('data: {}'.format(ans_dict)) 19 | 20 | 21 | if __name__ == '__main__': 22 | run() 23 | -------------------------------------------------------------------------------- /ling_code/inference_script.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | from vllm import LLM 4 | from vllm.sampling_params import SamplingParams 5 | 6 | # os.environ['LD_LIBRARY_PATH'] = '/root/miniconda3/lib/python3.10/site-packages/nvidia/cublas/lib' 7 | # model_path = '/mnt/modelops/models/Bailing_Code_MoE_Lite_4K_Chat_20250304_dpsk_gptq_int4' 8 | model_path = '{your model path}' 9 | 10 | enforce_eager = False 11 | 12 | # GPU运行 13 | trust_remote_code = True 14 | tensor_parallel_size = 1 15 | gpu_memory_utilization = 0.80 16 | max_model_len = 4096 17 | max_tokens = 4096 18 | model = LLM(model_path, trust_remote_code=trust_remote_code, tensor_parallel_size=tensor_parallel_size, enforce_eager=enforce_eager, gpu_memory_utilization=gpu_memory_utilization, max_model_len=max_model_len) 19 | prompt = "8 |
11 | 中文 | 12 | English 13 |
14 | 15 || 分类 | 150 |功能名称 | 151 |状态 | 152 |描述 | 153 |
|---|---|---|---|
| API Service | 156 |基于Open AI的标准API规范 | 157 |✅ | 158 |服务接口遵循 OpenAI 规范,通过标准化 API 降低接入成本,用户可轻松集成功能,快速响应业务需求,专注于核心开发。 | 159 |
| 阻塞式访问能力 | 162 |✅ | 163 |适用于需要完整性和准确性的任务,完成时结果进行整体校验或输出的任务,一次性获取完整输出。在整个过程中,用户需要等待直至所有输出内容完全完成。 | 164 ||
| 流式访问能力 | 167 |✅ | 168 |适用于对响应时间要求较高的实时应用,如代码补全、实时翻译或动态内容加载的场景。模型在生成过程中分段逐步传输内容,用户可在内容生成后立即接收和处理,无需等待全部完成,从而提升效率。 | 169 ||
| 高性能网络,提升用户开发能力 | 172 |⬜ | 173 |高性能网络通过优化数据传输、采用先进负载均衡算法及高效资源管理,能有效提升数据来源、降低延迟、提升响应速度。 | 174 ||
| 多引擎支持 | 177 |Ollama | 178 |✅ | 179 |Ollama 以易用和轻量著称,专注于高效稳定的大模型推理服务。其友好 API 和简洁流畅流程,使开发者能够轻松将其手作快速部署应用。 | 180 |
| vLLM | 183 |✅ | 184 |vLLM在内存管理和吞吐量上有显著优势,其通过优化存储和并行计算,显著提升推理速度和资源利用率,兼容多种硬件环境。vLLM提供丰富的配置选项,用户可根据需求调整推理策略,适用于实时和企业级应用。 | 185 ||
| Tensorrt–LLM | 188 |⬜ | 189 |TensorRT–LLM (TensorRT for Large Language Models) 是NVIDIA优化的高性能、大规模推理优化库,专为大型语言模型(LLM)设计。 | 190 ||
| Docker部署能力 | 193 |基于python3.10构建Docker镜像 | 194 |✅ | 195 |将大型模型及其依赖的镜像,确保版本号一致运行,简化部署与配置。利用Docker的版本构建和自动化部署,提高模型更新与迭代效率,加快从开发到生产落地的转化。 | 196 |
| Web UI接入 | 199 |OpenUI 协议 | 200 |⬜ | 201 |丰富的UI开源协议便于用户整合多种组件,提升产品的定制性和扩展性。 | 202 |
| 更多核心功能 | 205 |ModelCache语义缓存 | 206 |⬜ | 207 |通过缓存已有生成的QA Pair,使得请求变更更加细粒度,提高模型推理的性能与效率。 | 208 |
8 |
11 | 中文 | 12 | English 13 |
14 | 15 || Category | 146 |Function | 147 |Status | 148 |Description | 149 |
|---|---|---|---|
| API Service | 152 |OpenAI Standard API | 153 |✅ | 154 |The service interface complies with OpenAI standards, minimizing integration costs through standardized APIs. It enables users to seamlessly integrate and maintain the system, swiftly respond to business requirements, and concentrate on core development. | 155 |
| Blocking access capabilities | 158 |✅ | 159 |Suitable for tasks requiring integrity and coherence or for overall verification and processing of results, this approach obtains complete output in a single iteration. Throughout the process, the user must wait until all output content has been fully generated. | 160 ||
| Streaming access capabilities | 163 |✅ | 164 |Suitable for real-time applications with stringent response time requirements, such as code completion, real-time translation, or websites with dynamic content loading. The model transmits content incrementally during generation, enabling users to receive and process partial outputs immediately without waiting for full completion, thereby enhancing interactivity. | 165 ||
| High-performance gateway | 168 |⬜ | 169 |High-performance gateways effectively manage high-concurrency requests, reduce latency, and enhance response times by optimizing data transmission, employing advanced load balancing algorithms, and implementing efficient resource management. | 170 ||
| Multi-engine Support | 173 |Ollama | 174 |✅ | 175 |High-performance gateways effectively manage high-concurrency requests, reduce latency, and enhance response times by optimizing data transmission, employing advanced load balancing algorithms, and implementing efficient resource management. | 176 |
| vLLM | 179 |✅ | 180 |vLLM exhibits significant advantages in memory management and throughput. By optimizing memory usage and parallel computation, it substantially enhances inference speed and resource efficiency, while maintaining compatibility with various hardware environments. vLLM offers a wide range of configuration options, allowing users to adjust inference strategies based on their needs. Its scalable architecture makes it suitable for both research and enterprise-level applications. | 181 ||
| Tensorrt–LLM | 184 |⬜ | 185 |TensorRT-LLM (TensorRT for Large Language Models) is a high-performance, scalable deep learning inference optimization library developed by NVIDIA, specifically designed for large language models (LLMs). | 186 ||
| Docker Deployment Capability | 189 |Docker images built with Python 3.10 | 190 |✅ | 191 |TensorRT-LLM is a high-performance, scalable deep learning inference optimization library developed by NVIDIA, specifically designed for large language models (LLMs). | 192 |
| Web UI Integration | 195 |OpenUI protocol | 196 |⬜ | 197 |The comprehensive UI open-source protocol facilitates users in integrating diverse components, enhancing product customizability and extensibility. | 198 |
| More Core Features | 201 |ModelCache semantic caching | 202 |⬜ | 203 |By caching generated QA pairs, similar requests can achieve millisecond-level responses, enhancing the performance and efficiency of model inference. | 204 |