├── README.md ├── docs ├── 4.2.1.3 Byzer-SQL 和大模型的整合.pdf ├── 4.3.1 安装与配置 Byzer-SQL.pdf └── deploy.md ├── image.png └── openai_local_api.ipynb /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 🚀 Super Analysis 部署指南 3 | 4 | 本指南将帮助你部署 Super Analysis 系统,包括安装必要组件、配置服务和启动系统。 5 | 6 | --- 7 | 8 | ## 📦 安装 Super Analysis 9 | 10 | 11 | ```bash 12 | pip install -U auto-coder 13 | pip install -U super-analysis 14 | ``` 15 | 16 | --- 17 | 18 | ## 🤖 部署 Deepseek 模型代理 19 | 20 | 在启动其他服务之前,我们需要先部署 Deepseek 模型: 21 | 22 | ```bash 23 | byzerllm deploy --pretrained_model_type saas/openai \ 24 | --cpus_per_worker 0.001 \ 25 | --gpus_per_worker 0 \ 26 | --worker_concurrency 1000 \ 27 | --num_workers 1 \ 28 | --infer_params saas.base_url="https://api.deepseek.com/v1" saas.api_key=${MODEL_DEEPSEEK_TOKEN} saas.model=deepseek-chat \ 29 | --model deepseek_chat 30 | ``` 31 | 32 | 注意:确保已设置环境变量 `MODEL_DEEPSEEK_TOKEN`。 33 | 34 | --- 35 | 36 | ## 架构图 37 | 38 | ![架构图](./image.png) 39 | 40 | --- 41 | 42 | ## 🛠️ 部署 Byzer-SQL 43 | 44 | 参考 [安装与配置 Byzer-SQL 文档](./docs/4.3.1%20安装与配置%20Byzer-SQL.pdf) 完成部署,特别注意的使用 **手工** 部署的方式最为简洁。 45 | 根据 [Byzer-SQL 和大模型整合文档](./docs//4.2.1.3%20Byzer-SQL%20和大模型的整合.pdf) 中安装插件,然后注册 `deepseek_chat` 函数。 46 | 47 | > 启动时需要在安装有 super-analysis 的 conda 环境中启动。 48 | 49 | 当 byzer-sql 部署完成后,注册账号为 `hello`,然后在 byzer-sql 控制台中执行: 50 | 51 | ```sql 52 | !byzerllm setup single; 53 | 54 | run command as LLM.`` where 55 | action="infer" 56 | and reconnect="true" 57 | and pretrainedModelType="saas/openai" 58 | and udfName="deepseek_chat"; 59 | ``` 60 | 61 | --- 62 | 63 | ## 示例数据 64 | 65 | 下载电影数据集: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/download?datasetVersionNumber=7 66 | 67 | --- 68 | 69 | > 下面的指令都是在命令行里操作哈 70 | 71 | ## 📊 数据预处理和服务启动 72 | 73 | 1. 抽取电影数据集schema: 74 | 75 | ```bash 76 | super-analysis.convert --data_dir /Users/allwefantasy/data/movice --doc_dir /Users/allwefantasy/data/movice/schemas/ 77 | ``` 78 | 79 | 你还可以添加 --include-rows-num 5 让系统在生成 schema 文档时同时提供一些示例数据。方便大模型更好的对这个表进行认知。 80 | 81 | 82 | 2. 启动 schema 文档知识库: 83 | 84 | ```bash 85 | auto-coder.rag serve \ 86 | --model deepseek_chat --index_filter_workers 100 \ 87 | --tokenizer_path /Users/allwefantasy/Downloads/tokenizer.json \ 88 | --doc_dir /Users/allwefantasy/data/movice/schemas/ \ 89 | --port 8001 90 | ``` 91 | 92 | 3. 下载 Byzer-SQL 文档并启动文档知识库: 93 | 94 | ```bash 95 | git clone https://github.com/allwefantasy/llm_friendly_packages 96 | 97 | auto-coder.rag serve \ 98 | --model deepseek_chat --index_filter_workers 100 \ 99 | --tokenizer_path /Users/allwefantasy/Downloads/tokenizer.json \ 100 | --doc_dir /Users/allwefantasy/projects/llm_friendly_packages/github.com/allwefantasy \ 101 | --port 8002 102 | ``` 103 | 104 | 4. 启动兼容 OpenAI Server 的分析服务: 105 | 106 | ```bash 107 | super-analysis.serve --served-model-name deepseek_chat --port 8000 \ 108 | --schema-rag-base-url http://127.0.0.1:8001/v1 \ 109 | --context-rag-base-url http://127.0.0.1:8002/v1 \ 110 | --byzer-sql-url http://127.0.0.1:9003/run/script 111 | ``` 112 | 113 | 你可以通过 `--sql-func-llm-model` 函数单独为 SQL 函数指定模型(比如配置一个速度极快的模型)。注意,同样的,你需要在 Byzer-SQL 中注册这个函数。 114 | 115 | --- 116 | 117 | 现在,Super Analysis 系统已经完全部署并启动。你可以开始使用 OpenAI SDK 进行测试和接口调用。具体测试和接口使用方法请参考 [openai_local_api.ipynb](./openai_local_api.ipynb)。 118 | 119 | 120 | 🎉 恭喜!你已经成功部署了 Super Analysis 系统。如有任何问题,请随时查阅文档或联系支持团队。 -------------------------------------------------------------------------------- /docs/4.2.1.3 Byzer-SQL 和大模型的整合.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/allwefantasy/super-analysis-doc/8841be815fed2af8a0f6ffb9ef0d1654537eb77a/docs/4.2.1.3 Byzer-SQL 和大模型的整合.pdf -------------------------------------------------------------------------------- /docs/4.3.1 安装与配置 Byzer-SQL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/allwefantasy/super-analysis-doc/8841be815fed2af8a0f6ffb9ef0d1654537eb77a/docs/4.3.1 安装与配置 Byzer-SQL.pdf -------------------------------------------------------------------------------- /docs/deploy.md: -------------------------------------------------------------------------------- 1 | 2 | # 🚀 Super Analysis 部署指南 3 | 4 | 本指南将帮助你部署 Super Analysis 系统,包括安装必要组件、配置服务和启动系统。 5 | 6 | --- 7 | 8 | ## 📦 安装 Super Analysis 9 | 10 | 11 | ```bash 12 | pip install -U auto-coder 13 | pip install super_analysis-xxxx-py3-none-any.whl 14 | ``` 15 | 16 | 注意替换下 xxxx 为版本号。 17 | 18 | --- 19 | 20 | ## 🤖 部署 Deepseek 模型代理 21 | 22 | 在启动其他服务之前,我们需要先部署 Deepseek 模型: 23 | 24 | ```bash 25 | byzerllm deploy --pretrained_model_type saas/openai \ 26 | --cpus_per_worker 0.001 \ 27 | --gpus_per_worker 0 \ 28 | --worker_concurrency 1000 \ 29 | --num_workers 1 \ 30 | --infer_params saas.base_url="https://api.deepseek.com/v1" saas.api_key=${MODEL_DEEPSEEK_TOKEN} saas.model=deepseek-chat \ 31 | --model deepseek_chat 32 | ``` 33 | 34 | 注意:确保已设置环境变量 `MODEL_DEEPSEEK_TOKEN`。 35 | 36 | --- 37 | 38 | ## 🛠️ 部署 Byzer-SQL 39 | 40 | 参考 [安装与配置 Byzer-SQL 文档](./4.3.1%20安装与配置%20Byzer-SQL.pdf) 完成部署。 41 | 根据 [Byzer-SQL 和大模型整合文档](./4.2.1.3%20Byzer-SQL%20和大模型的整合.pdf) 中安装插件,然后注册 `deepseek_chat` 函数。 42 | 43 | > 启动时需要在安装有 super-analysis 的 conda 环境中启动。 44 | 45 | 当 byzer-sql 部署完成后,注册账号为 `hello`,然后在 byzer-sql 控制台中执行: 46 | 47 | ```sql 48 | !byzerllm setup single; 49 | 50 | run command as LLM.`` where 51 | action="infer" 52 | and reconnect="true" 53 | and pretrainedModelType="saas/openai" 54 | and udfName="deepseek_chat"; 55 | ``` 56 | 57 | --- 58 | 59 | ## 示例数据 60 | 61 | 下载电影数据集: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/download?datasetVersionNumber=7 62 | 63 | --- 64 | 65 | > 下面的指令都是在命令行里操作哈 66 | 67 | ## 📊 数据预处理和服务启动 68 | 69 | 1. 抽取电影数据集schema: 70 | 71 | ```bash 72 | super-analysis.convert --data_dir /Users/allwefantasy/data/movice --doc_dir /Users/allwefantasy/data/movice/schemas/ 73 | ``` 74 | 75 | 你还可以添加 --include-rows-num 5 让系统在生成 schema 文档时同时提供一些示例数据。方便大模型更好的对这个表进行认知。 76 | 77 | 78 | 2. 启动 schema 文档知识库: 79 | 80 | ```bash 81 | auto-coder.rag serve \ 82 | --model deepseek_chat --index_filter_workers 100 \ 83 | --tokenizer_path /Users/allwefantasy/Downloads/tokenizer.json \ 84 | --doc_dir /Users/allwefantasy/data/movice/schemas/ \ 85 | --port 8001 86 | ``` 87 | 88 | 3. 下载 Byzer-SQL 文档并启动文档知识库: 89 | 90 | ```bash 91 | git clone https://github.com/allwefantasy/llm_friendly_packages 92 | 93 | auto-coder.rag serve \ 94 | --model deepseek_chat --index_filter_workers 100 \ 95 | --tokenizer_path /Users/allwefantasy/Downloads/tokenizer.json \ 96 | --doc_dir /Users/allwefantasy/projects/llm_friendly_packages/github.com/allwefantasy \ 97 | --port 8002 98 | ``` 99 | 100 | 4. 启动兼容 OpenAI Server 的分析服务: 101 | 102 | ```bash 103 | super-analysis.serve --served-model-name deepseek_chat --port 8000 \ 104 | --schema-rag-base-url http://127.0.0.1:8001/v1 \ 105 | --context-rag-base-url http://127.0.0.1:8002/v1 \ 106 | --byzer-sql-url http://127.0.0.1:9003/run/script 107 | ``` 108 | 109 | 你可以通过 `--sql-func-llm-model` 函数单独为 SQL 函数指定模型(比如配置一个速度极快的模型)。注意,同样的,你需要在 Byzer-SQL 中注册这个函数。 110 | 111 | --- 112 | 113 | 现在,Super Analysis 系统已经完全部署并启动。你可以开始使用 OpenAI SDK 进行测试和接口调用。具体测试和接口使用方法请参考 [openai_local_api.ipynb](./openai_local_api.ipynb)。 114 | 115 | 🎉 恭喜!你已经成功部署了 Super Analysis 系统。如有任何问题,请随时查阅文档或联系支持团队。 -------------------------------------------------------------------------------- /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/allwefantasy/super-analysis-doc/8841be815fed2af8a0f6ffb9ef0d1654537eb77a/image.png -------------------------------------------------------------------------------- /openai_local_api.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Using OpenAI SDK to Access Local API\n", 8 | "\n", 9 | "This notebook demonstrates how to use the OpenAI SDK to interact with a locally running API at http://127.0.0.1:8000/v1." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# Import required libraries\n", 19 | "from openai import OpenAI\n", 20 | "import os" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# Configure the OpenAI client to use the local API\n", 30 | "client = OpenAI(\n", 31 | " base_url=\"http://127.0.0.1:8000/v1\",\n", 32 | " api_key=\"your-api-key-here\" # Replace with your actual API key if required\n", 33 | ")" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 3, 39 | "metadata": {}, 40 | "outputs": [ 41 | { 42 | "name": "stdout", 43 | "output_type": "stream", 44 | "text": [ 45 | "Available models:\n", 46 | "- \n" 47 | ] 48 | } 49 | ], 50 | "source": [ 51 | "# Test the connection by listing available models\n", 52 | "models = client.models.list()\n", 53 | "print(\"Available models:\")\n", 54 | "for model in models:\n", 55 | " print(f\"- {model.id}\")" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 4, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "## 注意,电影相关的表只有对电影的描述,并没有类型标签\n", 65 | "## 系统会自动使用大模型根据电影的描述识别出电影是不是和纯美爱情有关,然后再做统计\n", 66 | "def query(q: str):\n", 67 | " chat_completion = client.chat.completions.create(\n", 68 | " model=\"deepseek_chat\",\n", 69 | " messages=[\n", 70 | " {\"role\": \"user\", \"content\": q}\n", 71 | " ],\n", 72 | " )\n", 73 | " import json\n", 74 | "\n", 75 | " v = json.loads(chat_completion.choices[0].message.content)\n", 76 | " return v[\"data\"][0]\n", 77 | "\n", 78 | "def gen_sql(q: str):\n", 79 | " chat_completion = client.chat.completions.create(\n", 80 | " model=\"deepseek_chat\",\n", 81 | " messages=[\n", 82 | " {\"role\": \"user\", \"content\": q}\n", 83 | " ],\n", 84 | " )\n", 85 | " v = chat_completion.choices[0].message.content\n", 86 | " return v" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | " ## 注意\n", 94 | " 因为后续部分case 大模型会对每条记录做处理,这个数据集也还是比较大的,你需要手工把 movies_metadata.csv 文件中的数据量减少到 1000 条左右, 以便测试或者避免长时间等待。" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 10, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "name": "stdout", 104 | "output_type": "stream", 105 | "text": [ 106 | "{'user_count': 270896}\n" 107 | ] 108 | } 109 | ], 110 | "source": [ 111 | "print(query(\"帮我统计下打分的用户数量\"))" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 5, 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "-- 加载 ratings.csv 文件\n", 124 | "load csv.`/Users/allwefantasy/data/movice/ratings.csv` where header=\"true\" as ratings_table;\n", 125 | "\n", 126 | "-- 统计打分的用户数量\n", 127 | "select count(distinct userId) as user_count from ratings_table as output;\n", 128 | "\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "print(gen_sql(\"帮我统计下打分的用户数量\"))" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 9, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "name": "stdout", 143 | "output_type": "stream", 144 | "text": [ 145 | "{'user_count': 270896}\n" 146 | ] 147 | } 148 | ], 149 | "source": [ 150 | "## 这里,因为用户的描述是不精准的,比如说包含了安迪的玩具,\n", 151 | "## 实际原文是: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene.\n", 152 | "print(query(\"帮我统计下打分的用户数量,使用 rating表 \"))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 7, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "{'id': '862', 'title': 'Toy Story', 'overview': \"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.\"}\n" 165 | ] 166 | } 167 | ], 168 | "source": [ 169 | "## 这里,因为用户的描述是不精准的,比如说包含了安迪的玩具,\n", 170 | "## 实际原文是: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene.\n", 171 | "print(query(\"帮我找到描述包含了安迪玩具的那个电影。\"))" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 29, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "{'romance_count': 0}\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "## 注意,电影相关的表只有对电影的描述,并没有类型标签\n", 189 | "## 系统会自动使用大模型根据电影的描述识别出电影是不是和纯美爱情有关,然后再做统计\n", 190 | "print(query(\"根据电影的描述,统计下纯美爱情的电影数量。\"))\n" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 31, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "name": "stdout", 200 | "output_type": "stream", 201 | "text": [ 202 | "{'user_count': 3986}\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "## 这里,我们要先通过对overview 字段做NLP分析,找到目标电影集合,然后关联到打分表,再统计打分的用户数量\n", 208 | "print(query(\"帮我找到包含了描述包含了安迪玩具的那个电影,然后给该电影打分的用户数量\"))" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "## 这里,我们要先通过对overview 字段做NLP分析,找到目标电影集合,然后关联到打分表,再统计打分的用户数量\n", 218 | "print(query(\"\"))" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 6, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "name": "stdout", 228 | "output_type": "stream", 229 | "text": [ 230 | "-- 加载包含电影描述的表格\n", 231 | "load csv.`/Users/allwefantasy/data/movice/movies_metadata.csv` where header=\"true\" as movies_metadata;\n", 232 | "\n", 233 | "-- 加载用户评分的表格\n", 234 | "load csv.`/Users/allwefantasy/data/movice/ratings.csv` where header=\"true\" as ratings;\n", 235 | "\n", 236 | "-- 使用大模型辅助函数处理电影描述,找出包含“安迪玩具”的电影\n", 237 | "select \n", 238 | "id, \n", 239 | "title, \n", 240 | "overview, \n", 241 | "llm_result(deepseek_chat(llm_param(map(\n", 242 | " \"instruction\",llm_prompt('\n", 243 | "\n", 244 | "根据下面提供的信息,回答用户的问题。\n", 245 | "\n", 246 | "信息上下文:\n", 247 | "```\n", 248 | "{0}\n", 249 | "```\n", 250 | "\n", 251 | "用户的问题: 电影描述中是否包含“安迪玩具”?\n", 252 | "请只输出 “是” 或者 “否”\n", 253 | "',array(overview))\n", 254 | "\n", 255 | ")))) as contains_andy_toy\n", 256 | "from movies_metadata as filtered_movies;\n", 257 | "\n", 258 | "-- 过滤出包含“安迪玩具”的电影\n", 259 | "select id, title from filtered_movies where contains_andy_toy like '%是%' as andy_toy_movies;\n", 260 | "\n", 261 | "-- 统计给这些电影打分的用户数量\n", 262 | "select \n", 263 | "r.movieId, \n", 264 | "count(distinct r.userId) as user_count\n", 265 | "from ratings as r\n", 266 | "join andy_toy_movies as m on r.movieId = m.id\n", 267 | "group by r.movieId as user_rating_count;\n", 268 | "\n", 269 | "-- 输出结果\n", 270 | "select * from user_rating_count as output;\n", 271 | "\n" 272 | ] 273 | } 274 | ], 275 | "source": [ 276 | "print(gen_sql(\"帮我找到包含了描述包含了安迪玩具的那个电影,然后给该电影打分的用户数量\"))" 277 | ] 278 | } 279 | ], 280 | "metadata": { 281 | "kernelspec": { 282 | "display_name": "Python 3", 283 | "language": "python", 284 | "name": "python3" 285 | }, 286 | "language_info": { 287 | "codemirror_mode": { 288 | "name": "ipython", 289 | "version": 3 290 | }, 291 | "file_extension": ".py", 292 | "mimetype": "text/x-python", 293 | "name": "python", 294 | "nbconvert_exporter": "python", 295 | "pygments_lexer": "ipython3", 296 | "version": "3.10.11" 297 | } 298 | }, 299 | "nbformat": 4, 300 | "nbformat_minor": 4 301 | } 302 | --------------------------------------------------------------------------------