├── .gitignore ├── .pre-commit-config.yaml ├── README.ipynb ├── README.md ├── ci ├── aws.yaml ├── build.sh ├── gce.yaml ├── nb2py.py └── tests.sh ├── clear_cell_nums.py ├── configs ├── aws.yaml └── gce.yaml ├── containerfile ├── images ├── data_dashboard.png ├── data_llm.png ├── e2e_llm.png ├── loss.png ├── serve_dashboard.png ├── serve_llm.png └── train_dashboard.png └── lora_sft_ray.yaml /.gitignore: -------------------------------------------------------------------------------- 1 | # VSCode 2 | .vscode/ 3 | .idea 4 | 5 | # Byte-compiled / optimized / DLL files 6 | __pycache__/ 7 | *.py[cod] 8 | *$py.class 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Distribution / packaging 14 | .Python 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | wheels/ 27 | pip-wheel-metadata/ 28 | share/python-wheels/ 29 | *.egg-info/ 30 | .installed.cfg 31 | *.egg 32 | MANIFEST 33 | 34 | # PyInstaller 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | 56 | # Flask: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy: 61 | .scrapy 62 | 63 | # Sphinx 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython 70 | .ipynb_checkpoints 71 | profile_default/ 72 | ipython_config.py 73 | 74 | # pyenv 75 | .python-version 76 | 77 | # PEP 582 78 | __pypackages__/ 79 | 80 | # Celery 81 | celerybeat-schedule 82 | celerybeat.pid 83 | 84 | # Environment 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # mkdocs 94 | site/ 95 | 96 | # Airflow 97 | airflow/airflow.db 98 | 99 | # MacOS 100 | .DS_Store 101 | 102 | # Clean up 103 | .trash/ -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | # See https://pre-commit.com for more information 2 | # See https://pre-commit.com/hooks.html for more hooks 3 | repos: 4 | # Standard pre-commit checks 5 | - repo: https://github.com/pre-commit/pre-commit-hooks 6 | rev: v5.0.0 7 | hooks: 8 | - id: trailing-whitespace 9 | - id: end-of-file-fixer 10 | - id: check-merge-conflict 11 | - id: check-yaml 12 | - id: check-added-large-files 13 | args: ['--maxkb=10000'] 14 | exclude: ".*\\.ipynb$" 15 | 16 | # Ruff — replaces black, isort, and flake8 17 | - repo: https://github.com/astral-sh/ruff-pre-commit 18 | rev: v0.11.8 19 | hooks: 20 | - id: ruff 21 | args: ["--fix"] 22 | 23 | # Custom local hook for clearing notebook cell numbers 24 | - repo: local 25 | hooks: 26 | - id: clear-execution-counts 27 | name: Clear execution counts in notebooks 28 | entry: python clear_cell_nums.py 29 | language: system 30 | files: \.ipynb$ 31 | -------------------------------------------------------------------------------- /README.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Entity Recognition with LLMs\n", 8 | "\n", 9 | "
\n", 10 | " \n", 11 | " \n", 12 | "
" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "This end-to-end tutorial **fine-tunes** an LLM to perform **batch inference** and **online serving** at scale. While entity recognition (NER) is the main task in this tutorial, you can easily extend these end-to-end workflows to any use case.\n", 20 | "\n", 21 | "\n", 22 | "\n", 23 | "**Note**: The intent of this tutorial is to show how you can use Ray to implement end-to-end LLM workflows that can extend to any use case, including multimodal.\n", 24 | "\n", 25 | "This tutorial uses the [Ray library](https://github.com/ray-project/ray) to implement these workflows, namely the LLM APIs:\n", 26 | "\n", 27 | "[`ray.data.llm`](https://docs.ray.io/en/latest/data/working-with-llms.html):\n", 28 | "- Batch inference over distributed datasets\n", 29 | "- Streaming and async execution for throughput\n", 30 | "- Built-in metrics and tracing, including observability\n", 31 | "- Zero-copy GPU data transfer\n", 32 | "- Composable with preprocessing and postprocessing steps\n", 33 | "\n", 34 | "[`ray.serve.llm`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html):\n", 35 | "- Automatic scaling and load balancing\n", 36 | "- Unified multi-node multi-model deployment\n", 37 | "- Multi-LoRA support with shared base models\n", 38 | "- Deep integration with inference engines, vLLM to start\n", 39 | "- Composable multi-model LLM pipelines\n", 40 | "\n", 41 | "And all of these workloads come with all the observability views you need to debug and tune them to **maximize throughput/latency**." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Set up\n", 49 | "\n", 50 | "### Compute\n", 51 | "This [Anyscale Workspace](https://docs.anyscale.com/platform/workspaces/) automatically provisions and autoscales the compute your workloads need. If you're not on Anyscale, then you need to provision the appropriate compute (L4) for this tutorial.\n", 52 | "\n", 53 | "\n", 54 | "\n", 55 | "### Dependencies\n", 56 | "Start by downloading the dependencies required for this tutorial. Notice in your [`containerfile`](https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/refs/heads/main/containerfile) you have a base image [`anyscale/ray-llm:latest-py311-cu124`](https://hub.docker.com/layers/anyscale/ray-llm/latest-py311-cu124/images/sha256-5a1c55f7f416d2d2eb5f4cdd13afeda25d4f7383406cfee1f1f60da495d1b50f) followed by a list of pip packages. If you're not on [Anyscale](https://console.anyscale.com/), you can pull this Docker image yourself and install the dependencies.\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "README.ipynb ci\t\t configs\t images\n", 69 | "README.md clear_cell_nums.py containerfile lora_sft_ray.yaml\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "!ls" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "\u001b[92mSuccessfully registered `ray, vllm` and 5 other packages to be installed on all cluster nodes.\u001b[0m\n", 87 | "\u001b[92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_mp8cxvgle2yeumgcpu1yua2r3e?workspace-tab=dependencies\u001b[0m\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "%%bash\n", 93 | "# Install dependencies\n", 94 | "pip install -q \\\n", 95 | " \"xgrammar==0.1.11\" \\\n", 96 | " \"pynvml==12.0.0\" \\\n", 97 | " \"hf_transfer==0.1.9\" \\\n", 98 | " \"tensorboard==2.19.0\" \\\n", 99 | " \"llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory\"" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Data ingestion" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "import json\n", 116 | "import textwrap\n", 117 | "from IPython.display import Code, display" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "Start by downloading the data from cloud storage to local shared storage. " 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "download: s3://viggo-ds/train.jsonl to ../../../mnt/cluster_storage/viggo/train.jsonl\n", 137 | "download: s3://viggo-ds/val.jsonl to ../../../mnt/cluster_storage/viggo/val.jsonl\n", 138 | "download: s3://viggo-ds/test.jsonl to ../../../mnt/cluster_storage/viggo/test.jsonl\n", 139 | "download: s3://viggo-ds/dataset_info.json to ../../../mnt/cluster_storage/viggo/dataset_info.json\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "%%bash\n", 145 | "rm -rf /mnt/cluster_storage/viggo # clean up\n", 146 | "mkdir /mnt/cluster_storage/viggo\n", 147 | "wget https://viggo-ds.s3.amazonaws.com/train.jsonl -O /mnt/cluster_storage/viggo/train.jsonl\n", 148 | "wget https://viggo-ds.s3.amazonaws.com/val.jsonl -O /mnt/cluster_storage/viggo/val.jsonl\n", 149 | "wget https://viggo-ds.s3.amazonaws.com/test.jsonl -O /mnt/cluster_storage/viggo/test.jsonl\n", 150 | "wget https://viggo-ds.s3.amazonaws.com/dataset_info.json -O /mnt/cluster_storage/viggo/dataset_info.json" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "name": "stdout", 160 | "output_type": "stream", 161 | "text": [ 162 | "{\n", 163 | " \"instruction\": \"Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']\",\n", 164 | " \"input\": \"Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view.\",\n", 165 | " \"output\": \"give_opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])\"\n", 166 | "}\n" 167 | ] 168 | } 169 | ], 170 | "source": [ 171 | "%%bash\n", 172 | "head -n 1 /mnt/cluster_storage/viggo/train.jsonl | python3 -m json.tool" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "Given a target sentence construct the underlying meaning representation of the\n", 185 | "input sentence as a single function with attributes and attribute values. This\n", 186 | "function should describe the target string accurately and the function must be\n", 187 | "one of the following ['inform', 'request', 'give_opinion', 'confirm',\n", 188 | "'verify_attribute', 'suggest', 'request_explanation', 'recommend',\n", 189 | "'request_attribute']. The attributes must be one of the following: ['name',\n", 190 | "'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres',\n", 191 | "'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam',\n", 192 | "'has_linux_release', 'has_mac_release', 'specifier']\n" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "with open(\"/mnt/cluster_storage/viggo/train.jsonl\", \"r\") as fp:\n", 198 | " first_line = fp.readline()\n", 199 | " item = json.loads(first_line)\n", 200 | "system_content = item[\"instruction\"]\n", 201 | "print(textwrap.fill(system_content, width=80))" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "You also have an info file that identifies the datasets and format (Alpaca and ShareGPT formats) to use for post training." 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/html": [ 219 | "
{\n",
 294 |        "    "viggo-train": {\n",
 295 |        "        "file_name": "/mnt/cluster_storage/viggo/train.jsonl",\n",
 296 |        "        "formatting": "alpaca",\n",
 297 |        "        "columns": {\n",
 298 |        "            "prompt": "instruction",\n",
 299 |        "            "query": "input",\n",
 300 |        "            "response": "output"\n",
 301 |        "        }\n",
 302 |        "    },\n",
 303 |        "    "viggo-val": {\n",
 304 |        "        "file_name": "/mnt/cluster_storage/viggo/val.jsonl",\n",
 305 |        "        "formatting": "alpaca",\n",
 306 |        "        "columns": {\n",
 307 |        "            "prompt": "instruction",\n",
 308 |        "            "query": "input",\n",
 309 |        "            "response": "output"\n",
 310 |        "        }\n",
 311 |        "    }\n",
 312 |        "}\n",
 313 |        "
\n" 314 | ], 315 | "text/latex": [ 316 | "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", 317 | "\\PY{p}{\\PYZob{}}\n", 318 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}viggo\\PYZhy{}train\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{p}{\\PYZob{}}\n", 319 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}file\\PYZus{}name\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}/mnt/cluster\\PYZus{}storage/viggo/train.jsonl\\PYZdq{}}\\PY{p}{,}\n", 320 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}formatting\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}alpaca\\PYZdq{}}\\PY{p}{,}\n", 321 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}columns\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{p}{\\PYZob{}}\n", 322 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}prompt\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}instruction\\PYZdq{}}\\PY{p}{,}\n", 323 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}query\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}input\\PYZdq{}}\\PY{p}{,}\n", 324 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}response\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}output\\PYZdq{}}\n", 325 | "\\PY{+w}{ }\\PY{p}{\\PYZcb{}}\n", 326 | "\\PY{+w}{ }\\PY{p}{\\PYZcb{},}\n", 327 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}viggo\\PYZhy{}val\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{p}{\\PYZob{}}\n", 328 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}file\\PYZus{}name\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}/mnt/cluster\\PYZus{}storage/viggo/val.jsonl\\PYZdq{}}\\PY{p}{,}\n", 329 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}formatting\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}alpaca\\PYZdq{}}\\PY{p}{,}\n", 330 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}columns\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{p}{\\PYZob{}}\n", 331 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}prompt\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}instruction\\PYZdq{}}\\PY{p}{,}\n", 332 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}query\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}input\\PYZdq{}}\\PY{p}{,}\n", 333 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}response\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}output\\PYZdq{}}\n", 334 | "\\PY{+w}{ }\\PY{p}{\\PYZcb{}}\n", 335 | "\\PY{+w}{ }\\PY{p}{\\PYZcb{}}\n", 336 | "\\PY{p}{\\PYZcb{}}\n", 337 | "\\end{Verbatim}\n" 338 | ], 339 | "text/plain": [ 340 | "{\n", 341 | " \"viggo-train\": {\n", 342 | " \"file_name\": \"/mnt/cluster_storage/viggo/train.jsonl\",\n", 343 | " \"formatting\": \"alpaca\",\n", 344 | " \"columns\": {\n", 345 | " \"prompt\": \"instruction\",\n", 346 | " \"query\": \"input\",\n", 347 | " \"response\": \"output\"\n", 348 | " }\n", 349 | " },\n", 350 | " \"viggo-val\": {\n", 351 | " \"file_name\": \"/mnt/cluster_storage/viggo/val.jsonl\",\n", 352 | " \"formatting\": \"alpaca\",\n", 353 | " \"columns\": {\n", 354 | " \"prompt\": \"instruction\",\n", 355 | " \"query\": \"input\",\n", 356 | " \"response\": \"output\"\n", 357 | " }\n", 358 | " }\n", 359 | "}" 360 | ] 361 | }, 362 | "metadata": {}, 363 | "output_type": "display_data" 364 | } 365 | ], 366 | "source": [ 367 | "display(Code(filename=\"/mnt/cluster_storage/viggo/dataset_info.json\", language=\"json\"))" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "## Distributed fine-tuning" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "Use [Ray Train](https://docs.ray.io/en/latest/train/train.html) + [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform multi-node training. Find the parameters for the training workload, post-training method, dataset location, train/val details, etc. in the `llama3_lora_sft_ray.yaml` config file. See the recipes for even more post-training methods, like SFT, pretraining, PPO, DPO, KTO, etc. [on GitHub](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples).\n", 382 | "\n", 383 | "**Note**: Ray also supports using other tools like [axolotl](https://axolotl-ai-cloud.github.io/axolotl/docs/ray-integration.html) or even [Ray Train + HF Accelerate + FSDP/DeepSpeed](https://docs.ray.io/en/latest/train/huggingface-accelerate.html) directly for complete control of your post-training workloads.\n", 384 | "\n", 385 | "" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "### `config`" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "import os\n", 402 | "from pathlib import Path\n", 403 | "import yaml" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "text/html": [ 414 | "
### model\n",
 489 |        "model_name_or_path: Qwen/Qwen2.5-7B-Instruct\n",
 490 |        "trust_remote_code: true\n",
 491 |        "\n",
 492 |        "### method\n",
 493 |        "stage: sft\n",
 494 |        "do_train: true\n",
 495 |        "finetuning_type: lora\n",
 496 |        "lora_rank: 8\n",
 497 |        "lora_target: all\n",
 498 |        "\n",
 499 |        "### dataset\n",
 500 |        "dataset: viggo-train\n",
 501 |        "dataset_dir: /mnt/cluster_storage/viggo  # shared storage workers have access to\n",
 502 |        "template: qwen\n",
 503 |        "cutoff_len: 2048\n",
 504 |        "max_samples: 1000\n",
 505 |        "overwrite_cache: true\n",
 506 |        "preprocessing_num_workers: 16\n",
 507 |        "dataloader_num_workers: 4\n",
 508 |        "\n",
 509 |        "### output\n",
 510 |        "output_dir: /mnt/cluster_storage/viggo/outputs  # should be somewhere workers have access to (ex. s3, nfs)\n",
 511 |        "logging_steps: 10\n",
 512 |        "save_steps: 500\n",
 513 |        "plot_loss: true\n",
 514 |        "overwrite_output_dir: true\n",
 515 |        "save_only_model: false\n",
 516 |        "\n",
 517 |        "### ray\n",
 518 |        "ray_run_name: lora_sft_ray\n",
 519 |        "ray_storage_path: /mnt/cluster_storage/viggo/saves  # should be somewhere workers have access to (ex. s3, nfs)\n",
 520 |        "ray_num_workers: 4\n",
 521 |        "resources_per_worker:\n",
 522 |        "  GPU: 1\n",
 523 |        "  anyscale/accelerator_shape:4xL4: 0.001  # Use this to specify a specific node shape,\n",
 524 |        "  # accelerator_type:L4: 1           # Or use this to simply specify a GPU type.\n",
 525 |        "  # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types\n",
 526 |        "placement_strategy: PACK\n",
 527 |        "\n",
 528 |        "### train\n",
 529 |        "per_device_train_batch_size: 1\n",
 530 |        "gradient_accumulation_steps: 8\n",
 531 |        "learning_rate: 1.0e-4\n",
 532 |        "num_train_epochs: 5.0\n",
 533 |        "lr_scheduler_type: cosine\n",
 534 |        "warmup_ratio: 0.1\n",
 535 |        "bf16: true\n",
 536 |        "ddp_timeout: 180000000\n",
 537 |        "resume_from_checkpoint: null\n",
 538 |        "\n",
 539 |        "### eval\n",
 540 |        "eval_dataset: viggo-val  # uses same dataset_dir as training data\n",
 541 |        "# val_size: 0.1  # only if using part of training data for validation\n",
 542 |        "per_device_eval_batch_size: 1\n",
 543 |        "eval_strategy: steps\n",
 544 |        "eval_steps: 500\n",
 545 |        "
\n" 546 | ], 547 | "text/latex": [ 548 | "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", 549 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} model}\n", 550 | "\\PY{n+nt}{model\\PYZus{}name\\PYZus{}or\\PYZus{}path}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{Qwen/Qwen2.5\\PYZhy{}7B\\PYZhy{}Instruct}\n", 551 | "\\PY{n+nt}{trust\\PYZus{}remote\\PYZus{}code}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 552 | "\n", 553 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} method}\n", 554 | "\\PY{n+nt}{stage}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{sft}\n", 555 | "\\PY{n+nt}{do\\PYZus{}train}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 556 | "\\PY{n+nt}{finetuning\\PYZus{}type}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{lora}\n", 557 | "\\PY{n+nt}{lora\\PYZus{}rank}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{8}\n", 558 | "\\PY{n+nt}{lora\\PYZus{}target}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{all}\n", 559 | "\n", 560 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} dataset}\n", 561 | "\\PY{n+nt}{dataset}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{viggo\\PYZhy{}train}\n", 562 | "\\PY{n+nt}{dataset\\PYZus{}dir}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{/mnt/cluster\\PYZus{}storage/viggo}\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} shared storage workers have access to}\n", 563 | "\\PY{n+nt}{template}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{qwen}\n", 564 | "\\PY{n+nt}{cutoff\\PYZus{}len}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{2048}\n", 565 | "\\PY{n+nt}{max\\PYZus{}samples}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{1000}\n", 566 | "\\PY{n+nt}{overwrite\\PYZus{}cache}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 567 | "\\PY{n+nt}{preprocessing\\PYZus{}num\\PYZus{}workers}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{16}\n", 568 | "\\PY{n+nt}{dataloader\\PYZus{}num\\PYZus{}workers}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{4}\n", 569 | "\n", 570 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} output}\n", 571 | "\\PY{n+nt}{output\\PYZus{}dir}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{/mnt/cluster\\PYZus{}storage/viggo/outputs}\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} should be somewhere workers have access to (ex. s3, nfs)}\n", 572 | "\\PY{n+nt}{logging\\PYZus{}steps}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{10}\n", 573 | "\\PY{n+nt}{save\\PYZus{}steps}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{500}\n", 574 | "\\PY{n+nt}{plot\\PYZus{}loss}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 575 | "\\PY{n+nt}{overwrite\\PYZus{}output\\PYZus{}dir}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 576 | "\\PY{n+nt}{save\\PYZus{}only\\PYZus{}model}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{false}\n", 577 | "\n", 578 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} ray}\n", 579 | "\\PY{n+nt}{ray\\PYZus{}run\\PYZus{}name}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{lora\\PYZus{}sft\\PYZus{}ray}\n", 580 | "\\PY{n+nt}{ray\\PYZus{}storage\\PYZus{}path}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{/mnt/cluster\\PYZus{}storage/viggo/saves}\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} should be somewhere workers have access to (ex. s3, nfs)}\n", 581 | "\\PY{n+nt}{ray\\PYZus{}num\\PYZus{}workers}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{4}\n", 582 | "\\PY{n+nt}{resources\\PYZus{}per\\PYZus{}worker}\\PY{p}{:}\n", 583 | "\\PY{+w}{ }\\PY{n+nt}{GPU}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{1}\n", 584 | "\\PY{+w}{ }\\PY{n+nt}{anyscale/accelerator\\PYZus{}shape:4xL4}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{0.001}\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} Use this to specify a specific node shape,}\n", 585 | "\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} accelerator\\PYZus{}type:L4: 1 \\PYZsh{} Or use this to simply specify a GPU type.}\n", 586 | "\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} see https://docs.ray.io/en/master/ray\\PYZhy{}core/accelerator\\PYZhy{}types.html\\PYZsh{}accelerator\\PYZhy{}types for a full list of accelerator types}\n", 587 | "\\PY{n+nt}{placement\\PYZus{}strategy}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{PACK}\n", 588 | "\n", 589 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} train}\n", 590 | "\\PY{n+nt}{per\\PYZus{}device\\PYZus{}train\\PYZus{}batch\\PYZus{}size}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{1}\n", 591 | "\\PY{n+nt}{gradient\\PYZus{}accumulation\\PYZus{}steps}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{8}\n", 592 | "\\PY{n+nt}{learning\\PYZus{}rate}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{1.0e\\PYZhy{}4}\n", 593 | "\\PY{n+nt}{num\\PYZus{}train\\PYZus{}epochs}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{5.0}\n", 594 | "\\PY{n+nt}{lr\\PYZus{}scheduler\\PYZus{}type}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{cosine}\n", 595 | "\\PY{n+nt}{warmup\\PYZus{}ratio}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{0.1}\n", 596 | "\\PY{n+nt}{bf16}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{true}\n", 597 | "\\PY{n+nt}{ddp\\PYZus{}timeout}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{180000000}\n", 598 | "\\PY{n+nt}{resume\\PYZus{}from\\PYZus{}checkpoint}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{null}\n", 599 | "\n", 600 | "\\PY{c+c1}{\\PYZsh{}\\PYZsh{}\\PYZsh{} eval}\n", 601 | "\\PY{n+nt}{eval\\PYZus{}dataset}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{viggo\\PYZhy{}val}\\PY{+w}{ }\\PY{c+c1}{\\PYZsh{} uses same dataset\\PYZus{}dir as training data}\n", 602 | "\\PY{c+c1}{\\PYZsh{} val\\PYZus{}size: 0.1 \\PYZsh{} only if using part of training data for validation}\n", 603 | "\\PY{n+nt}{per\\PYZus{}device\\PYZus{}eval\\PYZus{}batch\\PYZus{}size}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{1}\n", 604 | "\\PY{n+nt}{eval\\PYZus{}strategy}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{steps}\n", 605 | "\\PY{n+nt}{eval\\PYZus{}steps}\\PY{p}{:}\\PY{+w}{ }\\PY{l+lScalar+lScalarPlain}{500}\n", 606 | "\\end{Verbatim}\n" 607 | ], 608 | "text/plain": [ 609 | "### model\n", 610 | "model_name_or_path: Qwen/Qwen2.5-7B-Instruct\n", 611 | "trust_remote_code: true\n", 612 | "\n", 613 | "### method\n", 614 | "stage: sft\n", 615 | "do_train: true\n", 616 | "finetuning_type: lora\n", 617 | "lora_rank: 8\n", 618 | "lora_target: all\n", 619 | "\n", 620 | "### dataset\n", 621 | "dataset: viggo-train\n", 622 | "dataset_dir: /mnt/cluster_storage/viggo # shared storage workers have access to\n", 623 | "template: qwen\n", 624 | "cutoff_len: 2048\n", 625 | "max_samples: 1000\n", 626 | "overwrite_cache: true\n", 627 | "preprocessing_num_workers: 16\n", 628 | "dataloader_num_workers: 4\n", 629 | "\n", 630 | "### output\n", 631 | "output_dir: /mnt/cluster_storage/viggo/outputs # should be somewhere workers have access to (ex. s3, nfs)\n", 632 | "logging_steps: 10\n", 633 | "save_steps: 500\n", 634 | "plot_loss: true\n", 635 | "overwrite_output_dir: true\n", 636 | "save_only_model: false\n", 637 | "\n", 638 | "### ray\n", 639 | "ray_run_name: lora_sft_ray\n", 640 | "ray_storage_path: /mnt/cluster_storage/viggo/saves # should be somewhere workers have access to (ex. s3, nfs)\n", 641 | "ray_num_workers: 4\n", 642 | "resources_per_worker:\n", 643 | " GPU: 1\n", 644 | " anyscale/accelerator_shape:4xL4: 0.001 # Use this to specify a specific node shape,\n", 645 | " # accelerator_type:L4: 1 # Or use this to simply specify a GPU type.\n", 646 | " # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types\n", 647 | "placement_strategy: PACK\n", 648 | "\n", 649 | "### train\n", 650 | "per_device_train_batch_size: 1\n", 651 | "gradient_accumulation_steps: 8\n", 652 | "learning_rate: 1.0e-4\n", 653 | "num_train_epochs: 5.0\n", 654 | "lr_scheduler_type: cosine\n", 655 | "warmup_ratio: 0.1\n", 656 | "bf16: true\n", 657 | "ddp_timeout: 180000000\n", 658 | "resume_from_checkpoint: null\n", 659 | "\n", 660 | "### eval\n", 661 | "eval_dataset: viggo-val # uses same dataset_dir as training data\n", 662 | "# val_size: 0.1 # only if using part of training data for validation\n", 663 | "per_device_eval_batch_size: 1\n", 664 | "eval_strategy: steps\n", 665 | "eval_steps: 500" 666 | ] 667 | }, 668 | "metadata": {}, 669 | "output_type": "display_data" 670 | } 671 | ], 672 | "source": [ 673 | "display(Code(filename=\"lora_sft_ray.yaml\", language=\"yaml\"))" 674 | ] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": null, 679 | "metadata": {}, 680 | "outputs": [ 681 | { 682 | "name": "stdout", 683 | "output_type": "stream", 684 | "text": [ 685 | "Qwen/Qwen2.5-7B-Instruct\n" 686 | ] 687 | } 688 | ], 689 | "source": [ 690 | "model_id = \"ft-model\" # call it whatever you want\n", 691 | "model_source = yaml.safe_load(open(\"lora_sft_ray.yaml\"))[\"model_name_or_path\"] # HF model ID, S3 mirror config, or GCS mirror config\n", 692 | "print (model_source)" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "### Multi-node training" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "Use Ray Train + LlamaFactory to perform the mult-node train loop." 707 | ] 708 | }, 709 | { 710 | "cell_type": "markdown", 711 | "metadata": {}, 712 | "source": [ 713 | "
Ray Train \n", 714 | "\n", 715 | "Using [Ray Train](https://docs.ray.io/en/latest/train/train.html) has several advantages:\n", 716 | "- it automatically handles **multi-node, multi-GPU** setup with no manual SSH setup or `hostfile` configs. \n", 717 | "- you can define **per-worker fractional resource requirements**, for example, 2 CPUs and 0.5 GPU per worker.\n", 718 | "- you can run on **heterogeneous machines** and scale flexibly, for example, CPU for preprocessing and GPU for training.\n", 719 | "- it has built-in **fault tolerance** through retry of failed workers, and continue from last checkpoint.\n", 720 | "- it supports Data Parallel, Model Parallel, Parameter Server, and even custom strategies.\n", 721 | "- [Ray Compiled graphs](https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html) allow you to even define different parallelism for jointly optimizing multiple models. Megatron, DeepSpeed, and similar frameworks only allow for one global setting.\n", 722 | "\n", 723 | "[RayTurbo Train](https://docs.anyscale.com/rayturbo/rayturbo-train) offers even more improvement to the price-performance ratio, performance monitoring, and more:\n", 724 | "- **elastic training** to scale to a dynamic number of workers, and continue training on fewer resources, even on spot instances.\n", 725 | "- **purpose-built dashboard** designed to streamline the debugging of Ray Train workloads:\n", 726 | " - Monitoring: View the status of training runs and train workers.\n", 727 | " - Metrics: See insights on training throughput and training system operation time.\n", 728 | " - Profiling: Investigate bottlenecks, hangs, or errors from individual training worker processes.\n", 729 | "\n", 730 | "" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": {}, 737 | "outputs": [ 738 | { 739 | "name": "stdout", 740 | "output_type": "stream", 741 | "text": [ 742 | "\n", 743 | "\n", 744 | "Training started with configuration:\n", 745 | " ╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", 746 | " │ Training config │\n", 747 | " ├──────────────────────────────────────────────────────────────────────────────────────────────────────┤\n", 748 | " │ train_loop_config/args/bf16 True │\n", 749 | " │ train_loop_config/args/cutoff_len 2048 │\n", 750 | " │ train_loop_config/args/dataloader_num_workers 4 │\n", 751 | " │ train_loop_config/args/dataset viggo-train │\n", 752 | " │ train_loop_config/args/dataset_dir ...ter_storage/viggo │\n", 753 | " │ train_loop_config/args/ddp_timeout 180000000 │\n", 754 | " │ train_loop_config/args/do_train True │\n", 755 | " │ train_loop_config/args/eval_dataset viggo-val │\n", 756 | " │ train_loop_config/args/eval_steps 500 │\n", 757 | " │ train_loop_config/args/eval_strategy steps │\n", 758 | " │ train_loop_config/args/finetuning_type lora │\n", 759 | " │ train_loop_config/args/gradient_accumulation_steps 8 │\n", 760 | " │ train_loop_config/args/learning_rate 0.0001 │\n", 761 | " │ train_loop_config/args/logging_steps 10 │\n", 762 | " │ train_loop_config/args/lora_rank 8 │\n", 763 | " │ train_loop_config/args/lora_target all │\n", 764 | " │ train_loop_config/args/lr_scheduler_type cosine │\n", 765 | " │ train_loop_config/args/max_samples 1000 │\n", 766 | " │ train_loop_config/args/model_name_or_path ...en2.5-7B-Instruct │\n", 767 | " │ train_loop_config/args/num_train_epochs 5.0 │\n", 768 | " │ train_loop_config/args/output_dir ...age/viggo/outputs │\n", 769 | " │ train_loop_config/args/overwrite_cache True │\n", 770 | " │ train_loop_config/args/overwrite_output_dir True │\n", 771 | " │ train_loop_config/args/per_device_eval_batch_size 1 │\n", 772 | " │ train_loop_config/args/per_device_train_batch_size 1 │\n", 773 | " │ train_loop_config/args/placement_strategy PACK │\n", 774 | " │ train_loop_config/args/plot_loss True │\n", 775 | " │ train_loop_config/args/preprocessing_num_workers 16 │\n", 776 | " │ train_loop_config/args/ray_num_workers 4 │\n", 777 | " │ train_loop_config/args/ray_run_name lora_sft_ray │\n", 778 | " │ train_loop_config/args/ray_storage_path ...orage/viggo/saves │\n", 779 | " │ train_loop_config/args/resources_per_worker/GPU 1 │\n", 780 | " │ train_loop_config/args/resources_per_worker/anyscale/accelerator_shape:4xA10G 1 │\n", 781 | " │ train_loop_config/args/resume_from_checkpoint │\n", 782 | " │ train_loop_config/args/save_only_model False │\n", 783 | " │ train_loop_config/args/save_steps 500 │\n", 784 | " │ train_loop_config/args/stage sft │\n", 785 | " │ train_loop_config/args/template qwen │\n", 786 | " │ train_loop_config/args/trust_remote_code True │\n", 787 | " │ train_loop_config/args/warmup_ratio 0.1 │\n", 788 | " │ train_loop_config/callbacks ... 0x7e1262910e10>] │\n", 789 | " ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯\n", 790 | "\n", 791 | " 100%|██████████| 155/155 [07:12<00:00, 2.85s/it][INFO|trainer.py:3942] 2025-04-11 14:57:59,207 >> Saving model checkpoint to /mnt/cluster_storage/viggo/outputs/checkpoint-155\n", 792 | " \n", 793 | " Training finished iteration 1 at 2025-04-11 14:58:02. Total running time: 10min 24s\n", 794 | " ╭─────────────────────────────────────────╮\n", 795 | " │ Training result │\n", 796 | " ├─────────────────────────────────────────┤\n", 797 | " │ checkpoint_dir_name checkpoint_000000 │\n", 798 | " │ time_this_iter_s 521.83827 │\n", 799 | " │ time_total_s 521.83827 │\n", 800 | " │ training_iteration 1 │\n", 801 | " │ epoch 4.704 │\n", 802 | " │ grad_norm 0.14288 │\n", 803 | " │ learning_rate 0. │\n", 804 | " │ loss 0.0065 │\n", 805 | " │ step 150 │\n", 806 | " ╰─────────────────────────────────────────╯\n", 807 | " Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000\n", 808 | "\n", 809 | "\n", 810 | "\n" 811 | ] 812 | } 813 | ], 814 | "source": [ 815 | "%%bash\n", 816 | "# Run multi-node distributed fine-tuning workload\n", 817 | "USE_RAY=1 llamafactory-cli train lora_sft_ray.yaml" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": null, 823 | "metadata": {}, 824 | "outputs": [ 825 | { 826 | "data": { 827 | "text/html": [ 828 | "
{\n",
 903 |        "    "epoch": 4.864,\n",
 904 |        "    "eval_viggo-val_loss": 0.13618840277194977,\n",
 905 |        "    "eval_viggo-val_runtime": 20.2797,\n",
 906 |        "    "eval_viggo-val_samples_per_second": 35.208,\n",
 907 |        "    "eval_viggo-val_steps_per_second": 8.827,\n",
 908 |        "    "total_flos": 4.843098686147789e+16,\n",
 909 |        "    "train_loss": 0.2079355036479331,\n",
 910 |        "    "train_runtime": 437.2951,\n",
 911 |        "    "train_samples_per_second": 11.434,\n",
 912 |        "    "train_steps_per_second": 0.354\n",
 913 |        "}\n",
 914 |        "
\n" 915 | ], 916 | "text/latex": [ 917 | "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", 918 | "\\PY{p}{\\PYZob{}}\n", 919 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}epoch\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{4.864}\\PY{p}{,}\n", 920 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}eval\\PYZus{}viggo\\PYZhy{}val\\PYZus{}loss\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{0.13618840277194977}\\PY{p}{,}\n", 921 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}eval\\PYZus{}viggo\\PYZhy{}val\\PYZus{}runtime\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{20.2797}\\PY{p}{,}\n", 922 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}eval\\PYZus{}viggo\\PYZhy{}val\\PYZus{}samples\\PYZus{}per\\PYZus{}second\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{35.208}\\PY{p}{,}\n", 923 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}eval\\PYZus{}viggo\\PYZhy{}val\\PYZus{}steps\\PYZus{}per\\PYZus{}second\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{8.827}\\PY{p}{,}\n", 924 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}total\\PYZus{}flos\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{4.843098686147789e+16}\\PY{p}{,}\n", 925 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}train\\PYZus{}loss\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{0.2079355036479331}\\PY{p}{,}\n", 926 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}train\\PYZus{}runtime\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{437.2951}\\PY{p}{,}\n", 927 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}train\\PYZus{}samples\\PYZus{}per\\PYZus{}second\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{11.434}\\PY{p}{,}\n", 928 | "\\PY{+w}{ }\\PY{n+nt}{\\PYZdq{}train\\PYZus{}steps\\PYZus{}per\\PYZus{}second\\PYZdq{}}\\PY{p}{:}\\PY{+w}{ }\\PY{l+m+mf}{0.354}\n", 929 | "\\PY{p}{\\PYZcb{}}\n", 930 | "\\end{Verbatim}\n" 931 | ], 932 | "text/plain": [ 933 | "{\n", 934 | " \"epoch\": 4.864,\n", 935 | " \"eval_viggo-val_loss\": 0.13618840277194977,\n", 936 | " \"eval_viggo-val_runtime\": 20.2797,\n", 937 | " \"eval_viggo-val_samples_per_second\": 35.208,\n", 938 | " \"eval_viggo-val_steps_per_second\": 8.827,\n", 939 | " \"total_flos\": 4.843098686147789e+16,\n", 940 | " \"train_loss\": 0.2079355036479331,\n", 941 | " \"train_runtime\": 437.2951,\n", 942 | " \"train_samples_per_second\": 11.434,\n", 943 | " \"train_steps_per_second\": 0.354\n", 944 | "}" 945 | ] 946 | }, 947 | "metadata": {}, 948 | "output_type": "display_data" 949 | } 950 | ], 951 | "source": [ 952 | "display(Code(filename=\"/mnt/cluster_storage/viggo/outputs/all_results.json\", language=\"json\"))" 953 | ] 954 | }, 955 | { 956 | "cell_type": "markdown", 957 | "metadata": { 958 | "vscode": { 959 | "languageId": "bat" 960 | } 961 | }, 962 | "source": [ 963 | "" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "### Observability" 971 | ] 972 | }, 973 | { 974 | "cell_type": "markdown", 975 | "metadata": {}, 976 | "source": [ 977 | "
🔎 Monitoring and debugging with Ray \n", 978 | "\n", 979 | "\n", 980 | "OSS Ray offers an extensive [observability suite](https://docs.ray.io/en/latest/ray-observability/index.html) with logs and an observability dashboard that you can use to monitor and debug. The dashboard includes a lot of different components such as:\n", 981 | "\n", 982 | "- memory, utilization, etc., of the tasks running in the [cluster](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-node-view)\n", 983 | "\n", 984 | "\n", 985 | "\n", 986 | "- views to see all running tasks, utilization across instance types, autoscaling, etc.\n", 987 | "\n", 988 | "\n" 989 | ] 990 | }, 991 | { 992 | "cell_type": "markdown", 993 | "metadata": {}, 994 | "source": [ 995 | "
🔎➕➕ Monitoring and debugging on Anyscale \n", 996 | "\n", 997 | "OSS Ray comes with an extensive observability suite, and Anyscale takes it many steps further to make monitoring and debugging your workloads even easier and faster with:\n", 998 | "\n", 999 | "- [unified log viewer](https://docs.anyscale.com/monitoring/accessing-logs/) to see logs from *all* driver and worker processes\n", 1000 | "- Ray workload specific dashboard, like Data, Train, etc., that can breakdown the tasks. For example, you can observe the preceding training workload live through the Train specific Ray Workloads dashboard:\n", 1001 | "\n", 1002 | "\n", 1003 | "\n", 1004 | "\n" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "markdown", 1009 | "metadata": {}, 1010 | "source": [ 1011 | "### Save to cloud storage" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "markdown", 1016 | "metadata": {}, 1017 | "source": [ 1018 | "
🗂️ Storage on Anyscale \n", 1019 | "\n", 1020 | "You can always store to data inside [any storage buckets](https://docs.anyscale.com/configuration/storage/#private-storage-buckets) but Anyscale offers a [default storage bucket](https://docs.anyscale.com/configuration/storage/#anyscale-default-storage-bucket) to make things even easier. You also have plenty of other [storage options](https://docs.anyscale.com/configuration/storage/) as well, shared at the cluster, user, and cloud levels." 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": null, 1026 | "metadata": {}, 1027 | "outputs": [ 1028 | { 1029 | "name": "stdout", 1030 | "output_type": "stream", 1031 | "text": [ 1032 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage\n" 1033 | ] 1034 | } 1035 | ], 1036 | "source": [ 1037 | "%%bash\n", 1038 | "# Anyscale default storage bucket.\n", 1039 | "echo $ANYSCALE_ARTIFACT_STORAGE" 1040 | ] 1041 | }, 1042 | { 1043 | "cell_type": "code", 1044 | "execution_count": null, 1045 | "metadata": {}, 1046 | "outputs": [], 1047 | "source": [ 1048 | "%%bash\n", 1049 | "# Save fine-tuning artifacts to cloud storage.\n", 1050 | "STORAGE_PATH=\"$ANYSCALE_ARTIFACT_STORAGE/viggo\"\n", 1051 | "LOCAL_OUTPUTS_PATH=\"/mnt/cluster_storage/viggo/outputs\"\n", 1052 | "LOCAL_SAVES_PATH=\"/mnt/cluster_storage/viggo/saves\"\n", 1053 | "\n", 1054 | "# AWS S3 operations.\n", 1055 | "if [[ \"$STORAGE_PATH\" == s3://* ]]; then\n", 1056 | " if aws s3 ls \"$STORAGE_PATH\" > /dev/null 2>&1; then\n", 1057 | " aws s3 rm \"$STORAGE_PATH\" --recursive --quiet\n", 1058 | " fi\n", 1059 | " aws s3 cp \"$LOCAL_OUTPUTS_PATH\" \"$STORAGE_PATH/outputs\" --recursive --quiet\n", 1060 | " aws s3 cp \"$LOCAL_SAVES_PATH\" \"$STORAGE_PATH/saves\" --recursive --quiet\n", 1061 | "\n", 1062 | "# Google Cloud Storage operations.\n", 1063 | "elif [[ \"$STORAGE_PATH\" == gs://* ]]; then\n", 1064 | " if gsutil ls \"$STORAGE_PATH\" > /dev/null 2>&1; then\n", 1065 | " gsutil -m -q rm -r \"$STORAGE_PATH\"\n", 1066 | " fi\n", 1067 | " gsutil -m -q cp -r \"$LOCAL_OUTPUTS_PATH\" \"$STORAGE_PATH/outputs\"\n", 1068 | " gsutil -m -q cp -r \"$LOCAL_SAVES_PATH\" \"$STORAGE_PATH/saves\"\n", 1069 | "\n", 1070 | "else\n", 1071 | " echo \"Unsupported storage protocol: $STORAGE_PATH\"\n", 1072 | " exit 1\n", 1073 | "fi" 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": null, 1079 | "metadata": {}, 1080 | "outputs": [ 1081 | { 1082 | "name": "stdout", 1083 | "output_type": "stream", 1084 | "text": [ 1085 | "TorchTrainer_95d16_00000_0_2025-04-11_14-47-37\n", 1086 | "TorchTrainer_f9e4e_00000_0_2025-04-11_12-41-34\n", 1087 | "basic-variant-state-2025-04-11_12-41-34.json\n", 1088 | "basic-variant-state-2025-04-11_14-47-37.json\n", 1089 | "experiment_state-2025-04-11_12-41-34.json\n", 1090 | "experiment_state-2025-04-11_14-47-37.json\n", 1091 | "trainer.pkl\n", 1092 | "tuner.pkl\n" 1093 | ] 1094 | } 1095 | ], 1096 | "source": [ 1097 | "%%bash\n", 1098 | "ls /mnt/cluster_storage/viggo/saves/lora_sft_ray" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": null, 1104 | "metadata": {}, 1105 | "outputs": [ 1106 | { 1107 | "name": "stdout", 1108 | "output_type": "stream", 1109 | "text": [ 1110 | "/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint\n", 1111 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint\n", 1112 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000\n", 1113 | "checkpoint\n" 1114 | ] 1115 | } 1116 | ], 1117 | "source": [ 1118 | "# LoRA paths.\n", 1119 | "save_dir = Path(\"/mnt/cluster_storage/viggo/saves/lora_sft_ray\")\n", 1120 | "trainer_dirs = [d for d in save_dir.iterdir() if d.name.startswith(\"TorchTrainer_\") and d.is_dir()]\n", 1121 | "latest_trainer = max(trainer_dirs, key=lambda d: d.stat().st_mtime, default=None)\n", 1122 | "lora_path = f\"{latest_trainer}/checkpoint_000000/checkpoint\"\n", 1123 | "cloud_lora_path = os.path.join(os.getenv(\"ANYSCALE_ARTIFACT_STORAGE\"), lora_path.split(\"/mnt/cluster_storage/\")[-1])\n", 1124 | "dynamic_lora_path, lora_id = cloud_lora_path.rsplit(\"/\", 1)\n", 1125 | "print (lora_path)\n", 1126 | "print (cloud_lora_path)\n", 1127 | "print (dynamic_lora_path)\n", 1128 | "print (lora_id)" 1129 | ] 1130 | }, 1131 | { 1132 | "cell_type": "code", 1133 | "execution_count": null, 1134 | "metadata": {}, 1135 | "outputs": [ 1136 | { 1137 | "name": "stdout", 1138 | "output_type": "stream", 1139 | "text": [ 1140 | "README.md\n", 1141 | "adapter_config.json\n", 1142 | "adapter_model.safetensors\n", 1143 | "added_tokens.json\n", 1144 | "merges.txt\n", 1145 | "optimizer.pt\n", 1146 | "rng_state_0.pth\n", 1147 | "rng_state_1.pth\n", 1148 | "rng_state_2.pth\n", 1149 | "rng_state_3.pth\n", 1150 | "scheduler.pt\n", 1151 | "special_tokens_map.json\n", 1152 | "tokenizer.json\n", 1153 | "tokenizer_config.json\n", 1154 | "trainer_state.json\n", 1155 | "training_args.bin\n", 1156 | "vocab.json\n" 1157 | ] 1158 | } 1159 | ], 1160 | "source": [ 1161 | "%%bash -s \"$lora_path\"\n", 1162 | "ls $1" 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "markdown", 1167 | "metadata": {}, 1168 | "source": [ 1169 | "## Batch inference \n", 1170 | "[`Overview`](https://docs.ray.io/en/latest/data/working-with-llms.html) | [`API reference`](https://docs.ray.io/en/latest/data/api/llm.html)" 1171 | ] 1172 | }, 1173 | { 1174 | "cell_type": "markdown", 1175 | "metadata": {}, 1176 | "source": [ 1177 | "The `ray.data.llm` module integrates with key large language model (LLM) inference engines and deployed models to enable LLM batch inference. These LLM modules use [Ray Data](https://docs.ray.io/en/latest/data/data.html) under the hood, which makes it extremely easy to distribute workloads but also ensures that they happen:\n", 1178 | "- **efficiently**: minimizing CPU/GPU idle time with heterogeneous resource scheduling.\n", 1179 | "- **at scale**: with streaming execution to petabyte-scale datasets, especially when [working with LLMs](https://docs.ray.io/en/latest/data/working-with-llms.html).\n", 1180 | "- **reliably** by checkpointing processes, especially when running workloads on spot instances with on-demand fallback.\n", 1181 | "- **flexibly**: connecting to data from any source, applying transformations, and saving to any format and location for your next workload.\n", 1182 | "\n", 1183 | "\n", 1184 | "\n", 1185 | "[RayTurbo Data](https://docs.anyscale.com/rayturbo/rayturbo-data) has more features on top of Ray Data:\n", 1186 | "- **accelerated metadata fetching** to improve reading first time from large datasets \n", 1187 | "- **optimized autoscaling** where Jobs can kick off before waiting for the entire cluster to start\n", 1188 | "- **high reliability** where entire failed jobs, like head node, cluster, uncaptured exceptions, etc., can resume from checkpoints. OSS Ray can only recover from worker node failures." 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "markdown", 1193 | "metadata": {}, 1194 | "source": [ 1195 | "Start by defining the [vLLM engine processor config](https://docs.ray.io/en/latest/data/api/doc/ray.data.llm.vLLMEngineProcessorConfig.html#ray.data.llm.vLLMEngineProcessorConfig) where you can select the model to use and the [engine behavior](https://docs.vllm.ai/en/stable/serving/engine_args.html). The model can come from [Hugging Face (HF) Hub](https://huggingface.co/models) or a local model path `/path/to/your/model`. Anyscale supports GPTQ, GGUF, or LoRA model formats.\n", 1196 | "\n", 1197 | "" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": {}, 1203 | "source": [ 1204 | "### vLLM engine processor" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "code", 1209 | "execution_count": null, 1210 | "metadata": {}, 1211 | "outputs": [ 1212 | { 1213 | "name": "stdout", 1214 | "output_type": "stream", 1215 | "text": [ 1216 | "INFO 04-11 14:58:40 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform\n" 1217 | ] 1218 | } 1219 | ], 1220 | "source": [ 1221 | "import os\n", 1222 | "import ray\n", 1223 | "from ray.data.llm import vLLMEngineProcessorConfig" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": null, 1229 | "metadata": {}, 1230 | "outputs": [], 1231 | "source": [ 1232 | "config = vLLMEngineProcessorConfig(\n", 1233 | " model_source=model_source,\n", 1234 | " runtime_env={\n", 1235 | " \"env_vars\": {\n", 1236 | " \"VLLM_USE_V1\": \"0\", # v1 doesn't support lora adapters yet\n", 1237 | " # \"HF_TOKEN\": os.environ.get(\"HF_TOKEN\"),\n", 1238 | " },\n", 1239 | " },\n", 1240 | " engine_kwargs={\n", 1241 | " \"enable_lora\": True,\n", 1242 | " \"max_lora_rank\": 8,\n", 1243 | " \"max_loras\": 1,\n", 1244 | " \"pipeline_parallel_size\": 1,\n", 1245 | " \"tensor_parallel_size\": 1,\n", 1246 | " \"enable_prefix_caching\": True,\n", 1247 | " \"enable_chunked_prefill\": True,\n", 1248 | " \"max_num_batched_tokens\": 4096,\n", 1249 | " \"max_model_len\": 4096, # or increase KV cache size\n", 1250 | " # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html\n", 1251 | " },\n", 1252 | " concurrency=1,\n", 1253 | " batch_size=16,\n", 1254 | " accelerator_type=\"L4\",\n", 1255 | ")" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "markdown", 1260 | "metadata": {}, 1261 | "source": [ 1262 | "### LLM processor" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "markdown", 1267 | "metadata": {}, 1268 | "source": [ 1269 | "Next, pass the config to an [LLM processor](https://docs.ray.io/en/master/data/api/doc/ray.data.llm.build_llm_processor.html#ray.data.llm.build_llm_processor) where you can define the preprocessing and postprocessing steps around inference. With your base model defined in the processor config, you can define the LoRA adapter layers as part of the preprocessing step of the LLM processor itself." 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "code", 1274 | "execution_count": null, 1275 | "metadata": {}, 1276 | "outputs": [], 1277 | "source": [ 1278 | "from ray.data.llm import build_llm_processor" 1279 | ] 1280 | }, 1281 | { 1282 | "cell_type": "code", 1283 | "execution_count": null, 1284 | "metadata": {}, 1285 | "outputs": [ 1286 | { 1287 | "name": "stderr", 1288 | "output_type": "stream", 1289 | "text": [ 1290 | "2025-04-11 14:58:40,942\tINFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.51.51:6379...\n", 1291 | "2025-04-11 14:58:40,953\tINFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttps://session-zt5t77xa58pyp3uy28glg2g24d.i.anyscaleuserdata.com \u001b[39m\u001b[22m\n", 1292 | "2025-04-11 14:58:40,960\tINFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip' (2.16MiB) to Ray cluster...\n", 1293 | "2025-04-11 14:58:40,969\tINFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip'.\n" 1294 | ] 1295 | }, 1296 | { 1297 | "data": { 1298 | "application/vnd.jupyter.widget-view+json": { 1299 | "model_id": "a9171027a5a249ff801e77f763506f67", 1300 | "version_major": 2, 1301 | "version_minor": 0 1302 | }, 1303 | "text/plain": [ 1304 | "config.json: 0%| | 0.00/663 [00:00system\n", 1386 | "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|im_end|>\n", 1387 | "<|im_start|>user\n", 1388 | "Do you have a favorite ESRB content rating?<|im_end|>\n", 1389 | "<|im_start|>assistant\n", 1390 | "\",\n", 1391 | " \"prompt_token_ids\": [151644, \"...\", 198],\n", 1392 | " \"request_id\": 94,\n", 1393 | " \"time_taken_llm\": 6.028705836999961,\n", 1394 | " \"generated_output\": \"request(esrb)\"\n", 1395 | "}\n", 1396 | "\n", 1397 | "\n" 1398 | ] 1399 | } 1400 | ], 1401 | "source": [ 1402 | "# Evaluation on test dataset\n", 1403 | "ds = ray.data.read_json(\"/mnt/cluster_storage/viggo/test.jsonl\") # complete list: https://docs.ray.io/en/latest/data/api/input_output.html\n", 1404 | "ds = processor(ds)\n", 1405 | "results = ds.take_all()\n", 1406 | "results[0]" 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "code", 1411 | "execution_count": null, 1412 | "metadata": {}, 1413 | "outputs": [ 1414 | { 1415 | "data": { 1416 | "text/plain": [ 1417 | "0.6879039704524469" 1418 | ] 1419 | }, 1420 | "execution_count": null, 1421 | "metadata": {}, 1422 | "output_type": "execute_result" 1423 | } 1424 | ], 1425 | "source": [ 1426 | "# Exact match (strict!)\n", 1427 | "matches = 0\n", 1428 | "for item in results:\n", 1429 | " if item[\"output\"] == item[\"generated_output\"]:\n", 1430 | " matches += 1\n", 1431 | "matches / float(len(results))" 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "markdown", 1436 | "metadata": {}, 1437 | "source": [ 1438 | "**Note**: The objective of fine-tuning here isn't to create the most performant model but to show that you can leverage it for downstream workloads, like batch inference and online serving at scale. However, you can increase `num_train_epochs` if you want to." 1439 | ] 1440 | }, 1441 | { 1442 | "cell_type": "markdown", 1443 | "metadata": {}, 1444 | "source": [ 1445 | "Observe the individual steps in the batch inference workload through the Anyscale Ray Data dashboard:\n", 1446 | "\n", 1447 | "" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "markdown", 1452 | "metadata": {}, 1453 | "source": [ 1454 | "
\n", 1455 | "\n", 1456 | "💡 For more advanced guides on topics like optimized model loading, multi-LoRA, OpenAI-compatible endpoints, etc., see [more examples](https://docs.ray.io/en/latest/data/working-with-llms.html) and the [API reference](https://docs.ray.io/en/latest/data/api/llm.html).\n", 1457 | "\n", 1458 | "
" 1459 | ] 1460 | }, 1461 | { 1462 | "cell_type": "markdown", 1463 | "metadata": {}, 1464 | "source": [ 1465 | "## Online serving\n", 1466 | "[`Overview`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) | [`API reference`](https://docs.ray.io/en/latest/serve/api/index.html#llm-api)" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "metadata": {}, 1472 | "source": [ 1473 | "\n", 1474 | "\n", 1475 | "`ray.serve.llm` APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.\n", 1476 | "\n", 1477 | "\n", 1478 | "\n", 1479 | "Ray Serve LLM is designed with the following features:\n", 1480 | "- Automatic scaling and load balancing\n", 1481 | "- Unified multi-node multi-model deployment\n", 1482 | "- OpenAI compatibility\n", 1483 | "- Multi-LoRA support with shared base models\n", 1484 | "- Deep integration with inference engines, vLLM to start\n", 1485 | "- Composable multi-model LLM pipelines\n", 1486 | "\n", 1487 | "[RayTurbo Serve](https://docs.anyscale.com/rayturbo/rayturbo-serve) on Anyscale has more features on top of Ray Serve:\n", 1488 | "- **fast autoscaling and model loading** to get services up and running even faster: [5x improvements](https://www.anyscale.com/blog/autoscale-large-ai-models-faster) even for LLMs\n", 1489 | "- 54% **higher QPS** and up-to 3x **streaming tokens per second** for high traffic serving use-cases\n", 1490 | "- **replica compaction** into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization\n", 1491 | "- **zero-downtime** [incremental rollouts](https://docs.anyscale.com/platform/services/update-a-service/#resource-constrained-updates) so your service is never interrupted\n", 1492 | "- [**different environments**](https://docs.anyscale.com/platform/services/multi-app/#multiple-applications-in-different-containers) for each service in a multi-serve application\n", 1493 | "- **multi availability-zone** aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures\n" 1494 | ] 1495 | }, 1496 | { 1497 | "cell_type": "markdown", 1498 | "metadata": {}, 1499 | "source": [ 1500 | "### LLM serve config" 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "code", 1505 | "execution_count": null, 1506 | "metadata": {}, 1507 | "outputs": [], 1508 | "source": [ 1509 | "import os\n", 1510 | "from openai import OpenAI # to use openai api format\n", 1511 | "from ray import serve\n", 1512 | "from ray.serve.llm import LLMConfig, build_openai_app" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "markdown", 1517 | "metadata": {}, 1518 | "source": [ 1519 | "Define an [LLM config](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) where you can define where the model comes from, it's [autoscaling behavior](https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling), what hardware to use and [engine arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html)." 1520 | ] 1521 | }, 1522 | { 1523 | "cell_type": "code", 1524 | "execution_count": null, 1525 | "metadata": {}, 1526 | "outputs": [], 1527 | "source": [ 1528 | "# Define config.\n", 1529 | "llm_config = LLMConfig(\n", 1530 | " model_loading_config={\n", 1531 | " \"model_id\": model_id,\n", 1532 | " \"model_source\": model_source\n", 1533 | " },\n", 1534 | " lora_config={ # REMOVE this section if you're only using a base model.\n", 1535 | " \"dynamic_lora_loading_path\": dynamic_lora_path,\n", 1536 | " \"max_num_adapters_per_replica\": 16, # You only have 1.\n", 1537 | " },\n", 1538 | " # runtime_env={\"env_vars\": {\"HF_TOKEN\": os.environ.get(\"HF_TOKEN\")}},\n", 1539 | " deployment_config={\n", 1540 | " \"autoscaling_config\": {\n", 1541 | " \"min_replicas\": 1,\n", 1542 | " \"max_replicas\": 2,\n", 1543 | " # complete list: https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling\n", 1544 | " }\n", 1545 | " },\n", 1546 | " accelerator_type=\"L4\",\n", 1547 | " engine_kwargs={\n", 1548 | " \"max_model_len\": 4096, # Or increase KV cache size.\n", 1549 | " \"tensor_parallel_size\": 1,\n", 1550 | " \"enable_lora\": True,\n", 1551 | " # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html\n", 1552 | " },\n", 1553 | ")" 1554 | ] 1555 | }, 1556 | { 1557 | "cell_type": "markdown", 1558 | "metadata": {}, 1559 | "source": [ 1560 | "Now deploy the LLM config as an application. And because this application is all built on top of [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), you can have advanced service logic around composing models together, deploying multiple applications, model multiplexing, observability, etc." 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "code", 1565 | "execution_count": null, 1566 | "metadata": {}, 1567 | "outputs": [ 1568 | { 1569 | "name": "stdout", 1570 | "output_type": "stream", 1571 | "text": [ 1572 | "DeploymentHandle(deployment='LLMRouter')\n" 1573 | ] 1574 | } 1575 | ], 1576 | "source": [ 1577 | "# Deploy.\n", 1578 | "app = build_openai_app({\"llm_configs\": [llm_config]})\n", 1579 | "serve.run(app)" 1580 | ] 1581 | }, 1582 | { 1583 | "cell_type": "markdown", 1584 | "metadata": {}, 1585 | "source": [ 1586 | "### Service request" 1587 | ] 1588 | }, 1589 | { 1590 | "cell_type": "code", 1591 | "execution_count": null, 1592 | "metadata": {}, 1593 | "outputs": [ 1594 | { 1595 | "name": "stdout", 1596 | "output_type": "stream", 1597 | "text": [ 1598 | "\n", 1599 | "\n", 1600 | "Avg prompt throughput: 20.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.\n", 1601 | "\n", 1602 | "_opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])\n", 1603 | "\n", 1604 | "\n" 1605 | ] 1606 | } 1607 | ], 1608 | "source": [ 1609 | "# Initialize client.\n", 1610 | "client = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"fake-key\")\n", 1611 | "response = client.chat.completions.create(\n", 1612 | " model=f\"{model_id}:{lora_id}\",\n", 1613 | " messages=[\n", 1614 | " {\"role\": \"system\", \"content\": \"Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']\"},\n", 1615 | " {\"role\": \"user\", \"content\": \"Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view.\"},\n", 1616 | " ],\n", 1617 | " stream=True\n", 1618 | ")\n", 1619 | "for chunk in response:\n", 1620 | " if chunk.choices[0].delta.content is not None:\n", 1621 | " print(chunk.choices[0].delta.content, end=\"\", flush=True)" 1622 | ] 1623 | }, 1624 | { 1625 | "cell_type": "markdown", 1626 | "metadata": {}, 1627 | "source": [ 1628 | "And of course, you can observe the running service, the deployments, and metrics like QPS, latency, etc., through the [Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/getting-started.html)'s [Serve view](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-serve-view):\n", 1629 | "\n", 1630 | "" 1631 | ] 1632 | }, 1633 | { 1634 | "cell_type": "markdown", 1635 | "metadata": {}, 1636 | "source": [ 1637 | "
\n", 1638 | "\n", 1639 | "💡 See [more examples](https://docs.ray.io/en/latest/serve/llm/overview.html) and the [API reference](https://docs.ray.io/en/latest/serve/llm/api.html) for advanced guides on topics like structured outputs (like JSON), vision LMs, multi-LoRA on shared base models, using other inference engines (like `sglang`), fast model loading, etc.\n", 1640 | "\n", 1641 | "
" 1642 | ] 1643 | }, 1644 | { 1645 | "cell_type": "markdown", 1646 | "metadata": {}, 1647 | "source": [ 1648 | "## Production\n", 1649 | "\n", 1650 | "Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale [CLI](https://docs.anyscale.com/reference/quickstart-cli) or [SDK](https://docs.anyscale.com/reference/quickstart-sdk) to run [reliable batch jobs](https://docs.anyscale.com/platform/jobs) and deploy [highly available services](https://docs.anyscale.com/platform/services). Given you've been developing in an environment that's almost identical to production with a multi-node cluster, this integration should drastically speed up your dev to prod velocity.\n", 1651 | "\n", 1652 | "\n", 1653 | "\n", 1654 | "### Jobs\n", 1655 | "\n", 1656 | "[Anyscale Jobs](https://docs.anyscale.com/platform/jobs/) ([API ref](https://docs.anyscale.com/reference/job-api/)) allows you to execute discrete workloads in production such as batch inference, embeddings generation, or model fine-tuning.\n", 1657 | "- [define and manage](https://docs.anyscale.com/platform/jobs/manage-jobs) your Jobs in many different ways, like CLI and Python SDK\n", 1658 | "- set up [queues](https://docs.anyscale.com/platform/jobs/job-queues) and [schedules](https://docs.anyscale.com/platform/jobs/schedules)\n", 1659 | "- set up all the [observability, alerting, etc.](https://docs.anyscale.com/platform/jobs/monitoring-and-debugging) around your Jobs\n", 1660 | "\n", 1661 | "\n", 1662 | "\n", 1663 | "### Services\n", 1664 | "\n", 1665 | "[Anyscale Services](https://docs.anyscale.com/platform/services/) ([API ref](https://docs.anyscale.com/reference/service-api/)) offers an extremely fault tolerant, scalable, and optimized way to serve your Ray Serve applications:\n", 1666 | "- you can [rollout and update](https://docs.anyscale.com/platform/services/update-a-service) services with canary deployment with zero-downtime upgrades\n", 1667 | "- [monitor](https://docs.anyscale.com/platform/services/monitoring) your Services through a dedicated Service page, unified log viewer, tracing, set up alerts, etc.\n", 1668 | "- scale a service (`num_replicas=auto`) and utilize replica compaction to consolidate nodes that are fractionally utilized\n", 1669 | "- [head node fault tolerance](https://docs.anyscale.com/platform/services/production-best-practices#head-node-ft) because OSS Ray recovers from failed workers and replicas but not head node crashes\n", 1670 | "- serving [multiple applications](https://docs.anyscale.com/platform/services/multi-app) in a single Service\n", 1671 | "\n", 1672 | "\n" 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": null, 1678 | "metadata": {}, 1679 | "outputs": [], 1680 | "source": [ 1681 | "%%bash\n", 1682 | "# clean up\n", 1683 | "rm -rf /mnt/cluster_storage/viggo\n", 1684 | "STORAGE_PATH=\"$ANYSCALE_ARTIFACT_STORAGE/viggo\"\n", 1685 | "if [[ \"$STORAGE_PATH\" == s3://* ]]; then\n", 1686 | " aws s3 rm \"$STORAGE_PATH\" --recursive --quiet\n", 1687 | "elif [[ \"$STORAGE_PATH\" == gs://* ]]; then\n", 1688 | " gsutil -m -q rm -r \"$STORAGE_PATH\"\n", 1689 | "fi" 1690 | ] 1691 | } 1692 | ], 1693 | "metadata": { 1694 | "kernelspec": { 1695 | "display_name": "base", 1696 | "language": "python", 1697 | "name": "python3" 1698 | }, 1699 | "language_info": { 1700 | "codemirror_mode": { 1701 | "name": "ipython", 1702 | "version": 3 1703 | }, 1704 | "file_extension": ".py", 1705 | "mimetype": "text/x-python", 1706 | "name": "python", 1707 | "nbconvert_exporter": "python", 1708 | "pygments_lexer": "ipython3", 1709 | "version": "3.11.11" 1710 | } 1711 | }, 1712 | "nbformat": 4, 1713 | "nbformat_minor": 2 1714 | } 1715 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Entity Recognition with LLMs 2 | 3 |
4 |   5 |   6 |
7 | 8 | This end-to-end tutorial **fine-tunes** an LLM to perform **batch inference** and **online serving** at scale. While entity recognition (NER) is the main task in this tutorial, you can easily extend these end-to-end workflows to any use case. 9 | 10 | 11 | 12 | **Note**: The intent of this tutorial is to show how you can use Ray to implement end-to-end LLM workflows that can extend to any use case, including multimodal. 13 | 14 | This tutorial uses the [Ray library](https://github.com/ray-project/ray) to implement these workflows, namely the LLM APIs: 15 | 16 | [`ray.data.llm`](https://docs.ray.io/en/latest/data/working-with-llms.html): 17 | - Batch inference over distributed datasets 18 | - Streaming and async execution for throughput 19 | - Built-in metrics and tracing, including observability 20 | - Zero-copy GPU data transfer 21 | - Composable with preprocessing and postprocessing steps 22 | 23 | [`ray.serve.llm`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html): 24 | - Automatic scaling and load balancing 25 | - Unified multi-node multi-model deployment 26 | - Multi-LoRA support with shared base models 27 | - Deep integration with inference engines, vLLM to start 28 | - Composable multi-model LLM pipelines 29 | 30 | And all of these workloads come with all the observability views you need to debug and tune them to **maximize throughput/latency**. 31 | 32 | ## Set up 33 | 34 | ### Compute 35 | This [Anyscale Workspace](https://docs.anyscale.com/platform/workspaces/) automatically provisions and autoscales the compute your workloads need. If you're not on Anyscale, then you need to provision the appropriate compute (L4) for this tutorial. 36 | 37 | 38 | 39 | ### Dependencies 40 | Start by downloading the dependencies required for this tutorial. Notice in your [`containerfile`](https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/refs/heads/main/containerfile) you have a base image [`anyscale/ray-llm:latest-py311-cu124`](https://hub.docker.com/layers/anyscale/ray-llm/latest-py311-cu124/images/sha256-5a1c55f7f416d2d2eb5f4cdd13afeda25d4f7383406cfee1f1f60da495d1b50f) followed by a list of pip packages. If you're not on [Anyscale](https://console.anyscale.com/), you can pull this Docker image yourself and install the dependencies. 41 | 42 | 43 | 44 | ```bash 45 | %%bash 46 | # Install dependencies 47 | pip install -q \ 48 | "xgrammar==0.1.11" \ 49 | "pynvml==12.0.0" \ 50 | "hf_transfer==0.1.9" \ 51 | "tensorboard==2.19.0" \ 52 | "llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory" 53 | ``` 54 | 55 | Successfully registered `ray, vllm` and 5 other packages to be installed on all cluster nodes. 56 | View and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_mp8cxvgle2yeumgcpu1yua2r3e?workspace-tab=dependencies 57 | 58 | 59 | ## Data ingestion 60 | 61 | 62 | ```python 63 | import json 64 | import textwrap 65 | from IPython.display import Code, Image, display 66 | ``` 67 | 68 | Start by downloading the data from cloud storage to local shared storage. 69 | 70 | 71 | ```bash 72 | %%bash 73 | rm -rf /mnt/cluster_storage/viggo # clean up 74 | mkdir /mnt/cluster_storage/viggo 75 | wget https://viggo-ds.s3.amazonaws.com/train.jsonl -O /mnt/cluster_storage/viggo/train.jsonl 76 | wget https://viggo-ds.s3.amazonaws.com/val.jsonl -O /mnt/cluster_storage/viggo/val.jsonl 77 | wget https://viggo-ds.s3.amazonaws.com/test.jsonl -O /mnt/cluster_storage/viggo/test.jsonl 78 | wget https://viggo-ds.s3.amazonaws.com/dataset_info.json -O /mnt/cluster_storage/viggo/dataset_info.json 79 | ``` 80 | 81 | download: s3://viggo-ds/train.jsonl to ../../../mnt/cluster_storage/viggo/train.jsonl 82 | download: s3://viggo-ds/val.jsonl to ../../../mnt/cluster_storage/viggo/val.jsonl 83 | download: s3://viggo-ds/test.jsonl to ../../../mnt/cluster_storage/viggo/test.jsonl 84 | download: s3://viggo-ds/dataset_info.json to ../../../mnt/cluster_storage/viggo/dataset_info.json 85 | 86 | 87 | 88 | ```bash 89 | %%bash 90 | head -n 1 /mnt/cluster_storage/viggo/train.jsonl | python3 -m json.tool 91 | ``` 92 | 93 | { 94 | "instruction": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']", 95 | "input": "Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view.", 96 | "output": "give_opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])" 97 | } 98 | 99 | 100 | 101 | ```python 102 | with open("/mnt/cluster_storage/viggo/train.jsonl", "r") as fp: 103 | first_line = fp.readline() 104 | item = json.loads(first_line) 105 | system_content = item["instruction"] 106 | print(textwrap.fill(system_content, width=80)) 107 | ``` 108 | 109 | Given a target sentence construct the underlying meaning representation of the 110 | input sentence as a single function with attributes and attribute values. This 111 | function should describe the target string accurately and the function must be 112 | one of the following ['inform', 'request', 'give_opinion', 'confirm', 113 | 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 114 | 'request_attribute']. The attributes must be one of the following: ['name', 115 | 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 116 | 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 117 | 'has_linux_release', 'has_mac_release', 'specifier'] 118 | 119 | 120 | You also have an info file that identifies the datasets and format (Alpaca and ShareGPT formats) to use for post training. 121 | 122 | 123 | ```python 124 | display(Code(filename="/mnt/cluster_storage/viggo/dataset_info.json", language="json")) 125 | ``` 126 | 127 | 128 |
{
 203 |     "viggo-train": {
 204 |         "file_name": "/mnt/cluster_storage/viggo/train.jsonl",
 205 |         "formatting": "alpaca",
 206 |         "columns": {
 207 |             "prompt": "instruction",
 208 |             "query": "input",
 209 |             "response": "output"
 210 |         }
 211 |     },
 212 |     "viggo-val": {
 213 |         "file_name": "/mnt/cluster_storage/viggo/val.jsonl",
 214 |         "formatting": "alpaca",
 215 |         "columns": {
 216 |             "prompt": "instruction",
 217 |             "query": "input",
 218 |             "response": "output"
 219 |         }
 220 |     }
 221 | }
 222 | 
223 | 224 | 225 | 226 | ## Distributed fine-tuning 227 | 228 | Use [Ray Train](https://docs.ray.io/en/latest/train/train.html) + [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform multi-node training. Find the parameters for the training workload, post-training method, dataset location, train/val details, etc. in the `llama3_lora_sft_ray.yaml` config file. See the recipes for even more post-training methods, like SFT, pretraining, PPO, DPO, KTO, etc. [on GitHub](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples). 229 | 230 | **Note**: Ray also supports using other tools like [axolotl](https://axolotl-ai-cloud.github.io/axolotl/docs/ray-integration.html) or even [Ray Train + HF Accelerate + FSDP/DeepSpeed](https://docs.ray.io/en/latest/train/huggingface-accelerate.html) directly for complete control of your post-training workloads. 231 | 232 | 233 | 234 | ### `config` 235 | 236 | 237 | ```python 238 | import os 239 | from pathlib import Path 240 | import yaml 241 | ``` 242 | 243 | 244 | ```python 245 | display(Code(filename="lora_sft_ray.yaml", language="yaml")) 246 | ``` 247 | 248 | 249 |
### model
 324 | model_name_or_path: Qwen/Qwen2.5-7B-Instruct
 325 | trust_remote_code: true
 326 | 
 327 | ### method
 328 | stage: sft
 329 | do_train: true
 330 | finetuning_type: lora
 331 | lora_rank: 8
 332 | lora_target: all
 333 | 
 334 | ### dataset
 335 | dataset: viggo-train
 336 | dataset_dir: /mnt/cluster_storage/viggo  # shared storage workers have access to
 337 | template: qwen
 338 | cutoff_len: 2048
 339 | max_samples: 1000
 340 | overwrite_cache: true
 341 | preprocessing_num_workers: 16
 342 | dataloader_num_workers: 4
 343 | 
 344 | ### output
 345 | output_dir: /mnt/cluster_storage/viggo/outputs  # should be somewhere workers have access to (ex. s3, nfs)
 346 | logging_steps: 10
 347 | save_steps: 500
 348 | plot_loss: true
 349 | overwrite_output_dir: true
 350 | save_only_model: false
 351 | 
 352 | ### ray
 353 | ray_run_name: lora_sft_ray
 354 | ray_storage_path: /mnt/cluster_storage/viggo/saves  # should be somewhere workers have access to (ex. s3, nfs)
 355 | ray_num_workers: 4
 356 | resources_per_worker:
 357 |   GPU: 1
 358 |   anyscale/accelerator_shape:4xL4: 0.001  # Use this to specify a specific node shape,
 359 |   # accelerator_type:L4: 1           # Or use this to simply specify a GPU type.
 360 |   # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types
 361 | placement_strategy: PACK
 362 | 
 363 | ### train
 364 | per_device_train_batch_size: 1
 365 | gradient_accumulation_steps: 8
 366 | learning_rate: 1.0e-4
 367 | num_train_epochs: 5.0
 368 | lr_scheduler_type: cosine
 369 | warmup_ratio: 0.1
 370 | bf16: true
 371 | ddp_timeout: 180000000
 372 | resume_from_checkpoint: null
 373 | 
 374 | ### eval
 375 | eval_dataset: viggo-val  # uses same dataset_dir as training data
 376 | # val_size: 0.1  # only if using part of training data for validation
 377 | per_device_eval_batch_size: 1
 378 | eval_strategy: steps
 379 | eval_steps: 500
 380 | 
381 | 382 | 383 | 384 | 385 | ```python 386 | model_id = "ft-model" # call it whatever you want 387 | model_source = yaml.safe_load(open("lora_sft_ray.yaml"))["model_name_or_path"] # HF model ID, S3 mirror config, or GCS mirror config 388 | print (model_source) 389 | ``` 390 | 391 | Qwen/Qwen2.5-7B-Instruct 392 | 393 | 394 | ### Multi-node training 395 | 396 | Use Ray Train + LlamaFactory to perform the mult-node train loop. 397 | 398 |
Ray Train 399 | 400 | Using [Ray Train](https://docs.ray.io/en/latest/train/train.html) has several advantages: 401 | - it automatically handles **multi-node, multi-GPU** setup with no manual SSH setup or `hostfile` configs. 402 | - you can define **per-worker fractional resource requirements**, for example, 2 CPUs and 0.5 GPU per worker. 403 | - you can run on **heterogeneous machines** and scale flexibly, for example, CPU for preprocessing and GPU for training. 404 | - it has built-in **fault tolerance** through retry of failed workers, and continue from last checkpoint. 405 | - it supports Data Parallel, Model Parallel, Parameter Server, and even custom strategies. 406 | - [Ray Compiled graphs](https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html) allow you to even define different parallelism for jointly optimizing multiple models. Megatron, DeepSpeed, and similar frameworks only allow for one global setting. 407 | 408 | [RayTurbo Train](https://docs.anyscale.com/rayturbo/rayturbo-train) offers even more improvement to the price-performance ratio, performance monitoring, and more: 409 | - **elastic training** to scale to a dynamic number of workers, and continue training on fewer resources, even on spot instances. 410 | - **purpose-built dashboard** designed to streamline the debugging of Ray Train workloads: 411 | - Monitoring: View the status of training runs and train workers. 412 | - Metrics: See insights on training throughput and training system operation time. 413 | - Profiling: Investigate bottlenecks, hangs, or errors from individual training worker processes. 414 | 415 | 416 | 417 | 418 | ```bash 419 | %%bash 420 | # Run multi-node distributed fine-tuning workload 421 | USE_RAY=1 llamafactory-cli train lora_sft_ray.yaml 422 | ``` 423 | 424 | 425 | 426 | Training started with configuration: 427 | ╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮ 428 | │ Training config │ 429 | ├──────────────────────────────────────────────────────────────────────────────────────────────────────┤ 430 | │ train_loop_config/args/bf16 True │ 431 | │ train_loop_config/args/cutoff_len 2048 │ 432 | │ train_loop_config/args/dataloader_num_workers 4 │ 433 | │ train_loop_config/args/dataset viggo-train │ 434 | │ train_loop_config/args/dataset_dir ...ter_storage/viggo │ 435 | │ train_loop_config/args/ddp_timeout 180000000 │ 436 | │ train_loop_config/args/do_train True │ 437 | │ train_loop_config/args/eval_dataset viggo-val │ 438 | │ train_loop_config/args/eval_steps 500 │ 439 | │ train_loop_config/args/eval_strategy steps │ 440 | │ train_loop_config/args/finetuning_type lora │ 441 | │ train_loop_config/args/gradient_accumulation_steps 8 │ 442 | │ train_loop_config/args/learning_rate 0.0001 │ 443 | │ train_loop_config/args/logging_steps 10 │ 444 | │ train_loop_config/args/lora_rank 8 │ 445 | │ train_loop_config/args/lora_target all │ 446 | │ train_loop_config/args/lr_scheduler_type cosine │ 447 | │ train_loop_config/args/max_samples 1000 │ 448 | │ train_loop_config/args/model_name_or_path ...en2.5-7B-Instruct │ 449 | │ train_loop_config/args/num_train_epochs 5.0 │ 450 | │ train_loop_config/args/output_dir ...age/viggo/outputs │ 451 | │ train_loop_config/args/overwrite_cache True │ 452 | │ train_loop_config/args/overwrite_output_dir True │ 453 | │ train_loop_config/args/per_device_eval_batch_size 1 │ 454 | │ train_loop_config/args/per_device_train_batch_size 1 │ 455 | │ train_loop_config/args/placement_strategy PACK │ 456 | │ train_loop_config/args/plot_loss True │ 457 | │ train_loop_config/args/preprocessing_num_workers 16 │ 458 | │ train_loop_config/args/ray_num_workers 4 │ 459 | │ train_loop_config/args/ray_run_name lora_sft_ray │ 460 | │ train_loop_config/args/ray_storage_path ...orage/viggo/saves │ 461 | │ train_loop_config/args/resources_per_worker/GPU 1 │ 462 | │ train_loop_config/args/resources_per_worker/anyscale/accelerator_shape:4xL4 1 │ 463 | │ train_loop_config/args/resume_from_checkpoint │ 464 | │ train_loop_config/args/save_only_model False │ 465 | │ train_loop_config/args/save_steps 500 │ 466 | │ train_loop_config/args/stage sft │ 467 | │ train_loop_config/args/template qwen │ 468 | │ train_loop_config/args/trust_remote_code True │ 469 | │ train_loop_config/args/warmup_ratio 0.1 │ 470 | │ train_loop_config/callbacks ... 0x7e1262910e10>] │ 471 | ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯ 472 | 473 | 100%|██████████| 155/155 [07:12<00:00, 2.85s/it][INFO|trainer.py:3942] 2025-04-11 14:57:59,207 >> Saving model checkpoint to /mnt/cluster_storage/viggo/outputs/checkpoint-155 474 | 475 | Training finished iteration 1 at 2025-04-11 14:58:02. Total running time: 10min 24s 476 | ╭─────────────────────────────────────────╮ 477 | │ Training result │ 478 | ├─────────────────────────────────────────┤ 479 | │ checkpoint_dir_name checkpoint_000000 │ 480 | │ time_this_iter_s 521.83827 │ 481 | │ time_total_s 521.83827 │ 482 | │ training_iteration 1 │ 483 | │ epoch 4.704 │ 484 | │ grad_norm 0.14288 │ 485 | │ learning_rate 0. │ 486 | │ loss 0.0065 │ 487 | │ step 150 │ 488 | ╰─────────────────────────────────────────╯ 489 | Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000 490 | 491 | 492 | 493 | 494 | 495 | 496 | ```python 497 | display(Code(filename="/mnt/cluster_storage/viggo/outputs/all_results.json", language="json")) 498 | ``` 499 | 500 | 501 |
{
 576 |     "epoch": 4.864,
 577 |     "eval_viggo-val_loss": 0.13618840277194977,
 578 |     "eval_viggo-val_runtime": 20.2797,
 579 |     "eval_viggo-val_samples_per_second": 35.208,
 580 |     "eval_viggo-val_steps_per_second": 8.827,
 581 |     "total_flos": 4.843098686147789e+16,
 582 |     "train_loss": 0.2079355036479331,
 583 |     "train_runtime": 437.2951,
 584 |     "train_samples_per_second": 11.434,
 585 |     "train_steps_per_second": 0.354
 586 | }
 587 | 
588 | 589 | 590 | 591 | 592 | 593 | ### Observability 594 | 595 |
🔎 Monitoring and debugging with Ray 596 | 597 | 598 | OSS Ray offers an extensive [observability suite](https://docs.ray.io/en/latest/ray-observability/index.html) with logs and an observability dashboard that you can use to monitor and debug. The dashboard includes a lot of different components such as: 599 | 600 | - memory, utilization, etc., of the tasks running in the [cluster](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-node-view) 601 | 602 | 603 | 604 | - views to see all running tasks, utilization across instance types, autoscaling, etc. 605 | 606 | 607 | 608 | 609 |
🔎➕➕ Monitoring and debugging on Anyscale 610 | 611 | OSS Ray comes with an extensive observability suite, and Anyscale takes it many steps further to make monitoring and debugging your workloads even easier and faster with: 612 | 613 | - [unified log viewer](https://docs.anyscale.com/monitoring/accessing-logs/) to see logs from *all* driver and worker processes 614 | - Ray workload specific dashboard, like Data, Train, etc., that can breakdown the tasks. For example, you can observe the preceding training workload live through the Train specific Ray Workloads dashboard: 615 | 616 | 617 | 618 | 619 | 620 | 621 | ### Save to cloud storage 622 | 623 |
🗂️ Storage on Anyscale 624 | 625 | You can always store to data inside [any storage buckets](https://docs.anyscale.com/configuration/storage/#private-storage-buckets) but Anyscale offers a [default storage bucket](https://docs.anyscale.com/configuration/storage/#anyscale-default-storage-bucket) to make things even easier. You also have plenty of other [storage options](https://docs.anyscale.com/configuration/storage/) as well, shared at the cluster, user, and cloud levels. 626 | 627 | 628 | ```bash 629 | %%bash 630 | # Anyscale default storage bucket. 631 | echo $ANYSCALE_ARTIFACT_STORAGE 632 | ``` 633 | 634 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage 635 | 636 | 637 | 638 | ```bash 639 | %%bash 640 | # Save fine-tuning artifacts to cloud storage. 641 | STORAGE_PATH="$ANYSCALE_ARTIFACT_STORAGE/viggo" 642 | LOCAL_OUTPUTS_PATH="/mnt/cluster_storage/viggo/outputs" 643 | LOCAL_SAVES_PATH="/mnt/cluster_storage/viggo/saves" 644 | 645 | # AWS S3 operations. 646 | if [[ "$STORAGE_PATH" == s3://* ]]; then 647 | if aws s3 ls "$STORAGE_PATH" > /dev/null 2>&1; then 648 | aws s3 rm "$STORAGE_PATH" --recursive --quiet 649 | fi 650 | aws s3 cp "$LOCAL_OUTPUTS_PATH" "$STORAGE_PATH/outputs" --recursive --quiet 651 | aws s3 cp "$LOCAL_SAVES_PATH" "$STORAGE_PATH/saves" --recursive --quiet 652 | 653 | # Google Cloud Storage operations. 654 | elif [[ "$STORAGE_PATH" == gs://* ]]; then 655 | if gsutil ls "$STORAGE_PATH" > /dev/null 2>&1; then 656 | gsutil -m -q rm -r "$STORAGE_PATH" 657 | fi 658 | gsutil -m -q cp -r "$LOCAL_OUTPUTS_PATH" "$STORAGE_PATH/outputs" 659 | gsutil -m -q cp -r "$LOCAL_SAVES_PATH" "$STORAGE_PATH/saves" 660 | 661 | else 662 | echo "Unsupported storage protocol: $STORAGE_PATH" 663 | exit 1 664 | fi 665 | ``` 666 | 667 | 668 | ```bash 669 | %%bash 670 | ls /mnt/cluster_storage/viggo/saves/lora_sft_ray 671 | ``` 672 | 673 | TorchTrainer_95d16_00000_0_2025-04-11_14-47-37 674 | TorchTrainer_f9e4e_00000_0_2025-04-11_12-41-34 675 | basic-variant-state-2025-04-11_12-41-34.json 676 | basic-variant-state-2025-04-11_14-47-37.json 677 | experiment_state-2025-04-11_12-41-34.json 678 | experiment_state-2025-04-11_14-47-37.json 679 | trainer.pkl 680 | tuner.pkl 681 | 682 | 683 | 684 | ```python 685 | # LoRA paths. 686 | save_dir = Path("/mnt/cluster_storage/viggo/saves/lora_sft_ray") 687 | trainer_dirs = [d for d in save_dir.iterdir() if d.name.startswith("TorchTrainer_") and d.is_dir()] 688 | latest_trainer = max(trainer_dirs, key=lambda d: d.stat().st_mtime, default=None) 689 | lora_path = f"{latest_trainer}/checkpoint_000000/checkpoint" 690 | cloud_lora_path = os.path.join(os.getenv("ANYSCALE_ARTIFACT_STORAGE"), lora_path.split("/mnt/cluster_storage/")[-1]) 691 | dynamic_lora_path, lora_id = cloud_lora_path.rsplit("/", 1) 692 | print (lora_path) 693 | print (cloud_lora_path) 694 | print (dynamic_lora_path) 695 | print (lora_id) 696 | ``` 697 | 698 | /mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint 699 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint 700 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000 701 | checkpoint 702 | 703 | 704 | 705 | ```bash 706 | %%bash -s "$lora_path" 707 | ls $1 708 | ``` 709 | 710 | README.md 711 | adapter_config.json 712 | adapter_model.safetensors 713 | added_tokens.json 714 | merges.txt 715 | optimizer.pt 716 | rng_state_0.pth 717 | rng_state_1.pth 718 | rng_state_2.pth 719 | rng_state_3.pth 720 | scheduler.pt 721 | special_tokens_map.json 722 | tokenizer.json 723 | tokenizer_config.json 724 | trainer_state.json 725 | training_args.bin 726 | vocab.json 727 | 728 | 729 | ## Batch inference 730 | [`Overview`](https://docs.ray.io/en/latest/data/working-with-llms.html) | [`API reference`](https://docs.ray.io/en/latest/data/api/llm.html) 731 | 732 | The `ray.data.llm` module integrates with key large language model (LLM) inference engines and deployed models to enable LLM batch inference. These LLM modules use [Ray Data](https://docs.ray.io/en/latest/data/data.html) under the hood, which makes it extremely easy to distribute workloads but also ensures that they happen: 733 | - **efficiently**: minimizing CPU/GPU idle time with heterogeneous resource scheduling. 734 | - **at scale**: with streaming execution to petabyte-scale datasets, especially when [working with LLMs](https://docs.ray.io/en/latest/data/working-with-llms.html). 735 | - **reliably** by checkpointing processes, especially when running workloads on spot instances with on-demand fallback. 736 | - **flexibly**: connecting to data from any source, applying transformations, and saving to any format and location for your next workload. 737 | 738 | 739 | 740 | [RayTurbo Data](https://docs.anyscale.com/rayturbo/rayturbo-data) has more features on top of Ray Data: 741 | - **accelerated metadata fetching** to improve reading first time from large datasets 742 | - **optimized autoscaling** where Jobs can kick off before waiting for the entire cluster to start 743 | - **high reliability** where entire failed jobs, like head node, cluster, uncaptured exceptions, etc., can resume from checkpoints. OSS Ray can only recover from worker node failures. 744 | 745 | Start by defining the [vLLM engine processor config](https://docs.ray.io/en/latest/data/api/doc/ray.data.llm.vLLMEngineProcessorConfig.html#ray.data.llm.vLLMEngineProcessorConfig) where you can select the model to use and the [engine behavior](https://docs.vllm.ai/en/stable/serving/engine_args.html). The model can come from [Hugging Face (HF) Hub](https://huggingface.co/models) or a local model path `/path/to/your/model`. Anyscale supports GPTQ, GGUF, or LoRA model formats. 746 | 747 | 748 | 749 | ### vLLM engine processor 750 | 751 | 752 | ```python 753 | import os 754 | import ray 755 | from ray.data.llm import vLLMEngineProcessorConfig 756 | ``` 757 | 758 | INFO 04-11 14:58:40 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform 759 | 760 | 761 | 762 | ```python 763 | config = vLLMEngineProcessorConfig( 764 | model_source=model_source, 765 | runtime_env={ 766 | "env_vars": { 767 | "VLLM_USE_V1": "0", # v1 doesn't support lora adapters yet 768 | # "HF_TOKEN": os.environ.get("HF_TOKEN"), 769 | }, 770 | }, 771 | engine_kwargs={ 772 | "enable_lora": True, 773 | "max_lora_rank": 8, 774 | "max_loras": 1, 775 | "pipeline_parallel_size": 1, 776 | "tensor_parallel_size": 1, 777 | "enable_prefix_caching": True, 778 | "enable_chunked_prefill": True, 779 | "max_num_batched_tokens": 4096, 780 | "max_model_len": 4096, # or increase KV cache size 781 | # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html 782 | }, 783 | concurrency=1, 784 | batch_size=16, 785 | accelerator_type="L4", 786 | ) 787 | ``` 788 | 789 | ### LLM processor 790 | 791 | Next, pass the config to an [LLM processor](https://docs.ray.io/en/master/data/api/doc/ray.data.llm.build_llm_processor.html#ray.data.llm.build_llm_processor) where you can define the preprocessing and postprocessing steps around inference. With your base model defined in the processor config, you can define the LoRA adapter layers as part of the preprocessing step of the LLM processor itself. 792 | 793 | 794 | ```python 795 | from ray.data.llm import build_llm_processor 796 | ``` 797 | 798 | 799 | ```python 800 | processor = build_llm_processor( 801 | config, 802 | preprocess=lambda row: dict( 803 | model=lora_path, # REMOVE this line if doing inference with just the base model 804 | messages=[ 805 | {"role": "system", "content": system_content}, 806 | {"role": "user", "content": row["input"]} 807 | ], 808 | sampling_params={ 809 | "temperature": 0.3, 810 | "max_tokens": 250, 811 | # complete list: https://docs.vllm.ai/en/stable/api/inference_params.html 812 | }, 813 | ), 814 | postprocess=lambda row: { 815 | **row, # all contents 816 | "generated_output": row["generated_text"], 817 | # add additional outputs 818 | }, 819 | ) 820 | ``` 821 | 822 | 2025-04-11 14:58:40,942 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.51.51:6379... 823 | 2025-04-11 14:58:40,953 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at https://session-zt5t77xa58pyp3uy28glg2g24d.i.anyscaleuserdata.com  824 | 2025-04-11 14:58:40,960 INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip' (2.16MiB) to Ray cluster... 825 | 2025-04-11 14:58:40,969 INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip'. 826 | 827 | 828 | 829 | config.json: 0%| | 0.00/663 [00:00system 880 | Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|im_end|> 881 | <|im_start|>user 882 | Do you have a favorite ESRB content rating?<|im_end|> 883 | <|im_start|>assistant 884 | ", 885 | "prompt_token_ids": [151644, "...", 198], 886 | "request_id": 94, 887 | "time_taken_llm": 6.028705836999961, 888 | "generated_output": "request(esrb)" 889 | } 890 | 891 | 892 | 893 | 894 | 895 | ```python 896 | # Exact match (strict!) 897 | matches = 0 898 | for item in results: 899 | if item["output"] == item["generated_output"]: 900 | matches += 1 901 | matches / float(len(results)) 902 | ``` 903 | 904 | 905 | 906 | 907 | 0.6879039704524469 908 | 909 | 910 | 911 | **Note**: The objective of fine-tuning here isn't to create the most performant model but to show that you can leverage it for downstream workloads, like batch inference and online serving at scale. However, you can increase `num_train_epochs` if you want to. 912 | 913 | Observe the individual steps in the batch inference workload through the Anyscale Ray Data dashboard: 914 | 915 | 916 | 917 |
918 | 919 | 💡 For more advanced guides on topics like optimized model loading, multi-LoRA, OpenAI-compatible endpoints, etc., see [more examples](https://docs.ray.io/en/latest/data/working-with-llms.html) and the [API reference](https://docs.ray.io/en/latest/data/api/llm.html). 920 | 921 |
922 | 923 | ## Online serving 924 | [`Overview`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) | [`API reference`](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) 925 | 926 | 927 | 928 | `ray.serve.llm` APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API. 929 | 930 | 931 | 932 | Ray Serve LLM is designed with the following features: 933 | - Automatic scaling and load balancing 934 | - Unified multi-node multi-model deployment 935 | - OpenAI compatibility 936 | - Multi-LoRA support with shared base models 937 | - Deep integration with inference engines, vLLM to start 938 | - Composable multi-model LLM pipelines 939 | 940 | [RayTurbo Serve](https://docs.anyscale.com/rayturbo/rayturbo-serve) on Anyscale has more features on top of Ray Serve: 941 | - **fast autoscaling and model loading** to get services up and running even faster: [5x improvements](https://www.anyscale.com/blog/autoscale-large-ai-models-faster) even for LLMs 942 | - 54% **higher QPS** and up-to 3x **streaming tokens per second** for high traffic serving use-cases 943 | - **replica compaction** into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization 944 | - **zero-downtime** [incremental rollouts](https://docs.anyscale.com/platform/services/update-a-service/#resource-constrained-updates) so your service is never interrupted 945 | - [**different environments**](https://docs.anyscale.com/platform/services/multi-app/#multiple-applications-in-different-containers) for each service in a multi-serve application 946 | - **multi availability-zone** aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures 947 | 948 | 949 | ### LLM serve config 950 | 951 | 952 | ```python 953 | import os 954 | from openai import OpenAI # to use openai api format 955 | from ray import serve 956 | from ray.serve.llm import LLMConfig, build_openai_app 957 | ``` 958 | 959 | Define an [LLM config](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) where you can define where the model comes from, it's [autoscaling behavior](https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling), what hardware to use and [engine arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html). 960 | 961 | 962 | ```python 963 | # Define config. 964 | llm_config = LLMConfig( 965 | model_loading_config={ 966 | "model_id": model_id, 967 | "model_source": model_source 968 | }, 969 | lora_config={ # REMOVE this section if you're only using a base model. 970 | "dynamic_lora_loading_path": dynamic_lora_path, 971 | "max_num_adapters_per_replica": 16, # You only have 1. 972 | }, 973 | # runtime_env={"env_vars": {"HF_TOKEN": os.environ.get("HF_TOKEN")}}, 974 | deployment_config={ 975 | "autoscaling_config": { 976 | "min_replicas": 1, 977 | "max_replicas": 2, 978 | # complete list: https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling 979 | } 980 | }, 981 | accelerator_type="L4", 982 | engine_kwargs={ 983 | "max_model_len": 4096, # Or increase KV cache size. 984 | "tensor_parallel_size": 1, 985 | "enable_lora": True, 986 | # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html 987 | }, 988 | ) 989 | ``` 990 | 991 | Now deploy the LLM config as an application. And because this application is all built on top of [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), you can have advanced service logic around composing models together, deploying multiple applications, model multiplexing, observability, etc. 992 | 993 | 994 | ```python 995 | # Deploy. 996 | app = build_openai_app({"llm_configs": [llm_config]}) 997 | serve.run(app) 998 | ``` 999 | 1000 | DeploymentHandle(deployment='LLMRouter') 1001 | 1002 | 1003 | ### Service request 1004 | 1005 | 1006 | ```python 1007 | # Initialize client. 1008 | client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key") 1009 | response = client.chat.completions.create( 1010 | model=f"{model_id}:{lora_id}", 1011 | messages=[ 1012 | {"role": "system", "content": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']"}, 1013 | {"role": "user", "content": "Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view."}, 1014 | ], 1015 | stream=True 1016 | ) 1017 | for chunk in response: 1018 | if chunk.choices[0].delta.content is not None: 1019 | print(chunk.choices[0].delta.content, end="", flush=True) 1020 | ``` 1021 | 1022 | 1023 | 1024 | Avg prompt throughput: 20.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%. 1025 | 1026 | _opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes]) 1027 | 1028 | 1029 | 1030 | 1031 | And of course, you can observe the running service, the deployments, and metrics like QPS, latency, etc., through the [Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/getting-started.html)'s [Serve view](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-serve-view): 1032 | 1033 | 1034 | 1035 |
1036 | 1037 | 💡 See [more examples](https://docs.ray.io/en/latest/serve/llm/overview.html) and the [API reference](https://docs.ray.io/en/latest/serve/llm/api.html) for advanced guides on topics like structured outputs (like JSON), vision LMs, multi-LoRA on shared base models, using other inference engines (like `sglang`), fast model loading, etc. 1038 | 1039 |
1040 | 1041 | ```python 1042 | # Shutdown the service 1043 | serve.shutdown() 1044 | ``` 1045 | 1046 | ## Production 1047 | 1048 | Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale [CLI](https://docs.anyscale.com/reference/quickstart-cli) or [SDK](https://docs.anyscale.com/reference/quickstart-sdk) to run [reliable batch jobs](https://docs.anyscale.com/platform/jobs) and deploy [highly available services](https://docs.anyscale.com/platform/services). Given you've been developing in an environment that's almost identical to production with a multi-node cluster, this integration should drastically speed up your dev to prod velocity. 1049 | 1050 | 1051 | 1052 | ### Jobs 1053 | 1054 | [Anyscale Jobs](https://docs.anyscale.com/platform/jobs/) ([API ref](https://docs.anyscale.com/reference/job-api/)) allows you to execute discrete workloads in production such as batch inference, embeddings generation, or model fine-tuning. 1055 | - [define and manage](https://docs.anyscale.com/platform/jobs/manage-jobs) your Jobs in many different ways, like CLI and Python SDK 1056 | - set up [queues](https://docs.anyscale.com/platform/jobs/job-queues) and [schedules](https://docs.anyscale.com/platform/jobs/schedules) 1057 | - set up all the [observability, alerting, etc.](https://docs.anyscale.com/platform/jobs/monitoring-and-debugging) around your Jobs 1058 | 1059 | 1060 | 1061 | ### Services 1062 | 1063 | [Anyscale Services](https://docs.anyscale.com/platform/services/) ([API ref](https://docs.anyscale.com/reference/service-api/)) offers an extremely fault tolerant, scalable, and optimized way to serve your Ray Serve applications: 1064 | - you can [rollout and update](https://docs.anyscale.com/platform/services/update-a-service) services with canary deployment with zero-downtime upgrades 1065 | - [monitor](https://docs.anyscale.com/platform/services/monitoring) your Services through a dedicated Service page, unified log viewer, tracing, set up alerts, etc. 1066 | - scale a service (`num_replicas=auto`) and utilize replica compaction to consolidate nodes that are fractionally utilized 1067 | - [head node fault tolerance](https://docs.anyscale.com/platform/services/production-best-practices#head-node-ft) because OSS Ray recovers from failed workers and replicas but not head node crashes 1068 | - serving [multiple applications](https://docs.anyscale.com/platform/services/multi-app) in a single Service 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | ```bash 1075 | %%bash 1076 | # clean up 1077 | rm -rf /mnt/cluster_storage/viggo 1078 | STORAGE_PATH="$ANYSCALE_ARTIFACT_STORAGE/viggo" 1079 | if [[ "$STORAGE_PATH" == s3://* ]]; then 1080 | aws s3 rm "$STORAGE_PATH" --recursive --quiet 1081 | elif [[ "$STORAGE_PATH" == gs://* ]]; then 1082 | gsutil -m -q rm -r "$STORAGE_PATH" 1083 | fi 1084 | ``` 1085 | -------------------------------------------------------------------------------- /ci/aws.yaml: -------------------------------------------------------------------------------- 1 | cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} 2 | region: us-west-2 3 | 4 | # Head node 5 | head_node_type: 6 | name: head 7 | instance_type: m5.2xlarge 8 | resources: 9 | cpu: 8 10 | 11 | # Worker nodes 12 | auto_select_worker_config: true 13 | flags: 14 | allow-cross-zone-autoscaling: true 15 | -------------------------------------------------------------------------------- /ci/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -exo pipefail 4 | 5 | # Will use lockfile instead later 6 | # pip3 install --no-cache-dir -r https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/refs/heads/main/lockfile.txt 7 | 8 | # Install Python dependencies 9 | pip3 install --no-cache-dir \ 10 | "xgrammar==0.1.11" \ 11 | "pynvml==12.0.0" \ 12 | "hf_transfer==0.1.9" \ 13 | "tensorboard==2.19.0" \ 14 | "git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory" 15 | 16 | 17 | # Env vars 18 | export HF_HUB_ENABLE_HF_TRANSFER=1 19 | -------------------------------------------------------------------------------- /ci/gce.yaml: -------------------------------------------------------------------------------- 1 | cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} 2 | region: us-central1 3 | 4 | # Head node 5 | head_node_type: 6 | name: head 7 | instance_type: n2-standard-8 8 | resources: 9 | cpu: 8 10 | 11 | # Worker nodes 12 | auto_select_worker_config: true 13 | flags: 14 | allow-cross-zone-autoscaling: true 15 | -------------------------------------------------------------------------------- /ci/nb2py.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import argparse 3 | import nbformat 4 | 5 | 6 | def convert_notebook(input_path: str, output_path: str) -> None: 7 | """ 8 | Read a Jupyter notebook and write a Python script, converting all %%bash 9 | cells and IPython "!" commands into subprocess.run calls that raise on error. 10 | """ 11 | nb = nbformat.read(input_path, as_version=4) 12 | with open(output_path, "w") as out: 13 | for cell in nb.cells: 14 | if cell.cell_type != "code": 15 | continue 16 | 17 | lines = cell.source.splitlines() 18 | # Detect a %%bash cell 19 | if lines and lines[0].strip().startswith("%%bash"): 20 | bash_script = "\n".join(lines[1:]).rstrip() 21 | out.write("import subprocess\n") 22 | out.write( 23 | f"subprocess.run(r'''{bash_script}''',\n" 24 | " shell=True,\n" 25 | " check=True,\n" 26 | " executable='/bin/bash')\n\n" 27 | ) 28 | else: 29 | # Detect any IPython '!' shell commands in code lines 30 | has_bang = any(line.lstrip().startswith("!") for line in lines) 31 | if has_bang: 32 | out.write("import subprocess\n") 33 | for line in lines: 34 | stripped = line.lstrip() 35 | if stripped.startswith("!"): 36 | cmd = stripped[1:].lstrip() 37 | out.write( 38 | f"subprocess.run(r'''{cmd}''',\n" 39 | " shell=True,\n" 40 | " check=True,\n" 41 | " executable='/bin/bash')\n" 42 | ) 43 | else: 44 | out.write(line.rstrip() + "\n") 45 | out.write("\n") 46 | else: 47 | # Regular Python cell: dump as-is 48 | out.write(cell.source.rstrip() + "\n\n") 49 | 50 | 51 | def main() -> None: 52 | parser = argparse.ArgumentParser( 53 | description="Convert a Jupyter notebook to a Python script, preserving bash cells and '!' commands as subprocess calls." 54 | ) 55 | parser.add_argument("input_nb", help="Path to the input .ipynb file") 56 | parser.add_argument("output_py", help="Path for the output .py script") 57 | args = parser.parse_args() 58 | convert_notebook(args.input_nb, args.output_py) 59 | 60 | 61 | if __name__ == "__main__": 62 | main() 63 | -------------------------------------------------------------------------------- /ci/tests.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # don't use nbcovert or jupytext unless you're willing 4 | # to check each subprocess unit and validate that errors 5 | # aren't being consumed/hidden 6 | python ci/nb2py.py README.ipynb README.py # convert notebook to script 7 | python README.py # run generated script 8 | rm README.py # remove the generated script 9 | -------------------------------------------------------------------------------- /clear_cell_nums.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import nbformat 4 | 5 | 6 | def clear_execution_numbers(nb_path): 7 | with open(nb_path, "r", encoding="utf-8") as f: 8 | nb = nbformat.read(f, as_version=4) 9 | for cell in nb["cells"]: 10 | if cell["cell_type"] == "code": 11 | cell["execution_count"] = None 12 | for output in cell["outputs"]: 13 | if "execution_count" in output: 14 | output["execution_count"] = None 15 | with open(nb_path, "w", encoding="utf-8") as f: 16 | nbformat.write(nb, f) 17 | 18 | 19 | if __name__ == "__main__": 20 | NOTEBOOK_DIR = Path(__file__).parent 21 | notebook_fps = list(NOTEBOOK_DIR.glob("**/*.ipynb")) 22 | for fp in notebook_fps: 23 | clear_execution_numbers(fp) 24 | -------------------------------------------------------------------------------- /configs/aws.yaml: -------------------------------------------------------------------------------- 1 | # Head node 2 | head_node_type: 3 | name: head 4 | instance_type: m5.2xlarge 5 | resources: 6 | cpu: 8 7 | 8 | # Worker nodes 9 | auto_select_worker_config: true 10 | flags: 11 | allow-cross-zone-autoscaling: true 12 | -------------------------------------------------------------------------------- /configs/gce.yaml: -------------------------------------------------------------------------------- 1 | # Head node 2 | head_node_type: 3 | name: head 4 | instance_type: n2-standard-8 5 | resources: 6 | cpu: 8 7 | 8 | # Worker nodes 9 | auto_select_worker_config: true 10 | flags: 11 | allow-cross-zone-autoscaling: true 12 | -------------------------------------------------------------------------------- /containerfile: -------------------------------------------------------------------------------- 1 | FROM anyscale/ray-llm:2.44.1-py311-cu124 2 | 3 | RUN python3 -m pip install --no-cache-dir \ 4 | "xgrammar==0.1.11" \ 5 | "pynvml==12.0.0" \ 6 | "hf_transfer==0.1.9" \ 7 | "tensorboard==2.19.0" \ 8 | "git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory" 9 | 10 | # Fast upload/download 11 | ENV HF_HUB_ENABLE_HF_TRANSFER=1 12 | -------------------------------------------------------------------------------- /images/data_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/data_dashboard.png -------------------------------------------------------------------------------- /images/data_llm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/data_llm.png -------------------------------------------------------------------------------- /images/e2e_llm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/e2e_llm.png -------------------------------------------------------------------------------- /images/loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/loss.png -------------------------------------------------------------------------------- /images/serve_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/serve_dashboard.png -------------------------------------------------------------------------------- /images/serve_llm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/serve_llm.png -------------------------------------------------------------------------------- /images/train_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/train_dashboard.png -------------------------------------------------------------------------------- /lora_sft_ray.yaml: -------------------------------------------------------------------------------- 1 | ### model 2 | model_name_or_path: Qwen/Qwen2.5-7B-Instruct 3 | trust_remote_code: true 4 | 5 | ### method 6 | stage: sft 7 | do_train: true 8 | finetuning_type: lora 9 | lora_rank: 8 10 | lora_target: all 11 | 12 | ### dataset 13 | dataset: viggo-train 14 | dataset_dir: /mnt/cluster_storage/viggo # shared storage workers have access to 15 | template: qwen 16 | cutoff_len: 2048 17 | max_samples: 1000 18 | overwrite_cache: true 19 | preprocessing_num_workers: 16 20 | dataloader_num_workers: 4 21 | 22 | ### output 23 | output_dir: /mnt/cluster_storage/viggo/outputs # should be somewhere workers have access to (ex. s3, nfs) 24 | logging_steps: 10 25 | save_steps: 500 26 | plot_loss: true 27 | overwrite_output_dir: true 28 | save_only_model: false 29 | 30 | ### ray 31 | ray_run_name: lora_sft_ray 32 | ray_storage_path: /mnt/cluster_storage/viggo/saves # should be somewhere workers have access to (ex. s3, nfs) 33 | ray_num_workers: 4 34 | resources_per_worker: 35 | GPU: 1 36 | anyscale/accelerator_shape:4xL4: 0.001 # Use this to specify a specific node shape, 37 | # accelerator_type:L4: 1 # Or use this to simply specify a GPU type. 38 | # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types 39 | placement_strategy: PACK 40 | 41 | ### train 42 | per_device_train_batch_size: 1 43 | gradient_accumulation_steps: 8 44 | learning_rate: 1.0e-4 45 | num_train_epochs: 3.0 46 | lr_scheduler_type: cosine 47 | warmup_ratio: 0.1 48 | bf16: true 49 | ddp_timeout: 180000000 50 | resume_from_checkpoint: null 51 | 52 | ### eval 53 | eval_dataset: viggo-val # uses same dataset_dir as training data 54 | # val_size: 0.1 # only if using part of training data for validation 55 | per_device_eval_batch_size: 1 56 | eval_strategy: steps 57 | eval_steps: 500 58 | --------------------------------------------------------------------------------