\n",
996 | "\n",
997 | "OSS Ray comes with an extensive observability suite, and Anyscale takes it many steps further to make monitoring and debugging your workloads even easier and faster with:\n",
998 | "\n",
999 | "- [unified log viewer](https://docs.anyscale.com/monitoring/accessing-logs/) to see logs from *all* driver and worker processes\n",
1000 | "- Ray workload specific dashboard, like Data, Train, etc., that can breakdown the tasks. For example, you can observe the preceding training workload live through the Train specific Ray Workloads dashboard:\n",
1001 | "\n",
1002 | "
\n",
1003 | "\n",
1004 | "\n"
1005 | ]
1006 | },
1007 | {
1008 | "cell_type": "markdown",
1009 | "metadata": {},
1010 | "source": [
1011 | "### Save to cloud storage"
1012 | ]
1013 | },
1014 | {
1015 | "cell_type": "markdown",
1016 | "metadata": {},
1017 | "source": [
1018 | "
🗂️ Storage on Anyscale \n",
1019 | "\n",
1020 | "You can always store to data inside [any storage buckets](https://docs.anyscale.com/configuration/storage/#private-storage-buckets) but Anyscale offers a [default storage bucket](https://docs.anyscale.com/configuration/storage/#anyscale-default-storage-bucket) to make things even easier. You also have plenty of other [storage options](https://docs.anyscale.com/configuration/storage/) as well, shared at the cluster, user, and cloud levels."
1021 | ]
1022 | },
1023 | {
1024 | "cell_type": "code",
1025 | "execution_count": null,
1026 | "metadata": {},
1027 | "outputs": [
1028 | {
1029 | "name": "stdout",
1030 | "output_type": "stream",
1031 | "text": [
1032 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage\n"
1033 | ]
1034 | }
1035 | ],
1036 | "source": [
1037 | "%%bash\n",
1038 | "# Anyscale default storage bucket.\n",
1039 | "echo $ANYSCALE_ARTIFACT_STORAGE"
1040 | ]
1041 | },
1042 | {
1043 | "cell_type": "code",
1044 | "execution_count": null,
1045 | "metadata": {},
1046 | "outputs": [],
1047 | "source": [
1048 | "%%bash\n",
1049 | "# Save fine-tuning artifacts to cloud storage.\n",
1050 | "STORAGE_PATH=\"$ANYSCALE_ARTIFACT_STORAGE/viggo\"\n",
1051 | "LOCAL_OUTPUTS_PATH=\"/mnt/cluster_storage/viggo/outputs\"\n",
1052 | "LOCAL_SAVES_PATH=\"/mnt/cluster_storage/viggo/saves\"\n",
1053 | "\n",
1054 | "# AWS S3 operations.\n",
1055 | "if [[ \"$STORAGE_PATH\" == s3://* ]]; then\n",
1056 | " if aws s3 ls \"$STORAGE_PATH\" > /dev/null 2>&1; then\n",
1057 | " aws s3 rm \"$STORAGE_PATH\" --recursive --quiet\n",
1058 | " fi\n",
1059 | " aws s3 cp \"$LOCAL_OUTPUTS_PATH\" \"$STORAGE_PATH/outputs\" --recursive --quiet\n",
1060 | " aws s3 cp \"$LOCAL_SAVES_PATH\" \"$STORAGE_PATH/saves\" --recursive --quiet\n",
1061 | "\n",
1062 | "# Google Cloud Storage operations.\n",
1063 | "elif [[ \"$STORAGE_PATH\" == gs://* ]]; then\n",
1064 | " if gsutil ls \"$STORAGE_PATH\" > /dev/null 2>&1; then\n",
1065 | " gsutil -m -q rm -r \"$STORAGE_PATH\"\n",
1066 | " fi\n",
1067 | " gsutil -m -q cp -r \"$LOCAL_OUTPUTS_PATH\" \"$STORAGE_PATH/outputs\"\n",
1068 | " gsutil -m -q cp -r \"$LOCAL_SAVES_PATH\" \"$STORAGE_PATH/saves\"\n",
1069 | "\n",
1070 | "else\n",
1071 | " echo \"Unsupported storage protocol: $STORAGE_PATH\"\n",
1072 | " exit 1\n",
1073 | "fi"
1074 | ]
1075 | },
1076 | {
1077 | "cell_type": "code",
1078 | "execution_count": null,
1079 | "metadata": {},
1080 | "outputs": [
1081 | {
1082 | "name": "stdout",
1083 | "output_type": "stream",
1084 | "text": [
1085 | "TorchTrainer_95d16_00000_0_2025-04-11_14-47-37\n",
1086 | "TorchTrainer_f9e4e_00000_0_2025-04-11_12-41-34\n",
1087 | "basic-variant-state-2025-04-11_12-41-34.json\n",
1088 | "basic-variant-state-2025-04-11_14-47-37.json\n",
1089 | "experiment_state-2025-04-11_12-41-34.json\n",
1090 | "experiment_state-2025-04-11_14-47-37.json\n",
1091 | "trainer.pkl\n",
1092 | "tuner.pkl\n"
1093 | ]
1094 | }
1095 | ],
1096 | "source": [
1097 | "%%bash\n",
1098 | "ls /mnt/cluster_storage/viggo/saves/lora_sft_ray"
1099 | ]
1100 | },
1101 | {
1102 | "cell_type": "code",
1103 | "execution_count": null,
1104 | "metadata": {},
1105 | "outputs": [
1106 | {
1107 | "name": "stdout",
1108 | "output_type": "stream",
1109 | "text": [
1110 | "/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint\n",
1111 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint\n",
1112 | "s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000\n",
1113 | "checkpoint\n"
1114 | ]
1115 | }
1116 | ],
1117 | "source": [
1118 | "# LoRA paths.\n",
1119 | "save_dir = Path(\"/mnt/cluster_storage/viggo/saves/lora_sft_ray\")\n",
1120 | "trainer_dirs = [d for d in save_dir.iterdir() if d.name.startswith(\"TorchTrainer_\") and d.is_dir()]\n",
1121 | "latest_trainer = max(trainer_dirs, key=lambda d: d.stat().st_mtime, default=None)\n",
1122 | "lora_path = f\"{latest_trainer}/checkpoint_000000/checkpoint\"\n",
1123 | "cloud_lora_path = os.path.join(os.getenv(\"ANYSCALE_ARTIFACT_STORAGE\"), lora_path.split(\"/mnt/cluster_storage/\")[-1])\n",
1124 | "dynamic_lora_path, lora_id = cloud_lora_path.rsplit(\"/\", 1)\n",
1125 | "print (lora_path)\n",
1126 | "print (cloud_lora_path)\n",
1127 | "print (dynamic_lora_path)\n",
1128 | "print (lora_id)"
1129 | ]
1130 | },
1131 | {
1132 | "cell_type": "code",
1133 | "execution_count": null,
1134 | "metadata": {},
1135 | "outputs": [
1136 | {
1137 | "name": "stdout",
1138 | "output_type": "stream",
1139 | "text": [
1140 | "README.md\n",
1141 | "adapter_config.json\n",
1142 | "adapter_model.safetensors\n",
1143 | "added_tokens.json\n",
1144 | "merges.txt\n",
1145 | "optimizer.pt\n",
1146 | "rng_state_0.pth\n",
1147 | "rng_state_1.pth\n",
1148 | "rng_state_2.pth\n",
1149 | "rng_state_3.pth\n",
1150 | "scheduler.pt\n",
1151 | "special_tokens_map.json\n",
1152 | "tokenizer.json\n",
1153 | "tokenizer_config.json\n",
1154 | "trainer_state.json\n",
1155 | "training_args.bin\n",
1156 | "vocab.json\n"
1157 | ]
1158 | }
1159 | ],
1160 | "source": [
1161 | "%%bash -s \"$lora_path\"\n",
1162 | "ls $1"
1163 | ]
1164 | },
1165 | {
1166 | "cell_type": "markdown",
1167 | "metadata": {},
1168 | "source": [
1169 | "## Batch inference \n",
1170 | "[`Overview`](https://docs.ray.io/en/latest/data/working-with-llms.html) | [`API reference`](https://docs.ray.io/en/latest/data/api/llm.html)"
1171 | ]
1172 | },
1173 | {
1174 | "cell_type": "markdown",
1175 | "metadata": {},
1176 | "source": [
1177 | "The `ray.data.llm` module integrates with key large language model (LLM) inference engines and deployed models to enable LLM batch inference. These LLM modules use [Ray Data](https://docs.ray.io/en/latest/data/data.html) under the hood, which makes it extremely easy to distribute workloads but also ensures that they happen:\n",
1178 | "- **efficiently**: minimizing CPU/GPU idle time with heterogeneous resource scheduling.\n",
1179 | "- **at scale**: with streaming execution to petabyte-scale datasets, especially when [working with LLMs](https://docs.ray.io/en/latest/data/working-with-llms.html).\n",
1180 | "- **reliably** by checkpointing processes, especially when running workloads on spot instances with on-demand fallback.\n",
1181 | "- **flexibly**: connecting to data from any source, applying transformations, and saving to any format and location for your next workload.\n",
1182 | "\n",
1183 | "

\n",
1184 | "\n",
1185 | "[RayTurbo Data](https://docs.anyscale.com/rayturbo/rayturbo-data) has more features on top of Ray Data:\n",
1186 | "- **accelerated metadata fetching** to improve reading first time from large datasets \n",
1187 | "- **optimized autoscaling** where Jobs can kick off before waiting for the entire cluster to start\n",
1188 | "- **high reliability** where entire failed jobs, like head node, cluster, uncaptured exceptions, etc., can resume from checkpoints. OSS Ray can only recover from worker node failures."
1189 | ]
1190 | },
1191 | {
1192 | "cell_type": "markdown",
1193 | "metadata": {},
1194 | "source": [
1195 | "Start by defining the [vLLM engine processor config](https://docs.ray.io/en/latest/data/api/doc/ray.data.llm.vLLMEngineProcessorConfig.html#ray.data.llm.vLLMEngineProcessorConfig) where you can select the model to use and the [engine behavior](https://docs.vllm.ai/en/stable/serving/engine_args.html). The model can come from [Hugging Face (HF) Hub](https://huggingface.co/models) or a local model path `/path/to/your/model`. Anyscale supports GPTQ, GGUF, or LoRA model formats.\n",
1196 | "\n",
1197 | "

"
1198 | ]
1199 | },
1200 | {
1201 | "cell_type": "markdown",
1202 | "metadata": {},
1203 | "source": [
1204 | "### vLLM engine processor"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "code",
1209 | "execution_count": null,
1210 | "metadata": {},
1211 | "outputs": [
1212 | {
1213 | "name": "stdout",
1214 | "output_type": "stream",
1215 | "text": [
1216 | "INFO 04-11 14:58:40 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform\n"
1217 | ]
1218 | }
1219 | ],
1220 | "source": [
1221 | "import os\n",
1222 | "import ray\n",
1223 | "from ray.data.llm import vLLMEngineProcessorConfig"
1224 | ]
1225 | },
1226 | {
1227 | "cell_type": "code",
1228 | "execution_count": null,
1229 | "metadata": {},
1230 | "outputs": [],
1231 | "source": [
1232 | "config = vLLMEngineProcessorConfig(\n",
1233 | " model_source=model_source,\n",
1234 | " runtime_env={\n",
1235 | " \"env_vars\": {\n",
1236 | " \"VLLM_USE_V1\": \"0\", # v1 doesn't support lora adapters yet\n",
1237 | " # \"HF_TOKEN\": os.environ.get(\"HF_TOKEN\"),\n",
1238 | " },\n",
1239 | " },\n",
1240 | " engine_kwargs={\n",
1241 | " \"enable_lora\": True,\n",
1242 | " \"max_lora_rank\": 8,\n",
1243 | " \"max_loras\": 1,\n",
1244 | " \"pipeline_parallel_size\": 1,\n",
1245 | " \"tensor_parallel_size\": 1,\n",
1246 | " \"enable_prefix_caching\": True,\n",
1247 | " \"enable_chunked_prefill\": True,\n",
1248 | " \"max_num_batched_tokens\": 4096,\n",
1249 | " \"max_model_len\": 4096, # or increase KV cache size\n",
1250 | " # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html\n",
1251 | " },\n",
1252 | " concurrency=1,\n",
1253 | " batch_size=16,\n",
1254 | " accelerator_type=\"L4\",\n",
1255 | ")"
1256 | ]
1257 | },
1258 | {
1259 | "cell_type": "markdown",
1260 | "metadata": {},
1261 | "source": [
1262 | "### LLM processor"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "markdown",
1267 | "metadata": {},
1268 | "source": [
1269 | "Next, pass the config to an [LLM processor](https://docs.ray.io/en/master/data/api/doc/ray.data.llm.build_llm_processor.html#ray.data.llm.build_llm_processor) where you can define the preprocessing and postprocessing steps around inference. With your base model defined in the processor config, you can define the LoRA adapter layers as part of the preprocessing step of the LLM processor itself."
1270 | ]
1271 | },
1272 | {
1273 | "cell_type": "code",
1274 | "execution_count": null,
1275 | "metadata": {},
1276 | "outputs": [],
1277 | "source": [
1278 | "from ray.data.llm import build_llm_processor"
1279 | ]
1280 | },
1281 | {
1282 | "cell_type": "code",
1283 | "execution_count": null,
1284 | "metadata": {},
1285 | "outputs": [
1286 | {
1287 | "name": "stderr",
1288 | "output_type": "stream",
1289 | "text": [
1290 | "2025-04-11 14:58:40,942\tINFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.51.51:6379...\n",
1291 | "2025-04-11 14:58:40,953\tINFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttps://session-zt5t77xa58pyp3uy28glg2g24d.i.anyscaleuserdata.com \u001b[39m\u001b[22m\n",
1292 | "2025-04-11 14:58:40,960\tINFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip' (2.16MiB) to Ray cluster...\n",
1293 | "2025-04-11 14:58:40,969\tINFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip'.\n"
1294 | ]
1295 | },
1296 | {
1297 | "data": {
1298 | "application/vnd.jupyter.widget-view+json": {
1299 | "model_id": "a9171027a5a249ff801e77f763506f67",
1300 | "version_major": 2,
1301 | "version_minor": 0
1302 | },
1303 | "text/plain": [
1304 | "config.json: 0%| | 0.00/663 [00:00, ?B/s]"
1305 | ]
1306 | },
1307 | "metadata": {},
1308 | "output_type": "display_data"
1309 | },
1310 | {
1311 | "name": "stdout",
1312 | "output_type": "stream",
1313 | "text": [
1314 | "\u001b[36m(pid=51260)\u001b[0m INFO 04-11 14:58:47 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform\n"
1315 | ]
1316 | }
1317 | ],
1318 | "source": [
1319 | "processor = build_llm_processor(\n",
1320 | " config,\n",
1321 | " preprocess=lambda row: dict(\n",
1322 | " model=lora_path, # REMOVE this line if doing inference with just the base model\n",
1323 | " messages=[\n",
1324 | " {\"role\": \"system\", \"content\": system_content},\n",
1325 | " {\"role\": \"user\", \"content\": row[\"input\"]}\n",
1326 | " ],\n",
1327 | " sampling_params={\n",
1328 | " \"temperature\": 0.3,\n",
1329 | " \"max_tokens\": 250,\n",
1330 | " # complete list: https://docs.vllm.ai/en/stable/api/inference_params.html\n",
1331 | " },\n",
1332 | " ),\n",
1333 | " postprocess=lambda row: {\n",
1334 | " **row, # all contents\n",
1335 | " \"generated_output\": row[\"generated_text\"],\n",
1336 | " # add additional outputs\n",
1337 | " },\n",
1338 | ")"
1339 | ]
1340 | },
1341 | {
1342 | "cell_type": "code",
1343 | "execution_count": null,
1344 | "metadata": {},
1345 | "outputs": [
1346 | {
1347 | "name": "stdout",
1348 | "output_type": "stream",
1349 | "text": [
1350 | "\n",
1351 | "\n",
1352 | "{\n",
1353 | " \"batch_uuid\": \"d7a6b5341cbf4986bb7506ff277cc9cf\",\n",
1354 | " \"embeddings\": null,\n",
1355 | " \"generated_text\": \"request(esrb)\",\n",
1356 | " \"generated_tokens\": [2035, 50236, 10681, 8, 151645],\n",
1357 | " \"input\": \"Do you have a favorite ESRB content rating?\",\n",
1358 | " \"instruction\": \"Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']\",\n",
1359 | " \"messages\": [\n",
1360 | " {\n",
1361 | " \"content\": \"Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']\",\n",
1362 | " \"role\": \"system\"\n",
1363 | " },\n",
1364 | " {\n",
1365 | " \"content\": \"Do you have a favorite ESRB content rating?\",\n",
1366 | " \"role\": \"user\"\n",
1367 | " }\n",
1368 | " ],\n",
1369 | " \"metrics\": {\n",
1370 | " \"arrival_time\": 1744408857.148983,\n",
1371 | " \"finished_time\": 1744408863.09091,\n",
1372 | " \"first_scheduled_time\": 1744408859.130259,\n",
1373 | " \"first_token_time\": 1744408862.7087252,\n",
1374 | " \"last_token_time\": 1744408863.089174,\n",
1375 | " \"model_execute_time\": null,\n",
1376 | " \"model_forward_time\": null,\n",
1377 | " \"scheduler_time\": 0.04162892400017881,\n",
1378 | " \"time_in_queue\": 1.981276035308838\n",
1379 | " },\n",
1380 | " \"model\": \"/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint\",\n",
1381 | " \"num_generated_tokens\": 5,\n",
1382 | " \"num_input_tokens\": 164,\n",
1383 | " \"output\": \"request_attribute(esrb[])\",\n",
1384 | " \"params\": \"SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None)\",\n",
1385 | " \"prompt\": \"<|im_start|>system\n",
1386 | "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|im_end|>\n",
1387 | "<|im_start|>user\n",
1388 | "Do you have a favorite ESRB content rating?<|im_end|>\n",
1389 | "<|im_start|>assistant\n",
1390 | "\",\n",
1391 | " \"prompt_token_ids\": [151644, \"...\", 198],\n",
1392 | " \"request_id\": 94,\n",
1393 | " \"time_taken_llm\": 6.028705836999961,\n",
1394 | " \"generated_output\": \"request(esrb)\"\n",
1395 | "}\n",
1396 | "\n",
1397 | "\n"
1398 | ]
1399 | }
1400 | ],
1401 | "source": [
1402 | "# Evaluation on test dataset\n",
1403 | "ds = ray.data.read_json(\"/mnt/cluster_storage/viggo/test.jsonl\") # complete list: https://docs.ray.io/en/latest/data/api/input_output.html\n",
1404 | "ds = processor(ds)\n",
1405 | "results = ds.take_all()\n",
1406 | "results[0]"
1407 | ]
1408 | },
1409 | {
1410 | "cell_type": "code",
1411 | "execution_count": null,
1412 | "metadata": {},
1413 | "outputs": [
1414 | {
1415 | "data": {
1416 | "text/plain": [
1417 | "0.6879039704524469"
1418 | ]
1419 | },
1420 | "execution_count": null,
1421 | "metadata": {},
1422 | "output_type": "execute_result"
1423 | }
1424 | ],
1425 | "source": [
1426 | "# Exact match (strict!)\n",
1427 | "matches = 0\n",
1428 | "for item in results:\n",
1429 | " if item[\"output\"] == item[\"generated_output\"]:\n",
1430 | " matches += 1\n",
1431 | "matches / float(len(results))"
1432 | ]
1433 | },
1434 | {
1435 | "cell_type": "markdown",
1436 | "metadata": {},
1437 | "source": [
1438 | "**Note**: The objective of fine-tuning here isn't to create the most performant model but to show that you can leverage it for downstream workloads, like batch inference and online serving at scale. However, you can increase `num_train_epochs` if you want to."
1439 | ]
1440 | },
1441 | {
1442 | "cell_type": "markdown",
1443 | "metadata": {},
1444 | "source": [
1445 | "Observe the individual steps in the batch inference workload through the Anyscale Ray Data dashboard:\n",
1446 | "\n",
1447 | "

"
1448 | ]
1449 | },
1450 | {
1451 | "cell_type": "markdown",
1452 | "metadata": {},
1453 | "source": [
1454 | "
\n",
1455 | "\n",
1456 | "💡 For more advanced guides on topics like optimized model loading, multi-LoRA, OpenAI-compatible endpoints, etc., see [more examples](https://docs.ray.io/en/latest/data/working-with-llms.html) and the [API reference](https://docs.ray.io/en/latest/data/api/llm.html).\n",
1457 | "\n",
1458 | "
"
1459 | ]
1460 | },
1461 | {
1462 | "cell_type": "markdown",
1463 | "metadata": {},
1464 | "source": [
1465 | "## Online serving\n",
1466 | "[`Overview`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) | [`API reference`](https://docs.ray.io/en/latest/serve/api/index.html#llm-api)"
1467 | ]
1468 | },
1469 | {
1470 | "cell_type": "markdown",
1471 | "metadata": {},
1472 | "source": [
1473 | "

\n",
1474 | "\n",
1475 | "`ray.serve.llm` APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.\n",
1476 | "\n",
1477 | "

\n",
1478 | "\n",
1479 | "Ray Serve LLM is designed with the following features:\n",
1480 | "- Automatic scaling and load balancing\n",
1481 | "- Unified multi-node multi-model deployment\n",
1482 | "- OpenAI compatibility\n",
1483 | "- Multi-LoRA support with shared base models\n",
1484 | "- Deep integration with inference engines, vLLM to start\n",
1485 | "- Composable multi-model LLM pipelines\n",
1486 | "\n",
1487 | "[RayTurbo Serve](https://docs.anyscale.com/rayturbo/rayturbo-serve) on Anyscale has more features on top of Ray Serve:\n",
1488 | "- **fast autoscaling and model loading** to get services up and running even faster: [5x improvements](https://www.anyscale.com/blog/autoscale-large-ai-models-faster) even for LLMs\n",
1489 | "- 54% **higher QPS** and up-to 3x **streaming tokens per second** for high traffic serving use-cases\n",
1490 | "- **replica compaction** into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization\n",
1491 | "- **zero-downtime** [incremental rollouts](https://docs.anyscale.com/platform/services/update-a-service/#resource-constrained-updates) so your service is never interrupted\n",
1492 | "- [**different environments**](https://docs.anyscale.com/platform/services/multi-app/#multiple-applications-in-different-containers) for each service in a multi-serve application\n",
1493 | "- **multi availability-zone** aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures\n"
1494 | ]
1495 | },
1496 | {
1497 | "cell_type": "markdown",
1498 | "metadata": {},
1499 | "source": [
1500 | "### LLM serve config"
1501 | ]
1502 | },
1503 | {
1504 | "cell_type": "code",
1505 | "execution_count": null,
1506 | "metadata": {},
1507 | "outputs": [],
1508 | "source": [
1509 | "import os\n",
1510 | "from openai import OpenAI # to use openai api format\n",
1511 | "from ray import serve\n",
1512 | "from ray.serve.llm import LLMConfig, build_openai_app"
1513 | ]
1514 | },
1515 | {
1516 | "cell_type": "markdown",
1517 | "metadata": {},
1518 | "source": [
1519 | "Define an [LLM config](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) where you can define where the model comes from, it's [autoscaling behavior](https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling), what hardware to use and [engine arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html)."
1520 | ]
1521 | },
1522 | {
1523 | "cell_type": "code",
1524 | "execution_count": null,
1525 | "metadata": {},
1526 | "outputs": [],
1527 | "source": [
1528 | "# Define config.\n",
1529 | "llm_config = LLMConfig(\n",
1530 | " model_loading_config={\n",
1531 | " \"model_id\": model_id,\n",
1532 | " \"model_source\": model_source\n",
1533 | " },\n",
1534 | " lora_config={ # REMOVE this section if you're only using a base model.\n",
1535 | " \"dynamic_lora_loading_path\": dynamic_lora_path,\n",
1536 | " \"max_num_adapters_per_replica\": 16, # You only have 1.\n",
1537 | " },\n",
1538 | " # runtime_env={\"env_vars\": {\"HF_TOKEN\": os.environ.get(\"HF_TOKEN\")}},\n",
1539 | " deployment_config={\n",
1540 | " \"autoscaling_config\": {\n",
1541 | " \"min_replicas\": 1,\n",
1542 | " \"max_replicas\": 2,\n",
1543 | " # complete list: https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling\n",
1544 | " }\n",
1545 | " },\n",
1546 | " accelerator_type=\"L4\",\n",
1547 | " engine_kwargs={\n",
1548 | " \"max_model_len\": 4096, # Or increase KV cache size.\n",
1549 | " \"tensor_parallel_size\": 1,\n",
1550 | " \"enable_lora\": True,\n",
1551 | " # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html\n",
1552 | " },\n",
1553 | ")"
1554 | ]
1555 | },
1556 | {
1557 | "cell_type": "markdown",
1558 | "metadata": {},
1559 | "source": [
1560 | "Now deploy the LLM config as an application. And because this application is all built on top of [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), you can have advanced service logic around composing models together, deploying multiple applications, model multiplexing, observability, etc."
1561 | ]
1562 | },
1563 | {
1564 | "cell_type": "code",
1565 | "execution_count": null,
1566 | "metadata": {},
1567 | "outputs": [
1568 | {
1569 | "name": "stdout",
1570 | "output_type": "stream",
1571 | "text": [
1572 | "DeploymentHandle(deployment='LLMRouter')\n"
1573 | ]
1574 | }
1575 | ],
1576 | "source": [
1577 | "# Deploy.\n",
1578 | "app = build_openai_app({\"llm_configs\": [llm_config]})\n",
1579 | "serve.run(app)"
1580 | ]
1581 | },
1582 | {
1583 | "cell_type": "markdown",
1584 | "metadata": {},
1585 | "source": [
1586 | "### Service request"
1587 | ]
1588 | },
1589 | {
1590 | "cell_type": "code",
1591 | "execution_count": null,
1592 | "metadata": {},
1593 | "outputs": [
1594 | {
1595 | "name": "stdout",
1596 | "output_type": "stream",
1597 | "text": [
1598 | "\n",
1599 | "\n",
1600 | "Avg prompt throughput: 20.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.\n",
1601 | "\n",
1602 | "_opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])\n",
1603 | "\n",
1604 | "\n"
1605 | ]
1606 | }
1607 | ],
1608 | "source": [
1609 | "# Initialize client.\n",
1610 | "client = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"fake-key\")\n",
1611 | "response = client.chat.completions.create(\n",
1612 | " model=f\"{model_id}:{lora_id}\",\n",
1613 | " messages=[\n",
1614 | " {\"role\": \"system\", \"content\": \"Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']\"},\n",
1615 | " {\"role\": \"user\", \"content\": \"Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view.\"},\n",
1616 | " ],\n",
1617 | " stream=True\n",
1618 | ")\n",
1619 | "for chunk in response:\n",
1620 | " if chunk.choices[0].delta.content is not None:\n",
1621 | " print(chunk.choices[0].delta.content, end=\"\", flush=True)"
1622 | ]
1623 | },
1624 | {
1625 | "cell_type": "markdown",
1626 | "metadata": {},
1627 | "source": [
1628 | "And of course, you can observe the running service, the deployments, and metrics like QPS, latency, etc., through the [Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/getting-started.html)'s [Serve view](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-serve-view):\n",
1629 | "\n",
1630 | "

"
1631 | ]
1632 | },
1633 | {
1634 | "cell_type": "markdown",
1635 | "metadata": {},
1636 | "source": [
1637 | "
\n",
1638 | "\n",
1639 | "💡 See [more examples](https://docs.ray.io/en/latest/serve/llm/overview.html) and the [API reference](https://docs.ray.io/en/latest/serve/llm/api.html) for advanced guides on topics like structured outputs (like JSON), vision LMs, multi-LoRA on shared base models, using other inference engines (like `sglang`), fast model loading, etc.\n",
1640 | "\n",
1641 | "
"
1642 | ]
1643 | },
1644 | {
1645 | "cell_type": "markdown",
1646 | "metadata": {},
1647 | "source": [
1648 | "## Production\n",
1649 | "\n",
1650 | "Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale [CLI](https://docs.anyscale.com/reference/quickstart-cli) or [SDK](https://docs.anyscale.com/reference/quickstart-sdk) to run [reliable batch jobs](https://docs.anyscale.com/platform/jobs) and deploy [highly available services](https://docs.anyscale.com/platform/services). Given you've been developing in an environment that's almost identical to production with a multi-node cluster, this integration should drastically speed up your dev to prod velocity.\n",
1651 | "\n",
1652 | "

\n",
1653 | "\n",
1654 | "### Jobs\n",
1655 | "\n",
1656 | "[Anyscale Jobs](https://docs.anyscale.com/platform/jobs/) ([API ref](https://docs.anyscale.com/reference/job-api/)) allows you to execute discrete workloads in production such as batch inference, embeddings generation, or model fine-tuning.\n",
1657 | "- [define and manage](https://docs.anyscale.com/platform/jobs/manage-jobs) your Jobs in many different ways, like CLI and Python SDK\n",
1658 | "- set up [queues](https://docs.anyscale.com/platform/jobs/job-queues) and [schedules](https://docs.anyscale.com/platform/jobs/schedules)\n",
1659 | "- set up all the [observability, alerting, etc.](https://docs.anyscale.com/platform/jobs/monitoring-and-debugging) around your Jobs\n",
1660 | "\n",
1661 | "

\n",
1662 | "\n",
1663 | "### Services\n",
1664 | "\n",
1665 | "[Anyscale Services](https://docs.anyscale.com/platform/services/) ([API ref](https://docs.anyscale.com/reference/service-api/)) offers an extremely fault tolerant, scalable, and optimized way to serve your Ray Serve applications:\n",
1666 | "- you can [rollout and update](https://docs.anyscale.com/platform/services/update-a-service) services with canary deployment with zero-downtime upgrades\n",
1667 | "- [monitor](https://docs.anyscale.com/platform/services/monitoring) your Services through a dedicated Service page, unified log viewer, tracing, set up alerts, etc.\n",
1668 | "- scale a service (`num_replicas=auto`) and utilize replica compaction to consolidate nodes that are fractionally utilized\n",
1669 | "- [head node fault tolerance](https://docs.anyscale.com/platform/services/production-best-practices#head-node-ft) because OSS Ray recovers from failed workers and replicas but not head node crashes\n",
1670 | "- serving [multiple applications](https://docs.anyscale.com/platform/services/multi-app) in a single Service\n",
1671 | "\n",
1672 | "

\n"
1673 | ]
1674 | },
1675 | {
1676 | "cell_type": "code",
1677 | "execution_count": null,
1678 | "metadata": {},
1679 | "outputs": [],
1680 | "source": [
1681 | "%%bash\n",
1682 | "# clean up\n",
1683 | "rm -rf /mnt/cluster_storage/viggo\n",
1684 | "STORAGE_PATH=\"$ANYSCALE_ARTIFACT_STORAGE/viggo\"\n",
1685 | "if [[ \"$STORAGE_PATH\" == s3://* ]]; then\n",
1686 | " aws s3 rm \"$STORAGE_PATH\" --recursive --quiet\n",
1687 | "elif [[ \"$STORAGE_PATH\" == gs://* ]]; then\n",
1688 | " gsutil -m -q rm -r \"$STORAGE_PATH\"\n",
1689 | "fi"
1690 | ]
1691 | }
1692 | ],
1693 | "metadata": {
1694 | "kernelspec": {
1695 | "display_name": "base",
1696 | "language": "python",
1697 | "name": "python3"
1698 | },
1699 | "language_info": {
1700 | "codemirror_mode": {
1701 | "name": "ipython",
1702 | "version": 3
1703 | },
1704 | "file_extension": ".py",
1705 | "mimetype": "text/x-python",
1706 | "name": "python",
1707 | "nbconvert_exporter": "python",
1708 | "pygments_lexer": "ipython3",
1709 | "version": "3.11.11"
1710 | }
1711 | },
1712 | "nbformat": 4,
1713 | "nbformat_minor": 2
1714 | }
1715 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Entity Recognition with LLMs
2 |
3 |
4 |

5 |

6 |
7 |
8 | This end-to-end tutorial **fine-tunes** an LLM to perform **batch inference** and **online serving** at scale. While entity recognition (NER) is the main task in this tutorial, you can easily extend these end-to-end workflows to any use case.
9 |
10 |

11 |
12 | **Note**: The intent of this tutorial is to show how you can use Ray to implement end-to-end LLM workflows that can extend to any use case, including multimodal.
13 |
14 | This tutorial uses the [Ray library](https://github.com/ray-project/ray) to implement these workflows, namely the LLM APIs:
15 |
16 | [`ray.data.llm`](https://docs.ray.io/en/latest/data/working-with-llms.html):
17 | - Batch inference over distributed datasets
18 | - Streaming and async execution for throughput
19 | - Built-in metrics and tracing, including observability
20 | - Zero-copy GPU data transfer
21 | - Composable with preprocessing and postprocessing steps
22 |
23 | [`ray.serve.llm`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html):
24 | - Automatic scaling and load balancing
25 | - Unified multi-node multi-model deployment
26 | - Multi-LoRA support with shared base models
27 | - Deep integration with inference engines, vLLM to start
28 | - Composable multi-model LLM pipelines
29 |
30 | And all of these workloads come with all the observability views you need to debug and tune them to **maximize throughput/latency**.
31 |
32 | ## Set up
33 |
34 | ### Compute
35 | This [Anyscale Workspace](https://docs.anyscale.com/platform/workspaces/) automatically provisions and autoscales the compute your workloads need. If you're not on Anyscale, then you need to provision the appropriate compute (L4) for this tutorial.
36 |
37 |

38 |
39 | ### Dependencies
40 | Start by downloading the dependencies required for this tutorial. Notice in your [`containerfile`](https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/refs/heads/main/containerfile) you have a base image [`anyscale/ray-llm:latest-py311-cu124`](https://hub.docker.com/layers/anyscale/ray-llm/latest-py311-cu124/images/sha256-5a1c55f7f416d2d2eb5f4cdd13afeda25d4f7383406cfee1f1f60da495d1b50f) followed by a list of pip packages. If you're not on [Anyscale](https://console.anyscale.com/), you can pull this Docker image yourself and install the dependencies.
41 |
42 |
43 |
44 | ```bash
45 | %%bash
46 | # Install dependencies
47 | pip install -q \
48 | "xgrammar==0.1.11" \
49 | "pynvml==12.0.0" \
50 | "hf_transfer==0.1.9" \
51 | "tensorboard==2.19.0" \
52 | "llamafactory@git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory"
53 | ```
54 |
55 | [92mSuccessfully registered `ray, vllm` and 5 other packages to be installed on all cluster nodes.[0m
56 | [92mView and update dependencies here: https://console.anyscale.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_cz951f43jjdybtzkx1s5sjgz99/workspaces/expwrk_mp8cxvgle2yeumgcpu1yua2r3e?workspace-tab=dependencies[0m
57 |
58 |
59 | ## Data ingestion
60 |
61 |
62 | ```python
63 | import json
64 | import textwrap
65 | from IPython.display import Code, Image, display
66 | ```
67 |
68 | Start by downloading the data from cloud storage to local shared storage.
69 |
70 |
71 | ```bash
72 | %%bash
73 | rm -rf /mnt/cluster_storage/viggo # clean up
74 | mkdir /mnt/cluster_storage/viggo
75 | wget https://viggo-ds.s3.amazonaws.com/train.jsonl -O /mnt/cluster_storage/viggo/train.jsonl
76 | wget https://viggo-ds.s3.amazonaws.com/val.jsonl -O /mnt/cluster_storage/viggo/val.jsonl
77 | wget https://viggo-ds.s3.amazonaws.com/test.jsonl -O /mnt/cluster_storage/viggo/test.jsonl
78 | wget https://viggo-ds.s3.amazonaws.com/dataset_info.json -O /mnt/cluster_storage/viggo/dataset_info.json
79 | ```
80 |
81 | download: s3://viggo-ds/train.jsonl to ../../../mnt/cluster_storage/viggo/train.jsonl
82 | download: s3://viggo-ds/val.jsonl to ../../../mnt/cluster_storage/viggo/val.jsonl
83 | download: s3://viggo-ds/test.jsonl to ../../../mnt/cluster_storage/viggo/test.jsonl
84 | download: s3://viggo-ds/dataset_info.json to ../../../mnt/cluster_storage/viggo/dataset_info.json
85 |
86 |
87 |
88 | ```bash
89 | %%bash
90 | head -n 1 /mnt/cluster_storage/viggo/train.jsonl | python3 -m json.tool
91 | ```
92 |
93 | {
94 | "instruction": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
95 | "input": "Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view.",
96 | "output": "give_opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])"
97 | }
98 |
99 |
100 |
101 | ```python
102 | with open("/mnt/cluster_storage/viggo/train.jsonl", "r") as fp:
103 | first_line = fp.readline()
104 | item = json.loads(first_line)
105 | system_content = item["instruction"]
106 | print(textwrap.fill(system_content, width=80))
107 | ```
108 |
109 | Given a target sentence construct the underlying meaning representation of the
110 | input sentence as a single function with attributes and attribute values. This
111 | function should describe the target string accurately and the function must be
112 | one of the following ['inform', 'request', 'give_opinion', 'confirm',
113 | 'verify_attribute', 'suggest', 'request_explanation', 'recommend',
114 | 'request_attribute']. The attributes must be one of the following: ['name',
115 | 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres',
116 | 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam',
117 | 'has_linux_release', 'has_mac_release', 'specifier']
118 |
119 |
120 | You also have an info file that identifies the datasets and format (Alpaca and ShareGPT formats) to use for post training.
121 |
122 |
123 | ```python
124 | display(Code(filename="/mnt/cluster_storage/viggo/dataset_info.json", language="json"))
125 | ```
126 |
127 |
128 |
{
203 | "viggo-train": {
204 | "file_name": "/mnt/cluster_storage/viggo/train.jsonl",
205 | "formatting": "alpaca",
206 | "columns": {
207 | "prompt": "instruction",
208 | "query": "input",
209 | "response": "output"
210 | }
211 | },
212 | "viggo-val": {
213 | "file_name": "/mnt/cluster_storage/viggo/val.jsonl",
214 | "formatting": "alpaca",
215 | "columns": {
216 | "prompt": "instruction",
217 | "query": "input",
218 | "response": "output"
219 | }
220 | }
221 | }
222 |
223 |
224 |
225 |
226 | ## Distributed fine-tuning
227 |
228 | Use [Ray Train](https://docs.ray.io/en/latest/train/train.html) + [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform multi-node training. Find the parameters for the training workload, post-training method, dataset location, train/val details, etc. in the `llama3_lora_sft_ray.yaml` config file. See the recipes for even more post-training methods, like SFT, pretraining, PPO, DPO, KTO, etc. [on GitHub](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples).
229 |
230 | **Note**: Ray also supports using other tools like [axolotl](https://axolotl-ai-cloud.github.io/axolotl/docs/ray-integration.html) or even [Ray Train + HF Accelerate + FSDP/DeepSpeed](https://docs.ray.io/en/latest/train/huggingface-accelerate.html) directly for complete control of your post-training workloads.
231 |
232 |

233 |
234 | ### `config`
235 |
236 |
237 | ```python
238 | import os
239 | from pathlib import Path
240 | import yaml
241 | ```
242 |
243 |
244 | ```python
245 | display(Code(filename="lora_sft_ray.yaml", language="yaml"))
246 | ```
247 |
248 |
249 |
### model
324 | model_name_or_path: Qwen/Qwen2.5-7B-Instruct
325 | trust_remote_code: true
326 |
327 | ### method
328 | stage: sft
329 | do_train: true
330 | finetuning_type: lora
331 | lora_rank: 8
332 | lora_target: all
333 |
334 | ### dataset
335 | dataset: viggo-train
336 | dataset_dir: /mnt/cluster_storage/viggo # shared storage workers have access to
337 | template: qwen
338 | cutoff_len: 2048
339 | max_samples: 1000
340 | overwrite_cache: true
341 | preprocessing_num_workers: 16
342 | dataloader_num_workers: 4
343 |
344 | ### output
345 | output_dir: /mnt/cluster_storage/viggo/outputs # should be somewhere workers have access to (ex. s3, nfs)
346 | logging_steps: 10
347 | save_steps: 500
348 | plot_loss: true
349 | overwrite_output_dir: true
350 | save_only_model: false
351 |
352 | ### ray
353 | ray_run_name: lora_sft_ray
354 | ray_storage_path: /mnt/cluster_storage/viggo/saves # should be somewhere workers have access to (ex. s3, nfs)
355 | ray_num_workers: 4
356 | resources_per_worker:
357 | GPU: 1
358 | anyscale/accelerator_shape:4xL4: 0.001 # Use this to specify a specific node shape,
359 | # accelerator_type:L4: 1 # Or use this to simply specify a GPU type.
360 | # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types
361 | placement_strategy: PACK
362 |
363 | ### train
364 | per_device_train_batch_size: 1
365 | gradient_accumulation_steps: 8
366 | learning_rate: 1.0e-4
367 | num_train_epochs: 5.0
368 | lr_scheduler_type: cosine
369 | warmup_ratio: 0.1
370 | bf16: true
371 | ddp_timeout: 180000000
372 | resume_from_checkpoint: null
373 |
374 | ### eval
375 | eval_dataset: viggo-val # uses same dataset_dir as training data
376 | # val_size: 0.1 # only if using part of training data for validation
377 | per_device_eval_batch_size: 1
378 | eval_strategy: steps
379 | eval_steps: 500
380 |
381 |
382 |
383 |
384 |
385 | ```python
386 | model_id = "ft-model" # call it whatever you want
387 | model_source = yaml.safe_load(open("lora_sft_ray.yaml"))["model_name_or_path"] # HF model ID, S3 mirror config, or GCS mirror config
388 | print (model_source)
389 | ```
390 |
391 | Qwen/Qwen2.5-7B-Instruct
392 |
393 |
394 | ### Multi-node training
395 |
396 | Use Ray Train + LlamaFactory to perform the mult-node train loop.
397 |
398 |
Ray Train
399 |
400 | Using [Ray Train](https://docs.ray.io/en/latest/train/train.html) has several advantages:
401 | - it automatically handles **multi-node, multi-GPU** setup with no manual SSH setup or `hostfile` configs.
402 | - you can define **per-worker fractional resource requirements**, for example, 2 CPUs and 0.5 GPU per worker.
403 | - you can run on **heterogeneous machines** and scale flexibly, for example, CPU for preprocessing and GPU for training.
404 | - it has built-in **fault tolerance** through retry of failed workers, and continue from last checkpoint.
405 | - it supports Data Parallel, Model Parallel, Parameter Server, and even custom strategies.
406 | - [Ray Compiled graphs](https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html) allow you to even define different parallelism for jointly optimizing multiple models. Megatron, DeepSpeed, and similar frameworks only allow for one global setting.
407 |
408 | [RayTurbo Train](https://docs.anyscale.com/rayturbo/rayturbo-train) offers even more improvement to the price-performance ratio, performance monitoring, and more:
409 | - **elastic training** to scale to a dynamic number of workers, and continue training on fewer resources, even on spot instances.
410 | - **purpose-built dashboard** designed to streamline the debugging of Ray Train workloads:
411 | - Monitoring: View the status of training runs and train workers.
412 | - Metrics: See insights on training throughput and training system operation time.
413 | - Profiling: Investigate bottlenecks, hangs, or errors from individual training worker processes.
414 |
415 |

416 |
417 |
418 | ```bash
419 | %%bash
420 | # Run multi-node distributed fine-tuning workload
421 | USE_RAY=1 llamafactory-cli train lora_sft_ray.yaml
422 | ```
423 |
424 |
425 |
426 | Training started with configuration:
427 | ╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
428 | │ Training config │
429 | ├──────────────────────────────────────────────────────────────────────────────────────────────────────┤
430 | │ train_loop_config/args/bf16 True │
431 | │ train_loop_config/args/cutoff_len 2048 │
432 | │ train_loop_config/args/dataloader_num_workers 4 │
433 | │ train_loop_config/args/dataset viggo-train │
434 | │ train_loop_config/args/dataset_dir ...ter_storage/viggo │
435 | │ train_loop_config/args/ddp_timeout 180000000 │
436 | │ train_loop_config/args/do_train True │
437 | │ train_loop_config/args/eval_dataset viggo-val │
438 | │ train_loop_config/args/eval_steps 500 │
439 | │ train_loop_config/args/eval_strategy steps │
440 | │ train_loop_config/args/finetuning_type lora │
441 | │ train_loop_config/args/gradient_accumulation_steps 8 │
442 | │ train_loop_config/args/learning_rate 0.0001 │
443 | │ train_loop_config/args/logging_steps 10 │
444 | │ train_loop_config/args/lora_rank 8 │
445 | │ train_loop_config/args/lora_target all │
446 | │ train_loop_config/args/lr_scheduler_type cosine │
447 | │ train_loop_config/args/max_samples 1000 │
448 | │ train_loop_config/args/model_name_or_path ...en2.5-7B-Instruct │
449 | │ train_loop_config/args/num_train_epochs 5.0 │
450 | │ train_loop_config/args/output_dir ...age/viggo/outputs │
451 | │ train_loop_config/args/overwrite_cache True │
452 | │ train_loop_config/args/overwrite_output_dir True │
453 | │ train_loop_config/args/per_device_eval_batch_size 1 │
454 | │ train_loop_config/args/per_device_train_batch_size 1 │
455 | │ train_loop_config/args/placement_strategy PACK │
456 | │ train_loop_config/args/plot_loss True │
457 | │ train_loop_config/args/preprocessing_num_workers 16 │
458 | │ train_loop_config/args/ray_num_workers 4 │
459 | │ train_loop_config/args/ray_run_name lora_sft_ray │
460 | │ train_loop_config/args/ray_storage_path ...orage/viggo/saves │
461 | │ train_loop_config/args/resources_per_worker/GPU 1 │
462 | │ train_loop_config/args/resources_per_worker/anyscale/accelerator_shape:4xL4 1 │
463 | │ train_loop_config/args/resume_from_checkpoint │
464 | │ train_loop_config/args/save_only_model False │
465 | │ train_loop_config/args/save_steps 500 │
466 | │ train_loop_config/args/stage sft │
467 | │ train_loop_config/args/template qwen │
468 | │ train_loop_config/args/trust_remote_code True │
469 | │ train_loop_config/args/warmup_ratio 0.1 │
470 | │ train_loop_config/callbacks ... 0x7e1262910e10>] │
471 | ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
472 |
473 | 100%|██████████| 155/155 [07:12<00:00, 2.85s/it][INFO|trainer.py:3942] 2025-04-11 14:57:59,207 >> Saving model checkpoint to /mnt/cluster_storage/viggo/outputs/checkpoint-155
474 |
475 | Training finished iteration 1 at 2025-04-11 14:58:02. Total running time: 10min 24s
476 | ╭─────────────────────────────────────────╮
477 | │ Training result │
478 | ├─────────────────────────────────────────┤
479 | │ checkpoint_dir_name checkpoint_000000 │
480 | │ time_this_iter_s 521.83827 │
481 | │ time_total_s 521.83827 │
482 | │ training_iteration 1 │
483 | │ epoch 4.704 │
484 | │ grad_norm 0.14288 │
485 | │ learning_rate 0. │
486 | │ loss 0.0065 │
487 | │ step 150 │
488 | ╰─────────────────────────────────────────╯
489 | Training saved a checkpoint for iteration 1 at: (local)/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000
490 |
491 |
492 |
493 |
494 |
495 |
496 | ```python
497 | display(Code(filename="/mnt/cluster_storage/viggo/outputs/all_results.json", language="json"))
498 | ```
499 |
500 |
501 |
{
576 | "epoch": 4.864,
577 | "eval_viggo-val_loss": 0.13618840277194977,
578 | "eval_viggo-val_runtime": 20.2797,
579 | "eval_viggo-val_samples_per_second": 35.208,
580 | "eval_viggo-val_steps_per_second": 8.827,
581 | "total_flos": 4.843098686147789e+16,
582 | "train_loss": 0.2079355036479331,
583 | "train_runtime": 437.2951,
584 | "train_samples_per_second": 11.434,
585 | "train_steps_per_second": 0.354
586 | }
587 |
588 |
589 |
590 |
591 |

592 |
593 | ### Observability
594 |
595 |
🔎 Monitoring and debugging with Ray
596 |
597 |
598 | OSS Ray offers an extensive [observability suite](https://docs.ray.io/en/latest/ray-observability/index.html) with logs and an observability dashboard that you can use to monitor and debug. The dashboard includes a lot of different components such as:
599 |
600 | - memory, utilization, etc., of the tasks running in the [cluster](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-node-view)
601 |
602 |

603 |
604 | - views to see all running tasks, utilization across instance types, autoscaling, etc.
605 |
606 |

607 |
608 |
609 |
🔎➕➕ Monitoring and debugging on Anyscale
610 |
611 | OSS Ray comes with an extensive observability suite, and Anyscale takes it many steps further to make monitoring and debugging your workloads even easier and faster with:
612 |
613 | - [unified log viewer](https://docs.anyscale.com/monitoring/accessing-logs/) to see logs from *all* driver and worker processes
614 | - Ray workload specific dashboard, like Data, Train, etc., that can breakdown the tasks. For example, you can observe the preceding training workload live through the Train specific Ray Workloads dashboard:
615 |
616 |

617 |
618 |
619 |
620 |
621 | ### Save to cloud storage
622 |
623 |
🗂️ Storage on Anyscale
624 |
625 | You can always store to data inside [any storage buckets](https://docs.anyscale.com/configuration/storage/#private-storage-buckets) but Anyscale offers a [default storage bucket](https://docs.anyscale.com/configuration/storage/#anyscale-default-storage-bucket) to make things even easier. You also have plenty of other [storage options](https://docs.anyscale.com/configuration/storage/) as well, shared at the cluster, user, and cloud levels.
626 |
627 |
628 | ```bash
629 | %%bash
630 | # Anyscale default storage bucket.
631 | echo $ANYSCALE_ARTIFACT_STORAGE
632 | ```
633 |
634 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage
635 |
636 |
637 |
638 | ```bash
639 | %%bash
640 | # Save fine-tuning artifacts to cloud storage.
641 | STORAGE_PATH="$ANYSCALE_ARTIFACT_STORAGE/viggo"
642 | LOCAL_OUTPUTS_PATH="/mnt/cluster_storage/viggo/outputs"
643 | LOCAL_SAVES_PATH="/mnt/cluster_storage/viggo/saves"
644 |
645 | # AWS S3 operations.
646 | if [[ "$STORAGE_PATH" == s3://* ]]; then
647 | if aws s3 ls "$STORAGE_PATH" > /dev/null 2>&1; then
648 | aws s3 rm "$STORAGE_PATH" --recursive --quiet
649 | fi
650 | aws s3 cp "$LOCAL_OUTPUTS_PATH" "$STORAGE_PATH/outputs" --recursive --quiet
651 | aws s3 cp "$LOCAL_SAVES_PATH" "$STORAGE_PATH/saves" --recursive --quiet
652 |
653 | # Google Cloud Storage operations.
654 | elif [[ "$STORAGE_PATH" == gs://* ]]; then
655 | if gsutil ls "$STORAGE_PATH" > /dev/null 2>&1; then
656 | gsutil -m -q rm -r "$STORAGE_PATH"
657 | fi
658 | gsutil -m -q cp -r "$LOCAL_OUTPUTS_PATH" "$STORAGE_PATH/outputs"
659 | gsutil -m -q cp -r "$LOCAL_SAVES_PATH" "$STORAGE_PATH/saves"
660 |
661 | else
662 | echo "Unsupported storage protocol: $STORAGE_PATH"
663 | exit 1
664 | fi
665 | ```
666 |
667 |
668 | ```bash
669 | %%bash
670 | ls /mnt/cluster_storage/viggo/saves/lora_sft_ray
671 | ```
672 |
673 | TorchTrainer_95d16_00000_0_2025-04-11_14-47-37
674 | TorchTrainer_f9e4e_00000_0_2025-04-11_12-41-34
675 | basic-variant-state-2025-04-11_12-41-34.json
676 | basic-variant-state-2025-04-11_14-47-37.json
677 | experiment_state-2025-04-11_12-41-34.json
678 | experiment_state-2025-04-11_14-47-37.json
679 | trainer.pkl
680 | tuner.pkl
681 |
682 |
683 |
684 | ```python
685 | # LoRA paths.
686 | save_dir = Path("/mnt/cluster_storage/viggo/saves/lora_sft_ray")
687 | trainer_dirs = [d for d in save_dir.iterdir() if d.name.startswith("TorchTrainer_") and d.is_dir()]
688 | latest_trainer = max(trainer_dirs, key=lambda d: d.stat().st_mtime, default=None)
689 | lora_path = f"{latest_trainer}/checkpoint_000000/checkpoint"
690 | cloud_lora_path = os.path.join(os.getenv("ANYSCALE_ARTIFACT_STORAGE"), lora_path.split("/mnt/cluster_storage/")[-1])
691 | dynamic_lora_path, lora_id = cloud_lora_path.rsplit("/", 1)
692 | print (lora_path)
693 | print (cloud_lora_path)
694 | print (dynamic_lora_path)
695 | print (lora_id)
696 | ```
697 |
698 | /mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint
699 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint
700 | s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000
701 | checkpoint
702 |
703 |
704 |
705 | ```bash
706 | %%bash -s "$lora_path"
707 | ls $1
708 | ```
709 |
710 | README.md
711 | adapter_config.json
712 | adapter_model.safetensors
713 | added_tokens.json
714 | merges.txt
715 | optimizer.pt
716 | rng_state_0.pth
717 | rng_state_1.pth
718 | rng_state_2.pth
719 | rng_state_3.pth
720 | scheduler.pt
721 | special_tokens_map.json
722 | tokenizer.json
723 | tokenizer_config.json
724 | trainer_state.json
725 | training_args.bin
726 | vocab.json
727 |
728 |
729 | ## Batch inference
730 | [`Overview`](https://docs.ray.io/en/latest/data/working-with-llms.html) | [`API reference`](https://docs.ray.io/en/latest/data/api/llm.html)
731 |
732 | The `ray.data.llm` module integrates with key large language model (LLM) inference engines and deployed models to enable LLM batch inference. These LLM modules use [Ray Data](https://docs.ray.io/en/latest/data/data.html) under the hood, which makes it extremely easy to distribute workloads but also ensures that they happen:
733 | - **efficiently**: minimizing CPU/GPU idle time with heterogeneous resource scheduling.
734 | - **at scale**: with streaming execution to petabyte-scale datasets, especially when [working with LLMs](https://docs.ray.io/en/latest/data/working-with-llms.html).
735 | - **reliably** by checkpointing processes, especially when running workloads on spot instances with on-demand fallback.
736 | - **flexibly**: connecting to data from any source, applying transformations, and saving to any format and location for your next workload.
737 |
738 |

739 |
740 | [RayTurbo Data](https://docs.anyscale.com/rayturbo/rayturbo-data) has more features on top of Ray Data:
741 | - **accelerated metadata fetching** to improve reading first time from large datasets
742 | - **optimized autoscaling** where Jobs can kick off before waiting for the entire cluster to start
743 | - **high reliability** where entire failed jobs, like head node, cluster, uncaptured exceptions, etc., can resume from checkpoints. OSS Ray can only recover from worker node failures.
744 |
745 | Start by defining the [vLLM engine processor config](https://docs.ray.io/en/latest/data/api/doc/ray.data.llm.vLLMEngineProcessorConfig.html#ray.data.llm.vLLMEngineProcessorConfig) where you can select the model to use and the [engine behavior](https://docs.vllm.ai/en/stable/serving/engine_args.html). The model can come from [Hugging Face (HF) Hub](https://huggingface.co/models) or a local model path `/path/to/your/model`. Anyscale supports GPTQ, GGUF, or LoRA model formats.
746 |
747 |

748 |
749 | ### vLLM engine processor
750 |
751 |
752 | ```python
753 | import os
754 | import ray
755 | from ray.data.llm import vLLMEngineProcessorConfig
756 | ```
757 |
758 | INFO 04-11 14:58:40 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform
759 |
760 |
761 |
762 | ```python
763 | config = vLLMEngineProcessorConfig(
764 | model_source=model_source,
765 | runtime_env={
766 | "env_vars": {
767 | "VLLM_USE_V1": "0", # v1 doesn't support lora adapters yet
768 | # "HF_TOKEN": os.environ.get("HF_TOKEN"),
769 | },
770 | },
771 | engine_kwargs={
772 | "enable_lora": True,
773 | "max_lora_rank": 8,
774 | "max_loras": 1,
775 | "pipeline_parallel_size": 1,
776 | "tensor_parallel_size": 1,
777 | "enable_prefix_caching": True,
778 | "enable_chunked_prefill": True,
779 | "max_num_batched_tokens": 4096,
780 | "max_model_len": 4096, # or increase KV cache size
781 | # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html
782 | },
783 | concurrency=1,
784 | batch_size=16,
785 | accelerator_type="L4",
786 | )
787 | ```
788 |
789 | ### LLM processor
790 |
791 | Next, pass the config to an [LLM processor](https://docs.ray.io/en/master/data/api/doc/ray.data.llm.build_llm_processor.html#ray.data.llm.build_llm_processor) where you can define the preprocessing and postprocessing steps around inference. With your base model defined in the processor config, you can define the LoRA adapter layers as part of the preprocessing step of the LLM processor itself.
792 |
793 |
794 | ```python
795 | from ray.data.llm import build_llm_processor
796 | ```
797 |
798 |
799 | ```python
800 | processor = build_llm_processor(
801 | config,
802 | preprocess=lambda row: dict(
803 | model=lora_path, # REMOVE this line if doing inference with just the base model
804 | messages=[
805 | {"role": "system", "content": system_content},
806 | {"role": "user", "content": row["input"]}
807 | ],
808 | sampling_params={
809 | "temperature": 0.3,
810 | "max_tokens": 250,
811 | # complete list: https://docs.vllm.ai/en/stable/api/inference_params.html
812 | },
813 | ),
814 | postprocess=lambda row: {
815 | **row, # all contents
816 | "generated_output": row["generated_text"],
817 | # add additional outputs
818 | },
819 | )
820 | ```
821 |
822 | 2025-04-11 14:58:40,942 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.51.51:6379...
823 | 2025-04-11 14:58:40,953 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-zt5t77xa58pyp3uy28glg2g24d.i.anyscaleuserdata.com [39m[22m
824 | 2025-04-11 14:58:40,960 INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip' (2.16MiB) to Ray cluster...
825 | 2025-04-11 14:58:40,969 INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_e71d58b4dc01d065456a9fc0325ee2682e13de88.zip'.
826 |
827 |
828 |
829 | config.json: 0%| | 0.00/663 [00:00, ?B/s]
830 |
831 |
832 | [36m(pid=51260)[0m INFO 04-11 14:58:47 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform
833 |
834 |
835 |
836 | ```python
837 | # Evaluation on test dataset
838 | ds = ray.data.read_json("/mnt/cluster_storage/viggo/test.jsonl") # complete list: https://docs.ray.io/en/latest/data/api/input_output.html
839 | ds = processor(ds)
840 | results = ds.take_all()
841 | results[0]
842 | ```
843 |
844 |
845 |
846 | {
847 | "batch_uuid": "d7a6b5341cbf4986bb7506ff277cc9cf",
848 | "embeddings": null,
849 | "generated_text": "request(esrb)",
850 | "generated_tokens": [2035, 50236, 10681, 8, 151645],
851 | "input": "Do you have a favorite ESRB content rating?",
852 | "instruction": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
853 | "messages": [
854 | {
855 | "content": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
856 | "role": "system"
857 | },
858 | {
859 | "content": "Do you have a favorite ESRB content rating?",
860 | "role": "user"
861 | }
862 | ],
863 | "metrics": {
864 | "arrival_time": 1744408857.148983,
865 | "finished_time": 1744408863.09091,
866 | "first_scheduled_time": 1744408859.130259,
867 | "first_token_time": 1744408862.7087252,
868 | "last_token_time": 1744408863.089174,
869 | "model_execute_time": null,
870 | "model_forward_time": null,
871 | "scheduler_time": 0.04162892400017881,
872 | "time_in_queue": 1.981276035308838
873 | },
874 | "model": "/mnt/cluster_storage/viggo/saves/lora_sft_ray/TorchTrainer_95d16_00000_0_2025-04-11_14-47-37/checkpoint_000000/checkpoint",
875 | "num_generated_tokens": 5,
876 | "num_input_tokens": 164,
877 | "output": "request_attribute(esrb[])",
878 | "params": "SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None)",
879 | "prompt": "<|im_start|>system
880 | Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|im_end|>
881 | <|im_start|>user
882 | Do you have a favorite ESRB content rating?<|im_end|>
883 | <|im_start|>assistant
884 | ",
885 | "prompt_token_ids": [151644, "...", 198],
886 | "request_id": 94,
887 | "time_taken_llm": 6.028705836999961,
888 | "generated_output": "request(esrb)"
889 | }
890 |
891 |
892 |
893 |
894 |
895 | ```python
896 | # Exact match (strict!)
897 | matches = 0
898 | for item in results:
899 | if item["output"] == item["generated_output"]:
900 | matches += 1
901 | matches / float(len(results))
902 | ```
903 |
904 |
905 |
906 |
907 | 0.6879039704524469
908 |
909 |
910 |
911 | **Note**: The objective of fine-tuning here isn't to create the most performant model but to show that you can leverage it for downstream workloads, like batch inference and online serving at scale. However, you can increase `num_train_epochs` if you want to.
912 |
913 | Observe the individual steps in the batch inference workload through the Anyscale Ray Data dashboard:
914 |
915 |

916 |
917 |
918 |
919 | 💡 For more advanced guides on topics like optimized model loading, multi-LoRA, OpenAI-compatible endpoints, etc., see [more examples](https://docs.ray.io/en/latest/data/working-with-llms.html) and the [API reference](https://docs.ray.io/en/latest/data/api/llm.html).
920 |
921 |
922 |
923 | ## Online serving
924 | [`Overview`](https://docs.ray.io/en/latest/serve/llm/serving-llms.html) | [`API reference`](https://docs.ray.io/en/latest/serve/api/index.html#llm-api)
925 |
926 |

927 |
928 | `ray.serve.llm` APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.
929 |
930 |

931 |
932 | Ray Serve LLM is designed with the following features:
933 | - Automatic scaling and load balancing
934 | - Unified multi-node multi-model deployment
935 | - OpenAI compatibility
936 | - Multi-LoRA support with shared base models
937 | - Deep integration with inference engines, vLLM to start
938 | - Composable multi-model LLM pipelines
939 |
940 | [RayTurbo Serve](https://docs.anyscale.com/rayturbo/rayturbo-serve) on Anyscale has more features on top of Ray Serve:
941 | - **fast autoscaling and model loading** to get services up and running even faster: [5x improvements](https://www.anyscale.com/blog/autoscale-large-ai-models-faster) even for LLMs
942 | - 54% **higher QPS** and up-to 3x **streaming tokens per second** for high traffic serving use-cases
943 | - **replica compaction** into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization
944 | - **zero-downtime** [incremental rollouts](https://docs.anyscale.com/platform/services/update-a-service/#resource-constrained-updates) so your service is never interrupted
945 | - [**different environments**](https://docs.anyscale.com/platform/services/multi-app/#multiple-applications-in-different-containers) for each service in a multi-serve application
946 | - **multi availability-zone** aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures
947 |
948 |
949 | ### LLM serve config
950 |
951 |
952 | ```python
953 | import os
954 | from openai import OpenAI # to use openai api format
955 | from ray import serve
956 | from ray.serve.llm import LLMConfig, build_openai_app
957 | ```
958 |
959 | Define an [LLM config](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) where you can define where the model comes from, it's [autoscaling behavior](https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling), what hardware to use and [engine arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html).
960 |
961 |
962 | ```python
963 | # Define config.
964 | llm_config = LLMConfig(
965 | model_loading_config={
966 | "model_id": model_id,
967 | "model_source": model_source
968 | },
969 | lora_config={ # REMOVE this section if you're only using a base model.
970 | "dynamic_lora_loading_path": dynamic_lora_path,
971 | "max_num_adapters_per_replica": 16, # You only have 1.
972 | },
973 | # runtime_env={"env_vars": {"HF_TOKEN": os.environ.get("HF_TOKEN")}},
974 | deployment_config={
975 | "autoscaling_config": {
976 | "min_replicas": 1,
977 | "max_replicas": 2,
978 | # complete list: https://docs.ray.io/en/latest/serve/autoscaling-guide.html#serve-autoscaling
979 | }
980 | },
981 | accelerator_type="L4",
982 | engine_kwargs={
983 | "max_model_len": 4096, # Or increase KV cache size.
984 | "tensor_parallel_size": 1,
985 | "enable_lora": True,
986 | # complete list: https://docs.vllm.ai/en/stable/serving/engine_args.html
987 | },
988 | )
989 | ```
990 |
991 | Now deploy the LLM config as an application. And because this application is all built on top of [Ray Serve](https://docs.ray.io/en/latest/serve/index.html), you can have advanced service logic around composing models together, deploying multiple applications, model multiplexing, observability, etc.
992 |
993 |
994 | ```python
995 | # Deploy.
996 | app = build_openai_app({"llm_configs": [llm_config]})
997 | serve.run(app)
998 | ```
999 |
1000 | DeploymentHandle(deployment='LLMRouter')
1001 |
1002 |
1003 | ### Service request
1004 |
1005 |
1006 | ```python
1007 | # Initialize client.
1008 | client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake-key")
1009 | response = client.chat.completions.create(
1010 | model=f"{model_id}:{lora_id}",
1011 | messages=[
1012 | {"role": "system", "content": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']"},
1013 | {"role": "user", "content": "Blizzard North is mostly an okay developer, but they released Diablo II for the Mac and so that pushes the game from okay to good in my view."},
1014 | ],
1015 | stream=True
1016 | )
1017 | for chunk in response:
1018 | if chunk.choices[0].delta.content is not None:
1019 | print(chunk.choices[0].delta.content, end="", flush=True)
1020 | ```
1021 |
1022 |
1023 |
1024 | Avg prompt throughput: 20.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
1025 |
1026 | _opinion(name[Diablo II], developer[Blizzard North], rating[good], has_mac_release[yes])
1027 |
1028 |
1029 |
1030 |
1031 | And of course, you can observe the running service, the deployments, and metrics like QPS, latency, etc., through the [Ray Dashboard](https://docs.ray.io/en/latest/ray-observability/getting-started.html)'s [Serve view](https://docs.ray.io/en/latest/ray-observability/getting-started.html#dash-serve-view):
1032 |
1033 |

1034 |
1035 |
1036 |
1037 | 💡 See [more examples](https://docs.ray.io/en/latest/serve/llm/overview.html) and the [API reference](https://docs.ray.io/en/latest/serve/llm/api.html) for advanced guides on topics like structured outputs (like JSON), vision LMs, multi-LoRA on shared base models, using other inference engines (like `sglang`), fast model loading, etc.
1038 |
1039 |
1040 |
1041 | ```python
1042 | # Shutdown the service
1043 | serve.shutdown()
1044 | ```
1045 |
1046 | ## Production
1047 |
1048 | Seamlessly integrate with your existing CI/CD pipelines by leveraging the Anyscale [CLI](https://docs.anyscale.com/reference/quickstart-cli) or [SDK](https://docs.anyscale.com/reference/quickstart-sdk) to run [reliable batch jobs](https://docs.anyscale.com/platform/jobs) and deploy [highly available services](https://docs.anyscale.com/platform/services). Given you've been developing in an environment that's almost identical to production with a multi-node cluster, this integration should drastically speed up your dev to prod velocity.
1049 |
1050 |

1051 |
1052 | ### Jobs
1053 |
1054 | [Anyscale Jobs](https://docs.anyscale.com/platform/jobs/) ([API ref](https://docs.anyscale.com/reference/job-api/)) allows you to execute discrete workloads in production such as batch inference, embeddings generation, or model fine-tuning.
1055 | - [define and manage](https://docs.anyscale.com/platform/jobs/manage-jobs) your Jobs in many different ways, like CLI and Python SDK
1056 | - set up [queues](https://docs.anyscale.com/platform/jobs/job-queues) and [schedules](https://docs.anyscale.com/platform/jobs/schedules)
1057 | - set up all the [observability, alerting, etc.](https://docs.anyscale.com/platform/jobs/monitoring-and-debugging) around your Jobs
1058 |
1059 |

1060 |
1061 | ### Services
1062 |
1063 | [Anyscale Services](https://docs.anyscale.com/platform/services/) ([API ref](https://docs.anyscale.com/reference/service-api/)) offers an extremely fault tolerant, scalable, and optimized way to serve your Ray Serve applications:
1064 | - you can [rollout and update](https://docs.anyscale.com/platform/services/update-a-service) services with canary deployment with zero-downtime upgrades
1065 | - [monitor](https://docs.anyscale.com/platform/services/monitoring) your Services through a dedicated Service page, unified log viewer, tracing, set up alerts, etc.
1066 | - scale a service (`num_replicas=auto`) and utilize replica compaction to consolidate nodes that are fractionally utilized
1067 | - [head node fault tolerance](https://docs.anyscale.com/platform/services/production-best-practices#head-node-ft) because OSS Ray recovers from failed workers and replicas but not head node crashes
1068 | - serving [multiple applications](https://docs.anyscale.com/platform/services/multi-app) in a single Service
1069 |
1070 |

1071 |
1072 |
1073 |
1074 | ```bash
1075 | %%bash
1076 | # clean up
1077 | rm -rf /mnt/cluster_storage/viggo
1078 | STORAGE_PATH="$ANYSCALE_ARTIFACT_STORAGE/viggo"
1079 | if [[ "$STORAGE_PATH" == s3://* ]]; then
1080 | aws s3 rm "$STORAGE_PATH" --recursive --quiet
1081 | elif [[ "$STORAGE_PATH" == gs://* ]]; then
1082 | gsutil -m -q rm -r "$STORAGE_PATH"
1083 | fi
1084 | ```
1085 |
--------------------------------------------------------------------------------
/ci/aws.yaml:
--------------------------------------------------------------------------------
1 | cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2 | region: us-west-2
3 |
4 | # Head node
5 | head_node_type:
6 | name: head
7 | instance_type: m5.2xlarge
8 | resources:
9 | cpu: 8
10 |
11 | # Worker nodes
12 | auto_select_worker_config: true
13 | flags:
14 | allow-cross-zone-autoscaling: true
15 |
--------------------------------------------------------------------------------
/ci/build.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -exo pipefail
4 |
5 | # Will use lockfile instead later
6 | # pip3 install --no-cache-dir -r https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/refs/heads/main/lockfile.txt
7 |
8 | # Install Python dependencies
9 | pip3 install --no-cache-dir \
10 | "xgrammar==0.1.11" \
11 | "pynvml==12.0.0" \
12 | "hf_transfer==0.1.9" \
13 | "tensorboard==2.19.0" \
14 | "git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory"
15 |
16 |
17 | # Env vars
18 | export HF_HUB_ENABLE_HF_TRANSFER=1
19 |
--------------------------------------------------------------------------------
/ci/gce.yaml:
--------------------------------------------------------------------------------
1 | cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2 | region: us-central1
3 |
4 | # Head node
5 | head_node_type:
6 | name: head
7 | instance_type: n2-standard-8
8 | resources:
9 | cpu: 8
10 |
11 | # Worker nodes
12 | auto_select_worker_config: true
13 | flags:
14 | allow-cross-zone-autoscaling: true
15 |
--------------------------------------------------------------------------------
/ci/nb2py.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import argparse
3 | import nbformat
4 |
5 |
6 | def convert_notebook(input_path: str, output_path: str) -> None:
7 | """
8 | Read a Jupyter notebook and write a Python script, converting all %%bash
9 | cells and IPython "!" commands into subprocess.run calls that raise on error.
10 | """
11 | nb = nbformat.read(input_path, as_version=4)
12 | with open(output_path, "w") as out:
13 | for cell in nb.cells:
14 | if cell.cell_type != "code":
15 | continue
16 |
17 | lines = cell.source.splitlines()
18 | # Detect a %%bash cell
19 | if lines and lines[0].strip().startswith("%%bash"):
20 | bash_script = "\n".join(lines[1:]).rstrip()
21 | out.write("import subprocess\n")
22 | out.write(
23 | f"subprocess.run(r'''{bash_script}''',\n"
24 | " shell=True,\n"
25 | " check=True,\n"
26 | " executable='/bin/bash')\n\n"
27 | )
28 | else:
29 | # Detect any IPython '!' shell commands in code lines
30 | has_bang = any(line.lstrip().startswith("!") for line in lines)
31 | if has_bang:
32 | out.write("import subprocess\n")
33 | for line in lines:
34 | stripped = line.lstrip()
35 | if stripped.startswith("!"):
36 | cmd = stripped[1:].lstrip()
37 | out.write(
38 | f"subprocess.run(r'''{cmd}''',\n"
39 | " shell=True,\n"
40 | " check=True,\n"
41 | " executable='/bin/bash')\n"
42 | )
43 | else:
44 | out.write(line.rstrip() + "\n")
45 | out.write("\n")
46 | else:
47 | # Regular Python cell: dump as-is
48 | out.write(cell.source.rstrip() + "\n\n")
49 |
50 |
51 | def main() -> None:
52 | parser = argparse.ArgumentParser(
53 | description="Convert a Jupyter notebook to a Python script, preserving bash cells and '!' commands as subprocess calls."
54 | )
55 | parser.add_argument("input_nb", help="Path to the input .ipynb file")
56 | parser.add_argument("output_py", help="Path for the output .py script")
57 | args = parser.parse_args()
58 | convert_notebook(args.input_nb, args.output_py)
59 |
60 |
61 | if __name__ == "__main__":
62 | main()
63 |
--------------------------------------------------------------------------------
/ci/tests.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # don't use nbcovert or jupytext unless you're willing
4 | # to check each subprocess unit and validate that errors
5 | # aren't being consumed/hidden
6 | python ci/nb2py.py README.ipynb README.py # convert notebook to script
7 | python README.py # run generated script
8 | rm README.py # remove the generated script
9 |
--------------------------------------------------------------------------------
/clear_cell_nums.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import nbformat
4 |
5 |
6 | def clear_execution_numbers(nb_path):
7 | with open(nb_path, "r", encoding="utf-8") as f:
8 | nb = nbformat.read(f, as_version=4)
9 | for cell in nb["cells"]:
10 | if cell["cell_type"] == "code":
11 | cell["execution_count"] = None
12 | for output in cell["outputs"]:
13 | if "execution_count" in output:
14 | output["execution_count"] = None
15 | with open(nb_path, "w", encoding="utf-8") as f:
16 | nbformat.write(nb, f)
17 |
18 |
19 | if __name__ == "__main__":
20 | NOTEBOOK_DIR = Path(__file__).parent
21 | notebook_fps = list(NOTEBOOK_DIR.glob("**/*.ipynb"))
22 | for fp in notebook_fps:
23 | clear_execution_numbers(fp)
24 |
--------------------------------------------------------------------------------
/configs/aws.yaml:
--------------------------------------------------------------------------------
1 | # Head node
2 | head_node_type:
3 | name: head
4 | instance_type: m5.2xlarge
5 | resources:
6 | cpu: 8
7 |
8 | # Worker nodes
9 | auto_select_worker_config: true
10 | flags:
11 | allow-cross-zone-autoscaling: true
12 |
--------------------------------------------------------------------------------
/configs/gce.yaml:
--------------------------------------------------------------------------------
1 | # Head node
2 | head_node_type:
3 | name: head
4 | instance_type: n2-standard-8
5 | resources:
6 | cpu: 8
7 |
8 | # Worker nodes
9 | auto_select_worker_config: true
10 | flags:
11 | allow-cross-zone-autoscaling: true
12 |
--------------------------------------------------------------------------------
/containerfile:
--------------------------------------------------------------------------------
1 | FROM anyscale/ray-llm:2.44.1-py311-cu124
2 |
3 | RUN python3 -m pip install --no-cache-dir \
4 | "xgrammar==0.1.11" \
5 | "pynvml==12.0.0" \
6 | "hf_transfer==0.1.9" \
7 | "tensorboard==2.19.0" \
8 | "git+https://github.com/hiyouga/LLaMA-Factory.git@ac8c6fdd3ab7fb6372f231f238e6b8ba6a17eb16#egg=llamafactory"
9 |
10 | # Fast upload/download
11 | ENV HF_HUB_ENABLE_HF_TRANSFER=1
12 |
--------------------------------------------------------------------------------
/images/data_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/data_dashboard.png
--------------------------------------------------------------------------------
/images/data_llm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/data_llm.png
--------------------------------------------------------------------------------
/images/e2e_llm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/e2e_llm.png
--------------------------------------------------------------------------------
/images/loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/loss.png
--------------------------------------------------------------------------------
/images/serve_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/serve_dashboard.png
--------------------------------------------------------------------------------
/images/serve_llm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/serve_llm.png
--------------------------------------------------------------------------------
/images/train_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anyscale/e2e-llm-workflows/a111b7547ea6b7af467183b5236efd8a841d50f7/images/train_dashboard.png
--------------------------------------------------------------------------------
/lora_sft_ray.yaml:
--------------------------------------------------------------------------------
1 | ### model
2 | model_name_or_path: Qwen/Qwen2.5-7B-Instruct
3 | trust_remote_code: true
4 |
5 | ### method
6 | stage: sft
7 | do_train: true
8 | finetuning_type: lora
9 | lora_rank: 8
10 | lora_target: all
11 |
12 | ### dataset
13 | dataset: viggo-train
14 | dataset_dir: /mnt/cluster_storage/viggo # shared storage workers have access to
15 | template: qwen
16 | cutoff_len: 2048
17 | max_samples: 1000
18 | overwrite_cache: true
19 | preprocessing_num_workers: 16
20 | dataloader_num_workers: 4
21 |
22 | ### output
23 | output_dir: /mnt/cluster_storage/viggo/outputs # should be somewhere workers have access to (ex. s3, nfs)
24 | logging_steps: 10
25 | save_steps: 500
26 | plot_loss: true
27 | overwrite_output_dir: true
28 | save_only_model: false
29 |
30 | ### ray
31 | ray_run_name: lora_sft_ray
32 | ray_storage_path: /mnt/cluster_storage/viggo/saves # should be somewhere workers have access to (ex. s3, nfs)
33 | ray_num_workers: 4
34 | resources_per_worker:
35 | GPU: 1
36 | anyscale/accelerator_shape:4xL4: 0.001 # Use this to specify a specific node shape,
37 | # accelerator_type:L4: 1 # Or use this to simply specify a GPU type.
38 | # see https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types
39 | placement_strategy: PACK
40 |
41 | ### train
42 | per_device_train_batch_size: 1
43 | gradient_accumulation_steps: 8
44 | learning_rate: 1.0e-4
45 | num_train_epochs: 3.0
46 | lr_scheduler_type: cosine
47 | warmup_ratio: 0.1
48 | bf16: true
49 | ddp_timeout: 180000000
50 | resume_from_checkpoint: null
51 |
52 | ### eval
53 | eval_dataset: viggo-val # uses same dataset_dir as training data
54 | # val_size: 0.1 # only if using part of training data for validation
55 | per_device_eval_batch_size: 1
56 | eval_strategy: steps
57 | eval_steps: 500
58 |
--------------------------------------------------------------------------------