├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker ├── deploy │ └── deploy-flan-t5.ipynb └── fine-tuning │ ├── DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb │ └── src │ ├── T5_configz_and_code │ ├── configs │ │ └── ds_flan_t5_z3_config_bf16.json │ └── scripts │ │ ├── run_seq2seq_deepspeed.py │ │ └── torch_launch.sh │ ├── requirements.txt │ └── start.py ├── LICENSE └── README.md /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/deploy/deploy-flan-t5.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d245b130-4eaf-4fa1-b7f8-7eb5e1e9a7eb", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "# Deploy Flan-T5 XXL on SageMaker\n", 11 | "\n", 12 | "Now, we will deploy the model on SageMaker realtime endpoint, which is also trained on SageMaker with deepspeed on multiple nodes." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "id": "ea489d40-d76e-4f19-a8b0-5755f7b1f23d", 19 | "metadata": { 20 | "tags": [] 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "import sagemaker\n", 25 | "import boto3\n", 26 | "\n", 27 | "sess = sagemaker.Session()\n", 28 | "role = sagemaker.get_execution_role()\n", 29 | "\n", 30 | "print(f\"sagemaker role arn: {role}\")\n", 31 | "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", 32 | "print(f\"sagemaker session region: {sess.boto_region_name}\")\n" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "7b79706d-a346-4072-878e-4e1205e4a4f5", 38 | "metadata": { 39 | "tags": [] 40 | }, 41 | "source": [ 42 | "We trained the Flan-T5-XXL, and the model is saved as BF16 format. We will use Huggingface accelerate to speed up the model inference. " 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "id": "b30d0c87-d3f7-4689-829e-d6b02fe9f60f", 49 | "metadata": { 50 | "tags": [] 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "!mkdir deploy_code" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "id": "572ed109-920f-4d6f-8215-e9f3e95e9718", 61 | "metadata": { 62 | "tags": [] 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "%%writefile deploy_code/requirements.txt\n", 67 | "accelerate==0.16.0\n", 68 | "transformers==4.26.0\n", 69 | "bitsandbytes==0.37.0" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "id": "1fd6d42e-beb4-4632-a3f2-cb2f483e560e", 75 | "metadata": { 76 | "tags": [] 77 | }, 78 | "source": [ 79 | "Now, we use Huggingface accelerate to speed up the model inference (configure \"engine\" to \"Python\"). And we will use g5.48xlarge (which has 8 GPUs) to deploy, so option.tensor_parallel_degree is set to 8. Finally, please configure the option.s3url to your model assets' S3 path, the suffix '/' is a must (such as s3://your_bucket/flan-t5-xxl/model/)." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "3d366acc-b7cc-492b-952b-42db3d6ac75c", 86 | "metadata": { 87 | "tags": [] 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "%%writefile deploy_code/serving.properties\n", 92 | "engine=Python\n", 93 | "option.tensor_parallel_degree=8\n", 94 | "option.s3url=s3://your_bucket/flan-t5-xxl/model/" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "id": "7e35c899-9c1a-443b-89b0-6f4ed27593c5", 101 | "metadata": { 102 | "tags": [] 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "%%writefile deploy_code/model.py\n", 107 | "from djl_python import Input, Output\n", 108 | "import torch\n", 109 | "import logging\n", 110 | "import math\n", 111 | "import os\n", 112 | "from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer\n", 113 | "\n", 114 | "\n", 115 | "def load_model(properties):\n", 116 | " tensor_parallel = properties[\"tensor_parallel_degree\"]\n", 117 | " model_location = properties['model_dir']\n", 118 | " if \"model_id\" in properties:\n", 119 | " model_location = properties['model_id']\n", 120 | " logging.info(f\"Loading model in {model_location}\")\n", 121 | " \n", 122 | " tokenizer = AutoTokenizer.from_pretrained(model_location)\n", 123 | " \n", 124 | " model = AutoModelForSeq2SeqLM.from_pretrained(\n", 125 | " model_location, \n", 126 | " device_map=\"balanced_low_0\", \n", 127 | " #load_in_8bit=True\n", 128 | " )\n", 129 | " model.requires_grad_(False)\n", 130 | " model.eval()\n", 131 | " \n", 132 | " return model, tokenizer\n", 133 | "\n", 134 | "\n", 135 | "model = None\n", 136 | "tokenizer = None\n", 137 | "generator = None\n", 138 | "\n", 139 | "\n", 140 | "def handle(inputs: Input):\n", 141 | " global model, tokenizer\n", 142 | " if not model:\n", 143 | " model, tokenizer = load_model(inputs.get_properties())\n", 144 | "\n", 145 | " if inputs.is_empty():\n", 146 | " return None\n", 147 | " data = inputs.get_as_json()\n", 148 | " \n", 149 | " input_sentences = data[\"inputs\"]\n", 150 | " params = data[\"parameters\"]\n", 151 | " \n", 152 | " # preprocess\n", 153 | " input_ids = tokenizer(input_sentences, return_tensors=\"pt\").input_ids\n", 154 | " # pass inputs with all kwargs in data\n", 155 | " if params is not None:\n", 156 | " outputs = model.generate(input_ids, **params)\n", 157 | " else:\n", 158 | " outputs = model.generate(input_ids)\n", 159 | "\n", 160 | " # postprocess the prediction\n", 161 | " prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)\n", 162 | " \n", 163 | " result = {\"outputs\": prediction}\n", 164 | " return Output().add_as_json(result)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "id": "23a3125b-fddf-458e-ad21-9fa61d923606", 170 | "metadata": {}, 171 | "source": [ 172 | "We will use LMI (large model inference) container on SageMaker to serve the LLM." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "879f42da-c2dd-47a3-96ad-43dde83fdfd5", 179 | "metadata": { 180 | "tags": [] 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "import sagemaker\n", 185 | "\n", 186 | "sess = sagemaker.Session()\n", 187 | "region = sess._region_name\n", 188 | "\n", 189 | "inference_image_uri = (\n", 190 | " f\"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117\"\n", 191 | ")\n", 192 | "\n", 193 | "print(f\"Image going to be used is ---- > {inference_image_uri}\")" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "id": "7a7070fe-3a46-4f28-976b-33da12a1880f", 200 | "metadata": { 201 | "tags": [] 202 | }, 203 | "outputs": [], 204 | "source": [ 205 | "!rm model-liang.tar.gz\n", 206 | "!tar czvf model-liang.tar.gz -C deploy_code ." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "265902c9-6406-45b7-b75f-2c1c5d86c65b", 213 | "metadata": { 214 | "tags": [] 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "s3_code_prefix = 'code_flan_t5_LMI_liang'\n", 219 | "bucket = sess.default_bucket() \n", 220 | "s3_code_artifact = sess.upload_data(\"model-liang.tar.gz\", bucket, s3_code_prefix)\n", 221 | "print(f\"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}\")" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "id": "f0d7c064-964d-4c09-8592-2806e6325ed0", 228 | "metadata": { 229 | "tags": [] 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "from sagemaker.utils import name_from_base\n", 234 | "import boto3\n", 235 | "sm_client = boto3.client(\"sagemaker\")\n", 236 | "smr_client = boto3.client(\"sagemaker-runtime\")\n", 237 | "\n", 238 | "model_name = name_from_base(f\"flan-t5-xxl-accelerate-LMI\")\n", 239 | "print(model_name)\n", 240 | "print(f\"Image going to be used is ---- > {inference_image_uri}\")\n", 241 | "\n", 242 | "create_model_response = sm_client.create_model(\n", 243 | " ModelName=model_name,\n", 244 | " ExecutionRoleArn=role,\n", 245 | " PrimaryContainer={\n", 246 | " \"Image\": inference_image_uri,\n", 247 | " \"ModelDataUrl\": s3_code_artifact\n", 248 | " },\n", 249 | " \n", 250 | ")\n", 251 | "model_arn = create_model_response[\"ModelArn\"]\n", 252 | "\n", 253 | "print(f\"Created Model: {model_arn}\")" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": null, 259 | "id": "76f48151-f61d-4e27-8671-12108a9a882c", 260 | "metadata": { 261 | "tags": [] 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "endpoint_config_name = f\"{model_name}-config-88\"\n", 266 | "endpoint_name = f\"{model_name}-endpoint\"\n", 267 | "\n", 268 | "endpoint_config_response = sm_client.create_endpoint_config(\n", 269 | " EndpointConfigName=endpoint_config_name,\n", 270 | " ProductionVariants=[\n", 271 | " {\n", 272 | " \"VariantName\": \"variant1\",\n", 273 | " \"ModelName\": model_name,\n", 274 | " \"InstanceType\": \"ml.g5.48xlarge\",\n", 275 | " \"InitialInstanceCount\": 1,\n", 276 | " #\"ModelDataDownloadTimeoutInSeconds\": 2400,\n", 277 | " \"ContainerStartupHealthCheckTimeoutInSeconds\": 2400,\n", 278 | " },\n", 279 | " ],\n", 280 | ")\n", 281 | "endpoint_config_response" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "id": "04c1882e-32c3-41c6-befd-ad9907d0b340", 288 | "metadata": { 289 | "tags": [] 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "create_endpoint_response = sm_client.create_endpoint(\n", 294 | " EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n", 295 | ")\n", 296 | "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "id": "4c9520f0-c04b-4f59-8eab-9a5b287112a7", 303 | "metadata": { 304 | "tags": [] 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "import time\n", 309 | "\n", 310 | "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", 311 | "status = resp[\"EndpointStatus\"]\n", 312 | "print(\"Status: \" + status)\n", 313 | "\n", 314 | "while status == \"Creating\":\n", 315 | " time.sleep(60)\n", 316 | " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", 317 | " status = resp[\"EndpointStatus\"]\n", 318 | " print(\"Status: \" + status)\n", 319 | "\n", 320 | "print(\"Arn: \" + resp[\"EndpointArn\"])\n", 321 | "print(\"Status: \" + status)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "id": "d96b9462-a2d8-4cda-a8cd-e068992f3dee", 327 | "metadata": {}, 328 | "source": [ 329 | "Use the low level boto3 API to generate context." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "id": "3900294a-bea9-481d-a822-e72b69ce18fc", 336 | "metadata": { 337 | "tags": [] 338 | }, 339 | "outputs": [], 340 | "source": [ 341 | "%%time\n", 342 | "import json\n", 343 | "import boto3\n", 344 | "\n", 345 | "smr_client = boto3.client(\"sagemaker-runtime\")\n", 346 | "\n", 347 | "prompts = \"\"\"Summarize the following news article:\n", 348 | "Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.\n", 349 | "Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well. Therefore, Peter stayed with her at the hospital for 3 days without leaving.\n", 350 | "Summary:\n", 351 | "\"\"\"\n", 352 | "\n", 353 | "parameters = {\n", 354 | " #\"early_stopping\": True,\n", 355 | " #\"length_penalty\": 2.0,\n", 356 | " \"max_new_tokens\": 50,\n", 357 | " \"temperature\": 0,\n", 358 | " \"min_length\": 10,\n", 359 | " \"no_repeat_ngram_size\": 2,\n", 360 | "}\n", 361 | "\n", 362 | "\n", 363 | "response_model = smr_client.invoke_endpoint(\n", 364 | " EndpointName=endpoint_name,\n", 365 | " Body=json.dumps(\n", 366 | " {\n", 367 | " \"inputs\": prompts,\n", 368 | " \"parameters\": parameters\n", 369 | " }\n", 370 | " ),\n", 371 | " ContentType=\"application/json\",\n", 372 | " )\n", 373 | "\n", 374 | "response_model['Body'].read().decode('utf8')" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "id": "0d5acf45-1187-4cce-a57f-9cc499891f52", 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [] 384 | } 385 | ], 386 | "metadata": { 387 | "availableInstances": [ 388 | { 389 | "_defaultOrder": 0, 390 | "_isFastLaunch": true, 391 | "category": "General purpose", 392 | "gpuNum": 0, 393 | "memoryGiB": 4, 394 | "name": "ml.t3.medium", 395 | "vcpuNum": 2 396 | }, 397 | { 398 | "_defaultOrder": 1, 399 | "_isFastLaunch": false, 400 | "category": "General purpose", 401 | "gpuNum": 0, 402 | "memoryGiB": 8, 403 | "name": "ml.t3.large", 404 | "vcpuNum": 2 405 | }, 406 | { 407 | "_defaultOrder": 2, 408 | "_isFastLaunch": false, 409 | "category": "General purpose", 410 | "gpuNum": 0, 411 | "memoryGiB": 16, 412 | "name": "ml.t3.xlarge", 413 | "vcpuNum": 4 414 | }, 415 | { 416 | "_defaultOrder": 3, 417 | "_isFastLaunch": false, 418 | "category": "General purpose", 419 | "gpuNum": 0, 420 | "memoryGiB": 32, 421 | "name": "ml.t3.2xlarge", 422 | "vcpuNum": 8 423 | }, 424 | { 425 | "_defaultOrder": 4, 426 | "_isFastLaunch": true, 427 | "category": "General purpose", 428 | "gpuNum": 0, 429 | "memoryGiB": 8, 430 | "name": "ml.m5.large", 431 | "vcpuNum": 2 432 | }, 433 | { 434 | "_defaultOrder": 5, 435 | "_isFastLaunch": false, 436 | "category": "General purpose", 437 | "gpuNum": 0, 438 | "memoryGiB": 16, 439 | "name": "ml.m5.xlarge", 440 | "vcpuNum": 4 441 | }, 442 | { 443 | "_defaultOrder": 6, 444 | "_isFastLaunch": false, 445 | "category": "General purpose", 446 | "gpuNum": 0, 447 | "memoryGiB": 32, 448 | "name": "ml.m5.2xlarge", 449 | "vcpuNum": 8 450 | }, 451 | { 452 | "_defaultOrder": 7, 453 | "_isFastLaunch": false, 454 | "category": "General purpose", 455 | "gpuNum": 0, 456 | "memoryGiB": 64, 457 | "name": "ml.m5.4xlarge", 458 | "vcpuNum": 16 459 | }, 460 | { 461 | "_defaultOrder": 8, 462 | "_isFastLaunch": false, 463 | "category": "General purpose", 464 | "gpuNum": 0, 465 | "memoryGiB": 128, 466 | "name": "ml.m5.8xlarge", 467 | "vcpuNum": 32 468 | }, 469 | { 470 | "_defaultOrder": 9, 471 | "_isFastLaunch": false, 472 | "category": "General purpose", 473 | "gpuNum": 0, 474 | "memoryGiB": 192, 475 | "name": "ml.m5.12xlarge", 476 | "vcpuNum": 48 477 | }, 478 | { 479 | "_defaultOrder": 10, 480 | "_isFastLaunch": false, 481 | "category": "General purpose", 482 | "gpuNum": 0, 483 | "memoryGiB": 256, 484 | "name": "ml.m5.16xlarge", 485 | "vcpuNum": 64 486 | }, 487 | { 488 | "_defaultOrder": 11, 489 | "_isFastLaunch": false, 490 | "category": "General purpose", 491 | "gpuNum": 0, 492 | "memoryGiB": 384, 493 | "name": "ml.m5.24xlarge", 494 | "vcpuNum": 96 495 | }, 496 | { 497 | "_defaultOrder": 12, 498 | "_isFastLaunch": false, 499 | "category": "General purpose", 500 | "gpuNum": 0, 501 | "memoryGiB": 8, 502 | "name": "ml.m5d.large", 503 | "vcpuNum": 2 504 | }, 505 | { 506 | "_defaultOrder": 13, 507 | "_isFastLaunch": false, 508 | "category": "General purpose", 509 | "gpuNum": 0, 510 | "memoryGiB": 16, 511 | "name": "ml.m5d.xlarge", 512 | "vcpuNum": 4 513 | }, 514 | { 515 | "_defaultOrder": 14, 516 | "_isFastLaunch": false, 517 | "category": "General purpose", 518 | "gpuNum": 0, 519 | "memoryGiB": 32, 520 | "name": "ml.m5d.2xlarge", 521 | "vcpuNum": 8 522 | }, 523 | { 524 | "_defaultOrder": 15, 525 | "_isFastLaunch": false, 526 | "category": "General purpose", 527 | "gpuNum": 0, 528 | "memoryGiB": 64, 529 | "name": "ml.m5d.4xlarge", 530 | "vcpuNum": 16 531 | }, 532 | { 533 | "_defaultOrder": 16, 534 | "_isFastLaunch": false, 535 | "category": "General purpose", 536 | "gpuNum": 0, 537 | "memoryGiB": 128, 538 | "name": "ml.m5d.8xlarge", 539 | "vcpuNum": 32 540 | }, 541 | { 542 | "_defaultOrder": 17, 543 | "_isFastLaunch": false, 544 | "category": "General purpose", 545 | "gpuNum": 0, 546 | "memoryGiB": 192, 547 | "name": "ml.m5d.12xlarge", 548 | "vcpuNum": 48 549 | }, 550 | { 551 | "_defaultOrder": 18, 552 | "_isFastLaunch": false, 553 | "category": "General purpose", 554 | "gpuNum": 0, 555 | "memoryGiB": 256, 556 | "name": "ml.m5d.16xlarge", 557 | "vcpuNum": 64 558 | }, 559 | { 560 | "_defaultOrder": 19, 561 | "_isFastLaunch": false, 562 | "category": "General purpose", 563 | "gpuNum": 0, 564 | "memoryGiB": 384, 565 | "name": "ml.m5d.24xlarge", 566 | "vcpuNum": 96 567 | }, 568 | { 569 | "_defaultOrder": 20, 570 | "_isFastLaunch": true, 571 | "category": "Compute optimized", 572 | "gpuNum": 0, 573 | "memoryGiB": 4, 574 | "name": "ml.c5.large", 575 | "vcpuNum": 2 576 | }, 577 | { 578 | "_defaultOrder": 21, 579 | "_isFastLaunch": false, 580 | "category": "Compute optimized", 581 | "gpuNum": 0, 582 | "memoryGiB": 8, 583 | "name": "ml.c5.xlarge", 584 | "vcpuNum": 4 585 | }, 586 | { 587 | "_defaultOrder": 22, 588 | "_isFastLaunch": false, 589 | "category": "Compute optimized", 590 | "gpuNum": 0, 591 | "memoryGiB": 16, 592 | "name": "ml.c5.2xlarge", 593 | "vcpuNum": 8 594 | }, 595 | { 596 | "_defaultOrder": 23, 597 | "_isFastLaunch": false, 598 | "category": "Compute optimized", 599 | "gpuNum": 0, 600 | "memoryGiB": 32, 601 | "name": "ml.c5.4xlarge", 602 | "vcpuNum": 16 603 | }, 604 | { 605 | "_defaultOrder": 24, 606 | "_isFastLaunch": false, 607 | "category": "Compute optimized", 608 | "gpuNum": 0, 609 | "memoryGiB": 72, 610 | "name": "ml.c5.9xlarge", 611 | "vcpuNum": 36 612 | }, 613 | { 614 | "_defaultOrder": 25, 615 | "_isFastLaunch": false, 616 | "category": "Compute optimized", 617 | "gpuNum": 0, 618 | "memoryGiB": 96, 619 | "name": "ml.c5.12xlarge", 620 | "vcpuNum": 48 621 | }, 622 | { 623 | "_defaultOrder": 26, 624 | "_isFastLaunch": false, 625 | "category": "Compute optimized", 626 | "gpuNum": 0, 627 | "memoryGiB": 144, 628 | "name": "ml.c5.18xlarge", 629 | "vcpuNum": 72 630 | }, 631 | { 632 | "_defaultOrder": 27, 633 | "_isFastLaunch": false, 634 | "category": "Compute optimized", 635 | "gpuNum": 0, 636 | "memoryGiB": 192, 637 | "name": "ml.c5.24xlarge", 638 | "vcpuNum": 96 639 | }, 640 | { 641 | "_defaultOrder": 28, 642 | "_isFastLaunch": true, 643 | "category": "Accelerated computing", 644 | "gpuNum": 1, 645 | "memoryGiB": 16, 646 | "name": "ml.g4dn.xlarge", 647 | "vcpuNum": 4 648 | }, 649 | { 650 | "_defaultOrder": 29, 651 | "_isFastLaunch": false, 652 | "category": "Accelerated computing", 653 | "gpuNum": 1, 654 | "memoryGiB": 32, 655 | "name": "ml.g4dn.2xlarge", 656 | "vcpuNum": 8 657 | }, 658 | { 659 | "_defaultOrder": 30, 660 | "_isFastLaunch": false, 661 | "category": "Accelerated computing", 662 | "gpuNum": 1, 663 | "memoryGiB": 64, 664 | "name": "ml.g4dn.4xlarge", 665 | "vcpuNum": 16 666 | }, 667 | { 668 | "_defaultOrder": 31, 669 | "_isFastLaunch": false, 670 | "category": "Accelerated computing", 671 | "gpuNum": 1, 672 | "memoryGiB": 128, 673 | "name": "ml.g4dn.8xlarge", 674 | "vcpuNum": 32 675 | }, 676 | { 677 | "_defaultOrder": 32, 678 | "_isFastLaunch": false, 679 | "category": "Accelerated computing", 680 | "gpuNum": 4, 681 | "memoryGiB": 192, 682 | "name": "ml.g4dn.12xlarge", 683 | "vcpuNum": 48 684 | }, 685 | { 686 | "_defaultOrder": 33, 687 | "_isFastLaunch": false, 688 | "category": "Accelerated computing", 689 | "gpuNum": 1, 690 | "memoryGiB": 256, 691 | "name": "ml.g4dn.16xlarge", 692 | "vcpuNum": 64 693 | }, 694 | { 695 | "_defaultOrder": 34, 696 | "_isFastLaunch": false, 697 | "category": "Accelerated computing", 698 | "gpuNum": 1, 699 | "memoryGiB": 61, 700 | "name": "ml.p3.2xlarge", 701 | "vcpuNum": 8 702 | }, 703 | { 704 | "_defaultOrder": 35, 705 | "_isFastLaunch": false, 706 | "category": "Accelerated computing", 707 | "gpuNum": 4, 708 | "memoryGiB": 244, 709 | "name": "ml.p3.8xlarge", 710 | "vcpuNum": 32 711 | }, 712 | { 713 | "_defaultOrder": 36, 714 | "_isFastLaunch": false, 715 | "category": "Accelerated computing", 716 | "gpuNum": 8, 717 | "memoryGiB": 488, 718 | "name": "ml.p3.16xlarge", 719 | "vcpuNum": 64 720 | }, 721 | { 722 | "_defaultOrder": 37, 723 | "_isFastLaunch": false, 724 | "category": "Accelerated computing", 725 | "gpuNum": 8, 726 | "memoryGiB": 768, 727 | "name": "ml.p3dn.24xlarge", 728 | "vcpuNum": 96 729 | }, 730 | { 731 | "_defaultOrder": 38, 732 | "_isFastLaunch": false, 733 | "category": "Memory Optimized", 734 | "gpuNum": 0, 735 | "memoryGiB": 16, 736 | "name": "ml.r5.large", 737 | "vcpuNum": 2 738 | }, 739 | { 740 | "_defaultOrder": 39, 741 | "_isFastLaunch": false, 742 | "category": "Memory Optimized", 743 | "gpuNum": 0, 744 | "memoryGiB": 32, 745 | "name": "ml.r5.xlarge", 746 | "vcpuNum": 4 747 | }, 748 | { 749 | "_defaultOrder": 40, 750 | "_isFastLaunch": false, 751 | "category": "Memory Optimized", 752 | "gpuNum": 0, 753 | "memoryGiB": 64, 754 | "name": "ml.r5.2xlarge", 755 | "vcpuNum": 8 756 | }, 757 | { 758 | "_defaultOrder": 41, 759 | "_isFastLaunch": false, 760 | "category": "Memory Optimized", 761 | "gpuNum": 0, 762 | "memoryGiB": 128, 763 | "name": "ml.r5.4xlarge", 764 | "vcpuNum": 16 765 | }, 766 | { 767 | "_defaultOrder": 42, 768 | "_isFastLaunch": false, 769 | "category": "Memory Optimized", 770 | "gpuNum": 0, 771 | "memoryGiB": 256, 772 | "name": "ml.r5.8xlarge", 773 | "vcpuNum": 32 774 | }, 775 | { 776 | "_defaultOrder": 43, 777 | "_isFastLaunch": false, 778 | "category": "Memory Optimized", 779 | "gpuNum": 0, 780 | "memoryGiB": 384, 781 | "name": "ml.r5.12xlarge", 782 | "vcpuNum": 48 783 | }, 784 | { 785 | "_defaultOrder": 44, 786 | "_isFastLaunch": false, 787 | "category": "Memory Optimized", 788 | "gpuNum": 0, 789 | "memoryGiB": 512, 790 | "name": "ml.r5.16xlarge", 791 | "vcpuNum": 64 792 | }, 793 | { 794 | "_defaultOrder": 45, 795 | "_isFastLaunch": false, 796 | "category": "Memory Optimized", 797 | "gpuNum": 0, 798 | "memoryGiB": 768, 799 | "name": "ml.r5.24xlarge", 800 | "vcpuNum": 96 801 | }, 802 | { 803 | "_defaultOrder": 46, 804 | "_isFastLaunch": false, 805 | "category": "Accelerated computing", 806 | "gpuNum": 1, 807 | "memoryGiB": 16, 808 | "name": "ml.g5.xlarge", 809 | "vcpuNum": 4 810 | }, 811 | { 812 | "_defaultOrder": 47, 813 | "_isFastLaunch": false, 814 | "category": "Accelerated computing", 815 | "gpuNum": 1, 816 | "memoryGiB": 32, 817 | "name": "ml.g5.2xlarge", 818 | "vcpuNum": 8 819 | }, 820 | { 821 | "_defaultOrder": 48, 822 | "_isFastLaunch": false, 823 | "category": "Accelerated computing", 824 | "gpuNum": 1, 825 | "memoryGiB": 64, 826 | "name": "ml.g5.4xlarge", 827 | "vcpuNum": 16 828 | }, 829 | { 830 | "_defaultOrder": 49, 831 | "_isFastLaunch": false, 832 | "category": "Accelerated computing", 833 | "gpuNum": 1, 834 | "memoryGiB": 128, 835 | "name": "ml.g5.8xlarge", 836 | "vcpuNum": 32 837 | }, 838 | { 839 | "_defaultOrder": 50, 840 | "_isFastLaunch": false, 841 | "category": "Accelerated computing", 842 | "gpuNum": 1, 843 | "memoryGiB": 256, 844 | "name": "ml.g5.16xlarge", 845 | "vcpuNum": 64 846 | }, 847 | { 848 | "_defaultOrder": 51, 849 | "_isFastLaunch": false, 850 | "category": "Accelerated computing", 851 | "gpuNum": 4, 852 | "memoryGiB": 192, 853 | "name": "ml.g5.12xlarge", 854 | "vcpuNum": 48 855 | }, 856 | { 857 | "_defaultOrder": 52, 858 | "_isFastLaunch": false, 859 | "category": "Accelerated computing", 860 | "gpuNum": 4, 861 | "memoryGiB": 384, 862 | "name": "ml.g5.24xlarge", 863 | "vcpuNum": 96 864 | }, 865 | { 866 | "_defaultOrder": 53, 867 | "_isFastLaunch": false, 868 | "category": "Accelerated computing", 869 | "gpuNum": 8, 870 | "memoryGiB": 768, 871 | "name": "ml.g5.48xlarge", 872 | "vcpuNum": 192 873 | } 874 | ], 875 | "instance_type": "ml.m5.large", 876 | "kernelspec": { 877 | "display_name": "Python 3 (Data Science)", 878 | "language": "python", 879 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" 880 | }, 881 | "language_info": { 882 | "codemirror_mode": { 883 | "name": "ipython", 884 | "version": 3 885 | }, 886 | "file_extension": ".py", 887 | "mimetype": "text/x-python", 888 | "name": "python", 889 | "nbconvert_exporter": "python", 890 | "pygments_lexer": "ipython3", 891 | "version": "3.7.10" 892 | } 893 | }, 894 | "nbformat": 4, 895 | "nbformat_minor": 5 896 | } 897 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Fine-tune FLAN-T5 XXL on Multiple nodes using DeepSpeed on Amazon SageMaker \n", 8 | "\n", 9 | "FLAN-T5 is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters.\n", 10 | "\n", 11 | "This repo will show how to fine-tune FLAN-T5 XXL(11B) on multiple nodes using [DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/) on Amazon SageMaker. And the repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region.\n", 12 | "\n", 13 | "It is structured as follows:\n", 14 | "1. process dataset and upload to S3\n", 15 | "2. prepare training script and deepspeed launcher\n", 16 | "3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n", 17 | "\n", 18 | "Before we start, let’s install the required libraries and make sure we have the correct permissions to access S3." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "tags": [] 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "!pip install \"transformers==4.26.0\" \"datasets[s3]==2.9.0\" sagemaker --upgrade" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.\n", 37 | "\n" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": { 44 | "tags": [] 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "import sagemaker\n", 49 | "from sagemaker import get_execution_role\n", 50 | "import boto3\n", 51 | "\n", 52 | "sess = sagemaker.Session()\n", 53 | "role = get_execution_role()\n", 54 | "\n", 55 | "print(f\"sagemaker role arn: {role}\")\n", 56 | "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", 57 | "print(f\"sagemaker session region: {sess.boto_region_name}\")" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## 1. process dataset and upload to S3\n", 65 | "\n", 66 | "We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). \n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": { 73 | "tags": [] 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "# experiment config\n", 78 | "model_id = \"google/flan-t5-xxl\" # Hugging Face Model Id\n", 79 | "dataset_id = \"cnn_dailymail\" # Hugging Face Dataset Id\n", 80 | "dataset_config = \"3.0.0\" # config/verison of the dataset\n", 81 | "save_dataset_path = \"data\" # local path to save processed dataset\n", 82 | "text_column = \"article\" # column of input text is\n", 83 | "summary_column = \"highlights\" # column of the output text \n", 84 | "# custom instruct prompt start\n", 85 | "prompt_template = f\"Summarize the following news article:\\n{{input}}\\nSummary:\\n\"" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "tags": [] 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "from datasets import load_dataset\n", 104 | "from transformers import AutoTokenizer\n", 105 | "import numpy as np \n", 106 | "\n", 107 | "dataset = load_dataset(dataset_id,name=dataset_config)\n", 108 | "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", 109 | "\n", 110 | "print(f\"Train dataset size: {len(dataset['train'])}\")\n", 111 | "print(f\"Test dataset size: {len(dataset['test'])}\")\n", 112 | "\n", 113 | "# Train dataset size: 287113\n", 114 | "# Test dataset size: 11490" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "tags": [] 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "prompt_lenght = len(tokenizer(prompt_template.format(input=\"\"))[\"input_ids\"])\n", 133 | "max_sample_length = tokenizer.model_max_length - prompt_lenght\n", 134 | "print(f\"Prompt length: {prompt_lenght}\")\n", 135 | "print(f\"Max input length: {max_sample_length}\")\n", 136 | "\n", 137 | "# Prompt length: 12\n", 138 | "# Max input length: 500" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "tags": [] 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "from datasets import concatenate_datasets\n", 157 | "import numpy as np\n", 158 | "\n", 159 | "# The maximum total input sequence length after tokenization. \n", 160 | "# Sequences longer than this will be truncated, sequences shorter will be padded.\n", 161 | "tokenized_inputs = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n", 162 | "max_source_length = max([len(x) for x in tokenized_inputs[\"input_ids\"]])\n", 163 | "max_source_length = min(max_source_length, max_sample_length)\n", 164 | "print(f\"Max source length: {max_source_length}\")\n", 165 | "\n", 166 | "# The maximum total sequence length for target text after tokenization. \n", 167 | "# Sequences longer than this will be truncated, sequences shorter will be padded.\"\n", 168 | "tokenized_targets = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n", 169 | "target_lenghts = [len(x) for x in tokenized_targets[\"input_ids\"]]\n", 170 | "# use 95th percentile as max target length\n", 171 | "max_target_length = int(np.percentile(target_lenghts, 95))\n", 172 | "print(f\"Max target length: {max_target_length}\")" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "We now have everything needed to process our dataset." 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": { 186 | "tags": [] 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "def preprocess_function(sample, padding=\"max_length\"):\n", 191 | " # created prompted input\n", 192 | " inputs = [prompt_template.format(input=item) for item in sample[text_column]]\n", 193 | "\n", 194 | " # tokenize inputs\n", 195 | " model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)\n", 196 | "\n", 197 | " # Tokenize targets with the `text_target` keyword argument\n", 198 | " labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)\n", 199 | "\n", 200 | " # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore\n", 201 | " # padding in the loss.\n", 202 | " if padding == \"max_length\":\n", 203 | " labels[\"input_ids\"] = [\n", 204 | " [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n", 205 | " ]\n", 206 | "\n", 207 | " model_inputs[\"labels\"] = labels[\"input_ids\"]\n", 208 | " return model_inputs\n", 209 | "\n", 210 | "# process dataset\n", 211 | "tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset[\"train\"].features))" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "tags": [] 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "# save train_dataset to s3\n", 230 | "training_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/train'\n", 231 | "tokenized_dataset[\"train\"].save_to_disk(training_input_path)\n", 232 | "\n", 233 | "# save test_dataset to s3\n", 234 | "test_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/test'\n", 235 | "tokenized_dataset[\"test\"].save_to_disk(test_input_path)\n", 236 | "\n", 237 | "\n", 238 | "print(\"uploaded data to:\")\n", 239 | "print(f\"training dataset to: {training_input_path}\")\n", 240 | "print(f\"test dataset to: {test_input_path}\")" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "## 2. prepare training script and deepspeed launcher\n", 248 | "\n", 249 | "Here we use torch.distribute.launch to launch deepspeed on multiple nodes. First, we use start.py to configure some enviroments and invoke the shell script torch_launch.sh. Second, the shell script torch_launch.sh will configure all of parameters required for both torch.distribute.launch and training script run_seq2seq_deepspeed.py.\n", 250 | "In addition, we create a deepspeed config file named ds_flan_t5_z3_config_bf16.json to configure our training setup. \n", 251 | "\n", 252 | "We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. \n" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## 3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n", 260 | "\n" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. \n", 268 | "SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "tags": [] 276 | }, 277 | "outputs": [], 278 | "source": [ 279 | "import time\n", 280 | "from sagemaker.huggingface import HuggingFace\n", 281 | "from sagemaker import get_execution_role\n", 282 | "\n", 283 | "role = get_execution_role()\n", 284 | "# define Training Job Name \n", 285 | "job_name = f'huggingface-flan-t5-deepspeed-{time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.localtime())}'\n", 286 | "#define the model s3 path which will store your trained model asset\n", 287 | "#Note: you should use your real s3 path to configure model_s3_path\n", 288 | "model_s3_path='s3://your_bucket/flan-t5-xxl-4102xx668899-liangaws/model/'\n", 289 | "\n", 290 | "instance_count = 2\n", 291 | "#define the enviroment variables for your scripts.\n", 292 | "environment = {'NODE_NUMBER':str(instance_count),\n", 293 | " 'FI_PROVIDER': 'efa',\n", 294 | " 'NCCL_PROTO': 'simple',\n", 295 | " 'FI_EFA_USE_DEVICE_RDMA': '1',\n", 296 | " 'NCCL_DEBUG': 'INFO',\n", 297 | " 'MODEL_S3_PATH': model_s3_path\n", 298 | "}\n", 299 | "\n", 300 | "# create the Estimator\n", 301 | "huggingface_estimator = HuggingFace(\n", 302 | " entry_point = 'start.py', # user endpoint script\n", 303 | " source_dir = 'src', # directory which includes all the files needed for training\n", 304 | " instance_type = 'ml.p4d.24xlarge', # instances type used for the training job\n", 305 | " instance_count = instance_count, # the number of instances used for training\n", 306 | " base_job_name = job_name, # the name of the training job\n", 307 | " role = role, # Iam role used in training job to access AWS ressources, e.g. S3\n", 308 | " transformers_version = '4.17', # the transformers version used in the training job\n", 309 | " pytorch_version = '1.10', # the pytorch_version version used in the training job\n", 310 | " py_version = 'py38', # the python version used in the training job\n", 311 | " environment = environment,\n", 312 | ")" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": { 326 | "tags": [] 327 | }, 328 | "outputs": [], 329 | "source": [ 330 | "# define a data input dictonary with our uploaded s3 uris\n", 331 | "#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.\n", 332 | "data = {\n", 333 | " 'training': test_input_path,\n", 334 | " 'test': test_input_path\n", 335 | "}\n", 336 | "\n", 337 | "# starting the train job with our uploaded datasets as input\n", 338 | "huggingface_estimator.fit(data, wait=True)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [] 347 | } 348 | ], 349 | "metadata": { 350 | "availableInstances": [ 351 | { 352 | "_defaultOrder": 0, 353 | "_isFastLaunch": true, 354 | "category": "General purpose", 355 | "gpuNum": 0, 356 | "memoryGiB": 4, 357 | "name": "ml.t3.medium", 358 | "vcpuNum": 2 359 | }, 360 | { 361 | "_defaultOrder": 1, 362 | "_isFastLaunch": false, 363 | "category": "General purpose", 364 | "gpuNum": 0, 365 | "memoryGiB": 8, 366 | "name": "ml.t3.large", 367 | "vcpuNum": 2 368 | }, 369 | { 370 | "_defaultOrder": 2, 371 | "_isFastLaunch": false, 372 | "category": "General purpose", 373 | "gpuNum": 0, 374 | "memoryGiB": 16, 375 | "name": "ml.t3.xlarge", 376 | "vcpuNum": 4 377 | }, 378 | { 379 | "_defaultOrder": 3, 380 | "_isFastLaunch": false, 381 | "category": "General purpose", 382 | "gpuNum": 0, 383 | "memoryGiB": 32, 384 | "name": "ml.t3.2xlarge", 385 | "vcpuNum": 8 386 | }, 387 | { 388 | "_defaultOrder": 4, 389 | "_isFastLaunch": true, 390 | "category": "General purpose", 391 | "gpuNum": 0, 392 | "memoryGiB": 8, 393 | "name": "ml.m5.large", 394 | "vcpuNum": 2 395 | }, 396 | { 397 | "_defaultOrder": 5, 398 | "_isFastLaunch": false, 399 | "category": "General purpose", 400 | "gpuNum": 0, 401 | "memoryGiB": 16, 402 | "name": "ml.m5.xlarge", 403 | "vcpuNum": 4 404 | }, 405 | { 406 | "_defaultOrder": 6, 407 | "_isFastLaunch": false, 408 | "category": "General purpose", 409 | "gpuNum": 0, 410 | "memoryGiB": 32, 411 | "name": "ml.m5.2xlarge", 412 | "vcpuNum": 8 413 | }, 414 | { 415 | "_defaultOrder": 7, 416 | "_isFastLaunch": false, 417 | "category": "General purpose", 418 | "gpuNum": 0, 419 | "memoryGiB": 64, 420 | "name": "ml.m5.4xlarge", 421 | "vcpuNum": 16 422 | }, 423 | { 424 | "_defaultOrder": 8, 425 | "_isFastLaunch": false, 426 | "category": "General purpose", 427 | "gpuNum": 0, 428 | "memoryGiB": 128, 429 | "name": "ml.m5.8xlarge", 430 | "vcpuNum": 32 431 | }, 432 | { 433 | "_defaultOrder": 9, 434 | "_isFastLaunch": false, 435 | "category": "General purpose", 436 | "gpuNum": 0, 437 | "memoryGiB": 192, 438 | "name": "ml.m5.12xlarge", 439 | "vcpuNum": 48 440 | }, 441 | { 442 | "_defaultOrder": 10, 443 | "_isFastLaunch": false, 444 | "category": "General purpose", 445 | "gpuNum": 0, 446 | "memoryGiB": 256, 447 | "name": "ml.m5.16xlarge", 448 | "vcpuNum": 64 449 | }, 450 | { 451 | "_defaultOrder": 11, 452 | "_isFastLaunch": false, 453 | "category": "General purpose", 454 | "gpuNum": 0, 455 | "memoryGiB": 384, 456 | "name": "ml.m5.24xlarge", 457 | "vcpuNum": 96 458 | }, 459 | { 460 | "_defaultOrder": 12, 461 | "_isFastLaunch": false, 462 | "category": "General purpose", 463 | "gpuNum": 0, 464 | "memoryGiB": 8, 465 | "name": "ml.m5d.large", 466 | "vcpuNum": 2 467 | }, 468 | { 469 | "_defaultOrder": 13, 470 | "_isFastLaunch": false, 471 | "category": "General purpose", 472 | "gpuNum": 0, 473 | "memoryGiB": 16, 474 | "name": "ml.m5d.xlarge", 475 | "vcpuNum": 4 476 | }, 477 | { 478 | "_defaultOrder": 14, 479 | "_isFastLaunch": false, 480 | "category": "General purpose", 481 | "gpuNum": 0, 482 | "memoryGiB": 32, 483 | "name": "ml.m5d.2xlarge", 484 | "vcpuNum": 8 485 | }, 486 | { 487 | "_defaultOrder": 15, 488 | "_isFastLaunch": false, 489 | "category": "General purpose", 490 | "gpuNum": 0, 491 | "memoryGiB": 64, 492 | "name": "ml.m5d.4xlarge", 493 | "vcpuNum": 16 494 | }, 495 | { 496 | "_defaultOrder": 16, 497 | "_isFastLaunch": false, 498 | "category": "General purpose", 499 | "gpuNum": 0, 500 | "memoryGiB": 128, 501 | "name": "ml.m5d.8xlarge", 502 | "vcpuNum": 32 503 | }, 504 | { 505 | "_defaultOrder": 17, 506 | "_isFastLaunch": false, 507 | "category": "General purpose", 508 | "gpuNum": 0, 509 | "memoryGiB": 192, 510 | "name": "ml.m5d.12xlarge", 511 | "vcpuNum": 48 512 | }, 513 | { 514 | "_defaultOrder": 18, 515 | "_isFastLaunch": false, 516 | "category": "General purpose", 517 | "gpuNum": 0, 518 | "memoryGiB": 256, 519 | "name": "ml.m5d.16xlarge", 520 | "vcpuNum": 64 521 | }, 522 | { 523 | "_defaultOrder": 19, 524 | "_isFastLaunch": false, 525 | "category": "General purpose", 526 | "gpuNum": 0, 527 | "memoryGiB": 384, 528 | "name": "ml.m5d.24xlarge", 529 | "vcpuNum": 96 530 | }, 531 | { 532 | "_defaultOrder": 20, 533 | "_isFastLaunch": true, 534 | "category": "Compute optimized", 535 | "gpuNum": 0, 536 | "memoryGiB": 4, 537 | "name": "ml.c5.large", 538 | "vcpuNum": 2 539 | }, 540 | { 541 | "_defaultOrder": 21, 542 | "_isFastLaunch": false, 543 | "category": "Compute optimized", 544 | "gpuNum": 0, 545 | "memoryGiB": 8, 546 | "name": "ml.c5.xlarge", 547 | "vcpuNum": 4 548 | }, 549 | { 550 | "_defaultOrder": 22, 551 | "_isFastLaunch": false, 552 | "category": "Compute optimized", 553 | "gpuNum": 0, 554 | "memoryGiB": 16, 555 | "name": "ml.c5.2xlarge", 556 | "vcpuNum": 8 557 | }, 558 | { 559 | "_defaultOrder": 23, 560 | "_isFastLaunch": false, 561 | "category": "Compute optimized", 562 | "gpuNum": 0, 563 | "memoryGiB": 32, 564 | "name": "ml.c5.4xlarge", 565 | "vcpuNum": 16 566 | }, 567 | { 568 | "_defaultOrder": 24, 569 | "_isFastLaunch": false, 570 | "category": "Compute optimized", 571 | "gpuNum": 0, 572 | "memoryGiB": 72, 573 | "name": "ml.c5.9xlarge", 574 | "vcpuNum": 36 575 | }, 576 | { 577 | "_defaultOrder": 25, 578 | "_isFastLaunch": false, 579 | "category": "Compute optimized", 580 | "gpuNum": 0, 581 | "memoryGiB": 96, 582 | "name": "ml.c5.12xlarge", 583 | "vcpuNum": 48 584 | }, 585 | { 586 | "_defaultOrder": 26, 587 | "_isFastLaunch": false, 588 | "category": "Compute optimized", 589 | "gpuNum": 0, 590 | "memoryGiB": 144, 591 | "name": "ml.c5.18xlarge", 592 | "vcpuNum": 72 593 | }, 594 | { 595 | "_defaultOrder": 27, 596 | "_isFastLaunch": false, 597 | "category": "Compute optimized", 598 | "gpuNum": 0, 599 | "memoryGiB": 192, 600 | "name": "ml.c5.24xlarge", 601 | "vcpuNum": 96 602 | }, 603 | { 604 | "_defaultOrder": 28, 605 | "_isFastLaunch": true, 606 | "category": "Accelerated computing", 607 | "gpuNum": 1, 608 | "memoryGiB": 16, 609 | "name": "ml.g4dn.xlarge", 610 | "vcpuNum": 4 611 | }, 612 | { 613 | "_defaultOrder": 29, 614 | "_isFastLaunch": false, 615 | "category": "Accelerated computing", 616 | "gpuNum": 1, 617 | "memoryGiB": 32, 618 | "name": "ml.g4dn.2xlarge", 619 | "vcpuNum": 8 620 | }, 621 | { 622 | "_defaultOrder": 30, 623 | "_isFastLaunch": false, 624 | "category": "Accelerated computing", 625 | "gpuNum": 1, 626 | "memoryGiB": 64, 627 | "name": "ml.g4dn.4xlarge", 628 | "vcpuNum": 16 629 | }, 630 | { 631 | "_defaultOrder": 31, 632 | "_isFastLaunch": false, 633 | "category": "Accelerated computing", 634 | "gpuNum": 1, 635 | "memoryGiB": 128, 636 | "name": "ml.g4dn.8xlarge", 637 | "vcpuNum": 32 638 | }, 639 | { 640 | "_defaultOrder": 32, 641 | "_isFastLaunch": false, 642 | "category": "Accelerated computing", 643 | "gpuNum": 4, 644 | "memoryGiB": 192, 645 | "name": "ml.g4dn.12xlarge", 646 | "vcpuNum": 48 647 | }, 648 | { 649 | "_defaultOrder": 33, 650 | "_isFastLaunch": false, 651 | "category": "Accelerated computing", 652 | "gpuNum": 1, 653 | "memoryGiB": 256, 654 | "name": "ml.g4dn.16xlarge", 655 | "vcpuNum": 64 656 | }, 657 | { 658 | "_defaultOrder": 34, 659 | "_isFastLaunch": false, 660 | "category": "Accelerated computing", 661 | "gpuNum": 1, 662 | "memoryGiB": 61, 663 | "name": "ml.p3.2xlarge", 664 | "vcpuNum": 8 665 | }, 666 | { 667 | "_defaultOrder": 35, 668 | "_isFastLaunch": false, 669 | "category": "Accelerated computing", 670 | "gpuNum": 4, 671 | "memoryGiB": 244, 672 | "name": "ml.p3.8xlarge", 673 | "vcpuNum": 32 674 | }, 675 | { 676 | "_defaultOrder": 36, 677 | "_isFastLaunch": false, 678 | "category": "Accelerated computing", 679 | "gpuNum": 8, 680 | "memoryGiB": 488, 681 | "name": "ml.p3.16xlarge", 682 | "vcpuNum": 64 683 | }, 684 | { 685 | "_defaultOrder": 37, 686 | "_isFastLaunch": false, 687 | "category": "Accelerated computing", 688 | "gpuNum": 8, 689 | "memoryGiB": 768, 690 | "name": "ml.p3dn.24xlarge", 691 | "vcpuNum": 96 692 | }, 693 | { 694 | "_defaultOrder": 38, 695 | "_isFastLaunch": false, 696 | "category": "Memory Optimized", 697 | "gpuNum": 0, 698 | "memoryGiB": 16, 699 | "name": "ml.r5.large", 700 | "vcpuNum": 2 701 | }, 702 | { 703 | "_defaultOrder": 39, 704 | "_isFastLaunch": false, 705 | "category": "Memory Optimized", 706 | "gpuNum": 0, 707 | "memoryGiB": 32, 708 | "name": "ml.r5.xlarge", 709 | "vcpuNum": 4 710 | }, 711 | { 712 | "_defaultOrder": 40, 713 | "_isFastLaunch": false, 714 | "category": "Memory Optimized", 715 | "gpuNum": 0, 716 | "memoryGiB": 64, 717 | "name": "ml.r5.2xlarge", 718 | "vcpuNum": 8 719 | }, 720 | { 721 | "_defaultOrder": 41, 722 | "_isFastLaunch": false, 723 | "category": "Memory Optimized", 724 | "gpuNum": 0, 725 | "memoryGiB": 128, 726 | "name": "ml.r5.4xlarge", 727 | "vcpuNum": 16 728 | }, 729 | { 730 | "_defaultOrder": 42, 731 | "_isFastLaunch": false, 732 | "category": "Memory Optimized", 733 | "gpuNum": 0, 734 | "memoryGiB": 256, 735 | "name": "ml.r5.8xlarge", 736 | "vcpuNum": 32 737 | }, 738 | { 739 | "_defaultOrder": 43, 740 | "_isFastLaunch": false, 741 | "category": "Memory Optimized", 742 | "gpuNum": 0, 743 | "memoryGiB": 384, 744 | "name": "ml.r5.12xlarge", 745 | "vcpuNum": 48 746 | }, 747 | { 748 | "_defaultOrder": 44, 749 | "_isFastLaunch": false, 750 | "category": "Memory Optimized", 751 | "gpuNum": 0, 752 | "memoryGiB": 512, 753 | "name": "ml.r5.16xlarge", 754 | "vcpuNum": 64 755 | }, 756 | { 757 | "_defaultOrder": 45, 758 | "_isFastLaunch": false, 759 | "category": "Memory Optimized", 760 | "gpuNum": 0, 761 | "memoryGiB": 768, 762 | "name": "ml.r5.24xlarge", 763 | "vcpuNum": 96 764 | }, 765 | { 766 | "_defaultOrder": 46, 767 | "_isFastLaunch": false, 768 | "category": "Accelerated computing", 769 | "gpuNum": 1, 770 | "memoryGiB": 16, 771 | "name": "ml.g5.xlarge", 772 | "vcpuNum": 4 773 | }, 774 | { 775 | "_defaultOrder": 47, 776 | "_isFastLaunch": false, 777 | "category": "Accelerated computing", 778 | "gpuNum": 1, 779 | "memoryGiB": 32, 780 | "name": "ml.g5.2xlarge", 781 | "vcpuNum": 8 782 | }, 783 | { 784 | "_defaultOrder": 48, 785 | "_isFastLaunch": false, 786 | "category": "Accelerated computing", 787 | "gpuNum": 1, 788 | "memoryGiB": 64, 789 | "name": "ml.g5.4xlarge", 790 | "vcpuNum": 16 791 | }, 792 | { 793 | "_defaultOrder": 49, 794 | "_isFastLaunch": false, 795 | "category": "Accelerated computing", 796 | "gpuNum": 1, 797 | "memoryGiB": 128, 798 | "name": "ml.g5.8xlarge", 799 | "vcpuNum": 32 800 | }, 801 | { 802 | "_defaultOrder": 50, 803 | "_isFastLaunch": false, 804 | "category": "Accelerated computing", 805 | "gpuNum": 1, 806 | "memoryGiB": 256, 807 | "name": "ml.g5.16xlarge", 808 | "vcpuNum": 64 809 | }, 810 | { 811 | "_defaultOrder": 51, 812 | "_isFastLaunch": false, 813 | "category": "Accelerated computing", 814 | "gpuNum": 4, 815 | "memoryGiB": 192, 816 | "name": "ml.g5.12xlarge", 817 | "vcpuNum": 48 818 | }, 819 | { 820 | "_defaultOrder": 52, 821 | "_isFastLaunch": false, 822 | "category": "Accelerated computing", 823 | "gpuNum": 4, 824 | "memoryGiB": 384, 825 | "name": "ml.g5.24xlarge", 826 | "vcpuNum": 96 827 | }, 828 | { 829 | "_defaultOrder": 53, 830 | "_isFastLaunch": false, 831 | "category": "Accelerated computing", 832 | "gpuNum": 8, 833 | "memoryGiB": 768, 834 | "name": "ml.g5.48xlarge", 835 | "vcpuNum": 192 836 | } 837 | ], 838 | "instance_type": "ml.m5.large", 839 | "kernelspec": { 840 | "display_name": "Python 3 (Data Science)", 841 | "language": "python", 842 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" 843 | }, 844 | "language_info": { 845 | "codemirror_mode": { 846 | "name": "ipython", 847 | "version": 3 848 | }, 849 | "file_extension": ".py", 850 | "mimetype": "text/x-python", 851 | "name": "python", 852 | "nbconvert_exporter": "python", 853 | "pygments_lexer": "ipython3", 854 | "version": "3.7.10" 855 | }, 856 | "vscode": { 857 | "interpreter": { 858 | "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146" 859 | } 860 | } 861 | }, 862 | "nbformat": 4, 863 | "nbformat_minor": 4 864 | } 865 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/configs/ds_flan_t5_z3_config_bf16.json: -------------------------------------------------------------------------------- 1 | { 2 | "bf16": { 3 | "enabled": "auto" 4 | }, 5 | "optimizer": { 6 | "type": "AdamW", 7 | "params": { 8 | "lr": "auto", 9 | "betas": "auto", 10 | "eps": "auto", 11 | "weight_decay": "auto" 12 | } 13 | }, 14 | "scheduler": { 15 | "type": "WarmupLR", 16 | "params": { 17 | "warmup_min_lr": "auto", 18 | "warmup_max_lr": "auto", 19 | "warmup_num_steps": "auto" 20 | } 21 | }, 22 | "zero_optimization": { 23 | "stage": 3, 24 | "overlap_comm": true, 25 | "contiguous_gradients": true, 26 | "sub_group_size": 1e9, 27 | "reduce_bucket_size": "auto", 28 | "stage3_prefetch_bucket_size": "auto", 29 | "stage3_param_persistence_threshold": "auto", 30 | "stage3_max_live_parameters": 1e9, 31 | "stage3_max_reuse_distance": 1e9, 32 | "stage3_gather_16bit_weights_on_model_save": true 33 | }, 34 | "gradient_accumulation_steps": "auto", 35 | "gradient_clipping": "auto", 36 | "steps_per_print": 2000, 37 | "train_batch_size": "auto", 38 | "train_micro_batch_size_per_gpu": "auto", 39 | "wall_clock_breakdown": false 40 | } -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/run_seq2seq_deepspeed.py: -------------------------------------------------------------------------------- 1 | # *************************************************************************************** 2 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. * 3 | # * 4 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this * 5 | # software and associated documentation files (the "Software"), to deal in the Software * 6 | # without restriction, including without limitation the rights to use, copy, modify, * 7 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to * 8 | # permit persons to whom the Software is furnished to do so. * 9 | # * 10 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, * 11 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A * 12 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT * 13 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION * 14 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE * 15 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. * 16 | # *************************************************************************************** 17 | 18 | 19 | import os 20 | import argparse 21 | import numpy as np 22 | from transformers import ( 23 | AutoModelForSeq2SeqLM, 24 | DataCollatorForSeq2Seq, 25 | AutoTokenizer, 26 | set_seed, 27 | ) 28 | 29 | from datasets import load_from_disk 30 | import torch 31 | import torch.distributed as dist 32 | import evaluate 33 | 34 | import deepspeed 35 | from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments 36 | 37 | import nltk 38 | 39 | 40 | def postprocess_text(preds, labels): 41 | preds = [pred.strip() for pred in preds] 42 | labels = [label.strip() for label in labels] 43 | 44 | # rougeLSum expects newline after each sentence 45 | preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds] 46 | labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels] 47 | 48 | return preds, labels 49 | 50 | def parse_arge(): 51 | """Parse the arguments.""" 52 | parser = argparse.ArgumentParser() 53 | # add model id and dataset path argument 54 | parser.add_argument("--model_id", type=str, default="google/flan-t5-xl", help="Model id to use for training.") 55 | parser.add_argument("--train_dataset_path", type=str, help="Path to processed dataset stored by sageamker.") 56 | parser.add_argument("--test_dataset_path", type=str, help="Path to processed dataset stored by sageamker.") 57 | # add training hyperparameters for epochs, batch size, learning rate, and seed 58 | parser.add_argument("--epochs", type=int, default=3, help="Number of epochs to train for.") 59 | parser.add_argument("--per_device_train_batch_size", type=int, default=8, help="Batch size to use for training.") 60 | parser.add_argument("--per_device_eval_batch_size", type=int, default=8, help="Batch size to use for testing.") 61 | parser.add_argument("--generation_max_length", type=int, default=140, help="Maximum length to use for generation") 62 | parser.add_argument("--generation_num_beams", type=int, default=4, help="Number of beams to use for generation.") 63 | parser.add_argument("--learning_rate", type=float, default=3e-3, help="Learning rate to use for training.") 64 | parser.add_argument("--seed", type=int, default=42, help="Seed to use for training.") 65 | parser.add_argument("--gradient_checkpointing", type=bool, default=True, help="Whether to use gradient checkpointing.") 66 | parser.add_argument( 67 | "--bf16", 68 | type=bool, 69 | default=True if torch.cuda.get_device_capability()[0] == 8 else False, 70 | help="Whether to use bf16.", 71 | ) 72 | 73 | # Include DeepSpeed configuration arguments 74 | parser = deepspeed.add_config_arguments(parser) 75 | args = parser.parse_known_args() 76 | return args 77 | 78 | 79 | def training_function(args): 80 | # set seed 81 | set_seed(args.seed) 82 | 83 | # load dataset from disk and tokenizer 84 | train_dataset = load_from_disk(args.train_dataset_path) 85 | eval_dataset = load_from_disk(args.test_dataset_path) 86 | tokenizer = AutoTokenizer.from_pretrained(args.model_id) 87 | # load model from the hub 88 | model = AutoModelForSeq2SeqLM.from_pretrained( 89 | args.model_id, 90 | use_cache=False if args.gradient_checkpointing else True, # this is needed for gradient checkpointing 91 | cache_dir = "/tmp/input/" # For instance storage instance such as p4d.24xlarge, you can put the file under /tmp which has enough storage space 92 | ) 93 | 94 | # we want to ignore tokenizer pad token in the loss 95 | label_pad_token_id = -100 96 | # Data collator 97 | data_collator = DataCollatorForSeq2Seq( 98 | tokenizer, model=model, label_pad_token_id=label_pad_token_id, pad_to_multiple_of=8 99 | ) 100 | 101 | # Define compute metrics function 102 | def compute_metrics(eval_preds): 103 | preds, labels = eval_preds 104 | if isinstance(preds, tuple): 105 | preds = preds[0] 106 | decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) 107 | # Replace -100 in the labels as we can't decode them. 108 | labels = np.where(labels != -100, labels, tokenizer.pad_token_id) 109 | decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) 110 | 111 | # Some simple post-processing 112 | decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) 113 | 114 | metric = evaluate.load("rouge") 115 | result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True) 116 | result = {k: round(v * 100, 4) for k, v in result.items()} 117 | prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds] 118 | result["gen_len"] = np.mean(prediction_lens) 119 | return result 120 | 121 | # Define training args 122 | #If you just want to save the best model weights, you can set the output_dir to temporary path such as '/tmp' on p4d.24xlarge; 123 | #And if you want to save all of the checkpoint during the training, you can set the output_dir to the checkponit local path (it will impact the train speed for multi-nodes training. Because SageMaker will upload the checkpoint to S3 nearly real-time, it will occupy the networking bandwidth and impact the communication efficiency between nodes in the cluster). 124 | output_dir = '/tmp' 125 | training_args = Seq2SeqTrainingArguments( 126 | output_dir=output_dir, 127 | per_device_train_batch_size=args.per_device_train_batch_size, 128 | per_device_eval_batch_size=args.per_device_eval_batch_size, 129 | predict_with_generate=True, 130 | generation_max_length=args.generation_max_length, 131 | generation_num_beams=args.generation_num_beams, 132 | fp16=False, # T5 overflows with fp16 133 | bf16=args.bf16, # Use BF16 if available 134 | learning_rate=args.learning_rate, 135 | num_train_epochs=args.epochs, 136 | max_steps = 80, 137 | deepspeed=args.deepspeed_config, 138 | save_on_each_node=True, #By default, DeepSpeed expects that a multi-node environment uses a shared storage. If this is not the case and each node can only see the local filesystem,you need to set the parameter to true. 139 | gradient_checkpointing=args.gradient_checkpointing, 140 | # logging & evaluation strategies 141 | logging_dir=f"{output_dir}/logs", 142 | logging_strategy="steps", 143 | logging_steps=40, 144 | evaluation_strategy="steps", 145 | save_strategy="no", 146 | eval_steps=60, 147 | save_total_limit=2, 148 | load_best_model_at_end=False, #need to set it to false during deepspeed multiple nodes training. 149 | ) 150 | 151 | # Create Trainer instance 152 | trainer = Seq2SeqTrainer( 153 | model=model, 154 | args=training_args, 155 | train_dataset=train_dataset, 156 | eval_dataset=eval_dataset, 157 | data_collator=data_collator, 158 | #compute_metrics=compute_metrics, #When using compute_metrics, the evaluation procedure is very slow. Here it is commented out. 159 | ) 160 | 161 | # Start training 162 | trainer.train() 163 | 164 | #We now save the model assets to an intermediate path. 165 | #Note: plesae do not save the model into /opt/ml/model (because Sagemaker will tar and compress all of files under /opt/ml/model, and it will consume much time for LLM.) 166 | print("------saving model!-----") 167 | save_model_dir = '/tmp/output/asset/' 168 | tokenizer.save_pretrained(save_model_dir) 169 | trainer.save_model(save_model_dir) 170 | print("------model is saved!-----") 171 | 172 | #Note: we just use the rank 0 process to upload the trained model assets to S3 by s5cmd command. 173 | WORLD_RANK = int(os.environ['RANK']) 174 | if WORLD_RANK == 0: 175 | os.system("./T5_configz_and_code/scripts/s5cmd sync {0} {1}".format(save_model_dir, os.environ['MODEL_S3_PATH'])) 176 | 177 | #Note: we should sync with every ranker and ensure rank 0 uploading the model assets successfully. 178 | torch.distributed.barrier() 179 | 180 | def main(): 181 | #Note: here the "_" is needed because parse_arge() return a tuple. 182 | args, _ = parse_arge() 183 | 184 | # Environment variables set by torch.distributed.launch 185 | LOCAL_RANK = int(os.environ['LOCAL_RANK']) 186 | WORLD_SIZE = int(os.environ['WORLD_SIZE']) 187 | WORLD_RANK = int(os.environ['RANK']) 188 | 189 | dist.init_process_group(backend='nccl', rank=WORLD_RANK, world_size=WORLD_SIZE) 190 | 191 | if LOCAL_RANK != 0: 192 | print("---------local rank {0}".format(LOCAL_RANK)) 193 | else : 194 | print("------download and unzip nltk punkt for for local rank 0!-----") 195 | nltk.download("punkt", quiet=True) 196 | 197 | #Note: the barrier is used to ensure just only local rank 0 to download and unzip the punkt, otherwise it may fail the training job. 198 | torch.distributed.barrier() 199 | 200 | training_function(args) 201 | 202 | 203 | if __name__ == "__main__": 204 | main() 205 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/torch_launch.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #Please change the folder "T5_configz_and_code" to your folder which includes config files and main codes. 4 | WORKING_DIR=/opt/ml/code/T5_configz_and_code 5 | SM_WORKING_DIR=/opt/ml/model 6 | 7 | #The related information about multi-nodes cluster. 8 | MASTER_HOST=$SM_MASTER 9 | MASTER_ADDR=$SM_MASTER_ADDR 10 | MASTER_PORT="23456" 11 | NNODES="$NODE_NUMBER" 12 | NODE_RANK="$NODE_INDEX" 13 | 14 | #Configure the distributed arguments for torch.distributed.launch. 15 | GPUS_PER_NODE="$SM_NUM_GPUS" 16 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \ 17 | --nnodes $NNODES --node_rank $NODE_RANK \ 18 | --master_addr $MASTER_ADDR \ 19 | --master_port $MASTER_PORT" 20 | 21 | SAVE_PATH="${SM_WORKING_DIR}/results" 22 | LOG_FILE="${SAVE_PATH}/log.txt" 23 | 24 | #Set the path of your deepspeed config file. 25 | DS_CONFIG="${WORKING_DIR}/configs/ds_flan_t5_z3_config_bf16.json" 26 | 27 | #Configure the parameters for your training according to your model and dataset. 28 | #Note: you should set the corresponding paths of train_dataset_path and test_dataset_path according to your input data channel name. 29 | EPOCHS=1 30 | model_id="google/flan-t5-xxl" 31 | train_dataset_path='/opt/ml/input/data/training' 32 | test_dataset_path='/opt/ml/input/data/test' 33 | learning_rate=0.0001 34 | generation_max_length=150 35 | per_device_train_batch_size=1 36 | per_device_eval_batch_size=8 37 | 38 | OPTS="" 39 | OPTS+=" --per_device_eval_batch_size ${per_device_eval_batch_size}" 40 | OPTS+=" --per_device_train_batch_size ${per_device_train_batch_size}" 41 | OPTS+=" --generation_max_length ${generation_max_length}" 42 | OPTS+=" --test_dataset_path ${test_dataset_path}" 43 | OPTS+=" --model_id ${model_id}" 44 | OPTS+=" --train_dataset_path ${train_dataset_path}" 45 | OPTS+=" --distributed-backend nccl" 46 | OPTS+=" --learning_rate ${learning_rate}" 47 | OPTS+=" --deepspeed" 48 | OPTS+=" --deepspeed_config ${DS_CONFIG}" 49 | OPTS+=" --epochs ${EPOCHS}" 50 | 51 | CMD="python -m torch.distributed.launch ${DISTRIBUTED_ARGS} ${WORKING_DIR}/scripts/run_seq2seq_deepspeed.py ${OPTS}" 52 | 53 | echo ${CMD} 54 | mkdir -p ${SAVE_PATH} 55 | ${CMD} 2>&1 | tee ${SAVE_PATH}/train_log 56 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/requirements.txt: -------------------------------------------------------------------------------- 1 | transformers==4.26.0 2 | datasets==2.9.0 3 | accelerate==0.16.0 4 | evaluate==0.4.0 5 | deepspeed==0.8.0 6 | ninja 7 | rouge-score 8 | nltk 9 | py7zr 10 | -------------------------------------------------------------------------------- /Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/start.py: -------------------------------------------------------------------------------- 1 | # *************************************************************************************** 2 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. * 3 | # * 4 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this * 5 | # software and associated documentation files (the "Software"), to deal in the Software * 6 | # without restriction, including without limitation the rights to use, copy, modify, * 7 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to * 8 | # permit persons to whom the Software is furnished to do so. * 9 | # * 10 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, * 11 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A * 12 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT * 13 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION * 14 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE * 15 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. * 16 | # *************************************************************************************** 17 | 18 | import os 19 | import json 20 | import socket 21 | 22 | if __name__ == "__main__": 23 | 24 | hosts = json.loads(os.environ['SM_HOSTS']) 25 | current_host = os.environ['SM_CURRENT_HOST'] 26 | host_rank = int(hosts.index(current_host)) 27 | 28 | #Parse the IP address of the master node in the multiple nodes cluster of SageMaker training. 29 | master = json.loads(os.environ['SM_TRAINING_ENV'])['master_hostname'] 30 | master_addr = socket.gethostbyname(master) 31 | 32 | os.environ['NODE_INDEX'] = str(host_rank) 33 | os.environ['SM_MASTER'] = str(master) 34 | os.environ['SM_MASTER_ADDR'] = str(master_addr) 35 | os.environ['NCCL_SOCKET_IFNAME'] = 'eth0' 36 | 37 | #invoke the torch launcher shell script. 38 | #Note: we will use the pytorch launcher to launch deepspeed for multi-nodes training. 39 | #Note: we will use the s5cmd to speed up the uploading model assets to S3. 40 | os.system("chmod +x ./T5_configz_and_code/scripts/torch_launch.sh") 41 | os.system("chmod +x ./T5_configz_and_code/scripts/s5cmd") 42 | os.system("/bin/bash -c ./T5_configz_and_code/scripts/torch_launch.sh") 43 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Training LLM on Amazon SageMaker for multiple nodes with deepspeed 2 | 3 | This repo will show the whole codes: 4 | 1. Fine tuning LLM by DeepSpeed on SageMaker for multiple nodes. 5 | 2. Deploy the trained model from above step #1 on SageMaker. 6 | 7 | ## Prerequisite: 8 | 9 | a. Download the "s5cmd" command from source and uncompress it (using the following: curl -L https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz | tar -xz). 10 | 11 | b. Clone this repo. 12 | 13 | c. Move the "s5cmd" to the path: "Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/". 14 | 15 | The repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region (If you encounter with kerenl restaring issue when preparing dataset in DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb, I suggest that you shut down the kernel gateway instance and re-execute the DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb). 16 | 17 | ## Fine tuning LLM such as Flan-T5-XXL 18 | 19 | Now, we utilize the torch.distributed.launch + Deepspeed + Huggingface trainer API to fine tunig Flan-T5-XXL on AWS SageMaker for multiple nodes (Just set the environment variable "NODE_NUMBER" to 1, you can use the same codes for multiple GPUs training on single node). You can follow up the folder structure, and prepare your training script and configure related parameters in the torch_launch.sh script. If you also use the HF high level trainer API to train CausalLM (such as GPT-J) or Seq2seqLM (such as T5), there is very little code that needs to be modified. 20 | 21 | I explain more about these files: start.py as user entry point will set some environment variables such as master's IP address and invoke the torch_launch.sh script. Most of parameters (including training parameters and torch distributed launcher parameters) should be configured in torch_launch.sh. Finally torch_launch.sh will invoke your training python script. Also, you can use the requirements.txt to install related python libraries. 22 | 23 | Now, the codes uses the Huggingface/HF API to download the model assets form HF model hub. Maybe your SageMaker training job will encounter with the timeout issue when downloading model assets from HF model hub, just restart the SageMaker training job to re-try. Also, you can separatly downlaod the model assets from HF and directly upload them to Amazon S3 (do not tar and compress these files). Then in your training script, please use s5cmd to download them from S3 only on local rank 0 (use the torch.distributed.barrier() to sync up for every rank, please refer to https://github.com/yuhuiaws/finetuning-and-deploying-llama-on-Sagemaker/blob/main/finetuning-llama-by-deepspeed/train.py), it will speed up the model asset downloading compared with downloading them by use of HF API. 24 | 25 | Maybe the built-in SageMaker Huggingface training container had some changes, it will result in the failure about deepspeed training on SageMaker. The workaroud is as following: 26 | 27 | a. Change the requirements.txt as following: 28 | 29 | transformers==4.28.1 30 | datasets 31 | sentencepiece 32 | accelerate 33 | evaluate 34 | deepspeed==0.9.2 35 | ninja 36 | rouge-score 37 | bitsandbytes 38 | 39 | About transformer version, maybe you should upgrade it to 4.36.0 because of the Deserialization of Untrusted Data vulnerability. 40 | 41 | b. Change SageMaker huggingface training container to SageMaker pytorch 1.12 training container: 42 | 43 | from sagemaker.pytorch import PyTorch 44 | estimator = PyTorch(entry_point='start.py', 45 | source_dir = 'src', 46 | instance_type='ml.p4d.24xlarge', 47 | instance_count=2, 48 | role=role, 49 | base_job_name = job_name, 50 | keep_alive_period_in_seconds=1800, 51 | framework_version='1.12.0', 52 | py_version='py38', 53 | environment = environment, 54 | disable_profiler=True, 55 | debugger_hook_config=False) 56 | 57 | 58 | ### Some useful tips: 59 | 60 | 1. There is the open source "s5cmd" file in this repo, we can use the "s5cmd" command to speedup the uploading model assets to S3 (do not tar and compress these model assets, just directly upload to S3) after saving model in the container's local path. 61 | 2. When using deepspeed zero stage 2 training LLM on muliple nodes in SageMaker, maybe it will hung untile the NCCL communication is timeout. When it happens, you can check the GPU memory utility of training instances from Amazon cloudwatch. In my experiment, the GPU memory utility is almost full (but OOM didn't occur), it may be a signal that you should switch to zero stage 3 (the issue disappears when I switch to zero 3). 62 | 3. By default, DeepSpeed expects that a multi-nodes environment uses a shared storage. If this is not the case and each node can only see the local filesystem,you need to set the parameter "save_on_each_node" of Seq2SeqTrainingArguments API or TrainingArguments API to true (in this repo, I didn't use share data store such as EFS to save model, so I set the "save_on_each_node" to true). 63 | 4. When using deepspeed to train on multiple GPUs, if the parameter "stage3_gather_16bit_weights_on_model_save" in deepseed config file is set to false, pytorch_modle.bin will not be generated in the end. You can use the zero_to_fp32.py script (it is located in the saved model assets path) to convert the deepspeed zero shared checkpoints to fp32 pytorch model bin on Sagemaker notebook instance or Sagemaker studio (the procedure will consume large memory and time). If the parameter "stage3_gather_16bit_weights_on_model_save" in deepseed config file is set to true, the pytorch_modle.bin will be generated in the end (stage3_gather_16bit_weights_on_model_save enables model fp16 weights consolidation when model gets saved. With large models and multiple GPUs this is an expensive operation both in terms of memory and speed). So how to configure the stage3_gather_16bit_weights_on_model_save parameter? It is up to you, and I will set it to true If the training speed does not drop significantly. 64 | 5. When you use deepspeed multiple nodes training and set the parameter "load_best_model_at_end" (from Seq2SeqTrainingArguments or TrainingArguments API) to true, maybe error will happens when finishing training procedure. The error looks like the following: 65 | 66 | Could not locate the best model at /tmp/checkpoint-60/pytorch_model.bin, if you are running 67 | distributed training on multiple nodes, you should activate `--save_on_each_node`. 68 | 69 | In fact, I have configured the parameter "save_on_each_node" to true (my environment: transformer 4.26.0,pytorch 1.10,python 3.8). I will only save best model, configure "load_best_model_at_end" to false and fix the issue. 70 | 71 | 6. If you just want to save the best model weights, you can set the parameter "output_dir" (from Seq2SeqTrainingArguments or TrainingArguments API) to temporary path such as '/tmp' on p4d.24xlarge ("/tmp" has the enough disk space to save); And if you want to save all of the checkpoint during the training, you can set the output_dir to the checkponit local path (it will impact the train speed for multi-nodes training. Because SageMaker will upload the checkpoint to S3 nearly real-time, it will occupy the networking bandwidth and impact the communication efficiency between nodes in the cluster). 72 | 7. When using parameter "compute_metrics" from Trainer or Seq2SeqTrainer API, the evaluation procedure is very slow. So if you just want to run successfully the whole training process, you can comment out the "compute_metrics". 73 | 8. When your training script will download something from website (such as nltk.downlaod("punkt")), you should ensure only one process in the current node (local rank 0) downloaindg files, otherwise it may fail the training job. 74 | 75 | Traceback (most recent call last): 76 | File "/opt/ml/code/T5_configz_and_code/scripts/run_seq2seq_deepspeed.py", line 26, in 77 | nltk.download("punkt", quiet=True) 78 | File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 777, in download 79 | for msg in self.incr_download(info_or_id, download_dir, force): 80 | File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 642, in incr_download 81 | yield from self._download_package(info, download_dir, force) 82 | File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 699, in _download_package 83 | os.makedirs(download_dir) 84 | File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs 85 | mkdir(name, mode) 86 | FileExistsError: [Errno 17] File exists: '/root/nltk_data' 87 | [nltk_data] [Errno 2] No such file or directory: 88 | [nltk_data] '/root/nltk_data/tokenizers/punkt.zip' 89 | [nltk_data] Error with downloaded zip file 90 | [nltk_data] [Errno 2] No such file or directory: 91 | [nltk_data] '/root/nltk_data/tokenizers/punkt.zip' 92 | Downloading builder script: 0%| | 0.00/6.27k [00:00