├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker
    ├── deploy
    │   └── deploy-flan-t5.ipynb
    └── fine-tuning
    │   ├── DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb
    │   └── src
    │       ├── T5_configz_and_code
    │           ├── configs
    │           │   └── ds_flan_t5_z3_config_bf16.json
    │           └── scripts
    │           │   ├── run_seq2seq_deepspeed.py
    │           │   └── torch_launch.sh
    │       ├── requirements.txt
    │       └── start.py
├── LICENSE
└── README.md


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/deploy/deploy-flan-t5.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "d245b130-4eaf-4fa1-b7f8-7eb5e1e9a7eb",
  6 |    "metadata": {
  7 |     "tags": []
  8 |    },
  9 |    "source": [
 10 |     "# Deploy Flan-T5 XXL on SageMaker\n",
 11 |     "\n",
 12 |     "Now, we will deploy the model on SageMaker realtime endpoint, which is also trained on SageMaker with deepspeed on multiple nodes."
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": null,
 18 |    "id": "ea489d40-d76e-4f19-a8b0-5755f7b1f23d",
 19 |    "metadata": {
 20 |     "tags": []
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "import sagemaker\n",
 25 |     "import boto3\n",
 26 |     "\n",
 27 |     "sess = sagemaker.Session()\n",
 28 |     "role = sagemaker.get_execution_role()\n",
 29 |     "\n",
 30 |     "print(f\"sagemaker role arn: {role}\")\n",
 31 |     "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
 32 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")\n"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "id": "7b79706d-a346-4072-878e-4e1205e4a4f5",
 38 |    "metadata": {
 39 |     "tags": []
 40 |    },
 41 |    "source": [
 42 |     "We trained the Flan-T5-XXL, and the model is saved as BF16 format. We will use Huggingface accelerate to speed up the model inference. "
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "id": "b30d0c87-d3f7-4689-829e-d6b02fe9f60f",
 49 |    "metadata": {
 50 |     "tags": []
 51 |    },
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "!mkdir deploy_code"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "572ed109-920f-4d6f-8215-e9f3e95e9718",
 61 |    "metadata": {
 62 |     "tags": []
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "%%writefile deploy_code/requirements.txt\n",
 67 |     "accelerate==0.16.0\n",
 68 |     "transformers==4.26.0\n",
 69 |     "bitsandbytes==0.37.0"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "id": "1fd6d42e-beb4-4632-a3f2-cb2f483e560e",
 75 |    "metadata": {
 76 |     "tags": []
 77 |    },
 78 |    "source": [
 79 |     "Now, we use Huggingface accelerate to speed up the model inference (configure \"engine\" to \"Python\"). And we will use g5.48xlarge (which has 8 GPUs) to deploy, so option.tensor_parallel_degree is set to 8. Finally, please configure the option.s3url to your model assets' S3 path, the suffix '/' is a must (such as s3://your_bucket/flan-t5-xxl/model/)."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "id": "3d366acc-b7cc-492b-952b-42db3d6ac75c",
 86 |    "metadata": {
 87 |     "tags": []
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "%%writefile deploy_code/serving.properties\n",
 92 |     "engine=Python\n",
 93 |     "option.tensor_parallel_degree=8\n",
 94 |     "option.s3url=s3://your_bucket/flan-t5-xxl/model/"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "id": "7e35c899-9c1a-443b-89b0-6f4ed27593c5",
101 |    "metadata": {
102 |     "tags": []
103 |    },
104 |    "outputs": [],
105 |    "source": [
106 |     "%%writefile deploy_code/model.py\n",
107 |     "from djl_python import Input, Output\n",
108 |     "import torch\n",
109 |     "import logging\n",
110 |     "import math\n",
111 |     "import os\n",
112 |     "from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer\n",
113 |     "\n",
114 |     "\n",
115 |     "def load_model(properties):\n",
116 |     "    tensor_parallel = properties[\"tensor_parallel_degree\"]\n",
117 |     "    model_location = properties['model_dir']\n",
118 |     "    if \"model_id\" in properties:\n",
119 |     "        model_location = properties['model_id']\n",
120 |     "    logging.info(f\"Loading model in {model_location}\")\n",
121 |     "    \n",
122 |     "    tokenizer = AutoTokenizer.from_pretrained(model_location)\n",
123 |     "   \n",
124 |     "    model = AutoModelForSeq2SeqLM.from_pretrained(\n",
125 |     "        model_location, \n",
126 |     "        device_map=\"balanced_low_0\", \n",
127 |     "        #load_in_8bit=True\n",
128 |     "    )\n",
129 |     "    model.requires_grad_(False)\n",
130 |     "    model.eval()\n",
131 |     "    \n",
132 |     "    return model, tokenizer\n",
133 |     "\n",
134 |     "\n",
135 |     "model = None\n",
136 |     "tokenizer = None\n",
137 |     "generator = None\n",
138 |     "\n",
139 |     "\n",
140 |     "def handle(inputs: Input):\n",
141 |     "    global model, tokenizer\n",
142 |     "    if not model:\n",
143 |     "        model, tokenizer = load_model(inputs.get_properties())\n",
144 |     "\n",
145 |     "    if inputs.is_empty():\n",
146 |     "        return None\n",
147 |     "    data = inputs.get_as_json()\n",
148 |     "    \n",
149 |     "    input_sentences = data[\"inputs\"]\n",
150 |     "    params = data[\"parameters\"]\n",
151 |     "    \n",
152 |     "    # preprocess\n",
153 |     "    input_ids = tokenizer(input_sentences, return_tensors=\"pt\").input_ids\n",
154 |     "    # pass inputs with all kwargs in data\n",
155 |     "    if params is not None:\n",
156 |     "        outputs = model.generate(input_ids, **params)\n",
157 |     "    else:\n",
158 |     "        outputs = model.generate(input_ids)\n",
159 |     "\n",
160 |     "    # postprocess the prediction\n",
161 |     "    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
162 |     "    \n",
163 |     "    result = {\"outputs\": prediction}\n",
164 |     "    return Output().add_as_json(result)"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "id": "23a3125b-fddf-458e-ad21-9fa61d923606",
170 |    "metadata": {},
171 |    "source": [
172 |     "We will use LMI (large model inference) container on SageMaker to serve the LLM."
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "id": "879f42da-c2dd-47a3-96ad-43dde83fdfd5",
179 |    "metadata": {
180 |     "tags": []
181 |    },
182 |    "outputs": [],
183 |    "source": [
184 |     "import sagemaker\n",
185 |     "\n",
186 |     "sess = sagemaker.Session()\n",
187 |     "region = sess._region_name\n",
188 |     "\n",
189 |     "inference_image_uri = (\n",
190 |     "    f\"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117\"\n",
191 |     ")\n",
192 |     "\n",
193 |     "print(f\"Image going to be used is ---- > {inference_image_uri}\")"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": null,
199 |    "id": "7a7070fe-3a46-4f28-976b-33da12a1880f",
200 |    "metadata": {
201 |     "tags": []
202 |    },
203 |    "outputs": [],
204 |    "source": [
205 |     "!rm model-liang.tar.gz\n",
206 |     "!tar czvf model-liang.tar.gz -C deploy_code ."
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "id": "265902c9-6406-45b7-b75f-2c1c5d86c65b",
213 |    "metadata": {
214 |     "tags": []
215 |    },
216 |    "outputs": [],
217 |    "source": [
218 |     "s3_code_prefix = 'code_flan_t5_LMI_liang'\n",
219 |     "bucket = sess.default_bucket() \n",
220 |     "s3_code_artifact = sess.upload_data(\"model-liang.tar.gz\", bucket, s3_code_prefix)\n",
221 |     "print(f\"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}\")"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "id": "f0d7c064-964d-4c09-8592-2806e6325ed0",
228 |    "metadata": {
229 |     "tags": []
230 |    },
231 |    "outputs": [],
232 |    "source": [
233 |     "from sagemaker.utils import name_from_base\n",
234 |     "import boto3\n",
235 |     "sm_client = boto3.client(\"sagemaker\")\n",
236 |     "smr_client = boto3.client(\"sagemaker-runtime\")\n",
237 |     "\n",
238 |     "model_name = name_from_base(f\"flan-t5-xxl-accelerate-LMI\")\n",
239 |     "print(model_name)\n",
240 |     "print(f\"Image going to be used is ---- > {inference_image_uri}\")\n",
241 |     "\n",
242 |     "create_model_response = sm_client.create_model(\n",
243 |     "    ModelName=model_name,\n",
244 |     "    ExecutionRoleArn=role,\n",
245 |     "    PrimaryContainer={\n",
246 |     "        \"Image\": inference_image_uri,\n",
247 |     "        \"ModelDataUrl\": s3_code_artifact\n",
248 |     "    },\n",
249 |     "    \n",
250 |     ")\n",
251 |     "model_arn = create_model_response[\"ModelArn\"]\n",
252 |     "\n",
253 |     "print(f\"Created Model: {model_arn}\")"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": null,
259 |    "id": "76f48151-f61d-4e27-8671-12108a9a882c",
260 |    "metadata": {
261 |     "tags": []
262 |    },
263 |    "outputs": [],
264 |    "source": [
265 |     "endpoint_config_name = f\"{model_name}-config-88\"\n",
266 |     "endpoint_name = f\"{model_name}-endpoint\"\n",
267 |     "\n",
268 |     "endpoint_config_response = sm_client.create_endpoint_config(\n",
269 |     "    EndpointConfigName=endpoint_config_name,\n",
270 |     "    ProductionVariants=[\n",
271 |     "        {\n",
272 |     "            \"VariantName\": \"variant1\",\n",
273 |     "            \"ModelName\": model_name,\n",
274 |     "            \"InstanceType\": \"ml.g5.48xlarge\",\n",
275 |     "            \"InitialInstanceCount\": 1,\n",
276 |     "            #\"ModelDataDownloadTimeoutInSeconds\": 2400,\n",
277 |     "            \"ContainerStartupHealthCheckTimeoutInSeconds\": 2400,\n",
278 |     "        },\n",
279 |     "    ],\n",
280 |     ")\n",
281 |     "endpoint_config_response"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "id": "04c1882e-32c3-41c6-befd-ad9907d0b340",
288 |    "metadata": {
289 |     "tags": []
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "create_endpoint_response = sm_client.create_endpoint(\n",
294 |     "    EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n",
295 |     ")\n",
296 |     "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": null,
302 |    "id": "4c9520f0-c04b-4f59-8eab-9a5b287112a7",
303 |    "metadata": {
304 |     "tags": []
305 |    },
306 |    "outputs": [],
307 |    "source": [
308 |     "import time\n",
309 |     "\n",
310 |     "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
311 |     "status = resp[\"EndpointStatus\"]\n",
312 |     "print(\"Status: \" + status)\n",
313 |     "\n",
314 |     "while status == \"Creating\":\n",
315 |     "    time.sleep(60)\n",
316 |     "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
317 |     "    status = resp[\"EndpointStatus\"]\n",
318 |     "    print(\"Status: \" + status)\n",
319 |     "\n",
320 |     "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
321 |     "print(\"Status: \" + status)"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "id": "d96b9462-a2d8-4cda-a8cd-e068992f3dee",
327 |    "metadata": {},
328 |    "source": [
329 |     "Use the low level boto3 API to generate context."
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": null,
335 |    "id": "3900294a-bea9-481d-a822-e72b69ce18fc",
336 |    "metadata": {
337 |     "tags": []
338 |    },
339 |    "outputs": [],
340 |    "source": [
341 |     "%%time\n",
342 |     "import json\n",
343 |     "import boto3\n",
344 |     "\n",
345 |     "smr_client = boto3.client(\"sagemaker-runtime\")\n",
346 |     "\n",
347 |     "prompts = \"\"\"Summarize the following news article:\n",
348 |     "Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.\n",
349 |     "Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well. Therefore, Peter stayed with her at the hospital for 3 days without leaving.\n",
350 |     "Summary:\n",
351 |     "\"\"\"\n",
352 |     "\n",
353 |     "parameters = {\n",
354 |     "  #\"early_stopping\": True,\n",
355 |     "  #\"length_penalty\": 2.0,\n",
356 |     "  \"max_new_tokens\": 50,\n",
357 |     "  \"temperature\": 0,\n",
358 |     "  \"min_length\": 10,\n",
359 |     "  \"no_repeat_ngram_size\": 2,\n",
360 |     "}\n",
361 |     "\n",
362 |     "\n",
363 |     "response_model = smr_client.invoke_endpoint(\n",
364 |     "            EndpointName=endpoint_name,\n",
365 |     "            Body=json.dumps(\n",
366 |     "            {\n",
367 |     "                \"inputs\": prompts,\n",
368 |     "                \"parameters\": parameters\n",
369 |     "            }\n",
370 |     "            ),\n",
371 |     "            ContentType=\"application/json\",\n",
372 |     "        )\n",
373 |     "\n",
374 |     "response_model['Body'].read().decode('utf8')"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "id": "0d5acf45-1187-4cce-a57f-9cc499891f52",
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": []
384 |   }
385 |  ],
386 |  "metadata": {
387 |   "availableInstances": [
388 |    {
389 |     "_defaultOrder": 0,
390 |     "_isFastLaunch": true,
391 |     "category": "General purpose",
392 |     "gpuNum": 0,
393 |     "memoryGiB": 4,
394 |     "name": "ml.t3.medium",
395 |     "vcpuNum": 2
396 |    },
397 |    {
398 |     "_defaultOrder": 1,
399 |     "_isFastLaunch": false,
400 |     "category": "General purpose",
401 |     "gpuNum": 0,
402 |     "memoryGiB": 8,
403 |     "name": "ml.t3.large",
404 |     "vcpuNum": 2
405 |    },
406 |    {
407 |     "_defaultOrder": 2,
408 |     "_isFastLaunch": false,
409 |     "category": "General purpose",
410 |     "gpuNum": 0,
411 |     "memoryGiB": 16,
412 |     "name": "ml.t3.xlarge",
413 |     "vcpuNum": 4
414 |    },
415 |    {
416 |     "_defaultOrder": 3,
417 |     "_isFastLaunch": false,
418 |     "category": "General purpose",
419 |     "gpuNum": 0,
420 |     "memoryGiB": 32,
421 |     "name": "ml.t3.2xlarge",
422 |     "vcpuNum": 8
423 |    },
424 |    {
425 |     "_defaultOrder": 4,
426 |     "_isFastLaunch": true,
427 |     "category": "General purpose",
428 |     "gpuNum": 0,
429 |     "memoryGiB": 8,
430 |     "name": "ml.m5.large",
431 |     "vcpuNum": 2
432 |    },
433 |    {
434 |     "_defaultOrder": 5,
435 |     "_isFastLaunch": false,
436 |     "category": "General purpose",
437 |     "gpuNum": 0,
438 |     "memoryGiB": 16,
439 |     "name": "ml.m5.xlarge",
440 |     "vcpuNum": 4
441 |    },
442 |    {
443 |     "_defaultOrder": 6,
444 |     "_isFastLaunch": false,
445 |     "category": "General purpose",
446 |     "gpuNum": 0,
447 |     "memoryGiB": 32,
448 |     "name": "ml.m5.2xlarge",
449 |     "vcpuNum": 8
450 |    },
451 |    {
452 |     "_defaultOrder": 7,
453 |     "_isFastLaunch": false,
454 |     "category": "General purpose",
455 |     "gpuNum": 0,
456 |     "memoryGiB": 64,
457 |     "name": "ml.m5.4xlarge",
458 |     "vcpuNum": 16
459 |    },
460 |    {
461 |     "_defaultOrder": 8,
462 |     "_isFastLaunch": false,
463 |     "category": "General purpose",
464 |     "gpuNum": 0,
465 |     "memoryGiB": 128,
466 |     "name": "ml.m5.8xlarge",
467 |     "vcpuNum": 32
468 |    },
469 |    {
470 |     "_defaultOrder": 9,
471 |     "_isFastLaunch": false,
472 |     "category": "General purpose",
473 |     "gpuNum": 0,
474 |     "memoryGiB": 192,
475 |     "name": "ml.m5.12xlarge",
476 |     "vcpuNum": 48
477 |    },
478 |    {
479 |     "_defaultOrder": 10,
480 |     "_isFastLaunch": false,
481 |     "category": "General purpose",
482 |     "gpuNum": 0,
483 |     "memoryGiB": 256,
484 |     "name": "ml.m5.16xlarge",
485 |     "vcpuNum": 64
486 |    },
487 |    {
488 |     "_defaultOrder": 11,
489 |     "_isFastLaunch": false,
490 |     "category": "General purpose",
491 |     "gpuNum": 0,
492 |     "memoryGiB": 384,
493 |     "name": "ml.m5.24xlarge",
494 |     "vcpuNum": 96
495 |    },
496 |    {
497 |     "_defaultOrder": 12,
498 |     "_isFastLaunch": false,
499 |     "category": "General purpose",
500 |     "gpuNum": 0,
501 |     "memoryGiB": 8,
502 |     "name": "ml.m5d.large",
503 |     "vcpuNum": 2
504 |    },
505 |    {
506 |     "_defaultOrder": 13,
507 |     "_isFastLaunch": false,
508 |     "category": "General purpose",
509 |     "gpuNum": 0,
510 |     "memoryGiB": 16,
511 |     "name": "ml.m5d.xlarge",
512 |     "vcpuNum": 4
513 |    },
514 |    {
515 |     "_defaultOrder": 14,
516 |     "_isFastLaunch": false,
517 |     "category": "General purpose",
518 |     "gpuNum": 0,
519 |     "memoryGiB": 32,
520 |     "name": "ml.m5d.2xlarge",
521 |     "vcpuNum": 8
522 |    },
523 |    {
524 |     "_defaultOrder": 15,
525 |     "_isFastLaunch": false,
526 |     "category": "General purpose",
527 |     "gpuNum": 0,
528 |     "memoryGiB": 64,
529 |     "name": "ml.m5d.4xlarge",
530 |     "vcpuNum": 16
531 |    },
532 |    {
533 |     "_defaultOrder": 16,
534 |     "_isFastLaunch": false,
535 |     "category": "General purpose",
536 |     "gpuNum": 0,
537 |     "memoryGiB": 128,
538 |     "name": "ml.m5d.8xlarge",
539 |     "vcpuNum": 32
540 |    },
541 |    {
542 |     "_defaultOrder": 17,
543 |     "_isFastLaunch": false,
544 |     "category": "General purpose",
545 |     "gpuNum": 0,
546 |     "memoryGiB": 192,
547 |     "name": "ml.m5d.12xlarge",
548 |     "vcpuNum": 48
549 |    },
550 |    {
551 |     "_defaultOrder": 18,
552 |     "_isFastLaunch": false,
553 |     "category": "General purpose",
554 |     "gpuNum": 0,
555 |     "memoryGiB": 256,
556 |     "name": "ml.m5d.16xlarge",
557 |     "vcpuNum": 64
558 |    },
559 |    {
560 |     "_defaultOrder": 19,
561 |     "_isFastLaunch": false,
562 |     "category": "General purpose",
563 |     "gpuNum": 0,
564 |     "memoryGiB": 384,
565 |     "name": "ml.m5d.24xlarge",
566 |     "vcpuNum": 96
567 |    },
568 |    {
569 |     "_defaultOrder": 20,
570 |     "_isFastLaunch": true,
571 |     "category": "Compute optimized",
572 |     "gpuNum": 0,
573 |     "memoryGiB": 4,
574 |     "name": "ml.c5.large",
575 |     "vcpuNum": 2
576 |    },
577 |    {
578 |     "_defaultOrder": 21,
579 |     "_isFastLaunch": false,
580 |     "category": "Compute optimized",
581 |     "gpuNum": 0,
582 |     "memoryGiB": 8,
583 |     "name": "ml.c5.xlarge",
584 |     "vcpuNum": 4
585 |    },
586 |    {
587 |     "_defaultOrder": 22,
588 |     "_isFastLaunch": false,
589 |     "category": "Compute optimized",
590 |     "gpuNum": 0,
591 |     "memoryGiB": 16,
592 |     "name": "ml.c5.2xlarge",
593 |     "vcpuNum": 8
594 |    },
595 |    {
596 |     "_defaultOrder": 23,
597 |     "_isFastLaunch": false,
598 |     "category": "Compute optimized",
599 |     "gpuNum": 0,
600 |     "memoryGiB": 32,
601 |     "name": "ml.c5.4xlarge",
602 |     "vcpuNum": 16
603 |    },
604 |    {
605 |     "_defaultOrder": 24,
606 |     "_isFastLaunch": false,
607 |     "category": "Compute optimized",
608 |     "gpuNum": 0,
609 |     "memoryGiB": 72,
610 |     "name": "ml.c5.9xlarge",
611 |     "vcpuNum": 36
612 |    },
613 |    {
614 |     "_defaultOrder": 25,
615 |     "_isFastLaunch": false,
616 |     "category": "Compute optimized",
617 |     "gpuNum": 0,
618 |     "memoryGiB": 96,
619 |     "name": "ml.c5.12xlarge",
620 |     "vcpuNum": 48
621 |    },
622 |    {
623 |     "_defaultOrder": 26,
624 |     "_isFastLaunch": false,
625 |     "category": "Compute optimized",
626 |     "gpuNum": 0,
627 |     "memoryGiB": 144,
628 |     "name": "ml.c5.18xlarge",
629 |     "vcpuNum": 72
630 |    },
631 |    {
632 |     "_defaultOrder": 27,
633 |     "_isFastLaunch": false,
634 |     "category": "Compute optimized",
635 |     "gpuNum": 0,
636 |     "memoryGiB": 192,
637 |     "name": "ml.c5.24xlarge",
638 |     "vcpuNum": 96
639 |    },
640 |    {
641 |     "_defaultOrder": 28,
642 |     "_isFastLaunch": true,
643 |     "category": "Accelerated computing",
644 |     "gpuNum": 1,
645 |     "memoryGiB": 16,
646 |     "name": "ml.g4dn.xlarge",
647 |     "vcpuNum": 4
648 |    },
649 |    {
650 |     "_defaultOrder": 29,
651 |     "_isFastLaunch": false,
652 |     "category": "Accelerated computing",
653 |     "gpuNum": 1,
654 |     "memoryGiB": 32,
655 |     "name": "ml.g4dn.2xlarge",
656 |     "vcpuNum": 8
657 |    },
658 |    {
659 |     "_defaultOrder": 30,
660 |     "_isFastLaunch": false,
661 |     "category": "Accelerated computing",
662 |     "gpuNum": 1,
663 |     "memoryGiB": 64,
664 |     "name": "ml.g4dn.4xlarge",
665 |     "vcpuNum": 16
666 |    },
667 |    {
668 |     "_defaultOrder": 31,
669 |     "_isFastLaunch": false,
670 |     "category": "Accelerated computing",
671 |     "gpuNum": 1,
672 |     "memoryGiB": 128,
673 |     "name": "ml.g4dn.8xlarge",
674 |     "vcpuNum": 32
675 |    },
676 |    {
677 |     "_defaultOrder": 32,
678 |     "_isFastLaunch": false,
679 |     "category": "Accelerated computing",
680 |     "gpuNum": 4,
681 |     "memoryGiB": 192,
682 |     "name": "ml.g4dn.12xlarge",
683 |     "vcpuNum": 48
684 |    },
685 |    {
686 |     "_defaultOrder": 33,
687 |     "_isFastLaunch": false,
688 |     "category": "Accelerated computing",
689 |     "gpuNum": 1,
690 |     "memoryGiB": 256,
691 |     "name": "ml.g4dn.16xlarge",
692 |     "vcpuNum": 64
693 |    },
694 |    {
695 |     "_defaultOrder": 34,
696 |     "_isFastLaunch": false,
697 |     "category": "Accelerated computing",
698 |     "gpuNum": 1,
699 |     "memoryGiB": 61,
700 |     "name": "ml.p3.2xlarge",
701 |     "vcpuNum": 8
702 |    },
703 |    {
704 |     "_defaultOrder": 35,
705 |     "_isFastLaunch": false,
706 |     "category": "Accelerated computing",
707 |     "gpuNum": 4,
708 |     "memoryGiB": 244,
709 |     "name": "ml.p3.8xlarge",
710 |     "vcpuNum": 32
711 |    },
712 |    {
713 |     "_defaultOrder": 36,
714 |     "_isFastLaunch": false,
715 |     "category": "Accelerated computing",
716 |     "gpuNum": 8,
717 |     "memoryGiB": 488,
718 |     "name": "ml.p3.16xlarge",
719 |     "vcpuNum": 64
720 |    },
721 |    {
722 |     "_defaultOrder": 37,
723 |     "_isFastLaunch": false,
724 |     "category": "Accelerated computing",
725 |     "gpuNum": 8,
726 |     "memoryGiB": 768,
727 |     "name": "ml.p3dn.24xlarge",
728 |     "vcpuNum": 96
729 |    },
730 |    {
731 |     "_defaultOrder": 38,
732 |     "_isFastLaunch": false,
733 |     "category": "Memory Optimized",
734 |     "gpuNum": 0,
735 |     "memoryGiB": 16,
736 |     "name": "ml.r5.large",
737 |     "vcpuNum": 2
738 |    },
739 |    {
740 |     "_defaultOrder": 39,
741 |     "_isFastLaunch": false,
742 |     "category": "Memory Optimized",
743 |     "gpuNum": 0,
744 |     "memoryGiB": 32,
745 |     "name": "ml.r5.xlarge",
746 |     "vcpuNum": 4
747 |    },
748 |    {
749 |     "_defaultOrder": 40,
750 |     "_isFastLaunch": false,
751 |     "category": "Memory Optimized",
752 |     "gpuNum": 0,
753 |     "memoryGiB": 64,
754 |     "name": "ml.r5.2xlarge",
755 |     "vcpuNum": 8
756 |    },
757 |    {
758 |     "_defaultOrder": 41,
759 |     "_isFastLaunch": false,
760 |     "category": "Memory Optimized",
761 |     "gpuNum": 0,
762 |     "memoryGiB": 128,
763 |     "name": "ml.r5.4xlarge",
764 |     "vcpuNum": 16
765 |    },
766 |    {
767 |     "_defaultOrder": 42,
768 |     "_isFastLaunch": false,
769 |     "category": "Memory Optimized",
770 |     "gpuNum": 0,
771 |     "memoryGiB": 256,
772 |     "name": "ml.r5.8xlarge",
773 |     "vcpuNum": 32
774 |    },
775 |    {
776 |     "_defaultOrder": 43,
777 |     "_isFastLaunch": false,
778 |     "category": "Memory Optimized",
779 |     "gpuNum": 0,
780 |     "memoryGiB": 384,
781 |     "name": "ml.r5.12xlarge",
782 |     "vcpuNum": 48
783 |    },
784 |    {
785 |     "_defaultOrder": 44,
786 |     "_isFastLaunch": false,
787 |     "category": "Memory Optimized",
788 |     "gpuNum": 0,
789 |     "memoryGiB": 512,
790 |     "name": "ml.r5.16xlarge",
791 |     "vcpuNum": 64
792 |    },
793 |    {
794 |     "_defaultOrder": 45,
795 |     "_isFastLaunch": false,
796 |     "category": "Memory Optimized",
797 |     "gpuNum": 0,
798 |     "memoryGiB": 768,
799 |     "name": "ml.r5.24xlarge",
800 |     "vcpuNum": 96
801 |    },
802 |    {
803 |     "_defaultOrder": 46,
804 |     "_isFastLaunch": false,
805 |     "category": "Accelerated computing",
806 |     "gpuNum": 1,
807 |     "memoryGiB": 16,
808 |     "name": "ml.g5.xlarge",
809 |     "vcpuNum": 4
810 |    },
811 |    {
812 |     "_defaultOrder": 47,
813 |     "_isFastLaunch": false,
814 |     "category": "Accelerated computing",
815 |     "gpuNum": 1,
816 |     "memoryGiB": 32,
817 |     "name": "ml.g5.2xlarge",
818 |     "vcpuNum": 8
819 |    },
820 |    {
821 |     "_defaultOrder": 48,
822 |     "_isFastLaunch": false,
823 |     "category": "Accelerated computing",
824 |     "gpuNum": 1,
825 |     "memoryGiB": 64,
826 |     "name": "ml.g5.4xlarge",
827 |     "vcpuNum": 16
828 |    },
829 |    {
830 |     "_defaultOrder": 49,
831 |     "_isFastLaunch": false,
832 |     "category": "Accelerated computing",
833 |     "gpuNum": 1,
834 |     "memoryGiB": 128,
835 |     "name": "ml.g5.8xlarge",
836 |     "vcpuNum": 32
837 |    },
838 |    {
839 |     "_defaultOrder": 50,
840 |     "_isFastLaunch": false,
841 |     "category": "Accelerated computing",
842 |     "gpuNum": 1,
843 |     "memoryGiB": 256,
844 |     "name": "ml.g5.16xlarge",
845 |     "vcpuNum": 64
846 |    },
847 |    {
848 |     "_defaultOrder": 51,
849 |     "_isFastLaunch": false,
850 |     "category": "Accelerated computing",
851 |     "gpuNum": 4,
852 |     "memoryGiB": 192,
853 |     "name": "ml.g5.12xlarge",
854 |     "vcpuNum": 48
855 |    },
856 |    {
857 |     "_defaultOrder": 52,
858 |     "_isFastLaunch": false,
859 |     "category": "Accelerated computing",
860 |     "gpuNum": 4,
861 |     "memoryGiB": 384,
862 |     "name": "ml.g5.24xlarge",
863 |     "vcpuNum": 96
864 |    },
865 |    {
866 |     "_defaultOrder": 53,
867 |     "_isFastLaunch": false,
868 |     "category": "Accelerated computing",
869 |     "gpuNum": 8,
870 |     "memoryGiB": 768,
871 |     "name": "ml.g5.48xlarge",
872 |     "vcpuNum": 192
873 |    }
874 |   ],
875 |   "instance_type": "ml.m5.large",
876 |   "kernelspec": {
877 |    "display_name": "Python 3 (Data Science)",
878 |    "language": "python",
879 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
880 |   },
881 |   "language_info": {
882 |    "codemirror_mode": {
883 |     "name": "ipython",
884 |     "version": 3
885 |    },
886 |    "file_extension": ".py",
887 |    "mimetype": "text/x-python",
888 |    "name": "python",
889 |    "nbconvert_exporter": "python",
890 |    "pygments_lexer": "ipython3",
891 |    "version": "3.7.10"
892 |   }
893 |  },
894 |  "nbformat": 4,
895 |  "nbformat_minor": 5
896 | }
897 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Fine-tune FLAN-T5 XXL on Multiple nodes using DeepSpeed on Amazon SageMaker \n",
  8 |     "\n",
  9 |     "FLAN-T5 is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters.\n",
 10 |     "\n",
 11 |     "This repo will show how to fine-tune FLAN-T5 XXL(11B) on multiple nodes using [DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/) on Amazon SageMaker. And the repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region.\n",
 12 |     "\n",
 13 |     "It is structured as follows:\n",
 14 |     "1. process dataset and upload to S3\n",
 15 |     "2. prepare training script and deepspeed launcher\n",
 16 |     "3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n",
 17 |     "\n",
 18 |     "Before we start, let’s install the required libraries and make sure we have the correct permissions to access S3."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "metadata": {
 25 |     "tags": []
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "!pip install \"transformers==4.26.0\" \"datasets[s3]==2.9.0\" sagemaker --upgrade"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.\n",
 37 |     "\n"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": null,
 43 |    "metadata": {
 44 |     "tags": []
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "import sagemaker\n",
 49 |     "from sagemaker import get_execution_role\n",
 50 |     "import boto3\n",
 51 |     "\n",
 52 |     "sess = sagemaker.Session()\n",
 53 |     "role = get_execution_role()\n",
 54 |     "\n",
 55 |     "print(f\"sagemaker role arn: {role}\")\n",
 56 |     "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
 57 |     "print(f\"sagemaker session region: {sess.boto_region_name}\")"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "## 1. process dataset and upload to S3\n",
 65 |     "\n",
 66 |     "We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). \n"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "metadata": {
 73 |     "tags": []
 74 |    },
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "# experiment config\n",
 78 |     "model_id = \"google/flan-t5-xxl\" # Hugging Face Model Id\n",
 79 |     "dataset_id = \"cnn_dailymail\" # Hugging Face Dataset Id\n",
 80 |     "dataset_config = \"3.0.0\" # config/verison of the dataset\n",
 81 |     "save_dataset_path = \"data\" # local path to save processed dataset\n",
 82 |     "text_column = \"article\" # column of input text is\n",
 83 |     "summary_column = \"highlights\" # column of the output text \n",
 84 |     "# custom instruct prompt start\n",
 85 |     "prompt_template = f\"Summarize the following news article:\\n{{input}}\\nSummary:\\n\""
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {
 99 |     "tags": []
100 |    },
101 |    "outputs": [],
102 |    "source": [
103 |     "from datasets import load_dataset\n",
104 |     "from transformers import AutoTokenizer\n",
105 |     "import numpy as np \n",
106 |     "\n",
107 |     "dataset = load_dataset(dataset_id,name=dataset_config)\n",
108 |     "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
109 |     "\n",
110 |     "print(f\"Train dataset size: {len(dataset['train'])}\")\n",
111 |     "print(f\"Test dataset size: {len(dataset['test'])}\")\n",
112 |     "\n",
113 |     "# Train dataset size: 287113\n",
114 |     "# Test dataset size: 11490"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": null,
127 |    "metadata": {
128 |     "tags": []
129 |    },
130 |    "outputs": [],
131 |    "source": [
132 |     "prompt_lenght = len(tokenizer(prompt_template.format(input=\"\"))[\"input_ids\"])\n",
133 |     "max_sample_length = tokenizer.model_max_length - prompt_lenght\n",
134 |     "print(f\"Prompt length: {prompt_lenght}\")\n",
135 |     "print(f\"Max input length: {max_sample_length}\")\n",
136 |     "\n",
137 |     "# Prompt length: 12\n",
138 |     "# Max input length: 500"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {
152 |     "tags": []
153 |    },
154 |    "outputs": [],
155 |    "source": [
156 |     "from datasets import concatenate_datasets\n",
157 |     "import numpy as np\n",
158 |     "\n",
159 |     "# The maximum total input sequence length after tokenization. \n",
160 |     "# Sequences longer than this will be truncated, sequences shorter will be padded.\n",
161 |     "tokenized_inputs = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n",
162 |     "max_source_length = max([len(x) for x in tokenized_inputs[\"input_ids\"]])\n",
163 |     "max_source_length = min(max_source_length, max_sample_length)\n",
164 |     "print(f\"Max source length: {max_source_length}\")\n",
165 |     "\n",
166 |     "# The maximum total sequence length for target text after tokenization. \n",
167 |     "# Sequences longer than this will be truncated, sequences shorter will be padded.\"\n",
168 |     "tokenized_targets = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n",
169 |     "target_lenghts = [len(x) for x in tokenized_targets[\"input_ids\"]]\n",
170 |     "# use 95th percentile as max target length\n",
171 |     "max_target_length = int(np.percentile(target_lenghts, 95))\n",
172 |     "print(f\"Max target length: {max_target_length}\")"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "We now have everything needed to process our dataset."
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {
186 |     "tags": []
187 |    },
188 |    "outputs": [],
189 |    "source": [
190 |     "def preprocess_function(sample, padding=\"max_length\"):\n",
191 |     "    # created prompted input\n",
192 |     "    inputs = [prompt_template.format(input=item) for item in sample[text_column]]\n",
193 |     "\n",
194 |     "    # tokenize inputs\n",
195 |     "    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)\n",
196 |     "\n",
197 |     "    # Tokenize targets with the `text_target` keyword argument\n",
198 |     "    labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)\n",
199 |     "\n",
200 |     "    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore\n",
201 |     "    # padding in the loss.\n",
202 |     "    if padding == \"max_length\":\n",
203 |     "        labels[\"input_ids\"] = [\n",
204 |     "            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n",
205 |     "        ]\n",
206 |     "\n",
207 |     "    model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
208 |     "    return model_inputs\n",
209 |     "\n",
210 |     "# process dataset\n",
211 |     "tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset[\"train\"].features))"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {},
217 |    "source": [
218 |     "After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script."
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": null,
224 |    "metadata": {
225 |     "tags": []
226 |    },
227 |    "outputs": [],
228 |    "source": [
229 |     "# save train_dataset to s3\n",
230 |     "training_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/train'\n",
231 |     "tokenized_dataset[\"train\"].save_to_disk(training_input_path)\n",
232 |     "\n",
233 |     "# save test_dataset to s3\n",
234 |     "test_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/test'\n",
235 |     "tokenized_dataset[\"test\"].save_to_disk(test_input_path)\n",
236 |     "\n",
237 |     "\n",
238 |     "print(\"uploaded data to:\")\n",
239 |     "print(f\"training dataset to: {training_input_path}\")\n",
240 |     "print(f\"test dataset to: {test_input_path}\")"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "## 2. prepare training script and deepspeed launcher\n",
248 |     "\n",
249 |     "Here we use torch.distribute.launch to launch deepspeed on multiple nodes. First, we use start.py to configure some enviroments and invoke the shell script torch_launch.sh. Second, the shell script torch_launch.sh will configure all of parameters required for both torch.distribute.launch and training script run_seq2seq_deepspeed.py.\n",
250 |     "In addition, we create a deepspeed config file named ds_flan_t5_z3_config_bf16.json to configure our training setup.  \n",
251 |     "\n",
252 |     "We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. \n"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "## 3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n",
260 |     "\n"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. \n",
268 |     "SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running."
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "code",
273 |    "execution_count": null,
274 |    "metadata": {
275 |     "tags": []
276 |    },
277 |    "outputs": [],
278 |    "source": [
279 |     "import time\n",
280 |     "from sagemaker.huggingface import HuggingFace\n",
281 |     "from sagemaker import get_execution_role\n",
282 |     "\n",
283 |     "role = get_execution_role()\n",
284 |     "# define Training Job Name \n",
285 |     "job_name = f'huggingface-flan-t5-deepspeed-{time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.localtime())}'\n",
286 |     "#define the model s3 path which will store your trained model asset\n",
287 |     "#Note: you should use your real s3 path to configure model_s3_path\n",
288 |     "model_s3_path='s3://your_bucket/flan-t5-xxl-4102xx668899-liangaws/model/'\n",
289 |     "\n",
290 |     "instance_count = 2\n",
291 |     "#define the enviroment variables for your scripts.\n",
292 |     "environment = {'NODE_NUMBER':str(instance_count),\n",
293 |     "               'FI_PROVIDER': 'efa',\n",
294 |     "               'NCCL_PROTO': 'simple',\n",
295 |     "               'FI_EFA_USE_DEVICE_RDMA': '1',\n",
296 |     "               'NCCL_DEBUG': 'INFO',\n",
297 |     "               'MODEL_S3_PATH': model_s3_path\n",
298 |     "}\n",
299 |     "\n",
300 |     "# create the Estimator\n",
301 |     "huggingface_estimator = HuggingFace(\n",
302 |     "    entry_point          = 'start.py',          # user endpoint script\n",
303 |     "    source_dir           = 'src',               # directory which includes all the files needed for training\n",
304 |     "    instance_type        = 'ml.p4d.24xlarge', # instances type used for the training job\n",
305 |     "    instance_count       = instance_count,                 # the number of instances used for training\n",
306 |     "    base_job_name        = job_name,          # the name of the training job\n",
307 |     "    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3\n",
308 |     "    transformers_version = '4.17',            # the transformers version used in the training job\n",
309 |     "    pytorch_version      = '1.10',            # the pytorch_version version used in the training job\n",
310 |     "    py_version           = 'py38',            # the python version used in the training job\n",
311 |     "    environment = environment,\n",
312 |     ")"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script."
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "metadata": {
326 |     "tags": []
327 |    },
328 |    "outputs": [],
329 |    "source": [
330 |     "# define a data input dictonary with our uploaded s3 uris\n",
331 |     "#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.\n",
332 |     "data = {\n",
333 |     "    'training': test_input_path,\n",
334 |     "    'test': test_input_path\n",
335 |     "}\n",
336 |     "\n",
337 |     "# starting the train job with our uploaded datasets as input\n",
338 |     "huggingface_estimator.fit(data, wait=True)"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "code",
343 |    "execution_count": null,
344 |    "metadata": {},
345 |    "outputs": [],
346 |    "source": []
347 |   }
348 |  ],
349 |  "metadata": {
350 |   "availableInstances": [
351 |    {
352 |     "_defaultOrder": 0,
353 |     "_isFastLaunch": true,
354 |     "category": "General purpose",
355 |     "gpuNum": 0,
356 |     "memoryGiB": 4,
357 |     "name": "ml.t3.medium",
358 |     "vcpuNum": 2
359 |    },
360 |    {
361 |     "_defaultOrder": 1,
362 |     "_isFastLaunch": false,
363 |     "category": "General purpose",
364 |     "gpuNum": 0,
365 |     "memoryGiB": 8,
366 |     "name": "ml.t3.large",
367 |     "vcpuNum": 2
368 |    },
369 |    {
370 |     "_defaultOrder": 2,
371 |     "_isFastLaunch": false,
372 |     "category": "General purpose",
373 |     "gpuNum": 0,
374 |     "memoryGiB": 16,
375 |     "name": "ml.t3.xlarge",
376 |     "vcpuNum": 4
377 |    },
378 |    {
379 |     "_defaultOrder": 3,
380 |     "_isFastLaunch": false,
381 |     "category": "General purpose",
382 |     "gpuNum": 0,
383 |     "memoryGiB": 32,
384 |     "name": "ml.t3.2xlarge",
385 |     "vcpuNum": 8
386 |    },
387 |    {
388 |     "_defaultOrder": 4,
389 |     "_isFastLaunch": true,
390 |     "category": "General purpose",
391 |     "gpuNum": 0,
392 |     "memoryGiB": 8,
393 |     "name": "ml.m5.large",
394 |     "vcpuNum": 2
395 |    },
396 |    {
397 |     "_defaultOrder": 5,
398 |     "_isFastLaunch": false,
399 |     "category": "General purpose",
400 |     "gpuNum": 0,
401 |     "memoryGiB": 16,
402 |     "name": "ml.m5.xlarge",
403 |     "vcpuNum": 4
404 |    },
405 |    {
406 |     "_defaultOrder": 6,
407 |     "_isFastLaunch": false,
408 |     "category": "General purpose",
409 |     "gpuNum": 0,
410 |     "memoryGiB": 32,
411 |     "name": "ml.m5.2xlarge",
412 |     "vcpuNum": 8
413 |    },
414 |    {
415 |     "_defaultOrder": 7,
416 |     "_isFastLaunch": false,
417 |     "category": "General purpose",
418 |     "gpuNum": 0,
419 |     "memoryGiB": 64,
420 |     "name": "ml.m5.4xlarge",
421 |     "vcpuNum": 16
422 |    },
423 |    {
424 |     "_defaultOrder": 8,
425 |     "_isFastLaunch": false,
426 |     "category": "General purpose",
427 |     "gpuNum": 0,
428 |     "memoryGiB": 128,
429 |     "name": "ml.m5.8xlarge",
430 |     "vcpuNum": 32
431 |    },
432 |    {
433 |     "_defaultOrder": 9,
434 |     "_isFastLaunch": false,
435 |     "category": "General purpose",
436 |     "gpuNum": 0,
437 |     "memoryGiB": 192,
438 |     "name": "ml.m5.12xlarge",
439 |     "vcpuNum": 48
440 |    },
441 |    {
442 |     "_defaultOrder": 10,
443 |     "_isFastLaunch": false,
444 |     "category": "General purpose",
445 |     "gpuNum": 0,
446 |     "memoryGiB": 256,
447 |     "name": "ml.m5.16xlarge",
448 |     "vcpuNum": 64
449 |    },
450 |    {
451 |     "_defaultOrder": 11,
452 |     "_isFastLaunch": false,
453 |     "category": "General purpose",
454 |     "gpuNum": 0,
455 |     "memoryGiB": 384,
456 |     "name": "ml.m5.24xlarge",
457 |     "vcpuNum": 96
458 |    },
459 |    {
460 |     "_defaultOrder": 12,
461 |     "_isFastLaunch": false,
462 |     "category": "General purpose",
463 |     "gpuNum": 0,
464 |     "memoryGiB": 8,
465 |     "name": "ml.m5d.large",
466 |     "vcpuNum": 2
467 |    },
468 |    {
469 |     "_defaultOrder": 13,
470 |     "_isFastLaunch": false,
471 |     "category": "General purpose",
472 |     "gpuNum": 0,
473 |     "memoryGiB": 16,
474 |     "name": "ml.m5d.xlarge",
475 |     "vcpuNum": 4
476 |    },
477 |    {
478 |     "_defaultOrder": 14,
479 |     "_isFastLaunch": false,
480 |     "category": "General purpose",
481 |     "gpuNum": 0,
482 |     "memoryGiB": 32,
483 |     "name": "ml.m5d.2xlarge",
484 |     "vcpuNum": 8
485 |    },
486 |    {
487 |     "_defaultOrder": 15,
488 |     "_isFastLaunch": false,
489 |     "category": "General purpose",
490 |     "gpuNum": 0,
491 |     "memoryGiB": 64,
492 |     "name": "ml.m5d.4xlarge",
493 |     "vcpuNum": 16
494 |    },
495 |    {
496 |     "_defaultOrder": 16,
497 |     "_isFastLaunch": false,
498 |     "category": "General purpose",
499 |     "gpuNum": 0,
500 |     "memoryGiB": 128,
501 |     "name": "ml.m5d.8xlarge",
502 |     "vcpuNum": 32
503 |    },
504 |    {
505 |     "_defaultOrder": 17,
506 |     "_isFastLaunch": false,
507 |     "category": "General purpose",
508 |     "gpuNum": 0,
509 |     "memoryGiB": 192,
510 |     "name": "ml.m5d.12xlarge",
511 |     "vcpuNum": 48
512 |    },
513 |    {
514 |     "_defaultOrder": 18,
515 |     "_isFastLaunch": false,
516 |     "category": "General purpose",
517 |     "gpuNum": 0,
518 |     "memoryGiB": 256,
519 |     "name": "ml.m5d.16xlarge",
520 |     "vcpuNum": 64
521 |    },
522 |    {
523 |     "_defaultOrder": 19,
524 |     "_isFastLaunch": false,
525 |     "category": "General purpose",
526 |     "gpuNum": 0,
527 |     "memoryGiB": 384,
528 |     "name": "ml.m5d.24xlarge",
529 |     "vcpuNum": 96
530 |    },
531 |    {
532 |     "_defaultOrder": 20,
533 |     "_isFastLaunch": true,
534 |     "category": "Compute optimized",
535 |     "gpuNum": 0,
536 |     "memoryGiB": 4,
537 |     "name": "ml.c5.large",
538 |     "vcpuNum": 2
539 |    },
540 |    {
541 |     "_defaultOrder": 21,
542 |     "_isFastLaunch": false,
543 |     "category": "Compute optimized",
544 |     "gpuNum": 0,
545 |     "memoryGiB": 8,
546 |     "name": "ml.c5.xlarge",
547 |     "vcpuNum": 4
548 |    },
549 |    {
550 |     "_defaultOrder": 22,
551 |     "_isFastLaunch": false,
552 |     "category": "Compute optimized",
553 |     "gpuNum": 0,
554 |     "memoryGiB": 16,
555 |     "name": "ml.c5.2xlarge",
556 |     "vcpuNum": 8
557 |    },
558 |    {
559 |     "_defaultOrder": 23,
560 |     "_isFastLaunch": false,
561 |     "category": "Compute optimized",
562 |     "gpuNum": 0,
563 |     "memoryGiB": 32,
564 |     "name": "ml.c5.4xlarge",
565 |     "vcpuNum": 16
566 |    },
567 |    {
568 |     "_defaultOrder": 24,
569 |     "_isFastLaunch": false,
570 |     "category": "Compute optimized",
571 |     "gpuNum": 0,
572 |     "memoryGiB": 72,
573 |     "name": "ml.c5.9xlarge",
574 |     "vcpuNum": 36
575 |    },
576 |    {
577 |     "_defaultOrder": 25,
578 |     "_isFastLaunch": false,
579 |     "category": "Compute optimized",
580 |     "gpuNum": 0,
581 |     "memoryGiB": 96,
582 |     "name": "ml.c5.12xlarge",
583 |     "vcpuNum": 48
584 |    },
585 |    {
586 |     "_defaultOrder": 26,
587 |     "_isFastLaunch": false,
588 |     "category": "Compute optimized",
589 |     "gpuNum": 0,
590 |     "memoryGiB": 144,
591 |     "name": "ml.c5.18xlarge",
592 |     "vcpuNum": 72
593 |    },
594 |    {
595 |     "_defaultOrder": 27,
596 |     "_isFastLaunch": false,
597 |     "category": "Compute optimized",
598 |     "gpuNum": 0,
599 |     "memoryGiB": 192,
600 |     "name": "ml.c5.24xlarge",
601 |     "vcpuNum": 96
602 |    },
603 |    {
604 |     "_defaultOrder": 28,
605 |     "_isFastLaunch": true,
606 |     "category": "Accelerated computing",
607 |     "gpuNum": 1,
608 |     "memoryGiB": 16,
609 |     "name": "ml.g4dn.xlarge",
610 |     "vcpuNum": 4
611 |    },
612 |    {
613 |     "_defaultOrder": 29,
614 |     "_isFastLaunch": false,
615 |     "category": "Accelerated computing",
616 |     "gpuNum": 1,
617 |     "memoryGiB": 32,
618 |     "name": "ml.g4dn.2xlarge",
619 |     "vcpuNum": 8
620 |    },
621 |    {
622 |     "_defaultOrder": 30,
623 |     "_isFastLaunch": false,
624 |     "category": "Accelerated computing",
625 |     "gpuNum": 1,
626 |     "memoryGiB": 64,
627 |     "name": "ml.g4dn.4xlarge",
628 |     "vcpuNum": 16
629 |    },
630 |    {
631 |     "_defaultOrder": 31,
632 |     "_isFastLaunch": false,
633 |     "category": "Accelerated computing",
634 |     "gpuNum": 1,
635 |     "memoryGiB": 128,
636 |     "name": "ml.g4dn.8xlarge",
637 |     "vcpuNum": 32
638 |    },
639 |    {
640 |     "_defaultOrder": 32,
641 |     "_isFastLaunch": false,
642 |     "category": "Accelerated computing",
643 |     "gpuNum": 4,
644 |     "memoryGiB": 192,
645 |     "name": "ml.g4dn.12xlarge",
646 |     "vcpuNum": 48
647 |    },
648 |    {
649 |     "_defaultOrder": 33,
650 |     "_isFastLaunch": false,
651 |     "category": "Accelerated computing",
652 |     "gpuNum": 1,
653 |     "memoryGiB": 256,
654 |     "name": "ml.g4dn.16xlarge",
655 |     "vcpuNum": 64
656 |    },
657 |    {
658 |     "_defaultOrder": 34,
659 |     "_isFastLaunch": false,
660 |     "category": "Accelerated computing",
661 |     "gpuNum": 1,
662 |     "memoryGiB": 61,
663 |     "name": "ml.p3.2xlarge",
664 |     "vcpuNum": 8
665 |    },
666 |    {
667 |     "_defaultOrder": 35,
668 |     "_isFastLaunch": false,
669 |     "category": "Accelerated computing",
670 |     "gpuNum": 4,
671 |     "memoryGiB": 244,
672 |     "name": "ml.p3.8xlarge",
673 |     "vcpuNum": 32
674 |    },
675 |    {
676 |     "_defaultOrder": 36,
677 |     "_isFastLaunch": false,
678 |     "category": "Accelerated computing",
679 |     "gpuNum": 8,
680 |     "memoryGiB": 488,
681 |     "name": "ml.p3.16xlarge",
682 |     "vcpuNum": 64
683 |    },
684 |    {
685 |     "_defaultOrder": 37,
686 |     "_isFastLaunch": false,
687 |     "category": "Accelerated computing",
688 |     "gpuNum": 8,
689 |     "memoryGiB": 768,
690 |     "name": "ml.p3dn.24xlarge",
691 |     "vcpuNum": 96
692 |    },
693 |    {
694 |     "_defaultOrder": 38,
695 |     "_isFastLaunch": false,
696 |     "category": "Memory Optimized",
697 |     "gpuNum": 0,
698 |     "memoryGiB": 16,
699 |     "name": "ml.r5.large",
700 |     "vcpuNum": 2
701 |    },
702 |    {
703 |     "_defaultOrder": 39,
704 |     "_isFastLaunch": false,
705 |     "category": "Memory Optimized",
706 |     "gpuNum": 0,
707 |     "memoryGiB": 32,
708 |     "name": "ml.r5.xlarge",
709 |     "vcpuNum": 4
710 |    },
711 |    {
712 |     "_defaultOrder": 40,
713 |     "_isFastLaunch": false,
714 |     "category": "Memory Optimized",
715 |     "gpuNum": 0,
716 |     "memoryGiB": 64,
717 |     "name": "ml.r5.2xlarge",
718 |     "vcpuNum": 8
719 |    },
720 |    {
721 |     "_defaultOrder": 41,
722 |     "_isFastLaunch": false,
723 |     "category": "Memory Optimized",
724 |     "gpuNum": 0,
725 |     "memoryGiB": 128,
726 |     "name": "ml.r5.4xlarge",
727 |     "vcpuNum": 16
728 |    },
729 |    {
730 |     "_defaultOrder": 42,
731 |     "_isFastLaunch": false,
732 |     "category": "Memory Optimized",
733 |     "gpuNum": 0,
734 |     "memoryGiB": 256,
735 |     "name": "ml.r5.8xlarge",
736 |     "vcpuNum": 32
737 |    },
738 |    {
739 |     "_defaultOrder": 43,
740 |     "_isFastLaunch": false,
741 |     "category": "Memory Optimized",
742 |     "gpuNum": 0,
743 |     "memoryGiB": 384,
744 |     "name": "ml.r5.12xlarge",
745 |     "vcpuNum": 48
746 |    },
747 |    {
748 |     "_defaultOrder": 44,
749 |     "_isFastLaunch": false,
750 |     "category": "Memory Optimized",
751 |     "gpuNum": 0,
752 |     "memoryGiB": 512,
753 |     "name": "ml.r5.16xlarge",
754 |     "vcpuNum": 64
755 |    },
756 |    {
757 |     "_defaultOrder": 45,
758 |     "_isFastLaunch": false,
759 |     "category": "Memory Optimized",
760 |     "gpuNum": 0,
761 |     "memoryGiB": 768,
762 |     "name": "ml.r5.24xlarge",
763 |     "vcpuNum": 96
764 |    },
765 |    {
766 |     "_defaultOrder": 46,
767 |     "_isFastLaunch": false,
768 |     "category": "Accelerated computing",
769 |     "gpuNum": 1,
770 |     "memoryGiB": 16,
771 |     "name": "ml.g5.xlarge",
772 |     "vcpuNum": 4
773 |    },
774 |    {
775 |     "_defaultOrder": 47,
776 |     "_isFastLaunch": false,
777 |     "category": "Accelerated computing",
778 |     "gpuNum": 1,
779 |     "memoryGiB": 32,
780 |     "name": "ml.g5.2xlarge",
781 |     "vcpuNum": 8
782 |    },
783 |    {
784 |     "_defaultOrder": 48,
785 |     "_isFastLaunch": false,
786 |     "category": "Accelerated computing",
787 |     "gpuNum": 1,
788 |     "memoryGiB": 64,
789 |     "name": "ml.g5.4xlarge",
790 |     "vcpuNum": 16
791 |    },
792 |    {
793 |     "_defaultOrder": 49,
794 |     "_isFastLaunch": false,
795 |     "category": "Accelerated computing",
796 |     "gpuNum": 1,
797 |     "memoryGiB": 128,
798 |     "name": "ml.g5.8xlarge",
799 |     "vcpuNum": 32
800 |    },
801 |    {
802 |     "_defaultOrder": 50,
803 |     "_isFastLaunch": false,
804 |     "category": "Accelerated computing",
805 |     "gpuNum": 1,
806 |     "memoryGiB": 256,
807 |     "name": "ml.g5.16xlarge",
808 |     "vcpuNum": 64
809 |    },
810 |    {
811 |     "_defaultOrder": 51,
812 |     "_isFastLaunch": false,
813 |     "category": "Accelerated computing",
814 |     "gpuNum": 4,
815 |     "memoryGiB": 192,
816 |     "name": "ml.g5.12xlarge",
817 |     "vcpuNum": 48
818 |    },
819 |    {
820 |     "_defaultOrder": 52,
821 |     "_isFastLaunch": false,
822 |     "category": "Accelerated computing",
823 |     "gpuNum": 4,
824 |     "memoryGiB": 384,
825 |     "name": "ml.g5.24xlarge",
826 |     "vcpuNum": 96
827 |    },
828 |    {
829 |     "_defaultOrder": 53,
830 |     "_isFastLaunch": false,
831 |     "category": "Accelerated computing",
832 |     "gpuNum": 8,
833 |     "memoryGiB": 768,
834 |     "name": "ml.g5.48xlarge",
835 |     "vcpuNum": 192
836 |    }
837 |   ],
838 |   "instance_type": "ml.m5.large",
839 |   "kernelspec": {
840 |    "display_name": "Python 3 (Data Science)",
841 |    "language": "python",
842 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
843 |   },
844 |   "language_info": {
845 |    "codemirror_mode": {
846 |     "name": "ipython",
847 |     "version": 3
848 |    },
849 |    "file_extension": ".py",
850 |    "mimetype": "text/x-python",
851 |    "name": "python",
852 |    "nbconvert_exporter": "python",
853 |    "pygments_lexer": "ipython3",
854 |    "version": "3.7.10"
855 |   },
856 |   "vscode": {
857 |    "interpreter": {
858 |     "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
859 |    }
860 |   }
861 |  },
862 |  "nbformat": 4,
863 |  "nbformat_minor": 4
864 | }
865 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/configs/ds_flan_t5_z3_config_bf16.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "bf16": {
 3 |     "enabled": "auto"
 4 |   },
 5 |   "optimizer": {
 6 |     "type": "AdamW",
 7 |     "params": {
 8 |       "lr": "auto",
 9 |       "betas": "auto",
10 |       "eps": "auto",
11 |       "weight_decay": "auto"
12 |     }
13 |   },
14 |   "scheduler": {
15 |     "type": "WarmupLR",
16 |     "params": {
17 |       "warmup_min_lr": "auto",
18 |       "warmup_max_lr": "auto",
19 |       "warmup_num_steps": "auto"
20 |     }
21 |   },
22 |   "zero_optimization": {
23 |     "stage": 3,
24 |     "overlap_comm": true,
25 |     "contiguous_gradients": true,
26 |     "sub_group_size": 1e9,
27 |     "reduce_bucket_size": "auto",
28 |     "stage3_prefetch_bucket_size": "auto",
29 |     "stage3_param_persistence_threshold": "auto",
30 |     "stage3_max_live_parameters": 1e9,
31 |     "stage3_max_reuse_distance": 1e9,
32 |     "stage3_gather_16bit_weights_on_model_save": true
33 |   },
34 |   "gradient_accumulation_steps": "auto",
35 |   "gradient_clipping": "auto",
36 |   "steps_per_print": 2000,
37 |   "train_batch_size": "auto",
38 |   "train_micro_batch_size_per_gpu": "auto",
39 |   "wall_clock_breakdown": false
40 | }


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/run_seq2seq_deepspeed.py:
--------------------------------------------------------------------------------
  1 | # ***************************************************************************************
  2 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.                    *
  3 | #                                                                                       *
  4 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this  *
  5 | # software and associated documentation files (the "Software"), to deal in the Software *
  6 | # without restriction, including without limitation the rights to use, copy, modify,    *
  7 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to    *
  8 | # permit persons to whom the Software is furnished to do so.                            *
  9 | #                                                                                       *
 10 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,   *
 11 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A         *
 12 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT    *
 13 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION     *
 14 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE        *
 15 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                                *
 16 | # ***************************************************************************************
 17 | 
 18 | 
 19 | import os
 20 | import argparse
 21 | import numpy as np
 22 | from transformers import (
 23 |     AutoModelForSeq2SeqLM,
 24 |     DataCollatorForSeq2Seq,
 25 |     AutoTokenizer,
 26 |     set_seed,
 27 | )
 28 | 
 29 | from datasets import load_from_disk
 30 | import torch
 31 | import torch.distributed as dist
 32 | import evaluate
 33 | 
 34 | import deepspeed
 35 | from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
 36 | 
 37 | import nltk
 38 | 
 39 | 
 40 | def postprocess_text(preds, labels):
 41 |     preds = [pred.strip() for pred in preds]
 42 |     labels = [label.strip() for label in labels]
 43 | 
 44 |     # rougeLSum expects newline after each sentence
 45 |     preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
 46 |     labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
 47 | 
 48 |     return preds, labels
 49 | 
 50 | def parse_arge():
 51 |     """Parse the arguments."""
 52 |     parser = argparse.ArgumentParser()
 53 |     # add model id and dataset path argument
 54 |     parser.add_argument("--model_id", type=str, default="google/flan-t5-xl", help="Model id to use for training.")
 55 |     parser.add_argument("--train_dataset_path", type=str, help="Path to processed dataset stored by sageamker.")
 56 |     parser.add_argument("--test_dataset_path", type=str, help="Path to processed dataset stored by sageamker.")
 57 |     # add training hyperparameters for epochs, batch size, learning rate, and seed
 58 |     parser.add_argument("--epochs", type=int, default=3, help="Number of epochs to train for.")
 59 |     parser.add_argument("--per_device_train_batch_size", type=int, default=8, help="Batch size to use for training.")
 60 |     parser.add_argument("--per_device_eval_batch_size", type=int, default=8, help="Batch size to use for testing.")
 61 |     parser.add_argument("--generation_max_length", type=int, default=140, help="Maximum length to use for generation")
 62 |     parser.add_argument("--generation_num_beams", type=int, default=4, help="Number of beams to use for generation.")
 63 |     parser.add_argument("--learning_rate", type=float, default=3e-3, help="Learning rate to use for training.")
 64 |     parser.add_argument("--seed", type=int, default=42, help="Seed to use for training.")
 65 |     parser.add_argument("--gradient_checkpointing", type=bool, default=True, help="Whether to use gradient checkpointing.")
 66 |     parser.add_argument(
 67 |         "--bf16",
 68 |         type=bool,
 69 |         default=True if torch.cuda.get_device_capability()[0] == 8 else False,
 70 |         help="Whether to use bf16.",
 71 |     )
 72 |     
 73 |     # Include DeepSpeed configuration arguments
 74 |     parser = deepspeed.add_config_arguments(parser)
 75 |     args = parser.parse_known_args()
 76 |     return args
 77 | 
 78 | 
 79 | def training_function(args):
 80 |     # set seed
 81 |     set_seed(args.seed)
 82 | 
 83 |     # load dataset from disk and tokenizer
 84 |     train_dataset = load_from_disk(args.train_dataset_path)
 85 |     eval_dataset = load_from_disk(args.test_dataset_path)
 86 |     tokenizer = AutoTokenizer.from_pretrained(args.model_id)
 87 |     # load model from the hub
 88 |     model = AutoModelForSeq2SeqLM.from_pretrained(
 89 |         args.model_id,
 90 |         use_cache=False if args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
 91 |         cache_dir = "/tmp/input/" # For instance storage instance such as p4d.24xlarge, you can put the file under /tmp which has enough storage space
 92 |     )
 93 | 
 94 |     # we want to ignore tokenizer pad token in the loss
 95 |     label_pad_token_id = -100
 96 |     # Data collator
 97 |     data_collator = DataCollatorForSeq2Seq(
 98 |         tokenizer, model=model, label_pad_token_id=label_pad_token_id, pad_to_multiple_of=8
 99 |     )
100 | 
101 |     # Define compute metrics function
102 |     def compute_metrics(eval_preds):
103 |         preds, labels = eval_preds
104 |         if isinstance(preds, tuple):
105 |             preds = preds[0]
106 |         decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
107 |         # Replace -100 in the labels as we can't decode them.
108 |         labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
109 |         decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
110 | 
111 |         # Some simple post-processing
112 |         decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
113 |         
114 |         metric = evaluate.load("rouge")
115 |         result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
116 |         result = {k: round(v * 100, 4) for k, v in result.items()}
117 |         prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
118 |         result["gen_len"] = np.mean(prediction_lens)
119 |         return result
120 | 
121 |     # Define training args
122 |     #If you just want to save the best model weights, you can set the output_dir to temporary path such as '/tmp' on p4d.24xlarge;
123 |     #And if you want to save all of the checkpoint during the training, you can set the output_dir to the checkponit local path (it will impact the train speed for multi-nodes training. Because SageMaker will upload the checkpoint to S3 nearly real-time, it will occupy the networking bandwidth and impact the communication efficiency between nodes in the cluster).
124 |     output_dir = '/tmp'
125 |     training_args = Seq2SeqTrainingArguments(
126 |         output_dir=output_dir,
127 |         per_device_train_batch_size=args.per_device_train_batch_size,
128 |         per_device_eval_batch_size=args.per_device_eval_batch_size,
129 |         predict_with_generate=True,
130 |         generation_max_length=args.generation_max_length,
131 |         generation_num_beams=args.generation_num_beams,
132 |         fp16=False,  # T5 overflows with fp16
133 |         bf16=args.bf16,  # Use BF16 if available
134 |         learning_rate=args.learning_rate,
135 |         num_train_epochs=args.epochs,
136 |         max_steps = 80,      
137 |         deepspeed=args.deepspeed_config,
138 |         save_on_each_node=True,    #By default, DeepSpeed expects that a multi-node environment uses a shared storage. If this is not the case and each node can only see the local filesystem，you need to set the parameter to true.
139 |         gradient_checkpointing=args.gradient_checkpointing,
140 |         # logging & evaluation strategies
141 |         logging_dir=f"{output_dir}/logs",
142 |         logging_strategy="steps",
143 |         logging_steps=40,
144 |         evaluation_strategy="steps",
145 |         save_strategy="no",
146 |         eval_steps=60,
147 |         save_total_limit=2,
148 |         load_best_model_at_end=False, #need to set it to false during deepspeed multiple nodes training.
149 |     )
150 | 
151 |     # Create Trainer instance
152 |     trainer = Seq2SeqTrainer(
153 |         model=model,
154 |         args=training_args,
155 |         train_dataset=train_dataset,
156 |         eval_dataset=eval_dataset,
157 |         data_collator=data_collator,
158 |         #compute_metrics=compute_metrics,    #When using compute_metrics, the evaluation procedure is very slow. Here it is commented out.
159 |     )
160 | 
161 |     # Start training
162 |     trainer.train()
163 | 
164 |     #We now save the model assets to an intermediate path.
165 |     #Note: plesae do not save the model into /opt/ml/model (because Sagemaker will tar and compress all of files under /opt/ml/model, and it will consume much time for LLM.)
166 |     print("------saving model!-----")
167 |     save_model_dir = '/tmp/output/asset/'
168 |     tokenizer.save_pretrained(save_model_dir)
169 |     trainer.save_model(save_model_dir)
170 |     print("------model is saved!-----")
171 |     
172 |     #Note: we just use the rank 0 process to upload the trained model assets to S3 by s5cmd command.
173 |     WORLD_RANK = int(os.environ['RANK'])
174 |     if WORLD_RANK == 0:
175 |         os.system("./T5_configz_and_code/scripts/s5cmd sync {0} {1}".format(save_model_dir, os.environ['MODEL_S3_PATH']))
176 |     
177 |     #Note: we should sync with every ranker and ensure rank 0 uploading the model assets successfully. 
178 |     torch.distributed.barrier()
179 | 
180 | def main():
181 |     #Note: here the "_" is needed because parse_arge() return a tuple.
182 |     args, _ = parse_arge()
183 |       
184 |     # Environment variables set by torch.distributed.launch
185 |     LOCAL_RANK = int(os.environ['LOCAL_RANK'])
186 |     WORLD_SIZE = int(os.environ['WORLD_SIZE'])
187 |     WORLD_RANK = int(os.environ['RANK'])
188 |     
189 |     dist.init_process_group(backend='nccl', rank=WORLD_RANK, world_size=WORLD_SIZE)
190 |     
191 |     if LOCAL_RANK != 0:
192 |         print("---------local rank {0}".format(LOCAL_RANK))
193 |     else :
194 |         print("------download and unzip nltk punkt for for local rank 0!-----")
195 |         nltk.download("punkt", quiet=True)
196 |     
197 |     #Note: the barrier is used to ensure just only local rank 0 to download and unzip the punkt, otherwise it may fail the training job. 
198 |     torch.distributed.barrier()
199 |     
200 |     training_function(args)
201 | 
202 | 
203 | if __name__ == "__main__":
204 |     main()
205 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/torch_launch.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | #Please change the folder "T5_configz_and_code" to your folder which includes config files and main codes. 
 4 | WORKING_DIR=/opt/ml/code/T5_configz_and_code
 5 | SM_WORKING_DIR=/opt/ml/model
 6 | 
 7 | #The related information about multi-nodes cluster.
 8 | MASTER_HOST=$SM_MASTER
 9 | MASTER_ADDR=$SM_MASTER_ADDR
10 | MASTER_PORT="23456"
11 | NNODES="$NODE_NUMBER"
12 | NODE_RANK="$NODE_INDEX"
13 | 
14 | #Configure the distributed arguments for torch.distributed.launch.
15 | GPUS_PER_NODE="$SM_NUM_GPUS"
16 | DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \
17 |                   --nnodes $NNODES --node_rank $NODE_RANK \
18 |                   --master_addr $MASTER_ADDR \
19 |                   --master_port $MASTER_PORT"
20 | 
21 | SAVE_PATH="${SM_WORKING_DIR}/results"
22 | LOG_FILE="${SAVE_PATH}/log.txt"
23 | 
24 | #Set the path of your deepspeed config file.
25 | DS_CONFIG="${WORKING_DIR}/configs/ds_flan_t5_z3_config_bf16.json"
26 | 
27 | #Configure the parameters for your training according to your model and dataset.
28 | #Note: you should set the corresponding paths of train_dataset_path and test_dataset_path according to your input data channel name.
29 | EPOCHS=1
30 | model_id="google/flan-t5-xxl"
31 | train_dataset_path='/opt/ml/input/data/training'
32 | test_dataset_path='/opt/ml/input/data/test'
33 | learning_rate=0.0001
34 | generation_max_length=150
35 | per_device_train_batch_size=1
36 | per_device_eval_batch_size=8
37 | 
38 | OPTS=""
39 | OPTS+=" --per_device_eval_batch_size ${per_device_eval_batch_size}"
40 | OPTS+=" --per_device_train_batch_size ${per_device_train_batch_size}"
41 | OPTS+=" --generation_max_length ${generation_max_length}"
42 | OPTS+=" --test_dataset_path ${test_dataset_path}"
43 | OPTS+=" --model_id ${model_id}"
44 | OPTS+=" --train_dataset_path ${train_dataset_path}"
45 | OPTS+=" --distributed-backend nccl"
46 | OPTS+=" --learning_rate ${learning_rate}"
47 | OPTS+=" --deepspeed"
48 | OPTS+=" --deepspeed_config ${DS_CONFIG}"
49 | OPTS+=" --epochs ${EPOCHS}"
50 | 
51 | CMD="python -m torch.distributed.launch ${DISTRIBUTED_ARGS} ${WORKING_DIR}/scripts/run_seq2seq_deepspeed.py ${OPTS}"
52 | 
53 | echo ${CMD}
54 | mkdir -p ${SAVE_PATH}
55 | ${CMD} 2>&1 | tee ${SAVE_PATH}/train_log
56 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/requirements.txt:
--------------------------------------------------------------------------------
 1 | transformers==4.26.0
 2 | datasets==2.9.0
 3 | accelerate==0.16.0
 4 | evaluate==0.4.0
 5 | deepspeed==0.8.0
 6 | ninja
 7 | rouge-score 
 8 | nltk 
 9 | py7zr
10 | 


--------------------------------------------------------------------------------
/Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/start.py:
--------------------------------------------------------------------------------
 1 | # ***************************************************************************************
 2 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.                    *
 3 | #                                                                                       *
 4 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this  *
 5 | # software and associated documentation files (the "Software"), to deal in the Software *
 6 | # without restriction, including without limitation the rights to use, copy, modify,    *
 7 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to    *
 8 | # permit persons to whom the Software is furnished to do so.                            *
 9 | #                                                                                       *
10 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,   *
11 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A         *
12 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT    *
13 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION     *
14 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE        *
15 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                                *
16 | # ***************************************************************************************
17 | 
18 | import os
19 | import json
20 | import socket
21 | 
22 | if __name__ == "__main__":
23 |    
24 |     hosts = json.loads(os.environ['SM_HOSTS'])
25 |     current_host = os.environ['SM_CURRENT_HOST']
26 |     host_rank = int(hosts.index(current_host))
27 |     
28 |     #Parse the IP address of the master node in the multiple nodes cluster of SageMaker training.
29 |     master = json.loads(os.environ['SM_TRAINING_ENV'])['master_hostname']
30 |     master_addr = socket.gethostbyname(master)
31 |     
32 |     os.environ['NODE_INDEX'] = str(host_rank)
33 |     os.environ['SM_MASTER'] = str(master)
34 |     os.environ['SM_MASTER_ADDR'] = str(master_addr)
35 |     os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'
36 |     
37 |     #invoke the torch launcher shell script.
38 |     #Note: we will use the pytorch launcher to launch deepspeed for multi-nodes training.
39 |     #Note: we will use the s5cmd to speed up the uploading model assets to S3.
40 |     os.system("chmod +x ./T5_configz_and_code/scripts/torch_launch.sh")
41 |     os.system("chmod +x ./T5_configz_and_code/scripts/s5cmd")
42 |     os.system("/bin/bash -c ./T5_configz_and_code/scripts/torch_launch.sh")
43 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT No Attribution
 2 | 
 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so.
10 | 
11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 | 
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | #  Training LLM on Amazon SageMaker for multiple nodes with deepspeed
  2 | 
  3 | This repo will show the whole codes:
  4 | 1. Fine tuning LLM by DeepSpeed on SageMaker for multiple nodes.
  5 | 2. Deploy the trained model from above step #1 on SageMaker.
  6 | 
  7 | ## Prerequisite:
  8 | 
  9 | a. Download the "s5cmd" command from source and uncompress it (using the following: curl -L https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz | tar -xz).
 10 | 
 11 | b. Clone this repo.
 12 | 
 13 | c. Move the "s5cmd" to the path: "Flan-T5-XXL-multiple-nodes-training-and-deploy-on-SageMaker/fine-tuning/src/T5_configz_and_code/scripts/".
 14 | 
 15 | The repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region (If you encounter with kerenl restaring issue when preparing dataset in DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb, I suggest that you shut down the kernel gateway instance and re-execute the DeepSpeed-Flan-T5-on-Sagemaker-multiple-nodes.ipynb).
 16 | 
 17 | ## Fine tuning LLM such as Flan-T5-XXL
 18 | 
 19 | Now, we utilize the torch.distributed.launch + Deepspeed + Huggingface trainer API to fine tunig Flan-T5-XXL on AWS SageMaker for multiple nodes (Just set the environment variable "NODE_NUMBER" to 1, you can use the same codes for multiple GPUs training on single node). You can follow up the folder structure, and prepare your training script and configure related parameters in the torch_launch.sh script. If you also use the HF high level trainer API to train CausalLM (such as GPT-J) or Seq2seqLM (such as T5), there is very little code that needs to be modified.
 20 | 
 21 | I explain more about these files: start.py as user entry point will set some environment variables such as master's IP address and invoke the torch_launch.sh script. Most of parameters (including training parameters and torch distributed launcher parameters) should be configured in torch_launch.sh. Finally torch_launch.sh will invoke your training python script. Also, you can use the requirements.txt to install related python libraries.
 22 | 
 23 | Now, the codes uses the Huggingface/HF API to download the model assets form HF model hub. Maybe your SageMaker training job will encounter with the timeout issue when downloading model assets from HF model hub, just restart the SageMaker training job to re-try. Also, you can separatly downlaod the model assets from HF and directly upload them to Amazon S3 (do not tar and compress these files). Then in your training script, please use s5cmd to download them from S3 only on local rank 0 (use the torch.distributed.barrier() to sync up for every rank, please refer to https://github.com/yuhuiaws/finetuning-and-deploying-llama-on-Sagemaker/blob/main/finetuning-llama-by-deepspeed/train.py), it will speed up the model asset downloading compared with downloading them by use of HF API.
 24 | 
 25 | Maybe the built-in SageMaker Huggingface training container had some changes, it will result in the failure about deepspeed training on SageMaker. The workaroud is as following:
 26 | 
 27 | a. Change the requirements.txt as following:
 28 | 
 29 |      transformers==4.28.1
 30 |      datasets
 31 |      sentencepiece
 32 |      accelerate
 33 |      evaluate
 34 |      deepspeed==0.9.2
 35 |      ninja
 36 |      rouge-score
 37 |      bitsandbytes
 38 | 
 39 | About transformer version, maybe you should upgrade it to 4.36.0 because of the Deserialization of Untrusted Data vulnerability.
 40 | 
 41 | b. Change SageMaker huggingface training container to SageMaker pytorch 1.12 training container:
 42 | 
 43 |                 from sagemaker.pytorch import PyTorch
 44 |                 estimator = PyTorch(entry_point='start.py',
 45 |                              source_dir           = 'src', 
 46 |                              instance_type='ml.p4d.24xlarge',
 47 |                              instance_count=2,
 48 |                              role=role,
 49 |                              base_job_name = job_name, 
 50 |                              keep_alive_period_in_seconds=1800,
 51 |                              framework_version='1.12.0',
 52 |                              py_version='py38',
 53 |                              environment = environment,
 54 |                              disable_profiler=True,
 55 |                              debugger_hook_config=False)
 56 | 
 57 | 
 58 | ### Some useful tips:
 59 | 
 60 | 1. There is the open source "s5cmd" file in this repo, we can use the "s5cmd" command to speedup the uploading model assets to S3 (do not tar and compress these model assets, just directly upload to S3) after saving model in the container's local path.
 61 | 2. When using deepspeed zero stage 2 training LLM on muliple nodes in SageMaker, maybe it will hung untile the NCCL communication is timeout. When it happens, you can check the GPU memory utility of training instances from Amazon cloudwatch. In my experiment, the GPU memory utility is almost full (but OOM didn't occur), it may be a signal that you should switch to zero stage 3 (the issue disappears when I switch to zero 3).
 62 | 3. By default, DeepSpeed expects that a multi-nodes environment uses a shared storage. If this is not the case and each node can only see the local filesystem，you need to set the parameter "save_on_each_node" of Seq2SeqTrainingArguments API or TrainingArguments API to true (in this repo, I didn't use share data store such as EFS to save model, so I set the "save_on_each_node" to true).
 63 | 4. When using deepspeed to train on multiple GPUs, if the parameter "stage3_gather_16bit_weights_on_model_save" in deepseed config file is set to false, pytorch_modle.bin will not be generated in the end. You can use the zero_to_fp32.py script (it is located in the saved model assets path) to convert the deepspeed zero shared checkpoints to fp32 pytorch model bin on Sagemaker notebook instance or Sagemaker studio (the procedure will consume large memory and time). If the parameter "stage3_gather_16bit_weights_on_model_save" in deepseed config file is set to true, the pytorch_modle.bin will be generated in the end (stage3_gather_16bit_weights_on_model_save enables model fp16 weights consolidation when model gets saved. With large models and multiple GPUs this is an expensive operation both in terms of memory and speed). So how to configure the stage3_gather_16bit_weights_on_model_save parameter? It is up to you, and I will set it to true If the training speed does not drop significantly.  
 64 | 5. When you use deepspeed multiple nodes training and set the parameter "load_best_model_at_end" (from Seq2SeqTrainingArguments or TrainingArguments API) to true, maybe error will happens when finishing training procedure. The error looks like the following: 
 65 | 
 66 |         Could not locate the best model at /tmp/checkpoint-60/pytorch_model.bin, if you are running 
 67 |         distributed training on multiple nodes, you should activate `--save_on_each_node`. 
 68 | 
 69 |   In fact, I have configured the parameter "save_on_each_node" to true (my environment: transformer 4.26.0，pytorch 1.10，python 3.8). I will only save best model, configure "load_best_model_at_end" to false and fix the issue.
 70 | 
 71 | 6. If you just want to save the best model weights, you can set the parameter "output_dir" (from Seq2SeqTrainingArguments or TrainingArguments API) to temporary path such as '/tmp' on p4d.24xlarge ("/tmp" has the enough disk space to save); And if you want to save all of the checkpoint during the training, you can set the output_dir to the checkponit local path (it will impact the train speed for multi-nodes training. Because SageMaker will upload the checkpoint to S3 nearly real-time, it will occupy the networking bandwidth and impact the communication efficiency between nodes in the cluster).
 72 | 7. When using parameter "compute_metrics" from Trainer or Seq2SeqTrainer API, the evaluation procedure is very slow. So if you just want to run successfully the whole training process, you can comment out the  "compute_metrics".
 73 | 8. When your training script will download something from website (such as nltk.downlaod("punkt")), you should ensure only one process in the current node (local rank 0) downloaindg files, otherwise it may fail the training job. 
 74 | 
 75 |         Traceback (most recent call last):
 76 |           File "/opt/ml/code/T5_configz_and_code/scripts/run_seq2seq_deepspeed.py", line 26, in <module>
 77 |             nltk.download("punkt", quiet=True)
 78 |           File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 777, in download
 79 |         for msg in self.incr_download(info_or_id, download_dir, force):
 80 |           File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 642, in incr_download
 81 |         yield from self._download_package(info, download_dir, force)
 82 |           File "/opt/conda/lib/python3.8/site-packages/nltk/downloader.py", line 699, in _download_package
 83 |         os.makedirs(download_dir)
 84 |           File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
 85 |         mkdir(name, mode)
 86 |         FileExistsError: [Errno 17] File exists: '/root/nltk_data'
 87 |         [nltk_data] [Errno 2] No such file or directory:
 88 |         [nltk_data]     '/root/nltk_data/tokenizers/punkt.zip'
 89 |         [nltk_data] Error with downloaded zip file
 90 |         [nltk_data] [Errno 2] No such file or directory:
 91 |         [nltk_data]     '/root/nltk_data/tokenizers/punkt.zip'
 92 |         Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]
 93 |         Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 7.75MB/s]
 94 |         [nltk_data] Error with downloaded zip file
 95 |         [nltk_data] Error with downloaded zip file
 96 |         [nltk_data] Error with downloaded zip file
 97 | 
 98 | If you use the torch.distributed.launch, you can utilize the barrier function to achivement this purpose. More details, you can find the related code from run_seq2seq_deepspeed.py.
 99 | 
100 | 9. When you use torch.distributed.launch, please don't use global variables in your training script. Otherwise, the CUDA errors may occurs when exiting your training script and fails the training job. So in run_seq2seq_deepspeed.py, I change the "metric" variable from global variable to local variable.
101 | 10. Plesae do not save the model into "/opt/ml/model", because Sagemaker will tar and compress all of files under "/opt/ml/model", and it will consume much time for LLM). I suggest that '/tmp/output/asset/' can be used to perform the model saving.
102 | 11. We just use the rank 0 process to upload the trained model assets to S3 by s5cmd command. It means just one of ranks will perform the thing even if multiple nodes training is used.
103 | 12. We should sync with every rank and ensure rank 0 uploading the model assets successfully (putting the torch.distributed.barrier() at the end of your taining script). Ater that, maybe there is some CUDA error when exiting the process:
104 | 
105 |           terminate called after throwing an instance of 'c10::CUDAError'
106 |             what():  CUDA error: driver shutting down
107 |           CUDA kernel errors might be asynchronously reported at some other API call,
108 |           so the stacktrace below might be incorrect.
109 |           For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
110 |           Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
111 | 
112 | Please just ignore the error because the trained model assets have been uploaded to the S3.
113 | 
114 | 13. When enabling RDMA protocol on EFA for P4d/P4de instance, there is very large improvement on deepspeed training speed. Just configure the following env variables in SageMaker SDK API: 'FI_PROVIDER': 'efa', 'NCCL_PROTO': 'simple', 'FI_EFA_USE_DEVICE_RDMA': '1' .  
115 | 
116 | ## Deploy LLM on SageMaker
117 | 
118 | Now, we suggest that trained LLM is deployed by LMI (large model inference) container on SageMaker. LMI support 3 types accelerator: huggingface accelerate, deepspeed inference, faster transformer.
119 | 
120 | ### Some useful tips:
121 | 
122 | 1. For the model trained with pytorch fp16 mixed precision, the torch_dtype in the model’s config.json is also float16, but the model.dtype (such as "model = AutoModelForCausalLM.from_pretrained()") during inference is torch.float32, which is a big Pit (I spent a lot of time on this before I found out).
123 | 
124 |            Analyze:
125 | 
126 |            A. The size of saved model is about 14GB byte, and the size of model parameters are about 7B, So I inferred that the real dtype of this model's parameters is fp16. 
127 |            When using the fp16 mixed precision for training, the final model parameter/weight is saved as fp16 or fp32 which is up to specific framework. 
128 |            For Tensorflow, even if the model is trained with mixed precision, the dtype of saved model's parameter is also fp32 (To use mixed precision, the global policy should be set to 'mixed_float16' or 'mixed_bfloat16', so that every layer uses a 16-bit compute dtype and float32 variable dtype by default.-----Refer to the link: https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/set_global_policy); 
129 |            but for pytorch, the saved model using the HF trainer API is fp16/bp16 after training with fp16 or bf16 mixed precision).
130 | 
131 |            B. In addition, from the torch_dtype in config.json, it is also float16. However, the model.dtype (such as "model = AutoModelForCausalLM.from_pretrained()") during inference is torch.float32, so if you directly assign model.dtype to the parameter "dytpe" in deepspeed.init_inference API, it may happen OOM issue. 
132 |            At this time, you can set the parameter "dtype" of deepspeed.init_inference API to torch.half for fixing the OOM issue.
133 | 
134 | 
135 | 2. For the trained LLM such as bloomz (using deepspeed, Sagemaker model parallelism or pytorch FSDP training methods), whether using LMI on Sagemaker endpoint or direct local inference, the inference speed is always much slower than that of bloomz downloaded directly from HF (When the context text is more than 750+ tokens and the max new token is set to 128, their speed difference is 8 times). 
136 | 
137 |         Although the size of the finetuned bloomz model has become larger, the model structure, number of parameters, and dictionary size are all the same as the original HF bloomz model. The shape of the id sequence after tokenzier is also the same. 
138 |         After setting the "use_cache" parameter to True, the inference speed becomes normal. This parameter is Only useful for inference, not for training. 
139 |         When using the flan-t5-xxl 11B model, even if use_cache is not set to True, the difference in flan-t5-xxl's inference speed before and after finetuning is not as big as that of bloomz. Also, for this model, setting use_cache to True will also make the inference faster. 
140 |         When using the HF model for generation, regardless of the pipeline API or the generate API, setting the parameter "use_cache" will speed up the inference speed of the model. For the explanation of "use_cache", please refer to: https://discuss.huggingface.co/t/what-is-the-purpose-of-use-cache-in-decoder/958. 
141 |         After enabling this parameter, it need not to recalculate the hiden state of all newly generated tokens when generating the next token each time, thus it will greatly save time, which is a great acceleration for the autoregressive model/causalLM.
142 | 
143 | 3. For the server side batch in LMI and batch inference for HF pipeline API, please refer to the corresponding code and Readme from my github  https://github.com/yuhuiaws/finetuning-and-deploying-llama-on-Sagemaker/tree/main.
144 | 
145 | ## Security
146 | 
147 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
148 | 
149 | ## License
150 | 
151 | This library is licensed under the MIT-0 License. See the LICENSE file.
152 | 
153 | 


--------------------------------------------------------------------------------